From 58a4ddff8973074581717fd712a1c9ca58f853f4 Mon Sep 17 00:00:00 2001 From: myungsub Date: Mon, 28 Mar 2016 03:37:36 +0900 Subject: [PATCH 001/199] Eng2Kor initial settings --- _config.yml | 8 +++--- classification.md | 33 +++++++++++++------------ css/main.css | 62 +++++++++++++++++++++++++++++++++++++++++++++++ glossary.md | 46 +++++++++++++++++++++++++++++++++++ index.html | 36 +++++++++++++++------------ 5 files changed, 149 insertions(+), 36 deletions(-) create mode 100644 glossary.md diff --git a/_config.yml b/_config.yml index 79d50b82..e703ff55 100644 --- a/_config.yml +++ b/_config.yml @@ -1,11 +1,11 @@ # Site settings title: CS231n Convolutional Neural Networks for Visual Recognition -email: karpathy@cs.stanford.edu +email: team.aikorea@gmail.com description: "Course materials and notes for Stanford class CS231n: Convolutional Neural Networks for Visual Recognition." baseurl: "" -url: "http://cs231n.github.io" -twitter_username: cs231n -github_username: cs231n +url: "http://aikorea.org/cs231nNotes" +twitter_username: kjw6612 +github_username: aikorea # Build settings markdown: redcarpet diff --git a/classification.md b/classification.md index efeeb8a2..645acc78 100644 --- a/classification.md +++ b/classification.md @@ -4,23 +4,24 @@ mathjax: true permalink: /classification/ --- -This is an introductory lecture designed to introduce people from outside of Computer Vision to the Image Classification problem, and the data-driven approach. The Table of Contents: +본 강의노트는 컴퓨터비전 외의 분야를 공부하던 사람들에게 영상 분류 (Image Classification) 문제와, 데이터 기반 방법론(data-driven approach)을 소개하고자 함이다. 목차는 다음과 같다. -- [Intro to Image Classification, data-driven approach, pipeline](#intro) -- [Nearest Neighbor Classifier](#nn) - - [k-Nearest Neighbor](#knn) -- [Validation sets, Cross-validation, hyperparameter tuning](#val) -- [Pros/Cons of Nearest Neighbor](#procon) -- [Summary](#summary) -- [Summary: Applying kNN in practice](#summaryapply) -- [Further Reading](#reading) +- [영상 분류, 데이터 기반 방법론, 파이프라인](#intro) +- [Nearest Neighbor 분류기](#nn) + - [k-Nearest Neighbor 알고리즘](#knn) +- [Validation sets, Cross-validation, hyperparameter 튜닝](#val) +- [Nearest Neighbor의 장단점](#procon) +- [요약](#summary) +- [요약: 실제 문제에 kNN 적용하기](#summaryapply) +- [읽을 자료](#reading) -## Image Classification +## 영상 분류 -**Motivation**. In this section we will introduce the Image Classification problem, which is the task of assigning an input image one label from a fixed set of categories. This is one of the core problems in Computer Vision that, despite its simplicity, has a large variety of practical applications. Moreover, as we will see later in the course, many other seemingly distinct Computer Vision tasks (such as object detection, segmentation) can be reduced to image classification. +**동기**. 이 섹션에서는 영상 분류 문제에 대해 다룰 것이다. 영상 분류 문제란, 입력 이미지를 미리 정해진 카테고리 중 하나로 분류하는 문제로, 문제 정의는 매우 간단하지만 다양한 활용 가능성이 있는 컴퓨터 비전 분야의 핵심적인 문제 중의 하나이다. 강의의 나중 파트에서도 살펴보겠지만, 영상 분류와 멀어보이는 다른 컴퓨터 비전 분야의 여러 문제들 (물체 검출, 영상 분할 등)이 영상 분류 문제를 푸는 것으로 인해 해결될 수 있다. -**Example**. For example, in the image below an image classification model takes a single image and assigns probabilities to 4 labels, *{cat, dog, hat, mug}*. As shown in the image, keep in mind that to a computer an image is represented as one large 3-dimensional array of numbers. In this example, the cat image is 248 pixels wide, 400 pixels tall, and has three color channels Red,Green,Blue (or RGB for short). Therefore, the image consists of 248 x 400 x 3 numbers, or a total of 297,600 numbers. Each number is an integer that ranges from 0 (black) to 255 (white). Our task is to turn this quarter of a million numbers into a single label, such as *"cat"*. +**예시**. 예를 들어, 아래 이미지 +For example, in the image below an image classification model takes a single image and assigns probabilities to 4 labels, *{cat, dog, hat, mug}*. As shown in the image, keep in mind that to a computer an image is represented as one large 3-dimensional array of numbers. In this example, the cat image is 248 pixels wide, 400 pixels tall, and has three color channels Red,Green,Blue (or RGB for short). Therefore, the image consists of 248 x 400 x 3 numbers, or a total of 297,600 numbers. Each number is an integer that ranges from 0 (black) to 255 (white). Our task is to turn this quarter of a million numbers into a single label, such as *"cat"*.
@@ -59,7 +60,7 @@ A good image classification model must be invariant to the cross product of all ### Nearest Neighbor Classifier -As our first approach, we will develop what we call a **Nearest Neighbor Classifier**. This classifier has nothing to do with Convolutional Neural Networks and it is very rarely used in practice, but it will allow us to get an idea about the basic approach to an image classification problem. +As our first approach, we will develop what we call a **Nearest Neighbor Classifier**. This classifier has nothing to do with Convolutional Neural Networks and it is very rarely used in practice, but it will allow us to get an idea about the basic approach to an image classification problem. **Example image classification dataset: CIFAR-10.** One popular toy image classification dataset is the CIFAR-10 dataset. This dataset consists of 60,000 tiny images that are 32 pixels high and wide. Each image is labeled with one of 10 classes (for example *"airplane, automobile, bird, etc"*). These 60,000 images are partitioned into a training set of 50,000 images and a test set of 10,000 images. In the image below you can see 10 random example images from each one of the 10 classes: @@ -137,7 +138,7 @@ class NearestNeighbor(object): If you ran this code, you would see that this classifier only achieves **38.6%** on CIFAR-10. That's more impressive than guessing at random (which would give 10% accuracy since there are 10 classes), but nowhere near human performance (which is [estimated at about 94%](http://karpathy.github.io/2011/04/27/manually-classifying-cifar10/)) or near state-of-the-art Convolutional Neural Networks that achieve about 95%, matching human accuracy (see the [leaderboard](http://www.kaggle.com/c/cifar-10/leaderboard) of a recent Kaggle competition on CIFAR-10). -**The choice of distance.** +**The choice of distance.** There are many other ways of computing distances between vectors. Another common choice could be to instead use the **L2 distance**, which has the geometric interpretation of computing the euclidean distance between two vectors. The distance takes the form: $$ @@ -190,7 +191,7 @@ Ytr = Ytr[1000:] # find hyperparameters that work best on the validation set validation_accuracies = [] for k in [1, 3, 5, 10, 20, 50, 100]: - + # use a particular value of k and evaluation on validation data nn = NearestNeighbor() nn.train(Xtr_rows, Ytr) @@ -257,7 +258,7 @@ In summary: - We saw that the correct way to set these hyperparameters is to split your training data into two: a training set and a fake test set, which we call **validation set**. We try different hyperparameter values and keep the values that lead to the best performance on the validation set. - If the lack of training data is a concern, we discussed a procedure called **cross-validation**, which can help reduce noise in estimating which hyperparameters work best. - Once the best hyperparameters are found, we fix them and perform a single **evaluation** on the actual test set. -- We saw that Nearest Neighbor can get us about 40% accuracy on CIFAR-10. It is simple to implement but requires us to store the entire training set and it is expensive to evaluate on a test image. +- We saw that Nearest Neighbor can get us about 40% accuracy on CIFAR-10. It is simple to implement but requires us to store the entire training set and it is expensive to evaluate on a test image. - Finally, we saw that the use of L1 or L2 distances on raw pixel values is not adequate since the distances correlate more strongly with backgrounds and color distributions of images than with their semantic content. In next lectures we will embark on addressing these challenges and eventually arrive at solutions that give 90% accuracies, allow us to completely discard the training set once learning is complete, and they will allow us to evaluate a test image in less than a millisecond. diff --git a/css/main.css b/css/main.css index 9be974a7..b495ff28 100644 --- a/css/main.css +++ b/css/main.css @@ -48,6 +48,10 @@ a:visited { color: #205caa; } margin-bottom: 5px; } +.module-header a{ + color: #8C1515; +} + .materials-wrap { font-size: 18px; } @@ -63,6 +67,64 @@ a:visited { color: #205caa; } background-color: #f7f6f1; } +#hor-minimalist-a +{ + font-family: "Lucida Sans Unicode", "Lucida Grande", Sans-Serif; + font-size: 14px; + background: #fff; + margin: 45px; + width: 480px; + border-collapse: collapse; + text-align: left; +} +#hor-minimalist-a th +{ + font-size: 16px; + font-weight: normal; + color: #039; + padding: 10px 8px; + border-bottom: 2px solid #6678b1; +} +#hor-minimalist-a td +{ + color: #669; + padding: 9px 8px 0px 8px; +} +#hor-minimalist-a tbody tr:hover td +{ + color: #009; +} + + +#hor-minimalist-b +{ + font-family: "Lucida Sans Unicode", "Lucida Grande", Sans-Serif; + font-size: 12px; + background: #fff; + margin: 45px; + width: 480px; + border-collapse: collapse; + text-align: left; +} +#hor-minimalist-b th +{ + font-size: 14px; + font-weight: normal; + color: #039; + padding: 10px 8px; + border-bottom: 2px solid #6678b1; +} +#hor-minimalist-b td +{ + border-bottom: 1px solid #ccc; + color: #669; + padding: 6px 8px; +} +#hor-minimalist-b tbody tr:hover td +{ + color: #009; +} + /* Custom CSS rules for content */ .embedded-video { diff --git a/glossary.md b/glossary.md new file mode 100644 index 00000000..df433504 --- /dev/null +++ b/glossary.md @@ -0,0 +1,46 @@ +--- +layout: page +mathjax: true +permalink: /glossary/ +--- + +영어 --> 한글 번역시 용어의 통일성을 위한 단어장입니다. 새로운 용어에 대한 추가는 GitHub에 이슈를 파서 서로 논의해 보고 정하도록 하면 좋을 것 같습니다. + +Markdown 형식의 table이 제대로 렌더링이 안되네요.. 그래서 우선 그냥 html로 표를 그려놓았습니다. 더 깔끔한 방안이 떠오르시는 분들께서는 역시 이슈/PR 부탁드립니다. + + + + + + + + + + + + + + + + + + + + + +
English한글
Image영상, 이미지 (혼용)
Neural network신경망, 뉴럴 네트워크
Activation function활성 함수
node노드
Nearest neighbor(영어 그대로)
Backpropagation(영어 그대로)
Chain rule연쇄 법칙
Classification분류
Convolutional neural network컨볼루션 신경망
Regression회귀
+ + + diff --git a/index.html b/index.html index 0d1f9bba..aab102e1 100644 --- a/index.html +++ b/index.html @@ -4,16 +4,20 @@
- These notes accompany the Stanford CS class CS231n: Convolutional Neural Networks for Visual Recognition. + 스탠포드 CS231n 강의 CS231n: Convolutional Neural Networks for Visual Recognition에 대한 강의노트의 한글 번역 프로젝트입니다.
- For questions/concerns/bug reports regarding contact Justin Johnson regarding the assignments, or contact Andrej Karpathy regarding the course notes. You can also submit a pull request directly to our git repo. + 질문/논의거리/이슈 등은 AI Korea 이메일로 연락주시거나, GitHub 레포지토리에 pull request, 또는 이슈를 열어주세요.
We encourage the use of the hypothes.is extension to annote comments and discuss these notes inline.
+
- +
+ Glossary +
+
Winter 2016 Assignments
@@ -85,20 +89,20 @@
-
Module 1: Neural Networks
+
Module 1: 신경망 구조
- Linear classification: Support Vector Machine, Softmax + 선형 분류: Support Vector Machine, Softmax
parameteric approach, bias trick, hinge loss, cross-entropy loss, L2 regularization, web demo @@ -107,7 +111,7 @@
- Optimization: Stochastic Gradient Descent + 최적화: Stochastic Gradient Descent
optimization landscapes, local search, learning rate, analytic/numerical gradient @@ -116,34 +120,34 @@
- Backpropagation, Intuitions + Backpropagation, Intuition
- chain rule interpretation, real-valued circuits, patterns in gradient flow + 연쇄 법칙 (chain rule) 해석, real-valued circuits, patterns in gradient flow
- Neural Networks Part 1: Setting up the Architecture + 신경망 파트 1: Setting up the Architecture
- model of a biological neuron, activation functions, neural net architecture, representational power + 생물학적 뉴런 모델, activation functions, 신경망 구조, representational power
- Neural Networks Part 2: Setting up the Data and the Loss + 신경망 파트 2: 데이터 준비 및 Loss
- preprocessing, weight initialization, batch normalization, regularization (L2/dropout), loss functions + 전처리, weight 초기값 설정, batch normalization, regularization (L2/dropout), loss 함수
- Neural Networks Part 3: Learning and Evaluation + 신경망 파트 3: 학습 및 평가
gradient checks, sanity checks, babysitting the learning process, momentum (+nesterov), second-order methods, Adagrad/RMSprop, hyperparameter optimization, model ensembles @@ -186,6 +190,6 @@ Transfer Learning and Fine-tuning Convolutional Neural Networks
- +
From 735d379bea885e8d7ddae2c0bb7e9a3c5816c675 Mon Sep 17 00:00:00 2001 From: Myungsub Choi <91mschoi@hanmail.net> Date: Mon, 28 Mar 2016 05:05:35 +0900 Subject: [PATCH 002/199] Update Readme.md --- Readme.md | 18 +++++++++++++++++- 1 file changed, 17 insertions(+), 1 deletion(-) diff --git a/Readme.md b/Readme.md index dfc3cdf7..1e082731 100644 --- a/Readme.md +++ b/Readme.md @@ -1,3 +1,19 @@ -Notes and assignments for Stanford CS class [CS231n: Convolutional Neural Networks for Visual Recognition](http://vision.stanford.edu/teaching/cs231n/) +English to Korean translation project for the notes and assignments for Stanford CS class [CS231n: Convolutional Neural Networks for Visual Recognition](http://vision.stanford.edu/teaching/cs231n/). + +## How to Participate + +1. Fork this repository +2. Translate the assigned file (markdown, ipython-notebook, etc.) into Korean - Please refer to the [glossary](http://aikorea.org/cs231n/glossary) +3. Send PR + +## Local Development Instructions + +To view the rendered site in your browser, + +1. Install Jekyll - follow the instructions [[here](https://jekyllrb.com/docs/installation/)] +2. `git clone yourusername@cs231n.github.io` +3. `cd cs231n.github.io` +4. `jekyll serve` +5. View the website at http://localhost:4000 From 9ee9d9281d70de26c93918f15ef18a1501056090 Mon Sep 17 00:00:00 2001 From: Myungsub Choi <91mschoi@hanmail.net> Date: Mon, 28 Mar 2016 05:31:15 +0900 Subject: [PATCH 003/199] Create CNAME --- CNAME | 1 + 1 file changed, 1 insertion(+) create mode 100644 CNAME diff --git a/CNAME b/CNAME new file mode 100644 index 00000000..44d8135d --- /dev/null +++ b/CNAME @@ -0,0 +1 @@ +aikorea.org From 488a81e7afade858a19e79d2b9064bae8c19bf63 Mon Sep 17 00:00:00 2001 From: Myungsub Choi <91mschoi@hanmail.net> Date: Mon, 28 Mar 2016 05:40:59 +0900 Subject: [PATCH 004/199] Update _config.yml --- _config.yml | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/_config.yml b/_config.yml index e703ff55..1150c2ed 100644 --- a/_config.yml +++ b/_config.yml @@ -3,7 +3,7 @@ title: CS231n Convolutional Neural Networks for Visual Recognition email: team.aikorea@gmail.com description: "Course materials and notes for Stanford class CS231n: Convolutional Neural Networks for Visual Recognition." baseurl: "" -url: "http://aikorea.org/cs231nNotes" +url: "http://aikorea.org/cs231n" twitter_username: kjw6612 github_username: aikorea From 8150e4044a4b9beba7f04c428846a535c9586a7f Mon Sep 17 00:00:00 2001 From: Myungsub Choi <91mschoi@hanmail.net> Date: Mon, 28 Mar 2016 05:56:44 +0900 Subject: [PATCH 005/199] Fix baseurl --- _config.yml | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/_config.yml b/_config.yml index 1150c2ed..880816d4 100644 --- a/_config.yml +++ b/_config.yml @@ -2,7 +2,7 @@ title: CS231n Convolutional Neural Networks for Visual Recognition email: team.aikorea@gmail.com description: "Course materials and notes for Stanford class CS231n: Convolutional Neural Networks for Visual Recognition." -baseurl: "" +baseurl: "cs231n" url: "http://aikorea.org/cs231n" twitter_username: kjw6612 github_username: aikorea From 4896cfae9167c57294a96c26a4835a4638c84cf6 Mon Sep 17 00:00:00 2001 From: Myungsub Choi <91mschoi@hanmail.net> Date: Mon, 28 Mar 2016 05:57:11 +0900 Subject: [PATCH 006/199] Fix baseurl --- _config.yml | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/_config.yml b/_config.yml index 1150c2ed..3f928e85 100644 --- a/_config.yml +++ b/_config.yml @@ -2,7 +2,7 @@ title: CS231n Convolutional Neural Networks for Visual Recognition email: team.aikorea@gmail.com description: "Course materials and notes for Stanford class CS231n: Convolutional Neural Networks for Visual Recognition." -baseurl: "" +baseurl: "/cs231n" url: "http://aikorea.org/cs231n" twitter_username: kjw6612 github_username: aikorea From e867fb0ecc5ebd9958f4b1d357328ab3012adf14 Mon Sep 17 00:00:00 2001 From: myungsub Date: Mon, 28 Mar 2016 06:34:00 +0900 Subject: [PATCH 007/199] fix broken links --- Readme.md | 9 ++++----- _config.yml | 4 ++-- index.html | 4 +--- 3 files changed, 7 insertions(+), 10 deletions(-) diff --git a/Readme.md b/Readme.md index 1e082731..751e2e60 100644 --- a/Readme.md +++ b/Readme.md @@ -9,11 +9,10 @@ English to Korean translation project for the notes and assignments for Stanford ## Local Development Instructions -To view the rendered site in your browser, +To view the rendered site in your browser, 1. Install Jekyll - follow the instructions [[here](https://jekyllrb.com/docs/installation/)] -2. `git clone yourusername@cs231n.github.io` -3. `cd cs231n.github.io` +2. `git clone yourusername@cs231n` +3. `cd cs231n` 4. `jekyll serve` -5. View the website at http://localhost:4000 - +5. View the website at http://127.0.0.1:4000/cs231n/ diff --git a/_config.yml b/_config.yml index 880816d4..f675dff5 100644 --- a/_config.yml +++ b/_config.yml @@ -1,8 +1,8 @@ # Site settings title: CS231n Convolutional Neural Networks for Visual Recognition email: team.aikorea@gmail.com -description: "Course materials and notes for Stanford class CS231n: Convolutional Neural Networks for Visual Recognition." -baseurl: "cs231n" +description: "스탠포드 CS231n: Convolutional Neural Networks for Visual Recognition 수업자료 번역사이트" +baseurl: "/cs231n" url: "http://aikorea.org/cs231n" twitter_username: kjw6612 github_username: aikorea diff --git a/index.html b/index.html index aab102e1..7c935e25 100644 --- a/index.html +++ b/index.html @@ -7,15 +7,13 @@ 스탠포드 CS231n 강의 CS231n: Convolutional Neural Networks for Visual Recognition에 대한 강의노트의 한글 번역 프로젝트입니다.
질문/논의거리/이슈 등은 AI Korea 이메일로 연락주시거나, GitHub 레포지토리에 pull request, 또는 이슈를 열어주세요. -
- We encourage the use of the hypothes.is extension to annote comments and discuss these notes inline.
Winter 2016 Assignments
From 540ef6cbafa958b870879f8577042029b6b6163b Mon Sep 17 00:00:00 2001 From: Myungsub Choi <91mschoi@hanmail.net> Date: Mon, 28 Mar 2016 06:49:55 +0900 Subject: [PATCH 008/199] Update _config.yml --- _config.yml | 4 ---- 1 file changed, 4 deletions(-) diff --git a/_config.yml b/_config.yml index 2bfa62cf..f675dff5 100644 --- a/_config.yml +++ b/_config.yml @@ -1,11 +1,7 @@ # Site settings title: CS231n Convolutional Neural Networks for Visual Recognition email: team.aikorea@gmail.com -<<<<<<< HEAD description: "스탠포드 CS231n: Convolutional Neural Networks for Visual Recognition 수업자료 번역사이트" -======= -description: "Course materials and notes for Stanford class CS231n: Convolutional Neural Networks for Visual Recognition." ->>>>>>> origin/gh-pages baseurl: "/cs231n" url: "http://aikorea.org/cs231n" twitter_username: kjw6612 From 2d4348b1482cefdb4f165afc1cb9b0295f703f60 Mon Sep 17 00:00:00 2001 From: myungsub Date: Mon, 28 Mar 2016 17:22:55 +0900 Subject: [PATCH 009/199] fix image links --- _config.yml | 2 +- _includes/head.html | 2 +- aws-tutorial.md | 24 ++++++++++++------------ classification.md | 22 +++++++++++----------- convolutional-networks.md | 20 ++++++++++---------- ipython-tutorial.md | 12 ++++++------ linear-classify.md | 14 +++++++------- neural-networks-1.md | 20 ++++++++++---------- neural-networks-2.md | 8 ++++---- neural-networks-3.md | 18 +++++++++--------- neural-networks-case-study.md | 6 +++--- optimization-1.md | 12 ++++++------ python-numpy-tutorial.md | 12 ++++++------ terminal-tutorial.md | 8 ++++---- understanding-cnn.md | 14 +++++++------- 15 files changed, 97 insertions(+), 97 deletions(-) diff --git a/_config.yml b/_config.yml index f675dff5..8765dc22 100644 --- a/_config.yml +++ b/_config.yml @@ -3,7 +3,7 @@ title: CS231n Convolutional Neural Networks for Visual Recognition email: team.aikorea@gmail.com description: "스탠포드 CS231n: Convolutional Neural Networks for Visual Recognition 수업자료 번역사이트" baseurl: "/cs231n" -url: "http://aikorea.org/cs231n" +url: "http://aikorea.org" twitter_username: kjw6612 github_username: aikorea diff --git a/_includes/head.html b/_includes/head.html index 7222af0e..18a47c31 100644 --- a/_includes/head.html +++ b/_includes/head.html @@ -23,5 +23,5 @@ ga('send', 'pageview'); - + diff --git a/aws-tutorial.md b/aws-tutorial.md index 15ab379f..52b1be5d 100644 --- a/aws-tutorial.md +++ b/aws-tutorial.md @@ -21,7 +21,7 @@ Console" button. It will direct you to a signup page which looks like the following.
- +
Select the "I am a new user" checkbox, click the "Sign in using our secure @@ -34,13 +34,13 @@ click on "Sign In to the Console", and this time sign in using your username and password.
- +
Once you have signed in, you will be greeted by a page like this:
- +
Make sure that the region information on the top right is set to N. California. @@ -55,14 +55,14 @@ Next, click on the EC2 link (first link under the Compute category). You will go to a dashboard page like this:
- +
Click the blue "Launch Instance" button, and you will be redirected to a page like the following:
- +
Click on the "Community AMIs" link on the left sidebar, and search for "cs231n" @@ -71,19 +71,19 @@ in the search box. You should be able to see the AMI AMI, and continue to the next step to choose your instance type.
- +
Choose the instance type `g2.2xlarge`, and click on "Review and Launch".
- +
In the next page, click on Launch.
- +
You will be then prompted to create or use an existing key-pair. If you already @@ -94,11 +94,11 @@ somewhere that you won't accidentally delete. Remember that there is **NO WAY** to get to your instance if you lose your key.
- +
- +
Once you download your key, you should change the permissions of the key to @@ -113,7 +113,7 @@ After this is done, click on "Launch Instances", and you should see a screen showing that your instances are launching:
- +
Click on "View Instances" to see your instance state. It should change to @@ -121,7 +121,7 @@ Click on "View Instances" to see your instance state. It should change to are now ready to ssh into the instance.
- +
First, note down the Public IP of the instance from the instance listing. Then, diff --git a/classification.md b/classification.md index 645acc78..d0c902a2 100644 --- a/classification.md +++ b/classification.md @@ -24,7 +24,7 @@ permalink: /classification/ For example, in the image below an image classification model takes a single image and assigns probabilities to 4 labels, *{cat, dog, hat, mug}*. As shown in the image, keep in mind that to a computer an image is represented as one large 3-dimensional array of numbers. In this example, the cat image is 248 pixels wide, 400 pixels tall, and has three color channels Red,Green,Blue (or RGB for short). Therefore, the image consists of 248 x 400 x 3 numbers, or a total of 297,600 numbers. Each number is an integer that ranges from 0 (black) to 255 (white). Our task is to turn this quarter of a million numbers into a single label, such as *"cat"*.
- +
The task in Image Classification is to predict a single label (or a distribution over labels as shown here to indicate our confidence) for a given image. Images are 3-dimensional arrays of integers from 0 to 255, of size Width x Height x 3. The 3 represents the three color channels Red, Green, Blue.
@@ -41,14 +41,14 @@ For example, in the image below an image classification model takes a single ima A good image classification model must be invariant to the cross product of all these variations, while simultaneously retaining sensitivity to the inter-class variations.
- +
**Data-driven approach**. How might we go about writing an algorithm that can classify images into distinct categories? Unlike writing an algorithm for, for example, sorting a list of numbers, it is not obvious how one might write an algorithm for identifying cats in images. Therefore, instead of trying to specify what every one of the categories of interest look like directly in code, the approach that we will take is not unlike one you would take with a child: we're going to provide the computer with many examples of each class and then develop learning algorithms that look at these examples and learn about the visual appearance of each class. This approach is referred to as a *data-driven approach*, since it relies on first accumulating a *training dataset* of labeled images. Here is an example of what such a dataset might look like:
- +
An example training set for four visual categories. In practice we may have thousands of categories and hundreds of thousands of images for each category.
@@ -65,7 +65,7 @@ As our first approach, we will develop what we call a **Nearest Neighbor Classif **Example image classification dataset: CIFAR-10.** One popular toy image classification dataset is the CIFAR-10 dataset. This dataset consists of 60,000 tiny images that are 32 pixels high and wide. Each image is labeled with one of 10 classes (for example *"airplane, automobile, bird, etc"*). These 60,000 images are partitioned into a training set of 50,000 images and a test set of 10,000 images. In the image below you can see 10 random example images from each one of the 10 classes:
- +
Left: Example images from the CIFAR-10 dataset. Right: first column shows a few test images and next to each we show the top 10 nearest neighbors in the training set according to pixel-wise difference.
@@ -80,7 +80,7 @@ $$ Where the sum is taken over all pixels. Here is the procedure visualized:
- +
An example of using pixel-wise differences to compare two images with L1 distance (for one color channel in this example). Two images are subtracted elementwise and then all differences are added up to a single number. If two images are identical the result will be zero. But if the images are very different the result will be large.
@@ -161,7 +161,7 @@ Note that I included the `np.sqrt` call above, but in a practical nearest neighb You may have noticed that it is strange to only use the label of the nearest image when we wish to make a prediction. Indeed, it is almost always the case that one can do better by using what's called a **k-Nearest Neighbor Classifier**. The idea is very simple: instead of finding the single closest image in the training set, we will find the top **k** closest images, and have them vote on the label of the test image. In particular, when *k = 1*, we recover the Nearest Neighbor classifier. Intuitively, higher values of **k** have a smoothing effect that makes the classifier more resistant to outliers:
- +
An example of the difference between Nearest Neighbor and a 5-Nearest Neighbor classifier, using 2-dimensional points and 3 classes (red, blue, green). The colored regions show the decision boundaries induced by the classifier with an L2 distance. The white regions show points that are ambiguously classified (i.e. class votes are tied for at least two classes). Notice that in the case of a NN classifier, outlier datapoints (e.g. green point in the middle of a cloud of blue points) create small islands of likely incorrect predictions, while the 5-NN classifier smooths over these irregularities, likely leading to better generalization on the test data (not shown). Also note that the gray regions in the 5-NN image are caused by ties in the votes among the nearest neighbors (e.g. 2 neighbors are red, next two neighbors are blue, last neighbor is green).
@@ -212,7 +212,7 @@ By the end of this procedure, we could plot a graph that shows which values of * In cases where the size of your training data (and therefore also the validation data) might be small, people sometimes use a more sophisticated technique for hyperparameter tuning called **cross-validation**. Working with our previous example, the idea is that instead of arbitrarily picking the first 1000 datapoints to be the validation set and rest training set, you can get a better and less noisy estimate of how well a certain value of *k* works by iterating over different validation sets and averaging the performance across these. For example, in 5-fold cross-validation, we would split the training data into 5 equal folds, use 4 of them for training, and 1 for validation. We would then iterate over which fold is the validation fold, evaluate the performance, and finally average the performance across the different folds.
- +
Example of a 5-fold cross-validation run for the parameter k. For each value of k we train on 4 folds and evaluate on the 5th. Hence, for each k we receive 5 accuracies on the validation fold (accuracy is the y-axis, each result is a point). The trend line is drawn through the average of the results for each k and the error bars indicate the standard deviation. Note that in this particular case, the cross-validation suggests that a value of about k = 7 works best on this particular dataset (corresponding to the peak in the plot). If we used more than 5 folds, we might expect to see a smoother (i.e. less noisy) curve.
@@ -221,7 +221,7 @@ In cases where the size of your training data (and therefore also the validation **In practice**. In practice, people prefer to avoid cross-validation in favor of having a single validation split, since cross-validation can be computationally expensive. The splits people tend to use is between 50%-90% of the training data for training and rest for validation. However, this depends on multiple factors: For example if the number of hyperparameters is large you may prefer to use bigger validation splits. If the number of examples in the validation set is small (perhaps only a few hundred or so), it is safer to use cross-validation. Typical number of folds you can see in practice would be 3-fold, 5-fold or 10-fold cross-validation.
- +
Common data splits. A training and test set is given. The training set is split into folds (for example 5 folds here). The folds 1-4 become the training set. One fold (e.g. fold 5 here in yellow) is denoted as the Validation fold and is used to tune the hyperparameters. Cross-validation goes a step further iterates over the choice of which fold is the validation fold, separately from 1-5. This would be referred to as 5-fold cross-validation. In the very end once the model is trained and all the best hyperparameters were determined, the model is evaluated a single time on the test data (red).
@@ -235,15 +235,15 @@ As an aside, the computational complexity of the Nearest Neighbor classifier is The Nearest Neighbor Classifier may sometimes be a good choice in some settings (especially if the data is low-dimensional), but it is rarely appropriate for use in practical image classification settings. One problem is that images are high-dimensional objects (i.e. they often contain many pixels), and distances over high-dimensional spaces can be very counter-intuitive. The image below illustrates the point that the pixel-based L2 similarities we developed above are very different from perceptual similarities:
- +
Pixel-based distances on high-dimensional data (and images especially) can be very unintuitive. An original image (left) and three other images next to it that are all equally far away from it based on L2 pixel distance. Clearly, the pixel-wise distance does not correspond at all to perceptual or semantic similarity.
Here is one more visualization to convince you that using pixel differences to compare images is inadequate. We can use a visualization technique called t-SNE to take the CIFAR-10 images and embed them in two dimensions so that their (local) pairwise distances are best preserved. In this visualization, images that are shown nearby are considered to be very near according to the L2 pixelwise distance we developed above:
- -
CIFAR-10 images embedded in two dimensions with t-SNE. Images that are nearby on this image are considered to be close based on the L2 pixel distance. Notice the strong effect of background rather than semantic class differences. Click here for a bigger version of this visualization.
+ +
CIFAR-10 images embedded in two dimensions with t-SNE. Images that are nearby on this image are considered to be close based on the L2 pixel distance. Notice the strong effect of background rather than semantic class differences. Click here for a bigger version of this visualization.
In particular, note that images that are nearby each other are much more a function of the general color distribution of the images, or the type of background rather than their semantic identity. For example, a dog can be seen very near a frog since both happen to be on white background. Ideally we would like images of all of the 10 classes to form their own clusters, so that images of the same class are nearby to each other regardless of irrelevant characteristics and variations (such as the background). However, to get this property we will have to go beyond raw pixels. diff --git a/convolutional-networks.md b/convolutional-networks.md index c55ea1f0..c7d41224 100644 --- a/convolutional-networks.md +++ b/convolutional-networks.md @@ -35,8 +35,8 @@ So what does change? ConvNet architectures make the explicit assumption that the *3D volumes of neurons*. Convolutional Neural Networks take advantage of the fact that the input consists of images and they constrain the architecture in a more sensible way. In particular, unlike a regular Neural Network, the layers of a ConvNet have neurons arranged in 3 dimensions: **width, height, depth**. (Note that the word *depth* here refers to the third dimension of an activation volume, not to the depth of a full Neural Network, which can refer to the total number of layers in a network.) For example, the input images in CIFAR-10 are an input volume of activations, and the volume has dimensions 32x32x3 (width, height, depth respectively). As we will soon see, the neurons in a layer will only be connected to a small region of the layer before it, instead of all of the neurons in a fully-connected manner. Moreover, the final output layer would for CIFAR-10 have dimensions 1x1x10, because by the end of the ConvNet architecture we will reduce the full image into a single vector of class scores, arranged along the depth dimension. Here is a visualization:
- - + +
Left: A regular 3-layer Neural Network. Right: A ConvNet arranges its neurons in three dimensions (width, height, depth), as visualized in one of the layers. Every layer of a ConvNet transforms the 3D input volume to a 3D output volume of neuron activations. In this example, the red input layer holds the image, so its width and height would be the dimensions of the image, and the depth would be 3 (Red, Green, Blue channels).
@@ -66,7 +66,7 @@ In summary: - Each Layer may or may not have additional hyperparameters (e.g. CONV/FC/POOL do, RELU doesn't)
- +
The activations of an example ConvNet architecture. The initial volume stores the raw image pixels and the last volume stores the class scores. Each volume of activations along the processing path is shown as a column. Since it's difficult to visualize 3D volumes, we lay out each volume's slices in rows. The last layer volume holds the scores for each class, but here we only visualize the sorted top 5 scores, and print the labels of each one. The full web-based demo is shown in the header of our website. The architecture shown here is a tiny VGG Net, which we will discuss later.
@@ -88,8 +88,8 @@ The Conv layer is the core building block of a Convolutional Network, and its ou *Example 2*. Suppose an input volume had size [16x16x20]. Then using an example receptive field size of 3x3, every neuron in the Conv Layer would now have a total of 3\*3\*20 = 180 connections to the input volume. Notice that, again, the connectivity is local in space (e.g. 3x3), but full along the input depth (20).
- - + +
Left: An example input volume in red (e.g. a 32x32x3 CIFAR-10 image), and an example volume of neurons in the first Convolutional layer. Each neuron in the convolutional layer is connected only to a local region in the input volume spatially, but to the full depth (i.e. all color channels). Note, there are multiple neurons (5 in this example) along the depth, all looking at the same region in the input - see discussion of depth columns in text below. Right: The neurons from the Neural Network chapter remain unchanged: They still compute a dot product of their weights with the input followed by a non-linearity, but their connectivity is now restricted to be local spatially.
@@ -104,7 +104,7 @@ The Conv layer is the core building block of a Convolutional Network, and its ou We can compute the spatial size of the output volume as a function of the input volume size (\\(W\\)), the receptive field size of the Conv Layer neurons (\\(F\\)), the stride with which they are applied (\\(S\\)), and the amount of zero padding used (\\(P\\)) on the border. You can convince yourself that the correct formula for calculating how many neurons "fit" is given by \\((W - F + 2P)/S + 1\\). If this number is not an integer, then the strides are set incorrectly and the neurons cannot be tiled so that they "fit" across the input volume neatly, in a symmetric way. An example might help to get intuitions for this formula:
- +
Illustration of spatial arrangement. In this example there is only one spatial dimension (x-axis), one neuron with a receptive field size of F = 3, the input size is W = 5, and there is zero padding of P = 1. Left: The neuron strided across the input in stride of S = 1, giving output of size (5 - 3 + 2)/1+1 = 5. Right: The neuron uses stride of S = 2, giving output of size (5 - 3 + 2)/2+1 = 3. Notice that stride S = 3 could not be used since it wouldn't fit neatly across the volume. In terms of the equation, this can be determined since (5 - 3 + 2) = 4 is not divisible by 3.
The neuron weights are in this example [1,0,-1] (shown on very right), and its bias is zero. These weights are shared across all yellow neurons (see parameter sharing below). @@ -124,7 +124,7 @@ It turns out that we can dramatically reduce the number of parameters by making Notice that if all neurons in a single depth slice are using the same weight vector, then the forward pass of the CONV layer can in each depth slice be computed as a **convolution** of the neuron's weights with the input volume (Hence the name: Convolutional Layer). Therefore, it is common to refer to the sets of weights as a **filter** (or a **kernel**), which is convolved with the input. The result of this convolution is an *activation map* (e.g. of size [55x55]), and the set of activation maps for each different filter are stacked together along the depth dimension to produce the output volume (e.g. [55x55x96]).
- +
Example filters learned by Krizhevsky et al. Each of the 96 filters shown here is of size [11x11x3], and each one is shared by the 55*55 neurons in one depth slice. Notice that the parameter sharing assumption is relatively reasonable: If detecting a horizontal edge is important at some location in the image, it should intuitively be useful at some other location as well due to the translationally-invariant structure of images. There is therefore no need to relearn to detect a horizontal edge at every one of the 55*55 distinct locations in the Conv layer output volume.
@@ -175,7 +175,7 @@ A common setting of the hyperparameters is \\(F = 3, S = 1, P = 1\\). However, t **Convolution Demo**. Below is a running demo of a CONV layer. Since 3D volumes are hard to visualize, all the volumes (the input volume (in blue), the weight volumes (in red), the output volume (in green)) are visualized with each depth slice stacked in rows. The input volume is of size \\(W\_1 = 5, H\_1 = 5, D\_1 = 3\\), and the CONV layer parameters are \\(K = 2, F = 3, S = 2, P = 1\\). That is, we have two filters of size \\(3 \times 3\\), and they are applied with a stride of 2. Therefore, the output volume size has spatial size (5 - 3 + 2)/2 + 1 = 3. Moreover, notice that a padding of \\(P = 1\\) is applied to the input volume, making the outer border of the input volume zero. The visualization below iterates over the output activations (green), and shows that each element is computed by elementwise multiplying the highlighted input (blue) with the filter (red), summing it up, and then offsetting the result by the bias.
- +
@@ -211,8 +211,8 @@ It is worth noting that there are only two commonly seen variations of the max p **General pooling**. In addition to max pooling, the pooling units can also perform other functions, such as *average pooling* or even *L2-norm pooling*. Average pooling was often used historically but has recently fallen out of favor compared to the max pooling operation, which has been shown to work better in practice.
- - + +
Pooling layer downsamples the volume spatially, independently in each depth slice of the input volume. Left: In this example, the input volume of size [224x224x64] is pooled with filter size 2, stride 2 into output volume of size [112x112x64]. Notice that the volume depth is preserved. Right: The most common downsampling operation is max, giving rise to max pooling, here shown with a stride of 2. That is, each max is taken over 4 numbers (little 2x2 square).
diff --git a/ipython-tutorial.md b/ipython-tutorial.md index 1c894162..6eb157a1 100644 --- a/ipython-tutorial.md +++ b/ipython-tutorial.md @@ -29,13 +29,13 @@ see a screen like this, showing all available IPython notebooks in the current directory:
- +
If you click through to a notebook file, you will see a screen like this:
- +
An IPython notebook is made up of a number of **cells**. Each cell can contain @@ -45,14 +45,14 @@ will be displayed beneath the cell. For example, after running the first cell the notebook looks like this:
- +
Global variables are shared between cells. Executing the second cell thus gives the following result:
- +
By convention, IPython notebooks are expected to be run from top to bottom. @@ -60,14 +60,14 @@ Failing to execute some cells or executing cells out of order can result in errors:
- +
After you have modified an IPython notebook for one of the assignments by modifying or executing some of its cells, remember to **save your changes!**
- +
This has only been a brief introduction to IPython notebooks, but it should diff --git a/linear-classify.md b/linear-classify.md index 01d206a0..1540214a 100644 --- a/linear-classify.md +++ b/linear-classify.md @@ -53,7 +53,7 @@ There are a few things to note: Notice that a linear classifier computes the score of a class as a weighted sum of all of its pixel values across all 3 of its color channels. Depending on precisely what values we set for these weights, the function has the capacity to like or dislike (depending on the sign of each weight) certain colors at certain positions in the image. For instance, you can imagine that the "ship" class might be more likely if there is a lot of blue on the sides of an image (which could likely correspond to water). You might expect that the "ship" classifier would then have a lot of positive weights across its blue channel weights (presence of blue increases score of ship), and negative weights in the red/green channels (presence of red/green descreases the score of ship).
- +
An example of mapping an image to class scores. For the sake of visualization, we assume the image only has 4 pixels (4 monochrome pixels, we are not considering color channels in this example for brevity), and that we have 3 classes (red (cat), green (dog), blue (ship) class). (Clarification: in particular, the colors here simply indicate 3 classes and are not related to the RGB channels.) We stretch the image pixels into a column and perform matrix multiplication to get the scores for each class. Note that this particular set of weights W is not good at all: the weights assign our cat image a very low cat score. In particular, this set of weights seems convinced that it's looking at a dog.
@@ -62,7 +62,7 @@ Notice that a linear classifier computes the score of a class as a weighted sum Since we defined the score of each class as a weighted sum of all image pixels, each class score is a linear function over this space. We cannot visualize 3072-dimensional spaces, but if we imagine squashing all those dimensions into only two dimensions, then we can try to visualize what the classifier might be doing:
- +
Cartoon representation of the image space, where each image is a single point, and three classifiers are visualized. Using the example of the car classifier (in red), the red line shows all points in the space that get a score of zero for the car class. The red arrow shows the direction of increase, so all points to the right of the red line have positive (and linearly increasing) scores, and all points to the left have a negative (and linearly decreasing) scores.
@@ -74,7 +74,7 @@ As we saw above, every row of \\(W\\) is a classifier for one of the classes. Th Another interpretation for the weights \\(W\\) is that each row of \\(W\\) corresponds to a *template* (or sometimes also called a *prototype*) for one of the classes. The score of each class for an image is then obtained by comparing each template with the image using an *inner product* (or *dot product*) one by one to find the one that "fits" best. With this terminology, the linear classifier is doing template matching, where the templates are learned. Another way to think of it is that we are still effectively doing Nearest Neighbor, but instead of having thousands of training images we are only using a single image per class (although we will learn it, and it does not necessarily have to be one of the images in the training set), and we use the (negative) inner product as the distance instead of the L1 or L2 distance.
- +
Skipping ahead a bit: Example learned weights at the end of learning for CIFAR-10. Note that, for example, the ship template contains a lot of blue pixels as expected. This template will therefore give a high score once it is matched against images of ships on the ocean with an inner product.
@@ -97,7 +97,7 @@ $$ With our CIFAR-10 example, \\(x\_i\\) is now [3073 x 1] instead of [3072 x 1] - (with the extra dimension holding the constant 1), and \\(W\\) is now [10 x 3073] instead of [10 x 3072]. The extra column that \\(W\\) now corresponds to the bias \\(b\\). An illustration might help clarify:
- +
Illustration of the bias trick. Doing a matrix multiplication and then adding a bias vector (left) is equivalent to adding a bias dimension with a constant of 1 to all input vectors and extending the weight matrix by 1 column - a bias column (right). Thus, if we preprocess our data by appending ones to all vectors we only have to learn a single matrix of weights instead of two matrices that hold the weights and the biases.
@@ -144,7 +144,7 @@ A last piece of terminology we'll mention before we finish with this section is
- +
The Multiclass Support Vector Machine "wants" the score of the correct class to be higher than all other scores by at least a margin of delta. If any class has a score inside the red region (or higher), then there will be accumulated loss. Otherwise the loss will be zero. Our objective will be to find the weights that will simultaneously satisfy this constraint for all examples in the training data and give a total loss that is as low as possible.
@@ -306,7 +306,7 @@ p = np.exp(f) / np.sum(np.exp(f)) # safe to do, gives the correct answer A picture might help clarify the distinction between the Softmax and SVM classifiers:
- +
Example of the difference between the SVM and Softmax classifiers for one datapoint. In both cases we compute the same score vector f (e.g. by matrix multiplication in this section). The difference is in the interpretation of the scores in f: The SVM interprets these as class scores and its loss function encourages the correct class (class 2, in blue) to have a score higher by a margin than the other class scores. The Softmax classifier instead interprets the scores as (unnormalized) log probabilities for each class and then encourages the (normalized) log probability of the correct class to be high (equivalently the negative of it to be low). The final loss for this example is 1.58 for the SVM and 1.04 for the Softmax classifier, but note that these numbers are not comparable; They are only meaningful in relation to loss computed within the same classifier and with the same data.
@@ -331,7 +331,7 @@ where the probabilites are now more diffuse. Moreover, in the limit where the we
- +
We have written an interactive web demo to help your intuitions with linear classifiers. The demo visualizes the loss functions discussed in this section using a toy 3-way classification on 2D data. The demo also jumps ahead a bit and performs the optimization, which we will discuss in full detail in the next section.
diff --git a/neural-networks-1.md b/neural-networks-1.md index 6ffaafba..c7a9094c 100644 --- a/neural-networks-1.md +++ b/neural-networks-1.md @@ -38,8 +38,8 @@ The area of Neural Networks has originally been primarily inspired by the goal o The basic computational unit of the brain is a **neuron**. Approximately 86 billion neurons can be found in the human nervous system and they are connected with approximately 10^14 - 10^15 **synapses**. The diagram below shows a cartoon drawing of a biological neuron (left) and a common mathematical model (right). Each neuron receives input signals from its **dendrites** and produces output signals along its (single) **axon**. The axon eventually branches out and connects via synapses to dendrites of other neurons. In the computational model of a neuron, the signals that travel along the axons (e.g. \\(x\_0\\)) interact multiplicatively (e.g. \\(w\_0 x\_0\\)) with the dendrites of the other neuron based on the synaptic strength at that synapse (e.g. \\(w\_0\\)). The idea is that the synaptic strengths (the weights \\(w\\)) are learnable and control the strength of influence (and its direction: excitory (positive weight) or inhibitory (negative weight)) of one neuron on another. In the basic model, the dendrites carry the signal to the cell body where they all get summed. If the final sum is above a certain threshold, the neuron can *fire*, sending a spike along its axon. In the computational model, we assume that the precise timings of the spikes do not matter, and that only the frequency of the firing communicates information. Based on this *rate code* interpretation, we model the *firing rate* of the neuron with an **activation function** \\(f\\), which represents the frequency of the spikes along the axon. Historically, a common choice of activation function is the **sigmoid function** \\(\sigma\\), since it takes a real-valued input (the signal strength after the sum) and squashes it to range between 0 and 1. We will see details of these activation functions later in this section.
- - + +
A cartoon drawing of a biological neuron (left) and its mathematical model (right).
@@ -79,8 +79,8 @@ The mathematical form of the model Neuron's forward computation might look famil Every activation function (or *non-linearity*) takes a single number and performs a certain fixed mathematical operation on it. There are several activation functions you may encounter in practice:
- - + +
Left: Sigmoid non-linearity squashes real numbers to range between [0,1] Right: The tanh non-linearity squashes real numbers to range between [-1,1].
@@ -92,8 +92,8 @@ Every activation function (or *non-linearity*) takes a single number and perform **Tanh.** The tanh non-linearity is shown on the image above on the right. It squashes a real-valued number to the range [-1, 1]. Like the sigmoid neuron, its activations saturate, but unlike the sigmoid neuron its output is zero-centered. Therefore, in practice the *tanh non-linearity is always preferred to the sigmoid nonlinearity.* Also note that the tanh neuron is simply a scaled sigmoid neuron, in particular the following holds: \\( \tanh(x) = 2 \sigma(2x) -1 \\).
@@ -120,8 +120,8 @@ This concludes our discussion of the most common types of neurons and their acti **Neural Networks as neurons in graphs**. Neural Networks are modeled as collections of neurons that are connected in an acyclic graph. In other words, the outputs of some neurons can become inputs to other neurons. Cycles are not allowed since that would imply an infinite loop in the forward pass of a network. Instead of an amorphous blobs of connected neurons, Neural Network models are often organized into distinct layers of neurons. For regular neural networks, the most common layer type is the **fully-connected layer** in which neurons between two adjacent layers are fully pairwise connected, but neurons within a single layer share no connections. Below are two example Neural Network topologies that use a stack of fully-connected layers:
- - + +
Left: A 2-layer Neural Network (one hidden layer of 4 neurons (or units) and one output layer with 2 neurons), and three inputs. Right: A 3-layer neural network with three inputs, two hidden layers of 4 neurons each and one output layer. Notice that in both cases there are connections (synapses) between neurons across layers, but not within a layer.
@@ -177,7 +177,7 @@ The full story is, of course, much more involved and a topic of much recent rese How do we decide on what architecture to use when faced with a practical problem? Should we use no hidden layers? One hidden layer? Two hidden layers? How large should each layer be? First, note that as we increase the size and number of layers in a Neural Network, the **capacity** of the network increases. That is, the space of representable functions grows since the neurons can collaborate to express many different functions. For example, suppose we had a binary classification problem in two dimensions. We could train three separate neural networks, each with one hidden layer of some size and obtain the following classifiers:
- +
Larger Neural Networks can represent more complicated functions. The data are shown as circles colored by their class, and the decision regions by a trained neural network are shown underneath. You can play with these examples in this ConvNetsJS demo.
@@ -190,7 +190,7 @@ The subtle reason behind this is that smaller networks are harder to train with To reiterate, the regularization strength is the preferred way to control the overfitting of a neural network. We can look at the results achieved by three different settings:
- +
The effects of regularization strength: Each neural network above has 20 hidden neurons, but changing the regularization strength makes its final decision regions smoother with a higher regularization. You can play with these examples in this ConvNetsJS demo.
diff --git a/neural-networks-2.md b/neural-networks-2.md index 6c064227..8e8bd789 100644 --- a/neural-networks-2.md +++ b/neural-networks-2.md @@ -29,7 +29,7 @@ There are three common forms of data preprocessing a data matrix `X`, where we w In case of images, the relative scales of pixels are already approximately equal (and in range from 0 to 255), so it is not strictly necessary to perform this additional preprocessing step.
- +
Common data preprocessing pipeline. Left: Original toy, 2-dimensional input data. Middle: The data is zero-centered by subtracting the mean in each dimension. The data cloud is now centered around the origin. Right: Each dimension is additionally scaled by its standard deviation. The red lines indicate the extent of the data - they are of unequal length in the middle, but of equal length on the right.
@@ -72,14 +72,14 @@ Xwhite = Xrot / np.sqrt(S + 1e-5) *Warning: Exaggerating noise.* Note that we're adding 1e-5 (or a small constant) to prevent division by zero. One weakness of this transformation is that it can greatly exaggerate the noise in the data, since it stretches all dimensions (including the irrelevant dimensions of tiny variance that are mostly noise) to be of equal size in the input. This can in practice be mitigated by stronger smoothing (i.e. increasing 1e-5 to be a larger number).
- +
PCA / Whitening. Left: Original toy, 2-dimensional input data. Middle: After performing PCA. The data is centered at zero and then rotated into the eigenbasis of the data covariance matrix. This decorrelates the data (the covariance matrix becomes diagonal). Right: Each dimension is additionally scaled by the eigenvalues, transforming the data covariance matrix into the identity matrix. Geometrically, this corresponds to stretching and squeezing the data into an isotropic gaussian blob.
We can also try to visualize these transformations with CIFAR-10 images. The training set of CIFAR-10 is of size 50,000 x 3072, where every image is stretched out into a 3072-dimensional row vector. We can then compute the [3072 x 3072] covariance matrix and compute its SVD decomposition (which can be relatively expensive). What do the computed eigenvectors look like visually? An image might help:
- +
Left:An example set of 49 images. 2nd from Left: The top 144 out of 3072 eigenvectors. The top eigenvectors account for most of the variance in the data, and we can see that they correspond to lower frequencies in the images. 2nd from Right: The 49 images reduced with PCA, using the 144 eigenvectors shown here. That is, instead of expressing every image as a 3072-dimensional vector where each element is the brightness of a particular pixel at some location and channel, every image above is only represented with a 144-dimensional vector, where each element measures how much of each eigenvector adds up to make up the image. In order to visualize what image information has been retained in the 144 numbers, we must rotate back into the "pixel" basis of 3072 numbers. Since U is a rotation, this can be achieved by multiplying by U.transpose()[:144,:], and then visualizing the resulting 3072 numbers as the image. You can see that the images are slightly more blurry, reflecting the fact that the top eigenvectors capture lower frequencies. However, most of the information is still preserved. Right: Visualization of the "white" representation, where the variance along every one of the 144 dimensions is squashed to equal length. Here, the whitened 144 numbers are rotated back to image pixel basis by multiplying by U.transpose()[:144,:]. The lower frequencies (which accounted for most variance) are now negligible, while the higher frequencies (which account for relatively little variance originally) become exaggerated.
@@ -139,7 +139,7 @@ There are several ways of controlling the capacity of Neural Networks to prevent **Dropout** is an extremely effective, simple and recently introduced regularization technique by Srivastava et al. in [Dropout: A Simple Way to Prevent Neural Networks from Overfitting](http://www.cs.toronto.edu/~rsalakhu/papers/srivastava14a.pdf) (pdf) that complements the other methods (L1, L2, maxnorm). While training, dropout is implemented by only keeping a neuron active with some probability \\(p\\) (a hyperparameter), or setting it to zero otherwise.
- +
Figure taken from the Dropout paper that illustrates the idea. During training, Dropout can be interpreted as sampling a Neural Network within the full Neural Network, and only updating the parameters of the sampled network based on the input data. (However, the exponential number of possible sampled networks are not independent because they share the parameters.) During testing there is no dropout applied, with the interpretation of evaluating an averaged prediction across the exponentially-sized ensemble of all sub-networks (more about ensembles in the next section).
diff --git a/neural-networks-3.md b/neural-networks-3.md index e30706ee..9f9052b8 100644 --- a/neural-networks-3.md +++ b/neural-networks-3.md @@ -104,8 +104,8 @@ The x-axis of the plots below are always in units of epochs, which measure how m The first quantity that is useful to track during training is the loss, as it is evaluated on the individual batches during the forward pass. Below is a cartoon diagram showing the loss over time, and especially what the shape might tell you about the learning rate:
- - + +
Left: A cartoon depicting the effects of different learning rates. With low learning rates the improvements will be linear. With high learning rates they will start to look more exponential. Higher learning rates will decay the loss faster, but they get stuck at worse values of loss (green line). This is because there is too much "energy" in the optimization and the parameters are bouncing around chaotically, unable to settle in a nice spot in the optimization landscape. Right: An example of a typical loss function over time, while training a small network on CIFAR-10 dataset. This loss function looks reasonable (it might indicate a slightly too small learning rate based on its speed of decay, but it's hard to say), and also indicates that the batch size might be a little too low (since the cost is a little too noisy).
@@ -123,7 +123,7 @@ Sometimes loss functions can look funny [lossfunctions.tumblr.com](http://lossfu The second important quantity to track while training a classifier is the validation/training accuracy. This plot can give you valuable insights into the amount of overfitting in your model:
- +
The gap between the training and validation accuracy indicates the amount of overfitting. Two possible cases are shown in the diagram on the left. The blue validation error curve shows very small validation accuracy compared to the training accuracy, indicating strong overfitting (note, it's possible for the validation accuracy to even start to go down after some point). When you see this in practice you probably want to increase regularization (stronger L2 weight penalty, more dropout, etc.) or collect more data. The other possible case is when the validation accuracy tracks the training accuracy fairly well. This case indicates that your model capacity is not high enough: make the model larger by increasing the number of parameters.
@@ -158,8 +158,8 @@ An incorrect initialization can slow down or even completely stall the learning Lastly, when one is working with image pixels it can be helpful and satisfying to plot the first-layer features visually:
- - + +
Examples of visualized weights for the first layer of a neural network. Left: Noisy features indicate could be a symptom: Unconverged network, improperly set learning rate, very low weight regularization penalty. Right: Nice, smooth, clean and diverse features are a good indication that the training is proceeding well.
@@ -203,7 +203,7 @@ Here we see an introduction of a `v` variable that is initialized at zero, and a The core idea behind Nesterov momentum is that when the current parameter vector is at some position `x`, then looking at the momentum update above, we know that the momentum term alone (i.e. ignoring the second term with the gradient) is about to nudge the parameter vector by `mu * v`. Therefore, if we are about to compute the gradient, we can treat the future approximate position `x + mu * v` as a "lookahead" - this is a point in the vicinity of where we are soon going to end up. Hence, it makes sense to compute the gradient at `x + mu * v` instead of at the "old/stale" position `x`.
- +
Nesterov momentum. Instead of evaluating gradient at the current position (red circle), we know that our momentum is about to carry us to the tip of the green arrow. With Nesterov momentum we therefore instead evaluate the gradient at this "looked-ahead" position.
@@ -304,8 +304,8 @@ Additional References: - [Unit Tests for Stochastic Optimization](http://arxiv.org/abs/1312.6055) proposes a series of tests as a standardized benchmark for stochastic optimization.
- - + +
Animations that may help your intuitions about the learning process dynamics. Left: Contours of a loss surface and time evolution of different optimization algorithms. Notice the "overshooting" behavior of momentum-based methods, which make the optimization look like a ball rolling down the hill. Right: A visualization of a saddle point in the optimization landscape, where the curvature along different dimension has different signs (one dimension curves up and another down). Notice that SGD has a very hard time breaking symmetry and gets stuck on the top. Conversely, algorithms such as RMSprop will see very low gradients in the saddle direction. Due to the denominator term in the RMSprop update, this will increase the effective learning rate along this direction, helping RMSProp proceed. Images credit: Alec Radford.
@@ -331,7 +331,7 @@ But as saw, there are many more relatively less sensitive hyperparameters, for e **Prefer random search to grid search**. As argued by Bergstra and Bengio in [Random Search for Hyper-Parameter Optimization](http://www.jmlr.org/papers/volume13/bergstra12a/bergstra12a.pdf), "randomly chosen trials are more efficient for hyper-parameter optimization than trials on a grid". As it turns out, this is also usually easier to implement.
- +
Core illustration from Random Search for Hyper-Parameter Optimization by Bergstra and Bengio. It is very often the case that some of the hyperparameters matter much more than others (e.g. top hyperparam vs. left one in this figure). Performing random search rather than grid search allows you to much more precisely discover good values for the important ones.
diff --git a/neural-networks-case-study.md b/neural-networks-case-study.md index aa9246aa..450c412e 100644 --- a/neural-networks-case-study.md +++ b/neural-networks-case-study.md @@ -40,7 +40,7 @@ plt.scatter(X[:, 0], X[:, 1], c=y, s=40, cmap=plt.cm.Spectral) ```
- +
The toy spiral data consists of three classes (blue, red, yellow) that are not linearly separable.
@@ -253,7 +253,7 @@ print 'training accuracy: %.2f' % (np.mean(predicted_class == y)) This prints **49%**. Not very good at all, but also not surprising given that the dataset is constructed so it is not linearly separable. We can also plot the learned decision boundaries:
- +
Linear classifier fails to learn the toy spiral dataset.
@@ -403,7 +403,7 @@ print 'training accuracy: %.2f' % (np.mean(predicted_class == y)) Which prints **98%**!. We can also visualize the decision boundaries:
- +
Neural Network classifier crushes the spiral dataset.
diff --git a/optimization-1.md b/optimization-1.md index bc0bab7c..e3fb1b90 100644 --- a/optimization-1.md +++ b/optimization-1.md @@ -41,9 +41,9 @@ We saw that a setting of the parameters \\(W\\) that produced predictions for ex The loss functions we'll look at in this class are usually defined over very high-dimensional spaces (e.g. in CIFAR-10 a linear classifier weight matrix is of size [10 x 3073] for a total of 30,730 parameters), making them difficult to visualize. However, we can still gain some intuitions about one by slicing through the high-dimensional space along rays (1 dimension), or along planes (2 dimensions). For example, we can generate a random weight matrix \\(W\\) (which corresponds to a single point in the space), then march along a ray and record the loss function value along the way. That is, we can generate a random direction \\(W\_1\\) and compute the loss along this direction by evaluating \\(L(W + a W\_1)\\) for different values of \\(a\\). This process generates a simple plot with the value of \\(a\\) as the x-axis and the value of the loss function as the y-axis. We can also carry out the same procedure with two dimensions by evaluating the loss \\( L(W + a W\_1 + b W\_2) \\) as we vary \\(a, b\\). In a plot, \\(a, b\\) could then correspond to the x-axis and the y-axis, and the value of the loss function can be visualized with a color:
- - - + + +
Loss function landscape for the Multiclass SVM (without regularization) for one single example (left,middle) and for a hundred examples (right) in CIFAR-10. Left: one-dimensional loss by only varying a. Middle, Right: two-dimensional loss slice, Blue = low loss, Red = high loss. Notice the piecewise-linear structure of the loss function. The losses for multiple examples are combined with average, so the bowl shape on the right is the average of many piece-wise linear bowls (such as the one in the middle).
@@ -69,7 +69,7 @@ $$ Since these examples are 1-dimensional, the data \\(x\_i\\) and weights \\(w\_j\\) are numbers. Looking at, for instance, \\(w\_0\\), some terms above are linear functions of \\(w\_0\\) and each is clamped at zero. We can visualize this as follows:
- +
1-dimensional illustration of the data loss. The x-axis is a single weight and the y-axis is the loss. The data loss is a sum of multiple terms, each of which is either independent of a particular weight, or a linear function of it that is thresholded at zero. The full SVM data loss is a 30,730-dimensional version of this shape.
@@ -256,7 +256,7 @@ for step_size_log in [-10, -9, -8, -7, -6, -5,-4,-3,-2,-1]: **Effect of step size**. The gradient tells us the direction in which the function has the steepest rate of increase, but it does not tell us how far along this direction we should step. As we will see later in the course, choosing the step size (also called the *learning rate*) will become one of the most important (and most headache-inducing) hyperparameter settings in training a neural network. In our blindfolded hill-descent analogy, we feel the hill below our feet sloping in some direction, but the step length we should take is uncertain. If we shuffle our feet carefully we can expect to make consistent but very small progress (this corresponds to having a small step size). Conversely, we can choose to make a large, confident step in an attempt to descend faster, but this may not pay off. As you can see in the code example above, at some point taking a bigger step gives a higher loss as we "overstep".
- +
Visualizing the effect of step size. We start at some particular spot W and evaluate the gradient (or rather its negative - the white arrow) which tells us the direction of the steepest decrease in the loss function. Small steps are likely to lead to consistent but slow progress. Large steps can lead to better progress but are more risky. Note that eventually, for a large step size we will overshoot and make the loss worse. The step size (or as we will later call it - the learning rate) will become one of the most important hyperparameters that we will have to carefully tune.
@@ -324,7 +324,7 @@ The extreme case of this is a setting where the mini-batch contains only a singl ### Summary
- +
Summary of the information flow. The dataset of pairs of (x,y) is given and fixed. The weights start out as random numbers and can change. During the forward pass the score function computes class scores, stored in vector f. The loss function contains two components: The data loss computes the compatibility between the scores f and the labels y. The regularization loss is only a function of the weights. During Gradient Descent, we compute the gradient on the weights (and optionally on data if we wish) and use them to perform a parameter update during Gradient Descent.
diff --git a/python-numpy-tutorial.md b/python-numpy-tutorial.md index 65e630a7..ef6f4afc 100644 --- a/python-numpy-tutorial.md +++ b/python-numpy-tutorial.md @@ -958,8 +958,8 @@ imsave('assets/cat_tinted.jpg', img_tinted) ```
- - + +
Left: The original image. Right: The tinted and resized image. @@ -1033,7 +1033,7 @@ plt.show() # You must call plt.show() to make graphics appear. Running this code produces the following plot:
- +
With just a little bit of extra work we can easily plot multiple lines @@ -1058,7 +1058,7 @@ plt.legend(['Sine', 'Cosine']) plt.show() ```
- +
You can read much more about the `plot` function @@ -1096,7 +1096,7 @@ plt.show() ```
- +
You can read much more about the `subplot` function @@ -1129,5 +1129,5 @@ plt.show() ```
- +
diff --git a/terminal-tutorial.md b/terminal-tutorial.md index 771efb68..5c7fa6ef 100644 --- a/terminal-tutorial.md +++ b/terminal-tutorial.md @@ -12,25 +12,25 @@ For each assignment, we will provide you a link to a shared terminal snapshot. T Here's an example of what a snapshot page looked like for an assignment in 2015:
- +
Yours will look similar. Click the "Start" button on the lower right corner. This will clone the shared snapshot to your own account. Now you should be able to find the terminal under the [My Terminals](https://www.stanfordterminalcloud.com/terminals) tab.
- +
Yours will look similar. You are all set! To work on the assignments, click the link to your terminal (shown in the red box in the above image). This link will open up the user interface layer over an AWS machine. It will look something similar to this:
- +
We have set up the Jupyter Notebook and other dependencies in the terminal. Launch a new console window with the small + sign (if you don't already have one), navigate around and look for the assignment folder and code. Launch a Jupyer notebook and work on the assignment. If your're a student enrolled in the class you will submit your assignment through Coursework:
- +
For more information about [Terminal](https://www.stanfordterminalcloud.com), check out the [FAQ](https://www.stanfordterminalcloud.com/faq) page. diff --git a/understanding-cnn.md b/understanding-cnn.md index 24eb909c..a8e3fa23 100644 --- a/understanding-cnn.md +++ b/understanding-cnn.md @@ -16,8 +16,8 @@ Several approaches for understanding and visualizing Convolutional Networks have **Layer Activations**. The most straight-forward visualization technique is to show the activations of the network during the forward pass. For ReLU networks, the activations usually start out looking relatively blobby and dense, but as the training progresses the activations usually become more sparse and localized. One dangerous pitfall that can be easily noticed with this visualization is that some activation maps may be all zero for many different inputs, which can indicate *dead* filters, and can be a symptom of high learning rates.
- - + +
Typical-looking activations on the first CONV layer (left), and the 5th CONV layer (right) of a trained AlexNet looking at a picture of a cat. Every box shows an activation map corresponding to some filter. Notice that the activations are sparse (most values are zero, in this visualization shown in black) and mostly local.
@@ -26,8 +26,8 @@ Several approaches for understanding and visualizing Convolutional Networks have **Conv/FC Filters.** The second common strategy is to visualize the weights. These are usually most interpretable on the first CONV layer which is looking directly at the raw pixel data, but it is possible to also show the filter weights deeper in the network. The weights are useful to visualize because well-trained networks usually display nice and smooth filters without any noisy patterns. Noisy patterns can be an indicator of a network that hasn't been trained for long enough, or possibly a very low regularization strength that may have led to overfitting.
- - + +
Typical-looking filters on the first CONV layer (left), and the 2nd CONV layer (right) of a trained AlexNet. Notice that the first-layer weights are very nice and smooth, indicating nicely converged network. The color/grayscale features are clustered because the AlexNet contains two separate streams of processing, and an apparent consequence of this architecture is that one stream develops high-frequency grayscale features and the other low-frequency color features. The 2nd CONV layer weights are not as interpretible, but it is apparent that they are still smooth, well-formed, and absent of noisy patterns.
@@ -38,7 +38,7 @@ Several approaches for understanding and visualizing Convolutional Networks have Another visualization technique is to take a large dataset of images, feed them through the network and keep track of which images maximally activate some neuron. We can then visualize the images to get an understanding of what the neuron is looking for in its receptive field. One such visualization (among others) is shown in [Rich feature hierarchies for accurate object detection and semantic segmentation](http://arxiv.org/abs/1311.2524) by Ross Girshick et al.:
- +
Maximally activating images for some POOL5 (5th pool layer) neurons of an AlexNet. The activation values and the receptive field of the particular neuron are shown in white. (In particular, note that the POOL5 neurons are a function of a relatively large portion of the input image!) It can be seen that some neurons are responsive to upper bodies, text, or specular highlights.
@@ -53,7 +53,7 @@ ConvNets can be interpreted as gradually transforming the images into a represen To produce an embedding, we can take a set of images and use the ConvNet to extract the CNN codes (e.g. in AlexNet the 4096-dimensional vector right before the classifier, and crucially, including the ReLU non-linearity). We can then plug these into t-SNE and get 2-dimensional vector for each image. The corresponding images can them be visualized in a grid:
- +
t-SNE embedding of a set of images based on their CNN codes. Images that are nearby each other are also close in the CNN representation space, which implies that the CNN "sees" them as being very similar. Notice that the similarities are more often class-based and semantic rather than pixel and color-based. For more details on how this visualization was produced the associated code, and more related visualizations at different scales refer to t-SNE visualization of CNN codes.
@@ -64,7 +64,7 @@ To produce an embedding, we can take a set of images and use the ConvNet to extr Suppose that a ConvNet classifies an image as a dog. How can we be certain that it's actually picking up on the dog in the image as opposed to some contextual cues from the background or some other miscellaneous object? One way of investigating which part of the image some classification prediction is coming from is by plotting the probability of the class of interest (e.g. dog class) as a function of the position of an occluder object. That is, we iterate over regions of the image, set a patch of the image to be all zero, and look at the probability of the class. We can visualize the probability as a 2-dimensional heat map. This approach has been used in Matthew Zeiler's [Visualizing and Understanding Convolutional Networks](http://arxiv.org/abs/1311.2901):
- +
Three input images (top). Notice that the occluder region is shown in grey. As we slide the occluder over the image we record the probability of the correct class and then visualize it as a heatmap (shown below each image). For instance, in the left-most image we see that the probability of Pomeranian plummets when the occluder covers the face of the dog, giving us some level of confidence that the dog's face is primarily responsible for the high classification score. Conversely, zeroing out other parts of the image is seen to have relatively negligible impact.
From c933ba2cd49d38a8790a9a317a67b7067028fe04 Mon Sep 17 00:00:00 2001 From: Myungsub Choi <91mschoi@hanmail.net> Date: Mon, 28 Mar 2016 18:29:14 +0900 Subject: [PATCH 010/199] Update _config.yml --- _config.yml | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/_config.yml b/_config.yml index 8765dc22..133b1587 100644 --- a/_config.yml +++ b/_config.yml @@ -8,5 +8,5 @@ twitter_username: kjw6612 github_username: aikorea # Build settings -markdown: redcarpet +markdown: kramdown permalink: pretty From 35c3821512600f5c3be77d2bc03fee825347eaf8 Mon Sep 17 00:00:00 2001 From: Myungsub Choi <91mschoi@hanmail.net> Date: Mon, 28 Mar 2016 18:29:37 +0900 Subject: [PATCH 011/199] Update Readme.md --- Readme.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/Readme.md b/Readme.md index 751e2e60..ce5adcf8 100644 --- a/Readme.md +++ b/Readme.md @@ -12,7 +12,7 @@ English to Korean translation project for the notes and assignments for Stanford To view the rendered site in your browser, 1. Install Jekyll - follow the instructions [[here](https://jekyllrb.com/docs/installation/)] -2. `git clone yourusername@cs231n` +2. `git clone https://github.com/yourUserName/cs231n.git` 3. `cd cs231n` 4. `jekyll serve` 5. View the website at http://127.0.0.1:4000/cs231n/ From 484c43d753b5f11b34f244d64df987ff675d744d Mon Sep 17 00:00:00 2001 From: myungsub Date: Sun, 3 Apr 2016 23:56:55 +0900 Subject: [PATCH 012/199] Fix broken equations by changing to kramdown --- Readme.md | 2 +- assignment1.md | 8 +- assignment2.md | 12 +-- assignment3.md | 12 +-- assignments2016/assignment1.md | 12 +-- assignments2016/assignment2.md | 12 +-- assignments2016/assignment3.md | 12 +-- aws-tutorial.md | 16 +-- classification.md | 26 ++--- convolutional-networks.md | 101 +++++++++--------- glossary.md | 101 ++++++++++++++---- ipython-tutorial.md | 8 +- linear-classify.md | 112 ++++++++++---------- neural-networks-1.md | 51 +++++----- neural-networks-2.md | 88 ++++++++-------- neural-networks-3.md | 60 +++++------ neural-networks-case-study.md | 108 ++++++++++---------- optimization-1.md | 70 ++++++------- optimization-2.md | 50 ++++----- python-numpy-tutorial.md | 180 ++++++++++++++++----------------- 20 files changed, 553 insertions(+), 488 deletions(-) diff --git a/Readme.md b/Readme.md index ce5adcf8..01875a65 100644 --- a/Readme.md +++ b/Readme.md @@ -12,7 +12,7 @@ English to Korean translation project for the notes and assignments for Stanford To view the rendered site in your browser, 1. Install Jekyll - follow the instructions [[here](https://jekyllrb.com/docs/installation/)] -2. `git clone https://github.com/yourUserName/cs231n.git` +2. Assuming that you have already forked this repo, `git clone https://github.com/yourUserName/cs231n.git` 3. `cd cs231n` 4. `jekyll serve` 5. View the website at http://127.0.0.1:4000/cs231n/ diff --git a/assignment1.md b/assignment1.md index 599e5a54..c3d52595 100644 --- a/assignment1.md +++ b/assignment1.md @@ -30,7 +30,7 @@ for the project. If you choose not to use a virtual environment, it is up to you to make sure that all dependencies for the code are installed on your machine. To set up a virtual environment, run the following: -```bash +~~~bash cd assignment1 sudo pip install virtualenv # This may already be installed virtualenv .env # Create a virtual environment @@ -38,16 +38,16 @@ source .env/bin/activate # Activate the virtual environment pip install -r requirements.txt # Install dependencies # Work on the assignment for a while ... deactivate # Exit the virtual environment -``` +~~~ **Download data:** Once you have the starter code, you will need to download the CIFAR-10 dataset. Run the following from the `assignment1` directory: -```bash +~~~bash cd cs231n/datasets ./get_datasets.sh -``` +~~~ **Start IPython:** After you have the CIFAR-10 data, you should start the IPython notebook server from the diff --git a/assignment2.md b/assignment2.md index f35b2375..9a8750b8 100644 --- a/assignment2.md +++ b/assignment2.md @@ -26,7 +26,7 @@ for the project. If you choose not to use a virtual environment, it is up to you to make sure that all dependencies for the code are installed on your machine. To set up a virtual environment, run the following: -```bash +~~~bash cd assignment2 sudo pip install virtualenv # This may already be installed virtualenv .env # Create a virtual environment @@ -34,7 +34,7 @@ source .env/bin/activate # Activate the virtual environment pip install -r requirements.txt # Install dependencies # Work on the assignment for a while ... deactivate # Exit the virtual environment -``` +~~~ You can reuse the virtual environment that you created for the first assignment, but you will need to run `pip install -r requirements.txt` after activating it @@ -44,16 +44,16 @@ to install additional dependencies required by this assignment. Once you have the starter code, you will need to download the CIFAR-10 dataset. Run the following from the `assignment2` directory: -```bash +~~~bash cd cs231n/datasets ./get_datasets.sh -``` +~~~ **Compile the Cython extension:** Convolutional Neural Networks require a very efficient implementation. We have implemented of the functionality using [Cython](http://cython.org/); you will need to compile the Cython extension before you can run the code. From the `cs231n` directory, run the following command: -```bash +~~~bash python setup.py build_ext --inplace -``` +~~~ **Start IPython:** After you have the CIFAR-10 data, you should start the IPython notebook server from the diff --git a/assignment3.md b/assignment3.md index 52dddd32..caa2f08a 100644 --- a/assignment3.md +++ b/assignment3.md @@ -28,7 +28,7 @@ for the project. If you choose not to use a virtual environment, it is up to you to make sure that all dependencies for the code are installed on your machine. To set up a virtual environment, run the following: -```bash +~~~bash cd assignment3 sudo pip install virtualenv # This may already be installed virtualenv .env # Create a virtual environment @@ -36,7 +36,7 @@ source .env/bin/activate # Activate the virtual environment pip install -r requirements.txt # Install dependencies # Work on the assignment for a while ... deactivate # Exit the virtual environment -``` +~~~ You can reuse the virtual environment that you created for the first or second assignment, but you will need to run `pip install -r requirements.txt` after @@ -52,18 +52,18 @@ Run the following from the `assignment3` directory: NOTE: After downloading and unpacking, the data and pretrained models will take about 900MB of disk space. -```bash +~~~bash cd cs231n/datasets ./get_datasets.sh ./get_tiny_imagenet_splits.sh ./get_pretrained_models.sh -``` +~~~ **Compile the Cython extension:** Convolutional Neural Networks require a very efficient implementation. We have implemented of the functionality using [Cython](http://cython.org/); you will need to compile the Cython extension before you can run the code. From the `cs231n` directory, run the following command: -```bash +~~~bash python setup.py build_ext --inplace -``` +~~~ **Start IPython:** After you have downloaded the data and compiled the Cython extensions, diff --git a/assignments2016/assignment1.md b/assignments2016/assignment1.md index 0bc7efc4..e99ca6da 100644 --- a/assignments2016/assignment1.md +++ b/assignments2016/assignment1.md @@ -17,7 +17,7 @@ In this assignment you will practice putting together a simple image classificat - get a basic understanding of performance improvements from using **higher-level representations** than raw pixels (e.g. color histograms, Histogram of Gradient (HOG) features) ## Setup -You can work on the assignment in one of two ways: locally on your own machine, or on a virtual machine through Terminal.com. +You can work on the assignment in one of two ways: locally on your own machine, or on a virtual machine through Terminal.com. ### Working in the cloud on Terminal @@ -32,7 +32,7 @@ The preferred approach for installing all the assignment dependencies is to use **[Option 2] Manual install, virtual environment:** If you'd like to (instead of Anaconda) go with a more manual and risky installation route you will likely want to create a [virtual environment](http://docs.python-guide.org/en/latest/dev/virtualenvs/) for the project. If you choose not to use a virtual environment, it is up to you to make sure that all dependencies for the code are installed globally on your machine. To set up a virtual environment, run the following: -```bash +~~~bash cd assignment1 sudo pip install virtualenv # This may already be installed virtualenv .env # Create a virtual environment @@ -40,16 +40,16 @@ source .env/bin/activate # Activate the virtual environment pip install -r requirements.txt # Install dependencies # Work on the assignment for a while ... deactivate # Exit the virtual environment -``` +~~~ **Download data:** Once you have the starter code, you will need to download the CIFAR-10 dataset. Run the following from the `assignment1` directory: -```bash +~~~bash cd cs231n/datasets ./get_datasets.sh -``` +~~~ **Start IPython:** After you have the CIFAR-10 data, you should start the IPython notebook server from the @@ -79,7 +79,7 @@ The IPython Notebook **svm.ipynb** will walk you through implementing the SVM cl The IPython Notebook **softmax.ipynb** will walk you through implementing the Softmax classifier. ### Q4: Two-Layer Neural Network (25 points) -The IPython Notebook **two\_layer\_net.ipynb** will walk you through the implementation of a two-layer neural network classifier. +The IPython Notebook **two_layer_net.ipynb** will walk you through the implementation of a two-layer neural network classifier. ### Q5: Higher Level Representations: Image Features (10 points) diff --git a/assignments2016/assignment2.md b/assignments2016/assignment2.md index aedc082f..e5b36cef 100644 --- a/assignments2016/assignment2.md +++ b/assignments2016/assignment2.md @@ -54,7 +54,7 @@ for the project. If you choose not to use a virtual environment, it is up to you to make sure that all dependencies for the code are installed globally on your machine. To set up a virtual environment, run the following: -```bash +~~~bash cd assignment2 sudo pip install virtualenv # This may already be installed virtualenv .env # Create a virtual environment @@ -62,16 +62,16 @@ source .env/bin/activate # Activate the virtual environment pip install -r requirements.txt # Install dependencies # Work on the assignment for a while ... deactivate # Exit the virtual environment -``` +~~~ **Download data:** Once you have the starter code, you will need to download the CIFAR-10 dataset. Run the following from the `assignment2` directory: -```bash +~~~bash cd cs231n/datasets ./get_datasets.sh -``` +~~~ **Compile the Cython extension:** Convolutional Neural Networks require a very efficient implementation. We have implemented of the functionality using @@ -79,9 +79,9 @@ efficient implementation. We have implemented of the functionality using before you can run the code. From the `cs231n` directory, run the following command: -```bash +~~~bash python setup.py build_ext --inplace -``` +~~~ **Start IPython:** After you have the CIFAR-10 data, you should start the IPython notebook server diff --git a/assignments2016/assignment3.md b/assignments2016/assignment3.md index 8b7e7aad..b35b748c 100644 --- a/assignments2016/assignment3.md +++ b/assignments2016/assignment3.md @@ -48,7 +48,7 @@ for the project. If you choose not to use a virtual environment, it is up to you to make sure that all dependencies for the code are installed globally on your machine. To set up a virtual environment, run the following: -```bash +~~~bash cd assignment3 sudo pip install virtualenv # This may already be installed virtualenv .env # Create a virtual environment @@ -56,17 +56,17 @@ source .env/bin/activate # Activate the virtual environment pip install -r requirements.txt # Install dependencies # Work on the assignment for a while ... deactivate # Exit the virtual environment -``` +~~~ **Download data:** Once you have the starter code, you will need to download the processed MS-COCO dataset, the TinyImageNet dataset, and the pretrained TinyImageNet model. Run the following from the `assignment3` directory: -```bash +~~~bash cd cs231n/datasets ./get_coco_captioning.sh ./get_tiny_imagenet_a.sh ./get_pretrained_model.sh -``` +~~~ **Compile the Cython extension:** Convolutional Neural Networks require a very efficient implementation. We have implemented of the functionality using @@ -74,9 +74,9 @@ efficient implementation. We have implemented of the functionality using before you can run the code. From the `cs231n` directory, run the following command: -```bash +~~~bash python setup.py build_ext --inplace -``` +~~~ **Start IPython:** After you have the data, you should start the IPython notebook server diff --git a/aws-tutorial.md b/aws-tutorial.md index 52b1be5d..e357872b 100644 --- a/aws-tutorial.md +++ b/aws-tutorial.md @@ -104,9 +104,9 @@ to get to your instance if you lose your key. Once you download your key, you should change the permissions of the key to user-only RW, In Linux/OSX you can do it by: -``` +~~~ $ chmod 600 PEM_FILENAME -``` +~~~ Here `PEM_FILENAME` is the full file name of the .pem file you just downloaded. After this is done, click on "Launch Instances", and you should see a screen @@ -127,26 +127,26 @@ are now ready to ssh into the instance. First, note down the Public IP of the instance from the instance listing. Then, do: -``` +~~~ ssh -i PEM_FILENAME ubuntu@PUBLIC_IP -``` +~~~ Now you should be logged in to the instance. You can check that Caffe is working by doing: -``` +~~~ $ cd caffe $ ./build/tools/caffe time --gpu 0 --model examples/mnist/lenet.prototxt -``` +~~~ We have Caffe, Theano, Torch7, Keras and Lasagne pre-installed. Caffe python bindings are also available by default. We have CUDA 7.5 and CuDNN v3 installed. If you encounter any error such as -``` +~~~ Check failed: error == cudaSuccess (77 vs. 0) an illegal memory access was encountered -``` +~~~ you might want to terminate your instance and start over again. I have observed this rarely, and I am not sure what causes this. diff --git a/classification.md b/classification.md index d0c902a2..d929a9e3 100644 --- a/classification.md +++ b/classification.md @@ -71,10 +71,10 @@ As our first approach, we will develop what we call a **Nearest Neighbor Classif Suppose now that we are given the CIFAR-10 training set of 50,000 images (5,000 images for every one of the labels), and we wish to label the remaining 10,000. The nearest neighbor classifier will take a test image, compare it to every single one of the training images, and predict the label of the closest training image. In the image above and on the right you can see an example result of such a procedure for 10 example test images. Notice that in only about 3 out of 10 examples an image of the same class is retrieved, while in the other 7 examples this is not the case. For example, in the 8th row the nearest training image to the horse head is a red car, presumably due to the strong black background. As a result, this image of a horse would in this case be mislabeled as a car. -You may have noticed that we left unspecified the details of exactly how we compare two images, which in this case are just two blocks of 32 x 32 x 3. One of the simplest possibilities is to compare the images pixel by pixel and add up all the differences. In other words, given two images and representing them as vectors \\( I\_1, I\_2 \\) , a reasonable choice for comparing them might be the **L1 distance**: +You may have noticed that we left unspecified the details of exactly how we compare two images, which in this case are just two blocks of 32 x 32 x 3. One of the simplest possibilities is to compare the images pixel by pixel and add up all the differences. In other words, given two images and representing them as vectors $$ I_1, I_2 $$ , a reasonable choice for comparing them might be the **L1 distance**: $$ -d\_1 (I\_1, I\_2) = \sum\_{p} \left| I^p\_1 - I^p\_2 \right| +d_1 (I_1, I_2) = \sum_{p} \left| I^p_1 - I^p_2 \right| $$ Where the sum is taken over all pixels. Here is the procedure visualized: @@ -86,27 +86,27 @@ Where the sum is taken over all pixels. Here is the procedure visualized: Let's also look at how we might implement the classifier in code. First, let's load the CIFAR-10 data into memory as 4 arrays: the training data/labels and the test data/labels. In the code below, `Xtr` (of size 50,000 x 32 x 32 x 3) holds all the images in the training set, and a corresponding 1-dimensional array `Ytr` (of length 50,000) holds the training labels (from 0 to 9): -```python +~~~python Xtr, Ytr, Xte, Yte = load_CIFAR10('data/cifar10/') # a magic function we provide # flatten out all images to be one-dimensional Xtr_rows = Xtr.reshape(Xtr.shape[0], 32 * 32 * 3) # Xtr_rows becomes 50000 x 3072 Xte_rows = Xte.reshape(Xte.shape[0], 32 * 32 * 3) # Xte_rows becomes 10000 x 3072 -``` +~~~ Now that we have all images stretched out as rows, here is how we could train and evaluate a classifier: -```python +~~~python nn = NearestNeighbor() # create a Nearest Neighbor classifier class nn.train(Xtr_rows, Ytr) # train the classifier on the training images and labels Yte_predict = nn.predict(Xte_rows) # predict labels on the test images # and now print the classification accuracy, which is the average number # of examples that are correctly predicted (i.e. label matches) print 'accuracy: %f' % ( np.mean(Yte_predict == Yte) ) -``` +~~~ Notice that as an evaluation criterion, it is common to use the **accuracy**, which measures the fraction of predictions that were correct. Notice that all classifiers we will build satisfy this one common API: they have a `train(X,y)` function that takes the data and the labels to learn from. Internally, the class should build some kind of model of the labels and how they can be predicted from the data. And then there is a `predict(X)` function, which takes new data and predicts the labels. Of course, we've left out the meat of things - the actual classifier itself. Here is an implementation of a simple Nearest Neighbor classifier with the L1 distance that satisfies this template: -```python +~~~python import numpy as np class NearestNeighbor(object): @@ -134,7 +134,7 @@ class NearestNeighbor(object): Ypred[i] = self.ytr[min_index] # predict the label of the nearest example return Ypred -``` +~~~ If you ran this code, you would see that this classifier only achieves **38.6%** on CIFAR-10. That's more impressive than guessing at random (which would give 10% accuracy since there are 10 classes), but nowhere near human performance (which is [estimated at about 94%](http://karpathy.github.io/2011/04/27/manually-classifying-cifar10/)) or near state-of-the-art Convolutional Neural Networks that achieve about 95%, matching human accuracy (see the [leaderboard](http://www.kaggle.com/c/cifar-10/leaderboard) of a recent Kaggle competition on CIFAR-10). @@ -142,14 +142,14 @@ If you ran this code, you would see that this classifier only achieves **38.6%** There are many other ways of computing distances between vectors. Another common choice could be to instead use the **L2 distance**, which has the geometric interpretation of computing the euclidean distance between two vectors. The distance takes the form: $$ -d\_2 (I\_1, I\_2) = \sqrt{\sum\_{p} \left( I^p\_1 - I^p\_2 \right)^2} +d_2 (I_1, I_2) = \sqrt{\sum_{p} \left( I^p_1 - I^p_2 \right)^2} $$ In other words we would be computing the pixelwise difference as before, but this time we square all of them, add them up and finally take the square root. In numpy, using the code from above we would need to only replace a single line of code. The line that computes the distances: -```python +~~~python distances = np.sqrt(np.sum(np.square(self.Xtr - X[i,:]), axis = 1)) -``` +~~~ Note that I included the `np.sqrt` call above, but in a practical nearest neighbor application we could leave out the square root operation because square root is a *monotonic function*. That is, it scales the absolute sizes of the distances but it preserves the ordering, so the nearest neighbors with or without it are identical. If you ran the Nearest Neighbor classifier on CIFAR-10 with this distance, you would obtain **35.4%** accuracy (slightly lower than our L1 distance result). @@ -180,7 +180,7 @@ Luckily, there is a correct way of tuning the hyperparameters and it does not to Here is what this might look like in the case of CIFAR-10: -```python +~~~python # assume we have Xtr_rows, Ytr, Xte_rows, Yte as before # recall Xtr_rows is 50,000 x 3072 matrix Xval_rows = Xtr_rows[:1000, :] # take first 1000 for validation @@ -202,7 +202,7 @@ for k in [1, 3, 5, 10, 20, 50, 100]: # keep track of what works on the validation set validation_accuracies.append((k, acc)) -``` +~~~ By the end of this procedure, we could plot a graph that shows which values of *k* work best. We would then stick with this value and evaluate once on the actual test set. diff --git a/convolutional-networks.md b/convolutional-networks.md index c7d41224..d08fbb36 100644 --- a/convolutional-networks.md +++ b/convolutional-networks.md @@ -26,6 +26,7 @@ Convolutional Neural Networks are very similar to ordinary Neural Networks from So what does change? ConvNet architectures make the explicit assumption that the inputs are images, which allows us to encode certain properties into the architecture. These then make the forward function more efficient to implement and vastly reduces the amount of parameters in the network. + ### Architecture Overview *Recall: Regular Neural Nets.* As we saw in the previous chapter, Neural Networks receive an input (a single vector), and transform it through a series of *hidden layers*. Each hidden layer is made up of a set of neurons, where each neuron is fully connected to all neurons in the previous layer, and where neurons in a single layer function completely independently and do not share any connections. The last fully-connected layer is called the "output layer" and in classification settings it represents the class scores. @@ -43,19 +44,20 @@ So what does change? ConvNet architectures make the explicit assumption that the > A ConvNet is made up of Layers. Every Layer has a simple API: It transforms an input 3D volume to an output 3D volume with some differentiable function that may or may not have parameters. -### Layers used to build ConvNets -As we described above, every layer of a ConvNet transforms one volume of activations to another through a differentiable function. We use three main types of layers to build ConvNet architectures: **Convolutional Layer**, **Pooling Layer**, and **Fully-Connected Layer** (exactly as seen in regular Neural Networks). We will stack these layers to form a full ConvNet **architecture**. +### Layers used to build ConvNets + +As we described above, every layer of a ConvNet transforms one volume of activations to another through a differentiable function. We use three main types of layers to build ConvNet architectures: **Convolutional Layer**, **Pooling Layer**, and **Fully-Connected Layer** (exactly as seen in regular Neural Networks). We will stack these layers to form a full ConvNet **architecture**. *Example Architecture: Overview*. We will go into more details below, but a simple ConvNet for CIFAR-10 classification could have the architecture [INPUT - CONV - RELU - POOL - FC]. In more detail: - INPUT [32x32x3] will hold the raw pixel values of the image, in this case an image of width 32, height 32, and with three color channels R,G,B. - CONV layer will compute the output of neurons that are connected to local regions in the input, each computing a dot product between their weights and the region they are connected to in the input volume. This may result in volume such as [32x32x12]. -- RELU layer will apply an elementwise activation function, such as the \\(max(0,x)\\) thresholding at zero. This leaves the size of the volume unchanged ([32x32x12]). +- RELU layer will apply an elementwise activation function, such as the $$max(0,x)$$ thresholding at zero. This leaves the size of the volume unchanged ([32x32x12]). - POOL layer will perform a downsampling operation along the spatial dimensions (width, height), resulting in volume such as [16x16x12]. - FC (i.e. fully-connected) layer will compute the class scores, resulting in volume of size [1x1x10], where each of the 10 numbers correspond to a class score, such as among the 10 categories of CIFAR-10. As with ordinary Neural Networks and as the name implies, each neuron in this layer will be connected to all the numbers in the previous volume. -In this way, ConvNets transform the original image layer by layer from the original pixel values to the final class scores. Note that some layers contain parameters and other don't. In particular, the CONV/FC layers perform transformations that are a function of not only the activations in the input volume, but also of the parameters (the weights and biases of the neurons). On the other hand, the RELU/POOL layers will implement a fixed function. The parameters in the CONV/FC layers will be trained with gradient descent so that the class scores that the ConvNet computes are consistent with the labels in the training set for each image. +In this way, ConvNets transform the original image layer by layer from the original pixel values to the final class scores. Note that some layers contain parameters and other don't. In particular, the CONV/FC layers perform transformations that are a function of not only the activations in the input volume, but also of the parameters (the weights and biases of the neurons). On the other hand, the RELU/POOL layers will implement a fixed function. The parameters in the CONV/FC layers will be trained with gradient descent so that the class scores that the ConvNet computes are consistent with the labels in the training set for each image. In summary: @@ -75,6 +77,7 @@ In summary: We now describe the individual layers and the details of their hyperparameters and their connectivities. + #### Convolutional Layer The Conv layer is the core building block of a Convolutional Network, and its output volume can be interpreted as holding neurons arranged in a 3D volume. We now discuss the details of the neuron connectivities, their arrangement in space, and their parameter sharing scheme. @@ -101,25 +104,25 @@ The Conv layer is the core building block of a Convolutional Network, and its ou 2. Second, we must specify the **stride** with which we allocate depth columns around the spatial dimensions (width and height). When the stride is 1, then we will allocate a new depth column of neurons to spatial positions only 1 spatial unit apart. This will lead to heavily overlapping receptive fields between the columns, and also to large output volumes. Conversely, if we use higher strides then the receptive fields will overlap less and the resulting output volume will have smaller dimensions spatially. 3. As we will soon see, sometimes it will be convenient to pad the input with zeros spatially on the border of the input volume. The size of this **zero-padding** is a hyperparameter. The nice feature of zero padding is that it will allow us to control the spatial size of the output volumes. In particular, we will sometimes want to exactly preserve the spatial size of the input volume. -We can compute the spatial size of the output volume as a function of the input volume size (\\(W\\)), the receptive field size of the Conv Layer neurons (\\(F\\)), the stride with which they are applied (\\(S\\)), and the amount of zero padding used (\\(P\\)) on the border. You can convince yourself that the correct formula for calculating how many neurons "fit" is given by \\((W - F + 2P)/S + 1\\). If this number is not an integer, then the strides are set incorrectly and the neurons cannot be tiled so that they "fit" across the input volume neatly, in a symmetric way. An example might help to get intuitions for this formula: +We can compute the spatial size of the output volume as a function of the input volume size ($$W$$), the receptive field size of the Conv Layer neurons ($$F$$), the stride with which they are applied ($$S$$), and the amount of zero padding used ($$P$$) on the border. You can convince yourself that the correct formula for calculating how many neurons "fit" is given by $$(W - F + 2P)/S + 1$$. If this number is not an integer, then the strides are set incorrectly and the neurons cannot be tiled so that they "fit" across the input volume neatly, in a symmetric way. An example might help to get intuitions for this formula:
- Illustration of spatial arrangement. In this example there is only one spatial dimension (x-axis), one neuron with a receptive field size of F = 3, the input size is W = 5, and there is zero padding of P = 1. Left: The neuron strided across the input in stride of S = 1, giving output of size (5 - 3 + 2)/1+1 = 5. Right: The neuron uses stride of S = 2, giving output of size (5 - 3 + 2)/2+1 = 3. Notice that stride S = 3 could not be used since it wouldn't fit neatly across the volume. In terms of the equation, this can be determined since (5 - 3 + 2) = 4 is not divisible by 3. + Illustration of spatial arrangement. In this example there is only one spatial dimension (x-axis), one neuron with a receptive field size of F = 3, the input size is W = 5, and there is zero padding of P = 1. Left: The neuron strided across the input in stride of S = 1, giving output of size (5 - 3 + 2)/1+1 = 5. Right: The neuron uses stride of S = 2, giving output of size (5 - 3 + 2)/2+1 = 3. Notice that stride S = 3 could not be used since it wouldn't fit neatly across the volume. In terms of the equation, this can be determined since (5 - 3 + 2) = 4 is not divisible by 3.
The neuron weights are in this example [1,0,-1] (shown on very right), and its bias is zero. These weights are shared across all yellow neurons (see parameter sharing below).
-*Use of zero-padding*. In the example above on left, note that the input dimension was 5 and the output dimension was equal: also 5. This worked out so because our receptive fields were 3 and we used zero padding of 1. If there was no zero-padding used, then the output volume would have had spatial dimension of only 3, because that it is how many neurons would have "fit" across the original input. In general, setting zero padding to be \\(P = (F - 1)/2\\) when the stride is \\(S = 1\\) ensures that the input volume and output volume will have the same size spatially. It is very common to use zero-padding in this way and we will discuss the full reasons when we talk more about ConvNet architectures. +*Use of zero-padding*. In the example above on left, note that the input dimension was 5 and the output dimension was equal: also 5. This worked out so because our receptive fields were 3 and we used zero padding of 1. If there was no zero-padding used, then the output volume would have had spatial dimension of only 3, because that it is how many neurons would have "fit" across the original input. In general, setting zero padding to be $$P = (F - 1)/2$$ when the stride is $$S = 1$$ ensures that the input volume and output volume will have the same size spatially. It is very common to use zero-padding in this way and we will discuss the full reasons when we talk more about ConvNet architectures. -*Constraints on strides*. Note that the spatial arrangement hyperparameters have mutual constraints. For example, when the input has size \\(W = 10\\), no zero-padding is used \\(P = 0\\), and the filter size is \\(F = 3\\), then it would be impossible to use stride \\(S = 2\\), since \\((W - F + 2P)/S + 1 = (10 - 3 + 0) / 2 + 1 = 4.5\\), i.e. not an integer, indicating that the neurons don't "fit" neatly and symmetrically across the input. Therefore, this setting of the hyperparameters is considered to be invalid, and a ConvNet library would likely throw an exception. As we will see in the ConvNet architectures section, sizing the ConvNets appropriately so that all the dimensions "work out" can be a real headache, which the use of zero-padding and some design guidelines will significantly alleviate. +*Constraints on strides*. Note that the spatial arrangement hyperparameters have mutual constraints. For example, when the input has size $$W = 10$$, no zero-padding is used $$P = 0$$, and the filter size is $$F = 3$$, then it would be impossible to use stride $$S = 2$$, since $$(W - F + 2P)/S + 1 = (10 - 3 + 0) / 2 + 1 = 4.5$$, i.e. not an integer, indicating that the neurons don't "fit" neatly and symmetrically across the input. Therefore, this setting of the hyperparameters is considered to be invalid, and a ConvNet library would likely throw an exception. As we will see in the ConvNet architectures section, sizing the ConvNets appropriately so that all the dimensions "work out" can be a real headache, which the use of zero-padding and some design guidelines will significantly alleviate. -*Real-world example*. The [Krizhevsky et al.](http://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks) architecture that won the ImageNet challenge in 2012 accepted images of size [227x227x3]. On the first Convolutional Layer, it used neurons with receptive field size \\(F = 11\\), stride \\(S = 4\\) and no zero padding \\(P = 0\\). Since (227 - 11)/4 + 1 = 55, and since the Conv layer had a depth of \\(K = 96\\), the Conv layer output volume had size [55x55x96]. Each of the 55\*55\*96 neurons in this volume was connected to a region of size [11x11x3] in the input volume. Moreover, all 96 neurons in each depth column are connected to the same [11x11x3] region of the input, but of course with different weights. +*Real-world example*. The [Krizhevsky et al.](http://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks) architecture that won the ImageNet challenge in 2012 accepted images of size [227x227x3]. On the first Convolutional Layer, it used neurons with receptive field size $$F = 11$$, stride $$S = 4$$ and no zero padding $$P = 0$$. Since (227 - 11)/4 + 1 = 55, and since the Conv layer had a depth of $$K = 96$$, the Conv layer output volume had size [55x55x96]. Each of the 55\*55\*96 neurons in this volume was connected to a region of size [11x11x3] in the input volume. Moreover, all 96 neurons in each depth column are connected to the same [11x11x3] region of the input, but of course with different weights. **Parameter Sharing.** Parameter sharing scheme is used in Convolutional Layers to control the number of parameters. Using the real-world example above, we see that there are 55\*55\*96 = 290,400 neurons in the first Conv Layer, and each has 11\*11\*3 = 363 weights and 1 bias. Together, this adds up to 290400 * 364 = 105,705,600 parameters on the first layer of the ConvNet alone. Clearly, this number is very high. -It turns out that we can dramatically reduce the number of parameters by making one reasonable assumption: That if one patch feature is useful to compute at some spatial position (x,y), then it should also be useful to compute at a different position (x2,y2). In other words, denoting a single 2-dimensional slice of depth as a **depth slice** (e.g. a volume of size [55x55x96] has 96 depth slices, each of size [55x55]), we are going to constrain the neurons in each depth slice to use the same weights and bias. With this parameter sharing scheme, the first Conv Layer in our example would now have only 96 unique set of weights (one for each depth slice), for a total of 96\*11\*11\*3 = 34,848 unique weights, or 34,944 parameters (+96 biases). Alternatively, all 55*55 neurons in each depth slice will now be using the same parameters. In practice during backpropagation, every neuron in the volume will compute the gradient for its weights, but these gradients will be added up across each depth slice and only update a single set of weights per slice. +It turns out that we can dramatically reduce the number of parameters by making one reasonable assumption: That if one patch feature is useful to compute at some spatial position (x,y), then it should also be useful to compute at a different position (x2,y2). In other words, denoting a single 2-dimensional slice of depth as a **depth slice** (e.g. a volume of size [55x55x96] has 96 depth slices, each of size [55x55]), we are going to constrain the neurons in each depth slice to use the same weights and bias. With this parameter sharing scheme, the first Conv Layer in our example would now have only 96 unique set of weights (one for each depth slice), for a total of 96\*11\*11\*3 = 34,848 unique weights, or 34,944 parameters (+96 biases). Alternatively, all 55*55 neurons in each depth slice will now be using the same parameters. In practice during backpropagation, every neuron in the volume will compute the gradient for its weights, but these gradients will be added up across each depth slice and only update a single set of weights per slice. Notice that if all neurons in a single depth slice are using the same weight vector, then the forward pass of the CONV layer can in each depth slice be computed as a **convolution** of the neuron's weights with the input volume (Hence the name: Convolutional Layer). Therefore, it is common to refer to the sets of weights as a **filter** (or a **kernel**), which is convolved with the input. The result of this convolution is an *activation map* (e.g. of size [55x55]), and the set of activation maps for each different filter are stacked together along the depth dimension to produce the output volume (e.g. [55x55x96]). @@ -135,9 +138,9 @@ Note that sometimes the parameter sharing assumption may not make sense. This is **Numpy examples.** To make the discussion above more concrete, lets express the same ideas but in code and with a specific example. Suppose that the input volume is a numpy array `X`. Then: - A *depth column* at position `(x,y)` would be the activations `X[x,y,:]`. -- A *depth slice*, or equivalently an *activation map* at depth `d` would be the activations `X[:,:,d]`. +- A *depth slice*, or equivalently an *activation map* at depth `d` would be the activations `X[:,:,d]`. -*Conv Layer Example*. Suppose that the input volume `X` has shape `X.shape: (11,11,4)`. Suppose further that we use no zero padding (\\(P = 0\\)), that the filter size is \\(F = 5\\), and that the stride is \\(S = 2\\). The output volume would therefore have spatial size (11-5)/2+1 = 4, giving a volume with width and height of 4. The activation map in the output volume (call it `V`), would then look as follows (only some of the elements are computed in this example): +*Conv Layer Example*. Suppose that the input volume `X` has shape `X.shape: (11,11,4)`. Suppose further that we use no zero padding ($$P = 0$$), that the filter size is $$F = 5$$, and that the stride is $$S = 2$$. The output volume would therefore have spatial size (11-5)/2+1 = 4, giving a volume with width and height of 4. The activation map in the output volume (call it `V`), would then look as follows (only some of the elements are computed in this example): - `V[0,0,0] = np.sum(X[:5,:5,:] * W0) + b0` - `V[1,0,0] = np.sum(X[2:7,:5,:] * W0) + b0` @@ -157,22 +160,22 @@ where we see that we are indexing into the second depth dimension in `V` (at ind **Summary**. To summarize, the Conv Layer: -- Accepts a volume of size \\(W\_1 \times H\_1 \times D\_1\\) -- Requires four hyperparameters: - - Number of filters \\(K\\), - - their spatial extent \\(F\\), - - the stride \\(S\\), - - the amount of zero padding \\(P\\). -- Produces a volume of size \\(W\_2 \times H\_2 \times D\_2\\) where: - - \\(W\_2 = (W\_1 - F + 2P)/S + 1\\) - - \\(H\_2 = (H\_1 - F + 2P)/S + 1\\) (i.e. width and height are computed equally by symmetry) - - \\(D\_2 = K\\) -- With parameter sharing, it introduces \\(F \cdot F \cdot D\_1\\) weights per filter, for a total of \\((F \cdot F \cdot D\_1) \cdot K\\) weights and \\(K\\) biases. -- In the output volume, the \\(d\\)-th depth slice (of size \\(W\_2 \times H\_2\\)) is the result of performing a valid convolution of the \\(d\\)-th filter over the input volume with a stride of \\(S\\), and then offset by \\(d\\)-th bias. +- Accepts a volume of size $$W_1 \times H_1 \times D_1$$ +- Requires four hyperparameters: + - Number of filters $$K$$, + - their spatial extent $$F$$, + - the stride $$S$$, + - the amount of zero padding $$P$$. +- Produces a volume of size $$W_2 \times H_2 \times D_2$$ where: + - $$W_2 = (W_1 - F + 2P)/S + 1$$ + - $$H_2 = (H_1 - F + 2P)/S + 1$$ (i.e. width and height are computed equally by symmetry) + - $$D_2 = K$$ +- With parameter sharing, it introduces $$F \cdot F \cdot D_1$$ weights per filter, for a total of $$(F \cdot F \cdot D_1) \cdot K$$ weights and $$K$$ biases. +- In the output volume, the $$d$$-th depth slice (of size $$W_2 \times H_2$$) is the result of performing a valid convolution of the $$d$$-th filter over the input volume with a stride of $$S$$, and then offset by $$d$$-th bias. -A common setting of the hyperparameters is \\(F = 3, S = 1, P = 1\\). However, there are common conventions and rules of thumb that motivate these hyperparameters. See the [ConvNet architectures](#architectures) section below. +A common setting of the hyperparameters is $$F = 3, S = 1, P = 1$$. However, there are common conventions and rules of thumb that motivate these hyperparameters. See the [ConvNet architectures](#architectures) section below. -**Convolution Demo**. Below is a running demo of a CONV layer. Since 3D volumes are hard to visualize, all the volumes (the input volume (in blue), the weight volumes (in red), the output volume (in green)) are visualized with each depth slice stacked in rows. The input volume is of size \\(W\_1 = 5, H\_1 = 5, D\_1 = 3\\), and the CONV layer parameters are \\(K = 2, F = 3, S = 2, P = 1\\). That is, we have two filters of size \\(3 \times 3\\), and they are applied with a stride of 2. Therefore, the output volume size has spatial size (5 - 3 + 2)/2 + 1 = 3. Moreover, notice that a padding of \\(P = 1\\) is applied to the input volume, making the outer border of the input volume zero. The visualization below iterates over the output activations (green), and shows that each element is computed by elementwise multiplying the highlighted input (blue) with the filter (red), summing it up, and then offsetting the result by the bias. +**Convolution Demo**. Below is a running demo of a CONV layer. Since 3D volumes are hard to visualize, all the volumes (the input volume (in blue), the weight volumes (in red), the output volume (in green)) are visualized with each depth slice stacked in rows. The input volume is of size $$W_1 = 5, H_1 = 5, D_1 = 3$$, and the CONV layer parameters are $$K = 2, F = 3, S = 2, P = 1$$. That is, we have two filters of size $$3 \times 3$$, and they are applied with a stride of 2. Therefore, the output volume size has spatial size (5 - 3 + 2)/2 + 1 = 3. Moreover, notice that a padding of $$P = 1$$ is applied to the input volume, making the outer border of the input volume zero. The visualization below iterates over the output activations (green), and shows that each element is computed by elementwise multiplying the highlighted input (blue) with the filter (red), summing it up, and then offsetting the result by the bias.
@@ -183,7 +186,7 @@ A common setting of the hyperparameters is \\(F = 3, S = 1, P = 1\\). However, t 1. The local regions in the input image are stretched out into columns in an operation commonly called **im2col**. For example, if the input is [227x227x3] and it is to be convolved with 11x11x3 filters at stride 4, then we would take [11x11x3] blocks of pixels in the input and stretch each block into a column vector of size 11\*11\*3 = 363. Iterating this process in the input at stride of 4 gives (227-11)/4+1 = 55 locations along both width and height, leading to an output matrix `X_col` of *im2col* of size [363 x 3025], where every column is a stretched out receptive field and there are 55*55 = 3025 of them in total. Note that since the receptive fields overlap, every number in the input volume may be duplicated in multiple distinct columns. 2. The weights of the CONV layer are similarly stretched out into rows. For example, if there are 96 filters of size [11x11x3] this would give a matrix `W_row` of size [96 x 363]. -3. The result of a convolution is now equivalent to performing one large matrix multiply `np.dot(W_row, X_col)`, which evaluates the dot product between every filter and every receptive field location. In our example, the output of this operation would be [96 x 3025], giving the output of the dot product of each filter at each location. +3. The result of a convolution is now equivalent to performing one large matrix multiply `np.dot(W_row, X_col)`, which evaluates the dot product between every filter and every receptive field location. In our example, the output of this operation would be [96 x 3025], giving the output of the dot product of each filter at each location. 4. The result must finally be reshaped back to its proper output dimension [55x55x96]. This approach has the downside that it can use a lot of memory, since some values in the input volume are replicated multiple times in `X_col`. However, the benefit is that there are many very efficient implementations of Matrix Multiplication that we can take advantage of (for example, in the commonly used [BLAS](http://www.netlib.org/blas/) API). Morever, the same *im2col* idea can be reused to perform the pooling operation, which we discuss next. @@ -195,18 +198,18 @@ This approach has the downside that it can use a lot of memory, since some value It is common to periodically insert a Pooling layer in-between successive Conv layers in a ConvNet architecture. Its function is to progressively reduce the spatial size of the representation to reduce the amount of parameters and computation in the network, and hence to also control overfitting. The Pooling Layer operates independently on every depth slice of the input and resizes it spatially, using the MAX operation. The most common form is a pooling layer with filters of size 2x2 applied with a stride of 2 downsamples every depth slice in the input by 2 along both width and height, discarding 75% of the activations. Every MAX operation would in this case be taking a max over 4 numbers (little 2x2 region in some depth slice). The depth dimension remains unchanged. More generally, the pooling layer: -- Accepts a volume of size \\(W\_1 \times H\_1 \times D\_1\\) -- Requires three hyperparameters: - - their spatial extent \\(F\\), - - the stride \\(S\\), -- Produces a volume of size \\(W\_2 \times H\_2 \times D\_2\\) where: - - \\(W\_2 = (W\_1 - F)/S + 1\\) - - \\(H\_2 = (H\_1 - F)/S + 1\\) - - \\(D\_2 = D\_1\\) +- Accepts a volume of size $$W_1 \times H_1 \times D_1$$ +- Requires three hyperparameters: + - their spatial extent $$F$$, + - the stride $$S$$, +- Produces a volume of size $$W_2 \times H_2 \times D_2$$ where: + - $$W_2 = (W_1 - F)/S + 1$$ + - $$H_2 = (H_1 - F)/S + 1$$ + - $$D_2 = D_1$$ - Introduces zero parameters since it computes a fixed function of the input - Note that it is not common to use zero-padding for Pooling layers -It is worth noting that there are only two commonly seen variations of the max pooling layer found in practice: A pooling layer with \\(F = 3, S = 2\\) (also called overlapping pooling), and more commonly \\(F = 2, S = 2\\). Pooling sizes with larger receptive fields are too destructive. +It is worth noting that there are only two commonly seen variations of the max pooling layer found in practice: A pooling layer with $$F = 3, S = 2$$ (also called overlapping pooling), and more commonly $$F = 2, S = 2$$. Pooling sizes with larger receptive fields are too destructive. **General pooling**. In addition to max pooling, the pooling units can also perform other functions, such as *average pooling* or even *L2-norm pooling*. Average pooling was often used historically but has recently fallen out of favor compared to the max pooling operation, which has been shown to work better in practice. @@ -238,20 +241,20 @@ Many types of normalization layers have been proposed for use in ConvNet archite Neurons in a fully connected layer have full connections to all activations in the previous layer, as seen in regular Neural Networks. Their activations can hence be computed with a matrix multiplication followed by a bias offset. See the *Neural Network* section of the notes for more information. -#### Converting FC layers to CONV layers +#### Converting FC layers to CONV layers It is worth noting that the only difference between FC and CONV layers is that the neurons in the CONV layer are connected only to a local region in the input, and that many of the neurons in a CONV volume share parameters. However, the neurons in both layers still compute dot products, so their functional form is identical. Therefore, it turns out that it's possible to convert between FC and CONV layers: - For any CONV layer there is an FC layer that implements the same forward function. The weight matrix would be a large matrix that is mostly zero except for at certian blocks (due to local connectivity) where the weights in many of the blocks are equal (due to parameter sharing). -- Conversely, any FC layer can be converted to a CONV layer. For example, an FC layer with \\(K = 4096\\) that is looking at some input volume of size \\(7 \times 7 \times 512\\) can be equivalently expressed as a CONV layer with \\(F = 7, P = 0, S = 1, K = 4096\\). In other words, we are setting the filter size to be exactly the size of the input volume, and hence the output will simply be \\(1 \times 1 \times 4096\\) since only a single depth column "fits" across the input volume, giving identical result as the initial FC layer. +- Conversely, any FC layer can be converted to a CONV layer. For example, an FC layer with $$K = 4096$$ that is looking at some input volume of size $$7 \times 7 \times 512$$ can be equivalently expressed as a CONV layer with $$F = 7, P = 0, S = 1, K = 4096$$. In other words, we are setting the filter size to be exactly the size of the input volume, and hence the output will simply be $$1 \times 1 \times 4096$$ since only a single depth column "fits" across the input volume, giving identical result as the initial FC layer. **FC->CONV conversion**. Of these two conversions, the ability to convert an FC layer to a CONV layer is particularly useful in practice. Consider a ConvNet architecture that takes a 224x224x3 image, and then uses a series of CONV layers and POOL layers to reduce the image to an activations volume of size 7x7x512 (in an *AlexNet* architecture that we'll see later, this is done by use of 5 pooling layers that downsample the input spatially by a factor of two each time, making the final spatial size 224/2/2/2/2/2 = 7). From there, an AlexNet uses two FC layers of size 4096 and finally the last FC layers with 1000 neurons that compute the class scores. We can convert each of these three FC layers to CONV layers as described above: -- Replace the first FC layer that looks at [7x7x512] volume with a CONV layer that uses filter size \\(F = 7\\), giving output volume [1x1x4096]. -- Replace the second FC layer with a CONV layer that uses filter size \\(F = 1\\), giving output volume [1x1x4096] -- Replace the last FC layer similarly, with \\(F=1\\), giving final output [1x1x1000] +- Replace the first FC layer that looks at [7x7x512] volume with a CONV layer that uses filter size $$F = 7$$, giving output volume [1x1x4096]. +- Replace the second FC layer with a CONV layer that uses filter size $$F = 1$$, giving output volume [1x1x4096] +- Replace the last FC layer similarly, with $$F=1$$, giving final output [1x1x1000] -Each of these conversions could in practice involve manipulating (e.g. reshaping) the weight matrix \\(W\\) in each FC layer into CONV layer filters. It turns out that this conversion allows us to "slide" the original ConvNet very efficiently across many spatial positions in a larger image, in a single forward pass. +Each of these conversions could in practice involve manipulating (e.g. reshaping) the weight matrix $$W$$ in each FC layer into CONV layer filters. It turns out that this conversion allows us to "slide" the original ConvNet very efficiently across many spatial positions in a larger image, in a single forward pass. For example, if 224x224 image gives a volume of size [7x7x512] - i.e. a reduction by 32, then forwarding an image of size 384x384 through the converted architecture would give the equivalent volume in size [12x12x512], since 384/32 = 12. Following through with the next 3 CONV layers that we just converted from FC layers would now give the final volume of size [6x6x1000], since (12 - 7)/1 + 1 = 6. Note that instead of a single vector of class scores of size [1x1x1000], we're now getting and entire 6x6 array of class scores across the 384x384 image. @@ -266,7 +269,7 @@ Lastly, what if we wanted to efficiently apply the original ConvNet over the ima ### ConvNet Architectures -We have seen that Convolutional Networks are commonly made up of only three layer types: CONV, POOL (we assume Max pool unless stated otherwise) and FC (short for fully-connected). We will also explicitly write the RELU activation function as a layer, which applies elementwise non-linearity. In this section we discuss how these are commonly stacked together to form entire ConvNets. +We have seen that Convolutional Networks are commonly made up of only three layer types: CONV, POOL (we assume Max pool unless stated otherwise) and FC (short for fully-connected). We will also explicitly write the RELU activation function as a layer, which applies elementwise non-linearity. In this section we discuss how these are commonly stacked together to form entire ConvNets. #### Layer Patterns @@ -281,7 +284,7 @@ where the `*` indicates repetition, and the `POOL?` indicates an optional poolin - `INPUT -> [CONV -> RELU -> POOL]*2 -> FC -> RELU -> FC`. Here we see that there is a single CONV layer between every POOL layer. - `INPUT -> [CONV -> RELU -> CONV -> RELU -> POOL]*3 -> [FC -> RELU]*2 -> FC` Here we see two CONV layers stacked before every POOL layer. This is generally a good idea for larger and deeper networks, because multiple stacked CONV layers can develop more complex features of the input volume before the destructive pooling operation. -*Prefer a stack of small filter CONV to one large receptive field CONV layer*. Suppose that you stack three 3x3 CONV layers on top of each other (with non-linearities in between, of course). In this arrangement, each neuron on the first CONV layer has a 3x3 view of the input volume. A neuron on the second CONV layer has a 3x3 view of the first CONV layer, and hence by extension a 5x5 view of the input volume. Similarly, a neuron on the third CONV layer has a 3x3 view of the 2nd CONV layer, and hence a 7x7 view of the input volume. Suppose that instead of these three layers of 3x3 CONV, we only wanted to use a single CONV layer with 7x7 receptive fields. These neurons would have a receptive field size of the input volume that is identical in spatial extent (7x7), but with several disadvantages. First, the neurons would be computing a linear function over the input, while the three stacks of CONV layers contain non-linearities that make their features more expressive. Second, if we suppose that all the volumes have \\(C\\) channels, then it can be seen that the single 7x7 CONV layer would contain \\(C \times (7 \times 7 \times C) = 49 C^2\\) parameters, while the three 3x3 CONV layers would only contain \\(3 \times (C \times (3 \times 3 \times C)) = 27 C^2\\) parameters. Intuitively, stacking CONV layers with tiny filters as opposed to having one CONV layer with big filters allows us to express more powerful features of the input, and with fewer parameters. As a practical disadvantage, we might need more memory to hold all the intermediate CONV layer results if we plan to do backpropagation. +*Prefer a stack of small filter CONV to one large receptive field CONV layer*. Suppose that you stack three 3x3 CONV layers on top of each other (with non-linearities in between, of course). In this arrangement, each neuron on the first CONV layer has a 3x3 view of the input volume. A neuron on the second CONV layer has a 3x3 view of the first CONV layer, and hence by extension a 5x5 view of the input volume. Similarly, a neuron on the third CONV layer has a 3x3 view of the 2nd CONV layer, and hence a 7x7 view of the input volume. Suppose that instead of these three layers of 3x3 CONV, we only wanted to use a single CONV layer with 7x7 receptive fields. These neurons would have a receptive field size of the input volume that is identical in spatial extent (7x7), but with several disadvantages. First, the neurons would be computing a linear function over the input, while the three stacks of CONV layers contain non-linearities that make their features more expressive. Second, if we suppose that all the volumes have $$C$$ channels, then it can be seen that the single 7x7 CONV layer would contain $$C \times (7 \times 7 \times C) = 49 C^2$$ parameters, while the three 3x3 CONV layers would only contain $$3 \times (C \times (3 \times 3 \times C)) = 27 C^2$$ parameters. Intuitively, stacking CONV layers with tiny filters as opposed to having one CONV layer with big filters allows us to express more powerful features of the input, and with fewer parameters. As a practical disadvantage, we might need more memory to hold all the intermediate CONV layer results if we plan to do backpropagation. #### Layer Sizing Patterns @@ -290,9 +293,9 @@ Until now we've omitted mentions of common hyperparameters used in each of the l The **input layer** (that contains the image) should be divisible by 2 many times. Common numbers include 32 (e.g. CIFAR-10), 64, 96 (e.g. STL-10), or 224 (e.g. common ImageNet ConvNets), 384, and 512. -The **conv layers** should be using small filters (e.g. 3x3 or at most 5x5), using a stride of \\(S = 1\\), and crucially, padding the input volume with zeros in such way that the conv layer does not alter the spatial dimensions of the input. That is, when \\(F = 3\\), then using \\(P = 1\\) will retain the original size of the input. When \\(F = 5\\), \\(P = 2\\). For a general \\(F\\), it can be seen that \\(P = (F - 1) / 2\\) preserves the input size. If you must use bigger filter sizes (such as 7x7 or so), it is only common to see this on the very first conv layer that is looking at the input image. +The **conv layers** should be using small filters (e.g. 3x3 or at most 5x5), using a stride of $$S = 1$$, and crucially, padding the input volume with zeros in such way that the conv layer does not alter the spatial dimensions of the input. That is, when $$F = 3$$, then using $$P = 1$$ will retain the original size of the input. When $$F = 5$$, $$P = 2$$. For a general $$F$$, it can be seen that $$P = (F - 1) / 2$$ preserves the input size. If you must use bigger filter sizes (such as 7x7 or so), it is only common to see this on the very first conv layer that is looking at the input image. -The **pool layers** are in charge of downsampling the spatial dimensions of the input. The most common setting is to use max-pooling with 2x2 receptive fields (i.e. \\(F = 2\\)), and with a stride of 2 (i.e. \\(S = 2\\)). Note that this discards exactly 75% of the activations in an input volume (due to downsampling by 2 in both width and height). Another sligthly less common setting is to use 3x3 receptive fields with a stride of 2, but this makes. It is very uncommon to see receptive field sizes for max pooling that are larger than 3 because the pooling is then too lossy and agressive. This usually leads to worse performance. +The **pool layers** are in charge of downsampling the spatial dimensions of the input. The most common setting is to use max-pooling with 2x2 receptive fields (i.e. $$F = 2$$), and with a stride of 2 (i.e. $$S = 2$$). Note that this discards exactly 75% of the activations in an input volume (due to downsampling by 2 in both width and height). Another sligthly less common setting is to use 3x3 receptive fields with a stride of 2, but this makes. It is very uncommon to see receptive field sizes for max pooling that are larger than 3 because the pooling is then too lossy and agressive. This usually leads to worse performance. *Reducing sizing headaches.* The scheme presented above is pleasing because all the CONV layers preserve the spatial size of their input, while the POOL layers alone are in charge of down-sampling the volumes spatially. In an alternative scheme where we use strides greater than 1 or don't zero-pad the input in CONV layers, we would have to very carefully keep track of the input volumes throughout the CNN architecture and make sure that all strides and filters "work out", and that the ConvNet architecture is nicely and symmetrically wired. @@ -317,7 +320,7 @@ There are several architectures in the field of Convolutional Networks that have **VGGNet in detail**. Lets break down the [VGGNet](http://www.robots.ox.ac.uk/~vgg/research/very_deep/) in more detail. The whole VGGNet is composed of CONV layers that perform 3x3 convolutions with stride 1 and pad 1, and of POOL layers that perform 2x2 max pooling with stride 2 (and no padding). We can write out the size of the representation at each step of the processing and keep track of both the representation size and the total number of weights: -``` +~~~ INPUT: [224x224x3] memory: 224*224*3=150K weights: 0 CONV3-64: [224x224x64] memory: 224*224*64=3.2M weights: (3*3*3)*64 = 1,728 CONV3-64: [224x224x64] memory: 224*224*64=3.2M weights: (3*3*64)*64 = 36,864 @@ -343,12 +346,13 @@ FC: [1x1x1000] memory: 1000 weights: 4096*1000 = 4,096,000 TOTAL memory: 24M * 4 bytes ~= 93MB / image (only forward! ~*2 for bwd) TOTAL params: 138M parameters -``` +~~~ As is common with Convolutional Networks, notice that most of the memory is used in the early CONV layers, and that most of the parameters are in the last FC layers. In this particular case, the first FC layer contains 100M weights, out of a total of 140M. + #### Computational Considerations The largest bottleneck to be aware of when constructing ConvNet architectures is the memory bottleneck. Many modern GPUs have a limit of 3/4/6GB memory, with the best GPUs having about 12GB of memory. There are three major sources of memory to keep track of: @@ -364,6 +368,7 @@ Once you have a rough estimate of the total number of values (for activations, g In the [next section](../understanding-cnn/) of these notes we look at visualizing and understanding Convolutional Neural Networks. + ### Additional Resources Additional resources related to implementation: diff --git a/glossary.md b/glossary.md index df433504..621779fc 100644 --- a/glossary.md +++ b/glossary.md @@ -8,27 +8,86 @@ permalink: /glossary/ Markdown 형식의 table이 제대로 렌더링이 안되네요.. 그래서 우선 그냥 html로 표를 그려놓았습니다. 더 깔끔한 방안이 떠오르시는 분들께서는 역시 이슈/PR 부탁드립니다. - - - - - - - - - - - - - - - - - - - - -
English한글
Image영상, 이미지 (혼용)
Neural network신경망, 뉴럴 네트워크
Activation function활성 함수
node노드
Nearest neighbor(영어 그대로)
Backpropagation(영어 그대로)
Chain rule연쇄 법칙
Classification분류
Convolutional neural network컨볼루션 신경망
Regression회귀
+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
English한글
Accuracy정확도, 성능
Activation function활성 함수
Architecture구조
Backpropagation(영어 그대로)
Batch배치
Batch normalization배치 정규화
Bias
Binary이진
Chain rule연쇄 법칙
Class클래스
Classification분류
Classifier분류기
Column vector열 벡터
Convolution컨볼루션
Convolutional neural network컨볼루션 신경망
Covariance공분산
Cross entropy
Cross validation
Depth깊이
Derivative미분값, 도함수
Dropout(영어 그대로)
Error에러, 오차
Evaluate평가하다
Feature특징, 표현(?)
Filter필터
Forward propagation
Fully-connected
Gate게이트
Gradient그라디언트
GRU(영어 그대로)
Hyperparameter
Image이미지
Initialization초기화
Iteration반복
Label라벨
Layer레이어(?)
Learning러닝, 학습
Loop루프
Loss?
LSTM(영어 그대로)
Matrix행렬
Nearest neighbor(영어 그대로)
Network네트워크
Neural network신경망, 뉴럴 네트워크
Neuron뉴런
Node노드
Non-linearity비선형~
Optimization최적화
Overfitting
Padding
Parameter파라미터
Performance성능
Pooling풀링
Preprocessing전처리
Receptive Field
Regression회귀
Regularization
ReLU
Representation표현
Recurrent neural network (RNN)회귀신경망(?)
Row vector행 벡터
Score스코어, 점수
Sigmoid
Softmax
Spatial
Training학습, 트레이닝
Validation
Variable변수
Visualization시각화
Weights파라미터 값
00:00:04,129 +trust us + +2 +00:00:04,129 --> 00:00:12,109 +ok that works ok good we'll get started +soon so today we'll be talking about the + +3 +00:00:12,109 --> 00:00:15,199 +recurrent neural networks which is one +of my favorite topics one of my favorite + +4 +00:00:15,199 --> 00:00:18,960 +models to play with input into neural +networks just everywhere there a lot of + +5 +00:00:18,960 --> 00:00:23,009 +fun to play with in terms of +administrative high temps recall that + +6 +00:00:23,009 --> 00:00:26,089 +your midterms on Wednesday this +Wednesday you can tell that I'm really + +7 +00:00:26,089 --> 00:00:32,738 +excited I know if you guys are excited +very excited to me what a cemetery will + +8 +00:00:32,738 --> 00:00:37,979 +be out due this Wednesday it's so he +will be out on Wednesday its due two + +9 +00:00:37,979 --> 00:00:40,429 +weeks from now on Monday but I think +since we're shifting it I think to + +10 +00:00:40,429 --> 00:00:43,399 +Wednesday we plan to have released it +today but we're gonna be shipping it to + +11 +00:00:43,399 --> 00:00:47,129 +roughly Wednesday so we'll probably the +first deadline for a few days and + +12 +00:00:47,130 --> 00:00:51,179 +assignment to him of mistaken was due on +Friday so if you're using 38 days then + +13 +00:00:51,179 --> 00:00:55,119 +you'd be having it in today hopefully +not too many of you are doing that our + +14 +00:00:55,119 --> 00:01:01,089 +people down with a 72 or many people are +done okay most of you looking great for + +15 +00:01:01,090 --> 00:01:04,549 +doing well so currently in the class +were talking about coming ashore neural + +16 +00:01:04,549 --> 00:01:07,820 +networks las Casas specifically we +looked at to visualizing understanding + +17 +00:01:07,819 --> 00:01:11,618 +convolutional neural networks so we look +at a whole bunch of pretty pictures and + +18 +00:01:11,618 --> 00:01:14,938 +video so we had a lot of fun trying to +interpret exactly what he's accomplished + +19 +00:01:14,938 --> 00:01:17,828 +all networks are doing what they're +learning how they're working and so on + +20 +00:01:17,828 --> 00:01:24,188 +and so we debug this through several +ways that you may become a call from a + +21 +00:01:24,188 --> 00:01:27,408 +structure actually over the weekend I +stumbled by some other visualizations + +22 +00:01:27,409 --> 00:01:32,569 +are new I found these on Twitter and +they look really cool and I'm not sure + +23 +00:01:32,569 --> 00:01:37,118 +how to how people made these because +there's not too much description to it + +24 +00:01:37,118 --> 00:01:43,099 +but looks like this is turtles tarantula +and then this is chained and some kind + +25 +00:01:43,099 --> 00:01:47,468 +of a dog and so the way you do this I +think it's something like the tree nuts + +26 +00:01:47,468 --> 00:01:50,509 +again optimization into images but +they're using a different regularize on + +27 +00:01:50,509 --> 00:01:53,679 +the image in this case I think they're +using a bilateral filter which is this + +28 +00:01:53,679 --> 00:01:57,049 +kind of a fancy filter so if you put +that regularization on the image that my + +29 +00:01:57,049 --> 00:01:59,420 +impression is that these are the kinds +of visualizations that you achieve + +30 +00:01:59,420 --> 00:02:03,659 +instead so that looks pretty cool but I +am not sure exactly what's going on I + +31 +00:02:03,659 --> 00:02:04,549 +guess we'll find out soon + +32 +00:02:04,549 --> 00:02:10,360 +ok so today we're going to be talking +about recurrent neural networks what's + +33 +00:02:10,360 --> 00:02:13,520 +nice about recurrent neural networks is +that they offer a lot of flexibility in + +34 +00:02:13,520 --> 00:02:15,870 +how to wire up your network +architectures + +35 +00:02:15,870 --> 00:02:18,650 +normally when you work with no let's +hear the case on the very left here + +36 +00:02:18,650 --> 00:02:22,849 +where you are given a fixed size picture +here in red then you process it with + +37 +00:02:22,848 --> 00:02:27,639 +some hidden layers and green and then +produce a fix I saw the better in San + +38 +00:02:27,639 --> 00:02:30,738 +image comes in which is a fix I +statements and we're producing a fixed + +39 +00:02:30,739 --> 00:02:34,469 +size picture which is the closest course +when the recurrent neural networks we + +40 +00:02:34,469 --> 00:02:38,239 +can actually operate over sequences +sequence at the input output or both at + +41 +00:02:38,239 --> 00:02:41,319 +the same time so for example in the case +of image captioning and we'll see some + +42 +00:02:41,318 --> 00:02:44,689 +of it today you're given a fixed size +image and then through a recurrent + +43 +00:02:44,689 --> 00:02:47,829 +neural network we're going to produce a +sequence of words that describe the + +44 +00:02:47,829 --> 00:02:52,560 +content of that image so that's going to +be a sentence that is the caption for + +45 +00:02:52,560 --> 00:02:55,969 +that and in the case of sentiment +classification in the lobby for example + +46 +00:02:55,969 --> 00:02:59,759 +were consuming a number of words and +sequins and they will try to class + +47 +00:02:59,759 --> 00:03:03,828 +driver the sentiment of that sentence is +positive or negative in a case of + +48 +00:03:03,829 --> 00:03:07,590 +machine translation we can have a +recurrent neural network that takes us a + +49 +00:03:07,590 --> 00:03:12,069 +number of words in say English and then +asked to produce a number of words in + +50 +00:03:12,068 --> 00:03:17,119 +French translation so we'd feed this end +Andrew recurrent neural network in what + +51 +00:03:17,120 --> 00:03:20,280 +we call sequence to sequence kind of +setup and so this work or not work with + +52 +00:03:20,280 --> 00:03:25,169 +just performed translation on arbitrary +sentences in English into French and in + +53 +00:03:25,169 --> 00:03:28,000 +the last case for example we have video +classification where you might want to + +54 +00:03:28,000 --> 00:03:31,699 +imagine classifying every single frame +of a video with some number of classes + +55 +00:03:31,699 --> 00:03:35,429 +but crucially don't want to the +prediction to be only a function of the + +56 +00:03:35,430 --> 00:03:38,739 +current time step the current frame of +the video but also all the things that + +57 +00:03:38,739 --> 00:03:41,909 +have come before it in the video as a +recurrent neural networks allow you to + +58 +00:03:41,909 --> 00:03:44,680 +wire up in architecture where the +prediction that every single time step + +59 +00:03:44,680 --> 00:03:48,760 +is a function of all the frames that +have come in up to that point now even + +60 +00:03:48,759 --> 00:03:52,388 +if you don't have sequences that input +or output you can still use recurrent + +61 +00:03:52,389 --> 00:03:55,250 +neural networks even in the case on the +very left because you can process your + +62 +00:03:55,250 --> 00:04:01,560 +fix as inputs or outputs sequentially +for example one of my favorite examples + +63 +00:04:01,560 --> 00:04:05,189 +of this is from people from deep mine +for a while ago we're trying to + +64 +00:04:05,189 --> 00:04:09,750 +transcribe house numbers and instead of +just having this big image feet into a + +65 +00:04:09,750 --> 00:04:13,530 +comment and try to classify exactly what +house numbers are in there they came up + +66 +00:04:13,530 --> 00:04:16,649 +with a recurrent neural network policy +where there's a small come that and it + +67 +00:04:16,649 --> 00:04:19,779 +steered around the image especially with +recurrent neural network and so their + +68 +00:04:19,779 --> 00:04:23,969 +current work learned to basically +readout house numbers from left to right + +69 +00:04:23,970 --> 00:04:26,870 +sequentially and so we have pics as +input but we're processing it + +70 +00:04:26,870 --> 00:04:32,019 +sequentially conversely we can think +about this is also a well-known people + +71 +00:04:32,019 --> 00:04:35,879 +draw this is a general model what you're +seeing here are samples from the model + +72 +00:04:35,879 --> 00:04:39,490 +where it's coming up with these digits +samples but crucially we're not just + +73 +00:04:39,490 --> 00:04:42,860 +predicting these digits at a single time +but we have our current network and we + +74 +00:04:42,860 --> 00:04:47,540 +think of the up as a canvas and the +kernel goes in and painted over time and + +75 +00:04:47,540 --> 00:04:50,200 +so you're giving yourself more chance to +actually do some computation before you + +76 +00:04:50,199 --> 00:04:53,479 +actually produce you're out that it's +more powerful kind of form of processing + +77 +00:04:53,480 --> 00:05:14,189 +data was a question over the specifics +of exactly what this means for now + +78 +00:05:14,189 --> 00:05:19,310 +eros just indicated indicate functional +dependence so things are so things are + +79 +00:05:19,310 --> 00:05:23,139 +harsh enough things before and we're +going to exactly what that looks like in + +80 +00:05:23,139 --> 00:05:37,168 +a bit okay so these are generated house +numbers so the network looked at a lot + +81 +00:05:37,168 --> 00:05:41,219 +of house numbers and came up with a way +of painting them and so these are not in + +82 +00:05:41,220 --> 00:05:44,830 +a training day on these are made up +numbers from the model none of these are + +83 +00:05:44,829 --> 00:05:48,219 +actually the training set these are made +up + +84 +00:05:48,220 --> 00:05:51,689 +yeah they look quite real but they're +actually made up from the local + +85 +00:05:51,689 --> 00:05:55,809 +so a recurrent neural network is +basically this thing he remarks and + +86 +00:05:55,809 --> 00:06:00,979 +green and it has a state and it +basically receives through time it + +87 +00:06:00,978 --> 00:06:04,859 +receives an actress so every single time +that we can feed in an input vector into + +88 +00:06:04,860 --> 00:06:08,538 +the armed men and it has some state +internally and then it can modify that + +89 +00:06:08,538 --> 00:06:12,988 +state as a function of what it what it +receives every single time step and so + +90 +00:06:12,988 --> 00:06:17,258 +they're all of course be weights and CNN +and so when we turn those wastes the + +91 +00:06:17,259 --> 00:06:20,829 +Arnold different behavior in terms of +how its stated goals as it received + +92 +00:06:20,829 --> 00:06:25,769 +exempts I usually we can also be +interested in producing and all but + +93 +00:06:25,769 --> 00:06:30,429 +based on the R&S state so we can produce +these matters on top of the hour now but + +94 +00:06:30,428 --> 00:06:33,988 +so you'll see a show pictures like this +but I just like to know that the Arnon + +95 +00:06:33,988 --> 00:06:36,688 +is really just the block in the middle + +96 +00:06:36,689 --> 00:06:39,489 +worked as a state and it can receive +pictures over time and then we can be + +97 +00:06:39,488 --> 00:06:44,838 +some prediction on top of its state in +some applications so completely the way + +98 +00:06:44,838 --> 00:06:50,610 +this will look like is the Army has some +kind of a state which hereunder noting + +99 +00:06:50,610 --> 00:06:55,399 +as Victor H and this can be also a +collection of doctors are just two more + +100 +00:06:55,399 --> 00:07:00,939 +general state and we're going to base it +as a function of the previous hadn't + +101 +00:07:00,939 --> 00:07:05,769 +state administration time I T minus one +and the current input vector 60 and this + +102 +00:07:05,769 --> 00:07:08,338 +is going to be done through a function +which I'll call a recurrence function + +103 +00:07:08,338 --> 00:07:13,728 +and that function will have parameters W +and so as we change those W us we're + +104 +00:07:13,728 --> 00:07:16,228 +going to see the Arnold different +behaviors and then of course we want + +105 +00:07:16,228 --> 00:07:19,338 +some specific behavior are the Arnon +we're going to be training those weights + +106 +00:07:19,338 --> 00:07:23,639 +under seal see examples of that song for +now I'd like to note that the same + +107 +00:07:23,639 --> 00:07:28,209 +function is used at every single time +step with a fixed function of weights w + +108 +00:07:28,209 --> 00:07:31,778 +and we played that single function at +every single time stuff and that allows + +109 +00:07:31,778 --> 00:07:35,928 +us to use the external network on +sequences of without having to commit to + +110 +00:07:35,928 --> 00:07:38,778 +the size of the sequence because we +applied the exact same function at every + +111 +00:07:38,778 --> 00:07:43,528 +single time step no matter how long the +input or output sequences are so in a + +112 +00:07:43,528 --> 00:07:46,769 +specific case of recurrent neural +network of recurrent neural network the + +113 +00:07:46,769 --> 00:07:50,309 +simplest way you can set this up in the +simplest recurrence you can use is what + +114 +00:07:50,309 --> 00:07:54,569 +other 42 as a bit alarming in this case +the state of recurrent neural network is + +115 +00:07:54,569 --> 00:08:00,569 +just a single state h and then we have a +cross formula that basically tells you + +116 +00:08:00,569 --> 00:08:04,039 +how you should update your hidden state +age as a function of the previous head + +117 +00:08:04,038 --> 00:08:04,688 +of state + +118 +00:08:04,689 --> 00:08:08,369 +and the current input eckstein and in +particular and the simplest case we're + +119 +00:08:08,369 --> 00:08:10,349 +going to have these weight matrices +whaaa + +120 +00:08:10,348 --> 00:08:15,238 +WX age and they're going to basically +project both in the hidden state from + +121 +00:08:15,238 --> 00:08:18,238 +the previous times that in the current +input and then those are going to add + +122 +00:08:18,238 --> 00:08:21,978 +and then we squish them with at any age +and that's how we update the state at + +123 +00:08:21,978 --> 00:08:26,199 +time t so this recurrence is telling you +how a total change as a function of its + +124 +00:08:26,199 --> 00:08:29,769 +history and also the current input at +this time that and then we can make + +125 +00:08:29,769 --> 00:08:34,129 +predictions we can base predictions on +top of H for example using just another + +126 +00:08:34,129 --> 00:08:37,528 +matrix projection on top of the hill +state so this is the simplest complete + +127 +00:08:37,528 --> 00:08:42,288 +case in which you can wire up in your +life work so just give you example of + +128 +00:08:42,288 --> 00:08:46,639 +how this will work right now I'm just +talked about sex age and why an abstract + +129 +00:08:46,639 --> 00:08:49,299 +forms in terms of actors we could +actually end of these factors with + +130 +00:08:49,299 --> 00:08:53,059 +semantics and so one of the ways in +which we can use a recurrent neural + +131 +00:08:53,059 --> 00:08:56,149 +network as in the case of character +level language models and this is one of + +132 +00:08:56,149 --> 00:08:59,899 +my favorite ways of explaining our next +because its intuitive and fun to look at + +133 +00:08:59,899 --> 00:09:04,698 +so in this case we have character level +language models using our dance and the + +134 +00:09:04,698 --> 00:09:07,859 +way this will work as we will feed a +sequence of characters into the + +135 +00:09:07,860 --> 00:09:10,899 +recurring role at work and at every +single time step will ask the recurrent + +136 +00:09:10,899 --> 00:09:14,299 +neural network to predict the next +character in the sequence will predict + +137 +00:09:14,299 --> 00:09:16,909 +an entire distribution for what it +thinks should come next in the sequence + +138 +00:09:16,909 --> 00:09:21,120 +that has seen so far so I suppose that +in this very simple example we have the + +139 +00:09:21,120 --> 00:09:25,610 +training sequence hello and so we have a +vocabulary for characters that are in + +140 +00:09:25,610 --> 00:09:29,870 +the ATL and we're going to try to get a +recurrent neural network to learn to + +141 +00:09:29,870 --> 00:09:33,289 +predict the next character in a sequence +on this training data so the way this + +142 +00:09:33,289 --> 00:09:37,000 +will work as will set up will feed in +every one of these characters one at a + +143 +00:09:37,000 --> 00:09:40,509 +time into a recurrent neural network +you'll see it in a chat the first time + +144 +00:09:40,509 --> 00:09:47,110 +step and hear the x-axis is the time to +time so we'll keep an H II L&L and + +145 +00:09:47,110 --> 00:09:50,629 +Hiromi coating characters using what we +call it one hot representation where we + +146 +00:09:50,629 --> 00:09:53,889 +just turned on the bitter that response +to that characters order and vocabulary + +147 +00:09:53,889 --> 00:09:58,129 +now we're going to use the recurrence +formula that I have shown you wear it + +148 +00:09:58,129 --> 00:10:01,860 +every single time step suppose we start +off with 80 and then we applied this + +149 +00:10:01,860 --> 00:10:04,720 +request to compute the hidden state +electorate every single time step using + +150 +00:10:04,720 --> 00:10:08,790 +this fix recurrence formula so I suppose +here we have only three percent in state + +151 +00:10:08,789 --> 00:10:11,099 +we're going to end up with a +three-dimensional representation that + +152 +00:10:11,100 --> 00:10:13,040 +basically at any point in time + +153 +00:10:13,039 --> 00:10:15,759 +summarizes all the characters that have +come until + +154 +00:10:15,759 --> 00:10:20,159 +and so we have to apply this requires +that every single time step and now + +155 +00:10:20,159 --> 00:10:23,139 +we're going to predict that every single +time step what should be the next + +156 +00:10:23,139 --> 00:10:27,569 +character in a sequence so for example +since we had four characters in this in + +157 +00:10:27,570 --> 00:10:32,100 +this we're going to protect the phone +numbers at every single time so for + +158 +00:10:32,100 --> 00:10:37,139 +example in a very first time that we've +said in the letter H and the RNN with + +159 +00:10:37,139 --> 00:10:40,799 +its current setting of weights computer +these are normalized lock problem he's + +160 +00:10:40,799 --> 00:10:42,959 +here for what it thinks should come next + +161 +00:10:42,960 --> 00:10:47,950 +so things that H is 110 likely to come +next things that eat as 2.2 likely well + +162 +00:10:47,950 --> 00:10:52,640 +as negative three likely and OS 4.1 +likely right now in terms of unless lott + +163 +00:10:52,639 --> 00:10:56,409 +probabilities of course we know that in +this training sequence we know that we + +164 +00:10:56,409 --> 00:11:00,669 +should follow each so in fact this 2.2 +which are shown in green is the correct + +165 +00:11:00,669 --> 00:11:04,559 +answer in this case and so we want that +to be high and we will do all these + +166 +00:11:04,559 --> 00:11:07,799 +other numbers to be low as on every +single time that we have basically a + +167 +00:11:07,799 --> 00:11:12,209 +target for what next character should +come in the sequence and so we just want + +168 +00:11:12,210 --> 00:11:15,470 +all these numbers to be high and all the +other numbers to be low and so that's of + +169 +00:11:15,470 --> 00:11:19,950 +course including in the included in the +green signal loss function and then that + +170 +00:11:19,950 --> 00:11:23,220 +gets back propagated through these +connections so another way to think + +171 +00:11:23,220 --> 00:11:26,600 +about it is that every single time step +we basically have a soft max classifier + +172 +00:11:26,600 --> 00:11:31,300 +so every one of these a soft max +classifier over the next character and + +173 +00:11:31,299 --> 00:11:34,269 +at every single point we know what the +next character should be and so we just + +174 +00:11:34,269 --> 00:11:37,879 +get all those losses slowing down from +the top and they will all flow through + +175 +00:11:37,879 --> 00:11:41,179 +this graph backwards to all the arrows +were going to get gradients on all the + +176 +00:11:41,179 --> 00:11:44,479 +weight matrices and then we'll know how +to shift the matrices so that the + +177 +00:11:44,480 --> 00:11:50,039 +correct problems are coming out of the +Arnon so we'd be shaping those weights + +178 +00:11:50,039 --> 00:11:53,599 +so that the correct behavior the army +have the correct behaviour you feeding + +179 +00:11:53,600 --> 00:11:57,750 +characters as you can imagine how we can +turn this over to other questions about + +180 +00:11:57,750 --> 00:12:02,879 +the diagram + +181 +00:12:02,879 --> 00:12:08,750 +yeah Thank you so desperately lying as I +mentioned a scene the recurrence the + +182 +00:12:08,750 --> 00:12:13,320 +same functions always so we have a +single WX patient every time step we + +183 +00:12:13,320 --> 00:12:17,010 +have a single WHYY at every time step in +the same whah applied at every time step + +184 +00:12:17,009 --> 00:12:23,830 +here so we've used WX awh why awhh four +times in this diagram and in back + +185 +00:12:23,830 --> 00:12:27,720 +propagation when we get through YouTube +account for that because we'll have all + +186 +00:12:27,720 --> 00:12:30,750 +these gradients adding up to the same +weight matrix because it has been used + +187 +00:12:30,750 --> 00:12:35,879 +at multiple time steps and this is what +allows us to process you know variably + +188 +00:12:35,879 --> 00:12:38,960 +sized inputs because every time that +we're doing the same thing so not a + +189 +00:12:38,960 --> 00:12:48,540 +function of the absolute amount of +things and your question what are common + +190 +00:12:48,539 --> 00:12:52,579 +things for initializing the first 80 I +think US Senate 20 this quite quite + +191 +00:12:52,580 --> 00:13:00,650 +common in the beginning but does the +order in which will receive the data + +192 +00:13:00,649 --> 00:13:01,289 +that matter + +193 +00:13:01,289 --> 00:13:11,299 +yes because so are you asking for these +characters in a different order so if + +194 +00:13:11,299 --> 00:13:14,359 +you see if this was a longer sequence +the order in this case in this case + +195 +00:13:14,360 --> 00:13:17,870 +always doesn't matter because it every +single point in time if you think about + +196 +00:13:17,870 --> 00:13:21,299 +it functionally like this it's a factor +at this time step as a function of + +197 +00:13:21,299 --> 00:13:26,859 +everything that has come before it right +and so disorder just matters for as long + +198 +00:13:26,860 --> 00:13:31,590 +as you're reading it and we're going to +go through a sieve through some specific + +199 +00:13:31,590 --> 00:13:36,149 +examples which i think will clarify some +of these points to look at a specific + +200 +00:13:36,149 --> 00:13:38,980 +example in fact if you want to try to +characterize the language model it's + +201 +00:13:38,980 --> 00:13:43,350 +quite short so I wrote a just then you +can find a good home where this is + +202 +00:13:43,350 --> 00:13:47,220 +hundred line application in numpy for +accuracy level are and then you can go + +203 +00:13:47,220 --> 00:13:49,840 +through an actual active steps through +this with you so you can see concretely + +204 +00:13:49,840 --> 00:13:53,220 +how we could train a recurrent neural +network impact this and so I'm going to + +205 +00:13:53,220 --> 00:13:58,250 +step through this so we're going to go +through all the blocks in the beginning + +206 +00:13:58,250 --> 00:14:02,389 +as you'll see the only dependence here +is we're loading in some text data so + +207 +00:14:02,389 --> 00:14:05,569 +our input here is just a large +collection of a large sequence of + +208 +00:14:05,570 --> 00:14:10,090 +characters in this case a text input +that txt file and then we get all the + +209 +00:14:10,090 --> 00:14:14,810 +characters in that file and we find all +the unique characters in that file + +210 +00:14:14,809 --> 00:14:18,179 +create these mapping dictionaries that +map from characteristic in the season + +211 +00:14:18,179 --> 00:14:23,120 +from indices two characters we basically +order our characters so seeming bread in + +212 +00:14:23,120 --> 00:14:27,350 +a whole bunch of to file and a whole +bunch of data and we have hundred + +213 +00:14:27,350 --> 00:14:30,860 +characters or something like that and +ordered them in a in a sequence so we + +214 +00:14:30,860 --> 00:14:36,300 +associate indices to every character men +here we're going to diminish license + +215 +00:14:36,299 --> 00:14:39,899 +first are hidden sizes hyper primary as +you'll see with recurrent neural + +216 +00:14:39,899 --> 00:14:43,100 +networks so you aren't using it to be a +hundred here we have a learning rate + +217 +00:14:43,100 --> 00:14:46,720 +sequence length here is up to +twenty-five this is a parameter that + +218 +00:14:46,720 --> 00:14:51,019 +you'll be you'll become aware of what +the problem is if our input data is way + +219 +00:14:51,019 --> 00:14:53,899 +too large say like millions of times the +UPS there's no way you can put in + +220 +00:14:53,899 --> 00:14:56,870 +dhahran and on top of all of it because +we need to maintain all of the stuff and + +221 +00:14:56,870 --> 00:15:00,070 +memory so that you can do back +propagation in fact we won't be able to + +222 +00:15:00,070 --> 00:15:03,540 +keep all of it and two men in memory and +a back rub through all of it so we'll go + +223 +00:15:03,539 --> 00:15:07,139 +in chunks through our input data in this +case we're going through chunks of 25 at + +224 +00:15:07,139 --> 00:15:09,230 +a time so as you'll see in a bit + +225 +00:15:09,230 --> 00:15:14,769 +we have this entire dataset but will be +going in chunks of 25 characters at a + +226 +00:15:14,769 --> 00:15:19,509 +time and every time we're just going to +backup get through 25 characters on time + +227 +00:15:19,509 --> 00:15:22,149 +because we can't afford to do back +propagation for longer because we have + +228 +00:15:22,149 --> 00:15:26,899 +to remember all that stuff and so we're +going in chunks here of 25 and then we + +229 +00:15:26,899 --> 00:15:30,789 +have all these W matrices that here I'm +analyzing randomly and some boxes so WX + +230 +00:15:30,789 --> 00:15:34,709 +HHH and HY and those are all of our hype +all of our parameters that we're going + +231 +00:15:34,710 --> 00:15:36,790 +to train a backrub + +232 +00:15:36,789 --> 00:15:40,699 +I'm going to skip over the loss function +here and I'm going to get to the bottom + +233 +00:15:40,700 --> 00:15:44,020 +of the script here we have a main loop +and I'm going to go through some of this + +234 +00:15:44,019 --> 00:15:48,399 +may look now so there are some +initialization here of various things 20 + +235 +00:15:48,399 --> 00:15:50,829 +in the beginning and then we're looking +for ever + +236 +00:15:50,830 --> 00:15:54,960 +we're doing here is a sampling a batch +of data so here it is where I actually + +237 +00:15:54,960 --> 00:15:58,970 +take a batch of 25 characters out of +this dataset so that's in the list + +238 +00:15:58,970 --> 00:16:03,019 +inputs and the list and puts basically +just has 25 integers correspond to the + +239 +00:16:03,019 --> 00:16:06,919 +characters the targets as you'll see is +just all the same characters but offset + +240 +00:16:06,919 --> 00:16:09,909 +by one because those are the indices +that we're trying to predict it every + +241 +00:16:09,909 --> 00:16:15,269 +single time stuff so so important +targets are just list of 25 characters + +242 +00:16:15,269 --> 00:16:20,689 +targets as offset by one into the future +so that's what we sampled basically back + +243 +00:16:20,690 --> 00:16:26,480 +from data here we this is some sample +code so it every single point in time + +244 +00:16:26,480 --> 00:16:30,659 +training this week and of course I try +to generate some samples of what it's + +245 +00:16:30,659 --> 00:16:35,370 +currently thanks character should +actually what these sequences look like + +246 +00:16:35,370 --> 00:16:40,320 +the way we use character low-level +artists and test time is that we're + +247 +00:16:40,320 --> 00:16:43,570 +going to see that with some characters +and then this aren't always gives us the + +248 +00:16:43,570 --> 00:16:46,379 +distribution of the next character in a +sequence so you can imagine sampling + +249 +00:16:46,379 --> 00:16:49,259 +from it and then you feat in the next +character getting a sample from the + +250 +00:16:49,259 --> 00:16:52,769 +distribution and keep doing it in to +keep feeding all the samples into the + +251 +00:16:52,769 --> 00:16:56,549 +iron and you can just generate arbitrary +text data that's what this code will do + +252 +00:16:56,549 --> 00:17:00,549 +and it caused the sample function so +we're going to that in a bit then here + +253 +00:17:00,549 --> 00:17:04,250 +I'm calling the loss function the loss +function receives the inputs the targets + +254 +00:17:04,250 --> 00:17:09,160 +and it receives also this H prep H +pressure is short for his state vector + +255 +00:17:09,160 --> 00:17:13,900 +from the previous trunk so we're going +in batches of 25 and we are keeping + +256 +00:17:13,900 --> 00:17:18,179 +track of what is the latest a picture at +the end of your 25 letters so that we + +257 +00:17:18,179 --> 00:17:22,400 +can when we meet in the next back we can +see that in as the initial h at that + +258 +00:17:22,400 --> 00:17:26,140 +time so we're making sure that the +hidden states are basically correctly + +259 +00:17:26,140 --> 00:17:30,700 +propagated from batch to batch through +that but we're only back propagating + +260 +00:17:30,700 --> 00:17:35,558 +those 25 time steps so we fit into a +function of the loss and gradients and + +261 +00:17:35,558 --> 00:17:39,319 +all the weight matrices and all the +boxes and you're just printing the loss + +262 +00:17:39,319 --> 00:17:44,149 +and then here's a primer update we're +told us older greetings and here we are + +263 +00:17:44,150 --> 00:17:47,429 +actually perform the update which you +should recognize as an undergrad update + +264 +00:17:47,429 --> 00:17:53,100 +so I have all these cash to think of all +these cashed + +265 +00:17:53,099 --> 00:17:56,819 +variables for the gradient squared which +I'm accumulating and then perform the + +266 +00:17:56,819 --> 00:18:00,639 +autocratic date someone to go into the +loss function and what that looks like + +267 +00:18:00,640 --> 00:18:05,790 +now the loss function is this block of +code it really consists of forward and + +268 +00:18:05,789 --> 00:18:08,990 +backward method so we're comparing the +forward pass and then the back of + +269 +00:18:08,990 --> 00:18:13,130 +passing Green so I'll go through those +two steps forward pass you should + +270 +00:18:13,130 --> 00:18:18,919 +recognize basically we get those deficit +targets we're waiting receive these 25 + +271 +00:18:18,919 --> 00:18:23,360 +indices and we're not trading through +them from 1 to 25 we create this text + +272 +00:18:23,359 --> 00:18:27,500 +input vector which is just zeros and +then we set the one hot encoding so + +273 +00:18:27,500 --> 00:18:32,169 +whatever the index and impetus we turned +it on for with one so we're feeding in + +274 +00:18:32,169 --> 00:18:34,110 +the character with that one hot encoding + +275 +00:18:34,109 --> 00:18:39,229 +here in computing the recurrence formula +using this equation so hsi T + +276 +00:18:39,230 --> 00:18:42,210 +their ages and all these things to keep +track of everything and every single + +277 +00:18:42,210 --> 00:18:46,910 +time stuff so we compute the state +vector and the output using the + +278 +00:18:46,910 --> 00:18:50,779 +recurrence formula and these two lines +and then over there I'm computing the + +279 +00:18:50,779 --> 00:18:54,440 +suspects function so normalizing this so +that if we get probabilities and then + +280 +00:18:54,440 --> 00:18:58,190 +your loss is negative lock probability +of the correct answer so that's just a + +281 +00:18:58,190 --> 00:19:02,779 +softness classifier lost over there so +that's the purpose and we're going to + +282 +00:19:02,779 --> 00:19:06,899 +back propagate through the graph so in +the backward pass we go backwards + +283 +00:19:06,900 --> 00:19:08,530 +through that sequence from 25 + +284 +00:19:08,529 --> 00:19:12,899 +all the way back to one and maybe you'll +recognize I don't know how much detail I + +285 +00:19:12,900 --> 00:19:16,509 +want to go in here but you'll recognize +them back propagating through a soft max + +286 +00:19:16,509 --> 00:19:19,089 +propagating through the activation +functions I'm not propagating through + +287 +00:19:19,089 --> 00:19:23,379 +all of it and I'm just adding up all the +greetings and all the prime minister and + +288 +00:19:23,380 --> 00:19:27,210 +one thing to note here especially is +that these ingredients and make weight + +289 +00:19:27,210 --> 00:19:31,210 +matrices like woahh I'm using a plus +equals because it every single time step + +290 +00:19:31,210 --> 00:19:34,590 +all of these weight matrices getting +gradient and we need to accumulate + +291 +00:19:34,589 --> 00:19:37,449 +fit into all the weight matrices because +we keep using all these weight matrices + +292 +00:19:37,450 --> 00:19:43,980 +at the same at every time step and so we +just backdrop into them over time and + +293 +00:19:43,980 --> 00:19:48,130 +that gives us the radiance and then we +can use that loss function from the + +294 +00:19:48,130 --> 00:19:52,580 +primary and then here we have finally a +sampling function so here is where we + +295 +00:19:52,579 --> 00:19:55,960 +try to actually get the artist to +generate new text data based on what he + +296 +00:19:55,960 --> 00:19:59,058 +has seen an attorney and based on the +statistics of the characters and how + +297 +00:19:59,058 --> 00:20:02,048 +they follow each other in the training +data so we initialize with some rain and + +298 +00:20:02,048 --> 00:20:06,759 +character and then we go for until we +get tired and we compute the recurrence + +299 +00:20:06,759 --> 00:20:09,289 +formula that the problem the +distribution sample from the + +300 +00:20:09,289 --> 00:20:10,450 +distribution + +301 +00:20:10,450 --> 00:20:15,640 +encoded in one hot Kate 11 hot +representation and then we fielded a + +302 +00:20:15,640 --> 00:20:22,460 +next time so we keep doing this until we +actually get 200 texts so is there any + +303 +00:20:22,460 --> 00:20:27,190 +question over just like the rough layout +of how this works + +304 +00:20:27,190 --> 00:21:04,680 +$25 South max classifiers at every batch +and we back all of those at the same + +305 +00:21:04,680 --> 00:21:14,910 +time and all add up in the connections +going backwards that's why do we use + +306 +00:21:14,910 --> 00:21:19,259 +regularization here you'll see that I +probably do not I guess I skipped it + +307 +00:21:19,259 --> 00:21:23,720 +here but you can in general I think +sometimes I tried regularization I don't + +308 +00:21:23,720 --> 00:21:27,269 +think it is common to use it and +recurring nuts as outside sometimes it + +309 +00:21:27,269 --> 00:21:38,379 +gave me like worst results so sometimes +I skip it it's kind of a fight promoter + +310 +00:21:38,380 --> 00:21:48,260 +yeah that's right yeah that's right so +in the sequence of 25 shots here we are + +311 +00:21:48,259 --> 00:21:51,839 +very low level on character level and we +don't actually care about words we don't + +312 +00:21:51,839 --> 00:21:56,289 +know that word exists as just character +indices so miss arnelle in fact doesn't + +313 +00:21:56,289 --> 00:21:58,569 +know anything about characters so +language or anything like that just in + +314 +00:21:58,569 --> 00:22:08,009 +the series and sequences appendices and +that's what we're modeling using pieces + +315 +00:22:08,009 --> 00:22:13,460 +can be used space as the letters or +something like that instead of just + +316 +00:22:13,460 --> 00:22:18,630 +constant batches of 25 I think he maybe +could but then it kind of just you have + +317 +00:22:18,630 --> 00:22:22,530 +to make assumptions about language will +see soon why you would want to do that + +318 +00:22:22,529 --> 00:22:25,359 +because you can plug anything into this +and we'll see that we can have a lot of + +319 +00:22:25,359 --> 00:22:31,539 +fun with that ok now we can do we can +take a whole bunch of texts we don't + +320 +00:22:31,539 --> 00:22:34,889 +care where it came from a sequence of +characters and we feed into the Arnon + +321 +00:22:34,890 --> 00:22:40,670 +and we can train the iron and to create +text like it and so for example you can + +322 +00:22:40,670 --> 00:22:44,789 +take all of William Shakespeare's works +you can catch all of it is just a giant + +323 +00:22:44,789 --> 00:22:48,289 +sequence of characters and you put into +the recurrent neural network and try to + +324 +00:22:48,289 --> 00:22:51,909 +predict the next character in a sequence +for William Shakespeare proponents and + +325 +00:22:51,910 --> 00:22:54,650 +so when you do those of course in the +beginning the recurrent neural network + +326 +00:22:54,650 --> 00:22:59,030 +has random random parameters so just +producing a garbled at the very end so + +327 +00:22:59,029 --> 00:23:03,200 +it's just random characters but then +when you train the Arnon will start to + +328 +00:23:03,200 --> 00:23:06,930 +understand that ok there are actually +things like spaces there's words start + +329 +00:23:06,930 --> 00:23:11,490 +to experiment with quotes it and it +basically learn some of the very short + +330 +00:23:11,490 --> 00:23:16,420 +words like here or on and so on and then +as you train more and more disease + +331 +00:23:16,420 --> 00:23:18,820 +becomes more and more refined and the +recurrent neural network learns that + +332 +00:23:18,819 --> 00:23:22,609 +when you open a quote you should close +it later or that those sentences and + +333 +00:23:22,609 --> 00:23:26,379 +with a cup with a dot it learns all the +stuff statistically just from the rock + +334 +00:23:26,380 --> 00:23:29,630 +patterns without actually having to head +coach anything and in the end you can + +335 +00:23:29,630 --> 00:23:30,580 +sample entire + +336 +00:23:30,579 --> 00:23:34,349 +shakespeare based on this on a character +level so just give an idea about what + +337 +00:23:34,349 --> 00:23:38,740 +kind of stuff comes out a lot I think he +shall become approached and the gang + +338 +00:23:38,740 --> 00:23:42,900 +will strain would be attained into being +never fed and his but the chain and + +339 +00:23:42,900 --> 00:23:45,460 +subject of his death I should not sleep + +340 +00:23:45,460 --> 00:23:56,909 +that's the kind of stuff that you would +get out of this regard network you're + +341 +00:23:56,909 --> 00:24:02,679 +mean up a very subtle point which I'd +like to get back to you in a bit okay so + +342 +00:24:02,679 --> 00:24:05,980 +we can run this on Shakespeare but we +can run the Sun basically anything so + +343 +00:24:05,980 --> 00:24:08,960 +we're playing with this with Justin I +think like roughly year ago and so + +344 +00:24:08,960 --> 00:24:12,990 +Justin Tuck he found this book on +algebraic geometry and this is just a + +345 +00:24:12,990 --> 00:24:18,069 +large latex source file and we took that +latex source file for this geometry and + +346 +00:24:18,069 --> 00:24:23,398 +finance the art and the artist can learn +to basically generate mathematics so + +347 +00:24:23,398 --> 00:24:27,199 +this is a sample submitted this morning +just spits out late check and then we + +348 +00:24:27,200 --> 00:24:30,009 +can pilot and of course doesn't work +right away we had to tune it a tiny bit + +349 +00:24:30,009 --> 00:24:33,890 +but basically the Arnon after we tweaked +some of the mistakes that has made you + +350 +00:24:33,890 --> 00:24:37,200 +can compile it and you can get the +generate mathematics as you'll see that + +351 +00:24:37,200 --> 00:24:42,460 +it basically creates all these proofs it +puts her stupid little squares at the + +352 +00:24:42,460 --> 00:24:47,090 +end of troops it creates let us and so +on + +353 +00:24:47,089 --> 00:24:52,428 +sometimes we are going to create +diagrams to varying amounts of success + +354 +00:24:52,429 --> 00:24:56,720 +and my best my favorite part about this +is that on the top left the proof here + +355 +00:24:56,720 --> 00:24:59,650 +is emitted + +356 +00:24:59,650 --> 00:25:05,780 +the Sarno is just lazy but otherwise +this stuff is quite indistinguishable I + +357 +00:25:05,779 --> 00:25:12,480 +would say from from actual geometry so +let X 10 scheme of X ok I'm not sure + +358 +00:25:12,480 --> 00:25:16,160 +about that part but otherwise the +gestalt of this looks very good + +359 +00:25:16,160 --> 00:25:19,529 +arbitrary things that it so I tried to +find the hardest arbitrary thing that I + +360 +00:25:19,529 --> 00:25:22,769 +could throw the character level I +decided that source code is actually + +361 +00:25:22,769 --> 00:25:27,879 +very difficult so I took all of Linux +source which is just older like C code + +362 +00:25:27,880 --> 00:25:30,850 +you can copy it and you end up with I +think some hundred megabytes and just + +363 +00:25:30,849 --> 00:25:35,079 +see code and header files and then just +thrown into the Arnon and then it can + +364 +00:25:35,079 --> 00:25:39,849 +learn to generate code and so this is +generated code from the Arnon and you + +365 +00:25:39,849 --> 00:25:42,949 +can see that basically creates function +declarations it knows about inputs + +366 +00:25:42,950 --> 00:25:47,460 +syntactically it makes very few mistakes +it knows about variables sort of how to + +367 +00:25:47,460 --> 00:25:53,230 +use them sometimes it intends to code it +creates its own bogus comments + +368 +00:25:53,230 --> 00:25:58,089 +syntactically is very rare to find that +it would open a bracket and not close it + +369 +00:25:58,089 --> 00:26:01,808 +and so on this actually is relatively +easy for dornin to learn and so some of + +370 +00:26:01,808 --> 00:26:04,058 +the mistakes that makes actually is that +for example it + +371 +00:26:04,058 --> 00:26:07,240 +declare some variables that it never +ends up using or do the same variables + +372 +00:26:07,240 --> 00:26:09,929 +that it never declared and so some of +these high level stuff is still missing + +373 +00:26:09,929 --> 00:26:12,509 +but otherwise it can do just fine + +374 +00:26:12,509 --> 00:26:17,460 +it also no hostile recite the Jeep the +new GOP licensed character by character + +375 +00:26:17,460 --> 00:26:22,009 +that has learned from data and that +knows that after the GPL license there + +376 +00:26:22,009 --> 00:26:25,779 +some include files there some macros and +then there's some code so that's + +377 +00:26:25,779 --> 00:26:33,879 +basically what has learned that in turn +into just a show is very small + +378 +00:26:33,880 --> 00:26:37,169 +just a toy thing to show you what's +going on then there's a char and then + +379 +00:26:37,169 --> 00:26:41,230 +which is a more kind of implementation +and torch which has just been charged + +380 +00:26:41,230 --> 00:26:45,009 +and scaled up and runs and GPU and so +you can play with that yourself and so + +381 +00:26:45,009 --> 00:26:49,269 +this in particular was going to this by +then the latter it's a three-layer Alice + +382 +00:26:49,269 --> 00:26:52,289 +team and so we'll see what that means +it's a more complex kind of phone + +383 +00:26:52,289 --> 00:26:58,839 +network I just give an idea about how +this works so there's a paper that we + +384 +00:26:58,839 --> 00:27:02,089 +played with a lot but this was just an +last year and we're basically trying to + +385 +00:27:02,089 --> 00:27:08,949 +pretend that we're neuroscientists and +we threw a hair salon on some test text + +386 +00:27:08,950 --> 00:27:13,110 +and so the Arden is reading this text in +the snippet of code and we're looking at + +387 +00:27:13,109 --> 00:27:17,119 +a specific cell and his state of the art +coloring the text based on whether or + +388 +00:27:17,119 --> 00:27:18,699 +not that sells excited or not + +389 +00:27:18,700 --> 00:27:23,470 +ok so you can see that many of the state + +390 +00:27:23,470 --> 00:27:27,110 +neurons are not interpretable to kind of +fire on nothin kind of weird ways + +391 +00:27:27,109 --> 00:27:29,829 +because they have to do some of them +have to do quite low level character + +392 +00:27:29,829 --> 00:27:33,859 +level stuff like how often does she come +after age and stuff like that let's all + +393 +00:27:33,859 --> 00:27:37,928 +the cells are quite interpretable so for +example we find ourselves like a quick + +394 +00:27:37,929 --> 00:27:41,830 +detection so that this cell just turns +on when it is a quote and then it stays + +395 +00:27:41,829 --> 00:27:46,460 +on until the quote closets and so this +quite reliably keeps track of this and + +396 +00:27:46,460 --> 00:27:50,610 +it just comes out from backpropagation +the island of this size that the + +397 +00:27:50,609 --> 00:27:54,329 +character level statistics are different +inside and outside of course and this is + +398 +00:27:54,329 --> 00:27:57,639 +a useful feature to learn and so it +dedicate some of its head of state to + +399 +00:27:57,640 --> 00:28:00,650 +keeping track of whether or not you're +inside a quote and this goes back to + +400 +00:28:00,650 --> 00:28:05,159 +your question which I want to point out +here that this RNN was trained on I + +401 +00:28:05,159 --> 00:28:06,500 +think sequence length + +402 +00:28:06,500 --> 00:28:10,269 +hundred but if you measure the length of +this quote is actually much more than a + +403 +00:28:10,269 --> 00:28:16,220 +hundred i think is like 250 and so we +worked on we only back propagated up to + +404 +00:28:16,220 --> 00:28:20,190 +a hundred and so that's the only place +where the cell can actually like Lauren + +405 +00:28:20,190 --> 00:28:23,460 +itself because it wouldn't be able to +spot the appendices there much longer + +406 +00:28:23,460 --> 00:28:27,809 +than that but I think basically this +seems to show that you can train this + +407 +00:28:27,809 --> 00:28:31,159 +character level detection sell as a +useful on sequences less than a hundred + +408 +00:28:31,160 --> 00:28:36,580 +and then it generalizes properly to +longer sequences so this so this cell + +409 +00:28:36,579 --> 00:28:39,859 +seems to work for more than a hundred +steps even if it was only trained even + +410 +00:28:39,859 --> 00:28:44,759 +if it was only able to spot the +dependencies on less than a hundred this + +411 +00:28:44,759 --> 00:28:48,890 +is another dataset here this is I think +Leo Tolstoy's War and Peace this is in + +412 +00:28:48,890 --> 00:28:52,460 +this dataset there's a new line +character at every single at roughly 80 + +413 +00:28:52,460 --> 00:28:57,819 +characters in 80 characters roughly +there's a new line and there's a there's + +414 +00:28:57,819 --> 00:29:02,470 +a line link tracking so that we found +where it starts off at like one and then + +415 +00:29:02,470 --> 00:29:06,539 +it slowly case over time and you might +imagine that a cell like this is + +416 +00:29:06,539 --> 00:29:09,019 +actually very useful in predicting that +you like character at the end because + +417 +00:29:09,019 --> 00:29:13,059 +this arnie's to count a tee time steps +so that it knows when a new line + +418 +00:29:13,059 --> 00:29:15,149 +character is likely to come next + +419 +00:29:15,150 --> 00:29:19,280 +ok so there's like tracking tell us we +found cells that actually respond only a + +420 +00:29:19,279 --> 00:29:23,970 +sudden statements we found cells that +only respondents cite quotes and strings + +421 +00:29:23,970 --> 00:29:28,710 +we found cells that I get more excited +the deeper you nestin expression and so + +422 +00:29:28,710 --> 00:29:33,150 +all kinds of interesting cells that you +can actually find inside these are not + +423 +00:29:33,150 --> 00:29:36,710 +completely come out just from the back +propagation and so that's quite magical + +424 +00:29:36,710 --> 00:29:42,130 +I suppose but + +425 +00:29:42,130 --> 00:29:49,110 +this Alice team I think they're about +2,100 cell so you just gonna go through + +426 +00:29:49,109 --> 00:29:54,589 +them and some of them look like this but +I would say roughly 5 percent of them + +427 +00:29:54,589 --> 00:30:00,429 +you spot something interesting so you +just go through it manually + +428 +00:30:00,430 --> 00:30:05,310 +sorry so we are completely running the +entire are in an intact but we're only + +429 +00:30:05,309 --> 00:30:09,679 +looking at a single hidden state fire at +the firing of US one single cell in + +430 +00:30:09,680 --> 00:30:14,470 +dhahran so running the are normally but +we're just kind of a recording from one + +431 +00:30:14,470 --> 00:30:20,900 +cell and the hidden state that makes +sense so this sell just the entire are + +432 +00:30:20,900 --> 00:30:23,940 +among those rising one part of the +hidden state basically there's many + +433 +00:30:23,940 --> 00:30:27,740 +other hidden still hittin cells that +involved in different ways and they're + +434 +00:30:27,740 --> 00:30:30,349 +all believing in different times and +they're all doing different things + +435 +00:30:30,349 --> 00:30:41,899 +inside the Arnon hidden state + +436 +00:30:41,900 --> 00:30:50,150 +but you can get similar results with one +layer + +437 +00:30:50,150 --> 00:31:00,490 +these cells were always between negative +one in 110 each and this is from + +438 +00:31:00,490 --> 00:31:04,120 +analysis team which we haven't covered +yet but the firing of the salsa between + +439 +00:31:04,119 --> 00:31:11,869 +a one and one so that's the scale that's +us this picture so are as are pretty + +440 +00:31:11,869 --> 00:31:15,609 +cool and you can actually trendy +sequence models over time about roughly + +441 +00:31:15,609 --> 00:31:19,039 +one year ago several people have come to +realize that you can actually use the + +442 +00:31:19,039 --> 00:31:22,039 +same very neat application in the +context of computer vision to perform + +443 +00:31:22,039 --> 00:31:25,210 +image capturing in this context for +taking a single imagine we'd like to + +444 +00:31:25,210 --> 00:31:27,840 +describe it with a sequence of warrants +and these are nuns are very good at + +445 +00:31:27,839 --> 00:31:32,490 +understanding how sequences develop over +time so in this particular model them + +446 +00:31:32,490 --> 00:31:36,240 +going to describe this actually work +from roughly year-ago happens to be my + +447 +00:31:36,240 --> 00:31:43,039 +paper I have I have pictures from my +paper so I'm going to use those so we + +448 +00:31:43,039 --> 00:31:46,629 +are feeding a commission and omission to +accomplish on your network and then + +449 +00:31:46,630 --> 00:31:48,990 +you'll see that this phone models +actually just made up of two modules + +450 +00:31:48,990 --> 00:31:51,750 +there's the comment that is doing the +processing of the image and their + +451 +00:31:51,750 --> 00:31:55,460 +current debt which will be very which is +very good with modeling sequences as so + +452 +00:31:55,460 --> 00:31:58,470 +if you remember my analogy from the very +beginning of the course where this is + +453 +00:31:58,470 --> 00:32:01,039 +kinda like playing with Lego blocks +we're going to take those two modules + +454 +00:32:01,039 --> 00:32:04,509 +and stick them together that corresponds +to the arrow in between and so what + +455 +00:32:04,509 --> 00:32:07,829 +we're doing effectively here is where +conditioning this RNN generative model + +456 +00:32:07,829 --> 00:32:11,349 +or not just telling its sample text at +random but we're conditioning that + +457 +00:32:11,349 --> 00:32:14,939 +generate process by the upper to come +ashore network and I'll show you exactly + +458 +00:32:14,940 --> 00:32:21,220 +how that looks like so suppose I'm going +to show you what the forward pass on + +459 +00:32:21,220 --> 00:32:24,110 +your own that is so suppose we have a +test image and we're trying to describe + +460 +00:32:24,109 --> 00:32:27,679 +it with a sequence of words so the way +this model with process the images US + +461 +00:32:27,680 --> 00:32:31,240 +policy which take that any plugin to +accomplish on your left work in this + +462 +00:32:31,240 --> 00:32:35,250 +case is a VG nett so we go through a +whole bunch of comics pool and so on + +463 +00:32:35,250 --> 00:32:37,349 +until we arrived at the end + +464 +00:32:37,349 --> 00:32:40,149 +normally at the end we have this +automatic classifier which is giving you + +465 +00:32:40,150 --> 00:32:44,440 +a profit distribution over say 1000 +categories of images in this case we're + +466 +00:32:44,440 --> 00:32:47,420 +going to actually get rid of that +classifier and instead we're going to + +467 +00:32:47,420 --> 00:32:50,750 +redirect representation at the top of +the coalition member into the recurrent + +468 +00:32:50,750 --> 00:32:54,880 +neural network so we begin to generation +of the Arnon with a special + +469 +00:32:54,880 --> 00:33:00,410 +art vector so the impetus are +nonetheless I think 300 emotional and + +470 +00:33:00,410 --> 00:33:02,700 +this is a special three hundred +emotional victory that we always plug + +471 +00:33:02,700 --> 00:33:05,750 +into the first iteration tells me that +this is the beginning of the sequence + +472 +00:33:05,750 --> 00:33:09,039 +and then we're going to perform the +recurrence formula that I shown you + +473 +00:33:09,039 --> 00:33:13,769 +before for recurrent neural network +normally we computed this recurrence + +474 +00:33:13,769 --> 00:33:18,779 +which we've solidarity where we compute +WSH time sex but whhhy and now we want + +475 +00:33:18,779 --> 00:33:23,500 +to additionally conditioned as recurrent +neural network not only on the current + +476 +00:33:23,500 --> 00:33:28,089 +input and current in a state which we +must like 20 so that term goes away at + +477 +00:33:28,089 --> 00:33:33,649 +the first time that but we initially +condition just by adding wiht times be + +478 +00:33:33,650 --> 00:33:38,040 +and so this is the top of the comment +here and we've added interaction and + +479 +00:33:38,039 --> 00:33:43,399 +added weight matrix W which tells us how +this image information emerges into the + +480 +00:33:43,400 --> 00:33:46,380 +very first time since the recurring role +at work now there are many ways to + +481 +00:33:46,380 --> 00:33:48,940 +actually play with this recurrence in +many ways to actually plug in the image + +482 +00:33:48,940 --> 00:33:51,690 +into there are now and this is only one +of them and one of the simpler ones + +483 +00:33:51,690 --> 00:33:55,750 +perhaps and at the very first time step +here in this wine zero vector is the + +484 +00:33:55,750 --> 00:34:00,009 +distribution over the first word in a +sequence so the way this works + +485 +00:34:00,009 --> 00:34:05,490 +you might imagine for example is you can +see that these structures in the mass + +486 +00:34:05,490 --> 00:34:09,699 +hat can be recognized by the Coalition +network as strong like stuff and then + +487 +00:34:09,699 --> 00:34:12,939 +through this interaction wiht my +condition to hit in state to go into a + +488 +00:34:12,940 --> 00:34:17,039 +particular state where the probability +of the word straw can be slightly higher + +489 +00:34:17,039 --> 00:34:20,519 +right so you might imagine that the +strong like textures can influence the + +490 +00:34:20,519 --> 00:34:23,940 +probability of strong so one of the +numbers inside 10 to be higher because + +491 +00:34:23,940 --> 00:34:28,470 +their structures and so the army from +now on has to kind of jungle two tasks + +492 +00:34:28,469 --> 00:34:32,269 +it has to predict the next care and next +word in the sequence in this case and it + +493 +00:34:32,269 --> 00:34:36,550 +has to remember the image information so +we sent from that sock Max and + +494 +00:34:36,550 --> 00:34:40,629 +supposedly the most likely word that we +sampled from that distribution was + +495 +00:34:40,628 --> 00:34:44,710 +indeed the word strong we will take +strong and we would try to plug it into + +496 +00:34:44,710 --> 00:34:47,519 +the recording all that work on the +bottom again and so in this case I think + +497 +00:34:47,519 --> 00:34:52,190 +we're using word level and beddings so +the strong strong word is associate with + +498 +00:34:52,190 --> 00:34:55,750 +a three hundred national Dr we're going +to learn to represent the three hundred + +499 +00:34:55,750 --> 00:35:00,010 +national representation for every single +unique jewellery and we plug in those + +500 +00:35:00,010 --> 00:35:02,940 +three hundred numbers into the Arnon and +forward again to get a description of + +501 +00:35:02,940 --> 00:35:07,090 +the second world and sequence inside why +one so we get all these properties we + +502 +00:35:07,090 --> 00:35:08,010 +sample from it again + +503 +00:35:08,010 --> 00:35:12,490 +suppose that the word hat is likely now +we take hats 400 much older presentation + +504 +00:35:12,489 --> 00:35:18,299 +and get the distribution of it there and +then we sample again and we sample until + +505 +00:35:18,300 --> 00:35:21,350 +we sample a special and token which is +really the period at the end of the + +506 +00:35:21,349 --> 00:35:24,900 +sentence and that tells us that the +arnaz now done generating and at this + +507 +00:35:24,900 --> 00:35:30,280 +point the army would have described this +image as a straw hat period ok so the + +508 +00:35:30,280 --> 00:35:34,010 +number of dimensions and his wife +picture is a number of words in your + +509 +00:35:34,010 --> 00:35:39,220 +vocabulary +1 for the special and token +and we are always feeding industry + +510 +00:35:39,219 --> 00:35:43,609 +sectors that correspond to different +words and a special start talkin and + +511 +00:35:43,610 --> 00:35:46,250 +then we always just that propagates +through the whole thing and a single + +512 +00:35:46,250 --> 00:35:49,769 +time to nationalize this at random or +you can initialize your BG net with free + +513 +00:35:49,769 --> 00:35:52,099 +trade for a minute and then + +514 +00:35:52,099 --> 00:35:56,319 +distributions and then you encode the +gradient and then you backed up through + +515 +00:35:56,320 --> 00:35:59,700 +the whole thing as a single model and +just trained at all jointly and you get + +516 +00:35:59,699 --> 00:36:08,389 +a caption or image capture lots of +questions ok but yes i three hundred + +517 +00:36:08,389 --> 00:36:12,609 +emotional embeddings they're just +independent of the image so every word + +518 +00:36:12,610 --> 00:36:18,430 +has 300 numbers associated with it so +we're going to bankrupt get into it so + +519 +00:36:18,429 --> 00:36:21,769 +you initialize it random and then you +can back up to get into these better sex + +520 +00:36:21,769 --> 00:36:25,360 +right so those embeddings will shift +around there just a parameter another + +521 +00:36:25,360 --> 00:36:30,530 +way to think about it is it's to having +a one-hop representation for all the + +522 +00:36:30,530 --> 00:36:34,960 +words and then you have a giant W matrix +where every single + +523 +00:36:34,960 --> 00:36:40,130 +multiplied W with that one hundred +plantation and it w has 300 out but size + +524 +00:36:40,130 --> 00:36:43,530 +then it's effectively plucking out a +single broke up w which and something + +525 +00:36:43,530 --> 00:36:47,560 +I'm betting it's kind of a cold front so +just think of it if you don't like those + +526 +00:36:47,559 --> 00:36:50,279 +in bed and just think of it as a +one-hopper presentation and you can + +527 +00:36:50,280 --> 00:36:58,920 +think of it that way yes the modelers to +up at the end token yes in the training + +528 +00:36:58,920 --> 00:37:02,769 +data the correct sequence that we expect +from the art is the first words I can + +529 +00:37:02,769 --> 00:37:07,969 +look forward and so every single +training example sort of have a special + +530 +00:37:07,969 --> 00:37:10,288 +and token it go ahead + +531 +00:37:10,289 --> 00:37:28,929 +you can wired differently we plugged +into every single state it turns out + +532 +00:37:28,929 --> 00:37:32,999 +that actually works worse so it actually +works better if you just plug in the + +533 +00:37:32,998 --> 00:37:36,718 +very first time step and then the Arnon +has to juggle these both tasks that has + +534 +00:37:36,719 --> 00:37:40,829 +to remember about the image what it +needs to remember through the art and it + +535 +00:37:40,829 --> 00:37:45,179 +also has to produce all these outfits +and somehow it wants to do that there's + +536 +00:37:45,179 --> 00:38:04,209 +some headway the reasons I can give you +after class right that's true + +537 +00:38:04,208 --> 00:38:10,208 +a single instance will correspond to an +image and a sequence of words and so we + +538 +00:38:10,208 --> 00:38:16,328 +would plug in those words here and we +will talk in that image and we shall I + +539 +00:38:16,329 --> 00:38:22,159 +come so it's a train time you have all +those weren't planning on the bottom of + +540 +00:38:22,159 --> 00:38:25,528 +the image London and then you unroll +this graph and you have your losses in + +541 +00:38:25,528 --> 00:38:29,389 +the background and then you can do +batches of images if you're careful and + +542 +00:38:29,389 --> 00:38:33,108 +so if your images they sometimes have +different lengths sequences in the + +543 +00:38:33,108 --> 00:38:36,199 +training data have to be careful with +that because you have to say that ok I'm + +544 +00:38:36,199 --> 00:38:41,059 +willing to process batches of up to +twenty words maybe and then some of + +545 +00:38:41,059 --> 00:38:44,499 +those sentences will be shorter or +longer a need to in your code you know + +546 +00:38:44,498 --> 00:38:48,188 +worry about that because some some some +sentences are longer than others + +547 +00:38:48,188 --> 00:38:55,368 +we have way too many questions I have +stuff to go + +548 +00:38:55,369 --> 00:39:03,450 +yes thank you so that propagate +everything completely jointly and two in + +549 +00:39:03,449 --> 00:39:07,538 +training so you can pre train with the +internet and then you put those words + +550 +00:39:07,539 --> 00:39:10,190 +there but then you just want to train +everything jointly and that's a big + +551 +00:39:10,190 --> 00:39:15,429 +advantage actually because we can we can +figure out what features to look for in + +552 +00:39:15,429 --> 00:39:20,368 +order to better describe the image that +the end so when you train this in + +553 +00:39:20,369 --> 00:39:23,890 +practice we tried this on the census +data sets one of the more common wants + +554 +00:39:23,889 --> 00:39:27,368 +is called Microsoft Coco so just to give +you an idea of what it looks like it's + +555 +00:39:27,369 --> 00:39:31,499 +roughly 800,000 images and five sentence +descriptions for each image these were + +556 +00:39:31,498 --> 00:39:35,288 +obtained using Amazon Mechanical Turk so +you just ask people please give us a + +557 +00:39:35,289 --> 00:39:39,710 +sentence description for an image and +your record and end up your data set and + +558 +00:39:39,710 --> 00:39:43,249 +so when you train this model the kinds +of results that you can expect or + +559 +00:39:43,248 --> 00:39:49,078 +roughly what is kinda like this so this +is our in describing these images so + +560 +00:39:49,079 --> 00:39:52,329 +this it says that this is a man in black +shirt playing guitar or construction + +561 +00:39:52,329 --> 00:39:55,710 +worker in Orange City West working on +the road or two young girls are playing + +562 +00:39:55,710 --> 00:40:00,528 +with Lego toy or boy is doing backflip +on a wakeboard and of course that's not + +563 +00:40:00,528 --> 00:40:04,650 +a wakeboard but it's close there are +also some very funny failure cases which + +564 +00:40:04,650 --> 00:40:07,680 +I also like to show this is a young boy +holding a baseball bat + +565 +00:40:07,679 --> 00:40:12,338 +this is a cat sitting on a couch with +the remote control that's a woman + +566 +00:40:12,338 --> 00:40:15,710 +holding a teddy bear in front of a +mirror + +567 +00:40:15,710 --> 00:40:22,400 +I'm pretty sure that the texture here +probably is what what happened made it + +568 +00:40:22,400 --> 00:40:26,289 +think that it's a teddy bear and the +last one is a whore standing in the + +569 +00:40:26,289 --> 00:40:30,409 +middle of a street road so there's no +horse obviously some not sure what + +570 +00:40:30,409 --> 00:40:34,858 +happened there so this is just a +simplest kind of model that came out + +571 +00:40:34,858 --> 00:40:37,619 +last year there were many people who try +to work on top of these kinds of models + +572 +00:40:37,619 --> 00:40:41,559 +and make them more complex I just like +to give you an idea of 11 level that is + +573 +00:40:41,559 --> 00:40:44,929 +interesting just to get an idea how how +people play with this basic architecture + +574 +00:40:44,929 --> 00:40:51,329 +so this is a paper from last year where +if you noticed in the current model we + +575 +00:40:51,329 --> 00:40:55,608 +only feed into images single time to +time at the beginning and one where you + +576 +00:40:55,608 --> 00:40:59,480 +can play with this is actually a rowdy +recurrent neural network to look back to + +577 +00:40:59,480 --> 00:41:03,130 +the image and reference parts of the +image Wireless describing to work does + +578 +00:41:03,130 --> 00:41:07,180 +the words such as you're generating +every single word you allow the aren't + +579 +00:41:07,179 --> 00:41:10,460 +actually make a look up next to the +image and look for different features of + +580 +00:41:10,460 --> 00:41:13,470 +what it might want to describe next and +you can actually do this in the fully + +581 +00:41:13,469 --> 00:41:17,899 +trainable way so they are not only +create these words but also the sides + +582 +00:41:17,900 --> 00:41:21,289 +where to look next in the image and so +the way this works is not only does the + +583 +00:41:21,289 --> 00:41:24,259 +Arnon out but you're probably +distribution for the next one sequence + +584 +00:41:24,260 --> 00:41:29,250 +but this coming that gives you does +valium so saying this case we forwarded + +585 +00:41:29,250 --> 00:41:37,389 +the comment and got a 14 by 14 by 512 by +512 activation volume and at every + +586 +00:41:37,389 --> 00:41:40,179 +single time that we don't just admit +that distribution but you also emit a + +587 +00:41:40,179 --> 00:41:44,358 +five hundred and twelve dimensional +picture that is kinda like a look up key + +588 +00:41:44,358 --> 00:41:48,019 +of what you want to look for next to the +image and so actually I don't think this + +589 +00:41:48,019 --> 00:41:51,210 +is what they did in in this particular +paper but this is one way you can wire + +590 +00:41:51,210 --> 00:41:54,510 +something like this up and saw this +picture is emitted from the Arnon just + +591 +00:41:54,510 --> 00:41:58,430 +like it's just predicted using some +weights and then this picture can be dot + +592 +00:41:58,429 --> 00:42:03,618 +product and with all these 14 by 14 +locations so we do all these dot product + +593 +00:42:03,619 --> 00:42:09,108 +and we achieved our we compute basically +14 by 14 compatibility now and then we + +594 +00:42:09,108 --> 00:42:13,949 +put a soft max on this so basically we +normalize all this so that it's all you + +595 +00:42:13,949 --> 00:42:17,149 +get this what we call in the tension +over the image so it's a 14 by 14 + +596 +00:42:17,150 --> 00:42:21,230 +probably map over what's interesting for +the Arnon right now in the image and + +597 +00:42:21,230 --> 00:42:25,889 +then we use this problem asked to do a +weighted sum of these guys with this + +598 +00:42:25,889 --> 00:42:27,239 +saliency + +599 +00:42:27,239 --> 00:42:30,929 +and so this morning can basically a myth +of what it thinks is currently + +600 +00:42:30,929 --> 00:42:36,089 +interesting for it and it goes back and +you end up doing a weighted sum of + +601 +00:42:36,090 --> 00:42:39,850 +different kinds of features that the +Ellis team wants to look at this point + +602 +00:42:39,849 --> 00:42:44,809 +in time and so for example the island's +generating stuff and it might decide + +603 +00:42:44,809 --> 00:42:49,400 +that ok I'd like to look for something +object like now admits a vector file + +604 +00:42:49,400 --> 00:42:53,220 +numbers of objects like stuff it +interacts with cum that's when the + +605 +00:42:53,219 --> 00:42:57,379 +comment a commission and maybe some of +the object like regions of that coming + +606 +00:42:57,380 --> 00:43:01,700 +out of that activation falling like +light up and ceiling see map in this + +607 +00:43:01,699 --> 00:43:05,949 +4514 irate and then you just end up +focusing your attention on that part of + +608 +00:43:05,949 --> 00:43:10,059 +the image through this interaction and +so you can basically just do lookups + +609 +00:43:10,059 --> 00:43:14,130 +into the image while you're describing +the sentence and so this is something we + +610 +00:43:14,130 --> 00:43:17,360 +refer to as a soft detention and will +actually going to this in a few lectures + +611 +00:43:17,360 --> 00:43:21,050 +so we're going to cover things like this +where the army can actually haven't + +612 +00:43:21,050 --> 00:43:26,880 +selective attention over its imports as +processing the input and that's so I + +613 +00:43:26,880 --> 00:43:30,030 +just want to bring it up roughly an hour +just to give you a preview of what that + +614 +00:43:30,030 --> 00:43:34,490 +looks like okay now if we want to make +our lives more complex one of the ways + +615 +00:43:34,489 --> 00:43:39,259 +we can do that is to stack them up in +layers so this gives you you know more + +616 +00:43:39,260 --> 00:43:43,570 +deep stuff usually works better so the +way we start this up one of the ways at + +617 +00:43:43,570 --> 00:43:46,809 +least you can stack recurrent neural +networks and there's many ways but this + +618 +00:43:46,809 --> 00:43:49,409 +is just one of them that people use in +practice as you can + +619 +00:43:49,409 --> 00:43:53,339 +straight up just plug harness into each +other so the impetus for one Arnon is + +620 +00:43:53,340 --> 00:43:59,170 +the director of the State picture of the +previous on and so this image we have + +621 +00:43:59,170 --> 00:44:02,750 +the time axis going horizontally and +then going upwards we have different + +622 +00:44:02,750 --> 00:44:05,960 +ordinance and so in this particular +image there are three separate recurrent + +623 +00:44:05,960 --> 00:44:09,858 +neural networks each with their own set +of weights and these are colonel that + +624 +00:44:09,858 --> 00:44:16,299 +works I just feed into each other and so +this is always train jointly there's no + +625 +00:44:16,300 --> 00:44:19,119 +train first won second term one that's +all just a single competition growth of + +626 +00:44:19,119 --> 00:44:22,700 +a backdrop to get through this +recurrence formula to top it + +627 +00:44:22,699 --> 00:44:25,980 +ivory britain it's likely to make it +more general rule still we're still + +628 +00:44:25,980 --> 00:44:29,280 +doing the exact same thing is we didn't +have the same formula we're taking a + +629 +00:44:29,280 --> 00:44:35,390 +lecture from below and below in depth +and effective from before in time we're + +630 +00:44:35,389 --> 00:44:39,469 +cutting them and putting them supporting +them through this w transformation and a + +631 +00:44:39,469 --> 00:44:40,519 +smashing the 10 each + +632 +00:44:40,519 --> 00:44:44,509 +so if you remember if you are slightly +confused about this there's there was a + +633 +00:44:44,510 --> 00:44:51,760 +WRX H times X plus whah times H you can +rewrite this is a concatenation of exxon + +634 +00:44:51,760 --> 00:44:56,260 +H multiplied by a single matrix right so +it's as if I stick tacks nation to a + +635 +00:44:56,260 --> 00:45:03,680 +single column vector and then I have +this w matrix where basically what ends + +636 +00:45:03,679 --> 00:45:07,690 +up happening is that your WX ages the +first part of this matrix and WH + +637 +00:45:07,690 --> 00:45:12,700 +nation's second part of your matrix and +so this kind of formula can be written + +638 +00:45:12,699 --> 00:45:16,099 +into formula where you stack with your +inputs and you have a single W + +639 +00:45:16,099 --> 00:45:24,759 +transformation so the same formula so +that's how we can stop these are + +640 +00:45:24,760 --> 00:45:29,780 +announced and then there now indexed by +both time and by later at which they + +641 +00:45:29,780 --> 00:45:33,510 +occur now one way we can also make these +more complex is not shared by stacking + +642 +00:45:33,510 --> 00:45:37,030 +them but actually using a slightly +better recurrence formula so right now + +643 +00:45:37,030 --> 00:45:40,300 +so far we've seen as very simple +recurrence formula for the return to + +644 +00:45:40,300 --> 00:45:44,480 +work in practice you will actually +rarely ever use formula like this and + +645 +00:45:44,480 --> 00:45:48,170 +basic network is very rarely used +instead you'll use what we call it an + +646 +00:45:48,170 --> 00:45:52,059 +LSD and our long short term memory so +this is basically used in all the papers + +647 +00:45:52,059 --> 00:45:56,500 +now so this is the formula would be +using also your project if you were to + +648 +00:45:56,500 --> 00:46:00,989 +use are currently works but I'd like you +to notice at this point is everything is + +649 +00:46:00,989 --> 00:46:04,729 +exactly the same as with an arlen it's +just that the recurrence formula has a + +650 +00:46:04,730 --> 00:46:09,050 +slightly more complex function ok we're +still taking the picture from the low + +651 +00:46:09,050 --> 00:46:13,789 +and depth like your input from before in +time the previous an estate were + +652 +00:46:13,789 --> 00:46:18,309 +contacting them putting them through aww +transport but now we have this more + +653 +00:46:18,309 --> 00:46:21,869 +complexity and how we actually achieve +the New Haven state at this point in + +654 +00:46:21,869 --> 00:46:25,539 +time so we're just being a slightly more +complex and how to combine defector from + +655 +00:46:25,539 --> 00:46:28,900 +below and before to actually perform an +update on heading state just a more + +656 +00:46:28,900 --> 00:46:33,050 +complex formula so we'll go into some +details of exactly what motivates this + +657 +00:46:33,050 --> 00:46:41,609 +formula and why it might be a better +idea to actually use in Austin + +658 +00:46:41,608 --> 00:46:49,909 +and it makes sense trust me we'll go +through it just right now so if you + +659 +00:46:49,909 --> 00:46:56,480 +block 4 p.m. some online video or you go +to Google Images you'll find diagrams + +660 +00:46:56,480 --> 00:47:00,989 +like this which is really not helping I +think anyone the first time I saw him + +661 +00:47:00,989 --> 00:47:04,048 +being really scared like this one really +scared he was really sure what's going + +662 +00:47:04,048 --> 00:47:08,170 +on I understand Ellis teams and I still +don't know what these two diagrams are + +663 +00:47:08,170 --> 00:47:14,289 +so ok so I'm going to try to break down +the list and it's kind of a tricky thing + +664 +00:47:14,289 --> 00:47:18,329 +to put into a diagram you really have to +kind of step through it so lecture + +665 +00:47:18,329 --> 00:47:24,220 +format is perfect for no steam ok so +here we have the US equations and I'm + +666 +00:47:24,219 --> 00:47:28,238 +going to focus on the first part here on +the top where we take these two vectors + +667 +00:47:28,239 --> 00:47:32,720 +from below and from before so X and HHS +our previous in a state an accident but + +668 +00:47:32,719 --> 00:47:37,848 +we met them through that transformation +W and now if both Jackson href size and + +669 +00:47:37,849 --> 00:47:40,950 +so there's a number send them we're +going to end up producing for any + +670 +00:47:40,949 --> 00:47:46,068 +numbers ok through this w matrix which +was put forward by 21 so we have these + +671 +00:47:46,068 --> 00:47:51,108 +four and dimensional vectors I F oMG +they're short for input for get out but + +672 +00:47:51,108 --> 00:47:57,328 +and G I'm not sure what that's just you +and so the ISI no go through signaled + +673 +00:47:57,329 --> 00:48:05,859 +gates and G go straight tenants gate now +the way this way this actually works the + +674 +00:48:05,858 --> 00:48:09,420 +best way to think about it is one thing +I forgot to mention actually in the + +675 +00:48:09,420 --> 00:48:15,028 +previous slide is normally require no +network to says the single HVAC tried + +676 +00:48:15,028 --> 00:48:18,018 +every single time stopped and asked him +actually has two vectors that every + +677 +00:48:18,018 --> 00:48:23,618 +single time and thus we call see the +cell state vector so that every single + +678 +00:48:23,619 --> 00:48:29,470 +time step we have both agency in peril +and and see vector here as shown in + +679 +00:48:29,469 --> 00:48:33,558 +yellow so we basically have two vectors +every single point in space here and + +680 +00:48:33,559 --> 00:48:37,849 +what they're doing is they're basically +operating over this cell state so + +681 +00:48:37,849 --> 00:48:41,680 +depending on what's before you and below +you and that is your context you end up + +682 +00:48:41,679 --> 00:48:45,199 +operating over the cell state with these + +683 +00:48:45,199 --> 00:48:50,509 +and Ong elements and new way to think +about it is I'm going to go through a + +684 +00:48:50,510 --> 00:48:58,290 +lot of this way to think about this is I +N O as just binary either 0 or 1 we want + +685 +00:48:58,289 --> 00:49:01,199 +them to be we want them to have an +interpretation of the gate like to think + +686 +00:49:01,199 --> 00:49:05,449 +of it as heroes are ones we of course +make them later signals because we want + +687 +00:49:05,449 --> 00:49:08,348 +this to be differentiable so that we can +back propagate through everything but + +688 +00:49:08,349 --> 00:49:11,960 +just think of Ino as just binary things +that were computing base in our context + +689 +00:49:11,960 --> 00:49:17,740 +and then what this from always doing +here see you can see that based on what + +690 +00:49:17,739 --> 00:49:22,250 +these gates are and what Diaz we're +going to end up dating this see value + +691 +00:49:22,250 --> 00:49:29,289 +and in particular this episode to forget +gate that will be used to shut a tus + +692 +00:49:29,289 --> 00:49:34,869 +reset some of the cells 20 solar cells +are best thought of as shelters and + +693 +00:49:34,869 --> 00:49:38,700 +these counters basically we can either +recent than 20 with us this interaction + +694 +00:49:38,699 --> 00:49:42,368 +this is an element of multiplication +their laser pointer is running out of + +695 +00:49:42,369 --> 00:49:45,530 +battery so + +696 +00:49:45,530 --> 00:49:50,140 +interaction 0 then you can see that will +zero out the cell so we can reset the + +697 +00:49:50,139 --> 00:49:53,969 +counter and then we can also add to a +counter so we can add through this + +698 +00:49:53,969 --> 00:50:00,459 +interaction I times G and since I S +between 11 and G is between negative one + +699 +00:50:00,460 --> 00:50:05,900 +in 10 basically adding a number between +one and 12 every cell so that every + +700 +00:50:05,900 --> 00:50:09,338 +single time step we have these counters +in all the cells we can reset these + +701 +00:50:09,338 --> 00:50:13,588 +countries 2012 forget Kate or we can +choose to add a number between one and + +702 +00:50:13,588 --> 00:50:18,039 +12 every single cell oK so that's how we +performed the cell update and then the + +703 +00:50:18,039 --> 00:50:24,029 +head an update ends up being a squashed +cell so 10 HFC squashed cell that is + +704 +00:50:24,030 --> 00:50:28,760 +modulated by this update so only some of +the cell state and up leaking into the + +705 +00:50:28,760 --> 00:50:33,500 +hidden state is modulated by this vector +oh so we only choose to reveal some of + +706 +00:50:33,500 --> 00:50:39,530 +the cells into the hen state and +learnable way there are several things + +707 +00:50:39,530 --> 00:50:43,910 +to to kind of highlight here one maybe +most confusing part here is that we're + +708 +00:50:43,909 --> 00:50:47,500 +adding a number between one and one with +I times D here but that's kind of + +709 +00:50:47,500 --> 00:50:51,809 +confusing because if we only had a G +there instead then jeez already between + +710 +00:50:51,809 --> 00:50:56,679 +8 11 so why do we need I times G what +does that actually giving us when all we + +711 +00:50:56,679 --> 00:50:58,279 +want is to implement a sea by + +712 +00:50:58,280 --> 00:51:02,330 +a number between one and one and so +that's kind of my castle part about the + +713 +00:51:02,329 --> 00:51:08,989 +last time I think one answer is that if +you think about the G it's a function of + +714 +00:51:08,989 --> 00:51:16,159 +its a linear function of your context no +one has to laser printer by chance right + +715 +00:51:16,159 --> 00:51:26,649 +ok so G as a function of your Geo 310 +age ok so G is a linear function of your + +716 +00:51:26,650 --> 00:51:30,579 +previous contacts squashed by 10 h and +so if we were adding just jeans that if + +717 +00:51:30,579 --> 00:51:35,349 +I time she then that would be kind of +like a very simple function so by adding + +718 +00:51:35,349 --> 00:51:38,929 +this I and then having a multiplicative +interaction you're actually getting more + +719 +00:51:38,929 --> 00:51:42,710 +richer function that you can actually +expressed in terms of what we're adding + +720 +00:51:42,710 --> 00:51:47,010 +torso state as a function of the +previous tests and another way to think + +721 +00:51:47,010 --> 00:51:50,620 +about this is that would basically +decoupling these two concepts of how + +722 +00:51:50,619 --> 00:51:54,159 +much do we want to add to a cell state +which is G and then do we want to + +723 +00:51:54,159 --> 00:51:58,129 +address all state which is so I is +likely we actually what this operation + +724 +00:51:58,130 --> 00:52:03,280 +to go through and genius what we want to +by decoupling these two that also may be + +725 +00:52:03,280 --> 00:52:08,470 +dynamically has some nice properties in +terms of how this all steam trains but + +726 +00:52:08,469 --> 00:52:12,039 +we just end up that's like the Austin +formulas and I'm going to actually go + +727 +00:52:12,039 --> 00:52:14,059 +through this in more detail as well + +728 +00:52:14,059 --> 00:52:21,400 +ok so think about this as cell C flowing +through and now the first interaction + +729 +00:52:21,400 --> 00:52:28,269 +here is the DOTC so efforts in a bit of +a sigmoid of that and so economically + +730 +00:52:28,269 --> 00:52:32,559 +gating yourselves with multiplicative +interaction so if f is zero you will + +731 +00:52:32,559 --> 00:52:38,409 +shut off the cell and reset the counter +the cytology part is basically giving + +732 +00:52:38,409 --> 00:52:44,799 +you a comp is basically adding to the +sole state and then the sub-state leaks + +733 +00:52:44,800 --> 00:52:51,100 +into the hill state but only through a +10 h and then that gets gated by so the + +734 +00:52:51,099 --> 00:52:55,380 +only electric and decide which parts +with some state to actually reveal into + +735 +00:52:55,380 --> 00:52:59,610 +the hidden didn't sell and then you'll +notice that this interstate not only + +736 +00:52:59,610 --> 00:53:03,720 +goes to the next iteration of the STM +but it also actually closed up to a + +737 +00:53:03,719 --> 00:53:07,159 +higher layers because this is the head +of state doctrine that we actually end + +738 +00:53:07,159 --> 00:53:11,250 +up looking into teams above us or them +goes into a prediction + +739 +00:53:11,250 --> 00:53:14,510 +and so when you unroll this basically +the way it looks like it's kind of like + +740 +00:53:14,510 --> 00:53:19,270 +this which now I have a confusing +diagram of my own thats I guess we ended + +741 +00:53:19,269 --> 00:53:24,550 +up with but you get your input vectors +from below you have your own state from + +742 +00:53:24,550 --> 00:53:26,090 +248 + +743 +00:53:26,090 --> 00:53:31,030 +determine your gates fije know they're +all in dimensional vectors and then the + +744 +00:53:31,030 --> 00:53:35,110 +end of modulating how you operate over +the cell state and the cell state can + +745 +00:53:35,110 --> 00:53:38,610 +once you actually we set some countries +and once you add numbers between one and + +746 +00:53:38,610 --> 00:53:42,630 +12 your country's the cell state leaks +out some of it leaks out in a learnable + +747 +00:53:42,630 --> 00:53:45,840 +way and then it can either go up to the +prediction or can go to the next + +748 +00:53:45,840 --> 00:53:52,269 +iteration of the US team going forward +and so that's the so this looks ugly so + +749 +00:53:52,269 --> 00:53:58,429 +we're going to so the question is +probably in your mind is why did we go + +750 +00:53:58,429 --> 00:54:02,649 +through all of this there something why +does this look at this particular way I + +751 +00:54:02,650 --> 00:54:05,639 +should like to know that this point that +there are many various to analyst at + +752 +00:54:05,639 --> 00:54:09,309 +this point but by the end of lecture +people play a lot with these equations + +753 +00:54:09,309 --> 00:54:12,840 +and we've kind of converged on this as +being like a reasonable thing but + +754 +00:54:12,840 --> 00:54:15,510 +there's many little tweaks you can make +to this that actually doesn't + +755 +00:54:15,510 --> 00:54:18,930 +deteriorate the performance by a lot you +can remove some of those gates like + +756 +00:54:18,929 --> 00:54:20,359 +maybe the implicate and so on + +757 +00:54:20,360 --> 00:54:25,200 +you can turns out that the stench of see +that can be a sea and it works just fine + +758 +00:54:25,199 --> 00:54:28,619 +normally but with a tender age of seats +were slightly better sometimes and I + +759 +00:54:28,619 --> 00:54:33,869 +don't think we have a very good reasons +for why and CSI end up with a bit of a + +760 +00:54:33,869 --> 00:54:37,039 +monster but I think it actually kinda +makes sense in terms of Justice counters + +761 +00:54:37,039 --> 00:54:40,739 +that can be reset to zero or you can add +small numbers between one and 12 them + +762 +00:54:40,739 --> 00:54:46,039 +and so it's kind of like a nice actually +relatively simple now to understand + +763 +00:54:46,039 --> 00:54:49,300 +exactly why this is much better than our +own and we have to go to a slightly + +764 +00:54:49,300 --> 00:54:55,330 +different picture to draw the +distinction so the recurrent neural + +765 +00:54:55,329 --> 00:54:59,259 +network that has some state vector right +and you're operating over it and you're + +766 +00:54:59,260 --> 00:55:02,260 +completely transforming into through +this recurrence formula and so you end + +767 +00:55:02,260 --> 00:55:06,280 +up changing your state vector from time +to time stuff you'll notice that the US + +768 +00:55:06,280 --> 00:55:11,140 +team instead has the cell States flowing +through and what we're doing effectively + +769 +00:55:11,139 --> 00:55:15,250 +as we're looking at the cells and some +of it leaks into the head of state to + +770 +00:55:15,250 --> 00:55:19,329 +state for deciding how to operate over +the cell and if you forget gains then we + +771 +00:55:19,329 --> 00:55:22,869 +end up basically just tweaking the cell +by + +772 +00:55:22,869 --> 00:55:28,509 +active interaction here so so there's +some stuff that looked at as a function + +773 +00:55:28,510 --> 00:55:33,040 +of the cell state and then whatever it +is we end up changing the soul state + +774 +00:55:33,039 --> 00:55:37,190 +instead of just transforming it right +away so it's an additive instead of a + +775 +00:55:37,190 --> 00:55:38,429 +transformative + +776 +00:55:38,429 --> 00:55:42,929 +interaction or something like that now +this is actually remind you of something + +777 +00:55:42,929 --> 00:55:48,839 +that we've already covered in the class +with this in mind you that's right yeah + +778 +00:55:48,840 --> 00:55:53,240 +so in fact like this is basically the +same thing as be solid resonates so + +779 +00:55:53,239 --> 00:55:56,299 +normally with a calm that we're +transforming representation resident has + +780 +00:55:56,300 --> 00:56:00,019 +these skip connections here and you'll +see that basically residents have this + +781 +00:56:00,019 --> 00:56:04,690 +additive interaction so we have this X +here now we do some computation based on + +782 +00:56:04,690 --> 00:56:10,240 +sex and then we have an additive +interaction with acts and so that's the + +783 +00:56:10,239 --> 00:56:12,959 +basic block of residents and that's in +fact what happens with an awesome as + +784 +00:56:12,960 --> 00:56:18,440 +well we have these interactions we're +here and the ex is your cell and we go + +785 +00:56:18,440 --> 00:56:22,619 +off with you some function and then we +choose to add to this cell state but the + +786 +00:56:22,619 --> 00:56:26,900 +LSD and unlike residents have also +please forget dates that were adding + +787 +00:56:26,900 --> 00:56:31,519 +these forget case-control choose to shut +off some parts of the signal as well but + +788 +00:56:31,519 --> 00:56:33,679 +otherwise it looks very much like a +president so I think it's kind of + +789 +00:56:33,679 --> 00:56:36,710 +interesting that were converging on very +similar kind of looking architecture + +790 +00:56:36,710 --> 00:56:40,429 +that works both income that's end in +recurrent neural networks where it seems + +791 +00:56:40,429 --> 00:56:43,809 +like dynamically somehow it's much nicer +to actually have these additive + +792 +00:56:43,809 --> 00:56:48,739 +interactions that allow you to actually +that propagate much more effectively so + +793 +00:56:48,739 --> 00:56:49,779 +to that point + +794 +00:56:49,780 --> 00:56:53,860 +think about the the back propagation +dynamics between our analysis team + +795 +00:56:53,860 --> 00:56:57,760 +especially in the US team is very clear +that if I inject some gradients and + +796 +00:56:57,760 --> 00:57:01,120 +sometimes that's here so if I inject +radiance and let the end of this diagram + +797 +00:57:01,119 --> 00:57:05,239 +then these plus interactions are just +like ingredients superhighway here right + +798 +00:57:05,239 --> 00:57:09,299 +like these videos will just flow through +all the tabs addition interactions right + +799 +00:57:09,300 --> 00:57:13,240 +because edition distributed equally so +if I plugging gradient any point in time + +800 +00:57:13,239 --> 00:57:16,849 +here just going to blow all the way back +and then of course the gradient also + +801 +00:57:16,849 --> 00:57:20,809 +flows through these acts and they end up +contributing their ingredients into the + +802 +00:57:20,809 --> 00:57:25,630 +reading to flow but you'll never end up +with what we refer to with our intense + +803 +00:57:25,630 --> 00:57:30,110 +problem called vanishing regions where +these gradients just died off go to zero + +804 +00:57:30,110 --> 00:57:32,880 +as you back propagate through and I'll +show an example + +805 +00:57:32,880 --> 00:57:36,640 +completely off why this happens in a bit +sonar now we have this banishing + +806 +00:57:36,639 --> 00:57:40,670 +gradient problem I'll show you why that +happens analyst am because of this + +807 +00:57:40,670 --> 00:57:45,210 +superhighway of just editions these +gradients of every single time step that + +808 +00:57:45,210 --> 00:57:47,130 +we inject into the US team from above + +809 +00:57:47,130 --> 00:57:54,829 +just flow through the cells and your +ratings don't end up finishing at this + +810 +00:57:54,829 --> 00:57:57,339 +point maybe I take some questions are +there questions about what's confusing + +811 +00:57:57,338 --> 00:58:01,849 +here but the last time and then after +that I'm going to why arnaz have been in + +812 +00:58:01,849 --> 00:58:03,059 +Greensboro + +813 +00:58:03,059 --> 00:58:09,789 +yes 000 vector is that important + +814 +00:58:09,789 --> 00:58:13,400 +turns out that I think that one +specifically is not super important so + +815 +00:58:13,400 --> 00:58:16,660 +there's a paper I'm going to show you +what else to answer Space Odyssey they + +816 +00:58:16,659 --> 00:58:21,719 +really played with this take stuff out +but stuff in there there's also like + +817 +00:58:21,719 --> 00:58:25,588 +these people connections you can you can +add so this cell state here that can be + +818 +00:58:25,588 --> 00:58:29,538 +actually put in with the hidden state +better as an input so people really play + +819 +00:58:29,539 --> 00:58:32,049 +with this architecture and they tried +lots of iterations of exactly these + +820 +00:58:32,048 --> 00:58:37,230 +equations and what you end up with us +almost everything works about equal some + +821 +00:58:37,230 --> 00:58:40,490 +of it we're slightly were sometimes so +it's very kind of confusing in this in + +822 +00:58:40,489 --> 00:58:45,699 +this way to show your paper where they +took they treated the DS update + +823 +00:58:45,699 --> 00:58:49,538 +equations has just been built trees over +the update equations and then they did + +824 +00:58:49,539 --> 00:58:52,950 +this like random mutation stuff and they +tried all kinds of different grass and + +825 +00:58:52,949 --> 00:58:57,028 +updates you can have and most of them +work about some of them break some of + +826 +00:58:57,028 --> 00:58:59,858 +them work about the same but nothing +like really does much better than + +827 +00:58:59,858 --> 00:59:08,150 +analyst team and the questions are going +to why recurrent neural networks have + +828 +00:59:08,150 --> 00:59:15,389 +terrible backward flow video also + +829 +00:59:15,389 --> 00:59:22,000 +showing the vanishing gradients problem +in recurrent neural networks with + +830 +00:59:22,000 --> 00:59:29,250 +respect to all stems so we're showing +here as we're looking at a recurrent + +831 +00:59:29,250 --> 00:59:33,039 +neural network over many periods many +time steps and then injecting gradient + +832 +00:59:33,039 --> 00:59:36,760 +and say it's a hundred and twenty eighth +time step and we're bankrupting + +833 +00:59:36,760 --> 00:59:40,028 +ingredients through the network and +we're looking at what is the gradient + +834 +00:59:40,028 --> 00:59:44,699 +for I think the input type hidden matrix +one of the weight matrices at every + +835 +00:59:44,699 --> 00:59:49,009 +single time step so remember that to +actually get the full update through the + +836 +00:59:49,010 --> 00:59:52,289 +back we actually adding all those +gradients here and so what's what's + +837 +00:59:52,289 --> 00:59:56,760 +what's being shown here is that as a +backdrop we only inject ingredient at + +838 +00:59:56,760 --> 01:00:00,799 +120th time steps we do backdrop back +through time and the strong the slices + +839 +01:00:00,798 --> 01:00:04,088 +of that propagation what you're seeing +is that the US team gives you lots of + +840 +01:00:04,088 --> 01:00:06,699 +gradients throughout this +backpropagation so there's lots of + +841 +01:00:06,699 --> 01:00:11,000 +information that is flowing through this +art and just instantly dies off that + +842 +01:00:11,000 --> 01:00:15,210 +just greedy and we say banishes just +just becomes tiny numbers there's no + +843 +01:00:15,210 --> 01:00:18,750 +gradient so in this case I think +indication about a time steps are so + +844 +01:00:18,750 --> 01:00:22,679 +like 10 times steps as all the +information that we injected did not + +845 +01:00:22,679 --> 01:00:26,149 +flow through the network and you can't +learn very long dependencies because all + +846 +01:00:26,150 --> 01:00:29,720 +the correlation structure has been just +died down there so we'll see why this + +847 +01:00:29,719 --> 01:00:39,399 +happens dynamically in a bit there some +comments your channel so funny he's like + +848 +01:00:39,400 --> 01:00:40,490 +YouTube or something + +849 +01:00:40,489 --> 01:00:44,779 +ok + +850 +01:00:44,780 --> 01:00:53,170 +ok so let's look at very simple example +here we have a recurrent neural network + +851 +01:00:53,170 --> 01:00:56,300 +that I'm going to unfold for you in this +recurrent neural network I'm not showing + +852 +01:00:56,300 --> 01:01:03,960 +any inputs we're only have his state +updates so wait whaaa church and state + +853 +01:01:03,960 --> 01:01:07,260 +hidden to hit an interaction and I'm +going to basically forward a recurrent + +854 +01:01:07,260 --> 01:01:12,380 +neural network does it not for some tea +time steps here I'm using T-fifty so + +855 +01:01:12,380 --> 01:01:16,260 +what I'm doing is WHAS time the previous +tenant and stuff and then on top of that + +856 +01:01:16,260 --> 01:01:20,570 +so this is just a forward pass for +ignoring any input vectors coming in is + +857 +01:01:20,570 --> 01:01:25,280 +just WHAS times H threshold WHAS time +sage threshold and so on + +858 +01:01:25,280 --> 01:01:29,500 +that's the forward pass and then +backward pass here where i'm directing a + +859 +01:01:29,500 --> 01:01:33,820 +random gradient here at the last time +step so in the 50th time step by + +860 +01:01:33,820 --> 01:01:37,880 +injecting gradient which is random and +then go backwards and I backed up so + +861 +01:01:37,880 --> 01:01:41,059 +when you back up through this right you +have to back up through here I'm using a + +862 +01:01:41,059 --> 01:01:46,170 +rather get the backdrop through a wh +multiply than 400 W H multiply and so on + +863 +01:01:46,170 --> 01:01:51,800 +and so the thing to note here is so here +I am doing developers brownback + +864 +01:01:51,800 --> 01:01:54,980 +propagate through the relevant just +holding anything that where the imports + +865 +01:01:54,980 --> 01:02:02,309 +were less than zero and Here I am +dropping the WH times each operation + +866 +01:02:02,309 --> 01:02:06,570 +where we actually multiplied by WH +matrix before we do the nonlinearity so + +867 +01:02:06,570 --> 01:02:09,570 +there's something very funky going on +when you actually look at what happens + +868 +01:02:09,570 --> 01:02:13,300 +to these DHS which is the gradient on +the NHS as you go backwards through time + +869 +01:02:13,300 --> 01:02:18,160 +it has a very kind of funny structure +that is very worrying as you look at + +870 +01:02:18,159 --> 01:02:22,210 +like how this gets chained up in the +loop like what we're doing here with + +871 +01:02:22,210 --> 01:02:33,409 +these two time steps + +872 +01:02:33,409 --> 01:02:43,849 +zeros yes I think and sometimes that's +maybe the outputs the rebels were all + +873 +01:02:43,849 --> 01:02:47,630 +dead and seemed you may have killed it +but that's not really the issue of the + +874 +01:02:47,630 --> 01:02:51,470 +more worrying issue is well that would +be a show of all but I think one wearing + +875 +01:02:51,469 --> 01:02:55,500 +issue that people can easily spot as +well as you'll see that we're + +876 +01:02:55,500 --> 01:03:00,380 +multiplying by this whah matrix over and +over and over again because in the + +877 +01:03:00,380 --> 01:03:04,840 +forward pass we multiply by awhh at +every single iteration + +878 +01:03:04,840 --> 01:03:09,670 +back propagates through all the hidden +states we end up back propagating this + +879 +01:03:09,670 --> 01:03:13,820 +formula wh ich konnte chess and a +backrub turns out to actually be that + +880 +01:03:13,820 --> 01:03:19,000 +you take your greeting signaling +multiplied by whah matrix and so we end + +881 +01:03:19,000 --> 01:03:26,199 +up the gradient gets multiplied by whah +hold it then multiplied by WH official + +882 +01:03:26,199 --> 01:03:32,019 +did so we end up multiplying by does +matrix WH age fifty times and so the + +883 +01:03:32,019 --> 01:03:37,509 +issue with this is that the green signal +basically two things can happen like if + +884 +01:03:37,510 --> 01:03:41,080 +you think about working with scalar +value supposedly scale is not matrices + +885 +01:03:41,079 --> 01:03:45,469 +if I take a number that's random and +then I have a second number and I keep + +886 +01:03:45,469 --> 01:03:48,509 +multiplying the first number by the +second number so again and again and + +887 +01:03:48,510 --> 01:03:55,990 +again what does that sequence go to +their cases right to play with the same + +888 +01:03:55,989 --> 01:04:01,849 +number either I die or just goes to +sleep yet if your second number exactly + +889 +01:04:01,849 --> 01:04:05,119 +one year so that the only case where you +don't actually explode but otherwise + +890 +01:04:05,119 --> 01:04:09,679 +really bad things are happening either +die or we explode and here we have major + +891 +01:04:09,679 --> 01:04:12,659 +cities we don't have a single number but +in fact it's the same thing happens a + +892 +01:04:12,659 --> 01:04:16,599 +generalization of it happens in the +spectral radius of the WHS major axis + +893 +01:04:16,599 --> 01:04:21,839 +which is the largest eigenvalue of that +matrix is greater than one then does + +894 +01:04:21,840 --> 01:04:25,220 +radio signal will explode if it's lower +than one degree in civil completely died + +895 +01:04:25,219 --> 01:04:30,549 +and so basically since dr Tan has this +very weird because of this recurrence + +896 +01:04:30,550 --> 01:04:34,680 +formula we end up at this very just +terrible dynamics and it's very unstable + +897 +01:04:34,679 --> 01:04:39,949 +and just dies or explodes and so in +practice the way this was handled was + +898 +01:04:39,949 --> 01:04:44,439 +you can control the exploding gradients +one simple hockey as if your greetings + +899 +01:04:44,440 --> 01:04:45,720 +exploding you click it + +900 +01:04:45,719 --> 01:04:50,789 +so people actually do this and practices +like a very patchy solution but if + +901 +01:04:50,789 --> 01:04:55,119 +you're reading Does above five min +Norman Lin clampett 25 element twice or + +902 +01:04:55,119 --> 01:04:58,150 +something like that so you can do that +is degrading clipping that's how you + +903 +01:04:58,150 --> 01:05:01,829 +address the exploding grading problem +and then you're you're recording don't + +904 +01:05:01,829 --> 01:05:06,049 +explode anymore but the Greens can still +vanish in a carnival at work and Ellis + +905 +01:05:06,050 --> 01:05:08,310 +team is very good with the vanishing +gradient problem because of these + +906 +01:05:08,309 --> 01:05:12,429 +highways of cells that are only change +with additive interactions with the + +907 +01:05:12,429 --> 01:05:17,309 +gradient just blow they never die down +if you are if you because you're + +908 +01:05:17,309 --> 01:05:21,000 +multiplying by the same age or something +like that that's roughly why these are + +909 +01:05:21,000 --> 01:05:26,909 +just better dynamically so we always +teams and we do do gradient clipping + +910 +01:05:26,909 --> 01:05:30,149 +usually so because the gradients in +Dallas team can potentially explode + +911 +01:05:30,150 --> 01:05:33,400 +still made they don't usually vanish + +912 +01:05:33,400 --> 01:05:48,608 +recurrent neural networks as well for +Ellis teams it's not clear where you + +913 +01:05:48,608 --> 01:05:53,769 +would plunge in its not clear in this +equation like exactly how you would plug + +914 +01:05:53,769 --> 01:06:00,619 +into relative and where maybe instead of +the May from G much since then attend + +915 +01:06:00,619 --> 01:06:08,690 +huug here but then resells would only +grow in a single direction right so + +916 +01:06:08,690 --> 01:06:11,980 +maybe then you can't actually end up +making it smaller so that's not a great + +917 +01:06:11,980 --> 01:06:18,539 +idea I suppose you know so there is +basically there's no clear way to plug + +918 +01:06:18,539 --> 01:06:25,380 +in a row here so yeah one thing I notice +that in terms of these superhighways + +919 +01:06:25,380 --> 01:06:29,780 +gradients this this viewpoint actually +breaks down when you have four get gates + +920 +01:06:29,780 --> 01:06:33,310 +because when you have four get Kate's +where we can forget some of these acts + +921 +01:06:33,309 --> 01:06:37,150 +with the multiplicative interaction then +whenever I forget gates turns on and it + +922 +01:06:37,150 --> 01:06:41,470 +kills the gradient then of course the +backward flow will stop so these super + +923 +01:06:41,469 --> 01:06:45,250 +highways are only kind of true if you +don't have any forget gates but if you + +924 +01:06:45,250 --> 01:06:50,000 +have a forget gave their then it can +kill the gradient and so in practice + +925 +01:06:50,000 --> 01:06:54,710 +when we play with us teams are we use +Austin's I suppose sometimes people when + +926 +01:06:54,710 --> 01:06:58,099 +they initially to forget get to the +initializer with a positive bias because + +927 +01:06:58,099 --> 01:06:58,769 +that by + +928 +01:06:58,769 --> 01:07:05,699 +forget to to turn on to me always kind +of turned off I suppose in the beginning + +929 +01:07:05,699 --> 01:07:08,679 +so in the beginning the green spoke very +well and then the US team can learn how + +930 +01:07:08,679 --> 01:07:12,779 +to shut them off at once you later on so +people play with that bias on that for + +931 +01:07:12,780 --> 01:07:17,530 +decades sometimes and so the last night +here I wanted to mention that cost him + +932 +01:07:17,530 --> 01:07:21,580 +so many people have basically play with +this quite a bit so there's a Space + +933 +01:07:21,579 --> 01:07:26,119 +Odyssey paper where they tried various +changes to the architecture there's a + +934 +01:07:26,119 --> 01:07:32,829 +paper here that tries to do this search +over huge number of potential changes to + +935 +01:07:32,829 --> 01:07:36,940 +the LST equations and they did a large +search and they didn't find anything + +936 +01:07:36,940 --> 01:07:42,300 +that works substantially better than +just analyst am so yeah and then there's + +937 +01:07:42,300 --> 01:07:45,560 +the GRU which also has a relatively +actually popular and I would actually + +938 +01:07:45,559 --> 01:07:50,159 +recommend that you might want to use +this drus a change in the Coliseum it + +939 +01:07:50,159 --> 01:07:54,460 +also has decided to interactions with +nice about it is that it's a shorter + +940 +01:07:54,460 --> 01:07:59,400 +smaller formula and it only has a single +a tractor doesn't have a Tennessee it + +941 +01:07:59,400 --> 01:08:03,130 +only has an H so implementation wise is +just nicer to remember just a single had + +942 +01:08:03,130 --> 01:08:07,590 +a setback in your forward past two +factors as just a smaller simpler thing + +943 +01:08:07,590 --> 01:08:12,190 +that seems to have most of the benefits +of a nasty but so it's called GRU and it + +944 +01:08:12,190 --> 01:08:16,730 +almost always works about the coolest +and in my experience and so you might + +945 +01:08:16,729 --> 01:08:19,939 +want to use it or you can use the last +time they both kinda knew about the same + +946 +01:08:19,939 --> 01:08:28,088 +and so somebody is that harness are very +nice but the rawr and does not actually + +947 +01:08:28,088 --> 01:08:29,130 +work very well + +948 +01:08:29,130 --> 01:08:32,420 +soyuz US teams are used instead what's +nice about them is that weird having + +949 +01:08:32,420 --> 01:08:36,000 +these additive interactions that allow +Greece to play much better and you don't + +950 +01:08:36,000 --> 01:08:39,579 +get a vanishing breed problem we still +have to worry a bit about the exploding + +951 +01:08:39,579 --> 01:08:44,269 +feeding problems so it's common to see +people clip these women sometimes and I + +952 +01:08:44,270 --> 01:08:46,670 +would say that better simpler +architectures are really trying to + +953 +01:08:46,670 --> 01:08:50,838 +understand how come there's something +deeper going on with the connection + +954 +01:08:50,838 --> 01:08:53,899 +between residents and Ellis teams and +there's something deeper about these + +955 +01:08:53,899 --> 01:08:57,579 +interactions that I think we're not +fully understanding yet exactly why that + +956 +01:08:57,579 --> 01:09:02,210 +works so well and which parts of it were +cool and so I think we need to + +957 +01:09:02,210 --> 01:09:05,119 +understand both theoretical and +empirical in the space and it's a very + +958 +01:09:05,119 --> 01:09:10,979 +wide open area of research and so so +it's + +959 +01:09:10,979 --> 01:09:23,469 +sport 10 but the end of class where they +can I i suppose to explode so it's not + +960 +01:09:23,470 --> 01:09:27,020 +as clear why they would but you keep +injecting gradient into the cell state + +961 +01:09:27,020 --> 01:09:30,069 +and so maybe degrading can sometimes get +larger + +962 +01:09:30,069 --> 01:09:33,960 +it's common to collect em but I think +not as may be important maybe as an hour + +963 +01:09:33,960 --> 01:09:40,829 +and then I'm not a hundred percent sure +about that point but urological basis I + +964 +01:09:40,829 --> 01:09:46,640 +have no idea what's interesting yeah I +think we should end up here but I'm + +965 +01:09:46,640 --> 01:09:47,569 +happy to take your questions here + diff --git a/captions/En/Lecture11_en.srt b/captions/En/Lecture11_en.srt new file mode 100644 index 00000000..4aefae64 --- /dev/null +++ b/captions/En/Lecture11_en.srt @@ -0,0 +1,4833 @@ +1 +00:00:00,000 --> 00:00:03,428 +right side we have a lot of stuff to get +through today so I'd like to get started + +2 +00:00:03,428 --> 00:00:08,669 +so today we're going to talk about CNN's +and practice and talked about a lot of + +3 +00:00:08,669 --> 00:00:12,050 +really low level sort of implementation +details that are really comment to get + +4 +00:00:12,050 --> 00:00:15,980 +these things to work when you're +actually training things but first as + +5 +00:00:15,980 --> 00:00:20,189 +usual we have some administrative stuff +to talk about number one is that through + +6 +00:00:20,189 --> 00:00:24,600 +a really heroic effort by all the TA is +all the midterms are degraded so you + +7 +00:00:24,600 --> 00:00:27,740 +guys should definitely thank them for +that and you can either pick them up + +8 +00:00:27,739 --> 00:00:34,920 +after class today or in any of these +office hours that are up here also keep + +9 +00:00:34,920 --> 00:00:38,609 +in mind that your project milestones are +going to be due tonight at midnight so + +10 +00:00:38,609 --> 00:00:41,628 +make sure that I hope you've been +working on your projects for the last + +11 +00:00:41,628 --> 00:00:45,579 +couple for the last week or so and have +made some really exciting progress so + +12 +00:00:45,579 --> 00:00:51,289 +make sure to write that up and put it in +the assignments tab on Dropbox no no not + +13 +00:00:51,289 --> 00:00:55,460 +on Dropbox but on the assignment tab on +coursework sorry that I know this is + +14 +00:00:55,460 --> 00:00:58,910 +really confusing but assignments tab +just like just like assignment to + +15 +00:00:58,909 --> 00:01:04,000 +assignment two were working on grading +hopefully we'll have that done sometime + +16 +00:01:04,000 --> 00:01:10,140 +this week and remember that assignment +three is out so how's that been going + +17 +00:01:10,140 --> 00:01:17,159 +anyone anyone done okay that's good one +person's done so the rest you should get + +18 +00:01:17,159 --> 00:01:22,740 +started because it's due in a week so we +have some fun stats from the midterm so + +19 +00:01:22,739 --> 00:01:26,379 +don't freak out when you see your grade +cuz we actually had this really nice + +20 +00:01:26,379 --> 00:01:30,759 +beautiful Gaussian distribution with a +beautiful standard deviation we don't + +21 +00:01:30,759 --> 00:01:34,549 +need to bash normalize this thing it's +already perfect I'd also like to point + +22 +00:01:34,549 --> 00:01:38,049 +out that someone got up max score a +hundred and three which means they got + +23 +00:01:38,049 --> 00:01:43,470 +everything right in the bonus so that's +means it wasn't hard enough to maybe + +24 +00:01:43,469 --> 00:01:49,500 +we also have some per questions thats my +percussion breakdown on average score + +25 +00:01:49,500 --> 00:01:52,450 +per every single question in the midterm +so if you want if you got something + +26 +00:01:52,450 --> 00:01:55,510 +wrong and you want to see if everyone +else got it wrong to you can go check on + +27 +00:01:55,510 --> 00:01:59,380 +these stats leader at you're on your own +time we have stats for the true false + +28 +00:01:59,379 --> 00:02:00,959 +and the multiple choice + +29 +00:02:00,959 --> 00:02:04,729 +keep in mind actually fired for two of +the true false we decided during grading + +30 +00:02:04,730 --> 00:02:07,090 +that they were little bit unfair to +throw it out and just give you all the + +31 +00:02:07,090 --> 00:02:12,960 +points which is why two of those are a +hundred percent we have these stats for + +32 +00:02:12,960 --> 00:02:19,810 +all the individual questions so go ahead +and have fun with those later + +33 +00:02:19,810 --> 00:02:24,379 +last time I know it's been a while but +we had a midterm and we had a holiday + +34 +00:02:24,379 --> 00:02:28,030 +but if you can remember like over a week +ago we were talking about recurrent + +35 +00:02:28,030 --> 00:02:31,509 +networks we talked about how recurrent +networks can be used for modeling + +36 +00:02:31,509 --> 00:02:35,500 +sequences you know normally with these +feedforward networks they takin it they + +37 +00:02:35,500 --> 00:02:39,139 +model this keyboard function but these +recurrent networks we talked about how + +38 +00:02:39,139 --> 00:02:43,208 +they can model different kinds of +sequence problems we talked about to + +39 +00:02:43,209 --> 00:02:48,319 +particular implementations of recurrent +networks 10 are announced and Alice and + +40 +00:02:48,319 --> 00:02:51,539 +implement both of those on the +assignment so you should know what they + +41 +00:02:51,539 --> 00:02:56,079 +are we talked about how these the +correct recurrent neural networks can be + +42 +00:02:56,080 --> 00:03:01,010 +used for language models and had some +fun showing some sample generated text + +43 +00:03:01,009 --> 00:03:06,329 +on what is the Shakespeare and algebraic +geometry that's one we talked about how + +44 +00:03:06,330 --> 00:03:09,590 +we can combine recurrent networks with +convolutional networks to do image + +45 +00:03:09,590 --> 00:03:14,180 +capturing and we played a little bit +this game of being RNN neuroscientists + +46 +00:03:14,180 --> 00:03:17,700 +and diving into the cells of the +Ardennes and trying to interpret what + +47 +00:03:17,699 --> 00:03:21,879 +they're doing and we saw that sometimes +we have these interminable cells that + +48 +00:03:21,879 --> 00:03:27,049 +are for example activating incited +statements which is pretty cool but + +49 +00:03:27,049 --> 00:03:28,890 +today we're going to talk about +something totally different + +50 +00:03:28,889 --> 00:03:33,339 +there are three we're gonna talk about +really a lot of low-level things that + +51 +00:03:33,340 --> 00:03:37,830 +you need to know to get CNN's working in +practice so there's three major themes + +52 +00:03:37,830 --> 00:03:41,600 +it's a little bit of a potpourri but +we're going to try to tie it together so + +53 +00:03:41,599 --> 00:03:45,349 +the first is really squeezing all the +juice that you cannot of your data so I + +54 +00:03:45,349 --> 00:03:48,219 +know a lot of you especially for +projects you don't have large datasets + +55 +00:03:48,219 --> 00:03:51,789 +we're going to talk about data +augmentation and transfer learning which + +56 +00:03:51,789 --> 00:03:55,079 +are two really powerful useful +techniques especially when you're + +57 +00:03:55,080 --> 00:03:56,350 +working with small datasets + +58 +00:03:56,349 --> 00:04:00,889 +we're going to really dive deep into +convolutions and talk a lot more about + +59 +00:04:00,889 --> 00:04:05,959 +those both how you can design efficient +architectures using convolutions and + +60 +00:04:05,960 --> 00:04:10,480 +also how contributions are efficiently +implemented in practice and then finally + +61 +00:04:10,479 --> 00:04:13,269 +we gonna talk about something but +usually gets lumped under implementation + +62 +00:04:13,270 --> 00:04:17,480 +details and doesn't even make it into +papers but that stuff like someone has a + +63 +00:04:17,480 --> 00:04:21,750 +CPU and GPU what kind of bottlenecks to +experience and training how much you + +64 +00:04:21,750 --> 00:04:26,069 +distribute raining over multiple over +multiple devices that's a lot of stuff + +65 +00:04:26,069 --> 00:04:31,620 +we should get started so first let's +talk about data augmentation I think + +66 +00:04:31,620 --> 00:04:34,910 +we've sort of mentioned this may be in +passing so far in the lectures but never + +67 +00:04:34,910 --> 00:04:39,780 +really talked about it so normally when +you're training CNN's you're really + +68 +00:04:39,779 --> 00:04:44,179 +familiar with this type of pipeline when +during training you're gonna load images + +69 +00:04:44,180 --> 00:04:48,379 +and labels up off the desk you're gonna +pay the image through to your CNN then + +70 +00:04:48,379 --> 00:04:51,009 +you're going to use the image together +with the label to compute some loss + +71 +00:04:51,009 --> 00:04:55,610 +function and back-propagation update the +CNN and repeat former so that they + +72 +00:04:55,610 --> 00:05:00,970 +should be really familiar with that by +now the thing about a documentation is + +73 +00:05:00,970 --> 00:05:05,960 +we just had one little step to this +pipeline which is here so after we load + +74 +00:05:05,959 --> 00:05:09,849 +the image above desk we're going to +transform it in some way before passing + +75 +00:05:09,850 --> 00:05:13,910 +it to the CNN and this transformation +should preserve the label + +76 +00:05:13,910 --> 00:05:19,090 +gonna come back propagate and the CNN so +it's really simple and the trick is just + +77 +00:05:19,089 --> 00:05:24,089 +what kind of Transformers you should be +using such data augmentation the idea is + +78 +00:05:24,089 --> 00:05:27,679 +really simple it's sort of this way that +lets you artificially expander training + +79 +00:05:27,680 --> 00:05:32,030 +set through clever usage of different +kinds of transformations so if you + +80 +00:05:32,029 --> 00:05:35,409 +remember the computer is really seeing +these images as these try and get some + +81 +00:05:35,410 --> 00:05:39,189 +pixels and there are these different +kinds of transformations we can make + +82 +00:05:39,189 --> 00:05:43,230 +that should preserve the label but which +will change all the pixels if you + +83 +00:05:43,230 --> 00:05:46,770 +imagine like shipping that cat 1 pixel +to left it's still a cat but all the + +84 +00:05:46,769 --> 00:05:50,539 +pixels are going to change that so what +when you talk about a documentation + +85 +00:05:50,540 --> 00:05:54,680 +you're sort of imagine that you're +expanding your training that these + +86 +00:05:54,680 --> 00:05:58,629 +trainings and these new basic training +samples be correlated but it will still + +87 +00:05:58,629 --> 00:06:03,389 +help you train models with with bigger +models with preventing and this is very + +88 +00:06:03,389 --> 00:06:04,959 +very widely used in practice + +89 +00:06:04,959 --> 00:06:08,668 +pretty much any CNN you see that's +winning competitions or doing well on + +90 +00:06:08,668 --> 00:06:09,810 +benchmarks is using some + +91 +00:06:09,810 --> 00:06:15,889 +station so the easiest from a bit +augmentation is horizontal flipping if + +92 +00:06:15,889 --> 00:06:18,699 +we think this cat when you look at the +mirror image the mirror images should + +93 +00:06:18,699 --> 00:06:22,949 +still be a cat and this is really really +easy to implement an umpire you can just + +94 +00:06:22,949 --> 00:06:27,159 +do it with a single call a single line +of code similarly easy and torch other + +95 +00:06:27,160 --> 00:06:32,040 +frameworks this is really easy vary +widely used something else that's very + +96 +00:06:32,040 --> 00:06:37,120 +widely used to take random crops from +the training images so a training time + +97 +00:06:37,120 --> 00:06:40,949 +we're gonna load up her image and we're +gonna take a patch about image at a + +98 +00:06:40,949 --> 00:06:42,629 +random scale and location + +99 +00:06:42,629 --> 00:06:47,189 +resize it to our fixed whatever size are +CNN's expecting and then use that as our + +100 +00:06:47,189 --> 00:06:51,389 +training example and again this is very +very widely used just give you a flavour + +101 +00:06:51,389 --> 00:06:56,610 +of how exactly this is used I looked up +the details for residents so they + +102 +00:06:56,610 --> 00:07:01,639 +actually had training time each training +image resize paper sticker random number + +103 +00:07:01,639 --> 00:07:05,620 +resize the whole image so that the +shorter side is that number then sample + +104 +00:07:05,620 --> 00:07:09,720 +of random 224 by 224 crop from the +resize dimension and then use that as + +105 +00:07:09,720 --> 00:07:13,990 +their training sample so that's pretty +easy to implement and usually helps + +106 +00:07:13,990 --> 00:07:20,560 +quite a bit so when you're using this +form of data augmentation usually things + +107 +00:07:20,560 --> 00:07:25,269 +change a little bit test time so a +training time when using this form of + +108 +00:07:25,269 --> 00:07:29,079 +data augmentation the network is not +really trained on full images it strain + +109 +00:07:29,079 --> 00:07:34,219 +on his crops so it doesn't really make +sense or seem fair to try to force the + +110 +00:07:34,220 --> 00:07:38,900 +network to look at the whole image as a +test I'm so usually in practice when + +111 +00:07:38,899 --> 00:07:42,879 +you're doing this kind of random +cropping for data augmentation at US + +112 +00:07:42,879 --> 00:07:48,379 +time you'll have some fixed set of crops +and use these for testing so very + +113 +00:07:48,379 --> 00:07:52,019 +commonly you'll see that you'll see ten +crops will take the upper left hand + +114 +00:07:52,019 --> 00:07:52,649 +corner + +115 +00:07:52,649 --> 00:07:56,189 +the upper right hand corner that you +bottom corners and the center gives you + +116 +00:07:56,189 --> 00:08:00,800 +five together at the horizontal flips +gives you 10 he'll take those 10 crops a + +117 +00:08:00,800 --> 00:08:06,460 +test time passing through the network an +average scores of those 10 crops so + +118 +00:08:06,459 --> 00:08:09,519 +resonant actually takes those little bit +one step further and actually do + +119 +00:08:09,519 --> 00:08:14,759 +multiscale multiple scales attest time +as well this is something that tends to + +120 +00:08:14,759 --> 00:08:20,649 +help performance in practice and again +very easy to implement vary widely used + +121 +00:08:20,649 --> 00:08:26,418 +another thing that we usually do 48 +augmentation is color generating so if + +122 +00:08:26,418 --> 00:08:29,529 +you take this picture of a cat maybe +maybe it was a little bit cloudier that + +123 +00:08:29,529 --> 00:08:33,348 +day a little bit funnier that day and if +we would have taken a picture than a lot + +124 +00:08:33,349 --> 00:08:37,070 +of the colors would have been quite +different so one thing that's very + +125 +00:08:37,070 --> 00:08:40,360 +common to do is just change the color a +little bit of our training images before + +126 +00:08:40,360 --> 00:08:45,539 +we get to the CNN so I'm very simple way +is just a change the contrast this is a + +127 +00:08:45,539 --> 00:08:50,469 +very easy to implement a very simple to +do but actually in practice you'll see + +128 +00:08:50,470 --> 00:08:55,759 +that this contract during a little bit +less common and what instead you see is + +129 +00:08:55,759 --> 00:09:01,259 +this slightly more complex pipeline +using principal component analysis over + +130 +00:09:01,259 --> 00:09:06,439 +all the pixels of the training data the +idea is that we for each pixel in the + +131 +00:09:06,440 --> 00:09:11,390 +training data is this vector of length 3 +an RGB and if we collect those pixels + +132 +00:09:11,389 --> 00:09:15,129 +over the entire training data that you +get a sense of what kinds of colors + +133 +00:09:15,129 --> 00:09:19,330 +generally exist in the training data +then using principal component analysis + +134 +00:09:19,330 --> 00:09:23,930 +gives us three principal component +directions in color space that kind of + +135 +00:09:23,929 --> 00:09:27,879 +tell us what are the directions along +which color tends to vary in the dataset + +136 +00:09:27,879 --> 00:09:32,429 +so than a test at training time for +color augmentation + +137 +00:09:32,429 --> 00:09:35,889 +we can actually use these principal +components of the color of the training + +138 +00:09:35,889 --> 00:09:41,419 +site to choose exactly how to gender the +color at training time this is again a + +139 +00:09:41,419 --> 00:09:46,719 +little bit more complicated but it is +pretty widely used so this type of PCA + +140 +00:09:46,720 --> 00:09:51,580 +driven data augmentation for color I +think was introduced with Alex that + +141 +00:09:51,580 --> 00:09:58,310 +paper in 2012 and it's also used in +ResNet for example so data augmentation + +142 +00:09:58,309 --> 00:10:02,829 +is israeli this very general thing right +you just want to think about for your + +143 +00:10:02,830 --> 00:10:06,420 +data set what kinds of transformations +do you want your class fire to be in + +144 +00:10:06,419 --> 00:10:11,179 +various too and then you want to +introduce those types of variations to + +145 +00:10:11,179 --> 00:10:15,229 +your training data a training time and +you can really go crazy here and get + +146 +00:10:15,230 --> 00:10:18,740 +creative and really think about your +data and what types of them variances + +147 +00:10:18,740 --> 00:10:23,659 +makes sense for your data so you might +want to try it like maybe random + +148 +00:10:23,659 --> 00:10:27,708 +rotations depending on your data may be +rotations of a couple degrees make sense + +149 +00:10:27,708 --> 00:10:31,399 +you could try it like different kinds of +stretching and shearing to simulate + +150 +00:10:31,399 --> 00:10:33,189 +maybe affine transformations of your +data + +151 +00:10:33,190 --> 00:10:36,990 +and you could really go crazy here and +try to get creative and think of + +152 +00:10:36,990 --> 00:10:43,840 +interesting ways to make your data and +other thing about I'd like to point out + +153 +00:10:43,840 --> 00:10:49,009 +is this idea of data augmentation really +fits into a larger theme that now we've + +154 +00:10:49,009 --> 00:10:54,090 +seen repeated many times throughout the +course and this team is that one way + +155 +00:10:54,090 --> 00:10:58,420 +that's really useful in practice for +preventing overfitting as a regular + +156 +00:10:58,419 --> 00:11:02,209 +rider is that during the fourth pass +during training when we're training our + +157 +00:11:02,210 --> 00:11:05,930 +network we had some kind of weird +stochastic noise to kind of mess with + +158 +00:11:05,929 --> 00:11:10,629 +the network for example with data +augmentation we're actually modifying + +159 +00:11:10,629 --> 00:11:14,210 +the training data that we put into the +network with things like drop out or + +160 +00:11:14,210 --> 00:11:18,860 +drop connect you're taking random parts +of the network and he they're setting + +161 +00:11:18,860 --> 00:11:22,730 +the activations are the weights 20 +randomly + +162 +00:11:22,730 --> 00:11:28,450 +this also has this also appears kind of +with Bosch normalization with patch + +163 +00:11:28,450 --> 00:11:31,930 +normalization your normalization +contents depend on the other things in + +164 +00:11:31,929 --> 00:11:35,000 +any batch so your normal and during +training + +165 +00:11:35,000 --> 00:11:39,440 +the same image might end up appearing in +many batches with different other images + +166 +00:11:39,440 --> 00:11:43,840 +that actually introduces this type of +noise I training time but for all of + +167 +00:11:43,840 --> 00:11:47,690 +these examples a test time we averaged +out this noise so for data augmentation + +168 +00:11:47,690 --> 00:11:52,790 +we all take averages over many different +samples of the training data for dropout + +169 +00:11:52,789 --> 00:11:56,870 +and dropped connect you can sort of +evaluate and marginalized this out in a + +170 +00:11:56,870 --> 00:12:01,090 +little more analytically and forecasts +normalization we keep keep his running + +171 +00:12:01,090 --> 00:12:05,269 +means so I just think that's kind of a +nice way to unify lot of these ideas for + +172 +00:12:05,269 --> 00:12:08,960 +regularization is that when you can add +noise at the forward pass and then + +173 +00:12:08,960 --> 00:12:13,540 +marginalized over at a time so keep that +in mind if you're trying to come up with + +174 +00:12:13,539 --> 00:12:20,250 +other creative ways to regularize your +networks so that's the main takeaways + +175 +00:12:20,250 --> 00:12:24,149 +for data augmentation are that one it's +it's usually really simple to implement + +176 +00:12:24,149 --> 00:12:28,329 +so you should almost always be using it +there's not really any excuse not to + +177 +00:12:28,330 --> 00:12:32,730 +it's very very useful especially for +small datasets which i think many of you + +178 +00:12:32,730 --> 00:12:36,850 +are using for your projects and it also +fits in nicely with this framework of + +179 +00:12:36,850 --> 00:12:41,509 +noise at training and marginalization a +test I'm so I think that's that's pretty + +180 +00:12:41,509 --> 00:12:45,360 +much all there is to say about data +augmentation so there's any questions + +181 +00:12:45,360 --> 00:12:45,840 +about that + +182 +00:12:45,840 --> 00:13:01,840 +i'm happy to talk about it now yeah a +lot of time training time it would take + +183 +00:13:01,840 --> 00:13:05,790 +a lot of disk space to try to dump these +things to desk so that I'm so sometimes + +184 +00:13:05,789 --> 00:13:08,879 +people get creative and even have like +background threads their matching data + +185 +00:13:08,879 --> 00:13:16,799 +and documentation right so I think +that's that's clear we can talk about + +186 +00:13:16,799 --> 00:13:21,069 +the next idea so there's this myth +floating around that when you work with + +187 +00:13:21,070 --> 00:13:25,770 +CNN's you really need a lot of data but +it turns out that would transfer + +188 +00:13:25,769 --> 00:13:33,029 +learning this myth is busted so there's +this really simple recipe that you can + +189 +00:13:33,029 --> 00:13:37,769 +use for transfer learning and thats +first you take whatever your favorite + +190 +00:13:37,769 --> 00:13:42,879 +CNN architecture is Alex matter BG or +what have you and you either training on + +191 +00:13:42,879 --> 00:13:46,970 +image not yourself or you down for more +commonly you download free trade bottle + +192 +00:13:46,970 --> 00:13:51,360 +from the internet that's easy to do just +takes 20 minutes to download many hours + +193 +00:13:51,360 --> 00:13:56,590 +to train but you probably won't do that +part next there's sort of two general + +194 +00:13:56,590 --> 00:14:00,910 +cases one if your data set is really +small and you really don't have any + +195 +00:14:00,909 --> 00:14:05,019 +images whatsoever then you can just +treat this classifier as a fixed feature + +196 +00:14:05,019 --> 00:14:10,110 +extractor so one way to look at this is +that you'll take the last layer of the + +197 +00:14:10,110 --> 00:14:15,580 +network the soft max hospital asian +model will take it away and he'll + +198 +00:14:15,580 --> 00:14:18,370 +replace it with some kind of linear +classifier for the task that you + +199 +00:14:18,370 --> 00:14:21,810 +actually care about and now you'll +freeze the rest of the network and + +200 +00:14:21,809 --> 00:14:26,969 +retraining only that top layer so this +is sort of equivalent to just training a + +201 +00:14:26,970 --> 00:14:31,230 +linear classifier directly on top of +features extracted from the network so + +202 +00:14:31,230 --> 00:14:35,149 +what you'll see a lot of times in +practice for this case is that sort of + +203 +00:14:35,149 --> 00:14:38,399 +as a preprocessing step you'll just +dumped features to test for all of your + +204 +00:14:38,399 --> 00:14:42,100 +training images and then work entirely +on top of those cast features so that + +205 +00:14:42,100 --> 00:14:48,110 +can help speed things up quite a bit and +that's quite easy to use its very very + +206 +00:14:48,110 --> 00:14:51,250 +common and usually provides a very +strong baseline for a lot of problems + +207 +00:14:51,250 --> 00:14:56,169 +that you might encounter in practice and +if you have a little bit more data than + +208 +00:14:56,169 --> 00:14:58,599 +then you can actually afford to train +more comfy + +209 +00:14:58,600 --> 00:15:03,949 +models so depending on the size of your +dataset usually you'll freeze some parts + +210 +00:15:03,948 --> 00:15:07,669 +some of the lower layers of the network +and then instead of retraining only the + +211 +00:15:07,669 --> 00:15:11,919 +last lair you'll pick some number of the +last letters to train depending on how + +212 +00:15:11,919 --> 00:15:16,349 +larger dataset is and generally when you +have a larger dataset available for + +213 +00:15:16,350 --> 00:15:21,350 +training you can afford to train more of +these final theirs and again if you're + +214 +00:15:21,350 --> 00:15:26,060 +similar to the similar to the trick over +here what you'll see very commonly is + +215 +00:15:26,059 --> 00:15:29,729 +that instead of actually explicitly +computing this part you'll just dump + +216 +00:15:29,730 --> 00:15:35,019 +these last layer features to desk and +then work on this part in memory so that + +217 +00:15:35,019 --> 00:15:47,490 +can speed things up quite a lot and +sometimes I question that you basically + +218 +00:15:47,490 --> 00:15:51,959 +have to try it and see but especially +for this type of small dataset will work + +219 +00:15:51,958 --> 00:15:55,799 +on instances so if you have like if you +want to just do image retrieval a pretty + +220 +00:15:55,799 --> 00:16:01,338 +strong baseline is just use LTE distance +on CNN features so it may be so this + +221 +00:16:01,339 --> 00:16:05,110 +type of approach I mean harmonies how +many samples you expect to need to train + +222 +00:16:05,110 --> 00:16:10,470 +a lot like an FBI or something and for +these if you have more than if you have + +223 +00:16:10,470 --> 00:16:15,310 +more data than you would expect to need +for a nice p.m. then try that so it's + +224 +00:16:15,309 --> 00:16:28,879 +not at all and maybe I'm sorry yeah it +depends sometimes you you actually will + +225 +00:16:28,879 --> 00:16:32,309 +run through the forward pass but +sometimes you just run the four pass + +226 +00:16:32,309 --> 00:16:36,818 +once and dump these two desk that's +kinda that's that's pretty common + +227 +00:16:36,818 --> 00:16:41,458 +actually saves compute + +228 +00:16:41,458 --> 00:16:59,729 +from random house you'll probably have +different classes are you + +229 +00:16:59,730 --> 00:17:03,350 +Russian problem or something but then +these these other intermediate layers + +230 +00:17:03,350 --> 00:17:08,750 +you initialize from whatever was in the +previous model and actually and in + +231 +00:17:08,750 --> 00:17:15,068 +practice when you find a nice tip it to +actually do this is bad there are there + +232 +00:17:15,068 --> 00:17:18,588 +only be two types of layers when I guess +three types of layers when you're fine + +233 +00:17:18,588 --> 00:17:22,349 +tuning they'll be the frozen layers +which you can think up as having a + +234 +00:17:22,349 --> 00:17:27,448 +learning rate of zero there are these +these new larry is that Yuri initialize + +235 +00:17:27,449 --> 00:17:32,548 +from scratch and typically those have +maybe a higher learning rate but not too + +236 +00:17:32,548 --> 00:17:36,528 +high maybe one tenth of what their +network was originally trained west and + +237 +00:17:36,528 --> 00:17:40,079 +then we'll have these intermediate +layers that you are initializing from + +238 +00:17:40,079 --> 00:17:43,269 +the pre train network but you're +planning to modify joint optimization + +239 +00:17:43,269 --> 00:17:47,470 +and fine-tuning so these intermediate +layers you'll tend to be very small + +240 +00:17:47,470 --> 00:17:56,589 +learning rate maybe one one-hundredth of +the original yeah + +241 +00:17:56,589 --> 00:18:04,319 +that's some people have tried to +investigate and found that generally + +242 +00:18:04,319 --> 00:18:08,079 +fine tuning this type of transfer +learning fine-tuning approach works + +243 +00:18:08,079 --> 00:18:11,710 +better when the network was originally +trained with similar types of data + +244 +00:18:11,710 --> 00:18:16,610 +whatever that means but in fact these +these very low-level features are things + +245 +00:18:16,609 --> 00:18:20,308 +like edges and colors and Gabor filters +which are probably gonna be applicable + +246 +00:18:20,308 --> 00:18:24,190 +to just about any type of visual data so +especially these lower level features I + +247 +00:18:24,190 --> 00:18:29,009 +think are generally pretty applicable to +almost anything and by the way I another + +248 +00:18:29,009 --> 00:18:33,788 +tip that you said that you sometimes see +in practice for fine-tuning is that you + +249 +00:18:33,788 --> 00:18:37,609 +might actually have a multi-stage +approach where first you freeze the + +250 +00:18:37,609 --> 00:18:42,079 +entire network and then only trained +this last lair and then after this last + +251 +00:18:42,079 --> 00:18:46,939 +layers seems to be converging then go +back and actually find to indies you can + +252 +00:18:46,940 --> 00:18:51,519 +sometimes have this problem that these +because this last layers initialize + +253 +00:18:51,519 --> 00:18:54,690 +randomly you might have very large +gradients that kind of mess up this + +254 +00:18:54,690 --> 00:18:59,070 +initialization so that the two ways to +get around that are either freezing this + +255 +00:18:59,069 --> 00:19:02,788 +at first I'm writing this converge or by +having this bearing learning rate + +256 +00:19:02,788 --> 00:19:08,658 +between the two regimes of the network +so this idea of transfer learning + +257 +00:19:08,659 --> 00:19:14,470 +actually works really well so there was +a couple pretty early papers from 2013 + +258 +00:19:14,470 --> 00:19:19,390 +2014 when CNN's per started getting +popular this one in particular + +259 +00:19:19,390 --> 00:19:24,490 +astounding baseline paper was was pretty +cool what they did is they took the what + +260 +00:19:24,490 --> 00:19:26,009 +at the time was one of the best + +261 +00:19:26,009 --> 00:19:30,470 +CNN's out there was over feat they just +extracted features from overseas and + +262 +00:19:30,470 --> 00:19:33,640 +apply these features to a bunch of +different standard datasets and standard + +263 +00:19:33,640 --> 00:19:38,679 +problems in computer vision and they +compared to these rights then the idea + +264 +00:19:38,679 --> 00:19:42,210 +is that they compared against what was +at the time these very specialized + +265 +00:19:42,210 --> 00:19:45,298 +pipelines and very specialized +architectures for each individual + +266 +00:19:45,298 --> 00:19:49,408 +problems and datasets and for each +problem they just replaced this very + +267 +00:19:49,409 --> 00:19:54,380 +specialized pipeline with very simple +linear models on top of features from + +268 +00:19:54,380 --> 00:19:58,559 +over feet and they did this for a whole +bunch of different datasets and found + +269 +00:19:58,558 --> 00:20:01,940 +that in general overall these +over-the-top teachers were a very very + +270 +00:20:01,940 --> 00:20:06,080 +strong baseline and for some problems +they were actually better than existing + +271 +00:20:06,079 --> 00:20:08,428 +methods and for some problems they were + +272 +00:20:08,429 --> 00:20:12,879 +get worse but still quite competitive so +this this was a really cool paper that + +273 +00:20:12,878 --> 00:20:16,118 +just demonstrated that these are really +strong features that can be used in a + +274 +00:20:16,118 --> 00:20:19,949 +lot of different tasks and tend to work +quite well and other paper along those + +275 +00:20:19,950 --> 00:20:25,419 +lines was from Berkeley the decaf paper +and decaf later became became + +276 +00:20:25,419 --> 00:20:33,610 +caffeinated and became cafe so that's +that's that's kind of lineage there so + +277 +00:20:33,609 --> 00:20:37,388 +kind of the recipe for transfer learning +is that there is you should think about + +278 +00:20:37,388 --> 00:20:43,398 +too little to buy two matrix how similar +is your data set to what the preteen + +279 +00:20:43,398 --> 00:20:47,989 +model was and how much data do you have +and what should you do in those four + +280 +00:20:47,990 --> 00:20:53,240 +different columns so generally if you +have very similar data set and very + +281 +00:20:53,240 --> 00:20:57,538 +little data just using the network has a +fixed feature extractor and training + +282 +00:20:57,538 --> 00:21:02,429 +simple linear models on top of those +features tends to work very well if you + +283 +00:21:02,429 --> 00:21:06,470 +have a little bit more data than you can +try to try fine-tuning and try actually + +284 +00:21:06,470 --> 00:21:10,509 +initializing network from fine tune from +pre-screened weights and running + +285 +00:21:10,509 --> 00:21:15,868 +optimization from there is another +column is little tricks here in this box + +286 +00:21:15,868 --> 00:21:20,099 +you might be in trouble you can try to +get creative and maybe instead of + +287 +00:21:20,099 --> 00:21:23,998 +extracting features from the very last +layer you might try extracting features + +288 +00:21:23,999 --> 00:21:27,470 +from different layers of the continent +and that can sometimes sometimes help + +289 +00:21:27,470 --> 00:21:32,819 +the intuition there is that maybe for +something like MRI data probably these + +290 +00:21:32,819 --> 00:21:37,178 +very top level features are very +specific image now categories but these + +291 +00:21:37,179 --> 00:21:42,059 +very low-level features are things like +edges and stuff like that that may be + +292 +00:21:42,058 --> 00:21:47,980 +more transferable to turn on and turn on +image net tech data sets and obviously + +293 +00:21:47,980 --> 00:21:51,099 +in this box you're in better shape and +again you can just sort of initializing + +294 +00:21:51,099 --> 00:21:57,928 +fine soon so another thing I'd like to +point out is this idea of initializing + +295 +00:21:57,929 --> 00:22:01,590 +with preteen models and fine-tuning is +actually not the exception this is + +296 +00:22:01,589 --> 00:22:05,439 +pretty much standard practice in almost +any larger system that you'll see in + +297 +00:22:05,440 --> 00:22:09,070 +computer vision these days and we've +actually seen two examples of this + +298 +00:22:09,069 --> 00:22:13,220 +already in the quarters so for example +if you remember from a few lectures ago + +299 +00:22:13,220 --> 00:22:17,220 +we talked about object detection where +we had a CNN looking at the image + +300 +00:22:17,220 --> 00:22:21,620 +region proposals and this other call +this all this crazy stuff but this part + +301 +00:22:21,619 --> 00:22:25,529 +was a CNN looking at the image and image +captioning we had a CNN looking at the + +302 +00:22:25,529 --> 00:22:29,399 +image so in both of those cases though +CNN's were initially is from imagefap + +303 +00:22:29,400 --> 00:22:34,080 +models and that really helps to solve +these other more specialized problems + +304 +00:22:34,079 --> 00:22:38,839 +even without a gigantic datasets and +also for the image captioning model in + +305 +00:22:38,839 --> 00:22:42,829 +particular part of this model includes +these were demanding that you should + +306 +00:22:42,829 --> 00:22:47,500 +have seen by now on homework if you +started on it but those weren't vectors + +307 +00:22:47,500 --> 00:22:50,099 +you can actually initialize from +something else that was maybe + +308 +00:22:50,099 --> 00:22:54,019 +pre-training a bunch of taxed and that +can sometimes help maybe in some search + +309 +00:22:54,019 --> 00:22:58,668 +in some situations where you might not +have a lot of capturing data available + +310 +00:22:58,669 --> 00:23:15,490 +yeah I'm here to help sometimes it +depends on the problem depends on the + +311 +00:23:15,490 --> 00:23:18,859 +network but it's definitely something +you can try and that especially might + +312 +00:23:18,859 --> 00:23:27,548 +help when you're in this box but yeah +that's a good trick to the takeaway + +313 +00:23:27,548 --> 00:23:31,210 +about fine-tuning is that you should +really use it it's a really good idea + +314 +00:23:31,210 --> 00:23:35,950 +yeah so that it works really well in +practice you should probably almost + +315 +00:23:35,950 --> 00:23:39,900 +always be using it and to some extent +you generally don't want to be training + +316 +00:23:39,900 --> 00:23:42,519 +these things from scratch unless you +have really really large data sets + +317 +00:23:42,519 --> 00:23:45,970 +available in almost all of the +circumstances it's much more convenient + +318 +00:23:45,970 --> 00:23:52,279 +to find to an existing model and by the +way Cafe has this existing model of you + +319 +00:23:52,279 --> 00:23:58,230 +you can download many exist many famous +image not models + +320 +00:23:58,230 --> 00:24:01,880 +actually the residual networks the +official model got released recently so + +321 +00:24:01,880 --> 00:24:06,130 +you can even download and play with it +would be pretty cool and these cafe + +322 +00:24:06,130 --> 00:24:09,020 +models new models are sort of like a +little bit of a standard in the + +323 +00:24:09,019 --> 00:24:13,759 +community so you can even load cafe +models into other other frameworks like + +324 +00:24:13,759 --> 00:24:17,658 +torch so that's that's something to keep +in mind that these cafe models are quite + +325 +00:24:17,659 --> 00:24:21,030 +useful right + +326 +00:24:21,029 --> 00:24:26,889 +any any further questions on fine-tuning +or transfer learning + +327 +00:24:26,890 --> 00:24:46,650 +yeah yeah that's quite large and lower +dimensions so you might try a highly + +328 +00:24:46,650 --> 00:24:50,250 +regularize linear model on top of that +or you might try putting a small come + +329 +00:24:50,250 --> 00:24:53,109 +out on top of that maybe reduce the +dimensionality you can get creative here + +330 +00:24:53,109 --> 00:24:56,399 +but I think that there are there are +there things you can try that might work + +331 +00:24:56,400 --> 00:25:03,640 +for your data depending on it right so I +think we should talk more about + +332 +00:25:03,640 --> 00:25:07,740 +convolutions so for all these networks +we've talked about it really the + +333 +00:25:07,740 --> 00:25:11,920 +convolutions are the computational +workhorse that's doing a lot of the work + +334 +00:25:11,920 --> 00:25:18,090 +and the network so we need to talk about +two things about convolutions the first + +335 +00:25:18,089 --> 00:25:22,809 +is how to stop them so how can we design +efficient network architectures that + +336 +00:25:22,809 --> 00:25:28,789 +combine many layers of convolution to +achieve some some nice results so here's + +337 +00:25:28,789 --> 00:25:33,230 +a question suppose that we have a +network that has two layers of people i + +338 +00:25:33,230 --> 00:25:37,190 +three contributions as this would be the +input this would be the activation map + +339 +00:25:37,190 --> 00:25:40,120 +in the first layer this would be the +activation nap after two layers of + +340 +00:25:40,119 --> 00:25:45,959 +convolution the question is for an Iran +on this second layer how big of a region + +341 +00:25:45,960 --> 00:25:49,640 +on the input doesn't see this was on +your midterm so I i hope i hope u guys + +342 +00:25:49,640 --> 00:25:53,920 +all know the answer to this + +343 +00:25:53,920 --> 00:26:01,298 +anyone ok just maybe that was a hard +exam question + +344 +00:26:01,298 --> 00:26:05,230 +but this is this is a five by five and +it's it's pretty easy to see from this + +345 +00:26:05,230 --> 00:26:08,989 +diagram why so that this neuron up to +the second layer is looking at this + +346 +00:26:08,989 --> 00:26:13,619 +entire volume in the intermediate where +some particular in this pixel in the + +347 +00:26:13,618 --> 00:26:18,138 +intermediate we're looking at this three +by three region in the input so when you + +348 +00:26:18,138 --> 00:26:22,738 +average across all when you look at all +of all three of these than this + +349 +00:26:22,739 --> 00:26:26,200 +lair this neuron in the in the second or +third layer is actually looking at this + +350 +00:26:26,200 --> 00:26:32,669 +entire five by five volume in the input +ok so now the question is if we had + +351 +00:26:32,669 --> 00:26:36,820 +three feet by three convolutions stacked +in a row how big of a region in the + +352 +00:26:36,819 --> 00:26:43,700 +input what they see ya so the same kind +of reason is that he's receptive field + +353 +00:26:43,700 --> 00:26:49,739 +just kind of build up with successive +contributions so the point here to make + +354 +00:26:49,739 --> 00:26:53,940 +is that you know 33 by three +convolutions actually give you a very + +355 +00:26:53,940 --> 00:26:57,919 +similar representational power is my +claim to a single seven by seven + +356 +00:26:57,919 --> 00:27:02,619 +convolution so you might debate on the +exact semantics of this and you could + +357 +00:27:02,618 --> 00:27:05,528 +try to prove theorems about it and +things like that but just from an + +358 +00:27:05,528 --> 00:27:09,940 +intuitive sense they can 333 +convolutions can represent similar types + +359 +00:27:09,940 --> 00:27:14,100 +of functions as a similar seven by seven +contribution since it's looking at the + +360 +00:27:14,099 --> 00:27:22,189 +same input region in the input so now +the idea now actually we can dig further + +361 +00:27:22,190 --> 00:27:27,399 +into this idea and we can compare more +concretely between a single 797 + +362 +00:27:27,398 --> 00:27:32,618 +convolution versus a stack of 33 by +three contributions so let's suppose + +363 +00:27:32,618 --> 00:27:38,638 +that we have input image that's hiw by +sea and we want to have convolutions + +364 +00:27:38,638 --> 00:27:43,329 +that preserve the depth so we have see +filters and we want to have them + +365 +00:27:43,329 --> 00:27:48,019 +preserve heightened with so we just said +patting appropriately and then we want + +366 +00:27:48,019 --> 00:27:51,528 +to compare concretely what is the +difference between a single seven by + +367 +00:27:51,528 --> 00:27:56,648 +seven versus a stack of three by three +so first how many weeks to each of these + +368 +00:27:56,648 --> 00:28:01,748 +two things have anyone have a gas on how +many weight the single seven by seven + +369 +00:28:01,749 --> 00:28:09,519 +convolution house and you can forget +about biases are confusing + +370 +00:28:09,519 --> 00:28:19,869 +I heard I heard some summers but so my +my answer I hope I got it right + +371 +00:28:19,869 --> 00:28:24,319 +was 49 C squared as you've got the seven +by seven convolution each one is looking + +372 +00:28:24,319 --> 00:28:29,809 +at a depth of see you got to see such +filters so 49 C squared but now for the + +373 +00:28:29,809 --> 00:28:34,649 +three by three convolutions we have +three layers of convolutions each one + +374 +00:28:34,650 --> 00:28:38,990 +each filter is three by three by Steve +and each player has see filters when you + +375 +00:28:38,990 --> 00:28:43,980 +multiply that all out we see that 33 by +free convolutions only has 27 C squared + +376 +00:28:43,980 --> 00:28:49,079 +parameters and assuming that we have Ray +Lewis after between each of these + +377 +00:28:49,079 --> 00:28:54,049 +contributions we see that the stack up +33 by three convolutions actually has + +378 +00:28:54,049 --> 00:28:58,649 +fewer parameters which is good and more +nonlinearity which is good for this kind + +379 +00:28:58,650 --> 00:29:02,960 +of gives you some intuition for why a +stack of three by of multiple three by + +380 +00:29:02,960 --> 00:29:06,440 +three convolutions might actually be +preferable to a single seven by seven + +381 +00:29:06,440 --> 00:29:11,559 +competition and we can actually take +this one step further and think about + +382 +00:29:11,559 --> 00:29:14,750 +not just the number below normal +parameters but actually honey floating + +383 +00:29:14,750 --> 00:29:19,099 +point operations to these things take so +anyone have a gas for how many + +384 +00:29:19,099 --> 00:29:29,669 +operations these things to take just now +sounds hard writes actually this is + +385 +00:29:29,670 --> 00:29:33,740 +pretty easy because for each of these +filters were gonna be using it at every + +386 +00:29:33,740 --> 00:29:37,819 +position in the end in the image so +actually the number of multiply ads is + +387 +00:29:37,819 --> 00:29:42,099 +just gonna be Heights times with times +the number of burnable filters so you + +388 +00:29:42,099 --> 00:29:47,789 +can see that actually over here again +not only do we have some between + +389 +00:29:47,789 --> 00:29:52,440 +comparing between these two the seven by +seven action not only has more learnable + +390 +00:29:52,440 --> 00:29:57,460 +parameters but it actually costs a lot +more to computers well so the stack of + +391 +00:29:57,460 --> 00:30:03,140 +33 by frequent allusions again gives us +more nonlinearity for less compute so + +392 +00:30:03,140 --> 00:30:06,170 +that kinda gives you some intuition for +why actually having multiple layers of + +393 +00:30:06,170 --> 00:30:12,300 +three bay three convolutions is actually +preferable to large filters but then you + +394 +00:30:12,299 --> 00:30:15,750 +can think of another question you know +we've been pushing towards smaller and + +395 +00:30:15,750 --> 00:30:20,109 +smaller filters but why stop at three by +three right we can actually go smaller + +396 +00:30:20,109 --> 00:30:21,859 +than that may be the same logic would +expand + +397 +00:30:21,859 --> 00:30:27,798 +shaking your head you don't believe it +that's true it's true you don't get the + +398 +00:30:27,798 --> 00:30:33,539 +receptive field so actually what we're +going to do here is compared to a single + +399 +00:30:33,539 --> 00:30:39,019 +33 convolution versus a slightly fancier +architecture the bottleneck architecture + +400 +00:30:39,019 --> 00:30:45,150 +so here we're gonna assume I can input +of HW see and hear we can actually do + +401 +00:30:45,150 --> 00:30:50,070 +this is a cool trick we do a single +one-by-one convolution with see over to + +402 +00:30:50,069 --> 00:30:54,609 +filters to actually reduce the +dimensionality of the volume so now this + +403 +00:30:54,609 --> 00:30:57,990 +thing is going to have the same spatial +extent but half the number of features + +404 +00:30:57,990 --> 00:31:03,480 +in-depth now after we do this bottleneck +we're gonna do a three by three + +405 +00:31:03,480 --> 00:31:08,929 +convolution at this reduced +dimensionality so now this this three by + +406 +00:31:08,929 --> 00:31:13,610 +three convolution takes over to input +features and produces over to output + +407 +00:31:13,609 --> 00:31:18,000 +features and now we restore the +dimensionality with another one by one + +408 +00:31:18,000 --> 00:31:23,558 +convolution to go from see over to back +to see this is kind of a kind of a funky + +409 +00:31:23,558 --> 00:31:27,910 +architecture this idea of using +one-by-one convolutions everywhere is + +410 +00:31:27,910 --> 00:31:31,669 +sometimes called network and network +because it has this intuition that + +411 +00:31:31,669 --> 00:31:35,730 +you're a one-by-one convolution is kinda +similar to sliding a fully connected + +412 +00:31:35,730 --> 00:31:42,480 +network over each part of your input +volume and this idea also appears in + +413 +00:31:42,480 --> 00:31:46,259 +Google Matt and in ResNet this idea of +using these one-by-one bottleneck + +414 +00:31:46,259 --> 00:31:52,679 +contributions so we can compare this +this bottleneck sandwich to a single + +415 +00:31:52,679 --> 00:31:56,390 +three by three convolution with C +filters and run through the same logic + +416 +00:31:56,390 --> 00:32:01,270 +so I won't I won't force you to +computers in your heads but you'll have + +417 +00:32:01,269 --> 00:32:02,720 +to trust me on this + +418 +00:32:02,720 --> 00:32:08,700 +that this bottleneck stack has three and +a quarter C squared parameters where is + +419 +00:32:08,700 --> 00:32:12,360 +this one over here has nine C squared +parameters and again if we're sticking + +420 +00:32:12,359 --> 00:32:15,879 +rallies in between each of these +contributions than this bottleneck + +421 +00:32:15,880 --> 00:32:20,620 +sandwich is giving us more more +nonlinearity for fewer number of + +422 +00:32:20,619 --> 00:32:28,899 +parameters and actually as we similar to +we saw on the three by three versus + +423 +00:32:28,900 --> 00:32:33,200 +seven by seven the number of parameters +is tied directly to the computation so + +424 +00:32:33,200 --> 00:32:35,389 +this bottleneck sandwich is also + +425 +00:32:35,388 --> 00:32:39,788 +much faster to compute so this to this +idea of one-by-one bottlenecks has + +426 +00:32:39,788 --> 00:32:52,669 +received quite a lot of usage recently +in Google Matt and especially yeah so + +427 +00:32:52,669 --> 00:32:56,579 +you might think of it as you sometimes +you think of it as as a projection from + +428 +00:32:56,578 --> 00:33:00,308 +like a lower dimensional feature back to +a higher dimensional space and then if + +429 +00:33:00,308 --> 00:33:03,868 +you think about stacking many of these +things on top of each other as happens + +430 +00:33:03,868 --> 00:33:09,499 +and residents than than you have been +coming immediately after this one is + +431 +00:33:09,499 --> 00:33:11,088 +going to be another one by one + +432 +00:33:11,088 --> 00:33:14,858 +you're kind of stuck in many many one +people one by one convolutions on top of + +433 +00:33:14,858 --> 00:33:18,918 +each other and one-by-one convolution is +a little bit like sliding a fully a + +434 +00:33:18,919 --> 00:33:23,409 +multi-layer fully connected network over +each double channel to think maybe think + +435 +00:33:23,409 --> 00:33:27,229 +about that when a little bit but it +turns out that actually you don't really + +436 +00:33:27,229 --> 00:33:31,200 +need the spatial extent and even just +comparing the sandwich to a single three + +437 +00:33:31,200 --> 00:33:35,769 +by three Khans you're sort of having the +same input output volume sizes but + +438 +00:33:35,769 --> 00:33:41,429 +what's more nonlinearity and cheaper to +compute and animal parameters so they're + +439 +00:33:41,429 --> 00:33:46,089 +all kind of nice features but there's +there's one problem with this is that + +440 +00:33:46,088 --> 00:33:49,668 +that's we're still using a three by +three convolution in there somewhere and + +441 +00:33:49,669 --> 00:33:54,709 +you might wonder if we if we really need +this and the answer is No it turns out + +442 +00:33:54,709 --> 00:33:59,808 +so one crazy thing that I've seen +recently is that you can you can factor + +443 +00:33:59,808 --> 00:34:05,608 +the street by three convolution in 2003 +by one and won by three and compared to + +444 +00:34:05,608 --> 00:34:09,469 +the single three by three convolution +this ends up saving you some parameters + +445 +00:34:09,469 --> 00:34:14,428 +as well so that you might if you really +go crazy you can come by in this one by + +446 +00:34:14,429 --> 00:34:18,019 +three and three by one together with +this bottleneck an idea and things just + +447 +00:34:18,018 --> 00:34:22,358 +get really cheap and that's basically +what Google has done in their most + +448 +00:34:22,358 --> 00:34:27,038 +recent version of Inception so there's +this kind of crazy paper rethinking the + +449 +00:34:27,039 --> 00:34:30,389 +inception architecture for computer +vision where they play a lot of these + +450 +00:34:30,389 --> 00:34:34,169 +crazy tricks about factoring +convolutions in weird ways and having a + +451 +00:34:34,168 --> 00:34:37,138 +lot of one-by-one bottlenecks and then +projections backup to different + +452 +00:34:37,139 --> 00:34:40,608 +dimensions and then if you thought the +original Google met with with their + +453 +00:34:40,608 --> 00:34:42,699 +inception module was was crazy + +454 +00:34:42,699 --> 00:34:46,118 +this one's these are the inception +modules that Google is now using in + +455 +00:34:46,119 --> 00:34:47,329 +their newest inception at + +456 +00:34:47,329 --> 00:34:50,739 +and the interesting features here are +that they have these one-by-one + +457 +00:34:50,739 --> 00:34:55,819 +bottlenecks everywhere and make sure you +have these asymmetric filters to against + +458 +00:34:55,820 --> 00:35:01,519 +Avon computation so this stuff is not +super widely used yet but it's it's it's + +459 +00:35:01,519 --> 00:35:05,079 +out there and it's a Google Matt +psychotics it something cool to mention + +460 +00:35:05,079 --> 00:35:14,610 +so quickly recap from convolutions and +how to stack them is that it's usually + +461 +00:35:14,610 --> 00:35:18,530 +better instead of having a single large +convolution with a large filter size + +462 +00:35:18,530 --> 00:35:22,740 +it's usually better to break up into +multiple smaller filters and that even + +463 +00:35:22,739 --> 00:35:26,339 +maybe helps explain the difference +between something like BGG which has + +464 +00:35:26,340 --> 00:35:30,059 +many many three by three filters with +something like Alex net that have fewer + +465 +00:35:30,059 --> 00:35:35,119 +smaller filters and other thing that's +actually become pretty common i think is + +466 +00:35:35,119 --> 00:35:38,829 +this idea of one by one bottle necking +you see that in both versions of Google + +467 +00:35:38,829 --> 00:35:42,579 +not and also in ResNet and that actually +helps you save a lot on parameters I + +468 +00:35:42,579 --> 00:35:46,340 +think that's a useful trick to keep in +mind and this idea of factoring + +469 +00:35:46,340 --> 00:35:50,890 +convolutions into these asymmetric +filters i think is maybe not so widely + +470 +00:35:50,889 --> 00:35:54,629 +used right now but it may become more +commonly used in the future I'm not sure + +471 +00:35:54,630 --> 00:36:00,160 +and the basic over overarching theme for +all of these tracks is that it lets you + +472 +00:36:00,159 --> 00:36:04,289 +have fewer learnable parameters and +fewer and less compute and more + +473 +00:36:04,289 --> 00:36:07,739 +nonlinearity which are all sort of nice +features to having your architectures + +474 +00:36:07,739 --> 00:36:18,779 +such as any questions about these these +convolution architecture designs to + +475 +00:36:18,780 --> 00:36:21,300 +bring her too obvious + +476 +00:36:21,300 --> 00:36:26,340 +ok so then the next thing is that once +you've actually decided on how you want + +477 +00:36:26,340 --> 00:36:30,760 +to wire up your stack of convolutions +you actually to compute them and this + +478 +00:36:30,760 --> 00:36:33,630 +there's actually been a lot of work on +different ways to implement + +479 +00:36:33,630 --> 00:36:37,950 +contributions we asked you to implement +of the assignments using for loops and + +480 +00:36:37,949 --> 00:36:43,960 +that as you may have guessed doesn't +scale too well so this a pretty a pretty + +481 +00:36:43,960 --> 00:36:47,720 +easy approach that's pretty easy to +implement is this idea of a name to call + +482 +00:36:47,719 --> 00:36:52,269 +method so the intuition here is that we +know matrix multiplication is really + +483 +00:36:52,269 --> 00:36:56,809 +fast and pretty much any computing +architecture out there someone has + +484 +00:36:56,809 --> 00:37:00,949 +written a really really well optimized +matrix multiplication retainer library + +485 +00:37:00,949 --> 00:37:06,230 +so the idea of him to call is stinking +well given that matrix multiplication is + +486 +00:37:06,230 --> 00:37:07,400 +really fast + +487 +00:37:07,400 --> 00:37:11,420 +is there some way that we can take this +convolution operation and recast as a + +488 +00:37:11,420 --> 00:37:17,800 +matrix multiply and it turns out that +this is actually pretty somewhat easy to + +489 +00:37:17,800 --> 00:37:22,930 +do once you think about it so the idea +is that we have an input volume that's + +490 +00:37:22,929 --> 00:37:28,549 +hiw by sea and we have a filter bank of +convolutions of convolutional filters + +491 +00:37:28,550 --> 00:37:32,730 +each one of these is going to be a +case-by-case by see volume so it has a + +492 +00:37:32,730 --> 00:37:36,659 +case-by-case receptive field and +adaptive see two matched to match the + +493 +00:37:36,659 --> 00:37:39,989 +input over here and we're gonna have to +deal with these filters and then we want + +494 +00:37:39,989 --> 00:37:44,809 +to turn this into a into a matrix +multiply problem so the idea is that + +495 +00:37:44,809 --> 00:37:48,829 +we're going to take one of their we're +going to take the first receptive field + +496 +00:37:48,829 --> 00:37:54,019 +of the image which is gonna be this kay +by Kay by CEE region in the region in + +497 +00:37:54,019 --> 00:37:58,130 +the end up in football you I'm going to +reshape it into this column of case + +498 +00:37:58,130 --> 00:38:01,910 +whereby see elements and then we're +going to repeat this for every possible + +499 +00:38:01,909 --> 00:38:05,909 +receptive field in the image so we're +going to take this little guy I'm going + +500 +00:38:05,909 --> 00:38:09,359 +to shift him over all possible regions +in the image and here I'm just saying + +501 +00:38:09,360 --> 00:38:12,680 +that there's going to be maybe end +region and different receptive field + +502 +00:38:12,679 --> 00:38:18,389 +locations so now we've taken our image +and we've taken reshaped into this giant + +503 +00:38:18,389 --> 00:38:25,139 +matrix oh and by I mean and in my case +whereby see anyone see what a potential + +504 +00:38:25,139 --> 00:38:28,139 +problem with this maybe + +505 +00:38:28,139 --> 00:38:36,829 +yeah that's true so best this tends to +use a lot of memory right so many + +506 +00:38:36,829 --> 00:38:41,380 +elements in this volume if it appears +and multiple receptive fields then it's + +507 +00:38:41,380 --> 00:38:45,010 +going to be duplicated in multiple of +these columns so and this is going to + +508 +00:38:45,010 --> 00:38:49,220 +get worse the more overlap there is +between your receptive fields but it + +509 +00:38:49,219 --> 00:38:52,839 +turns out that in practice that's +actually not too big of a deal and it + +510 +00:38:52,840 --> 00:38:57,910 +works fine then we're gonna run a +similar check on these convolutional + +511 +00:38:57,909 --> 00:39:01,699 +filters so if you remember what a +convolution is doing we want to take + +512 +00:39:01,699 --> 00:39:06,039 +each of these convolutional weights and +take our products with each + +513 +00:39:06,039 --> 00:39:10,889 +convolutional weight against each +receptive field location in the image so + +514 +00:39:10,889 --> 00:39:16,420 +each of these convolutional weights is +is this kay by Kay buy seats answer so + +515 +00:39:16,420 --> 00:39:21,059 +we're going to reshape each of those to +be a case where by Ciro now we have D + +516 +00:39:21,059 --> 00:39:26,420 +filters so we got a deal by case whereby +seat matrix now this is great + +517 +00:39:26,420 --> 00:39:31,750 +now this guide contains all the recep +each each column as a receptive field we + +518 +00:39:31,750 --> 00:39:37,039 +have one column receptive field in the +image and now this matrix has one has + +519 +00:39:37,039 --> 00:39:42,679 +one each row is a different weight so +now we can easily compute all of these + +520 +00:39:42,679 --> 00:39:49,069 +inner products all at once with the +single matrix multiply and I apologize + +521 +00:39:49,070 --> 00:39:52,809 +for these dimensions not working out the +probably should swap is to make it more + +522 +00:39:52,809 --> 00:39:59,219 +obvious but I think you get the idea so +this this gives und by end result that + +523 +00:39:59,219 --> 00:40:03,659 +that D is our number of output filters +and that n is for all the receptive + +524 +00:40:03,659 --> 00:40:07,469 +field locations in the image then you +play a similar trek to take this and + +525 +00:40:07,469 --> 00:40:13,000 +reshape it into your interior 3d +appetizer you can actually stand this + +526 +00:40:13,000 --> 00:40:16,219 +too many batches quite easily if you +have a mini batch of any of these + +527 +00:40:16,219 --> 00:40:24,099 +elements you just add more rows and how +one set of rows per me back element this + +528 +00:40:24,099 --> 00:40:28,589 +actually is pretty easy to implement so +yeah + +529 +00:40:28,590 --> 00:40:35,090 +depends that-that's then it depends on +your implementation right but then you + +530 +00:40:35,090 --> 00:40:39,910 +have to worry about things like memory +layout and stuff like that but sometimes + +531 +00:40:39,909 --> 00:40:45,099 +you even do that reshape operation on +the GPUs you can do it in parallel but + +532 +00:40:45,099 --> 00:40:50,089 +as a as a case study so this is really +easy to implement so a lot of if if if + +533 +00:40:50,090 --> 00:40:53,470 +you don't have a convolution technique +available and you need to implement one + +534 +00:40:53,469 --> 00:40:57,869 +passed this is probably the one to +choose and if you look at actual cafe in + +535 +00:40:57,869 --> 00:41:01,119 +earlier versions of cafe this is the +method that they used for doing + +536 +00:41:01,119 --> 00:41:07,730 +contributions so this is the convolution +forward code for the GPU conflict the + +537 +00:41:07,730 --> 00:41:12,630 +native GPU convolution in you can see in +this red chunk they're calling into the + +538 +00:41:12,630 --> 00:41:18,070 +same to call method is taking their +input image rights this is taking their + +539 +00:41:18,070 --> 00:41:22,900 +input image somewhere this is so this is +their intention and then they're going + +540 +00:41:22,900 --> 00:41:27,050 +to reshape this calling the same to call +method and then store it in this in this + +541 +00:41:27,050 --> 00:41:33,519 +column GPU tenser than they're gonna +take to a matrix matrix multiply calling + +542 +00:41:33,519 --> 00:41:37,980 +it could last through the matrix +multiply and then a bias so that's + +543 +00:41:37,980 --> 00:41:42,840 +that's how that's i mean these things +tend to work quite well in practice and + +544 +00:41:42,840 --> 00:41:45,850 +also has another case study if you +remember the fast layers we gave you any + +545 +00:41:45,849 --> 00:41:51,500 +assignments actually uses this exact +same strategy so here we actually do nm + +546 +00:41:51,500 --> 00:41:55,940 +to call operation was some crazy numpy +tricks and then now we can actually do + +547 +00:41:55,940 --> 00:42:00,230 +the convolution inside The FAST layers +with a single call to the numpy matrix + +548 +00:42:00,230 --> 00:42:03,900 +multiplication and you sign your +homework this usually gives me a couple + +549 +00:42:03,900 --> 00:42:07,740 +hundred times faster than using for +loops this actually works pretty well + +550 +00:42:07,739 --> 00:42:18,209 +and it's it's pretty easy to implement +any questions about him to call + +551 +00:42:18,210 --> 00:42:24,949 +think about it a little bit but if you +think if you think really hard you'll + +552 +00:42:24,949 --> 00:42:28,219 +realize that the backward pass on a +convolution is actually also a + +553 +00:42:28,219 --> 00:42:33,358 +convolution which you may have figured +out a few if you're thinking about it on + +554 +00:42:33,358 --> 00:42:37,269 +your homework but the backward pass a +convolution is actually also a type of + +555 +00:42:37,269 --> 00:42:41,070 +convolution over the over the upstream +gradients you can actually use a similar + +556 +00:42:41,070 --> 00:42:45,789 +type of image to call method for the +tobacco passes well the only trick there + +557 +00:42:45,789 --> 00:42:51,259 +is that one once you do in a backward +pass you need to some gradients from + +558 +00:42:51,260 --> 00:42:54,940 +across overlapping receptive fields in +the upstream so you need to be careful + +559 +00:42:54,940 --> 00:43:02,889 +about the call Tim you need to summon +the call Tim in the backward pass and + +560 +00:43:02,889 --> 00:43:06,150 +you can actually check out in the fast +lane is on the homework implements that + +561 +00:43:06,150 --> 00:43:11,050 +too although actually further in the +fast layers on the homework the call tim + +562 +00:43:11,050 --> 00:43:18,910 +is in sight on I couldn't find a way to +get it fast enough in there's actually + +563 +00:43:18,909 --> 00:43:22,710 +another way that sometimes people use +for convolutions and that's this idea of + +564 +00:43:22,710 --> 00:43:27,400 +a Fast Fourier Transform so if you have +some memories from like a signal + +565 +00:43:27,400 --> 00:43:30,700 +processing class or something like that +you might remember this thing called the + +566 +00:43:30,699 --> 00:43:34,639 +convolution theorem met says that you if +you have two signals and you want to + +567 +00:43:34,639 --> 00:43:38,779 +call them either discreetly are +continuously with another girl then + +568 +00:43:38,780 --> 00:43:44,130 +taking a convolution of these two +signals is the same as rather the + +569 +00:43:44,130 --> 00:43:47,820 +Fourier transform of the convolutions is +the same as the elements product of the + +570 +00:43:47,820 --> 00:43:51,859 +Fourier transforms so if you have you +unpacked out and stare the symbols I + +571 +00:43:51,858 --> 00:43:56,779 +think it'll make sense and if also you +might remember from again a signal + +572 +00:43:56,780 --> 00:44:00,240 +processing class or an algorithm class +there's this amazing thing called the + +573 +00:44:00,239 --> 00:44:04,299 +fast Fourier transform that actually +likes lets us to compute Fourier + +574 +00:44:04,300 --> 00:44:08,080 +transforms an inverse Fourier transforms +really really fast + +575 +00:44:08,079 --> 00:44:11,679 +you may have seen it bears versions of +this in one day in 2d and they're all + +576 +00:44:11,679 --> 00:44:17,129 +really fast so we can actually applied a +stricter convolutions so the way this + +577 +00:44:17,130 --> 00:44:20,660 +works is that first we're going to +compute use the Fast Fourier Transform + +578 +00:44:20,659 --> 00:44:24,899 +to compute the Fourier transform the +weights also compute the Fourier + +579 +00:44:24,900 --> 00:44:30,320 +transform of our activation map and now +in Fourier space we just do an element + +580 +00:44:30,320 --> 00:44:35,050 +multiplication which is really really +fast and efficient and then we just come + +581 +00:44:35,050 --> 00:44:40,269 +again and use the pass for a transformed +to do the inverse transform the output + +582 +00:44:40,269 --> 00:44:44,420 +of that elements product and that +implements convolutions for us in this + +583 +00:44:44,420 --> 00:44:52,550 +kinda cool fancy clever way and this is +actually been used and face some folks + +584 +00:44:52,550 --> 00:44:55,940 +that Facebook had a paper about this +last year and they actually released a + +585 +00:44:55,940 --> 00:44:57,650 +GPU library to do this + +586 +00:44:57,650 --> 00:45:03,329 +compute these things but the sad thing +about these Fourier transforms this that + +587 +00:45:03,329 --> 00:45:07,819 +they actually give you really really big +speedups over other methods but really + +588 +00:45:07,820 --> 00:45:11,970 +only four large boulders and when you're +working on these small three by three + +589 +00:45:11,969 --> 00:45:15,829 +filters the overhead of computing the +Fourier transform just towards the + +590 +00:45:15,829 --> 00:45:20,449 +computation of doing the computation +directly in the in the input pixel space + +591 +00:45:20,449 --> 00:45:25,579 +and as we just talked about earlier in +the lecture small contributions are + +592 +00:45:25,579 --> 00:45:30,389 +really really nice and appealing and +great for lots of reasons so it's a + +593 +00:45:30,389 --> 00:45:33,489 +little bit of a shame that this for a +trick doesn't work out too well impact + +594 +00:45:33,489 --> 00:45:38,439 +us but if for some reason you do want to +compute really large contributions then + +595 +00:45:38,440 --> 00:45:46,019 +this is something you can try yeah + +596 +00:45:46,019 --> 00:46:02,489 +too involved in stuff but I imagine if +you think it's a problem is probably a + +597 +00:46:02,489 --> 00:46:04,639 +problem + +598 +00:46:04,639 --> 00:46:12,900 +ya another thing to point out is that +one kind of balance out about Fourier + +599 +00:46:12,900 --> 00:46:17,430 +transforms conclusions is that they +don't handle striding too well so far + +600 +00:46:17,429 --> 00:46:21,219 +normal computer with your computing +strident convolutions in sort of normal + +601 +00:46:21,219 --> 00:46:25,409 +input space you only compute a small +subset of those in our products so you + +602 +00:46:25,409 --> 00:46:28,489 +actually save a lot of computation when +you strike the convolutions + +603 +00:46:28,489 --> 00:46:32,199 +directly on the input space but the way +you tend to implement strident + +604 +00:46:32,199 --> 00:46:36,649 +convolutions in Fourier transform space +is you just compute the whole thing and + +605 +00:46:36,650 --> 00:46:43,180 +then you throw out part of the data so +that ends up not being very efficient so + +606 +00:46:43,179 --> 00:46:47,969 +there's another trick that has not +really become too I think too widely + +607 +00:46:47,969 --> 00:46:51,989 +known yet but I really liked it so I +thought I wanted to talk about that so + +608 +00:46:51,989 --> 00:46:55,909 +you may remember from algorithms class +something called stratton's algorithm + +609 +00:46:55,909 --> 00:47:00,789 +right there's this idea that when you do +a naive matrix multiplication of to end + +610 +00:47:00,789 --> 00:47:04,869 +by and matrices kind of if you count up +although although all the modifications + +611 +00:47:04,869 --> 00:47:08,630 +and additions that you need to do it's +going to take about its gonna take any + +612 +00:47:08,630 --> 00:47:12,950 +cute operations and stratton's algorithm +is this like really crazy thing we + +613 +00:47:12,949 --> 00:47:16,839 +compute all these crazy intermediates +and it somehow magically works out to + +614 +00:47:16,840 --> 00:47:22,289 +compute the output asymptotically faster +than the naive method and you know from + +615 +00:47:22,289 --> 00:47:26,869 +him to call me know that matrix +multiplication is this we can implement + +616 +00:47:26,869 --> 00:47:31,339 +convolution as matrix multiplication to +intuitively you might expect that these + +617 +00:47:31,340 --> 00:47:35,110 +similar types of tricks might +theoretically maybe be applicable to + +618 +00:47:35,110 --> 00:47:41,320 +convolution and it turns out they are so +there's this really cool paper that just + +619 +00:47:41,320 --> 00:47:46,370 +came out over the summer where these two +guys worked out very explicitly that + +620 +00:47:46,369 --> 00:47:50,670 +something very special cases 43 by +frequent allusions and it involves this + +621 +00:47:50,670 --> 00:47:54,659 +obviously I'm not going to go into +details here but it's a similar flavor + +622 +00:47:54,659 --> 00:47:58,539 +to stress and computing very clever +intermediate + +623 +00:47:58,539 --> 00:48:03,630 +and Henry combining them to actually +save a lot on the computation and these + +624 +00:48:03,630 --> 00:48:08,220 +guys are actually really really intense +and they're not just mathematicians they + +625 +00:48:08,219 --> 00:48:11,959 +actually wrote also highly highly +optimized CUDA kernels to compute these + +626 +00:48:11,960 --> 00:48:17,570 +things and were able to speed up BGG by +a factor of two so that's really really + +627 +00:48:17,570 --> 00:48:21,890 +impressive so I think that these these +type this type of truck might become + +628 +00:48:21,889 --> 00:48:26,019 +pretty popular in the future but for the +time being I think it's not very widely + +629 +00:48:26,019 --> 00:48:30,650 +used but these numbers are crazy +especially for small batch sizes they're + +630 +00:48:30,650 --> 00:48:35,010 +getting a six speed up on BGG that's +that's really really impressive and I + +631 +00:48:35,010 --> 00:48:38,770 +think it's a really cool method the +downside is that you kinda have to work + +632 +00:48:38,769 --> 00:48:43,009 +out these explicit special cases each +different size of convolution but maybe + +633 +00:48:43,010 --> 00:48:45,850 +if we only care about three by three +convolutions that's not such a big deal + +634 +00:48:45,849 --> 00:48:54,719 +so the recap computing convolutions in +practice is that the sort of the really + +635 +00:48:54,719 --> 00:48:58,579 +fast easy quick and dirty way to +implement these things is in to call + +636 +00:48:58,579 --> 00:49:02,869 +matrix multiplication is passed it does +it's usually not too hard to implement + +637 +00:49:02,869 --> 00:49:06,609 +these things so if for some reason you +really need to implement competitions + +638 +00:49:06,610 --> 00:49:11,400 +yourself I'd really recommend into call +activity is something that coming from + +639 +00:49:11,400 --> 00:49:15,230 +signal processing you might think would +be really cool and really useful but it + +640 +00:49:15,230 --> 00:49:19,719 +turns out that it's it does give speed +ups but only for big filters so it's not + +641 +00:49:19,719 --> 00:49:24,000 +as useful as you might have hoped but +there is hope because these fast + +642 +00:49:24,000 --> 00:49:25,440 +algorithms are really good + +643 +00:49:25,440 --> 00:49:29,650 +filters and there already exists code +somewhere in the world to do it so + +644 +00:49:29,650 --> 00:49:35,889 +hopefully these these things will catch +on and become more widely used so if + +645 +00:49:35,889 --> 00:49:41,529 +there's any questions about computing +convolutions + +646 +00:49:41,530 --> 00:49:50,940 +ok so next we're gonna talk about some +implementation details so first question + +647 +00:49:50,940 --> 00:49:55,710 +how do you guys ever built your own +computer + +648 +00:49:55,710 --> 00:50:01,710 +ok so you guys are prevented from this +answer on this next slide so who can + +649 +00:50:01,710 --> 00:50:07,869 +spot the CPU anyone on a point out + +650 +00:50:07,869 --> 00:50:17,210 +the CPU is this little guy right so +actually this this thing is actually a + +651 +00:50:17,210 --> 00:50:22,179 +lot of it is the cooler so the CPU +itself is a little tiny part inside of + +652 +00:50:22,179 --> 00:50:28,730 +here a lot of this is actually the +heatsink cooling the next spot the GPU + +653 +00:50:28,730 --> 00:50:38,320 +yes it's the thing that says GeForce on +and so this GPU is is for one thing it's + +654 +00:50:38,320 --> 00:50:43,180 +it's much larger and the CPU so you +might so it may be is is more powerful I + +655 +00:50:43,179 --> 00:50:48,679 +know but at least it's taking up more +space in the case so that's that's kind + +656 +00:50:48,679 --> 00:50:54,309 +of an indication that something exciting +is happening so I'm another question and + +657 +00:50:54,309 --> 00:50:57,029 +you gotta play video games + +658 +00:50:57,030 --> 00:51:05,390 +ok then you probably have opinions about +this so turns out a lot of people in + +659 +00:51:05,389 --> 00:51:09,809 +machine learning and deep learning have +really strong opinions too and most + +660 +00:51:09,809 --> 00:51:15,639 +people are on the side so Nvidia is +actually much much more widely used then + +661 +00:51:15,639 --> 00:51:21,179 +AMD for you using GPUs and US and the +reason is that + +662 +00:51:21,179 --> 00:51:25,599 +NVIDIA has really done a lot in the last +couple of years to really dive into deep + +663 +00:51:25,599 --> 00:51:30,710 +learning and make it a really core part +of their focus so as a cool example of + +664 +00:51:30,710 --> 00:51:34,769 +that last year at GTC which is an + +665 +00:51:34,769 --> 00:51:39,869 +videos sort of yearly big gigantic +conference for the announce new products + +666 +00:51:39,869 --> 00:51:44,230 +Jensen Hong who is the CEO of in video +and actually also stanford alarm + +667 +00:51:44,230 --> 00:51:49,059 +introduced this latest and greatest +amazing new GPU capitation acts like + +668 +00:51:49,059 --> 00:51:53,400 +their flagship thing and the benchmark +he used to sell it was how fast the + +669 +00:51:53,400 --> 00:51:56,800 +country and Alex met so this was crazy + +670 +00:51:56,800 --> 00:52:00,140 +this was a gigantic room with like +hundreds and hundreds of people and + +671 +00:52:00,139 --> 00:52:04,279 +journalists and like this gigantic +highly polished presentation and the CEO + +672 +00:52:04,280 --> 00:52:07,890 +of in video was talking about Alex net +and convolutions and I thought that was + +673 +00:52:07,889 --> 00:52:11,690 +really exciting and it kind of shows you +that Nvidia really cares a lot about + +674 +00:52:11,690 --> 00:52:15,300 +getting these things to work and they +pushed a lot of their efforts into + +675 +00:52:15,300 --> 00:52:22,150 +getting into making it work so just to +give you an idea a CPU as you probably + +676 +00:52:22,150 --> 00:52:26,900 +know is really good at fast sequential +processing and they tend to have a small + +677 +00:52:26,900 --> 00:52:31,019 +number of cores your laptop probably +have like maybe between one and four + +678 +00:52:31,019 --> 00:52:36,920 +corners and big things on a server might +have up to 16 quarters and these things + +679 +00:52:36,920 --> 00:52:39,610 +are really good at computing things +really really fast + +680 +00:52:39,610 --> 00:52:45,349 +and in sequence GPU is on the other hand +tend to have many many many course for a + +681 +00:52:45,349 --> 00:52:49,759 +big guy like a tax it can have up to +thousands of quarters but they tend each + +682 +00:52:49,760 --> 00:52:53,500 +core can do last May 10 2010 lower clock +speed and be able to do less per + +683 +00:52:53,500 --> 00:52:59,429 +instruction cycle so these GPUs again we +actually were originally developed for + +684 +00:52:59,429 --> 00:53:05,230 +processing graphics graphics processing +units so they're really good at doing + +685 +00:53:05,230 --> 00:53:09,699 +sort of highly paralyzed operations are +you wanna do many many things in + +686 +00:53:09,699 --> 00:53:15,460 +parallel independently and since they +were originally designed for computer + +687 +00:53:15,460 --> 00:53:19,590 +graphics but since then they've sort of +evolved as a more general computing + +688 +00:53:19,590 --> 00:53:23,100 +platform so there are different +frameworks that allow you to write + +689 +00:53:23,099 --> 00:53:28,929 +generic code to run directly on the GPU +so from Nvidia we have this framework + +690 +00:53:28,929 --> 00:53:33,509 +that lets you write a variant of seats +actually write code that runs directly + +691 +00:53:33,510 --> 00:53:37,990 +on the GPU and there's a similar +framework called OpenCL that works on + +692 +00:53:37,989 --> 00:53:43,569 +pretty much any any computational +platform but I mean open standards are + +693 +00:53:43,570 --> 00:53:48,890 +nice and it's quite nice that OpenCL +works everywhere but in practice open so + +694 +00:53:48,889 --> 00:53:52,559 +that tends to be a lot more performance +and how a little bit nicer library + +695 +00:53:52,559 --> 00:53:57,420 +support so at least four deep learning +most people use could instead and if + +696 +00:53:57,420 --> 00:54:01,309 +you're interested in actually learning +how to write G Piko G Piko yourself + +697 +00:54:01,309 --> 00:54:05,230 +there's a really cool nasty course I +would it's it's pretty cool have fun + +698 +00:54:05,230 --> 00:54:09,409 +assignments all that lets you write code +to run things on GPU although in + +699 +00:54:09,409 --> 00:54:12,730 +practice if all you want to do is train +come nuts and do research and that sort + +700 +00:54:12,730 --> 00:54:16,409 +of thing you end up usually not having +to write any of this code yourself you + +701 +00:54:16,409 --> 00:54:20,139 +just rely on external libraries + +702 +00:54:20,139 --> 00:54:33,440 +right so could I is like this this raw +so cute and higher higher level library + +703 +00:54:33,440 --> 00:54:38,599 +kind of like glass right so one thing +that GPUs are really really good at is + +704 +00:54:38,599 --> 00:54:43,420 +matrix multiplication so here's here's a +benchmark I mean this is from Nvidia's + +705 +00:54:43,420 --> 00:54:49,550 +website so it's a little bit biased but +this is showing matrix multiplication + +706 +00:54:49,550 --> 00:54:54,789 +time as a function of matrix eyes on a +pretty beefy CPU this is a 12 corps guy + +707 +00:54:54,789 --> 00:55:00,079 +that would live in a server that's like +quite a quite a healthy CPU and this is + +708 +00:55:00,079 --> 00:55:04,000 +running the same date science matrix +multiply on a test like a 40 which is a + +709 +00:55:04,000 --> 00:55:11,000 +pretty beefy GPU and it's much faster I +mean that's no big surprise right and + +710 +00:55:11,000 --> 00:55:15,119 +GPUs are also really gotta convolutions +so as you mentioned and video has a + +711 +00:55:15,119 --> 00:55:19,909 +library called today announced that is +specifically optimized optimist CUDA + +712 +00:55:19,909 --> 00:55:26,139 +kernels for convolution so compared to +CPU I mean it's it's WAY faster and this + +713 +00:55:26,139 --> 00:55:30,139 +is actually comparing him to call +contributions from campaign with the + +714 +00:55:30,139 --> 00:55:34,920 +crew tienen convolutions I think these +graphs are actually from the first + +715 +00:55:34,920 --> 00:55:41,030 +version of CNN version for just came out +a few weeks ago and but this is the only + +716 +00:55:41,030 --> 00:55:44,600 +version where they actually had a CPU +benchmark since then the benchmark civil + +717 +00:55:44,599 --> 00:55:49,699 +me been against previous versions so +it's got a lot faster since then since + +718 +00:55:49,699 --> 00:55:54,769 +here but the way this witness fits and +is that something like two blasts or to + +719 +00:55:54,769 --> 00:56:00,090 +DNN is a C library so it provides +functions and see that just sort of + +720 +00:56:00,090 --> 00:56:05,309 +abstract away the GPU as a C library so +if you have a tensor sort of in in + +721 +00:56:05,309 --> 00:56:09,429 +memory and see you can just pass a +pointer to the Korean library and it'll + +722 +00:56:09,429 --> 00:56:13,299 +return the conf little running on GPU +maybe asynchronously and return the + +723 +00:56:13,300 --> 00:56:19,440 +result so frameworks like cafe and torch +all have now integrated the Q tienen + +724 +00:56:19,440 --> 00:56:23,750 +stuff into their own frameworks you can +utilize these efficient solutions in any + +725 +00:56:23,750 --> 00:56:30,340 +of these frameworks know but the problem +is that even when once we have these + +726 +00:56:30,340 --> 00:56:33,430 +really powerful GPUs training big models +is still kind + +727 +00:56:33,429 --> 00:56:39,409 +slow so VG nett was famously train for +something like two to three weeks on for + +728 +00:56:39,409 --> 00:56:43,759 +Titan what was a Titan black sandals +aren't cheap and it was actually a + +729 +00:56:43,760 --> 00:56:47,280 +recommendation of ResNet recently +there's a really cool right up this + +730 +00:56:47,280 --> 00:56:51,839 +really cool blog post describing it here +and they actually retrained the ResNet + +731 +00:56:51,838 --> 00:56:56,400 +hundred and one layer model and it also +took about two weeks to train on for + +732 +00:56:56,400 --> 00:57:03,880 +GPUs so that's not good and the one way +that people the way that the easy way to + +733 +00:57:03,880 --> 00:57:08,269 +split up training across multiple GPUs +is just to split your money back across + +734 +00:57:08,269 --> 00:57:14,230 +the GPUs so normally you might have you +especially for someone like BGG it takes + +735 +00:57:14,230 --> 00:57:17,679 +a lot of memory so you can't compete +with very large me batch sizes on a + +736 +00:57:17,679 --> 00:57:23,649 +single GPU so what you'll do you have +any batch of images may be a 6:00 128 or + +737 +00:57:23,650 --> 00:57:24,700 +something like that + +738 +00:57:24,699 --> 00:57:30,338 +than any match into four equal chunks +each GPU compute a forward and backward + +739 +00:57:30,338 --> 00:57:35,190 +pass for that many batch in your compute +pramit gradients on the weights while + +740 +00:57:35,190 --> 00:57:39,470 +some of those weights inside your some +of those weights after all for GPU + +741 +00:57:39,469 --> 00:57:44,548 +Spanish and make an update your model so +this is a really simple way that people + +742 +00:57:44,548 --> 00:57:53,599 +tend to implement distribution on GPUs +yeah yeah + +743 +00:57:53,599 --> 00:57:59,089 +yeah yeah so that's why they claim that +they can automate this process and + +744 +00:57:59,090 --> 00:58:03,039 +really really efficiently distribute it +which is really exciting I think but I + +745 +00:58:03,039 --> 00:58:07,820 +haven't played much myself and also at +least in torch there's a data parallel + +746 +00:58:07,820 --> 00:58:11,059 +there that you can just drop in and use +that all sort of automatically do with + +747 +00:58:11,059 --> 00:58:14,070 +this type of parallelism very easily + +748 +00:58:14,070 --> 00:58:18,930 +a slightly more complex idea for multi +GPU training actually comes from Alex + +749 +00:58:18,929 --> 00:58:21,279 +Alex not fame + +750 +00:58:21,280 --> 00:58:26,670 +guess that's kind of cool kind of a +funny title but the idea but the idea is + +751 +00:58:26,670 --> 00:58:31,409 +that we want to actually do data +parallelism on the lower layers so on + +752 +00:58:31,409 --> 00:58:35,980 +the lower layers will take our image +many batch split up across two GPUs and + +753 +00:58:35,980 --> 00:58:42,059 +eat and GPU one will compute the +convolutions for the first part first + +754 +00:58:42,059 --> 00:58:46,279 +part of the many batch and just released +just this comp convolution part will be + +755 +00:58:46,280 --> 00:58:49,960 +distributed equally across the GPUs but +once you get to the fully connected + +756 +00:58:49,960 --> 00:58:50,760 +layers + +757 +00:58:50,760 --> 00:58:54,800 +he found it's actually more efficient if +you are just really big matrix + +758 +00:58:54,800 --> 00:58:58,810 +multiplies then it's more efficient +actually have the GPS work together to + +759 +00:58:58,809 --> 00:59:02,869 +compute this matrix multiply this is +kind of a cool track it's not very + +760 +00:59:02,869 --> 00:59:09,480 +commonly used but I thought it's it's +fun to mention another idea from Google + +761 +00:59:09,480 --> 00:59:13,800 +is before it before there was tenser +flow they had this thing called + +762 +00:59:13,800 --> 00:59:18,380 +disbelief which was their their previous +system which was entirely CPU based + +763 +00:59:18,380 --> 00:59:22,630 +which from the benchmarks a few slides +ago you can imagine was going to be + +764 +00:59:22,630 --> 00:59:26,250 +really slow but actually the first +version of Google Matt was all trained + +765 +00:59:26,250 --> 00:59:30,800 +in disbelief on CPU so they actually so +they had to do massive amounts of + +766 +00:59:30,800 --> 00:59:35,800 +distribution on CPU to get these things +to train so here there's this cool paper + +767 +00:59:35,800 --> 00:59:39,530 +from jap teen a couple years ago that +describes this and a lot more detail but + +768 +00:59:39,530 --> 00:59:43,640 +you use data parallelism or you have +each machine have an independent copy of + +769 +00:59:43,639 --> 00:59:48,710 +the model and each machine as computing +forward and backward on patches of data + +770 +00:59:48,710 --> 00:59:52,659 +but now i text you actually have this +parameters server that's storing the + +771 +00:59:52,659 --> 00:59:55,739 +parameters of the model and these +independent workers are making + +772 +00:59:55,739 --> 01:00:01,209 +communication with the parameters server +to make updates on the model and they + +773 +01:00:01,210 --> 01:00:05,740 +contrast this with model parallelism +which is where you type 1 + +774 +01:00:05,739 --> 01:00:09,879 +model and you have different different +workers computing different parts of the + +775 +01:00:09,880 --> 01:00:14,650 +model so and in disbelief they really +did a really good job + +776 +01:00:14,650 --> 01:00:18,110 +optimizing this to work really well +across many many CPUs and many many + +777 +01:00:18,110 --> 01:00:23,170 +machines but now they have cancer flow +which hopefully should do these things + +778 +01:00:23,170 --> 01:00:28,639 +more automatically and once you're doing +these these these updates there's this + +779 +01:00:28,639 --> 01:00:34,949 +idea between asynchronous STD and +synchronous STD so synchronous STD is + +780 +01:00:34,949 --> 01:00:39,299 +one of the things like the naive thing +you might expect you have any batch you + +781 +01:00:39,300 --> 01:00:42,880 +split up across multiple workers each +worker does forward and backward + +782 +01:00:42,880 --> 01:00:46,710 +computes gradients when you add up all +the gradients and make a single model + +783 +01:00:46,710 --> 01:00:51,220 +updates this will this will sort of +exactly simulate + +784 +01:00:51,219 --> 01:00:55,029 +just computing but many batch on a +larger machine but it could be kind of + +785 +01:00:55,030 --> 01:00:59,619 +slow since you to synchronize across +machines this tends to be too much of a + +786 +01:00:59,619 --> 01:01:03,610 +big deal when you're working with +multiple GPUs on a single note but once + +787 +01:01:03,610 --> 01:01:08,430 +you're distributed across many many CPUs +that district that I'm synchronization + +788 +01:01:08,429 --> 01:01:12,569 +can actually be quite expensive so +instead at least they also have this + +789 +01:01:12,570 --> 01:01:17,500 +concept of asynchronous STD where each +model is just sort of making updates to + +790 +01:01:17,500 --> 01:01:21,599 +the to its own copy of the parameters +and those have some notion of an + +791 +01:01:21,599 --> 01:01:25,480 +eventual consistency where they +sometimes periodically synchronize with + +792 +01:01:25,480 --> 01:01:29,530 +each other and it's seems really +complicated and hard to debug but they + +793 +01:01:29,530 --> 01:01:35,619 +got it to work so that's that's pretty +cool and one of the really cool pictures + +794 +01:01:35,619 --> 01:01:39,430 +so these two figures are both in the +tensor flow paper and one of the + +795 +01:01:39,429 --> 01:01:42,549 +pictures of tensor flow is that it +should really make this type of + +796 +01:01:42,550 --> 01:01:46,510 +distribution much more transparent to +the user that if you do happen to have + +797 +01:01:46,510 --> 01:01:51,580 +access to a big cluster of GPUs and CPUs +and whatnot tenser flow should + +798 +01:01:51,579 --> 01:01:54,840 +automatically be able to figure out the +best way to do these kinds of + +799 +01:01:54,840 --> 01:01:58,970 +distributions combining data and model +parallelism and just do it all for you + +800 +01:01:58,969 --> 01:02:03,399 +so that's that's really cool and I think +that's that's the really exciting part + +801 +01:02:03,400 --> 01:02:11,050 +about 1000 any any questions about the +stupid training yeah + +802 +01:02:11,050 --> 01:02:16,120 +and CN TK I haven't even taken a look at +it yet + +803 +01:02:16,119 --> 01:02:22,130 +ok so next time there's a couple +bottlenecks you should be aware of in + +804 +01:02:22,130 --> 01:02:27,500 +practice so expect like usually when +you're training these things like this + +805 +01:02:27,500 --> 01:02:30,769 +distributed stuff is nice and great but +you can actually go a long way with just + +806 +01:02:30,769 --> 01:02:34,840 +a single GPU on a single machine and +there there's a lot of bottlenecks that + +807 +01:02:34,840 --> 01:02:39,160 +can get in the way one is the +communication between the CPU and GPU + +808 +01:02:39,159 --> 01:02:44,759 +actually and a lot of cases especially +when the data is small the most + +809 +01:02:44,760 --> 01:02:48,000 +expensive part of the pipeline is +copying the data onto the GPU and then + +810 +01:02:48,000 --> 01:02:51,579 +copy it back once you get things under +the GPU you can do + +811 +01:02:51,579 --> 01:02:55,719 +computation really really fast and +efficiently but the copying is the + +812 +01:02:55,719 --> 01:03:01,089 +really slow part so 11 idea as you want +to make sure to avoid the memory copy + +813 +01:03:01,090 --> 01:03:06,570 +like one thing that sometimes you see +you all at each layer of the network is + +814 +01:03:06,570 --> 01:03:10,460 +copying back and forth from CPU GPU and +I'll be really inefficient and slow + +815 +01:03:10,460 --> 01:03:14,170 +everything down so ideally you want the +whole forward and backward pass to run + +816 +01:03:14,170 --> 01:03:17,159 +on a GPU at once + +817 +01:03:17,159 --> 01:03:21,139 +another thing you'll sometimes see is +multithreaded approach where you'll have + +818 +01:03:21,139 --> 01:03:27,849 +a CPU thread that is prefetching data +many memory in one thread in the + +819 +01:03:27,849 --> 01:03:28,690 +background + +820 +01:03:28,690 --> 01:03:34,070 +possibly also appointed augmentations +online and then this this background CPU + +821 +01:03:34,070 --> 01:03:37,470 +throughout will be sort of preparing me +batches and possibly also shipping them + +822 +01:03:37,469 --> 01:03:41,669 +over to GPU you can kind of coordinate +this loading of data and computing + +823 +01:03:41,670 --> 01:03:44,680 +preprocessing and shipping memory +shipping + +824 +01:03:44,679 --> 01:03:48,940 +many batch data to the GPU and actually +doing the computations and actually you + +825 +01:03:48,940 --> 01:03:51,980 +can get pretty involved with some +courting I'll be all these things in a + +826 +01:03:51,980 --> 01:03:57,719 +multithreaded way and I can give you +some good speedups so cafe in particular + +827 +01:03:57,719 --> 01:04:01,059 +I think already implements this +prefetching date on there for certain + +828 +01:04:01,059 --> 01:04:04,199 +types of data storages and other +frameworks you just have to roll your + +829 +01:04:04,199 --> 01:04:11,839 +own another problem is that the CPU disk +model Mac so these these things are kind + +830 +01:04:11,840 --> 01:04:17,820 +of slow they're cheap and they're big +but they actually are not the best so so + +831 +01:04:17,820 --> 01:04:22,220 +these are hard disks that now the solid +state drives are much more common + +832 +01:04:22,219 --> 01:04:25,730 +but the problem is a solid state drives +are you know smaller and more expensive + +833 +01:04:25,730 --> 01:04:30,590 +but they're a lot faster so they get +used a lot in practice so what's really + +834 +01:04:30,590 --> 01:04:35,710 +although one 1 common feature to both +hard disks and solid-state drives as + +835 +01:04:35,710 --> 01:04:39,889 +they work best when you're reading data +sequentially off the desk so a lot of + +836 +01:04:39,889 --> 01:04:44,108 +times what you're right so one thing +that would be really bad for example is + +837 +01:04:44,108 --> 01:04:48,569 +to have a big folder full of JPEG images +because now each of these images could + +838 +01:04:48,570 --> 01:04:52,309 +be located in different parts on the +desk so it could be really up to a + +839 +01:04:52,309 --> 01:04:56,619 +random seek to read any individual JPEG +image and now also once you read the + +840 +01:04:56,619 --> 01:05:01,150 +JPEG you have to decompress it into +pixels that's quite inefficient so what + +841 +01:05:01,150 --> 01:05:05,079 +you'll see a lot of times in practice is +that you'll actually preprocessor data + +842 +01:05:05,079 --> 01:05:10,059 +by decompressing it and just riding out +the raw pixels entire data sat in one + +843 +01:05:10,059 --> 01:05:15,940 +giant contiguous files to desk so that +that takes a lot of disk space but we do + +844 +01:05:15,940 --> 01:05:22,230 +it anyway because it's all for the good +of a calmness right so this is kinda so + +845 +01:05:22,230 --> 01:05:27,400 +in cafe we do this with a coupled with +like a level d be is one commonly used + +846 +01:05:27,400 --> 01:05:33,599 +format I've also used I also use html5 +files a lot for us but the idea is that + +847 +01:05:33,599 --> 01:05:39,280 +you want to just get your data all +sequentially on desk and already turned + +848 +01:05:39,280 --> 01:05:43,180 +into pixels Senate training when you're +training you can store all your data in + +849 +01:05:43,179 --> 01:05:46,230 +memory you have to read off desk when +you wanna make that read as fast as + +850 +01:05:46,230 --> 01:05:50,679 +possible and again with clever amounts +of prefetching and multi-threaded stuff + +851 +01:05:50,679 --> 01:05:54,829 +you might have you might have won prized +pitching a top desk while other + +852 +01:05:54,829 --> 01:05:57,460 +competition is happening in the +background + +853 +01:05:57,460 --> 01:06:05,019 +another thing to keep in mind is GPU +memory bottlenecks so GPUs big ones have + +854 +01:06:05,019 --> 01:06:10,559 +big ones have a lot of memory but not +that much so the biggest GPUs you can + +855 +01:06:10,559 --> 01:06:15,539 +buy right now that I tax and the key +forty have 12 gigs of memory and that's + +856 +01:06:15,539 --> 01:06:18,139 +pretty much as big as you're going to +get right now + +857 +01:06:18,139 --> 01:06:22,679 +NextGen should be bigger but you can +actually bump up against this limit + +858 +01:06:22,679 --> 01:06:26,989 +without too much trouble especially if +you're training something like a BG or + +859 +01:06:26,989 --> 01:06:31,608 +if you're having recurrent networks were +very very very very long time stops it's + +860 +01:06:31,608 --> 01:06:34,929 +actually not too hard to bump up against +this memory limit that's something you + +861 +01:06:34,929 --> 01:06:35,598 +need to keep + +862 +01:06:35,599 --> 01:06:39,130 +mind when you're training these things +and some of these planes about you know + +863 +01:06:39,130 --> 01:06:43,450 +these efficient convolutions and +cleverly creating architectures actually + +864 +01:06:43,449 --> 01:06:47,068 +helps with this memory as well if you +can have a bigger more powerful model + +865 +01:06:47,068 --> 01:06:52,268 +with smaller amounts of with don't use +less memory than you'll be able to train + +866 +01:06:52,268 --> 01:06:58,129 +things faster and use bigger matches and +everything is good and even just just a + +867 +01:06:58,130 --> 01:07:01,588 +sense of scale Alex Knight is pretty +small compared to a lot of the models + +868 +01:07:01,588 --> 01:07:05,608 +that are state of the art now but Alex +net with a 256 back sides already takes + +869 +01:07:05,608 --> 01:07:09,469 +about 3 gigabytes GB memory so once you +have to these bigger networks it's + +870 +01:07:09,469 --> 01:07:15,738 +actually not too hard to bump up against +the 12 Dec limits so another thing we + +871 +01:07:15,739 --> 01:07:20,978 +should talk about is floating point +precision so when I'm writing code a lot + +872 +01:07:20,978 --> 01:07:24,788 +of times I like to imagine that you know +these things are just real numbers and + +873 +01:07:24,789 --> 01:07:27,960 +they just work but in practice that's +not true and you need to think about + +874 +01:07:27,960 --> 01:07:32,889 +things like how many bits of +floating-point are using so most types + +875 +01:07:32,889 --> 01:07:37,159 +are a lot of types of numeric code that +you might write sort of is with a double + +876 +01:07:37,159 --> 01:07:43,278 +precision by default this is using 64 +bits and a lot of also wrote more + +877 +01:07:43,278 --> 01:07:47,449 +commonly used for deep learning is this +idea of single precision so this is only + +878 +01:07:47,449 --> 01:07:52,710 +32 bets so the idea is that if each +number takes fewer bets then you can + +879 +01:07:52,710 --> 01:07:56,469 +store more of those numbers within the +same amount of memory so that's good and + +880 +01:07:56,469 --> 01:08:00,559 +also with fewer bets you need less +computes operate on those numbers that's + +881 +01:08:00,559 --> 01:08:05,210 +also good so in general we would like to +have smaller data types because they're + +882 +01:08:05,210 --> 01:08:11,150 +faster to compute and the useless memory +and as a as a case study this was + +883 +01:08:11,150 --> 01:08:15,489 +actually even an issue on homework so +you may have noticed that and the + +884 +01:08:15,489 --> 01:08:16,960 +default data type is this + +885 +01:08:16,960 --> 01:08:21,289 +64 bit double precision but for all of +these models that we provided you on + +886 +01:08:21,289 --> 01:08:25,789 +homework we had this cast or 32 bit +floating point number and you can + +887 +01:08:25,789 --> 01:08:28,670 +actually go back on the homework and try +switching between these two and you'll + +888 +01:08:28,670 --> 01:08:32,908 +see that switching to the 32 bit +actually gives you some decent some + +889 +01:08:32,908 --> 01:08:39,670 +decent speed ups so bad and the obvious +question is that if 32 bets are better + +890 +01:08:39,670 --> 01:08:42,829 +than 64 bet spend maybe we can use less +than + +891 +01:08:42,829 --> 01:08:52,199 +so there's this right + +892 +01:08:52,199 --> 01:09:01,010 +16 bets but it was ordered to do these +great ok so in addition to 32 bit + +893 +01:09:01,010 --> 01:09:05,420 +floating point there's also a standard +for 16 bit floating point which is + +894 +01:09:05,420 --> 01:09:09,699 +sometimes called the half precision and +actually recent versions of cunanan do + +895 +01:09:09,699 --> 01:09:17,199 +support computing things in a position +that's cool and actually there there are + +896 +01:09:17,199 --> 01:09:20,050 +some other other existing +implementations from a company called + +897 +01:09:20,050 --> 01:09:23,850 +their bana who also has these +sixteen-bit implementations so these are + +898 +01:09:23,850 --> 01:09:28,350 +the fastest convolutions out there right +now so these there's this nice get + +899 +01:09:28,350 --> 01:09:31,850 +hungry poll that has different kinds of +comment benchmarks for different types + +900 +01:09:31,850 --> 01:09:35,160 +of convolutions and frameworks and +everything and pretty much everything + +901 +01:09:35,159 --> 01:09:38,319 +winning all these benchmarks right now +are these 16 bit floating point + +902 +01:09:38,319 --> 01:09:42,279 +operations from Nirvana which is not +surprising right because I can you have + +903 +01:09:42,279 --> 01:09:47,479 +your bets so it's faster to compete but +right now there's actually not yet + +904 +01:09:47,479 --> 01:09:51,479 +framework support in things like cafe or +torch for utilizing the sixteen-bit + +905 +01:09:51,479 --> 01:09:57,299 +computation but it should be coming very +soon but the problem is that even if we + +906 +01:09:57,300 --> 01:10:01,420 +can compute it's it's it's pretty +obvious that if you have 16 but numbers + +907 +01:10:01,420 --> 01:10:05,880 +you can compete with them very fast but +once you get to 16 better than you might + +908 +01:10:05,880 --> 01:10:10,380 +actually be worried about numeric +precision because two of the sixteen is + +909 +01:10:10,380 --> 01:10:13,550 +not that big of a number anymore it's +actually not too many real numbers you + +910 +01:10:13,550 --> 01:10:20,360 +can even represent so there is this +paper from a couple years ago that did + +911 +01:10:20,359 --> 01:10:25,339 +some experiments low precision floating +point and they found that actually just + +912 +01:10:25,340 --> 01:10:28,710 +using the experiment they actually use a +fixed with a floating-point + +913 +01:10:28,710 --> 01:10:34,819 +implementation and they found that +actually with these very with with this + +914 +01:10:34,819 --> 01:10:38,659 +sort of naive implementation of oslo of +these low precision methods the networks + +915 +01:10:38,659 --> 01:10:43,689 +had a hard time converging probably due +to these low precision Americare numeric + +916 +01:10:43,689 --> 01:10:46,710 +issues that kind of accumulate over +multiple rounds of multiplication and + +917 +01:10:46,710 --> 01:10:50,989 +whatnot but they found a simple trick +was actually this idea of stochastic + +918 +01:10:50,989 --> 01:10:54,559 +rounding so some of their +multiplications would so all their + +919 +01:10:54,560 --> 01:10:55,200 +parameters + +920 +01:10:55,199 --> 01:10:59,079 +activations are stored in 16 bet but +when they perform a multiplication they + +921 +01:10:59,079 --> 01:11:03,269 +up converts to a slightly higher +precision floating-point value and then + +922 +01:11:03,270 --> 01:11:07,570 +they still cast a round that back down +to a lower position and actually doing + +923 +01:11:07,569 --> 01:11:11,789 +that rounding in a stochastic way that +is not rounding to the nearest number + +924 +01:11:11,789 --> 01:11:16,479 +but probabilistically rounding two +different numbers that depending on how + +925 +01:11:16,479 --> 01:11:17,549 +close you are + +926 +01:11:17,550 --> 01:11:21,860 +tends to work better and practice so +they found that for example when you're + +927 +01:11:21,859 --> 01:11:26,710 +using these were sixteen-bit fixed +numbers with two beds for integers and + +928 +01:11:26,710 --> 01:11:31,170 +stand between 12 and 14 this for the +floating point for the for the + +929 +01:11:31,170 --> 01:11:35,239 +fractional part that when you use this +idea of always rounding to the nearest + +930 +01:11:35,239 --> 01:11:40,359 +number these networks and to diverge but +when you use these stochastic grounding + +931 +01:11:40,359 --> 01:11:43,599 +techniques that you can actually get +these networks to converge quite nicely + +932 +01:11:43,600 --> 01:11:47,170 +even with these very low precision +floating-point technique low precision + +933 +01:11:47,170 --> 01:11:52,859 +floating-point numbers but you might +want to ask will sixteen-bit is great + +934 +01:11:52,859 --> 01:11:59,089 +but can we go even lower than that there +was another paper in 2015 that got down + +935 +01:11:59,090 --> 01:12:04,560 +to 10 and 12 bets so here that I mean +from the previous paper we already had + +936 +01:12:04,560 --> 01:12:08,039 +this intuition that maybe when you're +using very low precision floating-point + +937 +01:12:08,039 --> 01:12:11,359 +numbers you actually need to use more +precision in some parts of the network + +938 +01:12:11,359 --> 01:12:15,909 +and lower precision in other parts of +the network so in this paper they were + +939 +01:12:15,909 --> 01:12:22,149 +able to get away with using story in the +activations in 10 bit 10 bit values and + +940 +01:12:22,149 --> 01:12:27,500 +stand doing computing gradients using 12 +bets and they've got this to work which + +941 +01:12:27,500 --> 01:12:34,800 +is pretty amazing but anyone think that +that's the limit can we go further + +942 +01:12:34,800 --> 01:12:36,310 +yes + +943 +01:12:36,310 --> 01:12:44,180 +there was actually a paper just last +week so this is actually from the same + +944 +01:12:44,180 --> 01:12:49,200 +author as the previous paper and this is +crazy I was I was amazed about this and + +945 +01:12:49,199 --> 01:12:53,539 +hear the idea is that all activations +and weights of a network use only one + +946 +01:12:53,539 --> 01:12:58,819 +bet either one or negative one that's +pretty fast to compute now you don't + +947 +01:12:58,819 --> 01:13:02,429 +even really have to do multiplication +you can just do like why is explored and + +948 +01:13:02,430 --> 01:13:07,240 +multiply those that's pretty cool but +the trick is that on the forward pass + +949 +01:13:07,239 --> 01:13:11,199 +all of the gradients and activations are +either one or minus one so it's super + +950 +01:13:11,199 --> 01:13:15,399 +stuff four passes super super fast and +efficient but now on a backward pass + +951 +01:13:15,399 --> 01:13:20,179 +they actually compute gradients using +higher precision and then these higher + +952 +01:13:20,180 --> 01:13:24,150 +precision gradients are used to actually +make updates to these single bit + +953 +01:13:24,149 --> 01:13:28,059 +parameters so it's it's it's actually +really cool paper and I'd encourage you + +954 +01:13:28,060 --> 01:13:33,310 +to check it out but the pitch is that +may be a training time you can afford to + +955 +01:13:33,310 --> 01:13:36,600 +use maybe more floating point precision +but then a test time do you want your + +956 +01:13:36,600 --> 01:13:41,250 +network to be super super fast and all +binary so I think this is a really + +957 +01:13:41,250 --> 01:13:45,010 +really cool idea that I mean it the +paper just came out two weeks ago so I + +958 +01:13:45,010 --> 01:13:50,460 +don't know but I think it's a pretty +cool thing so the recap from + +959 +01:13:50,460 --> 01:13:52,199 +implementation details + +960 +01:13:52,199 --> 01:13:56,960 +is that overall GPUs are much much +faster than CPUs sometimes people use + +961 +01:13:56,960 --> 01:14:00,739 +distributed training distributing over +multiple GPUs in one system is pretty + +962 +01:14:00,739 --> 01:14:04,840 +common if your Google and using tensor +flow then distributing over multiple + +963 +01:14:04,840 --> 01:14:10,239 +nodes is maybe more common be aware of +the potential bottlenecks between the + +964 +01:14:10,239 --> 01:14:15,739 +CPU and GPU between the GPU in the desk +and between the GPU memory and also pay + +965 +01:14:15,739 --> 01:14:19,510 +attention to floating point precision it +might not be the most glamorous thing + +966 +01:14:19,510 --> 01:14:23,409 +but it actually I think makes huge +differences in practice and maybe binary + +967 +01:14:23,409 --> 01:14:28,639 +nuts will be the next big thing that'd +be pretty exciting so yeah just to recap + +968 +01:14:28,640 --> 01:14:32,690 +everything we talked about today that we +talked to a date augmentation as a trick + +969 +01:14:32,689 --> 01:14:37,449 +for improving when you have small +datasets and help prevent overfitting we + +970 +01:14:37,449 --> 01:14:40,859 +talk about transfer learning as a way to +initialize from existing models to help + +971 +01:14:40,859 --> 01:14:44,399 +with your help with your training we +talked a lot of detail about + +972 +01:14:44,399 --> 01:14:48,159 +convolutions both how to combine them to +make efficient models and + +973 +01:14:48,159 --> 01:14:52,840 +and we talked about all these +implementation details so I think that's + +974 +01:14:52,840 --> 01:14:57,319 +that's all we have printed ASAP is any +last minute questions + +975 +01:14:57,319 --> 01:15:02,840 +alright so I guess we're done a couple +minutes early and our midst the midterms + diff --git a/captions/En/Lecture12_en.srt b/captions/En/Lecture12_en.srt new file mode 100644 index 00000000..d2bbc809 --- /dev/null +++ b/captions/En/Lecture12_en.srt @@ -0,0 +1,5373 @@ +1 +00:00:00,000 --> 00:00:02,990 +today we're going to go over these four +major software packages that people + +2 +00:00:02,990 --> 00:00:10,919 +commonly used as usual a couple +administrative things the milestones + +3 +00:00:10,919 --> 00:00:14,798 +were actually do last week so hopefully +return the men will try to take a look + +4 +00:00:14,798 --> 00:00:19,089 +at those this week also remember that +assignment 3 the final assignment is + +5 +00:00:19,089 --> 00:00:23,160 +gonna be due on Wednesday so and you +guys done already + +6 +00:00:23,160 --> 00:00:30,870 +ok that's that's good then you have late +days you should be fine + +7 +00:00:30,870 --> 00:00:34,230 +another another thing that I should +point out is that if you're actually + +8 +00:00:34,229 --> 00:00:37,619 +planning on using Terminal for your +projects which i think a lot of you are + +9 +00:00:37,619 --> 00:00:42,049 +then make sure you you're backing up +your code and data and things off of + +10 +00:00:42,049 --> 00:00:46,659 +paternal instances every once in a while +we've had some problems where the + +11 +00:00:46,659 --> 00:00:50,529 +instances will crash randomly and in +most cases the terminal folks have been + +12 +00:00:50,530 --> 00:00:53,989 +able to get the data back but it +sometimes takes a couple days and + +13 +00:00:53,988 --> 00:00:57,570 +there's been a couple of cases where +actually people lost data because it was + +14 +00:00:57,570 --> 00:01:01,558 +just on terminal and that crashed so I +think if you are planning to use + +15 +00:01:01,558 --> 00:01:04,569 +terminal then make sure that you have +some alternative backup strategy for + +16 +00:01:04,569 --> 00:01:10,250 +your code and your data like I said +today we're talking about these poor + +17 +00:01:10,250 --> 00:01:16,049 +software packages that are commonly used +for a deep learning cafe torch piano and + +18 +00:01:16,049 --> 00:01:20,269 +tensor flow and as a little bit of +disclaimer at the beginning I felt like + +19 +00:01:20,269 --> 00:01:24,179 +personally I've mostly worked with cafe +and torch so those the ones that I know + +20 +00:01:24,180 --> 00:01:27,710 +the most about I'll do my best to give +you a good flavor for the others as well + +21 +00:01:27,709 --> 00:01:35,939 +but just throwing that disclaimer out +there so the first one is cafe we saw in + +22 +00:01:35,939 --> 00:01:39,509 +the last lecture that really cafe sprung +out of this paper at berkeley that was + +23 +00:01:39,510 --> 00:01:44,040 +trying to a re-employment Alex NAT and +Alex features for other things and since + +24 +00:01:44,040 --> 00:01:47,550 +then kathy has really grown into a +really really popular widely used + +25 +00:01:47,549 --> 00:01:53,759 +package for especially convolutional +neural networks so Cafe is from Berkeley + +26 +00:01:53,760 --> 00:01:56,859 +that I think a lot of you people have no +no + +27 +00:01:56,859 --> 00:02:01,989 +and it's mostly written in C++ and there +is actually buying things for a cafe so + +28 +00:02:01,989 --> 00:02:04,939 +you can access the nets and whatnot in +Python in Matlab that are super useful + +29 +00:02:04,939 --> 00:02:09,969 +and in general cafes really widely used +and it's really really good if you just + +30 +00:02:09,969 --> 00:02:15,289 +want to train sort of standard +feedforward convolutional networks and + +31 +00:02:15,289 --> 00:02:17,489 +actually Cafe is somewhat different than +the others + +32 +00:02:17,490 --> 00:02:21,610 +other frameworks in this respect you can +actually trained big powerful models and + +33 +00:02:21,610 --> 00:02:26,150 +kept a without writing any code yourself +so for example the ResNet image + +34 +00:02:26,150 --> 00:02:29,760 +classification model that one image that +one everything last year you can + +35 +00:02:29,759 --> 00:02:33,189 +actually trained to resonate using cafe +without writing any code which is pretty + +36 +00:02:33,189 --> 00:02:37,579 +amazing so the most but the most +important tip when you're working with + +37 +00:02:37,580 --> 00:02:41,860 +cafe is that the documentation is not as +sometimes out of date and not always + +38 +00:02:41,860 --> 00:02:45,980 +perfect so you need to not be afraid to +just dive in there and read the source + +39 +00:02:45,979 --> 00:02:52,359 +code yourself it's C++ so hopefully you +can read that and understand it but in + +40 +00:02:52,360 --> 00:02:56,080 +general the C++ code that they have +interface is pretty well structured + +41 +00:02:56,080 --> 00:03:00,270 +pretty well organized and pretty easy to +understand so if you have doubts about + +42 +00:03:00,270 --> 00:03:04,459 +how things work in cafe you do your best +bet is just to go on get up and read the + +43 +00:03:04,459 --> 00:03:11,229 +source code so Cafe is this huge big +project with Mike probably thousands + +44 +00:03:11,229 --> 00:03:14,369 +tens of thousands of lines of code and +it's a little bit scary to understand + +45 +00:03:14,370 --> 00:03:18,730 +how everything fits together but there's +really four major classes in cafe that + +46 +00:03:18,729 --> 00:03:24,310 +you need to know about the first one is +a blob so blobs army store all of your + +47 +00:03:24,310 --> 00:03:27,939 +data and your weight and your +activations in the network so these + +48 +00:03:27,939 --> 00:03:34,870 +blobs are things in the network so your +weights are have blocked are your rates + +49 +00:03:34,870 --> 00:03:38,680 +are stored in a blob your data which +would be like your pixel values are + +50 +00:03:38,680 --> 00:03:43,189 +stored in a blob and your labels your +wife or stored in a blob and also all of + +51 +00:03:43,189 --> 00:03:47,319 +your intermediate activations will also +be stored in blobs so blobs are these + +52 +00:03:47,319 --> 00:03:51,069 +and dimensional tensors sort of like +you've seen an umpire accepted they + +53 +00:03:51,069 --> 00:03:56,150 +actually have four copies of a +non-dimensional tenser inside they have + +54 +00:03:56,150 --> 00:03:57,370 +data + +55 +00:03:57,370 --> 00:04:02,450 +data version of the tensor which is +storing the actual raw data and they + +56 +00:04:02,449 --> 00:04:07,449 +also have a parallel thing but parallel +10 circled deaths that cafe uses to + +57 +00:04:07,449 --> 00:04:12,459 +store gradients with respect to that +data and that gives you two and then you + +58 +00:04:12,459 --> 00:04:16,280 +actually have four because there's a CPU +and GPU version of each of those things + +59 +00:04:16,279 --> 00:04:21,228 +so you have data types of CPU and GPU +there's actually four and dimensional + +60 +00:04:21,228 --> 00:04:26,159 +tents are superb lob the next important +class that you need to know about and + +61 +00:04:26,160 --> 00:04:30,930 +cafes the lair and a larry is sort of a +function from similar to the ones who + +62 +00:04:30,930 --> 00:04:35,329 +wrote on the hallmarks that receives +some input blobs catcalls inputs bottoms + +63 +00:04:35,329 --> 00:04:41,269 +and then produces output blobs that kept +a hole stop lobs the idea is that your + +64 +00:04:41,269 --> 00:04:45,349 +lair will receive pointers to the bottom +blobs with the data Rd filled in and + +65 +00:04:45,350 --> 00:04:49,229 +then it'll also receive a pointer to the +top blobs and it'll end in Fort + +66 +00:04:49,228 --> 00:04:53,759 +passionately expected to fill in the +values for the data elements of your top + +67 +00:04:53,759 --> 00:04:58,959 +blogs on the back road past the layers +will compute radiance sable expects to + +68 +00:04:58,959 --> 00:05:03,649 +receive a pointer to the top jobs with +the gradients and the activation spilled + +69 +00:05:03,649 --> 00:05:07,359 +an and then they'll also receive a +pointer to the bottom blobs until + +70 +00:05:07,360 --> 00:05:12,650 +ingredients for the bottoms and Blair is +this a pretty well structured abstract + +71 +00:05:12,649 --> 00:05:17,019 +class that you can go and I had to have +the the links for the source file here + +72 +00:05:17,019 --> 00:05:21,139 +and there's a lot of some classes that +implement different types of theirs and + +73 +00:05:21,139 --> 00:05:26,750 +like I said a common cap a problem +there's no really good list of all the + +74 +00:05:26,750 --> 00:05:30,490 +lair types you pretty much just need to +look at the code and see what types of + +75 +00:05:30,490 --> 00:05:36,280 +CPP files there are the next thing you +need to know about is a natural so and + +76 +00:05:36,279 --> 00:05:40,859 +that just combines multiple heirs and +that is basically directed acyclic graph + +77 +00:05:40,860 --> 00:05:44,598 +of layers and is responsible for running +the forward and backward methods of the + +78 +00:05:44,598 --> 00:05:49,519 +layers in the correct order so this is +you probably don't need to touch this + +79 +00:05:49,519 --> 00:05:52,560 +class ever yourself but it's kind of +nice to look at to get a flavour of how + +80 +00:05:52,560 --> 00:05:56,139 +everything fits together in the final +class that you need to know about a + +81 +00:05:56,139 --> 00:06:00,720 +solver so the solver is you know we have +this thing called solver on the homework + +82 +00:06:00,720 --> 00:06:04,710 +that was really inspired by capping a +somersault or is intended to dip into + +83 +00:06:04,709 --> 00:06:05,288 +the net + +84 +00:06:05,288 --> 00:06:08,889 +run the next forward and backward on +data actually update + +85 +00:06:08,889 --> 00:06:11,319 +owners of the network and handle +checkpointing and resuming from + +86 +00:06:11,319 --> 00:06:15,520 +checkpoints and all that sort of stuff +and in cafe solver is this abstract + +87 +00:06:15,519 --> 00:06:20,278 +class and different update rules are +implemented by different subclasses so + +88 +00:06:20,278 --> 00:06:24,598 +there is for example stochastic gradient +descent solver there's an atom bomb ass + +89 +00:06:24,598 --> 00:06:28,209 +problem-solver all of that sort of stuff +and again just to see what kinds of + +90 +00:06:28,209 --> 00:06:32,438 +options are available you should look at +the source code for this kind of gives + +91 +00:06:32,439 --> 00:06:35,639 +you a nice overview of how these things +all fit together that this whole thing + +92 +00:06:35,639 --> 00:06:40,069 +on the right would be done at the net +contains in the green boxes blobs each + +93 +00:06:40,069 --> 00:06:44,250 +blog contains data and texts the red +boxes are layers that are connecting + +94 +00:06:44,250 --> 00:06:51,038 +blocks together and the whole thing +would get optimized for the Psalter so + +95 +00:06:51,038 --> 00:06:55,538 +cafe makes heavy use of this funny thing +called protocol buffers any of you guys + +96 +00:06:55,538 --> 00:07:00,938 +ever in turn to Google after numbers you +guys know about this bomb but protocol + +97 +00:07:00,939 --> 00:07:05,099 +poppers are this almost like a binary +strongly typed JSON I sort of like to + +98 +00:07:05,098 --> 00:07:08,550 +think about it that are used very widely +inside google first utilizing data to + +99 +00:07:08,550 --> 00:07:14,750 +death over the network so protocol +buffers there's this . profile that + +100 +00:07:14,750 --> 00:07:18,639 +defines the different kinds of feels +that different types of objects how so + +101 +00:07:18,639 --> 00:07:22,819 +in this example there's a person has a +name and I D and an email and this lives + +102 +00:07:22,819 --> 00:07:26,300 +in a top profile . profiles + +103 +00:07:26,300 --> 00:07:31,490 +given to find a type of a class and you +can actually see realize instances to + +104 +00:07:31,490 --> 00:07:37,379 +human readable . total txt files so for +example this fills in the name it gives + +105 +00:07:37,379 --> 00:07:40,968 +you the idea gives you the email and +this is an instance of a person that can + +106 +00:07:40,968 --> 00:07:45,930 +be saved into this text file then +product includes this compiler that + +107 +00:07:45,930 --> 00:07:49,579 +actually lets you generate classes in +various programming languages to access + +108 +00:07:49,579 --> 00:07:55,418 +these data types you can after running +photobook compiler this profile it + +109 +00:07:55,418 --> 00:08:01,038 +produces classes that you can import in +Java and C C++ and Python and go and + +110 +00:08:01,038 --> 00:08:05,300 +just about everything so actually cafe +makes why do you say these probe of + +111 +00:08:05,300 --> 00:08:08,270 +these protocol buffers and they use them +to store pretty much everything and + +112 +00:08:08,269 --> 00:08:16,008 +Kathy so like I said to understand you +need to read the code understand cafe + +113 +00:08:16,009 --> 00:08:20,480 +and cafe has this one giant file called +cafe dark road + +114 +00:08:20,480 --> 00:08:24,470 +though they just defines all of the +protocol buffer types that are used in + +115 +00:08:24,470 --> 00:08:29,170 +cafe this is a gigantic file its I think +it's a couple thousand lines long but + +116 +00:08:29,170 --> 00:08:32,200 +it's actually pretty well documented and +is I think the most up-to-date + +117 +00:08:32,200 --> 00:08:35,890 +documentation of what are the lair types +are what the options for those layers + +118 +00:08:35,889 --> 00:08:39,629 +are how you specify every all the +options for solvers and layers and + +119 +00:08:39,629 --> 00:08:43,100 +that's not all bad so I really encourage +you to check out this file and read + +120 +00:08:43,100 --> 00:08:48,019 +through it if you have any questions +about how things work in cafe and just + +121 +00:08:48,019 --> 00:08:53,120 +give you a flavour on my left ear this +shows you this defines than a parameter + +122 +00:08:53,120 --> 00:08:58,519 +which is the type of protocol buffer +that cafe uses to represent an axe and + +123 +00:08:58,519 --> 00:09:03,970 +on the right is this solver parameter +which used to represent solvers so that + +124 +00:09:03,970 --> 00:09:09,009 +perimeter of the solver promoter for +example takes a reference to a net and + +125 +00:09:09,009 --> 00:09:12,409 +it also includes things like learning +rate and how often to check point and + +126 +00:09:12,409 --> 00:09:19,549 +other things like that right so when +you're working in cafe actually it's + +127 +00:09:19,549 --> 00:09:23,729 +pretty cool you don't need to write any +code in order to train models so when + +128 +00:09:23,730 --> 00:09:27,889 +working with cafe you generally have +this four-step process so first you will + +129 +00:09:27,889 --> 00:09:31,960 +convert your data and especially if you +just happen image classification problem + +130 +00:09:31,960 --> 00:09:34,540 +you don't have to write any code for +this you just use one of the existing + +131 +00:09:34,539 --> 00:09:40,240 +binary Kappa ships with Daniel define +your your file that you'll do by just + +132 +00:09:40,240 --> 00:09:45,230 +writing or editing one of these proteins +Daniel define your solver which again + +133 +00:09:45,230 --> 00:09:49,509 +will just live in Provo txt txt file +that you can just work within a text + +134 +00:09:49,509 --> 00:09:54,200 +editor and then you'll pass all of these +things to this existing binary to train + +135 +00:09:54,200 --> 00:09:57,990 +the model and battle spit out your train +kept a model to test that you can then + +136 +00:09:57,990 --> 00:10:02,820 +use for other things so even if you want +to train ResNet on image that you could + +137 +00:10:02,820 --> 00:10:06,000 +just follow the simple procedure and +train a giant network without writing + +138 +00:10:06,000 --> 00:10:12,110 +any code that's really cool and so step +one generally only to convert your data + +139 +00:10:12,110 --> 00:10:17,259 +so cafe uses I know we've talked a +little bit about html5 as format for + +140 +00:10:17,259 --> 00:10:21,460 +storing pixels on desk continuously and +then reading from them efficiently but + +141 +00:10:21,460 --> 00:10:26,940 +by default Kathy uses this other file +format called LM TV so there's asked if + +142 +00:10:26,940 --> 00:10:30,570 +you if all you have is a bunch of images +each image with a label then you can + +143 +00:10:30,570 --> 00:10:31,480 +call lol + +144 +00:10:31,480 --> 00:10:35,370 +cafe just has a script to convert that +whole dataset into a giant alamoudi be + +145 +00:10:35,370 --> 00:10:42,169 +you can use for training so Jen just to +give you an idea of the way it's this is + +146 +00:10:42,169 --> 00:10:46,240 +really easy you just create a text file +that has the path to your images and + +147 +00:10:46,240 --> 00:10:49,959 +separated by the label and you just +passenger kept a script wait a couple + +148 +00:10:49,958 --> 00:10:56,018 +hours if your data sets big giant IMDB +file on disk and if you're working with + +149 +00:10:56,019 --> 00:11:01,860 +something else like HBO five then you'll +have to create yourself probably so cafe + +150 +00:11:01,860 --> 00:11:06,060 +does actually have a couple of options +to reading data and there's this date on + +151 +00:11:06,059 --> 00:11:11,888 +their window dato mayor for protection +and actually can read from HDL 5 and + +152 +00:11:11,889 --> 00:11:14,350 +there's an option for reading stuff +directly from memory that's especially + +153 +00:11:14,350 --> 00:11:18,480 +useful with Python interface but at +least in my point of view all of these + +154 +00:11:18,480 --> 00:11:22,339 +types of other methods of reading and +data to campaign are a little bit + +155 +00:11:22,339 --> 00:11:26,120 +second-class citizens in the cafe +ecosystem and Ellen DBA is really the + +156 +00:11:26,120 --> 00:11:30,669 +easiest thing to work with so if you can +you should probably try to convert your + +157 +00:11:30,669 --> 00:11:40,179 +data into mp3 format with so step 24 +campaign is to define your object so + +158 +00:11:40,179 --> 00:11:44,609 +like I said he'll just to write a big +promo txt to find your not so here this + +159 +00:11:44,610 --> 00:11:48,818 +is this just a simple model for logistic +regression you can see that I did not + +160 +00:11:48,818 --> 00:11:53,948 +follow my own advice and I'm reading +data out of an HDL 5 file here then I + +161 +00:11:53,948 --> 00:11:59,278 +have a fully connected layer which is +called inner product and Cathay than + +162 +00:11:59,278 --> 00:12:03,588 +their rights that's fully connected lair +tells you the number of classes and how + +163 +00:12:03,589 --> 00:12:10,399 +to initialize the values and then I have +a soft max loss function that read the + +164 +00:12:10,399 --> 00:12:15,458 +labels and produces loss ingredients +from the opposite elected leader so a + +165 +00:12:15,458 --> 00:12:20,009 +couple things to point out about this +file are that one every layer typically + +166 +00:12:20,009 --> 00:12:23,588 +include some blogs which to store the +data and the gradients in the weights + +167 +00:12:23,589 --> 00:12:28,680 +and the layers blobs and Bellaire itself +typically have the same name that can be + +168 +00:12:28,679 --> 00:12:34,269 +a little bit confusing another thing is +that a lot of these layers will have two + +169 +00:12:34,269 --> 00:12:39,250 +blobs 14 weight and 14 bias and actually +in this network right in here you'll + +170 +00:12:39,250 --> 00:12:43,149 +find the learning rates for those two +blobs so that's learning rate and + +171 +00:12:43,149 --> 00:12:44,769 +regularization for both the way + +172 +00:12:44,769 --> 00:12:50,198 +bias of that later another thing to note +is that to specify the number of output + +173 +00:12:50,198 --> 00:12:51,568 +classes is just the number + +174 +00:12:51,568 --> 00:12:57,378 +output on this fully connected lair +perimeter and finally the quick and + +175 +00:12:57,379 --> 00:13:01,139 +dirty way to freeze layers and cafe is +just to set the learning rate 204 that + +176 +00:13:01,139 --> 00:13:08,048 +for the blobs associated to that way +it's our biases another thing to point + +177 +00:13:08,048 --> 00:13:12,600 +out is that for ResNet and other large +models like Google that this can get + +178 +00:13:12,600 --> 00:13:17,110 +really out of hand really quickly so +cafe doesn't really let you define like + +179 +00:13:17,110 --> 00:13:20,989 +composition ality so for ResNet they +just repeat the same pattern over and + +180 +00:13:20,989 --> 00:13:26,459 +over and over in the Pro txt file so the +ResNet proto txt is almost 7,000 lines + +181 +00:13:26,458 --> 00:13:31,219 +long so you could write that by hand but +interim practice people tend to write + +182 +00:13:31,220 --> 00:13:35,470 +little python script to generate these +things automatically so that's that's a + +183 +00:13:35,470 --> 00:13:41,879 +little bit gross you out if you want to +find to a network rather than starting + +184 +00:13:41,879 --> 00:13:46,509 +from scratch then you'll typically +download some existing product ext and + +185 +00:13:46,509 --> 00:13:50,230 +some existing weights file and work from +there so the way you should think about + +186 +00:13:50,230 --> 00:13:54,139 +it is that the product txt file that +we've seen here before it defines the + +187 +00:13:54,139 --> 00:13:58,159 +architecture of the network and Mendel +the preacher and weights live in this + +188 +00:13:58,159 --> 00:14:03,230 +cafe model file that's a binary thing +and you can't really inspected but the + +189 +00:14:03,230 --> 00:14:07,869 +way it works that's basically key-value +pairs where it matches name where the + +190 +00:14:07,869 --> 00:14:13,790 +inside the cafe model it matches these +names that are scoped to Lares so this + +191 +00:14:13,789 --> 00:14:19,389 +xc70 weight with the would-be though the +way its corresponding to this final + +192 +00:14:19,389 --> 00:14:24,048 +fully connected layer and Alex not so +then when you want to find you on your + +193 +00:14:24,048 --> 00:14:29,600 +own data when you start up cafe and you +load a model and a product ext + +194 +00:14:29,600 --> 00:14:33,459 +just tries to match the key-value pairs +of names and waits between the cafe + +195 +00:14:33,458 --> 00:14:35,008 +model and the product ext + +196 +00:14:35,009 --> 00:14:39,209 +so if the names of the same then your +new network gets initialized from the + +197 +00:14:39,208 --> 00:14:43,008 +values and the proto txt which is really +really useful and convenient for fine + +198 +00:14:43,009 --> 00:14:49,230 +tuning but if the layers if the names +don't match than those layers actually + +199 +00:14:49,230 --> 00:14:52,980 +initialize from scratch so this is how +for example you can read nationalize the + +200 +00:14:52,980 --> 00:14:57,810 +output in cafe so to be a little bit +more concrete if you've + +201 +00:14:57,809 --> 00:15:02,250 +maybe download an image that model then +this larry is going on this final fully + +202 +00:15:02,250 --> 00:15:06,289 +connected layer that's output in class +course will have a thousand outputs but + +203 +00:15:06,289 --> 00:15:09,480 +now maybe for some problem you care +about you only want have 10 outputs + +204 +00:15:09,480 --> 00:15:13,149 +you're gonna need to reindustrialize +that final layer and realized it + +205 +00:15:13,149 --> 00:15:17,309 +randomly and fine-tune the network so +the way that you do that is you need to + +206 +00:15:17,309 --> 00:15:22,088 +change the name of the lair in the Pro +txt file to make sure that it's actually + +207 +00:15:22,089 --> 00:15:26,890 +initialize randomly and not reading from +the from from the cafe model and if you + +208 +00:15:26,889 --> 00:15:30,919 +forget to do this then it'll actually +crash and it'll give you a weird error + +209 +00:15:30,919 --> 00:15:35,419 +message about the shapes not aligning +cause it'll be trying to store this + +210 +00:15:35,419 --> 00:15:39,299 +thousand dimensional weight matrix into +this ten dimensional thing from your new + +211 +00:15:39,299 --> 00:15:46,129 +file and it won't work so the next step +when working with cafe is to define the + +212 +00:15:46,129 --> 00:15:51,100 +solver the solver is also just a pro txt +file you can see all the options for it + +213 +00:15:51,100 --> 00:15:56,620 +in that giant profile that I gave a link +to a little look something like this for + +214 +00:15:56,620 --> 00:16:00,169 +Alex night maybe so that will define +your learning rate and you're learning + +215 +00:16:00,169 --> 00:16:04,809 +way to K and your regularization how +often to check everything like that but + +216 +00:16:04,809 --> 00:16:10,169 +these end up being less much less +complex than that he's pro txt for the + +217 +00:16:10,169 --> 00:16:15,069 +networks this Alex neckties just maybe +fourteen lines although what you will + +218 +00:16:15,070 --> 00:16:18,530 +see some times in practice is that if +people want to have sort of complex + +219 +00:16:18,529 --> 00:16:22,299 +trading pipelines where they first one I +trained with 11 learning rate in certain + +220 +00:16:22,299 --> 00:16:25,039 +parts of the network they want to train +with another learning rate certain parts + +221 +00:16:25,039 --> 00:16:28,389 +of the network that you might end up +with a cascade of different solver files + +222 +00:16:28,389 --> 00:16:31,490 +and actually run the most independent me +we are sort of fine-tuning your own + +223 +00:16:31,490 --> 00:16:38,070 +model in separate stages using different +solvers so once you've done all that + +224 +00:16:38,070 --> 00:16:43,550 +then you just trainer model so if you if +you followed my advice and just use a + +225 +00:16:43,549 --> 00:16:49,208 +MTB and all these things about you just +call this binary that is it that exists + +226 +00:16:49,208 --> 00:16:55,569 +in campaign already so here you just +passed your solver and your txt and + +227 +00:16:55,570 --> 00:16:59,540 +Europe retrain weights file if you're +fine tuning and it'll run maybe Friday + +228 +00:16:59,539 --> 00:17:03,659 +maybe for a long time and just checking +and savings to desk and you'll be happy + +229 +00:17:03,659 --> 00:17:08,549 +one thing to point out here is that you +specify which GPU it runs on this is + +230 +00:17:08,549 --> 00:17:11,209 +your last text but you can actually run +in CPR + +231 +00:17:11,209 --> 00:17:17,288 +by setting this flag to my negative one +and actually recent sometime in the last + +232 +00:17:17,288 --> 00:17:21,048 +year cafe added data parallelism to let +you split up many batches across + +233 +00:17:21,048 --> 00:17:26,318 +multiple GPUs in your system you can +actually add multiple GPUs on this flag + +234 +00:17:26,318 --> 00:17:29,710 +and if you just say all been Cafe will +automatically split up many batches + +235 +00:17:29,710 --> 00:17:33,600 +across all the GPUs on your machine so +that's really cool you've done multi GPU + +236 +00:17:33,599 --> 00:17:51,689 +training without writing a single line +of code pretty cool cafe oh yeah + +237 +00:17:51,690 --> 00:17:57,230 +yeah I think so the question is how +would you go about doing some more + +238 +00:17:57,230 --> 00:18:00,778 +complex initialization strategy for you +maybe want to initialize the weights + +239 +00:18:00,778 --> 00:18:04,019 +from a preacher and model and use those +same way as in multiple parts of your + +240 +00:18:04,019 --> 00:18:07,710 +network and then the answer is that you +probably can't do that with a simple + +241 +00:18:07,710 --> 00:18:11,278 +mechanism you can kind of money on the +weights and Python and that's probably + +242 +00:18:11,278 --> 00:18:17,669 +how you go about doing it right so I +think we've mentioned this before that + +243 +00:18:17,669 --> 00:18:21,710 +cafe has this really great models you +you can download lots of different types + +244 +00:18:21,710 --> 00:18:25,919 +of preteen models on a mission at and +other datasets so this this model is it + +245 +00:18:25,919 --> 00:18:29,659 +was really top-notch you've got Alex +natin BGG you've got residents up there + +246 +00:18:29,659 --> 00:18:33,840 +already pretty much lots and lots of +really good models are up there so + +247 +00:18:33,839 --> 00:18:37,359 +that's that's a really really strong +point about cafe that it's really easy + +248 +00:18:37,359 --> 00:18:40,428 +to download someone else's model and run +it on your data are pointing to your + +249 +00:18:40,429 --> 00:18:42,350 +data + +250 +00:18:42,349 --> 00:18:46,298 +Cafe has a pipeline interface like I +mentioned I + +251 +00:18:46,298 --> 00:18:49,069 +since there are so many things to cover +I don't think I can dive into detail + +252 +00:18:49,069 --> 00:18:53,378 +here but as a kind of par for the course +and cafe there's not really really great + +253 +00:18:53,378 --> 00:18:57,980 +documentation about the Python interface +so you need to read the code and the + +254 +00:18:57,980 --> 00:18:58,690 +whole + +255 +00:18:58,690 --> 00:19:02,730 +the Python interface Street Cafe is +mostly defined in these two in these two + +256 +00:19:02,730 --> 00:19:08,399 +files this CPP file uses boost Python if +you've ever used that before talks to + +257 +00:19:08,398 --> 00:19:13,369 +wrap up some of the C++ classes and +expose them to take on and then in this + +258 +00:19:13,369 --> 00:19:17,648 +. py file it actually attach additional +methods and gives you more Python + +259 +00:19:17,648 --> 00:19:22,469 +interface so if you wanna know what +kinds of methods and data types are + +260 +00:19:22,470 --> 00:19:27,000 +available in the cafe pipe interface +your best bet is to just read 3 through + +261 +00:19:27,000 --> 00:19:31,339 +these two files and they're not too long +so it's it's pretty easy to do + +262 +00:19:31,339 --> 00:19:37,038 +yes the Python interface in general is +is pretty useful it lets you do maybe + +263 +00:19:37,038 --> 00:19:40,558 +crazy weight initialization strategies +if you need to do something more complex + +264 +00:19:40,558 --> 00:19:44,960 +than just copy from a chain model it +also makes it really easy to just get a + +265 +00:19:44,960 --> 00:19:48,710 +network and then run it forward and +backward on with numpy from numpy array + +266 +00:19:48,710 --> 00:19:53,129 +is so for example you can implement +things like deep dream and class + +267 +00:19:53,128 --> 00:19:56,798 +visualisations similar to that you did +on the homework you can also do that + +268 +00:19:56,798 --> 00:20:01,349 +quite easily using the Python interface +on cafe where you just need to take data + +269 +00:20:01,349 --> 00:20:03,899 +and then run it forward and backward +through different parts of the network + +270 +00:20:03,900 --> 00:20:08,720 +work the Python interface is also quite +nice if if you just want to extract + +271 +00:20:08,720 --> 00:20:12,220 +features like you have some data that +you have some free trade model and you + +272 +00:20:12,220 --> 00:20:15,610 +want to track features from some part of +the network and then maybe save them to + +273 +00:20:15,609 --> 00:20:20,259 +disk maybe 2005 file were some +downstream processing that's quite easy + +274 +00:20:20,259 --> 00:20:25,660 +to do with the Python interface you can +also actually Cafe has a kind of a new + +275 +00:20:25,660 --> 00:20:29,970 +feature where you can actually define +layers entirely in Python but this is + +276 +00:20:29,970 --> 00:20:33,600 +I've never done it myself but it's it +seems cool it seems nice but the + +277 +00:20:33,599 --> 00:20:37,259 +downside is that those layers will be +CPU on me so we talked about + +278 +00:20:37,259 --> 00:20:41,809 +communication bottlenecks between the +CPU and GPU that if you write letters in + +279 +00:20:41,809 --> 00:20:46,460 +Python then every forward and backward +pass you'll be anchoring overhead I'm + +280 +00:20:46,460 --> 00:20:51,289 +not transfer although one nice place +where pipes and wires could be useful as + +281 +00:20:51,289 --> 00:20:58,450 +custom loss functions so that's maybe +something that you could keep in mind so + +282 +00:20:58,450 --> 00:21:02,450 +the quick overview of Catholic pros and +cons that really from my point of view + +283 +00:21:02,450 --> 00:21:06,049 +if all you wanna do is kind of train a +simple basic feedforward network + +284 +00:21:06,049 --> 00:21:09,730 +especially for classification and kathy +is really really easy to get things up + +285 +00:21:09,730 --> 00:21:12,880 +and running you don't have to write any +code yourself you just use all these are + +286 +00:21:12,880 --> 00:21:17,660 +pre-built tools and it's quite easy to +run it has a Python interface which is + +287 +00:21:17,660 --> 00:21:21,259 +quite nice for using a little to work +for a little bit more complex use cases + +288 +00:21:21,259 --> 00:21:25,329 +but it can be cumbersome when things get +really crazy when you have these really + +289 +00:21:25,329 --> 00:21:29,299 +big networks like president especially +with repeated module patterns they can + +290 +00:21:29,299 --> 00:21:33,450 +be tedious and for things like like +recurrent networks where you want to + +291 +00:21:33,450 --> 00:21:37,519 +share waits between different parts of +the network can be kind of company kind + +292 +00:21:37,519 --> 00:21:41,559 +of cumbersome in cafe it is possible but +it's probably not the best thing to use + +293 +00:21:41,559 --> 00:21:46,250 +for that and the other downside the +other big downside from my point of view + +294 +00:21:46,250 --> 00:21:50,220 +is that when you want to find your own +type of lair in cafe you end up having + +295 +00:21:50,220 --> 00:21:55,440 +to write C++ code so that's not doesn't +give you a very quick development cycle + +296 +00:21:55,440 --> 00:22:00,769 +so it's kind of a lot of kind of painful +to write you letters so that's that's + +297 +00:22:00,769 --> 00:22:04,750 +why our world whirlwind tour of cafe so +if there's any quick questions + +298 +00:22:04,750 --> 00:22:06,669 +yeah + +299 +00:22:06,669 --> 00:22:14,028 +cross validation and cafe so in the +train Valparaiso txt you can try to find + +300 +00:22:14,028 --> 00:22:19,159 +a training phase and a testing phase so +generally alright like a train about + +301 +00:22:19,159 --> 00:22:20,269 +product ext + +302 +00:22:20,269 --> 00:22:24,960 +and apply product ext and deploy will be +used at on the task at hand but test + +303 +00:22:24,960 --> 00:22:33,409 +phase of the trail product ext will be +used for validation ok that's that's all + +304 +00:22:33,409 --> 00:22:39,820 +there is to know about cabinet so the +next one is torch so torch is really my + +305 +00:22:39,819 --> 00:22:42,980 +personal favorite so I have a little bit +of bias here just to get that out in the + +306 +00:22:42,980 --> 00:22:46,259 +open that I've pretty much been using +torch almost exclusively on my own + +307 +00:22:46,259 --> 00:22:51,749 +projects in the last year or so so a +torch is originally from NYU it's + +308 +00:22:51,749 --> 00:22:56,450 +written in C and in lieu up and it's +used a lot at Facebook indeed mind + +309 +00:22:56,450 --> 00:23:02,409 +especially I think also a lot of folks +at Twitter use torch so one of the big + +310 +00:23:02,409 --> 00:23:05,309 +things that freaks people out of course +is that you have to write in lieu of + +311 +00:23:05,308 --> 00:23:11,038 +which I had never had never heard of or +used before starting to work with torch + +312 +00:23:11,038 --> 00:23:16,700 +but it actually isn't too bad that lure +is best highly this high-level scripting + +313 +00:23:16,700 --> 00:23:20,999 +language that is really intended for +embedded devices so it can run more + +314 +00:23:20,999 --> 00:23:24,720 +efficiently and it's a lot of very +similar to JavaScript in a lot of ways + +315 +00:23:24,720 --> 00:23:29,749 +so another cool thing about lou is that +because it's meant to be run on embedded + +316 +00:23:29,749 --> 00:23:33,929 +devices that you can actually do for +loops are really fast and torch you know + +317 +00:23:33,929 --> 00:23:37,149 +how in Python if you're in a for loop +it's going to be really slow that's + +318 +00:23:37,148 --> 00:23:40,798 +actually totally fine to do in in torch +because it actually uses just-in-time + +319 +00:23:40,798 --> 00:23:46,249 +compilation to make these things really +fast and torch is our newest most + +320 +00:23:46,249 --> 00:23:50,200 +important JavaScript in that it is +functional language functions are + +321 +00:23:50,200 --> 00:23:54,058 +first-class citizens it's very common to +pass pass callbacks around to different + +322 +00:23:54,058 --> 00:24:01,200 +parts of your code you also is has this +idea of protocol inheritance where + +323 +00:24:01,200 --> 00:24:05,200 +they're sort of one data structure which +in Lua is a table which you can think of + +324 +00:24:05,200 --> 00:24:09,558 +is being very similar to an object in +javascript and you can implement things + +325 +00:24:09,558 --> 00:24:13,378 +like object oriented programming using +prototypical inheritance in a similar + +326 +00:24:13,378 --> 00:24:18,428 +way as you would in javascript and one +of the town's one of the downsides + +327 +00:24:18,429 --> 00:24:19,929 +actually the standard library + +328 +00:24:19,929 --> 00:24:24,820 +is kind of annoying sometimes and things +like handling strings and whatnot can be + +329 +00:24:24,819 --> 00:24:28,999 +kind of cumbersome and maybe most +annoying is its one indexed so all of + +330 +00:24:28,999 --> 00:24:33,058 +your intuition about four loops will be +a little bit off for a while but other + +331 +00:24:33,058 --> 00:24:37,528 +than that it's pretty easy to pick up +and I gave a link here to this website + +332 +00:24:37,528 --> 00:24:41,618 +claiming that you can learn Lua in 15 +minutes it might be a little bit of an + +333 +00:24:41,618 --> 00:24:45,209 +over so they might be overselling it a +little bit but I think it is pretty easy + +334 +00:24:45,210 --> 00:24:50,298 +to pick up and start writing code and it +pretty fast so the main idea behind + +335 +00:24:50,298 --> 00:24:55,398 +torch is this tensor class so you guys +have been working in numpy a lot on your + +336 +00:24:55,398 --> 00:24:59,548 +assignments and the way the assignments +are kind of structured is that the numpy + +337 +00:24:59,548 --> 00:25:03,329 +array gives you this really easy way to +manipulate data in whatever way you want + +338 +00:25:03,329 --> 00:25:06,798 +and then you can use that number higher +rates of buildup other abstractions like + +339 +00:25:06,798 --> 00:25:10,720 +known that libraries and whatnot but +really the numpy array just lets you + +340 +00:25:10,720 --> 00:25:16,909 +manipulate data numerically in whatever +way you want in complete flexibility so + +341 +00:25:16,909 --> 00:25:20,580 +if you are a call then maybe here's a +look here's an example of some numpy + +342 +00:25:20,579 --> 00:25:24,918 +code that should be very familiar by now +we're just computing a simple for a pass + +343 +00:25:24,919 --> 00:25:31,990 +of cool air rail network so maybe black +wasn't the best choice here but we're + +344 +00:25:31,990 --> 00:25:36,569 +we're we're doing we're competing some +some constants were competing some + +345 +00:25:36,569 --> 00:25:40,408 +weights are getting some random data and +we're doing a matrix multiply a rally in + +346 +00:25:40,409 --> 00:25:44,789 +another major multiplied so that's +that's very easy to write an umpire and + +347 +00:25:44,788 --> 00:25:49,538 +actually this has almost a 120 +translation into torched answers so now + +348 +00:25:49,538 --> 00:25:53,970 +on the right this is the exact same code +but using torched answers and so here + +349 +00:25:53,970 --> 00:25:58,509 +we're defining our backsides input size +and all that we're defining our weights + +350 +00:25:58,509 --> 00:26:02,929 +which are just torched answers were +getting a random input vector we're + +351 +00:26:02,929 --> 00:26:07,929 +doing a forward pass this is doing a +matrix multiply up to our sponsors this + +352 +00:26:07,929 --> 00:26:09,179 +c-max + +353 +00:26:09,179 --> 00:26:13,149 +element wise maximum that's a real issue +and then we can compute cores using + +354 +00:26:13,148 --> 00:26:17,089 +another matrix multiply so in general +pretty much any kind of code used + +355 +00:26:17,089 --> 00:26:18,689 +trading an umpire is pretty easy + +356 +00:26:18,690 --> 00:26:22,460 +pretty much has almost a one-by-one +line-by-line translation into using + +357 +00:26:22,460 --> 00:26:25,400 +torched answers instead + +358 +00:26:25,400 --> 00:26:28,880 +so also remember in umpire that it's +really easy to swap and use different + +359 +00:26:28,880 --> 00:26:33,690 +data types we talked about this ad +nauseam the last lecture but at least in + +360 +00:26:33,690 --> 00:26:38,500 +numpy to switch to maybe a 32 bit +floating point all you need to do is + +361 +00:26:38,500 --> 00:26:43,049 +cast your data to this other data type +and it turns out that that's very very + +362 +00:26:43,049 --> 00:26:47,589 +easy to do in torture as well that our +data type is now this this strength and + +363 +00:26:47,589 --> 00:26:52,990 +then we can easily cast our data to +another data type but here's where two + +364 +00:26:52,990 --> 00:26:56,130 +years though so this next slide as the +real reason why torture is infinitely + +365 +00:26:56,130 --> 00:27:02,020 +better than numpy and that's that the +GPU is just another data type so when + +366 +00:27:02,019 --> 00:27:07,879 +you are right when you wanna run code on +the GPU in torch you use this you import + +367 +00:27:07,880 --> 00:27:11,630 +another package and you have another +time another data type which is torched + +368 +00:27:11,630 --> 00:27:16,810 +a tensor and now you cast your tensors +to this other data type and now they + +369 +00:27:16,809 --> 00:27:21,819 +live on the GPU and running any kind of +numerical operations on the tensors just + +370 +00:27:21,819 --> 00:27:26,500 +runs on the GPU so it's really really +easy and torch to just write generic + +371 +00:27:26,500 --> 00:27:34,220 +tenser scientific computing code to run +I GPU and be really fast so this like I + +372 +00:27:34,220 --> 00:27:37,819 +said these tensors are really you should +think of them as similar to numpy raised + +373 +00:27:37,819 --> 00:27:41,689 +and there's a lot of documentation on +but different kinds of methods that you + +374 +00:27:41,690 --> 00:27:46,250 +can work within 10 service up here and +get up this documentation isn't super + +375 +00:27:46,250 --> 00:27:53,950 +complete but it's it's not bad so you +should take a look at it so the next but + +376 +00:27:53,950 --> 00:27:58,200 +in practice you end up not really using +the tensors too much in torch instead + +377 +00:27:58,200 --> 00:28:02,880 +use this other package called an end for +neural networks so and and is this + +378 +00:28:02,880 --> 00:28:06,800 +pretty thin wrapper that actually +defines neural network package just in + +379 +00:28:06,799 --> 00:28:10,930 +terms of these tents in terms of these +tents are objects you should think of + +380 +00:28:10,930 --> 00:28:15,049 +this as being like a BPR more industrial +strength version of the homework code + +381 +00:28:15,049 --> 00:28:20,240 +base where you have this this tenth this +and the array this tensor abstraction + +382 +00:28:20,240 --> 00:28:24,480 +and then you implement an aromatic +library on top of that in a nice clean + +383 +00:28:24,480 --> 00:28:30,410 +interface so here's the same to larry +Adler network using the N package so we + +384 +00:28:30,410 --> 00:28:33,900 +define our network has a sequential so +it's gonna be a stack of of sequential + +385 +00:28:33,900 --> 00:28:38,360 +operations it's gonna we're gonna first +have a linear which is a fully connected + +386 +00:28:38,359 --> 00:28:41,759 +from our input mentioned marketing to +mention we're gonna have a railing and + +387 +00:28:41,759 --> 00:28:48,420 +another lender now we can actually get +the weights and gradients in second one + +388 +00:28:48,420 --> 00:28:52,070 +to answer for each using this get +parameters method to now waits will be a + +389 +00:28:52,069 --> 00:28:55,750 +single torched answer that will have all +the way to the network and graduates + +390 +00:28:55,750 --> 00:29:00,490 +will be a single torched answer for all +of the above ingredients we can generate + +391 +00:29:00,490 --> 00:29:05,730 +some random data now to a forward pass +we just call Matt the format on the + +392 +00:29:05,730 --> 00:29:11,599 +object using our data this gives us our +scores to computer loss we have a + +393 +00:29:11,599 --> 00:29:16,769 +separate criterion object that is our +loss function so we computer lost by + +394 +00:29:16,769 --> 00:29:21,289 +calling the fourth method of the +criteria now we've done our forecast + +395 +00:29:21,289 --> 00:29:27,279 +easy and backward pass we first set and +20 call a backward on the loss function + +396 +00:29:27,279 --> 00:29:31,609 +and then a backward I'm at work now this +has updated all of the gradients for the + +397 +00:29:31,609 --> 00:29:35,319 +network in the grad params so we can +just make a gradient stuff very easily + +398 +00:29:35,319 --> 00:29:40,419 +so this would be multiplying the +graduates by the opposite of learning + +399 +00:29:40,420 --> 00:29:44,130 +rate and then adding it to the ways +that's a simple gradient descent update + +400 +00:29:44,130 --> 00:29:50,400 +rights that's that's all of the rights +that would have been maybe a little bit + +401 +00:29:50,400 --> 00:29:53,560 +more clear but we have not we have +weights graduates who have lost function + +402 +00:29:53,559 --> 00:30:00,730 +we get random data from forward and +backward make an update and as as you + +403 +00:30:00,730 --> 00:30:03,930 +might expect from looking at that answer +it's quite easy to make this thing run + +404 +00:30:03,930 --> 00:30:09,570 +on GPU so to run on these networks on +the GPU we import a couple new packages + +405 +00:30:09,569 --> 00:30:14,519 +through torture and to an end which are +two versions of everything and then we + +406 +00:30:14,519 --> 00:30:17,930 +just need to cast our network and our +loss function to this other data type + +407 +00:30:17,930 --> 00:30:23,490 +and we also need to cast our data and +labels and now this whole network will + +408 +00:30:23,490 --> 00:30:28,660 +run and trained on the GPU so it's it's +pretty easy now in what was that like 40 + +409 +00:30:28,660 --> 00:30:31,320 +lines of code we've written a fully +connected network and we can train on + +410 +00:30:31,319 --> 00:30:37,089 +the GPU but one problem here is that +we're just using vanilla gradient + +411 +00:30:37,089 --> 00:30:41,000 +descent which is not so great and as you +saw on the assignments other things like + +412 +00:30:41,000 --> 00:30:45,329 +out on our mess popped into work much +better in practice so to solve that + +413 +00:30:45,329 --> 00:30:50,319 +torch gives us the opportunity package +so optimist quite easy to use again we + +414 +00:30:50,319 --> 00:30:51,799 +just import a new package up here + +415 +00:30:51,799 --> 00:30:57,799 +here and now what changes is that we +actually need to define this callback + +416 +00:30:57,799 --> 00:31:02,569 +function so before we were just calling +forward and backward exclude explicitly + +417 +00:31:02,569 --> 00:31:06,960 +ourself instead we're going to find this +callback function that will run the + +418 +00:31:06,960 --> 00:31:10,750 +network forward and backward on data and +then return the loss and the gradient + +419 +00:31:10,750 --> 00:31:15,400 +and now to make an update stop on our +network will actually pass this callback + +420 +00:31:15,400 --> 00:31:21,259 +function to this Adam method from the +Optim package so this this is maybe a + +421 +00:31:21,259 --> 00:31:26,940 +little bit awkward but you know we can +use any kind of update rule using just a + +422 +00:31:26,940 --> 00:31:31,430 +couple lines of change from what we had +before and again this is very easy to + +423 +00:31:31,430 --> 00:31:38,900 +add to run on the GPU by just casting +everything to go right so as we saw in + +424 +00:31:38,900 --> 00:31:44,220 +cafe cafe sort of implements everything +in terms of next and layers and cafe has + +425 +00:31:44,220 --> 00:31:48,750 +this really hard distinction between +that and the lair in torch they don't we + +426 +00:31:48,750 --> 00:31:52,400 +don't really draw this distinction +everything is just a model so the entire + +427 +00:31:52,400 --> 00:31:59,750 +network is a module and also each +individual larry is a module so modules + +428 +00:31:59,750 --> 00:32:03,650 +are just classes that are defined in +lieu of that rut that are implemented + +429 +00:32:03,650 --> 00:32:08,880 +using that answer API so these modules +are since the written law they're quite + +430 +00:32:08,880 --> 00:32:13,260 +easy to understand so many here is the +fully connected now is the the fully + +431 +00:32:13,259 --> 00:32:17,039 +connected larry and this is the +constructor you can see it's just + +432 +00:32:17,039 --> 00:32:23,210 +setting up tents as for the weight and +the bias and because this tensor API in + +433 +00:32:23,210 --> 00:32:28,100 +torch lets us easily run the same code +on GPU and CPU than all of these layers + +434 +00:32:28,099 --> 00:32:32,359 +will just be written in terms of the +tensor API and then Heasley run on both + +435 +00:32:32,359 --> 00:32:37,529 +devices so these modules need to +implement a forward and backward so far + +436 +00:32:37,529 --> 00:32:42,670 +forward babe decided to call it update +output so here's the example of the + +437 +00:32:42,670 --> 00:32:47,250 +update output for the full text of later +there's actually a couple cases they + +438 +00:32:47,250 --> 00:32:50,480 +need to deal with a couple different +cases here to be with me back vs non me + +439 +00:32:50,480 --> 00:32:55,170 +back in parts but other than that but +should be quite easy to read before + +440 +00:32:55,170 --> 00:33:00,830 +further backward pass there's a pair of +methods update grad input which receives + +441 +00:33:00,829 --> 00:33:03,970 +the upstream gradients and computes the +gradients respected + +442 +00:33:03,970 --> 00:33:09,160 +input and again this is just implemented +in the tensor API so it's very easy to + +443 +00:33:09,160 --> 00:33:14,279 +understand its just a bit just the same +type of thing you saw on homework and we + +444 +00:33:14,279 --> 00:33:17,990 +also implement and accumulate grab +parameters which computes the gradients + +445 +00:33:17,990 --> 00:33:21,480 +with respect to the weights of the +network as you saw in the constructor + +446 +00:33:21,480 --> 00:33:25,610 +the weights on the biases are held in +instance variables this module and + +447 +00:33:25,609 --> 00:33:30,309 +accumulate grad parameters will receive +gradients from upstream and accumulate + +448 +00:33:30,309 --> 00:33:34,940 +gradients of the parameters with respect +to the upstream radians and again this + +449 +00:33:34,940 --> 00:33:39,809 +is very simple just using the tensor API + +450 +00:33:39,809 --> 00:33:44,200 +torch actually has a ton of different +modules available the documentation here + +451 +00:33:44,200 --> 00:33:46,980 +can be a little bit out of date but if +you just go on get up you can see all + +452 +00:33:46,980 --> 00:33:51,460 +the files that give you all the goodies +to play with and he's actually get + +453 +00:33:51,460 --> 00:33:55,930 +updated a lot so just a point out a +couple these these pre-war just added me + +454 +00:33:55,930 --> 00:34:00,750 +last week so torches always adding new +modules that you can add your networks + +455 +00:34:00,750 --> 00:34:06,390 +which is pretty fun but when these +existing modules aren't good enough it's + +456 +00:34:06,390 --> 00:34:10,579 +actually very easy to write your own so +because you can just implement these + +457 +00:34:10,579 --> 00:34:13,989 +things using these tenser using the +tensor API and just implement the + +458 +00:34:13,989 --> 00:34:17,259 +forward and backward it's not much +harder than implementing layers on the + +459 +00:34:17,260 --> 00:34:21,890 +homeworks so here's just a small example +this is a stupid module that just takes + +460 +00:34:21,889 --> 00:34:28,210 +its input and multiply it by two and you +can see we implement the update graph + +461 +00:34:28,210 --> 00:34:31,849 +template and now we've implemented a new +layer and torque just twenty lines of + +462 +00:34:31,849 --> 00:34:35,929 +code and then that's really easy and +then it's very easy to use in other code + +463 +00:34:35,929 --> 00:34:40,710 +just import it and I you can add its +networks and so on and the really cool + +464 +00:34:40,710 --> 00:34:44,920 +thing about this is because this is just +the tensor API you can do whatever kind + +465 +00:34:44,920 --> 00:34:48,579 +of arbitrary thing you want inside of +these forward and backward if you need + +466 +00:34:48,579 --> 00:34:52,730 +to do for loops or complicated and +parent of code or anything or maybe + +467 +00:34:52,730 --> 00:34:56,980 +stochastic things for drop out or +rationalization than any kind of any + +468 +00:34:56,980 --> 00:34:59,949 +whatever kind of code you want to look +forward and backward pass you just + +469 +00:34:59,949 --> 00:35:03,500 +implemented yourself inside these +modules so it's usually very easy very + +470 +00:35:03,500 --> 00:35:11,500 +easy to implement your own new types of +players and torch so torch but of course + +471 +00:35:11,500 --> 00:35:14,250 +using individual layers on their own +isn't so useful + +472 +00:35:14,250 --> 00:35:16,960 +we need people to stitch them together +into larger networks + +473 +00:35:16,960 --> 00:35:21,220 +so far this torch uses containers we +already saw one in the previous example + +474 +00:35:21,219 --> 00:35:26,549 +which was this sequential container so +consequential container is just a stack + +475 +00:35:26,550 --> 00:35:29,950 +of modules that all we're one who +receives the output from the previous + +476 +00:35:29,949 --> 00:35:35,639 +one and just go back that's probably the +most commonly used another one you might + +477 +00:35:35,639 --> 00:35:40,799 +see is this parent is this cunt at table +so maybe if you have an input and you + +478 +00:35:40,800 --> 00:35:44,289 +want to apply different to different +modules to the same input than the + +479 +00:35:44,289 --> 00:35:49,099 +content table as you do that and you +receive the output Celeste another one + +480 +00:35:49,099 --> 00:35:53,280 +you might see as a parallel table if you +have a list of inputs and you want to + +481 +00:35:53,280 --> 00:35:57,500 +apply different modules to different +each element of the list then you can + +482 +00:35:57,500 --> 00:36:04,588 +use a parallel tabor table for that sort +of the construction but when things get + +483 +00:36:04,588 --> 00:36:08,980 +really complicated so actually those +those containers that I just told you + +484 +00:36:08,980 --> 00:36:13,480 +should in theory be easy to be possible +to implement just about aids apology you + +485 +00:36:13,480 --> 00:36:16,980 +want but it can be really hairy in +practice to wire up really complicated + +486 +00:36:16,980 --> 00:36:21,480 +things using those containers so torch +provides another package called pennant + +487 +00:36:21,480 --> 00:36:23,230 +graph that lets you hook up + +488 +00:36:23,230 --> 00:36:28,210 +container hook up things more +complicated topologies pretty easily so + +489 +00:36:28,210 --> 00:36:32,400 +here's an example if we have maybe if we +have three inputs we want to produce one + +490 +00:36:32,400 --> 00:36:36,930 +outputs and we want to produce them with +this pretty simple update rule that + +491 +00:36:36,929 --> 00:36:40,379 +corresponds to this type of +computational graph that we've seen many + +492 +00:36:40,380 --> 00:36:44,869 +times in lecture for different types of +problems so you could actually implement + +493 +00:36:44,869 --> 00:36:49,430 +this just fine using parallel and +sequential and cunt at table but it + +494 +00:36:49,429 --> 00:36:53,009 +could be kind of a mass so when you +wanna do things like this it's very + +495 +00:36:53,010 --> 00:36:58,470 +common to send a graph instead so this +graph code is is quite easy so here this + +496 +00:36:58,469 --> 00:37:03,179 +function is going to build a module +using a graph and then return it so here + +497 +00:37:03,179 --> 00:37:09,129 +we import the graph package and then +inside here this is a bit of money + +498 +00:37:09,130 --> 00:37:14,329 +syntax so this is actually not a tensor +this is the finding a symbolic variable + +499 +00:37:14,329 --> 00:37:19,480 +so this is saying that our our tents or +object is going to receive XY and Z as + +500 +00:37:19,480 --> 00:37:25,300 +inputs and now share were actually doing +symbolic operations on those inputs so + +501 +00:37:25,300 --> 00:37:26,840 +here we're saying that + +502 +00:37:26,840 --> 00:37:32,700 +we wanted to have a pointwise edition of +X&Y we want to have played twice + +503 +00:37:32,699 --> 00:37:38,159 +multiplication of ANZ store that and be +and now pointwise edition of A&B and + +504 +00:37:38,159 --> 00:37:42,159 +store that and see and again these are +not actual tenser objects these are now + +505 +00:37:42,159 --> 00:37:45,109 +sort of symbolic references that are you +being used to build up this + +506 +00:37:45,110 --> 00:37:50,420 +computational graph in the background +and now we can actually returned a + +507 +00:37:50,420 --> 00:37:55,159 +module here where we say that our module +will have input X Y and Z and outputs + +508 +00:37:55,159 --> 00:38:00,920 +see and this end I G module will +actually give us an object conforming to + +509 +00:38:00,920 --> 00:38:05,559 +the module API that implements its +computation so then after we build the + +510 +00:38:05,559 --> 00:38:10,619 +Montreal we can construct concrete court +torched answers and then feed them into + +511 +00:38:10,619 --> 00:38:19,170 +the module that will actually compute +the function so a torch actually quite + +512 +00:38:19,170 --> 00:38:22,670 +good at preteen models there is a +package called load campaign that lets + +513 +00:38:22,670 --> 00:38:27,050 +you load up many different types of +pre-trial models from cafe and it'll + +514 +00:38:27,050 --> 00:38:31,590 +convert them into their torture +equivalents you can load up the cafe + +515 +00:38:31,590 --> 00:38:35,539 +product ext and the cafe model file and +it'll turn into a giant stack of + +516 +00:38:35,539 --> 00:38:39,929 +sequential mayors load Cafe is not super +General Beau and only works for certain + +517 +00:38:39,929 --> 00:38:44,649 +types of networks but in particular load +Cafe will let you load up Alex not and + +518 +00:38:44,650 --> 00:38:49,660 +campaign and PGG so they're probably +some of the most commonly used there are + +519 +00:38:49,659 --> 00:38:54,259 +also a couple different implementations +you load up Google Matt into into torch + +520 +00:38:54,260 --> 00:38:58,520 +to let you load up retrain Google that +models into torch and actually very + +521 +00:38:58,519 --> 00:39:01,869 +recently Facebook went ahead and +reimplemented the residual networks + +522 +00:39:01,869 --> 00:39:07,900 +straight up in torch and they released +preteen models for that so between Alex + +523 +00:39:07,900 --> 00:39:11,849 +not campaign at BG Group and ResNet I +think that's probably everything you + +524 +00:39:11,849 --> 00:39:17,869 +need all the preteen models that most +people want to use another point is that + +525 +00:39:17,869 --> 00:39:21,549 +because torches using lure we can't use +pip to install packages and there's + +526 +00:39:21,550 --> 00:39:24,920 +another very similar idea called +barracks that's easily install new + +527 +00:39:24,920 --> 00:39:26,750 +packages an update packages + +528 +00:39:26,750 --> 00:39:29,650 +that's quite very easy to use + +529 +00:39:29,650 --> 00:39:34,079 +and this is kind of just a list of some +packages that I find very useful in + +530 +00:39:34,079 --> 00:39:38,349 +torch so there could be undone by names +you can read and write to HDR 5 files + +531 +00:39:38,349 --> 00:39:44,640 +you can read and write JSON there's this +funny one from Twitter autorad that is a + +532 +00:39:44,639 --> 00:39:47,980 +little bit like the animal which will +talk about it a bit but I haven't used + +533 +00:39:47,980 --> 00:39:52,369 +it but it's kind of cool to look at and +actually Facebook has a pretty useful + +534 +00:39:52,369 --> 00:39:57,849 +library for torches while that +implements a fifty convolutions and also + +535 +00:39:57,849 --> 00:40:01,548 +implements data-parallel and model +parallelism + +536 +00:40:01,548 --> 00:40:07,449 +so that's pretty a pretty nice thing to +have so very typical workflow in torch + +537 +00:40:07,449 --> 00:40:11,239 +is that you'll have some preprocessing +script often and pecan that'll + +538 +00:40:11,239 --> 00:40:15,818 +preprocess your data and dump it on to +some nice format in desk usually HDL 5 + +539 +00:40:15,818 --> 00:40:20,528 +for big things and Jason little things +then you will I'll typically write a + +540 +00:40:20,528 --> 00:40:25,318 +trained at low up at all read from the +HDL 5 and train the model and optimize + +541 +00:40:25,318 --> 00:40:30,088 +the model and save checkpoints the desk +and then usually I have some evaluate + +542 +00:40:30,088 --> 00:40:35,019 +script that loads up a train model and +does it for something useful so a case + +543 +00:40:35,019 --> 00:40:39,000 +study for this type of workflow is this +project I put up on github a week ago + +544 +00:40:39,000 --> 00:40:43,969 +that implements character level language +models and torch so here there's a + +545 +00:40:43,969 --> 00:40:48,239 +preprocessing script that converts text +files into HTML 5 files there's a + +546 +00:40:48,239 --> 00:40:52,889 +training script that loads for html5 and +trains these recurrent networks and then + +547 +00:40:52,889 --> 00:40:57,190 +there's a sampling script that loads up +the checkpoints generate tax so that's + +548 +00:40:57,190 --> 00:41:03,720 +that's kind of like my typical workflow +and torch so the quick pros and cons I + +549 +00:41:03,719 --> 00:41:07,169 +would say about torture that its lure is +a big turnoff for people but I don't + +550 +00:41:07,170 --> 00:41:11,690 +think it's actually that big a deal it's +definitely less plug and play in cafe so + +551 +00:41:11,690 --> 00:41:15,760 +you'll end up writing a lot of your own +code typically which maybe is a little + +552 +00:41:15,760 --> 00:41:20,028 +bit more overhead but also gives you +more flexibility it has a lot of modular + +553 +00:41:20,028 --> 00:41:24,278 +pieces that are easy to plug and play +and the like the standard library + +554 +00:41:24,278 --> 00:41:26,880 +because it's all written in blue it's +quite easy to read and quite easy to + +555 +00:41:26,880 --> 00:41:31,740 +understand there's a lot of preteen +models which is quite nice but + +556 +00:41:31,739 --> 00:41:34,598 +unfortunately it's it's a little bit +awkward to use for recurrent networks in + +557 +00:41:34,599 --> 00:41:38,640 +general so when you wanna have one month +when you want to have multiple modules + +558 +00:41:38,639 --> 00:41:42,028 +that share weights with each other you +can actually do this and torch but it's + +559 +00:41:42,028 --> 00:41:42,469 +it's kind + +560 +00:41:42,469 --> 00:41:47,199 +brittle and you can run into subtle bugs +there so that's that's probably the + +561 +00:41:47,199 --> 00:41:49,649 +biggest caveat is that recurrent +networks can be tricky + +562 +00:41:49,650 --> 00:42:15,800 +any any questions about torch yeah yeah +yeah but it's not out of the question + +563 +00:42:15,800 --> 00:42:21,570 +was about how how bad are four loops and +pecan is interpreted right so that's + +564 +00:42:21,570 --> 00:42:24,359 +that's really why for these are really +bad in Python because it's interpreted + +565 +00:42:24,358 --> 00:42:27,139 +and every for lupus actually doing quite +a lot of memory allocation and other + +566 +00:42:27,139 --> 00:42:31,960 +things behind the scenes but if you've +ever use JavaScript then loops and + +567 +00:42:31,960 --> 00:42:35,059 +JavaScript tend to be pretty fast +because the runtime actually just + +568 +00:42:35,059 --> 00:42:39,759 +compile the code on the fly down to +native code so loops in JavaScript are + +569 +00:42:39,760 --> 00:42:44,520 +really fast and fluid and lou actually +has a similar mechanism where it'll sort + +570 +00:42:44,519 --> 00:42:49,588 +of automatically and magically compiled +code for human genetic code so your lips + +571 +00:42:49,588 --> 00:42:53,608 +can be really fast but that only I'm +writing custom vectorized code still can + +572 +00:42:53,608 --> 00:43:01,619 +give you a lot of speed up all rights +we've got now maybe half an hour left to + +573 +00:43:01,619 --> 00:43:06,420 +cover two more frameworks so we're +running out of time so next up is no + +574 +00:43:06,420 --> 00:43:12,000 +such thing I know is from Joshua banjos +group at the University of Montreal and + +575 +00:43:12,000 --> 00:43:16,250 +it's really all about computational +graphs so we saw a little bit innn graph + +576 +00:43:16,250 --> 00:43:19,559 +from torch that computation crafts are +this pretty nice way to stitch together + +577 +00:43:19,559 --> 00:43:24,139 +big complicated architectures and Fionna +really takes this idea of computation on + +578 +00:43:24,139 --> 00:43:29,409 +graphics and runs with it to the extreme +and it also has some high-level library + +579 +00:43:29,409 --> 00:43:33,940 +is scarce and lasagna that will touch on +as well so here's the same computation + +580 +00:43:33,940 --> 00:43:38,570 +craft we saw in the context of a graph +before and we can actually walk through + +581 +00:43:38,570 --> 00:43:43,400 +implementation of this in 2010 so you +can see that in here we're importing + +582 +00:43:43,400 --> 00:43:49,440 +fiato and the fiato tenser object and +now here we're defining XY and Z as + +583 +00:43:49,440 --> 00:43:53,099 +symbolic as symbolic variables this is +actually very similar to the end and + +584 +00:43:53,099 --> 00:43:55,530 +graph example we saw just a few slides +ago + +585 +00:43:55,530 --> 00:43:59,500 +so that these are actually not numpy +raise these are sort of symbolic objects + +586 +00:43:59,500 --> 00:44:05,690 +in the in the computation grass then we +can actually computer these outputs + +587 +00:44:05,690 --> 00:44:11,679 +symbolically so XY and Z are these +symbolic things and we can compute ab&c + +588 +00:44:11,679 --> 00:44:15,769 +just using these overloaded operators +and that'll be building up this + +589 +00:44:15,769 --> 00:44:19,929 +computational graph in the background +then once we've built up our + +590 +00:44:19,929 --> 00:44:23,839 +computational craft we actually want to +be able to run certain parts of it on + +591 +00:44:23,840 --> 00:44:29,240 +real data so we call this the anode odd +function thing so this is saying about + +592 +00:44:29,239 --> 00:44:33,269 +we want to take our function will take +inputs XY and Z and it'll produce + +593 +00:44:33,269 --> 00:44:38,329 +outputs see this will return an actual +python function that we can evaluate on + +594 +00:44:38,329 --> 00:44:42,239 +real data and I'd like to point out that +this is really where all the magic and + +595 +00:44:42,239 --> 00:44:46,319 +Fionna was happening that when you call +the function it can be doing crazy crazy + +596 +00:44:46,320 --> 00:44:49,580 +things it can simplify your +computational graph to make it more + +597 +00:44:49,579 --> 00:44:54,199 +efficient it can actually symbolically +divider I pretense and other things and + +598 +00:44:54,199 --> 00:44:58,319 +it can actually generate native code so +when you call function to connect it + +599 +00:44:58,320 --> 00:45:02,450 +actually sometimes compiled code on the +flights are unofficially on the GPU so + +600 +00:45:02,449 --> 00:45:06,389 +all the magic and Fiano is really coming +from this from this little innocent + +601 +00:45:06,389 --> 00:45:11,750 +looking statement in Python but there's +a lot going on under the hood here and + +602 +00:45:11,750 --> 00:45:14,710 +now once we've gotten this magic +function through all this crazy stuff + +603 +00:45:14,710 --> 00:45:19,159 +then we can just run it on actual number +higher raise so here we instantiate + +604 +00:45:19,159 --> 00:45:25,440 +xxyyxx easy as actual as actual number +higher grades and then we can just about + +605 +00:45:25,440 --> 00:45:30,639 +our function and passing these actual +number is to get the values out and this + +606 +00:45:30,639 --> 00:45:35,359 +is doing the same thing as doing these +computations explosively in Python + +607 +00:45:35,360 --> 00:45:39,289 +except that the final version could be +much more efficient due to all the magic + +608 +00:45:39,289 --> 00:45:42,840 +under the hood and piano version +actually could be running on the GPU if + +609 +00:45:42,840 --> 00:45:47,289 +you have not configured but +unfortunately we don't really care about + +610 +00:45:47,289 --> 00:45:51,659 +computing things like this we wanted to +know thats so here's an example of a + +611 +00:45:51,659 --> 00:45:57,629 +simple tool air balloon at 10 so the +idea is the same that we're going to + +612 +00:45:57,630 --> 00:46:02,860 +declare our inputs but now instead of +just XY and Z we have our input syntax + +613 +00:46:02,860 --> 00:46:06,490 +our labels and Y which are better + +614 +00:46:06,489 --> 00:46:11,009 +are to weight matrices W&W too so we're +just sort of setting up these symbolic + +615 +00:46:11,010 --> 00:46:17,540 +variables that will be elements in our +computational grass now 44 pass we it + +616 +00:46:17,539 --> 00:46:21,179 +looks kinda like numpy but it's not +bizarre operations on the symbolic + +617 +00:46:21,179 --> 00:46:24,669 +objects that are building up the graph +in the background so here computing + +618 +00:46:24,670 --> 00:46:28,909 +activations with this . method that is +matrix multiply but we need symbolic + +619 +00:46:28,909 --> 00:46:33,210 +objects we're doing a real issue using +this this library function and we're + +620 +00:46:33,210 --> 00:46:37,769 +doing another matrix multiply and then +we can actually compute the loss the + +621 +00:46:37,769 --> 00:46:41,210 +probabilities and the Los using a couple +other library functions and again these + +622 +00:46:41,210 --> 00:46:44,349 +are all operations on the symbolic +objects that are building up the + +623 +00:46:44,349 --> 00:46:50,420 +computational grass so that we can just +compiled this function so our function + +624 +00:46:50,420 --> 00:46:54,570 +is going to take our data are labels and +are 28 factor in our to weight matrices + +625 +00:46:54,570 --> 00:46:58,890 +and puts and as outputs I will return +the loss and a scalar and our + +626 +00:46:58,889 --> 00:47:04,109 +classification scores in a vector and +now we can run this thing on real data + +627 +00:47:04,110 --> 00:47:07,559 +just like we saw in the previous slide +we can instantiate some actual number I + +628 +00:47:07,559 --> 00:47:13,759 +raised and then passed to the function +so this is great but this is only the + +629 +00:47:13,760 --> 00:47:17,820 +fourth pass actually to be able to train +this network and computer radiance so + +630 +00:47:17,820 --> 00:47:23,000 +here we just need to add a couple lines +of code to do that so this is the same + +631 +00:47:23,000 --> 00:47:27,170 +as before we're so we're defining are +symbolic variables for our inputs and + +632 +00:47:27,170 --> 00:47:29,510 +our weights and so forth and we're +combining + +633 +00:47:29,510 --> 00:47:33,980 +running the same four passes before to +compute the loss to the computer law + +634 +00:47:33,980 --> 00:47:37,920 +symbolically know the difference is that +we actually can do + +635 +00:47:37,920 --> 00:47:43,680 +symbolic differentiation here so this is +Dee W one and TW to we're telling the I + +636 +00:47:43,679 --> 00:47:47,129 +know that we want those to be the +gradient of the ingredients of the loss + +637 +00:47:47,130 --> 00:47:52,280 +with respect to those other symbolic +variables W one Min W two so this is + +638 +00:47:52,280 --> 00:47:52,930 +really cool + +639 +00:47:52,929 --> 00:47:56,549 +fiato just lets you take arbitrary +gradients of any part of the graph with + +640 +00:47:56,550 --> 00:48:00,289 +respect to any other part of the graph +not introduce introduced those as new + +641 +00:48:00,289 --> 00:48:05,190 +symbolic variables in the graph so that +you can really go crazy with that but + +642 +00:48:05,190 --> 00:48:09,470 +here in this case we're just gonna +return those Canadians as outputs so now + +643 +00:48:09,469 --> 00:48:14,049 +we're gonna compile a new function that +again is going to take our inputs are + +644 +00:48:14,050 --> 00:48:19,510 +input input pixel sacks and our labels +why along with the 28 matrices + +645 +00:48:19,510 --> 00:48:23,140 +and now it's going to return our loss +the classification scores and also these + +646 +00:48:23,139 --> 00:48:28,250 +two ingredients so now we can actually +use this setup to train a very simple + +647 +00:48:28,250 --> 00:48:32,809 +neural network so we can actually just +use gradient descent implement gradient + +648 +00:48:32,809 --> 00:48:36,630 +descent in just a couple lines using +using this this using this computation + +649 +00:48:36,630 --> 00:48:38,990 +grass so here we're + +650 +00:48:38,989 --> 00:48:43,599 +instantiating actual number higher raise +for the data set and the factors and + +651 +00:48:43,599 --> 00:48:45,489 +some random matrices as again + +652 +00:48:45,489 --> 00:48:49,839 +actual number higher raise and now every +time we make this call to ask when we + +653 +00:48:49,840 --> 00:48:50,519 +get back + +654 +00:48:50,519 --> 00:48:54,710 +numpy array is containing a loss and the +scores and the gradients so now that we + +655 +00:48:54,710 --> 00:48:57,800 +have the gradients we can just make a +simple gradient update on our weights + +656 +00:48:57,800 --> 00:49:01,970 +and measures promised an alley-oop to +train our network but there's actually a + +657 +00:49:01,969 --> 00:49:06,039 +big of a problem with this especially if +you're running on a GPU anyone can + +658 +00:49:06,039 --> 00:49:15,599 +anyone want totally lost the problem is +that this is actually incurring a lot of + +659 +00:49:15,599 --> 00:49:21,059 +over communication overhead between the +CPU and GPU because every time we we + +660 +00:49:21,059 --> 00:49:24,799 +call this a function and we get back +these gradients thats copying the + +661 +00:49:24,800 --> 00:49:29,720 +gradients from the GPU back to the CPU +and I can be an expensive operation and + +662 +00:49:29,719 --> 00:49:35,000 +now we're actually making our gradient +stop this is CPU computation in numpy so + +663 +00:49:35,000 --> 00:49:38,190 +it would be really nice if we can make +those gradient updates to our parameters + +664 +00:49:38,190 --> 00:49:45,389 +actually directly on the GPU and the way +that we do that in Fiano is this with + +665 +00:49:45,389 --> 00:49:50,619 +with this school thing called a shared +variable so I shared variable is another + +666 +00:49:50,619 --> 00:49:54,230 +part of the network that actually is a +value that lives inside the computation + +667 +00:49:54,230 --> 00:49:59,340 +craft and actually persists from call to +call so here this is this is actually + +668 +00:49:59,340 --> 00:50:04,150 +quite similar to before that now were +defining our same symbolic variables X&Y + +669 +00:50:04,150 --> 00:50:08,769 +for the data and labels and now we're +defining a couple of these new funky + +670 +00:50:08,769 --> 00:50:13,809 +things funky shared variables for our to +weight matrices and the initializing + +671 +00:50:13,809 --> 00:50:19,110 +these weight matrices with numpy raised +and now this is the same as before this + +672 +00:50:19,110 --> 00:50:22,910 +is the exact same code as before where +computing the forward pass using these + +673 +00:50:22,909 --> 00:50:24,980 +library functions are symbolically + +674 +00:50:24,980 --> 00:50:30,940 +gradients but now the difference is in +how we define our function so now this + +675 +00:50:30,940 --> 00:50:32,269 +compiled function + +676 +00:50:32,269 --> 00:50:36,780 +only receives does not receive the +weights and puts those actually live + +677 +00:50:36,780 --> 00:50:41,320 +inside the computational graph instead +we just received the data and the data + +678 +00:50:41,320 --> 00:50:45,210 +and the labels and now we are going to +put the loss rather than output + +679 +00:50:45,210 --> 00:50:49,639 +ingredients explicitly and instead we +actually provide these update rules they + +680 +00:50:49,639 --> 00:50:53,819 +should be run every time the function is +called so these update rules notice our + +681 +00:50:53,820 --> 00:50:57,920 +little functions that operate on the +symbolic variables so this is just + +682 +00:50:57,920 --> 00:51:02,010 +saying that we should make he's creating +the Santa stops to update W one Min W + +683 +00:51:02,010 --> 00:51:09,290 +two every time we run this computational +graph so writes weekly update and now to + +684 +00:51:09,289 --> 00:51:12,880 +train this network all we need to do is +call this function repeatedly and every + +685 +00:51:12,880 --> 00:51:16,869 +time we call the function those will +make a gradient stop on the way it's so + +686 +00:51:16,869 --> 00:51:21,210 +we can just trying this network by just +calling this thing repeatedly on in + +687 +00:51:21,210 --> 00:51:23,769 +practice when you make when you're doing +this kind of thing and I know you'll + +688 +00:51:23,769 --> 00:51:27,579 +often define our training function call +that update the weights and then also + +689 +00:51:27,579 --> 00:51:31,719 +evaluate function that I'll just put the +scores and not make any updates you can + +690 +00:51:31,719 --> 00:51:34,609 +actually have multiple of these compiled +functions that about eight different + +691 +00:51:34,610 --> 00:51:47,220 +parts of the same graph yeah yeah yeah +the question is how we compute gradients + +692 +00:51:47,219 --> 00:51:51,119 +and it actually does it symbolically +sort of person out the S well it's not + +693 +00:51:51,119 --> 00:51:55,219 +actually person at the St because every +time you make these calls it's a sort of + +694 +00:51:55,219 --> 00:51:58,769 +building up this computation on graphics +object and then you can compute + +695 +00:51:58,769 --> 00:52:06,090 +gradients by just adding nodes onto the +computation on graphics object so yeah + +696 +00:52:06,090 --> 00:52:09,360 +yeah so it needs to know every of these +basic operators it knows what the + +697 +00:52:09,360 --> 00:52:12,500 +derivative with the derivative is and +it's still the normal normal have a + +698 +00:52:12,500 --> 00:52:17,309 +back-propagation that you'll see it +works but some of it but the pitch with + +699 +00:52:17,309 --> 00:52:21,299 +the I know is that it works and he's +very very low level basic operations + +700 +00:52:21,300 --> 00:52:24,920 +like these elements things and matrix +multiply as and when it is hoping that + +701 +00:52:24,920 --> 00:52:27,800 +it can compile efficient code the +combine those and simplify it + +702 +00:52:27,800 --> 00:52:32,210 +symbolically and that I'm not sure how +well it works but that's at least what + +703 +00:52:32,210 --> 00:52:37,110 +they claim to do so there's a lot of a +lot of other advanced things that you + +704 +00:52:37,110 --> 00:52:40,309 +can do anything I know that we just +don't have time to talk about you can + +705 +00:52:40,309 --> 00:52:43,610 +actually include conditionals directly +inside your competition craft using + +706 +00:52:43,610 --> 00:52:44,809 +these files + +707 +00:52:44,809 --> 00:52:49,029 +and switch commands you can actually +include loops insider computational + +708 +00:52:49,030 --> 00:52:52,370 +graph using this this funny scan +function that I don't really understand + +709 +00:52:52,369 --> 00:52:57,409 +but it's tough but theoretically it lets +you implement recurrent networks quite + +710 +00:52:57,409 --> 00:53:01,909 +easily as you can imagine for a moment +are occurring at work in one of these + +711 +00:53:01,909 --> 00:53:05,539 +computational crafts all you're doing is +passing the same weight matrix into + +712 +00:53:05,539 --> 00:53:10,110 +multiple nodes and scan actually lets +you sort of do that in a loop and have + +713 +00:53:10,110 --> 00:53:14,680 +the loop be part of an explicit part of +the graph and we can actually go crazy + +714 +00:53:14,679 --> 00:53:17,909 +with derivatives we can compute +derivatives with respect with out any + +715 +00:53:17,909 --> 00:53:21,149 +part of the craft with respect to any +other part we can also compute jacoby + +716 +00:53:21,150 --> 00:53:24,300 +ends by computing derivatives of +derivatives we can use Allen our + +717 +00:53:24,300 --> 00:53:29,140 +operators to officially do made big +major matrix-vector multiply as actors + +718 +00:53:29,139 --> 00:53:32,500 +and Jacoby Jones you can do a lot of +pretty cool different derivative take + +719 +00:53:32,500 --> 00:53:36,610 +stock in piano that's maybe top and +other frameworks and it also has some + +720 +00:53:36,610 --> 00:53:40,180 +support for sparse matrices it tries to +optimize your code on the fly + +721 +00:53:40,179 --> 00:53:45,669 +do some other cool things I know does +have multi GPU support there's this + +722 +00:53:45,670 --> 00:53:50,599 +package that I have not used but that +claims that you can get data parallelism + +723 +00:53:50,599 --> 00:53:54,500 +so distribute I mean about to split up +over multiple GPUs and there's + +724 +00:53:54,500 --> 00:53:57,260 +experimental support for model +parallelism with this computational + +725 +00:53:57,260 --> 00:54:01,320 +graph will be divided among the +different devices but the documentation + +726 +00:54:01,320 --> 00:54:08,030 +says its experimental so it probably +really experimental so so you saw and + +727 +00:54:08,030 --> 00:54:11,730 +when working with the I know that the +API is little bit low level and we need + +728 +00:54:11,730 --> 00:54:15,769 +to sort of implement the update rules +and everything ourself somos anya is + +729 +00:54:15,769 --> 00:54:19,900 +this high-level wrapper around the I +know that sort of abstract away some of + +730 +00:54:19,900 --> 00:54:24,660 +those details for you so again we're +sort of defining symbolic matrices and + +731 +00:54:24,659 --> 00:54:28,659 +lasagna now has these layer functions +that will automatically set up the + +732 +00:54:28,659 --> 00:54:32,489 +shared variables and that sort of thing +we can compute the probability in the + +733 +00:54:32,489 --> 00:54:38,469 +loss using these convenient things from +the library and lasagna can actually + +734 +00:54:38,469 --> 00:54:41,969 +write these update rules for us to +implement and a strong momentum and + +735 +00:54:41,969 --> 00:54:47,109 +other fancy things and now when we +compile our function we actually just + +736 +00:54:47,110 --> 00:54:51,390 +pass on these update rules that were +written for us by my lasagna and all of + +737 +00:54:51,389 --> 00:54:51,839 +the way + +738 +00:54:51,840 --> 00:54:56,309 +objects were taken care of taken care of +for us by lasagna as well + +739 +00:54:56,309 --> 00:54:59,579 +and then at the end of the day we just +end up with one of these compiled piano + +740 +00:54:59,579 --> 00:55:04,599 +functions and we use at the same way as +before there's another there's another + +741 +00:55:04,599 --> 00:55:10,480 +rapper 4390 that's pretty popular +culture us which is a little bit is even + +742 +00:55:10,480 --> 00:55:15,730 +more high-level so here we're having +making a sequential container and adding + +743 +00:55:15,730 --> 00:55:20,559 +a stack of layers to it so this is kind +of like torch and now we're having this + +744 +00:55:20,559 --> 00:55:25,789 +making this Sgt object that is going to +actually updates for us and now we can + +745 +00:55:25,789 --> 00:55:29,759 +train our network by just using the +model that fit method so this is super + +746 +00:55:29,760 --> 00:55:36,570 +high level and you can't even tell that +using piano and in fact carry us well as + +747 +00:55:36,570 --> 00:55:40,289 +a background as well so you don't have +to use the honor with it but there's + +748 +00:55:40,289 --> 00:55:44,500 +actually one big problem with this piece +of code and I don't know if you if you + +749 +00:55:44,500 --> 00:55:49,219 +experience with ya know but this could +actually crashes and it crashes in a + +750 +00:55:49,219 --> 00:55:54,750 +really bad way this is the error message +so we get this giant stack trace none of + +751 +00:55:54,750 --> 00:55:58,380 +which is through any of the code that we +wrote and we get this giant value error + +752 +00:55:58,380 --> 00:56:03,440 +that doesn't make any sense to me so I'm +not really an expert in Fiano so this + +753 +00:56:03,440 --> 00:56:07,039 +was really confusing to me so we wrote +this kind of simple looking coating care + +754 +00:56:07,039 --> 00:56:11,259 +us but because it's using fiato as a +pack and it crapped out and gave us this + +755 +00:56:11,260 --> 00:56:15,030 +really confusing error message so that's +i think one of the common pain points + +756 +00:56:15,030 --> 00:56:18,730 +and failure cases with anything that +uses as a background that debugging can + +757 +00:56:18,730 --> 00:56:24,949 +be kinda hard so like any good developer +I googled the air and I found out that I + +758 +00:56:24,949 --> 00:56:28,659 +found out that I was including the width +of the white variable wrong and I was + +759 +00:56:28,659 --> 00:56:32,579 +supposed to use this other other +function to convert my wife variable and + +760 +00:56:32,579 --> 00:56:35,690 +make the problem go away but that was +not obvious from the error message + +761 +00:56:35,690 --> 00:56:41,139 +that's something to be good to be +worried about when using piano piano + +762 +00:56:41,139 --> 00:56:44,699 +actually has preteen models so we talk +about lasagna + +763 +00:56:44,699 --> 00:56:48,539 +actually has a pretty good models you a +lot of different popular model + +764 +00:56:48,539 --> 00:56:52,820 +architecture is that you might want so +in lasagna you can use Alex and Google + +765 +00:56:52,820 --> 00:56:56,190 +Matt and BG I don't think they have +resident yet but they have quite a lot + +766 +00:56:56,190 --> 00:57:00,320 +of useful things there and there are a +couple other packages I found that the + +767 +00:57:00,320 --> 00:57:04,550 +obvious that really seems good except I +mean this was clearly awesome because it + +768 +00:57:04,550 --> 00:57:07,030 +was a cs2 31 and project from last year + +769 +00:57:07,030 --> 00:57:10,330 +but if your gonna pick one i think +probably the lasagna models it was + +770 +00:57:10,329 --> 00:57:16,139 +really good so from my one day +experience of playing with the I know + +771 +00:57:16,139 --> 00:57:20,029 +about pros and cons that I could see +where that its its pipeline an umpire's + +772 +00:57:20,030 --> 00:57:20,890 +that's great + +773 +00:57:20,889 --> 00:57:23,920 +this computational crap seems like a +really powerful idea especially around + +774 +00:57:23,920 --> 00:57:28,760 +computing gradient symbolically and all +these optimizations it especially with R + +775 +00:57:28,760 --> 00:57:32,070 +and ends I think would be much easier to +implement using this computational graph + +776 +00:57:32,070 --> 00:57:37,570 +Rottino is kind of ugly and gross but +especially lasagna looks pretty good to + +777 +00:57:37,570 --> 00:57:41,470 +me and sort of takes away some of the +pain the error messages can be pretty + +778 +00:57:41,469 --> 00:57:46,279 +painful as we saw and big models from +what I've heard can have really long + +779 +00:57:46,280 --> 00:57:51,190 +compile times so that that when we're +compiling that function on the fly for + +780 +00:57:51,190 --> 00:57:54,579 +all these simple examples that pretty +much runs instantaneously but we're + +781 +00:57:54,579 --> 00:57:58,159 +doing big complicated things like neural +Turing machines I've heard stories that + +782 +00:57:58,159 --> 00:58:01,969 +that could actually take maybe half an +hour to compile so that's that's not + +783 +00:58:01,969 --> 00:58:06,239 +good and that's not good for iterating +quickly on your models and another sort + +784 +00:58:06,239 --> 00:58:10,509 +of pain point is that the API is much +better than torch that it's doing all + +785 +00:58:10,510 --> 00:58:13,470 +this complicated stuff in the background +so it's kind of hard to understand and + +786 +00:58:13,469 --> 00:58:17,969 +debug but actually happening to your +code and then preteen models are maybe + +787 +00:58:17,969 --> 00:58:22,569 +not quite as good as cafe or torch but +it looks like lasagna is pretty good + +788 +00:58:22,570 --> 00:58:30,320 +ok so we've got fifteen minutes now to +talk about 1000 although first if + +789 +00:58:30,320 --> 00:58:38,309 +there's any questions about the I know I +can try ok that's not so tenser flow + +790 +00:58:38,309 --> 00:58:42,809 +sensor flows from Google it's really +cool and shiny and new and everyone's + +791 +00:58:42,809 --> 00:58:47,829 +excited about it and it's actually very +similar to Fiona in a lot of ways that + +792 +00:58:47,829 --> 00:58:51,170 +they're really taking this idea of a +computational graph and a building on + +793 +00:58:51,170 --> 00:58:55,650 +that for everything so tenser flow and +Fiano actually very very closely linked + +794 +00:58:55,650 --> 00:58:59,090 +in my mind and that's sort of like +harris can get away with using either + +795 +00:58:59,090 --> 00:59:04,760 +one is a backhand and also kind of one +maybe point to make about 1000 is that + +796 +00:59:04,760 --> 00:59:07,200 +it's sort of the first one of these +frameworks that was designed from the + +797 +00:59:07,199 --> 00:59:10,750 +ground up by professional engineers + +798 +00:59:10,750 --> 00:59:14,000 +so a lot of other frameworks sort of +spun out of academic research labs and + +799 +00:59:14,000 --> 00:59:17,320 +they're really great and they let you do +things really well but they were sort of + +800 +00:59:17,320 --> 00:59:23,120 +maintained by grad students especially +so torch especially is maintained by + +801 +00:59:23,119 --> 00:59:26,500 +some engineers at Twitter and Facebook +now but it was originally an academic + +802 +00:59:26,500 --> 00:59:30,070 +project and for all of these I think +tenser flow was the first one that was + +803 +00:59:30,070 --> 00:59:35,000 +from the ground up from a neck from an +industrial place so maybe theoretically + +804 +00:59:35,000 --> 00:59:37,989 +that could lead to better code quality +or test coverage or something i dont no + +805 +00:59:37,989 --> 01:00:04,519 +I'm not sure seemed pretty scary so +here's so here's our favorite to lay + +806 +01:00:04,519 --> 01:00:07,389 +rabin that we're gonna we did it and all +other frameworks let's do it intends to + +807 +01:00:07,389 --> 01:00:12,769 +flow so this is actually really similar +to the I know so you can see that we're + +808 +01:00:12,769 --> 01:00:17,320 +importing tenser flow and in Fiano +remember we have these matrix and vector + +809 +01:00:17,320 --> 01:00:21,019 +symbolic variables intense workload +they're called placeholders but it's the + +810 +01:00:21,019 --> 01:00:26,380 +same idea these are just creating input +nodes in our computational graph we're + +811 +01:00:26,380 --> 01:00:30,650 +also going to define the weight matrices +in fiato we have these shared things + +812 +01:00:30,650 --> 01:00:34,490 +that lived inside the computation graph +same idea and tensor flexible called + +813 +01:00:34,489 --> 01:00:40,359 +variables we just like just like in +Ciano be computed are forward pass using + +814 +01:00:40,360 --> 01:00:44,610 +these library methods that operate +operate on symbolically on these things + +815 +01:00:44,610 --> 01:00:48,289 +and build up a computational graph so +that lets you easily compute the + +816 +01:00:48,289 --> 01:00:52,210 +probability is on the loss and +everything like that symbolically this + +817 +01:00:52,210 --> 01:00:56,190 +actually I think to me looks more like +care us rather looks a little bit more + +818 +01:00:56,190 --> 01:01:00,740 +like carousel lasagna than rocky I know +but we're using this gradient descent + +819 +01:01:00,739 --> 01:01:04,669 +optimizer and we're telling it to +minimize the loss so here we're not + +820 +01:01:04,670 --> 01:01:08,970 +explicitly but spitting out gradients +and we're not explicitly writing about + +821 +01:01:08,969 --> 01:01:13,489 +trading update rules were instead using +this people thing but just sort of adds + +822 +01:01:13,489 --> 01:01:19,250 +whatever it needs to into the graph in +order to minimize that loss and now just + +823 +01:01:19,250 --> 01:01:23,059 +like in Ciano market we can actually +instantiate using actual number higher + +824 +01:01:23,059 --> 01:01:23,779 +raise + +825 +01:01:23,780 --> 01:01:29,470 +some some small datasets and then we can +run in the loop so intense air flow and + +826 +01:01:29,469 --> 01:01:33,750 +you actually want to run your code you +need to use you need to wrap it in this + +827 +01:01:33,750 --> 01:01:39,199 +session code I don't understand what's +doing but it's you had to do it actually + +828 +01:01:39,199 --> 01:01:42,599 +went to do although actually what it's +doing is that all the stops short of + +829 +01:01:42,599 --> 01:01:45,869 +setting up your computational grass and +the missed session is actually doing + +830 +01:01:45,869 --> 01:01:48,440 +whatever optimization it needs to +actually like to run it + +831 +01:01:48,440 --> 01:01:58,110 +yeah yeah so if you're so the question +is what is one hot so if you remember in + +832 +01:01:58,110 --> 01:02:01,840 +your assignments when you did like a +soft max loss function but why was + +833 +01:02:01,840 --> 01:02:06,170 +always an integer telling you which +thing you wanted but in some of these + +834 +01:02:06,170 --> 01:02:11,420 +frameworks instead of an integer it +should be a factor where everything is + +835 +01:02:11,420 --> 01:02:15,090 +zero except for the one that was the +credit class so that was actually the + +836 +01:02:15,090 --> 01:02:20,420 +bug that tripped me up on care us back +there was the difference between one hot + +837 +01:02:20,420 --> 01:02:28,710 +and not one hot and it turns out 10 2011 +hot whatever right so than when we + +838 +01:02:28,710 --> 01:02:34,250 +actually want to train this network then +we call in fiato remember we actually + +839 +01:02:34,250 --> 01:02:37,610 +compiled this function object and then +call the function over and over again + +840 +01:02:37,610 --> 01:02:41,940 +the equivalent intense air flow is that +we used to call the run method on the + +841 +01:02:41,940 --> 01:02:46,409 +session object and we tell it what +switch output we wanted to compute so + +842 +01:02:46,409 --> 01:02:50,349 +here we're telling it that we want to +compute the train stopped out what I'm a + +843 +01:02:50,349 --> 01:02:54,769 +la Salle putt and we're gonna feed at +these numpy raised into these inputs so + +844 +01:02:54,769 --> 01:02:57,699 +this is kind of the same idea as Diano +except we're just calling the run method + +845 +01:02:57,699 --> 01:03:02,210 +rather than explicitly compiling +compiling a function and in the process + +846 +01:03:02,210 --> 01:03:06,179 +of evaluating this train stop object +election make a gradient descent on the + +847 +01:03:06,179 --> 01:03:10,690 +weights so then we just run this thing +in a loop and it'll the Los goes down + +848 +01:03:10,690 --> 01:03:16,450 +and everything is beautiful so one of +the really cool things about tenser flow + +849 +01:03:16,449 --> 01:03:20,519 +is this thing called tenser board that +lets you easily easily visualize what's + +850 +01:03:20,519 --> 01:03:24,880 +going on in your network so here is +pretty much the same code that we had + +851 +01:03:24,880 --> 01:03:29,150 +before except we've added these three +little lines hopefully you can see it if + +852 +01:03:29,150 --> 01:03:34,280 +not you'll have to trust me so here +where computing a scalar summary of the + +853 +01:03:34,280 --> 01:03:37,200 +loss and that's giving us a new symbolic +variables + +854 +01:03:37,199 --> 01:03:40,929 +law summary and more computing a +histogram summary of the weight matrices + +855 +01:03:40,929 --> 01:03:46,049 +W on w-2 and also getting us new +symbolic variables W one pissed and w2 + +856 +01:03:46,050 --> 01:03:51,390 +hissed now we're getting another +symbolic variable called emerged that + +857 +01:03:51,389 --> 01:03:54,349 +can emerge as all those summaries +together using some magic I don't + +858 +01:03:54,349 --> 01:03:58,929 +understand and we're getting this +summary writer object that we can use to + +859 +01:03:58,929 --> 01:04:03,000 +actually dumped out those summaries to +desk and now in our loop when we're + +860 +01:04:03,000 --> 01:04:06,570 +actually running the network then we +tell it to evaluate to evaluate the + +861 +01:04:06,570 --> 01:04:10,460 +training staff and a loss like before +her at all so this merge summary object + +862 +01:04:10,460 --> 01:04:14,190 +so in the process of evaluating the +splurge summary object it'll compute + +863 +01:04:14,190 --> 01:04:17,690 +gradient it'll compute histograms of the +weights and dump those summaries to desk + +864 +01:04:17,690 --> 01:04:22,019 +and then we tell our writer to actually +at the summaries I guess that's where + +865 +01:04:22,019 --> 01:04:26,610 +the right into this happens so once you +run this thing then you get the mall + +866 +01:04:26,610 --> 01:04:28,890 +this thing is running it sort of +constantly streaming all this + +867 +01:04:28,889 --> 01:04:33,069 +information about what's going on in +your network to desk and then you just + +868 +01:04:33,070 --> 01:04:37,480 +start up this this web server that ships +with tensor flow sensor board and we get + +869 +01:04:37,480 --> 01:04:41,420 +these beautiful beautiful visualisations +about what's going on in your network so + +870 +01:04:41,420 --> 01:04:42,539 +here on the left + +871 +01:04:42,539 --> 01:04:46,230 +member we were telling we were getting a +scalar summary of the loss so this + +872 +01:04:46,230 --> 01:04:49,360 +actually shows that loss was going down +I mean it was a small it was a big + +873 +01:04:49,360 --> 01:04:52,760 +network and a small dataset but that +means everything is working and this + +874 +01:04:52,760 --> 01:04:56,860 +over here on the right hand side showing +you histograms over time showing you the + +875 +01:04:56,860 --> 01:05:00,900 +distributions of the values in your +weight matrices so this is the stuff is + +876 +01:05:00,900 --> 01:05:04,579 +really really cool and I think this is a +really really beautiful debugging tool + +877 +01:05:04,579 --> 01:05:09,289 +so when i when I've been working on +projects and torch I've written this + +878 +01:05:09,289 --> 01:05:11,250 +kind of stuff myself by hand + +879 +01:05:11,250 --> 01:05:14,900 +just kinda dumping JSON blobs out of +torture and then writing my own custom + +880 +01:05:14,900 --> 01:05:18,369 +visualization visualizer is to view +these kind of statistics because they're + +881 +01:05:18,369 --> 01:05:21,609 +really useful and with tents are you +don't have to write any about yourself + +882 +01:05:21,610 --> 01:05:25,019 +you just a couple lines of code to your +training script run they're saying and + +883 +01:05:25,019 --> 01:05:27,489 +you can get all these beautiful +visualisations to help your debugging + +884 +01:05:27,489 --> 01:05:35,059 +tenser flow sensor board can also help +you even visualize what your network + +885 +01:05:35,059 --> 01:05:39,820 +structure looks like so here we've +annotated are variables with these names + +886 +01:05:39,820 --> 01:05:43,510 +and now when we're doing the forward +pass we can actually scope some of the + +887 +01:05:43,510 --> 01:05:47,450 +complications under a namespace and that +sort of the slices group together + +888 +01:05:47,449 --> 01:05:48,949 +computations that + +889 +01:05:48,949 --> 01:05:52,519 +should belong together semantically now +other than that it's the same with the + +890 +01:05:52,519 --> 01:05:56,949 +same thing that we saw before and now if +we run this network and load up tents or + +891 +01:05:56,949 --> 01:06:00,909 +more and we can actually get this +beautiful visualization for how like + +892 +01:06:00,909 --> 01:06:04,789 +what our network actually looks like and +we can actually click and look and see + +893 +01:06:04,789 --> 01:06:07,820 +what the screens on the scores and +really help debug what's going on inside + +894 +01:06:07,820 --> 01:06:12,170 +this network and Egypt you see these +loss and scores + +895 +01:06:12,170 --> 01:06:15,030 +these are the semantic namespaces that +we defined it during the forward pass + +896 +01:06:15,030 --> 01:06:18,940 +and if we click on the scores for +example it opens up and lets us see all + +897 +01:06:18,940 --> 01:06:22,679 +the operations that have that up here +inside the computation on graphics that + +898 +01:06:22,679 --> 01:06:28,108 +node so I thought this was really cool +if it lets you like really easily debug + +899 +01:06:28,108 --> 01:06:31,039 +what's going on inside your networks +while it's running enough to write any + +900 +01:06:31,039 --> 01:06:39,300 +of Apple's Asian code yourself so tender +flow does have support from multi GPU so + +901 +01:06:39,300 --> 01:06:42,750 +has data parallelism like you might +expect so I'd like to point out that + +902 +01:06:42,750 --> 01:06:45,809 +actually this distribute this +distribution part is probably one of the + +903 +01:06:45,809 --> 01:06:50,460 +other major selling point sometimes a +flow that it can try to actually + +904 +01:06:50,460 --> 01:06:53,338 +distributed computation crap in +different ways across different devices + +905 +01:06:53,338 --> 01:06:57,828 +and actually place the distribute that +crap smartly to minimize communication + +906 +01:06:57,829 --> 01:07:02,839 +overhead and so on so one thing that you +can do is data parallelism where you + +907 +01:07:02,838 --> 01:07:05,559 +just put your money back across +different devices and run each one + +908 +01:07:05,559 --> 01:07:08,409 +forward and backward and then either +some of the gradients to do + +909 +01:07:08,409 --> 01:07:12,068 +synchronous distributed training or just +make a synchronous updates to your + +910 +01:07:12,068 --> 01:07:16,730 +parameters and do a synchronous training +buchanan the white paper claims she can + +911 +01:07:16,730 --> 01:07:21,300 +do both of these things and tensor flow +but I didn't I didn't try it out you can + +912 +01:07:21,300 --> 01:07:25,000 +also actually do model parallelism in +intensive flow as well but lets you + +913 +01:07:25,000 --> 01:07:27,829 +split up the same model and compute +different parts of the same model on + +914 +01:07:27,829 --> 01:07:32,190 +different devices so here's an example +so one place for that might be useful is + +915 +01:07:32,190 --> 01:07:36,510 +a multi-layer recurrent network there it +might actually be a good idea to run + +916 +01:07:36,510 --> 01:07:39,900 +different layers of a network on +different CPUs because those things can + +917 +01:07:39,900 --> 01:07:42,838 +actually take a lot of memory so that's +the type of thing that you can actually + +918 +01:07:42,838 --> 01:07:47,599 +do you can do that intense air flow +without too much pain + +919 +01:07:47,599 --> 01:07:51,599 +tenser flow is also the only of the +frameworks that can run into distributed + +920 +01:07:51,599 --> 01:07:56,000 +mode not just a strip across one machine +and multiple GPUs but actually + +921 +01:07:56,000 --> 01:07:58,309 +distribute them training model across +many machine + +922 +01:07:58,309 --> 01:08:04,709 +ads so the caveat here is that that part +is not open source yet rated as of today + +923 +01:08:04,708 --> 01:08:08,328 +the open source release of tensor flow +can only do single machine multi GPU + +924 +01:08:08,329 --> 01:08:13,890 +training but I think but hopefully soon +that part will be released to be really + +925 +01:08:13,889 --> 01:08:16,500 +cool right so here + +926 +01:08:16,500 --> 01:08:22,069 +the idea is you can just end reply was +aware of communication costs both + +927 +01:08:22,069 --> 01:08:26,489 +between GPU and CPU but also between +different machines on the network so + +928 +01:08:26,488 --> 01:08:30,118 +that it can try to smartly distribute +the computation craft across different + +929 +01:08:30,118 --> 01:08:33,750 +machines and across different CPUs +within those machines to compute + +930 +01:08:33,750 --> 01:08:37,649 +everything as efficiently as possible so +that's i think thats really cool and + +931 +01:08:37,649 --> 01:08:41,629 +that's something that the other +frameworks just can't do right now one + +932 +01:08:41,630 --> 01:08:46,409 +can point with tens or flow is preteen +models so I looked I did a thorough + +933 +01:08:46,408 --> 01:08:51,448 +Google search and the only thing I could +come up with was an inception module a + +934 +01:08:51,448 --> 01:08:56,028 +pre-trial inception model but it's only +accessible through this Android explore + +935 +01:08:56,029 --> 01:08:59,569 +this and rate them all so that's +something I would have expected to be + +936 +01:08:59,569 --> 01:09:04,219 +more clear documentation but that's at +least you have that one pitch a model + +937 +01:09:04,219 --> 01:09:09,109 +either other than that I'm not I'm not +really aware of other preteen models + +938 +01:09:09,109 --> 01:09:12,109 +intense air flow but maybe maybe there +are out maybe they're out there and I + +939 +01:09:12,109 --> 01:09:13,230 +just don't know about them + +940 +01:09:13,229 --> 01:09:19,729 +prob says no so I googled correctly so +that answer flow pros and cons + +941 +01:09:19,729 --> 01:09:23,689 +again my quick one day experiment it's +really good because its pipeline an + +942 +01:09:23,689 --> 01:09:27,928 +umpire that's really cool I know it has +this idea of computation on graphics + +943 +01:09:27,929 --> 01:09:32,289 +which i think is super powerful and +actually takes this idea of + +944 +01:09:32,289 --> 01:09:35,948 +computational graphs even farther than +than Fiano really and things like + +945 +01:09:35,948 --> 01:09:40,000 +checkpointing and distributing across +devices these all end up as just know + +946 +01:09:40,000 --> 01:09:46,380 +it's inside the computation on graphics +4000 that's really cool it's also claims + +947 +01:09:46,380 --> 01:09:49,520 +to have much faster compile time +something I know I've heard horror + +948 +01:09:49,520 --> 01:09:53,670 +stories about neural tree machines +taking half an hour to compile maybe + +949 +01:09:53,670 --> 01:09:59,219 +maybe that should be faster intends or +flow so I've heard tenser board looks + +950 +01:09:59,219 --> 01:10:03,369 +awesome that looks amazing I want to use +that everywhere + +951 +01:10:03,369 --> 01:10:07,340 +it has really cool data and model model +parallelism I think much more advanced + +952 +01:10:07,340 --> 01:10:11,079 +than the other frameworks although the +distributed stop is still secret sauce + +953 +01:10:11,079 --> 01:10:15,689 +that Google but hopefully I'll come out +to the rest of us eventually but I guess + +954 +01:10:15,689 --> 01:10:19,989 +as bob was saying it's even maybe the +scariest code based actually dig into + +955 +01:10:19,989 --> 01:10:24,409 +and understand what's working under the +hood so at least my my fear about cancer + +956 +01:10:24,409 --> 01:10:29,010 +flow is that if you want to do some kind +of crazy weird imperative code and you + +957 +01:10:29,010 --> 01:10:32,690 +cannot easily work it into their +computational graph abstraction that + +958 +01:10:32,689 --> 01:10:38,159 +seems like you could be in a lot of +trouble we're in may be in in in torture + +959 +01:10:38,159 --> 01:10:40,659 +you can just write whatever imperative +code you want inside the forward and + +960 +01:10:40,659 --> 01:10:44,659 +backward passes of your own custom +theirs but that seems like the biggest a + +961 +01:10:44,659 --> 01:10:49,979 +worrying point for me about working with +tons of law and practice another another + +962 +01:10:49,979 --> 01:10:52,959 +kind of awkward thing is the slack +appreciate models so that's that's kind + +963 +01:10:52,960 --> 01:11:12,239 +of gross + +964 +01:11:12,239 --> 01:11:22,019 +even installing on a 2002 was a little +bit painful they claimed to have a + +965 +01:11:22,020 --> 01:11:25,680 +python we all that you can just download +and install with PEP but it broke and I + +966 +01:11:25,680 --> 01:11:29,150 +had to change the filename annually to +get to install and then they had a + +967 +01:11:29,149 --> 01:11:32,479 +broken dependency that I had to update +manually and like download some random + +968 +01:11:32,479 --> 01:11:36,759 +zip file and unpack it and copy some +random files around but it eventually + +969 +01:11:36,760 --> 01:11:41,520 +worked but installation was tough even +on my own machine that I have sudo 2012 + +970 +01:11:41,520 --> 01:11:47,400 +so they should get their act together on +that so I put together this quick + +971 +01:11:47,399 --> 01:11:51,529 +overview table that kind of covers when +I think people would care about on major + +972 +01:11:51,529 --> 01:11:55,529 +points between the frameworks whitewater +languages what kinda preteen models are + +973 +01:11:55,529 --> 01:11:56,210 +available + +974 +01:11:56,210 --> 01:12:05,029 +question + +975 +01:12:05,029 --> 01:12:09,988 +the question is is which of these +support Windows I'm sorry but I don't + +976 +01:12:09,988 --> 01:12:11,769 +know + +977 +01:12:11,770 --> 01:12:16,830 +I think you're on your own + +978 +01:12:16,829 --> 01:12:24,439 +aww you can use AWS from Windows ok + +979 +01:12:24,439 --> 01:12:29,359 +ok so I put together this quick come +quick comparison chart between the + +980 +01:12:29,359 --> 01:12:32,198 +frameworks that I think covers some of +the major bullet points that people care + +981 +01:12:32,198 --> 01:12:37,460 +about talking about what language is +whether they have free trade models what + +982 +01:12:37,460 --> 01:12:41,300 +kind of parallelism you have and how +readable as the source code and whether + +983 +01:12:41,300 --> 01:12:47,029 +they get our hands so I had a couple of +use cases in let's see we've got holy + +984 +01:12:47,029 --> 01:12:52,939 +crap we got 250 slides and we still have +two minutes left so let's let's do let's + +985 +01:12:52,939 --> 01:12:56,710 +play a little game suppose that all you +wanted to do was extracted aleksandr BGG + +986 +01:12:56,710 --> 01:12:58,619 +features which framework would you pick + +987 +01:12:58,619 --> 01:13:06,969 +yeah me too let's say all we wanted to +do was find to an Alex net on on some + +988 +01:13:06,969 --> 01:13:19,189 +new data yeah let's say we want to do +image captioning with fine-tuning ok I + +989 +01:13:19,189 --> 01:13:22,889 +heard a good distribution so this is my +thought process I'm not saying this is + +990 +01:13:22,890 --> 01:13:26,289 +the right answer but the way I think +about this is that for this problem we + +991 +01:13:26,289 --> 01:13:30,969 +need preteen models preteen models were +looking at Cafe or torture lasagna we + +992 +01:13:30,969 --> 01:13:36,239 +need our hands so kathy is pretty much +out even though people have done have + +993 +01:13:36,238 --> 01:13:39,869 +implemented the stuff there is just kind +of painful so I'd probably use torture + +994 +01:13:39,869 --> 01:13:44,869 +maybe lasagna about semantic +segmentation we want to classify every + +995 +01:13:44,869 --> 01:13:49,880 +pixel right so here we want to read an +input image and instead of giving a + +996 +01:13:49,880 --> 01:13:57,900 +label to the whole output image we want +to label every pixel independently ok + +997 +01:13:57,899 --> 01:14:01,969 +that's good so again my thought process +was that we need a preteen model here + +998 +01:14:01,969 --> 01:14:06,800 +most likely and hear that we're talking +about kind of a weird use case for you + +999 +01:14:06,800 --> 01:14:10,739 +might need to define some of our own +project so if this layer happens to + +1000 +01:14:10,738 --> 01:14:14,738 +exist in cafe they would be a good fit +otherwise what's the radar self and + +1001 +01:14:14,738 --> 01:14:23,109 +writing this thing ourself seems least +10 points for each object detection no + +1002 +01:14:23,109 --> 01:14:24,329 +idea + +1003 +01:14:24,329 --> 01:14:30,750 +yes ok kathy is an idea so my thought +process again we're looking at preteen + +1004 +01:14:30,750 --> 01:14:33,149 +models so we need cafe + +1005 +01:14:33,149 --> 01:14:38,069 +torch or lasagna we actually could with +the texting you could need a lot of + +1006 +01:14:38,069 --> 01:14:41,609 +funky imperative code that it might be +possible to put in a computation + +1007 +01:14:41,609 --> 01:14:47,799 +aircraft but seems scary to me so cafe + +python is is 11 choice that some of the + +1008 +01:14:47,800 --> 01:14:52,529 +spring we talked about actually went +this route and I've actually done a + +1009 +01:14:52,529 --> 01:14:56,939 +similar project like this and I chose +torch and it worked out good for me but + +1010 +01:14:56,939 --> 01:14:59,809 +if you want to language modeling like +you wanna do funky are intense and you + +1011 +01:14:59,810 --> 01:15:06,270 +want to play with the recurrence role +torch what do you guys thing yeah I + +1012 +01:15:06,270 --> 01:15:09,550 +would actually not used torture this at +all so here if we just wanted to + +1013 +01:15:09,550 --> 01:15:13,650 +language modeling and do funky kind of +recurrence relationships then we're not + +1014 +01:15:13,649 --> 01:15:17,109 +talking about images at all this just +untaxed so we don't need any pre-trial + +1015 +01:15:17,109 --> 01:15:22,309 +models and we really want to play with +this recurrence relationship and easily + +1016 +01:15:22,310 --> 01:15:25,430 +work with our current networks so there +I think that maybe fee on all returns + +1017 +01:15:25,430 --> 01:15:32,570 +are flow might be a good choice if you +want to implement batch norm + +1018 +01:15:32,569 --> 01:15:39,769 +ok ok slides sorry about that right so +here if you wanna if you want to rely on + +1019 +01:15:39,770 --> 01:15:42,230 +if you don't drive the gradient yourself +and you could rely on these + +1020 +01:15:42,229 --> 01:15:46,899 +computational craft things like flow but +because of the way that those things + +1021 +01:15:46,899 --> 01:15:50,089 +work as you saw on homework for passion +or you can actually simplify the + +1022 +01:15:50,090 --> 01:15:54,900 +gradient quite a lot and I'm not sure if +these computational craft frameworks + +1023 +01:15:54,899 --> 01:15:57,589 +would correctly simplify the gradient +down to this makes efficient form + +1024 +01:15:57,590 --> 01:16:09,489 +question + +1025 +01:16:09,488 --> 01:16:13,009 +I think I thought the question is how +easily is at how easy is it to come by + +1026 +01:16:13,010 --> 01:16:18,860 +in like a torch model with a piano model +and I think it seems painful but at + +1027 +01:16:18,859 --> 01:16:22,819 +least in fiato you can use lasagna tick +tick access and preteen models so + +1028 +01:16:22,819 --> 01:16:26,498 +fucking together a lasagna model is +something else I think theoretically + +1029 +01:16:26,498 --> 01:16:31,748 +maybe should be easier so here if you +want if you have some like really really + +1030 +01:16:31,748 --> 01:16:35,429 +good knowledge about how how exactly you +want the backward pass to be computed + +1031 +01:16:35,429 --> 01:16:38,179 +and you want to implement it yourself to +be efficient than you probably won't use + +1032 +01:16:38,179 --> 01:16:43,300 +torch you can just implement that backup +ask yourself so make recommendations on + +1033 +01:16:43,300 --> 01:16:46,949 +frameworks are that if you just wanna do +feature feature extraction or maybe + +1034 +01:16:46,948 --> 01:16:51,248 +fine-tuning of existing models or just +transferred of a vanilla straightforward + +1035 +01:16:51,248 --> 01:16:54,929 +task then Cafe is probably the right way +to go it's really easy to use you don't + +1036 +01:16:54,929 --> 01:16:58,649 +have to write any code if you want to +work around with preteen models but + +1037 +01:16:58,649 --> 01:17:02,738 +maybe do weird stuff with preteen models +and not just fine to Phnom Penh you + +1038 +01:17:02,738 --> 01:17:07,209 +might have a better job in lasagna or +torch is there it's easier to kind of + +1039 +01:17:07,210 --> 01:17:11,328 +mess with the structure of preteen +models if you want to if you really + +1040 +01:17:11,328 --> 01:17:14,788 +really want to write your own layers for +whatever reason and you don't think you + +1041 +01:17:14,788 --> 01:17:18,788 +can easily fit into these computational +crafts then probably should use torch if + +1042 +01:17:18,788 --> 01:17:22,948 +you really want to use our intense and +maybe other types of fancy things that + +1043 +01:17:22,948 --> 01:17:26,138 +depend on the computational graph then +probably talk then maybe fee on all + +1044 +01:17:26,139 --> 01:17:30,090 +returns are low also if you have a +gigantic model and you need to + +1045 +01:17:30,090 --> 01:17:33,449 +distribute across an entire cluster and +you have access to Google's internal + +1046 +01:17:33,448 --> 01:17:36,169 +code base then you should use her flow + +1047 +01:17:36,170 --> 01:17:39,989 +although hopefully that like I said that +part will be released for the rest of us + +1048 +01:17:39,988 --> 01:17:44,889 +soon so that's also if you wanna use +tents are bored and you got a slow so + +1049 +01:17:44,890 --> 01:17:48,810 +that's that's pretty much my my overview +my quick whirlwind tour of all the + +1050 +01:17:48,810 --> 01:17:58,210 +frameworks so any any last minute +questions questions questions about + +1051 +01:17:58,210 --> 01:18:02,630 +speed so there's actually a really nice +page that compares some speed + +1052 +01:18:02,630 --> 01:18:06,039 +benchmark speed of all the different +frameworks and right now the one that + +1053 +01:18:06,039 --> 01:18:10,488 +wins is none of these The one that wins +is this thing called me on from Nirvana + +1054 +01:18:10,488 --> 01:18:15,049 +systems so these guys have actually +written these guys are crazy they + +1055 +01:18:15,050 --> 01:18:20,119 +actually wrote their own custom +assembler for g4 and video hardware they + +1056 +01:18:20,119 --> 01:18:22,448 +were not happy with like and videos + +1057 +01:18:22,448 --> 01:18:26,500 +toolchain they reverse engineer the +hardware and rotor on a similar and then + +1058 +01:18:26,500 --> 01:18:30,948 +implemented all these kernels in +assembly themselves so these guys are + +1059 +01:18:30,948 --> 01:18:35,859 +crazy and their stuff is really really +fast so they're actually there is there + +1060 +01:18:35,859 --> 01:18:39,309 +stuff is actually the fastest right now +but I've never really used their I've + +1061 +01:18:39,310 --> 01:18:42,510 +never really used their framework myself +and I think it's a little less common + +1062 +01:18:42,510 --> 01:18:47,010 +these although for the ones that are +using CUDA and the speed is roughly the + +1063 +01:18:47,010 --> 01:18:52,030 +same right now I think 10 surplus quite +a bit slower than the others for some + +1064 +01:18:52,029 --> 01:18:55,609 +silly reasons that I think will be +cleaned up in subsequent releases but at + +1065 +01:18:55,609 --> 01:18:58,729 +least fundamentally there's no reason +you should be should should should be + +1066 +01:18:58,729 --> 01:19:04,209 +slower than others + +1067 +01:19:04,210 --> 01:19:07,319 +you people are picking up rifles + +1068 +01:19:07,319 --> 01:19:24,279 +alright + +1069 +01:19:24,279 --> 01:19:27,198 +that that's actually not crazy there are +quite a few for quite a few teams last + +1070 +01:19:27,198 --> 01:19:29,738 +year that actually use a sign like Oprah +projects and it was fine + +1071 +01:19:29,738 --> 01:19:34,658 +yeah I should also mention that there +are other frameworks + +1072 +01:19:34,658 --> 01:19:45,359 +I just think these are the peace for the +most common question + +1073 +01:19:45,359 --> 01:19:52,299 +so the question is about grabbing and +torch so torch actually has and I Python + +1074 +01:19:52,300 --> 01:19:56,770 +colonel you can actually use on my torch +notebooks that's kind of cool and + +1075 +01:19:56,770 --> 01:20:00,150 +actually you can actually do some simple +grabbing a night or two notebooks but in + +1076 +01:20:00,149 --> 01:20:04,899 +practice what I usually do is dump my +data run my torch model dump the data + +1077 +01:20:04,899 --> 01:20:09,899 +even to JSON HDL 5 and in visualizing in +Python which is a little bit little bit + +1078 +01:20:09,899 --> 01:20:19,359 +painful but you can just get the job +done + +1079 +01:20:19,359 --> 01:20:23,309 +the question is whether tenser bored +lets you dump the raw data you can put + +1080 +01:20:23,310 --> 01:20:28,300 +yourself there actually they're actually +dumping all this stuff into some log + +1081 +01:20:28,300 --> 01:20:33,050 +files in a temp directory I'm not sure +how easy those are sparse but you could + +1082 +01:20:33,050 --> 01:20:45,900 +try it could be easy or not I'm not sure +question the question is whether there + +1083 +01:20:45,899 --> 01:20:49,899 +are other third party tool some of the +tensor board for modern networks there + +1084 +01:20:49,899 --> 01:20:53,269 +might be some out there but I've never +really used them I just read my own in + +1085 +01:20:53,270 --> 01:20:58,159 +the past any other questions + +1086 +01:20:58,158 --> 01:21:00,319 +alright I think I think that's it + diff --git a/captions/En/Lecture13_en.srt b/captions/En/Lecture13_en.srt new file mode 100644 index 00000000..bd3e4fa5 --- /dev/null +++ b/captions/En/Lecture13_en.srt @@ -0,0 +1,4558 @@ +1 +00:00:00,000 --> 00:00:06,878 +so are our administrative points for +today an assignment 3 is due tonight so + +2 +00:00:06,878 --> 00:00:14,399 +was done that's going to be easier than +assignment to ok that's good hopefully + +3 +00:00:14,400 --> 00:00:18,320 +gives you more time to work on your +projects so also remember your + +4 +00:00:18,320 --> 00:00:22,500 +milestones where were turned in we +returned in last week so we're in the + +5 +00:00:22,500 --> 00:00:25,028 +process of looking through the +milestones to make sure those are ok and + +6 +00:00:25,028 --> 00:00:28,609 +also we're working on assignment to +creating so we should have that done + +7 +00:00:28,609 --> 00:00:32,289 +sometime this week or early next week + +8 +00:00:32,289 --> 00:00:36,329 +last time we had a whirlwind tour of all +the soccer all the common software + +9 +00:00:36,329 --> 00:00:40,058 +packages that people use for deep +learning and we saw a lot of code on + +10 +00:00:40,058 --> 00:00:43,468 +slides and a lot of stepping through +code and hopefully that you found it + +11 +00:00:43,469 --> 00:00:48,730 +useful for your projects today we're +going to talk about to other topics + +12 +00:00:48,729 --> 00:00:53,308 +we're gonna talk about segmentation +within segmentation there are two + +13 +00:00:53,308 --> 00:00:57,488 +subproblems semantic an instant +segmentation we're also going to talk + +14 +00:00:57,488 --> 00:01:01,509 +about soft attention and within soft +attention again they're sort of two + +15 +00:01:01,509 --> 00:01:07,069 +buckets that we've divided things into +but first before we go into these + +16 +00:01:07,069 --> 00:01:12,849 +details I was there something else I +want to bring up briefly so this is the + +17 +00:01:12,849 --> 00:01:16,769 +image classification errors I think at +this point in the class you've seen this + +18 +00:01:16,769 --> 00:01:23,079 +type of figure it many times right so in +2012 Alex not 2013 ZF crushed it + +19 +00:01:23,079 --> 00:01:29,118 +recently Google Matt and later ResNet is +sort of help but wonder classification + +20 +00:01:29,118 --> 00:01:37,400 +challenge in 2015 but turns out as of +today there is a new image net result so + +21 +00:01:37,400 --> 00:01:41,140 +this paper came out last night + +22 +00:01:41,140 --> 00:01:48,609 +so Google actually has now state of the +art on image that with 3.08% top 5 error + +23 +00:01:48,609 --> 00:01:55,560 +which is crazy and the way they do this +is with this thing that they call in + +24 +00:01:55,560 --> 00:01:59,900 +section before this is a little bit of a +monster so I don't want to go into too + +25 +00:01:59,900 --> 00:02:05,280 +much detail but you can see that it's +this really deep network that has these + +26 +00:02:05,280 --> 00:02:11,150 +repeated modules so here there's the +stem the stem is this guy over here a + +27 +00:02:11,150 --> 00:02:14,789 +couple interesting things to point out +about this architecture are actually may + +28 +00:02:14,789 --> 00:02:18,979 +use some balance convolutions which +means they have no padding so that makes + +29 +00:02:18,979 --> 00:02:22,229 +every all the math more complicated but +they're smart and I figured things out + +30 +00:02:22,229 --> 00:02:27,299 +they also have an interesting feature +here is they actually have in parallel + +31 +00:02:27,300 --> 00:02:31,459 +strident convolution and also Max Pooley +so they kind of do these two operations + +32 +00:02:31,459 --> 00:02:34,900 +and parallel to down sample images and +then concatenate the kind of the and + +33 +00:02:34,900 --> 00:02:39,909 +another thing is they they're really +going all out on these efficient + +34 +00:02:39,909 --> 00:02:43,389 +convolution checks that we talked about +a couple lectures ago so as you can see + +35 +00:02:43,389 --> 00:02:47,518 +they've actually have these asymmetric +filters like one by seven and seven by + +36 +00:02:47,519 --> 00:02:51,750 +one convolutions they also make heavy +use of these one-by-one convolutional + +37 +00:02:51,750 --> 00:02:56,449 +bottlenecks to reduce computational +costs so this is just the stem of a + +38 +00:02:56,449 --> 00:03:01,939 +network and actually each of these parts +is sort of different so they've got for + +39 +00:03:01,939 --> 00:03:07,769 +abuse inception modules been this but +down sampling module then what seven of + +40 +00:03:07,769 --> 00:03:11,599 +these guys and then another down +sampling module and then three more of + +41 +00:03:11,599 --> 00:03:16,889 +these guys and then finally they have +dropped out and and fully connected lair + +42 +00:03:16,889 --> 00:03:20,919 +for the class labels so another thing to +point out is again there's no sort of + +43 +00:03:20,919 --> 00:03:24,859 +fully connected hitler's here they just +have this global averaging to compute + +44 +00:03:24,860 --> 00:03:29,320 +the final feature vector and another +cool thing they did in this paper was + +45 +00:03:29,319 --> 00:03:34,900 +inception residents so they propose this +residual version of Inception + +46 +00:03:34,900 --> 00:03:39,579 +architecture which is also pretty big +and scary the stem is the same as before + +47 +00:03:39,579 --> 00:03:43,950 +and now these residues repeated +inception blocks that they repeat + +48 +00:03:43,949 --> 00:03:48,289 +throughout the network they actually +have these residual connections so + +49 +00:03:48,289 --> 00:03:51,409 +that's that's kind of cool their kind of +jumping on this residual idea + +50 +00:03:51,409 --> 00:03:55,609 +and now improved state of the art image +not so again they have many repeated + +51 +00:03:55,610 --> 00:04:00,880 +modules and when you add this thing all +up it's about 7875 layers assuming I did + +52 +00:04:00,879 --> 00:04:07,939 +the math right it last night so they +also show that between their sort of new + +53 +00:04:07,939 --> 00:04:12,680 +inception but their new version 4 of +Inception Google Maps and their residual + +54 +00:04:12,680 --> 00:04:17,079 +version of Google Maps that actually +both of them perform about the same so + +55 +00:04:17,079 --> 00:04:22,909 +this is true top five air as a function +of epochs on image now you can see that + +56 +00:04:22,910 --> 00:04:28,070 +the inception network and read it +actually is converging a bit faster but + +57 +00:04:28,069 --> 00:04:33,180 +over time both of them sort of +conversation about the same value so + +58 +00:04:33,180 --> 00:04:38,340 +that that's kind of interesting that's +kind of cool another thing that's kind + +59 +00:04:38,339 --> 00:04:42,369 +of interesting to point out is this this +this the raw numbers on the x-axis here + +60 +00:04:42,370 --> 00:04:46,030 +these are a pox on image now these +things are being trained for a hundred + +61 +00:04:46,029 --> 00:04:52,089 +and sixty Pakistan image net so that's +that's a lot of training time but that's + +62 +00:04:52,089 --> 00:04:55,469 +that's enough of current events and +let's go back to our regularly scheduled + +63 +00:04:55,470 --> 00:05:02,710 +programming so today oh yeah question I +don't know I think it might be in the + +64 +00:05:02,709 --> 00:05:11,789 +paper but I didn't read it carefully and +other questions on russia's might drop + +65 +00:05:11,790 --> 00:05:16,600 +out whether it's only in the last layer +I'm not sure again I didn't read the + +66 +00:05:16,600 --> 00:05:21,620 +paper to to carefully yet but it's the +link is here you should check it out + +67 +00:05:21,620 --> 00:05:29,600 +ok so today we're going to talk about to +sort of two other topics that are + +68 +00:05:29,600 --> 00:05:33,970 +considered common things and research +these days so those are segmentation + +69 +00:05:33,970 --> 00:05:37,490 +which is this sort of classic computer +vision topic and also this idea of + +70 +00:05:37,490 --> 00:05:41,550 +attention which i think is a really has +been a really popular thing to work on + +71 +00:05:41,550 --> 00:05:46,060 +in deep mourning over the past year +especially so first we're gonna talk + +72 +00:05:46,060 --> 00:05:50,889 +about segmentation so you may have +remembered this slide from a couple + +73 +00:05:50,889 --> 00:05:53,649 +lectures ago we talked about object +detection that was talking about + +74 +00:05:53,649 --> 00:05:58,000 +different tasks that people work on +computer vision and we spent a lot of + +75 +00:05:58,000 --> 00:06:02,259 +time in the class talking about +classification and back in lecture we + +76 +00:06:02,259 --> 00:06:03,750 +talk about different models for + +77 +00:06:03,750 --> 00:06:08,339 +localisation and for object detection +but today we're actually gonna focus on + +78 +00:06:08,339 --> 00:06:12,239 +this idea of segmentation that we +skipped over last time in this previous + +79 +00:06:12,240 --> 00:06:18,189 +lecture so within segmentation there's +sort of two different some tasks that we + +80 +00:06:18,189 --> 00:06:21,870 +need to make that we need to define and +people actually work on these things a + +81 +00:06:21,870 --> 00:06:26,389 +little bit separately the first task is +this idea called semantic segmentation + +82 +00:06:26,389 --> 00:06:32,370 +so here we take an end we have an input +image and we have some pics number of + +83 +00:06:32,370 --> 00:06:38,000 +classes things like buildings and trees +and ground and cow and whatever kind of + +84 +00:06:38,000 --> 00:06:42,629 +semantic labels you want usually have +some small fixed number of classes also + +85 +00:06:42,629 --> 00:06:46,199 +typically you'll have some background +class for first things that don't fit + +86 +00:06:46,199 --> 00:06:51,360 +into these classes and then the task is +that we want to take as input an inch + +87 +00:06:51,360 --> 00:06:55,240 +and then we want to label every pixel in +that image with one of these semantic + +88 +00:06:55,240 --> 00:06:59,850 +classes so here we have taken this input +image of these three cows in the field + +89 +00:06:59,850 --> 00:07:05,490 +and the ideal output is this image where +instead of being RGB values we actually + +90 +00:07:05,490 --> 00:07:11,228 +have one class label per pixel we can do +this and other images and maybe segment + +91 +00:07:11,228 --> 00:07:16,789 +out the trees and the sky and the road +the grass so this type of task is pretty + +92 +00:07:16,790 --> 00:07:19,950 +cool I think it gives you sort of a +higher level of understanding of what's + +93 +00:07:19,949 --> 00:07:23,029 +going on in images compared to just +putting a single label on the whole + +94 +00:07:23,029 --> 00:07:28,668 +image and this is actually a very old +problem in computer vision so this at + +95 +00:07:28,668 --> 00:07:32,649 +that predates sort of the deep learning +revolution so this figure actually comes + +96 +00:07:32,649 --> 00:07:37,259 +from a computer vision paperbacks in +2007 that didn't use any deep learning + +97 +00:07:37,259 --> 00:07:43,728 +at all people had other methods for this +a couple years ago the other a task that + +98 +00:07:43,728 --> 00:07:48,949 +people work on is this thing right so +the thing to point out here is that this + +99 +00:07:48,949 --> 00:07:54,310 +thing is not aware of instances so here +this this image actually house or four + +100 +00:07:54,310 --> 00:07:58,329 +cows there's actually three cows +standing up and one cow kinda laying on + +101 +00:07:58,329 --> 00:08:02,300 +the grass taking a nap but here in this +output it's not really clear how many + +102 +00:08:02,300 --> 00:08:07,560 +cows there are these different cow is +actually there pixels overlap so here in + +103 +00:08:07,560 --> 00:08:11,540 +the outputs there is no notion that +there are different cows + +104 +00:08:11,540 --> 00:08:15,480 +miss output we're just labeling every +pixel so it's maybe not as informative + +105 +00:08:15,480 --> 00:08:20,009 +as you might like and that could +actually lead to some problems for some + +106 +00:08:20,009 --> 00:08:23,409 +downstream applications so it's overcome +this + +107 +00:08:23,410 --> 00:08:28,080 +people have also separately work on this +nor problem called instant segmentation + +108 +00:08:28,079 --> 00:08:32,039 +this also sometimes gets called +simultaneous detection and segmentation + +109 +00:08:32,039 --> 00:08:37,879 +so here's the problem is again somewhere +to before we have some set of classes + +110 +00:08:37,879 --> 00:08:43,370 +that were trying to recognize and given +an input image we want to output all + +111 +00:08:43,370 --> 00:08:48,370 +instances of those classes and for each +instance we want segment out the pixels + +112 +00:08:48,370 --> 00:08:52,970 +that belong to that instance so here in +this in this input image there are + +113 +00:08:52,970 --> 00:08:57,509 +actually three different people there's +the two parents and the kid and now in + +114 +00:08:57,509 --> 00:09:00,860 +the output we actually distinguish +between those different people in the + +115 +00:09:00,860 --> 00:09:05,279 +input image which which those three +people are now shown in different colors + +116 +00:09:05,279 --> 00:09:09,360 +indicate different instances and again +for each of those instances we're going + +117 +00:09:09,360 --> 00:09:14,009 +to label all the pixels in the input +image that belong to that instance so + +118 +00:09:14,009 --> 00:09:18,639 +these two tasks of semantic segmentation +and instant segmentation people actually + +119 +00:09:18,639 --> 00:09:22,409 +have worked on them a little bit +separately so first we're gonna talk + +120 +00:09:22,409 --> 00:09:27,269 +about some models for semantic +segmentation so remember this is the + +121 +00:09:27,269 --> 00:09:30,399 +task for you want to just label all the +pixels in the image and you don't care + +122 +00:09:30,399 --> 00:09:38,230 +about instances so here the idea is +actually pretty simple given some input + +123 +00:09:38,230 --> 00:09:43,269 +image this is the final in with the cows +we're gonna take some little patch of + +124 +00:09:43,269 --> 00:09:48,720 +the input image and extract this patch +that sort of gives local information in + +125 +00:09:48,720 --> 00:09:53,340 +image then we're gonna take this patch +and we're gonna feed it through some + +126 +00:09:53,340 --> 00:09:57,230 +convolutional neural network this could +be any of the architecture is that we've + +127 +00:09:57,230 --> 00:10:01,070 +talked about so far in the class and now +this + +128 +00:10:01,070 --> 00:10:04,890 +convolutional neural network will +actually classified the center pixel a + +129 +00:10:04,889 --> 00:10:10,080 +patch so this neural network is Justin +classification that's something we know + +130 +00:10:10,080 --> 00:10:14,379 +how to do so this thing is just going to +say that the center pixel of dispatch + +131 +00:10:14,379 --> 00:10:19,769 +actually is a cow than we can imagine +taking this network that works on + +132 +00:10:19,769 --> 00:10:20,710 +patches + +133 +00:10:20,710 --> 00:10:26,019 +and labels the center pixel and we just +run it over the entire image and that + +134 +00:10:26,019 --> 00:10:33,269 +will give us a label for each pixel in +the image so this actually is a very + +135 +00:10:33,269 --> 00:10:36,699 +expensive operation right because now +there's many many patches in the image + +136 +00:10:36,700 --> 00:10:40,120 +and it would be super super expensive to +run this network independently for all + +137 +00:10:40,120 --> 00:10:44,139 +of them so in practice people use the +same trick that we saw an object + +138 +00:10:44,139 --> 00:10:48,639 +detection where you'll run this thing +fully convolutional II and get all the + +139 +00:10:48,639 --> 00:10:54,220 +outputs for the whole image all at once +but the problem here is that if you're + +140 +00:10:54,220 --> 00:10:58,879 +convolutional network contains a kind of +down sampling either to pooling or + +141 +00:10:58,879 --> 00:11:02,899 +through striker convolutions then now +your output your output image will + +142 +00:11:02,899 --> 00:11:07,139 +actually have a smaller spatial size and +your input image so that's that's + +143 +00:11:07,139 --> 00:11:09,929 +something that people need to work +around when they're using this type of + +144 +00:11:09,929 --> 00:11:14,629 +approach so any any questions on this +kind of basic setup for semantic + +145 +00:11:14,629 --> 00:11:28,208 +segmentation yeah + +146 +00:11:28,208 --> 00:11:32,979 +the question is whether pat pat pat +right thing just doesn't give you enough + +147 +00:11:32,980 --> 00:11:37,800 +information in some cases and that's +true so sometimes for these for these + +148 +00:11:37,799 --> 00:11:41,688 +networks people actually have a separate +offline refinement stage where they take + +149 +00:11:41,688 --> 00:11:44,980 +this output and then feed it to some +kind of graphical model to clean up to + +150 +00:11:44,980 --> 00:11:48,028 +clean up the output a little bed so +sometimes that can help boost your + +151 +00:11:48,028 --> 00:11:52,838 +performance a little better but justice +for input-output model set up tents to + +152 +00:11:52,839 --> 00:12:09,600 +work pretty well just as something easy +to implement yeah I'm you need I'm not + +153 +00:12:09,600 --> 00:12:13,019 +sure I'm not sure exactly probably +pretty big maybe a couple hundred 200 + +154 +00:12:13,019 --> 00:12:19,919 +pixels that order of magnitude so one +extension that people have used to this + +155 +00:12:19,919 --> 00:12:23,289 +basic approach is this idea of +multiscale testing actually sometimes a + +156 +00:12:23,289 --> 00:12:28,230 +single scale isn't enough so here we're +going to take our input image and will + +157 +00:12:28,230 --> 00:12:33,009 +actually resize it to multiple different +sizes so this is sort of a common trick + +158 +00:12:33,009 --> 00:12:36,688 +that people use in computer vision a lot +called an image pyramid you just take + +159 +00:12:36,688 --> 00:12:41,458 +the same dimension and you resize it +made many different scales and now for + +160 +00:12:41,458 --> 00:12:44,528 +each of these scales were gonna run it +through a convolutional neural network + +161 +00:12:44,528 --> 00:12:49,568 +that is going to protect these pics are +wise labels for these different images + +162 +00:12:49,568 --> 00:12:52,969 +of these different resolutions so +another thing to point out here along + +163 +00:12:52,970 --> 00:12:56,249 +the lines of your question that if each +of these networks actually has the same + +164 +00:12:56,249 --> 00:12:59,639 +architecture then each of these outputs +will have a different effect of + +165 +00:12:59,639 --> 00:13:04,490 +receptive field in the input 2222 the +image pyramid so now that we've gotten + +166 +00:13:04,490 --> 00:13:08,720 +these differently sized pixel labels for +the intention than we can take all of + +167 +00:13:08,720 --> 00:13:13,660 +them and Risa and just do some offline +up sampling to up sample those responses + +168 +00:13:13,659 --> 00:13:18,129 +to the same size as the input image so +now we've gotten our three outputs of + +169 +00:13:18,129 --> 00:13:24,319 +different sizes up samples and stack +them and this paper this actually paper + +170 +00:13:24,318 --> 00:13:29,139 +from the coon back in 2013 so they +actually also have this separate + +171 +00:13:29,139 --> 00:13:33,709 +off-line processing stap where they do +this idea of a bottom-up segmentation + +172 +00:13:33,708 --> 00:13:39,119 +using right using these super pixel +methods so these are these sort of more + +173 +00:13:39,120 --> 00:13:41,370 +classic computer vision image processing +type + +174 +00:13:41,370 --> 00:13:45,470 +methods that actually look at the +differences between adjacent pixels and + +175 +00:13:45,470 --> 00:13:48,589 +images and then try to merge them +together to give you these coherent + +176 +00:13:48,589 --> 00:13:52,900 +regions where there are not much change +in image so then this method actually + +177 +00:13:52,899 --> 00:13:56,519 +takes sort of runs the image offline +through these other more traditional + +178 +00:13:56,519 --> 00:14:02,230 +image processing techniques to get +either a set of super pixels or trees + +179 +00:14:02,230 --> 00:14:06,629 +saying which pixels ought to be merged +together in the image and they have this + +180 +00:14:06,629 --> 00:14:09,519 +somewhat complicated process for merging +all these different things together + +181 +00:14:09,519 --> 00:14:13,028 +cause now we've gotten this sort of +low-level information saying which + +182 +00:14:13,028 --> 00:14:14,110 +pixels in the image + +183 +00:14:14,110 --> 00:14:18,909 +actually are similar to each other based +on sort of color and great information + +184 +00:14:18,909 --> 00:14:22,439 +and we've got these outputs of different +resolutions from the convolutional + +185 +00:14:22,440 --> 00:14:25,810 +neural networks telling us semantically +what the labels are at different points + +186 +00:14:25,809 --> 00:14:29,929 +in the image and they use may actually +explore a couple different ideas for + +187 +00:14:29,929 --> 00:14:33,870 +merging these things together to give +you your final out what this actually + +188 +00:14:33,870 --> 00:14:38,419 +also answers when I addresses one of the +earlier questions about the conflict not + +189 +00:14:38,419 --> 00:14:43,809 +being enough on its own so using these +external read super pixel methods or the + +190 +00:14:43,809 --> 00:14:47,729 +segmentation trees is another thing that +sort of gives you additional information + +191 +00:14:47,730 --> 00:14:55,649 +about maybe larger context in the input +images so any any questions about this + +192 +00:14:55,649 --> 00:15:03,879 +model ok so another another sort of cool +idea that people have used for semantic + +193 +00:15:03,879 --> 00:15:08,299 +segmentation in this is this idea of +iterative refinement so we actually saw + +194 +00:15:08,299 --> 00:15:12,809 +this a few lectures ago when we talked +to many mentioned pose estimation but + +195 +00:15:12,809 --> 00:15:17,149 +the idea is that we're gonna have an +input image here they separated out the + +196 +00:15:17,149 --> 00:15:20,929 +three channels and we're gonna run that +thing for our favorite sort of + +197 +00:15:20,929 --> 00:15:24,929 +convolutional neural network to predict +these low resolution patches + +198 +00:15:24,929 --> 00:15:30,309 +rather to predict this no resolution +segmentation of the image and now we're + +199 +00:15:30,309 --> 00:15:34,899 +gonna take that output from the CNN +together with a Down sampled version of + +200 +00:15:34,899 --> 00:15:38,829 +the original image and we'll just repeat +the process again so this allows the + +201 +00:15:38,830 --> 00:15:43,990 +network to sort of wine increase its +effective receptive field of the output + +202 +00:15:43,990 --> 00:15:48,399 +and also to perform or processing on the +on the input image and then we can + +203 +00:15:48,399 --> 00:15:54,009 +repeat this process again so this is +kinda cool so if these three + +204 +00:15:54,009 --> 00:15:54,769 +convolutional + +205 +00:15:54,769 --> 00:15:58,249 +networks actually share weights then +this becomes a recurrent convolutional + +206 +00:15:58,249 --> 00:16:03,489 +network where it sort of operating on +the same input over overthrew time but + +207 +00:16:03,489 --> 00:16:07,528 +actually each of these updates steps is +a whole convolutional network that's + +208 +00:16:07,528 --> 00:16:10,139 +actually a very similar idea to +recurrent networks that we saw + +209 +00:16:10,139 --> 00:16:18,789 +previously and the idea behind this +paper which was in 2014 is that if you + +210 +00:16:18,789 --> 00:16:22,558 +actually do more iterations of the same +type of thing then hopefully it allows + +211 +00:16:22,558 --> 00:16:28,219 +the network to sort of iteratively +refine its outputs so here if we have + +212 +00:16:28,220 --> 00:16:32,220 +this raw input image then after one +generation you can see that actually + +213 +00:16:32,220 --> 00:16:35,959 +there's quite a bit of noise especially +on the boundaries of the objects but as + +214 +00:16:35,958 --> 00:16:39,359 +we run for two and three iterations +through this recurrent convolutional + +215 +00:16:39,360 --> 00:16:42,769 +network and actually allows the network +to clean up a lot of that sort of + +216 +00:16:42,769 --> 00:16:46,989 +low-level malaise and produced much +cleaner much cleaner and nicer results + +217 +00:16:46,989 --> 00:16:51,119 +so I thought that was quite quite a cool +idea that sort of merging together these + +218 +00:16:51,119 --> 00:16:55,199 +idea of recurrent networks and sharing +weights over time with this idea of + +219 +00:16:55,198 --> 00:17:03,479 +convolutional networks to process images +so another another very widely very well + +220 +00:17:03,480 --> 00:17:07,470 +very very well-known paper for semantic +segmentation is this one from Berkeley + +221 +00:17:07,470 --> 00:17:12,419 +that was published at CBP our last year +so here it's very similar model as + +222 +00:17:12,419 --> 00:17:16,850 +before we're going to take an input +image and run it through some number of + +223 +00:17:16,849 --> 00:17:22,259 +convolutions and eventually extract some +some feature map for the pixels but in + +224 +00:17:22,259 --> 00:17:26,638 +contrast the previous methods all rely +on this sort of hard-coded up sampling + +225 +00:17:26,638 --> 00:17:31,138 +to actually produce the final +segmentation for the energy but in this + +226 +00:17:31,138 --> 00:17:34,668 +paper they proposed that well we're +we're deep learning people we want to + +227 +00:17:34,669 --> 00:17:39,149 +learn everything so we're gonna learn +the upsampling as part of the network so + +228 +00:17:39,148 --> 00:17:43,298 +they're not work includes this at the +last layer this learnable up sampling + +229 +00:17:43,298 --> 00:17:50,798 +there that actually up samples the +feature map in a learnable way so yes + +230 +00:17:50,798 --> 00:17:55,179 +they have been up sampling map at the +end and the way their model kind of + +231 +00:17:55,179 --> 00:17:59,940 +looks is that they have this at the time +it was an Alex not so they have their + +232 +00:17:59,940 --> 00:18:04,090 +input image running through many phases +of convolution and pulling and + +233 +00:18:04,089 --> 00:18:08,028 +eventually they produce at this pool 5 +output they have a quite + +234 +00:18:08,028 --> 00:18:12,048 +down sample image quite a Down sampled +special size compared to the input image + +235 +00:18:12,048 --> 00:18:16,999 +and then there are learnable up sampling +reup them up samples that back to the + +236 +00:18:16,999 --> 00:18:19,460 +original size of the input image + +237 +00:18:19,460 --> 00:18:25,909 +another cool feature of this paper was +this idea of skip connections so they + +238 +00:18:25,909 --> 00:18:30,489 +actually don't use only just these poor +five features they actually use the + +239 +00:18:30,489 --> 00:18:34,598 +convolutional features from different +layers and the network which sort of + +240 +00:18:34,598 --> 00:18:39,200 +exist at different scales so you can +imagine that once you're in the pool for + +241 +00:18:39,200 --> 00:18:42,649 +a layup Alex now that's actually a +bigger feature map then the pool five + +242 +00:18:42,648 --> 00:18:48,069 +and pool 3 is even bigger than pool for +so the intuition is that these lower + +243 +00:18:48,069 --> 00:18:52,148 +convolutional airs might actually help +you capture finer grain structure in the + +244 +00:18:52,148 --> 00:18:56,408 +input image since they have a smaller +receptive field so actually impact us + +245 +00:18:56,409 --> 00:18:59,889 +take these different convolutional +feature maps and apply a separate + +246 +00:18:59,888 --> 00:19:03,428 +learned up sampling each of these +feature maps and then combine them all + +247 +00:19:03,429 --> 00:19:09,070 +to produce the final output and in the +result they show that actually adding he + +248 +00:19:09,069 --> 00:19:15,408 +skipped connections tends to help a lot +with these low-level details so over + +249 +00:19:15,409 --> 00:19:19,979 +here on the left these are the results +that only use these poor five outputs + +250 +00:19:19,979 --> 00:19:24,919 +and you can see that it's sort of gotten +the rough idea of a person on a bicycle + +251 +00:19:24,919 --> 00:19:29,330 +but it's kinda blobby and missing a lot +of the fine details are on the edges but + +252 +00:19:29,329 --> 00:19:31,819 +then when you add in these steps +connections from these lower + +253 +00:19:31,819 --> 00:19:35,468 +convolutional errors that gives you a +lot more fine-grained information about + +254 +00:19:35,469 --> 00:19:39,940 +the spatial locations of things in the +image so that action so adding those + +255 +00:19:39,940 --> 00:19:43,919 +skip connections in the lower layers +really helps you clean up the boundaries + +256 +00:19:43,919 --> 00:19:51,159 +in some cases for these these outputs +question so the question is how to + +257 +00:19:51,159 --> 00:19:55,070 +classify accuracy I think the two +metrics people typically used for this + +258 +00:19:55,069 --> 00:19:58,829 +are even just classification as you're +classifying every pixel classification + +259 +00:19:58,829 --> 00:20:03,968 +my tracks also sometimes people use +intersection of a union so for each + +260 +00:20:03,969 --> 00:20:09,058 +class you compute what is the region of +the image that I predicted us that class + +261 +00:20:09,058 --> 00:20:12,368 +and what was the reach the ground troops +region of the image that had that class + +262 +00:20:12,368 --> 00:20:17,158 +and then compute an intersection of a +union between those two I'm not sure + +263 +00:20:17,159 --> 00:20:20,510 +which which metric this paper used in +particular + +264 +00:20:20,509 --> 00:20:26,609 +so this idea of learnable up sampling is +actually really cool and since since + +265 +00:20:26,609 --> 00:20:30,839 +this paper has been applied and a lot of +other contacts cuz we know we've seen + +266 +00:20:30,839 --> 00:20:35,839 +that we can down sample our feature maps +in a variety of different ways but being + +267 +00:20:35,839 --> 00:20:39,689 +able to up sample them inside the +network could actually be very useful + +268 +00:20:39,690 --> 00:20:44,750 +and a very valuable thing to do so this +sometimes gets called deconvolution + +269 +00:20:44,750 --> 00:20:48,980 +that's not a very good terms so we all +talk about that in a couple minutes but + +270 +00:20:48,980 --> 00:20:54,130 +it's a very common term so just just to +recap sort of when you're doing a normal + +271 +00:20:54,130 --> 00:20:59,870 +sort of stride stride 1353 convolution +we we have this we have this picture + +272 +00:20:59,869 --> 00:21:04,489 +that should be pretty familiar by now +that given our four-by-four input we + +273 +00:21:04,490 --> 00:21:08,710 +have some three by three filter and we +plot that three by three filter over + +274 +00:21:08,710 --> 00:21:10,059 +part of the input + +275 +00:21:10,059 --> 00:21:14,539 +product and that gives us one element of +the output and now because the sting + +276 +00:21:14,539 --> 00:21:19,240 +asteroid one to compute the next element +of the output we we moved the filter + +277 +00:21:19,240 --> 00:21:22,599 +over one slot in the input again +computer dot product and that gives us + +278 +00:21:22,599 --> 00:21:29,409 +are one element in the output and now +for stride true convolution it's it's a + +279 +00:21:29,410 --> 00:21:32,360 +very similar type of idea where now + +280 +00:21:32,359 --> 00:21:36,099 +output is going to be a Down sampled +version 2 by two output for a + +281 +00:21:36,099 --> 00:21:40,459 +four-by-four in place and again it's the +same idea we take our filter we plopped + +282 +00:21:40,460 --> 00:21:44,279 +down on the image computer dot product +gives us one element of the output the + +283 +00:21:44,279 --> 00:21:48,450 +only difference is that now we slide the +convolutional filter over two slots and + +284 +00:21:48,450 --> 00:21:53,610 +the input to compute one on into the +outputs the deconvolution elaire + +285 +00:21:53,609 --> 00:21:57,439 +actually does something a little bit +different so here we want to take a low + +286 +00:21:57,440 --> 00:22:02,490 +resolution input and produce a higher +resolution output so this would be maybe + +287 +00:22:02,490 --> 00:22:08,309 +a few by free deconvolution with a +straight up to an appt at one so here + +288 +00:22:08,309 --> 00:22:12,659 +this is a little bit weird you know in a +normal convolution you imagine take you + +289 +00:22:12,660 --> 00:22:16,750 +have your three by three filter and you +take dot products and the input but here + +290 +00:22:16,750 --> 00:22:21,000 +you want to imagine taking your three by +three filter and just copying it over to + +291 +00:22:21,000 --> 00:22:26,230 +the output the only difference is that +the weights like this one scalar value + +292 +00:22:26,230 --> 00:22:27,579 +of the weight and your input + +293 +00:22:27,579 --> 00:22:31,788 +gives you a wait for that you're going +to relate that filter when you stand + +294 +00:22:31,788 --> 00:22:38,298 +into the output and now when we started +this thing along we're gonna step 1 step + +295 +00:22:38,298 --> 00:22:43,298 +over in the input and two steps over any +output now we're going to take the same + +296 +00:22:43,298 --> 00:22:47,798 +the same learned convolutional filter +and we're gonna plopped down in the + +297 +00:22:47,798 --> 00:22:53,378 +output but now in the boob now for +taking the same convolutional filter and + +298 +00:22:53,378 --> 00:22:56,928 +we're popping it down twice in the +output the difference being that the + +299 +00:22:56,929 --> 00:23:02,139 +Redbox that convolutional filter is +weighted by this scalar value in the + +300 +00:23:02,138 --> 00:23:06,148 +input and for the blue box that +convolutional filter is weighted by the + +301 +00:23:06,148 --> 00:23:10,978 +blue scalar value in the input and where +these where these regions overlap you + +302 +00:23:10,979 --> 00:23:16,590 +just add so this kind of allows you to +learn and up sampling inside the network + +303 +00:23:16,589 --> 00:23:23,118 +so if you remember from your from +implementing convolutions on the + +304 +00:23:23,118 --> 00:23:27,999 +assignment this idea of sort of +especially striking and adding an + +305 +00:23:27,999 --> 00:23:31,348 +overlapping regions that should remind +you of the backward pass for a normal + +306 +00:23:31,348 --> 00:23:36,729 +convolution and it turns out that these +are completely equivalent that this + +307 +00:23:36,729 --> 00:23:40,440 +deconvolution forward pass is exactly +the same as the normal convolution + +308 +00:23:40,440 --> 00:23:44,840 +backward pass and the normal and the +deconvolution backward pass is the same + +309 +00:23:44,839 --> 00:23:50,238 +as the normal convolution forward pass +so because of that actually the term + +310 +00:23:50,239 --> 00:23:54,989 +deconvolution is maybe not so great and +if you have a signal processing + +311 +00:23:54,989 --> 00:23:58,700 +background you may have seen that +deconvolution already has a very + +312 +00:23:58,700 --> 00:24:03,308 +well-defined meaning and that is the +inverse of convolution so a + +313 +00:24:03,308 --> 00:24:07,470 +deconvolution should undo a convolution +operation which is quite different from + +314 +00:24:07,470 --> 00:24:11,909 +what this is actually doing so probably +better names for this instead of + +315 +00:24:11,909 --> 00:24:17,609 +deconvolution that you'll sometimes see +will be convolution transpose or our + +316 +00:24:17,608 --> 00:24:22,148 +backwards strident convolution or +fractionally strident convolution or or + +317 +00:24:22,148 --> 00:24:27,148 +up convolution so I think those are kind +of weird names I think deconvolution as + +318 +00:24:27,148 --> 00:24:30,988 +popular just cuz it's easiest to say +even though it may be less technical + +319 +00:24:30,989 --> 00:24:35,369 +technically correct although actually if +you read papers you'll see that some + +320 +00:24:35,368 --> 00:24:38,699 +people get angry about this so + +321 +00:24:38,700 --> 00:24:43,539 +it's more proper to say convolution +transpose instead of deconvolution and + +322 +00:24:43,539 --> 00:24:47,529 +this other paper really wants it to be +colored fractionally stride convolution + +323 +00:24:47,529 --> 00:24:51,750 +so I think I think the community is +still deciding on the right terminology + +324 +00:24:51,750 --> 00:24:55,240 +here but I kind of agree with them the +deconvolution is probably not very + +325 +00:24:55,240 --> 00:25:00,309 +technically correct and this this paper +in particular has alcohol they felt very + +326 +00:25:00,309 --> 00:25:04,139 +strongly about this issue and they had a +one-page index appendix to the paper + +327 +00:25:04,140 --> 00:25:09,230 +actually explaining why so I convolution +transposase the proper term so if you're + +328 +00:25:09,230 --> 00:25:11,849 +interested then I would really recommend +checking out that it's a pretty good + +329 +00:25:11,849 --> 00:25:16,289 +explanation actually so any any +questions about this + +330 +00:25:16,289 --> 00:25:26,299 +yeah I think so the question is how much +faster is this relative to a patch based + +331 +00:25:26,299 --> 00:25:29,930 +thing the answer is that in practice +nobody even thanks to run this thing and + +332 +00:25:29,930 --> 00:25:34,820 +hopefully patch beast mode that would +just be way way too slow so actually all + +333 +00:25:34,819 --> 00:25:36,000 +of the papers that i've seen + +334 +00:25:36,000 --> 00:25:39,109 +do some kind of polly convolutional +thing in one way or another + +335 +00:25:39,109 --> 00:25:44,729 +actually there there is sort of another +trick instead of up sampling that people + +336 +00:25:44,730 --> 00:25:49,309 +sometimes use and that is that so +suppose that your network is actually + +337 +00:25:49,309 --> 00:25:52,599 +gonna down sample by a factor of four +then one thing you can do is take your + +338 +00:25:52,599 --> 00:25:57,199 +input image shipped by one pixel and now +running through the network again and I + +339 +00:25:57,200 --> 00:26:00,710 +you get another output and you repeat +this for sort of four different one + +340 +00:26:00,710 --> 00:26:04,870 +pixel ships of the input and now you've +gotten for output maps and you can sort + +341 +00:26:04,869 --> 00:26:08,339 +of interleaved those to reconstruct an +original input map so that's that's + +342 +00:26:08,339 --> 00:26:12,279 +another trick that people sometimes used +to get around that problem but I think + +343 +00:26:12,279 --> 00:26:19,740 +that this morning up sampling is quite a +bit cleaner + +344 +00:26:19,740 --> 00:26:28,440 +so i think is really nice just rolls off +the tongue I think back once tried I + +345 +00:26:28,440 --> 00:26:33,799 +think fractionally strident convolution +is actually pretty cool right I think + +346 +00:26:33,799 --> 00:26:36,928 +it's the longest name but it's really +descriptive right normal with Astro + +347 +00:26:36,929 --> 00:26:40,910 +normally with us try to convolution you +move acts elements in the end he moved + +348 +00:26:40,910 --> 00:26:45,808 +like you don't want any input and output +and here you're moving what happened the + +349 +00:26:45,808 --> 00:26:48,940 +input you might wanna input which +corresponds moving what happened the + +350 +00:26:48,940 --> 00:26:55,140 +output so that captures idea quite +nicely so I'm not sure what I'll call it + +351 +00:26:55,140 --> 00:27:02,790 +when when I use it in the paper we'll +have to see about that but so despite + +352 +00:27:02,789 --> 00:27:06,440 +the despite the concerns about people +calling a deconvolution like people just + +353 +00:27:06,440 --> 00:27:10,980 +call it that anyway so there was this +paper from ICC be that takes this idea + +354 +00:27:10,980 --> 00:27:16,319 +of this week of this convolutional / +fractionally try to come up with an idea + +355 +00:27:16,319 --> 00:27:21,428 +and sort of pushed to the extreme so +here and they they took what amounts to + +356 +00:27:21,429 --> 00:27:28,170 +two entire BGG networks so this is an +exact same models before I wanna input + +357 +00:27:28,170 --> 00:27:33,720 +and output pixel wise predictions for +the semantic segmentation task but here + +358 +00:27:33,720 --> 00:27:40,220 +we initialize the BGG and over here is +an upside down BGG and it trains for six + +359 +00:27:40,220 --> 00:27:44,509 +days on a tax so this thing is pretty +slow but actually got really really good + +360 +00:27:44,509 --> 00:27:51,160 +results and I think it's also a very +beautiful figure so that's that's pretty + +361 +00:27:51,160 --> 00:27:54,308 +much all that I have to say about +semantic segmentation if there's any + +362 +00:27:54,308 --> 00:27:59,799 +questions about that yeah + +363 +00:27:59,799 --> 00:28:04,909 +the question is how is this your main +the answer is I took a screenshot from + +364 +00:28:04,910 --> 00:28:09,090 +their paper so I don't know but you can +try to answer flow we saw in the last + +365 +00:28:09,089 --> 00:28:15,069 +lecture that lets you make make figures +but they're not as nice as this yeah + +366 +00:28:15,069 --> 00:28:22,579 +question as training data yes there +exists datasets with this kind of thing + +367 +00:28:22,579 --> 00:28:28,449 +where I think there's a common one is +the Pascal segmentation data set so it + +368 +00:28:28,450 --> 00:28:31,380 +just has a ground truth you have an +image you have an image and they have + +369 +00:28:31,380 --> 00:28:37,780 +every pixel labeled yeah it's it's kind +of expensive to get that data to the + +370 +00:28:37,779 --> 00:28:43,049 +datasets tend to be a little smaller but +in practice there's a famous interface + +371 +00:28:43,049 --> 00:28:46,299 +called label me where you could upload +an image and then sort of drunk on tour + +372 +00:28:46,299 --> 00:28:49,240 +around the invention around different +regions of the invention and that you + +373 +00:28:49,240 --> 00:28:54,140 +can convert those contours into sort of +these segmentation asks that's how you + +374 +00:28:54,140 --> 00:29:02,130 +tend to label these things in a way if +you are questions that I think we'll + +375 +00:29:02,130 --> 00:29:07,290 +move on to instant segmentation so just +to recap instance segmentation is this + +376 +00:29:07,289 --> 00:29:11,089 +generalization or where we not only want +to label the pixels of the image but we + +377 +00:29:11,089 --> 00:29:15,089 +also want to distinguish instant +distinguish instances so we're going to + +378 +00:29:15,089 --> 00:29:18,419 +detect the different instances of our +classes and for each one we want to + +379 +00:29:18,420 --> 00:29:25,320 +label the pixels of that instance so +these this actually these models and up + +380 +00:29:25,319 --> 00:29:28,419 +looking a lot like the detection models +that we talked about a few lectures ago + +381 +00:29:28,420 --> 00:29:34,150 +so one of the earliest papers that I +know that this is actually I should also + +382 +00:29:34,150 --> 00:29:38,040 +point out that this is i think im much +more recent ask that this idea of + +383 +00:29:38,039 --> 00:29:42,319 +semantic segmentation has been used in +computer vision for a long long time but + +384 +00:29:42,319 --> 00:29:45,409 +I think this idea of instant +segmentation has gotten a lot more + +385 +00:29:45,410 --> 00:29:50,970 +popular especially in the last couple of +years so this paper from 2014 sort of + +386 +00:29:50,970 --> 00:29:53,890 +took this I think they call it +simultaneous detection and segmentation + +387 +00:29:53,890 --> 00:29:59,600 +or SDS that's kind of a nice name and +this is actually very similar to our CNN + +388 +00:29:59,599 --> 00:30:03,839 +model that we saw protection so here +we're gonna take it and input dementia + +389 +00:30:03,839 --> 00:30:09,399 +and if you remember in our CNN we rely +on these external region proposals that + +390 +00:30:09,400 --> 00:30:12,269 +can are these sort of offline computer +vision + +391 +00:30:12,269 --> 00:30:16,538 +global thing that compute predictions on +where it thinks objects in image might + +392 +00:30:16,538 --> 00:30:17,658 +be located + +393 +00:30:17,659 --> 00:30:21,419 +well it turns out that there's other +methods for proposing segments instead + +394 +00:30:21,419 --> 00:30:25,419 +of boxes so we just download one of +those existing segment proposal methods + +395 +00:30:25,419 --> 00:30:30,879 +and use that instead now for each of +these segments we can for each of these + +396 +00:30:30,878 --> 00:30:35,398 +proposed segments we can extract a +bounding box by just sitting at a box of + +397 +00:30:35,398 --> 00:30:40,298 +a segment and then run crop out that +chunk of the input image and run it + +398 +00:30:40,298 --> 00:30:47,108 +through a box CNN to extract features +for that box than in parallel will run + +399 +00:30:47,108 --> 00:30:52,358 +through a region CNN so she can we take +that relevant that chunk from input + +400 +00:30:52,358 --> 00:30:57,168 +invention and crop it out but here +because we actually have this proposal + +401 +00:30:57,169 --> 00:31:01,320 +for the segment then we're going to mask +out the background region using the mean + +402 +00:31:01,319 --> 00:31:05,700 +color of the data that so this is kind +of a hack that lets you take these kind + +403 +00:31:05,700 --> 00:31:09,838 +of weird shape inputs and feed it into a +CNN you just mask out the background + +404 +00:31:09,838 --> 00:31:14,479 +part with us with a black color so that +may take these masks inputs and run them + +405 +00:31:14,479 --> 00:31:18,769 +through a separate regions CNN now we've +gotten two different feature vectors one + +406 +00:31:18,769 --> 00:31:22,739 +sort of incorporating the whole box and +one in corporate only the of the + +407 +00:31:22,739 --> 00:31:26,328 +proposed foreground pixels we +concatenate these things and then just + +408 +00:31:26,328 --> 00:31:30,638 +like in our CNN we make a classification +to decide what class actually should + +409 +00:31:30,638 --> 00:31:37,128 +this segment B and then they also have +this region refinement step where we + +410 +00:31:37,128 --> 00:31:42,108 +want to refine the proposed regions the +little bit so if you don't know how well + +411 +00:31:42,108 --> 00:31:45,218 +you remember that our CNN framework but +this is actually very similar to our CNN + +412 +00:31:45,219 --> 00:31:52,909 +just apply to this instance simultaneous +detection and segmentation task so this + +413 +00:31:52,909 --> 00:31:56,950 +idea for this region refinement step +there's actually a follow-up paper that + +414 +00:31:56,950 --> 00:32:03,288 +proposes a pretty nice way to do it so +here is the paper from the same folks at + +415 +00:32:03,288 --> 00:32:07,578 +berkeley though the following conference +and here we want to take this this input + +416 +00:32:07,578 --> 00:32:12,940 +which is this proposed to segment this +proposed segment and want to clean it up + +417 +00:32:12,940 --> 00:32:17,778 +somehow so we're actually gonna take a +very similar approach very similar type + +418 +00:32:17,778 --> 00:32:20,230 +multiscale approach that we saw in the + +419 +00:32:20,230 --> 00:32:24,839 +in the semantic segmentation model a +while ago so here we're going to take + +420 +00:32:24,839 --> 00:32:30,139 +our our image crop out the prop up the +box corresponding to that segment and + +421 +00:32:30,140 --> 00:32:34,350 +then pass it through and Alex net and +we're going to extract convolutional + +422 +00:32:34,349 --> 00:32:37,849 +features from several different layers +of that Alex NAT for each of those + +423 +00:32:37,849 --> 00:32:42,139 +feature maps will up sampled I'm and +combine them together and now will + +424 +00:32:42,140 --> 00:32:48,370 +produce this this figure this proposed +figure ground segmentation so this this + +425 +00:32:48,369 --> 00:32:52,308 +is actually kind of a funny output but +it's it's really easy to predict the + +426 +00:32:52,308 --> 00:32:55,910 +idea is that invests this output image +we're just gonna do a logistic + +427 +00:32:55,910 --> 00:33:00,990 +classifier inside each independent pixel +so given these features we just have a + +428 +00:33:00,990 --> 00:33:04,410 +whole bunch of independent logistic +classify hairs that are predicting how + +429 +00:33:04,410 --> 00:33:08,250 +much each pixel of this output is likely +to be in the foreground are in the + +430 +00:33:08,250 --> 00:33:13,390 +background and they show that this this +type of multiscale refinement step + +431 +00:33:13,390 --> 00:33:16,610 +actually cleans up the other parts of +the previous system and gives quite + +432 +00:33:16,609 --> 00:33:27,899 +quite nice results question + +433 +00:33:27,900 --> 00:33:34,390 +fractionally stride and convolution I +think it was instead of some kind of a + +434 +00:33:34,390 --> 00:33:37,870 +fix up sampling like a bilinear +interpolation or something like that or + +435 +00:33:37,869 --> 00:33:41,449 +maybe even a nearest-neighbor just +something fixed and variable but I could + +436 +00:33:41,450 --> 00:33:44,170 +be wrong but you could definitely +imagine swapping and some learnable + +437 +00:33:44,170 --> 00:33:46,250 +think they're too + +438 +00:33:46,250 --> 00:33:52,980 +ok so this this this actually is very +similar to our CNN but in the detection + +439 +00:33:52,980 --> 00:33:57,049 +lecture we saw that our CNN was just the +start of the story there's all these + +440 +00:33:57,049 --> 00:34:03,329 +faster versions right so it turns out +that a similar intuition from faster our + +441 +00:34:03,329 --> 00:34:08,090 +CNN has actually been applied to this +instance segmentation problem as well so + +442 +00:34:08,090 --> 00:34:12,050 +this is work from Microsoft that action +and this model actually won the cocoa + +443 +00:34:12,050 --> 00:34:16,860 +instance segmentation challenge this +year so they they took their giant + +444 +00:34:16,860 --> 00:34:20,000 +resonance and they stuck this model on +top of it and they and they crush + +445 +00:34:20,000 --> 00:34:25,489 +everyone else in the coco instance +segmentation challenge so this this + +446 +00:34:25,489 --> 00:34:28,668 +actually is very similar to past our +stand on so we're going to take our + +447 +00:34:28,668 --> 00:34:34,148 +input image and just like in fast and +faster our CNN our input image will not + +448 +00:34:34,148 --> 00:34:37,730 +be pretty high resolution and we'll get +this giant comedy show will feature map + +449 +00:34:37,730 --> 00:34:44,260 +over our high resolution and then from +this high resolution and we're actually + +450 +00:34:44,260 --> 00:34:48,700 +going to propose our own region +proposals in the previous method we + +451 +00:34:48,699 --> 00:34:52,319 +relied on these external segment +proposals but here we're just gonna + +452 +00:34:52,320 --> 00:34:56,870 +learn our own region proposals just like +faster our CNN so here we just stick a + +453 +00:34:56,869 --> 00:35:00,859 +couple a couple extra convolutional airs +on top up are controversial feature map + +454 +00:35:00,860 --> 00:35:04,740 +and each one of those is going to +predict several regions of interest in + +455 +00:35:04,739 --> 00:35:11,109 +the image that using this idea of boxes +that we saw in the detection work + +456 +00:35:11,110 --> 00:35:15,200 +the difference is that now once we have +this region these region proposals were + +457 +00:35:15,199 --> 00:35:18,559 +gonna segment about using a very similar +approach that we just saw on the last + +458 +00:35:18,559 --> 00:35:24,380 +slide so for each of these proposed +regions are going to use this ROI what + +459 +00:35:24,380 --> 00:35:28,579 +they call it ROI warping or pooling and +squish them all down to a fixed square + +460 +00:35:28,579 --> 00:35:33,000 +size and then run each of them through a +convolutional neural network to produce + +461 +00:35:33,000 --> 00:35:36,710 +these course figure ground segmentation +masks like we just saw in the previous + +462 +00:35:36,710 --> 00:35:41,909 +in the previous slide so now at this +point we've gotten our image we've got a + +463 +00:35:41,909 --> 00:35:45,859 +bunch of region proposals and now for +each region proposal we have a rough + +464 +00:35:45,860 --> 00:35:49,240 +idea of which part of that box as +foreground and which part is background + +465 +00:35:49,239 --> 00:35:54,489 +now we're going to take this idea of +masking so now that we predicted the + +466 +00:35:54,489 --> 00:35:57,709 +foreground background for each of these +segments we're going to mask out the + +467 +00:35:57,710 --> 00:36:02,889 +predicted background and only keep the +pixels from the predicted foreground and + +468 +00:36:02,889 --> 00:36:07,179 +past goes through another couple layers +to actually classified about classify + +469 +00:36:07,179 --> 00:36:13,629 +that segment as our different categories +so this is a man this entire thing can + +470 +00:36:13,630 --> 00:36:18,380 +just be learned jointly and two and with +the idea that we've got these three + +471 +00:36:18,380 --> 00:36:22,490 +semantically interpretable outputs in +intermediate layers of our network and + +472 +00:36:22,489 --> 00:36:26,589 +each of them we can just supervise with +ground truth data so these regions of + +473 +00:36:26,590 --> 00:36:29,900 +interest we know where the ground truth +sex objects are in the object and image + +474 +00:36:29,900 --> 00:36:34,349 +so we can provide supervision on those +outputs for these segmentation asks we + +475 +00:36:34,349 --> 00:36:37,929 +know what the true foreground and +background ours we can give supervision + +476 +00:36:37,929 --> 00:36:42,759 +there and we are we obviously know the +classes of those different segments so + +477 +00:36:42,760 --> 00:36:46,760 +we just provide supervision at different +layers of these network and try to trade + +478 +00:36:46,760 --> 00:36:50,420 +off all those different lost terms and +hopefully get the thing to converge but + +479 +00:36:50,420 --> 00:36:53,670 +this actually was trained and two and +and they find to an interesting and it + +480 +00:36:53,670 --> 00:36:59,809 +works really really well so here is that +the results figure that we have to show + +481 +00:36:59,809 --> 00:37:04,519 +so these results are at least to me +really really impressive so for example + +482 +00:37:04,519 --> 00:37:09,159 +this input image has all these different +people sitting in this room and the + +483 +00:37:09,159 --> 00:37:12,539 +predicted outputs do a really good job +of separating out all those different + +484 +00:37:12,539 --> 00:37:15,360 +people even though they overlap and +there's a lot of them and they're very + +485 +00:37:15,360 --> 00:37:16,500 +close + +486 +00:37:16,500 --> 00:37:20,699 +same with these cars made it a little +easier but especially this this people + +487 +00:37:20,699 --> 00:37:24,629 +when I was pretty impressed by but you +can see it's not perfect so this potted + +488 +00:37:24,630 --> 00:37:28,840 +plants at that was blocked here than it +really was and it confused this chair on + +489 +00:37:28,840 --> 00:37:32,230 +the right for a person and I missed a +person there but overall these results + +490 +00:37:32,230 --> 00:37:36,300 +are very very impressive and like I said +this model one that Coco segmentation + +491 +00:37:36,300 --> 00:37:43,250 +challenge this year so the overview of +segmentation is that we've got these + +492 +00:37:43,250 --> 00:37:47,519 +these two different tasks semantic +segmentation and instant segmentation + +493 +00:37:47,519 --> 00:37:52,210 +for semantic segmentation it's very +common to use this this + +494 +00:37:52,210 --> 00:37:56,800 +conde convoy approached and then for +instance segmentation you end up with + +495 +00:37:56,800 --> 00:38:02,180 +these pipelines that look more similar +to object detection so if there's any + +496 +00:38:02,179 --> 00:38:08,338 +last minute questions about segmentation +I can try to answer those now super + +497 +00:38:08,338 --> 00:38:14,329 +clear I guess so we're gonna move on to +another and are not a pretty cool + +498 +00:38:14,329 --> 00:38:18,150 +exciting topic and thats attention +models so this is something that I think + +499 +00:38:18,150 --> 00:38:24,550 +has got a lot of attention and last year +and community so as a kind of a case + +500 +00:38:24,550 --> 00:38:29,780 +study we're gonna talk about the model +from another citation here ok but as a + +501 +00:38:29,780 --> 00:38:32,349 +as a as a sort of a case study + +502 +00:38:32,349 --> 00:38:35,190 +we're going to talk about the idea of +attention as applied to image capture me + +503 +00:38:35,190 --> 00:38:39,530 +so I think this model was previewed in +the recurrent networks lecture but we + +504 +00:38:39,530 --> 00:38:43,740 +want I want to step into a lot more +details here but first as a recap just + +505 +00:38:43,739 --> 00:38:47,029 +so we're on the same page hopefully you +know how I missed captioning works by + +506 +00:38:47,030 --> 00:38:51,540 +now since the homework is due in a few +hours but we're going to take our input + +507 +00:38:51,539 --> 00:38:54,869 +invention and run it through a +convolutional not and get some features + +508 +00:38:54,869 --> 00:38:58,869 +those features will be used maybe to +initialize the first hidden state of our + +509 +00:38:58,869 --> 00:39:03,780 +current network then are far start token +or a first word got out that hidden + +510 +00:39:03,780 --> 00:39:06,609 +state we're going to produce this +distribution over words in our + +511 +00:39:06,608 --> 00:39:11,940 +vocabulary than to generate a word will +just simple format distribution and will + +512 +00:39:11,940 --> 00:39:16,429 +just sort of repeat this process +overtime to generate captions the + +513 +00:39:16,429 --> 00:39:20,199 +problem here is that this network only +sort of gets one chance to look at the + +514 +00:39:20,199 --> 00:39:23,899 +input image and when it does it's +looking at the entire input image all at + +515 +00:39:23,900 --> 00:39:29,970 +once and it might be cooler if it +actually have the ability to one look at + +516 +00:39:29,969 --> 00:39:33,809 +the input image multiple times and also +if it could focus on different parts of + +517 +00:39:33,809 --> 00:39:41,969 +the input image as it ran so one pretty +cool paper that came out last year was + +518 +00:39:41,969 --> 00:39:46,409 +this one called show attendant tell the +original one show and tell us they added + +519 +00:39:46,409 --> 00:39:51,289 +the a-ten part and the idea is is pretty +straightforward so we're going to take + +520 +00:39:51,289 --> 00:39:54,750 +our input image and we're still gonna +run it through a convolutional network + +521 +00:39:54,750 --> 00:39:58,440 +but instead of extracting the features +from the last fully connected later + +522 +00:39:58,440 --> 00:40:01,659 +instead we're gonna pull features from +one of the convoluted earlier + +523 +00:40:01,659 --> 00:40:05,549 +convolutional heirs and that's going to +give us this grid of features + +524 +00:40:05,550 --> 00:40:09,160 +rather than a single feature vector so +because these are coming from + +525 +00:40:09,159 --> 00:40:13,460 +convolutional air as you can imagine +that maybe the upper left-hand this you + +526 +00:40:13,460 --> 00:40:17,320 +can think of this as a treaty spatial +grid of features and inside each grid + +527 +00:40:17,320 --> 00:40:21,130 +each point in the grid gives you +features corresponding to some part of + +528 +00:40:21,130 --> 00:40:26,890 +the input image so now again will use +these these features to initialize the + +529 +00:40:26,889 --> 00:40:30,099 +hidden state of our network in some way +and now here's where things get + +530 +00:40:30,099 --> 00:40:34,400 +different now we're going to use our +hidden state to compute not a + +531 +00:40:34,400 --> 00:40:38,220 +distribution over words but instead a +distribution over these different + +532 +00:40:38,219 --> 00:40:43,459 +positions in the in our convolutional +feature map so again this is this would + +533 +00:40:43,460 --> 00:40:47,050 +probably be implemented with may be +awfully connect with maybe and a fine + +534 +00:40:47,050 --> 00:40:51,260 +layer or two and then some soft max to +give you a distribution but we just end + +535 +00:40:51,260 --> 00:40:54,410 +up with this al dimensional vector +giving us a probability distribution + +536 +00:40:54,409 --> 00:41:01,019 +over these different locations and our +input and now we take this probability + +537 +00:41:01,019 --> 00:41:05,780 +distribution and actually use it to read +to wait to get a weighted sum of those + +538 +00:41:05,780 --> 00:41:10,810 +feature vectors at the different points +in our in our grade so once we take this + +539 +00:41:10,809 --> 00:41:15,849 +weighted combination of features that +takes our grid and summarizes it down to + +540 +00:41:15,849 --> 00:41:22,420 +a single factor there and this this sort +of disease vector summarizes the input + +541 +00:41:22,420 --> 00:41:26,909 +image in some way and due to the +different types do to do to this + +542 +00:41:26,909 --> 00:41:30,619 +probability distribution it gives the +network the capacity to focus on + +543 +00:41:30,619 --> 00:41:35,299 +different parts of the image as it goes +so now this this weighting factor that's + +544 +00:41:35,300 --> 00:41:39,730 +produced from the input features gets +fed together with the first word and now + +545 +00:41:39,730 --> 00:41:43,960 +when we make a recurrence in a recurrent +network we actually have three and parts + +546 +00:41:43,960 --> 00:41:49,139 +we have our previous hidden state we +have this attended feature vector and we + +547 +00:41:49,139 --> 00:41:52,929 +have this first word and now all of +these together are used to produce our + +548 +00:41:52,929 --> 00:41:56,929 +new hidden state and now from this +hidden state we're actually going to + +549 +00:41:56,929 --> 00:42:01,419 +produce two outputs we're going to +produce another a new distribution over + +550 +00:42:01,420 --> 00:42:04,940 +the locations and our input image and +we're also going to reduce our standard + +551 +00:42:04,940 --> 00:42:08,599 +distribution over words so these are +probably be implemented as just a couple + +552 +00:42:08,599 --> 00:42:13,679 +of active layers on top of the hidden +states and now this process repeats so + +553 +00:42:13,679 --> 00:42:17,739 +given this new probably distribution we +go back to the input feature grand + +554 +00:42:17,739 --> 00:42:22,949 +and come to a new summarization vector +for the invention take take that dr. + +555 +00:42:22,949 --> 00:42:25,618 +together with the next word in the +sentence to compute the New Haven + +556 +00:42:25,619 --> 00:42:34,930 +State's produce ok so that spoiled a +little bad but I'm by Ben will actually + +557 +00:42:34,929 --> 00:42:50,109 +repeat this process overtime to generate +captions yeah so the question is how + +558 +00:42:50,110 --> 00:42:54,190 +where does this feature good come from +and the answer is when you're when + +559 +00:42:54,190 --> 00:42:57,510 +you're doing and Alex that for example +you have con- wanna come to country + +560 +00:42:57,510 --> 00:43:01,670 +Kanpur come by and by the time you get +to come five the shape of that tensor is + +561 +00:43:01,670 --> 00:43:05,960 +now something like seven by seven by +five hundred and twelve so that + +562 +00:43:05,960 --> 00:43:11,050 +corresponds to a seven by seven spatial +grid over the input and each grid + +563 +00:43:11,050 --> 00:43:15,450 +position that's a 512 dimensional +feature vector so those are just pulled + +564 +00:43:15,449 --> 00:43:27,858 +out of one of the convolutional there is +a network question + +565 +00:43:27,858 --> 00:43:33,219 +so the question is about this probably +distributions so we're actually + +566 +00:43:33,219 --> 00:43:37,899 +producing two different probability +distributions at every time step the + +567 +00:43:37,900 --> 00:43:42,400 +first one of these d vectors and blue so +those are probably distribution over + +568 +00:43:42,400 --> 00:43:46,920 +words in your vocabulary like we did in +normal image captioning and also at + +569 +00:43:46,920 --> 00:43:50,759 +every time step will produce a second +probability distribution over these + +570 +00:43:50,759 --> 00:43:55,170 +locations in the end in the input image +that are telling us where we want to + +571 +00:43:55,170 --> 00:43:59,690 +look next time step sis is actually +quite right so if you're just tuning to + +572 +00:43:59,690 --> 00:44:05,200 +ups and then as a quiz wanted to see +like what framework you want to use them + +573 +00:44:05,199 --> 00:44:09,679 +for months and we talked about maybe how +r intense would be a good choice for our + +574 +00:44:09,679 --> 00:44:16,288 +tents are flow and I think this +qualifies as a crazy are done so I + +575 +00:44:16,289 --> 00:44:19,749 +wanted to maybe talk in a little bit +more detail how these attention vector + +576 +00:44:19,748 --> 00:44:24,308 +how these summarization doctors get +produced so this paper actually talks + +577 +00:44:24,309 --> 00:44:29,278 +about two different methods for +generating these factors so the idea as + +578 +00:44:29,278 --> 00:44:33,559 +we saw in the last slide is that we'll +take our input image and get this great + +579 +00:44:33,559 --> 00:44:38,019 +of teachers coming from one of the +convolutional areas in our network and + +580 +00:44:38,018 --> 00:44:41,899 +then each time stop our network will +produce this probability distribution + +581 +00:44:41,900 --> 00:44:45,789 +over locations so this would be a full +impact of land in a soft maxed out to + +582 +00:44:45,789 --> 00:44:50,329 +normalize it and now the idea is that we +want to take these this great feature + +583 +00:44:50,329 --> 00:44:54,249 +vectors together with these probability +distributions and produce a single + +584 +00:44:54,248 --> 00:44:59,798 +d-dimensional factor that summarizes +that input image and there's the paper + +585 +00:44:59,798 --> 00:45:04,159 +actually explores two different ways of +solving this problem so the easy way is + +586 +00:45:04,159 --> 00:45:08,969 +to use what's what they call soft +detention so she r Rd dimensional + +587 +00:45:08,969 --> 00:45:13,518 +vectors eg will just be a weighted sum +of all the elements in the grid where + +588 +00:45:13,518 --> 00:45:18,028 +each factor is just waited by its +probably by its predicted probability + +589 +00:45:18,028 --> 00:45:23,318 +this is actually very easy to implement +it sort of a nice as just another layer + +590 +00:45:23,318 --> 00:45:28,599 +in a neural network and these gradients +like the derivative of this context + +591 +00:45:28,599 --> 00:45:32,588 +factor with respect are predicted +probabilities P is quite nice and easy + +592 +00:45:32,588 --> 00:45:36,818 +to compute so we can actually trained +this thing just using normal gradient + +593 +00:45:36,818 --> 00:45:40,019 +descent and back-propagation + +594 +00:45:40,019 --> 00:45:44,559 +but they actually explore another +another option for competing this + +595 +00:45:44,559 --> 00:45:48,210 +feature vector and that's something +called the heart attention so instead of + +596 +00:45:48,210 --> 00:45:52,630 +having this weighted sum we might want +to select just a single element of that + +597 +00:45:52,630 --> 00:45:57,940 +upgrade to attend to so you might +imagine so one simple thing to do is + +598 +00:45:57,940 --> 00:46:02,440 +just to pick the elements of the grid +with the highest probability and just + +599 +00:46:02,440 --> 00:46:07,269 +pull out that feature vector comp +corresponding to that part tax position + +600 +00:46:07,269 --> 00:46:13,150 +the problem is now if you think about in +this card max in this park next case if + +601 +00:46:13,150 --> 00:46:16,829 +you think about this derivative the +derivative with respect to our + +602 +00:46:16,829 --> 00:46:18,360 +distribution P + +603 +00:46:18,360 --> 00:46:22,980 +it turns out that this is not very +friendly for backpropagation anymore so + +604 +00:46:22,980 --> 00:46:29,059 +imagine in our next case if I suppose +that a that P A or actually the largest + +605 +00:46:29,059 --> 00:46:33,119 +element and our input and now what +happens if we change pH just a little + +606 +00:46:33,119 --> 00:46:40,130 +bit rate so if he is the architects and +then we just jiggle the probability + +607 +00:46:40,130 --> 00:46:44,869 +distribution just a little bit NPA will +still be the architects so we'll still + +608 +00:46:44,869 --> 00:46:49,400 +select the same factor from the input +which means that actually the derivative + +609 +00:46:49,400 --> 00:46:53,990 +of this factor is easy with respect are +predicted probabilities is going to be 0 + +610 +00:46:53,989 --> 00:46:58,689 +almost everywhere so that's that's very +bad week now we can't really use + +611 +00:46:58,690 --> 00:47:02,970 +backpropagation anymore to train this +thing so it turns out that they propose + +612 +00:47:02,969 --> 00:47:06,549 +another method based on reinforcement +learning to actually train the model in + +613 +00:47:06,550 --> 00:47:12,710 +this context where you want to select a +single element but that's a little bit + +614 +00:47:12,710 --> 00:47:16,260 +more complex so we're not gonna talk +about that in this lecture but just be + +615 +00:47:16,260 --> 00:47:18,900 +aware that that is something that you'll +see the difference between soft + +616 +00:47:18,900 --> 00:47:26,010 +attention and heart attention where you +actually pick one so now we can look at + +617 +00:47:26,010 --> 00:47:30,450 +some some pretty results from this model +so since we're actually generating a + +618 +00:47:30,449 --> 00:47:34,480 +probability distribution over grid +locations every time stop we can + +619 +00:47:34,480 --> 00:47:38,519 +visualize that probability distribution +as we generate each word of art are + +620 +00:47:38,519 --> 00:47:44,039 +generated caption so then this input +image that shows a bird both back they + +621 +00:47:44,039 --> 00:47:48,279 +both their heart attention model and her +soft attention model in this case both + +622 +00:47:48,280 --> 00:47:51,650 +produced the caption a bird flying over +a body of water period + +623 +00:47:51,650 --> 00:47:57,090 +and for these two models they visualize +what that probability distribution looks + +624 +00:47:57,090 --> 00:48:01,690 +like these two different models so the +top shows the soft attention so you can + +625 +00:48:01,690 --> 00:48:04,849 +see that it's sort of diffuse since it's +averaging probabilities from every + +626 +00:48:04,849 --> 00:48:09,309 +location and image and in the bottom +it's just showing the one single element + +627 +00:48:09,309 --> 00:48:16,289 +that it pulled out and it's actually +quite nice romantic drama meanings you + +628 +00:48:16,289 --> 00:48:19,779 +can see that when the model is +especially the soft attention on the top + +629 +00:48:19,780 --> 00:48:23,340 +i think is very nice results that when +it's talking about the bird and talking + +630 +00:48:23,340 --> 00:48:26,610 +about flying at sort of focus is right +on the bird and then when it's talking + +631 +00:48:26,610 --> 00:48:30,820 +about the water it kinda focuses on +everything else so another thing to + +632 +00:48:30,820 --> 00:48:34,269 +point out is that it didn't receive any +supervision and training time for which + +633 +00:48:34,269 --> 00:48:38,869 +parts of the image should be attending +to it just made up its own mind to + +634 +00:48:38,869 --> 00:48:43,289 +attend to those parts based on whatever +would help it captured things better and + +635 +00:48:43,289 --> 00:48:46,480 +it's pretty cool that we actually get +these interpretable results just out of + +636 +00:48:46,480 --> 00:48:51,920 +this captioning task we can look at a +couple a couple other results cause + +637 +00:48:51,920 --> 00:48:56,340 +they're fun we can see that when we have +the dog throwing one woman throwing the + +638 +00:48:56,340 --> 00:49:01,079 +frisbee in the park at Presby talking +about the dog at various recognize the + +639 +00:49:01,079 --> 00:49:05,259 +dog and especially interesting is this +guy in the bottom right when it + +640 +00:49:05,260 --> 00:49:08,790 +generates the word trees it's actually +focusing on all the stuff in the + +641 +00:49:08,789 --> 00:49:13,440 +background and not just the giraffe and +again these are just coming out with no + +642 +00:49:13,440 --> 00:49:22,179 +supervision all just based on the +caption and ask question yes or the + +643 +00:49:22,179 --> 00:49:27,440 +question is whatever when would you +prefer hard versus attention so there's + +644 +00:49:27,440 --> 00:49:31,380 +i think sort of two motivations that +people usually give her wanting to even + +645 +00:49:31,380 --> 00:49:33,530 +do attention at all in the first place + +646 +00:49:33,530 --> 00:49:37,580 +one of those is just to give nice +interminable outputs and I think you get + +647 +00:49:37,579 --> 00:49:42,710 +nice interpretable outputs in either +case at least theoretically maybe her + +648 +00:49:42,710 --> 00:49:46,130 +detention think you're wasn't quite as +pretty but the other motivation for + +649 +00:49:46,130 --> 00:49:49,970 +using attention is to relieve +computational burden especially when you + +650 +00:49:49,969 --> 00:49:54,989 +have a very very large and put it might +be computationally expensive actually + +651 +00:49:54,989 --> 00:49:58,619 +process that whole input on every time +step and it might be more efficient + +652 +00:49:58,619 --> 00:50:02,869 +computationally if we can just focus on +one part of the input at each time step + +653 +00:50:02,869 --> 00:50:07,380 +only process a small subset pretends +that so with soft attention because + +654 +00:50:07,380 --> 00:50:10,730 +we're doing this sort of averaging over +all positions we don't get any + +655 +00:50:10,730 --> 00:50:14,369 +computational savings are still +processing the whole input on every time + +656 +00:50:14,369 --> 00:50:17,799 +step but with heart attention we +actually do get a computational savings + +657 +00:50:17,800 --> 00:50:22,680 +since were explicitly pic picking out +some small subset of the input so I + +658 +00:50:22,679 --> 00:50:26,289 +think that's that's the big benefits +also her detention takes reinforcement + +659 +00:50:26,289 --> 00:50:41,420 +learning and expand CRN makes you look +smarter that's kind of question yeah so + +660 +00:50:41,420 --> 00:50:46,150 +the question is how does this work at +all and I think the answer is it's + +661 +00:50:46,150 --> 00:50:49,789 +really learning sort of correlation +structures in the input right that it's + +662 +00:50:49,789 --> 00:50:54,779 +seen many examples of images with dogs +and it's a many sentences with dogs but + +663 +00:50:54,780 --> 00:50:57,480 +for those different images with dogs the +dogs tend to appear in different + +664 +00:50:57,480 --> 00:51:01,349 +positions in the input and I guess it +turns out through the optimization + +665 +00:51:01,349 --> 00:51:05,659 +procedure that actually putting more +weight on the places where the dog + +666 +00:51:05,659 --> 00:51:10,399 +actually exists actually helps the +captioning task in some way so I don't + +667 +00:51:10,400 --> 00:51:14,460 +think there's a very very good answer it +just it just happens to work also I'm + +668 +00:51:14,460 --> 00:51:18,500 +not sure so obviously these are pictures +from a figure these are figures from a + +669 +00:51:18,500 --> 00:51:23,300 +paper not like random results so I'm not +sure how good it works on random images + +670 +00:51:23,300 --> 00:51:31,870 +but another thing to really point out +about this especially this model a soft + +671 +00:51:31,869 --> 00:51:35,739 +detention is that it's sort of +constraints to this fixed grid from the + +672 +00:51:35,739 --> 00:51:41,199 +convolution feature map that these like +we get more getting these nice diffused + +673 +00:51:41,199 --> 00:51:44,449 +looking things but those are just sort +of like blurring out this this + +674 +00:51:44,449 --> 00:51:48,210 +distribution and the model does not +really have the capacity to look at + +675 +00:51:48,210 --> 00:51:52,220 +arbitrary regions of the input it's only +allowed to look at these are fixed grid + +676 +00:51:52,219 --> 00:51:55,959 +regions + +677 +00:51:55,960 --> 00:51:59,690 +I should also point out that this idea +of soft attention was not really + +678 +00:51:59,690 --> 00:52:04,789 +introduced in this paper I think the +first paper that really had this notion + +679 +00:52:04,789 --> 00:52:09,159 +of soft attention came from machine +translation so here it's a similar + +680 +00:52:09,159 --> 00:52:13,299 +motivation that we want to take some +input sentence here in Spanish and then + +681 +00:52:13,300 --> 00:52:17,960 +produce an output sentence in English +and this would be done with recurrent + +682 +00:52:17,960 --> 00:52:22,179 +neural network sequence to sequence +model where we would first read in our + +683 +00:52:22,179 --> 00:52:26,588 +input sentence with a recurrent network +and then generate an output sequence is + +684 +00:52:26,588 --> 00:52:29,269 +very similar to that as we would in +captioning + +685 +00:52:29,269 --> 00:52:33,119 +but in this paper they wanted to +actually have attention over the inputs + +686 +00:52:33,119 --> 00:52:38,599 +intense as they generated their sentence +so the exact mechanism as a little bit + +687 +00:52:38,599 --> 00:52:43,080 +different but the intuition has the same +that now when we generate this first + +688 +00:52:43,079 --> 00:52:47,469 +word my we want to compute power +distribution not over + +689 +00:52:47,469 --> 00:52:52,000 +regions in an image but instead over +words in the input sentence so we're + +690 +00:52:52,000 --> 00:52:55,289 +gonna get a distribution that hopefully +will focus on this first word in Spanish + +691 +00:52:55,289 --> 00:52:59,170 +sentence and then we'll take some +pictures from each word and then relate + +692 +00:52:59,170 --> 00:53:03,780 +them and feed them back into the next +time step in this process would repeat + +693 +00:53:03,780 --> 00:53:08,820 +at every time step up the network so +this idea of soft detention is very + +694 +00:53:08,820 --> 00:53:12,230 +easily applicable not only to image +capturing but also to machine + +695 +00:53:12,230 --> 00:53:18,990 +translation question the question is how +do you do this for a variable-length + +696 +00:53:18,989 --> 00:53:23,409 +sentences and that's something I glossed +over a little bit but the idea is you + +697 +00:53:23,409 --> 00:53:26,980 +use what's called content based +addressing so for the image captioning + +698 +00:53:26,980 --> 00:53:31,559 +we all know ahead of time that there is +this fixed maybe seven by seven grid so + +699 +00:53:31,559 --> 00:53:35,579 +we just produce a probability +distribution directly instead in this + +700 +00:53:35,579 --> 00:53:40,440 +model as the encoder reads the input +sentence it's producing some vector that + +701 +00:53:40,440 --> 00:53:45,320 +encodes that each word in the input +sentence so now in the decoder instead + +702 +00:53:45,320 --> 00:53:49,300 +of directly producing a probability +vector a probability distribution its + +703 +00:53:49,300 --> 00:53:52,900 +way to spread out sort of a vector that +will get dot product it with each of + +704 +00:53:52,900 --> 00:53:57,000 +those encoded vectors and the input and +then those top products get used to get + +705 +00:53:57,000 --> 00:54:02,159 +renormalized and converted to a +distribution + +706 +00:54:02,159 --> 00:54:06,940 +so this idea of soft detention is +actually pretty easy to implement and + +707 +00:54:06,940 --> 00:54:10,970 +pretty easy to train so it's been very +popular in the last year or so and + +708 +00:54:10,969 --> 00:54:14,489 +there's a whole bunch of papers that +apply this idea of soft attention to a + +709 +00:54:14,489 --> 00:54:18,349 +whole bunch of different problems so +there have been a couple papers looking + +710 +00:54:18,349 --> 00:54:22,360 +at soft detention for machine +translation as we saw there have been a + +711 +00:54:22,360 --> 00:54:24,230 +couple papers that actually want to do + +712 +00:54:24,230 --> 00:54:28,179 +speech transcription where they read in +an audio signal and then I'll put the + +713 +00:54:28,179 --> 00:54:32,589 +words in English so there's been a +couple papers that use soft attention + +714 +00:54:32,590 --> 00:54:37,130 +over the input audio sequence to help +with that task weeks there's been at + +715 +00:54:37,130 --> 00:54:41,300 +least one paper on using soft attention +for video captioning so here you read in + +716 +00:54:41,300 --> 00:54:45,260 +some sequence of frames and then you +output some sequence of words and you + +717 +00:54:45,260 --> 00:54:49,110 +want to have a tension over whether the +frames of the input sequence as you're + +718 +00:54:49,110 --> 00:54:53,050 +generating your caption you could see +that maybe for this little video + +719 +00:54:53,050 --> 00:54:57,240 +sequence they output someone is trying +to fish in a pot and when they generate + +720 +00:54:57,239 --> 00:55:01,169 +the words someone actually attend much +more to this second frame in the video + +721 +00:55:01,170 --> 00:55:05,590 +sequence and when they generate the word +frying attends much more to this last + +722 +00:55:05,590 --> 00:55:11,480 +element in the video sequence there have +also been a couple papers for this task + +723 +00:55:11,480 --> 00:55:16,059 +of question answering so here you the +setup is that you read in a natural + +724 +00:55:16,059 --> 00:55:20,590 +language question and you also read an +image and image and the model needs to + +725 +00:55:20,590 --> 00:55:22,870 +produce an answer about that question + +726 +00:55:22,869 --> 00:55:28,139 +produced the answer to that question in +natural language so and there been a + +727 +00:55:28,139 --> 00:55:31,869 +couple papers that explore the idea of +spatial attention over the image in + +728 +00:55:31,869 --> 00:55:35,420 +order to help with this problem of +question answering another thing to + +729 +00:55:35,420 --> 00:55:38,860 +point out is that some of these papers +have great games so there was a + +730 +00:55:38,860 --> 00:55:43,000 +show-and-tell there was show attendant +I'll there was less than a tenth and + +731 +00:55:43,000 --> 00:55:45,039 +spell + +732 +00:55:45,039 --> 00:55:49,999 +and this one is asked to attend an +answer so I I really enjoy about + +733 +00:55:49,998 --> 00:55:56,808 +creativity with naming I'm just on this +line of work and this idea of soft + +734 +00:55:56,809 --> 00:55:59,910 +detention is pretty easy to implement so +a lot of people have just uploaded two + +735 +00:55:59,909 --> 00:56:05,899 +tons of tasks but remember we saw this +problem with this sort of implementation + +736 +00:56:05,900 --> 00:56:09,709 +of soft attention and that's that we +cannot attends to arbitrate regions in + +737 +00:56:09,708 --> 00:56:14,038 +the input instead were constrained and +can only attend to this fixed grid given + +738 +00:56:14,039 --> 00:56:18,699 +by the convolutional feature map so the +question is whether we can overcome this + +739 +00:56:18,699 --> 00:56:23,559 +restriction and still attend and attends +to arbitrary input regions somehow in a + +740 +00:56:23,559 --> 00:56:28,089 +different way and I think + +741 +00:56:28,088 --> 00:56:32,900 +precursor to this type of work is this +paper from Alex graves back in 2013 so + +742 +00:56:32,900 --> 00:56:38,249 +here he wanted to read as inputs natural +language sentence and then generate as + +743 +00:56:38,248 --> 00:56:43,598 +output actually an image that would be +handwriting general like writing out + +744 +00:56:43,599 --> 00:56:48,528 +that that sentence in handwriting and +the way that he actually has attention + +745 +00:56:48,528 --> 00:56:53,418 +over this output image in kind of a cool +way we're now he's actually predicting + +746 +00:56:53,418 --> 00:56:57,608 +the parameters of some cash and mixture +model over the output image and then + +747 +00:56:57,608 --> 00:57:02,739 +uses that to actually attend to +arbitrate parts of the output image and + +748 +00:57:02,739 --> 00:57:07,028 +this actually works really really well +so on the right some of these are + +749 +00:57:07,028 --> 00:57:12,259 +actually written by people and the rest +of them were written by him by his + +750 +00:57:12,259 --> 00:57:16,269 +network so can you tell the difference +between the generated on the real + +751 +00:57:16,268 --> 00:57:24,418 +generating I couldn't so it turns out +that the top one is real and he's + +752 +00:57:24,418 --> 00:57:31,049 +bottomed for all generated by the +network + +753 +00:57:31,050 --> 00:57:35,580 +yeah maybe maybe the real ones have more +variance between the letters or + +754 +00:57:35,579 --> 00:57:39,380 +something like that but these results +work really well and actually he has an + +755 +00:57:39,380 --> 00:57:42,820 +online demo that you can go on and try +that runs in your browser you can just + +756 +00:57:42,820 --> 00:57:46,800 +type in words and will generate the +handwriting for you that's kind of fun + +757 +00:57:46,800 --> 00:57:52,840 +another another paper that we saw +already is draw that that sort of takes + +758 +00:57:52,840 --> 00:57:56,500 +this idea of arbitrary detention over +and then extends it to a couple more + +759 +00:57:56,500 --> 00:58:01,050 +real-world problems not just handwriting +generation so one task they consider is + +760 +00:58:01,050 --> 00:58:05,960 +image classification here we want to +classify these digits but in the process + +761 +00:58:05,960 --> 00:58:09,920 +of classifying we're actually going to +attend to arbitrate regions of the input + +762 +00:58:09,920 --> 00:58:14,639 +image in order to help with this +classification task so this is this is + +763 +00:58:14,639 --> 00:58:17,909 +kind of cool it sort of learns on its +own but it needs to attend to these + +764 +00:58:17,909 --> 00:58:22,710 +digits in order to help with image +classification and withdrawn they also + +765 +00:58:22,710 --> 00:58:27,849 +consider the idea of generating +arbitrary output images with a similar + +766 +00:58:27,849 --> 00:58:31,589 +sort of motivation as the handwriting +generation where we're gonna have + +767 +00:58:31,590 --> 00:58:35,740 +arbitrary attention over the output +image and just generate this output on + +768 +00:58:35,739 --> 00:58:42,589 +my bed and I think we saw this video +before but it's really cool so this is + +769 +00:58:42,590 --> 00:58:47,190 +the draw network from my mind so you can +see that here we're gonna do it we're + +770 +00:58:47,190 --> 00:58:51,200 +doing a classification task sort of +learns to attend to arbitrate regions in + +771 +00:58:51,199 --> 00:58:55,439 +the input and when we generate we're +going to attend to arbitrate regions in + +772 +00:58:55,440 --> 00:58:59,579 +the output to actually generate these +digits so it can generate multiple + +773 +00:58:59,579 --> 00:59:04,000 +digits at a time and it can actually +generate these these house number these + +774 +00:59:04,000 --> 00:59:10,639 +house numbers so this is really cool and +as you can see you like the region it + +775 +00:59:10,639 --> 00:59:13,920 +was attending to was actually growing +and shrinking overtime and sort of + +776 +00:59:13,920 --> 00:59:17,430 +moving continuously over the image and +it was definitely not constrained to a + +777 +00:59:17,429 --> 00:59:21,690 +fixed grid like we saw with show +attendant tell so the way that this + +778 +00:59:21,690 --> 00:59:26,840 +paper works is a little bit a little bit +weird and some follow-up work from deep + +779 +00:59:26,840 --> 00:59:34,260 +mind I think actually was more clear and +why is the sky all my focus is all ok + +780 +00:59:34,260 --> 00:59:38,630 +right so there's this follow-up paper +that take that uses a very similar + +781 +00:59:38,630 --> 00:59:43,500 +mechanism attention called special +transport networks but i think is much + +782 +00:59:43,500 --> 00:59:44,500 +easier to understand + +783 +00:59:44,500 --> 00:59:49,039 +and and presented in a very clean way so +the idea is that we want to have this + +784 +00:59:49,039 --> 00:59:53,369 +input image this our favorite bird and +then we want to have this sort of + +785 +00:59:53,369 --> 00:59:57,589 +continuous set of variables telling us +where you want to attend you might + +786 +00:59:57,590 --> 01:00:01,579 +imagine that we have the corner of the +center and width and height of some box + +787 +01:00:01,579 --> 01:00:06,170 +of the region we want to attach to and +then we want to have some function that + +788 +01:00:06,170 --> 01:00:10,240 +takes our input image and these +continuous attention coordinates and + +789 +01:00:10,239 --> 01:00:14,919 +then produces some fixed size output and +we won't be able to do this in a + +790 +01:00:14,920 --> 01:00:21,840 +differentiable way so this this seems +kinda hard right can you imagine that at + +791 +01:00:21,840 --> 01:00:26,250 +least with the idea of cropping then +these inputs cannot really be continuous + +792 +01:00:26,250 --> 01:00:30,590 +they need to be sort of pixel values so +our country in two integers and it's not + +793 +01:00:30,590 --> 01:00:34,550 +really clear exactly how we can make +this function continuous or differential + +794 +01:00:34,550 --> 01:00:39,210 +and they actually came up with a very +nice solution and the idea is that we're + +795 +01:00:39,210 --> 01:00:44,679 +gonna write down a parametrized function +that will map from coordinates of pixels + +796 +01:00:44,679 --> 01:00:50,469 +in the outputs to coordinates of pixels +in the input so here we're gonna say + +797 +01:00:50,469 --> 01:00:54,839 +that this this upper left upper +right-hand pixel any other potential has + +798 +01:00:54,840 --> 01:00:59,700 +the coordinates x TYT in the output and +we're going to check compute these + +799 +01:00:59,699 --> 01:01:04,480 +coordinates access and white house in +the input image using this privatized a + +800 +01:01:04,480 --> 01:01:08,900 +fine function so that's that's a nice +differentiable function that we can + +801 +01:01:08,900 --> 01:01:13,349 +differentiate with respect to these a +fine transport accordance then we can + +802 +01:01:13,349 --> 01:01:17,059 +repeat this process and again for maybe +the upper upper left-hand pixel in the + +803 +01:01:17,059 --> 01:01:21,219 +output image we can use this planet rise +function to map to the coordinates of + +804 +01:01:21,219 --> 01:01:27,199 +the pixel in the input now we can repeat +this for all pixels in our output which + +805 +01:01:27,199 --> 01:01:31,689 +gives us something called a sampling +grid so the idea is that this will be + +806 +01:01:31,690 --> 01:01:36,480 +our output image and then for each pixel +in the output the sampling grid tells us + +807 +01:01:36,480 --> 01:01:41,610 +where in the input that pixel should +come from and how many guys taking a + +808 +01:01:41,610 --> 01:01:47,590 +computer graphics course not many so +this looks left looks kinda like texture + +809 +01:01:47,590 --> 01:01:52,510 +mapping doesn't it so they take this +idea from texture mapping in computer + +810 +01:01:52,510 --> 01:01:56,300 +graphics and just use by linear +interpolation to compute the output once + +811 +01:01:56,300 --> 01:01:57,720 +we have the sampling grid + +812 +01:01:57,719 --> 01:02:02,669 +so now what now that we have now this +now this allows our networked actually + +813 +01:02:02,670 --> 01:02:07,450 +attends to arbitrate parts of the input +and a nice differentiable way where our + +814 +01:02:07,449 --> 01:02:11,789 +network will now just predict these +transforming coordinates pana and that + +815 +01:02:11,789 --> 01:02:16,639 +will allow the whole thing to attend to +arbitrate regions of the input image so + +816 +01:02:16,639 --> 01:02:20,199 +they put this thing altogether into a +nice little self-contained module that + +817 +01:02:20,199 --> 01:02:24,608 +they call a special transformer so the +spatial transformer receives some input + +818 +01:02:24,608 --> 01:02:29,679 +which you can think of as a as our raw +input image and then actually runs this + +819 +01:02:29,679 --> 01:02:33,949 +small localisation network which could +be a small fully connected network or a + +820 +01:02:33,949 --> 01:02:38,409 +very shallow convolutional network and +this this localization network will + +821 +01:02:38,409 --> 01:02:44,500 +actually produces as output these a plan +transform coordinates data now these + +822 +01:02:44,500 --> 01:02:48,829 +affine transform coordinates will be +used to compute a sampling grid so now + +823 +01:02:48,829 --> 01:02:51,750 +that we've predicted these at misshapen +transformed from the localisation + +824 +01:02:51,750 --> 01:02:56,280 +network we map each pixel in the output +the coordinates of each pixel in the + +825 +01:02:56,280 --> 01:03:02,280 +output back to the input and this is a +nice smooth differentiable function now + +826 +01:03:02,280 --> 01:03:06,230 +once we have the sampling grid we can +just applied by linear interpolation to + +827 +01:03:06,230 --> 01:03:11,309 +compute the values in the pixels of the +output and if you if you think about + +828 +01:03:11,309 --> 01:03:15,588 +what this thing is doing it's clear that +every single part of this network is one + +829 +01:03:15,588 --> 01:03:21,159 +continuous and two differential so this +thing can be managed jointly without any + +830 +01:03:21,159 --> 01:03:26,579 +crazy reinforcement learning stuff which +is quite nice although 11 sort of + +831 +01:03:26,579 --> 01:03:31,789 +caveats know about bilinear sampling if +you know how by linear sampling works it + +832 +01:03:31,789 --> 01:03:36,449 +means that every pixel in the output is +going to be a weighted average of four + +833 +01:03:36,449 --> 01:03:41,639 +pixels and the input so those gradients +are actually very local so this is + +834 +01:03:41,639 --> 01:03:45,549 +continuous and differentiable a nice but +I don't think you get a whole lot of + +835 +01:03:45,550 --> 01:03:50,300 +gradient signal through the third the +bilinear sampling but once you have this + +836 +01:03:50,300 --> 01:03:54,410 +special this nice special transport +module we can just insert it into + +837 +01:03:54,409 --> 01:03:58,739 +existing networks to sort of let them +learn to attend two things so they + +838 +01:03:58,739 --> 01:04:03,739 +consider this classification task very +similar to the drop paper where they + +839 +01:04:03,739 --> 01:04:08,118 +want to classify these warped versions +of amnesty Jets so they actually + +840 +01:04:08,119 --> 01:04:09,519 +consider several other + +841 +01:04:09,519 --> 01:04:13,610 +more complicated transforms not just +he's a fine transformants you can also + +842 +01:04:13,610 --> 01:04:18,260 +imagine that that's the mapping from +your output pixel spektr in pixels on + +843 +01:04:18,260 --> 01:04:21,470 +the previous flight we showed an affine +transform but they also consider + +844 +01:04:21,469 --> 01:04:25,339 +projective transforms and also a thin +plate splines but the idea is you just + +845 +01:04:25,340 --> 01:04:28,970 +want to some private rise and +differentiable function and you could go + +846 +01:04:28,969 --> 01:04:34,829 +crazy with that part so here on the left +the network is just trying to classify + +847 +01:04:34,829 --> 01:04:38,380 +these digits that are worked so on the +left we have different versions of + +848 +01:04:38,380 --> 01:04:43,340 +warped digits on this middle colin is +showing these different thin plate + +849 +01:04:43,340 --> 01:04:47,460 +splines that it's using to attend to a +part of the image and then on the right + +850 +01:04:47,460 --> 01:04:51,590 +shows the output of the spatial +transformer model which has not only + +851 +01:04:51,590 --> 01:04:56,250 +attended to that region but also on +worked at corresponding to those planes + +852 +01:04:56,250 --> 01:05:01,730 +and on the right they're using an app +find transpire on the right is using an + +853 +01:05:01,730 --> 01:05:05,559 +affine transform not be in place plans +you can see that this is actually doing + +854 +01:05:05,559 --> 01:05:09,369 +more than just attending to the input or +actually transforming the input as well + +855 +01:05:09,369 --> 01:05:14,849 +so for example in this middle column +this is a four but it's actually rotated + +856 +01:05:14,849 --> 01:05:19,069 +by something by something like ninety +degrees so by using this app and + +857 +01:05:19,070 --> 01:05:23,140 +transform the network can not only +attention before but also rotated into + +858 +01:05:23,139 --> 01:05:27,839 +the proper position for the downstream +classification at work and this is all + +859 +01:05:27,840 --> 01:05:31,930 +very cool and i can sort of similar to +the soft attention we don't need + +860 +01:05:31,929 --> 01:05:35,949 +explicit supervision it can just decide +for itself where it wants to attend in + +861 +01:05:35,949 --> 01:05:41,710 +order to solve problems so these guys +have a fancy video as well which is very + +862 +01:05:41,710 --> 01:05:53,860 +impressive so this is the transformer +module that we just unpacked and here + +863 +01:05:53,860 --> 01:05:58,930 +we're actually showing right now this is +actually running a classification task + +864 +01:05:58,929 --> 01:06:03,389 +but we're varying the input continuously +you can see that these different inputs + +865 +01:06:03,389 --> 01:06:08,429 +the network learns to attends 22 and +then actually economic alliances that + +866 +01:06:08,429 --> 01:06:13,169 +digit to sort of a fixed known pose and +as we very that input and move it around + +867 +01:06:13,170 --> 01:06:18,500 +the image the network still does a good +job of locking onto the digit and on the + +868 +01:06:18,500 --> 01:06:23,059 +right you can see that sometimes it can +fix rotations as well right so + +869 +01:06:23,059 --> 01:06:26,809 +here on the left were actually rotating +that digit and the network actually + +870 +01:06:26,809 --> 01:06:31,619 +learns to on rotate the debt and +economic life the polls and again both + +871 +01:06:31,619 --> 01:06:36,420 +with a friend transforms or thin plate +splines this is using even crazier + +872 +01:06:36,420 --> 01:06:40,389 +warping with projected transports she +can see that it does a really good job + +873 +01:06:40,389 --> 01:06:48,099 +of learning to attend and also its own +work and they do quite a lot of other + +874 +01:06:48,099 --> 01:06:52,829 +experiments instead of classification +they learn to add together to work + +875 +01:06:52,829 --> 01:06:58,369 +digits which is kind of weird task but +it works so their network is receding + +876 +01:06:58,369 --> 01:07:05,389 +two inputs as to input images and I'll +put the sum and it burns and even know + +877 +01:07:05,389 --> 01:07:08,679 +this is kind of a weird task it learns +that it needs to attend and on work + +878 +01:07:08,679 --> 01:07:15,659 +those images so this is during +optimization writes this is a test + +879 +01:07:15,659 --> 01:07:20,009 +called co-localization the idea is that +the network is going to receive two + +880 +01:07:20,010 --> 01:07:25,560 +images of as input maybe two different +images all fours and the task is to say + +881 +01:07:25,559 --> 01:07:31,179 +whether or not those images are the same +the same or different and then also + +882 +01:07:31,179 --> 01:07:34,750 +local using spatial transformers and end +up learning to localize those things as + +883 +01:07:34,750 --> 01:07:38,139 +well as you can see that over the course +of training it learns to actually + +884 +01:07:38,139 --> 01:07:42,239 +localize these things very very +precisely even when we are closer to the + +885 +01:07:42,239 --> 01:07:50,479 +image than these networks still learn to +localize very very precisely that's a + +886 +01:07:50,480 --> 01:07:58,280 +recent paper from deep mind that is +pretty cool + +887 +01:07:58,280 --> 01:08:11,519 +so any other last minute questions about +special transformers yeah yeah so the + +888 +01:08:11,519 --> 01:08:13,989 +simple so the question is what is the +what is the task of these things are + +889 +01:08:13,989 --> 01:08:17,420 +doing and at least in the vanilla +version is just classification so it + +890 +01:08:17,420 --> 01:08:21,810 +receives this sort of input which could +be warped her cluttered or whatnot and + +891 +01:08:21,810 --> 01:08:26,060 +all that it needs to do is classified ad +budget and sort of in the process of + +892 +01:08:26,060 --> 01:08:29,839 +learning to classify it also plans to +attend to the cracked part so that's + +893 +01:08:29,838 --> 01:08:40,189 +that's a really cool feature of this +this this work right sort of my overview + +894 +01:08:40,189 --> 01:08:44,588 +of attention is that we have soft +attention which is really easy to + +895 +01:08:44,588 --> 01:08:49,119 +implement especially in this context of +fixed input positions that we just + +896 +01:08:49,119 --> 01:08:53,039 +produced distributions over and put a +man we wait and we feed those those + +897 +01:08:53,039 --> 01:08:56,850 +factors back to a network somehow and +this is really easy to implement in many + +898 +01:08:56,850 --> 01:08:59,930 +different contexts and has been +implemented for a lot of different tasks + +899 +01:08:59,930 --> 01:09:04,770 +when you want to attend to arbitrate +regions than you need to get a little + +900 +01:09:04,770 --> 01:09:09,130 +bit fancier and I think spatial +transformers is a very very nice elegant + +901 +01:09:09,130 --> 01:09:13,949 +way of attending to arbitrate regions in +input images there are a lot of papers + +902 +01:09:13,949 --> 01:09:17,889 +that actually work on her detention and +this is quite a bit more challenging due + +903 +01:09:17,890 --> 01:09:21,579 +to this problem with the gradients so +hard attention papers typically use + +904 +01:09:21,579 --> 01:09:26,199 +reinforcement learning and we didn't +really talk about that today so any any + +905 +01:09:26,199 --> 01:09:39,429 +other questions about tension or ok sure + +906 +01:09:39,429 --> 01:09:51,958 +the captioning before we got +transformers and yeah those closed + +907 +01:09:51,958 --> 01:09:56,649 +captions are produced using this script +based thing but in in that network in + +908 +01:09:56,649 --> 01:10:01,299 +particular I think it was a 14 by 14 +grid so it's actually quite a lot of + +909 +01:10:01,300 --> 01:10:04,550 +locations but it it's it's still is +constrained to wear it a lot to + +910 +01:10:04,550 --> 01:10:22,800 +questions about interpolating between +soft attention and her detention so yeah + +911 +01:10:22,800 --> 01:10:26,279 +11 thing you might imagine is you train +the network in a soft way and then + +912 +01:10:26,279 --> 01:10:29,929 +during training you kind of penalize its +distribution sharper and sharper and + +913 +01:10:29,929 --> 01:10:32,949 +sharper and then a test time you just +switch over and use her detention + +914 +01:10:32,948 --> 01:10:37,938 +instead and I think I can't remember +which paper did that but I'm I'm pretty + +915 +01:10:37,939 --> 01:10:43,130 +sure I've seen that idea somewhere but +in practice I think training with her + +916 +01:10:43,130 --> 01:10:46,099 +detention tends to work better than the +sharpening approach but it's definitely + +917 +01:10:46,099 --> 01:10:51,800 +something you could try ok if not a +question that I think we're done a + +918 +01:10:51,800 --> 01:10:54,179 +couple minutes early today so get your +homework done + diff --git a/captions/En/Lecture14_en.srt b/captions/En/Lecture14_en.srt new file mode 100644 index 00000000..f35caf2e --- /dev/null +++ b/captions/En/Lecture14_en.srt @@ -0,0 +1,5240 @@ +1 +00:00:00,000 --> 00:00:04,990 +administrative I tell everyone should be +done with 73 now if you're not done I + +2 +00:00:04,990 --> 00:00:07,790 +think you're late and you're in trouble + +3 +00:00:07,790 --> 00:00:11,280 +Muslim graves will be out very soon +we're still going through them and there + +4 +00:00:11,279 --> 00:00:13,779 +are basically I think done but we have +to double check a few things I will send + +5 +00:00:13,779 --> 00:00:14,199 +them out + +6 +00:00:14,199 --> 00:00:18,820 +ok so in terms of reminding you where we +are in the class I last looked very + +7 +00:00:18,820 --> 00:00:22,629 +briefly at segmentation we looked at +some soft attention models substation + +8 +00:00:22,629 --> 00:00:25,829 +models are away for selectively paying +attention to different parts of the + +9 +00:00:25,829 --> 00:00:28,028 +image as your processing it was +something like a recurrent neural + +10 +00:00:28,028 --> 00:00:32,020 +network so glad you selectively pay +attention to some parts of the scene and + +11 +00:00:32,020 --> 00:00:35,450 +enhance those features and will start +about special transformer which is this + +12 +00:00:35,450 --> 00:00:38,929 +very nice way of basically in a +different way cropping parts of an image + +13 +00:00:38,929 --> 00:00:43,769 +or some features either in a hydrogen or +in any kind of warped shape aren't in + +14 +00:00:43,770 --> 00:00:48,579 +like that so very interesting kind of PC +that you can slot internal network + +15 +00:00:48,579 --> 00:00:52,049 +architectures so today we'll talk about +videos + +16 +00:00:52,049 --> 00:00:56,229 +specifically now in image classification +you should be familiar by now whether + +17 +00:00:56,229 --> 00:00:59,390 +the basic combat set up you have an +image that comes in a reprocessing it to + +18 +00:00:59,390 --> 00:01:03,239 +for example classified in a case of +videos we won't have just a single image + +19 +00:01:03,238 --> 00:01:07,728 +but will have multiple frames so this is +an image of 32 by 32 will actually have + +20 +00:01:07,728 --> 00:01:13,829 +an entire video frames so 32 by 32 +mighty purty a sometime extent ok so + +21 +00:01:13,829 --> 00:01:17,340 +before I dive into how we approach these +problems with I'd like to talk about + +22 +00:01:17,340 --> 00:01:21,170 +very briefly about how we used to +address them for calm asking about using + +23 +00:01:21,170 --> 00:01:25,629 +pcr-based methods so some of the most +popular features right before coming to + +24 +00:01:25,629 --> 00:01:30,019 +work today became very popular where +these dense trajectory features + +25 +00:01:30,019 --> 00:01:34,140 +developed by hanging at all and I just +like to give you a brief taste of + +26 +00:01:34,140 --> 00:01:36,989 +exactly how these features worked +because it's kind of interesting and + +27 +00:01:36,989 --> 00:01:39,609 +they inspire some of the later +developments in terms of how come to + +28 +00:01:39,609 --> 00:01:43,429 +show that works actually operating +videos so in this trajectory is what we + +29 +00:01:43,430 --> 00:01:47,140 +do is we have this video the playing and +we're going to be detecting these key + +30 +00:01:47,140 --> 00:01:50,709 +points that are good to track in a video +and then we're going to be tracking them + +31 +00:01:50,709 --> 00:01:54,679 +and you end up with all these little +track let's that we actually track + +32 +00:01:54,680 --> 00:01:57,759 +across the video and then lots of +features about those track let's and + +33 +00:01:57,759 --> 00:02:01,868 +about the surrounding features that +accumulated just crimes so just to give + +34 +00:02:01,868 --> 00:02:06,549 +you an idea about how this worked there +are basically three steps roughly we + +35 +00:02:06,549 --> 00:02:10,868 +detect feature points at different +scales in the image I'll tell me briefly + +36 +00:02:10,868 --> 00:02:11,960 +about how that's done + +37 +00:02:11,960 --> 00:02:16,810 +then go to track those features over +time using optical flow methods optical + +38 +00:02:16,810 --> 00:02:20,270 +flow method solved explain very briefly +they basically give you a motion field + +39 +00:02:20,270 --> 00:02:23,800 +from one thing to another and they tell +you how the scene moved from one frame + +40 +00:02:23,800 --> 00:02:28,070 +to an extent Xtreme and then we're going +to extract a whole bunch of features but + +41 +00:02:28,069 --> 00:02:30,609 +importantly we're not just going to +extract those feature set fixed + +42 +00:02:30,610 --> 00:02:33,930 +positions in the image but we're +actually going to be struck me + +43 +00:02:33,930 --> 00:02:37,700 +speechless and the local coordinate +system every single track let and so + +44 +00:02:37,699 --> 00:02:41,869 +these histogram of greedy insist gotta +flows and and be resource we're going to + +45 +00:02:41,870 --> 00:02:45,610 +be extracted them in the coordinate +system off a track wit and so hard here + +46 +00:02:45,610 --> 00:02:49,200 +we saw histograms gradients and +two-dimensional images are basically + +47 +00:02:49,199 --> 00:02:51,750 +generalizations of that too + +48 +00:02:51,750 --> 00:02:54,780 +videos and so that's the kind of things +that people used to encode the + +49 +00:02:54,780 --> 00:03:01,009 +spatio-temporal bombings in terms of the +key point detection part there's been + +50 +00:03:01,009 --> 00:03:04,239 +quite a lot of work on exactly how to +detect good features and videos to track + +51 +00:03:04,240 --> 00:03:07,930 +and intuitively you don't want to track +a video that are too smooth because he + +52 +00:03:07,930 --> 00:03:11,580 +can't log onto any visual feature as +there are ways for basically getting a + +53 +00:03:11,580 --> 00:03:16,620 +set of points that are easy to track and +a video so there are some papers on this + +54 +00:03:16,620 --> 00:03:19,509 +so you detect a bunch of features like +this + +55 +00:03:19,509 --> 00:03:23,039 +optical flow algorithms on these videos + +56 +00:03:23,659 --> 00:03:28,060 +take a frame and a second frame and it +will solve for a motion field + +57 +00:03:28,060 --> 00:03:32,409 +displacement vector at every single +position in to where it traveled for how + +58 +00:03:32,409 --> 00:03:35,919 +the free moved as I hear some examples +of optical flow results + +59 +00:03:36,439 --> 00:03:42,270 +basically here every single pixel is +colored by a direction in which that + +60 +00:03:42,270 --> 00:03:46,260 +part of the image is currently moving +into video so for example this girl has + +61 +00:03:46,259 --> 00:03:49,939 +all yellow meaning that you probably +translating horizontally or something + +62 +00:03:49,939 --> 00:03:53,680 +like that the two most common methods +for using optical flow for computing it + +63 +00:03:53,680 --> 00:03:58,069 +at least me one of the most common ones +here as blocks from boxing Malik that's + +64 +00:03:58,069 --> 00:04:00,949 +the one that is kind of like a +defaulting to use so if you are + +65 +00:04:00,949 --> 00:04:03,399 +computing optical flow in your own +project I would encourage you to use + +66 +00:04:03,400 --> 00:04:08,950 +this large displacement optical flow +method so using this optical flow we + +67 +00:04:08,949 --> 00:04:12,199 +have all these key points using optical +flow we know also have the move as we + +68 +00:04:12,199 --> 00:04:15,859 +end up tracking these lil truckloads of +may be roughly fifteen frames at a time + +69 +00:04:15,860 --> 00:04:20,509 +so we end up with these half a second +roughly track lets through the video and + +70 +00:04:20,509 --> 00:04:21,519 +then we encode + +71 +00:04:21,519 --> 00:04:26,129 +regions around this track what's with +all these descriptors and then went to + +72 +00:04:26,129 --> 00:04:29,710 +accumulate all these visual Peterson two +histograms and people used to play with + +73 +00:04:29,709 --> 00:04:34,668 +different kinds of like how do you +exactly truncate video specially because + +74 +00:04:34,668 --> 00:04:37,359 +we're going to have a histogram an +independent histogram and every one of + +75 +00:04:37,360 --> 00:04:40,389 +these business and then we're going to +basically create all these histograms + +76 +00:04:40,389 --> 00:04:45,220 +urban with all these visual features and +all of this thing goes into an SVM and + +77 +00:04:45,220 --> 00:04:48,050 +what kind of the rock a layout in terms +of how people address these problems in + +78 +00:04:48,050 --> 00:04:55,720 +the past your truck just think of it as +is going to be fifteen frames and it's + +79 +00:04:55,720 --> 00:05:01,639 +just XY positions so a 15 X Y +coordinates the strangled and then we + +80 +00:05:01,639 --> 00:05:07,168 +extract in the local coordinate system +now in terms of how we actually approach + +81 +00:05:07,168 --> 00:05:13,859 +these problems with that works she never +called Alex net on the very first layer + +82 +00:05:13,860 --> 00:05:17,560 +will receive an image thatís for +example 227 227 by three and + +83 +00:05:17,560 --> 00:05:22,310 +reprocessing it with 96 filters that are +11 by 11 applied it's right for and so + +84 +00:05:22,310 --> 00:05:27,978 +we saw that with Alex net this results +in the 5555 by ninety six volume in + +85 +00:05:27,978 --> 00:05:30,468 +which we actually have all these +responses of all the filters at every + +86 +00:05:30,468 --> 00:05:34,788 +single spatial position so now what +would be a reasonable approach if you + +87 +00:05:34,788 --> 00:05:38,158 +wanted to generalize accomplish all that +work into a case we don't just have a + +88 +00:05:38,158 --> 00:05:42,579 +220 somebody turns 23 but you may be +happening frames that you like to encode + +89 +00:05:42,579 --> 00:05:47,278 +so you have an entire block of 227 227 +battery by 15 that's coming in to + +90 +00:05:47,278 --> 00:05:50,180 +accomplish all that work you're trying +to echo the both the spatial and + +91 +00:05:50,180 --> 00:05:54,209 +temporal patterns and inside this little +block of volume so would be like one + +92 +00:05:54,209 --> 00:05:57,379 +idea for how to change accomplish all +that work + +93 +00:05:57,379 --> 00:06:00,379 +generalize it to this case + +94 +00:06:03,899 --> 00:06:27,609 +and arrange them as like two blocks ok +that's interesting I would expect that + +95 +00:06:27,610 --> 00:06:33,870 +do not work very very well so the +problem with that is kind of interesting + +96 +00:06:33,870 --> 00:06:36,850 +basically all these neurons are looking +at only a single frame and then by the + +97 +00:06:36,850 --> 00:06:39,720 +end of the comment that you end up with +you on that are looking at a larger and + +98 +00:06:39,720 --> 00:06:43,310 +larger regions and your challenge so +eventually these neurons with see all of + +99 +00:06:43,310 --> 00:06:46,470 +your input but they would not be able to +very easily relate + +100 +00:06:47,589 --> 00:06:52,589 +like little special control patch in +this image so I'm not sure actually + +101 +00:06:52,589 --> 00:07:04,149 +really good idea did you turn them into +it I think so we'll get to some of those + +102 +00:07:04,149 --> 00:07:07,149 +that do something like that + +103 +00:07:09,930 --> 00:07:25,199 +take 45 channels effectively and you +could put a comment on that so that's + +104 +00:07:25,199 --> 00:07:28,919 +something that all get to I think you +could do that I don't think it's the + +105 +00:07:28,918 --> 00:07:44,049 +best idea as a yes so you're saying that +things in one slice of this time are you + +106 +00:07:44,050 --> 00:07:48,379 +want to extract similar kinds of +features in one time then a different + +107 +00:07:48,379 --> 00:07:48,990 +time + +108 +00:07:48,990 --> 00:07:52,829 +similar to the motivation of doing it +sharing with specially because peter is + +109 +00:07:52,829 --> 00:07:55,909 +here are useful down there as well so +you have the same kind of property where + +110 +00:07:55,910 --> 00:07:58,910 +you'd like to share weights and time not +only in space + +111 +00:07:59,689 --> 00:08:03,550 +ok so building on top of that idea of +the basic thing that people usually do + +112 +00:08:03,550 --> 00:08:06,400 +when they want to apply commercial +networks and videos as they extend these + +113 +00:08:06,399 --> 00:08:10,138 +filters not only to don't only have +filters in space but you also have these + +114 +00:08:10,139 --> 00:08:14,840 +filters and extend them small amounts in +time so before we have 11 Bielema + +115 +00:08:14,839 --> 00:08:15,750 +filters + +116 +00:08:15,750 --> 00:08:21,709 +1111 by tea filters where Tia some small +temporal extent so say for example we + +117 +00:08:21,709 --> 00:08:28,759 +can use a to up to 15 in this particular +case he was 30 2011 by three filters and + +118 +00:08:28,759 --> 00:08:33,979 +then by three because we have RGB and so +basically these filters are now you're + +119 +00:08:33,979 --> 00:08:36,969 +thinking of sliding filters not only in +space and carving out an entire + +120 +00:08:36,969 --> 00:08:40,469 +activation map but you're actually +sliding filters not only in space but + +121 +00:08:40,469 --> 00:08:44,450 +also in time and they have a small +finite temporal extent in time and you + +122 +00:08:44,450 --> 00:08:48,379 +end up carving out an entire activation +volume ok so you're introducing this + +123 +00:08:48,379 --> 00:08:51,909 +time to mention into all your kernels +and to all the are dying stages have an + +124 +00:08:51,909 --> 00:08:55,899 +additional time to mention along which +were performing the convolutions so + +125 +00:08:55,899 --> 00:08:59,659 +that's usually how people extract the +features and then you get this property + +126 +00:08:59,659 --> 00:09:04,009 +where safety is three here and so so +then when we do the spatial temporal + +127 +00:09:04,009 --> 00:09:07,230 +competition we end up with this +parameter sharing scheme going in time + +128 +00:09:07,230 --> 00:09:11,639 +as well as you mentioned so basically +what extent all the filters and time and + +129 +00:09:11,639 --> 00:09:14,360 +then we do convolutions not only in +space but also in time + +130 +00:09:14,360 --> 00:09:18,800 +wind up with activation volume +activation maps so some of these + +131 +00:09:18,799 --> 00:09:22,818 +approaches were proposed quite early on +for example one of the earlier ones + +132 +00:09:22,818 --> 00:09:28,238 +for activity recognition is maybe from +2010 so the idea here was that this is + +133 +00:09:28,239 --> 00:09:31,798 +just a couple of work but instead of +getting a single input of sixty by 40 + +134 +00:09:31,798 --> 00:09:36,108 +pics also we are getting in fact seven +frames of sixty by forty and then their + +135 +00:09:36,109 --> 00:09:40,119 +conclusions are three deconvolution as +we refer to them so these filters for + +136 +00:09:40,119 --> 00:09:44,220 +example might be sold by seven but now +by three as well as we end up with a 3d + +137 +00:09:44,220 --> 00:09:49,499 +calm and the three conditions are +applied at every single stage here + +138 +00:09:50,649 --> 00:09:55,208 +similar paper also from 2011 but the +same idea we have a block of friends + +139 +00:09:55,208 --> 00:09:59,518 +coming in and you promised them with 3d +completions three-dimensional filters at + +140 +00:09:59,519 --> 00:10:03,229 +every single point in this commercial +network so this isn't 2011 + +141 +00:10:04,948 --> 00:10:08,748 +very similar idea also so these are from +before actually Alex next these + +142 +00:10:08,749 --> 00:10:12,889 +approaches are kind of like smaller know +that work accomplished all that work so + +143 +00:10:12,889 --> 00:10:16,829 +the first kind of large-scale +application of this was from this + +144 +00:10:16,828 --> 00:10:19,828 +awesome paper in 2014 by capacity at all + +145 +00:10:20,830 --> 00:10:27,540 +this is for processing videos so the +model here on the very right that week + +146 +00:10:27,539 --> 00:10:31,159 +we called slow fusion that is the same +idea that I presented so far these are + +147 +00:10:31,159 --> 00:10:35,750 +three-dimensional competitions happening +in both space and time and so that's + +148 +00:10:35,750 --> 00:10:38,879 +slow fusion as we refer to it because +you're slowly using this temporal + +149 +00:10:38,879 --> 00:10:43,649 +information just as before we were +slowly using the spatial information now + +150 +00:10:43,649 --> 00:10:47,100 +there are other ways that you could also +why are up comedy show networks and just + +151 +00:10:47,100 --> 00:10:51,769 +to give you some context historically +this is Google research and Alex let's + +152 +00:10:51,769 --> 00:10:55,039 +just came out and everyone was super +excited because they work extremely well + +153 +00:10:55,039 --> 00:11:00,579 +images and I was in the video analysis +team at Google and we wanted to run on + +154 +00:11:00,580 --> 00:11:04,060 +the YouTube videos and but it was not +quite clear exactly how to generalize + +155 +00:11:04,059 --> 00:11:07,809 +you know commercial networks and then +just to videos so we explored several + +156 +00:11:07,809 --> 00:11:11,389 +kinds of architecture stuff how you can +actually wear this up so floats no + +157 +00:11:11,389 --> 00:11:17,889 +fusion as a 3d called kind of approach +early fusion is this idea that someone + +158 +00:11:17,889 --> 00:11:21,230 +described earlier where you take a chunk +of friends and just woke up need them + +159 +00:11:21,230 --> 00:11:25,430 +long channels you might end up with a +227 227 by like 45 + +160 +00:11:25,429 --> 00:11:29,500 +everything is just stocked up and you do +a single column over it so it's kind of + +161 +00:11:29,500 --> 00:11:35,200 +like your filters on the very first call +later have a large temporal extent but + +162 +00:11:35,200 --> 00:11:38,780 +from then on everything else is +two-dimensional competition in fact we + +163 +00:11:38,779 --> 00:11:42,139 +call it early because he refused the +temporal information very early on in + +164 +00:11:42,139 --> 00:11:45,879 +the very first letter from then on +everything just to call you can imagine + +165 +00:11:45,879 --> 00:11:49,490 +architecture is likely convolution so +here the ideas would take to Alex nets + +166 +00:11:49,490 --> 00:11:53,169 +we place them say ten things apart so +they both computed independently on + +167 +00:11:53,169 --> 00:11:57,169 +these 10 points apart and then we must +be much later in the fully connected + +168 +00:11:57,169 --> 00:12:00,620 +layers and then we had a single claim +baseline that is only looking at a + +169 +00:12:00,620 --> 00:12:03,830 +single frame of the video so you can +play with exactly how the white wire up + +170 +00:12:03,830 --> 00:12:08,440 +these models look Asian model you can +imagine that they've had three + +171 +00:12:08,440 --> 00:12:13,130 +dimensional colonels now the first layer +you can actually visualize them and + +172 +00:12:13,129 --> 00:12:16,210 +these are the kinds of features you end +up learning on videos these are + +173 +00:12:16,210 --> 00:12:18,990 +basically features that were familiar +with except they're moving because now + +174 +00:12:18,990 --> 00:12:22,680 +these filters are also extended a small +amount and time to have these little + +175 +00:12:22,679 --> 00:12:26,049 +moving blobs and some of them are static +and some of them are moving and they're + +176 +00:12:26,049 --> 00:12:30,729 +basically detecting motion on the very +first layer and so you end up a nice + +177 +00:12:30,730 --> 00:12:31,960 +moving bombings + +178 +00:12:31,960 --> 00:12:48,090 +question is how much we'll get to that +and I think the answer is probably yes + +179 +00:12:48,090 --> 00:12:53,269 +just as in spatial it works better if +smaller filters and you have more depth + +180 +00:12:53,269 --> 00:12:56,370 +at the same applies I think in time and +we'll see an architecture that does that + +181 +00:12:56,370 --> 00:13:07,220 +mean but expecting + +182 +00:13:08,190 --> 00:13:13,580 +classifying so we have a video and were +still classifying number of categories + +183 +00:13:13,580 --> 00:13:17,970 +at every single frame but now you're not +only function that single frame but also + +184 +00:13:17,970 --> 00:13:23,740 +a small number of frames alot on both +sides so maybe your prediction is + +185 +00:13:23,740 --> 00:13:28,539 +actually a function of safety drinks a +half a second video to end up with fun + +186 +00:13:28,539 --> 00:13:32,909 +moving pictures in this paper also +released video they said over one + +187 +00:13:32,909 --> 00:13:36,639 +million videos and 500 classes just a +given context for why this is actually + +188 +00:13:36,639 --> 00:13:41,759 +it's kind of difficult to work with +videos and right now I think because + +189 +00:13:41,759 --> 00:13:45,480 +problem right now i think is that +there's not too many very large-scale + +190 +00:13:45,480 --> 00:13:49,820 +datasets like millions of very varied +images that you see an image that there + +191 +00:13:49,820 --> 00:13:53,230 +are no really good equivalent of that in +the video domain and so we tried with + +192 +00:13:53,230 --> 00:13:56,730 +this for status and back in 2013 but I +don't think it actually we fully achieve + +193 +00:13:56,730 --> 00:14:00,519 +that and I think we're still not seeing +very good really lost the assassin + +194 +00:14:00,519 --> 00:14:03,579 +videos and that's partly why we're also +slightly discouraging some of you from + +195 +00:14:03,580 --> 00:14:08,050 +working on this on projects because you +can't retrain these very powerful + +196 +00:14:08,049 --> 00:14:12,969 +features because the data sets are just +not quite there another kind of + +197 +00:14:12,970 --> 00:14:16,100 +interesting things that you see and this +is why we also sometimes caution people + +198 +00:14:16,100 --> 00:14:21,490 +from working on videos and getting very +elaborate very quickly with them because + +199 +00:14:21,490 --> 00:14:24,490 +sometimes people think they have videos +and get very excited if they want to do + +200 +00:14:24,490 --> 00:14:27,810 +3d color shows Alice teams and they just +think about all the possibilities that + +201 +00:14:27,809 --> 00:14:31,469 +opened up for them but actually turns +out that single frame methods are a very + +202 +00:14:31,470 --> 00:14:34,820 +strong baseline and I would always +encourage you to run that first so don't + +203 +00:14:34,820 --> 00:14:37,710 +worry about the motion in your video and +just try single frame that works first + +204 +00:14:37,710 --> 00:14:40,990 +so for example in this paper we found +that a single from baseline was about + +205 +00:14:40,990 --> 00:14:44,610 +59.3% classification accuracy in our +dataset + +206 +00:14:44,610 --> 00:14:48,600 +and then we tried our best to actually +take into account small local motion but + +207 +00:14:48,600 --> 00:14:54,440 +we ended up bumping down by 11.6% so all +this extra work all the extra computer + +208 +00:14:54,440 --> 00:14:57,529 +and then you ended up with relatively +small gains I'm going to try to tell you + +209 +00:14:57,528 --> 00:15:02,088 +why that might be a basically video is +not always as useful as you might + +210 +00:15:02,089 --> 00:15:07,230 +intuitively think and so here are some +examples of kind of predictions that we + +211 +00:15:07,230 --> 00:15:11,800 +are different data sets of sports and +our predictions and I think this kind of + +212 +00:15:11,799 --> 00:15:15,528 +highlight slightly why adding video +might not be as helpful in some settings + +213 +00:15:15,528 --> 00:15:19,740 +in particular here if you're trying to +distinguish sports and think about it + +214 +00:15:19,740 --> 00:15:23,930 +trying to distinguish say tennis from +swimming or something like that it turns + +215 +00:15:23,929 --> 00:15:26,729 +out that you actually don't need very +fine local motion information if you're + +216 +00:15:26,730 --> 00:15:29,610 +trying to distinguish tennis from +swimming right lots of blue stuff lots + +217 +00:15:29,610 --> 00:15:33,350 +of red stuff like the images actually +have a huge amount of information and so + +218 +00:15:33,350 --> 00:15:36,240 +you're putting in a lot of additional +parameters and trying to go after these + +219 +00:15:36,240 --> 00:15:40,959 +local motions but and most in most +classes actually be local motions are + +220 +00:15:40,958 --> 00:15:44,289 +not very important they're only +important if you have very fine-grained + +221 +00:15:44,289 --> 00:15:47,919 +categories where the small motion +actually really matters a lot as a lot + +222 +00:15:47,919 --> 00:15:52,419 +of you if you have videos you'll be +inclined to use spatial temporal crazy + +223 +00:15:52,419 --> 00:15:56,860 +video networks but I think very hard +about is that locomotion extremely + +224 +00:15:56,860 --> 00:15:59,980 +important and you're setting because if +it isn't you might end up with results + +225 +00:15:59,980 --> 00:16:04,070 +like this where he put in a lot of work +and it may not work well let's look at + +226 +00:16:04,070 --> 00:16:10,180 +some other video classification that +works so this is April 2015 its + +227 +00:16:10,179 --> 00:16:14,698 +relatively popular it's called sea 3d +and the idea here was basically your + +228 +00:16:14,698 --> 00:16:18,528 +network has this very nice architecture +its three-month recalled and two by two + +229 +00:16:18,528 --> 00:16:22,110 +pool throughout the idea here is that +cool let's do the exact same thing but + +230 +00:16:22,110 --> 00:16:25,169 +extend everything in time so going back +to your point you want very small + +231 +00:16:25,169 --> 00:16:29,069 +filters so this is everything is three +my tree might recall to buy to buy to + +232 +00:16:29,070 --> 00:16:33,100 +pool throughout the architecture so it's +a very simple kind of big united in 3d + +233 +00:16:33,100 --> 00:16:36,528 +kind of approach and that works +reasonably well and you can look at this + +234 +00:16:36,528 --> 00:16:38,429 +paper for reference + +235 +00:16:38,429 --> 00:16:42,389 +another form of approaches actually that +works quite well as from Karen Simonian + +236 +00:16:42,389 --> 00:16:43,778 +in 2014 + +237 +00:16:43,778 --> 00:16:48,299 +Simonian by the way as the same he's a +person who came up with the BG not he + +238 +00:16:48,299 --> 00:16:51,828 +also has a very nice paper on video +classification and the idea here is that + +239 +00:16:51,828 --> 00:16:54,299 +he didn't want to do three dimensional +competitions because it's kind of + +240 +00:16:54,299 --> 00:16:55,219 +painful to have it + +241 +00:16:55,220 --> 00:17:00,360 +98 or find it and so on so he only used +to measure compilations but the idea + +242 +00:17:00,360 --> 00:17:05,179 +here is that we have to come that's +looking at an image and the other one is + +243 +00:17:05,179 --> 00:17:10,298 +looking at optical flow of the video so +both of these are just images but the + +244 +00:17:10,298 --> 00:17:14,699 +optical flow basically tells you how +things are moving in the in the image + +245 +00:17:14,699 --> 00:17:19,120 +and so both of these are just kind of +like an avg net like or Alex not like + +246 +00:17:19,119 --> 00:17:23,139 +that one of them another close one of +them on the image and you extract + +247 +00:17:23,140 --> 00:17:28,059 +optical flow with say the Bronx method +before and then you let the UF use that + +248 +00:17:28,058 --> 00:17:31,720 +information very late in the end so both +of these come up with some idea about + +249 +00:17:31,720 --> 00:17:34,850 +what they are seeing in terms of the +classes in the video and then refused + +250 +00:17:34,849 --> 00:17:37,859 +them and there are different ways of +using them so they found for example + +251 +00:17:37,859 --> 00:17:42,979 +that if you just use a special comments +are only looking at images you get some + +252 +00:17:42,980 --> 00:17:47,120 +performance if you use come on just the +optical flow it actually performs even + +253 +00:17:47,119 --> 00:17:49,558 +slightly better than just looking at the +raw images + +254 +00:17:49,558 --> 00:17:54,178 +optical flow actually here in this case +contains a lot of information and then + +255 +00:17:54,179 --> 00:17:58,538 +if you actually end up even better now +an interesting point to make here by the + +256 +00:17:58,538 --> 00:18:01,879 +way is that if you have this kind of +architecture especially here + +257 +00:18:01,880 --> 00:18:05,700 +complex history much by three filters +you might imagine that actually would + +258 +00:18:05,700 --> 00:18:10,038 +think that I mean why does it help to +actually put an optical flow you'd + +259 +00:18:10,038 --> 00:18:13,158 +imagine that in the center and framework +we're hoping that these comments learn + +260 +00:18:13,159 --> 00:18:16,049 +everything from scratch in particular +they should be able to learn something + +261 +00:18:16,048 --> 00:18:20,599 +that simulates the computation of +computing optical flow and it turns out + +262 +00:18:20,599 --> 00:18:24,230 +that that might not be the case because +sometimes when you compare video + +263 +00:18:24,230 --> 00:18:29,440 +networks on only the hospital and then +it works better and so I think the + +264 +00:18:29,440 --> 00:18:34,169 +reason for that is probably comes back +to actually data since we don't have + +265 +00:18:34,169 --> 00:18:37,900 +enough data we have small amount of data +I think you actually probably don't have + +266 +00:18:37,900 --> 00:18:42,730 +enough data to actually learn very good +optical flow like features and so that + +267 +00:18:42,730 --> 00:18:45,599 +would be my particular answer why +actually hard getting up to go to the + +268 +00:18:45,599 --> 00:18:48,819 +network is probably helping out in many +cases if you guys are working on your + +269 +00:18:48,819 --> 00:18:51,839 +project with videos I would encourage +you to actually try to be this kind of + +270 +00:18:51,839 --> 00:18:52,779 +architecture + +271 +00:18:52,779 --> 00:18:57,480 +optical flow and then pretend that it's +an image and you could come to an end to + +272 +00:18:57,480 --> 00:19:01,808 +that seems like a relatively reasonable +approach ok so so far we've only talked + +273 +00:19:01,808 --> 00:19:06,339 +about the little local information in +time right so we have these little + +274 +00:19:06,339 --> 00:19:07,398 +pieces + +275 +00:19:07,398 --> 00:19:10,069 +black half a second ever tried to take +advantage of it should be better + +276 +00:19:10,069 --> 00:19:13,739 +classification but what happens if you +have videos that actually have much + +277 +00:19:13,739 --> 00:19:14,489 +longer + +278 +00:19:14,489 --> 00:19:19,700 +temporal kind of dependencies that you +like to model so it's not only that the + +279 +00:19:19,700 --> 00:19:22,319 +local motion is important but actually +there are some events throughout the + +280 +00:19:22,319 --> 00:19:25,548 +video that are much larger in time +scales in your network and they actually + +281 +00:19:25,548 --> 00:19:29,618 +matter so event to happening after event +one can be very indicative of some class + +282 +00:19:29,618 --> 00:19:33,999 +and you'd like to actually model that +would work so are the kinds of + +283 +00:19:33,999 --> 00:19:39,659 +approaches that you might think for +trying to actually you know how would + +284 +00:19:39,659 --> 00:19:42,659 +you mind actually model these kinds of +much longer-term events + +285 +00:19:44,618 --> 00:19:54,009 +ok so attention model perhaps so you may +be like to have any tension over you're + +286 +00:19:54,009 --> 00:19:56,729 +trying to classify this entire video +maybe like to have a tension over + +287 +00:19:56,729 --> 00:19:58,129 +different parts of the video + +288 +00:19:58,128 --> 00:20:12,689 +yeah that's a good idea I see so you're +saying that we have these multiscale + +289 +00:20:12,690 --> 00:20:16,479 +approach is where we process images on +very low detail level but also sometimes + +290 +00:20:16,479 --> 00:20:20,298 +we resize the images and process them on +the global level as a maybe the frames + +291 +00:20:20,298 --> 00:20:23,710 +we can actually like speed up the video +and put a comment on that I don't think + +292 +00:20:23,710 --> 00:20:28,048 +that's a very common but it's senator +sensible idea I think yeah so the + +293 +00:20:28,048 --> 00:20:33,618 +problem roughly is that basically this +extent is maybe ten times too short it + +294 +00:20:33,618 --> 00:20:37,019 +doesn't spent our many seconds so how do +we make architectures that are + +295 +00:20:37,019 --> 00:20:40,179 +functional much longer time scales and +their prediction + +296 +00:20:42,150 --> 00:20:48,300 +yes the one idea here is we have this +video and we have different classes that + +297 +00:20:48,299 --> 00:20:50,599 +would like to predict at every single +point in time but we want that + +298 +00:20:50,599 --> 00:20:54,849 +prediction to be a function not only a +little choked up 15 seconds but actually + +299 +00:20:54,849 --> 00:20:59,149 +a much longer time expense so the idea +that is sensible as you actually use + +300 +00:20:59,150 --> 00:21:01,769 +record while at work somewhere in the +architecture because your current + +301 +00:21:01,769 --> 00:21:04,990 +networks allow you to have infinite +context and principal over everything + +302 +00:21:04,990 --> 00:21:08,579 +that has happened before you up till +that time especially if you go back to + +303 +00:21:08,579 --> 00:21:12,119 +this paper that I've already showing you +in 2011 it turns out that they have an + +304 +00:21:12,119 --> 00:21:16,289 +entire section where the cheek take this +and they actually have analyst team that + +305 +00:21:16,289 --> 00:21:21,109 +does exactly that this is a peep from +2011 using 3d called NLST I'm so way + +306 +00:21:21,109 --> 00:21:25,899 +before they will call in 2011 and so +this paper basically has it all + +307 +00:21:25,900 --> 00:21:29,920 +the model little local motion with 3d +calm and they most model global motion + +308 +00:21:29,920 --> 00:21:34,860 +with Ella stance and so they put a stamp +on the play the full connected layers so + +309 +00:21:34,859 --> 00:21:37,849 +they strung together fully connected +layers with this recurrence and then + +310 +00:21:37,849 --> 00:21:40,939 +when you're predicting classes every +single frame you have infinite context + +311 +00:21:40,940 --> 00:21:45,930 +this paper is as I think quite ahead of +its time and it basically has it all + +312 +00:21:45,930 --> 00:21:49,900 +except it's only set at 65 times I'm not +sure was not more popular I think people + +313 +00:21:49,900 --> 00:21:54,680 +are basically this is way ahead of its +time paper that recognizes both of these + +314 +00:21:54,680 --> 00:21:59,380 +national team sweat before I even knew +about them so since then and there are + +315 +00:21:59,380 --> 00:22:02,990 +several more recently percent actually +kind of take a similar approach so in + +316 +00:22:02,990 --> 00:22:07,190 +2015 by Jeff Donahue at all from +Berkeley the idea here is that you have + +317 +00:22:07,190 --> 00:22:08,610 +video you like to again + +318 +00:22:08,609 --> 00:22:11,819 +classify every single frame but they +have these comments that look at + +319 +00:22:11,819 --> 00:22:14,809 +individual frames but then they have +also Alice team that string this + +320 +00:22:14,809 --> 00:22:19,389 +together temporarily a similar idea also +from a paper from I think this is Google + +321 +00:22:19,390 --> 00:22:24,160 +and so the idea here is that they have +optical flow and images are processed by + +322 +00:22:24,160 --> 00:22:28,930 +complex and then again you have analyst +am that merges that over time so again + +323 +00:22:28,930 --> 00:22:34,680 +this this combination of local and +global so so far we've looked at kind of + +324 +00:22:34,680 --> 00:22:37,789 +two architectural patterns in +accomplishing your classification that + +325 +00:22:37,789 --> 00:22:43,170 +actually takes into account important +information modeling locomotion which + +326 +00:22:43,170 --> 00:22:47,289 +for example beast entry to call for use +optical flow or look more global motion + +327 +00:22:47,289 --> 00:22:51,059 +where we have chemistry together +sequences morning time steps or fusion + +328 +00:22:51,059 --> 00:22:54,418 +of the two now actually I like to make +the point that there's + +329 +00:22:54,419 --> 00:22:59,879 +another cleaner very nice interesting +idea that I saw in a recent paper and + +330 +00:22:59,878 --> 00:23:03,689 +then I like much more and so here's +basically the rock picture of what + +331 +00:23:03,690 --> 00:23:08,330 +things look like right now we have some +video and we have a 3d come that say + +332 +00:23:08,329 --> 00:23:13,038 +that is using optical flow may be +ordered using 3d column or both on the + +333 +00:23:13,038 --> 00:23:17,898 +trunk of frame crank up your data and +then have are nestled atop unfortunately + +334 +00:23:17,898 --> 00:23:20,979 +or something like that that are doing +the long-term modeling and so kind of + +335 +00:23:20,980 --> 00:23:24,950 +kind of not very nice are unsettling +about this is that their son of this + +336 +00:23:24,950 --> 00:23:29,499 +ugly asymmetry about these components to +have these parties neurons inside the 3d + +337 +00:23:29,499 --> 00:23:33,079 +come that are only a fraction of some +small local chunk of video and then you + +338 +00:23:33,079 --> 00:23:35,849 +have these neurons in the very top that +our function of everything in the video + +339 +00:23:35,849 --> 00:23:40,808 +because their record units that are a +function of everything that's come + +340 +00:23:40,808 --> 00:23:45,288 +before it and so it's kind of like an +unsettling asymmetry or something like + +341 +00:23:45,288 --> 00:23:48,720 +that so there's a paper that has a very +clever any idea from a few weeks ago + +342 +00:23:48,720 --> 00:23:54,249 +that is much more nice and homogeneous +lifestyle where everything is very nice + +343 +00:23:54,249 --> 00:23:58,118 +and margins and simple and so I don't +know if anyone can think of how we could + +344 +00:23:58,118 --> 00:24:06,819 +but we can do to make everything much +more cleaner and I couldn't because I + +345 +00:24:06,819 --> 00:24:09,019 +don't come up with this idea but I +thought it was cool what I read it + +346 +00:24:09,019 --> 00:24:22,399 +before the comment actually starts +processing the images not sure what that + +347 +00:24:22,398 --> 00:24:25,288 +would give you see would have torn +asunder optical information and comments + +348 +00:24:25,288 --> 00:24:30,169 +on top that somehow you would certainly +have neurons that are function of + +349 +00:24:30,169 --> 00:24:34,090 +everything but it's not clear what the +US team will be doing in that case + +350 +00:24:34,089 --> 00:24:37,388 +likely to be blurring the pixels it's +too low level probably processing at + +351 +00:24:37,388 --> 00:24:51,678 +that point then there's a lot of media +like an intolerable that works + +352 +00:24:51,679 --> 00:24:56,389 +differently temporal resolutions that +this problem is looking every bit you + +353 +00:24:56,388 --> 00:25:04,038 +another time that is looking like every +trip every a friend and I say so your + +354 +00:25:04,038 --> 00:25:07,009 +ideas that I think it's similar to what +someone pointed out where you take this + +355 +00:25:07,009 --> 00:25:10,179 +video and you work on multiple scales on +that video to speed up the video when + +356 +00:25:10,179 --> 00:25:14,778 +you slow down the video and then you +have 3d come that's on the front row + +357 +00:25:14,778 --> 00:25:23,989 +like speeds or something like that it's +a sensible idea can you do background + +358 +00:25:23,989 --> 00:25:26,669 +subtraction only look at things are +interesting to look at I think that's a + +359 +00:25:26,669 --> 00:25:30,639 +reasonable idea I think it kind of goes +against this idea of having end-to-end + +360 +00:25:30,638 --> 00:25:33,868 +learning because you're introducing like +this explicit computation that you think + +361 +00:25:33,868 --> 00:25:37,759 +is useful as he got + +362 +00:25:42,288 --> 00:25:48,658 +sharing between the 3d comes out and +they are and that's interesting I'm not + +363 +00:25:48,659 --> 00:25:52,139 +a hundred percent sure because the Arnon +is just hittin state vector and matrix + +364 +00:25:52,138 --> 00:25:55,678 +multiplies and things like that but in a +calm players we have disliked spatial + +365 +00:25:55,679 --> 00:26:05,369 +structure I'm not actually sure how the +sharing would work but yeah ok so the + +366 +00:26:05,368 --> 00:26:11,319 +idea is that we're going to see we're +going to get rid of the are now we're + +367 +00:26:11,319 --> 00:26:14,408 +going to basically take on that and +we're going to make every single neuron + +368 +00:26:14,409 --> 00:26:17,379 +and that comes out to be a small +recurrent neural network like every + +369 +00:26:17,378 --> 00:26:21,648 +single neuron becomes recurrent in the +calm that ok so the way this will work + +370 +00:26:21,648 --> 00:26:27,178 +and I think it's a beautiful but their +picture is kind of a kind of ugly so + +371 +00:26:27,179 --> 00:26:29,730 +much for this makes no sense so let me +try to explain this in a slightly + +372 +00:26:29,730 --> 00:26:36,278 +different way what we'll do instead is +that we have a caller somewhere in the + +373 +00:26:36,278 --> 00:26:40,278 +neural network and it takes input from +below the operative a previous calmly or + +374 +00:26:40,278 --> 00:26:43,398 +something that we're doing competitions +over this to compute the output at this + +375 +00:26:43,398 --> 00:26:47,528 +layer right so the idea here is we're +going to make every single come a little + +376 +00:26:47,528 --> 00:26:53,058 +later a kind of a recurrent player and +so the way we do that is we just as + +377 +00:26:53,058 --> 00:26:57,528 +for we take the input from below us and +we do comes over it but we also take our + +378 +00:26:57,528 --> 00:27:00,778 +previous output from the previous time +instead of this + +379 +00:27:00,778 --> 00:27:05,638 +players out there so that's this caller +from previous time step in addition to + +380 +00:27:05,638 --> 00:27:09,408 +the current input that this time stuff +and we do competitions over both this + +381 +00:27:09,409 --> 00:27:13,830 +one and that one and then we kind of +have you know we don't call when we have + +382 +00:27:13,829 --> 00:27:19,490 +these activations from current input and +activations from our previous outfit and + +383 +00:27:19,490 --> 00:27:24,649 +we add them up or something like that we +do recurrent like that work like merge + +384 +00:27:24,648 --> 00:27:28,719 +of those two to produce are up and so +we're a function of the current input + +385 +00:27:28,720 --> 00:27:34,730 +but also a function of our previous +activations if that makes sense and so + +386 +00:27:34,730 --> 00:27:37,200 +it's very nice about this is that were +in fact only using two-dimensional + +387 +00:27:37,200 --> 00:27:41,149 +competitions here and there is no 3d +count anywhere because both of these are + +388 +00:27:41,148 --> 00:27:44,678 +width by height by depth rights of the +previous convo liam is just with highly + +389 +00:27:44,679 --> 00:27:49,309 +depth from the previous layer and we are +with high depth from previous time and + +390 +00:27:49,308 --> 00:27:52,408 +some of these are two-dimensional +competitions but we end up with kind of + +391 +00:27:52,409 --> 00:27:57,710 +like recurrent process in here and so +one way to see this also with recurrent + +392 +00:27:57,710 --> 00:28:00,659 +neural networks which we looked at is +that you have this recurrence where + +393 +00:28:00,659 --> 00:28:03,980 +you're trying to compete in state and +it's a function of your previous state + +394 +00:28:03,980 --> 00:28:07,878 +and the current attacks and so we looked +at many different ways of actually + +395 +00:28:07,878 --> 00:28:14,058 +wiring up that recurrence so there's a +velar por el esteem or the GRU which GRU + +396 +00:28:14,058 --> 00:28:17,950 +is a simpler version of LSD and if you +recall but it almost always has similar + +397 +00:28:17,950 --> 00:28:21,548 +performance to analyst team so GRU a +slightly different update formulas for + +398 +00:28:21,548 --> 00:28:24,499 +actually performing that recurrence and +see what they do in this paper is + +399 +00:28:24,499 --> 00:28:27,950 +basically they take the GRU because it's +a simpler version of an Austrian that + +400 +00:28:27,950 --> 00:28:31,899 +works just as well but instead of every +single matrix multiply it's kind of like + +401 +00:28:31,898 --> 00:28:36,758 +replaced with a calm if you can you can +imagine that every single matrix + +402 +00:28:36,759 --> 00:28:41,819 +multiply here just becomes a call so we +can evolve over our input and become + +403 +00:28:41,819 --> 00:28:45,798 +involved a large output and that's the +before and the below and then we combine + +404 +00:28:45,798 --> 00:28:50,329 +them with the recurrence just us in the +GRU to actually get our activations and + +405 +00:28:50,329 --> 00:28:57,158 +so before it looked like this and now it +just looks like that so we don't have + +406 +00:28:57,159 --> 00:29:01,179 +some parts of the internet and extent of +some parts finite we just have this our + +407 +00:29:01,179 --> 00:29:05,679 +income that where every single layer is +returned its computing but it before but + +408 +00:29:05,679 --> 00:29:06,410 +also fun + +409 +00:29:06,410 --> 00:29:11,610 +its previous efforts and so this link on +that as a function of everything and + +410 +00:29:11,609 --> 00:29:14,990 +it's very kind of uniform and kinda like +a gene that you just 233 called too much + +411 +00:29:14,990 --> 00:29:19,799 +in mexico india recurrent and that's a +maybe that's just the answer my simplest + +412 +00:29:19,799 --> 00:29:27,579 +thing so somebody so if you'd like to +use spatial temporal commercial networks + +413 +00:29:27,579 --> 00:29:30,819 +and your projects and your very excited +because your videos the first thing to + +414 +00:29:30,819 --> 00:29:34,359 +do is stop and you should think about +whether or not you really need to + +415 +00:29:34,359 --> 00:29:37,740 +process locomotion or global motion or +emotion is really important your + +416 +00:29:37,740 --> 00:29:41,839 +classification task if you really think +motion is important to you then think + +417 +00:29:41,839 --> 00:29:44,829 +about whether or not you need to model +local motions are those are important + +418 +00:29:44,829 --> 00:29:46,929 +for all the global motion is very +important + +419 +00:29:46,930 --> 00:29:50,370 +based on that you get a hint of what you +should try about you always have to + +420 +00:29:50,369 --> 00:29:54,069 +compare that to a single from baseline I +would say and then you should try using + +421 +00:29:54,069 --> 00:29:57,539 +optical flow because it seems that if +you especially smaller amount of data it + +422 +00:29:57,539 --> 00:30:02,039 +actually is very important it's like a +very nice signal tax lien code that and + +423 +00:30:02,039 --> 00:30:06,099 +explicitly specified that optical flow +is a useful feature to look at and you + +424 +00:30:06,099 --> 00:30:09,609 +can try this Dr you are seeing that work +that afternoon just now but I think this + +425 +00:30:09,609 --> 00:30:12,599 +is too recent experimental so I'm +actually not sure if I can fully + +426 +00:30:12,599 --> 00:30:16,589 +endorses or if it works it seems like +it's a very nice idea but it hasn't been + +427 +00:30:16,589 --> 00:30:21,849 +proven yet and so that's that's kind of +like the rock layout of happy process + +428 +00:30:21,849 --> 00:30:25,339 +videos in the field so I know if there +is any questions because Justin is going + +429 +00:30:25,339 --> 00:30:28,339 +to come next + +430 +00:30:33,980 --> 00:30:43,289 +you are seeing this one hasn't been used +for it all P thats good question I don't + +431 +00:30:43,289 --> 00:30:46,879 +think so I'm not super duper expert on +LLP but I haven't seen this idea before + +432 +00:30:46,880 --> 00:30:52,980 +so i would i would guess that I haven't +seen her I don't think so good + +433 +00:31:18,880 --> 00:31:26,660 +in on a side with a million I would say +that definitely something people would + +434 +00:31:26,660 --> 00:31:31,810 +want to do you don't see too many papers +that do both of them just because people + +435 +00:31:31,809 --> 00:31:35,639 +like the kind of guy sleeping problems +and tackle them maybe not jointly but + +436 +00:31:35,640 --> 00:31:38,620 +certainly the company are trying to get +something working in a real system you + +437 +00:31:38,619 --> 00:31:42,869 +would do something like that but I don't +think there's anything that you would do + +438 +00:31:42,869 --> 00:31:45,449 +you probably do this with the late +fusion approach where you have a + +439 +00:31:45,450 --> 00:31:49,039 +whatever works best on videos whatever +works best on audio and then emerged out + +440 +00:31:49,039 --> 00:31:55,029 +somewhere later somehow but that's only +something I can do and with contend with + +441 +00:31:55,029 --> 00:31:57,639 +the neural networks right very simple +because you just have a player that's + +442 +00:31:57,640 --> 00:32:00,410 +looking at the output of both at some +point and then you're classifying as a + +443 +00:32:00,410 --> 00:32:09,860 +function of both so we're going to +surprise them and I guess we have to get + +444 +00:32:09,859 --> 00:32:11,179 +here + +445 +00:32:11,180 --> 00:32:14,180 +hopefully it works + +446 +00:32:29,148 --> 00:32:34,108 +ok so I guess we're gonna switch gears +completely and entirely and talk about + +447 +00:32:34,108 --> 00:32:38,199 +unsupervised learning so I'd like to +make a little bit of a contrast here + +448 +00:32:38,200 --> 00:32:42,460 +that first we're gonna talk about some +sort of basic definitions on + +449 +00:32:42,460 --> 00:32:46,009 +unsupervised learning and we're going to +talk about two different sort of ways + +450 +00:32:46,009 --> 00:32:50,858 +that unsupervised learning has recently +been attacked by deporting people so in + +451 +00:32:50,858 --> 00:32:53,408 +particular we gonna talk about auto +encoders and then this idea of + +452 +00:32:53,409 --> 00:32:58,679 +adversarial networks and I guess I need +my clicker right so pretty much + +453 +00:32:58,679 --> 00:33:03,259 +everything we've seen in this class so +far is supervised learning so the basic + +454 +00:33:03,259 --> 00:33:07,128 +setup behind pretty much all supervised +learning problems is that we assume that + +455 +00:33:07,128 --> 00:33:11,769 +our dataset has sort of each data point +has sort of two distinct parts we have + +456 +00:33:11,769 --> 00:33:15,858 +our data access and then we have some +label or output why that we want to have + +457 +00:33:15,858 --> 00:33:20,028 +produced from that from that input and +our whole goal in supervised learning is + +458 +00:33:20,028 --> 00:33:24,888 +to learn some function that takes in our +input tax and then produces this output + +459 +00:33:24,888 --> 00:33:29,538 +or label why and if you really think +about it pretty much almost everything + +460 +00:33:29,538 --> 00:33:33,088 +we've seen in this class is some +instances of this supervised learning + +461 +00:33:33,088 --> 00:33:37,358 +set up so something like image +classification acts as an image and then + +462 +00:33:37,358 --> 00:33:41,960 +why is a label for something like object +detection access an image and then why + +463 +00:33:41,960 --> 00:33:46,119 +is maybe a set of objects in the image +that you won't find why could be a + +464 +00:33:46,118 --> 00:33:50,238 +caption and then we look at capture name +could be a video and now why it could be + +465 +00:33:50,239 --> 00:33:55,838 +either a label or a caption or pretty +much anything anything so I just want to + +466 +00:33:55,838 --> 00:33:59,450 +make the point that supervised learning +is this very very very powerful powerful + +467 +00:33:59,450 --> 00:34:03,819 +and generic framework that encompass +encompasses everything we've done in the + +468 +00:34:03,819 --> 00:34:08,960 +class so far and the other point is that +supervised learning actually make system + +469 +00:34:08,960 --> 00:34:12,639 +that works systems that work really well +in practice and is very useful for + +470 +00:34:12,639 --> 00:34:14,628 +practical applications + +471 +00:34:14,628 --> 00:34:17,898 +unsupervised learning i think is a +little bit more of an open research + +472 +00:34:17,898 --> 00:34:22,338 +question at this point in time so it's +really cool I think it's really + +473 +00:34:22,338 --> 00:34:26,199 +important for solving a guy in general +but at this point it's maybe a little + +474 +00:34:26,199 --> 00:34:30,028 +bit more of a research focus to type of +area it's also a little bit less + +475 +00:34:30,028 --> 00:34:34,568 +well-defined so it's an unsupervised +learning we generally assume that we + +476 +00:34:34,568 --> 00:34:37,579 +have just data we only have pacs we +don't have any why + +477 +00:34:38,349 --> 00:34:44,009 +and the goal of unsupervised learning is +to do something with that data acts and + +478 +00:34:44,009 --> 00:34:48,199 +the something that we're trying to do +really depends on the problem so some so + +479 +00:34:48,199 --> 00:34:51,939 +in general we hope that we can discover +some type of latent structure in the + +480 +00:34:51,940 --> 00:34:56,710 +data acts without explicitly knowing +anything about the labels so some + +481 +00:34:56,710 --> 00:34:59,650 +classical examples that you might have +seen in previous machine learning + +482 +00:34:59,650 --> 00:35:04,009 +classes would be things like clustering +so something like a means we're just a + +483 +00:35:04,009 --> 00:35:07,728 +bunch of points and we discover +structure by classifying them into + +484 +00:35:07,728 --> 00:35:13,268 +clusters some other classical examples +of unsupervised learning would be + +485 +00:35:13,268 --> 00:35:18,248 +something like principal component +analysis where X is just at this point + +486 +00:35:18,248 --> 00:35:22,098 +of data and we want to discover some +low-dimensional representation of that + +487 +00:35:22,099 --> 00:35:27,170 +input data so unsupervised learning is +this really is sort of cool area but a + +488 +00:35:27,170 --> 00:35:30,519 +little bit more problems specific and a +little bit less well defined in + +489 +00:35:30,518 --> 00:35:37,228 +supervised learning so two things that +to architecture is in particular that + +490 +00:35:37,228 --> 00:35:42,358 +people in deep learning have done for +unsupervised learning these ideas as + +491 +00:35:42,358 --> 00:35:46,048 +this idea of an audio encoder will talk +about sort of traditional Ottoman + +492 +00:35:46,048 --> 00:35:49,318 +quarters that have a very very long +history will also talk about variational + +493 +00:35:49,318 --> 00:35:54,308 +auto encoders which are this sort of +news cool Asian twist on them will also + +494 +00:35:54,309 --> 00:35:57,729 +talk about some generative adversarial +networks that actually this really nice + +495 +00:35:57,728 --> 00:36:06,718 +idea but let you generate images and +model sample from natural images so the + +496 +00:36:06,719 --> 00:36:09,548 +idea with an audio encoder is is pretty +simple + +497 +00:36:09,548 --> 00:36:14,088 +we have our input sacks which is some +data and we're gonna pass this input + +498 +00:36:14,088 --> 00:36:19,710 +data through some kind of encoding +network to produce some features some + +499 +00:36:19,710 --> 00:36:24,440 +latent features so this you could think +this stage you could think up a little + +500 +00:36:24,440 --> 00:36:28,219 +bit like a learnable principal component +analysis we're going to take our input + +501 +00:36:28,219 --> 00:36:33,298 +data and then converted into some other +feature representation so those many + +502 +00:36:33,298 --> 00:36:38,940 +times these access will be images like +these are 10 images shown here so this + +503 +00:36:38,940 --> 00:36:42,989 +this encoder network could be something +very complicated so for something like + +504 +00:36:42,989 --> 00:36:47,228 +PCA it's just a simple linear transform +but in general this might be a fully + +505 +00:36:47,228 --> 00:36:51,799 +connected network originally sort of +maybe five or 10 years ago this + +506 +00:36:51,800 --> 00:36:56,130 +often a single they're fully connected +to network with sigmoid units now it's + +507 +00:36:56,130 --> 00:37:00,410 +often a deep deep network with trailer +units and this could also be something + +508 +00:37:00,409 --> 00:37:09,230 +like a convolutional not work right so +we also have this idea that Z + +509 +00:37:09,230 --> 00:37:13,820 +the features that we are that we learn +are usually smaller in size than acts so + +510 +00:37:13,820 --> 00:37:18,789 +we won't need to be some kind of useful +features about the data acts so we we + +511 +00:37:18,789 --> 00:37:22,610 +don't want the network to just transform +the net transport the data into some + +512 +00:37:22,610 --> 00:37:26,370 +useless representation we want to force +that actually crush the data down and + +513 +00:37:26,369 --> 00:37:29,900 +summarize it statistics and some useful +way that could hopefully be useful + +514 +00:37:29,900 --> 00:37:34,720 +person downstream processing but the +problem is that we don't really have any + +515 +00:37:34,719 --> 00:37:39,219 +explicit labels to use for this +downstream processing so instead we need + +516 +00:37:39,219 --> 00:37:43,159 +to invent some kind of a surrogate ask +that we can use using just just the data + +517 +00:37:43,159 --> 00:37:50,159 +itself so the circuit asked that we +often use for auto encoders is this idea + +518 +00:37:50,159 --> 00:37:55,719 +of reconstruction so since we don't have +any wise to learn a mapping instead + +519 +00:37:55,719 --> 00:38:00,119 +we're just gonna try to reproduce the +data acts from those features Z and + +520 +00:38:00,119 --> 00:38:05,119 +especially if those features are smaller +in size than hopefully that'll force the + +521 +00:38:05,119 --> 00:38:07,139 +network to act to summarize + +522 +00:38:07,139 --> 00:38:11,420 +to summarize the useful statistics of +the input data and hopefully discover + +523 +00:38:11,420 --> 00:38:16,289 +some useful features that could be one +useful for reconstruction but more + +524 +00:38:16,289 --> 00:38:19,920 +generally might be those features might +be useful for some other tasks if we + +525 +00:38:19,920 --> 00:38:26,340 +later get some supervised data so again +this decoder network could be pretty + +526 +00:38:26,340 --> 00:38:30,050 +complicated when auto in quarters so +first came about + +527 +00:38:30,050 --> 00:38:33,720 +oftentimes these were just simply either +a simple linear network or a small + +528 +00:38:33,719 --> 00:38:37,459 +signal network but now they can be +deeply networks and often times these + +529 +00:38:37,460 --> 00:38:43,220 +will be up convolutional is a good time +so it's Mason small inflatable slides so + +530 +00:38:43,219 --> 00:38:46,869 +oftentimes this decoder nowadays will be +one of these up convolutional networks + +531 +00:38:46,869 --> 00:38:50,529 +that takes your features that are again +are smaller in size than your input data + +532 +00:38:50,530 --> 00:38:56,880 +and kind of blows it back up in size to +reproduce your original data and I'd + +533 +00:38:56,880 --> 00:39:00,579 +like to make the point that these things +are actually pretty easy to train so the + +534 +00:39:00,579 --> 00:39:04,610 +right here is a quick example that I +just cooked up in torch so this is for + +535 +00:39:04,610 --> 00:39:05,050 +larry + +536 +00:39:05,050 --> 00:39:09,210 +code which is accomplished all that work +for their decoder which is up + +537 +00:39:09,210 --> 00:39:12,420 +convolutional network and you can see +that it's actually learns to reconstruct + +538 +00:39:12,420 --> 00:39:19,159 +the data pretty well another thing that +you sometimes see is that these encoder + +539 +00:39:19,159 --> 00:39:23,799 +and decoder networks will sometimes +share weights with just sort of as a + +540 +00:39:23,800 --> 00:39:27,740 +regularization strategy and with this +intuition that these are opposite + +541 +00:39:27,739 --> 00:39:32,329 +operations so maybe I might make sense +to try to use the same waits for both so + +542 +00:39:32,329 --> 00:39:36,659 +just as a concrete example if you're in +if you think about a fully connected + +543 +00:39:36,659 --> 00:39:39,980 +network then maybe your input data has +some dimension d + +544 +00:39:39,980 --> 00:39:44,070 +and then you're late and data the will +have some smaller dimension H and if + +545 +00:39:44,070 --> 00:39:47,769 +this encoder was just a fully connected +network then the weights will just be + +546 +00:39:47,769 --> 00:39:51,630 +this matrix of Dubai age and now when we +want to do the decoding and try to + +547 +00:39:51,630 --> 00:39:54,470 +reconstruct the original data than that + +548 +00:39:54,469 --> 00:39:59,129 +mapping back from each back to D so we +can just reuse the same weights in these + +549 +00:39:59,130 --> 00:40:06,420 +two areas we just take a transpose of a +matrix so when we're training this thing + +550 +00:40:06,420 --> 00:40:10,300 +we need some kind of a loss function +that we can use to compare a + +551 +00:40:10,300 --> 00:40:15,400 +reconstructed data with our original +data and then once and oftentimes will c + +552 +00:40:15,400 --> 00:40:20,220 +L to a simple like hell to Euclidean +loss to train this thing so once we've + +553 +00:40:20,219 --> 00:40:24,659 +chosen our internet work and once we've +chosen rd quarter network and function + +554 +00:40:24,659 --> 00:40:28,329 +then we can train this thing just like +any other normal neural network where we + +555 +00:40:28,329 --> 00:40:32,420 +get some data we pass it through to +encode it we passed through decoded the + +556 +00:40:32,420 --> 00:40:37,900 +computer law sweetback propagate and +everything's good so once we train this + +557 +00:40:37,900 --> 00:40:41,880 +thing then oftentimes will take this +decoder network that we spent so much + +558 +00:40:41,880 --> 00:40:46,700 +time learning and I'll just throw it +away which seems kinda weird but the + +559 +00:40:46,699 --> 00:40:52,129 +reason is that reconstruction on its own +is not such a useful task so instead we + +560 +00:40:52,130 --> 00:40:56,349 +want to apply these networks to some +kind of actually useful task which is + +561 +00:40:56,349 --> 00:41:01,099 +probably a supervised learning task so +here to set up is that we've learned + +562 +00:41:01,099 --> 00:41:05,179 +this encoder network which hopefully +from all this unsupervised data has + +563 +00:41:05,179 --> 00:41:08,799 +emerged to has learned to compress the +data and extract some useful features + +564 +00:41:08,800 --> 00:41:13,190 +and then we're going to use this encoder +network to initialize part of a larger + +565 +00:41:13,190 --> 00:41:17,650 +supervised work and now if we actually +do have access to maybe some smaller + +566 +00:41:17,650 --> 00:41:18,280 +data set + +567 +00:41:18,280 --> 00:41:22,590 +that have some labels then hopefully +this most of the work here could have + +568 +00:41:22,590 --> 00:41:26,309 +been done by this unsupervised training +at the beginning and then we can just + +569 +00:41:26,309 --> 00:41:29,699 +use that to initialize this this bigger +network and then fine tune the whole + +570 +00:41:29,699 --> 00:41:35,509 +thing with hopefully a very small amount +of supervised data so this is kind of a + +571 +00:41:35,510 --> 00:41:39,380 +dream of one of the dreams of +unsupervised feature learning that you + +572 +00:41:39,380 --> 00:41:43,410 +have this really really large datasets +with no labels you can just go on Google + +573 +00:41:43,409 --> 00:41:46,409 +and download images forever and it's +really easy to get a lot of images + +574 +00:41:46,969 --> 00:41:51,399 +the problem is the labels are expensive +to collect so you'd want some system + +575 +00:41:51,400 --> 00:41:54,960 +that could take advantage of both a +large huge amount of unsupervised data + +576 +00:41:54,960 --> 00:41:59,570 +and also just a small amount of +supervised data so automakers are at + +577 +00:41:59,570 --> 00:42:03,940 +least one thing that has been proposed +that has this night property but in + +578 +00:42:03,940 --> 00:42:07,670 +practice I think it tends not to work +too well which is a little bit + +579 +00:42:07,670 --> 00:42:12,010 +unfortunate because it's such a +beautiful idea another thing that I + +580 +00:42:12,010 --> 00:42:15,890 +should point out almost as a side note +that if you go back and read the + +581 +00:42:15,889 --> 00:42:21,179 +literature on these things from the mid +to thousands in the last 10 years than + +582 +00:42:21,179 --> 00:42:25,129 +people have this funny thing called +increase their wives pre-training that + +583 +00:42:25,130 --> 00:42:30,010 +they used for training auto encoders and +share the idea was that at the time in + +584 +00:42:30,010 --> 00:42:35,410 +2006 training very deep networks was was +challenging and if you you can find + +585 +00:42:35,409 --> 00:42:39,429 +quotes and papers like this that say +that even when you have maybe 45 hidden + +586 +00:42:39,429 --> 00:42:44,359 +layers it was extremely challenging per +pupil in those days to train networks so + +587 +00:42:44,360 --> 00:42:48,760 +it to get around that problem with a +instead had this paradigm where they + +588 +00:42:48,760 --> 00:42:53,560 +would try to train just one letter at a +time and they use this this thing but i + +589 +00:42:53,559 --> 00:42:57,139 +dont wanna get too much into called the +Restricted Boltzmann machine which is a + +590 +00:42:57,139 --> 00:43:01,279 +typographical model and they would use +these restricted Boltzmann Machines the + +591 +00:43:01,280 --> 00:43:05,880 +kind of trainees to these little there's +one at a time so first we will have our + +592 +00:43:05,880 --> 00:43:12,070 +input image may be sized up of size W +one and this would be maybe something + +593 +00:43:12,070 --> 00:43:16,630 +like PCA or some other kind of pics +transform and then we would hopefully + +594 +00:43:16,630 --> 00:43:19,990 +learn using a restricted Boltzmann +machine some kind of relationship + +595 +00:43:19,989 --> 00:43:25,359 +between those first their features and +some higher level features when once we + +596 +00:43:25,360 --> 00:43:27,940 +once we learned this layer within reason + +597 +00:43:27,940 --> 00:43:30,840 +and learn another restricted Boltzmann +machine on top of those features + +598 +00:43:30,840 --> 00:43:36,000 +connecting it to the next level features +so by using this type of approach it let + +599 +00:43:36,000 --> 00:43:40,050 +them train just one layer at a time in +this sort of greedy way and that let + +600 +00:43:40,050 --> 00:43:43,980 +them hopefully find a really good +initialization for this larger network + +601 +00:43:43,980 --> 00:43:48,369 +so after this greedy pre-training stage +they would stick the whole thing + +602 +00:43:48,369 --> 00:43:52,099 +together into this giant audio encoder +and then fine tune the audio encoder + +603 +00:43:52,099 --> 00:44:00,469 +jointly so nowadays we don't really need +to do this with things like ray Liu and + +604 +00:44:00,469 --> 00:44:04,139 +proper initialization and bash +normalization and slightly fancier + +605 +00:44:04,139 --> 00:44:08,730 +fancier optimizers this type of thing is +not really necessary anymore so as an + +606 +00:44:08,730 --> 00:44:12,659 +example on the previous slide we saw +this for Larry convolutional + +607 +00:44:12,659 --> 00:44:16,409 +deconvolution audio encoder that I +trained on ceasefire and this is just + +608 +00:44:16,409 --> 00:44:17,429 +trying to do + +609 +00:44:17,429 --> 00:44:20,149 +using all these modern neural network +techniques you don't have to mess around + +610 +00:44:20,150 --> 00:44:25,039 +with US Airways training so this is not +something that really gets done anymore + +611 +00:44:25,039 --> 00:44:27,800 +but I thought we should at least +mentioned it since you're probably + +612 +00:44:27,800 --> 00:44:35,990 +encounter this idea if you read back in +the literature about these things so the + +613 +00:44:35,989 --> 00:44:39,949 +basic idea or an auto in quarters is I +think pretty simple it's this beautiful + +614 +00:44:39,949 --> 00:44:44,009 +idea where we can just use a lot of +unsupervised data to hopefully learn + +615 +00:44:44,010 --> 00:44:49,710 +some nice features unfortunately that +doesn't work but that's ok but there's + +616 +00:44:49,710 --> 00:44:53,639 +maybe some other nice type of task we +would want to do with unsupervised data + +617 +00:44:53,639 --> 00:44:56,639 +question first + +618 +00:44:59,068 --> 00:45:10,308 +yesterday the question is what what's +going on here right so this is this is + +619 +00:45:10,309 --> 00:45:14,880 +this is maybe you could think about a +three-layer neural network so our input + +620 +00:45:14,880 --> 00:45:18,410 +is gonna be the same as the output so +we're just hoping that this is a neural + +621 +00:45:18,409 --> 00:45:22,788 +network that will learn the identity +function but that's a really and in + +622 +00:45:22,789 --> 00:45:26,099 +order to learn the identity function we +have some loss function at the end + +623 +00:45:26,099 --> 00:45:29,989 +something like an adult who lost that is +encouraging our to our input and output + +624 +00:45:29,989 --> 00:45:35,429 +to be the same and learning identity +function is probably really easy thing + +625 +00:45:35,429 --> 00:45:39,379 +to do but instead we're going to force +the network to not take the easy route + +626 +00:45:39,380 --> 00:45:43,410 +and instead hopefully rather than just +regurgitating the data and learning the + +627 +00:45:43,409 --> 00:45:46,909 +identity function in the easy way +instead we're gonna bottlenecks + +628 +00:45:46,909 --> 00:45:51,268 +representation through this hidden layer +in the middle so then it's gonna learn + +629 +00:45:51,268 --> 00:45:54,798 +the identity function but in the middle +of the network is gonna have to squeeze + +630 +00:45:54,798 --> 00:45:59,829 +down and summarize and compress the data +and hopefully that that compression will + +631 +00:45:59,829 --> 00:46:04,339 +give rise to features that are useful +for other tasks as that may be a little + +632 +00:46:04,338 --> 00:46:14,719 +bit more care ok questioned the claim +was that PCA is just the answer for this + +633 +00:46:14,719 --> 00:46:19,259 +problem so it's true that PCA is optimal +in certain senses if you're only allowed + +634 +00:46:19,259 --> 00:46:25,278 +to do one where a wonder if your income +and your decoder are just a single + +635 +00:46:25,278 --> 00:46:30,259 +linear transform then indeed PCA of +optimal in some sense but if you're in + +636 +00:46:30,259 --> 00:46:34,170 +quarter and decoder are potentially +larger more complicated functions that + +637 +00:46:34,170 --> 00:46:39,059 +are more maybe multi-layer neural +networks then then maybe PCA is no + +638 +00:46:39,059 --> 00:46:43,209 +longer the right solution another point +to make is that PCA is only optimal in + +639 +00:46:43,208 --> 00:46:44,308 +certain senses + +640 +00:46:44,309 --> 00:46:48,670 +particularly talking about LG +reconstruction but in practice we don't + +641 +00:46:48,670 --> 00:46:51,798 +actually care about reconstruction we're +just hoping that this thing will learn + +642 +00:46:51,798 --> 00:46:56,538 +useful features for other tasks so in +practice and will see this a bit later + +643 +00:46:56,539 --> 00:47:00,259 +that people don't always use out to +anymore because I'll to is maybe not + +644 +00:47:00,259 --> 00:47:04,719 +quite the right loss for actually +features yeah + +645 +00:47:04,719 --> 00:47:14,348 +the army on larry is this is is this +kind of generative model of the data of + +646 +00:47:14,349 --> 00:47:18,250 +data where you imagine that you have +sort of two sequences of bets and you + +647 +00:47:18,250 --> 00:47:19,108 +want to do this + +648 +00:47:19,108 --> 00:47:23,579 +generative modeling of the of the two +things so then you need to get into + +649 +00:47:23,579 --> 00:47:26,440 +quite a lot of reasons that this text to +figure out exactly what a loss function + +650 +00:47:26,440 --> 00:47:31,260 +is but it ends up being something like +what likelihood of the data with these + +651 +00:47:31,260 --> 00:47:35,470 +latent state that you don't observe and +that's actually a cool idea that we will + +652 +00:47:35,469 --> 00:47:40,868 +sort of revisit in the variational audio +encoder so one of the one of the + +653 +00:47:40,869 --> 00:47:45,280 +problems with this traditional audio +encoder is that it's hoping to learn + +654 +00:47:45,280 --> 00:47:49,590 +features that's that's a cool thing but +there's this other thing that we would + +655 +00:47:49,590 --> 00:47:54,670 +like to not just learned features but +also be able to generate new data a cool + +656 +00:47:54,670 --> 00:47:59,320 +task that we could potentially learned +from unsupervised data is that hopefully + +657 +00:47:59,320 --> 00:48:03,030 +our model could slurp and a bunch of +images and after it does that it sort of + +658 +00:48:03,030 --> 00:48:06,990 +learns what natural images look like and +then after its learn this distribution + +659 +00:48:06,989 --> 00:48:11,449 +then it could hopefully spit out sort of +fake images that look like the original + +660 +00:48:11,449 --> 00:48:17,949 +images but are fake and this is maybe +not address a task which is directly + +661 +00:48:17,949 --> 00:48:22,319 +applicable to things like classification +but it seems like an important thing for + +662 +00:48:22,320 --> 00:48:26,588 +a guy that humans are pretty good at +looking at data and summarizing it and + +663 +00:48:26,588 --> 00:48:31,199 +getting the idea of what it looks like +so hopefully if our models could also do + +664 +00:48:31,199 --> 00:48:34,969 +this sort of task then hopefully they'll +have learned some some useful + +665 +00:48:34,969 --> 00:48:41,299 +summarization or some useful statistics +of the data so the variation audio + +666 +00:48:41,300 --> 00:48:45,539 +encoder is this kind of neat twist on +the original order that lets us + +667 +00:48:45,539 --> 00:48:50,690 +hopefully actually generate novel images +from our learns data so here we need to + +668 +00:48:50,690 --> 00:48:54,849 +dive into a little bit of patience that +this tax so this is something that we + +669 +00:48:54,849 --> 00:48:58,320 +haven't really talked about at all in +this class anymore but up to this point + +670 +00:48:58,320 --> 00:49:02,420 +but there's this whole other side of +machine learning that doesn't do near + +671 +00:49:02,420 --> 00:49:05,250 +networks and deep learning but things +really hard about probability + +672 +00:49:05,250 --> 00:49:09,260 +distributions and how bility +distributions can fit together to + +673 +00:49:09,260 --> 00:49:13,190 +generate data sets and then reason +probabilistically about your data and + +674 +00:49:13,190 --> 00:49:16,670 +this type of paradigm is really nice +because it lets you sort of State + +675 +00:49:16,670 --> 00:49:17,970 +explicit probabilistic + +676 +00:49:17,969 --> 00:49:22,000 +assumptions about how you think your +data was generated and then given those + +677 +00:49:22,000 --> 00:49:25,858 +probabilistic assumptions you try to +figure model to the data that follows + +678 +00:49:25,858 --> 00:49:30,199 +your assumptions so what the variation +alarming quarter we're assuming this + +679 +00:49:30,199 --> 00:49:35,589 +this particular type of method by which +our data was generated so we assume that + +680 +00:49:35,590 --> 00:49:39,800 +we've barely exists out there in the +world some prior distribution which is + +681 +00:49:39,800 --> 00:49:44,440 +generating these latent States Z and +we've been we assume some conditional + +682 +00:49:44,440 --> 00:49:49,789 +distribution that once we have the +leading states we can generate samples + +683 +00:49:49,789 --> 00:49:54,389 +from some other distribution to generate +the data so the variation audio encoder + +684 +00:49:54,389 --> 00:49:58,170 +it really imagine that our data was +generated by this pretty simple process + +685 +00:49:58,170 --> 00:50:03,639 +that first we sample from some prior +distribution to get some to get raz B + +686 +00:50:03,639 --> 00:50:10,940 +sample from this conditional to get our +acts so the intuition is that acts as + +687 +00:50:10,940 --> 00:50:15,240 +something like an image and Z maybe +summarizes some useful stuff about that + +688 +00:50:15,239 --> 00:50:19,649 +image so if these were see far images +then maybe that lay in state she could + +689 +00:50:19,650 --> 00:50:23,800 +be something like the class of the image +whether it's a frog or a deer or cat and + +690 +00:50:23,800 --> 00:50:27,690 +also might contain variables about how +that cat is oriented or what color it is + +691 +00:50:27,690 --> 00:50:29,269 +or something like that + +692 +00:50:29,269 --> 00:50:33,719 +so this is kind of a nice sort of having +a pretty simple some pretty simple idea + +693 +00:50:33,719 --> 00:50:37,279 +but makes a lot of sense for how you +might imagine image images to be + +694 +00:50:37,280 --> 00:50:43,670 +generated so the problem now is that we +want to ask to meet these parameters + +695 +00:50:43,670 --> 00:50:48,470 +data of both the prior and the +conditional without actually having + +696 +00:50:48,469 --> 00:50:52,598 +access to these latest dates and see and +that's that's the that's a challenging + +697 +00:50:52,599 --> 00:50:57,588 +problem so to make a simple we're gonna +do something that you see a lot in + +698 +00:50:57,588 --> 00:51:00,769 +Bayesian statistics and I'll just assume +that the priors got a shampoo that's + +699 +00:51:00,769 --> 00:51:07,088 +easy to handle and the conditional be a +will also be shown but it's gonna be a + +700 +00:51:07,088 --> 00:51:11,489 +little bit fancier so we'll assume that +it's a Gaussian with diagonal mean and + +701 +00:51:11,489 --> 00:51:16,729 +unit with sorry diagonal covariance and +some mean but instead we're just gonna + +702 +00:51:16,730 --> 00:51:19,650 +put but the way that we're going to get +those is we're going to compute them + +703 +00:51:19,650 --> 00:51:24,800 +with a neural network so it suppose that +we had the latest agency for some piece + +704 +00:51:24,800 --> 00:51:27,579 +of data that we assume that that late +instead + +705 +00:51:27,579 --> 00:51:32,160 +will go into some decoder network which +could be some big complicated neural + +706 +00:51:32,159 --> 00:51:36,078 +network and now that neural network is +gonna spit out two things it's gonna + +707 +00:51:36,079 --> 00:51:40,079 +spit out the meaning of the data it's +gonna spit out the meaning of the data + +708 +00:51:40,079 --> 00:51:45,068 +acts and also the the the variance of +the data acts so you should think that + +709 +00:51:45,068 --> 00:51:48,958 +this looks very much like the top half +of a normal audio encoder that we have + +710 +00:51:48,958 --> 00:51:52,699 +this link state we have some known that +that's operating on the latest eight but + +711 +00:51:52,699 --> 00:51:57,588 +now instead of just directly spitting +out the data instead it's spitting out + +712 +00:51:57,588 --> 00:52:01,690 +the mean of the data and the variance of +the data but other than that this looks + +713 +00:52:01,690 --> 00:52:07,528 +very much like the decoder of the normal +audio encoder so this this decoder + +714 +00:52:07,528 --> 00:52:11,518 +network sort of thinking back to the +normal audio encoder might be a simple + +715 +00:52:11,518 --> 00:52:14,578 +fully connected thing or it might be +this very big powerful deconvolution + +716 +00:52:14,579 --> 00:52:22,269 +network and both of those are pretty +common so now the problem is that by + +717 +00:52:22,268 --> 00:52:26,679 +baseball if given the prior and given +the conditional basil tells us the + +718 +00:52:26,679 --> 00:52:31,578 +posterior that given so if we want to +actually use this model we need to be + +719 +00:52:31,579 --> 00:52:35,209 +able to estimate the latent state from +the input data and the way that we + +720 +00:52:35,208 --> 00:52:38,659 +estimate the leading state from the +input data is by writing down this + +721 +00:52:38,659 --> 00:52:42,899 +posterior distribution which is the +probability of the latest easy given are + +722 +00:52:42,900 --> 00:52:47,519 +observed data and using payroll we can +easily flip this around and write it in + +723 +00:52:47,518 --> 00:52:54,189 +terms of our prior oversee and in terms +of our conditional Givenchy and so we + +724 +00:52:54,190 --> 00:52:57,249 +can use by Israel to actually put this +thing around and write it in terms of + +725 +00:52:57,248 --> 00:53:02,409 +these three things so after we look at +these roles we can break down these + +726 +00:53:02,409 --> 00:53:06,818 +three terms and we can see that the +conditional we just use our decoder + +727 +00:53:06,818 --> 00:53:11,558 +network and we easily have access to +that and this prior again we have access + +728 +00:53:11,559 --> 00:53:15,569 +to the prior to be assumed that you +negotiate so that's easy to handle but + +729 +00:53:15,568 --> 00:53:19,458 +this denominator this probability of +acts it turns out if you if you work out + +730 +00:53:19,458 --> 00:53:22,828 +the math and write it out this ends up +being this giant intractable in a row + +731 +00:53:22,829 --> 00:53:26,579 +over the entire leading state space so +that's completely intractable there's no + +732 +00:53:26,579 --> 00:53:29,479 +way you could ever porn that in a girl +and even approximating it would be a + +733 +00:53:29,478 --> 00:53:33,399 +giant disaster so instead we will not +even trying to evaluate that in a girl + +734 +00:53:33,400 --> 00:53:38,759 +instead we're going to introduce some +encoder network that will try to + +735 +00:53:38,759 --> 00:53:40,179 +directly before + +736 +00:53:40,179 --> 00:53:45,210 +in print stuff for us so this encoder +network is going to take in a data point + +737 +00:53:45,210 --> 00:53:48,599 +and it's going to spit out a +distribution over the meeting state + +738 +00:53:48,599 --> 00:53:53,210 +space so again this looks very much +looking back at the original audio + +739 +00:53:53,210 --> 00:53:57,449 +encoder from a few slides ago this looks +very much the same as sort of the bottom + +740 +00:53:57,449 --> 00:54:01,449 +half of a traditional audio encoder +where we're taking in data and now + +741 +00:54:01,449 --> 00:54:04,789 +instead of directly spitting out the +latest eight we're gonna spit out a mean + +742 +00:54:04,789 --> 00:54:09,519 +and variance of the leading state and +again this quarter network might be + +743 +00:54:09,519 --> 00:54:13,639 +something might be somewhat +controversial network or maybe some deep + +744 +00:54:13,639 --> 00:54:21,159 +convolutional network so sort of the +intuition is that this encounter network + +745 +00:54:21,159 --> 00:54:25,259 +will be the separate totally different +destroying function but we're going to + +746 +00:54:25,260 --> 00:54:29,180 +try to train it in a way that it +approximates this posterior distribution + +747 +00:54:29,179 --> 00:54:35,799 +that we don't actually have access to +and so when we probably pieces together + +748 +00:54:35,800 --> 00:54:40,700 +then then we can set up a stitch this +all together and get give rise to this + +749 +00:54:40,699 --> 00:54:44,808 +variation audio encoder so once we put +these things together then we have this + +750 +00:54:44,809 --> 00:54:49,559 +input data point X we're gonna pass it +through our encoder network and the + +751 +00:54:49,559 --> 00:54:52,819 +encoder network will spit out a +distribution over the leading states + +752 +00:54:52,818 --> 00:54:57,789 +once we have this this distribution over +the latest dates you can imagine you + +753 +00:54:57,789 --> 00:55:01,650 +could imagine sampling from that +distribution to get some some highest + +754 +00:55:01,650 --> 00:55:07,700 +let me state of high probability for +that input than once we have been once + +755 +00:55:07,699 --> 00:55:11,889 +we have some concrete example of a +latent state then we can pass it through + +756 +00:55:11,889 --> 00:55:16,409 +this decoder network which will spread +out the probability of which should then + +757 +00:55:16,409 --> 00:55:20,469 +sped out the probability of the data +again and then once we have this + +758 +00:55:20,469 --> 00:55:24,439 +distribution over the data we could +sample from it to actually get something + +759 +00:55:24,440 --> 00:55:29,950 +that hopefully looks like the original +data point so this this ends up looking + +760 +00:55:29,949 --> 00:55:34,269 +very much like a normal audio encoder +where we're taking our input data we're + +761 +00:55:34,269 --> 00:55:37,829 +running it through this encoder to get +some latent state or passing into this + +762 +00:55:37,829 --> 00:55:42,200 +decoder totally reconstruct the original +data and when you go about training this + +763 +00:55:42,199 --> 00:55:46,149 +thing it's actually trained in a very +similar method as normal audio encoder + +764 +00:55:46,150 --> 00:55:50,230 +we have this for past and this backward +pass the only difference is in the loss + +765 +00:55:50,230 --> 00:55:55,490 +function so at the top we have this +reconstruction loss rather than being + +766 +00:55:55,489 --> 00:56:01,078 +displayed by sl2 instead we want this +distribution to be close to the true + +767 +00:56:01,079 --> 00:56:07,349 +input data and we also have this lost +term coming in the middle that we want + +768 +00:56:07,349 --> 00:56:11,230 +this generated distribution over the +Layton States hopefully be very similar + +769 +00:56:11,230 --> 00:56:16,579 +to our stated prior distribution that we +wrote down at the very beginning so once + +770 +00:56:16,579 --> 00:56:19,200 +you put these pieces together you can +just trying this thing like a normal + +771 +00:56:19,199 --> 00:56:22,969 +audio encoder with normal forward +forward forward pass and backward pass + +772 +00:56:22,969 --> 00:56:29,058 +the only difference is where you put the +loss and how you interpret the loss so + +773 +00:56:29,059 --> 00:56:32,500 +any any questions about the setup it's +kind of a when we went through a kind of + +774 +00:56:32,500 --> 00:56:39,608 +ass yeah question is why do you choose a +diagonal covariance and answers cuz it's + +775 +00:56:39,608 --> 00:56:44,199 +really easy to work with theirs but +actually people have tried I think + +776 +00:56:44,199 --> 00:56:50,210 +slightly fancier things too but that's +something you can play around with ok so + +777 +00:56:50,210 --> 00:56:53,530 +once we've actually trained this once +we've actually trained this kind of + +778 +00:56:53,530 --> 00:56:56,920 +variational audio encoder we can +actually use it to generate new data + +779 +00:56:56,920 --> 00:57:00,510 +that looks kind of like original dataset +so here + +780 +00:57:00,510 --> 00:57:04,430 +the idea is that remember we wrote down +this prior that might be a you negotiate + +781 +00:57:04,429 --> 00:57:07,960 +or maybe something a little bit fancier +but at any rate this prior is something + +782 +00:57:07,960 --> 00:57:12,039 +some distribution that we can easily +sample from so you negotiate it's very + +783 +00:57:12,039 --> 00:57:15,989 +easy to draw random samples from that +distribution so to generate new data + +784 +00:57:15,989 --> 00:57:20,459 +will start by just sort of following +this data this data generation process + +785 +00:57:20,460 --> 00:57:24,849 +that we had imagined data so first we'll +sample from our from our prior + +786 +00:57:24,849 --> 00:57:28,430 +distribution over the lake in states and +then we'll pass it through our decoder + +787 +00:57:28,429 --> 00:57:32,078 +network that we have learned during +training and this decoder network will + +788 +00:57:32,079 --> 00:57:36,190 +now spit out a distribution override and +appoints in the turn in in terms of both + +789 +00:57:36,190 --> 00:57:40,460 +I mean and covariance and once we have a +mean and covariance this is just a + +790 +00:57:40,460 --> 00:57:44,548 +diagonal gosh we can easily sample from +this thing again to generate some data + +791 +00:57:44,548 --> 00:57:50,369 +point so now 11 you train this thing +then another thing you can do is sort of + +792 +00:57:50,369 --> 00:57:54,440 +can out the latent space and I'm at and +rather than sampling from a latent + +793 +00:57:54,440 --> 00:57:58,490 +distribution instead you just densely +sample of allegiance from the latest + +794 +00:57:58,489 --> 00:58:01,979 +base to kind of get an idea of what type +of structure structure the network had + +795 +00:58:01,980 --> 00:58:09,280 +learned so this is doing exactly that on +this dataset so here we we trained this + +796 +00:58:09,280 --> 00:58:12,990 +variation audio encoder with where is +the latest eight is just a + +797 +00:58:12,989 --> 00:58:17,959 +two-dimensional thing and now we can +actually scan out this late in space we + +798 +00:58:17,960 --> 00:58:22,490 +can explore densely this two-dimensional +late in space and for each point in the + +799 +00:58:22,489 --> 00:58:26,519 +latent space passes through the decoder +and use it to generate some image you + +800 +00:58:26,519 --> 00:58:30,599 +can see that it's actually discovered +this beautiful structure it's that sort + +801 +00:58:30,599 --> 00:58:34,618 +of smoothly interpolates between the +different digit classes so I'll be up + +802 +00:58:34,619 --> 00:58:38,530 +here at the left you see six is the kind +of morph into zeros as you go down you + +803 +00:58:38,530 --> 00:58:42,690 +see six is that turned into seven into +BB nines and Southern's the aids are + +804 +00:58:42,690 --> 00:58:46,159 +hanging out in the middle somewhere in +the ones are down here so this latent + +805 +00:58:46,159 --> 00:58:50,049 +space actually learned this beautiful +disentanglement of the data in this very + +806 +00:58:50,050 --> 00:58:55,910 +nice unsupervised way we can also turn +this thing on our faces dataset and it's + +807 +00:58:55,909 --> 00:58:59,199 +the same sort of story where we're just +training this two-dimensional variation + +808 +00:58:59,199 --> 00:59:02,679 +audio encoder and then once we train it +we densely sampled from that late in + +809 +00:59:02,679 --> 00:59:05,679 +space to try to see what he has learned +western + +810 +00:59:13,018 --> 00:59:19,458 +yeah so the question is whether people +ever try to force the least specifically + +811 +00:59:19,458 --> 00:59:23,139 +variables to have some some some exact +meaning and yeah there has been some + +812 +00:59:23,139 --> 00:59:27,058 +follow-up work that does exactly that +there is a paper called deep inverse + +813 +00:59:27,059 --> 00:59:31,890 +graphics networks from MIT that has that +does exactly this setup where they try + +814 +00:59:31,889 --> 00:59:36,199 +to force where they want to learn sort +of a renderer in as a neural network so + +815 +00:59:36,199 --> 00:59:41,568 +they want to learn to like render 3d +images of things they want to force some + +816 +00:59:41,568 --> 00:59:44,619 +of the latent space some of the +variables in the latent space to + +817 +00:59:44,619 --> 00:59:49,289 +corresponds to the the 3d angles of the +object and maybe the class and the + +818 +00:59:49,289 --> 00:59:53,009 +repose of the object and the rest of +them it led to learn from whatever it + +819 +00:59:53,009 --> 00:59:56,099 +wants and that she has some cool +experiments were now they can do exactly + +820 +00:59:56,099 --> 01:00:00,809 +as you said and by setting those those +specific values the latent variables + +821 +01:00:00,809 --> 01:00:03,869 +that can render and actually rotate the +object and those are those are pretty + +822 +01:00:03,869 --> 01:00:09,390 +cool but that's that's that's a lot of +fancier than these spaces but these + +823 +01:00:09,389 --> 01:00:11,908 +faces are still pretty cool you can see +it sort of interpolating between + +824 +01:00:11,909 --> 01:00:16,689 +different phases in this very nice way +and I think that there's actually a very + +825 +01:00:16,688 --> 01:00:21,759 +nice motivation here and one of the +reasons we pick a diagonal tension is + +826 +01:00:21,759 --> 01:00:26,079 +that that has the probabilistic +interpretation of having independent but + +827 +01:00:26,079 --> 01:00:29,179 +that the very different variables in our +living space + +828 +01:00:29,179 --> 01:00:33,918 +actually should be independent so I +think that helps to explain why there + +829 +01:00:33,918 --> 01:00:37,219 +actually is this very nice separation +between the accys when you end up + +830 +01:00:37,219 --> 01:00:40,858 +sampling from the lead in space is due +to this probabilistic independence + +831 +01:00:40,858 --> 01:00:45,630 +assumption embedded in a prior so this +idea prior to this very powerful and + +832 +01:00:45,630 --> 01:00:51,139 +lets you sort of big those types of +things directly into a model so I I + +833 +01:00:51,139 --> 01:00:54,028 +wrote down a bunch of math and I don't +think we really have time to go through + +834 +01:00:54,028 --> 01:00:57,849 +it but the idea is that sort of +classically when you're training + +835 +01:00:57,849 --> 01:01:01,130 +generative models there's this thing +called maximum likelihood where you want + +836 +01:01:01,130 --> 01:01:04,608 +to maximize the likelihood of your data +under the model and then pick the model + +837 +01:01:04,608 --> 01:01:09,018 +where that makes your data most likely +but it turns out that if you just try to + +838 +01:01:09,018 --> 01:01:13,068 +run maximum-likelihood using a normal +using this generative process that we + +839 +01:01:13,068 --> 01:01:17,708 +had imagined for the very issues older +than you run into this giant you end up + +840 +01:01:17,708 --> 01:01:21,009 +needing to marginalize this joint +distribution which becomes this giant + +841 +01:01:21,009 --> 01:01:24,289 +intractable in a girl over the entire +meeting state space that's not something + +842 +01:01:24,289 --> 01:01:25,890 +that we can do + +843 +01:01:25,889 --> 01:01:29,659 +so instead the various audio encoder +encoder does this thing called a + +844 +01:01:29,659 --> 01:01:34,259 +variational inference which is a pretty +cool idea and the math is here in case + +845 +01:01:34,260 --> 01:01:38,150 +you want to go through it but the idea +is that instead of maximizing along + +846 +01:01:38,150 --> 01:01:42,619 +probability of the data work on a +cleverly insert this extra content and + +847 +01:01:42,619 --> 01:01:47,429 +break it up into these two different +terms so we're right this is an exact + +848 +01:01:47,429 --> 01:01:50,419 +equivalents that you can maybe work +there on your own but this log + +849 +01:01:50,420 --> 01:01:54,710 +likelihood we can write in terms of this +term that we call an elbow and this + +850 +01:01:54,710 --> 01:01:58,869 +other term which is a Cal divergence +between two distributions and we know + +851 +01:01:58,869 --> 01:02:03,029 +that killed two virgins is always zero +so we know that killed two virgins + +852 +01:02:03,030 --> 01:02:07,120 +between distributions is non-zero so we +know that this term has to be non-zero + +853 +01:02:07,119 --> 01:02:12,420 +which means that this this elbow term +actually is a lower bound on the log + +854 +01:02:12,420 --> 01:02:16,480 +likelihood of our data and notice that +in the process of writing down this + +855 +01:02:16,480 --> 01:02:20,889 +elbow we introduce this additional +parameter feed that we can interpret as + +856 +01:02:20,889 --> 01:02:25,710 +the parameters of this this encoder +network that is sort of approximating + +857 +01:02:25,710 --> 01:02:30,909 +this hard posterior distribution so now +instead of trying to directly maximize + +858 +01:02:30,909 --> 01:02:34,319 +the log likelihood of our data instead +will try to maximize this very issue + +859 +01:02:34,320 --> 01:02:39,539 +lower bound of the data and because the +elbow is as a lower bound of the log + +860 +01:02:39,539 --> 01:02:43,769 +likelihood then maximizing the elbow +will also have the effect of raising up + +861 +01:02:43,769 --> 01:02:49,059 +the log likelihood and stuff and these +these two terms of the elbow actually of + +862 +01:02:49,059 --> 01:02:53,360 +this beautiful interpretation that this +one at the at the front is the + +863 +01:02:53,360 --> 01:02:57,849 +expectation over the Layton States be +over the latent state space of the + +864 +01:02:57,849 --> 01:03:01,889 +probability of X given the latent state +space so you think about that that's + +865 +01:03:01,889 --> 01:03:05,559 +actually a data reconstruction term +that's saying that if we averaged over + +866 +01:03:05,559 --> 01:03:08,789 +all possible eighteen states that we +should end up with something that is + +867 +01:03:08,789 --> 01:03:13,639 +similar to our original data and this is +at this this other term is actually a + +868 +01:03:13,639 --> 01:03:17,940 +regularization term this is the Cal +divergence between are approximate + +869 +01:03:17,940 --> 01:03:22,059 +posterior and between the prior so this +is a regularization of trying to force + +870 +01:03:22,059 --> 01:03:27,019 +those two things together so impact this +this this first term you can approximate + +871 +01:03:27,019 --> 01:03:31,590 +with something called the approximate by +sampling using this trick in the paper + +872 +01:03:31,590 --> 01:03:35,600 +that I won't get into and this other +term again because everything is going + +873 +01:03:35,599 --> 01:03:38,489 +on here you can just about the skilled +emergence explicitly + +874 +01:03:38,489 --> 01:03:44,509 +so I think this is the most map every +slide in the class so that's that's kind + +875 +01:03:44,510 --> 01:03:50,020 +of fun but actually it's so but it's +actually scary but it's actually just + +876 +01:03:50,019 --> 01:03:54,150 +exactly just this one quarter idea we +have a reconstruction and then you have + +877 +01:03:54,150 --> 01:03:59,050 +this penalty penalizing to you to go +backwards the prior to any any questions + +878 +01:03:59,050 --> 01:04:08,840 +on the various quarters as that in +general the idea of an audio encoder is + +879 +01:04:08,840 --> 01:04:12,180 +that we want to force a network to try +to reconstruct our data and hopefully + +880 +01:04:12,179 --> 01:04:16,089 +this will learn sort of useful +representations of the data for Trisha + +881 +01:04:16,090 --> 01:04:19,470 +lot of encoders this is used for Peter +learning but once we move to variation + +882 +01:04:19,469 --> 01:04:23,569 +in quarters we make this thing patients +so we can actually generate samples that + +883 +01:04:23,570 --> 01:04:29,440 +are similar to our data so then this +idea of generating samples from my data + +884 +01:04:29,440 --> 01:04:32,690 +is really cool and everyone loves +looking at these kinds of pictures so + +885 +01:04:32,690 --> 01:04:37,119 +there's another idea that maybe can we +generate really cool samples without all + +886 +01:04:37,119 --> 01:04:41,100 +this scary Bayesian math and it turns +out that there's this idea called a + +887 +01:04:41,099 --> 01:04:45,219 +generative adversarial network that is a +sort of different idea a different twist + +888 +01:04:45,219 --> 01:04:49,799 +that lets you still generate samples +that look like your data but sort of a + +889 +01:04:49,800 --> 01:04:52,560 +little bit more explicitly without +having to worry about divergences + +890 +01:04:52,559 --> 01:04:54,340 +umpires and this sort of stuff + +891 +01:04:54,340 --> 01:04:58,920 +the idea is that we're gonna have a +generator not work that well first we're + +892 +01:04:58,920 --> 01:05:02,780 +gonna start with some random noise that +probably has drawn from you negotiate or + +893 +01:05:02,780 --> 01:05:07,060 +something like that and then we're going +to have a generator network and this + +894 +01:05:07,059 --> 01:05:11,079 +generator network actually looks very +much like the decoder in the variational + +895 +01:05:11,079 --> 01:05:15,849 +audio encoder or like the second half of +a normal audio encoder in that we're + +896 +01:05:15,849 --> 01:05:20,449 +taking this random noise and we're gonna +spend out an image that is going to be + +897 +01:05:20,449 --> 01:05:26,379 +some fake not real image that we're just +generating using this train network then + +898 +01:05:26,380 --> 01:05:29,410 +we're also going to hook up a +discriminator network that is going to + +899 +01:05:29,409 --> 01:05:32,679 +look at this fake image and try to +decide whether it's whether or not that + +900 +01:05:32,679 --> 01:05:34,769 +generated image is real or fake + +901 +01:05:34,769 --> 01:05:38,679 +so this is this so the second network is +just doing this binary classification + +902 +01:05:38,679 --> 01:05:42,949 +task where it receives an input and it +just needs to say whether or not it's + +903 +01:05:42,949 --> 01:05:46,739 +it's true or it's whether or not it's +real image or not that's just sort of a + +904 +01:05:46,739 --> 01:05:49,739 +classification task that you can hook up +like anything else + +905 +01:05:50,730 --> 01:05:55,349 +so then we can train this thing called +all jointly altogether + +906 +01:05:55,960 --> 01:06:01,179 +where r generator network will receive +many batches of random noise and I'll + +907 +01:06:01,179 --> 01:06:06,629 +spit out and it'll take images and our +discriminator network will receive many + +908 +01:06:06,630 --> 01:06:12,640 +batches of partially these images and +partially real images from a dataset and + +909 +01:06:12,639 --> 01:06:16,039 +it will have to do it will try to make +this classification task to say which + +910 +01:06:16,039 --> 01:06:21,358 +are real and which are fake and so this +is sort of now another way that we can + +911 +01:06:21,358 --> 01:06:25,880 +hook up this kind of supervised learning +problem ish without any real data so we + +912 +01:06:25,880 --> 01:06:30,390 +hope this thing up and we train the +hoping jointly so we can look at some + +913 +01:06:30,389 --> 01:06:34,730 +examples from the original general +adversarial networks paper and so these + +914 +01:06:34,730 --> 01:06:38,840 +are fake images that are generated by +the network announced you can see that + +915 +01:06:38,840 --> 01:06:41,829 +it's done a very good job of actually +generating fake tits they look like real + +916 +01:06:41,829 --> 01:06:46,549 +digits and here I'm here this this +middle column is showing actually the + +917 +01:06:46,550 --> 01:06:50,080 +nearest neighbor in the training set of +those digits to hopefully let you know + +918 +01:06:50,079 --> 01:06:53,599 +that it doesn't just memorize the +training set so for example this too has + +919 +01:06:53,599 --> 01:06:57,389 +a little dot and then this guy doesn't +have a dot so it's not just memorizing + +920 +01:06:57,389 --> 01:07:01,079 +training data and it also does a pretty +good job of recognizing pace + +921 +01:07:01,079 --> 01:07:05,849 +generating faces so but you know as +people who worked in machine learning + +922 +01:07:05,849 --> 01:07:10,440 +known these these digits and paste data +sets tend to be pretty easy to generate + +923 +01:07:10,440 --> 01:07:16,869 +samples from and when we apply this this +task to see far than RJR samples don't + +924 +01:07:16,869 --> 01:07:21,840 +quite look as nice and clean so here +it's clearly got some idea about CPR + +925 +01:07:21,840 --> 01:07:25,108 +data worth making blue stock and green +stuff but they don't really look like + +926 +01:07:25,108 --> 01:07:32,429 +real objects so that's that's a problem +so it's a follow-up work actually tried + +927 +01:07:32,429 --> 01:07:35,599 +some follow-up work on generative +adversarial networks has tried to make + +928 +01:07:35,599 --> 01:07:38,529 +these architectures bigger and more +powerful so hopefully be able to + +929 +01:07:38,530 --> 01:07:44,080 +generate better samples on these more +complex datasets so one idea is this + +930 +01:07:44,079 --> 01:07:48,949 +idea is multiscale processing so rather +than generating the image all at once + +931 +01:07:48,949 --> 01:07:53,919 +we're actually gonna generate our image +at multiple scales in this way so first + +932 +01:07:53,920 --> 01:07:58,170 +we're gonna happen generator that feeds +in bed receives noise and then generates + +933 +01:07:58,170 --> 01:08:03,670 +a low resolution and then we'll up +sample that nora skyy and apply a second + +934 +01:08:03,670 --> 01:08:04,200 +generator + +935 +01:08:04,199 --> 01:08:08,230 +ur that receives a new batch of random +noise and compute some Delta on top of + +936 +01:08:08,230 --> 01:08:12,070 +the low res image then what up sample +that again and repeat the process + +937 +01:08:12,070 --> 01:08:16,810 +several times until we've actually +finally generated are generated our + +938 +01:08:16,810 --> 01:08:22,219 +final result so this is again a very +similar ideas the previous as the + +939 +01:08:22,219 --> 01:08:25,329 +original gender diverse area network or +just generating at multiple scales + +940 +01:08:25,329 --> 01:08:30,199 +simultaneously and the training here is +a little bit more complex you actually a + +941 +01:08:30,199 --> 01:08:35,710 +discriminator at each scale and that +hopefully hopefully there's something so + +942 +01:08:35,710 --> 01:08:39,039 +when we look at the train samples from +this guy actually a lot better so here + +943 +01:08:39,039 --> 01:08:43,869 +are actually trained a separate model +per class on C 510 so here they've + +944 +01:08:43,869 --> 01:08:48,599 +trained this adversarial network on just +one just planes from CPR and you can see + +945 +01:08:48,600 --> 01:08:51,460 +that they're starting to look like real +planes so that's that's getting + +946 +01:08:51,460 --> 01:08:52,210 +somewhere + +947 +01:08:52,210 --> 01:08:56,689 +these look almost like real quarters and +these may be looked kinda like real + +948 +01:08:56,689 --> 01:09:04,278 +birds so in in the following year people +actually threw away this multiscale idea + +949 +01:09:04,279 --> 01:09:09,339 +and just used a simple are better more +principled continent so here is the idea + +950 +01:09:09,338 --> 01:09:14,318 +is forget about this multi skilled staff +and just use use batch norm don't use + +951 +01:09:14,319 --> 01:09:17,739 +fully connected layers sort of all these +architectural constraints that we've had + +952 +01:09:17,738 --> 01:09:22,759 +become practice practice and last couple +years just use those and turns out that + +953 +01:09:22,759 --> 01:09:27,969 +your adversary in that span work really +well so here they're generator is this + +954 +01:09:27,969 --> 01:09:33,088 +pretty pretty simple pretty simple +pretty small convolutional network and + +955 +01:09:33,088 --> 01:09:38,539 +the discriminator is again just a simple +network with nationalization and all + +956 +01:09:38,539 --> 01:09:42,180 +these other bells and whistles and once +you hook up this thing they get some + +957 +01:09:42,180 --> 01:09:47,810 +amazing samples in this paper so these +are generated bedrooms from the network + +958 +01:09:47,810 --> 01:09:53,450 +so these actually are pretty impressive +results these look like real data almost + +959 +01:09:53,449 --> 01:09:57,529 +so you can see that it's done a really +good job of capturing + +960 +01:09:57,529 --> 01:10:00,920 +really detailed structure about bedrooms +like there's a bad there's a window + +961 +01:10:00,920 --> 01:10:07,710 +there's a light switch so these are +these are really amazing samples but it + +962 +01:10:07,710 --> 01:10:12,579 +turns out that rather than just +generating samples we can play the same + +963 +01:10:12,579 --> 01:10:16,260 +trick as the very issue lot of encoder +and actually try to exploit try to play + +964 +01:10:16,260 --> 01:10:16,670 +around + +965 +01:10:16,670 --> 01:10:21,739 +meeting space because this cuz these +adversarial networks are receiving this + +966 +01:10:21,738 --> 01:10:25,579 +noise input and we can cleverly try to +move around that noise and put it and + +967 +01:10:25,579 --> 01:10:29,920 +try to change the type of things that +these networks generate so one example + +968 +01:10:29,920 --> 01:10:36,050 +that we can try is interpolating between +bedrooms so here on the left hip so here + +969 +01:10:36,050 --> 01:10:40,119 +the idea is that on the left for these +images on the left hand side we've drawn + +970 +01:10:40,119 --> 01:10:43,550 +a random point from our noise +distribution and then use it to generate + +971 +01:10:43,550 --> 01:10:47,690 +an image and now on the right hand side +we've done the same and we generate + +972 +01:10:47,689 --> 01:10:51,259 +another random point from our noise +distribution and use it to generate an + +973 +01:10:51,260 --> 01:10:57,710 +image so now these these two guys on the +opposite sides are generated are sort of + +974 +01:10:57,710 --> 01:11:01,760 +two points on a line and I we want to +interpolate in the lead in space between + +975 +01:11:01,760 --> 01:11:08,210 +those two lead actors and along that +line we're gonna generate used the use + +976 +01:11:08,210 --> 01:11:11,859 +the generator to generate images and +hopefully this will interpolate between + +977 +01:11:11,859 --> 01:11:16,439 +the latest dates of those two guys and +you can see that this is pretty crazy + +978 +01:11:16,439 --> 01:11:22,169 +that these bedrooms are more fame sort +of in a very nice smooth continuous way + +979 +01:11:22,170 --> 01:11:28,020 +from one bedroom to another and if you +one thing to point out is that this + +980 +01:11:28,020 --> 01:11:32,300 +morning is actually happening in kind of +a nice romantic way if you imagine what + +981 +01:11:32,300 --> 01:11:35,460 +this would look like and pixel space +than it would just be kind of this + +982 +01:11:35,460 --> 01:11:39,100 +fading effect and it would not look very +good at all but here you can see that + +983 +01:11:39,100 --> 01:11:42,690 +actually the shapes of these things and +colors are sort of continuously + +984 +01:11:42,689 --> 01:11:50,119 +deforming from one side to the other +which is quite fun so another experiment + +985 +01:11:50,119 --> 01:11:53,939 +they have in this paper is actually +using vector math to play around the + +986 +01:11:53,939 --> 01:11:58,069 +type of things that these networks +generate so here the idea is that they + +987 +01:11:58,069 --> 01:12:02,189 +generated a whole bunch of random +samples from the noise distribution then + +988 +01:12:02,189 --> 01:12:05,789 +pushed them all through the generator to +generate a whole bunch of samples and + +989 +01:12:05,789 --> 01:12:09,698 +then they as he using their own human +intelligence they tried to make some + +990 +01:12:09,698 --> 01:12:14,500 +semantic judgments about what those +random samples look like and then group + +991 +01:12:14,500 --> 01:12:18,050 +them into a couple of meaningful +semantic categories so here at this + +992 +01:12:18,050 --> 01:12:21,739 +would be three things that three images +that were generated from the network + +993 +01:12:21,738 --> 01:12:25,529 +that all kind of look like a smiling +woman and those are human provided + +994 +01:12:25,529 --> 01:12:26,819 +labels + +995 +01:12:26,819 --> 01:12:30,309 +here in the middle are three samples +from the network of a neutral women that + +996 +01:12:30,310 --> 01:12:35,010 +are that's not smiling and share on the +rate is 300 free samples of a man that + +997 +01:12:35,010 --> 01:12:40,289 +is not smiling so each of these guys was +produced from some latent state vector + +998 +01:12:40,289 --> 01:12:45,729 +so we'll just average those lay in state +vectors to compute this sort of average + +999 +01:12:45,729 --> 01:12:51,269 +average rating state of smiling woman +neutral women and neutral man now once + +1000 +01:12:51,270 --> 01:12:55,220 +we have this latent state vector we can +do some vector math so we can take a + +1001 +01:12:55,220 --> 01:13:01,050 +smiling woman subtract a neutral woman +and at a neutral man so what what would + +1002 +01:13:01,050 --> 01:13:06,070 +that give you so you hope that would +give you a smiling man and this is what + +1003 +01:13:06,069 --> 01:13:12,649 +it generates so this actually it does +kinda look like a smiling man that's + +1004 +01:13:12,649 --> 01:13:19,199 +that's pretty amazing we can do another +experiment we can take a man with + +1005 +01:13:19,199 --> 01:13:25,099 +glasses and a man without glasses and a +man with glasses subtract the man with + +1006 +01:13:25,100 --> 01:13:31,140 +glasses and add a woman with glasses +with no glasses this this is confusing + +1007 +01:13:31,140 --> 01:13:38,630 +stuff so that and what was this what +would this little equation give us a + +1008 +01:13:38,630 --> 01:13:47,369 +look at that so that's that's pretty +crazy so it def assault even though + +1009 +01:13:47,369 --> 01:13:51,279 +we're not sort of forcing an explicit +prior on the sleeping space space these + +1010 +01:13:51,279 --> 01:13:54,869 +adversarial networks have somehow still +managed to learn some really nice useful + +1011 +01:13:54,869 --> 01:13:59,960 +representation there so i also very +quickly I think there's a pretty cool + +1012 +01:13:59,960 --> 01:14:04,220 +paper that just came out a week or two +ago that puts all of these ideas + +1013 +01:14:04,220 --> 01:14:07,820 +together like we covered a lot of +different ideas in this lecture and + +1014 +01:14:07,819 --> 01:14:11,239 +let's just stick them all together so +first we're gonna take a variation on + +1015 +01:14:11,239 --> 01:14:15,659 +quarter as as our starting point and +this will have sort of the normal its + +1016 +01:14:15,659 --> 01:14:20,130 +allies loss from various audio encoder +but we saw that these adversarial + +1017 +01:14:20,130 --> 01:14:24,220 +networks give really amazing samples so +why don't we had an adversarial network + +1018 +01:14:24,220 --> 01:14:29,630 +to the variation autumn quarter so we do +that so now in addition to having our + +1019 +01:14:29,630 --> 01:14:33,710 +variation ottoman quarter we also have +this this discriminator network that's + +1020 +01:14:33,710 --> 01:14:35,949 +trying to tell the difference between +the + +1021 +01:14:35,949 --> 01:14:40,689 +no data and between the samples from the +variational audio encoder but that's not + +1022 +01:14:40,689 --> 01:14:47,099 +cool enough so why don't we also +download Alex NAT and then pass these + +1023 +01:14:47,100 --> 01:14:47,930 +two images + +1024 +01:14:47,930 --> 01:14:53,730 +Alex net and extract Alex net features +for both the original image and four are + +1025 +01:14:53,729 --> 01:14:59,079 +generated image and now in addition to +having a similar pics loss and hair and + +1026 +01:14:59,079 --> 01:15:02,340 +pulling the discriminator we're also +hoping to generate samples that have + +1027 +01:15:02,340 --> 01:15:06,900 +similar Alex net features as measured by +all too and once you stick all these + +1028 +01:15:06,899 --> 01:15:10,859 +things together hopefully you'll get +some really beautiful samples right so + +1029 +01:15:10,859 --> 01:15:17,069 +here are the examples from the paper so +these are paid just train the entire + +1030 +01:15:17,069 --> 01:15:21,109 +thing on image that so we should I think +this is these are actually quite nice + +1031 +01:15:21,109 --> 01:15:26,029 +samples and if you contrast this with +the multiscale samples on CPR that we + +1032 +01:15:26,029 --> 01:15:29,609 +saw before for those samples remember +they were actually training a separate + +1033 +01:15:29,609 --> 01:15:34,380 +model per class and see fire and these +those beautiful bedroom samples that you + +1034 +01:15:34,380 --> 01:15:35,760 +saw was again + +1035 +01:15:35,760 --> 01:15:40,270 +training one model that's specific to +bedrooms but here they actually trained + +1036 +01:15:40,270 --> 01:15:45,050 +one model on all of internet and still +like these are real images but they're + +1037 +01:15:45,050 --> 01:15:50,489 +definitely getting towards real issue +looking images so that's i think these + +1038 +01:15:50,489 --> 01:15:54,170 +are pretty cool I also think it's kind +of fun to just take all these things and + +1039 +01:15:54,170 --> 01:16:00,020 +stick them together and hopefully get +some really nice samples thats I think + +1040 +01:16:00,020 --> 01:16:02,460 +that's pretty much all we have to say +about unsupervised learning so if + +1041 +01:16:02,460 --> 01:16:05,460 +there's any any questions + +1042 +01:16:07,100 --> 01:16:17,110 +what does what is going on here + +1043 +01:16:18,680 --> 01:16:23,500 +yeah so the question is are you may be +literate linear rising the bedroom space + +1044 +01:16:23,500 --> 01:16:28,079 +and that's maybe one way to think about +it that here we remember we're just + +1045 +01:16:28,079 --> 01:16:30,729 +sampling program just sampling from +noise and passing them through the + +1046 +01:16:30,729 --> 01:16:35,319 +discriminator rather through the +generator and then the generator has + +1047 +01:16:35,319 --> 01:16:40,630 +just decided to use these different +noises channels in nice ways such that + +1048 +01:16:40,630 --> 01:16:44,510 +if you interplay between the noise you +end up interpolating between the images + +1049 +01:16:44,510 --> 01:16:49,110 +in sort of a nice smooth way so +hopefully that lets you know that it's + +1050 +01:16:49,109 --> 01:16:51,799 +not just sort of memorizing training +examples it's actually wanting to + +1051 +01:16:51,800 --> 01:17:00,310 +generalize from him in a nice way right +so just to recap everything we talked + +1052 +01:17:00,310 --> 01:17:04,430 +about today we gave you a lot of really +useful practical tips for working with + +1053 +01:17:04,430 --> 01:17:08,470 +videos and then I give you a lot of very +non practical tips for generating + +1054 +01:17:08,470 --> 01:17:16,119 +beautiful images so I think this stuff +is really cool but I'm not sure what the + +1055 +01:17:16,119 --> 01:17:19,840 +uses other than generating images but +its cool so it's fun and definitely + +1056 +01:17:19,840 --> 01:17:24,640 +stick around next time because we'll +have a guest lecture from jap teen so if + +1057 +01:17:24,640 --> 01:17:27,310 +you're watching on the internet maybe +you might wanna come to class for that + +1058 +01:17:27,310 --> 01:17:31,500 +one so I think that's everything we have +today and see you guys later + diff --git a/captions/En/Lecture15_en.srt b/captions/En/Lecture15_en.srt new file mode 100644 index 00000000..711039c0 --- /dev/null +++ b/captions/En/Lecture15_en.srt @@ -0,0 +1,4238 @@ +1 +00:00:00,000 --> 00:00:03,370 +like to point out that while I'll be +presenting today is partly my work in + +2 +00:00:03,370 --> 00:00:06,919 +collaboration with others and sometimes +I'm presenting work done by people in my + +3 +00:00:06,919 --> 00:00:10,929 +group that I wasn't really involved in +but its joint work with many many people + +4 +00:00:10,929 --> 00:00:14,740 +you'll see lots of names throughout the +talks so take that with a grain of salt + +5 +00:00:14,740 --> 00:00:20,920 +so what I'm gonna tell you about is kind +of how Google got to where it is today + +6 +00:00:20,920 --> 00:00:26,310 +in terms of using departing in a lot of +different places the project that + +7 +00:00:26,309 --> 00:00:30,608 +involved in actually started in 2011 +when entering with spending one day a + +8 +00:00:30,609 --> 00:00:36,340 +week in Google and I happen to bump into +him in the micro kitchen and and I said + +9 +00:00:36,340 --> 00:00:39,420 +oh what you were doing was like I don't +know but I haven't figured out yet but + +10 +00:00:39,420 --> 00:00:44,170 +ignore laps or are interesting and I got +the call and turns out I don't + +11 +00:00:44,170 --> 00:00:49,120 +understand pieces on parallel training +of termites like ages ago I don't want + +12 +00:00:49,119 --> 00:00:50,250 +to tell you how long ago + +13 +00:00:50,250 --> 00:00:56,350 +back kind of in the first exciting +period relax and I always kind of really + +14 +00:00:56,350 --> 00:01:00,660 +like the computational model they +provided but at that time there was a + +15 +00:01:00,659 --> 00:01:03,599 +little too early like we didn't have a +big enough data set for the number of + +16 +00:01:03,600 --> 00:01:08,879 +computation to really make them sing and +Andrew kind of sad 0 will be interesting + +17 +00:01:08,879 --> 00:01:13,579 +to train but now I'm like ok that's on +the phone so we kind of collaboratively + +18 +00:01:13,579 --> 00:01:20,209 +started the brain project to push the +size and scale of normativity training + +19 +00:01:20,209 --> 00:01:24,059 +and in particular we were really +interested in using big data sets in + +20 +00:01:24,060 --> 00:01:27,890 +large amounts competition to tackle +perception problems in my marriage + +21 +00:01:27,890 --> 00:01:34,400 +problems and read them I often found +Coursera and kind of just away from + +22 +00:01:34,400 --> 00:01:39,719 +google but since then we've been doing a +lot of interesting work in both kind of + +23 +00:01:39,719 --> 00:01:43,408 +research areas in a lot of different +domains you know one of the nice things + +24 +00:01:43,409 --> 00:01:46,859 +about no matter their incredibly +applicable to many many different kinds + +25 +00:01:46,859 --> 00:01:52,478 +of problems as I'm sure you seen in this +class and we've also deployed production + +26 +00:01:52,478 --> 00:01:56,530 +systems using our mats in pretty wide +variety of different products all kind + +27 +00:01:56,530 --> 00:02:00,049 +of give you a sampling of some of the +research some of the production aspects + +28 +00:02:00,049 --> 00:02:04,579 +some of the systems that we've built +underneath the covers including kind of + +29 +00:02:04,578 --> 00:02:08,030 +some of the implementation stuff that we +do intend to follow to make these kinds + +30 +00:02:08,030 --> 00:02:12,959 +of models run fast and I'll focus on her +mouth but a lot of the techniques are + +31 +00:02:12,959 --> 00:02:13,349 +more + +32 +00:02:13,349 --> 00:02:17,699 +a couple that just months before you can +train lots of different kinds of + +33 +00:02:17,699 --> 00:02:22,159 +reinforcement algorithms or other kinds +of other kinds of machinery industrial + +34 +00:02:22,159 --> 00:02:29,099 +thats ok Kevin hear me actually some of +the back if it comes up the time I one + +35 +00:02:29,099 --> 00:02:32,560 +of the things I really like about the +team we've put together is that we have + +36 +00:02:32,560 --> 00:02:36,479 +a really broad mix of different kinds of +expertise so we have people are really + +37 +00:02:36,479 --> 00:02:40,709 +experts at machine learning research you +know people like jeffrey hinton other + +38 +00:02:40,710 --> 00:02:45,820 +people all that we have large-scale +distributed systems builders I kind of + +39 +00:02:45,819 --> 00:02:50,169 +consider myself in that more in that +mold and then we have people can do with + +40 +00:02:50,169 --> 00:02:54,989 +a mix of those skills and often some of +the projects we work on you collectively + +41 +00:02:54,990 --> 00:03:00,870 +put together people with these different +kinds of expertise and collectively you + +42 +00:03:00,870 --> 00:03:03,580 +do something that none of you could do +individually because often you need both + +43 +00:03:03,580 --> 00:03:09,670 +kind of large-scale systems thinking and +machine learning ideas so that's always + +44 +00:03:09,669 --> 00:03:13,539 +fun and you often kind of pick up and +learn new things from other people + +45 +00:03:13,539 --> 00:03:22,280 +script outline actually this is from +hold back so you know you can kind of + +46 +00:03:22,280 --> 00:03:26,080 +see the progress of how Google has been +applying deep learning across lots of + +47 +00:03:26,080 --> 00:03:28,540 +different areas with this is sort of +when we started the project and we + +48 +00:03:28,539 --> 00:03:32,209 +started collaborating with a speech team +a bit and started doing it with some + +49 +00:03:32,210 --> 00:03:37,830 +kind of early computer vision kinds of +problems and as we had success in some + +50 +00:03:37,830 --> 00:03:42,770 +of the other teams that Google would say +hey I have a problem too and like they + +51 +00:03:42,770 --> 00:03:46,550 +would come to us or we would go to them +and say hey we think this could help + +52 +00:03:46,550 --> 00:03:50,610 +with your particular problem and over +time we've kind of gradually not so + +53 +00:03:50,610 --> 00:03:54,670 +gradually expanded the set of teams on +areas that we've been applying these + +54 +00:03:54,669 --> 00:03:58,539 +kinds of problems and you see the +breadth + +55 +00:03:58,539 --> 00:04:03,689 +different kinds of areas it's not like +it's only computer vision problems so + +56 +00:04:03,689 --> 00:04:08,150 +that's that's kinda nice we're +continuing to grow which is good and + +57 +00:04:08,150 --> 00:04:12,920 +part of the reason for that broad +spectrum of things is that you can + +58 +00:04:12,919 --> 00:04:18,229 +really think of that as these nice +really universal system that you can put + +59 +00:04:18,230 --> 00:04:21,359 +lots of different kinds of inputs into +you lots get lots of different kinds of + +60 +00:04:21,358 --> 00:04:22,129 +outputs + +61 +00:04:22,129 --> 00:04:27,300 +out of them with you know slight +differences in the model you try but in + +62 +00:04:27,300 --> 00:04:32,270 +general the same fundamental techniques +work pretty well across all these + +63 +00:04:32,269 --> 00:04:36,990 +different domains and i'd give our +results as a true you've heard about in + +64 +00:04:36,990 --> 00:04:40,400 +this class in lots of different areas +now pretty much any computer vision + +65 +00:04:40,399 --> 00:04:46,219 +problem any speech problem these days +starting to be more the case in lots of + +66 +00:04:46,220 --> 00:04:51,880 +language understanding areas lots of +kind of other areas of science like drug + +67 +00:04:51,879 --> 00:04:54,519 +discovery are starting to have +interesting role models that are better + +68 +00:04:54,519 --> 00:05:05,930 +than alternate yeah I like them they're +good along the way we've kind of built + +69 +00:05:05,930 --> 00:05:10,040 +two different generations of our +underlying system software for training + +70 +00:05:10,040 --> 00:05:14,640 +and deploying their lips the first was +called disbelief republish paper about + +71 +00:05:14,639 --> 00:05:20,479 +your nips 2012 it had the advantage of +their was really scalable like the first + +72 +00:05:20,480 --> 00:05:23,759 +one of the first uses we put to it was +doing some unsupervised training I'll + +73 +00:05:23,759 --> 00:05:27,319 +tell you about a minute which used +16,000 course to training they don't + +74 +00:05:27,319 --> 00:05:31,209 +have a lot of parameters is good for +production use but it wasn't super + +75 +00:05:31,209 --> 00:05:35,819 +flexible for research like it was kinda +hard to express kind of weird or more + +76 +00:05:35,819 --> 00:05:38,949 +esoteric kinds of models reinforcement +learning algorithms be hard to express + +77 +00:05:38,949 --> 00:05:43,349 +and it had this kind of much more later +driven approach with up-and-down + +78 +00:05:43,350 --> 00:05:48,770 +messages and it worked well for what it +did but we kind of took a step back + +79 +00:05:48,769 --> 00:05:52,639 +about a year and a little bit ago and +started building our second generation + +80 +00:05:52,639 --> 00:05:57,339 +system tends to flow which is based on +what we learned the first generation and + +81 +00:05:57,339 --> 00:06:02,289 +what we learned from work and other sort +of available open source packages and + +82 +00:06:02,290 --> 00:06:06,620 +rethink its retained a lot of good +features in disbelief but also made it + +83 +00:06:06,620 --> 00:06:13,329 +pretty flexible for a wide variety of +research is open source it which I got + +84 +00:06:13,329 --> 00:06:19,120 +heard about one of the really nice +properties have known that so I grabbed + +85 +00:06:19,120 --> 00:06:23,459 +this from a particular paper cuz it had +graphs on both scaling the sides of + +86 +00:06:23,459 --> 00:06:27,819 +training data and how accuracy increases +and also scaling the size of the neural + +87 +00:06:27,819 --> 00:06:30,279 +net and how accuracy increases + +88 +00:06:30,279 --> 00:06:33,109 +exact details aren't important you can +find these kinds of trends and hundreds + +89 +00:06:33,110 --> 00:06:37,509 +of papers but one of the really nice +properties is if you have more data and + +90 +00:06:37,509 --> 00:06:42,180 +you can make your model bigger generally +killing both of those things and even + +91 +00:06:42,180 --> 00:06:47,019 +better than scaling just one of them you +need a really big model in order to + +92 +00:06:47,019 --> 00:06:49,810 +capture kind of a more subtle trends +that appear in larger and larger + +93 +00:06:49,810 --> 00:06:54,180 +datasets you know any known that will +capture kind of obvious trends or + +94 +00:06:54,180 --> 00:06:57,370 +obvious kinda patterns but the more +subtle ones are ones where you need a + +95 +00:06:57,370 --> 00:07:04,189 +bigger model to capture and if that +extra she saw him too salty and that + +96 +00:07:04,189 --> 00:07:09,579 +requires a lot more competition so we +focus a lot on scaling the computation + +97 +00:07:09,579 --> 00:07:17,689 +we need and be able to train big models +on big data sets to one of the first + +98 +00:07:17,689 --> 00:07:22,699 +things we did in this project was we +said oh I'm surprised learning gonna be + +99 +00:07:22,699 --> 00:07:28,879 +really important and we had a big focus +on that initially quickly and others + +100 +00:07:28,879 --> 00:07:34,870 +said what would happen if we did +unsupervised learning of random you to + +101 +00:07:34,870 --> 00:07:38,519 +print so the idea is Rena take ten +million random youtube frame single + +102 +00:07:38,519 --> 00:07:42,990 +frames from a bunch of random videos and +we're going to essentially training data + +103 +00:07:42,990 --> 00:07:47,418 +recorder everyone knows what color is +that sounds like a family multi-level + +104 +00:07:47,418 --> 00:07:51,788 +auto encoder you know and this one we're +just trying to reconstruct the image now + +105 +00:07:51,788 --> 00:07:54,459 +on we're trying to reconstruct the +representation here from repetition + +106 +00:07:54,459 --> 00:08:01,629 +there and so on and we used sixteen +thousand cars we didn't have GPUs in the + +107 +00:08:01,629 --> 00:08:07,459 +datacenter the time so we compensated +with light throwing more CPUs at it we + +108 +00:08:07,459 --> 00:08:11,870 +used a sink a cutie which will talk +about a minute for optimization actually + +109 +00:08:11,870 --> 00:08:17,189 +had a lot of parameters cuz it was not +convolutional this was prior to come we + +110 +00:08:17,189 --> 00:08:20,199 +should be all the rage so he said well +we'll have a local receptive fields but + +111 +00:08:20,199 --> 00:08:24,168 +they won't become delusional and will +learn like separate representation for + +112 +00:08:24,168 --> 00:08:28,269 +this part of the image in this part of +the image which is kind of an + +113 +00:08:28,269 --> 00:08:31,038 +interesting twist I think it'd be +actually an interesting experiment to + +114 +00:08:31,038 --> 00:08:37,330 +redo this work but with convolutional +opera sharing I'll be kind of cool in + +115 +00:08:37,330 --> 00:08:40,590 +any case the representation he learned +the top after like nine layers + +116 +00:08:40,590 --> 00:08:45,580 +of these non convolutional local +receptive field $60,000 on the top level + +117 +00:08:45,580 --> 00:08:50,750 +and one of the things we thought might +happen is it would learn kind of + +118 +00:08:50,750 --> 00:08:54,799 +high-level feature detectors so in +particular printing in pixels but it + +119 +00:08:54,799 --> 00:08:58,929 +couldn't learn high-level concepts we +had a dataset that was half faces and + +120 +00:08:58,929 --> 00:09:04,349 +have not faces and we found looked +around for neurons that were good + +121 +00:09:04,350 --> 00:09:08,120 +selectors of whether or not the image +but estimates contained a face and we + +122 +00:09:08,120 --> 00:09:13,850 +found several such neurons the best one +that are those are some of the sample + +123 +00:09:13,850 --> 00:09:19,610 +images that caused that neuron to get +the most excited and then if you look + +124 +00:09:19,610 --> 00:09:24,240 +around for what stimulus will cause the +neuron to get the most excited there's + +125 +00:09:24,240 --> 00:09:32,669 +creepy face guy and that kind of +interesting like we did had no labels on + +126 +00:09:32,669 --> 00:09:38,399 +the image in the dataset at all that +we're training and a neuron in this + +127 +00:09:38,399 --> 00:09:43,029 +model has picked up on the fact that +faces are things I'm gonna get excited + +128 +00:09:43,029 --> 00:09:48,399 +when I see kind of a Caucasian face from +head on its YouTube so we also have a + +129 +00:09:48,399 --> 00:09:55,179 +cat now on a dataset with have captain +have not kept in this is average tabby I + +130 +00:09:55,179 --> 00:10:03,019 +call them and then you can take that +unsupervised model and and start a + +131 +00:10:03,019 --> 00:10:07,659 +supervised training tasks in particular +at this time we were i training on the + +132 +00:10:07,659 --> 00:10:11,669 +image next twenty thousand class task +which is not the one the most damage + +133 +00:10:11,669 --> 00:10:14,939 +that results are reported on that one +thousand classes is trying to + +134 +00:10:14,940 --> 00:10:21,490 +distinguish any made from one of 20 to +20,000 classes it's much harder task and + +135 +00:10:21,490 --> 00:10:26,340 +then we trained and then looked around +at what kinds of images cause different + +136 +00:10:26,340 --> 00:10:29,300 +popular routes to get excited you see +they're picking up on very high-level + +137 +00:10:29,299 --> 00:10:33,819 +concepts you know yellow flowers only or +waterfowl + +138 +00:10:34,620 --> 00:10:41,080 +I like and this retraining actually +increase the state to be hard accuracy + +139 +00:10:41,080 --> 00:10:44,080 +on that particular task for amount at +the time + +140 +00:10:45,129 --> 00:10:50,500 +then we kind of lost our excitement +about unsupervised learning because + +141 +00:10:50,500 --> 00:10:54,860 +supervised learning to cook so darn well +and so we started working with a speech + +142 +00:10:54,860 --> 00:11:00,100 +team who at the time was had a +non-parole Matt based acoustic + +143 +00:11:00,100 --> 00:11:06,570 +essentially trying to go from a small +segment of audio data like a hundred and + +144 +00:11:06,570 --> 00:11:09,420 +fifty millisecond time you try to +predict what sound does being uttered in + +145 +00:11:09,419 --> 00:11:17,809 +the middle 10 milliseconds and so we +just decided to try a layer fully + +146 +00:11:17,809 --> 00:11:21,879 +connected nomads and then predict one of +fourteen thousand try phones at the top + +147 +00:11:22,549 --> 00:11:27,939 +I'm at work family while basically could +train it pretty quickly and it gave a + +148 +00:11:27,940 --> 00:11:31,530 +huge reduction a moderate like this is +one of the people on speech team said + +149 +00:11:31,529 --> 00:11:34,339 +that like the biggest single improvement +they've seen in their 20 years of + +150 +00:11:34,340 --> 00:11:47,970 +research and that launched as part of +the Android based search system 2012 so + +151 +00:11:47,970 --> 00:11:51,990 +one of the things we often do is find +that we have a lot of data for some + +152 +00:11:51,990 --> 00:11:57,149 +tasks but not very many very much data +from the tasks and so for that we often + +153 +00:11:57,149 --> 00:12:02,949 +deploy systems that make you sad +multitask and transfer learning in + +154 +00:12:02,950 --> 00:12:09,030 +various ways so let's look at an example +where we use this in speech so obviously + +155 +00:12:09,029 --> 00:12:13,110 +with English we have a lot of data and +we got a really nice slow word or it + +156 +00:12:13,110 --> 00:12:17,350 +lowers that are for Portuguese on the +other hand about time we didn't have + +157 +00:12:17,350 --> 00:12:21,310 +that much training today we had $100 +purchase until the word error rate is a + +158 +00:12:21,309 --> 00:12:27,129 +lot worse which is bad so one of the +first and most simple things you can do + +159 +00:12:27,129 --> 00:12:30,620 +which is kind of what you do when you +take a model has been pre trained on + +160 +00:12:30,620 --> 00:12:33,509 +imaging that and apply to some other +problem we don't have as much data as + +161 +00:12:33,509 --> 00:12:37,610 +you just start training with those +weights by them totally random nights + +162 +00:12:37,610 --> 00:12:41,700 +I'm not actually improves your word +error rate for Portuguese if it does + +163 +00:12:41,700 --> 00:12:45,210 +there's enough similarities in the kinds +of features you want for speech in + +164 +00:12:45,210 --> 00:12:50,570 +general regardless of language no more +complicated thing you can do is actually + +165 +00:12:50,570 --> 00:12:55,390 +jointly train models that share of +entrepreneurs across all languages or in + +166 +00:12:55,389 --> 00:12:56,360 +this case all + +167 +00:12:56,360 --> 00:13:04,680 +all European languages I think it's what +we used and so they are you see we're + +168 +00:13:04,679 --> 00:13:07,939 +jointly training on this data and we +actually got a pretty significant + +169 +00:13:07,940 --> 00:13:13,310 +improvement even over the just copying +the date of the Portuguese model but + +170 +00:13:13,309 --> 00:13:17,739 +surprisingly we actually got a small +improvement English because in total + +171 +00:13:17,740 --> 00:13:20,889 +across all the other languages we +actually almost double the amount of + +172 +00:13:20,889 --> 00:13:25,399 +training data we were able to use you +miss model compared to just English + +173 +00:13:25,399 --> 00:13:30,379 +alarm so basically like languages +without much dated all improved a lot + +174 +00:13:30,379 --> 00:13:35,850 +languages with a lot of data improved +even a little bit and then we had a + +175 +00:13:35,850 --> 00:13:39,350 +language-specific top layer little +little bit of fiddling to figure out + +176 +00:13:39,350 --> 00:13:44,620 +does it make some tough to language +specific top players 1 I'll believe + +177 +00:13:44,620 --> 00:13:47,620 +these are the kinds of human guided +choices you are making + +178 +00:13:48,269 --> 00:13:53,149 +that's the production speech models +involved a lot from those really simple + +179 +00:13:53,149 --> 00:13:57,778 +feedforward models used now I last came +to deal with time to mention the + +180 +00:13:57,778 --> 00:14:02,490 +compilation of allusions to make them in +very into different frequencies so there + +181 +00:14:02,490 --> 00:14:06,769 +was a paper published here you know you +don't necessarily need to understand all + +182 +00:14:06,769 --> 00:14:11,459 +the details but there's a lot of more +complexity in the kind of model and it's + +183 +00:14:11,458 --> 00:14:15,088 +it's using much more sophisticated her +current models and computational models + +184 +00:14:15,089 --> 00:14:22,100 +a recent trend has been met you can use +alice is completely and and so rather + +185 +00:14:22,100 --> 00:14:26,730 +than having an acoustic model and then a +language model that kind of takes the + +186 +00:14:26,730 --> 00:14:30,550 +output of the acoustic model of an +estranged somewhat separately you can go + +187 +00:14:30,549 --> 00:14:34,879 +directly from audio waveforms to +producing transcript to character at a + +188 +00:14:34,879 --> 00:14:38,120 +time and I think that's going to be a +really big trend + +189 +00:14:38,809 --> 00:14:44,169 +both in speech and more generally in a +lot of heating systems you often have + +190 +00:14:44,169 --> 00:14:49,338 +today a lot of systems are kind of +composed of a bunch of subsystems each + +191 +00:14:49,339 --> 00:14:54,350 +perhaps with some she learned pieces and +some kind of hand coded pieces and then + +192 +00:14:54,350 --> 00:14:58,000 +I usually a big pile of goo decode to +glue it all together and + +193 +00:14:58,509 --> 00:15:04,600 +and often although separately developed +pieces have impediments optimization + +194 +00:15:04,600 --> 00:15:08,800 +right like you optimize your subsystem +in the context of symmetric by that + +195 +00:15:08,799 --> 00:15:12,699 +metric might not be the right thing for +the final task you care about which + +196 +00:15:12,700 --> 00:15:22,370 +might be transcribed correctly so having +a much bigger single system like a + +197 +00:15:22,370 --> 00:15:25,649 +single neural Apple goes directly from +audio waveform all the way to the end + +198 +00:15:25,649 --> 00:15:29,929 +objective you care about prescription +and that you cannot optimize end-to-end + +199 +00:15:29,929 --> 00:15:34,579 +through and there's not a lot of hand +written code in the middle that is going + +200 +00:15:34,580 --> 00:15:37,440 +to be a big trend I think you'll see +that here you'll see that I'm missing + +201 +00:15:37,440 --> 00:15:46,250 +translation a lot of other kinds of +demands so who's all competitions we + +202 +00:15:46,250 --> 00:15:48,919 +have tons of vision problems that we've +been using various kinds of + +203 +00:15:48,919 --> 00:15:54,849 +computational models for you know the +big excitement around convolutional + +204 +00:15:54,850 --> 00:15:59,220 +neural nets well first it started with +young and check reading competition that + +205 +00:15:59,220 --> 00:16:05,110 +kind of like subsided for a while and +then Alex Kozinski yo yo sup favor and + +206 +00:16:05,110 --> 00:16:10,200 +check for him to paper in 2012 which +light blue the other competitors out of + +207 +00:16:10,200 --> 00:16:16,470 +the water in the image net 2012 +challenge using a non that I think put + +208 +00:16:16,470 --> 00:16:20,500 +those things on everyone's map again +saying well we should we should be using + +209 +00:16:20,500 --> 00:16:24,399 +these things for vision cuz they work +really well and the next year + +210 +00:16:24,399 --> 00:16:28,100 +something like twenty twenty of the +entries or something you know not + +211 +00:16:28,100 --> 00:16:34,550 +threads previously it was just Alex +we've had a bunch of people at Google + +212 +00:16:34,549 --> 00:16:38,529 +looking at various kinds of +architectures for doing better and + +213 +00:16:38,529 --> 00:16:41,829 +better image that consultations on the +inspection architecture has like this + +214 +00:16:41,830 --> 00:16:45,889 +complicated model of like different size +competitions that are all kind of + +215 +00:16:45,889 --> 00:16:50,419 +concatenated together and then you can't +replicate those models a bunch of times + +216 +00:16:50,419 --> 00:16:51,319 +and + +217 +00:16:51,320 --> 00:16:55,810 +you end up with a very deep known at +that turned out to be quite good at it + +218 +00:16:56,789 --> 00:17:01,870 +condition there's been some slight +additions to that and slight changes to + +219 +00:17:01,870 --> 00:17:07,740 +make it even more accurate you know I +have you seen a slight like that in like + +220 +00:17:07,740 --> 00:17:17,120 +okay so I I was lazy susan only took my +slides from a folder thing I ever told + +221 +00:17:17,119 --> 00:17:19,549 +the story about Andre sitting down on +him labeling + +222 +00:17:19,549 --> 00:17:26,559 +ok signing Andrei decided he was helping +to administer the image that contest he + +223 +00:17:26,559 --> 00:17:31,269 +would sit down and subject himself 200 +hours of training training training + +224 +00:17:31,269 --> 00:17:38,099 +tough split and like this at an +Australian Shepherd Dog I don't know and + +225 +00:17:38,099 --> 00:17:41,449 +yes I can convince one of the lab mates +to do it but they weren't intelligence + +226 +00:17:41,450 --> 00:17:45,309 +are heated about a hundred and twenty +hours of training on images + +227 +00:17:45,980 --> 00:17:52,380 +and his lab may get tired after 12 hours +or something so he got 5.1 percent error + +228 +00:17:52,380 --> 00:17:55,380 +made got I think 12% + +229 +00:17:56,269 --> 00:18:12,918 +human error but without rain badly all +over the weekend + +230 +00:18:12,919 --> 00:18:19,690 +back at one hundred and twelve hours +later whatever anyway here is a great + +231 +00:18:19,690 --> 00:18:23,220 +blog post about it I encourage you to +check it out he has a lot of parameters + +232 +00:18:23,220 --> 00:18:34,279 +so typical humans are like you know 80 +trillion connection that many 201 one + +233 +00:18:34,279 --> 00:18:37,918 +point about these models as the models +of a small number of parameters fit well + +234 +00:18:37,919 --> 00:18:43,440 +on my mobile devices so I'm doesn't fit +while on a mobile phone but the general + +235 +00:18:43,440 --> 00:18:47,029 +trend other than andre is like smaller +numbers of parameters compared to Alex + +236 +00:18:47,029 --> 00:18:52,509 +mostly Alex net had like these two giant +fully connected layers of the top that + +237 +00:18:52,509 --> 00:18:57,000 +giant but a lot of parameters and later +worked just kind of get away with it was + +238 +00:18:57,000 --> 00:19:02,220 +the most part and so they've used you +know a small number of parameters but + +239 +00:19:02,220 --> 00:19:07,829 +more floating point operations per use +compositional parameters mark which is + +240 +00:19:07,829 --> 00:19:12,379 +good for putting them on funds we +released as part of the tensor flow + +241 +00:19:12,380 --> 00:19:18,549 +update up retrain adoption model which +you can use there's an editorial about + +242 +00:19:18,548 --> 00:19:24,089 +it there is Chris Harper although we +think its military uniform which is not + +243 +00:19:24,089 --> 00:19:29,859 +terribly inaccurate one of the nice +things about these models as they're + +244 +00:19:29,859 --> 00:19:32,589 +really good at doing very fine-grained +consultations I think one of the things + +245 +00:19:32,589 --> 00:19:35,959 +that is an Andres blog is that the +computer models are actually much much + +246 +00:19:35,960 --> 00:19:40,880 +better than people at distinguishing +exact breeds of dogs but humans are + +247 +00:19:40,880 --> 00:19:42,179 +better at + +248 +00:19:42,179 --> 00:19:49,150 +often picking out a small you know if if +the label is ping pong ball and it's + +249 +00:19:49,150 --> 00:19:52,190 +like a giant senior people playing ping +pong humans are better at that + +250 +00:19:52,829 --> 00:20:00,250 +models tend to focus on things with more +pixels if you train models with the + +251 +00:20:00,250 --> 00:20:01,109 +right kind of data + +252 +00:20:01,109 --> 00:20:05,019 +you know generalize while these scenes +look nothing alike but they actually you + +253 +00:20:05,019 --> 00:20:08,690 +know we'll both get labeled as me love +your training data is represented well + +254 +00:20:08,690 --> 00:20:14,710 +they make an acceptable errors which +kinda nineties no it's not a snake but + +255 +00:20:14,710 --> 00:20:19,230 +you understand why am I just said that +and I know it's not a dog but I actually + +256 +00:20:19,230 --> 00:20:25,190 +had to think carefully if the front +animal there is a is a donkey and I'm + +257 +00:20:25,190 --> 00:20:27,490 +still not entirely sure + +258 +00:20:27,490 --> 00:20:37,900 +any votes so one of the production uses +we've put these kinds of models kiryas + +259 +00:20:37,900 --> 00:20:42,850 +Google photo search so we launched +Google photo product and you can search + +260 +00:20:42,849 --> 00:20:46,539 +the photos that you've uploaded without +talking about all you just type ocean + +261 +00:20:46,539 --> 00:20:51,639 +and all of a sudden oliver ocean +Photoshop so for example this user + +262 +00:20:51,640 --> 00:20:56,870 +posted publicly hey I posted a +screenshot hey I didn't take these + +263 +00:20:56,869 --> 00:21:04,879 +statues of Buddha showed up for city +driving you know this is a tough because + +264 +00:21:04,880 --> 00:21:09,520 +it got a lot of textured compared to +most Utahns so we're pretty pleased to + +265 +00:21:09,519 --> 00:21:18,339 +retrieve macrophage others we have a lot +of kind of other kinds of more specific + +266 +00:21:18,339 --> 00:21:21,730 +visual tasks like essentially one of the +things we want to do in our Street View + +267 +00:21:21,730 --> 00:21:25,819 +imagery of these cars the driver in the +world and take pictures of all the roads + +268 +00:21:25,819 --> 00:21:29,609 +and street scenes and then we want to be +able to read all the texts that we find + +269 +00:21:29,609 --> 00:21:34,909 +so first you have to find the text and +well one of the first thing you want to + +270 +00:21:34,910 --> 00:21:39,720 +do is find all the addresses and maps +and months ago that you wanna like read + +271 +00:21:39,720 --> 00:21:43,829 +all the other texts so you can see that +it doesn't we have a model that does a + +272 +00:21:43,829 --> 00:21:47,799 +pretty good job of predicting that a +pixel level which which pixels contain + +273 +00:21:47,799 --> 00:21:53,819 +text or not and does pretty well in + +274 +00:21:53,819 --> 00:21:58,289 +well first of all finds lots of tax in +the training data had different kinds of + +275 +00:21:58,289 --> 00:22:03,019 +characters that represented so it has no +problem recognizing Chinese characters + +276 +00:22:03,019 --> 00:22:08,569 +English characters are Roman Latin +characters it does pretty well like + +277 +00:22:08,569 --> 00:22:12,889 +different colors of of tax two different +fonts and sizes and some of them are + +278 +00:22:12,890 --> 00:22:17,200 +very close to the cameras are very far +away and i was just and this is data + +279 +00:22:17,970 --> 00:22:24,809 +from just human labeled drawn polygons +around pieces of text and then they + +280 +00:22:24,809 --> 00:22:27,809 +transcribed it and then we have an OCR +model we also print + +281 +00:22:30,880 --> 00:22:34,500 +we've been kind of gradually releasing +other kinds of products we just launched + +282 +00:22:34,500 --> 00:22:39,799 +cloud vision of ATI's you can do lots of +things like label images this is meant + +283 +00:22:39,799 --> 00:22:44,859 +for people who don't necessarily wanna +want or how machine learning expertise I + +284 +00:22:44,859 --> 00:22:48,349 +just kind of want to do cool stuff with +images you want to go to you know say + +285 +00:22:48,349 --> 00:22:54,990 +only that they're running seemed to do +the OCR and find taxed in any image + +286 +00:22:54,990 --> 00:22:58,650 +uploads you just basically given an +emergency a bike Toronto CRM label + +287 +00:22:58,650 --> 00:23:03,820 +generation of this image and if it goes +to people have been pretty happy with + +288 +00:23:03,819 --> 00:23:06,689 +that + +289 +00:23:06,690 --> 00:23:10,220 +internally people have been thinking of +more creative uses of how to use + +290 +00:23:10,220 --> 00:23:13,600 +computer vision essentially now that +computer vision sort of really actually + +291 +00:23:13,599 --> 00:23:19,819 +works compared to five years ago this is +something that our our our geo team that + +292 +00:23:19,819 --> 00:23:23,250 +process and satellite imagery put +together and released which is basically + +293 +00:23:23,250 --> 00:23:28,740 +a way of predicting the slope of roofs +from multiple satellite views of that + +294 +00:23:28,740 --> 00:23:32,769 +country you'd like have you know every +few months new satellite imagery here + +295 +00:23:32,769 --> 00:23:36,099 +until we have multiple views of the same +location and we can predict what the + +296 +00:23:36,099 --> 00:23:40,109 +slope of the roof is given all those +different views of the same location and + +297 +00:23:40,109 --> 00:23:43,589 +how much sun exposure to get out and +then predict you know if you were to + +298 +00:23:43,589 --> 00:23:48,490 +install solar panels anyhow how much +energy could you generated by getting + +299 +00:23:48,490 --> 00:23:53,930 +kinda cool you know it's like a small +random things you can do not a vision + +300 +00:23:53,930 --> 00:24:03,160 +works ok so this class has been mostly +mostly about vision so I'm gonna talk + +301 +00:24:03,160 --> 00:24:08,029 +now about other kinds of problems like +language understanding one of the most + +302 +00:24:08,029 --> 00:24:16,779 +important problems is search obviously +so we care a lot about surgery and in + +303 +00:24:16,779 --> 00:24:20,700 +particular if I do the query car parts +for sale I'd like to determine which of + +304 +00:24:20,700 --> 00:24:25,400 +these two documents is more relevant and +you just look at the service forms of + +305 +00:24:25,400 --> 00:24:28,019 +the word that first document looks +pretty darn relevant + +306 +00:24:28,019 --> 00:24:34,609 +like lots of the words occur autorad but +actually the second document is much + +307 +00:24:34,609 --> 00:24:41,189 +more relevant given that and we'd like +to be able to understand that so how + +308 +00:24:41,190 --> 00:24:47,269 +much have you talked about embedding +model awesome so you know about the + +309 +00:24:47,269 --> 00:24:47,879 +medics + +310 +00:24:47,880 --> 00:24:54,680 +embedding defendants to so I will go +quickly but basically you want to + +311 +00:24:54,680 --> 00:24:58,200 +represent words or things in +high-dimensional things that are sparse + +312 +00:24:58,200 --> 00:25:03,559 +map them into a dense case some hundred +dimension 11,000 dimensional space so + +313 +00:25:03,559 --> 00:25:11,440 +that you can now have things that are +near each other and have similar + +314 +00:25:11,440 --> 00:25:15,029 +meanings will end up near each other in +the high-dimensional spaces so for + +315 +00:25:15,029 --> 00:25:17,769 +example you might porpoises and dolphins +to be very near each other in the + +316 +00:25:17,769 --> 00:25:20,099 +high-dimensional space because they're +quite similar words and have some + +317 +00:25:20,099 --> 00:25:23,099 +meetings they share the same time the +purpose + +318 +00:25:24,909 --> 00:25:27,420 +ok + +319 +00:25:27,420 --> 00:25:32,620 +and SeaWorld you be kind of nearby and +Cameron parents to be pretty far away + +320 +00:25:32,619 --> 00:25:39,069 +and you can train embedding to modernize +one is to have it kind of is the first + +321 +00:25:39,069 --> 00:25:42,519 +thing you do when you're feeding get +into and out of steam and even simpler + +322 +00:25:42,519 --> 00:25:47,859 +thing is a technique my former colleague +too much nickel off came up with the be + +323 +00:25:47,859 --> 00:25:51,969 +published paper about where essentially +it's called the word to make model and + +324 +00:25:51,970 --> 00:25:55,870 +essentially you pick up window of words +maybe twenty words why did you pick the + +325 +00:25:55,869 --> 00:26:00,119 +center word and then you pick another +random where do try to use the embedding + +326 +00:26:00,119 --> 00:26:06,419 +representation of that center word to +predict a man you can train that hoping + +327 +00:26:06,420 --> 00:26:11,230 +the backdrop essentially you adjust the +weights muscle flex classifier and then + +328 +00:26:11,230 --> 00:26:17,190 +in turn you through backpropagation you +you make little adjustments to the + +329 +00:26:17,190 --> 00:26:20,830 +embedding representation of that center +word so that next time you'll be able to + +330 +00:26:20,829 --> 00:26:25,919 +better predict the word parts from +automobile and actually works right like + +331 +00:26:25,920 --> 00:26:29,930 +one of the really nice things about +abetting the is given enough training + +332 +00:26:29,930 --> 00:26:34,070 +did you got really phenomenal weapons +visions of words so these are the + +333 +00:26:34,069 --> 00:26:39,759 +nearest neighbors for these three +different words or phrases as vocabulary + +334 +00:26:39,759 --> 00:26:44,319 +items in this particular on the tiger +shark you can think of his 11 embedding + +335 +00:26:44,319 --> 00:26:48,480 +vector and these are the nearest +neighbors say it it got the center of + +336 +00:26:48,480 --> 00:26:55,529 +sharpness car is interesting right like +you see why this is useful for search + +337 +00:26:55,529 --> 00:27:01,000 +because you have things that people +often hand coded information retrieval + +338 +00:27:01,000 --> 00:27:07,079 +systems like plurals and stemming and +like some kind of simple synonyms but + +339 +00:27:07,079 --> 00:27:10,750 +here he just seemed like oh I know car +automobile pickup truck racing car + +340 +00:27:10,750 --> 00:27:15,470 +passenger car dealership is kind of +related you just see that has this this + +341 +00:27:15,470 --> 00:27:19,200 +right concept of a knife Kenneth smooth +representation of car rather than + +342 +00:27:19,200 --> 00:27:26,509 +explicitly only the latter see our match +that and it turns out that if you + +343 +00:27:26,509 --> 00:27:29,980 +trained using the word avec approach +that directions turn out to be + +344 +00:27:29,980 --> 00:27:35,730 +meaningful and mental spaces so not only +is proximity interesting but directions + +345 +00:27:35,730 --> 00:27:38,730 +are interesting so it turns out if you +look at + +346 +00:27:39,720 --> 00:27:43,860 +capital and country pairs you go + +347 +00:27:43,859 --> 00:27:47,288 +roughly the same direction and distance +to get from a country with corresponding + +348 +00:27:47,288 --> 00:27:56,029 +capital or vice versa for any country +capital Paris and you also can you see + +349 +00:27:56,029 --> 00:27:59,298 +some semblance of other structures is +the embeddings map down to two + +350 +00:27:59,298 --> 00:28:05,889 +dimensions the principal components +analysis so and you see kind of + +351 +00:28:05,890 --> 00:28:12,788 +interesting structures around verb +tenses regardless of the firm which + +352 +00:28:12,788 --> 00:28:18,210 +means you can solve analogies like queen +is decaying as well mister man by doing + +353 +00:28:18,210 --> 00:28:21,279 +some simple fact arithmetic say you're +literally just looking at the embedding + +354 +00:28:21,279 --> 00:28:26,029 +vector and then adding the difference to +get to that point approximately the + +355 +00:28:26,029 --> 00:28:35,269 +point so we've been in collaboration +with the search team we launched kind of + +356 +00:28:35,269 --> 00:28:40,668 +one of the biggest search ranking +changes in the last few years we called + +357 +00:28:40,669 --> 00:28:44,640 +it rang bringing essentially just a deep +know that but uses embeddings and a + +358 +00:28:44,640 --> 00:28:50,059 +bunch of players to give you a score for +how relevant this document is for this + +359 +00:28:50,058 --> 00:28:51,730 +particular + +360 +00:28:51,730 --> 00:28:58,308 +and it's the third most important for +train travel miles out of hundreds of + +361 +00:28:58,308 --> 00:29:07,259 +that so-called smart reply was a little +cooperation with the Gmail team were + +362 +00:29:07,259 --> 00:29:11,259 +essentially replying to mail on your +phone kind of sucks cuz typing is hard + +363 +00:29:11,259 --> 00:29:16,429 +and so we wanted to have a system where +often you can predict what would be a + +364 +00:29:16,429 --> 00:29:21,900 +good reply just looking at the message +so we have a small network the predicts + +365 +00:29:21,900 --> 00:29:26,970 +is that a likely to be something that I +can have a short terse response to see + +366 +00:29:26,970 --> 00:29:30,380 +if you ask them i activate a much bigger + +367 +00:29:30,380 --> 00:29:35,409 +model and this is a message one of my +colleagues received a project that from + +368 +00:29:35,409 --> 00:29:37,720 +his brother he said we want to invite +you to join us for an early Thanksgiving + +369 +00:29:37,720 --> 00:29:43,220 +probable bob we've been your favorite +dish RCP next week so then the model + +370 +00:29:43,220 --> 00:29:48,100 +predicts countess and will be there or +sorry won't be able to make it + +371 +00:29:49,660 --> 00:29:54,810 +great if you get a lot of email it's +fantastic although your replies will be + +372 +00:29:54,809 --> 00:29:58,169 +somewhat curse of them which is nice + +373 +00:30:02,250 --> 00:30:07,329 +you know we can do interesting things +like this is a mobile app that actually + +374 +00:30:07,329 --> 00:30:11,779 +runs in airplane mode so it's actually +running the models on the phone and it's + +375 +00:30:11,779 --> 00:30:19,430 +actually got a lot of interesting things +entirely realized so you're essentially + +376 +00:30:19,430 --> 00:30:25,670 +using the camera image for detecting +text in your finding what the words are + +377 +00:30:25,670 --> 00:30:28,830 +doing OCR on it here then running it +through a translation model you can + +378 +00:30:28,829 --> 00:30:31,980 +figure it in a particular about this is +just cycling through different languages + +379 +00:30:31,980 --> 00:30:38,779 +but normally you'd set on Spanish money +but only show you Spanish but the thing + +380 +00:30:38,779 --> 00:30:43,460 +that in realizes there's actually an +interesting fun selection problem like + +381 +00:30:43,460 --> 00:30:49,210 +choose what I want to show you the +output so kind of call good if you're + +382 +00:30:49,210 --> 00:30:50,410 +traveling + +383 +00:30:50,410 --> 00:30:55,590 +interesting place I'm actually going to +Korea untiring so I am i'm looking + +384 +00:30:55,589 --> 00:31:04,549 +forward to using my translator up as +they don't be so one of the things we do + +385 +00:31:04,549 --> 00:31:09,000 +a bit of work on is reducing insurance +costs there's like nothing worse than + +386 +00:31:09,000 --> 00:31:15,789 +this feeling that wow my model is so +awesome with great it's just sad dreams + +387 +00:31:15,789 --> 00:31:18,309 +my phone's battery in Germany + +388 +00:31:18,309 --> 00:31:22,769 +or you know I can't afford the +temptation to run it at you know I keep + +389 +00:31:22,769 --> 00:31:27,039 +you in my data center even though I have +gotten machines so there's lots of + +390 +00:31:27,039 --> 00:31:31,720 +tricks you can use in particular the +simplest wanna news in for instance + +391 +00:31:31,720 --> 00:31:39,430 +generally much more forgiving of even +much lower precision computation dan + +392 +00:31:39,430 --> 00:31:44,120 +training so far in France we usually +find we can quantized all the way to get + +393 +00:31:44,119 --> 00:31:48,319 +through even less a bit too sista nice +quality but cheap you'd like to deal + +394 +00:31:48,319 --> 00:31:52,139 +with really you could do six that's +prolly but that doesn't help that much + +395 +00:31:52,140 --> 00:31:57,930 +that gives you like a nice Forex memory +reduction in storing the parameters and + +396 +00:31:57,930 --> 00:32:01,850 +also give you for a competition +efficiency cuz you can use CPU vector + +397 +00:32:01,849 --> 00:32:08,809 +instructions to 24 multiplies instead of +1:30 but why suddenly got to tell you + +398 +00:32:08,809 --> 00:32:13,879 +about kind of a cuter more exotic way of +getting more efficiency out of a mobile + +399 +00:32:13,880 --> 00:32:14,310 +phone + +400 +00:32:14,309 --> 00:32:19,169 +the technique called distillation that +jeffrey hinton organelles and I worked + +401 +00:32:19,170 --> 00:32:24,910 +on so suppose you have a really really +giant model the problem I just described + +402 +00:32:24,910 --> 00:32:30,660 +this fantastic model you really pleased +with maybe of an ensemble of those and + +403 +00:32:30,660 --> 00:32:36,430 +now you want a smaller cheaper model at +almost the same actors so here it is + +404 +00:32:36,430 --> 00:32:41,480 +your giant expensive model you feed the +same agenda gives you fantastic + +405 +00:32:41,480 --> 00:32:47,630 +predictions like . 95 Jaguar I'm pretty +sure and I'm definitely sure that's not + +406 +00:32:47,630 --> 00:32:48,530 +a car + +407 +00:32:48,529 --> 00:32:57,769 +10-4 car window for you I'm heading to +bed it could be a lion right so that's + +408 +00:32:57,769 --> 00:33:02,900 +what I really accurate model do tell the +main idea unfortunately we later + +409 +00:33:02,900 --> 00:33:07,380 +discovered the rich Caruana in 2006 had +published a similar idea in a paper + +410 +00:33:07,380 --> 00:33:13,310 +called model compression so the ensemble +for your giant accurate model implements + +411 +00:33:13,309 --> 00:33:18,669 +this interesting function from +input-output so if you forget the fact + +412 +00:33:18,670 --> 00:33:22,720 +that there's some structure there and +you just try to use the information + +413 +00:33:22,720 --> 00:33:27,500 +that's contained in that function how +can we transfer the knowledge in that + +414 +00:33:27,500 --> 00:33:30,730 +really accurate function into a smaller + +415 +00:33:30,730 --> 00:33:36,339 +intention of the function so when you're +training a model typically what you do + +416 +00:33:36,339 --> 00:33:40,740 +is you feat an image like this and then +you give it targets to try to the chief + +417 +00:33:40,740 --> 00:33:47,109 +and you give it the target one Jaguar +Land Rover everything else I'm gonna + +418 +00:33:47,109 --> 00:33:52,819 +call that a hard target so that's kind +of the ideal your model is striving to + +419 +00:33:52,819 --> 00:33:56,298 +achieve and you give it you know +hundreds of thousands or millions of + +420 +00:33:56,298 --> 00:34:00,918 +training images in a drive to +approximate all these factors from the + +421 +00:34:00,919 --> 00:34:05,160 +differences in actual fact it doesn't +quite do that cuz he gives you this nice + +422 +00:34:05,160 --> 00:34:09,990 +public probability distribution over +different images over different classes + +423 +00:34:09,989 --> 00:34:17,579 +for the same marriage so let's take our +giant expensive model and one of the + +424 +00:34:17,579 --> 00:34:22,079 +things we can do is we can actually +soften that distribution of it and this + +425 +00:34:22,079 --> 00:34:30,940 +is what jeffrey hinton calls dark +knowledge but if you soften this by + +426 +00:34:30,940 --> 00:34:34,500 +essentially dividing all the logistic +units by a temperature to you might be + +427 +00:34:34,500 --> 00:34:38,820 +like five or ten or something you then +get a softer representation of this + +428 +00:34:38,820 --> 00:34:44,159 +probability distribution where you say +okay at the Jaguar but also kinda hedge + +429 +00:34:44,159 --> 00:34:48,950 +about the little and call it a bit of a +lion maybe even less of a cow still call + +430 +00:34:48,949 --> 00:34:56,878 +it definitely not a car and that's +something you can then years and this + +431 +00:34:56,878 --> 00:35:00,139 +fall distribution made a lot more +information about the image about the + +432 +00:35:00,139 --> 00:35:04,429 +function of being implemented by this +large ensemble ensemble is trying to + +433 +00:35:04,429 --> 00:35:08,169 +head to bed soon do a really good job on +giving you a probability probability + +434 +00:35:08,170 --> 00:35:15,559 +distribution over that image so then you +can train the small model for normally + +435 +00:35:15,559 --> 00:35:19,070 +when you train just training hard +targets but instead you can train on + +436 +00:35:19,070 --> 00:35:25,640 +some combination of the hard targets +plus the soft targets and the training + +437 +00:35:25,639 --> 00:35:32,089 +objectives gonna try to Matt Matt should +some function of those two things so + +438 +00:35:32,090 --> 00:35:37,579 +this works surprisingly well so here's +an experiment we did on a large speech + +439 +00:35:37,579 --> 00:35:42,039 +model so we started by the model the +classified 58.9 percent of his friends + +440 +00:35:42,039 --> 00:35:46,190 +correctly that's our big accurate model +and now we're going to use that horrible + +441 +00:35:46,190 --> 00:35:50,829 +to provide soft targets for smaller +model they also get to see the hard + +442 +00:35:50,829 --> 00:35:57,690 +target and we're gonna train that only +3% of the data so the new model with the + +443 +00:35:57,690 --> 00:36:04,599 +soft targets kept almost that accuracy +57% am just hard targets + +444 +00:36:05,210 --> 00:36:12,800 +drastically over fits 44.5% accurate and +then go south so soft targets are really + +445 +00:36:12,800 --> 00:36:17,700 +really good regularize and the other +thing is that because the stock targets + +446 +00:36:17,699 --> 00:36:21,739 +have so much information them compared +to just a single one imagines arose you + +447 +00:36:21,739 --> 00:36:27,889 +train much much faster you get to that +accuracy in like a week short about the + +448 +00:36:27,889 --> 00:36:33,358 +time that that's pretty nice and you can +do this approach with light drying + +449 +00:36:33,358 --> 00:36:37,889 +ensembles napping into one size model +about ensemble you can do from a large + +450 +00:36:37,889 --> 00:36:45,269 +bottle into a smaller one somewhat +under-appreciated technique ok let's see + +451 +00:36:45,269 --> 00:36:51,980 +so one of the things we did when we +thought about building tons of flour was + +452 +00:36:51,980 --> 00:36:56,309 +we kind of took a step back for more +aware and we said what do you really + +453 +00:36:56,309 --> 00:36:59,259 +want to research system so you want a +lot of different things and it's kind of + +454 +00:36:59,260 --> 00:37:04,740 +hard to balance all of the things I but +really one of the things you really care + +455 +00:37:04,739 --> 00:37:08,489 +about a few researcher is either the +expression I wanna be able to take any + +456 +00:37:08,489 --> 00:37:12,589 +old research idea and try it out + +457 +00:37:15,119 --> 00:37:37,219 +it was considerably smaller like instead +of thousand wide fully connected layers + +458 +00:37:37,219 --> 00:37:43,409 +it was like 600 or 500 y which is +actually a big difference but checking + +459 +00:37:43,409 --> 00:37:51,399 +that paper for the details I'm probably +misremembered right and then you want to + +460 +00:37:51,400 --> 00:37:55,490 +be able to take your research idea a lot +and running quickly you want to be able + +461 +00:37:55,489 --> 00:38:00,689 +to run it probably on both data centers +and iPhones nice to be able to reproduce + +462 +00:38:00,690 --> 00:38:04,269 +things and you want to go from a good +research idea to a production system + +463 +00:38:04,269 --> 00:38:10,730 +without having to rewrite and some other +system that's how we kind of the main + +464 +00:38:10,730 --> 00:38:15,659 +things we were considering Wendling +counterflow open source it as as you're + +465 +00:38:15,659 --> 00:38:25,519 +aware that our first emotion is flexible +so the core bits of tender flow are we + +466 +00:38:25,519 --> 00:38:30,769 +have a notion of different devices it is +portable that runs on a much different + +467 +00:38:30,769 --> 00:38:34,340 +operating systems we have this core +graphics solution engine and then on top + +468 +00:38:34,340 --> 00:38:37,700 +of that we have different friends were +you expressed the kinds of competitions + +469 +00:38:37,699 --> 00:38:41,819 +are trying to do we have a C++ friend +and which most people don't use in my + +470 +00:38:41,820 --> 00:38:45,700 +mind we have the Piton friend I'm sure +most of you are probably more so they + +471 +00:38:45,699 --> 00:38:49,339 +don't have to wear most men but there's +nothing preventing people from putting + +472 +00:38:49,340 --> 00:38:55,750 +other languages I wanted to be fairly +language neutral so there is some work + +473 +00:38:55,750 --> 00:38:58,269 +going on to put ago friend on there + +474 +00:38:58,269 --> 00:39:03,980 +other kinds of languages and you wanna +be able to take that model and running + +475 +00:39:03,980 --> 00:39:09,440 +on a pretty wide variety of different +platforms the basic computational model + +476 +00:39:09,440 --> 00:39:12,710 +is the ground I don't know how much +talked about this in your overview of + +477 +00:39:12,710 --> 00:39:17,179 +ten little bit ok so this graph things +that flow along the edges or tenders for + +478 +00:39:17,179 --> 00:39:25,469 +arbitrary and dimensional arrays with a +primitive type like Procter into unlike + +479 +00:39:25,469 --> 00:39:29,269 +pure data flow models there's actually +stayed in this crassly you have things + +480 +00:39:29,269 --> 00:39:33,219 +like diocese which is a variable and +then you have operations again update + +481 +00:39:33,219 --> 00:39:37,019 +things that happen system state can go +through the whole graph compute some + +482 +00:39:37,019 --> 00:39:45,329 +gradient and then adjust the bias is +based on gradient graph goes through a + +483 +00:39:45,329 --> 00:39:50,809 +series of stages one important stage is +deciding given a whole bunch of + +484 +00:39:50,809 --> 00:39:55,670 +computational devices and McGrath where +are we in a run each of the different + +485 +00:39:55,670 --> 00:40:01,369 +node in the graph terms of computation +for example here we might have a CPU and + +486 +00:40:01,369 --> 00:40:06,650 +blue and I GPU card and green and we +might want to run the graph in such a + +487 +00:40:06,650 --> 00:40:13,160 +way that although that's a competition +happens on the GPU so actually as an + +488 +00:40:13,159 --> 00:40:17,259 +aside this placement decisions are kind +of tricky we allow users to provide him + +489 +00:40:17,260 --> 00:40:22,760 +the guide this a bit and then given the +hints which are not necessarily hard + +490 +00:40:22,760 --> 00:40:26,750 +constraints on the new black device but +might be something like you should + +491 +00:40:26,750 --> 00:40:33,300 +really try to run this on a GPU or place +it on task seven and I don't care what + +492 +00:40:33,300 --> 00:40:40,200 +device and then we want to basically +minimize the time for the graph subject + +493 +00:40:40,199 --> 00:40:44,159 +all kinds of other constraints like the +memory we have available on each keep + +494 +00:40:44,159 --> 00:40:51,199 +you Carter on CPUs I think it'd be +interesting actually use at home at with + +495 +00:40:51,199 --> 00:40:54,639 +some reinforcement learning because you +can actually measure an objective here + +496 +00:40:54,639 --> 00:40:58,759 +of you know if I place this note and +this known in this note in this way how + +497 +00:40:58,760 --> 00:41:02,500 +fast is my graph and I think that would +be pretty interesting reinforcement + +498 +00:41:02,500 --> 00:41:02,929 +learning + +499 +00:41:02,929 --> 00:41:09,139 +problem 13 made decisions over to place +things then we insert the sending + +500 +00:41:09,139 --> 00:41:12,500 +receive nodes which essentially +encapsulate all the communication system + +501 +00:41:12,500 --> 00:41:16,800 +so basically you want to move it answer +from one place to another the send nodal + +502 +00:41:16,800 --> 00:41:21,200 +kind of just hold onto the tensor until +they receive no checks and they've + +503 +00:41:21,199 --> 00:41:26,669 +really love that data for that and you +do this for all the edges of the Cross + +504 +00:41:26,670 --> 00:41:32,150 +device boundaries and you have different +implications of sending receive Paris + +505 +00:41:32,150 --> 00:41:36,220 +depending on the device see how for +example if the GPUs are on the same + +506 +00:41:36,219 --> 00:41:39,779 +machine you can often do our DNA +directly from one GPU memory to be there + +507 +00:41:39,780 --> 00:41:44,410 +if they're on different machines and you +across machine RBC your network might + +508 +00:41:44,409 --> 00:41:50,868 +support RDMA across the network and I +case you would just use directly reach + +509 +00:41:50,869 --> 00:41:56,920 +into the southern GPU memory on the +southern machine and credit you can + +510 +00:41:56,920 --> 00:42:00,210 +define new operations and colonels +pretty easily + +511 +00:42:00,210 --> 00:42:06,920 +such an interface is essentially how you +run the graph can typically you run he + +512 +00:42:06,920 --> 00:42:10,940 +set up a graph once and then you run a +lot so that allows us to kind of have + +513 +00:42:10,940 --> 00:42:17,068 +the system do a lot of optimization and +decisions about essentially how it wants + +514 +00:42:17,068 --> 00:42:22,199 +to place competition no then perhaps do +some experiments on like does it make + +515 +00:42:22,199 --> 00:42:26,068 +more sense to put it here here because +it's can advertise that overlap from + +516 +00:42:26,068 --> 00:42:30,969 +author bryan calls the single process +configuration everything runs and one + +517 +00:42:30,969 --> 00:42:35,509 +process and it's just sort of simple +procedure calls in a distributed setting + +518 +00:42:35,510 --> 00:42:38,440 +there's a client process a master +process and then a bunch of workers that + +519 +00:42:38,440 --> 00:42:43,608 +have devices and the Masterton clients +as I'd like to run the subgraph the + +520 +00:42:43,608 --> 00:42:47,568 +master says okay that means I need to +talk to process wanted to tell them to + +521 +00:42:47,568 --> 00:42:54,808 +do stuff you can feed in fact data and +that means that I might sort of have a + +522 +00:42:54,809 --> 00:42:59,619 +more complex graph but I only need to +run little bits of it cause I only need + +523 +00:42:59,619 --> 00:43:05,440 +to run the part to the computation that +the output throughout our + +524 +00:43:05,940 --> 00:43:14,940 +are needed based on a story we focus a +lot on being able to scale this + +525 +00:43:14,940 --> 00:43:19,099 +distributed environment we actually one +of the biggest things when we first open + +526 +00:43:19,099 --> 00:43:23,210 +source center for a week hadn't quite +carved apart a open source mobile + +527 +00:43:23,210 --> 00:43:28,269 +distributed implementation so that was +good how this your number 23 which got + +528 +00:43:28,269 --> 00:43:33,259 +filed within like a day of our release +that hey where's the distributed version + +529 +00:43:33,260 --> 00:43:39,839 +we did the initial released last +Thursday so that's good it'll get better + +530 +00:43:39,838 --> 00:43:43,619 +packaging but at the moment you can kind +of and configure multiple processes with + +531 +00:43:43,619 --> 00:43:48,710 +the names of the other process he's +involved IP addresses importance we're + +532 +00:43:48,710 --> 00:43:55,150 +gonna package that I'm better and next +couple of weeks but that's good and the + +533 +00:43:55,150 --> 00:43:59,250 +whole reason to have that is that you +want much better turnaround time for + +534 +00:43:59,250 --> 00:44:05,889 +experiments so if you're in the mode +where your training and experiment + +535 +00:44:05,889 --> 00:44:09,769 +iteration is kind of minutes or hours +that's really really good if you're in + +536 +00:44:09,769 --> 00:44:15,159 +the mode of like multiple weeks that's +kind of hopeless like more than a month + +537 +00:44:15,159 --> 00:44:19,279 +you you generally want to do it or if +you do you're like oh my travels done + +538 +00:44:19,280 --> 00:44:26,130 +why did I do that again so we really +emphasize a lot in our group just being + +539 +00:44:26,130 --> 00:44:31,269 +able to make it to people can do +experiments as fast as is reasonable + +540 +00:44:33,920 --> 00:44:39,250 +so the two main things we do our model +parallels amid a problem I'll talk about + +541 +00:44:39,250 --> 00:44:46,588 +both you've talked about this a little +bit or ok so the best way you can + +542 +00:44:46,588 --> 00:44:52,279 +decrease 9 training time is decreased to +stop time so one of the really nice + +543 +00:44:52,280 --> 00:44:56,329 +properties most laptops there's lots and +lots of inherent parallelism right like + +544 +00:44:56,329 --> 00:44:59,329 +if you think about a computational model +there's lots of parallelism + +545 +00:45:00,539 --> 00:45:04,119 +each of the layers because all the +spatial positions are mostly independent + +546 +00:45:04,119 --> 00:45:06,280 +you can just run around them + +547 +00:45:06,280 --> 00:45:10,680 +in parallel on different devices the +problem is figure out how to communicate + +548 +00:45:10,679 --> 00:45:17,889 +how to distribute that computation in +such a way that doesn't kill you if you + +549 +00:45:17,889 --> 00:45:21,389 +think help you someone is local +conductivity like convolutional neural + +550 +00:45:21,389 --> 00:45:25,299 +mats have this nice property that +they're generally looking like a five by + +551 +00:45:25,300 --> 00:45:31,070 +five patch of data below them and they +don't need anything else and the neuron + +552 +00:45:31,070 --> 00:45:35,289 +next to it as a whole lot of overlap +with the data it needs for for that + +553 +00:45:35,289 --> 00:45:41,099 +first neuron UCAV towers with little or +no connectivity between the towers so + +554 +00:45:41,099 --> 00:45:46,179 +every few layers you might communicate a +little bit but mostly you don't accept + +555 +00:45:46,179 --> 00:45:50,399 +paper did that so essentially had two +separate hours that mostly ran into + +556 +00:45:50,400 --> 00:45:55,880 +penalty on GPUs to different CPUs and +occasionally exchanged some information + +557 +00:45:55,880 --> 00:45:59,220 +you get a specialized parts of the model +attractive woman for some example + +558 +00:45:59,219 --> 00:46:06,759 +there's lots of ways to exploit +parallelism so when you're just naively + +559 +00:46:06,760 --> 00:46:10,630 +compiling matrix multiply code with gcc +or something it a lot probably already + +560 +00:46:10,630 --> 00:46:16,880 +take advantage of instruction +parallelism present on Intel CPUs scores + +561 +00:46:16,880 --> 00:46:23,420 +you can use Thread heroism and things +that way across devices communicating + +562 +00:46:23,420 --> 00:46:27,760 +between the abusers often pretty limited +to you have like a factor of 30 to 40 + +563 +00:46:27,760 --> 00:46:31,950 +better band trip to the local team +member you can you do to like another + +564 +00:46:31,949 --> 00:46:36,750 +GPU cards memory on the same machine and +across machine down in general even + +565 +00:46:36,750 --> 00:46:41,519 +worse so pretty important to kind of +keep as much data local as you can and + +566 +00:46:41,519 --> 00:46:48,159 +avoid eating too much but model +parallels in the basic idea is you're + +567 +00:46:48,159 --> 00:46:51,929 +just going to partition the +computational model somehow maybe + +568 +00:46:51,929 --> 00:47:01,710 +especially like this maybe layer by +layer and then in this case for example + +569 +00:47:01,710 --> 00:47:05,730 +the only communication I need to do is +that this boundary you know some of the + +570 +00:47:05,730 --> 00:47:09,039 +data from petition to have needed for +the input of that partition one but + +571 +00:47:09,039 --> 00:47:16,949 +mostly all that is local the other +techniques you can use for speeding up + +572 +00:47:16,949 --> 00:47:21,419 +convergence is data parallelism some a +case you're going to use many different + +573 +00:47:21,420 --> 00:47:24,608 +replicas of the same model structure and +they're all going to collaborate to + +574 +00:47:24,608 --> 00:47:30,949 +update parameters so in some shared set +of servers that hold the parameters + +575 +00:47:30,949 --> 00:47:36,629 +state speedups depend a lot on the kind +of model could be 10 to 40 X speed up + +576 +00:47:36,630 --> 00:47:42,720 +450 replicas sparse models with like +really large embeddings for every + +577 +00:47:42,719 --> 00:47:44,769 +vocabulary word known to man + +578 +00:47:44,769 --> 00:47:48,469 +generally you can't report more +parallelism cuz most updates only update + +579 +00:47:48,469 --> 00:47:53,129 +a handful of the embedding entries have +a sentence has like 10 unique words in + +580 +00:47:53,130 --> 00:47:57,630 +it out of a million and you can have +millions and millions are thousands of + +581 +00:47:57,630 --> 00:48:03,088 +replicas doing lots of work so the basic +idea and data parallelism is you have + +582 +00:48:03,088 --> 00:48:07,019 +these different model replicas are gonna +have the centralized system that keeps + +583 +00:48:07,019 --> 00:48:10,519 +track of the parameters that may not +just be a single machine and maybe a lot + +584 +00:48:10,519 --> 00:48:16,338 +of machines because you need a lot of +network bandwidth sometimes to keep all + +585 +00:48:16,338 --> 00:48:19,900 +these model replica standard parameters +so that might you know in our big setup + +586 +00:48:19,900 --> 00:48:24,950 +that my behind and 27 machines been +stopped and then you know you might have + +587 +00:48:24,949 --> 00:48:29,259 +five and replicas of the models down +there and before every model replica + +588 +00:48:29,260 --> 00:48:34,430 +doesn't match its gonna grab the +parameters so it says okay you hundred + +589 +00:48:34,429 --> 00:48:39,179 +and twenty-seven machines give me the +parameters and then it does a + +590 +00:48:39,179 --> 00:48:44,289 +combination of around the mini badge and +because I would agree it should be it + +591 +00:48:44,289 --> 00:48:47,869 +doesn't apply to rate of time degrading +back to the parameters servers routers + +592 +00:48:47,869 --> 00:48:52,829 +servers then update the current +parameter values and then before the + +593 +00:48:52,829 --> 00:48:58,039 +next step we did the same thing really +network intensive depending on your + +594 +00:48:58,039 --> 00:49:01,690 +model things that help here are modeled +the don't have very many parameters + +595 +00:49:01,690 --> 00:49:06,068 +competitions are really nice in that +respect Ella standardise in that respect + +596 +00:49:06,068 --> 00:49:11,250 +because you're essentially than reusing +every parameter lock them up the time so + +597 +00:49:11,250 --> 00:49:16,929 +you're already using you know however +bigger batch size is on the model of + +598 +00:49:16,929 --> 00:49:20,088 +your child's 228 you're gonna bring +pressure over you can use a hundred and + +599 +00:49:20,088 --> 00:49:23,900 +twenty eight times for all the columns +in the match but have a convolutional + +600 +00:49:23,900 --> 00:49:28,970 +model now you're gonna get an additional +factor of reuse of maybe like $10 in + +601 +00:49:28,969 --> 00:49:30,019 +different positions + +602 +00:49:30,019 --> 00:49:34,769 +in a layer that you're going to use it +an analysis p.m. if you unroll a hundred + +603 +00:49:34,769 --> 00:49:41,460 +times steps you can reuse it a hundred +times just for the unrolling those kinds + +604 +00:49:41,460 --> 00:49:47,220 +of things that have model have lots of +computation and fewer parameters to sort + +605 +00:49:47,219 --> 00:49:50,109 +of Dr that competition generally will +work better and did a parallel + +606 +00:49:50,110 --> 00:49:57,340 +environments now there's an obvious +issue depending on how you do those so + +607 +00:49:57,340 --> 00:50:00,720 +one way you can do this is completely +asynchronously every model replicas just + +608 +00:50:00,719 --> 00:50:05,459 +sitting in a loop and setting the +parameters doing a mini badge heating + +609 +00:50:05,460 --> 00:50:09,210 +radiant sending it up there and if you +do that asynchronously then the gradient + +610 +00:50:09,210 --> 00:50:13,710 +computes may be completely stale with +respect to the where the parameters are + +611 +00:50:13,710 --> 00:50:17,030 +now right now is computed it with his +back to this parameter value but + +612 +00:50:17,030 --> 00:50:20,810 +meanwhile 10 other applicants have made +called the parameters to meander over + +613 +00:50:20,809 --> 00:50:27,529 +here and now you apply the gradient that +you thought was for here this makes the + +614 +00:50:27,530 --> 00:50:31,080 +additions incredibly uncomfortable there +already uncomfortable cuz it's + +615 +00:50:31,079 --> 00:50:38,619 +completely non conduct problems but the +good news is it worked up to a certain + +616 +00:50:38,619 --> 00:50:43,670 +level it would be really good understand +the conditions under which you know this + +617 +00:50:43,670 --> 00:50:48,059 +works and theoretical basis but in +practice it does seem to work pretty + +618 +00:50:48,059 --> 00:50:51,710 +well the other thing you can do is do +this completely synchronously so you can + +619 +00:50:51,710 --> 00:50:55,800 +have one driving loop that sounds ok +everyone go they all get the parameters + +620 +00:50:55,800 --> 00:50:58,610 +they all compute gradients and then you +wait for the gradients to show up and do + +621 +00:50:58,610 --> 00:51:03,820 +something with a great effort to them +around her and that effectively just + +622 +00:51:03,820 --> 00:51:09,269 +looks like a giant batch are replicas +that looks like you know our times each + +623 +00:51:09,269 --> 00:51:14,300 +individual ones batch size which +sometimes works you kind of get + +624 +00:51:14,300 --> 00:51:18,950 +diminishing returns from larger and +larger batch sizes but the more training + +625 +00:51:18,949 --> 00:51:21,169 +examples you have + +626 +00:51:21,170 --> 00:51:26,159 +more tolerant you are a bigger bite +sized generally have a trillion training + +627 +00:51:26,159 --> 00:51:30,420 +examples you know about the size of a +thousand ok you have a million training + +628 +00:51:30,420 --> 00:51:36,068 +examples outside of a thousand not so +great right + +629 +00:51:36,639 --> 00:51:41,289 +think I said Lewis there's even more +complicated choices are you can have + +630 +00:51:41,289 --> 00:51:52,650 +like a descriptive ends in Europe right +I said that the current models are good + +631 +00:51:52,650 --> 00:51:57,829 +they reuse the parameters a lot so data +parallelism is actually really really + +632 +00:51:57,829 --> 00:52:02,740 +important for almost all of our models +that's how we get to the point of + +633 +00:52:02,739 --> 00:52:10,669 +training models in like half a day or a +day generally so you know you see some + +634 +00:52:10,670 --> 00:52:19,180 +of the rough kind of setup for use and +here's an example training graph of + +635 +00:52:19,179 --> 00:52:25,489 +image net model one GPU 10 GB used 52 +views and there's the kind of speed up + +636 +00:52:25,489 --> 00:52:26,239 +yet + +637 +00:52:26,239 --> 00:52:29,759 +like sometimes these graphs are +receiving like the difference between 10 + +638 +00:52:29,760 --> 00:52:34,220 +and 50 years doesn't seem that big like +lines are kind of close to each other + +639 +00:52:34,219 --> 00:52:39,489 +soldiers but in actual fact the +difference between 10 and 50 is like a + +640 +00:52:39,489 --> 00:52:43,798 +factor of four point want something so +that doesn't look like a factor 4.1 + +641 +00:52:43,798 --> 00:52:51,920 +difference does it but it is yeah the +way you do it as you would like without + +642 +00:52:51,920 --> 00:52:59,150 +one crisis point six and seven thousand +crisis point ok + +643 +00:52:59,150 --> 00:53:04,490 +so let me show you some of the slight +tweaks you make to tender for models to + +644 +00:53:04,489 --> 00:53:08,149 +exploit these different kinds of +parallelism one of the things we wanted + +645 +00:53:08,150 --> 00:53:13,280 +was for these kinds of parallelism +notions to be pretty easy to express so + +646 +00:53:13,280 --> 00:53:17,500 +one of the things I like about 20 mins +it maps pretty well to the kind of + +647 +00:53:17,500 --> 00:53:22,949 +things you might see in a research paper +so it's not talk to read all that but + +648 +00:53:22,949 --> 00:53:30,189 +it's not too different than what you +would see you should never be kinda nice + +649 +00:53:30,190 --> 00:53:37,940 +like a simple stem cell this is the +sequence to sequence model that only a + +650 +00:53:37,940 --> 00:53:43,079 +subsidiary organ all the quickly +published in its 2014 we're essentially + +651 +00:53:43,079 --> 00:53:47,849 +trying to take an input sequence and map +it turned out that sequence this is a + +652 +00:53:47,849 --> 00:53:51,679 +really big area of research it turns out +these kinds of models are applicable for + +653 +00:53:51,679 --> 00:53:56,849 +lots and lots of kinds of problems +there's lots of different groups doing + +654 +00:53:56,849 --> 00:54:07,369 +interesting inactive work in this area +so here's just some examples of recent + +655 +00:54:07,369 --> 00:54:13,269 +work in the last year and a half in this +area from what the different labs around + +656 +00:54:13,269 --> 00:54:17,630 +the world you've already talked about it + +657 +00:54:17,630 --> 00:54:26,320 +caption call just so instead of a +sequence you can put in pixels are you + +658 +00:54:26,320 --> 00:54:31,890 +put in pixels you went through CNN +that's your initial state and then you + +659 +00:54:31,889 --> 00:54:34,889 +can generate captions pretty amazing + +660 +00:54:36,030 --> 00:54:42,019 +35 years ago contributor to that I was I +don't think so not for a while Harry R + +661 +00:54:42,019 --> 00:54:46,730 +you can actually do and then I say it's +a generative model so you can generate + +662 +00:54:46,730 --> 00:54:51,320 +different sentences by exploring the +distribution you know I think both of us + +663 +00:54:51,320 --> 00:54:56,870 +are not captains it's not quite a +sophisticated of the human one don't + +664 +00:54:56,869 --> 00:55:01,230 +often see this one of the things is + +665 +00:55:01,230 --> 00:55:07,639 +if you if you train the model little bit +it's really important to her trainer + +666 +00:55:07,639 --> 00:55:13,210 +model to convergence because light +that's not so bad but if you train that + +667 +00:55:13,210 --> 00:55:17,070 +model longer the same model just got a +lot better + +668 +00:55:21,079 --> 00:55:25,139 +same thing here right training that is +sitting on the tracks yes that's true + +669 +00:55:25,139 --> 00:55:30,909 +but that ones better but she still see +the human has a lot more sophistication + +670 +00:55:30,909 --> 00:55:35,480 +right like they know that they're +crossed the tracks near a depot that's + +671 +00:55:35,480 --> 00:55:42,199 +sort of a more subtle thing that the +model to pick up on another kind of cute + +672 +00:55:42,199 --> 00:55:48,750 +using you can actually use them to solve +all kinds of cool graph problems so or + +673 +00:55:48,750 --> 00:55:56,440 +even yalls mara Fortunato and FTP this +work which you start with a ton of + +674 +00:55:56,440 --> 00:56:03,059 +points and then you try to predict the +traveling salesman for that works best + +675 +00:56:03,059 --> 00:56:11,559 +for the convex hull or Delonte +triangulation of grass gonna call you + +676 +00:56:11,559 --> 00:56:14,199 +know it's just a secret the sequence +problem for you feat in the sequence of + +677 +00:56:14,199 --> 00:56:18,129 +points and then the output is the right +set of points for whatever problem you + +678 +00:56:18,130 --> 00:56:21,130 +care about + +679 +00:56:21,780 --> 00:56:28,519 +reply ok so I'll scams so once you have +that Alice p.m. cellco that I showed you + +680 +00:56:28,519 --> 00:56:35,530 +on there you can enroll in time twenty +time steps let's say you wanted four + +681 +00:56:35,530 --> 00:56:37,680 +layers per time step instead of one + +682 +00:56:37,679 --> 00:56:42,389 +well you would make a little bit of +change your code and you do that now you + +683 +00:56:42,389 --> 00:56:47,690 +have four layers of computations 2011 of +the things you might want to do is run + +684 +00:56:47,690 --> 00:56:51,840 +each of those layers on a different GPU +so that's the change would make you tons + +685 +00:56:51,840 --> 00:56:56,869 +of occurred to do that and that allows +you to have a model like this so this is + +686 +00:56:56,869 --> 00:57:01,289 +my sequins these are the different deep +jealousy I'm layers I have per time step + +687 +00:57:01,289 --> 00:57:08,190 +and after the first little bit I can +start getting more and more GPUs kind of + +688 +00:57:08,190 --> 00:57:10,349 +involved in the process + +689 +00:57:10,349 --> 00:57:15,579 +and you essentially pipeline the entire +thing there's a giant soft packs at the + +690 +00:57:15,579 --> 00:57:19,710 +top of you can split across keep you +pretty easily do that to model + +691 +00:57:19,710 --> 00:57:25,500 +parallelism right we've now got six GPUs +in this picture we actually use a split + +692 +00:57:25,500 --> 00:57:30,909 +that soft max cross-border abuses and +man it so every replica would be a GPU + +693 +00:57:30,909 --> 00:57:36,109 +cards on the same machine all kind of +humming along and then you might use + +694 +00:57:36,110 --> 00:57:37,849 +data parallelism in addition to that + +695 +00:57:37,849 --> 00:57:45,989 +to train a bunch of AGP card replicas to +train quickly we have this notion of QS + +696 +00:57:45,989 --> 00:57:50,509 +he can kind of have her photographs the +do a bunch of stuff and then suffered an + +697 +00:57:50,510 --> 00:57:55,860 +EQ and then later you have another bit +of time to photograph that starts with D + +698 +00:57:55,860 --> 00:58:00,789 +hearings and stuff and then a dozen +things so one and one example is you + +699 +00:58:00,789 --> 00:58:04,650 +might want to prefetch inputs and then +why do the JPEG decoding to convert them + +700 +00:58:04,650 --> 00:58:09,240 +into sort of arrays and maybe do some +whitening and cropping a random + +701 +00:58:09,239 --> 00:58:16,149 +selection of men's stuff like you and +then you can then dq on a different GPU + +702 +00:58:16,150 --> 00:58:22,769 +cards or something we also can group +similar examples for translation work we + +703 +00:58:22,769 --> 00:58:27,869 +actually bucket by length of sentence so +that your batch has a bunch of examples + +704 +00:58:27,869 --> 00:58:32,449 +that are all roughly the same sentence +length all 13 216 words sentences or + +705 +00:58:32,449 --> 00:58:37,539 +something that just means we even need +only execute exactly that many unrolled + +706 +00:58:37,539 --> 00:58:42,210 +steps rather than you know arbitrary +next sentence length good for + +707 +00:58:42,210 --> 00:58:46,099 +randomization challenged members +shuffling cue is just a whole bunch of + +708 +00:58:46,099 --> 00:58:49,099 +examples and then get random ones out + +709 +00:58:55,130 --> 00:59:02,269 +data parallelism right so again we want +to be able to have many replicas of this + +710 +00:59:02,269 --> 00:59:09,309 +thing and so you make a modest amount of +changes to your we're not quite as happy + +711 +00:59:09,309 --> 00:59:13,769 +with this amount of change but this is +kind of what you do there's a supervisor + +712 +00:59:13,769 --> 00:59:19,429 +that has a bunch of things you now say +there's pressure devices and prepare the + +713 +00:59:19,429 --> 00:59:25,509 +session and then each one of these +rounds a local loop and you not keep + +714 +00:59:25,510 --> 00:59:28,000 +track of how many steps have been +applied globally across all the + +715 +00:59:28,000 --> 00:59:32,500 +different replicas and soon is the +cumulative sum of all those is big + +716 +00:59:32,500 --> 00:59:38,829 +enough for a synchronous training looks +kinda like that three separate client + +717 +00:59:38,829 --> 00:59:43,929 +dreads driving three separate replicas +all with parameters so one of the big + +718 +00:59:43,929 --> 00:59:47,119 +implications from disbelief to tend to +flow if we don't have the separate + +719 +00:59:47,119 --> 00:59:54,359 +parameters server notion we have answers +and variables variables that contained + +720 +00:59:54,360 --> 00:59:59,590 +answers and they're just other parts of +the graph and typically you map them + +721 +00:59:59,590 --> 01:00:04,250 +onto a small set of devices they're +gonna hold you parameters but it's all + +722 +01:00:04,250 --> 01:00:07,269 +kind of unified in the same framework +whether I'm sending it to answer that + +723 +01:00:07,269 --> 01:00:12,829 +parameters or activations or whatever +doesn't matter this is kind of a + +724 +01:00:12,829 --> 01:00:16,750 +synchronous do you have one client and I +just split my batch across three + +725 +01:00:16,750 --> 01:00:22,989 +replicas and had the gradient and apply +them know might turn out to be pretty + +726 +01:00:22,989 --> 01:00:31,239 +tolerant of reduced precision so convert +to FB 16 there's actually and I Tripoli + +727 +01:00:31,239 --> 01:00:36,869 +standard for 16 to 14 points now putting +point I use now most CPU don't quite + +728 +01:00:36,869 --> 01:00:42,719 +support that yet so we implemented our +own sixteen-bit format which is + +729 +01:00:42,719 --> 01:00:45,719 +essentially we have a 32 bit floating be +lopped off to buy to me + +730 +01:00:47,429 --> 01:00:55,889 +and you should kind of new stochastic +public but we don't so sort of ok it's + +731 +01:00:55,889 --> 01:01:01,389 +just know if any concurred converted to +32 bit on the other side by filling in + +732 +01:01:01,389 --> 01:01:15,098 +for it it's very sleepy roof friendly +paper while still model and data + +733 +01:01:15,099 --> 01:01:19,500 +parallelism in conjunction bind really +likes you train models quickly and + +734 +01:01:19,500 --> 01:01:24,639 +that's what this is all really about is +being able to take a research idea try + +735 +01:01:24,639 --> 01:01:28,250 +it out on a large dataset is +representative of a problem you care + +736 +01:01:28,250 --> 01:01:29,000 +about + +737 +01:01:29,000 --> 01:01:34,199 +figure out that work figure out the next +set of experiments as it's pretty easy + +738 +01:01:34,199 --> 01:01:38,039 +to express intensive load the data +profile somewhere not too happy with for + +739 +01:01:38,039 --> 01:01:44,889 +a synchronous parallelism but in general +it's not too bad we have open source + +740 +01:01:44,889 --> 01:01:49,480 +center flow because we think that'll +make it easier to share research writing + +741 +01:01:49,480 --> 01:01:56,338 +is we think you know having lots of +people using the system outside of + +742 +01:01:56,338 --> 01:01:59,849 +Google was there is a good thing to +improve it and bring ideas that we don't + +743 +01:01:59,849 --> 01:02:05,200 +necessarily how it makes it pretty easy +to deploy machine learning systems into + +744 +01:02:05,199 --> 01:02:09,298 +real products because you can go from +our research idea into something running + +745 +01:02:09,298 --> 01:02:13,059 +on a phone relatively easily the +community of tens of users outside + +746 +01:02:13,059 --> 01:02:16,609 +Google is growing which is nice how +they're doing all kinds of cool things I + +747 +01:02:16,608 --> 01:02:21,130 +picked a few random examples of things +people have done that are posted and get + +748 +01:02:21,130 --> 01:02:28,769 +how this is one that's like Andre has +this discontent at Daylesford runs in + +749 +01:02:28,769 --> 01:02:32,920 +your browser using javascript and one of +the things he has a little game he's + +750 +01:02:32,920 --> 01:02:38,798 +reinforcement learning the yellow dot +learns to get learns to eat the real + +751 +01:02:38,798 --> 01:02:42,769 +urgency the Green Dot to avoid the red +dots so someone reimplemented that in + +752 +01:02:42,769 --> 01:02:47,059 +terms of flow and actually added orange +dots are really bad + +753 +01:02:50,650 --> 01:02:54,550 +someone implemented this really nice +paper from University of Tilburg in the + +754 +01:02:54,550 --> 01:02:59,590 +Max Planck Institute only be seen this +work are you take an image a picture and + +755 +01:02:59,590 --> 01:03:05,269 +typically a painting and then renders +that picture in the style of that paper + +756 +01:03:05,269 --> 01:03:14,820 +and you end up with the cool stuff like +that bad you know there's a character + +757 +01:03:14,820 --> 01:03:19,550 +and model here outside the popular sort +of higher-level library to make it + +758 +01:03:19,550 --> 01:03:25,640 +easier to express mail mats someone +implemented the neural captioning model + +759 +01:03:25,639 --> 01:03:31,099 +in terms of low there's our effort +underway to translated into Mandarin + +760 +01:03:31,099 --> 01:03:39,349 +cool great last thing I will talk about +the brain residency programs we've + +761 +01:03:39,349 --> 01:03:44,349 +started this program a bit of an +experiment this year and so this is more + +762 +01:03:44,349 --> 01:03:47,769 +as an FYI for next year cause or +applications are closed rectory + +763 +01:03:47,769 --> 01:03:53,420 +selecting our final candidates this week +and then the idea is the people will + +764 +01:03:53,420 --> 01:03:57,789 +spend a year in our group doing deep +learning research and the hope is + +765 +01:03:57,789 --> 01:04:02,750 +they'll come out and have published a +couple of papers on archiver submitted + +766 +01:04:02,750 --> 01:04:08,039 +to Companies is and learn a lot about +doing sort of interesting machine + +767 +01:04:08,039 --> 01:04:16,170 +learning research and now we're looking +for people for next year obviously about + +768 +01:04:16,170 --> 01:04:24,670 +our strong in you know anyone taking the +class will reopen applications in the + +769 +01:04:24,670 --> 01:04:25,990 +fall + +770 +01:04:25,989 --> 01:04:34,439 +graduating like next year opportunity +there you go there's a bunch more + +771 +01:04:34,440 --> 01:04:36,909 +reading there + +772 +01:04:36,909 --> 01:04:42,949 +start your cuz I did a lot of work in +the white paper to make the whole set of + +773 +01:04:42,949 --> 01:04:52,169 +references clickable and then click your +way through 250 other figures ok so I + +774 +01:04:52,170 --> 01:04:53,820 +have been done early + +775 +01:04:53,820 --> 01:04:56,820 +hundred and sixty-five + +776 +01:05:02,730 --> 01:05:31,599 +yes so those kind of things are actually +tricky and we have an actually a pretty + +777 +01:05:31,599 --> 01:05:37,329 +extensive detailed process for things +that are you know talking about you're + +778 +01:05:37,329 --> 01:05:43,119 +using a user's private data for these +kinds of things so smart reply + +779 +01:05:43,119 --> 01:05:47,559 +essentially all the replies that word +that it ever will generate are things + +780 +01:05:47,559 --> 01:05:52,710 +that have been said by thousands of +users so the input to the model for + +781 +01:05:52,710 --> 01:05:57,380 +training is an email which is typically +not about how the people at but the only + +782 +01:05:57,380 --> 01:06:02,480 +things will ever suggest are things that +are generated in response by you know + +783 +01:06:02,480 --> 01:06:07,670 +suspicion number of unique users to +protect the privacy of users that put + +784 +01:06:07,670 --> 01:06:10,710 +kind of things you're thinking about +when designing products like cotton and + +785 +01:06:10,710 --> 01:06:16,400 +is actually a lot of Karen thought going +into you know we think this will be a + +786 +01:06:16,400 --> 01:06:22,119 +great feature but how can we do this in +a way that ensures the people's privacy + +787 +01:06:22,119 --> 01:06:25,119 +is protected + +788 +01:06:52,670 --> 01:07:30,108 +as much as we probably should have +assured it's just kind of been one of + +789 +01:07:30,108 --> 01:07:32,548 +the things on the back burner compared +to all the other things we've been + +790 +01:07:32,548 --> 01:07:37,679 +working on I do think the notion of +specialists so I didn't talk about that + +791 +01:07:37,679 --> 01:07:42,489 +at all but essentially we had a model +that was sort of arbitrary image that + +792 +01:07:42,489 --> 01:07:46,868 +classification model like JFT which is +like seventeen thousand losses or + +793 +01:07:46,869 --> 01:07:51,220 +something it's an internal data that we +trained a good general model that could + +794 +01:07:51,219 --> 01:07:57,539 +deal with all those classes and then we +found interesting confuse computable + +795 +01:07:57,539 --> 01:08:01,719 +classes that are algorithmically like +all the kinds of mushrooms in the world + +796 +01:08:01,719 --> 01:08:06,539 +and we were trained specialists on data +set there were enriched ribbed only + +797 +01:08:06,539 --> 01:08:11,909 +mushroom data primarily and an +occasional random images and we could + +798 +01:08:11,909 --> 01:08:16,179 +train fifty such models that reach good +at different kinds of things and get + +799 +01:08:16,179 --> 01:08:24,440 +pretty significant accuracy increases at +the time we we were able to distill it + +800 +01:08:24,439 --> 01:08:27,588 +into a single model pretty well we +haven't really pursued that too much + +801 +01:08:27,588 --> 01:08:31,899 +turned out just the mechanics have been +training fifty separate models and then + +802 +01:08:31,899 --> 01:08:34,899 +distilling them as a bit unwieldy + +803 +01:08:38,170 --> 01:09:20,630 +14 exploration and further research has +as you say this clearly demonstrates + +804 +01:09:20,630 --> 01:09:25,920 +that we're i mean it's a different +objectives were telling the model to do + +805 +01:09:25,920 --> 01:09:31,048 +right we're telling it to use this hard +label or use this hard label and also + +806 +01:09:31,048 --> 01:09:36,189 +get this incredibly rich gradient which +says like here's a hundred other signals + +807 +01:09:36,189 --> 01:09:41,379 +information so in some sense an unfair +comparison right you're telling it a lot + +808 +01:09:41,380 --> 01:09:46,829 +more stuff about every example my case +so sometimes it's not so much an + +809 +01:09:46,829 --> 01:09:49,119 +operation feeling it's maybe we should +be Fig + +810 +01:09:49,119 --> 01:09:53,960 +figuring out how to feed preacher +signals than just a single binary label + +811 +01:09:53,960 --> 01:09:59,569 +to our models I think that's probably an +interesting area to pursue I we thought + +812 +01:09:59,569 --> 01:10:05,349 +about ideas of having a big ensemble of +models all training collectively and + +813 +01:10:05,350 --> 01:10:08,449 +sort of exchanging information in the +form of their predictions are rather + +814 +01:10:08,448 --> 01:10:12,779 +than in their parameters as I might be +much cheaper more network friendly way + +815 +01:10:12,779 --> 01:10:19,099 +of of collaboratively training on a +really big did you train and 1% of the + +816 +01:10:19,100 --> 01:10:22,100 +day or something and swap predictions + +817 +01:10:39,729 --> 01:10:49,779 +yeah I mean I think all these kind of +radios are worth pursuing the captioning + +818 +01:10:49,779 --> 01:10:55,039 +workers interesting but it tends to you +tend to have many fewer labels with + +819 +01:10:55,039 --> 01:11:02,550 +captions then we have images with sort +of hard labels like Jeter Jaguar at + +820 +01:11:02,550 --> 01:11:06,810 +least that are prepared in a clean way I +think actually I'm aware there's a lot + +821 +01:11:06,810 --> 01:11:11,539 +of images with sentences written about +in the trick is identifying which + +822 +01:11:11,539 --> 01:11:26,430 +sentences about which image problem some +problems you know you don't need to + +823 +01:11:26,430 --> 01:11:29,510 +really train on mine like speech +recognition is a good example it's not + +824 +01:11:29,510 --> 01:11:35,670 +like human vocal cords change that often +the words you say change a little bit so + +825 +01:11:35,670 --> 01:11:38,670 +we redistributions tend to be not very +stationary + +826 +01:11:39,640 --> 01:11:45,460 +like the words everyone collectively +says tomorrow are pretty similar to ones + +827 +01:11:45,460 --> 01:11:50,640 +they say today but subtly different like +Long Island Chocolate Festival might + +828 +01:11:50,640 --> 01:11:55,220 +suddenly become more and more prominent +over the next two weeks and those kinds + +829 +01:11:55,220 --> 01:11:58,930 +of things you know you need to be +cognizant of the fact that you want to + +830 +01:11:58,930 --> 01:12:03,079 +capture those kinds of effects and one +of the ways to do it is to train your + +831 +01:12:03,079 --> 01:12:07,380 +model and a minor sometime he doesn't +need to be so online but you like + +832 +01:12:07,380 --> 01:12:10,770 +getting an example and immediately +update your model but you know the + +833 +01:12:10,770 --> 01:12:16,180 +pentium problem every five minutes or +ten minutes or hour or day is sufficient + +834 +01:12:16,180 --> 01:12:23,940 +for most problems but it is pretty +important to do that for non-stationary + +835 +01:12:23,939 --> 01:12:28,949 +problems like ads or search queries or +things that change over time like that + +836 +01:12:28,949 --> 01:12:33,738 +right + +837 +01:12:33,738 --> 01:12:42,428 +the third most important I can't say yes + +838 +01:12:45,819 --> 01:12:57,170 +yeah I mean noise in training datasets +actually happens all the time greats + +839 +01:12:57,170 --> 01:13:01,340 +like even if you look at the image that +examples occasionally you'll come across + +840 +01:13:01,340 --> 01:13:02,328 +one in your life + +841 +01:13:02,328 --> 01:13:06,670 +actually I was just sitting in a meeting +with some people who are working on + +842 +01:13:06,670 --> 01:13:10,929 +visualization techniques and one of the +things that were visualizing was see far + +843 +01:13:10,929 --> 01:13:14,779 +input data and they had this kind of +core presentation of all the C four + +844 +01:13:14,779 --> 01:13:18,920 +examples all mapped onto like +four-by-four pixels each one month on + +845 +01:13:18,920 --> 01:13:22,819 +their screen for sixty thousand images +and Mike you could kind of pick things + +846 +01:13:22,819 --> 01:13:28,219 +out and select them toward and here's +one that liked the model predicted with + +847 +01:13:28,219 --> 01:13:33,948 +high confidence but it got wrong and it +said her plane as the model that + +848 +01:13:33,948 --> 01:13:40,518 +airplane and you look at the image and +it's an airplane and the label is not + +849 +01:13:40,519 --> 01:13:49,690 +heavily you like I understand why I +gotta run so it's you know you want to + +850 +01:13:49,689 --> 01:13:53,288 +make sure your dataset is as clean as +possible cuz training and noisy data is + +851 +01:13:53,288 --> 01:13:56,488 +generally not as good as + +852 +01:13:56,488 --> 01:14:00,819 +cleaned it out but on the other hand +expending too much effort to clean that + +853 +01:14:00,819 --> 01:14:06,969 +it is often more more effort than its +worth to kind of do some filtering kinds + +854 +01:14:06,969 --> 01:14:12,788 +of things you don't throw out the +obvious bad stuff and generally more + +855 +01:14:12,788 --> 01:14:15,788 +noisy data is often better than less +clean it up + +856 +01:14:18,739 --> 01:14:28,649 +depends on the problem but only about +one thing to try and then if you're + +857 +01:14:28,649 --> 01:14:34,159 +unhappy with the result then investigate +why the question + +858 +01:14:34,159 --> 01:14:39,210 +okay thank you + diff --git a/captions/En/Lecture1_en.srt b/captions/En/Lecture1_en.srt new file mode 100644 index 00000000..0802c5d6 --- /dev/null +++ b/captions/En/Lecture1_en.srt @@ -0,0 +1,3523 @@ +1 +00:00:00,000 --> 00:00:03,899 +there's more seats on the side + +2 +00:00:03,899 --> 00:00:19,868 +people are walking in late so just to +make sure you're in cs2 31 and deep + +3 +00:00:19,868 --> 00:00:23,969 +learning on your network class for +visual recognition any money in the + +4 +00:00:23,969 --> 00:00:33,549 +wrong class good I so welcome and happy +new year happy first day of winter break + +5 +00:00:33,549 --> 00:00:41,069 +so the classiest 231 and this is the +second offering of this class when we + +6 +00:00:41,070 --> 00:00:48,738 +have literally doubled our enrollment +from 480 people last time we offered to + +7 +00:00:48,738 --> 00:00:55,939 +about 350 of you sign up a couple of +words to to to make us all legally + +8 +00:00:55,939 --> 00:01:02,570 +covered way our video recording this +class so you know if you're + +9 +00:01:02,570 --> 00:01:10,680 +uncomfortable about this for today just +go behind that camera or go to the + +10 +00:01:10,680 --> 00:01:18,280 +cameras litter but we are going to send +out forms for you to fill out in terms + +11 +00:01:18,280 --> 00:01:25,228 +of allowing video recording so that's +that's just one bit of housekeeping so + +12 +00:01:25,228 --> 00:01:32,200 +alright when they fail him a professor +at the computer science department so + +13 +00:01:32,200 --> 00:01:37,960 +this class and co-teaching with through +senior graduate students and one of them + +14 +00:01:37,959 --> 00:01:45,839 +is areas under a profit greatly we have +what I don't think Andre needs too much + +15 +00:01:45,840 --> 00:01:48,659 +introduction all of you know his work + +16 +00:01:48,659 --> 00:01:53,960 +follow his blog his Twitter follower + +17 +00:01:53,959 --> 00:02:02,509 +under has way more followers than I do +very popular Justin Johnson is still + +18 +00:02:02,510 --> 00:02:08,200 +traveling internationally but will be +back in a few days so andre land just so + +19 +00:02:08,199 --> 00:02:14,509 +we'll be picking up the bulk of the +lecture teaching today I'll be giving + +20 +00:02:14,509 --> 00:02:20,039 +her structure but as you probably can +see that I'm expecting the newborn ratio + +21 +00:02:20,039 --> 00:02:28,239 +speaking of weeks so you'll see more of +undrained Justin in lecture time we will + +22 +00:02:28,239 --> 00:02:34,189 +also introduce a whole team of TACE +towards the end of this lecture again + +23 +00:02:34,189 --> 00:02:38,959 +people who are looking for seats you go +out of that before and come back there's + +24 +00:02:38,959 --> 00:02:47,039 +a whole bunch of seats on the side so so +this for this lecture we're going to + +25 +00:02:47,039 --> 00:02:53,519 +give the introduction of the class what +kind of problems we work on and the + +26 +00:02:53,519 --> 00:03:03,530 +tools will be learning so again welcome +to see us 231 and this is a vision in + +27 +00:03:03,530 --> 00:03:09,140 +class it's based on a very specific +modeling architecture called your + +28 +00:03:09,139 --> 00:03:16,000 +network and the more specifically mostly +a convolution on your network and a lot + +29 +00:03:16,000 --> 00:03:23,799 +of you hear this term maybe through a +popular Press article we we or or or + +30 +00:03:23,799 --> 00:03:34,239 +coverage we've had to call this the deep +learning network growing field of + +31 +00:03:34,239 --> 00:03:40,920 +artificial intelligence in fact the +school has estimated and and we are + +32 +00:03:40,919 --> 00:03:50,018 +underway for of this by 2016 which we +already have arrived more than 85% of + +33 +00:03:50,019 --> 00:03:56,230 +the internet cyberspace data is in the +form of pixels + +34 +00:03:56,229 --> 00:04:05,329 +or what they call multimedia so so we +basically have entered an age of vision + +35 +00:04:05,330 --> 00:04:12,530 +of images and video calls why why is +this so while partly to a large extent + +36 +00:04:12,530 --> 00:04:20,858 +is speakers of the explosion of both the +Internet as a carrier of data as well as + +37 +00:04:20,858 --> 00:04:25,930 +answers we have more sensors that the +number of people on Thurs these days + +38 +00:04:25,930 --> 00:04:32,000 +every one of you is carrying some kind +of smart phones digital cameras and and + +39 +00:04:32,000 --> 00:04:37,879 +and and you know cars around on the +street with cameras so so the soldiers + +40 +00:04:37,879 --> 00:04:46,500 +have really enabled the explosion of +visual data on the internet but visual + +41 +00:04:46,500 --> 00:04:55,209 +data or pixel data is also the hardest +data to harness so if you have heard my + +42 +00:04:55,209 --> 00:05:07,810 +previous talks and and some other parks +by the dark matter of the Internet + +43 +00:05:07,810 --> 00:05:13,879 +why is this the dark matter just like +the universe is closest to 85% dark + +44 +00:05:13,879 --> 00:05:19,409 +matter dark energy is these matters that +energy that is very hard to observe week + +45 +00:05:19,410 --> 00:05:25,919 +and weekend by mathematical models in +the universe internet these are the + +46 +00:05:25,918 --> 00:05:30,649 +matters pixel data the other data that +we don't know we have a hard time + +47 +00:05:30,649 --> 00:05:36,239 +grasping the continent's here's one very +very simple suspects for you to consider + +48 +00:05:36,240 --> 00:05:39,090 +so today + +49 +00:05:39,089 --> 00:05:49,560 +YouTube servers every 60 seconds we'll +have more than $150 of videos uploaded + +50 +00:05:49,560 --> 00:05:54,089 +onto YouTube servers for every 60 +seconds + +51 +00:05:54,089 --> 00:06:02,739 +think about the amount of data there is +no way that human eyes can sift through + +52 +00:06:02,740 --> 00:06:07,829 +this massive amount of data and make it +a lot Asians + +53 +00:06:07,829 --> 00:06:14,009 +labeling it and and and and described +the contacts soul singer from the + +54 +00:06:14,009 --> 00:06:20,980 +perspective of the YouTube team or or +Google company if they want to help us + +55 +00:06:20,980 --> 00:06:25,640 +to search index managed and of course +for their purpose + +56 +00:06:25,639 --> 00:06:31,529 +advertisement or or whatever manipulate +the content of the data were at a loss + +57 +00:06:31,529 --> 00:06:38,919 +because nobody can take this the only +hope we can do this is true vision + +58 +00:06:38,920 --> 00:06:44,640 +technology to be able to label the +objects financings vinyl frames + +59 +00:06:44,639 --> 00:06:50,349 +you know lo que where basketball video +were Kobe Bryant's making like that + +60 +00:06:50,350 --> 00:06:57,320 +awesome shot and social these are the +problems we are facing today that the + +61 +00:06:57,319 --> 00:07:02,860 +massive amount of data and the the +challenges of the dark matter so + +62 +00:07:02,860 --> 00:07:07,379 +comfortable vision as a field that +touches upon many other fields of + +63 +00:07:07,379 --> 00:07:12,740 +studies so I am sure that even City +heater sitting here + +64 +00:07:12,740 --> 00:07:18,050 +mania vehicle from computer size but +many of you come from biology psychology + +65 +00:07:18,050 --> 00:07:24,389 +are specializing in natural language +processing or graphics for robotics or + +66 +00:07:24,389 --> 00:07:30,680 +or you know medical imaging and so on so +I love field computer vision is really a + +67 +00:07:30,680 --> 00:07:37,329 +truly interdisciplinary field what the +problems we work on the models we use + +68 +00:07:37,329 --> 00:07:43,849 +such as engineering physics biology +psychology compare size of mathematics + +69 +00:07:43,850 --> 00:07:51,030 +so just a little bit of a more personal +touch I am the director of the division + +70 +00:07:51,029 --> 00:07:58,589 +lab stuff in our lab will I work with +graduate students and postdocs and even + +71 +00:07:58,589 --> 00:08:04,669 +under ladders students on the number of +topics and most dear to our own research + +72 +00:08:04,670 --> 00:08:10,540 +who some of them you know the great just +come from my lab + +73 +00:08:10,540 --> 00:08:17,780 +number of two years come from my lab we +work on machine learning which is part + +74 +00:08:17,779 --> 00:08:26,109 +of a superset of deep learning we work a +lot of science and neuroscience as well + +75 +00:08:26,110 --> 00:08:31,270 +as the intersection between an LPN +speech so that's that's the kind of + +76 +00:08:31,269 --> 00:08:40,399 +landscape of computer vision research +that my lab works so also to put things + +77 +00:08:40,399 --> 00:08:45,600 +in a little more perspective what other +computer vision classes now we offer + +78 +00:08:45,600 --> 00:08:51,050 +here at stuff or through the computer +science department are clearly you're in + +79 +00:08:51,049 --> 00:08:59,629 +this class es 21 so you some of you who +have never taken computer vision + +80 +00:08:59,629 --> 00:09:06,220 +probably heard of commuters and for the +first time probably should have already + +81 +00:09:06,220 --> 00:09:14,730 +done vs 131 that's a cool class of +previous quarter we offer and then and + +82 +00:09:14,730 --> 00:09:19,779 +then next quarter which normally is +offer this quarter but this year as a + +83 +00:09:19,779 --> 00:09:25,069 +little shifted there is an important +graduate-level computer vision class + +84 +00:09:25,070 --> 00:09:31,840 +'cause cs2 30180 offered by professors +so he'll suffer si who works in robotic + +85 +00:09:31,840 --> 00:09:47,230 +3d vision and a lot of you ask the +question that this classiest 231 versus + +86 +00:09:47,230 --> 00:09:56,639 +the S two thirty 18 and the other is +know if you're interested in a broader + +87 +00:09:56,639 --> 00:10:03,220 +coverage of tools and topics of computer +vision as well as some of the + +88 +00:10:03,220 --> 00:10:11,009 +fundamental fundamental topics that come +that relay 223 division robotic vision + +89 +00:10:11,009 --> 00:10:17,269 +and visual recognition you should +consider taking in 23188 that is the + +90 +00:10:17,269 --> 00:10:26,039 +more general class 231 end which will go +into starting today more deeply focuses + +91 +00:10:26,039 --> 00:10:33,329 +on a specific Ando of both problem and +model model is your network and the + +92 +00:10:33,330 --> 00:10:38,580 +undertow is visual recognition mostly +but of course they have a little bit of + +93 +00:10:38,580 --> 00:10:47,990 +overlap but that's the major difference +next next quarter we also have possibly + +94 +00:10:47,990 --> 00:10:55,590 +a couple of a couple of advanced seminar +level class but that's still in the + +95 +00:10:55,590 --> 00:11:01,649 +formations so you just have to check the +syllabus so that's the kind of curcumin + +96 +00:11:01,649 --> 00:11:11,409 +division curricula we offer this year at +Stanford in question so far yes + +97 +00:11:11,409 --> 00:11:20,879 +131 is not a strict requirement for this +class but you also see that if you've + +98 +00:11:20,879 --> 00:11:25,570 +never heard of computer vision for the +first time I suggest you find a way to + +99 +00:11:25,570 --> 00:11:33,830 +catch up because this class has shrooms +a basic level of understanding of of + +100 +00:11:33,830 --> 00:11:42,560 +computer vision you can browse the notes +and so on + +101 +00:11:42,559 --> 00:11:49,619 +today is that I will give a very brief +broad stroke history of computer vision + +102 +00:11:49,620 --> 00:11:55,519 +and then we'll talk about 231 and a +little bit in terms of the organization + +103 +00:11:55,519 --> 00:12:01,409 +of the class they really care about in +with you this brief history of computer + +104 +00:12:01,409 --> 00:12:07,480 +vision because you know you might be +here primarily because of your interest + +105 +00:12:07,480 --> 00:12:11,990 +in this really interesting tool called +deeply and this is the purpose of this + +106 +00:12:11,990 --> 00:12:16,370 +class will offer you an in-depth look +and then + +107 +00:12:16,370 --> 00:12:22,470 +and just journey through the of the what +this deeply model is but without + +108 +00:12:22,470 --> 00:12:28,050 +understanding the problem domain without +thinking deeply about what this problem + +109 +00:12:28,049 --> 00:12:37,849 +is it's very hard for you to to go out +to be an inventor of the next model that + +110 +00:12:37,850 --> 00:12:43,320 +really solve the big problems vision or +to be you know developing developing + +111 +00:12:43,320 --> 00:12:52,379 +making impactful work in solving a heart +problem and also in general problem + +112 +00:12:52,379 --> 00:12:58,860 +domain and model the modeling tools +themselves are never never fully + +113 +00:12:58,860 --> 00:13:00,129 +decoupled + +114 +00:13:00,129 --> 00:13:05,360 +inform each other and you'll see through +the history of deep learning a little + +115 +00:13:05,360 --> 00:13:13,000 +bit that the coalition on your network +architecture come from the meat to solve + +116 +00:13:13,000 --> 00:13:15,289 +a vision problem + +117 +00:13:15,289 --> 00:13:23,449 +vision problem helps the the planning +algorithm to evolve and I'm back and + +118 +00:13:23,450 --> 00:13:29,350 +forth so is really important to to you +know I want you to finish this course I + +119 +00:13:29,350 --> 00:13:34,300 +feel proud that you're still enough +vision and of deep learning so you you + +120 +00:13:34,299 --> 00:13:39,528 +have this bullshit all set and the +in-depth understanding of how to use the + +121 +00:13:39,528 --> 00:13:46,750 +tools to to to to tackle important +problems so it's a brief history but + +122 +00:13:46,750 --> 00:13:54,149 +does it mean so short history so we're +gonna go all the way back to 200 540 + +123 +00:13:54,149 --> 00:14:00,110 +million years ago so why why did I +picked this you know on the other scale + +124 +00:14:00,110 --> 00:14:09,240 +of Earth history this is a fairly +specific range of years while so I don't + +125 +00:14:09,240 --> 00:14:14,049 +know if you have heard of this but this +is a very very curious period of the + +126 +00:14:14,049 --> 00:14:23,539 +Earth's history biologists call this the +big bag of evolution before 503 + +127 +00:14:23,539 --> 00:14:27,679 +for 540 million years ago + +128 +00:14:27,679 --> 00:14:37,989 +a very peaceful of its pretty big pot of +water so we have very simple organisms + +129 +00:14:37,990 --> 00:14:46,049 +these are like animals that just floats +in the water and the way the eastern and + +130 +00:14:46,049 --> 00:14:53,838 +now a daily basis is you know the flow +to move some kind of food comes by near + +131 +00:14:53,839 --> 00:15:01,160 +their house or whatever they just open +their mouths grabbed it and we don't + +132 +00:15:01,159 --> 00:15:09,969 +have too many different types of animals +but really strange happened around 540 + +133 +00:15:09,970 --> 00:15:18,430 +million solely from the fossils we study +there's a huge explosive of species + +134 +00:15:18,429 --> 00:15:27,729 +biologist car speciation like suddenly +for some reason something that animal + +135 +00:15:27,730 --> 00:15:35,230 +start to diversify and they got a +complex the start 2022 you start a + +136 +00:15:35,230 --> 00:15:41,039 +predators and praising and they have all +kind of tools to to to survive what was + +137 +00:15:41,039 --> 00:15:46,698 +the triggering force of those was a huge +question because people received no did + +138 +00:15:46,698 --> 00:15:53,269 +you know another SAT whatever meteoroid +earth or or you know the environment + +139 +00:15:53,269 --> 00:16:00,198 +change they talk about one of the most +convincing theory is by this guy call + +140 +00:16:00,198 --> 00:16:03,159 +Andrew Parker + +141 +00:16:03,159 --> 00:16:09,490 +largest in Australia from Australia he +he studied a lot of fun + +142 +00:16:09,490 --> 00:16:19,278 +fossils and he's theory is that it was +the onset of the ice so one of the first + +143 +00:16:19,278 --> 00:16:25,688 +trial bites developed and I really +really simple I it's almost like a + +144 +00:16:25,688 --> 00:16:30,779 +pinhole camera that just catches light +and make some projections in + +145 +00:16:30,779 --> 00:16:34,750 +register some information from the +environment + +146 +00:16:34,750 --> 00:16:41,080 +suddenly life is no longer so medal +because once you have that I the first + +147 +00:16:41,080 --> 00:16:44,889 +thing you can do is you could go patch +for you actually know where food it's + +148 +00:16:44,889 --> 00:16:51,809 +not just like blind them floating in the +water and was you can go cat food + +149 +00:16:51,809 --> 00:16:57,399 +guess what the food best better develop +eyes and to run away from you otherwise + +150 +00:16:57,399 --> 00:17:02,590 +they'll be gone you know your your so +the first of all who had had eyes were + +151 +00:17:02,590 --> 00:17:11,380 +like in a limited both Google and so +just like has the best time you think + +152 +00:17:11,380 --> 00:17:18,170 +everything they can but because those +are all set of lies what we what the + +153 +00:17:18,170 --> 00:17:28,400 +college's realize is the biological arms +race begin every single animal needs to + +154 +00:17:28,400 --> 00:17:34,170 +needs to learn to develop things to +survive or to you know you you you + +155 +00:17:34,170 --> 00:17:40,190 +suddenly have praised and predators and +all this and the speciation so that's + +156 +00:17:40,190 --> 00:17:47,870 +one vision begun 540 million years and +not only religion begun visual was one + +157 +00:17:47,869 --> 00:17:53,189 +of the major driving force of the +speciation or that the big fan of + +158 +00:17:53,190 --> 00:17:58,980 +evolution or I so so we're not gonna +fall evolution for with too much detail + +159 +00:17:58,980 --> 00:18:08,710 +another big important work that the +engineering of vision happened around + +160 +00:18:08,710 --> 00:18:19,220 +the Renaissance and of course it's +attributed to this amazing guy so before + +161 +00:18:19,220 --> 00:18:23,740 +other songs you know throughout human +civilization from Asia to Europe to + +162 +00:18:23,740 --> 00:18:30,400 +India to Arabic world we have seen +models of cameras so Aristotle has + +163 +00:18:30,400 --> 00:18:36,360 +proposed the camera through the Leafs +Chinese philosopher moses have proposed + +164 +00:18:36,359 --> 00:18:40,939 +the camera through a box with the whole +but + +165 +00:18:40,940 --> 00:18:47,750 +if you look at the first documentation +really modern looking camera it's called + +166 +00:18:47,750 --> 00:18:49,180 +camera obscura + +167 +00:18:49,180 --> 00:18:56,610 +and that is documented by Leonardo da da +Vinci I'm not gonna get into the details + +168 +00:18:56,609 --> 00:19:07,240 +but this is you know you get the idea +that there is some kind of a whole to + +169 +00:19:07,240 --> 00:19:12,240 +capture light reflected from the real +world and then there is some kind of + +170 +00:19:12,240 --> 00:19:20,319 +protection to capture the information of +the of the of the real-world image so + +171 +00:19:20,319 --> 00:19:27,779 +that's the beginning of the modern you +know engineering + +172 +00:19:27,779 --> 00:19:36,170 +vision it started with one team to copy +the world and wanted to make a copy of + +173 +00:19:36,170 --> 00:19:42,350 +the visual world it hasn't gone anywhere +close to wanting to engineer the + +174 +00:19:42,349 --> 00:19:46,879 +understanding of the visual world right +now we're just talking about duplicating + +175 +00:19:46,880 --> 00:19:53,760 +the visual world so that's one important +work to remember and of course after + +176 +00:19:53,759 --> 00:20:01,299 +camera obscura although we we we start +to see a whole series of successful in + +177 +00:20:01,299 --> 00:20:07,539 +all some film gets developed you know +like kodak was one of the first + +178 +00:20:07,539 --> 00:20:12,329 +companies developing commercial cameras +and then we start to have camcorders and + +179 +00:20:12,329 --> 00:20:21,889 +and and all this very important +important piece of work that I want you + +180 +00:20:21,890 --> 00:20:28,050 +to be aware of as vision student is +absolutely nothing engineering work but + +181 +00:20:28,049 --> 00:20:32,710 +do you think science piece of science +work that's starting to ask the question + +182 +00:20:32,710 --> 00:20:38,130 +is how does Visual work in our +biological bring you know we + +183 +00:20:38,130 --> 00:20:45,760 +we now know that it took 540 million +years of evolution to get to really + +184 +00:20:45,759 --> 00:20:54,579 +fantastic visual system in humans but +what did evolution do during this time + +185 +00:20:54,579 --> 00:21:01,759 +what kind of architecture did develop +from that simple trilobite to today + +186 +00:21:01,759 --> 00:21:07,950 +yours and mine while very important +piece of work happened at Harvard lied + +187 +00:21:07,950 --> 00:21:12,690 +to that time too young to very young +ambitious pulls the occupant in the + +188 +00:21:12,690 --> 00:21:21,500 +vehicle what they did is that they used +awake but anaesthetized cats and then + +189 +00:21:21,500 --> 00:21:28,529 +there was not technology to build this +little needle electrode to push the + +190 +00:21:28,529 --> 00:21:35,129 +electrons through to the the the the +skull is open until the bringing of the + +191 +00:21:35,130 --> 00:21:42,180 +cut into an area what we already know +come primary visual cortex primary + +192 +00:21:42,180 --> 00:21:49,490 +visual cortex area do a lot of things +for for visual processing but before + +193 +00:21:49,490 --> 00:21:54,779 +visa we don't really know what primary +visual cortex is to be winter snow is + +194 +00:21:54,779 --> 00:22:02,369 +one of the earliest stage of the UI is +of course but earliest stage for visual + +195 +00:22:02,369 --> 00:22:07,299 +processing then there is tons and tons +of new orleans working on vision then we + +196 +00:22:07,299 --> 00:22:12,419 +really alter our to know what this is +because that's the beginning of vision + +197 +00:22:12,420 --> 00:22:20,300 +visual process in the bring so they they +put this electrode into the primary + +198 +00:22:20,299 --> 00:22:25,930 +visual cortex and an interesting this is +another interesting fact I don't drop my + +199 +00:22:25,930 --> 00:22:34,880 +stuff for sure you probably visual +cortex the first they come from being + +200 +00:22:34,880 --> 00:22:40,910 +very very rough prosecutor first aid of +your cortical visual processing stage is + +201 +00:22:40,910 --> 00:22:47,180 +in the back of your bring not near your +I know it's very interesting because + +202 +00:22:47,180 --> 00:22:51,788 +your own factory in cortical processing +is right + +203 +00:22:51,788 --> 00:22:58,519 +behind her nose your auditory is right +behind every year but your primary + +204 +00:22:58,519 --> 00:23:05,798 +visual cortex is the furthest from your +eye and another very interesting that in + +205 +00:23:05,798 --> 00:23:11,099 +fact not only the primary there's a huge +area working on vision almost 50% of + +206 +00:23:11,099 --> 00:23:17,888 +your brain is a love division this in +this the hardest and most important + +207 +00:23:17,888 --> 00:23:22,608 +sensory perceptual cognitive system in +the break and I'm not saying anything + +208 +00:23:22,608 --> 00:23:29,839 +else does is not useful clearly but it +take nature of this long to develop this + +209 +00:23:29,839 --> 00:23:37,579 +this sensory system and it takes the +troop this much realist a space to be + +210 +00:23:37,579 --> 00:23:43,148 +used for the system why because it's so +important and it's so damn hard that's + +211 +00:23:43,148 --> 00:23:50,959 +why we need to get back to human reason +they were really ambitious they wanna + +212 +00:23:50,960 --> 00:23:56,028 +know what primary visual cortex is doing +because this is the beginning of our + +213 +00:23:56,028 --> 00:24:02,878 +knowledge for deep learning neural +network cats social then put the cats in + +214 +00:24:02,878 --> 00:24:07,709 +this room and they were recording your +activities when I say recording your + +215 +00:24:07,710 --> 00:24:11,659 +activities fair trial there basically +trying to see you know if I put the + +216 +00:24:11,659 --> 00:24:18,059 +electrode here like to the new office to +the new house fire when they see + +217 +00:24:18,058 --> 00:24:25,308 +something so for example if they show if +they show cat their ideas if I showed + +218 +00:24:25,308 --> 00:24:30,519 +this kind of fish you know apparently at +that time comes to eat fish rather than + +219 +00:24:30,519 --> 00:24:42,019 +these beings with the cats no I like +yellow happy and spikes and here's a + +220 +00:24:42,019 --> 00:24:48,128 +story of scientific discovery is +scientific discovery takes both luck and + +221 +00:24:48,128 --> 00:24:52,449 +care and thoughtfulness they were shown + +222 +00:24:52,450 --> 00:24:58,740 +whatever mouse flower it just doesn't +work the cats new are in the primary + +223 +00:24:58,740 --> 00:25:02,839 +visual cortex was silent there was no +spiking + +224 +00:25:02,839 --> 00:25:09,079 +very little spike in there were really +frustrated but the good news is that + +225 +00:25:09,079 --> 00:25:14,509 +there was no computer at that time so +what they have to do when they showed us + +226 +00:25:14,509 --> 00:25:21,740 +cats is they have to use a slight +protector so they put his foot a slide + +227 +00:25:21,740 --> 00:25:26,799 +of a fish and then wait till the new on +Spike if the new imposes bike they take + +228 +00:25:26,799 --> 00:25:29,960 +the slide out putting another slight + +229 +00:25:29,960 --> 00:25:38,630 +notice I would have liked this like you +know this film I don't you remember to + +230 +00:25:38,630 --> 00:25:46,890 +use glasser film whatever the Douro +spikes that's weird you know like the + +231 +00:25:46,890 --> 00:25:51,940 +actual mouse official flower didn't +drive the new excite the new role but + +232 +00:25:51,940 --> 00:25:59,759 +the the the movement of taking the slide +out nor could he has sliding did excite + +233 +00:25:59,759 --> 00:26:03,140 +the new I can be the catalyst think +you'll finally they're changing the new + +234 +00:26:03,140 --> 00:26:13,410 +you know new objects for me so it turned +out there is created by this life that + +235 +00:26:13,410 --> 00:26:18,240 +they're changing right slide the +whatever it's a square rectangular plate + +236 +00:26:18,240 --> 00:26:28,120 +and that moving edge or excited the +nuance so they're really taste the after + +237 +00:26:28,119 --> 00:26:34,859 +that observations you know if they were +too frustrated or to have missed that + +238 +00:26:34,859 --> 00:26:41,359 +but they're not the they're really taste +after that and realize new songs in the + +239 +00:26:41,359 --> 00:26:48,279 +primary visual cortex are organized in +columns and for every column of the new + +240 +00:26:48,279 --> 00:27:01,309 +Alice they'd like to see a specific +orientation of the of the bars rather + +241 +00:27:01,309 --> 00:27:02,980 +than the Fisher a mouse + +242 +00:27:02,980 --> 00:27:07,519 +you know I'm a bit of a simple story +because there are still numerous in + +243 +00:27:07,519 --> 00:27:10,940 +primary visual cortex we don't know what +they like they don't like simple + +244 +00:27:10,940 --> 00:27:17,570 +oriented but by large with a human +visitor found that the beginning of + +245 +00:27:17,569 --> 00:27:23,779 +visual processing is not a holistic fish +or malice the beginning of visual + +246 +00:27:23,779 --> 00:27:29,178 +processing is simple structures of the +world + +247 +00:27:29,179 --> 00:27:40,890 +oriented and this is a very deep deep +implications are signs as well as + +248 +00:27:40,890 --> 00:27:47,870 +engineering modeling it's later when we +visualize our dealer network features + +249 +00:27:47,869 --> 00:27:57,069 +will see that simple like structure in +emerging from our from our model and + +250 +00:27:57,069 --> 00:28:03,298 +even though the discovery was later +fifties and early sixties they won the + +251 +00:28:03,298 --> 00:28:12,039 +nobel medical price for this work in +1981 so that was another very important + +252 +00:28:12,039 --> 00:28:25,928 +piece of work related to vision and +visual processing so that's another + +253 +00:28:25,929 --> 00:28:35,620 +interesting story the precursor of +computer vision as a modern field was + +254 +00:28:35,619 --> 00:28:42,779 +this particular dissipation by Larry +Roberts in 1963 it's called block world + +255 +00:28:42,779 --> 00:28:49,889 +he just as humulin visa we're +discovering that the visual world in our + +256 +00:28:49,890 --> 00:29:00,380 +brain is organized by simple like +structures Larry Roberts as early as PhD + +257 +00:29:00,380 --> 00:29:06,350 +students were trying to extract these +like structures + +258 +00:29:06,349 --> 00:29:08,980 +images + +259 +00:29:08,980 --> 00:29:16,210 +as a as a piece of engineering work in +this particular case his goal is that + +260 +00:29:16,210 --> 00:29:22,210 +you know both you and not as humans can +recognize blocks no matter how it's + +261 +00:29:22,210 --> 00:29:28,009 +turned right like we know it's a saint +block these two are the same block even + +262 +00:29:28,009 --> 00:29:33,019 +though the lighting changed and the +orientation changed and he's conjuncture + +263 +00:29:33,019 --> 00:29:40,720 +is that just like we thought told us +it's the edges that define this the + +264 +00:29:40,720 --> 00:29:46,419 +structure that the edges defying the +laws shape and they don't change + +265 +00:29:46,419 --> 00:29:53,290 +relevant all these internal things so +Larry Roberts Road a PhD dissertation to + +266 +00:29:53,289 --> 00:29:59,250 +just extract these edges you know if +your work as a PhD student computer + +267 +00:29:59,250 --> 00:30:03,990 +vision this is like you know this is +like undergraduate computer vision would + +268 +00:30:03,990 --> 00:30:10,210 +have been had PhD theses but that was +the first precursor computer vision PhD + +269 +00:30:10,210 --> 00:30:18,819 +theses like Robert since interest he +gave up his computer vision afterwards + +270 +00:30:18,819 --> 00:30:27,189 +and DARPA I was one of the inventors of +the internet we didn't do too badly by + +271 +00:30:27,190 --> 00:30:34,490 +giving up computer vision but we always +like to say that the birth of computer + +272 +00:30:34,490 --> 00:30:43,960 +vision as a modern field is in the +summer of 1966 the summer of 1966 MIT + +273 +00:30:43,960 --> 00:30:49,548 +artificial intelligence lab was +established before that actually for one + +274 +00:30:49,548 --> 00:30:55,819 +piece of history should feel proud of +them for student this there are two + +275 +00:30:55,819 --> 00:31:02,579 +pioneering artificial intelligence lab +established in the world in the early + +276 +00:31:02,579 --> 00:31:10,329 +1960's one by Marvin Minsky at MIT one +by John McCarthy at suffered a stone for + +277 +00:31:10,329 --> 00:31:15,369 +the artificial intelligence lab was +established before the computer science + +278 +00:31:15,369 --> 00:31:21,479 +department and professor John McCarthy +who founded and I love is the one who is + +279 +00:31:21,480 --> 00:31:22,490 +responsible for + +280 +00:31:22,490 --> 00:31:26,450 +the term artificial intelligence so +that's a little bit of a problem + +281 +00:31:26,450 --> 00:31:31,720 +stempler history but anyway we have to +give us credit for starting the field of + +282 +00:31:31,720 --> 00:31:41,380 +computer vision because in the summer of +1966 a professor at MIT I decided it's + +283 +00:31:41,380 --> 00:31:46,630 +time to salvage you know so I was +established we will start to understand + +284 +00:31:46,630 --> 00:31:55,010 +I think this proves probably invented at +that time but anyway + +285 +00:31:55,009 --> 00:32:01,109 +vision is so easy you open your eyes you +see the world how can this be love one + +286 +00:32:01,109 --> 00:32:04,109 +summer so + +287 +00:32:04,109 --> 00:32:18,729 +so the summer vision project is an +attempt to use our visual system this + +288 +00:32:18,730 --> 00:32:24,329 +was the proposal from last number and +maybe they didn't use their summer work + +289 +00:32:24,329 --> 00:32:30,490 +effectively but in any case how +individual was not solved in that silver + +290 +00:32:30,490 --> 00:32:35,740 +since then they become the fastest +growing field of computer vision and I + +291 +00:32:35,740 --> 00:32:43,679 +if you go to today's premium computer +vision conferences cost CPR or ICC we we + +292 +00:32:43,679 --> 00:32:52,160 +have like 2000 to 2500 researchers +worldwide attending this conference and + +293 +00:32:52,160 --> 00:33:00,620 +very practical note 44 students if you +are a good computer vision / machine + +294 +00:33:00,619 --> 00:33:05,369 +learning students you will not worry +about jobs in Silicon Valley or or + +295 +00:33:05,369 --> 00:33:11,569 +anywhere else so it's actually one of +the most exciting field but that was the + +296 +00:33:11,569 --> 00:33:19,210 +birth of computer vision which means +this year is the fiftieth anniversary of + +297 +00:33:19,210 --> 00:33:25,829 +computer vision that's a very exciting +year in computer vision I we have a + +298 +00:33:25,829 --> 00:33:28,529 +caller who long long way + +299 +00:33:28,529 --> 00:33:31,660 +ok so continued of computer vision + +300 +00:33:31,660 --> 00:33:38,169 +this is a person to remember David Mark +he he was also at MIT at that time + +301 +00:33:38,169 --> 00:33:50,240 +working with a number of shimon tommy +tommy pope Jill and Mark himself died + +302 +00:33:50,240 --> 00:33:58,808 +early in the seventies very influential +book called vision it's a very book + +303 +00:33:58,808 --> 00:34:08,148 +smart thinking about vision he took a +lot of insights from your signs where he + +304 +00:34:08,148 --> 00:34:14,868 +said that he want reasonable give us the +concept of simple structure regions + +305 +00:34:14,869 --> 00:34:16,539 +start with + +306 +00:34:16,539 --> 00:34:23,259 +simple structure in today and start with +a holistic fish or holistic not was mark + +307 +00:34:23,260 --> 00:34:28,679 +give us the next important insight and +these two inside together is the + +308 +00:34:28,679 --> 00:34:35,740 +beginning of deep learning architecture +is that vision is hierarchical you know + +309 +00:34:35,739 --> 00:34:44,029 +so you will have easily said ok we start +simple but this major world is extremely + +310 +00:34:44,030 --> 00:34:49,540 +complex in fact I take a picture a +regular picture today with my eiffel + +311 +00:34:49,539 --> 00:34:58,309 +there is no my iPhone's resolution let's +suppose it's like turned up exhaust the + +312 +00:34:58,309 --> 00:35:05,059 +potential combination of pixels or +picture in that is bigger than the total + +313 +00:35:05,059 --> 00:35:11,429 +number of atoms in the universe that's +how complex vision can be is it's it's + +314 +00:35:11,429 --> 00:35:18,539 +really really complex human beings are +told us they are simple David Mark told + +315 +00:35:18,539 --> 00:35:25,130 +us build a hierarchical model of course +there mark didn't tell us to build it in + +316 +00:35:25,130 --> 00:35:29,400 +the coalition on your network which will +cover for the rest of the quarter but + +317 +00:35:29,400 --> 00:35:36,990 +his ideas is to represent or to think +about it image we think about it in + +318 +00:35:36,989 --> 00:35:42,129 +several layers the first one he thinks +we should think about that edge image + +319 +00:35:42,130 --> 00:35:49,110 +which is clearly an inspiration noted +took the inspiration from these oh and + +320 +00:35:49,110 --> 00:35:52,579 +he personally called this the primal +sketch + +321 +00:35:52,579 --> 00:35:55,730 +you know the name is Sophie explain it + +322 +00:35:55,730 --> 00:36:02,400 +explained every now and then you think +about one-half the this is work you + +323 +00:36:02,400 --> 00:36:08,829 +start to Rick reconcile your 2d image +with the 3d world you recognize there is + +324 +00:36:08,829 --> 00:36:15,679 +layers right I look at you right now I +don't think half of you only has upheld + +325 +00:36:15,679 --> 00:36:17,239 +in the neck + +326 +00:36:17,239 --> 00:36:22,799 +even though that's all I see there is I +know you're all concluded by the row in + +327 +00:36:22,800 --> 00:36:29,680 +front of you and this challenge will +post problem to solve + +328 +00:36:29,679 --> 00:36:38,118 +nature had that you oppose prob to solve +because the broadest 3d imagery 2d + +329 +00:36:38,119 --> 00:36:45,210 +nature saw that my first a hard work +trick we just to ice it did they use one + +330 +00:36:45,210 --> 00:36:49,389 +I but there's gonna be a whole bunch of +hoes software trick to lurch the + +331 +00:36:49,389 --> 00:36:53,868 +formation of the two eyes and Aldous so +the same thing with computer vision we + +332 +00:36:53,869 --> 00:36:59,280 +have to solve that too and have tea +problem and they eventually we have to + +333 +00:36:59,280 --> 00:37:03,180 +put everything together so that we +actually have a good 3d model of the + +334 +00:37:03,179 --> 00:37:08,629 +world why do we have to have a 3d model +of the world as we have to survive + +335 +00:37:08,630 --> 00:37:15,309 +navigate manipulate the world when I +shake your hand I really need to know + +336 +00:37:15,309 --> 00:37:16,509 +how do you know + +337 +00:37:16,510 --> 00:37:22,320 +external my hand and grab your heading +the right way that is a 3d modeling of + +338 +00:37:22,320 --> 00:37:26,000 +the world otherwise I won't be able to +grab your head in the right way when I + +339 +00:37:26,000 --> 00:37:34,219 +pick up a mug the same thing so so +that's that's that's David Marr's + +340 +00:37:34,219 --> 00:37:39,899 +architecture for vision that's a +high-level abstract architecture it + +341 +00:37:39,900 --> 00:37:45,490 +doesn't really inform us exactly what +kind of mathematical modeling we should + +342 +00:37:45,489 --> 00:37:51,439 +it doesn't inform us of the learning +procedure and they really does the + +343 +00:37:51,440 --> 00:37:55,599 +inference procedure which we will +getting to through the deep learning + +344 +00:37:55,599 --> 00:38:02,759 +that word architecture but that's not +that's the high-level view of important + +345 +00:38:02,760 --> 00:38:06,250 +it's an important concept to learn + +346 +00:38:06,250 --> 00:38:08,619 +envisioned and we call this the + +347 +00:38:08,619 --> 00:38:16,859 +representation really important work and +this is a little bit stuff first trip to + +348 +00:38:16,860 --> 00:38:25,180 +just show you as soon as they lead out +this important way of thinking about the + +349 +00:38:25,179 --> 00:38:31,879 +first wave of visual recognition +algorithms went after the 3d model + +350 +00:38:31,880 --> 00:38:38,280 +because that's the goal right like no +matter how you represent the stages the + +351 +00:38:38,280 --> 00:38:45,519 +goal here is to reconstruct recognized +object and this is really sensible + +352 +00:38:45,519 --> 00:38:52,380 +because that's when we go to the world +and do so both of these to your work + +353 +00:38:52,380 --> 00:38:58,829 +comes from Palo Alto one of those from +sum 41 as far as ROI Sao Tome before was + +354 +00:38:58,829 --> 00:39:00,440 +a professor at Stanford + +355 +00:39:00,440 --> 00:39:05,760 +I love that he and his two directly +Brooks proposed 11 of the first + +356 +00:39:05,760 --> 00:39:10,430 +so-called generalized till salu model +I'm not gonna get into the details but + +357 +00:39:10,429 --> 00:39:17,129 +the idea is that the world is composed +of simple shapes like + +358 +00:39:17,130 --> 00:39:23,150 +wonders blocks and then any real world +object is just a combination of these + +359 +00:39:23,150 --> 00:39:28,340 +simple shapes given the particular +feeling and go and that was a very + +360 +00:39:28,340 --> 00:39:37,970 +influential visual recognition model in +the seventies and went on to become the + +361 +00:39:37,969 --> 00:39:47,239 +Director of MIT lab and he was also a +founding member of iRobot company rumba + +362 +00:39:47,239 --> 00:39:51,379 +and all this so so he continued the very +influential + +363 +00:39:51,380 --> 00:39:56,930 +I work and nobody interesting model +coming from local + +364 +00:39:56,929 --> 00:40:05,009 +Research Institute I think I saw I is +across the street from El Camino is this + +365 +00:40:05,010 --> 00:40:15,260 +pictorial structure model has less of a +3d flavor but more of a probabilistic + +366 +00:40:15,260 --> 00:40:21,570 +flavor is that the objects are made of a +still simple part + +367 +00:40:21,570 --> 00:40:28,059 +like a person's head is made of eyes and +nose or mouth and the parts were CuMn + +368 +00:40:28,059 --> 00:40:34,679 +acted by springs allowing for some +deformations getting a sense of ok we + +369 +00:40:34,679 --> 00:40:40,069 +recognize the world not every one of you +have exactly the same eyes in the + +370 +00:40:40,070 --> 00:40:45,150 +distance between the eyes will allow for +some kind of rare variability so this + +371 +00:40:45,150 --> 00:40:50,450 +concept of variability start to get +introduced in the model like this and + +372 +00:40:50,449 --> 00:40:56,309 +using models like this you know the +reason I want to show you this is too to + +373 +00:40:56,309 --> 00:41:02,710 +see how simple the the worst was a tease +this was one of the most influential + +374 +00:41:02,710 --> 00:41:09,670 +model in the eighties recognizing +real-world objects and the entire paper + +375 +00:41:09,670 --> 00:41:18,900 +of real world is these seemingly users +but the using the edges and simple + +376 +00:41:18,900 --> 00:41:26,010 +shapes warm but edges to to recognize +this by another and other stuff or + +377 +00:41:26,010 --> 00:41:33,980 +graduate so that's that's that's kind of +the incident world of computer vision + +378 +00:41:33,980 --> 00:41:39,699 +will wind up being seen black and white +or even synthetic images started the + +379 +00:41:39,699 --> 00:41:46,529 +nineties we're finally started moving to +like color images of real world and it + +380 +00:41:46,530 --> 00:41:55,210 +was a big change again very very +influential work here is not + +381 +00:41:55,210 --> 00:42:01,150 +particularly about recognizing an object +is about how do it like carve out an + +382 +00:42:01,150 --> 00:42:08,990 +image into sensible parts right so if +you enter this room there's no way your + +383 +00:42:08,989 --> 00:42:15,559 +visual system is tell you of my god I +see so many pics it was only have group + +384 +00:42:15,559 --> 00:42:22,259 +things you see heads heads have +territory chair a stage platform piece + +385 +00:42:22,260 --> 00:42:26,640 +of furniture in the oldest this is +called perceptual grouping perceptual + +386 +00:42:26,639 --> 00:42:28,309 +grouping as one of me + +387 +00:42:28,309 --> 00:42:34,779 +most important problem envision +biological or artificial if we don't + +388 +00:42:34,780 --> 00:42:39,420 +know how to solve the perceptual +grouping problem where they have a + +389 +00:42:39,420 --> 00:42:46,690 +really hard time to to deeply understand +the visual world and you aren't words + +390 +00:42:46,690 --> 00:42:53,450 +that end of this class this course a +problem as fundamental as the still not + +391 +00:42:53,449 --> 00:42:57,859 +solved in computer vision even though we +have made a lot of progress before + +392 +00:42:57,860 --> 00:43:04,390 +departing after deplaning we're still +grasping the final solution to a problem + +393 +00:43:04,389 --> 00:43:10,650 +like this so so this is again I why I +want to give you at this introduction to + +394 +00:43:10,650 --> 00:43:16,950 +for you to be aware of the deep problems +evasion and also the then-current they + +395 +00:43:16,949 --> 00:43:22,730 +in the the the the challenges envision +we cannot solve all the problems despite + +396 +00:43:22,730 --> 00:43:29,079 +whatever the noose s you know like we're +far from developing terminators who can + +397 +00:43:29,079 --> 00:43:34,860 +do everything so this piece of work is +called normalized cut is what is one of + +398 +00:43:34,860 --> 00:43:42,390 +the first computer vision work that +takes real world images and tries to + +399 +00:43:42,389 --> 00:43:52,420 +solve the problem is the senior computer +vision researcher now professor at + +400 +00:43:52,420 --> 00:43:56,000 +berkeley also stanford graduate + +401 +00:43:56,000 --> 00:44:01,989 +the results are not great I will not +cover any sedimentation in this class + +402 +00:44:01,989 --> 00:44:08,459 +from where you see we are making +progress but this is the beginning of + +403 +00:44:08,460 --> 00:44:15,510 +that another very casual work that I +want to i want to bring up and pay + +404 +00:44:15,510 --> 00:44:22,410 +tribute for even though these work we're +not covering them in the rest of the + +405 +00:44:22,409 --> 00:44:26,679 +course but I think it has a vision +student pretty important for you to be + +406 +00:44:26,679 --> 00:44:31,199 +aware of this because not only +introduces the important problem we want + +407 +00:44:31,199 --> 00:44:36,730 +to solve it also gives you a perspective +of the development of the field let's + +408 +00:44:36,730 --> 00:44:40,480 +work is called villa jones face detector + +409 +00:44:40,480 --> 00:44:46,030 +it's very dear to my heart because as a +graduate student fresh graduate student + +410 +00:44:46,030 --> 00:44:51,650 +at cal tech it's the one of the first +papers I read as a graduate student when + +411 +00:44:51,650 --> 00:44:56,150 +I until the lab and I didn't know +anything about it my advisers with this + +412 +00:44:56,150 --> 00:45:02,090 +amazing piece of work that you know +we're all trying to understand them by + +413 +00:45:02,090 --> 00:45:08,690 +the time I graduated from Celtic this +very work is transferred to the first + +414 +00:45:08,690 --> 00:45:16,510 +smart digital camera by Fujifilm in 2006 +as the first digital camera that has a + +415 +00:45:16,510 --> 00:45:22,390 +face detector so far my transfer pump +technology transfer point of view it was + +416 +00:45:22,389 --> 00:45:28,789 +extremely fast and there was one of the +first successful high-level visual + +417 +00:45:28,789 --> 00:45:35,849 +recognition algorithm that's being used +by consumer product so let's work just + +418 +00:45:35,849 --> 00:45:41,059 +learns to detect faces and faces in the +wild with no longer soon you know + +419 +00:45:41,059 --> 00:45:47,920 +simulation they are a very contrived a +these are any pictures and even though + +420 +00:45:47,920 --> 00:45:53,329 +he didn't use a deep learning network it +has a lot of the deep learning flavor + +421 +00:45:53,329 --> 00:46:01,179 +the features were learned the algorithm +learns to find features simple features + +422 +00:46:01,179 --> 00:46:06,919 +like these black and white filter +features that can you give us the best + +423 +00:46:06,920 --> 00:46:14,639 +localization of faces so this is a very +influential piece of work it's also one + +424 +00:46:14,639 --> 00:46:24,679 +of the first computer visual work that +is deployed computer and can roam real + +425 +00:46:24,679 --> 00:46:31,019 +time before that comparison algorithms +were very slow the the paper actually is + +426 +00:46:31,019 --> 00:46:36,699 +called real-time face detection it was +granted send him to tips I don't know if + +427 +00:46:36,699 --> 00:46:41,409 +anybody remember that kind of chip but +it was not a slow chat but nevertheless + +428 +00:46:41,409 --> 00:46:48,569 +it run real time that was another very +important piece of art and also one more + +429 +00:46:48,570 --> 00:46:53,380 +thing to point out around this time this +is not the only work + +430 +00:46:53,380 --> 00:46:59,170 +but this is a a really good +representation Morales time the focus of + +431 +00:46:59,170 --> 00:47:06,250 +computer vision is shifting remember +that they've Mr + +432 +00:47:06,250 --> 00:47:14,699 +early for work was trying to model the +three the shape of the object now we're + +433 +00:47:14,699 --> 00:47:23,439 +shifting to recognizing what the object +is the little bit about can we really + +434 +00:47:23,440 --> 00:47:27,400 +reconstruct these phases or not there's +a whole branch of computer vision + +435 +00:47:27,400 --> 00:47:34,200 +graphics step continue to work on that +but a big part of computer vision is not + +436 +00:47:34,199 --> 00:47:38,730 +at this time around the turn of the +century is focusing on recognition + +437 +00:47:38,730 --> 00:47:47,539 +that's bringing computer vision and +today the most important parts of the + +438 +00:47:47,539 --> 00:47:55,480 +computer vision work is focused these +cognitive questions like recognition and + +439 +00:47:55,480 --> 00:47:57,369 +I questions + +440 +00:47:57,369 --> 00:48:06,150 +another very important piece of work is +starting to focus on features so around + +441 +00:48:06,150 --> 00:48:12,950 +the time of face recognition people +start to realize it's really really hard + +442 +00:48:12,949 --> 00:48:19,829 +to recognize an object by describing the +whole thing like I just said you know I + +443 +00:48:19,829 --> 00:48:25,960 +see you guys were heavily on concluded I +don't see the rest of your torso I + +444 +00:48:25,960 --> 00:48:31,690 +really don't see any of your legs on it +in the first row but I recognize you and + +445 +00:48:31,690 --> 00:48:39,230 +i ke fir you as an object so some people +start to realize she is fun this is + +446 +00:48:39,230 --> 00:48:44,240 +really a global shape now we have to go +after in order to recognize an object + +447 +00:48:44,239 --> 00:48:50,319 +maybe it's the features if we recognize +the important features an object we can + +448 +00:48:50,320 --> 00:48:53,090 +go a long way and makes a lot of sense + +449 +00:48:53,090 --> 00:48:57,930 +think about evolution if you are out +hunting you don't need to recognize that + +450 +00:48:57,929 --> 00:49:03,909 +Tigers full body in shape to decide you +need to run away you know there's a few + +451 +00:49:03,909 --> 00:49:06,588 +patches of the first of the tiger +through the + +452 +00:49:06,588 --> 00:49:12,679 +leaves probably cool arm you enough so +so we need to listen as quick + +453 +00:49:12,679 --> 00:49:16,429 +decision-making baseball's version is +really quick + +454 +00:49:16,429 --> 00:49:22,308 +a lot of this happens online important +features so this will cost shift by + +455 +00:49:22,309 --> 00:49:28,539 +David low again you saw that name again +is about learning important important + +456 +00:49:28,539 --> 00:49:34,009 +features on an object and once you learn +these important features just a few of + +457 +00:49:34,009 --> 00:49:38,400 +them on the object you can actually +recommends this object in a totally + +458 +00:49:38,400 --> 00:49:45,548 +different and go on the tolling +cluttered scenes so up to keep learnings + +459 +00:49:45,548 --> 00:49:54,880 +research election in 2010 or 2012 for +about 10 years the entire field of + +460 +00:49:54,880 --> 00:50:00,229 +computer vision was focusing on using +these features to build models to + +461 +00:50:00,228 --> 00:50:05,538 +recognize objects and scenes and we've +done a great job we've gone a long way + +462 +00:50:05,539 --> 00:50:12,609 +one of the reasons deep learning that +word was became more more convincing to + +463 +00:50:12,608 --> 00:50:17,690 +a lot of people is we will see that the +features that a deep learning that + +464 +00:50:17,690 --> 00:50:22,880 +learners is very similar to these +engineered features by brilliant + +465 +00:50:22,880 --> 00:50:30,229 +engineers so it kind of confirmed even +know you know if needed we needed them + +466 +00:50:30,228 --> 00:50:34,929 +below to first tell us this features +work and then we start to develop better + +467 +00:50:34,929 --> 00:50:38,978 +mathematical models to learn these +features by itself but they confirmed + +468 +00:50:38,978 --> 00:50:46,210 +each other so so the historical you know +importance of this work should not be + +469 +00:50:46,210 --> 00:50:52,028 +diminished this work is the intellectual +foundation for us one of the + +470 +00:50:52,028 --> 00:50:57,858 +intellectual foundation for us to +realize that how critical or how useful + +471 +00:50:57,858 --> 00:51:07,018 +these deep learning features are where +we learn them just briefly say because + +472 +00:51:07,018 --> 00:51:12,379 +of the features that have a low and many +other researchers told us we can't use + +473 +00:51:12,380 --> 00:51:18,239 +that to to learn scene recognition and +around that time the machine learning + +474 +00:51:18,239 --> 00:51:24,719 +tools we use mostly is either graphical +models or support vector machine and + +475 +00:51:24,719 --> 00:51:29,479 +this is one influential work on using +support vector machine and colonel + +476 +00:51:29,478 --> 00:51:43,358 +models 2222 recognizes thing but I'll be +brief here and last deep learning model + +477 +00:51:43,358 --> 00:51:50,578 +is this feature or feature baseball +called deformable part Waldo is where we + +478 +00:51:50,579 --> 00:51:57,420 +learn parts of object like parts of the +person and we learn how they come figure + +479 +00:51:57,420 --> 00:52:08,519 +each other income figure in space used a +support vector machine kind of model to + +480 +00:52:08,518 --> 00:52:16,179 +recognize objects like humans and +bottles around this time that's 2009 + +481 +00:52:16,179 --> 00:52:21,419 +2010 the field of computer vision is +mature enough that we're working on this + +482 +00:52:21,420 --> 00:52:25,659 +important on the heart probably +recognize the pedestrians and + +483 +00:52:25,659 --> 00:52:30,828 +recognizing cars they're no longer +contrived problem something else was + +484 +00:52:30,829 --> 00:52:37,219 +needed his bench partly because as a +field advancing now if we don't have + +485 +00:52:37,219 --> 00:52:44,039 +good benchmark then everybody feels set +of images and it's really hard to really + +486 +00:52:44,039 --> 00:52:50,369 +set global standard so one of the most +important benchmark is called pass goal + +487 +00:52:50,369 --> 00:52:57,608 +V oc object recognition bench part its +bio European it's a european effort that + +488 +00:52:57,608 --> 00:53:04,190 +researchers put together by tens of +thousands of images from 20 classes of + +489 +00:53:04,190 --> 00:53:13,019 +optics and these are one example per per +object like cats cults cows movie no + +490 +00:53:13,018 --> 00:53:17,808 +cats dogs cows airplanes bottles + +491 +00:53:17,809 --> 00:53:20,048 +horses trained + +492 +00:53:20,048 --> 00:53:27,268 +and Aldous and then we used and then +annually our computer vision researchers + +493 +00:53:27,268 --> 00:53:34,948 +and laps come to compete all the object +recognition task for best girl object + +494 +00:53:34,949 --> 00:53:41,188 +recognition challenge and an over the +past you know like through the years the + +495 +00:53:41,188 --> 00:53:47,949 +the performance just keeps increasing +and that was when we start to feel + +496 +00:53:47,949 --> 00:53:52,929 +excited about the progress of the field +at that time + +497 +00:53:52,929 --> 00:53:59,729 +here's a little bit over more closer +story close to us is that my love of my + +498 +00:53:59,728 --> 00:54:05,718 +students were thinking you know the real +world is not about 20 objects the real + +499 +00:54:05,719 --> 00:54:12,489 +world is a little more than 20 optics so +following the work of Pasco visual + +500 +00:54:12,489 --> 00:54:18,239 +object recognition challenge we put +together this massive massive project + +501 +00:54:18,239 --> 00:54:23,889 +image that some of you may have heard of +image that in this class you will be + +502 +00:54:23,889 --> 00:54:30,098 +using the tiny portion of the image that +in some of your assignment that image + +503 +00:54:30,099 --> 00:54:36,759 +that is the data set of 50 million +images all cleaned my hands and + +504 +00:54:36,759 --> 00:54:47,000 +annotated over 20,000 object classes to +students who cleaned it + +505 +00:54:47,000 --> 00:54:54,469 +various areas of my life remove the +crowdsourcing platform of the habits of + +506 +00:54:54,469 --> 00:54:59,969 +that Gladys dunno also suffered from you +know putting together this this platform + +507 +00:54:59,969 --> 00:55:08,599 +but it's a very exciting day does not we +started we started to put together + +508 +00:55:08,599 --> 00:55:15,900 +competitions annually called image that +competition for object recognition and + +509 +00:55:15,900 --> 00:55:22,440 +for example a standard competition of +image classification by Imogen that is a + +510 +00:55:22,440 --> 00:55:28,710 +thousand object classes over almost 1.5 +million images and algorithms compete on + +511 +00:55:28,710 --> 00:55:34,220 +the performance so actually I just heard +somebody who was on the social media was + +512 +00:55:34,219 --> 00:55:38,589 +referring image that challenges the +Olympics of computer vision I was very + +513 +00:55:38,590 --> 00:55:40,240 +flattering + +514 +00:55:40,239 --> 00:55:55,649 +bringing us close to the history making +the people are so so that challenge 2010 + +515 +00:55:55,650 --> 00:56:00,369 +that's actually around the time pass go +you know where their colleagues they + +516 +00:56:00,369 --> 00:56:05,309 +told us they're gonna start to phase out +their challenge of 20 object so we face + +517 +00:56:05,309 --> 00:56:12,039 +in the thousand object images a +challenge and why accesses error rate + +518 +00:56:12,039 --> 00:56:18,199 +and we start to we started with very +significant error and of course you know + +519 +00:56:18,199 --> 00:56:28,029 +every year that decreased but there is a +particularly years really decreased it + +520 +00:56:28,030 --> 00:56:38,960 +was cutting hot almost is 2012 2012 is +the year that the winning architecture + +521 +00:56:38,960 --> 00:56:45,769 +of image that challenge was a +convolution on your network I will talk + +522 +00:56:45,769 --> 00:56:53,250 +about it was not invented in 2012 +despite how all the new speaker's felt + +523 +00:56:53,250 --> 00:56:58,190 +like it's the newest thing around the +block it's not it was invented back in + +524 +00:56:58,190 --> 00:56:59,349 +the seventies and eighties + +525 +00:56:59,349 --> 00:57:05,279 +he's but having a convergence of things +will talk about convolution on your + +526 +00:57:05,280 --> 00:57:10,519 +network showed its massive power as a +high capacity and to end training + +527 +00:57:10,519 --> 00:57:18,219 +architecture and Wang the image that +challenged by a huge margin and that was + +528 +00:57:18,219 --> 00:57:24,829 +quite a historical moment from a a +mathematical point of view it wasn't + +529 +00:57:24,829 --> 00:57:30,079 +that new before my engineering and and +solving real-world point of view this + +530 +00:57:30,079 --> 00:57:35,090 +was a historical moment that piece of +work was covered by you know numerous + +531 +00:57:35,090 --> 00:57:42,400 +times and all this this is the onset +this is the beginning of learning + +532 +00:57:42,400 --> 00:57:48,869 +revolution if you call it and this is +the premise of this class so at this + +533 +00:57:48,869 --> 00:57:54,609 +point I'm gonna switch so we went +through a brief history of computer + +534 +00:57:54,610 --> 00:57:59,539 +vision for 540 million years + +535 +00:57:59,539 --> 00:58:05,869 +overview of this class is there any +other question + +536 +00:58:05,869 --> 00:58:13,969 +alright so we're talking even though it +was kind of overwhelming we talked a lot + +537 +00:58:13,969 --> 00:58:20,559 +about finding different task in computer +vision seems to 31 and is going to focus + +538 +00:58:20,559 --> 00:58:27,849 +on the visual recognition problem also +enlarged especially through most of the + +539 +00:58:27,849 --> 00:58:29,509 +foundation lecture + +540 +00:58:29,510 --> 00:58:35,750 +classification but now you know +everything we talk about is gonna be + +541 +00:58:35,750 --> 00:58:41,480 +based on that image that classification +set up we will we were getting to other + +542 +00:58:41,480 --> 00:58:47,900 +visual recognition scenarios but the +image classification problem is the main + +543 +00:58:47,900 --> 00:58:52,780 +problem we will focus on Emma's class +which means please keep in mind + +544 +00:58:52,780 --> 00:58:56,600 +visual recognition is not just image +classification right there was 3d + +545 +00:58:56,599 --> 00:59:01,339 +modeling there was a grouping of +segmentation and all this but that's + +546 +00:59:01,340 --> 00:59:06,250 +that's what we'll focus on and I don't +need to call miss you that just even + +547 +00:59:06,250 --> 00:59:11,000 +application wise image classification is +extremely useful problem + +548 +00:59:11,000 --> 00:59:17,929 +from you know big big commercial +Internet companies a point of view to + +549 +00:59:17,929 --> 00:59:22,449 +startup ideas you know you want to +recognize objects you want to recognize + +550 +00:59:22,449 --> 00:59:29,119 +food do online shop mobile shopping you +want us torture albums so you move + +551 +00:59:29,119 --> 00:59:35,710 +classification news is is can be a +bread-and-butter task for many many + +552 +00:59:35,710 --> 00:59:44,650 +important problems there is a problem +that's related to two classification and + +553 +00:59:44,650 --> 00:59:49,329 +today I don't expect you to understand +the differences but I wanted to hear + +554 +00:59:49,329 --> 00:59:55,659 +that while this class will make sure you +learn to understand the nuances in the + +555 +00:59:55,659 --> 01:00:01,879 +the details of different flavors of +visual recognition what is image + +556 +01:00:01,880 --> 01:00:07,700 +classification what's object detection +what's image captioning and these have + +557 +01:00:07,699 --> 01:00:14,529 +different flavors for example you know +while he made two classification my + +558 +01:00:14,530 --> 01:00:19,740 +focus on the whole big image object +detection by tell you where things + +559 +01:00:19,739 --> 01:00:23,579 +exactly are like where the car is the +pedestrian + +560 +01:00:23,579 --> 01:00:30,159 +the hammer and the word that the +relationship between objects and so on + +561 +01:00:30,159 --> 01:00:35,529 +social their nuances and details that +you will be learning about in this class + +562 +01:00:35,530 --> 01:00:43,840 +and I already said CNN or coalition on +your network is one type of deeply + +563 +01:00:43,840 --> 01:00:50,910 +architecture but it's the overwhelmingly +successful the planning architecture and + +564 +01:00:50,909 --> 01:00:54,909 +this is the architecture we will be +focusing on and to just go back to the + +565 +01:00:54,909 --> 01:01:02,849 +image 9 challenge so I said the +historical year is 2012 this is the year + +566 +01:01:02,849 --> 01:01:14,349 +I'll excursion Jeff Hinton proposed this +this convolutional I think it's a seven + +567 +01:01:14,349 --> 01:01:20,500 +layer convolutional your network to win +the image that challenge model before + +568 +01:01:20,500 --> 01:01:22,318 +this year + +569 +01:01:22,318 --> 01:01:30,548 +a SIFT feature plus support vector +machine architecture it still hierarchy + +570 +01:01:30,548 --> 01:01:38,449 +but it doesn't have that flavor of and +two and learning fast forward to 2015 + +571 +01:01:38,449 --> 01:01:43,798 +the winning architecture is still a +conclusion you're not worried it's a + +572 +01:01:43,798 --> 01:01:56,599 +hunter 51 layers buy buy microsoft Asia +research researchers and it's clear + +573 +01:01:56,599 --> 01:02:03,048 +reason the residual that the residual +that so I'm not so sure you for a cover + +574 +01:02:03,048 --> 01:02:09,369 +that definitely don't expect to to know +every single layer what they do actually + +575 +01:02:09,369 --> 01:02:17,269 +they repeat itself at heart but every +year since 2012 the winning architecture + +576 +01:02:17,268 --> 01:02:23,548 +of images that challenge is a deep +learning based architecture so like I + +577 +01:02:23,548 --> 01:02:32,369 +said I also want you to respect history +is not invented overnight there is a lot + +578 +01:02:32,369 --> 01:02:37,979 +of influential players today but you +know there are a lot of people who build + +579 +01:02:37,978 --> 01:02:41,879 +a foundation I actually I don't have the +slides one important thing to remember + +580 +01:02:41,880 --> 01:02:50,910 +is Kunihiko Fukushima contigo solution +was a Japanese scientist who build a + +581 +01:02:50,909 --> 01:02:58,798 +model corneil Kong the truck and that +was the beginning of the newer network + +582 +01:02:58,798 --> 01:03:04,318 +architecture and yellow color is also a +very influential person and he's really + +583 +01:03:04,318 --> 01:03:10,248 +the the groundbreaking work in my +opinion of young coup was published in + +584 +01:03:10,248 --> 01:03:16,348 +the nineteen nineties so that's one +mathematicians which Jeff Hinton + +585 +01:03:16,349 --> 01:03:22,479 +all-inclusive adviser was involved +worked out the back propagation learning + +586 +01:03:22,478 --> 01:03:28,088 +strategy which if this were deleting +anything under will tell you in a couple + +587 +01:03:28,088 --> 01:03:34,528 +of weeks but but the the mathematical +mando was roughed up in the eighties and + +588 +01:03:34,528 --> 01:03:34,920 +the + +589 +01:03:34,920 --> 01:03:40,869 +undies and this was your local was +working for Bell Labs it AT&T which is + +590 +01:03:40,869 --> 01:03:47,160 +amazing place at that time there's no +bail UPS today anymore that they were + +591 +01:03:47,159 --> 01:03:50,949 +working on really ambitious projects and +he needed to recognize the digits + +592 +01:03:50,949 --> 01:03:57,019 +because even to leave that product was +shipped to our bags in the USA post + +593 +01:03:57,019 --> 01:04:03,380 +office to recognize difficult and checks +and kinky constructive those coalition + +594 +01:04:03,380 --> 01:04:08,068 +on your network and this is where he +he's inspired by Hubel and Wiesel he + +595 +01:04:08,068 --> 01:04:14,500 +starts by looking at some pool edge like +structures and image it's not like the + +596 +01:04:14,500 --> 01:04:20,099 +whole letter eight it's really needs to +edges and the layer-by-layer + +597 +01:04:20,099 --> 01:04:25,539 +filters these edges pull them together +filters pool and then the field this + +598 +01:04:25,539 --> 01:04:36,230 +architecture 20121 Alex kruschev ski and +Jeff Hinton you almost exactly the thing + +599 +01:04:36,230 --> 01:04:40,900 +architecture to participate in the car + +600 +01:04:40,900 --> 01:04:47,900 +imagine a challenge there is a few +changes but that become the winning + +601 +01:04:47,900 --> 01:04:54,920 +architecture of this so we'll tell you +more about the detail changes at the lib + +602 +01:04:54,920 --> 01:05:02,380 +capacity model did grow a little bit +because Moore's Law helped us there's + +603 +01:05:02,380 --> 01:05:08,220 +also a very very detailed function that +change the little bit of a shape for + +604 +01:05:08,219 --> 01:05:14,828 +most Signori 224 but to file in their +shape but whatever there's a couple of + +605 +01:05:14,829 --> 01:05:19,130 +small changes but really by large +nothing had changed + +606 +01:05:19,130 --> 01:05:26,490 +mathematically but important things did +change and that grow deep learning + +607 +01:05:26,489 --> 01:05:35,379 +Architektur black ink crew into its +Renaissance one is like a morsel and + +608 +01:05:35,380 --> 01:05:41,180 +hardware hardware made a huge difference +because these are high extremely high + +609 +01:05:41,179 --> 01:05:44,669 +capacity models one dela Cruz + +610 +01:05:44,670 --> 01:05:50,720 +this painfully slow because of the the +bottleneck of computation he couldn't + +611 +01:05:50,719 --> 01:05:55,209 +build this model too big a while so you +cannot be added to big cannot fully + +612 +01:05:55,210 --> 01:06:00,670 +realize its potential for machine +learning standpoint there's over 15 and + +613 +01:06:00,670 --> 01:06:07,780 +all these problems you can also but now +we have much faster bigger transistor + +614 +01:06:07,780 --> 01:06:16,410 +transistor microchips and GPUs from +Nvidia made a huge difference in deep + +615 +01:06:16,409 --> 01:06:22,358 +learning history that we can now +trainees models in a reasonable amount + +616 +01:06:22,358 --> 01:06:27,358 +of time even if they're huge and others +I think we do need to take her out of + +617 +01:06:27,358 --> 01:06:37,159 +work is data availability of data that +was the big data did a self is just you + +618 +01:06:37,159 --> 01:06:41,078 +know it doesn't mean anything if you +don't know how to use it but in this + +619 +01:06:41,079 --> 01:06:45,869 +deep learning Architektur data become +the driving force for high-capacity + +620 +01:06:45,869 --> 01:06:52,390 +model to enable the doin training true +true true help avoid overfitting when + +621 +01:06:52,389 --> 01:06:57,608 +you have enough data so you know so you +if you look at the number of pixels that + +622 +01:06:57,608 --> 01:07:05,639 +machine learning people had in 2012 +versus helical having 1998 it's a huge + +623 +01:07:05,639 --> 01:07:06,469 +difference + +624 +01:07:06,469 --> 01:07:14,469 +orders of magnitude so so that was so +this is the focus of 231 + +625 +01:07:14,469 --> 01:07:21,098 +but will also go oh it's also important +one last time I'm drooling this idea + +626 +01:07:21,099 --> 01:07:27,048 +that visual intelligence does go beyond +object recognition I don't want any of + +627 +01:07:27,048 --> 01:07:31,039 +you coming out of this course dinky +we've done everything you know we've + +628 +01:07:31,039 --> 01:07:38,889 +challenged do flying the entire space of +visual recognition it's not true there + +629 +01:07:38,889 --> 01:07:44,460 +are still a lot of cool problems to +solve for example you know does labeling + +630 +01:07:44,460 --> 01:07:51,650 +entire scene with perceptual grouping so +I know where every single pixel belonged + +631 +01:07:51,650 --> 01:07:52,329 +to + +632 +01:07:52,329 --> 01:07:56,900 +that's still ongoing problem combining + +633 +01:07:56,900 --> 01:08:02,740 +recognition with 3d is a really there's +a lot of excitement happening at the + +634 +01:08:02,739 --> 01:08:09,349 +intersection of vision and robotics this +is this is definitely one area of that + +635 +01:08:09,349 --> 01:08:15,039 +and then anything to do with motion of +borders and and and and this is another + +636 +01:08:15,039 --> 01:08:33,289 +big open area of research work you know +beyond just gonna sing you actually want + +637 +01:08:33,289 --> 01:08:35,689 +deeply understand the victor + +638 +01:08:35,689 --> 01:08:39,489 +what people are doing what are the +relationship between objects in the West + +639 +01:08:39,489 --> 01:08:45,029 +Rd into the the the relation between +objects and this is an ongoing project + +640 +01:08:45,029 --> 01:08:49,759 +called visual genome in my lap that just +in the number of my students are + +641 +01:08:49,760 --> 01:08:55,739 +involved and this goes far beyond image +classification of weed we talked about + +642 +01:08:55,739 --> 01:09:03,639 +and what is one of our Holy Grails while +one of the Holy Grails of community this + +643 +01:09:03,640 --> 01:09:09,260 +to be able to tell the story of a scene +right so I think about you as a human + +644 +01:09:09,260 --> 01:09:11,180 +you open your eyes + +645 +01:09:11,180 --> 01:09:17,840 +the moment you open your eyes you're +able to describe what you see in fact in + +646 +01:09:17,840 --> 01:09:24,940 +psychology experiments we find that even +if you show people this picture for only + +647 +01:09:24,939 --> 01:09:30,659 +five hundred milliseconds that's +literally half of the second people who + +648 +01:09:30,659 --> 01:09:36,769 +write essays about it we pay them $10 an +hour so they didn't + +649 +01:09:36,770 --> 01:09:42,410 +it wasn't that long but you know I +figure if we talked more money they + +650 +01:09:42,409 --> 01:09:47,970 +probably could write longer ethics but +the point is that our visual system is + +651 +01:09:47,970 --> 01:09:54,390 +extremely powerful we can tell stories +and I would dream of this is my cell is + +652 +01:09:54,390 --> 01:10:02,560 +to undress dissertation that we give you +give a computer one picture per and + +653 +01:10:02,560 --> 01:10:03,960 +outcomes + +654 +01:10:03,960 --> 01:10:09,159 +description like this you know I was +getting there you'll see that give the + +655 +01:10:09,159 --> 01:10:15,149 +khmer Olympic tur gives you one sentence +or you give the number one pick turned + +656 +01:10:15,149 --> 01:10:20,319 +into a short sentences but we're not +here yet but that's one of the holder + +657 +01:10:20,319 --> 01:10:26,250 +blue and the other holding growth is +continuing this continue this i think is + +658 +01:10:26,250 --> 01:10:33,659 +is summarized really well by Audrey's +blog is you know like this right there + +659 +01:10:33,659 --> 01:10:42,300 +is refined there so much nuance in this +picture that you get to enjoy not only + +660 +01:10:42,300 --> 01:10:47,890 +you recognize the global seek it will be +very boring old computer can tell you is + +661 +01:10:47,890 --> 01:10:53,650 +that room room scale + +662 +01:10:53,649 --> 01:10:58,238 +whatever type in the locker that's it +you know here you recognize what they + +663 +01:10:58,238 --> 01:11:00,569 +are recognized the trick + +664 +01:11:00,569 --> 01:11:06,009 +Obama is do you recognize the kind of +interaction you recognize the humor you + +665 +01:11:06,010 --> 01:11:11,250 +recognize there's just so much knew that +this is one of the world is about we + +666 +01:11:11,250 --> 01:11:18,719 +used our ability to visual nurse tending +to not only survive navigate when they + +667 +01:11:18,719 --> 01:11:26,000 +play but we use it to socialize to +entertain to understand the world and + +668 +01:11:26,000 --> 01:11:32,929 +this is where vision in all the book +read goals of vision that is so and I + +669 +01:11:32,929 --> 01:11:39,630 +don't need to convince you that computer +visual technology will make our world a + +670 +01:11:39,630 --> 01:11:46,550 +better place despite some scary talks +out there you know even though you home + +671 +01:11:46,550 --> 01:11:51,029 +today in the industry as well as +research world we're using computer + +672 +01:11:51,029 --> 01:11:58,349 +vision to build better robots to save +lives to go deep exploring analyst now + +673 +01:11:58,350 --> 01:12:02,860 +ok so I have like what two minutes 35 +minutes left + +674 +01:12:02,859 --> 01:12:10,839 +great time let me introduce the team and +justice are the color instructors with + +675 +01:12:10,840 --> 01:12:16,989 +me tienes please stand up to say hi to +him + +676 +01:12:16,989 --> 01:12:22,639 +can you like this safe your name quickly +and you're like what you just don't give + +677 +01:12:22,640 --> 01:12:49,180 +a speech but yes + +678 +01:12:49,180 --> 01:13:42,240 +because because people class action and +help us to process I respect a person is + +679 +01:13:42,239 --> 01:14:04,739 +confidential personal issues but again +I'm going on our terms and leave for a + +680 +01:14:04,739 --> 01:14:09,939 +few weeks starting the end of January +social please if you decide you just + +681 +01:14:09,939 --> 01:14:15,379 +want to send email to me unless somebody +like you they will take + +682 +01:14:15,380 --> 01:14:20,770 +I'm likely to a reply you promptly sorry +about that + +683 +01:14:20,770 --> 01:14:25,420 +priorities + +684 +01:14:25,420 --> 01:14:34,739 +about our philosophy and we're not +getting to the details we really want + +685 +01:14:34,738 --> 01:14:39,448 +this to be a very hands-on project this +is really I give a lot of credit to + +686 +01:14:39,448 --> 01:14:46,419 +Justin and Andre they are extremely good +at walking through these hands-on + +687 +01:14:46,420 --> 01:14:51,840 +details with you so that when you come +out of this class you not only have I + +688 +01:14:51,840 --> 01:14:57,719 +love understanding but you have a you +have a really good ability to to build + +689 +01:14:57,719 --> 01:15:02,010 +your own deep learning code we want you +to be exposed to state of the art + +690 +01:15:02,010 --> 01:15:08,730 +material you're gonna be learning things +really that's as freshest 2015 and it'll + +691 +01:15:08,729 --> 01:15:11,859 +be fun you get to do things like this + +692 +01:15:11,859 --> 01:15:18,960 +not not all the time but like time the +picture into one goal or or this weird + +693 +01:15:18,960 --> 01:15:27,489 +thing it'll be a fun class in addition +to all the important tasks you you you + +694 +01:15:27,488 --> 01:15:33,589 +you learn we do have grading policies +these are all on our website another + +695 +01:15:33,590 --> 01:15:44,929 +eatery those again one very clear you +are grown ups which grew to like + +696 +01:15:44,929 --> 01:15:51,989 +grown-ups we do not take anything at the +end of the course is my professors want + +697 +01:15:51,988 --> 01:15:56,359 +me to go to this conference and I have +to have like three more late they say no + +698 +01:15:56,359 --> 01:16:03,630 +you are responsible for using your total +eight days you have 7 late you can use + +699 +01:16:03,630 --> 01:16:11,079 +them in whatever way you all 10 penalty +be all those you have to take a penalty + +700 +01:16:11,079 --> 01:16:18,069 +is like really really exceptional +medical family emergency + +701 +01:16:18,069 --> 01:16:21,799 +talk to us on the individual basis but +anything else + +702 +01:16:21,800 --> 01:16:29,539 +conference that why other finally you +know like missing cat or whatever is + +703 +01:16:29,539 --> 01:16:37,850 +we we we we budgeted that into the seven +days another his honor cold this is one + +704 +01:16:37,850 --> 01:16:43,190 +thing I have to say it with a really +straight face you are such a privilege + +705 +01:16:43,189 --> 01:16:50,710 +institution you are you are grown ups I +want you to be responsible for honor + +706 +01:16:50,710 --> 01:16:55,239 +code every single Stampfer student +taking this class should know the other + +707 +01:16:55,239 --> 01:16:58,619 +co if you don't there's no excuse you +should go back + +708 +01:16:58,619 --> 01:17:04,840 +wait a collaboration extremely seriously +I almost hate to say that statistically + +709 +01:17:04,840 --> 01:17:10,380 +given a classless big word Allah have a +few cases but I also want you to be an + +710 +01:17:10,380 --> 01:17:16,210 +exceptional class even with a size this +big we do not want to see anything that + +711 +01:17:16,210 --> 01:17:22,399 +infringes on Academic Honor Code so read +the collaboration policies and risk but + +712 +01:17:22,399 --> 01:17:31,960 +that this is really respecting yourself +I think with all these prereq you can + +713 +01:17:31,960 --> 01:17:38,149 +you can read it I do with anything I +want to say is there any burning + +714 +01:17:38,149 --> 01:17:47,569 +questions that you feel worth asking yes + +715 +01:17:47,569 --> 01:18:06,689 +ok + diff --git a/captions/En/Lecture2_en.srt b/captions/En/Lecture2_en.srt new file mode 100644 index 00000000..52b86a53 --- /dev/null +++ b/captions/En/Lecture2_en.srt @@ -0,0 +1,3644 @@ +1 +00:00:00,000 --> 00:00:03,750 +and we're recording it's like a great +just remind you again + +2 +00:00:03,750 --> 00:00:08,160 +hello recording the closest so if you're +uncomfortable speaking on camera here + +3 +00:00:08,160 --> 00:00:15,929 +not in the picture but your voice might +be on the recording ok great as you can + +4 +00:00:15,929 --> 00:00:19,589 +see also the screen is wider than it +should be and I'm not sure how to fix it + +5 +00:00:19,589 --> 00:00:21,300 +so hard to live with it + +6 +00:00:21,300 --> 00:00:25,269 +likely your visual cortex is very good +and very invariance to stretching so + +7 +00:00:25,268 --> 00:00:26,118 +this is not a problem + +8 +00:00:26,118 --> 00:00:32,259 +ok so what's up with some administrative +things before we dive into the class the + +9 +00:00:32,259 --> 00:00:36,100 +first assignment will come out tonight +or early tomorrow it is you in january + +10 +00:00:36,100 --> 00:00:41,289 +20 have exactly two weeks you will be +writing a classifier earlier classifier + +11 +00:00:41,289 --> 00:00:44,159 +and a small two-layer neural network +you'll be writing the entirety of + +12 +00:00:44,159 --> 00:00:47,979 +backpropagation algorithm for 22 layer +neural network will cover all that + +13 +00:00:47,979 --> 00:00:54,459 +material in next two weeks and morning +by the way there are some in from last + +14 +00:00:54,460 --> 00:00:57,350 +year as well and we're changing the +assignments so they will please do not + +15 +00:00:57,350 --> 00:01:02,890 +complete in 2015 assignment that's +something to be aware of and for your + +16 +00:01:02,890 --> 00:01:07,109 +competition but it will be using Python +and pie and will also be offering + +17 +00:01:07,109 --> 00:01:11,030 +terminal dot com which is which is +basically the virtual machines in the + +18 +00:01:11,030 --> 00:01:13,939 +club that you can use if you don't have +a very good laptop and so on + +19 +00:01:13,938 --> 00:01:17,250 +go into detail about it but i just like +to point out that for the first + +20 +00:01:17,250 --> 00:01:21,090 +assignment we assume that you'll be +relatively familiar with Python you'll + +21 +00:01:21,090 --> 00:01:24,859 +be writing these optimized numpy +expressions where you at manipulating + +22 +00:01:24,859 --> 00:01:28,438 +these matrices and vectors and very +efficient forms so for example if you're + +23 +00:01:28,438 --> 00:01:31,908 +seeing this code and its doesn't mean +anything to you then please have a look + +24 +00:01:31,909 --> 00:01:35,880 +at our python tutorial that is up on the +website as well it's written by Justin + +25 +00:01:35,879 --> 00:01:39,489 +and is very good and so go through that +and familiarize yourself with the + +26 +00:01:39,489 --> 00:01:42,328 +notation because you'll be seeing you +writing a lot of code that looks like + +27 +00:01:42,328 --> 00:01:47,048 +this where we doing all these optimize +operations so they're fast enough to run + +28 +00:01:47,049 --> 00:01:51,610 +on the CPU now in terms of total +basically what this amounts to is that + +29 +00:01:51,609 --> 00:01:54,599 +will give you a link to the assignment +you'll go to a web page and you'll see + +30 +00:01:54,599 --> 00:01:58,309 +something like this this is a virtual +machine in the cloud that has been set + +31 +00:01:58,310 --> 00:02:01,420 +up with all the dependencies of the +assignment they're all installed already + +32 +00:02:01,420 --> 00:02:05,618 +on the data is already there and so you +click on lunch machine and this'll + +33 +00:02:05,618 --> 00:02:09,580 +basically bring you to something like +this this is running your brother and + +34 +00:02:09,580 --> 00:02:13,060 +this is basically a thin layer on top of +an AWS + +35 +00:02:13,060 --> 00:02:17,209 +machine a UI layer here and so you have +an iPod the notebook and a little + +36 +00:02:17,209 --> 00:02:20,739 +terminal and you can go around and this +is just like a machine in the cloud and + +37 +00:02:20,739 --> 00:02:24,310 +so they have some CPU offerings and they +also have some GPU machines that you can + +38 +00:02:24,310 --> 00:02:25,539 +use and so on + +39 +00:02:25,539 --> 00:02:29,090 +normally have to pay for terminal but +will be distributing credits to you so + +40 +00:02:29,090 --> 00:02:33,709 +you just lost to a specific ta that will +decide in a bit you email to a TA and + +41 +00:02:33,709 --> 00:02:36,950 +ask for money will send you money and we +keep track of how much money we sent to + +42 +00:02:36,949 --> 00:02:40,799 +all the people so you have to be +responsible with the funds so this is + +43 +00:02:40,800 --> 00:02:55,689 +also an option for you to use you like +ok any details are you can you can read + +44 +00:02:55,689 --> 00:02:57,680 +it if you like it's not required for +your comment + +45 +00:02:57,680 --> 00:03:03,879 +but you can probably get there around +yeah ok sam says that to happen to the + +46 +00:03:03,879 --> 00:03:07,870 +lecture now today we'll be talking about +which classification and specially will + +47 +00:03:07,870 --> 00:03:13,219 +start off on linear classifiers so we +talk about the classification the basic + +48 +00:03:13,219 --> 00:03:17,560 +task is that we have some number of +categories say dog cat truck plane or so + +49 +00:03:17,560 --> 00:03:20,799 +on we get to decide what these are and +then asked earlier to take an image + +50 +00:03:20,799 --> 00:03:24,950 +which is a giant breed of numbers and we +have to transform into one of these + +51 +00:03:24,949 --> 00:03:29,169 +labels we have to build it into one of +the categories this problem will spend + +52 +00:03:29,169 --> 00:03:32,548 +most of our time talking about this one +specifically but if you'd like to do any + +53 +00:03:32,549 --> 00:03:36,349 +other task in computer vision such as +detection image capture any segmentation + +54 +00:03:36,349 --> 00:03:40,108 +or whatever else you'll find that once +he know about the classification and how + +55 +00:03:40,109 --> 00:03:43,569 +that's done everything else is just the +tiny built on top of it so you'll be in + +56 +00:03:43,568 --> 00:03:47,060 +a great position to do any of the other +tasks so it's really good for conceptual + +57 +00:03:47,060 --> 00:03:50,840 +understanding and we'll work through +that as a specific example to simplify + +58 +00:03:50,840 --> 00:03:54,819 +things in the beginning now why is this +problem hard just give an idea the + +59 +00:03:54,818 --> 00:03:58,518 +problem is what we refer to as a +semantic gap this image here as a giant + +60 +00:03:58,519 --> 00:04:01,739 +grid of numbers the way the images are +represented in the computer is that this + +61 +00:04:01,739 --> 00:04:06,299 +is basically say roughly 300 by a +hundred by three accelerates oh three + +62 +00:04:06,299 --> 00:04:09,620 +dimensional array and threes from the +three color channels red green and blue + +63 +00:04:09,620 --> 00:04:13,590 +and so when you zoom in on the part of +that image is basically a giant great + +64 +00:04:13,590 --> 00:04:18,728 +numbers between 0 and 255 so that's what +we have to work with these numbers + +65 +00:04:18,728 --> 00:04:21,370 +indicate the amount of brightness and +all the three color channels at every + +66 +00:04:21,370 --> 00:04:25,569 +single position in the image and so the +reason that any specification is + +67 +00:04:25,569 --> 00:04:26,269 +difficult + +68 +00:04:26,269 --> 00:04:29,519 +when you think about what we have to +work with decent like millions of + +69 +00:04:29,519 --> 00:04:33,899 +numbers of that form and having to +classify things like cats it quickly + +70 +00:04:33,899 --> 00:04:38,339 +became apparent to the complexity of the +task so for example the camera can be + +71 +00:04:38,339 --> 00:04:42,689 +rotated around this cat and it can be +zoomed in and nothing has shifted the + +72 +00:04:42,689 --> 00:04:46,769 +focal properties and transaxle that +camera can do different and think about + +73 +00:04:46,769 --> 00:04:49,769 +what happens to the brightness values +and as great as you actually do all + +74 +00:04:49,769 --> 00:04:52,779 +these transformations with a camera will +completely ship all the patterns are + +75 +00:04:52,779 --> 00:04:56,559 +changing and we can be robust to all of +this there are also many other + +76 +00:04:56,560 --> 00:05:00,709 +challenges for example charges up +illumination here we have a long cat + +77 +00:05:00,709 --> 00:05:07,728 +white cat we actually have two of them +but you can see beyond its a one cat is + +78 +00:05:07,728 --> 00:05:11,098 +clearly made it quite a bit and the +other is not but you can still recognize + +79 +00:05:11,098 --> 00:05:14,750 +two cats and so think about again the +brightness valleys on the level of the + +80 +00:05:14,750 --> 00:05:18,329 +grid and what happens to them as he +changed all the different things and all + +81 +00:05:18,329 --> 00:05:21,279 +the possible lighting schemes that we +can have in the world with to be robust + +82 +00:05:21,279 --> 00:05:28,179 +to all that there's issues off the +formation many classes lots of strange + +83 +00:05:28,180 --> 00:05:33,668 +arrangement of these objects would like +to recognize so cast coming very + +84 +00:05:33,668 --> 00:05:37,468 +different poses with the slides when I +create them they're quite dry there's a + +85 +00:05:37,468 --> 00:05:41,449 +lot of math and science this is the only +time I get to have fun so that's what I + +86 +00:05:41,449 --> 00:05:45,939 +just somehow everything that occurs to +be robust to all of these affirmations + +87 +00:05:45,939 --> 00:05:50,189 +you can still recognizes the cat and all +of these images despite their problems + +88 +00:05:50,189 --> 00:05:54,240 +so sometimes we might not see the +pelagic but you still recognizes that's + +89 +00:05:54,240 --> 00:06:00,340 +cat the cat behind a water bottle and +there's also a cab there inside a couch + +90 +00:06:00,339 --> 00:06:06,068 +even though you're seeing just 10 PCs +pieces of this class basically there's + +91 +00:06:06,069 --> 00:06:10,500 +problems on background clutter so things +can blend into the environment we have + +92 +00:06:10,500 --> 00:06:15,300 +to be reminded that and there's also the +intra-class variation so cat actually + +93 +00:06:15,300 --> 00:06:19,728 +there's a huge amount of cats just +species and so they can look different + +94 +00:06:19,728 --> 00:06:23,240 +ways with your boss to all of that so i +just like you to appreciate the + +95 +00:06:23,240 --> 00:06:26,718 +complexity of the task we consider any +one of these independently is difficult + +96 +00:06:26,718 --> 00:06:31,908 +but when you consider the cross product +of all these different things and have + +97 +00:06:31,908 --> 00:06:35,769 +to work across all of that it's actually +quite amazing that anything works at all + +98 +00:06:35,769 --> 00:06:39,539 +in fact not only does it work but it +works really really well almost here + +99 +00:06:39,540 --> 00:06:43,740 +accuracy of categories like this and we +can do that in a few dozen milliseconds + +100 +00:06:43,740 --> 00:06:49,040 +with the current technology and so +that's what you learn about this class + +101 +00:06:49,040 --> 00:06:54,390 +classifier look like basically we're +taking this through the area we'd like + +102 +00:06:54,389 --> 00:06:57,539 +to produce a class label and when i'd +like he's noticed that there is no + +103 +00:06:57,540 --> 00:07:01,569 +obvious way up actually encoding and +you'll this of these classifiers right + +104 +00:07:01,569 --> 00:07:04,790 +there's no simple algorithm like say +you're taking it all good in class early + +105 +00:07:04,790 --> 00:07:08,379 +computer science curriculum your writing +bubble sort or you're writing something + +106 +00:07:08,379 --> 00:07:11,939 +else to do any particular task you can +intuit all the possible steps and you + +107 +00:07:11,939 --> 00:07:15,300 +can enumerate them and lets them and +play with it and analyze it but here + +108 +00:07:15,300 --> 00:07:18,530 +there's no algorithm for detecting a cat +under all these variations are it's + +109 +00:07:18,529 --> 00:07:21,509 +extremely difficult to think about how +you actually write that up what is the + +110 +00:07:21,509 --> 00:07:26,039 +sequence of operations you would do an +arbitrary image to detect a cat that's + +111 +00:07:26,040 --> 00:07:28,629 +not to say that people haven't tried +especially early these a computer but + +112 +00:07:28,629 --> 00:07:32,719 +there were these explicit approaches as +I'd like to call them where you think + +113 +00:07:32,720 --> 00:07:37,240 +about okay I can't say is that he would +like to meet you look for little ear + +114 +00:07:37,240 --> 00:07:40,910 +pieces so what we'll do is we'll detect +all the edges UltraISO edges will + +115 +00:07:40,910 --> 00:07:45,380 +classify the different traits of edges +and their junctions will create you know + +116 +00:07:45,379 --> 00:07:48,350 +libraries of the season will try to find +their arrangements and if we ever see + +117 +00:07:48,350 --> 00:07:52,150 +anything like that will detect the cat +we see any particular texture of some + +118 +00:07:52,149 --> 00:07:55,899 +particular frequencies will attack the +cat as you can come up with some rules + +119 +00:07:55,899 --> 00:07:59,870 +but the problem is that once I tell you +okay I'd like to actually recognize the + +120 +00:07:59,870 --> 00:08:03,569 +boat now or a person when you go back to +the drawing board and yet to be like ok + +121 +00:08:03,569 --> 00:08:06,719 +what makes a boat exactly what the +original pages right it's completely + +122 +00:08:06,720 --> 00:08:11,590 +scalable approach to to prosecution as +the pressure dropping this class and + +123 +00:08:11,589 --> 00:08:16,699 +approach that works much better as the +data-driven approach that we like in the + +124 +00:08:16,699 --> 00:08:20,170 +framework of machine learning and just +to point out that in these days actually + +125 +00:08:20,170 --> 00:08:23,840 +in the early days they did not have the +luxury of using data because at this + +126 +00:08:23,839 --> 00:08:27,060 +point in time you're taking your +grayscale images of very low resolution + +127 +00:08:27,060 --> 00:08:30,250 +images in your trying to recognize +things it's obviously not going to work + +128 +00:08:30,250 --> 00:08:33,769 +but with the availability of Internet +huge amount of data I can search for + +129 +00:08:33,769 --> 00:08:38,460 +example for cat on Google and I get lots +of cats everywhere and we know that + +130 +00:08:38,460 --> 00:08:42,840 +these are cats based on the surrounding +text in the web pages so there's a lot + +131 +00:08:42,840 --> 00:08:46,060 +of data so the way that this now looks +like is that we have a training face + +132 +00:08:46,059 --> 00:08:49,079 +where you give me lots of training +samples cast + +133 +00:08:49,080 --> 00:08:52,900 +and you tell me about their cats you +give me lots of examples of any type of + +134 +00:08:52,899 --> 00:08:54,230 +other category you're interested in + +135 +00:08:54,230 --> 00:08:59,920 +I do I go away and I trained to model a +model is a class and I can then use that + +136 +00:08:59,919 --> 00:09:04,250 +model to actually classified data so +what i'm given a new image I can look at + +137 +00:09:04,250 --> 00:09:07,500 +my training data and I can do something +with this based on just a pattern + +138 +00:09:07,500 --> 00:09:13,759 +matching and statistics or someone so as +a simple example will work within this + +139 +00:09:13,759 --> 00:09:17,279 +framework consider the nearest neighbor +classifier the way you're single + +140 +00:09:17,279 --> 00:09:20,939 +classifier works is that effectively +were given destroyed Trade Center will + +141 +00:09:20,940 --> 00:09:23,970 +do a training time as well just remember +all the training data so have all the + +142 +00:09:23,970 --> 00:09:27,820 +training data just got here and I +remember it now when you give me a test + +143 +00:09:27,820 --> 00:09:32,060 +image what we'll do is we'll compare the +test image to every single one of the + +144 +00:09:32,059 --> 00:09:36,729 +images we saw in a train data and we'll +just transfer the label over so I'll + +145 +00:09:36,730 --> 00:09:41,149 +just look through all the images will +work with specific case as I go through + +146 +00:09:41,149 --> 00:09:43,740 +this I like to be as complete as +possible so we'll work with a specific + +147 +00:09:43,740 --> 00:09:47,740 +case of something called Seifert India +set the scene for today as it has 10 + +148 +00:09:47,740 --> 00:09:53,129 +labels labels there are 50,000 training +images that you have access to and then + +149 +00:09:53,129 --> 00:09:57,159 +there's a test set of 10 10,000 images +where we're going to evaluate how well + +150 +00:09:57,159 --> 00:10:00,669 +the classifiers working and these images +are quite tiny they're just little to a + +151 +00:10:00,669 --> 00:10:05,009 +dataset of 32 by 32 little thumbnail +images so the wait nearest neighbor + +152 +00:10:05,009 --> 00:10:07,809 +classifier would work as we take all +this training did others given to us + +153 +00:10:07,809 --> 00:10:12,589 +fifty thousand just not just i'm suppose +we have these ten different examples + +154 +00:10:12,590 --> 00:10:15,920 +here is our test images along the first +call in here what we'll do is we'll look + +155 +00:10:15,919 --> 00:10:19,909 +up the nearest neighbors in the training +set of things that are most similar to + +156 +00:10:19,909 --> 00:10:24,139 +every one of those in just independently +so there you see a ranked list of images + +157 +00:10:24,139 --> 00:10:30,220 +that are most similar to into training +data to any one of those 10 to every one + +158 +00:10:30,220 --> 00:10:32,700 +of those test images over there so in +the first row will see that there's a + +159 +00:10:32,700 --> 00:10:36,230 +truck i think is a test image and +there's quite a few images that look + +160 +00:10:36,230 --> 00:10:40,490 +similar to it will see how exactly where +to find similar teeny bit but you can + +161 +00:10:40,490 --> 00:10:44,269 +see that the first retreat result is in +fact a horse not a truck and that's + +162 +00:10:44,269 --> 00:10:48,289 +because of just the arrangement of the +blue sky that was thrown off so you can + +163 +00:10:48,289 --> 00:10:52,480 +see that this will not probably work +very well how do we define this is + +164 +00:10:52,480 --> 00:10:55,470 +measured how do we actually do the +comparison there are several ways one of + +165 +00:10:55,470 --> 00:10:59,940 +the simplest ways might be a Manhattan +distance so and understands or Manhattan + +166 +00:10:59,940 --> 00:11:01,180 +distance of the Institute + +167 +00:11:01,179 --> 00:11:04,429 +terms interchangeably simply what it +does is you have a test image you're + +168 +00:11:04,429 --> 00:11:07,639 +interested in classifying and considered +one single training image that we want + +169 +00:11:07,639 --> 00:11:11,919 +to compare this image to see what we'll +do is we'll element price compare all + +170 +00:11:11,919 --> 00:11:15,959 +the pics lollies so will form the +absolute value differences and then we + +171 +00:11:15,960 --> 00:11:20,040 +just add all that up so we're just look +at every single position or subtracting + +172 +00:11:20,039 --> 00:11:24,139 +it off and see what the differences are +increasingly special position adding it + +173 +00:11:24,139 --> 00:11:30,169 +all up and that's our similarity so +these two images are for 56 different so + +174 +00:11:30,169 --> 00:11:33,809 +we get a zero if we have identical +images here just to show your code + +175 +00:11:33,809 --> 00:11:36,959 +specifically the way this would look +like this is a full implementation of a + +176 +00:11:36,960 --> 00:11:42,930 +nearest-neighbor classifier and where I +filled in the actual body of the two men + +177 +00:11:42,929 --> 00:11:46,799 +talked about and what we do here at +training time as we're giving this + +178 +00:11:46,799 --> 00:11:52,709 +dataset X and Y which usually the note +the labels so forgiving and labels all + +179 +00:11:52,710 --> 00:11:56,530 +we do is just assigned to the class +instance methods so just remembered the + +180 +00:11:56,529 --> 00:12:01,439 +data nothing is being done I predict +time though what we're doing here is + +181 +00:12:01,440 --> 00:12:06,080 +we're getting newt test set of images X +and I'm not going to go through a full + +182 +00:12:06,080 --> 00:12:09,320 +details but you can see there's a for +loop over every single test image + +183 +00:12:09,320 --> 00:12:13,020 +independently we're getting the +distances to every single training image + +184 +00:12:13,019 --> 00:12:18,360 +and notice that that's only a single +line of vector I used Python code so in + +185 +00:12:18,360 --> 00:12:21,750 +a single line of code were comparing +that test image to every single training + +186 +00:12:21,750 --> 00:12:26,370 +image in the database computing this +distance in a previous slide and I think + +187 +00:12:26,370 --> 00:12:30,720 +alike so that's a crisis code we didn't +have to expend all those four loops that + +188 +00:12:30,720 --> 00:12:35,860 +are involved in processing systems and +then we compute the instance that is + +189 +00:12:35,860 --> 00:12:40,659 +closest so we're getting them in index +the index of the training that is has + +190 +00:12:40,659 --> 00:12:45,719 +the lowest distance and then we'll just +predicting for this image the label of + +191 +00:12:45,720 --> 00:12:51,210 +whatever so here's a question for you in +terms of the nearest neighbor classifier + +192 +00:12:51,210 --> 00:12:56,639 +how does its speed depend on the +training data size what happens is a + +193 +00:12:56,639 --> 00:13:02,779 +scale up the training gear slower + +194 +00:13:02,779 --> 00:13:07,789 +yes it's actually it's actually really +slow right because if I have I just have + +195 +00:13:07,789 --> 00:13:12,129 +to compare every single training sample +independently so it's a little slow down + +196 +00:13:12,129 --> 00:13:16,370 +and actually go as we go through the +classes that this is actually backwards + +197 +00:13:16,370 --> 00:13:19,590 +because what we really care about the +most practical applications as we care + +198 +00:13:19,590 --> 00:13:23,330 +about the test time performance of these +classifiers that means that we want this + +199 +00:13:23,330 --> 00:13:27,240 +class to be very efficient at this time +and so there's a tradeoff between really + +200 +00:13:27,240 --> 00:13:30,419 +how much computer we put on the train +method and how much do we put in a good + +201 +00:13:30,419 --> 00:13:35,240 +nearest neighbor is instant a train but +then it's expensive a test and as we'll + +202 +00:13:35,240 --> 00:13:38,570 +see soon come that's actually flip this +completely the other way around + +203 +00:13:38,570 --> 00:13:41,510 +will see that we do a huge amount of +compute a train time will be training + +204 +00:13:41,509 --> 00:13:45,409 +commercial network system performance +will be super efficient in fact it will + +205 +00:13:45,409 --> 00:13:49,589 +be constant amount of compute for every +single test image with the constant + +206 +00:13:49,590 --> 00:13:53,149 +amount of computation no matter if you +have a million billions or trillions + +207 +00:13:53,149 --> 00:13:57,669 +training I'm just I'd like to have a +trillion trillion trillion trillion just + +208 +00:13:57,669 --> 00:14:01,579 +no matter how large or trade deficit +will do a complete custom computer to + +209 +00:14:01,580 --> 00:14:05,250 +classify any single testing sample so +that's very nice practically speaking + +210 +00:14:05,250 --> 00:14:10,370 +now I'll just like to point out that +there are ways of speeding up here saber + +211 +00:14:10,370 --> 00:14:13,669 +classifiers there's these approximate +nearest neighbor methods plan as an + +212 +00:14:13,669 --> 00:14:16,879 +example library that people use up to +practice that allows you to speed up + +213 +00:14:16,879 --> 00:14:22,909 +this process of nearest-neighbor +matching but that's just a side note ok + +214 +00:14:22,909 --> 00:14:27,490 +so let's go back to the design of the +classifier we saw that we've defined + +215 +00:14:27,490 --> 00:14:32,200 +this distance and I arbitrarily chosen +to show you the Manhattan distance which + +216 +00:14:32,200 --> 00:14:35,720 +compares the difference of the absolute +value there is in fact many ways you can + +217 +00:14:35,720 --> 00:14:38,879 +formulate a distance metric and so +there's many different choices of + +218 +00:14:38,879 --> 00:14:42,700 +exactly how we do this comparison +another sim another choice to people + +219 +00:14:42,700 --> 00:14:46,000 +like to use in practice is what we call +the Euclidean are ultra distance which + +220 +00:14:46,000 --> 00:14:49,850 +instead sums up the differences in the +sums of squares of these differences + +221 +00:14:49,850 --> 00:14:55,690 +between images and so this choice + +222 +00:14:55,690 --> 00:15:02,730 +that someone over there and back + +223 +00:15:02,730 --> 00:15:07,850 +ok so this choice of what how exactly +computer distance it's a discrete choice + +224 +00:15:07,850 --> 00:15:11,769 +that we have control over that something +we called hyper primary it's not really + +225 +00:15:11,769 --> 00:15:14,990 +obvious how you set it it's a hyper +parameters we have to decide later on + +226 +00:15:14,990 --> 00:15:19,120 +exactly how to set this somehow another +sort of hybrid primary they'll talk + +227 +00:15:19,120 --> 00:15:22,828 +about in context of a classifier is when +we generalize nearest neighbor to have + +228 +00:15:22,828 --> 00:15:26,159 +what we call a k nearest neighbor +classifier so in a que horas neighbor + +229 +00:15:26,159 --> 00:15:29,328 +classifiers that are retrieving for +every test match the single nearest + +230 +00:15:29,328 --> 00:15:33,958 +train example will in fact retreat +several examples and will have the new + +231 +00:15:33,958 --> 00:15:37,069 +majority vote over the closest to +actually classified every test instance + +232 +00:15:37,070 --> 00:15:41,829 +so say a neighbor we would be retrieving +the five most similar images in the + +233 +00:15:41,828 --> 00:15:45,528 +training data and doing a majority vote +of the labels here's a simple + +234 +00:15:45,528 --> 00:15:48,970 +two-dimensional data set to illustrate +the point so here we have a three-class + +235 +00:15:48,970 --> 00:15:53,430 +dataset and 2d and Here I am drawing +what we call decision regions this + +236 +00:15:53,429 --> 00:15:57,429 +nearest neighbor classifier here with +this refers to is were truly trained + +237 +00:15:57,429 --> 00:16:02,838 +over there and we're coloring the entire +to deplane by what class this nearest + +238 +00:16:02,839 --> 00:16:05,430 +neighbor classifier with a sign that +every single point suppose you don't + +239 +00:16:05,429 --> 00:16:08,698 +suppose you had a test example some more +here than just saying that this would + +240 +00:16:08,698 --> 00:16:12,549 +have been classified as blue class based +on the nearest neighbor you get personal + +241 +00:16:12,549 --> 00:16:16,708 +note that here is a point that is a +green point inside the blue cluster and + +242 +00:16:16,708 --> 00:16:19,708 +it has its own little region of class +where it would have classified a lot of + +243 +00:16:19,708 --> 00:16:23,750 +tests place around it as green because +if anything to tell their than that + +244 +00:16:23,750 --> 00:16:27,879 +green point of the nearest neighbor now +when you move to higher numbers for ke + +245 +00:16:27,879 --> 00:16:30,809 +such as five years neighbor classifier +what you find is that the boundaries + +246 +00:16:30,809 --> 00:16:36,619 +start to smooth out it's kind of nice +effect where even that there's just one + +247 +00:16:36,620 --> 00:16:37,339 +point + +248 +00:16:37,339 --> 00:16:41,550 +kind of randomly as noise and outliers +in the blue cluster it's actually not + +249 +00:16:41,549 --> 00:16:44,539 +employing the predictions too much +because we always are treating five + +250 +00:16:44,539 --> 00:16:49,679 +nearest neighbors and so they get to +overwhelm the Greenpoint so in practice + +251 +00:16:49,679 --> 00:16:53,088 +you'll find that usually can your summer +classifiers offer better better + +252 +00:16:53,089 --> 00:16:58,180 +performance at US time now but again the +choice of k is again a hyper perimeter + +253 +00:16:58,179 --> 00:17:03,088 +right so I'll come back to this in a bit +just to show you an example of this look + +254 +00:17:03,089 --> 00:17:06,169 +like here I'm returning ten most similar +examples they're ranked by their + +255 +00:17:06,169 --> 00:17:08,939 +distance and I would actually do +majority vote over these training + +256 +00:17:08,939 --> 00:17:13,089 +examples here to classify every test +example here + +257 +00:17:13,088 --> 00:17:20,649 +ok so let's do a bit of questions here +just consider what is the accuracy of + +258 +00:17:20,650 --> 00:17:24,259 +the north of a classifier on the +training data when we're using Euclidean + +259 +00:17:24,259 --> 00:17:29,700 +distance so I suppose our test set is +exactly the training data and we're + +260 +00:17:29,700 --> 00:17:32,580 +trying to find the accuracy in other +words how many how often would we get + +261 +00:17:32,579 --> 00:17:34,750 +the correct answer + +262 +00:17:34,750 --> 00:17:44,808 +hundred-percent good ok among numerous +yeah that's correct so we're always find + +263 +00:17:44,808 --> 00:17:48,450 +a train example exactly on top of that +test which has their own this does and + +264 +00:17:48,450 --> 00:17:52,870 +then it's like will be transferred over +what if we're using the Manhattan + +265 +00:17:52,869 --> 00:18:00,949 +distance that + +266 +00:18:00,950 --> 00:18:04,680 +Manhattan distance doesn't need sum of +squares are you some absolute values + +267 +00:18:04,680 --> 00:18:12,110 +from differences it would it's just a +question would be something like a good + +268 +00:18:12,109 --> 00:18:14,169 +summer or keeping + +269 +00:18:14,170 --> 00:18:18,820 +attention ok what is the accuracy of the +King your neighbor classifier trained it + +270 +00:18:18,819 --> 00:18:25,339 +is a cable spot if is it a hundred +percent not necessarily get because + +271 +00:18:25,339 --> 00:18:29,230 +basically the points around you could +overwhelm you even have your best + +272 +00:18:29,230 --> 00:18:35,269 +example is actually off the glass ok so +we've discussed two choices of different + +273 +00:18:35,269 --> 00:18:39,740 +premise we have just met Rick its high +pressure in this case we're not sure how + +274 +00:18:39,740 --> 00:18:45,160 +to set it should be 1 23 10 and so on so +we're not exactly sure how to set these + +275 +00:18:45,160 --> 00:18:48,750 +in fact their problem dependent you'll +find that you can't find consistently + +276 +00:18:48,750 --> 00:18:52,250 +best choice for these high premise in +some applications some case might look + +277 +00:18:52,250 --> 00:18:56,930 +better than other applications so we're +not really sure how to set this so + +278 +00:18:56,930 --> 00:19:00,799 +here's an idea we have to basically try +out to lots of different primers so I'm + +279 +00:19:00,799 --> 00:19:05,649 +gonna do as I'm going to take my train +data and then I'm going to try out lots + +280 +00:19:05,650 --> 00:19:11,550 +of different parameters so I might just +die and I try out cables 123456 2800 I + +281 +00:19:11,549 --> 00:19:14,529 +tried all the defendants metrics and +whatever works best that's what I'll + +282 +00:19:14,529 --> 00:19:26,670 +take so that will work very well right +lies in its not a good idea because ok + +283 +00:19:26,670 --> 00:19:36,170 +so basically so basically yes so test +data is your proxy for your + +284 +00:19:36,170 --> 00:19:40,039 +generalization of your order them you +should not trust should the test data in + +285 +00:19:40,039 --> 00:19:43,509 +fact you should forget that you ever +have to Stata so it went 1 for giving + +286 +00:19:43,509 --> 00:19:46,079 +your dataset always set aside the +testator pretend you don't have it + +287 +00:19:46,079 --> 00:19:50,129 +that's telling you how will your organs +generalizing to unseen data points and + +288 +00:19:50,130 --> 00:19:52,730 +is important because you're trying to +develop your algorithm and then you're + +289 +00:19:52,730 --> 00:19:56,120 +hoping to eventually the planet and some +setting and you liked understanding of + +290 +00:19:56,119 --> 00:20:01,159 +exactly how will do I expect this to +work in practice right and so you'll see + +291 +00:20:01,160 --> 00:20:03,830 +that for example sometimes you can +perform very very well-intended about + +292 +00:20:03,829 --> 00:20:05,579 +not generalize very well to test it on + +293 +00:20:05,579 --> 00:20:08,659 +you're overthinking someone a lot of +this by 28 to 29 the requirement for + +294 +00:20:08,660 --> 00:20:11,750 +this class so you should be quite +familiar with this disease to most + +295 +00:20:11,750 --> 00:20:16,519 +extent this is kind of more and more +overview for you but basically this test + +296 +00:20:16,519 --> 00:20:20,940 +data is used very sparingly forget that +you have it instead what we do is we + +297 +00:20:20,940 --> 00:20:25,930 +separate our training data into what we +call folds so we separate safely use a + +298 +00:20:25,930 --> 00:20:29,900 +five-fold validation so we use twenty +percent of the training data as a + +299 +00:20:29,900 --> 00:20:35,120 +imagine such data and then we only +training part of it and we test on we + +300 +00:20:35,119 --> 00:20:39,279 +just have two choices applied primarily +on this validation set so I'm going to + +301 +00:20:39,279 --> 00:20:42,569 +train on my phone calls and try out +different case and all the front of some + +302 +00:20:42,569 --> 00:20:45,329 +clerics and whatever else if you're +using approximate nearest neighbor yet + +303 +00:20:45,329 --> 00:20:48,750 +many other choices you try it out see +what works best on that validation data + +304 +00:20:48,750 --> 00:20:51,859 +if you're feeling uncomfortable because +you have very few training data points + +305 +00:20:51,859 --> 00:20:54,939 +people also sometimes used +cross-validation where you actually get + +306 +00:20:54,940 --> 00:20:58,640 +to rate the choice of your test +validation pulled across these choices + +307 +00:20:58,640 --> 00:21:03,840 +so I'll first use for 124 for my +training and try out on five and then I + +308 +00:21:03,839 --> 00:21:07,519 +cycled the choice of the validation +pulled across all the five choices and I + +309 +00:21:07,519 --> 00:21:11,789 +look at what works best across all the +possible choices of my test fold and + +310 +00:21:11,789 --> 00:21:14,839 +then I just take whatever works best +across all the possible scenarios + +311 +00:21:14,839 --> 00:21:19,039 +that's a front-runner cross validation +set screw validation some practice the + +312 +00:21:19,039 --> 00:21:21,769 +way this would look like they were Cross +building for K for nearest neighbor + +313 +00:21:21,769 --> 00:21:26,049 +classifier is we are trying out +different values of K and this is our + +314 +00:21:26,049 --> 00:21:31,690 +performance across five choices of the +fold so you can see that for every + +315 +00:21:31,690 --> 00:21:35,759 +single case we have five data points +there and then this is the accuracy so + +316 +00:21:35,759 --> 00:21:40,240 +high is good and I'm plotting a line +through the mean analyst Sean Arce for + +317 +00:21:40,240 --> 00:21:44,190 +the standard deviations so we see here +is that the performance goes up on the + +318 +00:21:44,190 --> 00:21:49,240 +across these polls as you go up but at +some point starr said idk so for this + +319 +00:21:49,240 --> 00:21:53,460 +particular dataset it seems that K equal +to 7 is the best choice so that's what + +320 +00:21:53,460 --> 00:21:58,440 +I'll do this for all my hyperemesis also +for the symmetric and so on I do my + +321 +00:21:58,440 --> 00:22:03,650 +cross validation i promise i said i fix +them evaluate a single time on the test + +322 +00:22:03,650 --> 00:22:07,800 +site and whatever number I get to that's +what I report as eight accuracy of a + +323 +00:22:07,799 --> 00:22:11,490 +king or some classifier on this dataset +that's what goes into a paper that's + +324 +00:22:11,490 --> 00:22:15,539 +what goes into our final report as long +as the final generalization result of + +325 +00:22:15,539 --> 00:22:16,519 +what you've done + +326 +00:22:16,519 --> 00:22:36,048 +any questions about this basically it's +about the statistics of the distribution + +327 +00:22:36,048 --> 00:22:42,378 +of these data points in your label in +your face and so sometimes it's hard to + +328 +00:22:42,378 --> 00:22:47,769 +say like you get whereas this picture +you see roughly what happening as you + +329 +00:22:47,769 --> 00:22:52,209 +get more cleanliness and more case and +it just depends on how clunkier data + +330 +00:22:52,209 --> 00:22:55,129 +service that's really what it comes down +to is how + +331 +00:22:55,128 --> 00:23:01,569 +lobby is it or how specific is it I know +that's very handy answer but that's + +332 +00:23:01,569 --> 00:23:04,769 +roughly what what that comes into so +different datasets will have different + +333 +00:23:04,769 --> 00:23:27,230 +clicking us right now + +334 +00:23:27,230 --> 00:23:31,769 +because + +335 +00:23:31,769 --> 00:23:37,308 +and different different datasets will +require different choices and need to + +336 +00:23:37,308 --> 00:23:40,629 +see what works best if actually try out +different algorithms + +337 +00:23:40,630 --> 00:23:43,580 +you're not sure what's going to work +best in your data the choice of your + +338 +00:23:43,579 --> 00:23:47,699 +order is also kind of like hyper hammer +so you're just not sure what works + +339 +00:23:47,700 --> 00:23:52,019 +different approaches will be different + +340 +00:23:52,019 --> 00:23:55,190 +generalization boundaries they look +different and some data sets up the + +341 +00:23:55,190 --> 00:23:58,330 +front structure than other some things +work better than others + +342 +00:23:58,329 --> 00:24:05,298 +just ran it tried out ok I just like to +point out that king or something worse + +343 +00:24:05,298 --> 00:24:09,389 +is no one basically uses this sunday +going through this just doesn't get used + +344 +00:24:09,390 --> 00:24:12,480 +to this approach of really how this +works with training just splits and so + +345 +00:24:12,480 --> 00:24:13,450 +on + +346 +00:24:13,450 --> 00:24:17,610 +the reason this is never used as because +first of all it's very inefficient but + +347 +00:24:17,609 --> 00:24:21,139 +second of all this is my tracks all +images which are very high dimensional + +348 +00:24:21,140 --> 00:24:28,179 +objects they acting very unnatural and +intuitive ways I've done is taken in + +349 +00:24:28,179 --> 00:24:32,370 +order to limit and I change it in three +different ways but all these three + +350 +00:24:32,369 --> 00:24:37,168 +different images here have actually the +exact same distance to this one in an L + +351 +00:24:37,169 --> 00:24:42,100 +to Euclidean sense as I just think about +this one here is slightly shifted to the + +352 +00:24:42,099 --> 00:24:46,359 +left it's dropped slightly and it's this +is here are completely different because + +353 +00:24:46,359 --> 00:24:49,329 +these pixels are not matching up exactly +and it's all introducing all these + +354 +00:24:49,329 --> 00:24:53,109 +errors in your getting distance this one +is slightly darkened so you get a small + +355 +00:24:53,109 --> 00:24:57,629 +Delta across all special occasions and +this one is untouched 60 distance eres + +356 +00:24:57,630 --> 00:25:01,650 +across everywhere except in those +positions over there and that is taken + +357 +00:25:01,650 --> 00:25:05,900 +out critical pieces of the image and it +doesn't the nearest neighbor classifier + +358 +00:25:05,900 --> 00:25:08,030 +will not be able to really tell the +difference between these settings + +359 +00:25:08,029 --> 00:25:11,230 +because it's based on these distances +that don't really work very well in this + +360 +00:25:11,230 --> 00:25:16,009 +case so very unintuitive things happen +when you try to throw distances on very + +361 +00:25:16,009 --> 00:25:21,349 +high dimensional objects that's partly +why we don't exist so in summary so far + +362 +00:25:21,349 --> 00:25:26,230 +we're looking at these classifications a +specific case involving two different + +363 +00:25:26,230 --> 00:25:29,679 +settings later in the class of Engineers +neighbor classifier and the idea of + +364 +00:25:29,679 --> 00:25:33,110 +having different splits up your data and +we have these high pressure hose that + +365 +00:25:33,109 --> 00:25:37,240 +will need to pick and we use Cross +foundation for this usually most of the + +366 +00:25:37,240 --> 00:25:39,909 +time people don't actually entire +cross-validation they just have a single + +367 +00:25:39,909 --> 00:25:40,519 +validation + +368 +00:25:40,519 --> 00:25:43,778 +and they try out on the validation set +whatever works best in terms of the high + +369 +00:25:43,778 --> 00:25:47,999 +premise and once you get the best have +primaries you have a lead to single + +370 +00:25:47,999 --> 00:25:54,569 +tenant just said so I'm going to go into +the classification but any questions at + +371 +00:25:54,569 --> 00:26:04,229 +this point I see great we're going to +look at Telenor classification this is a + +372 +00:26:04,229 --> 00:26:07,649 +point where we are starting to work +towards commercial networks it'll be a + +373 +00:26:07,648 --> 00:26:11,148 +series of lectures will snarl +classification that will build up to an + +374 +00:26:11,148 --> 00:26:15,888 +entire commercial network analyzing +image I just like to say that motivated + +375 +00:26:15,888 --> 00:26:20,178 +the class yesterday from a task-specific +view this class is computer vision class + +376 +00:26:20,179 --> 00:26:25,489 +interested in giving machines site and +other way to motivate this class will be + +377 +00:26:25,489 --> 00:26:29,409 +from a model-based point of view in a +sense that we're giving you guys + +378 +00:26:29,409 --> 00:26:34,339 +watching guys about the plumbing and +electrics these are wonderful algorithms + +379 +00:26:34,338 --> 00:26:38,178 +that you can apply to many different +demands not just some particularly over + +380 +00:26:38,179 --> 00:26:42,469 +the last few years we saw that neural +networks can not only see that's what + +381 +00:26:42,469 --> 00:26:46,479 +you'll learn a lot about this class but +he also here there is quite a bit in a + +382 +00:26:46,479 --> 00:26:50,828 +speech recognition now so when you talk +to your phone does not work they can + +383 +00:26:50,828 --> 00:26:56,678 +also do machine translation so here you +are feeding neural network a set of + +384 +00:26:56,679 --> 00:27:00,700 +words one by one in English and the +neural network produces the translation + +385 +00:27:00,700 --> 00:27:05,328 +in print or whatever else target +language you have to perform control so + +386 +00:27:05,328 --> 00:27:09,308 +we've seen your network applications and +manipulate in the robots manipulation + +387 +00:27:09,308 --> 00:27:14,209 +and playing at a party gains work learn +how to play three games just by seeing + +388 +00:27:14,209 --> 00:27:18,089 +the rockets will set the screen and we +seem to be very successful in the + +389 +00:27:18,088 --> 00:27:23,878 +variety of domains and even more than a +bit here and we're uncertain exactly + +390 +00:27:23,878 --> 00:27:27,988 +where this will take us and then I'd +like to also say that we're exploring + +391 +00:27:27,989 --> 00:27:31,749 +ways for lyrics do think that this is +very henry VIII is just wishful thinking + +392 +00:27:31,749 --> 00:27:35,700 +but there are some hints that maybe they +can do that as well + +393 +00:27:35,700 --> 00:27:39,479 +neural networks are very nice because +they're just fun modular things to play + +394 +00:27:39,479 --> 00:27:42,450 +with when I think about working with +their networks I kind of this picture + +395 +00:27:42,450 --> 00:27:46,548 +comes to mind for me here we have a +neural networks practitioner and she's + +396 +00:27:46,548 --> 00:27:51,519 +building what looks to be a roughly 10 +layer at this point + +397 +00:27:51,519 --> 00:27:55,269 +it's very fun really the best way to +think about playing with their looks + +398 +00:27:55,269 --> 00:27:58,619 +like Lego blocks you'll see that we're +building these little function pieces + +399 +00:27:58,619 --> 00:28:02,579 +you look a lot so we can stuck together +to create entire architectures and then + +400 +00:28:02,579 --> 00:28:06,309 +very easily talk to each other and so we +can just create these modules in + +401 +00:28:06,309 --> 00:28:11,519 +Stockton together and play with this +very easily won work that I think + +402 +00:28:11,519 --> 00:28:16,039 +exemplifies this is my homework on which +captioning from roughly a year ago so + +403 +00:28:16,039 --> 00:28:20,289 +here in the task was to take an image +and you're trying to get to work to + +404 +00:28:20,289 --> 00:28:23,639 +produce a sentence description of the +image so for example the top left these + +405 +00:28:23,640 --> 00:28:27,810 +artists set results would say that this +is many black shirt was playing guitar + +406 +00:28:27,809 --> 00:28:32,480 +or a construction worker in Orange City +West is working on the road and so on so + +407 +00:28:32,480 --> 00:28:36,670 +they can look at the image and create +this description of every single image + +408 +00:28:36,670 --> 00:28:41,100 +and when you go to the details of this +model the way this works is we're taking + +409 +00:28:41,099 --> 00:28:45,079 +the convolutional neural network which +we know so there's two modules here in + +410 +00:28:45,079 --> 00:28:49,480 +this system diagram for image capturing +model which we can accomplish on your + +411 +00:28:49,480 --> 00:28:52,880 +network which we know can see we're +taking a recurrent neural network which + +412 +00:28:52,880 --> 00:28:56,150 +we know is very good and modeling +sequences in this case sequences of + +413 +00:28:56,150 --> 00:28:59,720 +words that will be describing the image +and then just as if we were playing with + +414 +00:28:59,720 --> 00:29:02,930 +LEGOs we take those two pieces and we +stick them together its corresponding to + +415 +00:29:02,930 --> 00:29:06,560 +this arrow here in between the two +modules in these networks learned to + +416 +00:29:06,559 --> 00:29:10,639 +talk to each other and in the process of +trying to describe the images these + +417 +00:29:10,640 --> 00:29:13,110 +gradients will be flown through the +comedy show that work on the phone + +418 +00:29:13,109 --> 00:29:16,689 +system will be adjusting itself to +better see the images in order to + +419 +00:29:16,690 --> 00:29:20,200 +describe them at the end and so this +whole system will work together as one + +420 +00:29:20,200 --> 00:29:24,920 +so we'll be working towards this model +will actually come to this class will + +421 +00:29:24,920 --> 00:29:28,279 +have a full understanding exactly off +both this part and this part about + +422 +00:29:28,279 --> 00:29:31,849 +halfway through the course roughly +you'll see how that instructional model + +423 +00:29:31,849 --> 00:29:34,909 +works but that's just a motivation for +really what we're building up to and + +424 +00:29:34,910 --> 00:29:40,290 +you're like really nice models to work +with ok but for now back to see 410 and + +425 +00:29:40,289 --> 00:29:43,159 +all your classification + +426 +00:29:43,160 --> 00:29:47,930 +just remind you are working with this +dataset 2000 and Justin labels and we're + +427 +00:29:47,930 --> 00:29:50,960 +going to approach your classification is +from what we call a parametric approach + +428 +00:29:50,960 --> 00:29:55,079 +can remember that we just discussed now +is something an instance of what we call + +429 +00:29:55,079 --> 00:29:57,439 +nonparametric approach there's no +parameters that we're going to be + +430 +00:29:57,440 --> 00:30:02,430 +optimizing over this distinction will +become clearer and human it's also + +431 +00:30:02,430 --> 00:30:04,240 +apparent to the project we're doing is +worth + +432 +00:30:04,240 --> 00:30:09,089 +thinking about constructing a function +that takes an image and produces the + +433 +00:30:09,089 --> 00:30:12,769 +scores for classes right this is what we +want to do you want to take any image + +434 +00:30:12,769 --> 00:30:17,109 +and we'd like to figure out which one of +the ten plus it is so we'd like to write + +435 +00:30:17,109 --> 00:30:21,169 +down the function and expression that +takes an image and gives you those two + +436 +00:30:21,170 --> 00:30:24,529 +numbers but the expression is not only +function of that image but critically + +437 +00:30:24,529 --> 00:30:28,339 +ill be also a function of these +parameters that are called W sometimes + +438 +00:30:28,339 --> 00:30:33,189 +also called the weights so really it's a +function that goes from 3072 numbers + +439 +00:30:33,190 --> 00:30:37,308 +which make up this image to 10 numbers +that's what we're doing we're defining a + +440 +00:30:37,308 --> 00:30:42,049 +function and we'll go through several +choices of this function in this in the + +441 +00:30:42,049 --> 00:30:45,589 +first case will look at later functions +and then extended to control it works + +442 +00:30:45,589 --> 00:30:49,579 +and then we'll extend that to get +commercial networks but intuitively what + +443 +00:30:49,579 --> 00:30:53,379 +we're building up to is that what we'd +like is when we put this image through + +444 +00:30:53,380 --> 00:30:57,690 +our function we'd like the 10 numbers +that correspond to the scores of the 10 + +445 +00:30:57,690 --> 00:31:01,150 +closest would like the number that +corresponds to the cat class to be high + +446 +00:31:01,150 --> 00:31:06,330 +and all the other numbers to be low and +will have we don't have a choice over X + +447 +00:31:06,329 --> 00:31:11,428 +that acts as our image that's given a +choice over W you will be free to set + +448 +00:31:11,429 --> 00:31:15,179 +aside whatever we want and we want will +want to set it to let this function + +449 +00:31:15,179 --> 00:31:19,050 +gives us the correct answers for every +single image in our training data that's + +450 +00:31:19,049 --> 00:31:23,230 +roughly the approach we're building +towards suppose that we use the simplest + +451 +00:31:23,230 --> 00:31:29,789 +the simplest just a linear +classification here so X is our image in + +452 +00:31:29,789 --> 00:31:34,200 +this case wrongdoing as I'm taking this +array this image that makes up the cat + +453 +00:31:34,200 --> 00:31:38,750 +and I'm stretching out with all the +pixels in that image into a giant column + +454 +00:31:38,750 --> 00:31:46,920 +vector so that there is a column vector +of 3072 numbers and so if you know your + +455 +00:31:46,920 --> 00:31:52,100 +matrix vector operations which you +should that's a prerequisite for this + +456 +00:31:52,099 --> 00:31:55,149 +class that there is just a matrix +multiplication which should be familiar + +457 +00:31:55,150 --> 00:32:00,100 +with and basically we're taking X which +is a 3072 muscle column vector we're + +458 +00:32:00,099 --> 00:32:03,569 +trying to get 10 numbers and it no +longer function so you can go backwards + +459 +00:32:03,569 --> 00:32:08,399 +and figure out the dimensions of this w +are basically 10 by 3072 so there are + +460 +00:32:08,400 --> 00:32:14,370 +30,000 772 202 numbers that goes into W +and that's what we have control over + +461 +00:32:14,369 --> 00:32:16,658 +that's what we have to tweak and find +what works + +462 +00:32:16,659 --> 00:32:21,710 +so those are the parameters in this +particular case when I'm leaving out is + +463 +00:32:21,710 --> 00:32:26,919 +there's also an appended + be sometimes +so you have a bias these biases are + +464 +00:32:26,919 --> 00:32:31,999 +against 10 more parameters and we have +to also find those so usually in a + +465 +00:32:31,999 --> 00:32:36,098 +linear classifier have a WNB we have to +find exactly what works best and this + +466 +00:32:36,098 --> 00:32:39,950 +baby is not a function of the image +that's just independent waits on the on + +467 +00:32:39,950 --> 00:32:44,989 +how likely any one of those just might +be to go back to your question if you + +468 +00:32:44,989 --> 00:32:50,239 +have a very unbalanced datasets for so +maybe you have mostly cats but some dogs + +469 +00:32:50,239 --> 00:32:54,710 +or something like that then you might +expect that the cat the bias for the + +470 +00:32:54,710 --> 00:32:58,200 +catalyst might be slightly higher +because by default the classifier once + +471 +00:32:58,200 --> 00:33:04,009 +to predict the catalyst unless something +comes to the otherwise something in the + +472 +00:33:04,009 --> 00:33:08,069 +image of God this otherwise I think +that's more concrete I just like to + +473 +00:33:08,069 --> 00:33:11,398 +break it down but of course I can't +visualize it very explicitly width 3072 + +474 +00:33:11,398 --> 00:33:17,459 +numbers so imagine that our input image +1024 pixels and imagine so more pics + +475 +00:33:17,460 --> 00:33:21,419 +also stressed out in the column X and +imagine that we have three classes so + +476 +00:33:21,419 --> 00:33:27,109 +red green and blue costs or a cat +adoption process so in this case W will + +477 +00:33:27,108 --> 00:33:30,868 +be only a three by for matrix and what +we're doing here is we're trying to + +478 +00:33:30,868 --> 00:33:36,398 +compute the score of this major acts so +this is major application going on here + +479 +00:33:36,398 --> 00:33:40,608 +to give us the output of path which is +this course we got the three scores for + +480 +00:33:40,608 --> 00:33:45,348 +three different classes so this is +random setting up w just running mates + +481 +00:33:45,348 --> 00:33:50,739 +here and we'll get some scores some +particularly can see that with this this + +482 +00:33:50,739 --> 00:33:55,639 +setting up w is not very good right +because with this setting up w Marquette + +483 +00:33:55,638 --> 00:34:00,449 +score of 96 is much less than any of the +other classes right so this was not + +484 +00:34:00,450 --> 00:34:04,720 +correctly classified for this training +image so that's not a very good + +485 +00:34:04,720 --> 00:34:07,220 +classifier so we want to change a +different double + +486 +00:34:07,220 --> 00:34:10,250 +want to use a different W so that that +score comes up higher than the other + +487 +00:34:10,250 --> 00:34:14,409 +ones but we have to do that consistently +across the entire training such examples + +488 +00:34:14,409 --> 00:34:20,389 +but one thing to notice here as well as +the basically W + +489 +00:34:20,389 --> 00:34:25,700 +it's this function is in parallel +evaluating all the tenant classifiers + +490 +00:34:25,699 --> 00:34:28,230 +but really there are ten independent +classifiers + +491 +00:34:28,230 --> 00:34:32,210 +to some extent here and every one of +these classifiers like say the cats + +492 +00:34:32,210 --> 00:34:36,918 +classifier is just a first row of W here +right in the first row and the first + +493 +00:34:36,918 --> 00:34:41,789 +bias gives you can't score and the dog +classifier is the second row W and the + +494 +00:34:41,789 --> 00:34:46,840 +ship's quarter the ship + 500 W W matrix +has all these different classifier + +495 +00:34:46,840 --> 00:34:50,889 +stacked and rose and they're all being +docked product and with the image to + +496 +00:34:50,889 --> 00:34:56,269 +give you this course so here's a +question for you what does a linear + +497 +00:34:56,269 --> 00:35:02,599 +classifier do in English we saw the +functional form sticking is doing this + +498 +00:35:02,599 --> 00:35:07,589 +funny operation there what was really +interpret and English somehow what this + +499 +00:35:07,590 --> 00:35:28,640 +is doing + +500 +00:35:28,639 --> 00:35:39,048 +X being a high-dimensional data point +and W is really putting plains through + +501 +00:35:39,048 --> 00:35:43,038 +the site and come back to that +interpretation of it but either way can + +502 +00:35:43,039 --> 00:35:59,420 +we think about this team way where every +single one of these rows of W + +503 +00:35:59,420 --> 00:36:03,630 +effectively is like this template that +we're not talking with the image and I + +504 +00:36:03,630 --> 00:36:08,608 +dot product is really a way of like +natural up seeing what what Alliance get + +505 +00:36:08,608 --> 00:36:17,960 +to what other ways + +506 +00:36:17,960 --> 00:36:42,088 +two positions because what we can do is +some of the spatial positions index if + +507 +00:36:42,088 --> 00:36:44,838 +we have zero weights then the classifier +would be + +508 +00:36:44,838 --> 00:36:50,329 +doesn't care what's in part of image so +50 waits for this part here then nothing + +509 +00:36:50,329 --> 00:36:53,389 +affected but for some other parts of the +image of you have positive or negative + +510 +00:36:53,389 --> 00:36:58,118 +weights something's gonna happen there +and contribute to the score in other + +511 +00:36:58,119 --> 00:37:23,200 +ways of describing a space to a label +space + +512 +00:37:23,199 --> 00:37:33,009 +so the question so this image as a +three-dimensional terrain where we have + +513 +00:37:33,010 --> 00:37:37,369 +all these channels you just a stretcher +doubts all the you stretch it out in + +514 +00:37:37,369 --> 00:37:41,849 +whatever way you like say you start the +red green and blue portions side-by-side + +515 +00:37:41,849 --> 00:37:46,030 +only you stretch it out in whatever way +you like but in a consistent way across + +516 +00:37:46,030 --> 00:37:49,930 +all the images you figure out a way to +serialize in which way you want to read + +517 +00:37:49,929 --> 00:37:55,779 +off the pics also used to call him + +518 +00:37:55,780 --> 00:38:05,060 +ok ok so let's say we have a for pixel +grayscale image which is the terrible + +519 +00:38:05,059 --> 00:38:09,420 +example you might think it i dont wanna +confuse people especially because + +520 +00:38:09,420 --> 00:38:12,539 +someone pointed out to me later after I +made this figure that red green and blue + +521 +00:38:12,539 --> 00:38:15,150 +are two color channels but here to red +green and blue course on the closest + +522 +00:38:15,150 --> 00:38:21,380 +this is a complete screw-up on my part +so I apologize not color channels just + +523 +00:38:21,380 --> 00:38:33,769 +three different colored closest sorry +about that okay + +524 +00:38:33,769 --> 00:38:47,309 +large exactly how do we make this all be +a single sized a column vector + +525 +00:38:47,309 --> 00:38:52,369 +the answer is you always always resize +images to be basically the same size we + +526 +00:38:52,369 --> 00:38:56,190 +can't easily deal with different size +than just a weekend we might go into + +527 +00:38:56,190 --> 00:38:59,789 +that later but the simplest thing to +think of it as just resize every single + +528 +00:38:59,789 --> 00:39:04,460 +image to exact same size as the simplest +thing because we want to ensure that all + +529 +00:39:04,460 --> 00:39:08,470 +of them are kind of comparable of the +same stuff so that we can make these + +530 +00:39:08,469 --> 00:39:12,049 +columns and we can analyze the school +patterns that are aligned in the space + +531 +00:39:12,050 --> 00:39:18,380 +in fact state of the art collectors the +way they actually work on this is the + +532 +00:39:18,380 --> 00:39:21,650 +only one square images so if you have a +very long and these methods will + +533 +00:39:21,650 --> 00:39:25,480 +actually work worse because many of them +what they do is to squash it that's what + +534 +00:39:25,480 --> 00:39:30,789 +we do still works fairly well so I feel +very long like panorama just tried to + +535 +00:39:30,789 --> 00:39:34,059 +put that somewhere like some online +service chances are my work worse + +536 +00:39:34,059 --> 00:39:36,679 +because they'll probably want to put it +through come that they will make it a + +537 +00:39:36,679 --> 00:39:41,129 +square because these comments always +work on squares you can make them work + +538 +00:39:41,130 --> 00:39:45,490 +on anything but that's just practice +what happens usually any other questions + +539 +00:39:45,489 --> 00:39:58,199 +are interpreting the W the pacifier yeah +yeah so each image get through this + +540 +00:39:58,199 --> 00:40:04,109 +anyone else would like to interpret this +or so another way to actually put it one + +541 +00:40:04,110 --> 00:40:07,150 +way that I didn't hear but it's also a +nice way of looking at it is that + +542 +00:40:07,150 --> 00:40:12,769 +basically every single score is just a +weighted sum of all the pixel values and + +543 +00:40:12,769 --> 00:40:16,489 +the image and these rates are we get to +choose those eventually but I just a + +544 +00:40:16,489 --> 00:40:20,559 +giant weighted sum it's really all it's +doing is it's coming up colors right + +545 +00:40:20,559 --> 00:40:25,779 +it's coming up colors at different +spatial positions so one way to one way + +546 +00:40:25,780 --> 00:40:29,500 +that was brought up in terms of how we +can interpret this w classified concrete + +547 +00:40:29,500 --> 00:40:33,170 +is that it's kind of like a bit like a +template matching thing so here's what + +548 +00:40:33,170 --> 00:40:37,059 +I've done is I trained classifier and I +have a show you how to do that yet but I + +549 +00:40:37,059 --> 00:40:41,920 +trained my weight matrix and then come +back to the second I'm taking out every + +550 +00:40:41,920 --> 00:40:45,010 +single one of those rows that we've +learned every single classifier and I'm + +551 +00:40:45,010 --> 00:40:46,599 +reshaping in back to an end + +552 +00:40:46,599 --> 00:40:51,809 +so that I can visualize it so I'm taking +it originally just a giant blow-up 3072 + +553 +00:40:51,809 --> 00:40:55,650 +numbers I we ship it back to the image +to undo the distortion have done and + +554 +00:40:55,650 --> 00:40:59,660 +then I have all these templates and so +for example what you see here is that + +555 +00:40:59,659 --> 00:41:04,659 +plane it's like a blue blob here the +reason you see blue blob is that if you + +556 +00:41:04,659 --> 00:41:08,278 +looked at the color channels of this +plane template you'll see that in the + +557 +00:41:08,278 --> 00:41:11,440 +blue channel you have lots of positive +weights because those positive weights + +558 +00:41:11,440 --> 00:41:15,479 +if they see me values then they interact +with those and they get a little + +559 +00:41:15,478 --> 00:41:19,338 +contribution to the score so this plane +classifiers really just counting up the + +560 +00:41:19,338 --> 00:41:23,159 +amount of blue stuff in the image across +all these special occasions and if you + +561 +00:41:23,159 --> 00:41:26,368 +look at the red and green channel for +the plane classifier you might find a + +562 +00:41:26,369 --> 00:41:30,499 +zero values or even negative values +right that's the plan classifier + +563 +00:41:30,498 --> 00:41:35,098 +price for all these other images to say +a frog you can almost see the template + +564 +00:41:35,099 --> 00:41:38,900 +of Prague their right to it looking for +some green starfish green stuff has + +565 +00:41:38,900 --> 00:41:42,849 +positive weights in here and then we see +some brown starfish things on the side + +566 +00:41:42,849 --> 00:41:49,599 +so if that gets butt over an image and +dot product it will get a high score one + +567 +00:41:49,599 --> 00:41:51,430 +thing to note here is a look at this + +568 +00:41:51,429 --> 00:41:56,588 +the car classifier that's not a very +like nice template of a car also hear + +569 +00:41:56,588 --> 00:42:01,679 +the horse looks a bit weird what's up +that was the car looking wherein lies + +570 +00:42:01,679 --> 00:42:11,048 +the horse looking weird yeah yeah +basically that's what's going in the + +571 +00:42:11,048 --> 00:42:14,998 +data the horses someone facing left +somewhere right and this classifier + +572 +00:42:14,998 --> 00:42:19,028 +really is not very powerful classifier +and has to combine the two modes it has + +573 +00:42:19,028 --> 00:42:22,179 +to do both things at the same time +staying up with us two headed horse in + +574 +00:42:22,179 --> 00:42:25,879 +there and you can in fact say that just +when this result there's probably more + +575 +00:42:25,880 --> 00:42:30,599 +left facing horses in seaport in the +right because the stronger they're also + +576 +00:42:30,599 --> 00:42:35,219 +for car right we can have a car like 45 +degrees to the left or right or front + +577 +00:42:35,219 --> 00:42:40,588 +and this classifier here is the optimal +way of mixing across like merging all + +578 +00:42:40,588 --> 00:42:43,608 +those modes into a single template +because that's where forcing it to do + +579 +00:42:43,608 --> 00:42:46,900 +what we're actually doing that's and +neural networks they don't have this + +580 +00:42:46,900 --> 00:42:50,239 +downside they can actually have in +principle they can have a template for + +581 +00:42:50,239 --> 00:42:53,338 +this car that card upcoming combined +across them for giving them more power + +582 +00:42:53,338 --> 00:42:56,478 +to actually carry out this +classification more properly but for now + +583 +00:42:56,478 --> 00:42:57,808 +we are constrained by this + +584 +00:42:57,809 --> 00:43:08,239 +question + +585 +00:43:08,239 --> 00:43:18,389 +yes something so a train time we would +not be taken just exactly what will be + +586 +00:43:18,389 --> 00:43:21,349 +generating them stretching them stealing +them and we'll be putting all that + +587 +00:43:21,349 --> 00:43:25,979 +that's going to become a huge part of +getting to work very well so yes I will + +588 +00:43:25,978 --> 00:43:30,038 +be doing a huge amount of that stuff for +everything will change that we're going + +589 +00:43:30,039 --> 00:43:33,469 +to elucidate many other training +examples of ships since rotates and + +590 +00:43:33,469 --> 00:43:47,009 +stews and that works much better how +these templates chain taking the average + +591 +00:43:47,009 --> 00:43:56,969 +person so you want to explicitly set a +template and the way your set the + +592 +00:43:56,969 --> 00:44:01,068 +template is your average across all the +images and that becomes your template + +593 +00:44:01,068 --> 00:44:13,918 +yeah so this classifier it binds they +would do something similar I would guess + +594 +00:44:13,918 --> 00:44:18,489 +it would work worse because the +classifier when you look at its Michael + +595 +00:44:18,489 --> 00:44:22,028 +formerly what it optimizes for it I +don't think he would have a minimum of + +596 +00:44:22,028 --> 00:44:26,179 +what you described in just a min of the +images but that would be like intuitive + +597 +00:44:26,179 --> 00:44:30,079 +Lee decent heuristic to perhaps that +awaits in the initialization or split + +598 +00:44:30,079 --> 00:44:34,239 +something related to it + +599 +00:44:34,239 --> 00:44:40,349 +yeah yeah but we might be going to that +I'll be able to return to their several + +600 +00:44:40,349 --> 00:44:43,980 +several things + +601 +00:44:43,980 --> 00:45:06,650 +different colors red which is saying +that there's probably more red cars in + +602 +00:45:06,650 --> 00:45:11,750 +the dataset and it may not work for you +in fact yellow cards might be for this + +603 +00:45:11,750 --> 00:45:16,909 +time so this thing just does not have +capacity to do all of that which is why + +604 +00:45:16,909 --> 00:45:19,989 +the powerful enough it can capture all +these different modes correctly and so + +605 +00:45:19,989 --> 00:45:23,689 +this will just go after the numbers +there's more red cars that's where it + +606 +00:45:23,690 --> 00:45:28,389 +will go if this was grayscale I'm not +sure if that would work better he'll + +607 +00:45:28,389 --> 00:45:40,368 +come back to that actually you might +expect as I mentioned for imbalanced + +608 +00:45:40,369 --> 00:45:42,190 +datasets what you might expect + +609 +00:45:42,190 --> 00:45:49,150 +not exactly what you might expect lots +of cats is that the cat bias would be + +610 +00:45:49,150 --> 00:45:53,750 +higher because this class this +classifier is just used to large numbers + +611 +00:45:53,750 --> 00:45:57,980 +based on the loss but we have to go into +loss function to exactly see how that + +612 +00:45:57,980 --> 00:46:01,929 +will play out so it's hard to say right +now + +613 +00:46:01,929 --> 00:46:05,960 +another interpretation of the classifier +that also someone else pointed out that + +614 +00:46:05,960 --> 00:46:09,869 +I'd like to point out is you can think +of these images as very high-dimensional + +615 +00:46:09,869 --> 00:46:17,619 +points in a 3072 dimensional space right +into 3072 pixels space space every image + +616 +00:46:17,619 --> 00:46:22,130 +is a point and these linear classifiers +are describing these gradients across + +617 +00:46:22,130 --> 00:46:25,070 +the three thousand something two +dimensional space these scores are this + +618 +00:46:25,070 --> 00:46:28,580 +region and negative to positive along +some liquor direction across the space + +619 +00:46:28,579 --> 00:46:33,670 +and so for example here for a classifier +I'm taking the first row of W which is + +620 +00:46:33,670 --> 00:46:37,750 +the car class and to the line here is +indicating the zero level set of the + +621 +00:46:37,750 --> 00:46:42,739 +classifier in other words that long that +line the car classifier has a zero score + +622 +00:46:42,739 --> 00:46:46,849 +so the car classifier there has 20 and +then their arrows indicating the + +623 +00:46:46,849 --> 00:46:51,730 +direction along which it will color the +space with more and more + +624 +00:46:51,730 --> 00:46:56,400 +harness score similar we have three +different classifiers in this example + +625 +00:46:56,400 --> 00:46:59,900 +they will also respond to these +gradients with particular level set and + +626 +00:46:59,900 --> 00:47:05,650 +they're basically trying to go in if all +these punks they are in the space and + +627 +00:47:05,650 --> 00:47:08,970 +these local suppliers we initialize then +randomly saw this car classifier would + +628 +00:47:08,969 --> 00:47:11,969 +have its level set at random and then +you'll see when we actually do the + +629 +00:47:11,969 --> 00:47:16,449 +optimization as we optimize this will +start your shift turn animal protein + +630 +00:47:16,449 --> 00:47:20,239 +isolate the car class and will like +through fun to watch these classifiers + +631 +00:47:20,239 --> 00:47:25,038 +trained because it will rotate will snap +into the car crossing Dr Jekyll and will + +632 +00:47:25,039 --> 00:47:28,528 +try to like separate out all the cars +from all the upholding of course it's + +633 +00:47:28,528 --> 00:47:33,289 +really amusing to watch so that's +another way of interpreting that ok + +634 +00:47:33,289 --> 00:47:37,130 +here's a question for you given all +these interpretations would be a very + +635 +00:47:37,130 --> 00:47:43,028 +hard to such a pacifier works what would +you expect to work really really not + +636 +00:47:43,028 --> 00:47:51,909 +well with a linear classifier + +637 +00:47:51,909 --> 00:48:05,230 +concurrent circle see our closest are +your classes exactly how I see so you're + +638 +00:48:05,230 --> 00:48:10,349 +in search of describing is in this +interpretation of space in your images + +639 +00:48:10,349 --> 00:48:15,630 +in one class would be in a blob and then +your other classes like around it so I'm + +640 +00:48:15,630 --> 00:48:19,880 +not sure exactly what that would look +like if you actually space but yes + +641 +00:48:19,880 --> 00:48:22,869 +you're right in that case clinic awesome +I will not be able to separate out those + +642 +00:48:22,869 --> 00:48:26,920 +but what about in terms of like what +would the images look like you would + +643 +00:48:26,920 --> 00:48:31,079 +look at the studio setup images clearly +say that later classifier will probably + +644 +00:48:31,079 --> 00:49:02,380 +not do very well here ya got + +645 +00:49:02,380 --> 00:49:39,210 +trained classifier and that I do a +negative of it negative image of that + +646 +00:49:39,210 --> 00:49:42,699 +classifier you still see the edges and +you'll say okay that's an airplane + +647 +00:49:42,699 --> 00:49:45,710 +obviously by the shape battalion +classifier all the colors would be + +648 +00:49:45,710 --> 00:49:49,760 +exactly wrong and so the cost I would +hate that airplane + +649 +00:49:49,760 --> 00:50:02,330 +example + +650 +00:50:02,329 --> 00:50:12,630 +dogs dogs dogs and one closest dogs in +on the right and you think that would be + +651 +00:50:12,630 --> 00:50:27,090 +a problem right + +652 +00:50:27,090 --> 00:50:32,829 +white background or something that would +be a problem it wouldn't be a problem I + +653 +00:50:32,829 --> 00:50:37,059 +wouldn't be a problem + +654 +00:50:37,059 --> 00:50:52,570 +transformation + +655 +00:50:52,570 --> 00:50:56,789 +you're saying that may be more difficult +thing would be if your dog that our work + +656 +00:50:56,789 --> 00:51:00,309 +in some ways according to class why +wouldn't it be a problem if you actually + +657 +00:51:00,309 --> 00:51:04,279 +do something in the center and something +on the right doesn't actually have an + +658 +00:51:04,280 --> 00:51:08,840 +understanding up especially on that +actually find right there would be a + +659 +00:51:08,840 --> 00:51:15,769 +relatively easy because you would have +positive weights in the middle + +660 +00:51:15,769 --> 00:51:25,219 +ok + +661 +00:51:25,219 --> 00:51:34,348 +yes so this is really really what it's +doing here really what this is doing is + +662 +00:51:34,349 --> 00:51:38,619 +it's counting up coming up colors and +special positions anything that messes + +663 +00:51:38,619 --> 00:51:41,800 +with this will be really hard actually +to go back to your point if you had a + +664 +00:51:41,800 --> 00:51:44,300 +grayscale data set by the way that would +work + +665 +00:51:44,300 --> 00:51:48,070 +not very well with our customers will +probably not work if you could see far + +666 +00:51:48,070 --> 00:51:53,250 +10 and you made or grayscale then doing +the exact same classification grayscale + +667 +00:51:53,250 --> 00:51:56,059 +images would probably work really +terribly because you can't pick up on + +668 +00:51:56,059 --> 00:52:00,739 +the colors you have to pick up on these +textures and fine details now and you + +669 +00:52:00,739 --> 00:52:03,848 +just can't localize them because they +could be very positions can't + +670 +00:52:03,849 --> 00:52:08,400 +consistently come to cross it would be +kind of a disaster + +671 +00:52:08,400 --> 00:52:11,660 +another example would be different +textures if you have say all of your + +672 +00:52:11,659 --> 00:52:16,989 +text are blue but these texts could be +different types then this doesn't really + +673 +00:52:16,989 --> 00:52:20,799 +like say these two different types but +they can be spatially invariant + +674 +00:52:20,800 --> 00:52:29,740 +that would be terrible terrible get so +just remind you I think nearly there + +675 +00:52:29,739 --> 00:52:35,269 +would find this function so with +specific case and W we're looking at + +676 +00:52:35,269 --> 00:52:38,588 +some test images we're getting some +scores out and just looking forward + +677 +00:52:38,588 --> 00:52:43,070 +we're headed now is with some setting up +w for getting some scores for all these + +678 +00:52:43,070 --> 00:52:47,470 +images and so for example with this +setting up w in this image we're seeing + +679 +00:52:47,469 --> 00:52:51,319 +that the cat score is 2.9 but there are +some classes I've got a higher score + +680 +00:52:51,320 --> 00:52:54,588 +like dog so that's not very good right +but some classes have negative scores + +681 +00:52:54,588 --> 00:52:59,909 +which is good for this image so this is +kind of a medium result for this waits + +682 +00:52:59,909 --> 00:53:04,199 +for this image in here we see that the +car class just correct for their has the + +683 +00:53:04,199 --> 00:53:08,439 +highest score which is going to write so +visiting W work too well on this image + +684 +00:53:08,440 --> 00:53:14,940 +here we see that the class is a very low +score so terribly on that so we're + +685 +00:53:14,940 --> 00:53:19,990 +headed now is we're going to define what +we call a loss function and this loss + +686 +00:53:19,989 --> 00:53:23,899 +function will quantify this intuition of +what we considered good or bad right now + +687 +00:53:23,900 --> 00:53:26,440 +we're just eyeballing these numbers are +saying what's good what's + +688 +00:53:26,440 --> 00:53:29,490 +which actually write down the +mathematical expression that tells us + +689 +00:53:29,489 --> 00:53:35,949 +exactly like these setting up w across +our test is 12.5 bad or 1220 whatever + +690 +00:53:35,949 --> 00:53:40,469 +bad or 110 bad because then once we have +a defined specifically we're going to be + +691 +00:53:40,469 --> 00:53:44,318 +looking forw that minimize the loss and +it will be set up in such a way that + +692 +00:53:44,318 --> 00:53:48,500 +when you have a loss of very low numbers +like say even zero and then your + +693 +00:53:48,500 --> 00:53:53,760 +correctly classifying all your images +but if you have a very high loss then + +694 +00:53:53,760 --> 00:53:56,970 +everything is messed up in W is not good +at all so we're going to find a lot of + +695 +00:53:56,969 --> 00:54:01,059 +action and then look for different w's +that actually do very well across all of + +696 +00:54:01,059 --> 00:54:03,469 +it so that's roughly what's coming up + +697 +00:54:03,469 --> 00:54:09,108 +well-defined loss function which is a +quantify a way to quantify how bad HW is + +698 +00:54:09,108 --> 00:54:13,328 +on our dataset the loss function as a +function of your entire training set and + +699 +00:54:13,329 --> 00:54:19,900 +your rates we don't have control over +the transfer of control of weeds then + +700 +00:54:19,900 --> 00:54:22,960 +we're going to look at the process of +optimization how to efficiently find the + +701 +00:54:22,960 --> 00:54:27,420 +set of weights w that works across all +of the images and gives us a very low + +702 +00:54:27,420 --> 00:54:30,940 +loss and then eventually what we'll do +is we'll go back and look at this + +703 +00:54:30,940 --> 00:54:34,250 +expression classifier that we saw we're +going to start meddling with the + +704 +00:54:34,250 --> 00:54:38,260 +function here so we're going to expend +effort to not be that simple your + +705 +00:54:38,260 --> 00:54:41,349 +expression but we're going to make it +slightly more complex will get a workout + +706 +00:54:41,349 --> 00:54:44,630 +and then we can slightly more complex +and will get a coalition that work out + +707 +00:54:44,630 --> 00:54:48,789 +but otherwise the entire framework will +stay unchanged all the time will be + +708 +00:54:48,789 --> 00:54:52,389 +competing these course dysfunctional +formal be changing but we're going to + +709 +00:54:52,389 --> 00:54:56,909 +some sort of course through some +function and will make it more elaborate + +710 +00:54:56,909 --> 00:55:01,179 +overtime and then we're identifying some +loss function and we're looking at what + +711 +00:55:01,179 --> 00:55:04,449 +waits what primaries are given a very +low loss and that's a setup will be + +712 +00:55:04,449 --> 00:55:09,710 +working with going forward so next class +will look into loss functions and then + +713 +00:55:09,710 --> 00:55:13,730 +we'll go to Arsenal Emirates income +that's so I guess this is my last light + +714 +00:55:13,730 --> 00:55:23,920 +so I can take up any last questions and + +715 +00:55:23,920 --> 00:55:36,068 +sorry sorry sorry I didn't hear + +716 +00:55:36,068 --> 00:55:41,969 +the project optimization are sometimes +in opposition settings you can operate + +717 +00:55:41,969 --> 00:55:45,429 +these innovative approaches are +basically the way this will work we'll + +718 +00:55:45,429 --> 00:55:49,598 +see we'll always start off with the +random W so that will give us some loss + +719 +00:55:49,599 --> 00:55:53,249 +and then we we don't have a process of +finding right away the best set of + +720 +00:55:53,248 --> 00:55:57,509 +weights but we do have a process for is +iteratively slightly improving the + +721 +00:55:57,509 --> 00:56:01,309 +weights so little see as we look at the +loss function and will find a gradient + +722 +00:56:01,309 --> 00:56:06,380 +and space and will march down so what we +do know how to do is how do we slightly + +723 +00:56:06,380 --> 00:56:09,890 +improved a set of weights we don't know +how to do the problem of just buying the + +724 +00:56:09,889 --> 00:56:12,858 +best way through right away we don't +know how to do that because especially + +725 +00:56:12,858 --> 00:56:17,108 +when these functions are very complex +likes a intercom that's a huge landscape + +726 +00:56:17,108 --> 00:56:31,038 +of its just a very intractable problem +is that your question I'm not sure how + +727 +00:56:31,039 --> 00:56:40,170 +do we deal with the color problem so ok +so so here we saw that the linear + +728 +00:56:40,170 --> 00:56:44,809 +classifier for car was this red template +for a car and neural network basically + +729 +00:56:44,809 --> 00:56:47,619 +what we'll do is we'll meet will you can +look at it as stacking when you're + +730 +00:56:47,619 --> 00:56:50,818 +classifier to some degree so what it +will end up doing is it will have all + +731 +00:56:50,818 --> 00:56:55,748 +these little templates really for rent +cars cars cars cars going this way or + +732 +00:56:55,748 --> 00:56:58,248 +that way or that way there will be +assigned to the technique every one of + +733 +00:56:58,248 --> 00:57:01,399 +these different modes and then they will +be combined across them on the second + +734 +00:57:01,400 --> 00:57:04,739 +layer so basically you have these are +looking for different types of course + +735 +00:57:04,739 --> 00:57:08,588 +and then next year on will be just like +ok I just take a way to tell if you guys + +736 +00:57:08,588 --> 00:57:13,548 +are doing or operation over you and then +we can detect cars in all of their modes + +737 +00:57:13,548 --> 00:57:17,498 +of their positions that makes sense +that's roughly homework + diff --git a/captions/En/Lecture3_en.srt b/captions/En/Lecture3_en.srt new file mode 100644 index 00000000..e708148a --- /dev/null +++ b/captions/En/Lecture3_en.srt @@ -0,0 +1,4444 @@ +1 +00:00:00,000 --> 00:00:05,400 +so before we get into some of the +material today on loss function + +2 +00:00:05,400 --> 00:00:09,429 +optimization I wanted to go over some +administrative things first + +3 +00:00:09,429 --> 00:00:12,859 +just as a reminder of the person Simon +is due on next Wednesday so you have + +4 +00:00:12,859 --> 00:00:18,100 +roughly nine days left and just as a +warning Monday is holidays so there will + +5 +00:00:18,100 --> 00:00:23,050 +be no class in office hours so plan out +your time accordingly to make sure that + +6 +00:00:23,050 --> 00:00:25,920 +you can complete the assignment in time +of course he also have some late two + +7 +00:00:25,920 --> 00:00:29,960 +days that you can use and allocate among +your silence as you see fit + +8 +00:00:29,960 --> 00:00:35,149 +ok so I diving into the material first +i'd like to remind you where we are + +9 +00:00:35,149 --> 00:00:39,100 +currently last time we looked at this +problem visual recognition as + +10 +00:00:39,100 --> 00:00:42,950 +specifically at image classification and +we're talking about the fact that this + +11 +00:00:42,950 --> 00:00:45,780 +is actually a very difficult problem +right so he just consider the cross + +12 +00:00:45,780 --> 00:00:50,829 +product called the possible variations +that we have to be robust to when we + +13 +00:00:50,829 --> 00:00:54,198 +recognize any of these categories such +as cat just seems like such an + +14 +00:00:54,198 --> 00:00:58,049 +intractable and possible problem and not +only do we know how to solve these + +15 +00:00:58,049 --> 00:01:02,108 +problems now but we can solve this +problem for thousands of categories and + +16 +00:01:02,109 --> 00:01:05,859 +the state of the art methods work almost +at human accuracy or even slightly + +17 +00:01:05,859 --> 00:01:11,829 +surpassing it and some of those classes +and it's also runs nearly in real kind + +18 +00:01:11,829 --> 00:01:16,539 +of your phone and so basically and all +of this also happened in the last three + +19 +00:01:16,540 --> 00:01:19,790 +years and also you'll be experts by the +end of the class on all of this + +20 +00:01:19,790 --> 00:01:23,609 +technology so it's really cool and +exciting oK so that's the problem of + +21 +00:01:23,609 --> 00:01:27,140 +classification of a commission we talked +specifically about the data German + +22 +00:01:27,140 --> 00:01:30,450 +approaching the fact that we can't just +explicitly hardcode these classifiers so + +23 +00:01:30,450 --> 00:01:34,100 +we have to actually trained them from +Dana and so we looked at the idea of + +24 +00:01:34,099 --> 00:01:37,188 +having different the training data +having the validation splits where we + +25 +00:01:37,188 --> 00:01:41,408 +just had our hyper parameters and a test +that that you don't touch too much we + +26 +00:01:41,409 --> 00:01:44,810 +look specifically at the example of the +nearest neighbor classifier and someone + +27 +00:01:44,810 --> 00:01:48,618 +and the canyons neighbor classifier and +I talked about the secret India said + +28 +00:01:48,618 --> 00:01:52,938 +which is our Toyota said that we play +with during this class then I introduced + +29 +00:01:52,938 --> 00:01:58,438 +the idea of this approach that I termed +paratroop approach which is really that + +30 +00:01:58,438 --> 00:02:03,639 +we're writing a function from image +directly to the tennis courts that have + +31 +00:02:03,640 --> 00:02:07,618 +10 closest and the spermatic formerly +seem to be a long year for us so we just + +32 +00:02:07,618 --> 00:02:11,520 +have equal WX and we talked about the +interpretations of this linear + +33 +00:02:11,520 --> 00:02:12,850 +classifier the fact that you can + +34 +00:02:12,849 --> 00:02:16,039 +interpreted as matching templates or +that you can interpret it as these + +35 +00:02:16,039 --> 00:02:18,449 +images being in the very +high-dimensional space and arlen your + +36 +00:02:18,449 --> 00:02:23,560 +class partner kind of going in and +coloring this space my class course so + +37 +00:02:23,560 --> 00:02:28,740 +to speak and so by the end of the class +we got to this picture where we suppose + +38 +00:02:28,740 --> 00:02:32,240 +we have a training example training +dataset them just three images here + +39 +00:02:32,240 --> 00:02:36,530 +along the columns and we have some +classes say 10 classes and support n and + +40 +00:02:36,530 --> 00:02:40,740 +basically this function assigning scores +for every single one of these images + +41 +00:02:40,740 --> 00:02:44,510 +with some particular setting off weights +which have chosen randomly here we got + +42 +00:02:44,509 --> 00:02:47,939 +some scores out and so some of these +results are good and some of them are + +43 +00:02:47,939 --> 00:02:51,419 +bad so if you inspect this course for +example in the first image you can see + +44 +00:02:51,419 --> 00:02:55,509 +that the correct class or just cat got a +score of 2.9 and that's kind of in the + +45 +00:02:55,509 --> 00:03:00,060 +middle so some some classes he received +a higher score which is not very good + +46 +00:03:00,060 --> 00:03:03,289 +some classes received a much lower score +which is good for that particular image + +47 +00:03:03,289 --> 00:03:09,019 +the car was very well classified because +the car was much higher than all of the + +48 +00:03:09,020 --> 00:03:12,980 +other ones and the Frog was enough +durable classified all right so we have + +49 +00:03:12,979 --> 00:03:18,199 +this notion that four different weights +these different weights work better or + +50 +00:03:18,199 --> 00:03:21,389 +worse on different images and of course +we're trying to find a way it's that + +51 +00:03:21,389 --> 00:03:26,209 +give us course that are consistent with +all the ground truth labels labels and + +52 +00:03:26,210 --> 00:03:30,490 +the data and so what we're going to do +now is so far with only I believe what I + +53 +00:03:30,490 --> 00:03:33,590 +just described like this is good and +that's not so good and so on but we have + +54 +00:03:33,590 --> 00:03:34,900 +to actually give it a shot + +55 +00:03:34,900 --> 00:03:38,710 +actually quantify this notion we have to +say that this particular set of weights + +56 +00:03:38,710 --> 00:03:44,189 +WSA like 12 bad or 1.5 bad or whatever +and then once we have this loss function + +57 +00:03:44,189 --> 00:03:47,710 +we're going to minimize it so we're +going to find W that gets us the lowest + +58 +00:03:47,710 --> 00:03:50,830 +loss and we're going to look into that +today we're going to look specifically + +59 +00:03:50,830 --> 00:03:55,830 +into how we can define a loss function +that measures this unhappiness and then + +60 +00:03:55,830 --> 00:04:00,030 +we're actually going to look at two +different cases a boston soft max costs + +61 +00:04:00,030 --> 00:04:04,840 +costs and then we're going to look into +the process optimization which is how do + +62 +00:04:04,840 --> 00:04:08,000 +you start off with these random audits +and how do you actually find very very + +63 +00:04:08,000 --> 00:04:13,110 +good sighting of weight sufficiently so +I'm going to downsize this example that + +64 +00:04:13,110 --> 00:04:16,620 +we have a nice working example to work +with suppose we only had three classes + +65 +00:04:16,620 --> 00:04:18,030 +and stuff you know + +66 +00:04:18,029 --> 00:04:22,009 +tens of thousands and we have these +three images and these are our scores + +67 +00:04:22,009 --> 00:04:23,360 +for some setup W + +68 +00:04:23,360 --> 00:04:27,949 +we're going to now try to write down +exactly our unhappiness with this result + +69 +00:04:27,949 --> 00:04:32,680 +of the first loss we're going to look +into it is termed a multi-class SVM loss + +70 +00:04:32,680 --> 00:04:36,629 +this is a generalization of a minority +support vector machine that you may have + +71 +00:04:36,629 --> 00:04:42,379 +seen over the closest I think between 9 +covers as well and so the setup here is + +72 +00:04:42,379 --> 00:04:47,710 +that we miss core function right so as a +vector of Lacoste course these are our + +73 +00:04:47,709 --> 00:04:50,948 +inspectors and there's a specific term +here + +74 +00:04:50,949 --> 00:04:55,348 +loss equal stew stuff and I'm going to +interpret this loss now for Easter that + +75 +00:04:55,348 --> 00:04:59,978 +and we're going to see through a +specific example of why this expression + +76 +00:04:59,978 --> 00:05:06,158 +excess effectively what the SVM losses +same is that it's something across all + +77 +00:05:06,158 --> 00:05:11,399 +the incorrect examples so all the all +the same across all the incorrect course + +78 +00:05:11,399 --> 00:05:17,209 +classes so for every single example we +have that loss and it's coming across + +79 +00:05:17,209 --> 00:05:20,769 +all the incorrect classes and it's +comparing the score at the core class + +80 +00:05:20,769 --> 00:05:25,209 +received in this court that the +incorrect class receipt Jane minus s why + +81 +00:05:25,209 --> 00:05:31,269 +I why I being the correct label plus one +and then that smacks of zero so what's + +82 +00:05:31,269 --> 00:05:35,838 +going on here as we're comparing the +difference in this course and this + +83 +00:05:35,838 --> 00:05:40,338 +particular lost the same that not only +do I want the correct score to be higher + +84 +00:05:40,338 --> 00:05:43,918 +than the incorrect score but there's +actually a safety margin that we putting + +85 +00:05:43,918 --> 00:05:46,079 +on will put were using a safety margin + +86 +00:05:46,079 --> 00:05:53,198 +exactly one and we're going to go into +why one makes sense to use as opposed to + +87 +00:05:53,199 --> 00:05:56,900 +some other hyper primary that we have to +choose their and intuitively you can + +88 +00:05:56,899 --> 00:06:00,508 +look into notes for a much more rigorous +derivation of exactly why that one + +89 +00:06:00,509 --> 00:06:04,278 +doesn't matter but then too early to +think about this underscores our kind of + +90 +00:06:04,278 --> 00:06:08,500 +scale-free because I can skim I W I can +make it larger or smaller and you're + +91 +00:06:08,500 --> 00:06:12,490 +going to get larger or smaller course so +really there's this pre-primary off + +92 +00:06:12,490 --> 00:06:16,550 +discourse and how large or small they +can be that is tied to how large or + +93 +00:06:16,550 --> 00:06:19,930 +weights are in magnitude and so these +whores are kind of arbitrary so using + +94 +00:06:19,930 --> 00:06:25,269 +one is just an arbitrary choice to some +extent ok so let's see specifically how + +95 +00:06:25,269 --> 00:06:29,128 +this expression works with a concrete +example so here I am going to evaluate + +96 +00:06:29,129 --> 00:06:33,899 +that loss for the first example so here +we're competing for plugging in this + +97 +00:06:33,899 --> 00:06:35,949 +course so we see that we're comparing + +98 +00:06:35,949 --> 00:06:40,829 +the score we got your car from 1-3 point +to which is the correct class car and + +99 +00:06:40,829 --> 00:06:45,219 +then adding our safety margin of one and +the maximum 0 and that is really what + +100 +00:06:45,220 --> 00:06:48,770 +it's doing is it's going to be clamping +values 80 right so if we get a negative + +101 +00:06:48,769 --> 00:06:53,759 +result we're going to exclude VAT 0 so +if you see the second class for the + +102 +00:06:53,759 --> 00:06:55,089 +incorrect Plaza frog + +103 +00:06:55,089 --> 00:06:59,699 +1.7 subtracted from 3.2 at a safety +margin and we gonna get to a point nine + +104 +00:06:59,699 --> 00:07:03,629 +and then when you work this through you +get a loss of 2.9 + +105 +00:07:03,629 --> 00:07:07,209 +intuitively what you can see here the +way this worked out is intuitively the + +106 +00:07:07,209 --> 00:07:12,930 +cat score is 3.2 so according to ESPN +Los we would we would like ideally is + +107 +00:07:12,930 --> 00:07:16,100 +that the scores for all the classes are +up to it most + +108 +00:07:16,100 --> 00:07:21,370 +2.2 but the car class actually had much +higher much higher score than one and + +109 +00:07:21,370 --> 00:07:24,620 +this difference in what we would have +liked which is 2.2 and what actually + +110 +00:07:24,620 --> 00:07:30,939 +happened just like 11 is exactly this +difference of 2.9 which is how bad of a + +111 +00:07:30,939 --> 00:07:36,129 +score outcome this was and in the other +case in fraud case you can see the proc + +112 +00:07:36,129 --> 00:07:40,139 +score was quite a bit lower than low 2.2 +and so the way that works out then the + +113 +00:07:40,139 --> 00:07:43,289 +math is that you end up getting a +negative number when you compare this + +114 +00:07:43,290 --> 00:07:48,110 +course and then the max 2000 lost +contribution for that particular part + +115 +00:07:48,110 --> 00:07:54,439 +and you end up with a loss of 2.9 oK so +that's the loss for this first major for + +116 +00:07:54,439 --> 00:07:57,050 +the second image we're going to again do +the same thing + +117 +00:07:57,050 --> 00:08:01,689 +plug in the numbers were comparing the +cats got the car score so we get my + +118 +00:08:01,689 --> 00:08:07,329 +point three months from 19 at a safety +margin and the same for the other class + +119 +00:08:07,329 --> 00:08:11,659 +so when you plug in your actually end up +with a lot of zero and loss of 0 + +120 +00:08:11,660 --> 00:08:17,280 +intuitively is because the car score +here is it is true that the car score is + +121 +00:08:17,279 --> 00:08:22,479 +higher than all the others course for +that image by at least one right that's + +122 +00:08:22,480 --> 00:08:27,490 +why we got zero score 0 lost that is so +the constraint was satisfied and some of + +123 +00:08:27,490 --> 00:08:31,310 +their loss and in this case we end up +with a very bad loss because of course + +124 +00:08:31,310 --> 00:08:34,470 +the frog class received a very low score +but the other classes received quite + +125 +00:08:34,470 --> 00:08:39,349 +high school so this adds up to an +unhappiness of 10.9 and now if we + +126 +00:08:39,349 --> 00:08:42,520 +actually want to combine all of this +into a single loss function we're going + +127 +00:08:42,519 --> 00:08:45,929 +to do the relatively intuitive +transformation here we just take the + +128 +00:08:45,929 --> 00:08:48,049 +average across all the losses we obtain + +129 +00:08:48,049 --> 00:08:51,458 +empower training set and so it would say +that the loss at the end when you + +130 +00:08:51,458 --> 00:08:56,369 +average these numbers is 4.6 so this +particular setting up w on this training + +131 +00:08:56,370 --> 00:09:01,320 +data gives us some course which we plug +into the loss function and we've given + +132 +00:09:01,320 --> 00:09:06,170 +and unhappiness a four-point sex with +this result ok so not going to ask you a + +133 +00:09:06,169 --> 00:09:08,939 +series of questions to kind of test your +understanding a bit about how this works + +134 +00:09:08,940 --> 00:09:12,390 +I'll get into questions in a bit let me +just pose my friend Michael questions + +135 +00:09:12,389 --> 00:09:20,230 +first of all what if that's over there +which is some overall the incorrect + +136 +00:09:20,230 --> 00:09:25,560 +colossus of Jane whatever that means +that some overall the closest not just + +137 +00:09:25,559 --> 00:09:29,799 +the incorrect ones so what if we allowed +J to equal to why I why am I actually + +138 +00:09:29,799 --> 00:09:39,149 +adding that small constraint in the +summer there yes so in fact what would + +139 +00:09:39,149 --> 00:09:43,139 +have happened is the reason the better +gnite equal to I as if we allowed to I + +140 +00:09:43,139 --> 00:09:46,539 +then score of why I cancel reply + +141 +00:09:46,539 --> 00:09:49,828 +you end up with a zero and really what +you're doing is you're adding a constant + +142 +00:09:49,828 --> 00:09:53,549 +of London so if that someone's overall +this course then really maybe just + +143 +00:09:53,549 --> 00:09:59,250 +completing the loss by constant of 10 +that's why that's their second what if + +144 +00:09:59,250 --> 00:10:03,940 +we used a mean instead of a sudden right +so I'm summing up over all these + +145 +00:10:03,940 --> 00:10:10,500 +constraints what if I used to mean just +like I'm using mean to actually averaged + +146 +00:10:10,500 --> 00:10:13,389 +over all the losses for all the examples +what if I use the mean over this course + +147 +00:10:13,389 --> 00:10:28,000 +the score concerns there were too many +classes so you're right in that the + +148 +00:10:28,000 --> 00:10:33,870 +absolute value of the loss will be lower + +149 +00:10:33,870 --> 00:10:37,879 +a constant factor why + +150 +00:10:37,879 --> 00:10:52,689 +did actually do an average here would be +averaging over the number of classes + +151 +00:10:52,690 --> 00:10:56,220 +here but there's a constant number of +classes say three in the specific + +152 +00:10:56,220 --> 00:10:56,889 +example + +153 +00:10:56,889 --> 00:11:01,000 +amounts to putting constant of one-third +in front of the loss and since we're + +154 +00:11:01,000 --> 00:11:04,450 +always in the end so that would make the +Los lower just like you pointed out but + +155 +00:11:04,450 --> 00:11:07,820 +in the end what we're always interested +in as we're going to minimize aw over + +156 +00:11:07,820 --> 00:11:12,470 +that loss so if you're shifting your +lost by one or if you're scaling it with + +157 +00:11:12,470 --> 00:11:15,350 +a constant is actually doesn't change +our solutions but you're still going to + +158 +00:11:15,350 --> 00:11:19,420 +end up at the same optimal W so these +choices are kind of basically free + +159 +00:11:19,419 --> 00:11:23,169 +parameters doesn't matter so for +convenience I'm adding do not equal to Y + +160 +00:11:23,169 --> 00:11:26,299 +and I'm not actually taken to mean +although it's the same thing and the + +161 +00:11:26,299 --> 00:11:33,329 +same goes for us for whether or not we +average for some across the examples ok + +162 +00:11:33,330 --> 00:11:38,410 +next question what if we instead used +not to the formulation of there but a + +163 +00:11:38,409 --> 00:11:42,669 +very similar looking for inflation but +there's an additional squared at the end + +164 +00:11:42,669 --> 00:11:47,809 +so we're taking the difference between +course plus one this morning and then + +165 +00:11:47,809 --> 00:11:54,509 +were squaring that do we obtain the same +or different lost when you think we + +166 +00:11:54,509 --> 00:11:57,710 +obtain the same or different loss in a +sense that if you were to optimize and + +167 +00:11:57,710 --> 00:12:05,759 +find the best W do we get the same +result or not + +168 +00:12:05,759 --> 00:12:20,340 +yes in fact get a different loss it's +not as obvious to see but what one way + +169 +00:12:20,340 --> 00:12:26,639 +to see it as that we're not just clearly +scaling not just clearly scaling the Los + +170 +00:12:26,639 --> 00:12:30,710 +up or down by constant or shifting it by +constant we're actually changing the + +171 +00:12:30,710 --> 00:12:35,580 +differences we are changing the +tradeoffs nonlinearly in terms of how + +172 +00:12:35,580 --> 00:12:38,920 +the SVM support vector machines going to +go there and trade all the different + +173 +00:12:38,919 --> 00:12:43,519 +score margins in different examples but +it's not obvious to see but basically + +174 +00:12:43,519 --> 00:12:46,829 +it's not very clear but I want to to +illustrate that all changes to this loss + +175 +00:12:46,830 --> 00:12:53,320 +are completely and the second permission +here is in fact something we call a + +176 +00:12:53,320 --> 00:12:57,530 +squared hinge loss instead of the one on +top which recall hinge loss and you can + +177 +00:12:57,529 --> 00:13:01,480 +use two different kind of hyper +primarily 20 use most often you see the + +178 +00:13:01,480 --> 00:13:04,750 +first formulation that's what we use +most of the time but sometimes you can + +179 +00:13:04,750 --> 00:13:07,950 +see these assets with the square inch +loss and better so that's something you + +180 +00:13:07,950 --> 00:13:12,550 +play with that's really hyper primer but +its most often used for the first one + +181 +00:13:12,549 --> 00:13:18,919 +let's also think about the scale of this +loss was the min and max possible loss + +182 +00:13:18,919 --> 00:13:23,149 +that you can achieved with the +multi-class SVM on your entire dataset + +183 +00:13:23,149 --> 00:13:26,759 +what is the smallest Mali + +184 +00:13:26,759 --> 00:13:35,029 +0 good what is the highest value so +basically scores could be arbitrarily + +185 +00:13:35,029 --> 00:13:39,870 +terrible so if you're signed score to +the correct example is very very small + +186 +00:13:39,870 --> 00:13:45,230 +then you're going to get your loss going +to infinity and one more question which + +187 +00:13:45,230 --> 00:13:49,480 +becomes kind of important when we start +doing optimization usually when we + +188 +00:13:49,480 --> 00:13:53,200 +actually optimize these loss functions +we start up with the initialization aw + +189 +00:13:53,200 --> 00:13:56,430 +that are very small weights so what ends +up happening is that the scores at the + +190 +00:13:56,429 --> 00:14:00,819 +very beginning of optimization are +roughly near zero all of these are black + +191 +00:14:00,820 --> 00:14:05,650 +small numbers near zero so what are the +loss when all these are new era in this + +192 +00:14:05,649 --> 00:14:12,329 +particular case that's right number of +classes minus 10 if all this course are + +193 +00:14:12,330 --> 00:14:16,639 +zero then he would this particular loss +I put down here and by doing an average + +194 +00:14:16,639 --> 00:14:21,269 +across this way we would have achieved a +loss of two ok so this is not very + +195 +00:14:21,269 --> 00:14:24,429 +important what's important is for safety +checks when you're actually starting + +196 +00:14:24,429 --> 00:14:28,399 +optimization and you're starting with +very small numbers W and you print out + +197 +00:14:28,399 --> 00:14:31,389 +your first loss as you're talking about +migration and you want to make sure that + +198 +00:14:31,389 --> 00:14:34,279 +you kind of understand the functional +forms and that they can think through + +199 +00:14:34,279 --> 00:14:38,929 +whether or not the number you get to +make sense so I'm seeing to in this case + +200 +00:14:38,929 --> 00:14:42,799 +then I'm happy that the further losses +may be implemented correctly percent + +201 +00:14:42,799 --> 00:14:46,990 +sure but shortly certainly there's +nothing wrong with it right away so it's + +202 +00:14:46,990 --> 00:14:51,730 +interesting to think about these I'm +going to go more into this loss of tiny + +203 +00:14:51,730 --> 00:14:55,950 +bit but as a question in terms of the +slide right now + +204 +00:14:55,950 --> 00:15:10,870 +question I was asked questions + +205 +00:15:10,870 --> 00:15:15,029 +efficient to actually not have this +constraint joy is not why I because it + +206 +00:15:15,029 --> 00:15:19,049 +makes it more difficult to actually do +these easy better eyes implementations + +207 +00:15:19,049 --> 00:15:23,799 +of this loss implementation so that +actually predict my next slide to some + +208 +00:15:23,799 --> 00:15:27,459 +degree so let me just going to say here +sometime by code for Hollywood right out + +209 +00:15:27,460 --> 00:15:33,290 +to this loss function in the same here +we're evaluating a lie in bed right now + +210 +00:15:33,289 --> 00:15:37,759 +we're getting a single example here so +acts as a single column vector light is + +211 +00:15:37,759 --> 00:15:42,279 +an integer specifying the label and W is +our weight matrix so what we do is we + +212 +00:15:42,279 --> 00:15:45,799 +validate this course which is just a +couple times X then we compute these + +213 +00:15:45,799 --> 00:15:50,179 +margins which is the difference between +the course we obtained and the correct + +214 +00:15:50,179 --> 00:15:55,569 +score + 10 these are numbers between 0 +and whatever and then see this dish + +215 +00:15:55,570 --> 00:16:03,360 +online margins that y equals 0 YZ them +there + +216 +00:16:03,360 --> 00:16:07,320 +yeah exactly so basically I'm doing this +efficient backgrounds importation which + +217 +00:16:07,320 --> 00:16:11,209 +goes to your point and then I want to +embrace that margin there because I'm + +218 +00:16:11,208 --> 00:16:15,569 +certain that margin said why currently +has one and I don't want to inflate my + +219 +00:16:15,570 --> 00:16:18,360 +score and so I'll set at 20 + +220 +00:16:18,360 --> 00:16:27,269 +yes I suppose you could subtract one of +the end as well so we can optimize if we + +221 +00:16:27,269 --> 00:16:31,200 +want but we're not going to think about +this too much if you do if you do in + +222 +00:16:31,200 --> 00:16:35,050 +your assignment that's very welcome for +extreme punishments and there were some + +223 +00:16:35,049 --> 00:16:40,859 +of those markets and so we got lost +going back to the site anymore questions + +224 +00:16:40,860 --> 00:16:45,320 +about this formulation and by the way +this formulation if you wanted to make + +225 +00:16:45,320 --> 00:16:49,430 +it if you actually write it down for +just two closest you'll see that it + +226 +00:16:49,429 --> 00:16:57,229 +reduces to a minor support vector +machine lost ok so we'll see a different + +227 +00:16:57,230 --> 00:17:00,190 +function soon and then we're going to +look at the comparisons of them as well + +228 +00:17:00,190 --> 00:17:05,400 +but for now actually so at this point +what we have is we have this + +229 +00:17:05,400 --> 00:17:08,699 +wrapping up its course and then we have +this loss function which have not + +230 +00:17:08,699 --> 00:17:11,870 +written out and its full form where we +have these differences between this + +231 +00:17:11,869 --> 00:17:18,178 +course +1 some of her closest and the +Sun and the average across hold examples + +232 +00:17:18,179 --> 00:17:21,309 +right so that's the loss function right +now I'd like to convince you that + +233 +00:17:21,308 --> 00:17:25,149 +there's actually a bug with this loss +function in other words if I'd like to + +234 +00:17:25,150 --> 00:17:31,798 +use this loss on Sunday as in practice I +might get some not very nice properties + +235 +00:17:31,798 --> 00:17:36,589 +ok if this if this was the only thing I +was using my phone and it's not + +236 +00:17:36,589 --> 00:17:39,709 +completely obvious to see exactly what +the issue is so I'll give you guys a + +237 +00:17:39,710 --> 00:17:43,620 +hint in particular suppose that we found +the W + +238 +00:17:43,619 --> 00:17:55,058 +getting zero loss ok on something and +now the question is is this w unique or + +239 +00:17:55,058 --> 00:18:00,329 +face another way can you give me aww +that would be different but also + +240 +00:18:00,329 --> 00:18:04,210 +definitely achieve zero loss in the back + +241 +00:18:04,210 --> 00:18:12,410 +that's right and so you're saying we can +scale it by some constant and in + +242 +00:18:12,410 --> 00:18:20,009 +particular all formats are based on +constraint you probably want to meet + +243 +00:18:20,009 --> 00:18:24,259 +young greater than one right so +basically what I can do as I can change + +244 +00:18:24,259 --> 00:18:28,119 +my weight and make them larger and +larger all I would be doing is I'm just + +245 +00:18:28,119 --> 00:18:31,639 +create making the score differences +larger and larger as I came up w right + +246 +00:18:31,640 --> 00:18:35,890 +because of the liquor law sport here so +basically it's not a very desirable + +247 +00:18:35,890 --> 00:18:40,370 +property because we have the entire +subspace of W that is optimal and all of + +248 +00:18:40,369 --> 00:18:44,319 +them are according to this loss function +completely the same but intuitively + +249 +00:18:44,319 --> 00:18:48,019 +that's not what I can burn as property +to pass and so just to see this in + +250 +00:18:48,019 --> 00:18:51,920 +america to convince yourself that this +is the case I taking this example of + +251 +00:18:51,920 --> 00:18:58,480 +what we achieved previously 0 loss there +before and I suppose I W I twice I mean + +252 +00:18:58,480 --> 00:19:02,360 +this is a very simple math going on here +but basically I would be conflicting or + +253 +00:19:02,359 --> 00:19:07,000 +my scores by two times and so their +difference would also becomes larger so + +254 +00:19:07,000 --> 00:19:11,019 +if all your score differences inside the +max 50 well already negative then + +255 +00:19:11,019 --> 00:19:14,389 +there's going to become more and more +negative and so you end up with larger + +256 +00:19:14,390 --> 00:19:18,040 +and larger negative values inside them +access and just become zero all the time + +257 +00:19:18,039 --> 00:19:32,159 +but the scale factor would have to be +larger than 1 because + +258 +00:19:32,160 --> 00:19:56,940 +another question for simplicity but yeah +basically scores are WX + be so so + +259 +00:19:56,940 --> 00:19:58,309 +you're just yet + +260 +00:19:58,309 --> 00:20:06,589 +forget to buy some just chillin W myself +ok so the way to fix this is intuitively + +261 +00:20:06,589 --> 00:20:10,250 +we have this entire subway some W's and +it all works the same according to this + +262 +00:20:10,250 --> 00:20:13,269 +loss function and what we'd like to do +as we'd like to have a preference over + +263 +00:20:13,269 --> 00:20:17,170 +some W's over others just based on +intrinsic you know what what do we + +264 +00:20:17,170 --> 00:20:21,430 +desire of W to look like forget the data +is just what what are nice things to + +265 +00:20:21,430 --> 00:20:26,110 +happen and so this introduces the notion +of regularization which we're going to + +266 +00:20:26,109 --> 00:20:29,319 +be attending to our loss function so we +have an additional term there which is + +267 +00:20:29,319 --> 00:20:33,309 +land up times a regularization function +of W and the regularization function + +268 +00:20:33,309 --> 00:20:37,500 +measures the niceness of your W ok and +so we don't only want to fit the data + +269 +00:20:37,500 --> 00:20:43,279 +but we also won W to be nice and we're +going to see some ways of framing that + +270 +00:20:43,279 --> 00:20:47,549 +exactly why they make sense and into +going on as regularization has a way of + +271 +00:20:47,549 --> 00:20:52,509 +trading off your training act your +training loss and your generalization + +272 +00:20:52,509 --> 00:20:56,589 +lost on a test set so intuitively +regularization a set of techniques where + +273 +00:20:56,589 --> 00:21:00,899 +we're adding objectives to the loss +which will be fighting with this guy so + +274 +00:21:00,900 --> 00:21:04,560 +this guy just wants to fit your training +data and that guy once W to look some + +275 +00:21:04,559 --> 00:21:07,879 +particular way and so they're fighting +each other sometimes in your objective + +276 +00:21:07,880 --> 00:21:11,730 +because we want to simultaneously +achieve both of them but it turns out + +277 +00:21:11,730 --> 00:21:14,470 +that adding these regularization +techniques even if it makes your + +278 +00:21:14,470 --> 00:21:18,319 +training error worse so we're not +correctly classifying examples which he + +279 +00:21:18,319 --> 00:21:21,599 +noticed is that the test set performance +and something better and we'll see an + +280 +00:21:21,599 --> 00:21:26,089 +example of why that might be actually +what the next for now I just want to + +281 +00:21:26,089 --> 00:21:29,109 +point out the next light but for now I +just want to point out that the most + +282 +00:21:29,109 --> 00:21:33,019 +common form of realization is the what +we call to regularization or weight + +283 +00:21:33,019 --> 00:21:37,539 +decay and really what we're doing is +suppose W in this case is a 2d matrix so + +284 +00:21:37,539 --> 00:21:42,230 +I had to some Sauveur que el the rows +and columns that really is just + +285 +00:21:42,230 --> 00:21:44,230 +element wise W squared + +286 +00:21:44,230 --> 00:21:48,019 +and we're just putting them all into the +Los ok so this this particular + +287 +00:21:48,019 --> 00:21:55,069 +regulation it likes w's to be 0 right so +when WS all 09 realization is happy but + +288 +00:21:55,069 --> 00:21:58,649 +of course you can't people there because +then you can classify so these guys will + +289 +00:21:58,650 --> 00:22:03,140 +fight each other there are different +forms of regularization with different + +290 +00:22:03,140 --> 00:22:08,570 +approaching Kong's will go into some of +them much later in the class and I just + +291 +00:22:08,569 --> 00:22:12,548 +like the 2nd Lt regularization is the +most common form and that's what you'll + +292 +00:22:12,548 --> 00:22:17,569 +use quite often in this class as well +it's not like to convince you I'd like + +293 +00:22:17,569 --> 00:22:20,529 +to convince you that this is a +reasonable thing to want out of it w + +294 +00:22:20,529 --> 00:22:25,779 +that its weights are small so consider +this very simple cooked up example to + +295 +00:22:25,779 --> 00:22:30,149 +get the intuition suppose we have an +example where we are in four-dimensional + +296 +00:22:30,150 --> 00:22:32,370 +space where we're doing this +classification and we have an even + +297 +00:22:32,369 --> 00:22:36,139 +better off just all once X and now +suppose we have these two candidates + +298 +00:22:36,140 --> 00:22:37,880 +weight matrices or wait + +299 +00:22:37,880 --> 00:22:44,780 +single voice I suppose to right now so +one of them is 100 and the other is 25 + +300 +00:22:44,779 --> 00:22:49,200 +everywhere since we have in your loss +functions you'll see that their effects + +301 +00:22:49,200 --> 00:22:55,080 +are the same so basically have a leading +scorer is by WX so the doc product with + +302 +00:22:55,079 --> 00:22:59,109 +ex is identical for both of these +discourse with both of these but + +303 +00:22:59,109 --> 00:23:03,469 +regularization with strictly favor one +of these over the other which one with + +304 +00:23:03,470 --> 00:23:07,720 +the regularization cost favor even +though their effects are the same which + +305 +00:23:07,720 --> 00:23:13,548 +one is better in terms of realization +the second one right and so the + +306 +00:23:13,548 --> 00:23:15,740 +regularization would tell you that even +though they're achieving the same + +307 +00:23:15,740 --> 00:23:19,109 +effects in terms of the data loss +classification down the road we actually + +308 +00:23:19,109 --> 00:23:22,629 +significantly preferred the second one +what's better about the second one is + +309 +00:23:22,630 --> 00:23:27,340 +that a good idea to have + +310 +00:23:27,339 --> 00:23:38,230 +that's correct so well that's one +interpretation I like the most is well + +311 +00:23:38,230 --> 00:23:43,549 +it takes into account the most number of +things in your X Factor right so what + +312 +00:23:43,549 --> 00:23:47,859 +this Delta realization wants to do is to +spread out your WSUS as much as possible + +313 +00:23:47,859 --> 00:23:51,169 +so that you're taking into account all +the input features are only empathic + +314 +00:23:51,170 --> 00:23:55,900 +sauce and wants to use as much as many +other dimensions as it likes its a + +315 +00:23:55,900 --> 00:23:57,600 +cheating the same effect + +316 +00:23:57,599 --> 00:24:01,439 +intuitively speaking and so that's +better than just focusing on just one + +317 +00:24:01,440 --> 00:24:06,990 +dimension is just nice it's something +that often works in practice basically + +318 +00:24:06,990 --> 00:24:11,880 +just the way things are and biggest +arranged and the property that they + +319 +00:24:11,880 --> 00:24:17,230 +usually have to tackle any questions +about the regularization good idea + +320 +00:24:17,230 --> 00:24:22,130 +everyone has sold some basically our +losses will always have this forum where + +321 +00:24:22,130 --> 00:24:25,350 +we we have a dinner loss and they will +also have a regularization it's a very + +322 +00:24:25,349 --> 00:24:29,529 +common thing to have in practice ok I'm +not going to go into the second + +323 +00:24:29,529 --> 00:24:34,629 +classifier pacifier and we'll see some +differences between the US and support + +324 +00:24:34,630 --> 00:24:38,070 +vector machine and this soft mask +classifier in practice these are kind of + +325 +00:24:38,069 --> 00:24:41,369 +like these two choices that you can have +either a spam or something like the most + +326 +00:24:41,369 --> 00:24:47,629 +commonly used linear classifiers often +you'll see that so far as preferred and + +327 +00:24:47,630 --> 00:24:51,480 +I'm not exactly sure why because usually +the end up working about the same and I + +328 +00:24:51,480 --> 00:24:54,420 +just like to mention that this is also +sometimes called multnomah was just + +329 +00:24:54,420 --> 00:24:57,019 +aggression so if you're familiar with +logistic regression this is just the + +330 +00:24:57,019 --> 00:25:00,190 +generalization of it into multiple +dimensions or in this case multiple + +331 +00:25:00,190 --> 00:25:12,009 +clouds of smoke just as they're +questioned over there + +332 +00:25:12,009 --> 00:25:32,150 +why do we want to use if we'd like to +pick between them in some way and I + +333 +00:25:32,150 --> 00:25:36,820 +think we are going for is that one thing +low W us is a reasonable way to pick + +334 +00:25:36,819 --> 00:25:42,700 +among men and the ultra-right favor +diffuse w's like in this case here and + +335 +00:25:42,700 --> 00:25:47,900 +one of the intuitive ways in which I can +try to pitch why this is a good idea is + +336 +00:25:47,900 --> 00:25:54,290 +that diffuse weights basically see this +w one is completely ignoring your inputs + +337 +00:25:54,289 --> 00:25:58,220 +to three and four but W two is using all +of the inputs right because of the way + +338 +00:25:58,220 --> 00:26:04,480 +to defuse and so intuitively this just +end up usually working better at a test + +339 +00:26:04,480 --> 00:26:10,150 +I'm because more evidence is being +accumulated and your decisions instead + +340 +00:26:10,150 --> 00:26:21,470 +of just one single evidence one single +feature that's right + +341 +00:26:21,470 --> 00:26:28,140 +that's right that's right so the idea +here is that these two W 110 w to + +342 +00:26:28,140 --> 00:26:32,630 +achieving the same effect so this data +loss suppose that that's basically it + +343 +00:26:32,630 --> 00:26:35,650 +doesn't care between the two but the +regularization expressed a preference + +344 +00:26:35,650 --> 00:26:39,169 +for them and since we had any objective +and we're going to end up optimizing + +345 +00:26:39,169 --> 00:26:42,240 +over this loss functions are going to +find the W that simultaneously + +346 +00:26:42,240 --> 00:26:46,659 +accomplishes both of those and so we end +up aw that not only classified correctly + +347 +00:26:46,659 --> 00:26:50,360 +but we also have the added preference +that actually wanted to be and we wanted + +348 +00:26:50,359 --> 00:27:05,668 +to be diffuse as much as possible could +also be indifferent L one has some nice + +349 +00:27:05,669 --> 00:27:09,240 +properties which I don't want to go into +right now we might cover it later fell + +350 +00:27:09,240 --> 00:27:16,579 +one has some properties like a sparsity +inducing properties what if you end up + +351 +00:27:16,579 --> 00:27:20,240 +having lunch in your objectives you'll +find that lots of W's will end up being + +352 +00:27:20,240 --> 00:27:25,329 +exactly zero for reasons that we might +go into labor and that sometimes is like + +353 +00:27:25,329 --> 00:27:30,629 +a feature selection almost and so I'll +one is another alternative that we might + +354 +00:27:30,630 --> 00:27:45,760 +go into a bit more later + +355 +00:27:45,759 --> 00:27:54,220 +that isn't it may be a good thing that +were ignoring features and just using + +356 +00:27:54,220 --> 00:28:02,960 +one of them yeah there's many technical +reasons why realization is a good idea I + +357 +00:28:02,960 --> 00:28:09,090 +went to give you just basic intuition so +maybe maybe tell them that but I think + +358 +00:28:09,089 --> 00:28:59,740 +that's a fair point if I have a good +return I would have to be ignoring some + +359 +00:28:59,740 --> 00:29:25,980 +and looking at times and learning theory +and you saw some of that in 229 and + +360 +00:29:25,980 --> 00:29:29,710 +there are some results on white +regularization is a good case in in + +361 +00:29:29,710 --> 00:29:33,650 +those areas and I don't think I'm gonna +go into that and salt also beyond the + +362 +00:29:33,650 --> 00:29:37,610 +scope of this class so far this class +just altering our nation will make your + +363 +00:29:37,609 --> 00:29:44,139 +test error better someone to go to +satisfy to find out which is just + +364 +00:29:44,140 --> 00:29:49,309 +generalization of logistic regression to +the way the way this will work as this + +365 +00:29:49,308 --> 00:29:53,049 +is just a different functional form for +how loss is specified on top of these + +366 +00:29:53,049 --> 00:29:58,539 +course some particular there's this +interpretation that classifier puts on + +367 +00:29:58,539 --> 00:30:02,170 +top of this course these are not just +some arbitrary scores and we want to + +368 +00:30:02,170 --> 00:30:05,769 +margins to be met but we have specific +interpretation that is maybe more + +369 +00:30:05,769 --> 00:30:10,549 +principled kind of from a problem that +point of view where we actually + +370 +00:30:10,549 --> 00:30:14,490 +interpret these course not just as these +things that mean margins but these are + +371 +00:30:14,490 --> 00:30:17,880 +actually the normalized lock +probabilities that are assigned to + +372 +00:30:17,880 --> 00:30:23,140 +different classes ok so we're going to +go into exactly what this means in a bit + +373 +00:30:23,140 --> 00:30:28,880 +these are normalized lock probabilities +of all the twice given the image in + +374 +00:30:28,880 --> 00:30:34,490 +other words we are assuming that the +scores are unlike problem peace than the + +375 +00:30:34,490 --> 00:30:38,799 +way to get probabilities of the closest +like Sasuke is that we take these the + +376 +00:30:38,799 --> 00:30:39,690 +score + +377 +00:30:39,690 --> 00:30:45,029 +exponential all of them to get the +anomalous probabilities and we normalize + +378 +00:30:45,029 --> 00:30:48,849 +them to get them to normalize +probabilities so we divided by the sum + +379 +00:30:48,849 --> 00:30:54,209 +over all the exponential its course and +that's how we actually get this + +380 +00:30:54,210 --> 00:30:58,240 +expression for a probability of a class +given the image and so this function + +381 +00:30:58,240 --> 00:31:02,880 +here is called a soft max function if +you see if someone has to eat to them + +382 +00:31:02,880 --> 00:31:07,840 +the elements are currently interested in +divided by the sum overall expense sheet + +383 +00:31:07,839 --> 00:31:11,918 +its course that's the way this will work +basically is if we're in this problem + +384 +00:31:11,919 --> 00:31:13,040 +premark we're really lucky + +385 +00:31:13,039 --> 00:31:16,869 +that we're deciding that this is the +probability of different classes that + +386 +00:31:16,869 --> 00:31:19,619 +makes sense in terms of what you what +you really want to do in this setting + +387 +00:31:19,619 --> 00:31:23,809 +will probably over different classes one +of these is correct so we wanted to + +388 +00:31:23,809 --> 00:31:25,429 +maximize the log-likelihood + +389 +00:31:25,430 --> 00:31:32,900 +for the loss function and so we want to +maximize the log likelihood of the true + +390 +00:31:32,900 --> 00:31:38,140 +class and since we're running a loss +function we want to minimize the + +391 +00:31:38,140 --> 00:31:42,980 +negative log-likelihood of the true +class ok so you end up with a series of + +392 +00:31:42,980 --> 00:31:46,599 +expressions here really are lost +function as you want the log-likelihood + +393 +00:31:46,599 --> 00:31:51,169 +the correct class to be high so negative +of it want to be low and the + +394 +00:31:51,170 --> 00:31:54,820 +log-likelihood is some expansion of +course let's look at a specific example + +395 +00:31:54,819 --> 00:32:00,599 +to make this more later here I actually +like something that expression so that + +396 +00:32:00,599 --> 00:32:04,839 +this is the Los negative log that +expression let's look at how this + +397 +00:32:04,839 --> 00:32:07,859 +expression works and I think it'll give +you a better intuition know exactly what + +398 +00:32:07,859 --> 00:32:12,009 +this is doing lights what's computing so +suppose here we haven't these scores + +399 +00:32:12,009 --> 00:32:16,379 +that came out from our neural network or +from our earlier pacifier and these are + +400 +00:32:16,380 --> 00:32:19,780 +the unlock problem peace so as I +mentioned we want to exponentially them + +401 +00:32:19,779 --> 00:32:22,879 +first because under this interpretation +that gives us the normalized + +402 +00:32:22,880 --> 00:32:28,150 +probabilities and now he's always some +21 so we have two divided by the sum of + +403 +00:32:28,150 --> 00:32:33,310 +all of these so we add up these guys and +we divide to actually get probably out + +404 +00:32:33,309 --> 00:32:37,609 +under this interpretation we've carried +out the set of transformations and what + +405 +00:32:37,609 --> 00:32:41,219 +this is saying is that this +interpretation the probability assigned + +406 +00:32:41,220 --> 00:32:47,029 +to this image of being a cat is 13% car +is 87% and progress very unlikely 0% + +407 +00:32:47,029 --> 00:32:51,399 +these are the probabilities and not +normally in the setting you want to + +408 +00:32:51,400 --> 00:32:54,960 +maximize the lock probability because it +turns out that maximizing just a rock + +409 +00:32:54,960 --> 00:32:58,049 +probability is not as nice +mathematically so lonely you see + +410 +00:32:58,049 --> 00:33:03,460 +maximizing luck probabilities and then +so you want to minimize the probability + +411 +00:33:03,460 --> 00:33:08,850 +so the correct class here is cat which +is only having 13 percent chance + +412 +00:33:08,849 --> 00:33:14,679 +Anderson misinterpretation so negative +log of points 13 gets us 89 and so + +413 +00:33:14,680 --> 00:33:21,180 +that's the final to find a loss that we +would achieve for this class here under + +414 +00:33:21,180 --> 00:33:25,529 +this interpretation of a classifier so +29 + +415 +00:33:25,529 --> 00:33:32,869 +let's go over some examples were some +questions now related to this to try to + +416 +00:33:32,869 --> 00:33:34,219 +interpret exactly how this works + +417 +00:33:34,220 --> 00:33:38,519 +first I was the min and max possible +lost with this loss function so that the + +418 +00:33:38,519 --> 00:33:44,460 +loss function what is the smallest Mali +and the highest body has to think about + +419 +00:33:44,460 --> 00:33:49,809 +this what is the smallest value that we +can cheap zero and how would that happen + +420 +00:33:49,809 --> 00:33:57,220 +I can get so if you're correct class is +getting probably have one where we have + +421 +00:33:57,220 --> 00:34:02,890 +a one was replying to the law and we're +getting negative log of 110 and the + +422 +00:34:02,890 --> 00:34:09,030 +highest possible loss so just as well as +we were getting the same 0 is minimum + +423 +00:34:09,030 --> 00:34:14,250 +and infinite is maximum so infant loss +would be achieved if you end up giving + +424 +00:34:14,250 --> 00:34:18,769 +your cat score very tiny probability and +then log of 0 gives you negative + +425 +00:34:18,769 --> 00:34:24,679 +infinity so negative that is just the +infinite so yeah so the same balance as + +426 +00:34:24,679 --> 00:34:28,159 +p.m. and also this question + +427 +00:34:28,159 --> 00:34:33,440 +normally when we initialize W with +roughly small small weights wind up with + +428 +00:34:33,440 --> 00:34:37,550 +all these cars are nearly zero what ends +up being the loss in this case + +429 +00:34:37,550 --> 00:34:40,419 +checks at the beginning of your +optimization what do you expect to see + +430 +00:34:40,418 --> 00:34:47,000 +as your first loss + +431 +00:34:47,000 --> 00:34:59,449 +one over a number of classes so you may +be getting older is here you get to all + +432 +00:34:59,449 --> 00:35:04,139 +of one's here and so here is one over a +number of classes and then they get a + +433 +00:35:04,139 --> 00:35:07,599 +blog about and something your final +awesome so actually for myself whenever + +434 +00:35:07,599 --> 00:35:11,569 +I run up my station I sometimes take +note of my number of classes and I + +435 +00:35:11,570 --> 00:35:14,970 +evaluate negative log one of a number of +classes and I'm trying to see what is + +436 +00:35:14,969 --> 00:35:18,429 +the my first beginning lost expect and +so when I start up my decision I make + +437 +00:35:18,429 --> 00:35:21,159 +sure that I'm getting roughly that +otherwise I know some things may be + +438 +00:35:21,159 --> 00:35:24,399 +slightly off expect to get something on +that order + +439 +00:35:24,400 --> 00:35:28,630 +moreover as an optimizing expect that I +go from thats 20 and if I'm seeing + +440 +00:35:28,630 --> 00:35:31,039 +negative numbers then I know from the +functional form that something very + +441 +00:35:31,039 --> 00:35:32,590 +strange is going on right + +442 +00:35:32,590 --> 00:35:37,070 +never actually expected gonna give +numbers out of this assault max loss + +443 +00:35:37,070 --> 00:35:40,630 +I'll show you one more slide nothing +some questions just to reiterate the + +444 +00:35:40,630 --> 00:35:44,599 +difference between them and really what +they look like as we have the score + +445 +00:35:44,599 --> 00:35:48,909 +function which gives aw we get our +scores of actor and now the difference + +446 +00:35:48,909 --> 00:35:54,420 +is just how they interpret what these +course coming out from this function is + +447 +00:35:54,420 --> 00:35:58,500 +so I just ran its course no +interpretation whatsoever we just want + +448 +00:35:58,500 --> 00:36:02,710 +that a lot of a larger score the correct +score to be some margin above the + +449 +00:36:02,710 --> 00:36:07,240 +incorrect course or interpreted to be +these unless lott probabilities and then + +450 +00:36:07,239 --> 00:36:10,569 +in this framework we first went to get +the probabilities and then we want to + +451 +00:36:10,570 --> 00:36:14,450 +maximize the public in the crack losses +or the log of them and so that ends up + +452 +00:36:14,449 --> 00:36:19,250 +giving us the loss function or something +so they start off at the same way but + +453 +00:36:19,250 --> 00:36:22,780 +they just happened to get to the +difference less results we're going to + +454 +00:36:22,780 --> 00:36:31,150 +exactly what the differences are in a +bit there are questions + +455 +00:36:31,150 --> 00:36:41,579 +they take your classified as near +instantaneous to evaluate most of the + +456 +00:36:41,579 --> 00:36:45,949 +work is done in the convolutions and so +will see that the classifier and + +457 +00:36:45,949 --> 00:36:51,629 +especially the losses roughly the same +of course South max involve some XP and + +458 +00:36:51,630 --> 00:36:56,200 +so on so these operations are slightly +more expensive perhaps but usually it + +459 +00:36:56,199 --> 00:36:57,439 +completely washes away + +460 +00:36:57,440 --> 00:36:59,320 +compared to everything else you're +worried about which is all the + +461 +00:36:59,320 --> 00:37:15,260 +competitions over the image of God + +462 +00:37:15,260 --> 00:37:32,600 +probably + +463 +00:37:32,599 --> 00:37:42,210 +exact same problem and so maximizing the +property and maximizing the locality + +464 +00:37:42,210 --> 00:37:46,119 +give you the identical result but in +terms of the match everything comes out + +465 +00:37:46,119 --> 00:37:49,279 +too much nicer looking when you actually +put a lot of there but it's the exact + +466 +00:37:49,280 --> 00:37:51,310 +same optimization problem + +467 +00:37:51,309 --> 00:37:56,539 +ok let's get some interpretations of +these two and exactly how they differ + +468 +00:37:56,539 --> 00:38:01,230 +max vs SEM and trying to give you an +idea about one property that actually + +469 +00:38:01,230 --> 00:38:03,559 +quite different between the two + +470 +00:38:03,559 --> 00:38:08,059 +these two different functional analysis +team that we have these three examples + +471 +00:38:08,059 --> 00:38:12,710 +all three examples and suppose there are +three closest three different examples + +472 +00:38:12,710 --> 00:38:15,980 +and these are discourse of these +examples for every one of these examples + +473 +00:38:15,980 --> 00:38:19,659 +the first class here is the correct +class so 10 is the correct class score + +474 +00:38:19,659 --> 00:38:24,509 +and the other scores are these guys +either the first one second or third one + +475 +00:38:24,510 --> 00:38:30,970 +and now just think about how would these +losses tell you about how desirable + +476 +00:38:30,969 --> 00:38:36,480 +outcomes are in terms of that w in +particular one way to think about it for + +477 +00:38:36,480 --> 00:38:39,530 +example is suppose I think this data +point the third one tenth of a hundred + +478 +00:38:39,530 --> 00:38:44,700 +and eight hundred and suppose I jiggle +it move it around a bit and my input + +479 +00:38:44,699 --> 00:38:58,159 +space what is happening to the losses as +I do that + +480 +00:38:58,159 --> 00:39:03,339 +I do so they increase and decrease as I +would go around do they both increase or + +481 +00:39:03,340 --> 00:39:10,050 +decrease for the third appointment for +example as me and remains the same + +482 +00:39:10,050 --> 00:39:13,740 +correct and why is that it's because the +margin was met by a huge amount so + +483 +00:39:13,739 --> 00:39:17,659 +there's just added robustness when I +take the day off on a sheet around the + +484 +00:39:17,659 --> 00:39:22,379 +SVM is already very happy because the +margins were met by you know we desire + +485 +00:39:22,380 --> 00:39:27,809 +margin of one and here we have a margin +of two hundred and there's a huge margin + +486 +00:39:27,809 --> 00:39:32,299 +ESPN doesn't express a preference over +these examples where this course come + +487 +00:39:32,300 --> 00:39:37,010 +out very negative ads no additional +preference over do I want to be negative + +488 +00:39:37,010 --> 00:39:43,890 +2009 200,000 PSP and wound care but the +s but the South max could always see you + +489 +00:39:43,889 --> 00:39:46,659 +will always get an improvement for +something that's right so soft max + +490 +00:39:46,659 --> 00:39:49,480 +function express a preference for +everyone's needs to be negative about + +491 +00:39:49,480 --> 00:39:53,590 +two hundred or five hundred or thousand +of them will give you better loss right + +492 +00:39:53,590 --> 00:39:58,530 +but the SVM at this point doesn't care +if the other examples I don't know if + +493 +00:39:58,530 --> 00:40:03,320 +it's as clear distinction rights of the +FBI has decided robustness to it once + +494 +00:40:03,320 --> 00:40:07,120 +this margin to be met but beyond that it +doesn't micromanage your course where + +495 +00:40:07,119 --> 00:40:11,400 +soft max will always want peace course +to be you know everything here nothing + +496 +00:40:11,400 --> 00:40:15,300 +there and so that's one kind of very +clear difference between the two + +497 +00:40:15,300 --> 00:40:20,548 +it was a question + +498 +00:40:20,548 --> 00:40:28,568 +yes the margin of one I mentioned very +briefly that that's not a hyper primary + +499 +00:40:28,568 --> 00:40:34,528 +you can fix it to be one reason for that +is that lease course they're the kind of + +500 +00:40:34,528 --> 00:40:40,048 +the absolute values of those course are +kind of don't really mattered because my + +501 +00:40:40,048 --> 00:40:45,088 +W I can make it a larger or smaller and +I can achieve different sizes course and + +502 +00:40:45,088 --> 00:40:49,759 +so one turns out to work better and in +the notes I have a longer duration go + +503 +00:40:49,759 --> 00:40:54,699 +into details exactly why one is safe to +choose so refer to that but i dont wanna + +504 +00:40:54,699 --> 00:41:03,239 +spend time on it in like 20 would be if +you wanna 20 there would be trouble you + +505 +00:41:03,239 --> 00:41:07,358 +can use any positive number and that +would give you a nice p.m. if he was 0 + +506 +00:41:07,358 --> 00:41:14,328 +that would look different + +507 +00:41:14,329 --> 00:41:18,259 +for example this added constant there +one property gives you when you actually + +508 +00:41:18,259 --> 00:41:21,920 +go through the mathematical analysis +likes in the ass p.m. in CST 29 as + +509 +00:41:21,920 --> 00:41:26,269 +you'll see that the chief suspects +margin property where the Eskimo playing + +510 +00:41:26,268 --> 00:41:29,698 +that the best margin when you actually +have a plus + +511 +00:41:29,699 --> 00:41:33,539 +constant their combined with the altar +regularization on the way it's very + +512 +00:41:33,539 --> 00:41:38,499 +small weights that meet specific margin +and as well give you this very nice mix + +513 +00:41:38,498 --> 00:41:42,259 +margin property that I didn't really +going to in this in this lecture right + +514 +00:41:42,259 --> 00:41:46,818 +now but I basically do want a positive +number there otherwise things would + +515 +00:41:46,818 --> 00:41:51,480 +break + +516 +00:41:51,480 --> 00:42:14,780 +numbers that are real numbers and we're +kind of free to get in this course out + +517 +00:42:14,780 --> 00:42:18,200 +and it's up to you to endow them with +interpretation right we can have + +518 +00:42:18,199 --> 00:42:21,669 +different losses in this specific case I +showed you the closest p.m. there's + +519 +00:42:21,670 --> 00:42:25,180 +multiple versions of a multi-class SVM +you can paddle around with exactly the + +520 +00:42:25,179 --> 00:42:30,750 +Los expression in one of the one of the +interpretations we can put on this + +521 +00:42:30,750 --> 00:42:34,510 +course then there'd be some normalized +block probably say they can't be + +522 +00:42:34,510 --> 00:42:37,590 +normalized because they just came we +have to explicitly because there's no + +523 +00:42:37,590 --> 00:42:42,180 +constraint that the output of your +function will be normalized and they + +524 +00:42:42,179 --> 00:42:45,579 +have to be the camp probably because +you're out that in just his real numbers + +525 +00:42:45,579 --> 00:42:51,309 +that can be positive or negative so we +interpret them as a problem peace and + +526 +00:42:51,309 --> 00:42:52,699 +and done + +527 +00:42:52,699 --> 00:42:58,329 +requires us to treat them some very bad +kind of explanation of it but I think + +528 +00:42:58,329 --> 00:43:05,889 +he's got + +529 +00:43:05,889 --> 00:43:57,139 +energy and losses like kind of an +equivalent of all about what you're + +530 +00:43:57,139 --> 00:44:05,690 +saying look at this one here right here +saying if I googled this around + +531 +00:44:05,690 --> 00:44:09,460 +nothing's changing I think the +difference is the loss would definitely + +532 +00:44:09,460 --> 00:44:12,800 +change for max even though it wouldn't +change a lot but I would definitely + +533 +00:44:12,800 --> 00:44:16,660 +change the subject to express preference +whereas p.m. guess you identically zero + +534 +00:44:16,659 --> 00:44:27,339 +wouldn't be very big blunder differently +as preference but in practice basically + +535 +00:44:27,340 --> 00:44:32,720 +this distinction the interaction of +trying to give you is that the SPM has a + +536 +00:44:32,719 --> 00:44:38,469 +very local part of the space immature +classifying that it cares about and + +537 +00:44:38,469 --> 00:44:40,279 +beyond it + +538 +00:44:40,280 --> 00:44:43,700 +its environment and a soft max kind of +physical action of the full data cloud + +539 +00:44:43,699 --> 00:44:48,129 +it cares about it cares about all the +points in your data cloud not just you + +540 +00:44:48,130 --> 00:44:50,590 +know there's like a small class here +that you're trying to separate out from + +541 +00:44:50,590 --> 00:44:51,410 +everything else + +542 +00:44:51,409 --> 00:44:55,659 +assault Maxwell kind of concerned the +full data closet getting your plane and + +543 +00:44:55,659 --> 00:44:59,059 +SPM just want to separate out that tiny +piece from the immediate part of the + +544 +00:44:59,059 --> 00:45:04,219 +data cloud like that in practice when +you actually run the state can give + +545 +00:45:04,219 --> 00:45:09,569 +nearly identical results almost always +so really when trying to I'm not trying + +546 +00:45:09,570 --> 00:45:12,640 +to pitch one or the other I'm just +trying to give you this notion that + +547 +00:45:12,639 --> 00:45:16,809 +you're in charge of the loss function +you get some scores out and you can + +548 +00:45:16,809 --> 00:45:19,199 +write down nearly any mathematical +expression + +549 +00:45:19,199 --> 00:45:23,279 +is differentiable into what you want +your scores to be like and there are + +550 +00:45:23,280 --> 00:45:26,619 +different ways of actually formulating +this and actually two examples that are + +551 +00:45:26,619 --> 00:45:30,579 +coming to see practice but in practice +we can put down any losses for what you + +552 +00:45:30,579 --> 00:45:34,619 +want your scores to be and that's a very +nice picture because we can optimize + +553 +00:45:34,619 --> 00:45:46,700 +overall let me show you an interactive +web them at this point + +554 +00:45:46,699 --> 00:45:54,289 +alright see this so this is an +interactive seminar class page you can + +555 +00:45:54,289 --> 00:45:58,409 +find it at this URL I wrote it last year +and I have to show it to all of you guys + +556 +00:45:58,409 --> 00:46:04,279 +to justify spending one day on +developing ok but some that your last + +557 +00:46:04,280 --> 00:46:12,440 +year not too many people looked at this +vehicle is one day of my life so we have + +558 +00:46:12,440 --> 00:46:18,000 +here is a two-dimensional problem with +three classes and I'm showing here three + +559 +00:46:18,000 --> 00:46:22,139 +classes each has three examples over +here in two dimensions and I'm showing + +560 +00:46:22,139 --> 00:46:24,969 +the three classifiers here because the +level set aside for example the red + +561 +00:46:24,969 --> 00:46:29,659 +classifier is as scores of 0 along the +line and then I'm showing the arrows + +562 +00:46:29,659 --> 00:46:35,509 +which scores increased right here's RW +matrix so as you recall the W matrix + +563 +00:46:35,510 --> 00:46:38,609 +double rows of that w matrix are the +different classifiers so we have the + +564 +00:46:38,608 --> 00:46:42,289 +blue classifier red and green classifier +and Brett classifier and we have both + +565 +00:46:42,289 --> 00:46:47,349 +the weights for both the X&Y component +and also the bias and then here we have + +566 +00:46:47,349 --> 00:46:50,609 +the data said so we have the X&Y +coordinates of all the data points there + +567 +00:46:50,608 --> 00:46:55,779 +correct label and thus course as well as +the loss achieved by all those data + +568 +00:46:55,780 --> 00:46:59,769 +points right now with this setting up w +and so you can see that I'm taking the + +569 +00:46:59,769 --> 00:47:04,568 +mean overall the loss so right now our +data losses 2.77 regularization loss for + +570 +00:47:04,568 --> 00:47:08,509 +this w is 3.5 and talk hola 6.27 + +571 +00:47:08,510 --> 00:47:14,810 +and so basically you can fiddle around +with this so so as I change my W you can + +572 +00:47:14,809 --> 00:47:19,328 +see that here I'm making my W one of the +WC bigger and you can see what that does + +573 +00:47:19,329 --> 00:47:25,940 +in in their order bias you can see the +bias basically shut these high plains + +574 +00:47:25,940 --> 00:47:32,639 +okay and then what we can do is we can +we're going to work this kind of a + +575 +00:47:32,639 --> 00:47:35,848 +preview of what's going to happen we're +getting the loss here and there were + +576 +00:47:35,849 --> 00:47:38,829 +going to do back propagation which has +given us the gradient over how we want + +577 +00:47:38,829 --> 00:47:44,359 +to adjust these W it's in order to make +the law smaller and so we're going to do + +578 +00:47:44,358 --> 00:47:48,838 +is this repeated states where we start +off with this w but now I can improve I + +579 +00:47:48,838 --> 00:47:54,460 +can improve this set of W's so when I do +a perimeter update this is actually + +580 +00:47:54,460 --> 00:47:57,568 +using these gradients which are shown +here in the right now and it's actually + +581 +00:47:57,568 --> 00:47:59,900 +making a tiny changed everything + +582 +00:47:59,900 --> 00:48:03,088 +according to this gradient right so as I +do + +583 +00:48:03,088 --> 00:48:07,699 +primary update you can see that the loss +here is decreasing special the total + +584 +00:48:07,699 --> 00:48:11,338 +loss here so the lost just keeps getting +better and better as I do primary date + +585 +00:48:11,338 --> 00:48:16,639 +so this is the process of optimization +that we're going to go into in a bit so + +586 +00:48:16,639 --> 00:48:20,989 +I can also start a repeated update and +then basically we keep improving this w + +587 +00:48:20,989 --> 00:48:24,808 +over and over until our loss it started +off was roughly three or something that + +588 +00:48:24,809 --> 00:48:29,579 +you are mean loss over the data is point +one like that and we're correctly + +589 +00:48:29,579 --> 00:48:39,068 +classifying all these buttons here so I +can also randomized randomized W so just + +590 +00:48:39,068 --> 00:48:41,980 +kind of knocks it off and then there's +always converges these acting point + +591 +00:48:41,980 --> 00:48:47,650 +through the process optimization and you +can play here with the regularization as + +592 +00:48:47,650 --> 00:48:51,730 +well you have different forms of loss so +the one I shown you right now is there + +593 +00:48:51,730 --> 00:48:55,990 +was a consensus p.m. formulation there's +a few more SPM formulations and there's + +594 +00:48:55,989 --> 00:49:01,098 +also soft max here you'll see that when +I Swisher soft max loss our losses look + +595 +00:49:01,099 --> 00:49:06,670 +different and but the solution and are +being roughly the same so when I switch + +596 +00:49:06,670 --> 00:49:10,700 +back to him you know the type of players +move around the tiny bit but really it's + +597 +00:49:10,699 --> 00:49:21,558 +it's mostly the same and so this is just +a size so this is how much how big steps + +598 +00:49:21,559 --> 00:49:25,650 +are we making when we get the gradient +on how to improve things so much promise + +599 +00:49:25,650 --> 00:49:29,119 +we should start with the very biggest +upside the scenes are giggling trying to + +600 +00:49:29,119 --> 00:49:32,309 +separate out these data points and then +over time we're going to be doing in a + +601 +00:49:32,309 --> 00:49:36,430 +position as we're going to decrease our +updates eyes and this thing or just + +602 +00:49:36,429 --> 00:49:43,298 +slowly converging on the premise that we +want in the end and so so you can play + +603 +00:49:43,298 --> 00:49:47,170 +with us and you can see how he scores to +go around and what the losses and if I + +604 +00:49:47,170 --> 00:49:53,358 +stop repeated update you can also drag +these points but I think on the Mac it + +605 +00:49:53,358 --> 00:49:58,598 +doesn't work so I tried to drag this +point it disappears so good + +606 +00:49:58,599 --> 00:50:02,479 +but it works on a desktop so I don't go +in and figure out exactly what happened + +607 +00:50:02,478 --> 00:50:14,480 +there but they can play with this + +608 +00:50:14,480 --> 00:50:30,840 +we have as mean loss over data plus +regularization this is one other diagram + +609 +00:50:30,840 --> 00:50:35,240 +to show you how did what this looks like +I don't think it's a very good diagram + +610 +00:50:35,239 --> 00:50:38,858 +and there's something confusing about it +that I can't remember from last year but + +611 +00:50:38,858 --> 00:50:45,269 +basically you have this data and why +your images your labels and there's W + +612 +00:50:45,269 --> 00:50:49,719 +and keeping this course and getting the +lawsuit and the regularization losses + +613 +00:50:49,719 --> 00:50:54,939 +only function of the weights not of the +data and mister what we want to do now + +614 +00:50:54,940 --> 00:50:58,608 +is we don't have control over the data +set right that's given to us we have + +615 +00:50:58,608 --> 00:51:04,130 +control over that w and as we changed at +W the loss will be different so for any + +616 +00:51:04,130 --> 00:51:08,340 +W give me I can compute the loss and +that lost is linked to how well we're + +617 +00:51:08,340 --> 00:51:12,730 +classifying all of our examples so one +thing a low loss means world-class find + +618 +00:51:12,730 --> 00:51:15,880 +them very very well on the training data +and then we're crossing our fingers that + +619 +00:51:15,880 --> 00:51:20,809 +also works on some test data that we +haven't seen so here's one strategy for + +620 +00:51:20,809 --> 00:51:26,139 +optimization it's a random search so +because we can evaluate loss for any + +621 +00:51:26,139 --> 00:51:30,500 +arbitrary W when I can afford to do and +I'm not sure if i dont im go through + +622 +00:51:30,500 --> 00:51:34,480 +this in full detail but effectively I +randomly sampled and I can check their + +623 +00:51:34,480 --> 00:51:37,460 +loss and I can just keep track of the W +that works best + +624 +00:51:37,460 --> 00:51:43,090 +oK so that's an amazing process of +optimization of getting check and it + +625 +00:51:43,090 --> 00:51:46,760 +turns out if you do this I think I tried +two thousand times if you do this + +626 +00:51:46,760 --> 00:51:50,970 +thousand times and take the best W found +at random and you run it on your seat + +627 +00:51:50,969 --> 00:51:56,108 +bartend data just made up you end up +with about 15.5 percent accuracy and + +628 +00:51:56,108 --> 00:52:01,150 +since they're acting classes are the +mean baseline as a 10% chance + +629 +00:52:01,150 --> 00:52:06,559 +performance so 15.5 there some signal +actually notably and so state of the art + +630 +00:52:06,559 --> 00:52:10,219 +is that ninety-five which is a common +that so we have some got too close over + +631 +00:52:10,219 --> 00:52:10,980 +the next + +632 +00:52:10,980 --> 00:52:17,670 +two weeks or so so this is so don't use +this just because it's on the slides one + +633 +00:52:17,670 --> 00:52:21,659 +interpretation of exactly what this +looks like this process optimization is + +634 +00:52:21,659 --> 00:52:25,399 +that we have this loss landscape right +this loss landscape is in this high + +635 +00:52:25,400 --> 00:52:32,619 +dimensional W space so here we sit here +in 3d and your losses the height then + +636 +00:52:32,619 --> 00:52:38,369 +you only have 2 W's in this case and +you're here and you're blindfolded W you + +637 +00:52:38,369 --> 00:52:42,269 +can see where the valleys are but you're +trying to find low loss as you're + +638 +00:52:42,269 --> 00:52:45,699 +blindfolded and you have an altitude +meter and so you can tell what your + +639 +00:52:45,699 --> 00:52:49,029 +losses at any single point and you're +trying to get to the bottom of the + +640 +00:52:49,030 --> 00:52:55,430 +valley right and so that's really the +process of optimization and what we've + +641 +00:52:55,429 --> 00:52:59,399 +shown you would actually so far as this +random optimization where you teleport + +642 +00:52:59,400 --> 00:53:03,309 +around and you just check your altitude +right so not the best idea so we're + +643 +00:53:03,309 --> 00:53:06,940 +going to do instead is we're going to +use what I refer to as a gradient or + +644 +00:53:06,940 --> 00:53:12,800 +really we're just computing the slope +across in every single direction so I'm + +645 +00:53:12,800 --> 00:53:17,990 +trying to compute the slope and then I'm +going to go downhill ok so we're + +646 +00:53:17,989 --> 00:53:21,289 +following the slope I'm not going to go +into too much detail on this but + +647 +00:53:21,289 --> 00:53:24,779 +basically there's an expression for the +gradient which is defined like that + +648 +00:53:24,780 --> 00:53:31,859 +there's a derivative populist 101 +definition and multiple dimensions if + +649 +00:53:31,858 --> 00:53:35,409 +you have a director of derivatives +that's referred to as the gradient right + +650 +00:53:35,409 --> 00:53:39,589 +so because we have multiple dimensions +multiple w's we have a gradient vector + +651 +00:53:39,590 --> 00:53:45,660 +ok so this is the expression and in fact +we can numerically evaluate the + +652 +00:53:45,659 --> 00:53:48,769 +expression before I go into the Analects +how to show you what that would look + +653 +00:53:48,769 --> 00:53:54,190 +like to evaluate the gradient at some W +suppose we have some current W and we're + +654 +00:53:54,190 --> 00:53:58,500 +getting some loss ok what we want to do +not want to get an idea about the slope + +655 +00:53:58,500 --> 00:54:03,239 +at this point so we're going to +basically look at this formula and we're + +656 +00:54:03,239 --> 00:54:07,329 +just going to evaluated so I'm going to +go in the first dimension and I'm going + +657 +00:54:07,329 --> 00:54:11,840 +and really what this is telling you to +do is evaluate explosive your altitude + +658 +00:54:11,840 --> 00:54:15,590 +at Xmas H subtracted from FFX and divide +by H + +659 +00:54:15,590 --> 00:54:19,800 +what that response to as me being on +this landscape taking a small step in + +660 +00:54:19,800 --> 00:54:23,130 +some direction and looking whether or +not my foot went up or down + +661 +00:54:23,130 --> 00:54:27,340 +right that's what the gradient is +telling me so I took a small step and + +662 +00:54:27,340 --> 00:54:32,150 +the lost there is 1.25 then I can use +that formula with a finite difference + +663 +00:54:32,150 --> 00:54:36,230 +approximation we review the small H two +actually derived that the gradient here + +664 +00:54:36,230 --> 00:54:41,199 +as negative 2.5 the slope downwards so I +took a step the loss and decreased so + +665 +00:54:41,199 --> 00:54:45,480 +the sloping downwards in terms of the +loss function so negative 2.5 in that + +666 +00:54:45,480 --> 00:54:49,369 +particular dimension so I can do this +for every single dimension independently + +667 +00:54:49,369 --> 00:54:53,210 +right so I go into the second dimension +I add a small amount so I step in a + +668 +00:54:53,210 --> 00:54:56,869 +different direction I look at what +happened to the loss I use that formula + +669 +00:54:56,869 --> 00:55:00,969 +and is telling me that the gradient the +slope is 2.6 I can do that in the third + +670 +00:55:00,969 --> 00:55:06,429 +dimension and I get the grieving ok so +what I'm referring to here is basically + +671 +00:55:06,429 --> 00:55:11,149 +evaluating the numerical ingredient +which is using the spine a difference + +672 +00:55:11,150 --> 00:55:14,539 +approximation where for every single +dimension independently I can take a + +673 +00:55:14,539 --> 00:55:18,500 +small step at the loss and that tells me +the slower is it going upwards or + +674 +00:55:18,500 --> 00:55:23,829 +downwards for every single one of these +parameters and so this is America + +675 +00:55:23,829 --> 00:55:28,500 +gradient the way this would look like it +is by con funk shun here it looks ugly + +676 +00:55:28,500 --> 00:55:32,630 +because it turns out it's slightly +tricky to iterate over all the W's but + +677 +00:55:32,630 --> 00:55:36,780 +basically we're just looking at age +comparing two effects and dividing by + +678 +00:55:36,780 --> 00:55:41,200 +age and we're getting agreement now the +problem with this is if you want to use + +679 +00:55:41,199 --> 00:55:44,960 +the numerical gradient event of course +we have to do this for every single + +680 +00:55:44,960 --> 00:55:47,949 +dimension to get a sense of what the +great iam this endeavor single dimension + +681 +00:55:47,949 --> 00:55:53,079 +and right when you have a comment you +have hundreds of millions of parameters + +682 +00:55:53,079 --> 00:55:58,139 +right so we can't afford to actually +check the loss in hundreds of millions + +683 +00:55:58,139 --> 00:56:02,920 +of primaries before we do a single step +so this approach where we would try to + +684 +00:56:02,920 --> 00:56:06,869 +evaluate the gradient numerically is +approximate because we're using finite + +685 +00:56:06,869 --> 00:56:11,119 +difference approximation second is also +extremely slow because I need to do + +686 +00:56:11,119 --> 00:56:15,460 +million checks on the loss function on +the icon that before I know what the + +687 +00:56:15,460 --> 00:56:20,519 +gradient doesn't I can take a primary +update so very slow approximate turns + +688 +00:56:20,519 --> 00:56:26,730 +out that this is all so silly right +because the loss as a function of W as + +689 +00:56:26,730 --> 00:56:29,800 +we've written about it and really what +we want is we want the gradient of the + +690 +00:56:29,800 --> 00:56:33,220 +last 11 respectively and luckily we can +just write that down + +691 +00:56:33,219 --> 00:56:42,598 +thanks to these guys actually know who +those guys are doing that's right you + +692 +00:56:42,599 --> 00:56:49,400 +know which is which could just get a +look remarkably similar but basically + +693 +00:56:49,400 --> 00:56:54,289 +anything like this to inventors of +calculus there's actually controversy + +694 +00:56:54,289 --> 00:56:59,429 +over who really invented calculus and +these guys each other over it but + +695 +00:56:59,429 --> 00:57:03,799 +basically calculus is this powerful +hammer and so what we can do is instead + +696 +00:57:03,800 --> 00:57:06,440 +of doing the silly thing we're +evaluating numerical gradient we can + +697 +00:57:06,440 --> 00:57:10,230 +actually use calculus and we can break +down an expression for what the gradient + +698 +00:57:10,230 --> 00:57:14,880 +is off the loss function in the white +space so basically instead of fumbling + +699 +00:57:14,880 --> 00:57:18,289 +around and doing this is it going up or +is it going down by checking the loss I + +700 +00:57:18,289 --> 00:57:22,509 +just have an expression where I take the +gradient of this and I can sync simply + +701 +00:57:22,510 --> 00:57:26,500 +evaluate what the entire matter is that +the only way that you can actually run + +702 +00:57:26,500 --> 00:57:30,159 +this in practice right we can just an +expression for years the gradient we can + +703 +00:57:30,159 --> 00:57:35,149 +do to stop and so on so in summary +basically numerical gradient approximate + +704 +00:57:35,150 --> 00:57:39,800 +slow but very easy to write because +you're just doing this very simple + +705 +00:57:39,800 --> 00:57:44,190 +process for any damage or loss function +I can get the gradient vector for a + +706 +00:57:44,190 --> 00:57:47,659 +gradient which is you actually do +calculus its exact no finite + +707 +00:57:47,659 --> 00:57:52,210 +proclamations it's very fast but it's +error-prone because you actually have to + +708 +00:57:52,210 --> 00:57:57,300 +do math right so in practice what you +see is we always use a lot of gradient + +709 +00:57:57,300 --> 00:58:01,380 +we do calculus we figure out what the +gradient should be but then you always + +710 +00:58:01,380 --> 00:58:04,789 +check your implementation using an +American gradient check as its referred + +711 +00:58:04,789 --> 00:58:10,480 +to so I will do all I care about the +loss function should be I write an + +712 +00:58:10,480 --> 00:58:15,500 +expression for the gradient I evaluated +in my code so I get in the holiday + +713 +00:58:15,500 --> 00:58:18,769 +greetings and then I also have a lead in +numerical gradient on the side and that + +714 +00:58:18,769 --> 00:58:22,280 +takes a while but you mature you have a +lead to more convenient and you make + +715 +00:58:22,280 --> 00:58:25,890 +sure that those two are the same and +then we say that you passed the green + +716 +00:58:25,889 --> 00:58:29,500 +truck oK so that's what you see in +practice whenever you try to develop a + +717 +00:58:29,500 --> 00:58:32,519 +new module for internal network you're +right I would have lost your right to + +718 +00:58:32,519 --> 00:58:35,759 +backward pass for a complete the +gradient and then you have to make sure + +719 +00:58:35,760 --> 00:58:40,250 +the gradient check it just to make sure +that your calculus is correct and then I + +720 +00:58:40,250 --> 00:58:43,980 +already referred to this process of +optimization which we saw nicely in the + +721 +00:58:43,980 --> 00:58:45,838 +Web Demo where we have this + +722 +00:58:45,838 --> 00:58:49,548 +loop when we optimized where we simply +Valley the gradient on your loss + +723 +00:58:49,548 --> 00:58:53,759 +function and then knowing the gradient +we can perform a primer update when we + +724 +00:58:53,759 --> 00:58:58,509 +change the WBI tiny amount in particular +we want to update with the negative + +725 +00:58:58,509 --> 00:59:04,509 +step-size times the gradient the +negative is there because the gradient + +726 +00:59:04,509 --> 00:59:07,478 +tells the direction of the greatest +increase it tells you which way the + +727 +00:59:07,478 --> 00:59:10,848 +losses increasing and want to minimize +it which is where the negative it's + +728 +00:59:10,849 --> 00:59:14,298 +coming from where to go and negative +reading direction step size here as a + +729 +00:59:14,298 --> 00:59:17,818 +hyper primary that will cause you a huge +amount of headaches step size are + +730 +00:59:17,818 --> 00:59:23,298 +learning rate this is the most critical +parameter to basically worry about that + +731 +00:59:23,298 --> 00:59:27,778 +really there's two that you have to +worry about the most the step size or + +732 +00:59:27,778 --> 00:59:31,539 +learning rate and there's the +regularization strength lame duck that + +733 +00:59:31,539 --> 00:59:35,180 +we saw already those two parameters are +really the two largest headaches and + +734 +00:59:35,179 --> 00:59:45,219 +that's usually what we cross body Dover +was a question about but it's not that + +735 +00:59:45,219 --> 00:59:50,849 +great just great and it tells you the +slope in every single direction and then + +736 +00:59:50,849 --> 00:59:56,109 +we just take a step by step it so the +process of opposition in the weight + +737 +00:59:56,108 --> 01:00:00,768 +space is your somewhere in your W you +get your gradient any March some amount + +738 +01:00:00,768 --> 01:00:05,228 +in the direction of the gradient but you +don't know how much so that the step + +739 +01:00:05,228 --> 01:00:08,449 +size and you saw that when I increase +the step size in the demo things were + +740 +01:00:08,449 --> 01:00:11,248 +jubilant generating around quite a lot +right there was a lot of energy no + +741 +01:00:11,248 --> 01:00:15,449 +system that's because I was taking huge +jumps all over this base and so here the + +742 +01:00:15,449 --> 01:00:19,578 +loss function is minimal at the blue +part there and it's high in the reports + +743 +01:00:19,579 --> 01:00:23,920 +so we want to get to them as part of the +basin this is actually with the loss + +744 +01:00:23,920 --> 01:00:28,579 +function looks like Princess p.m. or +discretion is our complex problems so + +745 +01:00:28,579 --> 01:00:31,729 +it's really just a bowl and we're trying +to get to the bottom of it but this bowl + +746 +01:00:31,728 --> 01:00:35,009 +is like 30,000 dimensional so that's why +takes awhile + +747 +01:00:35,010 --> 01:00:39,640 +ok so we take a step and we we evaluate +the gradient and repeat this over and + +748 +01:00:39,639 --> 01:00:44,980 +over in practice there's this additional +part I wanted to mention where we don't + +749 +01:00:44,980 --> 01:00:49,860 +actually evaluate the loss for the +entire training did in fact all we do is + +750 +01:00:49,860 --> 01:00:53,370 +we only use what's called me back +reading the st. where we have this + +751 +01:00:53,369 --> 01:00:58,670 +entire dataset but we sample batches +from it so we sample sale like say + +752 +01:00:58,670 --> 01:01:02,300 +thirty two examples out of my training +data I evaluate the loss of the gradient + +753 +01:01:02,300 --> 01:01:05,940 +on this batch of 32 and then I knew my +primary update and I keep doing this + +754 +01:01:05,940 --> 01:01:09,619 +over and over again and make sure what +ends up happening is if you only sample + +755 +01:01:09,619 --> 01:01:14,699 +very few data points from training data +then your estimate of the gradient of + +756 +01:01:14,699 --> 01:01:18,109 +course over the entire training set is +kind of noisy because you're only + +757 +01:01:18,110 --> 01:01:21,970 +estimating based on a small subset of +your data but it allows me to step more + +758 +01:01:21,969 --> 01:01:25,689 +so you can do more steps with +approximate gradient or you can do few + +759 +01:01:25,690 --> 01:01:30,179 +steps with exact gradient and practice +what ends up working better if used me + +760 +01:01:30,179 --> 01:01:35,049 +back and it's much more efficient of +course and it's impractical to actually + +761 +01:01:35,050 --> 01:01:41,550 +do fullback gradient descent so come in +many sizes 32 64 128 256 this is not + +762 +01:01:41,550 --> 01:01:45,940 +usually hyper primarily worry about too +much usually settled based on whatever + +763 +01:01:45,940 --> 01:01:49,380 +fits on your GPU we're going to be +talking about BP's in a bit but they + +764 +01:01:49,380 --> 01:01:53,030 +have finite amount of memory say about +like 6 gigabytes or talk about its good + +765 +01:01:53,030 --> 01:01:58,030 +GPU and usually choose a backside such +that a small me back to example spits in + +766 +01:01:58,030 --> 01:02:01,150 +your memory so that's how usually that's +the term and it's not a primary that + +767 +01:02:01,150 --> 01:02:09,570 +actually matters a lot and optimization +sense + +768 +01:02:09,570 --> 01:02:14,789 +and we're going to get the momentum in a +bit but if you want to use momentum then + +769 +01:02:14,789 --> 01:02:18,969 +this is just fine we always been about +trying to send but momentum very common + +770 +01:02:18,969 --> 01:02:23,799 +to do so just to give you an idea of +what will this look like in practice if + +771 +01:02:23,800 --> 01:02:28,510 +I'm running optimization overtime and +i'm looking at the Los evaluated on just + +772 +01:02:28,510 --> 01:02:32,700 +a small many batch of data and you can +see that basically my loss goes down + +773 +01:02:32,699 --> 01:02:37,309 +over time on these many batches from the +training data so as an optimizing I'm + +774 +01:02:37,309 --> 01:02:42,119 +going downhill now of course if I was +doing pullbacks gradient descent so this + +775 +01:02:42,119 --> 01:02:44,839 +was not just me back a sample from the +data you wouldn't expect as much noise + +776 +01:02:44,840 --> 01:02:48,550 +you just expect this to be aligned to +just goes down but because we use me + +777 +01:02:48,550 --> 01:02:51,730 +back if you get this noise in there +because something about you are better + +778 +01:02:51,730 --> 01:03:01,980 +than others but over time they could all +go down there question + +779 +01:03:01,980 --> 01:03:07,539 +yes sir you're wondering about the shape +of this loss function you're used to + +780 +01:03:07,539 --> 01:03:11,420 +maybe seeing more rapid improvement +quick are these loss functions come in + +781 +01:03:11,420 --> 01:03:17,079 +different shapes sizes so it really +depends it's not necessarily the case + +782 +01:03:17,079 --> 01:03:21,940 +that loss function must look very sharp +in the beginning although sometimes they + +783 +01:03:21,940 --> 01:03:25,929 +do they have different shapes for +example it also matters on your + +784 +01:03:25,929 --> 01:03:29,618 +initialization if I'm careful with my +initialization I would expect less of a + +785 +01:03:29,619 --> 01:03:34,990 +jump but if I initialize very +incorrectly then you would expect that + +786 +01:03:34,989 --> 01:03:38,649 +that's going to be fixed very early on +in the optimization we're going to get + +787 +01:03:38,650 --> 01:03:43,309 +to some of those parts I think much +later I also want to show you a lot of + +788 +01:03:43,309 --> 01:03:49,710 +the effects of learning rate on your +loss function and the still learning + +789 +01:03:49,710 --> 01:03:53,820 +rate is the step size basically a very +high learning rates or step sizes you + +790 +01:03:53,820 --> 01:03:59,240 +start rushing around in your W space and +so i dont converge or you explode if you + +791 +01:03:59,239 --> 01:04:02,618 +have a very low learning rate then +you're barely doing any updates and also + +792 +01:04:02,619 --> 01:04:07,869 +it takes a very long time to actually +converge and if you have a high learning + +793 +01:04:07,869 --> 01:04:11,150 +rate sometimes you can basically get a +kind of stuck in a bad position of a + +794 +01:04:11,150 --> 01:04:14,950 +loss so these loss functions kind of you +need to get down to the minimum so if + +795 +01:04:14,949 --> 01:04:17,929 +you have too much energy in your +stocking too quickly when you don't have + +796 +01:04:17,929 --> 01:04:21,679 +you don't allow your problem to kind of +settle in on the smaller local minima + +797 +01:04:21,679 --> 01:04:25,480 +your objective in general when you talk +about neural networks and optimization + +798 +01:04:25,480 --> 01:04:28,320 +you'll see a lot of hand waving because +that's the only way we communicate about + +799 +01:04:28,320 --> 01:04:32,350 +these losses and distance so just +imagine like a Big Basin of loss and + +800 +01:04:32,349 --> 01:04:36,069 +there are these like smaller pockets of +smaller loss and so if you're thrashing + +801 +01:04:36,070 --> 01:04:39,480 +around and you can settle in on a +smaller loss parts and converter for + +802 +01:04:39,480 --> 01:04:43,730 +their so that's why the learning rate so +good and so you need to find the correct + +803 +01:04:43,730 --> 01:04:47,150 +learning rate which will cause a lot of +headaches and what people do most of the + +804 +01:04:47,150 --> 01:04:49,970 +time is sometimes you start off with a +high learning rates we get some benefits + +805 +01:04:49,969 --> 01:04:55,319 +and then UDK it over time to start up +with high and then we decadence learning + +806 +01:04:55,320 --> 01:05:00,780 +read over time as we're settling in on +the good solution and I also want to + +807 +01:05:00,780 --> 01:05:03,550 +point out who's going to this in much +more detail but the way I'm doing the + +808 +01:05:03,550 --> 01:05:07,890 +update here which is how to use the +gradient to actually modify your W + +809 +01:05:07,889 --> 01:05:12,789 +that's called an update firmware update +there are many different forms of doing + +810 +01:05:12,789 --> 01:05:14,869 +it this is the simplest way which were + +811 +01:05:14,869 --> 01:05:20,299 +just STD simplest custom greeting cent +but there are many formulas such as + +812 +01:05:20,300 --> 01:05:23,740 +momentum that was already mentioned in +momentum you basically imagine as you're + +813 +01:05:23,739 --> 01:05:27,949 +doing this optimization you imagine +keeping track of this blog city so as + +814 +01:05:27,949 --> 01:05:31,389 +I'm stepping am also keeping track of my +velocity so if I keep seeing a positive + +815 +01:05:31,389 --> 01:05:35,519 +reading some direction I will accumulate +velocity in that direction so I don't + +816 +01:05:35,519 --> 01:05:39,550 +need someone to go faster at the russian +and so there are several from Los will + +817 +01:05:39,550 --> 01:05:46,100 +look and shortly the class but Thomas +prop Adam or commonly used so just to + +818 +01:05:46,099 --> 01:05:50,569 +show you what these look like these +different choices and what they might do + +819 +01:05:50,570 --> 01:05:56,760 +in your loss function this is a figure +from Alec so here we have a loss + +820 +01:05:56,760 --> 01:06:02,390 +function and these are low-level clerks +and we start off opposition over there + +821 +01:06:02,389 --> 01:06:06,920 +and we're trying to get to the basin and +different update formulas will give you + +822 +01:06:06,920 --> 01:06:10,670 +better or worse convergence in different +problems so you can see for example this + +823 +01:06:10,670 --> 01:06:15,369 +momentum in green it built up momentum +as it went down and then it overshot and + +824 +01:06:15,369 --> 01:06:19,259 +then it kind of go back go back and this +as UD takes forever to converge can read + +825 +01:06:19,260 --> 01:06:23,370 +that's what I presented you so far as +she takes forever to emerge and are + +826 +01:06:23,369 --> 01:06:27,489 +different ways of actually performing +this primary up there are more or less + +827 +01:06:27,489 --> 01:06:35,259 +efficient in modernization will see much +more of this I also wanted to mention at + +828 +01:06:35,260 --> 01:06:39,950 +this point as likely yes I want to go +slightly into I'm to explain obviously + +829 +01:06:39,949 --> 01:06:43,049 +like your classification we know how to +set up the problem we know that are + +830 +01:06:43,050 --> 01:06:47,070 +different loss functions me know how to +optimize them so we can kind of do at + +831 +01:06:47,070 --> 01:06:51,050 +this point across I wanted to mention +that I want to give you a sense of what + +832 +01:06:51,050 --> 01:06:53,710 +computer vision looked like before +comments came about so that you have a + +833 +01:06:53,710 --> 01:06:57,920 +bit of historical perspective because we +used a linear classifiers all the time + +834 +01:06:57,920 --> 01:07:01,019 +but of course you don't usually your +classic cars on the road original image + +835 +01:07:01,019 --> 01:07:06,759 +because that's all you want to believe +we solve the problems with it like you + +836 +01:07:06,760 --> 01:07:10,250 +have to cover all the modes and so on I +thought the police to do as they used to + +837 +01:07:10,250 --> 01:07:14,380 +compute all these different feature +types of images and then you can view + +838 +01:07:14,380 --> 01:07:17,160 +different descriptors in different +feature types and you get these + +839 +01:07:17,159 --> 01:07:22,049 +statistical summaries of what the image +looks like what the frequencies are like + +840 +01:07:22,050 --> 01:07:26,160 +and so on and then we can capitated all +those into large vectors and then we put + +841 +01:07:26,159 --> 01:07:27,710 +those into linear classifiers + +842 +01:07:27,710 --> 01:07:32,050 +so different feature types all of them +concatenated and then that went until + +843 +01:07:32,050 --> 01:07:35,369 +your classifiers that was usually the +pipeline so just to give you an idea of + +844 +01:07:35,369 --> 01:07:39,088 +really what these talks were like one +very simple feature type you might + +845 +01:07:39,088 --> 01:07:43,269 +imagine is just a color histogram so I +go over all the pixels in the image and + +846 +01:07:43,269 --> 01:07:47,449 +i'd in them and to say how many bands +are there are different colors depending + +847 +01:07:47,449 --> 01:07:50,750 +on the hue of the color as you can +imagine this is kind of like one + +848 +01:07:50,750 --> 01:07:54,250 +statistical summary of what's in the +image is just a number of colors each + +849 +01:07:54,250 --> 01:07:57,400 +been so this will become one of my +teachers that I would eventually become + +850 +01:07:57,400 --> 01:08:03,440 +cutting with many different feature +types and other kind of intimately the + +851 +01:08:03,440 --> 01:08:06,530 +classifier if you think about it the +linear classifier can use these features + +852 +01:08:06,530 --> 01:08:09,690 +to actually perform the classification +because the linear classifier can like + +853 +01:08:09,690 --> 01:08:14,320 +or dislike seeing lots of different +colors in the image with positive or + +854 +01:08:14,320 --> 01:08:17,930 +negative what's very common features +also include things like what we call + +855 +01:08:17,930 --> 01:08:22,440 +610 hawk features basically these were +you go in local neighborhoods in the + +856 +01:08:22,439 --> 01:08:26,539 +invention and you look at whether or not +there are lots of different orientations + +857 +01:08:26,539 --> 01:08:30,588 +so are there lots of horizontal or +vertical edges we make up histograms + +858 +01:08:30,588 --> 01:08:35,850 +over that and so when you end up with +just the summary of what kinds of edges + +859 +01:08:35,850 --> 01:08:40,338 +are wherein the image and you can +calculate all those together there was + +860 +01:08:40,338 --> 01:08:45,250 +lots of different types of our proposed +over to over the years just I'll be + +861 +01:08:45,250 --> 01:08:50,359 +taxed on lots of different ways of +measuring what kinds of things are there + +862 +01:08:50,359 --> 01:08:54,850 +in the image and statistics of them and +then we had these pipelines call back + +863 +01:08:54,850 --> 01:08:59,660 +over to my place where you look at +different points in your + +864 +01:08:59,659 --> 01:09:04,250 +you describe a little local patch with +something that you come up with like + +865 +01:09:04,250 --> 01:09:08,329 +looking at the frequencies are looking +at the colors or whatever and then we + +866 +01:09:08,329 --> 01:09:12,269 +came up with these dictionaries for ok +here's the stuff we're seeing images + +867 +01:09:12,270 --> 01:09:16,250 +like there's lots of high-frequency stop +for low-frequency stuff in blue and so + +868 +01:09:16,250 --> 01:09:16,699 +on + +869 +01:09:16,699 --> 01:09:21,338 +to end up with the centroids using +k-means of what kind of stuff to be seen + +870 +01:09:21,338 --> 01:09:25,818 +in a just and then we express every +single image as statistics over how much + +871 +01:09:25,819 --> 01:09:29,660 +of each thing we see in the image so for +example this image has lots of + +872 +01:09:29,659 --> 01:09:33,949 +high-frequency green stuff so you might +see some feature vector that basically + +873 +01:09:33,949 --> 01:09:38,568 +will have a higher value and high +frequency and green and then we did is + +874 +01:09:38,569 --> 01:09:40,760 +we basically took these feature vectors + +875 +01:09:40,760 --> 01:09:45,210 +needed them and put a linear classifier +on them so really the context for what + +876 +01:09:45,210 --> 01:09:49,090 +we're doing is as follows what it looked +like mostly computer vision before + +877 +01:09:49,090 --> 01:09:52,840 +roughly 2012 will let you take your +image and you have a step of feature + +878 +01:09:52,840 --> 01:09:57,409 +extraction where we decided what are +important things to you know about an + +879 +01:09:57,409 --> 01:10:01,859 +image different frequencies different +tents and we decided on what are + +880 +01:10:01,859 --> 01:10:05,109 +interesting features and you see people +take like 10 different feature types in + +881 +01:10:05,109 --> 01:10:09,369 +every paper and just woke up need all of +it just hit you can double one giant + +882 +01:10:09,369 --> 01:10:12,640 +feature vector over your image and then +you put a linear classifier on top of it + +883 +01:10:12,640 --> 01:10:15,920 +just like we saw it right now and so you +play a train sale in your ass p.m. on + +884 +01:10:15,920 --> 01:10:20,109 +top of all these feature types and what +we're replacing it since then we found + +885 +01:10:20,109 --> 01:10:24,869 +that works much better as you start with +the raw image and you think of the whole + +886 +01:10:24,869 --> 01:10:28,979 +thing you're not designing some part of +it in isolation of what you think is a + +887 +01:10:28,979 --> 01:10:33,479 +good idea or not we come up with an +architecture that can simulate a lot of + +888 +01:10:33,479 --> 01:10:38,189 +different features so to speak and since +everything is just a single function we + +889 +01:10:38,189 --> 01:10:41,879 +don't just trying to top it on top of +the features of we can actually train + +890 +01:10:41,880 --> 01:10:45,400 +all the way down to the pixels and we +can train our feature extractors + +891 +01:10:45,399 --> 01:10:49,989 +effectively so that was a big innovation +and how you approach this problem is we + +892 +01:10:49,989 --> 01:10:53,300 +try to eliminate a lot of hand +engineered components are trying to have + +893 +01:10:53,300 --> 01:10:56,779 +a single the principal blob so that we +can fully trained to pull things + +894 +01:10:56,779 --> 01:11:01,550 +starting at the Rock Texas that's what +historically this is coming from and + +895 +01:11:01,550 --> 01:11:06,760 +what we will be doing and so next last +will be looking specifically at this + +896 +01:11:06,760 --> 01:11:10,520 +problem of we need to compute analytic +gradients and so we're going to go into + +897 +01:11:10,520 --> 01:11:14,860 +backpropagation which is an efficient +way of computing analytics gradient and + +898 +01:11:14,859 --> 01:11:18,839 +so that's backdrop and you're going to +become good at it and then we're going + +899 +01:11:18,840 --> 01:11:20,039 +to go slightly works + diff --git a/captions/En/Lecture4_en.srt b/captions/En/Lecture4_en.srt new file mode 100644 index 00000000..46ec2630 --- /dev/null +++ b/captions/En/Lecture4_en.srt @@ -0,0 +1,4876 @@ +1 +00:00:02,740 --> 00:00:07,000 +ok so let me dive into some +administrator + +2 +00:00:07,000 --> 00:00:14,669 +went first so I can recall that +assignment one is due next Wednesday + +3 +00:00:14,669 --> 00:00:19,050 +yeah but hundred and fifty hours left +and I use ours because there's a more + +4 +00:00:19,050 --> 00:00:23,320 +common sense of doom and remember that +third of those hours he'll be + +5 +00:00:23,320 --> 00:00:29,278 +unconscious so you don't have that much +time it's really running out and you + +6 +00:00:29,278 --> 00:00:31,768 +know you might think that you have a +late day Sun so on but these images get + +7 +00:00:31,768 --> 00:00:38,640 +harder over time so you want to see +those and so so start now likely so + +8 +00:00:38,640 --> 00:00:43,109 +there's no office hours or anything like +that on Monday I'll hold make up office + +9 +00:00:43,109 --> 00:00:45,839 +hours on Wednesday because I want you +guys to be able to talk to me about the + +10 +00:00:45,840 --> 00:00:49,260 +special projects and so on so I'll be +moving my office hours from Monday to + +11 +00:00:49,259 --> 00:00:52,820 +wednesday usually I had my office starts +at 6 p.m. instead I'll have them at 5 + +12 +00:00:52,820 --> 00:00:59,909 +p.m. and usually think gates 260 but now +be engaged to 39-1 them both and yeah + +13 +00:00:59,909 --> 00:01:03,429 +and also to note when you're going to be +studying for midterm that's coming up in + +14 +00:01:03,429 --> 00:01:04,170 +a few weeks + +15 +00:01:04,170 --> 00:01:07,109 +make sure you go through the lecture +notes as well which are really part of + +16 +00:01:07,109 --> 00:01:09,819 +this class and a kind of pick and choose +some of the things that I think are most + +17 +00:01:09,819 --> 00:01:13,579 +valuable to present a lecture but +there's quite a bit of a more material + +18 +00:01:13,579 --> 00:01:16,548 +to beware of that might pop up in the +mid term even though I'm comin some of + +19 +00:01:16,549 --> 00:01:19,610 +the most important stuff usually no +larger than URI through those lecture + +20 +00:01:19,609 --> 00:01:25,618 +notes their complimentary to the actress +and so the material for the material be + +21 +00:01:25,618 --> 00:01:32,269 +drawn from both the lectures and its ok +so having said all that we're going to + +22 +00:01:32,269 --> 00:01:36,769 +dive into the material so where we are +right now just as a reminder we have + +23 +00:01:36,769 --> 00:01:39,989 +this core function we looked at several +loss functions such as the SP loss + +24 +00:01:39,989 --> 00:01:44,359 +function last time and we look at the +full lost that you achieve for any + +25 +00:01:44,359 --> 00:01:49,379 +particular set of weights on over your +training data and this loss made up of + +26 +00:01:49,379 --> 00:01:53,509 +two components there's a data loss and +loss right and really what we want to do + +27 +00:01:53,509 --> 00:01:57,200 +is we want to do right now the gradient +expression of the loss of respect to the + +28 +00:01:57,200 --> 00:02:01,118 +weights and we want to do this so that +we can actually perform the optimization + +29 +00:02:01,118 --> 00:02:07,069 +process optimization process we're doing +in dissent where we iterate in a leading + +30 +00:02:07,069 --> 00:02:11,030 +the gradient on your weights during a +primary update and just repeating this + +31 +00:02:11,030 --> 00:02:14,259 +over and over again so that were +converging to + +32 +00:02:14,259 --> 00:02:17,929 +the low points on that loss function and +when we arrived at a loss that's + +33 +00:02:17,930 --> 00:02:20,799 +equivalent to making good predictions +over our training data in terms of this + +34 +00:02:20,799 --> 00:02:25,030 +course that come out now we also saw +that are too kind of waste evaluate the + +35 +00:02:25,030 --> 00:02:29,019 +gradient there's an American gradient +and this is very easy to write but it's + +36 +00:02:29,019 --> 00:02:32,840 +extremely slow to evaluate and there's +an elegy gradient which is which you + +37 +00:02:32,840 --> 00:02:36,658 +obtained by using calculus and will be +going into that in this lecture quite a + +38 +00:02:36,658 --> 00:02:41,318 +bit more and so it's fast exact which is +great but it's not you can get it wrong + +39 +00:02:41,318 --> 00:02:45,969 +sometimes and so we always the following +week already in check where we write all + +40 +00:02:45,969 --> 00:02:48,639 +the expressions to complete the analytic +gradients and then we double check its + +41 +00:02:48,639 --> 00:02:51,828 +correctness with numerical gradient and +so I'm not sure if you're going to see + +42 +00:02:51,829 --> 00:02:59,250 +that you're going to see that definitely +assignments ok so now you might be + +43 +00:02:59,250 --> 00:03:04,378 +tempted to when you see the setup we +just want to drive the gradient of the + +44 +00:03:04,378 --> 00:03:08,459 +loss function back to the weights you +might be tempted to just you know right + +45 +00:03:08,459 --> 00:03:11,709 +out the full loss and just start to take +the gradients as you seen your calculus + +46 +00:03:11,709 --> 00:03:16,120 +class but I'd like to make is that you +should think much more of this in terms + +47 +00:03:16,120 --> 00:03:22,480 +of computational grass instead of just +taking thinking of one giant expression + +48 +00:03:22,479 --> 00:03:25,369 +that you're going to drive content with +pen and paper the expression for the + +49 +00:03:25,370 --> 00:03:27,549 +gradient and the reason for that + +50 +00:03:27,549 --> 00:03:31,689 +so here we are thinking about these +values flow flowing through a + +51 +00:03:31,689 --> 00:03:35,509 +competition around these operations +along circles and they transferred to + +52 +00:03:35,509 --> 00:03:38,979 +basically a function pieces that +transform your inputs all the way to the + +53 +00:03:38,979 --> 00:03:43,018 +loss function at the end so we start off +with our data and our parameters as + +54 +00:03:43,019 --> 00:03:46,079 +inputs they feed through this +competition graph which is just all + +55 +00:03:46,079 --> 00:03:49,790 +these series of functions along the way +and at the end we get a single number + +56 +00:03:49,789 --> 00:03:53,590 +which is the loss and the reason that +I'd like to think about it this way is + +57 +00:03:53,590 --> 00:03:57,069 +that these expressions right now look +very small and you might be able to + +58 +00:03:57,068 --> 00:04:00,339 +derive these grievance but these +expressions are in competition grass are + +59 +00:04:00,340 --> 00:04:04,250 +about to get very big and so for example +convolutional neural networks will have + +60 +00:04:04,250 --> 00:04:08,829 +hundreds maybe are dozens of operations +so we'll have all these images + +61 +00:04:08,829 --> 00:04:12,939 +flow-through like big computational +graph to get our loss and so it becomes + +62 +00:04:12,939 --> 00:04:16,858 +impractical to just write out these +expressions and commercial networks are + +63 +00:04:16,858 --> 00:04:19,370 +not even the worst of it once you +actually start to for example do + +64 +00:04:19,370 --> 00:04:23,509 +something called an alternate sheen +which is a paper from the mind where + +65 +00:04:23,509 --> 00:04:26,329 +this is basically differentiable Turing +machine + +66 +00:04:26,329 --> 00:04:30,128 +so the whole thing is differentiable the +whole procedure that the computer is + +67 +00:04:30,129 --> 00:04:33,590 +performing on the tape is made smooth +and is differentiable computer basically + +68 +00:04:33,589 --> 00:04:39,519 +and the competition graphic this is huge +and not only is this this is not hit + +69 +00:04:39,519 --> 00:04:42,478 +because what you end up doing and we're +going to recurrent neural networks and a + +70 +00:04:42,478 --> 00:04:45,848 +bit but what you end up doing is you end +up controlling this graph so think about + +71 +00:04:45,848 --> 00:04:51,658 +this graph copied many hundreds of time +steps and so you end up with this giant + +72 +00:04:51,658 --> 00:04:56,379 +monster of hundreds of thousands of +nodes and little computational units and + +73 +00:04:56,379 --> 00:04:59,819 +so it's impossible to write out you know +here's the loss for the neural Turing + +74 +00:04:59,819 --> 00:05:03,650 +machine it's just impossible it would +take like billions of pages and so we + +75 +00:05:03,649 --> 00:05:07,068 +have to think about this more in terms +of the structure so little functions + +76 +00:05:07,069 --> 00:05:11,710 +transforming intermediate variables to +just lost at the very end so we're going + +77 +00:05:11,709 --> 00:05:14,318 +to be looking specifically at +competition graphs and how we can derive + +78 +00:05:14,319 --> 00:05:20,560 +the gradient on the inputs with respect +to the loss function at the very end so + +79 +00:05:20,560 --> 00:05:25,569 +what start-up simple and concrete a very +small competition graph we have three + +80 +00:05:25,569 --> 00:05:29,778 +scalars as an input to this graph XY and +Z and they take on these specific about + +81 +00:05:29,778 --> 00:05:35,069 +these in this example of negative 25 of +94 and we have this very small graphic + +82 +00:05:35,069 --> 00:05:38,669 +or circuit you'll hear me refer to these +interchangeably hi there is a graph for + +83 +00:05:38,668 --> 00:05:43,038 +a circuit so we have this graph that at +the end gives us this out the negative + +84 +00:05:43,038 --> 00:05:47,288 +12 ok so here's what I've done is up +over deep refilled look will call the + +85 +00:05:47,288 --> 00:05:51,120 +forward pass of this graph where I set +the input and then I compute the outfits + +86 +00:05:51,120 --> 00:05:56,288 +and I would like to do as we'd like to +drive the gradients of the expression on + +87 +00:05:56,288 --> 00:06:01,250 +the inputs and what we'll do that is +introduced this intermediate variable + +88 +00:06:01,250 --> 00:06:07,050 +cue the plus gate so there's a plus gate +and times gate as I refer to them and + +89 +00:06:07,050 --> 00:06:10,800 +thus must get this computing this outfit +cue and Sookie was this intermediate as + +90 +00:06:10,800 --> 00:06:14,788 +a result of X plus Y and then f is a +multiplication of qnz what I've written + +91 +00:06:14,788 --> 00:06:19,360 +out here is what we want is the +gradients the derivative stiff idea if I + +92 +00:06:19,360 --> 00:06:25,598 +do I get my desired and I've written out +the intermediate please log gradients + +93 +00:06:25,598 --> 00:06:30,120 +for every one of these two expressions +separately so now we've performed for + +94 +00:06:30,120 --> 00:06:33,490 +class going from left to right and what +will do now is will derive the backward + +95 +00:06:33,490 --> 00:06:35,699 +pass will go from the back + +96 +00:06:35,699 --> 00:06:39,300 +to the front competing gradients of all +the intermediates in our circuit until + +97 +00:06:39,300 --> 00:06:43,509 +the very end we're going to build up to +it the gradients on the inputs and so we + +98 +00:06:43,509 --> 00:06:47,680 +start off at the very right and as a +base case sort of this recursive + +99 +00:06:47,680 --> 00:06:52,670 +procedure we're considering the gradient +of the respective so this is just the + +100 +00:06:52,670 --> 00:06:56,020 +identity function so what is the +derivative of it + +101 +00:06:56,019 --> 00:07:06,240 +identity mapping the idea it's one right +so in the identity has a gradient of one + +102 +00:07:06,240 --> 00:07:10,329 +so that's our base case we start off +with the one and now we're going to go + +103 +00:07:10,329 --> 00:07:18,519 +backwards through this graph so we want +to gradient of with respect is that so + +104 +00:07:18,519 --> 00:07:27,089 +what is that in this competition graph +ok it's so we have not written out right + +105 +00:07:27,089 --> 00:07:32,879 +here and what is key in this particular +example it's three right so the gradient + +106 +00:07:32,879 --> 00:07:36,279 +on that according to this will become +just 3 I'm going to be right ingredients + +107 +00:07:36,279 --> 00:07:42,309 +under the lines in red and the values +are in green about the lines of the + +108 +00:07:42,310 --> 00:07:48,420 +gradient on the in the front is one and +not the gradient onset is 33 as telling + +109 +00:07:48,420 --> 00:07:52,009 +you really intuitively keep in mind the +interpretation of a gradient is what + +110 +00:07:52,009 --> 00:07:58,459 +that's saying is that the influence of +dead on the final value is positive and + +111 +00:07:58,459 --> 00:08:02,859 +with sort of course of three so if I +increments Z by a small amount eight + +112 +00:08:02,860 --> 00:08:07,759 +then the output of the circuit will +react by increasing because it's a + +113 +00:08:07,759 --> 00:08:13,009 +positive three will increase by three so +small change will result in a positive + +114 +00:08:13,009 --> 00:08:21,560 +change in the ultimate now the gradient +upon cue in this case will be so deified + +115 +00:08:21,560 --> 00:08:30,860 +IQ is that what is that before so we get +a gradient of negative for on that part + +116 +00:08:30,860 --> 00:08:34,599 +of the circuit and with that saying is +that if he were to increase the output + +117 +00:08:34,599 --> 00:08:39,740 +of the circuit will decrease ok by if +you increase by H be up to the circuit + +118 +00:08:39,740 --> 00:08:44,789 +will decrease by four age that's the +slope is negative for ok now we're going + +119 +00:08:44,789 --> 00:08:48,480 +to continue this process through this +plus gate and this is where things get + +120 +00:08:48,480 --> 00:08:49,039 +slightly + +121 +00:08:49,039 --> 00:08:54,328 +I suppose so we'd like to compute the +agreement on on why with respect to Y + +122 +00:08:54,328 --> 00:09:10,208 +and so the gradient on why would this in +this particular graph will become + +123 +00:09:10,208 --> 00:09:23,979 +glass at it either way I'd like to think +about this is by applying trainable ok + +124 +00:09:23,980 --> 00:09:27,709 +so the chain rule says that if you would +like to direct the gradient of everyone + +125 +00:09:27,708 --> 00:09:33,208 +why then it's equal to the FBI dq times +the cube ideal i right and so we + +126 +00:09:33,208 --> 00:09:36,438 +computed both of those expressions in +particular IQ might be why we know is + +127 +00:09:36,438 --> 00:09:42,519 +negative or so that's the effect of the +influence of coupon is DFID Q which is + +128 +00:09:42,519 --> 00:09:46,619 +negative for and now we know the local +would like to know the local influence + +129 +00:09:46,619 --> 00:09:52,449 +of why on cuba and that local influence +of light on Q is one that's the locals + +130 +00:09:52,448 --> 00:09:58,969 +refer to as the local derivative of Y +for the prostate and so the general + +131 +00:09:58,970 --> 00:10:02,019 +tells us that the correct thing to do to +change these two gradients the local + +132 +00:10:02,019 --> 00:10:06,139 +gradient awful why don't you and the +kind of global gradient of Q on the + +133 +00:10:06,139 --> 00:10:10,948 +update of the circuit is to multiply +them so we'll get made it four times and + +134 +00:10:10,948 --> 00:10:14,588 +so this kind of the the crux of her back +propagation works is this is a very + +135 +00:10:14,589 --> 00:10:18,209 +important to understand here that we had +at least two pieces that we keep + +136 +00:10:18,208 --> 00:10:24,289 +multiplying through when we performed as +general we have computed X plus Y and + +137 +00:10:24,289 --> 00:10:29,379 +the derivative X&Y with respect to that +single expression is one and one so keep + +138 +00:10:29,379 --> 00:10:32,749 +in mind interpretation of the gradient +that's saying is that X&Y have a + +139 +00:10:32,749 --> 00:10:38,509 +positive influence on cue with a slope +of 10 increasing X by H + +140 +00:10:38,509 --> 00:10:44,548 +will increase cue by H and eventually +like as we'd like to influence of light + +141 +00:10:44,548 --> 00:10:49,980 +on the final out but the circuit and so +the way this end up working is you take + +142 +00:10:49,980 --> 00:10:53,480 +the influence of why are you and we know +the influence of Q on the final loss + +143 +00:10:53,480 --> 00:10:57,058 +which is what we are recursively +computing here through this graph and + +144 +00:10:57,058 --> 00:11:00,350 +the correct thing to do is to multiply +them so we end up with a nickname for 10 + +145 +00:11:00,350 --> 00:11:05,189 +to 15 negative for and so the way this +works out is basically what this is + +146 +00:11:05,188 --> 00:11:08,649 +saying is that the influence of why on +the final output circuit is negative or + +147 +00:11:08,649 --> 00:11:14,649 +so increasing why should decrease the +album circuit by negative four times the + +148 +00:11:14,649 --> 00:11:18,230 +law change that you've made and the way +that end up working out is why has a + +149 +00:11:18,230 --> 00:11:21,810 +positive influence in Cuse increasing +why slightly increase askew + +150 +00:11:21,809 --> 00:11:27,959 +which likely decreases in the circuit so +chain rule is kind of giving us this + +151 +00:11:27,960 --> 00:11:29,120 +correspondence + +152 +00:11:29,120 --> 00:11:45,259 +we're going to get into this you'll see +many many many associations of this and + +153 +00:11:45,259 --> 00:11:48,889 +all drill this into you by the end of +class and you understand it you will not + +154 +00:11:48,889 --> 00:11:51,870 +have any symbolic expressions anywhere +once we complete this letter actually + +155 +00:11:51,870 --> 00:11:54,639 +implementing this and you'll see +implementations of it later in this in + +156 +00:11:54,639 --> 00:11:57,009 +this it will always be just factors in +numbers + +157 +00:11:57,009 --> 00:12:02,230 +robert is numbers ok and looking at X we +have a very smart that happen thing that + +158 +00:12:02,230 --> 00:12:05,889 +happens we wonder if IDX that's our +final objective but we have to combine + +159 +00:12:05,889 --> 00:12:09,799 +it we know what the exes what is access +and listen to you and ask you same place + +160 +00:12:09,799 --> 00:12:13,979 +on the end of the circuit and so that +ends up being the chain grow so take a + +161 +00:12:13,980 --> 00:12:19,240 +negative four times want to give you one +ok so the way this works to generalize a + +162 +00:12:19,240 --> 00:12:23,289 +bit from this example and way to think +about it is as follows you are a gate + +163 +00:12:23,289 --> 00:12:28,429 +embedded in a circuit and this is a very +large computational graph or circuit and + +164 +00:12:28,429 --> 00:12:32,250 +you receive some templates some +particular numbers X&Y come in and + +165 +00:12:32,250 --> 00:12:39,059 +perform some operation on them and +compute some good set Z and now this + +166 +00:12:39,059 --> 00:12:43,019 +magazine goes into competition grass and +something happens to it but you're just + +167 +00:12:43,019 --> 00:12:46,169 +too great hanging out in a circuit and +you're not sure what happens but by the + +168 +00:12:46,169 --> 00:12:50,939 +end of the circuit the loss computed and +that's the forward pass and then we're + +169 +00:12:50,940 --> 00:12:56,250 +proceeding recursively in the reverse +order backwards but before that actually + +170 +00:12:56,250 --> 00:13:01,120 +get to that part right away when I get +X&Y the thing I'd like to point out that + +171 +00:13:01,120 --> 00:13:05,279 +during the forward pass if you're this +gate and you get to your values X&Y you + +172 +00:13:05,279 --> 00:13:08,500 +computer output said and there's another +thing you can computer right away and + +173 +00:13:08,500 --> 00:13:10,230 +that is the local gradients + +174 +00:13:10,230 --> 00:13:14,789 +X&Y so I can compute those right away +because I'm just a gate and I know what + +175 +00:13:14,789 --> 00:13:18,009 +I'm performing like say additional +application so I know the influence that + +176 +00:13:18,009 --> 00:13:24,259 +X&Y have won my out the body so I can +compute those guys right away but then + +177 +00:13:24,259 --> 00:13:25,389 +what happens + +178 +00:13:25,389 --> 00:13:29,769 +near the end so the lawsuits computed +another going backwards eventually learn + +179 +00:13:29,769 --> 00:13:32,499 +about what is my influence on + +180 +00:13:32,499 --> 00:13:37,839 +the final output of the circuit the loss +to learn what is DL by these in their + +181 +00:13:37,839 --> 00:13:41,419 +ingredient will flow into me and what I +have to do is I have to change that + +182 +00:13:41,418 --> 00:13:45,278 +gradient through this recursive case so +I have to make sure to change the + +183 +00:13:45,278 --> 00:13:48,778 +gradient through my operation performed +and it turns out that the correct thing + +184 +00:13:48,778 --> 00:13:52,068 +to do here buy tramadol really what it's +saying is the correct thing to do is to + +185 +00:13:52,068 --> 00:13:56,068 +multiply your local gradient without +gradient and that actually gives you the + +186 +00:13:56,068 --> 00:13:57,838 +DL IDX + +187 +00:13:57,839 --> 00:14:02,739 +employees off X on the final output of +the circuit so really chain rule is just + +188 +00:14:02,739 --> 00:14:08,229 +this added multiplication where we take +what are called Global gradient of this + +189 +00:14:08,229 --> 00:14:12,669 +gate on the outfit and we've changed +through the local gradient in the same + +190 +00:14:12,668 --> 00:14:18,509 +thing goes for a while so it's just a +multiplication of that guy the gradient + +191 +00:14:18,509 --> 00:14:22,889 +by your local gradient if you're a gate +and then remember that these X's and Y's + +192 +00:14:22,889 --> 00:14:27,229 +there are coming from different states +right so you end up with the cursing + +193 +00:14:27,229 --> 00:14:31,899 +this process through the entire cup +additional Turkish and so these gates + +194 +00:14:31,899 --> 00:14:36,808 +just basically communicate to each other +the influence on the final loss so they + +195 +00:14:36,808 --> 00:14:39,688 +tell each other ok if this is a positive +gradient that means you're positively + +196 +00:14:39,688 --> 00:14:43,198 +influencing the loss of its negative +gradient negative influence negatively + +197 +00:14:43,198 --> 00:14:46,788 +influencing loss and he just gets almost +applied through the circuit by these + +198 +00:14:46,788 --> 00:14:51,019 +local gradients and you end up with and +this process is called back propagation + +199 +00:14:51,019 --> 00:14:54,489 +it's a way of computing through a +recursive application of chain rule + +200 +00:14:54,489 --> 00:14:58,399 +through competition grab the influence +of every single intermediate value in + +201 +00:14:58,399 --> 00:15:02,158 +that graph on the final loss function +and so will see many examples of this + +202 +00:15:02,158 --> 00:15:06,918 +truck is like her I'll go into a +specific example there is a slightly + +203 +00:15:06,918 --> 00:15:11,298 +larger and we'll work through it in +detail but i dont their own questions at + +204 +00:15:11,298 --> 00:15:20,389 +this point that I would like to ask +ahead I'm going to come back to that you + +205 +00:15:20,389 --> 00:15:25,538 +add the gradients the grading the +cognitive Adam so if Z is being employed + +206 +00:15:25,538 --> 00:15:29,928 +in multiple places in the circus the +back roads closed will add that will + +207 +00:15:29,928 --> 00:15:31,539 +come back to that point + +208 +00:15:31,539 --> 00:16:03,139 +like we're going to get the all of those +issues and we're gonna see ya you're + +209 +00:16:03,139 --> 00:16:05,769 +gonna get what we call banishing +gradient problems and so on + +210 +00:16:05,769 --> 00:16:10,669 +we'll see let's go through another +example to make this more concrete so + +211 +00:16:10,669 --> 00:16:14,318 +here we have another circuit it happens +to be computing a little two-dimensional + +212 +00:16:14,318 --> 00:16:18,179 +in Iran but for now don't worry about +that interpretation just think of this + +213 +00:16:18,179 --> 00:16:22,849 +as that's an expression so one over +one-plus key to the whatever number of + +214 +00:16:22,850 --> 00:16:29,000 +inputs here is by Andrew function and we +have a single output over there and I + +215 +00:16:29,000 --> 00:16:32,490 +translated that mathematical expression +into this competition in draft form so + +216 +00:16:32,490 --> 00:16:35,769 +we have to recursively from inside out +compete with expression so a person do + +217 +00:16:35,769 --> 00:16:42,129 +all the little W times access and then +we add them all up and then we take a + +218 +00:16:42,129 --> 00:16:46,129 +negative of it and then we exponentially +that and they had one and then we + +219 +00:16:46,129 --> 00:16:49,769 +finally divide and we get the result of +the expression and so we're going to do + +220 +00:16:49,769 --> 00:16:52,409 +now is we're going to back propagate +through this expression we're going to + +221 +00:16:52,409 --> 00:16:56,500 +compute what the influence of every +single input value is on the output of + +222 +00:16:56,500 --> 00:17:07,230 +this expression that is degrading here + +223 +00:17:07,230 --> 00:17:22,039 +so for now the US is just a binary plus +its entirety + gate and we have a plus + +224 +00:17:22,039 --> 00:17:26,519 +one gate I'm making up these gates on +the spot and we'll see that what is a + +225 +00:17:26,519 --> 00:17:31,519 +gate or is not a gate is kind of up to +you come back to this point of it so for + +226 +00:17:31,519 --> 00:17:35,639 +now I just like we have several more +gates that we're using throughout and so + +227 +00:17:35,640 --> 00:17:38,650 +I just like to write out as we go +through this example several of these + +228 +00:17:38,650 --> 00:17:42,720 +derivatives exponentiation and we know +for every little local gate what these + +229 +00:17:42,720 --> 00:17:49,048 +local gradients are right so we can do +that using calculus so the extra tax and + +230 +00:17:49,048 --> 00:17:52,900 +so on so these are all the operations +and also addition and multiplication + +231 +00:17:52,900 --> 00:17:56,040 +which I'm assuming that you have +memorized in terms of what the great + +232 +00:17:56,039 --> 00:17:58,970 +things look like they're going to start +off at the end of the circuit and I've + +233 +00:17:58,970 --> 00:18:03,450 +already filled in a one point zero zero +in the back because that's how we always + +234 +00:18:03,450 --> 00:18:04,860 +start this recursion + +235 +00:18:04,859 --> 00:18:10,519 +1110 right but since that's the gradient +on the identity function now we're going + +236 +00:18:10,519 --> 00:18:17,849 +to back propagate through this one over +x operation ok so the relative of one of + +237 +00:18:17,849 --> 00:18:22,048 +wrecks the local gradient is a negative +one over x squared so that none of Rex + +238 +00:18:22,048 --> 00:18:27,119 +gate during the forward pass received +input 1.37 and right away that one of + +239 +00:18:27,119 --> 00:18:30,759 +her ex Kate could have computed what the +local gradients the local variant was + +240 +00:18:30,759 --> 00:18:35,048 +negative one over x squared and ordering +back propagation and has to buy tramadol + +241 +00:18:35,048 --> 00:18:40,750 +multiply that local gradient by the +gradient of it on the final of the + +242 +00:18:40,750 --> 00:18:44,789 +circuit which is easy because it happens +to be so what ends up being the + +243 +00:18:44,789 --> 00:18:51,349 +expression for the back propagated +reading here from one of my ex Kate + +244 +00:18:51,349 --> 00:18:59,829 +but she always has two pieces local +gradient times the gradient from or from + +245 +00:18:59,829 --> 00:19:18,069 +which is the gradient DFID X so that +that is the local gradient + +246 +00:19:18,069 --> 00:19:23,480 +giving one over 3.7 squared and then +multiplied by one point zero which is + +247 +00:19:23,480 --> 00:19:27,940 +degrading from which is really just one +because we just started and so applying + +248 +00:19:27,940 --> 00:19:34,850 +general right away here and the other is +negative 01534 that's the gradient on + +249 +00:19:34,849 --> 00:19:38,798 +that piece of the wire where this valley +was blowing ok so it has a negative + +250 +00:19:38,798 --> 00:19:43,889 +effect on the outfit you might expect +that right because if you were to + +251 +00:19:43,890 --> 00:19:47,850 +increase this value and then it goes +through a gate of one over x then + +252 +00:19:47,849 --> 00:19:50,939 +increased amount of Rex get smaller so +that's why you're seeing negative + +253 +00:19:50,940 --> 00:19:55,620 +gradient rate we're going to continue +back propagation here in the next gate + +254 +00:19:55,619 --> 00:20:01,048 +in the circuit it's adding a constant of +one so the local gradient if you look at + +255 +00:20:01,048 --> 00:20:06,960 +adding a constant to a value the +gradient off on exit is just one right + +256 +00:20:06,960 --> 00:20:13,169 +to talk to us and so the change gradient +here that we continue along the wire + +257 +00:20:13,169 --> 00:20:22,940 +will be your local gradient which has +one time the gradient from above the + +258 +00:20:22,940 --> 00:20:28,590 +gate which it has just learned is +negative Jul 23 2013 continues along the + +259 +00:20:28,589 --> 00:20:34,709 +way are unchanged and intuitively that +makes sense right because this is value + +260 +00:20:34,710 --> 00:20:38,319 +floats and it has some influence on the +final circuit and if you're if you're + +261 +00:20:38,319 --> 00:20:42,798 +adding one then its influence its rate +of change of slope toward the final + +262 +00:20:42,798 --> 00:20:46,970 +value doesn't change if you increase +this by some amount the effect at the + +263 +00:20:46,970 --> 00:20:51,548 +end will be the same because the rate of +change doesn't change through the +1 + +264 +00:20:51,548 --> 00:20:56,859 +gays just a constant officer continued +innovation here so the gradient of the + +265 +00:20:56,859 --> 00:21:01,599 +axe the axe so you can come back +propagation we're going to perform + +266 +00:21:01,599 --> 00:21:05,000 +gates input of negative one + +267 +00:21:05,000 --> 00:21:08,329 +it right away could have completed its +local gradient and now it knows that the + +268 +00:21:08,329 --> 00:21:12,259 +gradient from above is negative point by +three so the continued backpropagation + +269 +00:21:12,259 --> 00:21:20,000 +here in applying chain rule would +received the rhetorical questions I'm + +270 +00:21:20,000 --> 00:21:25,119 +not sure but but basically each of the +negative one which is the ex the ex + +271 +00:21:25,119 --> 00:21:30,569 +input to this expert eight times the +chain rule right to the point by three + +272 +00:21:30,569 --> 00:21:35,269 +so we keep multiplying their own so what +is the effect on me and what I have an + +273 +00:21:35,269 --> 00:21:39,069 +effect on the final end of the circuit +those are being always multiplied so we + +274 +00:21:39,069 --> 00:21:46,859 +get negative 22 at this point so now we +have a time to negative one gate so what + +275 +00:21:46,859 --> 00:21:50,279 +ends up happening what happens to the +gradient when you do it turns me on an + +276 +00:21:50,279 --> 00:21:57,139 +accomplished on da lips around right +because we have basically constant input + +277 +00:21:57,140 --> 00:22:02,038 +which happened to be a constant of +negative one so negative one time one + +278 +00:22:02,038 --> 00:22:05,548 +time they dont give us negative one in +the forward pass and so now we have to + +279 +00:22:05,548 --> 00:22:09,569 +multiply by a that's the local gradient +times the greeting from Bob which is + +280 +00:22:09,569 --> 00:22:14,879 +fine too so we end up with just positive +so now continue back propagation + +281 +00:22:14,880 --> 00:22:21,110 +propagating + and this plus operation +has multiple input here the green in the + +282 +00:22:21,109 --> 00:22:25,599 +local gradient for the bus gate as one +and 10 what ends up happening to the + +283 +00:22:25,599 --> 00:22:42,359 +brilliance flow along the upper buyers + +284 +00:22:42,359 --> 00:22:48,089 +surplus paid has a local gradient on all +of its always will be just one because + +285 +00:22:48,089 --> 00:22:53,769 +if you just have a functioning you know +experts why then for that function the + +286 +00:22:53,769 --> 00:22:58,109 +gradient on either X or Y is just one +and so what you end up getting is just + +287 +00:22:58,109 --> 00:23:03,619 +one time spent two and so in fact for a +plus gate always see see the same fact + +288 +00:23:03,619 --> 00:23:07,469 +where the local gradient all of its +inputs is one and so whatever grading it + +289 +00:23:07,470 --> 00:23:11,289 +gets from above it just always +distributes gradient equally to all of + +290 +00:23:11,289 --> 00:23:14,339 +its inputs because in the chain rule +don't have multiplied and multiplied by + +291 +00:23:14,339 --> 00:23:18,129 +10 something remains unchanged surplus +get this kind of like ingredient + +292 +00:23:18,130 --> 00:23:22,170 +distributor whereas something flows in +from the top it all just spread out all + +293 +00:23:22,170 --> 00:23:26,560 +the great teams equally to all of its +children and so we've already received + +294 +00:23:26,559 --> 00:23:32,139 +one of the inputs gradient point to hear +on the very final output of the circuit + +295 +00:23:32,140 --> 00:23:35,970 +and so this employees has been completed +through a series of applications of + +296 +00:23:35,970 --> 00:23:42,450 +trainer along the way there was another +plus get that skipped over and so this + +297 +00:23:42,450 --> 00:23:47,090 +point you kind of this tribute to both +20.2 equally so we've already done a + +298 +00:23:47,089 --> 00:23:51,750 +blockade and there's a multiply get +there and so now we're going to back + +299 +00:23:51,750 --> 00:23:55,940 +propagate through that multiply +operation and so the local grade so the + +300 +00:23:55,940 --> 00:24:06,450 +so what will be the gradient for w 00 +will be degrading 40 basically + +301 +00:24:06,450 --> 00:24:19,059 +2000 you will be going in W one will be +W 0:30 will be negative one times when + +302 +00:24:19,059 --> 00:24:24,389 +too good and the gradient on x zero will +be there is a bug bite away in the slide + +303 +00:24:24,390 --> 00:24:27,840 +that I just noticed like few minutes +before I actually create the class also + +304 +00:24:27,839 --> 00:24:34,289 +increase starting to class so you see . +39 there it should be point for its + +305 +00:24:34,289 --> 00:24:37,480 +because of a bug in evangelization +because I'm truncating a to the small + +306 +00:24:37,480 --> 00:24:41,190 +digits but basically that should be +pointed or because the way you get that + +307 +00:24:41,190 --> 00:24:45,400 +is two times pointed to get the point +for just like I've written out there so + +308 +00:24:45,400 --> 00:24:50,980 +that's what the opportunity there okay +so that we've been propagated the + +309 +00:24:50,980 --> 00:24:55,190 +circuit here and we get through this +expression and so you might imagine in + +310 +00:24:55,190 --> 00:24:59,289 +there are actual downstream applications +will have data and all the parameters as + +311 +00:24:59,289 --> 00:25:03,450 +inputs loss functions at the top at the +end it will be forward pass to evaluate + +312 +00:25:03,450 --> 00:25:06,440 +the loss function and then we'll back +propagate through every piece of + +313 +00:25:06,440 --> 00:25:10,450 +competition we've done along the way and +Welbeck propagate through every gate to + +314 +00:25:10,450 --> 00:25:14,150 +get our imports and back up again just +means supply chain rule many many times + +315 +00:25:14,150 --> 00:25:21,720 +and we'll see how that is implemented in +but the question i guess im going to + +316 +00:25:21,720 --> 00:25:31,769 +skip that because it's the same I'm +going to skip the other questions + +317 +00:25:31,769 --> 00:25:45,869 +so the cost of forward and backward +propagation is roughly almost always end + +318 +00:25:45,869 --> 00:25:49,500 +up being basically equal when you look +at timings usually the backup a slightly + +319 +00:25:49,500 --> 00:25:58,710 +slower idea so let's see one thing I +want to point out before in one is that + +320 +00:25:58,710 --> 00:26:02,350 +the setting of these gates like these +gates are arbitrary so what can I could + +321 +00:26:02,349 --> 00:26:06,509 +have known for example is some of you +may know this I can collapse these gates + +322 +00:26:06,509 --> 00:26:10,549 +into one gate if I wanted to for example +in something called the sigmoid function + +323 +00:26:10,549 --> 00:26:14,069 +which has that particular form a single +facts which the sigmoid function + +324 +00:26:14,069 --> 00:26:19,460 +computes won over one plus or minus tax +and so I could have rewritten that + +325 +00:26:19,460 --> 00:26:22,650 +expression and i cant collapsed all of +those gates that made up the sigmoid + +326 +00:26:22,650 --> 00:26:27,769 +gate into a single gate and so there's a +sigmoid get here and I could have done + +327 +00:26:27,769 --> 00:26:32,440 +that in a single go sort of and when I +would have had to do if I wanted to have + +328 +00:26:32,440 --> 00:26:37,980 +that gate as I need to compute an +expression for how this so what is the + +329 +00:26:37,980 --> 00:26:41,670 +local gradient for the sigmoid get +basically so what is the gradient of the + +330 +00:26:41,670 --> 00:26:44,470 +small gate on its input and I had to go +through some math which I'm not going to + +331 +00:26:44,470 --> 00:26:46,980 +go into detail but you end up with that +expression over there + +332 +00:26:46,980 --> 00:26:51,750 +it ends up being 1-6 next time segment +of access to local gradient and that + +333 +00:26:51,750 --> 00:26:55,450 +allows me to put this piece into a +competition graph because once I know + +334 +00:26:55,450 --> 00:26:58,819 +how to compute the local gradient +everything else is defined just through + +335 +00:26:58,819 --> 00:27:02,389 +chain rule and multiply everything +together so we can back propagate + +336 +00:27:02,390 --> 00:27:06,720 +through the sigmoid get down and the way +that would look like is input to the + +337 +00:27:06,720 --> 00:27:11,750 +gate was one point zero that's what flu +went into the gate and punk 73 went out + +338 +00:27:11,750 --> 00:27:18,759 +so . 7360 facts okay and we want to +local gradient which is as we've seen + +339 +00:27:18,759 --> 00:27:26,450 +from the math on their backs so you get +access point cemetery multiplying 1-23 + +340 +00:27:26,450 --> 00:27:31,170 +that's the local gradient and then times +will work we happened to be at the end + +341 +00:27:31,170 --> 00:27:36,330 +of the circuit so times 10 even writing +so we end up with 12 and of course we + +342 +00:27:36,329 --> 00:27:37,649 +get the same answer + +343 +00:27:37,650 --> 00:27:42,220 +point to as we received before 12 +because calculus works but basically we + +344 +00:27:42,220 --> 00:27:44,480 +could have broken up this expression +down and + +345 +00:27:44,480 --> 00:27:47,450 +one piece at a time or we could just +have a single signaled gate and it's + +346 +00:27:47,450 --> 00:27:51,569 +kind of up to us and what level up here +are key to break these expressions and + +347 +00:27:51,569 --> 00:27:52,339 +so you'd like to + +348 +00:27:52,339 --> 00:27:55,829 +intuitively clustered these expressions +into single gates if it's very efficient + +349 +00:27:55,829 --> 00:28:06,819 +or easy to direct the local radiance +because then they become your pieces so + +350 +00:28:06,819 --> 00:28:10,529 +the question is do libraries typically +do that I do they worry about you know + +351 +00:28:10,529 --> 00:28:14,058 +what's what's easy to convince the +computer and the answer is yes I would + +352 +00:28:14,058 --> 00:28:17,480 +say so so he noted that there are some +piece of operation you'd like to do over + +353 +00:28:17,480 --> 00:28:20,798 +and over again and it has a very simple +local gradient that's something very + +354 +00:28:20,798 --> 00:28:24,900 +appealing to actually create a single +unit of and we'll see some of those + +355 +00:28:24,900 --> 00:28:30,230 +examples actually but I think I'd like +to also point out that once you the + +356 +00:28:30,230 --> 00:28:32,490 +reason I like to think about these +compositional grass is it really hope + +357 +00:28:32,490 --> 00:28:36,289 +your intuition to think about how greedy +and slow in a neural network it's not + +358 +00:28:36,289 --> 00:28:39,369 +just you don't want this to be a black +box do you want to understand + +359 +00:28:39,369 --> 00:28:43,959 +intuitively how this happens and you +start to develop after a while of + +360 +00:28:43,960 --> 00:28:47,850 +looking at additional graphs intuitions +about how these graybeards flow and this + +361 +00:28:47,849 --> 00:28:52,029 +might help you debug some issues like +say will go to banish ingredient problem + +362 +00:28:52,029 --> 00:28:55,950 +it's much easier to understand exactly +what's going wrong in your optimization + +363 +00:28:55,950 --> 00:28:59,250 +if you understand how greedy and slow +and networks will help you debug these + +364 +00:28:59,250 --> 00:29:02,740 +networks much more efficiently and so +some information for example we already + +365 +00:29:02,740 --> 00:29:07,609 +saw the eighth at Gate it has a little +reading the one to all of its inputs so + +366 +00:29:07,609 --> 00:29:11,279 +it's just a greeting distributor that's +like a nice way to think about it + +367 +00:29:11,279 --> 00:29:14,548 +whenever you have a plus operation +anywhere in your score function or your + +368 +00:29:14,548 --> 00:29:18,740 +comment or anywhere else it's +distributed ratings the max kate is + +369 +00:29:18,740 --> 00:29:23,009 +instead a great writer and way this +works is if you look at the expression + +370 +00:29:23,009 --> 00:29:30,970 +like we have great these markers don't +work so if you have a very simple binary + +371 +00:29:30,970 --> 00:29:38,410 +expression of Maxim XY so this is a gate +then the gradient on x online if you + +372 +00:29:38,410 --> 00:29:42,570 +think about it the green on the larger +one of your inputs which is larger the + +373 +00:29:42,569 --> 00:29:46,389 +gradient on that guy is one and all this +and the smaller one is a greeting of + +374 +00:29:46,390 --> 00:29:50,630 +zero and intuitively that because if one +of these was smaller than what it has no + +375 +00:29:50,630 --> 00:29:53,220 +effect on the out but because the other +guy's larger and that's what ends up + +376 +00:29:53,220 --> 00:29:57,009 +getting through the gate so you end up +with a gradient of one on the + +377 +00:29:57,009 --> 00:30:03,140 +larger one of the inputs and so that's +why max cady as a gradient writer if I'm + +378 +00:30:03,140 --> 00:30:06,420 +actually and I have received several +inputs one of them was the largest of + +379 +00:30:06,420 --> 00:30:09,550 +all of them and that's the value that I +propagated through the circuit and + +380 +00:30:09,549 --> 00:30:12,909 +application time I'm just going to +receive my gradient from above and I'm + +381 +00:30:12,910 --> 00:30:16,590 +going to write it to whoever was my +largest impact it's a gradient writer + +382 +00:30:16,589 --> 00:30:22,569 +and multiply gate is a gradient switcher +actually don't think that's a very good + +383 +00:30:22,569 --> 00:30:26,960 +way to look at it but I'm referring to +the fact that it's not actually + +384 +00:30:26,960 --> 00:30:39,150 +nevermind about that part so the +question is what happens if the two + +385 +00:30:39,150 --> 00:30:53,470 +inputs are equal when you go through max +Kade what happens I don't think it's + +386 +00:30:53,470 --> 00:30:57,559 +correct to distributed to all of them I +think you have to you have to pick one + +387 +00:30:57,559 --> 00:31:07,990 +that basically never happens in actual +practice so max gradient here actually + +388 +00:31:07,990 --> 00:31:13,019 +have an example is that here was larger +than W so only is it has an influence on + +389 +00:31:13,019 --> 00:31:16,839 +the output of this max Kade right so +when two flows into the max gate and + +390 +00:31:16,839 --> 00:31:20,879 +gets read it and W gets a zero gradient +because its effect on the circuit is + +391 +00:31:20,880 --> 00:31:25,360 +nothing there is zero because when you +change it doesn't matter when you change + +392 +00:31:25,359 --> 00:31:29,689 +it because that is not a larger bally +going through the competition grounds I + +393 +00:31:29,690 --> 00:31:33,100 +have another note that is related to +back propagation which we already + +394 +00:31:33,099 --> 00:31:36,490 +addressed through question I just want +to briefly point out with it terribly + +395 +00:31:36,490 --> 00:31:40,440 +bad luck and figure that if you have +these circuits and sometimes you have a + +396 +00:31:40,440 --> 00:31:43,330 +value that branches out into a circuit +and is used in multiple parts of the + +397 +00:31:43,329 --> 00:31:47,179 +circuit the correct thing to do by +multivariate chain rule is to actually + +398 +00:31:47,180 --> 00:31:55,110 +add up the contributions at the +operation so gradients add a background + +399 +00:31:55,109 --> 00:32:00,009 +in backwards through the circuit if they +ever flow in in these backward flow + +400 +00:32:00,009 --> 00:32:04,879 +right we're going to go into +implementation very simple just a couple + +401 +00:32:04,880 --> 00:32:05,700 +of questions + +402 +00:32:05,700 --> 00:32:11,620 +thank you for the question the question +is is there ever like a loop in these + +403 +00:32:11,619 --> 00:32:15,839 +graphs that will never be looks so there +are never any loops you might think that + +404 +00:32:15,839 --> 00:32:18,589 +if you use a recurrent neural network +that there are loops in there but + +405 +00:32:18,589 --> 00:32:21,658 +there's actually no because what we'll +do is we'll take a recurrent neural + +406 +00:32:21,659 --> 00:32:26,230 +network and will unfold it through time +steps and this will all become there + +407 +00:32:26,230 --> 00:32:31,259 +will never be a loop in the photograph +copy pasted that small piece or time + +408 +00:32:31,259 --> 00:32:39,538 +you'll see that more when we actually +get into it but he's always looked so + +409 +00:32:39,538 --> 00:32:42,220 +let's look at the implementation of this +is actually implemented in practice and + +410 +00:32:42,220 --> 00:32:46,860 +I think will help make this more +concrete as well so we always have these + +411 +00:32:46,859 --> 00:32:52,038 +graphs graphs these are the best way to +think about structuring neural networks + +412 +00:32:52,038 --> 00:32:56,929 +and so what we end up with is all these +gates there were going to seem a bit but + +413 +00:32:56,929 --> 00:33:00,059 +on top of the gates there something that +needs to maintain connectivity structure + +414 +00:33:00,058 --> 00:33:03,490 +of the same paragraph what gates are +connected to each other and so usually + +415 +00:33:03,490 --> 00:33:09,710 +that's handled by a graph or object +usually in that the net object has needs + +416 +00:33:09,710 --> 00:33:13,679 +two main pieces which was the forward +and backward peace and this is just you + +417 +00:33:13,679 --> 00:33:19,929 +two coats run but basically roughly the +idea is that in the forward pass + +418 +00:33:19,929 --> 00:33:23,759 +trading overall the gates in the circuit +that and they're sorted in topological + +419 +00:33:23,759 --> 00:33:27,980 +order what that means is that all the +inputs must come to every note before + +420 +00:33:27,980 --> 00:33:32,099 +the opportunity consumed just ordered +from left to right and we're just + +421 +00:33:32,099 --> 00:33:35,969 +boarding will call ya forward on every +single gate along the way so we iterate + +422 +00:33:35,970 --> 00:33:39,600 +over that graph and just go forward to +every single piece and this object will + +423 +00:33:39,599 --> 00:33:43,189 +just make sure that happens in the +proper connectivity pattern and backward + +424 +00:33:43,190 --> 00:33:46,620 +pass we're going in the exact reverse +order and we're calling backward on + +425 +00:33:46,619 --> 00:33:49,709 +every single gate and these gates will +end up communicating gradients to each + +426 +00:33:49,710 --> 00:33:53,429 +other and the old get changeup and +computing the analytic gradients it back + +427 +00:33:53,429 --> 00:33:57,860 +so really an object is a very thin +wrapper around all these gates or as we + +428 +00:33:57,859 --> 00:34:01,879 +will see their cold layers layers or +gates I'm going to use interchangeably + +429 +00:34:01,880 --> 00:34:05,700 +and they're just very thin wrapper +surround connectivity structure of these + +430 +00:34:05,700 --> 00:34:09,369 +gates and calling a forward and backward +function on them and then let's look at + +431 +00:34:09,369 --> 00:34:12,950 +a specific example of one of the gates +and how this might be implemented and + +432 +00:34:12,949 --> 00:34:16,759 +this is not just a year ago this is +actually more like correct + +433 +00:34:16,760 --> 00:34:18,730 +implementation something like this might +run + +434 +00:34:18,730 --> 00:34:23,769 +at the end so let us enter and multiply +gate and how it could be implemented and + +435 +00:34:23,769 --> 00:34:27,690 +multiply gate in this case is just a +binary multiplies receives two inputs + +436 +00:34:27,690 --> 00:34:33,780 +X&Y it computes their multiplication +that his ex times why and returns and + +437 +00:34:33,780 --> 00:34:38,950 +all these games must be satisfied the +API of a forward and backward cool how + +438 +00:34:38,949 --> 00:34:42,529 +do you behave in a forward pass and how +they behave in a backward pass and + +439 +00:34:42,530 --> 00:34:46,019 +repass just computer whatever in a +backward pass we eventually end up + +440 +00:34:46,019 --> 00:34:52,639 +learning about what is our gradient on +the final loss to the old ideas as what + +441 +00:34:52,639 --> 00:34:55,628 +we learn that's represented in this +variable these head and right now + +442 +00:34:55,628 --> 00:35:00,639 +everything is scalars so X Y is that our +numbers here he said is also a number + +443 +00:35:00,639 --> 00:35:07,799 +telling the employers and what this gate +is charged in this backward pass is + +444 +00:35:07,800 --> 00:35:11,550 +performing the little piece of general +so what we have to compute is how do you + +445 +00:35:11,550 --> 00:35:16,550 +change this gradient these into your +inputs X&Y compute the ex NDY and we + +446 +00:35:16,550 --> 00:35:19,820 +turned us into backward pass and then +the competition on draft will make sure + +447 +00:35:19,820 --> 00:35:23,720 +that these get routed properly to all +the other bags and if there are any + +448 +00:35:23,719 --> 00:35:27,919 +badges that add up the competition grab +my dad might add all the ingredients + +449 +00:35:27,920 --> 00:35:35,650 +together ok so how would we implement +the DAX and devices for example what is + +450 +00:35:35,650 --> 00:35:42,300 +the X in this case it would be equal to +the implementation + +451 +00:35:42,300 --> 00:35:49,460 +why times easy break and a white and +easy additional point to make here by + +452 +00:35:49,460 --> 00:35:53,659 +the way that I added some lies in the +past we have to remember these values of + +453 +00:35:53,659 --> 00:35:57,509 +X&Y because we end up using them in a +backward pass from assigning them to a + +454 +00:35:57,510 --> 00:36:01,000 +sell stop because I need to remember +what X Y are because I need access to + +455 +00:36:01,000 --> 00:36:04,949 +them in my back yard pass in general and +back-propagation when we build these + +456 +00:36:04,949 --> 00:36:09,359 +when you actually the forward pass every +single gate must remember the impetus in + +457 +00:36:09,360 --> 00:36:13,430 +any kind of intermediate calculations +performed that it needs to do that needs + +458 +00:36:13,429 --> 00:36:17,069 +access to a backward pass so basically +we end up running these networks at + +459 +00:36:17,070 --> 00:36:20,050 +runtime just always keep in mind that as +you're doing this forward pass a huge + +460 +00:36:20,050 --> 00:36:22,890 +amount of stuff gets cashed in your +memory and that all has to stick around + +461 +00:36:22,889 --> 00:36:25,909 +because during the propagation and I +need access to some of those variables + +462 +00:36:25,909 --> 00:36:30,779 +and so your memory and the ballooning up +during a forward pass backward pass it + +463 +00:36:30,780 --> 00:36:33,690 +gets all consumed and we need all those +intermediaries to actually compete the + +464 +00:36:33,690 --> 00:36:45,289 +proper backward class so that you can +get rid of many of these things and you + +465 +00:36:45,289 --> 00:36:49,710 +don't have to compete in going to cash +them so you can save on memory for sure + +466 +00:36:49,710 --> 00:36:54,110 +but I don't think most implementations +actually worried about that I don't + +467 +00:36:54,110 --> 00:36:57,280 +think there's a lot of logic that deals +with that usually end up remembering it + +468 +00:36:57,280 --> 00:37:09,370 +anyway I yes I think if you're in an +embedded device for example and you were + +469 +00:37:09,369 --> 00:37:11,949 +eerily by the American strains this is +something that you might take advantage + +470 +00:37:11,949 --> 00:37:15,539 +of it we know that a neural network only +has to run and test time then you might + +471 +00:37:15,539 --> 00:37:18,750 +want to make sure going to the code to +make sure nothing gets cashed in case + +472 +00:37:18,750 --> 00:37:33,130 +you wanna do a backward pass questions +yes we remember the local gradients in + +473 +00:37:33,130 --> 00:37:39,750 +the forward pass then we don't have to +remember the other intermediates I think + +474 +00:37:39,750 --> 00:37:45,269 +that might only be the case in such in +some simple expressions like this 1 I'm + +475 +00:37:45,269 --> 00:37:49,170 +not actually sure that's true in general +but I mean you're in charge of remember + +476 +00:37:49,170 --> 00:37:54,950 +whatever you need to perform the +backward pass gate by game basis you + +477 +00:37:54,949 --> 00:37:58,509 +don't know if you can remember whatever +you feel like it has a footprint on + +478 +00:37:58,510 --> 00:38:04,420 +someone and you can be clever with that +guy's example of what it looks like in + +479 +00:38:04,420 --> 00:38:08,250 +practice we're going to look at specific +examples and torture tortures a deep + +480 +00:38:08,250 --> 00:38:11,480 +learning framework which we might be +going to a bit near the end of the class + +481 +00:38:11,480 --> 00:38:16,750 +that some of you might end up using for +your projects going to the github repo + +482 +00:38:16,750 --> 00:38:20,320 +for porridge and you look at the +musically it's just a giant collection + +483 +00:38:20,320 --> 00:38:24,580 +of these later objects and these are the +gates gates the same thing so there's + +484 +00:38:24,579 --> 00:38:27,429 +all these layers that's really what a +deep learning framework is this just a + +485 +00:38:27,429 --> 00:38:31,559 +whole bunch of layers and a very thin +competition graph thing that keeps track + +486 +00:38:31,559 --> 00:38:36,420 +of all the connectivity and so really +the image to have in mind at all these + +487 +00:38:36,420 --> 00:38:42,639 +things are your leg blocks and then +we're building up these graphs out of + +488 +00:38:42,639 --> 00:38:44,829 +your league in blocks out of the layers +you're putting them together in various + +489 +00:38:44,829 --> 00:38:47,549 +ways depending on what you want to +achieve and the end up building all + +490 +00:38:47,550 --> 00:38:51,519 +kinds of stuff so that's how you work +with their own networks so every library + +491 +00:38:51,519 --> 00:38:54,809 +just a whole set of layers that you +might want to compute and every layer is + +492 +00:38:54,809 --> 00:38:58,840 +implementing a smoky function peace and +that function keys knows how to move + +493 +00:38:58,840 --> 00:39:02,670 +forward and knows how to do a backward +so just above a specific example let's + +494 +00:39:02,670 --> 00:39:10,150 +look at the mall constant layer and +torch the mall constant layer or chrome + +495 +00:39:10,150 --> 00:39:16,039 +just a scaling by scalar so it takes +some tenser X so this is not a scalar + +496 +00:39:16,039 --> 00:39:19,300 +but it's actually like an array of +numbers basically because when we + +497 +00:39:19,300 --> 00:39:22,410 +actually work with these we do a lot of +extras operation so we receive a tensor + +498 +00:39:22,409 --> 00:39:28,289 +which is really just and dimensional +array and was killed by constant and you + +499 +00:39:28,289 --> 00:39:31,980 +can see that this actually just a sporty +lines there some initialization stuff + +500 +00:39:31,980 --> 00:39:35,940 +this is lula by the way if this is +looking some foreign to you but there's + +501 +00:39:35,940 --> 00:39:40,510 +initialisation where you actually +passing that a that you want to use as + +502 +00:39:40,510 --> 00:39:44,630 +you are scaling and then during the +forward pass which they call update out + +503 +00:39:44,630 --> 00:39:49,170 +but in a forward pass all they do is +they just multiply X and returned it and + +504 +00:39:49,170 --> 00:39:53,760 +into backward pass which they call +update grad input there's any statement + +505 +00:39:53,760 --> 00:39:56,510 +here but really when you look at these +three live their most important you can + +506 +00:39:56,510 --> 00:39:59,690 +see that all is doing its copying into a +variable grad + +507 +00:39:59,690 --> 00:40:03,539 +would need to compute that's your grade +in that you're passing up the great + +508 +00:40:03,539 --> 00:40:08,309 +impetus you're copping out but ran up to +this your your gradient on final loss + +509 +00:40:08,309 --> 00:40:11,989 +you're copping that over into grad input +and you're multiplying by the by the + +510 +00:40:11,989 --> 00:40:15,629 +scalar which is what you should be doing +because you are your local ratings just + +511 +00:40:15,630 --> 00:40:19,980 +a and C you take the out but you have to +take the gradient from above and just + +512 +00:40:19,980 --> 00:40:23,150 +killed by AP which is what these three +lines are doing and that's your grad + +513 +00:40:23,150 --> 00:40:27,849 +important that's what you return so +that's one of the hundreds of layers + +514 +00:40:27,849 --> 00:40:32,110 +that are and torture you can also look +at examples in cafe get there is also a + +515 +00:40:32,110 --> 00:40:36,140 +deep learning framework specifically for +images might be working with again if + +516 +00:40:36,139 --> 00:40:39,690 +you go into the layers director just see +all these layers all of them implement + +517 +00:40:39,690 --> 00:40:43,490 +the forward backward API so just to give +you an example there's a single layer + +518 +00:40:43,489 --> 00:40:51,269 +layer takes a blob so comfy likes to +call these tensors blogs so it takes a + +519 +00:40:51,269 --> 00:40:54,219 +blob is just an international array of +numbers and it passes + +520 +00:40:54,219 --> 00:40:57,949 +element wise to a single function and so +its computing in a forward pass a + +521 +00:40:57,949 --> 00:41:04,379 +sigmoid which you can see their use my +printer so they're calling it a lot of + +522 +00:41:04,380 --> 00:41:07,840 +this stuff is just boilerplate getting +pointers to all the data and then we + +523 +00:41:07,840 --> 00:41:11,730 +have a bottom blob and we're calling a +sigmoid function on the bottom and + +524 +00:41:11,730 --> 00:41:14,829 +that's just a sigmoid function right +there that's why we compute in a + +525 +00:41:14,829 --> 00:41:18,719 +backward pass some boilerplate stuff but +really what's important is we need to + +526 +00:41:18,719 --> 00:41:23,369 +compute the gradient times the chain +rule here so that's what you see in this + +527 +00:41:23,369 --> 00:41:26,150 +line that's where the magic happens when +we take the + +528 +00:41:26,150 --> 00:41:32,048 +so they call the greetings dips and you +compute the bottom diff is the top if + +529 +00:41:32,048 --> 00:41:36,869 +times this piece which is really the +that's the local gradient so this is + +530 +00:41:36,869 --> 00:41:41,960 +chain rule happening right here through +that multiplication so and so every + +531 +00:41:41,960 --> 00:41:45,179 +single layer just a forward backward API +and then you have a competition growth + +532 +00:41:45,179 --> 00:41:52,288 +on top or another object that troubled +connectivity and questions about some of + +533 +00:41:52,289 --> 00:42:00,849 +these implementations and so on + +534 +00:42:00,849 --> 00:42:15,559 +because when you want to do right away +to a backward and I have a gradient and + +535 +00:42:15,559 --> 00:42:19,369 +I can do an update right up my alley +gradient and I change my way it's a tiny + +536 +00:42:19,369 --> 00:42:24,960 +bit and the direction the negative +direction of your writing so overcome + +537 +00:42:24,960 --> 00:42:28,858 +the loss backward computer gradient and +then the update uses the gradient to + +538 +00:42:28,858 --> 00:42:33,278 +increment you are a bit so that's what +keeps happening Lupin III neural network + +539 +00:42:33,278 --> 00:42:36,318 +that's all that's happening forward +backward update forward backward state + +540 +00:42:36,318 --> 00:42:51,808 +will see that you're asking about the +for loop therefore Lapeer I do notice ok + +541 +00:42:51,809 --> 00:42:57,160 +yeah they have a for loop yes you'd like +us to be better eyes and that actually + +542 +00:42:57,159 --> 00:43:03,679 +sure this is C++ so I think they just go +for it + +543 +00:43:03,679 --> 00:43:10,899 +yeah so this is a CPU implementation by +the way I should mention that this is a + +544 +00:43:10,900 --> 00:43:14,599 +CPU implementation of a similar there's +a second file that implement the + +545 +00:43:14,599 --> 00:43:19,420 +simulator on GPU and that's correct code +and so that's a separate file its + +546 +00:43:19,420 --> 00:43:21,980 +would-be sigmoid out see you or +something like that I'm not showing you + +547 +00:43:21,980 --> 00:43:30,349 +that the russians ok great so I like to +make is will be of course working with + +548 +00:43:30,349 --> 00:43:33,519 +better so these things flowing along our +grass are not just killers they're going + +549 +00:43:33,519 --> 00:43:38,449 +to be entire back to us and so nothing +changes the only thing that is different + +550 +00:43:38,449 --> 00:43:43,529 +now since these are vectors XY and Z are +vectors is that these local gradient + +551 +00:43:43,530 --> 00:43:47,530 +which before used to be just a scalar +now there in general for general + +552 +00:43:47,530 --> 00:43:51,290 +expressions their full Jacobian matrices +and so it could be a major exodus + +553 +00:43:51,289 --> 00:43:54,670 +two-dimensional matrix and basically +tells me what is the influence of every + +554 +00:43:54,670 --> 00:43:58,010 +single element in X on every single +element of + +555 +00:43:58,010 --> 00:44:01,880 +and that's what you can be a major +source and the gradient the same + +556 +00:44:01,880 --> 00:44:08,960 +expression as before but now they hear +the IDX is a vector and DL Moody said is + +557 +00:44:08,960 --> 00:44:16,079 +designed as an actor and designed by Dax +is an entire Jacobian matrix end up with + +558 +00:44:16,079 --> 00:44:32,130 +an entire matrix-vector multiply to +actually change the gradient know so + +559 +00:44:32,130 --> 00:44:36,380 +I'll come back to this point in a bit +you never actually end up forming the + +560 +00:44:36,380 --> 00:44:40,119 +Jacobian you'll never actually do this +matrix multiply most of the time this is + +561 +00:44:40,119 --> 00:44:43,730 +just a general way of looking at you +know arbitrary function and I need to + +562 +00:44:43,730 --> 00:44:46,260 +keep track of this and I think that +these two are actually out of order + +563 +00:44:46,260 --> 00:44:49,569 +because he said by the exit the Jacobian +which should be on the left side so + +564 +00:44:49,568 --> 00:44:53,159 +that's that's a mistaken slide because +it should be a major factor multiplied + +565 +00:44:53,159 --> 00:44:57,618 +so I'll show you why you don't actually +need to perform those Jacobins so let's + +566 +00:44:57,619 --> 00:45:02,119 +work with a specific example that is +relatively common in the works + +567 +00:45:02,119 --> 00:45:06,869 +suppose we have this nonlinearity max 50 +index so really what this is operation + +568 +00:45:06,869 --> 00:45:11,068 +is doing its receiving a vector sale +4096 numbers which is a typical thing + +569 +00:45:11,068 --> 00:45:12,308 +you might want to do + +570 +00:45:12,309 --> 00:45:14,630 +4096 numbers real value + +571 +00:45:14,630 --> 00:45:19,630 +and your computing an element wise +threshold 0 so anything that is lower + +572 +00:45:19,630 --> 00:45:24,680 +than 0 gets clamped 20 and that's your +function that your computing and sew up + +573 +00:45:24,679 --> 00:45:28,588 +the victories on the same dimension to +the question here I'd like to ask is + +574 +00:45:28,588 --> 00:45:40,268 +what is the size of the Jacobian matrix +for this layer 4096 4096 in principle + +575 +00:45:40,268 --> 00:45:45,018 +every single number in here could have +influenced every single number in there + +576 +00:45:45,018 --> 00:45:49,459 +but that's not the case necessarily +right to the second question is so this + +577 +00:45:49,460 --> 00:45:52,949 +is a huge measure sixteen million +numbers but why would you never formed + +578 +00:45:52,949 --> 00:46:02,719 +what does actually look like always be +matrix because every one of these 4096 + +579 +00:46:02,719 --> 00:46:09,949 +could have influenced every it is so the +communists still a giant 4085 4086 + +580 +00:46:09,949 --> 00:46:14,558 +matrix but has special structure right +and what is that special structure but + +581 +00:46:14,559 --> 00:46:27,420 +so is a huge tits 4095 4096 matrix but +there's only elements on the diagonal + +582 +00:46:27,420 --> 00:46:33,700 +because this is an element was operation +and moreover they're not just once but + +583 +00:46:33,699 --> 00:46:38,129 +whichever element was less than zero it +was clamped 20 so some of these ones + +584 +00:46:38,130 --> 00:46:42,798 +actually are zeros in whichever elements +had a lower than zero value during the + +585 +00:46:42,798 --> 00:46:47,429 +forward pass and so the Jacobian would +just be almost no identity matrix but + +586 +00:46:47,429 --> 00:46:52,250 +some of them are actually Sarah so you +never actually would want to form the + +587 +00:46:52,250 --> 00:46:55,429 +full Jacobean because that's silly and +so you never actually want to carry out + +588 +00:46:55,429 --> 00:47:00,808 +this operation as a matrix-vector +multiply because their special structure + +589 +00:47:00,809 --> 00:47:04,150 +that we want to take advantage of and so +in particular the gradient the backward + +590 +00:47:04,150 --> 00:47:09,269 +pass for this operation is very very +easy because you just want to look at + +591 +00:47:09,268 --> 00:47:14,159 +all the dimensions where your input was +less than zero and you want to kill the + +592 +00:47:14,159 --> 00:47:17,210 +gradient and those mentioned you want to +set the gradient 20 in those dimensions + +593 +00:47:17,210 --> 00:47:21,650 +so you take the grid out but here and +whichever numbers were less than zero + +594 +00:47:21,650 --> 00:47:25,910 +just set them 200 and then you can ask + +595 +00:47:25,909 --> 00:47:52,230 +so very simple operations in the in the +end in terms of + +596 +00:47:52,230 --> 00:47:55,940 +if you want to you can do that but +that's internal to you and said the gate + +597 +00:47:55,940 --> 00:47:59,670 +and you can use that to do backdrop but +what's going back to other dates they + +598 +00:47:59,670 --> 00:48:17,380 +only care about the gradient vector so +we'll never actually run into that case + +599 +00:48:17,380 --> 00:48:20,430 +because we almost always have a single +out but skill and rallied in the end + +600 +00:48:20,429 --> 00:48:24,129 +because we're interested in Los +functions so we just have a single + +601 +00:48:24,130 --> 00:48:27,318 +number at the end that were interested +in trading for prospective if we had + +602 +00:48:27,318 --> 00:48:30,949 +multiple outputs then we have to keep +track of all of those as well + +603 +00:48:30,949 --> 00:48:35,769 +imperil when we do the backpropagation +but we just have to get a rally loss + +604 +00:48:35,769 --> 00:48:45,880 +function so as not to worry about that +so I want to also make the point that + +605 +00:48:45,880 --> 00:48:51,230 +actually four thousand crazy usually we +use many batches so say many batch of a + +606 +00:48:51,230 --> 00:48:54,929 +hundred elements going through the same +time and then you end up with a hundred + +607 +00:48:54,929 --> 00:48:59,038 +4096 emotional factors that are all +coming in peril but all the examples + +608 +00:48:59,039 --> 00:49:02,539 +enemy better processed independently of +each other in peril and so that you + +609 +00:49:02,539 --> 00:49:08,869 +could really end up being four hundred +million so huge so you never formally is + +610 +00:49:08,869 --> 00:49:14,160 +basically and you takes to take care to +actually take advantage of the sparsity + +611 +00:49:14,159 --> 00:49:17,538 +structure in the Jacobian and you hand +code operations you don't actually right + +612 +00:49:17,539 --> 00:49:25,819 +before the generalized general inside +any gate implementation ok so I'd like + +613 +00:49:25,818 --> 00:49:30,788 +to point out that your assignment he'll +be writing as Max and so on and I just + +614 +00:49:30,789 --> 00:49:33,680 +wanted to give you a hint on the design +of how you actually should approach this + +615 +00:49:33,679 --> 00:49:39,769 +problem what you should do is just think +about it as a back propagation even if + +616 +00:49:39,769 --> 00:49:44,108 +you're doing this for classification +optimization so roughly or structure + +617 +00:49:44,108 --> 00:49:50,048 +should look something like this where +against major computation and units that + +618 +00:49:50,048 --> 00:49:53,960 +you know the local gradient off and then +do backdrop when you actually these + +619 +00:49:53,960 --> 00:49:57,679 +gradients in your assignment so in the +top your code will look something like + +620 +00:49:57,679 --> 00:49:59,679 +this where we don't have any graph +structure because you're doing + +621 +00:49:59,679 --> 00:50:04,038 +everything in line so no crazy I just +running like that that you have to do + +622 +00:50:04,039 --> 00:50:07,200 +you will do that in a second assignment +you'll actually come up with a graphic + +623 +00:50:07,199 --> 00:50:10,509 +object you implement your layers but my +first assignment you're just doing it in + +624 +00:50:10,510 --> 00:50:15,579 +line just straight up an awesome and so +complete your scores based on wnx + +625 +00:50:15,579 --> 00:50:21,798 +compute these margins which are Maxim 0 +and the score differences compute the + +626 +00:50:21,798 --> 00:50:26,239 +loss and then do backdrop and in +particular I would really advise you to + +627 +00:50:26,239 --> 00:50:30,949 +have this intermediate course let you +create a matrix and then compute the + +628 +00:50:30,949 --> 00:50:34,769 +gradient on scores before you can view +the gradient on your weights and so + +629 +00:50:34,769 --> 00:50:40,179 +chain chain rule here like you might be +tempted to try to just arrived W the + +630 +00:50:40,179 --> 00:50:43,798 +gradient on W equals and then implement +that and that's an unhealthy way of + +631 +00:50:43,798 --> 00:50:47,349 +approaching problem so state your +competition and do backdrop through this + +632 +00:50:47,349 --> 00:50:55,800 +course and they will help you out so + +633 +00:50:55,800 --> 00:51:01,570 +so far are hopelessly large so we end up +in this competition structures and these + +634 +00:51:01,570 --> 00:51:05,470 +intermediate nodes forward backward API +for both the notes and also for the + +635 +00:51:05,469 --> 00:51:08,869 +graph structure and infrastructure is +usually a very thin wrapper on all these + +636 +00:51:08,869 --> 00:51:12,059 +layers and it can handle the +communication between him and his + +637 +00:51:12,059 --> 00:51:16,380 +communication is always along like +doctors being passed around in practice + +638 +00:51:16,380 --> 00:51:19,289 +when we write these implementations what +we're passing around our DS and + +639 +00:51:19,289 --> 00:51:23,079 +dimensional sensors really what that +means is just an end dimensional array + +640 +00:51:23,079 --> 00:51:28,059 +array those are what goes between the +gates and then internally every single + +641 +00:51:28,059 --> 00:51:33,529 +gate knows what to do in the forward and +backward pass ok so at this point I'm + +642 +00:51:33,530 --> 00:51:37,690 +going to end with that propagation and +I'm going to go into neural networks so + +643 +00:51:37,690 --> 00:51:49,860 +any questions before we move on from +background + +644 +00:51:49,860 --> 00:52:03,130 +operation challenging assignment almost +is how do you make sure that you do all + +645 +00:52:03,130 --> 00:52:06,750 +the sufficiently nicely with operations +in numpy so that's going to be something + +646 +00:52:06,750 --> 00:52:18,030 +that brings our stuff that you guys are +going to be like and what you want them + +647 +00:52:18,030 --> 00:52:24,490 +to be I don't think he'd want to do that + +648 +00:52:24,489 --> 00:52:30,739 +yeah I'm not sure maybe that works but +it's up to you to design this and to + +649 +00:52:30,739 --> 00:52:38,609 +back up through it so that's that's what +we're going to go to neural networks is + +650 +00:52:38,610 --> 00:52:44,010 +exactly what they look like you'll be +involving me and this is what happens + +651 +00:52:44,010 --> 00:52:46,770 +when you search on Google Images +networks this is I think the first + +652 +00:52:46,769 --> 00:52:51,590 +result of something like that so let's +look at the networks and before we dive + +653 +00:52:51,590 --> 00:52:55,100 +into neural networks actually I'd like +to do it first without all the brain + +654 +00:52:55,099 --> 00:52:58,329 +stuff so forget that their neural forget +that they have any relation whatsoever + +655 +00:52:58,329 --> 00:53:03,170 +to brain they don't forget if you +thought that they did but they do let's + +656 +00:53:03,170 --> 00:53:07,309 +just look at school functions well +before we thought that equals WX is what + +657 +00:53:07,309 --> 00:53:11,079 +we've been working with so far but now +as I said we're going to start to make + +658 +00:53:11,079 --> 00:53:14,590 +that F more complex and so if you want +to use a neural network then you're + +659 +00:53:14,590 --> 00:53:20,309 +going to change that equation to this so +this is a two-layer neural network and + +660 +00:53:20,309 --> 00:53:24,820 +that's what it looks like and it's just +a more complex mathematical expression X + +661 +00:53:24,820 --> 00:53:30,230 +and so what's happening here as you +receive your input X and you make + +662 +00:53:30,230 --> 00:53:32,369 +multiplied by matrix just like we did +before + +663 +00:53:32,369 --> 00:53:36,619 +now what's coming next what comes next +is a nonlinearity or activation function + +664 +00:53:36,619 --> 00:53:39,710 +I'm going to go into several choices +that you might make for these in this + +665 +00:53:39,710 --> 00:53:43,800 +case I'm using the threshold 0 as an +activation function so basically we're + +666 +00:53:43,800 --> 00:53:47,780 +doing matrix multiply we threshold +everything they get 20 and then we do + +667 +00:53:47,780 --> 00:53:52,240 +one more major supply and that gives us +are scarce and so if I was to drop this + +668 +00:53:52,239 --> 00:53:58,169 +say in case of C for 10 with three South +3072 numbers going in the pixel values + +669 +00:53:58,170 --> 00:54:02,110 +and before we just went one single major +metabolite discourse we went right away + +670 +00:54:02,110 --> 00:54:02,470 +22 + +671 +00:54:02,469 --> 00:54:05,899 +numbers but now we get to go through +this intermediate representation + +672 +00:54:05,900 --> 00:54:13,019 +pendants hidden state will call them +hidden layers so each of hundred-numbers + +673 +00:54:13,019 --> 00:54:16,849 +or whatever you want your size of the +network to be so this is a high pressure + +674 +00:54:16,849 --> 00:54:21,109 +that's a a hundred and we go through +this intermediate representation so make + +675 +00:54:21,108 --> 00:54:24,319 +sure to multiply gives us +hundred-numbers threshold at zero and + +676 +00:54:24,320 --> 00:54:28,559 +then one will make sure that this course +and since we have more numbers we have + +677 +00:54:28,559 --> 00:54:33,820 +more wiggle to do more interesting +things so I'm or one particular example + +678 +00:54:33,820 --> 00:54:36,330 +of something interesting you might want +to do what you might think that in the + +679 +00:54:36,329 --> 00:54:40,210 +latter could do is going back to the +example of interpreting linear + +680 +00:54:40,210 --> 00:54:45,690 +classifiers on C part 10 and we saw the +car class has this red car that tries to + +681 +00:54:45,690 --> 00:54:51,280 +merge all the modes of different car +space in different directions and so in + +682 +00:54:51,280 --> 00:54:57,980 +this case one single layer one single +leader crossfire had to go across all + +683 +00:54:57,980 --> 00:55:02,250 +those modes and we couldn't deal with +for example of different colors that + +684 +00:55:02,250 --> 00:55:05,190 +wasn't very natural to do but now we +have hundred-numbers in this + +685 +00:55:05,190 --> 00:55:08,289 +intermediate and so you might imagine +for example that one of those numbers + +686 +00:55:08,289 --> 00:55:11,539 +could be just picking up on the red +carpet leasing forward is just gotta + +687 +00:55:11,539 --> 00:55:14,750 +find is there a wrecked car facing +forward another one could be red car + +688 +00:55:14,750 --> 00:55:16,280 +facing slightly to the left + +689 +00:55:16,280 --> 00:55:20,650 +let carvey seems like the right and +those elements of age would only become + +690 +00:55:20,650 --> 00:55:24,358 +positive if they find that thing in the +image + +691 +00:55:24,358 --> 00:55:28,029 +otherwise they stay at zero and so +another age might look for green cards + +692 +00:55:28,030 --> 00:55:31,180 +or yellow cards or whatever else in +different orientations so now we can + +693 +00:55:31,179 --> 00:55:35,669 +have a template for all these different +modes and so these neurons turn on or + +694 +00:55:35,670 --> 00:55:41,869 +off if they find the thing they're +looking for some specific type and then + +695 +00:55:41,869 --> 00:55:46,660 +this W two major scan some across all +those little card templates and I we + +696 +00:55:46,659 --> 00:55:50,719 +have like say twenty card templates of +what you look like and now to complete + +697 +00:55:50,719 --> 00:55:54,149 +the scoring classifier there's an +additional measures so we have a choice + +698 +00:55:54,150 --> 00:55:58,700 +of a weighted sum over them and so if +anyone of them turned on then through my + +699 +00:55:58,699 --> 00:56:02,269 +way it's somewhat positive weights +presumably I would be adding up and + +700 +00:56:02,269 --> 00:56:07,358 +getting a higher score and so now I can +have this multimodal our classifier + +701 +00:56:07,358 --> 00:56:13,098 +through this additional hidden layer +between there and wavy reason for why + +702 +00:56:13,099 --> 00:56:14,720 +these would do something more +interesting + +703 +00:56:14,719 --> 00:56:49,509 +was a question for extra points in the +assignment and do something fun or extra + +704 +00:56:49,510 --> 00:56:53,220 +and so you get the carpet whatever you +think is interesting experiment and will + +705 +00:56:53,219 --> 00:56:56,699 +give you some bonus points that's good +candidate for for something you might + +706 +00:56:56,699 --> 00:56:59,659 +want to investigate whether that works +or not + +707 +00:56:59,659 --> 00:57:08,329 +questions + +708 +00:57:08,329 --> 00:57:34,989 +allocated over the different modes of +the dataset and I don't have a good + +709 +00:57:34,989 --> 00:57:37,969 +answer for that this since we're going +to train this fully with + +710 +00:57:37,969 --> 00:57:39,500 +back-propagation + +711 +00:57:39,500 --> 00:57:42,690 +I think it's like a naive to think that +there will be exact template for sale + +712 +00:57:42,690 --> 00:57:46,539 +let carvey seeing red carpet is left you +probably want to find that you'll find + +713 +00:57:46,539 --> 00:57:50,690 +these kind of like mixes and weird +things intermediates and so on + +714 +00:57:50,690 --> 00:57:55,630 +coming animal optimally find a way to +truncate your data with its boundaries + +715 +00:57:55,630 --> 00:57:59,809 +and kuwait's relegated just adjust the +company could come alright so it's + +716 +00:57:59,809 --> 00:58:10,579 +really hard to say well become tangled +up I think that's right so that's the + +717 +00:58:10,579 --> 00:58:14,579 +size of hidden layer and a high +primarily get to choose that so I chose + +718 +00:58:14,579 --> 00:58:18,719 +hundred usually that's going to be +usually you'll see that we're going to + +719 +00:58:18,719 --> 00:58:22,739 +this a lot but usually you want them to +be as big as possible as its your + +720 +00:58:22,739 --> 00:58:30,659 +computer and so on so more is better I'm +going to that + +721 +00:58:30,659 --> 00:58:38,639 +asking do we always take max 10 nature +and we don't get this like five slides + +722 +00:58:38,639 --> 00:58:44,359 +away somewhere to go into neural +networks I guess maybe I should just go + +723 +00:58:44,360 --> 00:58:48,390 +ahead and take questions near the end if +you wanted this to be a three-layer + +724 +00:58:48,389 --> 00:58:50,940 +neural network by the way there's a very +simple way in which we just extend + +725 +00:58:50,940 --> 00:58:53,710 +that's right so we just keep continuing +the same pattern we have all these + +726 +00:58:53,710 --> 00:58:57,159 +intermediate hidden nodes and then we +can keep making our network deeper and + +727 +00:58:57,159 --> 00:58:59,750 +deeper and you can compute more +interesting functions because you're + +728 +00:58:59,750 --> 00:59:03,369 +giving yourself more time to compute +something interesting and henry VIII way + +729 +00:59:03,369 --> 00:59:09,559 +up one other slide I want to flash is +that training a two-layer neural network + +730 +00:59:09,559 --> 00:59:12,690 +I mean it's actually quite simple when +it comes down to it so this is like + +731 +00:59:12,690 --> 00:59:17,349 +borrowed from Blockbuster and basically +the price is roughly eleven lines of + +732 +00:59:17,349 --> 00:59:21,980 +Python to implement a two layer neural +network during binary classification on + +733 +00:59:21,980 --> 00:59:27,570 +what is this two-dimensional better to +have a two dimensional data matrix X you + +734 +00:59:27,570 --> 00:59:32,580 +have thirty three dimensional and you +have a binary labels for why and then + +735 +00:59:32,579 --> 00:59:36,579 +sin 0 sin 1 are your weight matrices +wait one way to end so I think they're + +736 +00:59:36,579 --> 00:59:41,150 +called central synapse but mature and +then this is the opposition group here + +737 +00:59:41,150 --> 00:59:46,269 +and what you what you're seeing here I +should use my point for more than just + +738 +00:59:46,269 --> 00:59:50,139 +being here as we're completing the first +layer activations but and this is using + +739 +00:59:50,139 --> 00:59:54,069 +a signal nonlinearity not a max of 0 +necks and we're going to a bit of what + +740 +00:59:54,070 --> 00:59:58,650 +these nonlinearities might be more than +one form is reviewing the first layer + +741 +00:59:58,650 --> 01:00:03,059 +and the second layer and then its +computing here right away the backward + +742 +01:00:03,059 --> 01:00:08,130 +pass so this adult adult as the gradient +gel to the gradient ml 1 and the + +743 +01:00:08,130 --> 01:00:13,390 +gradient and this is a major update here +so right away he's doing an update at + +744 +01:00:13,389 --> 01:00:17,150 +the same time as during the final piece +of backdrop here where he formulated the + +745 +01:00:17,150 --> 01:00:22,519 +gradient on the W and right away he said +adding 22 gradient here and some really + +746 +01:00:22,519 --> 01:00:24,630 +eleven lines supplies to train the +neural network + +747 +01:00:24,630 --> 01:00:29,710 +classification the reason that this loss +may look slightly different from what + +748 +01:00:29,710 --> 01:00:33,500 +you've seen right now is that this is a +logistic regression loss so you saw a + +749 +01:00:33,500 --> 01:00:37,159 +generalization of it which is a nice +classifier into multiple dimensions but + +750 +01:00:37,159 --> 01:00:40,149 +this is basically a logistic loss being +updated here and you can go through this + +751 +01:00:40,150 --> 01:00:43,500 +in more detail by yourself but the +logistic regression lost look slightly + +752 +01:00:43,500 --> 01:00:50,539 +different and that's being that's inside +there but otherwise yes this is not too + +753 +01:00:50,539 --> 01:00:55,320 +crazy of a competition and very few +lines of code suffice actually train + +754 +01:00:55,320 --> 01:00:58,900 +these networks everything else is plus +how do you make an official and how do + +755 +01:00:58,900 --> 01:01:03,019 +you there's a cross-validation pipeline +that you need to have it all this stuff + +756 +01:01:03,019 --> 01:01:07,050 +that goes on top to actually give these +large code bases but the kernel of it is + +757 +01:01:07,050 --> 01:01:11,019 +quite simple we compute these layers +forward pass backward pass through an + +758 +01:01:11,019 --> 01:01:18,840 +update when it rains but the rain is +creating your personal initial random + +759 +01:01:18,840 --> 01:01:24,170 +weights so you need to start somewhere +so you generate a random W + +760 +01:01:24,170 --> 01:01:29,150 +now I want to mention that you'll also +be training a two-layer neural network + +761 +01:01:29,150 --> 01:01:32,070 +in this class so you'll be doing +something very similar to this but + +762 +01:01:32,070 --> 01:01:34,950 +you're not using logistic regression and +you might have different activation + +763 +01:01:34,949 --> 01:01:39,149 +functions but again just my advice to +you when you implement this is staged + +764 +01:01:39,150 --> 01:01:42,789 +your computation into these intermediate +results and then do proper + +765 +01:01:42,789 --> 01:01:46,909 +backpropagation into every intermediate +result so you might have you compute + +766 +01:01:46,909 --> 01:01:54,460 +your computer you receive these weight +matrices and also the biases I don't + +767 +01:01:54,460 --> 01:01:59,940 +believe you have biases p.m. in your +slot max but here you'll have biases so + +768 +01:01:59,940 --> 01:02:03,269 +take your weight matrices in the biases +computer person later computers course + +769 +01:02:03,269 --> 01:02:08,429 +complete your loss and then do backward +pass so backdrop in this course then + +770 +01:02:08,429 --> 01:02:13,739 +backdrop into the weights at the second +layer and backdrop into this h1 doctor + +771 +01:02:13,739 --> 01:02:18,849 +and then through eight-run backdrop into +the first weight matrices and spices do + +772 +01:02:18,849 --> 01:02:22,929 +proper backpropagation here otherwise if +you tried and right away just say what + +773 +01:02:22,929 --> 01:02:26,739 +is DWI on what is going on W one if you +just try to make it a single expression + +774 +01:02:26,739 --> 01:02:31,099 +for it will be way too large and +headaches so do it through a series of + +775 +01:02:31,099 --> 01:02:32,619 +steps and back-propagation + +776 +01:02:32,619 --> 01:02:36,119 +that's just a hint + +777 +01:02:36,119 --> 01:02:39,940 +ok now I'd like to say that was the +presentation of neural networks without + +778 +01:02:39,940 --> 01:02:43,940 +all the bring stuff and it looks fairly +simple so now we're going to make it + +779 +01:02:43,940 --> 01:02:47,740 +slightly more insane by folding in all +kinds of like motivations mostly + +780 +01:02:47,739 --> 01:02:51,219 +historical about like how this came +about that it's related to bring it all + +781 +01:02:51,219 --> 01:02:54,939 +and so we have neural networks and we +have neurons inside these neural + +782 +01:02:54,940 --> 01:02:59,440 +networks so this is what I look like +just what happens when you search on + +783 +01:02:59,440 --> 01:03:03,800 +image search Iran so there you go now +your actual biological neurons don't + +784 +01:03:03,800 --> 01:03:09,030 +look like this are currently more like +that and so on + +785 +01:03:09,030 --> 01:03:11,880 +just very briefly just to give you an +idea about where this is all coming from + +786 +01:03:11,880 --> 01:03:17,220 +you have a cell body or so much like to +call it and it's got all these dendrites + +787 +01:03:17,219 --> 01:03:21,049 +that are connected to other neurons +there's a cluster of other neurons and + +788 +01:03:21,050 --> 01:03:25,450 +somebody's over here and then drives are +really these appendages that listen to + +789 +01:03:25,449 --> 01:03:30,869 +them so this is your inputs to in Iran +and then it's got a single axon that + +790 +01:03:30,869 --> 01:03:35,839 +comes out of a neuron that carries the +output of the competition at this number + +791 +01:03:35,840 --> 01:03:40,579 +forms so usually usually have this +neuron receives inputs if many of them + +792 +01:03:40,579 --> 01:03:46,179 +online then this sell your own can +choose to spike it sends an activation + +793 +01:03:46,179 --> 01:03:50,199 +potential down the axon and then this +actually like that were just out to + +794 +01:03:50,199 --> 01:03:54,659 +connect to dendrites other neurons that +are downstream so there are other + +795 +01:03:54,659 --> 01:03:57,639 +neurons here and their dendrites +connected to the axons of these guys + +796 +01:03:57,639 --> 01:04:02,299 +basically just neurons connected through +these synapses between and we had these + +797 +01:04:02,300 --> 01:04:05,840 +dendrites that Rd in particular on and +this action on that actually carries the + +798 +01:04:05,840 --> 01:04:10,410 +output on their own and so basically you +can come up with a very crude model of a + +799 +01:04:10,409 --> 01:04:16,769 +neuron and it will look something like +this we have so this is the cell body + +800 +01:04:16,769 --> 01:04:20,909 +here on their own and just imagine an +axon coming from a different neuron + +801 +01:04:20,909 --> 01:04:24,730 +someone at work and this neuron is +connected to that Iran through this + +802 +01:04:24,730 --> 01:04:29,840 +synapse and every one of these synapses +has a weight associated with it + +803 +01:04:29,840 --> 01:04:35,350 +of how much this neuron likes that +neuron basically and so actually carries + +804 +01:04:35,349 --> 01:04:39,769 +this X it interacts in the synapse and +they multiply and discrete model so you + +805 +01:04:39,769 --> 01:04:44,989 +get W 00 flooding flowing to the summer +and then that happens for many Iraqis + +806 +01:04:44,989 --> 01:04:45,849 +who have lots of + +807 +01:04:45,849 --> 01:04:51,500 +and puts up w times explosion and the +cell body here it's just some offset by + +808 +01:04:51,500 --> 01:04:56,940 +bias and then if an activation function +is met here so it passes through an + +809 +01:04:56,940 --> 01:05:02,800 +activation function to actually complete +the outfit of the sax on now in + +810 +01:05:02,800 --> 01:05:06,570 +biological models historically people +like to use the sigmoid nonlinearity to + +811 +01:05:06,570 --> 01:05:11,730 +actually the reason for that is because +you get a number between 0 and one and + +812 +01:05:11,730 --> 01:05:15,420 +you can interpret that as the rate at +which this neuron inspiring for that + +813 +01:05:15,420 --> 01:05:19,809 +particular input so it's a rate between +zero and one that's going through the + +814 +01:05:19,809 --> 01:05:23,889 +activation function so if this neuron is +seen something that likes in the neurons + +815 +01:05:23,889 --> 01:05:27,900 +that connected to it it will start to +spike a lot and the rate is described by + +816 +01:05:27,900 --> 01:05:33,139 +F off the impact oK so that's the crude +model of neuron if I wanted to implement + +817 +01:05:33,139 --> 01:05:38,819 +it would look something like this so and +neuron function forward pass and receive + +818 +01:05:38,820 --> 01:05:44,500 +some inputs this is a vector and reform +of the cell body so just a lawyer some + +819 +01:05:44,500 --> 01:05:49,980 +and we put the firing rate as a sigmoid +off the Somali some and return to firing + +820 +01:05:49,980 --> 01:05:53,579 +rate and then this can plug into +different neurons right so you can + +821 +01:05:53,579 --> 01:05:56,710 +imagine you can actually see that this +looks very similar to a linear + +822 +01:05:56,710 --> 01:06:02,750 +classifier radar for MIMO lehrer some +here and we're passing through + +823 +01:06:02,750 --> 01:06:07,050 +nonlinearity so every single neuron in +this model is really like a small your + +824 +01:06:07,050 --> 01:06:11,530 +classifier but these authors plug into +each other and they can work together to + +825 +01:06:11,530 --> 01:06:16,650 +do interesting things now 10 to make +about neurons that they're very they're + +826 +01:06:16,650 --> 01:06:21,300 +not like biological neurons biological +neurons are super complex so if you go + +827 +01:06:21,300 --> 01:06:24,670 +around then you start saying that neural +networks work like brain people are + +828 +01:06:24,670 --> 01:06:28,849 +starting to round people started firing +at you and that's because there are + +829 +01:06:28,849 --> 01:06:33,650 +complex dynamical systems there are many +different types of neurons they function + +830 +01:06:33,650 --> 01:06:38,550 +differently these dendrites there they +can perform lots of interesting + +831 +01:06:38,550 --> 01:06:42,140 +computation a good review article is in +direct competition which I really + +832 +01:06:42,139 --> 01:06:46,069 +enjoyed these synapses are complex +dynamical systems they're not just a + +833 +01:06:46,070 --> 01:06:49,720 +single weight and we're not really sure +of the brain uses rate code to + +834 +01:06:49,719 --> 01:06:54,689 +communicate so very crude mathematical +model and don't put his analogy too much + +835 +01:06:54,690 --> 01:06:57,960 +but it's good for a kind of like media +articles + +836 +01:06:57,960 --> 01:07:01,990 +so I suppose that's why this keeps +coming up again and again as we + +837 +01:07:01,989 --> 01:07:04,989 +explained that this works like a brain +but I'm not going to go too deep into + +838 +01:07:04,989 --> 01:07:09,829 +this to go back to a question that was +asked for there's an entire set of + +839 +01:07:09,829 --> 01:07:17,559 +nonlinearities that we can choose from +so historically signal has been used + +840 +01:07:17,559 --> 01:07:20,210 +quite a bit and we're going to go into +much more detail over what these + +841 +01:07:20,210 --> 01:07:23,690 +nonlinearities are what are their trades +tradeoffs and why you might want to use + +842 +01:07:23,690 --> 01:07:27,838 +one or the other but for now just like a +flash to mention that there are many to + +843 +01:07:27,838 --> 01:07:28,579 +choose from + +844 +01:07:28,579 --> 01:07:33,940 +historically people use to 10 H as of +2012 really became quite popular + +845 +01:07:33,940 --> 01:07:38,429 +it makes your networks quite a bit +faster so right now if you want a + +846 +01:07:38,429 --> 01:07:40,429 +default choice for nonlinearity + +847 +01:07:40,429 --> 01:07:45,679 +relew that's the current default +recommendation and then there's a few + +848 +01:07:45,679 --> 01:07:51,489 +activation functions here and so are +proposed a few years ago I max out is + +849 +01:07:51,489 --> 01:07:54,989 +interesting and very recently you lou +and so you can come up with different + +850 +01:07:54,989 --> 01:07:58,319 +activation functions and you can +describe I these might work better or + +851 +01:07:58,320 --> 01:08:01,789 +not and so this is an active area of +research is trying to go up by the + +852 +01:08:01,789 --> 01:08:05,949 +activation functions that perform there +had better properties in one way or + +853 +01:08:05,949 --> 01:08:10,909 +another we're going to go into this much +more details as soon in class but for + +854 +01:08:10,909 --> 01:08:15,980 +now we have these morons we have a +choice of activation function and then + +855 +01:08:15,980 --> 01:08:19,259 +we runs these neurons into neural +networks right so we just connect them + +856 +01:08:19,259 --> 01:08:23,140 +together so they can talk to each other +and so here is an example of a what to + +857 +01:08:23,140 --> 01:08:27,170 +learn or relearn rowlett when you want +to count the number of layers and their + +858 +01:08:27,170 --> 01:08:30,829 +neural net you count the number of +players that happened waits to hear the + +859 +01:08:30,829 --> 01:08:35,449 +input layer does not count as a later +cuz there's no reason Iran's largest + +860 +01:08:35,449 --> 01:08:39,729 +single values they don't actually do any +computation so we have two players here + +861 +01:08:39,729 --> 01:08:45,068 +that that have weights to learn it and +we call these layers fully connected + +862 +01:08:45,069 --> 01:08:50,870 +layers and so that I shown you that a +single neuron computer this little + +863 +01:08:50,869 --> 01:08:54,750 +weight at some and ambassador +nonlinearity in a neural network the + +864 +01:08:54,750 --> 01:08:58,829 +reason we arrange these into layers is +because Iranian them into layers allows + +865 +01:08:58,829 --> 01:09:01,759 +us to the competition much more +efficiently so instead of having an + +866 +01:09:01,759 --> 01:09:04,460 +amorphous blob of neurons and every one +of them has to be computed independently + +867 +01:09:04,460 --> 01:09:08,699 +having them in layers allows us to use +vectorized operations and so we can + +868 +01:09:08,699 --> 01:09:10,139 +compute an entire set of + +869 +01:09:10,140 --> 01:09:14,410 +neurons in a single hidden layer as just +a single times amateurs multiply and + +870 +01:09:14,409 --> 01:09:17,619 +that's why we arrange them in these +layers where Iran since I deliver and + +871 +01:09:17,619 --> 01:09:21,119 +evaluate it completely in peril and they +all say the same thing but it's a + +872 +01:09:21,119 --> 01:09:25,519 +computational trick to arrange them in +leaders this is a three-layer neural net + +873 +01:09:25,520 --> 01:09:30,500 +and this is how you would compute it +just a bunch of major multiplies + +874 +01:09:30,500 --> 01:09:35,550 +followed by another activation followed +by activation function as well now I'd + +875 +01:09:35,550 --> 01:09:40,520 +like to show you a demo of how these +neural networks work so this is just + +876 +01:09:40,520 --> 01:09:44,770 +grabbed a model shoot you in a bit but +basically this is an example of a + +877 +01:09:44,770 --> 01:09:50,080 +two-layer neural network classifying AP +doing a binary classification task two + +878 +01:09:50,079 --> 01:09:54,119 +closest red and green and so if these +points in two dimensions and I'm drawing + +879 +01:09:54,119 --> 01:09:58,109 +the decision boundaries by the neural +network and see what you can see is when + +880 +01:09:58,109 --> 01:10:01,969 +I train a neural network on this data +the more hidden neurons I have in my + +881 +01:10:01,970 --> 01:10:05,770 +head in later the more wiggle your +electric cars right the more can compute + +882 +01:10:05,770 --> 01:10:12,290 +crazy functions and just show you also a +regularization strength so this is the + +883 +01:10:12,289 --> 01:10:17,069 +regularization of how much you penalize +large W you can see that when you insist + +884 +01:10:17,069 --> 01:10:22,340 +that your WR very small you end up with +a very smooth functions so they don't + +885 +01:10:22,340 --> 01:10:27,050 +have as much variance so these neural +networks there's not as much wriggle + +886 +01:10:27,050 --> 01:10:31,090 +that they can give you and then you +decrease the regularization these know + +887 +01:10:31,090 --> 01:10:34,090 +that we can do more and more complex +tasks so they can kind of get in and get + +888 +01:10:34,090 --> 01:10:38,710 +these laws squeezed out points to cover +them in the training data so let me show + +889 +01:10:38,710 --> 01:10:41,489 +you what this looks like + +890 +01:10:41,489 --> 01:10:47,079 +during training + +891 +01:10:47,079 --> 01:10:53,010 +so there's some stuff to explain here +let me first actually you can play with + +892 +01:10:53,010 --> 01:10:56,060 +this because it's all in javascript + +893 +01:10:56,060 --> 01:11:04,060 +alright so we're doing here as we have +six neurons and this is a binary + +894 +01:11:04,060 --> 01:11:09,000 +classification there said with circle +data and so we have a little cluster of + +895 +01:11:09,000 --> 01:11:13,520 +green dot separated by red dots and work +training a neural network to classify + +896 +01:11:13,520 --> 01:11:18,080 +this dataset so if I restart the neural +network it's just started off with the + +897 +01:11:18,079 --> 01:11:20,949 +random W and that it converges the +decision boundary to actually classified + +898 +01:11:20,949 --> 01:11:26,289 +the data showing on the right which is +the cool part is one interpretation of + +899 +01:11:26,289 --> 01:11:29,529 +the neural network here is what I'm +taking that's great to hear and I'm + +900 +01:11:29,529 --> 01:11:33,909 +showing how this space gets worked by +the neural network so you can interpret + +901 +01:11:33,909 --> 01:11:37,619 +what the neural network is doing is it's +using its hidden layer to transport your + +902 +01:11:37,619 --> 01:11:41,159 +input data in such a way that the second +hidden layer can come in with a linear + +903 +01:11:41,159 --> 01:11:47,059 +classifier and classify your data so +here you see that the neural network + +904 +01:11:47,060 --> 01:11:51,920 +arranges your space it works it such +that the second layer which is really a + +905 +01:11:51,920 --> 01:11:56,779 +linear classifier on top of the first +layer is can put a plane through it okay + +906 +01:11:56,779 --> 01:11:59,939 +so it's working the space so that you +can put the plane through it and + +907 +01:11:59,939 --> 01:12:06,259 +separate out the points so let's look at +this again so you can really see what + +908 +01:12:06,260 --> 01:12:10,940 +happens gets worked for that you can +leave early classify the data this is + +909 +01:12:10,939 --> 01:12:13,569 +something that people sometimes also +referred to as current trek it's + +910 +01:12:13,569 --> 01:12:19,149 +changing your data representation to a +space where two linearly separable ok + +911 +01:12:19,149 --> 01:12:23,079 +now here's a question if we'd like to +separate the right now we have six + +912 +01:12:23,079 --> 01:12:27,809 +neurons here and the intermediate layer +and it allows us to separate out these + +913 +01:12:27,810 --> 01:12:33,580 +things so you can see actually those six +neurons roughly you can see these lines + +914 +01:12:33,579 --> 01:12:36,869 +here like they're kind of like these +functions of one of these neurons so + +915 +01:12:36,869 --> 01:12:40,349 +here's a question for you what is the +minimum number of neurons for which this + +916 +01:12:40,350 --> 01:12:45,570 +dataset is separable with a neural +network like if I want to know that work + +917 +01:12:45,569 --> 01:12:51,889 +to correctly classify this as a minimum + +918 +01:12:51,890 --> 01:13:15,270 +so into it with the way this work is 34 +so what happens with or is there is one + +919 +01:13:15,270 --> 01:13:18,910 +around here that went from this way to +that way this way to that way this way + +920 +01:13:18,909 --> 01:13:22,689 +to that way there's more neurons that +are cutting up this plane and then + +921 +01:13:22,689 --> 01:13:27,039 +there's an additional layer that's a +weighted sum so in fact the lowest + +922 +01:13:27,039 --> 01:13:34,739 +number here what would be three which +would work so with three neurons ok so + +923 +01:13:34,739 --> 01:13:39,189 +one plane second plane airplane so three +linear functions within the linearity + +924 +01:13:39,189 --> 01:13:45,649 +and then you can basically with three +lines you can carve out the space so + +925 +01:13:45,649 --> 01:13:52,429 +that the second layer can just combined +them when their numbers are 102 + +926 +01:13:52,430 --> 01:13:57,850 +certainly donate to this will break +because two lines are not enough I + +927 +01:13:57,850 --> 01:14:03,900 +suppose this work something very good +here so with to basically it will find + +928 +01:14:03,899 --> 01:14:07,239 +the optimum way of just using these two +lines they're kind of creating this + +929 +01:14:07,239 --> 01:14:14,599 +tunnel and that the best you can do + +930 +01:14:14,600 --> 01:14:31,300 +I think if I was using rather I think +there would be much surrealism and I + +931 +01:14:31,300 --> 01:14:50,460 +think you'd see sharp boundaries yeah +you can do for now let's do it because + +932 +01:14:50,460 --> 01:14:52,130 +some of these parts + +933 +01:14:52,130 --> 01:14:58,119 +there's more than one of those revenues +are active and so you end up with there + +934 +01:14:58,119 --> 01:15:02,359 +are really three lines I think like 123 +but then in some of the corners to revel + +935 +01:15:02,359 --> 01:15:05,689 +in your eyes are active and so these +weights will have its kind of funky you + +936 +01:15:05,689 --> 01:15:12,649 +have to think about it but ok so let's +look at say twenty here so change to 20 + +937 +01:15:12,649 --> 01:15:16,670 +so we have lots of space there and let's +look at different assets like a spiral + +938 +01:15:16,670 --> 01:15:22,390 +you can see how this thing just as I'm +doing this update will just go in there + +939 +01:15:22,390 --> 01:15:32,800 +and figure that out very simple data +that is not my own circle and then ran + +940 +01:15:32,800 --> 01:15:39,880 +him down so you could kind of goes in +there and it's like covers up the green + +941 +01:15:39,880 --> 01:15:48,039 +lawns and the red ones and yeah and with +fewer say like I'm going to break this + +942 +01:15:48,039 --> 01:15:54,890 +now I'm not going to go with five yes +this will start working worse and worse + +943 +01:15:54,890 --> 01:15:58,770 +because you don't have enough capacity +to separate out this data so you can + +944 +01:15:58,770 --> 01:16:05,270 +play with this in your free time and so +as a summary + +945 +01:16:05,270 --> 01:16:10,690 +we arrange these neurons and neural +networks into political heirs + +946 +01:16:10,689 --> 01:16:14,579 +look at that crop and how this gets +changing competition graphs and they're + +947 +01:16:14,579 --> 01:16:19,149 +not really neural and as you'll see soon +the bigger the better and we'll go into + +948 +01:16:19,149 --> 01:16:28,210 +that a lot I want to take questions +before I am just sorry questions we have + +949 +01:16:28,210 --> 01:16:29,359 +two more minutes + +950 +01:16:29,359 --> 01:16:36,899 +yes thank you + +951 +01:16:36,899 --> 01:16:41,119 +so is it always better to have more +neurons and neural network the answer to + +952 +01:16:41,119 --> 01:16:48,809 +that is yes more is always better it's +usually competition constraint so more + +953 +01:16:48,810 --> 01:16:52,510 +always work better but then you have to +be careful to regularize it properly so + +954 +01:16:52,510 --> 01:16:55,810 +the correct way to constrain you're not +worked over put your data is not by + +955 +01:16:55,810 --> 01:16:58,940 +making the network smaller the correct +way to do it is to increase the + +956 +01:16:58,939 --> 01:17:03,079 +regularization so you always want to use +as larger network as you want but then + +957 +01:17:03,079 --> 01:17:06,269 +you have to make sure to properly +regulate rise it but most of the time + +958 +01:17:06,270 --> 01:17:09,920 +because competition reasons why I don't +have time to wait forever to train our + +959 +01:17:09,920 --> 01:17:19,980 +networks use smaller ones for practical +reasons question arises equally + +960 +01:17:19,979 --> 01:17:25,509 +usually you do as a simplification you +yeah most of the often when you see + +961 +01:17:25,510 --> 01:17:28,030 +networks trained in practice they will +be regularized the same way throughout + +962 +01:17:28,029 --> 01:17:33,809 +but you don't have to necessarily + +963 +01:17:33,810 --> 01:17:40,500 +is anybody using secondary option in +optimizing networks there is value + +964 +01:17:40,500 --> 01:17:44,859 +sometimes when your data sets are small +you can use things like lbs which I + +965 +01:17:44,859 --> 01:17:47,729 +don't go into too much and it's the +second order method but usually the data + +966 +01:17:47,729 --> 01:17:50,500 +sets are really large and that's when +I'll get you it doesn't work very well + +967 +01:17:50,500 --> 01:17:57,039 +so you when you millions of the up with +you can't do lbs for ya and LBJ is not + +968 +01:17:57,039 --> 01:18:01,970 +very good with many batch you always +have to fall back by default + +969 +01:18:01,970 --> 01:18:16,650 +like how do you allocate not a good +answer for that unfortunately so you + +970 +01:18:16,649 --> 01:18:20,899 +want a depth is good but maybe after +like ten layers may be a simple data + +971 +01:18:20,899 --> 01:18:25,219 +said it's not really adding too much in +one minute so I can still take some + +972 +01:18:25,220 --> 01:18:35,990 +questions you have a question for the +tradeoff between where do I allocate my + +973 +01:18:35,989 --> 01:18:40,019 +capacity to I want us to be deeper or do +I want it to be wider not a very good + +974 +01:18:40,020 --> 01:18:47,860 +answer to that yes usually especially +with images we find that more layers are + +975 +01:18:47,859 --> 01:18:51,199 +critical but sometimes when you have +simple tastes like to do you are some + +976 +01:18:51,199 --> 01:18:55,359 +other things like depth is not as +critical and so it's kind of slightly + +977 +01:18:55,359 --> 01:19:01,670 +data dependent + +978 +01:19:01,670 --> 01:19:10,050 +different for different layers that +health usually it's not done usually + +979 +01:19:10,050 --> 01:19:15,960 +just gonna pick one and go with it +that's for example will also see the + +980 +01:19:15,960 --> 01:19:19,279 +most of them are changes with others and +so you just use that throughout and + +981 +01:19:19,279 --> 01:19:22,389 +there's no real benefit to to switch +them around people don't play with that + +982 +01:19:22,390 --> 01:19:26,660 +too much on principle you there's +nothing preventing you are so it is 420 + +983 +01:19:26,659 --> 01:19:29,789 +so we're going to end here but we'll see +a lot more neural networks so a lot of + +984 +01:19:29,789 --> 01:19:31,238 +these questions will go through them + diff --git a/captions/En/Lecture5_en.srt b/captions/En/Lecture5_en.srt new file mode 100644 index 00000000..1ced507c --- /dev/null +++ b/captions/En/Lecture5_en.srt @@ -0,0 +1,5289 @@ +1 +00:00:00,000 --> 00:00:05,299 +horizon but it would be a seminar most +of you finished and unfinished but + +2 +00:00:05,299 --> 00:00:11,109 +against ok get some decent ok I'll be +holding off makeup office hours right + +3 +00:00:11,109 --> 00:00:15,660 +after this class assignment 2 will be +released tomorrow or day after tomorrow + +4 +00:00:15,660 --> 00:00:19,710 +we haven't fully finalize the date or +still working on it and we're changing + +5 +00:00:19,710 --> 00:00:23,050 +it from last year and so we are in +process of developing and we hope to + +6 +00:00:23,050 --> 00:00:24,580 +have it as soon as possible + +7 +00:00:24,579 --> 00:00:31,469 +its meeting but an occasional so you do +want to get started on that ASAP once + +8 +00:00:31,469 --> 00:00:36,039 +it's released we might be adjusting the +due date or somethin to because it is + +9 +00:00:36,039 --> 00:00:41,850 +slightly larger and yes so so will be +shuffling some of these things around + +10 +00:00:41,850 --> 00:00:46,219 +and also the grading scheme of the stuff +is just tentative and subject to change + +11 +00:00:46,219 --> 00:00:48,929 +because we're still trying to figure out +the course it's still relatively new and + +12 +00:00:48,929 --> 00:00:53,899 +a lot of it is changing so those are +just some heads up before we start in + +13 +00:00:53,899 --> 00:00:57,829 +terms of your project proposal by the +way which is due in roughly 10 days I + +14 +00:00:57,829 --> 00:01:00,799 +wanted to just bring up a few points +because you'll be thinking about your + +15 +00:01:00,799 --> 00:01:05,890 +projects and some of you might have some +misconceptions about what makes a good + +16 +00:01:05,890 --> 00:01:11,159 +or bad project so just two of them the +most common one probably is that people + +17 +00:01:11,159 --> 00:01:14,570 +are hesitant to work with data sets that +are small because they think that that's + +18 +00:01:14,569 --> 00:01:17,669 +require a huge amount of data training +and this is true there's hundreds of + +19 +00:01:17,670 --> 00:01:21,450 +millions of prime minister to come out +and they need training but actually for + +20 +00:01:21,450 --> 00:01:25,019 +your purposes in the project this is +kind of a mess this is not something you + +21 +00:01:25,019 --> 00:01:28,579 +have to worry about a lot you can work +with smaller data sets its ok the reason + +22 +00:01:28,579 --> 00:01:32,188 +it's ok is that we have this process +that will go into much more detail later + +23 +00:01:32,188 --> 00:01:35,938 +in a class called fine-tuning and the +thing is that in practice you rarely + +24 +00:01:35,938 --> 00:01:41,039 +ever trained these giant camel response +crash almost always do this retraining + +25 +00:01:41,040 --> 00:01:43,729 +and planting process so the way this +will work + +26 +00:01:43,728 --> 00:01:47,590 +look like it's almost always take a +commercial network he trained on some + +27 +00:01:47,590 --> 00:01:51,520 +large data set up say images likes a +huge amount of data and then you're + +28 +00:01:51,519 --> 00:01:54,618 +interested in some other data set right +there and you can train your comment on + +29 +00:01:54,618 --> 00:01:58,430 +your small business that will turn it +here and then we'll transfer it over + +30 +00:01:58,430 --> 00:02:01,240 +there and the way this transfer works +like it is + +31 +00:02:01,239 --> 00:02:05,359 +so here's a schematic of a comedy show +network we start for the image and talk + +32 +00:02:05,359 --> 00:02:09,000 +and we'll go through a series of layers +down to a classifier so you're used to + +33 +00:02:09,000 --> 00:02:12,150 +this but we haven't of course talked +about the specific players here but we + +34 +00:02:12,150 --> 00:02:16,120 +take that image net free trade network +we trained on a minute and then we + +35 +00:02:16,120 --> 00:02:20,129 +chopped off the top layer the classifier +with chopped off take it away and we + +36 +00:02:20,129 --> 00:02:24,150 +train the entire commercial network has +a fixed feature extractor and so you can + +37 +00:02:24,150 --> 00:02:27,219 +put that feature extractor on top of +your new dataset and you're just going + +38 +00:02:27,219 --> 00:02:30,739 +to swap in a different layer that +performs a classification on top and so + +39 +00:02:30,739 --> 00:02:34,810 +depending on how much data you have your +own going to train the last layer of + +40 +00:02:34,810 --> 00:02:38,159 +your network or you can do fine tuning +where you actually back propagate + +41 +00:02:38,159 --> 00:02:41,379 +through some portions of the combat and +get more data you're going to do back + +42 +00:02:41,379 --> 00:02:47,229 +propagation deeper through the network +and in particular the spring training + +43 +00:02:47,229 --> 00:02:51,649 +sample image net people do this for you +so there's a huge line of people who've + +44 +00:02:51,650 --> 00:02:55,400 +trained comes home networks will +reluctance of time weeks on different + +45 +00:02:55,400 --> 00:02:58,939 +datasets and then they upload the weight +of the comment on line is there + +46 +00:02:58,939 --> 00:03:02,229 +something called a couple models who for +example and these are all these + +47 +00:03:02,229 --> 00:03:05,629 +commercial networks have been preaching +on large data sets they already have + +48 +00:03:05,629 --> 00:03:09,310 +lots of the parameters learned and see +just take the surrounding swapping your + +49 +00:03:09,310 --> 00:03:12,769 +datacenter you find him through the +network so basically if you don't have a + +50 +00:03:12,769 --> 00:03:16,799 +lot of data that's okay and you just +take a preacher in combat and just fine + +51 +00:03:16,799 --> 00:03:20,500 +tune it and so don't be afraid to work +with small dataset it's going to work + +52 +00:03:20,500 --> 00:03:27,239 +out of the second thing that we saw some +problems with last time is that people + +53 +00:03:27,239 --> 00:03:31,209 +think they have infinite computer and +this is also a metal just like to point + +54 +00:03:31,209 --> 00:03:35,000 +out don't be overly ambitious and what +you propose these things take a while to + +55 +00:03:35,000 --> 00:03:37,959 +train you don't have too many GPUs +you're going to have to hyper + +56 +00:03:37,959 --> 00:03:41,780 +optimization there's a few things you +have to worry about here so we had some + +57 +00:03:41,780 --> 00:03:45,840 +projects last year where people proposed +projects of training on very large data + +58 +00:03:45,840 --> 00:03:51,889 +sets and you just don't have the time so +be mindful of that and yeah you'll get a + +59 +00:03:51,889 --> 00:03:54,980 +better sense as we go through the class +and what is or is not possible given + +60 +00:03:54,979 --> 00:03:59,949 +your computer constraints ok we're going +to dive into lectures are there any + +61 +00:03:59,949 --> 00:04:02,780 +administrative things that I may be left +out that you like to ask about it + +62 +00:04:02,780 --> 00:04:07,068 +ok good so we're going to dive into the +material we have quite a bit of it today + +63 +00:04:07,068 --> 00:04:12,138 +so just as a reminder woodworking +industry mark the passing grade in the + +64 +00:04:12,139 --> 00:04:13,189 +center for training + +65 +00:04:13,189 --> 00:04:16,750 +networks and basically the four-step +process training a neural network is as + +66 +00:04:16,750 --> 00:04:21,589 +simple as 123 for you sample your data +so a batch of your data from a dataset + +67 +00:04:21,589 --> 00:04:25,079 +you forward it through your network to +compute the Los + +68 +00:04:25,079 --> 00:04:29,339 +propagate to complete your radiance and +the new primary update or you tweak your + +69 +00:04:29,339 --> 00:04:33,529 +weight slightly in the direction of the +ingredients and so when you end up + +70 +00:04:33,529 --> 00:04:36,519 +repeating this process that really what +this comes down to is an optimization + +71 +00:04:36,519 --> 00:04:39,909 +problem wherein to wait space were +converging into areas of the white space + +72 +00:04:39,910 --> 00:04:42,990 +we have low loss and that means are +correctly classifying or training center + +73 +00:04:42,990 --> 00:04:48,590 +and we saw that these very large and i +flash disk image of altering sheen + +74 +00:04:48,589 --> 00:04:51,589 +basically these are huge computational +graphs and we need to do back + +75 +00:04:51,589 --> 00:04:54,699 +propagation through them and so we +talked about intuition some back + +76 +00:04:54,699 --> 00:04:57,289 +propagation and the fact that it's +really just a recursive application of + +77 +00:04:57,290 --> 00:05:01,220 +general from back on the circuit to the +front where we're changing gradients + +78 +00:05:01,220 --> 00:05:05,110 +through all the local operations we +looked at some implementations of this + +79 +00:05:05,110 --> 00:05:10,350 +can quickly with the forward backward +API on both coasts competition graph and + +80 +00:05:10,350 --> 00:05:14,379 +also in terms of its nodes which also +implement the same API and do for + +81 +00:05:14,379 --> 00:05:18,750 +propagation and backward propagation we +looked at specific examples in Portugal + +82 +00:05:18,750 --> 00:05:22,199 +cafe and I drew this analogy that these +are kind of like your illegal blocks + +83 +00:05:22,199 --> 00:05:26,159 +these layers are gates are your little +blocks from which you build out to the + +84 +00:05:26,160 --> 00:05:30,280 +intercom system that works then we +talked about neural networks first + +85 +00:05:30,279 --> 00:05:33,329 +without the bring stuff and basically +what that amounts to is we're making + +86 +00:05:33,329 --> 00:05:37,990 +this which goes from your image to class +course more complex and then we looked + +87 +00:05:37,990 --> 00:05:41,800 +at bill that works from the brain stuff +perspective where this is a chronology + +88 +00:05:41,800 --> 00:05:47,168 +of neuron and what we're doing is we're +stopping these emails and letters oK so + +89 +00:05:47,168 --> 00:05:49,370 +that's roughly what we're doing right +now and we're going to talk in this + +90 +00:05:49,370 --> 00:05:54,959 +class about this process of training +early works effectively ok so we're + +91 +00:05:54,959 --> 00:05:58,049 +going to go into that before I dive into +the details of it I just wanted to kind + +92 +00:05:58,050 --> 00:06:02,280 +of pull out and give you a zoomed out +the you up a bit of a history of how + +93 +00:06:02,279 --> 00:06:06,918 +this evolved over time if you try to +find where the spilled oil comes from + +94 +00:06:06,918 --> 00:06:09,870 +where the first proposed and so on + +95 +00:06:09,870 --> 00:06:15,269 +you probably will go back to roughly +1964 Frank rosenblatt in 1957 was + +96 +00:06:15,269 --> 00:06:18,899 +playing around with something called +perceptrons and the perceptron basically + +97 +00:06:18,899 --> 00:06:24,379 +it ended up being this implementation +and hardware so you have to like + +98 +00:06:24,379 --> 00:06:28,269 +they do just write code right now +actually had to build these things out + +99 +00:06:28,269 --> 00:06:37,099 +from circuits and electronics in these +times for most part and submitted the + +100 +00:06:37,100 --> 00:06:42,450 +perceptron roughly was this function +here and it looks very similar to what + +101 +00:06:42,449 --> 00:06:46,110 +we are familiar with its Justin only +explicitly but then the activation + +102 +00:06:46,110 --> 00:06:49,930 +function which were used to as a signal +that activation function was actually a + +103 +00:06:49,930 --> 00:06:54,439 +step function it was either 10 it was a +binary step function and so since this + +104 +00:06:54,439 --> 00:06:57,459 +is my new step function you'll notice +that this is not differentiable + +105 +00:06:57,459 --> 00:07:01,649 +operation so they were not able to back +propagate through this in fact the cost + +106 +00:07:01,649 --> 00:07:04,139 +of the backpropagation for training +neural networks have to come much later + +107 +00:07:04,139 --> 00:07:08,169 +and so they came up with these binary +stepwise functions perceptron and they + +108 +00:07:08,170 --> 00:07:12,449 +came up with these learning rules and so +this is kind of an ad hoc specified + +109 +00:07:12,449 --> 00:07:17,110 +learning rule that tweaked the weights +to make the desired outcome from the + +110 +00:07:17,110 --> 00:07:22,240 +perceptron match the true of the true +desire to balance but there was no + +111 +00:07:22,240 --> 00:07:25,490 +concept of a loss function there was no +concept of backpropagation his DS DS ad + +112 +00:07:25,490 --> 00:07:28,949 +hoc rules which when you look at them +they kind of almost do background but + +113 +00:07:28,949 --> 00:07:32,779 +it's kind of funny because of the step +function which is not differentiable and + +114 +00:07:32,779 --> 00:07:36,809 +then people started to stop these so in +1960 with the advent of Madeline + +115 +00:07:36,810 --> 00:07:42,110 +Madeline by woodrow enough they started +to take these perceptron like things and + +116 +00:07:42,110 --> 00:07:46,470 +stuff them into the first multi-layer +perceptron networks and this was still + +117 +00:07:46,470 --> 00:07:51,980 +all done in this Electronics and LG and +actually building out from Porter and + +118 +00:07:51,980 --> 00:07:55,830 +but still there's no back propagation at +this time this was all of these rules + +119 +00:07:55,829 --> 00:07:59,060 +that they come up with in terms of like +thinking about trying to flip it and + +120 +00:07:59,060 --> 00:08:02,949 +seeing if it works better or not and it +was kind of there was no view of + +121 +00:08:02,949 --> 00:08:06,430 +backpropagation at this time and so +roughly nineteen sixty people got very + +122 +00:08:06,430 --> 00:08:09,560 +excited and building up the circuits and +they thought that you know this could go + +123 +00:08:09,560 --> 00:08:12,930 +really far we can have these circuits +that learn you have to remember that + +124 +00:08:12,930 --> 00:08:17,829 +back then the concept of programming was +very explicit you write a series of + +125 +00:08:17,829 --> 00:08:20,689 +instructions for a computer and this is +the first time that people are thinking + +126 +00:08:20,689 --> 00:08:24,379 +about this kind of data driven approach +where you have some kind of a circuit + +127 +00:08:24,379 --> 00:08:29,019 +that can learn and so this was at the +time a huge conceptual leap that people + +128 +00:08:29,019 --> 00:08:33,179 +are very excited about these networks +with not actually end up working + +129 +00:08:33,179 --> 00:08:37,528 +very well right away in terms of 1964 +example they got slightly over excited + +130 +00:08:37,528 --> 00:08:41,088 +and over promised and the slightly under +delivered and so throughout the period + +131 +00:08:41,089 --> 00:08:45,660 +of nineteen seventies actually in the +field was very quiet and not much + +132 +00:08:45,659 --> 00:08:52,958 +research has been done next boost +actually came about roughly 1986 and in + +133 +00:08:52,958 --> 00:08:57,179 +1986 people there was this influential +paper that basically he is the first + +134 +00:08:57,179 --> 00:09:03,069 +time that you see back propagation like +rules in a nicely presented format and + +135 +00:09:03,070 --> 00:09:07,910 +so this is really hard in 10 and Wilson +and they were playing with multi-layer + +136 +00:09:07,909 --> 00:09:11,129 +perceptrons and this is the first time +when you go to the paper we actually see + +137 +00:09:11,129 --> 00:09:13,879 +something that looks like a back +propagation and so at this point they + +138 +00:09:13,879 --> 00:09:17,830 +already discarded this idea of ad-hoc +rules and become really the lock + +139 +00:09:17,830 --> 00:09:20,589 +function and talked about back +propagation gradient descent and so on + +140 +00:09:20,589 --> 00:09:25,390 +and so this time people get excited +again in 1986 because they felt that + +141 +00:09:25,389 --> 00:09:30,610 +they now had a principal nice credit +assignment kind of skiing by + +142 +00:09:30,610 --> 00:09:35,000 +backpropagation and they could train +networks the problem unfortunately was + +143 +00:09:35,000 --> 00:09:37,690 +that when they tried to scale up these +networks to make them deeper or larger + +144 +00:09:37,690 --> 00:09:41,089 +they didn't work very well compared to +some of the other things that might be + +145 +00:09:41,089 --> 00:09:44,620 +your machine learning tool kits and so +they just did not give a very good + +146 +00:09:44,620 --> 00:09:49,339 +results at this time and training with +get stuck and the competition was + +147 +00:09:49,339 --> 00:09:52,170 +basically not working very well +especially he wanted to have largely + +148 +00:09:52,169 --> 00:09:56,199 +networks and this was the case for +actually roughly twenty years where + +149 +00:09:56,200 --> 00:09:58,940 +again there was less research on your +own works because somehow it wasn't + +150 +00:09:58,940 --> 00:10:04,370 +working very well and you can train +because and in 2006 the research was a + +151 +00:10:04,370 --> 00:10:08,440 +recent once again reinvigorated whether +paper in science by Hinton and and + +152 +00:10:08,440 --> 00:10:14,190 +Russell had enough enough yet say his +name but basically what they found here + +153 +00:10:14,190 --> 00:10:17,430 +was this was roughly the first time we +can actually have likes a penalty or + +154 +00:10:17,429 --> 00:10:22,549 +neural network that trains properly and +what they did was instead of training + +155 +00:10:22,549 --> 00:10:26,319 +all the layers like 10 layers by +backpropagation a single pass they came + +156 +00:10:26,320 --> 00:10:29,230 +up with this unsupervised pre-training +scheme using what's called restricted + +157 +00:10:29,230 --> 00:10:32,139 +Boltzmann machine and so what this +amounts to is you train your first layer + +158 +00:10:32,139 --> 00:10:35,860 +using an unsupervised objective and then +you train your second layer on top of it + +159 +00:10:35,860 --> 00:10:39,850 +and then third and fourth and then once +all of these are trained then you put + +160 +00:10:39,850 --> 00:10:42,959 +them all together and then you start +back propagation then you start to + +161 +00:10:42,958 --> 00:10:46,479 +fine-tuning step it was a two step +process of first read the speech + +162 +00:10:46,480 --> 00:10:49,860 +stepwise through the layers and then we +put them in and then back propagation + +163 +00:10:49,860 --> 00:10:53,459 +works and so this was the first time a +back-propagation + +164 +00:10:53,458 --> 00:10:56,250 +needed basically this initialization +from the unsurprisingly training + +165 +00:10:56,250 --> 00:10:59,490 +otherwise they would not work out of +luck from scratch and we're going to see + +166 +00:10:59,490 --> 00:11:03,680 +why in this lecture it's kind of tricky +to get these indeed networks to train + +167 +00:11:03,679 --> 00:11:07,769 +from scratch using just backdrop and you +have to really think about it and so it + +168 +00:11:07,769 --> 00:11:11,100 +turned out later that you actually don't +need a surprise process you can just + +169 +00:11:11,100 --> 00:11:14,199 +trade with backdrop right away but you +have to be very careful with + +170 +00:11:14,198 --> 00:11:18,109 +initialization and they used signaling +that works at this point and sigmoid are + +171 +00:11:18,110 --> 00:11:23,389 +just not a great option to use and so +basically backdrop works but you have to + +172 +00:11:23,389 --> 00:11:29,250 +be careful in how you use it and so this +was in 2006 so a bit more research is + +173 +00:11:29,250 --> 00:11:32,600 +kind of came back to the area and was +rebranded as deep learning but really + +174 +00:11:32,600 --> 00:11:39,610 +it's still neural networks synonymous +but it's a better word for the art and + +175 +00:11:39,610 --> 00:11:43,990 +basically at this point I think start to +work properly well and people could + +176 +00:11:43,990 --> 00:11:48,940 +actually trained networks now still not +too many people paid attention and when + +177 +00:11:48,940 --> 00:11:53,310 +people start to really pay attention was +roughly I think around 2010 and 2012 so + +178 +00:11:53,309 --> 00:11:56,379 +specifically in 2010 there were two +first really big result for neural + +179 +00:11:56,379 --> 00:11:59,669 +networks really worked really well +compared to everything else that you had + +180 +00:11:59,669 --> 00:12:01,078 +in your machine learning toolkit + +181 +00:12:01,078 --> 00:12:07,888 +kernels or espionage and so on and this +was specifically the speech recognition + +182 +00:12:07,889 --> 00:12:12,839 +area where they took this GMM HMM +framework and they swapped out long part + +183 +00:12:12,839 --> 00:12:17,800 +in sports network and Internet would +give him huge improvements in 2010 and + +184 +00:12:17,799 --> 00:12:21,068 +this was worked on Microsoft and so +people start to pay attention because + +185 +00:12:21,068 --> 00:12:26,189 +this was the first time that works +really came from a large improvements + +186 +00:12:26,190 --> 00:12:30,550 +and then we saw that again in 2012 where +he played out even more dramatically in + +187 +00:12:30,549 --> 00:12:36,039 +the domain of visual recognition and +computer vision where basically we took + +188 +00:12:36,039 --> 00:12:44,448 +this 2012 network by all scratched D +Anton and basically a crush the + +189 +00:12:44,448 --> 00:12:48,719 +competition from all the features and +there was a really large improvement + +190 +00:12:48,720 --> 00:12:52,810 +from these neural networks that we +witnessed and that's what people really + +191 +00:12:52,809 --> 00:12:56,629 +start to pay attention and since then +the field this kind of exploded and + +192 +00:12:56,629 --> 00:12:58,370 +there's a lot of area in this field now + +193 +00:12:58,370 --> 00:13:03,110 +and so will go into details I think a +bit later in the possible why it started + +194 +00:13:03,110 --> 00:13:04,589 +to work early 2010 + +195 +00:13:04,589 --> 00:13:08,860 +it's a combination of things but I think +it's we've got to be figured out a + +196 +00:13:08,860 --> 00:13:12,710 +better way to visualizing of getting +these things to work of activation + +197 +00:13:12,710 --> 00:13:16,690 +functions and we had GPUs and we have +much more data and so really a lot of + +198 +00:13:16,690 --> 00:13:19,710 +the stuff before didn't quite work +because it was just not there in terms + +199 +00:13:19,710 --> 00:13:26,028 +of computer data and some of the ideas +just tweaking and so that's rough + +200 +00:13:26,028 --> 00:13:30,750 +historical setting so we basically went +throughout over promising underdog over + +201 +00:13:30,750 --> 00:13:34,700 +processing and delivery and now it seems +like things are actually trying to work + +202 +00:13:34,700 --> 00:13:37,028 +really well and so that's where we are +at this point + +203 +00:13:37,028 --> 00:13:42,210 +ok I'm going to dive into the specifics +and we'll see exactly will actually + +204 +00:13:42,210 --> 00:13:45,550 +dying to know what works and how we +train them properly so the overview of + +205 +00:13:45,549 --> 00:13:49,139 +what we're going to cover over the +course of the next year lectures is a + +206 +00:13:49,139 --> 00:13:52,809 +whole bunch of independent things so +I'll just become peppering you with all + +207 +00:13:52,809 --> 00:13:55,989 +these little areas that we have to +understand and see what people do in the + +208 +00:13:55,990 --> 00:13:59,409 +case and we'll go through them the pros +and cons of all trades as how you + +209 +00:13:59,409 --> 00:14:05,659 +actually properly trained neural +networks and real-world datasets to the + +210 +00:14:05,659 --> 00:14:06,730 +first thing we're going to talk about + +211 +00:14:06,730 --> 00:14:14,450 +activation functions as I promised I +think a lecture so ago so is this + +212 +00:14:14,450 --> 00:14:19,320 +function at the top of their own and we +saw that it can have many different + +213 +00:14:19,320 --> 00:14:25,230 +phones so these are all different +proposals for what these activation + +214 +00:14:25,230 --> 00:14:28,450 +functions can look like they're going to +go through some prison calls and how you + +215 +00:14:28,450 --> 00:14:31,459 +think about what an activation what are +going to desirable properties of an + +216 +00:14:31,458 --> 00:14:35,289 +activation function so historically the +one that has been used the most is the + +217 +00:14:35,289 --> 00:14:39,009 +sigmoid nonlinearity which looks like +this so it's basically squashing + +218 +00:14:39,009 --> 00:14:40,528 +function it takes a real value number + +219 +00:14:40,528 --> 00:14:45,669 +squashes it to be between 0 and one and +so the first problem with the sigmoid is + +220 +00:14:45,669 --> 00:14:51,120 +that as was pointed out a few lectures +to go there's a problem that saturated + +221 +00:14:51,120 --> 00:14:55,839 +neurons which are either very close to +zero or very close to one of those + +222 +00:14:55,839 --> 00:15:00,070 +neurons kill gradients during back +propagation and so I like to expand on + +223 +00:15:00,070 --> 00:15:03,660 +this entry exactly what this means and +this contributes to something that we're + +224 +00:15:03,659 --> 00:15:08,679 +going to call the bench ingredient +problem so let's look at the gate in the + +225 +00:15:08,679 --> 00:15:11,159 +back in the circuit and receive some + +226 +00:15:11,159 --> 00:15:16,149 +and signal that comes out and then in +back probably have deal by decent and we + +227 +00:15:16,149 --> 00:15:19,940 +like to back drop it through the second +gate to using chain rule so that we have + +228 +00:15:19,940 --> 00:15:24,089 +a deal by Dax at the end and you can see +that through chain rule basically told + +229 +00:15:24,089 --> 00:15:27,569 +us to multiply those two quantities and +so think about what happens when this + +230 +00:15:27,568 --> 00:15:33,399 +signaled gate receives and put off by 10 +or 20 or 10 it competes in value and + +231 +00:15:33,399 --> 00:15:37,309 +then it's getting some gradient from the +top and what happens to that radiant as + +232 +00:15:37,309 --> 00:15:41,549 +your backdrop through the circuit in any +of these cases where is that possible + +233 +00:15:41,549 --> 00:15:56,578 +problem in some of these cases so so +you're saying that the gradient is very + +234 +00:15:56,578 --> 00:16:01,919 +low when Texas negative 10 or 10 and +wait to see this is basically we have + +235 +00:16:01,919 --> 00:16:05,659 +this local gradient here that will be +multiplying with this gradient this + +236 +00:16:05,659 --> 00:16:09,838 +local gradient defund the DOMA bydy X +when you're at the negative 10 you can + +237 +00:16:09,839 --> 00:16:14,370 +see that the gradient is basically zero +because the slope at this point zero and + +238 +00:16:14,370 --> 00:16:18,339 +gradient attend will also be near zero +and so the issue is that you're reading + +239 +00:16:18,339 --> 00:16:24,220 +will drop in from here but if you're on +the saturated so it basically it 0 had + +240 +00:16:24,220 --> 00:16:26,930 +it won then the gradient will be killed + +241 +00:16:26,929 --> 00:16:31,258 +I'll just be multiplied by a very tiny +number and great info will stop through + +242 +00:16:31,259 --> 00:16:36,480 +them through the signature on so you can +imagine if you have a large network of + +243 +00:16:36,480 --> 00:16:39,800 +sigmoid neurons and many of them are in +a saturated regime where they're either + +244 +00:16:39,799 --> 00:16:43,269 +0 or 1 ingredients can't back propagate +through the network because they'll be + +245 +00:16:43,269 --> 00:16:48,230 +stopped if you're sitting in your office +or in the saturated or jeans ingredients + +246 +00:16:48,230 --> 00:16:51,740 +only flow if you're kind of in a safer +zone and what we call an active region + +247 +00:16:51,740 --> 00:16:57,049 +of the sigmoid and so that's kind of a +problem we'll see more about this soon + +248 +00:16:57,049 --> 00:17:03,289 +another problem with the sigmoid is that +there are not zero centered so we'll + +249 +00:17:03,289 --> 00:17:07,078 +talk about the preprocessing soon but +you always want to when you process your + +250 +00:17:07,078 --> 00:17:10,578 +day want to make sure that it's zero +centered right and in this case is + +251 +00:17:10,578 --> 00:17:14,658 +supposed to have a big network of +several layers of sigmund their opening + +252 +00:17:14,659 --> 00:17:19,659 +these 90 centered values between 0 and +one and we're putting more basically + +253 +00:17:19,659 --> 00:17:22,260 +leader classifiers that were stacked on +top of each other + +254 +00:17:22,259 --> 00:17:26,078 +and the problem roughly with non-zero +centered up but I'll just try to give + +255 +00:17:26,078 --> 00:17:31,169 +you a bit of an intuition on what goes +wrong + +256 +00:17:31,170 --> 00:17:36,480 +concern Iran that computes this function +right 0 to 60 in Iran looking at just + +257 +00:17:36,480 --> 00:17:40,589 +competing W must be and what can we say +about think about what you can say about + +258 +00:17:40,589 --> 00:17:45,559 +the gradients on W during +backpropagation if your exes are all + +259 +00:17:45,559 --> 00:17:49,259 +positive in this case between 011 so +maybe you're in Iran somewhere deep in + +260 +00:17:49,259 --> 00:17:54,539 +the network what can you say about the +weights if all the excess are positive + +261 +00:17:54,539 --> 00:18:00,960 +numbers + +262 +00:18:00,960 --> 00:18:13,970 +constrained in a way ahead on the green +WR either a positive or negative and + +263 +00:18:13,970 --> 00:18:17,730 +that is because gradient flows in from +the top and if you think about the + +264 +00:18:17,730 --> 00:18:22,700 +expression for all the W radiance +they're basically X times the gradient + +265 +00:18:22,700 --> 00:18:28,440 +and so the gradient off on the upper of +the neuron is positive then all your W + +266 +00:18:28,440 --> 00:18:32,308 +gratings will be positive and vice versa +so basically you end up at this case + +267 +00:18:32,308 --> 00:18:35,710 +where it's supposed to have just two +weights so you have the first wait a + +268 +00:18:35,710 --> 00:18:40,788 +second wait what ends up happening is +other ingredients for that for that as + +269 +00:18:40,788 --> 00:18:45,099 +this goes through your computer ready in +the weights there either positive or + +270 +00:18:45,099 --> 00:18:49,509 +negative and so the issue is that your +constrained and the kind of update you + +271 +00:18:49,509 --> 00:18:53,609 +can make and you end up with this +undesirables exacting path if you want + +272 +00:18:53,609 --> 00:18:57,808 +to get to some parts that are outside of +these regions this kind of like a + +273 +00:18:57,808 --> 00:19:02,058 +slightly henry VIII reason here but just +to give you intuition and you can see + +274 +00:19:02,058 --> 00:19:04,769 +this empirically when you train with +things that are not zero centered you + +275 +00:19:04,769 --> 00:19:09,319 +observed slower convergence and this is +a bit of a hand with a reason for why + +276 +00:19:09,319 --> 00:19:13,220 +that might happen but I think if you +actually want to go much deeper into + +277 +00:19:13,220 --> 00:19:15,919 +that you can and there are people +talking about this but you have to then + +278 +00:19:15,919 --> 00:19:19,350 +reason about mathematics official major +season natural gradients and gets a bit + +279 +00:19:19,349 --> 00:19:22,959 +more complex than this but i just wanted +to give you intuition for you want to + +280 +00:19:22,960 --> 00:19:25,950 +have zero Center things in the input you +want to have their santa thing + +281 +00:19:25,950 --> 00:19:30,450 +throughout the white thinks things as +nicely and so that is a downside of + +282 +00:19:30,450 --> 00:19:35,569 +signaling their own and the last one is +that XP function inside this expression + +283 +00:19:35,569 --> 00:19:39,099 +is kind of expensive to compute compared +to some of the alternatives of other + +284 +00:19:39,099 --> 00:19:45,199 +charities and so it's just a small +detail I suppose when you actually + +285 +00:19:45,200 --> 00:19:48,028 +trained these large commercial networks +most of the computer time isn't + +286 +00:19:48,028 --> 00:19:53,148 +competitions and these dot product it's +not in this expiration and so it's kind + +287 +00:19:53,148 --> 00:19:55,509 +of banishing small contribution but it's +still something that is a bit of a + +288 +00:19:55,509 --> 00:20:00,710 +downside compared to the other parts so +I'm going to ask if you think a few + +289 +00:20:00,710 --> 00:20:04,230 +questions so tender age is an attempt to +fix one of these problems in particular + +290 +00:20:04,230 --> 00:20:11,440 +the fact that it's 90 centered so +eloquent in 1991 right wrote a very nice + +291 +00:20:11,440 --> 00:20:13,450 +paper on how you optimize your network + +292 +00:20:13,450 --> 00:20:18,700 +and I links to it from the syllabus and +he recommended that people use any extra + +293 +00:20:18,700 --> 00:20:22,350 +steps intended to affect basically is +kind of like two segments but together + +294 +00:20:22,349 --> 00:20:28,219 +you end up with being between negative +one and one and so you're up with 40 + +295 +00:20:28,220 --> 00:20:32,139 +centered but otherwise you have still up +something from the other problems like + +296 +00:20:32,138 --> 00:20:36,240 +for example you have these regions where +if you get saturated no gradients flow + +297 +00:20:36,240 --> 00:20:41,829 +and so we haven't really fix that at +this point but so many just I think + +298 +00:20:41,829 --> 00:20:51,259 +strictly prefer to sigmoid because it +has all the same problems except for 10 + +299 +00:20:51,259 --> 00:20:57,970 +continue and then maybe we can take more +questions so around 2012 in the paper by + +300 +00:20:57,970 --> 00:21:01,038 +Oscar Jessica this is the first +commercial networks paper we propose + +301 +00:21:01,038 --> 00:21:05,240 +that actually we noticed that this +nonlinearity where you use maxis Iran X + +302 +00:21:05,240 --> 00:21:07,339 +instead of sigmoid or 10 each + +303 +00:21:07,339 --> 00:21:10,849 +just make sure networks converter much +quicker and in their experiments almost + +304 +00:21:10,849 --> 00:21:17,699 +my height of 6 and so we can go back and +try to think about why is this and what + +305 +00:21:17,700 --> 00:21:20,450 +kind of reading into it like you can see +that it works better in practice but + +306 +00:21:20,450 --> 00:21:25,580 +explaining it does not always as easy to +hear some reason is hoping for a while + +307 +00:21:25,579 --> 00:21:30,908 +people are thinking that this works much +better so one thing is that this this + +308 +00:21:30,909 --> 00:21:35,570 +role in your own and does not sanctuary +at least a positive region so at least + +309 +00:21:35,569 --> 00:21:38,859 +in this region you don't have the +Spanish ingredient problem where your + +310 +00:21:38,859 --> 00:21:42,019 +brilliance will just kind of died and +you have this issue where the neurons + +311 +00:21:42,019 --> 00:21:47,028 +are only active in a small area that is +bounded from both sides but these + +312 +00:21:47,028 --> 00:21:50,519 +neurons actually active in a sense of +the back propagate correctly or not + +313 +00:21:50,519 --> 00:21:55,419 +correctly but at least they don't like +80 oz at least half of their regions + +314 +00:21:55,419 --> 00:22:00,730 +they're much more computationally +efficient you're just holding and + +315 +00:22:00,730 --> 00:22:04,919 +experimental you can see that this +number just so much much faster so this + +316 +00:22:04,919 --> 00:22:08,929 +is called the rela- near on the file in +your unit was pointed out in this paper + +317 +00:22:08,929 --> 00:22:12,000 +for the first time that this works much +better and this is kind of like a + +318 +00:22:12,000 --> 00:22:15,429 +detailed recommendations what you should +use at this point at the same time there + +319 +00:22:15,429 --> 00:22:18,990 +are several problems with this ruling +Iran so one thing again notice that it's + +320 +00:22:18,990 --> 00:22:23,778 +not zero centered up it's so not +completely ideal perhaps and a slight + +321 +00:22:23,778 --> 00:22:26,130 +annoyance of the ruling Iran + +322 +00:22:26,130 --> 00:22:31,120 +that we can talk about it and think +about is what happens when there's + +323 +00:22:31,119 --> 00:22:37,009 +really no 10 what happens during the +propagation if Iran does not become + +324 +00:22:37,009 --> 00:22:43,269 +active in the forecast stays in active +thundering backdrop what they do it + +325 +00:22:43,269 --> 00:22:47,289 +kills right that kills the gradient and +so the way to see this of course is that + +326 +00:22:47,289 --> 00:22:51,609 +when the same picture and if you read +too negative say 10 than your local + +327 +00:22:51,609 --> 00:22:55,119 +gradient here will just be zero because +there's no there's just zero gradient + +328 +00:22:55,119 --> 00:22:58,589 +identically it's not just you squish +degrading down you actually kill it + +329 +00:22:58,589 --> 00:23:01,689 +completely so anyone that does not +activate will not that propagate + +330 +00:23:01,690 --> 00:23:06,039 +downwards its weights will not be +updated and nothing happens below it at + +331 +00:23:06,039 --> 00:23:13,970 +least for its contribution and a tactic +was ten was the local gradient that's + +332 +00:23:13,970 --> 00:23:19,940 +just one so just passes through +gradients just a gate if if if if its + +333 +00:23:19,940 --> 00:23:24,820 +assets out that was positive and then it +just passing reading through otherwise + +334 +00:23:24,819 --> 00:23:30,250 +it kills a kind of like a great game to +date and by the way what happens when + +335 +00:23:30,250 --> 00:23:38,569 +Texas 0 what is your gradient at that +point it's actually undefined that's + +336 +00:23:38,569 --> 00:23:42,169 +right the green does not exist at that +point we only talked about whenever I + +337 +00:23:42,170 --> 00:23:45,789 +talk about gradient just assumed that I +always mean some gradient which is a + +338 +00:23:45,789 --> 00:23:49,119 +generalization of gradient two functions +that are sometimes not differentiable to + +339 +00:23:49,119 --> 00:23:52,250 +hear the limit does not exist but +there's a whole bunch of some gradients + +340 +00:23:52,250 --> 00:23:58,609 +that could be 0 or 1 and so that's what +we use usually in practice this + +341 +00:23:58,609 --> 00:24:02,119 +distinction doesn't really matter too +much but i wanna talk about the south in + +342 +00:24:02,119 --> 00:24:06,539 +the case of by miramax Kate X&Y and +someone asked the question what happens + +343 +00:24:06,539 --> 00:24:12,629 +if X&Y are equal then that case you you +can also have a kink in the function and + +344 +00:24:12,630 --> 00:24:15,550 +makes them vulnerable but in practice +these things don't really matter just + +345 +00:24:15,549 --> 00:24:20,329 +pick one so you can have a great in 2011 +there and things will work just fine and + +346 +00:24:20,329 --> 00:24:23,490 +that's roughly because these are very +unlikely cases that you end up right + +347 +00:24:23,490 --> 00:24:24,710 +there + +348 +00:24:24,710 --> 00:24:28,519 +ok so the issue with relo roughly here's +the problem that happens in practice he + +349 +00:24:28,519 --> 00:24:32,799 +tried to Israel units and one thing that +you have to be aware of is you have + +350 +00:24:32,799 --> 00:24:37,629 +these neurons that if they don't put +anything they won't get any great dental + +351 +00:24:37,630 --> 00:24:38,290 +kill it + +352 +00:24:38,289 --> 00:24:48,049 +update and so what's the issue is +supposed to have something happen is + +353 +00:24:48,049 --> 00:24:51,059 +when you initialize you're really +neurons you can initialize them in a non + +354 +00:24:51,059 --> 00:24:57,000 +not very lucky way and what ends up +happening is suppose this is your guide + +355 +00:24:57,000 --> 00:25:02,009 +a cloud of inputs to your Eleanor owns +four you can end up with is what we call + +356 +00:25:02,009 --> 00:25:06,650 +a dead relative a dead ringer on so if +this neuron only activates in the region + +357 +00:25:06,650 --> 00:25:12,550 +outside of your data cloud in this bed +trailer will never become activated and + +358 +00:25:12,549 --> 00:25:15,889 +then it will never update and so this +can happen in one of two ways either + +359 +00:25:15,890 --> 00:25:19,090 +during initialization you were really +really unlucky and you happen to sample + +360 +00:25:19,089 --> 00:25:22,959 +waits for her role in your own in such a +way that that neuron will never turn on + +361 +00:25:22,960 --> 00:25:27,549 +in that case in Iran will not rain but +more often what happens is during + +362 +00:25:27,549 --> 00:25:31,769 +training if you are learning rate is +high then think about these neurons ask + +363 +00:25:31,769 --> 00:25:35,339 +around and we can happen sometimes by +chance they just got knocked off the + +364 +00:25:35,339 --> 00:25:39,669 +data manifold and when that happens then +they will never get activated again and + +365 +00:25:39,670 --> 00:25:43,310 +they will not come back to the data +manifold and you can see there's + +366 +00:25:43,309 --> 00:25:48,039 +actually practice like sometimes you can +train a big neural net with delegates + +367 +00:25:48,039 --> 00:25:51,740 +and you try it and it seems to work fine +and then what you do you stop the + +368 +00:25:51,740 --> 00:25:54,279 +training and you pass your entire +training dataset through your network + +369 +00:25:54,279 --> 00:25:59,460 +and you look at the statistics of every +single neuron and what can happen is + +370 +00:25:59,460 --> 00:26:02,620 +that as much as like 10 or 20 percent of +your network is dead + +371 +00:26:02,619 --> 00:26:06,319 +designer on that never turned on for +anything in the training data and this + +372 +00:26:06,319 --> 00:26:09,929 +could actually happen usually it's +because you're learning rate was high + +373 +00:26:09,930 --> 00:26:14,250 +and so those are just like dead parts of +your network and you can call pataki + +374 +00:26:14,250 --> 00:26:16,299 +schemes for real nationalizing these +things and so on + +375 +00:26:16,299 --> 00:26:21,569 +people don't usually do it as much but +it's something to be aware of and it's a + +376 +00:26:21,569 --> 00:26:26,929 +problem with this nonlinearity and so +especially for initialization because of + +377 +00:26:26,930 --> 00:26:30,840 +this dead real problem with people like +to do is normally initialize the bus 10 + +378 +00:26:30,839 --> 00:26:35,289 +instead people in this life was slightly +positive numbers Lexi 0101 because that + +379 +00:26:35,289 --> 00:26:40,389 +makes it more likely that an +initialisation these roman numbers and + +380 +00:26:40,390 --> 00:26:44,170 +old will get updates so it makes it less +likely that the neuron will just never + +381 +00:26:44,170 --> 00:26:48,190 +become activated ever throughout +training but I don't actually I think + +382 +00:26:48,190 --> 00:26:51,350 +this is likely have a controversial +point out some people claim that + +383 +00:26:51,349 --> 00:26:54,849 +help sexy some people say that it +actually doesn't help at all and so just + +384 +00:26:54,849 --> 00:27:02,089 +something to think about any questions +at this point we are going to go into + +385 +00:27:02,089 --> 00:27:08,839 +some other wants ok so let's look at +things like people trying to fix a loose + +386 +00:27:08,839 --> 00:27:13,058 +so one issue with relatives as these +dead neurons are not ideal so here's one + +387 +00:27:13,058 --> 00:27:18,349 +proposal which is called the leaky rain +and the idea of leaking really is + +388 +00:27:18,349 --> 00:27:22,399 +basically we want this kink and we want +this peace finally RT and we want the + +389 +00:27:22,400 --> 00:27:29,070 +sufficiency of but the issue is that in +these this region your dreams die so + +390 +00:27:29,069 --> 00:27:32,379 +instead let's make this slightly +negatively sloped here or slightly + +391 +00:27:32,380 --> 00:27:36,409 +positively sloped I suppose in this +region and so you end up with this + +392 +00:27:36,409 --> 00:27:41,260 +function and that's called a leaky and +so some people are people showing that + +393 +00:27:41,259 --> 00:27:45,519 +this works slightly better you don't +have this issue of neurons dying but I + +394 +00:27:45,519 --> 00:27:51,730 +think it's not completely established +that this works always better and then + +395 +00:27:51,730 --> 00:27:54,870 +some people playing with this even more +so right now this is your apt 101 but + +396 +00:27:54,869 --> 00:27:57,439 +that can actually be an arbitrary +parameter and then you get something + +397 +00:27:57,440 --> 00:28:01,058 +that's called a parametric rectifier or +people who and basically the idea here + +398 +00:28:01,058 --> 00:28:07,519 +is to introduce this is 101 which is a +parameter in your network and this can + +399 +00:28:07,519 --> 00:28:10,808 +be learned you can back up to get into +it and so these neurons basically can + +400 +00:28:10,808 --> 00:28:15,609 +choose what slope to have in his native +region ok and so they can become + +401 +00:28:15,609 --> 00:28:21,250 +irrelevant if they want to or they can +become a leak or they can be they have + +402 +00:28:21,250 --> 00:28:25,798 +the choice roughly every neuron is this +the kind of things that people play with + +403 +00:28:25,798 --> 00:28:40,950 +when they tried to design a good day too +in just a very normal way your + +404 +00:28:40,950 --> 00:28:44,200 +competition go out there every neuron +will have its just like it has its own + +405 +00:28:44,200 --> 00:28:46,659 +bias + +406 +00:28:46,659 --> 00:28:48,490 +go ahead + +407 +00:28:48,490 --> 00:29:00,370 +I'll finds one then you're going to get +an identity so that's probably not + +408 +00:29:00,369 --> 00:29:03,779 +something that the propagation will want +in a sense that if that wasn't identity + +409 +00:29:03,779 --> 00:29:06,819 +then that shouldn't be very competition +a useful so you might expect that baby + +410 +00:29:06,819 --> 00:29:09,939 +back propagation should not actually get +you to those regions of the space and + +411 +00:29:09,940 --> 00:29:13,720 +maybe even perhaps I don't actually +think if I remember correctly there is + +412 +00:29:13,720 --> 00:29:17,069 +no specific things where people really +worried about that too much but I could + +413 +00:29:17,069 --> 00:29:20,529 +be wrong ahead I read the paper while +ago now and I don't use these too much + +414 +00:29:20,529 --> 00:29:27,160 +work and then so one issue still is as +we saw it so these are different schemes + +415 +00:29:27,160 --> 00:29:30,759 +for fixing the bed railing Iran's +there's another people that only came + +416 +00:29:30,759 --> 00:29:34,730 +out for example roughly two months ago +so this just gives you a sense of how + +417 +00:29:34,730 --> 00:29:38,210 +new this field is there are papers +coming out just two months ago trying to + +418 +00:29:38,210 --> 00:29:42,850 +propose a new activation functions one +of them is exponential in your units are + +419 +00:29:42,849 --> 00:29:46,799 +just give you an idea about what people +play with it tries to have all the + +420 +00:29:46,799 --> 00:29:50,869 +benefits of relew buttressed to get rid +of this downside of being non-zero + +421 +00:29:50,869 --> 00:29:54,909 +centered and so they end up with is this +blue function here that looks like a + +422 +00:29:54,910 --> 00:29:58,390 +real issue but in the negative region it +doesn't just go to zero or doesn't just + +423 +00:29:58,390 --> 00:30:02,700 +go down as a leak but it has this funny +shape and there are two pages of math in + +424 +00:30:02,700 --> 00:30:03,480 +the paper + +425 +00:30:03,480 --> 00:30:08,509 +justifying partly why you want that and +roughly when you do this end up with + +426 +00:30:08,509 --> 00:30:12,829 +zero mean outlets and they claim that +the strains better and I think there's + +427 +00:30:12,829 --> 00:30:17,889 +some controversy about this and so we're +basically trying to figure all of this + +428 +00:30:17,890 --> 00:30:18,309 +out + +429 +00:30:18,308 --> 00:30:21,849 +active area of research and we're not +sure what to do yet but rather is right + +430 +00:30:21,849 --> 00:30:26,719 +now are like a safe recommendation if +you if you're careful with it so that's + +431 +00:30:26,720 --> 00:30:31,259 +a loose and one more I would like to +note mention because it's relatively + +432 +00:30:31,259 --> 00:30:35,319 +common in you'll see it if you read +about it works is this max out their own + +433 +00:30:35,319 --> 00:30:42,308 +from hotel and basically it's a very +different from iran it's not just an + +434 +00:30:42,308 --> 00:30:44,000 +activation function that looks different + +435 +00:30:44,000 --> 00:30:47,789 +it actually changes within Iran computer +how computes doesn't just have this form + +436 +00:30:47,789 --> 00:30:54,629 +of W X it actually has two weights and +then compute smacks of W transpose Xbox + +437 +00:30:54,630 --> 00:30:58,970 +be another set of WSYX must be the end +up with these like to hike a place that + +438 +00:30:58,970 --> 00:31:01,440 +you take a max over and that's what the +near a computer + +439 +00:31:01,440 --> 00:31:04,298 +you can see that there are many ways of +playing with these activation functions + +440 +00:31:04,298 --> 00:31:09,339 +so this doesn't have some of the +downsides of this want to die and it + +441 +00:31:09,339 --> 00:31:13,128 +still piecewise linear it's still +efficient but not every single neuron + +442 +00:31:13,128 --> 00:31:16,839 +has two weights and so you kind of +double the number of parameters premiere + +443 +00:31:16,839 --> 00:31:21,689 +on and so maybe that's not as ideal so +some people use this but I think it's + +444 +00:31:21,690 --> 00:31:45,130 +it's not super common I would say that +roads are still most common + +445 +00:31:45,130 --> 00:31:57,870 +into those winds will be different and +so you end up a different weights for + +446 +00:31:57,869 --> 00:32:11,009 +sure it's complicated it's complicated + +447 +00:32:11,009 --> 00:32:15,799 +is a lot of the optimization process is +not just about the loss function but + +448 +00:32:15,799 --> 00:32:19,000 +just like about the dynamics of the +backward flow of greens and we'll see a + +449 +00:32:19,000 --> 00:32:22,250 +bit about that in next week's lies you +have to really think about it + +450 +00:32:22,250 --> 00:32:27,420 +dynamically more than just lost +landscape and how it's so it's too + +451 +00:32:27,420 --> 00:32:32,410 +complex and also you specifically +stochastic gradient descent and has a + +452 +00:32:32,410 --> 00:32:36,340 +particular form and something splaine +nicer some liberties play nicely with + +453 +00:32:36,339 --> 00:32:41,039 +the fact that the optimization is tied +the update is tied into all this as well + +454 +00:32:41,039 --> 00:32:45,519 +as kind of all interacting together and +the choice of these activation functions + +455 +00:32:45,519 --> 00:32:49,619 +and the choice of your updates are kind +of coupled and it's very unclear when + +456 +00:32:49,619 --> 00:32:59,649 +you actually optimizes kind of complex +think so while they are here is that you + +457 +00:32:59,650 --> 00:33:03,620 +can try out these guys you can try out +anybody should expect too much I don't + +458 +00:33:03,619 --> 00:33:06,669 +think people use it too much right now +and don't ignore it because basically + +459 +00:33:06,670 --> 00:33:11,130 +ten I just strictly better and you won't +see people using voice now anymore + +460 +00:33:11,130 --> 00:33:14,350 +of course we use it and things like long +short term memory units palestinian + +461 +00:33:14,349 --> 00:33:17,129 +someone will go into that in a bit in +recurrent neural networks but their + +462 +00:33:17,130 --> 00:33:22,500 +specific reasons why we use them there +and that will see later in class and + +463 +00:33:22,500 --> 00:33:26,700 +they are they're used differently than +what we've covered so far in like this + +464 +00:33:26,700 --> 00:33:32,670 +just fully connected sandwich makers +multiply party and someone just having a + +465 +00:33:32,670 --> 00:33:35,720 +basic neural network oK so that's +everything I wanted to say but + +466 +00:33:35,720 --> 00:33:39,410 +activation functions as basically this +one had primary functions that we worry + +467 +00:33:39,410 --> 00:33:42,990 +about this research about it and we +haven't fully figured it out and there's + +468 +00:33:42,990 --> 00:33:46,640 +some pros and cons and many of them come +down to thinking about how the gradient + +469 +00:33:46,640 --> 00:33:50,690 +flows through your network and discuss +these issues like dead relatives and yet + +470 +00:33:50,690 --> 00:33:54,808 +to really know about the gradient flow +if you try to debug your networks + +471 +00:33:54,808 --> 00:33:59,428 +and a to understand what's going on +let's look at a price + +472 +00:33:59,429 --> 00:34:03,710 +processing very briefly so + +473 +00:34:03,710 --> 00:34:07,440 +processing just very briefly normally +suppose you just have a cloud of + +474 +00:34:07,440 --> 00:34:11,829 +original data and two dimensions here +very common 20 Center your data so that + +475 +00:34:11,829 --> 00:34:15,230 +just means that along every single +picture was to track the mean people + +476 +00:34:15,230 --> 00:34:18,889 +sometimes also when you go through +machine learning literature try to + +477 +00:34:18,889 --> 00:34:22,720 +normalize the data so in every single +dimension you normalize say by standard + +478 +00:34:22,719 --> 00:34:23,759 +deviation + +479 +00:34:23,760 --> 00:34:28,990 +standardizing are you can make sure that +the min and max are within and so on + +480 +00:34:28,989 --> 00:34:33,098 +there are several schemes for doing so +in images it's not as common because you + +481 +00:34:33,099 --> 00:34:35,760 +don't have to separate different +features that can be a different units + +482 +00:34:35,760 --> 00:34:39,619 +everything is just pixels and their own +boundary between 0 and 255 it's not as + +483 +00:34:39,619 --> 00:34:43,970 +common to normalize the data but it's +very common 20 Center your data you can + +484 +00:34:43,969 --> 00:34:44,719 +go further + +485 +00:34:44,719 --> 00:34:48,730 +normally in machine learning you can go +ahead and your data has some covariance + +486 +00:34:48,730 --> 00:34:52,079 +structure by default you can go ahead +and make that communist Russia be + +487 +00:34:52,079 --> 00:34:55,740 +diagonal say for example by applying PCA +or you can go even further and you can + +488 +00:34:55,739 --> 00:35:00,309 +wipe your data and what that means is +you kind of even squish after primed PCR + +489 +00:35:00,309 --> 00:35:05,159 +you also squish your data so that your +various metrics becomes just a diagonal + +490 +00:35:05,159 --> 00:35:08,699 +and so that's another form of +preprocessing I see people talk about it + +491 +00:35:08,699 --> 00:35:14,480 +and these are both I go much more detail +in the class notes on BC I don't want to + +492 +00:35:14,480 --> 00:35:17,500 +go into too many details on that because +it turns out that in images we don't + +493 +00:35:17,500 --> 00:35:20,960 +actually end up using these even the +order coming in machine learning + +494 +00:35:20,960 --> 00:35:25,659 +images specifically what's common is +just a means centering and then a + +495 +00:35:25,659 --> 00:35:28,519 +particular variant of me centering that +is slightly more convenient to practice + +496 +00:35:28,519 --> 00:35:34,780 +so I mean centering we say 330 to buy +three images subsea far if you want to + +497 +00:35:34,780 --> 00:35:38,869 +center your data that for every single +pixel you compete W overtraining such a + +498 +00:35:38,869 --> 00:35:43,318 +track that out so what you end up with +is this mean image that has basically + +499 +00:35:43,318 --> 00:35:47,219 +the mission of 32 by 32 by three so I +think that mean image for example for + +500 +00:35:47,219 --> 00:35:51,409 +image data justice orange blob tracking +up from every single image to center + +501 +00:35:51,409 --> 00:35:56,000 +your data to have better trained +dynamics and one other form they're + +502 +00:35:56,000 --> 00:36:00,818 +slightly more convenient is attracting +just a per channel mean so you go into + +503 +00:36:00,818 --> 00:36:05,639 +red green and blue channel and computed +the mean across all of space to just end + +504 +00:36:05,639 --> 00:36:07,289 +up with basically three numbers + +505 +00:36:07,289 --> 00:36:11,029 +moves in red green and blue channel and +just a practice out and so some networks + +506 +00:36:11,030 --> 00:36:15,250 +use that instead so those are two common +skins this one is like a more convenient + +507 +00:36:15,250 --> 00:36:17,519 +because you only have to worry about +those three numbers you don't have to + +508 +00:36:17,519 --> 00:36:20,670 +worry about the giant array of mean that +you have to ship around every writer + +509 +00:36:20,670 --> 00:36:26,430 +when you're actually putting this up so +not too much more I want to say about + +510 +00:36:26,429 --> 00:36:30,649 +this just basically subtract the mean +and computer vision applications things + +511 +00:36:30,650 --> 00:36:35,039 +don't get much more complex than that in +particular DPC and so on this used to be + +512 +00:36:35,039 --> 00:36:38,860 +slightly common issues you can't apply +to all images because your images are + +513 +00:36:38,860 --> 00:36:43,559 +very high dimensional objects with lots +of pixels and so these jerseys will be + +514 +00:36:43,559 --> 00:36:47,789 +huge and people try to do things like +only doing whitening locally so you + +515 +00:36:47,789 --> 00:36:53,179 +would see slide lightning filter through +your image especially and that used to + +516 +00:36:53,179 --> 00:36:56,389 +be down several years ago but it's not +as common now it doesn't seem to matter + +517 +00:36:56,389 --> 00:37:01,809 +too much ok to wait initialization + +518 +00:37:01,809 --> 00:37:06,539 +very very important topic one of the +reasons that I think early neural + +519 +00:37:06,539 --> 00:37:09,409 +networks didn't quite work with as well +as because people are not careful enough + +520 +00:37:09,409 --> 00:37:14,119 +with this so one of the first things I +will look at is first of all how not to + +521 +00:37:14,119 --> 00:37:18,170 +do it in this legislation so in +particular you might be tempted to just + +522 +00:37:18,170 --> 00:37:23,619 +say ok let's start off at the weights +are equal to zero and you that in your + +523 +00:37:23,619 --> 00:37:27,029 +network says it was like a 10 layer +neural network and you said always 20 + +524 +00:37:27,030 --> 00:37:37,320 +why doesn't that work why isn't that a +good idea as well go ahead + +525 +00:37:37,320 --> 00:37:41,410 +basically just all your neurons at the +same thing in backdrop they will behave + +526 +00:37:41,409 --> 00:37:45,000 +the same way and so there's nothing as +we call it what you call it + +527 +00:37:45,000 --> 00:37:50,360 +symmetry breaking so all the other +computing saying stuff and so they will + +528 +00:37:50,360 --> 00:37:53,570 +all look the same they'll compete the +same gradients and so on so not the best + +529 +00:37:53,570 --> 00:37:57,860 +thing is that people use small numbers +small random numbers so one way you can + +530 +00:37:57,860 --> 00:38:01,820 +do that for example that is a relatively +common thing to do is you sample from + +531 +00:38:01,820 --> 00:38:07,410 +you negotiate with 2010 one standard +deviation so small random numbers so + +532 +00:38:07,409 --> 00:38:11,299 +that's where W matrix Hollywood +initialize it now + +533 +00:38:11,300 --> 00:38:15,340 +issue with this initialization is that +it works ok but you'll find that it only + +534 +00:38:15,340 --> 00:38:20,068 +works ok if you have small networks but +as you start to go deeper and deeper + +535 +00:38:20,068 --> 00:38:24,659 +would have to be much more careful about +the nationalization and I'd like to go + +536 +00:38:24,659 --> 00:38:29,199 +into exactly what breaks and how it +breaks and bite breaks when you try to + +537 +00:38:29,199 --> 00:38:32,499 +do these naive initialisation strategies +and try to have deep networks so let's + +538 +00:38:32,498 --> 00:38:38,798 +look at what goes wrong so what I've +written here is a small book so what + +539 +00:38:38,798 --> 00:38:43,608 +we're doing here is going to step +through this just briefly I'm sampling a + +540 +00:38:43,608 --> 00:38:48,369 +dataset of 1,000 points that are 500 +dimensional and then I'm creating a + +541 +00:38:48,369 --> 00:38:52,170 +whole bunch of hidden layers and +nonlinearities so say right now we have + +542 +00:38:52,170 --> 00:38:58,749 +10 layers of 500 units and we're using +10 h and then I'm doing here as I'm just + +543 +00:38:58,748 --> 00:39:03,798 +basically taking unit gosh and data and +I'm forwarding it through the network + +544 +00:39:03,798 --> 00:39:07,509 +and with this particular initialization +strategy where right now that + +545 +00:39:07,509 --> 00:39:10,920 +initialization strategy is what I +described in previous slide see sample + +546 +00:39:10,920 --> 00:39:14,869 +from gushing he's killed by serb 101 so +what I'm doing here in this part because + +547 +00:39:14,869 --> 00:39:18,608 +I'm bored propagating this network which +is right now made up of just a series of + +548 +00:39:18,608 --> 00:39:25,208 +layers of the same size so if ten layers +of $500 and I'm for propagating with + +549 +00:39:25,208 --> 00:39:29,328 +this initialization strategy for a unit +gushing data and what what I want to + +550 +00:39:29,329 --> 00:39:34,109 +look at is what happens to the +statistics of the hidden neurons + +551 +00:39:34,108 --> 00:39:37,719 +activations throughout the network with +this initialization so we're going to + +552 +00:39:37,719 --> 00:39:40,429 +look specifically at the mean and +standard deviation and we're going to + +553 +00:39:40,429 --> 00:39:44,498 +plot the mean standard deviation and +we're going to block the histograms so + +554 +00:39:44,498 --> 00:39:48,159 +we take all this data through and then +say at the fifth player we're going to + +555 +00:39:48,159 --> 00:39:52,368 +look at what the what values did take on +inside the fifth or sixth or seventh + +556 +00:39:52,369 --> 00:39:56,338 +where we're going to make histograms of +those so with this initialization if you + +557 +00:39:56,338 --> 00:39:59,588 +run this experiment you end up it ends +up looking as follows + +558 +00:39:59,588 --> 00:40:03,889 +so here I am printing it out we start +off with a mean of zero as their + +559 +00:40:03,889 --> 00:40:07,368 +division of one that's our data and now +I'm for propagating + +560 +00:40:07,369 --> 00:40:13,019 +as I go to 10 player in the mean we're +using 10 age so tender age of symmetry + +561 +00:40:13,018 --> 00:40:16,868 +so as you might expect the mean states +around zero but the standard deviation + +562 +00:40:16,869 --> 00:40:21,440 +look at what happens to it started off +at 110 division was 2.2 then pulling + +563 +00:40:21,440 --> 00:40:27,420 +2004 and its plummets down to zero for +the standard deviation of these neurons + +564 +00:40:27,420 --> 00:40:31,639 +just goes down 20 looking at the +histograms here at every single air at + +565 +00:40:31,639 --> 00:40:33,338 +the first layer the histogram is reason + +566 +00:40:33,338 --> 00:40:37,778 +so we have a spread of numbers between +11 and then what ends up happening to it + +567 +00:40:37,778 --> 00:40:42,889 +just collapses to a tight distribution +at exactly zero so what ends up + +568 +00:40:42,889 --> 00:40:46,328 +happening with this initialization +produced only our network is all the 10 + +569 +00:40:46,329 --> 00:40:50,930 +H neurons just end up out the team just +20 so at the last layer these are tiny + +570 +00:40:50,929 --> 00:40:58,719 +numbers of like near zero and so all +occupations basically become zero and + +571 +00:40:58,719 --> 00:41:01,219 +why is this an issue + +572 +00:41:01,219 --> 00:41:05,568 +think about what happens to the dynamics +of the backward pass to the gradients + +573 +00:41:05,568 --> 00:41:10,969 +when you have tiny numbers in the +activations your texts are tiny numbers + +574 +00:41:10,969 --> 00:41:12,548 +on the last few layers + +575 +00:41:12,548 --> 00:41:17,159 +what what do these ingredients look like +on the way it's in these layers and what + +576 +00:41:17,159 --> 00:41:27,478 +happens to the backward pass the first +of all suppose my so there is a layer + +577 +00:41:27,478 --> 00:41:32,399 +here that looks at some later before it +and almost all the inputs are so tiny + +578 +00:41:32,400 --> 00:41:37,789 +numbers that's the x axis a tiny number +what is the gradient what do you expect + +579 +00:41:37,789 --> 00:41:45,509 +to the gradients for the W to be in that +case for those layers you some very + +580 +00:41:45,509 --> 00:41:55,528 +small so why would they be very small W +will be equal to x times the gradient + +581 +00:41:55,528 --> 00:41:56,278 +from the top + +582 +00:41:56,278 --> 00:42:00,789 +ok and so effects are tiny numbers than +your reasons for WR tiny numbers as well + +583 +00:42:00,789 --> 00:42:06,640 +and so these guys will basically have +almost no reason to cannulated now we + +584 +00:42:06,639 --> 00:42:13,228 +can also look at what happens with these +matrices again we we took data that was + +585 +00:42:13,228 --> 00:42:16,659 +distributed as a unit caution and the +beginning and then we ended up + +586 +00:42:16,659 --> 00:42:20,278 +multiplying it by W and activation +function and we saw that basically + +587 +00:42:20,278 --> 00:42:24,699 +everything goes to zero this just +collapses over time and think about the + +588 +00:42:24,699 --> 00:42:27,939 +backward pass as we change the gradient +through these layers and + +589 +00:42:27,940 --> 00:42:31,380 +back-propagation what we're doing +effectively is some of the gradient kind + +590 +00:42:31,380 --> 00:42:35,989 +of folks off into our gradient W and we +saw the numbers but then threw back + +591 +00:42:35,989 --> 00:42:39,108 +propagation we're going through +agreements effects and so we end up + +592 +00:42:39,108 --> 00:42:41,969 +doing when we backdrop through here is +what you get + +593 +00:42:41,969 --> 00:42:47,419 +multiplying by W again and again at +every single layer and if you take unit + +594 +00:42:47,420 --> 00:42:51,460 +gushing data and you multiply by WC at +this scale you can see that everything + +595 +00:42:51,460 --> 00:42:55,010 +goes to zero and the same thing will +happen then backward pass were + +596 +00:42:55,010 --> 00:42:59,180 +successively multiplying by W as we back +propagation two acts on every single air + +597 +00:42:59,179 --> 00:43:03,529 +and we are you that this gradient which +started off with reasonable numbers from + +598 +00:43:03,530 --> 00:43:07,300 +your loss function will end up just +going toward zero as you keep doing this + +599 +00:43:07,300 --> 00:43:11,519 +process and you end up with gradients +here that are basically just tiny tiny + +600 +00:43:11,519 --> 00:43:17,530 +numbers and so you basically end up with +very very low gradients throughout this + +601 +00:43:17,530 --> 00:43:21,500 +network because of this reason and this +is something that we refer to as banish + +602 +00:43:21,500 --> 00:43:24,070 +ingredient as this gradient travels +through with this particular + +603 +00:43:24,070 --> 00:43:27,160 +initialization you can see that the +group the magnitude of the green we'll + +604 +00:43:27,159 --> 00:43:34,239 +just go down when we used this in one of +two so we can try different extreme + +605 +00:43:34,239 --> 00:43:38,569 +instead of the scaling here as we scale +with bunny negative to you can try + +606 +00:43:38,570 --> 00:43:45,530 +different scale of the W matrix at +initialization so suppose I try 110001 + +607 +00:43:45,530 --> 00:43:51,099 +will see another funny thing happened +because now we overshot the other way in + +608 +00:43:51,099 --> 00:43:56,260 +a sense that you can see that well maybe +it's best to look at the decisions here + +609 +00:43:56,260 --> 00:44:00,250 +you can see that everything is +completely saturated these 10 hrs either + +610 +00:44:00,250 --> 00:44:05,079 +all negative one or all one i mean the +distribution is really just everything + +611 +00:44:05,079 --> 00:44:08,389 +is super-saturated your entire network +of neurons throughout the network card + +612 +00:44:08,389 --> 00:44:12,509 +either negative 101 because the weights +are too large and they keep adding that + +613 +00:44:12,510 --> 00:44:15,859 +anyone else because this course that end +up going through the non-linearity are + +614 +00:44:15,858 --> 00:44:19,949 +just very large because the weights are +large and so everything is super + +615 +00:44:19,949 --> 00:44:25,669 +saturated so what are the ingredients +flowing through your network is just + +616 +00:44:25,670 --> 00:44:28,869 +terrible it's complete disaster right +that's just zeros for every just + +617 +00:44:28,869 --> 00:44:34,180 +exponentially 0 and you die so you can +train for a very long time and where + +618 +00:44:34,179 --> 00:44:37,889 +you'll see when this happens is your +losses just nothing at all because + +619 +00:44:37,889 --> 00:44:41,299 +nothing is back propagating because all +the neurons are saturated and nothing is + +620 +00:44:41,300 --> 00:44:46,490 +being updated so this initialization as +you might expect actually is like super + +621 +00:44:46,489 --> 00:44:50,469 +tricky to set and it needs to be kind of +in this particular case it needs to be + +622 +00:44:50,469 --> 00:44:54,629 +somewhere between 10 10 10 K and so + +623 +00:44:54,630 --> 00:44:58,259 +so you can be slightly more principled +instead of trying some different values + +624 +00:44:58,259 --> 00:45:03,059 +and there are some papers written on +this so for example in 2010 there was a + +625 +00:45:03,059 --> 00:45:07,589 +proposal for what we now call the +initialisation from go out at all and + +626 +00:45:07,588 --> 00:45:11,199 +the kind of went through and they looked +at the expression for the variance of + +627 +00:45:11,199 --> 00:45:15,318 +your neurons and you can read this out +and you can basically propose a specific + +628 +00:45:15,318 --> 00:45:19,608 +initialization strategy for how you +spell your gradients so I don't have to + +629 +00:45:19,608 --> 00:45:24,088 +try 2001 I don't have to try one or +whatever else they recommend this kind + +630 +00:45:24,088 --> 00:45:27,500 +of initialization we divided by the +square root of the number of inputs for + +631 +00:45:27,500 --> 00:45:28,750 +every single neuron + +632 +00:45:28,750 --> 00:45:33,630 +lots of inputs then you end up with +lower weights and intuitively that makes + +633 +00:45:33,630 --> 00:45:36,539 +sense because you're doing more with you +have more stuff that goes into your + +634 +00:45:36,539 --> 00:45:39,619 +weight it some so you want less of an +interaction to all of them and if + +635 +00:45:39,619 --> 00:45:43,660 +smaller number of units that are feeding +into your lair when you want larger + +636 +00:45:43,659 --> 00:45:46,980 +weights because then there's only a few +of them and you want to have a variance + +637 +00:45:46,980 --> 00:45:51,019 +of 18 just back up a bit + +638 +00:45:51,018 --> 00:45:54,659 +the idea here is they were looking at +the single neuron no activation + +639 +00:45:54,659 --> 00:45:58,118 +functions include is just the linear +neuron and all they're saying is if you + +640 +00:45:58,119 --> 00:46:02,099 +want if you're getting your data as +input and you like this learner on to + +641 +00:46:02,099 --> 00:46:06,079 +have a variance of one then you should +initialize your weights with this amount + +642 +00:46:06,079 --> 00:46:10,670 +and in the notes I going to exactly how +this is derived is just us two standard + +643 +00:46:10,670 --> 00:46:15,650 +deviations and basically this is a +reasonable initialization so I can use + +644 +00:46:15,650 --> 00:46:18,700 +that instead and you can see that if I +use it here + +645 +00:46:18,699 --> 00:46:22,399 +the distributions end up being more +sensible over again looking at the + +646 +00:46:22,400 --> 00:46:25,660 +history between negative one on one of +these ten agents and you get a more + +647 +00:46:25,659 --> 00:46:31,000 +sensible number here and you actually +have your within the active region of + +648 +00:46:31,000 --> 00:46:33,929 +all these teenagers and so you can +expect that this will be a much better + +649 +00:46:33,929 --> 00:46:38,518 +initialization because things are in the +active regions and things will train + +650 +00:46:38,518 --> 00:46:42,318 +from the start nothing is super +saturated in the beginning the reason + +651 +00:46:42,318 --> 00:46:45,179 +that this doesn't just end up being very +nice and the reason we still have + +652 +00:46:45,179 --> 00:46:48,139 +convergence down here is because this +paper doesn't take into account the + +653 +00:46:48,139 --> 00:46:52,308 +nonlinearities in this case the tenant +and so the tennis nonlinearity and up + +654 +00:46:52,309 --> 00:46:57,650 +like kind of the forming your statistics +of the variance throughout and so if you + +655 +00:46:57,650 --> 00:47:02,309 +start this off it and and up still doing +something to distribution in this case + +656 +00:47:02,309 --> 00:47:05,410 +it seems that the standard deviation +goes down but it's not as dramatic as if + +657 +00:47:05,409 --> 00:47:08,179 +you were to set this bye bye just trial + +658 +00:47:08,179 --> 00:47:11,299 +there and so this is like a reasonable +initialisation + +659 +00:47:11,300 --> 00:47:15,280 +to use internal networks compared to +just setting at the 2001 and so people + +660 +00:47:15,280 --> 00:47:20,760 +end up using the same practice sometimes +but so this works in the case of 10 age + +661 +00:47:20,760 --> 00:47:24,349 +does something reasonable it turns out +if you try to put it into a rectified + +662 +00:47:24,349 --> 00:47:30,019 +linear unit network it doesn't work as +well and decreasing divisions will be + +663 +00:47:30,019 --> 00:47:34,679 +much more rapid so looking at a rally in +Tehran and the first layer it has some + +664 +00:47:34,679 --> 00:47:37,769 +distribution and then distribution as +you can see just gets more and more + +665 +00:47:37,769 --> 00:47:43,130 +picky at zero so more and more neurons +are activated with this initialization + +666 +00:47:43,130 --> 00:47:48,440 +so using the initialisation in a rectify +layer layer net does not do good things + +667 +00:47:48,440 --> 00:47:52,659 +and so again thinking about this paper +they don't actually talk about + +668 +00:47:52,659 --> 00:47:57,578 +nonlinearities and the relevant Iran's +the computer this weighted sum which is + +669 +00:47:57,579 --> 00:48:02,068 +within their demand here but not after +the way that something you do so you + +670 +00:48:02,068 --> 00:48:05,858 +kill half of the distribution you set it +to 0 and intuitively what that does to + +671 +00:48:05,858 --> 00:48:10,380 +your distribution of your up but +basically half their variants and so it + +672 +00:48:10,380 --> 00:48:14,849 +turns out it was proposed in this paper +just last year in fact someone said + +673 +00:48:14,849 --> 00:48:19,000 +basically look there's a factor of two +you're not a company for because he's + +674 +00:48:19,000 --> 00:48:22,809 +really you don't know ron's they +effectively happy or variants each time + +675 +00:48:22,809 --> 00:48:26,510 +because you take everything so you have +not gotten inputs you take them through + +676 +00:48:26,510 --> 00:48:29,960 +your nonlinearity you have you gotten +stuff I would but not you really do that + +677 +00:48:29,960 --> 00:48:35,530 +and so you end up having two variants +seem to account for it with it too and + +678 +00:48:35,530 --> 00:48:38,859 +when you do that then you get proper +distribution specifically for Darrell in + +679 +00:48:38,858 --> 00:48:43,719 +Iran and so in this initialization were +you using nets you have to worry about + +680 +00:48:43,719 --> 00:48:48,618 +that extra tax revenue and everything +will come up nicely and you won't get + +681 +00:48:48,619 --> 00:48:52,358 +this factor of two that keeps building +up and it screws up your activations + +682 +00:48:52,358 --> 00:48:56,769 +exponentially so basically this is +tricky tricky stuff and it really + +683 +00:48:56,769 --> 00:49:01,159 +matters in practice in practice in their +paper for example to compare having the + +684 +00:49:01,159 --> 00:49:04,519 +factor if you are not having a factor +too and it matters we have really deep + +685 +00:49:04,519 --> 00:49:08,500 +networks in this case I think they had a +few dozen players if you account for the + +686 +00:49:08,500 --> 00:49:12,940 +fact that you converge if you don't +count reduction to you does nothing just + +687 +00:49:12,940 --> 00:49:14,950 +zero lots ok + +688 +00:49:14,949 --> 00:49:19,469 +so very important stuff you really need +to think it through you to be careful + +689 +00:49:19,469 --> 00:49:24,789 +with inflation if it's incorrectly such +bad things happen and so specifically + +690 +00:49:24,789 --> 00:49:28,108 +the case if you have known that works +with rail units there is a correct + +691 +00:49:28,108 --> 00:49:36,460 +answer to use and that's this +initialization from coming so this is + +692 +00:49:36,460 --> 00:49:40,220 +partly this is partly why your remark +for a long time as we just i think + +693 +00:49:40,219 --> 00:49:46,088 +people didn't fully maybe appreciate +just how difficult this was to get right + +694 +00:49:46,088 --> 00:49:51,219 +and Turkey and so I just like to point +out that proper initialization basically + +695 +00:49:51,219 --> 00:49:54,419 +active area of research you can see the +papers are still being published on this + +696 +00:49:54,420 --> 00:49:58,849 +a large number of papers just opposing +different ways of initializing your + +697 +00:49:58,849 --> 00:50:03,019 +networks these last few are interesting +as well because they don't give you a + +698 +00:50:03,019 --> 00:50:06,659 +formula for initializing they have these +data driven waste of initializing + +699 +00:50:06,659 --> 00:50:10,399 +networks and to take a batch of data you +forward it to your network which is now + +700 +00:50:10,400 --> 00:50:13,530 +an arbitrary network and you look at the +variances that every single point in + +701 +00:50:13,530 --> 00:50:16,690 +your network and intuitively you don't +want your variances to go to zero you + +702 +00:50:16,690 --> 00:50:20,200 +don't want them to explode you want +everything to have roughly say like be a + +703 +00:50:20,199 --> 00:50:24,328 +unit caution throughout your network and +so they entered a plea scale these + +704 +00:50:24,329 --> 00:50:28,349 +weights in your network so that you have +roughly in the activation everywhere on + +705 +00:50:28,349 --> 00:50:33,568 +the order of that basically and so there +are some data-driven techniques and line + +706 +00:50:33,568 --> 00:50:39,139 +of work on how to properly initialized +ok so I'm going to go into some I'm + +707 +00:50:39,139 --> 00:50:41,848 +going to go into technique that +alleviate a lot of these problems but + +708 +00:50:41,849 --> 00:50:55,369 +right now I could take some questions +and they're only by dividing by the + +709 +00:50:55,369 --> 00:50:59,800 +variance possibly but then you're not +being back propagation because if you + +710 +00:50:59,800 --> 00:51:02,710 +met with the gradient then it's not +clear what your objective is anymore and + +711 +00:51:02,710 --> 00:51:06,710 +so you're not getting necessarily +gradient so this may be the only concern + +712 +00:51:06,710 --> 00:51:11,170 +I'm not sure what would happen if you +can try to normalize the gradient I + +713 +00:51:11,170 --> 00:51:13,730 +think the method I'm going to propose in +a bit + +714 +00:51:13,730 --> 00:51:19,960 +is actually doing something to the +effect of that but in a clean way what's + +715 +00:51:19,960 --> 00:51:23,550 +going to something that actually fix a +lot of these problems in practice it's + +716 +00:51:23,550 --> 00:51:26,630 +called back to my vision and it was only +proposed last year and so i cant even + +717 +00:51:26,630 --> 00:51:30,809 +covered this last year in this class but +now I can actually helps a lot + +718 +00:51:30,809 --> 00:51:37,119 +ok and the basic idea maximization paper +is ok you want roughly unit gotten + +719 +00:51:37,119 --> 00:51:42,039 +activations in every single part of your +network and so just just do that just + +720 +00:51:42,039 --> 00:51:46,369 +just make them you know caution ok you +can do that because making something + +721 +00:51:46,369 --> 00:51:50,720 +unit caution is a completely different +function and so it's ok you can + +722 +00:51:50,719 --> 00:51:54,980 +propagate through it and see what they +do is you taking me back from your data + +723 +00:51:54,980 --> 00:51:57,480 +and you're picking through your network +we're going to meet + +724 +00:51:57,480 --> 00:52:00,900 +inserting these specialization layers +into your network and the best + +725 +00:52:00,900 --> 00:52:06,400 +normalization layers they take your +input X and they make sure that every + +726 +00:52:06,400 --> 00:52:10,420 +single feature dimension across the +batch you have unit gushing activations + +727 +00:52:10,420 --> 00:52:15,909 +so he had a batch of hundred examples +going through the network maybe this is + +728 +00:52:15,909 --> 00:52:19,779 +a good example here is even better +activation so many things in your money + +729 +00:52:19,780 --> 00:52:25,530 +back and have D features or deactivation +of neurons that are at some point some + +730 +00:52:25,530 --> 00:52:28,869 +part and this is an input your back +later + +731 +00:52:28,869 --> 00:52:32,550 +so this is a major subjects of +activations and nationalization + +732 +00:52:32,550 --> 00:52:39,390 +effectively evaluate the empirical mean +and variance along every single feature + +733 +00:52:39,389 --> 00:52:44,989 +and it just divided by it so whatever +your ex was just make sure that every + +734 +00:52:44,989 --> 00:52:49,088 +single column here has unit is a +Univision and so that's a perfectly + +735 +00:52:49,088 --> 00:52:54,219 +differentiable function and just applies +it at every single feature or activation + +736 +00:52:54,219 --> 00:53:02,818 +independently across the batch so you +can do that turns out to be a very good + +737 +00:53:02,818 --> 00:53:08,548 +idea now one problem with this team so +this is the way this will work as well + +738 +00:53:08,548 --> 00:53:11,670 +have normally we have followed by +nonlinearity + +739 +00:53:11,670 --> 00:53:15,900 +party network of this now we're going to +be inserting these nationalization + +740 +00:53:15,900 --> 00:53:19,670 +layers right after political heirs or +equivalently after convolutional layers + +741 +00:53:19,670 --> 00:53:24,490 +as well CCNA with commercial networks +and basically we can start them there + +742 +00:53:24,489 --> 00:53:28,159 +and they make sure that everything is +gushing at every single step of the + +743 +00:53:28,159 --> 00:53:30,190 +network because we just make it so + +744 +00:53:30,190 --> 00:53:36,500 +and one problem I think up with this +this is that it seems like a unnecessary + +745 +00:53:36,500 --> 00:53:41,088 +constraint so when you put it back here +after that the outputs will definitely + +746 +00:53:41,088 --> 00:53:45,389 +be gushing because you normalize them +but it's not clear that 10 H actually + +747 +00:53:45,389 --> 00:53:50,288 +once to recede unit caution inputs so if +you think about the the form of 10 H it + +748 +00:53:50,289 --> 00:53:54,450 +has a specific skill to it it's not +clear that they're all that work once to + +749 +00:53:54,449 --> 00:53:59,730 +have this hard constraint of making sure +that outputs are exactly you negotiate + +750 +00:53:59,730 --> 00:54:06,009 +before the 10 TH because you like the +network to pick if it wants your 10 each + +751 +00:54:06,009 --> 00:54:10,429 +other what's to be more or less diffuse +more or less saturated and right now it + +752 +00:54:10,429 --> 00:54:14,268 +will be able to death so a small patch +on top of it this is the second part of + +753 +00:54:14,268 --> 00:54:19,429 +haitian is not going to normalize acts +but after normalization you live network + +754 +00:54:19,429 --> 00:54:25,068 +to shift by gamma and had to be for +every single feature and so this allows + +755 +00:54:25,068 --> 00:54:28,358 +the network to do and these are our +parameters so gamma and be here are + +756 +00:54:28,358 --> 00:54:33,869 +parameters that we're going to back to +back up into and they just allow the + +757 +00:54:33,869 --> 00:54:38,690 +network 22 shipped after your normal ICU +negotiate they allow this bomb to shift + +758 +00:54:38,690 --> 00:54:44,108 +and scale if the network wants to and so +we initialize the presumably webb 110 or + +759 +00:54:44,108 --> 00:54:48,250 +something like that and then we can the +network can choose to adjust them and by + +760 +00:54:48,250 --> 00:54:51,239 +adjusting these you can imagine that +once we feed into 10 H + +761 +00:54:51,239 --> 00:54:54,719 +the network can choose through the +backdrop signal to make it any more or + +762 +00:54:54,719 --> 00:54:58,618 +less picky or saturated in whatever way +it once but you're not going to get into + +763 +00:54:58,619 --> 00:55:01,910 +this trouble where things just +completely died or explode in the + +764 +00:55:01,909 --> 00:55:06,359 +beginning of optimization and so things +will train right away and then back + +765 +00:55:06,360 --> 00:55:10,579 +propagation can take over and can find +you into overtime and not one more + +766 +00:55:10,579 --> 00:55:16,170 +important feature is that if you set +these gunmen be if you train them if my + +767 +00:55:16,170 --> 00:55:20,230 +back propagation it happens that the end +up taking the empirical variance and the + +768 +00:55:20,230 --> 00:55:24,829 +mean when you can see that basically the +network has the capacity to undo the + +769 +00:55:24,829 --> 00:55:30,519 +nationalization so this part can learn +to undo that part and so that's why back + +770 +00:55:30,519 --> 00:55:34,059 +to realization and can act as an +identity function or can learn to be an + +771 +00:55:34,059 --> 00:55:37,599 +identity whereas before it couldn't and +so when you have these best-known + +772 +00:55:37,599 --> 00:55:42,460 +players in their the network and threw +back propagation learn to take it out or + +773 +00:55:42,460 --> 00:55:45,110 +it can learn to take advantage of it if +it finds it helpful + +774 +00:55:45,110 --> 00:55:51,010 +through the backdrop this will kind of +workout so that's just a nice point to + +775 +00:55:51,010 --> 00:55:58,470 +have and so basically there are several +properties so this is the right number + +776 +00:55:58,469 --> 00:56:03,639 +them as I described my properties are +that it improves the gradient flow + +777 +00:56:03,639 --> 00:56:09,049 +through the network allows for higher +learning rates so your network and learn + +778 +00:56:09,050 --> 00:56:13,080 +faster it reduces this is an important +one introduces the strong dependence on + +779 +00:56:13,079 --> 00:56:16,269 +initialization as you sweep through +different choices of your initialisation + +780 +00:56:16,269 --> 00:56:19,659 +scale you'll see that with and without +bashing on you'll see a huge difference + +781 +00:56:19,659 --> 00:56:23,469 +with maximum you'll see a much more +things will work for much larger + +782 +00:56:23,469 --> 00:56:27,539 +settings of the initial scale and so you +don't have to worry about it as much it + +783 +00:56:27,539 --> 00:56:34,139 +really helps out with this put point and +one more subtle thing to point out here + +784 +00:56:34,139 --> 00:56:39,299 +is it kind of access of money from a +realization and it reduces the need for + +785 +00:56:39,300 --> 00:56:43,900 +a drop of which will go into in a bit +later in class but the way it acts as a + +786 +00:56:43,900 --> 00:56:51,559 +funny regularization is when you have +some kind of an input X and go through + +787 +00:56:51,559 --> 00:56:55,849 +the network then its representation at +some later in the network is basically + +788 +00:56:55,849 --> 00:56:59,858 +not only function of it but it's also a +function of whatever other examples + +789 +00:56:59,858 --> 00:57:02,049 +happened to being a batch so + +790 +00:57:02,050 --> 00:57:05,570 +because whatever other examples are with +you in that batch process completely + +791 +00:57:05,570 --> 00:57:09,840 +independently apparel fashion actually +ties them together and so your + +792 +00:57:09,840 --> 00:57:12,880 +representation that say like the thick +layer of network is actually a function + +793 +00:57:12,880 --> 00:57:16,539 +of whatever it back you happen to be +sampled in and what does a generous your + +794 +00:57:16,539 --> 00:57:19,809 +place in the representation space on +that later and this actually has a nice + +795 +00:57:19,809 --> 00:57:26,139 +regularizing effect and so does generate +sarcastically who this fact that you + +796 +00:57:26,139 --> 00:57:31,609 +happen to be in has this effect and so i +don't realize it actually seems to + +797 +00:57:31,610 --> 00:57:33,920 +actually help out of it + +798 +00:57:33,920 --> 00:57:38,950 +ok and the test I'm passionate later by +the way functions a bit differently you + +799 +00:57:38,949 --> 00:57:42,699 +don't have a test time you want this to +be a deterministic function so just a + +800 +00:57:42,699 --> 00:57:46,500 +quick point that s time when you're +using a Bachelor function differently in + +801 +00:57:46,500 --> 00:57:52,019 +particular you have this new and a sigma +that you keep normalizing by so a test + +802 +00:57:52,019 --> 00:57:55,519 +I'm just remember your view and Sigma +across the dataset you can either + +803 +00:57:55,519 --> 00:57:59,250 +computed like what is the mean and +sigmoid every single point in the + +804 +00:57:59,250 --> 00:58:02,309 +network you can compute that once over +your entire training center or you can + +805 +00:58:02,309 --> 00:58:05,759 +just have a running some amusing six +months while you're training and then + +806 +00:58:05,760 --> 00:58:08,800 +make sure to remember that in the best +player because it just time you don't + +807 +00:58:08,800 --> 00:58:12,460 +want to actually estimate the empirical +mean and variance across your back you + +808 +00:58:12,460 --> 00:58:17,000 +want to just use those directly so +because that's good you're not coming + +809 +00:58:17,000 --> 00:58:26,179 +forward at this time so there's just a +small detail and so that's any questions + +810 +00:58:26,179 --> 00:58:29,049 +about the national motorway so this is a +good thing + +811 +00:58:29,050 --> 00:58:35,559 +use it and your employees that actually +your assignment + +812 +00:58:35,559 --> 00:58:41,039 +thank you so the question is that a +slowdown at all it does so there is a + +813 +00:58:41,039 --> 00:58:44,219 +runtime penalty but you have to pay for +it unfortunately I don't know exactly + +814 +00:58:44,219 --> 00:58:49,088 +how expensive is I heard someone say +like 30 percent even and so I don't know + +815 +00:58:49,088 --> 00:58:54,318 +actually I haven't fully checked this +but basically there is a penalty because + +816 +00:58:54,318 --> 00:58:58,548 +you have to do this normally you it's +very common to be after every + +817 +00:58:58,548 --> 00:59:02,458 +competition later and we have 250 calm +like larry is you end up having all this + +818 +00:59:02,458 --> 00:59:16,719 +stuff buildup of questions raised the +price we pay I suppose so yes so when + +819 +00:59:16,719 --> 00:59:20,249 +can you tell you maybe need national I +think I'll come back to that in a in a + +820 +00:59:20,248 --> 00:59:24,228 +few slides will see like how can you +detect that your network is not healthy + +821 +00:59:24,228 --> 00:59:30,318 +and then maybe you want a transnational +ok so the learning process I have 20 + +822 +00:59:30,318 --> 00:59:36,489 +minutes I think I can do this is like +700 so I think we're fine so we trust + +823 +00:59:36,489 --> 00:59:41,420 +our data we've decided let's let's +decide on some for these purposes of + +824 +00:59:41,420 --> 00:59:44,719 +these experiments I'm going to work with +C for 10 and I'm going to use a + +825 +00:59:44,719 --> 00:59:48,688 +two-layer neural network with safety had +a nuanced and I'd like to give an idea + +826 +00:59:48,688 --> 00:59:51,538 +about like how this looks like impact is +when your training neural networks like + +827 +00:59:51,539 --> 00:59:52,699 +how do you play with it + +828 +00:59:52,699 --> 00:59:56,849 +where someone how do you actually +converted to Primaris what does this + +829 +00:59:56,849 --> 00:59:59,380 +process of playing with a date on +getting things to work look like in + +830 +00:59:59,380 --> 01:00:03,019 +practice and so I decided to try out a +small neural network + +831 +01:00:03,018 --> 01:00:08,248 +preprocess my data and so the first +kinds of things that I would look at if + +832 +01:00:08,248 --> 01:00:11,728 +I want to make sure that my prediction +is correct them think things are working + +833 +01:00:11,728 --> 01:00:16,028 +first of all I'm going to be +initializing here a two-year neural + +834 +01:00:16,028 --> 01:00:19,679 +network so weights and biases +initializing was just naive + +835 +01:00:19,679 --> 01:00:23,969 +initialization here because this is just +a very small network so I can afford to + +836 +01:00:23,969 --> 01:00:28,259 +maybe do just a naive sample from +exhaustion and then this is a function + +837 +01:00:28,259 --> 01:00:31,329 +that basically going to train a neural +network and I'm not showing you the + +838 +01:00:31,329 --> 01:00:35,949 +implementation obviously but just one +thing missing is returned your lost + +839 +01:00:35,949 --> 01:00:39,170 +their returns your premiums on your +model parameters and so that the first + +840 +01:00:39,170 --> 01:00:42,869 +time I tried for example is i disable +the regularization that's passed in the + +841 +01:00:42,869 --> 01:00:45,818 +end and I make sure that my loss comes +out + +842 +01:00:45,818 --> 01:00:49,358 +act right so I mention this and previous +lines so say I have 10 classes and + +843 +01:00:49,358 --> 01:00:53,318 +support n im using soft classifier so I +know that I'm expecting a loss of + +844 +01:00:53,318 --> 01:00:59,099 +negative log of one over 10 because +that's that's an expression for the loss + +845 +01:00:59,099 --> 01:01:03,180 +and that turns out to be 2.3 and so I +put this and I get a lot of 2.3 so I + +846 +01:01:03,179 --> 01:01:05,708 +know that basically the neural network +is currently giving me a diffuse + +847 +01:01:05,708 --> 01:01:09,728 +distribution over the classes because it +doesn't know anything we've just been so + +848 +01:01:09,728 --> 01:01:12,778 +that sucks out the next thing I might +check is that for example I cranked up + +849 +01:01:12,778 --> 01:01:17,318 +the regularization and of course expect +my loss to go up right because now we + +850 +01:01:17,318 --> 01:01:20,380 +have this additional term in the +objective and so that checks out so + +851 +01:01:20,380 --> 01:01:20,940 +that's nice + +852 +01:01:20,940 --> 01:01:25,409 +different the next thing I would usually +try to do it's a very good sanity check + +853 +01:01:25,409 --> 01:01:28,478 +when you're working on their networks is +try to take a small piece of your data + +854 +01:01:28,478 --> 01:01:32,139 +and try to make sure you can over it +you're trying to do just that small + +855 +01:01:32,139 --> 01:01:36,608 +piece some twenty takes like say a +sample of like 20 training examples and + +856 +01:01:36,608 --> 01:01:41,858 +28 labels and I just make sure that I +trained on that small piece and I just + +857 +01:01:41,858 --> 01:01:45,179 +make sure that I can get a loss of +basically near zero I can fully over fit + +858 +01:01:45,179 --> 01:01:48,379 +that because if i cant over fit a tiny +piece of my idea then things are + +859 +01:01:48,380 --> 01:01:54,608 +definitely broken and so here I am +starting the training and I'm starting + +860 +01:01:54,608 --> 01:01:58,969 +with some random number of parameters +here I'm not going to go into full + +861 +01:01:58,969 --> 01:02:04,150 +details there but basically I make sure +that my costs can go down to zero and + +862 +01:02:04,150 --> 01:02:08,519 +that I'm getting accuracy 100% on this +tiny piece of data and that gives me + +863 +01:02:08,518 --> 01:02:12,659 +confidence that probably backdrop is +working probably the update is working + +864 +01:02:12,659 --> 01:02:16,798 +the learning rate is set to somehow +reasonably and so I can put a small + +865 +01:02:16,798 --> 01:02:21,190 +dataset not happy at this point in time +maybe I'm thinking about scaling up to a + +866 +01:02:21,190 --> 01:02:28,079 +larger than something + +867 +01:02:28,079 --> 01:02:33,960 +so you should be able to overpower it +sometimes I can try like say like one or + +868 +01:02:33,960 --> 01:02:37,409 +two or three examples you can really +practice down and you should be able to + +869 +01:02:37,409 --> 01:02:40,460 +afford even with smaller networks and so +that's a very good sanity check because + +870 +01:02:40,460 --> 01:02:45,289 +you can afford to have small networks +and just make sure if you can't help it + +871 +01:02:45,289 --> 01:02:49,039 +implementations probably incorrect +something's very funky was wrong so you + +872 +01:02:49,039 --> 01:02:52,039 +should not be scaling up your full day I +said before you can pass the Senate + +873 +01:02:52,039 --> 01:03:02,380 +check so basically the way I try to +approach this is taking a small piece of + +874 +01:03:02,380 --> 01:03:05,990 +data and now we're scaling it up over +but it's an arms coming up to like the + +875 +01:03:05,989 --> 01:03:10,049 +bigger dataset I'm trying to find the +learning rate that works and you have to + +876 +01:03:10,050 --> 01:03:13,289 +really play with this right you can just +eyeball delivering great to have to find + +877 +01:03:13,289 --> 01:03:17,219 +the scale roughly some trying first the +small learning rate like many negative + +878 +01:03:17,219 --> 01:03:22,559 +six and I see that the loss as bait +barely barely going down so this lost + +879 +01:03:22,559 --> 01:03:27,509 +this learning rate of one negative six +is probably too small right nothing is + +880 +01:03:27,510 --> 01:03:30,250 +changing of course there could be many +other things wrong because they lost + +881 +01:03:30,250 --> 01:03:34,409 +because in for like a million reasons +but we passed the small sanity check so + +882 +01:03:34,409 --> 01:03:38,339 +I'm thinking that this is probably +losses too low and I need you to hit by + +883 +01:03:38,340 --> 01:03:43,130 +the way this is a fine example hear of +something funky going on that is fun to + +884 +01:03:43,130 --> 01:03:48,280 +think about my loss just barely went +down but actually my training accuracy + +885 +01:03:48,280 --> 01:03:54,000 +shot up to 20% from the default 10% how +does that make any sense how can I beat + +886 +01:03:54,000 --> 01:03:58,050 +by lost just barely changed but my costs +my accuracy so good + +887 +01:03:58,050 --> 01:04:08,130 +well much much better than 10% of that +even possible + +888 +01:04:08,130 --> 01:04:38,860 +still + +889 +01:04:38,860 --> 01:04:46,120 +ok maybe not quite so think about how +accuracy is computed and how this custom + +890 +01:04:46,119 --> 01:05:04,799 +computer right now what's happening is +your training so these scores are tiny + +891 +01:05:04,800 --> 01:05:08,769 +shifting your losses still roughly +diffuse end up in the same loss but now + +892 +01:05:08,769 --> 01:05:12,619 +you're correct answers are not tiny bit +more probably and so we actually + +893 +01:05:12,619 --> 01:05:16,210 +competing the accuracy D art maxi class +is actually end up doing the correct one + +894 +01:05:16,210 --> 01:05:19,530 +of these are some of the fun things you +run into when you actually trained some + +895 +01:05:19,530 --> 01:05:24,900 +of the stuff you do have to think about +the expressions ok so now I start I + +896 +01:05:24,900 --> 01:05:27,619 +tried very low learning rate things are +barely happening soon I'm going to go to + +897 +01:05:27,619 --> 01:05:30,719 +the other extreme and I'm going to try +out the learning 32 million what could + +898 +01:05:30,719 --> 01:05:36,199 +possibly go wrong so what happens in +that case you get some weird errors and + +899 +01:05:36,199 --> 01:05:40,429 +things explode to get nancy really fun +stuff happens so ok one of the 1,000,000 + +900 +01:05:40,429 --> 01:05:44,639 +this probably too high as what I'm +thinking at this point so then I tried + +901 +01:05:44,639 --> 01:05:48,179 +to narrow in on rough region that +actually gives me a decrease in my cost + +902 +01:05:48,179 --> 01:05:51,409 +thread that's what I'm trying to do with +my binary search here and so at some + +903 +01:05:51,409 --> 01:05:54,739 +point I get some idea about you know +roughly where should I be cross + +904 +01:05:54,739 --> 01:05:55,929 +validating + +905 +01:05:55,929 --> 01:06:00,019 +like a proper optimization at this point +I'm trying to find the best I promise + +906 +01:06:00,019 --> 01:06:04,030 +for my network right we like to do in +practice is go from course to find + +907 +01:06:04,030 --> 01:06:07,820 +strategy so first I just have a rough +idea by playing with it we're learning + +908 +01:06:07,820 --> 01:06:11,550 +Richard being then I do a course search +are alarming rates of like a bigger a + +909 +01:06:11,550 --> 01:06:16,180 +segment and then I repeat this process I +look at what works and then I narrow in + +910 +01:06:16,179 --> 01:06:20,500 +on the region's that work well ok do +this here are quickly and your codes for + +911 +01:06:20,500 --> 01:06:23,719 +example detect explosions and break out +early it's like a nice step in terms of + +912 +01:06:23,719 --> 01:06:28,339 +implementation so what I'm doing +effectively here as I have a loop where + +913 +01:06:28,340 --> 01:06:31,579 +I sample my prime minister saying this +case the regularization and learning + +914 +01:06:31,579 --> 01:06:36,849 +rate I sample them I train I get some +results here so these are the accuracy + +915 +01:06:36,849 --> 01:06:40,179 +in the validation data and these are too +high primaries that produced them and + +916 +01:06:40,179 --> 01:06:44,440 +some of the accuracy as you can see that +they were quite well so 50% 40% some of + +917 +01:06:44,440 --> 01:06:47,409 +them don't work well at all so this +gives me an idea about what range of + +918 +01:06:47,409 --> 01:06:50,659 +learning rates and regulations are +working relatively well + +919 +01:06:50,659 --> 01:06:55,079 +and when you do this optimization you +can start out first with just a small + +920 +01:06:55,079 --> 01:06:58,090 +number of epochs you going to run for a +very long time just run for a few + +921 +01:06:58,090 --> 01:07:02,680 +minutes you can already get the sense of +what's working better than some other + +922 +01:07:02,679 --> 01:07:08,259 +things and also one note when you're +optimizing over regularization learning + +923 +01:07:08,260 --> 01:07:12,320 +rate it's best to simply walk space you +don't just want to sample from a uniform + +924 +01:07:12,320 --> 01:07:16,510 +distribution because these learning +rates and regularization they act + +925 +01:07:16,510 --> 01:07:20,180 +multiplicative Lee on the dynamics of +your back propagation and so that's why + +926 +01:07:20,179 --> 01:07:25,319 +you want to do this in lock space so you +can see that I'm sampling from nigga 326 + +927 +01:07:25,320 --> 01:07:28,350 +the exponent from the learning rate and +then I'm raising it to the power of 10 + +928 +01:07:28,349 --> 01:07:33,319 +amazing 10 to the power of it and so you +don't want to just be sampling from a + +929 +01:07:33,320 --> 01:07:38,610 +uniform 0012 like a hundred because the +most of your samples are kind of in a + +930 +01:07:38,610 --> 01:07:41,820 +bad region right because the learning +rate is a multiplicative interaction + +931 +01:07:41,820 --> 01:07:50,050 +something to be aware of what works +relatively well I'm doing a second pass + +932 +01:07:50,050 --> 01:07:52,950 +where I'm kind of going in and I'm +changing these again a bit and i'm + +933 +01:07:52,949 --> 01:07:58,139 +looking at what works so I find that I +can now get 253 some of these work + +934 +01:07:58,139 --> 01:08:02,460 +really well one thing to be aware of +sometimes you get a result like this + +935 +01:08:02,460 --> 01:08:06,920 +53 is working quite well and this is +actually worse if I see this I'm + +936 +01:08:06,920 --> 01:08:11,440 +actually worried at this point because +I'm so through this cross validation + +937 +01:08:11,440 --> 01:08:14,490 +here I have a result here and there +something actually wrong about this + +938 +01:08:14,489 --> 01:08:21,880 +result that hints at some issue + +939 +01:08:21,880 --> 01:08:31,279 +problem + +940 +01:08:31,279 --> 01:08:54,109 +actually quite consistent too much +happening here look amazing learning + +941 +01:08:54,109 --> 01:08:58,759 +rate between 93 94 tend to that and I +end up with a very good result that is + +942 +01:08:58,760 --> 01:09:00,690 +just the boundaries of what I'm + +943 +01:09:00,689 --> 01:09:06,960 +optimizing over so this is almost 13 +it's almost 0001 which ends which is + +944 +01:09:06,960 --> 01:09:10,510 +really a boundary of what I'm searching +over some getting a really good result + +945 +01:09:10,510 --> 01:09:14,780 +at an edge of what I'm looking for and +that's not good because maybe this year + +946 +01:09:14,779 --> 01:09:18,719 +the way I've defined it is not actually +optimal and so I want to make sure that + +947 +01:09:18,720 --> 01:09:21,560 +I spot these things and I just my ranges +because there might be even better + +948 +01:09:21,560 --> 01:09:22,520 +results + +949 +01:09:22,520 --> 01:09:26,390 +going slightly this way so maybe I want +to change negative 32 negative two or + +950 +01:09:26,390 --> 01:09:32,570 +2.5 and but for regularization I see +that is working quite well so maybe i'm + +951 +01:09:32,569 --> 01:09:38,529 +in a slightly better spot and so I'm so +worried about this one thing I like to + +952 +01:09:38,529 --> 01:09:42,739 +point out as you'll see me sample bees +randomly also tend to the uniform of + +953 +01:09:42,739 --> 01:09:46,639 +this some sampling random regularization +learning return doing this what you + +954 +01:09:46,640 --> 01:09:49,829 +might see sometimes people do with +what's called a grid search so really + +955 +01:09:49,829 --> 01:09:53,920 +the difference here is instead of +sampling randomly people like to go in + +956 +01:09:53,920 --> 01:09:58,789 +steps of fixed amounts in both the +learning rate and the regulation and so + +957 +01:09:58,789 --> 01:10:02,519 +you end up with this double loop here +over some settings of learning even some + +958 +01:10:02,520 --> 01:10:03,740 +settings of regularization + +959 +01:10:03,739 --> 01:10:07,590 +trying to be exhaustive and this is +actually a bad idea doesn't actually + +960 +01:10:07,590 --> 01:10:12,720 +work as well as a few simple randomly +and unintuitive but you actually always + +961 +01:10:12,720 --> 01:10:16,280 +want to sample randomly don't want to go +into next steps and here's the reason + +962 +01:10:16,279 --> 01:10:23,319 +for that it's kind of think about it but +this is great search way so I sampled at + +963 +01:10:23,319 --> 01:10:31,579 +set intervals and I can't have company +you know sweep out the tax base and a + +964 +01:10:31,579 --> 01:10:35,090 +random sampling where I just randomly +sampled from the to the issue is that an + +965 +01:10:35,090 --> 01:10:38,930 +optimization and training they're all +that works what often happens is that + +966 +01:10:38,930 --> 01:10:41,800 +she's one of the parameters can be much +much more important than the other + +967 +01:10:41,800 --> 01:10:43,039 +parameter + +968 +01:10:43,039 --> 01:10:45,989 +so say that this is an important +parameter its performance the + +969 +01:10:45,989 --> 01:10:49,349 +performance of your loss function is not +really a function of the white dimension + +970 +01:10:49,350 --> 01:10:52,510 +but it's really a function of the +exhibition you get much better result in + +971 +01:10:52,510 --> 01:10:58,699 +a specific range along the x-axis and if +this is true then which is often the + +972 +01:10:58,699 --> 01:11:02,170 +case then in this case you're actually +going to end up something lots of + +973 +01:11:02,170 --> 01:11:06,300 +different taxes and you end up with a +better spot than here where you've + +974 +01:11:06,300 --> 01:11:09,850 +sampled at exact spot and you're not +getting any kind of information across + +975 +01:11:09,850 --> 01:11:14,910 +the ex if that makes sense so always use +random because in these cases which are + +976 +01:11:14,909 --> 01:11:24,220 +common the random will actually give you +more bang for the buck so I promise you + +977 +01:11:24,220 --> 01:11:28,520 +want to play with the most common ones +are probably the learning rate the + +978 +01:11:28,520 --> 01:11:32,920 +update to type maybe we're going to +we're going to go into this in a bit + +979 +01:11:32,920 --> 01:11:36,899 +the regularization and the dropout +amount we're going to go into so this is + +980 +01:11:36,899 --> 01:11:42,979 +really it's so much fun so in practice +the way but this looks like as we have a + +981 +01:11:42,979 --> 01:11:46,679 +for example of computer vision cluster +we have so many machines so I can just + +982 +01:11:46,680 --> 01:11:49,829 +distributes my training across so many +machines and I've written myself for + +983 +01:11:49,829 --> 01:11:53,100 +example comment and turned to face where +these are all the loss functions on all + +984 +01:11:53,100 --> 01:11:56,880 +the different machines and computers and +cluster these are all here are some + +985 +01:11:56,880 --> 01:12:01,270 +searching over and I can see basically +what's working and what isn't and I can + +986 +01:12:01,270 --> 01:12:04,370 +send commands to my workers so I can say +ok this isn't working at all stages + +987 +01:12:04,369 --> 01:12:07,399 +resample you're not doing well at all +and some of these are doing very well + +988 +01:12:07,399 --> 01:12:10,960 +and I look at what's exactly working +well and I'm adjusting its a dynamic + +989 +01:12:10,960 --> 01:12:14,020 +process that I have to go through to +actually get the stuff to work well + +990 +01:12:14,020 --> 01:12:17,490 +because he just have too much stuff to +optimize over and you can afford to just + +991 +01:12:17,489 --> 01:12:21,569 +spray and pray you have to work with it + +992 +01:12:21,569 --> 01:12:25,759 +ok so you optimizing you're looking at a +loss functions + +993 +01:12:25,760 --> 01:12:29,289 +loss functions can take various +different forms and you need to be able + +994 +01:12:29,289 --> 01:12:34,510 +to read into what that means so you'll +be you'll get quite good at looking at a + +995 +01:12:34,510 --> 01:12:38,289 +loss function as an interesting what +happens this one for example it was + +996 +01:12:38,289 --> 01:12:42,409 +pointing out that previous lecture it's +not as exponential as a maybe used to my + +997 +01:12:42,409 --> 01:12:47,359 +loss functions I'd like it to you know +it looks a little to linger and so that + +998 +01:12:47,359 --> 01:12:50,949 +maybe tells me that the learning rate as +may be slightly too low so that doesn't + +999 +01:12:50,949 --> 01:12:53,069 +mean the learning rate is too low just +means that I might want to consider + +1000 +01:12:53,069 --> 01:12:54,359 +trying + +1001 +01:12:54,359 --> 01:12:58,549 +morning sometimes you get all kinds of +funny things so you can have a plateau + +1002 +01:12:58,550 --> 01:13:04,199 +where at some point that would decide +that now runs you optimize usually so + +1003 +01:13:04,198 --> 01:13:15,948 +what is the prime suspect in these kinds +of cases just a guess me and i think is + +1004 +01:13:15,948 --> 01:13:19,388 +the prime suspect you initialize +correctly the gradients and barely + +1005 +01:13:19,389 --> 01:13:23,579 +flowing but at some point they add up +and just saw some research training so + +1006 +01:13:23,579 --> 01:13:27,420 +lots of fun in fact it's so much fun +that I started an entire tumblr a while + +1007 +01:13:27,420 --> 01:13:34,260 +ago and lost function so they can go +through these people contribute these + +1008 +01:13:34,260 --> 01:13:38,300 +which is nice and services I think so +and training especially transfer network + +1009 +01:13:38,300 --> 01:13:43,550 +we're going to go into that this is all +kinds of exotic shapes I'm not exactly + +1010 +01:13:43,550 --> 01:13:48,730 +know at some point you're not really +sure what any of this means it's going + +1011 +01:13:48,729 --> 01:13:52,569 +so well + +1012 +01:13:52,569 --> 01:14:04,469 +yeah so here this several tasks that are +training at the same time and just this + +1013 +01:14:04,469 --> 01:14:08,139 +by the way I know what happened here +it's this is actually training a + +1014 +01:14:08,139 --> 01:14:11,170 +reinforcement learning agent the problem +in reinforcement learning as you don't + +1015 +01:14:11,170 --> 01:14:14,679 +have a stationary distribution you don't +have a fixed asset investment learning + +1016 +01:14:14,679 --> 01:14:17,800 +agent interacting with the environment +if your policy changes and you end up + +1017 +01:14:17,800 --> 01:14:21,199 +like staring at the wall or you end up +looking at different parts of your space + +1018 +01:14:21,198 --> 01:14:24,629 +you end up with different data +distributions and so suddenly I'm + +1019 +01:14:24,630 --> 01:14:27,109 +looking at something very different than +what I used to be looking at and I'm + +1020 +01:14:27,109 --> 01:14:30,098 +training my agent and lost goes up +because the agent is unfamiliar with + +1021 +01:14:30,099 --> 01:14:33,569 +that kind of templates and so you have +all kinds of fun stuff happening there + +1022 +01:14:33,569 --> 01:14:40,578 +and then this one is one of my favorites +I have no idea what basically happened + +1023 +01:14:40,578 --> 01:14:45,988 +here this loss oscillates but roughly +does and then it comes just explodes + +1024 +01:14:45,988 --> 01:14:53,238 +clearly something was not right in this +case and also here just got someone + +1025 +01:14:53,238 --> 01:14:57,789 +decides to converge and no idea was +wrong so you get all kinds of funny + +1026 +01:14:57,789 --> 01:15:01,368 +things if you end up with funny plots in +your assignment please do send them to + +1027 +01:15:01,368 --> 01:15:02,948 +Los Panchos but + +1028 +01:15:02,948 --> 01:15:06,219 +robust during training + +1029 +01:15:06,219 --> 01:15:09,899 +don't only look at the loss function and +other thing to look at is your accuracy + +1030 +01:15:09,899 --> 01:15:14,929 +especially accuracies for example so you +sometimes prefer looking at the accuracy + +1031 +01:15:14,929 --> 01:15:18,248 +over what functions because accuracies +are interpretable I know what these + +1032 +01:15:18,248 --> 01:15:22,519 +classification accuracies mean in +absolute terms for loss function is + +1033 +01:15:22,519 --> 01:15:27,369 +maybe not as interpretable and so in +particular I have a loss for my + +1034 +01:15:27,368 --> 01:15:31,589 +salvation data and my training and so +for example in this case I'm saying that + +1035 +01:15:31,590 --> 01:15:35,288 +my training data accuracy is getting +much much better and validation accuracy + +1036 +01:15:35,288 --> 01:15:38,929 +has stopped improving and so based on +this guy that can give you hints on what + +1037 +01:15:38,929 --> 01:15:42,380 +might be going on under the hood in this +particular case there's a huge gap here + +1038 +01:15:42,380 --> 01:15:44,440 +so maybe I'm thinking of overfitting + +1039 +01:15:44,439 --> 01:15:48,069 +100% sure but I might be overpaying I +might want to try to regular I strongly + +1040 +01:15:48,069 --> 01:15:57,038 +when things might also be looking at is +tracking the difference between the + +1041 +01:15:57,038 --> 01:16:01,988 +scale of your parameters and the scale +of your updates to those parameters so + +1042 +01:16:01,988 --> 01:16:06,748 +say you're so you're suppose that your +weights are on the order of unit gushing + +1043 +01:16:06,748 --> 01:16:10,599 +then intuitively the update that your +incrementing your weights by and + +1044 +01:16:10,599 --> 01:16:14,349 +back-propagation you don't want those +updates to be much larger than the + +1045 +01:16:14,349 --> 01:16:16,679 +weights obviously or you want them to be +tiny + +1046 +01:16:16,679 --> 01:16:20,529 +your updates to be on the order of 1987 +when your weights are on the order of + +1047 +01:16:20,529 --> 01:16:25,359 +one negative too and so look at the +update that you're about to increment + +1048 +01:16:25,359 --> 01:16:29,439 +onto your weights and just look at this +norm for example the color squares and + +1049 +01:16:29,439 --> 01:16:34,129 +compared to the update the scale of your +parameters and usually a good rule of + +1050 +01:16:34,130 --> 01:16:38,550 +thumb is this should be roughly 13 so +basically everything will update your + +1051 +01:16:38,550 --> 01:16:41,360 +modifying on the order of like a third +significant digits for every single + +1052 +01:16:41,359 --> 01:16:44,118 +parameter right you're not making huge +updates you're not making very small + +1053 +01:16:44,118 --> 01:16:49,708 +updates so that's one thing to look at +roughly 13 usually works ok if this is + +1054 +01:16:49,708 --> 01:16:53,038 +too high I want to maybe decrease my +learning made its way too low like say + +1055 +01:16:53,038 --> 01:17:00,069 +it's 107 maybe I want to increase my +learning rate and so in summary today we + +1056 +01:17:00,069 --> 01:17:05,308 +looked at a whole bunch of things to do +with training neural networks the teal + +1057 +01:17:05,309 --> 01:17:09,729 +the arms of all of them are basically +you lose track mean use the + +1058 +01:17:09,729 --> 01:17:11,869 +initialization + +1059 +01:17:11,869 --> 01:17:15,750 +or if you think you have a small network +you can maybe get away with just + +1060 +01:17:15,750 --> 01:17:20,399 +choosing your scale 2001 or maybe you +want to play with that a bit and there's + +1061 +01:17:20,399 --> 01:17:26,719 +no strong recommendation here I think +just use and when you're doing I'm not + +1062 +01:17:26,720 --> 01:17:34,110 +my decision make sure to sample programs +and doing lots base when appropriate and + +1063 +01:17:34,109 --> 01:17:39,449 +that's something to be aware of and this +is what we still have to cover and that + +1064 +01:17:39,449 --> 01:17:44,269 +will be next we do have two more minutes +so I will take questions if there are + +1065 +01:17:44,270 --> 01:18:01,520 +any + +1066 +01:18:01,520 --> 01:18:11,120 +correlation between + +1067 +01:18:11,119 --> 01:18:15,729 +I don't think there's any obviously I +can recommend there you have to get a + +1068 +01:18:15,729 --> 01:18:18,769 +check of it I don't think there's +anything jumps out at me that's obvious + +1069 +01:18:18,770 --> 01:18:35,210 +another couple in ok great questions + +1070 +01:18:35,210 --> 01:18:35,949 +question regarding + diff --git a/captions/En/Lecture6_en.srt b/captions/En/Lecture6_en.srt new file mode 100644 index 00000000..5c5c51f7 --- /dev/null +++ b/captions/En/Lecture6_en.srt @@ -0,0 +1,4497 @@ +1 +00:00:00,000 --> 00:00:07,009 +ok so what's now first today we'll talk +about training neural networks again and + +2 +00:00:07,009 --> 00:00:10,449 +then I'll give you a bit of an interview +coming to show that works before we dive + +3 +00:00:10,449 --> 00:00:15,489 +into that the material just some +administrative things first first I + +4 +00:00:15,490 --> 00:00:18,618 +didn't get a chance to actually +interviews Justin last lecture justin is + +5 +00:00:18,618 --> 00:00:21,579 +your instructor also for this class and +he was missing for the first two weeks + +6 +00:00:21,579 --> 00:00:28,409 +and they can can ask me anything about +anything he's very knowledgeable maybe + +7 +00:00:28,410 --> 00:00:29,428 +that's an understatement + +8 +00:00:29,428 --> 00:00:37,960 +ok and the 72 is out as a reminder it's +quite long so I encourage you to start + +9 +00:00:37,960 --> 00:00:43,850 +to build here and it's do basically next +Friday so get started on that as soon as + +10 +00:00:43,850 --> 00:00:47,679 +possible and you implement know that +works with the proper API of forward + +11 +00:00:47,679 --> 00:00:50,429 +backward classes and you'll see the +abstraction of a competition will grab + +12 +00:00:50,429 --> 00:00:54,820 +and go back to my session drop out and +then you'll actually implement + +13 +00:00:54,820 --> 00:00:57,770 +commercial networks so by the end of +this assignment to actually have a + +14 +00:00:57,770 --> 00:01:00,770 +fairly good understanding of all the +low-level details of how come on strong + +15 +00:01:00,770 --> 00:01:06,530 +network classifiers I'm just ok so where +we are in this class just as a reminder + +16 +00:01:06,530 --> 00:01:10,140 +again we're training neural networks and +turns out the training on networks is + +17 +00:01:10,140 --> 00:01:15,590 +really a four-step process you have an +entire dataset images and labels we + +18 +00:01:15,590 --> 00:01:18,920 +sample a small back from the dataset we +thought propagating through the network + +19 +00:01:18,920 --> 00:01:23,060 +to get to the loss which is telling us +how well we're currently classifying + +20 +00:01:23,060 --> 00:01:26,390 +dispatch of data and we back propagates +to complete the gradient of all the + +21 +00:01:26,390 --> 00:01:29,969 +weights and this gradient is telling us +how we should not sure every single wait + +22 +00:01:29,969 --> 00:01:33,789 +in the network so that we're better +classifying these images and then once + +23 +00:01:33,790 --> 00:01:36,700 +we have the gradient we can use it for a +primary update where we actually do that + +24 +00:01:36,700 --> 00:01:38,930 +small notch + +25 +00:01:38,930 --> 00:01:42,659 +last class we looked into activation +functions and I'm tired of activation + +26 +00:01:42,659 --> 00:01:45,368 +functions and some pros and cons of +using any of these insider neural + +27 +00:01:45,368 --> 00:01:49,060 +network a good question came up in +Piazza so when asked why would you even + +28 +00:01:49,060 --> 00:01:53,939 +using your activation function why not +just skip it and question was posed it + +29 +00:01:53,938 --> 00:01:57,618 +and I've got to really address this very +nicely in the last lecture by basically + +30 +00:01:57,618 --> 00:02:00,790 +if you don't use an activation function +than your entire neural network ends up + +31 +00:02:00,790 --> 00:02:05,500 +being one single in your sandwich and so +your capacity is equal to that of just a + +32 +00:02:05,500 --> 00:02:10,080 +linear classifier so those activation +functions are really critical to have + +33 +00:02:10,080 --> 00:02:13,880 +between and they they are the ones that +give you all this way that you can use + +34 +00:02:13,879 --> 00:02:17,490 +to actually put your data we talked +briefly about the preprocessing + +35 +00:02:17,490 --> 00:02:21,860 +techniques but very briefly we also +looked at the activation functions and + +36 +00:02:21,860 --> 00:02:24,830 +their distributions throughout the +neural network and so the problem here I + +37 +00:02:24,830 --> 00:02:31,370 +see your call is we have to choose this +initial weights and in particular the + +38 +00:02:31,370 --> 00:02:34,930 +scale of how large you want those who +wait to be in the beginning and we saw + +39 +00:02:34,930 --> 00:02:38,260 +that if that if those weights are too +small then your activation in a neural + +40 +00:02:38,259 --> 00:02:41,909 +network as you have a deep network goes +toward zero and if you set that skill is + +41 +00:02:41,909 --> 00:02:45,129 +likely to higher than all of them will +explode instead and so you end up with + +42 +00:02:45,129 --> 00:02:48,939 +other super-saturated networks or you +end up with networks that just about all + +43 +00:02:48,939 --> 00:02:54,189 +zeros and so that scale is very very +tricky thing to set we looked into the + +44 +00:02:54,189 --> 00:02:59,579 +initialisation which gives you a +reasonable kind of thing to use in that + +45 +00:02:59,580 --> 00:03:03,290 +form and that gives you basically +roughly good active activations or + +46 +00:03:03,289 --> 00:03:06,459 +distributions of activation throughout +the network in the beginning of training + +47 +00:03:06,459 --> 00:03:10,959 +and then we went into best normalization +which is this thing that alleviate a lot + +48 +00:03:10,959 --> 00:03:14,120 +of these headaches with actually setting +that skill properly and Sebastian + +49 +00:03:14,120 --> 00:03:16,689 +legislation makes this a much more +robust choices they don't have to + +50 +00:03:16,689 --> 00:03:20,550 +precisely get that initial scale correct +and we went to all of its present calls + +51 +00:03:20,550 --> 00:03:23,620 +and we talked about that for a while and +then we talked about the learning + +52 +00:03:23,620 --> 00:03:26,920 +process by trying to show you a kind of +tips and tricks for how you actually be + +53 +00:03:26,919 --> 00:03:29,809 +said these neural networks how you get +them to train properly and also how you + +54 +00:03:29,810 --> 00:03:34,860 +run across violations and how you slowly +over time in to get up rendering just so + +55 +00:03:34,860 --> 00:03:37,769 +we talked about all that last time so +this time we're going to go into some of + +56 +00:03:37,769 --> 00:03:41,060 +the remaining items for training neural +networks in particular parameter up the + +57 +00:03:41,060 --> 00:03:44,989 +schemes I think most part and then we'll +talk a bit about my l'ensemble dropout + +58 +00:03:44,989 --> 00:03:49,480 +and so on so before I dive into that any +administrative things my way that I'm + +59 +00:03:49,479 --> 00:03:53,509 +forgetting not necessarily so + +60 +00:03:53,509 --> 00:03:58,030 +primary updates because there's a +process to training a neural network and + +61 +00:03:58,030 --> 00:04:01,199 +this is a pseudocode really in what it +looks like that about you violate the + +62 +00:04:01,199 --> 00:04:04,419 +law severely the gradient and performer +primary update when I talk about + +63 +00:04:04,419 --> 00:04:08,030 +parameter updates were specifically +looking at this last line in here where + +64 +00:04:08,030 --> 00:04:12,129 +we are trying to make that more complex +where so right now what we're doing in + +65 +00:04:12,129 --> 00:04:17,129 +school just reading the st. where we +take that break into my computer and we + +66 +00:04:17,129 --> 00:04:21,639 +just multiply it scaled by the learning +rate on to our primary factor we can be + +67 +00:04:21,639 --> 00:04:23,159 +much more elaborate with how we + +68 +00:04:23,160 --> 00:04:27,960 +on that date and so I flash this image +briefly in the last few lectures where + +69 +00:04:27,959 --> 00:04:30,759 +you can see different parameter update +schemes and how quickly they actually + +70 +00:04:30,759 --> 00:04:35,129 +optimize this simple loss function here +and so in particular can see that STD + +71 +00:04:35,129 --> 00:04:38,550 +which is what we're using right now in +the fourth line here that's a speedy and + +72 +00:04:38,550 --> 00:04:41,710 +read to you can see that that's actually +the slowest one of all of them so + +73 +00:04:41,709 --> 00:04:45,139 +practice you rarely ever use just basic +custody and are better schemes that we + +74 +00:04:45,139 --> 00:04:48,979 +can use we're going to go into those in +the structure so let's look at what the + +75 +00:04:48,980 --> 00:04:54,810 +problem is with Sgt why is it so slow so +consider this particular slightly + +76 +00:04:54,810 --> 00:04:58,589 +contrived example here where we have a +loss function surface level sets of our + +77 +00:04:58,589 --> 00:05:02,099 +loss as opposed to elevated long one +direction much more than another + +78 +00:05:02,100 --> 00:05:05,500 +direction so basically this loss +function here is very shallow + +79 +00:05:05,500 --> 00:05:10,199 +horizontally but very steep vertically +and we want to of course minimize this + +80 +00:05:10,199 --> 00:05:13,469 +and right now we're at the Rex Baltimore +trying to get to the minimum denoted by + +81 +00:05:13,470 --> 00:05:19,240 +the smiley face that's where we're happy +but think about what's the trajectory of + +82 +00:05:19,240 --> 00:05:22,980 +this is both X&Y directions + +83 +00:05:22,980 --> 00:05:30,650 +judy if we try to optimize this +landscape with that look like so what + +84 +00:05:30,649 --> 00:05:35,729 +would it look like horizontally and +vertically I see someone's butt so what + +85 +00:05:35,730 --> 00:05:43,540 +are you planning out there and why is it +so I'm going to bounce up and down like + +86 +00:05:43,540 --> 00:05:52,030 +that and why is it not making a lot of +progress right is basically has this + +87 +00:05:52,029 --> 00:05:56,969 +forum where when we look at the gradient +horizontally we see that the radiant is + +88 +00:05:56,970 --> 00:06:00,680 +very small because this is a shallow +function horizontally but we have a + +89 +00:06:00,680 --> 00:06:03,439 +large rating because it's a very steep +function as to what's going to happen + +90 +00:06:03,439 --> 00:06:06,389 +when you roll out a street in these +kinds of cases and you end up with this + +91 +00:06:06,389 --> 00:06:10,250 +kind of pattern where you're going way +too slow in horizontal direction but + +92 +00:06:10,250 --> 00:06:13,300 +you're going way too fast and vertical +direction because you end up at this + +93 +00:06:13,300 --> 00:06:17,918 +year so one way of remedying this kind +of situation as we recall and momentum + +94 +00:06:17,918 --> 00:06:22,189 +update to the momentum update will +change our update in the following way + +95 +00:06:22,189 --> 00:06:25,319 +so right now we're just implementing the +gradient + +96 +00:06:25,319 --> 00:06:28,409 +taking the gradient and we're +integrating our current position by the + +97 +00:06:28,410 --> 00:06:34,220 +ratings in a date instead we're going to +take the gradient that we computed and + +98 +00:06:34,220 --> 00:06:36,449 +instead of integrating the position +directly + +99 +00:06:36,449 --> 00:06:40,840 +we're going to increment this variable V +which I could leave for velocity so + +100 +00:06:40,839 --> 00:06:44,049 +we're going to see why that is in a bit +so we increment + +101 +00:06:44,050 --> 00:06:48,020 +velocity variable be and instead instead +we're basically building up this + +102 +00:06:48,019 --> 00:06:53,278 +exponential some credence in the past +and that's what integrating the position + +103 +00:06:53,278 --> 00:06:58,610 +this new here is a happy primer and mute +as kind of a number between 0 and one + +104 +00:06:58,610 --> 00:07:03,629 +and was doing it became the previous be +and adding on the screen gradient so + +105 +00:07:03,629 --> 00:07:07,180 +what's nice about the momentum updated +you can interpret it in a very physical + +106 +00:07:07,180 --> 00:07:14,310 +terms and in the following way basically +using momentum update corresponds to + +107 +00:07:14,310 --> 00:07:18,899 +interpreting discount list as really a +bold rolling allows this round is + +108 +00:07:18,899 --> 00:07:22,459 +landscape and the gradient in this case +is your forest that the particles + +109 +00:07:22,459 --> 00:07:26,408 +feeling so this article is feeling some +force you to gradient instead of + +110 +00:07:26,408 --> 00:07:31,158 +directly integrating the position this +force in physics so force is equivalent + +111 +00:07:31,158 --> 00:07:36,019 +to acceleration there and so +acceleration is what we're competing and + +112 +00:07:36,019 --> 00:07:39,938 +so the velocity gets integrated by the +acceleration here and then the new times + +113 +00:07:39,939 --> 00:07:43,039 +he has the interpretation of friction in +that case because it every single + +114 +00:07:43,038 --> 00:07:47,759 +iteration were slightly slowing down and +intuitively if this new times be was not + +115 +00:07:47,759 --> 00:07:51,550 +there then does bold with never come to +rest because it was just around the law + +116 +00:07:51,550 --> 00:07:54,509 +surface forever and there will be no +loss of energy where it would settle at + +117 +00:07:54,509 --> 00:07:58,158 +the end of a loss function and so that +the momentum update is taking this + +118 +00:07:58,158 --> 00:08:01,810 +physical interpretation of optimization +but we have a ball rolling around around + +119 +00:08:01,810 --> 00:08:08,249 +and it's slowing down over time and so +the way this works is what's very nice + +120 +00:08:08,249 --> 00:08:11,669 +about this update as you end up building +up this velocity and in particular in + +121 +00:08:11,668 --> 00:08:14,959 +the shallow directions is very easy to +see that if you have a shallow but + +122 +00:08:14,959 --> 00:08:18,449 +consistent direction then the momentum +update will slowly build up the velocity + +123 +00:08:18,449 --> 00:08:21,360 +vector in the direction you end up +speeding up and up across the shallow + +124 +00:08:21,360 --> 00:08:24,999 +direction but in a very steep directions +what's going to happen is you start of + +125 +00:08:24,999 --> 00:08:28,919 +course generally around but then you're +always being pulled up the other + +126 +00:08:28,918 --> 00:08:32,429 +direction toward the center and with the +damping and the kind of oscillating to + +127 +00:08:32,429 --> 00:08:36,338 +the middle and so it's kind of denting +these oscillations in a steep directions + +128 +00:08:36,339 --> 00:08:41,139 +and it's kind of encouraging it's +encouraging process and is consistent + +129 +00:08:41,139 --> 00:08:44,889 +shallow directions and that's why it +ends up improving the convergence in + +130 +00:08:44,889 --> 00:08:49,600 +many cases so for example here in this +visualization we see the SED update in + +131 +00:08:49,600 --> 00:08:53,459 +momentum update isn't green and so you +can see what happens with the green one + +132 +00:08:53,458 --> 00:08:57,008 +hit over shoes because it built up all +this publicity + +133 +00:08:57,009 --> 00:09:00,909 +overshoots the minimum but then it +eventually ends up converting gallon and + +134 +00:09:00,909 --> 00:09:04,169 +of course it's over shot but once it +emerges there you can see that it's + +135 +00:09:04,169 --> 00:09:07,879 +converging there much quicker than just +basic as did the update to end up + +136 +00:09:07,879 --> 00:09:11,230 +building up too much of a statement than +you eventually get there quicker than if + +137 +00:09:11,230 --> 00:09:17,110 +you did not have the velocity got the +momentum update are going to a + +138 +00:09:17,110 --> 00:09:20,430 +particular variation of the momentum +appeared in a bit I just wanted to ask + +139 +00:09:20,429 --> 00:09:34,289 +questions about the momentum updates +when I got a single like a primer and + +140 +00:09:34,289 --> 00:09:40,078 +usually it takes some values of roughly +8.5 4.9 and usually people sometimes + +141 +00:09:40,078 --> 00:09:43,219 +it's not super comet but people +sometimes in the lead from 25 2.99 + +142 +00:09:43,220 --> 00:09:54,200 +slowly over time but it's just a single +number + +143 +00:09:54,200 --> 00:09:57,180 +yes so you can avoid those with a +smaller learning rate but then the issue + +144 +00:09:57,179 --> 00:10:03,000 +is if you had a slower learning rate is +applied globally to all directions in + +145 +00:10:03,000 --> 00:10:06,070 +the gradient and so then you would +basically do no progress in the + +146 +00:10:06,070 --> 00:10:09,390 +horizontal direction right you wouldn't +get as much but then it would take you + +147 +00:10:09,389 --> 00:10:12,710 +forever to go horizontally few small +learning says this kind of trade off + +148 +00:10:12,710 --> 00:10:25,350 +their selected describe a modification +on the question is how to initialize + +149 +00:10:25,350 --> 00:10:29,050 +lost you usually 10 and it doesn't +matter too much because you end up + +150 +00:10:29,049 --> 00:10:32,490 +building it up in the first few steps +and then you end up like this if you + +151 +00:10:32,490 --> 00:10:35,480 +spend out this recurrence you'll see +that basically it's exponentially + +152 +00:10:35,480 --> 00:10:39,330 +decaying some of your previous greetings +and so once you've got it up to you you + +153 +00:10:39,330 --> 00:10:46,020 +have certain 10 so particular variation +of momentum has got something called + +154 +00:10:46,019 --> 00:10:53,449 +mister on momentum and gradient descent +and the idea here is we have the + +155 +00:10:53,450 --> 00:10:57,550 +ordinary momentum equation here and the +way to think about it is that your + +156 +00:10:57,549 --> 00:10:59,789 +excess recommended by really two parts + +157 +00:10:59,789 --> 00:11:03,279 +there's a part of that you build up some +momentum in a particular direction so + +158 +00:11:03,279 --> 00:11:06,799 +that's the momentum step in green that's +the new times and that's where the + +159 +00:11:06,799 --> 00:11:09,959 +momentum is currently trying to carry +you and then you have the second + +160 +00:11:09,960 --> 00:11:12,610 +contribution from the gradients the +gradient is pulling you this way towards + +161 +00:11:12,610 --> 00:11:17,450 +the decrease of a loss function and the +actual step ends up being the vector sum + +162 +00:11:17,450 --> 00:11:21,350 +of the two so the blue as much you end +up with is just the green plus the red + +163 +00:11:21,350 --> 00:11:24,840 +and the idea but necessary momentum and +this ends up working better in practice + +164 +00:11:24,840 --> 00:11:29,629 +as the following we know at this point +regardless of what the current input was + +165 +00:11:29,629 --> 00:11:33,439 +to us so we haven't competed against up +yet but we know that we've built up some + +166 +00:11:33,440 --> 00:11:37,240 +momentum and we know we're definitely +going to take this green direction ok so + +167 +00:11:37,240 --> 00:11:41,220 +we're definitely going to take this +Green Valley ingredient here at our + +168 +00:11:41,220 --> 00:11:45,310 +current spot Nesterov momentum does +wants to look ahead and instead + +169 +00:11:45,309 --> 00:11:49,379 +evaluates the gradient at this point +this point at the top of the arrow so + +170 +00:11:49,379 --> 00:11:53,679 +what you end up with is the following +difference here we know we're going to + +171 +00:11:53,679 --> 00:11:57,089 +go this way anyway so why not just like +look ahead to get to that part of the + +172 +00:11:57,090 --> 00:12:00,420 +objective and evaluate the green at that +point and it doesn't of course you're + +173 +00:12:00,419 --> 00:12:02,309 +reading is going to be slightly +different because you're in a different + +174 +00:12:02,309 --> 00:12:05,669 +position in Los function and this one +step ahead give you a slightly better + +175 +00:12:05,669 --> 00:12:06,259 +direction + +176 +00:12:06,259 --> 00:12:11,109 +over there and get it a slightly +different update now you can do you can + +177 +00:12:11,109 --> 00:12:14,379 +theoretically show that this actually +enjoys better theoretical guarantees on + +178 +00:12:14,379 --> 00:12:18,069 +convergence rates but not only is a true +in theory but also in practice and + +179 +00:12:18,068 --> 00:12:23,068 +almost always works better than just a +moment to ok so the difference roughly + +180 +00:12:23,068 --> 00:12:28,358 +is the following year I've written like +like notations that of code but we still + +181 +00:12:28,359 --> 00:12:29,589 +have the time + +182 +00:12:29,589 --> 00:12:33,089 +mutants the previous velocity vector and +the gradient that you're currently + +183 +00:12:33,089 --> 00:12:37,629 +evaluating and then we do an update here +and so the necessary update the only + +184 +00:12:37,629 --> 00:12:41,720 +differences were pending here this new +plus new times bTW minus 11 will + +185 +00:12:41,720 --> 00:12:44,949 +evaluate the gradient we have evaluated +at a slightly different position in this + +186 +00:12:44,948 --> 00:12:48,278 +look ahead to position and so that's +really in the strong momentum it almost + +187 +00:12:48,278 --> 00:12:51,698 +always works that are now there's a +slight technology here which I don't + +188 +00:12:51,698 --> 00:12:57,068 +think I'm going to go into too much but +it's slightly inconvenient the fact that + +189 +00:12:57,068 --> 00:13:00,418 +normally we think about just going +forward and backward pass so what we end + +190 +00:13:00,418 --> 00:13:04,288 +up with is we have a primary victories +data and the gradient at that point but + +191 +00:13:04,288 --> 00:13:09,088 +you're never off wants us to have a +breeding parameters and gradient at a + +192 +00:13:09,089 --> 00:13:12,600 +different point so doesn't quite fit in +with like a simple API between only + +193 +00:13:12,600 --> 00:13:16,019 +having your code and so turns out that +there's a way and I don't want to really + +194 +00:13:16,019 --> 00:13:19,899 +probably spent too much time on this but +there's a way to basically do a variable + +195 +00:13:19,899 --> 00:13:23,379 +transformer get the notice the beefy you +do some rearrangement and then you get + +196 +00:13:23,379 --> 00:13:26,079 +something that looks much more like of +the newly updated that you can just + +197 +00:13:26,078 --> 00:13:29,538 +swipe in from Amanda Martin swapping +impressed ed because you end up with + +198 +00:13:29,538 --> 00:13:34,119 +only needing gradient atrophy and you up +to update something and this feature is + +199 +00:13:34,119 --> 00:13:35,209 +really do look ahead + +200 +00:13:35,208 --> 00:13:38,159 +version of the parameters since they're +just the raw parameter vector that's + +201 +00:13:38,159 --> 00:13:40,608 +just a technicality you can go into +notes to check this out + +202 +00:13:40,609 --> 00:13:46,709 +ok so here Nesterov accelerated reading +is in magenta and you can see the + +203 +00:13:46,708 --> 00:13:50,208 +original momentum here over shop but not +a lot but because mister of accelerating + +204 +00:13:50,208 --> 00:13:53,958 +momentum has this one step ahead you'll +see that it's curls around much more + +205 +00:13:53,958 --> 00:13:57,738 +quickly and that's because all these +tiny contributions of a slightly better + +206 +00:13:57,739 --> 00:14:01,619 +gradient at where you're about to be end +up adding up and you almost always + +207 +00:14:01,619 --> 00:14:08,600 +converge faster so that's necessary so +until recently as UD momentum was the + +208 +00:14:08,600 --> 00:14:11,329 +standard default way of training +commercial networks and many people + +209 +00:14:11,328 --> 00:14:14,658 +still trained using just a moment to +update this is a common thing to see in + +210 +00:14:14,658 --> 00:14:17,610 +practice and even better if necessary + +211 +00:14:17,610 --> 00:14:20,990 +so mag here stands for a week + +212 +00:14:20,990 --> 00:14:44,350 +question you're thinking about that so i +think it's slightly incorrect to was + +213 +00:14:44,350 --> 00:14:46,990 +only think about a lot of options for +neural networks usually think about + +214 +00:14:46,990 --> 00:14:50,350 +these crazy ravines and lots of local +minima everywhere it's actually not a + +215 +00:14:50,350 --> 00:14:53,670 +correct way to look at it that's a +correct approximation to have conceptual + +216 +00:14:53,669 --> 00:14:56,278 +in your mind when you have a very small +neural networks and people used to think + +217 +00:14:56,278 --> 00:14:59,769 +that local minima an issue and +optimizing networks but actually turns + +218 +00:14:59,769 --> 00:15:04,269 +out with a lot of recent theoretical +work that as you scale up your models + +219 +00:15:04,269 --> 00:15:10,740 +these local minimum has become less and +less of an issue so that the picture to + +220 +00:15:10,740 --> 00:15:14,389 +have in mind is there are lots of local +minima but they're all about the same + +221 +00:15:14,389 --> 00:15:18,958 +actually loss that's a better way to +look at it so these functions neural + +222 +00:15:18,958 --> 00:15:22,078 +networks actually in practice and i'm +looking much more like like a bowl + +223 +00:15:22,078 --> 00:15:25,599 +instead of the crazy ravine landscape +and you can show that as you still up + +224 +00:15:25,600 --> 00:15:28,360 +the neural network the difference +between like the worst than your best + +225 +00:15:28,360 --> 00:15:29,259 +local minima + +226 +00:15:29,259 --> 00:15:32,448 +actually kinda like shrinks down over +time with some researchers also + +227 +00:15:32,448 --> 00:15:36,120 +basically there's no bad local minima +this only happens in very small networks + +228 +00:15:36,120 --> 00:15:41,409 +so and in fact in practice what you find +is if you initialize with different + +229 +00:15:41,409 --> 00:15:44,610 +random initialization almost always end +up getting the same answer like the same + +230 +00:15:44,610 --> 00:15:48,009 +loss in the end so you don't end up +there's no like bad local minima you + +231 +00:15:48,009 --> 00:15:57,429 +sometimes especially when you have begun +networks question with a question + +232 +00:15:57,429 --> 00:16:10,849 +Nesterov as an oscillating feature which +part + +233 +00:16:10,850 --> 00:16:14,819 +ok I think you're jumping had maybe by +by several slides were going to go into + +234 +00:16:14,818 --> 00:16:19,849 +second or two methods in a bit okay let +me jump into another update that is very + +235 +00:16:19,850 --> 00:16:23,069 +common to see in practice it's called a +ground and it was originally developed + +236 +00:16:23,068 --> 00:16:25,969 +in a convex optimization literature and +then it was kind of ported over to + +237 +00:16:25,970 --> 00:16:30,019 +neural networks and people sometimes use +it so the other great update looks as + +238 +00:16:30,019 --> 00:16:30,560 +follows + +239 +00:16:30,559 --> 00:16:35,619 +we have this update as we normally see +some basic stochastic gradient descent + +240 +00:16:35,620 --> 00:16:37,500 +here learning great times here + +241 +00:16:37,500 --> 00:16:42,259 +gradient but now we're scaling this +gradient but this additional variable + +242 +00:16:42,259 --> 00:16:47,589 +that we keep accumulating note here that +this cash which were building up and is + +243 +00:16:47,589 --> 00:16:52,199 +the sum of gradient square this cache +contains positive numbers only + +244 +00:16:52,198 --> 00:16:55,599 +and note that the cache variable here is +a joint venture of the same size as your + +245 +00:16:55,600 --> 00:17:00,730 +primary factor and so this cash and up +building up in a personal dimension were + +246 +00:17:00,730 --> 00:17:03,839 +keeping track of the sum of squares of +the gradients or as we like to sometimes + +247 +00:17:03,839 --> 00:17:07,679 +called the second moment of those the +Oncenter take a moment and so we keep + +248 +00:17:07,679 --> 00:17:12,409 +building up this cash and then we divide +element why's this step function by the + +249 +00:17:12,409 --> 00:17:21,709 +square root of cash and so what ends up +happening here so that's the reason that + +250 +00:17:21,709 --> 00:17:26,189 +people call it a purr purr parameter +adaptive learning rate method because + +251 +00:17:26,189 --> 00:17:31,090 +every single product every single +dimension of your parameter space now + +252 +00:17:31,089 --> 00:17:34,569 +has its own kind of like learning rate +that is scaled dynamically based on what + +253 +00:17:34,569 --> 00:17:39,079 +kinds of ingredients are seeing in terms +of their scale so with this + +254 +00:17:39,079 --> 00:17:42,859 +interpretation what happens with +autograph in this particular case if we + +255 +00:17:42,859 --> 00:17:47,019 +do this what happens in the horizontal +and vertical direction but this kind of + +256 +00:17:47,019 --> 00:17:51,359 +dynamics + +257 +00:17:51,359 --> 00:18:03,789 +what you'll see as we have a large +gradient vertically and that large + +258 +00:18:03,789 --> 00:18:07,259 +gradient will be added up to cash and +then we end up dividing by larger and + +259 +00:18:07,259 --> 00:18:11,359 +larger numbers so will get smaller and +smaller updates in the vertical step so + +260 +00:18:11,359 --> 00:18:14,798 +since we're seeing lots of large regions +very clean this will decayed learning + +261 +00:18:14,798 --> 00:18:18,859 +rate and will make smaller and smaller +steps in the vertical direction but in + +262 +00:18:18,859 --> 00:18:22,009 +the horizontal direction it's a very +shallow direction so we end up with + +263 +00:18:22,009 --> 00:18:25,750 +smaller numbers in denominator and +you'll see that the relative to the Y + +264 +00:18:25,750 --> 00:18:29,058 +dimension we're going to end up making +faster progress so we have this equalize + +265 +00:18:29,058 --> 00:18:35,058 +the effect of accounting for this the +steepness and inshallah directions you + +266 +00:18:35,058 --> 00:18:40,319 +can actually have much larger learning +right then instead of the vertical + +267 +00:18:40,319 --> 00:18:48,048 +directions and but so that's one problem +without a grad is think about what + +268 +00:18:48,048 --> 00:18:53,009 +happens to the step size as we're +updating this position if we want to + +269 +00:18:53,009 --> 00:18:55,900 +train an entire deep neural network the +stakes for a long time and we're + +270 +00:18:55,900 --> 00:19:01,970 +training this summer long time what's +going to happen in a degree of course so + +271 +00:19:01,970 --> 00:19:05,169 +your cash end up building up all the +time you add all these positive numbers + +272 +00:19:05,169 --> 00:19:09,100 +goes into denominator you're literally +just the case 20 and you end up stopping + +273 +00:19:09,099 --> 00:19:14,579 +learning like completely and so that's +not so that's ok income tax problems + +274 +00:19:14,579 --> 00:19:17,970 +perhaps we just have a bowling just kind +of decay down to the optimum and you're + +275 +00:19:17,970 --> 00:19:21,919 +done but in the neural network the stuff +is kinda like shuttling around then it's + +276 +00:19:21,919 --> 00:19:24,549 +trying to picture based on that's like a +better way to think of it and so this + +277 +00:19:24,548 --> 00:19:28,329 +thing needs continuous kind of energy to +get your data and so you don't want to + +278 +00:19:28,329 --> 00:19:33,009 +just decay to a halt so there's a very +simple change to an autographed that was + +279 +00:19:33,009 --> 00:19:37,829 +proposed by Jeff Hinton recently and the +idea here is that instead of keeping + +280 +00:19:37,829 --> 00:19:42,289 +completely just a sum of squares and I +was able to mention weekend we make that + +281 +00:19:42,289 --> 00:19:46,250 +counter a leaky counter so instead we +end up with this decay rate hike the + +282 +00:19:46,250 --> 00:19:52,500 +primary which we set to something like +0.99% squares but the sum of squares is + +283 +00:19:52,500 --> 00:19:57,750 +leaking slowly but that's ok so we we +still maintain this nice equalizing + +284 +00:19:57,750 --> 00:20:01,569 +effect of equalizing the step sizes in +steep or shelling directions + +285 +00:20:01,569 --> 00:20:05,869 +we're not going to just convert +completely 20 updates that sold arms + +286 +00:20:05,869 --> 00:20:10,299 +prop 19 is historical contact about +Armas proper way is the way was + +287 +00:20:10,299 --> 00:20:11,430 +introduced to us + +288 +00:20:11,430 --> 00:20:14,340 +you think that it would be a paper that +proposed this method but in fact it was + +289 +00:20:14,339 --> 00:20:18,789 +a slide and Justin Scott Sarah class +just a few years ago and so Justin just + +290 +00:20:18,789 --> 00:20:22,240 +was giving this Corsair class and +flashed a slide of life this is + +291 +00:20:22,240 --> 00:20:25,630 +unpublished but this usually works well +in practice and do this and it's + +292 +00:20:25,630 --> 00:20:29,920 +basically our math problem and so I +implemented it then I saw like better + +293 +00:20:29,920 --> 00:20:34,060 +results on my optimization right away +and I thought that was really funny and + +294 +00:20:34,059 --> 00:20:37,769 +so in fact mike in papers not only my +papers but many other papers of people + +295 +00:20:37,769 --> 00:20:44,559 +have cited slide from Coursera just +slide lecture 6 the slide just this + +296 +00:20:44,559 --> 00:20:48,389 +problem since then this is actually now +an actual paper and there's more results + +297 +00:20:48,390 --> 00:20:52,300 +on exactly what he's doing and and so on +but for a while this was really funny + +298 +00:20:52,299 --> 00:20:57,609 +and so in this up my perspective we can +see the ground here is blue and Aramis + +299 +00:20:57,609 --> 00:20:58,579 +prop is this + +300 +00:20:58,579 --> 00:21:02,490 +and black and we can see that both of +them covered quite quickly down here + +301 +00:21:02,490 --> 00:21:07,519 +this way in this particular case at a +grad and converting slightly faster than + +302 +00:21:07,519 --> 00:21:11,589 +Armas problem but that's not always the +case something usually what you see in + +303 +00:21:11,589 --> 00:21:15,839 +practice when you train to Penn Jillette +works as a grad stops too early and are + +304 +00:21:15,839 --> 00:21:21,329 +miserable end up usually the winning out +in these these methods and questions + +305 +00:21:21,329 --> 00:21:24,509 +about our most prob go ahead + +306 +00:21:24,509 --> 00:21:55,150 +the issue is very steep directions you +probably don't want to this method is + +307 +00:21:55,150 --> 00:21:58,800 +saying to make very fast updates in that +direction so yourself down so maybe in + +308 +00:21:58,799 --> 00:22:02,220 +this particular case you'd like to go +faster but you're kind of reading into + +309 +00:22:02,220 --> 00:22:05,019 +this particular example and that's not +true kind of in general and these + +310 +00:22:05,019 --> 00:22:09,940 +optimization landscape that no networks +are made up of a good strategy to apply + +311 +00:22:09,940 --> 00:22:22,930 +in those cases in the beginning + +312 +00:22:22,930 --> 00:22:25,730 +oh by the way I skipped over this +exploration of 17 but you guys can + +313 +00:22:25,730 --> 00:22:30,380 +hopefully see that 127 is there just to +prevent the division by zero it's moving + +314 +00:22:30,380 --> 00:22:34,550 +back to its high proprietor usually we +sat at two one five or six or seven or + +315 +00:22:34,549 --> 00:22:39,139 +something like that in the beginning +your cash is 0 so then you can come into + +316 +00:22:39,140 --> 00:22:46,540 +your life learning rate 22 what you get +is this adaptive behavior but the scale + +317 +00:22:46,539 --> 00:22:50,420 +of it is still in your control the +absolute scale of it distilling or + +318 +00:22:50,420 --> 00:22:57,370 +control is still learning rate this +story just interrupt its kind of thing + +319 +00:22:57,369 --> 00:23:00,989 +can look at more like a relative thing +with respect to different primers how + +320 +00:23:00,990 --> 00:23:12,190 +are you equalizing the steps but +absolute global step is still up to you + +321 +00:23:12,190 --> 00:23:18,710 +from the very beginning so effectively +doing what you're describing right + +322 +00:23:18,710 --> 00:23:23,038 +because it ends up for getting sort of +ingredients from very long time ago and + +323 +00:23:23,038 --> 00:23:27,750 +it's only really its expression at time +t is only a function of the last few + +324 +00:23:27,750 --> 00:23:36,480 +ingredients but in an exponentially +decaying weighted sum up we're going to + +325 +00:23:36,480 --> 00:23:43,819 +go into last update glad + +326 +00:23:43,819 --> 00:24:03,039 +would be similar to exponentially +weighted way and so you want to have a + +327 +00:24:03,039 --> 00:24:09,789 +finite window on this or I don't think +people have really tried you can ya + +328 +00:24:09,789 --> 00:24:19,889 +takes too much memory when you're 10 +optimizing networks will see that X for + +329 +00:24:19,890 --> 00:24:23,560 +example 240 million parameters so that's +taking up quite a lot of memory and so + +330 +00:24:23,559 --> 00:24:29,659 +you don't want to keep track of 10 +previous grievances well okay then we're + +331 +00:24:29,660 --> 00:24:37,540 +gonna go in 20 sure if you combine a +degraded momentum thank you for the + +332 +00:24:37,539 --> 00:24:45,269 +question and that's the slide so so +roughly what's happening is Adam is this + +333 +00:24:45,269 --> 00:24:49,119 +last update was only prison actually +proposed very recently and it has + +334 +00:24:49,119 --> 00:24:52,959 +elements of both as you'll notice +momentum is kind of keeping track of the + +335 +00:24:52,960 --> 00:24:57,190 +first order moment of your of your +reading that sums up the wrong gradients + +336 +00:24:57,190 --> 00:25:02,350 +and keeping this exponential some and a +grandson are keeping track of the second + +337 +00:25:02,349 --> 00:25:07,869 +moment the gradient and and what you end +up with is Adam Adam update as you end + +338 +00:25:07,869 --> 00:25:13,389 +up with the step that's basically take +it's kind of like yeah it's kind of like + +339 +00:25:13,390 --> 00:25:16,980 +our most probably momentum a bit so you +end up with this thing that looks like + +340 +00:25:16,980 --> 00:25:21,650 +it's basically keep track of this +velocity in a decaying way and that's + +341 +00:25:21,650 --> 00:25:25,420 +your step but then you also scaling it +down by this exponentially adding up + +342 +00:25:25,420 --> 00:25:29,490 +leaky counter of your square gradients +and so you end up with both in the same + +343 +00:25:29,490 --> 00:25:36,009 +formula and thats update combining those +do so you're doing both momentum and + +344 +00:25:36,009 --> 00:25:41,759 +you're also doing this adaptive scaling +and let see so here's the army's prob + +345 +00:25:41,759 --> 00:25:44,789 +actually I should have flashed this +earlier so even when compared this + +346 +00:25:44,789 --> 00:25:46,339 +basically our most prob + +347 +00:25:46,339 --> 00:25:52,079 +red is the same thing as here except +we've replaced TX which there was just a + +348 +00:25:52,079 --> 00:25:56,220 +previous just a gradient currently right +now we're replacing this gradient TX + +349 +00:25:56,220 --> 00:25:56,630 +with it + +350 +00:25:56,630 --> 00:26:01,170 +which is this running counter of RDX so +if you imagine for example one way to + +351 +00:26:01,170 --> 00:26:04,090 +look at it also is your nasty kasich +setting your sampling many batches + +352 +00:26:04,089 --> 00:26:07,359 +there's gonna be lots of randomness in a +poor pass and you get all these noisy + +353 +00:26:07,359 --> 00:26:10,990 +gradients so instead of using any great +impact every single time step we're + +354 +00:26:10,990 --> 00:26:14,309 +actually going to be using this became +some of previous greetings and it can + +355 +00:26:14,309 --> 00:26:19,139 +stabilize your gradient direction of it +and that's the function of the momentum + +356 +00:26:19,140 --> 00:26:23,720 +here and the scaling here is to make +sure that the step-size workout relative + +357 +00:26:23,720 --> 00:26:29,940 +to each other and Steven L directions +thank you don't want to be that you are + +358 +00:26:29,940 --> 00:26:31,269 +hyper parameters + +359 +00:26:31,269 --> 00:26:36,119 +801 usually point 9802 usually Point 995 + +360 +00:26:36,119 --> 00:26:42,869 +somewhere there so it's a high premium +across a leader in my own work I found + +361 +00:26:42,869 --> 00:26:45,719 +that this is a relatively robust +settings across I don't actually usually + +362 +00:26:45,720 --> 00:26:50,690 +end up leaving these I just set them to +put smileys usually but you can play + +363 +00:26:50,690 --> 00:27:04,259 +with those of it and sometimes it can +help you get momentum we saw the + +364 +00:27:04,259 --> 00:27:08,789 +restaurant works better clean do that +yes you can actually just read the paper + +365 +00:27:08,789 --> 00:27:12,849 +about this yesterday and actually wasn't +a paper it was a project report from 229 + +366 +00:27:12,849 --> 00:27:17,149 +someone actually that that I'm not sure +if there's a paper about it but you can + +367 +00:27:17,150 --> 00:27:20,250 +play with that simply does not being +done here + +368 +00:27:20,250 --> 00:27:25,759 +ok and one more thing that I so I have +to make Adam slightly more complex here + +369 +00:27:25,759 --> 00:27:30,849 +as you see it's incomplete so let me +just put into complete immersion in Adam + +370 +00:27:30,849 --> 00:27:33,949 +there's one more thing where you might +be confused when you see it there's this + +371 +00:27:33,950 --> 00:27:38,220 +thing called bias correction to insert +their and despise correction the way to + +372 +00:27:38,220 --> 00:27:40,920 +the reason I'm expanding of the loop is +that the bias correction depends on your + +373 +00:27:40,920 --> 00:27:46,940 +absolute time step T 00 T is used here +and the reason for that is what this is + +374 +00:27:46,940 --> 00:27:49,730 +doing is kind of like a minor point and +I don't want to be confused about this + +375 +00:27:49,730 --> 00:27:54,049 +too much but basically it's compensated +for compensating for the fact that MMV + +376 +00:27:54,049 --> 00:27:58,659 +ornish 500 statistics are incorrect in +the beginning and so what he's doing is + +377 +00:27:58,660 --> 00:28:01,269 +really at scaling up your Mb + +378 +00:28:01,269 --> 00:28:04,250 +the first few iterations so you don't +end up with a very kind of biased + +379 +00:28:04,250 --> 00:28:07,359 +estimate of the first and the second +moment so don't worry about that + +380 +00:28:07,359 --> 00:28:11,279 +too much this is only this is only +changing your update at the very first + +381 +00:28:11,279 --> 00:28:15,190 +few times that's as as the item is +warming up and so it's done in a proper + +382 +00:28:15,190 --> 00:28:18,210 +way in terms of the statistics Mb + +383 +00:28:18,210 --> 00:28:23,380 +I don't go too much into that ok so we +talked about several different updates + +384 +00:28:23,380 --> 00:28:26,710 +and we saw that all these updates have +this learning great primer still there + +385 +00:28:26,710 --> 00:28:31,279 +and so I just want to briefly talk about +the fact that although still require a + +386 +00:28:31,279 --> 00:28:34,369 +learning and we saw what happens with +the front racism learning rates for all + +387 +00:28:34,369 --> 00:28:37,639 +of these methods and the question i'd +like to pose is which one of these + +388 +00:28:37,640 --> 00:28:47,290 +learning rates is best to use + +389 +00:28:47,289 --> 00:28:55,509 +so when you're running neural networks +this is a slide about learning rate the + +390 +00:28:55,509 --> 00:28:59,819 +case the trick answer is that none of +those are good learning race to use what + +391 +00:28:59,819 --> 00:29:04,259 +you should do is you should use the high +learning rate first because it optimizes + +392 +00:29:04,259 --> 00:29:07,869 +faster than the good learning rate is +seen you make a very fast progress but + +393 +00:29:07,869 --> 00:29:10,779 +at some point you're going to be two +stochastic and you can't converging to + +394 +00:29:10,779 --> 00:29:13,829 +your main my very nicely because you +have too much energy in your system and + +395 +00:29:13,829 --> 00:29:17,869 +you can't settle down into black nice +parts of your loss function and so what + +396 +00:29:17,869 --> 00:29:21,399 +you do then is UDK you're learning rate +and then you can kind of ride this + +397 +00:29:21,400 --> 00:29:26,269 +dragon of decreasing learning rates and +do best in all of them are many + +398 +00:29:26,269 --> 00:29:28,670 +different ways that people begin to +learn rates over time and you should + +399 +00:29:28,670 --> 00:29:32,400 +also became your assignment of their +stuff decay which is kind of like the + +400 +00:29:32,400 --> 00:29:36,810 +simplest one perhaps or after one epoch +of training data is referring to you've + +401 +00:29:36,809 --> 00:29:41,619 +seen every single training sample one +time so after saying what a Paki decayed + +402 +00:29:41,619 --> 00:29:45,219 +learning rates to my point nine or +something like that you can also use + +403 +00:29:45,220 --> 00:29:49,600 +exponential decay or one of the TDK +there several several of them are going + +404 +00:29:49,599 --> 00:29:54,379 +to know it's likely expanding on some of +the theoretical properties that improve + +405 +00:29:54,380 --> 00:29:58,260 +about these different case unfortunately +not many of them apply because I think + +406 +00:29:58,259 --> 00:30:01,150 +they're mostly from convex optimization +literature and we're dealing with very + +407 +00:30:01,150 --> 00:30:05,160 +different objectives but usually in +practice I just used for something that + +408 +00:30:05,160 --> 00:30:12,330 +was a question + +409 +00:30:12,329 --> 00:30:25,259 +not committing to any one of these +between them during training + +410 +00:30:25,259 --> 00:30:28,470 +yeah I don't think that's the standard +at all + +411 +00:30:28,470 --> 00:30:32,990 +an interesting point I'm not sure I'm +not sure when you'd want to use yeah + +412 +00:30:32,990 --> 00:30:37,839 +it's not clear to me you could try +something to try and practice I like to + +413 +00:30:37,839 --> 00:30:42,079 +make the point that you almost always I +find at least impact is right now is + +414 +00:30:42,079 --> 00:30:46,189 +usually the nice default rose to go with +so I use a time for everything now and + +415 +00:30:46,190 --> 00:30:49,840 +seems to work quite well better than +momentum are our most problems or + +416 +00:30:49,839 --> 00:30:56,638 +anything like that so it's a tall order +methods as we call them because they + +417 +00:30:56,638 --> 00:31:00,579 +only use your gradient information at +your loss function so we've evaluated + +418 +00:31:00,579 --> 00:31:03,720 +the gradient and we basically know the +slope and every single direction and + +419 +00:31:03,720 --> 00:31:05,710 +that's the only thing that we use + +420 +00:31:05,710 --> 00:31:09,600 +there's an entire set of second order +methods for optimization but you should + +421 +00:31:09,599 --> 00:31:13,168 +be aware of the second order opposition +as I do want to go into too much detail + +422 +00:31:13,169 --> 00:31:17,919 +but the end up forming a larger +approximation to your loss function so + +423 +00:31:17,919 --> 00:31:20,820 +they don't only approximated with this +basically hyperplane of like which way I + +424 +00:31:20,819 --> 00:31:26,069 +was hoping but you also approximated by +discussion which is telling you how your + +425 +00:31:26,069 --> 00:31:29,710 +services curbing so you don't only need +the gradient he also need the Hessian + +426 +00:31:29,710 --> 00:31:36,808 +need to compute that as well and you may +have seen you tonight I'd say for + +427 +00:31:36,808 --> 00:31:38,500 +example in 229 + +428 +00:31:38,500 --> 00:31:44,190 +Newton method it's basically giving you +an update that was you formed your bowl + +429 +00:31:44,190 --> 00:31:47,259 +like fashion approximation to your +objective you can use this updated + +430 +00:31:47,259 --> 00:31:54,259 +number to jump directly to the minimum +of that that approximation scheme so + +431 +00:31:54,259 --> 00:31:58,490 +what's nice about second order methods +why do people like these are used them + +432 +00:31:58,490 --> 00:32:02,099 +especially the Newton method is +presented here what's nice about this + +433 +00:32:02,099 --> 00:32:05,399 +update for convergence + +434 +00:32:05,400 --> 00:32:13,410 +you'll notice no learning rate know how +primary in this update ok and that's + +435 +00:32:13,410 --> 00:32:17,220 +because if you see your gradient in this +loss function in this loss function but + +436 +00:32:17,220 --> 00:32:20,480 +you also know the curvature and that +place and so if you approximated with + +437 +00:32:20,480 --> 00:32:23,920 +this bull you know exactly where to go +to the minimum order approximation so + +438 +00:32:23,920 --> 00:32:26,900 +there's no need for learning you can +jump directly to that minimum of that + +439 +00:32:26,900 --> 00:32:30,610 +approximating bowl so that's a very nice +feature I think those are the two that I + +440 +00:32:30,609 --> 00:32:32,969 +had in mind you have a fast convergence +because you're using second order + +441 +00:32:32,970 --> 00:32:38,839 +information as well why is it kind of +impractical to use this step update in + +442 +00:32:38,839 --> 00:32:47,069 +training all that works for the issue of +course is passion say you have a hundred + +443 +00:32:47,069 --> 00:32:48,500 +million primary network + +444 +00:32:48,500 --> 00:32:52,299 +hundred-million by hundred-million +matrix and then you want to convert it + +445 +00:32:52,299 --> 00:32:59,259 +so good luck with that this is not going +to happen so there are several + +446 +00:32:59,259 --> 00:33:02,480 +algorithms and I just like you to be +aware of your not going to use them as + +447 +00:33:02,480 --> 00:33:05,650 +class which is below where there's +something called DHS which basically + +448 +00:33:05,650 --> 00:33:08,360 +lets you get away with not converting +the fashion and build up an + +449 +00:33:08,359 --> 00:33:11,819 +approximation of the Hessian through +successive updates that are all ranked + +450 +00:33:11,819 --> 00:33:15,000 +one and it kind of builds up the session +but you still need to store the Hessian + +451 +00:33:15,000 --> 00:33:18,279 +in memory so still no good for large +networks and then there's something + +452 +00:33:18,279 --> 00:33:22,710 +called lbs short for limited Jeremy BFGS +was not actually store in the fall + +453 +00:33:22,710 --> 00:33:26,980 +fashion or it's approximated members and +that's what people use in practice + +454 +00:33:26,980 --> 00:33:33,549 +sometimes now lbs you'll see sometimes +mentioned in optimization literature + +455 +00:33:33,549 --> 00:33:37,769 +especially when it works really really +well for us if you have a single small + +456 +00:33:37,769 --> 00:33:42,450 +deterministic function like a box +there's no stochastic noise like there's + +457 +00:33:42,450 --> 00:33:47,920 +no city in and everything fits in your +memory address can usually crushing loss + +458 +00:33:47,920 --> 00:33:53,200 +functions very easily but what's tricky +as to extend lbs gs2 basically very very + +459 +00:33:53,200 --> 00:33:56,539 +large datasets and the reason is that +were subsampling these many doctors + +460 +00:33:56,539 --> 00:33:59,730 +because we can't fit all the training +data into memory so wassup simple many + +461 +00:33:59,730 --> 00:34:02,930 +batches and then I'll be at risk of +works on these many matches and its + +462 +00:34:02,930 --> 00:34:06,810 +approximation is in the being incorrect +as you swap different many batches and + +463 +00:34:06,809 --> 00:34:10,449 +and also has the capacity you have to be +careful with it then you have to make + +464 +00:34:10,449 --> 00:34:12,539 +sure that you fix a dropout + +465 +00:34:12,539 --> 00:34:17,690 +you have to make sure that your function +so internally albeit rascals your + +466 +00:34:17,690 --> 00:34:20,679 +function many many different times is +doing all these approximations and lie + +467 +00:34:20,679 --> 00:34:24,480 +search and stuff like that it's a very +heavy function and so you have to make + +468 +00:34:24,480 --> 00:34:26,668 +sure that when you use this you disable +or sources + +469 +00:34:26,668 --> 00:34:29,889 +randomness because really not going to +like it so basically in practice we + +470 +00:34:29,889 --> 00:34:33,779 +don't use all BHS because it seems to +not great not worked really well right + +471 +00:34:33,780 --> 00:34:36,970 +now compared to other methods is +basically to have too much stuff is + +472 +00:34:36,969 --> 00:34:41,529 +happening and you it's better to just do +this and noisy our stuff but do more of + +473 +00:34:41,530 --> 00:34:47,880 +it that's the trade off so in summary +used as a good choice and if you can + +474 +00:34:47,880 --> 00:34:51,570 +afford to just have you can afford for +banks to maybe your day as it is not + +475 +00:34:51,570 --> 00:34:55,419 +very large income for 2009 memory and +the forward and they get passes in + +476 +00:34:55,418 --> 00:35:00,460 +memory then you can look into lbs but +you won't see it in practice used in + +477 +00:35:00,460 --> 00:35:05,220 +large-scale setting right now although +its research direction right now right + +478 +00:35:05,219 --> 00:35:10,009 +so that concludes my discussion of +different private updates because you're + +479 +00:35:10,010 --> 00:35:14,830 +learning rates we're not going to look +into all beatrice in this class there's + +480 +00:35:14,829 --> 00:35:24,739 +a question the very back + +481 +00:35:24,739 --> 00:35:34,609 +you're asking about so a great for +example it automatically case you're + +482 +00:35:34,610 --> 00:35:38,510 +learning rate over time so would you use +also learning break the case if you're + +483 +00:35:38,510 --> 00:35:41,930 +using a grand or so usually you see +learning Reiki very common when you + +484 +00:35:41,929 --> 00:35:55,379 +actually I'm not sure if you use it but +at a grad or or Adam yeah it's it's not + +485 +00:35:55,380 --> 00:36:04,900 +not not a very good answer that you can +certainly do it but maybe item is not + +486 +00:36:04,900 --> 00:36:08,910 +like Adam will not just wantonly make +your learning 30 at the Android because + +487 +00:36:08,909 --> 00:36:12,339 +it's a leaky gradient but he was a great +concern he became the learning rate + +488 +00:36:12,340 --> 00:36:15,170 +probably does not make sense because +it's decayed automatically 20 Indian + +489 +00:36:15,170 --> 00:36:22,710 +alright okay we're going to go into +model ensembles I just very briefly like + +490 +00:36:22,710 --> 00:36:24,829 +to talk about it because it's quite +simple + +491 +00:36:24,829 --> 00:36:28,750 +turns out that if you train multiple +independent models on your training data + +492 +00:36:28,750 --> 00:36:32,949 +instead of just a single one and then +you averaged results at this time you've + +493 +00:36:32,949 --> 00:36:39,929 +always got 22 percent extra performance +ok so this is not really a theoretical + +494 +00:36:39,929 --> 00:36:43,289 +result here it's kind of like a result +but just like in practice + +495 +00:36:43,289 --> 00:36:46,570 +basically this is like a good thing to +do almost always works better + +496 +00:36:46,570 --> 00:36:48,850 +the downside of course is not have to +have all these different independent + +497 +00:36:48,849 --> 00:36:52,259 +models and need to do forward and +backward classes of all of them and you + +498 +00:36:52,260 --> 00:36:56,850 +have trained all of them so that's not +ideal and presumably you're slow down + +499 +00:36:56,849 --> 00:37:00,989 +just time with the number of models in +your ensemble and so there are some tips + +500 +00:37:00,989 --> 00:37:05,689 +and tricks for using on some kind of +picking up a bit so one approach for + +501 +00:37:05,690 --> 00:37:08,619 +example is as your training your neural +network you have all these different + +502 +00:37:08,619 --> 00:37:11,680 +checkpoints usually are saving them +every single hockey save a checkpoint + +503 +00:37:11,679 --> 00:37:14,750 +and you figure out what your was your +validation performance so one thing you + +504 +00:37:14,750 --> 00:37:18,119 +can do for example it turns out to +actually gets like this sometimes is you + +505 +00:37:18,119 --> 00:37:23,420 +just take some different checkpoints on +your model and you were those that + +506 +00:37:23,420 --> 00:37:26,349 +actually turns out to sometimes improve +things in it and so that way you don't + +507 +00:37:26,349 --> 00:37:29,730 +have to train seven independent models +US-trained one but you ensemble some + +508 +00:37:29,730 --> 00:37:34,809 +different checkpoints related to that +there's a trick of + +509 +00:37:34,809 --> 00:37:39,739 +protest what's happening here this is +your four steps that we've seen before + +510 +00:37:39,739 --> 00:37:44,709 +I'm keeping another set of primaries +here X test and this text as a running + +511 +00:37:44,710 --> 00:37:49,590 +some exponentially decaying off my +actual parameter vector X and when I use + +512 +00:37:49,590 --> 00:37:52,750 +text test and validation or test data it +turns out that this almost always + +513 +00:37:52,750 --> 00:37:57,199 +perform slightly better than using X +alone ok so this is kind of doing like a + +514 +00:37:57,199 --> 00:38:00,919 +small like weighted ensemble of last +previous few primary factors it's kind + +515 +00:38:00,920 --> 00:38:05,309 +of a kind of difficult to interpret +actually but basically one way to + +516 +00:38:05,309 --> 00:38:08,329 +interpret it one way I can handle about +why this is actually a good thing to do + +517 +00:38:08,329 --> 00:38:12,900 +is think about optimizing your ball +function and you're stepping too much + +518 +00:38:12,900 --> 00:38:16,849 +around your minimum that actually taking +the average of all those steps gets you + +519 +00:38:16,849 --> 00:38:20,980 +closer to the minimum ok I can do for +why this actually is important slightly + +520 +00:38:20,980 --> 00:38:25,639 +better so that small ensembles I had to +discuss my life because we're going to + +521 +00:38:25,639 --> 00:38:29,759 +look into dropout and this is a very +important technique that you will be + +522 +00:38:29,760 --> 00:38:34,590 +using an implementation and so on so the +idea for dropout is very interesting + +523 +00:38:34,590 --> 00:38:38,620 +what you do with dropout is you as +you're doing your whole purpose of + +524 +00:38:38,619 --> 00:38:45,429 +neural network you will randomly set +some neurons 20 in the park pass so just + +525 +00:38:45,429 --> 00:38:49,839 +to clarify what you will do is as you're +doing a forward pass of your data X your + +526 +00:38:49,840 --> 00:38:52,670 +computing a say in this function + +527 +00:38:52,670 --> 00:38:57,010 +your first hidden layer is the +nonlinearity of W one times XP sp1 so + +528 +00:38:57,010 --> 00:39:02,830 +that's a little later and then I will +compute here a mask of binary numbers + +529 +00:39:02,829 --> 00:39:05,230 +either 0 or 1 based on whether or not + +530 +00:39:05,230 --> 00:39:09,469 +numbers between 0 and one are smaller +than P which we hear serious pump so + +531 +00:39:09,469 --> 00:39:13,469 +this you want is a binary mask of zeros +and ones half and half and then we + +532 +00:39:13,469 --> 00:39:17,469 +multiply that are hidden activations +actively dropping half of them so we + +533 +00:39:17,469 --> 00:39:21,349 +computed all the activations each one +hidden layer and then we drop have two + +534 +00:39:21,349 --> 00:39:25,730 +units at random and then we do second +and then we drop half of them at random + +535 +00:39:25,730 --> 00:39:30,699 +ok and of course this is only the +forward pass the backward pass has to be + +536 +00:39:30,699 --> 00:39:35,719 +appropriately adjusted as well so these +drops have to be also back propagated + +537 +00:39:35,719 --> 00:39:39,309 +through so remember to do that when you +implement drop out so it's not only in + +538 +00:39:39,309 --> 00:39:41,980 +the forward pass a drop but in a +backward pass if the backpropagation + +539 +00:39:41,980 --> 00:39:45,829 +multiplying by u2 and buy you one so you +killed radiance basically in places + +540 +00:39:45,829 --> 00:39:46,559 +where you dropped + +541 +00:39:46,559 --> 00:39:52,179 +ok so you might be thinking when I +showed you this for the first time how + +542 +00:39:52,179 --> 00:39:56,799 +does this make any sense at all and how +was this good idea why would you want to + +543 +00:39:56,800 --> 00:40:00,390 +compute your neuroses and then set them +a trend in 20 make any sense whatsoever + +544 +00:40:00,389 --> 00:40:12,369 +so I don't know let's let's do you guys +think ahead to prevent overheating in + +545 +00:40:12,369 --> 00:40:23,880 +what sense + +546 +00:40:23,880 --> 00:40:27,170 +you're really getting the right +information so you're saying it will + +547 +00:40:27,170 --> 00:40:31,240 +prevent overfitting because if I'm only +using half of my network then roughly I + +548 +00:40:31,239 --> 00:40:34,500 +have like smaller capacity I'm only +using half of my network any one time + +549 +00:40:34,500 --> 00:40:37,739 +and one smaller networks there's only +like I'm basically there's only so much + +550 +00:40:37,739 --> 00:40:40,209 +I can do what happened at work then +there's a full network so it's kind of + +551 +00:40:40,210 --> 00:40:44,798 +like control of your of your variance in +terms of what you can represent + +552 +00:40:44,798 --> 00:40:55,619 +yeah I would like to meet the terms of +like by various trade often so I haven't + +553 +00:40:55,619 --> 00:40:59,480 +really we're not going to that too much +but you have a smaller model it's harder + +554 +00:40:59,480 --> 00:41:08,579 +to over that but having many ensembles +of different neural networks were going + +555 +00:41:08,579 --> 00:41:34,289 +to go into that point in a bit because +if that was the only one that was used + +556 +00:41:34,289 --> 00:41:38,119 +upstairs ok I have a better way of +phrasing that point in my next life + +557 +00:41:38,119 --> 00:41:43,028 +let's look at a particular example is +that okay suppose that we are trying to + +558 +00:41:43,028 --> 00:41:47,130 +compute the cat score in the neural +network and the idea here is that you + +559 +00:41:47,130 --> 00:41:51,380 +have all these different units and +dropout is doing sports sing their many + +560 +00:41:51,380 --> 00:41:54,920 +way to look at dropout but one of them +is it's forcing your code your + +561 +00:41:54,920 --> 00:41:59,608 +representation for what the image was +about to be redundant because you need + +562 +00:41:59,608 --> 00:42:03,318 +that redundancy because you're about to +in a way that you can't control get half + +563 +00:42:03,318 --> 00:42:06,710 +of your network dropped off and so you +need to make your cat score on many more + +564 +00:42:06,710 --> 00:42:09,900 +features if you're going to cook +correctly compute the cat score because + +565 +00:42:09,900 --> 00:42:14,000 +any any one of them you can't rely on it +because it might be dropped and so + +566 +00:42:14,000 --> 00:42:17,068 +that's one way to look at it so in this +case we can still classify catskill + +567 +00:42:17,068 --> 00:42:22,639 +properly even if we don't have access to +whether or not it's very essential so + +568 +00:42:22,639 --> 00:42:24,768 +that's one interpretation of dropout + +569 +00:42:24,768 --> 00:42:29,088 +another interpretation of dropout is as +was mentioned in terms of muscle so + +570 +00:42:29,088 --> 00:42:33,358 +dropout is effectively can be looked at +as training a large ensemble of models + +571 +00:42:33,358 --> 00:42:36,420 +that are basically subnetworks + +572 +00:42:36,420 --> 00:42:43,099 +one large network but they cannot share +primaries in a good way so you + +573 +00:42:43,099 --> 00:42:46,650 +understand this you have to notice the +following if we do it for us and we + +574 +00:42:46,650 --> 00:42:49,970 +randomly drop off some of the units than +in backward pass think about what + +575 +00:42:49,969 --> 00:42:53,669 +happens with the gradient right so I +suppose we bring a random dropped off + +576 +00:42:53,670 --> 00:42:57,409 +these units in a backward pass we're +back propagating through the max that + +577 +00:42:57,409 --> 00:43:01,879 +were induced by the dropout so in +particular only the neurons that were + +578 +00:43:01,880 --> 00:43:05,349 +used in a forward pass will actually be +updated or have any grievance flowing + +579 +00:43:05,349 --> 00:43:09,599 +through them because any neuron that was +shut off 20 no gradient will flow + +580 +00:43:09,599 --> 00:43:13,650 +through it and its weights to its +previous layer will not be updated so + +581 +00:43:13,650 --> 00:43:18,550 +actively anymore on that was dropped out +its connections to the previous layer + +582 +00:43:18,550 --> 00:43:22,750 +will not be updated and it was just it's +as if it wasn't there so really what the + +583 +00:43:22,750 --> 00:43:27,230 +drop-off masks your sub sampling a part +of your neural network and you're only + +584 +00:43:27,230 --> 00:43:30,789 +training that neural network on that +single example that you happen that + +585 +00:43:30,789 --> 00:43:44,980 +point in time so as one model and gets +rained on only one data point + +586 +00:43:44,980 --> 00:43:51,250 +ok I can try to repeat that + +587 +00:43:51,250 --> 00:44:04,239 +came from somewhere here I want you guys +to understand this or not + +588 +00:44:04,239 --> 00:44:10,789 +ok so when you drop drop drop in your +own I wish I had the example of the + +589 +00:44:10,789 --> 00:44:14,429 +neuron right but if I drop in the value +I multiply its up to buy 09 its effect + +590 +00:44:14,429 --> 00:44:17,918 +on the loss function there's no effect +right so its gradient 10 because it's + +591 +00:44:17,918 --> 00:44:21,668 +about he was not used in computing the +loss and so it's weights will not get an + +592 +00:44:21,668 --> 00:44:25,679 +update and so it's as if we've subsample +a part of the network and we only train + +593 +00:44:25,679 --> 00:44:28,959 +that single data point that currently +came to network with only trained on it + +594 +00:44:28,958 --> 00:44:32,348 +and every time we do it for possibly +subsample two different part of your + +595 +00:44:32,349 --> 00:44:35,899 +neural network but they all share +parameters so it's kind of like a weird + +596 +00:44:35,898 --> 00:44:39,778 +ensemble of lots of different models all +training Monday a point but they all + +597 +00:44:39,778 --> 00:44:48,458 +share parameters so that's kind of +roughly the idea here doesn't make sense + +598 +00:44:48,458 --> 00:45:07,108 +usually save 50% is a very rough way to +raise the same size so in this in this + +599 +00:45:07,108 --> 00:45:09,798 +world powers will notice we actually +computer H + +600 +00:45:09,798 --> 00:45:14,009 +we compute them just as we did before +all of the computer it more than half of + +601 +00:45:14,009 --> 00:45:17,119 +the values will get dropped 20 + +602 +00:45:17,119 --> 00:45:29,250 +nothing changes they're good + +603 +00:45:29,250 --> 00:45:38,349 +stations instead of competing on the +issues you want to compete in the roads + +604 +00:45:38,349 --> 00:45:42,150 +are not being dropped in that case you +want to do sports updates so you could + +605 +00:45:42,150 --> 00:45:44,950 +in theory but I don't think that's +unusual in practice we don't worry about + +606 +00:45:44,949 --> 00:46:12,369 +it too much and so you always gotta work +training so every single iteration we + +607 +00:46:12,369 --> 00:46:15,469 +get a minute match we sample or noise +pattern for what we're gonna drop out + +608 +00:46:15,469 --> 00:46:19,359 +and go forward and backward pass and the +gradient and we keep turning this over + +609 +00:46:19,360 --> 00:46:31,360 +and over again so your question is like +somehow cleverly true the binary mask in + +610 +00:46:31,360 --> 00:46:35,829 +like a way that best optimize the model +or something that not really I don't + +611 +00:46:35,829 --> 00:46:44,769 +think that's done or anyone has looked +into too much sorry I yes I'm going to + +612 +00:46:44,769 --> 00:46:47,389 +get into that in one slide next slide + +613 +00:46:47,389 --> 00:46:57,618 +we're going to look at this time I'll +take up one last question + +614 +00:46:57,619 --> 00:47:04,519 +questions one drop out to different +amounts in different layers you can + +615 +00:47:04,518 --> 00:47:05,459 +there's nothing stopping you + +616 +00:47:05,460 --> 00:47:09,338 +its intuitively you want to apply +stronger drop out if you need more + +617 +00:47:09,338 --> 00:47:12,690 +regularization so there's a layer that +has a huge amount of Primaris will see + +618 +00:47:12,690 --> 00:47:16,349 +that income that's in one example you +want to hit by strong drop out there + +619 +00:47:16,349 --> 00:47:20,269 +conversely there might be some layers +that we'll see what the network's early + +620 +00:47:20,268 --> 00:47:24,248 +on the comedy show layers are very small +he don't really play as much drop out + +621 +00:47:24,248 --> 00:47:27,368 +there it's quite common for example the +color networking going to this in a bit + +622 +00:47:27,369 --> 00:47:30,740 +you start off with a low dropout ending +up over time so the answer to that is + +623 +00:47:30,739 --> 00:47:38,848 +yes and I forgot your second question +can you instead units dropout just + +624 +00:47:38,849 --> 00:47:41,880 +individual weights you can and that's +something called dropped connect we want + +625 +00:47:41,880 --> 00:47:46,349 +to go into too much in this class but +there's a way to do that as well i got + +626 +00:47:46,349 --> 00:47:52,829 +now it s time i trust ideally what you +want to do is we've introduced all this + +627 +00:47:52,829 --> 00:47:56,940 +noise right into the park pass and so if +you would like to do now it just time as + +628 +00:47:56,940 --> 00:48:00,349 +we'd like to integrate out all that +noise and want to cuddle approximation + +629 +00:48:00,349 --> 00:48:03,318 +to that would be something like you have +a test image that you like to classify + +630 +00:48:03,318 --> 00:48:06,909 +you can do many forward passes with many +different settings of your binary masks + +631 +00:48:06,909 --> 00:48:10,558 +and you're only using the subnetworks +and then you can averaged across all + +632 +00:48:10,559 --> 00:48:14,329 +those probably distributions so that +would be great but unfortunately is not + +633 +00:48:14,329 --> 00:48:17,818 +very efficient so it turns out that you +can actually approximate this process to + +634 +00:48:17,818 --> 00:48:22,338 +some degree has given to point out when +first introduced dropout and the way + +635 +00:48:22,338 --> 00:48:26,170 +will do this intuitively you want to +take advantage of all your neurons you + +636 +00:48:26,170 --> 00:48:29,509 +don't want to be dropping my random +we're going to try to copy the way we + +637 +00:48:29,509 --> 00:48:33,548 +can leave all the neurons turned on so +dunno drop out in a forward pass on a + +638 +00:48:33,548 --> 00:48:39,920 +test image but we have to actually be +careful with how we do this so we can so + +639 +00:48:39,920 --> 00:48:43,480 +in a poor pass your test images we're +not going to drop any units but we have + +640 +00:48:43,480 --> 00:48:48,028 +to be careful with something and +basically one way to get that what the + +641 +00:48:48,028 --> 00:48:54,880 +issue is supposed that this was an Iran +and its got two inputs and I suppose + +642 +00:48:54,880 --> 00:48:59,079 +that with all these inputs present at +this time so we're not dropping unit so + +643 +00:48:59,079 --> 00:49:02,630 +it s time these two have some +activations and the other doctors near a + +644 +00:49:02,630 --> 00:49:06,400 +computer to be some value tax yet to +compare this + +645 +00:49:06,400 --> 00:49:12,608 +value of x two what the neurons out but +would be during training time in + +646 +00:49:12,608 --> 00:49:18,440 +expectation ok because in training time +this dropout masks very randomly and so + +647 +00:49:18,440 --> 00:49:21,170 +there are many different cases that +could have happened any different in + +648 +00:49:21,170 --> 00:49:27,068 +those cases of this would be a different +scale and have to worry about this let + +649 +00:49:27,068 --> 00:49:32,259 +me show you exactly what this means I +think this + +650 +00:49:32,260 --> 00:49:35,539 +computes say there's no nonlinearity +were only looking at the lingering Iran + +651 +00:49:35,539 --> 00:49:39,990 +during stress tests this activation +becomes W 0 which is the wait here 10 + +652 +00:49:39,989 --> 00:49:44,848 +sacks + W one times why oK so that's +what I want to compute a test on and the + +653 +00:49:44,849 --> 00:49:48,420 +reason I have to be careful is that +during training time the expected output + +654 +00:49:48,420 --> 00:49:51,528 +of a in this particular case would have +been quite different so we have four + +655 +00:49:51,528 --> 00:49:55,619 +possibilities we could drop one or the +other or both or none so in those four + +656 +00:49:55,619 --> 00:49:56,720 +possibilities + +657 +00:49:56,719 --> 00:50:00,750 +computer different Valley was actually +crunch do this math you'll see that when + +658 +00:50:00,750 --> 00:50:01,659 +you reduce it + +659 +00:50:01,659 --> 00:50:07,548 +you end up with one half off WRX + W one +times why so in expectation at training + +660 +00:50:07,548 --> 00:50:15,630 +time the update of this neuron was +actually just time and so when you want + +661 +00:50:15,630 --> 00:50:19,640 +to use all the time you have to +compensate for this and this and that + +662 +00:50:19,639 --> 00:50:22,730 +happened away as coming from the fact +that we've dropped units with probably + +663 +00:50:22,730 --> 00:50:29,219 +the half and so that's why this end up +being half and so with probably point + +664 +00:50:29,219 --> 00:50:35,358 +five Olympic Singapore pass so basically +if we did not do this then we end up + +665 +00:50:35,358 --> 00:50:39,019 +having to large enough but compared to +what we had an expectation during + +666 +00:50:39,019 --> 00:50:42,960 +training time and you're out the +distribution will change and basically + +667 +00:50:42,960 --> 00:50:45,639 +things in the world that would break +because they're not used to seeing such + +668 +00:50:45,639 --> 00:50:49,368 +large epithermal neutrons and she have +to compensate for that and you have to + +669 +00:50:49,369 --> 00:50:53,798 +squish down so you're not using all your +things instead of just happened things + +670 +00:50:53,798 --> 00:50:57,480 +but you have to scratch daily +activations to get back to recover your + +671 +00:50:57,480 --> 00:51:03,099 +expected output ok this is actually a +tricky point but I think I was told once + +672 +00:51:03,099 --> 00:51:06,559 +a story that when Jeff Hinton came up +with drop out in the beginning he + +673 +00:51:06,559 --> 00:51:10,710 +actually didn't fully come up with this +part so we tried drop out any didn't + +674 +00:51:10,710 --> 00:51:16,088 +work and actually the reason it didn't +work as he he missed out on this tricky + +675 +00:51:16,088 --> 00:51:19,340 +point actually admittedly and so we have +to scale your activation + +676 +00:51:19,340 --> 00:51:24,070 +system down because of this effect and +then everything works much better so I + +677 +00:51:24,070 --> 00:51:28,500 +just I'm just to show you what this +looks like we basically compute these + +678 +00:51:28,500 --> 00:51:33,449 +neural nets as normal so we can be the +first or second but now it just time we + +679 +00:51:33,449 --> 00:51:38,869 +have to multiply by P so for example of +peace haha dropping probability scale + +680 +00:51:38,869 --> 00:51:43,139 +and down the activation so that the +expectation expected out but now has the + +681 +00:51:43,139 --> 00:51:46,969 +same as expected output in the training +time and so at this time you actually + +682 +00:51:46,969 --> 00:51:52,449 +recover for dropout and expected outputs +are matching and this actually works + +683 +00:51:52,449 --> 00:52:18,069 +really well so I'm dropping from this is +just the discrepancy between train and + +684 +00:52:18,070 --> 00:52:20,780 +test like every using all your neurons +are dropping them there's a discrepancy + +685 +00:52:20,780 --> 00:52:24,580 +so either you can correct it at this +time or you can use what we call in + +686 +00:52:24,579 --> 00:52:29,469 +Burley dropout which I'll show you in a +bit so we'll get to that in a bit + +687 +00:52:29,469 --> 00:52:34,319 +dropout summary if you want to drop out +drop your units with probably off with + +688 +00:52:34,320 --> 00:52:38,210 +keeping a probability of pee and then it +just forget to scale them so if you do + +689 +00:52:38,210 --> 00:52:40,820 +this network will do will work better + +690 +00:52:40,820 --> 00:52:44,190 +ok and don't forget to also back +propagate the masks which I'm not + +691 +00:52:44,190 --> 00:52:49,710 +showing an inverted dropout by the way +to do is to take care of this + +692 +00:52:49,710 --> 00:52:53,349 +discrepancy between the train and test +solution a slightly different way in + +693 +00:52:53,349 --> 00:52:57,710 +particular what we'll do is we're +changing this year so before you one was + +694 +00:52:57,710 --> 00:53:01,250 +a biomass cups frozen ones we're not +going to do is we're going to do the + +695 +00:53:01,250 --> 00:53:04,980 +scaling here at training time so we're +going to scale down the activations a + +696 +00:53:04,980 --> 00:53:07,960 +trying time for another skill them up +because if he spent five then we're + +697 +00:53:07,960 --> 00:53:12,079 +boosting accusations a train time by hot +and then it s time we can leave our code + +698 +00:53:12,079 --> 00:53:16,029 +touched right so we're doing the +boosting of the activations a train time + +699 +00:53:16,030 --> 00:53:20,880 +we're making everything artificially +greater by two acts and then it s time + +700 +00:53:20,880 --> 00:53:24,450 +we're supposed to have but now we're +just going to recover the clean + +701 +00:53:24,449 --> 00:53:27,819 +expressions because we've done the +scaling a trying time so now you'll be + +702 +00:53:27,820 --> 00:53:31,010 +you'll be properly calibrated +expectations between the train and test + +703 +00:53:31,010 --> 00:53:39,290 +every year on and work that's right so +using a dropout that's most common want + +704 +00:53:39,289 --> 00:53:42,779 +to use in practice so infected really +comes down to a few lines and then the + +705 +00:53:42,780 --> 00:53:47,300 +backward pass changes a bit but the +networks almost always work better with + +706 +00:53:47,300 --> 00:54:15,070 +this unless you're severely under +fitting in their actual exact and that's + +707 +00:54:15,070 --> 00:54:17,230 +why this is as i mentioned here + +708 +00:54:17,230 --> 00:54:22,039 +approximation is an approximation to +assemble and one of the reasons an + +709 +00:54:22,039 --> 00:54:25,029 +approximation is because once you +actually happened in the picture then + +710 +00:54:25,030 --> 00:54:27,769 +these expected outputs are all kind of +screwed up because of the nonlinear + +711 +00:54:27,769 --> 00:54:37,500 +effects on top of these questions thank +you for pointing that I go ahead + +712 +00:54:37,500 --> 00:54:44,769 +I see you're saying that they are +inverted drop-in and drop-out are not + +713 +00:54:44,769 --> 00:54:49,039 +equivalent so doing her job whether or +not is not a problem because of the the + +714 +00:54:49,039 --> 00:54:59,309 +nineties I'd have to think about it +maybe maybe you're right you may be + +715 +00:54:59,309 --> 00:55:37,949 +right here and I think all of this is +just about expectations in expectation + +716 +00:55:37,949 --> 00:55:41,349 +you're dropping a half and so that's the +correct thing to use even though there's + +717 +00:55:41,349 --> 00:55:44,049 +some randomness in exactly the amount +that actually end up being dropped + +718 +00:55:44,050 --> 00:55:47,370 +okay great + +719 +00:55:47,369 --> 00:55:51,869 +oh yeah there's like to tell you as a +fun story will drop out so I was in a + +720 +00:55:51,869 --> 00:55:55,509 +deep learning summer school in 2012 and +Jeff Hinton was for the first time or at + +721 +00:55:55,510 --> 00:55:56,590 +least the first time I saw it + +722 +00:55:56,590 --> 00:56:00,930 +presenting dropout and so he's basically +just saying okay said your neurons 20 at + +723 +00:56:00,929 --> 00:56:04,589 +random and just I'm just busy +activations and this always works better + +724 +00:56:04,590 --> 00:56:07,750 +better and we're like wow that's +interesting as a friend of mine sitting + +725 +00:56:07,750 --> 00:56:10,469 +next to me he just pulled up his laptop +right there he has a station has + +726 +00:56:10,469 --> 00:56:13,959 +University machines and implement it +right there during the talk and by the + +727 +00:56:13,960 --> 00:56:17,340 +time Jeff Hinton finish to talk he was +getting better results and getting + +728 +00:56:17,340 --> 00:56:18,950 +actually state of the art reporter like + +729 +00:56:18,949 --> 00:56:25,189 +on his data that he was working with the +fastest I've seen someone go like get an + +730 +00:56:25,190 --> 00:56:30,490 +extra 5% it was right there and then +while Japan too much going to talk I + +731 +00:56:30,489 --> 00:56:33,589 +thought that was really funny there's +very few times actually the something + +732 +00:56:33,590 --> 00:56:36,590 +like this happens it's a dropout is a +great thing because it's one of those + +733 +00:56:36,590 --> 00:56:42,390 +few investors that is very simple and it +always works just better and there's + +734 +00:56:42,389 --> 00:56:45,579 +very few of those kinds of tips and +tricks that we've picked up and I guess + +735 +00:56:45,579 --> 00:56:49,659 +the question is how many more simple +things like dropout are there and that + +736 +00:56:49,659 --> 00:56:50,879 +just give you two percent boost + +737 +00:56:50,880 --> 00:56:54,140 +always so we don't know + +738 +00:56:54,139 --> 00:57:01,199 +ok so I was going to go on at this point +into gradient checking but I think I + +739 +00:57:01,199 --> 00:57:04,588 +actually I decided I'm gonna skip this +because I'm tired of all the neural + +740 +00:57:04,588 --> 00:57:07,130 +network like we've been talking about +lots of details in training all that + +741 +00:57:07,130 --> 00:57:10,180 +works and I think you guys are tired as +well and so I'm going to skip gradient + +742 +00:57:10,179 --> 00:57:13,469 +checking because it's quite well +described herein notes I encourage you + +743 +00:57:13,469 --> 00:57:19,028 +to go through it is kind of a tricky +process takes a bit of time to to + +744 +00:57:19,028 --> 00:57:23,190 +appreciate all the difficulties with the +process and so just read through it I + +745 +00:57:23,190 --> 00:57:27,250 +don't think there's anything I can drive +around to make it more interesting to + +746 +00:57:27,250 --> 00:57:29,469 +you so I would encourage you to just +check it out + +747 +00:57:29,469 --> 00:57:33,118 +meanwhile we're going to jump right hand +and it's going to come that works and + +748 +00:57:33,119 --> 00:57:42,358 +look at pictures so look like this this +is Aileen at five from nineteen eighty + +749 +00:57:42,358 --> 00:57:46,538 +roughly and we're going to go into +details of how commercial networks mark + +750 +00:57:46,539 --> 00:57:49,609 +and in this class we're not actually +going to do any of the low-level details + +751 +00:57:49,608 --> 00:57:52,768 +I'm just going to try to give you +intuition about how this field can about + +752 +00:57:52,768 --> 00:57:56,868 +some days total context and just come +back from that works in general so if + +753 +00:57:56,869 --> 00:57:59,559 +you'd like to talk about the history of +commercial networks you have to go back + +754 +00:57:59,559 --> 00:58:04,910 +to roughly nineteen sixties experiments +approval and weasel so in particular + +755 +00:58:04,909 --> 00:58:10,449 +they were studying the primary visual +cortex and cat and they were sending an + +756 +00:58:10,449 --> 00:58:14,710 +early visual area and the cat brain as +the cat was looking at patterns on the + +757 +00:58:14,710 --> 00:58:19,500 +screen and they ended up actually +winning a Nobel Prize for this sometime + +758 +00:58:19,500 --> 00:58:23,449 +later for these experiments as we'd like +to show you what these experiments look + +759 +00:58:23,449 --> 00:58:27,518 +like just so they're really fun to look +at so I pulled up eighty video here in + +760 +00:58:27,518 --> 00:58:32,258 +and see what's going on here is the cat +is fixed in position and we're recording + +761 +00:58:32,259 --> 00:58:35,900 +from its cortex somewhere in the area of +processing which is in the back of your + +762 +00:58:35,900 --> 00:58:39,809 +brain could be one and now we're showing +different light patterns to the cat and + +763 +00:58:39,809 --> 00:58:43,519 +we're recording and sharing the neurons +fire for different stimuli let's look at + +764 +00:58:43,518 --> 00:58:48,039 +how this experience will look like + +765 +00:58:48,039 --> 00:59:14,050 +here + +766 +00:59:14,050 --> 00:59:27,410 +experiments like these cells and they +seem to turn all four edges in a + +767 +00:59:27,409 --> 00:59:30,279 +particular orientation and they get +excited about the edges and one + +768 +00:59:30,280 --> 00:59:36,360 +orientation and northern orientation +does not excite them and so like this + +769 +00:59:36,360 --> 00:59:42,150 +through a long process like a 10 minute +video so we're not going to do this for + +770 +00:59:42,150 --> 00:59:45,450 +a long time they spirited and they came +up with a model of how the visual cortex + +771 +00:59:45,449 --> 00:59:52,349 +process information in the brain and so +they can several things that ended up + +772 +00:59:52,349 --> 00:59:56,059 +leading to the Nobel Prize for example +they figured out that the cortex is + +773 +00:59:56,059 --> 00:59:56,759 +arranged + +774 +00:59:56,760 --> 01:00:02,570 +topically the visual cortex and what +that means is that she was my printer + +775 +01:00:02,570 --> 01:00:06,920 +basically nearby cells in the cortex so +this is cortical tissue unfolded nearby + +776 +01:00:06,920 --> 01:00:11,389 +salt air cortex are actually processing +nearby areas in your visual field so + +777 +01:00:11,389 --> 01:00:15,049 +you're whatever is not a recognized +processed nearby and your bring this + +778 +01:00:15,050 --> 01:00:20,510 +locality is preserved in your processing +and they also figured out that there was + +779 +01:00:20,510 --> 01:00:23,790 +an entire year of these roles what's +called the simple cells and they + +780 +01:00:23,789 --> 01:00:27,659 +responded to a particular orientation of +an edge and then there were all these + +781 +01:00:27,659 --> 01:00:31,809 +other cells that had more complex +responses so for example some cells + +782 +01:00:31,809 --> 01:00:34,949 +would be turning offer specific +orientation but were slightly + +783 +01:00:34,949 --> 01:00:38,159 +translation invariant so they don't care +about the specific position of the edge + +784 +01:00:38,159 --> 01:00:41,839 +but they only cared about the +orientation and so they hypothesize + +785 +01:00:41,840 --> 01:00:44,120 +through all of these experiments that +the visual cortex has this kind of + +786 +01:00:44,119 --> 01:00:48,269 +hierarchical organization where you end +up a simple sell their reading to other + +787 +01:00:48,269 --> 01:00:52,679 +cells called complex cells and etc and +these cells are built on top of each + +788 +01:00:52,679 --> 01:00:56,369 +other and the simple songs in particular +have these relatively local receptive + +789 +01:00:56,369 --> 01:01:00,019 +fields and they were building up more +and more complex kind of representations + +790 +01:01:00,019 --> 01:01:04,320 +in the brain through successive layers +of representation and so these are + +791 +01:01:04,320 --> 01:01:09,240 +experienced a lot of course some people +are trying to reproduce this in + +792 +01:01:09,239 --> 01:01:14,649 +computers and trying to model the visual +cortex with code and so one of the first + +793 +01:01:14,650 --> 01:01:19,389 +examples of this was gonna drop from +Fukushima and he basically ended up + +794 +01:01:19,389 --> 01:01:20,429 +setting up + +795 +01:01:20,429 --> 01:01:26,710 +architecture with these local receptive +cells that basically look at a small + +796 +01:01:26,710 --> 01:01:31,760 +region of the impact and then he stepped +up layers and layers of these and so he + +797 +01:01:31,760 --> 01:01:34,750 +had these simple assault on the complex +also simple solves complex also the + +798 +01:01:34,750 --> 01:01:39,000 +sandwich of simple and complex also +building up into iraqi now back then + +799 +01:01:39,000 --> 01:01:41,849 +though in nineteen eighties back +propagation will still not really around + +800 +01:01:41,849 --> 01:01:45,380 +and so pushing my head and unsupervised +learning procedure for training these + +801 +01:01:45,380 --> 01:01:49,599 +networks with like a clustering scheme +but this is not back propagates at the + +802 +01:01:49,599 --> 01:01:54,150 +time but it had this idea of successive +layers small cells building up on top of + +803 +01:01:54,150 --> 01:02:00,039 +each other and then these experiments +further and he kind of built on top of + +804 +01:02:00,039 --> 01:02:04,739 +work and he kept the architectural +layout but what he did was actually + +805 +01:02:04,739 --> 01:02:09,009 +trainees network the back propagation +and so for example he trained different + +806 +01:02:09,010 --> 01:02:12,770 +classifiers four digits or letters and +so on and so trained all of it + +807 +01:02:12,769 --> 01:02:16,769 +backdrop and they actually ended up +using this in complex systems that read + +808 +01:02:16,769 --> 01:02:23,469 +to check the radar like digits from +postal mail service and so on and so + +809 +01:02:23,469 --> 01:02:27,239 +that's actually go back to quite a long +time ago to nineteen nineties and + +810 +01:02:27,239 --> 01:02:33,199 +someone who was using them back then but +they were quite small ok and so in 2012 + +811 +01:02:33,199 --> 01:02:37,559 +is when the come to start to get quite a +bit bigger so this was the paper from + +812 +01:02:37,559 --> 01:02:43,549 +that I keep referring to escape into +they took all of that and it's not as a + +813 +01:02:43,550 --> 01:02:48,200 +dataset that comes actually from our lab +so it's a million images with thousand + +814 +01:02:48,199 --> 01:02:51,339 +classes huge amount of data you take +this model which is roughly 60 million + +815 +01:02:51,340 --> 01:02:56,380 +parameters and cold in Alex net based on +the first name of Alex Kozinski these + +816 +01:02:56,380 --> 01:02:59,260 +networks were going to see that they +have names so this is Alex Knapp is a + +817 +01:02:59,260 --> 01:03:05,560 +region that has that Google at their +several minutes so just like this one is + +818 +01:03:05,559 --> 01:03:09,630 +a limit and so we give them names so +this was Alex net and it was the one + +819 +01:03:09,630 --> 01:03:13,090 +that actually outperformed by quite a +bit on the other algorithms what's + +820 +01:03:13,090 --> 01:03:17,530 +interesting to note historically is the +difference between Alex nothing 2012 and + +821 +01:03:17,530 --> 01:03:21,850 +the limit in nineteen nineties there's +basically very very little difference is + +822 +01:03:21,849 --> 01:03:25,940 +when you look at these two different +networks this one used I think signals + +823 +01:03:25,940 --> 01:03:31,789 +or 10 H pennies probably and this one is +real and it was bigger and deeper and + +824 +01:03:31,789 --> 01:03:33,460 +was training GPU and have more data + +825 +01:03:33,460 --> 01:03:38,889 +and that's basically it that's the only +like that's roughly the difference and + +826 +01:03:38,889 --> 01:03:41,098 +so really what we've done is we've +figured out better ways of course + +827 +01:03:41,099 --> 01:03:45,000 +initializing them and it works better +with national army and rebels work much + +828 +01:03:45,000 --> 01:03:49,480 +better but other than that it was just +killing up both the data and compute + +829 +01:03:49,480 --> 01:03:53,740 +but for the most part the actor was +quite similar and we've done a few more + +830 +01:03:53,739 --> 01:03:56,719 +tricks like for example they used a big +filters will see that we use a much + +831 +01:03:56,719 --> 01:04:01,379 +smaller filters we also now this is only +a few tens of players we now have a + +832 +01:04:01,380 --> 01:04:05,059 +hundred and fifty later come that so we +really just skill is up quite a bit in + +833 +01:04:05,059 --> 01:04:08,150 +some respects but otherwise the basic +concept of how you process information + +834 +01:04:08,150 --> 01:04:09,789 +is similar + +835 +01:04:09,789 --> 01:04:15,150 +oK so that's are now basically +everywhere so they can do all kinds of + +836 +01:04:15,150 --> 01:04:19,280 +things like classify things of course +they're very good at retrieval so if you + +837 +01:04:19,280 --> 01:04:24,119 +show them an image they can retrieve +other images like it they can also do + +838 +01:04:24,119 --> 01:04:29,809 +detection so here and there detecting +dogs or horses are people and so on + +839 +01:04:29,809 --> 01:04:33,230 +this might be used for example in some +German cars all have this in the next + +840 +01:04:33,230 --> 01:04:36,588 +line they can also do some +experimentation so every single pixel is + +841 +01:04:36,588 --> 01:04:41,409 +labeled for example the person or a road +or tree or sky rebuilding segmentation + +842 +01:04:41,409 --> 01:04:47,529 +for their use in cars for example here's +an Nvidia Tegra which is small embedded + +843 +01:04:47,530 --> 01:04:51,480 +GPU we can run come that's one reason +for example this might be useful in the + +844 +01:04:51,480 --> 01:04:55,480 +car where you can identify all the you +can be skewed perception of rounding + +845 +01:04:55,480 --> 01:04:57,219 +things around you + +846 +01:04:57,219 --> 01:05:02,039 +comments are identifying faces probably +if you some of your friends are tacked + +847 +01:05:02,039 --> 01:05:04,909 +on Facebook automatically it's almost +certainly I would guess at this point + +848 +01:05:04,909 --> 01:05:10,069 +that video classification on YouTube +identify what's inside YouTube videos + +849 +01:05:10,070 --> 01:05:14,900 +they're used in this is a project from +Google that was very successful where + +850 +01:05:14,900 --> 01:05:17,900 +basically Google was really interested +in taking street view images and + +851 +01:05:17,900 --> 01:05:20,809 +automatically reading outhouse numbers +from them + +852 +01:05:20,809 --> 01:05:25,019 +ok and turns out this is a perfect +astrakhan that so they had lots of human + +853 +01:05:25,019 --> 01:05:30,289 +labor is at eight huge amounts of data +and then put a giant comment on it and + +854 +01:05:30,289 --> 01:05:33,429 +it ended up working almost as well as a +human and that's the thing that we'll + +855 +01:05:33,429 --> 01:05:37,710 +see throughout that this stuff works +really really well make an estimate + +856 +01:05:37,710 --> 01:05:41,730 +poses they can play computer games + +857 +01:05:41,730 --> 01:05:46,559 +they detect all kinds of cancer or +something like that and bye bye bye + +858 +01:05:46,559 --> 01:05:53,519 +images they can read Chinese characters +recognized street signs this is I think + +859 +01:05:53,519 --> 01:05:57,690 +segmentation of neural tissue they can +also do things that are not visual so + +860 +01:05:57,690 --> 01:06:02,510 +for example they can recognize speech +for speech processing they've been used + +861 +01:06:02,510 --> 01:06:07,780 +also for text documents so you can see +that text into comments as well they've + +862 +01:06:07,780 --> 01:06:11,400 +been used for to recognize different +types of galaxies they've been used to + +863 +01:06:11,400 --> 01:06:15,570 +in the recent cattle competition to +recognize different Wales this is a + +864 +01:06:15,570 --> 01:06:18,420 +particular well there was like a hundred +miles or something like that and that's + +865 +01:06:18,420 --> 01:06:24,409 +just my specific individual so this will +buy the pattern of its white spots on + +866 +01:06:24,409 --> 01:06:28,179 +its head is a particular way I'll become +it has recognized so it's amazing that + +867 +01:06:28,179 --> 01:06:32,618 +works at all they're using satellite +images quite a bit because now there are + +868 +01:06:32,619 --> 01:06:35,280 +several companies that have lots of +satellite data so this is all analyzed + +869 +01:06:35,280 --> 01:06:39,530 +with large comments in this case it's +winding roads but you can also look at + +870 +01:06:39,530 --> 01:06:43,850 +agriculture applications or someone they +can also do image capturing you might + +871 +01:06:43,849 --> 01:06:48,829 +have seen some of these results my work +included as well we take images and + +872 +01:06:48,829 --> 01:06:53,369 +captions that more sentences instead of +just a single category and they can also + +873 +01:06:53,369 --> 01:06:56,150 +be used for various artistic endeavors + +874 +01:06:56,150 --> 01:06:59,800 +so this is something called deep dream +and we're going to go into how this + +875 +01:06:59,800 --> 01:07:00,350 +works + +876 +01:07:00,349 --> 01:07:04,440 +actually implementing your third +assignment may be ok maybe you will + +877 +01:07:04,440 --> 01:07:08,099 +implement in your third assignment you +give it an image and using that you can + +878 +01:07:08,099 --> 01:07:11,349 +make it do weird stuff + +879 +01:07:11,349 --> 01:07:17,380 +particularly a lot of hallucinations of +dogs and we're going to go into why dogs + +880 +01:07:17,380 --> 01:07:20,349 +appear it has to do with the fact that +image net which is where these networks + +881 +01:07:20,349 --> 01:07:25,579 +get trained to the end up they have a +lot of dogs and so these these networks + +882 +01:07:25,579 --> 01:07:28,259 +and apple juice and eating dogs it's +kind of like they're used to some + +883 +01:07:28,260 --> 01:07:32,440 +patterns and then you should have a +different image you can make them put + +884 +01:07:32,440 --> 01:07:36,710 +them in the loop with the image and dole +hallucinate things so we'll see how this + +885 +01:07:36,710 --> 01:07:42,769 +works in a bit I'm not going to explain +the slide but it looks cool so you can + +886 +01:07:42,769 --> 01:07:47,559 +imagine that it's probably involved +somewhere I also want to point out that + +887 +01:07:47,559 --> 01:07:51,579 +what's interesting there's this paper +called the networks rival representation + +888 +01:07:51,579 --> 01:07:55,420 +of private I think cortex call for a +quarter of the recognition what they did + +889 +01:07:55,420 --> 01:08:00,250 +here is basically looking at I think +this was a macaque monkey and the + +890 +01:08:00,250 --> 01:08:05,280 +recording from the ITV from the cortex +here and there recording neural + +891 +01:08:05,280 --> 01:08:09,030 +activations monkeys looking at images +and then they fed the same images to + +892 +01:08:09,030 --> 01:08:12,660 +accomplish on your network and what +they're trying to do is from the popular + +893 +01:08:12,659 --> 01:08:16,960 +prom the commercial network code or from +the population of neurons only sparse + +894 +01:08:16,960 --> 01:08:21,560 +population of context they're trying to +perform classification of some concepts + +895 +01:08:21,560 --> 01:08:25,820 +and what you see is that the coating +from the idea cortex and classifying + +896 +01:08:25,819 --> 01:08:30,519 +images is almost as good as using this +neural network from 2013 in terms of the + +897 +01:08:30,520 --> 01:08:35,400 +information that they're about the image +you can do almost equal in performance + +898 +01:08:35,399 --> 01:08:40,279 +for classification perhaps even more +striking results here we're comparing + +899 +01:08:40,279 --> 01:08:43,759 +the fed a lot of images through the +competition at work and they got this + +900 +01:08:43,760 --> 01:08:46,720 +month he took a lot of images and then +you look at how these images are + +901 +01:08:46,720 --> 01:08:48,789 +represented in the brain or in the +comment + +902 +01:08:48,789 --> 01:08:53,019 +so these are two spaces representation +of how images are arranged in the space + +903 +01:08:53,020 --> 01:08:57,520 +by the comment and you can compare the +similarity matrices and statistics + +904 +01:08:57,520 --> 01:09:00,450 +you'll see that the I T cortex and the +comment + +905 +01:09:00,449 --> 01:09:04,099 +that's are basically very very similar +representation there's a mapping between + +906 +01:09:04,100 --> 01:09:08,440 +them it almost seems like similar things +are being computed the way they arranged + +907 +01:09:08,439 --> 01:09:12,399 +a visual space of different concepts and +what's closed and what's far is very + +908 +01:09:12,399 --> 01:09:16,809 +very remarkably similar to what you see +in the in the brain and so some people + +909 +01:09:16,810 --> 01:09:20,780 +think that this is just some evidence +that companies are doing something brain + +910 +01:09:20,779 --> 01:09:23,769 +like and that's very interesting so the +only question that remains then in that + +911 +01:09:23,770 --> 01:09:24,330 +case + +912 +01:09:24,329 --> 01:09:27,210 +is this work + +913 +01:09:27,210 --> 01:09:28,609 +and we'll find out the next class + diff --git a/captions/En/Lecture8_en.srt b/captions/En/Lecture8_en.srt new file mode 100644 index 00000000..62222deb --- /dev/null +++ b/captions/En/Lecture8_en.srt @@ -0,0 +1,4377 @@ +1 +00:00:00,000 --> 00:00:07,519 +clocks let's let's get started so I know +lecture today a little bit of a break so + +2 +00:00:07,519 --> 00:00:11,269 +today we're the last time we talked +about sort of we saw all the parts of + +3 +00:00:11,269 --> 00:00:14,439 +comments we put everything together +today we're going to see some + +4 +00:00:14,439 --> 00:00:16,250 +applications of contacts + +5 +00:00:16,250 --> 00:00:20,550 +aspect actually dive inside images and +talk about spatial localization and + +6 +00:00:20,550 --> 00:00:25,550 +detection we were we actually moved this +lecture up a little bit we had it later + +7 +00:00:25,550 --> 00:00:29,080 +on the schedule we saw a lot of guys +were interested in this type of projects + +8 +00:00:29,079 --> 00:00:31,839 +who wanted to move it earlier to kind of +give you an idea of what's what's + +9 +00:00:31,839 --> 00:00:38,378 +feasible so first couple administrative +things are the project proposals were + +10 +00:00:38,378 --> 00:00:41,988 +doing Saturday my inbox kind of exploded +over the weekend so I think most of you + +11 +00:00:41,988 --> 00:00:45,909 +submit it but if you didn't you should +probably get on that we're in the + +12 +00:00:45,909 --> 00:00:49,328 +process of looking through those will go +to make sure that the project proposals + +13 +00:00:49,329 --> 00:00:52,530 +are reasonable never once admitted one +so we'll hopefully get back to you on + +14 +00:00:52,530 --> 00:01:02,149 +your projects this week also home or two +is due on Friday so who's who's done who + +15 +00:01:02,149 --> 00:01:04,519 +stuck on patch norm + +16 +00:01:04,519 --> 00:01:09,820 +okay good good that's fewer hands then +we saw last week so we're making + +17 +00:01:09,819 --> 00:01:13,688 +progress also keep in mind that we're +asking you to actually trained a pretty + +18 +00:01:13,688 --> 00:01:17,798 +big continent on C far for this homework +so if you're starting to train on + +19 +00:01:17,799 --> 00:01:22,570 +Thursday night that might be top so +maybe start early on that last part also + +20 +00:01:22,569 --> 00:01:25,618 +homework 1 were in the process of +creating hopefully we'll have those back + +21 +00:01:25,618 --> 00:01:30,540 +to this week you can get feedback before +homework to do also keep in mind though + +22 +00:01:30,540 --> 00:01:35,450 +we actually have a in class midterm next +week on Wednesday so that's a week from + +23 +00:01:35,450 --> 00:01:41,159 +Wednesday so be ready in class should be +a lot of fun + +24 +00:01:41,159 --> 00:01:46,359 +alright so last lecture we were talking +about competition that works we can + +25 +00:01:46,358 --> 00:01:50,438 +absolve the pieces we spent a long time +understanding how this convolution + +26 +00:01:50,438 --> 00:01:53,699 +operator works how we're sort of +transforming feature maps from one to + +27 +00:01:53,700 --> 00:01:58,329 +another by running into products over by +sliding this window over the map + +28 +00:01:58,328 --> 00:02:01,809 +computing products and actually +transforming our representation through + +29 +00:02:01,810 --> 00:02:05,759 +many layers of processing and if you +remember if you remember these lower + +30 +00:02:05,759 --> 00:02:09,299 +layers of convolutions tent wherein +things like edges and colors and higher + +31 +00:02:09,299 --> 00:02:14,790 +layers tend to learn more complex object +parts we talked about pulling which is + +32 +00:02:14,789 --> 00:02:18,509 +used to some sample and downsize our +feature representations inside networks + +33 +00:02:18,509 --> 00:02:24,209 +that's a common ingredient we saw we +also did case studies on particular + +34 +00:02:24,209 --> 00:02:27,479 +content architectures you could see how +these things tend to get hooked up in + +35 +00:02:27,479 --> 00:02:31,568 +practice so we talk about one at which +is something from 98 it's a little fiber + +36 +00:02:31,568 --> 00:02:35,189 +content that was used four digit +recognition we talked about Alex not the + +37 +00:02:35,189 --> 00:02:38,949 +kind of kicked off the big deep deep +learning boom in 2012 by winning image + +38 +00:02:38,949 --> 00:02:45,568 +not come that we talked about ZF that +one image net classification in 2013 was + +39 +00:02:45,568 --> 00:02:51,108 +pretty similar to Alex now and then we +saw that deeper is often better for + +40 +00:02:51,109 --> 00:02:55,709 +classification we looked at Google Matt +and PGG that did really well in 2014 + +41 +00:02:55,709 --> 00:03:00,609 +competitions that were much much deeper +than Alex Natanz and a lot better and we + +42 +00:03:00,609 --> 00:03:05,430 +also saw this new fancy crazy thing for +Microsoft called the ResNet that one + +43 +00:03:05,430 --> 00:03:10,909 +just in december in 2015 with hundred +and fifty where architecture and as your + +44 +00:03:10,909 --> 00:03:14,579 +caller just over the last couple years +these different architectures have been + +45 +00:03:14,579 --> 00:03:19,109 +getting deeper and getting a lot better +but this is just for classification so + +46 +00:03:19,109 --> 00:03:23,980 +now in this lecture we're going to talk +about localisation and detection which + +47 +00:03:23,979 --> 00:03:28,500 +is actually another really big important +problem in computer vision and this idea + +48 +00:03:28,500 --> 00:03:32,699 +of deeper networks doing better chance +that all kind of will revisit that a lot + +49 +00:03:32,699 --> 00:03:37,798 +in these new attacks as well so so far +in the class we've really been talking + +50 +00:03:37,799 --> 00:03:42,639 +about classification which is sort of +given an image we want to classify which + +51 +00:03:42,639 --> 00:03:47,049 +are some number object categories it is +that's not nice basic problem in + +52 +00:03:47,049 --> 00:03:50,340 +computer vision that we've using that +were using to understand comments and + +53 +00:03:50,340 --> 00:03:53,800 +such but there's actually a lot of other +tasks that people were coming to + +54 +00:03:53,800 --> 00:03:59,350 +so some of these are classification and +localisation now instead of just + +55 +00:03:59,349 --> 00:04:03,699 +classifying an edge as well as some +category labels we also want to drop + +56 +00:04:03,699 --> 00:04:07,349 +down box in the image to say where that +class occurs + +57 +00:04:07,349 --> 00:04:11,549 +another problem people work on its +detection so here there's again some + +58 +00:04:11,550 --> 00:04:15,689 +pics number of object categories but we +actually want to find all instances of + +59 +00:04:15,689 --> 00:04:20,238 +those categories inside the image and +Dropbox around them another more recent + +60 +00:04:20,238 --> 00:04:24,189 +task but people have started to work on +a bit as this crazy thing called instant + +61 +00:04:24,189 --> 00:04:27,490 +segmentation where again you want you +have some pics number about two + +62 +00:04:27,490 --> 00:04:30,829 +categories you want to find all +instances of those categories your image + +63 +00:04:30,829 --> 00:04:35,319 +but instead of using a box you actually +want to draw little contour around and + +64 +00:04:35,319 --> 00:04:37,279 +identify all the pixels + +65 +00:04:37,279 --> 00:04:41,549 +belonging to each instance instance +segmentations kind of crazy so we're not + +66 +00:04:41,550 --> 00:04:44,710 +going to talk about that today just +thought you should be aware of it and + +67 +00:04:44,709 --> 00:04:47,959 +we're gonna really focus on this these +localisation and detection tasks today + +68 +00:04:47,959 --> 00:04:52,009 +and the big difference between these is +the number of objects that were finding + +69 +00:04:52,009 --> 00:04:56,250 +so and localisation there's sort of one +object or in general effects number of + +70 +00:04:56,250 --> 00:05:00,129 +objects whereas in detection we might +have multiple objects or a variable + +71 +00:05:00,129 --> 00:05:04,000 +number of objects and this seems like a +small difference but it'll turn out to + +72 +00:05:04,000 --> 00:05:05,360 +actually make a big + +73 +00:05:05,360 --> 00:05:10,480 +be pretty important for architectures so +we're gonna first talked about + +74 +00:05:10,480 --> 00:05:15,610 +classification and localisation cuz its +kind of the simplest so just to recap + +75 +00:05:15,610 --> 00:05:16,389 +what I just sad + +76 +00:05:16,389 --> 00:05:21,849 +classification one image to a category +label localisation is image to a box and + +77 +00:05:21,850 --> 00:05:26,730 +classification localisation means we're +gonna do both the same time just to give + +78 +00:05:26,730 --> 00:05:30,669 +you an idea of the kinds of dance that +people use for this we talked we've + +79 +00:05:30,668 --> 00:05:33,849 +talked about the image that +classification challenge image not also + +80 +00:05:33,850 --> 00:05:37,810 +has run a classification + localisation +challenge so here + +81 +00:05:37,810 --> 00:05:42,269 +similar to the classification task +there's a thousand classes and each + +82 +00:05:42,269 --> 00:05:46,319 +training instance in those classes +actually has one class and several + +83 +00:05:46,319 --> 00:05:51,069 +bounding boxes for that class inside the +image and now a test tinier algorithm + +84 +00:05:51,069 --> 00:05:55,709 +organics bypasses where instead of your +guesses just being class labels it's a + +85 +00:05:55,709 --> 00:05:59,370 +class label together with the bounding +box and to get it right you need to get + +86 +00:05:59,370 --> 00:06:03,288 +the class label rights and the bounding +box rights we're getting a bounding box + +87 +00:06:03,288 --> 00:06:06,589 +right just means you're close in some +thing called intersection of + +88 +00:06:06,589 --> 00:06:11,310 +that you don't need to care about too +much right now so again you get it for + +89 +00:06:11,310 --> 00:06:15,259 +image that at least you get it right if +you one of your 5 gases is correct and + +90 +00:06:15,259 --> 00:06:18,129 +this is kind of the main dataset people +work on for classification + + +91 +00:06:18,129 --> 00:06:25,159 +localisation so one really fundamental +paradigm it's really useful when + +92 +00:06:25,160 --> 00:06:28,700 +thinking about localisation is this idea +of regression so I don't know if + +93 +00:06:28,699 --> 00:06:31,219 +thinking back to a machine learning +class you kind of saw like + +94 +00:06:31,220 --> 00:06:36,160 +classification and regression may be +with me regression or something fancier + +95 +00:06:36,160 --> 00:06:39,689 +and when we're talking about +localisation it's really implies we can + +96 +00:06:39,689 --> 00:06:42,980 +really just frame this as a regression +problem where we have an image that's + +97 +00:06:42,980 --> 00:06:46,700 +coming in that image is going to go +through some some processing and/or + +98 +00:06:46,699 --> 00:06:49,990 +eventually going to produce for +real-valued numbers that promote rise + +99 +00:06:49,990 --> 00:06:53,829 +this box there's different +parameterizations people use common is + +100 +00:06:53,829 --> 00:06:57,759 +XY coordinates of the upper left hand +corner and the width and height of the + +101 +00:06:57,759 --> 00:07:01,000 +box but you'll see some other variants +as well but always four numbers for + +102 +00:07:01,000 --> 00:07:04,680 +bounding box and then there's some +ground truth bounding box which again is + +103 +00:07:04,680 --> 00:07:08,810 +just four numbers and now we have we can +compute a loss like maybe Euclidean + +104 +00:07:08,810 --> 00:07:12,699 +losses a pretty pretty standard choice +between the numbers that we produced in + +105 +00:07:12,699 --> 00:07:16,339 +the correct numbers and now we can just +turn this thing just like we did our + +106 +00:07:16,339 --> 00:07:20,489 +classification networks where we sample +so many batch of data with some ground + +107 +00:07:20,490 --> 00:07:24,210 +truth boxes we propagate forward +computer lost between our predictions + +108 +00:07:24,209 --> 00:07:29,359 +and the correct predictions back +propagate and just update the network so + +109 +00:07:29,360 --> 00:07:33,250 +this paradigm is is really easy that's +actually makes this localization task + +110 +00:07:33,250 --> 00:07:37,269 +actually pretty easy to implement so +here's a really simple recipe for how + +111 +00:07:37,269 --> 00:07:41,289 +you could implement classification + +localisation so first you just download + +112 +00:07:41,290 --> 00:07:44,370 +some existing preteen model are you +train yourself if you're ambitious + +113 +00:07:44,370 --> 00:07:48,139 +something like Alex Knight BGG Google +met all these things we talked about + +114 +00:07:48,139 --> 00:07:53,180 +last lecture now we're going to take +those fully connected layers that were + +115 +00:07:53,180 --> 00:07:57,100 +producing our class scores were gonna +set those aside for the moment and we're + +116 +00:07:57,100 --> 00:08:00,410 +gonna attach a couple new fully +connected layers to some point in the + +117 +00:08:00,410 --> 00:08:04,840 +network this will be called call this a +regression had but I mean it's basically + +118 +00:08:04,839 --> 00:08:08,119 +the same thing as a couple fully +connected layers and then I'll puts some + +119 +00:08:08,120 --> 00:08:13,889 +real valued numbers now we train this +thing just like we train our + +120 +00:08:13,889 --> 00:08:17,209 +classification network the only +difference is that now instead of class + +121 +00:08:17,209 --> 00:08:18,359 +wars + +122 +00:08:18,360 --> 00:08:24,550 +and graduate classes we use Lt loss and +crown jewel boxes of Matt we train this + +123 +00:08:24,550 --> 00:08:28,918 +network exactly the same way now it s +time we just use both heads to do + +124 +00:08:28,918 --> 00:08:32,218 +classification and localisation we have +an image we've changed the + +125 +00:08:32,219 --> 00:08:36,700 +classification has we train +delocalization heads we pass it through + +126 +00:08:36,700 --> 00:08:40,620 +we get class course we get boxes and +when we're done like really that's all + +127 +00:08:40,620 --> 00:08:44,259 +you need to do so this is kind of a +really nice simple recipe that you guys + +128 +00:08:44,259 --> 00:08:50,208 +could use for classification + +localisation on your projects so other + +129 +00:08:50,208 --> 00:08:54,750 +one slight detail with this approach +there's sort of two main ways that + +130 +00:08:54,750 --> 00:08:59,990 +people do this regression task you could +imagine a class agnostic regresar or + +131 +00:08:59,990 --> 00:09:04,190 +class-specific regresar you could +imagine that no matter what class I'm + +132 +00:09:04,190 --> 00:09:07,760 +going to use the same architecture the +same weights in those fully connected + +133 +00:09:07,759 --> 00:09:11,600 +layers to produce my bounding box that +would be in your sort of outputting + +134 +00:09:11,600 --> 00:09:15,379 +always four numbers which are just the +box no matter the class I'm an + +135 +00:09:15,379 --> 00:09:19,139 +alternative you'll see sometimes it's +class-specific regression we're now + +136 +00:09:19,139 --> 00:09:23,389 +you're gonna put see times for numbers +that's sort of like one bounding box per + +137 +00:09:23,389 --> 00:09:27,569 +class and different people have found +that sometimes these work better and + +138 +00:09:27,570 --> 00:09:31,269 +different cases but it i mean +intuitively it kind of makes sense that + +139 +00:09:31,269 --> 00:09:35,470 +something that the way you might think +about localizing a cat could be a little + +140 +00:09:35,470 --> 00:09:38,129 +bit different than the way you localize +are trained so maybe you wanna have + +141 +00:09:38,129 --> 00:09:42,289 +different parts of your network that are +responsible for those things but it's + +142 +00:09:42,289 --> 00:09:45,569 +it's pretty easy venue just it changes +your back the way you came to Los a + +143 +00:09:45,570 --> 00:09:49,329 +little bit you compute loss only using +the ground truth class + +144 +00:09:49,328 --> 00:09:52,809 +the box for the ground truth class but +even that still basically the same idea + +145 +00:09:52,809 --> 00:09:57,750 +and other design choice here is where +exactly you attach the regression had + +146 +00:09:57,750 --> 00:10:01,360 +again this isn't too important different +people if you'll see different people do + +147 +00:10:01,360 --> 00:10:05,120 +it in different ways some common choices +would be to attach it right after the + +148 +00:10:05,120 --> 00:10:09,948 +fall of the last convolutional air just +sort of mean like you're really serious + +149 +00:10:09,948 --> 00:10:14,909 +initializing new fully connected layers +will see things like over feet and BG + +150 +00:10:14,909 --> 00:10:18,909 +localisation work this way another +common choice is to just attach your + +151 +00:10:18,909 --> 00:10:22,939 +aggression had actually after the last +fully connected layers from the + +152 +00:10:22,940 --> 00:10:27,310 +classification of work and you'll see +some other things like depots on our CNN + +153 +00:10:27,309 --> 00:10:31,099 +kind of work in this labor but either +one works fine + +154 +00:10:31,100 --> 00:10:38,129 +you could attach to just about anywhere +and do something so as an aside this is + +155 +00:10:38,129 --> 00:10:42,029 +we can actually generalize this +framework to localizing more than one + +156 +00:10:42,029 --> 00:10:46,610 +object so normally with this +classification localisation task that we + +157 +00:10:46,610 --> 00:10:50,440 +sort of set up an image that we care +about producing exactly one object + +158 +00:10:50,440 --> 00:10:54,620 +bounding box for the input image but in +some cases you might know ahead of time + +159 +00:10:54,620 --> 00:10:59,279 +that you always want to localize some +fixed number of objects so here this is + +160 +00:10:59,279 --> 00:11:03,730 +really easy to generalize now your +aggression had just outputs box for each + +161 +00:11:03,730 --> 00:11:07,039 +of those objects that you care about and +again you train the network in the same + +162 +00:11:07,039 --> 00:11:12,839 +way and this idea of actually localizing +multiple objects the same time is pretty + +163 +00:11:12,840 --> 00:11:16,790 +general and pretty powerful so for +example this kind of approach has been + +164 +00:11:16,789 --> 00:11:21,559 +used for human pose estimation so the +idea is we want to input a crime a + +165 +00:11:21,559 --> 00:11:25,299 +close-up view of a person and anyone to +figure out what's the pose of that + +166 +00:11:25,299 --> 00:11:29,789 +person so well people sort of generally +have a fixed number of joints like their + +167 +00:11:29,789 --> 00:11:34,370 +breasts and their neck and their elbows +and that sort of stuff so we just know + +168 +00:11:34,370 --> 00:11:39,060 +that we need to find all the joints so +we import our image we run it through a + +169 +00:11:39,059 --> 00:11:43,829 +convolutional network and we regress xy +coordinates for each joint location and + +170 +00:11:43,830 --> 00:11:47,490 +that gives us our action that actually +lets you predict a whole human pose + +171 +00:11:47,490 --> 00:11:52,409 +using the sort of localisation framework +in this paper and there's a paper from + +172 +00:11:52,409 --> 00:11:55,819 +Google from a year or two ago that does +this sort of approach that a couple + +173 +00:11:55,820 --> 00:11:59,740 +other bells and whistles but the basic +idea was just regressing using a CNN to + +174 +00:11:59,740 --> 00:12:05,100 +these joint sessions so overall this +idea of localisation and treating it as + +175 +00:12:05,100 --> 00:12:09,769 +regression 46 number of objects is +really really simple so I know some of + +176 +00:12:09,769 --> 00:12:12,659 +you guys on your projects have been +thinking about you want to actually run + +177 +00:12:12,659 --> 00:12:16,850 +detection cause you want to understand +like any parts of your images or find + +178 +00:12:16,850 --> 00:12:21,290 +parts inside the image and if you're +thinking of a project along those lines + +179 +00:12:21,289 --> 00:12:25,019 +I really encourage you to think about +this localization framework instead that + +180 +00:12:25,019 --> 00:12:27,750 +if there's actually a fixed number of +objects that you know you want to + +181 +00:12:27,750 --> 00:12:31,929 +localize and every image you should try +to frame it as a localization problem + +182 +00:12:31,929 --> 00:12:38,129 +that's tends to be a lot easier to setup +alright so actually the simple idea of + +183 +00:12:38,129 --> 00:12:42,019 +localisation via regression actually is +really simple it'll actually work I + +184 +00:12:42,019 --> 00:12:44,120 +would really encourage you to try it for +projects + +185 +00:12:44,120 --> 00:12:47,330 +but if you wanna win competitions like +image that you need to add a little bit + +186 +00:12:47,330 --> 00:12:52,330 +of other fancy stuff so another thing +that people do for localisation is this + +187 +00:12:52,330 --> 00:12:56,410 +idea of sliding window so we'll step +through this in more detail but the idea + +188 +00:12:56,409 --> 00:13:00,809 +is that you still have your +classification localisation two-headed + +189 +00:13:00,809 --> 00:13:04,929 +network but you're actually gonna run it +not once on the image but at multiple + +190 +00:13:04,929 --> 00:13:08,269 +positions on the image and you're gonna +aggregated across those different + +191 +00:13:08,269 --> 00:13:13,100 +positions and you can actually do this +in an efficient way so it took sort of + +192 +00:13:13,100 --> 00:13:17,290 +see more concretely how how this sliding +window localisation works we're gonna + +193 +00:13:17,289 --> 00:13:21,980 +look at the over-the-air architecture so +over feat was actually the winner of the + +194 +00:13:21,980 --> 00:13:25,399 +image that localisation challenge in +2013 + +195 +00:13:25,399 --> 00:13:29,730 +it this this architect this this sort of +setup looks basically like what we saw a + +196 +00:13:29,730 --> 00:13:33,839 +couple nights ago we have an Alex not at +the beginning then we have a + +197 +00:13:33,839 --> 00:13:37,820 +classification had to have a regression +had classification head is spinning out + +198 +00:13:37,820 --> 00:13:38,740 +class for us + +199 +00:13:38,740 --> 00:13:44,450 +regression had a speeding up the boxes +and this thing because it's in Alex nat + +200 +00:13:44,450 --> 00:13:51,120 +type of architecture is expecting an +input of 221 221 but actually we can run + +201 +00:13:51,120 --> 00:13:55,679 +this on larger images and this can help +sometimes so suppose we have a large + +202 +00:13:55,679 --> 00:14:02,799 +larger image of what and when I say 257 +by 257 now we could imagine taking our + +203 +00:14:02,799 --> 00:14:06,659 +classification + localisation network +and running at just on the upper corner + +204 +00:14:06,659 --> 00:14:11,799 +of this image and that'll give us some +some class score and also some summer + +205 +00:14:11,799 --> 00:14:15,979 +grass bounding box and we're gonna +repeat this take our same classification + +206 +00:14:15,980 --> 00:14:21,820 ++ localisation network and run it on all +four corners of this image and after + +207 +00:14:21,820 --> 00:14:26,230 +doing so will end up with for grass +bounding boxes one from each of those + +208 +00:14:26,230 --> 00:14:30,509 +four locations together with a class +classification score for each location + +209 +00:14:30,509 --> 00:14:35,700 +but we actually want just a single +bounding box so then they use some + +210 +00:14:35,700 --> 00:14:39,770 +heuristics to Mercedes bounding boxes in +scores and that puts a little bit ugly I + +211 +00:14:39,769 --> 00:14:42,809 +don't wanna go into the details here +they have it in the paper but the idea + +212 +00:14:42,809 --> 00:14:46,699 +is that public combining aggregating +these boxes across multiple locations + +213 +00:14:46,700 --> 00:14:50,959 +can help that can help the model sort of +credits on airs and this tends to work + +214 +00:14:50,958 --> 00:14:55,058 +really well and that mean that won them +the challenge that year + +215 +00:14:55,058 --> 00:14:58,149 +but in practice they actually use many +more than four locations + +216 +00:14:58,149 --> 00:15:08,989 +oh ya ought to be fully with them + +217 +00:15:08,989 --> 00:15:12,939 +well I mean it's actually good point so +once you're doing regression you're just + +218 +00:15:12,938 --> 00:15:15,498 +predicting for numbers you couldn't +crack you couldn't be reproduced + +219 +00:15:15,499 --> 00:15:20,149 +anywhere it doesn't have to be inside +the image although I know that brings up + +220 +00:15:20,149 --> 00:15:23,698 +a good point when you're doing this +especially when they when you're + +221 +00:15:23,698 --> 00:15:27,088 +training this network in this sliding +window way you actually to ship the + +222 +00:15:27,089 --> 00:15:30,429 +ground truth box in a little bit ship +ship the coordinate frame for those + +223 +00:15:30,428 --> 00:15:35,999 +different slices that's kind of an ugly +details just worried about ya but in + +224 +00:15:35,999 --> 00:15:39,428 +practice they use many more than four +image locations and they actually do + +225 +00:15:39,428 --> 00:15:43,629 +multiple scales as well as you can see +this is actually figure from their paper + +226 +00:15:43,629 --> 00:15:47,129 +I'm a left you see all the different +positions where they kind of evaluated + +227 +00:15:47,129 --> 00:15:52,058 +this network in the middle you see those +output progressed boxes one for each of + +228 +00:15:52,058 --> 00:15:55,678 +those positions on the bottom easy to +score map for each of those positions + +229 +00:15:55,678 --> 00:16:00,139 +and then I mean they're pretty noisy but +it's kinda convert their kind of + +230 +00:16:00,139 --> 00:16:03,899 +generally over the bear so they'd run +this fancy aggregation method and they + +231 +00:16:03,899 --> 00:16:07,839 +get a final box for the bear and they +decide that the same as a pair and they + +232 +00:16:07,839 --> 00:16:12,869 +actually won the challenge with this but +one problem you might anticipate is it + +233 +00:16:12,869 --> 00:16:15,759 +could be pretty expensive to actually +run the network on every one of those + +234 +00:16:15,759 --> 00:16:20,259 +crops but there's actually more +efficient with thing we could do so we + +235 +00:16:20,259 --> 00:16:23,489 +normally think of these networks as +having convolutional errors and then + +236 +00:16:23,489 --> 00:16:26,048 +fully connected Lares but when you think +about it + +237 +00:16:26,048 --> 00:16:31,108 +a fully connected larry is just 4096 +numbers right it's just a factor but + +238 +00:16:31,109 --> 00:16:34,679 +instead of thinking of it as a vector we +could think of it as just another + +239 +00:16:34,678 --> 00:16:39,269 +convolutional feature map is kinda crazy +we just transpose that added to + +240 +00:16:39,269 --> 00:16:45,019 +one-by-one dimensions so now the idea is +that we can now treat our car fully + +241 +00:16:45,019 --> 00:16:49,499 +connected layers and convert them into +convolutional there's a few imagined in + +242 +00:16:49,499 --> 00:16:54,339 +our fully connected network we had this +convolutional feature map and we had one + +243 +00:16:54,339 --> 00:16:57,749 +way from each element of that +competition will feature map to produce + +244 +00:16:57,749 --> 00:17:02,048 +each element of our 4096 dimensional +vector but we instead of thinking about + +245 +00:17:02,048 --> 00:17:06,288 +reshaping and having a fine layer that's +sort of equivalent to just having a five + +246 +00:17:06,288 --> 00:17:06,970 +by five + +247 +00:17:06,970 --> 00:17:10,120 +solution it's a little bit weird but if +you think about it it should make sense + +248 +00:17:10,119 --> 00:17:16,318 +eventually but alright so then we take +this fully connected later turns into a + +249 +00:17:16,318 --> 00:17:21,899 +five by five convolution than this than +we previously had another fully + +250 +00:17:21,900 --> 00:17:26,409 +connected mayor going from 4096 4096 +this is actually a one-by-one + +251 +00:17:26,409 --> 00:17:30,570 +convolution right that's that's kinda +weird but if you if you think hard and + +252 +00:17:30,569 --> 00:17:35,369 +work out the math on paper and go send a +quiet room you'll figure it out and so + +253 +00:17:35,369 --> 00:17:38,769 +we basically can't earn each of these +fully connected layers and our network + +254 +00:17:38,769 --> 00:17:43,509 +into a convolutional air and now now +this is pretty cool because now our + +255 +00:17:43,509 --> 00:17:47,589 +network is composed entirely of just +contributions and pooling and elements + +256 +00:17:47,589 --> 00:17:51,819 +operations so now we can actually run +the network on images of different sizes + +257 +00:17:51,819 --> 00:17:56,889 +and this sort of will give us very +cheaply equip the equivalent of + +258 +00:17:56,890 --> 00:18:01,840 +operating but not work independently on +different locations so to kind of see + +259 +00:18:01,839 --> 00:18:02,609 +how that works + +260 +00:18:02,609 --> 00:18:07,219 +you imagine a training time you may be +working over 14 by 14 template you run + +261 +00:18:07,220 --> 00:18:11,960 +some convolutions and then here are are +fully connected layers that we're now + +262 +00:18:11,960 --> 00:18:17,140 +re-imagining as convolutional Ayers said +and we have this by by five con block + +263 +00:18:17,140 --> 00:18:22,600 +that gets turned into these one-by-one +specially sized elements so we've sort + +264 +00:18:22,599 --> 00:18:26,449 +of eliminating not showing the depth +dimension here but these like this one + +265 +00:18:26,450 --> 00:18:30,900 +by one would be one by one by 4096 +rights or just converting these layers + +266 +00:18:30,900 --> 00:18:35,259 +into a convolutional there's now that we +know that their convolutions we could + +267 +00:18:35,259 --> 00:18:39,700 +actually run on in part of a larger size +and you can see that now we've got we've + +268 +00:18:39,700 --> 00:18:43,558 +added a couple extra pixels and now we +actually run all these things the + +269 +00:18:43,558 --> 00:18:47,869 +convolutions and get a two-by-two output +but what's really cool here is that + +270 +00:18:47,869 --> 00:18:52,058 +we're able to share computation to make +this really efficient so now our output + +271 +00:18:52,058 --> 00:18:56,428 +is four times as big but we've done much +less than four times the compute cuz if + +272 +00:18:56,429 --> 00:19:00,360 +you think about the difference between +where we're doing computation here the + +273 +00:19:00,359 --> 00:19:04,449 +only extra computation happened in these +yellow parts so now we're actually very + +274 +00:19:04,450 --> 00:19:08,610 +efficiently evaluating the network at +many many different positions without + +275 +00:19:08,609 --> 00:19:11,918 +actually spending much computation so +this is how they're able to evaluate + +276 +00:19:11,919 --> 00:19:15,240 +that network in that very very dense +multiscale way that you saw a couple + +277 +00:19:15,240 --> 00:19:19,388 +nights ago that make sense any questions +on this + +278 +00:19:19,388 --> 00:19:25,558 +ok writes actually we can look at the +classification + localisation results on + +279 +00:19:25,558 --> 00:19:30,858 +a mission over the last couple of years +so in 2012 Alex Alex Kozinski Jack + +280 +00:19:30,858 --> 00:19:36,358 +Hinton they won not only classification +but also localisation but I wasn't able + +281 +00:19:36,358 --> 00:19:40,978 +to find any published details of exactly +how they did that in 2013 was the + +282 +00:19:40,979 --> 00:19:45,249 +over-the-top that we just saw actually +improved on Alex's results a little bit + +283 +00:19:45,249 --> 00:19:50,429 +the year after we talked about VGG and +they're sort of really deep 19 their + +284 +00:19:50,429 --> 00:19:54,009 +network they got second place on +classification but actually 1 I'm + +285 +00:19:54,009 --> 00:19:59,139 +localisation and the BGG actually used +basically exactly the same strategy that + +286 +00:19:59,138 --> 00:20:03,918 +over feat dead they just use the deeper +network and actually interesting the BGG + +287 +00:20:03,919 --> 00:20:08,288 +used fewer scales they stand pat network +out in fewer places and used fewer + +288 +00:20:08,288 --> 00:20:12,878 +skills but they actually decrease the +era quite a bit so basically the only + +289 +00:20:12,878 --> 00:20:17,868 +difference being over feet and BG here +is that BGU the deeper network so here + +290 +00:20:17,868 --> 00:20:20,858 +we could see that these really powerful +image features actually improve the + +291 +00:20:20,858 --> 00:20:24,098 +localization performance quite a bit +with enough to change the localisation + +292 +00:20:24,098 --> 00:20:28,418 +architecture at all we just swapped out +about her CNN and it improved results a + +293 +00:20:28,419 --> 00:20:34,169 +lot and then this year in 2015 Microsoft +swept everything as that'll be a theme + +294 +00:20:34,169 --> 00:20:39,239 +in this lecture as well this this +hundred fifty lair ResNet from Microsoft + +295 +00:20:39,239 --> 00:20:43,629 +crushed localisation here and drunk +proper performance from 25 all the way + +296 +00:20:43,628 --> 00:20:48,738 +down to nine but I mean this this is a +little bit and this is talk to really + +297 +00:20:48,739 --> 00:20:52,798 +isolate the deep features so yes they +did have deeper features but Microsoft + +298 +00:20:52,798 --> 00:20:56,398 +actually it's a different localization +method called rpms region proposal + +299 +00:20:56,398 --> 00:21:00,699 +networks so it's not really clear +whether this which part whether it's a + +300 +00:21:00,700 --> 00:21:04,929 +better localization strategy or whether +the better features but at any rate they + +301 +00:21:04,929 --> 00:21:10,139 +did really well that's pretty much all I +want to say about classification + +302 +00:21:10,138 --> 00:21:13,848 +localisation just consider doing it for +projects and if there's any questions + +303 +00:21:13,848 --> 00:21:19,509 +about this task we should talk about +that now before moving on ya + +304 +00:21:19,509 --> 00:21:32,890 +performance especially with a loss right +so then I'll two losses when having + +305 +00:21:32,890 --> 00:21:37,050 +outliers is actually really bad so +sometimes people don't use an L to loss + +306 +00:21:37,049 --> 00:21:40,609 +instead you can try and sell one loss +that can help with outliers a little bit + +307 +00:21:40,609 --> 00:21:45,279 +people also will do sometimes a smooth +one loss where it looks like he'll one + +308 +00:21:45,279 --> 00:21:49,339 +sort of a tales but then near zero it'll +be quadratic so actually swapping out + +309 +00:21:49,339 --> 00:21:53,319 +that regression loss function can help a +bit with outliers sometimes but also if + +310 +00:21:53,319 --> 00:21:56,399 +you have a little bit of noise sometimes +hopefully you're not just figured out + +311 +00:21:56,400 --> 00:22:14,380 +cross your fingers don't think too hard +questions questions so people do both + +312 +00:22:14,380 --> 00:22:18,560 +actually I'm so over feet actually I +don't remember I don't remember exactly + +313 +00:22:18,559 --> 00:22:23,409 +which oversee dead but BGG actually +backdrops into the entire network so + +314 +00:22:23,410 --> 00:22:27,230 +it'll be it'll be faster to just +actually work fine if you just trained + +315 +00:22:27,230 --> 00:22:30,289 +in the regression had but you'll tend to +get a little bit better results if you + +316 +00:22:30,289 --> 00:22:34,049 +back drop into the home network and BG +did this experiment and they got maybe + +317 +00:22:34,049 --> 00:22:37,659 +one or two points extra buyback dropping +through the whole thing but it at the + +318 +00:22:37,660 --> 00:22:41,320 +expense of a lot more competition and +training time so it so I would I would + +319 +00:22:41,319 --> 00:22:44,769 +say it like as a first thing don't just +talk tried not back dropping and the + +320 +00:22:44,769 --> 00:22:50,440 +network + +321 +00:22:50,440 --> 00:22:57,110 +generally not right because your testing +on the same classes that you saw + +322 +00:22:57,109 --> 00:23:00,839 +training time you're gonna see different +instances obviously but I mean you're + +323 +00:23:00,839 --> 00:23:04,759 +still bears a tough time in OC bears at +training time we're not expecting you to + +324 +00:23:04,759 --> 00:23:07,370 +generalize across classes I'll be pretty +hard + +325 +00:23:07,369 --> 00:23:20,638 +yea good question yes so sometimes +people will do that they'll train with + +326 +00:23:20,638 --> 00:23:24,349 +both simultaneously also sometimes +people will just end up with separate + +327 +00:23:24,349 --> 00:23:27,089 +networks one that sort of only +responsible for aggression when it's + +328 +00:23:27,089 --> 00:23:38,089 +only responsible for classification +those both work well glad you asked + +329 +00:23:38,089 --> 00:23:40,558 +that's that's actually the next thing +we're gonna talk about that's that's a + +330 +00:23:40,558 --> 00:23:50,740 +different task of object detection so + +331 +00:23:50,740 --> 00:23:56,808 +well yeah well so I mean it kinda +depends on the training strategy if + +332 +00:23:56,808 --> 00:23:59,920 +you're like if you also kind of goes +back to this idea of class agnostic + +333 +00:23:59,920 --> 00:24:03,610 +first-class Pacific regression class +agnostic regression it doesn't matter + +334 +00:24:03,609 --> 00:24:06,889 +you just regress to the boxes tomorrow +the class class specific you're sort of + +335 +00:24:06,890 --> 00:24:13,950 +training separate aggressors for each +class right let's talk about object + +336 +00:24:13,950 --> 00:24:19,220 +detection so object detection is is much +fancier much cooler but also a lot + +337 +00:24:19,220 --> 00:24:22,890 +harrier so the idea is that again we +have an input image we have some sort of + +338 +00:24:22,890 --> 00:24:26,660 +classes we want to find all instances of +those classes in that in that input + +339 +00:24:26,660 --> 00:24:31,670 +image so I mean you know regression +worked pretty well for localisation why + +340 +00:24:31,670 --> 00:24:37,470 +don't we try it for for detection to +mark as an SMS we have these these dogs + +341 +00:24:37,470 --> 00:24:41,429 +and cats and we have four things we have +16 numbers thats looks like that looks + +342 +00:24:41,429 --> 00:24:46,250 +like regression rate image in numbers +out but if we look at another image then + +343 +00:24:46,250 --> 00:24:50,609 +you know this one only has two things +coming out so it has eight numbers they + +344 +00:24:50,609 --> 00:24:54,589 +look at this one there's a whole bunch +of cats we need a bunch of numbers so I + +345 +00:24:54,589 --> 00:24:57,519 +mean it's it's kind of hard to treat +detection a straight-up regression + +346 +00:24:57,519 --> 00:25:01,450 +because we have this problem of variable +size outputs so we're gonna have to do + +347 +00:25:01,450 --> 00:25:04,460 +something fancier although actually +there is a method will talk about later + +348 +00:25:04,460 --> 00:25:09,539 +that sort of does this anyway and does +treated as as regression but we'll get + +349 +00:25:09,539 --> 00:25:12,950 +to that we'll get to that later but in +general you wanna not treat this as + +350 +00:25:12,950 --> 00:25:18,360 +regression because you have very precise +outputs so we're really easy problem + +351 +00:25:18,359 --> 00:25:22,779 +really easy way to solve this is to +think of detection not as regression but + +352 +00:25:22,779 --> 00:25:25,960 +as classification right in machine +learning regression and classification + +353 +00:25:25,960 --> 00:25:29,929 +are your two hammers you just want to +use those to eat all your problems right + +354 +00:25:29,929 --> 00:25:34,250 +so we regression and works will do +classification instead we know how to + +355 +00:25:34,250 --> 00:25:38,558 +classify image regions we just for CNN +right we're going to do is we're gonna + +356 +00:25:38,558 --> 00:25:43,349 +take many of these input regions of the +image of a classifier there and say like + +357 +00:25:43,349 --> 00:25:46,129 +alright this region of the alleged +attack at No + +358 +00:25:46,130 --> 00:25:50,770 +as a dog know that over a little bit we +found a cat that's great but over a + +359 +00:25:50,769 --> 00:25:54,460 +little bit that's that's not anything so +then we can actually just try out a + +360 +00:25:54,460 --> 00:25:58,558 +whole bunch different image regions run +a classifier in each one and this will + +361 +00:25:58,558 --> 00:26:02,490 +basically solve our variable size output +problem + +362 +00:26:02,490 --> 00:26:11,160 +so there's there's no question so the +question of how decide how to decide + +363 +00:26:11,160 --> 00:26:14,558 +what the window size the answer is we +just tried them all right just literally + +364 +00:26:14,558 --> 00:26:18,879 +tried them all so that's that's that's +actually a big problem right because we + +365 +00:26:18,880 --> 00:26:21,910 +need to try Windows of different sizes +of different positions of different + +366 +00:26:21,910 --> 00:26:25,290 +scales me do this properly test and this +is going to be really expensive right + +367 +00:26:25,289 --> 00:26:39,089 +there's a whole lot of places we need to +look yeah also when you're doing this + +368 +00:26:39,089 --> 00:26:45,058 +you add an extra two things one you can +add an extra class to say background and + +369 +00:26:45,058 --> 00:26:49,569 +say like oh there's nothing here another +thing you can do is not is to actually + +370 +00:26:49,569 --> 00:26:54,159 +multi-label classification you cannot +put multiple positive things right + +371 +00:26:54,160 --> 00:26:56,950 +that's actually pretty easy to do and +just instead of a soft max loss you have + +372 +00:26:56,950 --> 00:27:01,390 +independent regression loss of +independent logistic regression class so + +373 +00:27:01,390 --> 00:27:05,100 +I can actually let you say yes I +multiple classes at one point but that's + +374 +00:27:05,099 --> 00:27:10,189 +just walking on a loss function so +that's that's pretty easy to do right so + +375 +00:27:10,190 --> 00:27:13,220 +actually like what we see a problem with +this approach is that there's a whole + +376 +00:27:13,220 --> 00:27:17,690 +bunch of different positions we need to +evaluate the solution sort of a couple + +377 +00:27:17,690 --> 00:27:21,308 +you as a couple of years ago was just +usually class fat used really fast + +378 +00:27:21,308 --> 00:27:26,299 +classifiers try them all so actually +detection is this really all problem in + +379 +00:27:26,299 --> 00:27:29,119 +computer vision so you should probably +have a little bit more historical + +380 +00:27:29,119 --> 00:27:34,109 +perspective so starting in about 2005 +there was this really successful + +381 +00:27:34,109 --> 00:27:38,490 +approach to it but I'm really successful +detection that use this feature + +382 +00:27:38,490 --> 00:27:42,039 +representation called histograms of +Oriented radiance so if you are call + +383 +00:27:42,039 --> 00:27:46,609 +back to homework 1 you actually use this +feature on the last part to actually do + +384 +00:27:46,609 --> 00:27:50,979 +classification as well so this was +actually a sort of the best feature that + +385 +00:27:50,980 --> 00:27:55,670 +we had in computer vision Sircar in 2005 +the idea is we're just gonna do linear + +386 +00:27:55,670 --> 00:27:59,550 +classifiers on top of this feature and +that's going to be our our classifier so + +387 +00:27:59,549 --> 00:28:03,460 +linear classifiers are really fast so if +this works is that we compute are + +388 +00:28:03,460 --> 00:28:08,250 +oriented gradient feature for the whole +image at multiple scales and we run this + +389 +00:28:08,250 --> 00:28:12,660 +linear classifier at every scale every +position just do it really fast just do + +390 +00:28:12,660 --> 00:28:13,210 +it everywhere + +391 +00:28:13,210 --> 00:28:15,329 +classifier and its past to evaluate + +392 +00:28:15,329 --> 00:28:21,029 +and this worked really well in 2005 sort +of people took this idea and worked on + +393 +00:28:21,029 --> 00:28:25,029 +it a little bit more in the next couple +of years so sort of the one of the most + +394 +00:28:25,029 --> 00:28:29,879 +important detection paradigms 3d +planning is this thing called deep but + +395 +00:28:29,880 --> 00:28:34,470 +deformable parts model so I don't wanna +go too much into the details are best + +396 +00:28:34,470 --> 00:28:39,309 +but I mean the basic idea is that we're +still working on these history memorial + +397 +00:28:39,309 --> 00:28:42,619 +gradient features but now our model +rather than just being a linear + +398 +00:28:42,619 --> 00:28:46,659 +classifier we have this linear click +this linear sort of template for the + +399 +00:28:46,660 --> 00:28:51,370 +object and we also have these templates +for parts that are allowed to sort of + +400 +00:28:51,369 --> 00:28:57,119 +very over spatial positions and deform a +little bit and they have some some fancy + +401 +00:28:57,119 --> 00:29:01,939 +fancy think about late in the AM to +learn these things and really fancy + +402 +00:29:01,940 --> 00:29:07,190 +dynamic programming algorithms actually +evaluate this thing really fast test + +403 +00:29:07,190 --> 00:29:11,100 +time is actually kind of fun if you +enjoy our thumbs this this thing at this + +404 +00:29:11,099 --> 00:29:16,119 +part is kind of fun to think about but +the end result is that it's it's a much + +405 +00:29:16,119 --> 00:29:19,209 +more powerful classifier that allows a +little bit of deformability in your + +406 +00:29:19,210 --> 00:29:23,079 +model and you can still about weight +really fast so we're still just going to + +407 +00:29:23,079 --> 00:29:26,490 +evaluate it everywhere every scale every +position every aspect ratio just do it + +408 +00:29:26,490 --> 00:29:33,039 +everywhere its past and this actually +worked really well in 2010 around there + +409 +00:29:33,039 --> 00:29:37,619 +that was sort of state of the art and +detection for many problems at a time so + +410 +00:29:37,619 --> 00:29:40,509 +this is I don't spend too much time on +this but there was a really cool paper + +411 +00:29:40,509 --> 00:29:45,049 +last year that argued that these dpi +models are actually just a certain type + +412 +00:29:45,049 --> 00:29:47,480 +of content right and so right + +413 +00:29:47,480 --> 00:29:51,329 +these these history I'm going crazy ants +are like little edges we can just look + +414 +00:29:51,329 --> 00:29:55,539 +on delusions and history and was kinda +like pooling that sort of thing so if + +415 +00:29:55,539 --> 00:30:00,349 +you're interested check out this paper +it's kind of fun to think about right + +416 +00:30:00,349 --> 00:30:02,250 +but we really want to work on + +417 +00:30:02,250 --> 00:30:06,259 +make this thing work on classifiers that +are not fast without weights like maybe + +418 +00:30:06,259 --> 00:30:11,809 +a CNN so here this week this problem is +still hard right we have many different + +419 +00:30:11,809 --> 00:30:14,940 +positions you want to try when we +probably can't actually afford to try + +420 +00:30:14,940 --> 00:30:19,220 +them all so the solution is that we +don't try them all we have some other + +421 +00:30:19,220 --> 00:30:23,380 +things that sort of guesses where we +want to look and then we only apply our + +422 +00:30:23,380 --> 00:30:28,720 +expense of classifier at those smaller +number of locations so that idea + +423 +00:30:28,720 --> 00:30:35,419 +is called region proposals so we on our +region proposal method is this thing + +424 +00:30:35,419 --> 00:30:39,900 +that takes in an image and then outputs +a whole bunch of regions where maybe + +425 +00:30:39,900 --> 00:30:45,280 +possibly an object might be located so +one way you can think about region + +426 +00:30:45,279 --> 00:30:48,428 +proposals is that they're kinda like a +really fast + +427 +00:30:48,429 --> 00:30:53,038 +class agnostic object detector right +they don't care about the class they're + +428 +00:30:53,038 --> 00:30:56,038 +not very accurate but they're pretty +fast to run and they give us a whole + +429 +00:30:56,038 --> 00:31:00,769 +bunch of boxes and the general intuition +behind behind these region proposal + +430 +00:31:00,769 --> 00:31:04,639 +methods is that they're kinda looking +for blob like structure is an image rate + +431 +00:31:04,640 --> 00:31:09,740 +so like objects are generally the dog i +mean if you can ask when it looks kinda + +432 +00:31:09,740 --> 00:31:13,940 +like a white blob the cat looks like a +white blobs flowers I kinda blah be the + +433 +00:31:13,940 --> 00:31:17,929 +eyes and nose are kinda blah be so +anyone these region proposal methods a + +434 +00:31:17,929 --> 00:31:21,650 +lot of times what you'll see is the kind +of put boxes around a lot of these + +435 +00:31:21,650 --> 00:31:27,820 +blobby regions in the image so probably +the most famous region proposal method + +436 +00:31:27,819 --> 00:31:31,538 +is called selective search you don't +really need to know exact into much + +437 +00:31:31,538 --> 00:31:36,980 +detail how this works but the idea is +that you start from your pixels and you + +438 +00:31:36,980 --> 00:31:40,919 +kind of merger adjacent pixels together +if they have similar color and texture + +439 +00:31:40,919 --> 00:31:45,770 +and form these are connected reid +disconnected blob like regions and then + +440 +00:31:45,769 --> 00:31:50,740 +you merge yuppies blob like regions to +get bigger and bigger body parts and + +441 +00:31:50,740 --> 00:31:53,829 +then for each of these different scales +you could actually convert each of these + +442 +00:31:53,829 --> 00:31:58,710 +Bobby regions into a box by just drawing +a box around it so then by doing this + +443 +00:31:58,710 --> 00:32:02,548 +over multiple scales you end up with a +whole bunch of boxes around sort of a + +444 +00:32:02,548 --> 00:32:06,359 +lot of blobby stuff in the image and its +are reasonably fast to compute and + +445 +00:32:06,359 --> 00:32:11,500 +actually cuts down the search space +quite a lot but selectively certainly + +446 +00:32:11,500 --> 00:32:14,720 +isn't the only game in town is just may +be the most famous there's a whole bunch + +447 +00:32:14,720 --> 00:32:18,319 +of different region proposal methods +that people have developed there was + +448 +00:32:18,319 --> 00:32:21,509 +this paper last year that actually did a +really cool thorough scientific + +449 +00:32:21,509 --> 00:32:25,890 +evaluation of all these different region +proposal methods and sort of gave you + +450 +00:32:25,890 --> 00:32:29,950 +the pros and the cons of each and all +that kind of stuff but I mean my + +451 +00:32:29,950 --> 00:32:33,620 +takeaway from this paper was just use +that boxes if you had to pick one so + +452 +00:32:33,619 --> 00:32:37,459 +it's it's it's really fast it you can +run it in the bottom third of a second + +453 +00:32:37,460 --> 00:32:40,950 +per image compared to about 10 seconds +for selective search + +454 +00:32:40,950 --> 00:32:49,000 +but more stars is better and it gets a +lot of stars so it's going right so now + +455 +00:32:49,000 --> 00:32:51,970 +that we have this idea region proposals +and we have this idea of a CNN + +456 +00:32:51,970 --> 00:32:56,679 +classifier let's just put everything +altogether so that's and so this this + +457 +00:32:56,679 --> 00:33:02,830 +idea was sort of first put together in a +really nice way in 2014 in this method + +458 +00:33:02,829 --> 00:33:08,740 +called RCN on the idea is it's a +region-based CNN method so it's it's + +459 +00:33:08,740 --> 00:33:12,179 +it's pretty simple man what we've seen +all the pieces we have an input image + +460 +00:33:12,179 --> 00:33:17,028 +we're gonna run a region proposal method +like selective search to get about maybe + +461 +00:33:17,028 --> 00:33:21,929 +two thousand boxes of different scales +and positions mean 2000 still a lot but + +462 +00:33:21,929 --> 00:33:26,380 +it's a lot less than all possible boxes +in the image now for each of those boxes + +463 +00:33:26,380 --> 00:33:31,510 +we're gonna have cropped and warp that +image region to some fixed size and then + +464 +00:33:31,509 --> 00:33:35,898 +run it former through I CNN to classify +and then this CNN is going to have a + +465 +00:33:35,898 --> 00:33:41,199 +regression head and the regression had +here and a classification had been used + +466 +00:33:41,200 --> 00:33:46,259 +as PM's here so the idea is that this +this regression had can sort of correct + +467 +00:33:46,259 --> 00:33:50,369 +for region proposals that were a little +bit off writes this this actually works + +468 +00:33:50,369 --> 00:33:55,219 +really well it's really simple yeah it's +pretty cool but unfortunately so + +469 +00:33:55,220 --> 00:33:59,460 +unfortunately the training pipeline is a +little bit complicated so the way that + +470 +00:33:59,460 --> 00:34:03,788 +you end up train training src and a +model is you know like many like many + +471 +00:34:03,788 --> 00:34:06,970 +models you first start by downloading a +model from the internet that works well + +472 +00:34:06,970 --> 00:34:13,240 +for classification originally they were +using and how it's not then then next we + +473 +00:34:13,239 --> 00:34:16,868 +actually want to fine tune this model +for detection rate because this this + +474 +00:34:16,869 --> 00:34:20,780 +classification model was probably +trained on image that 4,000 classes but + +475 +00:34:20,780 --> 00:34:24,019 +your detection dataset has a different +number of classes in the image that + +476 +00:34:24,019 --> 00:34:28,398 +extra little bit different so you still +run this you still train this network + +477 +00:34:28,398 --> 00:34:29,679 +for classification + +478 +00:34:29,679 --> 00:34:33,429 +you have to add a couple new layers at +the end to deal with your classes and to + +479 +00:34:33,429 --> 00:34:38,068 +help you deal with slightly different +statistics of your image data so here + +480 +00:34:38,068 --> 00:34:41,579 +you're just doing classification still +but you're not running on hold images + +481 +00:34:41,579 --> 00:34:44,869 +you're running out on positive and +negative regions of your images from + +482 +00:34:44,869 --> 00:34:49,950 +your detection dataset right so you +initially as a new layer and you and you + +483 +00:34:49,949 --> 00:34:53,599 +train this thing again for your day is +that + +484 +00:34:53,599 --> 00:34:57,889 +next we actually want to casualties +features two desks so for every + +485 +00:34:57,889 --> 00:35:02,230 +engineered in your data that you run +selective search you run that image you + +486 +00:35:02,230 --> 00:35:07,079 +extract those regions you were down here +on the CNN and you cash those features + +487 +00:35:07,079 --> 00:35:12,319 +to desk and something important for this +step is to have a large hard drive the + +488 +00:35:12,320 --> 00:35:16,289 +passcode they decided not too big maybe +order a couple tens of thousands of + +489 +00:35:16,289 --> 00:35:20,170 +images but extracting these features +actually takes hundreds of gigabytes so + +490 +00:35:20,170 --> 00:35:26,869 +that's not so great and then next we +have this we want to train RSP arms to + +491 +00:35:26,869 --> 00:35:30,909 +actually be able to classify different +are different classes based on these + +492 +00:35:30,909 --> 00:35:35,649 +features so here we want to run a bunch +of we want to change a bunch of + +493 +00:35:35,650 --> 00:35:40,760 +different binary as PM's to classify +image regions as to whether or not they + +494 +00:35:40,760 --> 00:35:45,220 +contain or don't contain that that one +object to this goes back to a question a + +495 +00:35:45,219 --> 00:35:49,029 +little bit ago that sometimes you +actually might wanna how one region have + +496 +00:35:49,030 --> 00:35:53,460 +multiple positive be able to output YES +on multiple classes for the same image + +497 +00:35:53,460 --> 00:35:56,889 +region and one way that they do that is +just my training separate binary SVM + +498 +00:35:56,889 --> 00:36:01,579 +speech class right so then this is sort +of an offline process they just used the + +499 +00:36:01,579 --> 00:36:08,230 +best p.m. so you have these features +these are maybe those are positive + +500 +00:36:08,230 --> 00:36:11,820 +samples for a countess yeah it doesn't +make any sense right but you get the + +501 +00:36:11,820 --> 00:36:14,700 +idea rate you have these different +imagery you have these different image + +502 +00:36:14,699 --> 00:36:18,599 +regions you have these features that you +save to disk for those regions and then + +503 +00:36:18,599 --> 00:36:22,029 +you divide them into positive and +negative samples for each for each class + +504 +00:36:22,030 --> 00:36:27,269 +and you just train these these binary +SVM you do this you do this the same + +505 +00:36:27,269 --> 00:36:33,239 +thing for dog and you just do this for +every class near to decide right now + +506 +00:36:33,239 --> 00:36:37,029 +there's another stop right if so then +there's this idea of Cox regression so + +507 +00:36:37,030 --> 00:36:40,450 +sometimes you region proposals aren't +perfect so what we actually want to do + +508 +00:36:40,449 --> 00:36:45,549 +is be able to regress from from his cast +features to a correction on to the + +509 +00:36:45,550 --> 00:36:50,269 +region proposal and that correction has +this kind of funny premise rise + +510 +00:36:50,269 --> 00:36:54,320 +normalize representation the country +details about in the paper but kind of + +511 +00:36:54,320 --> 00:36:58,300 +intuition is that maybe for this for +this for this region proposal it was + +512 +00:36:58,300 --> 00:37:02,030 +pretty good we don't really need to make +any any corrections but maybe this one + +513 +00:37:02,030 --> 00:37:06,250 +in the middle that proposal was too far +to the left it should be like the crib + +514 +00:37:06,250 --> 00:37:09,510 +cracked ground truth as a little bit to +the right we want to regress to this + +515 +00:37:09,510 --> 00:37:12,530 +correction factor that actually tell us +that we need to shift a little bit to + +516 +00:37:12,530 --> 00:37:15,780 +the right or maybe this guy is a little +bit too wide + +517 +00:37:15,780 --> 00:37:19,100 +they didn't lose too much of the stuff +outside the cat so we want to regress to + +518 +00:37:19,099 --> 00:37:21,880 +this correction factor that tells us we +need to shrink + +519 +00:37:21,880 --> 00:37:26,539 +region proposal a little bit so again +this is just let me just do linear + +520 +00:37:26,539 --> 00:37:30,340 +regression which you can but you know +from 229 you have these these features + +521 +00:37:30,340 --> 00:37:35,490 +you have these targets you you just ran +linear regression i SAT so before we + +522 +00:37:35,489 --> 00:37:39,219 +look at the results we should talk to +talk a little bit about the different + +523 +00:37:39,219 --> 00:37:42,769 +datasets the people used for detection +there's kind of three that you'll see in + +524 +00:37:42,769 --> 00:37:48,489 +practice one as the Pascal the OC +dataset it was pretty important I think + +525 +00:37:48,489 --> 00:37:53,399 +in the earlier to thousands but now it's +a little bit small this one's about 20 + +526 +00:37:53,400 --> 00:37:57,820 +classes and about 20,000 images and +hence have about two objects percentage + +527 +00:37:57,820 --> 00:38:01,550 +so because this is a relatively small +ish dataset you'll see a lot of + +528 +00:38:01,550 --> 00:38:05,860 +detection papers work on this just goes +it's easier to handle but there's also + +529 +00:38:05,860 --> 00:38:09,970 +an image that detection dataset image +that runs a whole bunch of challenges as + +530 +00:38:09,969 --> 00:38:13,109 +you've probably seen by now we saw a +classification we sought localisation + +531 +00:38:13,110 --> 00:38:17,820 +there's also an image that detection +challenge but protection there's only + +532 +00:38:17,820 --> 00:38:21,600 +two hundred classes not the thousand +from classification but it's it's very + +533 +00:38:21,599 --> 00:38:25,619 +big almost half a million images so you +don't see as many papers work on it just + +534 +00:38:25,619 --> 00:38:29,819 +cuz it's kind of annoying to handle but +there's only about 100 per image and + +535 +00:38:29,820 --> 00:38:32,760 +then more weeks more recently there's +this one from Microsoft called Coco + +536 +00:38:32,760 --> 00:38:36,660 +which has fewer classes images but +actually has a lot more objects + +537 +00:38:36,659 --> 00:38:42,649 +percentage so people like to work I'm +not now has more interesting right + +538 +00:38:42,650 --> 00:38:45,300 +there's also this this when you're +talking about detection there's this + +539 +00:38:45,300 --> 00:38:49,000 +funny evaluation metric we use called +mean average precision and early wanna + +540 +00:38:49,000 --> 00:38:52,000 +get too much into the details like what +you really need to know is that it's a + +541 +00:38:52,000 --> 00:38:56,570 +number between 0 and hundreds and +hundreds good and it + +542 +00:38:56,570 --> 00:38:59,940 +and it also I mean the kind of the +intuition is that it's you want to have + +543 +00:38:59,940 --> 00:39:04,079 +the right you wanna have true positives +get high scores and you also have to + +544 +00:39:04,079 --> 00:39:08,230 +have some threshold that your boxes you +produced need to be within some + +545 +00:39:08,230 --> 00:39:12,090 +threshold of a crack box and you can +usually this that threshold this point + +546 +00:39:12,090 --> 00:39:15,420 +by an intersection of a union but you'll +see different challenges you slightly + +547 +00:39:15,420 --> 00:39:19,740 +different things for thats right so +let's now that we understand the data + +548 +00:39:19,739 --> 00:39:24,679 +sets on the elevation of us via our CNN +did right so this is on the past two + +549 +00:39:24,679 --> 00:39:27,779 +versions of the Pascal Davis at like I +said it's smaller as you'll see a lot of + +550 +00:39:27,780 --> 00:39:32,730 +results on this there's different +versions one in 2007 2010 you often see + +551 +00:39:32,730 --> 00:39:35,990 +people use those just because the test +is publicly available so it's easy to + +552 +00:39:35,989 --> 00:39:37,169 +evaluate + +553 +00:39:37,170 --> 00:39:42,380 +yeah but so in this this deformable +parts model that we saw from 2011 from + +554 +00:39:42,380 --> 00:39:48,579 +couple slides ago is getting twenty +about 30 on average precision there's + +555 +00:39:48,579 --> 00:39:52,069 +this other method called region let's +from 2013 that was sort of the state of + +556 +00:39:52,070 --> 00:39:55,280 +the art that I could find right before +deep learning but it's it's sort of a + +557 +00:39:55,280 --> 00:39:58,130 +similar flavor you have these features +in its class players on top of teachers + +558 +00:39:58,130 --> 00:40:02,840 +and our CNN is this pretty simple thing +we just saw and actually jump and + +559 +00:40:02,840 --> 00:40:06,789 +actually improves the performance quite +a lot so the first thing the Seas we had + +560 +00:40:06,789 --> 00:40:10,509 +a big improvement when we just switch +this pretty simple framework using CNN's + +561 +00:40:10,510 --> 00:40:15,160 +and actually this this result here is +without the bounding box repressions + +562 +00:40:15,159 --> 00:40:19,029 +this is only using the region proposals +on ESPN's actually if you include this + +563 +00:40:19,030 --> 00:40:23,550 +additional bonding proposal stop it +actually helps quite a bet another fun + +564 +00:40:23,550 --> 00:40:26,820 +thing to note is that if you take our +CNN and you do everything the same + +565 +00:40:26,820 --> 00:40:31,080 +except used eg 16 instead of Alex net +you get another pretty big boost in + +566 +00:40:31,079 --> 00:40:34,059 +performance so this is kind of similar +to what we've seen before that just + +567 +00:40:34,059 --> 00:40:39,650 +using these more powerful features tends +to help a lot of different tasks right + +568 +00:40:39,650 --> 00:40:42,840 +this is really good right we've we've +done like a huge improvement on + +569 +00:40:42,840 --> 00:40:47,829 +detection compared to 2013 that's +amazing but our CNN is not perfect it + +570 +00:40:47,829 --> 00:40:53,150 +has some problems right so it's pretty +slow its test time right we saw that we + +571 +00:40:53,150 --> 00:40:57,110 +have maybe two thousand regions means to +evaluate our CNN for each region that's + +572 +00:40:57,110 --> 00:41:02,910 +kinda slow we also have this this +slightly subtle problem where r SVM + +573 +00:41:02,909 --> 00:41:07,009 +regression those were sort of trained +off-line using likely best p.m. + +574 +00:41:07,010 --> 00:41:10,930 +and linear regression actually the +weights of our of our CNN didn't really + +575 +00:41:10,929 --> 00:41:14,960 +have the chance to update in response to +what those parts of of a network of + +576 +00:41:14,960 --> 00:41:19,039 +those objectives wanted to do and we +also had this kind of complicated + +577 +00:41:19,039 --> 00:41:24,309 +training pipeline that was a bit of a +mess so to fix these problems a year + +578 +00:41:24,309 --> 00:41:29,690 +later we have this thing called fast our +CNN so fast our CNN it was presented + +579 +00:41:29,690 --> 00:41:34,950 +pretty recently in ICC be just in +December but the idea is really simple + +580 +00:41:34,949 --> 00:41:39,819 +we're just gonna swap the order of +extracting regions and running the CNN + +581 +00:41:39,820 --> 00:41:43,550 +this is kind of a kind of related to the +sliding window idea we saw with + +582 +00:41:43,550 --> 00:41:48,450 +over-the-top so here the pipeline that +test time looks kinda similar we have + +583 +00:41:48,449 --> 00:41:52,299 +this input image we're gonna not we're +going to take this high-resolution input + +584 +00:41:52,300 --> 00:41:55,920 +image and run it through the +convolutional layers of our network and + +585 +00:41:55,920 --> 00:42:00,150 +now we're gonna get this high-resolution +convolutional feature map and now our + +586 +00:42:00,150 --> 00:42:03,940 +region proposals were gonna extracts +directly features for those region + +587 +00:42:03,940 --> 00:42:07,610 +proposals from this convolutional +feature map using this thing called ROI + +588 +00:42:07,610 --> 00:42:10,530 +pooling and then the region's + +589 +00:42:10,530 --> 00:42:14,269 +the features for these the compositional +features for those regions will be fed + +590 +00:42:14,269 --> 00:42:17,829 +into our fully connected layers and will +again have a classification had a + +591 +00:42:17,829 --> 00:42:22,670 +regression had like we saw before so +this is really cool it's it's pretty + +592 +00:42:22,670 --> 00:42:26,930 +great it solves a lot of the problems +that we just saw with our CNN so our CNN + +593 +00:42:26,929 --> 00:42:31,039 +is really slow at US time we solve this +problem by just sharing this this + +594 +00:42:31,039 --> 00:42:37,289 +computation of convolutional features +across the region proposals are see our + +595 +00:42:37,289 --> 00:42:40,519 +CNN also have these problems at training +time where we had this this message + +596 +00:42:40,519 --> 00:42:44,920 +training pipeline we had this this +problem where we're training different + +597 +00:42:44,920 --> 00:42:48,760 +parts of the network separately and the +solution is pretty simple we just you + +598 +00:42:48,760 --> 00:42:50,480 +know training all together all at once + +599 +00:42:50,480 --> 00:42:53,800 +don't don't have this complicated +pipeline which we can actually do it now + +600 +00:42:53,800 --> 00:42:58,140 +that we have this this pretty nice +function from inputs to outputs right as + +601 +00:42:58,139 --> 00:43:01,299 +you can see that are so that fast our +CNN actually solves quite a lot of the + +602 +00:43:01,300 --> 00:43:06,340 +problems that we saw with our CNN sort +of them with a really interesting + +603 +00:43:06,340 --> 00:43:10,530 +technical bit in fast our CNN was this +problem of our way region of interest + +604 +00:43:10,530 --> 00:43:15,519 +pooling so the idea is that we have this +input image that's probably high + +605 +00:43:15,519 --> 00:43:19,068 +resolution and we have this region +proposal that's becoming + +606 +00:43:19,068 --> 00:43:23,969 +elective surgery boxes or something like +that and we can put this region this + +607 +00:43:23,969 --> 00:43:27,199 +high resolution image through our +convolutional and pooling layers just + +608 +00:43:27,199 --> 00:43:30,880 +fine because those are sort of +scale-invariant they're still up two + +609 +00:43:30,880 --> 00:43:34,318 +different sizes of inputs but now the +problem is that the fully connected + +610 +00:43:34,318 --> 00:43:39,630 +layers from our pre train network are +expecting these pretty low res con + +611 +00:43:39,630 --> 00:43:46,068 +features whereas these features from the +whole image are high res so now we solve + +612 +00:43:46,068 --> 00:43:50,038 +this problem in a pretty straightforward +way so given this region proposal we're + +613 +00:43:50,039 --> 00:43:53,930 +gonna projected onto it sort of the +special part of that comment feature + +614 +00:43:53,929 --> 00:43:59,368 +volume now we're going to divide that +Khan future vol into a little grid right + +615 +00:43:59,369 --> 00:44:04,910 +divide that thing into this hiw grid +that downstream layers are expecting and + +616 +00:44:04,909 --> 00:44:09,798 +we do Macs pulling within each of those +grid cells so now we've seen now we have + +617 +00:44:09,798 --> 00:44:14,349 +this pretty simple strategy we've taken +this region proposal and we've shared + +618 +00:44:14,349 --> 00:44:19,430 +compilation features extracted this +excites output for that region for that + +619 +00:44:19,429 --> 00:44:23,629 +for that region proposal writes this is +basically just swapping the order of + +620 +00:44:23,630 --> 00:44:28,108 +convolution and warping and cropping +that's one way to think about it and + +621 +00:44:28,108 --> 00:44:31,538 +also this is a pretty nice operation +because since this thing is basically + +622 +00:44:31,539 --> 00:44:35,249 +just max pulling and we know how to back +propagate through max pulling you can + +623 +00:44:35,248 --> 00:44:38,368 +back propagate through these are these +are region of interest pulling there's + +624 +00:44:38,369 --> 00:44:42,269 +just fine and that's what really allows +us to train this whole thing in a joint + +625 +00:44:42,268 --> 00:44:46,758 +way rights let's see some results and +these are actually pretty cool pretty + +626 +00:44:46,759 --> 00:44:50,858 +amazing great so for training time are +CNN it had this complicated pipeline + +627 +00:44:50,858 --> 00:44:54,098 +would save all the stuff that desk where +to do all this stuff independently and + +628 +00:44:54,099 --> 00:44:57,789 +even on that pretty small Pascale Denis +at it took eighty four hours to train + +629 +00:44:57,789 --> 00:45:05,229 +passed our CNN is much faster you can +train and a at as far as test time in LR + +630 +00:45:05,228 --> 00:45:09,318 +CNN is pretty slow because again we're +running these independent forward passes + +631 +00:45:09,318 --> 00:45:14,469 +at the CNN for each region proposal +whereas for fast our CNN where we can + +632 +00:45:14,469 --> 00:45:17,979 +sort of share the computation between +different region proposals and get this + +633 +00:45:17,978 --> 00:45:23,439 +gigantic speed up a test I'm a hundred +and forty-six that's great amazing and + +634 +00:45:23,440 --> 00:45:26,690 +in fact in terms of performance I mean +it does a little bit better it's not a + +635 +00:45:26,690 --> 00:45:30,048 +drastic difference in performance but +this could probably be attributed to + +636 +00:45:30,048 --> 00:45:32,130 +this fine tuning property that + +637 +00:45:32,130 --> 00:45:35,140 +past our CNN you can actually find you +in all parts of the convolutional + +638 +00:45:35,139 --> 00:45:38,969 +network jointly to help with these ALPA +tasks and that's probably why you see a + +639 +00:45:38,969 --> 00:45:43,230 +bit of an increase here right so this is +great right what's what could possibly + +640 +00:45:43,230 --> 00:45:45,730 +be wrong with fast our CNN and its looks +amazing + +641 +00:45:45,730 --> 00:45:51,699 +the big problem is that these tests I'm +speeds don't include region proposals + +642 +00:45:51,699 --> 00:45:55,669 +right so now fast our CNN the so good +that actually the bottleneck is + +643 +00:45:55,670 --> 00:46:00,750 +computing region proposals that's pretty +cool so once you factor in the speed of + +644 +00:46:00,750 --> 00:46:04,789 +computer actually computing these region +proposals on CPU you can see that a lot + +645 +00:46:04,789 --> 00:46:09,190 +of our speed benefits disappear right +only 25 x faster and we kind of lost + +646 +00:46:09,190 --> 00:46:15,030 +that beautiful hundred speed-up also now +because it takes me two seconds Tehran + +647 +00:46:15,030 --> 00:46:18,560 +actually pretty magenta and you can't +really use this real-time it still kinda + +648 +00:46:18,559 --> 00:46:23,750 +off-line processing thing right so the +solution of this should be pretty + +649 +00:46:23,750 --> 00:46:27,340 +obvious rate we are all you're already +using a convolutional network for + +650 +00:46:27,340 --> 00:46:32,620 +regression using it for classification +why not use it for a reason proposals to + +651 +00:46:32,619 --> 00:46:39,569 +write should work may be kind of crazy +so that's that's a paper anyone want to + +652 +00:46:39,570 --> 00:46:46,570 +guess the name yes it's faster our CNN + +653 +00:46:46,570 --> 00:46:50,789 +yes they were they were really creative +here right but the idea is pretty simple + +654 +00:46:50,789 --> 00:46:55,460 +right where some from fast our CNN where +are you taking our input image and where + +655 +00:46:55,460 --> 00:46:59,630 +computing these big convolutional +feature maps over the entire input image + +656 +00:46:59,630 --> 00:47:05,170 +so that instead of using some external +method to compute region proposals they + +657 +00:47:05,170 --> 00:47:09,010 +add this little thing called the region +proposal network that looks directly at + +658 +00:47:09,010 --> 00:47:13,060 +these looks at these last their +compositional features as able to + +659 +00:47:13,059 --> 00:47:17,599 +produce region proposals directly from +that competition will feature map and + +660 +00:47:17,599 --> 00:47:21,190 +then once you have region proposals you +just do the same thing as fast our CNN + +661 +00:47:21,190 --> 00:47:25,880 +you use this ROI pooling and I'll be +upstream stop is the same as fast or CNN + +662 +00:47:25,880 --> 00:47:31,130 +so really about the novel bit here is +this region proposal network it's it's + +663 +00:47:31,130 --> 00:47:34,180 +really cool right we're doing the whole +thing and one giant competition at work + +664 +00:47:34,179 --> 00:47:40,500 +right so the way this region proposal +network works is that were sort of we + +665 +00:47:40,500 --> 00:47:43,880 +receive as input this competition will +feature map this may be coming out of + +666 +00:47:43,880 --> 00:47:47,820 +the last layer of our convolutional +features and we're going to add you like + +667 +00:47:47,820 --> 00:47:52,570 +like most things are recent post on that +worked as a convolutional network right + +668 +00:47:52,570 --> 00:47:57,570 +so actually this is a typo this is a +free by freak on that right so we have a + +669 +00:47:57,570 --> 00:48:01,809 +sort of a sliding window approach over +our convolutional feature map but + +670 +00:48:01,809 --> 00:48:06,820 +sliding sliding window is just a +convolution rate so we just have a three + +671 +00:48:06,820 --> 00:48:10,920 +by three convolution on top of this +feature map and then we have this this + +672 +00:48:10,920 --> 00:48:14,599 +peculiar struck this familiar to head +structure inside the region proposal + +673 +00:48:14,599 --> 00:48:19,670 +network where we're doing classification +we're here we just want to say whether + +674 +00:48:19,670 --> 00:48:25,430 +or not it's an object and also +regression to regress from this sort of + +675 +00:48:25,429 --> 00:48:29,829 +position on to an actual pridgen +proposal so the idea is that the + +676 +00:48:29,829 --> 00:48:33,909 +position of the sliding window relative +to the feature map sort of tells us + +677 +00:48:33,909 --> 00:48:38,239 +where we are in the image and then these +regression outputs sort of give us + +678 +00:48:38,239 --> 00:48:43,619 +corrections on top of this this position +in the feature map but actually they + +679 +00:48:43,619 --> 00:48:46,940 +make it a little bit more complicated +than that so instead of addressing + +680 +00:48:46,940 --> 00:48:51,110 +directly from this position in the +convolution will feature map they have + +681 +00:48:51,110 --> 00:48:55,280 +this notion of these different anchor +boxes you can imagine taking these + +682 +00:48:55,280 --> 00:48:59,910 +different sized and shaped banker boxes +and sort of pasting them in the original + +683 +00:48:59,909 --> 00:49:03,538 +image at the point of the image +corresponding to this point in the + +684 +00:49:03,539 --> 00:49:08,020 +feature map right leg and fast RCMP were +projecting forward from the image into + +685 +00:49:08,019 --> 00:49:11,519 +the feature map now we're doing the +opposite we're projecting from the + +686 +00:49:11,519 --> 00:49:17,288 +feature map back into the image for +these boxes so then for each of these + +687 +00:49:17,289 --> 00:49:21,640 +anchor boxes they use sort of an +convolutional anchor boxes may use the + +688 +00:49:21,639 --> 00:49:27,400 +same ones at every position in the image +and they for each of these anchor boxes + +689 +00:49:27,400 --> 00:49:32,119 +they produce score as to whether or not +that anchor box corresponds to an object + +690 +00:49:32,119 --> 00:49:36,809 +and they also produce for regression +coordinates that's incorrect that anger + +691 +00:49:36,809 --> 00:49:41,880 +box in similar ways that we saw before +and now this region proposal network you + +692 +00:49:41,880 --> 00:49:45,700 +can just trained to try to predict it's +sort of a high-class agnostic object + +693 +00:49:45,699 --> 00:49:52,058 +detector so faster our CNN in the +original paper they train this thing and + +694 +00:49:52,059 --> 00:49:55,490 +kind of a funny way where first they +train read a proposal not work then they + +695 +00:49:55,489 --> 00:49:59,500 +train passed our CNN then they do some +magic to merge together and at the end + +696 +00:49:59,500 --> 00:50:03,530 +of the day they have one network that +produces everything so this this is a + +697 +00:50:03,530 --> 00:50:07,880 +little bit messy but individual paper +they describe this thing but since then + +698 +00:50:07,880 --> 00:50:10,470 +they've had some unpublished work where +they actually just change the whole + +699 +00:50:10,469 --> 00:50:14,909 +thing jointly where they're sort of have +one big network where you have an image + +700 +00:50:14,909 --> 00:50:19,679 +coming in you have this in inside the +region proposal network you have a + +701 +00:50:19,679 --> 00:50:23,538 +classification lost to classify whether +each region proposal is or is not an + +702 +00:50:23,539 --> 00:50:27,670 +object you have these bounding box +regressions inside the region proposal + +703 +00:50:27,670 --> 00:50:33,500 +not work on top of your competition +anchors and then from fast then we do + +704 +00:50:33,500 --> 00:50:37,190 +our life pooling and do this fast our +CNN trek and then at the end of the + +705 +00:50:37,190 --> 00:50:41,200 +network we have this classification lost +to say which class that is and this + +706 +00:50:41,199 --> 00:50:47,659 +regression lost to correct a correction +on top of the region proposal so this is + +707 +00:50:47,659 --> 00:50:53,170 +this big thing is just one big network +with four losses yeah + +708 +00:50:53,170 --> 00:51:04,019 +so the proposal and repression +coordinates are produced by a three by + +709 +00:51:04,019 --> 00:51:07,588 +three three by three and an apparent +one-by-one convolutions often feature + +710 +00:51:07,588 --> 00:51:12,358 +map right so the idea is that we're +looking at these different anchor boxes + +711 +00:51:12,358 --> 00:51:16,400 +of different positions and scales but +we're actually looking at the same + +712 +00:51:16,400 --> 00:51:20,139 +position in the feature map to classify +those different banker boxes but you + +713 +00:51:20,139 --> 00:51:26,179 +have different you learn different +weights for the different anchors I + +714 +00:51:26,179 --> 00:51:29,969 +think it's mostly empirical right so the +three by the idea is just you want to + +715 +00:51:29,969 --> 00:51:33,429 +have a little bit of nonlinearity you +could imagine just doing sort of a + +716 +00:51:33,429 --> 00:51:38,098 +direct one-by-one convolution directly +off the feature maps but I think they + +717 +00:51:38,099 --> 00:51:40,990 +don't discuss this in the paper but I'm +guessing just a three by three times to + +718 +00:51:40,989 --> 00:51:44,669 +work a bit better but there's no like +really deep reason why you why you do + +719 +00:51:44,670 --> 00:51:47,450 +that you could be more could be less +that could be a bigger colonel is just + +720 +00:51:47,449 --> 00:51:50,548 +sort of you have this little competition +at work with two heads that's the main + +721 +00:51:50,548 --> 00:51:53,710 +point and your questions + +722 +00:51:53,710 --> 00:52:18,380 +yeah I understand because + +723 +00:52:18,380 --> 00:52:22,140 +corresponds to the whole image + +724 +00:52:22,139 --> 00:52:26,098 +the point is that we don't actually want +to process the whole image you want to + +725 +00:52:26,099 --> 00:52:29,960 +pick out some regions of the image to do +more processing on but we need to choose + +726 +00:52:29,960 --> 00:52:36,048 +those regions somehow + +727 +00:52:36,048 --> 00:52:42,188 +yes that's basically the that's +basically this idea of using external + +728 +00:52:42,188 --> 00:52:46,428 +region proposals right so when you do +that external region proposals you're + +729 +00:52:46,429 --> 00:52:50,929 +sort of picking it first before you do +the convolutions but it's just sort of a + +730 +00:52:50,929 --> 00:52:54,858 +nice thing if you can do it all at once +so it's like I'm illusions are kind of + +731 +00:52:54,858 --> 00:52:58,748 +this general like really general +processing processing but you can do to + +732 +00:52:58,748 --> 00:53:01,608 +the image you're kinda hoping that +contributions are good enough for + +733 +00:53:01,608 --> 00:53:04,869 +classification gonna aggression the +types of information that you have in + +734 +00:53:04,869 --> 00:53:07,439 +those contributions is probably good +enough for classifying regions as well + +735 +00:53:07,438 --> 00:53:11,958 +so it's actually it's actually a +computational savings because at the end + +736 +00:53:11,958 --> 00:53:15,719 +of the day you end up using that same +convolutional Peter map for everything + +737 +00:53:15,719 --> 00:53:18,938 +for the region proposals for the +downstream classification for the dam + +738 +00:53:18,938 --> 00:53:23,389 +downstream regression that's actually +why you get the speed up here + +739 +00:53:23,389 --> 00:53:29,788 +question yes we have this big network we +train with four losses and now we can do + +740 +00:53:29,789 --> 00:53:31,569 +object detection sort of all at once + +741 +00:53:31,568 --> 00:53:37,858 +pretty cool so if we look at results +comparing the free our CNN's of various + +742 +00:53:37,858 --> 00:53:43,630 +velocities then we have original our CNN +it took about 50 seconds a test time per + +743 +00:53:43,630 --> 00:53:47,150 +image this is counting the region +proposals this is counting running the + +744 +00:53:47,150 --> 00:53:52,439 +CNN separately for each region proposal +that's pretty slow now passed our CNN we + +745 +00:53:52,438 --> 00:53:56,909 +saw it was sort of bottleneck by the +region proposal time but once we move to + +746 +00:53:56,909 --> 00:54:01,768 +faster our CNN than those region +proposals are basically coming for free + +747 +00:54:01,768 --> 00:54:06,139 +since they're just the way we compute +region proposals is just a tiny three my + +748 +00:54:06,139 --> 00:54:09,199 +free time dilution and a couple +one-by-one convolutions so they're very + +749 +00:54:09,199 --> 00:54:13,229 +cheap to evaluate it sent a test times +faster our CNN runs in the fifth of a + +750 +00:54:13,228 --> 00:54:23,849 +second a pretty high resolution image +that's actually yeah + +751 +00:54:23,849 --> 00:54:36,739 +well I mean you're not one of the ideas +behind zero padding as you're hoping not + +752 +00:54:36,739 --> 00:54:40,699 +too far away information from the edges +so I think maybe you might have a + +753 +00:54:40,699 --> 00:54:45,299 +problem with if you didn't do the zero +padding and maybe more problem but I + +754 +00:54:45,300 --> 00:54:48,430 +mean as we sort of discussed before and +the fact that you're adding that zero + +755 +00:54:48,429 --> 00:54:52,519 +padding might affect the statistics of +those features so it could maybe be a + +756 +00:54:52,519 --> 00:54:56,900 +bit of a problem but in practice it +seems to work just fine but actually + +757 +00:54:56,900 --> 00:55:00,099 +about yeah that that's an analysis of +where do we have a failure cases where + +758 +00:55:00,099 --> 00:55:02,949 +do we get things wrong as a really +important process when you develop new + +759 +00:55:02,949 --> 00:55:08,419 +algorithms and I can give you insight +into what might make things better + +760 +00:55:08,420 --> 00:55:26,940 +yeah yeah yeah + +761 +00:55:26,940 --> 00:55:35,858 +so maybe it might help but it's actually +kinda hard to the next to do that + +762 +00:55:35,858 --> 00:55:40,108 +experiment because the data sets are +different right because when you when + +763 +00:55:40,108 --> 00:55:43,789 +you were kind of classification dataset +like image now that's one thing but then + +764 +00:55:43,789 --> 00:55:47,259 +when you work on detection it's this +other data set and I haven't liked you + +765 +00:55:47,260 --> 00:55:51,000 +could imagine trying to classify the +detection images based on what objects + +766 +00:55:51,000 --> 00:55:54,500 +are present but I haven't really seen +any really good comparisons that try to + +767 +00:55:54,500 --> 00:56:00,630 +study that apparently but I mean that +the experiment on your project + +768 +00:56:00,630 --> 00:56:18,088 +yeah that's a very good question so then +you have this problem with our way + +769 +00:56:18,088 --> 00:56:22,119 +pooling right because of the way that +the ROI pooling work as well as by + +770 +00:56:22,119 --> 00:56:25,720 +dividing that thing into the sixth grade +and doing max pulling once you do + +771 +00:56:25,719 --> 00:56:29,949 +rotations it's actually kind of +difficult there's this really cool paper + +772 +00:56:29,949 --> 00:56:33,159 +from deep mind in the last over the +summer called spatial transformer + +773 +00:56:33,159 --> 00:56:39,250 +networks that actually introduces a +really cool way to solve this problem in + +774 +00:56:39,250 --> 00:56:42,239 +the idea is that instead of doing ROI +pooling we're gonna do by linear + +775 +00:56:42,239 --> 00:56:46,699 +interpolation kinda like you might be +used for textures and graphics so once + +776 +00:56:46,699 --> 00:56:50,009 +you do by linear interpolation than you +actually can do maybe these these crazy + +777 +00:56:50,010 --> 00:56:53,609 +regions so yeah that's definitely +something people are thinking about but + +778 +00:56:53,608 --> 00:56:56,848 +it hasn't been incorporated into the +into the whole pipeline yet + +779 +00:56:56,849 --> 00:57:00,338 +yeah + +780 +00:57:00,338 --> 00:57:11,728 +you could be slowed down your back in +this sort of our CNN regime right and + +781 +00:57:11,728 --> 00:57:12,449 +look at that + +782 +00:57:12,449 --> 00:57:16,828 +250 times slower you really want to pay +that price I mean I think another + +783 +00:57:16,829 --> 00:57:20,690 +practical concern with rotated objects +is that we don't really have that ground + +784 +00:57:20,690 --> 00:57:25,318 +truth data sets so for most of these +most of these detection dataset the only + +785 +00:57:25,318 --> 00:57:29,190 +ground truth information we have are +these access online bounding boxes so + +786 +00:57:29,190 --> 00:57:33,150 +it's hard you don't have a ground truth +position that's kind of a practical + +787 +00:57:33,150 --> 00:57:39,219 +concern I think people haven't really +explored this so much so the end and + +788 +00:57:39,219 --> 00:57:43,009 +story with past our CNN has its super +fast and it was about the same right + +789 +00:57:43,009 --> 00:57:49,798 +that's good and works actually really +interesting is now at this point I knew + +790 +00:57:49,798 --> 00:57:52,949 +it you can actually understand the state +of the art in object detection so this + +791 +00:57:52,949 --> 00:57:55,669 +is this is one of the best object +detector in the world it crushed + +792 +00:57:55,670 --> 00:58:00,479 +everyone at the image that challenge in +image and cocoa challenges in December + +793 +00:58:00,478 --> 00:58:06,710 +and like most other thing is it's this +deep residual network so the best object + +794 +00:58:06,710 --> 00:58:10,548 +in the world right now is a hundred and +one layer residual network plus faster + +795 +00:58:10,548 --> 00:58:17,298 +our CNN plus a couple other goodies here +right so we talked about we talk about + +796 +00:58:17,298 --> 00:58:23,670 +past our CNN we saw president last year +they have to get an extra they always + +797 +00:58:23,670 --> 00:58:26,389 +for competitions you need to add a +couple of crazy things to get a little + +798 +00:58:26,389 --> 00:58:30,348 +bit boost in performance right so here +in this box refinements actually do + +799 +00:58:30,349 --> 00:58:33,528 +multiple steps of refining the bounding +box + +800 +00:58:33,528 --> 00:58:38,818 +you saw that in the fast our CNN +framework you doing this correction on + +801 +00:58:38,818 --> 00:58:41,929 +top of your region proposal could +actually feed that back into the network + +802 +00:58:41,929 --> 00:58:46,298 +and reclassify Andrea get another +production so that's this box refinement + +803 +00:58:46,298 --> 00:58:50,929 +step it gives you a little bit a boost +they add context so in addition to + +804 +00:58:50,929 --> 00:58:55,710 +classifying just just the region they +get out of actor that gives you the + +805 +00:58:55,710 --> 00:59:00,309 +whole features for the entire image that +sort of gives you more contacts than + +806 +00:59:00,309 --> 00:59:03,999 +just that little crop net gives you a +little bit more apartments and they also + +807 +00:59:03,998 --> 00:59:08,179 +do multiscale testing kinda like we saw +in over feet back so they actually run + +808 +00:59:08,179 --> 00:59:10,730 +the thing on images at different size is +a test time + +809 +00:59:10,730 --> 00:59:13,949 +an aggregate or those different sizes +and when you put all those things + +810 +00:59:13,949 --> 00:59:21,129 +together you win a lot of competitions +so this thing one on SoCo actually + +811 +00:59:21,130 --> 00:59:24,960 +Microsoft Coco actually runs a detection +challenge and they wonder detection + +812 +00:59:24,960 --> 00:59:29,199 +challenge on cocoa we can also look at +the rapid progress on the image that + +813 +00:59:29,199 --> 00:59:32,909 +detection challenges over the last +couple of years so you can see in 2013 + +814 +00:59:32,909 --> 00:59:38,949 +was sort of the first time that we had +these deep learning detection models so + +815 +00:59:38,949 --> 00:59:43,789 +over feat that we saw for localisation +they actually submitted version of their + +816 +00:59:43,789 --> 00:59:47,949 +system that works on detection as well +by sort of changing the logic with by + +817 +00:59:47,949 --> 00:59:51,849 +which they merge bounding boxes and they +did pretty good but they were actually + +818 +00:59:51,849 --> 00:59:57,319 +outperformed by this other this other +group called you vision that was sort of + +819 +00:59:57,320 --> 01:00:02,289 +not a deep learning approach to use a +lot of features but none in 2014 we + +820 +01:00:02,289 --> 01:00:05,840 +actually saw both of these were deep +learning approaches and Google actually + +821 +01:00:05,840 --> 01:00:09,740 +won that one by using a Google Map plus +some other detection stuff on top of + +822 +01:00:09,739 --> 01:00:15,029 +Google not and then in 2015 things went +crazy and these residual networks plus + +823 +01:00:15,030 --> 01:00:19,410 +passer I CNN just crushed everything so +I think that action especially over the + +824 +01:00:19,409 --> 01:00:22,409 +last couple years has been a really +exciting thing because we've seen this + +825 +01:00:22,409 --> 01:00:25,429 +really rapid progress over the last +couple years in detection like most + +826 +01:00:25,429 --> 01:00:29,129 +other things and another point I think +it's kind of fun to make is that + +827 +01:00:29,130 --> 01:00:33,800 +actually for all I can to win +competitions you know Andre said you + +828 +01:00:33,800 --> 01:00:37,830 +ensemble and get 2% so you always win +competitions with an ensemble but + +829 +01:00:37,829 --> 01:00:42,829 +actually sort of fun microsoft also +submitted their best single resident + +830 +01:00:42,829 --> 01:00:47,440 +model this was not an ensemble and just +a single resident model actually be all + +831 +01:00:47,440 --> 01:00:52,400 +the other things from all the other +years that's actually pretty cool yeah + +832 +01:00:52,400 --> 01:00:58,130 +that's that's the best actor out there +so this is kind of a funny thing right + +833 +01:00:58,130 --> 01:01:03,240 +so this is a really so we we we talked +about this idea of localisation as + +834 +01:01:03,239 --> 01:01:08,439 +regression so this funny thing called +Yolo you only look once actually tries + +835 +01:01:08,440 --> 01:01:13,519 +to oppose the detection problem directly +as a regression problem so the idea is + +836 +01:01:13,519 --> 01:01:18,389 +that we actually are going to take our +input image and we're gonna divided into + +837 +01:01:18,389 --> 01:01:22,190 +some spatial grid they used to seven by +seven and then within + +838 +01:01:22,190 --> 01:01:26,480 +each element about spatial grid we're +gonna make six number of bounding box + +839 +01:01:26,480 --> 01:01:31,039 +predictions they use be equal to I think +in most of the experiments so then + +840 +01:01:31,039 --> 01:01:36,489 +within each grid you're going to predict +maybe to be bounding boxes that's four + +841 +01:01:36,489 --> 01:01:41,229 +numbers are also going to protect US +single score for how much you believe + +842 +01:01:41,230 --> 01:01:44,969 +that bounding box and you're also going +to protect classification score for each + +843 +01:01:44,969 --> 01:01:49,659 +class near Davis at so then you can sort +of take this this detection problem and + +844 +01:01:49,659 --> 01:01:53,969 +it ends up being regression your input +is an image in your output is this maybe + +845 +01:01:53,969 --> 01:01:59,529 +seven by seven by five B plus see answer +right now just a regression problem and + +846 +01:01:59,530 --> 01:02:04,820 +just try it and that's pretty cool and +it's it's sort of a new approach to a + +847 +01:02:04,820 --> 01:02:07,900 +bit different than these region proposal +things that we've seen before + +848 +01:02:07,900 --> 01:02:12,300 +of course sort of a problem with this is +that there's an upper bound in the + +849 +01:02:12,300 --> 01:02:15,930 +number of outputs that your model can +have so that might be a problem if + +850 +01:02:15,929 --> 01:02:20,279 +you're testing data has many many more +ground truth boxes in your training data + +851 +01:02:20,280 --> 01:02:27,180 +so this this yellow detector actually is +really fast it's actually faster and + +852 +01:02:27,179 --> 01:02:32,460 +then faster our CNN which is pretty +crazy but unfortunately it tends to work + +853 +01:02:32,460 --> 01:02:36,769 +a little bit worse so bad this other +thing called fast yellow that i dont + +854 +01:02:36,769 --> 01:02:39,460 +wanna talk about but + +855 +01:02:39,460 --> 01:02:45,170 +right but just as our number these are +mean AP numbers on passed on one of the + +856 +01:02:45,170 --> 01:02:49,619 +Pascal data sets that we saw you can see +yellow actually gets 64 that's pretty + +857 +01:02:49,619 --> 01:02:53,329 +good and runs at forty five frames per +second that this is obviously on a + +858 +01:02:53,329 --> 01:02:58,840 +powerful GPU but still that's that's +pretty much real time that's amazing + +859 +01:02:58,840 --> 01:03:03,960 +was also I don't wanna talk about that +right now knows these different versions + +860 +01:03:03,960 --> 01:03:09,309 +of past and Pastor are CNN's you can see +that these actually pretty much all beat + +861 +01:03:09,309 --> 01:03:14,119 +yo in terms of performance but are quite +a bit slower yeah that's that's actually + +862 +01:03:14,119 --> 01:03:20,119 +kind of a neat twist on the detection +problem actually all these all these + +863 +01:03:20,119 --> 01:03:22,779 +different detection metric all these +different detection models that we + +864 +01:03:22,780 --> 01:03:26,780 +talked about today they all pretty much +have code up their released you should + +865 +01:03:26,780 --> 01:03:30,800 +maybe consider using them for projects +probably don't use our CNN it's too slow + +866 +01:03:30,800 --> 01:03:36,090 +fast are seen on pretty good but +requires MATLAB pastor our CNN there is + +867 +01:03:36,090 --> 01:03:39,720 +actually a Persian a pastor our CNN that +doesn't require MATLAB is just Pipeline + +868 +01:03:39,719 --> 01:03:44,379 +Cafe I haven't personally used it but +it's something you might want to try to + +869 +01:03:44,380 --> 01:03:48,070 +use for your projects I'm not sure how +difficult it is to get running and + +870 +01:03:48,070 --> 01:03:52,050 +yellow as actually I think maybe a good +choice for some of your projects because + +871 +01:03:52,050 --> 01:03:55,810 +it's so fast that it might be easier to +work with if you have not be really big + +872 +01:03:55,809 --> 01:03:59,860 +powerful GPUs and actually have caught +up as well + +873 +01:03:59,860 --> 01:04:03,480 +yes that's actually I got through things +a little bit faster than expected so is + +874 +01:04:03,480 --> 01:04:10,559 +there any questions on detection + +875 +01:04:10,559 --> 01:04:15,880 +yeah + +876 +01:04:15,880 --> 01:04:22,630 +yes in terms of model like model size +it's pretty much about the same as a + +877 +01:04:22,630 --> 01:04:26,039 +classification model because when when +you're running on bigger image + +878 +01:04:26,039 --> 01:04:29,109 +especially for faster our CNN right +cause your convolutions you don't really + +879 +01:04:29,108 --> 01:04:32,558 +introduce any more parameters the full +impact of layers are not really anymore + +880 +01:04:32,559 --> 01:04:35,829 +parameters you have a couple extra +parameters for the region proposal + +881 +01:04:35,829 --> 01:04:38,798 +network but it's basically the same +number primaries as a classification + +882 +01:04:38,798 --> 01:04:45,619 +model right I guess I guess we're done a +little early today + diff --git a/captions/Ko/Lecture10_ko.srt b/captions/Ko/Lecture10_ko.srt new file mode 100644 index 00000000..48dad07c --- /dev/null +++ b/captions/Ko/Lecture10_ko.srt @@ -0,0 +1,3860 @@ +1 +00:00:00,000 --> 00:00:04,129 + 우리를 신뢰 + +2 +00:00:04,129 --> 00:00:12,109 + 확인 그것은 우리가 곧 그래서 오늘 우리가 얘기 할 수 있습니다 시작합니다 확인 좋은 작품 + +3 +00:00:12,109 --> 00:00:15,199 + 내가 가장 좋아하는 주제 내 좋아하는 일 중 하나입니다 재발 성 신경 네트워크 + +4 +00:00:15,199 --> 00:00:18,960 + 모델은 도처에 많이 신경 네트워크에 입력으로 재생 + +5 +00:00:18,960 --> 00:00:23,009 + 재미 관리 높은 임시 직원의 관점에서 놀 리콜 + +6 +00:00:23,009 --> 00:00:26,089 + 수요일에 여러분의 중간 고사는이 수요일 당신은 정말이야 말할 수 있습니다 + +7 +00:00:26,089 --> 00:00:32,738 + 너희들은 나에게 매우 흥분 흥분하는 경우 내가 아는 흥분 무엇 묘지는 것 + +8 +00:00:32,738 --> 00:00:37,979 + 그렇게 그는이 그것 때문에 수요일에 밖으로있을 것이다 것이 수요일 인해 외출 + +9 +00:00:37,979 --> 00:00:40,429 + 월요일에 지금부터 주 그러나 나는 우리가 그것을 이동하고 이후가 생각하는 생각 + +10 +00:00:40,429 --> 00:00:43,399 + 우리가 계획 수요일은 오늘 발표했다합니다 그러나 우리는거야에 출하 할 수있어 + +11 +00:00:43,399 --> 00:00:47,129 + 대략 수요일 그래서 우리거야 아마 몇 일에 대한 첫 번째 마감일과 + +12 +00:00:47,130 --> 00:00:51,179 + 그런 다음 삼십팔일를 사용하는 경우, 그래서 오해의 그에게 할당 금요일에 기인 + +13 +00:00:51,179 --> 00:00:55,119 + 당신은 일을 당신의 너무 많은 희망을 갖고 오늘을 낳게 될 것입니다 우리의 + +14 +00:00:55,119 --> 00:01:01,089 + 72 또는 여러 사람과 사람 아래로는 대부분의 위대​​한 찾고 좋아 완료 + +15 +00:01:01,090 --> 00:01:04,549 + 클래스에 너무 현재 잘하는 해변 신경오고에 대해 얘기했다 + +16 +00:01:04,549 --> 00:01:07,820 + 네트워크 라스 카사스는 특히 우리는 시각화 이해를 보았다 + +17 +00:01:07,819 --> 00:01:11,618 + 길쌈 신경망 우리는 예쁜 그림의 모두 볼 수 있도록하고 + +18 +00:01:11,618 --> 00:01:14,938 + 비디오는 그래서 우리는 많은 재미 그가 달성 정확히 어떤 해석을 시도했다 + +19 +00:01:14,938 --> 00:01:17,828 + 모든 네트워크는 그들이 그렇게 작업하고있는 방법을 학습하는 일을하고 있습니다 + +20 +00:01:17,828 --> 00:01:24,188 + 그래서 우리는 당신이에서 호출 될 수있는 몇 가지 방법을 통해이 문제를 디버깅 + +21 +00:01:24,188 --> 00:01:27,408 + 구조는 실제로 주말에 나는 다른 시각화에 의해 발견 + +22 +00:01:27,409 --> 00:01:32,569 + 내가 트위터에서 이러한 발견 새로운 그들은 정말 근사하고 잘 모르겠어요 + +23 +00:01:32,569 --> 00:01:37,118 + 어떻게 너무 많은 설명이 아니기 때문에 사람들이 만든 방법 + +24 +00:01:37,118 --> 00:01:43,099 + 이 거북 독 거미이고, 다음이 연결 및 어떤 종류처럼 만 보인다 + +25 +00:01:43,099 --> 00:01:47,468 + 이렇게 개 등 방식 나는 그것이 견과류 같은 생각 + +26 +00:01:47,468 --> 00:01:50,509 + 다시 이미지로 최적화 그러나 그들은에서 다른 정례화를 사용하는 + +27 +00:01:50,509 --> 00:01:53,679 + 이 경우 이미지가 나는 그들이이있는 양​​자 필터를 사용하고 생각 + +28 +00:01:53,679 --> 00:01:57,049 + 멋진 필터의 종류는 해당 이미지에 해당 정규화를 넣어 그래서 만약 내 + +29 +00:01:57,049 --> 00:01:59,420 + 느낌이 당신이 달성 시각화의 종류가 있다는 것입니다 + +30 +00:01:59,420 --> 00:02:03,659 + 대신에 꽤 멋진 보이지만 내가가는 정확히 무엇인지 확실하지 않다 그래서 + +31 +00:02:03,659 --> 00:02:04,549 + 우리는 곧 알게 될 같아요 + +32 +00:02:04,549 --> 00:02:10,360 + 확인 그래서 오늘 우리는 재발 성 신경 네트워크 무엇의에 대해 얘기 할거야 + +33 +00:02:10,360 --> 00:02:13,520 + 재발 성 신경 네트워크에 대한 좋은들은 많은 유연성에 제공한다는 것입니다 + +34 +00:02:13,520 --> 00:02:15,870 + 네트워크 아키텍처를 배선하는 방법 + +35 +00:02:15,870 --> 00:02:18,650 + 더 작동하지 않을 때 보통의가 바로 여기 왼쪽에있는 경우를 들어 보자 + +36 +00:02:18,650 --> 00:02:22,849 + 당신이 빨간색으로 여기에 고정 된 크기의 사진을 제공하는 경우 당신은 그것을 처리 + +37 +00:02:22,848 --> 00:02:27,639 + 녹색 다음 몇 가지 숨겨진 레이어 내가 산에 더 나은을보고 수정을 생산 + +38 +00:02:27,639 --> 00:02:30,738 + 이미지가 제공하는 수정 I 문이며 우리는 고정을 생산하고 + +39 +00:02:30,739 --> 00:02:34,469 + 가장 가까운 코스가 크기 사진 때 재발 신경망 우리 + +40 +00:02:34,469 --> 00:02:38,239 + 실제로 입출력 또는 둘 모두에서에서 시퀀스 순서를 통해 작동 할 수 있습니다 + +41 +00:02:38,239 --> 00:02:41,319 + 동시에 영상 자막의 경우 예를 들어, 우리가 일부를 볼 수 있도록 + +42 +00:02:41,318 --> 00:02:44,689 + 그것을 오늘 당신은 반복을 통해 다음 고정 된 크기의 이미지를 부여하고 + +43 +00:02:44,689 --> 00:02:47,829 + 신경망 우리는를 설명하는 단어의 시퀀스를 생성하는 것 + +44 +00:02:47,829 --> 00:02:52,560 + 그래서 그 이미지의 내용에 대한 캡션입니다 문장이 될 것 + +45 +00:02:52,560 --> 00:02:55,969 + 그 예를 들면 로비 감정 분류의 경우, + +46 +00:02:55,969 --> 00:02:59,759 + 단어와 장식 조각의 수를 소모하고, 그들은 클래스에 시도 할 것이다 + +47 +00:02:59,759 --> 00:03:03,828 + 드라이버 그 문장의 감정의 경우 긍정적 또는 부정적 + +48 +00:03:03,829 --> 00:03:07,590 + 기계 번역 우리는 우리 소요 재발 성 신경 네트워크를 가질 수있다 + +49 +00:03:07,590 --> 00:03:12,069 + 다음 말 영어 단어의 수는 단어의 개수를 생성하라는 + +50 +00:03:12,068 --> 00:03:17,119 + 프랑스어 번역 그래서 우리는이 말 앤드류 재발 성 신경 네트워크를 공급했던 것과 + +51 +00:03:17,120 --> 00:03:20,280 + 우리는 설정의 순서 종류의 순서를 호출 등이 작업 여부는 작업 + +52 +00:03:20,280 --> 00:03:25,169 + 단지 프랑스어와 영어에 임의의 문장에 대한 번역을 수행 + +53 +00:03:25,169 --> 00:03:28,000 + 당신이 할 수있는 경우, 예를 들어 지난 경우 우리는 비디오 분류를 + +54 +00:03:28,000 --> 00:03:31,699 + 클래스의 일부 번호와 비디오의 매 프레임을 분류하는 상상 + +55 +00:03:31,699 --> 00:03:35,429 + 하지만 결정적으로의 전용 함수로 예측 싶지 않아 + +56 +00:03:35,430 --> 00:03:38,739 + 현재 시간은 모든 것을 비디오의 현재 프레임 단계하지만 + +57 +00:03:38,739 --> 00:03:41,909 + 재발 성 신경 네트워크가 당신을 허용하는 비디오에 전에왔다 + +58 +00:03:41,909 --> 00:03:44,680 + 건축 와이어 최대 위치를 예측하는 매 시간 단계 + +59 +00:03:44,680 --> 00:03:48,760 + 지금까지도 그 지점까지 들어오는 모든 프레임의 함수이다 + +60 +00:03:48,759 --> 00:03:52,388 + 만약 입력 또는 출력 여전히 재발을 사용할 수있는 서열을 가지고 있지 않다면 + +61 +00:03:52,389 --> 00:03:55,250 + 심지어 당신이 처리 할 수​​ 있기 때문에 매우 왼쪽에있는 경우 신경망 당신의 + +62 +00:03:55,250 --> 00:04:01,560 + 예를 들어, 입력 또는 출력 순차적으로 내가 가장 좋아하는 예제 중 하나를 수정 + +63 +00:04:01,560 --> 00:04:05,189 + 이 잠시 전에 우리가 노력하고 대한 깊은 광산에서 사람들된다 + +64 +00:04:05,189 --> 00:04:09,750 + 집 번호를 전사 단지에이 큰 이미지 발을 갖는 대신 + +65 +00:04:09,750 --> 00:04:13,530 + 의견 그들이 와서 집 번호가에 정확히 분류하는 시도 + +66 +00:04:13,530 --> 00:04:16,649 + 작은 거기에 재발 신경 네트워크 정책과 그 와서 + +67 +00:04:16,649 --> 00:04:19,779 + 특히 때문에 자신의 재발 성 신경 네트워크와 함께 이미지 주위 조향 + +68 +00:04:19,779 --> 00:04:23,969 + 왼쪽에서 오른쪽으로 현재 작업은 기본적으로 집 번호를 판독을 배운 + +69 +00:04:23,970 --> 00:04:26,870 + 순차적 그래서 우리는 입력으로 사진을 가지고 있지만 우리는 그것을 처리하고 + +70 +00:04:26,870 --> 00:04:32,019 + 순차적으로 반대로 우리가 이것에 대해 생각할 수도 잘 알려진 사람이다 + +71 +00:04:32,019 --> 00:04:35,879 + 이것은 당신이 모델에서 샘플을 여기서 볼 수 있습니다하는지 일반 모델 그리기 + +72 +00:04:35,879 --> 00:04:39,490 + 이들 숫자 샘플과 함께오고 있지만 결정적으로 우리는 단지하지 않은 경우 + +73 +00:04:39,490 --> 00:04:42,860 + 한 번에이 숫자를 예측하지만 우리는 우리의 현재 네트워크와 우리가 + +74 +00:04:42,860 --> 00:04:47,540 + 캔버스로까지 생각하고 커널에가는 시간이 지남에 그린과 + +75 +00:04:47,540 --> 00:04:50,200 + 그래서 당신은 자신에게 실제로 전에 몇 가지 계산을 할 수있는 더 많은 기회를 제공하고 있습니다 + +76 +00:04:50,199 --> 00:04:53,479 + 실제로 당신이 처리 형태의 더 강력한 종류의 것을있어 생산 + +77 +00:04:53,480 --> 00:05:14,189 + 데이터는이 지금은 무엇을 의미하는지 정확히의 특성을 통해 질문이었다 + +78 +00:05:14,189 --> 00:05:19,310 + 일이 일이 너무 그래서 그냥 표시 에로스는 기능 의존도를 나타냅니다 + +79 +00:05:19,310 --> 00:05:23,139 + 거친만큼 일하기 전에 우리가가는 것을 정확하게 그처럼 보인다 + +80 +00:05:23,139 --> 00:05:37,168 + 네트워크가 많이 보았다 그래서이 너무 좋아 조금 집 번호를 생성 + +81 +00:05:37,168 --> 00:05:41,219 + 집 숫자와 이러한 그림의 방법으로 와서 그래서 이들에없는 + +82 +00:05:41,220 --> 00:05:44,830 + 이들의 훈련 일이이의 모델 없음에서 숫자를 만들어 + +83 +00:05:44,829 --> 00:05:48,219 + 실제로 교육이 만들어집니다 세트 + +84 +00:05:48,220 --> 00:05:51,689 + 그래, 그들은 아주 진짜 보이지만 그들은 실제로 현지에서 만든 것 + +85 +00:05:51,689 --> 00:05:55,809 + 그래서 재발 성 신경 네트워크 그가 발언이 것은 기본적이며, + +86 +00:05:55,809 --> 00:06:00,979 + 녹색 및 그 상태를 가지며, 그것은 기본적으로 시간을 통해 수신하고 그것을 + +87 +00:06:00,978 --> 00:06:04,859 + 우리가에 입력 벡터에 공급 할 수있는 여배우 그래서 매번 수신 + +88 +00:06:04,860 --> 00:06:08,538 + 무장 한 남자와 내부적으로 어떤 상태를 가지고 있으며, 다음은 그 수정할 수 있습니다 + +89 +00:06:08,538 --> 00:06:12,988 + 그것은 매 시간 단계를 받고 그래서 것의​​ 함수로서 상태 + +90 +00:06:12,988 --> 00:06:17,258 + 그들은 우리가 그 폐기물를 켤 때 물론 모든 무게와 CNN 등 수있어 + +91 +00:06:17,259 --> 00:06:20,829 + 명시된 목표는 수신 한 방법의 측면에서 아놀드 다른 동작 + +92 +00:06:20,829 --> 00:06:25,769 + 나는 보통 우리는 또한 생산에 관심이있을 수 있습니다 면제 및 모든하지만, + +93 +00:06:25,769 --> 00:06:30,429 + 우리가 지금 만 시간의 상단에 이러한 문제를 생성 할 수 있도록 R & S 상태에 따라 + +94 +00:06:30,428 --> 00:06:33,988 + 그래서 당신은 이런 쇼 사진을 볼 수 있지만, 난 그냥 아르 논 것을 알고 싶다 + +95 +00:06:33,988 --> 00:06:36,688 + 정말 블록은 중간에 + +96 +00:06:36,689 --> 00:06:39,489 + 상태로 근무하고 시간이 지남에 사진을받을 수 있으며, 우리는 할 수 있습니다 + +97 +00:06:39,488 --> 00:06:44,838 + 그래서 완전히 일부 응용 프로그램에서 상태의 상단에 몇 가지 예측 방법 + +98 +00:06:44,838 --> 00:06:50,610 + 육군은 지적 이하 상태의 일종을 가지고있는 것처럼이 보일 것이다 + +99 +00:06:50,610 --> 00:06:55,399 + 빅터 H하고이 또한 될 수있는 의사의 집합은 두 개있다 + +100 +00:06:55,399 --> 00:07:00,939 + 일반 상태와 우리는 이전의 함수로의 기반하지 않은거야 + +101 +00:07:00,939 --> 00:07:05,769 + 하나 뺀 상태 관리 시간 IT 및 현재의 입력 벡터 (60)이 + +102 +00:07:05,769 --> 00:07:08,338 + 내가 재발 함수를 호출합니다 함수를 통해 수행 할 것입니다 + +103 +00:07:08,338 --> 00:07:13,728 + 그 기능은 W 매개 변수가되고 우리는 우리 (W) 그 변경으로 우리는있어 + +104 +00:07:13,728 --> 00:07:16,228 + 물론 우리가 원하는 다음 아놀드 다른 행동을보고 가서 + +105 +00:07:16,228 --> 00:07:19,338 + 일부 특정 동작은 아르 논은 우리가 그 무게를 훈련 할 겁니다된다 + +106 +00:07:19,338 --> 00:07:23,639 + 씰 지금은 그 노래의 예를 참조에서 나는 같은 점에 유의하고 싶습니다 + +107 +00:07:23,639 --> 00:07:28,209 + 함수는 무게의 고정 기능 w와 함께 매 시간 단계에서 사용된다 + +108 +00:07:28,209 --> 00:07:31,778 + 우리는 매번 물건에 그 하나의 기능을 담당하고 허용 + +109 +00:07:31,778 --> 00:07:35,928 + 우리는 커밋하지 않고 일련의 외부 네트워크를 사용하는 + +110 +00:07:35,928 --> 00:07:38,778 + 시퀀스의 크기는 우리의 모든 동일한 기능을 적용하기 때문에 + +111 +00:07:38,778 --> 00:07:43,528 + 한 시간 간격에 상관없이 시간을 입력 또는 출력 순서가 그렇게되어 + +112 +00:07:43,528 --> 00:07:46,769 + 재발 신경망의 재발 성 신경 네트워크의 특정한 경우 + +113 +00:07:46,769 --> 00:07:50,309 + 당신이 사용할 수있는 간단한 재발이를 설정할 수 있습니다 간단한 방법은 무엇입니까 + +114 +00:07:50,309 --> 00:07:54,569 + 재발 성 신경 네트워크의 상태가이 경우에 약간의 경고로 (42) 기타 + +115 +00:07:54,569 --> 00:08:00,569 + 단지 하나의 상태 시간 후 우리는 기본적으로 알려주는 크로스 공식이 + +116 +00:08:00,569 --> 00:08:04,039 + 당신은 이전 머리의 함수로 숨겨진 상태 나이를 업데이트하는 방법 + +117 +00:08:04,038 --> 00:08:04,688 + 국가의 + +118 +00:08:04,689 --> 00:08:08,369 + 현재 입력 엑스타인 특히 우리가있어 간단한 경우와 + +119 +00:08:08,369 --> 00:08:10,349 + 이러한 가중치 행렬의 whaaa을해야 할 것 + +120 +00:08:10,348 --> 00:08:15,238 + WX 연령 그들은 기본적으로 숨겨진 상태에서 모두 프로젝트거야 + +121 +00:08:15,238 --> 00:08:18,238 + 다음 현재의 입력과 이들 추가하려고하는 이전의 시간 + +122 +00:08:18,238 --> 00:08:21,978 + 그리고, 우리는 모든 연령에서 그들을 뭉개 버려 그것은 우리가 상태를 업데이트하는 방법 + +123 +00:08:21,978 --> 00:08:26,199 + 이 재발 그래서 시간 t는의 함수로 어떻게 전체의 변화를 말하고 그 + +124 +00:08:26,199 --> 00:08:29,769 + 역사와 또한이 시간에 현재 입력 한 다음 우리는 할 수 있습니다 + +125 +00:08:29,769 --> 00:08:34,129 + 예측은 우리는 또 다른를 사용하여 예를 들어 H의 상단에 예측을 기반으로 할 수 있습니다 + +126 +00:08:34,129 --> 00:08:37,528 + 언덕 국가의 상단에 매트릭스 투영 그래서 이것은 단순한 완료 + +127 +00:08:37,528 --> 00:08:42,288 + 당신의 인생 작업에 연결할 수있는 경우는, 그래서 그냥 당신의 예를 제공합니다 + +128 +00:08:42,288 --> 00:08:46,639 + 이 지금 작동하는 방법 난 그냥 섹스의 나이와 왜 추상을 이야기 해요 + +129 +00:08:46,639 --> 00:08:49,299 + 우리가 실제로 이러한 요인 끝낼 수 배우면에서 형태 + +130 +00:08:49,299 --> 00:08:53,059 + 의미와 방법 그래서 하나는 우리는 재발 성 신경을 사용할 수 있습니다 + +131 +00:08:53,059 --> 00:08:56,149 + 문자 레벨 언어 모델의 경우에서와 같이, 네트워크 및이 중 하나이다 + +132 +00:08:56,149 --> 00:08:59,899 + 직관적 인과 보는 재미 때문에 우리의 다음 설명의 나의 마음에 드는 방법 + +133 +00:08:59,899 --> 00:09:04,698 + 그래서이 경우에 우리는 우리의 춤과를 사용하여 문자 수준의 언어 모델을 + +134 +00:09:04,698 --> 00:09:07,859 + 우리는에 문자의 시퀀스를 공급하므로 방법이 작동 + +135 +00:09:07,860 --> 00:09:10,899 + 직장과 매 시간 단계에서 역할을 반복하면 재발을 요청합니다 + +136 +00:09:10,899 --> 00:09:14,299 + 신경망은 시퀀스의 다음 문자를 예측한다 예측할 + +137 +00:09:14,299 --> 00:09:16,909 + 이 시퀀스에서 다음에 와야 어떻게 생각하는지에 대한 전체 분포 + +138 +00:09:16,909 --> 00:09:21,120 + 즉, 지금까지 나는이 아주 간단한 예에서 우리는이 있다고 가정 있도록 보았다 + +139 +00:09:21,120 --> 00:09:25,610 + 트레이닝 시퀀스 안녕하세요 그리고 우리는에있는 문자 어휘가 + +140 +00:09:25,610 --> 00:09:29,870 + ATL의와 우리가 배울 수있는 재발 성 신경 네트워크를 얻으려고하는거야 + +141 +00:09:29,870 --> 00:09:33,289 + 이 훈련 데이터에 시퀀스 내의 다음 문자를 예측하는 방법이 너무 + +142 +00:09:33,289 --> 00:09:37,000 + (A)에서 이러한 문자의 모든 하나 하나에 공급됩니다 설정합니다으로 작동합니다 + +143 +00:09:37,000 --> 00:09:40,509 + 재발 성 신경 네트워크에 시간은 처음 채팅에서 볼 수 있습니다 + +144 +00:09:40,509 --> 00:09:47,110 + 단계 및 x 축 듣고 그래서 우리는 H II L & L 및하겠습니다 시간 시간입니다 + +145 +00:09:47,110 --> 00:09:50,629 + 히로미 코팅 문자는 우리가 하나의 뜨거운 표현 여기서 우리가 부르는 사용 + +146 +00:09:50,629 --> 00:09:53,889 + 그냥 쓴 그 문자에 대응 주문 있고 어휘 켜져 + +147 +00:09:53,889 --> 00:09:58,129 + 지금 우리는 내가 보여 재발 수식 당신이 그것을 착용을 사용하는거야 + +148 +00:09:58,129 --> 00:10:01,860 + 매 시간 단계는 우리가 80로 시작하고 우리는이 적용된 가정 + +149 +00:10:01,860 --> 00:10:04,720 + 사용하여 숨겨진 상태 유권자 매 시간 단계를 계산 요청 + +150 +00:10:04,720 --> 00:10:08,790 + 이 수정 재발의 공식은 그래서 우리는 상태에서 단지 3 %가 여기에 가정 + +151 +00:10:08,789 --> 00:10:11,099 + 우리는 세 가지 차원 표현으로 끝낼거야 그 + +152 +00:10:11,100 --> 00:10:13,040 + 기본적으로 어느 시점에서 + +153 +00:10:13,039 --> 00:10:15,759 + 온까지 모든 문자 요약 + +154 +00:10:15,759 --> 00:10:20,159 + 그래서 우리는 이것이 필요 적용 할 필요가 매 시간 스텝 지금 + +155 +00:10:20,159 --> 00:10:23,139 + 우리는 매번 다음해야 어떤 단계 것으로 예측거야 + +156 +00:10:23,139 --> 00:10:27,569 + 우리는이 네 개의 문자를 가지고 있기 때문에 그래서 예를 들어 시퀀스에서 문자 + +157 +00:10:27,570 --> 00:10:32,100 + 이것은 우리에 대한 그래서 매 시간에 전화 번호를 보호하기 위해거야 + +158 +00:10:32,100 --> 00:10:37,139 + 우리는 편지 H와 RNN과에서 말한 아주 처음에 예 + +159 +00:10:37,139 --> 00:10:40,799 + 무게 컴퓨터의 현재 설정이 그가의 정규화 된 잠금 문제입니다 + +160 +00:10:40,799 --> 00:10:42,959 + 여기 옆에 와서해야한다고 생각하는 것에 대해 + +161 +00:10:42,960 --> 00:10:47,950 + H 가능성이 아니라 2.2으로 먹고 다음 일을 올 110 가능성이 있으므로 일 + +162 +00:10:47,950 --> 00:10:52,640 + 하지 않는 한 로트의 측면에서 가능성이 지금 가능성 및 OS 4.1 세 부정적 + +163 +00:10:52,639 --> 00:10:56,409 + 물론 확률은 우리는이 훈련 순서로 우리가 알고있는 것을 알고 우리가 + +164 +00:10:56,409 --> 00:11:00,669 + 녹색으로 표시되어이 2.2가 정확한 사실 때문에 각을 따라야한다 + +165 +00:11:00,669 --> 00:11:04,559 + 이 경우에 답하고 그래서 우리는 그 높은 것으로 원하는 우리는이 모든 작업을 수행합니다 + +166 +00:11:04,559 --> 00:11:07,799 + 다른 숫자는 우리가 기본적으로 가지고 매 시간에 낮아야합니다 + +167 +00:11:07,799 --> 00:11:12,209 + 다음 문자가 순서대로 제공해야하는지에 대한 목표 그래서 우리는 단지 원하는 + +168 +00:11:12,210 --> 00:11:15,470 + 모든 숫자는 높은하고 다른 모든 숫자는 낮은 것으로 그래서 그의의 + +169 +00:11:15,470 --> 00:11:19,950 + 상기 녹색 신호 손실 함수에 포함하고 있다는 점에서 포함 과정 + +170 +00:11:19,950 --> 00:11:23,220 + 다시 이러한 연결 그렇게 생각하는 또 다른 방법을 통해 전파됩니다 + +171 +00:11:23,220 --> 00:11:26,600 + 그것은 매 시간 단계는 우리가 기본적으로 소프트 맥스 분류를 가지고있다에 대해 + +172 +00:11:26,600 --> 00:11:31,300 + 그래서이 모든 일 다음 문자를 통해 소프트 맥스 분류 및 + +173 +00:11:31,299 --> 00:11:34,269 + 모든 단일 지점에서 우리는 다음 문자가 있어야한다 그래서 우리가 알고 + +174 +00:11:34,269 --> 00:11:37,879 + 모든 손실은 상단에서 아래로 둔화 얻을 그들은 모두를 통과한다 + +175 +00:11:37,879 --> 00:11:41,179 + 모든 화살을 거꾸로이 그래프에서 기울기를 얻기 위하여려고 된 모든 + +176 +00:11:41,179 --> 00:11:44,479 + 체중 행렬 그리고, 우리는 매트릭스를 이동하는 방법을 알게되도록 + +177 +00:11:44,480 --> 00:11:50,039 + 정확한 문제는 우리가 그 무게를 형성 할 것 아르 논에서 나오는된다 + +178 +00:11:50,039 --> 00:11:53,599 + 그 올바른 행동 때문에 군대는 먹이 올바른 동작을 + +179 +00:11:53,600 --> 00:11:57,750 + 당신과 같은 문자는 우리에 대한 다른 질문이 뒤집 수있는 방법을 상상할 수있다 + +180 +00:11:57,750 --> 00:12:02,879 + 그림 + +181 +00:12:02,879 --> 00:12:08,750 + 그래 나는 장면을 되풀이을 언급 한 바와 같이 그렇게 필사적으로 거짓말을 주셔서 감사합니다 + +182 +00:12:08,750 --> 00:12:13,320 + 항상 동일한 기능 그래서 우리는 하나의 WX 환자마다 단계 우리가 + +183 +00:12:13,320 --> 00:12:17,010 + 같은 whah의 모든 시간 단계에서 하나의 WHYY마다 단계에 적용했습니다 + +184 +00:12:17,009 --> 00:12:23,830 + 여기에서 우리는 WX를 사용했습니다 AWH 이유도 및 다시 awhh 네 번 + +185 +00:12:23,830 --> 00:12:27,720 + 우리 모두를해야하기 때문에 전파 우리는 YouTube에 계정을 통해 얻을 때 + +186 +00:12:27,720 --> 00:12:30,750 + 동일한 가중치 행렬을 더하여 이들 구배는 사용 되었기 때문에 + +187 +00:12:30,750 --> 00:12:35,879 + 여러 시간 단계와에 이것은 당신이 가변적으로 알고 처리 할 수​​있게 해준다 것입니다 + +188 +00:12:35,879 --> 00:12:38,960 + 크기 입력 우리는 같은 일을 그렇게하지 ​​않는 일을하는지마다 때문에 + +189 +00:12:38,960 --> 00:12:48,540 + 사물의 절대 금액의 기능과 공통 무엇인지 질문 + +190 +00:12:48,539 --> 00:12:52,579 + 제 80 초기화하는 일이 내가 생각하는 미국 상원 (20)이 아주 아주 + +191 +00:12:52,580 --> 00:13:00,650 + 처음에 공통되지만 순서에 따라 데이터를 수신처 않는다 + +192 +00:13:00,649 --> 00:13:01,289 + 그 문제 + +193 +00:13:01,289 --> 00:13:11,299 + 예 있기 때문에 그래서 당신은 만약 그렇다면 다른 순서로 이러한 문자를 요구하고 + +194 +00:13:11,299 --> 00:13:14,359 + 이이 경우에는이 경우의 순서 긴 시퀀스 않았는지 + +195 +00:13:14,360 --> 00:13:17,870 + 당신에 대해 생각하면 항상 시간에하기 때문에 하나 하나 점을 중요하지 않습니다 + +196 +00:13:17,870 --> 00:13:21,299 + 그것은 기능적 같이 이것은 팩터의 함수로서 시간이 단계에서의 + +197 +00:13:21,299 --> 00:13:26,859 + 그것은 바로 그래서 장애는 그냥 오랫동안 중요한 전에 온 모든 + +198 +00:13:26,860 --> 00:13:31,590 + 당신이 그것을 읽고있는 것처럼 우리는 몇 가지 구체적인 통해 체를 통해 갈거야 + +199 +00:13:31,590 --> 00:13:36,149 + 나는 이러한 점들을 명확히 생각 예는 특정보고하기 + +200 +00:13:36,149 --> 00:13:38,980 + 당신은 그것의 언어 모델의 특성을 시도 할 경우 실제로 예 + +201 +00:13:38,980 --> 00:13:43,350 + 이 곳에 아주 짧은 그래서 난 그냥 당신이 좋은 가정을 찾을 수를 썼다 + +202 +00:13:43,350 --> 00:13:47,220 + 정확도 수준 NumPy와 백에 줄 응용 프로그램입니다 그리고 당신은 갈 수 있습니다 + +203 +00:13:47,220 --> 00:13:49,840 + 당신이를 통해 실제 활성 단계 그래서 당신은 구체적으로 볼 수 있습니다 + +204 +00:13:49,840 --> 00:13:53,220 + 우리는이 재발 신경 네트워크에 미치는 영향을 훈련 할 수 그래서 내가 갈거야 방법 + +205 +00:13:53,220 --> 00:13:58,250 + 이 때문에 우리는 처음에 모든 블록을 통해 갈거야 단계별로 + +206 +00:13:58,250 --> 00:14:02,389 + 당신은 여기에만 의존은 우리가 일부 텍스트 데이터를로드하는 것을 볼 수 있습니다로 + +207 +00:14:02,389 --> 00:14:05,569 + 여기에 우리의 입력의 큰 순서 그냥 큰 모음입니다 + +208 +00:14:05,570 --> 00:14:10,090 + 이 경우 문자 파일을 TXT하고 우리 모두가 얻을 텍스트 입력 + +209 +00:14:10,090 --> 00:14:14,810 + 해당 파일의 문자 우리는 해당 파일의 모든 고유 문자를 찾을 수 + +210 +00:14:14,809 --> 00:14:18,179 + 계절의 특성에 매핑이 매핑 사전을 만들 + +211 +00:14:18,179 --> 00:14:23,120 + 인덱스에서 두 문자 우리는 기본적으로 우리의 문자가 너무 빵을 보이는 주문 + +212 +00:14:23,120 --> 00:14:27,350 + 파일로의 전체 무리와 데이터의 전체 무리 우리 백 한 + +213 +00:14:27,350 --> 00:14:30,860 + 문자 나처럼 뭔가 그래서 우리는 순서에 그들을 주문 + +214 +00:14:30,860 --> 00:14:36,300 + 여기에 모든 문자 남자에 연관 지수는 우리는 라이센스를 감소거야 + +215 +00:14:36,299 --> 00:14:39,899 + 당신이 재발 성 신경로 볼 수 있습니다로 첫 번째 하이퍼 기본 크기 숨겨져 있습니다 + +216 +00:14:39,899 --> 00:14:43,100 + 그렇게하면 네트워크 우리는 학습율이 여기 백으로 사용하지 않을 + +217 +00:14:43,100 --> 00:14:46,720 + 스물 다섯이이 매개 변수가 최대 여기 시퀀스 길이입니다 + +218 +00:14:46,720 --> 00:14:51,019 + 우리의 입력 데이터가 길 경우 당신은 문제가 무엇인지 알게 될 것입니다 수 있습니다 + +219 +00:14:51,019 --> 00:14:53,899 + 시간의 수백만 같은 너무 큰 말은 UPS는 당신이 넣을 수있는 방법은 없습니다 + +220 +00:14:53,899 --> 00:14:56,870 + 다란과의 모든 위에 우리가 물건을 모두 유지할 필요가 있기 때문에 + +221 +00:14:56,870 --> 00:15:00,070 + 당신이 실제로 전파를 다시 할 수 있도록 메모리 우리는 할 수 없습니다 + +222 +00:15:00,070 --> 00:15:03,540 + 우리가 갈거야 그것 모두를 통해 그것의 모든 메모리와 다시 문질러 두 남자를 유지 + +223 +00:15:03,539 --> 00:15:07,139 + 이 경우 우리의 입력 데이터를 통해 덩어리로 우리는 25의 덩어리에서 통해거야 + +224 +00:15:07,139 --> 00:15:09,230 + 당신은 약간의 시간을 볼 수 있도록 + +225 +00:15:09,230 --> 00:15:14,769 + 우리는이 전체 데이터 집합을 가지고 있지만 25 문자의 덩어리로 갈 것 + +226 +00:15:14,769 --> 00:15:19,509 + 시간과 우리가 백업거야 때마다 시간에 25 자 통과 + +227 +00:15:19,509 --> 00:15:22,149 + 우리는 우리가 가지고 있기 때문에 이상에 대한 전파을 다시 할 여유가 없기 때문에 + +228 +00:15:22,149 --> 00:15:26,899 + 모든 물건을 기억하고 우리는 여기에 25 그리고 우리 덩어리거야 + +229 +00:15:26,899 --> 00:15:30,789 + 모두 여기에 내가 무작위로 분석하고있어 이러한 W 행렬과 일부 상자 WX 그래서이 + +230 +00:15:30,789 --> 00:15:34,709 + HHH와 HY 그 우리의 매개 변수의 모든 과대 광고의 모든 우리는 거 야된다 + +231 +00:15:34,710 --> 00:15:36,790 + backrub를 양성하는 + +232 +00:15:36,789 --> 00:15:40,699 + 나는 여기에 손실 함수를 통해 건너 갈거야 내가 바닥에 도착하는거야 + +233 +00:15:40,700 --> 00:15:44,020 + 여기에 스크립트의 우리는 메인 루프를 가지고 있고이 중 일부를 통해 갈거야 + +234 +00:15:44,019 --> 00:15:48,399 + 여기에 다양한 몇 가지 초기화가 그래서 20 지금 보일 수 있습니다 + +235 +00:15:48,399 --> 00:15:50,829 + 다음 처음에 우리는 영원히 찾고 + +236 +00:15:50,830 --> 00:15:54,960 + 우리가 여기서하고있는 것은 그래서 여기에 데이터의 배치가 어디 실제로 샘플링 + +237 +00:15:54,960 --> 00:15:58,970 + 그 목록에, 그래서이 데이터 세트에서 25 문자의 배치를 취할 + +238 +00:15:58,970 --> 00:16:03,019 + 입력하고 목록 및 둔다는 기본적으로 단지가 25 정수는 대응 + +239 +00:16:03,019 --> 00:16:06,919 + 당신이 볼로 문자 대상은 모두 같은 문자하지만 오프셋 + +240 +00:16:06,919 --> 00:16:09,909 + 하나 그 때문에 우리는 모든을 예측하려는 인덱스는 + +241 +00:16:09,909 --> 00:16:15,269 + 한 시간에 물건을 너무 중요한 목표는 25 자에 불과 목록입니다 + +242 +00:16:15,269 --> 00:16:20,689 + 즉 우리가 기본적으로 샘플링 무엇 때문에 대상은 같은 미래에 의해 오프셋 (offset) + +243 +00:16:20,690 --> 00:16:26,480 + 여기에 데이터에서 우리이는 그래서 몇 가지 예제 코드는 시간의 모든 단일 지점 + +244 +00:16:26,480 --> 00:16:30,659 + 이번 주 훈련 물론 내가하려고하는 것은 그것이 무엇의 일부 샘플을 생성하는 + +245 +00:16:30,659 --> 00:16:35,370 + 현재 감사 문자는 이러한 순서는 다음과 같이 보일 것입니다 실제로 무엇을 + +246 +00:16:35,370 --> 00:16:40,320 + 우리는 문자 낮은 수준의 예술가와 테스트 시간을 사용하는 방법은 우리가 걸이다 + +247 +00:16:40,320 --> 00:16:43,570 + 다음 몇 가지 문자와 함께이 항상 아니라는 것을 보게 될 것은 우리에게주는 + +248 +00:16:43,570 --> 00:16:46,379 + 당신이 샘플링을 상상할 수 있도록 시퀀스에서 다음 문자의 분포 + +249 +00:16:46,379 --> 00:16:49,259 + 그것에서 다음 다음 문자의 위업은에서 샘플을 받고 + +250 +00:16:49,259 --> 00:16:52,769 + 분포와에 모든 샘플을 공급 유지에 그 일을 계속 + +251 +00:16:52,769 --> 00:16:56,549 + 철, 당신은이 코드가 무엇을 할 것입니다 그 임의의 텍스트 데이터를 생성 할 수 있습니다 + +252 +00:16:56,549 --> 00:17:00,549 + 그리고 우리가 여기에 약간의 그 거 야 그래서 샘플 기능을 발생 + +253 +00:17:00,549 --> 00:17:04,250 + I는 손실 함수는 입력 대상을받는 손실 함수를 호출있어 + +254 +00:17:04,250 --> 00:17:09,160 + 그리고이 H 준비 H 압력이 자신의 상태 벡터에 대한 짧은 또한 수신 + +255 +00:17:09,160 --> 00:17:13,900 + 이전 트렁크에서 우리는 (25)의 일괄거야 그리고 우리는 유지됩니다 + +256 +00:17:13,900 --> 00:17:18,179 + 당신의 25 편지의 끝 부분에있는 최신 사진이 무엇인지를 추적하는 우리 + +257 +00:17:18,179 --> 00:17:22,400 + 우리가 다음에 다시 만날 때 우리는 그의 초기 시간으로 그에서 볼 수 있습니다 + +258 +00:17:22,400 --> 00:17:26,140 + 우리가 숨겨진 상태가 제대로 기본적으로있는 것을 확인하고, 그래서 시간 + +259 +00:17:26,140 --> 00:17:30,700 + 그 통해 일괄 배치에서 전파 그러나 우리는 다시 전파하는 + +260 +00:17:30,700 --> 00:17:35,558 + 그 25 시간 단계 그래서 우리는 손실 및 그라디언트의 기능에 적합하고 + +261 +00:17:35,558 --> 00:17:39,319 + 모든 무게 행렬과 모든 상자와 당신은 손실을 인쇄하고 + +262 +00:17:39,319 --> 00:17:44,149 + 그리고 여기에 우리가 여기에 우리가 우리에게 나이 인사를 듣는다 프라이머 업데이트 그리고 + +263 +00:17:44,150 --> 00:17:47,429 + 실제로 당신이 대학원에 업데이트로 인식해야 업데이트를 수행 + +264 +00:17:47,429 --> 00:17:53,100 + 그래서 나는이 현금으로 모든 생각하는 모든 현금이 + +265 +00:17:53,099 --> 00:17:56,819 + 그라데이션에 대한 변수는 내가 축적 한 다음를 수행하고있어 어느 제곱 + +266 +00:17:56,819 --> 00:18:00,639 + 독재 날짜 누군가가 손실 함수로 이동하고 어떻게 그처럼 보인다 + +267 +00:18:00,640 --> 00:18:05,790 + 이제 손실 함수는 정말 앞으로 구성 코드 블록이며, + +268 +00:18:05,789 --> 00:18:08,990 + 우리가 앞으로 패스 다음의 뒷면을 비교하는, 그래서 뒤로 방법 + +269 +00:18:08,990 --> 00:18:13,130 + 녹색 그래서이 두 단계를 통해 갈거야 통과하면 앞으로해야 당신에게 전달 + +270 +00:18:13,130 --> 00:18:18,919 + 기본적으로 우리는 우리가 기다리고있어 그 적자 대상이 25를받을 얻을 인식 + +271 +00:18:18,919 --> 00:18:23,360 + 인덱스 우리는 25 일에서 그들을 통해 거래하지 않는 우리는이 텍스트를 만들 + +272 +00:18:23,359 --> 00:18:27,500 + 그럼 그냥 제로이며, 입력 벡터 우리는 그래서 하나의 뜨거운 인코딩을 설정 + +273 +00:18:27,500 --> 00:18:32,169 + 어떤 인덱스 및 자극 우리는 하나 우리가에 공급하고 그것을 설정 + +274 +00:18:32,169 --> 00:18:34,110 + 그 하나의 뜨거운 인코딩 문자 + +275 +00:18:34,109 --> 00:18:39,229 + 여기에이 식 HSI T 그래서를 사용하여 재발 수식을 계산에 + +276 +00:18:39,230 --> 00:18:42,210 + 자신의 연령이 모든 것을 다하고 하나 하나 추적하기 + +277 +00:18:42,210 --> 00:18:46,910 + 시간 물건 그래서 우리는 상태 벡터와를 사용하여 출력을 계산 + +278 +00:18:46,910 --> 00:18:50,779 + 재발 수식이 두 줄 다음 저기 난을 계산 해요 + +279 +00:18:50,779 --> 00:18:54,440 + 그래서 용의자 그래서이 정상화 작동하는지 우리는 확률을 얻을 경우 + +280 +00:18:54,440 --> 00:18:58,190 + 그 그냥 그래서 당신의 손실은 정답의 부정적인 잠금 확률 + +281 +00:18:58,190 --> 00:19:02,779 + 부드러움의 분류는 그 목적 그래서 거기 잃고 우리는 거 야 + +282 +00:19:02,779 --> 00:19:06,899 + 우리가 뒤로 이동 뒤로 패스 그래서 다시 그래프를 통해 전파 + +283 +00:19:06,900 --> 00:19:08,530 + (25)로부터의 순서를 통해 + +284 +00:19:08,529 --> 00:19:12,899 + 당신은 내가 인식합니다 다시 하나 어쩌면 모든 방법은 얼마나 많은 세부 사항을 모르는 I + +285 +00:19:12,900 --> 00:19:16,509 + 여기에 가고 싶어하지만 당신은 소프트 맥스를 통해 전파를 다시 인식합니다 + +286 +00:19:16,509 --> 00:19:19,089 + 내가 통해 전파하고 있지 않다 활성화 기능을 통해 전파 + +287 +00:19:19,089 --> 00:19:23,379 + 그것의 모든 난 그냥 모든 인사 및 모든 총리을 추가 해요 + +288 +00:19:23,380 --> 00:19:27,210 + 특히 여기에서주의해야 할 한 가지는 이러한 재료와 무게를 만드는 것입니다 + +289 +00:19:27,210 --> 00:19:31,210 + 내가 플러스를 사용하고 woahh 같은 행렬에 해당 그것은 매 시간 스텝 때문에 + +290 +00:19:31,210 --> 00:19:34,590 + 이 무게의 모든 그라데이션을 받고 행렬 우리는 축적해야 + +291 +00:19:34,589 --> 00:19:37,449 + 우리는 이러한 모든 체중 행렬을 계속 사용하기 때문에 모든 체중 행렬에 적합 + +292 +00:19:37,450 --> 00:19:43,980 + 시간이 지남에 그들로 때마다 단계에서 동일한 그래서 우리 그냥 배경에서와 + +293 +00:19:43,980 --> 00:19:48,130 + 그것은 우리에게 생기를 제공하고 우리는에서 그 손실 기능을 사용할 수 있습니다 + +294 +00:19:48,130 --> 00:19:52,580 + 기본 및 여기에 우리는 마침내 그래서 여기에 샘플링 기능은 어디 한 우리 + +295 +00:19:52,579 --> 00:19:55,960 + 실제로 그 내용에 기초하여 새로운 텍스트 데이터를 생성하는 아티스트 가려고 + +296 +00:19:55,960 --> 00:19:59,058 + 캐릭터와 방법의 통계에 변호사를 보았고를 기반으로하고있다 + +297 +00:19:59,058 --> 00:20:02,048 + 우리는 약간의 비와 함께 초기화 그래서 그들은 훈련 데이터에서 서로를 따라 + +298 +00:20:02,048 --> 00:20:06,759 + 문자, 그리고, 우리는 우리가 피곤 때까지 가서 우리가 재발을 계산 + +299 +00:20:06,759 --> 00:20:09,289 + 식 문제로부터 배포 샘플이 + +300 +00:20:09,289 --> 00:20:10,450 + 분포 + +301 +00:20:10,450 --> 00:20:15,640 + 핫 케이트 (11) 핫 표현으로 인코딩 한 후 우리는을 받​​았는데 + +302 +00:20:15,640 --> 00:20:22,460 + 우리가 실제로 200 텍스트를 얻을 때까지 우리가이 일을 계속 그래서 다음에 시간이 너무 거기 어떤 + +303 +00:20:22,460 --> 00:20:27,190 + 그냥이 작동하는 방법의 거친 레이아웃 등을 통해 질문 + +304 +00:20:27,190 --> 00:21:04,680 + 다시 $ (25) 남부 최대의 모든 배치에서 분류 우리 같은에서 사람들의 모든 + +305 +00:21:04,680 --> 00:21:14,910 + 시간과 모든 우리가 사용하는 왜 거꾸로 그건가는 연결에 추가 + +306 +00:21:14,910 --> 00:21:19,259 + 여기 정규화 당신은 내가 그것을 생략 추측 아마하지 않는 것을 확인할 수 있습니다 + +307 +00:21:19,259 --> 00:21:23,720 + 때때로 나는 정규화를 시도 여기에 있지만 일반적으로 내가 생각할 수있는 난 몰라 + +308 +00:21:23,720 --> 00:21:27,269 + 때로는 그것을 외부로 반복 너트를 사용하는 것이 일반적이다 생각 + +309 +00:21:27,269 --> 00:21:38,379 + 최악의 결과처럼 내게 준 것은 그래서 가끔 그것을 싸움 발기인의 그것의 종류를 건너 + +310 +00:21:38,380 --> 00:21:48,260 + 그래 그건 그래 그건 우리가 바로 여기에 25 샷의 순서 그래서 바로 + +311 +00:21:48,259 --> 00:21:51,839 + 매우 낮은 캐릭터 레벨에 대한 수준과 우리가 실제로 단어에 대해 걱정하지 않는다 우리는하지 않습니다 + +312 +00:21:51,839 --> 00:21:56,289 + 그 단어는 사실은 그렇지 않습니다에서처럼 문자 인덱스가 너무 arnelle를 그리워 존재 알고 + +313 +00:21:56,289 --> 00:21:58,569 + 같은 언어 또는 아무것도 그냥 그렇게 문자에 대해 아는 + +314 +00:21:58,569 --> 00:22:08,009 + 시리즈와 시퀀스 부록에 그 우리가 사용하는 조각을 모델링하고 무엇 + +315 +00:22:08,009 --> 00:22:13,460 + 대신 그 같은 문자로 공간 또는 뭔가를 사용할 수 있습니다 + +316 +00:22:13,460 --> 00:22:18,630 + 25 일정 배치는 그가 아마 할 수 생각하지만 다음 종류의 단지, 당신은 + +317 +00:22:18,630 --> 00:22:22,530 + 당신이 그렇게 할 이유 언어에 대한 가정이 곧 볼 수 있도록합니다 + +318 +00:22:22,529 --> 00:22:25,359 + 이에 아무것도 연결할 수 있습니다 그리고 우리는 우리가 많이 가질 수 있음을 볼 수 있기 때문에 + +319 +00:22:25,359 --> 00:22:31,539 + 그 확인과 재미 이제 우리는 우리가 텍스트의 전체 무리를 우리가하지 걸릴 수 있습니다 할 수있는 + +320 +00:22:31,539 --> 00:22:34,889 + 이 문자의 순서 어디에서 왔는지 신경 그리고 우리는 아르 논에 공급 + +321 +00:22:34,890 --> 00:22:40,670 + 우리는 철을 훈련 할 수 있으며 같은 텍스트를 작성하고 그래서 예를 들어, 당신은 할 수 있습니다 + +322 +00:22:40,670 --> 00:22:44,789 + 당신이 그것을 모두 잡을 수 있습니다 윌리엄 셰익스피어의 작품을 모두 가지고 그냥 거대한입니다 + +323 +00:22:44,789 --> 00:22:48,289 + 문자의 순서 당신은 재발 성 신경 네트워크에 넣고 시도 + +324 +00:22:48,289 --> 00:22:51,909 + 윌리엄 셰익스피어의 지지자에 대한 시퀀스에서 다음 문자를 예측하고 + +325 +00:22:51,910 --> 00:22:54,650 + 그래서 당신은 처음에 재발 신경망을 물론 그 작업을 수행 할 때 + +326 +00:22:54,650 --> 00:22:59,030 + 그래서 그냥 바로 종료 그래서에서 왜곡을 생산 무작위 임의의 매개 변수가 + +327 +00:22:59,029 --> 00:23:03,200 + 그냥 임의의 문자를이다 그러나 당신이 훈련 할 때 다음 아르 논가에 시작됩니다 + +328 +00:23:03,200 --> 00:23:06,930 + 그 확인을 이해 거기의 말을 시작 공백이 같은 일이 실제로있다 + +329 +00:23:06,930 --> 00:23:11,490 + 따옴표와 함께 실험은 그것은 기본적으로 매우 짧은 일부 내용 + +330 +00:23:11,490 --> 00:23:16,420 + 당신이 더 많은 질병 훈련으로 여기거나 등등 다음과 같은 단어 + +331 +00:23:16,420 --> 00:23:18,820 + 되고 점점 더 세련되고 재발 신경 네트워크를 배운다 + +332 +00:23:18,819 --> 00:23:22,609 + 당신이 견적을 열 때 나중에 닫거나해야하는 그 문장 + +333 +00:23:22,609 --> 00:23:26,379 + 점과 한잔과 함께 그냥 바위에서 통계적으로 모든 물건을 배운다 + +334 +00:23:26,380 --> 00:23:29,630 + 실제로 코치 아무것도 머리를하지 않고 당신이 할 수있는 말의 패턴 + +335 +00:23:29,630 --> 00:23:30,580 + 샘플 전체 + +336 +00:23:30,579 --> 00:23:34,349 + 캐릭터 레벨이 기반으로 셰익스피어 그래서 그냥에 대한 아이디어를 줄 것 + +337 +00:23:34,349 --> 00:23:38,740 + 물건의 종류 나는 그가 접근 될 것이다 생각을 많이하고 갱을 제공 + +338 +00:23:38,740 --> 00:23:42,900 + 존재에 달성 될 것이다 변형됩니다 공급하지 않고 자신 만 체인 결코 + +339 +00:23:42,900 --> 00:23:45,460 + 내가 잠을 안 그의 죽음의 주제 + +340 +00:23:45,460 --> 00:23:56,909 + 즉, 당신이있어 이와 관련하여 네트워크의 나갈 것입니다 물건의 종류 + +341 +00:23:56,909 --> 00:24:02,679 + 내가 좋아 그래서 비트에 다시 연락하고 싶은 아주 미묘한 점을 의미 + +342 +00:24:02,679 --> 00:24:05,980 + 우리는 셰익스피어에서이 작업을 실행할 수 있지만 우리는 그래서 기본적으로 아무것도 태양을 실행할 수 있습니다 + +343 +00:24:05,980 --> 00:24:08,960 + 우리는 내가 대략 년 전 등처럼 생각 저스틴과 함께이 함께 연주 + +344 +00:24:08,960 --> 00:24:12,990 + 저스틴 턱 그는 대수 기하학에서이 책을 발견하고이 단지입니다 + +345 +00:24:12,990 --> 00:24:18,069 + 대형 라텍스 소스 파일 우리는이 형상에 대해 그 라텍스 소스 파일을 가져다 + +346 +00:24:18,069 --> 00:24:23,398 + 예술을 재정 작가는 기본적으로 수학을 그렇게 생성을 배울 수 + +347 +00:24:23,398 --> 00:24:27,199 + 이 아침에 제출 된 샘플 그냥 다음 늦은 체크 아웃 뱉어이며 우리 + +348 +00:24:27,200 --> 00:24:30,009 + 파일럿 물론 바로 작동하지 않습니다으로 우리는 조정 그것은 작은 비트를 가지고 있습니다 + +349 +00:24:30,009 --> 00:24:33,890 + 하지만 기본적으로 아르 논은 우리가 당신을 만든 실수의 일부를 불통 후 + +350 +00:24:33,890 --> 00:24:37,200 + 컴파일 할 수 있습니다 당신은 당신이 그것을 볼로 수학을 생성 얻을 수 있습니다 + +351 +00:24:37,200 --> 00:24:42,460 + 그것은 기본적으로는 그녀의 바보 같은 작은 사각형을두고 이러한 모든 증거를 생성 + +352 +00:24:42,460 --> 00:24:47,090 + 군대의 끝은 그렇게에 우리를 보자 생성 + +353 +00:24:47,089 --> 00:24:52,428 + 때때로 우리는 성공의 다양한 양으로 다이어그램을 만들 예정 + +354 +00:24:52,429 --> 00:24:56,720 + 그리고 이것에 대해 최선 나의 마음에 드는 부분은 상단에 여기 증거가 남아 있다는 것입니다 + +355 +00:24:56,720 --> 00:24:59,650 + 방출된다 + +356 +00:24:59,650 --> 00:25:05,780 + Sarno는 게으른하지만 그렇지 않으면이 물건은 확실히 구별 I입니다 + +357 +00:25:05,779 --> 00:25:12,480 + 실제 형상에서에서 말을 그래서 X의 X 10 방식을하자 확인 나는 확실하지 않다 + +358 +00:25:12,480 --> 00:25:16,160 + 그 부분에 대한하지만 그렇지 않으면이의 게슈탈트 매우 좋아 보인다 + +359 +00:25:16,160 --> 00:25:19,529 + 그것이 내가 가장 어려운 임의의 일을 찾기 위해 노력 임의의 물건이 I + +360 +00:25:19,529 --> 00:25:22,769 + 캐릭터 레벨을 던질 수 있었다 나는 소스 코드를 실제로 결정 + +361 +00:25:22,769 --> 00:25:27,879 + 매우 어려운 그래서 C 코드와 같은 단지 이전 인 리눅스 소스의 모든했다 + +362 +00:25:27,880 --> 00:25:30,850 + 당신은 그것을 복사 할 수 있습니다 당신은 내가 몇 백 메가 바이트 생각으로 끝낼 단지 + +363 +00:25:30,849 --> 00:25:35,079 + 코드와 헤더 파일을 참조하고 단지 아르 논에 던져 그리고, 그것은 할 수 + +364 +00:25:35,079 --> 00:25:39,849 + 아르 논 당신의 코드 등이 생성 된 코드를 생성하는 법을 배워야 + +365 +00:25:39,849 --> 00:25:42,949 + 그것을 볼 수 있습니다 기본적으로는 입력에 대해 알고 함수 선언을 만듭니다 + +366 +00:25:42,950 --> 00:25:47,460 + 구문 그것은 일종의 변수에 대해 알고 거의 실수를하는 방법을 + +367 +00:25:47,460 --> 00:25:53,230 + 그들이 때때로 그것을 코딩 할 계획 사용은 자신의 가짜 코멘트를 작성 + +368 +00:25:53,230 --> 00:25:58,089 + 구문은 브라켓을 열고 닫습니다하지 않을 것을 발견하는 것은 매우 드문 일이다 + +369 +00:25:58,089 --> 00:26:01,808 + dornin 그래서 몇 가지를 배우고하는 등이에 실제로 상대적으로 쉽다 + +370 +00:26:01,808 --> 00:26:04,058 + 실제로 만드는 실수는 그 예를 들어 그 + +371 +00:26:04,058 --> 00:26:07,240 + 그것을 사용하여 결코 끝나지 않아 몇 가지 변수를 선언하거나 동일한 변수를 할 + +372 +00:26:07,240 --> 00:26:09,929 + 이 선언되지 않습니다 그래서 이러한 높은 수준의 물건 중 일부는 아직 행방 불명된다 + +373 +00:26:09,929 --> 00:26:12,509 + 하지만 그렇지 않으면 잘 할 수있는 + +374 +00:26:12,509 --> 00:26:17,460 + 그것은 또한 더 적대적는 지프 새로운 GOP에게 문자로 허가 된 문자를 암송하지 + +375 +00:26:17,460 --> 00:26:22,009 + 그는 데이터에서 배운하고는 GPL 라이센스 후이 알고있다 + +376 +00:26:22,009 --> 00:26:25,779 + 일부는이 파일의 일부 매크로를 포함하고는, 그래서 다음 몇 가지 코드가있다 + +377 +00:26:25,779 --> 00:26:33,879 + 기본적으로 그냥 쇼에 교대로 매우 작은 것이 무엇인지를 배웠다 + +378 +00:26:33,880 --> 00:26:37,169 + 그냥 장난감 일이 일어나고 다음 문자 거기하고 있는지를 보여 + +379 +00:26:37,169 --> 00:26:41,230 + 그냥 충전 된 구현 및 토치의 많은 종류는이다 + +380 +00:26:41,230 --> 00:26:45,009 + 과 및 실행과 GPU를 확장 그래서 당신은 자신을 재생할 수 등 + +381 +00:26:45,009 --> 00:26:49,269 + 이 특히 그것이 세 계층 앨리스의 다음 후자에 의해이 가고 있었다 + +382 +00:26:49,269 --> 00:26:52,289 + 팀 그리고 우리는 그게 전화의 더 복잡한 종류의 의미를 볼 수 있습니다 + +383 +00:26:52,289 --> 00:26:58,839 + 난 그냥이 어떻게 작동하는지에 대한 아이디어를 제공 네트워크는 그래​​서 종이가 있음을 우리 + +384 +00:26:58,839 --> 00:27:02,089 + 많은 연주 그러나 이것은 단지 작년 우리는 기본적으로 노력하고 + +385 +00:27:02,089 --> 00:27:08,949 + 우리는 신경 과학자있어 척 그리고 우리는 몇 가지 테스트 텍스트에 미용실을 던졌다 + +386 +00:27:08,950 --> 00:27:13,110 + 그래서 아덴의 코드 스 니펫에서이 텍스트를 읽고 우리가보고있는 + +387 +00:27:13,109 --> 00:27:17,119 + 특정 셀의 여부에 기초하여 상기 텍스트 착색 당해 그의 상태 + +388 +00:27:17,119 --> 00:27:18,699 + 하지 그 흥분 판매 여부 + +389 +00:27:18,700 --> 00:27:23,470 + 확인 그래서 당신은 국가의 많은 볼 수 있습니다 + +390 +00:27:23,470 --> 00:27:27,110 + 뉴런은 이상한 방법으로 아무것도의 종류에 화재의 종류에 해석되지 않습니다 + +391 +00:27:27,109 --> 00:27:29,829 + 그들이해야하기 때문에 그들 중 일부는 매우 낮은 수준의 문자를해야 + +392 +00:27:29,829 --> 00:27:33,859 + 그녀는하자 모두 같은 나이와 물건 후에 오는가 얼마나 자주 같은 수준의 물건 + +393 +00:27:33,859 --> 00:27:37,928 + 우리는 빠른처럼 자신을 찾을 예를 들면 있도록 세포는 아주 해석입니다 + +394 +00:27:37,929 --> 00:27:41,830 + 검출 있도록이 셀은 그냥 인용 한 때 온 후는 유지 + +395 +00:27:41,829 --> 00:27:46,460 + 인용 옷장까지에 등이 매우 안정적이 추적을 유지하고 + +396 +00:27:46,460 --> 00:27:50,610 + 그냥 역 전파에서이 크기의 섬을 나오는 그 + +397 +00:27:50,609 --> 00:27:54,329 + 캐릭터 레벨 통계 물론 내외 다르며이다 + +398 +00:27:54,329 --> 00:27:57,639 + 유용한 기능은 학습하고 그래서 그것의 머리 상태의 일부를 바칩니다 + +399 +00:27:57,640 --> 00:28:00,650 + 당신이 따옴표 안에있어 여부를 추적하고이로 돌아갑니다 + +400 +00:28:00,650 --> 00:28:05,159 + 나는이 RNN가 I에 훈련 것을 여기에서 지적하고 싶은 질문 + +401 +00:28:05,159 --> 00:28:06,500 + 시퀀스 길이를 생각한다 + +402 +00:28:06,500 --> 00:28:10,269 + 백하지만 당신은이 인용문의 길이가 실제로보다 훨씬 더 측정 할 경우 + +403 +00:28:10,269 --> 00:28:16,220 + 내가 생각 백 (250)처럼 우리는 다시 최대 전파에 그래서 우리는 일 + +404 +00:28:16,220 --> 00:28:20,190 + 백은 그래서는 셀 수 실제로 로렌 같은 유일한 장소 + +405 +00:28:20,190 --> 00:28:23,460 + 자체는 더 이상이 부록을 발견 할 수 없습니다 때문에 + +406 +00:28:23,460 --> 00:28:27,809 + 그러나보다 기본적으로 내가이이 훈련을 수 있다는 것을 보여 것 같아요 + +407 +00:28:27,809 --> 00:28:31,159 + 캐릭터 레벨 검출 백보다 작은 시퀀스에 유용한로 판매 + +408 +00:28:31,160 --> 00:28:36,580 + 다음은이 때문에이 셀 수 있도록 긴 시퀀스에 제대로 일반화 + +409 +00:28:36,579 --> 00:28:39,859 + 그것은 단지에도 교육을받은 경우 더 이상 백 단계를 작동하는 것 같다 + +410 +00:28:39,859 --> 00:28:44,759 + 그것보다 수백이의 종속성을 발견 할 만 할 수 있다면 + +411 +00:28:44,759 --> 00:28:48,890 + 이 여기에 다른 데이터 세트 내가 레오 톨스토이의 전쟁과 평화는이에 생각이다 + +412 +00:28:48,890 --> 00:28:52,460 + 이 데이터 세트는 대략 80 매에서 새 줄 문자있다 + +413 +00:28:52,460 --> 00:28:57,819 + 80 자 문자 대략 새로운 라인이있다 그리고 거기에있다 + +414 +00:28:57,819 --> 00:29:02,470 + 그 다음 하나 같은에서 시작은 우리가 찾을 수 있도록 라인 링크 추적 + +415 +00:29:02,470 --> 00:29:06,539 + 천천히 시간이 지남에 따라 구분하고이 같은 세포가 있음을 상상 + +416 +00:29:06,539 --> 00:29:09,019 + 당신이 말 때문에에 캐릭터를 좋아하는 예측 실제로 매우 유용 + +417 +00:29:09,019 --> 00:29:13,059 + 이 때 새로운 라인을 알 수 있도록이 애니의 티 타임 단계를 계산하는 + +418 +00:29:13,059 --> 00:29:15,149 + 문자는 다음에 올 가능성이 높습니다 + +419 +00:29:15,150 --> 00:29:19,280 + 확인 그래서 추적처럼 거기에 우리가 실제로 단지 응답 세포를 발견 알려 + +420 +00:29:19,279 --> 00:29:23,970 + 갑자기 문을 우리는 응답자가 시세 및 문자열을 인용 세포 발견 + +421 +00:29:23,970 --> 00:29:28,710 + 우리는 내가 깊은 당신이 네 스틴 표현 등을 더 흥분 세포 발견 + +422 +00:29:28,710 --> 00:29:33,150 + 실제로이 내부에서 찾을 수 있습니다 흥미로운 세포의 모든 종류는 아니다 + +423 +00:29:33,150 --> 00:29:36,710 + 완전하게 다시 전파에서 나와서 그래서 아주 마법의 + +424 +00:29:36,710 --> 00:29:42,130 + 나는 생각하지만, + +425 +00:29:42,130 --> 00:29:49,110 + 당신은 그냥 통과거야 그래서 내가 생각하는이 앨리스 팀은 약 2,100 세포있어 + +426 +00:29:49,109 --> 00:29:54,589 + 그들과 그들 중 일부는 다음과 같이하지만 그 중 약 5 %를 말할 것입니다 + +427 +00:29:54,589 --> 00:30:00,429 + 그냥 수동으로 통과, 그래서 당신은 뭔가 흥미로운 것을 발견 + +428 +00:30:00,430 --> 00:30:05,310 + 미안 우리가 완전히 전체를 실행하는 온전한에 있지만 우리는있어 + +429 +00:30:05,309 --> 00:30:09,679 + US 하나의 셀의 소성시 하나의 숨겨진 상태 화재를보고 + +430 +00:30:09,680 --> 00:30:14,470 + 다란 그래서 일반적으로하지만 우리는 하나에서 기록의 단지 친절 실행 + +431 +00:30:14,470 --> 00:30:20,900 + 전지 등이 단지 전체가 판매하는 의미가 숨겨진 상태 + +432 +00:30:20,900 --> 00:30:23,940 + 숨겨진 상태의 그 상승 한 부분 사이에 기본적으로 많은있다 + +433 +00:30:23,940 --> 00:30:27,740 + 다른 하나는 여전히 다른 방법에 관련된 세포를 잤지 숨겨진 그들이있어 + +434 +00:30:27,740 --> 00:30:30,349 + 모든 다른 시간에 믿고 그들은 모두 다른 일을하고있는 + +435 +00:30:30,349 --> 00:30:41,899 + 아르 논 숨겨진 상태 내부 + +436 +00:30:41,900 --> 00:30:50,150 + 하지만 당신은 하나의 층으로 비슷한 결과를 얻을 수 있습니다 + +437 +00:30:50,150 --> 00:31:00,490 + 이 세포는 (110) 각각에 부정적 일 사이에 항상 있었고,이은이다 + +438 +00:31:00,490 --> 00:31:04,120 + 우리가 아직 덮여 있지만, 살사의 소성 사이하지 않은 분석 팀 + +439 +00:31:04,119 --> 00:31:11,869 + 하나 하나는 그래서 꽤 있습니다 정도로되어 우리에게이 사진의 규모이다 + +440 +00:31:11,869 --> 00:31:15,609 + 시원하고 트렌디 한 시퀀스 모델 시간이 지남에 실제로 수에 대한 대략 + +441 +00:31:15,609 --> 00:31:19,039 + 일년 전 여러 사람이 실제로 사용할 수 있음을 깨닫게했다 + +442 +00:31:19,039 --> 00:31:22,039 + 수행 할 수있는 컴퓨터 비전의 맥락에서 같은 매우 깔끔한 응용 프로그램 + +443 +00:31:22,039 --> 00:31:25,210 + 하나의 복용이 상황에서 캡처 이미지는 우리가하고 싶은 상상 + +444 +00:31:25,210 --> 00:31:27,840 + 보증의 순서로 설명하고이 있습니다 수녀은 아주 좋다 + +445 +00:31:27,839 --> 00:31:32,490 + 이 특정 모델 그들 있도록 시퀀스는 시간이 지남에 따라 개발 방법을 이해 + +446 +00:31:32,490 --> 00:31:36,240 + 이 실제로 대략에서 일을 설명하려고 전년 될 일이 내 + +447 +00:31:36,240 --> 00:31:43,039 + 나는 내 용지에서 사진이있는 종이 그래서 나는 그 정도 사용하려고 해요 우리 + +448 +00:31:43,039 --> 00:31:46,629 + 다음 네트워크에서 수행하고 수수료 및 누락을 먹이 + +449 +00:31:46,630 --> 00:31:48,990 + 당신이 휴대 전화 모델은 실제로 단지 두 개의 모듈로 구성된 것을 확인할 수 있습니다 + +450 +00:31:48,990 --> 00:31:51,750 + 이미지와 자신의 처리를하고있다 코멘트가있다 + +451 +00:31:51,750 --> 00:31:55,460 + 그래서 같은 모델링 시퀀스와 아주 좋은 것입니다 매우 될 것입니다 현재 부채 + +452 +00:31:55,460 --> 00:31:58,470 + 이 어디 있는지 과정의 처음부터 나의 비유를 기억한다면 + +453 +00:31:58,470 --> 00:32:01,039 + 좀 레고 블록과 재생처럼 우리는 그 두 개의 모듈을거야 + +454 +00:32:01,039 --> 00:32:04,509 + 사이 그래서 무엇을의 화살표에 해당하는 함께 스틱 + +455 +00:32:04,509 --> 00:32:07,829 + 우리가 효과적으로 여기서하고있는 것은 어디 조절이 RNN 생식 모델 + +456 +00:32:07,829 --> 00:32:11,349 + 아니면 그냥 무작위로 그 샘플 텍스트를 이야기하지만 우리는 에어컨 아니에요 그 + +457 +00:32:11,349 --> 00:32:14,939 + 네트워크 해변 와서 상단으로 프로세스를 생성하고 난 당신을 정확하게 보여주지 + +458 +00:32:14,940 --> 00:32:21,220 + 그 모습 어떻게 그래서 앞으로가 통과 무엇을 보여 드리겠습니다 가정 + +459 +00:32:21,220 --> 00:32:24,110 + 자신의 우리가 테스트 이미지를 가지고 우리가 설명하려는 생각된다 + +460 +00:32:24,109 --> 00:32:27,679 + 프로세스 방법 이렇게 단어의 시퀀스가​​ 모델 화상을 US + +461 +00:32:27,680 --> 00:32:31,240 + 어떤 플러그인이에서 왼쪽 작업을 수행하는 것을 가지고 정책 + +462 +00:32:31,240 --> 00:32:35,250 + 우리가에 만화의 모두 통과 수영장, 그래서 경우는 VG 정가입니다 + +463 +00:32:35,250 --> 00:32:37,349 + 우리는 단부에 도달 할 때까지 + +464 +00:32:37,349 --> 00:32:40,149 + 일반적으로 마지막에 우리는 당신을주고있다이 자동 분류가 + +465 +00:32:40,150 --> 00:32:44,440 + 이익 분배를 통해 우리가있어이 경우 이미지 1000 카테고리 말 + +466 +00:32:44,440 --> 00:32:47,420 + 실제로 분류 없애가는 대신에 우리가 갈거야 + +467 +00:32:47,420 --> 00:32:50,750 + 재발에 연합 부재의 상단에있는 표현을 리디렉션 + +468 +00:32:50,750 --> 00:32:54,880 + 신경 네트워크는 그래​​서 우리는 특정 함께 아르 논의 생성에 시작 + +469 +00:32:54,880 --> 00:33:00,410 + 자극이 내가 생각 그럼에도 불구하고 그래서 예술 벡터 (300), 정서적, + +470 +00:33:00,410 --> 00:33:02,700 + 이것은 우리가 항상 플러그 특별한 삼백 감정적 인 승리입니다 + +471 +00:33:02,700 --> 00:33:05,750 + 첫 번째 반복이 나에게 이야기에이 시퀀스의 시작입니다 + +472 +00:33:05,750 --> 00:33:09,039 + 그리고, 우리는 당신을 나타낸 재발 수식을 수행 할거야 + +473 +00:33:09,039 --> 00:33:13,769 + 재발 성 신경 네트워크의 전에 일반적으로 우리는이 재발 계산 + +474 +00:33:13,769 --> 00:33:18,779 + 우리는 WSH 시간 섹스를 계산하지만 whhhy 지금 우리가 원하는 위치에있는 우리가했습니다 연대 + +475 +00:33:18,779 --> 00:33:23,500 + 또한 현재에뿐만 아니라 재발 성 신경 네트워크로 조절합니다 + +476 +00:33:23,500 --> 00:33:28,089 + 우리가 20을 좋아합니다 상태에서 입력 전류는 그래서 그 용어는 멀리 간다 + +477 +00:33:28,089 --> 00:33:33,649 + 하지만 우리는 처음에 단지 사랑의 시간이 될 추가하여 컨디셔닝 처음으로 + +478 +00:33:33,650 --> 00:33:38,040 + 등이 여기에 주석의 정상이며 우리가 추가 한 상호 작용과 + +479 +00:33:38,039 --> 00:33:43,399 + 이 이미지 정보가에 나오는 방법을 우리에게 말해 추가 무게 행렬 W + +480 +00:33:43,400 --> 00:33:46,380 + 처음으로 직장에서 재발 역할 때문에 지금 여러 가지 방법이 있습니다 + +481 +00:33:46,380 --> 00:33:48,940 + 실제로 실제로 이미지를 연결하는 여러 가지 방법이 재발 플레이 + +482 +00:33:48,940 --> 00:33:51,690 + 이 지금이 중 하나만 및 간단한 것 중 하나에 + +483 +00:33:51,690 --> 00:33:55,750 + 아마도이 와인에 여기 난생 처음 단계에서 0 벡터 인 + +484 +00:33:55,750 --> 00:34:00,009 + 이 작동 방식하도록 시퀀스의 첫 번째 단어에 걸쳐 분포 + +485 +00:34:00,009 --> 00:34:05,490 + 예를 들어 당신이 볼 수있다 당신이 상상할 수있는 그 질량이 구조 + +486 +00:34:05,490 --> 00:34:09,699 + 모자는 다음 연합 네트워크 강한 같은 물건에 의해 인식 될 수있다 + +487 +00:34:09,699 --> 00:34:12,939 + 내 조건 사랑의이 상호 작용을 통해 들어갈 상태에서 칠 + +488 +00:34:12,940 --> 00:34:17,039 + 단어 짚의 확률이 약간 높을 수있다 특정 상태 + +489 +00:34:17,039 --> 00:34:20,519 + 바로 그래서 당신은 강한 같은 텍스처가 영향을 미칠 수 있다는 것을 상상 + +490 +00:34:20,519 --> 00:34:23,940 + 강력한 그래서 번호 중 하나 (10) 내부의 가능성이 있기 때문에 높은 것으로 + +491 +00:34:23,940 --> 00:34:28,470 + 그들의 구조와는 그래서 지금부터 군대는 정글이 작업의 종류에있다 + +492 +00:34:28,469 --> 00:34:32,269 + 그것은이 케이스에 시퀀스 내의 다음 치료 및 다음 단어를 예측한다 + +493 +00:34:32,269 --> 00:34:36,550 + 그래서 우리는 최대 그 양말로부터 전송 된 화상 정보를 기억하고 + +494 +00:34:36,550 --> 00:34:40,629 + 아마 우리가 그 분포에서 샘플링 가능성이 가장 높은 단어였다 + +495 +00:34:40,628 --> 00:34:44,710 + 참으로 강한 단어 우리는 강한 걸립니다 우리는에 연결하려고 할 것 + +496 +00:34:44,710 --> 00:34:47,519 + 본인은 생각이 경우 다시 그렇게 바닥에 모든 작업을 기록 + +497 +00:34:47,519 --> 00:34:52,190 + 강한 강한 단어와 연관 그래서 우리는 단어 수준과 침구를 사용하는 + +498 +00:34:52,190 --> 00:34:55,750 + 삼백 국가 박사는 우리는 삼백를 표현하는 법을 배워야거야 + +499 +00:34:55,750 --> 00:35:00,010 + 국가마다 하나의 고유 한 보석에 대한 표현과 우리는 그 플러그 + +500 +00:35:00,010 --> 00:35:02,940 + 삼백 아르 논에 숫자와 설명을 얻기 위해 다시 전달 + +501 +00:35:02,940 --> 00:35:07,090 + 하나는 우리가 이러한 모든 특성을 우리가 얻을 왜 내 두 번째 세계와 순서 + +502 +00:35:07,090 --> 00:35:08,010 + 그것에서 샘플을 다시 + +503 +00:35:08,010 --> 00:35:12,490 + 워드 모자 가능성이 있다고 가정 지금 우리는 모자 400 훨씬 나이 프리젠 테이션을 + +504 +00:35:12,489 --> 00:35:18,299 + 그리고 거기의 분포를 얻을 후 우리는 다시 샘플링하고 우리는 때까지 샘플 + +505 +00:35:18,300 --> 00:35:21,350 + 우리는 특별한 샘플 및 진정의 끝에있는 기간 토큰 + +506 +00:35:21,349 --> 00:35:24,900 + 문장하고는 arnaz 지금이에서 생성 할 것을 우리에게 알려줍니다 + +507 +00:35:24,900 --> 00:35:30,280 + 군대는 그렇게 확인 밀짚 모자 기간이 이미지를 설명했을 포인트 + +508 +00:35:30,280 --> 00:35:34,010 + 치수와 그의 아내 사진의 수는 단어의 숫자 당신의 + +509 +00:35:34,010 --> 00:35:39,220 + 특수 토큰과 우리가 항상 먹이 산업을위한 어휘 +1 + +510 +00:35:39,219 --> 00:35:43,609 + 다른 단어에 해당하는 부문과 얘기 특별한 시작과 + +511 +00:35:43,610 --> 00:35:46,250 + 우리는 언제나 그 전부 단일 통해 전파 + +512 +00:35:46,250 --> 00:35:49,769 + 시간은 무작위로이 국유화하거나 당신은 무료로 BG 그물을 초기화 할 수 있습니다 + +513 +00:35:49,769 --> 00:35:52,099 + 다음 분을 위해 무역 + +514 +00:35:52,099 --> 00:35:56,319 + 배포판은 다음 그라데이션을 인코딩 한 다음이를 통해 백업 + +515 +00:35:56,320 --> 00:35:59,700 + 전체 단일 모델로 것이나 그냥 모든 공동에서 훈련하고 얻을 + +516 +00:35:59,699 --> 00:36:08,389 + 캡션 또는 이미지 캡처 확인 질문을 많이하지만 네 삼백 + +517 +00:36:08,389 --> 00:36:12,609 + 감정 묻어은 너무 이미지 모든 단어의 단지 독립적있어 + +518 +00:36:12,610 --> 00:36:18,430 + 그렇게 우리가 그것으로 얻을 파산거야와 관련된 300 번호를 가지고 + +519 +00:36:18,429 --> 00:36:21,769 + 당신은 무작위로 초기화 한 다음이 더 나은 섹스에 들어갈 백업 할 수 있습니다 + +520 +00:36:21,769 --> 00:36:25,360 + 그 묻어은 주위 그냥 매개 변수를 다른 이동합니다 오른쪽 그래서 + +521 +00:36:25,360 --> 00:36:30,530 + 그것에 대해 생각하는 방법은 모두를위한 하나의 홉 표현을 데입니다입니다 + +522 +00:36:30,530 --> 00:36:34,960 + 단어는 당신은 거대한 W 매트릭스 곳 하나 하나가 + +523 +00:36:34,960 --> 00:36:40,130 + 그 백 농장과 W 곱셈과 승 300 밖으로하지만 크기가 + +524 +00:36:40,130 --> 00:36:43,530 + 효과적으로 하나가 부러 밖으로 따 버릴거야있는 뭔가 w + +525 +00:36:43,530 --> 00:36:47,560 + 나는 당신이 그 마음에 들지 않는 경우 그래서 그냥 생각이 한랭 전선의 종류의 걸거야 + +526 +00:36:47,559 --> 00:36:50,279 + 침대에서 단지 하나의 호퍼 프리젠 테이션으로 생각하고 수행 할 수 있습니다 + +527 +00:36:50,280 --> 00:36:58,920 + 교육에 토큰 네 말에 최대 네 그것의 모델러를 그런 식으로 생각 + +528 +00:36:58,920 --> 00:37:02,769 + 데이터는 우리가 예술에서 기대하는 올바른 순서는 내가 할 수있는 첫 번째 단어입니다 + +529 +00:37:02,769 --> 00:37:07,969 + 기대 때문에 매일 훈련 예 일종의 특별이 + +530 +00:37:07,969 --> 00:37:10,288 + 그리고 진행 토큰 + +531 +00:37:10,289 --> 00:37:28,929 + 당신이 유선 수 다르게 우리는 모든 단일 상태로 연결이 밝혀 + +532 +00:37:28,929 --> 00:37:32,999 + 그것은 실제로 당신이 단지에 연결하면 실제로 잘 작동 악화 때문에 작동 + +533 +00:37:32,998 --> 00:37:36,718 + 시간 단계 최초의 다음 아르 논은이이 두 작업을 저글링하는 + +534 +00:37:36,719 --> 00:37:40,829 + 그것은 예술과 그것을 통해 기억 할 필요가 무엇 이미지에 대한 기억 + +535 +00:37:40,829 --> 00:37:45,179 + 또한 이러한 모든 의상을 생산해야하고 어떻게 든 거기에 그렇게하고 싶어 + +536 +00:37:45,179 --> 00:38:04,209 + 일부는 사실 클래스 직후 나는 당신을 줄 수있는 이유를 전진 + +537 +00:38:04,208 --> 00:38:10,208 + 단일 인스턴스는 이미지와 단어의 순서와 우리가 대응합니다 + +538 +00:38:10,208 --> 00:38:16,328 + 여기에 그 단어를 연결 것이고, I를 우리는 이미지를 이야기하고 우리가하여야한다 + +539 +00:38:16,329 --> 00:38:22,159 + 그래서 와서 당신이 모든 사람들은 바닥에 계획되지 않은 한 기차 시간 + +540 +00:38:22,159 --> 00:38:25,528 + 이미지 런던과 다음이 그래프를 풀다 당신은 당신의 손실을 + +541 +00:38:25,528 --> 00:38:29,389 + 당신이 조심 있다면 배경이 다음 이미지의 배치를 할 수 있으며, + +542 +00:38:29,389 --> 00:38:33,108 + 그래서 당신의 이미지를 한 경우에는 때로는 서로 다른 길이의 시퀀스가 + +543 +00:38:33,108 --> 00:38:36,199 + 당신이 난 것을 확인 말을해야하기 때문에 훈련 데이터는 조심해야 + +544 +00:38:36,199 --> 00:38:41,059 + 아마 다음의 몇 가지를 최대 스무 단어의 배치를 처리하고자 + +545 +00:38:41,059 --> 00:38:44,499 + 코드에서 당신이 알고에 그 문장이 짧거나 더 이상 필요가있을 것입니다 + +546 +00:38:44,498 --> 00:38:48,188 + 일부 일부 일부 문장은 다른 사람보다 더 오래 있기 때문에 걱정 + +547 +00:38:48,188 --> 00:38:55,368 + 우리는 내가 갈 물건이 너무 많은 질문이 + +548 +00:38:55,369 --> 00:39:03,450 + 그 완전히 공동으로이 모든 것을 전파하도록 네 감사합니다 + +549 +00:39:03,449 --> 00:39:07,538 + 훈련은 인터넷으로 기차를 미리 할 수​​ 있도록 한 다음 그 단어를 넣어 + +550 +00:39:07,539 --> 00:39:10,190 + 이하지만 당신은 공동으로 모든 훈련을 원하고 그 큰이야 + +551 +00:39:10,190 --> 00:39:15,429 + 우리는 우리가 검색 기능을 알아낼 수 있기 때문에 실제로 이점 + +552 +00:39:15,429 --> 00:39:20,368 + 더 좋은 말은 그래서 당신은이 훈련하는 이미지를 설명하기 위해 + +553 +00:39:20,369 --> 00:39:23,890 + 실제로 우리가 인구 조사 자료에이 시도는 일반적인 욕구 중 하나를 설정합니다 + +554 +00:39:23,889 --> 00:39:27,368 + 마이크로 소프트 코코라고하는 것은, 그래서 그냥 당신이처럼 보이는 무엇의 아이디어를 제공합니다 + +555 +00:39:27,369 --> 00:39:31,499 + 대략 각 이미지 80 이미지와 다섯 문장의 설명이 있었다 + +556 +00:39:31,498 --> 00:39:35,288 + 그래서 당신은 단지 사람들에게 아마존 기계 터크를 사용하여 얻은 것은 우리에게주세요 + +557 +00:39:35,289 --> 00:39:39,710 + 문장 이미지에 대한 설명과 기록 및 데이터 세트를 종료하고 + +558 +00:39:39,710 --> 00:39:43,249 + 그래서 당신은 당신이 예상 할 수있는이 모델에게 결과의 종류를 훈련 할 때 또는 + +559 +00:39:43,248 --> 00:39:49,078 + 약 좀이 같은이 너무 이러한 이미지를 설명하는 우리의 무엇이다 + +560 +00:39:49,079 --> 00:39:52,329 + 이 이것이 검은 셔츠 연주 기타 또는 건설 사람이다라고 말한다 + +561 +00:39:52,329 --> 00:39:55,710 + 도로 또는 두 젊은 여자에 작업 오렌지 시티 웨스트에서 노동자 재생 + +562 +00:39:55,710 --> 00:40:00,528 + 레고 장난감이나 소년 그건 아니에요 웨이크 보드에 물론 공중제비를하고있다 + +563 +00:40:00,528 --> 00:40:04,650 + 웨이크 보드는하지만 매우 재미 실패 사례도 있습니다 가까이있는 + +564 +00:40:04,650 --> 00:40:07,680 + 또한이 야구 방망이를 들고 어린 소년입니다 보여주고 싶은 + +565 +00:40:07,679 --> 00:40:12,338 + 이 고양이는 여자의 원격 제어와 함께 소파에 앉아있다 + +566 +00:40:12,338 --> 00:40:15,710 + 거울 앞의 테디 베어를 들고 + +567 +00:40:15,710 --> 00:40:22,400 + 여기 질감은 아마 무슨 일이 것은 그것을 만든 것입니다 확신 해요 + +568 +00:40:22,400 --> 00:40:26,289 + 이 테디 베어가 있다고 생각하고 마지막은 서 창녀입니다 + +569 +00:40:26,289 --> 00:40:30,409 + 거리 도로의 중간 그래서 분명히 일부 확실하지 아무 말 없다 무엇 + +570 +00:40:30,409 --> 00:40:34,858 + 이 나온 모델의 단지 간단한 종류 그래서 거기에 무슨 일이 있었 + +571 +00:40:34,858 --> 00:40:37,619 + 작년 모델의 이러한 종류의 상단에 작업하려고 많은 사람들이 있었다 + +572 +00:40:37,619 --> 00:40:41,559 + 난 그냥 당신에게 11 레벨의 아이디어를 제공하고자 그들을 더 복잡하게 + +573 +00:40:41,559 --> 00:40:44,929 + 흥미로운 단지 사람들이 기본 아키텍처를 연주하는 방법에 대한 아이디어를 얻을 수 + +574 +00:40:44,929 --> 00:40:51,329 + 그래서 이것은 현재 모델에서 발견 경우 지난해 종이는 우리 + +575 +00:40:51,329 --> 00:40:55,608 + 단지 처음에 시간을 이미지로 한 시간을 공급 한 경우를 + +576 +00:40:55,608 --> 00:40:59,480 + 이 놀 수있는 것은 실제로 다시 볼 수있는 난폭 한 재발 성 신경 네트워크입니다 + +577 +00:40:59,480 --> 00:41:03,130 + 무선 않는 작동 기술 화상의 화상 및 참조 부 + +578 +00:41:03,130 --> 00:41:07,180 + 당신이 허용 등이 모든 단어를 생성하는 등의 단어가 없습니다 + +579 +00:41:07,179 --> 00:41:10,460 + 실제로 이미지 옆 모습을하고 다른 기능을 찾아 + +580 +00:41:10,460 --> 00:41:13,470 + 그것은 다음에 설명 할 수 있습니다 당신은 실제로 완전히에서이 작업을 수행 할 수있는 작업 + +581 +00:41:13,469 --> 00:41:17,899 + 그들은 단지이 말뿐만 아니라 측면을 생성하지 않도록 학습 가능한 방법 + +582 +00:41:17,900 --> 00:41:21,289 + 여기서 이미지에 다음보고하는 등이 작동하는 방식 만을 수행하지 않습니다 + +583 +00:41:21,289 --> 00:41:24,259 + 아웃 아르 논하지만 당신은 아마 다음 하나의 시퀀스에 대한 분배있어 + +584 +00:41:24,260 --> 00:41:29,250 + 하지만 제공이 오는 당신은 발륨은 우리가 전달이 경우 말을 않는 + +585 +00:41:29,250 --> 00:41:37,389 + 512 활성화 부피 (512) (14)에 의해 14를 얻었고에서 모든 및 주석 + +586 +00:41:37,389 --> 00:41:40,179 + 우리는 단지 그 분포를 인정하지 않습니다하지만 당신은 또한을 방출 한 시간 + +587 +00:41:40,179 --> 00:41:44,358 + 모양까지 키처럼 좀입니다 오백열둘 차원 사진 + +588 +00:41:44,358 --> 00:41:48,019 + 당신은 이미지 옆에 그래서 실제로 나는이 생각하지 않습니다 찾기 위해 원하는 것을 + +589 +00:41:48,019 --> 00:41:51,210 + 그들은이 특별한 종이에 무슨 짓을하지만, 이것은 당신이 연결할 수 있습니다 한 방법입니다 + +590 +00:41:51,210 --> 00:41:54,510 + 이 위로이 사진을보고 뭔가는 아르 논에서 방출되는 단지 + +591 +00:41:54,510 --> 00:41:58,430 + 그냥 약간의 무게와 다음이 그림은 점 수를 사용하여 예측처럼 + +592 +00:41:58,429 --> 00:42:03,618 + 제품이 모든 (14) (14)에 의해 위치가 그래서 우리는 이러한 모든 점 제품을 함께 + +593 +00:42:03,619 --> 00:42:09,108 + 우리는 우리가 지금 우리가 다음 우리 (14)의 호환성에 의해 기본적으로 14 계산 달성 + +594 +00:42:09,108 --> 00:42:13,949 + 그것은 모두 당신의 있도록 그래서 기본적으로 우리는이 모든 것을 정상화 이것에 부드러운 최대를 넣어 + +595 +00:42:13,949 --> 00:42:17,149 + 이 14 (14)에 의해, 그래서 우리는 이미지를 통해 긴장 부르는이를 얻을 수 + +596 +00:42:17,150 --> 00:42:21,230 + 아마 이미지에 지금 아르 논에 대한 흥미로운 내용을 통해지도, + +597 +00:42:21,230 --> 00:42:25,889 + 우리는이와이 사람의 가중 합을 수행하라는 메시지가이 문제를 사용 + +598 +00:42:25,889 --> 00:42:27,239 + 현출 + +599 +00:42:27,239 --> 00:42:30,929 + 그래서 오늘 아침은 기본적으로는 어떻게 생각하는지의 신화는 현재 수 + +600 +00:42:30,929 --> 00:42:36,089 + 그것에 대한 흥미가 돌아갑니다 당신은의 가중 합을하고 결국 + +601 +00:42:36,090 --> 00:42:39,850 + 엘리스 팀이 시점에서보고 싶은 기능의 종류 + +602 +00:42:39,849 --> 00:42:44,809 + 시간 등 섬의 생성 물건, 예를 들어 그것을 결정할 수 있습니다 + +603 +00:42:44,809 --> 00:42:49,400 + 지금과 같은 객체에 대한보고 싶은 그 확인은 벡터 파일을 인정 + +604 +00:42:49,400 --> 00:42:53,220 + 물건 같은 개체의 숫자는이 때의 정액과 상호 작용 + +605 +00:42:53,219 --> 00:42:57,379 + 위원회 주석 어쩌면 그 지역 같은 개체의 일부는 오는 + +606 +00:42:57,380 --> 00:43:01,700 + 점등 및 천장처럼 떨어지는 정품 인증에서이지도를 참조 + +607 +00:43:01,699 --> 00:43:05,949 + 4514 화나게하고 당신은 그 부분에 관심을 집중 결국 + +608 +00:43:05,949 --> 00:43:10,059 + 이 상호 작용을 통해 그래서 당신은 기본적으로 그냥 할 수있는 조회 이미지 + +609 +00:43:10,059 --> 00:43:14,130 + 이미지에 당신은 문장을 설명하고 그래서이 뭔가 우리 동안 + +610 +00:43:14,130 --> 00:43:17,360 + 부드러운 구금으로 참조 실제로 몇 강연이가는 것 + +611 +00:43:17,360 --> 00:43:21,050 + 그래서 우리는 군대가 실제로하지 않은 수있는이 같은 일을 다루려고 + +612 +00:43:21,050 --> 00:43:26,880 + 선택적 입력을 처리하는 등의 수입을 통해 관심과 그 그래서 I + +613 +00:43:26,880 --> 00:43:30,030 + 그냥 당신에게 그 무엇의 미리보기를 제공하기 위해 약 한 시간 그것을 가지고 싶어 + +614 +00:43:30,030 --> 00:43:34,490 + 우리가 중 한 가지 방법으로 우리의 삶을 더 복잡하게하려면 이제 괜찮아 보이는 + +615 +00:43:34,489 --> 00:43:39,259 + 이 당신을 제공합니다, 그래서 우리가 그 층을 쌓아하는 것입니다 할 수있는 당신이 더 많은 것을 알고 + +616 +00:43:39,260 --> 00:43:43,570 + 깊은 물건은 일반적으로 더 나은 우리가에 가지 방법 중 하나를이를 시작하는 방법을 작동 + +617 +00:43:43,570 --> 00:43:46,809 + 적어도 당신은 재발 성 신경 네트워크를 쌓을 수 많은 방법이있다 그러나이 + +618 +00:43:46,809 --> 00:43:49,409 + 사람들이 당신이 할 수 실제로 사용하는 것이 바로 그 중 하나입니다 + +619 +00:43:49,409 --> 00:43:53,339 + 똑바로 그냥 서로 그렇게 한 아르 논에 대한 자극이에 하네스를 연결 + +620 +00:43:53,340 --> 00:43:59,170 + 우리가 이전에 주 사진의 디렉터 등이 이미지 + +621 +00:43:59,170 --> 00:44:02,750 + 시간 축이 수평으로 이동 한 다음 우리가 다른이 위쪽으로가는 + +622 +00:44:02,750 --> 00:44:05,960 + 이 특정 이미지의 의식 등 세 가지 별도의 재발이 있습니다 + +623 +00:44:05,960 --> 00:44:09,858 + 신경 네트워크는 무게의 자신의 세트와 각각이 대령이다 그 + +624 +00:44:09,858 --> 00:44:16,299 + 난 그냥 서로 먹이를하지 그래서이 항상 공동으로 더 거기에 훈련되어 작동합니다 + +625 +00:44:16,300 --> 00:44:19,119 + 기차는 먼저 모든 단지 하나의 경쟁 성장의 두 번째 임기 하나 원 + +626 +00:44:19,119 --> 00:44:22,700 + 배경으로는 상단이 재발 식을 통해 얻을 수 + +627 +00:44:22,699 --> 00:44:25,980 + 상아 영국은 여전히​​ 우리는 여전히있어 더 일반적인 규칙을 만들 가능성이 높습니다 + +628 +00:44:25,980 --> 00:44:29,280 + 똑같은 일을하면 우리는 우리가 복용하고있는 같은 공식을하지 않았다된다 + +629 +00:44:29,280 --> 00:44:35,390 + 우린 시간 전에에서 아래 아래 깊이와 효과에서 강의 + +630 +00:44:35,389 --> 00:44:39,469 + 를 절단하고 퍼팅이 w 변환과를 통해 지원 + +631 +00:44:39,469 --> 00:44:40,519 + 스매싱 10 각 + +632 +00:44:40,519 --> 00:44:44,509 + 당신이 이것에 대해 약간 혼란스러워하는 경우에 당신이 기억한다면, 그래서 거기있다 + +633 +00:44:44,510 --> 00:44:51,760 + WRX H 시간의 X 플러스 당신이 다시 작성할 수 있습니다 whah 시간의 H는 엑손의 연결입니다 + +634 +00:44:51,760 --> 00:44:56,260 + 하나의 행렬 곱 H 바로 그래서 난에 침을 국가 스틱 것처럼 + +635 +00:44:56,260 --> 00:45:03,680 + 기본적으로 무슨 일이 끝나는 다음 하나의 열 벡터와 나는이 w 행렬이 + +636 +00:45:03,679 --> 00:45:07,690 + 최대 일어나고 당신의 WX 연령이 행렬과 WH의 첫 번째 부분 + +637 +00:45:07,690 --> 00:45:12,700 + 미국에서 두 번째로 당신의 매트릭스의 일부 등 식의이 종류는 기록 될 수있다 + +638 +00:45:12,699 --> 00:45:16,099 + 식으로 당신은 당신의 입력을 쌓아 단일 W가 어디 + +639 +00:45:16,099 --> 00:45:24,759 + 변환은 같은 식 있도록 그래서 우리가이는 중지 할 수 있습니다 방법 + +640 +00:45:24,760 --> 00:45:29,780 + 두 시간 색인되는 이후로 지금 다음이 발표하고 + +641 +00:45:29,780 --> 00:45:33,510 + 우리는 또한이 더 복잡한이 적층 공유되지 수 있습니다 지금은 한 방향으로 발생 + +642 +00:45:33,510 --> 00:45:37,030 + 그들을 실제로 그렇게 지금 약간 더 반복 공식을 사용하여 + +643 +00:45:37,030 --> 00:45:40,300 + 지금까지 우리는 복귀에 대한 매우 간단한 재발 수식으로 보았다 + +644 +00:45:40,300 --> 00:45:44,480 + 실제로 작품은 실제로 거의 지금과 같은 공식을 사용하고 + +645 +00:45:44,480 --> 00:45:48,170 + 기본 네트워크는 매우 드물게 우리가 그것에게 부르는 사용합니다 대신 사용되지 않습니다 + +646 +00:45:48,170 --> 00:45:52,059 + LSD와 오랜 단기 기억은 그래서 이것은 기본적으로 모든 서류에 사용된다 + +647 +00:45:52,059 --> 00:45:56,500 + 지금이 공식은 당신이 인 경우도 프로젝트를 사용하는 것입니다 + +648 +00:45:56,500 --> 00:46:00,989 + 사용이 현재 작동하지만 나는이 시점에서 주목하고 싶은 모든입니다 + +649 +00:46:00,989 --> 00:46:04,729 + 동일은 알렌과 마찬가지로이 재발 수식은이 단지의 + +650 +00:46:04,730 --> 00:46:09,050 + 약간 더 복잡한 기능을 확인 우리는 여전히 낮은에서 사진을 촬영하고 + +651 +00:46:09,050 --> 00:46:13,789 + 그리고 이전의 시간에 입력 같은 깊이 이전 재산이었다 + +652 +00:46:13,789 --> 00:46:18,309 + 연락 그들 앗 전송을 통해 이르렀 그러나 지금 우리는이 더이 + +653 +00:46:18,309 --> 00:46:21,869 + 복잡성과 방법을 우리가 실제로이 지점에서 뉴 헤이븐 상태를 달성 + +654 +00:46:21,869 --> 00:46:25,539 + 시간은 그래서 우리는 단지 약간 더 복잡한되고있어 방법에서 북한 이탈 주민을 결합 + +655 +00:46:25,539 --> 00:46:28,900 + 아래 실제로 단지 더 상태를 제목에 업데이트를 수행하기 전에 + +656 +00:46:28,900 --> 00:46:33,050 + 이 동기를 부여 정확히 복잡한 공식은 그래서 우리는 몇 가지 세부 사항에 갈거야 + +657 +00:46:33,050 --> 00:46:41,609 + 공식 이유는 실제로 오스틴에서 사용할 수있는 더 좋은 생각이 될 수 있습니다 + +658 +00:46:41,608 --> 00:46:49,909 + 그리고 우리가 지금 당장 그것을 통해 갈거야 의미가 나를 신뢰하게 그렇다면 당신 + +659 +00:46:49,909 --> 00:46:56,480 + 오후 4시 일부 온라인 비디오를 차단하거나 Google 이미지는 다이어그램을 찾을 수 있습니다로 이동 + +660 +00:46:56,480 --> 00:47:00,989 + 정말 도움이되지 않는이처럼 사람에게 내가 그를 처음봤을 때 생각 + +661 +00:47:00,989 --> 00:47:04,048 + 이 사람이 정말 그가 무슨 일이 일어나고 있는지 정말 확신했다 겁처럼 정말 무서워되고 + +662 +00:47:04,048 --> 00:47:08,170 + 나는 엘리스 팀을 이해하고 난 여전히이 두 다이어그램이 무엇인지 모르는에 + +663 +00:47:08,170 --> 00:47:14,289 + 나는 목록을 파괴하려고하는거야하고 ​​까다로운 물건의 종류, 그래서 그렇게 확인 + +664 +00:47:14,289 --> 00:47:18,329 + 그것을 통해 단계의 종류 당신이 정말로이 도면에 강의 있도록 넣어 + +665 +00:47:18,329 --> 00:47:24,220 + 형식은 우리가 미국의 방정식이 있고 난 그래서 여기에없는 스팀 확인을 위해 완벽하다 + +666 +00:47:24,219 --> 00:47:28,238 + 우리는이 두 벡터를 가지고 위치를 상단에 여기에 첫 번째 부분에 초점을 맞출 것 + +667 +00:47:28,239 --> 00:47:32,720 + 아래로부터의 상태에서 이렇게 X와 HHS 이전 전에 사고 있지만, + +668 +00:47:32,719 --> 00:47:37,848 + 우리는 변환 W를 통해 지금 모두 잭슨 href가 크기 경우를 만났다 + +669 +00:47:37,849 --> 00:47:40,950 + 그래서 우리는 어떤을 위해 생산 끝날거야 숫자를 보낼있다 + +670 +00:47:40,949 --> 00:47:46,068 + (21)에 의해 제시되었다이 w 매트릭스를 통해 확인 번호는 그래서 우리는 이러한이 + +671 +00:47:46,068 --> 00:47:51,108 + 그들이 입력 짧은 것 OMG 경우 사 및 차원 벡터 나가뿐만 + +672 +00:47:51,108 --> 00:47:57,328 + 그리고 G는 나는 당신과 그렇게 ISI없이 신호를 통과 단지를 무엇 확실하지 않다 + +673 +00:47:57,329 --> 00:48:05,859 + 게이트 및 G는 방법에게 지금이 실제로 작동이 길을 똑바로 세입자 게이트로 이동 + +674 +00:48:05,858 --> 00:48:09,420 + 그것에 대해 생각하는 가장 좋은 방법은 내가 깜빡 한 가지가 실제로 언급하는 것입니다 + +675 +00:48:09,420 --> 00:48:15,028 + 이전 슬라이드는 일반적으로 하나의 HVAC 시도 말합니다 할 네트워크를 필요로하지 않습니다 + +676 +00:48:15,028 --> 00:48:18,018 + 매번 중지하고 그에게 물었다 실제로 두 벡터 모든이 + +677 +00:48:18,018 --> 00:48:23,618 + 한 시간 때문에 우리는 세포 상태 벡터를 참조 전화를 매도록 + +678 +00:48:23,619 --> 00:48:29,470 + 시간 단계는 우리가 위험에 두 기관이 있고 그리고에서와 같이 여기 벡터를 참조하십시오 + +679 +00:48:29,469 --> 00:48:33,558 + 노란색 그래서 우리는 기본적으로 두 벡터 여기 공간에있는 모든 단일 지점을 가지고 + +680 +00:48:33,559 --> 00:48:37,849 + 그들이하는 일은 그들이 기본적 그래서이 셀 상태에서 작동하고있다 + +681 +00:48:37,849 --> 00:48:41,680 + 전에 당신 아래의 내용에 따라 해당 사용자 컨텍스트 당신은 결국 + +682 +00:48:41,679 --> 00:48:45,199 + 이들과 함께 세포 상태에서 작동 + +683 +00:48:45,199 --> 00:48:50,509 + 그리고 옹 요소와 그것에 대해 생각하는 새로운 방법 내가 통해 갈거야된다 + +684 +00:48:50,510 --> 00:48:58,290 + 이 0 또는 1 우리가 원하는 I NO처럼 이진 않습니다에 대해이 방법을 많이 생각합니다 + +685 +00:48:58,289 --> 00:49:01,199 + 그들에게 우리가 그들을 게이트의 해석이 생각하고 싶다 갖고 싶어 할 수 + +686 +00:49:01,199 --> 00:49:05,449 + 영웅이 그들이다 그것의로 우리는 물론 우리가 원하기 때문에 그들에게 이상 신호를 만들 + +687 +00:49:05,449 --> 00:49:08,348 + 우리는하지만, 모든 것을 통해 전파 백업 할 수 있도록이 미분 될 수 있습니다 + +688 +00:49:08,349 --> 00:49:11,960 + 우리의 상황에 기반을 계산 한 바로 진 것들로 이노 생각 + +689 +00:49:11,960 --> 00:49:17,740 + 항상 여기서 뭘에서이 참조 다음 당신은 무엇을 기준으로 그를 볼 수있는 + +690 +00:49:17,739 --> 00:49:22,250 + 이 문은 다음과 디아즈 우리는이 페이지의 값을 데이트 끝날거야 무슨 + +691 +00:49:22,250 --> 00:49:29,289 + 특히이 에피소드는 TUS을 종료하는 데 사용됩니다 게이트를 잊지 + +692 +00:49:29,289 --> 00:49:34,869 + (20) 태양 전지 등의 보호소 가장 생각되는 세포들을 재설정 + +693 +00:49:34,869 --> 00:49:38,700 + 우리와 함께 (20)이 상호 작용보다 기본적으로 우리가 할 수있는 하나 최근 이러한 카운터 + +694 +00:49:38,699 --> 00:49:42,368 + 이것은 자신의 레이저 포인터가 부족합니다 곱셈의 요소입니다 + +695 +00:49:42,369 --> 00:49:45,530 + 배터리 때문에 + +696 +00:49:45,530 --> 00:49:50,140 + 상호 작용 0 당신은 우리가를 재설정 할 수 있도록 그 셀을 제로 것이다 볼 수 있습니다 + +697 +00:49:50,139 --> 00:49:53,969 + 카운터 그리고 우리는 또한 우리는이를 통해 추가 할 수있는 카운터에 추가 할 수 있습니다 + +698 +00:49:53,969 --> 00:50:00,459 + 상호 작용 I 번 G와 11 사이와 G는 부정적 일 사이이기 때문에 + +699 +00:50:00,460 --> 00:50:05,900 + (10)에 기본적으로 한 12 매 있도록 모든 세포 사이의 숫자를 추가 + +700 +00:50:05,900 --> 00:50:09,338 + 우리는이를 재설정 할 수있는 모든 세포에서 이러한 카운터를 하나의 시간 단계 + +701 +00:50:09,338 --> 00:50:13,588 + 국가 2012 케이트를 잊어 버렸거나 우리는 하나 사이의 숫자를 추가 할 수 있습니다 + +702 +00:50:13,588 --> 00:50:18,039 + 12 그래서 확인을 하나 하나 셀은 우리가 다음 셀 업데이트 및 수행 방법 + +703 +00:50:18,039 --> 00:50:24,029 + 업데이트가 찌그러 세포 그렇게 10 HFC는 셀을 숙청되고 끝 머리 + +704 +00:50:24,030 --> 00:50:28,760 + 그렇게 만 셀 상태의 일부와 위로로 누출이 업데이트에 의해 변조 + +705 +00:50:28,760 --> 00:50:33,500 + 숨겨진 상태가이 벡터에 의해 변조 오 그래서 우리는 단지의 일부를 공개 선택 + +706 +00:50:33,500 --> 00:50:39,530 + 암탉 상태와 학습 가능 방법으로 세포는 몇 가지가있다 + +707 +00:50:39,530 --> 00:50:43,910 + 에 하이라이트의 종류 여기에 아마 여기에 가장 혼란스러운 부분에 우리가 걸이다 + +708 +00:50:43,909 --> 00:50:47,500 + 여기에 D I 배 하나 하나 사이의 숫자를 추가하지만 가지의 + +709 +00:50:47,500 --> 00:50:51,809 + 우리는 단지 거기 G가 있다면 대신 다음 이미 사이에 이름 : Jeez 때문에 혼란 + +710 +00:50:51,809 --> 00:50:56,679 + 8 11 왜 우리는 내가 여러 번 G 무엇을하지 실제로 우리가 제공하는 모든 필요합니까 우리 + +711 +00:50:56,679 --> 00:50:58,279 + 원하는에 의해 바다를 구현하는 것입니다 + +712 +00:50:58,280 --> 00:51:02,330 + 하나 하나 사이의 숫자는 그래서는 대한 내 성 부품의 종류의 + +713 +00:51:02,329 --> 00:51:08,989 + 마지막으로 내가 한 대답은 당신이 G에 대해 생각하면 그것의 기능 있다고 생각합니다 + +714 +00:51:08,989 --> 00:51:16,159 + 당신의 문맥의 선형 함수는 하나의 기회가 오른쪽으로 레이저 프린터가 없습니다 + +715 +00:51:16,159 --> 00:51:26,649 + 확인 그래서 G는 G 그래서 확인을 지역 310 세의 함수의 선형 함수로 + +716 +00:51:26,650 --> 00:51:30,579 + 우리가 청바지를 추가 한 경우 10 시간 등에 의해 숙청 이전에 접촉하는 경우 + +717 +00:51:30,579 --> 00:51:35,349 + 추가하여, 그래서 나는 시간 그녀는 그 종류의 매우 간단한 함수 같은 것 + +718 +00:51:35,349 --> 00:51:38,929 + 이 난 후 실제로 더 있어요 곱셈 상호 작용을 갖는 + +719 +00:51:38,929 --> 00:51:42,710 + 실제로 우리가 추가하는 것을 표현 할 수 있습니다 풍부한 기능 + +720 +00:51:42,710 --> 00:51:47,010 + 이전 테스트의 기능을 생각하는 또 다른 방법으로 상태를 몸통 + +721 +00:51:47,010 --> 00:51:50,620 + 이 약이 기본적으로 방법이 두 개념을 분리하는 것 + +722 +00:51:50,619 --> 00:51:54,159 + 많은 우리가 G 인 셀 상태로 추가 싶어하고 우리가 원하는 수행 + +723 +00:51:54,159 --> 00:51:58,129 + 나는 우리가 실제로 무엇을이 조작 가능성이 있으므로 모든 상태를 해결 + +724 +00:51:58,130 --> 00:52:03,280 + 또한 될 수 있음이 두 디커플링에 의해 통해 천재 우리가 원하는 이동 + +725 +00:52:03,280 --> 00:52:08,470 + 동적 측면에서 몇 가지 좋은 특성을 가지고 어떻게이 모든 증기 기차하지만, + +726 +00:52:08,469 --> 00:52:12,039 + 우리는 단지 그 오스틴 공식처럼 결국 나는 실제로 갈거야 + +727 +00:52:12,039 --> 00:52:14,059 + 자세한 세부 사항에서이뿐만 아니라 통해 + +728 +00:52:14,059 --> 00:52:21,400 + 확인 상기 제 1 상호 작용 이제 셀 C가 흐르는으로 이것에 대해 생각하고 + +729 +00:52:21,400 --> 00:52:28,269 + 여기 그래서 경제적으로 그 시그 모이 약간의 DOTC 그렇게 노력하다 + +730 +00:52:28,269 --> 00:52:32,559 + 곱셈의 상호 작용으로 자신을 게이팅 F 제로는 것입니다 그래서 만약 + +731 +00:52:32,559 --> 00:52:38,409 + 셀을 차단하고 세포학 부분이 기본적으로 제공되는 카운터를 재설정 + +732 +00:52:38,409 --> 00:52:44,799 + 당신은 완은 기본적으로 하위 상태 누수가 유일한 상태로 추가하고있다 + +733 +00:52:44,800 --> 00:52:51,100 + 언덕 상태로하지만 너무 의해 문이 가도록 한 후 10 시간 통해 + +734 +00:52:51,099 --> 00:52:55,380 + 전기 만 결정 사실로 밝혀 몇 가지 상태에있는 부품 + +735 +00:52:55,380 --> 00:52:59,610 + 매각하지 않았다 숨겨진 그리고 당신은 알 수가이 고속도로뿐만 아니라, + +736 +00:52:59,610 --> 00:53:03,720 + STM의 다음 반복으로 이동뿐만 아니라 실제로까지 폐쇄 + +737 +00:53:03,719 --> 00:53:07,159 + 상위 계층이 우리가 실제로 종료 상태 교리의 머리이기 때문에 + +738 +00:53:07,159 --> 00:53:11,250 + 까지 우리 위에 팀으로보고하거나이 예측에 간다 + +739 +00:53:11,250 --> 00:53:14,510 + 이 기본적으로 방법을 풀다 때 그래서 그것이 가지처럼 보이는 + +740 +00:53:14,510 --> 00:53:19,270 + 지금은 내 자신 그게 전부의 혼란도를 가지고있는이 나는 우리가 끝난 것 같아요 + +741 +00:53:19,269 --> 00:53:24,550 + 그러나 아래에서 입력 벡터를 얻을 수와 최대 당신은 당신의 자신의 상태에서이 + +742 +00:53:24,550 --> 00:53:26,090 + (248) + +743 +00:53:26,090 --> 00:53:31,030 + 그들은 다음 차원 벡터 및 모든 거 알아 fije 네 성문을 결정 + +744 +00:53:31,030 --> 00:53:35,110 + 는 셀 상태에서 동작하고, 셀의 상태가 변조 방법을 종료 + +745 +00:53:35,110 --> 00:53:38,610 + 당신이 한 번 실제로 우리는 일부 국가를 설정하고 하나 사이에 번호를 추가하면 + +746 +00:53:38,610 --> 00:53:42,630 + (12) 국가의 셀 상태는 그것의 일부는 학습 가능에서 누수 밖으로 누출 + +747 +00:53:42,630 --> 00:53:45,840 + 방법 및 다음 중 하나를 예측까지 갈 수 또는 다음에 갈 수 있습니다 + +748 +00:53:45,840 --> 00:53:52,269 + 미국 팀의 반복은 향후 그래서 그게 그렇게이 그렇게 추한 모습입니다 + +749 +00:53:52,269 --> 00:53:58,429 + 문제는 당신의 마음에 아마 그래서 우리는 거 야 우리가 간다 않은 이유입니다 + +750 +00:53:58,429 --> 00:54:02,649 + 이 특별한 방법 I에서이 Look을 수행하는 이유가 뭔가의 모든 통해 + +751 +00:54:02,650 --> 00:54:05,639 + 알고 싶어한다 분석가 많은 다양한 있다는 것을이 시점이 + +752 +00:54:05,639 --> 00:54:09,309 + 이 시점하지만 강의 사람들의 말은 이런 식으로 많이 연주 + +753 +00:54:09,309 --> 00:54:12,840 + 우리는 종류의 합리적인 것 같은 것으로이에 수렴했지만 + +754 +00:54:12,840 --> 00:54:15,510 + 당신이 실제로하지 않는이에 수 많은 작은 비틀기가있다 + +755 +00:54:15,510 --> 00:54:18,930 + 당신 같은 사람들 게이트의 일부를 제거 할 수 있습니다 많은하여 성능을 저하 + +756 +00:54:18,929 --> 00:54:20,359 + 아마 연루 등 + +757 +00:54:20,360 --> 00:54:25,200 + 당신은 할 수의 악취가이 바다가 될 수 볼 밝혀 그것을 잘 작동합니다 + +758 +00:54:25,199 --> 00:54:28,619 + 일반적으로하지만 좌석의 어린 나이로 때로는 약간 더 있었다 I + +759 +00:54:28,619 --> 00:54:33,869 + 우리는 CSI가의 비트와 함께 결국 왜를위한 아주 좋은 이유가 생각하지 않습니다 + +760 +00:54:33,869 --> 00:54:37,039 + 괴물하지만 실제로 좀 법무부 카운터의 측면에서 의미가 생각 + +761 +00:54:37,039 --> 00:54:40,739 + 그 0으로 재설정 할 수 있습니다 또는 당신은 하나 (12)을 사이에 작은 숫자를 추가 할 수 있습니다 + +762 +00:54:40,739 --> 00:54:46,039 + 지금은 좋은 실제로 비교적 단순한 이해하는 것처럼 그렇게는 가지이다 + +763 +00:54:46,039 --> 00:54:49,300 + 이것은 우리 자신보다 훨씬 더 그리고 우리는 약간에 가야 정확하게 이유 + +764 +00:54:49,300 --> 00:54:55,330 + 다른 그림은 재발 성 신경 있도록 구별을 그립니다 + +765 +00:54:55,329 --> 00:54:59,259 + 어떤 상태 벡터 권리가 네트워크 당신은 그것을 통해 운영하고 있고이있어 + +766 +00:54:59,260 --> 00:55:02,260 + 완전히이 재발 식을 통해로 변신 그래서 당신은 종료 + +767 +00:55:02,260 --> 00:55:06,280 + 시간 물건 시간에서 상태 벡터를 변경까지 당신은 미국 것을 알 수 있습니다 + +768 +00:55:06,280 --> 00:55:11,140 + 팀 대신 셀 미국이 흐르는 우리가 효과적으로 무슨 일을하고있다 + +769 +00:55:11,139 --> 00:55:15,250 + 우리는 세포에서 찾고 그것의 일부는 국가의 머리에 누수로 + +770 +00:55:15,250 --> 00:55:19,329 + 우리가 이득을 다음 잊어 버린 경우 셀에서 동작하는 방법을 결정하는 상태 + +771 +00:55:19,329 --> 00:55:22,869 + 기본적으로 그냥하여 셀을 조정 끝 + +772 +00:55:22,869 --> 00:55:28,509 + 함수로 쳐다 보면서 몇 가지 물건이 그래서 그래서 여기 활성 상호 작용 + +773 +00:55:28,510 --> 00:55:33,040 + 우리는 영혼의 상태를 변경 결국 그것이 무엇이든 셀 상태의 다음 + +774 +00:55:33,039 --> 00:55:37,190 + 대신 바로이 첨가제는 대신, 그래서 그것을 변환의 + +775 +00:55:37,190 --> 00:55:38,429 + 변형 + +776 +00:55:38,429 --> 00:55:42,929 + 그런 상호 작용이나 뭐 이제이 실제로 뭔가 당신을 생각 나게한다 + +777 +00:55:42,929 --> 00:55:48,839 + 우리가 이미 염두에두고 클래스에 적용되었음을 그, 그래 맞아 + +778 +00:55:48,840 --> 00:55:53,240 + 그래서이 같은 사실은 고체와 같은 일이 이렇게 기본적으로 직렬 공진입니다 + +779 +00:55:53,239 --> 00:55:56,299 + 일반적으로 우리가 표현 거주자가 변화하고 진정으로 + +780 +00:55:56,300 --> 00:56:00,019 + 여기에이 스킵 연결 및 당신은 기본적으로 주민들이를 볼 수 있습니다 + +781 +00:56:00,019 --> 00:56:04,690 + 우리가 지금 여기이 X이 때문에 첨가제의 상호 작용 우리는 약간의 계산에 기초 않는다 + +782 +00:56:04,690 --> 00:56:10,240 + 다음 섹스 그리고 우리는 행위와 첨가제의 상호 작용을 가지고 있고 그래서는이다 + +783 +00:56:10,239 --> 00:56:12,959 + 같은 멋진로 발생하는 기본 주민들의 블록과 그 사실의 + +784 +00:56:12,960 --> 00:56:18,440 + 물론 우리는 우리가 여기있어 이러한 상호 작용을 가지고 전은 세포이며, 우리가 간다 + +785 +00:56:18,440 --> 00:56:22,619 + 다음 몇 가지 기능은 당신과 떨어져 우리는이 세포 상태 만에 추가 할 수 + +786 +00:56:22,619 --> 00:56:26,900 + LSD와는 달리 주민들은 또한 추가 된 날짜를 잊지하시기 바랍니다있다 + +787 +00:56:26,900 --> 00:56:31,519 + 이뿐만 아니라 신호의 일부를 차단하도록 선택할 경우 제어를 잊지 있지만, + +788 +00:56:31,519 --> 00:56:33,679 + 그렇지 않으면 나는 그것이 가지 생각 때문에 대통령처럼 매우 보인다 + +789 +00:56:33,679 --> 00:56:36,710 + 보고 아키텍처와 매우 유사 종류에 수렴하고 그 재미 + +790 +00:56:36,710 --> 00:56:40,429 + 보인다 곳은 재발 성 신경 네트워크에서 끝의 두 소득을 작동 + +791 +00:56:40,429 --> 00:56:43,809 + 같은 동적으로 어떻게 든 실제로 이러한 첨가제를 가지고 훨씬 좋네요이다 + +792 +00:56:43,809 --> 00:56:48,739 + 당신이 실제로 훨씬 더 효과적으로 그렇게 전파 할 수 있도록 상호 작용 + +793 +00:56:48,739 --> 00:56:49,779 + 그 시점에 + +794 +00:56:49,780 --> 00:56:53,860 + 분석 팀 사이의 뒷면 전파 역학에 대해 생각 + +795 +00:56:53,860 --> 00:56:57,760 + 특히 미국 팀에 좀 그라디언트를 주입하면 매우 명확하고 + +796 +00:56:57,760 --> 00:57:01,120 + 가끔 내가 생기를 주입하고이 그림의 끝을 보자, 그래서 만약 여기에 + +797 +00:57:01,119 --> 00:57:05,239 + 다음이 플러스 상호 작용은 바로 여기 그냥 재료 고속도로처럼 + +798 +00:57:05,239 --> 00:57:09,299 + 이 동영상은 모든 탭 추가 상호 작용 오른쪽으로 흐르는 것 같은 + +799 +00:57:09,300 --> 00:57:13,240 + 내가 그라데이션 시간의 어느 지점을 연결하는 경우 버전은 동일하므로 분산 때문에 + +800 +00:57:13,239 --> 00:57:16,849 + 여기에 단지 물론 그라데이션도 다시 모든 방법을 날려 가고 + +801 +00:57:16,849 --> 00:57:20,809 + 이러한 행위를 통해 흘러 그들이에 자신의 재료를 기여 결국 + +802 +00:57:20,809 --> 00:57:25,630 + 독서 흐름합니다하지만 당신은 우리가 우리의 강렬한으로 참조 무​​엇으로 끝낼 수 없을거야 + +803 +00:57:25,630 --> 00:57:30,110 + 이 그라디언트 그냥 제로로 이동을 사망 어디에 문제가 지역 사라지는라고 + +804 +00:57:30,110 --> 00:57:32,880 + 당신은 다시 통해 전파 내가 예를 보여 드리겠습니다로 + +805 +00:57:32,880 --> 00:57:36,640 + 완전히이 조금 수중 음파 탐지기에서 발생하는 이유 떨어져 지금 우리는이 배니싱이 + +806 +00:57:36,639 --> 00:57:40,670 + 나는 당신을 보여줄 것 그라데이션 문제는 이유는이 때문에 애널리스트 오전 발생 + +807 +00:57:40,670 --> 00:57:45,210 + 그냥 판의 고속도로 매 시간 단계의 이러한 구배가 + +808 +00:57:45,210 --> 00:57:47,130 + 우리는 위의 미국 팀에 주입 + +809 +00:57:47,130 --> 00:57:54,829 + 그냥 세포를 통과하고 등급이에서 마무리 결국하지 않습니다 + +810 +00:57:54,829 --> 00:57:57,339 + 어쩌면 내가 몇 가지 질문을 가리 혼란 기능에 대한 질문이 있습니다 + +811 +00:57:57,338 --> 00:58:01,849 + 여기하지만 마지막으로 한 다음 그 후 나는 arnaz가에 있었던 이유에 갈거야 + +812 +00:58:01,849 --> 00:58:03,059 + 그린 즈 버러 + +813 +00:58:03,059 --> 00:58:09,789 + 예 000 벡터가 중요한 것입니다 + +814 +00:58:09,789 --> 00:58:13,400 + 내가 하나가 특별히 매우 중요 아니라고 생각 밝혀 + +815 +00:58:13,400 --> 00:58:16,660 + 나는 스페이스 오디세이 그들이 대답 할 다른 무엇을 보여 드리겠습니다 종이가있다 + +816 +00:58:16,659 --> 00:58:21,719 + 정말 거기에이 걸릴 물건 아웃하지만 물건을 연주 또한 같은있다 + +817 +00:58:21,719 --> 00:58:25,588 + 당신이 그렇게이 셀 상태가 여기에있을 수 추가 할 수 있습니다 사람들의 연결 + +818 +00:58:25,588 --> 00:58:29,538 + 사람들이 정말 재생할 수 있도록 실제로 입력으로 더 나은 숨겨진 상태에 넣어 + +819 +00:58:29,539 --> 00:58:32,049 + 이 아키텍처 그들은 바로 이러한 반복을 많이 시도 + +820 +00:58:32,048 --> 00:58:37,230 + 방정식과 거의 모든 약 동일한 일부 작동 당신이 우리와 끝까지 + +821 +00:58:37,230 --> 00:58:40,490 + 그것을 우리는 약간은 매우 가지 혼란이있는, 그래서 때로는 있었다있어 + +822 +00:58:40,489 --> 00:58:45,699 + 그들은했다 어디 용지를 표시하려면이 방법은 그들이 DS 업데이트를 처리 + +823 +00:58:45,699 --> 00:58:49,538 + 방정식은 업데이트 방정식을 통해 나무를 내장하고있다 그리고 그들은했다 + +824 +00:58:49,539 --> 00:58:52,950 + 이 같은 무작위 돌연변이 물건과 서로 다른 잔디의 모든 종류의 시도 + +825 +00:58:52,949 --> 00:58:57,028 + 사용자가 업데이트 할 수 그들 대부분은 그들 중 일부의 일부를 파괴에 대해 작동 + +826 +00:58:57,028 --> 00:58:59,858 + 정말보다 훨씬 더 않습니다처럼은 동일하지만 아무것도에 대한 작업 + +827 +00:58:59,858 --> 00:59:08,150 + 분석 팀과 질문 재발 성 신경 네트워크가 왜 가고있다 + +828 +00:59:08,150 --> 00:59:15,389 + 또한 끔찍한 역류 비디오 + +829 +00:59:15,389 --> 00:59:22,000 + 와 재발 성 신경 네트워크에서 사라지는 그라데이션 문제를 보여주는 + +830 +00:59:22,000 --> 00:59:29,250 + 모두에 대해 우리가 재발보고있는 것처럼 우리가 여기에 표시하고 줄기 + +831 +00:59:29,250 --> 00:59:33,039 + 많은 기간 많은 시간 단계에 걸쳐 신경망 다음 주입 그라데이션 + +832 +00:59:33,039 --> 00:59:36,760 + 그것은 백 스물여덟번째 시간 단계의 말을 우리는 파산하고 + +833 +00:59:36,760 --> 00:59:40,028 + 네트워크를 통해 재료와 우리는 그라데이션이 무엇인지보고있는 + +834 +00:59:40,028 --> 00:59:44,699 + 용 나는 체중의 입력 타입 숨겨진 매트릭스 하나에 모든 행렬 생각 + +835 +00:59:44,699 --> 00:59:49,009 + 한 시간 간격 때문에 실제로 통해 전체 업데이트를 얻기 위해 그 기억 + +836 +00:59:49,010 --> 00:59:52,289 + 다시 우리가 실제로 여기에 모든 그라디언트를 추가하고 그래서 무엇 무엇이다 + +837 +00:59:52,289 --> 00:59:56,760 + 어떻게 여기에 표시되는 것은 배경으로 우리는 단지에서 성분을 주입하는 것입니다 + +838 +00:59:56,760 --> 01:00:00,799 + 우리가 시간과 강한 조각을 통해 배경을 120 시간 단계 + +839 +01:00:00,798 --> 01:00:04,088 + 그 전파의 당신이보고있는 것은 미국 팀이 당신을 많이 준다이다 + +840 +01:00:04,088 --> 01:00:06,699 + 많이있다, 그래서이 역 전파에 걸쳐 그라데이션 + +841 +01:00:06,699 --> 01:00:11,000 + 단지 바로이 기술을 통해 흐르는되는 정보는 전원 사망 + +842 +01:00:11,000 --> 01:00:15,210 + 그냥 욕심 우리는 추방은 그냥 아무 거기에 작은 숫자가된다라고 + +843 +01:00:15,210 --> 01:00:18,750 + 내가 단계 그렇게되는 시간에 대해 표시를 생각이 경우 너무 그라데이션 + +844 +01:00:18,750 --> 01:00:22,679 + 우리가하지 않았다 주입 모든 정보와 10 배 단계 등 + +845 +01:00:22,679 --> 01:00:26,149 + 네트워크를 통해 흘러 모든 때문에 매우 긴 종속성을 배울 수 있습니다 + +846 +01:00:26,150 --> 01:00:29,720 + 우리가 왜이 볼 수 있도록 상관 관계 구조는 아래가 사망 한 + +847 +01:00:29,719 --> 01:00:39,399 + 조금 동적으로 발생이 채널이 너무 재미 그가처럼 몇 가지 코멘트 + +848 +01:00:39,400 --> 01:00:40,490 + YouTube 또는 뭔가 + +849 +01:00:40,489 --> 01:00:44,779 + 그래 + +850 +01:00:44,780 --> 01:00:53,170 + 확인 그래서 우리가 재발 성 신경 네트워크가 여기 아주 간단한 예를 살펴 보자 + +851 +01:00:53,170 --> 01:00:56,300 + 내가 보여주는 아니에요이 재발 성 신경 네트워크에 당신을 위해 전개거야 것을 + +852 +01:00:56,300 --> 01:01:03,960 + 우리가있어 모든 입력은 자신의 상태 업데이트가 너무 whaaa 교회와 대기 상태가 + +853 +01:01:03,960 --> 01:01:07,260 + 상호 작용을 칠 숨겨진 나는 기본적으로 재발을 전달하려고 해요 + +854 +01:01:07,260 --> 01:01:12,380 + 신경망 때문에 T-오십를 사용하고 여기에 내가 어떤 차 시간 단계를하지를 않습니다 + +855 +01:01:12,380 --> 01:01:16,260 + 내가 무슨 일을하고있어 WHAS 시간을 그 위에 다음 이전 세입자와 물건과입니다 + +856 +01:01:16,260 --> 01:01:20,570 + 그래서 이것은 모든 입력 벡터를 무시 들어오는 단지 전진 패스입니다 + +857 +01:01:20,570 --> 01:01:25,280 + 단지 WHAS 시간 H 임계 값 WHAS 시간 세이 임계 값 등 + +858 +01:01:25,280 --> 01:01:29,500 + 그 전진 패스의 다음 뒤로 여기가 연출하고있어 여기서 통과 + +859 +01:01:29,500 --> 01:01:33,820 + 마지막 단계에서 여기에 임의의 기울기에 의해 50 시간 단계에서 매우 + +860 +01:01:33,820 --> 01:01:37,880 + 뒤쪽으로 이동 한 후 무작위 및 그라데이션을 주입 나는 그렇게 백업 + +861 +01:01:37,880 --> 01:01:41,059 + 당신은 백업이 권한을 통해 여기 내가 사용하고 있습니다 통해 백업해야 할 때 + +862 +01:01:41,059 --> 01:01:46,170 + 오히려 곱셈 등 400 WH보다 곱셈 어를 통해 배경을 얻을 + +863 +01:01:46,170 --> 01:01:51,800 + 그래서 여기서주의 할 것은 여기에서 매우이다 나는 개발자 브라운 백을하고있는 중이 야 + +864 +01:01:51,800 --> 01:01:54,980 + 수입을 어디에서 관련 바로 잡고 아무것도 통해 전파 + +865 +01:01:54,980 --> 01:02:02,309 + 나는 WH 시간마다 작업을 제로보다 작은 여기서 포기하고 있었다 + +866 +01:02:02,309 --> 01:02:06,570 + 우리가 실제로 WH 행렬 곱 경우 우리는 그렇게 비선형 성을하기 전에 + +867 +01:02:06,570 --> 01:02:09,570 + 당신이 실제로 무슨 일을 볼 때가는 매우 펑키 뭔가가있다 + +868 +01:02:09,570 --> 01:02:13,300 + 당신이 시간을 통해 뒤로 이동으로 NHS의 구배이 DHS에 + +869 +01:02:13,300 --> 01:02:18,160 + 당신이 보는 것처럼 매우 걱정입니다 재미있는 구조의 매우 종류가 있습니다 + +870 +01:02:18,159 --> 01:02:22,210 + 등이 우리가 여기 무슨 일을하는지와 같은 루프에 연결되는 방식 + +871 +01:02:22,210 --> 01:02:33,409 + 두 시간 간격 + +872 +01:02:33,409 --> 01:02:43,849 + 제로 그래 나는 생각하고 가끔 어쩌면 반군이 모든 있었다 출력의 + +873 +01:02:43,849 --> 01:02:47,630 + 죽은 당신을 죽일 수 듯하지만 그건 정말 문제 아니다 + +874 +01:02:47,630 --> 01:02:51,470 + 더 걱정 문제는 그 모든 쇼가 될 것 잘하지만 착용 한 생각 + +875 +01:02:51,469 --> 01:02:55,500 + 사람들이 쉽게 우리가 걸 볼 수 있습니다뿐만 아니라 발견 할 수 있습니다 문제 + +876 +01:02:55,500 --> 01:03:00,380 + 때문에에 또 다시 이상이 whah 행렬 곱 + +877 +01:03:00,380 --> 01:03:04,840 + 앞으로 우리가 매일 반복에 awhh 곱 통과 + +878 +01:03:04,840 --> 01:03:09,670 + 다시 우리가이 전파 결국 모든 숨겨진 상태를 통해 전파 + +879 +01:03:09,670 --> 01:03:13,820 + 무형 문화 유산 konnte 체스와 backrub 어 공식은 실제로 것을 밝혀 + +880 +01:03:13,820 --> 01:03:19,000 + 당신은 whah 행렬 곱 인사말 신호를 가지고 우리는 종료 + +881 +01:03:19,000 --> 01:03:26,199 + 그라데이션이 whah 유지를 곱한 도착까지 그 다음 WH 관계자를 곱한 + +882 +01:03:26,199 --> 01:03:32,019 + 그렇게 우리는 그렇게하지 ​​매트릭스 W​​H 나이 오십 번 곱 결국 + +883 +01:03:32,019 --> 01:03:37,509 + 이 가진 문제는 녹색 신호는 기본적으로 두 가지 경우처럼 일어날 수 있다는 것입니다 + +884 +01:03:37,510 --> 01:03:41,080 + 당신은 아마 규모 행렬없는 스칼라 값 작업에 대한 생각 + +885 +01:03:41,079 --> 01:03:45,469 + 그때 임의의 번호를 가지고 있다면 두 번째 번호가 나는 유지 + +886 +01:03:45,469 --> 01:03:48,509 + 그래서 또 다시 두 번째 숫자에 의해 첫 번째 숫자를 곱한 + +887 +01:03:48,510 --> 01:03:55,990 + 다시 그 순서는 바로 같은 플레이 자신의 경우에 무엇을 이동 않습니다 + +888 +01:03:55,989 --> 01:04:01,849 + 번호 하나 내가 죽거나 아직 경우 두 번째 번호를 정확히 절전 모드로 전환 + +889 +01:04:01,849 --> 01:04:05,119 + 일년 실제로 폭발하지만, 그렇지 않는 경우에만 위치하도록 + +890 +01:04:05,119 --> 01:04:09,679 + 정말 나쁜 일이 죽을 중 하나 일어나고 또는 우리는 우리가 큰이 여기 폭발 + +891 +01:04:09,679 --> 01:04:12,659 + 도시 우리는 하나의 번호가없는 있지만, 사실은이 같은 일이 일어난다이다 + +892 +01:04:12,659 --> 01:04:16,599 + 그것의 일반화는 WHS 장축 반경 스펙트럼에서 일어나는 + +893 +01:04:16,599 --> 01:04:21,839 + 이는 그 행렬의 최대 고유 한 후보다 큰 것이다 + +894 +01:04:21,840 --> 01:04:25,220 + 이 시민은 완전히 사망의 1도 이하의 경우 무선 신호가 폭발 + +895 +01:04:25,219 --> 01:04:30,549 + 그래서 기본적으로 박사 탄 때문에이 재발이 매우 이상한이 있기 때문에 + +896 +01:04:30,550 --> 01:04:34,680 + 공식 우리는 매우 끔찍 역학에 결국 그리고 그것은 매우 불안정입니다 + +897 +01:04:34,679 --> 01:04:39,949 + 그냥 그렇게 연습이 처리 된 방법을 폭발하고 또는 사망했다 + +898 +01:04:39,949 --> 01:04:44,439 + 당신은 폭발 그라디언트에게 인사말 마치 하나의 간단한 하키를 제어 할 수 있습니다 + +899 +01:04:44,440 --> 01:04:45,720 + 폭발 당신은 그것을 클릭 + +900 +01:04:45,719 --> 01:04:50,789 + 그래서 사람들은 실제로 매우 누덕 누덕 기운 솔루션처럼하지만 경우에이 관행을 + +901 +01:04:50,789 --> 01:04:55,119 + 두 번 다섯 분 노먼 린 크램 펫 (25) 요소 위에합니까을 읽고있는 나 + +902 +01:04:55,119 --> 01:04:58,150 + 당신이 저하되어 클리핑을 수행 할 수 있도록 그런 일이 그 방법을의 + +903 +01:04:58,150 --> 01:05:01,829 + 폭발 등급을 매기는 문제를 해결하고 당신은 당신이 기록하고있어하지 않습니다 + +904 +01:05:01,829 --> 01:05:06,049 + 더 이상 폭발 그러나 녹색당은 여전히​​ 직장과 엘리스에서 카니발에서 사라질 수 있습니다 + +905 +01:05:06,050 --> 01:05:08,310 + 팀 때문에 이들의 사라지는 그라데이션 문제에 아주 좋은 것입니다 + +906 +01:05:08,309 --> 01:05:12,429 + 단지와 첨가제의 상호 작용에 따라 변화되는 세포의 고속도로 + +907 +01:05:12,429 --> 01:05:17,309 + 당신은 당신이이기 때문에 경우에 당신이 경우 구배는 단지 그들이 아래로 죽지 않을 날려 + +908 +01:05:17,309 --> 01:05:21,000 + 이러한 이유 대략이다처럼 같은 나이 또는 무언가에 의해 곱 + +909 +01:05:21,000 --> 01:05:26,909 + 단지 더 동적으로 우리는 항상 팀 그래서 우리는 그라데이션 클리핑을 수행 할 + +910 +01:05:26,909 --> 01:05:30,149 + 일반적으로 달라스 팀의 기울기가 잠재적으로 폭발 할 수 있기 때문에 + +911 +01:05:30,150 --> 01:05:33,400 + 여전히 그들은 일반적으로 사라하지 않는했다 + +912 +01:05:33,400 --> 01:05:48,608 + 재발 성 신경 네트워크뿐만 아니라에 대한 엘리스 팀은 분명하지 않다 어디를 + +913 +01:05:48,608 --> 01:05:53,769 + 당신이 플러그 것입니다 정확히 같은이 식의 명확하지에 뛰어들 것 + +914 +01:05:53,769 --> 01:06:00,619 + 상대적으로 어디에 아마 대신 G에서 월의 많은 다음에 참석하기 때문에 + +915 +01:06:00,619 --> 01:06:08,690 + 여기 huug하지만 재판매는 바로 이렇게 하나의 방향으로 성장할 것 + +916 +01:06:08,690 --> 01:06:11,980 + 어쩌면 당신은 실제로 좋은 아니에요 작게 있도록 만드는 끝낼 수 없다 + +917 +01:06:11,980 --> 01:06:18,539 + 난 당신이 알고있는 가정 아이디어는 이렇게 연결하는 명확한 방법이 없습니다 기본적으로됩니다 + +918 +01:06:18,539 --> 01:06:25,380 + 여기에 행을 너무 좋아 한 것은 나는이 초 고속도로의 측면에서 그 통지 + +919 +01:06:25,380 --> 01:06:29,780 + 네 개의 얻을 문이있을 때이 그라디언트 이러한 관점은 실제로 고장 + +920 +01:06:29,780 --> 01:06:33,310 + 네 개의 얻을 때 때문에 케이트의 우리는 이러한 행위의 일부를 잊을 수있는 곳 + +921 +01:06:33,309 --> 01:06:37,150 + 내가 문을 잊지 때마다 곱셈 상호 작용은 다음에 그것과 세가와 + +922 +01:06:37,150 --> 01:06:41,470 + 다음 그라데이션을 죽이고 물론 역류 때문에 이러한 슈퍼 중단됩니다 + +923 +01:06:41,469 --> 01:06:45,250 + 당신이없는 경우 고속도로 가지 사실 어느 문을 잊지하지만 당신은 경우 + +924 +01:06:45,250 --> 01:06:50,000 + a는 다음 그라디언트를 죽일 수 그들의줬고, 그래서 실제로 잊지했다 + +925 +01:06:50,000 --> 01:06:54,710 + 우리는 우리와 함께 연주 할 때 팀은 우리가 가끔 사람들이 때 가정 오스틴의 사용이다 + +926 +01:06:54,710 --> 01:06:58,099 + 긍정적 인 편견 때문에 함께 초기화에 그들이 처음 잊지 얻을 + +927 +01:06:58,099 --> 01:06:58,769 + 에 의한 + +928 +01:06:58,769 --> 01:07:05,699 + 나에 설정하는 것을 잊지 항상 종류의 내가 처음에 생각 해제 + +929 +01:07:05,699 --> 01:07:08,679 + 그래서 처음에 녹색 아주 잘 이야기하고 미국 팀은 배울 수있는 방법 + +930 +01:07:08,679 --> 01:07:12,779 + 그 해당 바이어스 용으로 나중에 사람들이 재생되도록 한 번에 그들을 차단하기 + +931 +01:07:12,780 --> 01:07:17,530 + 수십 년 때때로 그래서 여기에 지난 밤 나는 그 비용을 언급하고 싶었다 + +932 +01:07:17,530 --> 01:07:21,580 + 공간이 그래서 많은 사람들은 기본적으로이 꽤 플레이 한 + +933 +01:07:21,579 --> 01:07:26,119 + 그들이 아키텍처로 다양한 변화를 시도 오디세이 용지 거기 + +934 +01:07:26,119 --> 01:07:32,829 + 잠재적 인 변화의 큰 숫자 이상이 검색을 수행하려고 여기에 종이 + +935 +01:07:32,829 --> 01:07:36,940 + LST 방정식 그리고 그들은 많은 검색을했고, 그들은 아무것도 찾지 못했습니다 + +936 +01:07:36,940 --> 01:07:42,300 + 그건 그냥 애널리스트 오전 너무 좋아하고있어보다 실질적으로 더 잘 작동 + +937 +01:07:42,300 --> 01:07:45,560 + 또한 상대적으로 실제로 인기가 있고 내가 실제로 것 GRU + +938 +01:07:45,559 --> 01:07:50,159 + 당신이 콜로세움 그것의 변화를 개의 DRU 사용 할 수 있습니다 것이 좋습니다 + +939 +01:07:50,159 --> 01:07:54,460 + 그것은 짧은 점이다 대해도 좋은 상호 작용으로 결정했다 + +940 +01:07:54,460 --> 01:07:59,400 + 작은 공식과 단지 하나있는 테네시을 갖지 않는 트랙터 + +941 +01:07:59,400 --> 01:08:03,130 + 구현은 현명한 단지 하나가 가진 기억 단지 좋네요 있도록 만 H가 + +942 +01:08:03,130 --> 01:08:07,590 + 단지 작은 간단한 일이 같은 앞으로 과​​거 두 가지 요인에 차질 + +943 +01:08:07,590 --> 01:08:12,190 + 그 불쾌한의 혜택의 대부분을 갖고있는 것 같아요하지만 그래서는 GRU과라고 + +944 +01:08:12,190 --> 01:08:16,730 + 거의 항상 멋진에 대한 내 경험에 작동하고 그래서 당신은 수도 + +945 +01:08:16,729 --> 01:08:19,939 + 그것을 사용하려는 또는 당신은 그들이 모두 좀 동일한 대해 알고 마지막 시간을 사용할 수 있습니다 + +946 +01:08:19,939 --> 01:08:28,088 + 그래서 누군가가 마구는 아주 좋은하지만의 RaWR하고 실제로하지 않는 것입니다 + +947 +01:08:28,088 --> 01:08:29,130 + 아주 잘 작동 + +948 +01:08:29,130 --> 01:08:32,420 + 소유즈 미국 팀은 무엇을 그들에 대해 좋은 데요 것은 이상한 갖는 것입니다 대신 사용된다 + +949 +01:08:32,420 --> 01:08:36,000 + 그리스을 허용 이러한 첨가제의 상호 작용은 매우 잘 재생 당신은하지 않습니다 + +950 +01:08:36,000 --> 01:08:39,579 + 사라지는 품종 문제를 얻을 우리는 여전히 폭발에 대해 조금 걱정 + +951 +01:08:39,579 --> 01:08:44,269 + 이 사람들은 때때로 내가이 여자 클립을 참조하는 것이 일반적 그래서 문제를 공급 + +952 +01:08:44,270 --> 01:08:46,670 + 더 간단한 구조가 정말하려고하는 말 것 + +953 +01:08:46,670 --> 01:08:50,838 + 연결과 무슨 깊은 거기에 뭔가 오는 방법을 이해 + +954 +01:08:50,838 --> 01:08:53,899 + 주민과 엘리스 팀 사이에 이들에 대해 뭔가 깊은있다 + +955 +01:08:53,899 --> 01:08:57,579 + 나는 우리가 아직 정확히 그 이유는 완전히 이해되지 것 같아요 상호 작용 + +956 +01:08:57,579 --> 01:09:02,210 + 그래서 잘 작동하고 어떤 부분은 시원했고, 그래서 우리가 필요하다고 생각 + +957 +01:09:02,210 --> 01:09:05,119 + 공간 이론과 경험을 모두 이해하고 그것은 매우이야 + +958 +01:09:05,119 --> 01:09:10,979 + 벌리고 연구의 영역과 그래서 그래서 + +959 +01:09:10,979 --> 01:09:23,469 + 스포츠 (10) 그러나 나는 내가 그렇지 않은 그래서 폭발 가정 할 수 클래스의 끝 + +960 +01:09:23,470 --> 01:09:27,020 + 명확 왜 것이라고하지만 당신은 세포 상태로 그라데이션을 주입 유지 + +961 +01:09:27,020 --> 01:09:30,069 + 그래서 어쩌면 때때로 큰 얻을 수 있습니다 저하 + +962 +01:09:30,069 --> 01:09:33,960 + 그것은 그들을 수집하는 것이 일반적이지만 중요 할 수 있으므로 한 시간으로 아마 생각 + +963 +01:09:33,960 --> 01:09:40,829 + 그리고, 나는 그 시점하지만 비뇨기과 기초 I에 대해 확실히 백퍼센트 아니에요 + +964 +01:09:40,829 --> 01:09:46,640 + 흥미로운 무슨 생각 그래 나는 우리가 여기까지해야한다고 생각하지 않습니다 있지만 난 + +965 +01:09:46,640 --> 01:09:47,569 + 여기에 질문을 드리겠습니다 + diff --git a/captions/Ko/Lecture11_ko.srt b/captions/Ko/Lecture11_ko.srt new file mode 100644 index 00000000..e0d1c5ca --- /dev/null +++ b/captions/Ko/Lecture11_ko.srt @@ -0,0 +1,3900 @@ +1 +00:00:00,000 --> 00:00:03,428 + 오늘날 우리가 통과하는 물건을 많이 가지고 오른쪽 그래서 나는 시작하고 싶습니다 + +2 +00:00:03,428 --> 00:00:08,669 + 그래서 오늘 우리는 CNN의 실천에 대해 이야기 할 것과 많이 이야기하고 + +3 +00:00:08,669 --> 00:00:12,050 + 정말 얻​​기 위해 언급되는 구현 세부 사항의 정말 낮은 수준의 종류 + +4 +00:00:12,050 --> 00:00:15,980 + 당신이 실제로하지만 처음으로 일을 훈련 할 때 이런 일들이 작동합니다 + +5 +00:00:15,980 --> 00:00:20,189 + 보통 우리는 번호 하나에 대해 이야기하는 일부 관리 물건을 그 통해이 + +6 +00:00:20,189 --> 00:00:24,600 + 모든 TA에 의해 정말 영웅적인 노력은 모든 중간 고사가되도록 저하되어있다 + +7 +00:00:24,600 --> 00:00:27,740 + 사람은 확실히 그 대해 감사해야하고 당신도 그들을 선택할 수 있습니다 + +8 +00:00:27,739 --> 00:00:34,920 + 클래스 오늘 또는 여기있는이 근무 시간 중 어느 한 후에도 계속 + +9 +00:00:34,920 --> 00:00:38,609 + 그 마음에 프로젝트 이정표는 자정 때문에 오늘 밤 수 있도록거야 + +10 +00:00:38,609 --> 00:00:41,628 + 난 당신이 마지막에 대한 귀하의 프로젝트를 진행했습니다 희망 있는지 확인 + +11 +00:00:41,628 --> 00:00:45,579 + 지난 주 정도에 대한 부부 등 몇 가지 정말 흥미로운 진전을 + +12 +00:00:45,579 --> 00:00:51,289 + 그를 쓸 수 있는지 확인하고 더 더 드롭 박스에 할당 탭에서 그것을하지 넣지 + +13 +00:00:51,289 --> 00:00:55,460 + 드롭 박스에 있지만 과정에서 할당 탭에 나는이 알고 죄송 것을 + +14 +00:00:55,460 --> 00:00:58,910 + 정말 혼란하지만 단지 단지 등의 지정과 같은 탭을 할당 + +15 +00:00:58,909 --> 00:01:04,000 + 할당이 잘하면 우리가 언젠가 수행해야합니다 그레이딩 작업을 하였다 + +16 +00:01:04,000 --> 00:01:10,140 + 이번 주에 그 과제 세 가지가 너무 밖으로 기억하는 것은 어떻게이 진행되고있어 + +17 +00:01:10,140 --> 00:01:17,159 + 나머지는 당신이 얻을해야하므로 좋은 사람이 누구 괜찮 수행 한 사람은 이루어집니다 + +18 +00:01:17,159 --> 00:01:22,740 + 우리는 그래서 중간에서 약간의 재미 통계를 그래서 1 주일에 의한 때문에 시작 + +19 +00:01:22,739 --> 00:01:26,379 + 당신이 당신의 성적을 볼 때 우리가 실제로이 정말 좋은했다 사촌 흥분하지 않습니다 + +20 +00:01:26,379 --> 00:01:30,759 + 우리가하지 않는 아름다운 표준 편차 아름다운 가우스 분포 + +21 +00:01:30,759 --> 00:01:34,549 + 그것이 내가 또한 지적하고 싶습니다 이미 완벽이 일을 정상화 비난해야 + +22 +00:01:34,549 --> 00:01:38,049 + 그 사람이 일어나서에서 최대는 백 세 그들이있어 의미 점수 + +23 +00:01:38,049 --> 00:01:43,470 + 즉, 그래서 바로 보너스의 모든는 아마에 하드 충분하지 의미 + +24 +00:01:43,469 --> 00:01:49,500 + 우리는 또한 질문 당 일부는 평균 점수에 내 타악기 고장을 먹으 렴이 + +25 +00:01:49,500 --> 00:01:52,450 + 당신이 뭔가를 가지고있는 경우 중간에 매 질문마다 그래서 만약 당신이 원하는 + +26 +00:01:52,450 --> 00:01:55,510 + 잘못 당신은 다른 사람이 당신에게 잘못에 확인 갈 수 있어요 있는지 확인하려면 + +27 +00:01:55,510 --> 00:01:59,380 + 당신이 당신의 자신의 시간에있어 이들 통계 리더 우리는 참 거짓에 대한 통계가 + +28 +00:01:59,379 --> 00:02:00,959 + 와 객관식 + +29 +00:02:00,959 --> 00:02:04,729 + 실제로 우리가 등급 중 결정 진정한 거짓이 해고 염두에 두어야 + +30 +00:02:04,730 --> 00:02:07,090 + 그들은 그것을 버리고 당신에게 줄 약간 불공정라고 모든 + +31 +00:02:07,090 --> 00:02:12,960 + 그 두 가지가 백퍼센트 왜 우리가이 통계에 대한이되는 점 + +32 +00:02:12,960 --> 00:02:19,810 + 모든 개별 질문은 그래서 가서 그 이상으로 재미를 + +33 +00:02:19,810 --> 00:02:24,379 + 내가 아는 마지막으로 그것은 동안이었다 그러나 우리는 중간을했고, 우리는 휴일이 있었다 + +34 +00:02:24,379 --> 00:02:28,030 + 하지만 당신은 일주일 전에 우리가 재발에 대해 얘기했다 위에처럼 기억할 수있는 경우 + +35 +00:02:28,030 --> 00:02:31,509 + 우리가 어떻게 재발 네트워크에 대해 이야기 네트워크 모델링에 사용할 수 있습니다 + +36 +00:02:31,509 --> 00:02:35,500 + 이러한 피드 포워드 네트워크와 당신이 일반적으로 알고 서열 그들은 그것을 바쳐 그들이 + +37 +00:02:35,500 --> 00:02:39,139 + 이 키보드 기능을 모델링하지만이 재발 네트워크를 우리는 방법에 대해 이야기 + +38 +00:02:39,139 --> 00:02:43,208 + 그들은 우리가 이야기 순서 문제의 다른 종류를 모델링 할 수 + +39 +00:02:43,209 --> 00:02:48,319 + 재발 네트워크 (10)의 특정 구현이 발표되고 앨리스와 + +40 +00:02:48,319 --> 00:02:51,539 + 당신이 알아야 할 수 있도록 할당에 그 모두를 구현 무엇 + +41 +00:02:51,539 --> 00:02:56,079 + 우리가이 올바른 재발 성 신경 네트워크가 될 수있는 방법에 대해 이야기하다 + +42 +00:02:56,080 --> 00:03:01,010 + 언어 모델에 사용되는 일부 샘플 생성 된 텍스트를 보여주는 재미 있었다 + +43 +00:03:01,009 --> 00:03:06,329 + 에 우리는 방법에 대해 이야기를의 셰익스피어와 대수 기하학이 무엇인지 + +44 +00:03:06,330 --> 00:03:09,590 + 우리는 이미지를 할 수있는 길쌈 네트워크와 재발 네트워크를 결합 할 수 있습니다 + +45 +00:03:09,590 --> 00:03:14,180 + 캡처 우리는 RNN의 신경 과학자 인의 조금이 게임을 + +46 +00:03:14,180 --> 00:03:17,700 + 그리고 아르덴의 세포로 다이빙은 무엇을 해석하려고 + +47 +00:03:17,699 --> 00:03:21,879 + 그들은 일을하는지 우리는 때때로 우리가이 끝없는 세포를 가지고 보았다 + +48 +00:03:21,879 --> 00:03:27,049 + 꽤 멋진 문을 예를 들어 활성화를위한 선동하지만 + +49 +00:03:27,049 --> 00:03:28,890 + 오늘 우리는 완전히 다른 무언가에 대해 이야기하는거야 + +50 +00:03:28,889 --> 00:03:33,339 + 우리는 낮은 수준의 것들을 정말 많이 약거야 말거야 세 가지가 그 + +51 +00:03:33,340 --> 00:03:37,830 + 세 가지 주요 테마가 그래서 당신은 실제로 CNN의 작업을 얻기 위해 알아야 할 + +52 +00:03:37,830 --> 00:03:41,600 + 그것은 포푸리 약간의 그러나 우리는 그래서 그것을 함께 묶어하려고거야 + +53 +00:03:41,599 --> 00:03:45,349 + 첫 번째는 정말 모든 주스를 압착된다 당신이 할 수없는 데이터의 정말 + +54 +00:03:45,349 --> 00:03:48,219 + 당신이 큰 데이터 세트를 필요가 없습니다 특히 프로젝트에 대해 많이 알고 + +55 +00:03:48,219 --> 00:03:51,789 + 우리는 데이터 증가 및 전송 학습에 대해 이야기 할 것중인 + +56 +00:03:51,789 --> 00:03:55,079 + 당신이있어 특히이 정말 강력한 유용한 기술입니다 + +57 +00:03:55,080 --> 00:03:56,350 + 작은 데이터 세트로 작업 + +58 +00:03:56,349 --> 00:04:00,889 + 우리가 정말 회선에 깊은 다이빙에 대한 더 많은 이야기거야 + +59 +00:04:00,889 --> 00:04:05,959 + 그 모두가 회선을 사용하여 효율적인 아키텍처를 설계 할 수있는 방법과 + +60 +00:04:05,960 --> 00:04:10,480 + 기여 효율적으로 연습 한 후 최종적으로 구현되는 방법도 + +61 +00:04:10,479 --> 00:04:13,269 + 우리는 뭔가에 대해 이야기거야하지만 일반적으로 구현에서 집중됩니다 + +62 +00:04:13,270 --> 00:04:17,480 + 자세한 내용은 심지어 종이로하지 않고 사람과 같은 그 물건은이 + +63 +00:04:17,480 --> 00:04:21,750 + CPU와 경험 병목 어떤 종류의 GPU하고 당신에게 얼마나 많은 훈련 + +64 +00:04:21,750 --> 00:04:26,069 + 물건을 많이의 여러 장치를 통해 여러 걸쳐 비가 배포 + +65 +00:04:26,069 --> 00:04:31,620 + 우리의 내가 생각 데이터 증가에 대해 이야기 할 수 있도록 먼저 시작해야 + +66 +00:04:31,620 --> 00:04:34,910 + 우리는 종류의이 강의 그러나 결코 지금까지 전달 될 수 있습니다 언급 한 + +67 +00:04:34,910 --> 00:04:39,780 + 정말 당신이 CNN의 당신이 정말로있어 훈련 때 그래서 일반적으로 그것에 대해 이야기 + +68 +00:04:39,779 --> 00:04:44,179 + 때 훈련 도중 파이프 라인의이 유형에 익숙해 거 부하 이미지를이야 + +69 +00:04:44,180 --> 00:04:48,379 + 당신이거야 책상 오프 최대 레이블은 다음 CNN을 통해 이미지를 지불 + +70 +00:04:48,379 --> 00:04:51,009 + 일부 손실을 계산하기 위해 라벨과 함께 이미지를 사용하는거야 + +71 +00:04:51,009 --> 00:04:55,610 + 기능 및 역 전파를 업데이트 CNN과 이전의 반복 그 때문에 + +72 +00:04:55,610 --> 00:05:00,970 + 문서상의가에 대해 이제 일에 의해 그와 정말 잘 알고 있어야합니다 + +73 +00:05:00,970 --> 00:05:05,960 + 우리는 단지 우리가로드 한 후, 그래서 여기에이 파이프 라인에 하나의 작은 단계가 있었다 + +74 +00:05:05,959 --> 00:05:09,849 + 책상 위의 이미지는 우리가 전달하기 전에 어떤 방법으로 그것을 변환하는거야 + +75 +00:05:09,850 --> 00:05:13,910 + 그것을 CNN에이 변환은 라벨을 보존한다 + +76 +00:05:13,910 --> 00:05:19,090 + 정말 간단하고 트릭 그냥 그래서 거 전파 돌아올와 CNN + +77 +00:05:19,089 --> 00:05:24,089 + 당신은 데이터 증가를 생각되어 사용되어야한다 변압기의 종류 + +78 +00:05:24,089 --> 00:05:27,679 + 정말 간단 당신이 인위적으로 교육을 확장기 할 수 있습니다이 방법의 일종 + +79 +00:05:27,680 --> 00:05:32,030 + 변환의 다른 종류의 현명한 사용을 통해 설정된 경우 그래서 당신 + +80 +00:05:32,029 --> 00:05:35,409 + 이러한 시도로 컴퓨터가 정말로 이러한 이미지를보고있다 기억하고 몇 가지를 얻을 수 + +81 +00:05:35,410 --> 00:05:39,189 + 픽셀 우리가 할 수있는 변환의 서로 다른 종류가있다 + +82 +00:05:39,189 --> 00:05:43,230 + 그 레이블을 유지해야하지만 이는 모든 픽셀이 변경됩니다 당신이 경우 + +83 +00:05:43,230 --> 00:05:46,770 + 그것을 왼쪽으로 그 고양이 1 픽셀을 출하처럼 상상하는 것은 여전히​​ 고양이하지만 모든입니다 + +84 +00:05:46,769 --> 00:05:50,539 + 픽셀을 사용하면 문서에 대해 이야기 할 때 그래서 그 변경 예정 + +85 +00:05:50,540 --> 00:05:54,680 + 당신은 종류의 당신이 당신의 훈련을 확대하고 상상하고 + +86 +00:05:54,680 --> 00:05:58,629 + 교육 및 새로운 기본 훈련 샘플은 상관 관계하지만 여전히 것 + +87 +00:05:58,629 --> 00:06:03,389 + 당신은 방지와 더 큰 모델과 모델을 훈련하는 데 도움이 매우이다 + +88 +00:06:03,389 --> 00:06:04,959 + 매우 광범위하게 연습에 사용 + +89 +00:06:04,959 --> 00:06:08,668 + 거의 모든 CNN 당신은 그 대회에서 우승거나 잘하고있어 참조 + +90 +00:06:08,668 --> 00:06:09,810 + 벤치 마크는 일부를 사용하고 있습니다 + +91 +00:06:09,810 --> 00:06:15,889 + 비트 증가에서 가장 쉬운 그래서 역은 수평 뒤집기 경우입니다 + +92 +00:06:15,889 --> 00:06:18,699 + 당신이 거울 이미지가해야 미러 이미지를 볼 때 우리는이 고양이를 생각한다 + +93 +00:06:18,699 --> 00:06:22,949 + 여전히 고양이 일이 방금 할 수있는 심판을 구현하기 위해 정말 정말 간단합니다 + +94 +00:06:22,949 --> 00:06:27,159 + 단일 통화 마찬가지로 쉽고 다른 토치 단 한 줄의 코드와 함께 할 + +95 +00:06:27,160 --> 00:06:32,040 + 프레임 워크이 정말 쉽다는 아주의 다른 널리 사용되는 뭔가를 변화 + +96 +00:06:32,040 --> 00:06:37,120 + 널리 교육 시간 있도록 훈련 이미지에서 임의의 작물을하는 데 사용 + +97 +00:06:37,120 --> 00:06:40,949 + (A)에서 이미지에 대한 패치를 가지고 우리는 그녀의 이미지를로드거야 그리고 우리는거야 + +98 +00:06:40,949 --> 00:06:42,629 + 임의의 규모와 위치 + +99 +00:06:42,629 --> 00:06:47,189 + 우리는 CNN의는 어떤 크기를 기대하고 고정으로 크기를 조정 한 다음과 같은 것을 사용하여 우리의 + +100 +00:06:47,189 --> 00:06:51,389 + 예를 훈련하고 다시이 아주 아주 널리 그냥 사용하는 것은 당신에게 맛을 제공 + +101 +00:06:51,389 --> 00:06:56,610 + 이 사용 방법을 정확하게의 전 주민에 대한 세부 사항을 그들이 그렇게 고개 + +102 +00:06:56,610 --> 00:07:01,639 + 실제로 각 교육 이미지 크기 조정 종이 스티커 난수 교육 시간을 가졌다 + +103 +00:07:01,639 --> 00:07:05,620 + 짧은면 해당 번호는 다음 샘플이되도록 전체 이미지 크기를 조정 + +104 +00:07:05,620 --> 00:07:09,720 + 다음 임의의 크기 조정 차원에서 224 작물로 224 그리고 같은 것을 사용 + +105 +00:07:09,720 --> 00:07:13,990 + 훈련 샘플은 그래서 도움이 일반적으로 구현하는 데 아주 쉽게 그리고 + +106 +00:07:13,990 --> 00:07:20,560 + 꽤 당신은 데이터 증가 일반적으로 사물의 형태를 사용하고 그렇게 할 때 + +107 +00:07:20,560 --> 00:07:25,269 + 약간의 테스트 시간이 양식을 사용하므로 교육 시간 변경 + +108 +00:07:25,269 --> 00:07:29,079 + 데이터 증가는 네트워크가 정말 긴장 전체 이미지에 훈련되지 않습니다 + +109 +00:07:29,079 --> 00:07:34,219 + 그의 작물에 정말 이해하거나 강제로 시도 공정하지 않는 것 때문에 + +110 +00:07:34,220 --> 00:07:38,900 + 네트워크는 내가 연습 할 때 그렇게 보통 해요 테스트로 전체 이미지를 볼 수 있습니다 + +111 +00:07:38,899 --> 00:07:42,879 + 당신은 미국에서 데이터 증가에 대한 임의 자르기 이런 종류의 일을하고 + +112 +00:07:42,879 --> 00:07:48,379 + 시간 당신은 작물의 일부 고정 세트가 그래서 매우 테스트하기 위해 다음을 사용합니다 + +113 +00:07:48,379 --> 00:07:52,019 + 일반적으로 당신은 당신이 왼쪽 손을 잡고 것이다 열 작물을 볼 것을 볼 수 있습니다 + +114 +00:07:52,019 --> 00:07:52,649 + 모서리 + +115 +00:07:52,649 --> 00:07:56,189 + 당신 하단 모서리와 중심을 제공하는 오른쪽 상단 모서리 + +116 +00:07:56,189 --> 00:08:00,800 + 함께 수평 플립에서 오는 10 그가 그 (10) 작물 할게요 제공하는 + +117 +00:08:00,800 --> 00:08:06,460 + 그래서 그 (10) 작물의 네트워크를 통해 평균 점수를 통과 시험 시간 + +118 +00:08:06,459 --> 00:08:09,519 + 공진 실제로 더 그 조금 한 단계 소요 실제로 수행 + +119 +00:08:09,519 --> 00:08:14,759 + 다중 스케일 다중 스케일이가 경향이 뭔가뿐만 아니라 시간을 증명 + +120 +00:08:14,759 --> 00:08:20,649 + 실제로 성능이 도움 구현 다시 아주 쉽게 널리 사용되는 다양 + +121 +00:08:20,649 --> 00:08:26,418 + 우리가 일반적으로 48 증가 할 또 다른 점은 그렇다면 컬러 생성입니다 + +122 +00:08:26,418 --> 00:08:29,529 + 당신은 아마 어쩌면 고양이의이 사진을 찍을는 그 조금 cloudier했다 + +123 +00:08:29,529 --> 00:08:33,348 + 그 날 우스운 우리가 많이보다 사진을 촬영했을 경우 하루 조금 + +124 +00:08:33,349 --> 00:08:37,070 + 색상의 너무 한 가지 아주의 그 상당히 달라졌을 것이다 + +125 +00:08:37,070 --> 00:08:40,360 + 방금 전에 색상 우리 교육의 이미지를 조금 변경하면된다하는 것이 일반적 + +126 +00:08:40,360 --> 00:08:45,539 + 내가 아주 간단한 방법은 그냥이가있는 반면 변화입니다 해요 그래서 우리는 CNN에 도착 + +127 +00:08:45,539 --> 00:08:50,469 + 매우 수행하는 매우 간단한을 쉽게 구현할 수 있지만, 실제로는 실제로 당신은 볼 수 있습니다 + +128 +00:08:50,470 --> 00:08:55,759 + 이 흔하지 조금 동안 계약 대신 당신이 볼은 그 + +129 +00:08:55,759 --> 00:09:01,259 + 이상의 주성분 분석을 사용하여이 약간 더 복잡한 파이프 라인 + +130 +00:09:01,259 --> 00:09:06,439 + 학습 데이터의 모든 화소 아이디어이라는 각 화소에 대한 우리 + +131 +00:09:06,440 --> 00:09:11,390 + 훈련 데이터는 길이 3의 RGB이 벡터이며 우리는 그 화소를 수집하는 경우 + +132 +00:09:11,389 --> 00:09:15,129 + 전체 훈련 데이터를 통해 당신은 색상의 종류의 감각을 얻을 것을 + +133 +00:09:15,129 --> 00:09:19,330 + 일반적으로 다음 주성분 분석을 사용하여 트레이닝 데이터에 존재 + +134 +00:09:19,330 --> 00:09:23,930 + 우리에게 색 공간이 종류의 세 가지 주요 구성 요소의 방향을 제공합니다 + +135 +00:09:23,929 --> 00:09:27,879 + 색상이 데이터 세트에서 변화하는 경향이 따라 방향이 무엇인지를 알려 + +136 +00:09:27,879 --> 00:09:32,429 + 색상 확대 술에 대한 교육 시간에 시험보다 너무 + +137 +00:09:32,429 --> 00:09:35,889 + 우리는 실제로 훈련의 색깔이 주요 구성 요소를 사용할 수 있습니다 + +138 +00:09:35,889 --> 00:09:41,419 + 성별 훈련시에 색이 다시 얼마나 사이트 정확하게 선택할 + +139 +00:09:41,419 --> 00:09:46,719 + 조금 더 복잡하지만 꽤 널리 PCA 이러한 유형의 있도록 사용 + +140 +00:09:46,720 --> 00:09:51,580 + 내가 생각하는 색상의 구동 데이터의 증가는 알렉스 도입 한 것 + +141 +00:09:51,580 --> 00:09:58,310 + 2012 년 종이 및 또한 예를 들어 ResNet에 사용되는 데이터의 증가 때문에 + +142 +00:09:58,309 --> 00:10:02,829 + 이 매우 일반적인 일이 바로 당신이 단지에 대해 생각하고 싶지 이스라엘 당신의 + +143 +00:10:02,830 --> 00:10:06,420 + 변환의 종류는 당신이 당신의 클래스 불에 원하는 작업 데이터 세트 + +144 +00:10:06,419 --> 00:10:11,179 + 다양한 너무 그리고 당신은에 변화의 그 유형을 소개하고 싶습니다 + +145 +00:10:11,179 --> 00:10:15,229 + 훈련 데이터 교육 시간 그리고 당신이 정말로 여기에 미쳐 얻을 수 있습니다 + +146 +00:10:15,230 --> 00:10:18,740 + 정말 창의적이고는 데이터에 대해 생각하고 어떤 종류의 편차를 + +147 +00:10:18,740 --> 00:10:23,659 + 당신은 아마 랜덤처럼 그것을 시도 할 수 있습니다 귀하의 데이터에 대한 의미가 있습니다 + +148 +00:10:23,659 --> 00:10:27,708 + 몇도 회전 할 수있다 데이터에 따라 회전은 의미가 + +149 +00:10:27,708 --> 00:10:31,399 + 당신은 스트레칭과 시뮬레이션하기 위해 전단의 다른 종류처럼 시도 할 수 있습니다 + +150 +00:10:31,399 --> 00:10:33,189 + 데이터의 아마 아핀 변환 + +151 +00:10:33,190 --> 00:10:36,990 + 그리고 당신이 정말로 여기에 미쳐과 창의력을하려고 생각할 수 + +152 +00:10:36,990 --> 00:10:43,840 + 내가 지적하고 싶은에 대한 데이터와 다른 일을 할 수있는 흥미로운 방법 + +153 +00:10:43,840 --> 00:10:49,009 + 데이터 증가의이 아이디어는 정말 지금 우리가했습니다 큰 테마에 맞는 것입니다 + +154 +00:10:49,009 --> 00:10:54,090 + 볼이 과정을 통해 여러 번 반복이 팀은 하나의 방법이다 + +155 +00:10:54,090 --> 00:10:58,420 + 즉 정기적으로 overfitting 방지하기위한 연습에 정말 유용 + +156 +00:10:58,419 --> 00:11:02,209 + 라이더는 훈련 도중 네 번째 패스 동안 우리가 훈련 할 때이다 + +157 +00:11:02,210 --> 00:11:05,930 + 네트워크 우리와 종류 혼란에 이상한 확률 잡음의 일종했다 + +158 +00:11:05,929 --> 00:11:10,629 + 네트워크는 데이터 증가와 예를 들어 우리가 실제로 수정하고 + +159 +00:11:10,629 --> 00:11:14,210 + 우리가 떨어 뜨리거나 같은 것을 사용하여 네트워크에 넣어 학습 데이터 + +160 +00:11:14,210 --> 00:11:18,860 + 네트워크의 임의의 부분을 복용하고 그는 그들이 설정하는 연결 drop하여 + +161 +00:11:18,860 --> 00:11:22,730 + 프로세서의 활성화 임의로 가중치 20 아르 + +162 +00:11:22,730 --> 00:11:28,450 + 이것은 또한 이것은 또한 패치 보쉬 정상화와 종류의 표시가 + +163 +00:11:28,450 --> 00:11:31,930 + 정규화하여 정규화 내용의 다른 것들에 의존 + +164 +00:11:31,929 --> 00:11:35,000 + 그래서 정상적인 훈련 중에 배치 + +165 +00:11:35,000 --> 00:11:39,440 + 같은 이미지는 서로 다른 다른 이미지와 많은 일괄 적으로 나타나는 끝낼 수 있습니다 + +166 +00:11:39,440 --> 00:11:43,840 + 실제로 나는 시간을 훈련 잡음 만의 모든 유형을 소개합니다 + +167 +00:11:43,840 --> 00:11:47,690 + 이러한 예는 테스트 시간 우리는 데이터 증가에 대한 그래서이 소음을 평균 + +168 +00:11:47,690 --> 00:11:52,790 + 우리 모두가 드롭 아웃에 대한 훈련 데이터의 다양한 샘플에 걸쳐 평균을 + +169 +00:11:52,789 --> 00:11:56,870 + 그리고 일종의 평가할 수를 연결 삭제하고이를 소외 + +170 +00:11:56,870 --> 00:12:01,090 + 우리가 자신을 계속 실행 계속 더 분석적으로 작은 및 전망 정상화 + +171 +00:12:01,090 --> 00:12:05,269 + 그래서 난 그냥 그 이러한 아이디어를 많이 통합 할 수있는 좋은 방법의 종류 생각 의미 + +172 +00:12:05,269 --> 00:12:08,960 + 정규화는 다음 전진 패스에 노이즈를 추가 할 수 있습니다 때이다 + +173 +00:12:08,960 --> 00:12:13,540 + 당신이 올하려는 경우 한 번에 이상 소외 너무 마음에 계속 + +174 +00:12:13,539 --> 00:12:20,250 + 그 주요 테이크 아웃을 그래서 다른 창조적 인 방법은 네트워크를 정례화하기 + +175 +00:12:20,250 --> 00:12:24,149 + 데이터 증가에 대한 하나 그것을 구현하는 것이 정말 간단하다는 있습니다 + +176 +00:12:24,149 --> 00:12:28,329 + 그래서 당신은 거의 항상 변명하지에 그것을 정말이 아니에요 사용해야합니다 + +177 +00:12:28,330 --> 00:12:32,730 + 그것은 내가 당신을 많이 생각하는 작은 데이터 세트를 위해 특별히 아주 아주 유용합니다 + +178 +00:12:32,730 --> 00:12:36,850 + 프로젝트에 사용하고 또한이 프레임 워크와 멋지게에 맞는 + +179 +00:12:36,850 --> 00:12:41,509 + 교육 및 소외에서 잡음이 난 테스트 그래서 나는 그 꽤의 생각 + +180 +00:12:41,509 --> 00:12:45,360 + 많은 모든 질문이 그래서 데이터 증가에 대해 말할 수있다 + +181 +00:12:45,360 --> 00:12:45,840 + 그것에 대해 + +182 +00:12:45,840 --> 00:13:01,840 + 네 시간 훈련에 많은 시간이 걸릴 것 내가 지금 얘기 행복 해요 + +183 +00:13:01,840 --> 00:13:05,790 + 디스크 공간이 많이 난 그렇게 가끔 해요 있도록 책상이 일을 덤프 시도 + +184 +00:13:05,789 --> 00:13:08,879 + 사람들은 창의력과 배경이 그들의 일치하는 데이터를 스레드처럼도 있습니다 + +185 +00:13:08,879 --> 00:13:16,799 + 및 문서 바로 그래서 나는 그것이 우리가 이야기 할 수는 분명 생각 + +186 +00:13:16,799 --> 00:13:21,069 + 다음 생각은 그래서 당신이 작업 할 때 그 주위에 떠있는이 신화있다 + +187 +00:13:21,070 --> 00:13:25,770 + CNN의 당신은 정말 많은 양의 데이터가 필요하지만 그 이전 것이 밝혀 + +188 +00:13:25,769 --> 00:13:33,029 + 이 정말 간단 레시피가 당신이 할 수있다, 그래서이 신화를 학습하는 파열된다 + +189 +00:13:33,029 --> 00:13:37,769 + 전송 학습에 사용할 그게 먼저 당신은 무엇이든 좋아하는을 + +190 +00:13:37,769 --> 00:13:42,879 + CNN 아키텍처는 알렉스 문제 BG 또는 무엇을 당신과 당신이 훈련 중 하나가 있습니다 + +191 +00:13:42,879 --> 00:13:46,970 + 이미지가 아래로 더 일반적으로에 대한 자신 또는 당신은 자유 무역 병을 다운로드하지 + +192 +00:13:46,970 --> 00:13:51,360 + 불과 20 분 거리에 쉽게 인터넷이 수행하는 많은 시간을 다운로드 + +193 +00:13:51,360 --> 00:13:56,590 + 훈련 할 수 있지만, 일반적으로 두 종류의 거기 옆에 당신은 아마 그 부분을하지 않습니다 + +194 +00:13:56,590 --> 00:14:00,910 + 경우 하나의 데이터 세트가 정말 작고, 당신이 정말로 어떤이없는 경우 + +195 +00:14:00,909 --> 00:14:05,019 + 이미지는 무엇이든지 당신은 단지 고정 된 기능으로이 분류를 처리 할 수 + +196 +00:14:05,019 --> 00:14:10,110 + 추출기 그래서이 보는 한 가지 방법은 당신이 마지막 층 할게요이다 + +197 +00:14:10,110 --> 00:14:15,580 + 소프트 맥스 병원 아시아 모델 멀리 걸릴 것 네트워크 그는거야 + +198 +00:14:15,580 --> 00:14:18,370 + 작업에 대한 선형 분류의 일종으로 대체 당신을 + +199 +00:14:18,370 --> 00:14:21,810 + 실제로 걱정 이제 네트워크의 나머지 부분을 동결거야, 그리고 + +200 +00:14:21,809 --> 00:14:26,969 + 해당 상위 레이어를 재교육하는 것은 그래서 이것은 일종의 그냥 훈련에의 것과 동일 + +201 +00:14:26,970 --> 00:14:31,230 + 네트워크에서 추출 된 기능의 상단에 직접 선형 분류 그래서 + +202 +00:14:31,230 --> 00:14:35,149 + 당신이이 경우에 대한 연습에 시간을 많이 볼 것은 그런 종류입니다 + +203 +00:14:35,149 --> 00:14:38,399 + 전처리 단계로 당신은 모든 테스트하는 기능을 덤프합니다 당신의 + +204 +00:14:38,399 --> 00:14:42,100 + 이미지를 훈련하고 있기 때문에 그 캐스트 기능의 상단에 완전히 작동 + +205 +00:14:42,100 --> 00:14:48,110 + 꽤 속도 일을 도와 그것은 매우 매우를 사용하여 아주 쉽게 할 수 있습니다 + +206 +00:14:48,110 --> 00:14:51,250 + 일반적인 보통 많은 문제를위한 매우 강력한 기준을 제공한다 + +207 +00:14:51,250 --> 00:14:56,169 + 당신은 실제로 발생하고 수도 당신보다 조금 더 많은 데이터가있는 경우 + +208 +00:14:56,169 --> 00:14:58,599 + 당신은 실제로 더 편안 훈련을 감당할 수 + +209 +00:14:58,600 --> 00:15:03,949 + 모델은 그래서 일반적으로 일부 부품을 고정합니다 데이터 집합의 크기에 따라 + +210 +00:15:03,948 --> 00:15:07,669 + 다음 하부 네트워크 층과 일부 대신 오직 재교육의 + +211 +00:15:07,669 --> 00:15:11,919 + 마지막 은신처 당신은 방법에 따라 훈련 마지막 편지의 일부 번호를 선택합니다 + +212 +00:15:11,919 --> 00:15:16,349 + 당신이 더 큰 데이터 세트를 사용할 수있을 때 더 큰 데이터 세트는 일반적이며, + +213 +00:15:16,350 --> 00:15:21,350 + 당신이 훈련을 다음 최종 그들의 더 ​​많은 훈련을 감당할 다시이라면 할 수 있습니다 + +214 +00:15:21,350 --> 00:15:26,060 + 당신은 매우 일반적입니다 볼 수 있습니다 무엇을 여기에 트릭에 비슷한 유사 + +215 +00:15:26,059 --> 00:15:29,729 + 그 대신 실제로 명시 적으로이 부분을 계산 당신은 덤프합니다 + +216 +00:15:29,730 --> 00:15:35,019 + 이 마지막 층은 책상에 기능하고 메모리에이 부분에서 작동되도록 + +217 +00:15:35,019 --> 00:15:47,490 + 꽤 많은 일들을 빠르게 때로는 나는 기본적 것을 질문 할 수 있습니다 + +218 +00:15:47,490 --> 00:15:51,959 + 그것을 시도하고 볼 수 있지만 특히 작은 데이터 세트 이러한 유형의 작동해야 + +219 +00:15:51,958 --> 00:15:55,799 + 당신이 원한다면 그냥 할 것처럼 당신이있는 경우에 경우에 이미지가 꽤을 검색하도록 + +220 +00:15:55,799 --> 00:16:01,338 + 그렇게이 될 수 있도록 강력한베이스 라인은 CNN의 기능에 LTE 거리를 사용한다 + +221 +00:16:01,339 --> 00:16:05,110 + 당신이 훈련해야 할 것으로 예상 얼마나 많은 샘플 방식의 유형은 내가 하모니를 의미 + +222 +00:16:05,110 --> 00:16:10,470 + 당신이 당신이 가진 것보다 더 많은이 경우 연방 수사 국 (FBI)이나 뭐 같은 이들에 대한 많은 + +223 +00:16:10,470 --> 00:16:15,310 + 당신보다 더 많은 데이터는 그래서 좋은시는 그 시도에 필요한 기대 + +224 +00:16:15,309 --> 00:16:28,879 + 전혀 어쩌면 내가 그래 당신은 가끔 미안 당신이 실제로 것입니다 의존하고있어 + +225 +00:16:28,879 --> 00:16:32,309 + 전진 패스를 통해 실행하지만 때로는 당신은 단지 네 개의 패스를 실행 + +226 +00:16:32,309 --> 00:16:36,818 + 한 번이 꽤 일반적입니다 좀 그건의 두 책상을 덤프 + +227 +00:16:36,818 --> 00:16:41,458 + 실제로 계산을 절약 + +228 +00:16:41,458 --> 00:16:59,729 + 랜덤 하우스에서 당신은 아마 당신이 다른 수업을해야합니다 + +229 +00:16:59,730 --> 00:17:03,350 + 러시아어 문제이나 뭐하지만이 이러한 다른 중간층 + +230 +00:17:03,350 --> 00:17:08,750 + 당신은 실제로와의 이전 모델에 있던 어떤에서 초기화 + +231 +00:17:08,750 --> 00:17:15,068 + 당신이 실제로 할 수있는 좋은 팁을 찾아 연습 나쁜있다가 + +232 +00:17:15,068 --> 00:17:18,588 + 당신은 괜찮아요 때 레이어의 세 가지 유형을 추측하는 경우에만 층 두 가지 유형의 수 + +233 +00:17:18,588 --> 00:17:22,349 + 그들은 당신이있는 것으로 생각할 수있는 냉동 층 수 있습니다 튜닝은 + +234 +00:17:22,349 --> 00:17:27,448 + 제로의 속도를 학습이 새로운 래리가 유리 초기화한다는 것이다있다 + +235 +00:17:27,449 --> 00:17:32,548 + 처음부터 일반적으로 사람들은 너무 어쩌면 더 높은 학습 속도 만이 + +236 +00:17:32,548 --> 00:17:36,528 + 네트워크는 원래 서쪽 훈련이 있었는지의 높은 어쩌면 십분 및 + +237 +00:17:36,528 --> 00:17:40,079 + 우리는 당신이에서 초기화하는이 중간층을해야합니다 + +238 +00:17:40,079 --> 00:17:43,269 + 전 열차 네트워크는하지만 당신은 동시 최적화를 수정할 계획하고 + +239 +00:17:43,269 --> 00:17:47,470 + 당신이 경향이 있습니다 이러한 중간층 있도록 미세 조정은 매우 작게 + +240 +00:17:47,470 --> 00:17:56,589 + 원래의 학습 속도 어쩌면 한 백 그래 + +241 +00:17:56,589 --> 00:18:04,319 + 그 어떤 사람들은 일반적으로 그것을 조사하기 위해 노력하고 발견이다 + +242 +00:18:04,319 --> 00:18:08,079 + 미세 조정 미세 조정 방식의 작품을 학습 전송이 유형의 + +243 +00:18:08,079 --> 00:18:11,710 + 더 나은 네트워크는 원래 데이터의 비슷한 유형의 훈련을받은 때 + +244 +00:18:11,710 --> 00:18:16,610 + 어떤 것을 의미하지만, 사실이이 매우 낮은 수준의 기능을 가지 있습니다 + +245 +00:18:16,609 --> 00:18:20,308 + 아마 적용 할 수 원하지되어 가장자리와 색상과 가버 필터 등 + +246 +00:18:20,308 --> 00:18:24,190 + 단지에 대한 시각 모든 유형의 데이터 그래서 특히 이러한 낮은 수준의 기능 I에 + +247 +00:18:24,190 --> 00:18:29,009 + 생각하는 거의 모든과 방법에 의해 일반적으로 꽤 적용 할 수있는 나는 또 다른 + +248 +00:18:29,009 --> 00:18:33,788 + 당신은 당신이 때로는 미세 조정을위한 연습에서 볼 수 있다고 팁은 당신 것입니다 + +249 +00:18:33,788 --> 00:18:37,609 + 첫째는 동결 여기서 실제로 다단 방식을 가질 수도 + +250 +00:18:37,609 --> 00:18:42,079 + 다음 전체 네트워크에만이 지난 후에 다음이 마지막 은신처 및 훈련 + +251 +00:18:42,079 --> 00:18:46,939 + 층 돌아가서 실제로 인디 당신이 할 수있는 것을 발견 한 후 수렴 것으로 보인다 + +252 +00:18:46,940 --> 00:18:51,519 + 때로는이 때문에이 마지막 층 초기화이 문제가 + +253 +00:18:51,519 --> 00:18:54,690 + 무작위로 당신은 엉망이 어떤 종류의 매우 큰 경사가있을 수 있습니다 + +254 +00:18:54,690 --> 00:18:59,070 + 초기화는 두 가지 방법 중 하나를이 동결되는 주위를 얻을 수 있도록 + +255 +00:18:59,069 --> 00:19:02,788 + 처음에는이 수렴을 쓰는 사람이나이 베어링 학습 속도를 가짐으로써 + +256 +00:19:02,788 --> 00:19:08,658 + 그래서 네트워크 전송 학습이 아이디어의 두 정권 사이 + +257 +00:19:08,659 --> 00:19:14,470 + 실제로 정말 잘 몇 꽤 초기 논문은 2013 거기 때문에 작동 + +258 +00:19:14,470 --> 00:19:19,390 + 2014 CNN의 당은 특히이 하나의 인기가 점점 시작했을 때 + +259 +00:19:19,390 --> 00:19:24,490 + 그들은했다 용지가 있었다 놀라운베이스 라인 그들이 무슨 짓을했는지 매우 담담했다 무엇 + +260 +00:19:24,490 --> 00:19:26,009 + 당시 최고 중 하나였다 + +261 +00:19:26,009 --> 00:19:30,470 + 해외에서 CNN의 아웃 위업에 걸쳐 있었다 그들은 단지 추출 기능과 + +262 +00:19:30,470 --> 00:19:33,640 + 다른 표준 데이터 세트 및 표준의 무리에 이러한 기능을 적용 + +263 +00:19:33,640 --> 00:19:38,679 + 컴퓨터 비전의 문제하고는 이러한 권리 아이디어에 비해 + +264 +00:19:38,679 --> 00:19:42,210 + 그들은 시간이 매우 전문을에서 무엇과 비교된다 + +265 +00:19:42,210 --> 00:19:45,298 + 각 개인에 대한 파이프 라인 및 매우 전문적인 아키텍처 + +266 +00:19:45,298 --> 00:19:49,408 + 문제와 데이터 세트 및 각 문제에 대해 그들은 단지이 매우 교체 + +267 +00:19:49,409 --> 00:19:54,380 + 기능의 상단에 아주 간단한 선형 모델의와 전문 파이프 라인 + +268 +00:19:54,380 --> 00:19:58,559 + 피트와 서로 다른 데이터 세트의 모두를 위해 이런 짓을 발견 + +269 +00:19:58,558 --> 00:20:01,940 + 일반적으로 전체 이러한 이상 - 더 - 상단 교사는 아주 아주이었다 있음 + +270 +00:20:01,940 --> 00:20:06,080 + 강한베이스 라인과 몇 가지 문제들은 기존보다 실제로 더 좋았다 + +271 +00:20:06,079 --> 00:20:08,428 + 방법과 몇 가지 문제에 대해 그들이 있었다 + +272 +00:20:08,429 --> 00:20:12,879 + 이 이것은 정말 멋진 종이했을 정도로 악화하지만 여전히 매우 경쟁력 얻을 그 + +273 +00:20:12,878 --> 00:20:16,118 + 다만 이들에 사용될 수있다 정말 강한 특징이 있다고 입증 + +274 +00:20:16,118 --> 00:20:19,949 + 다른 작업의 많은 아주 잘 작동하는 경향이 사람들을 따라 다른 종이 + +275 +00:20:19,950 --> 00:20:25,419 + 라인은 카페인 용지 및 디카 페인 나중에되었다되었다 버클리 출신 + +276 +00:20:25,419 --> 00:20:33,610 + 카페인과 그가가가 너무 혈통의 종류의이다, 그래서 카페가되었다 + +277 +00:20:33,609 --> 00:20:37,388 + 전송 학습을위한 조리법의 종류는 당신에 대해 생각해야 있다는 것입니다 + +278 +00:20:37,388 --> 00:20:43,398 + 이 행렬을 구입하기에 너무 작은 무엇 초반 이었죠로 설정 데이터 방법과 유사하다 + +279 +00:20:43,398 --> 00:20:47,989 + 모델이었고, 데이터의 양을 당신이 할 당신은 그 4에 어떻게해야 + +280 +00:20:47,990 --> 00:20:53,240 + 당신은 매우 유사한 데이터 세트와 매우있는 경우에 그래서 일반적으로 다른 열 + +281 +00:20:53,240 --> 00:20:57,538 + 단지 네트워크를 사용하여 약간의 데이터는 특징 추출기 및 훈련을 수정 한 + +282 +00:20:57,538 --> 00:21:02,429 + 이러한 기능의 상단에 간단한 선형 모델은 경우에 아주 잘 작동하는 경향이있다 + +283 +00:21:02,429 --> 00:21:06,470 + 당신이 미세 조정을 시도하고 실제로 시도하려고 할 수있는 것보다 조금 더 많은 데이터를 가지고 + +284 +00:21:06,470 --> 00:21:10,509 + 사전 심사 무게와 실행에서 미세 조정에서 네트워크를 초기화 + +285 +00:21:10,509 --> 00:21:15,868 + 다른 열이에서 최적화이 상자 여기에 약간의 트릭이다 + +286 +00:21:15,868 --> 00:21:20,099 + 당신은 당신이 창의력을 시도하고 아마 대신 할 수 있습니다 문제가 될 수 있습니다 + +287 +00:21:20,099 --> 00:21:23,998 + 맨 마지막 층으로부터 특징을 추출하면이 기능을 추출 시도 할 수 있습니다 + +288 +00:21:23,999 --> 00:21:27,470 + 다른 대륙의 레이어와 그 때로는 때로는 도움이 될 수 있습니다에서 + +289 +00:21:27,470 --> 00:21:32,819 + 직관이있다 어쩌면 아마도 이러한 MRI 데이터 등 뭔가 + +290 +00:21:32,819 --> 00:21:37,178 + 매우 최고 수준의 기능은 매우 구체적인 이미지 지금 범주는하지만, 이러한 + +291 +00:21:37,179 --> 00:21:42,059 + 매우 낮은 수준의 기능을 할 수처럼 가장자리와 물건 같은 것들입니다 + +292 +00:21:42,058 --> 00:21:47,980 + 켜 분명 이미지 네트 기술 데이터 세트를 켜 더 양도 + +293 +00:21:47,980 --> 00:21:51,099 + 이 상자에 당신은 더 나은 모양에있어 다시 당신은 일종의 초기화 할 수 있습니다 + +294 +00:21:51,099 --> 00:21:57,928 + 미세 곧 내가 지적하고 싶은 다른 일이 초기화이 좋습니다 + +295 +00:21:57,929 --> 00:22:01,590 + 초반 이었죠 모델과 미세 조정으로 실제로는 예외 아니다 + +296 +00:22:01,589 --> 00:22:05,439 + 당신이 볼 수있는 거의 모든 큰 시스템에서 거의 표준 연습 + +297 +00:22:05,440 --> 00:22:09,070 + 컴퓨터 비전 요즘 우리는 실제로이 두 가지 예를 본 적이 + +298 +00:22:09,069 --> 00:22:13,220 + 이미 분기 예를 들어, 당신은 몇 강의 전에서 기억한다면, 그래서 + +299 +00:22:13,220 --> 00:22:17,220 + 우리는 이미지를보고 우리가 CNN했다 물체 검출에 대해 이야기 + +300 +00:22:17,220 --> 00:22:21,620 + 지역의 제안이 다른 호출이이 모든 미친 것들하지만이 부분 + +301 +00:22:21,619 --> 00:22:25,529 + CNN과 우리가 CNN이보고 한 이미지 및 이미지 캡션보고 있었다 + +302 +00:22:25,529 --> 00:22:29,399 + 이미지는 이러한 경우 모두에서, 그래서 CNN의 처음 imagefap에서입니다 되었더라도 + +303 +00:22:29,400 --> 00:22:34,080 + 모델과 그 정말이 다른보다 전문적인 문제를 해결하는 데 도움이 + +304 +00:22:34,079 --> 00:22:38,839 + 심지어 거대한 데이터 세트없이 또한 이미지 캡션 모델에 대한 + +305 +00:22:38,839 --> 00:22:42,829 + 이 모델의 특정 부분이 당신이해야 할 것을 요구 하였다 포함 + +306 +00:22:42,829 --> 00:22:47,500 + 당신이 그것을 시작하지만 사람들은 벡터 아니었다면 숙제를 지금까지 본 + +307 +00:22:47,500 --> 00:22:50,099 + 당신은 실제로 아마이었다 뭔가에서 초기화 할 수 있습니다 + +308 +00:22:50,099 --> 00:22:54,019 + 과세의 무리를 사전에 훈련하고 때로는 일부 검색에 아마 도움이 될 수 있습니다 + +309 +00:22:54,019 --> 00:22:58,668 + 어떤 상황에서 사용 가능한 캡처 데이터를 많이 가지고하지 않을 수 있습니다 경우 + +310 +00:22:58,669 --> 00:23:15,490 + 그래 나는 문제가에 따라 달라집니다에 의존 때로는 도움이 왔어요 + +311 +00:23:15,490 --> 00:23:18,859 + 네트워크는하지만 확실히 당신이 시도 할 수 있습니다 뭔가하고 특히 수도 + +312 +00:23:18,859 --> 00:23:27,548 + 이 상자에있을 때 도움이되지만 그래 그것은 테이크 아웃에 좋은 트릭 + +313 +00:23:27,548 --> 00:23:31,210 + 대한 미세 조정은 정말 정말 좋은 생각을 사용해야한다는 것입니다 + +314 +00:23:31,210 --> 00:23:35,950 + 그것은 실제로 정말 잘 작동하도록 그래 당신이해야 아마 거의 + +315 +00:23:35,950 --> 00:23:39,900 + 항상 그것을 사용하고 어느 정도는 일반적으로 교육되고 싶지 않아 + +316 +00:23:39,900 --> 00:23:42,519 + 처음부터이 일이 정말 정말 큰 데이터 집합이없는 경우 + +317 +00:23:42,519 --> 00:23:45,970 + 거의 모든 상황에서 사용할 수있는 것이 훨씬 더 편리 + +318 +00:23:45,970 --> 00:23:52,279 + 카페 기존 모델과 방식에 의해 발견하면이 기존 모델이 + +319 +00:23:52,279 --> 00:23:58,230 + 당신은 많은 사람들이 많은 유명한 이미지로 모델을 존재 다운로드 할 수 있습니다 + +320 +00:23:58,230 --> 00:24:01,880 + 실제로 잔류 네트워크 공식 모델은 아주 최근에 출시있어 + +321 +00:24:01,880 --> 00:24:06,130 + 당신도 다운로드 정말 멋진 이들 카페 것 그것으로 재생할 수 있습니다 + +322 +00:24:06,130 --> 00:24:09,020 + 새로운 모델 모델은 일종의 표준의 약간처럼 + +323 +00:24:09,019 --> 00:24:13,759 + 지역 사회 당신은 심지어 같은 다른 다른 프레임 워크에 카페 모델을로드 할 수 있도록 + +324 +00:24:13,759 --> 00:24:17,658 + 토치는 그래서 그 다음 카페 모델이 상당히 있다는 사실을 양지해야 할 무언가이다 + +325 +00:24:17,659 --> 00:24:21,030 + 유용 못했습니다 + +326 +00:24:21,029 --> 00:24:26,889 + 미세 조정 또는 전송 학습에 어떤 추가 질문 + +327 +00:24:26,890 --> 00:24:46,650 + 당신이 높게을 시도 할 수 있도록 그래 그건 아주 크고 낮은 차원의 + +328 +00:24:46,650 --> 00:24:50,250 + 그 꼭대기에 선형 모델을 정례화하거나 작은 올 퍼팅 시도 할 수 있습니다 + +329 +00:24:50,250 --> 00:24:53,109 + 어쩌면 차원을 줄일 그 꼭대기에에서 여기 창조적 얻을 수 있습니다 + +330 +00:24:53,109 --> 00:24:56,399 + 하지만 난 당신이 그 일을 할 수 시도 할 수있는 일이있다가 있다고 생각 + +331 +00:24:56,400 --> 00:25:03,640 + 데이터가에 따라 적합한 그래서 우리가 더 이야기해야한다고 생각 + +332 +00:25:03,640 --> 00:25:07,740 + 회선이 모든 네트워크에 대해 우리는 그것에 대해 정말 이야기했습니다 있도록 + +333 +00:25:07,740 --> 00:25:11,920 + 회선은 많은 작업을하고있어 계산 주력이다 + +334 +00:25:11,920 --> 00:25:18,090 + 네트워크는 그래​​서 우리는 회선 처음에 대해 약 두 가지를 얘기해야 + +335 +00:25:18,089 --> 00:25:22,809 + 우리는 효율적인 네트워크 아키텍처를 설계 할 수있는 방법 그래서 그들을 막을 방법입니다 + +336 +00:25:22,809 --> 00:25:28,789 + 여기에, 그래서 몇 가지 멋진 결과를 달성하기 위해 회선의 많은 레이어를 결합 + +337 +00:25:28,789 --> 00:25:33,230 + 질문은 우리가 사람들의 두 개의 층이있는 네트워크를 가지고 있다고 가정 내가 + +338 +00:25:33,230 --> 00:25:37,190 + 이 같은 세 개의 기여가 입력 될이 활성화 될지도 + +339 +00:25:37,190 --> 00:25:40,120 + 제 1 층이 두 층의 후 활성화 NAP 것 + +340 +00:25:40,119 --> 00:25:45,959 + 회선 문제는이 두 번째 층 상에이란에 대한 지역의 얼마나 큰 + +341 +00:25:45,960 --> 00:25:49,640 + 입력이 표시되지 않습니다에이 내가 희망 나는 귀하의 중간에 있던 나는 U 너희들 희망 + +342 +00:25:49,640 --> 00:25:53,920 + 모두가 이에 대한 답변을 알고 + +343 +00:25:53,920 --> 00:26:01,298 + 확인 사람은 아마 그 힘든 시험 문제였다 + +344 +00:26:01,298 --> 00:26:05,230 + 하지만이 다섯으로 다섯입니다 그리고 그것은이에서 볼 꽤 쉽게이다 + +345 +00:26:05,230 --> 00:26:08,989 + 그림은 왜 두 번째 층까지이 신경이보고 될 수 있도록 + +346 +00:26:08,989 --> 00:26:13,619 + 일부 특정이 픽셀의 중간에서 전체 볼륨 + +347 +00:26:13,618 --> 00:26:18,138 + 중간 우리가 그렇게 할 때를 입력 세 지역에 따라이 세 가지에서 찾고 + +348 +00:26:18,138 --> 00:26:22,738 + 이보다 다음의 세 가지를 모두 볼 때 모두에서 평균 + +349 +00:26:22,739 --> 00:26:26,200 + 두 번째 또는 세 번째 층에있는이 신경을 은신처는 사실이보고있다 + +350 +00:26:26,200 --> 00:26:32,669 + 우리가 있던 경우에 입력 다섯 부피 다섯 전체 괜찮 이제 질문은 + +351 +00:26:32,669 --> 00:26:36,820 + 에서 지역의 얼마나 큰 행에 쌓여 세 개의 회선으로 삼피트 + +352 +00:26:36,819 --> 00:26:43,700 + 입력 이유로 같은 종류의 그가 수용 필드가 있다고 그래서 그들은 나중에 무엇을보고 + +353 +00:26:43,700 --> 00:26:49,739 + 여기에 포인트를 만들 수 있도록 단지 종류의 연속 공헌 구축 + +354 +00:26:49,739 --> 00:26:53,940 + 당신이 실제로 매우 줄 33 세에 의해 회선을 알고있다 + +355 +00:26:53,940 --> 00:26:57,919 + 비슷한 표현 능력은 하나의 일곱 일곱하여 내 주장이다 + +356 +00:26:57,919 --> 00:27:02,619 + 컨볼 루션은이의 정확한 의미에 대한 토론을 할 수 있도록 당신은 할 수 + +357 +00:27:02,618 --> 00:27:05,528 + 하지만 단지에서 같은 그것에 대해 정리하고 물건을 증명하려고 + +358 +00:27:05,528 --> 00:27:09,940 + 직관적 인 감각 그들은 333 회선 비슷한 유형을 나타낼 수 있습니다 + +359 +00:27:09,940 --> 00:27:14,100 + 그것은에서 찾고 있기 때문에 일곱 기여하여 유사한 칠 등의 기능 + +360 +00:27:14,099 --> 00:27:22,189 + 입력에 동일한 입력 영역 그래서 지금 생각은 지금 실제로 우리는 더 팔 수있다 + +361 +00:27:22,190 --> 00:27:27,399 + 이 아이디어로 우리는 하나 797 사이에보다 구체적으로 비교할 수 있습니다 + +362 +00:27:27,398 --> 00:27:32,618 + 그래서이 가정하자 세 공헌 33의 스택 대 컨볼 루션 + +363 +00:27:32,618 --> 00:27:38,638 + 우리는 바다로 HIW의 입력 이미지를 가지고 우리는 회선을하도록 + +364 +00:27:38,638 --> 00:27:43,329 + 우리가 필터를 볼 그래서 그 깊이를 보존하고 우리는 그들을 갖고 싶어 + +365 +00:27:43,329 --> 00:27:48,019 + 그래서 우리가 적절하고 우리가 원하는 두드리며 말했다으로 높게 보존 식품 + +366 +00:27:48,019 --> 00:27:51,528 + 구체적으로 비교 한 일곱의 차이에 의한 것입니다하기 + +367 +00:27:51,528 --> 00:27:56,648 + 이러한 각각의 세 가지 그래서 처음 몇 주에 의해 세의 스택 대 칠 + +368 +00:27:56,648 --> 00:28:01,748 + 두 가지 사람이 얼마나 많은 무게 단일 일곱 일곱으로의 가스가있다 + +369 +00:28:01,749 --> 00:28:09,519 + 컨볼 루션 집과는 편견에 대해 잊을 수는 혼동 + +370 +00:28:09,519 --> 00:28:19,869 + 나는 약간의 여름을 들었지만 나의 내 대답 나는 바로 그것을 가지고 희망을 들어 + +371 +00:28:19,869 --> 00:28:24,319 + 49 C는 각자가 찾고있는 일곱 일곱하여 회선을 가지고로 제곱했다 + +372 +00:28:24,319 --> 00:28:29,809 + 볼의 깊이에서 당신은 지금 49 C 제곱 있도록 이러한 필터를 보게되었다하지만 + +373 +00:28:29,809 --> 00:28:34,649 + 세 개의 회선 세에 의해 우리는 회선의 세 층 각 하나가 + +374 +00:28:34,650 --> 00:28:38,990 + 각 필터는 스티브으로 세 가지로 세 가지이며, 각 플레이어는 때 필터를 볼 수 있습니다 + +375 +00:28:38,990 --> 00:28:43,980 + 모든 아웃 우리가 무료로 회선 33 만 제곱 (27) C가 볼 것을 곱 + +376 +00:28:43,980 --> 00:28:49,079 + 매개 변수와 우리는 이러한 각각의 사이에 후 레이 루이스가 있다고 가정 + +377 +00:28:49,079 --> 00:28:54,049 + 기여는 우리는 세 개의 회선으로 최대 33 스택이 실제로 있는지 참조 + +378 +00:28:54,049 --> 00:28:58,649 + 이런 종류의 좋은 좋은 더 비선형이다 적은 수의 매개 변수 + +379 +00:28:58,650 --> 00:29:02,960 + 의 여러 세에 의해 당신에게 세 가지의 이유 스택에 대한 몇 가지 직관을 제공합니다 + +380 +00:29:02,960 --> 00:29:06,440 + 세 개의 회선은 실제로 하나의 일곱 칠로하는 것이 바람직 할 수있다 + +381 +00:29:06,440 --> 00:29:11,559 + 경쟁은 우리가 실제로 더이 한 단계 걸릴에 대해 생각 할 수 있습니다 + +382 +00:29:11,559 --> 00:29:14,750 + 하지 일반 매개 변수 아래 단지 수 있지만, 실제로 꿀 부동 + +383 +00:29:14,750 --> 00:29:19,099 + 이 일에 소수점 연산 그래서 사람이 얼마나 많은 가스가 걸릴 + +384 +00:29:19,099 --> 00:29:29,669 + 이러한 일들이 지금 수행 할 작업이이 실제로 하드 쓰기 소리 + +385 +00:29:29,670 --> 00:29:33,740 + 때문에 이러한 필터 각각에 대해 매우 쉽다는 거의 모든 IT를 사용했다 + +386 +00:29:33,740 --> 00:29:37,819 + 이미지의 단부에 위치하므로 실제 곱셈의 광고의 수는 + +387 +00:29:37,819 --> 00:29:42,099 + 단지거야 시간이 Heights의 배 가연성 필터의 수를 당신 때문에 + +388 +00:29:42,099 --> 00:29:47,789 + 실제로 여기에 그것을 볼 수 있습니다 다시뿐만 아니라 우리 사이에 있나요 + +389 +00:29:47,789 --> 00:29:52,440 + 이 두 가지 사이에 일곱으로 칠 작업을 비교하는 것은뿐만 아니라 더 많은 학습 가능이 + +390 +00:29:52,440 --> 00:29:57,460 + 매개 변수하지만 실제로 잘 스택 있도록 더 많은 컴퓨터에 비용 + +391 +00:29:57,460 --> 00:30:03,140 + 자주 암시로 (33)는 다시 적은 컴퓨팅 우리에게 더 비선형 성을 제공하므로 + +392 +00:30:03,140 --> 00:30:06,170 + 그 좀 당신에게 왜 실제로 여러 층을 갖는 몇 가지 직관을 제공합니다 + +393 +00:30:06,170 --> 00:30:12,300 + 세 베이 세 회선하지만 다음 큰 필터 실제로 바람직하다 + +394 +00:30:12,299 --> 00:30:15,750 + 우리가 작은쪽으로 밀어 봤는데 알고 또 다른 질문을 생각할 수 있으며, + +395 +00:30:15,750 --> 00:30:20,109 + 작은 필터하지만 왜 바로 우리가 실제로 작은 갈 수있는 세 가지에 의해 세에서 정지 + +396 +00:30:20,109 --> 00:30:21,859 + 그건있을 수 있습니다보다 같은 논리 확장 것 + +397 +00:30:21,859 --> 00:30:27,798 + 머리를 흔들어 당신은 당신이 얻을하지 않습니다 그것은 사실 그 사실을 믿지 않는다 + +398 +00:30:27,798 --> 00:30:33,539 + 우리가 여기서 할거야 그래서 실제로 무엇을 수용 필드는 단일 비교된다 + +399 +00:30:33,539 --> 00:30:39,019 + 약간 애호가 아키텍처 병목 아키텍처 대 33 회선 + +400 +00:30:39,019 --> 00:30:45,150 + 그래서 여기에 우리는 내가 HW의 입력이보고 우리가 실제로 할 수있는들을 수있는 거 가정하고 + +401 +00:30:45,150 --> 00:30:50,070 + 이것은 우리가까지 볼 수있는 단일 하나씩 회선을 멋진 트릭 + +402 +00:30:50,069 --> 00:30:54,609 + 필터는 실제로 지금이 볼륨의 차원을 줄이기 위해 + +403 +00:30:54,609 --> 00:30:57,990 + 것은 동일한 공간 범위이지만 기능 절반이 예정 + +404 +00:30:57,990 --> 00:31:03,480 + 심층 지금 우리는거야이 병목을 수행 한 후 3 × 3 할 + +405 +00:31:03,480 --> 00:31:08,929 + 이 감소 차원에서 컨볼 루션 지금이이 세 가지에 의해 + +406 +00:31:08,929 --> 00:31:13,610 + 세 개의 회선 입력 기능을 통해 받아 출력에 이상 발생 + +407 +00:31:13,609 --> 00:31:18,000 + 기능과 이제 우리는 하나 다른 하나를 사용하여 차원을 복원 + +408 +00:31:18,000 --> 00:31:23,558 + 이 펑키 한 가지 종류의 볼 백업을 통해 볼에서 회선은 이동 + +409 +00:31:23,558 --> 00:31:27,910 + 아키텍처는 하나씩 회선을 사용하는이 아이디어는 어디 에나있다 + +410 +00:31:27,910 --> 00:31:31,669 + 그것은이 직관을 가지고 있기 때문에 때로는 네트워크와 네트워크라는 것을 + +411 +00:31:31,669 --> 00:31:35,730 + 당신은 하나씩 회선이 완전히 연결 슬라이딩에 좀 유사하다있어 + +412 +00:31:35,730 --> 00:31:42,480 + 당신의 입력 볼륨의 각 부분에 걸쳐 네트워크와이 아이디어도 나타납니다 + +413 +00:31:42,480 --> 00:31:46,259 + 구글 매트와이 하나씩 병목 현상을 사용 ResNet이 생각 + +414 +00:31:46,259 --> 00:31:52,679 + 기여 그래서 우리는 하나이이 병목 샌드위치를​​ 비교할 수 있습니다 + +415 +00:31:52,679 --> 00:31:56,390 + C 필터와 세 개의 회선에 의해 세와 같은 논리를 통해 실행 + +416 +00:31:56,390 --> 00:32:01,270 + 그래서 나는 당신의 머리에있는 컴퓨터에 강제하지 않습니다하지만 당신은 할 수 있습니다 + +417 +00:32:01,269 --> 00:32:02,720 + 이 날 믿어하기 + +418 +00:32:02,720 --> 00:32:08,700 + 이 병목 스택은 세 가지를 가지고 어디 분기 C는 매개 변수를 제곱 것을 + +419 +00:32:08,700 --> 00:32:12,360 + 여기이 사람은 우리가 고집하는 경우 구 C가 다시 매개 변수를 제곱있다 + +420 +00:32:12,359 --> 00:32:15,879 + 이 병목 이상이 기여 각각의 사이에 집회 + +421 +00:32:15,880 --> 00:32:20,620 + 샌드위치는 적은 수의 우리에게 더 많은 비선형 성을주고있다 + +422 +00:32:20,619 --> 00:32:28,899 + 우리는 실제로 같은 우리 유사한 매개 변수 및 대 세에 의해 세에서 본 + +423 +00:32:28,900 --> 00:32:33,200 + 일곱으로 칠 매개 변수의 수는 그렇게 계산에 직접 연결되어 + +424 +00:32:33,200 --> 00:32:35,389 + 이 병목 샌드위치도 + +425 +00:32:35,388 --> 00:32:39,788 + 훨씬 빠른 하나씩 병목 현상이 아이디어에 따라서이를 계산한다 + +426 +00:32:39,788 --> 00:32:52,669 + 구글 매트에서 최근에 사용 꽤 많이 받고 특히 그래 그렇게 + +427 +00:32:52,669 --> 00:32:56,579 + 당신은 때때로 당신이에서 투사로로 생각 당신은 그것의 생각 + +428 +00:32:56,578 --> 00:33:00,308 + 다시 더 높은 차원 공간에 다음의 경우 낮은 차원 기능 등 + +429 +00:33:00,308 --> 00:33:03,868 + 당신은 어떻게 서로의 상단에 이런 일을 많이 쌓아 생각 + +430 +00:33:03,868 --> 00:33:09,499 + 당신이 한 직후오고 것보다보다 주민 + +431 +00:33:09,499 --> 00:33:11,088 + 하나 또 하나가 될 것 + +432 +00:33:11,088 --> 00:33:14,858 + 당신은 가지 위에 많은 많은 한 사람들 하나 하나 회선 붙어있어 + +433 +00:33:14,858 --> 00:33:18,918 + 서로 하나씩 회선 조금 슬라이딩 같이하는 완전히 + +434 +00:33:18,919 --> 00:33:23,409 + 각각의 이중 채널을 통해 다층 완전히 연결 네트워크는 아마 생각 생각하는 + +435 +00:33:23,409 --> 00:33:27,229 + 그것에 대해 때 조금하지만 실제로 당신이 정말 안 밝혀 + +436 +00:33:27,229 --> 00:33:31,200 + 공간적 범위를 필요로하고, 심지어 단 하나의 세 가지로 샌드위치를​​ 비교 + +437 +00:33:31,200 --> 00:33:35,769 + 세 Khans하여 정렬의 동일한 입력 출력 볼륨 크기를 갖는하고 있지만, + +438 +00:33:35,769 --> 00:33:41,429 + 더 비선형 그들이있어 그래서 저렴 계산과 동물 매개 변수는 무엇입니까 + +439 +00:33:41,429 --> 00:33:46,089 + 이 것입니다 모든 좋은 기능의 종류 만이 거기에 하나의 문제입니다 + +440 +00:33:46,088 --> 00:33:49,668 + 즉, 우리는 여전히 어딘가에서 3 × 3 회선을 사용하고 그리고 + +441 +00:33:49,669 --> 00:33:54,709 + 우리는 우리가 정말이 필요하면 경우에 당신이 궁금해 할 수 대답은 그것이 나오는 것에 없음입니다 + +442 +00:33:54,709 --> 00:33:59,808 + 내가 최근에 본 적이 한 미친 것은 당신이 당신이 인수 분해 할 수 있다는 것입니다 + +443 +00:33:59,808 --> 00:34:05,608 + 하나 2003 년에 세 개의 회선으로 거리와 세에 의해 원에 비해 + +444 +00:34:05,608 --> 00:34:09,469 + 단일 세에 의해 세 개의 회선이 몇 가지 매개 변수를 저장 끝 + +445 +00:34:09,469 --> 00:34:14,428 + 뿐만 아니라 당신이 정말로 당신으로이 일에 의해 올 수 미쳐 경우 수도 있기 때문에 + +446 +00:34:14,429 --> 00:34:18,019 + 세와 함께이 병목와 하나 세 생각과 사물 단지 + +447 +00:34:18,018 --> 00:34:22,358 + 정말 저렴받을 수 있도록 구글이 가장에서 무슨 짓을했는지 기본적이다 + +448 +00:34:22,358 --> 00:34:27,038 + 인 셉션의 최신 버전은 그렇게 미친 종이의이 종류가를 다시 생각있다 + +449 +00:34:27,039 --> 00:34:30,389 + 그들이이 많이 재생 컴퓨터 비전에 대한 개시 아키텍처 + +450 +00:34:30,389 --> 00:34:34,169 + 이상한 방법으로 회선을 감안하고있는에 관하여 미친 트릭 + +451 +00:34:34,168 --> 00:34:37,138 + 다른에 하나씩 병목 현상을 많이하고 예측 백업 + +452 +00:34:37,139 --> 00:34:40,608 + 당신이 생각 치수 다음 경우 원래 구글과 만나 자신의 + +453 +00:34:40,608 --> 00:34:42,699 + 처음 모듈은 미친이었다 + +454 +00:34:42,699 --> 00:34:46,118 + 이 일이 구글이 지금에 사용하고있는 개시 모듈입니다 것 + +455 +00:34:46,119 --> 00:34:47,329 + 그들의 최신 개시 + +456 +00:34:47,329 --> 00:34:50,739 + 여기에 흥미로운 기능은 다음 하나씩을 가지고 있습니다 + +457 +00:34:50,739 --> 00:34:55,819 + 모든 곳에서 병목 현상이에 대한 이러한 비대칭 필터를 가지고 있는지 확인 + +458 +00:34:55,820 --> 00:35:01,519 + 에이 계산 그래서이 물건은 슈퍼 널리 아직 사용되지 않지만, 그것은 그것의의의 + +459 +00:35:01,519 --> 00:35:05,079 + 거기 그것은 구글 매트는 그것을 언급 멋진 무언가를 정신병이야 + +460 +00:35:05,079 --> 00:35:14,610 + 이렇게 빨리 회선에서 개괄하고는 보통의 것입니다 스택하는 방법 + +461 +00:35:14,610 --> 00:35:18,530 + 대신 큰 필터 크기의 하나의 큰 회선을 갖는의 더 나은 + +462 +00:35:18,530 --> 00:35:22,740 + 그것은 여러 개의 작은 필터로하고 있음을 해체하는 것이 더 나은에도 + +463 +00:35:22,739 --> 00:35:26,339 + 어쩌면이 BGG 같은 사이의 차이를 설명하는 데 도움이 + +464 +00:35:26,340 --> 00:35:30,059 + 적은이 알렉스 그물 같은 많은 많은 3 × 3 필터 + +465 +00:35:30,059 --> 00:35:35,119 + 실제로 내가 생각 꽤 공통되고있다 작은 필터와 다른 것은 + +466 +00:35:35,119 --> 00:35:38,829 + 당신을 네킹 한 병 하나의이 아이디어는 볼 구글의 두 버전에서 + +467 +00:35:38,829 --> 00:35:42,579 + 되지도 ResNet에서 그것은 실제로 당신이 매개 변수를 많이 절약 할 수 있습니다 I + +468 +00:35:42,579 --> 00:35:46,340 + 이 염두에 두어야 할 유용한 트릭과 인수 분해의이 아이디어 생각 + +469 +00:35:46,340 --> 00:35:50,890 + 내가 생각이 비대칭 필터로 회선 어쩌면 너무 광범위하지 않습니다 + +470 +00:35:50,889 --> 00:35:54,629 + 지금 사용하지만, 더 일반적으로 미래에 사용되는 잘 모르겠어요 될 수 있습니다 + +471 +00:35:54,630 --> 00:36:00,160 + 이러한 모든 트랙에 대한 지배적 인 테마를 통해 기본은 당신을 할 수 있다는 것입니다 + +472 +00:36:00,159 --> 00:36:04,289 + 적은 학습 가능 매개 변수 적은 적은 컴퓨팅 등을 가지고 + +473 +00:36:04,289 --> 00:36:07,739 + 좋은 기능을 모든 종류의 당신의 아키텍처를 데있다 비선형 + +474 +00:36:07,739 --> 00:36:18,779 + 이러한 이러한 이러한 회선 아키텍처 설계에 대한 질문으로 + +475 +00:36:18,780 --> 00:36:21,300 + 그녀가 너무 분명 가져 + +476 +00:36:21,300 --> 00:36:26,340 + 확인 그래서 그 다음 것은 당신이했습니다되면 실제로 당신이 원하는 방법에 대한 결정이다 + +477 +00:36:26,340 --> 00:36:30,760 + 회선의 스택을 묶는 당신은 실제로 그들을 계산하고이하는 + +478 +00:36:30,760 --> 00:36:33,630 + 실제로 구현하는 여러 가지 방법에 많은 일이있었습니다 + +479 +00:36:33,630 --> 00:36:37,950 + 기부는 우리가 루프를 사용하여 과제의 구현을 물어 + +480 +00:36:37,949 --> 00:36:43,960 + 당신은 너무 잘 확장되지 않습니다 추​​측 수 있으므로이 너무 예쁜 예쁜 + +481 +00:36:43,960 --> 00:36:47,720 + 구현하기가 매우 쉽다 쉬운 방법은 이름의이 아이디어는 호출하는 것입니다 + +482 +00:36:47,719 --> 00:36:52,269 + 방법은 그래서 여기 직관은 우리가 행렬 곱셈 정말 알고있다 + +483 +00:36:52,269 --> 00:36:56,809 + 빠르고 거기에 누군가가 거의 모든 컴퓨팅 아키텍처 + +484 +00:36:56,809 --> 00:37:00,949 + 정말 정말 잘 최적화 행렬 곱셈 고정 라이브러리를 작성 + +485 +00:37:00,949 --> 00:37:06,230 + 그래서 전화를 그의 생각이 잘 행렬 곱셈이 주어진 악취된다 + +486 +00:37:06,230 --> 00:37:07,400 + 정말 빠른 + +487 +00:37:07,400 --> 00:37:11,420 + 우리는이 컨볼 루션 연산을하고 같은 개주 수있는 몇 가지 방법이있다 + +488 +00:37:11,420 --> 00:37:17,800 + 행렬 곱셈과 이것이 꽤 다소 쉽게가 있음을 밝혀 + +489 +00:37:17,800 --> 00:37:22,930 + 아이디어가 그래서 당신은 우리의 입력 볼륨을 가지고 그것에 대해 생각하면 + +490 +00:37:22,929 --> 00:37:28,549 + 바다 HIW 우리는 컨볼 루션 필터 회선의 필터 뱅크가 + +491 +00:37:28,550 --> 00:37:32,730 + 그것은이 때문에 이들 각각 볼 볼륨으로 사례 별을 될 것입니다 + +492 +00:37:32,730 --> 00:37:36,659 + 사례 별 수용 필드와 적응 두가 일치 일치 참조 + +493 +00:37:36,659 --> 00:37:39,989 + 입력 여기에 우리는거야 것은 이러한 필터 처리해야하고 우리가 원하는 + +494 +00:37:39,989 --> 00:37:44,809 + 아이디어는 점이다 있도록 행렬 곱셈 문제로이 점을 켭니다 + +495 +00:37:44,809 --> 00:37:48,829 + 우리는 우리가 먼저 수용 필드를 취할려고하고 자신의 일을하는거야 + +496 +00:37:48,829 --> 00:37:54,019 + 거 지역에서 CEE 영역에 의해 케이하여이 케이를 할 수있는 이미지의 + +497 +00:37:54,019 --> 00:37:58,130 + 축구에서 최종까지 나는 사건이 컬럼에 그것을 바꿀거야 당신 + +498 +00:37:58,130 --> 00:38:01,910 + 이에 요소를 확인하고 우리는 가능한 모든이를 반복하는거야 + +499 +00:38:01,909 --> 00:38:05,909 + 이미지의 수용 필드 그래서 우리는 내가 갈거야이 작은 사람을거야 + +500 +00:38:05,909 --> 00:38:09,359 + 이미지에 가능한 모든 영역에 걸쳐 그를 이동하고 여기에 그냥 말하는거야 + +501 +00:38:09,360 --> 00:38:12,680 + 될 것이 아마 지역과 다른 수용 필드를 종료하는 것이 + +502 +00:38:12,679 --> 00:38:18,389 + 위치는 지금 우리는 우리의 이미지를 촬영했습니다 우리는이 거대한으로 재편 촬영했습니다 + +503 +00:38:18,389 --> 00:38:25,139 + 매트릭스 오 누구나 볼 볼 수있다 내 말 및 내 경우 어떤 가능성을 + +504 +00:38:25,139 --> 00:38:28,139 + 이 아마와 문제 + +505 +00:38:28,139 --> 00:38:36,829 + 그래, 그게 사실은 그래서 최선이 바로 많은 메모리를 많이 사용하는 경향이 + +506 +00:38:36,829 --> 00:38:41,380 + 이 책의 요소가 나타나는 경우 여러 수용 필드 다음이다 + +507 +00:38:41,380 --> 00:38:45,010 + 가고 그래서 이러한 열 여러 중복되는 및이에 가고 + +508 +00:38:45,010 --> 00:38:49,220 + 당신의 수용 필드하지만 사이가 overlap은 더 더를 얻을 수 + +509 +00:38:49,219 --> 00:38:52,839 + 실제로이 실제로 거래의 너무 큰 아니에요 및 밝혀 그 + +510 +00:38:52,840 --> 00:38:57,910 + 우리는거야이 길쌈에 유사한 검사를 잘 실행하고 작품 + +511 +00:38:57,909 --> 00:39:01,699 + 필터는 그래서 만약 당신이 회선 우리가 먹고 싶어 무엇을하고 있는지 기억 + +512 +00:39:01,699 --> 00:39:06,039 + 이러한 길쌈 무게의 각 각으로 우리의 제품을 + +513 +00:39:06,039 --> 00:39:10,889 + 이미지 때문에 각 수용 필드 위치에 대한 길쌈 무게 + +514 +00:39:10,889 --> 00:39:16,420 + 이 길쌈 무게의 각이 케이에 의해이 케이 것은 좌석이 그렇게 대답 구입 + +515 +00:39:16,420 --> 00:39:21,059 + 우리는 치로로 이제 우리는 D를 경우로 그 각각을 바꿀거야 + +516 +00:39:21,059 --> 00:39:26,420 + 좌석 행렬은 지금이 좋은해진다 필터 그래서 우리는 경우에 의해 계약을 얻었다 + +517 +00:39:26,420 --> 00:39:31,750 + 지금이 가이드는 수용 필드로 각 열의 모든 레셉 각을 포함 우리 + +518 +00:39:31,750 --> 00:39:37,039 + 이미지에 하나의 컬럼 수용 필드가 지금이 행렬은 하나가있다 + +519 +00:39:37,039 --> 00:39:42,679 + 하나의 각 행은 우리가 쉽게이 모든 계산할 수 해주기 때문에 다른 무게 + +520 +00:39:42,679 --> 00:39:49,069 + 내부의 제품은 한 번에 하나의 행렬 곱 나는 사과 + +521 +00:39:49,070 --> 00:39:52,809 + 이러한 차원의 아마 교체해야 밖으로 작동하지 않는 것은 더 만드는 것입니다 + +522 +00:39:52,809 --> 00:39:59,219 + 분명하지만 난 당신이 아이디어를 얻을 생각 때문에이이 최종 결과에 의해 싶게를 제공하는 + +523 +00:39:59,219 --> 00:40:03,659 + 그 D 출력 필터의 우리의 수를하고, n은 모두 받아 들일입니다 + +524 +00:40:03,659 --> 00:40:07,469 + 이미지 필드의 위치는 다음이 걸릴 비슷한 여행을 재생 + +525 +00:40:07,469 --> 00:40:13,000 + 실제로이 참을 수있는 실내 3D 전채로 모양을 변경 + +526 +00:40:13,000 --> 00:40:16,219 + 아주 쉽게 당신이 이들의 미니 배치가있는 경우 너무 많은 배치 + +527 +00:40:16,219 --> 00:40:24,099 + 요소는 당신은 행의 한 세트 나 다시 요소이 당 더 많은 행과 방법을 추가 + +528 +00:40:24,099 --> 00:40:28,589 + 실제로 그래 그렇게 구현하는 매우 간단합니다 + +529 +00:40:28,590 --> 00:40:35,090 + 그 것을 - 그 다음 구현 오른쪽에 달려 있지만, 다음의 따라 달라집니다 + +530 +00:40:35,090 --> 00:40:39,910 + 그 때로는 같은 메모리 레이아웃 및 물건 같은 것들에 대해 걱정할 필요가 + +531 +00:40:39,909 --> 00:40:45,099 + 당신도 당신이 병렬로 그것을 할 수있는 GPU에서 그 모양 변경 작업을 수행하지만, + +532 +00:40:45,099 --> 00:40:50,089 + 그래서 이것은 정말 쉬운 사례 연구로는 너무 많이 구현하는 등의 경우 경우 경우 + +533 +00:40:50,090 --> 00:40:53,470 + 사용 가능한 컨벌루션 기술이 없어 하나를 구현해야 + +534 +00:40:53,469 --> 00:40:57,869 + 이것은 아마도 선택할 수있는 하나 통과하면 실제 카페를 보면 + +535 +00:40:57,869 --> 00:41:01,119 + 카페의 이전 버전이 그들이 무엇에 사용되는 방법이다 + +536 +00:41:01,119 --> 00:41:07,730 + 기부금이는 GPU 충돌에 대한 회선 앞으로 코드가 있도록 + +537 +00:41:07,730 --> 00:41:12,630 + 당신의 기본 GPU의 컨볼 루션들은으로 전화하는거야이 붉은 덩어리를 볼 수 있습니다 + +538 +00:41:12,630 --> 00:41:18,070 + 전화를 같은 방법이 복용 그들의 입력 영상 권한을 가지고 그들의 + +539 +00:41:18,070 --> 00:41:22,900 + 입력 영상 어딘가에이 그래서 이것은 그들의 의도 된 후 그들은거야 + +540 +00:41:22,900 --> 00:41:27,050 + 이것은이이 방법으로 호출하고이를 저장하는 동일한 전화 재정비 + +541 +00:41:27,050 --> 00:41:33,519 + 그들이 호출 곱 행렬 행렬에 거 가지고있어보다 열 GPU의 tenser + +542 +00:41:33,519 --> 00:41:37,980 + 즉, 그래서 그것은 곱셈 후 바이어스 행렬을 지속 할 수 + +543 +00:41:37,980 --> 00:41:42,840 + 그 그 내가 이러한 일들이 실제로 아주 잘 작동하는 경향이 의미의 방법과 + +544 +00:41:42,840 --> 00:41:45,850 + 당신은 우리가 당신 하나를 준 빠른 레이어를 기억한다면 또 다른 사례 연구가 + +545 +00:41:45,849 --> 00:41:51,500 + 과제 실제로 우리가 실제로 나노 수행 그래서 여기에이 동일한 전략을 사용 + +546 +00:41:51,500 --> 00:41:55,940 + 작업을 호출하는 지금 우리가 실제로 할 수있는 다음 어떤 미친 NumPy와 트릭이었고, + +547 +00:41:55,940 --> 00:42:00,230 + NumPy와 매트릭스에 단일 통화와 FAST 층 내부의 회선 + +548 +00:42:00,230 --> 00:42:03,900 + 곱셈 당신과이 보통 나에게 몇 가지를 제공합니다 숙제에 서명 + +549 +00:42:03,900 --> 00:42:07,740 + 이 꽤 잘 작동 루프를 사용하는 것보다 더 빨리 백 번 + +550 +00:42:07,739 --> 00:42:18,209 + 그리고는 전화 그에 대해 질문을 구현하는 데 아주 쉽게이다 + +551 +00:42:18,210 --> 00:42:24,949 + 그것에 대해 조금 생각하지만, 당신이 생각하는 경우에 당신이 정말 열심히 생각하면 당신은거야 + +552 +00:42:24,949 --> 00:42:28,219 + 컨볼 루션의 뒤로 패스도 실제로 실현 + +553 +00:42:28,219 --> 00:42:33,358 + 당신이 그것에 대해 생각한다면 당신은 몇 가지 알아 낸 수 컨볼 루션 + +554 +00:42:33,358 --> 00:42:37,269 + 당신의 숙제하지만 이전 버전과는 회선도 실제로의 유형입니다 통과 + +555 +00:42:37,269 --> 00:42:41,070 + 상류 구배를 통해 실제로 비슷한을 사용할 수있는 이상 컨볼 루션 + +556 +00:42:41,070 --> 00:42:45,789 + 담배가 아니라 유일한 트릭을 전달하기위한 이미지의 유형은 메소드를 호출합니다 + +557 +00:42:45,789 --> 00:42:51,259 + 당신이 뒤로 패스 할 후에는 일부 그라디언트에 필요 한 것입니다 + +558 +00:42:51,260 --> 00:42:54,940 + 상류에서 수용 필드를 중복 통해 당신은주의해야하므로 + +559 +00:42:54,940 --> 00:43:02,889 + 통화 팀에 대해 당신이 뒤로 패스에서 호출 팀을 소환 필요 + +560 +00:43:02,889 --> 00:43:06,150 + 숙제는 것을 구현에 실제로 빠른 차선에서 확인하실 수 있습니다 것은 + +561 +00:43:06,150 --> 00:43:11,050 + 너무 실제로 비록 더 숙제를 호출 팀에 빠른 레이어 + +562 +00:43:11,050 --> 00:43:18,910 + 실제로 거기에 내가 충분히 빨리 그것을 얻을 수있는 방법을 찾을 수 없습니다에 눈에 + +563 +00:43:18,909 --> 00:43:22,710 + 때로는 사람들이 회선 사용하고는이 아이디어 또 다른 방법 + +564 +00:43:22,710 --> 00:43:27,400 + 당신이 신호 등으로부터 추억이있는 경우 고속 푸리에 그래서 변환 + +565 +00:43:27,400 --> 00:43:30,700 + 처리 클래스 또는 호출이 일을 기억 수도 같은 + +566 +00:43:30,699 --> 00:43:34,639 + 충족 회선 정리는 두 개의 신호를 가지고 있다면 당신은 당신이 원하는 것을 말한다 + +567 +00:43:34,639 --> 00:43:38,779 + 하나 신중하게 한 후 다른 여자와 계속되어 그들에게 전화 + +568 +00:43:38,780 --> 00:43:44,130 + 이들 두 신호의 콘볼 루션을 복용하면 오히려와 동일 + +569 +00:43:44,130 --> 00:43:47,820 + 회선의 푸리에 변환은 요소 제품과 동일 + +570 +00:43:47,820 --> 00:43:51,859 + 당신은 당신이 밖으로 압축을 푼 가지고 기호를 응시 그래서 만약 푸리에 변환 I + +571 +00:43:51,858 --> 00:43:56,779 + 이 의미가있을 거라 생각하고 또한 경우 다시 신호에서 기억하고 있습니다 + +572 +00:43:56,780 --> 00:44:00,240 + 처리 클래스 또는 알고리즘 클래스 호출이 놀라운 일이있다 + +573 +00:44:00,239 --> 00:44:04,299 + 고속 푸리에 실제로 좋아 푸리에 변환을 계산하기 위해 우리가 할 수 변환 + +574 +00:44:04,300 --> 00:44:08,080 + 역 푸리에 정말 정말 빠른 변환 변환 + +575 +00:44:08,079 --> 00:44:11,679 + 당신은 2D에서 하루에이 버전의 곰 볼 수 있으므로 그들은 모든 것 + +576 +00:44:11,679 --> 00:44:17,129 + 정말 빠른 그래서 우리는 실제로 엄격한 회선을 적용 할 수있는 방법이 너무 + +577 +00:44:17,130 --> 00:44:20,660 + 작동 처음 우리가 고속 푸리에 변환을 사용하여 계산하는거야 것입니다 + +578 +00:44:20,659 --> 00:44:24,899 + 푸리에 푸리에을 계산에도 가중치를 계산하는 변환 + +579 +00:44:24,900 --> 00:44:30,320 + 우리의 활성화지도의 변환 지금 푸리에 공간에서 우리는 단지 요소를 할 + +580 +00:44:30,320 --> 00:44:35,050 + 정말 정말 빠르고 효율적이며 다음 우리가 올 곱셈 + +581 +00:44:35,050 --> 00:44:40,269 + 다시의 패스를 사용하여 상기 역 출력을 변환 할 변환 + +582 +00:44:40,269 --> 00:44:44,420 + 그 요소 제품의이에 우리를 위해 회선을 구현 + +583 +00:44:44,420 --> 00:44:52,550 + 멋진 영리한 방법 좀 시원하고이 실제로 사용하고 몇몇 사람에 직면하고있다 + +584 +00:44:52,550 --> 00:44:55,940 + 페이스 북이 작년에 관한 논문을했다 그리고 그들은 실제로 출시 것을 + +585 +00:44:55,940 --> 00:44:57,650 + GPU 라이브러리는이 작업을 수행하는 + +586 +00:44:57,650 --> 00:45:03,329 + 이 일을 계산하지만, 이러한 푸리에에 대한 슬픈 일이 변환이 + +587 +00:45:03,329 --> 00:45:07,819 + 그들은 실제로 당신에게 정말 다른 방법하지만 통해 정말 큰 속도 향상을 제공 + +588 +00:45:07,820 --> 00:45:11,970 + 당신은이 작은 3 × 3에 최선을 다하고 네 개의 큰 바위와 때 + +589 +00:45:11,969 --> 00:45:15,829 + 푸리에 변환을 계산하는 오버 헤드를 향해 바로 변환 필터 + +590 +00:45:15,829 --> 00:45:20,449 + 입력 화소 공간에서 직접 연산을하는 연산 + +591 +00:45:20,449 --> 00:45:25,579 + 우리가 강의에 앞서 이야기로 작은 기여는 + +592 +00:45:25,579 --> 00:45:30,389 + 그것은을, 그래서 많은 이유에 대해 정말 정말 멋지고 매력과 큰 + +593 +00:45:30,389 --> 00:45:33,489 + 트릭이 너무 잘 영향을 작동하지 않습니다 수치의 조금 + +594 +00:45:33,489 --> 00:45:38,439 + 우리하지만 어떤 이유로 경우 당신은 정말 큰 기여를 계산하고 싶어 + +595 +00:45:38,440 --> 00:45:46,019 + 이 그래 당신이 시도 할 수있는 일입니다 + +596 +00:45:46,019 --> 00:46:02,489 + 너무 물건에 관여하지만 당신이 문제가 아마이다 생각하면 내가 상상 + +597 +00:46:02,489 --> 00:46:04,639 + 문제 + +598 +00:46:04,639 --> 00:46:12,900 + 나중에 지적하는 또 다른 한가지는 그 푸리에에 대한 균형 아웃 한 종류의 + +599 +00:46:12,900 --> 00:46:17,430 + 결론을 변환하는 것은 그들이 지금까지 너무 잘 월쯤 처리하지 않는다는 것입니다 + +600 +00:46:17,429 --> 00:46:21,219 + 정상의 종류에 계산 귀에 거슬리는 회선과 일반 컴퓨터 + +601 +00:46:21,219 --> 00:46:25,409 + 입력 공간 만 우리의 제품에 사람들의 작은 하위 집합을 계산하면 이렇게 + +602 +00:46:25,409 --> 00:46:28,489 + 당신이 회선을 공격 할 때 실제로 계산을 많이 저장 + +603 +00:46:28,489 --> 00:46:32,199 + 직접 입력 공간하지만 방법에 당신은 귀에 거슬리는 구현하는 경향이 + +604 +00:46:32,199 --> 00:46:36,649 + 푸리에의 회선 공간이 당신은 단지 전체를 계산한다 변환 및 + +605 +00:46:36,650 --> 00:46:43,180 + 그 때문에 매우 효율적되지 않는 끝, 그래서 당신은 데이터의 일부를 밖으로 던져 + +606 +00:46:43,179 --> 00:46:47,969 + 정말 너무 넓게 생각도하게되지 않은 또 다른 트릭이있다 + +607 +00:46:47,969 --> 00:46:51,989 + 알려진 아직하지만 난 정말 그렇게 내가 그것에 대해 그렇게 얘기하고 싶었 생각 좋아 + +608 +00:46:51,989 --> 00:46:55,909 + 당신은 스트래튼의 알고리즘이라는 알고리즘 클래스에서 뭔가 기억하고있다 + +609 +00:46:55,909 --> 00:47:00,789 + 바로 당신의 순진 행렬 곱셈을 수행 할 때 종료하는 것이이 아이디어가있다 + +610 +00:47:00,789 --> 00:47:04,869 + 에 의해 종류의 행렬 당신은 카운트 경우 비록 모든 수정 있지만, + +611 +00:47:04,869 --> 00:47:08,630 + 당신이 그것을 할 필요가 추가는거야 어떤을 정도 걸릴 것 + +612 +00:47:08,630 --> 00:47:12,950 + 귀여운 운영 및 스트래튼의 알고리즘은 정말 미친 것처럼 이것이다 우리 + +613 +00:47:12,949 --> 00:47:16,839 + 이 모든 미친 중간체를 계산하고 어떻게 든 마술에 밖으로 작동 + +614 +00:47:16,840 --> 00:47:22,289 + 순진한 방법보다 점근 적으로 빠른 출력을 계산하고 당신은 알고 + +615 +00:47:22,289 --> 00:47:26,869 + 그 날 행렬 곱셈이 우리가 구현할 수있는 것을 알고 호출 + +616 +00:47:26,869 --> 00:47:31,339 + 행렬 곱셈 같은 회선은 직관적으로 이러한 것을 예상합니다 + +617 +00:47:31,340 --> 00:47:35,110 + 트릭 비슷한 유형의 이론적 아마에 적용 할 수 있습니다 + +618 +00:47:35,110 --> 00:47:41,320 + 컨볼 루션 그것은 그들이 그렇게이 정말 멋진 용지가있을 수 있습니다 밝혀 그 단지 + +619 +00:47:41,320 --> 00:47:46,370 + 이 두 사람이 아주 명시 적으로 밖으로 일 여름 동안 나왔다 + +620 +00:47:46,369 --> 00:47:50,670 + 뭔가 아주 특별한 경우를 자주 암시 43 그것은이 포함됩니다 + +621 +00:47:50,670 --> 00:47:54,659 + 분명 여기 세부 사항으로 이동하지 않을거야하지만 비슷한 맛이다 + +622 +00:47:54,659 --> 00:47:58,539 + 스트레스와 중간 아주 영리한 계산에 + +623 +00:47:58,539 --> 00:48:03,630 + 헨리 실제로 계산과 이들에 많이 저장을 결합 + +624 +00:48:03,630 --> 00:48:08,220 + 사람은 실제로 정말 강렬하고 그들은 단지 수학자하지 않은 그들 + +625 +00:48:08,219 --> 00:48:11,959 + 실제로 또한 매우 높은이를 계산하기위한 CUDA 커널을 최적화 쓴 + +626 +00:48:11,960 --> 00:48:17,570 + 두 배 BGG을 단축 할 수 있었다 물건 그래서 정말 정말 + +627 +00:48:17,570 --> 00:48:21,890 + 인상적인 그래서 나는이 이러한 유형의 트럭이 유형이 될 수 있다는 생각 + +628 +00:48:21,889 --> 00:48:26,019 + 인 미래하지만 시간이 꽤 인기가 나는 그것이 매우 광범위하지 생각 + +629 +00:48:26,019 --> 00:48:30,650 + 사용하지만이 숫자는 특히 그들이있어 작은 배치 크기에 미친 + +630 +00:48:30,650 --> 00:48:35,010 + 그게 정말 인상적이고 I의의 BGG에 여섯 속도를 점점 + +631 +00:48:35,010 --> 00:48:38,770 + 단점은 당신이 좀 일을해야한다는 것입니다 그것은 정말 멋진 방법이라고 생각 + +632 +00:48:38,769 --> 00:48:43,009 + 이러한 명시 적으로 특별한 경우 외부 회선 각각 다른 크기 그러나 아마 + +633 +00:48:43,010 --> 00:48:45,850 + 우리는 3 × 3 회선에 대해 신경 경우 그는 큰 문제가 아니다 + +634 +00:48:45,849 --> 00:48:54,719 + 그래서 실제로 회선 컴퓨팅 정리해은 그 진짜의 종류 + +635 +00:48:54,719 --> 00:48:58,579 + 이러한 것들을 구현하는 빠르고, 쉽고 신속하고 더러운 방법은 전화에서입니다 + +636 +00:48:58,579 --> 00:49:02,869 + 행렬 곱셈이 전달됩니다 그것을 구현하는 것이 너무 어렵지 않다 않습니다 + +637 +00:49:02,869 --> 00:49:06,609 + 이 것을 어떤 이유로 당신이 정말로 대회를 구현해야하는 경우 이렇게 + +638 +00:49:06,610 --> 00:49:11,400 + 자신을 정말 통화 활동에 권 해드립니다은 오는 뭔가 + +639 +00:49:11,400 --> 00:49:15,230 + 당신이 생각하는 신호 처리 정말 시원하고 정말 유용하지만 것 + +640 +00:49:15,230 --> 00:49:19,719 + 그것은하지 그래서는하지만 큰 필터의 속도 업을 주는가 있다고 밝혀 + +641 +00:49:19,719 --> 00:49:24,000 + 당신이 희망하지만 한 수만큼이 빠른 때문에 희망 유용있다 + +642 +00:49:24,000 --> 00:49:25,440 + 알고리즘은 정말 좋아 + +643 +00:49:25,440 --> 00:49:29,650 + 이미 세계 어딘가에 코드가 존재하고 필터는 그렇게 할 일 + +644 +00:49:29,650 --> 00:49:35,889 + 희망이 이러한 것들에 잡아 더 널리 그래서 만약 사용 될 것입니다 + +645 +00:49:35,889 --> 00:49:41,529 + 계산 회선에 대해 질문이있다 + +646 +00:49:41,530 --> 00:49:50,940 + 확인을 우리는 첫 번째 질문 그래서 일부 구현 세부 사항에 대한 거 이야기를 그렇게 옆에있어 + +647 +00:49:50,940 --> 00:49:55,710 + 어떻게 너희들은 지금까지 자신의 컴퓨터를 구축 할 + +648 +00:49:55,710 --> 00:50:01,710 + 확인 그래서 너희들은 너무 사람이 할 수있는이 다음 슬라이드에이 답변 방지 할 수있다 + +649 +00:50:01,710 --> 00:50:07,869 + 아웃 지점의 CPU 사람을 발견 + +650 +00:50:07,869 --> 00:50:17,210 + CPU는이 작은 사람이 바로 그래서 실제로는이 일이 사실이다 + +651 +00:50:17,210 --> 00:50:22,179 + 는 CPU 자체의 내부에 약간의 작은 부분이 그래서 그것의 많은 쿨러입니다 + +652 +00:50:22,179 --> 00:50:28,730 + 여기 많은 다음 스폿에게 GPU 냉각 방열판 실제로 + +653 +00:50:28,730 --> 00:50:38,320 + 네, 그것은에 등이 GPU는 지포스를 말하는 것은 그것의 한 가지입니다입니다 + +654 +00:50:38,320 --> 00:50:43,180 + 그것은 훨씬 더 큰 그리고 당신은 그래서 할 수 있습니다 있도록 CPU는 더 강력한 I하다 + +655 +00:50:43,179 --> 00:50:48,679 + 알고 있지만 그 종류의 그, 그래서 적어도이 경우에 더 많은 공간을 복용 + +656 +00:50:48,679 --> 00:50:54,309 + 흥미로운 일이 너무 일어나고 있다는 표시로 나는 또 다른 질문을하고 있는데 + +657 +00:50:54,309 --> 00:50:57,029 + 당신 돼 플레이 비디오 게임 + +658 +00:50:57,030 --> 00:51:05,390 + 확인 후, 당신은 아마 그래서 사람들이 많이 밝혀 이에 대한 의견 + +659 +00:51:05,389 --> 00:51:09,809 + 기계 학습과 깊은 학습도 정말 강한 의견을 가지고 대부분의 + +660 +00:51:09,809 --> 00:51:15,639 + 사람들은 그래서 엔비디아 실제로는 훨씬 더 널리 후 사용 측면에 있습니다 + +661 +00:51:15,639 --> 00:51:21,179 + AMD는 당신이 GPU는 미국을 사용하는 이유는 것입니다 + +662 +00:51:21,179 --> 00:51:25,599 + NVIDIA는 정말 정말 깊이로 다이빙 지난 몇 년에 많은 일을하고있다 + +663 +00:51:25,599 --> 00:51:30,710 + 정도의 멋진 예를 들어 학습과 자신의 초점이 정말 핵심 부분 확인 + +664 +00:51:30,710 --> 00:51:34,769 + 입니다 GTC에서 작년 + +665 +00:51:34,769 --> 00:51:39,869 + 발표 새로운 제품에 대한 연간 큰 거대한 회의의 비디오 정렬 + +666 +00:51:39,869 --> 00:51:44,230 + 비디오에서의 CEO 실제로 또한 스탠포드 경보입니다 젠슨 홍콩 + +667 +00:51:44,230 --> 00:51:49,059 + 이 최신 가장 놀라운 새로운 GPU의 인두세 행위 등 소개 + +668 +00:51:49,059 --> 00:51:53,400 + 자신의 주력 것은 그가 그것을 판매하는 데 사용되는 벤치 마크는 얼마나 빨리 + +669 +00:51:53,400 --> 00:51:56,800 + 국가와 알렉스는 그렇게 만난이 미쳤다 + +670 +00:51:56,800 --> 00:52:00,140 + 이 같은 수백 수백명의 사람들과 함께 거대한 방이었다 + +671 +00:52:00,139 --> 00:52:04,279 + 이 거대한 높은 광택 프리젠 테이션 및 CEO 등 언론인과 + +672 +00:52:04,280 --> 00:52:07,890 + 비디오에서 알렉스 그물 및 회선에 대해 얘기하고, 나는 그것이라고 생각했다 + +673 +00:52:07,889 --> 00:52:11,690 + 정말 흥분하고 가지 방법을 보여줍니다 엔비디아는 정말 많은 약을 걱정하는 것이 + +674 +00:52:11,690 --> 00:52:15,300 + 이 일을 얻는 것은 일을하고 그들에 노력을 많이 밀어하기 + +675 +00:52:15,300 --> 00:52:22,150 + 그렇게 그냥 작동 제작에 들어가는 것은 아마 같은 생각을 당신에게 CPU를 제공합니다 + +676 +00:52:22,150 --> 00:52:26,900 + 빠른 순차 처리에 정말 좋은 알고 그들은 작은을 갖는 경향이 + +677 +00:52:26,900 --> 00:52:31,019 + 코어의 수는 노트북은 아마 어쩌면 사이에 하나 사처럼이 + +678 +00:52:31,019 --> 00:52:36,920 + 서버의 모서리와 큰 일 최대 4분의 16과 이런 일이있을 수 있습니다 + +679 +00:52:36,920 --> 00:52:39,610 + 컴퓨팅 물건 정말 정말 빨리에게 정말 좋은 + +680 +00:52:39,610 --> 00:52:45,349 + 그리고 순서 GPU 반면에는 많은 많은 많은 과정을 갖는 경향 + +681 +00:52:45,349 --> 00:52:49,759 + 세금 등 큰 사람은 분기 수천까지 가질 수 있지만 각 경향 + +682 +00:52:49,760 --> 00:52:53,500 + 코어는 지난 2010년 5월 10일 낮은 클럭 속도를하고 당 더 적은을 할 수 있습니다 + +683 +00:52:53,500 --> 00:52:59,429 + 이 GPU는 다시 우리가 실제로 처음으로 개발되었다 있도록 명령어 사이클 + +684 +00:52:59,429 --> 00:53:05,230 + 처리 그래픽 그래픽 처리 장치는 그래서 그들은 일에 정말 좋은거야 + +685 +00:53:05,230 --> 00:53:09,699 + 일종의 고도의 마비 작업은 싶어 많은 많은 일을 할 수 있습니다 + +686 +00:53:09,699 --> 00:53:15,460 + 독립적으로 평행하고 원래 컴퓨터를 위해 설계 되었기 때문에 + +687 +00:53:15,460 --> 00:53:19,590 + 그래픽은 그러나 그 이후 그들은 종류의보다 일반적인 컴퓨팅으로 진화했습니다 + +688 +00:53:19,590 --> 00:53:23,100 + 당신이 쓸 수 다른 프레임 워크가 플랫폼 있도록 + +689 +00:53:23,099 --> 00:53:28,929 + 엔비디아에서 우리는이 프레임 워크를 그래서 일반적인 코드는 GPU에서 직접 실행하기 + +690 +00:53:28,929 --> 00:53:33,509 + 그것은 당신이 실제로 직접 실행되는 코드를 작성 시트의 변형을 작성할 수 있습니다 + +691 +00:53:33,510 --> 00:53:37,990 + GPU에서와에 작동 오픈 CL이라는 비슷한 프레임 워크가있다 + +692 +00:53:37,989 --> 00:53:43,569 + 거의 모든 컴퓨팅 플랫폼 어느 정도하지만 개방형 표준이되는 의미 + +693 +00:53:43,570 --> 00:53:48,890 + 좋은 그것은 OpenCL을 사방에 작동하는지 아주 좋다하지만 실제로는 그렇게 열 + +694 +00:53:48,889 --> 00:53:52,559 + 즉, 더 많은 성능과 방법을 조금 더 멋진 도서관 경향이있다 + +695 +00:53:52,559 --> 00:53:57,420 + 지원 적어도 네 깊은 학습 대부분의 사람들이 사용할 수 있도록 할 수 대신하고있는 경우 + +696 +00:53:57,420 --> 00:54:01,309 + 실제로 G PIKO G PIKO를 직접 작성하는 방법을 학습에 관심 + +697 +00:54:01,309 --> 00:54:05,230 + 나는 그것이 꽤 멋지다의 것 정말 멋진 불쾌한 과정이 재미 있어요 + +698 +00:54:05,230 --> 00:54:09,409 + 당신이 코드는 GPU에 일을 실행에 쓸 수 있습니다 할당 모두 있지만, + +699 +00:54:09,409 --> 00:54:12,730 + 당신이 원하는 모든 경우 방법은 기차 너트를 와서 연구와 그와 같은 작업을 수행 + +700 +00:54:12,730 --> 00:54:16,409 + 일의 당신은 일반적으로이 코드를 직접 당신이 중 하나를 작성하지 않아도 결국 + +701 +00:54:16,409 --> 00:54:20,139 + 단지 외부 라이브러리에 의존 + +702 +00:54:20,139 --> 00:54:33,440 + 바로 그렇게 할 수 나는이 너무 귀엽다이 원 이상 높은 수준의 도서관처럼 + +703 +00:54:33,440 --> 00:54:38,599 + 종류의 유리 바로 그래서 한 가지 같은 GPU는에 정말 정말 좋은 것을 + +704 +00:54:38,599 --> 00:54:43,420 + 행렬 곱셈은 그래서 여기 여기 난이 엔비디아의에서 인 벤치 마크 뜻이야 + +705 +00:54:43,420 --> 00:54:49,550 + 웹 사이트는 그래서 조금 편견이다 그러나 이것은 행렬 곱셈을 보이고있다 + +706 +00:54:49,550 --> 00:54:54,789 + 이 꽤 살이 찐 CPU에 매트릭스 눈의 함수로 시간은 12 군단의 사람입니다 + +707 +00:54:54,789 --> 00:55:00,079 + 그것은 아주 아주 건강한 CPU처럼 서버에 살고있는 것이입니다 + +708 +00:55:00,079 --> 00:55:04,000 + 를 인 40 같은 시험에서 곱셈 같은 날짜 과학 행렬을 실행 + +709 +00:55:04,000 --> 00:55:11,000 + 꽤 살이 찐 GPU 그것은 훨씬 더 빨리 내가​​ 그 더 큰 놀라움 오른쪽 없습니다 의미이고 + +710 +00:55:11,000 --> 00:55:15,119 + 당신이 언급 한 비디오는이되도록 GPU는 또한 정말 꼭 회선입니다 + +711 +00:55:15,119 --> 00:55:19,909 + 오늘 호출 라이브러리는 특별히 최적화 된 낙관주의 CUDA를 발표했다 + +712 +00:55:19,909 --> 00:55:26,139 + 회선에 대한 커널 그래서 내 말은 CPU에 비해​​ 그것은 WAY 빨리이의의 + +713 +00:55:26,139 --> 00:55:30,139 + 실제로 그를 비교하는 것은 함께 캠페인의 기여를 호출 + +714 +00:55:30,139 --> 00:55:34,920 + 승무원 티에 넨 회선 나는이 그래프는 처음부터 실제로 생각 + +715 +00:55:34,920 --> 00:55:41,030 + CNN 버전의 버전은 단지 몇 주 전에 나와서 그러나 이것은 단지입니다 + +716 +00:55:41,030 --> 00:55:44,600 + 실제로 다음 벤치 마크 토목 때문에 CPU 벤치 마크했다 버전 + +717 +00:55:44,599 --> 00:55:49,699 + 더 빨리 그 이후부터 많이있어, 그래서 나 이전 버전에 대해되었습니다 + +718 +00:55:49,699 --> 00:55:54,769 + 여기에 있지만 증거가 맞는 방법은 두 개의 폭발 같은 또는 뭔가 + +719 +00:55:54,769 --> 00:56:00,090 + 이 기능을 제공하고 단지 종류의 것을 볼 수 있도록 DNN은 C 라이브러리입니다 + +720 +00:56:00,090 --> 00:56:05,309 + C 라이브러리와 GPU 멀리 추상 그래서 당신은 일종의에서의 텐서이있는 경우 + +721 +00:56:05,309 --> 00:56:09,429 + 메모리와 방금 한국 라이브러리에 대한 포인터를 전달할 수 있습니다보고는거야 + +722 +00:56:09,429 --> 00:56:13,299 + conf의 작은 아마 비동기 적으로 GPU에서 실행 돌아가서를 반환 + +723 +00:56:13,300 --> 00:56:19,440 + 결과 카페와 토치와 같은 프레임 워크 있도록 모든 이제 Q의 티에 넨을 통합 한 + +724 +00:56:19,440 --> 00:56:23,750 + 물건을 자신의 프레임 워크에 당신은 어떤에서 이러한 효율적인 솔루션을 활용할 수 + +725 +00:56:23,750 --> 00:56:30,340 + 이 프레임 워크는 알고 있지만, 문제는 그 우리는이를 일단 경우에도 + +726 +00:56:30,340 --> 00:56:33,430 + 정말 큰 모델을 훈련 강력한 GPU는 종류 여전히 + +727 +00:56:33,429 --> 00:56:39,409 + VG 정가가에 2 ~ 3 주 같은 훈련을 유명하게되었다 그래서 천천히 + +728 +00:56:39,409 --> 00:56:43,759 + 타이탄 무엇 타이탄 블랙 샌들은 싸지 않다이었고, 그것은 실제로이었다 + +729 +00:56:43,760 --> 00:56:47,280 + ResNet의 추천은 최근이 최대 정말 멋진 바로 거기 + +730 +00:56:47,280 --> 00:56:51,839 + 정말 멋진 블로그 게시물 여기를 설명하고 실제로 ResNet을 재교육 + +731 +00:56:51,838 --> 00:56:56,400 + 백 한 레이어 모델과 또한 동안 훈련을 2 주 정도 걸렸다 + +732 +00:56:56,400 --> 00:57:03,880 + 그 좋지 않다, 그래서 GPU를하고 한 방향으로 그 사람들이 길을 쉬운 방법으로 그 + +733 +00:57:03,880 --> 00:57:08,269 + 당신의 돈을 돌려 걸쳐 분할되어 여러 GPU에 걸쳐 교육을 분할 + +734 +00:57:08,269 --> 00:57:14,230 + 당신이 특히 BGG 같은 사람을 위해 당신이 수있는 GPU를 정상적으로 있도록 소요 + +735 +00:57:14,230 --> 00:57:17,679 + 많은 메모리 그래서 당신은 매우 큰 나 배치 크기와 경쟁 할 수 + +736 +00:57:17,679 --> 00:57:23,649 + 단일 GPU 당신은 당신이 할 거 야 그래서 어떤 이미지의 배치가 될 수있는 6 128 + +737 +00:57:23,650 --> 00:57:24,700 + 그런 일 + +738 +00:57:24,699 --> 00:57:30,338 + 네 개의 동일한 조각으로 어떤 경기보다 각각의 GPU는 순방향 및 역방향을 계산 + +739 +00:57:30,338 --> 00:57:35,190 + 가중치 반면에 계산 pramit 구배에 많은 배치에 대해 통과 + +740 +00:57:35,190 --> 00:57:39,470 + GPU에 대한 후 모든 무게 당신의 일부 내부에 그 무게의 일부 + +741 +00:57:39,469 --> 00:57:44,548 + 스페인어와이 정말 간단한 방법 즉 있도록 업데이 트 모델을 만드는 사람들 + +742 +00:57:44,548 --> 00:57:53,599 + 그래 GPU에서 분포를 구현하는 경향이 + +743 +00:57:53,599 --> 00:57:59,089 + 그들은이 과정을 자동화 할 수 있다고 주장 왜 그래 그래서 그건 및 + +744 +00:57:59,090 --> 00:58:03,039 + 정말 정말 효율적으로 내가 생각하는 정말 흥분되는에게 배포하지만 + +745 +00:58:03,039 --> 00:58:07,820 + 적어도 토치도 자신을 많이 연주되지 않은 데이터 병렬있다 + +746 +00:58:07,820 --> 00:58:11,059 + 당신이 그냥 드롭과는 자동으로 모든 종류의 사용할 수있는 + +747 +00:58:11,059 --> 00:58:14,070 + 병렬 이런 종류의 아주 쉽게 + +748 +00:58:14,070 --> 00:58:18,930 + 멀티 GPU 훈련을위한 약간 더 복잡한 아이디어는 실제로 알렉스에서 온다 + +749 +00:58:18,929 --> 00:58:21,279 + 알렉스되지 명성 + +750 +00:58:21,280 --> 00:58:26,670 + 그 재미있는 제목의 멋진 가지 종류하지만 생각 생각하지만, 생각은 + +751 +00:58:26,670 --> 00:58:31,409 + 우리는 실제로 하위 계층에 등을 데이터 병렬 처리를 수행하도록 + +752 +00:58:31,409 --> 00:58:35,980 + 우리의 이미지를 여러 배치를 취할 것 하위 계층은 두 개의 GPU를 통해 분할 및 + +753 +00:58:35,980 --> 00:58:42,059 + 먹고 GPU 하나는 먼저 첫 번째 부분에 대해 컨볼 루션을 계산하는 것 + +754 +00:58:42,059 --> 00:58:46,279 + 많은 배치의 일부 단지 발표 그냥 빌려 컨볼 루션 부분이 될 것입니다 + +755 +00:58:46,280 --> 00:58:49,960 + GPU에 걸쳐 균등하게 분포하지만 당신이 일단 완전 연결 + +756 +00:58:49,960 --> 00:58:50,760 + 층 + +757 +00:58:50,760 --> 00:58:54,800 + 그는 당신이 정말 큰 매트릭스 경우 실제로 더 효율적 발견 + +758 +00:58:54,800 --> 00:58:58,810 + 승산은 실제로 서로에 GPS 작업이 더 효율적 + +759 +00:58:58,809 --> 00:59:02,869 + 매우 아니다이 행렬이 멋진 트랙의 종류 곱 계산 + +760 +00:59:02,869 --> 00:59:09,480 + 일반적으로 사용하지만 그것이 구글에서 다른 생각을 언급하는 재미의 생각 + +761 +00:59:09,480 --> 00:59:13,800 + tenser 흐름이되기 전에 그들이이 일이라고했다 전에이다 + +762 +00:59:13,800 --> 00:59:18,380 + 전체 CPU 기반이었다 자신의 이전 시스템이었다 불신 + +763 +00:59:18,380 --> 00:59:22,630 + 몇 슬라이드 전에 벤치 마크에서 당신이 될 줄 상상도 할 수있는 + +764 +00:59:22,630 --> 00:59:26,250 + 정말 느리지 만 실제로 구글 매트의 첫 번째 버전은 모든 훈련을했다 + +765 +00:59:26,250 --> 00:59:30,800 + CPU에 대한 불신 때문에 실제로 그래서 그들은 엄청난 양의 작업을 수행했다 + +766 +00:59:30,800 --> 00:59:35,800 + CPU에 분포는이 멋진 종이 거기에 이런 일이 그래서 여기에 훈련을받을 + +767 +00:59:35,800 --> 00:59:39,530 + 이 설명 몇 년 전 JAP 청소년과 더 많은 세부 사항 만에서 + +768 +00:59:39,530 --> 00:59:43,640 + 당신은 데이터 병렬 처리를 사용하거나 각 시스템이 독립적 인 복사본이 + +769 +00:59:43,639 --> 00:59:48,710 + 데이터의 패치에 모델과 앞으로 컴퓨팅 각 기계 및 이전 버전 + +770 +00:59:48,710 --> 00:59:52,659 + 하지만 지금은 당신이 실제로 저장있어이 매개 변수 서버가 텍스트 + +771 +00:59:52,659 --> 00:59:55,739 + 모델의 매개 변수와 이러한 독립적 인 노동자 만들고있다 + +772 +00:59:55,739 --> 01:00:01,209 + 파라미터 서버와의 통신 모델을 업데이트 할 수 있으며하도록 + +773 +01:00:01,210 --> 01:00:05,740 + 당신이 1을 입력 어디 모델 병렬로이 대조 + +774 +01:00:05,739 --> 01:00:09,879 + 모델 당신은의 다른 부분을 계산하는 다른 다른 노동자 + +775 +01:00:09,880 --> 01:00:14,650 + 그래서 모델과 불신에 그들은 정말 정말 좋은 일을했다 + +776 +01:00:14,650 --> 01:00:18,110 + 이 많은 많은 CPU와 많은 많은 걸쳐 정말 잘 작동하도록 최적화 + +777 +01:00:18,110 --> 01:00:23,170 + 기계는하지만 지금은 희망이 일을해야 암의 흐름이 + +778 +01:00:23,170 --> 01:00:28,639 + 더 자동으로 당신이 이러한 업데이트를하고있어 일단이있다 + +779 +01:00:28,639 --> 01:00:34,949 + 비동기 STD 및 동기 STD 사이의 생각은 그래서 동기 STD입니다 + +780 +01:00:34,949 --> 01:00:39,299 + 순진한 것 같은 것들 중 하나는 당신이 어떤 배치를 예상하면 + +781 +01:00:39,300 --> 01:00:42,880 + 각 노동자는 다수의 근로자에​​ 걸쳐 전후 않습니다 분할 + +782 +01:00:42,880 --> 01:00:46,710 + 당신은 모든 그라디언트를 추가하고 단일 모델을 그라데이션을 계산 + +783 +01:00:46,710 --> 01:00:51,220 + 업데이트이이 정확하게의 정렬 시뮬레이션 할 것 + +784 +01:00:51,219 --> 01:00:55,029 + 그냥 계산하지만 더 큰 기계에 많은 배치하지만 종류의 수 + +785 +01:00:55,030 --> 01:00:59,619 + 당신이 기계를 통해 동기화 할 수 있기 때문에 속도가 느린 이것은 너무 많은 경향이있다 + +786 +01:00:59,619 --> 01:01:03,610 + 단일 노트에 여러 GPU를 작업 만 한 번하고 큰 문제 + +787 +01:01:03,610 --> 01:01:08,430 + 당신은 많은 많은 CPU를 통해 내가 동기있어 그 지역을 분산하고 + +788 +01:01:08,429 --> 01:01:12,569 + 사실은 꽤 비쌀 수 있으므로 대신 적어도 그들은이있다 + +789 +01:01:12,570 --> 01:01:17,500 + 각 모델은 단지 종류의 제조 업데이트 인 비동기 STD의 개념 + +790 +01:01:17,500 --> 01:01:21,599 + 매개 변수의 복사본에 그는 몇 가지 개념이 + +791 +01:01:21,599 --> 01:01:25,480 + 그들은 때로는 주기적으로 동기화 최종 일관성 + +792 +01:01:25,480 --> 01:01:29,530 + 서로 그것의 디버그하지만 정말 복잡하고 어려운 것 같다 + +793 +01:01:29,530 --> 01:01:35,619 + 그렇게 작동하려면 그것은 정말 멋진 그림 중 하나가 꽤 멋진이고있어 + +794 +01:01:35,619 --> 01:01:39,430 + 그래서이 두 숫자는 텐서 흐름 종이 모두이며 하나 + +795 +01:01:39,429 --> 01:01:42,549 + 텐서 흐름의 이미지는 실제로 이러한 유형해야한다는 것이다 + +796 +01:01:42,550 --> 01:01:46,510 + 당신이 일어날 경우 그 사용자에게 훨씬 더 투명 유통 + +797 +01:01:46,510 --> 01:01:51,580 + 의 GPU와 CPU를 큰 클러스터에 액세스하고 이것 저것 tenser 흐름해야 + +798 +01:01:51,579 --> 01:01:54,840 + 자동으로 이러한 종류의 할 수있는 최선의 방법을 알아낼 수 + +799 +01:01:54,840 --> 01:01:58,970 + 데이터 및 모델의 병렬 처리와 결합 분포는 당신을 위해 모든 것을 할 + +800 +01:01:58,969 --> 01:02:03,399 + 즉 그것은 정말 멋진이고, 그래서 나는 그게 정말 흥미로운 부분이라고 생각 + +801 +01:02:03,400 --> 01:02:11,050 + 바보 훈련 약 1000 어떤 질문이 그래 + +802 +01:02:11,050 --> 01:02:16,120 + 및 CN TK 나는 아직 그것을 살펴 촬영하지 않은 + +803 +01:02:16,119 --> 01:02:22,130 + 확인 그래서 다음 번에 ​​몇 병목 현상은 당신이 알고 있어야 거기 + +804 +01:02:22,130 --> 01:02:27,500 + 연습은 그래서 일반적으로이 같은이 일을 훈련 할 때처럼 기대 + +805 +01:02:27,500 --> 01:02:30,769 + 분산 물건 좋은 큰하지만 당신은 실제로 단지와 함께 먼 길을 갈 수 있습니다 + +806 +01:02:30,769 --> 01:02:34,840 + 하나의 단일 시스템에 GPU 및 병목 현상이 많이있다 그 + +807 +01:02:34,840 --> 01:02:39,160 + 그런데 하나 얻을 수 것은 GPU와 CPU 사이의 통신이며 + +808 +01:02:39,159 --> 01:02:44,759 + 실제로 그리고 많은 경우, 데이터는 가장 작고, 특히 + +809 +01:02:44,760 --> 01:02:48,000 + 파이프 라인의 고가의 제품은 GPU에 다음의 데이터를 복사하는 + +810 +01:02:48,000 --> 01:02:51,579 + 당신은 GPU에서 일을 일단 당신이 할 수있는 다시 복사 + +811 +01:02:51,579 --> 01:02:55,719 + 계산 정말 정말 빠르고 효율적으로하지만 복사는이다 + +812 +01:02:55,719 --> 01:03:01,089 + 정말 느린 부분은 메모리 복사를 방지 할 수 있는지 확인하려면로 11 아이디어 그래서 + +813 +01:03:01,090 --> 01:03:06,570 + 같은 때때로 네트워크의 각 계층에서 모두 표시 한 것입니다 + +814 +01:03:06,570 --> 01:03:10,460 + CPU에 GPU에서 앞뒤로 복사하고 정말 비효율적이고 속도가 느려질 수 있습니다 + +815 +01:03:10,460 --> 01:03:14,170 + 그래서 이상적으로 모든 것을 아래로 실행하기 위해 앞으로 전체와 후방 패스를 할 + +816 +01:03:14,170 --> 01:03:17,159 + GPU에에 한 번 + +817 +01:03:17,159 --> 01:03:21,139 + 당신이 볼 수있는 곳 가끔 볼 수 있습니다 또 다른 한가지는 접근 방식을 다중 스레드 + +818 +01:03:21,139 --> 01:03:27,849 + 에 하나의 스레드에서 데이터 많은 메모리를 프리 페치 된 CPU 스레드 + +819 +01:03:27,849 --> 01:03:28,690 + 배경 + +820 +01:03:28,690 --> 01:03:34,070 + 아마도 온라인 보강을 임명하고이이 배경 CPU + +821 +01:03:34,070 --> 01:03:37,470 + 전역 종류의를 발송도 가능 나에게 배치를 준비 할 것 + +822 +01:03:37,469 --> 01:03:41,669 + 이상 GPU에 당신은 종류의 데이터와 컴퓨팅이로드를 조정할 수 있습니다 + +823 +01:03:41,670 --> 01:03:44,680 + 전처리 및 배송 메모리 배송 + +824 +01:03:44,679 --> 01:03:48,940 + 많은 배치 데이터를 GPU에 실제로 계산을하고 실제로 + +825 +01:03:48,940 --> 01:03:51,980 + 아주 약간의 구애에 참여할 수 난의 모든 것을 알 수있을 것입니다 + +826 +01:03:51,980 --> 01:03:57,719 + 멀티 스레드 방식으로 나는 특히 카페 있도록 당신에게 좋은 속도 향상을 제공 할 수 있습니다 + +827 +01:03:57,719 --> 01:04:01,059 + 나는 이미 생각 특정 거기에이 프리 페치 날짜를 구현 + +828 +01:04:01,059 --> 01:04:04,199 + 데이터 스토리지 및 기타 프레임 워크의 유형 당신은 롤이 당신의 + +829 +01:04:04,199 --> 01:04:11,839 + 또 다른 문제는 CPU 디스크 모델 Mac 그래서이 이러한 일들이 친절이다 소유 + +830 +01:04:11,840 --> 01:04:17,820 + 느린 그들은 저렴하고있어 그들은 큰이야하지만 그들은 실제로 그렇게 그렇게 가장하지 않습니다 + +831 +01:04:17,820 --> 01:04:22,220 + 이제 이러한 고체 드라이브는 훨씬 더 일반적인 하드 디스크이다 + +832 +01:04:22,219 --> 01:04:25,730 + 그러나 문제는 고체 상태 드라이브는 더 작고 비용이 알고있는 것이다 + +833 +01:04:25,730 --> 01:04:30,590 + 하지만 그들은 실제로 많이 익숙해 빨리 너무 많이 그래서 무슨 일이 정말이야 + +834 +01:04:30,590 --> 01:04:35,710 + 비록 하드 디스크 및 고체 상태 드라이브와 같은 양으로 하나 하나 공통점 + +835 +01:04:35,710 --> 01:04:39,889 + 당신이 그렇게 책상을 많이 떨어져 데이​​터를 순차적으로 읽고있는 때 가장 잘 작동 + +836 +01:04:39,889 --> 01:04:44,108 + 예는 정말 나쁜 것 당신이 바로 그렇게 한 일을하는지 번 + +837 +01:04:44,108 --> 01:04:48,569 + 이제 이러한 이미지를 각각 때문에 JPEG 이미지의 전체 큰 폴더를해야합니다 수 + +838 +01:04:48,570 --> 01:04:52,309 + 그것까지 정말이 될 수 있도록 책상에 다른 부분에 위치 할 + +839 +01:04:52,309 --> 01:04:56,619 + 임의의 사용자가 읽어도 일단 지금은 개별 JPEG 이미지를 읽어 추구하고 + +840 +01:04:56,619 --> 01:05:01,150 + JPEG는 그렇게 무엇을 매우 비효율적 그 픽셀에 압축을 해제해야 + +841 +01:05:01,150 --> 01:05:05,079 + 당신은 연습에 많은 시간이 표시됩니다 당신거야 실제로 처리기 데이터가 + +842 +01:05:05,079 --> 01:05:10,059 + 그것을 압축 해제 단지 전체 데이터를 하나에 앉아 원시 픽셀을 타고 + +843 +01:05:10,059 --> 01:05:15,940 + 그래서 책상에 거대한 연속 파일이 디스크 공간을 많이 걸리지 만 우리가 할 + +844 +01:05:15,940 --> 01:05:22,230 + 그것은 어쨌든 그것은 평온의 좋은 모든 때문에 바로 그래서 이것은 좀 너무하다 + +845 +01:05:22,230 --> 01:05:27,400 + 카페에서 우리가 할 레벨 D 등을 결합하여이 작업을 수행하는 것은 일반적으로 사용되는 하나입니다 + +846 +01:05:27,400 --> 01:05:33,599 + 형식은 또한 나 또한 HTML5를 사용하는 사용하는 우리를 위해 많은 파일하지만 아이디어는 것입니다했습니다 + +847 +01:05:33,599 --> 01:05:39,280 + 당신은 모든 순차적으로 책상에 이미 설정 데이터 싶어 + +848 +01:05:39,280 --> 01:05:43,180 + 픽셀 상원 훈련에 당신은 당신이 모든 데이터를 저장할 수있는 훈련 할 때 + +849 +01:05:43,179 --> 01:05:46,230 + 메모리 당신은 당신이 그 빠른 속도로 읽을 만들고 싶어 할 때 책상을 읽을 필요 + +850 +01:05:46,230 --> 01:05:50,679 + 프리 페의 영리한 금액과 멀티 스레드 물건을 다시 가능하고 + +851 +01:05:50,679 --> 01:05:54,829 + 당신은 당신이 원 다른 동안 최고 책상 투구 소중히 수도있을 수 있습니다 + +852 +01:05:54,829 --> 01:05:57,460 + 경쟁은 백그라운드에서 일어나는 + +853 +01:05:57,460 --> 01:06:05,019 + GPU가 큰 사람은 그래서 기억해야 할 또 다른 점은 GPU 메모리 병목 현상입니다 + +854 +01:06:05,019 --> 01:06:10,559 + 큰 사람은 많은 메모리를 가지고 있지만 그 정도의 가장 큰 GPU는 그래서 당신은 할 수 있습니다 + +855 +01:06:10,559 --> 01:06:15,539 + 지금 내가 세금 것을 구입하고 키 마흔 메모리 12 기가를 가지고 그건 + +856 +01:06:15,539 --> 01:06:18,139 + 당신이 지금받을거야으로 큰로서 꽤 많은 + +857 +01:06:18,139 --> 01:06:22,679 + NextGen 더 큰해야하지만 실제로는이 제한에 반대 충돌 할 수 있습니다 + +858 +01:06:22,679 --> 01:06:26,989 + 당신은 BG 또는 같은 것을 훈련하고 특히 너무 많은 문제없이 + +859 +01:06:26,989 --> 01:06:31,608 + 당신이 발생하는 경우 재발 네트워크는 매우 매우 매우 매우 긴 시간이 그것의 중지했다 + +860 +01:06:31,608 --> 01:06:34,929 + 실제로 뭔가가있어이 메모리 제한에 반대 충돌하지 너무 열심히 + +861 +01:06:34,929 --> 01:06:35,598 + 유지해야 + +862 +01:06:35,599 --> 01:06:39,130 + 당신이 알고에 대해 당신이이 비행기의 일부를이 일을 훈련하고있을 때 마음 + +863 +01:06:39,130 --> 01:06:43,450 + 이러한 효율적인 회선 실제로 영리하게 만드는 구조 + +864 +01:06:43,449 --> 01:06:47,068 + 당신이 더 큰 더 강력한 모델을 가지고 할 수있는 경우뿐만 아니라이 메모리에 도움 + +865 +01:06:47,068 --> 01:06:52,268 + 와 적은 양의 메모리를 적게 사용하지 않는 당신은 훈련 할 수 있습니다보다 + +866 +01:06:52,268 --> 01:06:58,129 + 일이 더 빠르고 더 큰 일치를 사용하고 모든 것이 좋고, 다만 단지 + +867 +01:06:58,130 --> 01:07:01,588 + 규모 알렉스 기사의 의미는 모델의 많은에 비해 매우 작다 + +868 +01:07:01,588 --> 01:07:05,608 + 이미 소요 256 다시 양쪽 이제 최첨단하지만 알렉스 그물이 있습니다 + +869 +01:07:05,608 --> 01:07:09,469 + 대한 3기가바이트 GB 메모리 당신은 그것의이 더 큰 네트워크가 한 번 있도록 + +870 +01:07:09,469 --> 01:07:15,738 + 실제로 그래서 다른 일이 12 월 12 한계에 부딪하지 너무 열심히 우리 + +871 +01:07:15,739 --> 01:07:20,978 + 나는 많은 코드를 작성하고있을 때 너무 소수점 정밀도를 떠 대해 이야기한다 + +872 +01:07:20,978 --> 01:07:24,788 + 시간의 나는 당신이 이러한 일들이 그냥 실수 알고 상상하기 좋아하고 + +873 +01:07:24,789 --> 01:07:27,960 + 그들은 단지 작동하지만 실제로 그것은 사실이 아니에요 당신은 생각해야 + +874 +01:07:27,960 --> 01:07:32,889 + 부동 소수점의 얼마나 많은 비트 같은 것들 때문에 대부분의 유형을 사용하는 + +875 +01:07:32,889 --> 01:07:37,159 + 당신이 일종의 작성할 수 있습니다 숫자 코드의 유형이 많이 이중으로되어 있습니다 + +876 +01:07:37,159 --> 01:07:43,278 + 또한 작성의 기본이에 의해 정밀 64 비트를 많이 사용하고 더 + +877 +01:07:43,278 --> 01:07:47,449 + 이것은 단지 그래서 일반적으로 깊은 학습에 사용되는 단일 정밀도의이 아이디어는 + +878 +01:07:47,449 --> 01:07:52,710 + 32 내기 때문에 아이디어는 각 번호는 다음 적은 베팅 걸리는 경우 당신이 할 수있는 그 + +879 +01:07:52,710 --> 01:07:56,469 + 그 좋은, 그래서 같은 양의 메모리 내에서 그 숫자를 더 저장하고 + +880 +01:07:56,469 --> 01:08:00,559 + 또한 적은 베팅으로 당신은 더 적은 계산하여이야 그 숫자에서 작동해야 + +881 +01:08:00,559 --> 01:08:05,210 + 또한 우리는 그들이이기 때문에 작은 데이터 유형이 좋은 싶습니다 일반적 있도록 + +882 +01:08:05,210 --> 01:08:11,150 + 빠른 계산하고 불필요한 메모리와 케이스 등의 연구로이 있었다 + +883 +01:08:11,150 --> 01:08:15,489 + 실제로 숙제에 심지어 문제는 그과를 눈치 챘을 수 있도록 + +884 +01:08:15,489 --> 01:08:16,960 + 기본 데이터 타입이있다 + +885 +01:08:16,960 --> 01:08:21,289 + 64 비트 배정 밀도하지만 우리가 당신을 제공하는이 모델의 모든 + +886 +01:08:21,289 --> 01:08:25,789 + 숙제 우리는이 캐스트 또는 32 비트 부동 소수점 숫자를 가지고 있었고, 당신은 할 수 + +887 +01:08:25,789 --> 01:08:28,670 + 실제로 숙제에 돌아가서이 두 당신은거야 사이를 전환 시도 + +888 +01:08:28,670 --> 01:08:32,908 + 32 비트로 전환하는 것은 실제로 당신에게 몇 가지 괜찮은 몇 가지를 제공합니다 볼 + +889 +01:08:32,908 --> 01:08:39,670 + 괜찮은 속도 업 그렇게 나쁜 명백한 문제는 그 32 베팅이 더 나은 경우 + +890 +01:08:39,670 --> 01:08:42,829 + 64 내기 아마 지출보다 우리는 이하를 사용할 수 있습니다 + +891 +01:08:42,829 --> 01:08:52,199 + 그래서이 권리가있다 + +892 +01:08:52,199 --> 01:09:01,010 + 16 베팅은 있지만 32 비트뿐만 아니라, 그래서이 큰 확인을 수행하도록 명령했다 + +893 +01:09:01,010 --> 01:09:05,420 + 부동 소수점 16 비트 부동 소수점에 대한 표준도있다 + +894 +01:09:05,420 --> 01:09:09,699 + 때로는 반 정밀도와 cunanan 실제로 최신 버전이 할라고 + +895 +01:09:09,699 --> 01:09:17,199 + 멋진 실제로 거기에있는 위치에서 지원 컴퓨팅 것들 + +896 +01:09:17,199 --> 01:09:20,050 + 라는 회사에서 다른 기존의 구현 + +897 +01:09:20,050 --> 01:09:23,850 + 이러한 그래서이 여섯 비트 구현되는이 자신의 바나 + +898 +01:09:23,850 --> 01:09:28,350 + 이 좋은 GET있다이 때문에 지금 거기에 가장 빠른 회선 + +899 +01:09:28,350 --> 01:09:31,850 + 다른 유형의 주석 벤치 마크의 종류가 배고픈 설문 조사 + +900 +01:09:31,850 --> 01:09:35,160 + 회선 및 프레임 워크 및 모든과 거의 모든의 + +901 +01:09:35,159 --> 01:09:38,319 + 우승이 모든 벤치 마크 지금이 16 비트 부동 소수점입니다 + +902 +01:09:38,319 --> 01:09:42,279 + 난 당신이 가질 수 있기 때문에 놀라운 일이 옳지 않다 너바나에서 작업 + +903 +01:09:42,279 --> 01:09:47,479 + 베팅 더 빨리, 그래서 경쟁하지만 지금 아직 사실이 아니다합니다 + +904 +01:09:47,479 --> 01:09:51,479 + 열 여섯 비트를 이용하기위한 카페 또는 토치 같은 것들에 프레임 워크 지원 + +905 +01:09:51,479 --> 01:09:57,299 + 계산하지만 곧 다가오는되어야하지만, 문제는 우리하더라도 + +906 +01:09:57,300 --> 01:10:01,420 + 계산할 수 그것은 꽤 분명이야이야 당신은 16 만 번호가있는 경우 + +907 +01:10:01,420 --> 01:10:05,880 + 당신은 매우 빠른 그들과 경쟁하지만, 일단 (16)이 수도보다 더 얻을 수 있습니다 + +908 +01:10:05,880 --> 01:10:10,380 + 열 여섯의 두 가지이기 때문에 실제로 숫자 정밀도에 대한 걱정 + +909 +01:10:10,380 --> 01:10:13,550 + 다수의 큰되지는 더 이상 실제로 너무 많은 실수 당신입니다 + +910 +01:10:13,550 --> 01:10:20,360 + 심지어 표현 그래서 몇 년 전에했던 것과이 논문이있다 할 수 있습니다 + +911 +01:10:20,359 --> 01:10:25,339 + 일부 실험 낮은 정밀도 부동 소수점 그들은 발견 실제로 단지 + +912 +01:10:25,340 --> 01:10:28,710 + 실험을 사용하여 그들이 실제로는 부동 소수점 고정 사용 + +913 +01:10:28,710 --> 01:10:34,819 + 구현 및 그들이 발견 사실이 매우이와 + +914 +01:10:34,819 --> 01:10:38,659 + 이러한 낮은 정밀 방법 네트워크의 오슬로의 순진 구현의 종류 + +915 +01:10:38,659 --> 01:10:43,689 + 때문에이 낮은 정밀 Americare의 숫자에 아마 수렴 힘든 시간을했다 + +916 +01:10:43,689 --> 01:10:46,710 + 종류의 곱셈의 여러 라운드를 통해 축적 문제와 + +917 +01:10:46,710 --> 01:10:50,989 + 이것 저것 그러나 그들은 간단한 트릭이 실제로 확률이 아이디어 발견 + +918 +01:10:50,989 --> 01:10:54,559 + 자신의 곱셈의 일부는 그렇게 것 때문에 모든 라운딩 자신의 + +919 +01:10:54,560 --> 01:10:55,200 + 매개 변수 + +920 +01:10:55,199 --> 01:10:59,079 + 정품 인증은 16 건에 저장되지만, 그들은 곱셈 그들은을 수행 할 때 + +921 +01:10:59,079 --> 01:11:03,269 + 가입은 약간 높은 정밀도 부동 소수점 값으로 변환 + +922 +01:11:03,270 --> 01:11:07,570 + 그들은 여전히​​ 낮은 위치까지 다시 라운드를 캐스팅하고 실제로 일을 + +923 +01:11:07,569 --> 01:11:11,789 + 그 가장 가까운 숫자로 반올림되지 않은 확률 적 방법으로 반올림 + +924 +01:11:11,789 --> 01:11:16,479 + 그러나 확률 적 방식에 따라 서로 다른 번호를 라운딩 + +925 +01:11:16,479 --> 01:11:17,549 + 당신이 닫습니다 + +926 +01:11:17,550 --> 01:11:21,860 + 더 나은 일을하고 그렇게 연습을하는 경향이 그들은 예를 들어 당신이있을 때 발견 + +927 +01:11:21,859 --> 01:11:26,710 + 여섯 비트 고정 번호는 정수 두 개의 침대가 있었다 이러한 사용 + +928 +01:11:26,710 --> 01:11:31,170 + 에 대한에 대한 부동 소수점 12, 14이 사이 서 + +929 +01:11:31,170 --> 01:11:35,239 + 당신의이 아이디어를 사용할 때 항상 가장 가까운 반올림 것을 소수 부분 + +930 +01:11:35,239 --> 01:11:40,359 + 수 이러한 네트워크 및 분기 만하는 이러한 확률 접지를 사용하는 경우 + +931 +01:11:40,359 --> 01:11:43,599 + 실제로 이러한 네트워크는 아주 잘 수렴 얻을 수있는 기술 + +932 +01:11:43,600 --> 01:11:47,170 + 심지어이 매우 낮은 밀도 부동 소수점 기술 낮은 정밀도 + +933 +01:11:47,170 --> 01:11:52,859 + 부동 소수점 수 있지만 16 개의 비트가됩니다 대단한 문의 할 수 있습니다 + +934 +01:11:52,859 --> 01:11:59,089 + 그러나 우리는 아래로있어 2015 년 다른 용지가보다하는 것이 더 낮은 갈 수 있습니다 + +935 +01:11:59,090 --> 01:12:04,560 + 10 그래서 여기에 우리가 이미 가지고 있던 이전의 논문에서 의미하는 것으로 12 베팅 + +936 +01:12:04,560 --> 01:12:08,039 + 이 직관 어쩌면 당신은 매우 낮은 정밀도를 사용하는 부동 소수점 + +937 +01:12:08,039 --> 01:12:11,359 + 숫자는 실제로 네트워크의 일부 지역에서 더 정밀도를 사용할 필요가 + +938 +01:12:11,359 --> 01:12:15,909 + 네트워크의 다른 부분에서 낮은 정밀이 논문에서 그들은했다 그래서 + +939 +01:12:15,909 --> 01:12:22,149 + (10)의 활성화에 이야기를 사용하여 멀리 얻을 수 10 비트 값을 비트 및 + +940 +01:12:22,149 --> 01:12:27,500 + 12 베팅을 사용하여 컴퓨팅 그라데이션을하고 서서 그들은이 일을 가지고있는 + +941 +01:12:27,500 --> 01:12:34,800 + 꽤 훌륭하지만 사람이 그 한계는 우리가 더 갈 수 있다고 생각 + +942 +01:12:34,800 --> 01:12:36,310 + 예 + +943 +01:12:36,310 --> 01:12:44,180 + 이 같은에서 실제로 있도록 용지는 지난 주에 실제로 있었다 + +944 +01:12:44,180 --> 01:12:49,200 + 이전의 종이로 제작이는 내가 이것에 대해 놀랐다했다 미친이며, + +945 +01:12:49,199 --> 01:12:53,539 + 개념 네트워크의 모든 활성화 및 가중치 하나만​​을 사용하는 것이 듣는 + +946 +01:12:53,539 --> 01:12:58,819 + 지금은 그렇지 계산하기 위해 꽤 빨리 둘 중 하나 또는 음을 내기 + +947 +01:12:58,819 --> 01:13:02,429 + 심지어 정말 그냥 탐험 왜 같이 할 수 곱셈을해야하고 + +948 +01:13:02,430 --> 01:13:07,240 + 꽤 멋진 것들을 곱하지만 트릭은 앞으로 패스를 그이다 + +949 +01:13:07,239 --> 01:13:11,199 + 이 슈퍼 그래서 기울기 및 정품 인증 모두가 하나 또는 마이너스 하나 + +950 +01:13:11,199 --> 01:13:15,399 + 슈퍼 슈퍼 신속하고 효율적인하지만 지금은 뒤로 패스에 물건 4 패스 + +951 +01:13:15,399 --> 01:13:20,179 + 그들은 실제로 높은 정밀도와 다음이 이상을 사용하여 그라데이션을 계산 + +952 +01:13:20,180 --> 01:13:24,150 + 정밀 그라디언트 실제로 이러한 단일 비트에 대한 업데이트를 확인하는 데 사용됩니다 + +953 +01:13:24,149 --> 01:13:28,059 + 그것은 그래서 매개 변수는 실제로 정말 멋진 종이 그리고 내가 당신을 격려 것입니다 + +954 +01:13:28,060 --> 01:13:33,310 + 그것을 확인하지만 피치는 그 당신이 감당할 수있는 훈련 시간이 될 수있다합니다 + +955 +01:13:33,310 --> 01:13:36,600 + 어쩌면 부동 소수점 정밀도를 사용하지만 시험 시간은 당신이 원하는 마십시오 + +956 +01:13:36,600 --> 01:13:41,250 + 나는 이것이 정말 생각 때문에 네트워크 슈퍼 슈퍼 빠른 모든 이진 될 수 있습니다 + +957 +01:13:41,250 --> 01:13:45,010 + 나는 그것을 용지 뜻 정말 멋진 아이디어는 내가 2 주 전에 나온 + +958 +01:13:45,010 --> 01:13:50,460 + 알고하지만 난 그것을에서 정리 해보 그래서 정말 멋진 일이 생각하지 않습니다 + +959 +01:13:50,460 --> 01:13:52,199 + 구현 세부 사항 + +960 +01:13:52,199 --> 01:13:56,960 + 전체 GPU는 CPU가 때때로 사람들이 사용보다 훨씬 더 빠르게 있다는 것입니다 + +961 +01:13:56,960 --> 01:14:00,739 + 하나의 시스템에서 여러 GPU에 걸쳐 배포 배포 훈련은 예쁜 + +962 +01:14:00,739 --> 01:14:04,840 + 일반 사용자의 구글과 사용 텐서 여러를 통해 배포 한 후 흐르는 경우 + +963 +01:14:04,840 --> 01:14:10,239 + 노드는 어쩌면 더 일반적인 사이의 잠재적 인 병목을 알고 있어야한다 + +964 +01:14:10,239 --> 01:14:15,739 + 책상에서 GPU 사이와 GPU 메모리 사이도 지불 CPU와 GPU + +965 +01:14:15,739 --> 01:14:19,510 + 부동 소수점 정밀도에 대한 관심은 가장 매력적인 일이 될하지 않을 수 있습니다 + +966 +01:14:19,510 --> 01:14:23,409 + 하지만 실제로 나는 연습과 어쩌면 진에 큰 차이를 만드는 생각 + +967 +01:14:23,409 --> 01:14:28,639 + 그래서 그래 그냥 정리해하는 너트 꽤 흥미로운 것 다음 큰 일이 될 것입니다 + +968 +01:14:28,640 --> 01:14:32,690 + 모든 우리는 우리가 속임수로 날짜 확대 술 이야기 오늘 이야기 + +969 +01:14:32,689 --> 01:14:37,449 + 당신이 작은 데이터 세트를 가지고 우리를 overfitting 방지 때 개선 + +970 +01:14:37,449 --> 01:14:40,859 + 도움이 기존 모델에서 초기화하는 방법으로 전송 학습에 대해 이야기 + +971 +01:14:40,859 --> 01:14:44,399 + 훈련과 당신의 도움으로 우리가에 대한 세부 사항을 많이 이야기 + +972 +01:14:44,399 --> 01:14:48,159 + 회선 모두 효율적인 모델을 만들기 위해 그들을 결합하는 방법과 + +973 +01:14:48,159 --> 01:14:52,840 + 나는 그 생각 때문에 우리는 모든 구현 세부 사항에 대해 이야기 + +974 +01:14:52,840 --> 01:14:57,319 + 그것은 우리가 최대한 빨리 인쇄 한 모든 임의의 마지막 분 질문입니다입니다 + +975 +01:14:57,319 --> 01:15:02,840 + 좋아, 그래서 우리가 몇 초 분, 우리 가운데 중간 고사를 마친 것 같아요 + diff --git a/captions/Ko/Lecture12_ko.srt b/captions/Ko/Lecture12_ko.srt new file mode 100644 index 00000000..70b0b27d --- /dev/null +++ b/captions/Ko/Lecture12_ko.srt @@ -0,0 +1,4344 @@ +1 +00:00:00,000 --> 00:00:02,990 + 오늘 우리는이 네 개의 주요 소프트웨어 패키지를 통해 갈거야 그 사람들 + +2 +00:00:02,990 --> 00:00:10,919 + 일반적으로 보통 몇 관리자 가지 이정표로 사용 + +3 +00:00:10,919 --> 00:00:14,798 + 실제로 그렇게 희망을 살펴하려고합니다 남자를 반환 지난 주 않는 한 + +4 +00:00:14,798 --> 00:00:19,089 + 최종 할당은 3 사람들에 이번 주 또한 할당을 기억 + +5 +00:00:19,089 --> 00:00:23,160 + 거야 그래서 수요일에 기인하고 너희들은 아직 + +6 +00:00:23,160 --> 00:00:30,870 + 확인 그건 그 다음 당신은 당신이 잘 늦게 일을해야했습니다 좋은 + +7 +00:00:30,870 --> 00:00:34,230 + 내가 지적해야 다른 또 다른 한가지는 당신이 실제로 있다면 + +8 +00:00:34,229 --> 00:00:37,619 + 내가 당신을 많이 생각 프로젝트에 대한 터미널을 사용할 계획 + +9 +00:00:37,619 --> 00:00:42,049 + 당신은 당신이 떨어져 코드 및 데이터 물건을 백업하고 있는지 확인 + +10 +00:00:42,049 --> 00:00:46,659 + 아버지의 경우는 가끔씩 우리는 어떤 데 문제 했어 + +11 +00:00:46,659 --> 00:00:50,529 + 인스턴스는 무작위로 충돌하고 대부분의 경우 단말기 사람들은왔다 + +12 +00:00:50,530 --> 00:00:53,989 + 데이터를 다시 얻을 수 있지만 때로는 몇 일이 소요 및 + +13 +00:00:53,988 --> 00:00:57,570 + 이 때문에 사람들이 손실 된 데이터를 실제 사례 몇 가지가있었습니다 + +14 +00:00:57,570 --> 00:01:01,558 + 그냥 터미널에 그 당신이 사용하려는 경우 내가 생각하는 그래서 추락 + +15 +00:01:01,558 --> 00:01:04,569 + 단말기는 당신이 어떤 다른 백업 전략을 가지고 있는지 확인 + +16 +00:01:04,569 --> 00:01:10,250 + 코드와 나는 같은 데이터는 우리가이 불쌍한 대해 얘기 밝혔다 + +17 +00:01:10,250 --> 00:01:16,049 + 일반적으로 깊은 학습 카페 토치 피아노에 사용되는 소프트웨어 패키지와 + +18 +00:01:16,049 --> 00:01:20,269 + 텐서 흐름과 내가 같은 느낌이 처음에 부인의 조금으로 + +19 +00:01:20,269 --> 00:01:24,179 + 개인적으로 나는 주로 내가 아는 그 사람 때문에 카페와 성화와 함께 일했습니다 + +20 +00:01:24,180 --> 00:01:27,710 + I에 대한 가장 당신에게 다른 사람에 대한 좋은 맛을뿐만 아니라 제공하기 위해 최선을 다하겠습니다 + +21 +00:01:27,709 --> 00:01:35,939 + 하지만 단지 첫 번째, 그래서 거기에 그 부인을 던지는 것은 우리가 본 카페입니다 + +22 +00:01:35,939 --> 00:01:39,509 + 정말 카페이었다 버클리에서 본 논문에서 튀어 마지막 강의 + +23 +00:01:39,510 --> 00:01:44,040 + 재 고용 알렉스 NAT와 알렉스하려고하는 것은 다른 것들과 이후 기능 + +24 +00:01:44,040 --> 00:01:47,550 + 다음 캐시는 정말 널리 사용되는 정말 인기로 성장했다 + +25 +00:01:47,549 --> 00:01:53,759 + 카페 버클리에서 그래서 특히 길쌈 신경망을위한 패키지 + +26 +00:01:53,760 --> 00:01:56,859 + 난 당신이 많은 사람들은 더없는이 생각 + +27 +00:01:56,859 --> 00:02:01,989 + 그것은 대부분 C ++로 작성 실제로 카페에 대한 일이 구입되어, + +28 +00:02:01,989 --> 00:02:04,939 + 당신은 매우 유용하다 matlab에 파이썬에서 그물과 이것 저것에 액세스 할 수 있습니다 + +29 +00:02:04,939 --> 00:02:09,969 + 일반 카페에서 정말 널리 사용하고 그냥 경우는 정말 정말 좋은 + +30 +00:02:09,969 --> 00:02:15,289 + 일종의 표준 피드 포워드 컨볼 루션 네트워크를 훈련하고 싶은 + +31 +00:02:15,289 --> 00:02:17,489 + 실제로 카페는 다른 사람보다 약간 다르다 + +32 +00:02:17,490 --> 00:02:21,610 + 이 점에서 다른 프레임 워크는 당신이 실제로 큰 강력한 모델을 훈련하고 + +33 +00:02:21,610 --> 00:02:26,150 + 예제 ResNet 이미지를 그래서 어떤 코드를 직접 작성하지 않고 유지 + +34 +00:02:26,150 --> 00:02:29,760 + 분류 모델을 하나의 이미지 하나 다 작년에 당신이 할 수있는 그 + +35 +00:02:29,759 --> 00:02:33,189 + 실제로 꽤 임의의 코드를 작성하지 않고 카페를 사용하여 공진 훈련 + +36 +00:02:33,189 --> 00:02:37,579 + 놀라운 가장 그러나 당신이 작업하는 가장 중요한 팁 그래서 + +37 +00:02:37,580 --> 00:02:41,860 + 카페 설명서를 항상 최신 상태로 때때로되지 않고하지 않는 것이있다 + +38 +00:02:41,860 --> 00:02:45,980 + 완벽한 그래서 당신은 단지 거기에 다이빙 소스를 읽을 두려워하지 필요 + +39 +00:02:45,979 --> 00:02:52,359 + 그것은 ++ C의 너무 잘하면 당신이 그것을 읽고 이해 만에 할 수 있습니다 자신을 코드 + +40 +00:02:52,360 --> 00:02:56,080 + 그들이 인터페이스가 일반 C ++ 코드를 꽤 잘 구성되어 + +41 +00:02:56,080 --> 00:03:00,270 + 꽤 잘 조직하고 대한 의심이있는 경우 아주 쉽게 그렇게 이해하기 + +42 +00:03:00,270 --> 00:03:04,459 + 일이 카페에서 일하는 당신은 어떻게 당신의 가장 좋은 건에 가서 일어나서을 읽을 그냥 할 + +43 +00:03:04,459 --> 00:03:11,229 + 카페 그래서 소스 코드는 아마 수천 마이크와 함께이 거대한 큰 프로젝트입니다 + +44 +00:03:11,229 --> 00:03:14,369 + 수십 줄의 코드 수천하고 이해하기 무서운 약간의 + +45 +00:03:14,370 --> 00:03:18,730 + 모든 것을 함께 맞는하지만 카페에서 정말 네 개의 주요 클래스를 거기에 어떻게 + +46 +00:03:18,729 --> 00:03:24,310 + 첫 번째 일에 대해 알 필요가있는 얼룩 때문에 모든 군대 저장소를 모양 당신의 + +47 +00:03:24,310 --> 00:03:27,939 + 데이터와 무게와 네트워크에서 활성화 그래서 이러한 + +48 +00:03:27,939 --> 00:03:34,870 + 모양은 그래서 당신의 무게가 차단 한있는 네트워크의 것들 당신의 비율은 + +49 +00:03:34,870 --> 00:03:38,680 + 블롭에 저장되어있는 데이터는 픽셀 값처럼 될 것이다있다 + +50 +00:03:38,680 --> 00:03:43,189 + 블롭에 저장 레이블 당신의 아내 또는 BLOB에 저장된 모든의 + +51 +00:03:43,189 --> 00:03:47,319 + 모양이입니다 귀하의 중간 정품 인증은 모양에 저장됩니다 + +52 +00:03:47,319 --> 00:03:51,069 + 과 차원 텐서는 일종의 당신이 본 것 같은 심판은 허용 + +53 +00:03:51,069 --> 00:03:56,150 + 그들이 가진 내부 실제로 무 차원 tenser 네 사본을 + +54 +00:03:56,150 --> 00:03:57,370 + 데이터 + +55 +00:03:57,370 --> 00:04:02,450 + 실제 원시 데이터하고 저장하는 텐서의 데이터 버전 + +56 +00:04:02,449 --> 00:04:07,449 + 또한 카페가 사용하는 병렬 일을 가지고 있지만 10 평행 죽음을 원 + +57 +00:04:07,449 --> 00:04:12,459 + 저장소는 데이터에 대한 그라디언트 그것은 당신에게 당신에게 두 가지를 제공하고 + +58 +00:04:12,459 --> 00:04:16,280 + 그런 것들의 각각의 CPU와 GPU 버전이 있기 때문에 실제로 사를 + +59 +00:04:16,279 --> 00:04:21,228 + 그래서 당신은 CPU의 데이터 유형이 있고 GPU 실제로 사 및 치수있다 + +60 +00:04:21,228 --> 00:04:26,159 + 텐트 당신이 알아야 할 그 다음 중요한 클래스 로브 뛰어난이며, + +61 +00:04:26,160 --> 00:04:30,930 + 은신처를 카페와 래리는 것과 유사한에서 함수의 일종이다 사람 + +62 +00:04:30,930 --> 00:04:35,329 + 일부 입력 모양을받는 특징에 글을 입력 바닥을 야유 + +63 +00:04:35,329 --> 00:04:41,269 + 다음 구멍 정지를 유지 출력 모양을 생각한다는 것입니다 LOB를 생성하여 + +64 +00:04:41,269 --> 00:04:45,349 + 은신처가 채워 데이터 (RD)와 바닥 모양에 포인터를 받게되며 + +65 +00:04:45,350 --> 00:04:49,229 + 다음은 상단 모양에 대한 포인터를 받게됩니다 그리고 포트에 겁니다 + +66 +00:04:49,228 --> 00:04:53,759 + 열정적으로 상위의 데이터 요소의 값을 입력 할 것으로 예상 + +67 +00:04:53,759 --> 00:04:58,959 + 층을지나 다시 도로에 블로그 래디언스 검은 담비가 기대 계산합니다 + +68 +00:04:58,959 --> 00:05:03,649 + 기울기와 상부 작업에 대한 포인터를 수신하고 활성화 쏟 + +69 +00:05:03,649 --> 00:05:07,359 + 를 사용하며 그들은 또한 바닥 모양에 대한 포인터까지를 받게됩니다 + +70 +00:05:07,360 --> 00:05:12,650 + 재료 바닥과 블레어 총리는이 추상 꽤 잘 구성되어 + +71 +00:05:12,649 --> 00:05:17,019 + 클래스 당신은 갈 수 있고 내가 여기에 소스 파일에 대한 링크를했다 그 + +72 +00:05:17,019 --> 00:05:21,139 + 그리고 그들의의 다른 유형을 구현하는 몇 가지 클래스가 많이있다 및 + +73 +00:05:21,139 --> 00:05:26,750 + 같은 나는 일반적인 캡이 문제가 모든 전혀 정말 좋은 목록이 없습니다 말했다 + +74 +00:05:26,750 --> 00:05:30,490 + 유형의 은신처 당신은 거의 그냥 코드를보고 어떤 종류를 볼 필요가 + +75 +00:05:30,490 --> 00:05:36,280 + CPP 파일은 자연 있도록 당신이 알아야 할 다음 일은있다 + +76 +00:05:36,279 --> 00:05:40,859 + 그것은 단지 다수의 상속인을 결합하고는 기본적으로 비순환 그래프에 관한 것이다 + +77 +00:05:40,860 --> 00:05:44,598 + 층과하면의 전후 방법을 실행하기위한 책임 + +78 +00:05:44,598 --> 00:05:49,519 + 올바른 순서 층은 그래서 이것은 당신은 아마이 터치하지 않아도된다 + +79 +00:05:49,519 --> 00:05:52,560 + 자신 만이에 가지 좋은 볼의 어느 클래스는 방법의 맛을 얻을 수 + +80 +00:05:52,560 --> 00:05:56,139 + 모든 당신이 알아야 할 마지막 클래스에서 함께 맞는 + +81 +00:05:56,139 --> 00:06:00,720 + 솔버가 솔버 있도록 우리가 숙제 해결사이라는 것을 알고 + +82 +00:06:00,720 --> 00:06:04,710 + 그 정말 재주 넘기를 캡에서 영감을받은 나에 찍어하기위한 것입니다 + +83 +00:06:04,709 --> 00:06:05,288 + 순 + +84 +00:06:05,288 --> 00:06:08,889 + 실제로 업데이트 데이터의 다음 전후 실행할 + +85 +00:06:08,889 --> 00:06:11,319 + 네트워크와 핸들 검사 점과에서 다시 시작의 소유자 + +86 +00:06:11,319 --> 00:06:15,520 + 체크 포인트 및 물건의 카페 해결사의 모든 종류의이 추상입니다 + +87 +00:06:15,519 --> 00:06:20,278 + 클래스와 다른 갱신 규칙은 다른 서브 클래스에 의해 구현된다 그래서 + +88 +00:06:20,278 --> 00:06:24,598 + 예를 확률 그라데이션 하강을 위해 존재하는 것은 해석 원자 폭탄 엉덩이있다 + +89 +00:06:24,598 --> 00:06:28,209 + 문제 해결사 다시 물건의 종류의 모든과는 어떤 종류를 볼 수 있습니다 + +90 +00:06:28,209 --> 00:06:32,438 + 옵션을 제공합니다 당신은 이런 종류의 소스 코드를 찾아야한다 사용할 수 있습니다 + +91 +00:06:32,439 --> 00:06:35,639 + 당신이 일 모두가이 모든 일이 있음을 함께 맞는 방법의 좋은 개요 + +92 +00:06:35,639 --> 00:06:40,069 + 인터넷에서 할 것이다 오른쪽 녹색 상자에 포함 각을 얼룩 + +93 +00:06:40,069 --> 00:06:44,250 + 블로그는 데이터를 포함하고 빨간색 상자가 연결되어있는 층이다 텍스트 + +94 +00:06:44,250 --> 00:06:51,038 + 함께 블록과 모든 일이 시편에 최적화 얻을 것이다 그래서 + +95 +00:06:51,038 --> 00:06:55,538 + 카페 프로토콜이라는 재미있는 것은 많이 사용 너희들의 버퍼 수 + +96 +00:06:55,538 --> 00:07:00,938 + 이제까지 숫자 후 구글에 다시 너희들은이 폭탄 그러나 프로토콜에 대해 알고 + +97 +00:07:00,939 --> 00:07:05,099 + 권총은 거의 이진 강하게 나는 종류의에 같은 JSON을 입력처럼 이쪽 + +98 +00:07:05,098 --> 00:07:08,550 + 구글이 처음에 데이터를 이용하여 내부에 매우 널리 사용되는 그것에 대해 생각 + +99 +00:07:08,550 --> 00:07:14,750 + 네트워크를 통해 죽음은 그래서 프로토콜이있다 버퍼링한다. 그 프로필 + +100 +00:07:14,750 --> 00:07:18,639 + 서로 다른 객체의 형태를 어떻게 그렇게 느낌의 종류를 정의 + +101 +00:07:18,639 --> 00:07:22,819 + 이 예에서 사용자는 이름 및 ID 및 이메일이 생명을 가지고있을 것 + +102 +00:07:22,819 --> 00:07:26,300 + 최고의 프로필입니다. 프로필 + +103 +00:07:26,300 --> 00:07:31,490 + 클래스의 유형을 찾기 위해 주어진 당신은 실제로에 인스턴스를 실현 볼 수 있습니다 + +104 +00:07:31,490 --> 00:07:37,379 + 사람이 읽을. 예를 들어 이것은주는 이름 총 TXT 파일을 채우고 있으므로 + +105 +00:07:37,379 --> 00:07:40,968 + 당신은 아이디어가 당신에게 이메일을 제공하고이 사람의 인스턴스입니다 수 + +106 +00:07:40,968 --> 00:07:45,930 + 이 텍스트 파일로 저장 한 후 제품이 컴파일러를 포함 + +107 +00:07:45,930 --> 00:07:49,579 + 실제로 액세스 다양한 프로그래밍 언어에서 클래스를 생성 할 수 있습니다 + +108 +00:07:49,579 --> 00:07:55,418 + 당신이 할 수있는 이러한 데이터 형식 포토 북 컴파일러를 실행 한 후이 그것을 프로필 + +109 +00:07:55,418 --> 00:08:01,038 + 당신은 자바와 C C에 가져 ++과 파이썬과 갈 수있는 클래스를 생성 + +110 +00:08:01,038 --> 00:08:05,300 + 모든 그래서 실제로 카페가 있습니다 단지에 대해 왜 이러한 프로브를 말합니까 + +111 +00:08:05,300 --> 00:08:08,270 + 이러한 프로토콜 버퍼 그들은 거의 모든 것을 저장을 사용하여 + +112 +00:08:08,269 --> 00:08:16,008 + 캐시는 내가 말했듯 있도록 카페를 이해하는 코드를 읽을 필요가 이해하기 + +113 +00:08:16,009 --> 00:08:20,480 + 카페이 하나의 거대한 파일이라는 카페 어두운 도로가 + +114 +00:08:20,480 --> 00:08:24,470 + 그들은 단지에서 사용되는 프로토콜 버퍼 유형 모두를 정의하지만 + +115 +00:08:24,470 --> 00:08:29,170 + 카페는이 나는 그것이 몇 생각의 거대한 파일 만 라인 긴하지만 + +116 +00:08:29,170 --> 00:08:32,200 + 실제로 꽤 잘 문서화 그리고 내가 가장 최신의 생각입니다 + +117 +00:08:32,200 --> 00:08:35,890 + 은신처 유형이 무엇인지의 문서는 어떤 이들 계층에 대한 옵션입니다 + +118 +00:08:35,889 --> 00:08:39,629 + 당신이 솔버 및 레이어와마다 모든 옵션을 지정하는 방법입니다 + +119 +00:08:39,629 --> 00:08:43,100 + 그 때문에 난 정말이 파일을 체크 아웃하고 읽어 보시기 바랍니다 모든 나쁘지 않다 + +120 +00:08:43,100 --> 00:08:48,019 + 일이 카페 단지 어떻게 작동하는지 그것을 통해 당신에 대한 질문이있는 경우 + +121 +00:08:48,019 --> 00:08:53,120 + 이 매개 변수보다 정의이 당신을 보여줍니다 당신이 내 왼쪽 귀에서 맛을 제공 + +122 +00:08:53,120 --> 00:08:58,519 + 이는 카페 도끼를 나타내는 데 사용 프로토콜 버퍼의 종류 및 + +123 +00:08:58,519 --> 00:09:03,970 + 오른쪽 해법을 나타내는 데,이 해석 파라미터 있도록 + +124 +00:09:03,970 --> 00:09:09,009 + 예를 들어 솔버 프로모터의 경계가 그물에 대한 참조를 취하고 + +125 +00:09:09,009 --> 00:09:12,409 + 또한 학습 속도와 빈도 점을 확인하는 방법과 같은 것들을 포함 + +126 +00:09:12,409 --> 00:09:19,549 + 당신이 카페에서 작업 할 때 바로 그래서 실제로 있다는 같은 다른 것들 + +127 +00:09:19,549 --> 00:09:23,729 + 정말 멋진 그렇게 할 때 모델을 양성하기 위해 코드를 작성할 필요가 없습니다 + +128 +00:09:23,730 --> 00:09:27,889 + 카페 작업 당신은 일반적으로 당신은 그래서 먼저이 4 단계 프로세스를 + +129 +00:09:27,889 --> 00:09:31,960 + 데이터를 변환하고 그냥 이미지 분류 문제가 발생할 경우 특히 + +130 +00:09:31,960 --> 00:09:34,540 + 당신은 단지 기존 중 하나를 사용이를 위해 당신은 어떤 코드를 작성할 필요가 없습니다 + +131 +00:09:34,539 --> 00:09:40,240 + 다니엘 이진 카파 배는 방금로 할 것이다 당신의 파일을 정의 + +132 +00:09:40,240 --> 00:09:45,230 + 작성 또는 편집이 단백질 다니엘 중 하나는 다시 해석을 정의 + +133 +00:09:45,230 --> 00:09:49,509 + 단지 프로보 TXT TXT 파일에 살 것이다 당신은 텍스트 내에서 작동 할 수 + +134 +00:09:49,509 --> 00:09:54,200 + 편집기 다음은 훈련이 기존 바이너리에이 일을 모두 통과합니다 + +135 +00:09:54,200 --> 00:09:57,990 + 모델과 전투는 기차가 그 다음에 할 수 있는지 테스트하는 모델을 계속 뱉어 + +136 +00:09:57,990 --> 00:10:02,820 + 당신은 당신이 할 수 이미지에 ResNet 훈련을 할 경우에도, 그래서 다른 것들에 사용 + +137 +00:10:02,820 --> 00:10:06,000 + 그냥 간단한 절차를 따르 쓰기없이 거대한 네트워크를 훈련 + +138 +00:10:06,000 --> 00:10:12,110 + 정말 멋진 등 만 데이터를 변환하기 위해 일반적으로 한 단계 코드 + +139 +00:10:12,110 --> 00:10:17,259 + 그래서 카페 나는 우리가 형식으로 HTML5에 대해 조금 얘기했습니다 알고 사용 + +140 +00:10:17,259 --> 00:10:21,460 + 지속적으로 책상에 픽셀을 저장하고 효율적으로 읽는하지만, + +141 +00:10:21,460 --> 00:10:26,940 + 경우가 물었다 그래서 기본적으로 캐시는 LM TV라는이 다른 파일 형식을 사용 + +142 +00:10:26,940 --> 00:10:30,570 + 당신이 당신이 가진 모든 레이블이 각 이미지 다음 이미지의 무리 인 경우 당신은 할 수 + +143 +00:10:30,570 --> 00:10:31,480 + 롤 호출 + +144 +00:10:31,480 --> 00:10:35,370 + 카페는 수 거대한 alamoudi에 그 전체 데이터 집합을 변환하는 스크립트가 있습니다 + +145 +00:10:35,370 --> 00:10:42,169 + 젠은 당신에게 방법의 아이디어를 제공 할 수 있도록이는 그것의 훈련을 위해 사용할 수 있습니다 + +146 +00:10:42,169 --> 00:10:46,240 + 당신은 당신의 이미지에 대한 경로를 가진 텍스트 파일을 작성 정말 쉽고 + +147 +00:10:46,240 --> 00:10:49,959 + 라벨로 구분하고 그냥 승객은 스크립트가 몇 기다려 유지 + +148 +00:10:49,958 --> 00:10:56,018 + 시간 데이터가 디스크에 큰 거대한 IMDB 파일을 설정하고있는 경우에는 작업하는 경우 + +149 +00:10:56,019 --> 00:11:01,860 + HBO 오 같은 뭔가 다른 당신은 아마 카페 그래서 자신을 만들어야합니다 + +150 +00:11:01,860 --> 00:11:06,060 + 실제로 데이터를 읽는 몇 가지 옵션을 가지고이 날짜에있다 않습니다 + +151 +00:11:06,059 --> 00:11:11,888 + 자신의 윈도우 다토 보호를위한 시장 실제로는 HDL 5에서 읽을 수 있습니다 + +152 +00:11:11,889 --> 00:11:14,350 + 특히의 직접 메모리에서 읽기 물건에 대한 옵션이있다 + +153 +00:11:14,350 --> 00:11:18,480 + 이 모든 파이썬 인터페이스하지만 내 관점에서 적어도 유용 + +154 +00:11:18,480 --> 00:11:22,339 + 캠페인에 읽기 및 데이터의 다른 방법의 종류가 조금 있습니다 + +155 +00:11:22,339 --> 00:11:26,120 + 두 번째 수준의 카페 생태계에서 시민과 엘렌 DBA 정말입니다 + +156 +00:11:26,120 --> 00:11:30,669 + 가장 쉬운 것은 당신이 당신이 아마 변환을 시도해야 할 수 있습니다 그래서 만약 작동하는 방법 + +157 +00:11:30,669 --> 00:11:40,179 + 그래서 24 캠페인 단계와 mp3 형식으로 데이터가 있으므로 객체를 정의하는 것입니다 + +158 +00:11:40,179 --> 00:11:44,609 + 같은 나는 그가 그냥하지 그래서 여기에이를 찾기 위해 큰 프로모션 TXT를 작성하는 것입니다 말했다 + +159 +00:11:44,610 --> 00:11:48,818 + 로지스틱 회귀 분석이 단순한 모델 당신은 내가하지 않았다 것을 알 수있다 + +160 +00:11:48,818 --> 00:11:53,948 + 내 자신의 조언을 따라 난 다음 여기에 HDL (5) 파일에서 데이터를 읽고 있어요 + +161 +00:11:53,948 --> 00:11:59,278 + 내적 및 캐세이보다라고 완전히 연결 층을 + +162 +00:11:59,278 --> 00:12:03,588 + 완전히 은신처를 연결되어 자신의 권리는 당신에게 수업 방법의 수를 알려줍니다 + +163 +00:12:03,589 --> 00:12:10,399 + 값을 초기화하고 내가 읽어 부드러운 최대 손실 함수를합니다 + +164 +00:12:10,399 --> 00:12:15,458 + 라벨과는 반대 선출 된 리더에서 손실 성분을 생성하므로 + +165 +00:12:15,458 --> 00:12:20,009 + 이 파일에 대해 지적하는 몇 가지 있습니다 일반적으로 한 모든 층이 + +166 +00:12:20,009 --> 00:12:23,588 + 가중치의 데이터를 저장하기 위해 약간의 블로그와 기울기를 포함 + +167 +00:12:23,589 --> 00:12:28,680 + 그리고 층의 모양과 벨레 자체가 일반적으로 할 수있는 같은 이름을 가진 + +168 +00:12:28,679 --> 00:12:34,269 + 다른 일을 혼란 조금 이들 층의 많은이있을 것입니다 + +169 +00:12:34,269 --> 00:12:39,250 + 바로 당신이거야 여기에이 네트워크에 실제로 14 무게 14 바이어스와 모양 + +170 +00:12:39,250 --> 00:12:43,149 + 즉 학습 속도가 그래서 그 두 모양의 학습 속도를 찾아 + +171 +00:12:43,149 --> 00:12:44,769 + 방법 모두 정규화 + +172 +00:12:44,769 --> 00:12:50,198 + 참고로 나중에 또 하나의 바이어스는 출력의 수를 지정하는 것입니다 + +173 +00:12:50,198 --> 00:12:51,568 + 클래스는 단지 숫자입니다 + +174 +00:12:51,568 --> 00:12:57,378 + 이 완전히 연결 은신처 주변에 출력하고 마지막으로 신속하고 + +175 +00:12:57,379 --> 00:13:01,139 + 레이어와 카페를 동결 더러운 방법은 학습 속도 (204)를 설정하는 것입니다 + +176 +00:13:01,139 --> 00:13:08,048 + 그런 식으로 연결된 모양에 대한 우리의 편견 지적하는 또 다른 일이 + +177 +00:13:08,048 --> 00:13:12,600 + 밖으로 구글과 같은 ResNet 및 기타 대형 모델이 얻을 수 있다는 것입니다 + +178 +00:13:12,600 --> 00:13:17,110 + 정말 정말 빨리 손에서 카페 정말 당신처럼 정의 할 수 없습니다 있도록 + +179 +00:13:17,110 --> 00:13:20,989 + 조성의 ality ResNet을 위해 그들은 단지 동일한 패턴을 통해 반복되도록 + +180 +00:13:20,989 --> 00:13:26,459 + 반복해서 ResNet의 프로토 TXT 그래서 프로 txt 파일에 거의 7,000 선이다 + +181 +00:13:26,458 --> 00:13:31,219 + 당신이 손으로 그것을 쓸 수 있지만 중간 연습 사람들이 쓰는 경향이 긴 있도록 + +182 +00:13:31,220 --> 00:13:35,470 + 그가가의 그래서 작은 파이썬 스크립트는 자동으로이 일을 생성하는 + +183 +00:13:35,470 --> 00:13:41,879 + 네트워크에 찾을 것이 아니라 시작하려면 약간의 총 당신을 + +184 +00:13:41,879 --> 00:13:46,509 + 처음부터 당신은 일반적으로 몇 가지 기존 제품 내선을 다운로드 할 수 있습니다 및 + +185 +00:13:46,509 --> 00:13:50,230 + 일부 기존 무게 파일과 작업 거기에서 그래서 당신이 생각해야하는 방법 + +186 +00:13:50,230 --> 00:13:54,139 + 우리가 여기에 본 적이 제품 txt 파일이이를 정의하기 전에 것입니다 + +187 +00:13:54,139 --> 00:13:58,159 + 설교자와 무게이 살고있는 네트워크와 멘델의 아키텍처 + +188 +00:13:58,159 --> 00:14:03,230 + 진 일이 그리고 당신이 정말로 검사를하지만, 수 없습니다 카페 모델 파일 + +189 +00:14:03,230 --> 00:14:07,869 + 그것은 그 작동 방법은 이름 위치를 일치 기본적으로 키 - 값 쌍의 + +190 +00:14:07,869 --> 00:14:13,790 + 카페 모델 안에이 때문에 가정의 수호신으로 범위가 이들의 이름과 일치 + +191 +00:14:13,789 --> 00:14:19,389 + 그런데 그가 마지막에 해당 불구과 XC70 무게는-것 + +192 +00:14:19,389 --> 00:14:24,048 + 완전히 연결 층과 알렉스하지 그래서 다음에 당신을 찾을 때 + +193 +00:14:24,048 --> 00:14:29,600 + 자신의 데이터는 카페를 시작하고 당신은 모델과 제품 내선을로드 할 때 + +194 +00:14:29,600 --> 00:14:33,459 + 단지 이름의 키 - 값 쌍을 일치 시키려고하고 카페 사이 대기 + +195 +00:14:33,458 --> 00:14:35,008 + 모델 제품 EXT + +196 +00:14:35,009 --> 00:14:39,209 + 그래서 같은 이름은 다음 새로운 네트워크는에서 초기화되는 경우 + +197 +00:14:39,208 --> 00:14:43,008 + 가치와 정말 정말 유용하고 좋은 편리하다 프로토 TXT + +198 +00:14:43,009 --> 00:14:49,230 + 조정하지만, 층 이름은 실제로 해당 계층보다 일치하지 않으면하면 + +199 +00:14:49,230 --> 00:14:52,980 + 이 예를 들어 당신이 국유화 읽을 수있는 방법 그래서 처음부터 초기화 + +200 +00:14:52,980 --> 00:14:57,810 + 당신이했습니다 경우 카페에서 출력 그래서 조금 더 구체적으로 + +201 +00:14:57,809 --> 00:15:02,250 + 아마 모델은 다음이 래리가 완전히 마지막에가는 이미지를 다운로드 + +202 +00:15:02,250 --> 00:15:06,289 + 출력 클래스 과정에서의 연결 층은 천 출력이되지만 + +203 +00:15:06,289 --> 00:15:09,480 + 지금 어쩌면 당신은 당신에 대해 걱정 몇 가지 문제에 대한 10 출력을 원하는 + +204 +00:15:09,480 --> 00:15:13,149 + 당신은 마지막 층을 reindustrialize하는 거 필요가있어, 그것을 실현 + +205 +00:15:13,149 --> 00:15:17,309 + 무작위로 미세 조정 네트워크는 그래​​서 당신이 할 방법은 당신이 필요하다 + +206 +00:15:17,309 --> 00:15:22,088 + 실제로입니다 있는지 확인하기 위해 프로 txt 파일에있는 은신처의 이름을 변경 + +207 +00:15:22,089 --> 00:15:26,890 + 무작위가 아닌 카페 모델과 경우에서 읽는 초기화 + +208 +00:15:26,889 --> 00:15:30,919 + 다음은 실제로 충돌합니다이 작업을 수행하는 것을 잊지 그것은 당신에게 이상한 오류를 줄 것이다 + +209 +00:15:30,919 --> 00:15:35,419 + 모양이이를 저장하려고 할 것이다 원인 정렬하지에 대한 메시지 + +210 +00:15:35,419 --> 00:15:39,299 + 새에서이 열 차원 일에 천 차원의 가중치 행렬 + +211 +00:15:39,299 --> 00:15:46,129 + 파일 및 카페로 작업하는을 정의 할 때 그렇게 다음 단계를 작동하지 않습니다 + +212 +00:15:46,129 --> 00:15:51,100 + 해석 솔버는 당신이 그것을위한 모든 옵션을 볼 수 있습니다 그냥 프로 txt 파일입니다 + +213 +00:15:51,100 --> 00:15:56,620 + 나는이 같은 작은 모양의 무언가에 대한 링크를 준 거 프로필 + +214 +00:15:56,620 --> 00:16:00,169 + 알렉스 밤 그 학습 속도를 정의 할 것이다 당신은 배우고 어쩌면 있도록 + +215 +00:16:00,169 --> 00:16:04,809 + 방법은 k로하고 정규화가 얼마나 자주 그런 모든 것을 확인하지만, + +216 +00:16:04,809 --> 00:16:10,169 + 이러한 덜 훨씬 덜 복잡 그는위한 프로 TXT 있다는보다 끝나게 + +217 +00:16:10,169 --> 00:16:15,069 + 네트워크이 알렉스 넥타이 다만 어쩌면 십사 라인 당신이되지만 + +218 +00:16:15,070 --> 00:16:18,530 + 실제로 몇 번을 참조 것은 그 사람들은 복잡한의 종류가하려는 경우 + +219 +00:16:18,529 --> 00:16:22,299 + 거래 파이프 라인 위치를 특정의 속도를 학습 나는 11 훈련 그들이 첫 번째 + +220 +00:16:22,299 --> 00:16:25,039 + 네트워크의 일부는 다른 학습 속도 특정 부분과 함께 훈련 할 + +221 +00:16:25,039 --> 00:16:28,389 + 서로 다른 해석 파일의 폭포로 끝날 수있는 네트워크의 + +222 +00:16:28,389 --> 00:16:31,490 + 실제로 우리가 일종의 자신을 미세 조정하는 가장 독립적 인 저를 실행 + +223 +00:16:31,490 --> 00:16:38,070 + 다른 솔버를 사용하여 별도의 단계에서 모델은 모든 일을 한 번, 그래서 그 + +224 +00:16:38,070 --> 00:16:43,550 + 다음은 단지 트레이너 모델 내 조언을 따라하면 당신이 단지를 사용하는 경우 있도록 + +225 +00:16:43,549 --> 00:16:49,208 + MTB 당신은 그냥 존재, 즉 바이너리 전화에 대한 모든 것을 + +226 +00:16:49,208 --> 00:16:55,569 + 아직 여기 캠페인에 그냥 통과하여 해석하고 TXT 및 + +227 +00:16:55,570 --> 00:16:59,540 + 유럽​​은 미세 조정 인 경우 무게가 파일을 재교육과 금요일 아마 실행합니다 + +228 +00:16:59,539 --> 00:17:03,659 + 아마 오랜 시간 동안 그냥 확인하고 책상을 절감하고 알 수있을 것 + +229 +00:17:03,659 --> 00:17:08,549 + 여기에서 지적하는 한 것은이가에 실행 GPU를 지정할 것입니다 + +230 +00:17:08,549 --> 00:17:11,209 + 마지막 텍스트하지만 당신은 실제로 CPR에서 실행할 수 있습니다 + +231 +00:17:11,209 --> 00:17:17,288 + 마지막에 언젠가 내 음 하나에이 플래그를 설정하고 실제로 최근에 의해 + +232 +00:17:17,288 --> 00:17:21,048 + 올해 카페는 걸쳐 많은 배치를 분할 할 수있는 데이터 병렬 처리를 추가 + +233 +00:17:21,048 --> 00:17:26,318 + 시스템에서 여러 GPU는 실제로이 플래그에 여러 개의 GPU를 추가 할 수 있습니다 + +234 +00:17:26,318 --> 00:17:29,710 + 그냥 모든되어 카페를 말하면 자동으로 여러 배치를 분할합니다 + +235 +00:17:29,710 --> 00:17:33,600 + 컴퓨터에있는 모든 GPU를 통해이 정말 당신은 멀티 GPU를 수행 한 멋진 있도록 + +236 +00:17:33,599 --> 00:17:51,689 + 오 코드 정말 멋진 카페의 한 줄을 작성하지 않고 훈련 그래 + +237 +00:17:51,690 --> 00:17:57,230 + 그래, 정말 문제는 당신이 좀 더 일에 대해 이동하는 방법을 생각 + +238 +00:17:57,230 --> 00:18:00,778 + 당신은 아마 무게를 초기화 할 복잡한 초기화 전략 + +239 +00:18:00,778 --> 00:18:04,019 + 설교자와 모델에서의 여러 부분과 그 같은 방법을 사용하여 + +240 +00:18:04,019 --> 00:18:07,710 + 네트워크는 대답은 당신이 아마 간단한으로 그렇게 할 수 없다는 것입니다 + +241 +00:18:07,710 --> 00:18:11,278 + 당신이 무게와 파이썬에 돈의 종류 할 수있는 메커니즘 그것은 아마 + +242 +00:18:11,278 --> 00:18:17,669 + 내가 우리가 전에 언급 한 것 같아 바로 그래서 그 일에 대해 가지 방법 + +243 +00:18:17,669 --> 00:18:21,710 + 카페는 서로 다른 종류의 많은 다운로드 할 수 있습니다이 정말 좋은 모델이있다 + +244 +00:18:21,710 --> 00:18:25,919 + 이 때문에의 임무에 초반 이었죠 모델과 다른 데이터 세트의이 모델은이다 + +245 +00:18:25,919 --> 00:18:29,659 + 정말 최고의 당신이 알렉스 natin BGG을 가지고 있었다 당신은 거기 주민있어 + +246 +00:18:29,659 --> 00:18:33,840 + 이미 꽤 많이 많이와 정말 좋은 모델을 많이 그래서 거기있다 + +247 +00:18:33,839 --> 00:18:37,359 + 그는 그것이 정말 쉽습니다 카페에 대한 정말 정말 장점입니다입니다 + +248 +00:18:37,359 --> 00:18:40,428 + 누군가 다른 사람의 모델을 다운로드를 가리키는 데이터에서 실행하는 방법 + +249 +00:18:40,429 --> 00:18:42,350 + 데이터 + +250 +00:18:42,349 --> 00:18:46,298 + 내가 언급 한 것처럼 카페는 파이프 라인 인터페이스를 가지고 + +251 +00:18:46,298 --> 00:18:49,069 + 나는 세부 사항으로 뛰어들 수 있다고 생각하지 않습니다 충당하기 위해 너무 많은 일이 있기 때문에 + +252 +00:18:49,069 --> 00:18:53,378 + 여기에 있지만 코스와 카페 파의 일종으로 정말 정말 좋은이 아니다 + +253 +00:18:53,378 --> 00:18:57,980 + 코드를 읽을 필요하고, 그래서 파이썬 인터페이스에 대한 설명서 + +254 +00:18:57,980 --> 00:18:58,690 + 모든 + +255 +00:18:58,690 --> 00:19:02,730 + 파이썬 인터페이스 스트리트 카페는 대부분이 두에 두에 정의되어 있습니다 + +256 +00:19:02,730 --> 00:19:08,399 + 파일이 CPP 파일은 적에게 이야기하기 전에 것을 사용한 경우 파이썬을 향상 사용 + +257 +00:19:08,398 --> 00:19:13,369 + 는 C ++ 클래스의 일부를 마무리하고이에에 다음 걸릴에 노출 + +258 +00:19:13,369 --> 00:19:17,648 + . 평 실제로 추가 방법을 첨부하고 더 많은 파이썬를 제공 파일 + +259 +00:19:17,648 --> 00:19:22,469 + 인터페이스 당신은 어떤 방법과 데이터 타입의 종류 알고 싶어요 그래서 만약 + +260 +00:19:22,470 --> 00:19:27,000 + 카페 파이프에서 사용할 수있는 것이 가장 좋은 인터페이스 단지를 통해 3을 읽는 것입니다 + +261 +00:19:27,000 --> 00:19:31,339 + 이 두 파일 그리고 그들은 너무 오래 그것은 할 매우 쉽게 그래서 아니에요 + +262 +00:19:31,339 --> 00:19:37,038 + '예 일반적으로 파이썬 인터페이스는 당신이 아마 수행 할 수 있습니다 꽤 유용 + +263 +00:19:37,038 --> 00:19:40,558 + 미친 체중 초기화 전략은 당신이 뭔가 더 복잡한 작업을 수행해야하는 경우 + +264 +00:19:40,558 --> 00:19:44,960 + 단지 체인 모델 복사보다 그것은 또한 정말 쉽게 단지를 얻을 수 있습니다 + +265 +00:19:44,960 --> 00:19:48,710 + 네트워크는 다음 NumPy와 배열 NumPy와 함께 앞으로 뒤로 실행 + +266 +00:19:48,710 --> 00:19:53,129 + 그래서 예를 들어, 당신은 깊은 꿈과 클래스 등을 구현할 수있다 + +267 +00:19:53,128 --> 00:19:56,798 + 당신이 숙제에 당신도 그렇게 할 수 않았다 유사한 시각화 + +268 +00:19:56,798 --> 00:20:01,349 + 아주 쉽게 그냥 데이터를 취할 필요가 카페에 파이썬 인터페이스를 사용하여 + +269 +00:20:01,349 --> 00:20:03,899 + 다음 네트워크의 다른 부분을 통해 순방향 및 역방향 실행 + +270 +00:20:03,900 --> 00:20:08,720 + 파이썬 인터페이스도 아주 좋은 일을 그냥 추출 할 경우 경우 + +271 +00:20:08,720 --> 00:20:12,220 + 당신 같은 기능을 사용하면 일부 자유 무역 모델이 일부 데이터를 가지고 있고 + +272 +00:20:12,220 --> 00:20:15,610 + 네트워크의 일부에서 기능을 추적 할 다음 어쩌면에 저장 + +273 +00:20:15,609 --> 00:20:20,259 + 디스크가 아마 2005 파일은 아주 쉽게 약간의 다운 스트림 처리했다 + +274 +00:20:20,259 --> 00:20:25,660 + 당신이 할 수있는 파이썬 인터페이스를 수행하는 것은 실제로 카페는 새의 종류가 + +275 +00:20:25,660 --> 00:20:29,970 + 실제로 파이썬에서 레이어를 완전히 정의 할 수 있지만이 기능 + +276 +00:20:29,970 --> 00:20:33,600 + 내가 직접 해본 적이 있지만, 그것은 좋은 것 같지만 냉각 보인다 년대 + +277 +00:20:33,599 --> 00:20:37,259 + 단점은 그 층이 나 CPU가 될 것입니다 그래서 우리는 대한 이야기​​입니다 + +278 +00:20:37,259 --> 00:20:41,809 + 당신이 편지를 쓰는 경우 CPU와 GPU 사이의 통신 병목 + +279 +00:20:41,809 --> 00:20:46,460 + 파이썬은 모든 전후 패스가 난 오버 헤드가 고정됩니다 + +280 +00:20:46,460 --> 00:20:51,289 + 파이프와 전선으로 도움이 될 수 있습니다 하나의 좋은 장소 있지만 전송하지 + +281 +00:20:51,289 --> 00:20:58,450 + 사용자 정의 손실 함수 그건 당신이 마음에 그렇게 유지 수있는 일이 어쩌면 그래서 + +282 +00:20:58,450 --> 00:21:02,450 + 가톨릭 장점과 단점의 빠른 개요이 정말 내 관점에서 + +283 +00:21:02,450 --> 00:21:06,049 + 당신 싶어하는 모든 경우 종류의 간단한 기본 피드 포워드 네트워크를 훈련 + +284 +00:21:06,049 --> 00:21:09,730 + 특히 분류 및 캐시는 물건을 얻을 정말 쉽습니다 + +285 +00:21:09,730 --> 00:21:12,880 + 당신을 실행하면 자신이 방금 이러한 모든 사용하는 코드를 작성할 필요가 없습니다 + +286 +00:21:12,880 --> 00:21:17,660 + 도구를 사전이 내장 그것은 파이썬 인터페이스를 실행하는 것은 매우 쉽다 + +287 +00:21:17,660 --> 00:21:21,259 + 조금 사용하기위한 아주 좋은 조금 더 복잡한 사용 사례에 대해 작동합니다 + +288 +00:21:21,259 --> 00:21:25,329 + 당신이 정말로이있을 때 일이 정말 미친 얻을 때하지만 성가신 될 수 있습니다 + +289 +00:21:25,329 --> 00:21:29,299 + 특히 그들이 할 수있는 반복 모듈 패턴 회장 같은 큰 네트워크 + +290 +00:21:29,299 --> 00:21:33,450 + 당신이 원하는 위치 지루한와 같은 재발 네트워크와 같은 것들에 대한 수 + +291 +00:21:33,450 --> 00:21:37,519 + 네트워크의 다른 부분 사이의 공유 기다립니다 종류 회사의 종류 일 수있다 + +292 +00:21:37,519 --> 00:21:41,559 + 카페에서 성가신의 그것은 가능하지만 아마 사용할 수있는 가장 좋은 것은 아니다 + +293 +00:21:41,559 --> 00:21:46,250 + 내 관점에서 다른 큰 단점이 및 다른 단점 + +294 +00:21:46,250 --> 00:21:50,220 + 당신이 카페에서 은신처의 자신의 유형을 찾을 때 갖는 끝날 것입니다 + +295 +00:21:50,220 --> 00:21:55,440 + 그것이 당신에게 매우 빠른 개발주기를 제공하지 않습니다하지 그래서 C ++ 코드를 작성하는 + +296 +00:21:55,440 --> 00:22:00,769 + 그래서 그 그건 그래서 고통의 종류의 많은 종류의 당신에게 편지를 쓰고 있어요 + +297 +00:22:00,769 --> 00:22:04,750 + 우리의 세계는 카페의 여행을하는 간단한 질문이 있다면, 그래서 선풍 이유 + +298 +00:22:04,750 --> 00:22:06,669 + 네 + +299 +00:22:06,669 --> 00:22:14,028 + 교차 검증 및 카페 찾을 시도 할 수 있습니다이 txt 기차 발파라이소에서 매우 + +300 +00:22:14,028 --> 00:22:19,159 + 교육 단계와 테스트 단계는 그래서 일반적으로 약 기차처럼 좋아 + +301 +00:22:19,159 --> 00:22:20,269 + 제품 내선 + +302 +00:22:20,269 --> 00:22:24,960 + 제품 내선을 적용하고 손하지만 테스트에서 작업에에서 사용되는 배포 + +303 +00:22:24,960 --> 00:22:33,409 + 흔적 제품 내선의 위상은 그게 다야의 유효성 확인에 사용됩니다 + +304 +00:22:33,409 --> 00:22:39,820 + 다음 하나는 토치 그래서 토치 정말 내 그래서 캐비닛에 대해 알고있다 + +305 +00:22:39,819 --> 00:22:42,980 + 내가 여기 편견 조금 그래서 개인적으로 좋아 단지에 그를 얻을 수 + +306 +00:22:42,980 --> 00:22:46,259 + 오픈 나는 꽤 많이 내 자신에 거의 독점적으로 토치를 사용했던 것을 + +307 +00:22:46,259 --> 00:22:51,749 + 작년에 정도 횃불 그래서 프로젝트는 것 NYU 출신 + +308 +00:22:51,749 --> 00:22:56,450 + C에서 최대 대신에 기록하고 페이스 북 참으로 마음에 많이 사용되는 + +309 +00:22:56,450 --> 00:23:02,409 + 특히 나는 큰 중 하나 있도록 토치를 사용하는 트위터에서도 사람들을 많이 생각 + +310 +00:23:02,409 --> 00:23:05,309 + 과정에서 사람들을 괴물 일들이 당신이 대신 작성해야한다는 것입니다 + +311 +00:23:05,308 --> 00:23:11,038 + 나는 듣지 또는 토치 작업을 시작하기 전에 사용 된 적이 적이있는 + +312 +00:23:11,038 --> 00:23:16,700 + 하지만 실제로 유혹 가장 높은이 높은 수준의 스크립트는 것을 너무 나쁘지 않다 + +313 +00:23:16,700 --> 00:23:20,999 + 더 실행할 수 있도록 정말 임베디드 디바이스를위한 것입니다 언어 + +314 +00:23:20,999 --> 00:23:24,720 + 효율적으로는 많은 방법에서 자바 스크립트와 매우 유사 많이있어 + +315 +00:23:24,720 --> 00:23:29,749 + 그래서 루에 대한 또 다른 좋은 점은 그것이 의미가 있기 때문에 내장에서 실행할 수 있다는 것입니다 + +316 +00:23:29,749 --> 00:23:33,929 + 당신이 실제로 루프를 위해 할 수있는 장치는 정말 빨리하고 성화 당신은 알고있다 + +317 +00:23:33,929 --> 00:23:37,149 + 당신은 진짜로 그건 천천히 것 for 루프에서 어떻게 파이썬에서라면 + +318 +00:23:37,148 --> 00:23:40,798 + 실제로 적시를 사용하기 때문에 성화에의 할 사실은 완전히 잘 + +319 +00:23:40,798 --> 00:23:46,249 + 이러한 일들이 정말 빨리 만들 수 컴파일과 성화는 우리의 최신 가장입니다 + +320 +00:23:46,249 --> 00:23:50,200 + 이 기능 언어 기능이 인 것을 중요 자바 스크립트 + +321 +00:23:50,200 --> 00:23:54,058 + 그것은 다른에 주위에 콜백을 통과 전달하는 매우 흔한 일류 시민 + +322 +00:23:54,058 --> 00:24:01,200 + 코드의 일부는 또한 프로토콜 상속 곳의이 아이디어를 가지고있다 + +323 +00:24:01,200 --> 00:24:05,200 + 그들은 당신이 생각할 수있는 테이블 루아에있는 일종의 하나의 데이터 구조의 것 + +324 +00:24:05,200 --> 00:24:09,558 + 당신이 일을 구현할 수있는 자바 스크립트 객체와 매우 유사되고있는 + +325 +00:24:09,558 --> 00:24:13,378 + 객체 지향 프로그래밍 같은 비슷한에 프로토 타입 상속을 사용하여 + +326 +00:24:13,378 --> 00:24:18,428 + 방법이 자바 스크립트에서와 그리고 도시의 하나 단점 중의 하나로서 + +327 +00:24:18,429 --> 00:24:19,929 + 실제로 표준 라이브러리 + +328 +00:24:19,929 --> 00:24:24,820 + 취급 문자열과 이것 저것 할 수있다처럼 때로는 성가신 종류의 물건입니다 + +329 +00:24:24,819 --> 00:24:28,999 + 아마 가장 성가신 가시고의 종류는 그렇게 모든 인덱스의 하나입니다 + +330 +00:24:28,999 --> 00:24:33,058 + 네 루프에 대한 당신의 직관은 한 동안 있지만 다른에 대해 조금 꺼집니다 + +331 +00:24:33,058 --> 00:24:37,528 + 보다가 데리러 아주 쉽게 그리고 나는이 웹 사이트에 여기에 링크를 준 + +332 +00:24:37,528 --> 00:24:41,618 + 당신이 15 분 루아를 배울 수 있다고 주장하는 것은 그것은 약간있을 수 있습니다 + +333 +00:24:41,618 --> 00:24:45,209 + 그들은 그것을 조금 지나치게 될 수 있으므로 이상하지만 난 그것을 아주 쉽게 생각 + +334 +00:24:45,210 --> 00:24:50,298 + 그래서 주요 아이디어 뒤에 꽤 빨리 그것을 선택하고 코드를 작성 시작하기 + +335 +00:24:50,298 --> 00:24:55,398 + 토치 그래서 너희들에 NumPy와 많이 작업 된이 텐서 클래스 당신의 + +336 +00:24:55,398 --> 00:24:59,548 + 할당 및 할당 종류의 구조화하는 방법은 NumPy와 것입니다 + +337 +00:24:59,548 --> 00:25:03,329 + 배열은 당신에게 어떤 방법으로 당신이 원하는 데이터를 조작하는이 정말 쉬운 방법을 제공합니다 + +338 +00:25:03,329 --> 00:25:06,798 + 다음과 같은 축적 다른 추상화의 수 높은 비율을 사용할 수 있습니다 + +339 +00:25:06,798 --> 00:25:10,720 + 라이브러리와 이것 저것하지만 정말 NumPy와 배열이 당신을 할 것으로 알려져 + +340 +00:25:10,720 --> 00:25:16,909 + 당신이 그렇게 완벽한 유연성에 원하는 방식으로 숫자 데이터를 조작 + +341 +00:25:16,909 --> 00:25:20,580 + 통화하는 경우 어쩌면 여기 봐 여기에 일부 NumPy와의 예 + +342 +00:25:20,579 --> 00:25:24,918 + 지금 우리가 그냥 패스에 대한 간단한을 계산하고 의해 아주 잘 알고 있어야 코드 + +343 +00:25:24,919 --> 00:25:31,990 + 냉기 철도 네트워크의 어쩌면 블랙은 여기 최선의 선택은 아니었지만 우리는있어 + +344 +00:25:31,990 --> 00:25:36,569 + 우리는 우리는 우리가 어떤 어떤 상수 일부 경쟁하고 경쟁하는 일을하고있어 + +345 +00:25:36,569 --> 00:25:40,408 + 가중치는 어떤 임의의 데이터를 얻고있다 그리고 우리는 매트릭스에서 집회를 곱하고있는 + +346 +00:25:40,409 --> 00:25:44,789 + 곱 또 다른 주요 그래서 그 그 심판을 작성하는 것은 매우 간단이고 + +347 +00:25:44,788 --> 00:25:49,538 + 실제로이 이제 불을 지른 답변에 거의 120 번역이 + +348 +00:25:49,538 --> 00:25:53,970 + 오른쪽이 동일한 코드를하지만, 그래서 여기에 불을 지른 답변을 사용하고 + +349 +00:25:53,970 --> 00:25:58,509 + 우리의 가중치를 우리는 우리의 뒷부분 입력 크기를 정의하고 모든 우리는 정의하고 그 + +350 +00:25:58,509 --> 00:26:02,929 + 이는 단지 불을 지른 대답은 우리가있어 임의의 입력 벡터를 얻고 있었다있다 + +351 +00:26:02,929 --> 00:26:07,929 + 이 행렬을하고있다 전진 패스를하고 우리의 스폰서까지 번식이 + +352 +00:26:07,929 --> 00:26:09,179 + C-최대 + +353 +00:26:09,179 --> 00:26:13,149 + 진짜 문제이고 우리가 사용하는 코어를 계산할 수 요소 현명한 최대 + +354 +00:26:13,148 --> 00:26:17,089 + 코드의 일반적인 거의 모든 종류의 사용에 있도록 다른 행렬 곱셈 + +355 +00:26:17,089 --> 00:26:18,689 + 심판을 거래하는 것은 매우 간단합니다 + +356 +00:26:18,690 --> 00:26:22,460 + 꽤 많이 사용에 거의 하나씩 라인 별 번역이 + +357 +00:26:22,460 --> 00:26:25,400 + 불을 지른 대답 대신 + +358 +00:26:25,400 --> 00:26:28,880 + 그래서 또한 교환 및 다른 사용하기 정말 쉬운 것이 심판의 기억 + +359 +00:26:28,880 --> 00:26:33,690 + 우리는이 광고에 대해 얘기 데이터 유형은 최소한의 마지막 강의를 싫증하지만, + +360 +00:26:33,690 --> 00:26:38,500 + 당신이 할 필요가 아마 32 비트 부동 소수점로 전환 할 NumPy와 + +361 +00:26:38,500 --> 00:26:43,049 + 이 다른 데이터 유형으로 데이터를 캐스팅하고는 아주 아주 있다고 밝혀 + +362 +00:26:43,049 --> 00:26:47,589 + 뿐만 아니라 우리의 데이터 유형이이 강도가 지금 있음을 고문 할 쉽게 + +363 +00:26:47,589 --> 00:26:52,990 + 우리는 쉽게 다른 데이터 유형으로 우리의 데이터를 전송할 수 있지만, 여기 어디이있어 + +364 +00:26:52,990 --> 00:26:56,130 + 년 그래서 진짜 이유 등이 다음 슬라이드 고문은 무한 이유하지만 + +365 +00:26:56,130 --> 00:27:02,020 + NumPy와보다 나은 즉 GPU는 또 다른 데이터가 그렇게 할 때 입력 한 것을이다 + +366 +00:27:02,019 --> 00:27:07,879 + 당신이 횃불의 GPU에서 코드를 실행할 싶어 할 때 당신은 당신이 가져 이것을 사용 옳다 + +367 +00:27:07,880 --> 00:27:11,630 + 다른 패키지와 다른 시간을 지른 또 다른 데이터 형식이 + +368 +00:27:11,630 --> 00:27:16,810 + 텐서와 지금이 다른 데이터 유형에 텐서를 캐스팅하고 지금은 + +369 +00:27:16,809 --> 00:27:21,819 + GPU에서 살고 텐서에 수치 연산의 종류를 실행 단지 + +370 +00:27:21,819 --> 00:27:26,500 + 정말 정말 간단하고 토치 그냥 일반 쓸 수 있도록 GPU에서 실행 + +371 +00:27:26,500 --> 00:27:34,220 + tenser 과학 컴퓨팅 코드는 I GPU를 실행하고 정말 빨리 그래서 이런 식으로 내가 할 수 + +372 +00:27:34,220 --> 00:27:37,819 + 이 텐서 정말 당신은 유사한 그들을 제기 NumPy와 생각해야되는 것을 + +373 +00:27:37,819 --> 00:27:41,689 + 및 방법의 종류 만에 문서의 많은 거기에 당신 + +374 +00:27:41,690 --> 00:27:46,250 + 여기까지 10 서비스 내에서 작동하고이 문서를 얻을 수있는 것은 슈퍼 아니다 + +375 +00:27:46,250 --> 00:27:53,950 + 완전한하지만 당신이 다음 있도록에서 살펴 보셔야합니다 있도록이 나쁘지 않아 야하지만, + +376 +00:27:53,950 --> 00:27:58,200 + 실제로 당신은 정말 대신 횃불에 너무 많이 텐서를 사용하지 결국 + +377 +00:27:58,200 --> 00:28:02,880 + 그래서 신경 네트워크의 끝이라는 다른 패키지를 사용하고 이것입니다 + +378 +00:28:02,880 --> 00:28:06,800 + 실제로 그냥 신경 네트워크 패키지를 정의하는 매우 얇은 래퍼 + +379 +00:28:06,799 --> 00:28:10,930 + 이 텐트의 관점에서이 텐트의 조건은 당신이 생각해야 개체 + +380 +00:28:10,930 --> 00:28:15,049 + 이 숙제 코드의 BPR 더 산업용 강도 버전처럼 인 것으로 + +381 +00:28:15,049 --> 00:28:20,240 + 기본이이이 열 번째와 배열이 텐서 추상화가 어디 + +382 +00:28:20,240 --> 00:28:24,480 + 다음은 좋은 깨끗한에서 그 꼭대기에 방향족 라이브러리를 구현 + +383 +00:28:24,480 --> 00:28:30,410 + 인터페이스는 그래서 여기에 우리를 N 패키지를 사용 래리 애들러 네트워크에 동일합니다 + +384 +00:28:30,410 --> 00:28:33,900 + 이 거 순차적의 스택 수 그래서 우리의 네트워크가 순차적가 정의 + +385 +00:28:33,900 --> 00:28:38,360 + 작업을 우리가 완전히 연결되어 선형이 처음거야 거입니다 + +386 +00:28:38,359 --> 00:28:41,759 + 우리의 입력이 마케팅을 언급에서 우리는거야 난간을 가지고 언급 + +387 +00:28:41,759 --> 00:28:48,420 + 다른 대출은 지금 우리가 실제로 두 번째의 무게와 그라디언트를 얻을 수 있습니다 + +388 +00:28:48,420 --> 00:28:52,070 + 지금 대기가 될 것이 얻을 매개 변수 방법을 사용하여 각각에 대한 대답 + +389 +00:28:52,069 --> 00:28:55,750 + 네트워크 및 졸업생 모든 방법이있을 것이다 단일 불을 지른 답변 + +390 +00:28:55,750 --> 00:29:00,490 + 우리가 생성 할 수 있습니다 위의 모든 재료에 대해 하나의 횃불 대답을 할 것이다 + +391 +00:29:00,490 --> 00:29:05,730 + 어떤 임의의 데이터는 이제 앞으로 패스에 우리는 단지의 형식 매트를 호출 + +392 +00:29:05,730 --> 00:29:11,599 + 이 우리에게 컴퓨터 손실에 대한 우리의 점수 우리는이를 제공합니다 우리의 데이터를 사용하여 객체 + +393 +00:29:11,599 --> 00:29:16,769 + 그래서 우리는 컴퓨터에 의해 잃어버린 우리의 손실 함수입니다 별도의 기준 객체 + +394 +00:29:16,769 --> 00:29:21,289 + 기준의 네 번째 메서드를 호출하면 지금 우리는 우리의 예측을 수행 한 + +395 +00:29:21,289 --> 00:29:27,279 + 우리가 처음 설정 쉽고 뒤로 패스 20 호출 손실 함수에 역 + +396 +00:29:27,279 --> 00:29:31,609 + 내가 지금 일에있어 다음 역이는의 그라데이션을 모두 업데이트했습니다 + +397 +00:29:31,609 --> 00:29:35,319 + 대학원에서 네트워크는 우리가 그냥 아주 쉽게 그라데이션 물건을 만들 수 있습니다 params는 + +398 +00:29:35,319 --> 00:29:40,419 + 그래서 이것은 학습의 반대에 의해 졸업생을 곱한 것 + +399 +00:29:40,420 --> 00:29:44,130 + 레이트하고 간단한 그래디언트 디센트 갱신의 방법에 추가 + +400 +00:29:44,130 --> 00:29:50,400 + 그것은 어쩌면 조금했을 모든 권한을의 권리 + +401 +00:29:50,400 --> 00:29:53,560 + 더 명확하지만 우리는 우리가 기능을 잃은 무게 졸업생을하지 않은 + +402 +00:29:53,559 --> 00:30:00,730 + 우리는 앞으로에서 임의의 데이터를 얻을 뒤로 업데이 트를 확인하고 같이 + +403 +00:30:00,730 --> 00:30:03,930 + 그 대답보고에서 기대할 수있는 것이이 일을 실행할 수 있도록하는 것은 매우 쉽다 + +404 +00:30:03,930 --> 00:30:09,570 + GPU에 우리가 몇 가지 새로운 패키지를 가져 GPU에서 이러한 네트워크에서 실행 + +405 +00:30:09,569 --> 00:30:14,519 + 고문을 통해 모든 것을 두 가지 버전이 끝과 우리 + +406 +00:30:14,519 --> 00:30:17,930 + 그냥이 다른 데이터 유형으로 우리의 네트워크와 우리의 손실 함수를 캐스팅해야 + +407 +00:30:17,930 --> 00:30:23,490 + 우리는 또한 우리의 데이터와 레이블을 캐스팅 할 필요가 지금이 모든 네트워크는 것 + +408 +00:30:23,490 --> 00:30:28,660 + 그것은 그 40 어땠는지 지금 매우 쉽다는 그래서 실행하고 GPU에 대한 교육을 + +409 +00:30:28,660 --> 00:30:31,320 + 코드 라인은 우리가 완전히 연결 네트워크를 작성했습니다 우리는에 훈련 할 수있다 + +410 +00:30:31,319 --> 00:30:37,089 + 여기에 GPU하지만 하나의 문제는 우리가 그냥 바닐라 그라데이션을 사용하는 것입니다 + +411 +00:30:37,089 --> 00:30:41,000 + 너무 큰되지 않고 하강 당신은 할당과 같은 다른 것들에 본대로 + +412 +00:30:41,000 --> 00:30:45,329 + 아웃 실제로는 훨씬 더 나은 작업에 불쑥 우리의 혼란에 이렇게 것을 해결하기 위해 + +413 +00:30:45,329 --> 00:30:50,319 + 성화는 우리에게 다시 사용하기 때문에 낙관적 아주 쉽게 기회 패키지를 제공합니다 우리 + +414 +00:30:50,319 --> 00:30:51,799 + 바로 여기에 새로운 패키지를 가져 + +415 +00:30:51,799 --> 00:30:57,799 + 여기 그리고 지금 무엇을 변경하는 것은 우리가 실제로이 콜백을 정의 할 필요가있다 + +416 +00:30:57,799 --> 00:31:02,569 + 우리가 앞으로 전화 및 역방향 명시 적으로 제외되었다 전에 있도록 기능 + +417 +00:31:02,569 --> 00:31:06,960 + 해결할 대신 우리는을 실행이 콜백 함수를 찾을거야 + +418 +00:31:06,960 --> 00:31:10,750 + 네트워크는 앞으로 데이터를 뒤로하고 손실과 기울기를 반환 + +419 +00:31:10,750 --> 00:31:15,400 + 지금 우리의 네트워크에 업데이트 정지 실제로이 콜백을 통과 할 수 있도록하는 + +420 +00:31:15,400 --> 00:31:21,259 + Optim을 패키지에서이 아담 방법과 기능 때문에이이 어쩌면이다 + +421 +00:31:21,259 --> 00:31:26,940 + 조금 어색하지만 당신은 우리가 단지를 사용하여 업데이트 규칙의 어떤 종류를 사용할 수 있습니다 알고 + +422 +00:31:26,940 --> 00:31:31,430 + 우리가 전에 한 어떤에서 변화의 몇 라인은 다시는 매우 간단합니다 + +423 +00:31:31,430 --> 00:31:38,900 + 단지 우리가 본 바로 위하여 가고 모든 캐스팅하여 GPU에서 실행에 추가 + +424 +00:31:38,900 --> 00:31:44,220 + 카페 카페 일종의 다음의 용어와 레이어와 카페가 모든 것을 구현 + +425 +00:31:44,220 --> 00:31:48,750 + 그 사이 정말 열심히 구분 및 토치의 은신처 그들은 우리는하지 않습니다 + +426 +00:31:48,750 --> 00:31:52,400 + 정말이 구별 모든 것을 그릴 수없는 것은 전체 그래서 그냥 모델입니다 + +427 +00:31:52,400 --> 00:31:59,750 + 네트워크 모듈이며, 또한 각각의 모듈 래리 그래서 모듈은 + +428 +00:31:59,750 --> 00:32:03,650 + 구현되는 틀에 박힌 생활 대신에 정의되어 단지 클래스는 + +429 +00:32:03,650 --> 00:32:08,880 + 그 대답의 API를 사용하므로 이러한 모듈은 꽤있어 서면 법부터입니다 + +430 +00:32:08,880 --> 00:32:13,260 + 여기에 많은 이해하기 쉬운 완전 이제 연결된 완전하다 + +431 +00:32:13,259 --> 00:32:17,039 + 래리 연결이 당신이 그냥 볼 수있는 생성자 + +432 +00:32:17,039 --> 00:32:23,210 + 가중치 및 바이어스 용으로이 텐서 API로 인해 텐트 설정 + +433 +00:32:23,210 --> 00:32:28,100 + 성화는 우리가 쉽게 이러한 모든 층보다 GPU와 CPU에 동일한 코드를 실행할 수 있습니다 + +434 +00:32:28,099 --> 00:32:32,359 + 단지 텐서 API의 관점에서 작성하고 Heasley 모두에서 실행됩니다 + +435 +00:32:32,359 --> 00:32:37,529 + 장치이므로 이러한 모듈은 전후 원경 구현해야 + +436 +00:32:37,529 --> 00:32:42,670 + 앞으로 아기 때문에 여기의 예제는 출력을 업데이트 부르기로 결정 + +437 +00:32:42,670 --> 00:32:47,250 + 의 전체 텍스트에 대한 업데이트 출력은 나중에 실제로 몇 가지 경우가있어 그들이 + +438 +00:32:47,250 --> 00:32:50,480 + 다시 비 대 나와 함께 여기 몇 가지 경우를 처리해야하는 날 + +439 +00:32:50,480 --> 00:32:55,170 + 다시 부분에 있지만, 다른 것보다 있지만, 사용하기 전에 반드시 숙지 아주 쉬워야한다 + +440 +00:32:55,170 --> 00:33:00,830 + 더 뒤로받는 방법 업데이트 대학원 입력 한 쌍의 거기 통과 + +441 +00:33:00,829 --> 00:33:03,970 + 상류 구배 및 계산하는 그라디언트 존중 + +442 +00:33:03,970 --> 00:33:09,160 + 입력 및 다시 그냥 텐서 API로 구현됩니다 그래서 그것은 매우 쉽게 + +443 +00:33:09,160 --> 00:33:14,279 + 일의 그 조금을 이해 그냥 같은 유형은 숙제 우리에보고 + +444 +00:33:14,279 --> 00:33:17,990 + 또한 구현하고 기울기를 계산하는 잡아 매개 변수를 축적 + +445 +00:33:17,990 --> 00:33:21,480 + 네트워크의 무게에 대하여 당신은 생성자에서 본대로 + +446 +00:33:21,480 --> 00:33:25,610 + 편견의 무게가 인스턴스 변수이 모듈에서 개최되며, + +447 +00:33:25,609 --> 00:33:30,309 + 대학원 매개 변수는 업스트림에서 그라디언트를 받게됩니다 축적과 축적 + +448 +00:33:30,309 --> 00:33:34,940 + 다시 상류 라디안과 관련하여 파라미터 그라디언트 + +449 +00:33:34,940 --> 00:33:39,809 + 단지 텐서 API를 사용하여 매우 간단합니다 + +450 +00:33:39,809 --> 00:33:44,200 + 토치 실제로 사용 가능한 다른 모듈의 톤 여기에 문서를 가지고 + +451 +00:33:44,200 --> 00:33:46,980 + 당신은 단지에 가면 날짜가 조금있을 수 있지만, 당신은 모든 확인할 수 있습니다 일어나서 + +452 +00:33:46,980 --> 00:33:51,460 + 파일은 당신에게 놀 수있는 모든 케이크를 제공하는 그는 실제로 얻을 수있어 + +453 +00:33:51,460 --> 00:33:55,930 + 많은 그래서 그냥 포인트 아웃이이 전쟁 전 그냥 날 추가 몇 업데이트 + +454 +00:33:55,930 --> 00:34:00,750 + 횃불은 항상 당신이 당신의 네트워크를 추가 할 수있는 새로운 모듈을 추가, 그래서 지난 주 + +455 +00:34:00,750 --> 00:34:06,390 + 이는 꽤 재미 있지만, 이러한 기존의 모듈이 충분하지 않은 때이다 + +456 +00:34:06,390 --> 00:34:10,579 + 그냥이를 구현할 수 있기 때문에, 그래서 자신을 쓰기 실제로 매우 쉽게 + +457 +00:34:10,579 --> 00:34:13,989 + 텐서 API를 사용하여 단지를 구현하는이 tenser를 사용하는 것 + +458 +00:34:13,989 --> 00:34:17,259 + 전후 그것은에 구현 층보다 훨씬 더 어렵다 + +459 +00:34:17,260 --> 00:34:21,890 + 그래서 여기에 숙제 단지 작은 예입니다 이것은 단지 걸리는 바보 모듈입니다 + +460 +00:34:21,889 --> 00:34:28,210 + 입력과 2로 곱하면 우리가 업데이트 그래프를 구현 볼 수 있습니다 + +461 +00:34:28,210 --> 00:34:31,849 + 템플릿과 지금 우리는 단지 스물 라인 새 레이어과 토크를 구현 한 + +462 +00:34:31,849 --> 00:34:35,929 + 코드는 그 정말 쉽게 후 다른 코드를 사용하는 것은 매우 쉽다 + +463 +00:34:35,929 --> 00:34:40,710 + 그냥 가져 와서 나는 당신의 네트워크를 추가 등등과 정말 멋진 할 수 있습니다 + +464 +00:34:40,710 --> 00:34:44,920 + 이것은 당신이 할 수있는 단지 텐서 API이기 때문에 이것에 대해 것은 무엇이든 종류 + +465 +00:34:44,920 --> 00:34:48,579 + 임의의 일이 당신 앞으로 이러한 내부 원하고 필요한 경우 뒤로 + +466 +00:34:48,579 --> 00:34:52,730 + 아마 코드 또는 아무것도 또는 루프 또는 복잡하고 부모를 위해해야​​ 할 일 + +467 +00:34:52,730 --> 00:34:56,980 + 어떤 어떤 종류의보다 밖으로 드롭 또는 합리화 확률 일 + +468 +00:34:56,980 --> 00:34:59,949 + 당신은 기대 할 뒤로 당신을 통과 코드의 어떤 종류 + +469 +00:34:59,949 --> 00:35:03,500 + 그것은 일반적으로 매우 간단 매우 그래서 이러한 모듈 내부에 직접 구현 + +470 +00:35:03,500 --> 00:35:11,500 + 토치 있도록하지만, 물론 선수들과 토치의 자신의 새로운 유형을 쉽게 구현할 수 + +471 +00:35:11,500 --> 00:35:14,250 + 자신의 개별 레이어를 사용하는 것은 매우 유용하지 않습니다 + +472 +00:35:14,250 --> 00:35:16,960 + 우리는 더 큰 네트워크로 함께 스티치 할 사람이 필요 + +473 +00:35:16,960 --> 00:35:21,220 + 지금까지이 성화는 우리가 이미 앞의 예에서 일을보고 용기를 사용 + +474 +00:35:21,219 --> 00:35:26,549 + 이는이 순차적 컨테이너 그래서 결과적 컨테이너 그냥 스택이다 + +475 +00:35:26,550 --> 00:35:29,950 + 모든 우리는 이전의 출력을 수신하여 하나있어 모듈 + +476 +00:35:29,949 --> 00:35:35,639 + 하나는 그냥 가서 그게 아마 당신이 수도 가장 일반적으로 사용되는 또 다른 하나 + +477 +00:35:35,639 --> 00:35:40,799 + 볼이 입력이 있고 경우이 부모가 어쩌면 테이블이 성기입니다 + +478 +00:35:40,800 --> 00:35:44,289 + 댄 동일한 입력에 다른 모듈에 다른 적용 할 + +479 +00:35:44,289 --> 00:35:49,099 + 콘텐츠 테이블 당신은 그렇게 당신은 다른 출력 셀레스트를받는 + +480 +00:35:49,099 --> 00:35:53,280 + 당신이 입력의 목록이있는 경우는 병렬 테이블로 볼 수 있습니다 당신이 원하는 + +481 +00:35:53,280 --> 00:35:57,500 + 다음 다른에리스트의 각 요소를 다른 모듈을 수행 할 수 있습니다 적용 + +482 +00:35:57,500 --> 00:36:04,588 + 상황이 오면 건설의 종류에 대한 병렬 타 보르 테이블을 사용하지만, + +483 +00:36:04,588 --> 00:36:08,980 + 정말 그렇게 실제로 내가 당신에게하는 그 용기를 복잡 + +484 +00:36:08,980 --> 00:36:13,480 + 이론적으로 쉽게해야하는 보조 당신을 사과 단지에 대해 구현하는 것이 가능합니다 + +485 +00:36:13,480 --> 00:36:16,980 + 원하는하지만 정말 복잡 묶는 연습에 정말 털이 수 있습니다 + +486 +00:36:16,980 --> 00:36:21,480 + 토치 페넌트라는 다른 패키지를 제공하므로 그 용기를 사용하는 것 + +487 +00:36:21,480 --> 00:36:23,230 + 당신이 훅 수 있습니다 그래프 + +488 +00:36:23,230 --> 00:36:28,210 + 컨테이너 훅 가지 더 복잡한 토폴로지 아주 쉽게 그렇게 + +489 +00:36:28,210 --> 00:36:32,400 + 우리가있는 경우 여기에 우리가 3 개의 입력이있는 경우 어쩌면 우리가 하나를 생성 할 예입니다 + +490 +00:36:32,400 --> 00:36:36,930 + 출력과 우리는 아주 간단 업데이트 규칙을 생성하려는 + +491 +00:36:36,929 --> 00:36:40,379 + 우리는 많은 봤어요 계산 그래프의이 유형에 해당 + +492 +00:36:40,380 --> 00:36:44,869 + 당신이 실제로 구현할 수 있도록 문제의 다른 유형에 대한 강의에서 배 + +493 +00:36:44,869 --> 00:36:49,430 + 이 단지 사용하여 고급 병렬 및 순차 및 테이블 성기하지만 그것을 + +494 +00:36:49,429 --> 00:36:53,009 + 당신 싶어이 같은 일을 할 때 그래서 아주의 질량의 종류 수 + +495 +00:36:53,010 --> 00:36:58,470 + 이 그래프 코드가 대신 있도록 그래프를 전송하는 것이 일반적 그래서 여기이 매우 쉽습니다 + +496 +00:36:58,469 --> 00:37:03,179 + 함수는 그래프를 사용하여 모듈을 구축하고 그래서 여기를 반환하는 것입니다 + +497 +00:37:03,179 --> 00:37:09,129 + 우리는 그래프 패키지를 가져온 다음 여기 안에이 돈의 비트입니다 + +498 +00:37:09,130 --> 00:37:14,329 + 구문은 그래서 이것은 실제로는이 상징적 인 변수를 발견하는 텐서 아니다 + +499 +00:37:14,329 --> 00:37:19,480 + 그래서 이것은 우리의 우리 텐트 또는 객체로 XY 및 Z를 받으려고하는 것을 말하고있다 + +500 +00:37:19,480 --> 00:37:25,300 + 입력과 현재 점유율은 실제로 그렇게 그 입력에 상징적 인 작업을하고 있었다 + +501 +00:37:25,300 --> 00:37:26,840 + 여기에 우리가 말을하는지 + +502 +00:37:26,840 --> 00:37:32,700 + 우리는 우리가 두 번 연주 한 할 X & Y의 점별 버전을 가지고 싶어 + +503 +00:37:32,699 --> 00:37:38,159 + 및 수 ANZ 저장소의 곱셈과 A & B의 지금 점별 판과 + +504 +00:37:38,159 --> 00:37:42,159 + 이러한 실제적인 tenser이 지금의 객체이다 다시 저장하고보고 + +505 +00:37:42,159 --> 00:37:45,109 + 당신이 구축하는 데 사용되는 기호 참조의 종류 + +506 +00:37:45,110 --> 00:37:50,420 + 계산 백그라운드에서 그래프와 지금 우리가 실제로 반환 할 수 있습니다 + +507 +00:37:50,420 --> 00:37:55,159 + 우리는 우리의 모듈은 입력 XY 및 Z 출력을 것이라고 여기 모듈 + +508 +00:37:55,159 --> 00:38:00,920 + 볼이 엔드 IG 모듈은 실제로 우리에 부합하는 객체를 제공합니다 + +509 +00:38:00,920 --> 00:38:05,559 + 그럼 우리가 구축 한 후 그 계산을 구현하는 모듈 API + +510 +00:38:05,559 --> 00:38:10,619 + 몬트리올 우리는 콘크리트 법원이 답변을 지른 구성 할 수 있습니다 다음에 그들을 먹여 + +511 +00:38:10,619 --> 00:38:19,170 + 실제로 기능을 너무 횃불 사실은 꽤 계산됩니다 모듈 + +512 +00:38:19,170 --> 00:38:22,670 + 초반 이었죠 모델을 잘 할 수있는 패키지라는로드 캠페인이있다 + +513 +00:38:22,670 --> 00:38:27,050 + 당신은 카페에서 사전 시험 모델의 많은 다른 유형을로드하고는거야 + +514 +00:38:27,050 --> 00:38:31,590 + 그들의 고문의 등가물로 변환 당신은 카페를로드 할 수 있습니다 + +515 +00:38:31,590 --> 00:38:35,539 + 제품 내선 및 카페 모델 파일과는의 거대한 스택에 설정합니다 + +516 +00:38:35,539 --> 00:38:39,929 + 연속 시장은, 카페 슈퍼 일반 보하지로드하고 특정 작동 + +517 +00:38:39,929 --> 00:38:44,649 + 네트워크 그러나 특정 부하 카페의 종류는 알렉스를하지로드 할 것이며, + +518 +00:38:44,650 --> 00:38:49,660 + 그들은 아마 가장 일반적으로 몇 가지있어 수 있도록 캠페인 및 PGG이 있습니다 사용 + +519 +00:38:49,659 --> 00:38:54,259 + 또한 몇 가지 다른 구현 당신은 횃불로 구글 매트에로드 + +520 +00:38:54,260 --> 00:38:58,520 + 당신이 재시도 구글을로드 할 수 있도록 그 모델 성화에 실제로 매우 + +521 +00:38:58,519 --> 00:39:01,869 + 최근 페이스 북은 잔류 네트워크를 나서서 다시 구현 + +522 +00:39:01,869 --> 00:39:07,900 + 바로 횃불 최대 알렉스 사이 때문에 그들은 그것을 위해 초반 이었죠 모델 출시 + +523 +00:39:07,900 --> 00:39:11,849 + BG 그룹과 ResNet에서 캠페인되지 그게 아마 모든 것을 당신을 생각 + +524 +00:39:11,849 --> 00:39:17,869 + 이다 대부분의 사람들이 다른 점을 이용하려는 모든 초반 이었죠 모델 필요 + +525 +00:39:17,869 --> 00:39:21,549 + 횃불이 미끼를 사용하기 때문에 우리는 패키지를 설치 핍 사용하고있을 수 없습니다 + +526 +00:39:21,550 --> 00:39:24,920 + 쉽게 새로운 설치할 것 막사라는 또 다른 매우 유사한 아이디어 + +527 +00:39:24,920 --> 00:39:26,750 + 업데이트 패키지를 패키지 + +528 +00:39:26,750 --> 00:39:29,650 + 그것은 아주 사용하기 매우 쉽습니다 + +529 +00:39:29,650 --> 00:39:34,079 + 이 종류의 나는 매우 유용한 몇 가지 패키지 단지 목록입니다 + +530 +00:39:34,079 --> 00:39:38,349 + 이름으로이 취소 할 수 있도록 성화는 5 파일을 읽고 HDR 쓸 수 있습니다 + +531 +00:39:38,349 --> 00:39:44,640 + 당신은 인 트위터 autorad에서이 재미 하나가 거기에 읽기 및 JSON을 쓸 수 있습니다 + +532 +00:39:44,639 --> 00:39:47,980 + 조금하지만 사용하지 않은 얘기 할 동물처럼 조금 + +533 +00:39:47,980 --> 00:39:52,369 + 하지만 가지를보고 멋진 실제로 페이스 북은 꽤 유용이 + +534 +00:39:52,369 --> 00:39:57,849 + 횃불에 대한 라이브러리는 또한 오십 회선과를 구현하면서 + +535 +00:39:57,849 --> 00:40:01,548 + 데이터 병렬 모델 병렬 처리를 구현 + +536 +00:40:01,548 --> 00:40:07,449 + 그래서 그 횃불에 이렇게 아주 일반적인 워크 플로우가 꽤 꽤 좋은 일이 + +537 +00:40:07,449 --> 00:40:11,239 + 당신이 있습니다 종종 몇 가지 전처리 스크립트를 가지고 그 피칸 것이다 + +538 +00:40:11,239 --> 00:40:15,818 + 전처리 데이터와는 일반적으로 책상 HDL 5에 몇 가지 좋은 형식으로 그것을 덤프 + +539 +00:40:15,818 --> 00:40:20,528 + 큰 일 제이슨 작은 것들에 대해 당신은 내가 일반적으로 쓰기 것이다 것 + +540 +00:40:20,528 --> 00:40:25,318 + 최대 낮은에서 훈련에있는 모든 HDL 5에서 읽고 모델을 학습하고 최적화 + +541 +00:40:25,318 --> 00:40:30,088 + 모델과는 체크 포인트에게 책상을 저장하고 일반적으로 좀 평가가 + +542 +00:40:30,088 --> 00:40:35,019 + 기차 모델을로드하고 그래서 뭔가 유용한 경우 그것을하지 스크립트 + +543 +00:40:35,019 --> 00:40:39,000 + 워크 플로우의이 유형에 대한 연구는 내가 일주일 전에 GitHub의에 올려이 프로젝트 + +544 +00:40:39,000 --> 00:40:43,969 + 즉 문자 수준의 언어 모델과 토치 그래서 여기에 거기를 구현 + +545 +00:40:43,969 --> 00:40:48,239 + HTML 5에 텍스트 파일을 변환 전처리 스크립트는 거기에 파일을 + +546 +00:40:48,239 --> 00:40:52,889 + HTML5에 대한로드하고이 재발 네트워크와 열차 훈련 스크립트 + +547 +00:40:52,889 --> 00:40:57,190 + 그건 있도록 검사 점은 세금이 생성 최대로드 샘플링 스크립트가있다 + +548 +00:40:57,190 --> 00:41:03,720 + 즉, 빠른 장점과 단점 나는 나의 일반적인 워크 플로우 및 토치와 같은 종류의 + +549 +00:41:03,719 --> 00:41:07,169 + 그 유혹은 사람들을위한 큰 분기점하지만 나는하지 않는 것이 고문에 대해 말할 것입니다 + +550 +00:41:07,170 --> 00:41:11,690 + 큰 거래는 확실히 덜 플러그가 있다고 실제로 생각하고 그래서 카페에서 재생 + +551 +00:41:11,690 --> 00:41:15,760 + 당신은 아마 조금 전형적으로 자신의 코드를 많이 작성하게 될 겁니다 + +552 +00:41:15,760 --> 00:41:20,028 + 또한 더 많은 오버 헤드하지만 비트는 당신이 모듈을 많이 가지고 더 많은 유연성을 제공 + +553 +00:41:20,028 --> 00:41:24,278 + 플러그 앤 플레이하기 쉽고 조각 표준 라이브러리처럼 + +554 +00:41:24,278 --> 00:41:26,880 + 이 모든 푸른로 작성하기 때문에 읽기가 매우 간단하고 아주 쉽게 + +555 +00:41:26,880 --> 00:41:31,740 + 아주 좋은 초반 이었죠 모델이 많이있어 이해하지만, + +556 +00:41:31,739 --> 00:41:34,598 + 불행하게도 그것의 재발 네트워크를 사용하는 것이 조금 어색입니다 + +557 +00:41:34,599 --> 00:41:38,640 + 일반 그래서 당신은 당신이 여러 개의 모듈을하고자 할 때 한 달 가지고 싶어 할 때 + +558 +00:41:38,639 --> 00:41:42,028 + 서로 공유하는 가중치는 실제로이와 횃불을 할 수 있지만입니다 + +559 +00:41:42,028 --> 00:41:42,469 + 그것은 종류의 + +560 +00:41:42,469 --> 00:41:47,199 + 취성 그 아마의 그, 그래서 당신은 거기에 미묘한 버그로 실행할 수 있습니다 + +561 +00:41:47,199 --> 00:41:49,649 + 가장주의해야 할 점은 재발 네트워크가 까다로운 일이 될 수 있다는 것입니다 + +562 +00:41:49,650 --> 00:42:15,800 + 어떤 어떤 토치에 대한 질문이 그래 그래하지만 질문에서이 아니다 + +563 +00:42:15,800 --> 00:42:21,570 + 그, 그래서 네 루프와 피칸를 잘 해석하는 방법 나쁜 방법에 대해이었다 + +564 +00:42:21,570 --> 00:42:24,359 + 즉, 해석 있기 때문에이 파이썬에서 정말 나쁜 이유에 대해 정말 + +565 +00:42:24,358 --> 00:42:27,139 + 실제로 메모리 할당과 다른 꽤 많은 일 루푸스에 대한 모든 + +566 +00:42:27,139 --> 00:42:31,960 + 무대 뒤에서 일하지만 만약 당신이 자바 스크립트를 사용 한 경우 다음 루프와 + +567 +00:42:31,960 --> 00:42:35,059 + 자바 스크립트는 꽤 빨리하는 경향이 있기 때문에 런타임 실제로 단지 + +568 +00:42:35,059 --> 00:42:39,759 + 자바 스크립트에서 루프 그래서 네이티브 코드까지 즉석에서 코드를 컴파일 + +569 +00:42:39,760 --> 00:42:44,520 + 정말 빠르고 유체 및 루 실제로 어디 정렬거야 유사한 메커니즘을 가지고 + +570 +00:42:44,519 --> 00:42:49,588 + 인간의 유전자 코드의 자동 마술 컴파일 된 코드 입술 때문에 + +571 +00:42:49,588 --> 00:42:53,608 + 정말 빨리하지만 난 여전히 사용자 지정 벡터화 코드를 쓰고 있다고 할 수 있습니다 + +572 +00:42:53,608 --> 00:43:01,619 + 우리가 가진 모든 권한을 당신에게 속도 업을 많이주고 지금 아마 반 시간은 왼쪽 + +573 +00:43:01,619 --> 00:43:06,420 + 아니 우리가 그래서 옆에 시간이 부족하고, 그래서 두 개 더 프레임 워크를 커버 + +574 +00:43:06,420 --> 00:43:12,000 + 내가 아는 같은 건 몬트리올 대학에서 여호수아 밴조 그룹에서이며, + +575 +00:43:12,000 --> 00:43:16,250 + 우리는 그래프 innn 조금을 보았다 그래서 계산 그래프에 대해 정말 전부 + +576 +00:43:16,250 --> 00:43:19,559 + 토치에서 그 계산 공예 함께 스티치이 꽤 좋은 방법입니다 + +577 +00:43:19,559 --> 00:43:24,139 + 큰 복잡한 아키텍처와 Fionna 정말 계산이 아이디어에 소요 + +578 +00:43:24,139 --> 00:43:29,409 + 그래픽 및 실행 그것 극단적하고 또한 약간 높은 수준의 라이브러리를 갖는다 + +579 +00:43:29,409 --> 00:43:33,940 + 여기에 같은 계산이 그래서 부족뿐만 아니라에 터치합니다 라자냐입니다 + +580 +00:43:33,940 --> 00:43:38,570 + 우리가 전에 그래프의 맥락에서 본 공예 우리는 실제로 통해 걸을 수 + +581 +00:43:38,570 --> 00:43:43,400 + 2010 년이의 구현은 그래서 당신은 여기에 우리가 가져 오는 것을 볼 수 있습니다 + +582 +00:43:43,400 --> 00:43:49,440 + fiato과 fiato의 tenser 객체와 지금 여기에 우리가 같이 XY 및 Z를 정의하고 + +583 +00:43:49,440 --> 00:43:53,099 + 이 실제로 말과 매우 유사 기호와 같은 기호 변수 + +584 +00:43:53,099 --> 00:43:55,530 + 그래프의 예를 우리는 불과 몇 슬라이드 전에 보았다 + +585 +00:43:55,530 --> 00:43:59,500 + 이 실제로되도록하는 것은 이러한 종류의 상징적 인 개체 인상 NumPy와하지 + +586 +00:43:59,500 --> 00:44:05,690 + 계산 잔디에서 다음 우리가 할 수 실제로 컴퓨터 이러한 출력에 + +587 +00:44:05,690 --> 00:44:11,679 + XY 및 Z는이 상징적 인 일이며 우리는 AB & C를 계산할 수 있습니다 상징적 있도록 + +588 +00:44:11,679 --> 00:44:15,769 + 바로 이러한 오버로드 된 연산자를 사용하고이를 구축 할 수 있습니다 + +589 +00:44:15,769 --> 00:44:19,929 + 우리가 구축 한 후, 일단 백그라운드에서 계산 그래프 우리 + +590 +00:44:19,929 --> 00:44:23,839 + 계산 공예 우리는 사실에 그것의 특정 부분을 실행할 수 있도록하려면 + +591 +00:44:23,840 --> 00:44:29,240 + 실제 데이터는 그래서 우리는 그래서이 약 말하고있는이 양극 이상한 함수 일 전화 + +592 +00:44:29,239 --> 00:44:33,269 + 우리는 입력 XY 및 Z를 취할 것입니다 우리의 기능을 할 그것을 생산합니다 + +593 +00:44:33,269 --> 00:44:38,329 + 출력이 우리가 평가할 수있는 실제 파이썬 함수를 반환합니다 참조 + +594 +00:44:38,329 --> 00:44:42,239 + 실제 데이터와 나는이 정말 지적하고 싶은 경우 모든 마법과 + +595 +00:44:42,239 --> 00:44:46,319 + Fionna 당신이 함수를 호출 할 때 미친 미친 일을 할 수 있다는 일어나고 + +596 +00:44:46,320 --> 00:44:49,580 + 물건은 그것이 더 확인하기 위해 계산 그래프를 단순화 할 수 있습니다 + +597 +00:44:49,579 --> 00:44:54,199 + 효율적인 실제로 상징적 내가 가식과 다른 것들과 산기 수 있습니다 + +598 +00:44:54,199 --> 00:44:58,319 + 당신이 그것을 연결하는 함수를 호출 할 때 실제로 그렇게 네이티브 코드를 생성 할 수 있습니다 + +599 +00:44:58,320 --> 00:45:02,450 + 실제로 때로는 항공편에서 코드를 컴파일 때문에 GPU에 비공식적 있습니다 + +600 +00:45:02,449 --> 00:45:06,389 + 모든 마법과 Fiano 정말이 작은 죄에서이오고있다 + +601 +00:45:06,389 --> 00:45:11,750 + 파이썬에서 문을 찾고 있지만 여기에 후드 일이 많이있다 및 + +602 +00:45:11,750 --> 00:45:14,710 + 지금 한 번 우리는이 미친 물건을 통해이 마법의 기능을 쪘 + +603 +00:45:14,710 --> 00:45:19,159 + 우리는 단지 우리가 인스턴스화 그래서 여기 인상보다 실제 수에서 실행할 수 있습니다 + +604 +00:45:19,159 --> 00:45:25,440 + xxyyxx 실제 수보다 높은 등급으로 실제 쉽고 그 다음 우리는 막을 수 + +605 +00:45:25,440 --> 00:45:30,639 + 이러한 실제 번호를 전달하는 우리의 기능은 이것을 값을 얻을 수 있습니다 + +606 +00:45:30,639 --> 00:45:35,359 + 파이썬에서 폭발적으로 이러한 계산을하고 같은 일을하고있다 + +607 +00:45:35,360 --> 00:45:39,289 + 것을 제외하고 최종 버전으로 인해 모든 마법에 훨씬 더 효율적이 될 수 + +608 +00:45:39,289 --> 00:45:42,840 + 후드와 피아노 버전에서 실제로 GPU 경우에 실행 될 수있다 + +609 +00:45:42,840 --> 00:45:47,289 + 당신은 구성하지 않은 그러나 불행하게도 우리가 정말 걱정하지 않는다 + +610 +00:45:47,289 --> 00:45:51,659 + 우리가 여기 먹으 렴 알고 싶어이 같은 일을 계산하는 것은의 예 + +611 +00:45:51,659 --> 00:45:57,629 + (10)에서 간단한 도구 공기 풍선 그래서 아이디어는 우리가 가고있는이 동일하다 + +612 +00:45:57,630 --> 00:46:02,860 + 우리의 입력을 선언하지만 지금은 대신 그냥 XY 및 Z 우리가 우리의 입력 구문 + +613 +00:46:02,860 --> 00:46:06,490 + 더 나은 우리의 라벨 및 Y + +614 +00:46:06,489 --> 00:46:11,009 + 행렬 W & W 너무 그래서 우리는 그저 설정하는이 상징적 인 체중한다 + +615 +00:46:11,010 --> 00:46:17,540 + 변수는 우리의 계산 잔디의 요소가 될 것이다 지금 44 패스 우리를 + +616 +00:46:17,539 --> 00:46:21,179 + 좀 NumPy와 같습니다하지만 기호에없는 기괴한 작업이다 + +617 +00:46:21,179 --> 00:46:24,669 + 그래서 여기 백그라운드에서 그래프를 구축 계산되는 개체 + +618 +00:46:24,670 --> 00:46:28,909 + 이과 활성화. 방법 행렬 곱셈하지만 우리는 상징적 필요가 + +619 +00:46:28,909 --> 00:46:33,210 + 개체는 우리는이 라이브러리 함수를 사용하여 실제 문제를하고있는 우리는있어 + +620 +00:46:33,210 --> 00:46:37,769 + 또 다른 행렬의 곱셈을하고 우리는 실제로 손실을 계산할 수 있습니다 + +621 +00:46:37,769 --> 00:46:41,210 + 다시 이러한 몇 가지 다른 라이브러리 기능을 이용하여 확률과 로스 + +622 +00:46:41,210 --> 00:46:44,349 + 까지 구축하고 상징적 인 개체에 대한 모든 작업은 + +623 +00:46:44,349 --> 00:46:50,420 + 우리는 우리의 기능 때문에이 기능을 컴파일 할 수 있도록 전산 잔디 + +624 +00:46:50,420 --> 00:46:54,570 + 걸릴 것입니다 우리의 데이터는 레이블이며, 행렬을 가중하는 우리의 28 요소는 + +625 +00:46:54,570 --> 00:46:58,890 + 및 풋 출력으로 나는 손실과 스칼라 우리를 반환합니다 + +626 +00:46:58,889 --> 00:47:04,109 + 분류 점수 벡터에서 지금 우리가 실제 데이터에이 일을 실행할 수 있습니다 + +627 +00:47:04,110 --> 00:47:07,559 + 우리는 이전 슬라이드에서 본 것처럼 우리는 몇 가지 실제 수의 I를 인스턴스화 할 수 + +628 +00:47:07,559 --> 00:47:13,759 + 함수에 제기하고 전달 그래서 이것은 큰하지만 이것은 단지입니다 + +629 +00:47:13,760 --> 00:47:17,820 + 네 번째는 실제로 그렇게이 네트워크 및 컴퓨터 생기를 양성 할 수 있도록 + +630 +00:47:17,820 --> 00:47:23,000 + 여기에 우리가 너무이 동일하다고 할 코드의 몇 라인을 추가 할 필요가 + +631 +00:47:23,000 --> 00:47:27,170 + 우리는 우리가 정의하고있어 이전과 같이 우리의 입에 대한 상징적 인 변수는 + +632 +00:47:27,170 --> 00:47:29,510 + 우리의 무게 등 우리는 함께있어 + +633 +00:47:29,510 --> 00:47:33,980 + 전에 같은 4 패스를 실행하는 컴퓨터의 법률에 손실을 계산하기 + +634 +00:47:33,980 --> 00:47:37,920 + 상징적으로 차이가 우리가 실제로 할 수있는 알고 + +635 +00:47:37,920 --> 00:47:43,680 + 여기에 상징적 인 차별화 그래서이에 우리가 내가 말하는 거 야 디 W 하나 TW입니다 + +636 +00:47:43,679 --> 00:47:47,129 + 우리가 그는 손실의 성분의 기울기가되고 싶어요 것을 알고있다 + +637 +00:47:47,130 --> 00:47:52,280 + 그래서이 두 개의 W 하나의 최소 W 그 다른 상징적 인 변수에 대하여 + +638 +00:47:52,280 --> 00:47:52,930 + 정말 멋진 + +639 +00:47:52,929 --> 00:47:56,549 + fiato는 그래프의 어떤 부분에 임의의 그라데이션을 할 수 있습니다 + +640 +00:47:56,550 --> 00:48:00,289 + 그래프의 다른 부분에 대한 새로운으로 그 도입 도입하지 + +641 +00:48:00,289 --> 00:48:05,190 + 그래프의 기호 변수는 당신이 정말로 그와 함께 미친 갈 수 있도록하지만, + +642 +00:48:05,190 --> 00:48:09,470 + 여기이 경우 우리는 지금 출력으로 그 캐나다인을 반환거야 + +643 +00:48:09,469 --> 00:48:14,049 + 우리는 다시 새로운 기능을 컴파일거야 우리의 입력이 걸릴 것입니다 + +644 +00:48:14,050 --> 00:48:19,510 + 입력 입력 픽셀 자루와 우리의 레이블 왜 28 행렬과 함께 + +645 +00:48:19,510 --> 00:48:23,140 + 지금은 우리의 손실을 반환 할 것 이러한 분류 점수와 + +646 +00:48:23,139 --> 00:48:28,250 + 지금 우리는 실제로 매우 간단한 훈련이 설정을 사용할 수있는 두 가지 성분 있도록 + +647 +00:48:28,250 --> 00:48:32,809 + 신경 네트워크는 그래​​서 우리는 실제로 단지 그라데이션을 구현 그라데이션 하강을 사용할 수 있습니다 + +648 +00:48:32,809 --> 00:48:36,630 + 이 계산을 사용하여이이를 이용하여 단지 몇 줄의 하강 + +649 +00:48:36,630 --> 00:48:38,990 + 그래서 여기에 우리가있어 잔디 + +650 +00:48:38,989 --> 00:48:43,599 + 데이터 세트 및 요인에 대한 인상보다 실제 수를 인스턴스화 + +651 +00:48:43,599 --> 00:48:45,489 + 로 다시 어떤 임의의 행렬 + +652 +00:48:45,489 --> 00:48:49,839 + 실제 수는 더 높은 올릴 때 우리는 지금 우리가이 전화를 걸 때마다 물어 + +653 +00:48:49,840 --> 00:48:50,519 + 돌아 가야 + +654 +00:48:50,519 --> 00:48:54,710 + NumPy와 배열은 지금 우리가 그 손실 및 점수와 그라데이션을 포함한다 + +655 +00:48:54,710 --> 00:48:57,800 + 그라디언트를 우리는 단지 간단한 그라데이션 업데이트를 우리의 가중치에 만들 수 있습니다 + +656 +00:48:57,800 --> 00:49:01,970 + 및 조치는 우리의 네트워크를 훈련 골목 - OOP를 약속하지만 실제로있다 + +657 +00:49:01,969 --> 00:49:06,039 + 이 문제의 큰 당신이 할 수있는 GPU를 누군가에 실행중인 특히 + +658 +00:49:06,039 --> 00:49:15,599 + 완전히 문제를 손실 할 사람이 실제로 많이 초래된다는 것이다 + +659 +00:49:15,599 --> 00:49:21,059 + CPU와 GPU 사이의 통신 오버 헤드를 통해 때문에 우리가 우리 때마다 + +660 +00:49:21,059 --> 00:49:24,799 + 기능이 전화 그리고 우리는 복사 먹으 렴 다시이 그라디언트를 얻을 수 + +661 +00:49:24,800 --> 00:49:29,720 + 다시 CPU에 GPU에서 그라디언트 나는 비용이 많이 드는 작업 할 수 있고, + +662 +00:49:29,719 --> 00:49:35,000 + 이제 우리는 실제로 우리의 그라데이션 중지 만들고있어이 너무 NumPy와의 CPU 계산이다 + +663 +00:49:35,000 --> 00:49:38,190 + 우리가 그 기울기 업데이트를 우리의 매개 변수를 할 수 있도록 할 수 있다면 정말 좋을 텐데 + +664 +00:49:38,190 --> 00:49:45,389 + 실제로 직접 GPU 길에 우리가 그 Fiano이 함께 있음 + +665 +00:49:45,389 --> 00:49:50,619 + 공유 변수라고이 학교 일에 그래서 변수가 다른 것입니다 공유 + +666 +00:49:50,619 --> 00:49:54,230 + 네트워크 실제로 일부는 내부 연산 사는 값 + +667 +00:49:54,230 --> 00:49:59,340 + 공예 실제로이 실제로이 그래서 여기에 전화 통화에서 지속 + +668 +00:49:59,340 --> 00:50:04,150 + 매우 유사 이제 정의 된 우리 같은 상징적 인 변수 X & Y 전 + +669 +00:50:04,150 --> 00:50:08,769 + 데이터와 라벨과 지금 우리가 새로운 펑키의 몇 가지를 정의하고 대한 + +670 +00:50:08,769 --> 00:50:13,809 + 행렬 및 초기화를 가중하는 우리를위한 것 펑키 공유 변수를 + +671 +00:50:13,809 --> 00:50:19,110 + NumPy와 이러한 가중치 행렬을 제기하고 지금이 이전과 동일 + +672 +00:50:19,110 --> 00:50:22,910 + 이들을 사용하여 포워드 패스를 산출 여기서 이전과 동일한 코드이다 + +673 +00:50:22,909 --> 00:50:24,980 + 라이브러리 함수는 상징적이다 + +674 +00:50:24,980 --> 00:50:30,940 + 그라디언트 그러나 우리는 지금이 그래서 우리의 함수를 정의하는 방법을 지금의 차이에 + +675 +00:50:30,940 --> 00:50:32,269 + 컴파일 기능 + +676 +00:50:32,269 --> 00:50:36,780 + 만 가중치를 수신하지 않는 수신하고 그 실제로 살고 둔다 + +677 +00:50:36,780 --> 00:50:41,320 + 연산 그래프 내부 대신 우리는 단지 데이터 및 상기 수신 데이터 + +678 +00:50:41,320 --> 00:50:45,210 + 그리고 라벨과 지금 우리는 출력보다 오히려 손실을 넣어 가고있다 + +679 +00:50:45,210 --> 00:50:49,639 + 성분은 명시 적으로 대신 우리가 실제로 이러한 업데이트 규칙을 제공하는 그들 + +680 +00:50:49,639 --> 00:50:53,819 + 이 업데이트 규칙을 알 수 있도록 함수가 호출 될 때마다 실행해야 우리 + +681 +00:50:53,820 --> 00:50:57,920 + 이 그냥 그래서 상징적 인 변수에서 작동 작은 기능 + +682 +00:50:57,920 --> 00:51:02,010 + 우리는 그가 산타를 만드는 것해야한다는 것은 하나의 최소 W W 업데이트 할 중지 + +683 +00:51:02,010 --> 00:51:09,290 + 이 우리가이 계산 그래프를 실행할 때마다 그래서 지금 매주 업데이트 및 기록 + +684 +00:51:09,289 --> 00:51:12,880 + 우리가해야 할 모든 반복마다이 함수를 호출 인이 네트워크를 훈련 + +685 +00:51:12,880 --> 00:51:16,869 + 우리가 함수를 호출 할 때 사람들은 그렇게의 방법에 그라데이션 중지 할 것 + +686 +00:51:16,869 --> 00:51:21,210 + 우리는 그냥 반복해서 그냥이 일을 호출하여이 네트워크를 시도 할 수 있습니다 + +687 +00:51:21,210 --> 00:51:23,769 + 당신이 이런 종류의 일을 할 때 할 때 연습하고 난 당신거야 알고 + +688 +00:51:23,769 --> 00:51:27,579 + 종종 다음 무게를 업데이트하고 우리의 교육 함수 호출을 정의 + +689 +00:51:27,579 --> 00:51:31,719 + 난 그냥 점수를 넣어 당신이 할 수있는 업데이트를하지거야 기능을 평가 + +690 +00:51:31,719 --> 00:51:34,609 + 실제로이 컴파일 된 함수의 여러에 대한 8 개의 다른 것이있다 + +691 +00:51:34,610 --> 00:51:47,220 + 같은 그래프의 부분은 그래 그래 문제는 우리가 기울기를 계산하는 방법이다 + +692 +00:51:47,219 --> 00:51:51,119 + 그것은 실제로는 그렇지 않아 잘은 상징적 종류의 사람이 밖으로에게 S 않습니다 + +693 +00:51:51,119 --> 00:51:55,219 + 때마다 당신이이 통화를 할 수 있기 때문에 실제로 성에서 사람이의 일종이다 + +694 +00:51:55,219 --> 00:51:58,769 + 그래픽 객체에이 계산을 구축하고 당신은 계산할 수 있습니다 + +695 +00:51:58,769 --> 00:52:06,090 + 단지 그래픽의 계산에 노드를 추가하여 그라디언트 그래서 개체 그래 + +696 +00:52:06,090 --> 00:52:09,360 + 그래, 그래서 그것이 무엇을 알고 이러한 기본 운영자의 각을 알 필요가 + +697 +00:52:09,360 --> 00:52:12,500 + 파생 상품과 파생 및이 여전히 정상 정상 + +698 +00:52:12,500 --> 00:52:17,309 + 당신이 그것을 볼 것을 다시 전파는 작동하지만 그 중 일부하지만 피치 + +699 +00:52:17,309 --> 00:52:21,299 + 내가 아는 그것이 작동하고 그는 매우 매우 낮은 수준의 기본 작업입니다 + +700 +00:52:21,300 --> 00:52:24,920 + 이러한 요소의 사물과 행렬 곱셈으로하고 바라고 때처럼 + +701 +00:52:24,920 --> 00:52:27,800 + 이 효율적인 코드를 컴파일 할 수있는 사람들을 통합하고 단순화 + +702 +00:52:27,800 --> 00:52:32,210 + 상징적으로 내가 어떻게 작동하는지 잘 모르겠어요하지만 적어도 무슨하다고 + +703 +00:52:32,210 --> 00:52:37,110 + 그들은 그렇게 다른 고급 많은 것들을 많이 거기에 이렇게 주장하는 당신 + +704 +00:52:37,110 --> 00:52:40,309 + 무엇이든 할 수 나는 우리가 당신이 할 수있는 얘기 할 시간이없는 것을 알고있다 + +705 +00:52:40,309 --> 00:52:43,610 + 실제로 사용하여 경쟁 공예 내부에 직접 조건문은 다음과 같습니다 + +706 +00:52:43,610 --> 00:52:44,809 + 이 파일 + +707 +00:52:44,809 --> 00:52:49,029 + 그리고 스위치는 당신이 실제로 루프를 내부자 계산을 포함 할 수 있습니다 명령 + +708 +00:52:49,030 --> 00:52:52,370 + 그래프이 정말 이해하지 못하는이 재미 스캔 기능을 사용하여 + +709 +00:52:52,369 --> 00:52:57,409 + 하지만 힘든하지만 이론적으로는 매우 재발 네트워크를 구현할 수 있습니다 + +710 +00:52:57,409 --> 00:53:01,909 + 쉽게 당신이 잠시 상상할 수있는 다음 중 하나의 작업에서 발생하는 + +711 +00:53:01,909 --> 00:53:05,539 + 계산 공예가에 동일한 가중치 행렬을 통과하고있는 모든 + +712 +00:53:05,539 --> 00:53:10,110 + 여러 노드 및 정렬 할 그 루프 및이 수 실제로 검사 + +713 +00:53:10,110 --> 00:53:14,680 + 루프는 그래프의 명시 적 부분의 일부와 우리가 실제로 미친 갈 수 있습니다 + +714 +00:53:14,679 --> 00:53:17,909 + 파생 상품과 우리는 밖으로 어떤과 ​​관련하여 파생 상품을 계산할 수 있습니다 + +715 +00:53:17,909 --> 00:53:21,149 + 우리는 또한 자코비 계산할 수있는 다른 부분에 대한 선박의 일부 + +716 +00:53:21,150 --> 00:53:24,300 + 파생 상품 우리는 알렌을 사용할 수의 파생 상품을 계산하여 종료 우리 + +717 +00:53:24,300 --> 00:53:29,140 + 운영자는 공식적으로 배우로서 큰 주요 행렬 - 벡터 곱셈을 만든하려면 + +718 +00:53:29,139 --> 00:53:32,500 + 와 자코비 존스 당신은 정말 멋진 다른 파생 테이크를 많이 할 수 + +719 +00:53:32,500 --> 00:53:36,610 + 아마 상단과 다른 프레임 워크 그리고 그것은 또한 일부가 피아노에 재고 + +720 +00:53:36,610 --> 00:53:40,180 + 스파 스 매트릭스에 대한 지원은 즉석에서 코드를 최적화하기 위해 시도 + +721 +00:53:40,179 --> 00:53:45,669 + 내가 아는 다른 멋진 일을하는 것은이 거기에 멀티 GPU 지원을 가지고 + +722 +00:53:45,670 --> 00:53:50,599 + 내가 그 제외한 사용하지 않은 패키지는 데이터 병렬 처리를 얻을 수 있다고 주장 + +723 +00:53:50,599 --> 00:53:54,500 + 그래서 여러 GPU를 통해 분할에 대한 의미와 거기에 배포 + +724 +00:53:54,500 --> 00:53:57,260 + 이 계산에 모델 병렬 처리에 대한 실험 지원 + +725 +00:53:57,260 --> 00:54:01,320 + 그래프는 다른 장치 있지만 문서 사이에 분할됩니다 + +726 +00:54:01,320 --> 00:54:08,030 + 라고 그 실험 아마 정말 실험 그래서, 그래서 당신이 본 있도록 + +727 +00:54:08,030 --> 00:54:11,730 + 작업을 할 때 나는 API가 조금 낮은 수준 인 것을 알고 우리는 필요 + +728 +00:54:11,730 --> 00:54:15,769 + 일종의 해결할은 somos 아냐가 업데이트 규칙과 모든 것을 구현 + +729 +00:54:15,769 --> 00:54:19,900 + 는 I 주변이 높은 수준의 래퍼는 거리의 일부를 추상의 종류를 알고 + +730 +00:54:19,900 --> 00:54:24,660 + 당신을 위해 그 세부 그래서 다시 우리가 일종의 상징적 인 행렬을 정의하고 있고 + +731 +00:54:24,659 --> 00:54:28,659 + 라자냐는 자동으로 설정됩니다 이러한 계층 기능이있는 + +732 +00:54:28,659 --> 00:54:32,489 + 공유 변수와 그런 종류의 물건 우리는의 확률을 계산할 수 있습니다 + +733 +00:54:32,489 --> 00:54:38,469 + 손실 라이브러리에서이 편리한 것을 사용하고, 라자냐 실제로 수 + +734 +00:54:38,469 --> 00:54:41,969 + 우리가 구현하는 이러한 업데이트 규칙과 강한 추진력을 작성하고 + +735 +00:54:41,969 --> 00:54:47,109 + 다른 멋진 것들과 지금 우리가 컴파일 우리의 기능 우리가 실제로 단지 + +736 +00:54:47,110 --> 00:54:51,390 + 내 라자냐와 모두가 우리를 위해 기록 된이 업데이트 규칙에 전달 + +737 +00:54:51,389 --> 00:54:51,839 + 방법 + +738 +00:54:51,840 --> 00:54:56,309 + 객체뿐만 아니라 라자냐가 우리를 위해 찍은 치료의 치료를 찍은 + +739 +00:54:56,309 --> 00:54:59,579 + 다음 날의 끝에서 우리는 이러한 컴파일 피아노 하나 결국 + +740 +00:54:59,579 --> 00:55:04,599 + 기능과 우리는 또 다른 거기에 다른 거기에 이전과 같은 방법으로 사용 + +741 +00:55:04,599 --> 00:55:10,480 + 꽤 인기있는 문화의 래퍼 4390 우리는 조금 짝수되는 + +742 +00:55:10,480 --> 00:55:15,730 + 그래서 여기에 우리가 순차적으로 컨테이너를 만들고 추가하는 데있어 더 높은 수준의 + +743 +00:55:15,730 --> 00:55:20,559 + 그것에 층의 스택은 그래서 이것은 종류의 횃불처럼 이제 우리는이 데있어 + +744 +00:55:20,559 --> 00:55:25,789 + 가는이 상사 개체를 만들기에 실제로 우리를 갱신하고 지금 우리가 할 수있는 + +745 +00:55:25,789 --> 00:55:29,759 + 이 슈퍼 그래서 그냥 방법을 맞는 모델을 사용하여 우리의 네트워크를 훈련 + +746 +00:55:29,760 --> 00:55:36,570 + 높은 수준의 당신은 심지어 피아노를 사용하여 실제로 우리를 수행 할뿐만 아니라 말할 수 없다 + +747 +00:55:36,570 --> 00:55:40,289 + 뿐만 아니라 당신이하지 않아도 배경은 그것으로 영광을 사용하지만 거기하기 + +748 +00:55:40,289 --> 00:55:44,500 + 당신이 당신의 경우 경우 실제로 하나의 큰이 코드 조각 문제와 나도 몰라 + +749 +00:55:44,500 --> 00:55:49,219 + 경험이 나중에 알고 있지만 실제로 충돌 수 있으며, 그것은에 충돌로 + +750 +00:55:49,219 --> 00:55:54,750 + 정말 나쁜 방법은 오류 메시지가 그렇습니다 우리는이 거대한 스택 트레이스 아무 것도 얻을 + +751 +00:55:54,750 --> 00:55:58,380 + 우리가 작성한 코드의 통해 우리는이 거대한 값 오류가 발생하는 + +752 +00:55:58,380 --> 00:56:03,440 + 내가 Fiano 정말 전문가 그래서이 아니에요 그래서 나에게 어떤 의미가 없습니다 + +753 +00:56:03,440 --> 00:56:07,039 + 정말 나에게 혼란되었다 그래서 우리는 간단한 찾고 코팅 처리의이 종류를 썼다 + +754 +00:56:07,039 --> 00:56:11,259 + 우리하지만 팩으로 fiato를 사용하고 그것을 밖으로 허튼과 우리에게 준 때문에 + +755 +00:56:11,260 --> 00:56:15,030 + 즉, 내가 일반적인 통증 포인트 중 하나라고 생각, 그래서 정말 오류 메시지가 혼란 + +756 +00:56:15,030 --> 00:56:18,730 + 디버깅 배경으로 사용 아무것도 실패 사례 수 + +757 +00:56:18,730 --> 00:56:24,949 + 좀 열심히 공기를 봤와 나는 것을 발견 좋은 개발자가 같은 수 + +758 +00:56:24,949 --> 00:56:28,659 + 내가 잘못 흰색 변수의 폭을 포함하고 나는 것을 발견 + +759 +00:56:28,659 --> 00:56:32,579 + 내 아내 변수로 변환이 기타 다른 기능을 사용하는 가정 및 + +760 +00:56:32,579 --> 00:56:35,690 + 문제가 멀리 갈 수 있도록하지만 오류 메시지에서 분명 아니었다 + +761 +00:56:35,690 --> 00:56:41,139 + 그 피아노 피아노를 사용할 때 신경이 쓰이는 것이 좋을 위해 뭔가 + +762 +00:56:41,139 --> 00:56:44,699 + 우리가 라자냐에 대해 이야기 때문에 실제로 초반 이었죠 모델이 + +763 +00:56:44,699 --> 00:56:48,539 + 실제로 꽤 좋은 모델 당신에게 다른 인기있는 모델을 많이 가지고 + +764 +00:56:48,539 --> 00:56:52,820 + 아키텍처는 당신이 그렇게 라자냐 당신이 알렉스와 구글을 사용 할 수도 있다는 것입니다 + +765 +00:56:52,820 --> 00:56:56,190 + 매트와 BG 나는 그들이 아직 주민을 생각하지 않습니다하지만 그들은 꽤 많이있다 + +766 +00:56:56,190 --> 00:57:00,320 + 거기에 유용한 것들과 내가 발견 몇 가지 다른 패키지가 + +767 +00:57:00,320 --> 00:57:04,550 + 분명 정말로 내 말을 제외하고이 명확하게 굉장했다 좋은 것 같다 그것 때문에 + +768 +00:57:04,550 --> 00:57:07,030 + 작년부터 CS2 (31) 및 프로젝트였다 + +769 +00:57:07,030 --> 00:57:10,330 + 당신거야 하나를 선택하면하지만 난 그것이 아마 라자냐 모델을 생각한다 + +770 +00:57:10,329 --> 00:57:16,139 + 내가 아는 가지고 노는 내 하루 경험 그래서 정말 좋은 + +771 +00:57:16,139 --> 00:57:20,029 + 장점과 단점에 대해 나는 곳의 파이프 라인이 심판의 볼 수있는 + +772 +00:57:20,030 --> 00:57:20,890 + 훌륭 해요 + +773 +00:57:20,889 --> 00:57:23,920 + 이 계산 쓰레기 특히 주위에 정말 강력한 아이디어처럼 보인다 + +774 +00:57:23,920 --> 00:57:28,760 + 상징적으로 그라데이션을 계산하고 이러한 모든 최적화 그것은 특히 R과 + +775 +00:57:28,760 --> 00:57:32,070 + 나는이 계산 그래프를 사용하여 구현하는 것이 훨씬 쉬울 것이라고 생각 종료 + +776 +00:57:32,070 --> 00:57:37,570 + Rottino의 종류 추한 및 총하지만 특히 라자냐는 꽤 좋아 보인다 + +777 +00:57:37,570 --> 00:57:41,470 + 저와 종류의 오류 메시지가 꽤 될 수있는 고통의 일부를 빼앗아 + +778 +00:57:41,469 --> 00:57:46,279 + 내가 무슨 소리를 들었어요에서 우리가 보았 듯이 고통과 큰 모델은 정말 오래 할 수 있습니다 + +779 +00:57:46,280 --> 00:57:51,190 + 컴파일 시간 그래서 그 우리에 대한 즉시 해당 함수를 컴파일 할 때 + +780 +00:57:51,190 --> 00:57:54,579 + 거의 순간적으로 실행되는 모든 간단한 예제하지만 우리는있어 + +781 +00:57:54,579 --> 00:57:58,159 + 내가 이야기를 들었습니다 신경 튜링 기계처럼 큰 복잡한 일을 그 + +782 +00:57:58,159 --> 00:58:01,969 + 즉, 실제로는 아니에요 그건 너무 컴파일 아마 반 시간이 걸릴 수 있습니다 + +783 +00:58:01,969 --> 00:58:06,239 + 좋은 그것은 당신의 모델과 다른 종류에 빠르게 반복에 대한 좋지 않다 + +784 +00:58:06,239 --> 00:58:10,509 + 통증의 포인트는 API가 모든 일을거야 토치보다 훨씬 더 나은 것입니다 + +785 +00:58:10,510 --> 00:58:13,470 + 그것은 이해하기 힘든 종류, 그래서 배경이 복잡한 물건과 + +786 +00:58:13,469 --> 00:58:17,969 + 디버그하지만, 실제로 코드를 발생하고 초반 이었죠 모델은 어쩌면이다 + +787 +00:58:17,969 --> 00:58:22,569 + 라자냐는 꽤 좋은처럼되지 확실히 카페 또는 토치 좋은 그러나 그것은 본다 + +788 +00:58:22,570 --> 00:58:30,320 + 확인은 그래서 우리는 첫 번째 경우 비록 1000을 얘기 십오분 지금있어 + +789 +00:58:30,320 --> 00:58:38,309 + 내가 시도 할 수 있습니다 알고는 확인이 그렇게 tenser 흐름이 아니다에 대한 질문이있다 + +790 +00:58:38,309 --> 00:58:42,809 + 센서는 정말 시원하고 반짝 새롭고의 Google에서 모든 사람의 흐름 + +791 +00:58:42,809 --> 00:58:47,829 + 그것에 대해 흥분하고 실제로 많은 방법에서 피오나와 매우 유사 그 + +792 +00:58:47,829 --> 00:58:51,170 + 그들은 정말 계산 그래프의이 아이디어와 건물을 복용하고 + +793 +00:58:51,170 --> 00:58:55,650 + 그는 모든 것을 tenser 흐름과 Fiano 실제로 아주 아주 밀접하게 연결되도록 + +794 +00:58:55,650 --> 00:58:59,090 + 내 마음에 그 중 하나를 사용하여 멀리 얻을 수있는 종류의 해리스 등이다 + +795 +00:58:59,090 --> 00:59:04,760 + 하나는 백핸드이며, 또한 어떤 하나의 어쩌면 약 1000인지 확인 가리 + +796 +00:59:04,760 --> 00:59:07,200 + 그것은에서 설계된 이러한 프레임 워크 중 제 1의 일종 + +797 +00:59:07,199 --> 00:59:10,750 + 전문 엔지니어에 의해 접지 + +798 +00:59:10,750 --> 00:59:14,000 + 그래서 다른 프레임 워크의 많은 종류의 학술 연구 실험실에서 회전 및 + +799 +00:59:14,000 --> 00:59:17,320 + 그들은 정말 좋은있어, 그들은 당신이 정말 잘 일을 할 수 있도록하지만 그들은 일종의했다 + +800 +00:59:17,320 --> 00:59:23,120 + 토치 특히 의해 관리되고, 특히 있도록 대학원 학생들에 의해 유지 + +801 +00:59:23,119 --> 00:59:26,500 + 지금 트위터와 페이스 북에서 일부 엔지니어하지만 원래 학술했다 + +802 +00:59:26,500 --> 00:59:30,070 + 프로젝트 및 이들의 모든 나는 tenser 흐름이었다 첫 번째 생각 + +803 +00:59:30,070 --> 00:59:35,000 + 그래서 아마 이론적으로 산업 곳에서 목에서 처음부터 + +804 +00:59:35,000 --> 00:59:37,989 + 그없이 내가 그나마 나은 코드 품질이나 시험 범위 또는 무언가로 이어질 수 + +805 +00:59:37,989 --> 01:00:04,519 + 나는 확실하지 그래서 여기에 꽤 무서운 듯 우리 마음에 드는 사람이 누워있어 여기 그래서 해요 + +806 +01:00:04,519 --> 01:00:07,389 + 우리는거야 라빈 우리는 그것을했고, 다른 모든 프레임 워크는 의도의이하자 + +807 +01:00:07,389 --> 01:00:12,769 + 그래서 우리가 걸 볼 수 있도록이 실제로 알고있는 나는 정말 비슷합니다 흐름 + +808 +01:00:12,769 --> 01:00:17,320 + Fiano에 tenser 흐름을 가져 우리가이 행렬과 벡터를 기억 + +809 +01:00:17,320 --> 01:00:21,019 + 기호 변수 강렬한 워크로드 그들은 자리라고하고 있지만입니다 + +810 +01:00:21,019 --> 01:00:26,380 + 이 같은 생각은 우리가있어 우리의 계산 그래프에 입력 노드를 작성하는 + +811 +01:00:26,380 --> 01:00:30,650 + 또한 fiato에 무게 행렬을 정의하는 것 우리는 이러한 공유 일이 + +812 +01:00:30,650 --> 01:00:34,490 + 그라는 계산 그래프 같은 생각하고 유연한 텐서 안에 살았다 + +813 +01:00:34,489 --> 01:00:40,359 + 변수를 우리가 같은 단지 계산 Ciano처럼 사용하여 통과 기대된다 + +814 +01:00:40,360 --> 01:00:44,610 + 작동이 라이브러리 방법은이 일에 상징적에 작동 + +815 +01:00:44,610 --> 01:00:48,289 + 즉, 쉽게를 계산 할 수 있도록 전산 그래프를 구축 + +816 +01:00:48,289 --> 01:00:52,210 + 확률은 상징적으로이 같은 손실과 모든에 + +817 +01:00:52,210 --> 01:00:56,190 + 실제로 나는 나보다는 조금 더 우리에게 보이는 관심이 더 많은 것 같습니다에 생각 + +818 +01:00:56,190 --> 01:01:00,740 + 바위보다 회전 목마 라자냐처럼 알고 있지만, 우리는이 경사 하강을 사용하는 + +819 +01:01:00,739 --> 01:01:04,669 + 최적화는 그리고 우리는 그래서 여기에 우리가하지 않은 손실을 최소화하기 위해 그것을 말하는 거 + +820 +01:01:04,670 --> 01:01:08,970 + 명시 적으로하지만, 그라디언트를 토하고 우리는 명시 적으로 대해 서면으로하지 않는 + +821 +01:01:08,969 --> 01:01:13,489 + 무역 업데이트 규칙 대신이 사람들의 일을 사용하지만 그냥 일종의 추가되었다 + +822 +01:01:13,489 --> 01:01:19,250 + 그냥 지금 손실을 최소화하기 위해 그래프로 할 필요가 무엇이든 + +823 +01:01:19,250 --> 01:01:23,059 + 같은 Ciano 시장에서 우리가 실제로 더 높은 실제 번호를 사용하여 인스턴스화 할 수 + +824 +01:01:23,059 --> 01:01:23,779 + 증가 + +825 +01:01:23,780 --> 01:01:29,470 + 일부 몇 가지 작은 데이터 세트 후 우리는 루프 너무 강한 공기 흐름에서 실행할 수 있으며, + +826 +01:01:29,469 --> 01:01:33,750 + 당신은 실제로 당신이 그것을 포장 할 필요가 사용해야하는 코드를 실행하려면 + +827 +01:01:33,750 --> 01:01:39,199 + 세션 코드는 내가 무엇을하고 있는지 이해가 안하지만 그건 당신이 실제로 그것을 할 수 있었다 + +828 +01:01:39,199 --> 01:01:42,599 + 실제로 무엇을하고있어하면 그 모든 정거장 짧은하지만 할 갔다 + +829 +01:01:42,599 --> 01:01:45,869 + 당신의 계산 잔디를 설정하고 놓친 세션이 실제로하고있다 + +830 +01:01:45,869 --> 01:01:48,440 + 어떤 최적화 실제로 실행하고자 할 필요가 + +831 +01:01:48,440 --> 01:01:58,110 + 그래 그래서 당신이 있다면 그래서 질문은 당신이 기억 그래서 만약 하나의 뜨거운 무엇인가 + +832 +01:01:58,110 --> 01:02:01,840 + 과제는 부드러운 최대 손실 함수처럼했지만 왜 때 + +833 +01:02:01,840 --> 01:02:06,170 + 항상 정수는 당신이 원하는하지만 어떤 일을 말하는 이들 중 일부에 + +834 +01:02:06,170 --> 01:02:11,420 + 모든 곳에 대신 정수의 프레임 워크는 요인이 될한다 + +835 +01:02:11,420 --> 01:02:15,090 + 즉 실제로이었을 정도로 신용 클래스가 있던 일을 제외하고 제로 + +836 +01:02:15,090 --> 01:02:20,420 + 에 나를 넘어 버그는 차이가 뜨거운 일 사이에 있었다 다시 우리를 걱정 + +837 +01:02:20,420 --> 01:02:28,710 + 뜨거운 그것은 10 2011 핫 어떤 권리를 밝혀없는 것보다 우리 때 + +838 +01:02:28,710 --> 01:02:34,250 + 실제로 우리가 실제로 기억하고 우리가 fiato에 전화 한 후이 네트워크를 훈련 할 + +839 +01:02:34,250 --> 01:02:37,610 + 이 함수 객체를 컴파일 한 다음 또 다시 함수를 호출 + +840 +01:02:37,610 --> 01:02:41,940 + 해당 강한 공기 흐름은 우리가의 run 메소드를 호출하는 데 사용한다는 것입니다 + +841 +01:02:41,940 --> 01:02:46,409 + 세션 객체 우리는 우리가 계산하는 원한 스위치 출력을 말해 + +842 +01:02:46,409 --> 01:02:50,349 + 여기에서 우리는 우리가 기차가 내가이야 무엇을 중단 계산 할 것인지를 말하는 거 + +843 +01:02:50,349 --> 01:02:54,769 + 라 셀리 퍼트 우리는 이러한 입력에 발생이 NumPy와에 거 피드있어 그렇게 + +844 +01:02:54,769 --> 01:02:57,699 + 우리가 run 메소드를 호출하는 것을 제외하고이 디아 노 같은 아이디어의 종류 + +845 +01:02:57,699 --> 01:03:02,210 + 오히려 명시 함수를 컴파일 컴파일보다 과정에서 + +846 +01:03:02,210 --> 01:03:06,179 + 이 열차 정지 객체 선거을 평가에 그라데이션 하강을 + +847 +01:03:06,179 --> 01:03:10,690 + 무게는 그래서 우리는 단지 루프에서이 일을 실행하고 로스가 다운 될 수 있습니다 + +848 +01:03:10,690 --> 01:03:16,450 + 모든 것이 tenser 흐름에 대한 정말 멋진 것들 중 하나 아름답습니다 그래서 + +849 +01:03:16,449 --> 01:03:20,519 + 쉽게 쉽게 무엇을 시각화 할 수 있습니다 tenser 보드라는이 일입니다 + +850 +01:03:20,519 --> 01:03:24,880 + 그래서 여기에 네트워크에서 진행하는 것은 우리가 가진 거의 동일한 코드입니다 + +851 +01:03:24,880 --> 01:03:29,150 + 우리는 희망이 세 개의 작은 줄을 추가 한 제외하기 전에 경우를 볼 수 있습니다 + +852 +01:03:29,150 --> 01:03:34,280 + 의 스칼라 요약을 계산하는 곳없는 당신은 그래서 여기에 저를 신뢰해야합니다 + +853 +01:03:34,280 --> 01:03:37,200 + 손실 그것은 우리에게 새로운 상징적 인 변수를주고 + +854 +01:03:37,199 --> 01:03:40,929 + 법의 개요 더욱 가중 행렬들의 히스토그램을 산출 요약 + +855 +01:03:40,929 --> 01:03:46,049 + 화가 하나 W2 W 2 w도 우리에게 점점 새로운 상징적 인 변수에 W + +856 +01:03:46,050 --> 01:03:51,390 + 지금 우리가라는 또 다른 상징적 인 변수를 얻고 야유하는 것으로 나타났다 + +857 +01:03:51,389 --> 01:03:54,349 + 함께 그렇게하지 ​​마법을 사용하는 모든 사람들 요약으로 부상 할 수 있습니다 + +858 +01:03:54,349 --> 01:03:58,929 + 이해하고 우리는 우리가 사용할 수있는이 요약 작가 개체를 받고있어 + +859 +01:03:58,929 --> 01:04:03,000 + 실제로 우리가있을 때 우리의 루프에서 지금 책상에 그 요약을 덤프하고 + +860 +01:04:03,000 --> 01:04:06,570 + 실제로 다음 네트워크를 실행 우리는을 평가하는 평가를 말해 + +861 +01:04:06,570 --> 01:04:10,460 + 교육 직원과 같은 손실 그녀의 모든 그래서이 병합 요약 개체 전에 + +862 +01:04:10,460 --> 01:04:14,190 + 그래서 과시 요약을 평가하는 프로세스에서 계산할 것이다 개체 + +863 +01:04:14,190 --> 01:04:17,690 + 이 가중치의 히스토그램을 계산합니다 그라데이션 책상에 그 요약을 덤프 + +864 +01:04:17,690 --> 01:04:22,019 + 그리고, 우리는 요약에 내가 그 어디 같아요 실제로 우리의 작가에게 + +865 +01:04:22,019 --> 01:04:26,610 + 그 다음에이 일을 실행하면 사용자가 쇼핑몰을 얻을 수 있도록이에 대한 권리가 발생합니다 + +866 +01:04:26,610 --> 01:04:28,890 + 이 일을 지속적으로 정렬을 실행하는 모든이를 스트리밍 + +867 +01:04:28,889 --> 01:04:33,069 + 책상에 네트워크에서 무슨 일이 일어나고 있는지에 대한 정보는 다음 당신은 + +868 +01:04:33,070 --> 01:04:37,480 + 그 텐서 유량 센서 보드와 함께 제공되는이이 웹 서버를 시작하고 우리가 얻을 + +869 +01:04:37,480 --> 01:04:41,420 + 그래서 당신의 네트워크에서 무슨 일이 일어나고 있는지에 대한이 아름다운 아름다운 시각화 + +870 +01:04:41,420 --> 01:04:42,539 + 여기 왼쪽에 + +871 +01:04:42,539 --> 01:04:46,230 + 회원 우리는 우리가이 때문에 손실 스칼라 요약을 얻고 있었다 말하고 있었다 + +872 +01:04:46,230 --> 01:04:49,360 + 실제로 손실이 나는 그것이 큰 작은이었다 의미 내려가는 것을 보여줍니다 + +873 +01:04:49,360 --> 01:04:52,760 + 네트워크 및 소규모 데이터 세트하지만 그 모든 의미가 작동이됩니다 + +874 +01:04:52,760 --> 01:04:56,860 + 여기 오른쪽에 표시 당신은 당신을 보여주는 시간이 지남에 따라 히스토그램 + +875 +01:04:56,860 --> 01:05:00,900 + 당신의 체중 행렬의 값의 분포이 물건은 그래서 + +876 +01:05:00,900 --> 01:05:04,579 + 정말 정말 시원하고 나는이 정말 정말 아름다운 디버깅 도구라고 생각합니다 + +877 +01:05:04,579 --> 01:05:09,289 + 내가 프로젝트와 토치 작업을했습니다 때 때 그래서 나는이 작성했습니다 + +878 +01:05:09,289 --> 01:05:11,250 + 종류의 손으로 자신을 물건 + +879 +01:05:11,250 --> 01:05:14,900 + 그냥 좀 고문에서 JSON의 모양을 덤핑하고 내 자신의 정의를 작성 + +880 +01:05:14,900 --> 01:05:18,369 + 시각화 시각화은이기 때문에 통계의 이러한 종류를 볼 수 있습니다 + +881 +01:05:18,369 --> 01:05:21,609 + 정말 유용하고 텐트 당신은 자신에 대해 어떤을 작성하지 않아도됩니다으로 + +882 +01:05:21,610 --> 01:05:25,019 + 그들은 말을하는지 훈련 스크립트 실행에 코드를 사용하면 단지 몇 라인과 + +883 +01:05:25,019 --> 01:05:27,489 + 당신은이 모든 아름다운 시각화하여 디버깅 도움을 얻을 수 있습니다 + +884 +01:05:27,489 --> 01:05:35,059 + tenser 유량 센서 보드는 또한 당신도 어떤 네트워크 시각화 할 수 있습니다 + +885 +01:05:35,059 --> 01:05:39,820 + 이 이름을 가진 변수는 구조가 그래서 여기에 우리가 주석했던 모양이다 + +886 +01:05:39,820 --> 01:05:43,510 + 지금 우리는 우리가 범위 실제로 몇몇을 할 수있는 전진 패스를하고있는 경우 + +887 +01:05:43,510 --> 01:05:47,450 + 네임 스페이스와 함께 슬라이스 그룹의 종류에 따라 합병증 + +888 +01:05:47,449 --> 01:05:48,949 + 계산이 + +889 +01:05:48,949 --> 01:05:52,519 + 그것은과 동일합니다보다 함께 의미 지금 다른 속해야 + +890 +01:05:52,519 --> 01:05:56,949 + 우리가 전에 지금보고 같은 것은 우리가이 네트워크를 실행하고 텐트를로드하는 경우 또는 + +891 +01:05:56,949 --> 01:06:00,909 + 더 우리가 실제로 어떻게 같은이 아름다운 시각화를 얻을 수 있습니다 + +892 +01:06:00,909 --> 01:06:04,789 + 우리의 네트워크는 실제로처럼 보이는 우리가 실제로 클릭하고 찾아 볼 수있는 + +893 +01:06:04,789 --> 01:06:07,820 + 정말 무슨 점수에 화면 내부에 무슨 일이 일어나고 있는지 디버그에 도움 + +894 +01:06:07,820 --> 01:06:12,170 + 이 네트워크 이집트 당신은 손실과 성적이 참조 + +895 +01:06:12,170 --> 01:06:15,030 + 이는 우리가 포워드 패스 중에 정의 된 의미있는 네임 + +896 +01:06:15,030 --> 01:06:18,940 + 우리가 예를 들어 점수를 클릭하면 그것을 열어 우리 모두 볼 수 있습니다 + +897 +01:06:18,940 --> 01:06:22,679 + 그래픽에 계산 내부 여기에를 가지고 작업하는 + +898 +01:06:22,679 --> 01:06:28,108 + 노드는 그래서 나는 정말 쉽게 디버그처럼 당신이 할 수 있다면 이것은 정말 멋진 줄 알았는데 + +899 +01:06:28,108 --> 01:06:31,039 + 이 중 하나를 작성하는만큼 실행중인 동안 무슨 일이 당신의 네트워크 내부에서 무슨 일 + +900 +01:06:31,039 --> 01:06:39,300 + 애플의 아시아 코드로 자신을 너무 부드러운 흐름은 멀티 GPU 지원을해야합니까 그래서 + +901 +01:06:39,300 --> 01:06:42,750 + 데이터 병렬 그렇게 예상처럼 나는 것을 지적하고 싶습니다있다 + +902 +01:06:42,750 --> 01:06:45,809 + 실제로는이 메일 부분은 아마도 중 하나입니다 배포 + +903 +01:06:45,809 --> 01:06:50,460 + 실제로 시도 할 수있는 다른 주요 판매 포인트는 때때로 흐름 + +904 +01:06:50,460 --> 01:06:53,338 + 다른 기기에서 다른 방법으로 분산 계산 쓰레기 + +905 +01:06:53,338 --> 01:06:57,828 + 실제로 통신을 최소화하기 위해 기민하게 그 쓰레기를 분산 배치 + +906 +01:06:57,829 --> 01:07:02,839 + 오버 헤드 등등 그래서 당신이 할 수있는 한 가지 데이터 병렬 어디에이다 + +907 +01:07:02,838 --> 01:07:05,559 + 단지 다른 기기에서 다시 돈을 넣어 각 하나를 실행 + +908 +01:07:05,559 --> 01:07:08,409 + 전후 다음 중 하나 그라디언트 중 일부는해야 할 일 + +909 +01:07:08,409 --> 01:07:12,068 + 단지에 동기 업데이트를하거나 동기 분산 교육 당신의 + +910 +01:07:12,068 --> 01:07:16,730 + 매개 변수와는 동기 훈련 뷰캐넌에게 그녀가 할 수있는 백서 주장을 + +911 +01:07:16,730 --> 01:07:21,300 + 이런 것들과 텐서 흐름의 양을하지만 나는 그것을 밖으로 당신이 할 수하지 않았다 않았다 + +912 +01:07:21,300 --> 01:07:25,000 + 또한 실제로뿐만 아니라 집중적 인 흐름 모델의 병렬 처리를 수행하지만 다음을 수행 할 수 있습니다 + +913 +01:07:25,000 --> 01:07:27,829 + 같은 모델을 분할에 같은 모델의 다른 부분을 계산 + +914 +01:07:27,829 --> 01:07:32,190 + 그래서 여기에 다른 장치는 그래서이 유용 할 수 있습니다 한 곳에는 예입니다 + +915 +01:07:32,190 --> 01:07:36,510 + 다층 재발 네트워크가 실제로 실행하는 것이 좋습니다 수 있습니다 + +916 +01:07:36,510 --> 01:07:39,900 + 다른 CPU에서 네트워크의 다른 레이어 그 일을 할 수 있기 때문에 + +917 +01:07:39,900 --> 01:07:42,838 + 그 물건의 종류가 그래서 실제로 많은 메모리를 가지고 당신은 실제로 수 + +918 +01:07:42,838 --> 01:07:47,599 + 당신은 너무 많은 고통없이 그 강한 공기 흐름을 할 수 않습니다 + +919 +01:07:47,599 --> 01:07:51,599 + tenser 흐름은 분산으로 실행할 수있는 프레임 워크의 유일한입니다 + +920 +01:07:51,599 --> 01:07:56,000 + 하나의 시스템 및 다중 GPU를 실제로에서 모드뿐만 아니라 스트립 + +921 +01:07:56,000 --> 01:07:58,309 + 그들에게 많은 기계에서 교육 모델을 배포 + +922 +01:07:58,309 --> 01:08:04,709 + 광고 여기서주의해야 할 점은 그 부분은 아직 오늘로 평가 오픈 소스 아니라고 그래서 + +923 +01:08:04,708 --> 01:08:08,328 + 텐서 흐름의 오픈 소스 버전은 단일 시스템 멀티 GPU 작업을 수행 할 수 있습니다 + +924 +01:08:08,329 --> 01:08:13,890 + 훈련하지만 난 생각하지만 희망이 곧 그 부분은 정말로 발표 될 예정이다 + +925 +01:08:13,889 --> 01:08:16,500 + 바로 그래서 여기에 멋진 + +926 +01:08:16,500 --> 01:08:22,069 + 아이디어는 그냥 응답이 모두 통신 비용을 알고 있었다 끝낼 수있다 + +927 +01:08:22,069 --> 01:08:26,489 + 네트워크에있는 다른 시스템 간의 또한 GPU와 CPU 만 사이에 그렇게 + +928 +01:08:26,488 --> 01:08:30,118 + 그것은 현명 다른 걸쳐 계산 공예를 배포하려고 할 수 있습니다 + +929 +01:08:30,118 --> 01:08:33,750 + 그 기계에서 다른 CPU에서 기계는 계산하기 + +930 +01:08:33,750 --> 01:08:37,649 + 모든 가능한 한 효율적으로 그 그래서 난 그게 정말 멋진 생각 + +931 +01:08:37,649 --> 01:08:41,629 + 즉 다른 프레임 워크가 바로 지금 일을 할 수없는 뭔가 + +932 +01:08:41,630 --> 01:08:46,409 + 수십 흐름을 가리킬 수 있습니다 것은 초반 이었죠 모델 나는 보았다 그래서 나는 철저한했다 + +933 +01:08:46,408 --> 01:08:51,448 + 구글은 검색과 내가 함께 올 수있는 유일한 방법은 처음 모듈가 있었다 + +934 +01:08:51,448 --> 01:08:56,028 + 사전 시험 개시 모델 만이 탐구 안드로이드를 통해 만 접근 가능 + +935 +01:08:56,029 --> 01:08:59,569 + 이 그들 있도록 모든 평가 내가 될 것으로 예상했을 것이다 뭔가 + +936 +01:08:59,569 --> 01:09:04,219 + 명확한 문서가 더 있지만 그건 적어도 당신은 1 피치 모델이 + +937 +01:09:04,219 --> 01:09:09,109 + 그 중 하나가 아닌 다른 내가 다른 초반 이었죠 모델의 정말 알고 아니에요 아니에요 + +938 +01:09:09,109 --> 01:09:12,109 + 강한 공기 흐름하지만 어쩌면 어쩌면 어쩌면 그들은 거기있어, 내가 거기에 있습니다 + +939 +01:09:12,109 --> 01:09:13,230 + 그냥 모르는 + +940 +01:09:13,229 --> 01:09:19,729 + 확률값은 그래서 제대로 봤하지 말합니다 그 대답 흐름 장점과 단점 때문에 + +941 +01:09:19,729 --> 01:09:23,689 + 다시는 내 빠른 일일 실험 정말 좋은 파이프 라인의 때문에 + +942 +01:09:23,689 --> 01:09:27,928 + 나는 그것이 그래픽에 대한 계산이 아이디어를 가지고 알고 정말 멋진 심판 + +943 +01:09:27,929 --> 01:09:32,289 + 이는 내가 슈퍼 강력 생각하고 실제로 이런 생각을합니다 + +944 +01:09:32,289 --> 01:09:35,948 + 계산 그래프도 더 Fiano보다보다 정말와 같은 것들 + +945 +01:09:35,948 --> 01:09:40,000 + 검사 점 및 장치를 통해 배포 이러한 모든처럼 알고 결국 + +946 +01:09:40,000 --> 01:09:46,380 + 그것은 정말 멋진 그래픽 4000에 계산 내부의는 주장이다 + +947 +01:09:46,380 --> 01:09:49,520 + 더 빨리가 나는 공포를 들었습니다 알고 시간 뭔가를 컴파일 + +948 +01:09:49,520 --> 01:09:53,670 + 아마 컴파일 시간 반을 복용 신경 나무 기계에 대한 이야기 + +949 +01:09:53,670 --> 01:09:59,219 + 어쩌면 그게 더 빠를 계획 또는 내가 들었어요 tenser 보드 외모 흐름한다 + +950 +01:09:59,219 --> 01:10:03,369 + 놀라운 보이는 멋진 내가 사방에 그것을 사용하려면 + +951 +01:10:03,369 --> 01:10:07,340 + 내가 훨씬 더 진보 된 생각 정말 멋진 데이터와 모델 모델 병렬 처리가 + +952 +01:10:07,340 --> 01:10:11,079 + 다른 프레임 워크보다 배포 중지는 여전히 비밀 소스 있지만 + +953 +01:10:11,079 --> 01:10:15,689 + 구글 만 잘하면 나는 결국 우리의 나머지 부분에 나올거야하지만 난 생각하는 것이 + +954 +01:10:15,689 --> 01:10:19,989 + 밥이도 아마 실제로 기반 무서운 코드를 파고있어 말을 하였다으로 + +955 +01:10:19,989 --> 01:10:24,409 + 후드 암에 대해 너무 적어도 나의 두려움에서 일하고 이해 + +956 +01:10:24,409 --> 01:10:29,010 + 흐름은 당신 미친 이상한 필수적 코드와 어떤 종류의 작업을 수행하려는 경우 + +957 +01:10:29,010 --> 01:10:32,690 + 쉽게 계산 그래프 추상화로 작동하지 않을 수 + +958 +01:10:32,689 --> 01:10:38,159 + 당신은 문제를 많이 될 수있는 것 같다 같은 것은 우리에있어 고문에에의 할 수있다 + +959 +01:10:38,159 --> 01:10:40,659 + 당신은 당신이 앞으로 내부에 원하는 필수적 어떤 코드를 작성할 수 있으며, + +960 +01:10:40,659 --> 01:10:44,659 + 이전 버전과 그들의 고유 한 사용자 정의의 패스하지만 가장 큰 유사한 것 + +961 +01:10:44,659 --> 01:10:49,979 + 법의 톤 작업에 대한 나를 위해 점을 걱정하고 또 다른 연습 + +962 +01:10:49,979 --> 01:10:52,959 + 어색한 일의 종류는 그 종류이다, 그래서 이완이 모델을 주셔서 감사하다 + +963 +01:10:52,960 --> 01:11:12,239 + 의 총 + +964 +01:11:12,239 --> 01:11:22,019 + 심지어 2002 년에 설치하는 것은 그들이이 주장 조금 아팠다 + +965 +01:11:22,020 --> 01:11:25,680 + 파이썬은 우리 방금 다운로드 PEP에 설치할 수있는 모든하지만 파산 및 I + +966 +01:11:25,680 --> 01:11:29,150 + 그리고 그들은했다 설치 얻기 위해 매년 파일 이름을 변경했다 및 + +967 +01:11:29,149 --> 01:11:32,479 + 내가 수동으로 업데이트했고, 같은 일부 랜덤를 다운로드 깨진 의존성 + +968 +01:11:32,479 --> 01:11:36,759 + zip 파일 그리고 그것은 결국 포장을 풀고 주위에 어떤 임의의 파일을 복사 할 수 있지만 + +969 +01:11:36,760 --> 01:11:41,520 + 일을하지만 설치 내가 sudo는 2012이 심지어 내 자신의 컴퓨터에 어려웠다 + +970 +01:11:41,520 --> 01:11:47,400 + 그래서 그들은 내가 함께이 빨리 넣어 그 함께 자신의 행동을 얻어야한다 + +971 +01:11:47,399 --> 01:11:51,529 + 나는 사람들이 주에 대한 관심이라고 생각하면 종류의 커버 개요 표 + +972 +01:11:51,529 --> 01:11:55,529 + 좀 초반 이었죠 모델이 무엇인지 언어 급류 프레임 워크 사이의 점 + +973 +01:11:55,529 --> 01:11:56,210 + 유효한 + +974 +01:11:56,210 --> 01:12:05,029 + 문제 + +975 +01:12:05,029 --> 01:12:09,988 + 문제는 미안 이러한 지원 Windows의입니다하지만 난하지 않습니다 + +976 +01:12:09,988 --> 01:12:11,769 + 알고있다 + +977 +01:12:11,770 --> 01:12:16,830 + 난 당신이 자신에 있다고 생각 + +978 +01:12:16,829 --> 01:12:24,439 + 앗 확인을 윈도우에서 AWS를 사용할 수 있습니다 + +979 +01:12:24,439 --> 01:12:29,359 + 확인 그래서 나는이 빠른 간의 빠른 비교 차트 와서 함께 넣어 + +980 +01:12:29,359 --> 01:12:32,198 + 나는 사람들이 관심있는 주요 총알 포인트의 일부를 커버 생각 프레임 워크 + +981 +01:12:32,198 --> 01:12:37,460 + 이야기에 대해 어떤 언어가 자유 무역 모델을 가지고 있는지 여부를 손쉽게 + +982 +01:12:37,460 --> 01:12:41,300 + 당신이 병렬 처리의 종류와 방법을 읽을 수있는 소스 코드 등의 여부 + +983 +01:12:41,300 --> 01:12:47,029 + 내가하자 참조에 사용 사례 몇 가지를 가지고, 그래서 그들은 우리가 거​​룩한있어 우리의 손을 얻을 + +984 +01:12:47,029 --> 01:12:52,939 + 쓰레기 우리는 250 슬라이드를 얻었고, 우리는 여전히 이분은의이의이하자하자하자 그렇게 떠났다 + +985 +01:12:52,939 --> 01:12:56,710 + 약간의 게임은 당신이하고 싶었던 모든 알렉산드르의 BGG 추출한다고 가정 재생 + +986 +01:12:56,710 --> 01:12:58,619 + 어떤 프레임 워크 기능은 당신이 선택할 것 + +987 +01:12:58,619 --> 01:13:06,969 + 그래 내가 너무 일부에에 그물 알렉스를 찾을의 우리가하고 싶었던 모든 말을 하였다하자 + +988 +01:13:06,969 --> 01:13:19,189 + 새로운 데이터는 그래의 우리가 미세 조정을 확인 I와 이미지 캡션을하고 싶은 말을하자 + +989 +01:13:19,189 --> 01:13:22,889 + 좋은 분포를 들어 본 것은 그래서 이것은 내가이가 말하는 게 아니에요 내 생각 과정입니다 + +990 +01:13:22,890 --> 01:13:26,289 + 나는 이것에 대해 생각하는 정답하지만 방법은 해당이 문제에 대한 우리 + +991 +01:13:26,289 --> 01:13:30,969 + 초반 이었죠 모델은 카페 나 고문 라자냐 우리보고 있었다 초반 이었죠 모델 필요 + +992 +01:13:30,969 --> 01:13:36,239 + 캐시는 거의 밖으로 사람들이 할 경우에도, 그래서 우리의 손이 필요합니다 + +993 +01:13:36,238 --> 01:13:39,869 + 아마 고문을 사용하는 것 때문에 그냥 가지 고통이 물건을 구현 + +994 +01:13:39,869 --> 01:13:44,869 + 어쩌면 우리는 모든 분류 할 의미 분할에 대해 라자냐 + +995 +01:13:44,869 --> 01:13:49,880 + 우리는 입력 영상을 읽을 수와 대신을주는 바로 그래서 여기 화소를 + +996 +01:13:49,880 --> 01:13:57,900 + 우리가 독립적으로 확인 모든 픽셀에 라벨을 할 전체 출력 이미지에 레이블을 + +997 +01:13:57,899 --> 01:14:01,969 + 그 좋은 그렇게 다시 내 생각 과정은 우리가 여기 초반 이었죠 모델을 필요로했다입니다 + +998 +01:14:01,969 --> 01:14:06,800 + 대부분 우리는 당신을 위해 이상한 사용 케이스의 종류에 대해 얘기 듣고 + +999 +01:14:06,800 --> 01:14:10,739 + 이 레이어가 발생, 그래서 만약 우리 자신의 프로젝트의 일부를 정의 할 필요가 있습니다 + +1000 +01:14:10,738 --> 01:14:14,738 + 그들은 레이더 자체의 다른 어떤 잘 맞는 것 카페에 존재 + +1001 +01:14:14,738 --> 01:14:23,109 + 해결할없이 각 객체 검출을 위해 최소 10 점을 보이​​는이 일을 작성 + +1002 +01:14:23,109 --> 01:14:24,329 + 생각 + +1003 +01:14:24,329 --> 01:14:30,750 + 예 확인 캐시는 생각이 나의 생각 과정을 다시 우리가 초반 이었죠보고있는 것입니다 + +1004 +01:14:30,750 --> 01:14:33,149 + 모델은 그래서 우리는 카페 필요 + +1005 +01:14:33,149 --> 01:14:38,069 + 토치 또는 라자냐 우리는 실제로 문자 메시지로 당신이 많이 필요 할 수 있습니다 + +1006 +01:14:38,069 --> 01:14:41,609 + 그것은 할 수 있습니다 펑키 필수적 코드는 계산에 넣어 + +1007 +01:14:41,609 --> 01:14:47,799 + 11 선택이 카페 + 파이썬이 때문에 항공기하지만 나에게 무서운 것 같다의 일부가 + +1008 +01:14:47,800 --> 01:14:52,529 + 우리가에 대한 이야기​​ 봄 실제로이 길을 갔고, 나는 실제로 일을했습니다 + +1009 +01:14:52,529 --> 01:14:56,939 + 이 같은 유사 프로젝트와 나는 횃불을 선택하고 나를 위해 좋은 밖으로 일을하지만, + +1010 +01:14:56,939 --> 01:14:59,809 + 당신은 언어 모델링하려는 경우 펑키이 강렬 할 싶어하고 당신처럼 + +1011 +01:14:59,810 --> 01:15:06,270 + 너희들의 일을 어떻게 재발 역할 토치와 함께 놀고 싶어 내가 네 + +1012 +01:15:06,270 --> 01:15:09,550 + 우리가 원한다면 사실이 전혀 그래서 여기에 고문을 사용하지 않을 것 + +1013 +01:15:09,550 --> 01:15:13,650 + 언어 모델링과는 우리가하지 않은 재발 관계의 펑키 종류의 일을 + +1014 +01:15:13,649 --> 01:15:17,109 + 이 모든 단지 비과세의 이미지에 대해 이야기하는 것은 그래서 우리는 어떤 사전 시험이 필요하지 않습니다 + +1015 +01:15:17,109 --> 01:15:22,309 + 모델과 우리가 정말 쉽게 재발 관계와 놀고 싶어 + +1016 +01:15:22,310 --> 01:15:25,430 + 내가 생각하는 거기 때문에 현재 네트워크에서 작동하는 모든 반환에 아마 수수료 + +1017 +01:15:25,430 --> 01:15:32,570 + 당신이 배치 표준을 구현하려는 경우입니다 흐름은 좋은 선택이 될 수 있습니다 + +1018 +01:15:32,569 --> 01:15:39,769 + 당신이에 의존 할 경우 싶다면 확인 확인 그래서 여기에 그 권리 미안 슬라이드 + +1019 +01:15:39,770 --> 01:15:42,230 + 당신은 그라데이션을 직접 운전하지 않는 당신은이에 의존 할 수 있다면 + +1020 +01:15:42,229 --> 01:15:46,899 + 흐름처럼하지만 방식 때문에의 계산 공예 것들 그 것들 + +1021 +01:15:46,899 --> 01:15:50,089 + 당신이 열정 숙제에서 본대로 작동하거나 실제로을 단순화 할 수 있습니다 + +1022 +01:15:50,090 --> 01:15:54,900 + 꽤 많은 그라데이션 나는 확실하지 않다 이러한 계산 공예 프레임 워크 경우 + +1023 +01:15:54,899 --> 01:15:57,589 + 이 효율적인 양식을 만드는까지 제대로 그라데이션을 단순화 것 + +1024 +01:15:57,590 --> 01:16:09,489 + 문제 + +1025 +01:16:09,488 --> 01:16:13,009 + 나는 질문이 구하기하는 방법 쉬운에 얼마나 쉽게입니다 생각 생각 + +1026 +01:16:13,010 --> 01:16:18,860 + 피아노 모델 횃불 모델 같은과 나는 고통스러운 듯 보이지만에서 생각 + +1027 +01:16:18,859 --> 01:16:22,819 + fiato에서 적어도 당신은 라자냐 틱 틱 액세스 등 초반 이었죠 모델을 사용할 수 있습니다 + +1028 +01:16:22,819 --> 01:16:26,498 + 빌어 먹을 함께 라자냐 모델은 뭔가 다른 내가 이론적으로 생각하다 + +1029 +01:16:26,498 --> 01:16:31,748 + 당신이 원하는 경우에 당신이 진짜 진짜 좋아해 몇 가지가 있다면 아마 그래서 여기에 쉽게해야한다 + +1030 +01:16:31,748 --> 01:16:35,429 + 정확히 어떻게 당신이 뒤로 패스를 할 방법에 대한 좋은 지식은 계산한다 + +1031 +01:16:35,429 --> 01:16:38,179 + 당신은 당신이 아마 사용하지 않습니다보다 효율적으로 스스로를 구현하려는 + +1032 +01:16:38,179 --> 01:16:43,300 + 토치 당신은 너무에 추천을 자신에게 물어 해당 백업을 구현할 수 있습니다 + +1033 +01:16:43,300 --> 01:16:46,949 + 프레임 워크는 당신은 싶어 아마 기능의 특징 추출을 할 경우, 또는 + +1034 +01:16:46,948 --> 01:16:51,248 + 기존 모델의 미세 조정하거나 간단 바닐라의 전송 + +1035 +01:16:51,248 --> 01:16:54,929 + 작업은 다음 카페 아마 그렇지 사용하기 정말 쉽게 갈 수있는 올바른 방법이다 + +1036 +01:16:54,929 --> 01:16:58,649 + 당신이 초반 이었죠 모델 주위에 작업 할 경우 임의의 코드를 작성해야하지만, + +1037 +01:16:58,649 --> 01:17:02,738 + 어쩌면 프놈펜 당신에게 잘을 초반 이었죠 모델 이상한 물건을하고 있지 + +1038 +01:17:02,738 --> 01:17:07,209 + 라자냐 또는 토치의 더 나은 일을 할 수도 있습니다 것은 가지를에 쉽게있다 + +1039 +01:17:07,210 --> 01:17:11,328 + 초반 이었죠 모델의 구조와 혼란 당신은 당신이 만약 정말로를 원하는 경우 + +1040 +01:17:11,328 --> 01:17:14,788 + 정말 어떤 이유로 자신의 레이어를 작성하려는 당신은 당신을 생각하지 않는다 + +1041 +01:17:14,788 --> 01:17:18,788 + 쉽게 이러한 계산 공예에 들어갈 수있는 것은 다음 아마도 경우 토치를 사용한다 + +1042 +01:17:18,788 --> 01:17:22,948 + 당신은 정말 멋진 일 우리의 강렬하고 어쩌면 다른 유형을 사용하려면 그 + +1043 +01:17:22,948 --> 01:17:26,138 + 계산 그래프에 따라 다음 아마에 어쩌면 요금을 이야기 + +1044 +01:17:26,139 --> 01:17:30,090 + 당신이 거대한 모델이있는 경우 수익률도 낮은이고, 당신이 필요 + +1045 +01:17:30,090 --> 01:17:33,449 + 전체 클러스터에서 배포하고 구글의 내부에 액세스 할 수 있습니다 + +1046 +01:17:33,448 --> 01:17:36,169 + 코드베이스는 그녀 흐름을 사용해야합니다 + +1047 +01:17:36,170 --> 01:17:39,989 + 내가 말했듯이 그 부분은 우리의 나머지 부분에 대한 발표 될 예정이다 희망이 있다고하지만, + +1048 +01:17:39,988 --> 01:17:44,889 + 당신 싶어 사용 텐트가 지루해하는 경우 곧 있도록도 그리고 당신이 너무 느린있어 + +1049 +01:17:44,890 --> 01:17:48,810 + 즉, 그 모든의 꽤 많이 내 내 개요 내 빠른 회오리 바람 투어의의 + +1050 +01:17:48,810 --> 01:17:58,210 + 에 대한 프레임 워크 어떤 그래서 어떤 마지막 순간 질문 질문 질문 + +1051 +01:17:58,210 --> 01:18:02,630 + 그래서 약간의 속도를 비교하는 정말 좋은 페이지가 실제로 거기에 속도 + +1052 +01:18:02,630 --> 01:18:06,039 + 벤치 마크 모든 다른 프레임 워크의 속도와 지금 한 그 + +1053 +01:18:06,039 --> 01:18:10,488 + 승리 승리이 하나 하나도없는 것은이 일이 열반에서 저를 불려 + +1054 +01:18:10,488 --> 01:18:15,049 + 이 사람이 실제로이 녀석을 쓴 있도록 시스템은 미친 그들 + +1055 +01:18:15,050 --> 01:18:20,119 + 실제로 G4 및 비디오 하드웨어에 대한 자신의 정의 어셈블러를 쓴 사람들 + +1056 +01:18:20,119 --> 01:18:22,448 + 같은과 동영상에 만족하지 않았다 + +1057 +01:18:22,448 --> 01:18:26,500 + 툴체인 그들은 다음 유사한의 하드웨어와 회 전자를 리버스 엔지니어링 + +1058 +01:18:26,500 --> 01:18:30,948 + 어셈블리에서 구현 된 모든 커널 자체가 그래서이 사람들이 + +1059 +01:18:30,948 --> 01:18:35,859 + 미친 실제로이 존재하고, 그래서 자신의 물건은 정말 정말 빠릅니다 + +1060 +01:18:35,859 --> 01:18:39,309 + 물건은 지금 실제로 가장 빠른하지만 난 정말 내가 그들의했습니다 사용한 적이 + +1061 +01:18:39,310 --> 01:18:42,510 + 결코 정말 자신의 프레임 워크 나 자신을 사용하고 난 좀 덜 일반적인 생각 + +1062 +01:18:42,510 --> 01:18:47,010 + CUDA와 속도를 사용하는 사람에 대한하지만 이들은 대략이다 + +1063 +01:18:47,010 --> 01:18:52,030 + 같은 지금 나는 약간의 다른 사람보다 10 여분 상당히 느린 생각 + +1064 +01:18:52,029 --> 01:18:55,609 + 내가 생각하는 어리석은 이유는 후속 릴리스에서 만에 정리한다 + +1065 +01:18:55,609 --> 01:18:58,729 + 적어도 근본적으로 당신이해야해야해야해야 할 이유가 없습니다 + +1066 +01:18:58,729 --> 01:19:04,209 + 다른 사람보다 느린 + +1067 +01:19:04,210 --> 01:19:07,319 + 당신의 사람들은 소총을 집어 들고있다 + +1068 +01:19:07,319 --> 01:19:24,279 + 다 좋아 + +1069 +01:19:24,279 --> 01:19:27,198 + 즉 실제로 미친 아니에요 대부분의 팀이 마지막에 대한 꽤있다 + +1070 +01:19:27,198 --> 01:19:29,738 + 올해이 실제로 오프라 프로젝트와 같은 기호를 사용하고 그것을 잘했다 + +1071 +01:19:29,738 --> 01:19:34,658 + 그래 나는 또한 다른 프레임 워크가 있다는 것을 언급해야한다 + +1072 +01:19:34,658 --> 01:19:45,359 + 난 그냥이 가장 일반적인 질문에 대한 평화를 생각한다 + +1073 +01:19:45,359 --> 01:19:52,299 + 그래서 질문은 잡아 토치 토치가 실제로있다, 그래서 내가 파이썬에 관한 것입니다 + +1074 +01:19:52,300 --> 01:19:56,770 + 대령 실제로 종류의 멋진 나의 성화 노트북에서 사용할 수 있으며, + +1075 +01:19:56,770 --> 01:20:00,150 + 실제로 당신은 실제로 밤 또는 두 개의 노트북 만에 잡는 몇 가지 간단한 작업을 수행 할 수 있습니다 + +1076 +01:20:00,149 --> 01:20:04,899 + 난 보통 내 데이터가 내 토치 모델은 데이터를 덤프 실행 덤프됩니다 않는 것을 실천 + +1077 +01:20:04,899 --> 01:20:09,899 + 심지어 JSON의 HDL 5 파이썬에서 시각화에 조금 조금이다 + +1078 +01:20:09,899 --> 01:20:19,359 + 고통스러운하지만 당신은 작업이 완료 얻을 수 있습니다 + +1079 +01:20:19,359 --> 01:20:23,309 + 문제는 지루 tenser 당신이 넣을 수 있습니다 당신은 원시 데이터를 덤프 할 수 있는지 여부입니다 + +1080 +01:20:23,310 --> 01:20:28,300 + 거기에 자신이 실제로 그들은 실제로 일부 로그에 모든 물건을 덤핑하고 + +1081 +01:20:28,300 --> 01:20:33,050 + 임시 디렉토리에있는 파일 나는 사람들은 드문 드문 얼마나 쉽게 잘 모르겠어요하지만 당신은 할 수 + +1082 +01:20:33,050 --> 01:20:45,900 + 쉽게 할 수있는 시도하거나 나는 문제가 있는지 있는지 질문하지 아니에요 + +1083 +01:20:45,899 --> 01:20:49,899 + 이 현대적인 네트워크를위한 텐서 보드의 다른 타사 도구입니다 + +1084 +01:20:49,899 --> 01:20:53,269 + 거기서 몇 가지있을 수 있습니다하지만 난 정말 그들을 사용한 적이 난 그냥 내 자신을 읽기 + +1085 +01:20:53,270 --> 01:20:58,159 + 과거 다른 질문 + +1086 +01:20:58,158 --> 01:21:00,319 + 확실히 나는 그게 생각 생각 + diff --git a/captions/Ko/Lecture13_ko.srt b/captions/Ko/Lecture13_ko.srt new file mode 100644 index 00000000..8022bfca --- /dev/null +++ b/captions/Ko/Lecture13_ko.srt @@ -0,0 +1,3672 @@ +1 +00:00:00,000 --> 00:00:06,878 + 그래서 오늘 우리의 관리 포인트는 할당 3 때문에 오늘 밤은 너무하다 + +2 +00:00:06,878 --> 00:00:14,399 + 그 희망 좋은 확인에 할당보다 쉽게​​ 될 것 이루어졌다 + +3 +00:00:14,400 --> 00:00:18,320 + 당신도 기억 귀하의 프로젝트를 수행하는 데 더 많은 시간을 제공하여 + +4 +00:00:18,320 --> 00:00:22,500 + 우리가에있어, 그래서 우리 못했습니다 된 이정표가 지난 주에 반환 + +5 +00:00:22,500 --> 00:00:25,028 + 사람들은 확인 있는지 확인하기 위해 이정표를 통해보고의 과정과 + +6 +00:00:25,028 --> 00:00:28,609 + 또한 우리가 만드는 임무에 최선을 다하고 그래서 우리는 그 짓을해야한다 + +7 +00:00:28,609 --> 00:00:32,289 + 언젠가 이번 주 또는 다음 주 초 + +8 +00:00:32,289 --> 00:00:36,329 + 마지막으로 우리는 모든 축구의 회오리 바람 투어 모든 일반적인 소프트웨어를했다 + +9 +00:00:36,329 --> 00:00:40,058 + 사람들이 깊은 학습을 위해 사용하고 우리가 코드를 많이 보았다 패키지 + +10 +00:00:40,058 --> 00:00:43,468 + 슬라이드와 당신이 그것을 발견 잘하면 코드를 단계별로 및 많은 + +11 +00:00:43,469 --> 00:00:48,730 + 프로젝트에 유용 오늘 우리는 다른 주제에 대해 이야기하는거야 + +12 +00:00:48,729 --> 00:00:53,308 + 우리가 분할에서 분할에 대한 거 얘기있어 두 가지가 있습니다 + +13 +00:00:53,308 --> 00:00:57,488 + 우리는 또한 말할거야 의미 인스턴트 분할을 하위 문제 + +14 +00:00:57,488 --> 00:01:01,509 + 에 대한 부드러운 관심과 부드러운주의에서 다시 그들은 일종의 두의 것 + +15 +00:01:01,509 --> 00:01:07,069 + 우리는 그러나로 일을 분할 한 양동이 우리는이 들어갈 전에 먼저 + +16 +00:01:07,069 --> 00:01:12,849 + 나는 이것이 그래서 간단히 불러하려면 뭔가 다른 세부 사항이 있었다 + +17 +00:01:12,849 --> 00:01:16,769 + 이미지 분류 오류가 나는 당신이 보았던 클래스의이 시점에서 생각 + +18 +00:01:16,769 --> 00:01:23,079 + 그림의 종류가 여러 번 바로 그래서 2012 알렉스 2013 ZF 그것을 분쇄하지 + +19 +00:01:23,079 --> 00:01:29,118 + 최근 구글 매트 이상 ResNet 도움 그러나 경이 분류의 일종이다 + +20 +00:01:29,118 --> 00:01:37,400 + 새로운 이미지 최종 결과가 너무가 오늘로 2015 년에 도전하지만 밝혀 + +21 +00:01:37,400 --> 00:01:41,140 + 이 논문은 지난 밤에 나왔다 + +22 +00:01:41,140 --> 00:01:48,609 + 그래서 구글은 실제로 이미지에 예술의 현재 상태를 갖는다 3.08 %, 상위 5 오차 + +23 +00:01:48,609 --> 00:01:55,560 + 어떤 미친하고이 작업을 수행하는 방법은에서 부르는이 일을 함께 + +24 +00:01:55,560 --> 00:01:59,900 + 이 괴물의 조금 전에 섹션 그래서 나는 너무로 가고 싶지 않아 + +25 +00:01:59,900 --> 00:02:05,280 + 많은 세부하지만 당신은이가이 정말 깊은 네트워크의 것을 볼 수 있습니다 + +26 +00:02:05,280 --> 00:02:11,150 + 반복 모듈 그래서 여기 줄기 줄기 거기에 여기에 이​​상이 사람이다 + +27 +00:02:11,150 --> 00:02:14,789 + 이 아키텍처에 대해 지적하는 몇 가지 흥미로운 일들이 수도 실제로 + +28 +00:02:14,789 --> 00:02:18,979 + 즉 수 있도록 그들이 패딩이 없음을 의미하는 일부 균형 회선을 사용 + +29 +00:02:18,979 --> 00:02:22,229 + 모든 모든 수학 더 복잡하지만 그들은 똑똑하고 나는 일을 알아 냈 + +30 +00:02:22,229 --> 00:02:27,299 + 그들은 또한 여기에 흥미로운 기능은 실제로 병렬로 가지고있다가 + +31 +00:02:27,300 --> 00:02:31,459 + 귀에 거슬리는 회선도 최대 Pooley은 그래서 그들은 종류의이 두 가지 작업을 할 + +32 +00:02:31,459 --> 00:02:34,900 + 아래 샘플 이미지 평행 다음의 종류와 연결할 + +33 +00:02:34,900 --> 00:02:39,909 + 다른 것은 그들이 정말이 효율적으로 모든 외출하고있다 + +34 +00:02:39,909 --> 00:02:43,389 + 당신이 볼 수 있도록 우리가 전에 몇 강의에 대해 이야기 회선 점검 + +35 +00:02:43,389 --> 00:02:47,518 + 그들은 실제로 일곱 일곱에 의해 같은 이러한 비대칭 필터를했습니다 + +36 +00:02:47,519 --> 00:02:51,750 + 하나의 회선 그들은 또한이 하나씩 길쌈을 많이 사용합니다 + +37 +00:02:51,750 --> 00:02:56,449 + 이것은 단지 줄기 그래서 병목 계산 비용을 줄이기 위해 + +38 +00:02:56,449 --> 00:03:01,939 + 그들이 가지고 있도록 네트워크와 실제로 각이 부품은 종류의 다른 + +39 +00:03:01,939 --> 00:03:07,769 + 남용 개시 모듈이되어 있지만 아래로 다음의 어떤​​ 일곱 모듈을 샘플링 + +40 +00:03:07,769 --> 00:03:11,599 + 이 사람하고 다른 다운 샘플링 모듈과의 다음 3 개의 + +41 +00:03:11,599 --> 00:03:16,889 + 다음이 사람과 마침내 그들은 중퇴하고 완전히 은신처를 연결 + +42 +00:03:16,889 --> 00:03:20,919 + 지적하는 또 다른 것은 그래서 클래스 레이블 다시의 어떤 종류가 없습니다 + +43 +00:03:20,919 --> 00:03:24,859 + 완전히 연결 히틀러의 이곳은 단지 계산이 세계 평균이 + +44 +00:03:24,860 --> 00:03:29,320 + 마지막 특징 벡터 그들이 본 논문에서 한 또 다른 멋진 일이었다 + +45 +00:03:29,319 --> 00:03:34,900 + 처음 거주자들은 셉션이 잔류 버전을 제안 할 수 있도록 + +46 +00:03:34,900 --> 00:03:39,579 + 또한 줄기 꽤 크고 무서운 아키텍처는 이전과 동일 + +47 +00:03:39,579 --> 00:03:43,950 + 이제 이러한 잔기는 반복 개시 블록 반복 + +48 +00:03:43,949 --> 00:03:48,289 + 네트워크를 통해 그들은 실제로 이러한 잔류 연결 그래서이 + +49 +00:03:48,289 --> 00:03:51,409 + 즉, 그 종류의이 잔류 생각에 점프의 그 종류를 냉각 것입니다 + +50 +00:03:51,409 --> 00:03:55,609 + 지금하지 그래서 다시 그들은 많은 반복이있는 아트 이미지의 상태를 개선 + +51 +00:03:55,610 --> 00:04:00,880 + 모듈 및 모든 업이 일을 추가 할 때 내가했던 가정에 대한 7875 층의 + +52 +00:04:00,879 --> 00:04:07,939 + 수학은 바로 그렇게 그들은 또한 마지막 밤을 표시하는 새로운 자신의 종류의 사이 + +53 +00:04:07,939 --> 00:04:12,680 + 처음하지만 인 셉션 Google지도 및 잔류의 새로운 버전 4 + +54 +00:04:12,680 --> 00:04:17,079 + 실제로 모두 동일한 대해 수행 구글지도 버전이므로 + +55 +00:04:17,079 --> 00:04:22,909 + 지금 당신이 볼 수있는이 이미지에 신 (新) 시대의 함수로 진정한 상위 5 공기를하다 + +56 +00:04:22,910 --> 00:04:28,070 + 발단 네트워크와 실제로 빠른 비트를 수렴 읽을 수는 있지만 + +57 +00:04:28,069 --> 00:04:33,180 + 같은 값에 대한 일종의 대화 그들 모두에게 시간에 너무 + +58 +00:04:33,180 --> 00:04:38,340 + 그 가지 종류의 다른 일을 멋진 흥미로운의 종류 있다는 + +59 +00:04:38,339 --> 00:04:42,369 + 지적 흥미로운이 문서는 여기에서 x 축에 원료 수있다 + +60 +00:04:42,370 --> 00:04:46,030 + 이들은 지금 이러한 일들이 백을 위해 훈련중인 이미지에 수두 있습니다 + +61 +00:04:46,029 --> 00:04:52,089 + 그리고 그건 육십 파키스탄 이미지 그물 그래서 그 훈련에 많은 시간이다하지만 그건 + +62 +00:04:52,089 --> 00:04:55,469 + 즉, 현재 이벤트의 충분과의 정기적으로 예약 된 우리로 돌아 가자 + +63 +00:04:55,470 --> 00:05:02,710 + 프로그래밍은 그래서 오늘 오, 그래 질문에 나는 그것이 될 것 같아요 모른다 + +64 +00:05:02,709 --> 00:05:11,789 + 종이 그러나 나는주의 깊게 읽어하지 않았고 러시아의 다른 질문에 드롭 수 + +65 +00:05:11,790 --> 00:05:16,600 + 이 마지막 층에서만인지 잘 모르겠어요에서 다시 난을 읽어 보지 않았 + +66 +00:05:16,600 --> 00:05:21,620 + 종이에 조심스럽게 아직하지만 링크가 당신이 그것을 확인해야합니다 여기 있어요 + +67 +00:05:21,620 --> 00:05:29,600 + 확인 그래서 오늘 우리는 두 개의 다른 주제의 종류에 대해 얘기하는거야 + +68 +00:05:29,600 --> 00:05:33,970 + 그래서 그 생각 일반적인 것들과 연구 요즘 분할이다 + +69 +00:05:33,970 --> 00:05:37,490 + 이는 고전적인 컴퓨터 비전 주제의이 종류도 이런 생각입니다 + +70 +00:05:37,490 --> 00:05:41,550 + 내가 생각하는주의는 정말로 일하기 정말 인기있는 일이있다한다 + +71 +00:05:41,550 --> 00:05:46,060 + 우리가 거​​ 얘기있어 특히 첫 있도록 지난 한 해 동안 깊은 애도의 + +72 +00:05:46,060 --> 00:05:50,889 + 분할에 대해 당신은 몇에서이 슬라이드를 기억했을 수 있도록 + +73 +00:05:50,889 --> 00:05:53,649 + 강의 전에 우리에 대해 얘기했다 물체 검출에 대해 이야기 + +74 +00:05:53,649 --> 00:05:58,000 + 사람들이 컴퓨터 비전에서 일을하고 우리가 많이 소비 다른 작업 + +75 +00:05:58,000 --> 00:06:02,259 + 수업 시간 강의에서 다시 분류에 대해 이야기하고 우리 + +76 +00:06:02,259 --> 00:06:03,750 + 다른 모델에 대해 이야기 + +77 +00:06:03,750 --> 00:06:08,339 + 현지화 및 물체 감지하지만 오늘 우리는에 실제로거야 초점을거야 + +78 +00:06:08,339 --> 00:06:12,239 + 우리가 이전에 마지막으로 시간이 지남에 따라 생략 분할이 생각 + +79 +00:06:12,240 --> 00:06:18,189 + 두 개의 서로 다른 일부 작업의 종류 거기에 분할 이내에 강의하는 우리 + +80 +00:06:18,189 --> 00:06:21,870 + 우리가 정의 할 필요가 사람들이 실제로 이런 일에 작동하는지 확인 필요 + +81 +00:06:21,870 --> 00:06:26,389 + 조금 별도로 첫 번째 작업은 의미 분할이라는 생각이다 + +82 +00:06:26,389 --> 00:06:32,370 + 그래서 여기에 우리는 우리가 입력 영상이 끝을 가지고 우리의 일부 사진 번호가 + +83 +00:06:32,370 --> 00:06:38,000 + 어떤 종류의 건물과 나무와 지상 및 암소와 같은 종류의 것 + +84 +00:06:38,000 --> 00:06:42,629 + 당신은 일반적으로 원하는 의미 라벨은 또한 클래스의 몇 가지 작은 고정 번호가 + +85 +00:06:42,629 --> 00:06:46,199 + 일반적으로는 적합하지 않다 최초의 것들에 대한 몇 가지 배경 클래스를해야합니다 + +86 +00:06:46,199 --> 00:06:51,360 + 이러한 클래스에 다음 작업은 우리가 입력 인치로 수행 할 것입니다 + +87 +00:06:51,360 --> 00:06:55,240 + 그리고, 우리는 이러한 의미 중 하나를 사용하여 해당 이미지의 모든 픽셀에 라벨을 할 + +88 +00:06:55,240 --> 00:06:59,850 + 우리가 현장에서 이러한 세 소의이, 입력 화상을 촬영 한 여기에서 매우 클래스 + +89 +00:06:59,850 --> 00:07:05,490 + 그리고 이상적인 출력 대신 인 RGB 값이 이미지는 우리가 실제로 + +90 +00:07:05,490 --> 00:07:11,228 + 우리는 이것과 다른 이미지와 아마 세그먼트를 할 수있는 픽셀 당 하나의 클래스 레이블이 + +91 +00:07:11,228 --> 00:07:16,789 + 나무와 하늘과 도로에서 잔디 때문에 작업이 유형의 예쁜 + +92 +00:07:16,790 --> 00:07:19,950 + 멋진 나는 당신에게 무엇의 이해의 높은 수준의 종류를 제공합니다 생각 + +93 +00:07:19,949 --> 00:07:23,029 + 단지 전체에 단일 레이블을 넣어 비교 이미지에서 진행 + +94 +00:07:23,029 --> 00:07:28,668 + 영상이 너무 이것 실제로 컴퓨터 비전에서 아주 오래 된 문제 + +95 +00:07:28,668 --> 00:07:32,649 + 이 그림은 실제로 온다, 그래서 그 깊은 학습 혁명의 종류를 선행 + +96 +00:07:32,649 --> 00:07:37,259 + 컴퓨터 비전은 2007 년 문고에서 그 어떤 깊은 학습을 사용하지 않은 + +97 +00:07:37,259 --> 00:07:43,728 + 모든 사람에서 다른에게 몇 년 전에 작업을이 다른 방법을 가지고 그 + +98 +00:07:43,728 --> 00:07:48,949 + 사람들은이 일을 여기서 지적하는 것은이 때문이다 바로 그래서입니다 작업 + +99 +00:07:48,949 --> 00:07:54,310 + 일이이 이미지는 실제로이 집 또는 네 그래서 여기에 인스턴스를 인식하지 못합니다 + +100 +00:07:54,310 --> 00:07:58,329 + 실제로 좀 누워 서 세 소와 한 소 소 거기 + +101 +00:07:58,329 --> 00:08:02,300 + 이 출력 여기에 낮잠 만 복용 잔디는 정말 분명하지 않다 얼마나 많은 + +102 +00:08:02,300 --> 00:08:07,560 + 소는 서로 다른 소가 픽셀 그래서 여기에 겹쳐 실제로있다 + +103 +00:08:07,560 --> 00:08:11,540 + 가 없다는 출력이 다른 소 있다는 것을 + +104 +00:08:11,540 --> 00:08:15,480 + 그것은 어쩌면 같은 정보를하지 그래서 우리가 모든 픽셀을 라벨링하고 출력을 그리워 + +105 +00:08:15,480 --> 00:08:20,009 + 당신이 좋아할 것하고 실제로 일부에 대한 몇 가지 문제가 발생할 수 있으므로 + +106 +00:08:20,009 --> 00:08:23,409 + 그것은 이것을 극복 그래서 다운 스트림 응용 프로그램 + +107 +00:08:23,410 --> 00:08:28,080 + 사람들은 개별적으로 인스턴트 분할이라는이나 문제에 작동 한 + +108 +00:08:28,079 --> 00:08:32,039 + 이것은 또한 때때로 동시 탐지 및 분할 호출되는 + +109 +00:08:32,039 --> 00:08:37,879 + 그래서 여기에 문제는 우리가 클래스의 일부 세트가 이전에 어딘가에서 다시가요 + +110 +00:08:37,879 --> 00:08:43,370 + 그 인식하려고 우리가 모든 출력을 원하는 입력 영상을 받았다 + +111 +00:08:43,370 --> 00:08:48,370 + 이러한 클래스의 각 인스턴스에 대한 인스턴스 우리는 픽셀 밖으로 세그먼트를 원하는 + +112 +00:08:48,370 --> 00:08:52,970 + 그가이 입력 이미지 때문에 여기에 해당 인스턴스에 속하는 + +113 +00:08:52,970 --> 00:08:57,509 + 지금의 두 부모와 아이 실제로 세 가지 다른 사람들이있다 + +114 +00:08:57,509 --> 00:09:00,860 + 우리가 실제로에서와 다른 사람들을 구분 출력 + +115 +00:09:00,860 --> 00:09:05,279 + 그 세 사람들이 지금 서로 다른 색으로 표시되는 입력 영상 + +116 +00:09:05,279 --> 00:09:09,360 + 다른 인스턴스를 표시하고 다시 해당 인스턴스 각각에 대해 우리는거야 + +117 +00:09:09,360 --> 00:09:14,009 + 해당 인스턴스에 속하는 입력 이미지의 모든 픽셀을 레이블을 너무 + +118 +00:09:14,009 --> 00:09:18,639 + 의미 분할 실제로 인스턴트 분할 사람들이 두 작업 + +119 +00:09:18,639 --> 00:09:22,409 + 조금 별도로에서 일한 우리가 거​​ 얘기 야 처음 그래서 + +120 +00:09:22,409 --> 00:09:27,269 + 의미 론적 분할에 대한 일부 모델에 대해 그래서 이것은 기억 + +121 +00:09:27,269 --> 00:09:30,399 + 당신을위한 작업은 이미지의 모든 픽셀에 레이블을 원하는 당신은 상관 없어 + +122 +00:09:30,399 --> 00:09:38,230 + 그래서 여기에 인스턴스에 대한 생각은 일부 입력 주어진 실제로 매우 간단하다 + +123 +00:09:38,230 --> 00:09:43,269 + 이미지 이것은 우리가거야 소와의 마지막은 일부 작은 패치를 가지고있다 + +124 +00:09:43,269 --> 00:09:48,720 + 입력 이미지와는 종류의 현지 정보를 제공하는이 패치를 추출 + +125 +00:09:48,720 --> 00:09:53,340 + 이미지는 우리는거야이 패치를 가지고 우리는 몇 가지를 통해거야 공급을거야 + +126 +00:09:53,340 --> 00:09:57,230 + 콘벌 루션 신경망이 아키텍처 중 우리가된다는 것입니다 수 + +127 +00:09:57,230 --> 00:10:01,070 + 지금까지 클래스 이제이 이야기 + +128 +00:10:01,070 --> 00:10:04,890 + 콘벌 루션 신경망 실제로 중심 화소 A를 분류 할 + +129 +00:10:04,889 --> 00:10:10,080 + 패치 그래서이 신경 네트워크는 우리가 알고있는 뭔가 저스틴 분류입니다 + +130 +00:10:10,080 --> 00:10:14,379 + 그래서이 일을 그냥 말 것입니다 수행하는 방법이 파견의 중심 픽셀 + +131 +00:10:14,379 --> 00:10:19,769 + 실제로 우리가 작동이 네트워크를 복용 상상할 수있는 것보다 소입니다 + +132 +00:10:19,769 --> 00:10:20,710 + 패치 + +133 +00:10:20,710 --> 00:10:26,019 + 및 중앙 픽셀 레이블 그리고 우리는 단지 전체 이미지에 걸쳐 것을 실행 + +134 +00:10:26,019 --> 00:10:33,269 + 이 실제로 매우 그래서 우리에게 이미지의 각 픽셀에 대한 레이블을 제공합니다 + +135 +00:10:33,269 --> 00:10:36,699 + 이제 이미지의 많은 많은 패치를 거기에 바로 있기 때문에 비용이 많이 드는 작업 + +136 +00:10:36,700 --> 00:10:40,120 + 그리고 그것은 모두 독립적으로이 네트워크를 실행하는 슈퍼 슈퍼 비싼 것 + +137 +00:10:40,120 --> 00:10:44,139 + 그들의 연습 사람들이 우리가 물체를 보았다 같은 트릭을 사용 그렇게 + +138 +00:10:44,139 --> 00:10:48,639 + 완전히 길쌈 II를이 일을 실행하고거야 검출 모든 + +139 +00:10:48,639 --> 00:10:54,220 + 한 번에 전체 이미지 그러나 여기에서 문제에 대한 출력은 당신이 인 경우이다 + +140 +00:10:54,220 --> 00:10:58,879 + 컨볼 루션 네트워크 풀링에 또는 어느 샘플링 다운의 종류를 포함 + +141 +00:10:58,879 --> 00:11:02,899 + 다음 이제 출력 출력 이미지 것 스트라이커 회선을 통해 + +142 +00:11:02,899 --> 00:11:07,139 + 그 그건 그래서 실제로 작은 공간 크기와 입력 이미지가 + +143 +00:11:07,139 --> 00:11:09,929 + 그들은 이러한 유형의를 사용할 때 사람들이 주위에 일을해야 할 일 + +144 +00:11:09,929 --> 00:11:14,629 + 접근의 의미에 대한 기본 설정의이 종류에 해당하므로, 어떠한 질문 + +145 +00:11:14,629 --> 00:11:28,208 + 분할 그래 + +146 +00:11:28,208 --> 00:11:32,979 + 문제는 팻 팻 팻 옳은 일을 그냥 충분히 제공하지 않습니다 여부 + +147 +00:11:32,980 --> 00:11:37,800 + 어떤 경우에는 정보와 그 다음 이러한 위해 이렇게 때때로 진실 + +148 +00:11:37,799 --> 00:11:41,688 + 네트워크 사람들은 실제로 그들이 가지고 별도의 오프라인 정제 단계를 + +149 +00:11:41,688 --> 00:11:44,980 + 다음이 출력은 최대 청소 그래픽 모델의 일종으로 공급 + +150 +00:11:44,980 --> 00:11:48,028 + 밀어 도움이 될 수 있도록 때때로 출력 조금 침대를 정리하여 + +151 +00:11:48,028 --> 00:11:52,838 + 입력 - 출력 모델에 대한 좀 더 나은 성능을하지만 공의에 텐트를 설정 + +152 +00:11:52,839 --> 00:12:09,600 + 그냥 그래 난 당신이 아니에요 필요 해요 구현하기 쉬운 무언가로 꽤 잘 작동 + +153 +00:12:09,600 --> 00:12:13,019 + 확실히 나는 정확히 아마 꽤 큰 어쩌면 몇 백 200 모르겠어요 + +154 +00:12:13,019 --> 00:12:19,919 + 크기 때문에 하나의 확장의 순서로하는 사람들이 사용했다고하는 것이 픽셀 + +155 +00:12:19,919 --> 00:12:23,289 + 기본적인 접근 방식은 실제로 때때로 다중 스케일 시험이 좋습니다 + +156 +00:12:23,289 --> 00:12:28,230 + 단일 규모는 우리가 우리의 입력 이미지를 데리고가는 것하고 그래서 여기 충분하지 않습니다 + +157 +00:12:28,230 --> 00:12:33,009 + 이 공통 트릭의 일종이다, 그래서 실제로는 여러 다른 크기로 크기를 조정 + +158 +00:12:33,009 --> 00:12:36,688 + 사람들이 컴퓨터 비전에 사용하는 것이 많은 이미지 피라미드 방금 걸릴라고 + +159 +00:12:36,688 --> 00:12:41,458 + 같은 차원과 당신은 지금 많은 다른 스케일을 만든 및 크기 조정 + +160 +00:12:41,458 --> 00:12:44,528 + 이러한 비늘 각각은 컨볼 루션 신경망을 통해 실행 된 거 + +161 +00:12:44,528 --> 00:12:49,568 + 그 다음 사진을 보호하는 것입니다 서로 다른 이미지의 현명한 라벨은 + +162 +00:12:49,568 --> 00:12:52,969 + 이러한 서로 다른 해상도의 너무 다른 점은 여기에 따라 지적합니다 + +163 +00:12:52,970 --> 00:12:56,249 + 질문의 라인이 각각의 네트워크가 실제로 동일한 경우 해당 + +164 +00:12:56,249 --> 00:12:59,639 + 아키텍처는 이러한 출력은 각각 다른 영향을 미칠 것입니다 + +165 +00:12:59,639 --> 00:13:04,490 + 그래서 지금 우리가 입수 한 것을 입력 2222 이미지 피라미드 수용 필드 + +166 +00:13:04,490 --> 00:13:08,720 + 우리는 모두를 취할 수있는 것보다 의도에 대해이 다른 크기의 픽셀 라벨 + +167 +00:13:08,720 --> 00:13:13,660 + 그들과 리사와 그냥 그 응답을 샘플링하는 일부 오프라인 업 샘플링을 + +168 +00:13:13,659 --> 00:13:18,129 + 그래서 지금 우리는 우리의 3 개의 출력을 쪘 입력 이미지와 동일한 크기로 + +169 +00:13:18,129 --> 00:13:24,319 + 다른 샘플을 크기 및 그들과이 글이 실제로 종이를 쌓아 + +170 +00:13:24,318 --> 00:13:29,139 + 다시 2013 년 검둥이에서 그래서 그들은 실제로이 별도의이 + +171 +00:13:29,139 --> 00:13:33,709 + 오프라인 처리 STAP 그들은 상향식 (bottom-up) 분할이 생각하는 곳 + +172 +00:13:33,708 --> 00:13:39,119 + 이러한 이상이 일종의 그래서 이러한 슈퍼 픽셀 방법을 사용하여 권한을 사용하여 + +173 +00:13:39,120 --> 00:13:41,370 + 고전 컴퓨터 비전 화상 처리 유형 + +174 +00:13:41,370 --> 00:13:45,470 + 실제로 인접 픽셀 사이의 차이를 보면 방법 + +175 +00:13:45,470 --> 00:13:48,589 + 다음 둘을 병합하려고 이미지는 당신이 일관성을 제공합니다 + +176 +00:13:48,589 --> 00:13:52,900 + 그럼 실제로이 방법 이미지에별로 변화가있는 지역 + +177 +00:13:52,899 --> 00:13:56,519 + 소요 종류의 이러한 다른 전통적인을 통해 이미지를 오프라인으로 실행 + +178 +00:13:56,519 --> 00:14:02,230 + 이미지 처리 기술은 슈퍼 픽셀 또는 나무의 세트를 얻을 수 + +179 +00:14:02,230 --> 00:14:06,629 + 화소 말은 이미지에 병합되어야한다 그리고 그들은이 있습니다 + +180 +00:14:06,629 --> 00:14:09,519 + 이러한 모든 다른 일을 병합하는 다소 복잡한 과정 + +181 +00:14:09,519 --> 00:14:13,028 + 이제 우리는 낮은 수준의 정보 말 이런 종류의 쪘 유발하는 + +182 +00:14:13,028 --> 00:14:14,110 + 이미지의 픽셀 + +183 +00:14:14,110 --> 00:14:18,909 + 실제로 색상과 좋은 정보의 종류에 따라 서로 유사하다 + +184 +00:14:18,909 --> 00:14:22,439 + 우리는 길쌈에서 서로 다른 해상도의 이러한 출력을 가지고있어 + +185 +00:14:22,440 --> 00:14:25,810 + 신경망 레이블이 다른 지점에있다 의미 무엇인지 우리에게 이야기 + +186 +00:14:25,809 --> 00:14:29,929 + 이미지에 그들은 실제로에 대한 몇 가지 아이디어를 탐구 할 수 사용 + +187 +00:14:29,929 --> 00:14:33,870 + 함께이 일을 합병하는 것은 당신에게 당신의 마지막을 어떻게이 사실은 알아주는 + +188 +00:14:33,870 --> 00:14:38,419 + 나는 갈등에 대한 이전 질문 중 하나가 해결 될 때 또한 응답 + +189 +00:14:38,419 --> 00:14:43,809 + 그래서 이러한 외부 읽기 슈퍼 픽셀의 방법 또는를 사용하여 자체적으로 충분히있는 + +190 +00:14:43,809 --> 00:14:47,729 + 분할 나무 종류의 당신에게 추가 정보를 제공하는 또 다른 일이 + +191 +00:14:47,730 --> 00:14:55,649 + 입력 이미지에 대한 어쩌면 더 큰 문맥이 약하므로, 어떠한 질문 + +192 +00:14:55,649 --> 00:15:03,879 + 확인 모델을 다른 사람들이 의미에 사용되는 것을 멋진 아이디어를 다른 종류의 이렇게 + +193 +00:15:03,879 --> 00:15:08,299 + 이에 분할은 반복 정제의이 아이디어는 우리가 실제로보고있다 + +194 +00:15:08,299 --> 00:15:12,809 + 우리가 얘기 할 때이 몇 강의 전에 언급 한 많은 추정을 제기하지만, + +195 +00:15:12,809 --> 00:15:17,149 + 아이디어는 우리가거야 그들이 밖으로 분리 여기에 입력 된 이미지를 가지고 있다는 것입니다 + +196 +00:15:17,149 --> 00:15:20,929 + 세 가지 채널 그리고 우리는 우리의 마음에 드는 종류에 대한 그 일 거 실행있어 + +197 +00:15:20,929 --> 00:15:24,929 + 콘벌 루션 신경망이 저해상도 패치를 예측할 + +198 +00:15:24,929 --> 00:15:30,309 + 오히려 이미지의이없는 해상도 분할을 예측하고 지금 우리는있어 + +199 +00:15:30,309 --> 00:15:34,899 + 거야의 다운 샘플링 된 버전과 함께 CNN에서 해당 출력을 + +200 +00:15:34,899 --> 00:15:38,829 + 우리가이 과정을 반복합니다 및 원본 이미지를 다시 때문에이 허용 + +201 +00:15:38,830 --> 00:15:43,990 + 네트워크는 출력의 유효 수용 필드를 증가 포도주를 정렬하려면 + +202 +00:15:43,990 --> 00:15:48,399 + 또한 수행하거나 처리를의 입력 이미지를 그리고, 우리는 할 수있는 + +203 +00:15:48,399 --> 00:15:54,009 + 이 좀 멋진 그래서 다시이 과정을 반복이 세 가지를 그렇다면 + +204 +00:15:54,009 --> 00:15:54,769 + 길쌈 + +205 +00:15:54,769 --> 00:15:58,249 + 네트워크는 실제로 다음이 재발 길쌈가된다 가중치를 공유 + +206 +00:15:58,249 --> 00:16:03,489 + 네트워크 어디 종류의 전복 시간에 동일한 입력으로 작동하지만, + +207 +00:16:03,489 --> 00:16:07,528 + 실제로 이러한 업데이트 단계는 각각의 전체 컨벌루션 네트워크는 + +208 +00:16:07,528 --> 00:16:10,139 + 실제로 매우 유사한 아이디어는 우리가 본 네트워크를 재발하기 + +209 +00:16:10,139 --> 00:16:18,789 + 이전에 2014 년에 있었던이 논문의 뒤에 아이디어는 것입니다 당신이 경우 + +210 +00:16:18,789 --> 00:16:22,558 + 실제로는 수 잘하면 것은 동일한 유형의 더 많은 반복을 수행 + +211 +00:16:22,558 --> 00:16:28,219 + 네트워크는 일종의 반복적 그래서 여기에 우리가있는 경우 그 출력을 수정하기 + +212 +00:16:28,220 --> 00:16:32,220 + 이 원시 입력 이미지는 한 세대 후에는 실제로 볼 수 있습니다 + +213 +00:16:32,220 --> 00:16:35,959 + 특히 객체의 경계에 있지만 같은 소음이 꽤가있다 + +214 +00:16:35,958 --> 00:16:39,359 + 우리는이 재발 길쌈을 통해 둘, 셋, 반복에 대해 실행 + +215 +00:16:39,360 --> 00:16:42,769 + 네트워크 실제로는 그런 종류의 많은 정리하기 위해 네트워크를 할 수 있습니다 + +216 +00:16:42,769 --> 00:16:46,989 + 낮은 수준의 불쾌 및 생산 훨씬 청소기 훨씬 깨끗하고 더 멋진 결과 + +217 +00:16:46,989 --> 00:16:51,119 + 그래서 나는 함께 이러한 병합의 종류 아주 아주 멋진 아이디어라고 생각했다 + +218 +00:16:51,119 --> 00:16:55,199 + 재발 네트워크의 아이디어와이 아이디어를 시간이 지남에 따라 가중치를 공유 + +219 +00:16:55,198 --> 00:17:03,479 + 컨볼 루션 네트워크는 아주 잘 매우 광범위하게 다른 그래서 이미지를 다른를 처리하는 + +220 +00:17:03,480 --> 00:17:07,470 + 의미 론적 분할에 매우 매우 잘 알려진 논문은 버클리에서이 하나입니다 + +221 +00:17:07,470 --> 00:17:12,419 + 그는 CBP에 게시 된 우리의 지난해 그래서 여기에 그것은 매우 유사한 모델이다 + +222 +00:17:12,419 --> 00:17:16,850 + 우리는 입력 영상을 가지고 어떤 수를 통해 실행하는거야 전에 + +223 +00:17:16,849 --> 00:17:22,259 + 회선은 결국 화소하지만 일부 일부 기능지도를 추출 + +224 +00:17:22,259 --> 00:17:26,638 + 모든 하드 코딩 된 업 샘플링 이런 종류의에 의존하는 기존 방법 대비 + +225 +00:17:26,638 --> 00:17:31,138 + 실제로 에너지하지만이에 최종 분할을 생산하는 + +226 +00:17:31,138 --> 00:17:34,668 + 종이 그들은 잘 우리는 우리가 우리가 원하는 깊은 학습 사람들이야이야 제안 + +227 +00:17:34,669 --> 00:17:39,149 + 우리가 거​​ 네트워크의 일환으로 업 샘플링을 배울 수있는, 그래서 모든 것을 배울 그래서 + +228 +00:17:39,148 --> 00:17:43,298 + 그들은없는 일이 마지막 층이 최대 학습 가능 샘플링에서이 포함되어있어 + +229 +00:17:43,298 --> 00:17:50,798 + 이 실제로 최대 샘플 때문에 학습 가능 방식으로 기능지도 예 + +230 +00:17:50,798 --> 00:17:55,179 + 그들은 마지막에 샘플링지도와 길을 자신의 모델 종류까지왔다 + +231 +00:17:55,179 --> 00:17:59,940 + 그들은 그것을하지 그래서 그들은이 알렉스 된 시간에이를 것으로되어 보이는 그들의 + +232 +00:17:59,940 --> 00:18:04,090 + 회선 및 당기 및 여러 단계의 실행 입력 영상 + +233 +00:18:04,089 --> 00:18:08,028 + 결국 그들은 그들이 가지고있는이 풀 (5) 출력에서​​ 생산 꽤 + +234 +00:18:08,028 --> 00:18:12,048 + 샘플 이미지 아래 특별한 크기 샘플링 상당히 아래로 입력 화상과 비교하여 + +235 +00:18:12,048 --> 00:18:16,999 + 다음까지 학습 가능 샘플링이 다시에 샘플을 그들을 reup있다 + +236 +00:18:16,999 --> 00:18:19,460 + 입력 화상의 원래 크기 + +237 +00:18:19,460 --> 00:18:25,909 + 이 논문의 또 다른 멋진 기능은 스킵 연결이 아이디어들은 그렇게 + +238 +00:18:25,909 --> 00:18:30,489 + 그들은 실제로을 사용하여 실제로 단지이 가난한 오 기능을 사용하지 않는 + +239 +00:18:30,489 --> 00:18:34,598 + 다른 레이어와 네트워크에서 길쌈 기능하는 일종의 + +240 +00:18:34,598 --> 00:18:39,200 + 당신이 상상할 수 있도록 다양한 규모에서 존재하는 당신의 수영장에있어 한 번 + +241 +00:18:39,200 --> 00:18:42,649 + 알렉스는 지금 실제로의 레이 업 후 더 큰 기능지도 풀 오 + +242 +00:18:42,648 --> 00:18:48,069 + 수영장 3 그래서 직관이 낮은 것입니다에 대한 풀보다 더 크다 + +243 +00:18:48,069 --> 00:18:52,148 + 길쌈 방송 해 실제로 당신이에 미세한 입자 구조를 캡처 도움이 될 수 있습니다 + +244 +00:18:52,148 --> 00:18:56,408 + 그들은 작은 수용 필드가 있기 때문에 입력 이미지가 그래서 실제로 우리에게 영향을 + +245 +00:18:56,409 --> 00:18:59,889 + 서로 다른 길쌈 기능 맵을 별도 적용 + +246 +00:18:59,888 --> 00:19:03,428 + 최대 배운 그들 모두를 결합 한 후이 기능 맵의 각을 샘플링하고 + +247 +00:19:03,429 --> 00:19:09,070 + 최종 출력을 생성하고 그 결과에 그들은 실제로 그가를 추가하는 것을 보여 + +248 +00:19:09,069 --> 00:19:15,408 + 스킵 연결을 통해 때문에 이러한 낮은 수준의 세부 사항에 많은 도움이 경향 + +249 +00:19:15,409 --> 00:19:19,979 + 여기 왼쪽에있는 이들 만이 가난한 오 출력을 사용하는 결과는 + +250 +00:19:19,979 --> 00:19:24,919 + 당신은이 종류의 자전거에 사람의 거친 생각이라도 것을 알 수 있습니다 + +251 +00:19:24,919 --> 00:19:29,330 + 하지만 좀 blobby 그리고 미세한 세부 사항을 많이 누락 가장자리에 있지만 + +252 +00:19:29,329 --> 00:19:31,819 + 당신은 이러한 낮은에서 다음 단계 연결에 추가 할 때 + +253 +00:19:31,819 --> 00:19:35,468 + 당신에 대한 더 많은 세부적인 정보를 제공 길쌈 오류 + +254 +00:19:35,469 --> 00:19:39,940 + 그 작업이 그렇게 사람들을 추가 있도록 이미지에서 사물의 공간 위치 + +255 +00:19:39,940 --> 00:19:43,919 + 하위 계층에서 연결을 이동하는 것은 정말 경계를 정리하는 데 도움이 + +256 +00:19:43,919 --> 00:19:51,159 + 이 이러한 출력 질문에 대한 몇 가지 경우에 문제는 방법이다 있도록 + +257 +00:19:51,159 --> 00:19:55,070 + 정확성을 분류 나는 사람들이 일반적으로이 사용되는 두 메트릭 생각 + +258 +00:19:55,069 --> 00:19:58,829 + 당신이 모든 픽셀 분류를 분류하고 같이 다만 분류입니다 + +259 +00:19:58,829 --> 00:20:03,968 + 내 트랙은 또한 때때로 사람들은 각각 그래서 노동 조합의 교차를 사용 + +260 +00:20:03,969 --> 00:20:09,058 + 클래스 당신은 내가 우리에게 그 클래스를 예측 이미지의 영역이 무엇인지 계산 + +261 +00:20:09,058 --> 00:20:12,368 + 이었다 무엇 클래스를 가지고 이미지의 지상군 영역에 도달 + +262 +00:20:12,368 --> 00:20:17,158 + 다음 두 사이에 노동 조합의 교차점을 계산 나는 확실하지 않다 + +263 +00:20:17,159 --> 00:20:20,510 + 이는 측정하는 특히 사용이 논문 + +264 +00:20:20,509 --> 00:20:26,609 + 이렇게까지 ​​학습 가능 샘플링이 아이디어는 실제로 정말 멋진 이후부터입니다 + +265 +00:20:26,609 --> 00:20:30,839 + 이 문서가 적용되었습니다 및 기타 연락처를 많이는 우리가 본 적이 우리가 알고있는 사촌 + +266 +00:20:30,839 --> 00:20:35,839 + 우리는 아래로 우리의 기능지도를 다양한 방법으로뿐만되는 샘플 수 있음 + +267 +00:20:35,839 --> 00:20:39,689 + 최대 네트워크 내부를 샘플링 할 수 실제로 매우 유용 할 수있다 + +268 +00:20:39,690 --> 00:20:44,750 + 매우 가치있는 일이 그래서이 때로는 호출되는 디컨 볼 루션을 수행하는 + +269 +00:20:44,750 --> 00:20:48,980 + 즉 우리 모두가 몇 분 안에 그 이야기 때문에 매우 좋은 조건은 아니지만 + +270 +00:20:48,980 --> 00:20:54,130 + 당신이 정상적인 일을 할 때 그냥 일종의 요약하자면, 그래서 그것은 매우 일반적인 용어이다 + +271 +00:20:54,130 --> 00:20:59,870 + 보폭 보폭 1353 컨볼 루션 우리의 종류 우리는 우리가이 사진을 가지고이 있습니다 + +272 +00:20:59,869 --> 00:21:04,489 + 즉 우리의 4 × 4 입력 우리를 부여하는 것이 지금까지 꽤 잘 알고 있어야합니다 + +273 +00:21:04,490 --> 00:21:08,710 + 세 가지 필터에 의해 약 3를 가지고 우리는 이상이 셋으로 세 가지 필터를 플롯 + +274 +00:21:08,710 --> 00:21:10,059 + 입력의 일부 + +275 +00:21:10,059 --> 00:21:14,539 + 제품이 우리에게 지금 고통 때문에 하나의 출력 요소 및 제공 + +276 +00:21:14,539 --> 00:21:19,240 + 아스테로이드 하나는 필터를 이동 출력 우리의 다음 요소를 계산 + +277 +00:21:19,240 --> 00:21:22,599 + 하나의 입력을 다시 컴퓨터 내적의 슬롯을 통해 그 우리를 제공합니다 + +278 +00:21:22,599 --> 00:21:29,409 + 출력에서 하나의 원소이며, 현재 걸음에 해당 회선에 대한 그것의 A의 + +279 +00:21:29,410 --> 00:21:32,360 + 아이디어의 매우 유사한 유형 어디 + +280 +00:21:32,359 --> 00:21:36,099 + 출력은 두 개의 출력으로 버전이 샘플링 다운 될 것입니다 + +281 +00:21:36,099 --> 00:21:40,459 + 4 × 4 곳에서 다시는 우리가 우리의 필터를 같은 생각 가지고있어 우리는 풍덩 + +282 +00:21:40,460 --> 00:21:44,279 + 화상 컴퓨터 내적 아래로 우리에게 출력의 한 요소를 준다 + +283 +00:21:44,279 --> 00:21:48,450 + 유일한 차이점은 지금 우리가 두 개의 슬롯을 통해 컨볼 루션 필터를 밀어 것입니다 + +284 +00:21:48,450 --> 00:21:53,610 + 입력은 출력에 디컨 볼 루션 elaire 하나를 계산 + +285 +00:21:53,609 --> 00:21:57,439 + 실제로 우리가 저를 먹고 싶어 그래서 여기에 조금 다른 무언가를 + +286 +00:21:57,440 --> 00:22:02,490 + 해상도 입력하고 그래서 이것은 아마 것보다 높은 해상도 출력을 생성 + +287 +00:22:02,490 --> 00:22:08,309 + 바로 그래서 여기에 하나에서 APPT까지의 무료 디컨 볼 루션에 의해 몇 + +288 +00:22:08,309 --> 00:22:12,659 + 이것은 당신이 정상적인 회선 알고있는 이상한 조금 당신을 상상하다 + +289 +00:22:12,660 --> 00:22:16,750 + 당신의 세 가지로 세 개의 필터가 있고 여기에 도트 제품과 입력을하지만, + +290 +00:22:16,750 --> 00:22:21,000 + 당신은 당신의 세 가지로 세 가지 필터를 복용 상상 할 단지에 이상 복사 + +291 +00:22:21,000 --> 00:22:26,230 + 출력은 유일한 차이점은이 하나의 스칼라 값을 같은 무게 + +292 +00:22:26,230 --> 00:22:27,579 + 무게 및 입력 + +293 +00:22:27,579 --> 00:22:31,788 + 당신에 대한 대기를 제공합니다 당신은 당신이 설 때 필터를 관련 거라고 + +294 +00:22:31,788 --> 00:22:38,298 + 우리가 거​​의 1 단계 발짝을 따라 우리가이 일을 시작했을 때 출력에 지금 + +295 +00:22:38,298 --> 00:22:43,298 + 모든 출력을 통해 이상 입력과 두 단계 이제 우리는 같은 걸릴거야 + +296 +00:22:43,298 --> 00:22:47,798 + 우리는 동일한 학습 컨볼 루션 필터 아래로 풍덩거야 + +297 +00:22:47,798 --> 00:22:53,378 + 같은 컨볼 루션 필터를 주셔서 지금은 가슴으로 출력하지만, 지금 + +298 +00:22:53,378 --> 00:22:56,928 + 우리는 차이가 있다는 것을 출력에 두 번을 보여주고있어 + +299 +00:22:56,929 --> 00:23:02,139 + 레드 박스는 컨볼 루션 필터는이 스칼라 값에 의해 가중된다 + +300 +00:23:02,138 --> 00:23:06,148 + 입력과 파란색 상자에 대한 컨볼 루션 필터에 의해 가중된다 + +301 +00:23:06,148 --> 00:23:10,978 + 블루 스칼라 입력의 값과 위치를이 곳이 지역이 당신을 중복 + +302 +00:23:10,979 --> 00:23:16,590 + 바로이 종류의 당신이 배울 수 있으며 네트워크 내부 샘플링까지 이렇게 추가 + +303 +00:23:16,589 --> 00:23:23,118 + 그래서 당신은에 구현 회선에서에서 기억한다면 + +304 +00:23:23,118 --> 00:23:27,999 + 할당 일종의 특히 눈에 띄는과 추가의이 아이디어 + +305 +00:23:27,999 --> 00:23:31,348 + 보통의 뒤로 패스 당신을 생각 나게한다 겹치는 영역 + +306 +00:23:31,348 --> 00:23:36,729 + 회선 및 이들이 그 완전히 동일하다는 것을 밝혀 + +307 +00:23:36,729 --> 00:23:40,440 + 컨벌루션 포워드 패스 정확하게 정상 컨벌루션와 동일 + +308 +00:23:40,440 --> 00:23:44,840 + 역방향 패스 정상 및 컨벌루션 후방 패스는 동일 + +309 +00:23:44,839 --> 00:23:50,238 + 일반 회선 앞으로 때문에 실제로 용어는 그렇게 통과 + +310 +00:23:50,239 --> 00:23:54,989 + 디컨 볼 루션 어쩌면 그렇게 크지 않다 당신은 신호 처리가있는 경우 + +311 +00:23:54,989 --> 00:23:58,700 + 당신이 디컨 볼 루션 이미 본 적이있다 배경은 매우가 + +312 +00:23:58,700 --> 00:24:03,308 + 의미는 잘 정의하고 그 때문에 회선의 역입니다 + +313 +00:24:03,308 --> 00:24:07,470 + 디콘 볼 루션은 상당히 다르다 컨벌루션 연산을 취소해야 + +314 +00:24:07,470 --> 00:24:11,909 + 어떤이 실제로 그렇게하는 대신 이에 대한 아마 더 좋은 이름을하고있다 + +315 +00:24:11,909 --> 00:24:17,609 + 가끔 볼 수 디컨 볼 루션은 컨볼 루션 전치 또는 우리가 될 것 + +316 +00:24:17,608 --> 00:24:22,148 + 뒤로 귀에 거슬리는 회선 또는 단편적으로 귀에 거슬리는 회선 또는 또는 + +317 +00:24:22,148 --> 00:24:27,148 + 나는 이러한 종류의 이상한 이름이라고 생각하므로 회선까지 나는대로 디컨 볼 루션을 생각한다 + +318 +00:24:27,148 --> 00:24:30,988 + 인기가 덜 기술 될 수있다하더라도 말을 쉬운 그냥 사촌 + +319 +00:24:30,989 --> 00:24:35,369 + 당신이 논문을 읽고 실제로 경우 그 일부를 볼 수 있지만 기술적으로 정확 + +320 +00:24:35,368 --> 00:24:38,699 + 사람들은 그래서 이것에 대해 화 + +321 +00:24:38,700 --> 00:24:43,539 + 이 회선 대신 디컨 볼 루션의 트랜스 말을 더 적절한이고 + +322 +00:24:43,539 --> 00:24:47,529 + 이 다른 논문은 정말 분별 보폭 회선을 착색하고 싶어 + +323 +00:24:47,529 --> 00:24:51,750 + 그래서 내가 사회는 여전히 올바른 용어를 결정할 생각 생각 + +324 +00:24:51,750 --> 00:24:55,240 + 여기 그러나 나는 이러한 종류의 디컨 볼 루션은 아마 매우 아니다 그들에 동의 + +325 +00:24:55,240 --> 00:25:00,309 + 기술적으로 정확하고 매우 느낌이 특히이 논문은이 알코올 + +326 +00:25:00,309 --> 00:25:04,139 + 강하게이 문제에 대해 그들은 용지에 한 페이지 인덱스 부록했다 + +327 +00:25:04,140 --> 00:25:09,230 + 당신이있어 그래서 만약 내가 적절한 용어를 트랜스 회선 왜 실제로 설명 + +328 +00:25:09,230 --> 00:25:11,849 + 관심은 정말 꽤 좋은 것을 확인하는 것이 좋습니다 것입니다 + +329 +00:25:11,849 --> 00:25:16,289 + 이에 대한 설명은 실제로 해당하므로, 어떠한 질문 + +330 +00:25:16,289 --> 00:25:26,299 + 그래, 정말 문제는 패치이 상대를 기반으로 얼마나 빨리 생각 + +331 +00:25:26,299 --> 00:25:29,930 + 일이 대답은 연습 아무도에도 덕분에이 일을 실행하는 것입니다 + +332 +00:25:29,930 --> 00:25:34,820 + 단지 방법이 너무 느린 그래서 실제로 모든 것 희망 패치 짐승 모드 + +333 +00:25:34,819 --> 00:25:36,000 + 내가 본 논문의 + +334 +00:25:36,000 --> 00:25:39,109 + 이럭저럭 폴리 길쌈 것은 어떤 종류의 작업을 수행 + +335 +00:25:39,109 --> 00:25:44,729 + 실제로 다른 트릭의 대신 샘플링까지 종류가있다 그 사람 + +336 +00:25:44,730 --> 00:25:49,309 + 때때로 사용하고 그 때문에 네트워크가 실제로 있다고 가정이다 + +337 +00:25:49,309 --> 00:25:52,599 + 4 배에 의해 거 아래 샘플은 당신이 할 수있는 한 가지 가지고 당신의 + +338 +00:25:52,599 --> 00:25:57,199 + 입력 이미지는 하나의 픽셀로 배송 지금은 다시 내가 네트워크를 통해 실행 + +339 +00:25:57,200 --> 00:26:00,710 + 다른 출력을 얻을 당신은 네 가지 일의 종류에 대해이 작업을 반복 + +340 +00:26:00,710 --> 00:26:04,870 + 픽셀 입력의 선박 지금은 출력지도를받은 적이 당신은 정렬 할 수 있습니다 + +341 +00:26:04,869 --> 00:26:08,339 + 그 그건 그래서의 원래의 입력 맵을 재구성하는 인터리브 + +342 +00:26:08,339 --> 00:26:12,279 + 사람들이 가끔 사용되는 또 다른 트릭은 그 문제를 해결 얻을 수 있지만, 나는 생각한다 + +343 +00:26:12,279 --> 00:26:19,740 + 오늘 아침 샘플링이 꽤 청소기입니다 + +344 +00:26:19,740 --> 00:26:28,440 + 그래서 내가 정말 좋은 나는 다시 한 번 I 시도라고 생각 혀를 롤 생각 + +345 +00:26:28,440 --> 00:26:33,799 + 단편적으로 귀에 거슬리는 회선 실제로 내가 생각하는 정말 멋진 권리라고 생각합니다 + +346 +00:26:33,799 --> 00:26:36,928 + 그것은 가장 긴 이름이다하지만 천체 정상 바로 정말 설명이다 + +347 +00:26:36,929 --> 00:26:40,910 + 일반적으로 우리와 그가 이동 결국 요소의 역할을 이동할 회선에 시도 + +348 +00:26:40,910 --> 00:26:45,808 + 같은 당신은 어떤 입력과 출력을하지 않으 여기에 당신은 무슨 일이 있었는지 이동하고 + +349 +00:26:45,808 --> 00:26:48,940 + 무슨 일이 있었는지 움직이는 해당하는 입력하면 입력 싶어 있습니다 + +350 +00:26:48,940 --> 00:26:55,140 + 출력은 그래서 내가 내가 전화 할게 무엇인지 확실하지 않다, 그래서 아주 능숙하게 아이디어를 캡처 + +351 +00:26:55,140 --> 00:27:02,790 + 나는 신문에서 그것을 사용할 때 우리는 그것에 대해 그러나 지금에도 불구하고 참조해야 할 때 + +352 +00:27:02,789 --> 00:27:06,440 + 사람들이 같은 디컨 볼 루션을 호출하는 사람들에 대한 우려에도 불구하고 단지 + +353 +00:27:06,440 --> 00:27:10,980 + ICC에서이 논문은 수 있었다 어쨌든 그래서이 아이디어를 소요 호출 + +354 +00:27:10,980 --> 00:27:16,319 + 이 길쌈의 금주의 / 단편적으로 생각을 마련하려고 + +355 +00:27:16,319 --> 00:27:21,428 + 그리고 일종의 그래서 여기에 극단적으로 푸시하고 ​​그들이 금액 무엇을했다 + +356 +00:27:21,429 --> 00:27:28,170 + 이 전체 BGG 네트워크 I 입력 싶어하기 전에이는 동일한 모델 그래서 + +357 +00:27:28,170 --> 00:27:33,720 + 의미 분할 작업하지만 여기 출력 픽셀 현명한 예측 + +358 +00:27:33,720 --> 00:27:40,220 + 우리는 BGG를 초기화 이상 여기 BGG 거꾸로이며, 그것은 육에 대한 훈련 + +359 +00:27:40,220 --> 00:27:44,509 + 세금에 일 때문에이 일을 꽤 느린 실제로 정말 정말 좋은있어 + +360 +00:27:44,509 --> 00:27:51,160 + 결과와 나는 그 꽤 있어요 그래서 그것도 아주 아름다운 그림이라고 생각 + +361 +00:27:51,160 --> 00:27:54,308 + 많은 모든 내가 어떤 존재인지 의미 론적 분할에 대해 말할 필요가 있음 + +362 +00:27:54,308 --> 00:27:59,799 + 그에 대한 질문이 그래 + +363 +00:27:59,799 --> 00:28:04,909 + 문제는 내가에서 스크린 샷을했다이 메인 답변입니다 방법입니다 + +364 +00:28:04,910 --> 00:28:09,090 + 자신의 종이 그래서 난 몰라하지만 당신은 우리가 마지막에서 본 흐름 답변을 시도 할 수 있습니다 + +365 +00:28:09,089 --> 00:28:15,069 + 그래 당신이 그림을 만들하지만이만큼 좋은 아니에요 수 있습니다 강연 + +366 +00:28:15,069 --> 00:28:22,579 + 훈련 데이터와 같은 질문은 예이 것은 이런 종류의 데이터 세트를 존재 + +367 +00:28:22,579 --> 00:28:28,449 + 어디 파스칼 분할 데이터가 그렇게 설정되어 일반적인 일이 있다고 생각 + +368 +00:28:28,450 --> 00:28:31,380 + 당신이 이미지가 접지 진실은 당신은 이미지를 그들은 가지고있다 + +369 +00:28:31,380 --> 00:28:37,780 + 표시된 모든 픽셀 그래 그것은이 해당 데이터를 가져 가지 비싼이다 + +370 +00:28:37,779 --> 00:28:43,049 + 데이터 세트는 약간 작은 경향이 있지만, 실제로 유명한 인터페이스가있다 + +371 +00:28:43,049 --> 00:28:46,299 + 이미지를 업로드 할 수있는 곳 나 라벨을 불러 투어 다음 종류의 술 + +372 +00:28:46,299 --> 00:28:49,240 + 본 발명의 다른 지역의 주위에 당신이 본 발명의 주위에 + +373 +00:28:49,240 --> 00:28:54,140 + 이 세그먼트의 일종으로 그 윤곽을 변환 할 수 있습니다 묻는 그 방법을의 + +374 +00:28:54,140 --> 00:29:02,130 + 당신은 우리가 거​​ 생각 질문이있는 경우 방식으로이 일에 라벨을하는 경향이 + +375 +00:29:02,130 --> 00:29:07,290 + 다만 인스턴스 분할을 정리해 할 수 있도록이가 즉시 분할로 이동 + +376 +00:29:07,289 --> 00:29:11,089 + 일반화 또는 우리뿐만 아니라 이미지의 픽셀에 라벨을 원하는 위치에 있지만 + +377 +00:29:11,089 --> 00:29:15,089 + 또한 즉시 우리가가는거야 그래서 인스턴스를 구별 구별 할 + +378 +00:29:15,089 --> 00:29:18,419 + 우리 클래스의 다른 인스턴스를 감지하고 각각에 대해 우리가 원하는 + +379 +00:29:18,420 --> 00:29:25,320 + 그래서이이 실제로이 모델 최대 인스턴스의 픽셀을 라벨 + +380 +00:29:25,319 --> 00:29:28,419 + 우리가 전에 몇 강의에 대해 이야기 검출 모델처럼 많이 찾고 + +381 +00:29:28,420 --> 00:29:34,150 + 그래서이 실제로 나는 또한해야 것을 알고 최초의 논문 중 하나 + +382 +00:29:34,150 --> 00:29:38,040 + 이 내가 메신저 훨씬 더 최근이의이 아이디어 부탁 생각이라고 지적 + +383 +00:29:38,039 --> 00:29:42,319 + 의미 론적 분할 긴 장시간 컴퓨터 비전에 사용되었지만 + +384 +00:29:42,319 --> 00:29:45,409 + 나는 즉시 분할이 생각보다 많이받은 것 같아요 + +385 +00:29:45,410 --> 00:29:50,970 + 특히 2014 종류의에서이 논문 그래서 지난 몇 년에 인기 + +386 +00:29:50,970 --> 00:29:53,890 + 이 걸렸다 나는 그들이 그것을 동시 탐지 및 세분화를 호출 생각 + +387 +00:29:53,890 --> 00:29:59,600 + 또는 SDS는 좋은 이름의 종류 그리고이 사실은 우리의 CNN과 매우 유사 + +388 +00:29:59,599 --> 00:30:03,839 + 우리가 여기에 보호를 보았다 모델은 우리가 입력 치매를 거 가지고있어 + +389 +00:30:03,839 --> 00:30:09,399 + 당신이 우리의 CNN에서 기억한다면 우리는 이러한 외부 지역의 제안에 의존하는 + +390 +00:30:09,400 --> 00:30:12,269 + 오프라인 컴퓨터 비전의 이러한 종류입니다 수 있습니다 + +391 +00:30:12,269 --> 00:30:16,538 + 이 이미지의 개체를 생각하는 위치에 예측을 계산 글로벌 일 수도 + +392 +00:30:16,538 --> 00:30:17,658 + 위치 + +393 +00:30:17,659 --> 00:30:21,419 + 잘은 대신 세그먼트를 제안하기위한 다른 방법이 있다고 밝혀 + +394 +00:30:21,419 --> 00:30:25,419 + 상자의 우리는 이러한 기존 세그먼트 제안 방법 중 하나를 다운로드 + +395 +00:30:25,419 --> 00:30:30,879 + 이들 각각에 대해이 세그먼트 우리가 할 수있는 각 대신 지금 사용 + +396 +00:30:30,878 --> 00:30:35,398 + 제안 된 세그먼트 우리는 단지의 상자에 앉아하여 경계 상자를 추출 할 수 있습니다 + +397 +00:30:35,398 --> 00:30:40,298 + 다음 세그먼트는 입력 영상의 덩어리에서 작물을 실행하고 실행 + +398 +00:30:40,298 --> 00:30:47,108 + 상자를 통해 CNN이 실행됩니다 병렬보다 그 상자 기능을 추출 + +399 +00:30:47,108 --> 00:30:52,358 + 지역 CNN을 통해 그래서 그녀는 우리가 취할 수 입력에서 관련 청크 + +400 +00:30:52,358 --> 00:30:57,168 + 발명 작물이 밖으로 그러나 여기에서 우리는 실제로이 제안을 가지고 있기 때문에 + +401 +00:30:57,169 --> 00:31:01,320 + 다음 세그먼트에 대해 우리는 평균을 사용하여 배경 영역을 마스크거야 + +402 +00:31:01,319 --> 00:31:05,700 + 그래서 이것은 당신이이 종류를 취할 수있는 해킹의 일종이다 데이터의 색 + +403 +00:31:05,700 --> 00:31:09,838 + 이상한 모양의 입력과 CNN로 먹이를 그냥 배경을 마스크 + +404 +00:31:09,838 --> 00:31:14,479 + 검은 색으로 우리와 부분 그래서이 마스크 입력을하고 실행할 수 있습니다 + +405 +00:31:14,479 --> 00:31:18,769 + 별도의 영역을 통해 CNN은 지금 우리가 입수 한 두 개의 서로 다른 특징 벡터 하나 + +406 +00:31:18,769 --> 00:31:22,739 + 전체 상자를 통합의 종류 만 기업에서 하나의 + +407 +00:31:22,739 --> 00:31:26,328 + 제안 된 전경 픽셀 우리는 이러한 것들과 연결하여 바로 + +408 +00:31:26,328 --> 00:31:30,638 + 우리의 CNN에 우리가 구분을 같은 결정하는 어떤 클래스 실제로해야 + +409 +00:31:30,638 --> 00:31:37,128 + 이 세그먼트 B는 다음 그들은 또한이 지역의 정제 단계가 어디 + +410 +00:31:37,128 --> 00:31:42,108 + 당신이 모르는, 그래서 만약 잘하는 방법을 제안 영역에게 조금 수정하려면 + +411 +00:31:42,108 --> 00:31:45,218 + 당신은 우리의 CNN 프레임 워크 기억하지만 실제로 우리의 CNN과 매우 유사 + +412 +00:31:45,219 --> 00:31:52,909 + 다만이 경우 동시 검출 및 분할 작업 때문에이를 적용 + +413 +00:31:52,909 --> 00:31:56,950 + 이 지역의 정제 단계에 대한 아이디어는 실제로 후속 종이있다 그 + +414 +00:31:56,950 --> 00:32:03,288 + 같은 사람의 논문에서 이렇게 여기에서 그것을 할 수있는 아주 좋은 방법입니다 제안 + +415 +00:32:03,288 --> 00:32:07,578 + 우리는이 입력을 할 버클리 다음 회의 있지만 여기 + +416 +00:32:07,578 --> 00:32:12,940 + 이 세그먼트에 제안 된 세그먼트를 제안하고 그것을 정리할되는 + +417 +00:32:12,940 --> 00:32:17,778 + 어떻게 든 우리는 실제로 매우 유사한 유형 매우 유사한 접근 방식을거야 + +418 +00:32:17,778 --> 00:32:20,230 + 우리가에서 본 다중 스케일 방법 + +419 +00:32:20,230 --> 00:32:24,839 + 그래서 여기에 얼마 전 의미 분할 모델에서 우리는 걸릴거야 + +420 +00:32:24,839 --> 00:32:30,139 + 우리의 우리의 이미지 자르기 아웃이 해당 세그먼트에 해당하는 상자를 지탱하고 + +421 +00:32:30,140 --> 00:32:34,350 + 다음과 알렉스 그물을 통해 그것을 통과하고 우리는 길쌈을 추출 할거야 + +422 +00:32:34,349 --> 00:32:37,849 + 그 각각에 대해 그 알렉스 NAT의 여러 레이어의 기능 + +423 +00:32:37,849 --> 00:32:42,139 + 기능 맵은 최대 난과 함께 결합 지금 것 샘플링합니다 + +424 +00:32:42,140 --> 00:32:48,370 + 이이 그림이 제안 그림 지상 분할을 생성하므로이이 + +425 +00:32:48,369 --> 00:32:52,308 + 가지 재미 출력 사실이지만 그것은 예측을 정말 쉽게있어 + +426 +00:32:52,308 --> 00:32:55,910 + 아이디어는 우리가 단지거야이 출력 이미지가 물류를 수행 투자입니다 + +427 +00:32:55,910 --> 00:33:00,990 + 각각의 독립적 인 픽셀 내부 분류 그래서 우리는 단지이 이러한 기능을 제공 + +428 +00:33:00,990 --> 00:33:04,410 + 독립 물류의 전체 무리를 어떻게 예측하고 머리카락을 분류 + +429 +00:33:04,410 --> 00:33:08,250 + 출력이 많은 화소가 전경 될 가능성이있다 + +430 +00:33:08,250 --> 00:33:13,390 + 배경과 그들이 보여이 다중 스케일 정제 단계의이 유형이 + +431 +00:33:13,390 --> 00:33:16,610 + 실제로 이전 시스템의 다른 부분을 정리 아주 준다 + +432 +00:33:16,609 --> 00:33:27,899 + 아주 좋은 결과 질문 + +433 +00:33:27,900 --> 00:33:34,390 + 단편적으로 보폭 및 회선 나는 그것이 어떤 종류의 대신 생각 + +434 +00:33:34,390 --> 00:33:37,870 + 그 또는 같은 선형 보간 또는 뭔가처럼 샘플링을 수정 + +435 +00:33:37,869 --> 00:33:41,449 + 어쩌면 가장 가까운 이웃 뭔가 고정 및 가변 그러나 나는 할 수 + +436 +00:33:41,450 --> 00:33:44,170 + 잘못하지만 당신은 확실히 교환 및 일부 학습 가능 상상할 수 + +437 +00:33:44,170 --> 00:33:46,250 + 너무 것 같아요 + +438 +00:33:46,250 --> 00:33:52,980 + 확인 그래서 실제로이이 우리의 CNN뿐만 검출에 매우 유사합니다 + +439 +00:33:52,980 --> 00:33:57,049 + 우리는 우리의 CNN이 이야기의 시작에 불과했다 모든이 있다고 보았다 강의 + +440 +00:33:57,049 --> 00:34:03,329 + 그것이 나오는 바로 있도록 빠른 버전이 빠르게에서 유사한 직관 우리 + +441 +00:34:03,329 --> 00:34:08,090 + CNN은 실제로도 있으므로이 경우 세그멘테이션 문제에 적용된 + +442 +00:34:08,090 --> 00:34:12,050 + 이 해당 작업이 모델은 실제로 코코아 우승 Microsoft에서 작품입니다 + +443 +00:34:12,050 --> 00:34:16,860 + 예를 세분화 도전 그들의 거대한했다 그들은 그래서 올해 + +444 +00:34:16,860 --> 00:34:20,000 + 공명 그들은 그 위에이 모델을 고집하고 그들은 호감 + +445 +00:34:20,000 --> 00:34:25,489 + 코코 인스턴스 분할 도전에 다른 사람 때문에이이 + +446 +00:34:25,489 --> 00:34:28,668 + 실제로 우리가 걸릴 거 야에 우리의 독립 과거와 매우 유사하다 우리 + +447 +00:34:28,668 --> 00:34:34,148 + 입력 영상 단지 빠른처럼 빠르게 우리의 CNN 우리의 입력 영상은하지 않습니다 + +448 +00:34:34,148 --> 00:34:37,730 + 꽤 높은 해상도를하고 우리는이 거대한 코미디 쇼가지도 기능을합니다거야 + +449 +00:34:37,730 --> 00:34:44,260 + 우리의 높은 해상도를 통해 다음이 고해상도에서 우리는 실제로있어 + +450 +00:34:44,260 --> 00:34:48,700 + 이전의 방법은 우리가 우리 자신의 영역 제안을 제안하는 것 + +451 +00:34:48,699 --> 00:34:52,319 + 이러한 외부 세그먼트의 제안에 의존하지만, 여기에 우리는거야 + +452 +00:34:52,320 --> 00:34:56,870 + 우리는 그냥 막대기 우리 자신의 영역 제안이 너무 여기에 빠른 우리의 CNN을 좋아하는 배우 + +453 +00:34:56,869 --> 00:35:00,859 + 몇 가지 추가 길쌈 상단까지에 개최 부부는 논란 기능지도입니다 + +454 +00:35:00,860 --> 00:35:04,740 + 그 중 각 하나에 대한 관심의 여러 지역을 예측하는 것입니다 + +455 +00:35:04,739 --> 00:35:11,109 + 이미지 우리가 검출 작업에서 본 상자의이 아이디어를 사용하여 + +456 +00:35:11,110 --> 00:35:15,200 + 차이점은 우리는이 지역이 지금은 일단이 지역의 제안이 있었다이다 + +457 +00:35:15,199 --> 00:35:18,559 + 우리가 마지막에 본 매우 유사한 접근 방식을 사용하는 방법에 대한 거 세그먼트 + +458 +00:35:18,559 --> 00:35:24,380 + 이 제안 된 영역 각각에 대해 너무 미끄러이 투자 수익 (ROI)을 사용하려고 무엇을 + +459 +00:35:24,380 --> 00:35:28,579 + 그들은 뒤틀림이나 풀링 및 고정 된 사각형에 이르기까지 그들 모두를 뭉개 버려 ROI를 호출 + +460 +00:35:28,579 --> 00:35:33,000 + 크기하고 생성하는 컨볼 루션 신경망을 통해 각각 실행될 + +461 +00:35:33,000 --> 00:35:36,710 + 우리와 같은 이러한 과정 그림 지상 분할 마스크는 이전에보고 + +462 +00:35:36,710 --> 00:35:41,909 + 이 시점에서 이제 이전 슬라이드에서 우리는 우리가 가지고 우리의 이미지를 쪘 + +463 +00:35:41,909 --> 00:35:45,859 + 각 지역의 제안에 대한 지역 제안의 무리 지금 우리는 거친이 + +464 +00:35:45,860 --> 00:35:49,240 + 전경 어느 한 부분으로 그 상자의 어느 부분의 아이디어는 배경입니다 + +465 +00:35:49,239 --> 00:35:54,489 + 지금 우리는 우리가 예측하는 것이 지금 마스킹의 이런 생각을하는거야 + +466 +00:35:54,489 --> 00:35:57,709 + 우리가 밖으로 마스크거야이 세그먼트의 각 전경 배경 + +467 +00:35:57,710 --> 00:36:02,889 + 배경을 예측 만 예측 전경에서 픽셀을 유지하고 + +468 +00:36:02,889 --> 00:36:07,179 + 과거 다른 몇 층을 통과 실제로 분류에 대한 분류하기 + +469 +00:36:07,179 --> 00:36:13,629 + 우리의 서로 다른 범주로 그 세그먼트 그래서이이 모든 일을 할 수있는 사람이다 + +470 +00:36:13,630 --> 00:36:18,380 + 단지 공동으로 두 배울 수 및 아이디어 우리는이 세 가지 있는데 그 + +471 +00:36:18,380 --> 00:36:22,490 + 우리의 네트워크의 중간 계층에 의미 해석 출력 및 + +472 +00:36:22,489 --> 00:36:26,589 + 그들 각각 우리는 그냥 그렇게이 지역 지상 진실 데이터를 감독 할 수 있습니다 + +473 +00:36:26,590 --> 00:36:29,900 + 지상 진실 섹스 객체는 객체와 이미지에 어디에 관심을 우리는 알고있다 + +474 +00:36:29,900 --> 00:36:34,349 + 이러한 분할이 우리 요청에 대해 우리는 그 출력에 감독을 제공 할 수 있습니다 + +475 +00:36:34,349 --> 00:36:37,929 + 우리가 감독을 줄 수있는 진정한 전경과 배경 우리 알고 + +476 +00:36:37,929 --> 00:36:42,759 + 그리고 우리는 우리가 분명히 그렇게 그 다른 세그먼트의 클래스를 알고있다 + +477 +00:36:42,760 --> 00:36:46,760 + 우리는 이러한 네트워크의 여러 계층에서 감독을 제공하고 거래를하려고 + +478 +00:36:46,760 --> 00:36:50,420 + 모든 다른 손실 조건 해제와 희망을 수렴 할 수있는 일을 얻을 수 있지만, + +479 +00:36:50,420 --> 00:36:53,670 + 이 실제로 훈련, 둘, 그들은 재미 있고 그것을에 발견되었다 + +480 +00:36:53,670 --> 00:36:59,809 + 정말 정말 잘 그래서 여기에 작동하는 결과가 우리가 보여해야한다는 그림이다 + +481 +00:36:59,809 --> 00:37:04,519 + 그래서 이러한 결과는 정말 나에게 적어도 예를 들어, 그래서 정말 인상적이다 + +482 +00:37:04,519 --> 00:37:09,159 + 이 입력 영상이 방에와 앉아이 모든 다른 사람들이 + +483 +00:37:09,159 --> 00:37:12,539 + 예상 출력은 모든 다른를 분리하는 정말 좋은 일을 + +484 +00:37:12,539 --> 00:37:15,360 + 사람들은 중복에도 불구하고 많은있다 그리고 그들은 매우있어 + +485 +00:37:15,360 --> 00:37:16,500 + 닫기 + +486 +00:37:16,500 --> 00:37:20,699 + 이 차와 같은 조금 더 쉽게하지만, 특히이이 백성 만든 + +487 +00:37:20,699 --> 00:37:24,629 + 나는 꽤 감동하지만, 때 당신은 그래서이 화분 완벽 하진 볼 수 있습니다 + +488 +00:37:24,630 --> 00:37:28,840 + 그에서 식물은 정말보다 여기 차단 그것은에이 의자를 혼동했다 + +489 +00:37:28,840 --> 00:37:32,230 + 사람에 대한 권리와 나는이 사람을 놓친하지만 전체 결과 + +490 +00:37:32,230 --> 00:37:36,300 + 아주 아주 인상적과 같은 나는이 모델 하나는 코코 세분화했다 + +491 +00:37:36,300 --> 00:37:43,250 + 분할의 개요 우리가 이것들을 가지고 있다는 것입니다, 그래서 올해에 도전 + +492 +00:37:43,250 --> 00:37:47,519 + 이 두 가지 작업 론적 분할 및 분할 인스턴트 + +493 +00:37:47,519 --> 00:37:52,210 + 의미 분할을 위해이 이것을 사용하는 것은 매우 흔한 일 + +494 +00:37:52,210 --> 00:37:56,800 + 콘데 호송은 접근 한 다음 예를 세분화 당신이와 끝까지 + +495 +00:37:56,800 --> 00:38:02,180 + 어떤이의 경우, 그래서 더 유사 이러한 파이프 라인 검출 객체하기 + +496 +00:38:02,179 --> 00:38:08,338 + 분할에 대한 마지막 순간의 질문에 나는 슈퍼 지금 그 대답을 시도 할 수 있습니다 + +497 +00:38:08,338 --> 00:38:14,329 + 분명 나는 우리가 서로에 거 이동을하고있어 꽤 멋진 아닌 것 같아요 + +498 +00:38:14,329 --> 00:38:18,150 + 흥미로운 항목이 너무 먹으 렴주의 모델은 내가 생각하는 뭔가 + +499 +00:38:18,150 --> 00:38:24,550 + 그래서 경우의 일종으로 관심과 작년과 지역 사회를 많이 가지고있다 + +500 +00:38:24,550 --> 00:38:29,780 + 연구 우리는 여기에서 확인하지만 같은 다른 인용문에서 모델에 대한 거 얘기 야 + +501 +00:38:29,780 --> 00:38:32,349 + 사례 연구의 일종으로 같은 + +502 +00:38:32,349 --> 00:38:35,190 + 이미지에 적용되는 우리는 저를 캡처 관심의 아이디어에 대해 이야기하는거야 + +503 +00:38:35,190 --> 00:38:39,530 + 그래서 나는이 모델이 재발 네트워크 강좌에 미리 있었다라고 생각하지만 + +504 +00:38:39,530 --> 00:38:43,740 + 나는 정리 해보 여기하지만 먼저 더 많은 세부 사항으로 단계하고자 원하는 단지 + +505 +00:38:43,739 --> 00:38:47,029 + 그래서 우리는 희망 당신은 내가하여 자막 작업을 놓친 방법을 알고 같은 페이지에있어 + +506 +00:38:47,030 --> 00:38:51,540 + 이제 숙제 때문에 몇 시간 예정이다 그러나 우리는 우리의 입력을거야 + +507 +00:38:51,539 --> 00:38:54,869 + 본 발명은하고 길쌈을 통해 그것을하지 실행하고 일부 기능을 얻을 + +508 +00:38:54,869 --> 00:38:58,869 + 이러한 기능의 첫 번째 숨겨진 상태를 초기화 아마 사용됩니다 우리 + +509 +00:38:58,869 --> 00:39:03,780 + 현재의 네트워크는 토큰 시작 멀리하거나 첫 번째 단어가 숨겨져 있음을 얻었다 + +510 +00:39:03,780 --> 00:39:06,609 + 상태는 우리가 단어 이상이 분포를 생성하는거야 우리 + +511 +00:39:06,608 --> 00:39:11,940 + 어휘 것입니다 단순한 형식으로 배포 단어를 생성하는 것보다 및 것 + +512 +00:39:11,940 --> 00:39:16,429 + 그저 자막를 생성하기 위해이 프로세스 초과 근무를 반복 + +513 +00:39:16,429 --> 00:39:20,199 + 여기서 문제는이 네트워크는 일종의 보는 하나의 기회를 얻을 수 있다는 것입니다 + +514 +00:39:20,199 --> 00:39:23,899 + 입력 이미지와 그것이 전체 입력 영상에 모두를 찾고 않을 때 + +515 +00:39:23,900 --> 00:39:29,970 + 실제로 한번보고하는 기능에있는 경우 한 번 그리고 냉각기 수 있습니다 + +516 +00:39:29,969 --> 00:39:33,809 + 그것이 다른 부분에 초점을 맞출 수 있다면, 또한 입력 화상 여러번 + +517 +00:39:33,809 --> 00:39:41,969 + 작년에 나온 입력 이미지가 달렸다 정도로 하나의 정말 멋진 종이이었다 + +518 +00:39:41,969 --> 00:39:46,409 + 이 하나라는 쇼 교환은 원래 하나의 쇼를 말해 우리는 추가 말 + +519 +00:39:46,409 --> 00:39:51,289 + ㄱ - 열 부분과 아이디어는 우리가 걸릴 거 야, 그래서 매우 간단합니다 + +520 +00:39:51,289 --> 00:39:54,750 + 우리의 입력 영상 그리고 우리는 여전히 컨볼 루션 네트워크를 통해 실행거야 + +521 +00:39:54,750 --> 00:39:58,440 + 대신 마지막 완전히 연결 이후의 특징을 추출 + +522 +00:39:58,440 --> 00:40:01,659 + 대신 우리는 이전 선상 중 하나에서 거 풀 기능이있어 + +523 +00:40:01,659 --> 00:40:05,549 + 길쌈 상속인과는 우리에게 기능이 그리드를 줄 것 + +524 +00:40:05,550 --> 00:40:09,160 + 오히려이 때문에, 그래서이에서 오는 하나의 특징 벡터보다 + +525 +00:40:09,159 --> 00:40:13,460 + 길쌈 공기는 당신이 당신을 그 아마 왼쪽 위를 상상할 수 + +526 +00:40:13,460 --> 00:40:17,320 + 기능의 조약 공간 격자로하고 각 격자 안에이 생각할 수 + +527 +00:40:17,320 --> 00:40:21,130 + 그리드의 각 점은 어떤 부분에 해당하는 기능을 제공합니다 + +528 +00:40:21,130 --> 00:40:26,890 + 입력 이미지는 이제 다시 초기화하는이 이러한 기능을 사용합니다 + +529 +00:40:26,889 --> 00:40:30,099 + 어떤 방법으로 우리의 네트워크의 상태를 숨겨진 물건을 얻을 경우 지금 여기 + +530 +00:40:30,099 --> 00:40:34,400 + 다른 이제 우리는하지 계산하기 위해 우리의 숨겨진 상태를 사용하는거야 + +531 +00:40:34,400 --> 00:40:38,220 + 단어를 통해 분배하는 대신 서로 다른 이상 배포 + +532 +00:40:38,219 --> 00:40:43,459 + 우리의 길쌈 기능지도에서의 위치 때문에 다시이 것입니다 + +533 +00:40:43,460 --> 00:40:47,050 + 아마도 몹시 아마와 잘 연결 수로 구현 될 수 + +534 +00:40:47,050 --> 00:40:51,260 + 층 또는 두 후 일부 소프트 맥스는 당신에게 메일을주고 있지만, 우리는 단지 종료 + +535 +00:40:51,260 --> 00:40:54,410 + 우리에게 확률 분포를주는이 알 차원 벡터 최대 + +536 +00:40:54,409 --> 00:41:01,019 + 서로 다른 위치와 우리의 입력을 통해 지금 우리는이 확률을 + +537 +00:41:01,019 --> 00:41:05,780 + 분포는 실제로 이들의 가중 합을 얻기 위해 기다리는 읽을 사용 + +538 +00:41:05,780 --> 00:41:10,810 + 우리 학년 우리는이 걸릴 그래서 일단 우리의 다른 점에 특징 벡터 + +539 +00:41:10,809 --> 00:41:15,849 + 우리의 그리드를 받아 그것을 아래로 요약 기능의 가중 조합 + +540 +00:41:15,849 --> 00:41:22,420 + 이 하나의 요인과 질병 벡터의이 이런 종류의 입력을 요약 + +541 +00:41:22,420 --> 00:41:26,909 + 다른 유형의 몇 가지 방법으로 인해 이미지가 이것에 어떻게 할 + +542 +00:41:26,909 --> 00:41:30,619 + 확률 분포는 네트워크를 집중하는 능력을 준다 + +543 +00:41:30,619 --> 00:41:35,299 + 이미지의 다른 부분은 지금의이이 가중치를 간다 + +544 +00:41:35,300 --> 00:41:39,730 + 입력 기능에서 생성 이제 첫 번째 단어와 함께 공급됩니다 + +545 +00:41:39,730 --> 00:41:43,960 + 우리가 재발 네트워크의 재발을 할 때 우리는 실제로 세와 부품이 + +546 +00:41:43,960 --> 00:41:49,139 + 우리는 우리의 이전의 숨겨진 상태를 가지고 우리는이 참석 특징 벡터가 우리 + +547 +00:41:49,139 --> 00:41:52,929 + 생산이 함께 사용되는 모든 지금이 첫 번째 단어가 우리의 + +548 +00:41:52,929 --> 00:41:56,929 + 새로운 숨겨진 상태와 지금이 숨겨진 상태에서 우리가 실제로 갈거야 + +549 +00:41:56,929 --> 00:42:01,419 + 우리는 다른 새로운 유통을 통해 생산하는거야 두 개의 출력을 생성 + +550 +00:42:01,420 --> 00:42:04,940 + 위치와 우리의 입력 이미지와 우리는 또한 우리의 표준을 감소시키는거야 + +551 +00:42:04,940 --> 00:42:08,599 + 이 때문에 단어 이상 분포는 아마 몇으로 구현 될 수있다 + +552 +00:42:08,599 --> 00:42:13,679 + 의 활성 숨겨진 상태의 상단에 레이어 이제이 과정은 그렇게 반복 + +553 +00:42:13,679 --> 00:42:17,739 + 우리는 입력 기능 그랜드​​으로 돌아가 새로운 아마도 분포를 부여 + +554 +00:42:17,739 --> 00:42:22,949 + 그 닥터을 본 발명에 대한 새로운 요약 벡터에 걸릴 온다. + +555 +00:42:22,949 --> 00:42:25,618 + 함께 뉴 헤이븐을 계산 문장의 다음 단어 + +556 +00:42:25,619 --> 00:42:34,930 + 국가의 생산은 확인 그래서 조금 나쁜 버릇이 있지만 벤에 의한 것 사실이야 + +557 +00:42:34,929 --> 00:42:50,109 + 캡션을 생성하는이 프로세스 초과 근무를 반복 그래 그래서 질문은 어떻게 + +558 +00:42:50,110 --> 00:42:54,190 + 여기서이 기능 좋은에서 오는가 당신이 때있을 때 대답은 + +559 +00:42:54,190 --> 00:42:57,510 + 당신은 예를 들어 당신이 CON- 싶어 나라에 와서 가지고 일을하고 알렉스있어 + +560 +00:42:57,510 --> 00:43:01,670 + 칸 푸르는로 와서 시간으로 다섯을 그 텐서의 형상이되어 온에 도착 + +561 +00:43:01,670 --> 00:43:05,960 + 그래서 오백열둘에 의해 일곱으로 칠 등 지금 뭔가 + +562 +00:43:05,960 --> 00:43:11,050 + 입력 및 각 격자를 통해 일곱 일곱하여 공간 격자에 해당 + +563 +00:43:11,050 --> 00:43:15,450 + 그래서 사람들은 그냥 뽑아되는 512 차원의 특징 벡터의 위치 + +564 +00:43:15,449 --> 00:43:27,858 + 길쌈 중 하나에서 네트워크 문제가있다 + +565 +00:43:27,858 --> 00:43:33,219 + 우리가 실제로있어 그래서 이렇게 질문이 아마 분포에 관한 것입니다 + +566 +00:43:33,219 --> 00:43:37,899 + 모든 시간에 두 개의 서로 다른 확률 분포를 생성하면 단계 + +567 +00:43:37,900 --> 00:43:42,400 + 이 D 벡터의 하나의 제와 푸른 그래서 그 유통을 통해 아마 + +568 +00:43:42,400 --> 00:43:46,920 + 어휘 단어 우리가 정상적인 이미지 캡션과도에서처럼 + +569 +00:43:46,920 --> 00:43:50,759 + 때마다 단계는이 이상의 두 번째 확률 분포를 생성합니다 + +570 +00:43:50,759 --> 00:43:55,170 + 우리가 원하는 위치를 입력 이미지의 끝에 위치는 우리에게 말하고되는 + +571 +00:43:55,170 --> 00:43:59,690 + 단계 동생이 아주 적합한 단지로 조정하고, 그래서 만약 실제로 다음에 봐 + +572 +00:43:59,690 --> 00:44:05,200 + 업 후 퀴즈 당신이 그들을 사용하고자하는 어떤 프레임 워크를 같이보고 싶었다로 + +573 +00:44:05,199 --> 00:44:09,679 + 개월 동안 우리는 약 어쩌면위한 좋은 선택이 될 것입니다 강렬한 r에 어떻게 이야기를 우리의 + +574 +00:44:09,679 --> 00:44:16,288 + 텐트는 흐름이고 나는 미친이 그렇게 될 때이 자격이 생각 나는 + +575 +00:44:16,289 --> 00:44:19,749 + 아마 조금 더 자세히 이야기하고 싶었 방법이 주목 벡터 + +576 +00:44:19,748 --> 00:44:24,308 + 이 요약 의사가 생성되는 방식이 문서가 실제로 회담 그래서 + +577 +00:44:24,309 --> 00:44:29,278 + 이러한 요인 때문에 아이디어로서 생성에 대해 두 가지 방법 + +578 +00:44:29,278 --> 00:44:33,559 + 우리가 마지막 슬라이드에서 본 우리가 우리의 입력 이미지를 가지고이 위대한를 얻을 수 있다는 것입니다 + +579 +00:44:33,559 --> 00:44:38,019 + 우리의 네트워크에서 길쌈 영역 중 하나에서 오는 교사와 + +580 +00:44:38,018 --> 00:44:41,899 + 각 시간이 확률 분포를 만들어 우리의 네트워크를 중지 + +581 +00:44:41,900 --> 00:44:45,789 + 위치 이상 그래서 이것은에 끝나가 소프트 토지의 전체 영향 것 + +582 +00:44:45,789 --> 00:44:50,329 + 그것을 정상화 지금 생각은 우리가이 위대한 기능을 수행 할 것입니다 + +583 +00:44:50,329 --> 00:44:54,249 + 이러한 확률 분포와 함께 벡터와 하나의 생산 + +584 +00:44:54,248 --> 00:44:59,798 + D-차원 요소 입력 영상 것을 요약하고 용지가있다 + +585 +00:44:59,798 --> 00:45:04,159 + 실제로 쉬운 방법은 그래서이 문제를 해결하는 두 가지 방법을 탐구 + +586 +00:45:04,159 --> 00:45:08,969 + 그녀는 Rd에는 차원 r에 그래서 그들은 부드러운 구금 부르는 것을 사용 + +587 +00:45:08,969 --> 00:45:13,518 + 벡터의 예는 그리드 여기서 모든 요소의 가중 합계 것 + +588 +00:45:13,518 --> 00:45:18,028 + 각 요소는 바로 아마 그 예측 확률에 의해 그것의에 의해 대기한다 + +589 +00:45:18,028 --> 00:45:23,318 + 이것은 또 다른 층과 같은 종류의 좋은 그것을 구현하기 위해 실제로 매우 간단합니다 + +590 +00:45:23,318 --> 00:45:28,599 + 신경망과이 문맥의 유도체 등이 그라디언트 + +591 +00:45:28,599 --> 00:45:32,588 + 대한 요인은 확률을 예측 P는 아주 좋은 쉽습니다 + +592 +00:45:32,588 --> 00:45:36,818 + 그냥 보통의 구배를 사용하여 우리가 실제로 훈련을 수있는이 일을 계산하기 + +593 +00:45:36,818 --> 00:45:40,019 + 하강 및 역 전파 + +594 +00:45:40,019 --> 00:45:44,559 + 그러나 실제로이 경쟁하는 다른 또 다른 옵션을 탐험 + +595 +00:45:44,559 --> 00:45:48,210 + 특징 벡터 그래서 대신 심장주의라는 그 뭔가 + +596 +00:45:48,210 --> 00:45:52,630 + 이 가중 합을 갖는 우리는 단지 하나의 요소를 선택 할 수 있습니다 + +597 +00:45:52,630 --> 00:45:57,940 + 그래서 당신이 할 수있는 매우 간단한 일을 상상하기에 참석하기 위해 업그레이드 + +598 +00:45:57,940 --> 00:46:02,440 + 단지 확률이 가장 높은 단지로 그리드의 요소를 선택합니다 + +599 +00:46:02,440 --> 00:46:07,269 + 그 부분 세금 위치에 대응하는 특징 벡터 빌려 당겨 + +600 +00:46:07,269 --> 00:46:13,150 + 이 공원 옆 경우 경우에이 카드가 최대에 대해 생각하면 문제는 지금 + +601 +00:46:13,150 --> 00:46:16,829 + 당신에 대한이 파생 상품에 대한 미분을 생각하는 우리의 + +602 +00:46:16,829 --> 00:46:18,360 + 배포 P + +603 +00:46:18,360 --> 00:46:22,980 + 이것이 그래서 더 이상 역 전파에 대한 매우 친절 아니라고 밝혀 + +604 +00:46:22,980 --> 00:46:29,059 + 내가 실제로 가장 큰 그 PA한다고 가정 또는 경우에 우리의 다음 경우 상상 + +605 +00:46:29,059 --> 00:46:33,119 + 요소와 우리의 입력과 우리가 조금의 pH를 변경하는 경우 지금 무슨 일이 + +606 +00:46:33,119 --> 00:46:40,130 + 비트 레이트는 그래서 만약 그가 건축가이며, 우리는 확률을 가볍게 흔들다 + +607 +00:46:40,130 --> 00:46:44,869 + 유통 조금 NPA는 여전히 건축가가 될 것입니다 그래서 우리는 여전히거야 + +608 +00:46:44,869 --> 00:46:49,400 + 실제로 미분을 의미하는 입력에서 동일한 요소를 선택 + +609 +00:46:49,400 --> 00:46:53,990 + 이 요소의 대해 쉽게 예측할 확률은 0이 될 것입니다있다 + +610 +00:46:53,989 --> 00:46:58,689 + 지금 우리가 정말 사용할 수 없습니다 거의 모든 곳에서 그, 그래서 그것은 아주 나쁜 주입니다 + +611 +00:46:58,690 --> 00:47:02,970 + 그들이 제안하는 것을 알 수 있도록 역 전파 더 이상이 일을 훈련합니다 + +612 +00:47:02,969 --> 00:47:06,549 + 강화 학습을 기반으로 또 다른 방법은 실제의 모델을 학습합니다 + +613 +00:47:06,550 --> 00:47:12,710 + 당신이 원하는 이러한 상황은 단일 요소를 선택하지만 약간의 + +614 +00:47:12,710 --> 00:47:16,260 + 더 복잡한 우리는이 강의에서 그것에 대해 않을거야 말거야하지만 단지 수 있도록 + +615 +00:47:16,260 --> 00:47:18,900 + 그건 당신이 부드러운의 차이를 볼 수 있습니다 뭔가 알고 있음 + +616 +00:47:18,900 --> 00:47:26,010 + 실제로 지금 우리가 볼 수 하나를 선택 관심과 심장주의 + +617 +00:47:26,010 --> 00:47:30,450 + 우리가 실제로 발생하고 그렇게 때문에이 모델에서 일부 꽤 결과 + +618 +00:47:30,449 --> 00:47:34,480 + 그리드 위치 우리가 할 수있는 모든 시간이 정지 이상의 확률 분포 + +619 +00:47:34,480 --> 00:47:38,519 + 있습니다 우리는 예술의 각 단어를 생성로서 그 확률 분포를 시각화 + +620 +00:47:38,519 --> 00:47:44,039 + 새를 모두 다시 보여줍니다 생성 된 캡션 그럼이 입력 이미지들은 + +621 +00:47:44,039 --> 00:47:48,279 + 마음주의 모델 모두이 경우에 그녀의 부드러운주의 모델 모두 + +622 +00:47:48,280 --> 00:47:51,650 + 캡션에게 물주기의 몸에 비행 조류를 생산 + +623 +00:47:51,650 --> 00:47:57,090 + 이 두 모델들은 무엇을 그 확률 분포의 모양을 시각화 + +624 +00:47:57,090 --> 00:48:01,690 + 이 두 가지 모델처럼 상단은 부드러운주의를 할 수 있도록 보여줍니다 있도록 + +625 +00:48:01,690 --> 00:48:04,849 + 이 모든에서 확률을 평균이기 때문에 그것은 일종의 확산있어 볼 + +626 +00:48:04,849 --> 00:48:09,309 + 위치와 이미지 하단에 단지 하나의 요소를 보여주는 것 + +627 +00:48:09,309 --> 00:48:16,289 + 그것은 꺼내 실제로 아주 좋은 로맨틱 드라마의 의미를 당신에게 있다는 + +628 +00:48:16,289 --> 00:48:19,779 + 모델이 특히 부드러운 관심이 상단에있을 때 볼 수 있습니다 + +629 +00:48:19,780 --> 00:48:23,340 + 나는 새에 대해 얘기하고 얘기 할 때 매우 좋은 결과이다 생각 + +630 +00:48:23,340 --> 00:48:26,610 + 초점 종류의 비행에 대해 새에 적합한 다음이 얘기 할 때 + +631 +00:48:26,610 --> 00:48:30,820 + 물에 대해는 좀 다른 모든 것들 때문에 다른 일에 초점을 맞추고 + +632 +00:48:30,820 --> 00:48:34,269 + 지적 것은 대한 감독 및 교육 시간을받지 않은 것입니다 + +633 +00:48:34,269 --> 00:48:38,869 + 이미지의 일부 단지에 자신의 마음을 만들어에 참석해야한다 + +634 +00:48:38,869 --> 00:48:43,289 + 더 나은 일을 캡처 도움이 무엇 이건을 기반으로 그 부분에 참석 + +635 +00:48:43,289 --> 00:48:46,480 + 우리가 실제로 단지에서 이러한 해석 결과를 얻을 수 꽤 멋지다 + +636 +00:48:46,480 --> 00:48:51,920 + 이 자막 작업, 우리는 몇 몇 다른 결과의 원인을 볼 수 있습니다 + +637 +00:48:51,920 --> 00:48:56,340 + 우리가 볼 수 그들은 재미있어 그 우리가 던지는 한 여자를 던지고 개를 때 + +638 +00:48:56,340 --> 00:49:01,079 + 다양한에서 개에 대해 이야기 Presby 공원에서 프리즈는 인식 + +639 +00:49:01,079 --> 00:49:05,259 + 개, 특히 흥미로운 바로 때를 바닥에서이 사람이다 + +640 +00:49:05,260 --> 00:49:08,790 + 그것은 실제로 모든 것들에 초점을 맞추고 단어 나무를 생성 + +641 +00:49:08,789 --> 00:49:13,440 + 배경 다시뿐만 아니라 기린과는 전혀 나오고 있지이 + +642 +00:49:13,440 --> 00:49:22,179 + 감독은 모든 단지 캡션을 기반으로 네 질문을하거나 + +643 +00:49:22,179 --> 00:49:27,440 + 문제는 당신이주의 대 하드 선호하는 경우가 그래서 무엇이다 + +644 +00:49:27,440 --> 00:49:31,380 + 나는 사람들이 일반적으로 그녀의도에 원하는 줄 것을 일종의 두 동기의 생각 + +645 +00:49:31,380 --> 00:49:33,530 + 처음에 전혀 관심을 + +646 +00:49:33,530 --> 00:49:37,580 + 그 중 하나는 좋은 끝없는 출력을주고 당신이 얻을 생각하는 것입니다 + +647 +00:49:37,579 --> 00:49:42,710 + 두 경우 모두에서 좋은 해석 출력 적어도 이론적으로는 아마도 그녀의 + +648 +00:49:42,710 --> 00:49:46,130 + 구금는 확실히 꽤 있지만, 다른 동기 부여하지 않았다 것 같아요 + +649 +00:49:46,130 --> 00:49:49,970 + 주의를 사용하여 때를 특히 계산 부담을 완화하는 것입니다 + +650 +00:49:49,969 --> 00:49:54,989 + 매우 매우 큰이 있고 실제로 계산 비용이 많이들 수 있습니다 넣어 + +651 +00:49:54,989 --> 00:49:58,619 + 각 시간 단계에서 그 전체의 입력을 처리하고 더 효율적일 수도 + +652 +00:49:58,619 --> 00:50:02,869 + 우리는 단지 각 시간 단계에서 입력 한 부분에 초점을 맞출 수 계산하는 경우 + +653 +00:50:02,869 --> 00:50:07,380 + 단지 작은 부분 집합 척 처리가 부드러운 관심 때문에 너무 + +654 +00:50:07,380 --> 00:50:10,730 + 우리는 우리가 어떤을하지 않는 모든 포지션에 걸쳐 평균 이런 종류의 일을하고 + +655 +00:50:10,730 --> 00:50:14,369 + 계산 저축은 여전히​​ 모든 시간에 전체 입력을 처리하는 + +656 +00:50:14,369 --> 00:50:17,799 + 단계하지만 마음의 관심과 우리는 실제로 계산 절감 효과를 얻을 수 있습니까 + +657 +00:50:17,800 --> 00:50:22,680 + 명시 적으로 I 있도록 입력의 일부 작은 부분 집합을 따기 (pic)의 한 사람 + +658 +00:50:22,679 --> 00:50:26,289 + 또한 그녀의 구금이 강화됩니다 즉 그 큰 혜택을의 생각 + +659 +00:50:26,289 --> 00:50:41,420 + 학습과 CRN 당신이 똑똑 그 종류의 그래 그래서 질문있어 보이게 확장 + +660 +00:50:41,420 --> 00:50:46,150 + 문제는 모든에서이 작업을 수행하는 방법을 내가 대답은 그것의 생각 + +661 +00:50:46,150 --> 00:50:49,789 + 정말 그것의 입력 오른쪽 상관 관계 구조의 종류를 학습 + +662 +00:50:49,789 --> 00:50:54,779 + 강아지와 이미지의 많은 예를 보지하고 강아지와 함께 많은 문장입니다 만 + +663 +00:50:54,780 --> 00:50:57,480 + 강아지와 함께 그 다른 이미지의 개는 다른에 표시하는 경향이 + +664 +00:50:57,480 --> 00:51:01,349 + 입력의 위치와 나는 그것이 최적화를 통해 밝혀 같아요 + +665 +00:51:01,349 --> 00:51:05,659 + 절차 실제로 장소에 더 무게를 두는 경우 개 + +666 +00:51:05,659 --> 00:51:10,399 + 실제로 실제로 존재 그렇게하지 ​​않도록 몇 가지 방법으로 자막 작업을하는 데 도움이 + +667 +00:51:10,400 --> 00:51:14,460 + 그냥 그냥도 난 작업 할 일이 아주 아주 좋은 답이 있다고 생각 + +668 +00:51:14,460 --> 00:51:18,500 + 확실하지 그래서 분명이 다음이에서 수치입니다 인물의 사진입니다 + +669 +00:51:18,500 --> 00:51:23,300 + 그것은 임의의 이미지를 어떻게 작동하는지 잘 잘 모르겠어요 그래서 논문은 임의의 결과가 마음에 들지 + +670 +00:51:23,300 --> 00:51:31,870 + 하지만 다른 점은 정말이 특히이 모델 소프트에 대한 지적합니다 + +671 +00:51:31,869 --> 00:51:35,739 + 구금은 제약 조건의 종류가에서이 고정 된 격자 점이다 + +672 +00:51:35,739 --> 00:51:41,199 + 우리 같은 이들보다이 좋은이 확산 점점 얻을 컨볼 루션 기능지도 + +673 +00:51:41,199 --> 00:51:44,449 + 일을 찾고 있지만, 사람들은 그저이이 밖으로 흐리게처럼 + +674 +00:51:44,449 --> 00:51:48,210 + 분포 모델 실제로보고 할 능력이없는 + +675 +00:51:48,210 --> 00:51:52,220 + 입력의 임의의 영역 만이 고정 그리드 볼 수있어 + +676 +00:51:52,219 --> 00:51:55,959 + 지역 + +677 +00:51:55,960 --> 00:51:59,690 + 또한 부드러운 관심이 아이디어는 정말 아니라고 지적한다 + +678 +00:51:59,690 --> 00:52:04,789 + 본 논문에서 소개 난 정말이 개념을 가지고 있었던 첫 번째 논문을 생각한다 + +679 +00:52:04,789 --> 00:52:09,159 + 부드러운 관심은이 유사 그래서 여기에 기계 번역에서 온 + +680 +00:52:09,159 --> 00:52:13,299 + 우리는 다음 스페인어, 여기에 몇 가지 입력 문장을하려는 의욕 + +681 +00:52:13,300 --> 00:52:17,960 + 영어로 출력 문장을 생성하고이 재발와 함께 할 것 + +682 +00:52:17,960 --> 00:52:22,179 + 우리는 먼저 판독 할 시퀀스 모델 신경망 시퀀스 우리 + +683 +00:52:22,179 --> 00:52:26,588 + 입력 재발 네트워크와 문장하고는 출력 시퀀스가​​ 생성 + +684 +00:52:26,588 --> 00:52:29,269 + 우리는 자막에서와 같은 매우 유사 + +685 +00:52:29,269 --> 00:52:33,119 + 그러나이 논문에서 그들은 실제로 입력을 통해 관심을 가지고 싶어 + +686 +00:52:33,119 --> 00:52:38,599 + 강렬한 그들은 조금으로 정확한 메커니즘 때문에 자신의 문장을 생성 된 + +687 +00:52:38,599 --> 00:52:43,080 + 다른 그러나 직관은 지금 우리가 처음이를 생성 할 때 같은있다 + +688 +00:52:43,079 --> 00:52:47,469 + 말씀은 우리를 통해 전력 분배를하지 계산하려는의 나 + +689 +00:52:47,469 --> 00:52:52,000 + 이미지의 대신 우린 있도록 입력 문장에서 단어를 통해 지역 + +690 +00:52:52,000 --> 00:52:55,289 + 거 희망을 스페인어로이 첫 번째 단어에 초점을 맞출 것이다 분포를 얻을 + +691 +00:52:55,289 --> 00:52:59,170 + 문장, 그리고, 우리는 각 단어의 일부 사진을 촬영 한 후 관련이 있습니다 + +692 +00:52:59,170 --> 00:53:03,780 + 이를 반복 것이이 프로세스의 다음 단계로 궤환 + +693 +00:53:03,780 --> 00:53:08,820 + 모든 시간에 너무 부드러운 구금이 아이디어는 매우있는 네트워크를 단계 + +694 +00:53:08,820 --> 00:53:12,230 + 이미지 캡처에뿐만 아니라 기계뿐만 아니라 쉽게 적용 + +695 +00:53:12,230 --> 00:53:18,990 + 번역 질문은 질문은 가변 길이에 대해이 작업을 수행 할 방법입니다 + +696 +00:53:18,989 --> 00:53:23,409 + 문장 그리고 내가 조금 이상 호도 뭔가하지만 아이디어는 당신입니다 + +697 +00:53:23,409 --> 00:53:26,980 + 이미지 캡션에 대한 그래서 주소 내용이라고 기반으로 무엇을 사용 + +698 +00:53:26,980 --> 00:53:31,559 + 우리 모두는이 일곱 그리드에 의해 아마 일곱이 고정되어 있는지 미리 알 수 있도록 + +699 +00:53:31,559 --> 00:53:35,579 + 우리는 단지이 직접하는 대신 확률 분포를 생성 + +700 +00:53:35,579 --> 00:53:40,440 + 인코더와 같은 모델은 일부 벡터를 생성있어 입력 된 문장을 읽는 + +701 +00:53:40,440 --> 00:53:45,320 + 인코딩이 디코더에서 지금 입력 문장의 각 단어 대신 + +702 +00:53:45,320 --> 00:53:49,300 + 의 직접 확률 분포 확률 벡터를 생성 그것 + +703 +00:53:49,300 --> 00:53:52,900 + 방법은 각각과 내적을 얻을 것이다 벡터의 종류를 확산하기 + +704 +00:53:52,900 --> 00:53:57,000 + 그 코드 벡터와 입력 한 다음 그 위에 제품을 얻을 익숙해 + +705 +00:53:57,000 --> 00:54:02,159 + 재 정규화 및 분포로 변환 + +706 +00:54:02,159 --> 00:54:06,940 + 그래서 부드러운 구금이 아이디어를 구현하기 꽤 용이하고 + +707 +00:54:06,940 --> 00:54:10,970 + 꽤 훈련하기 쉬운 그래서 작년 정도에 매우 인기가있어와 + +708 +00:54:10,969 --> 00:54:14,489 + A와 부드러운 관심이 아이디어를 적용 논문의 전체 무리가있다 + +709 +00:54:14,489 --> 00:54:18,349 + 다른 문제의 전체 무리보고 몇 논문이 있었다 있도록 + +710 +00:54:18,349 --> 00:54:22,360 + 우리가 본대로 기계 번역 소프트 구금에서이되어왔다 + +711 +00:54:22,360 --> 00:54:24,230 + 실제로하고 싶은 몇 가지 서류 + +712 +00:54:24,230 --> 00:54:28,179 + 그들은 오디오 신​​호에 읽은 다음 나는 놓을 게요 음성 녹음 + +713 +00:54:28,179 --> 00:54:32,589 + 영어 단어는 너무 부드러운 관심을 사용하는 몇 가지 서류가되었습니다 + +714 +00:54:32,590 --> 00:54:37,130 + 해당 작업 주에 도움이 입력 오디오 시퀀스를 통해 거기에있었습니다 + +715 +00:54:37,130 --> 00:54:41,300 + 당신이 읽을 그래서 여기에 동영상 캡션 소프트 관심을 사용하는 방법에 적어도 하나의 종이 + +716 +00:54:41,300 --> 00:54:45,260 + 프레임의 일부 순서와 단어와 당신의 다음 출력을 어떤 순서 + +717 +00:54:45,260 --> 00:54:49,110 + 가있는 한, 입력 시퀀스의 프레임인지를 통해 장력을 갖도록 할 + +718 +00:54:49,110 --> 00:54:53,050 + 캡션을 생성하는 당신은 아마이 작은 비디오 것을 볼 수 있었다 + +719 +00:54:53,050 --> 00:54:57,240 + 시퀀스들은 출력 누군가가 냄비에 물고기 위해 노력하고 때 생성된다 + +720 +00:54:57,239 --> 00:55:01,169 + 단어 누군가가 실제로 비디오에서이 두 번째 프레임에 많은 참석 + +721 +00:55:01,170 --> 00:55:05,590 + 순서 그들은 단어 튀김이 마지막에 더 많은 참석 생성 할 때 + +722 +00:55:05,590 --> 00:55:11,480 + 비디오 시퀀스의 요소는이 작업에 대한 몇 가지 서류가 있었다 + +723 +00:55:11,480 --> 00:55:16,059 + 당신에게 그래서 여기에 설정을 응답 질문을하면 자연에서 읽을 수 있다는 것입니다 + +724 +00:55:16,059 --> 00:55:20,590 + 언어 질문 당신은 또한 이미지와 이미지를 읽고 모델에 필요 + +725 +00:55:20,590 --> 00:55:22,870 + 그 질문에 대한 답을 생산 + +726 +00:55:22,869 --> 00:55:28,139 + 그래서 자연 언어에서 그 질문에 대한 답을 생산하고 거기 + +727 +00:55:28,139 --> 00:55:31,869 + 이미지를 통해 공간주의의 아이디어를 탐구 커플 논문 + +728 +00:55:31,869 --> 00:55:35,420 + 또 다른 일을 응답 질문이 문제를 돕기 위해 + +729 +00:55:35,420 --> 00:55:38,860 + 지적하는 것은이 있었다, 그래서이 논문의 일부는 좋은 게임을 가지고있다 + +730 +00:55:38,860 --> 00:55:43,000 + 보여 앤 알려 쇼 교환이 있었다 나는거야 거기 십분 미만 + +731 +00:55:43,000 --> 00:55:45,039 + 주문 + +732 +00:55:45,039 --> 00:55:49,999 + II 정말 약 즐길 수 있도록이 하나 답변을 참석하도록 요청 + +733 +00:55:49,998 --> 00:55:56,808 + 나는이 작업 줄에 불과 해요 이름으로 창의력과 부드러운의이 아이디어 + +734 +00:55:56,809 --> 00:55:59,910 + 구금 그래서 많은 사람들이 두을 업로드 구현하기 매우 쉽다 + +735 +00:55:59,909 --> 00:56:05,899 + 하지만 작업의 톤은 우리가 구현 이러한 종류의와 함께이 문제를보고 기억 + +736 +00:56:05,900 --> 00:56:09,709 + 부드러운 관심의 그것은 우리가 지역을 중재하기 위해 참석 할 수 없습니다입니다 + +737 +00:56:09,708 --> 00:56:14,038 + 입력 대신 제약하고 만 주어진이 고정 된 그리드에 참석할 수 + +738 +00:56:14,039 --> 00:56:18,699 + 길쌈 기능지도에 의해, 그래서 문제는 우리가 이것을 극복 할 수 있는지 여부입니다 + +739 +00:56:18,699 --> 00:56:23,559 + 제한은 여전히​​ 참석하고 어떻게 든 임의의 입력 영역에 참석 + +740 +00:56:23,559 --> 00:56:28,089 + 내가 생각하는 다른 방법 + +741 +00:56:28,088 --> 00:56:32,900 + 작업 이러한 유형의 전구체는 알렉스에서이 논문은 그래서 2013 년에 다시 무덤이다 + +742 +00:56:32,900 --> 00:56:38,249 + 여기에 그는 같은 입력 자연 언어 문장을 읽고 다음과 같이 생성 원 + +743 +00:56:38,248 --> 00:56:43,598 + 작성하는 것처럼 일반 필기 될 출력 실제로 이미지 + +744 +00:56:43,599 --> 00:56:48,528 + 필기에 그 문장과 그가 실제로 관심을 가지고있는 방법이 + +745 +00:56:48,528 --> 00:56:53,418 + 멋진 방법의 종류이 출력 이미지 위에 우리는 지금 그가 실제로 예측하는 것하고 + +746 +00:56:53,418 --> 00:56:57,608 + 그런 다음 출력 영상 이상 현금과 혼합 모델의 파라미터 + +747 +00:56:57,608 --> 00:57:02,739 + 실제적으로 출력 영상의 부분을 중재하는 것을 사용하고 참석 + +748 +00:57:02,739 --> 00:57:07,028 + 이 사실은 이들 중 일부는 오른쪽에 정말 잘 그래서 정말 작동 + +749 +00:57:07,028 --> 00:57:12,259 + 실제로 사람들에 의해 작성하고 나머지는 그의 그를 작성했습니다 + +750 +00:57:12,259 --> 00:57:16,269 + 네트워크는 그래​​서 당신은 현실에서 발생하는 차이를 알 수 있습니다 + +751 +00:57:16,268 --> 00:57:24,418 + 내가 할 수 없습니다 생성하는 그래서 상단 하나는 진짜 그가 있다고 밝혀 + +752 +00:57:24,418 --> 00:57:31,049 + 네트워크에 의해 생성 된 모든 바닥 + +753 +00:57:31,050 --> 00:57:35,580 + 그래 어쩌면 어쩌면 진짜 기적은 문자 나 사이에 많은 차이가 + +754 +00:57:35,579 --> 00:57:39,380 + 그런 일하지만, 이러한 결과는 정말 잘 작동 실제로 그는이 + +755 +00:57:39,380 --> 00:57:42,820 + 당신이 가서 그 방금 할 수있는 브라우저에서 실행하려고 할 수있는 온라인 데모 + +756 +00:57:42,820 --> 00:57:46,800 + 단어를 입력하고 재미 종류의 당신을위한 필기를 생성합니다 + +757 +00:57:46,800 --> 00:57:52,840 + 우리가 이미 본 다른 또 다른 종이 종류의 소요가 그 그릴입니다 + +758 +00:57:52,840 --> 00:57:56,500 + 다음을 통해 임의 구금이 아이디어는 몇 가지 더로 확장 + +759 +00:57:56,500 --> 00:58:01,050 + 실제 세계의 문제는 세대가 그래서 그들은 고려 하나의 작업은 필기하지 + +760 +00:58:01,050 --> 00:58:05,960 + 이미지 분류는 여기에 우리가하지만 그 과정에서이 숫자를 분류 할 + +761 +00:58:05,960 --> 00:58:09,920 + 분류의 우리는 실제로 입력 영역을 중재하기 위해 참석거야 + +762 +00:58:09,920 --> 00:58:14,639 + 이 분류 작업에 도움하기 위해 이미지 그래서 이것은이입니다 + +763 +00:58:14,639 --> 00:58:17,909 + 가지가 종류의 자체 학습 냉각하지만이 참석해야 + +764 +00:58:17,909 --> 00:58:22,710 + 순서대로 숫자는 영상 분류에 도움이 또한 철회합니다 + +765 +00:58:22,710 --> 00:58:27,849 + 유사한과 임의 출력 이미지를 생성하는 개념을 고려 + +766 +00:58:27,849 --> 00:58:31,589 + 우리가 가지고있는거야 필기 생성과 같은 동기 부여의 종류 + +767 +00:58:31,590 --> 00:58:35,740 + 출력 이미지를 통해 임의의 관심은 단지에이 출력을 생성 + +768 +00:58:35,739 --> 00:58:42,589 + 내 침대와 나는 우리가 전에이 동영상을 본 것 같아요하지만이 그래서 정말 멋지다 + +769 +00:58:42,590 --> 00:58:47,190 + 당신이 여기에서 우리는거야 볼 수 있도록 내 마음에서 무승부 네트워크는 우리가있어 그것을 할 + +770 +00:58:47,190 --> 00:58:51,200 + 분류 작업을하는 것은 일종의에서 영역을 중재에 참석하기 위해 배운다 + +771 +00:58:51,199 --> 00:58:55,439 + 우리는 우리가 지역을 중재하기 위해 참석거야 생성 입력 할 때 + +772 +00:58:55,440 --> 00:58:59,579 + 그것은 다수 생성 할 수 있도록 출력은 실제로 이러한 숫자를 생성 + +773 +00:58:59,579 --> 00:59:04,000 + 한 번에 숫자와 실제로이이 집 번호 다음를 생성 할 수 있습니다 + +774 +00:59:04,000 --> 00:59:10,639 + 집 번호는 그래서 이것은 정말 멋진 당신이 볼 수 있듯이 당신은이 지역을 좋아한다 + +775 +00:59:10,639 --> 00:59:13,920 + 실제로 종류의 성장과 초과 근무 축소되었다 참석했다 + +776 +00:59:13,920 --> 00:59:17,430 + 이미지 위에 연속적으로 이동 그것은 확실히에 구속되지 않았습니다 + +777 +00:59:17,429 --> 00:59:21,690 + 우리와 같은 고정 된 그리드이되도록하는 방법을 알려 쇼 교환 보았다 + +778 +00:59:21,690 --> 00:59:26,840 + 작동 종이 조금 조금 이상하고 깊은 일부 후속 작품입니다 + +779 +00:59:26,840 --> 00:59:34,260 + 내 모든 초점은 모든 괜찮 좀 더 분명 왜 하늘은 실제로 생각하는 마음 + +780 +00:59:34,260 --> 00:59:38,630 + 바로 그래서 매우 유사한을 사용 걸릴이 후속 논문이있다 + +781 +00:59:38,630 --> 00:59:43,500 + 기구의 관심은 특별한 전송 네트워크라고하지만 난 많은 생각 + +782 +00:59:43,500 --> 00:59:44,500 + 이해하기 쉽게 + +783 +00:59:44,500 --> 00:59:49,039 + 그리고 매우 깨끗한 방법으로 제시 한 아이디어는 우리가 갖고 싶어한다는 것입니다 수 있도록 + +784 +00:59:49,039 --> 00:59:53,369 + 입력 영상이 우리의 마음에 드는 새와 우리는 이런 종류의를 갖고 싶어 + +785 +00:59:53,369 --> 00:59:57,589 + 당신은 당신에게 있습니다에 참석하려는 우리에게 말하고 변수의 연속 세트 + +786 +00:59:57,590 --> 01:00:01,579 + 우리가 어떤 상자의 중심과 너비와 높이의 모서리를 상상 + +787 +01:00:01,579 --> 01:00:06,170 + 이 지역의 우리에 첨부 할 다음 우리는 몇 가지 기능을 갖고 싶어 그 + +788 +01:00:06,170 --> 01:00:10,240 + 우리의 입력 영상을 받아 이러한 지속적인 관심 좌표 + +789 +01:00:10,239 --> 01:00:14,919 + 다음 몇 가지 고정 된 크기의 출력을 생성하고 우리는이 작업을 수행 할 수 없습니다 + +790 +01:00:14,920 --> 01:00:21,840 + 미분 방법은 그래서 이것은 이것은 당신에 그 상상할 수 좀 하드 바로 보인다 + +791 +01:00:21,840 --> 01:00:26,250 + 자르기의 생각과 이상은 다음이 입력은 정말 연속이 될 수 없습니다 + +792 +01:00:26,250 --> 01:00:30,590 + 그들은 두 개의 정수 그렇게 우리 나라 픽셀 값의 정렬을해야하고 그렇지 않아 + +793 +01:00:30,590 --> 01:00:34,550 + 우리는이 함수가 연속 또는 차등 할 수있는 방법을 정확하게 정말 선택 + +794 +01:00:34,550 --> 01:00:39,210 + 그리고 그들은 실제로 아주 좋은 해결책을 와서 생각은 우리가 걸이다 + +795 +01:00:39,210 --> 01:00:44,679 + 거 픽셀의 좌표에서지도하는 매개 변수화 기능을 적어 + +796 +01:00:44,679 --> 01:00:50,469 + 그래서 여기에 입력 픽셀의 좌표를 출력에 우리는거야 말거야 + +797 +01:00:50,469 --> 01:00:54,839 + 이이 왼쪽 오른쪽 상단 픽셀 것을 다른 가능성이있다 + +798 +01:00:54,840 --> 01:00:59,700 + 좌표는 출력에 TYT을 X와 우리는이를 계산 확인하는거야 + +799 +01:00:59,699 --> 01:01:04,480 + 이 민영화를 사용하여 입력 이미지에 액세스 및 백악관 좌표 + +800 +01:01:04,480 --> 01:01:08,900 + 즉 그, 그래서 좋은 함수는 우리가 할 수있는 좋은 미분 가능 함수의 + +801 +01:01:08,900 --> 01:01:13,349 + 다음 이들에 대해 벌금 전송에 따라을 우리가 할 수있는 차별화 + +802 +01:01:13,349 --> 01:01:17,059 + 에서 아마 상단 왼쪽 상단 픽셀 다시이 과정을 반복 + +803 +01:01:17,059 --> 01:01:21,219 + 출력 화상 우리의 좌표에 매핑이 행성 상승 기능을 사용하여 + +804 +01:01:21,219 --> 01:01:27,199 + 입력의 화소 이제 우리는 우리의 출력의 모든 픽셀에 대해이 작업을 반복 할 수있는 + +805 +01:01:27,199 --> 01:01:31,689 + 생각이 될 것입니다, 그래서 우리에게 샘플링 그리드라는 뭔가를 제공합니다 + +806 +01:01:31,690 --> 01:01:36,480 + 출력의 각 픽셀에 대해 다음 우리의 출력 이미지와 샘플링 그리드는 우리에게 알려줍니다 + +807 +01:01:36,480 --> 01:01:41,610 + 여기서 입력에 픽셀에서 온해야 얼마나 많은을 복용하는 사람 + +808 +01:01:41,610 --> 01:01:47,590 + 컴퓨터 그래픽 과정 많지 않은이 왼쪽에 보이는 질감처럼 좀 보인다 그래서 + +809 +01:01:47,590 --> 01:01:52,510 + 그들은 컴퓨터에 텍스처 매핑에서이 아이디어를 취할 수 있도록 매핑은하지 않습니다 + +810 +01:01:52,510 --> 01:01:56,300 + 단지 선형 보간하여 사용하고 그래픽 한 번 출력을 계산하기 + +811 +01:01:56,300 --> 01:01:57,720 + 우리는 샘플링 그리드가 + +812 +01:01:57,719 --> 01:02:02,669 + 그래서 우리가 가지고 이제 지금 무슨 일이 지금이 지금이 실제로 우리의 네트워크를 할 수 있습니다 + +813 +01:02:02,670 --> 01:02:07,450 + 입력과 좋은 미분 방법의 일부를 중재하기 위해 참석 우리 + +814 +01:02:07,449 --> 01:02:11,789 + 지금 바로이 변형을 예측합니다 네트워크는 PANA 것을 좌표 + +815 +01:02:11,789 --> 01:02:16,639 + 그래서 입력 영상의 영역을 중재에 참석하기 위해 전체를 수 + +816 +01:02:16,639 --> 01:02:20,199 + 그들은 좋은 작은 독립적 인 모듈에 모두이 일을 넣어 그 + +817 +01:02:20,199 --> 01:02:24,608 + 공간 변압기가 어떤 입력을 수신 그래서 그들은 특별한 변압기를 호출 + +818 +01:02:24,608 --> 01:02:29,679 + 우리의 원시 입력 이미지로하고 다음 실제로 실행 당신이 생각할 수있는 + +819 +01:02:29,679 --> 01:02:33,949 + 작은 완전히 연결 네트워크 또는 될 수있는 작은 현지화 네트워크 + +820 +01:02:33,949 --> 01:02:38,409 + 매우 얕은 길쌈 네트워크 및이이 현지화 네트워크의 뜻 + +821 +01:02:38,409 --> 01:02:44,500 + 실제로 지금이 출력이 좌표 데이터를 변환 계획으로 생산 + +822 +01:02:44,500 --> 01:02:48,829 + 아핀 변환 좌표는 이제 샘플링 그리드를 계산하는데 사용될 + +823 +01:02:48,829 --> 01:02:51,750 + 우리가보기 흉한에서 이러한 예측 한 것을 국산화로 변환 + +824 +01:02:51,750 --> 01:02:56,280 + 우리는 네트워크의 출력에서​​의 각 화소의 좌표를 각각의 화소를 매핑 + +825 +01:02:56,280 --> 01:03:02,280 + 출력을 다시 입력이 지금은 좋은 부드러운 미분 함수 + +826 +01:03:02,280 --> 01:03:06,230 + 우리가 샘플링 그리드를 일단 우리는 단지에 선형 보간에 의해 적용 할 수 있습니다 + +827 +01:03:06,230 --> 01:03:11,309 + 출력의 픽셀 값을 계산하고 경우에 당신이 생각하는 경우 + +828 +01:03:11,309 --> 01:03:15,588 + 어떻게이 일을하고있는 것은이 네트워크의 모든 단일 부품이 하나라는 것을 분명 + +829 +01:03:15,588 --> 01:03:21,159 + 이 일이 어떤없이 공동으로 관리 할 수​​ 있도록 지속적이고 두 개의 차동 + +830 +01:03:21,159 --> 01:03:26,579 + 11 종류의 비록 아주 좋은 미친 강화 학습 물건 + +831 +01:03:26,579 --> 01:03:31,789 + 당신이 선형으로 샘플링이 그것을 어떻게 작동하는지 알고있는 경우주의 사항은 바이 리니어 샘플링에 대해 알고 + +832 +01:03:31,789 --> 01:03:36,449 + 출력의 각 화소가 넷의 누적 평균가는 것을 의미 + +833 +01:03:36,449 --> 01:03:41,639 + 픽셀과 입력 그래서 그 기울기는 실제로 매우 로컬 그래서 이것은이다 + +834 +01:03:41,639 --> 01:03:45,549 + 연속과 미분 멋진하지만 난 당신의 전체 많이 얻을 생각하지 않습니다 + +835 +01:03:45,550 --> 01:03:50,300 + 세 번째 바이 리니어 샘플링을 통해하지만 당신이이 한 번 기울기 신호 + +836 +01:03:50,300 --> 01:03:54,410 + 특별이 ​​좋은 특수 전송 모듈 우리는 단지에 삽입 할 수 있습니다 + +837 +01:03:54,409 --> 01:03:58,739 + 네트워크에 존재하는 일종의 그들이 있도록 두 가지 참석을 배울 수 있도록 + +838 +01:03:58,739 --> 01:04:03,739 + 드롭 종이와 매우 유사이 분류 작업을 고려 어디 + +839 +01:04:03,739 --> 01:04:08,118 + 실제로 그들이 그렇게 사면 제트기의 이러한 변형 된 버전을 분류 할 + +840 +01:04:08,119 --> 01:04:09,519 + 여러 가지 다른 생각 + +841 +01:04:09,519 --> 01:04:13,610 + 더 복잡한 변환하지 당신은 또한 할 수있는 단지 그가의 좋은 형질 전환 + +842 +01:04:13,610 --> 01:04:18,260 + 그의 화소에서 출력 픽셀 SPEKTR에서 매핑의 상상 + +843 +01:04:18,260 --> 01:04:21,470 + 우리는 아핀을 보였다 이전의 비행 변환 그러나 또한 고려 + +844 +01:04:21,469 --> 01:04:25,339 + 사영 변환도 얇은 판 스플라인하지만 아이디어는 당신에게 그냥 + +845 +01:04:25,340 --> 01:04:28,970 + 일부 민간 상승과 미분 가능 함수를 원하고 당신은 갈 수 + +846 +01:04:28,969 --> 01:04:34,829 + 그래서 여기에 왼쪽 네트워크의 일부 미친 그냥 분류하려고 + +847 +01:04:34,829 --> 01:04:38,380 + 왼쪽에 이렇게 일을하는이 자리는 우리의 서로 다른 버전이 + +848 +01:04:38,380 --> 01:04:43,340 + 이 가운데 콜린에 변형 된 자리는 서로 다른 얇은 판을 보이고있다 + +849 +01:04:43,340 --> 01:04:47,460 + 스플라인은 오른쪽에 다음 이미지의 일부에 참석하고 사용하고 있음 + +850 +01:04:47,460 --> 01:04:51,590 + 뿐만 아니라이 공간 변압기 모델의 출력을 보여줍니다 + +851 +01:04:51,590 --> 01:04:56,250 + 그 영역에 참석뿐만 아니라에 그 비행기에 대응에서 근무 + +852 +01:04:56,250 --> 01:05:01,730 + 오른쪽에 그들은은을 사용하고 오른쪽에 발산 찾기 앱을 사용하고 + +853 +01:05:01,730 --> 01:05:05,559 + 아핀이 실제로하고있는 것을 볼 수 있습니다 장소의 계획에 있지 변환 + +854 +01:05:05,559 --> 01:05:09,369 + 단지 입력에 참석하거나 실제로는 물론, 입력을 변화보다 + +855 +01:05:09,369 --> 01:05:14,849 + 그래서이 가운데 열에서 예를 들어이는 4이지만 실제로 회전있어 + +856 +01:05:14,849 --> 01:05:19,069 + 90도 같은으로 뭔가에 의해 때문에이 응용 프로그램을 사용하여 및 + +857 +01:05:19,070 --> 01:05:23,140 + 네트워크 변환 것이 아니라주의 전뿐만 아니라 회전 수로 + +858 +01:05:23,139 --> 01:05:27,839 + 적절한 직장에서 하류 분류에 대한 위치와이 전부입니다 + +859 +01:05:27,840 --> 01:05:31,930 + 아주 멋진 그리고 난 우리가 필요로하지 않는 부드러운 관심을 비슷한의 정렬 할 수 있습니다 + +860 +01:05:31,929 --> 01:05:35,949 + 이에 참석하고 싶어 그냥 스스로 결정할 수 있습니다 명시 적 감독 + +861 +01:05:35,949 --> 01:05:41,710 + 이 사람뿐만 아니라 매우있는 멋진 동영상을 가지고 있도록 문제를 해결하기 위해 + +862 +01:05:41,710 --> 01:05:53,860 + 인상적인 그래서 이것은 우리가 압축을 푼 여기 변압기 모듈입니다 + +863 +01:05:53,860 --> 01:05:58,930 + 우리는 실제로 지금이 실제로 분류 작업을 실행하는 바로 보여주는 것 + +864 +01:05:58,929 --> 01:06:03,389 + 그러나 우리가 입력을 변경하고 지속적으로 이러한 서로 다른 입력 것을 볼 수있다 + +865 +01:06:03,389 --> 01:06:08,429 + 네트워크 (22) 다음 실제로 경제 제휴에 참석 배운다 그 + +866 +01:06:08,429 --> 01:06:13,169 + 자리는 고정 알려진 포즈의 정렬하는 등 우리는 매우 입력이 주변으로 이동 + +867 +01:06:13,170 --> 01:06:18,500 + 이미지 네트워크는 여전히 자리에와에 잠금의 좋은 일을 + +868 +01:06:18,500 --> 01:06:23,059 + 바로 당신은 때때로 잘 그래서뿐만 아니라 회전을 해결할 수 있음을 알 수 + +869 +01:06:23,059 --> 01:06:26,809 + 여기 왼쪽에 실제로 그 자리 실제로 네트워크를 회전했다 + +870 +01:06:26,809 --> 01:06:31,619 + 다시 모두 부채와 경제 생활 투표와 회전 배운다 + +871 +01:06:31,619 --> 01:06:36,420 + 친구가 변형 또는 얇은 판 스플라인이도 미쳤 사용하여 + +872 +01:06:36,420 --> 01:06:40,389 + 예상 전송으로 휘게 그녀는 정말 좋은 일을 볼 수 있습니다 + +873 +01:06:40,389 --> 01:06:48,099 + 의 또한 자신의 작품에 참석하고 학습하고 다른 꽤 많이 할 + +874 +01:06:48,099 --> 01:06:52,829 + 대신 분류의 실험은이 일을 함께 추가 학습 + +875 +01:06:52,829 --> 01:06:58,369 + 가지 이상한 일이다하지만 그렇게 네트워크가 후퇴되어 작동 자리 + +876 +01:06:58,369 --> 01:07:05,389 + 두 개의 입력 입력 영상에 관해서는 나는 합계를 놓을 게하고 화상과 알지도 + +877 +01:07:05,389 --> 01:07:08,679 + 이 그것이 참석하고 작업에 필요가 있음을 알게 이상한 작업의 종류 + +878 +01:07:08,679 --> 01:07:15,659 + 이 때문에 그 이미지 최적화이 테스트입니다 기록 중입니다 + +879 +01:07:15,659 --> 01:07:20,009 + 공동 지역화라는 개념은 두 네트워크를 받으려고한다는 것이다 + +880 +01:07:20,010 --> 01:07:25,560 + 등의 입력 영상 아마 두 개의 서로 다른 이미지 네 다리를하고 작업이 말을하는 것입니다 + +881 +01:07:25,559 --> 01:07:31,179 + 여부 그 이미지는 다음 동일하거나 상이하고, 동일한 + +882 +01:07:31,179 --> 01:07:34,750 + 지역 공간 변압기를 사용하여 같은 것들을 지역화 학습 결국 + +883 +01:07:34,750 --> 01:07:38,139 + 잘 훈련의 과정을 통해 실제로 배운다 것을 볼 수 있습니다 + +884 +01:07:38,139 --> 01:07:42,239 + 우리가 가까이있을 때 매우 매우 정밀도이 일을 현지화 + +885 +01:07:42,239 --> 01:07:50,479 + 이러한 네트워크는 여전히 아주 아주 정확하게의 그 지역화하는 법을 배워야보다 이미지 + +886 +01:07:50,480 --> 01:07:58,280 + 꽤 멋진 깊은 마음의 최근 논문 + +887 +01:07:58,280 --> 01:08:11,519 + 특수 변압기에 대해 너무 다른 마지막 분 질문 그래 그래서 + +888 +01:08:11,519 --> 01:08:13,989 + 간단한 때문에 문제는이 일의 작업이 무엇인지 무엇이고 있습니다 + +889 +01:08:13,989 --> 01:08:17,420 + 일을하고 바닐라 버전 적어도 그것 때문에 그냥 분류입니다 + +890 +01:08:17,420 --> 01:08:21,810 + 뒤틀린 될 수있는 입력 이런 종류의 수신 그녀의 어수선 또는 이것 저것하고 + +891 +01:08:21,810 --> 01:08:26,060 + 모두가 할 필요가 그 과정에서 일종의 분류 광고 예산입니다 + +892 +01:08:26,060 --> 01:08:29,839 + 또한 그것을 분류하는 학습 즉, 그래서 금이 부분에 참석하기 위해 계획 + +893 +01:08:29,838 --> 01:08:40,189 + 그건 내 개요이이 작품 오른쪽 정렬이의 정말 멋진 기능입니다 + +894 +01:08:40,189 --> 01:08:44,588 + 관심의 우리가 정말 쉬운 부드러운 관심을 가지고있다 + +895 +01:08:44,588 --> 01:08:49,119 + 고정 된 입력 위치의이 맥락에서 특히 구현하는 우리 단지 + +896 +01:08:49,119 --> 01:08:53,039 + 이상 분포를 생산하고 우리가 기다리는 사람을 넣어 우리는 그 사람들을 먹이 + +897 +01:08:53,039 --> 01:08:56,850 + 다시 어떻게 든 네트워크 요인이 많은에 구현하기 정말 쉽습니다 + +898 +01:08:56,850 --> 01:08:59,930 + 다른 문맥 및 다른 작업의 많은 구현 된 + +899 +01:08:59,930 --> 01:09:04,770 + 당신은 당신이 조금을 얻기 위해 필요한 것보다 지역을 중재하기 위해 참석하고자 할 때 + +900 +01:09:04,770 --> 01:09:09,130 + 비트 애호가와 나는 공간 변압기 우아한 아주 아주 좋은 생각 + +901 +01:09:09,130 --> 01:09:13,949 + 입력 이미지의 영역을 중재하기 위해 참석의 방법이 논문 많이 있습니다 + +902 +01:09:13,949 --> 01:09:17,889 + 실제로 그녀의 구금 작업이 아주 조금 더 도전 때문이다 + +903 +01:09:17,890 --> 01:09:21,579 + 열심히주의 용지 일반적으로 사용하는 그라디언트이 문제에 대한 + +904 +01:09:21,579 --> 01:09:26,199 + 강화 학습은 우리가 정말 그렇게 어떤 임의의 오늘에 대해 이야기하지 않았다 + +905 +01:09:26,199 --> 01:09:39,429 + 긴장 또는 있는지 확인에 대한 다른 질문 + +906 +01:09:39,429 --> 01:09:51,958 + 캡션 우리는 변압기를 얻었고, 그래 그 폐쇄하기 전에 + +907 +01:09:51,958 --> 01:09:56,649 + 캡션에 해당 네트워크에서이 스크립트를 기반으로 일 만에를 사용하여 생산됩니다 + +908 +01:09:56,649 --> 01:10:01,299 + 특히 나는 실제로 꽤 많은, 그래서 그것은 (14) (14) 그리드 생각 + +909 +01:10:01,300 --> 01:10:04,550 + 여전히 제한되어있어 위치하지만 그것의 그것을 훨씬에 착용 + +910 +01:10:04,550 --> 01:10:22,800 + 그래서 부드러운 관심과 그녀의 구금 사이에 보간에 대한 질문이 그래 + +911 +01:10:22,800 --> 01:10:26,279 + 당신이 상상할 수있는 11 점은 다음 부드러운 방식으로 네트워크를 훈련하고있다 + +912 +01:10:26,279 --> 01:10:29,929 + 당신이 종류의 분포가 선명하고 선명하게 처벌 훈련 동안과 + +913 +01:10:29,929 --> 01:10:32,949 + 선명하고 테스트 시간은 단지 전환과 그녀의 구금을 사용 + +914 +01:10:32,948 --> 01:10:37,938 + 대신에 나는 내가 그 짓하는 종이 기억할 수 있다고 생각하지만 난 꽤 해요 해요 + +915 +01:10:37,939 --> 01:10:43,130 + 확인은 어디 선가 그 생각을 본 적이 있지만, 실제로 나는 그녀와 함께 훈련 생각 + +916 +01:10:43,130 --> 01:10:46,099 + 구금은 선명 방식보다 더 잘 작동하는 경향이 있지만, 확실히이다 + +917 +01:10:46,099 --> 01:10:51,800 + 뭔가 확인을 시도 할 수 있다면 우리가 일을하고 있다고 생각하지 질문 + +918 +01:10:51,800 --> 01:10:54,179 + 몇 분 일찍 오늘은 그래서 당신의 숙제 완수 + diff --git a/captions/Ko/Lecture14_ko.srt b/captions/Ko/Lecture14_ko.srt new file mode 100644 index 00000000..a56f02d9 --- /dev/null +++ b/captions/Ko/Lecture14_ko.srt @@ -0,0 +1,4232 @@ +1 +00:00:00,000 --> 00:00:04,990 + 행정 난 당신이 내가 일을하지 않는 경우 모두가 지금 73 수행해야 말해 + +2 +00:00:04,990 --> 00:00:07,790 + 당신이 늦게 당신은 문제가 있다고 생각 + +3 +00:00:07,790 --> 00:00:11,280 + 이슬람 무덤은 우리가 아직도 그들을 통해 가서 거기에있어 매우 곧 될 것입니다 + +4 +00:00:11,279 --> 00:00:13,779 + 기본적으로있는 내가 다 생각하지만, 우리는 내가 보낼 것 몇 가지를 한 번 확인해야 + +5 +00:00:13,779 --> 00:00:14,199 + 그들을 밖으로 + +6 +00:00:14,199 --> 00:00:18,820 + 확인 그래서 우리는 클래스에 당신을 생각 나게 측면에서 어제 아주 보였다 + +7 +00:00:18,820 --> 00:00:22,629 + 간단히 분할에서 우리는 약간의 부드러운주의 모델 변전소 보았다 + +8 +00:00:22,629 --> 00:00:25,829 + 모델은 선택적으로 다른 부분에 주목 떨어져있는 + +9 +00:00:25,829 --> 00:00:28,028 + 당신의 처리와 같은 이미지는 재발 신경과 같은했다 + +10 +00:00:28,028 --> 00:00:32,020 + 네트워크 다행 당신이 선택적으로 장면의 어떤 부분에주의를 지불하고 + +11 +00:00:32,020 --> 00:00:35,450 + 이러한 기능을 강화하고이되는 특수 변압기에 대해 시작됩니다 + +12 +00:00:35,450 --> 00:00:38,929 + 이미지의 일부를 자르기 다른 방법으로 기본적으로 아주 좋은 방법 + +13 +00:00:38,929 --> 00:00:43,769 + 또는 수소 또는 변형 된 모양의 모든 종류의 하나 일부 기능에없는 + +14 +00:00:43,770 --> 00:00:48,579 + 당신은 내부 네트워크를 슬롯 수 있습니다 PC의 때문에 매우 흥미 종류 등 + +15 +00:00:48,579 --> 00:00:52,049 + 아키텍처는 그래서 오늘 우리는 동영상에 대해 얘기하자 + +16 +00:00:52,049 --> 00:00:56,229 + 구체적으로 현재 이미지 분류에서 이제 여부에 의해 잘 알고 있어야합니다 + +17 +00:00:56,229 --> 00:00:59,390 + 기본적인 전투는 당신이 그것을 재 처리 (A)에 오는 이미지가 설정 + +18 +00:00:59,390 --> 00:01:03,239 + 동영상의 경우 분류 예를 들어 우리는 단지 하나의 이미지가되지 않습니다 + +19 +00:01:03,238 --> 00:01:07,728 + 이 실제로있을 것이다 (32)에 의해 32의 이미지가 그래서하지만 여러 프레임을해야합니다 + +20 +00:01:07,728 --> 00:01:13,829 + 전체 동영상은 32 강 이뿌다 언젠가 범위 그래서 확인에 의해 그래서 32 프레임 + +21 +00:01:13,829 --> 00:01:17,340 + 나는 우리가 I와 이러한 문제를 접근하는 방법에 뛰어 전에 대해 이야기하고 싶습니다 + +22 +00:01:17,340 --> 00:01:21,170 + 우리가 진정 사용에 대해 문의를 해결하기 위해 사용하는 방법에 대한 매우 간략하게 + +23 +00:01:21,170 --> 00:01:25,629 + 오른쪽에 오기 전에 그래서 가장 인기있는 기능 중 일부를 방법을 PCR 기반 + +24 +00:01:25,629 --> 00:01:30,019 + 이 조밀 한 궤적 특징 곳 일 오늘은 매우 인기가 + +25 +00:01:30,019 --> 00:01:34,140 + 모든 매달려 개발 한 난 그냥 당신에게 간단한 맛을주고 싶어 + +26 +00:01:34,140 --> 00:01:36,989 + 정확히 어떻게 가지 흥미로운 때문에 이러한 기능이 근무하고 + +27 +00:01:36,989 --> 00:01:39,609 + 그들은 온 방법의 측면에서 나중에 개발의 일부를 영감 + +28 +00:01:39,609 --> 00:01:43,429 + 이 궤도에 그래서 실제로 작동하는 비디오를 작동 쇼는 무엇이고 우리 + +29 +00:01:43,430 --> 00:01:47,140 + 이렇게 우리는이 비디오 재생이되고, 우리는 이러한 키를 검출 할거야 + +30 +00:01:47,140 --> 00:01:50,709 + 좋은 점은 비디오에서 추적하고 우리는 그들을 추적 할거야 + +31 +00:01:50,709 --> 00:01:54,679 + 당신은 모든 작은 트랙 결국 우리가 실제로 추적하는 것을하자 + +32 +00:01:54,680 --> 00:01:57,759 + 그 트랙하자 약 기능의 영상 다음 제비에서 + +33 +00:01:57,759 --> 00:02:01,868 + 에 대해에게 단지 범죄를 축적 주변 기능, 그래서 그냥주는 + +34 +00:02:01,868 --> 00:02:06,549 + 당신은 어떻게에 대한 생각은 세 단계는 우리가 기본적으로이 대략있는 일 + +35 +00:02:06,549 --> 00:02:10,868 + 이미지에 서로 다른 규모에서 특징점을 검출 나는 나에게 간단히 말씀 드리죠 + +36 +00:02:10,868 --> 00:02:11,960 + 그 어떻게하는지에 대한 + +37 +00:02:11,960 --> 00:02:16,810 + 다음 광 광류 방법을 사용하여 시간이 지남에 따라 그 기능을 트랙으로 이동 + +38 +00:02:16,810 --> 00:02:20,270 + 흐름 방법은 매우 간단하게 설명 해결 그들은 기본적으로 당신에게 운동 필드를 제공 + +39 +00:02:20,270 --> 00:02:23,800 + 한 가지에서 다른 그들은 장면이 하나의 프레임에서 이동하는 방법을 알려 + +40 +00:02:23,800 --> 00:02:28,070 + 어느 정도 익스트림 다음 우리가 기능의 전체 무리를 추출거야하지만, + +41 +00:02:28,069 --> 00:02:30,609 + 중요한 것은 우리는 단지 고정하는 기능 세트를 추출하지 않을거야 + +42 +00:02:30,610 --> 00:02:33,930 + 이미지에 위치하지만 우리가 실제로거야 나를 공격한다 + +43 +00:02:33,930 --> 00:02:37,700 + 말을하고 로컬 좌표 시스템은 모든 단일 트랙하자 등 + +44 +00:02:37,699 --> 00:02:41,869 + 욕심 이러한 히스토그램 돼 흐름을 주장하고 우리가 가고있는 자원이 될 + +45 +00:02:41,870 --> 00:02:45,610 + 열심히 여기 트랙 재치 떨어져 좌표계를 추출 할 수 + +46 +00:02:45,610 --> 00:02:49,200 + 우리는 히스토그램 구배 및 2 차원 화상은 기본적으로봤을 + +47 +00:02:49,199 --> 00:02:51,750 + 그 일반화 너무 + +48 +00:02:51,750 --> 00:02:54,780 + 비디오 및 그래서 사람들이를 인코딩하는 데 사용되는 물건의 종류입니다 + +49 +00:02:54,780 --> 00:03:01,009 + 키 포인트 검출부의 관점에서 공간 - 시간 폭탄 테러가있었습니다 + +50 +00:03:01,009 --> 00:03:04,239 + 좋은 기능과 추적하는 비디오를 감지하는 방법을 정확하게에 작업 꽤 많은 + +51 +00:03:04,240 --> 00:03:07,930 + 직관적으로 당신은 그가 때문에 너무 부드러운 있습니다 비디오를 추적하지 않도록 + +52 +00:03:07,930 --> 00:03:11,580 + 기본적를 취득하기위한 방법이 임의의 시각 기능에 로그인 할 수 없습니다 + +53 +00:03:11,580 --> 00:03:16,620 + 이에 대한 몇 가지 서류가되도록 추적하고 비디오하기 쉬운 점 세트 + +54 +00:03:16,620 --> 00:03:19,509 + 그래서 당신은이 같은 기능의 무리를 검출 + +55 +00:03:19,509 --> 00:03:23,039 + 이 동영상에 광학 플로우 알고리즘 + +56 +00:03:23,659 --> 00:03:28,060 + 프레임 및 제 2 프레임을 그리고 모션 필드 해결할 것 + +57 +00:03:28,060 --> 00:03:32,409 + 이 방법은 여행 곳에서 모든 단일 위치에서 변위 벡터 + +58 +00:03:32,409 --> 00:03:35,919 + 내가 광학 플로우 결과의 몇 가지 예를 들어으로 무료 이동 + +59 +00:03:36,439 --> 00:03:42,270 + 기본적으로 여기에 모든 단일 픽셀은 방향에 의해 착색되는 것을 + +60 +00:03:42,270 --> 00:03:46,260 + 이는 예를 들어자가 갖도록 이미지의 부분은 현재 영상으로 이동 + +61 +00:03:46,259 --> 00:03:49,939 + 아마 당신은 수평 또는 뭔가 변환하는 모든 노란색 의미 + +62 +00:03:49,939 --> 00:03:53,680 + 추천 그것이 컴퓨팅 광학 흐름을 사용하기위한 두 가지 일반적인 방법 + +63 +00:03:53,680 --> 00:03:58,069 + 권투 말리크에서 블록으로 여기에 가장 일반적인 적어도 나 하나 + +64 +00:03:58,069 --> 00:04:00,949 + 그래서 당신이 경우에 사용하는 디폴트로 같은 종류의 인 하나 + +65 +00:04:00,949 --> 00:04:03,399 + 자신의 프로젝트에 광학 흐름을 계산하는 것은 내가 사용하는 것이 좋습니다 것 + +66 +00:04:03,400 --> 00:04:08,950 + 이 큰 변위 광학 플로우 방법은 그래서이 광 흐름이 우리를 사용하여 + +67 +00:04:08,949 --> 00:04:12,199 + 우리가 알고있는 광학 플로우를 사용하여 모든 주요 장소로는 우리로 이동을했습니다 + +68 +00:04:12,199 --> 00:04:15,859 + 한 번에 약 십오 프레임 일 수있다 이러한 릴 트럭 분량을 추적 끝 + +69 +00:04:15,860 --> 00:04:20,509 + 그래서 우리는이와 끝까지 0.5 초 정도 트랙이 비디오를 통해 할 수 있습니다 및 + +70 +00:04:20,509 --> 00:04:21,519 + 우리는 인코딩 + +71 +00:04:21,519 --> 00:04:26,129 + 모든 이들 기술자와의 어떤이 트랙 주변 지역에 갔다 + +72 +00:04:26,129 --> 00:04:29,710 + 함께 플레이하는 데 사용되는 모든 피터슨 시각이 히스토그램 사람들을 축적 + +73 +00:04:29,709 --> 00:04:34,668 + 정확히 특별히 때문에 비디오를자를 어떻게 같은 다른 종류의 + +74 +00:04:34,668 --> 00:04:37,359 + 우리는 히스토그램을 독립적 히스토그램과의 모든 일을 할 겁니다 + +75 +00:04:37,360 --> 00:04:40,389 + 이러한 비즈니스 다음 우리는 기본적으로 모든 히스토그램을 만들거야 + +76 +00:04:40,389 --> 00:04:45,220 + 이러한 모든 시각적 기능을 갖춘 도시와이 일을 모두가 SVM에 가서 + +77 +00:04:45,220 --> 00:04:48,050 + 사람들이 이러한 문제의 해결 방법의 측면에서 바위 레이아웃의 종류 + +78 +00:04:48,050 --> 00:04:55,720 + 트럭 단지로 생각 과거는 다섯 프레임이 될 것입니다 그리고 그것은이다 + +79 +00:04:55,720 --> 00:05:01,639 + 단지 XY 위치 그렇게 15 XY는 다음 교살 및 좌표 우리 + +80 +00:05:01,639 --> 00:05:07,168 + 우리가 실제로 접근 방법의 관점에서 현재 로컬 좌표계 추출 + +81 +00:05:07,168 --> 00:05:13,859 + 그와 함께 이러한 문제는 그녀가 첫 번째 층에 알렉스 그물을 호출되지 작동 + +82 +00:05:13,860 --> 00:05:17,560 + 세에 의해 예를 들어 227 (227)에 대한 이미지 thatís을 받게되며 + +83 +00:05:17,560 --> 00:05:22,310 + 11 11 96 필터를 재 처리하면 오른쪽에 대한 등의 적용 + +84 +00:05:22,310 --> 00:05:27,978 + 우리는 알렉스 그물이 아흔여섯 볼륨에 의해 5555 결과 보았다 + +85 +00:05:27,978 --> 00:05:30,468 + 우리는 실제로 각에서 모든 필터의 모든 응답을 갖는 + +86 +00:05:30,468 --> 00:05:34,788 + 하나의 공간적 위치 당신이 경우 합리적인 방법이 될 것입니다 무슨 지금 + +87 +00:05:34,788 --> 00:05:38,158 + 우리는 단지이없는 경우에 작동하는 모든 작업을 수행 일반화하고 싶어 + +88 +00:05:38,158 --> 00:05:42,579 + 220 누군가가 23집니다하지만 일이 될 수는 인코딩 좋아하는 프레임 + +89 +00:05:42,579 --> 00:05:47,278 + 그래서 당신은에오고 그 15 227 227 배터리의 전체 블록이 + +90 +00:05:47,278 --> 00:05:50,180 + 달성 당신이 공간을 모두 에코하려는 일을 모두하고 + +91 +00:05:50,180 --> 00:05:54,209 + 시간적 패턴과 볼륨이 작은 블록 내부 그래서처럼 될 것이다 + +92 +00:05:54,209 --> 00:05:57,379 + 변경하는 방법에 대한 아이디어는 모든 일을 성취 + +93 +00:05:57,379 --> 00:06:00,379 + 이 경우에 일반화 + +94 +00:06:03,899 --> 00:06:27,609 + 나는 것으로 기대 확인 그 흥미로운 두 블록 등 그들을 배치 + +95 +00:06:27,610 --> 00:06:33,870 + 그게 문제가 관심의 종류 그래서 아주 아주 잘 작동하지 않는다 + +96 +00:06:33,870 --> 00:06:36,850 + 기본적으로 모든 신경에 의해 다음 단 하나의 프레임에서 찾고있다 + +97 +00:06:36,850 --> 00:06:39,720 + 당신이 당신과 함께 결국 주석의 끝이 그에 큰보고하고, + +98 +00:06:39,720 --> 00:06:43,310 + 더 큰 영역과 도전 그래서 결국 모두 볼 이러한 뉴런 + +99 +00:06:43,310 --> 00:06:46,470 + 귀하의 의견하지만 그들은 아주 쉽게 연관 할 수 없을 것입니다 + +100 +00:06:47,589 --> 00:06:52,589 + 이 이미지에서 조금 특별한 제어 패치 같은 사실은 확실하지 않다 + +101 +00:06:52,589 --> 00:07:04,149 + 정말 좋은 아이디어는 내가 그래서 우리는 그 중 몇 가지를 얻을 수있을 거라 생각 그것으로 만들어 놓을 않았다 + +102 +00:07:04,149 --> 00:07:07,149 + 그 같은 일을 + +103 +00:07:09,930 --> 00:07:25,199 + 효과적으로 45 채널을 가지고 그, 그래서 당신은에 코멘트를 넣을 수 + +104 +00:07:25,199 --> 00:07:28,919 + 모든 I에 도착 뭔가 당신은 내가는 생각하지 않는 것을 할 수 있다고 생각 + +105 +00:07:28,918 --> 00:07:44,049 + 그래서 당신이 시간의 한 조각의 일이 당신을 것을 말을하는지 예 '로 최고의 아이디어 + +106 +00:07:44,050 --> 00:07:48,379 + 다음 다른를 한 번에 기능과 유사한 종류의 압축을 + +107 +00:07:48,379 --> 00:07:48,990 + 시각 + +108 +00:07:48,990 --> 00:07:52,829 + 피터이기 때문에 특별히 공유 그 일의 동기 부여와 유사 + +109 +00:07:52,829 --> 00:07:55,909 + 여기에 당신이 재산 곳의 같은 종류 그래서뿐만 아니라 거기 유용 + +110 +00:07:55,910 --> 00:07:58,910 + 당신은 공간뿐만 아니라 무게와 시간을 공유하고 싶습니다 + +111 +00:07:59,689 --> 00:08:03,550 + 확인 그래서 사람들이 일반적으로 수행하는 것이 기본 일의 아이디어 위에 구축 + +112 +00:08:03,550 --> 00:08:06,400 + 그들은 이러한 확장으로 상용 네트워크와 비디오를 적용 할 때 + +113 +00:08:06,399 --> 00:08:10,138 + 필터는 공간 필터를하지뿐만 아니라 할 수 있지만 이러한이 + +114 +00:08:10,139 --> 00:08:14,840 + 필터 우리가 Bielema (11)가 전에, 그래서 시간에 그들에게 소량의 확장 + +115 +00:08:14,839 --> 00:08:15,750 + 필터 + +116 +00:08:15,750 --> 00:08:21,709 + 몇 가지 작은 시간 정도 그렇게 예를 들어 말 티아 차 필터에 의해 1111 우리 + +117 +00:08:21,709 --> 00:08:28,759 + 세 가지 필터에 의해 그는 2011 년 30이었다 특히이 경우 최대 15로 사용할 수 있습니다 및 + +118 +00:08:28,759 --> 00:08:33,979 + 다음 세 가지에 의해 우리는 RGB를 가지고 있기 때문에 기본적으로이 필터는 당신이있어 지금 + +119 +00:08:33,979 --> 00:08:36,969 + 뿐만 아니라 공간에서 필터를 슬라이딩 생각하고 전체를 조각 + +120 +00:08:36,969 --> 00:08:40,469 + 활성화지도하지만 실제로뿐만 아니라 공간에서 필터를 슬라이딩하고 있지만, + +121 +00:08:40,469 --> 00:08:44,450 + 또한 시간에 그들은 시간에 작은 유한 한 시간 정도가 있고 + +122 +00:08:44,450 --> 00:08:48,379 + 당신이 도입하고, 그래서 확인 전체 활성화 볼륨을 조각 끝 + +123 +00:08:48,379 --> 00:08:51,909 + 시간은 모든 커널에 모든 죽어가는 단계에 언급하기를 + +124 +00:08:51,909 --> 00:08:55,899 + 그래서 회선을 수행 된 따라 추가 시간이 언급 + +125 +00:08:55,899 --> 00:08:59,659 + 그 사람들이 기능을 추출하는 방법 일반적으로 그리고 당신은이 속성을 얻을 + +126 +00:08:59,659 --> 00:09:04,009 + 안전 그래서 여기에 등 세 곳에 우리는 공간적 시간적를 수행 할 때 + +127 +00:09:04,009 --> 00:09:07,230 + 경쟁 우리는이 매개 변수를 공유하는 방식은 시간가는 결국 + +128 +00:09:07,230 --> 00:09:11,639 + 뿐만 아니라 당신이 그렇게 기본적으로 언급 한 바와 같이 어느 정도 모든 필터 시간과 + +129 +00:09:11,639 --> 00:09:14,360 + 우리는 공간뿐만 아니라 시간뿐만 아니라 회선을 + +130 +00:09:14,360 --> 00:09:18,800 + 활성 볼륨 정품 인증과 바람은 그래서 이들 중 일부 매핑 + +131 +00:09:18,799 --> 00:09:22,818 + 접근 방식은 이전의 것들의 예를 하나 아주 초기에 제안했다 + +132 +00:09:22,818 --> 00:09:28,238 + 활동 인식이 2010 년부터 아마이기 때문에 아이디어는 여기가 있음을했다 + +133 +00:09:28,239 --> 00:09:31,798 + 일의 대신 (40)에 의해 예순의 단일 입력을 받고 단지 몇 + +134 +00:09:31,798 --> 00:09:36,108 + 사진은 또한 우리는 마흔에 의해 사실 예순 일곱 프레임을 받고 자신의 + +135 +00:09:36,109 --> 00:09:40,119 + 우리가 그래서 이러한 필터들을 참조로 결론은 세 디컨 볼 루션 있습니다 + +136 +00:09:40,119 --> 00:09:44,220 + 예를 들어뿐만 아니라 우리가 3 차원으로 끝낼 같이 세 가지로 이제 일곱 판매 될 수 있지만 + +137 +00:09:44,220 --> 00:09:49,499 + 진정과 세 가지 조건은 여기에 모든 단일 단계에서 적용됩니다 + +138 +00:09:50,649 --> 00:09:55,208 + 2011 년 비슷한 종이하지만 같은 생각 우리는 친구의 블록을 + +139 +00:09:55,208 --> 00:09:59,518 + 들어오는 당신은 3 차원 완료 입체 필터에서 그들을 약속 + +140 +00:09:59,519 --> 00:10:03,229 + 이 상용 네트워크에있는 모든 단일 지점 그래서이 2011 아니다 + +141 +00:10:04,948 --> 00:10:08,748 + 매우 유사한 아이디어도 그렇게이 다음이 전에 실제로 알렉스 출신 + +142 +00:10:08,749 --> 00:10:12,889 + 접근 방식은 일이 그렇게 모든 작업을 수행하는 것이 작은 알고 같은 종류의 수 있습니다 + +143 +00:10:12,889 --> 00:10:16,829 + 이러한 대규모 애플리케이션의 제 1 종이 출신 + +144 +00:10:16,828 --> 00:10:19,828 + 용량에 의해 2014 년 멋진 종이의 모든 + +145 +00:10:20,830 --> 00:10:27,540 + 이 처리 동영상을 여기에 바로 오른쪽에있는 모델이 주 그래서 + +146 +00:10:27,539 --> 00:10:31,159 + 우리는 내가이는 지금까지 그렇게되게 같은 생각 느린 융합이라고 + +147 +00:10:31,159 --> 00:10:35,750 + 세 가지 차원 모두 시간과 공간에서 일어나는 경쟁 때문에 그건 + +148 +00:10:35,750 --> 00:10:38,879 + 느린 융합 천천히이 시간을 사용하고 있기 때문에 우리는 그것을 참조로 + +149 +00:10:38,879 --> 00:10:43,649 + 단지 우리는 이전과 정보는 천천히 이제 공간 정보를 사용하고 + +150 +00:10:43,649 --> 00:10:47,100 + 당신은 또한 왜 코미디 쇼 네트워크 및 그냥있는 수있는 다른 방법이 있습니다 + +151 +00:10:47,100 --> 00:10:51,769 + 몇 가지 컨텍스트를 제공하는 것은 역사적으로이 구글의 연구이며, 알렉스하자 + +152 +00:10:51,769 --> 00:10:55,039 + 그냥 와서 그들이 매우 잘 작동하기 때문에 모두가 슈퍼 흥분했다 + +153 +00:10:55,039 --> 00:11:00,579 + 이미지와 나는 구글 비디오 분석 팀에 있었고, 우리는에 실행하고 싶었다 + +154 +00:11:00,580 --> 00:11:04,060 + 유튜브 동영상하지만 그것은 일반화하는 방법을 정확하게 꽤 명확하지 않았다 + +155 +00:11:04,059 --> 00:11:07,809 + 우리는 여러 가지 탐구 그래서 당신은 동영상을 다음 상용 네트워크와 알고 + +156 +00:11:07,809 --> 00:11:11,389 + 당신이 실제로 그래서 수레이를 착용하지 수있는 방법 건축 재료의 종류 + +157 +00:11:11,389 --> 00:11:17,889 + 접근 조기 융합의 종류라는 차원으로 융합는이 아이디어 사람 + +158 +00:11:17,889 --> 00:11:21,230 + 필요할 친구의 덩어리를 가지고 그냥 일어 났을 경우 앞에서 설명한 + +159 +00:11:21,230 --> 00:11:25,430 + 긴 채널은 45 등에 의해 227 227으로 끝낼 수 있습니다 + +160 +00:11:25,429 --> 00:11:29,500 + 이 종류의, 그래서 모든 것이 사들이고 당신은 그 위에 하나의 열을 + +161 +00:11:29,500 --> 00:11:35,200 + 맨 처음 통화하여 필터처럼 나중에 큰 시간적 범위를 가지고 있지만 + +162 +00:11:35,200 --> 00:11:38,780 + 다음 다른 모든부터 사실 두 차원의 경쟁은 우리 + +163 +00:11:38,779 --> 00:11:42,139 + 그는 매우 초기에 시간 정보를 거부했기 때문에 일찍 전화 + +164 +00:11:42,139 --> 00:11:45,879 + 다음 모두에의 첫 번째 편지는 당신이 상상할 수있는 호출 + +165 +00:11:45,879 --> 00:11:49,490 + 아이디어 알렉스 그물에 걸릴 여기 있도록 아키텍처는 가능성 회선입니다 + +166 +00:11:49,490 --> 00:11:53,169 + 우리는 그들을 떨어져 10 가지 그들이 그렇게 모두 독립적에 계산 말할 배치 + +167 +00:11:53,169 --> 00:11:57,169 + 이 10 점을 따로 따로 그리고, 우리는 완전히 연결에 많은 이상이어야합니다 + +168 +00:11:57,169 --> 00:12:00,620 + 레이어, 그리고, 우리는 단지보고 단일 청구 기준을했다 + +169 +00:12:00,620 --> 00:12:03,830 + 비디오의 한 프레임은 그래서 당신은 정확히 흰색 선까지로 재생할 수 있습니다 + +170 +00:12:03,830 --> 00:12:08,440 + 이 모델은 그들이 했어 상상할 수있는 아시아 모델을 보면 세 + +171 +00:12:08,440 --> 00:12:13,130 + 차원 대령은 이제 첫 번째 층은 실제로 그들을 시각화 할 수 있으며, + +172 +00:12:13,129 --> 00:12:16,210 + 이 다음은 동영상에 당신이 학습 결국 기능의 종류입니다 + +173 +00:12:16,210 --> 00:12:18,990 + 그들은 지금 때문에 이동하는 것을 제외하고 잘 알고 있었다 기본적 기능 + +174 +00:12:18,990 --> 00:12:22,680 + 이 필터는이 작은을 가지고 소량 및 시간을 연장된다 + +175 +00:12:22,679 --> 00:12:26,049 + 블롭을 이동하고, 그들 중 일부는 정적이고, 그들 중 일부는 이동 그들이있어 + +176 +00:12:26,049 --> 00:12:30,729 + 기본적으로 첫 번째 층에 움직임을 감지하고 그래서 당신은 멋진을 종료 + +177 +00:12:30,730 --> 00:12:31,960 + 폭탄 테러 이동 + +178 +00:12:31,960 --> 00:12:48,090 + 문제는 우리가 그에게거야 얼마나 내가 대답은 예 아마 생각 + +179 +00:12:48,090 --> 00:12:53,269 + 단지 공간에서이 경우 더 작은 필터를 작동하고 당신은 더 깊이가 + +180 +00:12:53,269 --> 00:12:56,370 + 같은 적용에 나는 시간에 생각하고 우리 것을 수행하는 아키텍처를 볼 수 있습니다 + +181 +00:12:56,370 --> 00:13:07,220 + 의미하지만 기대 + +182 +00:13:08,190 --> 00:13:13,580 + 이렇게 분류 우리는 영상이 여전히 카테고리의 수를 분류 한 + +183 +00:13:13,580 --> 00:13:17,970 + 매 프레임에서 그러나 지금 당신은 단지 하나의 프레임 것이 아니라 작동하지 않는 + +184 +00:13:17,970 --> 00:13:23,740 + 프레임 소수 어쩌면하여 예측이 양쪽 alot을 + +185 +00:13:23,740 --> 00:13:28,539 + 실제로 안전의 기능은 반에게 재미와 끝까지하는 제 2 비디오 음료 + +186 +00:13:28,539 --> 00:13:32,909 + 본 논문도 발표 동영상을 동영상을 그들은 하나 이상했다 + +187 +00:13:32,909 --> 00:13:36,639 + 이 실제로 이유에 대한 백만 동영상과 500 클래스는 주어진 컨텍스트 + +188 +00:13:36,639 --> 00:13:41,759 + 이 동영상 작업을 가지 어려운 지금은 내가 있기 때문에 생각 + +189 +00:13:41,759 --> 00:13:45,480 + 문제는 지금 내가 생각이 너무 많은 매우 큰 규모가 아니다 것입니다 + +190 +00:13:45,480 --> 00:13:49,820 + 당신은 이미지 것을 볼 매우 다양한 이미지의 수백만 같은 데이터 세트가 + +191 +00:13:49,820 --> 00:13:53,230 + 비디오 영역에서 그 어떤 정말 좋은 동등하지 않으며 그래서 우리는 함께 노력 + +192 +00:13:53,230 --> 00:13:56,730 + 이것은 그러나 2013 년 상태 및 다시 내가 그것이 실제로 우리가 충분히 달성 생각하지 않습니다 + +193 +00:13:56,730 --> 00:14:00,519 + 그와 나는 우리가 여전히 정말로 암살자을 잃었 아주 좋은 표시되지 않는 생각 + +194 +00:14:00,519 --> 00:14:03,579 + 비디오 및 그 우리는 또한 약간에서 당신의 일부를 낙담하는 이유 부분적이다 + +195 +00:14:03,580 --> 00:14:08,050 + 프로젝트에이 작업은 이러한 매우 강력한을 재교육 할 수 없기 때문에 + +196 +00:14:08,049 --> 00:14:12,969 + 기능 데이터 세트는 단지 확실히 거기에 다른 종류이기 때문에 + +197 +00:14:12,970 --> 00:14:16,100 + 당신이보고 우리가 때때로 사람을주의 이유는 흥미로운 것들 + +198 +00:14:16,100 --> 00:14:21,490 + 그 때문에 매우 빠르게 매우 정교을 동영상에 작업 점점에서 + +199 +00:14:21,490 --> 00:14:24,490 + 때때로 사람들은 동영상이 그들이 수행하려는 경우 매우 흥분 생각 + +200 +00:14:24,490 --> 00:14:27,810 + 3d 컬러 앨리스 팀을 표시하고는 모든 가능성에 대해 생각 + +201 +00:14:27,809 --> 00:14:31,469 + 그들을 위해 개방 실제로 단일 프레임 방법은 매우 것을 밝혀 + +202 +00:14:31,470 --> 00:14:34,820 + 강력한베이스와 나는 항상 첫 번째를하지 않는 실행하는 것이 좋습니다 것 + +203 +00:14:34,820 --> 00:14:37,710 + 동영상의 움직임에 대해 걱정하고 단지 첫 번째 작품 하나의 프레임을 시도 + +204 +00:14:37,710 --> 00:14:40,990 + 그래서이 논문의 예를 들어 우리는베이스 라인에서 하나에 대한 것을 발견 + +205 +00:14:40,990 --> 00:14:44,610 + 우리의 데이터 세트에서 59.3 %의 분류 정확도 + +206 +00:14:44,610 --> 00:14:48,600 + 다음 우리가 실제로 계정 작은 지역의 움직임을 고려하기 위해 최선을 시도했지만 + +207 +00:14:48,600 --> 00:14:54,440 + 우리는 11.6 %에 의해 아래로 당김이 모든 추가 작업 모든 여분의 컴퓨터 그래서 결국 + +208 +00:14:54,440 --> 00:14:57,529 + 그리고 당신은 내가 당신에게 시도거야 상대적으로 작은 이익에 결국 + +209 +00:14:57,528 --> 00:15:02,088 + 그가 될 이유 기본적으로 비디오는 항상 당신이하는만큼 유용하지 않다 + +210 +00:15:02,089 --> 00:15:07,230 + 직관적으로 생각하고, 그래서 여기에 예측 종류의 몇 가지 예입니다 그 우리 + +211 +00:15:07,230 --> 00:15:11,800 + 스포츠와 우리의 예측 다른 데이터 세트는 내가 이런 종류의 생각 + +212 +00:15:11,799 --> 00:15:15,528 + 강조 약간 이유에 비디오를 추가하는 것은 일부 설정에서와 같이 도움이되지 않을 수도 있습니다 + +213 +00:15:15,528 --> 00:15:19,740 + 여기에 특히 당신은 스포츠를 구분하고 그것에 대해 생각하려고하는 경우 + +214 +00:15:19,740 --> 00:15:23,930 + 이 회전처럼 수영이나 뭔가에서 테니스 말을 구별하려고 + +215 +00:15:23,929 --> 00:15:26,729 + 당신이 있다면 당신은 실제로 아주 좋은 지역의 움직임 정보를 필요로하지 않는 것을 + +216 +00:15:26,730 --> 00:15:29,610 + 파란색 물건을 많이 오른쪽 많은 수영에서 테니스를 구별하려고 + +217 +00:15:29,610 --> 00:15:33,350 + 빨간색 물건의 이미지가 실제로 정보의 엄청난 금액을 가지고과 같이 + +218 +00:15:33,350 --> 00:15:36,240 + 당신은 추가 매개 변수를 많이 넣고이 후 이동하려는 + +219 +00:15:36,240 --> 00:15:40,959 + 대부분의 클래스의 대부분은 실제로 지역 운동은하고 있지만, 지역 운동 + +220 +00:15:40,958 --> 00:15:44,289 + 매우 중요하지 그들은 당신이 매우 세분화 된 경우에만 중요한 것 + +221 +00:15:44,289 --> 00:15:47,919 + 작은 움직임이 실제로 정말 많은으로 많은 문제 카테고리 + +222 +00:15:47,919 --> 00:15:52,419 + 이 동영상이 경우 당신은 미친 시간적 공간적 사용하는 경향됩니다 + +223 +00:15:52,419 --> 00:15:56,860 + 비디오 네트워크 그러나 나는 그 운동이 매우 약 열심히 생각 + +224 +00:15:56,860 --> 00:15:59,980 + 중요하고 그렇지 않은 경우 결과를 얻을 수 있기 때문에 당신은 설정하는 + +225 +00:15:59,980 --> 00:16:04,070 + 그는 작업을 많이 넣어 곳이 같은 그것은 잘 작동의를 살펴 보자되지 않을 수 있습니다 + +226 +00:16:04,070 --> 00:16:10,180 + 작동 다른 비디오 분류 그래서 이것은 2015 4월 자사의 + +227 +00:16:10,179 --> 00:16:14,698 + 상대적으로 인기가 그것은 바다 3d 및 아이디어라고 여기에 기본적이었다 있어요 + +228 +00:16:14,698 --> 00:16:18,528 + 네트워크는 두 가지로이 아주 좋은 그 3 개월 불러 아키텍처와 두가 + +229 +00:16:18,528 --> 00:16:22,110 + 여기에 생각에 걸쳐 풀 멋진의 정확한 같은 일을 할 수 있다는 것입니다하지만, + +230 +00:16:22,110 --> 00:16:25,169 + 시간에 모든 확장하므로 지점으로 돌아가는 당신은 매우 작은합니다 + +231 +00:16:25,169 --> 00:16:29,069 + 이 모든 세 가지입니다 때문에 필터가 내 나무를 구입하는 구입 기억 수도 있습니다 + +232 +00:16:29,070 --> 00:16:33,100 + 아키텍처 전반에 걸쳐 풀은 그래서 차원에서 큰 미국의 매우 간단한 종류의 + +233 +00:16:33,100 --> 00:16:36,528 + 접근 방식의 종류 및 그 합리적으로 잘 작동하고 당신이 볼 수 + +234 +00:16:36,528 --> 00:16:38,429 + 참조 용 종이 + +235 +00:16:38,429 --> 00:16:42,389 + 접근 방법의 또 다른 형태는 실제로는 카렌 시몽에서로 아주 잘 작동합니다 + +236 +00:16:42,389 --> 00:16:43,778 + 2014 년 + +237 +00:16:43,778 --> 00:16:48,299 + 같은과 같은 방법으로 그는 BG하지 그가 해낸 사람의 SIMONIAN + +238 +00:16:48,299 --> 00:16:51,828 + 또한 비디오 분류에 아주 좋은 종이를 가지고 있으며 여기에 생각이 있다는 것입니다 + +239 +00:16:51,828 --> 00:16:54,299 + 이 종류의 때문에 그는 세 가지 차원의 경쟁을하고 싶지 않았다 + +240 +00:16:54,299 --> 00:16:55,219 + 그것을 가지고 고통 + +241 +00:16:55,220 --> 00:17:00,360 + 98 그것을 발견하고 너무 너무에 그는 단지 컴파일하지만 아이디어를 측정하는 데 사용 + +242 +00:17:00,360 --> 00:17:05,179 + 여기에 우리가 와서해야 할 이미지를 찾고, 다른 하나는 점이다 + +243 +00:17:05,179 --> 00:17:10,298 + 이 두 단지 이미지 만 너무 비디오의 광학 흐름에 있습니다보고 + +244 +00:17:10,298 --> 00:17:14,699 + 광학 흐름은 기본적으로 상황이 이미지의 이동 방법을 알려줍니다 + +245 +00:17:14,699 --> 00:17:19,120 + 그래서이 둘은 평균 그물 같은 또는 알렉스 싫어하는처럼 그냥 가지입니다 + +246 +00:17:19,119 --> 00:17:23,139 + 그 중 하나의 이미지에 이들의 또 다른 가까운 하나가 추출이 + +247 +00:17:23,140 --> 00:17:28,059 + 광학 흐름은 전 브롱스 방법을 말한다 다음은 University of Florida의 사용을 허용하는 + +248 +00:17:28,058 --> 00:17:31,720 + 아주 늦은 결국 이렇게 두 가지의 정보를 몇 가지 아이디어에 대해 생각해 + +249 +00:17:31,720 --> 00:17:34,850 + 다음 그들이 비디오의 클래스의 관점에서보고있다 및 거부 + +250 +00:17:34,849 --> 00:17:37,859 + 그들이 그들이 예를 찾을 수 있도록 그들을 이용하는 방법은 다양 + +251 +00:17:37,859 --> 00:17:42,979 + 당신은 그냥 특별한 코멘트는 이미지를 찾고 사용하는 경우 당신은 몇 가지를 얻을 + +252 +00:17:42,980 --> 00:17:47,120 + 방금 광 흐름에 와서 사용하는 경우 성능이 실제로도 수행 + +253 +00:17:47,119 --> 00:17:49,558 + 단지 원시 영상을보고보다 약간 더 + +254 +00:17:49,558 --> 00:17:54,178 + 이 경우 실제로 여기 광 흐름은 정보를 많이 포함 + +255 +00:17:54,179 --> 00:17:58,538 + 실제로 의해 여기 수 있도록 더 나은 지금 흥미로운 점을 끝낼 경우 + +256 +00:17:58,538 --> 00:18:01,879 + 방법은 당신 특히 여기 아키텍처의이 종류가있는 경우 + +257 +00:18:01,880 --> 00:18:05,700 + 세 가지 필터에 의해 많은 복잡한 역사는 실제로 것이라고 상상할 수 + +258 +00:18:05,700 --> 00:18:10,038 + 나는 그것이 실제로 당신이 좋겠 광학 흐름을 넣어하는 데 도움 않는 이유를 의미한다고 생각 + +259 +00:18:10,038 --> 00:18:13,158 + 중앙 및 프레임 워크에 우리가 이러한 의견 배울 것으로 기대하고 상상 + +260 +00:18:13,159 --> 00:18:16,049 + 특히 처음부터 모든 것을 그들이 뭔가를 배울 수 있어야합니다 + +261 +00:18:16,048 --> 00:18:20,599 + 즉, 광학 흐름을 계산하는 계산을 시뮬레이션하며 밝혀 + +262 +00:18:20,599 --> 00:18:24,230 + 때때로 비디오를 비교할 때 때문에 그 경우하지 않을 수 있음 + +263 +00:18:24,230 --> 00:18:29,440 + 만 병원에 네트워크 및 그것은 잘 작동 그래서 내가 생각 + +264 +00:18:29,440 --> 00:18:34,169 + 우리가 가지고 있지 않기 때문에 그 이유는 아마 실제로 데이터로 회복된다 + +265 +00:18:34,169 --> 00:18:37,900 + 충분한 데이터 우리가 당신이 실제로 아마이없는 생각 데이터의 소량 + +266 +00:18:37,900 --> 00:18:42,730 + 충분한 데이터가 실제로 기능 등 같은 아주 좋은 광학 흐름을 배울 수 + +267 +00:18:42,730 --> 00:18:45,599 + 실제로 하드에 갈 점점 왜 내 특정 대답을 것 + +268 +00:18:45,599 --> 00:18:48,819 + 너희들이에서 작업하는 경우 네트워크는 아마 대부분의 경우에서 돕는 당신의 + +269 +00:18:48,819 --> 00:18:51,839 + 내가 실제로 시도하는 것이 좋습니다 것입니다 비디오와 프로젝트는 이런 종류의 일하기 + +270 +00:18:51,839 --> 00:18:52,779 + 건축물 + +271 +00:18:52,779 --> 00:18:57,480 + 다음 광학 흐름과는 이미지의 척 당신은에 끝이 올 수 + +272 +00:18:57,480 --> 00:19:01,808 + 즉, 상대적으로 합리적인 접근 방식처럼 좋아 보인다 그래서 지금까지 우리는 얘기했습니다 + +273 +00:19:01,808 --> 00:19:06,339 + 시간의 작은 지역 정보에 대한 권리 그래서 우리는이 작은이 + +274 +00:19:06,339 --> 00:19:07,398 + 조각 + +275 +00:19:07,398 --> 00:19:10,069 + 블랙 0.5 초 적 좋을한다 활용하려 + +276 +00:19:10,069 --> 00:19:13,739 + 실제로 많은이 동영상이 경우 분류하지만 무슨 일이 + +277 +00:19:13,739 --> 00:19:14,489 + 더 길게 + +278 +00:19:14,489 --> 00:19:19,700 + 당신이 모델 같은 종속의 시간적 종류 그래서 그건뿐만 아니라 그 + +279 +00:19:19,700 --> 00:19:22,319 + 지역 운동은 중요하지만 실제로 어떤 이벤트가 걸쳐있다 + +280 +00:19:22,319 --> 00:19:25,548 + 비디오 네트워크와 실제로의 시간 규모에서 훨씬 더 큰 것을 + +281 +00:19:25,548 --> 00:19:29,618 + 문제 때문에 이벤트 이후에 발생하는 이벤트는 하나 몇 가지 클래스의 매우 나타낼 수 있습니다 + +282 +00:19:29,618 --> 00:19:33,999 + 당신이 실제로 그 모델이 그렇게 일하는 것이하려는 종류의은 + +283 +00:19:33,999 --> 00:19:39,659 + 실제로 당신은 얼마나 알고에 당신이 노력에 대해 생각하는 것이 접근 + +284 +00:19:39,659 --> 00:19:42,659 + 당신은 훨씬 더 긴 기간 이벤트 이러한 종류의 모델을 실제로 될까요 + +285 +00:19:44,618 --> 00:19:54,009 + 당신이있어 위에 어떤 긴장감을 가지고 같은 확인하므로주의 모델은 아마도 그래서 당신은 할 수있다 + +286 +00:19:54,009 --> 00:19:56,729 + 이 전체 비디오를 분류하려고하는 것은 어쩌면 통해 긴장을 갖고 싶어요 + +287 +00:19:56,729 --> 00:19:58,129 + 비디오의 다른 부분 + +288 +00:19:58,128 --> 00:20:12,689 + 그래 그게 내가보고 좋은 생각이 그래서 당신은 우리가 이러한 다중 스케일을 가지고 말을하는지이야 + +289 +00:20:12,690 --> 00:20:16,479 + 우리는 때때로 매우 낮은 상세 수준에 이미지를 처리​​하지만 어디 방법이다 + +290 +00:20:16,479 --> 00:20:20,298 + 우리는 이미지의 크기를 조정하고 아마 프레임으로 글로벌 수준에이를 처리 + +291 +00:20:20,298 --> 00:20:23,710 + 우리는 실제로 비디오의 속도를 내가 생각하지 않는에 코멘트를 넣어 원하는 수 있습니다 + +292 +00:20:23,710 --> 00:20:28,048 + 나는 그래서 네 생각은 매우 흔한 일이지만 상원 의원 재치있는 아이디어 + +293 +00:20:28,048 --> 00:20:33,618 + 문제는 대략 것을 기본적으로이 정도가 아마 열 번 너무 짧은 그것입니다 + +294 +00:20:33,618 --> 00:20:37,019 + 그래서 우리의 초를 소비하지 않는 방법을 우리가 아키텍처를 어떻게해야합니까 + +295 +00:20:37,019 --> 00:20:40,179 + 기능 훨씬 더 긴 시간 규모 및 예측 + +296 +00:20:42,150 --> 00:20:48,300 + 예 여기에 하나의 아이디어는 우리는이 동영상을 가지고 있으며 우리는 다른 클래스가 그 + +297 +00:20:48,299 --> 00:20:50,599 + 시간에 모든 단일 시점에서 예측하기 좋아하지만 우리는 것을 원하는 것 + +298 +00:20:50,599 --> 00:20:54,849 + 예측 함수가 될 조금까지 숨 막혀 15초뿐만 아니라 실제로하기 + +299 +00:20:54,849 --> 00:20:59,149 + 당신이 실제로 사용으로 분별있는 생각 때문에 훨씬 더 긴 시간 비용 + +300 +00:20:59,150 --> 00:21:01,769 + 기록 작업에서 어딘가 현재 때문에 건축에있는 동안 + +301 +00:21:01,769 --> 00:21:04,990 + 네트워크는 당신이 모든 것을 통해 무한 상황과 주체를 가질 수 있도록 + +302 +00:21:04,990 --> 00:21:08,579 + 당신이 돌아갈 특히 최대 그때까지 당신을하기 전에 그 일이있다 + +303 +00:21:08,579 --> 00:21:12,119 + 이미 2011 년을 보여주는 한이 논문 그것은 그들이이 밝혀 + +304 +00:21:12,119 --> 00:21:16,289 + 전체 섹션 뺨이 걸릴 그들은 실제로 분석 팀이 곳 + +305 +00:21:16,289 --> 00:21:21,109 + 내가 그렇게 방법이야이 NLST라는 차원을 사용하여 2011에서 들여다가 있음을 정확히 수행 + +306 +00:21:21,109 --> 00:21:25,899 + 그들은 2011 년에 호출 그래서이 논문은 기본적으로 모두가 전에 + +307 +00:21:25,900 --> 00:21:29,920 + 3 차원 침착하고 대부분의 모델 글로벌 모션 모델 작은 지역 운동 + +308 +00:21:29,920 --> 00:21:34,860 + 엘라 자세 등으로 이들은 전체 연결 층 때문에 플레이에 스탬프를 넣어 + +309 +00:21:34,859 --> 00:21:37,849 + 그들은 다음이 재발와 완전히 연결 층을 함께 중독 + +310 +00:21:37,849 --> 00:21:40,939 + 당신은 모든 단일 프레임 클래스를 예측 할 때 당신은 무한 컨텍스트가 + +311 +00:21:40,940 --> 00:21:45,930 + 나는 꽤 시대를 앞서 생각하는이 논문이며, 그것은 기본적으로 모든 권한을 가지고 + +312 +00:21:45,930 --> 00:21:49,900 + 이 단지 65 시간에 설정되어 제외하고 나는 사람들이 더 많은 인기를 생각하지 않은 확실하지 않다 + +313 +00:21:49,900 --> 00:21:54,680 + 기본적으로이 이들 모두를 인식하는 방법 앞서 시간 종이입니다입니다 + +314 +00:21:54,680 --> 00:21:59,380 + 국가 대표팀 땀 나는 심지어 그 이후 그들에 대해 알고있다 전에 + +315 +00:21:59,380 --> 00:22:02,990 + 몇 가지 최근 %는 실제로 가지에서 매우 유사한 접근 방식을 + +316 +00:22:02,990 --> 00:22:07,190 + 제프 도나휴 2015은 모든 버클리에서 여기에 아이디어는 가지고있다 + +317 +00:22:07,190 --> 00:22:08,610 + 비디오 다시에 좋아 + +318 +00:22:08,609 --> 00:22:11,819 + 매 프레임을 분류하지만 그들은 보면 이러한 의견이 + +319 +00:22:11,819 --> 00:22:14,809 + 각각의 프레임은하지만 그들은 또한 앨리스는 해당 문자열 팀 한이 + +320 +00:22:14,809 --> 00:22:19,389 + 함께 일시적으로 나는이 구글이다 종이에서도 비슷한 생각 생각에서 + +321 +00:22:19,390 --> 00:22:24,160 + 그래서 여기에 아이디어는 광학 흐름을 가지고 이미지를 처리​​하는 것입니다 + +322 +00:22:24,160 --> 00:22:28,930 + 복잡하고 다시 당신은 시간이 지남에 그렇게 다시 병합 애널리스트 오전이 + +323 +00:22:28,930 --> 00:22:34,680 + 로컬 및 글로벌이이 조합은 그래서 지금까지 우리는 어떤 종류의 검토 한 + +324 +00:22:34,680 --> 00:22:37,789 + 당신의 분류를 달성 두 아키텍처 패턴이 + +325 +00:22:37,789 --> 00:22:43,170 + 실제로 계정 중요한 정보 모델링 운동에 소요되는 + +326 +00:22:43,170 --> 00:22:47,289 + 예를 들어 짐승 항목은 사용 광학 플로우를 요구 이상의 전역 움직임을 볼 수 있습니다 + +327 +00:22:47,289 --> 00:22:51,059 + 여기서 우리는 화학 함께 시퀀스 아침 시간 단계 또는 융합이 + +328 +00:22:51,059 --> 00:22:54,418 + 두 사람은 지금 실제로 나는이 있다는 점을 확인하는 등의 + +329 +00:22:54,419 --> 00:22:59,879 + 내가 최근 논문에서 본 다른 청소기 아주 좋은 흥미로운 아이디어와 + +330 +00:22:59,878 --> 00:23:03,689 + 그때는 훨씬 더 좋아하고 그래서 여기에 기본적으로의 바위 그림 무엇 + +331 +00:23:03,690 --> 00:23:08,330 + 지금 우리가 일부 비디오를 가지고 같은 것들을 우리는 차원이 말을 그 온 보일 + +332 +00:23:08,329 --> 00:23:13,038 + 그 사용 광학 플로우는 차원 열 또는 둘 모두를 사용하여 주문할 수 있습니다 + +333 +00:23:13,038 --> 00:23:17,898 + 프레임의 트렁크는 데이터를 크랭크 후 불행하게도 꼭대기에 자리 잡고있다 한 + +334 +00:23:17,898 --> 00:23:20,979 + 또는 장기 모델링을하고 그 그 때문에 종류의 같은 + +335 +00:23:20,980 --> 00:23:24,950 + 이 약의 종류 아주 좋은되지는 불안하다 그이 자신의 아들 + +336 +00:23:24,950 --> 00:23:29,499 + 이러한 구성 요소에 대한 추악한 비대칭이 당사자에게 3 차원 내부의 신경 세포가하는 + +337 +00:23:29,499 --> 00:23:33,079 + 당신은 비디오의 몇 가지 작은 지방 덩어리의 일부입니다 그 와서 + +338 +00:23:33,079 --> 00:23:35,849 + 맨 이러한 신경 세포가 그 비디오의 모든 우리의 기능 + +339 +00:23:35,849 --> 00:23:40,808 + 올 모든 일의 함수 자신의 기록 단위 때문에 + +340 +00:23:40,808 --> 00:23:45,288 + 그 전에 그래서 그것은 불안 비대칭 또는 뭔가처럼 같은 종류의 + +341 +00:23:45,288 --> 00:23:48,720 + 그래서 몇 주 전에에서 매우 영리한 어떤 생각을 가지고 종이가있다 + +342 +00:23:48,720 --> 00:23:54,249 + 모든 것이 아주 좋은 곳이 훨씬 더 좋은 균일 한 라이프 스타일입니다 + +343 +00:23:54,249 --> 00:23:58,118 + 어떻게 우리가 할 수 있었던 사람이 생각할 수있는 경우 마진과 간단하고 그래서 난 몰라 + +344 +00:23:58,118 --> 00:24:06,819 + 하지만 우리는 모든 것을 훨씬 더 청소기를 만들기 위해 할 수 있습니다 내가 할 수 없었던 나는 때문에 + +345 +00:24:06,819 --> 00:24:09,019 + 이 아이디어를 제공하지만 난 그것을 읽고 무엇 멋진라고 생각하지 않습니다 + +346 +00:24:09,019 --> 00:24:22,399 + 주석이 실제로 어떤 것을 확실하지 않은 이미지 처리를 시작하기 전에 + +347 +00:24:22,398 --> 00:24:25,288 + 당신이 찢어진 것 참조 산산이 광 정보 및 의견을 줄 것이다 + +348 +00:24:25,288 --> 00:24:30,169 + 어떻게 든 당신이 확실히의 함수이다 신경을 것 위에 + +349 +00:24:30,169 --> 00:24:34,090 + 그것은하지만 모든 미국 팀이이 경우에 일을해야 될지 분명하지 않다 + +350 +00:24:34,089 --> 00:24:37,388 + 아마에서 처리 너무 낮은 수준의 픽셀을 흐리게 될 가능성이 + +351 +00:24:37,388 --> 00:24:51,678 + 그 시점은 다음 작품을 참을 같은 미디어를 많이있다 + +352 +00:24:51,679 --> 00:24:56,389 + 이 문제는 모든 비트를 찾고 있음을 다르게 시간적 해상도 + +353 +00:24:56,388 --> 00:25:04,038 + 모든 모든 여행 친구처럼 보이는 내가 그래서 당신의 말을 또 다른 시간 + +354 +00:25:04,038 --> 00:25:07,009 + 나는 당신이이 걸릴 경우 다른 사람이 지적한 것과 유사한 생각 아이디어 + +355 +00:25:07,009 --> 00:25:10,179 + 비디오 당신은 때 비디오를 빠르게 해당 동영상에 여러 저울에서 작동 + +356 +00:25:10,179 --> 00:25:14,778 + 당신은 비디오를 느리게 그리고 당신은 그 앞줄에있어 온 3D했습니다 + +357 +00:25:14,778 --> 00:25:23,989 + 그것은 현명한 생각이 같은 속도 또는 뭔가처럼 배경을 수행 할 수 있습니다 + +358 +00:25:23,989 --> 00:25:26,669 + 일을보기 위하여 흥미에 내가 그는 생각에 빼기 만 보면 + +359 +00:25:26,669 --> 00:25:30,639 + 내가 생각하는 합리적인 생각은 종류의 엔드 - 투 - 엔드를 갖는이 아이디어에 반하는 + +360 +00:25:30,638 --> 00:25:33,868 + 당신은 당신이 생각하는이 명시 적 계산과 같이 소개하고 있기 때문에 학습 + +361 +00:25:33,868 --> 00:25:37,759 + 그가 가지고로서 유용 + +362 +00:25:42,288 --> 00:25:48,658 + 3 차원 사이에 공유가 나오고 그들이 그 재미의 내가 아니에요 + +363 +00:25:48,659 --> 00:25:52,139 + 아르 논 때문에 확실히 백퍼센트는 상태 벡터와 행렬을 잤다된다 + +364 +00:25:52,138 --> 00:25:55,678 + 곱셈과 사물처럼하지만 진정 플레이어에서 우리는 공간을 싫어했다 + +365 +00:25:55,679 --> 00:26:05,369 + 구조 나 공유가 작동하는 방법을 실제로 모르겠지만 그래 좋아하므로 + +366 +00:26:05,368 --> 00:26:11,319 + 아이디어는 우리가 우리가있어 지금 없애하는거야 보게 될 것이다 + +367 +00:26:11,319 --> 00:26:14,408 + 기본적으로이에 걸릴 것 우리는 모든 단일 신경 세포를 만들거야 + +368 +00:26:14,409 --> 00:26:17,379 + 그 모든 같은 작은 재발 성 신경 네트워크로 나온다 + +369 +00:26:17,378 --> 00:26:21,648 + 하나의 신경 세포가 확인하는 방식 때문에이 작동합니다 진정에 재발된다 + +370 +00:26:21,648 --> 00:26:27,178 + 그리고 나는 그것이 아름다운 생각하지만, 자신의 사진이 그렇게 추한의 종류의 종류 + +371 +00:26:27,179 --> 00:26:29,730 + 많은이 말도 안돼 위해 이렇게 나를 약간이 설명하려고하자 + +372 +00:26:29,730 --> 00:26:36,278 + 우리가 대신 무엇을 할 거 야 다른 방법은 우리가 어딘가에 발신자를 가지고있다 + +373 +00:26:36,278 --> 00:26:40,278 + 신경 네트워크가 수술 이전에 침착 아래에서 입력을 받아 또는 + +374 +00:26:40,278 --> 00:26:43,398 + 우리는이를 통해 경쟁을하고있는 일이의 출력을 계산하기 + +375 +00:26:43,398 --> 00:26:47,528 + 여기에 아이디어는 우리가 매일 조금 오는 만들려고하고있다 우측 있도록 층 + +376 +00:26:47,528 --> 00:26:53,058 + 나중에 때문에 재발 플레이어의 종류 우리가 할 길을 우리가 그대로입니다 + +377 +00:26:53,058 --> 00:26:57,528 + 에 대한 우리는 우리 아래에서 입력을 받아 우리는 그 위에 오는 않지만 우리는 또한 우리를 취할 + +378 +00:26:57,528 --> 00:27:00,778 + 대신 이전 시간으로부터 이전 출력 + +379 +00:27:00,778 --> 00:27:05,638 + 그 외에도 이전 시간 단계에서이 발신자 그래서 거기 플레이어 + +380 +00:27:05,638 --> 00:27:09,408 + 이 때 물건과 우리가 모두이 이상 대회를 수행하는 것이 현재의 입력 + +381 +00:27:09,409 --> 00:27:13,830 + 하나 하나, 그리고, 우리는 종류의 우리가 우리가있을 때 호출하지 않습니다 알고있다 + +382 +00:27:13,829 --> 00:27:19,490 + 이전 복장에서 현재 입력하고 정품 인증에서 이러한 활성화 및 + +383 +00:27:19,490 --> 00:27:24,649 + 우리는 그들을 추가하거나 우리가 병합 같은 그 일처럼 재발을 수행하는 것이 같은 + +384 +00:27:24,648 --> 00:27:28,719 + 그 두 생산의 최대이며, 그래서 우리는 현재의 입력의 기능이야 + +385 +00:27:28,720 --> 00:27:34,730 + 뿐만 아니라 이전 활성화의 기능은 너무 감각을 만드는 경우 + +386 +00:27:34,730 --> 00:27:37,200 + 그것은이 두 차원을 사용하여 사실이었다 즉 대해 아주 좋다 + +387 +00:27:37,200 --> 00:27:41,149 + 여기에 대회 이들 모두는 어디 때문에 더 차원 수는 없다 + +388 +00:27:41,148 --> 00:27:44,678 + 이전 야그의 리암의 깊이 권한에 의해 높이로 폭은 매우 함께 + +389 +00:27:44,679 --> 00:27:49,309 + 이전 계층의 깊이와 우리는 이전 시간에서 높은 깊이있는 + +390 +00:27:49,308 --> 00:27:52,408 + 이들 중 일부는 두 가지 차원 대회하지만 우리는 종류와 끝까지 + +391 +00:27:52,409 --> 00:27:57,710 + 재발 여기에 프로세스 등 하나의 방법처럼 재발과이를 볼 수 있습니다 + +392 +00:27:57,710 --> 00:28:00,659 + 우리가 바라 보았다 신경망은이 재발 위치를 가지고있다 + +393 +00:28:00,659 --> 00:28:03,980 + 당신은 상태에서 경쟁하기 위해 노력하고 있으며 이전 상태의 함수이다 + +394 +00:28:03,980 --> 00:28:07,878 + 현재 공격은 그래서 우리는 실제로 여러 가지 방법으로 보았다 + +395 +00:28:07,878 --> 00:28:14,058 + 연구 개의 포 엘 존중가 그래서 그 재발 또는 GRU GRU까지 배선 + +396 +00:28:14,058 --> 00:28:17,950 + LSD의 간단한 버전입니다 당신이 기억하지만 경우는 거의 항상 비슷한 있습니다 + +397 +00:28:17,950 --> 00:28:21,548 + 분석 팀에 성능이 약간 다른 업데이트 수식에 대한 GRU 그래서 + +398 +00:28:21,548 --> 00:28:24,499 + 실제로이 논문은에 무엇을 그 재발을 수행하고 참조 + +399 +00:28:24,499 --> 00:28:27,950 + 이 오스트리아의 간단한 버전이기 때문에 기본적으로 그들은 GRU을 그 + +400 +00:28:27,950 --> 00:28:31,899 + 단지뿐만 아니라 대신 모든 단일 매트릭스 작동하는 것은 일종의처럼 곱 + +401 +00:28:31,898 --> 00:28:36,758 + 진정으로 대체 당신은 당신이 상상할 수있는 수 있다면 그 모든 단일 행렬 + +402 +00:28:36,759 --> 00:28:41,819 + 여기에 곱하면 바로 전화가 그래서 우리는 우리의 입력을 통해 발전 할 수지고가 + +403 +00:28:41,819 --> 00:28:45,798 + 큰 출력을 포함하고는 이전의와 아래, 그리고, 우리는 결합 + +404 +00:28:45,798 --> 00:28:50,329 + 다만 미 GRU의 재발과 그들이 실제로 우리의 활성화를 가져올 수 및 + +405 +00:28:50,329 --> 00:28:57,158 + 이 같은 모습과 지금은 그냥 보이는 전에 그래서 우리는이 없습니다 + +406 +00:28:57,159 --> 00:29:01,179 + 일부 지역의 인터넷과 범위의 일부는 우리가 그냥이이 유한 한 우리의 + +407 +00:29:01,179 --> 00:29:05,679 + 소득은 모든 단일 층 전에하지만 컴퓨팅하지만 반환되는 경우 있음 + +408 +00:29:05,679 --> 00:29:06,410 + 또한 재미 + +409 +00:29:06,410 --> 00:29:11,610 + 이전 노력과 모두의 함수로 그에 따라서이 링크 + +410 +00:29:11,609 --> 00:29:14,990 + 균일 한의 매우 친절 그리고 좀 유전자처럼 그냥 233을 너무 많이 불리는 그 + +411 +00:29:14,990 --> 00:29:19,799 + 멕시코에서 인도 재발하고는 어쩌면 그건 내 간단한 그냥 대답의의 + +412 +00:29:19,799 --> 00:29:27,579 + 일이 이렇게 누군가 당신은 공간 시간 상용 네트워크를 사용하고 싶습니다 그래서 만약 + +413 +00:29:27,579 --> 00:29:30,819 + 당신의 프로젝트와 매우 흥분 때문에 동영상에 제일 먼저에 + +414 +00:29:30,819 --> 00:29:34,359 + 중지하면됩니다 그리고 당신은 당신이 정말로 필요 여부에 대해 생각해야 + +415 +00:29:34,359 --> 00:29:37,740 + 프로세스 운동 또는 전역 움직임이나 감정이 정말 중요합니다 당신의 + +416 +00:29:37,740 --> 00:29:41,839 + 분류 작업 당신이 정말로 운동이 그 다음 생각에 중요하다고 생각하는 경우 + +417 +00:29:41,839 --> 00:29:44,829 + 로컬 움직임이 그가 중요하다 모델링 할 필요가 있는지 여부에 대한 + +418 +00:29:44,829 --> 00:29:46,929 + 모든 전역 움직임을 위해 매우 중요하다 + +419 +00:29:46,930 --> 00:29:50,370 + 당신은 항상에이에 대해 당신이 시도해야 당신이의 힌트를 얻을에 기반 + +420 +00:29:50,369 --> 00:29:54,069 + 내가 말을 기준으로 한 해당 비교 한 다음 사용하여 시도해야 + +421 +00:29:54,069 --> 00:29:57,539 + 광학 플로우는 것 때문에 그 경우 데이터의 당신이 특히 적은 양의 그것 + +422 +00:29:57,539 --> 00:30:02,039 + 실제로는 아주 좋은 신호 세금 선취 특권 코드처럼 매우 중요하다 및 + +423 +00:30:02,039 --> 00:30:06,099 + 명시 적으로 광 흐름이 나와 보는 유용한 기능이라고 지정 + +424 +00:30:06,099 --> 00:30:09,609 + 당신이 지금 막 오후 일을보고있는이 박사를 시도하지만이를 생각할 수 + +425 +00:30:09,609 --> 00:30:12,599 + 실험도 최근 그래서 나는 실제로 내가 충분히 할 수있는 경우에 확실하지 않다 + +426 +00:30:12,599 --> 00:30:16,589 + 보증하거나 작동하는 경우가 아주 좋은 아이디어처럼 보인다하지만되지 않았습니다 + +427 +00:30:16,589 --> 00:30:21,849 + 아직 검증 그래서 그 행복 프로세스의 바위 레이아웃과 같은 종류의의의 + +428 +00:30:21,849 --> 00:30:25,339 + 현장에서 동영상 그래서 나는 저스틴 가고 있기 때문에 질문이 있는지 알고 + +429 +00:30:25,339 --> 00:30:28,339 + 다음에 올 + +430 +00:30:33,980 --> 00:30:43,289 + 이 일이 사용하지 않은보고있는 모든 P는 내가 안 좋은 질문 이잖아 + +431 +00:30:43,289 --> 00:30:46,879 + 내가 LLP 슈퍼 괜찮아요 전문가가 아니에요하지만이 생각하기 전에 보지 못했지만 그렇게 생각 + +432 +00:30:46,880 --> 00:30:52,980 + 그래서 나는 내가 너무 좋아 생각하지 않는다 그녀를 보지 못했다 추측 할 것 + +433 +00:31:18,880 --> 00:31:26,660 + 만 가진 측에 나는 확실히 뭔가 사람들이 할 말을 + +434 +00:31:26,660 --> 00:31:31,810 + 당신은 단지 사람들 때문에 둘 다 할 너무 많은 논문을 볼 수 없습니다 싶은 + +435 +00:31:31,809 --> 00:31:35,639 + 그리고 사람의 수면 문제의 종류와 같은 것은 어쩌면 그들을 해결되지 공동으로하지만, + +436 +00:31:35,640 --> 00:31:38,620 + 확실히 회사는 실제 시스템에 뭔가 작업을 얻으려고 노력하는 당신 + +437 +00:31:38,619 --> 00:31:42,869 + 그런 일을 할 것입니다하지만 난 당신이 할 것입니다 거기에 아무것도 생각하지 않습니다 + +438 +00:31:42,869 --> 00:31:45,449 + 당신은 아마 당신은이 말 융합 접근 방식으로이 작업을 수행 + +439 +00:31:45,450 --> 00:31:49,039 + 오디오에서 가장 잘 작동하고 밖으로 나왔다 무엇이든 동영상에 가장 적합 + +440 +00:31:49,039 --> 00:31:55,029 + 어딘가 나중에 어떻게 든하지만 내가 할 수있는 유일한 뭔가하고와 함께 주장 + +441 +00:31:55,029 --> 00:31:57,639 + 신경망 권리 매우 간단 당신은 그냥 선수가 있기 때문에 + +442 +00:31:57,640 --> 00:32:00,410 + 어떤 점에서 둘의 출력을보고 다음은로 분류하고 + +443 +00:32:00,410 --> 00:32:09,860 + 모두의 기능은 그래서 우리는 그들을 놀라게 할거야 그리고 나는 우리가 얻어야 할 것 같아요 + +444 +00:32:09,859 --> 00:32:11,179 + 이리 + +445 +00:32:11,180 --> 00:32:14,180 + 희망 그것은 작동 + +446 +00:32:29,148 --> 00:32:34,108 + 확인 그래서 우리가 완전히 완전히 거 스위치 기어를하고있어 대한 이야기​​ 같아요 + +447 +00:32:34,108 --> 00:32:38,199 + 자율 학습은 그래서 여기에 대비 약간을하고 싶습니다 + +448 +00:32:38,200 --> 00:32:42,460 + 먼저 우리의 기본 정의 어떤 종류의에 대한 거 얘기 야 + +449 +00:32:42,460 --> 00:32:46,009 + 자율 학습은 우리는 방법에 대한 두 개의 서로 다른 종류의 이야기거야 + +450 +00:32:46,009 --> 00:32:50,858 + 그 자율 학습은 최근에 그래서 사람을 추방에 의해 공격되었다 + +451 +00:32:50,858 --> 00:32:53,408 + 특히 우리는 자동차 인코더와의이 아이디어에 대한 이야기​​를 거 + +452 +00:32:53,409 --> 00:32:58,679 + 적대적 네트워크와 내가 바로 그렇게 꽤 많이 내 리모콘이 필요 같아요 + +453 +00:32:58,679 --> 00:33:03,259 + 우리가 지금까지이 클래스에서 본 적이 모든 기본 그래서지도 학습이다 + +454 +00:33:03,259 --> 00:33:07,128 + 거의 모든지도 학습 문제 뒤에 설치는 우리가 가정이다 + +455 +00:33:07,128 --> 00:33:11,769 + 우리의 데이터 세트는 각 데이터 포인트의 종류는 두 가지 부품의 종류를 가지고있다 우리는이 + +456 +00:33:11,769 --> 00:33:15,858 + 우리의 데이터 액세스 한 다음 우리는 우리가 원하는 것을 왜 어떤 라벨 또는 출력이 + +457 +00:33:15,858 --> 00:33:20,028 + 해당 입력에서 해당로부터 생산 및 감독 학습에서 우리의 전체 목표는 + +458 +00:33:20,028 --> 00:33:24,888 + 우리의 매입 세액에 걸리는 일부 기능을 학습하고이 출력을 생성합니다 + +459 +00:33:24,888 --> 00:33:29,538 + 또는 당신이 정말로 그것을 거의 거의 모든 것에 대해 생각하는 이유와 경우 레이블 + +460 +00:33:29,538 --> 00:33:33,088 + 우리가이 클래스에서 보았던 것은이지도 학습의 일부 예입니다 + +461 +00:33:33,088 --> 00:33:37,358 + 다음 이미지로 이미지를 분류 행위 같은 뭔가를 설정하고 + +462 +00:33:37,358 --> 00:33:41,960 + 물체 검출과 같은의 라벨은 왜 이미지 및 액세스 이유 + +463 +00:33:41,960 --> 00:33:46,119 + 가 될 수 이유를 찾을 수 없습니다 이미지에서 개체의 집합 어쩌면이다 + +464 +00:33:46,118 --> 00:33:50,238 + 이 될 수 왜 우리가 캡처 이름을보고 캡션 후 이제 비디오하고 수 + +465 +00:33:50,239 --> 00:33:55,838 + 레이블 또는 캡션 또는 거의 아무것도 아무것도 중 하나는 그래서 난 그냥 원하는 + +466 +00:33:55,838 --> 00:33:59,450 + 학습 감독 점이 강력한이 매우 매우 매우 강력하게 + +467 +00:33:59,450 --> 00:34:03,819 + 그리고 포함 일반적인 프레임 워크는 우리가에서 수행 한 모든 것을 포함 + +468 +00:34:03,819 --> 00:34:08,960 + 지금까지 클래스와 다른 점은지도 학습은 실제로 시스템을 만드는 것입니다 + +469 +00:34:08,960 --> 00:34:12,639 + 즉, 실제로는 정말 잘 작동 시스템을 작동하고 매우 유용합니다 + +470 +00:34:12,639 --> 00:34:14,628 + 실제 응용 + +471 +00:34:14,628 --> 00:34:17,898 + 내가 생각 자율 학습은 개방 연구의 조금 더 + +472 +00:34:17,898 --> 00:34:22,338 + 정말 멋진, 그래서이 시점에서 질문 나는 정말 생각 + +473 +00:34:22,338 --> 00:34:26,199 + 일반적으로 사람을 해결하기위한 중요하지만이 시점에서 그것은 아마도 약간의 + +474 +00:34:26,199 --> 00:34:30,028 + 영역의 유형에 대한 연구의 초점 이상의 비트는 또한 약간의 작은 + +475 +00:34:30,028 --> 00:34:34,568 + 우리가 일반적으로 우리 가정 자율 학습, 그래서 잘 정의 + +476 +00:34:34,568 --> 00:34:37,579 + 우리는 PACS를 그냥 데이터 우리는 어떤 이유가없는 한 + +477 +00:34:38,349 --> 00:34:44,009 + 및 자율 학습의 목표는 데이터의 역할과 일을하는 것입니다 + +478 +00:34:44,009 --> 00:34:48,199 + 우리가 정말로하려는 일이 너무 일부 그래서 문제에 따라 달라집니다 + +479 +00:34:48,199 --> 00:34:51,939 + 일반적으로 우리는 우리가에 잠재 구조의 몇 가지 유형을 발견 할 수 있기를 바랍니다 + +480 +00:34:51,940 --> 00:34:56,710 + 데이터는 명시 적으로 어떤 레이블에 대해 아무것도 모른 채 역할 + +481 +00:34:56,710 --> 00:34:59,650 + 당신이 이전의 기계 학습에서 볼 수도 고전적인 예 + +482 +00:34:59,650 --> 00:35:04,009 + 클래스는 그래서 수단과 같은 우리가 그냥있어 클러스터링 같은 것들이 될 것이다 + +483 +00:35:04,009 --> 00:35:07,728 + 점의 무리 우리는로를 구분하여 구조를 발견 + +484 +00:35:07,728 --> 00:35:13,268 + 클러스터는 자율 학습의 다른 고전적인 예는 것 + +485 +00:35:13,268 --> 00:35:18,248 + X이 시점에서 그냥 주성분 분석과 같은 + +486 +00:35:18,248 --> 00:35:22,098 + 데이터의 우리는 그 중 일부 저 차원 표현을 발견 할 + +487 +00:35:22,099 --> 00:35:27,170 + 입력 데이터 그래서 자율 학습이 정말 종류의 멋진 지역 만입니다 + +488 +00:35:27,170 --> 00:35:30,519 + 조금 더 문제가 구체적이고 약간은 덜 잘 정의 + +489 +00:35:30,518 --> 00:35:37,228 + 아키텍처로 특정되어 있으므로 두 가지를 학습 감독 + +490 +00:35:37,228 --> 00:35:42,358 + 깊은 학습 사람들은이 아이디어로 자율 학습에 대해 수행 한 + +491 +00:35:42,358 --> 00:35:46,048 + 오디오 인코더의이 아이디어는 전통적인 오스만의 종류에 대해 이야기합니다 + +492 +00:35:46,048 --> 00:35:49,318 + 또한 변분에 대해 이야기하는 아주 아주 오랜 역사를 가지고 분기 + +493 +00:35:49,318 --> 00:35:54,308 + 뉴스의이 종류이다 자동 인코더는 것입니다 그들에 아시아 트위스트를 냉각 + +494 +00:35:54,309 --> 00:35:57,729 + 실제로 일부 생식 적대적 네트워크에 대해이 정말 좋은 이야기 + +495 +00:35:57,728 --> 00:36:06,718 + 생각하지만 당신은 너무 자연스러운 이미지의 이미지와 모델 샘플을 생성 할 수 + +496 +00:36:06,719 --> 00:36:09,548 + 인 오디오 인코더와 아이디어는 매우 간단하다 + +497 +00:36:09,548 --> 00:36:14,088 + 우리는 일부 데이터이며, 우리는이 입력 거 패스 야하는 우리의 입력 자루가 + +498 +00:36:14,088 --> 00:36:19,710 + 인코딩 네트워크의 어떤 종류를 통해 데이터에서 일부 기능을 생산하는 일부 + +499 +00:36:19,710 --> 00:36:24,440 + 이 단계를 생각할 수이 있도록 잠재 기능을 사용하면 약간을 생각할 수 + +500 +00:36:24,440 --> 00:36:28,219 + 우리는 우리의 입력을거야 학습 가능 주요 구성 요소 분석과 같은 비트 + +501 +00:36:28,219 --> 00:36:33,298 + 다음 데이터 그래서 그 많은 다른 기능 표현으로 변환 + +502 +00:36:33,298 --> 00:36:38,940 + 이 10 이미지 때문에이 여기에 표시됩니다 같은 시간은 이러한 액세스는 이미지가 될 것입니다 + +503 +00:36:38,940 --> 00:36:42,989 + 이 인코더 네트워크는 같은 뭔가를 이렇게 아주 복잡한 일을 할 수 + +504 +00:36:42,989 --> 00:36:47,228 + PCA는 그냥 간단한 선형 변환이야 그러나 일반적으로이 완벽하게 될 수 있습니다 + +505 +00:36:47,228 --> 00:36:51,799 + 연결된 네트워크 원래 종류의 아마 다섯 10 년 전 + +506 +00:36:51,800 --> 00:36:56,130 + 종종 하나의 그들은 그것의 현재 시그 모이 단위로 네트워크에 완벽하게 연결되어 + +507 +00:36:56,130 --> 00:37:00,410 + 트레일러 단위 종종 깊은 깊은 네트워크와이 또한 뭔가 될 수 있습니다 + +508 +00:37:00,409 --> 00:37:09,230 + 길쌈 바로 그렇게 작동하지처럼 우리는이 생각이있는 Z + +509 +00:37:09,230 --> 00:37:13,820 + 그래서 역할을보다 우리가 배울 수있는 기능의 크기는 일반적으로 작은 + +510 +00:37:13,820 --> 00:37:18,789 + 데이터 그래서 우리는 우리의 역할에 대해 우리는 유용한 기능의 일종 할 필요가 없습니다 + +511 +00:37:18,789 --> 00:37:22,610 + 그냥 몇 가지로 인터넷 전송에게 데이터를 변환하기 위해 네트워크를 원하지 않는다 + +512 +00:37:22,610 --> 00:37:26,370 + 쓸모없는 표현은 우리가 실제로 데이터를 분쇄 강제로 원하는 + +513 +00:37:26,369 --> 00:37:29,900 + 통계 및 희망 도움이 될 수있는 몇 가지 유용한 방법을 요약 + +514 +00:37:29,900 --> 00:37:34,720 + 사람 다운 스트림 처리하지만 문제는 우리가 정말 어떤을하지 않아도됩니다 + +515 +00:37:34,719 --> 00:37:39,219 + 명시적인 레이블 그래서 대신에 우리가 필요로하는이 다운 스트림 처리를 위해 사용하는 + +516 +00:37:39,219 --> 00:37:43,159 + 대리의 어떤 종류를 발명 우리가 단지 데이터를 사용하여 사용할 수있는 요청 + +517 +00:37:43,159 --> 00:37:50,159 + 자체 회로는 우리가 자주 자동 인코더에 사용하는 요구 있도록이 좋습니다 + +518 +00:37:50,159 --> 00:37:55,719 + 재건의 우리는 매핑을 대신 배울 수있는 지혜가없는 사람, 그래서 + +519 +00:37:55,719 --> 00:38:00,119 + 우리는 이러한 기능의 Z에서 데이터의 행위를 재현 단지 거 시도하고 있고 + +520 +00:38:00,119 --> 00:38:05,119 + 이러한 기능은보다 크기가 작은 특히 희망 그것은 강제합니다 + +521 +00:38:05,119 --> 00:38:07,139 + 네트워크 요약하는 역할을합니다 + +522 +00:38:07,139 --> 00:38:11,420 + 입력 데이터의 유용한 통계 요약 희망 발견 할 + +523 +00:38:11,420 --> 00:38:16,289 + 재건하지만 더 유용 하나가 될 수있는 몇 가지 유용한 기능 + +524 +00:38:16,289 --> 00:38:19,920 + 일반적으로 이러한 기능은 다른 작업에 유용 할 수 있습니다 수 있습니다 우리의 경우 + +525 +00:38:19,920 --> 00:38:26,340 + 나중에 어떤 감독 데이터를 얻을 그래서 다시이 디코더 네트워크는 꽤 될 수있다 + +526 +00:38:26,340 --> 00:38:30,050 + 숙소에서 자동 그래서 처음에 대한 왔을 때 복잡 + +527 +00:38:30,050 --> 00:38:33,720 + 종종 이들은 단지 간단한 선형 네트워크 또는 작은 하나 있었다 + +528 +00:38:33,719 --> 00:38:37,459 + 네트워크 신호하지만 지금은 깊이 네트워크와 종종이 될 수 있습니다 + +529 +00:38:37,460 --> 00:38:43,220 + 길쌈까지 될 것입니다 것은 너무 메이슨 작은 풍선 슬라이드, 그래서 좋은 시간입니다 + +530 +00:38:43,219 --> 00:38:46,869 + 자주이 디코더는 현재이 최대 길쌈 네트워크 중 하나가 될 것입니다 + +531 +00:38:46,869 --> 00:38:50,529 + 즉, 다시 수 있습니다 귀하의 입력 데이터보다 크기가 작은 당신의 기능을한다 + +532 +00:38:50,530 --> 00:38:56,880 + 및 종류의 원본 데이터를 재생 내가 좋겠하는 크기까지 다시 불면 + +533 +00:38:56,880 --> 00:39:00,579 + 이러한 일들이 실제로 그렇게 훈련을 아주 쉽게 있다는 점을 확인하려면 + +534 +00:39:00,579 --> 00:39:04,610 + 바로 여기가 그래서 난 그냥 토치에서 요리하는 간단한 예제입니다 + +535 +00:39:04,610 --> 00:39:05,050 + 래리 + +536 +00:39:05,050 --> 00:39:09,210 + 최대 그들의 디코더에 대한 모든 작업을 수행되는 코드 + +537 +00:39:09,210 --> 00:39:12,420 + 컨볼 루션 네트워크 당신은 실제로 재구성 배운다있어 것을 알 수 있습니다 + +538 +00:39:12,420 --> 00:39:19,159 + 가끔 볼 꽤 잘 다른 것은 데이터가 이러한 인코더이다 + +539 +00:39:19,159 --> 00:39:23,799 + 및 디코더 네트워크는 때로는 종류의 같은과 가중치를 공유합니다 + +540 +00:39:23,800 --> 00:39:27,740 + 정규화 전략과 이러한 반대 있음이 직감으로 + +541 +00:39:27,739 --> 00:39:32,329 + 작업은 그래서 어쩌면 난 둘 정도 같은 대기를 사용하려고하는 것은 의미가 있습니다 + +542 +00:39:32,329 --> 00:39:36,659 + 당신이 완전히 연결에 대해 생각하면 그냥 구체적인 예를 들어 당신은에 있다면 + +543 +00:39:36,659 --> 00:39:39,980 + 네트워크는 아마도 사용자의 입력 데이터의 일부 치수 D를 갖는 + +544 +00:39:39,980 --> 00:39:44,070 + 그리고 당신은 늦게와 데이터는 약간 작은 치수 H를 가지고있는 경우 것 + +545 +00:39:44,070 --> 00:39:47,769 + 이 인코더는 가중치 그냥 될 것 그냥 완전히 연결 네트워크했다 + +546 +00:39:47,769 --> 00:39:51,630 + 이 두바이 시대의 매트릭스와 지금 우리가 디코딩을하고하려고 할 때 + +547 +00:39:51,630 --> 00:39:54,470 + 보다 원래의 데이터를 재구성 + +548 +00:39:54,469 --> 00:39:59,129 + 다시 D에 각각의 뒷면에서 매핑 그래서 우리는 단지이 동일한 가중치를 재사용 할 수 있습니다 + +549 +00:39:59,130 --> 00:40:06,420 + 우리가이 일을 훈련 할 때 두 가지 영역 우리는 너무 행렬의 전치을 + +550 +00:40:06,420 --> 00:40:10,300 + 우리는 비교하는 데 사용할 수있는 손실 함수의 어떤 필요 + +551 +00:40:10,300 --> 00:40:15,400 + 재구성 된 우리의 원래 데이터와 데이터를 다음 번 자주 것 다 + +552 +00:40:15,400 --> 00:40:20,220 + 우리가하면, 그래서 유클리드 손실 지옥 같은 간단한에 L이 일을 훈련합니다 + +553 +00:40:20,219 --> 00:40:24,659 + 우리의 인터넷 작업을 선택하고 우리는 번째 분기 네트워크와 기능을 선택한 후 + +554 +00:40:24,659 --> 00:40:28,329 + 우리는 다른 보통의 신경망처럼이 일을 훈련 할 수있는 우리 + +555 +00:40:28,329 --> 00:40:32,420 + 디코딩을 통해 우리를 통해 전달할 일부 데이터를 인코딩에 도착 우리는 통과 + +556 +00:40:32,420 --> 00:40:37,900 + 컴퓨터 법률 sweetback의 전파 모든 것이 우리가이 훈련 그래서 일단 좋은 + +557 +00:40:37,900 --> 00:40:41,880 + 물건은 자주 우리가 너무 많은 지출이 디코더 네트워크를 취할 것 + +558 +00:40:41,880 --> 00:40:46,700 + 시간 학습과 난 그냥 좀 이상한 보인다 그것을 멀리 던질거야하지만, + +559 +00:40:46,699 --> 00:40:52,129 + 이유는 자체적으로 재구성하므로 대신 같은 유용한 작업 없다는 것이다 우리 + +560 +00:40:52,130 --> 00:40:56,349 + 입니다 실제로 유용한 작업의 일종으로 이러한 네트워크를 적용 할 + +561 +00:40:56,349 --> 00:41:01,099 + 아마 설정하기 때문에 여기에 감독 학습 과제는 우리가 배운 것입니다 + +562 +00:41:01,099 --> 00:41:05,179 + 희망이 모든 자율 데이터로부터이이 엔코더 네트워크 + +563 +00:41:05,179 --> 00:41:08,799 + 데이터를 압축하고 몇 가지 유용한 기능을 추출하기 위해 배운 등장 + +564 +00:41:08,800 --> 00:41:13,190 + 그리고, 우리는 더 큰 부분을 초기화하기 위해 인코더 네트워크를 사용하는거야 + +565 +00:41:13,190 --> 00:41:17,650 + 우리가 실제로 어쩌면 몇 가지 작은에 액세스 할 경우 지금 감독 업무와 + +566 +00:41:17,650 --> 00:41:18,280 + 데이터 세트 + +567 +00:41:18,280 --> 00:41:22,590 + 다음 희망이 작업 대부분이 여기에있는 수있는 몇 가지 레이블이 + +568 +00:41:22,590 --> 00:41:26,309 + 처음에이 자율 훈련을 수행 한 후 우리는 할 수있는 한 + +569 +00:41:26,309 --> 00:41:29,699 + 전체이 더 큰 네트워크를 한 후 미세 조정이를 초기화하는 것을 사용 + +570 +00:41:29,699 --> 00:41:35,509 + 관리 대상 데이터의 희망을 아주 소량 것은 그래서 이것은의 종류 + +571 +00:41:35,510 --> 00:41:39,380 + 자율 기능 학습의 꿈 중 하나의 꿈 당신을 + +572 +00:41:39,380 --> 00:41:43,410 + 그냥 구글에 갈 수없는 레이블이 정말 큰 데이터 세트를 + +573 +00:41:43,409 --> 00:41:46,409 + 영원히 이미지를 다운로드하고 이미지를 많이 얻을 정말 쉽습니다 + +574 +00:41:46,969 --> 00:41:51,399 + 문제는 레이블 그래서 당신은 어떤 시스템을 싶어 수집하는 비용이있다 + +575 +00:41:51,400 --> 00:41:54,960 + 즉 자율 많은 데이터 엄청난 양 모두를 이용할 수도 + +576 +00:41:54,960 --> 00:41:59,570 + 또한 자동차 제조에서 감독 데이터의 단지 작은 양 그래서 + +577 +00:41:59,570 --> 00:42:03,940 + 이 밤 속성 만에이 제안되고있다 적어도 한 가지 + +578 +00:42:03,940 --> 00:42:07,670 + 내가 조금 인 너무 잘 작동하지 않는 경향이 생각하는 연습 + +579 +00:42:07,670 --> 00:42:12,010 + 이 아름다운 그런 생각이 다른 것 때문에 불행한 그 I + +580 +00:42:12,010 --> 00:42:15,890 + 거의 다시 가서 읽으면하는 보조 노트로 지적해야 + +581 +00:42:15,889 --> 00:42:21,179 + 지난 10 년 이상에서 수천 중반부터이 일에 문학 + +582 +00:42:21,179 --> 00:42:25,129 + 사람들은 자신의 아내가 미리 훈련 증가이라는 재미있는 것은이 그 + +583 +00:42:25,130 --> 00:42:30,010 + 그들은 자동 인코더 훈련에 사용하고 생각했다 공유 그 시간에 + +584 +00:42:30,010 --> 00:42:35,410 + 매우 깊은 네트워크가 있었다 2,006 훈련은 도전이고 당신이 경우 당신은 찾을 수 있습니다 + +585 +00:42:35,409 --> 00:42:39,429 + 당신이 가지고있는 경우에도 아마 45 숨겨진 말과 같이 인용 및 논문 + +586 +00:42:39,429 --> 00:42:44,359 + 층이 있도록 네트워크를 훈련 당시 학생 당 극단적으로 도전했다 + +587 +00:42:44,360 --> 00:42:48,760 + 대신 곳 패러다임을 가진과 그 문제를 해결 얻을들이 + +588 +00:42:48,760 --> 00:42:53,560 + 한 번에 하나의 편지를 양성하려고 그들은이이 일을 사용하지만 난 것 + +589 +00:42:53,559 --> 00:42:57,139 + 싶어이있는 제한된 볼츠만 기계 호출에 너무 많이 얻을 해달라고 + +590 +00:42:57,139 --> 00:43:01,279 + 인쇄상의 모델 그리고 그들은 이러한 제한 볼츠만 기계를 사용하는 것 + +591 +00:43:01,280 --> 00:43:05,880 + 한 번에 하나씩 거기에이 작은에 연수생의 종류 그래서 우리는 먼저해야합니다 우리의 + +592 +00:43:05,880 --> 00:43:12,070 + 입력 이미지 크기 W 하나의 최대 크기가 될 수 있으며, 이것은 아마 뭔가 될 것 + +593 +00:43:12,070 --> 00:43:16,630 + PCA 또는 사진의 다른 종류의 같은 변환 한 후 우리는 희망 것 + +594 +00:43:16,630 --> 00:43:19,990 + 제한된 볼츠만 기계에게 관계의 어떤 종류를 사용하여 배울 수 + +595 +00:43:19,989 --> 00:43:25,359 + 그 첫 번째 자신의 기능과 몇 가지 높은 수준의 기능 사이에 때 한 번 + +596 +00:43:25,360 --> 00:43:27,940 + 우리는 이유에서이 층을 알게되면 + +597 +00:43:27,940 --> 00:43:30,840 + 그 기능의 상단에 다른 제한 볼츠만 기계 학습 + +598 +00:43:30,840 --> 00:43:36,000 + 이러한 유형의 접근법을 사용하여 다음 레벨의 기능으로되도록 접속하면하자 + +599 +00:43:36,000 --> 00:43:40,050 + 그 욕심 방법 및 그하자 이런 종류의에서 한 번에 하나의 층을 훈련 + +600 +00:43:40,050 --> 00:43:43,980 + 그들에게 희망이 더 큰 네트워크에 대한 정말 좋은 초기화를 찾을 수 + +601 +00:43:43,980 --> 00:43:48,369 + 그래서이 욕심 사전 교육 단계 이후 그들은 전체를 스틱 것 + +602 +00:43:48,369 --> 00:43:52,099 + 함께이 거대한 오디오 인코더 다음 미세 조정 오디오 인코더로 + +603 +00:43:52,099 --> 00:44:00,469 + 공동 요즘, 그래서 우리가 정말 선 리우와 같은 것들로이 작업을 수행 할 필요가 없습니다 + +604 +00:44:00,469 --> 00:44:04,139 + 적절한 초기화 및 bash는 정상화 약간 애호가 + +605 +00:44:04,139 --> 00:44:08,730 + 일의이 유형은 그래서으로 더 이상 정말 필요하지 않습니다 애호가 최적화 + +606 +00:44:08,730 --> 00:44:12,659 + 이전 슬라이드의 예를 우리는 래리 길쌈이를 보았다 + +607 +00:44:12,659 --> 00:44:16,409 + 내가 휴전에 훈련이 그냥 디컨 볼 루션 오디오 인코더 + +608 +00:44:16,409 --> 00:44:17,429 + 일을하려고 + +609 +00:44:17,429 --> 00:44:20,149 + 모든 현대적인 신경망 기술을 사용하면 주위에 엉망이 없습니다 + +610 +00:44:20,150 --> 00:44:25,039 + 미국 항공 훈련 그래서 이것은 정말 더 이상 수행되는 것이 아닙니다 + +611 +00:44:25,039 --> 00:44:27,800 + 하지만 난 당신이 아마에 있기 때문에 우리는 적어도 언급해야한다고 생각 + +612 +00:44:27,800 --> 00:44:35,990 + 당신이 그래서 이런 것들에 대한 문헌에서 다시 읽으면이 아이디어를 발생 + +613 +00:44:35,989 --> 00:44:39,949 + 기본적인 아이디어 또는 분기 자동차는 나는이 아름답다 아주 간단 생각한다 + +614 +00:44:39,949 --> 00:44:44,009 + 우리가 희망을 배울 자율 많은 양의 데이터를 사용할 수있는 아이디어 + +615 +00:44:44,010 --> 00:44:49,710 + 몇 가지 좋은 기능은 불행하게도 그 작동하지 않습니다하지만 괜찮아요하지만 거기에 + +616 +00:44:49,710 --> 00:44:53,639 + 아마 작업의 다른 좋은 유형 우리는 자율 데이터로 할 것 + +617 +00:44:53,639 --> 00:44:56,639 + 질문 첫 번째 + +618 +00:44:59,068 --> 00:45:10,308 + 어제 질문은 여기에서 일어나고 것은 바로 그래서 이것은 이것이 무엇이다 + +619 +00:45:10,309 --> 00:45:14,880 + 이것은 어쩌면 당신이 우리의 입력 때문에 세 계층 신경 네트워크에 대해 생각할 수있다 + +620 +00:45:14,880 --> 00:45:18,410 + 거 것은 그래서 우리는 단지이 신경 것을 바라고 출력과 동일 + +621 +00:45:18,409 --> 00:45:22,788 + 네트워크 식별 기능을 배울하지만 정말하고에있어 것 + +622 +00:45:22,789 --> 00:45:26,099 + 우리 끝에 일부 손실 함수를 갖는 항등 함수 학습하기 위해서 + +623 +00:45:26,099 --> 00:45:29,989 + 그 손실 성인과 같은 우리의 입력과 출력에 우리를 격려한다 + +624 +00:45:29,989 --> 00:45:35,429 + 같은 학습 식별 기능으로 아마 정말 쉬운 일입니다 + +625 +00:45:35,429 --> 00:45:39,379 + 수행하는 대신 우리는 쉽게 경로를하지하기 위해 네트워크를 강제하는거야 + +626 +00:45:39,380 --> 00:45:43,410 + 대신 희망이 아니라 단지 데이터를 토하는 및 학습보다 + +627 +00:45:43,409 --> 00:45:46,909 + 쉬운 방법으로 식별 기능을 대신 우린 병목 현상이야 + +628 +00:45:46,909 --> 00:45:51,268 + 중간에이 숨겨진 레이어를 통해 표현은 그래서 다음거야 배울 수 있어요 + +629 +00:45:51,268 --> 00:45:54,798 + 신원 기능하지만, 네트워크의 중간에 거이다가 집어 넣은해야 + +630 +00:45:54,798 --> 00:45:59,829 + 아래 데이터를 요약하고 압축하고 잘하면 그 그 압축 것 + +631 +00:45:59,829 --> 00:46:04,339 + 그 조금이 될 수 있으므로 다른 작업에 유용한 기능을 야기 할 + +632 +00:46:04,338 --> 00:46:14,719 + 좀 더 배려 확인은 주장 PCA이 단지 해답이었다 의문을 제기 + +633 +00:46:14,719 --> 00:46:19,259 + 문제는 그래서 만 허용하는 경우 PCA 특정 감각에 최적 인 것은 사실이다 + +634 +00:46:19,259 --> 00:46:25,278 + 경이는 소득 및 디코더가 단지 하나의 경우 어디 하나를 수행합니다 + +635 +00:46:25,278 --> 00:46:30,259 + 당신이 있다면 어떤 의미에서 최적의 참 다음 PCA 변환하지만, 선형 + +636 +00:46:30,259 --> 00:46:34,170 + 분기 및 디코더는 잠재적으로 더 큰 더 복잡한 함수이다 그 + +637 +00:46:34,170 --> 00:46:39,059 + 더 어쩌면 다층 신경망은 어쩌면 PCA가 더있다 없다 + +638 +00:46:39,059 --> 00:46:43,209 + 더 이상 다른 점은 수있는 권리 솔루션은 PCA는 단지 최적이다 + +639 +00:46:43,208 --> 00:46:44,308 + 특정 감각 + +640 +00:46:44,309 --> 00:46:48,670 + 특히 LG의 재건에 대해 이야기하지만 실제로 우리는하지 않습니다 + +641 +00:46:48,670 --> 00:46:51,798 + 실제로 우리가이 일을 배울 것으로 기대하고 재건에 관심 + +642 +00:46:51,798 --> 00:46:56,538 + 다른 작업에 유용한 기능 연습 때문에이 조금 이상을 볼 것이다 + +643 +00:46:56,539 --> 00:47:00,259 + 나는 아마되는 것이기 때문에 사람들은 항상 더 이상에게 사용하지 않는 것이 + +644 +00:47:00,259 --> 00:47:04,719 + 사실에 매우 적합한 손실 그래 특징 + +645 +00:47:04,719 --> 00:47:14,348 + 이것은이다 래리의 군대의 데이터의 생성 적 모델의이 종류이다 + +646 +00:47:14,349 --> 00:47:18,250 + 당신이 내기 당신의 종류의 두 시퀀스가​​ 상상 데이터 + +647 +00:47:18,250 --> 00:47:19,108 + 이 작업을 수행 할 수 + +648 +00:47:19,108 --> 00:47:23,579 + 두 가지의 생식 모델링 그래서 당신은 들어갈 필요 + +649 +00:47:23,579 --> 00:47:26,440 + 이 텍스트는 정확히 손실 기능을 파악하는 것이 이유 중 꽤 많은 + +650 +00:47:26,440 --> 00:47:31,260 + 하지만, 이들로 데이터를 어​​떤 우도 추천되는 것을 끝낸다 + +651 +00:47:31,260 --> 00:47:35,470 + 당신이 관찰되지 않고 그것이 우리가 의지하는 것이 실제로 멋진 아이디어 잠복 상태 + +652 +00:47:35,469 --> 00:47:40,868 + 일종의의 하나 하나 있도록 변분 오디오 인코더에 다시 방문 + +653 +00:47:40,869 --> 00:47:45,280 + 전통적인 오디오 인코더 문제는 배우를 바라고 있다는 것입니다 + +654 +00:47:45,280 --> 00:47:49,590 + 즉 그 멋진 일이의 기능을하지만 다른 일이 우리가 것입니다 + +655 +00:47:49,590 --> 00:47:54,670 + 에 같은 단지 기능을 습득뿐만 아니라 멋진 새로운 데이터를 생성 할 수 없습니다 + +656 +00:47:54,670 --> 00:47:59,320 + 우리는 잠재적으로 자율 데이터에서 배운 수있는 작업은 희망입니다 + +657 +00:47:59,320 --> 00:48:03,030 + 우리 후루룩 소리 내며 먹기 수있는 모델과 이미지의 무리 그것은 일종의 그 것을 수행 한 후 + +658 +00:48:03,030 --> 00:48:06,990 + 자연 이미지의 모습과이 메일 내용은 다음 후 무엇을 배운다 + +659 +00:48:06,989 --> 00:48:11,449 + 그것은 희망 원래의 모습 가짜 이미지의 종류를 뱉어 수 + +660 +00:48:11,449 --> 00:48:17,949 + 이미지하지만 가짜 이것은 어쩌면 바로 작업을 처리하지 않습니다 + +661 +00:48:17,949 --> 00:48:22,319 + 분류 같은 것들에 적용 할 수 있지만, 중요한 일처럼 보인다 + +662 +00:48:22,320 --> 00:48:26,588 + 인간이 데이터를 찾고 그것을 요​​약에서 꽤 좋은 사람과 + +663 +00:48:26,588 --> 00:48:31,199 + 그렇게 희망을 갖고 우리의 모델도 할 수 있다면 어떻게 생겼는지의 아이디어를 얻기 + +664 +00:48:31,199 --> 00:48:34,969 + 작업의 이런 종류는 잘하면 그들은 몇 가지 유용한 배운 것 + +665 +00:48:34,969 --> 00:48:41,299 + 요약이나 데이터의 일부 유용한 통계 변동 오디오 그래서 + +666 +00:48:41,300 --> 00:48:45,539 + 인코더는 우리가 할 수있는 원래 순서에 깔끔한 트위스트의이 종류이다 + +667 +00:48:45,539 --> 00:48:50,690 + 희망 실제로 그래서 여기에 우리 배운다 데이터에서 새로운 이미지를 생성 우리는 필요 + +668 +00:48:50,690 --> 00:48:54,849 + 인내의 약간에 뛰어 것을이 세금은 그래서 이것은 뭔가가 우리 + +669 +00:48:54,849 --> 00:48:58,320 + 정말이 시점에 더 이상하지만까지이 클래스에 대해 전혀 이야기하지 않은 + +670 +00:48:58,320 --> 00:49:02,420 + 하지만 근처하지 않는 기계 학습의이 모든 다른 측면이있다 + +671 +00:49:02,420 --> 00:49:05,250 + 확률에 대한 정말 열심히 네트워크와 깊은 학습하지만 일 + +672 +00:49:05,250 --> 00:49:09,260 + 분포 및 부 합성 분포는 서로에 들어갈 수있는 방법 + +673 +00:49:09,260 --> 00:49:13,190 + 데이터 세트 다음 이유를 확률 적 데이터와에 대해 생성 + +674 +00:49:13,190 --> 00:49:16,670 + 그것은 당신에게 국가의 종류 수 있기 때문에 패러다임의이 유형은 정말 좋은 + +675 +00:49:16,670 --> 00:49:17,970 + 명시 적 확률 + +676 +00:49:17,969 --> 00:49:22,000 + 당신이 당신의 데이터를 어​​떻게 생각하는지에 대한 가정이 생성 한 후 그 주어졌다 + +677 +00:49:22,000 --> 00:49:25,858 + 확률 적 가정은 다음과 데이터에 모델을 파악하려고 + +678 +00:49:25,858 --> 00:49:30,199 + 당신의 가정은 그래서 변화 놀라운 분기 우리는이 가정을하는지 + +679 +00:49:30,199 --> 00:49:35,589 + 우리가 가정 있도록 방법이 특정 유형은하는 우리의 데이터가 생성 된 + +680 +00:49:35,590 --> 00:49:39,800 + 우리는 거의 세계가 몇 가지 사전 분포를 존재했습니다 + +681 +00:49:39,800 --> 00:49:44,440 + 이러한 잠재 미국 Z를 생성하고 우리는 우리는 몇 가지 조건 가정했습니다 + +682 +00:49:44,440 --> 00:49:49,789 + 일단 우리는 우리가 샘플을 최고의 상태를 생성 할 수있는 유통 + +683 +00:49:49,789 --> 00:49:54,389 + 다른 분포 변동 오디오 인코더 따라서 데이터를 생성하도록 + +684 +00:49:54,389 --> 00:49:58,170 + 정말 우리의 데이터가이 꽤 간단한 과정에 의해 생성 된 상상 + +685 +00:49:58,170 --> 00:50:03,639 + 먼저 우리는 몇 가지가 RAZ의 B를 얻기 위해 얻기 위해 몇 가지 사전 분포에서 샘플링 있음 + +686 +00:50:03,639 --> 00:50:10,940 + 직관이 그 역할을하므로이 조건에서 샘플은 우리의 행위를 얻을 수 + +687 +00:50:10,940 --> 00:50:15,240 + 이미지와 Z 같은 아마 그것에 대해 몇 가지 유용한 물건을 요약 + +688 +00:50:15,239 --> 00:50:19,649 + 이 훨씬 이미지를 볼 수 있었다, 그래서 만약 이미지 어쩌면 상태에 누워 그 그녀는 수 + +689 +00:50:19,650 --> 00:50:23,800 + 이 개구리 나 사슴 또는 고양이 여부 이미지의 클래스 같은 것을하고 + +690 +00:50:23,800 --> 00:50:27,690 + 또한 고양이가 지향하거나 어떤 색 방법에 대한 변수를 포함 할 수 있습니다 + +691 +00:50:27,690 --> 00:50:29,269 + 또는 그런 일 + +692 +00:50:29,269 --> 00:50:33,719 + 그래서 이것은 매우 간단 꽤 간단한 아이디어를 가진의 좋은 종류의 종류 + +693 +00:50:33,719 --> 00:50:37,279 + 하지만 당신이되고 이미지의 이미지를 상상하는 방법에 대한 많은 이해 + +694 +00:50:37,280 --> 00:50:43,670 + 문제 때문에 발생 지금 우리는 이러한 매개 변수를 충족 물어보고 싶은 것입니다 + +695 +00:50:43,670 --> 00:50:48,470 + 종래 실제로 않고 조건부 모두 데이터 + +696 +00:50:48,469 --> 00:50:52,598 + 그 도전의 이러한 최신 날짜에 대한 액세스를 참조하고는이의의 + +697 +00:50:52,599 --> 00:50:57,588 + 문제는 그래서 우리는거야 간단한 당신이에서 많이 볼 일을 만들려면 + +698 +00:50:57,588 --> 00:51:00,769 + 베이지안 통계 및 난 그냥 전과가있어 샴푸를 가지고 있다고 가정합니다 + +699 +00:51:00,769 --> 00:51:07,088 + 취급이 용이하고, 조건은 또한 표시됩니다 수 있지만 될거야 될 것 + +700 +00:51:07,088 --> 00:51:11,489 + 조금 애호가 그래서 우리는 대각선 평균과 가진 가우스 있다고 가정합니다 + +701 +00:51:11,489 --> 00:51:16,729 + 대신 죄송 대각 공분산 어떤 의미하지만,과 단위 우리는 단지거야 + +702 +00:51:16,730 --> 00:51:19,650 + 넣어하지만 우리는 사람들을 얻기 위하여려고하고있는 방법은 우리가 그들을 계산하는거야입니다 + +703 +00:51:19,650 --> 00:51:24,800 + 신경 네트워크 그래서 우리가 어떤 부분에 대한 최신 의지를 가지고 있다고 가정 + +704 +00:51:24,800 --> 00:51:27,579 + 데이터 우리는 그 말 대신한다고 가정 + +705 +00:51:27,579 --> 00:51:32,160 + 몇 가지 큰 복잡한 신경이 될 수있는 몇 가지 디코더 네트워크로 이동합니다 + +706 +00:51:32,159 --> 00:51:36,078 + 네트워크와 지금 신경 네트워크는 거이 거의 두 가지를 뱉어입니다 + +707 +00:51:36,079 --> 00:51:40,079 + 데이터의 의미를 뱉어는거야 데이터의 의미를 뱉어 것 + +708 +00:51:40,079 --> 00:51:45,068 + 행위 또한 데이터의 상기 분산은 그래서 당신이 생각해야 작용 + +709 +00:51:45,068 --> 00:51:48,958 + 이것은 우리가 보통 오디오 인코더의 위쪽 절반 같은​​ 아주 많이 보인다 + +710 +00:51:48,958 --> 00:51:52,699 + 우리가 어떤 그건 것으로 알려져있는이 링크 상태 최신 팔에서 작동하지만, + +711 +00:51:52,699 --> 00:51:57,588 + 지금 대신 직접 데이터를 침 대신에 그것을 밖으로 뱉어 것 + +712 +00:51:57,588 --> 00:52:01,690 + 데이터의 평균이 보이는 것보다 데이터의 분산되지만 다른 + +713 +00:52:01,690 --> 00:52:07,528 + 매우 일반적인 오디오 인코더의 디코더 등이이 디코더 그래서 + +714 +00:52:07,528 --> 00:52:11,518 + 일반 오디오 인코더 다시 생각의 네트워크 종류는 간단한 수 있습니다 + +715 +00:52:11,518 --> 00:52:14,578 + 완전히 연결된 것은 아니면이 매우 큰 강력한 디컨 볼 루션 수 있습니다 + +716 +00:52:14,579 --> 00:52:22,269 + 네트워크 이들 모두 문제가있다 의한 지금 매우 일반적인 + +717 +00:52:22,268 --> 00:52:26,679 + 사전을 부여하고 조건 바질가 주어진다면 야구는 우리가 알 + +718 +00:52:26,679 --> 00:52:31,578 + 우리가 실제로이 모델을 사용하려면 우리가 할 필요가 있도록 주어진 것을 후부 + +719 +00:52:31,579 --> 00:52:35,209 + 상기 입력 데이터와 상기 방식에서 잠복 상태를 추정 할 수있는 것을 우리 + +720 +00:52:35,208 --> 00:52:38,659 + 입력 데이터에서 최고의 상태를 추정하면이를 작성하는 것입니다 + +721 +00:52:38,659 --> 00:52:42,899 + 쉽게 주어진 최신의 확률 사후 분포 + +722 +00:52:42,900 --> 00:52:47,519 + 관측 데이터 및 급여를 사용하여 우리는 쉽게 주위에이 플립과 그것을 쓸 수 있습니다 + +723 +00:52:47,518 --> 00:52:54,189 + 우리의 전 감독과 우리의 조건 지방의 관점에서 등 조건 우리 + +724 +00:52:54,190 --> 00:52:57,249 + 이스라엘 사용할 수 있습니다 실제로 주위에이 일을두고 측면에서 그것을 쓰기 + +725 +00:52:57,248 --> 00:53:02,409 + 우리가 이러한 역할을보고 난 후에 우리는 이것들을 분해 할 수 있도록 이러한 세 가지 + +726 +00:53:02,409 --> 00:53:06,818 + 세 가지 용어와 조건은 우리가 우리의 디코더를 사용하는 것이 우리는 볼 수 있습니다 + +727 +00:53:06,818 --> 00:53:11,558 + 네트워크와 우리는 쉽게 그에 대한 액세스 권한이 이전에 다시 우리가 접근 + +728 +00:53:11,559 --> 00:53:15,569 + 사전은 당신이 협상 것으로 가정 할에 그래서 다루기 쉽게하지만, + +729 +00:53:15,568 --> 00:53:19,458 + 당신은 당신이 운동하는 경우 경우이 분모의 역할이 확률​​은 밝혀 + +730 +00:53:19,458 --> 00:53:22,828 + 수학이 행에이 거대한 난치성 인 끝을 쓰는 + +731 +00:53:22,829 --> 00:53:26,579 + 그 완전히 다루기 힘든, 그래서 전체 선도적 인 상태 공간을 통해 더 없다 + +732 +00:53:26,579 --> 00:53:29,479 + 방법은 당신이 할 수 이제까지 포르노도 근사 여자와 그것이 될 것이라고 + +733 +00:53:29,478 --> 00:53:33,399 + 거대한 재난 그래서 대신에 우리는 심지어 여자에 그 평가하려고하지 않습니다 + +734 +00:53:33,400 --> 00:53:38,759 + 대신 우리는하려고 몇 가지 인코더 네트워크를 소개하는거야 + +735 +00:53:38,759 --> 00:53:40,179 + 직접 전 + +736 +00:53:40,179 --> 00:53:45,210 + 우리의 인쇄 재료에 때문에이 엔코더 네트워크는 데이터 포인트에 걸릴 것입니다 + +737 +00:53:45,210 --> 00:53:48,599 + 그리고 회의의 상태에 걸쳐 분포를 뱉어 것 + +738 +00:53:48,599 --> 00:53:53,210 + 공간은 그래서 다시는 매우 원래 오디오를 다시 찾고 보인다 + +739 +00:53:53,210 --> 00:53:57,449 + 몇 슬라이드에서 인코더 전이 매우 하단의 종류와 같은 모양 + +740 +00:53:57,449 --> 00:54:01,449 + 우리가 지금 데이터에 복용하는 기존의 오디오 인코더의 절반 + +741 +00:54:01,449 --> 00:54:04,789 + 대신 직접 최신 팔을 침으로 우리는거야 평균을 뱉어하고 + +742 +00:54:04,789 --> 00:54:09,519 + 및 주요 국가의 분산 다시 이번 분기 네트워크가 될 수 있습니다 + +743 +00:54:09,519 --> 00:54:13,639 + 뭔가 다소 논란의 네트워크 또는 어쩌면 약간의 깊은 수 있습니다 + +744 +00:54:13,639 --> 00:54:21,159 + 컨볼 루션 네트워크는 그래​​서 직관의 종류입니다이 만남 네트워크 + +745 +00:54:21,159 --> 00:54:25,259 + 별도의 완전히 다른 파괴하는 기능이있을 것입니다하지만 우리는 거 야 + +746 +00:54:25,260 --> 00:54:29,180 + 그것은 이러한 사후 분포에 근사하는 방식으로 훈련 시도 + +747 +00:54:29,179 --> 00:54:35,799 + 우리는 실제로 그렇게 할 때 우리는 아마 조각을 함께에 액세스하지 않는 것이 + +748 +00:54:35,800 --> 00:54:40,700 + 그 다음 우리는 이것에 상승을 줄이 모두 함께 스티치를 설정하고 얻을 수 있습니다 + +749 +00:54:40,699 --> 00:54:44,808 + 변화는 오디오 인코더 번, 그래서 우리는 우리가이 다음 함께 이러한 것들을 넣어 + +750 +00:54:44,809 --> 00:54:49,559 + 입력 데이터 포인트의 X 우리 것 우리의 인코더 네트워크와 통해거야 패스를 + +751 +00:54:49,559 --> 00:54:52,819 + 인코더 네트워크는 최고의 상태에 대한 분포를 뱉어 + +752 +00:54:52,818 --> 00:54:57,789 + 우리는 최신 날짜 이상이이 메일을 일단 당신이 상상할 수 + +753 +00:54:57,789 --> 00:55:01,650 + 그 분포에서 샘플링을 상상할 수있는 것은 얻기 위해 일부 일부 최고 + +754 +00:55:01,650 --> 00:55:07,700 + 우리가 한 번왔다하면보다 그 입력을 나에게 높은 확률의 상태를 보자 + +755 +00:55:07,699 --> 00:55:11,889 + 우리는 잠재 상태의 몇 가지 구체적인 예를 우리는 그것을 통과 할 수 + +756 +00:55:11,889 --> 00:55:16,409 + 이 디코더 네트워크 확률을 밖으로 확산되는있는 다음해야 + +757 +00:55:16,409 --> 00:55:20,469 + 우리가이 있으면 다시 다음 데이터의 확률을 가속화 + +758 +00:55:20,469 --> 00:55:24,439 + 우리는 그것에서 맛볼 수있는 데이터에 대한 분포는 실제로 뭔가를 얻을 수 + +759 +00:55:24,440 --> 00:55:29,950 + 그 희망이보고 끝이 때문에 원본 데이터 점처럼 보이는 + +760 +00:55:29,949 --> 00:55:34,269 + 우리는 우리가있어 우리의 입력 데이터를 복용하고 일반 오디오 인코더와 같은 매우 + +761 +00:55:34,269 --> 00:55:37,829 + 일부 잠재 상태를 얻기 위해이 엔코더를 통해 실행하거나에 전달 + +762 +00:55:37,829 --> 00:55:42,200 + 디코더는 완전히 원래의 데이터를 재구성하고이 훈련에 대해 갈 때 + +763 +00:55:42,199 --> 00:55:46,149 + 물건은 실제로 일반 오디오 인코더와 같은 매우 유사한 방법에서 훈련이야 + +764 +00:55:46,150 --> 00:55:50,230 + 우리는 과거이 있고이 이전 버전과의 유일한 차이점은 손실에 전달 + +765 +00:55:50,230 --> 00:55:55,490 + 기능 상단에 우리는이 재건 손실이 아닌 것을 있도록 + +766 +00:55:55,489 --> 00:56:01,078 + (SL2)에 의해 표시 대신에 우리는이 분포가 실제에 근접 할 + +767 +00:56:01,079 --> 00:56:07,349 + 입력 데이터와 우리는이를 우리가 원하는 중간에 나오는 용어를 잃었다 + +768 +00:56:07,349 --> 00:56:11,230 + 레이튼 미국을 통해이 발생 분포는 희망 매우 유사 + +769 +00:56:11,230 --> 00:56:16,579 + 우리의 명시된 사전 분포에 우리는 매우 그래서 한 번 시작 부분에 아래로 썼다 + +770 +00:56:16,579 --> 00:56:19,200 + 당신은 그냥 보통처럼이 일을 시도 할 수 있습니다 함께 이러한 조각을 넣어 + +771 +00:56:19,199 --> 00:56:22,969 + 앞으로 전진 패스 정상 앞으로 오디오 인코더 및 후방 패스 + +772 +00:56:22,969 --> 00:56:29,058 + 만약 손실을 넣고 어떻게 그렇게 손실 해석 여기서 유일한 차이점은 + +773 +00:56:29,059 --> 00:56:32,500 + 우리의 종류를 통해 갈 때 어떤 설정에 대한 질문 그것은 종류입니다 + +774 +00:56:32,500 --> 00:56:39,608 + 그것의 사촌 엉덩이 네 질문은 왜 대각 공분산과 답변을 선택합니까된다 + +775 +00:56:39,608 --> 00:56:44,199 + 정말 쉬운 그들의 작업을 할 수 있지만, 실제로 사람들은 내가 생각하는 시도 + +776 +00:56:44,199 --> 00:56:50,210 + 약간 너무 일을 애호가하지만 당신이 너무 좋아 함께 놀러 수있는 일이다 + +777 +00:56:50,210 --> 00:56:53,530 + 우리가 실제로 한 번이 훈련을하고 나면 우리는 실제로 이러한 종류의 훈련을했습니다 + +778 +00:56:53,530 --> 00:56:56,920 + 변분 오디오 인코더 우리는 실제적으로 새로운 데이터를 생성하는 데 사용할 수 + +779 +00:56:56,920 --> 00:57:00,510 + 즉, 그래서 여기 종류의 원본 데이터 셋처럼 보이는 + +780 +00:57:00,510 --> 00:57:04,430 + 아이디어는 기억하고 있음을 우리는이 이전 될 수있는 당신이 협상 적어된다 + +781 +00:57:04,429 --> 00:57:07,960 + 아니면 뭔가 조금 애호가하지만 어떤 속도로이 사전은 뭔가 + +782 +00:57:07,960 --> 00:57:12,039 + 당신이 아주의 협상 그래서 우리는 쉽게에서 맛볼 수있는 유통 + +783 +00:57:12,039 --> 00:57:15,989 + 그 분포에서 무작위 샘플을 그리 쉽게 그래서 새로운 데이터를 생성하는 + +784 +00:57:15,989 --> 00:57:20,459 + 그저이 데이터이 데이터 생성 과정에 따라 시작됩니다 + +785 +00:57:20,460 --> 00:57:24,849 + 먼저 우리는 우리의 사전에서 우리의에서 샘플링됩니다, 그래서 우리는 데이터를 상상했던 것을 + +786 +00:57:24,849 --> 00:57:28,430 + 국가에있는 호수에 분포하고 우리는 우리의 디코더를 통해 전달됩니다 + +787 +00:57:28,429 --> 00:57:32,078 + 우리는 훈련 기간 동안 배운 네트워크와이 디코더 네트워크는 것 + +788 +00:57:32,079 --> 00:57:36,190 + 지금 모두의 측면에서에서 차례로 분배 무시하고 임명을 뱉어 + +789 +00:57:36,190 --> 00:57:40,460 + 내 말과 공분산 우리는 평균과 공분산이되면이 단지입니다 + +790 +00:57:40,460 --> 00:57:44,548 + 대각선 세상에는 우리가 쉽게 몇 가지 데이터를 생성하기 위해 다시이 일에서보실 수 있습니다 + +791 +00:57:44,548 --> 00:57:50,369 + 당신이 할 수있는 다른 일이 종류의 당신이이 일을 훈련 그래서 지금 11 포인트 + +792 +00:57:50,369 --> 00:57:54,440 + 잠재 공간에서 할 수있는 난에 오히려 잠재에서 샘플링보다 해요 + +793 +00:57:54,440 --> 00:57:58,490 + 최신에서 충성의 분포 대신에 단지 밀집 샘플 + +794 +00:57:58,489 --> 00:58:01,979 + 베이스 종류의 네트워크가 가진 구조 구조의 유형의 아이디어를 얻을 수 있습니다 + +795 +00:58:01,980 --> 00:58:09,280 + 우리는 우리가이 훈련 그래서 이것은 그래서 여기에이 데이터 집합에 정확히하고있다 배웠다 + +796 +00:58:09,280 --> 00:58:12,990 + 최신 여덟 곳으로 변화 오디오 인코더 단지입니다 + +797 +00:58:12,989 --> 00:58:17,959 + 일이 차원 이제 우리는 실제로 공간에서이 말을 스캔 할 수 있습니다 우리 + +798 +00:58:17,960 --> 00:58:22,490 + 늦은 공간 및 각 포인트에 대한 밀도가 이러한 2 차원 탐색 할 + +799 +00:58:22,489 --> 00:58:26,519 + 잠상 공간 디코더 통과 일부 화상을 생성하도록 사용하면 + +800 +00:58:26,519 --> 00:58:30,599 + 실제로는 그런 종류의 아름다운 구조를 발견 있다고 볼 수 있습니다 + +801 +00:58:30,599 --> 00:58:34,618 + I가있을 것이다, 그래서의 원활 다른 자리 클래스 사이에 보간 + +802 +00:58:34,619 --> 00:58:38,530 + 여기에서 당신을 아래로 가서 당신이 여섯 제로로 모프의 종류 참조 왼쪽 + +803 +00:58:38,530 --> 00:58:42,690 + 볼 여섯 것으로는 BB의 화려로 칠로 설정되어 남부의 에이즈은 + +804 +00:58:42,690 --> 00:58:46,159 + 이 잠재 그래서 여기에 아래로 사람 어딘가에 중간에 매달려 + +805 +00:58:46,159 --> 00:58:50,049 + 공간은 실제로 아주에있는 데이터의이 아름다운 풀림을 배웠다 + +806 +00:58:50,050 --> 00:58:55,910 + 좋은 자율 방식으로 우리는 또한 우리의 얼굴 데이터 세트에이 일을 설정할 수 있습니다 그것은이다 + +807 +00:58:55,909 --> 00:58:59,199 + 우리가이 2 차원의 변화를 훈련하고 이야기의 같은 종류의 + +808 +00:58:59,199 --> 00:59:02,679 + 오디오 인코더 다음 우리가 훈련을하면 우리는 밀도에서 늦은에서 샘플링 + +809 +00:59:02,679 --> 00:59:05,679 + 공간은 그가 서 배운 것을보고 시도 + +810 +00:59:13,018 --> 00:59:19,458 + 그래 그래서 문제는 사람들이 지금 가장 구체적를 강제로 시도 여부 + +811 +00:59:19,458 --> 00:59:23,139 + 변수는 일부 일부 일부 정확한 의미를하고 그래 몇 가지가있다 + +812 +00:59:23,139 --> 00:59:27,058 + 후속 정확하게 수행 작업을 깊은 역이라는 종이가 있음 + +813 +00:59:27,059 --> 00:59:31,890 + 그들이 시도 정확하게이 설정을 수행하는 것이이 MIT에서 그래픽 네트워크 + +814 +00:59:31,889 --> 00:59:36,199 + 의 신경망으로 그들이 렌더러의 종류를 학습 할 위치를 강제로 그렇게 + +815 +00:59:36,199 --> 00:59:41,568 + 그들은 일부를 강제 할 것들의 3D 이미지를 렌더링 좋아하는 배우고 싶어요 + +816 +00:59:41,568 --> 00:59:44,619 + 잠재 공간에서 변수의 일부 잠재 공간 + +817 +00:59:44,619 --> 00:59:49,289 + 물체의 3 차원 각도와 아마 클래스와에 대응 + +818 +00:59:49,289 --> 00:59:53,009 + 물체와 그 나머지의 휴식은 무엇이든 배울지도 그 + +819 +00:59:53,009 --> 00:59:56,099 + 원하고 그녀가 가지고있는 멋진 실험은 지금은 정확히 할 수 있었다 + +820 +00:59:56,099 --> 01:00:00,809 + 당신이 말한으로 그 그 특정 값을 잠재 변수를 설정하여 + +821 +01:00:00,809 --> 01:00:03,869 + 그 렌더링 실제로 객체를 회전 것들이다 그 꽤 있습니다 할 수 있습니다 + +822 +01:00:03,869 --> 01:00:09,390 + 멋진하지만 그건 그건있어 그 다음에 공백 그러나이보다 애호가의 많은입니다 + +823 +01:00:09,389 --> 01:00:11,908 + 얼굴은 여전히​​ 당신이 종류의 사이에 보간 볼 수 있습니다 꽤 멋진 + +824 +01:00:11,909 --> 01:00:16,689 + 다른이 아주 좋은 방법 단계와 나는 실제로 아주가 있다고 생각 + +825 +01:00:16,688 --> 01:00:21,759 + 좋은 동기 부여 여기 이유 중 하나 우리는 대각선 긴장이 선택 + +826 +01:00:21,759 --> 01:00:26,079 + 그 독립을 갖는의 확률 적 해석을 가지고 있지만, + +827 +01:00:26,079 --> 01:00:29,179 + 우리의 생활 공간에서 매우 다른 변수 + +828 +01:00:29,179 --> 01:00:33,918 + 실제로 독립적 그래서 나는 그 이유가 설명하는 데 도움이 생각해야 + +829 +01:00:33,918 --> 01:00:37,219 + 당신이 끝날 때 실제로 accys 사이의 아주 좋은 분리입니다 + +830 +01:00:37,219 --> 01:00:40,858 + 공간에있는지도에서 샘플링이 확률 적 독립성 때문이다 + +831 +01:00:40,858 --> 01:00:45,630 + 에 포함 된 가정 이전 그래서이 아이디어 이전이 매우 강력와 + +832 +01:00:45,630 --> 01:00:51,139 + 당신이 종류의 큰 일 이러한 유형의 직접 모델 그래서 II에 있습니다 + +833 +01:00:51,139 --> 01:00:54,028 + 수학의 무리를 쓴 나는 우리가 정말 통과 시간이 생각하지 않습니다 + +834 +01:00:54,028 --> 01:00:57,849 + 그것은하지만 아이디어는 훈련있어 고전 때의 일종이다 + +835 +01:00:57,849 --> 01:01:01,130 + 생식 모델 당신이 원하는 최대 우도라는이 일있다 + +836 +01:01:01,130 --> 01:01:04,608 + 모델에 따라 데이터의 우도를 최대화하고 모델을 선택하는 + +837 +01:01:04,608 --> 01:01:09,018 + 즉, 데이터가 가장 가능성이 있습니다하지만 당신은 그냥하려고하면 밝혀 곳 + +838 +01:01:09,018 --> 01:01:13,068 + 최대 우도 생식이 공정을 이용하여 통상의 사용을 실행하는 것이 우리 + +839 +01:01:13,068 --> 01:01:17,708 + 당신은 당신이 결국이 거대한로 실행보다 나이가 바로 그 문제에 대해 상상했던 + +840 +01:01:17,708 --> 01:01:21,009 + 이 거인이되는이 공동 분배를 소외 필요 + +841 +01:01:21,009 --> 01:01:24,289 + 뭔가 아니다 전체 회의 상태 공간을 통해 소녀에서 다루기 힘든 + +842 +01:01:24,289 --> 01:01:25,890 + 우리가 할 수있는 + +843 +01:01:25,889 --> 01:01:29,659 + 그래서 대신에 다양한 오디오 인코더 인코더는이 일이라고 않는 + +844 +01:01:29,659 --> 01:01:34,259 + 변분은 정말 멋진 생각이다 추론하는 그리고 수학은 경우에 여기에있다 + +845 +01:01:34,260 --> 01:01:38,150 + 당신은 그것을 통해 가고 싶어하지만 아이디어는 대신에 따라 극대화하는 것입니다 + +846 +01:01:38,150 --> 01:01:42,619 + A의 데이터 일 가능성이 영리이 추가 컨텐츠를 삽입하고 + +847 +01:01:42,619 --> 01:01:47,429 + 우리는 바로이 정확한 것입니다있어, 그래서이 두 가지 다른 용어로 그것을 깰 + +848 +01:01:47,429 --> 01:01:50,419 + 당신은 아마 자신 만이 로그에이 작업 할 수 등가물 + +849 +01:01:50,420 --> 01:01:54,710 + 가능성 우리는 우리가 팔꿈치를 호출이 용어의 관점이에서 쓸 수 있습니다 + +850 +01:01:54,710 --> 01:01:58,869 + 이 분포하고 우리는 아는 사이 칼의 차이가 다른 용어 + +851 +01:01:58,869 --> 01:02:03,029 + 두 처녀를 죽인 그래서 우리는이 처녀를 죽인 것을 알고 항상 0이다 + +852 +01:02:03,030 --> 01:02:07,120 + 분배 사이에서 우리가이 용어는 비 - 제로이어야한다는 것을 알 비 제로 + +853 +01:02:07,119 --> 01:02:12,420 + 이는이이 팔꿈치 용어는 실제로 로그에 결합 된 낮다는 것을 의미한다 + +854 +01:02:12,420 --> 01:02:16,480 + 우리의 데이터의 가능성이 아래로 작성하는 과정에서 그 통지 + +855 +01:02:16,480 --> 01:02:20,889 + 팔꿈치 우리는 우리가 같이 해석 할 수있는이 추가 매개 변수 피드를 소개합니다 + +856 +01:02:20,889 --> 01:02:25,710 + 일종의 근사되어이이 엔코더 네트워크의 매개 변수 + +857 +01:02:25,710 --> 01:02:30,909 + 그래서 지금 대신 직접 극대화하는 노력이 하드 사후 분포 + +858 +01:02:30,909 --> 01:02:34,319 + 우리의 데이터의 로그 가능성이 대신 바로이 문제를 극대화하기 위해 노력할 것입니다 + +859 +01:02:34,320 --> 01:02:39,539 + 데이터 바인딩 및 팔꿈치 같은 저급 때문에 로그 하한 + +860 +01:02:39,539 --> 01:02:43,769 + 다음 팔꿈치 최대화 가능성도까지 상승시키는 효과를 가질 것이다 + +861 +01:02:43,769 --> 01:02:49,059 + 로그 가능성과 물건과 실제의 팔꿈치이이 두 용어 + +862 +01:02:49,059 --> 01:02:53,360 + 전면의에서이 하나가 있음이 아름다운 해석 + +863 +01:02:53,360 --> 01:02:57,849 + 레이튼 상태에 대한 기대의 잠재 상태 공간 이상이어야 + +864 +01:02:57,849 --> 01:03:01,889 + 당신이이 있다고 생각하므로 X의 확률은 잠재 상태 공간을 제공 + +865 +01:03:01,889 --> 01:03:05,559 + 말하는 실제로 데이터 재구성 기간 우리가 이상 평균 경우 그 + +866 +01:03:05,559 --> 01:03:08,789 + 우리가 무언가와 끝까지해야 가능한 모든 십팔 상태 + +867 +01:03:08,789 --> 01:03:13,639 + 우리의 원래의 데이터와 유사한이이 다른 용어는 실제로는 이것이다 + +868 +01:03:13,639 --> 01:03:17,940 + 정규화 기간이 대략 사이의 칼 발산하다 + +869 +01:03:17,940 --> 01:03:22,059 + 후부 및 이전 사이에 그래서 이것은 강제로 시도의 정규화입니다 + +870 +01:03:22,059 --> 01:03:27,019 + 당신이 대략 수있는 그 두 가지가 함께 그래서이이 첫 번째 임기에 영향을 미칠 + +871 +01:03:27,019 --> 01:03:31,590 + 뭔가 논문에서이 트릭을 사용하여 샘​​플링하여 대략적인 호출 + +872 +01:03:31,590 --> 01:03:35,600 + 모든 것이 예정되어 있기 때문에 나는 다시이 다른 용어로 얻을하지 않습니다 + +873 +01:03:35,599 --> 01:03:38,489 + 여기에 당신이 단지 숙련 출현에 대해 가능한 명시 적으로 + +874 +01:03:38,489 --> 01:03:44,509 + 그래서 나는이 그 그 종류의의 때문에 대부분의 클래스에있는 모든 슬라이드를지도 생각 + +875 +01:03:44,510 --> 01:03:50,020 + 재미 있지만, 실제로는 그래서하지만 실제로 무서운하지만 그것은 단지 사실이다 + +876 +01:03:50,019 --> 01:03:54,150 + 정확히 바로이 사분의 아이디어 우리는 재건이 후, 당신은 + +877 +01:03:54,150 --> 01:03:59,050 + 당신에게 처벌이 페널티는 질문에 뒤로 이전 이동 + +878 +01:03:59,050 --> 01:04:08,840 + 다양한 분기 해당 일반적으로 오디오 인코더의 생각대로 + +879 +01:04:08,840 --> 01:04:12,180 + 우리는 희망을 갖고 우리의 데이터를 재구성하려고하는 네트워크를 강제하려는 + +880 +01:04:12,179 --> 01:04:16,089 + 이 트리샤에 대한 데이터의 종류의 유용한 표현을 배울 것 + +881 +01:04:16,090 --> 01:04:19,470 + 우리는 변화에 이동하면 인코더의 많은이 피터 학습에 사용되지만, + +882 +01:04:19,469 --> 01:04:23,569 + 우리가 실제로 샘플을 생성 할 수 있도록 분기에 우리는이 일 환자를 확인하는 + +883 +01:04:23,570 --> 01:04:29,440 + 그럼 내 데이터에서 샘플을 생성하는이 아이디어 우리의 데이터와 유사 + +884 +01:04:29,440 --> 01:04:32,690 + 정말 멋진이며, 모든 사람이 사진의 이러한 종류의 너무보고 사랑 + +885 +01:04:32,690 --> 01:04:37,119 + 어쩌면 우리 모두가없이 정말 멋진 예제를 생성 할 수있는 또 다른 생각이있다 + +886 +01:04:37,119 --> 01:04:41,100 + 이 무서운 베이지안 수학 그리고 그것은이라는 생각이 있다고 밝혀 + +887 +01:04:41,099 --> 01:04:45,219 + 다른 생각 다른 트위스트의 일종이다 생식 적대적인 네트워크 + +888 +01:04:45,219 --> 01:04:49,799 + 즉, 여전히 데이터 만의 일종처럼 보이는 샘플을 생성 할 수 있습니다 + +889 +01:04:49,800 --> 01:04:52,560 + 좀 더 명시 적으로 이견에 대해 걱정할 필요없이 + +890 +01:04:52,559 --> 01:04:54,340 + 심판과 물건을 이런 종류의 + +891 +01:04:54,340 --> 01:04:58,920 + 아이디어는 우리가거야 발전기가 처음 우리가 있다는 것을 잘 작동하지 않은 것입니다 + +892 +01:04:58,920 --> 01:05:02,780 + 당신이 협상에서거야 아마 받고있다 일부 랜덤 노이즈로 시작하거나 + +893 +01:05:02,780 --> 01:05:07,060 + 그 다음 우리는 발전기 네트워크를 가지고거야,이 같은 + +894 +01:05:07,059 --> 01:05:11,079 + 발전기 네트워크는 실제로 매우 많은 변분의 디코더처럼 보이는 + +895 +01:05:11,079 --> 01:05:15,849 + 오디오 인코더 또는 우리가 야한다는 점에서 일반 오디오 인코더 년 하반기와 같은 + +896 +01:05:15,849 --> 01:05:20,449 + 이 랜덤 노이즈를 복용하고 우리가 될 것입니다 이미지를 보낼거야 + +897 +01:05:20,449 --> 01:05:26,379 + 우리는 단지 다음이 열차 네트워크를 사용하여 발생하는 일부 가짜하지 실제 이미지 + +898 +01:05:26,380 --> 01:05:29,410 + 우리는 또한에가는 판별 네트워크를 연결하는거야 + +899 +01:05:29,409 --> 01:05:32,679 + 이 가짜 이미지를보고이 있는지 여부를 그 여부를 결정하려고 + +900 +01:05:32,679 --> 01:05:34,769 + 생성 된 이미지는 실제 또는 가짜 + +901 +01:05:34,769 --> 01:05:38,679 + 그래서 이것은이 때문에 제 2 네트워크는 단지이 이진 분류를하고있다 + +902 +01:05:38,679 --> 01:05:42,949 + 작업은 입력을 수신하는 경우 그리고 그것은 단지 그것의 여부를 말할 필요 + +903 +01:05:42,949 --> 01:05:46,739 + 그것은 사실이나 그것이 실제 이미지인지 아닌지의 여부 그건 그냥 일종의이야 + +904 +01:05:46,739 --> 01:05:49,739 + 당신이 다른 것처럼 연결할 수 있습니다 분류 작업 + +905 +01:05:50,730 --> 01:05:55,349 + 그래서 우리는 공동으로 완전히 모든 호출이 일을 훈련 할 수있다 + +906 +01:05:55,960 --> 01:06:01,179 + R 발전기 네트워크 랜덤 노이즈의 여러 배치를받을 것이다 어디거야 + +907 +01:06:01,179 --> 01:06:06,629 + 뱉어과 이미지를 할게요 우리 판별 네트워크는 많은을 받게됩니다 + +908 +01:06:06,630 --> 01:06:12,640 + 데이터 집합에서 부분적으로 이러한 이미지의 배치를 부분적으로 실제 이미지와 + +909 +01:06:12,639 --> 01:06:16,039 + 그것은 대답이 분류 작업을 만들기 위해 노력할 것입니다해야 할 것이다 + +910 +01:06:16,039 --> 01:06:21,358 + 진짜와 가짜있는 등이 또 다른 방법은 지금 종류의 우리가 할 수있다 + +911 +01:06:21,358 --> 01:06:25,880 + 실제 데이터없이지도 학습 문제 틱의이 종류를 연결 우리 때문에 + +912 +01:06:25,880 --> 01:06:30,390 + 까지이 일을 희망 그리고 우리는 우리가 어떤 볼 수 공동으로 기대 훈련 + +913 +01:06:30,389 --> 01:06:34,730 + 그래서이 원래 일반적으로 적대적 네트워크 종이에서 예 + +914 +01:06:34,730 --> 01:06:38,840 + 당신이 볼 수있는 발표 네트워크에 의해 생성 된 가짜 이미지입니다 + +915 +01:06:38,840 --> 01:06:41,829 + 그것은 그들이 진짜처럼 실제로 발생 가짜 가슴의 아주 좋은 일을 있어요 + +916 +01:06:41,829 --> 01:06:46,549 + 숫자와 여기를 내가 여기이 가운데 열있어 실제로 보여주고있다 + +917 +01:06:46,550 --> 01:06:50,080 + 그 자리의 트레이닝 세트의 가장 가까운 이웃이 희망을 알려합니다 + +918 +01:06:50,079 --> 01:06:53,599 + 예를 들어이 너무있다, 그래서 그냥 훈련 집합을 기억하지 않습니다 + +919 +01:06:53,599 --> 01:06:57,389 + 그냥 기억하지 그래서 약간의 점과 다음이 사람은 도트가 없습니다 + +920 +01:06:57,389 --> 01:07:01,079 + 데이터를 교육하고 또한 인식 속도의 꽤 좋은 일을 + +921 +01:07:01,079 --> 01:07:05,849 + 생성 그렇게 얼굴 그러나 당신은 기계 학습에서 일한 사람으로 알고 + +922 +01:07:05,849 --> 01:07:10,440 + 알려진 이러한 숫자와 붙여 넣기 데이터 세트를 생성 아주 쉽게하는 경향이 + +923 +01:07:10,440 --> 01:07:16,869 + 에서 샘플 우리는 RJR 샘플을하지 않는 것보다 훨씬 볼이이 작업을 적용 할 때 + +924 +01:07:16,869 --> 01:07:21,840 + 아주 보면 친절하고 깨끗한 그래서 여기 명확하게 CPR에 대한 몇 가지 아이디어를 가지고 + +925 +01:07:21,840 --> 01:07:25,108 + 블루 stock와 녹색 물건을하지만 그들은 정말처럼 보이지 않는 가치 데이터 + +926 +01:07:25,108 --> 01:07:32,429 + 실제로 시도 후속 작업 그래서 그 해당 그래서 실제 객체는 문제입니다 + +927 +01:07:32,429 --> 01:07:35,599 + 생식 적대적 네트워크에 대한 몇 가지 후속 작업을 만들기 위해 노력하고 있습니다 + +928 +01:07:35,599 --> 01:07:38,529 + 더 크고 더 강력한 이러한 아키텍처는 그렇게 희망 할 수있을 + +929 +01:07:38,530 --> 01:07:44,080 + 하나의 생각이 그래서이 더 복잡한 데이터 세트에 더 좋은 샘플을 생성 + +930 +01:07:44,079 --> 01:07:48,949 + 아이디어는 다중 스케일 처리 때문에보다 한꺼번에 영상을 생성 + +931 +01:07:48,949 --> 01:07:53,919 + 우리는 실제로 그렇게 먼저이 방법으로 여러 규모에서 우리의 이미지를 생성거야 + +932 +01:07:53,920 --> 01:07:58,170 + 우리는 침대 후 발생 소음을 수신하고 피드 생성기 일어날거야 + +933 +01:07:58,170 --> 01:08:03,670 + 낮은 해상도와 우리는 위로 그 노라의 SKYY 샘플 및 제를 적용합니다 + +934 +01:08:03,670 --> 01:08:04,200 + 발전기 + +935 +01:08:04,199 --> 01:08:08,230 + 위에 일부 델타 랜덤 잡음의 새로운 배치를 수신하고, 계산 UR + +936 +01:08:08,230 --> 01:08:12,070 + 낮은 고해상도 이미지를 다시 무슨 샘플과 과정을 반복 + +937 +01:08:12,070 --> 01:08:16,810 + 우리가 실제로 마지막으로 생성 할 때까지 여러 번 생성되는 우리의 + +938 +01:08:16,810 --> 01:08:22,219 + 최종 결과는 그래서 이것은 다시로 이전 매우 비슷한 생각입니다 + +939 +01:08:22,219 --> 01:08:25,329 + 원래 성별 다양한 영역 네트워크하거나 여러 규모에서 발생 + +940 +01:08:25,329 --> 01:08:30,199 + 동시에 여기에 훈련은 실제로 당신이 조금 더 복잡하다 + +941 +01:08:30,199 --> 01:08:35,710 + 각 규모의 판별과 그 희망을 희망 그래서 뭔가있다 + +942 +01:08:35,710 --> 01:08:39,039 + 우리는 그래서 여기에 훨씬 더 실제로이 사람에서 기차 샘플을 볼 때 + +943 +01:08:39,039 --> 01:08:43,869 + 실제로 그래서 여기에 C (510)에 클래스마다 별도의 모델을 훈련 그들은했습니다 + +944 +01:08:43,869 --> 01:08:48,599 + CPR까지 한 단지 비행기에이 적대적 네트워크를 훈련하고 볼 수 있습니다 + +945 +01:08:48,600 --> 01:08:51,460 + 그들은 그 그게 점점 그래서 실제 비행기처럼 보이기 시작하고 있다는 + +946 +01:08:51,460 --> 01:08:52,210 + 어딘가에 + +947 +01:08:52,210 --> 01:08:56,689 + 이 거의 실제 분기처럼 보이는 이들은 진짜처럼 좀보고 할 수있다 + +948 +01:08:56,689 --> 01:09:04,278 + 조류에서의 있도록 다음 해 사람들은 실제로 멀리이 다중 스케일 아이디어를 던져 + +949 +01:09:04,279 --> 01:09:09,339 + 단지 간단한 그래서 여기에 아이디어가 더 나은 더 원칙 대륙이다 사용 + +950 +01:09:09,338 --> 01:09:14,318 + 이 멀티 숙련 된 직원에 대해 잊고 그냥 사용하지 않는 사용 배치 표준을 사용한다 + +951 +01:09:14,319 --> 01:09:17,739 + 우리가 가진 한 모든 건축 제약 완전히 연결 레이어 정렬 + +952 +01:09:17,738 --> 01:09:22,759 + 가 연습 연습과 지난 몇 년은 사람들을 사용하고 밝혀 + +953 +01:09:22,759 --> 01:09:27,969 + 그 범위 작업에 대적 정말 잘 여기 그래서 그들은 발생이 인 것 + +954 +01:09:27,969 --> 01:09:33,088 + 아주 아주 간단합니다 아주 간단합니다 아주 작은 길쌈 네트워크와 + +955 +01:09:33,088 --> 01:09:38,539 + 판별 다시 국유화하고 모든 그냥 간단한 네트워크입니다 + +956 +01:09:38,539 --> 01:09:42,180 + 이러한 다른 종과 경적 당신이이 일을를 연결하면 그들은 몇 가지를 얻을 수 + +957 +01:09:42,180 --> 01:09:47,810 + 이러한 네트워크에서 침실을 생성하므로 본 논문에서 놀라운 샘플 + +958 +01:09:47,810 --> 01:09:53,450 + 그래서이 실제로 꽤 인상적 결과 이​​들은 실제 데이터처럼 거의 + +959 +01:09:53,449 --> 01:09:57,529 + 그래서 당신은 캡처 정말 좋은 일을 끝낼 것을 알 수 있습니다 + +960 +01:09:57,529 --> 01:10:00,920 + 나쁜 거기에 같은 침실 정말 상세한 구조는 윈도우있다 + +961 +01:10:00,920 --> 01:10:07,710 + 이러한이 정말 놀라운 샘플하지만되도록 전등 스위치가있다 그것은 + +962 +01:10:07,710 --> 01:10:12,579 + 오히려 방금 생성 된 샘플보다 우리가 같은 재생할 수 있습니다 밝혀 + +963 +01:10:12,579 --> 01:10:16,260 + 실제로 인코더의 매우 문제가 많은 등의 트릭과 재생하려고 악용을 시도 + +964 +01:10:16,260 --> 01:10:16,670 + 약 + +965 +01:10:16,670 --> 01:10:21,739 + 이 이러한 적대적 네트워크가이 받고있는 사촌 때문에 회의 공간 + +966 +01:10:21,738 --> 01:10:25,579 + 노이즈 입력하고 우리는 영리 소음 주위에 이동하려고하고 그것을 넣을 수 있습니다 + +967 +01:10:25,579 --> 01:10:29,920 + 이러한 네트워크 그렇게 일례를 생성하는 것들의 형태를 변경하려고 + +968 +01:10:29,920 --> 01:10:36,050 + 우리는 그래서 여기에 왼쪽 엉덩이에 그래서 여기에 침실 사이에 보간됩니다 시도 할 수 있음 + +969 +01:10:36,050 --> 01:10:40,119 + 아이디어는 왼쪽에 이러한 이미지의 왼쪽에 우리가 그린 한 것입니다 + +970 +01:10:40,119 --> 01:10:43,550 + 생성하기 위해 사용 후 우리 잡음 분포로부터 무작위 포인트 + +971 +01:10:43,550 --> 01:10:47,690 + 이미지는 이제 오른쪽 우리는이를 수행 한 결과 우리가 생성 + +972 +01:10:47,689 --> 01:10:51,259 + 우리 잡음 분포에서 다른 임의의 지점과는를 생성하는 데 사용할 + +973 +01:10:51,260 --> 01:10:57,710 + 양측이이 두 사람이 생성 해주기 때문에 이미지가 일종의 있습니다 + +974 +01:10:57,710 --> 01:11:01,760 + 우리는 공간에서 리드 사이에서 보간하고자하는 라인과 I의 두 점 + +975 +01:11:01,760 --> 01:11:08,210 + 두 리드 배우와 그 라인을 따라 우리가 거​​의 사용을 사용 생성하고 + +976 +01:11:08,210 --> 01:11:11,859 + 발전기는 이미지를 생성하고 희망이 보간됩니다 + +977 +01:11:11,859 --> 01:11:16,439 + 최신 두 사람의 날짜와이 꽤 미친 것을 알 수있다 + +978 +01:11:16,439 --> 01:11:22,169 + 이 객실은 아주 좋은 부드러운 연속 방법으로 더 많은 명성 종류의 것을 + +979 +01:11:22,170 --> 01:11:28,020 + 침실에서 다른 당신이 한 가지 지적 할 경우이 있다는 것입니다 + +980 +01:11:28,020 --> 01:11:32,300 + 당신이 상상할 경우 아침은 실제로 좋은 낭만적 인 방법의 종류에 무슨 일이 일어나고 있는지 + +981 +01:11:32,300 --> 01:11:35,460 + 그냥이 어떤 종류의 것보다이 픽셀 공간과 같을 것이다 + +982 +01:11:35,460 --> 01:11:39,100 + 효과를 페이딩과 전혀 매우 좋아 보이지 않을 것이다 그러나 여기 당신이 볼 수 있습니다 + +983 +01:11:39,100 --> 01:11:42,690 + 실제로 이러한 것들의 모양과 색상의 지속적 종류입니다 + +984 +01:11:42,689 --> 01:11:50,119 + 또 다른 실험 있도록 아주 재미있는 다른 한쪽에서 변형 + +985 +01:11:50,119 --> 01:11:53,939 + 그들은 실제로 주위에 재생 벡터 수학을 사용하고이 문서에있는 + +986 +01:11:53,939 --> 01:11:58,069 + 이러한 네트워크가 생성 가지의 유형은 그래서 여기에 아이디어는 그들이 + +987 +01:11:58,069 --> 01:12:02,189 + 다음 노이즈 분포에서 무작위 샘플의 전체 무리를 생성 + +988 +01:12:02,189 --> 01:12:05,789 + 샘플의 전체 무리를 생성하는 발전기를 통해 그들 모두를 밀어 + +989 +01:12:05,789 --> 01:12:09,698 + 그들은 그가 자신의 인간의 지능을 사용하여 그들이 몇 가지를 만들려고 + +990 +01:12:09,698 --> 01:12:14,500 + 그 랜덤 샘플 그룹 다음과 같이 무엇에 대한 의미 론적 판단 + +991 +01:12:14,500 --> 01:12:18,050 + 이것 때문에 여기에 의미 론적 범주의 몇 가지로 + +992 +01:12:18,050 --> 01:12:21,739 + 세 가지가 될 것이라고 네트워크에서 생성 된 세 가지 이미지 + +993 +01:12:21,738 --> 01:12:25,529 + 모든 종류의 웃는 여자처럼 그 제공 인간 + +994 +01:12:25,529 --> 01:12:26,819 + 라벨 + +995 +01:12:26,819 --> 01:12:30,309 + 여기 중간에 중립 여성의 네트워크에서 3 개의 시료는 그 + +996 +01:12:30,310 --> 01:12:35,010 + 그 미소 요금에 공유 아니에요되는 것은 사람의 300 무료 샘플입니다 + +997 +01:12:35,010 --> 01:12:40,289 + 그래서이 사람들의 각각의 미소되지는 일부 잠재 상태 벡터에서 생성 된 + +998 +01:12:40,289 --> 01:12:45,729 + 그래서 우리는 평균 이런 종류의 계산하는 상태 벡터에서와 평신도 평균 다만 것 + +999 +01:12:45,729 --> 01:12:51,269 + 웃는 여자 중립 여성과 중성 남자의 평균 평가 상태 지금 한 번 + +1000 +01:12:51,270 --> 01:12:55,220 + 우리는 우리가 어떤 벡터 연산을 할 수있는이 잠복 상태 벡터 그래서 우리가 걸릴 수 있습니다 + +1001 +01:12:55,220 --> 01:13:01,050 + 웃는 여자 중립 여자를 빼고 중성 남자 그래서 무엇을 어떻게 것 + +1002 +01:13:01,050 --> 01:13:06,070 + 당신이 당신에게 웃는 사람을 줄 것이라고 희망 있도록 제공하고이게 무슨입니다 + +1003 +01:13:06,069 --> 01:13:12,649 + 그것은 일종의의 웃는 남자 생겼 그래서이 실제로 발생 + +1004 +01:13:12,649 --> 01:13:19,199 + 즉, 우리가 사람을 걸릴 수 있습니다 우리는 또 다른 실험을 할 수있는 아주 놀라운 + +1005 +01:13:19,199 --> 01:13:25,099 + 안경 및 안경없이 남자와 안경 사람과 사람을 빼기 + +1006 +01:13:25,100 --> 01:13:31,140 + 이이 혼란 안경없이 안경 안경 여자를 추가 + +1007 +01:13:31,140 --> 01:13:38,630 + 물건이었다 그래서 어떤이이 작은 방정식은 우리에게 줄 것이다 무엇 + +1008 +01:13:38,630 --> 01:13:47,369 + 즉 그하더라도 그것 때문에 데프 폭행 꽤 미친 그래서 그 봐 + +1009 +01:13:47,369 --> 01:13:51,279 + 우리는 일종의 강제하지 않는 잠자는 공간 공간에 명시 적으로 이전이 + +1010 +01:13:51,279 --> 01:13:54,869 + 적대적 네트워크는 어떻게 든 여전히 유용 정말 좋은 내용을 관리해야 + +1011 +01:13:54,869 --> 01:13:59,960 + 이 표현은 그래서도 매우 빠르게 난 정말 멋진 있다고 생각 + +1012 +01:13:59,960 --> 01:14:04,220 + 이러한 아이디어를 모두두고 그 두 주 전에 나온 그냥 종이 + +1013 +01:14:04,220 --> 01:14:07,820 + 우리는이 강의에서 다른 생각을 많이 덮여 함께 같은과 + +1014 +01:14:07,819 --> 01:14:11,239 + 그냥 그렇게 먼저 우리는거야 함께 모든 스틱에 변화를 보자 + +1015 +01:14:11,239 --> 01:14:15,659 + 이 분기는 통상의 정렬을 갖하고 시작점으로서의 + +1016 +01:14:15,659 --> 01:14:20,130 + 동맹국의 다양한 오디오 인코더 손실 그러나 우리는 이러한 적대적인 것을보고 + +1017 +01:14:20,130 --> 01:14:24,220 + 하지 우리는 적대적인 네트워크를 가지고 왜 네트워크는 그래​​서 정말 놀라운 샘플을 제공 + +1018 +01:14:24,220 --> 01:14:29,630 + 우리가 가진뿐만 아니라 지금 그렇게하도록 변화가 분기에 우리의 + +1019 +01:14:29,630 --> 01:14:33,710 + 변화 발판 분기 우리는 또한의이이 판별 네트워크가 + +1020 +01:14:33,710 --> 01:14:35,949 + 사이의 차이를 말하려고 + +1021 +01:14:35,949 --> 01:14:40,689 + 자료 없음과 샘플 사이의 변분 오디오 인코더하지만 그건 아니에요 + +1022 +01:14:40,689 --> 01:14:47,099 + 멋진 충분한 왜 우리는 또한 알렉스 NAT를 다운로드하지 않는 한 다음이 통과 + +1023 +01:14:47,100 --> 01:14:47,930 + 두 개의 이미지 + +1024 +01:14:47,930 --> 01:14:53,730 + 알렉스 그물 원래의 이미지와 4 개의 모두 알렉스 순 특징을 추출 + +1025 +01:14:53,729 --> 01:14:59,079 + 또한 비슷한 사진 손실과 머리를 가지고 지금 이미지를 생성 + +1026 +01:14:59,079 --> 01:15:02,340 + 우리는 또한이 샘플을 생성하기를 바라고있는 판별을 당겨 + +1027 +01:15:02,340 --> 01:15:06,900 + 너무 모든 측정 및 모든 스틱 한 번와 유사한 알렉스 순 기능 + +1028 +01:15:06,899 --> 01:15:10,859 + 일이 함께 희망 당신은 바로 그래서 정말 아름다운 샘플을 얻을 것이다 + +1029 +01:15:10,859 --> 01:15:17,069 + 여기 그래서이 단지 전체 훈련을 지불하고있는 용지의 예입니다 + +1030 +01:15:17,069 --> 01:15:21,109 + 그래서 우리는 난이 이러한 사실은 꽤 멋지다 생각해야한다는 이미지 일 + +1031 +01:15:21,109 --> 01:15:26,029 + 샘플 및 심폐 소생술에 다중 스케일 샘플이 대조 경우 그 우리 + +1032 +01:15:26,029 --> 01:15:29,609 + 그 샘플이 실제로 별도의 훈련 된 기억을 위해 이전에 본 + +1033 +01:15:29,609 --> 01:15:34,380 + 클래스 당 모델 볼 화재 및이 그 아름다운 침실 샘플 당신 + +1034 +01:15:34,380 --> 01:15:35,760 + 톱을 다시했다 + +1035 +01:15:35,760 --> 01:15:40,270 + 침실에 고유이다 그러나 여기에서 실제로 교육 훈련을 하나의 모델 + +1036 +01:15:40,270 --> 01:15:45,050 + 인터넷의 모든에 아직도 이런 하나의 모델은 실제 이미지하지만 그들은이야 + +1037 +01:15:45,050 --> 01:15:50,489 + 확실히 그 내가이 생각, 그래서 진짜 문제를 찾고 이미지를 향해 점점 + +1038 +01:15:50,489 --> 01:15:54,170 + 나는 또한 생각 꽤 멋진 그냥 모든 일을 재미의 종류 그리고 + +1039 +01:15:54,170 --> 01:16:00,020 + 함께 스틱 잘하면 내가 생각하는 그게 정말 좋은 샘플을 얻을 + +1040 +01:16:00,020 --> 01:16:02,460 + 그것은 거의 우리가 만약 그렇다면 자율 학습에 대해 말을 전부 + +1041 +01:16:02,460 --> 01:16:05,460 + 어떤 질문이있다 + +1042 +01:16:07,100 --> 01:16:17,110 + 여기에 무엇을 무슨 일이야 않습니다 + +1043 +01:16:18,680 --> 01:16:23,500 + 그래 그래서 질문은 당신이 침실 공간을 상승 글을 읽고 선형 될 수 있습니다 + +1044 +01:16:23,500 --> 01:16:28,079 + 그리고 우리는 우리가있어 기억 여기에 그것에 대해 생각하는 한 가지 방법은 어쩌면이다 + +1045 +01:16:28,079 --> 01:16:30,729 + 샘플링 프로그램을 바로 잡음에서 샘플링과를 통해 전달 + +1046 +01:16:30,729 --> 01:16:35,319 + 판별보다는 제너레이터 후 발전기 갖는다 + +1047 +01:16:35,319 --> 01:16:40,630 + 바로 그런 그 좋은 방법으로 서로 다른 소리 채널을 사용하기로 결정 + +1048 +01:16:40,630 --> 01:16:44,510 + 당신이 소음 사이에 상호 작용하는 경우는 이미지 사이에 보간 결국 + +1049 +01:16:44,510 --> 01:16:49,110 + 수 그래서 잘하면 좋은 부드러운 방법의 종류에 당신은 그것의 알고 + +1050 +01:16:49,109 --> 01:16:51,799 + 그저 실제로 싶은 연수 예를 기억하지 + +1051 +01:16:51,800 --> 01:17:00,310 + 바로 그래서 그냥 우리가 이야기 모든 것을 정리해하는 좋은 방법으로 그 일반화 + +1052 +01:17:00,310 --> 01:17:04,430 + 약 오늘 우리는 작업을위한 당신에게 정말 유용한 실용적인 팁을 많이 준 + +1053 +01:17:04,430 --> 01:17:08,470 + 동영상은 내가 당신에게 발생하는 매우 비 실용적인 팁을 많이 제공 + +1054 +01:17:08,470 --> 01:17:16,119 + 아름다운 이미지가 그래서 나는이 물건은 정말 멋진 생각하지만 난 무엇 확실하지 않다 + +1055 +01:17:16,119 --> 01:17:19,840 + 생성 된 이미지 이외의 사용하지만 그 멋진 그것은 재미와 확실히 있도록 + +1056 +01:17:19,840 --> 01:17:24,640 + 우리는 JAP 대에서 게스트 강의를해야하기 때문에 다음에 주위에 붙어 그렇다면 + +1057 +01:17:24,640 --> 01:17:27,310 + 당신은 그것을 위해 클래스에 와서 인터넷에 아마 당신은 싶어 있습니다보고있어 + +1058 +01:17:27,310 --> 01:17:31,500 + 하나는 그래서 나는 그것이 오늘날 우리가 모든 것을 생각하고 나중에 너희들을 볼 + diff --git a/captions/Ko/Lecture15_ko.srt b/captions/Ko/Lecture15_ko.srt new file mode 100644 index 00000000..ffbb057d --- /dev/null +++ b/captions/Ko/Lecture15_ko.srt @@ -0,0 +1,3432 @@ +1 +00:00:00,000 --> 00:00:03,370 + 오늘 발표됩니다 동안 것은 부분적으로 내 일이라고 지적했습니다 + +2 +00:00:03,370 --> 00:00:06,919 + 사람들에 의해 수행 다른 사람과 때때로 내가 제시하고있어 업무와 공동으로 제 + +3 +00:00:06,919 --> 00:00:10,929 + 정말 많은 많은 사람들이 있지만 공동 작업에 참여하지 않은 그룹 + +4 +00:00:10,929 --> 00:00:14,740 + 그렇게 에누리이 걸릴 회담을 통해 이름을 많이 볼 수 있습니다 + +5 +00:00:14,740 --> 00:00:20,920 + 오늘날 어디에 구글이 있었는지의 종류에 대해 그래서 나는거야 무엇을 말해 + +6 +00:00:20,920 --> 00:00:26,310 + 다른 장소 프로젝트의 많이 벗어나지 사용의 관점에서 그 + +7 +00:00:26,309 --> 00:00:30,608 + 지출 일일 A를 입력 할 때 실제로 2011 년에 시작에 관여 + +8 +00:00:30,609 --> 00:00:36,340 + 구글의 주와 나는 마이크로 부엌에서 그를 우연히 일이하고 나는 말했다 + +9 +00:00:36,340 --> 00:00:39,420 + 나도 몰라처럼 오 무슨 일을 하였다는했지만, 난 아직하지만 파악하지 않은 + +10 +00:00:39,420 --> 00:00:44,170 + 바퀴를 무시하거나 흥미와 나는 전화를 받았는데 내가 안 밝혀 + +11 +00:00:44,170 --> 00:00:49,120 + 전 원하지 않는 나이와 같은 흰개미의 병렬 교육에 조각을 이해 + +12 +00:00:49,119 --> 00:00:50,250 + 당신에게 시간을 전합니다 + +13 +00:00:50,250 --> 00:00:56,350 + 다시 최초의 흥미 진진한 기간에서 휴식을 취 항상 종류의 정말 + +14 +00:00:56,350 --> 00:01:00,660 + 계산 모델처럼 그들은 제공하지만 그 시간에이 있었다 + +15 +00:01:00,659 --> 00:01:03,599 + 우리의 수를 충분히 큰 데이터 세트를 가지고 있지 않은 것처럼 너무 일찍 조금 + +16 +00:01:03,600 --> 00:01:08,879 + 계산은 정말 그 노래와 앤드류 슬픈 0의 종류 흥미로운 일이 될 것이다 만들려면 + +17 +00:01:08,879 --> 00:01:13,579 + 훈련하지만 지금은 우리 종류의 공동 전화의 확인과 같은거야하기 + +18 +00:01:13,579 --> 00:01:20,209 + 규범 교육의 크기와 규모를 밀어 뇌 프로젝트를 시작 + +19 +00:01:20,209 --> 00:01:24,059 + 특히 우리는 큰 데이터 세트를 사용하여 정말 관심 + +20 +00:01:24,060 --> 00:01:27,890 + 많은 양의 결혼 생활에 대한 인식의 문제를 해결하기위한 경쟁 + +21 +00:01:27,890 --> 00:01:34,400 + 문제와 나는 종종 코 세라를 발견하고 종류의 단지 거리에서 읽을 + +22 +00:01:34,400 --> 00:01:39,719 + 구글하지만 그 이후로 우리는 두 종류의 재미있는 작업을 많이 해왔습니다 + +23 +00:01:39,719 --> 00:01:43,408 + 다른 도메인의 많은 연구 분야는 좋은 것들 중 하나 알고 + +24 +00:01:43,409 --> 00:01:46,859 + 많은 많은 다른 종류의 약에 상관없이 자신의 믿을 수 없을만큼 적용 할 수 없습니다 + +25 +00:01:46,859 --> 00:01:52,478 + 나는 확신 같은 문제는이 클래스에서 본 우리는 또한 생산을 배치 한 + +26 +00:01:52,478 --> 00:01:56,530 + 다른 제품의 모든 종류의 매우 다양한 우리의 매트를 사용하는 시스템 + +27 +00:01:56,530 --> 00:02:00,049 + 의 당신에게 생산 측면의 일부를 연구 몇 가지의 샘플링을 제공 + +28 +00:02:00,049 --> 00:02:04,579 + 우리의 종류를 포함하여 커버 아래에 내장 한 시스템의 일부 + +29 +00:02:04,578 --> 00:02:08,030 + 우리가하려는 않는 구현 물건 중 일부는 이러한 종류를 확인하기 위해 수행하는 + +30 +00:02:08,030 --> 00:02:12,959 + 모델의 빠른 실행하고 나는 그녀의 입에 초점을 맞출 것입니다하지만 기술이 많이 있습니다 + +31 +00:02:12,959 --> 00:02:13,349 + 더 + +32 +00:02:13,349 --> 00:02:17,699 + 당신이 전에 몇 달은 다른 종류의 많은 훈련을 할 수있는 몇 + +33 +00:02:17,699 --> 00:02:22,159 + 강화 알고리즘 또는 기계 산업의 다른 종류의 다른 종류 + +34 +00:02:22,159 --> 00:02:29,099 + 그것은 시간을 제공하는 경우 확인 케빈 실제로 뒷면의 일부를 내 말 듣고 그게 전부 내가 하나 + +35 +00:02:29,099 --> 00:02:32,560 + 정말 우리가 함께 넣어 한 팀에 대해 좋아하는 것들을 우리가 가지고있다 + +36 +00:02:32,560 --> 00:02:36,479 + 사람들이 정말 그래서 우리가 가지고있는 전문 지식을 다른 종류의 정말 다양한 믹스 + +37 +00:02:36,479 --> 00:02:40,709 + 기계 학습 연구에서 전문가들은 제프리 힌튼 다른에게 같은 사람을 알고 + +38 +00:02:40,710 --> 00:02:45,820 + 우리는 대규모는 시스템 빌더를 배포 한 사람들은 모두 내가 가지 + +39 +00:02:45,819 --> 00:02:50,169 + 그 금형이 더 자신을 생각하고 우리가 함께 할 수있는 사람이 + +40 +00:02:50,169 --> 00:02:54,989 + 우리가 집단적으로 당신 작업 프로젝트의 일부 종종 그 기술과의 혼합 + +41 +00:02:54,990 --> 00:03:00,870 + 집합이 다른 전문 지식의 종류 당신과 함께 사람을 넣어 + +42 +00:03:00,870 --> 00:03:03,580 + 자주 모두 필요하기 때문에 아무도 개별적으로 할 수 없었다 뭔가를 + +43 +00:03:03,580 --> 00:03:09,670 + 즉 항상 그래서 대규모 시스템 사고의 종류, 기계 아이디어를 학습 + +44 +00:03:09,669 --> 00:03:13,539 + 재미와 당신은 종종 종류의 픽업과 다른 사람들로부터 새로운 것을 배우고 + +45 +00:03:13,539 --> 00:03:22,280 + 당신이 할 수있는 종류의 알 수 있도록 스크립트 개요 사실이 다시 보류에서입니다 + +46 +00:03:22,280 --> 00:03:26,080 + 어떻게 구글의 많은 걸쳐 깊은 학습을 적용되었습니다의 진행 상황을 볼 + +47 +00:03:26,080 --> 00:03:28,540 + 우리는 프로젝트를 시작하고 때와 다른 지역 일종의이다 우리 + +48 +00:03:28,539 --> 00:03:32,209 + 음성 팀 비트와 함께 공동 작업을 시작하고 시작 일부 그 일을 + +49 +00:03:32,210 --> 00:03:37,830 + 문제의 초기 컴퓨터 비전 종류의와 같은 종류의 우리는 몇 가지에 성공했다 + +50 +00:03:37,830 --> 00:03:42,770 + 다른 팀의 구글은 헤이 나도 그들과 같은 문제가 말하는 것 + +51 +00:03:42,770 --> 00:03:46,550 + 우리에게 올 것 또는 우리가 도울 수 있다고 생각 헤이 우리는 그들에게 가서 말을 + +52 +00:03:46,550 --> 00:03:50,610 + 특정 문제와 시간에 우리했습니다 종류의 점진적 그리 + +53 +00:03:50,610 --> 00:03:54,670 + 점차적으로 우리가이 적용되어 한 분야에 팀의 세트를 확장 + +54 +00:03:54,669 --> 00:03:58,539 + 문제의 종류 당신은 폭을 참조 + +55 +00:03:58,539 --> 00:04:03,689 + 그것을 같지 지역의 다른 종류 그렇게 만 컴퓨터 비전 문제입니다 + +56 +00:04:03,689 --> 00:04:08,150 + 즉 그것이 우리가 선한 성장을 계속하고 좀 좋네요 그리고 + +57 +00:04:08,150 --> 00:04:12,920 + 사물의 광범위한 스펙트럼에 대한 이유의 일부는 당신이 할 수있는 것입니다 + +58 +00:04:12,919 --> 00:04:18,229 + 정말 당신이 넣을 수 있습니다이 좋은 정말 보편적 인 시스템으로 그 생각 + +59 +00:04:18,230 --> 00:04:21,359 + 만약에 많은 입력의 다른 종류의 많은 다른 종류를 많이 얻을 + +60 +00:04:21,358 --> 00:04:22,129 + 출력 + +61 +00:04:22,129 --> 00:04:27,300 + 그들 중에서 당신은 당신이 시도 모델의하지만 약간의 차이를 알고 함께 + +62 +00:04:27,300 --> 00:04:32,270 + 일반적으로는 같은 기본적인 기술은 모든 걸쳐 꽤 잘 작동 + +63 +00:04:32,269 --> 00:04:36,990 + 다른 도메인과 난 당신에 대해 들었어요 진정한 우리의 결과를 얻을 것 + +64 +00:04:36,990 --> 00:04:40,400 + 다른 지역의 제비에서이 클래스 이제 거의 모든 컴퓨터 비전 + +65 +00:04:40,399 --> 00:04:46,219 + 문제는 이러한 일이 시작하는 음성 문제는 많은 더 많은 경우 일 수 있습니다 + +66 +00:04:46,220 --> 00:04:51,880 + 약물과 같은 과학의 다른 영역의 종류의 언어 이해 영역의 많은 + +67 +00:04:51,879 --> 00:04:54,519 + 발견은 더 나은 흥미로운 역할 모델을 가지고 시작 + +68 +00:04:54,519 --> 00:05:05,930 + 다른 것보다 그래 나는 그들이 우리가 가지 내장 한 길을 따라 좋은있어 그들처럼 + +69 +00:05:05,930 --> 00:05:10,040 + 교육에 대한 우리의 기본 시스템 소프트웨어의 두 개의 서로 다른 세대 + +70 +00:05:10,040 --> 00:05:14,640 + 자신의 입술을 배포하는 것은 처음이라고했다 불신 대해 논문을 게시 + +71 +00:05:14,639 --> 00:05:20,479 + 하여 닙 2012 그들의 장점은 제 등 실제로 확장이었다했다 + +72 +00:05:20,480 --> 00:05:23,759 + 우리가에 넣어 최초의 용도 중 하나는 내가거야 일부 자율 훈련을하고 있었다 + +73 +00:05:23,759 --> 00:05:27,319 + 교육에 16,000 과정을 사용 분에 대해 말해 그들은하지 않습니다 + +74 +00:05:27,319 --> 00:05:31,209 + 이 매개 변수가 많이 생산 사용에 적합하지만 슈퍼 아니었다 + +75 +00:05:31,209 --> 00:05:35,819 + 이 종류의 이상한 이상의 표현하기 좀 어려운처럼 연구를위한 유연한 + +76 +00:05:35,819 --> 00:05:38,949 + 표현하는 모델 강화 학습 알고리즘의 비의 종류는 어렵다 + +77 +00:05:38,949 --> 00:05:43,349 + 그리고 많은 이런 종류 이상 상하로 구동되는 방식을 가지고 + +78 +00:05:43,350 --> 00:05:48,770 + 메시지와 그것이 무슨 짓을했는지 잘 작동하지만 우리는 종류의 다시 발을 내딛었 + +79 +00:05:48,769 --> 00:05:52,639 + 조금 전에 년에 대한 우리의 두 번째 세대를 구축 시작했다 + +80 +00:05:52,639 --> 00:05:57,339 + 시스템이 우리가 1 세대 배운 것을 기반으로하는 흐르는 경향과 + +81 +00:05:57,339 --> 00:06:02,289 + 우리가 작업하고있는 오픈 소스 패키지의 다른 종류에서 배운 무엇을 + +82 +00:06:02,290 --> 00:06:06,620 + 재고의 불신에 좋은 기능을 많이 유지뿐만 아니라 그것을 만든 + +83 +00:06:06,620 --> 00:06:13,329 + 연구의 다양한 꽤 유연 내가 가진 그것은 오픈 소스입니다 + +84 +00:06:13,329 --> 00:06:19,120 + 정말 좋은 속성 중 하나에 대해 들어 그래서 내가 잡고 것으로 알려져있다 + +85 +00:06:19,120 --> 00:06:23,459 + 이 모두 그래프의 측면을 확장했다 사촌 특정 논문에서이 + +86 +00:06:23,459 --> 00:06:27,819 + 트레이닝 데이터 및 방법의 정확성이 증가하고 또한 신경의 크기를 스케일링 + +87 +00:06:27,819 --> 00:06:30,279 + 그물 및 방법의 정확성 증가 + +88 +00:06:30,279 --> 00:06:33,109 + 정확한 세부 사항은 중요하지 않습니다 당신은 트렌드와 수백 이러한 종류의를 찾을 수 있습니다 + +89 +00:06:33,110 --> 00:06:37,509 + 당신이 더 많은 데이터를 가지고있는 경우 그러나 논문의 정말 좋은 호텔 중 한 곳입니다 + +90 +00:06:37,509 --> 00:06:42,180 + 당신은 당신의 모델이 일반적으로 이러한 것들을 모두 사망하고 더 큰 만들 수 있습니다 + +91 +00:06:42,180 --> 00:06:47,019 + 단지 그들 중 하나를 확장보다 더 나은 당신은 순서대로 정말 큰 모델이 필요 + +92 +00:06:47,019 --> 00:06:49,810 + 더 큰에서 나타나는 미묘한 동향 캡처 종류 + +93 +00:06:49,810 --> 00:06:54,180 + 데이터 세트가 어떤 종류의 명백한 트렌드를 포착 것이다 알려진 알거나 + +94 +00:06:54,180 --> 00:06:57,370 + 당신이 필요로하는 곳에 분명 좀 패턴하지만 더 미묘한 것들 것들 + +95 +00:06:57,370 --> 00:07:04,189 + 그녀는 그를 너무 짠 것을보고 그 여분의 경우 더 큰 모델을 캡처하기 + +96 +00:07:04,189 --> 00:07:09,579 + 우리가 계산을 확장에 많은 초점을 맞출 수 있도록 더 많은 경쟁이 필요합니다 + +97 +00:07:09,579 --> 00:07:17,689 + 우리가해야 할 첫 번째 중 하나에 큰 데이터 세트에 큰 모델을 훈련 할 수 + +98 +00:07:17,689 --> 00:07:22,699 + 우리는이 프로젝트에서했던 것을 우리가 아 내가 놀랄 배우고거야 수 말했다 + +99 +00:07:22,699 --> 00:07:28,879 + 정말 중요하고 우리는 초기에 신속하게에 큰 초점을했고 다른 사람 + +100 +00:07:28,879 --> 00:07:34,870 + 우리가 임의의 당신을 자율 학습을했다면 무슨 일이 일어날 지했다 + +101 +00:07:34,870 --> 00:07:38,519 + 아이디어가 레나가 천만 임의 유튜브 프레임 하나의 수행되도록 인쇄 + +102 +00:07:38,519 --> 00:07:42,990 + 임의의 동영상의 무리에서 프레임 우리는 본질적으로 데이터를 훈련하는거야 + +103 +00:07:42,990 --> 00:07:47,418 + 레코더 모두가 그 가족 다단계 같은데 무슨 색깔입니다 알고 + +104 +00:07:47,418 --> 00:07:51,788 + 자동차 당신이 알고 인코더와 우리가 지금 이미지를 재구성하려는이 하나 + +105 +00:07:51,788 --> 00:07:54,459 + 우리는 반복 여기서 표현을 재구성하려는에 + +106 +00:07:54,459 --> 00:08:01,629 + 이 등 우리가 만육천 자동차를 사용하여 우리는에서의 GPU가 없었다 + +107 +00:08:01,629 --> 00:08:07,459 + 데이터 센터는 시간 그래서 우리는 빛이 더 많은 CPU를 던지고로 보상 우리 + +108 +00:08:07,459 --> 00:08:11,870 + 실제로 최적화를위한 분에 대해 얘기하는 싱크 귀 염 둥이를 사용 + +109 +00:08:11,870 --> 00:08:17,189 + 이것은 우리를 와서 이전했다 길쌈되지 않은 그 사촌 매개 변수를 많이했다 + +110 +00:08:17,189 --> 00:08:20,199 + 모든 분노를해야합니다 그는 또한 우리가 로컬 수용 필드를 가지고 있지만거야 말했다 있도록 + +111 +00:08:20,199 --> 00:08:24,168 + 그들은 망상가되지 않으며 별도의 표현처럼 배울 것 + +112 +00:08:24,168 --> 00:08:28,269 + 의 종류 이미지의이 부분에있는 이미지의이 부분 + +113 +00:08:28,269 --> 00:08:31,038 + 나는 그것이 실제로 흥미로운 실험이 될 거라고 생각 흥미로운 트위스트 + +114 +00:08:31,038 --> 00:08:37,330 + 이 작업을 다시 실행하지만, 길쌈 오페라 공유와 나는 서늘함의 종류에있을거야 + +115 +00:08:37,330 --> 00:08:40,590 + 어떤 경우는 표현은 그와 같은 후 아홉 층 상단을 배웠다 + +116 +00:08:40,590 --> 00:08:45,580 + 이러한 비 길쌈 로컬 수용 필드의 최상위에 $ 60,000 + +117 +00:08:45,580 --> 00:08:50,750 + 우리가 일어날 거라고 생각 것들 중 하나는 가지 배울 것입니다 + +118 +00:08:50,750 --> 00:08:54,799 + 높은 수준의 기능 픽셀 때문에 특히 인쇄에 감지기하지만, + +119 +00:08:54,799 --> 00:08:58,929 + 높은 수준의 개념을 배울 수있는 우리는 반 얼굴이었다 데이터 집합을 가지고 있었고, + +120 +00:08:58,929 --> 00:09:04,349 + 하지면을 가지고 우리는 좋은 뉴런을 위해 주위를 둘러 보았다 발견 + +121 +00:09:04,350 --> 00:09:08,120 + 하는지 여부, 이미지의 추정치 선택기 얼굴을 포함 우리 + +122 +00:09:08,120 --> 00:09:13,850 + 몇 가지 예 뉴런을 수있는 최선의 일을 볼 수있는 샘플의 일부입니다 + +123 +00:09:13,850 --> 00:09:19,610 + 당신이 보면 신경이 후 가장 흥분을 얻을 수 있다는 인한 이미지 + +124 +00:09:19,610 --> 00:09:24,240 + 주위의 원인이됩니다 어떤 자극에 대한 신경은 가장 흥분 거기에 도착 + +125 +00:09:24,240 --> 00:09:32,669 + 소름 얼굴 남자와 재미의 종류는 우리가에는 라벨이 없었다처럼 + +126 +00:09:32,669 --> 00:09:38,399 + 모든 데이터 셋의 이미지를 우리가 훈련하고 있다는이의 신경 세포 + +127 +00:09:38,399 --> 00:09:43,029 + 모델은 얼굴이 일이 내가 흥분거야 있다는 사실을 포착했다 + +128 +00:09:43,029 --> 00:09:48,399 + 나는 그 YouTube에서 머리에서 백인 얼굴의 종류를 볼 때 우리는 또한이 + +129 +00:09:48,399 --> 00:09:55,179 + 선장이 유지되지 않은 한과 데이터 집합에 지금 고양이는 평균 얼룩 무늬 I입니다 + +130 +00:09:55,179 --> 00:10:03,019 + 그들에게 전화 한 다음 해당 자율 모델을 수 및 시작 + +131 +00:10:03,019 --> 00:10:07,659 + 이 때 특히 감독 훈련 작업은 우리 온 내가 훈련을했다 + +132 +00:10:07,659 --> 00:10:11,669 + 가장 손상 하나없는 이미지 스물 다음 천 클래스 작업 + +133 +00:10:11,669 --> 00:10:14,939 + 결과는 그 천 클래스에보고하려고하고있는 + +134 +00:10:14,940 --> 00:10:21,490 + 훨씬 더 열심히 작업의 모든 20 20,000 클래스 중 하나에서 만든 구별 + +135 +00:10:21,490 --> 00:10:26,340 + 우리는 다른 원인이 이미지의 종류에 주위를 둘러 보았다 다음 훈련 + +136 +00:10:26,340 --> 00:10:29,300 + 인기있는 노선은 그들은 매우 높은 수준에 따기있어보고 흥분하는 + +137 +00:10:29,299 --> 00:10:33,819 + 개념은 당신이 노란 꽃 전용 또는 물새 알 + +138 +00:10:34,620 --> 00:10:41,080 + 내가 좋아하는이 재교육 실제로 하드 정확도로 상태를 증가 + +139 +00:10:41,080 --> 00:10:44,080 + 시 양 특정 태스크에 + +140 +00:10:45,129 --> 00:10:50,500 + 우리는 종류의 자율 학습 때문에 대한 우리의 흥분을 잃었다 + +141 +00:10:50,500 --> 00:10:54,860 + 그래서 이놈 잘 요리를 배우고 그래서 우리는 말과 ​​함께 작업을 시작 감독 + +142 +00:10:54,860 --> 00:11:00,100 + 시 아닌 계 탄성 매트 가석방이었다 팀 + +143 +00:11:00,100 --> 00:11:06,570 + 기본적으로 백처럼 오디오 데이터의 작은 세그먼트에서 이동하려고 + +144 +00:11:06,570 --> 00:11:09,420 + 오십 밀리 초 시간은 당신이 소리에 선포되고 무엇을 예측하려고 + +145 +00:11:09,419 --> 00:11:17,809 + 중간 10 밀리 초는 그리고 우리는 완전히 층을 바꾸기로했습니다 + +146 +00:11:17,809 --> 00:11:21,879 + 유목민 접속하고 상단 만사천 시도 된 전화 중 하나를 예측 + +147 +00:11:22,549 --> 00:11:27,939 + 나는 기본적으로 매우 신속하게 훈련 할 수있는 동안 작업 가족에있어 그리고 그것은했다 + +148 +00:11:27,940 --> 00:11:31,530 + 거대한 감소 같은 온건 한 음성 팀에있는 사람들 중 하나가 말했다입니다 + +149 +00:11:31,529 --> 00:11:34,339 + 가장 큰 하나의 개선과 같은 그들의 20 년에 본 적이 있는지 + +150 +00:11:34,340 --> 00:11:47,970 + 연구와 그 안드로이드 기반 검색 시스템 2012 정도의 일환으로 시작 + +151 +00:11:47,970 --> 00:11:51,990 + 우리가 흔히하는 일 중 하나는 우리가 어떤 데이터를 많이 가지고 찾을 수있다 + +152 +00:11:51,990 --> 00:11:57,149 + 그에 대한 작업 등의 작업을하지만, 아주 많지 않은 매우 많은 데이터를 우리는 자주 + +153 +00:11:57,149 --> 00:12:02,949 + 당신에게 슬픈 멀티 태스크를 확인하고 학습 전송 시스템을 구축 + +154 +00:12:02,950 --> 00:12:09,030 + 여러 가지 방법이 그래서는 우리가 분명히 연설에서 이것을 사용하는 예를 살펴 보자 + +155 +00:12:09,029 --> 00:12:13,110 + 영어로 우리는 많은 데이터를 가지고 있고 우리는 그것을 정말 좋은 느린 단어를 가지고 또는 + +156 +00:12:13,110 --> 00:12:17,350 + 반면에, 포르투갈어있다 저하는 시간에 대해 우리는하지 않았다 + +157 +00:12:17,350 --> 00:12:21,310 + 단어 오류율이를 때까지 그 정도 훈련 오늘 우리는 $ (100) 구매를했다 + +158 +00:12:21,309 --> 00:12:27,129 + 그래서 당신이 할 수있는 첫 번째이자 가장 간단한 것들 중 하나 나쁜 더 많은 + +159 +00:12:27,129 --> 00:12:30,620 + 이는 당신이 모델을 사전에 훈련 된 찍을 때 당신이하는 일의 종류 + +160 +00:12:30,620 --> 00:12:33,509 + 및 다른 문제에 적용 촬상 우리는 많은 데이터가없는 + +161 +00:12:33,509 --> 00:12:37,610 + 당신은 그들을 완전히 무작위 밤에 의해 그 무게 훈련을 시작 + +162 +00:12:37,610 --> 00:12:41,700 + 나는 실제로 않는 경우 포르투갈어 워드 에러율을 향상하고 있지 않다 + +163 +00:12:41,700 --> 00:12:45,210 + 당신이 연설에 대해 원하는 기능의 종류에 충분한 유사성이있다 + +164 +00:12:45,210 --> 00:12:50,570 + 일반적으로는 언어에 상관없이 당신이 할 수있는 더 복잡한 것은 사실이다 + +165 +00:12:50,570 --> 00:12:55,390 + 공동 모든 언어에서 또는에서 모델에게 기업의 점유율을 훈련 + +166 +00:12:55,389 --> 00:12:56,360 + 이 경우 모든 + +167 +00:12:56,360 --> 00:13:04,680 + 모든 유럽 언어 나는 우리가 사용한 무엇을 생각하고 그래서 그들은 우리가있어 볼 수 있습니다 + +168 +00:13:04,679 --> 00:13:07,939 + 공동으로이 데이터를 훈련하고 우리가 실제로 꽤 상당한있어 + +169 +00:13:07,940 --> 00:13:13,310 + 심지어 포르투갈 모델의 단지 복사하는 일을 통해 개선하지만, + +170 +00:13:13,309 --> 00:13:17,739 + 놀랍게도 우리는 실제로 총 때문에 작은 개선이 영어를 얻었다 + +171 +00:13:17,740 --> 00:13:20,889 + 다른 모든 언어를 통해 우리는 실제로는 거의 양을 두 배로 + +172 +00:13:20,889 --> 00:13:25,399 + 교육 자료는 우리는 당신이 단지 영어에 비해 모델을 그리워 사용할 수 있었다 + +173 +00:13:25,399 --> 00:13:30,379 + 많은 일없이 그렇게 기본적으로 언어와 같은 알람이 모두 많이 향상 + +174 +00:13:30,379 --> 00:13:35,850 + 데이터의 많은 언어를 조금이라도 개선하고 우리는 있었다 + +175 +00:13:35,850 --> 00:13:39,350 + 알아낼 하구의 언어 별 최상층 조금 조금 + +176 +00:13:39,350 --> 00:13:44,620 + 그것은 언어 별 최고 선수 일부가 힘든 만들 않는 한 나는 믿을거야 + +177 +00:13:44,620 --> 00:13:47,620 + 이들은 당신이 만들고 인간의 가이드 선택의 종류가 + +178 +00:13:48,269 --> 00:13:53,149 + 즉, 생산 음성 모델은 그 정말 간단에서 많이 참여의 + +179 +00:13:53,149 --> 00:13:57,778 + 지금 사용하는 피드 포워드 모델은 내가 마지막으로 언급 할 시간을 처리했다 + +180 +00:13:57,778 --> 00:14:02,490 + 암시의 컴파일 그래서이 매우 다른 주파수로 그들을 만들려면 + +181 +00:14:02,490 --> 00:14:06,769 + 종이 여기에 출판되었다​​ 당신은 반드시 모든 이해 할 필요가 없습니다 알고 + +182 +00:14:06,769 --> 00:14:11,459 + 상세하지만이 모델의 종류에 더 많은 복잡성은이고 그것의 + +183 +00:14:11,458 --> 00:14:15,088 + 그것은 그녀의 현재 모델과 계산 모델 훨씬 더 정교한을 사용하고 + +184 +00:14:15,089 --> 00:14:22,100 + 최근의 추세는 앨리스가 완전히 그래서 오히려 사용할 수 있습니다 충족 된 + +185 +00:14:22,100 --> 00:14:26,730 + 이러한 종류의 소요 음향 모델 후 언어 모델을 갖는보다 + +186 +00:14:26,730 --> 00:14:30,550 + 소외의 음향 모델의 출력은 다소 별도로 갈 수 + +187 +00:14:30,549 --> 00:14:34,879 + 직접 오디오 파형에서에서 문자로 성적 증명서를 생산하는 + +188 +00:14:34,879 --> 00:14:38,120 + 시간과 그게 정말 큰 트렌드가 될 것 같아 + +189 +00:14:38,809 --> 00:14:44,169 + 모두 연설에서 더 일반적으로 난방 시스템을 많이에서 당신은 종종이 + +190 +00:14:44,169 --> 00:14:49,338 + 오늘은 많은 시스템이 가지 서브 시스템 각각의 무리로 구성되어 있습니다 + +191 +00:14:49,339 --> 00:14:54,350 + 아마도 일부 그녀는 조각과 손 코드 조각의 일부 종류를 배웠다 + +192 +00:14:54,350 --> 00:14:58,000 + 나는 보통 끈적 거리는 디코드의 큰 더미가 모두 함께 접착제 및 + +193 +00:14:58,509 --> 00:15:04,600 + 별도로 개발 한 조각 장애 최적화가 종종 있지만, + +194 +00:15:04,600 --> 00:15:08,800 + 당신은에 의해 대칭의 맥락에서 서브 시스템을 최적화 오른쪽처럼 + +195 +00:15:08,799 --> 00:15:12,699 + 통계는 어떤 관심있는 마지막 작업을 위해 옳은 일을하지 않을 수 있습니다 + +196 +00:15:12,700 --> 00:15:22,370 + 올바르게 복사 할 수 있으므로 같은 훨씬 더 큰 하나의 시스템을 가지고 + +197 +00:15:22,370 --> 00:15:25,649 + 하나의 신경 애플은 끝까지 오디오 파형에서 직접 모든 길을 간다 + +198 +00:15:25,649 --> 00:15:29,929 + 목표는 당신이 처방에 관심 당신은 엔드 - 투 - 엔드를 최적화 할 수 없음 + +199 +00:15:29,929 --> 00:15:34,579 + 통해가는 중간에 손으로 쓴 많은 코드가 아니다 + +200 +00:15:34,580 --> 00:15:37,440 + 난 당신이 내가 부족 것을 볼 수 있습니다 여기에 볼 수있을 거라 생각 큰 트렌드가 될 수 있습니다 + +201 +00:15:37,440 --> 00:15:46,250 + 번역 요구의 다른 종류의 많은 그래서 사람들은 모든 대회의 우리 + +202 +00:15:46,250 --> 00:15:48,919 + 우리는 다양한 종류를 사용했던 시력 문제의 톤을 가지고 + +203 +00:15:48,919 --> 00:15:54,849 + 당신을위한 계산 모델은 길쌈 주위에 큰 흥분을 알고 + +204 +00:15:54,850 --> 00:15:59,220 + 신경망 잘 먼저 젊은 시작과 경쟁을 읽고 확인하는 것이 + +205 +00:15:59,220 --> 00:16:05,110 + 가지 잠시 가라 앉았하고 좋아 다음 알렉스 Kozinski 요 요 한모금의 은혜와 + +206 +00:16:05,110 --> 00:16:10,200 + 에서 블루에게 다른 경쟁에 불을하는 2012 년 종이에 그를 확인 + +207 +00:16:10,200 --> 00:16:16,470 + 내가 넣어 생각 비를 사용하여 이미지 순이익 2,012 도전에 물 + +208 +00:16:16,470 --> 00:16:20,500 + 모든 사람의지도에 그 일이 다시 우리는 우리가 사용되어야한다 잘 말 + +209 +00:16:20,500 --> 00:16:24,399 + 그들은 정말 잘 작동 사촌 비전이 일 내년 + +210 +00:16:24,399 --> 00:16:28,100 + 항목의 스물 스물이나 뭐 같은 당신은 알지 + +211 +00:16:28,100 --> 00:16:34,550 + 스레드는 이전에 그냥 알렉스 우리는 구글에서 사람들의 무리 했어했다 + +212 +00:16:34,549 --> 00:16:38,529 + 더 나은 일을 위해 아키텍처의 다양한 종류의보고 + +213 +00:16:38,529 --> 00:16:41,829 + 검사 아키텍처에 대한 협의는 다음과 같이 가지고 더 나은 이미지 + +214 +00:16:41,830 --> 00:16:45,889 + 모든 종류의 수 있습니다 같은 다른 크기 경쟁의 복잡한 모델 + +215 +00:16:45,889 --> 00:16:50,419 + 함께 연결된 후, 당신은 그 모델에게 시간의 무리를 복제 할 수 없습니다 + +216 +00:16:50,419 --> 00:16:51,319 + 과 + +217 +00:16:51,320 --> 00:16:55,810 + 그게 꽤 좋은 밝혀졌다에서 당신은 매우 깊은 알려진 끝낼 + +218 +00:16:56,789 --> 00:17:01,870 + 조건은 어떤 것과 약간의 추가 및에 약간의 변화가있었습니다 + +219 +00:17:01,870 --> 00:17:07,740 + 그것은 훨씬 더 정확한 당신처럼 내가 당신이 그런 식으로 약간을 보았다 알고 있어야합니다 + +220 +00:17:07,740 --> 00:17:17,120 + 좋아 그래서 II가 게을 렀다는 수잔 만 내가 이야기 폴더 일에서 내 슬라이드를했다 + +221 +00:17:17,119 --> 00:17:19,549 + 그 라벨에 앉아 앙드레 대한 이야기 + +222 +00:17:19,549 --> 00:17:26,559 + 확인 안드레이는 자신이 이미지를 그 대회를 관리하는 데 도움이되었다 결정에 서명 그는 + +223 +00:17:26,559 --> 00:17:31,269 + 앉아서 자신을 훈련 교육 훈련 200 시간을 가하지 것 + +224 +00:17:31,269 --> 00:17:38,099 + 나도 몰라 오스트레일리아 양치기 개에서이 같은 힘든 분할 및 및 + +225 +00:17:38,099 --> 00:17:41,449 + 예 나는 그것을 할 수있는 실험실 동료 중 하나를 설득 할 수 있지만 지능 없었다 + +226 +00:17:41,450 --> 00:17:45,309 + 이미지에 대한 교육의 백 이십시간에 대해 가열 + +227 +00:17:45,980 --> 00:17:52,380 + 그의 연구실은 그가 5.1 %의 오류가있어 있도록 12 시간 일 후 피곤받을 수 있습니다 + +228 +00:17:52,380 --> 00:17:55,380 + 제작 내가 12 %를 생각있어 + +229 +00:17:56,269 --> 00:18:12,918 + 인간의 오류하지만 비없이 심하게 모든 주말 + +230 +00:18:12,919 --> 00:18:19,690 + 뒤쪽에 백십이시간 나중에 어쨌든 여기에 좋은 무엇이든 + +231 +00:18:19,690 --> 00:18:23,220 + 그것에 대해 블로그 게시물 나는 그가 매개 변수를 많이 가지고 확인해 보시기 바랍니다 + +232 +00:18:23,220 --> 00:18:34,279 + 당신이 많은 201 일이 80000000000000 연결을 알고 같은 전형적인 인간은 + +233 +00:18:34,279 --> 00:18:37,918 + 파라미터 소수의 모델에 잘 맞는 이들 모델에 대한 지적 + +234 +00:18:37,919 --> 00:18:43,440 + 내 모바일 장치에서 난 휴대 전화 있지만 일반에 다소 맞지 않도록 + +235 +00:18:43,440 --> 00:18:47,029 + 앙드레 이외의 추세는 알렉스에 비해 매개 변수의 작은 숫자처럼 + +236 +00:18:47,029 --> 00:18:52,509 + 대부분 알렉스 그물 상단이이 두 거대한 완전히 연결 층 같이했다 + +237 +00:18:52,509 --> 00:18:57,000 + 그것이 함께 거대한하지만 매개 변수를 많이 이상은 가지 얻을 멀리했다 + +238 +00:18:57,000 --> 00:19:02,220 + 대부분의과가 그래서 그들은 당신이 매개 변수의 작은 번호를 알고 사용했지만 + +239 +00:19:02,220 --> 00:19:07,829 + 이다 사용 조성 매개 변수 표시 당 부동 소수점 연산 + +240 +00:19:07,829 --> 00:19:12,379 + 좋은 펀드에 이르렀 우리는 텐서 흐름의 일부로 출시 + +241 +00:19:12,380 --> 00:19:18,549 + 당신이 사설에 대한 거기에 사용할 수있는 재시도 채택 모델을 업데이트 + +242 +00:19:18,548 --> 00:19:24,089 + 우리가하지 않은 군사 유니폼을 생각하지만, 그것은 크리스 하퍼가 + +243 +00:19:24,089 --> 00:19:29,859 + 그들이있어 이러한 모델에 대한 좋은 것들을 몹시 부정확 한 + +244 +00:19:29,859 --> 00:19:32,589 + 아주 세밀한 상담을하고 정말 좋은 나는 것들 중 하나라고 생각 + +245 +00:19:32,589 --> 00:19:35,959 + 이 컴퓨터 모델은 훨씬 실제로 있다는 안드레스 블로그입니다 + +246 +00:19:35,960 --> 00:19:40,880 + 개 그러나 인간이다의 구별 정확한 품종에서 사람보다 더 나은 + +247 +00:19:40,880 --> 00:19:42,179 + 더 나은 + +248 +00:19:42,179 --> 00:19:49,150 + 자주 레이블이 탁구 공이며 경우가 있다면 당신이 알고있는 작은을 따기 + +249 +00:19:49,150 --> 00:19:52,190 + 탁구 인간을 재생하는 거대한 고위 사람들처럼 그에서 더 낫다 + +250 +00:19:52,829 --> 00:20:00,250 + 모델은 당신이있는 모델을 훈련하면 더 많은 픽셀 일에 집중하는 경향이 + +251 +00:20:00,250 --> 00:20:01,109 + 데이터의 오른쪽 종류 + +252 +00:20:01,109 --> 00:20:05,019 + 이러한 장면은 모두 아무것도 보지 동안 일반화 알고 있지만 실제로 당신 + +253 +00:20:05,019 --> 00:20:08,690 + 우리는 모두 당신의 훈련 데이터를 사랑 나를로 표시거야 알고 잘 표현된다 + +254 +00:20:08,690 --> 00:20:14,710 + 그들은 그것이 뱀이 아니다 좀 더 구십 허용 실수를하지만, + +255 +00:20:14,710 --> 00:20:19,230 + 난 그냥 그런 말을하고 왜 이해하고 난 개가 아니에요 알고 있지만 실제로 I + +256 +00:20:19,230 --> 00:20:25,190 + 전면 동물이이 당나귀가되고 난 경우 신중하게 생각했다 + +257 +00:20:25,190 --> 00:20:27,490 + 완전히 확실하지 아직도 + +258 +00:20:27,490 --> 00:20:37,900 + 모든 투표는 이렇게 생산 중 하나는 우리가 모델 kiryas의 이러한 종류를 넣었습니다 사용합니다 + +259 +00:20:37,900 --> 00:20:42,850 + 구글 사진 검색 그래서 우리는 Google 사진 제품을 출시하면 검색 할 수 있습니다 + +260 +00:20:42,849 --> 00:20:46,539 + 방금 바다 입력 한 모든 얘기하지 않고 업로드 한 사진 + +261 +00:20:46,539 --> 00:20:51,639 + 갑자기 올리버 바다 포토샵의 모든 그래서 예를 들어이 사용자에 대한 + +262 +00:20:51,640 --> 00:20:56,870 + 공개적으로 게시 헤이 나는 스크린 샷 헤이 나는이 걸릴하지 않았다 게시 + +263 +00:20:56,869 --> 00:21:04,879 + 부처님의 동상이 때문에 힘든 것을 알고 도시 운전을위한 나타났다 + +264 +00:21:04,880 --> 00:21:09,520 + 대부분의 Utahns에 비해 질감이 많이 그래서 우리는 꽤 기쁘게 생각있어 + +265 +00:21:09,519 --> 00:21:18,339 + 우리가보다 구체적인 다른 종류의 종류의 많은이 대 식세포 등을 검색 + +266 +00:21:18,339 --> 00:21:21,730 + 우리는 우리의 스트리트 뷰에서하고 싶은 것들을 본질적으로 같은 비주얼 작업 + +267 +00:21:21,730 --> 00:21:25,819 + 세계에서이 차 운전자의 이미지와 모든 도로의 사진을 촬영 + +268 +00:21:25,819 --> 00:21:29,609 + 거리 장면과 그 다음 우리는 우리가 찾을 수있는 모든 텍스트를 읽을 수 있어야합니다 + +269 +00:21:29,609 --> 00:21:34,909 + 그래서 먼저 당신이 원하는 우선의 텍스트와 잘 하나를 찾을 수있다 + +270 +00:21:34,910 --> 00:21:39,720 + 이렇게하면 읽기처럼 싶은 것을 전 모든 주소와지도와 달을 찾을 수 있습니다 + +271 +00:21:39,720 --> 00:21:43,829 + 당신이 우리가을하는 모델이없는 것을 볼 수 있도록 다른 모든 텍스트 + +272 +00:21:43,829 --> 00:21:47,799 + 예측 꽤 좋은 직장 그 화소가 포함되는 픽셀 수준 + +273 +00:21:47,799 --> 00:21:53,819 + 텍스트 여부와 꽤 잘하지 + +274 +00:21:53,819 --> 00:21:58,289 + 잘 우선 훈련 데이터에 세금을 많이 발견은 여러 종류가 있었다 + +275 +00:21:58,289 --> 00:22:03,019 + 이렇게 표시되는 문자는 더 문제가 한자를 인식하지 않았다 + +276 +00:22:03,019 --> 00:22:08,569 + 영어 문자는 꽤 잘처럼 수행 로마 라틴 문자이다 + +277 +00:22:08,569 --> 00:22:12,889 + 세금의 두 개의 서로 다른 글꼴과 크기와 그들 중 몇 가지의 서로 다른 색상 + +278 +00:22:12,890 --> 00:22:17,200 + 이다 매우 멀리 떨어져 카메라에 매우 근접하고 난 그냥이었고,이 데이터입니다 + +279 +00:22:17,970 --> 00:22:24,809 + 텍스트의 조각 주위에 단지 인간의 표지 그린 다각형에서 그리고 그들은 + +280 +00:22:24,809 --> 00:22:27,809 + 다음을 전사하고 우리가 OCR 모델을 우리 또한 인쇄 + +281 +00:22:30,880 --> 00:22:34,500 + 우리는 가지 점차적으로 우리가 출시 제품의 다른 종류를 출시했습니다 + +282 +00:22:34,500 --> 00:22:39,799 + ATI의의 클라우드 비전은 당신이 의미 라벨 이미지 같은 것들을 많이 할 수 있습니다 + +283 +00:22:39,799 --> 00:22:44,859 + 반드시 싶어하지 않는 사람 또는 방법을 기계 학습 전문 I에 대한 + +284 +00:22:44,859 --> 00:22:48,349 + 단지 종류의 당신이 말을 알고에 가고 싶어 이미지와 함께 멋진 물건을하고 싶지 + +285 +00:22:48,349 --> 00:22:54,990 + 그들은 OCR을 수행하고 모든 이미지에서 과세 찾을 것 같았다 실행하는 경우에만 있음 + +286 +00:22:54,990 --> 00:22:58,650 + 당신은 기본적으로 자전거 토론토 CRM 라벨 비상 사태를 부여 업로드 + +287 +00:22:58,650 --> 00:23:03,820 + 이 사람에게가는 경우이 이미지의 생성과 꽤 행복했다 + +288 +00:23:03,819 --> 00:23:06,689 + 그 + +289 +00:23:06,690 --> 00:23:10,220 + 내부적으로 사람들이 사용하는 방법의 창조적 사용을 생각하고있다 + +290 +00:23:10,220 --> 00:23:13,600 + 컴퓨터 비전 정말 실제로 본질적으로 지금 컴퓨터 비전 정렬 + +291 +00:23:13,599 --> 00:23:19,819 + 오년에 비해 작품 전에이 일이 우리의 우리의 우리 지역 팀입니다 + +292 +00:23:19,819 --> 00:23:23,250 + 기본적으로 어떤 프로세스와 위성 이미지를 함께 넣어 출시 + +293 +00:23:23,250 --> 00:23:28,740 + 그 여러 위성 뷰에서 지붕의 기울기를 예측하는 방법 + +294 +00:23:28,740 --> 00:23:32,769 + 국가 당신은 당신이 여기에 몇 개월마다 새로운 위성 이미지를 알고있다 싶습니다 + +295 +00:23:32,769 --> 00:23:36,099 + 우리가 같은 위치의 여러 전망이 때까지 우리는 무엇을 예측할 수있다 + +296 +00:23:36,099 --> 00:23:40,109 + 지붕의 기울기가 같은 위치의 모든 다른 견해를 제공하고, + +297 +00:23:40,109 --> 00:23:43,589 + 나가 다음 당신이 인 경우에 당신이 알고 예측하는 방법 많은 태양 노출 + +298 +00:23:43,589 --> 00:23:48,490 + 당신이 점점 생성 할 수 얼마나 많은 에너지 어쨌든 태양 전지 패널을 설치 + +299 +00:23:48,490 --> 00:23:53,930 + 좀 당신이 당신이 할 수있는 작은 임의의 물건이 아닌 비전처럼 알고 냉각 + +300 +00:23:53,930 --> 00:24:03,160 + 작동 확인이 클래스는 내가 거 얘기 해요 그래서 대부분은 대부분의 비전에 대해되었습니다 + +301 +00:24:03,160 --> 00:24:08,029 + 이제 언어 이해 같은 문제의 다른 종류에 대한 가장 중 하나 + +302 +00:24:08,029 --> 00:24:16,779 + 중요한 문제는 분명히 검색 그래서 우리는 수술에 대한과에 많은 관심입니다 + +303 +00:24:16,779 --> 00:24:20,700 + 내가 판매에 대한 쿼리 자동차 부품을한다면 내가 확인하고 싶은 특정의 어떤 + +304 +00:24:20,700 --> 00:24:25,400 + 이 두 문서는 관련성 그리고 당신은 단지의 서비스 형태를 보면 + +305 +00:24:25,400 --> 00:24:28,019 + 첫 번째 문서는 무척 관련 보이는 것을 말씀 + +306 +00:24:28,019 --> 00:24:34,609 + 단어를 많이처럼 발생은 autorad 실제로 두 번째 문서는 많이 있습니다 + +307 +00:24:34,609 --> 00:24:41,189 + 관련성이 주어 우리는 어떻게 그래서 이해할 수 있도록하고 싶습니다 + +308 +00:24:41,190 --> 00:24:47,269 + 당신이에 대해 알 수 있도록 많은 당신이 멋진 모델을 포함 이야기가 + +309 +00:24:47,269 --> 00:24:47,879 + 의료진 + +310 +00:24:47,880 --> 00:24:54,680 + 피고인을 포함하는 것은 그래서 나는 빨리 갈 것입니다하지만 기본적으로 당신이 원하는합니다 + +311 +00:24:54,680 --> 00:24:58,200 + 드문 드문 높은 차원 일에 단어 나 사물을 표현 + +312 +00:24:58,200 --> 00:25:03,559 + 조밀 한 경우 몇 백 차원 11,000 차원 공간 때문에로 매핑 + +313 +00:25:03,559 --> 00:25:11,440 + 이제 서로 가까이와 유사한이 일을 가질 수 + +314 +00:25:11,440 --> 00:25:15,029 + 의미에 대한 너무 높은 차원 공간에서 서로 가까이 끝날 것 + +315 +00:25:15,029 --> 00:25:17,769 + 예를 들어 당신은 돌고래와 돌고래는 매우 가까이 서로 될 수 있습니다 + +316 +00:25:17,769 --> 00:25:20,099 + 그들은 매우 유사 단어이고 어떤을 가지고 있기 때문에 높은 차원 공간 + +317 +00:25:20,099 --> 00:25:23,099 + 회의 그들은 목적 동시에 공유 + +318 +00:25:24,909 --> 00:25:27,420 + 그래 + +319 +00:25:27,420 --> 00:25:32,620 + 와 씨월드 당신은 가지 근처의 수와 카메론의 부모는 꽤 멀리로 + +320 +00:25:32,619 --> 00:25:39,069 + 당신이 하나를 현대화 포함 훈련 할 수는 최초 인 것입니다 + +321 +00:25:39,069 --> 00:25:42,519 + 당신이로 스팀에서, 심지어 간단한 GET을 공급 할 때 당신이 할 일 + +322 +00:25:42,519 --> 00:25:47,859 + 일은 나의 전 동료 너무 많은 니켈 오프가 될 해낸 기술이다 + +323 +00:25:47,859 --> 00:25:51,969 + 이 모델을 만들기 위해 단어라고 어디 본질적으로 대한 출판 종이 + +324 +00:25:51,970 --> 00:25:55,870 + 아마 스무 단어가 왜 당신이를 선택 않았다 기본적으로 당신은 단어의 창을 선택 + +325 +00:25:55,869 --> 00:26:00,119 + 워드 센터 및 다음 삽입을 사용하려고 할 경우 다른 랜덤을 선택 + +326 +00:26:00,119 --> 00:26:06,419 + 그 중심 단어의 표현은 그 희망을 훈련 할 수있는 사람을 예측하는 + +327 +00:26:06,420 --> 00:26:11,230 + 배경으로는 기본적으로 당신은 그 무게의 근육 플렉스 분류를 조정하고 + +328 +00:26:11,230 --> 00:26:17,190 + 결과적으로 당신은 역 전파을 통해 당신은 거의 조정 + +329 +00:26:17,190 --> 00:26:20,830 + 그 중심 단어의 표현을 포함하는 다음 번에 ​​할 수 있습니다 있도록 + +330 +00:26:20,829 --> 00:26:25,919 + 더 나은 자동차에서 단어 부분을 예측하고 실제로 오른쪽처럼 작동 + +331 +00:26:25,920 --> 00:26:29,930 + 을 선동에 대한 정말 좋은 것 중 하나는 충분한 교육을 제공됩니다 + +332 +00:26:29,930 --> 00:26:34,070 + 이들은 그래서 당신이 단어의 정말 놀라운 무기 비전을 가지고 않았다 + +333 +00:26:34,069 --> 00:26:39,759 + 어휘 이러한 세 가지 다른 단어 나 문구에 대한 가장 가까운 이웃 + +334 +00:26:39,759 --> 00:26:44,319 + 호랑이 상어에이 특정 항목 당신은 자신의 11 삽입 생각할 수 + +335 +00:26:44,319 --> 00:26:48,480 + 이들 벡터는 가장 가까운 이웃은 그 중심을 가지고 말 + +336 +00:26:48,480 --> 00:26:55,529 + 이 검색에 유용 이유를 볼 같은 선명도 자동차 흥미 권리가있다 + +337 +00:26:55,529 --> 00:27:01,000 + 당신이 일이 있기 때문에 사람들은 종종 코딩 정보 검색을 건네 것을 + +338 +00:27:01,000 --> 00:27:07,079 + 복수형 및 형태소 분석 등의 간단한 동의어 어떤 종류의 같은 시스템 만 + +339 +00:27:07,079 --> 00:27:10,750 + 여기에 그는 단지 오 내가 자동차 자동차 픽업 트럭 경주 용 자동차를 알고 같은 듯 + +340 +00:27:10,750 --> 00:27:15,470 + 여객 자동차 대리점의 종류 당신은 그냥이이이 참조 관련 + +341 +00:27:15,470 --> 00:27:19,200 + 바로 자동차의 칼 케네스 부드러운 표현의 개념이 아닌 + +342 +00:27:19,200 --> 00:27:26,509 + 명시 적으로 만 후자는 우리의 경기를 관람하며 밝혀 당신 경우 + +343 +00:27:26,509 --> 00:27:29,980 + 방향으로 판명 단어 AVEC 접근 방식을 사용하여 훈련 + +344 +00:27:29,980 --> 00:27:35,730 + 의미 있고 정신적 인 공간이 너무뿐만 아니라 근접 재미 있지만, 방향이다 + +345 +00:27:35,730 --> 00:27:38,730 + 흥미로운 그래서 당신이 보면 그것은 밝혀 + +346 +00:27:39,720 --> 00:27:43,860 + 자본과 국가 쌍 가고 + +347 +00:27:43,859 --> 00:27:47,288 + 거의 같은 방향과 거리가 대응하는 국가에서 얻을 수 + +348 +00:27:47,288 --> 00:27:56,029 + 파리, 당신은 또한 당신이 볼 수있는 국가 자본에 대한 자본 또는 그 반대의 경우도 마찬가지 + +349 +00:27:56,029 --> 00:27:59,298 + 다른 구조의 일부 외관이 묻어 두 아래지도입니다 + +350 +00:27:59,298 --> 00:28:05,889 + 그래서 주성분 분석을 치수 당신은 종류의 참조 + +351 +00:28:05,890 --> 00:28:12,788 + 동사의 주위에 흥미 구조는 회사에 관계없이 시제하는 + +352 +00:28:12,788 --> 00:28:18,210 + 당신은 여왕처럼 유추을 수행하여뿐만 아니라 아저씨 사람이 부패되어 해결할 수 있음을 의미 + +353 +00:28:18,210 --> 00:28:21,279 + 몇 가지 간단한 사실 연산은 말 그대로 그냥 삽입보고있는 말 + +354 +00:28:21,279 --> 00:28:26,029 + 차이를 추가 벡터 한 다음 그 지점에 거의 도착 + +355 +00:28:26,029 --> 00:28:35,269 + 그래서 우리는 우리가 가지 시작 검색 팀과 공동으로 봤는데 포인트 + +356 +00:28:35,269 --> 00:28:40,668 + 우리라는 지난 몇 년 동안에서 가장 큰 검색 순위 중 하나가 변경 + +357 +00:28:40,669 --> 00:28:44,640 + 그것은 단지 깊은 알고 본질적으로 데려 울렸다하지만 묻어 및 사용 + +358 +00:28:44,640 --> 00:28:50,059 + 선수의 무리가이 문서는이 것이 얼마나 중요한에 대한 점수를주고 + +359 +00:28:50,058 --> 00:28:51,730 + 특별한 + +360 +00:28:51,730 --> 00:28:58,308 + 그것은 수백 중 기차 여행 마일 세 번째 가장 중요한 + +361 +00:28:58,308 --> 00:29:07,259 + 그 소위 스마트 응답은 Gmail 팀과 함께 약간의 협력이 있었다이었다 + +362 +00:29:07,259 --> 00:29:11,259 + 기본적으로 휴대 전화에 메일을 회신 종류의 사촌 입력이 어렵다 짜증 + +363 +00:29:11,259 --> 00:29:16,429 + 그래서 우리는 당신이 될 것이라고 예측할 수 종종 시스템을 가지고 싶어 + +364 +00:29:16,429 --> 00:29:21,900 + 좋은 응답이 바로 메시지를 찾고 그래서 우리는 소규모 네트워크가이 예측이 + +365 +00:29:21,900 --> 00:29:26,970 + 수있는 가능성이 내가 볼 수있는 짧은 간결한 응답을 가질 수 뭔가를 할 수 있음 + +366 +00:29:26,970 --> 00:29:30,380 + 당신이 그들을 묻는다면 나는 훨씬 더 큰 활성화 + +367 +00:29:30,380 --> 00:29:35,409 + 모델이 제 동료 중 하나가 프로젝트를 수신하는 메시지입니다에서 + +368 +00:29:35,409 --> 00:29:37,720 + 그의 동생은 그는 우리가 이른 추수 감사절 우리와 함께 당신을 초대하고자했다 + +369 +00:29:37,720 --> 00:29:43,220 + 가능성이 밥 우리는​​ 그래서 다음 모델 좋아하는 요리 RCP 다음 주에 봤는데 + +370 +00:29:43,220 --> 00:29:48,100 + 백작을 예측하고있을 것 또는 죄송는 그것을 만들 수 없습니다 + +371 +00:29:49,660 --> 00:29:54,810 + 이메일을 많이받을 경우 응답이 될 것입니다하지만 좋은 그것은 환상적이다 + +372 +00:29:54,809 --> 00:29:58,169 + 좋은 그들 중 다소 저주 + +373 +00:30:02,250 --> 00:30:07,329 + 당신이 실제로 모바일 앱처럼 우리가 흥미있는 일을 할 수있어 + +374 +00:30:07,329 --> 00:30:11,779 + 비행기 모드에서 실행 그래서 실제로 전화 모델을 실행중인 그것은이다 + +375 +00:30:11,779 --> 00:30:19,430 + 실제로 당신이 본질적으로있어, 그래서 완전히 실현 흥미로운 것들을 많이 가지고 + +376 +00:30:19,430 --> 00:30:25,670 + 단어가 무엇인지 발견하여 텍스트를 검출하는 카메라 영상을 이용하여 + +377 +00:30:25,670 --> 00:30:28,830 + 다음 여기에 OCR을 수행 할 수있는 번역 모델을 통해 그것을 실행 + +378 +00:30:28,829 --> 00:30:31,980 + 이것에 대해, 특히 그것을 그림은 서로 다른 언어를 순환한다 + +379 +00:30:31,980 --> 00:30:38,779 + 그러나 일반적으로 당신은 스페인어 돈에 설정 거라고하지만 스페인어하지만 일이 보여 + +380 +00:30:38,779 --> 00:30:43,460 + 점에서 흥미로운 재미 선택 문제 등이 실제로있다 실현 + +381 +00:30:43,460 --> 00:30:49,210 + 내가 그렇게 종류의 당신이 있다면 좋은 호출하면 출력을 보여주고 싶은 것을 선택 + +382 +00:30:49,210 --> 00:30:50,410 + 여행 + +383 +00:30:50,410 --> 00:30:55,590 + 흥미로운 장소는 사실은 그래서 내가 찾고 있어요 오전 한국 불굴 갈거야 + +384 +00:30:55,589 --> 00:31:04,549 + 그들은 우리가하는 일이 너무 일하지 않는 나의 번역기를 사용하여 전달 + +385 +00:31:04,549 --> 00:31:09,000 + 보험을 줄일 수 있습니다에 약간의 작업 비용 아무것도처럼보다 더있다 + +386 +00:31:09,000 --> 00:31:15,789 + 와우 내 모델이 큰 너무 굉장이 느낌은 그냥 슬픈 꿈이야 + +387 +00:31:15,789 --> 00:31:18,309 + 독일 내 휴대 전화의 배터리 + +388 +00:31:18,309 --> 00:31:22,769 + 또는 당신은 내가 당신은 내가 계속 알고 그것을 실행하는 유혹을 감당할 수있어 + +389 +00:31:22,769 --> 00:31:27,039 + 많이있다, 그래서 내 데이터 센터의 당신은 내가 기계를받은 경우에도 + +390 +00:31:27,039 --> 00:31:31,720 + 당신은 예를 들어 특히 간단한 싶어 뉴스에서 사용할 수있는 트릭 + +391 +00:31:31,720 --> 00:31:39,430 + 심지어 훨씬 낮은 정밀 계산 단의 일반적으로 훨씬 더 관대 + +392 +00:31:39,430 --> 00:31:44,120 + 훈련은 지금까지 프랑스에서 우리는 일반적으로 우리가 얻을 수있는 모든 방법을 양자화 할 수 있습니다 발견 + +393 +00:31:44,119 --> 00:31:48,319 + 더 적은 조금 너무 그녀 야 좋은 품질하지만 저렴한 통해 당신은 거래를하고 싶습니다 + +394 +00:31:48,319 --> 00:31:52,139 + 정말 당신은 prolly의 육을 할 수 있지만 그 많은 도움이되지 않습니다 + +395 +00:31:52,140 --> 00:31:57,930 + 그 매개 변수를 저장하기에 좋은 외환 메모리 감소처럼를 제공하고 + +396 +00:31:57,930 --> 00:32:01,850 + 또한 CPU 벡터를 사용할 수 있습니다 사촌 경쟁 효율을 제공 + +397 +00:32:01,849 --> 00:32:08,809 + (24) 대신 1:30 곱하지만 왜 설명은 갑자기 당신에게있어 + +398 +00:32:08,809 --> 00:32:13,879 + 모바일에서 더 많은 효율을 얻기의 귀엽 더 이국적인 방법의 종류에 대한 + +399 +00:32:13,880 --> 00:32:14,310 + 전화 + +400 +00:32:14,309 --> 00:32:19,169 + 제프리 힌튼의 세포 소기관 내가 근무하는 기술이라고 증류 + +401 +00:32:19,170 --> 00:32:24,910 + 그래서 생각에 당신은 정말 거대한 모델에게 난 그냥 설명하는 문제가 + +402 +00:32:24,910 --> 00:32:30,660 + 어쩌면 이들의 앙상블과 함께 정말 기쁘게이 환상적인 모델 당신에게 + +403 +00:32:30,660 --> 00:32:36,430 + 지금 당신은 그래서 여기에 작은 싼 모델에서 거의 같은 배우를 원한다 + +404 +00:32:36,430 --> 00:32:41,480 + 당신의 거대한 비싼 모델 동일한 의제는 환상적인 준다 공급 + +405 +00:32:41,480 --> 00:32:47,630 + 예측은 좋아한다. 95 재규어 나는 확신하고 나는 그것이 아니다 확실히 확신 + +406 +00:32:47,630 --> 00:32:48,530 + 자동차 + +407 +00:32:48,529 --> 00:32:57,769 + 당신을 위해 10-4 차 창은 내가 잘 그래서 그건 사자가 될 수 침대로 향하고 있어요 + +408 +00:32:57,769 --> 00:33:02,900 + 나중에 불행하게도 우리를 메인 아이디어를 말해 정말 정확한 모델을 무엇을 + +409 +00:33:02,900 --> 00:33:07,380 + 2006 년 부자 카루 아나는 논문에서 비슷한 생각을 게시 발견했다 + +410 +00:33:07,380 --> 00:33:13,310 + 당신의 거대한 정확한 모델 구현을위한 앙상블 소위 모델 압축 + +411 +00:33:13,309 --> 00:33:18,669 + 입력 - 출력에서​​이 흥미로운 기능은 사실을 잊어 버린 경우 이렇게 + +412 +00:33:18,670 --> 00:33:22,720 + 거기에 몇 가지 구조 그리고 당신은 정보를 사용하려고하는 것이 + +413 +00:33:22,720 --> 00:33:27,500 + 그는 우리가있는 지식을 전달하는 방법을 그 함수에 포함 된 것 + +414 +00:33:27,500 --> 00:33:30,730 + 작은에 정말 정확한 기능 + +415 +00:33:30,730 --> 00:33:36,339 + 함수의 의도는 그래서 당신은 당신이 무엇 일반적으로 모델을 훈련 할 때 + +416 +00:33:36,339 --> 00:33:40,740 + 당신 위업이 같은 이미지입니다 다음은 주요하려고하는 대상으로 제공 + +417 +00:33:40,740 --> 00:33:47,109 + 당신은 내가거야 다른 대상에게 그것을 하나의 재규어 랜드 로버 모든 것을 제공 + +418 +00:33:47,109 --> 00:33:52,819 + 그래서 하드 타겟이 모델에 노력하고 이상적인의 종류의 것을 호출 + +419 +00:33:52,819 --> 00:33:56,298 + 달성하고 당신은 수천 또는 수백만의 수백을 알고 그것을 제공 + +420 +00:33:56,298 --> 00:34:00,918 + 드라이브의 이미지를 훈련하는 것은 모든 요인에 근접하는 + +421 +00:34:00,919 --> 00:34:05,160 + 실제 사실에 차이가 꽤하지 않습니다 그는 당신이 좋은를 제공 사촌 + +422 +00:34:05,160 --> 00:34:09,990 + 다른 클래스를 통해 다른 이미지를 통해 공개 확률 분포 + +423 +00:34:09,989 --> 00:34:17,579 + 같은 결혼은 그래서 우리의 거대한 비싼 모델과 중 하나를 수행 할 수 있도록하기위한 + +424 +00:34:17,579 --> 00:34:22,079 + 우리가 할 수있는 일이 우리가 실제로이의 분포를 부드럽게 할 수있다 + +425 +00:34:22,079 --> 00:34:30,940 + 제프리 힌튼 어두운 지식을 부르는하지만 당신은이 작업을 부드럽게 경우 + +426 +00:34:30,940 --> 00:34:34,500 + 당신이있을 수 있습니다에 기본적으로 온도에 의해 모든 물류 단위를 분할 + +427 +00:34:34,500 --> 00:34:38,820 + 5 ~ 10 개 무엇인가 당신은이의 부드러운 표현을 얻을 + +428 +00:34:38,820 --> 00:34:44,159 + 당신은 재규어에 괜찮 말할뿐만 아니라 좀 회피 확률 분포 + +429 +00:34:44,159 --> 00:34:48,950 + 에 대한 작은 여전히​​ 전화 소 어쩌면 작은 사자의 비트를 호출 + +430 +00:34:48,949 --> 00:34:56,878 + 그것은 확실히 자동차와 그 당신이 다음 년 수 뭔가이 + +431 +00:34:56,878 --> 00:35:00,139 + 가을 분포에 대한 이미지에 대해 더 많은 정보를 만들어 + +432 +00:35:00,139 --> 00:35:04,429 + 이 큰 앙상블 앙상블에 의해 구현되는 기능에 노력하고있다 + +433 +00:35:04,429 --> 00:35:08,169 + 침대 머리는 곧 당신에게 확률 확률을주는 정말 좋은 일을 + +434 +00:35:08,170 --> 00:35:15,559 + 해당 이미지를 통해 배포 그래서 당신은 보통의 작은 모델을 학습 할 수 있습니다 + +435 +00:35:15,559 --> 00:35:19,070 + 당신은 하드 목표를 교육 훈련 대신에 당신은에 훈련 할 때 + +436 +00:35:19,070 --> 00:35:25,640 + 하드 타겟 플러스 소프트 목표와 교육의 조합 + +437 +00:35:25,639 --> 00:35:32,089 + 목표는 거 매트 매트 시도해야하는 두 가지 중 어떤 기능 때문에 + +438 +00:35:32,090 --> 00:35:37,579 + 이것은 우리가 큰 연설에서했던 실험을 그래서 여기 놀라 울 정도로 잘 작동이다 + +439 +00:35:37,579 --> 00:35:42,039 + 모델은 그래서 우리는 모델에 따라 그의 친구의 분류 58.9 %로 시작 + +440 +00:35:42,039 --> 00:35:46,190 + 제대로 그것은 우리의 큰 정확한 모델 그리고 지금 우리는 그 끔찍한를 사용하는거야 + +441 +00:35:46,190 --> 00:35:50,829 + 작은 모델의 부드러운 목표를 제공하기 위해 그들은 또한 하드를 보게 + +442 +00:35:50,829 --> 00:35:57,690 + 대상 및 우린 기차있어 그 데이터의 3 % 그래서 함께 새로운 모델 + +443 +00:35:57,690 --> 00:36:04,599 + 부드러운 목표 정확도 57 %가 그냥 하드 목표입니다 거의 것을 유지 + +444 +00:36:05,210 --> 00:36:12,800 + 크게 이상 44.5 % 정확한 적합하고 매우 부드러운 목표는 정말 남쪽으로 이동 + +445 +00:36:12,800 --> 00:36:17,700 + 정말 좋은 정례화하고, 다른 것은입니다 주식 목표 때문에 + +446 +00:36:17,699 --> 00:36:21,739 + 너무 많은 정보들을 비교 한 단 하나의 하나의 상상 당신을 발생 + +447 +00:36:21,739 --> 00:36:27,889 + 훈련을 훨씬 더 빨리 당신은에 대한 짧은 일주일 등에 그 정확성에 도착 + +448 +00:36:27,889 --> 00:36:33,358 + 시간은 꽤 좋은 있다고 당신은 빛 건조에이 방법을 수행 할 수 있습니다 + +449 +00:36:33,358 --> 00:36:37,889 + 앙상블에 대한 하나의 크기 모델에 낮잠 앙상블 당신은 큰에서 수행 할 수 있습니다 + +450 +00:36:37,889 --> 00:36:45,269 + 작은 일에 병 다소 과소 평가 기술은 확인 보자 + +451 +00:36:45,269 --> 00:36:51,980 + 그래서 우리가했던 것들 중 하나는 우리는 밀가루의 톤을 구축에 대해 생각했다 때 + +452 +00:36:51,980 --> 00:36:56,309 + 우리는 가지 더 알고 다시 단계를했고 우리가 정말 당신이 무엇을 말했다 + +453 +00:36:56,309 --> 00:36:59,259 + 당신이 다른 많은 것들을 할 수 있도록 시스템을 조회 할 그것을 가지의 + +454 +00:36:59,260 --> 00:37:04,740 + 당신이 정말로 관심있는 것들을 정말 한 일의 모든 I 그러나 하드 균형 + +455 +00:37:04,739 --> 00:37:08,489 + 몇 연구원이 중 하나에 대해 내가 싶어 식은를 취할 수 + +456 +00:37:08,489 --> 00:37:12,589 + 이전 연구 아이디어와 그것을 밖으로 시도 + +457 +00:37:15,119 --> 00:37:37,219 + 대신 천 폭 완전히 연결 계층 같은 상당히 작았 + +458 +00:37:37,219 --> 00:37:43,409 + 그것은 600 실제로 큰 차이가 500 y를하지만 확인 같았다 + +459 +00:37:43,409 --> 00:37:51,399 + 자세한 것은 그 종이 아마 misremembered 권리를 해요 그리고 당신은 원하는 + +460 +00:37:51,400 --> 00:37:55,490 + 많은 당신이 할 수 있도록하려면 빨리 실행을 연구 아이디어를 취할 수 + +461 +00:37:55,489 --> 00:38:00,689 + 두 데이터 센터와 좋은 아이폰에 아마 그것을 실행하는 것은 재현 할 수 있도록 + +462 +00:38:00,690 --> 00:38:04,269 + 일이 당신은 생산 시스템에 좋은 연구 아이디어에서 가고 싶어 + +463 +00:38:04,269 --> 00:38:10,730 + 다시 및 다른 시스템 필요없이 그 방법을 우리 종류 주의의 + +464 +00:38:10,730 --> 00:38:15,659 + 일이 우리는 당신이있어 등으로 역류 오​​픈 소스에게 그것을 웬들 링 고려했다 + +465 +00:38:15,659 --> 00:38:25,519 + 첫 감정가요 유의 부드러운 유동의 코어 비트 우리 그래서 + +466 +00:38:25,519 --> 00:38:30,769 + 그것은 많은 다른 실행 휴대용 다른 장치의 개념이 + +467 +00:38:30,769 --> 00:38:34,340 + 운영 체제는 우리가 상단에 다음이 핵심 그래픽 솔루션 엔진과이 + +468 +00:38:34,340 --> 00:38:37,700 + 그 중 우리는 다른 친구는 대회의 종류를 표현했다가 + +469 +00:38:37,699 --> 00:38:41,819 + 수행하려는 우리는 C ++ 친구가 대부분의 사람들은에있는 사용하지 않는 내 + +470 +00:38:41,820 --> 00:38:45,700 + 마음 우리는 당신의 확인 대부분은 더 그렇게 아마있는 밧줄 친구 야했다 그들은 + +471 +00:38:45,699 --> 00:38:49,339 + 대부분의 남자를 착용해야하지만 퍼팅에서 사람들을 방지 아무것도 없다하지 않습니다 + +472 +00:38:49,340 --> 00:38:55,750 + 몇 가지 작업이 때문에 다른 언어 나는 중립 상당히 언어가되고 싶어 + +473 +00:38:55,750 --> 00:38:58,269 + 거기에 전 친구를 넣어 진행 + +474 +00:38:58,269 --> 00:39:03,980 + 다른 언어의 종류 당신은 싶어 그 모델을 취할 수 및 실행해야 + +475 +00:39:03,980 --> 00:39:09,440 + 다른 플랫폼의 매우 다양한 기본 계산 모델 + +476 +00:39:09,440 --> 00:39:12,710 + 나는 당신의 개요이 얘기 얼마나 모르는 땅이다 + +477 +00:39:12,710 --> 00:39:17,179 + 열 좀 확인 가장자리 또는 입찰에 대한을 따라 흐름이 그래프 것들 때문에 + +478 +00:39:17,179 --> 00:39:25,469 + 달리으로 프록터와 같은 기본 유형으로 임의과 차원 배열 + +479 +00:39:25,469 --> 00:39:29,269 + 실제로이에 머물 것이 순수한 데이터 흐름 모델은 crassly 당신은 일이 + +480 +00:39:29,269 --> 00:39:33,219 + 교구와 같은 변수 후 다시 작업을 업데이트 한하는 + +481 +00:39:33,219 --> 00:39:37,019 + 전체 그래프는 몇 가지 계산을 통해 시스템 상태를 일 일이 갈 수 있습니다 + +482 +00:39:37,019 --> 00:39:45,329 + 구배 후 바이어스 조정은 기울기 그래프에 기초가 통과 + +483 +00:39:45,329 --> 00:39:50,809 + 단계의 시리즈는 한 가지 중요한 단계는의 전체 무리를 주어 결정하는 + +484 +00:39:50,809 --> 00:39:55,670 + 실행에 서로 다른 각각의 우리를있는 연산 장치와 맥그래스 + +485 +00:39:55,670 --> 00:40:01,369 + 예를 들어 계산의 그래프 측면에서 노드가 여기에 우리가 CPU가있을 수 있습니다 및 + +486 +00:40:01,369 --> 00:40:06,650 + 파란색과 I의 GPU 카드와 녹색과 우리는 같은에서 그래프를 실행할 수 있습니다 + +487 +00:40:06,650 --> 00:40:13,160 + 방법이 그 경쟁이 정도로 실제로 GPU에서 발생 비록 + +488 +00:40:13,159 --> 00:40:17,259 + 옆이 배치 결정은 종류 우리는 사용자가 그에게 제공 할 수의 까다로운 있습니다 + +489 +00:40:17,260 --> 00:40:22,760 + 가이드는이 비트와 반드시 하드없는 힌트가 주어 + +490 +00:40:22,760 --> 00:40:26,750 + 새로운 검은 장치에 있지만 제약은해야 같은 수 있습니다 + +491 +00:40:26,750 --> 00:40:33,300 + 정말 GPU에서이 작업을 실행하거나 작업 일곱에 배치하려고 난 상관 없어 무엇을 + +492 +00:40:33,300 --> 00:40:40,200 + 다음 장치 및 우리는 기본적 그래프 피사체 시간을 최소화 할 + +493 +00:40:40,199 --> 00:40:44,159 + 우리가 서로에 사용할 수있는 메모리 같은 다른 제약 조건의 모든 종류의 유지 + +494 +00:40:44,159 --> 00:40:51,199 + 당신은 CPU에서 카터 나는 실제로와 집에서 사용하는 재미있을 거라고 생각 + +495 +00:40:51,199 --> 00:40:54,639 + 당신이 실제로 여기에 목표를 측정 할 수 있기 때문에 학습 일부 강화 + +496 +00:40:54,639 --> 00:40:58,759 + 나는이 쪽지를 배치하면 당신을 알고이 이런 식으로 방법이 노트에 알려진 + +497 +00:40:58,760 --> 00:41:02,500 + 빨리 내 그래프이고, 나는 그것이 꽤 흥미 보강있을 거라고 생각 + +498 +00:41:02,500 --> 00:41:02,929 + 배우기 + +499 +00:41:02,929 --> 00:41:09,139 + 문제 13 우리는 전송을 삽입 한 후 물건을 배치하는 의사 결정을 통해 만든 + +500 +00:41:09,139 --> 00:41:12,500 + 기본적으로 모든 통신 시스템을 캡슐화 노드를받을 + +501 +00:41:12,500 --> 00:41:16,800 + 그래서 기본적으로는 노드 전송이 한 장소에서 다른 답변을 이동하려면 + +502 +00:41:16,800 --> 00:41:21,200 + 그들은 더 검사를받지 그들이 때까지 종류의 단지 텐서에 개최 + +503 +00:41:21,199 --> 00:41:26,669 + 정말에 대한 데이터를 사랑하고 당신은 십자가의 모든 가장자리에 대해이 작업을 수행 + +504 +00:41:26,670 --> 00:41:32,150 + 장치 경계 및 수신 전송 파리의 다른 의미를 가지고 + +505 +00:41:32,150 --> 00:41:36,220 + 장치에 따라서 예를 들어, GPU는 동일한 방법에 있으면 볼 + +506 +00:41:36,219 --> 00:41:39,779 + 기계 종종있을 하나의 GPU 메모리에서 직접 우리의 DNA를 수행 할 수 있습니다 + +507 +00:41:39,780 --> 00:41:44,410 + 그들은 시스템에서 다른 시스템에있어, 당신이 경우 RBC 네트워크는 수도 + +508 +00:41:44,409 --> 00:41:50,868 + 그냥 직접 도달 할 때 사용하는 네트워크 및 I 케이스에서 지원 RDMA + +509 +00:41:50,869 --> 00:41:56,920 + 남부 기계와 당신이 할 수있는 신용의 남부 GPU 메모리에 + +510 +00:41:56,920 --> 00:42:00,210 + 아주 쉽게 새로운 운영 및 대령을 정의 + +511 +00:42:00,210 --> 00:42:06,920 + 당신은 일반적으로 실행할 수있는 그래프를 실행하는 방법을 이러한 인터페이스는 그 본질적 + +512 +00:42:06,920 --> 00:42:10,940 + 한 번 그래프를 설정하고 우리가 가지 가질 수 있도록 당신은 많은 실행 + +513 +00:42:10,940 --> 00:42:17,068 + 시스템은 원하는 본질적 방법에 대해 최적화 많은 의사 결정을 할 + +514 +00:42:17,068 --> 00:42:22,199 + 그것을 만들 않는 등 몇 가지 실험을 할 아마도 더 후 경쟁을 배치 없습니다 + +515 +00:42:22,199 --> 00:42:26,068 + 더 감각으로부터 중복을 광고 할 수 있습니다 여기 여기이기 때문에 그것을 넣어 + +516 +00:42:26,068 --> 00:42:30,969 + 저자 브라이언은 단일 프로세스 구성 모든 실행 하나를 호출 + +517 +00:42:30,969 --> 00:42:35,509 + 과정을 그리고 그것은 단지 종류의 간단한 절차는 분산 환경에서 호출이야 + +518 +00:42:35,510 --> 00:42:38,440 + 노동자의 무리는이 클라이언트 프로세스는 마스터 프로세스이고 그 + +519 +00:42:38,440 --> 00:42:43,608 + 나는 서브 그래프를 실행하려면 같은 장치와 마스터 톤의 클라이언트가 + +520 +00:42:43,608 --> 00:42:47,568 + 마스터는 내가 처리 할 얘기가에 그들에게 싶었 의미 괜찮 말한다 + +521 +00:42:47,568 --> 00:42:54,808 + 당신이 실제로 데이터에 공급할 수 물건을하고 그것이 내가 종류의를 가질 수 있음을 의미합니다 + +522 +00:42:54,809 --> 00:42:59,619 + 더 복잡한 그래프 그러나 나는 단지 내가 만하면 원인 그것의 작은 비트를 실행해야 + +523 +00:42:59,619 --> 00:43:05,440 + 계산에 대한 부분을 실행하는 내내 출력 우리의 + +524 +00:43:05,940 --> 00:43:14,940 + 필요에 우리가 이것을 확장 할 수있는에 많은 초점 이야기를 기반으로 + +525 +00:43:14,940 --> 00:43:19,099 + 분산 환경 실제로 우리의 가장 큰 것 중 하나 때 우리가 처음 열려 + +526 +00:43:19,099 --> 00:43:23,210 + 일주일 동안 소스 센터는 꽤 오픈 소스 모바일 떨어져 조각하지 않았다 + +527 +00:43:23,210 --> 00:43:28,269 + 이 전화 번호 (23)를 가지고하는 방법 있도록 분산 구현은 좋았다 + +528 +00:43:28,269 --> 00:43:33,259 + 헤이 분산 버전의 우리의 릴리스의 일처럼 이내에 제출 + +529 +00:43:33,260 --> 00:43:39,839 + 그게거야 더 좋은, 그래서 우리는 처음 출시 된 지난 목요일했다 + +530 +00:43:39,838 --> 00:43:43,619 + 포장하지만 순간에 당신의 종류 및 여러 프로세스를 구성 할 수 있습니다 + +531 +00:43:43,619 --> 00:43:48,710 + 다른 프로세스의 이름 그는 우리가있어 관련 IP 주소의 중요성입니다 + +532 +00:43:48,710 --> 00:43:55,150 + 내가 주 더 앞으로 몇이야하지만 그건 좋은 및 거 패키지 + +533 +00:43:55,150 --> 00:43:59,250 + 그것을 가지고 온 이유는 훨씬 더 나은 처리 시간을 할 것입니다 + +534 +00:43:59,250 --> 00:44:05,889 + 실험은 모드 훈련 및 실험에 있다면 그렇게 + +535 +00:44:05,889 --> 00:44:09,769 + 반복 당신이 있다면 정말 정말 좋은 분 또는 몇 시간의 종류 + +536 +00:44:09,769 --> 00:44:15,159 + 한 달보다 종류의 희망처럼 더 같은 여러 주 모드 + +537 +00:44:15,159 --> 00:44:19,279 + 당신이 당신은 일반적으로 작업을 수행 할 또는 당신이 할 경우, 당신은 할 나의 여행 오 ​​같은거야 + +538 +00:44:19,280 --> 00:44:26,130 + 왜 우리가 정말 우리 그룹에서 많이 강조 다시 그래서 그냥되는 것이 했는가 + +539 +00:44:26,130 --> 00:44:31,269 + 합리적으로 빨리 실험을 할 수있는 사람을 만들 수 + +540 +00:44:33,920 --> 00:44:39,250 + 그래서 두 가지 일이 우리는 내가 대해 얘기하자 문제 속에서 우리의 모델 평행선을 + +541 +00:44:39,250 --> 00:44:46,588 + 모두 당신이 조금 또는 확인이 이야기 한 당신이 할 수있는 가장 좋은 방법은 이렇게 + +542 +00:44:46,588 --> 00:44:52,279 + (9) 교육 시간을 단축하는 것은 그래서 정말 좋은 중 하나를 시간을 중지 감소 + +543 +00:44:52,280 --> 00:44:56,329 + 속성 대부분의 노트북을 많이하고 고유의 병렬 권리 등 많이있다 + +544 +00:44:56,329 --> 00:44:59,329 + 당신이 계산 모델에 대해 생각하면 병렬 많이있다 + +545 +00:45:00,539 --> 00:45:04,119 + 각 층의 모든 공간 위치는 거의 무관하므로 + +546 +00:45:04,119 --> 00:45:06,280 + 당신은 단지 그들 주위에 실행할 수 있습니다 + +547 +00:45:06,280 --> 00:45:10,680 + 다른 장치에 병렬로 문제가 통신하는 방법을 알아낼 수있다 + +548 +00:45:10,679 --> 00:45:17,889 + 같은 방법으로 그 계산을 배포하는 것은 당신을 경우 죽이지 않는 방법 + +549 +00:45:17,889 --> 00:45:21,389 + 도움이 사람이 길쌈 신경 같은 지역의 전도성 당신 생각 + +550 +00:45:21,389 --> 00:45:25,299 + 매트는 일반적으로 오에 의해처럼 찾고이 좋은 특성을 가지고 + +551 +00:45:25,300 --> 00:45:31,070 + 오 그 아래의 데이터를 패치하고 다른 아무것도 신경 세포가 필요하지 않습니다 + +552 +00:45:31,070 --> 00:45:35,289 + 그것은 그것을 위해에 필요한 데이터와 중복 훨씬로서 그 옆에 + +553 +00:45:35,289 --> 00:45:41,099 + 타워 사이 거의 또는 전혀 연결을 통해 먼저 신경 UCAV 타워 그래서 + +554 +00:45:41,099 --> 00:45:46,179 + 마다 몇 층 당신은 약간의 의사 소통 수도 있지만 대부분은 당신이 동의하지 않습니다 + +555 +00:45:46,179 --> 00:45:50,399 + 종이 그래서 기본적으로 대부분 우연히 두 개의 별도의 시간을 한 것으로 한 + +556 +00:45:50,400 --> 00:45:55,880 + 다른 CPU에 GPU가와에 대한 처벌은 때때로 몇 가지 정보를 교환 + +557 +00:45:55,880 --> 00:45:59,220 + 당신은 몇 가지 예를 들어 모델 매력적인 여자의 전문 부품을 얻을 + +558 +00:45:59,219 --> 00:46:06,759 + 당신은 그냥 순진하게있을 때 그래서 병렬을 악용 할 수있는 방법이 많이있다 + +559 +00:46:06,760 --> 00:46:10,630 + 아마 이미 GCC 또는 무언가 그 많은 행렬 곱셈 코드를 컴파일 + +560 +00:46:10,630 --> 00:46:16,880 + 인텔의 CPU 점수에 명령 병렬 선물을 활용 + +561 +00:46:16,880 --> 00:46:23,420 + 당신은 통신 기기에서 스레드 영웅주의와 가지 방법을 사용할 수 있습니다 + +562 +00:46:23,420 --> 00:46:27,760 + 종종 꽤 당신에게 한정 학대자 사이에 30 ~ 40 배처럼이 + +563 +00:46:27,760 --> 00:46:31,950 + 로컬 팀 구성원에 더 밴드 여행 당신은 다른 좋아 할 수있는 + +564 +00:46:31,949 --> 00:46:36,750 + 동일한 시스템에서 GPU 카드 메모리와 시스템에서 일반적으로 아래도 + +565 +00:46:36,750 --> 00:46:41,519 + 더 나쁜 종류의 당신이 할 수있는 지역의 많은 데이터를 유지하기 때문에 매우 중요하고 + +566 +00:46:41,519 --> 00:46:48,159 + 당신이있어입니다 기본 개념에 너무 많이하지만 모델 평행선을 먹고 피 + +567 +00:46:48,159 --> 00:46:51,929 + 그냥 어떻게 든 아마 계산 모델을 분할하는 것 + +568 +00:46:51,929 --> 00:47:01,710 + 특히 층으로하고, 예를 들어이 경우이 어쩌면 층 등 + +569 +00:47:01,710 --> 00:47:05,730 + 내가해야 할 유일한 통신이 경계 당신이 몇몇을 알고있다 + +570 +00:47:05,730 --> 00:47:09,039 + 청원의 데이터가 해당 파티션의 입력에 필요한 가지고 일하지만, + +571 +00:47:09,039 --> 00:47:16,949 + 대부분 모든 당신이 속도에 사용할 수있는 다른 기술 지역입니다 + +572 +00:47:16,949 --> 00:47:21,419 + 컨버전스는 일부 데이터 병렬 처리가 다른 많은 사용하려고하는 경우입니다 + +573 +00:47:21,420 --> 00:47:24,608 + 동일 모델 구조의 복제본 그들은 모든 협력거야 + +574 +00:47:24,608 --> 00:47:30,949 + 매개 변수를 잡고 일부는 서버의 공유 세트에 있도록 업데이트 매개 변수 + +575 +00:47:30,949 --> 00:47:36,629 + 상태 속도 향상은 속도 모델의 종류에 많은 10-40 X이 될 수 의존 + +576 +00:47:36,630 --> 00:47:42,720 + 모든에 대해 같은 정말 큰 묻어 450 복제본 스파 스 모델을 + +577 +00:47:42,719 --> 00:47:44,769 + 인간에게 알려진 어휘 단어 + +578 +00:47:44,769 --> 00:47:48,469 + 대부분의 업데이트는 업데이트 사촌 일반적으로 더 많은 병렬 처리를보고 할 수 없습니다 + +579 +00:47:48,469 --> 00:47:53,129 + 매립 항목의 소수 문장은 10와 같은 독특한 단어를 가지고 있습니다 + +580 +00:47:53,130 --> 00:47:57,630 + 만 밖으로 당신은 수백만을 가질 수 있고, 수백만의 수천 + +581 +00:47:57,630 --> 00:48:03,088 + 기본 개념 및 데이터 병렬 처리는 당신이 그래서 일을 많이하고 복제본 + +582 +00:48:03,088 --> 00:48:07,019 + 서로 다른 모델 복제본 유지하는 중앙 집중식 시스템을 거 가지고있다 + +583 +00:48:07,019 --> 00:48:10,519 + 단지 하나의 기계와 아마 많이하지 않을 수 있습니다 매개 변수 추적 + +584 +00:48:10,519 --> 00:48:16,338 + 기계의 당신은 때로는 모든 유지하기 위해 네트워크 대역폭을 많이 필요로하기 때문에 + +585 +00:48:16,338 --> 00:48:19,900 + 그래서이 모델 복제 표준 매개 변수 당신은 우리의 큰 설정에서 알 수 있습니다 + +586 +00:48:19,900 --> 00:48:24,950 + 내 뒤에 (27) 기계 후 중지 된 것을 당신은 당신이있을 수 있습니다 알고 + +587 +00:48:24,949 --> 00:48:29,259 + 다섯 거기 모델의 복제본과 모든 모델 복제하기 전에 + +588 +00:48:29,260 --> 00:48:34,430 + 그것은 당신에게 백을 좋아 말한다, 그래서 그거야 매개 변수를 잡아 일치하지 않습니다 + +589 +00:48:34,429 --> 00:48:39,179 + 및 스물일곱 기계는 나에게 매개 변수를 제공 한 후는 않습니다 + +590 +00:48:39,179 --> 00:48:44,289 + 미니 배지 주변의 조합 동의 것이기 때문에 그것을해야한다 + +591 +00:48:44,289 --> 00:48:47,869 + 매개 변수 서버 라우터로 다시 분해 시간의 속도에 적용되지 않습니다 + +592 +00:48:47,869 --> 00:48:52,829 + 서버는 그 후 이전에 현재의 매개 변수 값을 업데이트 + +593 +00:48:52,829 --> 00:48:58,039 + 다음 단계 우리는 같은 일이 정말로 집중에 따라 네트워크했다 당신의 + +594 +00:48:58,039 --> 00:49:01,690 + (가) 매우 많은 매개 변수가없는 여기에 도움이 모델로있는 모델 일 + +595 +00:49:01,690 --> 00:49:06,068 + 대회는 그 점 엘라는 그런 점에서 표준화 정말 멋지다 + +596 +00:49:06,068 --> 00:49:11,250 + 당신은 재사용보다 본질적이기 때문에 모든 매개 변수는 너무 시간을 잠글 + +597 +00:49:11,250 --> 00:49:16,929 + 당신은 이미 당신은 그러나 더 큰 배치 크기를 알고 사용하는 것은의 모델에 + +598 +00:49:16,929 --> 00:49:20,088 + 당신이 백을 사용할 수 있습니다 통해 자녀의 228는 거 압력을 가지고있어와 + +599 +00:49:20,088 --> 00:49:23,900 + 경기의 모든 열에 대한 스물여덟 시간 만은 길쌈을 + +600 +00:49:23,900 --> 00:49:28,970 + 모델은 지금 넌 아마 같은 $ 10 재사용의 추가 팩터를 얻을 수있어 + +601 +00:49:28,969 --> 00:49:30,019 + 다른 위치 + +602 +00:49:30,019 --> 00:49:34,769 + 층에 당신은 당신이 백을 풀다 경우를 분석 오후를 사용하는 거라고 + +603 +00:49:34,769 --> 00:49:41,460 + 시간은 그냥 그 종류를 줄이기 위해 그것을 백 번을 다시 사용할 수 있습니다 단계 + +604 +00:49:41,460 --> 00:49:47,220 + 모델이 물건의 정렬 계산을 많이 적은 매개 변수가 + +605 +00:49:47,219 --> 00:49:50,109 + 박사의 경쟁은 일반적으로 잘 작동과 평행 한 것 + +606 +00:49:50,110 --> 00:49:57,340 + 환경은 지금 당신이 그렇게 그 작업을 수행하는 방법에 따라 명백한 문제가있다 + +607 +00:49:57,340 --> 00:50:00,720 + 당신이 할 수있는 한 가지 방법은 완전하게 비동기 적으로 모든 모델 복제본입니다 + +608 +00:50:00,719 --> 00:50:05,459 + 미니 배지 난방을하는 루프에 앉아 매개 변수를 설정 + +609 +00:50:05,460 --> 00:50:09,210 + 복사가 그것을 보내고 당신은 비동기 적으로 다음 그라데이션을 그렇게 할 경우 + +610 +00:50:09,210 --> 00:50:13,710 + 계산하여이 매개 변수는 곳에 대해 완전히 부패 할 수있다 + +611 +00:50:13,710 --> 00:50:17,030 + 지금 현재이 매개 변수 값에 자신의 뒤쪽에 그것을 계산하지만, + +612 +00:50:17,030 --> 00:50:20,810 + 한편 10 다른 지원자가 만든은을 통해 사행하는 매개 변수를 호출 + +613 +00:50:20,809 --> 00:50:27,529 + 여기 그리고 지금 당신은 당신이이 만드는 여기에 대한라고 생각 그라데이션을 적용 + +614 +00:50:27,530 --> 00:50:31,080 + 그것의 사촌 이미 불편이 추가 매우 불편 + +615 +00:50:31,079 --> 00:50:38,619 + 완전히 비 윤리 문제 그러나 좋은 소식은 그것이 어떤까지 일 + +616 +00:50:38,619 --> 00:50:43,670 + 수준 당신이 알고있는 조건을 이해 정말 좋은 것 + +617 +00:50:43,670 --> 00:50:48,059 + 작품과 이론적 기초하지만, 실제로는 꽤 작동하도록 보인다 + +618 +00:50:48,059 --> 00:50:51,710 + 잘 당신이 할 수있는 다른 일이 완전히 동 기적으로 당신이 할 수있는 그래서 이렇게이다 + +619 +00:50:51,710 --> 00:50:55,800 + 확인 모든 사람들이 그들 모두가 매개 변수를 가서 소리 하나 운전 루프가 + +620 +00:50:55,800 --> 00:50:58,610 + 그들은 모두 계산 그라디언트 다음은 그라데이션을 기다리는는 표시와해야 할 일 + +621 +00:50:58,610 --> 00:51:03,820 + 그녀의 주위에 그들에게 큰 노력 뭔가하고 효과적으로 단지 + +622 +00:51:03,820 --> 00:51:09,269 + 거대한 일괄처럼 보인다는 당신이 우리의 회를 알고처럼 보이는 그 복제본 + +623 +00:51:09,269 --> 00:51:14,300 + 때때로 당신이 가지 얻을 작동 개개의 일괄 처리 크기 + +624 +00:51:14,300 --> 00:51:18,950 + 더 큰 배치 크기 만 더 훈련에서 수익을 감소 + +625 +00:51:18,949 --> 00:51:21,169 + 예 당신이 + +626 +00:51:21,170 --> 00:51:26,159 + 더 관대 한 당신은 더 큰 바이트는 일반적으로 크기 조 훈련을하다 + +627 +00:51:26,159 --> 00:51:30,420 + 당신이 천 확인의 크기에 대해 알고 예는 백만 훈련을 + +628 +00:51:30,420 --> 00:51:36,068 + 천 그리 좋은하지 못 외부의 예 + +629 +00:51:36,639 --> 00:51:41,289 + 내가 훨씬 더 복잡한 선택은 당신이 할 수 있습니다 거기에 루이스했다 생각 + +630 +00:51:41,289 --> 00:51:52,650 + 유럽​​의 권리를 설명 끝처럼 현재의 모델은 좋은했다 + +631 +00:51:52,650 --> 00:51:57,829 + 데이터 병렬 정말 정말 실제로 그래서 그들은 매개 변수를 많이 재사용 + +632 +00:51:57,829 --> 00:52:02,740 + 우리의 모델의 거의 모든 중요한 그것은 우리가 지점에 도착 방법 + +633 +00:52:02,739 --> 00:52:10,669 + 같은 반에서 교육 모델은 하루 일반적으로 하루 그래서 당신은 당신이 어떤 참조 알고 + +634 +00:52:10,670 --> 00:52:19,180 + 사용 설정의 거친 종류의 이곳은의 예 훈련 그래프이다 + +635 +00:52:19,179 --> 00:52:25,489 + 이미지 네트 모델 하나의 GPU 10기가바이트 52 뷰를 사용하고 최대 속도의 종류가있다 + +636 +00:52:25,489 --> 00:52:26,239 + 아직 + +637 +00:52:26,239 --> 00:52:29,759 + 같은 때때로 이러한 그래프는 10의 차이처럼 받고있다 + +638 +00:52:29,760 --> 00:52:34,220 + 50 년 라인과 같은 큰 각 다른 종류의 가까운 것을하지 않는 것 + +639 +00:52:34,219 --> 00:52:39,489 + 군인하지만 실제 사실 10과 50의 차이는 같다 + +640 +00:52:39,489 --> 00:52:43,798 + 그 요인 4.1처럼 보이지 않도록 네 점의 요인은 무엇인가를 원하는 + +641 +00:52:43,798 --> 00:52:51,920 + 차이를 수행하지만, 그래 당신이없이 원하는만큼 당신이 그것을 할 방법입니다 + +642 +00:52:51,920 --> 00:52:59,150 + 하나의 위기 지점 여섯 칠천 위기 지점 확인 + +643 +00:52:59,150 --> 00:53:04,490 + 그래서 내가 당신에게 당신이에 모델에 입찰 할 수 있도록 약간의 개조하면 되겠 어의 일부를 보여 드리겠습니다 + +644 +00:53:04,489 --> 00:53:08,149 + 병렬 처리의 서로 다른 종류의 악용 우리가 원하는 것들 중 하나 + +645 +00:53:08,150 --> 00:53:13,280 + 병렬 개념의 이러한 종류이었다 그렇게 표현 꽤 쉽게하기 + +646 +00:53:13,280 --> 00:53:17,500 + 것들 중 하나는 난의 종류에 꽤 잘 매핑 약 20 분을 좋아한다 + +647 +00:53:17,500 --> 00:53:22,949 + 이 얘기 아니에요 있도록 연구 논문에서 볼 수있는 일이 읽을 수있는 모든 것을하지만, + +648 +00:53:22,949 --> 00:53:30,189 + 당신은 당신이 좀 좋은되어서는 안 볼 것입니다 무엇보다 너무 다른 아니에요 + +649 +00:53:30,190 --> 00:53:37,940 + 간단한 줄기 세포처럼이 시퀀스 모델에 순서입니다 만 + +650 +00:53:37,940 --> 00:53:43,079 + 보조 기관이 신속하게 2014 년에 출판 된 모든 우리는 본질적으로있어 + +651 +00:53:43,079 --> 00:53:47,849 + 입력 시퀀스를 가지고 매핑하는 시도는 이것이하다 서열을 밝혀 + +652 +00:53:47,849 --> 00:53:51,679 + 연구의 정말 큰 영역은 모델 이러한 종류의에 적용 할 수 있습니다 밝혀 + +653 +00:53:51,679 --> 00:53:56,849 + 문제의 종류를 많이하고 많이 다른 그룹 많이하고있다 + +654 +00:53:56,849 --> 00:54:07,369 + 그래서 여기에이 지역에서 흥미로운 비활성 작업은 최근의 단지 몇 가지 예입니다 + +655 +00:54:07,369 --> 00:54:13,269 + 어떤 다른 실험실 주변에서이 지역의 마지막 년 반에서 일 + +656 +00:54:13,269 --> 00:54:17,630 + 당신이 이미 그것에 대해 얘기했습니다 세계 + +657 +00:54:17,630 --> 00:54:26,320 + 당신은 픽셀 단위로 넣을 수 있습니다 그냥 대신 시퀀스의 자막 호출은 당신입니다 + +658 +00:54:26,320 --> 00:54:31,890 + 당신은 당신의 초기 상태의 당신이 CNN을 통해 갔다 픽셀에 넣고 + +659 +00:54:31,889 --> 00:54:34,889 + 꽤 놀라운 캡션을 생성 할 수 있습니다 + +660 +00:54:36,030 --> 00:54:42,019 + 삼십오년 전 나는 잠시 동안 해리 R에 대해 그렇게하지를 생각하지 않는에 기여했다 + +661 +00:54:42,019 --> 00:54:46,730 + 당신은 실제로 할 수있는 다음 당신이 생성 할 수 있도록이 생식 모델 말할 + +662 +00:54:46,730 --> 00:54:51,320 + 분포를 탐구하여 다른 문장은 내가 우리 모두를 생각하는지 + +663 +00:54:51,320 --> 00:54:56,870 + 선장은 인간 하나의 매우 정교한가 안하지 않은 것은하지 않습니다 + +664 +00:54:56,869 --> 00:55:01,230 + 종종 사물의 하나입니다 참조 + +665 +00:55:01,230 --> 00:55:07,639 + 당신이 모델 조금 훈련하면 경우는 그녀의 트레이너 정말 중요합니다 + +666 +00:55:07,639 --> 00:55:13,210 + 그렇게 나쁘지 않아 빛이 있기 때문에 모델이 수렴하지만 당신은 것을 훈련하는 경우 + +667 +00:55:13,210 --> 00:55:17,070 + 모델 이상 같은 모델은 훨씬 더있어 + +668 +00:55:21,079 --> 00:55:25,139 + 트랙에 앉​​아있다 여기에 같은 일을 바로 훈련은 예 그건 사실이야 + +669 +00:55:25,139 --> 00:55:30,909 + 하지만 사람이 더 나은하지만 그녀는 여전히 볼 수있는 사람은 훨씬 더 세련가 + +670 +00:55:30,909 --> 00:55:35,480 + 그들은의 저장소 근처에 트랙을 교차하고 있음을 알 권리처럼 + +671 +00:55:35,480 --> 00:55:42,199 + 모델이 귀여운 다른 종류의에 데리러 것을 더 미묘한 물건의 종류 + +672 +00:55:42,199 --> 00:55:48,750 + 을 사용하여 실제로 매우 시원 그래프 모든 종류의 문제를 해결하는 데 사용할 수 있거나 + +673 +00:55:48,750 --> 00:55:56,440 + 도 마라 포르투나 및 FTP 당신의 톤으로 시작이 일을 yalls + +674 +00:55:56,440 --> 00:56:03,059 + 포인트는 그게 잘 작동에 대한 외판원을 예측하려고 + +675 +00:56:03,059 --> 00:56:11,559 + 잔디의 볼록 선체 또는 Delonte 삼각 측량을위한거야 전화 + +676 +00:56:11,559 --> 00:56:14,199 + 그것은 단지의 순서에 당신 위업에 대한 시퀀스 문제 비밀 알고 + +677 +00:56:14,199 --> 00:56:18,129 + 점하고 출력은 어떤 문제에 대한 포인트의 설정 오른쪽은 당신 + +678 +00:56:18,130 --> 00:56:21,130 + 에 대한 관심 + +679 +00:56:21,780 --> 00:56:28,519 + 내가 사기를거야, 그래서 당신이 한 번, 그래서 확인 응답이 내가 당신을 보여 앨리스 오후 cellco + +680 +00:56:28,519 --> 00:56:35,530 + 거기에 당신의 당신이 네 가지를 원 가정 해 봅시다 시간 스무 시간 단계에 등록 할 수 있습니다 + +681 +00:56:35,530 --> 00:56:37,680 + 시간 단계 당 층 대신 하나 + +682 +00:56:37,679 --> 00:56:42,389 + 잘 당신은 당신의 코드를 변경의 약간을 만들 것입니다 그리고 당신은 지금 그렇게 + +683 +00:56:42,389 --> 00:56:47,690 + 계산의 4 층 당신이 할 수있는 일의 2011 실행이 + +684 +00:56:47,690 --> 00:56:51,840 + 그래서 다른 GPU에 그 층의 각각의 변화가 톤을 만들 것입니다 + +685 +00:56:51,840 --> 00:56:56,869 + 의 작업을 수행 발생하고 당신이 그래서 이런 모델을 할 수 있습니다 + +686 +00:56:56,869 --> 00:57:01,289 + 내 장식 조각이 나는 시간 단계 당이 난 층이야 다른 깊은 질투하다 + +687 +00:57:01,289 --> 00:57:08,190 + 첫 번째 조금 후 나는 점점 더 많은 GPU를을 가지지고 시작할 수 있습니다 + +688 +00:57:08,190 --> 00:57:10,349 + 과정에 관여 + +689 +00:57:10,349 --> 00:57:15,579 + 당신은 기본적으로 파이프 라인 전체 것은 상기 거대한 소프트 팩있다 + +690 +00:57:15,579 --> 00:57:19,710 + 당신의 상단은 아주 쉽게 모델에 그렇게 유지에 걸쳐 분할 할 수 있습니다 + +691 +00:57:19,710 --> 00:57:25,500 + 병렬 바로 우리가 지금이 사진을 우리가 실제로 분할을 사용하여 여섯 GPU가있어 + +692 +00:57:25,500 --> 00:57:30,909 + 그 부드러운 최대 국경을 남용하고 남자는 그렇게 모든 복제본은 GPU 것 + +693 +00:57:30,909 --> 00:57:36,109 + 동일한 시스템에서 카드를 따라 흥얼의 모든 종류 그리고 당신은 사용할 수 있습니다 + +694 +00:57:36,110 --> 00:57:37,849 + 그 외에도, 데이터 병렬성 + +695 +00:57:37,849 --> 00:57:45,989 + 빨리 훈련하는 AGP 카드 복제의 무리를 양성하는 우리는 QS의이 개념이 + +696 +00:57:45,989 --> 00:57:50,509 + 그는 종류의 그녀가 잔뜩 할 사진이 한 다음 고통을 수 있습니다 + +697 +00:57:50,510 --> 00:57:55,860 + 다음 EQ와 나중에는 D로 시작 사진과 시간의 또 다른 비트가 + +698 +00:57:55,860 --> 00:58:00,789 + 청문회 물건 후 십여 가지 하나 하나의 예는 그래서 + +699 +00:58:00,789 --> 00:58:04,650 + 변환하는 JPEG 디코딩을 할 이유를 다음 입력을 프리 페치와 할 수 있습니다 + +700 +00:58:04,650 --> 00:58:09,240 + 배열의 종류에 어쩌면 약간의 미백을하고 임의 자르기 + +701 +00:58:09,239 --> 00:58:16,149 + 당신 같은 사람의 물건을 선택하고 당신은 다른 GPU에 DQ 수 있습니다 + +702 +00:58:16,150 --> 00:58:22,769 + 카드 또는 뭔가 우리 또한 할 수있는 번역 작업의 우리에 대한 그룹 유사한 예 + +703 +00:58:22,769 --> 00:58:27,869 + 당신의 배치 예에 무리가되도록 실제로 문장의 길이에 의해 버킷 + +704 +00:58:27,869 --> 00:58:32,449 + 그 모두 거의 같은 문장 길이 모두 13 216 단어 문장 + +705 +00:58:32,449 --> 00:58:37,539 + 단지 우리가 심지어해야 만 정확하게 많은 펼쳐진 실행 의미 일 + +706 +00:58:37,539 --> 00:58:42,210 + 당신이 임의의 다음 문장 길이 잘 알고보다는 단계 + +707 +00:58:42,210 --> 00:58:46,099 + 임의 회원 큐는 단지 전체 무리입니다 셔플 도전 + +708 +00:58:46,099 --> 00:58:49,099 + 예를 들면 다음 밖으로 임의의 사람을 얻을 + +709 +00:58:55,130 --> 00:59:02,269 + 데이터 병렬 바로 그래서 다시 우리는이 많은 복제본을 가질 수 있도록하려면 + +710 +00:59:02,269 --> 00:59:09,309 + 것은 그래서 당신은 우리있어 꽤 행복하지 않은 변경의 적당한 양을 + +711 +00:59:09,309 --> 00:59:13,769 + 하지만 변화의 양이 감독자가 무엇 당신이 할의 종류 + +712 +00:59:13,769 --> 00:59:19,429 + 그것은 당신이 지금 압력 장치가 말할와 준비 사물의 무리가 + +713 +00:59:19,429 --> 00:59:25,509 + 세션 후 다음 라운드의 각 로컬 루프 당신은 유지하지 + +714 +00:59:25,510 --> 00:59:28,000 + 얼마나 많은 단계의 트랙 모두에 걸쳐 전 세계적으로 적용되었습니다 + +715 +00:59:28,000 --> 00:59:32,500 + 곧 다른 복제본과 모든 사람들의 누적 합계가 큰입니다 + +716 +00:59:32,500 --> 00:59:38,829 + 동기 훈련을 위해 충분히 그 세 가지 별도의 클라이언트처럼 ​​좀 보인다 + +717 +00:59:38,829 --> 00:59:43,929 + 그래서 모든 매개 변수와 함께 큰 중 하나를 세 가지 별도의 복제본을 구동 두려워 + +718 +00:59:43,929 --> 00:59:47,119 + 우리는 분리가없는 경우 불신에서 의미가 흐르는 경향하기 + +719 +00:59:47,119 --> 00:59:54,359 + 매개 변수 서버 개념 우리가 포함 된 답변 변수 변수를 + +720 +00:59:54,360 --> 00:59:59,590 + 답변 그들은 그래프의 단지 다른 부분이고 일반적으로 당신이 그들을지도 + +721 +00:59:59,590 --> 01:00:04,250 + 장치의 작은 세트에 그들은 당신에게 매개 변수를 거 보유하고 있지만 전부 + +722 +01:00:04,250 --> 01:00:07,269 + 나는 그 대답을 보낸다 여부 종류의 같은 프레임 워크에 통합 + +723 +01:00:07,269 --> 01:00:12,829 + 매개 변수 또는 정품 인증 또는이 문제가되지 않습니다 어떤이의 종류 + +724 +01:00:12,829 --> 01:00:16,750 + 동기는 하나의 클라이언트를 가지고 난 그냥 세에서 내 배치를 분할 할 + +725 +01:00:16,750 --> 01:00:22,989 + 복제 기울기를 가지고 있었고, 꽤 것으로 판명 수 있습니다 알고 적용하고 + +726 +01:00:22,989 --> 01:00:31,239 + 감소 정밀도의 허용 그렇게 FB (16)가 실제로 있고 난 트리폴리 변환 + +727 +01:00:31,239 --> 01:00:36,869 + 16 ~ 14 점은 현재 지점을두고 표준 내가 할 지금 대부분의 CPU 사용하지 꽤 + +728 +01:00:36,869 --> 01:00:42,719 + 아직 우리가 우리 자신의 여섯 비트 형식을 구현 지원 + +729 +01:00:42,719 --> 01:00:45,719 + 기본적으로 우리는 32 비트 부동는 나에게 구입 잘려 수있다 + +730 +01:00:47,429 --> 01:00:55,889 + 당신은 종류의 확률 공공 새로운하지만 우리가 종류의 확인은 안해야 + +731 +01:00:55,889 --> 01:01:01,389 + 어떤에서 작성하여 다른 측면에서 32 비트 변환 동의하면 바로 알 + +732 +01:01:01,389 --> 01:01:15,098 + 그것을 위해 여전히 모델링 및 데이터하면서 매우 졸린 지붕 친화적 인 종이입니다 + +733 +01:01:15,099 --> 01:01:19,500 + 함께 바인딩에서 병렬 정말 빠르게 모델을 훈련 좋아 + +734 +01:01:19,500 --> 01:01:24,639 + 즉,이 모든 정말로에 대한 시도 연구 아이디어를 가지고 할 수있는입니다 무엇 + +735 +01:01:24,639 --> 01:01:28,250 + 큰 데이터 세트에 그것을 밖으로는 상관 문제의 대표 + +736 +01:01:28,250 --> 01:01:29,000 + 약 + +737 +01:01:29,000 --> 01:01:34,199 + 꽤 쉽게로 실험의 다음 세트 밖으로 그 일의 숫자를 파악 + +738 +01:01:34,199 --> 01:01:38,039 + 어딘가에 집중 하중을위한 너무 행복하지 않은 데이터 프로파일을 표현하는 + +739 +01:01:38,039 --> 01:01:44,889 + 동기 병렬 처리는 일반적으로 우리가 오픈 소스가 너무 나쁜 아니지만 + +740 +01:01:44,889 --> 01:01:49,480 + 센터 흐름을 우리가보다 쉽게​​ 연구 기록을 공유 할 수있을 거라 생각하기 때문에 + +741 +01:01:49,480 --> 01:01:56,338 + 우리는 당신이 외부 시스템을 사용하는 많은 사람들을 가지는 알고 생각입니다 + +742 +01:01:56,338 --> 01:01:59,849 + 구글을 개선하고 우리가하지 않는 아이디어를 가져 오는 좋은 일이 있었다 + +743 +01:01:59,849 --> 01:02:05,200 + 반드시이에 기계 학습 시스템을 구축하는 것이 매우 쉽게하는 방법 + +744 +01:02:05,199 --> 01:02:09,298 + 실제 제품은 당신이 뭔가를 실행에 우리의 연구 아이디어에서 갈 수 있기 때문에 + +745 +01:02:09,298 --> 01:02:13,059 + 상대적으로 쉽게 전화 외부 수십 사용자의 커뮤니티 + +746 +01:02:13,059 --> 01:02:16,609 + 구글은 멋진 사물의 모든 종류의 일을하는 방법 좋은 인 성장 I + +747 +01:02:16,608 --> 01:02:21,130 + 고른 게시 얻을 사람들이 수행 한 일의 몇 가지 임의의 예 + +748 +01:02:21,130 --> 01:02:28,769 + 이 안드레처럼 하나의 방법 데일 스 포드가에서 실행에서이 불만을 가지고 + +749 +01:02:28,769 --> 01:02:32,920 + 브라우저를 사용하여 자바 스크립트와 그가 약간 게임의 한 것들 중 하나 + +750 +01:02:32,920 --> 01:02:38,798 + 노란색 점을 학습 보강 배운다 진짜 먹고 얻을 수 배운다 + +751 +01:02:38,798 --> 01:02:42,769 + 긴급 녹색 점은 누군가에 그것을 다시 구현하도록 빨간색 점을 피하기 위해 + +752 +01:02:42,769 --> 01:02:47,059 + 흐름의 관점 실제로 추가 오렌지 도트 정말 나쁜 + +753 +01:02:50,650 --> 01:02:54,550 + 누군가가에 틸 부르 흐 대학에서이 정말 좋은 종이를 구현 + +754 +01:02:54,550 --> 01:02:59,590 + 막스 플랑크 연구소는 당신이 사진 이미지를 촬영하는이 작품을 볼 수과 + +755 +01:02:59,590 --> 01:03:05,269 + 일반적으로 다음 그림과 해당 용지의 스타일에서 해당 사진을 렌더링 + +756 +01:03:05,269 --> 01:03:14,820 + 당신은 당신이 문자가 알고 나쁜처럼 멋진 물건으로 끝날 + +757 +01:03:14,820 --> 01:03:19,550 + 높은 수준의 라이브러리의 인기 정렬 외부 여기 모델을 만드는 + +758 +01:03:19,550 --> 01:03:25,640 + 쉽게 메일 매트를 표현하는 사람이 신경 자막 모델을 구현 + +759 +01:03:25,639 --> 01:03:31,099 + 중국어로 번역에 낮은 측면에서 우리의 노력이 진행되고있다 + +760 +01:03:31,099 --> 01:03:39,349 + 멋진 위대한 마지막 것은 우리가했습니다 뇌 레지던시 프로그램에 대해 이야기합니다 + +761 +01:03:39,349 --> 01:03:44,349 + 실험의 비트 올해이 프로그램을 시작하고 그래서 이것은 더 + +762 +01:03:44,349 --> 01:03:47,769 + 참고로 내년 원인 또는 응용 프로그램에 대한 폐쇄 사제관으로 + +763 +01:03:47,769 --> 01:03:53,420 + 사람들은 것 이번 주에 우리의 최종 후보를 선택한 다음 생각은 + +764 +01:03:53,420 --> 01:03:57,789 + 깊은 학습 연구를하고 우리 그룹의 올해 투자 및 희망이다 + +765 +01:03:57,789 --> 01:04:02,750 + 그들은 나올 것입니다 및 제출 아카이버 논문의 몇 가지를 발표했다 + +766 +01:04:02,750 --> 01:04:08,039 + 회사에하고 흥미로운 기계의 종류를하는 것에 대해 많은 것을 배울 + +767 +01:04:08,039 --> 01:04:16,170 + 연구를 배우고 지금 우리에 대해 분명히 내년 사람을 찾고있어 + +768 +01:04:16,170 --> 01:04:24,670 + 당신은 애플리케이션을 다시 할 수업을 아는 사람에 우리의 강한 + +769 +01:04:24,670 --> 01:04:25,990 + 가을 + +770 +01:04:25,989 --> 01:04:34,439 + 내년에 기회처럼 졸업 거기 당신은 무리가 더있어 이동 + +771 +01:04:34,440 --> 01:04:36,909 + 이 읽기 + +772 +01:04:36,909 --> 01:04:42,949 + 당신의 사촌을 시작 난의 전체 세트를 만들기 위해 흰 종이에 많은 일을했다 + +773 +01:04:42,949 --> 01:04:52,169 + 참조를 클릭 한 다음 확인을 그래서 250 다른 인물을 통해 귀하의 방법을 클릭합니다 I + +774 +01:04:52,170 --> 01:04:53,820 + 초기에 수행 된 + +775 +01:04:53,820 --> 01:04:56,820 + 백 육십 오 + +776 +01:05:02,730 --> 01:05:31,599 + 예 그래서 사물의 그 종류는 실제로 까다로운 그리고 우리는 실제로 꽤있다 + +777 +01:05:31,599 --> 01:05:37,329 + 당신이 당신에 대해 얘기를 알고있는 것 것들에 대한 광범위한 세부 과정 + +778 +01:05:37,329 --> 01:05:43,119 + 일 똑똑 응답 이러한 종류의 사용자의 개인 정보를 사용 + +779 +01:05:43,119 --> 01:05:47,559 + 이제까지 생성됩니다 기본적으로 모든 응답이 단어는 것들 + +780 +01:05:47,559 --> 01:05:52,710 + 수천 명의 사용자로 말했다되었습니다 있도록하기위한 모델에 입력 + +781 +01:05:52,710 --> 01:05:57,380 + 교육 방법에 대해 사람들하지만 단지에 대해 일반적으로하지 않은 이메일입니다 + +782 +01:05:57,380 --> 01:06:02,480 + 지금까지 제안합니다 일이 당신이 알고에 의해 응답으로 생성되는 것들 + +783 +01:06:02,480 --> 01:06:07,670 + 고유 한 사용자의 의심 번호를 넣어 사용자의 개인 정보를 보호하기 위해 + +784 +01:06:07,670 --> 01:06:10,710 + 면 같은 제품을 설계 할 때 약 당신이 생각하는 물건의 종류 + +785 +01:06:10,710 --> 01:06:16,400 + 실제로 카렌의 많은 당신에 갈 생각된다 우리가이가 될 것이라고 생각을 알고 + +786 +01:06:16,400 --> 01:06:22,119 + 훌륭한 기능 그러나 우리는 사람들의 프라이버시를 보장하는 방식으로이 작업을 수행 할 수있는 방법 + +787 +01:06:22,119 --> 01:06:25,119 + 보호 + +788 +01:06:52,670 --> 01:07:30,108 + 우리는 아마 그것을 보장해야하는만큼 그냥 가지 중 하나가되었습니다 + +789 +01:07:30,108 --> 01:07:32,548 + 우리가했던 모든 다른 것들에 비해 다시 버너에 일 + +790 +01:07:32,548 --> 01:07:37,679 + 나는 전문가의 개념은 내가 그것에 대해 얘기하지 않았다고 생각 할 작업 + +791 +01:07:37,679 --> 01:07:42,489 + 기본적으로 모든하지만 우리는 종류의 임의의 이미지를했다 모델을 가지고 그 + +792 +01:07:42,489 --> 01:07:46,868 + JFT 같은 분류 모델은 만칠천 손실 또는 같은 인 + +793 +01:07:46,869 --> 01:07:51,220 + 뭔가는 우리가 할 수있는 좋은 일반 모델을 내부 데이터 훈련있어 그 + +794 +01:07:51,219 --> 01:07:57,539 + 모든 클래스에 대처하고 우리는 흥미로운 혼동이 계산 가능한 발견 + +795 +01:07:57,539 --> 01:08:01,719 + 세계에서 버섯의 모든 종류의 같은 알고리즘이다 클래스 + +796 +01:08:01,719 --> 01:08:06,539 + 데이터가 풍부한이 유일한 골이 한 세트에 우리는 전문가를 훈련했다 + +797 +01:08:06,539 --> 01:08:11,909 + 버섯 주로 데이터와 가끔 임의의 이미지와 우리는 할 수 + +798 +01:08:11,909 --> 01:08:16,179 + 물건의 종류에 좋은 도달 쉰 같은 모델을 훈련받을 + +799 +01:08:16,179 --> 01:08:24,440 + 꽤 상당한 정확도는 우리가 그것을 증류 할 수 있었다 우리시에 증가 + +800 +01:08:24,439 --> 01:08:27,588 + 단일 모델로 꽤 잘 우리가 정말 너무 많은 것을 추구하지 않은 + +801 +01:08:27,588 --> 01:08:31,899 + 밝혀졌다 그냥 역학 쉰 별도의 모델을 훈련하고있다 + +802 +01:08:31,899 --> 01:08:34,899 + 조금 다루기로 증류 + +803 +01:08:38,170 --> 01:09:20,630 + 이 명확하게 보여줍니다 말한대로 14 탐사 및 추가 연구가있다 + +804 +01:09:20,630 --> 01:09:25,920 + 우리가 걸 내가 그것을 다른 목적이해야 할 모델을 이야기 한 뜻 + +805 +01:09:25,920 --> 01:09:31,048 + 바로 우리는이 어려운 라벨을 사용하거나이 어려운 라벨을 사용하고 그것을 말하는 거 + +806 +01:09:31,048 --> 01:09:36,189 + 여기처럼 말한다이 믿을 수 없을만큼 풍부한 그라데이션의 백 다른 신호를 얻을 수 + +807 +01:09:36,189 --> 01:09:41,379 + 정보 어떤 의미에서 불공정 한 비교 바로 당신이 그것을 많이 이야기하고 그래서 + +808 +01:09:41,380 --> 01:09:46,829 + 내 경우는 그래서 때로는 그렇지 않은 모든 예에 대한 더 많은 물건 너무 많은 + +809 +01:09:46,829 --> 01:09:49,119 + 작업은 어쩌면 우리도해야거야 느낌 + +810 +01:09:49,119 --> 01:09:53,960 + 단지 하나의 이진 레이블보다 설교자 신호를 공급하는 방법을 알아내는 + +811 +01:09:53,960 --> 01:09:59,569 + 우리의 모델 그게 우리가 생각 나는을 추구하는 아마 흥미있는 영역이라고 생각 + +812 +01:09:59,569 --> 01:10:05,349 + 모든 훈련 집합 모델의 큰 앙상블을 갖는 아이디어에 대한 + +813 +01:10:05,350 --> 01:10:08,449 + 그 예측의 형태로 정보를 교환의 일종이다 오히려 + +814 +01:10:08,448 --> 01:10:12,779 + 해당 매개 변수보다 나는 훨씬 저렴 이상의 네트워크 친화적 인 방법이 될 수 있습니다로 + +815 +01:10:12,779 --> 01:10:19,099 + 의의 공동 당신이 훈련의 1 % 않았다 정말 큰에서 훈련 + +816 +01:10:19,100 --> 01:10:22,100 + 하루라도 스왑 예측 + +817 +01:10:39,729 --> 01:10:49,779 + 그래 내가 라디오의 모든 종류의 캡션을 추구하는 가치가있다 생각 의미 + +818 +01:10:49,779 --> 01:10:55,039 + 흥미로운 근로자하지만 당신 경향이 많은 적은 라벨을 갖는 경향이 + +819 +01:10:55,039 --> 01:11:02,550 + 캡션 우리는 지터 재규어 같은 하드 라벨의 종류에 이미지를 가지고 + +820 +01:11:02,550 --> 01:11:06,810 + 깨끗한 방법으로 제조되는 적어도 나는 많은 거기에 내가 알고 있어요 실제로 생각 + +821 +01:11:06,810 --> 01:11:11,539 + 트릭에 대해 쓴 문장과 이미지로 식별되는 + +822 +01:11:11,539 --> 01:11:26,430 + 문장있는 이미지 문제는 당신이 필요하지 않습니다 알고있는 몇 가지 문제에 대한 + +823 +01:11:26,430 --> 01:11:29,510 + 정말 음성 인식 등의 광산에 훈련은 그렇지 않은 좋은 예입니다 + +824 +01:11:29,510 --> 01:11:35,670 + 인간의 성대가 종종 단어를 변경처럼 그렇게 조금 변경 말한다 + +825 +01:11:35,670 --> 01:11:38,670 + 우리 재배포은 매우 고정하지 경향이있다 + +826 +01:11:39,640 --> 01:11:45,460 + 단어 모두가 공동으로 말한다처럼 내일 것과 매우 유사하다 + +827 +01:11:45,460 --> 01:11:50,640 + 그들은 오늘하지만 롱 아일랜드 초콜릿 축제 같은 미묘한 차이가 수도 말 + +828 +01:11:50,640 --> 01:11:55,220 + 갑자기 더 눈에 띄는 다음 2 주 그 종류 이상이 될 + +829 +01:11:55,220 --> 01:11:58,930 + 일의 당신은 당신이 원하는 사실을 인식 할 필요가 알고 + +830 +01:11:58,930 --> 01:12:03,079 + 이 양성하는 것입니다 할 그 효과의 종류와 가지 방법 중 하나를 캡처하여 + +831 +01:12:03,079 --> 01:12:07,380 + 모델과 작은 언젠가 그는 이렇게 온라인으로 할 필요하지는 않지만 좋아 + +832 +01:12:07,380 --> 01:12:10,770 + 예를 받고 즉시 당신의 모델을 업데이트 할 수 있지만 당신은 알고 + +833 +01:12:10,770 --> 01:12:16,180 + 펜티엄 문제 5 분마다, 10 분 시간 또는 하루에 충분하다 + +834 +01:12:16,180 --> 01:12:23,940 + 대부분의 문제이지만 것은 아닌 고정을 위해 그렇게 매우 중요 + +835 +01:12:23,939 --> 01:12:28,949 + 그런 시간이 지남에 따라 변경 광고 나 검색 쿼리 나 물건 같은 문제 + +836 +01:12:28,949 --> 01:12:33,738 + 권리 + +837 +01:12:33,738 --> 01:12:42,428 + 내가 네 말을 할 수없는 세 번째 가장 중요한 + +838 +01:12:45,819 --> 01:12:57,170 + 그래 나는 훈련 데이터 세트에서 잡음이 실제로 모든 시간 유명한 선수를 어떻게 의미 + +839 +01:12:57,170 --> 01:13:01,340 + 예 때때로 당신이 건너거야 같은 당신은 이미지를 볼 경우에도 + +840 +01:13:01,340 --> 01:13:02,328 + 당신의 인생에서 하나 + +841 +01:13:02,328 --> 01:13:06,670 + 에서 일하고있는 사실 난 그냥 어떤 사람과의 만남에 앉아 있었다 + +842 +01:13:06,670 --> 01:13:10,929 + 시각화 기술과 지금까지 볼 수 있었다 시각화 된 것들 중 하나 + +843 +01:13:10,929 --> 01:13:14,779 + 입력 데이터들은 모두 C 네 코어 프레젠테이션 이런 있었다 + +844 +01:13:14,779 --> 01:13:18,920 + 예는 모두 4 × 4 화소 각각 월에 같은에 매핑 + +845 +01:13:18,920 --> 01:13:22,819 + 자신의 육만 이미지 화면과 마이크가 가지 일을 선택할 수 + +846 +01:13:22,819 --> 01:13:28,219 + 출력 및 방향을 선택하고 여기에 예측 모델을 좋아 하나 + +847 +01:13:28,219 --> 01:13:33,948 + 높은 신뢰성하지만 잘못했고, 그것은 말했다 모델로 그녀의 비행기가 + +848 +01:13:33,948 --> 01:13:40,518 + 비행기 당신은 이미지를보고는 비행기의 레이블은 아니다 + +849 +01:13:40,519 --> 01:13:49,690 + 주로 내가 실행 야지 왜 당신은 당신이 원하는 알고는 그래서 이해 좋아 + +850 +01:13:49,689 --> 01:13:53,288 + 있는지 확인 데이터 집합 교육 사촌 가능한 한 깨끗하고 잡음이 데이터가 + +851 +01:13:53,288 --> 01:13:56,488 + 로 일반적으로 좋지 않다 + +852 +01:13:56,488 --> 01:14:00,819 + 그것을 세정하지만 한편 것을 청소 너무 많은 노력을 늘리지 + +853 +01:14:00,819 --> 01:14:06,969 + 종종 더 많은 종류의 몇 가지 필터링 가지 작업을 수행하는 그 가치보다 더 많은 노력 + +854 +01:14:06,969 --> 01:14:12,788 + 일의 당신은 일반적으로 더 명백한 나쁜 물건을 던져하지 않습니다 + +855 +01:14:12,788 --> 01:14:15,788 + 시끄러운 데이터는 최대 그것은 덜 깨끗한보다 종종 더 낫다 + +856 +01:14:18,739 --> 01:14:28,649 + 문제에 따라 달라집니다하지만 당신이 있다면 약 한 것은 다음 시도하고 + +857 +01:14:28,649 --> 01:14:34,159 + 결과에 만족하지 왜 질문 조사 + +858 +01:14:34,159 --> 01:14:39,210 + 좋아 감사합니다 + diff --git a/captions/Ko/Lecture1_ko.srt b/captions/Ko/Lecture1_ko.srt new file mode 100644 index 00000000..a129996a --- /dev/null +++ b/captions/Ko/Lecture1_ko.srt @@ -0,0 +1,2860 @@ +1 +00:00:00,000 --> 00:00:03,899 + 측면에 더 많은 좌석이있다 + +2 +00:00:03,899 --> 00:00:19,868 + 사람들은 당신이 CS2 (31)과 깊이에있어 확인하기 위해 늦은 그래서에서 걷고있다 + +3 +00:00:19,868 --> 00:00:23,969 + 시각적 인식을위한 네트워크 클래스에있는 돈을 학습 + +4 +00:00:23,969 --> 00:00:33,549 + 겨울 방학의 잘못된 클래스 좋은 내가 너무 환영과 행복 한 새 해 행복 첫날 + +5 +00:00:33,549 --> 00:00:41,069 + 그래서 이것은이 클래스의 두 번째 제안은 231 classiest하고있을 때 우리 + +6 +00:00:41,070 --> 00:00:48,738 + 문자 그대로 4백80명에서 우리가 제공하는 마지막 시간을 우리의 등록을 두 배로 + +7 +00:00:48,738 --> 00:00:55,939 + 350 당신의에 우리 모두를 법적으로 확인하는 단어의 몇 가지를 가입 + +8 +00:00:55,939 --> 00:01:02,570 + 당신이 있다면 당신이 알 수 있도록 방법이 클래스를 기록 우리의 비디오 덮여 + +9 +00:01:02,570 --> 00:01:10,680 + 오늘이 불편 그냥 카메라 뒤에 이동하거나 이동 + +10 +00:01:10,680 --> 00:01:18,280 + 이 관점에서 작성을 위해 카메라 쓰레기 그러나 우리는 양식을 보내려고하고있다 + +11 +00:01:18,280 --> 00:01:25,228 + 즉, 그래서 비디오 녹화를 허용하는 그 때문에 가사의 한 비트의 + +12 +00:01:25,228 --> 00:01:32,200 + 확실히 그들은 그에게 컴퓨터 과학 학부 교수를 실패 할 때 너무 + +13 +00:01:32,200 --> 00:01:37,960 + 이 클래스와 함께 공동 교육 수석 대학원생을 통해 그 중 하나 + +14 +00:01:37,959 --> 00:01:45,839 + 이익 아래 영역은 크게 우리는 내가 앙드레 너무 많이 필요하다고 생각하지 않는 것을 가지고있다 + +15 +00:01:45,840 --> 00:01:48,659 + 소개 여러분 모두가 자신의 일을 알고있다 + +16 +00:01:48,659 --> 00:01:53,960 + 자신의 블로그를 자신의 트위터 팔로워를 따라 + +17 +00:01:53,959 --> 00:02:02,509 + 방법이 더 추종자가에서 나는 매우 인기있는 저스틴 존슨은 여전히​​보다 + +18 +00:02:02,510 --> 00:02:08,200 + 해외 여행하지만 며칠 그래서 앙드레 땅 그냥 그렇게 다시 할 것이다 + +19 +00:02:08,199 --> 00:02:14,509 + 우리는 내가주는거야 강의 교육의 대부분 오늘을 따기됩니다 + +20 +00:02:14,509 --> 00:02:20,039 + 그녀의 구조는하지만 같은 당신은 아마 나는 신생아 비율을 기대하고있어 것을 알 수 있습니다 + +21 +00:02:20,039 --> 00:02:28,239 + 주 말하고 그래서 당신은 우리 것 강의 시간에 비 배수 저스틴의 자세한 내용을 볼 수 있습니다 + +22 +00:02:28,239 --> 00:02:34,189 + 또한 다시이 강의의 끝으로 TACE의 전체 팀 소개 + +23 +00:02:34,189 --> 00:02:38,959 + 좌석을 찾고있는 사람들은 당신이 전에 밖으로 가서 다시 거기 와서 + +24 +00:02:38,959 --> 00:02:47,039 + 우리가 가고있는이 강의에 대한 측면에 좌석의 모두 그렇게 때문에이 + +25 +00:02:47,039 --> 00:02:53,519 + 우리가하고 작동 문제가 어떤 종류의 클래스의 도입을 제공 + +26 +00:02:53,519 --> 00:03:03,530 + 도구는 그래서 다시 배우고 우리에게 (231)를 볼 수 환영이의 비전됩니다 + +27 +00:03:03,530 --> 00:03:09,140 + 수업은 여러분라는 매우 구체적인 모델링 아키텍처를 기반으로 + +28 +00:03:09,139 --> 00:03:16,000 + 네트워크와 네트워크 및 많은에 대한보다 구체적으로는 대부분 컨볼 루션 + +29 +00:03:16,000 --> 00:03:23,799 + 당신이 인기를 눌러 문서를 통해 어쩌면이 용어를 듣고의 우리 또는 또는 또는 + +30 +00:03:23,799 --> 00:03:34,239 + 범위는 우리의이 깊은 학습 네트워크 성장 분야를 호출 했어 + +31 +00:03:34,239 --> 00:03:40,920 + 사실 인공 지능이 학교는 추정하고 우리는있다 + +32 +00:03:40,919 --> 00:03:50,018 + 우리는 이미의 85 % 이상을 도달 한 2016 년이의에 대한 진행 + +33 +00:03:50,019 --> 00:03:56,230 + 인터넷 사이버 공간 데이터는 픽셀의 형태 인 + +34 +00:03:56,229 --> 00:04:05,329 + 또는 멀티미디어를 부르는 것을 우리는 기본적으로 비전의 시대를 입력 그래서 있도록 + +35 +00:04:05,330 --> 00:04:12,530 + 이미지와 영상 통화의 이유를 이유입니다 동안 부분적으로 큰 정도 그렇게 + +36 +00:04:12,530 --> 00:04:20,858 + 데이터 캐리어로서의 인터넷 모두 폭발 스피커 물론이다 + +37 +00:04:20,858 --> 00:04:25,930 + 답변 우리는 더 많은 센서가 그 목에 사람들이 일 수 + +38 +00:04:25,930 --> 00:04:32,000 + 당신의 모든 사람이 스마트 폰의 어떤 종류를 수행하고 디지털 카메라와 + +39 +00:04:32,000 --> 00:04:37,879 + 그리고 당신은 군인 그렇게 있도록 카메라와 거리에 주위에 차를 알고 + +40 +00:04:37,879 --> 00:04:46,500 + 정말 인터넷하지만 시각에 영상 데이터의 폭발을 사용하도록 설정 + +41 +00:04:46,500 --> 00:04:55,209 + 데이터 또는 픽셀 데이터는 들어 본 적이 경우 가장 어려운 데이터가 그렇게 활용하는 것도 내 + +42 +00:04:55,209 --> 00:05:07,810 + 인터넷의 암흑 물질에 의해 이전 회담과 다른 공원 + +43 +00:05:07,810 --> 00:05:13,879 + 이유는 우주와 같은 암흑 물질은 85 % 어두운에게 가장 가까운입니다 + +44 +00:05:13,879 --> 00:05:19,409 + 문제 암흑 에너지는 매우 어렵다 에너지가 주 관찰하는 것이이 문제입니다 + +45 +00:05:19,410 --> 00:05:25,919 + 우주 인터넷이 수학적 모델로 주말이있는 + +46 +00:05:25,918 --> 00:05:30,649 + 문제는 우리가 힘든 시간을 모르는 데이터를 다른 데이터를 화소를 + +47 +00:05:30,649 --> 00:05:36,239 + 여기에 대륙있어 파악하는 것은 당신이 고려해야 할 하나의 매우 매우 간단 용의자의 + +48 +00:05:36,240 --> 00:05:39,090 + 그래서 오늘 + +49 +00:05:39,089 --> 00:05:49,560 + 유튜브 서버 60 초마다 우리는 업로드 한 동영상의 이상 $ 150해야합니다 + +50 +00:05:49,560 --> 00:05:54,089 + 60 초마다를위한 YouTube 서버 상에 + +51 +00:05:54,089 --> 00:06:02,739 + 인간의 눈은 가려 낼 수있는 방법이 없습니다 데이터의 양에 대해 생각 + +52 +00:06:02,740 --> 00:06:07,829 + 데이터의이 방대한 양과는 많은 아시아 확인 + +53 +00:06:07,829 --> 00:06:14,009 + 그것과와와 라벨과에서 연락처 영혼 가수 설명 + +54 +00:06:14,009 --> 00:06:20,980 + YouTube 팀이나 또는 Google 회사의 관점 그들은 우리를 도와하려면 + +55 +00:06:20,980 --> 00:06:25,640 + 자신의 목적을 위해 인덱스를 관리하고 물론 검색하기 + +56 +00:06:25,639 --> 00:06:31,529 + 광고 나 또는 어떤 조작은 데이터의 내용은 손실이었다 + +57 +00:06:31,529 --> 00:06:38,919 + 아무도 우리가 할 수있는이 유일한 희망을 수 없기 때문이 사실의 비전입니다 + +58 +00:06:38,920 --> 00:06:44,640 + 기술은 객체 파이낸싱 비닐 프레임 레이블을 할 수 + +59 +00:06:44,639 --> 00:06:50,349 + 농구 비디오 그런 코비 브라이언트의 결정이었다 어디 싸다 가야 알고 + +60 +00:06:50,350 --> 00:06:57,320 + 멋진 샷과 사회이 우리가 오늘 직면하고있는 문제입니다 + +61 +00:06:57,319 --> 00:07:02,860 + 대용량 데이터의 양 문제 때문에 어둠의 도전 + +62 +00:07:02,860 --> 00:07:07,379 + 많은 다른 분야에 닿을 필드로 편안한 비전 + +63 +00:07:07,379 --> 00:07:12,740 + 연구 그래서 난시 히터는 여기에서 확인 앉아 오전 + +64 +00:07:12,740 --> 00:07:18,050 + 컴퓨터의 크기와 매니아 차량은하지만 많은 생물학 심리학에서 온 + +65 +00:07:18,050 --> 00:07:24,389 + 로봇을위한 자연 언어 처리 또는 그래픽을 전문으로 또는 + +66 +00:07:24,389 --> 00:07:30,680 + 또는 당신은 의료 영상을 알고 그래서 내가 사랑하는 있도록 필드 컴퓨터 비전 정말입니다 + +67 +00:07:30,680 --> 00:07:37,329 + 문제는 우리가 우리가 사용하는 모델 일을 무엇을 진정으로 학제 분야 + +68 +00:07:37,329 --> 00:07:43,849 + 엔지니어링 물리학 생물학 심리학으로 수학의 크기를 비교 + +69 +00:07:43,850 --> 00:07:51,030 + 그래서 좀 더 개인적인 접촉의 조금 나는 부문의 감독이다 + +70 +00:07:51,029 --> 00:07:58,589 + 심지어 대학원생과 박사후 연구원과 함께 작동합니다 우리의 실험실에서 실험실 물건 + +71 +00:07:58,589 --> 00:08:04,669 + 사다리 아래 항목의 수에 학생들과 우리 자신의 연구에 가장 사랑 + +72 +00:08:04,670 --> 00:08:10,540 + 그들 중 일부는 당신이 좋은 알고있는 내 실험실에서 온 + +73 +00:08:10,540 --> 00:08:17,780 + 일부인 2 년 수는 우리가 기계 학습 작업을 내 실험실에서 온 + +74 +00:08:17,779 --> 00:08:26,109 + 깊은 학습의 상위 집합의 우리뿐만 아니라 과학과 신경 과학의 많은 작업 + +75 +00:08:26,110 --> 00:08:31,270 + 즉 그의 있도록 LPN 연설 사이의 교차점으로의 종류 + +76 +00:08:31,269 --> 00:08:40,399 + 내 연구실 물건을 넣어 그래서 또한 작동 컴퓨터 비전 연구의 풍경 + +77 +00:08:40,399 --> 00:08:45,600 + 우리가 제공하는 지금 무엇을 다른 컴퓨터 비전 클래스 조금 더 관점에서 + +78 +00:08:45,600 --> 00:08:51,050 + 분명히 여기에 물건이나 컴퓨터 과학 부서를 통해 당신은에있어 + +79 +00:08:51,049 --> 00:08:59,629 + 이 클래스 에스 (21) 그래서 당신에게 당신의 찍은 적이없는 컴퓨터 비전 + +80 +00:08:59,629 --> 00:09:06,220 + 아마 통근 들어 처음으로 아마 이미 가지고 있어야 + +81 +00:09:06,220 --> 00:09:14,730 + 전 분기의 멋진 클래스는 우리가 제공하는 것 (131) 다음과 대 다 + +82 +00:09:14,730 --> 00:09:19,779 + 정상적으로되는 다음 분기는로 이번 분기하지만 올해를 제공하다 + +83 +00:09:19,779 --> 00:09:25,069 + 작은 중요한 대학원 수준 컴퓨터 비젼 클래스가 시프트 + +84 +00:09:25,070 --> 00:09:31,840 + 그는시 로봇 작동 누가 고통을 것입니다, 그래서 교수가 제공하는 '원인 CS2 30180 + +85 +00:09:31,840 --> 00:09:47,230 + 차원의 비전과 당신의 많은 질문을하는이 classiest 231 대 + +86 +00:09:47,230 --> 00:09:56,639 + 의 S 두 삼십 (18) 및 넓은에 관심이 있다면, 다른 하나는 알고있다 + +87 +00:09:56,639 --> 00:10:03,220 + 도구 및 컴퓨터 비전 주제의 범위뿐만 아니라 일부 + +88 +00:10:03,220 --> 00:10:11,009 + 그 릴레이 (223) 부문 로봇 비전을 올 기본적인 기본 주제 + +89 +00:10:11,009 --> 00:10:17,269 + 시각 인식 당신은을입니다 23,188에 복용을 고려해야합니다 + +90 +00:10:17,269 --> 00:10:26,039 + 더 깊이 초점을 맞추고 오늘부터로 갈 것보다 일반적인 클래스 (231) 끝 + +91 +00:10:26,039 --> 00:10:33,329 + 모두 문제 및 모델 모델의 특정 안도에 네트워크와는 + +92 +00:10:33,330 --> 00:10:38,580 + 물러 시각적 인식 대부분이지만, 물론 그들은 조금이 + +93 +00:10:38,580 --> 00:10:47,990 + 중복하지만 그 다음 분기 옆에 큰 차이의 우리 또한 가능성이 있습니다 + +94 +00:10:47,990 --> 00:10:55,590 + 고급 세미나 레벨 클래스의 몇 몇 있지만은 아직이다 + +95 +00:10:55,590 --> 00:11:01,649 + 즉 커큐민의 종류 그래서 형성은 그냥 강의 계획서를 확인 할 수 있도록 + +96 +00:11:01,649 --> 00:11:11,409 + 분할 교육 과정은 우리가 질문에 스탠포드에서 올해 제공 지금까지 네 + +97 +00:11:11,409 --> 00:11:20,879 + (131)는이 클래스에 대한 엄격한 요구 사항이 아닙니다하지만 당신은 볼 당신은했습니다 경우 + +98 +00:11:20,879 --> 00:11:25,570 + 처음으로 컴퓨터 비전 들어 본 적이 난 당신이 방식을 찾을 제안 + +99 +00:11:25,570 --> 00:11:33,830 + 이 클래스의 이해의 기본 수준을 shrooms 때문에 잡기 + +100 +00:11:33,830 --> 00:11:42,560 + 컴퓨터 비전은 그렇게에서 메모를 검색 할 수 있습니다 + +101 +00:11:42,559 --> 00:11:49,619 + 오늘은 내가 컴퓨터 비전의 아주 짧은 폭 넓은 행정 역사를 줄 것입니다 + +102 +00:11:49,620 --> 00:11:55,519 + 그리고, 우리는 조직의 관점에서 (231)과 조금 얘기하자 + +103 +00:11:55,519 --> 00:12:01,409 + 클래스의 그들은 정말 당신에 대해 컴퓨터의이 짧은 역사를 관심 + +104 +00:12:01,409 --> 00:12:07,480 + 비전 당신 때문에 당신의 관심이 주로 여기에있을 수 있습니다 알고 있기 때문에 + +105 +00:12:07,480 --> 00:12:11,990 + 이 정말 흥미 도구 깊이 호출이이 목적이며 + +106 +00:12:11,990 --> 00:12:16,370 + 클래스는 당신에게 깊이있는 모양을 제공 할 것이다 + +107 +00:12:16,370 --> 00:12:22,470 + 과의를 통해 바로 여행이 깊이 모델은하지 않고 있지만, 무엇 + +108 +00:12:22,470 --> 00:12:28,050 + 어떤이 문제에 대해 깊이 생각하지 않고 문제 도메인을 이해 + +109 +00:12:28,049 --> 00:12:37,849 + 당신이 다음 모델의 발명가로 외출하는 것이 매우 어렵다입니다 + +110 +00:12:37,850 --> 00:12:43,320 + 정말 큰 문제의 비전을 해결하거나 당신이 알고있는 개발 개발을 할 + +111 +00:12:43,320 --> 00:12:52,379 + 일반적인 문제도 심장 문제 해결에 영향력있는 작품을 만들고 + +112 +00:12:52,379 --> 00:12:58,860 + 도메인 및 모델 모델링 도구 자체는 결코 완전히 결코 + +113 +00:12:58,860 --> 00:13:00,129 + 분리 + +114 +00:13:00,129 --> 00:13:05,360 + 상호 통보하고 깊은 학습 조금에게의 역사를 통해 볼 수 있습니다 + +115 +00:13:05,360 --> 00:13:13,000 + 고기에서 온 네트워크 아키텍처의 연합이 해결하는 것을 조금 + +116 +00:13:13,000 --> 00:13:15,289 + 시력 문제 + +117 +00:13:15,289 --> 00:13:23,449 + 비전 문제는 발전 할 계획 알고리즘을하는 데 도움이 저 돌아 왔어요 및 + +118 +00:13:23,450 --> 00:13:29,350 + 앞으로 그래서 당신은 당신이이 과정 I을 완료 할 알고에 정말 중요하다 + +119 +00:13:29,350 --> 00:13:34,300 + 당신 때문에 깊은 학습 여전히 충분한 비전을 걸 자랑스럽게 느낄 당신 + +120 +00:13:34,299 --> 00:13:39,528 + 이 헛소리 모든 설정 및 사용 방법의 심도있는 이해를 + +121 +00:13:39,528 --> 00:13:46,750 + 도구는 간단한 역사 그래서 중요한 문제를 해결하기에에에 있지만, + +122 +00:13:46,750 --> 00:13:54,149 + 우리가 거​​ (200) (540)에 다시 모든 길을 갈 것 때문에 그렇게 짧은 역사를 의미 하는가 + +123 +00:13:54,149 --> 00:14:00,110 + 왜 그랬는지이 고른 이유 만 년 전 그래서 당신은 다른 규모에 알고 + +124 +00:14:00,110 --> 00:14:09,240 + 그래서하지 않는 동안 지구 역사의이 년의 상당히 구체적인 범위는 + +125 +00:14:09,240 --> 00:14:14,049 + 이 들었을 알고 있지만 이것은 매우 호기심 기간 + +126 +00:14:14,049 --> 00:14:23,539 + 지구 역사의 생물 학자들은이 503 전에 진화의 큰 가방 전화 + +127 +00:14:23,539 --> 00:14:27,679 + 5억4천만년 전을위한 + +128 +00:14:27,679 --> 00:14:37,989 + 물은 꽤 큰 냄비의 아주 평화로운 그래서 우리는 아주 간단한 생물이 + +129 +00:14:37,990 --> 00:14:46,049 + 이건 그냥 물에 떠 동물과 방법 동부와 같다 + +130 +00:14:46,049 --> 00:14:53,838 + 지금은 매일 매일 당신이 음식 근처에 의해 제공의 어떤 종류를 이동하는 흐름을 알고있다 + +131 +00:14:53,839 --> 00:15:01,160 + 그들은 단지 자신의 입을 열어 무엇이든 자신의 집 또는 그것을 잡고 우리는하지 않습니다 + +132 +00:15:01,159 --> 00:15:09,969 + 동물의 너무 많은 다른 유형을 가지고 있지만 정말 이상한 약 540 일 + +133 +00:15:09,970 --> 00:15:18,430 + 전적으로 우리가 공부 화석에서 백만 종의 거대한 폭발있다 + +134 +00:15:18,429 --> 00:15:27,729 + 어떤 이유로 어떤 동물을위한 갑자기 같은 생물학 차 분화 + +135 +00:15:27,730 --> 00:15:35,230 + 다양 화하기 시작하고 그들이있어 복잡한 시작 2022 당신이 시작할 + +136 +00:15:35,230 --> 00:15:41,039 + 포식자와 찬양 그리고 그들은 무엇 살아남을 수있는 도구의 모든 종류가 + +137 +00:15:41,039 --> 00:15:46,698 + 사람들이 받았기 때문에 사람들의 트리거 힘이 큰 문제였다 더 않았다 + +138 +00:15:46,698 --> 00:15:53,269 + 당신은 어떤 유성 지구 또는 다른 SAT 알고하거나 환경을 알고 + +139 +00:15:53,269 --> 00:16:00,198 + 그들이 가장 설득력있는 이론의 약 얘기를 변경하면이 사람의 전화입니다 + +140 +00:16:00,198 --> 00:16:03,159 + 앤드류 파커 + +141 +00:16:03,159 --> 00:16:09,490 + 호주에서 가장 큰 호주에서 그는 그가 재미를 많이 공부 + +142 +00:16:09,490 --> 00:16:19,278 + 화석 그 이론은 제 그렇게 한 얼음의 개시 인 것을 + +143 +00:16:19,278 --> 00:16:25,688 + 시험 물린 개발하고 정말 정말 간단 나는 그것은처럼 거의이다 + +144 +00:16:25,688 --> 00:16:30,779 + 단지 빛을 포착하고 몇 가지 예측을 핀홀 카메라 + +145 +00:16:30,779 --> 00:16:34,750 + 환경에서 일부 정보를 등록 + +146 +00:16:34,750 --> 00:16:41,080 + 당신은 일단 때문에 갑자기 인생은 더 이상 이렇게 메달 없다 내가 처음 그 + +147 +00:16:41,080 --> 00:16:44,889 + 음식이 어디 당신이 실제로 알고 당신이 할 수있는 것은 당신이 패치를 갈 수있다 + +148 +00:16:44,889 --> 00:16:51,809 + 뿐만 아니라 물에 떠있는 그들을 눈 멀게 좋아하고 당신이 고양이 먹이를 갈 수 있었다 + +149 +00:16:51,809 --> 00:16:57,399 + 음식이 가장 좋은 눈을 개발하고, 그렇지 않으면 멀리에서 실행하는 것 같아요 + +150 +00:16:57,399 --> 00:17:02,590 + 그들은 당신이 알고 사라질 것이다 당신의 눈을 가졌다 모든 사람의 첫 번째이었다 있도록 + +151 +00:17:02,590 --> 00:17:11,380 + 그냥 좋아 제한된 구글 모두에서 그렇게 같은 것은 당신이 생각하는 가장 좋은 시간을 가지고 + +152 +00:17:11,380 --> 00:17:18,170 + 모든 것이 그들이 할 수 있지만, 그 거짓말 우리 것의 모든 설정을하기 때문에 + +153 +00:17:18,170 --> 00:17:28,400 + 대학의 실현은 생물학적 군비 경쟁은 모든 단일 동물에 필요 시작이다 + +154 +00:17:28,400 --> 00:17:34,170 + 생존을 위해 일을 개발하기 위해 배울 필요가 또는 당신은 당신에게 당신이 알고에 + +155 +00:17:34,170 --> 00:17:40,190 + 즉, 그래서 갑자기 육식 동물이 모든과 분화와 칭찬했다 + +156 +00:17:40,190 --> 00:17:47,870 + 하나의 비전 540,000,000년 시작 및뿐만 아니라 종교는 시각적 시작 하나 + +157 +00:17:47,869 --> 00:17:53,189 + 종 분화의 또는 큰 팬의 주요 원동력의 + +158 +00:17:53,190 --> 00:17:58,980 + 진화 또는 정말 그래서 우리는 너무 많은 세부 사항에 대한 않을거야 가을의 진화있어 + +159 +00:17:58,980 --> 00:18:08,710 + 비전 엔지니어링 주위에 일어난 또 다른 큰 중요한 일 + +160 +00:18:08,710 --> 00:18:19,220 + 르네상스와는 물론 그 전에 너무 놀라운 사람에 의한 것 + +161 +00:18:19,220 --> 00:18:23,740 + 다른 노래는 유럽에 아시아에서 인류 문명에 걸쳐 알고 + +162 +00:18:23,740 --> 00:18:30,400 + 아리스토텔레스는이 때문에 인도는 아랍어 세계에 우리는 카메라의 모델을 보았다 + +163 +00:18:30,400 --> 00:18:36,360 + 철학자 모세가 제안 잎을 중국어 통해 카메라를 제안 + +164 +00:18:36,359 --> 00:18:40,939 + 전체와 함께 상자를 통해 카메라 만 + +165 +00:18:40,940 --> 00:18:47,750 + 첫 번째 문서 정말 현대 찾고 카메라를 보면 그것이라고 + +166 +00:18:47,750 --> 00:18:49,180 + 카메라 옵스큐라 + +167 +00:18:49,180 --> 00:18:56,610 + 그리고 그 레오나르도 다빈치에 의해 설명되어 나는 세부 사항에 들어갈 거 아니에요 + +168 +00:18:56,609 --> 00:19:07,240 + 그러나 이것은 당신이 전체의 몇 가지 종류가 존재한다는 생각을 알고있다 + +169 +00:19:07,240 --> 00:19:12,240 + 캡처 빛이 현실 세계에서 반사 된 후 어떤 종류의가 + +170 +00:19:12,240 --> 00:19:20,319 + 보호되도록 실제 이미지의의 정보를 캡처 + +171 +00:19:20,319 --> 00:19:27,779 + 즉, 당신이 알고있는 현대 엔지니어링의 시작이다 + +172 +00:19:27,779 --> 00:19:36,170 + 비전이 세상을 복사 한 팀과 시작의 복사본을 만들고 싶었다 + +173 +00:19:36,170 --> 00:19:42,350 + 시각 세계는은을 설계하고자 어디서나 가까운 사라되지 않았습니다 + +174 +00:19:42,349 --> 00:19:46,879 + 시각 세계의 이해는 지금 우리는 단지 복제에 대해 얘기하고 + +175 +00:19:46,880 --> 00:19:53,760 + 그래서 시각적 세계는 하나의 중요한 기억해야 할 일과 물론 이후의 + +176 +00:19:53,759 --> 00:20:01,299 + 우리는 우리가 성공의 전체 시리즈를 참조하기 시작 우리하지만 카메라 옵스큐라 + +177 +00:20:01,299 --> 00:20:07,539 + 모두가 어떤 영화는 코닥처럼 알고 개발되는 것은 최초의 일이었다 + +178 +00:20:07,539 --> 00:20:12,329 + 회사는 상업 카메라를 개발하고 우리는 캠코더를 가지고 시작 + +179 +00:20:12,329 --> 00:20:21,889 + 과 및 모든 작업이 매우 중요 중요한 부분 당신을 원하는 + +180 +00:20:21,890 --> 00:20:28,050 + 비전 학생이 절대적으로 아무것도 엔지니어링 작품으로 알고 있어야하지만, + +181 +00:20:28,049 --> 00:20:32,710 + 당신은 질문을 시작하고 과학 연구의 과학 조각을 생각 + +182 +00:20:32,710 --> 00:20:38,130 + 우리의 생물학적 비주얼 작업 당신이 우리를 알고 가져 않는 방법입니다 + +183 +00:20:38,130 --> 00:20:45,760 + 우리는 지금 정말에 도착하는 진화의 5억4천만년했다 것을 알고있다 + +184 +00:20:45,759 --> 00:20:54,579 + 환상적인 비주얼 인간의 시스템 만 한 일이 진화는이 기간 동안 수행 + +185 +00:20:54,579 --> 00:21:01,759 + 오늘 간단 삼엽충에서 건축의 어떤 종류를 개발 않았다 + +186 +00:21:01,759 --> 00:21:07,950 + 작업의 매우 중요한 부분 하버드에서 일어난 당신과 나의 동안 거짓말 + +187 +00:21:07,950 --> 00:21:12,690 + 그 시간에 야심 찬 아주 젊은 너무 젊은는에서 탑승자를 가져옵니다 + +188 +00:21:12,690 --> 00:21:21,500 + 그들이 무슨 짓을했는지 차량은 다음 깨어 있지만 마취 고양이를 사용한다는 것입니다 + +189 +00:21:21,500 --> 00:21:28,529 + 를 밀어이 작은 바늘 전극을 구축하는 기술은 없었다 + +190 +00:21:28,529 --> 00:21:35,129 + 상기하여 두개골을 통해 전자가의 지참까지 열려 + +191 +00:21:35,130 --> 00:21:42,180 + 우리는 이미 일차 시각 피질의 차를 올 알고있는 지역으로 절단 + +192 +00:21:42,180 --> 00:21:49,490 + 시각 피질 영역은 영상 처리하지만, 전에 일을 많이 할 + +193 +00:21:49,490 --> 00:21:54,779 + 우리가 정말 모르는 비자는 시각 주요 내용 피질 겨울 눈이 될 것입니다 + +194 +00:21:54,779 --> 00:22:02,369 + 사용자 인터페이스의 초기 단계 중 하나는 물론이지만 시각에 대한 초기 단계 + +195 +00:22:02,369 --> 00:22:07,299 + 처리는 다음의 비전에 우리 작업 톤 뉴 올리언스의 톤이있다 + +196 +00:22:07,299 --> 00:22:12,419 + 그 비전의 시작이기 때문에 정말이 무엇인지 알고 우리를 변경 + +197 +00:22:12,420 --> 00:22:20,300 + (가) 그렇게 가져 시각 과정은 그들이 차에이 전극을 넣어 + +198 +00:22:20,299 --> 00:22:25,930 + 시각 피질이 흥미로운 내가 떨어 뜨리지 않는 또 다른 흥미로운 사실​​ 내 + +199 +00:22:25,930 --> 00:22:34,880 + (가) 첫째가되는 것을 온 있는지 아마 시각 피질에 대한 물건 + +200 +00:22:34,880 --> 00:22:40,910 + 당신의 대뇌 피질의 시각 처리 단계의 아주 아주 거친 검사 응급 처치입니다 + +201 +00:22:40,910 --> 00:22:47,180 + 당신이 근처에없는 가져 뒷면에 나는 그것이 있기 때문에 매우 흥미로운 알고 + +202 +00:22:47,180 --> 00:22:51,788 + 대뇌 피질의 처리에 자신의 공장이 맞다 + +203 +00:22:51,788 --> 00:22:58,519 + 그녀의 코 뒤에 청각 바로 매년 그러나 주 뒤에 + +204 +00:22:58,519 --> 00:23:05,798 + 시각 피질은 눈에서 가장 먼과에서 그 또 다른 매우 흥미로운 일이다 + +205 +00:23:05,798 --> 00:23:11,099 + 사실뿐만 아니라 기본 비전 거의 50 %의 작업 거대한 지역에있다 + +206 +00:23:11,099 --> 00:23:17,888 + 당신의 두뇌는이이 어려운 가장 중요한 사랑의 부서입니다 + +207 +00:23:17,888 --> 00:23:22,608 + 감각 지각인지 휴식 시간에 시스템 난 아무것도 말하고 있지 않다 + +208 +00:23:22,608 --> 00:23:29,839 + 다른 않습니다 분명히 도움이되지 않습니다하지만이 개발이 긴의 특성을 + +209 +00:23:29,839 --> 00:23:37,579 + 이 감각 시스템은 그것을 할 병력에게 공간이 많은 현실 소요 + +210 +00:23:37,579 --> 00:23:43,148 + 너무 중요하기 때문에 왜 시스템에 사용 그렇게 빌어 먹을 하드 그건입니다 + +211 +00:23:43,148 --> 00:23:50,959 + 우리는 인간의 이성을 다시 얻을 필요가 왜 그들이 싶어 정말 야심했다 + +212 +00:23:50,960 --> 00:23:56,028 + 이의 시작이기 때문에 일차 시각 피질이 무엇을하고 있는지 알고 우리의 + +213 +00:23:56,028 --> 00:24:02,878 + 다음 소셜 깊은 학습 신경망 고양이에 대한 지식에 고양이를 넣어 + +214 +00:24:02,878 --> 00:24:07,709 + 나는 당신의 기록 말할 때이 방은 그들이 당신의 활동을 기록했다 + +215 +00:24:07,710 --> 00:24:11,659 + 내가 넣어 경우 활동 공정한 재판이 기본적으로 당신이 알고 보려고 + +216 +00:24:11,659 --> 00:24:18,059 + 여기에 새로운 사무실 등 그들이 볼 수있는 새 집 화재 전극 + +217 +00:24:18,058 --> 00:24:25,308 + 그래서 예를 들어 뭔가 내가 보여 주었다 경우 그들이 자신의 아이디어를 고양이를 보여 주면 그들이 보여 주면 + +218 +00:24:25,308 --> 00:24:30,519 + 당신이 그 때 분명히 알고 물고기의이 종류보다는 물고기를 먹고 온다 + +219 +00:24:30,519 --> 00:24:42,019 + 여기에 노란색 행복과 스파이크와 같은 더 나는 없다 고양이와 함께이 존재 + +220 +00:24:42,019 --> 00:24:48,128 + 과학적 발견의 이야기는 과학적 발견이 소요 행운 모두와 + +221 +00:24:48,128 --> 00:24:52,449 + 관심과 배려 그들은 나타내었다 + +222 +00:24:52,450 --> 00:24:58,740 + 어떤 마우스 꽃 단지 새로운 고양이를 작동 차에하지 않습니다 + +223 +00:24:58,740 --> 00:25:02,839 + 시각 피질은 급상승가 없었다 침묵 + +224 +00:25:02,839 --> 00:25:09,079 + 거기에 약간의 스파이크 정말 좌절했다 그러나 좋은 소식은 있다는 것입니다 + +225 +00:25:09,079 --> 00:25:14,509 + 그들이 우리를 보였다 때이해야 할 무엇 때문에 그 시간에는 컴퓨터가 없었다 + +226 +00:25:14,509 --> 00:25:21,740 + 고양이는 그들이 그의 발 슬라이드 넣을 수 있도록 약간의 보호를 사용해야 할 것이다 + +227 +00:25:21,740 --> 00:25:26,799 + 새로운 강요하며 자전거가 걸릴 경우 물고기의 다음 스파이크에 새로운까지 기다려 + +228 +00:25:26,799 --> 00:25:29,960 + 슬라이드 아웃 다른 약간을 넣어 + +229 +00:25:29,960 --> 00:25:38,630 + 이 영화는 내가 당신이 기억하지 알고처럼이 좋아하는 것주의 사항 + +230 +00:25:38,630 --> 00:25:46,890 + 당신이처럼 알고 이상한 그 무엇이든 도우루가 스파이크 래셔 필름을 사용 + +231 +00:25:46,890 --> 00:25:51,940 + 실제 마우스 공식 꽃 새가 새 역할을 자극 운전하지 않았지만 + +232 +00:25:51,940 --> 00:25:59,759 + 상기 아웃 슬라이드를 복용의 이동도 그는 흥분 않았다 슬라이딩 한 수 + +233 +00:25:59,759 --> 00:26:03,140 + 나는 촉매가 될 수있는 새로운 당신이 마침내 새를 변경하는 것 같아요 + +234 +00:26:03,140 --> 00:26:13,410 + 그것은이 생활에 의해이 생성되어 있도록 나를 위해 새로운 객체를 알고 + +235 +00:26:13,410 --> 00:26:18,240 + 그들은 그것이 정사각형, 직사각형 판이야 무엇이든 바로 슬라이드를 변경하고 + +236 +00:26:18,240 --> 00:26:28,120 + 것을 에지 또는 흥분 뉘앙스를 이동 그들은 정말 후 맛을하고 그래서 + +237 +00:26:28,119 --> 00:26:34,859 + 너무 좌절하거나 놓친 것 인 경우에 관찰 당신은 알고 + +238 +00:26:34,859 --> 00:26:41,359 + 하지만 그들은 그들이 정말로 그 후 맛과 새로운 노래를 실현하고 아니에요 + +239 +00:26:41,359 --> 00:26:48,279 + 차 시각 피질은 열에 새의 모든 컬럼에 대해 구성되어 있습니다 + +240 +00:26:48,279 --> 00:27:01,309 + 앨리스는이 바의 특정 방향을 오히려보고 싶습니다 + +241 +00:27:01,309 --> 00:27:02,980 + 피셔 마우스보다 + +242 +00:27:02,980 --> 00:27:07,519 + 당신은 여전히​​ 수많은 있기 때문에 나는 간단한 이야기​​ 좀 해요 알고 + +243 +00:27:07,519 --> 00:27:10,940 + 차 시각 피질 우리는 그들이 단순한 마음에 안 좋아하는지 모르겠어요 + +244 +00:27:10,940 --> 00:27:17,570 + 지향하지만 인간의 방문자 대형하여 처음의 발견 + +245 +00:27:17,569 --> 00:27:23,779 + 시각 처리는 전체적인 생선이나 악의 시각의 시작되지 않습니다 + +246 +00:27:23,779 --> 00:27:29,178 + 처리는 세계의 간단한 구조입니다 + +247 +00:27:29,179 --> 00:27:40,890 + 배향이 매우 깊은 깊이 영향뿐만 아니라 징후이며 + +248 +00:27:40,890 --> 00:27:47,870 + 우리는 우리의 딜러 네트워크 기능을 시각화 할 때 엔지니어링 모델링은 나중에 야 + +249 +00:27:47,869 --> 00:27:57,069 + 우리의 모델 우리에서 신흥 구조 등의 간단한 표시되고 + +250 +00:27:57,069 --> 00:28:03,298 + 발견했다하더라도 나중에 쉰과 60 년대 초반은 그들이 원 + +251 +00:28:03,298 --> 00:28:12,039 + 1981 년이 작품에 대한 노벨 의학 가격은 그래서 또 다른 매우 중요 + +252 +00:28:12,039 --> 00:28:25,928 + 즉 다른, 그래서 작품의 조각의 비전과 시각적 처리와 관련된 + +253 +00:28:25,929 --> 00:28:35,620 + 재미있는 이야기 현대 필드와 컴퓨터 비전의 전구체이었다 + +254 +00:28:35,619 --> 00:28:42,779 + 이 블록의 세계라고 1963 년 래리 로버츠에 의해이 특정 손실 + +255 +00:28:42,779 --> 00:28:49,889 + 그는 단지 휴 물린 비자로 우리의 시각 세계 있음을 발견하고 우리의 + +256 +00:28:49,890 --> 00:29:00,380 + 뇌는 빠르면 박사와 같은 구조 래리 로버츠처럼 간단한에 의해 구성되어있다 + +257 +00:29:00,380 --> 00:29:06,350 + 학생들은 다음과 같은 구조를 추출하고있었습니다 + +258 +00:29:06,349 --> 00:29:08,980 + 이미지 + +259 +00:29:08,980 --> 00:29:16,210 + 이 특정한 경우에 엔지니어링 작업의 조각 같은 그의 목표는 것입니다 + +260 +00:29:16,210 --> 00:29:22,210 + 당신은 모두 당신을 알고 인간은 아무리 그것이 얼마나 블록을 인식 할 수 없습니다하지로 + +261 +00:29:22,210 --> 00:29:28,009 + 우리가 성자 블록이 두 알고 같은 블록도 있습니다 오른쪽과 같이 설정 + +262 +00:29:28,009 --> 00:29:33,019 + 조명이 변화하고 있지만 방향이 변경 그는 국면이다 + +263 +00:29:33,019 --> 00:29:40,720 + 우리가 우리에게 생각하는 것처럼이 이것을 정의 가장자리는 것이다 + +264 +00:29:40,720 --> 00:29:46,419 + 구조 가장자리가 법률의 형태를 무시하고는 변경하지 않는 것이 + +265 +00:29:46,419 --> 00:29:53,290 + 래리 로버츠 도로 박사 학위 논문에 아주 관련된 모든 내부 물건 + +266 +00:29:53,289 --> 00:29:59,250 + 그냥 있는지 알이 가장자리를 추출 박사 과정 학생 컴퓨터로 작업 + +267 +00:29:59,250 --> 00:30:03,990 + 이 학부 컴퓨터 비전처럼 알처럼이 비전은 것 + +268 +00:30:03,990 --> 00:30:10,210 + 박사 학위 논문을했지만, 즉 제 1 전구체 컴퓨터 비전 박사이었다되었습니다 + +269 +00:30:10,210 --> 00:30:18,819 + 그는 이후 자신의 컴퓨터 비전을 포기 관심 때문에 로버트 같은 논문 + +270 +00:30:18,819 --> 00:30:27,189 + 과 DARPA는 내가 너무 심하게하여 우리가하지 않은 인터넷의 발명자 중 하나였다 + +271 +00:30:27,190 --> 00:30:34,490 + 컴퓨터 비전을 포기하지만 우리는 항상 좋아하는 말을 해당 컴퓨터의 탄생 + +272 +00:30:34,490 --> 00:30:43,960 + 현대 필드와 비전은 1966 년 여름에 1966 MIT의 여름 + +273 +00:30:43,960 --> 00:30:49,548 + 인공 지능 연구소는 하나 실제로 그 전에 설립되었다 + +274 +00:30:49,548 --> 00:30:55,819 + 역사의 조각이 학생에 대한 그들의 자부심을 느껴야한다이 두 가지가 있습니다 + +275 +00:30:55,819 --> 00:31:02,579 + 초기에 세계에 설립 선구적인 인공 지능 연구실 + +276 +00:31:02,579 --> 00:31:10,329 + 존 맥카시에 의해 MIT 하나에서 마빈 민스키에 의해 1960 년대 하나에 돌에 대한 고통 + +277 +00:31:10,329 --> 00:31:15,369 + 인공 지능 연구실은 컴퓨터 과학 전에 설립되었다 + +278 +00:31:15,369 --> 00:31:21,479 + 설립 내가 사랑하는 부서 및 교수 존 맥카시는 하나입니다 + +279 +00:31:21,480 --> 00:31:22,490 + 에 대한 책임 + +280 +00:31:22,490 --> 00:31:26,450 + 그 문제의 조금 그래서 용어 인공 지능 + +281 +00:31:26,450 --> 00:31:31,720 + stempler 역사 어쨌든 우리는 분야를 시작하는 우리에게 신용을 제공해야 + +282 +00:31:31,720 --> 00:31:41,380 + 컴퓨터 비전이 때문에 1966 년 여름에 MIT의 교수 나는 그것의 결정 + +283 +00:31:41,380 --> 00:31:46,630 + 시간은 그래서 우리가 이해하기 시작 설립되었다 알고 구제하기 + +284 +00:31:46,630 --> 00:31:55,010 + 나는 이것이 아마 그 시간에 어쨌든 발명 입증한다 생각 + +285 +00:31:55,009 --> 00:32:01,109 + 비전 당신이 당신의 눈을 너무 쉽게 열려있는이 사랑의 하나가 될 수있는 방법을 세상을보고 + +286 +00:32:01,109 --> 00:32:04,109 + 여름 그래서 + +287 +00:32:04,109 --> 00:32:18,729 + 그래서 여름 비전 프로젝트는 우리의 시각 시스템이 사용하기위한 시도이다 + +288 +00:32:18,730 --> 00:32:24,329 + 마지막 번호의 제안이었다 어쩌면 그들은 그들의 여름 작업을 사용하지 않은 + +289 +00:32:24,329 --> 00:32:30,490 + 효과적으로하지만 어떤 경우에 어떻게 개인이 그 실버 해결되지 않았다 + +290 +00:32:30,490 --> 00:32:35,740 + 그 이후로 그들은 컴퓨터 비전 및 I의 가장 빠르게 성장하는 분야가 될 + +291 +00:32:35,740 --> 00:32:43,679 + 오늘의 프리미엄 컴퓨터 비전 컨퍼런스에 가면 CPR 또는 우리 ICC 비용 + +292 +00:32:43,679 --> 00:32:52,160 + 2000 2500에 대한 연구가 전 세계적으로이 회의에 참석처럼 가지고 + +293 +00:32:52,160 --> 00:33:00,620 + 매우 실용적인 노트 (44) 학생 당신은 좋은 컴퓨터 비전 / 기계 경우 + +294 +00:33:00,619 --> 00:33:05,369 + 학생들이 학습 당신은 실리콘 밸리 나 또는 작업에 대해 걱정하지 않습니다 + +295 +00:33:05,369 --> 00:33:11,569 + 다른 곳에서는 실제로 가장 흥미로운 분야 중 하나입니다하지만 그래서이었다 + +296 +00:33:11,569 --> 00:33:19,210 + 올해 의미 컴퓨터 비전의 탄생의 50 주년입니다 + +297 +00:33:19,210 --> 00:33:25,829 + 우리가이 컴퓨터 비전 I에서 매우 흥미로운 년의 컴퓨터 비전 + +298 +00:33:25,829 --> 00:33:28,529 + 발신자 사람 오래 오래 방법 + +299 +00:33:28,529 --> 00:33:31,660 + 확인 그래서 컴퓨터 비전의 계속 + +300 +00:33:31,660 --> 00:33:38,169 + 이 그가 그가 그 시간에 MIT에서도했다 데이비드 마크를 기억하는 사람 + +301 +00:33:38,169 --> 00:33:50,240 + 사망 시몬 토미 토미 교황 질 마크 자신의 번호와 작업 + +302 +00:33:50,240 --> 00:33:58,808 + 초기 70 년대의 비전라는 매우 영향력있는 책은 매우 책입니다 + +303 +00:33:58,808 --> 00:34:08,148 + 비전에 대한 스마트 생각 그는 당신의 표지판 통찰력을 많이했다 그가 어디 + +304 +00:34:08,148 --> 00:34:14,868 + 그는 우리에게 단순한 구조 영역의 개념을 제공 합리적인한다고 말했다 + +305 +00:34:14,869 --> 00:34:16,539 + 시작 + +306 +00:34:16,539 --> 00:34:23,259 + 오늘의 간단한 구조와 전체적인 물고기로 시작하거나 전체적인 마크 아니었다 + +307 +00:34:23,260 --> 00:34:28,679 + 함께 인 우리에게 다음으로 중요한 통찰력과 두 내부를 제공 + +308 +00:34:28,679 --> 00:34:35,740 + 그 비전은 깊은 학습 아키텍처입니다 시작하는 당신이 알고있는 계층이다 + +309 +00:34:35,739 --> 00:34:44,029 + 그래서 당신은 쉽게 말한 것입니다 확인 우리는 간단하게 시작하지만,이 세계 주요 매우입니다 + +310 +00:34:44,030 --> 00:34:49,540 + 사실 복잡한 내 에펠 오늘 사진을 일반 사진을 촬영 + +311 +00:34:49,539 --> 00:34:58,309 + 내 아이폰의 해상도의 그것의 같은 켜져를 배기를 가정하자 전혀 없다 + +312 +00:34:58,309 --> 00:35:05,059 + 그 픽셀 또는 그림의 잠재적 인 조합은 전체보다 크다 + +313 +00:35:05,059 --> 00:35:11,429 + 얼마나 복잡한 비전의 우주에있는 원자의 수는 그것의의입니다 + +314 +00:35:11,429 --> 00:35:18,539 + 정말 정말 복잡한 인간은 간단한 데이비드 마크는 이야기입니다 우리에게있다 + +315 +00:35:18,539 --> 00:35:25,130 + 우리는 물론 계층 적 모델을 구축이 마크에 구축 알려하지 않았다 + +316 +00:35:25,130 --> 00:35:29,400 + 분기의 나머지 다룰 것 네트워크의 연합 있지만, + +317 +00:35:29,400 --> 00:35:36,990 + 자신의 아이디어를 표현하기 위해 또는 우리가 그것에 대해 생각하는 이미지 그것에 대해 생각하는 것입니다 + +318 +00:35:36,989 --> 00:35:42,129 + 여러 레이어 그는 우리가 그 가장자리 이미지에 대해 생각해야한다고 생각 첫 번째 + +319 +00:35:42,130 --> 00:35:49,110 + 이는 분명히 영감이 주목 오 이들로부터 영감을 가져다가 + +320 +00:35:49,110 --> 00:35:52,579 + 그는 개인적으로이 시원 스케치라고 + +321 +00:35:52,579 --> 00:35:55,730 + 당신의 이름은 소피가 그것을 설명 알고 + +322 +00:35:55,730 --> 00:36:02,400 + 모든 이제 다음은이 작업이 약 1/2 생각 설​​명 + +323 +00:36:02,400 --> 00:36:08,829 + 당신이 인식 3D 세계로 2D 이미지를 조정 릭에 시작 + +324 +00:36:08,829 --> 00:36:15,679 + 레이어 바로 지금 당장 당신의 절반 만지지하고있다 생각하지 않는다 당신을보고 + +325 +00:36:15,679 --> 00:36:17,239 + 목 + +326 +00:36:17,239 --> 00:36:22,799 + 그게 내가이 표시되는 모든 비록 당신이 모든 행에 체결 거 알아 + +327 +00:36:22,800 --> 00:36:29,680 + 당신이 문제의 전면 해결하기 위해 문제를 게시 할 예정입니다 + +328 +00:36:29,679 --> 00:36:38,118 + 자연은 광범위한 차원 이미지의 2D 때문에 해결하기 위해 확률값 반대하는 것으로했다 + +329 +00:36:38,119 --> 00:36:45,210 + 자연은 내 첫 번째 하드 작업 트릭은 우리가 그들이 하나를 사용했던 아이스하는 것을보고 + +330 +00:36:45,210 --> 00:36:49,389 + I하지만 거 야를 돌출하는 괭이 소프트웨어 트릭의 전체 무리가있을 수있어 + +331 +00:36:49,389 --> 00:36:53,868 + 컴퓨터 비전과 같은 일 때문에 두 눈의 형성과 더스 우리 + +332 +00:36:53,869 --> 00:36:59,280 + 그것도를 해결하고 차에 문제가있는 그리고 그들은 결국 우리에게있다 + +333 +00:36:59,280 --> 00:37:03,180 + 우리가 실제로 좋은 3D 모델을 함께 있도록 모든 것을 넣어 + +334 +00:37:03,179 --> 00:37:08,629 + 세계는 왜 우리가 살아남을 가지고 우리가 세계의 3D 모델을해야합니까 + +335 +00:37:08,630 --> 00:37:15,309 + 나는 손을 흔들 때 내가 정말 알아야 할 세계를 조작 이동 + +336 +00:37:15,309 --> 00:37:16,509 + 당신은 알고 어떻게 + +337 +00:37:16,510 --> 00:37:22,320 + 내 손을 외부와가의 3 차원 모델링입니다 올바른 방법을 향하고 잡아 + +338 +00:37:22,320 --> 00:37:26,000 + 세계 그렇지 않으면 나는 때 올바른 방법으로 당신의 머리를 잡아 할 수 없습니다 + +339 +00:37:26,000 --> 00:37:34,219 + 그건 그래서, 그래서 그 데이비드 마르의의 찻잔에게 같은 일을 데리러 + +340 +00:37:34,219 --> 00:37:39,899 + 높은 수준의 추상적 인 아키텍처의 비전 아키텍처 그것을 + +341 +00:37:39,900 --> 00:37:45,490 + 정말 수학적 모델링 정확히 어떤 종류의 정보를 통보하지 않습니다 우리는해야 + +342 +00:37:45,489 --> 00:37:51,439 + 그것은 학습 과정의 정보를 통보하지 않으며, 그들은 정말 않습니다 + +343 +00:37:51,440 --> 00:37:55,599 + 우리는 깊은 학습을 통해에 도착합니다 추론 절차 + +344 +00:37:55,599 --> 00:38:02,759 + 그 단어 아키텍처하지만이 아닌 그의 중요의 높은 수준의보기이다 + +345 +00:38:02,760 --> 00:38:06,250 + 그것은 배울 수있는 중요한 개념이다 + +346 +00:38:06,250 --> 00:38:08,619 + 구상 우리는 이것을 호출 + +347 +00:38:08,619 --> 00:38:16,859 + 표현 정말 중요한 작업 및이에 약간의 물건 첫번째 여행이다 + +348 +00:38:16,860 --> 00:38:25,180 + 다만 즉시이에 대해 생각이 중요한 방법을지도로 보여 + +349 +00:38:25,179 --> 00:38:31,879 + 영상 인식 알고리즘의 첫 번째 물결은 3D 모델 이후 갔다 + +350 +00:38:31,880 --> 00:38:38,280 + 그 오른쪽에 상관없이 같은 목표이기 때문에 어떻게 단계을 나타냅니다 + +351 +00:38:38,280 --> 00:38:45,519 + 여기에 목표는 인식 개체를 복원하는 것입니다이 정말 합리적이다 + +352 +00:38:45,519 --> 00:38:52,380 + 우리는 당신의 일에 그렇게 이들 모두를 세계로 이동 할 때이 있기 때문에 + +353 +00:38:52,380 --> 00:38:58,829 + 팔로 알토 (Palo Alto)에서 유래 합계 41까지로 투자 수익 (ROI) 상투 메에서 그 중 하나는 예전 + +354 +00:38:58,829 --> 00:39:00,440 + 스탠포드 교수 + +355 +00:39:00,440 --> 00:39:05,760 + 나는 그와 그의이 직접 브룩스가 처음으로 11을 제안 사랑 + +356 +00:39:05,760 --> 00:39:10,430 + 살루 모델까지 일반화 된 소위 아니에요거야 세부 사항에 들어가 있지만, + +357 +00:39:10,429 --> 00:39:17,129 + 아이디어는 세상이 같은 간단한 형태로 구성되어 있다는 것입니다 + +358 +00:39:17,130 --> 00:39:23,150 + 블록을 궁금해하고 실제 세계의 객체는이 단지 조합 + +359 +00:39:23,150 --> 00:39:28,340 + 간단한 형태는 특정 느낌을 주어 이동이 매우이었다 + +360 +00:39:28,340 --> 00:39:37,970 + 70 년대 영향력있는 시각적 인식 모델이되기 위해 계속 + +361 +00:39:37,969 --> 00:39:47,239 + MIT 연구소의 이사 그는 또한 아이 로봇 회사 룸바의 창립 멤버였다 + +362 +00:39:47,239 --> 00:39:51,379 + 이 모든 그래서 그래서 그는 매우 영향력을 계속했다 + +363 +00:39:51,380 --> 00:39:56,930 + 나는 일을하고 아무도 흥미로운 모델은 지역에서 오는 + +364 +00:39:56,929 --> 00:40:05,009 + 연구소는 나는 엘 카미노이 인에서 나는 길 건너 본 것 같아요 + +365 +00:40:05,010 --> 00:40:15,260 + 화보 구조 모델은 확률의 차원 맛이 덜하지만 더있다 + +366 +00:40:15,260 --> 00:40:21,570 + 맛은 개체가 여전히 간단한 부분 만들어진 것입니다 + +367 +00:40:21,570 --> 00:40:28,059 + 같은 사람의 머리는 눈, 코 또는 입 만들어진 부품은 CuMn되어 있었다 + +368 +00:40:28,059 --> 00:40:34,679 + 확인 우리의 감각을 받고 일부 변형을 허용 스프링에 의해 행동 + +369 +00:40:34,679 --> 00:40:40,069 + 세계를 인식하지 당신의 모든 하나는 정확히 같은 눈을 가지고 + +370 +00:40:40,070 --> 00:40:45,150 + 눈 사이의 거리 때문에이 드문 변화의 어떤 종류의 수 + +371 +00:40:45,150 --> 00:40:50,450 + 변화의 시작의 개념이 같은 모델에 도입하려면 및 + +372 +00:40:50,449 --> 00:40:56,309 + 이가 너무 나는 당신을 보여주고 싶은 이유를 알고이 같은 모델을 사용하여 + +373 +00:40:56,309 --> 00:41:02,710 + 표시 방법 애타게이 있었던 최악의 단순 가장 영향력있는 중 하나였다 + +374 +00:41:02,710 --> 00:41:09,670 + 실제 개체와 전체 용지를 인식하는 80 년대 모델 + +375 +00:41:09,670 --> 00:41:18,900 + 실세계의이 겉보기 사용자이지만 모서리를 사용하여 간단한 + +376 +00:41:18,900 --> 00:41:26,010 + 따뜻한 모양이지만 서로 다른 재료 또는하여이를 인식하는 에지 + +377 +00:41:26,010 --> 00:41:33,980 + 졸업 그건 그래서 그 컴퓨터 비전의 입사 세계의 종류의의 + +378 +00:41:33,980 --> 00:41:39,699 + 바람 것은 흑백 또는 합성 이미지가 시작 본적이되고 + +379 +00:41:39,699 --> 00:41:46,529 + 구십 우리는 마침내 컬러 현실 세계의 이미지를 좋아하는 이동하기 시작하고 + +380 +00:41:46,530 --> 00:41:55,210 + 다시 큰 변화 여기 매우 매우 영향력있는 작업이되지이었다 + +381 +00:41:55,210 --> 00:42:01,150 + 객체를 인식하는 것은 대해 특히에 대해 어떻게은을 개척 좋아합니까 + +382 +00:42:01,150 --> 00:42:08,990 + 당신이이 방을 입력하면 합리적인 부분으로 이미지를 바로 그래서 방법 당신이 없습니다 + +383 +00:42:08,989 --> 00:42:15,559 + 시각 시스템은 단지 그룹이되었다 내가 이렇게 많은 사진을 볼 세상에 당신을 말할 것입니다 + +384 +00:42:15,559 --> 00:42:22,259 + 일이 당신은 헤드 헤드 영토 의자 무대 플랫폼 조각이 참조 + +385 +00:42:22,260 --> 00:42:26,640 + 가구이 가장 오래된에 지각 그룹화 지각이라고 + +386 +00:42:26,639 --> 00:42:28,309 + 나 중 하나로 그룹화 + +387 +00:42:28,309 --> 00:42:34,779 + 우리가하지 않으면 가장 중요한 문제는 생물학적 또는 인공 구상 + +388 +00:42:34,780 --> 00:42:39,420 + 그들은이 지각 그룹핑 문제를 해결하는 방법을 알고 + +389 +00:42:39,420 --> 00:42:46,690 + 정말 하드 시간은 깊이 시각적 세계를 이해하고 단어 수 없습니다 + +390 +00:42:46,690 --> 00:42:53,450 + 정지하지으로 기본이 수업이 과정에 문제의 끝 + +391 +00:42:53,449 --> 00:42:57,859 + 우리는 많은 진전 이전을 한 경우에도 컴퓨터 비전에 해결 + +392 +00:42:57,860 --> 00:43:04,390 + 우리는 여전히 문제로 최종 솔루션을 파악하고 deplaning 이후에 출발하는 + +393 +00:43:04,389 --> 00:43:10,650 + 나는이 소개에서 당신을주고 싶어 왜이 같은 그래서 이것은 다시 I입니다 + +394 +00:43:10,650 --> 00:43:16,950 + 당신은 또한 깊은 문제를 회피하고 당시 알고 있어야하는 그들은 + +395 +00:43:16,949 --> 00:43:22,730 + 에 상기 도전 우리에도 모든 문제를 해결할 수없는 구상 + +396 +00:43:22,730 --> 00:43:29,079 + 올가미는 우리가 개발 터미네이터에서 멀리있는 것처럼 당신이 알고있는이야 어떤 사람 수 + +397 +00:43:29,079 --> 00:43:34,860 + 모든 것을 할 일이 조각이 정상화 컷이라고 있도록 중 하나입니다 무엇인가 + +398 +00:43:34,860 --> 00:43:42,390 + 에 실제 이미지와 시도 걸리는 최초의 컴퓨터 비전 작업 + +399 +00:43:42,389 --> 00:43:52,420 + 고위 컴퓨터 비전 연구원이 교수에 지금 문제를 해결 + +400 +00:43:52,420 --> 00:43:56,000 + 버클리 또한 스탠포드 졸업 + +401 +00:43:56,000 --> 00:44:01,989 + 결과는 내가이 클래스의 모든 침전을 포함하지 않습니다 좋은하지 않습니다 + +402 +00:44:01,989 --> 00:44:08,459 + 당신이 보는 곳에서 우리는 진전이 있지만, 이것은의 시작입니다 + +403 +00:44:08,460 --> 00:44:15,510 + 불러 및 지불 내가하고 싶은 또 다른 매우 캐주얼 작업이 원하는 + +404 +00:44:15,510 --> 00:44:22,410 + 비록이 작품에 대한 찬사 우리는의 나머지 부분을 커버하지 않는 + +405 +00:44:22,409 --> 00:44:26,679 + 물론하지만 난 당신이 될 꽤 중요한 비전 학생이 생각 + +406 +00:44:26,679 --> 00:44:31,199 + 이 알고 있기 때문에뿐만 아니라 우리가 원하는 중요한 문제를 소개합니다 + +407 +00:44:31,199 --> 00:44:36,730 + 그것을 해결하는 것은 또한 당신에게 필드하자의 발전의 관점을 제공합니다 + +408 +00:44:36,730 --> 00:44:40,480 + 작업이 호출 빌라 존스 얼굴 검출기 + +409 +00:44:40,480 --> 00:44:46,030 + 이 때문에 대학원생 신선한​​ 대학원 학생으로 내 마음을 매우 사랑이다 + +410 +00:44:46,030 --> 00:44:51,650 + 칼 테크에서 그것은 내가 대학원생 때과 같이 첫 번째 논문의 하나 + +411 +00:44:51,650 --> 00:44:56,150 + 나는이 내 고문 그것에 대해 아무것도 모르는 내가 실험실까지 + +412 +00:44:56,150 --> 00:45:02,090 + 당신은 우리 모두가 그들을 이해하려는 알고 작품의 놀라운 조각 + +413 +00:45:02,090 --> 00:45:08,690 + 내가 셀틱 졸업 시간이 매우 작품은 처음에 전달된다 + +414 +00:45:08,690 --> 00:45:16,510 + 이 최초의 디지털 카메라와 같은 2006 년에 후지 필름에 의해 스마트 디지털 카메라 + +415 +00:45:16,510 --> 00:45:22,390 + 보기의 얼굴 검출기 지금까지 내 이송 펌프 기술 이전 시점이되었다 + +416 +00:45:22,389 --> 00:45:28,789 + 매우 빠르고 시각적 첫 번째 성공적인 높은 수준의 일이 있었다 + +417 +00:45:28,789 --> 00:45:35,849 + 소비자 제품에 사용되고 인식 알고리즘 그래서 그냥 작업 할 수 + +418 +00:45:35,849 --> 00:45:41,059 + 얼굴을 감지 배운다 더 이상 빨리 당신이 알고 함께 야생에서 직면하지 + +419 +00:45:41,059 --> 00:45:47,920 + 시뮬레이션 그들은 매우 이들은 비록 모든 사진 및 있습니다 고안되어 + +420 +00:45:47,920 --> 00:45:53,329 + 그는 깊은 학습 풍미를 많이 갖는 깊은 학습 네트워크를 사용하지 않은 + +421 +00:45:53,329 --> 00:46:01,179 + 기능은 기능을 간단한 기능을 찾을 수있는 알고리즘을 배운다을 알게되었다 + +422 +00:46:01,179 --> 00:46:06,919 + 당신이 우리에게 가장 좋은 줄 수있는이 흑백 필터 기능 등 + +423 +00:46:06,920 --> 00:46:14,639 + 얼굴의 현지화 그래​​서 이것은 하나의 작품의 매우 영향력있는 작품이다 + +424 +00:46:14,639 --> 00:46:24,679 + 컴퓨터를 배포하고 실제 로밍 할 수있는 첫 번째 컴퓨터 영상 작품의 + +425 +00:46:24,679 --> 00:46:31,019 + 그 비교 알고리즘 전에 시간은 용지가 실제로 매우 느린했다 + +426 +00:46:31,019 --> 00:46:36,699 + 이 부여 된 실시간 얼굴 인식이라고 나는 알고하지 않는 팁에 그를 보내 + +427 +00:46:36,699 --> 00:46:41,409 + 사람이 칩의 종류를 기억하지만 느린 채팅 아니었지만 그럼에도 불구하고 + +428 +00:46:41,409 --> 00:46:48,569 + 그것은 또 다른 매우 중요한 예술 작품도 한 번 더 있었다 실시간으로 실행 + +429 +00:46:48,570 --> 00:46:53,380 + 일이 유일한 일없는이시기에 지적하는 + +430 +00:46:53,380 --> 00:46:59,170 + 그러나 이것은 금주 모임 정말 좋은 표현 모랄레스 시간의 초점이다 + +431 +00:46:59,170 --> 00:47:06,250 + 컴퓨터 비전은 미스터을했습니다 기억 이동하고있다 + +432 +00:47:06,250 --> 00:47:14,699 + 작업 초기에 지금 우리가하고있는 세에게 물체의 형상을 모델링하기 위해 노력했다 + +433 +00:47:14,699 --> 00:47:23,439 + 우리가 정말 개체에 대한 약간이 무엇인지 인식을 할 수 있습니다 이동 + +434 +00:47:23,440 --> 00:47:27,400 + 이러한 단계를 재구성 여부를 컴퓨터 비전의 전체 분기있다 + +435 +00:47:27,400 --> 00:47:34,200 + 그래픽은 그 작업을 계속 단계하지만 컴퓨터 비전의 큰 부분은 아니다 + +436 +00:47:34,199 --> 00:47:38,730 + 세기의 전환기 주위에이 시간에 인식에 초점을 맞추고있다 + +437 +00:47:38,730 --> 00:47:47,539 + 즉, 컴퓨터 비전과 오늘에게 가장 중요한 부분을 가져이다 + +438 +00:47:47,539 --> 00:47:55,480 + 컴퓨터 비전 작업이 인식 등이인지 질문을 집중하고, + +439 +00:47:55,480 --> 00:47:57,369 + 내가 질문 + +440 +00:47:57,369 --> 00:48:06,150 + 작품의 또 다른 매우 중요한 부분 때문에 주위의 기능에 초점을 시작 + +441 +00:48:06,150 --> 00:48:12,950 + 사람들이 그것을 실현하기 시작 얼굴 인식의 시간은 정말 열심히 정말 + +442 +00:48:12,949 --> 00:48:19,829 + 난 그냥 말했듯이 모든 일을 설명하여 객체를 인식하는 당신은 내가 알고 + +443 +00:48:19,829 --> 00:48:25,960 + 너희가 많이 있었다 나는 당신의 몸통 I의 나머지 부분을 볼 수 없습니다 결론을 내렸다 참조 + +444 +00:48:25,960 --> 00:48:31,690 + 정말 첫 번째 행에에 다리 중 하나를 볼 수 있지만 나는 당신을 인식하지 않고 + +445 +00:48:31,690 --> 00:48:39,230 + 나는 그래서 어떤 사람들은 그녀가 이것이 재미 실현하기 위해 시작 개체로 전나무 당신을 애 + +446 +00:48:39,230 --> 00:48:44,240 + 정말 글로벌 형상은 지금 우리가 물체를 인식하기 위해 후 가야 + +447 +00:48:44,239 --> 00:48:50,319 + 우리는 중요한 기능을 우리가 할 수있는 객체를 인식하는 경우 아마이 기능의 + +448 +00:48:50,320 --> 00:48:53,090 + 먼 길을 가서 많은 이해 + +449 +00:48:53,090 --> 00:48:57,930 + 당신이 밖으로 것을 인식 할 필요가 없습니다 당신을 사냥하는 경우 진화에 대해 생각 + +450 +00:48:57,929 --> 00:49:03,909 + 모양 호랑이 몸 전체는 몇이 알고 도망 할 필요가 결정하는 + +451 +00:49:03,909 --> 00:49:06,588 + 관통 호랑이의 첫 번째 패치 + +452 +00:49:06,588 --> 00:49:12,679 + 우리가 빨리 듣고 필요가 너무 있도록 충분히 아마 멋진 팔을 잎 + +453 +00:49:12,679 --> 00:49:16,429 + 의사 결정 야구의 버전은 정말 빠르다 + +454 +00:49:16,429 --> 00:49:22,308 + 이에 의해 이동 비용을 부담해야하므로이 많은 온라인 중요한 기능을 발생 + +455 +00:49:22,309 --> 00:49:28,539 + 데이비드 낮은 다시 다시 그 이름을보고 중요한 중요한 학습에 관한 것입니다 + +456 +00:49:28,539 --> 00:49:34,009 + 객체에 기능과 당신은 단지 몇 이러한 중요한 기능을 배우면 + +457 +00:49:34,009 --> 00:49:38,400 + 당신이 할 수있는 개체에 대한 그들 실제로 완전히에이 객체를 권장합니다 + +458 +00:49:38,400 --> 00:49:45,548 + 다른과 교훈을 유지하도록하도록 징수 복잡 장면에 이동 + +459 +00:49:45,548 --> 00:49:54,880 + 약 10 년 전 필드의 2010 년 또는 2012 년 연구 선거 + +460 +00:49:54,880 --> 00:50:00,229 + 컴퓨터 비전에 모델을 구축하기 위해 이러한 기능을 사용하는 방법에 초점을 맞추고 있었다 + +461 +00:50:00,228 --> 00:50:05,538 + 객체 및 장면을 인식하고 우리는 우리가 먼 길을 갔어요 훌륭한 일을 했어 + +462 +00:50:05,539 --> 00:50:12,609 + 깊은 그 단어를 배우는 이유 중 하나는 더 이상 설득력이되었다 + +463 +00:50:12,608 --> 00:50:17,690 + 많은 사람들이 우리가 볼 수있는 기능은 그 깊은 학습이 그 + +464 +00:50:17,690 --> 00:50:22,880 + 학습자는 화려한하여 이러한 설계 기능과 매우 유사 + +465 +00:50:22,880 --> 00:50:30,229 + 엔지니어는 필요한 경우 우리가 그들을 필요 알지 심지어 종류의 확인 있도록 + +466 +00:50:30,228 --> 00:50:34,929 + 아래 먼저이 일을 갖추고 있으며, 우리는 더 나은 개발을 시작하는 우리에게 얘기를 + +467 +00:50:34,929 --> 00:50:38,978 + 수학적 모델은 그 자체로 이러한 기능을 배울 수 있지만 확인 + +468 +00:50:38,978 --> 00:50:46,210 + 서로 너무 너무 역사적 당신은 안이 작품의 중요성을 알고 + +469 +00:50:46,210 --> 00:50:52,028 + 감소 된이 작품은 우리의 지적 기반의 하나입니다 + +470 +00:50:52,028 --> 00:50:57,858 + 우리의 지적 기반은 실현하기 위해 얼마나 중요한지 또는 얼마나 유용한 지 그 + +471 +00:50:57,858 --> 00:51:07,018 + 이러한 깊은 학습 기능은 우리가 그들을 배울 경우 그냥 간단히 때문에 말할 수 있습니다 + +472 +00:51:07,018 --> 00:51:12,379 + 이 기능의 저와 다른 많은 연구자들은 우리가 사용할 수없는 우리에게 + +473 +00:51:12,380 --> 00:51:18,239 + 그 장면 인식과 그 시간 기계 학습 주위를 배울합니다 + +474 +00:51:18,239 --> 00:51:24,719 + 도구는 우리가 주로 사용하거나 그래픽 모델 또는 지원 벡터 기계와 + +475 +00:51:24,719 --> 00:51:29,479 + 이 하나의 영향력 작업 지원 벡터 기계와 대령을 사용하여에 + +476 +00:51:29,478 --> 00:51:43,358 + 모델 2222은 일을 인식하지만 난 여기에 간단한과 마지막 깊은 학습 모델이 될 수 있습니다 + +477 +00:51:43,358 --> 00:51:50,578 + 이 기능 또는 기능 야구라는 변형 부분은 왈도입니다입니다 우리 + +478 +00:51:50,579 --> 00:51:57,420 + 사람의 일부처럼 개체의 일부를 배우고 우리는 그들이 그림을 오는 방법 + +479 +00:51:57,420 --> 00:52:08,519 + 공간에서 서로 소득 그림에 모델을 지원 벡터 머신의 종류를 사용 + +480 +00:52:08,518 --> 00:52:16,179 + 2009 년의이시기에 인간과 병 같은 객체를 인식 + +481 +00:52:16,179 --> 00:52:21,419 + 2010 년 컴퓨터 비전 분야는 우리가 최선을 다하고 충분히 성숙 + +482 +00:52:21,420 --> 00:52:25,659 + 중요한 심장이 아마 보행자 인식과 + +483 +00:52:25,659 --> 00:52:30,828 + 더 이상 인위적인 문제가 뭔가있어 차를 인식하지 것은 다른 사람이었다 + +484 +00:52:30,829 --> 00:52:37,219 + 때문에 우리가하지 않으면 지금 진행 필드로 부분적으로 자신의 벤치가 필요 + +485 +00:52:37,219 --> 00:52:44,039 + 좋은 벤치 마크는 다음 모두가 이미지 집합 느낌과 정말 열심히 정말 + +486 +00:52:44,039 --> 00:52:50,369 + 가장 중요한 기준 중 하나가 통과 목표라고 있도록 글로벌 표준을 설정 + +487 +00:52:50,369 --> 00:52:57,608 + V OC는 물체 인식 벤치 일부는 유럽의 노력 유럽 생물의 그 + +488 +00:52:57,608 --> 00:53:04,190 + 연구진은 20 종류의 이미지 수만에 의해 함께 넣어 + +489 +00:53:04,190 --> 00:53:13,019 + 광학 및이 고양이처럼 하나의 예 개체 당 기준 요금은 소 영화를 숭배하지 + +490 +00:53:13,018 --> 00:53:17,808 + 고양이는 소 비행기 병 개 + +491 +00:53:17,809 --> 00:53:20,048 + 말 훈련 + +492 +00:53:20,048 --> 00:53:27,268 + 다음 더스와 우리는 매년 우리의 컴퓨터 비전 연구자를 사용 + +493 +00:53:27,268 --> 00:53:34,948 + 그리고 바퀴는 최고의 여자 객체에 대한 모든 물체 인식 작업을 경쟁 올 + +494 +00:53:34,949 --> 00:53:41,188 + 당신이 년을 통해처럼 알고 과거를 통해 인식 문제와 + +495 +00:53:41,188 --> 00:53:47,949 + 성능은 계속 증가하고 우리가 느끼기 시작할 때 + +496 +00:53:47,949 --> 00:53:52,929 + 그 때의 필드의 진행 흥분 + +497 +00:53:52,929 --> 00:53:59,729 + 여기에 가까운 우리에게 더 가까이 이야기를 통해 조금이다 그건 그 내 사랑 내 + +498 +00:53:59,728 --> 00:54:05,718 + 학생들은 현실 세계가 진짜 약 20 개체를 알고 생각했다 + +499 +00:54:05,719 --> 00:54:12,489 + 세상은 그렇게 파스코 시각의 작품 다음 작은 20 개 이상의 광학입니다 + +500 +00:54:12,489 --> 00:54:18,239 + 물체 인식 문제는 우리가 함께이 방대한 대규모 프로젝트를 넣어 + +501 +00:54:18,239 --> 00:54:23,889 + 여러분 중 일부는이 클래스에서 당신이 될 것 이미지 들었을 수 있다는 이미지 + +502 +00:54:23,889 --> 00:54:30,098 + 이미지의 작은 부분을 사용하여 해당 과제 그 이미지의 일부에 + +503 +00:54:30,099 --> 00:54:36,759 + 그 모두가 내 손을 청소하고 5 천만 이미지의 데이터 세트입니다 + +504 +00:54:36,759 --> 00:54:47,000 + 그것을 청소 학생들에게 20,000 객체 클래스를 주석 + +505 +00:54:47,000 --> 00:54:54,469 + 내 삶의 다양한 영역의 습관의 크라우드 소싱 플랫폼을 제거 + +506 +00:54:54,469 --> 00:54:59,969 + 당신이 함께이 플랫폼이 퍼팅 알고에서 글래디스는 고통을 몰라 그 + +507 +00:54:59,969 --> 00:55:08,599 + 그러나 그것은 매우 흥미로운 일 우리가 함께 넣어하기 시작 시작되지 않습니다이다 + +508 +00:55:08,599 --> 00:55:15,900 + 대회는 매년 이미지라고 그 물체 인식을위한 경쟁 + +509 +00:55:15,900 --> 00:55:22,440 + 예를 들어 이모 겐에 의한 영상 분류의 표준 경쟁은은이다 + +510 +00:55:22,440 --> 00:55:28,710 + 거의 150 만 이미지와 알고리즘을 통해 천 개체 클래스에 경쟁 + +511 +00:55:28,710 --> 00:55:34,220 + 성능은 그래서 사실 난 그냥 소셜 미디어에 있던 사람이 들었어요 + +512 +00:55:34,219 --> 00:55:38,589 + 나는 매우했다 컴퓨터 비전의 올림픽 도전 이미지 참조 + +513 +00:55:38,590 --> 00:55:40,240 + 유망한 + +514 +00:55:40,239 --> 00:55:55,649 + 그 도전 2010 그래서 그렇게 사람들을 역사에 우리가 가까이 가져 + +515 +00:55:55,650 --> 00:56:00,369 + 그 시간 패스 주위에 실제로 것은 어디 동료를 알아가는 사람들 + +516 +00:56:00,369 --> 00:56:05,309 + 그들은 우리가 직면 그래서 20 개체의 자신의 문제를 단계적으로 폐지하는 거 처음이야 우리에게 말했다 + +517 +00:56:05,309 --> 00:56:12,039 + 천 개체의 이미지에 도전하는 이유는 에러율에 액세스 + +518 +00:56:12,039 --> 00:56:18,199 + 우리는 매우 중요한 오류와 함​​께 시작에 우리는 시작 물론 당신은 알고있다 + +519 +00:56:18,199 --> 00:56:28,029 + 매년 감소하지만 특히 세 정말 감소가 그 + +520 +00:56:28,030 --> 00:56:38,960 + 올해는 뜨거운 거의 IS 2012 2012입니다 절단 한 것 승리 아키텍처 + +521 +00:56:38,960 --> 00:56:45,769 + 이미지 그 문제는 내가 말할 것 네트워크의 회선이었다 + +522 +00:56:45,769 --> 00:56:53,250 + 그것은 어떻게 모든 새로운 스피커의 느낌에도 불구하고 2012 년에 발명되지 않았습니다 대해 + +523 +00:56:53,250 --> 00:56:58,190 + 이 블록 주위에 새로운 일이처럼 그것은 다시 발명되었다 아니에요 + +524 +00:56:58,190 --> 00:56:59,349 + 칠십 년대와 80 년대 + +525 +00:56:59,349 --> 00:57:05,279 + 그는 그러나 사물의 융합에 회선에 대해 이야기합니다 보내고 당신의 + +526 +00:57:05,280 --> 00:57:10,519 + 네트워크는 대용량으로 그 거대한 힘을 보여 주었다 훈련을 종료 + +527 +00:57:10,519 --> 00:57:18,219 + 큰 차이로 제치고 그였다 이미지 아키텍처와 왕 + +528 +00:57:18,219 --> 00:57:24,829 + 보기의 AA 수학적 관점에서 매우 역사적인 순간이 아니었다 그것은 + +529 +00:57:24,829 --> 00:57:30,079 + 볼이 내 엔지니어링 전에 새로운 및 해결 실제 포인트가 + +530 +00:57:30,079 --> 00:57:35,090 + 당신이 많은 알고에 의해 작품의 조각이 덮여있는 역사적 순간이었다 + +531 +00:57:35,090 --> 00:57:42,400 + 시간이 모든 문제는이 발병입니다 학습의 시작입니다 + +532 +00:57:42,400 --> 00:57:48,869 + 혁명 당신은 그것을 호출이 이것 때문에이 클래스의 전제 인 경우 + +533 +00:57:48,869 --> 00:57:54,609 + 우리가 컴퓨터의 간단한 역사를 통해 갔다, 그래서 나는 거 스위치 해요 가리 + +534 +00:57:54,610 --> 00:57:59,539 + 540,000,000년 비전 + +535 +00:57:59,539 --> 00:58:05,869 + 이 클래스의 개요 다른 질문이 있습니다 + +536 +00:58:05,869 --> 00:58:13,969 + 우리가 많이 얘기 종류의 압도적 이었지만 확실히 그래서 우리는 심지어 얘기 + +537 +00:58:13,969 --> 00:58:20,559 + 컴퓨터 비전에서 다른 작업을 찾는 것에 대해 31에 보인다 초점을 맞출 것입니다 + +538 +00:58:20,559 --> 00:58:27,849 + 시각적 인식 문제에도 특히 대부분의를 통해 확대 + +539 +00:58:27,849 --> 00:58:29,509 + 기초 강좌 + +540 +00:58:29,510 --> 00:58:35,750 + 우리가 얘기 분류하지만 지금은 당신이 알고있는 모든 것을 할 거입니다 + +541 +00:58:35,750 --> 00:58:41,480 + 우리는 우리가 다른 얻고 있었다됩니다 설정 분류 그 이미지를 기반으로 + +542 +00:58:41,480 --> 00:58:47,900 + 시인성 시나리오이지만, 화상 분류 문제이다 메인 + +543 +00:58:47,900 --> 00:58:52,780 + 명심하십시오 의미 우리는 엠마의 클래스에 초점을 맞출 것이다 문제 + +544 +00:58:52,780 --> 00:58:56,600 + 시각적 인식은 바로 3 차원 거기에 그냥 이미지 분류되지 않습니다 + +545 +00:58:56,599 --> 00:59:01,339 + 모델링이 분할의 그룹이었다하고 있지만,이 모든입니다 + +546 +00:59:01,340 --> 00:59:06,250 + 그건 우리가에 초점을 맞출 것이다 나는 미스 당신이 그냥 전화도 할 필요가 없습니다 것 + +547 +00:59:06,250 --> 00:59:11,000 + 애플리케이션 현명한 이미지 분류는 매우 유용 문제 + +548 +00:59:11,000 --> 00:59:17,929 + 당신은 큰 큰 상업적인 인터넷 기업들에게 관점을 알고부터 + +549 +00:59:17,929 --> 00:59:22,449 + 시작 아이디어 당신은 당신이 인식 할 객체를 인식 할 알고 + +550 +00:59:22,449 --> 00:59:29,119 + 음식은 이동할 수 있도록 당신이 우리에게 고문 앨범을 원하는 온라인 상점 모바일 쇼핑을 + +551 +00:59:29,119 --> 00:59:35,710 + 분류 소식은 많은 많은에 대한 생계 작업이 될 수있다 + +552 +00:59:35,710 --> 00:59:44,650 + 중요한 문제 두 분류와 관련이있어 문제가있다 + +553 +00:59:44,650 --> 00:59:49,329 + 오늘은 당신이 차이를 이해하는 기대하지 않는다 그러나 나는 듣고 싶어 + +554 +00:59:49,329 --> 00:59:55,659 + 이 클래스가 있는지 확인 반면 것을 당신은의 미묘한 차이를 이해하는 법을 배워야 + +555 +00:59:55,659 --> 01:00:01,879 + 시각적 인식의 다른 맛의 세부 내용 이미지 + +556 +01:00:01,880 --> 01:00:07,700 + 분류는 영상 자막이 그리고 이것들이 가지고있는 물체 감지 무엇 + +557 +01:00:07,699 --> 01:00:14,529 + 그는이 분류를 만든 예를 들어 다른 맛을 알고 내 + +558 +01:00:14,530 --> 01:00:19,740 + 로 전체의 큰 이미지 객체 검출에 초점 곳 가지를 알려 + +559 +01:00:19,739 --> 01:00:23,579 + 정확히 차가 보행자입니다 같다 + +560 +01:00:23,579 --> 01:00:30,159 + 망치와 단어가 등등 객체와의 관계 + +561 +01:00:30,159 --> 01:00:35,529 + 이 클래스에 대해 학습한다 그들의 뉘앙스 및 세부 사항을 사회 + +562 +01:00:35,530 --> 01:00:43,840 + 나는 이미 CNN 말했다 또는 네트워크의 연합은 깊이의 한 종류입니다 + +563 +01:00:43,840 --> 01:00:50,910 + 아키텍처하지만 계획 아키텍처 압도적으로 성공이고 + +564 +01:00:50,909 --> 01:00:54,909 + 이것은 우리가 집중되고 단지로 다시 이동합니다 아키텍처 + +565 +01:00:54,909 --> 01:01:02,849 + 이미지 9 도전은 그래서 역사적 년이는 연도 2012 인 + +566 +01:01:02,849 --> 01:01:14,349 + 나는 그것이 칠 생각 제프 힌튼이이 길쌈을 제안 소풍 있습니다 + +567 +01:01:14,349 --> 01:01:20,500 + 이전 모델에 도전 이미지를 승리 네트워크 길쌈 층 + +568 +01:01:20,500 --> 01:01:22,318 + 올해 + +569 +01:01:22,318 --> 01:01:30,548 + 기능을 SIFT 플러스 벡터 머신 아키텍처 그것은 여전히​​ 계층 구조를 지원 + +570 +01:01:30,548 --> 01:01:38,449 + 하지만 두의 맛을 가지고 2015 년에 앞으로 빠른 학습하지 않습니다 + +571 +01:01:38,449 --> 01:01:43,798 + 승리 아키텍처는 아직도 당신이 그것의 걱정하지 않은 결론이다 + +572 +01:01:43,798 --> 01:01:56,599 + 사냥꾼 (51) 층은 마이크로 소프트 아시아 연구소 연구원을 구입 구입하고 그것은 분명 + +573 +01:01:56,599 --> 01:02:03,048 + 이유가 잔류 잔류 그래서 커버에 대한 확신 아니에요 + +574 +01:02:03,048 --> 01:02:09,369 + 그 확실히 실제로 무엇을 하나 하나 층 알고 기대하지 않습니다 + +575 +01:02:09,369 --> 01:02:17,269 + 그들은 2012 우승 구조 때문에 마음 만 매년 자체를 반복 + +576 +01:02:17,268 --> 01:02:23,548 + 이미지의 그 문제는 내가 같은 깊은 학습 기반의 아키텍처입니다 + +577 +01:02:23,548 --> 01:02:32,369 + 나는 또한 당신이 역사는 발명되지 존중하고 싶은 것을하는 것은 하룻밤 많이있다 + +578 +01:02:32,369 --> 01:02:37,979 + 오늘하지만 당신이 알고있는 영향력있는 선수의 구축 많은 사람들이있다 + +579 +01:02:37,978 --> 01:02:41,879 + 기초 사실 나는 슬라이드를 기억해야 할 한 가지 중요한 일이 없습니다 + +580 +01:02:41,880 --> 01:02:50,910 + 쿠니히코 후쿠시마 contigo 솔루션은 구축 일본의 과학자했다입니다 + +581 +01:02:50,909 --> 01:02:58,798 + 모델 corneil 홍콩 트럭 및 그 새로운 네트워크의 시작 + +582 +01:02:58,798 --> 01:03:04,318 + 건축과 노란색 색상도 매우 영향력있는 사람이며 그는 정말 + +583 +01:03:04,318 --> 01:03:10,248 + 젊은 쿠데타의 내 의견에 혁신적인 작업에 출판되었다 + +584 +01:03:10,248 --> 01:03:16,348 + 그래서 19 구십 한 수학자의 한 제프 힌튼 + +585 +01:03:16,349 --> 01:03:22,479 + 포함 된 모든 항목을 포함 고문은 다시 전파 학습을했다 + +586 +01:03:22,478 --> 01:03:28,088 + 아래에 아무것도 삭제이 있다면 전략은 몇 당신을 말할 것이다 + +587 +01:03:28,088 --> 01:03:34,528 + 주하지만,하지만, 수학적 만도는 80 년대 거칠게하고, + +588 +01:03:34,528 --> 01:03:34,920 + 그만큼 + +589 +01:03:34,920 --> 01:03:40,869 + 속옷이 있었다 그것이 AT & T 벨 연구소에서 근무하고 해당 지역의 + +590 +01:03:40,869 --> 01:03:47,160 + 그 때 놀랄만한 장소들이 있다고 더 이상 오늘날에는 보석 UPS가 없습니다 + +591 +01:03:47,159 --> 01:03:50,949 + 정말 야심 찬 프로젝트를 진행하고 그는 숫자를 인식하는 데 필요한 + +592 +01:03:50,949 --> 01:03:57,019 + 심지어 미국의 게시물에 우리의 가방에 제공된 해당 제품을 떠나 있기 때문에 + +593 +01:03:57,019 --> 01:04:03,380 + 사무실은 어렵고 검사 및 변태 건설하는 연합을 인식하는 + +594 +01:04:03,380 --> 01:04:08,068 + 그는 그가 HUBEL과 위젤 그는 영감을 어디 네트워크에서이입니다 + +595 +01:04:08,068 --> 01:04:14,500 + 일부 수영장에서 찾고 의해 시작은 구조와 같은 가장자리와 이미지는이 마음에 들지있어 + +596 +01:04:14,500 --> 01:04:20,099 + 전체 편지 정말 가장자리에 필요가있어 팔과 계층 별 층 + +597 +01:04:20,099 --> 01:04:25,539 + 이 가장자리를 끌어 필터들 함께 풀을 필터링 한 다음 필드이 + +598 +01:04:25,539 --> 01:04:36,230 + 아키텍처 20121 알렉스 kruschev 스키와 제프 힌튼하면 거의 정확하게 일 + +599 +01:04:36,230 --> 01:04:40,900 + 아키텍처는 차에 참여 + +600 +01:04:40,900 --> 01:04:47,900 + 몇 가지 변경이 도전을 상상하지만 승리가 될 + +601 +01:04:47,900 --> 01:04:54,920 + 이 아키텍처는 그래서 우리는 lib 디렉토리의 세부 사항 변경에 대한 자세한 말씀 드리죠 + +602 +01:04:54,920 --> 01:05:02,380 + 거기에 무어의 법칙이 우리를 도왔 때문에 용량 모델은 조금 증가했다 + +603 +01:05:02,380 --> 01:05:08,220 + 대한 모양의 약간의 변화도 매우 매우 상세한 기능 + +604 +01:05:08,219 --> 01:05:14,828 + 대부분의 Signori (224)는 그 형태하지만 무엇에 파일뿐만 몇있다 + +605 +01:05:14,829 --> 01:05:19,130 + 큰 아무것도에 의해 정말 작은 변화 만 변경했다 + +606 +01:05:19,130 --> 01:05:26,490 + 수학적하지만 중요한 것은 변화 않았고 그 깊은 학습을 성장 + +607 +01:05:26,489 --> 01:05:35,379 + 그 르네상스 하나에 Architektur 검정 잉크 승무원은 음식물의 한 입처럼이며, + +608 +01:05:35,380 --> 01:05:41,180 + 이들은 매우 높은 높은 있기 때문에 하드웨어 하드웨어는 큰 차이를 만들어 + +609 +01:05:41,179 --> 01:05:44,669 + 용량 모델 일 델라 크루즈 + +610 +01:05:44,670 --> 01:05:50,720 + 때문에 계산의 병목이 고통스럽게 느린 그는 할 수 없었다 + +611 +01:05:50,719 --> 01:05:55,209 + 그래서 당신은 큰 수없는 완벽에 추가 할 수는 없지만이 모델에게 너무 큰 구축 + +612 +01:05:55,210 --> 01:06:00,670 + 15 이상 거기에 기계 학습의 관점에 대한 잠재력을 실현하고 + +613 +01:06:00,670 --> 01:06:07,780 + 이러한 모든 문제는 당신도하지만 지금 우리는 훨씬 더 빨리 더 큰 트랜지스터를 가질 수있다 + +614 +01:06:07,780 --> 01:06:16,410 + 엔비디아에서 트랜지스터 마이크로 칩 및 GPU는 깊은에 큰 차이를 만들어 + +615 +01:06:16,409 --> 01:06:22,358 + 우리가 지금 적당한 양의 모델을 연수생 수있는 학습의 역사 + +616 +01:06:22,358 --> 01:06:27,358 + 그들은 거대한이고 다른 사람들이 우리가 밖으로 데리고해야합니까 생각하는 경우에도 시간 + +617 +01:06:27,358 --> 01:06:37,159 + 작품 자체가 그냥있는 빅 데이터했던이었다 데이터의 데이터 가용성이다 + +618 +01:06:37,159 --> 01:06:41,078 + 그것은 아무것도 의미하지 않는다 알고는 있지만 그것을 사용하는 방법을 모르는 경우 + +619 +01:06:41,079 --> 01:06:45,869 + 깊은 학습 Architektur 데이터 고용량 구동력 될 + +620 +01:06:45,869 --> 01:06:52,390 + 모델은 뭐하는 교육을 활성화하는 진정한 진정한 진정한 도움 피하기 overfitting 때 + +621 +01:06:52,389 --> 01:06:57,608 + 당신은 픽셀 수를 보면 당신은 그래서 당신을 알 수 있도록 당신은 충분한 데이터를 가지고 그 + +622 +01:06:57,608 --> 01:07:05,639 + 기계 학습 사람들은 나선형 그것이 거대 1998를 가진 대 2012 년에 있었다 + +623 +01:07:05,639 --> 01:07:06,469 + 차이 + +624 +01:07:06,469 --> 01:07:14,469 + 크기의 주문은 그래서 그래서 그래서이 (231)의 초점이었다 + +625 +01:07:14,469 --> 01:07:21,098 + 뿐만 아니라 갈 것입니다 오, 내가이 생각을 침을 흘리고있어 중요한 마지막이기도 + +626 +01:07:21,099 --> 01:07:27,048 + 나는 어떤을 원하지 않는 시각적 지능 물체 인식을 넘어 않습니다 + +627 +01:07:27,048 --> 01:07:31,039 + 이 과정에서 나오는 것은 우리는 당신이 우리가했습니다 알고있는 모든 일을했습니다 딩키 + +628 +01:07:31,039 --> 01:07:38,889 + 시각적 인식의 전체 공간을 비행 할 도전은이 사실이 아니에요 + +629 +01:07:38,889 --> 01:07:44,460 + 여전히 멋진 많은 문제가 당신이 알고 예를 들어 해결하는 것은 라벨링을한다 + +630 +01:07:44,460 --> 01:07:51,650 + 모든 단일 픽셀이 속한 곳 지각 그룹과 전체 장면은 그래서 나는 알고있다 + +631 +01:07:51,650 --> 01:07:52,329 + 에 + +632 +01:07:52,329 --> 01:07:56,900 + 그 여전히 함께 지속적인 문제입니다 + +633 +01:07:56,900 --> 01:08:02,740 + 3 차원으로 인식이 정말 흥분을 많이가 거기 무슨 일이 일어나고입니다 + +634 +01:08:02,739 --> 01:08:09,349 + 비전과는이이 로봇의 교차로는 그 확실히 하나의 영역입니다 + +635 +01:08:09,349 --> 01:08:15,039 + 다음 아무것도 국경의 움직임과와와 함께 할 수있는이 또 다른입니다 + +636 +01:08:15,039 --> 01:08:33,289 + 당신은 그냥 거 이상으로 알고 연구 작업의 큰 개방 영역은 당신이 실제로 원하는 노래 + +637 +01:08:33,289 --> 01:08:35,689 + 깊이 승자를 이해 + +638 +01:08:35,689 --> 01:08:39,489 + 어떤 사람들이하고있는 것은 서양의 개체 사이의 관계 무엇인가 + +639 +01:08:39,489 --> 01:08:45,029 + 객체 사이의 상기 관계에 RD와이 진행중인 프로젝트 + +640 +01:08:45,029 --> 01:08:49,759 + 학생들의 수는 단지 내 무릎에 시각적 게놈이라고 + +641 +01:08:49,760 --> 01:08:55,739 + 관련이 지금까지 우리가에 대한 이야기​​ 잡초의 이미지 분류 넘어 + +642 +01:08:55,739 --> 01:09:03,639 + 우리의 거룩한 Grails에의 한 것입니다 사회 이것의 성배의 일 동안 + +643 +01:09:03,640 --> 01:09:09,260 + 바로 그래서 인간으로 당신에 대해 생각하는 장면의 이야기를 할 수 있어야합니다 + +644 +01:09:09,260 --> 01:09:11,180 + 당신은 당신의 눈을 열어 + +645 +01:09:11,180 --> 01:09:17,840 + 당신은 당신이 할 수있어 눈을 뜨고 순간 당신이 실제로 무엇을보고 설명하기 + +646 +01:09:17,840 --> 01:09:24,940 + 심리학 실험은 우리는 당신이 사람들에게 단에 사진을 표시하는 경우에도 발견 + +647 +01:09:24,939 --> 01:09:30,659 + 말 그대로 두 번째 사람의 절반이다 오백 밀리 자 + +648 +01:09:30,659 --> 01:09:36,769 + 그들은하지 않았다, 그래서 그것에 대해 에세이를 작성 우리는 그들에게 $ 시간당 10을 지불 + +649 +01:09:36,770 --> 01:09:42,410 + 그것은 그 길지 않았다하지만 우리는 더 많은 돈을 이야기하면 당신은 내가 그림을 알고 그들이 + +650 +01:09:42,409 --> 01:09:47,970 + 아마 더 이상 윤리를 작성하지만 요점은 우리의 시각 시스템이 있다는 것입니다 수 + +651 +01:09:47,970 --> 01:09:54,390 + 매우 강력한 내 셀입니다 우리는 이야기를하고 나는이 꿈을 꿀 것 + +652 +01:09:54,390 --> 01:10:02,560 + 논문을 옷을 위해 우리는 당신이 당 컴퓨터 한 그림을 제공주고 있고 + +653 +01:10:02,560 --> 01:10:03,960 + 결과 + +654 +01:10:03,960 --> 01:10:09,159 + 이 같은 설명 당신은 당신이이 줄이 표시됩니다 내가 거기에 도착하고 있었다 알고 + +655 +01:10:09,159 --> 01:10:15,149 + 크메르어 올림픽 TUR 당신에게 한 문장을 제공하거나 하나의 선택이 켜져 수가 줄 + +656 +01:10:15,149 --> 01:10:20,319 + 하지만 짧은 문장으로 우리는 아직 여기 아니에요하지만 홀더 중 하나 + +657 +01:10:20,319 --> 01:10:26,250 + 블루와 다른 들고 성장 내가 생각이이 계속이 작업을 계속하고있다 + +658 +01:10:26,250 --> 01:10:33,659 + 오드리의 블로그가 정말 잘 요약하면 바로 다음과 같이 알고있다 + +659 +01:10:33,659 --> 01:10:42,300 + 당신이뿐만 아니라 즐길 얻을이 그림이 너무 많은 뉘앙스를 정제 + +660 +01:10:42,300 --> 01:10:47,890 + 전역은 매우 지루한 오래된 컴퓨터가 당신이 말할 수있을 것입니다 그것을 추구 인식 + +661 +01:10:47,890 --> 01:10:53,650 + 객실에 객실 규모 + +662 +01:10:53,649 --> 01:10:58,238 + 그것의 사물함에서 어떤 유형의 당신은 당신이 인식 여기 알고있는 그들은 + +663 +01:10:58,238 --> 01:11:00,569 + 트릭을 인식하고 있습니다 + +664 +01:11:00,569 --> 01:11:06,009 + 오바마 대통령은 당신이 유머를 인식 상호 작용의 종류를 인식 할 것입니다 + +665 +01:11:06,010 --> 01:11:11,250 + 너무 많이 알고,이 세상의 하나입니다 우리에 관한 것입니다 거기에 인식 + +666 +01:11:11,250 --> 01:11:18,719 + 때뿐만 아니라 탐색 생존하는 경향이 시각 간호사에게 우리의 능력을 사용 + +667 +01:11:18,719 --> 01:11:26,000 + 재생 그러나 우리가 세계를 이해하기 위해 즐겁게 교제하는 데 사용 + +668 +01:11:26,000 --> 01:11:32,929 + 모든 책의 비전은 비전의 목표를 읽을 곳은 그래서 나는 점이다 + +669 +01:11:32,929 --> 01:11:39,630 + 우리의 세계 a를 만들 것이다 당신에게 해당 컴퓨터 시각 기술을 설득 할 필요가 없습니다 + +670 +01:11:39,630 --> 01:11:46,550 + 당신이 집 심지어하지만 알고 거기 어떤 무서운 이야기에도 불구하고 더 나은 곳으로 + +671 +01:11:46,550 --> 01:11:51,029 + 업계에서 오늘뿐만 아니라 연구의 세계 우리가 컴퓨터를 사용하는 + +672 +01:11:51,029 --> 01:11:58,349 + 더 나은 로봇을 구축하는 비전은 이제 분석을 탐험 깊이 갈 생명을 저장합니다 + +673 +01:11:58,350 --> 01:12:02,860 + 확인 그래서 나는 35 분 왼쪽 무엇 이분처럼이 + +674 +01:12:02,859 --> 01:12:10,839 + 좋은 시간은 컬러 강사가 나를 팀과 정의를 소개하자 + +675 +01:12:10,840 --> 01:12:16,989 + 내가 그에게 인사 일어 서서하시기 바랍니다 그래야 될지도 + +676 +01:12:16,989 --> 01:12:22,639 + 당신은 빨리이 안전하게 이름을 좋아하고 당신은 그냥 포기하지 않는 것 같아 수 있습니다 + +677 +01:12:22,640 --> 01:12:49,180 + 연설 그러나 예 + +678 +01:12:49,180 --> 01:13:42,240 + 사람들의 집단 소송이 우리를 도와 때문에 처리하기 때문에 사람이 존중 + +679 +01:13:42,239 --> 01:14:04,739 + 기밀 개인적인 문제가 다시 나는 우리의 조건에 예정하고 떠날거야하지만, + +680 +01:14:04,739 --> 01:14:09,939 + 월 말부터 몇 주 사회 당신은 당신을 결정하십시오 + +681 +01:14:09,939 --> 01:14:15,379 + 당신 같은 사람이 그들이 취할 것입니다하지 않는 한 나에게 이메일을 보내려면 + +682 +01:14:15,380 --> 01:14:20,770 + 그것에 대해 내가 답장 가능성이있어 당신을 즉시 죄송합니다 + +683 +01:14:20,770 --> 01:14:25,420 + 우선 순위 + +684 +01:14:25,420 --> 01:14:34,739 + 우리의 철학에 대한 그리고 우리는 우리가 진정으로 원하는 세부에 도착하지 않는 + +685 +01:14:34,738 --> 01:14:39,448 + 이것은이 정말 내가 신용을 많이하는 줄 매우 실제적인 프로젝트를 할 수 + +686 +01:14:39,448 --> 01:14:46,419 + 저스틴과 앙드레는 이러한 실습을 통해 걷기에 매우 좋은 + +687 +01:14:46,420 --> 01:14:51,840 + 이 클래스 나올 때와 세부 있도록뿐만 아니라 I에게 있습니다 + +688 +01:14:51,840 --> 01:14:57,719 + 이해를 사랑하지만 당신이 가지고 당신은 구축 할 수있는 정말 좋은 능력을 가지고 + +689 +01:14:57,719 --> 01:15:02,010 + 자신의 깊은 학습 코드 우리는 당신이 예술의 상태로 노출 할 + +690 +01:15:02,010 --> 01:15:08,730 + 당신이거야 일을 학습 할 수있는 재료 정말 2015로 신선한 그리고 그것은거야 + +691 +01:15:08,729 --> 01:15:11,859 + 이 같은 일을하는 재미를 얻을 수 + +692 +01:15:11,859 --> 01:15:18,960 + 아니 모든 시간이 있지만, 하나의 목표 나 또는이 이상으로 시간과 같은 사진 + +693 +01:15:18,960 --> 01:15:27,489 + 일이 모든 중요한 작업에 추가하여 재미 클래스를 알 수있을 것입니다 당신이 당신 + +694 +01:15:27,488 --> 01:15:33,589 + 당신은 우리가 이러한 다른 웹 사이트에 있습니다 등급을 매기는 정책을 수행 배우 + +695 +01:15:33,590 --> 01:15:44,929 + 다시 사람들을 식당 하나 당신이 좋아하는 성장 업을 성장 명확 + +696 +01:15:44,929 --> 01:15:51,989 + 우리가 과정의 끝에서 아무것도하지 않는 어른들 내 교수 원하는이다 + +697 +01:15:51,988 --> 01:15:56,359 + 날이 회의에 가서 내가 세처럼이 있어야 더 늦게 그들은 아무 말도 + +698 +01:15:56,359 --> 01:16:03,630 + 당신은 당신이 7 늦게 사용할 수있는 사용에 대한 책임 총 팔일 있습니다 + +699 +01:16:03,630 --> 01:16:11,079 + 어떤 방법으로 당신은 모든 10 처벌은 벌금을해야 모든 일 + +700 +01:16:11,079 --> 01:16:18,069 + 정말 정말 뛰어난 의료 가족 비상 같다 + +701 +01:16:18,069 --> 01:16:21,799 + 개별 기준으로하지만, 무엇에 우리 이야기 + +702 +01:16:21,800 --> 01:16:29,539 + 회의는 왜 다른 사람은 마침내 당신이 누락 고양이 또는 무엇처럼 알고 + +703 +01:16:29,539 --> 01:16:37,850 + 우리는 우리가 우리가 칠일에이 하나의 또 다른 자신의 명예 감기 그 예산을 책정 우리 + +704 +01:16:37,850 --> 01:16:43,190 + 내가 가진 것은 당신이 그런 권한이 있습니다 정말 진지한 얼굴로 대답 + +705 +01:16:43,189 --> 01:16:50,710 + 기관 당신은 당신이 당신이 명예에 대한 책임하려는 어른들되어 있습니다 + +706 +01:16:50,710 --> 01:16:55,239 + 코드이 수업을 하나 하나 Stampfer 학생은 다른 알아야 + +707 +01:16:55,239 --> 01:16:58,619 + 공동 당신이 변명이 없다하지 않으면 당신은 돌아 가야한다 + +708 +01:16:58,619 --> 01:17:04,840 + 매우 심각하게 나는 거의 통계적으로 그런 말을 싫어 협력을 기다립니다 + +709 +01:17:04,840 --> 01:17:10,380 + 주어진 계급 큰 단어 알라는 몇 가지 경우가 있지만 나는 또한 당신이되고 싶어요 + +710 +01:17:10,380 --> 01:17:16,210 + 심지어 크기와 뛰어난 클래스이 큰 우리는 무엇을보고 싶어하지 않는 + +711 +01:17:16,210 --> 01:17:22,399 + 대학 명예 코드를 침해하므로 협력 정책과 위험을 읽을 수는 있지만 + +712 +01:17:22,399 --> 01:17:31,960 + 이것은 정말 당신이 할 수있는 모든 습득 조건으로 생각하는 자신을 존중하는 것을 + +713 +01:17:31,960 --> 01:17:38,149 + 당신은 어떤 굽기가 내가 말하고 싶은 무슨 상관 읽을 수 + +714 +01:17:38,149 --> 01:17:47,569 + 당신이 예 물어 가치가 느끼는 질문 + +715 +01:17:47,569 --> 01:18:06,689 + 그래 + diff --git a/captions/Ko/Lecture2_ko.srt b/captions/Ko/Lecture2_ko.srt new file mode 100644 index 00000000..ba230f61 --- /dev/null +++ b/captions/Ko/Lecture2_ko.srt @@ -0,0 +1,2948 @@ +1 +00:00:00,000 --> 00:00:03,750 + 우리는 좋은처럼 기록 단지 당신을 다시 생각 나게하는 + +2 +00:00:03,750 --> 00:00:08,160 + 당신이 카메라에 불편 말하기를하는 경우, 그래서 안녕하세요 가장 가까운 기록 + +3 +00:00:08,160 --> 00:00:15,929 + 아니 그림에 있지만 음성은 당신이 수 확인 위대한 기록에있을 수 있습니다 + +4 +00:00:15,929 --> 00:00:19,589 + 참조 또한 화면은해야보다 넓은 내가 그것을 해결하는 방법을 잘 모르겠어요 + +5 +00:00:19,589 --> 00:00:21,300 + 함께 사는 열심히 + +6 +00:00:21,300 --> 00:00:25,269 + 가능성이 시각 피질 그래서 스트레칭에 아주 좋은 아주 불변이다 + +7 +00:00:25,268 --> 00:00:26,118 + 이것은 문제가되지 않는다 + +8 +00:00:26,118 --> 00:00:32,259 + 우리는 클래스로 다이빙을하기 전에 확인 그래서 일부 관리 것들로까지 무엇이야 + +9 +00:00:32,259 --> 00:00:36,100 + 첫 임무는 월을하다 오늘 밤 또는 이른 내일 나올 것입니다 + +10 +00:00:36,100 --> 00:00:41,289 + 20 정확히 2 주 당신이 분류 이전 분류를 작성하는 것이 + +11 +00:00:41,289 --> 00:00:44,159 + 작은 두 계층 신경망 당신의 전체를 작성 할 수 있습니다 + +12 +00:00:44,159 --> 00:00:47,979 + 22 층 신경 네트워크의 역 전파 알고리즘을 모두 충당 할 수 + +13 +00:00:47,979 --> 00:00:54,459 + 2 주 아침에 재료에 의해 일부는 마지막에서가 + +14 +00:00:54,460 --> 00:00:57,350 + 그들이하지 기쁘게 있도록 올해도 우리는 할당을 변경하고 + +15 +00:00:57,350 --> 00:01:02,890 + 과에 대한주의해야 할 뭔가 2,015 할당에 완료하여 + +16 +00:01:02,890 --> 00:01:07,109 + 경쟁은하지만 파이썬과 파이를 사용하는 것 또한 제공됩니다 + +17 +00:01:07,109 --> 00:01:11,030 + 에서 기본적으로 가상 머신 인 것입니다 터미널 닷컴 + +18 +00:01:11,030 --> 00:01:13,939 + 당신이 그렇게에 아주 좋은 노트북을 가지고하지 않을 경우 사용할 수있는 클럽 + +19 +00:01:13,938 --> 00:01:17,250 + 그것에 대해 세부 사항으로 이동하지만 난 그냥 제에 대한 것을 지적하고자 + +20 +00:01:17,250 --> 00:01:21,090 + 할당 우리는 당신이거야 파이썬 비교적 잘 알고있을거야 가정 + +21 +00:01:21,090 --> 00:01:24,859 + 어디 조작에이 최적화 된 NumPy와 식을 작성한다 + +22 +00:01:24,859 --> 00:01:28,438 + 이 행렬과 벡터 예를 들어 당신이 있다면, 그래서 매우 효율적인 형태 + +23 +00:01:28,438 --> 00:01:31,908 + 이 코드를보고 그 다음 당신에게 아무것도 의미하지 않는 것은 봐주세요 + +24 +00:01:31,909 --> 00:01:35,880 + 이 저스틴에 의해 작성된 것뿐만 아니라 웹 사이트에 가입 우리의 파이썬 튜토리얼에서 + +25 +00:01:35,879 --> 00:01:39,489 + 매우 좋은이며, 그래서 통과하고 익숙해 + +26 +00:01:39,489 --> 00:01:42,328 + 표기법 당신과 같은 코드를 많이 작성을 볼 수있을 것이기 때문에 + +27 +00:01:42,328 --> 00:01:47,048 + 그들은 충분히 빨리 실행하는 것, 그래서 우리가하고있는 곳이이 작업을 최적화하는 모든 + +28 +00:01:47,049 --> 00:01:51,610 + CPU에서 지금은 전체의 관점에서이 있다는 것입니다 금액 기본적으로 무엇을 + +29 +00:01:51,609 --> 00:01:54,599 + 당신에게 할당에 대한 링크를 제공합니다 당신은 웹 페이지로 이동합니다 당신은 볼 수 있습니다 + +30 +00:01:54,599 --> 00:01:58,309 + 이 같은 일이 설정되어 클라우드에서 가상 머신이다 + +31 +00:01:58,310 --> 00:02:01,420 + 과제의 모든 종속성까지 그들은 모두 이미 설치되어있는 + +32 +00:02:01,420 --> 00:02:05,618 + 데이터가 이미 존재하고에 그래서 당신은 점심 시스템에서 클릭하고이는거야 + +33 +00:02:05,618 --> 00:02:09,580 + 기본적으로이 이런 일에 당신을 데려 동생을 실행하고 + +34 +00:02:09,580 --> 00:02:13,060 + 이것은 기본적 AWS 위에 얇은 층 + +35 +00:02:13,060 --> 00:02:17,209 + 기계 여기 그래서 UI 층에는 노트북과 조금 아이팟이 + +36 +00:02:17,209 --> 00:02:20,739 + 터미널 당신은 주위에 갈 수있는이 클라우드에 그냥 기계처럼이며, + +37 +00:02:20,739 --> 00:02:24,310 + 그래서 그들은 어떤 CPU 제품을 가지고 그들은 또한 당신이 할 수있는 일부 GPU 시스템을 + +38 +00:02:24,310 --> 00:02:25,539 + 그래서 사용 + +39 +00:02:25,539 --> 00:02:29,090 + 일반적으로 단말기 비용을 지불해야하지만, 그래서 당신에게 크레딧을 분배한다 + +40 +00:02:29,090 --> 00:02:33,709 + 당신은 단지 당신이 TA에 이메일을 보내 비트에 결정할 것이다 따 특정 손실과 + +41 +00:02:33,709 --> 00:02:36,950 + 돈을 요구하는 것은 당신에게 돈을 보내 우리는 우리가로 전송 얼마나 많은 돈을 추적합니다 + +42 +00:02:36,949 --> 00:02:40,799 + 모든 사람들은 그래서 당신은 그래서 이것이 자금과 책임을 가지고 + +43 +00:02:40,800 --> 00:02:55,689 + 또한 옵션은 모든 세부 사항 당신이 읽을 수 수 있습니다 확인처럼 사용하기 위해 + +44 +00:02:55,689 --> 00:02:57,680 + 당신은 당신의 의견이 필요하지있어 좋아하는 경우 + +45 +00:02:57,680 --> 00:03:03,879 + 하지만 당신은 아마 주위에서 얻을 수 그래 좋아 샘이 일어날 것을 말한다 + +46 +00:03:03,879 --> 00:03:07,870 + 강의는 이제 오늘 우리가 이야기 할 것입니다있는 분류 및 특수 것 + +47 +00:03:07,870 --> 00:03:13,219 + 우리는 분류의 기본을 이야기하도록 선형 분류에 시작 + +48 +00:03:13,219 --> 00:03:17,560 + 작업은 우리가 범주의 일부 수 있도록 개 고양이 트럭 평면 또는 말을해야한다는 것입니다 + +49 +00:03:17,560 --> 00:03:20,799 + 우리는 그 다음이 무엇인지를 결정하는 얻을에 이미지를 촬영하기 이전 요청 + +50 +00:03:20,799 --> 00:03:24,950 + 어떤 숫자의 거대한 품종이며, 우리는이 중 하나에 변환해야 + +51 +00:03:24,949 --> 00:03:29,169 + 라벨 우리는이 문제가 지출 범주 중 하나에 그것을 구축해야 + +52 +00:03:29,169 --> 00:03:32,548 + 우리의 대부분의 시간은 구체적으로이 일에 대해 이야기하지만 당신은 하나를 수행하려는 경우 + +53 +00:03:32,549 --> 00:03:36,349 + 이러한 검출 이미지 캡처 어떤 분할 등의 컴퓨터 비전에서 다른 작업 + +54 +00:03:36,349 --> 00:03:40,108 + 또는 어떤 다른 당신은 찾을 그가 분류 방법에 대해 알고 나면 그 + +55 +00:03:40,109 --> 00:03:43,569 + 다른이 이루어집니다 모든 그래서 당신이 알 수있을 것입니다 그것의 상단에 내장 단지 작은입니다 + +56 +00:03:43,568 --> 00:03:47,060 + 이 개념에 대한 정말 좋은, 그래서 좋은 위치는 다른 작업을 수행 할 수 + +57 +00:03:47,060 --> 00:03:50,840 + 이해하고 우리는 간단하게 구체적인 예로서 그 통해 작동합니다 + +58 +00:03:50,840 --> 00:03:54,819 + 처음에 일이 지금 왜이 문제는 하드 그냥 생각를 제공한다 + +59 +00:03:54,818 --> 00:03:58,518 + 문제는 우리가 거​​대한 여기에 의미 론적 차이로이 이미지를 참조 할 것입니다 + +60 +00:03:58,519 --> 00:04:01,739 + 숫자의 격자 이미지가 컴퓨터에 표시되는 방식이 있다는 + +61 +00:04:01,739 --> 00:04:06,299 + 세 오 기본적으로 가속화 세에 의해 약 300 백으로 말 + +62 +00:04:06,299 --> 00:04:09,620 + 적색, 녹색, 청색 세 가지 색상 채널에서 차원 배열과 열로 + +63 +00:04:09,620 --> 00:04:13,590 + 그래서 그 이미지의 일부를 확대 할 때 기본적으로 거대한 중대하다 + +64 +00:04:13,590 --> 00:04:18,728 + 0에서 255 사이의 숫자 것은 그래서 우리는이 숫자와 함께 일해야 무엇 + +65 +00:04:18,728 --> 00:04:21,370 + 밝기의 양마다 모든 세 개의 컬러 채널을 나타낸다 + +66 +00:04:21,370 --> 00:04:25,569 + 단일 이미지의 위치 및 임의의 사양이되도록 이유 + +67 +00:04:25,569 --> 00:04:26,269 + 어려운 + +68 +00:04:26,269 --> 00:04:29,519 + 당신은 우리가 수백만처럼 괜찮은 작업해야 것에 대해 생각할 때 + +69 +00:04:29,519 --> 00:04:33,899 + 그 형태의 번호와 가지고는 신속하게 고양이 등을 분류하기 + +70 +00:04:33,899 --> 00:04:38,339 + 태스크의 복잡성은 명백 해졌다 그래서 예를 들면 카메라가 될 수있다 + +71 +00:04:38,339 --> 00:04:42,689 + 이 고양이를 중심으로 회전하며 확대 할 수 있고, 아무것도를 이동하지 않았다 + +72 +00:04:42,689 --> 00:04:46,769 + 초점 속성과 카메라가 다른 수행과에 대해 생각할 수있는 트랜스 액슬 + +73 +00:04:46,769 --> 00:04:49,769 + 무슨 일이 밝기 값으로 발생하고 실제로 모든 할만큼 큰 + +74 +00:04:49,769 --> 00:04:52,779 + 카메라와 함께 이러한 변환은 완전히 모든 패턴이 출시 될 예정이다 + +75 +00:04:52,779 --> 00:04:56,559 + 변경하고 우리는이 모두에 강력한 될 수 있습니다 많은 다른있다 + +76 +00:04:56,560 --> 00:05:00,709 + 예를 들어, 요금에 대한 문제는 여기에 조명까지 우리는 긴 고양이가 + +77 +00:05:00,709 --> 00:05:07,728 + 흰 고양이는 우리가 실제로 그 두 가지를 가지고 있지만 한 고양이가 넘어 당신은 볼 수 있습니다 + +78 +00:05:07,728 --> 00:05:11,098 + 명확하게 그것을 꽤 만든, 다른 하나는 아니지만 여전히 인식 할 수 + +79 +00:05:11,098 --> 00:05:14,750 + 두 고양이 등의 수준에 대해 다시 밝기 계곡을 생각한다 + +80 +00:05:14,750 --> 00:05:18,329 + 그는 모든 다른 모든 것들을 변화와 같이 그리드 무엇을 그들에게 발생 + +81 +00:05:18,329 --> 00:05:21,279 + 우리가 세상에서 가질 수있는 가능한 조명 방식은 견고하기 + +82 +00:05:21,279 --> 00:05:28,179 + 모두에게 많은 클래스를 형성 떨어져 이상한 많은 문제가 있음 + +83 +00:05:28,180 --> 00:05:33,668 + 이러한 개체의 배열은 매우 오는 캐스트 그렇게 인식하고 싶습니다 + +84 +00:05:33,668 --> 00:05:37,468 + 슬라이드와 다른 포즈 난 그들이 거기에 아주 건조이야을 만들 때 + +85 +00:05:37,468 --> 00:05:41,449 + 즉, 그래서 수학이 과학의 많은 내가 재미를 얻을 수있는 유일한 시간이다 내가 + +86 +00:05:41,449 --> 00:05:45,939 + 이러한 긍정의 모든 강력한으로 발생 그냥 어떻게 든 모든 + +87 +00:05:45,939 --> 00:05:50,189 + 당신은 여전히​​ 자신의 문제에도 불구하고 고양이이 모든 이미지를 인식 할 수 있습니다 + +88 +00:05:50,189 --> 00:05:54,240 + 그래서 가끔 우리는 원양가 표시되지 않을 수 있습니다하지만 당신은 여전히​​ 그건 인식 + +89 +00:05:54,240 --> 00:06:00,340 + 고양이 물 한 병 뒤에 고양이 택시는 소파 내부가 또한있다 + +90 +00:06:00,339 --> 00:06:06,068 + 도 기본적으로 거기에이 클래스의 10 개 조각을보고있는 것처럼 + +91 +00:06:06,069 --> 00:06:10,500 + 배경 혼란에 문제가 일들이 우리가 가지고있는 환경에 혼합 할 수 있습니다 + +92 +00:06:10,500 --> 00:06:15,300 + 그에게 상기시켰다 그래서 고양이 실제로도 내 수준의 변화 거기에있다 + +93 +00:06:15,300 --> 00:06:19,728 + 이 고양이 단지 종의 엄청난 양이다 그래서 그들은 다르게 보일 수 있습니다 + +94 +00:06:19,728 --> 00:06:23,240 + 나는 그래서 모두에게 당신의 상사와 방법은 감사하는 것처럼 + +95 +00:06:23,240 --> 00:06:26,718 + 우리는 이러한 독립적 중 하나 고려 작업의 복잡성은 어렵다 + +96 +00:06:26,718 --> 00:06:31,908 + 당신이 모든 다른 것들의 크로스 제품을 고려하고 그러나이 + +97 +00:06:31,908 --> 00:06:35,769 + 아무것도 전혀 작동하는지 실제로 아주 놀라운 것을 모두에서 작동합니다 + +98 +00:06:35,769 --> 00:06:39,539 + 사실 그것은 작동하지만 거의 여기 정말 잘 작동 않습니다뿐만 아니라, + +99 +00:06:39,540 --> 00:06:43,740 + 이 같은 카테고리의 정확성과 우리는 수십 밀리 초 단위로이 작업을 수행 할 수 있습니다 + +100 +00:06:43,740 --> 00:06:49,040 + 그래서 현재의 기술과 함께 그는이 클래스에 대한 자세한 내용은 무엇입니까 + +101 +00:06:49,040 --> 00:06:54,390 + 기본적으로 우리는 우리가하고 싶은 영역을 통해이 복용하고 같은 분류보기 + +102 +00:06:54,389 --> 00:06:57,539 + 클래스 레이블을 생성하고 내가 원하는 때 그는 더 있다는 것을 눈치없는거야 + +103 +00:06:57,540 --> 00:07:01,569 + 확실한 방법까지 실제로 인코딩하고 정액이 분류의이 권리 + +104 +00:07:01,569 --> 00:07:04,790 + 일찍 클래스에 모두 좋은 복용하는 말처럼 간단한 알고리즘은 없다 + +105 +00:07:04,790 --> 00:07:08,379 + 컴퓨터 과학 교육 과정 당신의 쓰기 거품 정렬 또는 당신이 뭔가를 작성하는 + +106 +00:07:08,379 --> 00:07:11,939 + 당신은 모든 가능한 단계에와 당신을 직관적으로 할 수 있습니다 특정 작업을 수행 할 다른 + +107 +00:07:11,939 --> 00:07:15,300 + 그것들을 열거하고이를 수 있습니다 여기에 함께 연주하고 그것을 분석 할 수 있지만, + +108 +00:07:15,300 --> 00:07:18,530 + 어떤 알고리즘이 모든 변화에서 고양이를 검출 없다 그것의있다 + +109 +00:07:18,529 --> 00:07:21,509 + 는 IS 당신이 실제로을 작성하는 방법에 대해 생각하기가 매우 어렵습니다 + +110 +00:07:21,509 --> 00:07:26,039 + 작업의 순서는의 고양이를 감지하는 임의의 이미지를 할 것 + +111 +00:07:26,040 --> 00:07:28,629 + 사람들이 시도하지 않은 말을하지 특히​​ 초기이 컴퓨터 만 + +112 +00:07:28,629 --> 00:07:32,719 + 나는 당신이 생각하는 경우 그들에게 전화하고 싶습니다 이러한 명시 적 접근이 있었다 + +113 +00:07:32,720 --> 00:07:37,240 + 내가 말할 수 없습니다에 대한 괜찮 그는 당신이 작은 귀를 찾아 만나고 싶을 것입니다 + +114 +00:07:37,240 --> 00:07:40,910 + 우리가 무엇을 할 거 야 그래서 개는 우리가 울트라 iso는 뜻을 가장자리 모든 가장자리를 감지 할 수 있습니다입니다 + +115 +00:07:40,910 --> 00:07:45,380 + 가장자리의 서로 다른 특성을 분류하고 그들의 접합 당신이 알고 만듭니다 + +116 +00:07:45,379 --> 00:07:48,350 + 우리가 보면 시즌 라이브러리는 자신의 준비를 찾으려고 할 것이다 + +117 +00:07:48,350 --> 00:07:52,150 + 우리는 몇 가지의 특정 질감을 볼 고양이를 검출 할 것 같은 것을 + +118 +00:07:52,149 --> 00:07:55,899 + 당신이 어떤 규칙을 가지고 올 수있는 특정 주파수는 고양이를 공격합니다 + +119 +00:07:55,899 --> 00:07:59,870 + 하지만 문제는 내가 좋아 말해 일단 내가 사실을 인식하고 싶습니다이다 + +120 +00:07:59,870 --> 00:08:03,569 + 보트 지금 또는 당신이 좋아처럼 다시 드로잉 보드에 아직 갈 사람 + +121 +00:08:03,569 --> 00:08:06,719 + 어떤 원본 페이지를 잘 완전히의 정확히 보트를 만든다 + +122 +00:08:06,720 --> 00:08:11,590 + 이 클래스를 삭제 압력으로 기소로 확장 접근과 + +123 +00:08:11,589 --> 00:08:16,699 + 우리가에 원하는 데이터 중심의 접근 방식으로 매우 잘 작동 방식 + +124 +00:08:16,699 --> 00:08:20,170 + 기계 학습의 프레임 워크는 단지 지적 것과 실제로 이러한 일 + +125 +00:08:20,170 --> 00:08:23,840 + 초기에 그들은이에 데이터 때문에 사용의 사치를하지 않았다 + +126 +00:08:23,839 --> 00:08:27,060 + 포인트는 시간에 당신은 매우 낮은 해상도의 당신의 그레이 스케일 이미지를 가지고있어 + +127 +00:08:27,060 --> 00:08:30,250 + 당신이 분명히 작동하지 않을거야 것을 인식하려고의 이미지 + +128 +00:08:30,250 --> 00:08:33,769 + 하지만 데이터의 인터넷 엄청난 양의 가용성과 나는를 검색 할 수 있습니다 + +129 +00:08:33,769 --> 00:08:38,460 + Google에서 고양이 예를 들어 나는 모든 곳에서 고양이를 많이 얻고 우리는 알고 + +130 +00:08:38,460 --> 00:08:42,840 + 거기에 많은 그래서 이러한 웹 페이지에서 주변 텍스트를 기반으로 고양이입니다 + +131 +00:08:42,840 --> 00:08:46,060 + 데이터의 방식 때문에 지금처럼 보이는이 우리가 훈련 얼굴을 가지고 있다는 것을 + +132 +00:08:46,059 --> 00:08:49,079 + 당신은 내게 캐스팅 훈련 샘플을 많이주는 곳 + +133 +00:08:49,080 --> 00:08:52,900 + 당신은 자신의 고양이에 대해 말해 당신은 내게 모든 유형의 예를 많이 제공 + +134 +00:08:52,899 --> 00:08:54,230 + 관심있는 다른 카테고리 + +135 +00:08:54,230 --> 00:08:59,920 + 나는 멀리 가야합니까 나는 모델은 클래스 모델 훈련 나는 그 것을 사용할 수 있습니다 + +136 +00:08:59,919 --> 00:09:04,250 + 모델은 실제로 내가 볼 수있는 새로운 이미지를 부여하고있어 그래서 데이터를 분류합니다 + +137 +00:09:04,250 --> 00:09:07,500 + 내 훈련 데이터와 난 그냥 패턴에 따라이 함께 뭔가를 할 수 + +138 +00:09:07,500 --> 00:09:13,759 + 매칭 및 통계 또는 간단한 예를 들어이 내에서 작동 할 수 있도록 사람 + +139 +00:09:13,759 --> 00:09:17,279 + 프레임 워크는 가장 가까운 이웃 분류 당신이 하나있어 방법을 고려 + +140 +00:09:17,279 --> 00:09:20,939 + 분류는 효과적으로 파괴 주어진 무역 센터는 것입니다 작동 + +141 +00:09:20,940 --> 00:09:23,970 + 뿐만 아니라 교육 시간을 그냥 모든 훈련 데이터 그래서 모두가 기억 + +142 +00:09:23,970 --> 00:09:27,820 + 훈련 데이터는 여기에 도착하고 당신이 나에게 테스트를 줄 때 나는 지금 그것을 기억 + +143 +00:09:27,820 --> 00:09:32,060 + 우리가 무엇을 할 거 야 이미지는 우리의 모든 하나 하나에 테스트 이미지를 비교하는 것입니다 + +144 +00:09:32,059 --> 00:09:36,729 + 이미지는 우리가 기차 데이터를보고 우리는 단지 내가 거를 통해 라벨을 전송합니다 + +145 +00:09:36,730 --> 00:09:41,149 + 그냥 통과로 특정 경우에 작동합니다 모든 이미지를 봐 + +146 +00:09:41,149 --> 00:09:43,740 + 나는 가능한 한 완전 좋아이 그래서 우리는 특정 도와 드리겠습니다 + +147 +00:09:43,740 --> 00:09:47,740 + 이 10가로 페르 인도라는 무언가의 경우는 오늘 장면을 설정 + +148 +00:09:47,740 --> 00:09:53,129 + 라벨은 다음에 액세스 할 수 50,000 훈련 이미지가 레이블 + +149 +00:09:53,129 --> 00:09:57,159 + 우리가 잘하는 방법을 평가하는거야 10 만 이미지의 테스트 세트가있다 + +150 +00:09:57,159 --> 00:10:00,669 + 분류기는 작업과 이러한 이미지는 그들이에 그냥 좀있어 아주 작은 있습니다 + +151 +00:10:00,669 --> 00:10:05,009 + 32 작은 썸네일 이미지로 (32)의 데이터 세트 대기 가까운 이웃 그래서 + +152 +00:10:05,009 --> 00:10:07,809 + 우리는 다른 사람들이 우리에게 주어진 한 모든 교육을로 분류가 작동합니다 + +153 +00:10:07,809 --> 00:10:12,589 + 오만은 그냥 나는 우리가이 10 개의 다른 사례가 있다고 가정 아니에요 + +154 +00:10:12,590 --> 00:10:15,920 + 여기에 우리가 볼 것입니다 무엇을 할 거 야 여기에서 첫 번째 호출을 따라 우리의 테스트 이미지입니다 + +155 +00:10:15,919 --> 00:10:19,909 + 가장 유사한 사물의 트레이닝 세트의 가장 가까운 이웃까지 + +156 +00:10:19,909 --> 00:10:24,139 + 다만 독립적 그래서 거기에 그 모든 일이 당신은 이미지의 순위 목록을 보려면 + +157 +00:10:24,139 --> 00:10:30,220 + 모든 사람에게 그 (10)의 어느 하나에 트레이닝 데이터에 가장 유사하다고 + +158 +00:10:30,220 --> 00:10:32,700 + 저기 그 테스트 이미지의 첫 번째 행 있도록이 있다고 볼 수 있습니다 + +159 +00:10:32,700 --> 00:10:36,230 + 내가 생각하는 트럭은 테스트 이미지와 보면 대부분의 이미지가있다 + +160 +00:10:36,230 --> 00:10:40,490 + 여기서 비슷한 조그마한 비트를 찾을 당신이 할 수있는 방법을 정확하게 볼 수 그것과 유사 + +161 +00:10:40,490 --> 00:10:44,269 + 첫 번째 후퇴의 결과가 실제로 말을하지 트럭이고, 그건 볼 + +162 +00:10:44,269 --> 00:10:48,289 + 때문에 당신이 할 수 있도록 던져진 된 푸른 하늘의 단지 배치 + +163 +00:10:48,289 --> 00:10:52,480 + 이것은 아마 우리가이 정의 어떻게 잘 작동하지 않습니다 참조 + +164 +00:10:52,480 --> 00:10:55,470 + 우리가 실제로 비교를 어떻게 측정 여러 가지 방법 중 하나가 있습니다 + +165 +00:10:55,470 --> 00:10:59,940 + 및 이해 또는 맨하탄 있도록 간단한 방법은 맨해튼 거리를 수 있습니다 + +166 +00:10:59,940 --> 00:11:01,180 + 연구소의 거리 + +167 +00:11:01,179 --> 00:11:04,429 + 용어는 상호 교환 단순히 무엇을 당신이있어 테스트 이미지를 가지고있다 + +168 +00:11:04,429 --> 00:11:07,639 + 분류에 관심 고려 우리가 원하는 하나의 교육 이미지 + +169 +00:11:07,639 --> 00:11:11,919 + 우리가 무엇을 할 거 야 볼이 이미지를 비교하는 것은 우리가 요소 가격은 모든 비교 것입니다 + +170 +00:11:11,919 --> 00:11:15,959 + 찍어의 막대 사탕은 그러므로 절대 값의 차이를 형성하는 것이다 우리 + +171 +00:11:15,960 --> 00:11:20,040 + 그냥 우리가 모든 단일 위치 또는 감산을보고있는 모든 것을 추가 할 + +172 +00:11:20,039 --> 00:11:24,139 + 그것은 오프 차이가 추가 점점 더 특별한 위치에 어떤 참조 + +173 +00:11:24,139 --> 00:11:30,169 + 모든 걸 포기하고 우리의 유사성이다 그래서이 두 이미지는 그렇게 56 다른위한 + +174 +00:11:30,169 --> 00:11:33,809 + 우리가 코드를 보여주기 위해 여기에 동일한 이미지가 있다면 우리는 0을 얻을 + +175 +00:11:33,809 --> 00:11:36,959 + 구체적으로는이 같을 것이다 방법은 전체 구현은 + +176 +00:11:36,960 --> 00:11:42,930 + 가장 가까운 이웃 분류와 나는 두 사람의 실제 몸에 충전 곳 + +177 +00:11:42,929 --> 00:11:46,799 + 에 대해 이야기하고 우리가주는 것 같이 우리는 훈련 시간에 여기에 무엇을 + +178 +00:11:46,799 --> 00:11:52,709 + 레이블 그래서 용서 일반적으로 노트 모든 레이블 집합 X와 Y + +179 +00:11:52,710 --> 00:11:56,530 + 우리는 그냥 그냥 기억 클래스의 인스턴스 메소드에 할당 할 + +180 +00:11:56,529 --> 00:12:01,439 + 데이터 아무것도 우리가 여기서 무슨 일을하는지는 비록 내가 시간을 예측 수행되고 있지 + +181 +00:12:01,440 --> 00:12:06,080 + 우리는 이미지의 X의 영원 테스트 세트를 받고있어 나는 전체를 통과하지 않을거야 + +182 +00:12:06,080 --> 00:12:09,320 + 자세한하지만 당신은 모든 단일 테스트 이미지를 통해 for 루프 거기에 볼 수 있습니다 + +183 +00:12:09,320 --> 00:12:13,020 + 독립적으로 우리는 매일 훈련 이미지의 거리를 받고있어 + +184 +00:12:13,019 --> 00:12:18,360 + 그리고 그 하나의 벡터 라인 내가 그렇게 파이썬 코드를 사용하여 만의 통지 + +185 +00:12:18,360 --> 00:12:21,750 + 단 한 줄의 코드는 매 훈련이 테스트 이미지를 비교 하​​였다 + +186 +00:12:21,750 --> 00:12:26,370 + 내가 생각하고 이전 슬라이드에서이 거리를 계산 데이터베이스의 이미지 + +187 +00:12:26,370 --> 00:12:30,720 + 그 위기 코드는 그래서 우리가 그 4 루프를 소비하지 않았다 모두 그 + +188 +00:12:30,720 --> 00:12:35,860 + 처리 시스템에 참여하고 우리는 인스턴스를 계산하는 + +189 +00:12:35,860 --> 00:12:40,659 + 가장 가까운 그래서 우리는 색인에 가지고있는 교육의 인덱스를을 받고있어 + +190 +00:12:40,659 --> 00:12:45,719 + 가장 낮은 거리와 그 다음 우리는 단지이 이미지 레이블에 대한 예측됩니다 + +191 +00:12:45,720 --> 00:12:51,210 + 그래서 여기에 어떤 것은 가장 가까운 이웃 분류의 관점에서 당신을위한 질문 + +192 +00:12:51,210 --> 00:12:56,639 + 어떻게 그 속도는 무슨 일하는 것은 인 훈련 데이터의 크기에 따라 달라 않습니다 + +193 +00:12:56,639 --> 00:13:02,779 + 느린 훈련 장비를 확장 + +194 +00:13:02,779 --> 00:13:07,789 + 내가 만약 내가 그냥 가지고 있기 때문에 예는 실제로는 실제로 아주 천천히 맞아입니다 + +195 +00:13:07,789 --> 00:13:12,129 + 천천히 아래로 조금 그래서 독립적으로 하나 하나 훈련 샘플을 비교 + +196 +00:13:12,129 --> 00:13:16,370 + 우리는 클래스를 진행하면서이 거꾸로 실제로는 것을 실제로 이동 + +197 +00:13:16,370 --> 00:13:19,590 + 우리가 관심을 우리가 정말 가장 실용적인 응용 프로그램에 대한 관심이 무엇 때문에 + +198 +00:13:19,590 --> 00:13:23,330 + 이 분류의 시험 시간 성능에 대해 그것은 우리가 원하는 것을 의미합니다 + +199 +00:13:23,330 --> 00:13:27,240 + 클래스는이 시점에서 매우 효율적 그래서 정말 사이의 트레이드 오프있을 수 있습니다 + +200 +00:13:27,240 --> 00:13:30,419 + 우리는 기차 방식에 넣고 우리는 좋은에서 얼마나 넣을까요 얼마나 많은 컴퓨터 + +201 +00:13:30,419 --> 00:13:35,240 + 가장 가까운 이웃 기차 인스턴트하지만 그것은 우리가 거​​ 같은 시험 비싼 및 + +202 +00:13:35,240 --> 00:13:38,570 + 곧 볼이 실제로 주변이 완전히 다른 방법으로 플립있어 올 + +203 +00:13:38,570 --> 00:13:41,510 + 우리가 컴퓨팅의 엄청난 양의 훈련을 할 것이다 기차 시간을 볼 것이다 + +204 +00:13:41,509 --> 00:13:45,409 + 상업용 네트워크의 시스템 성능은 것 실제로 매우 효율적일 것이다 + +205 +00:13:45,409 --> 00:13:49,589 + 상수와 하나 하나 테스트 이미지 컴퓨팅의 일정 금액을 수 + +206 +00:13:49,590 --> 00:13:53,149 + 당신 만 수십억 또는 수조가있는 경우에 상관없이 계산의 양 + +207 +00:13:53,149 --> 00:13:57,669 + 난 그냥 해요 훈련은 내가 조 조 조 조를 그냥 가지고 싶습니다 + +208 +00:13:57,669 --> 00:14:01,579 + 아무리 크거나 무역 적자가 전체 사용자의 컴퓨터 작업을 수행하는 방법 + +209 +00:14:01,580 --> 00:14:05,250 + 즉 실질적으로 말하기 아주 좋은, 그래서 단일 테스트 샘플을 분류 + +210 +00:14:05,250 --> 00:14:10,370 + 지금은 그냥 세이버 여기 가속화하는 방법이 있다는 것을 지적하고자합니다 + +211 +00:14:10,370 --> 00:14:13,669 + 이 대략 가까운 이웃 방법 거기 분류는 같은 계획 + +212 +00:14:13,669 --> 00:14:16,879 + 사람들이 그 연습을 위해 사용 예제 라이브러리는 속도를 할 수 있습니다 + +213 +00:14:16,879 --> 00:14:22,909 + 가까운 이웃이 과정은 일치하지만 확인 그냥 보조 노트의 + +214 +00:14:22,909 --> 00:14:27,490 + 그래서 우리는 우리가 정의한 것을보고 분류의 디자인으로 돌아 가자 + +215 +00:14:27,490 --> 00:14:32,200 + 내가 임의로 선택이 거리는 당신에게 맨해튼 거리를 표시 할 + +216 +00:14:32,200 --> 00:14:35,720 + 할 수있는 많은 방법이 사실상 존재 절대 값의 차이를 비교 + +217 +00:14:35,720 --> 00:14:38,879 + 측정 거리를 공식화 등의 다양한 선택이있다 + +218 +00:14:38,879 --> 00:14:42,700 + 우리는이 비교를 정확히 어떻게 다른 사람들에게 다른 선택의 여지가 심 + +219 +00:14:42,700 --> 00:14:46,000 + 실제로 사용하려면 우리가 유클리드 울트라 거리입니다 부르는하다 + +220 +00:14:46,000 --> 00:14:49,850 + 대신에 이러한 차이의 제곱합의 차이를 요약 + +221 +00:14:49,850 --> 00:14:55,690 + 이미지 등이 선택 사이 + +222 +00:14:55,690 --> 00:15:02,730 + 그 이상이 다시 사람 + +223 +00:15:02,730 --> 00:15:07,850 + 확인 그래서 어떤 방법을 정확하게 컴퓨터 거리의이 선택은 이산 선택이다 + +224 +00:15:07,850 --> 00:15:11,769 + 우리는 우리가 차 하이퍼라는 뭔가를 제어 할 수 있는지 정말 아니에요 + +225 +00:15:11,769 --> 00:15:14,990 + 분명 당신이 그것을 설정하는 방법은 우리가 나중에 결정해야 하이퍼 매개 변수의 + +226 +00:15:14,990 --> 00:15:19,120 + 정확히 그들이 얘기하자 하이브리드 차의이 어떻게 든 다른 종류를 설정하는 방법 + +227 +00:15:19,120 --> 00:15:22,828 + 우리가 가지고 가까운 이웃을 일반화하는 경우에 대한 분류의 맥락에서 + +228 +00:15:22,828 --> 00:15:26,159 + 우리는 가장 가까운 이웃 분류 AK 전화 무엇 케 horas 이웃에 너무 + +229 +00:15:26,159 --> 00:15:29,328 + 모든 테스트를 위해 검색하는 분류는 가장 가까운 하나의 일치 + +230 +00:15:29,328 --> 00:15:33,958 + 사실 몇 가지 예를 후퇴합니다 예를 양성하고 새로운있을 것이다 + +231 +00:15:33,958 --> 00:15:37,069 + 가장 가까운 이상의 다수결은 실제로 모든 테스트 인스턴스를 분류합니다 + +232 +00:15:37,070 --> 00:15:41,829 + 그래서 이웃이 우리가에있는 5 가장 유사한 이미지를 검색하는 것 말 + +233 +00:15:41,828 --> 00:15:45,528 + 훈련 데이터와 레이블의 과반수 투표를하고 여기에 간단 + +234 +00:15:45,528 --> 00:15:48,970 + 그래서 여기에 요점을 설명하기 위해 설정이 차원 데이터 우리는 세 클래스가 + +235 +00:15:48,970 --> 00:15:53,430 + 데이터 세트 및 2D와 여기에 우리가 결정 영역이 부르는 그리기입니다 + +236 +00:15:53,429 --> 00:15:57,429 + 여기에 가장 가까운 이웃 분류는 정말 훈련을받은되는 의미 + +237 +00:15:57,429 --> 00:16:02,838 + 우리는 전체를 색칠하고 거기 무슨 수업이 가까운에 의해 비행기에서 내리다하기 + +238 +00:16:02,839 --> 00:16:05,430 + 모든 단일 지점은 그렇지 가정 기호 이웃 분류 + +239 +00:16:05,429 --> 00:16:08,698 + 당신은 단지이 것이라고 말하는 것보다 더 여기에 몇 가지 테스트 예를 들어 있다고 가정 + +240 +00:16:08,698 --> 00:16:12,549 + 당신이 개인 얻는 가장 가까운 이웃을 기반으로 푸른 클래스로 분류 된 + +241 +00:16:12,549 --> 00:16:16,708 + 점에 유의 푸른 클러스터 내부의 녹색 지점 인 점이며, + +242 +00:16:16,708 --> 00:16:19,708 + 그것이 많은의 분류했을 클래스 자체의 작은 영역을 갖는다 + +243 +00:16:19,708 --> 00:16:23,750 + 무엇이든 자신보다 그에게 경우 때문에 테스트는 녹색으로 주위에 배치 + +244 +00:16:23,750 --> 00:16:27,879 + 가장 가까운 이웃의 녹색 점은 이제 애에 대한 높은 숫자로 이동할 때 + +245 +00:16:27,879 --> 00:16:30,809 + 당신이 무엇을 발견 같은 오년 이웃 분류기는 경계입니다 + +246 +00:16:30,809 --> 00:16:36,619 + 한 거기 심지어 어디 좋은 효과 그것의 종류 부드럽게 시작 + +247 +00:16:36,620 --> 00:16:37,339 + 포인트 + +248 +00:16:37,339 --> 00:16:41,550 + 가지 무작위 실제로 아니다 푸른 클러스터의 소음과 아웃 라이어로 + +249 +00:16:41,549 --> 00:16:44,539 + 우리는 항상 다섯을 치료하기 때문에 너무 많은 예측을 채용 + +250 +00:16:44,539 --> 00:16:49,679 + 가장 가까운 이웃은 그래서 그들은 실제로 그렇게 그린 포인트를 압도 얻을 + +251 +00:16:49,679 --> 00:16:53,088 + 당신이를 찾을 수 있습니다 일반적으로 여름 분류는 더 나은 제공 할 수 있습니다 + +252 +00:16:53,089 --> 00:16:58,180 + 이제 US시 성능 그러나 다시 K의 선택은 다시 경계 인 하이퍼 + +253 +00:16:58,179 --> 00:17:03,088 + 내가 조금이 다시 올 거 바로 그래서 당신이보기의 예를 보여 + +254 +00:17:03,089 --> 00:17:06,169 + 여기처럼 나는 그들에 의해 위를 기록하고 열 가장 유사한 예를 반환하고있어 자신의 + +255 +00:17:06,169 --> 00:17:08,939 + 거리와 실제로 이러한 훈련을 통해 과반수 투표를 할 것 + +256 +00:17:08,939 --> 00:17:13,089 + 여기 예제는 여기에 모든 테스트 예제를 분류합니다 + +257 +00:17:13,088 --> 00:17:20,649 + 확인 그래서 여기에 단지의 정확성이 무엇인지 고려의 질문의 비트를하자 + +258 +00:17:20,650 --> 00:17:24,259 + 우리는 유클리드을 사용하는 훈련 데이터에 대한 분류의 북쪽 + +259 +00:17:24,259 --> 00:17:29,700 + 나는 우리의 테스트 세트는 정확히 훈련 데이터입니다 가정, 우리가있어 너무 거리 + +260 +00:17:29,700 --> 00:17:32,580 + 우리는 얼마나 자주를 얻을 얼마나 많은 즉 정확성을 찾기 위해 노력 + +261 +00:17:32,579 --> 00:17:34,750 + 정답 + +262 +00:17:34,750 --> 00:17:44,808 + 많은 중 확인을 백 퍼센트의 좋은 예 맞습니다 그래서 우리는 항상 찾을 수있어 + +263 +00:17:44,808 --> 00:17:48,450 + 가 자신이 수행하고 정확하게 테스트의 상단에 기차 예 + +264 +00:17:48,450 --> 00:17:52,870 + 다음과 같은 우리가 맨해튼을 사용하는 경우 무엇을 통해 전송 될 것 + +265 +00:17:52,869 --> 00:18:00,949 + 거리가 + +266 +00:18:00,950 --> 00:18:04,680 + 맨해튼 거리가 약간의 절대 값은 당신에게 있습니다 제곱의 합을 필요로하지 않는다 + +267 +00:18:04,680 --> 00:18:12,110 + 차이에서 그것은 단지 문제는 좋은 같은 것 것 것 + +268 +00:18:12,109 --> 00:18:14,169 + 여름이나 유지 + +269 +00:18:14,170 --> 00:18:18,820 + 관심을 확인 이웃 분류가 훈련 왕의 정확성은 무엇인가 + +270 +00:18:18,819 --> 00:18:25,339 + 반드시 때문에하지 백 %가 그것을 경우 케이블 장소입니다 + +271 +00:18:25,339 --> 00:18:29,230 + 기본적으로 주변의 포인트는 당신이 당신의 최선을 압도 할 수 + +272 +00:18:29,230 --> 00:18:35,269 + 예를 들어 유리 떨어져 실제로 괜찮 그래서 우리는 서로 다른 두 가지 선택을 논의했습니다 + +273 +00:18:35,269 --> 00:18:39,740 + 전제 우리가이 경우에 릭에게 높은 압력을 만난 우리는 어떻게 확실하지 않다 + +274 +00:18:39,740 --> 00:18:45,160 + 그렇게 우리는 이러한 설정하는 방법을 정확하게 확실하지에 1 23 10 이렇게해야 설정 + +275 +00:18:45,160 --> 00:18:48,750 + 사실에 따라 자신의 문제는 당신이 지속적으로 찾을 수 없습니다 찾을 수 있습니다 + +276 +00:18:48,750 --> 00:18:52,250 + 일부 경우에 볼 수있는 몇 가지 응용 프로그램에서 이러한 높은 전제를위한 최선의 선택 + +277 +00:18:52,250 --> 00:18:56,930 + 우리가 이것을하도록 설정하는 방법을 정말 확실하지 않도록 다른 응용 프로그램보다 더 나은 + +278 +00:18:56,930 --> 00:19:00,799 + 여기에 우리가 기본적으로 난 그래서 다른 프라이머의 제비를 시도 할 생각이다 + +279 +00:19:00,799 --> 00:19:05,649 + 거 내가 나의 기차 데이터를 데리고 갈거야 다음 내가 많이 시도거야과 같이 + +280 +00:19:05,650 --> 00:19:11,550 + 다른 매개 변수의 그래서 난 그냥 죽을 수와 나는 케이블 123456 2800 I을 시도 + +281 +00:19:11,549 --> 00:19:14,529 + 즉 가장 적합한 무엇이든 모든 피고인 메트릭을 시도하고는 내가 정액의 + +282 +00:19:14,529 --> 00:19:26,670 + 그래서 괜찮 때문에 좋은 생각에 아주 잘 오른쪽 거짓말을 작동 걸릴 + +283 +00:19:26,670 --> 00:19:36,170 + 그래서 기본적으로 그래서 기본적으로 네 그래서 테스트 데이터에 대한 프록시 당신의 + +284 +00:19:36,170 --> 00:19:40,039 + 신뢰하지 않아야 주문 그들의 일반화해야 테스트 데이터에 + +285 +00:19:40,039 --> 00:19:43,509 + 당신은 당신이 이제까지 STATA에있는 잊지해야 사실은 그래서주는 일을했다 + +286 +00:19:43,509 --> 00:19:46,079 + 데이터 집합은 항상 당신이 그것을 필요가 없습니다 척 유언자를 따로 + +287 +00:19:46,079 --> 00:19:50,129 + 그게 당신을 말하고 어떻게 것입니다 보이지 않는 데이터 점에 일반화 당신의 장기와 + +288 +00:19:50,130 --> 00:19:52,730 + 당신이 당신의 알고리즘을 개발하기 위해 노력하고 있기 때문에 중요합니다 그리고 당신은있어 + +289 +00:19:52,730 --> 00:19:56,120 + 결국 지구와 몇 가지 설정을 할 희망 당신은 이해 좋아 + +290 +00:19:56,119 --> 00:20:01,159 + 정확히 할 것입니다 어떻게 연습 오른쪽 작업이 예상 그래서 당신은 볼 수 있습니다 + +291 +00:20:01,160 --> 00:20:03,830 + 예를 들어 때때로 당신은 매우 매우에 대해 잘 구성 수행 할 수있는 + +292 +00:20:03,829 --> 00:20:05,579 + 아주 잘 일반화 그것을 테스트하지 + +293 +00:20:05,579 --> 00:20:08,659 + 당신은 28 요구 사항 29에 의해 사람이 많이 지나친 것 + +294 +00:20:08,660 --> 00:20:11,750 + 이 클래스는, 그래서 당신은 가장이 질병에 매우 잘 알고 있어야합니다 + +295 +00:20:11,750 --> 00:20:16,519 + 범위이 당신을 위해 좀 더 많은 개요 그러나 기본적으로이 테스트입니다 + +296 +00:20:16,519 --> 00:20:20,940 + 데이터는 우리가 할 대신 무엇을 잊지 매우 드물게 사용된다 + +297 +00:20:20,940 --> 00:20:25,930 + 우리가 안전하게를 사용으로 구분하기 때문에 우리가 주름을 부르는에 우리의 훈련 데이터를 분리 + +298 +00:20:25,930 --> 00:20:29,900 + 우리는 같은 트레이닝 데이터의 20 %를 사용할 수 있도록 5 배 검증 + +299 +00:20:29,900 --> 00:20:35,120 + 데이터 및 그것의 우리는 교육 부분을 상상하고 우리는 우리의 테스트 + +300 +00:20:35,119 --> 00:20:39,279 + 그냥 내가 갈거야 그렇게 설정이 검증에 주로 적용되는 두 가지 선택이 + +301 +00:20:39,279 --> 00:20:42,569 + 내 전화에 훈련과 다른 경우 일부의 모든 앞을 시도 + +302 +00:20:42,569 --> 00:20:45,329 + 성직자와 아직 대략 가까운 이웃를 사용하는 경우 다른 어떤 + +303 +00:20:45,329 --> 00:20:48,750 + 당신이 그것을 시도 많은 다른 선택은 밖으로 그 검증 데이터에 가장 적합한 참조 + +304 +00:20:48,750 --> 00:20:51,859 + 당신이 불편하게 느끼는 경우는 거의 훈련 데이터 포인트를 가지고 있기 때문에 + +305 +00:20:51,859 --> 00:20:54,939 + 당신이 실제로 얻을 곳 사람들은 때때로 교차 유효성 검사를 사용 + +306 +00:20:54,940 --> 00:20:58,640 + 테스트 검증의 선택이 이러한 선택을 통해 뽑아 평가하기 + +307 +00:20:58,640 --> 00:21:03,840 + 그래서 내가 먼저 내 훈련 (124)에 사용할 수 있습니다 나는 다섯에 시도하고 + +308 +00:21:03,839 --> 00:21:07,519 + 검증의 선택이 모든 다섯 선택과 I에서 뽑아 순환 + +309 +00:21:07,519 --> 00:21:11,789 + 내 테스트 배의 모든 가능한 선택을 통해 가장 적합한보고 + +310 +00:21:11,789 --> 00:21:14,839 + 그때는 모든 가능한 시나리오를 통해 가장 적합한 무엇이든 취할 + +311 +00:21:14,839 --> 00:21:19,039 + 그건 선두 주자 교차 검증 고정 나사 검증의 연습 + +312 +00:21:19,039 --> 00:21:21,769 + 이 그들과 같을 것이다 방법은 가장 가까운 이웃을 위해 K에 대한 크로스 건물이었다 + +313 +00:21:21,769 --> 00:21:26,049 + 분류 우리는 K의 다른 값을 시도하고 이것이 우리입니다 + +314 +00:21:26,049 --> 00:21:31,690 + 겹의 다섯 가지 선택에서 성능이 그래서 당신은 모든에 대한 것을 볼 수 있습니다 + +315 +00:21:31,690 --> 00:21:35,759 + 하나의 경우 우리가 5 개의 데이터 지점을 가지고 다음이 정밀도가 그렇다 + +316 +00:21:35,759 --> 00:21:40,240 + 높은 좋은 내가 평균 분석가 숀 Arce에 대한 통하여 선을 음모를 꾸미고 있어요 + +317 +00:21:40,240 --> 00:21:44,190 + 표준 편차는 그래서 우리는 여기에서 볼 성능이 위로가는 것입니다 + +318 +00:21:44,190 --> 00:21:49,240 + 이 여론 조사에서 당신은 가서 같이하지만, 어떤 점에서 스타는 말했다 이것을 그래서 나도 몰라 + +319 +00:21:49,240 --> 00:21:53,460 + 특정 데이터 세트는 그게 그래서 7과 동일 K 최고의 선택이 될 것 같다 무엇 + +320 +00:21:53,460 --> 00:21:58,440 + 나는 대칭과 내가 할에 모든 내 hyperemesis에 대해이 작업을 수행 할 수 있습니다 내 + +321 +00:21:58,440 --> 00:22:03,650 + 내가 약속 교차 검증은 내가 그들을 시험에서 하나의 시간을 평가하고 수정했다 + +322 +00:22:03,650 --> 00:22:07,800 + 내가 그에 도착 사이트와 어떤 수는 내가 여덟 정확도로보고 무엇을 + +323 +00:22:07,799 --> 00:22:11,490 + 왕이나의 종이로가는 무슨이 데이터 세트에 대한 몇 가지 분류 + +324 +00:22:11,490 --> 00:22:15,539 + 무엇의 최종 일반화 결과만큼 우리의 최종 보고서로 전환 + +325 +00:22:15,539 --> 00:22:16,519 + 무슨 짓을했는지 + +326 +00:22:16,519 --> 00:22:36,048 + 이에 대한 질문은 기본적으로는 분포의 통계에 관하여 + +327 +00:22:36,048 --> 00:22:42,378 + 라벨에 이러한 데이터 포인트 당신의 얼굴에 그래서 때때로의 그것은 하드의 + +328 +00:22:42,378 --> 00:22:47,769 + 이 그림 반면 얻을처럼 당신으로 발생하는 약을 확인 말 + +329 +00:22:47,769 --> 00:22:52,209 + 더 청결과 더 많은 경우를 얻을 수 있으며 얼마나 clunkier 데이터에 의존 + +330 +00:22:52,209 --> 00:22:55,129 + 이 내려 오는 것을 정말 서비스는 어떻게 + +331 +00:22:55,128 --> 00:23:01,569 + 로비를하거나 그것을 어떻게 특정 그게 매우 편리 대답은 알고 있지만 그건 + +332 +00:23:01,569 --> 00:23:04,769 + 그 때문에 다른 데이터 세트는 다른 것에 와서 무엇을 대략 무엇 + +333 +00:23:04,769 --> 00:23:27,230 + 지금 우리를 클릭 + +334 +00:23:27,230 --> 00:23:31,769 + 때문에 + +335 +00:23:31,769 --> 00:23:37,308 + 다른 다른 데이터 세트는 다른 선택을 필요로해야합니다 + +336 +00:23:37,308 --> 00:23:40,629 + 실제로 다른 알고리즘을 시도하는 경우에 작동하는 무슨이 가장 참조 + +337 +00:23:40,630 --> 00:23:43,580 + 당신은 당신의 데이터의 선택에서 가장 잘 작동하는 무슨 일이 일어나고 있는지 확실하지 않은 당신의 + +338 +00:23:43,579 --> 00:23:47,699 + 당신이 어떤 작품 그냥 확실하지 않도록하기 위해 하이퍼 망치 같은 종류의도 + +339 +00:23:47,700 --> 00:23:52,019 + 다른 접근 방법이 다를 수 있습니다 + +340 +00:23:52,019 --> 00:23:55,190 + 일반화 경계는 서로 다른 모양과 일부 데이터가 설정하는 + +341 +00:23:55,190 --> 00:23:58,330 + 다른 것보다 앞 구조 몇 가지 다른 사람보다 더 잘 작동 + +342 +00:23:58,329 --> 00:24:05,298 + 그냥 난 그냥 그 왕 또는 더 나쁜 뭔가를 가리 키도록 좋아 좋아 밖으로 시도 실행 + +343 +00:24:05,298 --> 00:24:09,389 + 아무도 기본적으로 그냥 사용하지 않는이 통과하는이 일요일를 사용하지입니다 + +344 +00:24:09,390 --> 00:24:12,480 + 이 훈련은 단지 분할 등 작동 정말 방법이 방법 + +345 +00:24:12,480 --> 00:24:13,450 + ...에 + +346 +00:24:13,450 --> 00:24:17,610 + 그 이유는 이것이 우선 매우 비효율적이기 때문으로 사​​용하지 않고, + +347 +00:24:17,609 --> 00:24:21,139 + 모든이의 두 번째 내 트랙 차원 높은 모든 이미지입니다 + +348 +00:24:21,140 --> 00:24:28,179 + 개체는 그들은 내가에서 촬영 한 적이 매우 부자연스럽고 직관적 인 방법을 행동 + +349 +00:24:28,179 --> 00:24:32,370 + 순서는 제한하고 나는 세 가지 방법으로 변경하지만이 모든 세 + +350 +00:24:32,369 --> 00:24:37,168 + 여기에 서로 다른 이미지는 L 실제로이 하나의 동일한 거리를 + +351 +00:24:37,169 --> 00:24:42,100 + 유클리드 감각에 난 그냥 여기 사람이 약간에 이동에 대해 생각하는 + +352 +00:24:42,099 --> 00:24:46,359 + 그것은 약간 떨어있어 그것은이 여기의 왼쪽 이유로 인해 완전히 다른 + +353 +00:24:46,359 --> 00:24:49,329 + 이러한 픽셀은 정확 하 게 일치하지 않습니다 그것은 모든 모든을 소개하는 것 + +354 +00:24:49,329 --> 00:24:53,109 + 당신은 작은을 얻을 수 있도록이 하나가 약간 어둡게하여 점점 거리의 오류 + +355 +00:24:53,109 --> 00:24:57,629 + 모든 특별 행사를 통해 델타이 하나의 손길이 닿지 않은 60 거리 ERES입니다 + +356 +00:24:57,630 --> 00:25:01,650 + 사방에서 저기 그 위치를 제외하고는 촬영 + +357 +00:25:01,650 --> 00:25:05,900 + 아웃 임계 이미지 조각과 가장 가까운 이웃 분류를하지 않는다 + +358 +00:25:05,900 --> 00:25:08,030 + 정말 이러한 설정의 차이를 말할 수 없습니다 + +359 +00:25:08,029 --> 00:25:11,230 + 이이 거리를 기반으로하기 때문에 그 정말이 아주 잘 작동하지 않는다 + +360 +00:25:11,230 --> 00:25:16,009 + 당신은 매우에 거리를 던져하려고 할 때 경우 이렇게 아주 직관적 일이 일어날 + +361 +00:25:16,009 --> 00:25:21,349 + 우리가 지금까지 요약에 그렇게 존재하지 않는 이유를 부분적으로의 높은 차원 객체 + +362 +00:25:21,349 --> 00:25:26,230 + 우리는 이러한 분류에서 서로 다른 두를 포함하는 특정한 경우를 찾고 + +363 +00:25:26,230 --> 00:25:29,679 + 나중에 엔지니어 이웃 분류의 클래스와 아이디어의 설정 + +364 +00:25:29,679 --> 00:25:33,110 + 최대 데이터를 다른 분할을 갖고 우리는 이러한 고압 호스를 그 + +365 +00:25:33,109 --> 00:25:37,240 + 선택해야하고 우리는이 일반적으로 대부분의 크로스 기반을 사용합니다 + +366 +00:25:37,240 --> 00:25:39,909 + 시간 사람들은 단지 하나가 실제로 전체 교차 유효성 검사를 수행 + +367 +00:25:39,909 --> 00:25:40,519 + 확인 + +368 +00:25:40,519 --> 00:25:43,778 + 그들은 높은 측면에서 가장 적합한 어떤 검증 세트에 시도 + +369 +00:25:43,778 --> 00:25:47,999 + 전제하고 가장이 예비 선거를 일단 당신은 하나에 리드를 + +370 +00:25:47,999 --> 00:25:54,569 + 세입자는 그냥 분류에 갈하지만 질문에있어 그렇게 말했다 + +371 +00:25:54,569 --> 00:26:04,229 + 나는 우리가 텔레 노어 분류 보는거야 좋은 볼이 시점이 인 + +372 +00:26:04,229 --> 00:26:07,649 + 우리는이를 알 수있을 것입니다 상업 네트워크를 향해 작업을 시작하는 지점 + +373 +00:26:07,648 --> 00:26:11,148 + 강의 시리즈는 최대 만들 것이다 분류를 혼란 것 + +374 +00:26:11,148 --> 00:26:15,888 + 전체 상용 네트워크 분석 이미지는 그냥 그 동기를 말씀 드리고 + +375 +00:26:15,888 --> 00:26:20,178 + 작업 별보기에서 클래스 어제이 클래스는 컴퓨터 비전 클래스입니다 + +376 +00:26:20,179 --> 00:26:25,489 + 기계 사이트 될 것이 클래스 동기를 부여하기 위해 다른 방법을 제공에 관심 + +377 +00:26:25,489 --> 00:26:29,409 + 어떤 의미에서보기의 모델 기반 관점에서 그 우리는 너희들을 제공하고 + +378 +00:26:29,409 --> 00:26:34,339 + 배관 및 전기에 대한 사람 보는이 멋진 알고리즘은 + +379 +00:26:34,338 --> 00:26:38,178 + 당신은 단지 일부, 특히 위에 다양한 요구에 적용 할 수있는 + +380 +00:26:38,179 --> 00:26:42,469 + 지난 몇 년 동안 우리는 신경 네트워크는 그 무엇입니까 볼 수없는 것을보고 + +381 +00:26:42,469 --> 00:26:46,479 + 이 클래스에 대해 많은 것을 배울 것이다 그러나 그는 또한 여기에 꽤있다 + +382 +00:26:46,479 --> 00:26:50,828 + 당신이 휴대 전화로 이야기 할 때 음성 인식은 지금 그들이 할 수있는 작동하지 않습니다 + +383 +00:26:50,828 --> 00:26:56,678 + 또한 그래서 여기에 기계 번역을 당신은 세트의 신경 네트워크를 먹이 + +384 +00:26:56,679 --> 00:27:00,700 + 영어 하나 신경망로 단어 하나는 번역을 생산 + +385 +00:27:00,700 --> 00:27:05,328 + 인쇄 또는 어떤 다른 대상 언어에 그렇게 제어를 수행 할 필요가 + +386 +00:27:05,328 --> 00:27:09,308 + 우리는 당신의 네트워크 응용 프로그램을 볼 수 및 로봇 조작에 조작 한 + +387 +00:27:09,308 --> 00:27:14,209 + 및 파티 이익의 직장에서 재생하면 확인하여 세 가지 게임을 바로 재생하는 방법 + +388 +00:27:14,209 --> 00:27:18,089 + 로켓은 화면을 설정하고 우리는 매우 성공적인 것으로 보인다 + +389 +00:27:18,088 --> 00:27:23,878 + 도메인의 다양성과 여기에 조금보다 더 우리는 정확히 확실치 + +390 +00:27:23,878 --> 00:27:27,988 + 어디이 우리를 취할 것 그리고, 나는 또한 우리가 탐구하고 있다는 말을하고 싶습니다 + +391 +00:27:27,989 --> 00:27:31,749 + 이것은 매우 헨리 VIII이라고 생각합니까 가사에 대한 방법은 소망 적 사고입니다 + +392 +00:27:31,749 --> 00:27:35,700 + 하지만 어쩌면 그들은뿐만 아니라 그렇게 할 수있는 몇 가지 힌트가있다 + +393 +00:27:35,700 --> 00:27:39,479 + 그들이 연주하는 재미 모듈 일이기 때문에 신경 네트워크는 아주 좋은입니다 + +394 +00:27:39,479 --> 00:27:42,450 + 나는이 사진의 자신의 네트워크와 I 종류의 작업에 대해 생각할 때와 + +395 +00:27:42,450 --> 00:27:46,548 + 여기에 나를 위해 마음에 오는 우리는 신경 네트워크 개업이 그녀입니다 + +396 +00:27:46,548 --> 00:27:51,519 + 보이는 것을 만드는 것은이 시점에서 대략 10 층이 될 수 있습니다 + +397 +00:27:51,519 --> 00:27:55,269 + 정말 자신의 외모와 함께 연주에 대해 생각하는 가장 좋은 방법은 매우 재미 + +398 +00:27:55,269 --> 00:27:58,619 + 레고 블록처럼 우리가이 작은 기능 조각을 구축하는 것을 볼 수 있습니다 + +399 +00:27:58,619 --> 00:28:02,579 + 당신은 많은 그래서 우리는 다음 전체 아키텍처를 만들어 함께 붙어 수 있습니다 봐 + +400 +00:28:02,579 --> 00:28:06,309 + 아주 쉽게 서로 이야기하고 그래서 우리는 이러한 모듈을 만들 수 있습니다 + +401 +00:28:06,309 --> 00:28:11,519 + 스톡턴 함께 내가 생각이 아주 쉽게 승리 작업 플레이 + +402 +00:28:11,519 --> 00:28:16,039 + 예시이 내 숙제입니다 대략 년 전에 그렇게로부터의 자막에 + +403 +00:28:16,039 --> 00:28:20,289 + 여기에 작업의 이미지를 촬영했다 당신은에 일을 얻기 위해 노력하고 + +404 +00:28:20,289 --> 00:28:23,639 + 예를 들어 상단이 왼쪽 있도록 이미지의 문장 설명을 생산 + +405 +00:28:23,640 --> 00:28:27,810 + 예술가는 결과이 많은 검은 셔츠는 기타를 연주 한 것을 말할 것입니다 설정 + +406 +00:28:27,809 --> 00:28:32,480 + 또는 오렌지 시티 웨스트에서 건설 노동자 등등 그래서 도로에 노력하고있다 + +407 +00:28:32,480 --> 00:28:36,670 + 그들은 영상을보고 하나 하나 이미지의이 설명을 만들 수 있습니다 + +408 +00:28:36,670 --> 00:28:41,100 + 이 모델의 세부 사항에 길을 갈 때 이것은 우리가 복용하고있다 작품 + +409 +00:28:41,099 --> 00:28:45,079 + 우리가 알고있는 길쌈 신경망 그래서 여기에 두 개의 모듈에있다 + +410 +00:28:45,079 --> 00:28:49,480 + 우리가 달성 할 수있는 촬상 모델이 계통도하여 + +411 +00:28:49,480 --> 00:28:52,880 + 우리가 알고있는 네트워크는 우리가 재발 성 신경 네트워크를 복용하고 볼 수있는 + +412 +00:28:52,880 --> 00:28:56,150 + 우리는이 경우 시퀀스에 아주 좋은 및 모델링 시퀀스 알고 + +413 +00:28:56,150 --> 00:28:59,720 + 이미지를 설명하는 것 그리고 우리가 가지고 노는 것처럼 말 + +414 +00:28:59,720 --> 00:29:02,930 + 레고 우리는 그 두 가지를 가지고 우리는 함께 그에 대응을 스틱 + +415 +00:29:02,930 --> 00:29:06,560 + 이러한 네트워크에있는 두 개의 모듈 사이에서 여기 화살표에 배운 + +416 +00:29:06,559 --> 00:29:10,639 + 이러한 이미지를 설명하기 위해 서로와 노력의 과정에서 대화 + +417 +00:29:10,640 --> 00:29:13,110 + 그라디언트는 휴대 전화에서 작동 코미디 쇼를 통해 비행한다 + +418 +00:29:13,109 --> 00:29:16,689 + 시스템은 더하기 위해 이미지를보고 자신을 조정하는 것 + +419 +00:29:16,690 --> 00:29:20,200 + 말을 설명하고 그래서이 모든 시스템은 하나 같이 함께 작동합니다 + +420 +00:29:20,200 --> 00:29:24,920 + 그래서 우리는 실제로이 클래스에 올 것이다이 모델을 위해 노력 할 것이다 것 + +421 +00:29:24,920 --> 00:29:28,279 + 바로이 부​​분이 부분에 대해 모두 떨어져 전체 이해를 + +422 +00:29:28,279 --> 00:29:31,849 + 중간 과정을 통해 대략 당신은 어떻게 교육 모델을 볼 수 있습니다 + +423 +00:29:31,849 --> 00:29:34,909 + 그 정말 우리가 구축하고있는 것에 대해 그냥 의욕이다 제외한 작동 + +424 +00:29:34,910 --> 00:29:40,290 + 당신이 확인 작업을하지만 지금은 다시 410을보고 정말 좋은 모델처럼하고있어 + +425 +00:29:40,289 --> 00:29:43,159 + 모든 분류 + +426 +00:29:43,160 --> 00:29:47,930 + 당신이이 데이터 집합 2000 작업 저스틴 라벨된다 나게 우리는있어 + +427 +00:29:47,930 --> 00:29:50,960 + 당신의 분류에 접근하려고하는 것은 우리가 파라 메트릭 방식 부르는에서입니다 + +428 +00:29:50,960 --> 00:29:55,079 + 우리가 지금 논의 된 것을 기억하는 것은 무엇인가 우리가 부르는의 인스턴스 + +429 +00:29:55,079 --> 00:29:57,439 + 비모수 적 접근 방법은 우리가 될거야 매개 변수가 없습니다 + +430 +00:29:57,440 --> 00:30:02,430 + 이러한 구분을 통해 최적화하는 것은 명확하게 인간은 또한의 뜻 + +431 +00:30:02,430 --> 00:30:04,240 + 우리가하고있는 프로젝트에 명백한 가치가있다 + +432 +00:30:04,240 --> 00:30:09,089 + 이미지를 가져와을 생성하는 기능을 구성에 대해 생각 + +433 +00:30:09,089 --> 00:30:12,769 + 클래스에 대한 점수는 바로이 어떤 이미지를 수행해야 할 우리가해야 할 것입니다 + +434 +00:30:12,769 --> 00:30:17,109 + 우리는 열 중 어느 하나를 파악하고 싶습니다 플러스 그렇게 우리가 쓰고 싶은된다 + +435 +00:30:17,109 --> 00:30:21,169 + 이미지를 소요하고 당신에게 그 두 가지를 제공하는 기능과 표현 아래로 + +436 +00:30:21,170 --> 00:30:24,529 + 숫자 만 표현은 매우뿐만 아니라 그 이미지의 기능 만입니다 + +437 +00:30:24,529 --> 00:30:28,339 + 때때로 병 W라고 이들 파라미터의 함수일 + +438 +00:30:28,339 --> 00:30:33,189 + 또한 무게라고 그래서 정말은 3072 번호로 이동하는 기능입니다 + +439 +00:30:33,190 --> 00:30:37,308 + 10 숫자에이 이미지를 우리는 우리가 정의하고 무슨 일을하는지있어 그 구성하는 + +440 +00:30:37,308 --> 00:30:42,049 + 기능 그리고 우리는이이 기능의 몇 가지 선택을 통해 이동합니다 + +441 +00:30:42,049 --> 00:30:45,589 + 첫 번째 경우는 나중에 기능을보고 한 후 작동을 제어하도록 확장됩니다 + +442 +00:30:45,589 --> 00:30:49,579 + 그리고, 우리는 상업 네트워크 그러나 직관적으로 무엇을 얻기 위해 그 연장합니다 + +443 +00:30:49,579 --> 00:30:53,379 + 우리를 구축하고 우리를 통해이 이미지를 둘 때 우리가 원하는 것은이다 + +444 +00:30:53,380 --> 00:30:57,690 + 우리의 기능 우리는 10의 점수에 해당하는 10 숫자를 싶습니다 + +445 +00:30:57,690 --> 00:31:01,150 + 가장 가까운 높은 것으로 고양이 클래스에 해당하는 번호를 싶습니다 + +446 +00:31:01,150 --> 00:31:06,330 + 다른 모든 숫자는 낮은하고있을 것이다 우리는 X를 통해 선택의 여지가 없어 + +447 +00:31:06,329 --> 00:31:11,428 + 즉, 사용자가 설정하는 무료입니다 (W) 이상 선택의 여지가 주어진 것 우리의 이미지의 역할 + +448 +00:31:11,429 --> 00:31:15,179 + 어떤을 제외하고 우리가 원하는 우리는이 기능을 허용하도록 설정하는 것이 좋습니다 싶어 + +449 +00:31:15,179 --> 00:31:19,050 + 의 우리의 훈련 데이터의 모든 하나의 이미지에 대한 우리에게 정답을 제공합니다 + +450 +00:31:19,049 --> 00:31:23,230 + 우리는 간단한 사용하는 것이 우리 가정하는 방향으로 구축하고 거의 접근 + +451 +00:31:23,230 --> 00:31:29,789 + X 그래서 여기에 간단한 단지 선형 분류는 우리의 이미지입니다 + +452 +00:31:29,789 --> 00:31:34,200 + 이 경우 잘못은 내가 고양이를 구성하는이 영상이 배열을 취하고로 + +453 +00:31:34,200 --> 00:31:38,750 + 나는 거대한 컬럼에 해당 이미지의 모든 픽셀 뻗어있어 + +454 +00:31:38,750 --> 00:31:46,920 + 3072 번호 등의 열 벡터가되도록 벡터 당신이 알고있는 경우 + +455 +00:31:46,920 --> 00:31:52,100 + 당신이이을위한 전제 조건입니다한다 행렬 벡터 연산 + +456 +00:31:52,099 --> 00:31:55,149 + 잘 알고 있어야합니다 단지 행렬 곱셈이 있다는 클래스 + +457 +00:31:55,150 --> 00:32:00,100 + 와 기본적으로 우리는 우리가있어 3072 근육의 열 벡터이다 X를 취하고있어 + +458 +00:32:00,099 --> 00:32:03,569 + (10) 번호를 얻으려고 노력하고 당신은 뒤로 더 이상 기능을 갈 수 있도록 + +459 +00:32:03,569 --> 00:32:08,399 + 이 w의 크기는 3072 그래서 거기에 기본적으로 10입니다 파악 + +460 +00:32:08,400 --> 00:32:14,370 + 즉 W로 전환하고 30,000 772 202 번호는 우리가 통제 할 수있는 무엇 + +461 +00:32:14,369 --> 00:32:16,658 + 그것은 우리가 조정할 및 작동 찾을 필요가 무엇 + +462 +00:32:16,659 --> 00:32:21,710 + 그래서 사람들은 내가 밖으로 떠날거야이 특정한 경우에 매개 변수가되고 있습니다 + +463 +00:32:21,710 --> 00:32:26,919 + 또한 끝에 추가 된 거기에 + 이러한 편견은 편견을 가지고, 그래서 때로는 수 + +464 +00:32:26,919 --> 00:32:31,999 + 10 개 이상의 매개 변수에 대해 우리는 또한 보통 사람들을 찾을 수있다 + +465 +00:32:31,999 --> 00:32:36,098 + 선형 분류 우리가 가장 잘 작동 정확히 찾아 가지고 WNB을 가지고이 + +466 +00:32:36,098 --> 00:32:39,950 + 아기는 온에 그냥 독립적 인 대기의 이미지의 함수가 아니다 + +467 +00:32:39,950 --> 00:32:44,989 + 가능성을 그 중 하나는 당신의 질문에 다시 갈 수 있습니다 당신이 경우 + +468 +00:32:44,989 --> 00:32:50,239 + 어쩌면 당신은 대부분의 고양이 있지만 일부 개를위한 매우 불균형 데이터 집합을 + +469 +00:32:50,239 --> 00:32:54,710 + 또는 그런 일이 당신은 기대할 수있는 고양이에 대한 편견이 + +470 +00:32:54,710 --> 00:32:58,200 + 한번 분류를 기본 때문에 촉매는 약간 높을 수 있습니다 + +471 +00:32:58,200 --> 00:33:04,009 + 뭔가에 다른 뭔가를 제공하지 않는 한 촉매를 예측하는 + +472 +00:33:04,009 --> 00:33:08,069 + 하나님의 형상이, 그렇지 않으면 나는 내가 단지에 같은보다 구체적인 생각 + +473 +00:33:08,069 --> 00:33:11,398 + 그것을 분해하지만 물론 나는 그것이 매우 명시 적으로 3072 폭 시각화 할 수 없습니다 + +474 +00:33:11,398 --> 00:33:17,459 + 숫자는 그래서 우리의 입력 이미지 1024 픽셀 및 그래서 더 많은 사진을 상상 상상 + +475 +00:33:17,460 --> 00:33:21,419 + 또한 열 X에 스트레스를 우리는 세 가지 클래스 정도가 상상 + +476 +00:33:21,419 --> 00:33:27,109 + 적색, 녹색, 청색의 비용이나이 경우 W에서 매우 고양이 입양 처리합니다 + +477 +00:33:27,108 --> 00:33:30,868 + 단지 매트릭스와 우리가 여기서 일을하는지에 대한 의해 세 가지가 우리가하려는하다 + +478 +00:33:30,868 --> 00:33:36,398 + 그래서이 주요 응용 프로그램은 여기 주요 행위의 점수를 계산하는 일이야 + +479 +00:33:36,398 --> 00:33:40,608 + 우리에게 우리가 세 가지 점수를 가지고이 과정은 경로의 출력을 제공합니다 + +480 +00:33:40,608 --> 00:33:45,348 + 세 가지 다른 클래스 그래서 이것은 단지 실행 동료 w 임의 설정까지입니다 + +481 +00:33:45,348 --> 00:33:50,739 + 여기 우리는 약간의 점수가 일부 특히이 이것과 저것을 볼 수있는거야 + +482 +00:33:50,739 --> 00:33:55,639 + 때문에 마켓 w이 설정을 최대로 아주 좋은 옳지 않다 승 설정 + +483 +00:33:55,638 --> 00:34:00,449 + 96의 점수가 다른 클래스들보다 훨씬 적은 바로 그래서이 아니었다 + +484 +00:34:00,450 --> 00:34:04,720 + 그것은 매우 좋지 않다 그래서 올바르게 교육 이미지 분류 + +485 +00:34:04,720 --> 00:34:07,220 + 분류는 그래서 우리는 다른 배를 변경하려면 + +486 +00:34:07,220 --> 00:34:10,250 + 그 스코어가 다른 것보다 더 위로 오도록 다른 W를 사용할 + +487 +00:34:10,250 --> 00:34:14,409 + 사람은 그러나 우리는 전체 교육 등의 예에서 일관되게 그렇게해야 + +488 +00:34:14,409 --> 00:34:20,389 + 하지만 한 가지가 기본적으로 W뿐만 아니라 여기주의 사항 + +489 +00:34:20,389 --> 00:34:25,700 + 그것은 모든 세입자 분류 평가를 병렬로이 함수이야 + +490 +00:34:25,699 --> 00:34:28,230 + 하지만 정말 열 독립적 인 분류가 있습니다 + +491 +00:34:28,230 --> 00:34:32,210 + 여기에 어느 정도 이러한 분류의 모든 일에 고양이 말을 좋아한다 + +492 +00:34:32,210 --> 00:34:36,918 + 분류기 여기 오른쪽 첫 번째 행과 첫 번째의 W 단지 최초의 행 + +493 +00:34:36,918 --> 00:34:41,789 + 바이어스는 득점 할 수 있습니다 및 개 분류는 두 번째 행의 W하고있다 + +494 +00:34:41,789 --> 00:34:46,840 + 선박의 분기 배 + 500 WW 행렬은 모든 다른 분류가 + +495 +00:34:46,840 --> 00:34:50,889 + 스택과 장미 그리고 그들은 모든 제품을 도킹 및 이미지와 되 고있어 + +496 +00:34:50,889 --> 00:34:56,269 + 그래서 여기에 당신이 과정을 제공 선형을 무엇을 당신을 위해 질문 + +497 +00:34:56,269 --> 00:35:02,599 + 분류 영어로 할 우리는 함수 형태의 부착이를하고있다 보았다 + +498 +00:35:02,599 --> 00:35:07,589 + 정말 어떻게 든 어떤이를 해석하고 영어 있었는지가 재미 작업 + +499 +00:35:07,590 --> 00:35:28,640 + 하고있다 + +500 +00:35:28,639 --> 00:35:39,048 + X되는 높은 차원 데이터 포인트와 W는 정말 통해 평야를두고있다 + +501 +00:35:39,048 --> 00:35:43,038 + 사이트는 그 해석에 돌아오고 있지만, 어느 쪽이든은 할 수 + +502 +00:35:43,039 --> 00:35:59,420 + 우리는이 팀 방법에 대해 생각 어디 W의 이러한 행의 모든​​ 하나 하나 + +503 +00:35:59,420 --> 00:36:03,630 + 효율적으로 우리가 이미지와 I과 이야기하지 않는이 템플릿처럼 + +504 +00:36:03,630 --> 00:36:08,608 + 내적 정말하는 방법입니다 같은 얼라이언스 무엇을 얻을 것을보고까지 자연 + +505 +00:36:08,608 --> 00:36:17,960 + 어떤 다른 방법으로 + +506 +00:36:17,960 --> 00:36:42,088 + 우리가 할 수있는 것은 공간의 위치 인덱스의 경우 일부이기 때문에 두 위치 + +507 +00:36:42,088 --> 00:36:44,838 + 우리는 분류가 될 것 제로 가중치가 + +508 +00:36:44,838 --> 00:36:50,329 + 여기이 부분은 다음 아무것도 이미지 때문에 50 대기의 일부에 무슨 상관하지 않는다 + +509 +00:36:50,329 --> 00:36:53,389 + 영향을하지만 당신의 이미지의 다른 부분에 대한 양 또는 음이 + +510 +00:36:53,389 --> 00:36:58,118 + 무게는 뭔가거야이 일어날 등의 점수에 기여 + +511 +00:36:58,119 --> 00:37:23,200 + 라벨 공간의 공간을 기술하는 방법 + +512 +00:37:23,199 --> 00:37:33,009 + 그래서 질문이 우리가 가진 입체 지형 등 때문에 이미지 + +513 +00:37:33,010 --> 00:37:37,369 + 그냥 들것이 의심이 모든 채널을 모든 당신은 그것을 스트레칭 + +514 +00:37:37,369 --> 00:37:41,849 + 당신이 좋아하는 어떤 방법은 녹색 빨간색과 파란색 부분을 나란히 시작하는 말 + +515 +00:37:41,849 --> 00:37:46,030 + 단지 당신은 당신이 좋아하는 어떤 방법으로하지만, 일관된 방법으로 그것을 스트레칭 + +516 +00:37:46,030 --> 00:37:49,930 + 모든 이미지는 당신이 읽을 수있는 방법으로 직렬화하는 방법을 알아낼 + +517 +00:37:49,929 --> 00:37:55,779 + 또한 그에게 전화하는 데 사용되는 사진 오프 + +518 +00:37:55,780 --> 00:38:05,060 + 확인 확인 그래서이 끔찍한 저희는 픽셀 그레이 스케일 이미지의가 있다고 가정하자 + +519 +00:38:05,059 --> 00:38:09,420 + 예를 들어 당신은 내가 싶어 사람들, 특히 때문에 혼동하지 말아 그것을 생각 + +520 +00:38:09,420 --> 00:38:12,539 + 내가 적색, 녹색, 청색이이 그림을 만든 후 사람이 나중에 나에게 지적 + +521 +00:38:12,539 --> 00:38:15,150 + 가장 가까운 두 개의 색상 채널 그러나 여기 적색, 녹색, 청색의 과정에 있습니다 + +522 +00:38:15,150 --> 00:38:21,380 + 내가하지 색상 채널 그냥 사과 있도록이 내 부분에 완전한 나사 - 최대 + +523 +00:38:21,380 --> 00:38:33,769 + 그 괜찮에 대한 세 가지 다른 색깔의 가장 가까운 죄송합니다 + +524 +00:38:33,769 --> 00:38:47,309 + 큰 정확히 우리가 모두 하나의 크기의 열 벡터가 될 수 있도록 어떻게 + +525 +00:38:47,309 --> 00:38:52,369 + 대답은 항상 항상 기본적으로 같은 크기의 우리로 이미지 크기를 조정하다 + +526 +00:38:52,369 --> 00:38:56,190 + 쉽게 우리가 들어갈 수있는 그냥 주말보다 다른 크기를 처리 할 수​​ 없습니다 + +527 +00:38:56,190 --> 00:38:59,789 + 나중에 그러나 가장 간단한 것은 단지 하나 하나의 크기를 조정라고 생각합니다 + +528 +00:38:59,789 --> 00:39:04,460 + 우리가 모든 것을 보장하기 원하기 때문에 이미지가 간단한 것은 동일 크기를 정확하기 + +529 +00:39:04,460 --> 00:39:08,470 + 그들의 종류의 우리가 이것들을 할 수 있도록 동일한 물건의 필적 + +530 +00:39:08,469 --> 00:39:12,049 + 열과 우리는 공간에 정렬 학교 패턴을 분석 할 수 있습니다 + +531 +00:39:12,050 --> 00:39:18,380 + 당해 수집기 사실 상태가 실제로 작동하는 방법이있다 + +532 +00:39:18,380 --> 00:39:21,650 + 하나의 사각형 이미지가 매우 긴을 가지고있는 경우 이러한 방법 것 때문에 + +533 +00:39:21,650 --> 00:39:25,480 + 그들 중 많은 사람들이 그들이 할 것은 그 무엇을의 IT 스쿼시하기 때문에 실제로 나쁜 일 + +534 +00:39:25,480 --> 00:39:30,789 + 우리는 내가 단지에 노력 파노라마처럼 매우 긴 느낌 때문에 여전히 잘 공정하게 작동 할 + +535 +00:39:30,789 --> 00:39:34,059 + 일부 온라인 서비스의 기회가 더 내 일처럼 어딘가에 넣어 + +536 +00:39:34,059 --> 00:39:36,679 + 그들은 아마도 그들이 그것을 할 것을 온을 통해 그것을 넣을 것이기 때문에 + +537 +00:39:36,679 --> 00:39:41,129 + 이러한 의견은 항상 사각형에서 작동하기 때문에 광장 당신은 그들이 일을 할 수 있습니다 + +538 +00:39:41,130 --> 00:39:45,490 + 그건 그냥 일반적으로 다른 질문이 무슨 연습이야 아무것도하지만에 + +539 +00:39:45,489 --> 00:39:58,199 + 젖꼭지 승을 해석하는 것은 그래 각각의 영상이 통과 + +540 +00:39:58,199 --> 00:40:04,109 + 하고 싶은 다른 사람이 해석하는 정도 다른 방법은 실제로 그것을 하나를 넣어 + +541 +00:40:04,110 --> 00:40:07,150 + 방법 나는 듣지 않았다 그러나 그것은 또한이다보고의 좋은 방법 있음 + +542 +00:40:07,150 --> 00:40:12,769 + 기본적으로 모든 단일 스코어가 모든 화소 값들의 단순한 가중 합이고 + +543 +00:40:12,769 --> 00:40:16,489 + 이미지와 이러한 비율은 우리가 결국 그 선택에 도착하지만 난 그냥 + +544 +00:40:16,489 --> 00:40:20,559 + 거대한 가중 합은 정말 그것이 바로 색상을오고있다하고있어 전부 + +545 +00:40:20,559 --> 00:40:25,779 + 그렇게 하나의 방법 한 가지 방법으로 서로 다른 공간 위치에서 색상을오고 + +546 +00:40:25,780 --> 00:40:29,500 + 그 우리가 분류 콘크리트 승이 해석 할 수있는 방법의 측면에서 자랐다 + +547 +00:40:29,500 --> 00:40:33,170 + 그렇게 여기에 템플릿 매칭 것 같은 비트가 무엇을의 가지처럼이다 + +548 +00:40:33,170 --> 00:40:37,059 + 나는 분류를 훈련 한 적이 그리고 당신이 그것을 수행하는 방법 쇼가 아직 있지만 + +549 +00:40:37,059 --> 00:40:41,920 + 내 가중치 행렬을 훈련 한 다음 다시 두 번째로 돌아와 I 밖으로 가지고있어 모든 + +550 +00:40:41,920 --> 00:40:45,010 + 우리가 난 모든 단일 분류를 배운 그 행을 하나 하나 + +551 +00:40:45,010 --> 00:40:46,599 + 끝으로 다시 재편 + +552 +00:40:46,599 --> 00:40:51,809 + 나는 그것을 시각화 할 수 있도록 그래서 나는 원래 그냥 거대한 블로우 업 3072을 데려 갈거야 + +553 +00:40:51,809 --> 00:40:55,650 + 우리는 왜곡을 취소 할 이미지로 다시 발송 번호 내가 수행하고 + +554 +00:40:55,650 --> 00:40:59,660 + 나는이 모든 템플릿이 때문에 예를 들어 당신이 여기에 참조하는 것입니다 + +555 +00:40:59,659 --> 00:41:04,659 + 면 그것은 당신이 파란색 얼룩을 볼 수있는 이유는 여기에 파란색 얼룩 같은가요 당신 경우 + +556 +00:41:04,659 --> 00:41:08,278 + 당신은에있는 것을 볼이 비행기 템플릿의 색상 채널에서 보았다 + +557 +00:41:08,278 --> 00:41:11,440 + 파랑 채널 당신은 긍정적 인 무게의 제비가 그 양의 무게 때문에 + +558 +00:41:11,440 --> 00:41:15,479 + 그럼 그들이 나에게 값을 볼 경우 그들은 사람들과 상호 작용하고 그들은 조금 얻을 + +559 +00:41:15,478 --> 00:41:19,338 + 점수에 기여 그래서이 비행기 분류 정말 그냥 계산 + +560 +00:41:19,338 --> 00:41:23,159 + 모든 특별 행사에서하고있는 경우 이미지의 파란색 물건의 양 + +561 +00:41:23,159 --> 00:41:26,368 + 당신이를 찾을 수있는 평면 분류의 빨간색과 녹색 채널을보고 + +562 +00:41:26,369 --> 00:41:30,499 + 0 값 또는 음의 값 바로 그 계획의 분류입니다 + +563 +00:41:30,498 --> 00:41:35,098 + 가격이 모든 다른 이미지가 개구리 말을하는 당신은 거의 템플릿을 볼 수 있습니다 + +564 +00:41:35,099 --> 00:41:38,900 + 프라하의 그것 권리는 녹색 물건이 일부 녹색 불가사리를 찾고 + +565 +00:41:38,900 --> 00:41:42,849 + 긍정적 여기에 무게와 그 다음 우리는 측면에 약간의 갈색 불가사리 사물을 + +566 +00:41:42,849 --> 00:41:49,599 + 그 이미지와 내적 위에 엉덩이를 얻을 경우 그래서 높은 점수를 얻을 것이다 + +567 +00:41:49,599 --> 00:41:51,430 + 여기에서 주목해야 할 것은 이것 좀입니다 + +568 +00:41:51,429 --> 00:41:56,588 + 또한 듣고 자동차의 아주 같은 좋은 템플릿 아닙니다 차 분류 + +569 +00:41:56,588 --> 00:42:01,679 + 말은 즉, 상기 거짓말을 찾고 차가이었다까지 무슨 조금 이상한 보인다 + +570 +00:42:01,679 --> 00:42:11,048 + 보고 말 이상한 그래 기본적으로 그가에 무슨 일이 일어나고 있는지의 예 + +571 +00:42:11,048 --> 00:42:14,998 + 데이터는 말 누군가가 어딘가에 오른쪽이 분류 왼쪽 직면 + +572 +00:42:14,998 --> 00:42:19,028 + 실제로 매우 강력한 분류가 아니고,이있는 두 가지 모드를 결합하는 + +573 +00:42:19,028 --> 00:42:22,179 + 두 향하고 말에 우리와 함께 머물 동시에 두 일을하는 + +574 +00:42:22,179 --> 00:42:25,879 + 거기에이 결과는 아마 더있을 바로 그 때 당신은 실제로 그런 말을 할 수 있습니다 + +575 +00:42:25,880 --> 00:42:30,599 + 강한 그들은 또한이기 때문에 오른쪽에있는 항구에서 말을 왼쪽에 직면 + +576 +00:42:30,599 --> 00:42:35,219 + 자동차의 권리를 위해 우리는 왼쪽이나 오른쪽 또는 전방 45도 같은 차를 가질 수있다 + +577 +00:42:35,219 --> 00:42:40,588 + 여기에이 분류 모든 병합 같은에서 혼합하는 최적의 방법입니다 + +578 +00:42:40,588 --> 00:42:43,608 + 그 때문에 하나의 템플릿에 해당 모드가 여기서 할 그것을 강요 + +579 +00:42:43,608 --> 00:42:46,900 + 그들은이없는 우리가 실제로 그건 일을하는지와 신경망 + +580 +00:42:46,900 --> 00:42:50,239 + 실제로 원칙적으로 할 수 있습니다 단점은 그들에 대한 템플릿을 가질 수있다 + +581 +00:42:50,239 --> 00:42:53,338 + 카드가 그들에게 더 많은 힘을주고 그들을 통해 결합 곧이 차 + +582 +00:42:53,338 --> 00:42:56,478 + 실제로 더 적절하지만 지금이 분류를 수행하는 + +583 +00:42:56,478 --> 00:42:57,808 + 우리는이 제약된다 + +584 +00:42:57,809 --> 00:43:08,239 + 문제 + +585 +00:43:08,239 --> 00:43:18,389 + 예 뭔가 그래서 우리가 정확하게 수행되지 않을 것이다 기차 시간이 될 것인가 + +586 +00:43:18,389 --> 00:43:21,349 + 그들이 그들을 훔쳐을 스트레칭과 우리가 모든 것을 바꾸어됩니다 생성 + +587 +00:43:21,349 --> 00:43:25,979 + 즉, 그래 난 것 때문에 아주 잘 작동 점점의 큰 부분이 될 것 + +588 +00:43:25,978 --> 00:43:30,038 + 우리가 가고 있다는 것을 변경됩니다 모두를위한 그 물건의 엄청난 금액을 수행 할 수 + +589 +00:43:30,039 --> 00:43:33,469 + 회전하기 때문에 선박의 다른 많은 교육 사례를 규명하고, + +590 +00:43:33,469 --> 00:43:47,009 + 스튜와 이러한 템플릿 체인 평균을 복용하는 방법이 훨씬 더 잘 작동 + +591 +00:43:47,009 --> 00:43:56,969 + 사람 당신은 방법 당신의 세트를 명시 적으로 템플릿을 설정하고 싶은 있도록 + +592 +00:43:56,969 --> 00:44:01,068 + 템플릿은 모든 이미지에 걸쳐 평균이고, 그 템플릿된다 + +593 +00:44:01,068 --> 00:44:13,918 + 그래 그래서이 분류는 그들이 내가 추측하는 것입니다 비슷한 할 것이다 결합 + +594 +00:44:13,918 --> 00:44:18,489 + 분류 당신은 마이클 볼 때 있기 때문에 더 작동합니다 + +595 +00:44:18,489 --> 00:44:22,028 + 이전에 그것을 위해 최적화 무엇 나는 그가 최소있을 것입니다 생각하지 않습니다 + +596 +00:44:22,028 --> 00:44:26,179 + 당신이 이미지의 단지 분에 설명하지만 직관적 같은 것 + +597 +00:44:26,179 --> 00:44:30,079 + 에 리 괜찮은 발견 아마도 그 초기화 또는 분할에 기다립니다 + +598 +00:44:30,079 --> 00:44:34,239 + 그것은 어떤 관련 + +599 +00:44:34,239 --> 00:44:40,349 + 그래하지만 우리는 내가 그들의 몇 가지로 돌아갈 수있을거야 해당 갈 수 있습니다 + +600 +00:44:40,349 --> 00:44:43,980 + 몇 가지 + +601 +00:44:43,980 --> 00:45:06,650 + 에 아마 빨간색 자동차가 있다는 것을 말하고 다른 색상의 빨간색 + +602 +00:45:06,650 --> 00:45:11,750 + 데이터 세트는과 노란색 카드는이를 위해 수 있습니다 실제로 당신을 위해 작동하지 않을 수 있습니다 + +603 +00:45:11,750 --> 00:45:16,909 + 시간은 그래서이 일을 그냥 이유입니다이 모든 것을 할 수있는 능력이 없습니다 + +604 +00:45:16,909 --> 00:45:19,989 + 제대로 그래서 모든 다른 모드를 캡처 할 수있는 충분히 강력한 + +605 +00:45:19,989 --> 00:45:23,689 + 이것은 단지 어디있어 그 이상의 빨간색 차를 거기에 숫자 후 이동합니다 + +606 +00:45:23,690 --> 00:45:28,389 + 이 그레이 스케일 인 경우 그 그가거야 더 잘 작동한다면 잘 모르겠어요 갈 것입니다 + +607 +00:45:28,389 --> 00:45:40,368 + 내가 불균형에 대해 언급 한 바와 같이 실제로 당신이 예상 다시 그에게 올 + +608 +00:45:40,369 --> 00:45:42,190 + 당신이 기대하는 어​​떤 데이터 세트 + +609 +00:45:42,190 --> 00:45:49,150 + 정확히 당신이 고양이를 많이 기대하는 것은 고양이 바이어스가 될 것입니다 + +610 +00:45:49,150 --> 00:45:53,750 + 높은 때문에이 분류는 단지 큰 숫자에 사용되는이 클래스 + +611 +00:45:53,750 --> 00:45:57,980 + 손실에 기초하지만 우리는 볼을 정확하게에 손실 함수로 가야 어떻게 + +612 +00:45:57,980 --> 00:46:01,929 + 그것은 지금 말할 하드 그래서 밖으로 재생됩니다 + +613 +00:46:01,929 --> 00:46:05,960 + 또한 다른 사람이 지적 분류의 또 다른 해석 + +614 +00:46:05,960 --> 00:46:09,869 + 내가 지적하고 싶은 당신은 매우 높은 차원으로 이러한 이미지 생각할 수있다 + +615 +00:46:09,869 --> 00:46:17,619 + 바로 3072 픽셀 공간 공간마다 이미지로 3072 차원 공간에서 점 + +616 +00:46:17,619 --> 00:46:22,130 + 점이며, 이러한 선형 분류 걸쳐 이러한 그라데이션을 설명하는 + +617 +00:46:22,130 --> 00:46:25,070 + 이 점수이있는 삼천 뭔가 2 차원 + +618 +00:46:25,070 --> 00:46:28,580 + 영역 및 공간에서 일부 주류 방향에 따른 긍정적 부정적 + +619 +00:46:28,579 --> 00:46:33,670 + 그래서 여기에 예를 들어 분류에 대한 I는 W의 첫 번째 행을 데려 갈거야 + +620 +00:46:33,670 --> 00:46:37,750 + 자동차 클래스와 여기에 라인에가의 제로 레벨 세트를 표시한다 + +621 +00:46:37,750 --> 00:46:42,739 + 자동차 분류 그 라인을 오랫동안 즉 분류가 0 점수가 + +622 +00:46:42,739 --> 00:46:46,849 + 그래서 차 분류기는 20을 표시하고 화살표가 그들의 갖는다 + +623 +00:46:46,849 --> 00:46:51,730 + 더 많은 공간으로 착색되는 방향을 따라 + +624 +00:46:51,730 --> 00:46:56,400 + 우리는이 예에서 세 가지 분류가 점수 유사 활용 + +625 +00:46:56,400 --> 00:46:59,900 + 특정 레벨 설정과 그들이 이러한 기울기에 반응하고 + +626 +00:46:59,900 --> 00:47:05,650 + 그들은 기본적으로 그들은 공간에있는 모든 끼 경우에 이동하려는 및 + +627 +00:47:05,650 --> 00:47:08,970 + 우리는 다음 초기화 이들 지역의 공급 업체가 임의로이 차 분류는 것 보았다 + +628 +00:47:08,969 --> 00:47:11,969 + 그 수준이 무작위로 설정되어 우리가 실제로 작업을 수행 할 때 당신은 볼 수 있습니다 + +629 +00:47:11,969 --> 00:47:16,449 + 우리가 최적화로 최적화이 당신의 시프트 차례 동물성 단백질을 시작합니다 + +630 +00:47:16,449 --> 00:47:20,239 + 자동차 클래스를 분리하고이 분류를 보는 재미를 좋아합니다 + +631 +00:47:20,239 --> 00:47:25,038 + 이 박사 지킬과 의지를 건너 차에 스냅됩니다 회전하기 때문에 훈련 + +632 +00:47:25,039 --> 00:47:28,528 + 그건 물론 모든 지키는에서 모든 차량을 분리하고자 시도 + +633 +00:47:28,528 --> 00:47:33,289 + 보고 정말 재미 그래서 그 확인을 해석하는 또 다른 방법입니다 + +634 +00:47:33,289 --> 00:47:37,130 + 여기에 당신이 모든 해석은 매우 될 것 주어진에 대한 질문입니다 + +635 +00:47:37,130 --> 00:47:43,028 + 이러한 젖꼭지 하드 당신이 정말로 정말로 일을 기대하는 것이 무엇 작동 + +636 +00:47:43,028 --> 00:47:51,909 + 잘 선형 분류와 + +637 +00:47:51,909 --> 00:48:05,230 + 동시 원은 우리의 가장 가까운 참조 나는 그래서 당신이있어 볼 정확히 어떻게 수업은 + +638 +00:48:05,230 --> 00:48:10,349 + 설명을 찾아 이미지에 공간이 해석에 + +639 +00:48:10,349 --> 00:48:15,630 + 하나의 클래스에 그렇게 난 주위에 같은 다른 클래스 다음 얼룩에와 것 + +640 +00:48:15,630 --> 00:48:19,880 + 즉, 예를 경우 실제로 공간 만 같을 것이다 정확하게 확실하지 + +641 +00:48:19,880 --> 00:48:22,869 + 당신은 내가 그를 분리 할 수​​ 없습니다 굉장이 경우 병원에 맞아 + +642 +00:48:22,869 --> 00:48:26,920 + 하지만 이미지는 것 당신처럼 대해 같은 측면에서 보일 것입니다 무슨 + +643 +00:48:26,920 --> 00:48:31,079 + 스튜디오 설치 이미지를 보면 분명히 나중에 분류 아마 것이라고 + +644 +00:48:31,079 --> 00:49:02,380 + 나중에있어 여기에 아주 잘하지 + +645 +00:49:02,380 --> 00:49:39,210 + 훈련을 분류하고 나는 그것을 그것의 부정적인 이미지를 부정 할 것을 + +646 +00:49:39,210 --> 00:49:42,699 + 당신은 여전히​​ 가장자리를 참조 분류하고 괜찮 그 비행기 말할 수 있습니다 + +647 +00:49:42,699 --> 00:49:45,710 + 분명히 모양 대대 분류 모든 색상이 될 것이다 + +648 +00:49:45,710 --> 00:49:49,760 + 정확히 잘못 때문에 비용이 그 비행기를 싫어 + +649 +00:49:49,760 --> 00:50:02,330 + 예 + +650 +00:50:02,329 --> 00:50:12,630 + 개는 개, 오른쪽에 하나의 가장 가까운 개를 개 및 해당 될 것이라고 생각 + +651 +00:50:12,630 --> 00:50:27,090 + 문제의 권리 + +652 +00:50:27,090 --> 00:50:32,829 + 문제가 될 것입니다 흰색 배경이나 뭔가가 문제의 I되지 않을 것 + +653 +00:50:32,829 --> 00:50:37,059 + 문제가되지 않을 것입니다 + +654 +00:50:37,059 --> 00:50:52,570 + 변환 + +655 +00:50:52,570 --> 00:50:56,789 + 당신이 더 어려울 수 있습니다 말하고있는 일이 될 것이다 당신의 개 우리의 작업을하는 경우 + +656 +00:50:56,789 --> 00:51:00,309 + 어떤면에서는 클래스에 따라 왜 당신이 만약 실제로 문제가되지 않을 것입니다 + +657 +00:51:00,309 --> 00:51:04,279 + 실제로이없는 오른쪽 중앙에 뭔가 일을 + +658 +00:51:04,280 --> 00:51:08,840 + 그에 특히 최대 이해하는 것은 실제로 오른쪽이 될 것이다 발견 + +659 +00:51:08,840 --> 00:51:15,769 + 당신이 중간에 긍정적 인 가중치를해야하기 때문에 상대적으로 쉽게 + +660 +00:51:15,769 --> 00:51:25,219 + 그래 + +661 +00:51:25,219 --> 00:51:34,348 + 그것은 여기 정말이가 무엇을하고 있는지 무엇을하고 있는지 예 그래서 이것은 정말 정말 + +662 +00:51:34,349 --> 00:51:38,619 + 그것은 놨어요 색상 및 특수 위치 아무것도오고 카운트 업이야 + +663 +00:51:38,619 --> 00:51:41,800 + 이건 정말 힘들 것입니다 당신이 있던 경우에 실제로 지점으로 돌아갑니다 + +664 +00:51:41,800 --> 00:51:44,300 + 작동하는 방법과 설정 그레이 스케일 데이터 + +665 +00:51:44,300 --> 00:51:48,070 + 아니 아주 잘 당신이 지금까지 볼 수 있다면 우리의 고객은 아마 작동하지 않습니다와 함께 + +666 +00:51:48,070 --> 00:51:53,250 + 10 당신은 제조 또는 그레이 스케일은 동일한 분류 그레이 스케일을 수행 + +667 +00:51:53,250 --> 00:51:56,059 + 당신이에서 선택할 수 없기 때문에 이미지는 아마 정말 끔찍하게 작동합니다 + +668 +00:51:56,059 --> 00:52:00,739 + 색상은 이제 이러한 질감과 미세한 세부 사항을 데리러해야하고 + +669 +00:52:00,739 --> 00:52:03,848 + 그들이 할 수 있기 때문에 그냥 아주 위치가 없습니다를 지역화 할 수 없습니다 + +670 +00:52:03,849 --> 00:52:08,400 + 일관되게 재해의 종류 것 건너 와서 + +671 +00:52:08,400 --> 00:52:11,660 + 당신이 모든 말을이있는 경우 또 다른 예는 서로 다른 질감을 것입니다 당신의 + +672 +00:52:11,659 --> 00:52:16,989 + 텍스트는 파란색하지만이 정말하지 않습니다이 텍스트는 다른 종류의 수 + +673 +00:52:16,989 --> 00:52:20,799 + 같은 이러한 두 가지 유형의 말을하지만 그들은 공간적으로 불변 일 수있다 + +674 +00:52:20,800 --> 00:52:29,740 + 그, 그래서 그냥 내가 거의가 생각하는 당신을 생각 나게 얻을 끔찍한 끔찍한 것 + +675 +00:52:29,739 --> 00:52:35,269 + 우리가보고있는 특정 케이스와 W에 있도록이 기능을 찾을 것입니다 + +676 +00:52:35,269 --> 00:52:38,588 + 몇 가지 테스트 이미지는 우리가 밖으로 약간의 점수를 받고 그냥 기대하고 + +677 +00:52:38,588 --> 00:52:43,070 + 우리는 지금 향하고 약간의 모든 일부 점수를 얻기 위해 w를 설정 함께있어 + +678 +00:52:43,070 --> 00:52:47,470 + 이미지 등이 이미지에서 우리가보고있는 w는이 설정을 위로 예를 들어, + +679 +00:52:47,469 --> 00:52:51,319 + 고양이 점수는 2.9하지만 나는 높은 점수있어 몇 가지 클래스가 있음 + +680 +00:52:51,320 --> 00:52:54,588 + 이들은 매우 좋은 권리는 아니지만 일부 클래스가 부정적인 점수를 그래서 개 같은 + +681 +00:52:54,588 --> 00:52:59,909 + 이 종류의이 대기에 대한 중간 결과입니다 그래서이 이미지의 선한 + +682 +00:52:59,909 --> 00:53:04,199 + 여기에서이 이미지를 우리는이 자신에 대한 그 차 클래스 단지 올바른 참조 + +683 +00:53:04,199 --> 00:53:08,439 + 이 이미지에 너무 잘 W 작업을 방문 그렇게 쓸 것입니다 가장 높은 점수 + +684 +00:53:08,440 --> 00:53:14,940 + 여기에 우리는 클래스가 너무 끔찍 그에 우리가있어, 그래서 매우 낮은 점수 것을 볼 + +685 +00:53:14,940 --> 00:53:19,990 + 지금 향했다 우리는 우리가 손실의 기능이 손실을 부르는 정의려고하고있다 + +686 +00:53:19,989 --> 00:53:23,899 + 함수는 우리가 지금 좋은 또는 나쁜 생각 무엇을이 직관을 정량화한다 + +687 +00:53:23,900 --> 00:53:26,440 + 우리가이 숫자를 째려하는 것은 무슨 무슨 좋은 말 + +688 +00:53:26,440 --> 00:53:29,490 + 실제로 우리에게 수식을 기록하는 + +689 +00:53:29,489 --> 00:53:35,949 + 바로 이러한 우리의 테스트에서 w를 설정처럼 나쁜 12.5 또는 1220 무엇이든 + +690 +00:53:35,949 --> 00:53:40,469 + 우리가 구체적으로 정의한 후, 일단 우리가 될 것하고 있기 때문에 나쁜 또는 110 나쁜 + +691 +00:53:40,469 --> 00:53:44,318 + 그 forw 찾고 손실을 최소화하고는 같은 방법으로 설정한다는 + +692 +00:53:44,318 --> 00:53:48,500 + 다음도 제로 말처럼 당신은 매우 낮은 숫자의 손실이있을 때 + +693 +00:53:48,500 --> 00:53:53,760 + 정확하게 모든 이미지를 분류하지만 당신은 매우 높은 손실이있는 경우 + +694 +00:53:53,760 --> 00:53:56,970 + 모든 것이 W에 엉망이 전혀 우리가 많이 찾을거야 좋지 않다 + +695 +00:53:56,969 --> 00:54:01,059 + 실제로 모두에서 매우 잘 수행하는 것이 야 승 조치는 다음 다른 찾아 + +696 +00:54:01,059 --> 00:54:03,469 + 그렇게 그 대략 무엇을오고있어 + +697 +00:54:03,469 --> 00:54:09,108 + A는 정량화 방법이다 잘 정의 손실 함수는 HW가 얼마나 나쁜 정할 + +698 +00:54:09,108 --> 00:54:13,328 + 우리의 데이터 세트에 전체 학습 집합의 함수로 손실 함수와 + +699 +00:54:13,329 --> 00:54:19,900 + 당신의 속도는 우리는 잡초 제어의 전송을 제어 할 수 없습니다 + +700 +00:54:19,900 --> 00:54:22,960 + 우리는 어떻게 효율적를 찾기 위해 최적화하는 과정에서 볼거야 + +701 +00:54:22,960 --> 00:54:27,420 + 모든 이미지에서 작동 우리에게 매우 낮은을 제공합니다 가중치 w의 세트 + +702 +00:54:27,420 --> 00:54:30,940 + 손실은 결국 우리가 무엇을 할 거 야 우리가 가서 이것 좀 봐 것입니다 + +703 +00:54:30,940 --> 00:54:34,250 + 우리가 보았던 식 분류 우리는 함께 간섭 시작하는거야 + +704 +00:54:34,250 --> 00:54:38,260 + 여기에 기능 그래서 우리는 간단하지 노력을 소비거야 당신의 + +705 +00:54:38,260 --> 00:54:41,349 + 표현하지만 우리는 조금 더 복잡한 운동을 얻을 수 있도록거야 + +706 +00:54:41,349 --> 00:54:44,630 + 그리고, 우리는 조금 더 복잡하고 운동 연합을 얻을 수 있습니다 + +707 +00:54:44,630 --> 00:54:48,789 + 하지만, 그 전체 프레임 워크는 모든 시간이 될 것입니다 변하지 남아있을 것입니다 + +708 +00:54:48,789 --> 00:54:52,389 + 경쟁이 과정 역기능 형식이 변경 될 수 있지만, 우리는 거 야 + +709 +00:54:52,389 --> 00:54:56,909 + 어떤 종류의 코스 일부 기능을 통해 더 정교하게 만들 것 + +710 +00:54:56,909 --> 00:55:01,179 + 초과 근무 후 우리는 약간의 손실 함수를 식별하고 우리가보고있는 것을 + +711 +00:55:01,179 --> 00:55:04,449 + 예비 선거는 매우 낮은 손실을 부여하고이 설정이 될 것입니다 무엇을 기다립니다 + +712 +00:55:04,449 --> 00:55:09,710 + 다음 손실 기능에 모양 앞으로 그래서 다음 수업을가는 작업 + +713 +00:55:09,710 --> 00:55:13,730 + 우리는 그래서 이것이 나의 마지막 빛 추측있어 그 아스날 에미리트 소득에 갈거야 + +714 +00:55:13,730 --> 00:55:23,920 + 그래서 어떤 마지막 질문에 걸릴 수 있고, + +715 +00:55:23,920 --> 00:55:36,068 + 죄송합니다 죄송합니다 죄송합니다 나는 듣지 않았다 + +716 +00:55:36,068 --> 00:55:41,969 + 프로젝트 최적화 야당 설정하면 작동 할 수 있습니다에 때때로 + +717 +00:55:41,969 --> 00:55:45,429 + 이러한 혁신적인 접근은 기본적으로이 거 우리를 작동 방법입니다 + +718 +00:55:45,429 --> 00:55:49,598 + 우리는 항상 랜덤 W로 시작합니다 참조하면 그것은 우리에게 손실을 줄 것이다 있도록 + +719 +00:55:49,599 --> 00:55:53,249 + 그리고, 우리 우리의 바로 최고의 세트를 찾는 과정이 없습니다 + +720 +00:55:53,248 --> 00:55:57,509 + 무게는하지만 우리는 반복적으로 약간을 개선하는 방법을해야합니까 + +721 +00:55:57,509 --> 00:56:01,309 + 무게가 너무 작은 우리가 손실 함수에서 보면보고 ​​그라데이션을 찾을 수 + +722 +00:56:01,309 --> 00:56:06,380 + 공간과 우리가하는 방법을 알고 무엇을 약간 우리를 어떻게되어 아래로 행진한다 + +723 +00:56:06,380 --> 00:56:09,890 + 우리가 단지 구입의 문제를 수행하는 방법을 모르는 가중치의 세트를 향상 + +724 +00:56:09,889 --> 00:56:12,858 + 바로 통해 가장 좋은 방법은 우리가 그렇게하는 방법을 모르겠어요 특히 때문에 + +725 +00:56:12,858 --> 00:56:17,108 + 이러한 기능은 매우 복잡 할 때 거대한 풍경의 인터콤을 좋아한다 + +726 +00:56:17,108 --> 00:56:31,038 + 그 단지 매우 다루기 힘든 문제 귀하의 질문에 내가 어떻게 잘 모르겠어요 것입니다 + +727 +00:56:31,039 --> 00:56:40,170 + 우리가 여기에 너무 너무 좋아 색상 문제를 처리 할 우리는 선형 것을보고 + +728 +00:56:40,170 --> 00:56:44,809 + 자동차에 대한 분류는 기본적으로 자동차와 신경망이 빨간색 템플릿했다 + +729 +00:56:44,809 --> 00:56:47,619 + 우리가 할 거 야 우리가 당신이있을 때 당신이 적층로 볼 수 있습니다 것 만날입니다 + +730 +00:56:47,619 --> 00:56:50,818 + 어느 정도 분류 그래서 그것이 모든 것입니다 일을 결국 무슨 + +731 +00:56:50,818 --> 00:56:55,748 + 이 길을가는 정말 임대 자동차 자동차 자동차 자동차에 대한이 작은 템플릿 또는 + +732 +00:56:55,748 --> 00:56:58,248 + 그 방법 또는 그 방법으로는 기술에 할당됩니다 거기에 모든 사람의 + +733 +00:56:58,248 --> 00:57:01,399 + 이러한 다양한 모드는 다음 그들은 두 번째에 그에서 결합됩니다 + +734 +00:57:01,400 --> 00:57:04,739 + 이러한 서로 다른 종류의 코스를 찾고있다 그래서 기본적 층 + +735 +00:57:04,739 --> 00:57:08,588 + 다음에 내년 난 그냥 경우를 말할 수있는 방법을 바로 확인처럼 될 것입니다 너희들 + +736 +00:57:08,588 --> 00:57:13,548 + 일이나 이상 동작, 그리고, 우리는 모든 모드에서 차를 검출 할 수있다 + +737 +00:57:13,548 --> 00:57:17,498 + 자신의 위치의 대략 숙제의 의미가 있습니다 + diff --git a/captions/Ko/Lecture3_ko.srt b/captions/Ko/Lecture3_ko.srt new file mode 100644 index 00000000..4ad4589f --- /dev/null +++ b/captions/Ko/Lecture3_ko.srt @@ -0,0 +1,3596 @@ +1 +00:00:00,000 --> 00:00:05,400 + 그래서 우리는 손실 함수에 재료의 일부에 오늘 도착하기 전에 + +2 +00:00:05,400 --> 00:00:09,429 + 최적화 내가 먼저 일부 관리 일을 통해 가고 싶어 + +3 +00:00:09,429 --> 00:00:12,859 + 당신은 그래서 그냥 사람의 신호로 시몬은 다음 주 수요일에 기인한다 + +4 +00:00:12,859 --> 00:00:18,100 + 약 구일 왼쪽 단지 경고로 월요일이 것 때문에 휴일입니다 + +5 +00:00:18,100 --> 00:00:23,050 + 근무 시간에 더 클래스는, 그래서 확인하기 위해 따라 시간을 계획하지 + +6 +00:00:23,050 --> 00:00:25,920 + 당신은 그가 또한 일부 늦게이이 과정의 시간에 과제를 완료 할 수 있습니다 + +7 +00:00:25,920 --> 00:00:29,960 + 당신이 사용하고 맞는 볼로 침묵 사이에서 할당 할 수있는 일 + +8 +00:00:29,960 --> 00:00:35,149 + 확인을 재료로 그래서 다이빙 먼저 나는 우리가 어디 당신을 생각 나게하고 싶습니다 + +9 +00:00:35,149 --> 00:00:39,100 + 현재 마지막으로 우리는이 문제를 시각적 인식으로 바라 보았다 + +10 +00:00:39,100 --> 00:00:42,950 + 특히 이미지 분류에서 우리는이 사실에 대해 얘기 + +11 +00:00:42,950 --> 00:00:45,780 + 그는 단지 십자가를 고려 바로 때문에 매우 어려운 문제가 실​​제로 + +12 +00:00:45,780 --> 00:00:50,829 + 제품은 우리가 때를에 강력한해야 가능한 변화라고 + +13 +00:00:50,829 --> 00:00:54,198 + 고양이 그냥 그런 것 같아 같은 이러한 범주 중 하나를 인식 + +14 +00:00:54,198 --> 00:00:58,049 + 다루기 힘든 가능한 문제가 아니라 단지 우리가 이것들을 해결하는 방법을 알고 + +15 +00:00:58,049 --> 00:01:02,108 + 문제는 지금 그러나 우리는 종류의 수천을 위해이 문제를 해결할 수 있으며, + +16 +00:01:02,109 --> 00:01:05,859 + 당해 방법의 상태는 거의 인간의 정밀도 혹은 약간 작동 + +17 +00:01:05,859 --> 00:01:11,829 + 그것은 그 클래스의 일부를 돌파하고 또한 리얼 종류 거의 실행있어 + +18 +00:01:11,829 --> 00:01:16,539 + 휴대 전화 등 기본적으로이 모든의 또한 마지막 세에서 일어난 + +19 +00:01:16,540 --> 00:01:19,790 + 이 모든에 클래스의 말에 세 또한 수 있습니다 전문가 + +20 +00:01:19,790 --> 00:01:23,609 + 기술 그 문제 그래서 정말 시원하고 확인을 흥분 때문에 + +21 +00:01:23,609 --> 00:01:27,140 + 위원회의 분류 우리는 데이터 독일어에 대해 구체적으로 이야기 + +22 +00:01:27,140 --> 00:01:30,450 + 우리가 명시 적으로 이러한 분류를 하드 수 없다는 사실에 접근 + +23 +00:01:30,450 --> 00:01:34,100 + 우리는 실제로 다나에서 그들을 훈련을해야하고 그래서 우리의 생각 보았다 + +24 +00:01:34,099 --> 00:01:37,188 + 다른 갖는 유효성을 갖는 훈련 데이터는 어디 분할 우리 + +25 +00:01:37,188 --> 00:01:41,408 + 우리의 하이퍼 매개 변수와 너무 많이 만지지 않도록하는 테스트를했다 우리 + +26 +00:01:41,409 --> 00:01:44,810 + 가장 가까운 이웃 분류 누군가의 예에서 구체적으로 보면 + +27 +00:01:44,810 --> 00:01:48,618 + 그리고 협곡 이웃 분류와 나는 비밀 인도에 대해 이야기했다 + +28 +00:01:48,618 --> 00:01:52,938 + 이는 우리의 도요타는 우리가 내가 소개 한 후이 수업 시간에 재생했다 + +29 +00:01:52,938 --> 00:01:58,438 + 정말로, 즉 I가 공수 방식 불리는이 방법의 아이디어 + +30 +00:01:58,438 --> 00:02:03,639 + 우리는 바로이 테니스 코트에 이미지에서 함수를 작성하는 + +31 +00:02:03,640 --> 00:02:07,618 + (10)에 가장 가까운과 정액은 이전에 우리 단지를 우리에게 긴 해가 될 것 같다 + +32 +00:02:07,618 --> 00:02:11,520 + 동일 WX 가지고 우리는이 선형의 해석에 대해 이야기 + +33 +00:02:11,520 --> 00:02:12,850 + 당신이 할 수있는 그 사실을 분류 + +34 +00:02:12,849 --> 00:02:16,039 + 일치하는 템플릿으로 해석하거나 이들로 해석 할 수 + +35 +00:02:16,039 --> 00:02:18,449 + 및 매우 높은 차원 공간에있는 이미지를 사용자의 알렌 + +36 +00:02:18,449 --> 00:02:23,560 + 클래스 파트너 종류의가는 그래서이 공간 내 수업 과정을 착색 + +37 +00:02:23,560 --> 00:02:28,740 + 클래스의 말에 그렇게 말을하고 우리는 우리가 가정이 사진에 도착 + +38 +00:02:28,740 --> 00:02:32,240 + 우리는 여기에서 훈련 예 훈련 데이터 세트들을 단지 세 개의 이미지가 + +39 +00:02:32,240 --> 00:02:36,530 + 우리가 가지고있는 열을 따라 몇 가지 클래스 10 클래스와 지원 n은 말과 + +40 +00:02:36,530 --> 00:02:40,740 + 기본적으로이 기능은 이러한 이미지의 모든 하나 하나에 대해 점수를 할당 + +41 +00:02:40,740 --> 00:02:44,510 + 여기에 무작위로 선택한 일부 특정 설정 해제 무게를 우리는 가지고 + +42 +00:02:44,509 --> 00:02:47,939 + 아웃 등 일부 점수 결과 중 일부는 좋은 그들 중 일부는 + +43 +00:02:47,939 --> 00:02:51,419 + 첫 번째 이미지 예를 들어,이 과정을 검사하면 당신이 볼 수 있도록 나쁜 + +44 +00:02:51,419 --> 00:02:55,509 + 올바른 클래스 또는 그냥 고양이는 2.9의 점수를 얻었고, 그것은에 가지 있다고 + +45 +00:02:55,509 --> 00:03:00,060 + 중간 그래서 일부 일부 클래스는 그는 매우 좋지 않다 높은 점수를받은 + +46 +00:03:00,060 --> 00:03:03,289 + 일부 클래스는 특정 이미지에 좋은 훨씬 낮은 점수를받은 + +47 +00:03:03,289 --> 00:03:09,019 + 차가 모두보다 높은 때문에 자동차 잘 분급 + +48 +00:03:09,020 --> 00:03:12,980 + 다른 사람과 개구리는 내구성이 충분히 잘 그래서 우리는 모두 분류했다 + +49 +00:03:12,979 --> 00:03:18,199 + 이 개념 네 가지 무게 서로 다른 가중치가 작동하는지 더 나은 또는 + +50 +00:03:18,199 --> 00:03:21,389 + 다른 이미지에 물론 더 우리는이 것을의 방법을 찾기 위해 노력하고 + +51 +00:03:21,389 --> 00:03:26,209 + 모든 지상 진실은 라벨 레이블과 함께 우리에게 일치 과정을 제공 + +52 +00:03:26,210 --> 00:03:30,490 + 데이터 그래서 우리가 지금 할 거 야하는에서만 지금까지입니다 내가 무엇을 믿는 I + +53 +00:03:30,490 --> 00:03:33,590 + 이 좋은 그 등등 그리 좋은 및하지만 우리가 같은 단지 설명 + +54 +00:03:33,590 --> 00:03:34,900 + 실제로에게 총을주고 + +55 +00:03:34,900 --> 00:03:38,710 + 실제로이 개념을 정량화 우리는 말을 그 무게의이 특별한 세트 + +56 +00:03:38,710 --> 00:03:44,189 + 우리가이 손실 함수를 일단 다음 나쁜 12 1.5 나쁜이든과 같은 WSA + +57 +00:03:44,189 --> 00:03:47,710 + 우리는 우리가 가장 낮은를 가져옵니다 W를 찾을거야, 그래서 우리는 그것을 최소화하기 위해거야 + +58 +00:03:47,710 --> 00:03:50,830 + 손실은 그리고 우리는 우리가 특별히 볼거야 오늘 조사거야 + +59 +00:03:50,830 --> 00:03:55,830 + 그런 다음에이 불행 측정 손실 함수를 정의 할 수있는 방법 + +60 +00:03:55,830 --> 00:04:00,030 + 우리는 실제로 두 개의 서로 다른 경우 보스턴 소프트 최대 비용 보는거야 + +61 +00:04:00,030 --> 00:04:04,840 + 비용과 우리는 어떻게되는 프로세스 최적화로 보는거야 + +62 +00:04:04,840 --> 00:04:08,000 + 이러한 임의의 감사로 시작하는 방법 당신은 실제로 아주 아주 찾을 수 있습니까 + +63 +00:04:08,000 --> 00:04:13,110 + 체중을 잘 관찰을 충분히 그래서이 예제를 소형화거야 그 + +64 +00:04:13,110 --> 00:04:16,620 + 우리는 좋은 작업 예를 가정하는 작업을해야 우리는 세 가지 클래스가 있었다 + +65 +00:04:16,620 --> 00:04:18,030 + 당신이 알고있는 물건 + +66 +00:04:18,029 --> 00:04:22,009 + 수만 우리는이 세 가지 이미지가 이러한 우리의 점수입니다 + +67 +00:04:22,009 --> 00:04:23,360 + 일부 설치 W에 대한 + +68 +00:04:23,360 --> 00:04:27,949 + 우리는 지금이 결과를 정확히 우리의 불행을 작성하려고거야 + +69 +00:04:27,949 --> 00:04:32,680 + 첫 번째 손실 우리는 멀티 클래스 SVM 손실이라고한다 그것으로 볼거야 + +70 +00:04:32,680 --> 00:04:36,629 + 이것은 당신이 가질 수있는 소수 서포트 벡터 머신의 일반화이다 + +71 +00:04:36,629 --> 00:04:42,379 + 나뿐만 아니라 9 커버 사이에 생각하고 그렇게 설정이 여기에 가장 가까운을 통해 본 + +72 +00:04:42,379 --> 00:04:47,710 + 우리는 라코스테의 벡터 물론 이러한 우리의있는 권리 있도록 핵심 기능 그리워 것을 + +73 +00:04:47,709 --> 00:04:50,948 + 사찰단 특정 용어는 여기에있다 + +74 +00:04:50,949 --> 00:04:55,348 + 손실 동일 스튜 물건과 나는 부활절 지금이 손실을 해석하는거야 그 + +75 +00:04:55,348 --> 00:04:59,978 + 우리는 왜 이런 식의 구체적인 예를 통해 볼거야 + +76 +00:04:59,978 --> 00:05:06,158 + 초과 효과적으로 무엇 SVM 손실 같은 것은 모두에서 뭔가 있다는 것입니다 + +77 +00:05:06,158 --> 00:05:11,399 + 모든 잘못된 과정에 걸쳐 모두 모두 동일하므로 잘못된 예 + +78 +00:05:11,399 --> 00:05:17,209 + 클래스는 하나 하나 예를 들어 우리는 그 손실을 그래서 그것을 가로 질러오고있다 + +79 +00:05:17,209 --> 00:05:20,769 + 모든 잘못된 클래스와는 코어 클래스에서 점수를 비교하는 것 + +80 +00:05:20,769 --> 00:05:25,209 + 잘못된 클래스 영수증 제인은 마이너스 이유 다 것을이 법원에 접수 + +81 +00:05:25,209 --> 00:05:31,269 + 나는 왜 이렇게 무엇의 제로 난다 다음 올바른 레이블 더하기 하나 인 + +82 +00:05:31,269 --> 00:05:35,838 + 우리는이 과정이의 차이를 비교하는 여기에 계속 + +83 +00:05:35,838 --> 00:05:40,338 + 특히이 같은 내가 올바른 점수가 높은 것으로 싶어 할뿐만 손실 + +84 +00:05:40,338 --> 00:05:43,918 + 잘못된 점수보다하지만 우리는 퍼팅 안전 마진은 실제로있다 + +85 +00:05:43,918 --> 00:05:46,079 + 안전 마진을 사용하고 넣어 것입니다에 + +86 +00:05:46,079 --> 00:05:53,198 + 정확히 하나의 우리는 반대로 사용하는 하나의 의미가 왜에 갈거야 + +87 +00:05:53,199 --> 00:05:56,900 + 우리가 자신을 선택해야하고 직관적으로 당신이 할 수있는 다른 하이퍼 차 + +88 +00:05:56,899 --> 00:06:00,508 + 훨씬 더 엄격한 유도에 대한 메모를 들여다 정확히 왜 하나 + +89 +00:06:00,509 --> 00:06:04,278 + 중요하지만 이것에 대해 생각하는 다음 너무 일찍 우리의 종류를 강조하지 않습니다 + +90 +00:06:04,278 --> 00:06:08,500 + 스케일이없는 내가 IWI을 탈지 할 수 있기 때문에 그것이 더 크거나 작게 만들 수 있으며있어 + +91 +00:06:08,500 --> 00:06:12,490 + 크거나 작은 코스를 얻기 위하여려고하는 것은 그래서 정말이 미리 차 떨어져있다 + +92 +00:06:12,490 --> 00:06:16,550 + 담론 방법 크거나 그들이 그렇게 할 수있는 작은이에 연결하는 방법 크거나 + +93 +00:06:16,550 --> 00:06:19,930 + 무게는 크기에 등 사용하므로 이러한 창녀의 종류 임의적 + +94 +00:06:19,930 --> 00:06:25,269 + 하나는 확인 그래서 구체적으로 어떻게 볼 수 있도록 어느 정도 그냥 임의의 선택 + +95 +00:06:25,269 --> 00:06:29,128 + 이 표현은 내가 평가하기 위하여려고하고 그래서 여기에 구체적인 예와 함께 작동 + +96 +00:06:29,129 --> 00:06:33,899 + 첫 번째 예를 들어 그 손실은 그래서 여기에 우리는이에 연결하기 위해 경쟁하고 + +97 +00:06:33,899 --> 00:06:35,949 + 물론 그래서 우리는 우리가 비교하는 것을 볼 수 + +98 +00:06:35,949 --> 00:06:40,829 + 올바른 클래스 자동차가있는 점수 우리가 1-3 점에서 당신의 차를 가지고있다하고, + +99 +00:06:40,829 --> 00:06:45,219 + 다음 하나는 최대 0의 우리의 안전 마진을 추가하고 정말 무슨이다 + +100 +00:06:45,220 --> 00:06:48,770 + 그것은 값 (80)을 체결 할 것입니다하고있어 바로 우리가 부정적인 얻는 경우에 이렇게 + +101 +00:06:48,769 --> 00:06:53,759 + 당신이에 대한 두 번째 클래스를 참조하면 결과는 우리가 그렇게 VAT 0을 제외 할거야 + +102 +00:06:53,759 --> 00:06:55,089 + 잘못된 플라자 개구리 + +103 +00:06:55,089 --> 00:06:59,699 + 1.7 안전 마진에서 3.2에서 차감 우리는거야 포인트 구에 도착 + +104 +00:06:59,699 --> 00:07:03,629 + 당신이 당신을 통해이 작업 할 때 다음 2.9의 손실을 가져 + +105 +00:07:03,629 --> 00:07:07,209 + 직관적으로 무엇을 당신이 밖으로 일을하는 방식이 직관적으로 여기 볼 수 있습니다 + +106 +00:07:07,209 --> 00:07:12,930 + 고양이 점수는 3.2 그래서 ESPN 로스에 따라 우리는 우리가 이상적으로 IS 싶은 것 + +107 +00:07:12,930 --> 00:07:16,100 + 모든 클래스에 대한 점수는 최대 것을 가장 + +108 +00:07:16,100 --> 00:07:21,370 + 2.2 그러나 자동차 클래스는 실제로 한 것보다 훨씬 더 훨씬 더 높은 점수를했고, + +109 +00:07:21,370 --> 00:07:24,620 + 우리가 어떤 좋아하는 것 무엇의 차이는 2.2 실제로 무엇인가 + +110 +00:07:24,620 --> 00:07:30,939 + 단지 11처럼 일하면 얼마나 나쁜 2.9의 바로이 차이 + +111 +00:07:30,939 --> 00:07:36,129 + 결과를 점수이 였고, 사기 경우에 다른 경우에 당신은 시저를 볼 수 있습니다 + +112 +00:07:36,129 --> 00:07:40,139 + 점수는 낮은 2.2보다 상당히 낮은 다음 밖으로 작동하므로 방법이었다 + +113 +00:07:40,139 --> 00:07:43,289 + 수학은 당신이 비교할 때 음수를 받고 끝낼 것입니다 + +114 +00:07:43,290 --> 00:07:48,110 + 물론 다음 최대 2000은 특정 부분에 대한 공헌을 잃었다 + +115 +00:07:48,110 --> 00:07:54,439 + 즉,이 최초의 주요의 손실 그래서 당신은 2.9 확인의 손실로 끝날 + +116 +00:07:54,439 --> 00:07:57,050 + 두 번째 이미지는 우리는 다시 같은 일을 할거야 + +117 +00:07:57,050 --> 00:08:01,689 + 고양이는 우리가 얻을 그래서 차 점수를 가지고 비교 한 숫자를 연결 내 + +118 +00:08:01,689 --> 00:08:07,329 + 안전 마진과 다른 클래스의 동일한 19부터 3 개월간 포인트 + +119 +00:08:07,329 --> 00:08:11,659 + 당신이에 연결하면하므로 실제로 0 제로의 많은 손실과 끝 + +120 +00:08:11,660 --> 00:08:17,280 + 여기에 자동차 점수이기 때문에 직관적으로는 자동차 점수 인 것은 사실이다 + +121 +00:08:17,279 --> 00:08:22,479 + 의 적어도 하나의 권리로 해당 이미지에 대한 모든 다른 코스보다 더 높은 + +122 +00:08:22,480 --> 00:08:27,490 + 우리가 가진 이유 제로 점수 0은 너무 제약이 만족하고 일부이었다입니다 손실 + +123 +00:08:27,490 --> 00:08:31,310 + 자신의 손실 때문에 우리가 물론 아주 나쁜 손실 끝이 경우 + +124 +00:08:31,310 --> 00:08:34,470 + 개구리 클래스는 매우 낮은 점수를 받았지만 다른 클래스는 아주 수신 + +125 +00:08:34,470 --> 00:08:39,349 + 우리 경우 고등학교 그래서 이것은 지금 10.9의 불행까지 추가하고 + +126 +00:08:39,349 --> 00:08:42,520 + 실제로 우리가 가고있는 하나의 손실 함수에이 모든 것을 결합하려는 + +127 +00:08:42,519 --> 00:08:45,929 + 우리가 단지을 여기에 상대적으로 직관적 인 변환을 수행하는 + +128 +00:08:45,929 --> 00:08:48,049 + 우리가 얻을 수있는 모든 손실에 걸쳐 평균 + +129 +00:08:48,049 --> 00:08:51,458 + 트레이닝 세트 권한을 부여하고 그래서 말을 그 말에 손실 때를 + +130 +00:08:51,458 --> 00:08:56,369 + 4.6 그래서이 특정 설정은이 훈련에 승까지이며이 숫자를 평균 + +131 +00:08:56,370 --> 00:09:01,320 + 데이터는 우리에게 우리가 손실 함수에 연결 몇 가지 과정을 제공하고 우리는 준 + +132 +00:09:01,320 --> 00:09:06,170 + 당신에게 부탁하지 않을 수 있도록 확인이 결과 4 점 섹스 대한 실망 + +133 +00:09:06,169 --> 00:09:08,939 + 질문의 시리즈는 종류의이 어떻게 작동하는지에 대해 조금 이해를 테스트 + +134 +00:09:08,940 --> 00:09:12,390 + 나는 나를 그냥 내 친구 마이클의 질문을 제기 할 수 있도록 약간의 질문에 얻을 것이다 + +135 +00:09:12,389 --> 00:09:20,230 + 우선 그 일부 전반적으로 잘못 인 저기 무슨 경우 + +136 +00:09:20,230 --> 00:09:25,560 + 제인의 거상 그 의미대로 일부 전반적으로 가장 가까운 그뿐만 아니라 + +137 +00:09:25,559 --> 00:09:29,799 + 잘못된 사람은 그래서 우리는 J 내가 왜 실제로 I 오전 이유에 동일 할 수 있다면 무엇을 + +138 +00:09:29,799 --> 00:09:39,149 + 사실이 네 그래서 여름에 그 작은 제약 조건을 추가하는 어떤 것 + +139 +00:09:39,149 --> 00:09:43,139 + 일어난 우리는 I 허용 것처럼 I에 이유 더 나은 gnite 동일 + +140 +00:09:43,139 --> 00:09:46,539 + 나는 답장을 취소 이유의 점수 + +141 +00:09:46,539 --> 00:09:49,828 + 당신은 0으로 끝날 정말 당신이하는 일은 당신이 상수를 추가하는 것입니다 + +142 +00:09:49,828 --> 00:09:53,549 + 런던의 그 누군가가이 과정은 정말 어쩌면 그냥 다음 전체 있도록 인 경우 + +143 +00:09:53,549 --> 00:09:59,250 + 그 두 번째 이유이다 (10)의 일정으로 손실을 완료 만약에 + +144 +00:09:59,250 --> 00:10:03,940 + 나는이 모든 이상 합산하고있어, 그래서 우리는 갑자기 오른쪽 대신 평균을 사용 + +145 +00:10:03,940 --> 00:10:10,500 + 내가 평균 사용하고처럼 의미로 사용되는 경우 어떤 제약 실제로 평균합니다 + +146 +00:10:10,500 --> 00:10:13,389 + 나는이 과정을 통해 평균을 사용하는 경우 어떤 모든 예제에 대한 모든 손실을 통해 + +147 +00:10:13,389 --> 00:10:28,000 + 당신이 그에 맞아 있도록 점수 문제는 너무 많은 수업이 있었다 + +148 +00:10:28,000 --> 00:10:33,870 + 손실의 절대 값이 낮은 것 + +149 +00:10:33,870 --> 00:10:37,879 + 일정한 인자 이유 + +150 +00:10:37,879 --> 00:10:52,689 + 실제로 여기에 평균을 했는가 클래스의 수에 걸쳐 평균 될 것이다 + +151 +00:10:52,690 --> 00:10:56,220 + 여기하지만 클래스의 상수가 특정 세 말의 + +152 +00:10:56,220 --> 00:10:56,889 + 예 + +153 +00:10:56,889 --> 00:11:01,000 + 손실 앞의 3 분의 1의 상수를 넣어 금액 우리는에 있기 때문에 + +154 +00:11:01,000 --> 00:11:04,450 + 항상 결국 그래서 당신이 지적처럼 낮은 로스를 만들 것입니다하지만, + +155 +00:11:04,450 --> 00:11:07,820 + 결국 우리는 항상 우리가 이상 아를 최소화거야로 관심 + +156 +00:11:07,820 --> 00:11:12,470 + 그 손실은 그래서 만약 당신이 하나를 분실하거나 당신이 그것을 확장하는 경우 이동하고 + +157 +00:11:12,470 --> 00:11:15,350 + 일정은 당사의 솔루션을 변경하지 않습니다 실제로 있지만 여전히 갈거야 + +158 +00:11:15,350 --> 00:11:19,420 + 그래서 이러한 선택이 가지 기본적으로 무료입니다 (W) 같은 최적의에서 결국 + +159 +00:11:19,419 --> 00:11:23,169 + 매개 변수가 나는 Y와 동일하지 않습니다 추​​가 해요 편의를 위해 그렇게 중요하지 않습니다 + +160 +00:11:23,169 --> 00:11:26,299 + 나는 실제로이 같은 일을하고 비록 의미 촬영 아니에요 + +161 +00:11:26,299 --> 00:11:33,329 + 같은 확인 우리가 예에서 일부 평균 여부를 우리에 간다 + +162 +00:11:33,330 --> 00:11:38,410 + 우리가 대신 거기 제제하지만하지 사용되는 경우 어떤 다음 질문 + +163 +00:11:38,409 --> 00:11:42,669 + 매우 유사 인플레이션을 찾고 있지만, 마지막에 제곱 추가있다 + +164 +00:11:42,669 --> 00:11:47,809 + 그래서 우리는 물론 더하기 하나는 아침과의 차이를 취하고있어 + +165 +00:11:47,809 --> 00:11:54,509 + 당신은 우리를 생각할 때 우리는 동일하거나 상이 손실을 얻을 않는 것이 제곱했다 + +166 +00:11:54,509 --> 00:11:57,710 + 어떤 의미에서 동일하거나 상이 손실을받는 것이을 최적화한다면 및 + +167 +00:11:57,710 --> 00:12:05,759 + 우리가 같은 결과를 얻는 가장 좋은 W를 찾을 여부 + +168 +00:12:05,759 --> 00:12:20,340 + 네, 사실 다른이 볼 등 명확하지의 손실 그러나 한 가지 방법을 얻을 + +169 +00:12:20,340 --> 00:12:26,639 + 우리는 분명히 단지 명확 로스를 확장하지 확장하지 않는 것과 그것을 볼 수 있습니다 + +170 +00:12:26,639 --> 00:12:30,710 + 위 또는 일정하거나 우리가 실제로 변화하고 일정하여 이동 아래로 + +171 +00:12:30,710 --> 00:12:35,580 + 차이점은 우리는 방법의 측면에서 비선형 장단점을 변경 + +172 +00:12:35,580 --> 00:12:38,920 + SVM 지원 벡터 기계는 가서 모든 다른 무역 것 + +173 +00:12:38,919 --> 00:12:43,519 + 다른 예에서 여백을 점수하지만보고 분명 아니지만, 기본적으로 + +174 +00:12:43,519 --> 00:12:46,829 + 그것은 매우 분명하지 않다 그러나 나는이 손실에 대한 모든 변경 사항을 설명 할 + +175 +00:12:46,830 --> 00:12:53,320 + 완전하고 여기에 두 번째 권한은 사실 우리가 전화 뭔가 + +176 +00:12:53,320 --> 00:12:57,530 + 당신이 할 수있는 힌지 손실을 불러 상단에 대신 하나의 제곱 힌지 손실 + +177 +00:12:57,529 --> 00:13:01,480 + 주로 20 당신이 볼 가장 자주 사용하는 하이퍼 두 가지 다른 종류를 사용 + +178 +00:13:01,480 --> 00:13:04,750 + 우리는 대부분의 시간을 사용하지만 때로는 당신이 할 수있는 무엇을 먼저 수립 + +179 +00:13:04,750 --> 00:13:07,950 + 제곱 인치 손실이 자산을보고 더 나은 그래서 뭔가 당신입니다 + +180 +00:13:07,950 --> 00:13:12,550 + 그 정말 하이퍼 프라이머이다하지만 가장 자주 처음에 사용 플레이 + +181 +00:13:12,549 --> 00:13:18,919 + 이 손실의 규모가 최소 및 최대 가능 손실이었다에 대해의도 생각해 봅시다 + +182 +00:13:18,919 --> 00:13:23,149 + 당신은 당신의 전체 데이터 세트에 다중 클래스 SVM을 달성 할 수 + +183 +00:13:23,149 --> 00:13:26,759 + 작은 말리 무엇인가 + +184 +00:13:26,759 --> 00:13:35,029 + 점수를 임의로 될 수 기본적 있도록 가장 높은 값이 무엇인지 0 좋은 + +185 +00:13:35,029 --> 00:13:39,870 + 올바른 예 끔찍한 당신이 로그인 그래서 만약 점수는 매우 매우 작은 + +186 +00:13:39,870 --> 00:13:45,230 + 당신은 당신 무한대로가는 손실과 한 번 더 질문을받을거야하는 + +187 +00:13:45,230 --> 00:13:49,480 + 우리가 때 일반적으로 최적화를 수행 시작할 때 가지 중요 + +188 +00:13:49,480 --> 00:13:53,200 + 실제로, 우리는 초기화 AW와 시동이 손실 함수를 최적화 + +189 +00:13:53,200 --> 00:13:56,430 + 아주 작은 무게가 있기 때문에 무슨 일이 끝나는 것은 그에서 점수 + +190 +00:13:56,429 --> 00:14:00,819 + 최적화의 처음에 가까운 검은 색이 모두 제로 대략있다 + +191 +00:14:00,820 --> 00:14:05,650 + 제로 근처의 작은 숫자는 그렇게 모든이의 새로운 시대 때 손실 무엇인가 + +192 +00:14:05,649 --> 00:14:12,329 + 이 모든 과정이 있다면 바로 클래스의 수를 뺀 10의 특별한 경우 + +193 +00:14:12,330 --> 00:14:16,639 + 제로 그는이 특정 손실 나는 평균을하고 여기에 의해 아래로 둘 것 + +194 +00:14:16,639 --> 00:14:21,269 + 이 아주없는 이러한 방식을 통해 우리는 확인이의 손실을 달성 한 것 + +195 +00:14:21,269 --> 00:14:24,429 + 당신이 실제로 시작 때 중요한 무엇이 중요한 안전 점검을위한 + +196 +00:14:24,429 --> 00:14:28,399 + 최적화는 당신은 W 매우 작은 숫자로 시작하고 당신은 인쇄 + +197 +00:14:28,399 --> 00:14:31,389 + 당신이 있는지 확인하려면 당신이 이전에 대해 얘기로 첫 손실 + +198 +00:14:31,389 --> 00:14:34,279 + 당신은 종류의 기능 양식을 이해하고이를 생각할 수 있음 + +199 +00:14:34,279 --> 00:14:38,929 + 수 있는지 여부를 당신은 내가이 경우에 볼 수있어 너무 의미를 얻을 + +200 +00:14:38,929 --> 00:14:42,799 + 그때는 더 손실이 올바르게 %로 구현 될 수 있음을 행복 해요 + +201 +00:14:42,799 --> 00:14:46,990 + 확실하지만 곧 확실히 잘못된 것은 바로이 그래서가 없습니다 + +202 +00:14:46,990 --> 00:14:51,730 + 나는 작은이 손실에 더 갈거야이 생각하는 재미 + +203 +00:14:51,730 --> 00:14:55,950 + 비트하지만 지금 슬라이드의 관점에서 질문으로 + +204 +00:14:55,950 --> 00:15:10,870 + 질문 나는 질문했다 + +205 +00:15:10,870 --> 00:15:15,029 + 실제로이 제약 기쁨이없는 것이 효율적이지 왜 그것 때문에 + +206 +00:15:15,029 --> 00:15:19,049 + 만드는 것이 더 어려워 실제로이 쉽게 더 눈 구현을 할 수 + +207 +00:15:19,049 --> 00:15:23,799 + 이 손실 구현의 실제로 일부 내 옆에 슬라이드를 예측할 수 있도록 + +208 +00:15:23,799 --> 00:15:27,459 + 정도는 그렇게 나를 바로 알아 할리우드의 코드에 의해 언젠가 여기에 말을하려고하자 + +209 +00:15:27,460 --> 00:15:33,290 + 여기에 같은이 손실 함수에 우리는 지금 침대에서 거짓말을 평가하고 + +210 +00:15:33,289 --> 00:15:37,759 + 하나의 열 벡터 빛이기 때문에 우리는 행동 때문에 여기에 하나의 예를 받고있어 + +211 +00:15:37,759 --> 00:15:42,279 + 정수 레이블을 지정하고 W는 우리가 우리가 할 그래서 우리의 가중치 행렬입니다 + +212 +00:15:42,279 --> 00:15:45,799 + 단지 몇 시간의 X는 그리고 우리는 이러한 계산이다이 과정을 확인 + +213 +00:15:45,799 --> 00:15:50,179 + 우리가 획득 과정 올바른 간의 차이 마진 + +214 +00:15:50,179 --> 00:15:55,569 + 점수 + 10이 0에서 무엇이든 다음이 접시를 볼 사이의 번호는 + +215 +00:15:55,570 --> 00:16:03,360 + 온라인 여백 Y가 0 YZ 그와 같 + +216 +00:16:03,360 --> 00:16:07,320 + 그래 정확히 그래서 기본적으로 나는이 효율적인 배경 수입을하고있어 어떤 + +217 +00:16:07,320 --> 00:16:11,209 + 당신의 포인트로 이동 한 다음 내가이기 때문에이 그 여백을 수용 할 + +218 +00:16:11,208 --> 00:16:15,569 + 현재 하나를 가지고 있는데이 팽창하지 않는 이유 이익률은 말했다 어떤 것을 내 + +219 +00:16:15,570 --> 00:16:18,360 + 점수 그리고 나는 20로 설정합니다 + +220 +00:16:18,360 --> 00:16:27,269 + 그래 나는 우리가 경우에 최적화 할 수 있도록뿐만 아니라 말을 뺄 수도있을 것 같군요 우리 + +221 +00:16:27,269 --> 00:16:31,200 + 원하지만 우리는 너무 많은 당신이 할 경우, 당신이 할 경우이에 대해 생각하지 않을거야 + +222 +00:16:31,200 --> 00:16:35,050 + 극단적 인 처벌에 대한 매우 환영의 일부가되었다 과제 + +223 +00:16:35,049 --> 00:16:40,859 + 그 시장 그리고 우리는 더 이상 다시 사이트에 질문에가는 길을 잃었다 + +224 +00:16:40,860 --> 00:16:45,320 + 이 제제에 대해 당신이 만들고 싶어하는 경우 방법이 제제에 의해 + +225 +00:16:45,320 --> 00:16:49,430 + 당신은 실제로 두 가장 가까운 당신이 그것을 볼 수 있습니다 그것을 아래로 작성하는 경우 + +226 +00:16:49,429 --> 00:16:57,229 + 우리가 다른를 볼 수 있도록 확인 잃은 작은 서포트 벡터 머신에 감소 + +227 +00:16:57,230 --> 00:17:00,190 + 기능은 곧 다음 우리는뿐만 아니라 이들의 비교에서 볼거야 + +228 +00:17:00,190 --> 00:17:05,400 + 하지만 지금은 실제로 우리가 가지고있는이 시점에서 우리가 이것을 가지고있다 + +229 +00:17:05,400 --> 00:17:08,699 + 그 과정을 마무리하고 우리는하지 않은이 손실 함수를 + +230 +00:17:08,699 --> 00:17:11,870 + 써 우리는이 사이에 이러한 차이가 자사의 전체 형태 + +231 +00:17:11,869 --> 00:17:18,178 + 물론 한 그녀의 가장 가까운과 태양과 홀드 예에서 평균의 일부 + +232 +00:17:18,179 --> 00:17:21,309 + 즉 지금 손실 함수를 그건 바로 그래서 내가 당신을 설득하고 싶습니다 + +233 +00:17:21,308 --> 00:17:25,149 + 내가하고 싶은 경우 즉이 손실 함수에 버그가 실제로있다 + +234 +00:17:25,150 --> 00:17:31,798 + 나는 매우 좋은하지 속성을 얻을 수 있습니다 연습과 일요일이 손실을 사용 + +235 +00:17:31,798 --> 00:17:36,589 + 이이 경우는 내 전화를 사용하고있는 유일한이었고, 경우 확인이 아니에요 + +236 +00:17:36,589 --> 00:17:39,709 + 정확히 문제가 너무 무엇인지보고 완전히 분명 내가 너희들을 줄 것이다 + +237 +00:17:39,710 --> 00:17:43,620 + 특히 힌트는 우리가 W를 발견한다고 가정 + +238 +00:17:43,619 --> 00:17:55,058 + 뭔가에 제로 손실을 확인 받고 이제 문제는 것은이 w 고유하거나 + +239 +00:17:55,058 --> 00:18:00,329 + 다른 방법에 직면 당신이 내게 줄 수 앗 그 또한 다를 수 있지만 것 + +240 +00:18:00,329 --> 00:18:04,210 + 확실히 다시 제로 손실을 달성 + +241 +00:18:04,210 --> 00:18:12,410 + 맞아 그래서 당신은 우리가 어떤 상수와 그것을 확장 할 수있는 말을하는지 + +242 +00:18:12,410 --> 00:18:20,009 + 특히 모든 형식은 아마의 만남을 원하는 제약 조건을 기반으로 + +243 +00:18:20,009 --> 00:18:24,259 + 젊은 내가 변경할 수있는 내가 할 수있는 권리 그래서 기본적으로 1보다 큰 + +244 +00:18:24,259 --> 00:18:28,119 + 내 무게와 내가 일을 할 수있는 모든 난 그냥 해요입니다 그들이 더 크고 더 크게 만들 + +245 +00:18:28,119 --> 00:18:31,639 + 내가 바로 승 등장하면서 점수 차이가 크고 큰 만들기 만들 + +246 +00:18:31,640 --> 00:18:35,890 + 여기 그래서 기본적으로 주류 법 스포츠의 그것은 매우 바람직하지 때문에 + +247 +00:18:35,890 --> 00:18:40,370 + 부동산 우리는 최적의 및 모든 인 W의 전체 부분 공간을 가지고 있기 때문에 + +248 +00:18:40,369 --> 00:18:44,319 + 그것들이 손실 함수에 따라되는 완전히 동일하지만, 직감적 + +249 +00:18:44,319 --> 00:18:48,019 + 그게 내가 전달하는 속성으로 구울 수 있고, 그래서 그냥이를 볼 게 아니에요 + +250 +00:18:48,019 --> 00:18:51,920 + 미국이 내가이 예를 복용하는 경우가 있음을 자신을 설득 + +251 +00:18:51,920 --> 00:18:58,480 + 나는 두 번 내 말은 IWI을 가정 해 우리가 전에이 이전에 0 손실을 달성 + +252 +00:18:58,480 --> 00:19:02,360 + 여기 아주 간단한 수학이다 일어나고 있지만, 기본적으로 내가 충돌 할 수 또는 것 + +253 +00:19:02,359 --> 00:19:07,000 + 내 점수 두 배 그래서 그 차이는 매우 커진다 것 + +254 +00:19:07,000 --> 00:19:11,019 + 최대 50 아니라 내부의 모든 점수 차이 이미 부정적인 경우 + +255 +00:19:11,019 --> 00:19:14,389 + 이 점점 더 부정적이 될 것 그래서 당신은 더 큰 끝낼 것 + +256 +00:19:14,390 --> 00:19:18,040 + 더 큰 음의 값 접근을 그들에게 내부와 단지 제로 모든 시간이 될 + +257 +00:19:18,039 --> 00:19:32,159 + 그러나 스케일 팩터는 1보다 크게 할 것이기 때문에 + +258 +00:19:32,160 --> 00:19:56,940 + 단순성에 대한 또 다른 질문하지만 그래 기본적으로 점수는 WX가 + 그래서 그렇게 될 수 있습니다 + +259 +00:19:56,940 --> 00:19:58,309 + 당신은 아직이야 + +260 +00:19:58,309 --> 00:20:06,589 + W 일부 단지 어때을 구입하는 것을 잊지 자신이 문제를 해결하는 방법이 직관적 그래서 확인 + +261 +00:20:06,589 --> 00:20:10,250 + 우리는이 전체 지하철 몇 W의를 가지고 모든이에 따라 동일하게 작동 + +262 +00:20:10,250 --> 00:20:13,269 + 손실 함수와 우리가 환경 설정을 통해이하고 싶은대로 우리가하고 싶습니다 + +263 +00:20:13,269 --> 00:20:17,170 + 일부 W의 이상 다른 사람이 단지 고유에 따라 당신은 우리가 무엇을 알고 + +264 +00:20:17,170 --> 00:20:21,430 + 데이터를 잊지 같이하는 W의 욕망에 좋은 일이 무엇 것입니다 + +265 +00:20:21,430 --> 00:20:26,110 + 일이 그래서 이것은 우리가가는거야 정규화의 개념을 소개합니다 + +266 +00:20:26,109 --> 00:20:29,319 + 우리가하는 추가 용어를 그래서 우리의 손실 함수에 참석 + +267 +00:20:29,319 --> 00:20:33,309 + W의 정규화 기능과 정규화 작동 시간을 착륙 + +268 +00:20:33,309 --> 00:20:37,500 + 확인을 W의 쾌적을 측정하고 그래서 우리는 단지 데이터에 맞게 싶지 않아 + +269 +00:20:37,500 --> 00:20:43,279 + 그러나 우리는 또한 좋은 것으로 W를 원하고 우리는 프레임의 몇 가지 방법을 보게 될 것입니다 그 + +270 +00:20:43,279 --> 00:20:47,549 + 정확히 왜 그들이 이해와 정규화로가는로하는 방법이있다 + +271 +00:20:47,549 --> 00:20:52,509 + 훈련 떨어져 거래는 훈련 손실 및 일반화 행동 + +272 +00:20:52,509 --> 00:20:56,589 + 그래서 직관적으로 설정 테스트에 손실은 기술 곳의 세트를 정규화 + +273 +00:20:56,589 --> 00:21:00,899 + 우리는이 사람과 싸우게 될 것이다 손실에 목표를 추가하고 그래서 + +274 +00:21:00,900 --> 00:21:04,560 + 이 사람은 당신의 훈련 데이터에 맞게 원하는 한 번 W 그 사람은 몇 가지를보고 + +275 +00:21:04,559 --> 00:21:07,879 + 특정 방식 그래서 그들은 당신의 목적에 때로는 서로 싸우고있어 + +276 +00:21:07,880 --> 00:21:11,730 + 우리는 동시에 모두 달성하고자하지만 밝혀 때문에 + +277 +00:21:11,730 --> 00:21:14,470 + 이러한 정규화 기술을 추가하는 것은 그것을 만드는 경우에도 귀하 + +278 +00:21:14,470 --> 00:21:18,319 + 교육 에러가 악화 그래서 우리는 제대로 예를 분류하지 않는 한 그 + +279 +00:21:18,319 --> 00:21:21,599 + 주의는 테스트 세트 성능과 더 나은 뭔가 우리가를 볼 수 있다는 것입니다 + +280 +00:21:21,599 --> 00:21:26,089 + 그 내용은 다음 지금 난 그냥 원하는 것을 실제로 할 수있는 이유의 예 + +281 +00:21:26,089 --> 00:21:29,109 + 다음 빛을 지적하지만 지금 난 그냥 가장 지적하고 싶은 + +282 +00:21:29,109 --> 00:21:33,019 + 실현의 일반적인 형태는 우리가 정규화 또는 중량에 전화를 무엇 + +283 +00:21:33,019 --> 00:21:37,539 + 부패와 정말 우리가이 경우 W 생각되는 일을하는지 그래서 2 차원 행렬 + +284 +00:21:37,539 --> 00:21:42,230 + 좀 뵈르 가야 엘에 정말있는 행과 열을했다 + +285 +00:21:42,230 --> 00:21:44,230 + 제곱 W 현명한 요소 + +286 +00:21:44,230 --> 00:21:48,019 + 우리는 로스 확인되므로이이 특정에 그들 모두를 가하고있어 + +287 +00:21:48,019 --> 00:21:55,069 + 이 승 좋아하는 규정은 모든 09 실현 행복 WS을 때 바로 그렇게 공을 수있어하지만 + +288 +00:21:55,069 --> 00:21:58,649 + 물론 당신은 당신은 그래서이 사람들이 의지가없는 사람을 분류 할 수 있기 때문에 + +289 +00:21:58,650 --> 00:22:03,140 + 서로 싸울 다른과 정규화의 다른 형태가있다 + +290 +00:22:03,140 --> 00:22:08,570 + 홍콩의 클래스에 훨씬 나중에 그들 중 일부에 가서 그냥 것 접근 + +291 +00:22:08,569 --> 00:22:12,548 + 2 중위 정규화가 가장 흔한 형태이며, 그처럼 당신은 무엇을거야 + +292 +00:22:12,548 --> 00:22:17,569 + 이 클래스에서 자주 사용뿐만 아니라 내가 원하는 당신을 설득 같지 + +293 +00:22:17,569 --> 00:22:20,529 + 당신을 설득하면이 승 그것에서 할 수있는 합리적인 것입니다 + +294 +00:22:20,529 --> 00:22:25,779 + 그래서이 매우 간단 위로 요리 예를 고려의 무게가 작은 것을 + +295 +00:22:25,779 --> 00:22:30,149 + 우리는 네 가지 차원에서 어디에 직관은 우리가 예를 들어 있다고 가정하세요 + +296 +00:22:30,150 --> 00:22:32,370 + 우리는이 분류를하고있는 우리는 심지어이 공간 + +297 +00:22:32,369 --> 00:22:36,139 + 그냥 한꺼번에 X 나을 지금 생각 우리는이 두 후보가 + +298 +00:22:36,140 --> 00:22:37,880 + 체중 행렬 또는 대기 + +299 +00:22:37,880 --> 00:22:44,780 + I 지금까지 가정 단일 음성 때문에 그 중 하나 (100)이고 다른 하나는 25 + +300 +00:22:44,779 --> 00:22:49,200 + 우리는 당신의 손실 함수에있는 사방 이후 자신의 효과를 볼 수 있습니다 + +301 +00:22:49,200 --> 00:22:55,080 + 같은 그래서 기본적으로 선도적 득점이 문서 제품과 있도록 WX입니다 가지고 있습니다 + +302 +00:22:55,079 --> 00:22:59,109 + 예는이 두 그러나 이러한 담론의 모두에 대해 동일 + +303 +00:22:59,109 --> 00:23:03,469 + 엄격와 정규화는 다른 통해 이들 중 하나를 선호하는 하나 + +304 +00:23:03,470 --> 00:23:07,720 + 정규화 선정 호의 그 효과는 동일 할지라도 + +305 +00:23:07,720 --> 00:23:13,548 + 하나는 실현의 관점에서 그래서 두 번째 오른쪽 낫다 + +306 +00:23:13,548 --> 00:23:15,740 + 정규화는 동일한을 달성하는 경우에도 당신을 말할 것 + +307 +00:23:15,740 --> 00:23:19,109 + 도로 실제로 우리 다운 데이터 손실 분류면에서 효과 + +308 +00:23:19,109 --> 00:23:22,629 + 크게 두 번째가에 대한 더 나은 무엇 두 번째를 선호 + +309 +00:23:22,630 --> 00:23:27,340 + 좋은 생각 가지고하는 것이 + +310 +00:23:27,339 --> 00:23:38,230 + 그는 내가 가장 좋아하는 하나의 해석이 잘 있어요 잘 맞습니다 + +311 +00:23:38,230 --> 00:23:43,549 + 그것은 바로 그래서 당신의 X 팩터에서 고려 사물의 가장 많은 무엇이 + +312 +00:23:43,549 --> 00:23:47,859 + 이 델타 실현하고 싶어 가능한 한 많은 당신의 WSUS를 확산하는 것입니다 + +313 +00:23:47,859 --> 00:23:51,169 + 당신이 고려하고 있도록 모든 입력 기능은 공감이다 + +314 +00:23:51,170 --> 00:23:55,900 + 소스와는 자사을 좋아하는만큼 많은 다른 차원을 사용하고 싶어 + +315 +00:23:55,900 --> 00:23:57,600 + 동일한 효과를 부정 + +316 +00:23:57,599 --> 00:24:01,439 + 직관적으로 말하기, 그래서 그것은 단지 하나에 집중보다 낫다 + +317 +00:24:01,440 --> 00:24:06,990 + 차원은 종종 기본적으로 실제로 작동 뭔가 그냥 좋은 + +318 +00:24:06,990 --> 00:24:11,880 + 다만 방법의 일이며 가장 큰 배열 속성 그들이 + +319 +00:24:11,880 --> 00:24:17,230 + 일반적으로 정규화 좋은 아이디어에 대한 질문이 해결해야 + +320 +00:24:17,230 --> 00:24:22,130 + 모든 사람이 어떤 기본적으로 우리의 손실이 항상이 포럼 곳이있을 것이다 판매했다 + +321 +00:24:22,130 --> 00:24:25,350 + 우리는 저녁 식사 손실을 가지고 있고 또한 매우이다 정규화를해야합니다 + +322 +00:24:25,349 --> 00:24:29,529 + 실제로 가지고 일반적인 것은 좋아 내가 두 번째로 이동하지 않을거야 + +323 +00:24:29,529 --> 00:24:34,629 + 분류 젖꼭지와 우리는 미국과 지원 사이에 약간의 차이를 볼 수 있습니다 + +324 +00:24:34,630 --> 00:24:38,070 + 벡터 머신과 실천이 부드러운 마스크 분류 이러한 종류의 수 있습니다 + +325 +00:24:38,069 --> 00:24:41,369 + 이 두 가지 선택처럼 당신이 가장 좋아 스팸 또는 뭔가를 가질 수 + +326 +00:24:41,369 --> 00:24:47,629 + 선호로 일반적으로 지금까지 자주 당신이 나타납니다 선형 분류를 사용 + +327 +00:24:47,630 --> 00:24:51,480 + 나는 정확히 모르겠어요 왜 같은 난에 대해 작업 보통 말까지 있기 때문에 + +328 +00:24:51,480 --> 00:24:54,420 + 다만이 또한 때때로라고 멀티 그냥 것을 말씀 드리고 + +329 +00:24:54,420 --> 00:24:57,019 + 침략 당신은 로지스틱 회귀에 대해 잘 알고 있다면이 그냥 그래서 + +330 +00:24:57,019 --> 00:25:00,190 + 여러 차원으로 또는이 경우 여러에 그것의 일반화 + +331 +00:25:00,190 --> 00:25:12,009 + 연기의 구름은 저쪽에 의문을 제기하는 것처럼 + +332 +00:25:12,009 --> 00:25:32,150 + 왜 우리는 우리가 어떤 식 으로든 내가 그들 사이에서 선택하려는 경우 사용하려는 + +333 +00:25:32,150 --> 00:25:36,820 + 우리가 선택하는 합리적인 방법입니다 (W) 우리가 갈 생각은 한 가지 낮다 + +334 +00:25:36,819 --> 00:25:42,700 + 남자와 울트라 오른쪽 호의 확산 중 여기이 경우처럼 W 및 + +335 +00:25:42,700 --> 00:25:47,900 + 나는 피치 시도 할 수있는 직관적 인 방법 중 하나는 왜이 좋은 생각입니다 + +336 +00:25:47,900 --> 00:25:54,290 + 그 확산 가중치는 기본적으로 하나의 승를 확인 완전히 입력을 무시 + +337 +00:25:54,289 --> 00:25:58,220 + 셋, 넷하지만 W이 오른쪽 방식 때문에의 입력을 모두 사용 + +338 +00:25:58,220 --> 00:26:04,480 + 완화 등 직관적으로 이것은 단지 보통 테스트에서 더 나은 작업을 끝낼 수 있습니다 + +339 +00:26:04,480 --> 00:26:10,150 + 더 많은 증거가 대신 축적하고 결정되고 있기 때문에 난 + +340 +00:26:10,150 --> 00:26:21,470 + 단 하나의 증거 하나의 기능으로 맞아 + +341 +00:26:21,470 --> 00:26:28,140 + 맞아 맞아 그래서 아이디어는 여기입니다의 두 110 W에 w 그 + +342 +00:26:28,140 --> 00:26:32,630 + 동일한 효과를 얻기 때문에이 데이터 손실은 기본적으로 있다고 가정 + +343 +00:26:32,630 --> 00:26:35,650 + 두 정규화 그러나 사이에 상관하지 않는 환경 설정을 표시 + +344 +00:26:35,650 --> 00:26:39,169 + 우리가 어떤 목표를 가지고 있었고, 때문에 우리는 최적화 끝날거야 그들과 + +345 +00:26:39,169 --> 00:26:42,240 + 이 손실을 통해 기능은 동시에 W를 찾을거야 + +346 +00:26:42,240 --> 00:26:46,659 + 그 모두를 수행 그래서 우리는 제대로 분류되지 않은 아우를 종료 + +347 +00:26:46,659 --> 00:26:50,360 + 그러나 우리는 또한 실제로 싶었다 추가 환경 설정을 가지고 우리는 원 + +348 +00:26:50,359 --> 00:27:05,668 + 또한 무관심 L의 하나가 될 수있는 가능한 한 많이 확산 될 것은 멋진이 + +349 +00:27:05,669 --> 00:27:09,240 + 나는에 가고 싶지 않아 속성 지금 우리는 나중에 떨어졌다 덮을 수 있습니다 + +350 +00:27:09,240 --> 00:27:16,579 + 하나는 당신이 끝날 경우 어떤 속성을 유도 희소성과 같은 몇 가지 특성을 가지고 + +351 +00:27:16,579 --> 00:27:20,240 + 당신의 목표에서 점심을 먹고는 W의 많은이 끝나게 것을 확인할 수 있습니다 + +352 +00:27:20,240 --> 00:27:25,329 + 정확히 제로 우리가 노동에 갈 수도하고 때로는 같다 이유 + +353 +00:27:25,329 --> 00:27:30,629 + 기능 선택은 거의 그리고 나는 하나는 우리가 수도 또 다른 대안 인 것이다 + +354 +00:27:30,630 --> 00:27:45,760 + 더 이상 조금로 이동 + +355 +00:27:45,759 --> 00:27:54,220 + 즉, 기능을 무시하고 그냥 사용하고 좋은 일이 될 수 없습니다 + +356 +00:27:54,220 --> 00:28:02,960 + 실현 좋은 생각 나는 이유 중 하나는 그래 많은 기술적 인 이유가있다 + +357 +00:28:02,960 --> 00:28:09,090 + 당신은 단지 기본적인 직관을주고 갔다 그래서 어쩌면 어쩌면하지만 내가 생각하는 그들에게 그 말 + +358 +00:28:09,089 --> 00:28:59,740 + 그게 내가 좋은 수익을 만약 내가 일부를 무시되어야 할 것이다 공정한 점이다 + +359 +00:28:59,740 --> 00:29:25,980 + 때때로보고 이론을 학습하고 229에 그 중 일부를보고 + +360 +00:29:25,980 --> 00:29:29,710 + 흰색 정규화에 대한 몇 가지 결과는 거기에서 좋은 사례가되고있다 + +361 +00:29:29,710 --> 00:29:33,650 + 그 지역과 나는거야 생각하지 않습니다이 넘어도 그에 가서 소금 + +362 +00:29:33,650 --> 00:29:37,610 + 이 클래스의 범위 지금까지이 클래스는 것 우리의 국가를 변경하여 + +363 +00:29:37,609 --> 00:29:44,139 + 테스트 오류 나은 사람은 어떤 알 수 만족시키기 위해 이동 + +364 +00:29:44,140 --> 00:29:49,309 + 방법에 대한 로지스틱 회귀 분석의 일반화는이 같은 작동 방식 + +365 +00:29:49,308 --> 00:29:53,049 + 손실이 위에 지정된 방법에 대한 그냥 다른 함수 형태입니다 + +366 +00:29:53,049 --> 00:29:58,539 + 분류가 박았 물론 일부 특정 이러한 해석이있다 + +367 +00:29:58,539 --> 00:30:02,170 + 이 과정의 상단이는 어떤 임의의 점수가 아니며, 우리가 원하는 + +368 +00:30:02,170 --> 00:30:05,769 + 마진이 충족되어야합니다하지만 우리는 어쩌면 더 구체적인 해석이 + +369 +00:30:05,769 --> 00:30:10,549 + 보기의 문제에서 그 시점 어디에 실제로 우리의 원칙 종류 + +370 +00:30:10,549 --> 00:30:14,490 + 그냥 여백을 의미하지만, 이러한는 이러한 것들을로하지 이러한 과정을 해석 + +371 +00:30:14,490 --> 00:30:17,880 + 에 할당 된 실제 정규화 된 잠금 확률 + +372 +00:30:17,880 --> 00:30:23,140 + 다른 클래스는 확인 그래서 우리는이 조금 의미 정확히 무엇에 갈거야 + +373 +00:30:23,140 --> 00:30:28,880 + 이러한 모든 회 주어진 이미지의 정규화 로크 확률의 + +374 +00:30:28,880 --> 00:30:34,490 + 즉 우리는 점수가보다 문제의 평화와 달리 것을 가정합니다 + +375 +00:30:34,490 --> 00:30:38,799 + 사스케와 같은 가장 가까운 확률을 얻을 수있는 방법은 우리가이을 것입니다 + +376 +00:30:38,799 --> 00:30:39,690 + 점수 + +377 +00:30:39,690 --> 00:30:45,029 + 변칙 확률을 얻기 위해 그들 모두를 기하 급수적으로 우리는 정상화 + +378 +00:30:45,029 --> 00:30:48,849 + 우리가 합으로 나눈 있도록 그들에게 그들이 확률 정상화를 얻을 수 + +379 +00:30:48,849 --> 00:30:54,209 + 모든 지수의 과정을 통해 그 우리가 실제로 얻을 방법 + +380 +00:30:54,210 --> 00:30:58,240 + 클래스의 확률에 대한 표현은 이미지 등이 기능을 부여 + +381 +00:30:58,240 --> 00:31:02,880 + 당신이 누군가가 그들에게 먹을 경우 참조하는 경우 여기에 부드러운 최대 함수를 호출한다 + +382 +00:31:02,880 --> 00:31:07,840 + 요소가 합 전반적인 비용 매로 나눈 현재 관심 + +383 +00:31:07,839 --> 00:31:11,918 + 우리는이 문제에 있다면이 기본적으로 작동 할 방법의 과정이다 + +384 +00:31:11,919 --> 00:31:13,040 + premark 우리는 정말 운이 좋다 + +385 +00:31:13,039 --> 00:31:16,869 + 우리는 이것이 다른 클래스의 확률 것을 결정하고 있다는 것을 + +386 +00:31:16,869 --> 00:31:19,619 + 당신이 정말이 설정에서 수행 할 작업을 어떤 측면에서 의미가 있습니다 + +387 +00:31:19,619 --> 00:31:23,809 + 것입니다 아마 다른 클래스를 통해 이들 중 하나는 우리가 원 정확 + +388 +00:31:23,809 --> 00:31:25,429 + 로그 우도를 최대화 + +389 +00:31:25,430 --> 00:31:32,900 + 손실 함수 등등을 위해 우리는 실제의 로그 우도를 최대화 할 + +390 +00:31:32,900 --> 00:31:38,140 + 클래스와 우리가 손실 함수를 실행하고 있기 때문에 우리는을 최소화하려면 + +391 +00:31:38,140 --> 00:31:42,980 + 진정한 클래스의 음의 로그 우도 확인이 일련의와 끝까지 그렇게 + +392 +00:31:42,980 --> 00:31:46,599 + 로그-가능성을 원하는대로 여기 표현은 정말 기능을 잃게됩니다 + +393 +00:31:46,599 --> 00:31:51,169 + 올바른 클래스는 너무 부정적인 높은 낮은 싶어하는 + +394 +00:31:51,170 --> 00:31:54,820 + 로그인 가능성이 코스의 일부 확장되어 이제 구체적인 예를 살펴 보자 + +395 +00:31:54,819 --> 00:32:00,599 + 실제로 있도록 표현 뭔가처럼 나중에 여기에 내가이 더 만들려​​면 + +396 +00:32:00,599 --> 00:32:04,839 + 이 표현의 방법이 살펴 보겠습니다하는 로스 음의 로그입니다 + +397 +00:32:04,839 --> 00:32:07,859 + 표현은 작품과 내가 아는 당신에게 더 나은 직관을주지 생각을 정확히 + +398 +00:32:07,859 --> 00:32:12,009 + 이것은 우리가이 점수를하지 않은 여기에 가정 그래서 컴퓨팅 있는지 등을하고있다 + +399 +00:32:12,009 --> 00:32:16,379 + 즉, 우리의 신경 네트워크 또는 우리의 이전 젖꼭지에서 나온 이들이다 + +400 +00:32:16,380 --> 00:32:19,780 + 내가 언급 한 바와 같이 그렇게 잠금 해제 문제의 평화 우리는 기하 급수적으로 그들을 원하는 + +401 +00:32:19,779 --> 00:32:22,879 + 첫 번째 때문에 우리에게 정규화를 제공이 해석에서 + +402 +00:32:22,880 --> 00:32:28,150 + 확률과 우리가 두 가지의 합으로 나눈 값이 지금은 항상 일부 (21)의 + +403 +00:32:28,150 --> 00:32:33,310 + 이 모든 그래서 우리는이 사람을 추가하고 우리가 실제로 아마 아웃을 얻기 위해 분할 + +404 +00:32:33,309 --> 00:32:37,609 + 이 해석에서 우리는 변환 어떤 세트를 수행 한 + +405 +00:32:37,609 --> 00:32:41,219 + 말하는이이 해석은 확률 할당 된 것입니다 + +406 +00:32:41,220 --> 00:32:47,029 + 고양이가되는이 이미지에 13 %의 차량이 87 % 진행 매우 가능성이 0 %입니다 + +407 +00:32:47,029 --> 00:32:51,399 + 이러한 확률이며,하지 일반적으로 설정하는 당신이 원하는 + +408 +00:32:51,400 --> 00:32:54,960 + 그냥 바위를 극대화 밝혀 때문에 잠금 확률을 극대화 + +409 +00:32:54,960 --> 00:32:58,049 + 확률은 수학적으로 볼 그래서 외로운으로 좋은하지 않습니다 + +410 +00:32:58,049 --> 00:33:03,460 + 당신이 확률을 최소화 할 수 있도록 다음 행운의 확률을 극대화 + +411 +00:33:03,460 --> 00:33:08,850 + 그래서 여기에 올바른 클래스는 13 %의 확률을 가지고있다 고양이는 + +412 +00:33:08,849 --> 00:33:14,679 + 포인트 13 앤더슨 오해 때문에 음의 로그가 우리를 얻을 수 89 등 + +413 +00:33:14,680 --> 00:33:21,180 + 그것은 우리가 아래에 여기에이 클래스 달성 할 손실을 찾을 수있는 마지막이다 + +414 +00:33:21,180 --> 00:33:25,529 + 분류기의이 해석 때문에 29 + +415 +00:33:25,529 --> 00:33:32,869 + 의 몇 가지 예에 시도하는 지금이 관련된 몇 가지 질문을했다 통해 가자 + +416 +00:33:32,869 --> 00:33:34,219 + 이 작품 정확히 어떻게 해석 + +417 +00:33:34,220 --> 00:33:38,519 + 처음 나는 그래서이 손실 기능을 잃은 분 가능한 최대이었다 + +418 +00:33:38,519 --> 00:33:44,460 + 손실 함수 어떤 작은 말리이며, 가장 높은 몸은 생각한다 + +419 +00:33:44,460 --> 00:33:49,809 + 이것은 우리가 싼 제로하고 어떻게 일어날 수있는 가장 작은 값 것입니다 + +420 +00:33:49,809 --> 00:33:57,220 + 당신은 우리가 하나가 올바른 클래스 아마지고 있다면 나는 그렇게 얻을 수 있습니다 + +421 +00:33:57,220 --> 00:34:02,890 + 하나는 법에 회신하고 우리는 (110)과의 음의 로그를 받고있어 + +422 +00:34:02,890 --> 00:34:09,030 + 그래서 그냥뿐만 아니라 우리는 같은 공을 받고 있었다으로 가장 높은 손실이 최소 + +423 +00:34:09,030 --> 00:34:14,250 + 무한 당신이주는 끝날 경우 최대 그래서 유아 손실을 달성 할 것입니다 + +424 +00:34:14,250 --> 00:34:18,769 + 0 당신이 부정적인 제공의 고양이는 아주 작은 확률을 득점 한 후 로그인 + +425 +00:34:18,769 --> 00:34:24,679 + 그래서 그래 그래서 같은 균형으로 바로 무한 무한 그래서 음 + +426 +00:34:24,679 --> 00:34:28,159 + 오후이 질문 + +427 +00:34:28,159 --> 00:34:33,440 + 우리는 대략 작은 작은 무게와 W를 초기화 할 때 일반적으로 바람 + +428 +00:34:33,440 --> 00:34:37,550 + 모든 자동차는 거의이 경우 손실 될 수있을 테니까요 무엇 제로되어 + +429 +00:34:37,550 --> 00:34:40,419 + 당신이보고 기대 무슨 최적화의 시작 부분에 체크 + +430 +00:34:40,418 --> 00:34:47,000 + 첫 번째 손실 + +431 +00:34:47,000 --> 00:34:59,449 + 나이가 점점 될 수 있도록 클래스의 수에 하나 당신이 모든에 도착 여기 + +432 +00:34:59,449 --> 00:35:04,139 + 한 다음 여기 그래서 여기에 클래스의 수에 하나입니다 그리고 그들은 얻을 + +433 +00:35:04,139 --> 00:35:07,599 + 에 대한 블로그를 자신 할 때마다 대한 그래서 실제로 최종 멋진 뭔가 + +434 +00:35:07,599 --> 00:35:11,569 + 나는 때때로 클래스와 I의 내 번호 주목을 내 스테이션을 실행 + +435 +00:35:11,570 --> 00:35:14,970 + 클래스의 숫자 중 하나를 부정 로그 평가하고 내가 무엇을보고 노력하고있어 + +436 +00:35:14,969 --> 00:35:18,429 + 잃어버린 내 첫 시작은 기대와 내 결정을 시작할 때 그래서 나는 확인 + +437 +00:35:18,429 --> 00:35:21,159 + 확실히 내가 아는 다른 몇 가지가있을 수 있음을 대략 얻고 있음 + +438 +00:35:21,159 --> 00:35:24,399 + 약간 떨어져 순서에 뭔가를 얻을 것으로 예상 + +439 +00:35:24,400 --> 00:35:28,630 + 또한 최적화 나는 20 그게 전부에서 가서 제가 보는 경우 기대 + +440 +00:35:28,630 --> 00:35:31,039 + 음수는 내가 함수 형태 알고 뭔가 매우 + +441 +00:35:31,039 --> 00:35:32,590 + 이상은 오른쪽에있는 것입니다 + +442 +00:35:32,590 --> 00:35:37,070 + 실제로 예상하지거야이 폭행 최대 손실에서 번호를 부여 + +443 +00:35:37,070 --> 00:35:40,630 + 몇 가지 질문 단지를 반복하는 당신에게 또 하나의 슬라이드 아무것도 표시되지 것입니다 + +444 +00:35:40,630 --> 00:35:44,599 + 그들과 정말 그들이 우리가 점수를 가지고있는 모습의 차이 + +445 +00:35:44,599 --> 00:35:48,909 + 차이 지금 우리가 배우의 우리의 점수를 얻을 아​​와를 제공 기능 + +446 +00:35:48,909 --> 00:35:54,420 + 그들은이 함수에서 나오는 이러한 과정이 무엇인지 해석하는 방법을 그냥 + +447 +00:35:54,420 --> 00:35:58,500 + 그래서 난 그냥 무엇이든지 우리가 원하는 어떤 해석의 과정을 실행하지 + +448 +00:35:58,500 --> 00:36:02,710 + 더 큰 점수 정확한 점수를 많이 위의 몇 가지 한계가 될 것을 + +449 +00:36:02,710 --> 00:36:07,240 + 잘못된 과정이나 해석은 로트 확률과하지 않는 한이 될 수 있습니다 + +450 +00:36:07,239 --> 00:36:10,569 + 이 프레임 워크에서 우리는 먼저 확률을 가져 갔고, 우리는 원하는 + +451 +00:36:10,570 --> 00:36:14,450 + 균열 손실의 공개 또는 이들의 로그를 극대화하고 그래서 그 끝 + +452 +00:36:14,449 --> 00:36:19,250 + 우리에게 손실 함수 또는 무언가를주는 동일한 방법에서 시작하도록하지만, + +453 +00:36:19,250 --> 00:36:22,780 + 그들은 단지 우리가 가고있는 차이가 적은 결과를 얻을 수 있었 + +454 +00:36:22,780 --> 00:36:31,150 + 정확히 차이가 조금에서 어떤 질문이 있습니다 + +455 +00:36:31,150 --> 00:36:41,579 + 그들은 대부분을 평가하는 순간 근처로 분류을 + +456 +00:36:41,579 --> 00:36:45,949 + 작품은 회선에서 수행되고, 그렇게 볼 것이다 분류 및 + +457 +00:36:45,949 --> 00:36:51,629 + 물론 남한 최대의 거의 같은 특히 손실이 수반 일부 XP 및 + +458 +00:36:51,630 --> 00:36:56,200 + 등등 그래서이 작업은 아마도하지만 보통을 약간 더 비싸다 + +459 +00:36:56,199 --> 00:36:57,439 + 완전히 씻어 + +460 +00:36:57,440 --> 00:36:59,320 + 당신이하는 걱정 다른 모든 것들에 비해 모든입니다 + +461 +00:36:59,320 --> 00:37:15,260 + 하나님의 형상을 통해 대회 + +462 +00:37:15,260 --> 00:37:32,600 + 아마도 + +463 +00:37:32,599 --> 00:37:42,210 + 동일한 문제 등 특성을 극대화하고 지역을 극대화 + +464 +00:37:42,210 --> 00:37:46,119 + 경기의 모든 나오는 당신에게 동일한 결과 있지만 측면에서 제공 + +465 +00:37:46,119 --> 00:37:49,279 + 너무 멋진 당신이 실제로이 많이 넣으면 찾고 있지만 정확한이다 + +466 +00:37:49,280 --> 00:37:51,310 + 동일한 최적화 문제 + +467 +00:37:51,309 --> 00:37:56,539 + 확인의 그들이 차이가 정확히 어떻게 이러한 둘의 해석을 얻을 수 있습니다 + +468 +00:37:56,539 --> 00:38:01,230 + SEM 대 최대 당신에게 아이디어를 제공하기 위해 노력에 대한 하나의 속성 실제로 + +469 +00:38:01,230 --> 00:38:03,559 + 둘 사이의 상당히 다른 + +470 +00:38:03,559 --> 00:38:08,059 + 우리가이 세 가지 예를 다음 두 가지 기능 분석 팀 + +471 +00:38:08,059 --> 00:38:12,710 + 세 가지 예제와 세 가까운 세 가지 예제가있는 가정 + +472 +00:38:12,710 --> 00:38:15,980 + 이들은 이러한 예제 하나 하나에 대한이 예제의 담론이다 + +473 +00:38:15,980 --> 00:38:19,659 + 여기에 첫 번째 클래스는 올바른 클래스 그래서 10이 올바른 수준의 점수입니다 + +474 +00:38:19,659 --> 00:38:24,509 + 다른 점수는이 사람들 중 첫 번째 두 번째 또는 세 번째는 + +475 +00:38:24,510 --> 00:38:30,970 + 지금은 단지 이러한 손실이 얼마나 바람직한에 대해 얘기하는 방법을 생각 + +476 +00:38:30,969 --> 00:38:36,480 + 결과는 그것에 대해 생각하는 특정 하나의 방법으로 승 그 측면에 + +477 +00:38:36,480 --> 00:38:39,530 + 예를 들어 내가이 데이터는 백의 세 번째 십분의 일을 가리키는 생각한다고 가정한다 + +478 +00:38:39,530 --> 00:38:44,700 + 그리고 팔백 내가 조금 내 입력 주위를 이동 가볍게 흔들다 가정 + +479 +00:38:44,699 --> 00:38:58,159 + 내가 그렇게 같은 공간 손실에 무슨 일이 일어나고 + +480 +00:38:58,159 --> 00:39:03,339 + 나는 그들이 둘 다 증가 할들이 증가 및 감소 내가 주위에 이동하는 것처럼 그렇게 나 + +481 +00:39:03,340 --> 00:39:10,050 + 나 예를 들어 제 약속 줄이고 동일하게 유지 + +482 +00:39:10,050 --> 00:39:13,740 + 정확한 이유는 마진이 그렇게 엄청난 금액으로 이어졌습니다 때문에이 점이다 + +483 +00:39:13,739 --> 00:39:17,659 + 나는 주위의 시트에 하루를 찍을 때 그냥 견고성이 추가됩니다 + +484 +00:39:17,659 --> 00:39:22,379 + 우리가 원하는 알고에 의해 여백이 충족 되었기 때문에 SVM은 이미 매우 행복하다 + +485 +00:39:22,380 --> 00:39:27,809 + 여기에 하나의 마진 우리는 이백의 여유를 가지고 거대한 여백이있다 + +486 +00:39:27,809 --> 00:39:32,299 + 이 과정이 올 곳 ESPN은 이러한 예를 통해 환경 설정을 표현하지 않습니다 + +487 +00:39:32,300 --> 00:39:37,010 + 이상 내가 부정되고 싶어하지 않는다 추가 환경 설정 매우 부정적인 광고 아웃 + +488 +00:39:37,010 --> 00:39:43,890 + 2009 200,000 PSP 및 상처 치료 만의 그러나 남부 최대 수 항상 당신을보고 + +489 +00:39:43,889 --> 00:39:46,659 + 항상 옳다 그래서 소프트 맥스가 뭔가에 대한 개선을 얻을 것이다 + +490 +00:39:46,659 --> 00:39:49,480 + 기능에 대해 부정적인 것으로 모든 사람의 요구에 대한 선호를 표현 + +491 +00:39:49,480 --> 00:39:53,590 + 이백이나 오백 또는 이들의 천명은 더 나은 손실 권리를 줄 것이다 + +492 +00:39:53,590 --> 00:39:58,530 + 다른 예는 내가 알고하지 않으면하지만이 시점에서 SVM은 상관하지 않는다 + +493 +00:39:58,530 --> 00:40:03,320 + 연방 수사 국 (FBI)의 명확한 구분이 권한을 한 번에 견고성을 결정했다대로입니다 + +494 +00:40:03,320 --> 00:40:07,120 + 이 마진이 충족되어야합니다하지만 그 이상은 코스 어디 세세한하지 않습니다 + +495 +00:40:07,119 --> 00:40:11,400 + 소프트 맥스는 항상 당신이 모든 것을 아무것도 알 수 평화 과정을 원할 것입니다 + +496 +00:40:11,400 --> 00:40:15,300 + 그래서 둘 사이의 차이가 명확 1 종 존재하고있어 + +497 +00:40:15,300 --> 00:40:20,548 + 이 질문했다 + +498 +00:40:20,548 --> 00:40:28,568 + 예 하나의 이익률은 그게 하이퍼 차 아니다 매우 간략하게 언급 + +499 +00:40:28,568 --> 00:40:34,528 + 즉 그리스 코스입니다 당신은의 친절 하나의 이유가 고칠 수 + +500 +00:40:34,528 --> 00:40:40,048 + 그 과정의 절대 값을 가지 정말 때문에 중요하지 않습니다되어 내 + +501 +00:40:40,048 --> 00:40:45,088 + WI 그것이 더 크거나 작게 만들 수 있고, 내가 다른 크기의 과정을 달성 할 수 있으며, + +502 +00:40:45,088 --> 00:40:49,759 + 그래서 하나는 잘 작동 밝혀과 노트에 나는 더 긴 기간이 갈이 + +503 +00:40:49,759 --> 00:40:54,699 + 세부 사항을 정확히 이유 하나 때문에 선택이 참조하는 것이 안전하지만 난 싶지 해달라고 + +504 +00:40:54,699 --> 00:41:03,239 + 당신이 20 싶어하는 경우 문제가있을 것 같은 20에서 시간을 보내고는 것 + +505 +00:41:03,239 --> 00:41:07,358 + 양수를 사용할 수 있습니다 그가 0 인 경우 그것은 당신에게 좋은 오후를 줄 것이다 + +506 +00:41:07,358 --> 00:41:14,328 + 그 다르게 보일 것입니다 + +507 +00:41:14,329 --> 00:41:18,259 + 예를 들어이 경우 실제로 하나의 속성을 제공가 일정 추가 + +508 +00:41:18,259 --> 00:41:21,920 + CST (29)의 엉덩이 오후의 수학적 분석의 좋아하는을 통해로 이동 + +509 +00:41:21,920 --> 00:41:26,269 + 최고는 에스키모 재생 여백 속성을 의심 것을 당신은 볼 수 있습니다 + +510 +00:41:26,268 --> 00:41:29,698 + 최고의 마진 실제로 플러스을 때 + +511 +00:41:29,699 --> 00:41:33,539 + 상수는 길에 제단 정규화와 결합 된 아주의 자신의 + +512 +00:41:33,539 --> 00:41:38,499 + 특정 마진을 충족뿐만 아니라 당신이 아주 좋은 혼합을 제공 작은 무게 + +513 +00:41:38,498 --> 00:41:42,259 + 난 정말이 강의에서이에서로 이동하지 않은 여백 재산권 + +514 +00:41:42,259 --> 00:41:46,818 + 지금 그러나 나는 기본적 그렇지 않으면이 양수를 일을 할 싶어 + +515 +00:41:46,818 --> 00:41:51,480 + 단절 + +516 +00:41:51,480 --> 00:42:14,780 + 실수하고 우리는 종류의 무료 아웃이 과정에서 얻을 수있는 좋은 방법입니다 번호 + +517 +00:42:14,780 --> 00:42:18,200 + 바로 우리가 수 있으며 해석을 부여하도록 당신의 + +518 +00:42:18,199 --> 00:42:21,669 + 거기에이 특정한 경우에 다른 손실은 내가 당신에게 가장 가까운시를 보였다 + +519 +00:42:21,670 --> 00:42:25,180 + 멀티 클래스 SVM의 여러 버전은 정확히 주위에 헤엄 수 있습니다 + +520 +00:42:25,179 --> 00:42:30,750 + 우리는이에 넣을 수있는 해석의 하나 일 로스 식 + +521 +00:42:30,750 --> 00:42:34,510 + 코스는 아마도 그들이 할 수없는 말을 몇 가지 표준화 된 블록이있을 것 + +522 +00:42:34,510 --> 00:42:37,590 + 그들은 단지 온 정규화 때문에 우리는 더이 있기 때문에 명시 적으로해야 + +523 +00:42:37,590 --> 00:42:42,180 + 함수의 출력이 정상화 될 것이라고 제약 조건과 그들이 + +524 +00:42:42,179 --> 00:42:45,579 + 당신은 단지 자신의 실수에 그 밖으로이기 때문에 아마 캠프를해야 + +525 +00:42:45,579 --> 00:42:51,309 + 즉 양 또는 음이 될 수 있도록 우리가 문제를 평화로를 해석하고 + +526 +00:42:51,309 --> 00:42:52,699 + 및 수행 + +527 +00:42:52,699 --> 00:42:58,329 + 우리를 필요로하는 것은 그들에게 그것을 설명 매우 나쁜 종류를 취급하지만 난 생각하는 + +528 +00:42:58,329 --> 00:43:05,889 + 그는 있어요 + +529 +00:43:05,889 --> 00:43:57,139 + 에너지와 당신이있어 무엇에 대한 모든 동등한 종류의 같은 손실 + +530 +00:43:57,139 --> 00:44:05,690 + 나는 주위에이를 봤 경우 말을 여기 여기 하나 봐 말 + +531 +00:44:05,690 --> 00:44:09,460 + 아무것도 나는 차이가 확실히 손실 것이라고 생각 변화없는 것 + +532 +00:44:09,460 --> 00:44:12,800 + 가 많이 변경하지 않을 경우에도 최대에 대한 변경하지만 난 확실히 것 + +533 +00:44:12,800 --> 00:44:16,660 + 환경 설정을 표현하는 제목을 변경 오후 반면 당신이 동일 제로 같아요 + +534 +00:44:16,659 --> 00:44:27,339 + 다른 선호도하지만, 기본적으로 실제로는 매우 큰 실수를하지 않을 것이다 + +535 +00:44:27,340 --> 00:44:32,720 + 이 구별 당신에게 노력의 상호 작용은 SPM이있다이다 + +536 +00:44:32,719 --> 00:44:38,469 + 공간의 매우 로컬 부분이 관심이 있다고 분류 미숙에 대한과 + +537 +00:44:38,469 --> 00:44:40,279 + 그 이후 + +538 +00:44:40,280 --> 00:44:43,700 + 환경 및 전체 데이터 구름 물리적 작용 소프트 맥스 종류 + +539 +00:44:43,699 --> 00:44:48,129 + 그것은 당신의 데이터를 클라우드에 대한 모든 사항을 관심에 대해 그냥 당신을하지 관심 + +540 +00:44:48,130 --> 00:44:50,590 + 당신으로부터 분리하려는 것은 여기에서 작은 클래스처럼 거기에 알고 + +541 +00:44:50,590 --> 00:44:51,410 + 다른 모든 것들 + +542 +00:44:51,409 --> 00:44:55,659 + 해당 전체 데이터 옷장의 공격 맥스웰 종류의 비행기를 얻고 + +543 +00:44:55,659 --> 00:44:59,059 + SPM 단지의 직접적인 부분에서 작은 조각 것을 구분합니다 + +544 +00:44:59,059 --> 00:45:04,219 + 실제로 같은 데이터 구름이 실제로 국가가 제공 할 수 있습니다 실행할 때 + +545 +00:45:04,219 --> 00:45:09,569 + 내가 노력 아니에요에 거의 동일한 결과는 거의 항상 정말하려고 할 때 + +546 +00:45:09,570 --> 00:45:12,640 + 하나의 피치하거나 그냥 시도하고 다른 하나는 당신이 개념을주고 그 + +547 +00:45:12,639 --> 00:45:16,809 + 당신은 당신이 밖으로 약간의 점수를 얻을 손실 기능을 담당하고있어, 당신은 할 수 + +548 +00:45:16,809 --> 00:45:19,199 + 거의 모든 수식을 적어 + +549 +00:45:19,199 --> 00:45:23,279 + 당신이 당신의 점수처럼되고 싶은에 미분하고있다 + +550 +00:45:23,280 --> 00:45:26,619 + 다른 사실이 수립의 방법과입니다 실제로 두 가지 예 + +551 +00:45:26,619 --> 00:45:30,579 + 연습을보기 위해 오는 그러나 실제로 우리는 무엇을 어떤 손실을 넣을 수 있습니다 + +552 +00:45:30,579 --> 00:45:34,619 + 점수이 원하는 우리가 최적화 할 수 있기 때문에 그것은 아주 좋은 사진입니다 + +553 +00:45:34,619 --> 00:45:46,700 + 전반적인 날이 시점에서 당신에게 대화 형 웹을 보여 드리겠습니다 + +554 +00:45:46,699 --> 00:45:54,289 + 이것은 당신이 할 수있는 대화 형 세미나 클래스 페이지가 그래서 확실히이 참조 + +555 +00:45:54,289 --> 00:45:58,409 + 이 URL에서 찾을 작년를 쓴 나는 너희들 모두에게 보여해야 + +556 +00:45:58,409 --> 00:46:04,279 + 확인 개발에 지출 하루 정당화하지만, 일부 그 마지막을합니다 + +557 +00:46:04,280 --> 00:46:12,440 + 올해 너무 많은 사람들이 차량에서 발견하지 그래서 우리는 내 삶의 하루가있다 + +558 +00:46:12,440 --> 00:46:18,000 + 여기에 세 가지 클래스와 이차원 문제는 내가 여기에 세 가지를 보여주는거야 + +559 +00:46:18,000 --> 00:46:22,139 + 클래스 각각은 여기에 두 개의 차원을 통해 세 가지 예를 가지고 있는데 보여주는거야 + +560 +00:46:22,139 --> 00:46:24,969 + 여기에 세 가지 분류는 수준은 예를 빨간색 방치하기 때문에 + +561 +00:46:24,969 --> 00:46:29,659 + 분류는 선을 따라 0의 점수로하고 그때의 화살표를 보여주는거야 + +562 +00:46:29,659 --> 00:46:35,509 + 이는 당신이 W 매트릭스 기억으로, 그래서 여기 증가 점수 RW 행렬이다 + +563 +00:46:35,510 --> 00:46:38,609 + 우리는이 그래서 w 행렬의 두 행은 서로 다른 분류입니다 + +564 +00:46:38,608 --> 00:46:42,289 + 파란색 분류 빨간색과 녹색의 분류와 브렛 분류 및 우리가 모두 + +565 +00:46:42,289 --> 00:46:47,349 + 여기에 우리가 다음 X & Y 구성 요소 또한 바이어스 모두에 대한 가중치 + +566 +00:46:47,349 --> 00:46:50,609 + 데이터는 우리가 모든 데이터 포인트의 X 및 Y 좌표를 상기 그래서 + +567 +00:46:50,608 --> 00:46:55,779 + 올바른 라벨 따라서 과정뿐만 아니라 모든 데이터에 의해 달성 손실 + +568 +00:46:55,780 --> 00:46:59,769 + 이 w 설정하고 그래서 내가 데려 갈거야 것을 볼 수 있습니다로 지금 포인트 + +569 +00:46:59,769 --> 00:47:04,568 + 우리의 데이터 손실이 너무 좋아 지금은 2.77 정규화 손실이 전체 손실을 의미 + +570 +00:47:04,568 --> 00:47:08,509 + 이 w 3.5과 이야기 안녕 6.27이다 + +571 +00:47:08,510 --> 00:47:14,810 + 그래서 기본적으로 내가 내 W를 변경할 수 있도록 그래서 당신이 할 수있는이 주위에 바이올린 수 있습니다 + +572 +00:47:14,809 --> 00:47:19,328 + 내 W 더 큰 WC 중 하나 만들고있어 여기에 볼 당신은 그 무엇을 볼 수 있습니다 + +573 +00:47:19,329 --> 00:47:25,940 + 순서 바이어스에서 당신은 바이어스는 기본적으로 이러한 높은 평야를 종료 볼 수 있습니다 + +574 +00:47:25,940 --> 00:47:32,639 + 우리가 할 수있는 일 후 좋아하고 우리는 우리가 이런 종류의 작업을 진행하고 있습니다입니다 + +575 +00:47:32,639 --> 00:47:35,848 + 무슨 일이 일어나고 있는지의 미리보기 우리가 여기 손실을 얻고 일어나고 있었다하기 + +576 +00:47:35,849 --> 00:47:38,829 + 전파를 다시하기 위하여려고하는 것은 우리에게 우리가 원하는 방법을 통해 기울기를 부여하고있다있는 + +577 +00:47:38,829 --> 00:47:44,359 + 이 법이 작고 만들기 위해의 W 이러한 조정 그래서 우리는 할거야 + +578 +00:47:44,358 --> 00:47:48,838 + 이것은 우리가이 w로 시작 상태를 반복한다하지만 지금 난을 향상시킬 수 있습니다 + +579 +00:47:48,838 --> 00:47:54,460 + W의이 세트를 향상 그래서 주변 업데이트를 수행 할 때이 실제로 수 + +580 +00:47:54,460 --> 00:47:57,568 + 지금 바로 여기에 표시됩니다 이러한 그라디언트를 사용하여 실제로입니다 + +581 +00:47:57,568 --> 00:47:59,900 + 작은 변화 모두를 만들기 + +582 +00:47:59,900 --> 00:48:03,088 + 바로 그래서 내가 할로이 경사에 따라 + +583 +00:48:03,088 --> 00:48:07,699 + 차 업데이트는 여기 손실이 총 특별한 감소하고 있음을 알 수 + +584 +00:48:07,699 --> 00:48:11,338 + 여기 손실 잃어버린 내가 할로 단지 더 좋아 계속 차 날짜 때문에 + +585 +00:48:11,338 --> 00:48:16,639 + 그래서 이것은 우리가 조금에 들어갈거야 최적화의 과정이다 + +586 +00:48:16,639 --> 00:48:20,989 + 또한 반복되는 업데이트를 시작하고 기본적으로 우리는이 w를 개선 유지 + +587 +00:48:20,989 --> 00:48:24,808 + 이상 우리의 손실 때까지 진형을 통해 대략 세 또는 무언가였습니다 + +588 +00:48:24,809 --> 00:48:29,579 + 당신은 데이터에 대한 평균 손실은 같은 하나의 포인트입니다 그리고 우리가 제대로이야 + +589 +00:48:29,579 --> 00:48:39,068 + 나는 또한 그래서 그냥 (W) 무작위 랜덤 수 있도록 여기에 모든 버튼을 분류 + +590 +00:48:39,068 --> 00:48:41,980 + 종류의 그것을 노크하고 거기에 항상 이러한 행동 포인트를 수렴 + +591 +00:48:41,980 --> 00:48:47,650 + 공정 최적화를 통해 당신은 정규화으로 여기 재생할 수 있습니다 + +592 +00:48:47,650 --> 00:48:51,730 + 하나는 내가 지금 당신을 보여 있도록 잘 손실의 다른 형태가되어 있습니다 + +593 +00:48:51,730 --> 00:48:55,990 + 합의 오후 제제는 몇 가지 더 SPM 제제가 그리고 거기에 있었다 + +594 +00:48:55,989 --> 00:49:01,098 + 또한 여기에 소프트 맥스는 내가 우리의 손실이 스위셔 소프트 최대 손실을 볼 때 것을 볼 수 있습니다 + +595 +00:49:01,099 --> 00:49:06,670 + 다른하고 있지만 솔루션은 I 스위치 그렇게 할 때 거의 같은되고있다 + +596 +00:49:06,670 --> 00:49:10,700 + 다시 그에게 당신은 작은 조각 주위에 이동 플레이어의 유형을 알고 있지만 정말이야 + +597 +00:49:10,699 --> 00:49:21,558 + 그것은 대부분 동일합니다이 얼마나 얼마나 큰 단계 그래서 그래서 이것은 단지 크기 + +598 +00:49:21,559 --> 00:49:25,650 + 우리는 너무 많은 약속 일을 개선하는 방법에 그라데이션을 얻을 때 우리는하고 있습니다 + +599 +00:49:25,650 --> 00:49:29,119 + 우리는 장면이 킥킥 웃고하려고하는 매우 큰 상승을 시작한다 + +600 +00:49:29,119 --> 00:49:32,309 + 이러한 데이터 포인트를 분리 한 후 시간이 지남에 따라 우리는에서 일을 할거야 + +601 +00:49:32,309 --> 00:49:36,430 + 우리가 우리의 업데이 트의 눈을 감소거야로 위치와이 일을하거나 + +602 +00:49:36,429 --> 00:49:43,298 + 천천히 우리는 결국 원하는 그래서 그래서 당신은 재생할 수있는 전제 수렴 + +603 +00:49:43,298 --> 00:49:47,170 + 우리와 함께 당신은 그가 점수는 손실이 나는 경우를 주위에 가서 어떤 방법을 볼 수 있습니다 + +604 +00:49:47,170 --> 00:49:53,358 + 당신이 이러한 점을 드래그 할 수 있습니다 반복 갱신을 중지하지만 맥 그것을 생각 + +605 +00:49:53,358 --> 00:49:58,598 + 내가 그렇게 좋은 사라이 점을 드래그하려고 그렇게 작동하지 않습니다 + +606 +00:49:58,599 --> 00:50:02,479 + 하지만 바탕 화면에서 작동 그래서 내가 가서 무슨 일이 있었 정확히 파악하지 + +607 +00:50:02,478 --> 00:50:14,480 + 그러나이 거기 재생할 수 있습니다 + +608 +00:50:14,480 --> 00:50:30,840 + 우리는 이것이 다른 한 도면이다 데이터 플러스 정규화 이상으로 평균 손실이 + +609 +00:50:30,840 --> 00:50:35,240 + 나는 그것이 아주 좋은도 생각하지 않습니다처럼이 어떻게 생겼는지 한 방법을 보여 + +610 +00:50:35,239 --> 00:50:38,858 + 내가 작년 기억할 수없는 그것에 대해 혼란 거기에 뭔가하지만, + +611 +00:50:38,858 --> 00:50:45,269 + 기본적으로이 데이터가 왜 이미지 레이블 및 W있다 + +612 +00:50:45,269 --> 00:50:49,719 + 이 과정을 유지하고 소송을 얻고 정규화 손실 + +613 +00:50:49,719 --> 00:50:54,939 + 우리가 지금 무엇을 원하는가 아닌 데이터와 아저씨의 무게의 기능 + +614 +00:50:54,940 --> 00:50:58,608 + 우리는 우리에게 주어진 것 바로 데이터 세트 우리가 제어 할 수없는된다 + +615 +00:50:58,608 --> 00:51:04,130 + 그 w를 제어하고 우리가 손실 W 변경할로 다를 수 있습니다 어떤을 위해 그렇게 + +616 +00:51:04,130 --> 00:51:08,340 + W 내가 손실을 계산할 수 나를 포기하고 그 손실은 우리가있어 얼마나 잘 연결되어 있습니다 + +617 +00:51:08,340 --> 00:51:12,730 + 우리의 모든 예제를 분류하는 것은 낮은 손실은 세계 최고 수준의 발견을 의미 한 가지도록 + +618 +00:51:12,730 --> 00:51:15,880 + 그들을 아주 잘의 훈련 데이터, 그리고, 우리는 우리의 손가락이 교차하고 + +619 +00:51:15,880 --> 00:51:20,809 + 또한 우리가 여기 보지 못했다 테스트 데이터에서 작동하는 하나의 전략이다 + +620 +00:51:20,809 --> 00:51:26,139 + 우리가 어떤에 대한 손실을 평가할 수 있기 때문에 있도록 최적화는 임의의 검색입니다 + +621 +00:51:26,139 --> 00:51:30,500 + 임의 W는 때 나는 어떻게 감당할 수와 내가 메신저를 통해 이동 해달라고 있는지 확실하지 않습니다 + +622 +00:51:30,500 --> 00:51:34,480 + 이 전체 상세히하지만 효과적으로 나는 무작위로 샘플링 나는 확인할 수 있습니다 자신의 + +623 +00:51:34,480 --> 00:51:37,460 + 손실 난 그냥 가장 적합한 W 추적 할 수 있습니다 + +624 +00:51:37,460 --> 00:51:43,090 + 좋아, 그래서 점점 점검과 최적화의 놀라운 과정이다 + +625 +00:51:43,090 --> 00:51:46,760 + 이 작업을 수행 할 경우, 나는이 작업을 수행 할 경우 내가 이천 번 시도 생각 밝혀 + +626 +00:51:46,760 --> 00:51:50,970 + 천 번과 최고의 W는 무작위로 발견 걸릴 당신은 당신의 좌석에서 실행 + +627 +00:51:50,969 --> 00:51:56,108 + 당신이 약 15.5 %의 정확도로 결국 그냥 만든 데이터를 바텐더와 + +628 +00:51:56,108 --> 00:52:01,150 + 그들이 행동하고 있기 때문에 클래스는 10 %의 확률로 평균 기준은 + +629 +00:52:01,150 --> 00:52:06,559 + 성능 때문에 15.5이 실제로 특히와 예술의 매우 상태 일부 신호 + +630 +00:52:06,559 --> 00:52:10,219 + 아흔다섯 공통입니다 그래서 우리는 몇 가지를 통해 너무 가까이 가지고 있다는 점이다 + +631 +00:52:10,219 --> 00:52:10,980 + 다음 + +632 +00:52:10,980 --> 00:52:17,670 + 이 슬라이드 하나에 그냥 있기 때문에 2 주 정도 그래서 이것은 그래서이를 사용하지 않는있다 + +633 +00:52:17,670 --> 00:52:21,659 + 이 프로세스 최적화처럼 정확히 어떤이의 해석은 보인다 + +634 +00:52:21,659 --> 00:52:25,399 + 우리가이 손실 풍경을 가지고 바로이 손실 풍경이 높은에 + +635 +00:52:25,400 --> 00:52:32,619 + 우리는 그 다음 차원에서 앉아서 당신의 손실 높이 여기도록 차원 W 공간 + +636 +00:52:32,619 --> 00:52:38,369 + 당신은 단지 2 W의이 경우가 있고 당신은 여기 그리고 당신 (W) 눈을 가리고있어 + +637 +00:52:38,369 --> 00:52:42,269 + 계곡이 있지만 당신은 당신이있는 한 낮은 손실을 찾기 위해 노력하고 위치를 볼 수 있습니다 + +638 +00:52:42,269 --> 00:52:45,699 + 눈을 가린 당신은 고도 측정기를 가지고 있고 그래서 당신은 무엇을 말할 수 있습니다 + +639 +00:52:45,699 --> 00:52:49,029 + 단일 지점에서의 손실과 당신의 하단에 도착하기 위해 노력하고 + +640 +00:52:49,030 --> 00:52:55,430 + 계곡 오른쪽 그래서 정말 최적화하는 과정이고 우리가했습니다 + +641 +00:52:55,429 --> 00:52:59,399 + 당신은 것 도시 실제로 지금까지 당신이 순간 이동이 임의의 최적화로 + +642 +00:52:59,400 --> 00:53:03,309 + 주위에 당신은 단지 우리가있어 너무 좋아 너무 좋은 생각을 당신의 고도를하지 확인 + +643 +00:53:03,309 --> 00:53:06,940 + 우리가 그라데이션으로 내가 참조 무​​엇을 사용하려고하고있다 대신 할 예정이나 + +644 +00:53:06,940 --> 00:53:12,800 + 정말 우리가 너무 난 모든 단일 방향으로 가로 질러 기울기를 계산하고 + +645 +00:53:12,800 --> 00:53:17,990 + 기울기를 계산하기 위해 노력하고 우리가있어 그래서 내리막 확인 갈거야 + +646 +00:53:17,989 --> 00:53:21,289 + 나는이 있지만, 너무 많은 세부 사항으로 갈 않을거야 기울기를 다음 + +647 +00:53:21,289 --> 00:53:24,779 + 기본적으로 그렇게 정의 된 그라데이션 표현이있다 + +648 +00:53:24,780 --> 00:53:31,859 + 파생 포퓰리즘 (101) 정의와 여러 차원 경우가있다 + +649 +00:53:31,858 --> 00:53:35,409 + 당신은 그라데이션 권리라고있어 파생 상품의 이사가 + +650 +00:53:35,409 --> 00:53:39,589 + 우리의 승 여러 차원을 여러가 그렇게 때문에 우리는 그라데이션 벡터가 + +651 +00:53:39,590 --> 00:53:45,660 + 확인 그래서 이것은 표현이며, 실제로 우리는 수치를 평가할 수 있습니다 + +652 +00:53:45,659 --> 00:53:48,769 + 식 그게 보일 것 무엇을 표시하는 방법 논어에 가기 전에 + +653 +00:53:48,769 --> 00:53:54,190 + 일부 W의 그라데이션 우리는 약간의 현재 W를 가지고 우리가있어 가정 평가하려면 + +654 +00:53:54,190 --> 00:53:58,500 + 우리는 경사에 대한 아이디어를 얻을 싶지 않아하고 싶은 일부 손실이 좋아지고 + +655 +00:53:58,500 --> 00:54:03,239 + 그래서이 시점에서 우리는 기본적으로이 공식에서 볼거야 그리고 우리는있어 + +656 +00:54:03,239 --> 00:54:07,329 + 그냥 첫 번째 차원에 갈거야 그래서 평가에 가서 내가 갈거야 + +657 +00:54:07,329 --> 00:54:11,840 + 정말 어떻게이 수행 할 수 말하고있는 것은 폭발 고도를 평가입니다 + +658 +00:54:11,840 --> 00:54:15,590 + 크리스마스 H에 H에 의해 FFX 및 분할에서 제외 + +659 +00:54:15,590 --> 00:54:19,800 + 어떤 날의 작은 단계를 복용이 풍경 것으로 해당 응답 + +660 +00:54:19,800 --> 00:54:23,130 + 어떤 방향으로할지 여부를보고 내 발은 올라 갔다 아래로 + +661 +00:54:23,130 --> 00:54:27,340 + 바로 그 때문에 나는 작은 단계를 데리고 어떤 기울기가 말해가요 + +662 +00:54:27,340 --> 00:54:32,150 + 잃어버린 1.25 그때 유한 차이가 그 공식을 사용할 수있다 + +663 +00:54:32,150 --> 00:54:36,230 + 근사 우리는 작은 H이 실제로 파생 검토가 여기에 그라데이션 + +664 +00:54:36,230 --> 00:54:41,199 + 마이너스 2.5로 기울기는 아래로 그래서 나는 단계에게 손실을 가져다가 이렇게 감소 + +665 +00:54:41,199 --> 00:54:45,480 + 점에서 손실 함수의 관점 그래서 마이너스 2.5 하향 경 + +666 +00:54:45,480 --> 00:54:49,369 + 특히 치수는 그래서 독립적으로 모든 단일 차원에 대해이 작업을 수행 할 수 있습니다 + +667 +00:54:49,369 --> 00:54:53,210 + 나는에 단계 그래서 바로 그래서 내가 작은 금액을 추가 두 번째 차원으로 이동 + +668 +00:54:53,210 --> 00:54:56,869 + 나는 손실에 무슨 일이 있었는지를 보면 다른 방향 나는 공식 것을 사용 + +669 +00:54:56,869 --> 00:55:00,969 + 및 기울기가 2.6 인 그라데이션 내가 세 번째에 해당 할 수 있다고 말해됩니다 + +670 +00:55:00,969 --> 00:55:06,429 + 기본적으로 내가 여기서 말하는 겁니다 치수와 나는 확인 너무 슬퍼하세요 + +671 +00:55:06,429 --> 00:55:11,149 + 척추의 차를 사용하는 성분 수치를 평가 + +672 +00:55:11,150 --> 00:55:14,539 + 모든 단일 차원에 대해 독립적으로 내가 걸릴 수 있습니다 근사 + +673 +00:55:14,539 --> 00:55:18,500 + 작은 손실에 단계 그리고 나에게 느린 지시가 위쪽으로 갈거나 + +674 +00:55:18,500 --> 00:55:23,829 + 아래 이러한 매개 변수의 모든 하나 하나 등이 미국이다 + +675 +00:55:23,829 --> 00:55:28,500 + 그것은 추한 보이는 여기를 피하는이 그것과 같을 것이다 방법을 사기 펑크가된다 그라데이션 + +676 +00:55:28,500 --> 00:55:32,630 + 그것이 나오는 것에 있기 때문에 반복 약간 까다로운 모든 W의 만 + +677 +00:55:32,630 --> 00:55:36,780 + 기본적으로 우리는 단지 두 가지 효과를 비교로 나누어 나이에 찾고 + +678 +00:55:36,780 --> 00:55:41,200 + 우리가 계약을 받고있어 나이 사용할 경우 지금의 문제입니다 + +679 +00:55:41,199 --> 00:55:44,960 + 물론 수치 그라데이션 이벤트는 우리가 매일이 작업을 수행해야 + +680 +00:55:44,960 --> 00:55:47,949 + 차원은 어떤 위대한 스피 이러한 노력 단일 차원의 감각을 얻을 수 + +681 +00:55:47,949 --> 00:55:53,079 + 당신은 코멘트를 때 오른쪽 당신은 매개 변수의 수백만의 수백 + +682 +00:55:53,079 --> 00:55:58,139 + 바로 그래서 우리는 실제로 수억의 손실을 확인 할 여유가 없다 + +683 +00:55:58,139 --> 00:56:02,920 + 예비 선거의 우리는 한 단계 우리가하려고 것 때문에이 방법을하기 전에 + +684 +00:56:02,920 --> 00:56:06,869 + 우리는 유한 사용하고 있기 때문에 평가 그라데이션 수치 대략적인 + +685 +00:56:06,869 --> 00:56:11,119 + 내가 할 필요가 있기 때문에 차분 근사 둘째도 매우 느립니다 + +686 +00:56:11,119 --> 00:56:15,460 + 아이콘의 손실 함수에 만 확인 내가 무엇을 알기도 전에 것이 + +687 +00:56:15,460 --> 00:56:20,519 + 그라데이션 나는 매우 느린 대략적인 회전 수 있도록 차 업데이트를 취할 수 없습니다 + +688 +00:56:20,519 --> 00:56:26,730 + 이 때문에 바보 권리는 모든 것을 밖으로 때문에 W의 함수로 손실 + +689 +00:56:26,730 --> 00:56:29,800 + 우리는 그것에 대해 서면으로 작성했습니다 정말 우리가 원하는 것을 우리는의 기울기를 원하는됩니다 + +690 +00:56:29,800 --> 00:56:33,220 + 각각 11 마지막 운 좋게 우리는 단지를 쓸 수 있습니다 + +691 +00:56:33,219 --> 00:56:42,598 + 이 녀석 덕분에 실제로 당신이 바로 그 사람이하고있는 사람을 알고 + +692 +00:56:42,599 --> 00:56:49,400 + 단지 모양이 매우 비슷하지만, 기본적으로 얻을 수있는 것입니다 알고 + +693 +00:56:49,400 --> 00:56:54,289 + 미적분학의 발명자에 이런 일이 실제로 논란이있다 + +694 +00:56:54,289 --> 00:56:59,429 + 이상 사람 정말 발명 한 미적분시키고이 사람이 서로를하지만, + +695 +00:56:59,429 --> 00:57:03,799 + 기본적으로 수학이 강력한 망치이며, 우리가 할 수있는 일이 아닌 것입니다 + +696 +00:57:03,800 --> 00:57:06,440 + 어리석은 일을의 우리는 우리가 할 수있는 수치 그라데이션을 평가하고 + +697 +00:57:06,440 --> 00:57:10,230 + 실제로 수학을 사용하고 우리는 어떤 기울기에 대한 표현을 분해 할 수 있습니다 + +698 +00:57:10,230 --> 00:57:14,880 + 공백의 손실 함수 떨어져 그래서 기본적으로 대신 멍청이 + +699 +00:57:14,880 --> 00:57:18,289 + 주변이 작업은 최대 것입니다 아니면 손실 I을 선택하여 추락 + +700 +00:57:18,289 --> 00:57:22,509 + 그냥이의 그라데이션을 표현을하고 난 간단하게 동기화 할 수 있습니다 + +701 +00:57:22,510 --> 00:57:26,500 + 전체 물질이 실제로있는 유일한 방법을 실행할 수 있다는 것입니다 무엇 평가 + +702 +00:57:26,500 --> 00:57:30,159 + 년 동안이 연습의 권리를 우리는 할 수 있습니다 만 표현 우리가 할 수있는 그라데이션 + +703 +00:57:30,159 --> 00:57:35,149 + 그래서 중지하려면 어떻게 요약 기본적으로 숫자 그라데이션 대략에게에 너무 + +704 +00:57:35,150 --> 00:57:39,800 + 느리지 만 그냥이 매우 간단한 일을하고 있기 때문에 쓰기 아주 쉽게 + +705 +00:57:39,800 --> 00:57:44,190 + 손해 나 손실에 기능에 대한 처리 난에 대한 그라데이션 벡터를 얻을 수 있습니다 + +706 +00:57:44,190 --> 00:57:47,659 + 당신이 실제로 할 것입니다 구배는 정확한에는 유한을 수학하지 + +707 +00:57:47,659 --> 00:57:52,210 + 포고문 매우 빨리하지만 오류가 발생하기 쉬운 당신이 실제로에 있기 때문이다 + +708 +00:57:52,210 --> 00:57:57,300 + 실제로 바로 그래서 수학을 당신이 무엇을보고 우리는 항상 그라데이션을 많이 사용 + +709 +00:57:57,300 --> 00:58:01,380 + 우리는 항상 우리가 그라데이션해야 알아낼 수학을하지만, + +710 +00:58:01,380 --> 00:58:04,789 + 그것의 언급으로 미국의 그라데이션 체크를 사용하여 구현을 확인 + +711 +00:58:04,789 --> 00:58:10,480 + 그래서 나는 내가 작성해야합니다 나는 손실 함수에 대한 관심을 모두 수행 할 수 있습니다 + +712 +00:58:10,480 --> 00:58:15,500 + 내 코드에서 평가 된 그라데이션 표현은 그래서 휴가를 얻을 + +713 +00:58:15,500 --> 00:58:18,769 + 인사는 다음 나는 또한 측면에와 있음을 수치 그라데이션의 리드가 + +714 +00:58:18,769 --> 00:58:22,280 + 잠시 필요하지만 당신은 당신에게 더 편리 리드가 성숙하면 확인 + +715 +00:58:22,280 --> 00:58:25,890 + 그 두 가지가 동일하고 우리는 당신이 녹색을 통과라고 확인 + +716 +00:58:25,889 --> 00:58:29,500 + 그래서 트럭 확인 당신이 개발을 시도 할 때마다 당신이 실제로 무엇을보고있어 + +717 +00:58:29,500 --> 00:58:32,519 + 내부 네트워크에 대한 새로운 모듈은 바로 내가 할 수있는 권리를 잃은 것하고 + +718 +00:58:32,519 --> 00:58:35,759 + 당신이 확인해야 다음 그라데이션 완전하고 대한 후방 패스 + +719 +00:58:35,760 --> 00:58:40,250 + 그라데이션은 당신의 수학이 정확한지 확인해 확인하고 I + +720 +00:58:40,250 --> 00:58:43,980 + 이미 우리는에 잘 보았다 최적화이 과정 함 + +721 +00:58:43,980 --> 00:58:45,838 + 우리가이 웹 데모 + +722 +00:58:45,838 --> 00:58:49,548 + 루프는 우리는 당신의 손실에 어디서 단순히 밸리 그라데이션을 최적화 할 때 + +723 +00:58:49,548 --> 00:58:53,759 + 때 그라데이션을 아는 기능, 그리고, 우리는 프라이머 업데이트를 수행 할 수 있습니다 + +724 +00:58:53,759 --> 00:58:58,509 + 우리가 부정적으로 업데이트 할 특히 WBI 작은 양을 변경할 + +725 +00:58:58,509 --> 00:59:04,509 + 스텝 사이즈 배 네거티브 기울기 때문에 존재 그래디언트 + +726 +00:59:04,509 --> 00:59:07,478 + 그것은 당신을 알려줍니다 가장 큰 증가의 방향을 알려줍니다 끝까지 + +727 +00:59:07,478 --> 00:59:10,848 + 손실이 증가하고 부정적인 그것은 어디 인을 최소화하려면 + +728 +00:59:10,849 --> 00:59:14,298 + 어디에서 오는 가서 여기에 음의 판독 방향 스텝 크기로 + +729 +00:59:14,298 --> 00:59:17,818 + 당신에게있는 두통의 스텝 크기의 거대한 양의가 발생할 하이퍼 차 + +730 +00:59:17,818 --> 00:59:23,298 + 비율이 기본적으로 걱정하는 가장 중요한 매개 변수입니다 학습 + +731 +00:59:23,298 --> 00:59:27,778 + 정말 당신은 스텝 크기 또는 대부분에 대해 걱정할 필요가 두 개의있다 + +732 +00:59:27,778 --> 00:59:31,539 + 속도를 학습하고 정규화 강도 레임덕이있다 그 + +733 +00:59:31,539 --> 00:59:35,180 + 우리가 이미 보았던 그 두 개의 매개 변수 정말이 가장 큰 두통이며, + +734 +00:59:35,179 --> 00:59:45,219 + 그것은 우리가 몸 도버에 대한 질문이었다 교차 무엇 일반적이다하지만 그 아니다 + +735 +00:59:45,219 --> 00:59:50,849 + 큰 단지 위대하고 그것은 당신에게 경사 매 방향으로 다음을 알려줍니다 + +736 +00:59:50,849 --> 00:59:56,109 + 우리는 단지 단계 그래서 무게에 반대의 과정에 의한 조치를 취할 + +737 +00:59:56,108 --> 01:00:00,768 + 공간은 W에 어딘가에 당신이 당신의 그라데이션 어떠한 월 일부 금액을 얻을 수있다 + +738 +01:00:00,768 --> 01:00:05,228 + 그러나 그라디언트의 방향으로 당신은 얼마나 알고하지 않도록하는 단계 + +739 +01:00:05,228 --> 01:00:08,449 + 크기와 내가 데모 일에 스텝 크기를 증가시 더라 + +740 +01:00:08,449 --> 01:00:11,248 + 꽤 많은 주위에 기뻐 생성 바로 에너지없이 많이 있었다 + +741 +01:00:11,248 --> 01:00:15,449 + 나는 거대한 복용했기 때문에의 시스템은 모든 그래서 여기에이 기본 이상 및 점프 + +742 +01:00:15,449 --> 01:00:19,578 + 손실 함수가 파란색 부분에 최소이며 보고서에서 높은입니다 + +743 +01:00:19,579 --> 01:00:23,920 + 그래서 우리는 분지의 일부로서 그들에게 싶어이 손실로 실제로 + +744 +01:00:23,920 --> 01:00:28,579 + 기능은 공주 오후 모양 또는 재량 그래서 우리의 복잡한 문제입니다 + +745 +01:00:28,579 --> 01:00:31,729 + 정말 그냥 그릇 그리고 우리는 그것의 바닥하지만이 그릇에 도착하기 위해 노력하고 + +746 +01:00:31,728 --> 01:00:35,009 + 잠시 걸리는 이유 등 30,000 차원 그래서입니다 + +747 +01:00:35,010 --> 01:00:39,640 + 확인 그래서 우리는 조치를 취할 우리는 우리가 그라데이션을 평가하고 이상이 반복 + +748 +01:00:39,639 --> 01:00:44,980 + 이상 실제로 우리가하지 않는 위치를 언급 원이 추가 부분이있다 + +749 +01:00:44,980 --> 01:00:49,860 + 실제로 전체 훈련에 대한 손실은 우리가하는 모든 일이 실제로 한 평가 + +750 +01:00:49,860 --> 01:00:53,370 + 우리는 단지 일을 읽고 나에게 다시 전화 무엇을 사용합니다. 우리는이 곳 + +751 +01:00:53,369 --> 01:00:58,670 + 말처럼 우리는 판매를 샘플링 때문에 전체 데이터 세트는하지만 우리는 그것에서 일괄 샘플 + +752 +01:00:58,670 --> 01:01:02,300 + 삼십로서는 내 트레이닝 데이터로부터 I 그래디언트의 손실을 평가할 + +753 +01:01:02,300 --> 01:01:05,940 + 다음 32의이 배치에 내 차 업데이 트를 알고 나는이 일을 계속 + +754 +01:01:05,940 --> 01:01:09,619 + 계속해서 또 다시하고 있는지 확인 무슨 일이 끝나는 것은 경우에만 샘플입니다 + +755 +01:01:09,619 --> 01:01:14,699 + 다음, 트레이닝 데이터의 기울기중인 추정치 거의 데이터 포인트 + +756 +01:01:14,699 --> 01:01:18,109 + 당신은 단지이기 때문에 전체 학습 집합을 통해 코스 종류의 잡음이 + +757 +01:01:18,110 --> 01:01:21,970 + 데이터의 작은 하위 집합을 기반으로하지만 좀 더 단계 수 추정하는 + +758 +01:01:21,969 --> 01:01:25,689 + 그래서 당신은 대략 그라데이션 더 많은 단계를 수행하거나 몇 가지 작업을 수행 할 수 있습니다 + +759 +01:01:25,690 --> 01:01:30,179 + 저를 사용하면 더 나은 작업을 끝 무엇 정확한 그라데이션과 연습 단계 + +760 +01:01:30,179 --> 01:01:35,049 + 다시 그것을 훨씬 효율적 물론이고 실제로는 비실용적이다 + +761 +01:01:35,050 --> 01:01:41,550 + 풀백 그라데이션 하강는 많은 크기 32 64 128 256이 아닌 오는가 + +762 +01:01:41,550 --> 01:01:45,940 + 일반적으로 하이퍼 주로에 따라 너무 많은 보통 정착 걱정대로 + +763 +01:01:45,940 --> 01:01:49,380 + 우리는 비트에서 BP의 이야기 될 것하지만하고 당신의 GPU에 맞는 사람들 + +764 +01:01:49,380 --> 01:01:53,030 + 메모리의 유한 한 양이 6기가바이트 등에 대해 말하거나 그것의 좋은 이야기 + +765 +01:01:53,030 --> 01:01:58,030 + GPU 일반적으로 후면을 선택 같은 날 작은 다시 예에서 뱉어 그 + +766 +01:01:58,030 --> 01:02:01,150 + 당신의 기억은 그래서 그 용어의 방법 일반적으로 그리고 그것은 기본이 아니다 그 + +767 +01:02:01,150 --> 01:02:09,570 + 실제로 많은 및 최적화 감각을 중요 + +768 +01:02:09,570 --> 01:02:14,789 + 우리는 약간의 모멘텀을받을거야하지만 당신은 그 기세를 사용하려는 경우 + +769 +01:02:14,789 --> 01:02:18,969 + 이것은 우리가 항상 모멘텀이 매우 일반적인를 보낼 수 있지만 노력에 대해하고 괜찮아요 + +770 +01:02:18,969 --> 01:02:23,799 + 그래서 그냥 할 당신이 연습하는 경우에 어떻게 보일까의 아이디어를 제공합니다 + +771 +01:02:23,800 --> 01:02:28,510 + 나는 최적화 초과 근무를 실행하고 있는데 난 그냥 평가 로스 찾고 있어요 + +772 +01:02:28,510 --> 01:02:32,700 + 작은 많은 데이터를 배치하고 당신은 기본적으로 내 손실이 아래로가는 것을 볼 수 있습니다 + +773 +01:02:32,699 --> 01:02:37,309 + 학습 데이터에서 이러한 여러 일괄 처리에 시간이 지남에 그래서 난 최적화로 + +774 +01:02:37,309 --> 01:02:42,119 + 나는이 때문에 그라데이션 하강 주가 하락을하는 경우 지금은 물론 내리막 것 + +775 +01:02:42,119 --> 01:02:44,839 + 그냥 날 다시 데이터의 샘플 당신은 많은 소음을 기대하지 않을 것이다되지 않았습니다 + +776 +01:02:44,840 --> 01:02:48,550 + 당신은 단지 우리가 저를 사용하기 때문에하지만 내려갑니다이가 정렬 될 것으로 예상 + +777 +01:02:48,550 --> 01:02:51,730 + 당신에 대해 뭔가 더 나은 있기 때문에 당신이이 노이즈를 얻을 다시하는 경우 + +778 +01:02:51,730 --> 01:03:01,980 + 다른 사람보다하지만 시간이지나면서 그들은 모두가 질문을 아래로 갈 수있다 + +779 +01:03:01,980 --> 01:03:07,539 + 예 선생님 당신은 당신이 사용하는이 손실 함수의 형태에 대해 궁금 + +780 +01:03:07,539 --> 01:03:11,420 + 어쩌면 더 빠른 개선을보고 바로 이러한 손실 함수에 와서 있습니다 + +781 +01:03:11,420 --> 01:03:17,079 + 정말 그것이 반드시 그렇지 않다 달려있다, 그래서 다른 모양은 크기 + +782 +01:03:17,079 --> 01:03:21,940 + 그 손실 함수는 처음에 있지만 때로는 매우 날카로운 봐야한다 그들이 + +783 +01:03:21,940 --> 01:03:25,929 + 그들은 그것은 또한 당신에 중요한 예를 들어 다른 모양을 가지고 수행 + +784 +01:03:25,929 --> 01:03:29,618 + 내 초기화 조심 해요 경우 초기화 내가 덜 기대 + +785 +01:03:29,619 --> 01:03:34,990 + 점프하지만 매우 잘못 초기화하면 당신은 그 기대 + +786 +01:03:34,989 --> 01:03:38,649 + 그것은 우리가받을거야 최적화에 매우 초기에 해결 될 것 + +787 +01:03:38,650 --> 01:03:43,309 + 그 부분의 일부 나중에 나는 또한 당신에게 많이 보여주고 싶은 많은 생각 + +788 +01:03:43,309 --> 01:03:49,710 + 하여 손실 함수와 계속 학습의 학습 속도의 영향 + +789 +01:03:49,710 --> 01:03:53,820 + 속도는 기본적으로는 스텝 사이즈의 학습 속도 또는 스텝 사이즈하면 매우 높다 + +790 +01:03:53,820 --> 01:03:59,240 + 당신의 W 공간에서 주위에 돌진 시작하고 그래서 난 수렴 해달라고 또는 당신은 당신이 경우 폭발 + +791 +01:03:59,239 --> 01:04:02,618 + 당신은 거의 또한 업데이트를하고있어 다음 매우 낮은 학습 속도가 + +792 +01:04:02,619 --> 01:04:07,869 + 실제로 수렴 시간이 오래 걸리고는 높은 교육이 있다면 + +793 +01:04:07,869 --> 01:04:11,150 + 요금은 때때로 당신은 기본적으로 나쁜 위치에 붙어의 종류를 얻을 수 있습니다 + +794 +01:04:11,150 --> 01:04:14,950 + 당신이 손실 함수의 종류 때문에 손실이 그렇다면 최소한으로 내려받을 필요 + +795 +01:04:14,949 --> 01:04:17,929 + 당신이 없을 때 당신은 너무 빨리 당신의 스타킹에 너무 많은 에너지를 가지고 + +796 +01:04:17,929 --> 01:04:21,679 + 당신은 당신의 문제가 가지 작은 로컬 최소값에 정착하는 것을 허용하지 않습니다 + +797 +01:04:21,679 --> 01:04:25,480 + 당신은 신경 네트워크 및 최적화에 대해 일반적으로 당신의 목표를 이야기 할 때 + +798 +01:04:25,480 --> 01:04:28,320 + 즉 우리가 통신 할 수있는 유일한 방법이기 때문에 당신은 손을 흔들며을 많이 볼 수 있습니다 + +799 +01:04:28,320 --> 01:04:32,350 + 이러한 손실과 거리가 그래서 그냥 큰 손실의 분지와 같은 상상 + +800 +01:04:32,349 --> 01:04:36,069 + 당신은 탈곡하는 경우 이러한 작은 손실 등의 작은 주머니 같은있다 + +801 +01:04:36,070 --> 01:04:39,480 + 주위에 당신은 작은 손실 부품 컨버터에 정착 할 수 + +802 +01:04:39,480 --> 01:04:43,730 + 그 때문에 그 이유는 학습 속도 너무 좋은 그래서 올바른을 찾을 필요 + +803 +01:04:43,730 --> 01:04:47,150 + 속도를 배우는 것은 많은 두통의 원인과 사람들은 대부분 무엇을 할 + +804 +01:04:47,150 --> 01:04:49,970 + 시간은 때때로 우리가 어떤 혜택을받을 높은 학습 속도로 시작된다 + +805 +01:04:49,969 --> 01:04:55,319 + 다음 높은 함께 시작하는 데 시간이 지남에 그것을 UDK 그리고, 우리는 학습 타락 + +806 +01:04:55,320 --> 01:05:00,780 + 우리는 좋은 해결책에 정착하고 시간이 지남에 읽고 나는 또한 원하는 + +807 +01:05:00,780 --> 01:05:03,550 + 훨씬 더 자세하게거야하지만 방법은 내가 일을 해요 누가 지적 + +808 +01:05:03,550 --> 01:05:07,890 + 실제로 W을 수정 구배를 사용하는 방법이다 여기서 업데이트 + +809 +01:05:07,889 --> 01:05:12,789 + 그 일을 여러 가지 형태가 업데이트 펌웨어 업데이트라고 + +810 +01:05:12,789 --> 01:05:14,869 + 그것은이 있었다 가장 간단한 방법입니다 + +811 +01:05:14,869 --> 01:05:20,299 + 다만 STD 간단한 사용자 지정 인사말 %가되지만, 많은 공식이있다 + +812 +01:05:20,300 --> 01:05:23,740 + 당신이있어 이미 모멘텀에 언급 된 모멘텀은 기본적으로 상상 + +813 +01:05:23,739 --> 01:05:27,949 + 이 최적화를 수행하면은 그래서이 블로그 도시의 트랙을 유지 상상 + +814 +01:05:27,949 --> 01:05:31,389 + 나는 긍정적를보고 계속 그래서 만약 또한 내 속도의 트랙을 유지하고 스테핑 해요 + +815 +01:05:31,389 --> 01:05:35,519 + 내가 그 방향으로 속도를 축적 몇 가지 방향을 읽고 그래서 난 몰라 + +816 +01:05:35,519 --> 01:05:39,550 + 러시아에서 빨리 갈 사람이 필요하고 그래서 로스에서 몇 가지가 있습니다 것 + +817 +01:05:39,550 --> 01:05:46,100 + 보고 일반적으로 곧 클래스하지만 토마스 소품 아담 또는 그래서 그냥에 사용 + +818 +01:05:46,099 --> 01:05:50,569 + 이러한 서로 다른 선택의 모양을 그들이 할 수있는 것을 보여 + +819 +01:05:50,570 --> 01:05:56,760 + 당신의 손실 함수이 우리가 손실을 가지고, 그래서 여기에 알렉에서 그림입니다 + +820 +01:05:56,760 --> 01:06:02,390 + 기능과 이러한 낮은 수준의 점원 그리고 우리는 저기 반대를 시작 + +821 +01:06:02,389 --> 01:06:06,920 + 우리는 당신을 줄 것이다 유역과 다른 업데이트 공식에 도착하기 위해 노력하고 + +822 +01:06:06,920 --> 01:06:10,670 + 당신은 예를 들어이 서로 다른 문제에 좋든 나쁘 든 수렴을 볼 수 있도록 + +823 +01:06:10,670 --> 01:06:15,369 + 녹색 모멘텀이 내려 갔다으로는 모멘텀을 구축 한 다음은 오버 슈팅과 + +824 +01:06:15,369 --> 01:06:19,259 + 다음은 가지 돌아가 다시 돌아가 UD 등이 읽을 수 수렴 영원히 소요 + +825 +01:06:19,260 --> 01:06:23,370 + 그것은 그녀가 등장하고 있습니다 영원히 걸립니다 내가 지금까지 당신을 제시 무엇 + +826 +01:06:23,369 --> 01:06:27,489 + 실제로이 차 위로가 수행하는 다른 방법은 더 많거나 적은 + +827 +01:06:27,489 --> 01:06:35,259 + 현대화 효율적인 내가 또한에서 언급하고 싶었이 훨씬 더 볼 수 있습니다 + +828 +01:06:35,260 --> 01:06:39,950 + 확률이이 점은 예 내가 분명히 설명 해요로 약간 가고 싶어 + +829 +01:06:39,949 --> 01:06:43,049 + 당신의 분류처럼 우리는 우리가 알고있는 문제를 설정하는 방법을 알고 + +830 +01:06:43,050 --> 01:06:47,070 + 다른 손실이 날 우리가 가지에서 할 수 있도록 그들을 최적화하는 방법을 알고 기능 + +831 +01:06:47,070 --> 01:06:51,050 + I에서이 점은 내가 당신의 감각을 부여 할 것을 언급 원하는 것을 + +832 +01:06:51,050 --> 01:06:53,710 + 댓글에 대한 그래서 당신은을 가지고 오기 전에 컴퓨터 비전처럼 보였다 + +833 +01:06:53,710 --> 01:06:57,920 + 역사적 관점의 비트 우리는 선형 분류 모든 시간을 사용하기 때문에 + +834 +01:06:57,920 --> 01:07:01,019 + 하지만 물론 당신은 도로 원본 이미지에없는 보통 클래식 자동차를 수행 + +835 +01:07:01,019 --> 01:07:06,759 + 즉 당신이 믿는하려는 모든 때문에 우리는 당신 같은 그것으로 문제를 해결 + +836 +01:07:06,760 --> 01:07:10,250 + 나는 그들이에 사용 된 경찰이해야 할 생각에 모든 모드 등을 포함해야 + +837 +01:07:10,250 --> 01:07:14,380 + 모든 이미지의 서로 다른 기능 유형을 계산하고 당신은 볼 수 있습니다 + +838 +01:07:14,380 --> 01:07:17,160 + 다른 기능 유형에서 다른 설명하면 다음을 얻을 + +839 +01:07:17,159 --> 01:07:22,049 + 이미지가 주파수처럼 어떤 모습의 통계 요약 + +840 +01:07:22,050 --> 01:07:26,160 + 등등, 그리고, 우리는 capitated 모든 큰 벡터에 그와 우리는 넣을 수 있습니다 + +841 +01:07:26,159 --> 01:07:27,710 + 선형 분류에 그 + +842 +01:07:27,710 --> 01:07:32,050 + 까지 간 다음, 그들 모두 연결된 등 다양한 기능 유형 + +843 +01:07:32,050 --> 01:07:35,369 + 일반적으로 파이프 라인이었다 당신의 분류, 그래서 그냥 당신의 아이디어를 제공합니다 + +844 +01:07:35,369 --> 01:07:39,088 + 정말 무슨 회담 당신이 수도 한 아주 간단한 기능 유형 같았다 + +845 +01:07:39,088 --> 01:07:43,269 + 나는 모든 이미지의 픽셀을 통해 갈 수 있도록 단지 컬러 히스토그램 상상 + +846 +01:07:43,269 --> 01:07:47,449 + 나는 그들 거라고하고 따라 다른 색상이 얼마나 많은 밴드 대답 + +847 +01:07:47,449 --> 01:07:50,750 + 당신이 상상할 수있는 색상의 색조에 종류의 일처럼 + +848 +01:07:50,750 --> 01:07:54,250 + 이미지에 무엇의 통계 요약 색상 각 단지 숫자입니다 + +849 +01:07:54,250 --> 01:07:57,400 + 그래서 이것은 내가 결국 될 것입니다 선생님의 하나가 될 것입니다되었습니다 + +850 +01:07:57,400 --> 01:08:03,440 + 다양한 기능의 유형과 친밀의 다른 종류의 절단 + +851 +01:08:03,440 --> 01:08:06,530 + 분류는 당신이 그것에 대해 생각하면 선형 분류는 이러한 기능을 사용할 수 있습니다 + +852 +01:08:06,530 --> 01:08:09,690 + 선형 분류 좋아 할 수 있기 때문에 실제로 분류를 수행하는 + +853 +01:08:09,690 --> 01:08:14,320 + 또는 양 또는과 이미지에 다른 색상을 많이보고 싫어 + +854 +01:08:14,320 --> 01:08:17,930 + 부정적인 무엇 매우 일반적인 기능은 또한 우리가 부르는 등이 포함됩니다 + +855 +01:08:17,930 --> 01:08:22,440 + 610 매 기능은 기본적으로 이러한 당신은 현지 지역에 이동했다 + +856 +01:08:22,439 --> 01:08:26,539 + 발명과는 다른 방향의 제비가 있는지 여부를보고 + +857 +01:08:26,539 --> 01:08:30,588 + 그래서 수평 또는 수직 가장자리의 많은이 우리는 히스토그램을 구성 + +858 +01:08:30,588 --> 01:08:35,850 + 그 이상하고 그래서 당신은 가장자리의 종류 단지 요약 끝날 때 + +859 +01:08:35,850 --> 01:08:40,338 + 상기 이미지이고, 당신이 사람들은 모두 함께 있었다 계산할 수 있습니다 + +860 +01:08:40,338 --> 01:08:45,250 + 수년에 걸쳐까지 제안 된 우리의 다른 유형의 많은 단지 내가있을거야 + +861 +01:08:45,250 --> 01:08:50,359 + 측정하는 다른 방법을 많이에 과세 것들의 종류가 + +862 +01:08:50,359 --> 01:08:54,850 + 이미지와 그들의 통계, 그리고, 우리는이 파이프 라인은 다시 전화했다 + +863 +01:08:54,850 --> 01:08:59,660 + 내 장소를 통해 당신이 다른 점을보고 어디 + +864 +01:08:59,659 --> 01:09:04,250 + 당신은 같은 당신이 와서 뭔가 조금 로컬 패치를 설명 + +865 +01:09:04,250 --> 01:09:08,329 + 주파수보고는 색상을보고 또는 다음 무엇이든되고 우리 + +866 +01:09:08,329 --> 01:09:12,269 + 여기에 확인 이러한 사전 내놓았다 우리가 이미지를 볼 수있는 물건입니다 + +867 +01:09:12,270 --> 01:09:16,250 + 같은 파란색과 낮은 주파수 물건에 대한 고주파 정지 많이있다 + +868 +01:09:16,250 --> 01:09:16,699 + ...에 + +869 +01:09:16,699 --> 01:09:21,338 + 물건의 종류를 볼 수 무엇의 K-수단을 사용하여 무게 중심으로 끝낼 수 있습니다 + +870 +01:09:21,338 --> 01:09:25,818 + A의 바로 다음 우리는 얼마 동안 통계 등의 모든 하나의 이미지를 표현 + +871 +01:09:25,819 --> 01:09:29,660 + 예를 들어,이 이미지는 많이있다, 그래서 각각의 일의 우리는 이미지 참조 + +872 +01:09:29,659 --> 01:09:33,949 + 고주파 녹색 물건 당신은 기본적으로 몇 가지 특징 벡터가 표시 될 수 있도록 + +873 +01:09:33,949 --> 01:09:38,568 + 더 높은 가치와 높은 주파수와 녹색이있을 것이다, 그리고, 우리는 않았다 + +874 +01:09:38,569 --> 01:09:40,760 + 우리는 기본적으로 이러한 특징 벡터했다 + +875 +01:09:40,760 --> 01:09:45,210 + 그들에게 필요한 무엇을 위해 이렇게 정말 그들에 컨텍스트를 선형 분류를 넣어 + +876 +01:09:45,210 --> 01:09:49,090 + 그 이전에 대부분 컴퓨터 비전 속에서 다음과 같이 우리는 일을하는지 + +877 +01:09:49,090 --> 01:09:52,840 + 약 2012 당신이 당신의 이미지를 촬영하게되며 기능의 단계를 + +878 +01:09:52,840 --> 01:09:57,409 + 추출 우리는 당신이에 대해 알아야 할 중요한 일이 무엇인지 결정 곳 + +879 +01:09:57,409 --> 01:10:01,859 + 이미지 서로 다른 주파수 다른 텐트와 우리는 어떤 결정 + +880 +01:10:01,859 --> 01:10:05,109 + 흥미로운 기능은 당신이 사람들이 10 개의 서로 다른 기능 유형처럼 걸릴 참조 + +881 +01:10:05,109 --> 01:10:09,369 + 모든 종이와 그냥 일어 났는데 그냥 당신이 하나의 거대한 두 배로 할 수 있습니다 히트 그것의 모든 필요 + +882 +01:10:09,369 --> 01:10:12,640 + 기능 이미지를 통해 벡터 및 당신은 그 위에 선형 분류를 넣어 + +883 +01:10:12,640 --> 01:10:15,920 + 우리는 지금 그것을보고처럼 그래서 당신은에 당신의 엉덩이 오후에 기차 판매를 재생 + +884 +01:10:15,920 --> 01:10:20,109 + 이러한 모든 기능 유형의 상단과 우리가 우리가 발견 한 이후로 교체하고 + +885 +01:10:20,109 --> 01:10:24,869 + 그건 당신이 원시 이미지로 시작하는만큼 더 잘 작동하고 전체의 생각 + +886 +01:10:24,869 --> 01:10:28,979 + 당신이 생각하는 어떤 것은 당신의 분리에의 일부를 설계하지 않는 + +887 +01:10:28,979 --> 01:10:33,479 + 많은 시뮬레이션 할 수 있습니다 우리가 건축을 마련 좋은 아이디어 나하지 + +888 +01:10:33,479 --> 01:10:38,189 + 모든 것이 하나의 기능 우리 때문에 다른 특징은 그렇게 말하고합니다 + +889 +01:10:38,189 --> 01:10:41,879 + 단지 우리가 실제로 훈련 할 수의 기능의 상단에 상단하려고하지 않는다 + +890 +01:10:41,880 --> 01:10:45,400 + 모든 방법 화소에 이르기까지 우리는 우리의 기능 추출기를 훈련 할 수있다 + +891 +01:10:45,399 --> 01:10:49,989 + 효과적으로 있도록 큰 혁신이었다 당신이 접근 방법이 문제는 우리는 + +892 +01:10:49,989 --> 01:10:53,300 + 반면 설계 요소를 많이 가지고하려고하는 제거 시도 + +893 +01:10:53,300 --> 01:10:56,779 + 우리가 완벽하게 일을 끌어 훈련을 할 수 있도록 하나의 주요 얼룩 + +894 +01:10:56,779 --> 01:11:01,550 + 역사적으로이 오는하고 무엇을 그 바위 텍사스에서 시작 + +895 +01:11:01,550 --> 01:11:06,760 + 우리는 무엇을하고있을 것입니다 그래서 다음 마지막이에 구체적으로보고됩니다 + +896 +01:11:06,760 --> 01:11:10,520 + 우리의 문제는 분석 구배를 계산해야하고 그래서 우리는에 갈거야 + +897 +01:11:10,520 --> 01:11:14,860 + 분석 기울기를 계산하는 효율적인 방법은 역 전파 및 + +898 +01:11:14,859 --> 01:11:18,839 + 그래서 그 배경 그리고 당신은 잘 될거야 그리고 우리는거야 + +899 +01:11:18,840 --> 01:11:20,039 + 약간 작동 이동 + diff --git a/captions/Ko/Lecture4_ko.srt b/captions/Ko/Lecture4_ko.srt new file mode 100644 index 00000000..57959a1c --- /dev/null +++ b/captions/Ko/Lecture4_ko.srt @@ -0,0 +1,3936 @@ +1 +00:00:02,740 --> 00:00:07,000 + 확인 그래서 내가 어떤 관리자에 뛰어 보자 + +2 +00:00:07,000 --> 00:00:14,669 + 나는 그 과제를 호출 할 수 있도록 먼저 가서 한 다음 주 수요일 때문이다 + +3 +00:00:14,669 --> 00:00:19,050 + 그래하지만 백 오십 시간 왼쪽보다 거기 때문에 우리를 사용하는 + +4 +00:00:19,050 --> 00:00:23,320 + 일반적인 운명의 감각과 그가있을거야 그 시간의 세 번째 기억 + +5 +00:00:23,320 --> 00:00:29,278 + 의식이 그래서 당신은 많은 시간이 정말로 실행중인 것을 가지고 있고하지 않습니다 + +6 +00:00:29,278 --> 00:00:31,768 + 당신은 당신이 그래서 늦은 하루 일을 생각할 수 있습니다 알고 있지만 이러한 이미지를 얻을 + +7 +00:00:31,768 --> 00:00:38,640 + 시간이 지남에 열심히 그래서 당신은 그들을보고 싶어하고 그래서 그렇게 가능성이 지금 시작 + +8 +00:00:38,640 --> 00:00:43,109 + 월요일에는 근무 시간 또는 그런 아무것도 없다 내가 사무실을 만들 개최합니다 + +9 +00:00:43,109 --> 00:00:45,839 + 수요일에 시간이 나는 너희들이에 대해 나에게 이야기 할 수 있도록하려면 때문에 + +10 +00:00:45,840 --> 00:00:49,260 + 그래서 월요일부터 내 근무 시간 이동됩니다에 특별 프로젝트 등 + +11 +00:00:49,259 --> 00:00:52,820 + 수요일은 일반적으로 내 사무실은 오후 6시 시작했다 대신 나는 5시에를해야합니다 + +12 +00:00:52,820 --> 00:00:59,909 + 오후 일반적으로는 게이트 (260)를 생각하지만 지금은 39-1 그들 모두와 약혼 할 예 + +13 +00:00:59,909 --> 00:01:03,429 + 또한 당신이오고있어 중간에 대해 공부하려고 할 때주의해야 + +14 +00:01:03,429 --> 00:01:04,170 + 몇 주 + +15 +00:01:04,170 --> 00:01:07,109 + 당신이 정말로의 일부뿐만 아니라 강의 노트를 통해 이동해야합니다 + +16 +00:01:07,109 --> 00:01:09,819 + 이 클래스와 선택의 종류와 내가 가장 생각하는 것들 중 몇 가지를 선택 + +17 +00:01:09,819 --> 00:01:13,579 + 강의를 제공하는 귀중한하지만보다 재료의 꽤가있다 + +18 +00:01:13,579 --> 00:01:16,548 + 내가의 일부를오고 있어요에도 중기에 팝업을 생각하고 조심하는 + +19 +00:01:16,549 --> 00:01:19,610 + 그 강의를 통해 URI보다 일반적으로 더 큰 가장 중요한 물건 + +20 +00:01:19,609 --> 00:01:25,618 + 여배우에 자신의 무료 노트 등 재료의 재료가 될 수 + +21 +00:01:25,618 --> 00:01:32,269 + 강의 양쪽에서 그려진 그 확인하여 모든 우리가 갈거야, 상기 한 + +22 +00:01:32,269 --> 00:01:36,769 + 재료에 뛰어 그래서 우리는 단지 우리가 미리 알림으로 바로 지금 어디에 + +23 +00:01:36,769 --> 00:01:39,989 + 이 핵심 기능은 우리가 같은 SP의 손실로 여러 손실 함수를 보았다 + +24 +00:01:39,989 --> 00:01:44,359 + 기능 마지막으로 우리는 당신이 어떤을 위해 달성하는 것이 손실 전체를 보면 + +25 +00:01:44,359 --> 00:01:49,379 + 특정 훈련 데이터에에 가중치의 세트로 구성이 손실 + +26 +00:01:49,379 --> 00:01:53,509 + 두 가지 구성 요소 우리가 무엇을 원하는 정말이 데이터 손실 및 손실 맞아 및 + +27 +00:01:53,509 --> 00:01:57,200 + 우리가 지금에 대한 손실의 그라데이션 표현을하고 싶은 것입니다 + +28 +00:01:57,200 --> 00:02:01,118 + 무게는 그리고 우리는 우리가 실제로 최적화를 수행 할 수 있도록이 작업을 수행 할 수 + +29 +00:02:01,118 --> 00:02:07,069 + 우리가 선두에 반복 우리가 반대 의견에서하고있는 공정 최적화 과정 + +30 +00:02:07,069 --> 00:02:11,030 + ㄱ 주 업데이트 중에 무게에 그라데이션과 그냥이 반복 + +31 +00:02:11,030 --> 00:02:14,259 + 또 다시 이렇게 수렴 된 그 + +32 +00:02:14,259 --> 00:02:17,929 + 저 그 손실 함수의 점과 우리가 손실에 도착했을 때의 + +33 +00:02:17,930 --> 00:02:20,799 + 이 측면에서 우리의 훈련 데이터에 대한 좋은 예측을하는 것과 + +34 +00:02:20,799 --> 00:02:25,030 + 지금 나오는 과정은 우리는 또한 수 있습니다 너무 종류의 폐기물이 평가 보았다 + +35 +00:02:25,030 --> 00:02:29,019 + 그라데이션이 미국의 기울기 그리고이 작성하는 매우 간단하지만, 그것의 + +36 +00:02:29,019 --> 00:02:32,840 + 극단적으로 평가 느리게하고있는이있는 비가 구배가있다 + +37 +00:02:32,840 --> 00:02:36,658 + 수학을 사용하여 얻을이 강의에 그에 갈 것 꽤 + +38 +00:02:36,658 --> 00:02:41,318 + 좀 더하고 그래서 큰하지만 그렇지 당신이 잘못 얻을 수있어 어떤 빠르고 정확한이다 + +39 +00:02:41,318 --> 00:02:45,969 + 때로는 그래서 우리는 항상 이미 검사에서 다음 주에 우리는 모든 쓰기 위치 + +40 +00:02:45,969 --> 00:02:48,639 + 표현은 분석 그라디언트를 완료 한 후 우리는 다시 확인의 + +41 +00:02:48,639 --> 00:02:51,828 + 수치 그라데이션 정확성 그리고 나는 당신이 볼 거라면 확실하지 않다 + +42 +00:02:51,829 --> 00:02:59,250 + 당신은 당신이 할 수있는 지금이 확실히 확인 할당 볼 거라고 + +43 +00:02:59,250 --> 00:03:04,378 + 이 설정을 볼 때 유혹 우리는 단지의 기울기를 구동 할 + +44 +00:03:04,378 --> 00:03:08,459 + 다시 당신은 단지에 유혹 될 수있는 가중치를 바로 알고에 손실 함수 + +45 +00:03:08,459 --> 00:03:11,709 + 전체 손실 밖으로 당신이 당신의 미적분을 본 바로 그라데이션을 시작 + +46 +00:03:11,709 --> 00:03:16,120 + 내가하고 싶습니다 클래스하지만 당신이 측면에서이 훨씬 더 생각해야한다는 것입니다 + +47 +00:03:16,120 --> 00:03:22,480 + 대신 하나의 거대한 표현의 단지 복용 생각의 계산 잔디의 + +48 +00:03:22,479 --> 00:03:25,369 + 당신은 펜과 용지를위한 표현에 만족 구동 할 거라고 + +49 +00:03:25,370 --> 00:03:27,549 + 기울기 및 그 이유 + +50 +00:03:27,549 --> 00:03:31,689 + 그래서 여기에 우리는 흐르는 이러한 값에 대한 흐름을 생각하고 + +51 +00:03:31,689 --> 00:03:35,509 + 원을 따라 이러한 작업 주위에 경쟁과 그들에 전달 + +52 +00:03:35,509 --> 00:03:38,979 + 모든 방법으로 입력을 변환 기본적으로 기능 조각 + +53 +00:03:38,979 --> 00:03:43,018 + 마지막에 손실 함수는 그래서 우리는 우리의 데이터와 우리의 매개 변수로로 시작 + +54 +00:03:43,019 --> 00:03:46,079 + 입력 그들은 단지 모든 인이 경쟁 그래프를 통해 공급 + +55 +00:03:46,079 --> 00:03:49,790 + 길을 따라와 말의 기능이 시리즈는 우리는 하나의 번호를 + +56 +00:03:49,789 --> 00:03:53,590 + 손실 난이 방법은 그것에 대해 생각하고자하는 이유는있다 + +57 +00:03:53,590 --> 00:03:57,069 + 이러한 표현은 지금 매우 작은 보면 당신은 할 수있을 수 있음 + +58 +00:03:57,068 --> 00:04:00,339 + 이러한 고충을 유도하지만, 이러한 표현은 경쟁 잔디되어 있습니다 + +59 +00:04:00,340 --> 00:04:04,250 + 매우 큰 얻을에 대한 그래서 예를 들어 길쌈 신경 네트워크는 것 + +60 +00:04:04,250 --> 00:04:08,829 + 수백 어쩌면 우리 모두가 이러한 이미지를해야하므로 작업의 수십입니다 + +61 +00:04:08,829 --> 00:04:12,939 + 우리의 손실을 얻기 위해 큰 계산 그래프처럼-을 통해 흐름 등이된다 + +62 +00:04:12,939 --> 00:04:16,858 + 바로 이러한 표현을 쓸 비현실적 및 상업 네트워크는 + +63 +00:04:16,858 --> 00:04:19,370 + 심지어 당신은 실제로 예를 들어 수행에 최악 시작되지 일단 + +64 +00:04:19,370 --> 00:04:23,509 + 마음 어디에서 용지 인 대체 광택이라는 것을 + +65 +00:04:23,509 --> 00:04:26,329 + 이것은 기본적으로 미분 튜링 기계 + +66 +00:04:26,329 --> 00:04:30,128 + 그래서 전부는 컴퓨터가하는 모든 절차 미분이다 + +67 +00:04:30,129 --> 00:04:33,590 + 테이프에 수행이 원활하게 이루어지고 미분 컴퓨터는 기본적으로 + +68 +00:04:33,589 --> 00:04:39,519 + 및 경쟁 그래픽이이이 명중되지 거대하고뿐만 아니라 + +69 +00:04:39,519 --> 00:04:42,478 + 당신이 일을 끝낼 우리가 재발 신경망과에가는거야 무엇 때문에 + +70 +00:04:42,478 --> 00:04:45,848 + 비트하지만 당신은 결국 일을 당신이이 그래프 그렇게 생각 제어 끝입니다 + +71 +00:04:45,848 --> 00:04:51,658 + 이 그래프는 시간 단계의 수백을 복사 그래서 당신은이 거대한 끝낼 + +72 +00:04:51,658 --> 00:04:56,379 + 수천 개의 노드와 약간의 계산 단위의 수백의 몬스터와 + +73 +00:04:56,379 --> 00:04:59,819 + 그래서 당신이 알고 쓰는 데 신경 튜링에 대한 손실 불가능이다 + +74 +00:04:59,819 --> 00:05:03,650 + 기계 그것은 페이지의 수십억 우리과 같이 걸릴 것 단지 불가능 + +75 +00:05:03,649 --> 00:05:07,068 + 더 구조의 관점 너무 작은 기능에 이것에 대해 생각해야 + +76 +00:05:07,069 --> 00:05:11,710 + 우리는거야 그래서 중간 변수를 변환하는 것은 바로 맨 끝에 분실하기 + +77 +00:05:11,709 --> 00:05:14,318 + 경쟁 그래프에서 구체적으로보고 할 우리는 어떻게 유도 할 수있다 + +78 +00:05:14,319 --> 00:05:20,560 + 맨 끝에 손실 함수에 대한 입력에 기울기 때문에 + +79 +00:05:20,560 --> 00:05:25,569 + 시작 - 무슨 간단하고 구체적인 아주 작은 경쟁 그래프를 우리는 세 가지가 + +80 +00:05:25,569 --> 00:05:29,778 + 이 그래프 XY 및 Z에 대한 입력으로 스칼라 그들은 약이 특정에 걸릴 + +81 +00:05:29,778 --> 00:05:35,069 + 94의 마이너스 25의 예에서 이러한 우리는이 매우 작은 그래픽이 + +82 +00:05:35,069 --> 00:05:38,669 + 또는 당신이 날이 상호 교환 안녕을 참조 듣게 회로에 대한 그래프가 + +83 +00:05:38,668 --> 00:05:43,038 + 회로는 그래서 우리는 마지막에 음을 우리에게 이것을 제공하는 것이이 그래프를 + +84 +00:05:43,038 --> 00:05:47,288 + 12 그래서 여기에 확인 내가 무슨 짓을했는지하는 동안 깊은 리필 모습은를 호출까지입니다입니다 + +85 +00:05:47,288 --> 00:05:51,120 + 내가 입력을 설정 한 다음 나는 복장을 계산이 그래프의 전진 패스 + +86 +00:05:51,120 --> 00:05:56,288 + 그리고 나는 우리가에 식의 기울기를 구동하고 싶은대로 할 싶습니다 + +87 +00:05:56,288 --> 00:06:01,250 + 입력과 우리가 그 무엇을 할 거 야이 중간 변수를 도입 + +88 +00:06:01,250 --> 00:06:07,050 + 내가 그들에게 참조로 플러스 게이트 및 시간 게이트가 그래서 플러스 게이트 큐와 + +89 +00:06:07,050 --> 00:06:10,800 + 따라서이 컴퓨팅이 옷 큐를 얻고 수키는이 중간이었다합니다 + +90 +00:06:10,800 --> 00:06:14,788 + 내가 작성한 것을 다음 X 플러스 Y와의 결과는 f를 qnz의 곱셈이다 + +91 +00:06:14,788 --> 00:06:19,360 + 여기에서 우리가 원하는 것은 그라디언트 파생 뻣뻣한 생각입니다 경우 I + +92 +00:06:19,360 --> 00:06:25,598 + 내 원하는받을 수 있나요 나는 중간 로그인하시기 바랍니다 그라데이션을 작성했습니다 + +93 +00:06:25,598 --> 00:06:30,120 + 우리가 수행 한 개별적으로 지금이 두 표현 모두를위한 + +94 +00:06:30,120 --> 00:06:33,490 + 에서가 클래스는 왼쪽에서 오른쪽으로 지금 무엇을 할 것 인 것은 역을 유도합니다 + +95 +00:06:33,490 --> 00:06:35,699 + 패스 뒤쪽에서 이동합니다 + +96 +00:06:35,699 --> 00:06:39,300 + 앞으로 우리의 회로까지 모든 중간체의 그라디언트 경쟁 + +97 +00:06:39,300 --> 00:06:43,509 + 맨 끝에 우리는 그라디언트 입력에 그리고 우리 그것을 구축하는거야 + +98 +00:06:43,509 --> 00:06:47,680 + 맨 오른쪽이 재귀의 기본 케이스의 일종으로 시작 + +99 +00:06:47,680 --> 00:06:52,670 + 절차 우리는 각각의 기울기를 고려하고 그래서 이것은 단지입니다 + +100 +00:06:52,670 --> 00:06:56,020 + 식별 기능은 그래서 그것의 파생 무엇인가 + +101 +00:06:56,019 --> 00:07:06,240 + 그것은 정체성 바로 그래서 하나의 아이디어를 매핑 ID는 하나의 기울기가 + +102 +00:07:06,240 --> 00:07:10,329 + 그래서 우리가 하나를 시작하고 지금 우리가 갈거야 우리의 기본 사건 + +103 +00:07:10,329 --> 00:07:18,519 + 존경 그 너무 거꾸로이 그래프를 통해 우리는 그라데이션 할 + +104 +00:07:18,519 --> 00:07:27,089 + 이 경쟁 그래프에서 확인이 그래서 우리는 권리를 작성하지 않은 있다는 것입니다 + +105 +00:07:27,089 --> 00:07:32,879 + 여기에 무엇이 특정 예제의 핵심은 세 가지 바로 그라데이션이되도록있어 + +106 +00:07:32,879 --> 00:07:36,279 + 그에이에 따라 내가 바로 재료가 될거야 불과 3이 될 것이다 + +107 +00:07:36,279 --> 00:07:42,309 + 빨간색 선과 값 아래의 라인에 대한 녹색에 + +108 +00:07:42,310 --> 00:07:48,420 + 전면의 그라데이션은 하나가 아닌 그라데이션 발병 텔링으로 33 + +109 +00:07:48,420 --> 00:07:52,009 + 당신은 정말 직관적으로 그라데이션의 해석은 무엇을 염두에 두어야 + +110 +00:07:52,009 --> 00:07:58,459 + 즉 말하는 최종 값에 죽은의 영향이 긍정적이라고하고 + +111 +00:07:58,459 --> 00:08:02,859 + 세 코스의 종류와 그래서 소량의 팔에 의해 Z를 증가하는 경우 + +112 +00:08:02,860 --> 00:08:07,759 + 그 회로의 출력은 a를 증가 때문에 반응 + +113 +00:08:07,759 --> 00:08:13,009 + 긍정적 세 긍정적 발생합니다 세 이렇게 작은 변화로 증가 + +114 +00:08:13,009 --> 00:08:21,560 + 궁극적 인 변화는 이제이 경우 큐에 따라 기울기가 너무 신격화한다 + +115 +00:08:21,560 --> 00:08:30,860 + IQ는 그 어떤 것을 우리는 그 부분에 대한 음의 기울기를 얻을 수 있도록하기 전에 + +116 +00:08:30,860 --> 00:08:34,599 + 그 말과 회로 그리고 그가 인 경우 출력을 증가시키는 것입니다 + +117 +00:08:34,599 --> 00:08:39,740 + 회로가 좋아 감소 당신은 H 증가하는 경우가 회로까지 일 + +118 +00:08:39,740 --> 00:08:44,789 + 기울기의 네 나이가 감소하는 것은 지금 우리가 가고있는 확인에 대한 부정적 + +119 +00:08:44,789 --> 00:08:48,480 + 이 플러스 게이트를 통해이 과정을 계속이 일을 얻을 수있는 곳입니다하기 + +120 +00:08:48,480 --> 00:08:49,039 + 약간 + +121 +00:08:49,039 --> 00:08:54,328 + 나는 우리가 Y에 대한 이유에에 계약을 계산하고 싶습니다 가정 + +122 +00:08:54,328 --> 00:09:10,208 + 등 그라데이션 왜이 특정 그래프이 될 것이다 것 + +123 +00:09:10,208 --> 00:09:23,979 + 어느 쪽이든 내가 이것에 대해 생각하고 싶습니다 그것에 유리 학습 가능한 확인을 적용하는 것입니다 + +124 +00:09:23,980 --> 00:09:27,709 + 그래서 체인 규칙은 모든 사람의 기울기를 직접 할 것인지 말한다 + +125 +00:09:27,708 --> 00:09:33,208 + 왜 그때 내가 바로 그래서 우리 연방 수사 국 (FBI)의 DQ 시간에 큐브 이상적인 동일입니다 + +126 +00:09:33,208 --> 00:09:36,438 + 우리가 알고 이유 일 수 있습니다 특정 IQ에서 그 표현을 모두 계산 + +127 +00:09:36,438 --> 00:09:42,519 + 음 정도 그 쿠폰의 영향의 효과가있어입니다 DFID의 Q입니다 + +128 +00:09:42,519 --> 00:09:46,619 + 부정적인 지금 우리가 지방을 알고는 로컬 영향을 알고 싶습니다 + +129 +00:09:46,619 --> 00:09:52,449 + 왜 Q에 빛의 로컬 영향 쿠바에 그의 것은 지역 주민을의 하나입니다 + +130 +00:09:52,448 --> 00:09:58,969 + 전립선에 대한 Y의 로컬 유도체 등 일반으로 참조 + +131 +00:09:58,970 --> 00:10:02,019 + 정확한 것은이 두 그라디언트에게 지역을 변경 할 것을 우리에게 알려줍니다 + +132 +00:10:02,019 --> 00:10:06,139 + 그라데이션 끔찍한 왜 당신과의 Q의 글로벌 그라데이션의 종류하지 + +133 +00:10:06,139 --> 00:10:10,948 + 우리가 네 번 만든거야 그래서 회로의 업데이트를 곱하는 것입니다 + +134 +00:10:10,948 --> 00:10:14,588 + 그래서 그녀를 다시 전파의 요점 이런 종류의이 매우이다 작동 + +135 +00:10:14,589 --> 00:10:18,209 + 중요한 우리는 우리가 계속 적어도 두 가지를 한 것으로 여기에 이​​해하기 + +136 +00:10:18,208 --> 00:10:24,289 + 우리가 일반적으로 수행하는 경우를 통해 곱 우리는 X 플러스 Y와를 계산 한 + +137 +00:10:24,289 --> 00:10:29,379 + 그 하나의 표현에 대한 미분 X & Y는 하나 하나 그렇게 계속하다 + +138 +00:10:29,379 --> 00:10:32,749 + 말하는 그라디언트의 마음의 해석에 X & Y가있을 것입니다 + +139 +00:10:32,749 --> 00:10:38,509 + (10)의 기울기 H X가 증가함에 따라 큐에 긍정적 인 영향 + +140 +00:10:38,509 --> 00:10:44,548 + H에 의해 큐 증가하고 결국 같은처럼 우리는 빛의 영향을하고 싶습니다 + +141 +00:10:44,548 --> 00:10:49,980 + 최종 밖으로하지만, 회로 등 길에이를 최대 작업은 걸릴 것입니다 + +142 +00:10:49,980 --> 00:10:53,480 + 의 영향을하고 우리는 최종 손실에 대한 Q의 영향을 알고 왜 + +143 +00:10:53,480 --> 00:10:57,058 + 인 우리가 반복적으로이 그래프를 통해 여기 컴퓨팅 무엇을하고 + +144 +00:10:57,058 --> 00:11:00,350 + 할 수있는 올바른 것은 우리가 (10)의 별명으로 끝낼 수 있도록를 곱하는 것입니다 + +145 +00:11:00,350 --> 00:11:05,189 + 15 음과 그래서 이것은 밖으로 작동 방식은 기본적으로 이것이 무엇인가 + +146 +00:11:05,188 --> 00:11:08,649 + 속담 최종 출력 회로에 대한 이유의 영향을 부정하는 것이거나 + +147 +00:11:08,649 --> 00:11:14,649 + 왜 부정적인 네 배 앨범 회로를 감소시켜야 증가 + +148 +00:11:14,649 --> 00:11:18,230 + 가 왜 법 당신이 만든 변화와 운동을 끝낼 방법입니다 + +149 +00:11:18,230 --> 00:11:21,810 + 약간 비스듬히 증가 이유를 증가 Cuse에 긍정적 인 영향 + +150 +00:11:21,809 --> 00:11:27,959 + 어떤 체인의 규칙이 종류의 우리에게주는되도록 가능성이 회로 감소 + +151 +00:11:27,960 --> 00:11:29,120 + 일치 + +152 +00:11:29,120 --> 00:11:45,259 + 우리가이 많은 많은 많은 연결을 볼 수 있습니다이 당신에받을거야 및 + +153 +00:11:45,259 --> 00:11:48,889 + 모든 클래스의 말에 당신이 점을 드릴 당신이 그것을 이해는하지 않습니다 + +154 +00:11:48,889 --> 00:11:51,870 + 우리가 실제로이 편지를 완료하면 어디서나 어떤 상징적 인 표현이 + +155 +00:11:51,870 --> 00:11:54,639 + 이 구현 당신은이 이후에 그것의 구현을 볼 수 있습니다 + +156 +00:11:54,639 --> 00:11:57,009 + 항상있을 것입니다 이것은 단지 숫자 요인 + +157 +00:11:57,009 --> 00:12:02,230 + 로버트 숫자 확인 및 X보고 우리가 일을 어떻게하는 아주 똑똑한를하다 + +158 +00:12:02,230 --> 00:12:05,889 + 우리는 우리의 최종 목표이다 그 IDX 궁금 발생하지만 우리는 결합해야 + +159 +00:12:05,889 --> 00:12:09,799 + 그것은 우리가 접근이 무엇인지 무엇 예전 친구를 알고 당신을 듣고 당신에게 같은 장소를 물어 + +160 +00:12:09,799 --> 00:12:13,979 + 체인이 그렇게을 성장 될 수있을 테니까요 때문에 회로의 끝에서 + +161 +00:12:13,980 --> 00:12:19,240 + 음 네 번 당신이이 일반화 작동하는 방식 때문에 하나의 확인을주고 싶어 + +162 +00:12:19,240 --> 00:12:23,289 + 당신이 게이트입니다 다음과 같이이 예제와 방법에서 비트는이에 대해 생각하는 + +163 +00:12:23,289 --> 00:12:28,429 + 회로에 삽입이 매우 큰 계산 그래프 또는 회로이며 + +164 +00:12:28,429 --> 00:12:32,250 + 당신은 어떤 특정 번호 X & Y가 와서 몇 가지 템플릿을 수신하고, + +165 +00:12:32,250 --> 00:12:39,059 + 그들에 대한 몇 가지 작업을 수행하고 좋은 세트 Z를 계산하고 지금이 + +166 +00:12:39,059 --> 00:12:43,019 + 잡지는 경쟁 잔디로 전환 무언가가 일어나는하지만 당신은 그냥있어 + +167 +00:12:43,019 --> 00:12:46,169 + 너무 큰 회로에서 놀고 당신은이 아니라 무슨 확실하지 않다 + +168 +00:12:46,169 --> 00:12:50,939 + 우린 다음 회로의 끝은 손실을 계산하고 그 전진 패스 그리고 + +169 +00:12:50,940 --> 00:12:56,250 + 거꾸로 역순으로 반복적으로 진행하지만, 실제로 전 + +170 +00:12:56,250 --> 00:13:01,120 + 나는 X & Y 내가하고 싶은 일이 있다는 지적에 도착하면 바로 그 부분에 도착 + +171 +00:13:01,120 --> 00:13:05,279 + 전진 패스 동안이 게이트 있다면 당신은 당신의 값 X & Y 당신에 도착 + +172 +00:13:05,279 --> 00:13:08,500 + 컴퓨터 출력과 상기 다른 일이있다 할 수 있습니다 컴퓨터에 바로와 + +173 +00:13:08,500 --> 00:13:10,230 + 그 지역 그라디언트입니다 + +174 +00:13:10,230 --> 00:13:14,789 + X & Y 그래서 바로 그냥 게이트이기 때문에 사람들을 계산할 수 있습니다 내가 알고있는 + +175 +00:13:14,789 --> 00:13:18,009 + 내가 좋아하는 수행하고있어 추가 응용 프로그램은 내가 영향을 알고 말을 그 + +176 +00:13:18,009 --> 00:13:24,259 + X & Y 그래서 지금 당장하지만 그 사람들을 계산할 수 있습니다 내 밖으로 몸을 이겼다 + +177 +00:13:24,259 --> 00:13:25,389 + 무슨 일이야 + +178 +00:13:25,389 --> 00:13:29,769 + 끝 부분에서 계산 된 소송은 다른 결국 배울 뒤로 갈 수 있도록 + +179 +00:13:29,769 --> 00:13:32,499 + 내 영향에 무엇인가에 대한 + +180 +00:13:32,499 --> 00:13:37,839 + 회로의 최종 출력 DL은 이들에 의해 손쉽게 배울 수있는 손실 자신의 + +181 +00:13:37,839 --> 00:13:41,419 + 성분은 내게로 흘러 제가해야 할 일은 내가 그 변경해야 할 것입니다 것입니다 + +182 +00:13:41,418 --> 00:13:45,278 + 나는를 변경할 수 있는지 확인해야합니다 있도록이 재귀 경우를 통해 그라데이션 + +183 +00:13:45,278 --> 00:13:48,778 + 내 작업을 통해 그라데이션을 수행하고 정확한 것은 밝혀 + +184 +00:13:48,778 --> 00:13:52,068 + 여기에서이 말하는 정말 무슨 트라마돌을 구입하는 것은이다 할 올바른 일이 + +185 +00:13:52,068 --> 00:13:56,068 + 그라데이션없이 해당 지역의 그라데이션을 곱 것을 실제로 당신에게 제공하는 + +186 +00:13:56,068 --> 00:13:57,838 + DL IDX + +187 +00:13:57,839 --> 00:14:02,739 + 회로의 최종 출력에 X 오프 직원 그래서 정말 체인 규칙은 그냥 + +188 +00:14:02,739 --> 00:14:08,229 + 우리는이 글로벌 그라데이션이라고 무엇을 가지고이 추가 곱셈 + +189 +00:14:08,229 --> 00:14:12,669 + 의상에 게이트와 우리는 같은에서 로컬 그라데이션을 변경했습니다 + +190 +00:14:12,668 --> 00:14:18,509 + 그것은 그 사람 그라디언트의 단지 곱셈 그래서 것은 잠시 동안 간다 + +191 +00:14:18,509 --> 00:14:22,889 + 해당 지역의 그라데이션으로 당신은 게이트있어 다음 기억한다면 이들 X 년대와 Y의 + +192 +00:14:22,889 --> 00:14:27,229 + 당신이 저주로 끝날 바로 그래서 다른 상태에서이오고있다 + +193 +00:14:27,229 --> 00:14:31,899 + 추가 터키어 등이 게이트 전체 컵을 통해이 과정 + +194 +00:14:31,899 --> 00:14:36,808 + 다만 기본적 그들이 그렇게 마지막 손실에 서로 영향을 통신 + +195 +00:14:36,808 --> 00:14:39,688 + 이것은 당신이 긍정적있어 의미 긍정적 인 그라데이션 경우 서로 확인 말해 + +196 +00:14:39,688 --> 00:14:43,198 + 부정적인 부정적인 그라데이션 부정적인 영향의 손실에 영향을 미치는 + +197 +00:14:43,198 --> 00:14:46,788 + 손실에 영향을 미치는 그는 단지 거의 이들에 의해 회로를 통해 적용됩니다 + +198 +00:14:46,788 --> 00:14:51,019 + 지역 그라디언트와 함께 결국이 과정은 전파 다시 호출 + +199 +00:14:51,019 --> 00:14:54,489 + 그것은 연쇄 규칙 재귀 적용하여 계산하는 방법이다 + +200 +00:14:54,489 --> 00:14:58,399 + 경쟁을 통해 하나 하나 중간 값의 영향에 잡아 + +201 +00:14:58,399 --> 00:15:02,158 + 최종 손실 함수 등이 그래프는이 많은 예제를 볼 수 있습니다 + +202 +00:15:02,158 --> 00:15:06,918 + 트럭 내가 약간이 구체적인 예에​​ 갈거야 그녀처럼 + +203 +00:15:06,918 --> 00:15:11,298 + 더 큰 우리가 구체적으로 그것을 통해 작동합니다하지만 난에 자신의 질문을 해달라고 + +204 +00:15:11,298 --> 00:15:20,389 + 내가 좋아하는 것,이 점은 내가 당신에게 돌아올거야 앞서 물어 + +205 +00:15:20,389 --> 00:15:25,538 + Z가 사용되는 경우 있도록 등급에게 그라디언트에게인지 아담를 추가 + +206 +00:15:25,538 --> 00:15:29,928 + 서커스의 여러 장소에서 다시 도로는 그 뜻을 추가합니다 폐쇄 + +207 +00:15:29,928 --> 00:15:31,539 + 그 시점에 돌아온다 + +208 +00:15:31,539 --> 00:16:03,139 + 같은 우리가 그 문제의 전부를받을거야 그리고 우리는거야 당신이있어 나중에 참조 + +209 +00:16:03,139 --> 00:16:05,769 + 거야 우리가 그라데이션 문제를 추방 호출 것을 얻을 + +210 +00:16:05,769 --> 00:16:10,669 + 우리의이보다 구체적인 그렇게하기 위해 또 다른 예를 통해 풀어 볼 수 있습니다 + +211 +00:16:10,669 --> 00:16:14,318 + 여기에 우리가 그런 일이 다른 회로는 작은 두 개의 차원을 계산해야 할 + +212 +00:16:14,318 --> 00:16:18,179 + 이란에서하지만 지금은 그냥이 생각하는 그 해석에 대해 걱정하지 마십시오 + +213 +00:16:18,179 --> 00:16:22,849 + 그 표현 때문에 하나를 통해 하나의 플러스 키의 어떤 숫자의로 + +214 +00:16:22,850 --> 00:16:29,000 + 여기에 입력 앤드류 기능에 의해 그리고 우리는 저기 내가 단일 출력이 + +215 +00:16:29,000 --> 00:16:32,490 + 초안 형태 때문에이 대회에 수식 것을 번역 + +216 +00:16:32,490 --> 00:16:35,769 + 그래서 사람이 할 식으로 우리가 안에서부터 밖으로 재귀에있는 경쟁 + +217 +00:16:35,769 --> 00:16:42,129 + 모든 작은 W 시간에 액세스하고 우리는 그들 모두를 추가 한 다음 우리는을 + +218 +00:16:42,129 --> 00:16:46,129 + 그것의 부정적이고 우리는 기하 급수적으로 그들은했다 하나, 그리고, 우리 + +219 +00:16:46,129 --> 00:16:49,769 + 마지막으로 나누어 우리는 식의 결과를 얻을 그래서 우리가 할거야 + +220 +00:16:49,769 --> 00:16:52,409 + 지금 우리가 가고있는이 식을 통해 전파를 백업하는거야입니다 + +221 +00:16:52,409 --> 00:16:56,500 + 매 입력 값의 영향의 출력에 손쉽게 계산 + +222 +00:16:56,500 --> 00:17:07,230 + 여기 저하되어이 식 + +223 +00:17:07,230 --> 00:17:22,039 + 그래서 지금 미국은 플러스 전체 + 게이트 이진 그리고 우리는 플러스가 + +224 +00:17:22,039 --> 00:17:26,519 + 하나의 게이트 나는 그 자리에서이 문을 만들고있어 우리는 어떤이는 것을 볼 수 있습니다 + +225 +00:17:26,519 --> 00:17:31,519 + 게이트 또는 게이트가 당신에게 달려 가지이다 아니다를위한 그래서이 시점에 돌아온다 + +226 +00:17:31,519 --> 00:17:35,639 + 지금은 단지 우리가 우리가 전반에 걸쳐 그래서 사용하는 몇 가지 더 게이트가 좋아 + +227 +00:17:35,640 --> 00:17:38,650 + 난 그냥 우리가 이들 중 몇 가지 예를 통해 이동으로 쓰는 좋아 + +228 +00:17:38,650 --> 00:17:42,720 + 파생 상품 지수 그리고 우리는 모든 작은 지역의 게이트에 대해 알고있는이 + +229 +00:17:42,720 --> 00:17:49,048 + 지역 그라디언트를 잘 그래서 우리가 할 수있는되는 미적분을 사용하여 추가 세금이 너무과 + +230 +00:17:49,048 --> 00:17:52,900 + 그래서 이러한 있도록 모든 작업과 덧셈과 곱셈이다 + +231 +00:17:52,900 --> 00:17:56,040 + 나는 당신이 어떤 위대한 측면에서 기억했다고 믿고있어하는 + +232 +00:17:56,039 --> 00:17:58,970 + 그들은 회로의 끝에서 시작하는거야 같은 것들을 모양과 나는했습니다 + +233 +00:17:58,970 --> 00:18:03,450 + 이미 뒷면에 원 포인트 제로 제로 채워 그건 어떻게 항상 있기 때문에 + +234 +00:18:03,450 --> 00:18:04,860 + 이 재귀를 시작 + +235 +00:18:04,859 --> 00:18:10,519 + 1110 오른쪽하지만 그 신원 기능에 그라데이션 지금 이후 우리는거야 + +236 +00:18:10,519 --> 00:18:17,849 + 하나의 상대적 그래서 확인 X 작업을 통해이 일을 통해 전파를 백업하기 + +237 +00:18:17,849 --> 00:18:22,048 + 로컬 그라데이션이 X를 통해 음의 하나가 렉스의 그래서 아무도 제곱되지 난파 + +238 +00:18:22,048 --> 00:18:27,119 + 게이트는 앞으로 통과하는 동안 입력 1.37을 받고 즉시 중 하나가 + +239 +00:18:27,119 --> 00:18:30,759 + 그녀의 전 케이트 계산 한 수 로컬 변형은 지역 그라디언트 것이었다 + +240 +00:18:30,759 --> 00:18:35,048 + X를 통해 음의 하나는 제곱과 전파를 다시 주문 및 트라마돌을 구입한다 + +241 +00:18:35,048 --> 00:18:40,750 + 의 마지막에의 경사가 그 로컬 기울기를 곱할 + +242 +00:18:40,750 --> 00:18:44,789 + 쉽게 회로는 그렇게 무엇 인 끝을 될 일이 있기 때문에 + +243 +00:18:44,789 --> 00:18:51,349 + 뒷면에 대한 표현은 내 전 케이트 중 하나를 여기에 읽기 전파 + +244 +00:18:51,349 --> 00:18:59,829 + 하지만 그녀는 항상 두 가지 지역 그라데이션 배에서 또는에서 기울기가 + +245 +00:18:59,829 --> 00:19:18,069 + 이는 그 로컬 구배가되도록 기울기 DFID X입니다 + +246 +00:19:18,069 --> 00:19:23,480 + 3.7 이상 하나의 제곱 한 다음 하나 포인트 0으로 곱한 제공 + +247 +00:19:23,480 --> 00:19:27,940 + 있는 분해하는 것은 정말 우리가 시작했기 때문에 하나 때문에 적용입니다 + +248 +00:19:27,940 --> 00:19:34,850 + 일반적으로는 바로 여기에 다른 하나는 구배에있어 그 01534 음 + +249 +00:19:34,849 --> 00:19:38,798 + 이 계곡은 확인 불고 된 와이어의 조각은 그래서 음이 + +250 +00:19:38,798 --> 00:19:43,889 + 당신에게 당신이 있다면 바로 때문에 기대할 수있는 복장에 효과 + +251 +00:19:43,890 --> 00:19:47,850 + 이 값이 증가하고 그 후 X 위에 하나의 게이트를 통과 + +252 +00:19:47,849 --> 00:19:50,939 + 그 이유는 당신이 부정적인보고있는, 그래서 렉스의 증가 금액은 작아 + +253 +00:19:50,940 --> 00:19:55,620 + 그라데이션 속도는 우리는 다음 게이트 여기에 전파를 다시 계속거야 + +254 +00:19:55,619 --> 00:20:01,048 + 당신이 보면 회로에서 하나의 일정한 때문에 로컬 그라데이션을 추가하는 것 + +255 +00:20:01,048 --> 00:20:06,960 + 출구에 값으로 기울기를 일정을 추가하면 하나의 권리 + +256 +00:20:06,960 --> 00:20:13,169 + 우리에게 이야기하고 그래서 여기에 변화 구배 우리는 선을 따라 계속합니다 + +257 +00:20:13,169 --> 00:20:22,940 + 상기에서 그라데이션을 한 시간이 해당 지역의 그라데이션 될 것입니다 + +258 +00:20:22,940 --> 00:20:28,590 + 그냥 배운 게이트 부정적인 2013년 7월 23일가 함께 계속됩니다 + +259 +00:20:28,589 --> 00:20:34,709 + 방법 변경되지 않습니다 직관적 즉,이 값이 바로 때문에 의미가 있습니다 + +260 +00:20:34,710 --> 00:20:38,319 + 수레 그리고 마지막 회로에 어떤 영향을하고 있다면 당신이 있다면 + +261 +00:20:38,319 --> 00:20:42,798 + 그 영향력 후 최종쪽으로 기울기의 변화의 속도를 하나 추가 + +262 +00:20:42,798 --> 00:20:46,970 + 당신이 어떤 양만큼의 효과를이 증가하는 경우 값은 변경되지 않습니다 + +263 +00:20:46,970 --> 00:20:51,548 + 변화율이 1을 변경하지 않기 때문에 일단은 동일 할 것이다 + +264 +00:20:51,548 --> 00:20:56,859 + 게이는 일정한 장교의 기울기 때문에 여기에 혁신을 계속 + +265 +00:20:56,859 --> 00:21:01,599 + 우리가 수행하는거야 전파를 돌아올 수 있도록 도끼 도끼 + +266 +00:21:01,599 --> 00:21:05,000 + 음 하나의 게이트 입력 + +267 +00:21:05,000 --> 00:21:08,329 + 그것은 바로 로컬 그라데이션을 완료 할 수 지금은 것​​을 알고 + +268 +00:21:08,329 --> 00:21:12,259 + 위의 그라데이션이 세 가지 때문에 계속 역 전파에 의해 음의 포인트입니다 + +269 +00:21:12,259 --> 00:21:20,000 + 여기에 체인 규칙을 적용하는 것 난 수사학 질문을 받았다 + +270 +00:21:20,000 --> 00:21:25,119 + 확실하지만,하지만, 기본적으로 전이 전 인 부정적인 하나의 각을하지 + +271 +00:21:25,119 --> 00:21:30,569 + 권리 세에 의해 지점이 전문가에 입력 8 배 체인 규칙 + +272 +00:21:30,569 --> 00:21:35,269 + 그래서 우리는 자신을 곱 계속 이렇게 나에 미치는 영향은 무엇이고 나는 무슨이 + +273 +00:21:35,269 --> 00:21:39,069 + 그 회로의 최종 끝에 효과는 항상 우리가 곱되고있다 + +274 +00:21:39,069 --> 00:21:46,859 + 그래서 지금이 시점에서 마이너스 22를 얻을 우리는 부정적인 하나의 게이트에 시간이 그래서 뭐 + +275 +00:21:46,859 --> 00:21:50,279 + 그것이 나를집니다 당신이 할 때 그라데이션 일어나는 일이 끝납니다 + +276 +00:21:50,279 --> 00:21:57,139 + 우리는 기본적으로 일정 입력을 가지고 있기 때문에 바로 주위에 다 입술에 달성 + +277 +00:21:57,140 --> 00:22:02,038 + 그래서 음의 음 하나 하나 시간을 일정하게 일어난 어느 + +278 +00:22:02,038 --> 00:22:05,548 + 시간 그들은 전진 패스로 우리에게 부정적인 하나를 제공 해달라고 그래서 지금 우리에게있다 + +279 +00:22:05,548 --> 00:22:09,569 + 인 밥에서 인사말 로컬 그라데이션 시간을의하는 곱 + +280 +00:22:09,569 --> 00:22:14,879 + 미세 너무 그래서 우리는 지금 그냥 긍정적으로 끝낼 전파를 다시 계속 + +281 +00:22:14,880 --> 00:22:21,110 + 전파 +이 플러스 작업은 여러 여기에 입력에 녹색이 + +282 +00:22:21,109 --> 00:22:25,599 + 하나는 10 버스 게이트 현지 그라데이션은 무엇 일어나고 끝 + +283 +00:22:25,599 --> 00:22:42,359 + 상단 구매자 따라 광택 흐름 + +284 +00:22:42,359 --> 00:22:48,089 + 지불 잉여는 모든 로컬 그라데이션이 항상 하나 때문에이됩니다 + +285 +00:22:48,089 --> 00:22:53,769 + 당신은 단지 기능이있는 경우 해당 기능에 대한 이유를 다음 전문가를 알고 + +286 +00:22:53,769 --> 00:22:58,109 + X 또는 Y 중 하나에 그라데이션은 하나이며, 그래서 당신은 점점 끝날 것입니다 + +287 +00:22:58,109 --> 00:23:03,619 + 한 시간은 2 시간에 그렇게 더하기 게이트에 대한 사실 항상 같은 사실을보고 참조 + +288 +00:23:03,619 --> 00:23:07,469 + 모든 입력의 로컬 그라데이션 하나 때문에 어디를 무엇을 등급 + +289 +00:23:07,470 --> 00:23:11,289 + 그냥 항상 모두에게 동등하게 그라데이션을 배포 이상에서 가져옵니다 + +290 +00:23:11,289 --> 00:23:14,339 + 그 입력은 체인 규칙 곱하지 않고 승산 때문에 + +291 +00:23:14,339 --> 00:23:18,129 + 10 일이 변경되지 않은 잉여는 같은 성분의이 종류를 얻을 남아 + +292 +00:23:18,130 --> 00:23:22,170 + 뭔가 반면 유통은 모든 단지 모든 퍼져 상단에서 유입 + +293 +00:23:22,170 --> 00:23:26,560 + 위대한 팀은 동등하게 모든 자식과 우리는 이미받은 + +294 +00:23:26,559 --> 00:23:32,139 + 입력 그라데이션 포인트 중 하나는 회로의 최종 출력에 매우 듣고 + +295 +00:23:32,140 --> 00:23:35,970 + 그래서이 직원의 애플리케이션 일련 완료 + +296 +00:23:35,970 --> 00:23:42,450 + 트레이너 길을 따라가 다른이었다 플러스 그 이상 등이 생략 얻을 + +297 +00:23:42,450 --> 00:23:47,090 + 모두 20.2이 공물의 당신 종류를 가리 동일하게 우리가 이미 수행 한 + +298 +00:23:47,089 --> 00:23:51,750 + 봉쇄하고있다 곱셈 거기 그래서 지금 우리는 다시거야 + +299 +00:23:51,750 --> 00:23:55,940 + 그 곱셈 연산을 통해 전파 등 지역 학년 때문에 + +300 +00:23:55,940 --> 00:24:06,450 + 기본적으로 40 저하됩니다 00w에 대한 그래서 무슨 일이 그라데이션이됩니다 + +301 +00:24:06,450 --> 00:24:19,059 + 2000 당신은 한 번 할 때 음수가 될 것이다 0시 반 (W) 될 것 W 하나에 갈 것 + +302 +00:24:19,059 --> 00:24:24,389 + 너무 좋은있을 것입니다 X 제로에 그라데이션 버그가 슬라이드에 떨어져 물린입니다 + +303 +00:24:24,390 --> 00:24:27,840 + 난 사실 또한 클래스를 작성하기 전에 나는 단지 몇 분처럼 발견하는 것이 + +304 +00:24:27,839 --> 00:24:34,289 + 당신이 볼 수 있도록 클래스에 시작 증가한다. 39이 그것에 대한 포인트가 될한다 그 + +305 +00:24:34,289 --> 00:24:37,480 + 때문에 복음화의 버그 난 작은로를 절단하고 있습니다 때문에 + +306 +00:24:37,480 --> 00:24:41,190 + 숫자하지만 기본적으로 그 지적해야하거나 것을 얻는 방법 때문에 + +307 +00:24:41,190 --> 00:24:45,400 + 두 개의 시간은 내가 지금 거기 작성한처럼의 포인트를 얻을 수 지적 + +308 +00:24:45,400 --> 00:24:50,980 + 우리는을 전파했습니다 있도록 그가 어떤 기회를 괜찮아 + +309 +00:24:50,980 --> 00:24:55,190 + 여기에 회로 우리는이 표현을 통해 얻을 그래서 당신의 상상 + +310 +00:24:55,190 --> 00:24:59,289 + 실제 다운 스트림 데이터를해야합니다 응용 프로그램 및 모든 매개 변수 등이있다 + +311 +00:24:59,289 --> 00:25:03,450 + 끝 상단 입력 손실 함수는 앞으로있을 평가할 합격 + +312 +00:25:03,450 --> 00:25:06,440 + 손실 기능과 우리가 다시 것은 모든 조각을 통해 전파 + +313 +00:25:06,440 --> 00:25:10,450 + 경쟁은 우리가 길을 따라 한 적이과 웰벡이에 대한 모든 게이트를 통해 전파 + +314 +00:25:10,450 --> 00:25:14,150 + 우리의 수입을 얻고 백업 다시 단지 공급 체인 규칙 많은 많은 시간을 의미 + +315 +00:25:14,150 --> 00:25:21,720 + 우리는 그에서 구현하는 방법을 볼 수 있지만, 문제는 내가 메신저에가는 것 같아요 + +316 +00:25:21,720 --> 00:25:31,769 + 이 같은이기 때문에 다른 질문을 건너 뛸거야 것을 이동 + +317 +00:25:31,769 --> 00:25:45,869 + 그래서 전후 전파의 비용은 대략 거의 항상 끝 + +318 +00:25:45,869 --> 00:25:49,500 + 기본적으로 같다고까지 당신은 일반적으로 백업 약간 타이밍을 볼 때 + +319 +00:25:49,500 --> 00:25:58,710 + 느린 생각은 그래서 하나가 있다는 것입니다 내가 이전에 지적하고 싶은 한 가지를 보자 + +320 +00:25:58,710 --> 00:26:02,350 + 이 게이트 등이 게이트의 설정은 그래서 무엇을 할 수 내가 할 수있는 임의적 + +321 +00:26:02,349 --> 00:26:06,509 + 예를 들어 알고 당신 중 일부는 내가이 문을 축소 할 수 있습니다 이것을 알고있다 + +322 +00:26:06,509 --> 00:26:10,549 + 하나의 게이트에 뭔가, 예를 들어 시그 모이 드 함수를 호출하고 싶다면 + +323 +00:26:10,549 --> 00:26:14,069 + 이는 시그 모이 드 함수 특정 형태의 하나의 사실이있다 + +324 +00:26:14,069 --> 00:26:19,460 + 하나 플러스 또는 마이너스 세금을 통해 원 계산하고 그래서 난 것을 다시 한 수 + +325 +00:26:19,460 --> 00:26:22,650 + 표현은 내가 S 상을 만들어 그들 문을 모두 붕괴 캔트 + +326 +00:26:22,650 --> 00:26:27,769 + 단일 게이트에 게이트 등 시그 모이 나는이 할 수 있었다 여기에 도착하고 있어요 + +327 +00:26:27,769 --> 00:26:32,440 + 내가하고 싶어하는 경우해야 할 일을했을 것이다 때 하나의 종류의 갈 것을 + +328 +00:26:32,440 --> 00:26:37,980 + 그 게이트 나는이 그래서 무엇 방법에 대한 식을 계산하기 위해 필요로하는 + +329 +00:26:37,980 --> 00:26:41,670 + 기본적으로 얻을 S 상 로컬 그라데이션 그래서의 기울기 무엇인가 + +330 +00:26:41,670 --> 00:26:44,470 + 작은 입력에 게이트와 내가 않을거야 일부 수학을 통과했다 + +331 +00:26:44,470 --> 00:26:46,980 + 세부 사항으로 이동하지만 당신은 저기있는 식으로 끝날 + +332 +00:26:46,980 --> 00:26:51,750 + 이 지역의 기울기와 그 액세스의 1-6 다음 세그먼트 인 끝 + +333 +00:26:51,750 --> 00:26:55,450 + 나 경쟁 그래프로이 조각을 넣을 수 있습니다 내가 아는 한 번 때문에 + +334 +00:26:55,450 --> 00:26:58,819 + 다른 지역 그라데이션 모든 단지를 통해 정의되는 방법을 계산하는 방법 + +335 +00:26:58,819 --> 00:27:02,389 + 체인 규칙과 우리가 전파 백업 할 수 있도록 모든 것을 함께 곱 + +336 +00:27:02,390 --> 00:27:06,720 + S 상을 통해 내려와 같을 것이다 방법은에 입력되고, + +337 +00:27:06,720 --> 00:27:11,750 + 게이트 독감 게이트에 가서 무엇을 하나 포인트 제로이었고, 펑크 73은 밖으로 나갔습니다 + +338 +00:27:11,750 --> 00:27:18,759 + 그래서. 7360 사실 좋아 그리고 우리는 우리가 본 것 같다 현지 그라데이션하려는 + +339 +00:27:18,759 --> 00:27:26,450 + 자신의 허리에 수학에서 당신은 1-23 곱 액세스 포인트 묘지를 얻을 수 있도록 + +340 +00:27:26,450 --> 00:27:31,170 + 즉, 로컬 그라데이션의 다음 번 우리가 마지막에 우연히 작동합니다 + +341 +00:27:31,170 --> 00:27:36,330 + 10도 작성 회로의 그렇게 시간은 그래서 우리는 12 물론 결국 우리 + +342 +00:27:36,329 --> 00:27:37,649 + 같은 답변을 얻을 + +343 +00:27:37,650 --> 00:27:42,220 + 수학이 있지만, 기본적으로 작동하기 때문에 우리가 12 전에받은 가리킨 우리 + +344 +00:27:42,220 --> 00:27:44,480 + 다운이 식을 부러 졌을 수 있으며, + +345 +00:27:44,480 --> 00:27:47,450 + 한 번에 조각 또는 우리는 단지 하나의 신호 게이트를 가질 수 그것은이다 + +346 +00:27:47,450 --> 00:27:51,569 + 종류의 어떤 수준까지 여기에 이​​러한 식을 깰 열쇠 우리에게 달려과과 + +347 +00:27:51,569 --> 00:27:52,339 + 그래서 당신은하고 싶습니다 + +348 +00:27:52,339 --> 00:27:55,829 + 그것은 매우 효율적인지 직관적으로 하나의 게이트에 이러한 식을 클러스터 + +349 +00:27:55,829 --> 00:28:06,819 + 그들은 당신의 조각 그렇게 될 수 있기 때문에 또는 쉽게 로컬 윤기를 연출하는 + +350 +00:28:06,819 --> 00:28:10,529 + 문제는 일반적으로 당신이 알고에 대해 나는 그들이 걱정 않도록해야합니까 라이브러리입니다 + +351 +00:28:10,529 --> 00:28:14,058 + 어떤 컴퓨터를 설득 쉽게 무엇을하고 대답은 '예 나는 것입니다 + +352 +00:28:14,058 --> 00:28:17,480 + 그래서 그래서 그는 당신을 통해 수행하려는 작업의 일부 조각이 있음을 지적 말 + +353 +00:28:17,480 --> 00:28:20,798 + 또 다시 그리고 그것은 매우 뭔가 아주 간단한 로컬 그라데이션이 + +354 +00:28:20,798 --> 00:28:24,900 + 실제로 단일 유닛을 만들 호소 우리는 그 중 일부를 볼 수 있습니다 + +355 +00:28:24,900 --> 00:28:30,230 + 예를 들면 실제로하지만 난 또한 지적하고 싶은 생각하면 한 번 + +356 +00:28:30,230 --> 00:28:32,490 + 나는이 조성 잔디에 대해 생각하는 좋아하는 이유는 정말 희망입니다 + +357 +00:28:32,490 --> 00:28:36,289 + 그렇지 않은 방법 욕심 느린 신경 네트워크에있는 당신의 직감에 대해 생각하는 + +358 +00:28:36,289 --> 00:28:39,369 + 당신이 당신이 이해 싶어 블랙 박스 싶지 않아 + +359 +00:28:39,369 --> 00:28:43,959 + 직관적으로 어떻게 이런 일이 발생하면의 잠시 후에 개발 시작 + +360 +00:28:43,960 --> 00:28:47,850 + 이 graybeards 흐름이 방법에 대한 자세한 그래프 직관보고 + +361 +00:28:47,849 --> 00:28:52,029 + 말은 성분 문제를 추방하기 위해 갈 것 같은 당신이 어떤 문제를 디버깅하는 데 도움이 될 수 + +362 +00:28:52,029 --> 00:28:55,950 + 그것은 무엇 최적화에 잘못된거야 정확히 이해하는 것이 훨씬 쉽게 + +363 +00:28:55,950 --> 00:28:59,250 + 당신이 도움이 될 것입니다 얼마나 욕심과 느린 네트워크를 이해한다면 이러한 디버깅 + +364 +00:28:59,250 --> 00:29:02,740 + 훨씬 더 효율적으로 네트워크와 우리는 이미 예를 들어, 그래서 몇 가지 정보 + +365 +00:29:02,740 --> 00:29:07,609 + 그것의 입력 그래서 모두에게 하나를 읽고 조금이 게이트에서 여덟 번째를 보았다 + +366 +00:29:07,609 --> 00:29:11,279 + 그것은 그것에 대해 생각하는 좋은 방법처럼 그냥 인사 대리점입니다 + +367 +00:29:11,279 --> 00:29:14,548 + 당신은 당신의 점수 기능 또는 어디 더하기 수술을 할 때마다 + +368 +00:29:14,548 --> 00:29:18,740 + 댓글을 다른 곳은 최대 케이트는 평가를 분산 있어요 + +369 +00:29:18,740 --> 00:29:23,009 + 당신이 표현 보면 대신이 작품 훌륭한 작가와 방법은 + +370 +00:29:23,009 --> 00:29:30,970 + 당신은 아주 간단한 바이너리가있는 경우 등 우리는 이러한 마커는 정말 대단 작동하지 않는 한 + +371 +00:29:30,970 --> 00:29:38,410 + 당신이 경우 맥심 XY의 표현은 그래서 이것은 온라인으로 다음 게이트 X의 기울기이다 + +372 +00:29:38,410 --> 00:29:42,570 + 더 큰 당신의 입력의 큰 일에 대해 녹색을 생각한다 + +373 +00:29:42,569 --> 00:29:46,389 + 그 사람에 그라데이션은 하나이며 모든 이것과 더 작은 하나의 인사말입니다 + +374 +00:29:46,390 --> 00:29:50,630 + 제로 직관적으로 이들 때문에 경우 하나는 더이 무엇보다 작은 것을 + +375 +00:29:50,630 --> 00:29:53,220 + 다른 사람의 큰 및 그건 무슨 일이 끝나는 때문에 출력에 영향을하지만, + +376 +00:29:53,220 --> 00:29:57,009 + 게이트를 통해 점점 당신은 하나의 구배로 끝날 수 있도록 + +377 +00:29:57,009 --> 00:30:03,140 + 입력 중 하나 크고 그래서 난 경우 그라데이션 작가로 왜 맥스 캐디의 + +378 +00:30:03,140 --> 00:30:06,420 + 실제로 내가받은 여러 입력 그들 중 하나의 가장 큰했다 + +379 +00:30:06,420 --> 00:30:09,550 + 그들 모두 그 내가 회로를 통해 전파되는 값이고 + +380 +00:30:09,549 --> 00:30:12,909 + 응용 프로그램 시간은 그냥 위에서 내 구배를받을거야 그리고 난 + +381 +00:30:12,910 --> 00:30:16,590 + 나의 가장 큰 충격이었다 누구에 기록하려고하면은 그라데이션 작가의 + +382 +00:30:16,589 --> 00:30:22,569 + 및 다중 게이트 그라데이션 스위처는 실제로 아주 좋은 생각하지 않습니다이다 + +383 +00:30:22,569 --> 00:30:26,960 + 방법은 그것을보고 할 수 있지만, 실제로는 아니에요 난 사실을 말하는 겁니다 + +384 +00:30:26,960 --> 00:30:39,150 + 신경 끄시 고 질문 그래서 그 부분에 대해 두 가지 경우 발생하는 것입니다 + +385 +00:30:39,150 --> 00:30:53,470 + 당신은 내가 그것을 생각하지 않습니다 무슨 일 최대 카데을 통과 할 때 입력은 동일하다 + +386 +00:30:53,470 --> 00:30:57,559 + 그들 모두에게 분배에 올바른 난 당신이 하나를 선택해야한다고 생각 + +387 +00:30:57,559 --> 00:31:07,990 + 즉, 기본적으로 결코 실제로 여기에 실제 연습 때문에 최대 구배를 발생하지 않습니다 + +388 +00:31:07,990 --> 00:31:13,019 + 예를 들어 여기에 너무 만이 영향에가있다 (W)보다 큰 것을이다가 + +389 +00:31:13,019 --> 00:31:16,839 + 이 최대 카데의 출력 바로 그렇게 할 때 최대 게이트로 두 흐름 및 + +390 +00:31:16,839 --> 00:31:20,879 + 읽어와 회로에 효과가 있으므로 W가 0 구배를 얻는다 도착 + +391 +00:31:20,880 --> 00:31:25,360 + 아무것도 제로가없는 당신이 변경할 때 중요하지 않습니다를 변경할 때 때문에 + +392 +00:31:25,359 --> 00:31:29,689 + 그것은 그 경쟁 경내 I 통과하는 큰 발리 없기 때문에 + +393 +00:31:29,690 --> 00:31:33,100 + 전파하는 우리는 이미 백업과 관련된 또 다른 메모가 + +394 +00:31:33,099 --> 00:31:36,490 + 난 그냥 간단히 정말 그것으로 지적하고 싶은 질문을 통해 해결 + +395 +00:31:36,490 --> 00:31:40,440 + 불운과 이러한 회로가있는 경우 때때로 당신은이 있는지 그림 + +396 +00:31:40,440 --> 00:31:43,330 + 값 회로에 지점 밖으로 그와의 여러 부분에 사용된다 + +397 +00:31:43,329 --> 00:31:47,179 + 정확한 것은 변수 체인 규칙에 의해 수행하는 회로는 사실이다 + +398 +00:31:47,180 --> 00:31:55,110 + 그라디언트 배경을 추가 할 수 있도록 동작에 기여를 추가 + +399 +00:31:55,109 --> 00:32:00,009 + 회로를 통해 거꾸로 그들이 이제까지이 역류에 유입하는 경우 + +400 +00:32:00,009 --> 00:32:04,879 + 바로 우리는 매우 간단한 구현 단지 몇으로 갈거야 + +401 +00:32:04,880 --> 00:32:05,700 + 질문 + +402 +00:32:05,700 --> 00:32:11,620 + 질문은 질문 해 지금까지 이들의 루프처럼이됩니다 감사합니다 + +403 +00:32:11,619 --> 00:32:15,839 + 당신이 생각 수있는 루프가 결코 그래서 외모 없을 것 그래프 + +404 +00:32:15,839 --> 00:32:18,589 + 당신은 재발 성 신경 네트워크를 사용하는 경우가 있음을 거기에 루프하지만, + +405 +00:32:18,589 --> 00:32:21,658 + 우리가 할 거 야하는 것이 있기 때문에 실제로는 더 우리는 재발 성 신경이 걸릴 거 있어요 + +406 +00:32:21,659 --> 00:32:26,230 + 네트워크 및 시간 단계를 통해 전개되며,이 모두가 될 것입니다 + +407 +00:32:26,230 --> 00:32:31,259 + 사진에 루프가 있음을 붙여 복사 할 수 결코 작은 조각 또는 시간 + +408 +00:32:31,259 --> 00:32:39,538 + 우리가 실제로 그것으로 얻을 때 당신은 더 많은 것을 볼 수 있습니다하지만 그는 항상보고 있어요 + +409 +00:32:39,538 --> 00:32:42,220 + 이것의 구현보고의 사실 실제로 구현하자 + +410 +00:32:42,220 --> 00:32:46,860 + 나는 우리가 항상 이러한 그래서뿐만 아니라이보다 구체적를하는 데 도움이됩니다 생각 + +411 +00:32:46,859 --> 00:32:52,038 + 그래프는 이러한 신경 네트워크를 구성에 대해 생각하는 가장 좋은 방법입니다 그래프 + +412 +00:32:52,038 --> 00:32:56,929 + 그래서 우리가 결국 어떻게이 모든 게이트가 약간 보일 거라고하지만, + +413 +00:32:56,929 --> 00:33:00,059 + 연결 구조를 유지할 필요가 무언가 게이트 위에 + +414 +00:33:00,058 --> 00:33:03,490 + 같은 단락의 내용 게이트 그래서 일반적으로 서로 연결되어 + +415 +00:33:03,490 --> 00:33:09,710 + 그 그래프에 의해 처리 또는 순 객체가 필요가 있는지에 일반적으로 객체의 + +416 +00:33:09,710 --> 00:33:13,679 + 두 가지 주요 부분 전후 평화이었고, 이것은 당신은 단지 인 + +417 +00:33:13,679 --> 00:33:19,929 + 이 코트는 실행되지만 기본적으로 거의 생각은 앞으로 패스이다 + +418 +00:33:19,929 --> 00:33:23,759 + 전체 그들이 위상으로 정렬하는 회로의 게이트를 거래 + +419 +00:33:23,759 --> 00:33:27,980 + 그게 무슨 뜻인지 주문하면 모든 입력이되기 전에 모든 노트에 와서해야한다는 것입니다 + +420 +00:33:27,980 --> 00:33:32,099 + 기회는 바로 왼쪽에서 오른쪽으로 주문하고 우리는 그냥있어 소모 된 + +421 +00:33:32,099 --> 00:33:35,969 + 탑승 우리가 반복 그래서 길을 따라 모든 단일 게이트 앞으로 나중에 호출 + +422 +00:33:35,970 --> 00:33:39,600 + 이 그래프를 통해 단지 하나 하나 조각 전진이 오브젝트 것 + +423 +00:33:39,599 --> 00:33:43,189 + 단지 있는지 확인하는 적절한 연결 패턴과 이전 버전에서 발생 + +424 +00:33:43,190 --> 00:33:46,620 + 우리는 정확한 역순으로거야 우리가 역에 전화하는거야 통과 + +425 +00:33:46,619 --> 00:33:49,709 + 모든 단일 게이트 및이 게이트는 각각 그라디언트를 전달 끝날 것 + +426 +00:33:49,710 --> 00:33:53,429 + 다른 및 이전 가져 오기 체인지업과 분석 그라디언트 그것을 다시 계산 + +427 +00:33:53,429 --> 00:33:57,860 + 그래서 진짜 목적은 모든 게이트 주위에 또는 매우 얇은 래퍼는 우리 + +428 +00:33:57,859 --> 00:34:01,879 + 자신의 차가운 레이어 층 또는 게이트 내가 같은 의미로 사용하는거야 볼 수 있습니다 + +429 +00:34:01,880 --> 00:34:05,700 + 그들은이 단지 매우 얇은 래퍼 서라운드 연결 구조있어 + +430 +00:34:05,700 --> 00:34:09,369 + 게이트는 그들에 순방향 및 역방향 함수를 호출 한 다음의가 살펴 보자 + +431 +00:34:09,369 --> 00:34:12,950 + 게이트의 한 방법이 구현 될 수 있으며의 구체적인 예 + +432 +00:34:12,949 --> 00:34:16,759 + 이것은이 올바른처럼 실제로 더 만 년 전 아닙니다 + +433 +00:34:16,760 --> 00:34:18,730 + 이 같은 구현 뭔가를 실행할 수 있습니다 + +434 +00:34:18,730 --> 00:34:23,769 + 마지막에 그래서 우리가 입력 게이트를 곱하자 어떻게 구현 될 수 있으며, + +435 +00:34:23,769 --> 00:34:27,690 + 이 경우 다중 게이트 단지 이진 곱셈은 두 개의 입력을 수신 + +436 +00:34:27,690 --> 00:34:33,780 + X & Y가 자신의 곱셈을 계산 그의 전 시간 왜 반환 및 + +437 +00:34:33,780 --> 00:34:38,950 + 모든 게임은 앞으로 얼마나 멋진 이전 버전의 API를 만족해야합니다 + +438 +00:34:38,949 --> 00:34:42,529 + 당신이 앞으로 패스 행동과 그들이 뒤로 패스에서 어떻게 행동하는지 않고 + +439 +00:34:42,530 --> 00:34:46,019 + 후방 패스에서 우리는 결국 결국 어떤 repass은 컴퓨터 + +440 +00:34:46,019 --> 00:34:52,639 + 것과 오래된 아이디어를 최종 손실에 대한 우리의 기울기를 무엇인지에 대한 학습 + +441 +00:34:52,639 --> 00:34:55,628 + 우리는 지금이 변수에이 머리를 표현하고있어 학습 + +442 +00:34:55,628 --> 00:35:00,639 + 모든 XY 여기에 우리의 숫자는 그가 말했다 그래서 스칼라입니다 또한 숫자입니다 + +443 +00:35:00,639 --> 00:35:07,799 + 고용주 어떤이 문이 뒤로 패스에 충전된다 이야기 + +444 +00:35:07,800 --> 00:35:11,550 + 그래서 우리는 무엇을 계산하기 위해 일반의 작은 조각을 수행하면 작업을 수행하는 방법이다 + +445 +00:35:11,550 --> 00:35:16,550 + 전을 계산하여 입력 X에이 그라데이션이 변경 & Y NDY 우리 + +446 +00:35:16,550 --> 00:35:19,820 + 뒤로 패스로 우리를 설정 한 다음 초안에 경쟁이 있는지 확인합니다 + +447 +00:35:19,820 --> 00:35:23,720 + 이러한 것이 다른 모든 백 제대로 전달 얻을 어떤가 있는지 + +448 +00:35:23,719 --> 00:35:27,919 + 경쟁을 추가 배지 내 아버지가 모든 재료를 추가 할 수 있습니다 잡아 + +449 +00:35:27,920 --> 00:35:35,650 + 함께 확인 그래서 우리가 어떻게 무엇 인 예를 들어 DAX와 장치를 구현하는 것 + +450 +00:35:35,650 --> 00:35:42,300 + 이 경우 X는 그 구현에 동등한 것 + +451 +00:35:42,300 --> 00:35:49,460 + 왜 항상 쉬운 휴식에 의해 여기에 만들 수있는 흰색과 쉽게 추가 점 + +452 +00:35:49,460 --> 00:35:53,659 + I는 과거에 어떤 거짓말을 첨가하는 방법, 우리는 이들 값을 기억해야 + +453 +00:35:53,659 --> 00:35:57,509 + X & Y는 우리가에 할당에서 뒤로 패스를 사용하게하기 때문에 + +454 +00:35:57,510 --> 00:36:01,000 + 나는에 대한 액세스를 필요로하기 때문에 XY가 기억해야하기 때문에 중지 판매 + +455 +00:36:01,000 --> 00:36:04,949 + 일반 및 역 전파에 내 뒤뜰 단계의 그들을 우리는 이러한 구축 + +456 +00:36:04,949 --> 00:36:09,359 + 실제로 앞으로이 통과 할 때마다 하나의 게이트에 자극을 기억해야한다 + +457 +00:36:09,360 --> 00:36:13,430 + 중간 계산의 종류는 필요하다고 할 필요가 있음을 수행 + +458 +00:36:13,429 --> 00:36:17,069 + 후방 패스에 액세스 그래서 기본적으로 우리는 이들 네트워크를 실행 끝 + +459 +00:36:17,070 --> 00:36:20,050 + 런타임은 항상 당신이하고있는 것처럼 앞으로 거대한를 통과 있음을 알아 두셔야 + +460 +00:36:20,050 --> 00:36:22,890 + 물건의 금액은 메모리에 현금화되는 모든가 가지고있는 곁에 + +461 +00:36:22,889 --> 00:36:25,909 + 나는 그 변수의 일부에 대한 액세스를 필요로하는 전파 동안 때문에 + +462 +00:36:25,909 --> 00:36:30,779 + 그래서 당신의 기억과 앞으로 패스 동안 열기구까지 뒤로 건네 + +463 +00:36:30,780 --> 00:36:33,690 + 모든 소비 얻고 우리는 실제로 경쟁하는 모든 중개인이 필요 + +464 +00:36:33,690 --> 00:36:45,289 + 이러한 것들과 많은 제거 할 수 있도록 적절한 뒤로 클래스 + +465 +00:36:45,289 --> 00:36:49,710 + 당신은 확실히 메모리에 저장할 수 있도록 그들을 현금하려고 경쟁 할 필요가 없습니다 + +466 +00:36:49,710 --> 00:36:54,110 + 하지만 난하지 않습니다에 대한 사실 걱정 대부분의 구현을 생각하지 않는다 + +467 +00:36:54,110 --> 00:36:57,280 + 보통 기억 결국 그 다루는 로직이 많이 있다고 생각 + +468 +00:36:57,280 --> 00:37:09,370 + 어쨌든 당신은 예를 들어 임베디드 장치에 있다면 나는 네 생각하고 있었다 + +469 +00:37:09,369 --> 00:37:11,949 + 미국 균주에 의해 무시 무시 이것은 당신이 활용할 수있는 무언가이다 + +470 +00:37:11,949 --> 00:37:15,539 + 그것을 우리는 신경 네트워크는 다음 테스트 시간을 당신이 있습니다 실행하는 것을 알고 + +471 +00:37:15,539 --> 00:37:18,750 + 확인이 없는지 확인하기 위해 코드에 갈 수 있도록 할 경우에 현금으로 도착 + +472 +00:37:18,750 --> 00:37:33,130 + 당신은 우리가 지방 그라디언트를 기억 예 후방 패스 질문을 할 싶어 + +473 +00:37:33,130 --> 00:37:39,750 + 전진 패스는 우리는 내가 생각하는 다른 중간체를 기억할 필요가 없습니다 + +474 +00:37:39,750 --> 00:37:45,269 + 그는 난이 1 등 몇 가지 간단한 표현에서 이러한 경우를 수 있습니다 + +475 +00:37:45,269 --> 00:37:49,170 + 실제로 있는지 즉, 일반적으로 사실 만 기억 당신이 담당하고있어 의미 + +476 +00:37:49,170 --> 00:37:54,950 + 당신은 게임으로 당신에 의해 후방 패스 게이트를 수행 할 필요가 무엇 + +477 +00:37:54,949 --> 00:37:58,509 + 당신이에 발자국이 좋아 기분이 어떤 기억 할 수 있는지 모르겠어요 + +478 +00:37:58,510 --> 00:38:04,420 + 사람과 당신은의 모습의 그 사람의 예와 영리한 될 수 있습니다 + +479 +00:38:04,420 --> 00:38:08,250 + 실제로 우리는 구체적인 예를 살펴거야 고문 깊은 고문 + +480 +00:38:08,250 --> 00:38:11,480 + 우리는 클래스의 끝 부분에 약간에 갈 수있는 학습 프레임 워크 + +481 +00:38:11,480 --> 00:38:16,750 + 여러분 중 일부는 github에의 REPO에가는 프로젝트에 사용 끝날 수도 + +482 +00:38:16,750 --> 00:38:20,320 + 와 죽 당신은 음악적으로 그냥 거대한 컬렉션이야 봐 + +483 +00:38:20,320 --> 00:38:24,580 + 거기 그래서 이러한 나중에 개체의 이러한 게이트 게이트 같은 일이다 + +484 +00:38:24,579 --> 00:38:27,429 + 깊은 학습 프레임 워크는이 단지를 무엇인지 정말 모든 층 + +485 +00:38:27,429 --> 00:38:31,559 + 층의 전체 무리를 추적 매우 얇은 경쟁 그래프 일 + +486 +00:38:31,559 --> 00:38:36,420 + 모든 연결 및 그래서 정말 이미지의 모든에서 염두에두고하는 + +487 +00:38:36,420 --> 00:38:42,639 + 일이 당신의 다리 블록이며, 우리는 밖으로 이러한 그래프를 구축하고 + +488 +00:38:42,639 --> 00:38:44,829 + 블록에서 리그 층에서 당신은 다양한에서 함께 그들을 가하고있어 + +489 +00:38:44,829 --> 00:38:47,549 + 방법은 당신이 달성하려는 작업에 따라와 말까지 모든 건물 + +490 +00:38:47,550 --> 00:38:51,519 + 물건의 종류 그래서 당신이 그렇게 자신의 네트워크에있는 모든 라이브러리와 함께 작동 방법 + +491 +00:38:51,519 --> 00:38:54,809 + 당신이 할 수 있습니다 층의 단지 전체 집합을 계산하고 모든 층이다 + +492 +00:38:54,809 --> 00:38:58,840 + 연기 함수 평화를 구현하고 그 기능 키를 이동하는 방법을 알고 + +493 +00:38:58,840 --> 00:39:02,670 + 앞으로 그래서 그냥하자 구체적인 예 위의 이전 버전을 수행하는 방법을 알고 + +494 +00:39:02,670 --> 00:39:10,150 + 쇼핑몰 상수 층 토치 쇼핑몰 상수 층 또는 크롬 봐 + +495 +00:39:10,150 --> 00:39:16,039 + 스칼라로 불과 스케일링은 일부 tenser X를 취하도록 그래서이 스칼라 아니다 + +496 +00:39:16,039 --> 00:39:19,300 + 하지만 숫자의 배열 기본적으로 우리를하기 때문에 같은 사실입니다 + +497 +00:39:19,300 --> 00:39:22,410 + 실제로 우리가 우리가 텐서를 엑스트라 작업을 많이 받는가이 작업 + +498 +00:39:22,409 --> 00:39:28,289 + 이는 정말하고 차원 배열이며, 일정 및 의해 살해되었다 + +499 +00:39:28,289 --> 00:39:31,980 + 이이 실제로 단지 스포티 한 라인 일부 초기화 물건 것을 알 수 있습니다 + +500 +00:39:31,980 --> 00:39:35,940 + 이것은이 당신에게 외국을 찾고 있지만, 거기 경우 그런데 룰라입니다 + +501 +00:39:35,940 --> 00:39:40,510 + 당신이 실제로 당신이 원하는 그 A가로 사용할 것을 전달 초기화 + +502 +00:39:40,510 --> 00:39:44,630 + 당신은 그들이 밖으로 업데이트를 호출 전진 패스 동안 다음 확장되고, + +503 +00:39:44,630 --> 00:39:49,170 + 하지만 앞으로 패스에 그들이 모두 그들은 단지 X를 곱 그것을 반환하고 + +504 +00:39:49,170 --> 00:39:53,760 + 그들은 업데이트 대학원 입력 전화를 뒤로 패스에있는 문이있다 + +505 +00:39:53,760 --> 00:39:56,510 + 당신이이 세 가지를 볼 때 여기에 있지만 정말 가장 중요한 생활을 수행 할 수 있습니다 + +506 +00:39:56,510 --> 00:39:59,690 + 모든 변수 대학원으로의 복사를하고있는 것을 볼 수 + +507 +00:39:59,690 --> 00:40:03,539 + 그 점에서 당신이 훌륭한을 통과하는 성적이다 계산해야 + +508 +00:40:03,539 --> 00:40:08,309 + 자극 아웃 참을 수가 만하는 것은 최종 손실에이에 그라디언트를 실행 + +509 +00:40:08,309 --> 00:40:11,989 + 당신은 대학원 입력에 그 이상 참을 수가있어 당신은에 의해 곱있어 + +510 +00:40:11,989 --> 00:40:15,629 + 이는 스칼라는 해당 지역의 평점은 그냥 있기 때문에 당신이 일을해야 무엇인가 + +511 +00:40:15,630 --> 00:40:19,980 + A와 C는 아웃을하지만 당신은에서 위의 단지 그라데이션을해야 + +512 +00:40:19,980 --> 00:40:23,150 + AP에 의해 살해 이러한 세 가지 라인이 일을하고 당신의 대학원 무엇 인 + +513 +00:40:23,150 --> 00:40:27,849 + 중요한 그것은 당신이 그렇게 그 층의 수백 중 하나 반환 무엇 + +514 +00:40:27,849 --> 00:40:32,110 + 그 그리고 당신은 또한 카페에서 예를 볼 수 있습니다 고문은 또한이받을 + +515 +00:40:32,110 --> 00:40:36,140 + 특히 이미지에 대한 깊은 학습 프레임 워크는 다시 경우 작업 할 수 있습니다 + +516 +00:40:36,139 --> 00:40:39,690 + 당신은 모든 층이 모두 구현하는 참조 레이어 디렉터로 이동 + +517 +00:40:39,690 --> 00:40:43,490 + 앞으로 뒤로 API 그래서 그냥 단층 거기에 당신에게 예를 제공합니다 + +518 +00:40:43,489 --> 00:40:51,269 + 그것은 소요 있도록 층이 텐서의 블로그를 호출 할 수 있도록 편안 좋아하는 블롭 소요 + +519 +00:40:51,269 --> 00:40:54,219 + 덩어리는 숫자에 불과 국제 배열이며 전달 + +520 +00:40:54,219 --> 00:40:57,949 + 하나의 함수에 현명하고 그래서 전방에서의 컴퓨팅가 전달 요소 + +521 +00:40:57,949 --> 00:41:04,379 + 그들이 많은 전화하는거야, 그래서 당신이 볼 수 시그 모이 자신의 프린터를 사용 + +522 +00:41:04,380 --> 00:41:07,840 + 이 물건은 다음 모든 데이터에 대한 포인터를 받고 보일러되어 우리 + +523 +00:41:07,840 --> 00:41:11,730 + 하부 블롭이 우리는 아래의 시그 모이 드 함수를 호출하고 있고 + +524 +00:41:11,730 --> 00:41:14,829 + 우리가에 계산하는 이유는 그건 바로 거기에 그냥 시그 모이 드 함수의 + +525 +00:41:14,829 --> 00:41:18,719 + 이전 버전과 일부 상용구 물건을 전달하지만 정말 중요한 것은 우리가 필요하다 + +526 +00:41:18,719 --> 00:41:23,369 + 즉이에 무엇을보고, 그래서 그라데이션 시간을 여기에 체인 규칙을 계산 + +527 +00:41:23,369 --> 00:41:26,150 + 우리가 걸릴 때 마법 발생의 라인 + +528 +00:41:26,150 --> 00:41:32,048 + 그래서 그들은 인사 강하를 호출하면 하단 DIFF 상단 경우입니다 계산 + +529 +00:41:32,048 --> 00:41:36,869 + 시간이 그래서 정말 그 지역 그라데이션의 인이 작품 + +530 +00:41:36,869 --> 00:41:41,960 + 그래서 그래서 모든 것을 곱셈을 통해 바로 여기 일어나고 체인 규칙 + +531 +00:41:41,960 --> 00:41:45,179 + 다음 단일 층 단지 앞뒤로 API 당신은 경쟁 성장을 + +532 +00:41:45,179 --> 00:41:52,288 + 상단 또는 일부에 대한 연결 및 질문에 고민 또 다른 목적에 + +533 +00:41:52,289 --> 00:42:00,849 + 이러한 구현 등 + +534 +00:42:00,849 --> 00:42:15,559 + 당신이 바로 후방에 수행 할 때 나는 그라데이션을 가지고 있기 때문에 + +535 +00:42:15,559 --> 00:42:19,369 + 그것은 작은이야 내가 바로 내 골목 그라데이션까지 업데이트를 할 수있는 나는 내 방식을 변경 + +536 +00:42:19,369 --> 00:42:24,960 + 비트와 당신의 쓰기의 음의 방향이 그렇게 극복 방향 + +537 +00:42:24,960 --> 00:42:28,858 + 손실 이전 버전과 컴퓨터 그라데이션 및 다음 업데이트는 그라디언트를 사용 + +538 +00:42:28,858 --> 00:42:33,278 + 그 일이 일어나고 루팡 III의 신경 네트워크를 유지 무엇​​ 때문에 당신이 조금 있습니다 증가 + +539 +00:42:33,278 --> 00:42:36,318 + 그 앞으로 뒤로 업데이트를 앞뒤로 상태를 일어나는 모든이다 + +540 +00:42:36,318 --> 00:42:51,808 + 이 때문에 루프 LAPEER의에 대해 문의하는 것을 볼 수 있습니다 내가 확인 통지 할 + +541 +00:42:51,809 --> 00:42:57,160 + 그래 그들은 루프 그래 당신은 더 나은 눈으로 우리를 싶은 있고 실제로 + +542 +00:42:57,159 --> 00:43:03,679 + 확인이 그래서 그들은 그냥 갈 생각 ++ C이다 + +543 +00:43:03,679 --> 00:43:10,899 + 그래 그래서 이것은 내가이는 것을 언급해야하는 방식에 의해 CPU의 구현입니다 + +544 +00:43:10,900 --> 00:43:14,599 + 비슷한의 CPU 구현은 구현 번째 파일을 거기에 + +545 +00:43:14,599 --> 00:43:19,420 + GPU에 시뮬레이터와 그 올바른 코드이고 그래서 별도의 파일입니다 그 + +546 +00:43:19,420 --> 00:43:21,980 + 시그 모이-것 밖으로 그런 당신이나 무언가가 당신을 보여주는 아니에요 참조 + +547 +00:43:21,980 --> 00:43:30,349 + 확인 위대한 러시아가 내가 할 좋아하는 것은 작업 과정이 될 것입니다 + +548 +00:43:30,349 --> 00:43:33,519 + 더 나은 그래서 우리의 잔디 따라 흐르는 이런 일들은거야 그냥 살인자 없습니다 + +549 +00:43:33,519 --> 00:43:38,449 + 우리에게 다시 전체 수 그래서 아무것도 다른 유일한 변화 없습니다 + +550 +00:43:38,449 --> 00:43:43,529 + 이제이 때문에 벡터 XY 및 Z는 벡터입니다되어있는이 지역의 그라데이션이 + +551 +00:43:43,530 --> 00:43:47,530 + 전에 일반의 일반적인 지금이 바로 스칼라로 사용하는 + +552 +00:43:47,530 --> 00:43:51,290 + 자신의 모든 코비안 행렬을 표현하고 그래서 주요 출 수 + +553 +00:43:51,289 --> 00:43:54,670 + 매트릭스 2 차원 기본적으로 모든의 영향이 무엇인지 알려줍니다 + +554 +00:43:54,670 --> 00:43:58,010 + 모든 단일 요소에 X에서 하나의 요소 + +555 +00:43:58,010 --> 00:44:01,880 + 그것은 당신이 주요 소스와 그라데이션 동일 할 수있는 작업 + +556 +00:44:01,880 --> 00:44:08,960 + 그들은 IDX는 벡터이다 듣고 DL 무디 전에 그러나 지금과 같은 식입니다 말했다 + +557 +00:44:08,960 --> 00:44:16,079 + 배우로 설계 닥스에 의해 디자인하는 것은 전체 코비안 행렬로 끝날 것입니다 + +558 +00:44:16,079 --> 00:44:32,130 + 실제로 그라데이션을 변경하려면 전체 행렬 - 벡터 곱셈 그렇게 알고 + +559 +00:44:32,130 --> 00:44:36,380 + 난 당신이 실제로 형성 결국 결코 조금의이 시점에 다시 올 것이다 + +560 +00:44:36,380 --> 00:44:40,119 + 코비 실제로이 행렬은 대부분의 시간이된다 번식 할 수 없을거야 + +561 +00:44:40,119 --> 00:44:43,730 + 당신을 찾고 그냥 일반적인 방법은 임의의 기능을 알고 난 필요 + +562 +00:44:43,730 --> 00:44:46,260 + 이 추적하고 나는이 두 가지 순서가 실제로 생각 + +563 +00:44:46,260 --> 00:44:49,569 + 그는 왼쪽에 있어야 출구 코비안 말했다 때문에 이렇게 + +564 +00:44:49,568 --> 00:44:53,159 + 그게 중요한 요인이 곱해야하기 때문에 그가 잘못 슬라이드의의 + +565 +00:44:53,159 --> 00:44:57,618 + 당신이 실제로 그렇게하자 그 자코뱅를 수행 할 필요가 없습니다 왜 그렇게 당신을 보여 드리죠 + +566 +00:44:57,619 --> 00:45:02,119 + 작품 비교적 일반적인 구체 예에서 작동 + +567 +00:45:02,119 --> 00:45:06,869 + 우리가 정말이 동작이 무엇인지이 비선형 최대 50 인덱스가 있다고 가정 + +568 +00:45:06,869 --> 00:45:11,068 + 그 전형적인 것 인 4096 번호 벡터 판매를 수신하고있다 + +569 +00:45:11,068 --> 00:45:12,308 + 당신이 수행 할 수 있습니다 + +570 +00:45:12,309 --> 00:45:14,630 + 4096 번호 진정한 가치 + +571 +00:45:14,630 --> 00:45:19,630 + 당신은 요소 현명한 임계 값을 낮은 0 그래서 아무것도 계산 + +572 +00:45:19,630 --> 00:45:24,680 + 0 20 고정됩니다 그것은 당신의 컴퓨팅 함수의 최대 바느질보다 + +573 +00:45:24,679 --> 00:45:28,588 + 질문에 동일한 차원의 승리는 여기에 내가 물어 싶습니다 + +574 +00:45:28,588 --> 00:45:40,268 + 원칙적으로이 계층 4096 4096에 대한 자 코비안 행렬의 크기는 무엇인가 + +575 +00:45:40,268 --> 00:45:45,018 + 여기에있는 모든 단일 번호는 거기에 매 수에 영향을 수 있었다 + +576 +00:45:45,018 --> 00:45:49,459 + 하지만 두 번째 질문에 반드시 오른쪽의 경우는 그래서이됩니다 아니다 + +577 +00:45:49,460 --> 00:45:52,949 + 거대한 측정 천육백만 수 있지만, 당신은 왜 형성되지 않을 것입니다 + +578 +00:45:52,949 --> 00:46:02,719 + 실제로 무엇을 항상 같이 행렬 않기 때문에 이러한 4096의 모든 일 + +579 +00:46:02,719 --> 00:46:09,949 + 그것은 거대한 4085 4086 여전히 공산주의자를 모든입니다 영향을 수 있었다 + +580 +00:46:09,949 --> 00:46:14,558 + 매트릭스 그러나 특수 구조의 권리가 무엇이며 특별한 구조 만 + +581 +00:46:14,559 --> 00:46:27,420 + 그래서 큰 가슴 4095 4096 행렬 만 요소는 대각선에있다 + +582 +00:46:27,420 --> 00:46:33,700 + 이 요소가 작동했다이며, 또한 그들은 단지 한 번하지 않은 있기 때문에 + +583 +00:46:33,699 --> 00:46:38,129 + 어느 요소가 그래서 이러한 것들 중 일부를 20 제로보다 고정 하였다 + +584 +00:46:38,130 --> 00:46:42,798 + 실제로 요소가 동안보다 낮은 제로 값을 가지고 중의 제로입니다 + +585 +00:46:42,798 --> 00:46:47,429 + 앞으로 전달하고 그래서 코비안 그냥 거의 행렬 없을 것입니다하지만, + +586 +00:46:47,429 --> 00:46:52,250 + 당신이 결코 실제로을 형성 할 것이다, 그래서 그들 중 일부는 실제로 사라입니다 + +587 +00:46:52,250 --> 00:46:55,429 + 전체 코빈 그 바보 그래서 당신은 실제로 수행 싶지 않기 때문에 + +588 +00:46:55,429 --> 00:47:00,808 + 그들의 특별한 구조 때문에 매트릭스 - 벡터 곱셈 연산 등이 + +589 +00:47:00,809 --> 00:47:04,150 + 우리의 등 특정 기울기 후방을에 활용하려는 + +590 +00:47:04,150 --> 00:47:09,269 + 그냥보고 싶지 때문에 아주 아주 쉽게이 작업에 전달 + +591 +00:47:09,268 --> 00:47:14,159 + 귀하의 의견은 당신이 죽이고 싶어 제로보다 작은이었다 모든 차원 + +592 +00:47:14,159 --> 00:47:17,210 + 그라데이션과 언급은 그 차원에서 그라데이션 (20)을 설정하려면 + +593 +00:47:17,210 --> 00:47:21,650 + 그래서 당신은 밖으로 그러나 여기 그리드를 가지고 있었다 중 번호 제로보다 작은 + +594 +00:47:21,650 --> 00:47:25,910 + 그냥 200을 설정 한 다음 당신은 요청할 수 있습니다 + +595 +00:47:25,909 --> 00:47:52,230 + 측면에서 결국에 이렇게 아주 간단한 작업 + +596 +00:47:52,230 --> 00:47:55,940 + 당신은 할 수 있습니다 당신이 원하는하지만 당신 내부의 게이트 말했다 경우 + +597 +00:47:55,940 --> 00:47:59,670 + 당신은 배경을 수행하는 것을 사용할 수 있지만 다른 날짜로 다시 무슨 일이 일어나고 있는지 그들 + +598 +00:47:59,670 --> 00:48:17,380 + 그라데이션 벡터 걱정 만 그래서 우리는 실제로 그런 경우에 실행하지 않을거야 + +599 +00:48:17,380 --> 00:48:20,430 + 우리는 거의 항상 하나의 아웃하지만 기술을 가지고 결국 재편성하기 때문에 + +600 +00:48:20,429 --> 00:48:24,129 + 우리는 로스 기능에 관심이 있기 때문에 우리는 단지 하나가 + +601 +00:48:24,130 --> 00:48:27,318 + 우리가 있던 경우에 장래에 대한 거래에 관심이 끝 번호 + +602 +00:48:27,318 --> 00:48:30,949 + 다중 출력 후 우리는 그 모두의도를 추적 할 수 있습니다 + +603 +00:48:30,949 --> 00:48:35,769 + 우리는 역 전파 할 때 위태롭게하지만 우리는 집회 손실을 얻을 수있다 + +604 +00:48:35,769 --> 00:48:45,880 + 기능 그래서 나는 또한 지점을 확인하려면 그것에 대해 걱정하지 않도록 그 + +605 +00:48:45,880 --> 00:48:51,230 + 실제로 사천 미친 일반적으로 우리가 사용하는 많은 일괄 처리하므로 여러 배치를 말한다 + +606 +00:48:51,230 --> 00:48:54,929 + 백 요소는 같은 시간을 통과하고 당신은 백으로 끝날 + +607 +00:48:54,929 --> 00:48:59,038 + 4096 감정적 모든 위험에서 오는 요인하지만 모든 예제 + +608 +00:48:59,039 --> 00:49:02,539 + 적을 더 위험에서 서로 독립적으로 처리하므로 그 + +609 +00:49:02,539 --> 00:49:08,869 + 정말 당신이 공식적으로하지 않을 수 있도록 사억이 너무 큰 것을 끝낼 수 있었다 + +610 +00:49:08,869 --> 00:49:14,160 + 기본적으로 당신은 실제로 희소성을 활용하는데주의를 기울여야하는 데 걸리는 + +611 +00:49:14,159 --> 00:49:17,538 + 코비안과 손 코드 작업에서 구조 당신은 실제로 바로하지 않습니다 + +612 +00:49:17,539 --> 00:49:25,819 + 모든 게이트 구현 내부의 일반화 된 일반적인하기 전에 확인 그래서 내가 좋아하는 것 + +613 +00:49:25,818 --> 00:49:30,788 + 과제 그가 최대로 작성하는 등 단지 I됩니다 지적합니다 + +614 +00:49:30,789 --> 00:49:33,680 + 실제로이 접근 방법의 디자인에 당신에게 힌트를주고 싶어 + +615 +00:49:33,679 --> 00:49:39,769 + 당신이 무엇을해야하는지 문제는 다시 전파에도 것처럼 생각된다 + +616 +00:49:39,769 --> 00:49:44,108 + 당신은 분류 최적화 너무 약하거나 구조에 대해이 작업을하고있는 + +617 +00:49:44,108 --> 00:49:50,048 + 주요 계산 및 단위에 대해이 곳처럼 보일 것이다 그 + +618 +00:49:50,048 --> 00:49:53,960 + 당신은 떨어져 지역의 기울기를 알고 다음 배경 당신 실제로 이러한 작업을 수행 + +619 +00:49:53,960 --> 00:49:57,679 + 과제에 그라디언트 코드는 같을 것이다 상단에 있도록 + +620 +00:49:57,679 --> 00:49:59,679 + 당신이하고있는 때문에 우리는 어떤 그래프 구조가없는 경우 + +621 +00:49:59,679 --> 00:50:04,038 + 라인의 모든 그래서 미친 난 그냥 당신이해야 할 그 그런 식으로 실행되지 + +622 +00:50:04,039 --> 00:50:07,200 + 당신은 두 번째 과제를 실제로 그래픽을 마련하고 있다고 할 것입니다 + +623 +00:50:07,199 --> 00:50:10,509 + 당신이 당신의 레이어를 구현하지만 내 첫 임무는 바로 그것을하고있는 객체 + +624 +00:50:10,510 --> 00:50:15,579 + 라인은 똑바로 굉장한 등 wnx에 따라 점수를 완료 + +625 +00:50:15,579 --> 00:50:21,798 + 맥심 0있는 이러한 여백을 계산하고 점수 차이는를 계산 + +626 +00:50:21,798 --> 00:50:26,239 + 다음 손실 및 배경을 특히 나는 정말로 당신을 권합니다 + +627 +00:50:26,239 --> 00:50:30,949 + 이 중간 과정은 당신이 매트릭스를 만들 수있는 다음을 계산 + +628 +00:50:30,949 --> 00:50:34,769 + 당신의 무게 등은 그라데이션을 볼 수 있습니다 전에 점수에 그라데이션 + +629 +00:50:34,769 --> 00:50:40,179 + 여기에 당신과 같은 체인 체인 규칙은 단지 (W) 도착하려고하는 유혹 될 수 있습니다 + +630 +00:50:40,179 --> 00:50:43,798 + W 등호에 그라데이션과 그 구현과 그의 건강에 해로운 방법 + +631 +00:50:43,798 --> 00:50:47,349 + 문제를 접근하는 것은 그래서 당신의 경쟁을 명시하고이를 통해 배경을 + +632 +00:50:47,349 --> 00:50:55,800 + 물론 그들은 당신을 너무 밖으로 도움이 될 것입니다 + +633 +00:50:55,800 --> 00:51:01,570 + 지금까지 우리는이 경쟁 구조와 이들에 절망적으로 큰 끝날된다 + +634 +00:51:01,570 --> 00:51:05,470 + 중간은 대한주의 사항도 모두 앞으로 뒤로 API 노드 + +635 +00:51:05,469 --> 00:51:08,869 + 그래프 구조와 하부 구조는 일반적으로 모든 이들의 매우 얇은 래퍼입니다 + +636 +00:51:08,869 --> 00:51:12,059 + 층과 그 사이의 통신을 처리 할 수​​있는 그의 + +637 +00:51:12,059 --> 00:51:16,380 + 통신은 따라 의사가 실제로 주위에 전달되는 것처럼 항상 + +638 +00:51:16,380 --> 00:51:19,289 + 우리는 우리가 우리의 DS 주위를 통과하는지 이러한 구현을 쓸 때와 + +639 +00:51:19,289 --> 00:51:23,079 + 차원 센서는 정말 무슨 뜻인지 그냥 끝 차원 배열입니다 + +640 +00:51:23,079 --> 00:51:28,059 + 이러한 배열은 무슨 일이 내부적으로 다음 게이트 사이에 매일 진행됩니다 + +641 +00:51:28,059 --> 00:51:33,529 + 게이트는 확인을 전후 패스에 무엇을 알고 난이 시점에서 이렇게 + +642 +00:51:33,530 --> 00:51:37,690 + 그 전파로 끝나는 것 나는 그렇게 신경 네트워크에 갈거야 + +643 +00:51:37,690 --> 00:51:49,860 + 우리는 배경에서 이동하기 전에 모든 질문 + +644 +00:51:49,860 --> 00:52:03,130 + 동작 도전 과제는 거의 방법이 모든 수행해야합니다 않습니다이다 + +645 +00:52:03,130 --> 00:52:06,750 + 충분히 잘 NumPy와의 작업과 그 뭔가 될 것 있도록 + +646 +00:52:06,750 --> 00:52:18,030 + 그것은 우리 너희들처럼 될 예정 물건과 당신이 그들을 원하는을 제공합니다 + +647 +00:52:18,030 --> 00:52:24,490 + 수 나는 그가 그렇게 할 거라고 생각하지 않습니다 + +648 +00:52:24,489 --> 00:52:30,739 + 그래, 나는 아마 작동이 확실하지 않다하지만이과에 디자인까지 당신의 + +649 +00:52:30,739 --> 00:52:38,609 + 다시 그것을 통해 그, 그래서 그것은 우리가 신경망에 갈거야 무슨이다 + +650 +00:52:38,610 --> 00:52:44,010 + 정확히 당신이 날을 포함 할 것처럼 보이는 이것은 무슨 일이 무엇인지 + +651 +00:52:44,010 --> 00:52:46,770 + 당신이 Google 이미지 네트워크에서 검색 할 때이 내가 처음 생각입니다 + +652 +00:52:46,769 --> 00:52:51,590 + 이 같은 결과는 그래서는 네트워크를보고 우리가 다이빙을하기 전에하자 + +653 +00:52:51,590 --> 00:52:55,100 + 신경 네트워크에 실제로 나는 모든 뇌없이 먼저 수행하고 싶습니다 + +654 +00:52:55,099 --> 00:52:58,329 + 물건 그래서 그들의 신경은 그들이 어떤 관계든지를 잊지 잊지 + +655 +00:52:58,329 --> 00:53:03,170 + 뇌에 그들은 당신이 그들이했던 것을 생각하면 잊지 마세요하지만 그들은하자 할 + +656 +00:53:03,170 --> 00:53:07,309 + 그냥이 WX 무엇입니까 동일 우리가 생각도하기 전에 학교 기능을보고 + +657 +00:53:07,309 --> 00:53:11,079 + 나는 우리가 만드는 시작하는거야 말했듯이 우리는 지금까지하지만 지금은 작업했습니다 + +658 +00:53:11,079 --> 00:53:14,590 + F 더 복잡하고 그래서 당신은 당신이있어 신경 네트워크를 사용하려는 경우 + +659 +00:53:14,590 --> 00:53:20,309 + 그래서이 이것에 해당 방정식을 변경하려고하는 것은 두 계층 신경망이며, + +660 +00:53:20,309 --> 00:53:24,820 + 그게처럼 보이는 그것은 단지 더 복잡한 수학 식 X의 무엇을 + +661 +00:53:24,820 --> 00:53:30,230 + 당신이 당신의 입력 X를 받고 당신이 만드는으로 그래서 무슨 일이 일어나고 + +662 +00:53:30,230 --> 00:53:32,369 + 우리가 전에했던 것처럼 매트릭스를 곱한 + +663 +00:53:32,369 --> 00:53:36,619 + 지금 무엇을 다음에 오는 것을 옆오고하는 것은 비선형 또는 활성화 기능입니다 + +664 +00:53:36,619 --> 00:53:39,710 + 나는 당신이 이들에 대해 할 수있는 몇 가지 선택에 갈거야 + +665 +00:53:39,710 --> 00:53:43,800 + 케이스는 정말 기본적으로 우리가있어 활성화 함수로 임계 값 0을 사용하고 있습니다 + +666 +00:53:43,800 --> 00:53:47,780 + 행렬 곱셈 우리 임계 모든 일을 그들은 20 얻을하고 우리가 할 + +667 +00:53:47,780 --> 00:53:52,240 + 또 하나의 주요 공급하고 우리에게주는이 부족하다 그래서 난이 드롭 인 경우 + +668 +00:53:52,239 --> 00:53:58,169 + 세 남아 3072 참조 화소 값을가는 10 C의 경우라고 + +669 +00:53:58,170 --> 00:54:02,110 + 우리가 하나의 주요 대사 산물 담론을 가기 전에 우리는 바로 갔다 + +670 +00:54:02,110 --> 00:54:02,470 + (22) + +671 +00:54:02,469 --> 00:54:05,899 + 번호 그러나 이제 우리는이 중간 표현을 통해 이동하세요 + +672 +00:54:05,900 --> 00:54:13,019 + 상태 숨겨진 펜던트 그래서 숨겨진 레이어 백 - 각 숫자를 그들에게 전화한다 + +673 +00:54:13,019 --> 00:54:16,849 + 또는 당신이되고 네트워크의 사이즈를 원하는 그래서 이것은 높은 압력 + +674 +00:54:16,849 --> 00:54:21,109 + 그 AA 백 그리고 우리는 그렇게 할이 중간 표현을 통해 이동 + +675 +00:54:21,108 --> 00:54:24,319 + 곱해야 제로에서 우리에게 백 번호 임계 값을 제공하고 + +676 +00:54:24,320 --> 00:54:28,559 + 다음 하나는 확실이이 과정을하고 우리가 더 많은 숫자를 가지고 있기 때문에 우리는이 + +677 +00:54:28,559 --> 00:54:33,820 + 더 호기심이 난 그래서 더 흥미로운 또는 하나의 특정 예를 수행하는 + +678 +00:54:33,820 --> 00:54:36,330 + 당신이에서 그런 생각 무엇을 수행 할 수 있습니다 흥미로운의 + +679 +00:54:36,329 --> 00:54:40,210 + 해석 선형의 예를 다시 것입니다 할 수있는 후자의 + +680 +00:54:40,210 --> 00:54:45,690 + C 부 (10)에 분류 우리는 자동차 클래스가 시도이 빨간 차를 가지고 보았다 + +681 +00:54:45,690 --> 00:54:51,280 + 다른 자동차의 모든 공간 모드 상이한 방향 등의 병합 + +682 +00:54:51,280 --> 00:54:57,980 + 이 경우 하나의 층을 하나의 리더 일제 공격 모두를 가로 질러 가야했다 + +683 +00:54:57,980 --> 00:55:02,250 + 이러한 모드와 우리는 서로 다른 색상의 예를 들어 다루지 수있는 + +684 +00:55:02,250 --> 00:55:05,190 + 할 매우 자연스러운하지 않았다 그러나 지금 우리는이에 백 번호를 + +685 +00:55:05,190 --> 00:55:08,289 + 중간 그래서 당신은 예를 들어 그 숫자의 한 상상 + +686 +00:55:08,289 --> 00:55:11,539 + 단지 앞으로 임대 레드 카펫에 따기 수 있습니다 단지 돼있다 + +687 +00:55:11,539 --> 00:55:14,750 + 빨간 자동차가 될 수 앞으로 또 다른 하나를 직면 난파 자동차가 발견 + +688 +00:55:14,750 --> 00:55:16,280 + 왼쪽으로 약간 직면 + +689 +00:55:16,280 --> 00:55:20,650 + 하자 carvey 오른쪽처럼 보인다 연령의 이러한 요소는 될 것입니다 + +690 +00:55:20,650 --> 00:55:24,358 + 긍정적 인 사람들은 이미지의 그 일을 찾을 경우 + +691 +00:55:24,358 --> 00:55:28,029 + 그렇지 않으면 제로 숙박 등 다른 연령은 녹색 카드를 보일 수 있습니다 + +692 +00:55:28,030 --> 00:55:31,180 + 노란색 카드 나 이제 서로 다른 방향에서 다른 무엇이든 우리는 할 수 또는 + +693 +00:55:31,179 --> 00:55:35,669 + 이러한 모든 다른 모드에 대한 템플릿을 가지고 있으므로 이러한 뉴런을 켜거나 + +694 +00:55:35,670 --> 00:55:41,869 + 그들은 다음 몇 가지 구체적인 유형과 찾고있는 것을 발견하면 해제 + +695 +00:55:41,869 --> 00:55:46,660 + 모든 작은 카드 템플릿과 나는 우리를 가로 질러이 W 두 가지 주요 검사 일부 + +696 +00:55:46,659 --> 00:55:50,719 + 같은 완료하기 위해 지금 당신이 어떻게 생겼는지의 스물 카드 템플릿을 말하고있다 + +697 +00:55:50,719 --> 00:55:54,149 + 우리는 선택의 여지가 있도록 점수 분류는 추가 조치를 거기에 + +698 +00:55:54,150 --> 00:55:58,700 + 그들에 가중 합과 그 사람을 통해 다음 켜져 그렇다면 내 + +699 +00:55:58,699 --> 00:56:02,269 + 방법은 다소 긍정적 인 가중치는 아마도 내가 위로하고 추가 할 것입니다 + +700 +00:56:02,269 --> 00:56:07,358 + 높은 점수를 얻기 때문에 지금은이 복합 우리의 분류를 가질 수있다 + +701 +00:56:07,358 --> 00:56:13,098 + 이 이유에 대해 물결 모양의 이유와이 부가 숨겨진 계층을 통해 + +702 +00:56:13,099 --> 00:56:14,720 + 이들은 더 재미있는 일을 할 것입니다 + +703 +00:56:14,719 --> 00:56:49,509 + 할당에 추가 점수에 대한 질문이었다 뭔가 재미 또는 추가 할 + +704 +00:56:49,510 --> 00:56:53,220 + 그래서 당신은 당신이 흥미있는 실험이 생각과 의지대로 카펫을 얻을 + +705 +00:56:53,219 --> 00:56:56,699 + 당신에게 뭔가를 당신이 수도에 대한 좋은 후보의 일부 보너스 포인트를 제공 + +706 +00:56:56,699 --> 00:56:59,659 + 그 작동 여부를 조사 할 + +707 +00:56:59,659 --> 00:57:08,329 + 질문 + +708 +00:57:08,329 --> 00:57:34,989 + 데이터 세트의 다양한 모드를 통해 할당 나는 좋은이 없습니다 + +709 +00:57:34,989 --> 00:57:37,969 + 우리는 완전히로이 훈련을거야 때문에 그 이것에 대한 대답 + +710 +00:57:37,969 --> 00:57:39,500 + 역 전파 + +711 +00:57:39,500 --> 00:57:42,690 + 나는 판매를위한 정확한 템플릿이있을 것이라고 생각하는 순진한처럼 생각 + +712 +00:57:42,690 --> 00:57:46,539 + 레드 카펫을보고 carvey 당신은 아마 당신이 찾을 것을 발견 할 남아 내버려 + +713 +00:57:46,539 --> 00:57:50,690 + 그래서 같은 믹스와 이상한 일 중간체 및 이러한 종류 + +714 +00:57:50,690 --> 00:57:55,630 + 동물을 오는 최적의 경계를 사용하여 데이터를자를 수있는 방법을 찾을 + +715 +00:57:55,630 --> 00:57:59,809 + 쿠웨이트의 그냥 그래서 확실히 올 수있는 회사를 조정 강등 + +716 +00:57:59,809 --> 00:58:10,579 + 정말 열심히 잘 나는 그이를 그래서 그 잘 생각 휩쓸 리게 될 대답 + +717 +00:58:10,579 --> 00:58:14,579 + 숨겨진 레이어 주로 높은의 크기는 내가 선택한 것을 선택하세요 + +718 +00:58:14,579 --> 00:58:18,719 + 백은 일반적으로 그것은 우리가 가고있는 것을 볼 수 있습니다 일반적으로 될 것 + +719 +00:58:18,719 --> 00:58:22,739 + 이 많이 있지만, 일반적으로 당신은 그들이 당신의 가능한 한 큰되고 싶어 + +720 +00:58:22,739 --> 00:58:30,659 + 등 컴퓨터가 그래서 더 나은 내가 그에게 갈거야입니다 + +721 +00:58:30,659 --> 00:58:38,639 + 우리는 항상 최대 10 특성을 가지고 않습니다 물어 우리는 다섯 슬라이드처럼이 얻을하지 않습니다 + +722 +00:58:38,639 --> 00:58:44,359 + 멀리 어딘가에 내가 아마 난 그냥 가야 추측 신경 네트워크로 이동합니다 + +723 +00:58:44,360 --> 00:58:48,390 + 이는 3 층으로 원한다면 앞서 및 끝 부분에 질문을 + +724 +00:58:48,389 --> 00:58:50,940 + 우리가 확장되는 아주 간단한 방법이 방법에 의해 신경 네트워크 + +725 +00:58:50,940 --> 00:58:53,710 + 그것은 우리가 단지 우리가 모든이 동일한 패턴을 계속 유지하도록 맞아 + +726 +00:58:53,710 --> 00:58:57,159 + 중간 노드를 숨겨 후 우리는 우리의 네트워크는 깊은하게 유지할 수 있으며, + +727 +00:58:57,159 --> 00:58:59,750 + 깊은 당신이이기 때문에 당신은 더 많은 흥미로운 기능을 계산할 수 있습니다 + +728 +00:58:59,750 --> 00:59:03,369 + 자신에게 시간을 더주는 것은 흥미로운과 헨리 VIII의 방법으로 계산하기 + +729 +00:59:03,369 --> 00:59:09,559 + 내가 깜빡 할 하나의 다른 슬라이드 달려 그 두 계층 신경망 훈련 + +730 +00:59:09,559 --> 00:59:12,690 + 나는 그렇게이가 내려 오면이 같은 사실은 꽤 간단 의미 + +731 +00:59:12,690 --> 00:59:17,349 + 의 블록 버스터에서 빌린 기본적으로 가격은 대략 열​​한 라인 + +732 +00:59:17,349 --> 00:59:21,980 + 파이썬은 이진 분류 동안 두 계층 신경망을 구현하는 + +733 +00:59:21,980 --> 00:59:27,570 + 이 이차원 더 이차원 데이터 행렬 X가 무엇 당신 + +734 +00:59:27,570 --> 00:59:32,580 + 서른세 차원이 있고 왜 다음에 대한 이진 레이블이 + +735 +00:59:32,579 --> 00:59:36,579 + 죄 0 죄 (1)는 당신의 체중 행렬 그래서 나는 그들이 있다고 생각 종료 한 가지 방법을 기다릴 수 있습니다 + +736 +00:59:36,579 --> 00:59:41,150 + 중앙 시냅스하지만 성숙한라고하며 다음이 반대 그룹은 여기 + +737 +00:59:41,150 --> 00:59:46,269 + 당신은 당신이보고있는 것을 내가 단지보다 더 내 지점을 사용한다 + +738 +00:59:46,269 --> 00:59:50,139 + 우리는 첫 번째 계층 활성화를 완료하고 있지만, 이것이 사용되는 여기 + +739 +00:59:50,139 --> 00:59:54,069 + 신호 비선형 0이 아닌 목의 최대 및 우리의 비트에가는하는지 + +740 +00:59:54,070 --> 00:59:58,650 + 이러한 비선형는 하나의 형태 이상이 1 층을 검토 할 수 있습니다 + +741 +00:59:58,650 --> 01:00:03,059 + 두 번째 레이어 바로 여기에 다음의 컴퓨팅 후방 및 + +742 +01:00:03,059 --> 01:00:08,130 + 구배 용액 (1)과의 구배 겔 정도로이 성인 성인 합격 + +743 +01:00:08,130 --> 01:00:13,390 + 그라데이션이 너무 바로 그 업데이트에를하고있어 여기에 주요 업데이트입니다 + +744 +01:00:13,389 --> 01:00:17,150 + 그는 공식화 그 다음에 배경의 최종 부분 동안과 같이 동시에 + +745 +01:00:17,150 --> 01:00:22,519 + 바로 W 및 그라데이션 그는 여기 정말 (22) 경사를 추가했다 + +746 +01:00:22,519 --> 01:00:24,630 + 열한 라인 공급 신경망을 훈련합니다 + +747 +01:00:24,630 --> 01:00:29,710 + 이 손실은 무엇과 약간 다를 수 있습니다 이유를 분류 + +748 +01:00:29,710 --> 01:00:33,500 + 당신은 지금 볼 당신이보고, 그래서 이것은 로지스틱 회귀 손실이 있다는 것 + +749 +01:00:33,500 --> 01:00:37,159 + 여러 차원으로 좋은 분류가 그것의 일반화하지만, + +750 +01:00:37,159 --> 01:00:40,149 + 이것은 기본적으로 여기에 업데이트되는 물류 손실이며이 통과 할 수 있습니다 + +751 +01:00:40,150 --> 01:00:43,500 + 자신보다 상세하게하지만 잃어버린 로지스틱 회귀가 약간 보일 + +752 +01:00:43,500 --> 01:00:50,539 + 다른 그리고 그 안에있다중인 있지만 그렇지 않으면 네이 너무 없습니다 + +753 +01:00:50,539 --> 01:00:55,320 + 실제로 훈련 코드 충분할의 경쟁 미친 매우 몇 줄 + +754 +01:00:55,320 --> 01:00:58,900 + 다른 이들 네트워크의 모든 플러스 방법은 어떻게합니까 공식을 만들어 않습니다 + +755 +01:00:58,900 --> 01:01:03,019 + 당신은 당신이 그것을 모든 물건을 가질 필요가 교차 검증 파이프 라인있다 + +756 +01:01:03,019 --> 01:01:07,050 + 즉 그것이 실제로 이러한 큰 코드베이스하지만 커널을 제공하기 위해 상단에 간다 + +757 +01:01:07,050 --> 01:01:11,019 + 매우 간단 우리는 뒤로 통과 이러한 층이 앞으로 통과 계산 + +758 +01:01:11,019 --> 01:01:18,840 + 업데이트는 비가 오면하지만 비는 개인 초기 랜덤를 만드는 것입니다 + +759 +01:01:18,840 --> 01:01:24,170 + 당신이 임의의 W를 생성 할 수 있도록 가중치는 그래서 당신은 어딘가에 시작해야 + +760 +01:01:24,170 --> 01:01:29,150 + 지금 당신은 또한 두 계층 신경망을 훈련 할 것이다 언급 할 + +761 +01:01:29,150 --> 01:01:32,070 + 이 클래스에 그래서 당신이 매우 비슷한 일을하고있을거야하지만, + +762 +01:01:32,070 --> 01:01:34,950 + 당신은 로지스틱 회귀를 사용하지 않는 당신은 다른 활성화가있을 수 있습니다 + +763 +01:01:34,949 --> 01:01:39,149 + 기능 그러나 다시이 구현 당신에게 내 조언을 개최한다 + +764 +01:01:39,150 --> 01:01:42,789 + 당신이 중간 결과로 계산 한 다음 적절한 수행 + +765 +01:01:42,789 --> 01:01:46,909 + 모든 중간 결과에 역 전파 그래서 당신은 당신이 계산해야 할 수도 있습니다 + +766 +01:01:46,909 --> 01:01:54,460 + 컴퓨터가 이러한 무게 행렬 또한 편견 그렇게하지를받을 + +767 +01:01:54,460 --> 01:01:59,940 + 믿는 당신은 당신의 슬롯 최대의 편견의 오후를 가지고 있지만 여기에 당신은 너무 편견이있을 것이다 + +768 +01:01:59,940 --> 01:02:03,269 + 편견 컴퓨터 사람이 나중에 컴퓨터 과정에서 체중 행렬을 + +769 +01:02:03,269 --> 01:02:08,429 + 당신의 손실을 완료 한 다음 뒤로 다음이 과정에서 너무 배경을 통과 할 + +770 +01:02:08,429 --> 01:02:13,739 + 이 H1 의사에 두 번째 레이어와 배경의 무게에 배경 + +771 +01:02:13,739 --> 01:02:18,849 + 다음을 통해 최초의 무게 매트릭스와 향신료 않습니다에 배경을 팔 실행 + +772 +01:02:18,849 --> 01:02:22,929 + 여기에 적절한 역 전파 그렇지 않으면 시도하면 바로 바로 말 + +773 +01:02:22,929 --> 01:02:26,739 + 당신은 단지 하나의 표현 만들려고하면 어떻게 W 하나를 것입니다에 음주 운전하다 + +774 +01:02:26,739 --> 01:02:31,099 + 그것이 너무 클 것이며 대한 두통 정도의 일련 그것을 + +775 +01:02:31,099 --> 01:02:32,619 + 단계 및 역 전파 + +776 +01:02:32,619 --> 01:02:36,119 + 그건 그냥 힌트입니다 + +777 +01:02:36,119 --> 01:02:39,940 + 확인 지금은하지 않고 그 신경망의 표현이었다 말을하고 싶습니다 + +778 +01:02:39,940 --> 01:02:43,940 + 모든 물건을 가지고, 우리가 그것을 만들거야 그것은 지금 매우 간단 보인다 + +779 +01:02:43,940 --> 01:02:47,740 + 약간 더 미친 같은 동기의 모든 종류의 접이식에 의해 주로 + +780 +01:02:47,739 --> 01:02:51,219 + 이이 모든 것을 가지고 관련이 있는지에 대해 온 방법 등 역사에 대한 + +781 +01:02:51,219 --> 01:02:54,939 + 그리고 우리는 신경 네트워크를 가지고 우리는 이러한 신경 내의 신경이 + +782 +01:02:54,940 --> 01:02:59,440 + 이 때문에 네트워크는 내가 당신이 검색 할 때 발생하는 그냥 뭐처럼 보이는 것입니다 + +783 +01:02:59,440 --> 01:03:03,800 + 이미지 검색이란은 그래서 당신은 이제 실제 생물학적 인 뉴런하지 않는 이동 + +784 +01:03:03,800 --> 01:03:09,030 + 같이 더 많은 그래서 그와 같은 현재 + +785 +01:03:09,030 --> 01:03:11,880 + 그냥 아주 간단하게 단지 모두에서 오는 당신이 어디 이것에 대해 생각을한다주고 + +786 +01:03:11,880 --> 01:03:17,220 + 당신은 세포체이 있거나 너무 많이 전화를 좋아하고는 모든 수상 돌기를 가지고 + +787 +01:03:17,219 --> 01:03:21,049 + 다른 뉴런에 연결되어있는 다른 뉴런의 클러스터가 그리고 + +788 +01:03:21,050 --> 01:03:25,450 + 누군가가 여기에있어 다음 드라이브는 정말 듣고이 부속된다 + +789 +01:03:25,449 --> 01:03:30,869 + 그 때문에이이란에 입력을하고 그것은 하나의 축삭을 가지고 그 + +790 +01:03:30,869 --> 01:03:35,839 + 이 번호에 경쟁의 출력을 전달하는 신경 나오는 + +791 +01:03:35,840 --> 01:03:40,579 + 형태는 그래서 보통 일반적으로이 신경 세포는 그들 중 많은 경우 입력을 수신해야 + +792 +01:03:40,579 --> 01:03:46,179 + 온라인 다음이 판매 당신은 활성화를 전송 스파이크를 선택할 수 있습니다 자신의 + +793 +01:03:46,179 --> 01:03:50,199 + 축삭 아래로 잠재하고이 사실처럼 그냥 밖으로했다 + +794 +01:03:50,199 --> 01:03:54,659 + 다운 스트림 다른 뉴런이 너무 다른가 수상 돌기에 연결 + +795 +01:03:54,659 --> 01:03:57,639 + 이 사람의 축삭에 연결 여기에 신경과 수상 돌기 + +796 +01:03:57,639 --> 01:04:02,299 + 기본적으로 그냥 사이의 이러한 시냅스를 통해 연결되어 있고 우리는이 있었다 뉴런 + +797 +01:04:02,300 --> 01:04:05,840 + 특히 그 로드은에 수상 돌기과 그에이 작업은 실제로 운반 + +798 +01:04:05,840 --> 01:04:10,410 + 자신의 등 기본적에 출력 당신은 매우 원유 모델을 가지고 올 수 + +799 +01:04:10,409 --> 01:04:16,769 + 우리가 신경과는 다음과 같이 보일 것입니다 그래서이 셀 몸 + +800 +01:04:16,769 --> 01:04:20,909 + 여기에 자신에 단지 다른 신경 세포에서 나오는 축삭을 상상 + +801 +01:04:20,909 --> 01:04:24,730 + 직장에서 사람이 신경이를 통해이이란에 연결되어 + +802 +01:04:24,730 --> 01:04:29,840 + 시냅스 이러한 시냅스의 각 하나와 연관된 가중치를 갖는다 + +803 +01:04:29,840 --> 01:04:35,350 + 얼마나 많은이 신경 세포가 좋아하는 신경 세포는 기본적 그래서 실제로 수행하는 것이 + +804 +01:04:35,349 --> 01:04:39,769 + 이 X이는 시냅스에서 상호 작용하고 증식 및 이산 모델 당신 때문에 + +805 +01:04:39,769 --> 01:04:44,989 + 00 홍수가 여름에 흐르는 W 얻을 후 많은 이라크 인에 대한 일어나는 + +806 +01:04:44,989 --> 01:04:45,849 + 많이 가지고있는 사람 + +807 +01:04:45,849 --> 01:04:51,500 + 및 시간의 폭발과는 좀 상쇄있어 여기 세포체 승두고 + +808 +01:04:51,500 --> 01:04:56,940 + 바이어스 한 후이 통과되도록 활성화 기능이 여기에 충족되는 경우 + +809 +01:04:56,940 --> 01:05:02,800 + 활성화 기능은 실제로 지금의에 색소폰의 옷을 완료합니다 + +810 +01:05:02,800 --> 01:05:06,570 + 생물학적 모델은 역사적으로 사람들에 S 상 비선형 성을 사용하려면 + +811 +01:05:06,570 --> 01:05:11,730 + 는 0과 하나 사이의 수를 얻을 수 있기 때문에 실제로 그 이유는 + +812 +01:05:11,730 --> 01:05:15,420 + 이 신경 세포가에 대한 영감을하는 속도로 그 해석 할 수 있습니다 + +813 +01:05:15,420 --> 01:05:19,809 + 그것은을 통해 일어나고 0과 1 사이의 비율이 그래서 특정 입력 + +814 +01:05:19,809 --> 01:05:23,889 + 이 신경 세포는 신경 세포에서 좋아하는 무언가를 볼 수있다, 그래서 만약 활성화 함수 + +815 +01:05:23,889 --> 01:05:27,900 + 연결된이 많이 스파이크 시작되고 속도에 의해 설명된다 + +816 +01:05:27,900 --> 01:05:33,139 + 그게 내가 구현하고자한다면 신경 세포의 원유 모델은 그래서 충격 오프 F 확인 + +817 +01:05:33,139 --> 01:05:38,819 + 그것은이 너무과 신경 기능 전진 패스 같은 것을보고받을 것이다 + +818 +01:05:38,820 --> 01:05:44,500 + 일부 입력이 세포체 그래서 그냥 변호사 몇 가지의 벡터 개혁이다 + +819 +01:05:44,500 --> 01:05:49,980 + 우리는 소말리아 일부를 S 자형으로 발사 속도를 넣고 발사로 돌아가 + +820 +01:05:49,980 --> 01:05:53,579 + 속도와 다음이 바로 그래서 당신이 할 수있는 다른 뉴런에 연결 할 수 있습니다 + +821 +01:05:53,579 --> 01:05:56,710 + 실제로이 선형과 매우 유사 보이는 것을 알 수 있습니다 상상 + +822 +01:05:56,710 --> 01:06:02,750 + 일부 여기에 우리가 통과하는 MIMO의 레러에 대한 분류 레이더 + +823 +01:06:02,750 --> 01:06:07,050 + 비선형 그래서이 모델에 하나 하나 신경 세포는 작은처럼 정말 당신의 + +824 +01:06:07,050 --> 01:06:11,530 + 분류 그러나이 저자는 서로 연결하고 그들에게 함께 작업 할 수 있습니다 + +825 +01:06:11,530 --> 01:06:16,650 + 그들은 매우 그들이있어있어 뉴런에 대해 지금 10 재미있는 일을 + +826 +01:06:16,650 --> 01:06:21,300 + 생물학적 뉴런은 당신이 가고 그래서 만약 슈퍼 복잡한 생물학적 뉴런을 좋아하지 + +827 +01:06:21,300 --> 01:06:24,670 + 주위에 당신은 뇌의 사람들이 같은 신경 네트워크가 작동한다는 시작 + +828 +01:06:24,670 --> 01:06:28,849 + 라운드 사람들에게 시작하는 당신을 발사하기 시작하고있다 때문이다 + +829 +01:06:28,849 --> 01:06:33,650 + 그들이 작동 뉴런의 많은 다른 종류가 복잡한 동적 시스템 + +830 +01:06:33,650 --> 01:06:38,550 + 다르게이이 수상 돌기는 흥미를 많이 수행 할 수 있습니다 + +831 +01:06:38,550 --> 01:06:42,140 + 계산 좋은 리뷰 기사가 직접적인 경쟁이다 정말 + +832 +01:06:42,139 --> 01:06:46,069 + 이 시냅스는 복잡한 동적 시스템이다 그들은 그냥 아니에요 즐겼다 + +833 +01:06:46,070 --> 01:06:49,720 + 우리는 뇌의 정말 확실하지 않은 단일 무게에 속도 코드를 사용 + +834 +01:06:49,719 --> 01:06:54,689 + 그래서 매우 원유 수학적 모델을 전달하고 너무 많은 그의 비유를 넣지 마십시오 + +835 +01:06:54,690 --> 01:06:57,960 + 하지만 같은 미디어 제품의 종류에 좋은 + +836 +01:06:57,960 --> 01:07:01,990 + 그래서 나는이 다시 올라오고 계속 그 이유는 우리로 다시 겠 + +837 +01:07:01,989 --> 01:07:04,989 + 설명이 뇌처럼 작동하지만 난 너무 깊이에 갈 아니에요 + +838 +01:07:04,989 --> 01:07:09,829 + 의 전체 세트 거기에이 질문 한 질문으로 돌아갑니다 + +839 +01:07:09,829 --> 01:07:17,559 + 우리가 역사적으로 신호를 선택할 수 있습니다 비선형가 사용되었습니다 + +840 +01:07:17,559 --> 01:07:20,210 + 꽤는 그리고 우리는 무엇을이 이상 더 많은 세부 사항으로 갈거야 + +841 +01:07:20,210 --> 01:07:23,690 + 비선형는 거래의 장단점이 무엇이며 왜 사용 할 수 있습니다 + +842 +01:07:23,690 --> 01:07:27,838 + 하나 또는 다른하지만​​ 플래시가 많은이 있다는 것을 언급하는 것처럼 지금은 + +843 +01:07:27,838 --> 01:07:28,579 + 에서 선택 + +844 +01:07:28,579 --> 01:07:33,940 + 정말 아주 인기가 2012 년으로 역사적으로 사람들은 H (10)에 사용 + +845 +01:07:33,940 --> 01:07:38,429 + 당신이 원하는 경우 더 빨리 그래서 지금 당신의 네트워크를 꽤한다 + +846 +01:07:38,429 --> 01:07:40,429 + 비선형의 기본 선택 + +847 +01:07:40,429 --> 01:07:45,679 + 즉 현재의 디폴트 추천 그리고 relew 다음 몇 가지있다 + +848 +01:07:45,679 --> 01:07:51,489 + 여기 이렇게 활성화 기능은 내가 밖으로는 최대 몇 년 전 제안 + +849 +01:07:51,489 --> 01:07:54,989 + 재미 있고 아주 최근에 루 그래서 당신은 다른 가지고 올 수 + +850 +01:07:54,989 --> 01:07:58,319 + 활성화 기능과 내가이 더 잘 작동 할 수 설명 할 수 있습니다 또는 + +851 +01:07:58,320 --> 01:08:01,789 + 그래서이 연구의 활성 영역은에 의해 이동을 시도되지 않으며 + +852 +01:08:01,789 --> 01:08:05,949 + 이 수행 활성화 기능을 하나의 방법으로 더 좋은 특성을 가지고 또는 + +853 +01:08:05,949 --> 01:08:10,909 + 또 다른 우리는 클래스에 있지만뿐만 곧 더 많은 세부 사항으로 갈거야 + +854 +01:08:10,909 --> 01:08:15,980 + 이제 우리는 활성화 함수의 선택의 여지가 이러한 바보가 + +855 +01:08:15,980 --> 01:08:19,259 + 우리는 바로 그래서 우리는 단지 그들을 연결하는 신경 네트워크에 이러한 신경 세포를 실행 + +856 +01:08:19,259 --> 01:08:23,140 + 함께 그들이 그렇게 여기에 서로 통신 할 수 있도록하는 것은 무엇의 예입니다 + +857 +01:08:23,140 --> 01:08:27,170 + 배우 또는 레이어와의 수를 계산하려면 rowlett을 재 학습 자신의 + +858 +01:08:27,170 --> 01:08:30,829 + 신경망 당신은 듣고 대기 일어난 선수의 수를 계산 + +859 +01:08:30,829 --> 01:08:35,449 + 이유이란 최대가 없습니다 사촌 입력 층은 나중에로 계산하지 않습니다 + +860 +01:08:35,449 --> 01:08:39,729 + 우리는 여기에 두 선수가되도록 하나의 값은 실제로 어떤 계산을하지 않는다 + +861 +01:08:39,729 --> 01:08:45,068 + 무게가 그 그게 배우는 우리는 완전히 연결이 레이어를 호출 + +862 +01:08:45,069 --> 01:08:50,870 + 레이어와 내가 당신 단일 신경 세포 시스템이 작은 그런를 표시하는 것이 + +863 +01:08:50,869 --> 01:08:54,750 + 신경 네트워크의 일부 대사 비선형의 중량 + +864 +01:08:54,750 --> 01:08:58,829 + 레이어로이란은 수 있기 때문에 우리가 층으로 이들을 배치 이유는 + +865 +01:08:58,829 --> 01:09:01,759 + 경쟁에 우리를 훨씬 더 효율적 그래서 대신의를 갖는 + +866 +01:09:01,759 --> 01:09:04,460 + 비정질 뉴런의 얼룩과 그들 모두 독립적으로 계산되어야한다 + +867 +01:09:04,460 --> 01:09:08,699 + 레이어를 갖는 벡터화 작업을 사용하는 우리를 허용하고 그래서 우리는 할 수 있습니다 + +868 +01:09:08,699 --> 01:09:10,139 + 의 전체 세트를 계산할 + +869 +01:09:10,140 --> 01:09:14,410 + 아마추어 곱 단지 하나의 배 하나의 숨겨진 계층의 뉴런과 + +870 +01:09:14,409 --> 01:09:17,619 + 그것이 우리가이 층을 배치 이유 곳이란 내가 제공 이후 및 + +871 +01:09:17,619 --> 01:09:21,119 + 완전히 위험 그것을 평가 모두 동일한 것을 말하지만, 이것은 A의 + +872 +01:09:21,119 --> 01:09:25,519 + 계산 트릭이는 3 층 신경망 인 지도자로 정렬하기 + +873 +01:09:25,520 --> 01:09:30,500 + 이것은 당신이 주요 곱셈의 단지 무리를 계산하는 방법을이다 + +874 +01:09:30,500 --> 01:09:35,550 + 지금뿐만 아니라 활성 기능에 의해 나는 것 다음에 다른 활성화 한 다음 + +875 +01:09:35,550 --> 01:09:40,520 + 그래서 이것은 그냥 당신이 신경 네트워크가 작동하는 방법의 데모를 보여 드리고자 + +876 +01:09:40,520 --> 01:09:44,770 + 모델은 약간 당신을 촬영하지만, 기본적으로 이것은의 예입니다 잡고는 + +877 +01:09:44,770 --> 01:09:50,080 + 두 층은 신경망 AP 이진 분류 업무 개의 하 분류 + +878 +01:09:50,079 --> 01:09:54,119 + 가장 가까운 빨간색과 녹색 등 두 가지 차원에서 이러한 점은 내가 그리는거야 경우 + +879 +01:09:54,119 --> 01:09:58,109 + 신경망에 의해 결정 경계와는 당신이 볼 수있는 것은 때입니다 참조 + +880 +01:09:58,109 --> 01:10:01,969 + 나는이 데이터를 내가에있는 더 많은 숨겨진 뉴런 신경 네트워크를 양성 내 + +881 +01:10:01,970 --> 01:10:05,770 + 당신의 전기 자동차가 오른쪽으로 더 계산할 수 나중에 더 호기심 머리 + +882 +01:10:05,770 --> 01:10:12,290 + 미친 기능은 당신이 또한 정규화 강도 그래서이는 것을 보여 + +883 +01:10:12,289 --> 01:10:17,069 + 당신 W 큰 불이익을 어느 정도의 정규화 당신이 주장 할 때 볼 수 있습니다 + +884 +01:10:17,069 --> 01:10:22,340 + 그렇지 않은 귀하의 WR 당신은 매우 부드러운 기능을 아주 작은 끝낼 것을 + +885 +01:10:22,340 --> 01:10:27,050 + 많은 몸부림이 아니라 이러한 신경망 있도록 많은 차이가 + +886 +01:10:27,050 --> 01:10:31,090 + 그들은 당신을 제공 할 수 있습니다 그리고 당신은이 알고있는 정규화를 감소하는 것이 + +887 +01:10:31,090 --> 01:10:34,090 + 그들이 종류의에서 얻을 얻을 수 있도록 우리는 점점 더 복잡한 작업을 할 수있는 + +888 +01:10:34,090 --> 01:10:38,710 + 훈련 데이터를 포함하는 점을 압착이 법은 그렇게 나에게 쇼를 보자 + +889 +01:10:38,710 --> 01:10:41,489 + 당신이 어떻게 생겼는지 + +890 +01:10:41,489 --> 01:10:47,079 + 훈련 도중 + +891 +01:10:47,079 --> 01:10:53,010 + 그래서 여기에 설명 첫번째 실제로 당신이 놀 수있는 날 수 있도록하기 위해 몇 가지 물건이있다 + +892 +01:10:53,010 --> 01:10:56,060 + 이는 자바 스크립트 전부 때문에 + +893 +01:10:56,060 --> 01:11:04,060 + 확실히 우리가 여섯 뉴런을 가지고 우리는 여기에서하고있는이 이진입니다 + +894 +01:11:04,060 --> 01:11:09,000 + 분류가 원 데이터와 상기 그래서 우리의 작은 클러스터가 + +895 +01:11:09,000 --> 01:11:13,520 + 분류 신경망을 훈련 빨간색 점과 직장으로 구분 녹색 점 + +896 +01:11:13,520 --> 01:11:18,080 + 내가 신경 네트워크를 다시 시작하면 그것이 바로 진형 그래서이 데이터 집합 + +897 +01:11:18,079 --> 01:11:20,949 + W 무작위 실제로 분류하는 결정 경계 수렴 + +898 +01:11:20,949 --> 01:11:26,289 + 시원한 부분 오른쪽에 표시 데이터 중 하나 해석 + +899 +01:11:26,289 --> 01:11:29,529 + 여기에 신경 네트워크는 내가 그 데려 갈거야 것은 듣기 좋은 그리고 난입니다 + +900 +01:11:29,529 --> 01:11:33,909 + 이 공간은 신경 네트워크에 의해 작동되는 방법을 보여주는 것은 그래서 당신은 해석 할 수 있습니다 + +901 +01:11:33,909 --> 01:11:37,619 + 무엇 신경 네트워크가하고있는 것은 수송 숨겨진 층을 사용하고 당신의 + +902 +01:11:37,619 --> 01:11:41,159 + 제 은닉층 선형으로 올 수있는 방식으로 입력 데이터 + +903 +01:11:41,159 --> 01:11:47,059 + 분류 및 신경망이 볼 여기에 귀하의 데이터를 분류 + +904 +01:11:47,060 --> 01:11:51,920 + 그것을 정상적으로 작동 공간이 배치되도록 실제로, 상기 제 2 층 + +905 +01:11:51,920 --> 01:11:56,779 + 제 1 층의 상단에 선형 분류는 괜찮 통해 비행기를 넣을 수있다 + +906 +01:11:56,779 --> 01:11:59,939 + 당신이 그것을 통해 비행기를 넣어 수 있도록 그래서 공간을 일하고 + +907 +01:11:59,939 --> 01:12:06,259 + 포인트가 그래서 당신은 정말 볼 수 있도록 이제 다시 살펴 보자 분리 해 무엇을 + +908 +01:12:06,260 --> 01:12:10,940 + 당신이이 데이터를 분류 일찍 떠날 수있는 근무되는 일이 발생 + +909 +01:12:10,939 --> 01:12:13,569 + 사람들은 때때로 그것의 현재 여행이라 뭔가 + +910 +01:12:13,569 --> 01:12:19,149 + 공간에 데이터 표현을 변경하는 경우 두 개의 선형 분리 확인 + +911 +01:12:19,149 --> 01:12:23,079 + 지금 여기에 우리가 지금 우리가 육이 권리를 분리하려는 경우 질문입니다 + +912 +01:12:23,079 --> 01:12:27,809 + 여기 뉴런과 중간층과 우리가 이러한 분리를 허용 + +913 +01:12:27,810 --> 01:12:33,580 + 당신이 실제로 그 여섯 신경 세포를 볼 수 있도록 일들이 대략 이러한 라인을 볼 수 있습니다 + +914 +01:12:33,579 --> 01:12:36,869 + 그들은 종류의 때문에 이러한 뉴런의 하나의 이러한 기능 같아 여기처럼 + +915 +01:12:36,869 --> 01:12:40,349 + 여기에 신경 세포의 최소 숫자가 무엇인지에 대한 질문입니다 어떤이에 대한 + +916 +01:12:40,350 --> 01:12:45,570 + 내가 그 일을 알고 싶다면 데이터 세트는 신경 네트워크 등으로 분리 가능 + +917 +01:12:45,569 --> 01:12:51,889 + 제대로 최소한이 분류하는 + +918 +01:12:51,890 --> 01:13:15,270 + 그래서에 방법이 작업은 34 그렇게하여 발생하거나 무슨 일이있다 + +919 +01:13:15,270 --> 01:13:18,910 + 여기에 주변이 그런 식으로이 방법을 그런 식으로이 방법에서이 방법을 갔다 + +920 +01:13:18,909 --> 01:13:22,689 + 그 방법으로 다음 더이 비행기를 절단하는 뉴런하고있다 + +921 +01:13:22,689 --> 01:13:27,039 + 가중 합계의 추가 계층이있다 가장 낮은 사실 때문에 + +922 +01:13:27,039 --> 01:13:34,739 + 수는 여기에 무엇을 확인 세 개의 신경 세포로 그렇게 일하는 것이 세 것 + +923 +01:13:34,739 --> 01:13:39,189 + 하나의 평면 제 2 평면 비행기 선형성 내에서 세 개의 선형 함수 + +924 +01:13:39,189 --> 01:13:45,649 + 다음은 기본적으로 세 가지 라인으로 그렇게 공간을 개척 할 수 있습니다 + +925 +01:13:45,649 --> 01:13:52,429 + 그 수는 102 인 경우 제 2 층은 단지 그들 결합 수 + +926 +01:13:52,430 --> 01:13:57,850 + 두 줄이 충분 I하지 않기 때문에 확실히 깰 것이에 기부 + +927 +01:13:57,850 --> 01:14:03,900 + 기본적으로는 찾을 수와 그래서 여기에 아주 좋은이 작업 뭔가를 가정 + +928 +01:14:03,899 --> 01:14:07,239 + 단지 그들이 종류의이 만드는이 두 줄을 사용하는 최적의 방법 + +929 +01:14:07,239 --> 01:14:14,599 + 터널과 최고의 것을 당신은 할 수있다 + +930 +01:14:14,600 --> 01:14:31,300 + 나는 오히려 사용하는 경우 나는 많은 초현실주의 내가있을 것이라고 생각 생각 + +931 +01:14:31,300 --> 01:14:50,460 + 당신이 날카로운 경계를 볼 수있을 거라고 생각 당신이 지금 할 수있는 네의이 있기 때문에 그것을 할 수 있습니다 + +932 +01:14:50,460 --> 01:14:52,130 + 이러한 부품의 일부 + +933 +01:14:52,130 --> 01:14:58,119 + 그 수익의 둘 이상의 활성 그래서 당신이와 끝까지있다 + +934 +01:14:58,119 --> 01:15:02,359 + 내가 123처럼 생각 정말 세 줄하지만 다음에에게 모서리 중 일부는 푹 빠졌하기 + +935 +01:15:02,359 --> 01:15:05,689 + 눈에 활성 등이 가중치는 펑키의 종류있을 것이다 당신 + +936 +01:15:05,689 --> 01:15:12,649 + 그것에 대해 생각해야하지만, 확인 그러니에서 그렇게 20로 변경 여기 스무 말을 살펴 보자 + +937 +01:15:12,649 --> 01:15:16,670 + 그래서 우리는이 공간을 많이 가지고의 나선형처럼 다른 자산을 살펴 보자 + +938 +01:15:16,670 --> 01:15:22,390 + 당신은 그냥 거기에 가서이 업데이트를하고 있어요 것처럼 어떻게이 일을 볼 수 있습니다 + +939 +01:15:22,390 --> 01:15:32,800 + 내 자신의 원없는 매우 간단한 데이터를 알아낼 후 실행 + +940 +01:15:32,800 --> 01:15:39,880 + 그 아래는 그래서 당신은 가지가 간다 수 있고 녹색을 커버처럼입니다 + +941 +01:15:39,880 --> 01:15:48,039 + 잔디와 붉은 색과 그래 내가이 깰거야처럼 적은 수의 말과 + +942 +01:15:48,039 --> 01:15:54,890 + 지금은 오 갈 않을거야 그래이 더 악화 작업을 시작합니다 + +943 +01:15:54,890 --> 01:15:58,770 + 당신은 당신이 할 수 있도록이 데이터를 분리 할 수​​있는 충분한 능력이 없기 때문에 + +944 +01:15:58,770 --> 01:16:05,270 + 당신의 자유 시간 및 요약도록이 놀이 + +945 +01:16:05,270 --> 01:16:10,690 + 우리는 정치적 후계자로 이러한 신경 세포와 신경 네트워크를 배치 + +946 +01:16:10,689 --> 01:16:14,579 + 그 작물을보고이 경쟁 그래프를 변경 얻을 방법을 그들이있어 + +947 +01:16:14,579 --> 01:16:19,149 + 정말 신경과 곧 볼 수와 같은 더 큰 더 나은 우리는에 갈거야 + +948 +01:16:19,149 --> 01:16:28,210 + 나는 우리가 그냥 미안 질문을 생각하기 전에 내가 원하는 많은 질문을하는 것을 + +949 +01:16:28,210 --> 01:16:29,359 + 두 분 이상 + +950 +01:16:29,359 --> 01:16:36,899 + 네 감사합니다 + +951 +01:16:36,899 --> 01:16:41,119 + 그래서 더 신경과 신경 네트워크 대답에이 항상 더 낫다 + +952 +01:16:41,119 --> 01:16:48,809 + 그 네 더 그렇게 더 일반적으로 경쟁 제한의 항상 더 나은입니다 + +953 +01:16:48,810 --> 01:16:52,510 + 항상 잘 작동하지만 당신은 조심해야하므로 제대로 정례화하기 + +954 +01:16:52,510 --> 01:16:55,810 + 당신이 당신의 데이터를 넣어 이상 일하지 않는 제한 할 수있는 올바른 방법으로하지 않습니다 + +955 +01:16:55,810 --> 01:16:58,940 + 작은 네트워크를 할 수있는 올바른 방법을 만드는 것은 증가하는 것입니다 + +956 +01:16:58,939 --> 01:17:03,079 + 정규화 당신은 항상 당신이 원하는대로 큰 네트워크로 사용할 그래서하지만 + +957 +01:17:03,079 --> 01:17:06,269 + 당신은 대부분의 시간을 적절하게 상승 조절할 확인해야하지만, + +958 +01:17:06,270 --> 01:17:09,920 + 나는 훈련을 영원히 기다릴 시간이없는 이유는 경쟁의 이유 때문에 우리 + +959 +01:17:09,920 --> 01:17:19,980 + 실제적인 이유 질문은 동일하게 발생에 대한 네트워크는 작은 것을 사용 + +960 +01:17:19,979 --> 01:17:25,509 + 일반적으로 당신은 당신이 좋아 당신이 대부분을 자주 볼 단순화과 같이 + +961 +01:17:25,510 --> 01:17:28,030 + 실제로 훈련 네트워크들은 동일한 방법 전반을 정규화한다 + +962 +01:17:28,029 --> 01:17:33,809 + 하지만 당신은 반드시 필요 없어요 + +963 +01:17:33,810 --> 01:17:40,500 + 값이 누구 최적화 네트워크에 보조 옵션을 사용 + +964 +01:17:40,500 --> 01:17:44,859 + 데이터 세트가 작은 경우 때때로 당신은 파운드 같은 것들을 사용할 수있는 I + +965 +01:17:44,859 --> 01:17:47,729 + 너무 들어가서 보통 2 차 방법하지만 데이터의하지 + +966 +01:17:47,729 --> 01:17:50,500 + 세트는 정말 크고, 그것이 내가 당신을 얻을 것이다 때 매우 잘 작동하지 않습니다이다 + +967 +01:17:50,500 --> 01:17:57,039 + 그래서 당신은 당신과 함께 최대의 당신 수백만 나중에에 대한 파운드를 할 수없는 경우와 LBJ는 아니다 + +968 +01:17:57,039 --> 01:18:01,970 + 많은 배치 매우 좋은 당신은 항상 기본적으로 후퇴해야 + +969 +01:18:01,970 --> 01:18:16,650 + 같은 방법 당신 때문에 불행하게도 그에 대한 좋은 대답을하지 할당 할 + +970 +01:18:16,649 --> 01:18:20,899 + 깊이가 양호하지만 어쩌면 같은 후 열 층은 간단한 데이터 일 수있다 싶어 + +971 +01:18:20,899 --> 01:18:25,219 + 말했다 정말 내가 여전히 걸릴 수 있습니다 1 분 너무 많이 추가 아니에요 + +972 +01:18:25,220 --> 01:18:35,990 + 질문은 내가 할당 않는 곳 사이의 트레이드 오프에 대한 질문이 내 + +973 +01:18:35,989 --> 01:18:40,019 + 나는 우리가 깊은되고 싶어하거나 수행에 용량 나는 그것이 매우 좋은 넓은지지 않습니다 싶어 + +974 +01:18:40,020 --> 01:18:47,860 + 일반적으로, 특히 이미지와 함께 우리는 더 많은 층이 발견이 yes로 응답 + +975 +01:18:47,859 --> 01:18:51,199 + 중요하지만 때로는 당신이있을 때 간단한 취향은 몇 가지 작업을 수행하려면 + +976 +01:18:51,199 --> 01:18:55,359 + 깊이와 같은 다른 것들하지 중요하고 그래서 약간 종류의 + +977 +01:18:55,359 --> 01:19:01,670 + 종속 데이터 + +978 +01:19:01,670 --> 01:19:10,050 + 건강이 보통 일반적으로하지있어 다른 레이어에 대해 서로 다른 + +979 +01:19:10,050 --> 01:19:15,960 + 그냥 거 하나를 선택하고 또한 표시됩니다 예를 들어있어 그것으로 이동 + +980 +01:19:15,960 --> 01:19:19,279 + 그들 중 대부분은 다른 사람과 변경하고 그래서 당신은 단지 전체에 그것을 사용하고 + +981 +01:19:19,279 --> 01:19:22,389 + 사람들이 그와 함께 연주하지 않는 주위를 전환 할 수있는 실제 혜택은 없습니다 + +982 +01:19:22,390 --> 01:19:26,660 + 원칙에 너무 많은 당신이 그래서 420입니다 있습니다 방지 아무것도 없다 + +983 +01:19:26,659 --> 01:19:29,789 + 그래서 우리는 여기서 끝나지거야하지만 우리는 많은 있도록 더 많은 신경 네트워크를 볼 수 있습니다 + +984 +01:19:29,789 --> 01:19:31,238 + 이러한 질문은 그들을 통해 이동합니다 + diff --git a/captions/Ko/Lecture5_ko.srt b/captions/Ko/Lecture5_ko.srt new file mode 100644 index 00000000..c824f870 --- /dev/null +++ b/captions/Ko/Lecture5_ko.srt @@ -0,0 +1,4280 @@ +1 +00:00:00,000 --> 00:00:05,299 + 수평선 그러나 그것은 당신의 대부분이 완료 세미나 및 미완성 것입니다하지만 + +2 +00:00:05,299 --> 00:00:11,109 + 확인 좀 괜찮은 확인을 얻을에 나는 메이크업 근무 시간 권리를 보유 할 수 있습니다 + +3 +00:00:11,109 --> 00:00:15,660 + 이 클래스 할당 한 후이 내일 이후 내일 또는 일 발매 예정 + +4 +00:00:15,660 --> 00:00:19,710 + 우리는 완벽하게 일을 마무리하거나 여전히 작업을 우리는 변화하고하지 않은 + +5 +00:00:19,710 --> 00:00:23,050 + 그것은 작년 그래서 우리는 개발의 과정에서 우리는 희망 + +6 +00:00:23,050 --> 00:00:24,580 + 가능한 한 빨리이 + +7 +00:00:24,579 --> 00:00:31,469 + 그 모임하지만 가끔 당신은 최대한 빨리 한 번 그에서 시작하려는 않도록 + +8 +00:00:31,469 --> 00:00:36,039 + 우리가이 때문에 기한 또는 뭔가를 조정 될 수 있습니다 발표이야 + +9 +00:00:36,039 --> 00:00:41,850 + 약간 큰 그래 그래서 그래서 주위에 이런 것들의 일부를 셔플한다 + +10 +00:00:41,850 --> 00:00:46,219 + 또한 재료의 등급을 매기는 방식은 임시 변경 될 수 있습니다 + +11 +00:00:46,219 --> 00:00:48,929 + 우리는 여전히 비교적 새로운 아직 코스를 알아 내기 위해 노력하고 있기 때문에 + +12 +00:00:48,929 --> 00:00:53,899 + 그것의 많은 우리가 시작하기 전에 사람들은 그냥 머리까지입니다 그래서 변화 + +13 +00:00:53,899 --> 00:00:57,829 + 약 10 일 예정이다 그런데 프로젝트 제안서의 조건 I + +14 +00:00:57,829 --> 00:01:00,799 + 당신에 대해 생각하고있을 것이기 때문에 몇 가지 포인트를 불어 넣고 싶었 당신의 + +15 +00:01:00,799 --> 00:01:05,890 + 무슨 좋은하게 대해 프로젝트와 여러분 중 일부는 약간의 오해가있을 수 있습니다 + +16 +00:01:05,890 --> 00:01:11,159 + 나쁜 프로젝트는 그래서 그냥이 그들을 가장 일반적인 하나는 아마 그 사람입니다 + +17 +00:01:11,159 --> 00:01:14,570 + 그들이 그이 있다고 생각하기 때문에 작은 데이터 세트로 작업 할 주저 + +18 +00:01:14,569 --> 00:01:17,669 + 데이터 교육의 엄청난 금액을 요구하고이 수백 거기에 해당하는 + +19 +00:01:17,670 --> 00:01:21,450 + 총리의 수백만 나올하고 그들은 훈련이 필요하지만, 사실에 대한 + +20 +00:01:21,450 --> 00:01:25,019 + 프로젝트에 목적이 뭔가 당신이 엉망의 종류, 그렇지 않습니다 + +21 +00:01:25,019 --> 00:01:28,579 + 당신이 작은 데이터로 작업 할 수 있습니다 많은 걱정하는 이유의 확인을 설정 + +22 +00:01:28,579 --> 00:01:32,188 + 우리가이 과정을 가지고 나중에 더 많은 세부 사항으로 이동합니다되는 괜찮아요 + +23 +00:01:32,188 --> 00:01:35,938 + 미세 조정과 일이라는 클래스가 실제로 당신이 거의 없다 + +24 +00:01:35,938 --> 00:01:41,039 + 지금이 거대한 낙타 응답 충돌 훈련을 거의 항상이 재교육을 + +25 +00:01:41,040 --> 00:01:43,729 + 및 심기 처리 방식 때문에이 작동합니다 + +26 +00:01:43,728 --> 00:01:47,590 + 거의 항상 그 일부에 대한 교육을 상용 네트워크를 취할 것처럼 + +27 +00:01:47,590 --> 00:01:51,520 + 설정 대용량 데이터 이미지는 당신이있어 대량의 데이터를 좋아하는 말 + +28 +00:01:51,519 --> 00:01:54,618 + 바로 거기 설정 다른 데이터에 관심이 당신은 당신의 코멘트를 훈련 할 수있다 + +29 +00:01:54,618 --> 00:01:58,430 + 중소 기업 여기를 설정하고 우리는 그것을 통해 전송할 수 있습니다 그 + +30 +00:01:58,430 --> 00:02:01,240 + 거기 그것이 같은 방식이 이전 작품 + +31 +00:02:01,239 --> 00:02:05,359 + 그래서 여기에 우리가 이미지와 이야기에 대한 시작 코미디 쇼 네트워크의 개략도이다 + +32 +00:02:05,359 --> 00:02:09,000 + 당신이 사용하고, 그래서 우리는 아래 분류에 층의 시리즈를 통해 갈거야 + +33 +00:02:09,000 --> 00:02:12,150 + 이것은 그러나 우리는 물론 여기에 특정 선수에 대해 이야기하지만하지 않은 우리 + +34 +00:02:12,150 --> 00:02:16,120 + 우리를 우리가 분에 대한 교육을 해당 이미지 순 자유 무역 네트워크를 가지고 다음 + +35 +00:02:16,120 --> 00:02:20,129 + 그것은 멀리 가기 층 오프 다진 테이크로 분류를 다진 우리 + +36 +00:02:20,129 --> 00:02:24,150 + 전체 상용 네트워크는 고정 된 특징 추출기를 가지고 훈련 등을 수행 할 수 있습니다 + +37 +00:02:24,150 --> 00:02:27,219 + 새 데이터 세트의 상단에 그 특징 추출기를 넣어 당신은 단지거야 + +38 +00:02:27,219 --> 00:02:30,739 + 그래서 상단에 분류를 수행하는 다른 층에 교환합니다 + +39 +00:02:30,739 --> 00:02:34,810 + 당신이 당신의 자신의 마지막 층을 훈련하기 위하여려고 얼마나 많은 데이터에 따라 + +40 +00:02:34,810 --> 00:02:38,159 + 당신이 실제로 다시 전파 경우 네트워크하거나 미세 조정을 할 수 있습니다 + +41 +00:02:38,159 --> 00:02:41,379 + 그리고 전투의 일부를하면 다시 할 거 야 더 많은 데이터를 얻을 + +42 +00:02:41,379 --> 00:02:47,229 + 깊은 네트워크를 통해 특히 봄 훈련에 전파 + +43 +00:02:47,229 --> 00:02:51,649 + 한 사용자의 거대한 라인이 그래서 샘플 이미지 그물 사람들은 당신을 위해 이렇게 + +44 +00:02:51,650 --> 00:02:55,400 + 다른에 시간 주 홈 네트워크가됩니다 온다 훈련을 꺼리는 + +45 +00:02:55,400 --> 00:02:58,939 + 데이터 세트와 라인이에 그들은 주석의 무게를 업로드 + +46 +00:02:58,939 --> 00:03:02,229 + 뭔가 예를 들어 이들은 모든입니다 몇 가지 모델을 호출 + +47 +00:03:02,229 --> 00:03:05,629 + 상업 네트워크는 이미 대규모 데이터 세트에 설교 한 + +48 +00:03:05,629 --> 00:03:09,310 + 매개 변수가 많이 배웠 단지 주변 교환을 참조하여 + +49 +00:03:09,310 --> 00:03:12,769 + 당신이없는 경우 데이터 센터는 그렇게 기본적으로 네트워크를 통해 그를 찾을 + +50 +00:03:12,769 --> 00:03:16,799 + 데이터를 많이 괜찮아 당신은 전투와 잘에서 설교자을 그 + +51 +00:03:16,799 --> 00:03:20,500 + 조정을하고 그래서 일 것 작은 데이터 세트로 작업하는 것을 두려워하지 않는다 + +52 +00:03:20,500 --> 00:03:27,239 + 우리가 지난 시간에 몇 가지 문제가 보았던 두 번째 것은 밖으로 사람들이다 + +53 +00:03:27,239 --> 00:03:31,209 + 그들은 무한한 컴퓨터가이 단지를 가리 키도록 좋아도 금속 생각 + +54 +00:03:31,209 --> 00:03:35,000 + 아웃 지나치게 야심 찬하고 당신이 제안하는 어떤 것들은 잠시에 적용되지 않습니다 + +55 +00:03:35,000 --> 00:03:37,959 + 당신이 하이퍼해야 할거야 너무 많은 GPU를이없는 훈련 + +56 +00:03:37,959 --> 00:03:41,780 + 최적화는 우리가 몇 가지 있었다 있도록 여기에 대해 걱정할 필요가 몇 가지있다 + +57 +00:03:41,780 --> 00:03:45,840 + 사람들이 매우 큰 데이터에 대한 교육의 프로젝트를 제안 프로젝트 지난해 + +58 +00:03:45,840 --> 00:03:51,889 + 설정하고 당신은 시간이되도록 염두이 없어 그래, 당신은 얻을 것이다 + +59 +00:03:51,889 --> 00:03:54,980 + 더 나은 감각 우리는 클래스를 통해 갈 것입니다 또는 제공 할 수 없습니다로 + +60 +00:03:54,979 --> 00:03:59,949 + 우리가 강의에 뛰어거야 확인 컴퓨터 제약이 어떤 있습니다 + +61 +00:03:59,949 --> 00:04:02,780 + 나는 당신이 그것에 대해 물어보고 싶은 것을 남아있을 수 있습니다 관리 일 + +62 +00:04:02,780 --> 00:04:07,068 + 확인 좋은 그래서 우리는 오늘날 우리가 꽤있는 재료로 다이빙을거야 + +63 +00:04:07,068 --> 00:04:12,138 + 그래서 그냥 알림 목공 산업으로의 합격 점수를 표시 + +64 +00:04:12,139 --> 00:04:13,189 + 교육 센터 + +65 +00:04:13,189 --> 00:04:16,750 + 네트워크 기본적 신경 네트워크를 훈련하는 4 단계 공정은 그대로 + +66 +00:04:16,750 --> 00:04:21,589 + 당신이 데이터 세트에서 데이터의 배치 귀하의 데이터를 샘플링 123 간단 + +67 +00:04:21,589 --> 00:04:25,079 + 당신은 로스를 계산하기 위해 네트워크를 통해 전달 + +68 +00:04:25,079 --> 00:04:29,339 + 당신의 생기 새 차 업데이트를 완료 전파하거​​나 조정할 당신의 + +69 +00:04:29,339 --> 00:04:33,529 + 약간 재료 등의 방향으로 무게가 끝날 때 + +70 +00:04:33,529 --> 00:04:36,519 + 정말 어떤이가 내려 오는 것은 최적화가이 과정을 반복 + +71 +00:04:36,519 --> 00:04:39,909 + 공간을 대기하는 것을 특징으로 문제는 공백의 영역으로 수렴 하였다 + +72 +00:04:39,910 --> 00:04:42,990 + 우리는 낮은 손실과 그 수단 제대로 분류하거나 교육 센터를 + +73 +00:04:42,990 --> 00:04:48,590 + 우리는이 매우 큰 것을보고 나는 변경 광택의 디스크 이미지를 플래시 + +74 +00:04:48,589 --> 00:04:51,589 + 기본적으로이 거대한 계산 그래프는 그리고 우리는 할 필요가 + +75 +00:04:51,589 --> 00:04:54,699 + 그들을 통해 전파 그래서 우리는 직관 몇 가지 다시 이야기하고 + +76 +00:04:54,699 --> 00:04:57,289 + 전파와 정말 그냥 재귀 응용 프로그램의 사실 + +77 +00:04:57,290 --> 00:05:01,220 + 우리가 그라디언트를 변경하고 전면에 회로 다시에서 일반 + +78 +00:05:01,220 --> 00:05:05,110 + 모든 로컬 작업을 통해 우리는이 일부 구현 보았다 + +79 +00:05:05,110 --> 00:05:10,350 + 수 신속하게 모두 해안 경쟁 그래프에서 앞으로 뒤로 API와 + +80 +00:05:10,350 --> 00:05:14,379 + 또한 노드의 관점에서 이는 동일한 API를 구현을 위해 할 + +81 +00:05:14,379 --> 00:05:18,750 + 전파 및 역 전파 우리는 포르투갈 구체적인 예 보았다 + +82 +00:05:18,750 --> 00:05:22,199 + 카페와 나는이 당신 불법 블록 같은 종류의 것을이 비유를 그린 + +83 +00:05:22,199 --> 00:05:26,159 + 이 층은 게이트 당신이에 구축하는 당신의 작은 블록입니다입니다 + +84 +00:05:26,160 --> 00:05:30,280 + 다음 작품 인터콤 시스템은 먼저 신경 네트워크에 대해 이야기 + +85 +00:05:30,279 --> 00:05:33,329 + 물건을 가지고 그것은 우리가 만드는 것입니다 금액 기본적으로 무엇을하지 않고 + +86 +00:05:33,329 --> 00:05:37,990 + 클래스 코스로 이미지에서 진행되는이 더 복잡하고 우리는 보았다 + +87 +00:05:37,990 --> 00:05:41,800 + 법안에이 연대기는 뇌 물건의 관점에서 작품을 + +88 +00:05:41,800 --> 00:05:47,168 + 신경과 우리가하고있는 것입니다으로 우리는 확인 그래서 이러한 이메일과 편지를 중지하고 + +89 +00:05:47,168 --> 00:05:49,370 + 그것은 우리가 지금 무슨 일을하는지 대략 그리고 우리는이 얘기하는거야 + +90 +00:05:49,370 --> 00:05:54,959 + 훈련이 프로세스에 대한 클래스는 초기 우린 있도록 효과적으로 확인 작업 + +91 +00:05:54,959 --> 00:05:58,049 + 내가 다이빙을하기 전에 세부 사항에 해당로 갈 난 그냥 싶었 종류 + +92 +00:05:58,050 --> 00:06:02,280 + 밖으로 당겨주고 당신은 최대에게 어떻게 역사를 조금 축소 + +93 +00:06:02,279 --> 00:06:06,918 + 당신이 유출 된 기름이 어디​​에서 오는지 발견하려고하면이 시간이 지남에 진화 + +94 +00:06:06,918 --> 00:06:09,870 + 여기서 제 등등 제안 + +95 +00:06:09,870 --> 00:06:15,269 + 당신은 아마 1957 년에 약 1,964 프랭크 로젠 블랏으로 돌아 갈 것이었다 + +96 +00:06:15,269 --> 00:06:18,899 + 뭔가라는 퍼셉트론 즈 (Perceptrons)와 기본적으로 퍼셉트론과 장난 + +97 +00:06:18,899 --> 00:06:24,379 + 당신이 좋아하는 그래서이 구현 및 하드웨어 었죠 + +98 +00:06:24,379 --> 00:06:28,269 + 그들은 단지 실제로이 일을 구축했다 지금 코드를 작성 않습니다 + +99 +00:06:28,269 --> 00:06:37,099 + 이 시간에 회로와 전자 기기에서 대부분의 경우와 제출 + +100 +00:06:37,100 --> 00:06:42,450 + 퍼셉트론은 대략이 기능 여기이었고, 그것은 매우 유사 모습 + +101 +00:06:42,449 --> 00:06:46,110 + 우리는 그 다음 활성화 만 명시 적으로 저스틴을 잘 알고 있지만, + +102 +00:06:46,110 --> 00:06:49,930 + 활성화 함수 실제로되었다는 신호로 사용 된 함수 + +103 +00:06:49,930 --> 00:06:54,439 + 스텝 함수는 그것이이 때문에 바이너리 스텝 함수 등이었다 중 10이었다 + +104 +00:06:54,439 --> 00:06:57,459 + 나의 새로운 단계 기능은 당신이이 미분 아니라는 것을 알 수 있습니다 + +105 +00:06:57,459 --> 00:07:01,649 + 작업은 그래서 그들은 비용 사실이 통해 전파 백업 할 수 없었다 + +106 +00:07:01,649 --> 00:07:04,139 + 교육 신경 네트워크를위한 역 전파의 훨씬 나중에 올 필요 + +107 +00:07:04,139 --> 00:07:08,169 + 그래서 그들은 이러한 이진 단계적으로 기능 퍼셉트론 그들은 함께했다 + +108 +00:07:08,170 --> 00:07:12,449 + 이러한 학습 규칙에 와서 그래서 이것은 임시 지정의 종류 + +109 +00:07:12,449 --> 00:07:17,110 + 가중치를 쥐게 규칙을 학습하는 것은에서 원하는 결과를 만들려면 + +110 +00:07:17,110 --> 00:07:22,240 + 퍼셉트론 일치하는 진정한 욕망의 진정한는 균형을하지만, 거기에 아무 + +111 +00:07:22,240 --> 00:07:25,490 + 손실 함수의 개념은 역 전파 그의 DS DS 광고의 개념이 없었다 + +112 +00:07:25,490 --> 00:07:28,949 + 당신이 그들을 볼 때 그들이 종류의 거의 배경을 특별 규칙 만 + +113 +00:07:28,949 --> 00:07:32,779 + 이 때문에 미분하지 않고 스텝 기능의 종류의 재미 + +114 +00:07:32,779 --> 00:07:36,809 + 다음 사람들은 매들린의 출현으로 1960 년에 있으므로이를 중지 시작 + +115 +00:07:36,810 --> 00:07:42,110 + 우드로 매들린이 충분히들은 것 같은이 퍼셉트론을 시작했고, + +116 +00:07:42,110 --> 00:07:46,470 + 제 다층 퍼셉트론 망로 물건 이는 여전히 + +117 +00:07:46,470 --> 00:07:51,980 + 모든 전자와 LG에서 수행 실제로 포터로부터 구축 + +118 +00:07:51,980 --> 00:07:55,830 + 하지만 여전히이이 모든 규칙을했다이 경우에는 다시 전파가 없습니다 + +119 +00:07:55,829 --> 00:07:59,060 + 그들이 측면에서 가지고 올 것을의 그것을 뒤집기 시도에 대해 생각하기 추천하고 + +120 +00:07:59,060 --> 00:08:02,949 + 더 나은 여부를 작동하고 가지의 더보기가 없었다 경우보고 + +121 +00:08:02,949 --> 00:08:06,430 + 이 시점에서 역 전파 등 약 1960년 사람들은 매우있어 + +122 +00:08:06,430 --> 00:08:09,560 + 흥분과 회로를 구축하고 그들은 당신이 갈 수 알고 있다고 생각 + +123 +00:08:09,560 --> 00:08:12,930 + 정말 지금까지 우리는 내용이 회로는 것을 기억해야 할 수 있습니다 + +124 +00:08:12,930 --> 00:08:17,829 + 그때 프로그래밍의 개념은 매우 명시했다 당신은 일련의 쓰기 + +125 +00:08:17,829 --> 00:08:20,689 + 컴퓨터에 대한 지침이 사람들이 생각하는이 처음이다 + +126 +00:08:20,689 --> 00:08:24,379 + 이러한 종류의 데이터는 접근 방식을 기반에 대해 당신은 회로의 일종 곳 + +127 +00:08:24,379 --> 00:08:29,019 + 이 배울 수 있고, 그래서 이것은 시간 사람들이 큰 개념적 도약에 있었다 + +128 +00:08:29,019 --> 00:08:33,179 + 실제로 작업 끝나지 이러한 네트워크에 대한 매우 흥분 + +129 +00:08:33,179 --> 00:08:37,528 + 잘 바로 1964 예 측면에서 그들은 흥분을 통해 약간있어 + +130 +00:08:37,528 --> 00:08:41,088 + 이상은 약속​​과 약간 아래에 따라서 기간 동안 전달 + +131 +00:08:41,089 --> 00:08:45,660 + 열 아홉 칠십의 실제 현장에서 매우 조용하고 많은 아니었다 + +132 +00:08:45,659 --> 00:08:52,958 + 연구는 다음 부스트 사실에 대한 대략 1986 년에 와서 완료되었습니다 + +133 +00:08:52,958 --> 00:08:57,179 + 1천9백86명이 기본적으로 자신이 처음입니다이 영향력있는 논문이었다 + +134 +00:08:57,179 --> 00:09:03,069 + 당신이 잘되게 형식으로 규칙 등의 전파를 다시 볼 시간과 + +135 +00:09:03,070 --> 00:09:07,910 + 그래서 이것은 (10)와 윌슨에 정말 열심히 그리고 그들은 여러 층으로 연주했다 + +136 +00:09:07,909 --> 00:09:11,129 + 퍼셉트론 즈 (Perceptrons) 그리고 당신이 우리가 실제로 볼 수있는 종이에 갈 때이 처음이다 + +137 +00:09:11,129 --> 00:09:13,879 + 이 시점에서 다시 전파 등처럼 보이는 뭔가 그들이 + +138 +00:09:13,879 --> 00:09:17,830 + 이미 임시 규칙이 아이디어를 폐기 정말 자물쇠가 + +139 +00:09:17,830 --> 00:09:20,589 + 기능 등 전파 그라데이션 하강에 대해 다시 이야기하고 + +140 +00:09:20,589 --> 00:09:25,390 + 그들이 있다고 생각하기 때문에 그래서이 시간 사람들은 1986 년에 다시 흥분 + +141 +00:09:25,389 --> 00:09:30,610 + 그들은 지금 스키의 주요 좋은 신용 할당 종류를했다 + +142 +00:09:30,610 --> 00:09:35,000 + 역 전파 그들이 네트워크를 훈련 할 수있는 문제가 있었다 불행히도 + +143 +00:09:35,000 --> 00:09:37,690 + 그들은 이러한 네트워크를 확장하려고 할 때 그들을 깊이 이상 만드는 것을 + +144 +00:09:37,690 --> 00:09:41,089 + 그들이 할 수있는 다른 것들 중 일부에 비해 매우 잘 작동하지 않았다 + +145 +00:09:41,089 --> 00:09:44,620 + 당신의 기계 학습 도구 키트 그래서 그들은 그냥 아주 좋은을 포기하지 않았다 + +146 +00:09:44,620 --> 00:09:49,339 + 이 시간과 훈련의 결과는 박히과 경쟁했다 + +147 +00:09:49,339 --> 00:09:52,170 + 기본적으로 아주 잘 작동하지 특히​​ 그는 크게하고 싶어 + +148 +00:09:52,169 --> 00:09:56,199 + 네트워크는이 사실 거의 이십년 곳의 경우와 + +149 +00:09:56,200 --> 00:09:58,940 + 어떻게 든이 없었기 때문에 다시 자신의 작품에 적은 연구가 있었다 + +150 +00:09:58,940 --> 00:10:04,370 + 연구는이 때문에 2006 년 아주 잘 작동하고 당신은 훈련 할 수있다 + +151 +00:10:04,370 --> 00:10:08,440 + 최근 다시 한번 소생 힌튼과과에 의해 과학 논문 여부 + +152 +00:10:08,440 --> 00:10:14,190 + 러셀은 충분히 충분했다 아직 그의 이름을 말할하지만 그들은 여기 기본적으로 무엇을 + +153 +00:10:14,190 --> 00:10:17,430 + 이것은 우리가 실제로 할 수 있습니다 처음으로 페널티 킥을 추천 또는 약이었다 + +154 +00:10:17,429 --> 00:10:22,549 + 제대로 훈련하고 그들이 한 신경망 훈련 대신했다 + +155 +00:10:22,549 --> 00:10:26,319 + 역 전파가 온 단일 패스 10 층 같은 모든 층 + +156 +00:10:26,320 --> 00:10:29,230 + 제한이라는 무엇을 사용하여이 자율 사전 교육 방식에 최대 + +157 +00:10:29,230 --> 00:10:32,139 + 볼츠만 기계와이 첫 번째 층을 양성되는 금액 그래서 뭐 + +158 +00:10:32,139 --> 00:10:35,860 + 자율 목표를 사용하여 당신은 그 위에 두 번째 층을 훈련 + +159 +00:10:35,860 --> 00:10:39,850 + 다음 세 번째와 네 번째 다음 번에이 모든 당신이 넣어 다음 훈련 + +160 +00:10:39,850 --> 00:10:42,959 + 모두 함께 다음은 당신이 시작 전파를 다시 시작 + +161 +00:10:42,958 --> 00:10:46,479 + 먼저 음성 읽어 미세 조정 단계는 두 단계 프로세스이었다 + +162 +00:10:46,480 --> 00:10:49,860 + 레이어를 통해 단계적으로, 그리고, 우리는에 넣어 다음 다시 전파 + +163 +00:10:49,860 --> 00:10:53,459 + 작동하고 그래서 이것은 역 전파 처음이다 + +164 +00:10:53,458 --> 00:10:56,250 + 놀라지 훈련에서 기본적으로이 초기화를 필요 + +165 +00:10:56,250 --> 00:10:59,490 + 그렇지 않으면 그들은 처음부터 운이 작동하지 않을 것입니다 그리고 우리는 보게 될 것입니다 + +166 +00:10:59,490 --> 00:11:03,680 + 왜이 강의에서이 훈련이 참으로 네트워크를 얻기 위해 가지 까다로운 + +167 +00:11:03,679 --> 00:11:07,769 + 처음부터 그냥 배경을 사용하여 그리고 당신이 정말로 그것에 대해 생각해야하고 그래서 그것을 + +168 +00:11:07,769 --> 00:11:11,100 + 실제로 깜짝 과정을 필요로하지 않는 것이 나중에 밝혀졌다 당신은 할 수 있습니다 + +169 +00:11:11,100 --> 00:11:14,199 + 바로 배경으로 거래를하지만 당신은 매우 신중해야 + +170 +00:11:14,198 --> 00:11:18,109 + 초기화 및 그들은이 점과 시그 모이 있습니다에서 작동 신호를 사용 + +171 +00:11:18,110 --> 00:11:23,389 + 그냥 좋은 옵션은 사용하지 그래서 기본적으로 배경은 작동하지만 당신은에 있습니다 + +172 +00:11:23,389 --> 00:11:29,250 + 당신이 그것을 사용하는 방법에주의 등이 너무 좀 더 연구가 2006 년이었다 + +173 +00:11:29,250 --> 00:11:32,600 + 가지 다시 지역에 와서 깊은 학습으로하지만, 정말 재 상표되었다 + +174 +00:11:32,600 --> 00:11:39,610 + 여전히 신경망 동의어이다하지만 예술에 대한 더 나은 단어이고 + +175 +00:11:39,610 --> 00:11:43,990 + 기본적으로 내가 생각하는이 시점에서 제대로 잘 작동하기 시작하고 사람들이 수 + +176 +00:11:43,990 --> 00:11:48,940 + 실제로 훈련 네트워크는 지금 여전히 너무 많은없는 사람들이 관심을 때 + +177 +00:11:48,940 --> 00:11:53,310 + 사람들이 정말 관심이 내가 그렇게 2010 년과 2012 년의 주위에 생각하는 정도였다 지불 시작 + +178 +00:11:53,309 --> 00:11:56,379 + 특히 2010 년 신경에 대한이 첫번째 정말 큰 결과가 있었다 + +179 +00:11:56,379 --> 00:11:59,669 + 네트워크는 정말 당신이 한 다른 모든 것들에 비해 정말 잘했다 + +180 +00:11:59,669 --> 00:12:01,078 + 당신의 기계 학습 툴킷 + +181 +00:12:01,078 --> 00:12:07,888 + 커널 정도에 간첩과이 구체적으로 음성 인식했다 + +182 +00:12:07,889 --> 00:12:12,839 + 그들은이 GMM의 HMM 프레임 워크를했다 그들은 긴 부분을 스왑 영역 + +183 +00:12:12,839 --> 00:12:17,800 + 스포츠 네트워크 및 인터넷에서 2010 년에 그에게 큰 개선을 줄 것이며, + +184 +00:12:17,799 --> 00:12:21,068 + 이 마이크로 소프트에 근무하고 그래서 사람들이 있기 때문에주의를 기울여야 시작 + +185 +00:12:21,068 --> 00:12:26,189 + 이건 정말 큰 개선에서 나온 작품이 처음이다 + +186 +00:12:26,190 --> 00:12:30,550 + 그는 더 극적으로 펼쳐 곳, 그리고, 우리는 2012 년에 다시 보았다 + +187 +00:12:30,549 --> 00:12:36,039 + 우리가했다 기본적으로 시각적 인식 및 컴퓨터 비전의 도메인 + +188 +00:12:36,039 --> 00:12:44,448 + 모든 긁힌 D 안톤하여이 2012 네트워크와 기본적으로 호감 + +189 +00:12:44,448 --> 00:12:48,719 + 모든 기능에서 경쟁과 정말 큰 개선이 있었다 + +190 +00:12:48,720 --> 00:12:52,810 + 이들 신경망에서 우리가 목격하고는 어떤 사람들이 정말 + +191 +00:12:52,809 --> 00:12:56,629 + 관심과 폭발의 다음 필드입니다 이런 종류의 지불 시작 + +192 +00:12:56,629 --> 00:12:58,370 + 이 필드 영역의 많은 지금 거기 + +193 +00:12:58,370 --> 00:13:03,110 + 그래서 그것을 시작하는 이유는 수에 조금 나중에 생각 세부 사항으로 이동합니다 + +194 +00:13:03,110 --> 00:13:04,589 + 2010 초기 작동합니다 + +195 +00:13:04,589 --> 00:13:08,860 + 그것은 사물의 조합이다 그러나 나는 우리가 알아 낸 수있어 생각 + +196 +00:13:08,860 --> 00:13:12,710 + 활성화 작업에이 일을 얻는 시각화에 더 나은 방법 + +197 +00:13:12,710 --> 00:13:16,690 + 기능과 우리의 GPU를 가지고 있었고, 우리는 그렇게 정말 훨씬 더 많은 데이터를 많이 가지고 + +198 +00:13:16,690 --> 00:13:19,710 + 이 점에서 그냥 있었기 때문에 물건 전에 확실히 작동하지 않았다 + +199 +00:13:19,710 --> 00:13:26,028 + 컴퓨터 데이터와 아이디어의 일부에 불과 조정 등 그 거친 + +200 +00:13:26,028 --> 00:13:30,750 + 역사적 그래서 우리는 기본적 통해 유망 약자 이상에 걸쳐 갔다 + +201 +00:13:30,750 --> 00:13:34,700 + 처리 및 배달하고 지금은 일처럼 보인다는 실제로 작동하려고 + +202 +00:13:34,700 --> 00:13:37,028 + 정말 잘 우리는이 시점에서 어디 그래서이다 + +203 +00:13:37,028 --> 00:13:42,210 + 확인 나는 세부 사항에 뛰어거야 그리고 우리는 정확히 것입니다 실제로 볼 수 있습니다 + +204 +00:13:42,210 --> 00:13:45,550 + 작품을 알고 죽어가는 우리는의 개요 있도록 적절하게 훈련하는 방법 + +205 +00:13:45,549 --> 00:13:49,139 + 우리가 다음 해 강의의 과정을 통해 다루려고하는 것은 인 + +206 +00:13:49,139 --> 00:13:52,809 + 독립적 인 사물의 전체 무리 그래서 난 그냥 모든 당신을 peppering 될 수 있습니다 + +207 +00:13:52,809 --> 00:13:55,989 + 우리가 이해하고 사람들이에서 무엇을 볼 수있는이 작은 지역 + +208 +00:13:55,990 --> 00:13:59,409 + 경우 우리는 그들을 통해 방법을 모든 거래의 장단점을 갈거야 + +209 +00:13:59,409 --> 00:14:05,659 + 실제로 제대로에 신경 네트워크와 실제 데이터 세트를 훈련 + +210 +00:14:05,659 --> 00:14:06,730 + 먼저 우리가 얘기하는거야 + +211 +00:14:06,730 --> 00:14:14,450 + 활성화 기능은 내가 강의 그래서 전 그렇게이 생각 약속 + +212 +00:14:14,450 --> 00:14:19,320 + 자신의 정상 기능에 우리는 다른 많은 수 보았다 + +213 +00:14:19,320 --> 00:14:25,230 + 이 때문에 휴대폰은 무엇 이러한 활성화에 대한 모든 다른 제안입니다 + +214 +00:14:25,230 --> 00:14:28,450 + 그들이 어떤 감옥 통화 및 방법을 통해 갈 것 같은 기능을 볼 수 있습니다 + +215 +00:14:28,450 --> 00:14:31,459 + 의 바람직한 특성에가는 무엇을 어떻게 활성화에 대해 생각 + +216 +00:14:31,458 --> 00:14:35,289 + 활성화 기능을 가장 많이 사용 된 역사적으로 하나가 그래서 + +217 +00:14:35,289 --> 00:14:39,009 + 그것은 기본적 부수 그래서 다음과 같습니다 시그 모이 드 비선형 + +218 +00:14:39,009 --> 00:14:40,528 + 그것이 진정한 가치 번호를 취 기능 + +219 +00:14:40,528 --> 00:14:45,669 + 그래서 시그 모이와 첫 번째 문제는 0과 1 사이 과즙은가되게합니다 + +220 +00:14:45,669 --> 00:14:51,120 + 지적 된 바와 같이 몇 가지 강의 포화 문제가 거기에 갈 것을 + +221 +00:14:51,120 --> 00:14:55,839 + 어느 제로에 매우 가까이 또는 그 중 하나에 매우 가까운 뉴런 + +222 +00:14:55,839 --> 00:15:00,070 + 신경 세포는 다시 전파 동안 구배를 죽이고 그래서 난에 확장하려면 + +223 +00:15:00,070 --> 00:15:03,660 + 이 항목은 정확히 이것이 의미하는 어떤이는 우리가있어 뭔가에 기여 + +224 +00:15:03,659 --> 00:15:08,679 + 벤치 성분 문제를 호출하려고하는 것은 그래서이의 게이트를 살펴 보자 + +225 +00:15:08,679 --> 00:15:11,159 + 다시 회로의 일부를받을 + +226 +00:15:11,159 --> 00:15:16,149 + 그리고이 나오고 다시 아마 괜찮은 우리로 거래를 신호 + +227 +00:15:16,149 --> 00:15:19,940 + 우리가 가질 수 있도록 체인 규칙을 사용하여 제 2 게이트를 통해 드롭 백업하려면 + +228 +00:15:19,940 --> 00:15:24,089 + 당신이 체인 규칙을 통해 그것을 볼 수있는 끝에서 닥스에 의한 거래는 기본적으로 말했다 + +229 +00:15:24,089 --> 00:15:27,569 + 우리는이 두 수량을 곱 등 때 일어나는 일에 대해 생각하는 + +230 +00:15:27,568 --> 00:15:33,399 + 신호 게이트 받아 10 또는 20 또는 가치가 경쟁 (10)에 의해 연기 + +231 +00:15:33,399 --> 00:15:37,309 + 다음은 정상에서 약간의 그라데이션을지고있어 무엇은 그 방사에 미치는 영향 + +232 +00:15:37,309 --> 00:15:41,549 + 이러한 경우 중 하나의 회로를 통해 배경이 가능한 것입니다 + +233 +00:15:41,549 --> 00:15:56,578 + 당신은 그라데이션이 매우이라고 말을하는지 그래서, 그래서이 경우 일부에서 문제 + +234 +00:15:56,578 --> 00:16:01,919 + 텍사스 마이너스 10 또는 10이 기본적으로 우리가 있습니다보고 기다릴 때 낮은 + +235 +00:16:01,919 --> 00:16:05,659 + 이 그라데이션 곱됩니다 여기에이 지역의 그라데이션이 + +236 +00:16:05,659 --> 00:16:09,838 + 당신은 당신이 할 수있는 네거티브 (10)에있을 때 현지 구배 X bydy DOMA를 defund + +237 +00:16:09,839 --> 00:16:14,370 + 그래디언트 기본적 제로임을 알 때문에이 때의 기울기 제로 + +238 +00:16:14,370 --> 00:16:18,339 + 그라데이션 참석도 제로 근처에있을 것입니다 그래서 문제는 당신이 읽고있는 것입니다 + +239 +00:16:18,339 --> 00:16:24,220 + 여기에서 드롭하지만 당신은에있어 경우는 그래서 기본적으로 0이 있었다 포화됩니다 + +240 +00:16:24,220 --> 00:16:26,930 + 그 다음 원 그라데이션 살해한다 + +241 +00:16:26,929 --> 00:16:31,258 + 난 그냥 아주 작은 수를 곱한됩니다 큰 정보를 통해 중지 + +242 +00:16:31,259 --> 00:16:36,480 + 당신의 대규모 네트워크가있는 경우의 서명을 통해 그들 그래서 당신은 상상할 수 + +243 +00:16:36,480 --> 00:16:39,800 + 시그 모이 신경 세포와 그들 중 많은 사람들이 하나있어 포화 정권에 + +244 +00:16:39,799 --> 00:16:43,269 + 그들이 될 것이기 때문에 0 또는 1 성분은 다시 네트워크를 통해 전파 할 수 없습니다 + +245 +00:16:43,269 --> 00:16:48,230 + 당신이 당신의 사무실 또는 포화 또는 청바지 성분에 앉아하는 경우 중지 + +246 +00:16:48,230 --> 00:16:51,740 + 단지 당신이 있다면 안전한 영역에 종류의 흐름과 우리는 활성 영역은 무엇 전화 + +247 +00:16:51,740 --> 00:16:57,049 + S 상과 그래서 문제의 종류 우리는이에 대한 자세한 내용을 곧 볼 수 있어요 + +248 +00:16:57,049 --> 00:17:03,289 + 시그 모이 또 다른 문제는 우리가 거​​ 야가 중심 제로되지 않은 것입니다 + +249 +00:17:03,289 --> 00:17:07,078 + 당신이 처리 할 때 곧 전처리에 대해 이야기하지만 당신은 항상하려는 + +250 +00:17:07,078 --> 00:17:10,578 + 이 경우는에 하루는 제로 중심으로 오른쪽의이 있는지 확인하려면 및 + +251 +00:17:10,578 --> 00:17:14,658 + 지그문트 자신의 개구부의 여러 레이어의 큰 네트워크를 가지고 가정 + +252 +00:17:14,659 --> 00:17:19,659 + 0과 1 사이이 (90)을 중심으로 값은 우리가 더 기본적으로 퍼팅하는 + +253 +00:17:19,659 --> 00:17:22,260 + 서로의 상부에 적층 리더 분류 + +254 +00:17:22,259 --> 00:17:26,078 + 그리고 약 비 - 제로의 문제는 최대 중심하지만 난 그냥 포기하려고합니다 + +255 +00:17:26,078 --> 00:17:31,169 + 당신이 잘못되면 무엇에 직관의 비트 + +256 +00:17:31,170 --> 00:17:36,480 + 그냥보고이란 바로 공 (60)이 함수를 계산 관심이란 + +257 +00:17:36,480 --> 00:17:40,589 + W 경쟁은해야하고 우리는 당신에 대해 말할 수있는 생각에 대해 무슨 말을 할 수 있습니다 + +258 +00:17:40,589 --> 00:17:45,559 + 역 전파시 W의 기울기는 예전 친구는 모든 경우 + +259 +00:17:45,559 --> 00:17:49,259 + 011 사이의 경우 긍정적 어쩌면 당신은 어딘가 깊은이란에있어 + +260 +00:17:49,259 --> 00:17:54,539 + 모든 초과가 긍정적 인 경우 네트워크는 당신이 무게에 대해 무엇을 말할 수있다 + +261 +00:17:54,539 --> 00:18:00,960 + 번호 + +262 +00:18:00,960 --> 00:18:13,970 + 녹색 WR에 앞서 방법으로 제한하거나 양 또는 음과 + +263 +00:18:13,970 --> 00:18:17,730 + 그라데이션 상단에서 유입 때문이고, 당신이 생각하는 경우 + +264 +00:18:17,730 --> 00:18:22,700 + 모든 W의 광채에 대한 표현들이있어 기본적으로 X 시간 그라데이션 + +265 +00:18:22,700 --> 00:18:28,440 + 그래서 신경 세포의 상단에 그라데이션 오프는 모든 W 긍정적이다 + +266 +00:18:28,440 --> 00:18:32,308 + 격자는 긍정적 것이며, 그 반대의 경우도 마찬가지 그래서 기본적으로 당신이 경우에 결국 + +267 +00:18:32,308 --> 00:18:35,710 + 첫 번째 대기 a를 그래서 두 가중치를 했어야 어디 + +268 +00:18:35,710 --> 00:18:40,788 + 둘째 무슨 일이 끝나는 것은으로 그것에 대해 그 다른 성분 인 대기 + +269 +00:18:40,788 --> 00:18:45,099 + 이 거기에 무게에서 컴퓨터를 통해 준비가 진행 중 긍정적 또는 + +270 +00:18:45,099 --> 00:18:49,509 + 음 그래서 문제는이 제한되어 있고 업데이트의 종류의 + +271 +00:18:49,509 --> 00:18:53,609 + 하고 당신이 원하는 경우 경로를 엄격한이 바람직하지 않은 끝낼 수 있습니다 + +272 +00:18:53,609 --> 00:18:57,808 + 유사한 이런 종류의이 지역 이외의 몇 가지 부분에 도착합니다 + +273 +00:18:57,808 --> 00:19:02,058 + 약간 헨리 VIII 이유 여기지만 단지는 직관주고 당신은 볼 수 있습니다 + +274 +00:19:02,058 --> 00:19:04,769 + 이 경험적으로 당신은 영을 중앙에 있지 것들로 훈련 할 때 + +275 +00:19:04,769 --> 00:19:09,319 + 느린 수렴을 관찰하고이 이유에 대한 이유와 손의 비트입니다 + +276 +00:19:09,319 --> 00:19:13,220 + 그런 일이 있습니다하지만 난 당신이 실제로으로 훨씬 더 깊게 할 경우 생각 + +277 +00:19:13,220 --> 00:19:15,919 + 당신이 말하는 사람들이있다 할 수 있지만 다음에 있는지 + +278 +00:19:15,919 --> 00:19:19,350 + 수학 공식의 주요 계절 자연 그라디언트에 대한 이유와는 조금 얻는다 + +279 +00:19:19,349 --> 00:19:22,959 + 더 이것보다 복잡하지만 난 그냥 당신이 원하는 당신의 직관을주고 싶어 + +280 +00:19:22,960 --> 00:19:25,950 + 당신이 그들의 산타 일을하려는 입력에 제로 센터 일들이 + +281 +00:19:25,950 --> 00:19:30,450 + 흰색은 멋지게 일을 생각하고 그래서 그의 단점 인을 통해 + +282 +00:19:30,450 --> 00:19:35,569 + 이 식 내부 XP 해당 함수는 자신의 마지막 하나되는 시그널링 + +283 +00:19:35,569 --> 00:19:39,099 + 기타의 대안 중 일부에 비해 계산하는 종류의 비용이 + +284 +00:19:39,099 --> 00:19:45,199 + 자선 단체 그래서 그냥 작은 세부 사항의 난 때를 실제로 가정 + +285 +00:19:45,200 --> 00:19:48,028 + 이러한 큰 상업 네트워크를 훈련 컴퓨터가 대부분의 시간은 아니다 + +286 +00:19:48,028 --> 00:19:53,148 + 대회 이러한 내적는이 만료에없는 등 그 종류의 + +287 +00:19:53,148 --> 00:19:55,509 + 작은 기여를 추방하지만 아직도 조금 무언가이다 + +288 +00:19:55,509 --> 00:20:00,710 + 난 당신이 몇 가지를 생각하면 물어 갈거야, 그래서 다른 부분에 비해 단점 + +289 +00:20:00,710 --> 00:20:04,230 + 어린 나이 때문에 질문은 특히 이러한 문제를 해결하기 위해 시도이다 + +290 +00:20:04,230 --> 00:20:11,440 + 그것은 1991 년에 이렇게 웅변을 중심으로 90 년대 있다는 사실은 바로 아주 좋은 썼다 + +291 +00:20:11,440 --> 00:20:13,450 + 네트워크를 최적화하는 방법에 대한 논문 + +292 +00:20:13,450 --> 00:20:18,700 + 그리고 나는 강의에서에 연결하고 그 사람들이 어떤 여분을 사용하는 것이 좋습니다 + +293 +00:20:18,700 --> 00:20:22,350 + 기본적으로 영향을위한 단계의 두 세그먼트하지만 함께 같은 종류의 + +294 +00:20:22,349 --> 00:20:28,219 + 당신은 당신이 (40)과 함께있어 음 하나 하나 너무 사이에있는와 끝까지 + +295 +00:20:28,220 --> 00:20:32,139 + 중심하지만 그렇지 않으면 같은 다른 문제에서 무언가까지 여전히이 + +296 +00:20:32,138 --> 00:20:36,240 + 예를 들어, 당신은 당신이 포화받을 경우이 지역 그라디언트 흐름 없음을 + +297 +00:20:36,240 --> 00:20:41,829 + 그래서 우리가 정말이 시점에서 그 문제를 해결하지 않은하지만 너무 많은 그냥 생각 + +298 +00:20:41,829 --> 00:20:51,259 + 엄밀하게는 (10)을 제외한 모든 같은 문제가 있기 때문에 S 자형 선호 + +299 +00:20:51,259 --> 00:20:57,970 + 계속 어쩌면 우리가 신문에 그렇게 2012 주위에 더 많은 질문을 할 수 있습니다 + +300 +00:20:57,970 --> 00:21:01,038 + 오스카 제시카 이것은 우리가 제안하는 최초의 상용 네트워크 종이입니다 + +301 +00:21:01,038 --> 00:21:05,240 + 실제로 우리는 발견 당신이 맥 시스이란 X를 사용하여이 비선형 + +302 +00:21:05,240 --> 00:21:07,339 + 대신에 S 자형 또는 10 각의 + +303 +00:21:07,339 --> 00:21:10,849 + 단지 훨씬 빠르고 자신의 실험을 거의에서 확인 네트워크 변환을 + +304 +00:21:10,849 --> 00:21:17,699 + 제 6의 높이와 우리가 돌​​아가서 이유에 대해 생각하는 시도 할 수 있습니다이 무엇 인 + +305 +00:21:17,700 --> 00:21:20,450 + 종류의 당신이 실제로 잘 작동하는지 볼 수있는 것처럼 그것으로 읽고 있지만, + +306 +00:21:20,450 --> 00:21:25,580 + 항상 쉬운 몇 가지 이유를 듣고 있지 않는 것을 설명하는 것은 잠시 동안 기대 + +307 +00:21:25,579 --> 00:21:30,908 + 사람들은 그래서 한 가지이이 있다는 것입니다이 훨씬 더 잘 작동하는지 생각 + +308 +00:21:30,909 --> 00:21:35,570 + 자신의 역할과 적어도 그렇게 성소하지 적어도 긍정적 인 영역 + +309 +00:21:35,569 --> 00:21:38,859 + 여기서,이 지역에서 당신은 스페인 성분 문제가없는 당신의 + +310 +00:21:38,859 --> 00:21:42,019 + 광택은 종류의 사망하고이 문제가 어디 뉴런이 + +311 +00:21:42,019 --> 00:21:47,028 + 양측하지만 이들로부터 경계하는 작은 영역 만 활성화됩니다 + +312 +00:21:47,028 --> 00:21:50,519 + 뒷면의 의미에서 실제로 활성 뉴런이 제대로 여부를 전파 + +313 +00:21:50,519 --> 00:21:55,419 + 올바르게 그러나 적어도 그들은 그들의 지역의 80온스 절반 이상을 좋아하지 않는다 + +314 +00:21:55,419 --> 00:22:00,730 + 그들이있어 훨씬 더 그냥 잡고있어 효율적으로 계산하고 + +315 +00:22:00,730 --> 00:22:04,919 + 실험 당신은이 숫자를 너무 많이 더 빨리 그래서이를 볼 수 있습니다 + +316 +00:22:04,919 --> 00:22:08,929 + 근처 누린다 당신의 장치에있는 파일에 대한 호출은이 논문에서 지적했다 + +317 +00:22:08,929 --> 00:22:12,000 + 이 훨씬 더 잘 작동이이 같은 종류의 것을 처음으로 + +318 +00:22:12,000 --> 00:22:15,429 + 자세한 권장 사항은 무엇가 동시에이 시점에서 사용한다 + +319 +00:22:15,429 --> 00:22:18,990 + 이 판결이란 그래서 한 가지 몇 가지 문제가 다시 그것의 것을 알 수 있습니다 + +320 +00:22:18,990 --> 00:22:23,778 + 하지 제로는 그렇게 완전히 아마도 이상과 약간 아니다 업을 중심으로 + +321 +00:22:23,778 --> 00:22:26,130 + 집권이란의 성가심 + +322 +00:22:26,130 --> 00:22:31,120 + 이 때 발생하는 무엇에 대해 우리는 그것에 대해 이야기하고 생각할 수 있음 + +323 +00:22:31,119 --> 00:22:37,009 + 정말 전파 진행 상황에 대해 더 (10)는이란이 될하지 않는 경우 + +324 +00:22:37,009 --> 00:22:43,269 + 예측에 적극적으로 그들이 그것을 무엇을 활성 천둥 배경에 남아 + +325 +00:22:43,269 --> 00:22:47,289 + 물론이 있다는 것입니다의 그라데이션 등 방법을 죽이는 권리를 확인하기 죽이기 + +326 +00:22:47,289 --> 00:22:51,609 + 때 너무 부정적인 읽을 경우 동일한 그림이 현지보다 10 말과 + +327 +00:22:51,609 --> 00:22:55,119 + 이 때문에 더 그냥 제로 그라데이션이 그냥 0이됩니다 여기에 그라데이션 + +328 +00:22:55,119 --> 00:22:58,589 + 동일 그것은 단지 당신이 그것을 죽일 실제로 당신을 저하 뭉개 버려 아니에요 + +329 +00:22:58,589 --> 00:23:01,689 + 완전히 그렇게 작동하지 않는 사람은 그 전파되지 않습니다 + +330 +00:23:01,690 --> 00:23:06,039 + 아래의 무게가 업데이트 아무것도에서 그 아래에 발생하지 않습니다 + +331 +00:23:06,039 --> 00:23:13,970 + 기여와 전술에 대한 최소한 10입니다 지역 그라데이션이었다 + +332 +00:23:13,970 --> 00:23:19,940 + 하나는, 그래서 그냥 그라데이션을 통해 단지 게이트를 통과 할 경우 경우 경우 경우의 + +333 +00:23:19,940 --> 00:23:24,820 + 그 밖의 자산은 긍정적하고 그냥 통과 그렇지 통해 읽기 + +334 +00:23:24,819 --> 00:23:30,250 + 그것은 지금까지 좋은 게임 같은 종류의를 죽이고 그런데 무슨 일이 때 발생 + +335 +00:23:30,250 --> 00:23:38,569 + 실제로 그건 정의되지 않은 것 그 시점에서 당신의 그라데이션을 손쉽게 텍사스 0 + +336 +00:23:38,569 --> 00:23:42,169 + 오른쪽 녹색은 우리가 단지 내가 할 때마다 이야기 그 시점에서 존재하지 않는 + +337 +00:23:42,170 --> 00:23:45,789 + 그라데이션에 대한 이야기​​는 내가 항상 일부 기울기를 의미하는 것으로 가정 + +338 +00:23:45,789 --> 00:23:49,119 + 때때로에 미분없는 그라데이션이 기능의 일반화 + +339 +00:23:49,119 --> 00:23:52,250 + 존재하지 않는 한계를들을 수 있지만, 일부 그라디언트의 전체 무리가있다 + +340 +00:23:52,250 --> 00:23:58,609 + 즉, 0 또는 1이 될 수 있고 그래서 우리가 연습이 보통 사용하는 무엇을 + +341 +00:23:58,609 --> 00:24:02,119 + 차이는 너무 많은 정말 중요하지 않습니다하지만 난에서 남쪽에 대해 이야기를하고 싶어 + +342 +00:24:02,119 --> 00:24:06,539 + 무슨 미라 케이트 X & Y 누군가에 의해의 경우는 질문을 + +343 +00:24:06,539 --> 00:24:12,629 + X & Y가 동일한 경우 그 경우 당신은 또한 기능에 꼬임을 가질 수 있으며, + +344 +00:24:12,630 --> 00:24:15,550 + 그들 취약하게하지만 실제로 이런 일들은 정말 상관 없어 + +345 +00:24:15,549 --> 00:24:20,329 + 하나를 선택 그래서 당신이 2011 년에 큰를 가질 수 있고, 일이 잘 작동하고 + +346 +00:24:20,329 --> 00:24:23,490 + 이 당신이 권리를 결국 매우 않을 경우이기 때문에 그 약이다 + +347 +00:24:23,490 --> 00:24:24,710 + 그곳에 + +348 +00:24:24,710 --> 00:24:28,519 + 확인 그래서 RELO의 문제는 대략 여기에 실제로 발생하는 문제입니다 그 + +349 +00:24:28,519 --> 00:24:32,799 + 당신은 당신이 인의 알고 있어야 이스라엘 단위와 한 가지로 시도 + +350 +00:24:32,799 --> 00:24:37,629 + 그들은 아무것도 넣지 않는 경우, 어떤 좋은 치과를 얻을하지 않습니다 이러한 뉴런 + +351 +00:24:37,630 --> 00:24:38,290 + 죽여 + +352 +00:24:38,289 --> 00:24:48,049 + 업데이트 등 문제가있을 것으로 예상되는 무슨 뭔가 일이 + +353 +00:24:48,049 --> 00:24:51,059 + 당신이있어 초기화 할 때 정말 당신이 비에를 초기화 할 수 있습니다 뉴런 + +354 +00:24:51,059 --> 00:24:57,000 + 아니 아주 운이 방법은 어떤 생각되는 일이 끝이 당신의 가이드 + +355 +00:24:57,000 --> 00:25:02,009 + 당신의 엘레 노어의 입력의 구름은 우리가 부르는 끝낼 수있는 네 소유 + +356 +00:25:02,009 --> 00:25:06,650 + 이 신경 세포는 지역의 활성화 경우 등 죽은 상대 죽은 벨소리 + +357 +00:25:06,650 --> 00:25:12,550 + 이 침대 트레일러에서 데이터 클라우드의 외부가 활성화 될하지 않습니다 및 + +358 +00:25:12,549 --> 00:25:15,889 + 다음은 업데이트하지 않습니다 때문에이 중 두 가지 방법 중 하나를 발생할 수 있습니다 + +359 +00:25:15,890 --> 00:25:19,090 + 초기화하는 동안 당신은 정말 운이 있었고, 당신은 샘플 일 + +360 +00:25:19,089 --> 00:25:22,959 + 그 신경이 켜지지 않을 것입니다 그런 방법으로 자신의 그녀의 역할을 기다립니다 + +361 +00:25:22,960 --> 00:25:27,549 + 이란이 경우 비가되지 않지만 더 자주 발생하는 중입니다 + +362 +00:25:27,549 --> 00:25:31,769 + 당신이 속도를 학습하는 경우 훈련은 이러한 뉴런 요청에 대해 생각 높다 + +363 +00:25:31,769 --> 00:25:35,339 + 우리는 기회가 종종 발생할 수 있습니다 주위에 그리고 그들은 단지 떨어져 기절있어 + +364 +00:25:35,339 --> 00:25:39,669 + 데이터 매니 폴드와 그 다음 일어날 때 그들은 다시 활성화하지 않고 얻을 않을 것 + +365 +00:25:39,670 --> 00:25:43,310 + 그들은 데이터 매니 폴드에 다시 오지 않을 당신은 거기에 볼 수 있습니다 + +366 +00:25:43,309 --> 00:25:48,039 + 실제로 때때로 당신이 대표단에 큰 신경 그물을 훈련 할 수처럼 연습 + +367 +00:25:48,039 --> 00:25:51,740 + 당신은 그것을 시도하고 그것을 잘 작동하는 것 같다 다음 당신이를 중지 할 것 + +368 +00:25:51,740 --> 00:25:54,279 + 교육 당신은 당신의 네트워크를 통해 전체 훈련 데이터 세트를 전달 + +369 +00:25:54,279 --> 00:25:59,460 + 당신은 모든 단일 신경 세포의 통계를 보면 어떤 일어날 수있는 것은 + +370 +00:25:59,460 --> 00:26:02,620 + 네트워크의만큼 10 등, 20 %가 죽었다는 + +371 +00:26:02,619 --> 00:26:06,319 + 그에 디자이너는 훈련 데이터의 어떤이에 대해 설정되지 않습니다 + +372 +00:26:06,319 --> 00:26:09,929 + 실제로 당신이 비율이 높았다 배우고 있기 때문이다 일반적으로 일어날 수 + +373 +00:26:09,930 --> 00:26:14,250 + 그래서 사람들은 네트워크의 죽은 부분처럼 당신은 파타키를 호출 할 수 있습니다 + +374 +00:26:14,250 --> 00:26:16,299 + 실제 등등 이러한 것들과 국유화에 대한 계획 + +375 +00:26:16,299 --> 00:26:21,569 + 사람들은 일반적으로 많이하지 않습니다하지만주의해야 할 뭔가 그리고 그것은이다 + +376 +00:26:21,569 --> 00:26:26,929 + 이 비선형으로 그래서 특히 때문에 초기화 문제 + +377 +00:26:26,930 --> 00:26:30,840 + 사람이 죽은 진짜 문제는 일반적으로 버스를 초기화하면된다 좋아 (10) + +378 +00:26:30,839 --> 00:26:35,289 + 이 생활에있는 사람들이 그 때문에 렉시 0101 약간 양수이었다 대신 + +379 +00:26:35,289 --> 00:26:40,389 + 수 그것은 가능성이 있음을 초기화 이러한 로마 숫자와 + +380 +00:26:40,390 --> 00:26:44,170 + 그것은 덜 것을 수 있도록 이전 업데이트를 얻을 것이다 신경 단지 않습니다 + +381 +00:26:44,170 --> 00:26:48,190 + 교육을 통해 지금까지 활성화 될하지만 난 생각 실제로하지 않습니다 + +382 +00:26:48,190 --> 00:26:51,350 + 이 가능성이 논쟁 점을 가지고있다 어떤 사람들은 주장 + +383 +00:26:51,349 --> 00:26:54,849 + 어떤 사람들은 실제로 모든 그래서 그냥 도움이되지 않는 말 섹시 도움 + +384 +00:26:54,849 --> 00:27:02,089 + 뭔가 우리가 들어갈 예정이 시점에서 질문에 대해 생각하는 + +385 +00:27:02,089 --> 00:27:08,839 + 다른 지금의 사람들이 느슨한를 해결하기 위해 노력 같은 것들을 살펴 보자 확인 원 + +386 +00:27:08,839 --> 00:27:13,058 + 이러한 죽은 신경 세포가 아니기 때문에 그래서 친척 하나의 문제는, 그래서 여기에 하나를 이상적 + +387 +00:27:13,058 --> 00:27:18,349 + 누설 비와 정말 누출의 아이디어라고 제안 + +388 +00:27:18,349 --> 00:27:22,399 + 기본적으로 우리는이 꼬임을 원하고 우리는이 평화 마침내 RT 원하는 우리가 원하는 + +389 +00:27:22,400 --> 00:27:29,070 + 하지만 문제의 자족이 지역이 당신의 꿈이 너무 죽을 것입니다 + +390 +00:27:29,069 --> 00:27:32,379 + 대신의 약간 부정적으로 여기거나 약간 경 사진이를 만들어 보자 + +391 +00:27:32,380 --> 00:27:36,409 + 긍정적으로 나는이 지역에서 가정 경사 등이와 끝까지 + +392 +00:27:36,409 --> 00:27:41,260 + 기능과 그가 새는 그래서 어떤 사람들은 사람들이 있음을 보여주고있다라는 것 + +393 +00:27:41,259 --> 00:27:45,519 + 이것은 당신이 죽어가는 신경이 문제가 있지만하지 않는 약간 더 나은 작품 + +394 +00:27:45,519 --> 00:27:51,730 + 완전히이 항상 더 나은 다음 작동 설정되지 않은 생각 + +395 +00:27:51,730 --> 00:27:54,870 + 이 가지고 노는 어떤 사람들은 더욱 더 지금이 당신의 아파트 101 만 + +396 +00:27:54,869 --> 00:27:57,439 + 그 사실은 임의의 매개 변수가 될 수 있으며, 당신은 뭔가를 얻을 + +397 +00:27:57,440 --> 00:28:01,058 + 그는 파라 메트릭 정류기 또는 사람과 여기에 기본적으로 아이디어라고 + +398 +00:28:01,058 --> 00:28:07,519 + 이 네트워크의 매개 변수를 101 인이 할 수있는 소개한다 + +399 +00:28:07,519 --> 00:28:10,808 + 배울 당신은 그것으로 얻을 백업 할 수 있습니다 그래서 이러한 뉴런은 기본적으로 수 + +400 +00:28:10,808 --> 00:28:15,609 + 확인 자신의 고유 영역이 무엇을 사면 선택하고 그래서 그들은 될 수 있습니다 + +401 +00:28:15,609 --> 00:28:21,250 + 그들이 원하는 또는 그들이 누출 될 수 있습니다 또는 그들이 가질 수 있습니다 관련성이없는 경우 + +402 +00:28:21,250 --> 00:28:25,798 + 선택은 거의 모든 신경이 사람들이 노는 물건의 종류 + +403 +00:28:25,798 --> 00:28:40,950 + 그들은 단지 매우 일반적인 방법으로 너무 좋은 일을 설계하려 할 때 + +404 +00:28:40,950 --> 00:28:44,200 + 그것은 그것의 자신의 그 것처럼 경쟁은 모든 신경 세포가있을 것이다 나가서 + +405 +00:28:44,200 --> 00:28:46,659 + 바이어스 + +406 +00:28:46,659 --> 00:28:48,490 + 진행 + +407 +00:28:48,490 --> 00:29:00,370 + 나는거야 그건 아마 그래서 한 다음 ID를받을거야 발견 + +408 +00:29:00,369 --> 00:29:03,779 + 전파는 의미에서 원하는 무언가 그 정체성 아니었다면 그 + +409 +00:29:03,779 --> 00:29:06,819 + 당신이 아기를 예상 할 수 있도록 그 아주 경쟁에게 유용한이어야한다 + +410 +00:29:06,819 --> 00:29:09,939 + 다시 전파가 실제로 공간이 그 지역에 당신을 얻을 안되며, + +411 +00:29:09,940 --> 00:29:13,720 + 내가 제대로이 기억한다면 어쩌면 아마도 내가 실제로 생각하지 않습니다 + +412 +00:29:13,720 --> 00:29:17,069 + 특별한 일없는 곳에 너무 많이하지만 내가 할 수있는 것을 정말 걱정 명 + +413 +00:29:17,069 --> 00:29:20,529 + 잘못된 앞으로있을 이제 얼마 전 신문을 읽고 나는이 너무 많이 사용하지 않는 + +414 +00:29:20,529 --> 00:29:27,160 + 작동하고 그래서 하나의 문제는 여전히 우리가 그것을보고 이러한 서로 다른 방식이있다 + +415 +00:29:27,160 --> 00:29:30,759 + 이란의 난간 침대를 고정하는 경우에만 온 다른 사람들이있다 + +416 +00:29:30,759 --> 00:29:34,730 + 약 두 달 전, 예를 들면 밖으로 그래서 이것은 당신에게 방법의 감각을 준다 + +417 +00:29:34,730 --> 00:29:38,210 + 논문하려고 두 달 전에 새로운이 필드는이 나오고있다 + +418 +00:29:38,210 --> 00:29:42,850 + 하여 단위 중 하나를 지수로하는 새로운 기능을 활성화가 제안 + +419 +00:29:42,849 --> 00:29:46,799 + 이 모든이 시도에 당신에게 사람들이 즐기는 게임에 대한 아이디어를 제공 + +420 +00:29:46,799 --> 00:29:50,869 + 힘 입어 relew의 장점 인 비 - 제로의 이러한 단점을 없애 + +421 +00:29:50,869 --> 00:29:54,909 + 중심 그래서 그들은처럼 보이는 여기 푸른 함수와 끝까지 + +422 +00:29:54,910 --> 00:29:58,390 + 진짜 문제 만은 음의 영역에 그냥 제로에 갈하거나하지 않습니다 않습니다 + +423 +00:29:58,390 --> 00:30:02,700 + 누수로 내려가하지만이 재미있는 모양을 가지고 있으며, 수학의 두 페이지에 있습니다 + +424 +00:30:02,700 --> 00:30:03,480 + 종이 + +425 +00:30:03,480 --> 00:30:08,509 + 부분적으로 정당화 당신은 원하는 이유는 대략 당신과 함께이 말을 할 때 + +426 +00:30:08,509 --> 00:30:12,829 + 제로 평균 아울렛 그리고 그들은 그 균주 더 나은 주장 내가 있다고 생각 + +427 +00:30:12,829 --> 00:30:17,889 + 어떤이에 대한 논쟁과 우리는 기본적으로이 모든 것을 이해하려고 노력하고 있습니다 + +428 +00:30:17,890 --> 00:30:18,309 + 아웃 + +429 +00:30:18,308 --> 00:30:21,849 + 활성 연구의 영역과 우리가 오히려 아직 어떻게해야할지 확실하지 않다하지만 권리 + +430 +00:30:21,849 --> 00:30:26,719 + 당신이 조심 있다면 당신은 그 그래서 경우 ​​지금 안전 권고처럼 + +431 +00:30:26,720 --> 00:30:31,259 + 느슨한 하나 상대적이기 때문에 내가 언급을 주목하고 싶습니다 더 + +432 +00:30:31,259 --> 00:30:35,319 + 일반적으로 당신이 작동에 대한 것은 자신의 출력이 최대입니다 읽으면 당신은 그것을 볼에 + +433 +00:30:35,319 --> 00:30:42,308 + 호텔에서 기본적으로는이란에서 매우 다른있어 그것을 그냥 아니에요 + +434 +00:30:42,308 --> 00:30:44,000 + 다른 보이는 활성화 함수 + +435 +00:30:44,000 --> 00:30:47,789 + 실제로 계산해 그냥이 양식이없는 방법이란 컴퓨터 내에서 변경 + +436 +00:30:47,789 --> 00:30:54,629 + WX의 실제로 두 개의 무게가 다음 W 난다는 X 박스를 바꾸어 계산 + +437 +00:30:54,630 --> 00:30:58,970 + 이러한 장소를 하이킹을 좋아하여 WSYX의 또 다른 세트가 끝을 위로해야 할 그 + +438 +00:30:58,970 --> 00:31:01,440 + 당신은을 통해 최대를 가지고 그건 컴퓨터 근처에 무엇을 + +439 +00:31:01,440 --> 00:31:04,298 + 이러한 활성 기능을 가지고 노는 여러 가지 방법이 있다는 것을 볼 수있다 + +440 +00:31:04,298 --> 00:31:09,339 + 그래서 이것은 이것의 단점 중 일부는 죽고 싶어해야하고하지 않습니다 + +441 +00:31:09,339 --> 00:31:13,128 + 여전히 구분 선형은 여전히​​ 효율적인하지만 하나 하나 신경 세포의 + +442 +00:31:13,128 --> 00:31:16,839 + 이 가중치를 가지고 있으며, 그래서 당신은 가지 매개 변수 초연의 수를 두 배로 + +443 +00:31:16,839 --> 00:31:21,689 + 그래서 아마 그 이상적인 아니에요에 그래서 어떤 사람들은 이것을 사용하지만 난 그게 생각 + +444 +00:31:21,690 --> 00:31:45,130 + 그것은 내가 도로가 여전히 가장 일반적인 것을 말할 것입니다 슈퍼 공통 아니다 + +445 +00:31:45,130 --> 00:31:57,870 + 그 바람에 상이 할 수 있고, 그래서 당신은 다른 가중치를 종료합니다 + +446 +00:31:57,869 --> 00:32:11,009 + 확실히 복잡 복잡 + +447 +00:32:11,009 --> 00:32:15,799 + 최적화 프로세스의 많은 단지 손실 함수에 대한되지 않지만 + +448 +00:32:15,799 --> 00:32:19,000 + 다만 채소의 역류의 역학에 대해 좋아하고 우리는 표시됩니다 + +449 +00:32:19,000 --> 00:32:22,250 + 다음 주에 그것에 대해 조금 당신이 정말로 그것에 대해 생각할 필요가있다 + +450 +00:32:22,250 --> 00:32:27,420 + 단지 잃어버린 풍경보다 동적으로 더 그렇게 너무 야 어떻게 + +451 +00:32:27,420 --> 00:32:32,410 + 복잡하고 또한 특별히 확률 그라데이션 하강하고있다 + +452 +00:32:32,410 --> 00:32:36,340 + 일부 자유가 멋지게 연주 좋네요 특정 형태와 뭔가 splaine + +453 +00:32:36,339 --> 00:32:41,039 + 최적화 업데이트를 연결되어 있다는 사실뿐만 아니라 모든이에 연결되어 + +454 +00:32:41,039 --> 00:32:45,519 + 모두가 함께 상호 작용의 유형 및 이러한 활성 기능의 선택과 같은 + +455 +00:32:45,519 --> 00:32:49,619 + 및 업데이트의 선택은 종류의 결합이 때 매우 불분명하고 있습니다 + +456 +00:32:49,619 --> 00:32:59,649 + 그들이 여기있는 동안이다, 그래서 당신은 실제로 복잡한 생각의 종류를 최적화 + +457 +00:32:59,650 --> 00:33:03,620 + 이 녀석을 시도 할 수 있습니다 당신은 사람이 그렇게하지 ​​너무 많은 기대한다 시도 할 수 + +458 +00:33:03,619 --> 00:33:06,669 + 사람들이 너무 많은 지금 사용하고 무시하지 않는 생각 기본적으로 인해 + +459 +00:33:06,670 --> 00:33:11,130 + 열 난 그냥 엄격하게 더 나은 당신은 사람들이 이제 더 이상 음성을 사용하여 볼 수 없습니다 + +460 +00:33:11,130 --> 00:33:14,350 + 물론 우리는 긴 단기 기억 단위 팔레스타인 등을 사용 + +461 +00:33:14,349 --> 00:33:17,129 + 누군가가 재발 신경망하지만 자신의 비트에 해당로 이동합니다 + +462 +00:33:17,130 --> 00:33:22,500 + 우리가 그들을 사용하고 수업 시간에 나중에 볼 이​​유 구체적인 이유와 + +463 +00:33:22,500 --> 00:33:26,700 + 그들은 우리가 같이 멀리에 너무 커버 한 내용과 다르게 사용하는 것입니다 + +464 +00:33:26,700 --> 00:33:32,670 + 그냥 완전히 연결 샌드위치 메이커 파티를 곱 누군가 단지를 갖는 + +465 +00:33:32,670 --> 00:33:35,720 + 기본 신경망 확인 그게 내가 말하고 싶어 모든 그래서하지만 + +466 +00:33:35,720 --> 00:33:39,410 + 활성화 기능은 기본적으로 이것을 우리가 걱정 주요 기능을했다 + +467 +00:33:39,410 --> 00:33:42,990 + 그것에 대해 본 연구에 대해 우리는 완전하게 파악하고있다하지 않은 + +468 +00:33:42,990 --> 00:33:46,640 + 그들 중 많은 몇 가지 장점과 단점 및 방법 그라데이션에 대해 생각에 내려와 + +469 +00:33:46,640 --> 00:33:50,690 + 네트워크를 통해 흐르는 아직 죽은 친척과 같은 이러한 문제를 논의 + +470 +00:33:50,690 --> 00:33:54,808 + 당신이 당신의 네트워크를 디버깅하려고하면 정말 그라데이션 흐름에 대해 알아야 할 사항 + +471 +00:33:54,808 --> 00:33:59,428 + 그리고는 가격에하자 모습에 무슨 일이 일어나고 있는지 이해하기 + +472 +00:33:59,429 --> 00:34:03,710 + 처리 매우 간단하므로 + +473 +00:34:03,710 --> 00:34:07,440 + 처리는 매우 간단 일반적으로 그냥 구름이 있다고 가정 + +474 +00:34:07,440 --> 00:34:11,829 + 원래의 데이터와 여기에 두 가지 차원 (20) 센터 데이터 있도록 매우 일반적인 + +475 +00:34:11,829 --> 00:34:15,230 + 그냥 하나 하나 그림이었다 함께 평균 사람들을 추적 할 수 있음을 의미합니다 + +476 +00:34:15,230 --> 00:34:18,889 + 당신은 기계 학습 문학을 통해 갈 때 때때로 시도 + +477 +00:34:18,889 --> 00:34:22,720 + 표준이 말을 당신이 정상화 모든 단일 차원 있도록 데이터를 정상화 + +478 +00:34:22,719 --> 00:34:23,759 + 일탈 + +479 +00:34:23,760 --> 00:34:28,990 + 표준화는 최소 및 최대 내에 등등 있는지 확인 할 수 있습니다 + +480 +00:34:28,989 --> 00:34:33,098 + 이미지에 이렇게 여러 가지 방식이 거기 있기 때문에 일반적인 아니에요 있습니다 + +481 +00:34:33,099 --> 00:34:35,760 + 다른 단위가 될 수있는 다양한 기능을 분리 할 필요가 없습니다 + +482 +00:34:35,760 --> 00:34:39,619 + 모든 것이 픽셀과 자신의 경계가로 아니라 0과 255 + +483 +00:34:39,619 --> 00:34:43,970 + 공통 데이터를 정상화하지만 당신이 할 수있는 매우 일반적인 20 센터 데이터의합니다 + +484 +00:34:43,969 --> 00:34:44,719 + 더 나아가 + +485 +00:34:44,719 --> 00:34:48,730 + 일반적으로 시스템에서 당신이 앞서 갈 수있는 학습 데이터가 일부 공분산이 + +486 +00:34:48,730 --> 00:34:52,079 + 기본적으로 구조가 가서 그 공산주의 러시아 될 수 있습니다 + +487 +00:34:52,079 --> 00:34:55,740 + 대각선은 PCA를 적용하여, 예를 들어 말을하거나 더도 갈 수 있고, 당신이 할 수있는 + +488 +00:34:55,739 --> 00:35:00,309 + 데이터를 닦아 무엇을 뜻하면 가지도 끝났다 PCR 후 뭉개 버려이다 + +489 +00:35:00,309 --> 00:35:05,159 + 당신의 다양한 측정 그냥 대각선이되도록 당신은 또한 당신의 데이터를 뭉개 버려 + +490 +00:35:05,159 --> 00:35:08,699 + 그래서 그 사람들이 얘기 내가 볼 전처리의 또 다른 형태이다 + +491 +00:35:08,699 --> 00:35:14,480 + 이들은 모두 내가하지 않으려는 BC의 클래스 노트에서 더 자세한 내용을 갈 수 있습니다 + +492 +00:35:14,480 --> 00:35:17,500 + 이 이미지에서 우리가하지 않는 것으로 나타났다 때문에 너무 많은 세부 사항으로 이동 + +493 +00:35:17,500 --> 00:35:20,960 + 실제로 기계 학습오고 이러한에도 순서로 끝낸다 + +494 +00:35:20,960 --> 00:35:25,659 + 이미지는 구체적으로 어떤 일반적인 것은 다음 단지 수단을 중심으로하고있다 + +495 +00:35:25,659 --> 00:35:28,519 + 내가 그 중심의 특정 변종 연습 약간 더 편리합니다 + +496 +00:35:28,519 --> 00:35:34,780 + 그래서 우리가 330 말을 중심으로 의미하는 것은 당신이 원하는까지하면 세 가지 이미지의 해저를 구입 + +497 +00:35:34,780 --> 00:35:38,869 + 모든 단일 픽셀에 대해 당신이 오버 트레이닝 (W) 경쟁하는 것이 데이터를 중앙 등 + +498 +00:35:38,869 --> 00:35:43,318 + 당신이 결국 그래서 그 무엇을 추적하는 것은 기본적으로 가지고이 평균 이미지 + +499 +00:35:43,318 --> 00:35:47,219 + 세에 의해 32로 32의 임무는 그래서 그 예에 대한 이미지를 의미 생각 + +500 +00:35:47,219 --> 00:35:51,409 + 이미지 데이터 정의, 오렌지, 방울마다 하나의 이미지로부터 추적하는 것은 중앙에 + +501 +00:35:51,409 --> 00:35:56,000 + 데이터가 더 나은 훈련 역학과 그들이있어 하나의 다른 형태를 가질 수 있습니다 + +502 +00:35:56,000 --> 00:36:00,818 + 약간 더 편리 단지 채널 당 그​​래서 당신이 들어갈 뜻 받고있다 + +503 +00:36:00,818 --> 00:36:05,639 + 모든 공간은 단지 끝에서 적색, 녹색, 청색 채널의 평균을 계산 + +504 +00:36:05,639 --> 00:36:07,289 + 기본적으로 세 개의 숫자 최대 + +505 +00:36:07,289 --> 00:36:11,029 + 적색, 녹색, 청색 채널의 이동 단지 연습 아웃 등 일부 네트워크 + +506 +00:36:11,030 --> 00:36:15,250 + 사람들이 하나가 더 편리처럼 두 가지 일반적인 스킨입니다 대신 있도록 그 사용 + +507 +00:36:15,250 --> 00:36:17,519 + 당신 만 걱정 세 개의 숫자를 가지고 있기 때문에 당신은 필요가 없습니다 + +508 +00:36:17,519 --> 00:36:20,670 + 당신이 모든 작가를 주위에 제공 할 필요가 평균의 거대한 배열에 대해 걱정 + +509 +00:36:20,670 --> 00:36:26,430 + 당신은 실제로 너무 너무 많은 I에 대해 말하고 싶은이를 넣어 때 + +510 +00:36:26,429 --> 00:36:30,649 + 이것은 단지 기본적으로 평균 및 컴퓨터 비전 응용 프로그램 일을 빼기 + +511 +00:36:30,650 --> 00:36:35,039 + 특히 DPC에서보다 훨씬 더 복잡 얻을 등이에로 사용하지 않습니다 + +512 +00:36:35,039 --> 00:36:38,860 + 이미지이기 때문에 약간 일반적인 문제는 모든 이미지에 적용 할 수 없습니다 + +513 +00:36:38,860 --> 00:36:43,559 + 픽셀의 제비 등이 유니폼이 될 것입니다 매우 높은 차원 객체 + +514 +00:36:43,559 --> 00:36:47,789 + 거대하고 사람들은 당신 때문에 로컬 미백하고 같은 일을하려고 + +515 +00:36:47,789 --> 00:36:53,179 + 특히 이미지를 통해 슬라이드 번개 필터를 확인하고 그 사용 것 + +516 +00:36:53,179 --> 00:36:56,389 + 몇 년 전 다운하지만 중요하지 않는 것 지금과 같은 일반적인 아니다 + +517 +00:36:56,389 --> 00:37:01,809 + 초기화를 기다릴 너무 많이 확인 + +518 +00:37:01,809 --> 00:37:06,539 + 아주 아주 중요한 주제 일찍 신경을 생각하는 이유 중 하나 + +519 +00:37:06,539 --> 00:37:09,409 + 네트워크는 상당히으로뿐만 아니라 사람들이 충분히주의하지 않기 때문에 같은 작동하지 않았다 + +520 +00:37:09,409 --> 00:37:14,119 + 내가 볼 것이다 첫 번째 일이 그렇게 한 모든 첫 번째로 어떻게하지에 + +521 +00:37:14,119 --> 00:37:18,170 + 당신은 단지에 유혹 될 수 있습니다 특히 있도록이 법안에 할 + +522 +00:37:18,170 --> 00:37:23,619 + 당신을 것을 제로에 동일한 가중치에서 시작하자하고 확인을 말하여 + +523 +00:37:23,619 --> 00:37:27,029 + 네트워크는 10 층 신경망처럼이었고, 당신은 항상 20 말했다 말한다 + +524 +00:37:27,030 --> 00:37:37,320 + 왜 그 일이 왜 좋은 생각은 잘 진행되지 않는 것을 + +525 +00:37:37,320 --> 00:37:41,410 + 기본적으로 배경에서 같은 일에 그냥 모든 신경 세포는 동작합니다 + +526 +00:37:41,409 --> 00:37:45,000 + 당신이 그것을 부르는 우리가 전화와 동일한 방법과 그렇게 아무것도 없다 + +527 +00:37:45,000 --> 00:37:50,360 + 대칭 그래서 다른 모든 컴퓨팅 말하는 물건을 깨는 그래서 그들은 것 + +528 +00:37:50,360 --> 00:37:53,570 + 모두 동일한 구배를 경쟁 할 것 같은 등 그렇게하지 ​​가장 깨끗하게 + +529 +00:37:53,570 --> 00:37:57,860 + 것은 사람들이 당신이 할 수있는 한 가지 방법 있도록 작은 숫자를 작은 임의의 숫자를 사용하는 것입니다 + +530 +00:37:57,860 --> 00:38:01,820 + 할 수있는 비교적 흔한 일이 예를 들어 그렇게하는 것은에서 당신 샘플입니다 + +531 +00:38:01,820 --> 00:38:07,410 + 당신은 2010 하나의 표준 편차가 너무 작은 임의의 숫자 때문에 협상 + +532 +00:38:07,409 --> 00:38:11,299 + W 행렬은 할리우드 지금 초기화 곳이다 + +533 +00:38:11,300 --> 00:38:15,340 + 이 초기화에 문제가 확인 작동하지만 당신이 찾는 그것은 해당 + +534 +00:38:15,340 --> 00:38:20,068 + 당신이 점점 더 깊이 갈 시작으로 당신이 작은 네트워크를 가지고 있지만 경우 확인 작업 + +535 +00:38:20,068 --> 00:38:24,659 + 국유화에 대해 훨씬 더 조심해야 할 것입니다 내가 가고 싶습니다 + +536 +00:38:24,659 --> 00:38:29,199 + 정확하게는 휴식과 나누기 어떻게 당신이하려고 할 때 휴식을 물린로 + +537 +00:38:29,199 --> 00:38:32,499 + 이 순진 초기화 전략을 수행하고 깊은 네트워크를 가지고 그렇게하자하려고 + +538 +00:38:32,498 --> 00:38:38,798 + 잘못되면 어떻게 보면 그래서 내가 여기에 쓴 것은 작은 책이다 그래서 뭐 + +539 +00:38:38,798 --> 00:38:43,608 + 우리는 내가 샘플링하고있어이 그냥 간단히 단계별로 가고 여기 일을하는지 + +540 +00:38:43,608 --> 00:38:48,369 + 차원 (500)이며, 그 다음 내가 만드는거야 1,000 점의 집합 + +541 +00:38:48,369 --> 00:38:52,170 + 숨겨진 레이어 및 비선형의 전체 무리 그래서 우리가 지금 당장 말 + +542 +00:38:52,170 --> 00:38:58,749 + (10) 500 단위의 층 우리는 10 시간을 사용하고 난 그냥 해요으로 나는 여기서 뭐 해요 + +543 +00:38:58,748 --> 00:39:03,798 + 기본적으로 단위 세상에 및 데이터를 복용하고 난 네트워크를 통해 전달 해요 + +544 +00:39:03,798 --> 00:39:07,509 + 어디 바로 지금이 특정 초기화 전략 + +545 +00:39:07,509 --> 00:39:10,920 + 초기화 전략은 내가 이전 슬라이드 참조 샘플에 설명 된 것입니다 + +546 +00:39:10,920 --> 00:39:14,869 + 분출에서 그는 내가이 부분 때문에 여기에서하고 있어요 그래서 세르비아 (101)에 의해 살해있어 + +547 +00:39:14,869 --> 00:39:18,608 + 지금은 그냥 시리즈로 구​​성되어이 네트워크를 전파 지루 해요 + +548 +00:39:18,608 --> 00:39:25,208 + 같은 크기의 레이어 $ 500 그렇다면 열 층과 나는 함께 전파를위한 해요 + +549 +00:39:25,208 --> 00:39:29,328 + 단위 분출 데이터와 I가 원하는 것에 대해이 초기화 전략 + +550 +00:39:29,329 --> 00:39:34,109 + 숨겨진 뉴런의 통계에 일어나는 것이다 봐 + +551 +00:39:34,108 --> 00:39:37,719 + 이 초기화와 네트워크를 통해 활성화 그래서 우리는 거 야 + +552 +00:39:37,719 --> 00:39:40,429 + 평균과 표준 편차에 구체적으로 보면 우리가가는거야 + +553 +00:39:40,429 --> 00:39:44,498 + 평균 표준 편차를 플롯 우리는 그렇게 히스토그램을 차단하는거야 + +554 +00:39:44,498 --> 00:39:48,159 + 우리는을 통해 모든 데이터를 가지고 우리가 가고있는 다섯 번째 플레이어에서 말 + +555 +00:39:48,159 --> 00:39:52,368 + 뭐라고 값이 다섯이나 여섯 또는 일곱 번째 내부에 가지고 무슨 짓을했는지 봐 + +556 +00:39:52,369 --> 00:39:56,338 + 우리는 당신이 만약이 초기화와 함께, 그래서 사람들의 히스토그램을 만들어가는 곳 + +557 +00:39:56,338 --> 00:39:59,588 + 그것은 다음과 같이보고 끝나는 당신은 결국이 실험을 실행 + +558 +00:39:59,588 --> 00:40:03,889 + 그래서 여기에 내가 그것을 밖으로 인쇄하고 우리는 0의 평균으로 시작 그들의 + +559 +00:40:03,889 --> 00:40:07,368 + 하나의 부문은 우리의 데이터의 그것과 지금은 전파를위한 해요 + +560 +00:40:07,369 --> 00:40:13,019 + 나는 우리가 대칭의 10 세 정도로 부드러운 나이를 사용하고 평균 10 플레이어에 가서 + +561 +00:40:13,018 --> 00:40:16,868 + 당신이 제로 주변의 평균 상태에 있지만, 표준 편차를 예상대로 그렇게 + +562 +00:40:16,869 --> 00:40:21,440 + 그것을 어떻게되는지를 보면 110 부문 진형 2.2은 당기했다 + +563 +00:40:21,440 --> 00:40:27,420 + 2004 년과 다음 뉴런의 표준 편차에 대한 구심 제로 다운 + +564 +00:40:27,420 --> 00:40:31,639 + 그냥 하나 하나 공기에 여기 히스토그램을보고 아래로 20 간다 + +565 +00:40:31,639 --> 00:40:33,338 + 히스토그램 이유 제 층 + +566 +00:40:33,338 --> 00:40:37,778 + 그래서 우리는 11 사이의 숫자의 확산을하고 무엇은 일어나고 끝 + +567 +00:40:37,778 --> 00:40:42,889 + 다만 정확히 제로 그래서 무슨 일이 끝에서 꽉 분포 ​​축소 + +568 +00:40:42,889 --> 00:40:46,328 + 우리의 네트워크 생산이 초기에 발생하는 모든 10 + +569 +00:40:46,329 --> 00:40:50,930 + H 뉴런은 20 그래서 마지막 계층에서 이러한 작은 수있는 팀에서 결국 + +570 +00:40:50,929 --> 00:40:58,719 + 같은 제로에 가까운 그래서 모든 직업은 기본적으로 0이된다 및 번호 + +571 +00:40:58,719 --> 00:41:01,219 + 왜이 문제입니다 + +572 +00:41:01,219 --> 00:41:05,568 + 그라디언트에 후방 패스의 역학에 무슨 생각 + +573 +00:41:05,568 --> 00:41:10,969 + 당신이 정품 인증에 작은 번호가있을 때 당신의 텍스트는 작은 숫자입니다 + +574 +00:41:10,969 --> 00:41:12,548 + 지난 몇 층에 + +575 +00:41:12,548 --> 00:41:17,159 + 어떤 이들 성분처럼 무엇을 중요시하는 방법에 이들 계층 무엇에 있어요 + +576 +00:41:17,159 --> 00:41:27,478 + 후방에 발생하는 모든의 첫 번째 내 너무 층이 가정 통과 + +577 +00:41:27,478 --> 00:41:32,399 + 여기에 나중에 전에 몇 가지를 살펴보고 거의 모든 입력은 너무 작은 것을 + +578 +00:41:32,400 --> 00:41:37,789 + 당신이 기대하는 일 기울기가 무엇인지 작은 수의 X 축입니다 번호 + +579 +00:41:37,789 --> 00:41:45,509 + 그라디언트에 W 해당 레이어의 경우에하는 주셔서 매우 + +580 +00:41:45,509 --> 00:41:55,528 + 작은 왜 그들은 것입니다 매우 작은 W는 X 시간 기울기와 동일합니다 + +581 +00:41:55,528 --> 00:41:56,278 + 상부로부터 + +582 +00:41:56,278 --> 00:42:00,789 + 확인 등의 효과뿐만 아니라 WR 작은 숫자에 대한 이유보다 작은 숫자입니다 + +583 +00:42:00,789 --> 00:42:06,640 + 그래서이 사람은 기본적으로 지금 유관 거의 이유가 없습니다 우리 + +584 +00:42:06,639 --> 00:42:13,228 + 또한 다시이 행렬에 무슨 볼 수 있습니다 우리는 우리이었다 데이터를했다 + +585 +00:42:13,228 --> 00:42:16,659 + 단위주의와 처음으로 배포 한 후 우리는 결국 + +586 +00:42:16,659 --> 00:42:20,278 + W 및 활성화 기능에 의해 그것을 곱 우리는 기본적으로 그 보았다 + +587 +00:42:20,278 --> 00:42:24,699 + 모든 이것은 단지 시간이 지남에 붕괴하고 생각 제로로 간다 + +588 +00:42:24,699 --> 00:42:27,939 + 뒤로 패스 우리가 이러한 레이어를 통해 그라데이션을 변경 같이 + +589 +00:42:27,940 --> 00:42:31,380 + 우리가 효과적으로 무슨 일을하는지 다시 전파는 그라데이션 종류의 일부입니다 + +590 +00:42:31,380 --> 00:42:35,989 + 우리 그라데이션 W에 사람들 오프의 우리는 숫자를 보았다하지만 다시 던졌다 + +591 +00:42:35,989 --> 00:42:39,108 + 전파는 우리가 계약 효과를 통해거야 그리고 우리는 결국 + +592 +00:42:39,108 --> 00:42:41,969 + 여기를 통해 우리 배경은 당신이 무엇을 얻을 때 일 + +593 +00:42:41,969 --> 00:42:47,419 + 당신이 장치를 가지고가는 경우 모든 단일 계층에서 또 다시 W에 의해 곱한 + +594 +00:42:47,420 --> 00:42:51,460 + 이 규모에서 화장실에 의해 다중 데이터를 분출하면 당신은 모든 것을 볼 수 있습니다 + +595 +00:42:51,460 --> 00:42:55,010 + 제로하고 같은 일이 후 뒤로 패스했다 일어날 간다 + +596 +00:42:55,010 --> 00:42:59,180 + 연속적으로 매일 공기에 우리 다시 전파로 W에 의해 두 행위를 곱하여 + +597 +00:42:59,179 --> 00:43:03,529 + 우리는 당신을있는 것과 합리적인 숫자 진형이 그라데이션 + +598 +00:43:03,530 --> 00:43:07,300 + 당신의 손실 함수는이 일을 계속 같이 그냥 0으로가는 종료됩니다 + +599 +00:43:07,300 --> 00:43:11,519 + 프로세스 및 당신은 기본적으로 작은 단지 작은 여기에 그라디언트 결국 + +600 +00:43:11,519 --> 00:43:17,530 + 숫자는 그래서 당신은 기본적으로이 전반에 걸쳐 매우 매우 낮은 기울기와 끝까지 + +601 +00:43:17,530 --> 00:43:21,500 + 이 때문에 이유의 네트워크 및 이것은 우리가 추방으로 참조 뭔가 + +602 +00:43:21,500 --> 00:43:24,070 + 이 그라데이션 등의 성분이 특정과를 통해 이동 + +603 +00:43:24,070 --> 00:43:27,160 + 초기화는 녹색의 그​​룹이 크기를 볼 수 있습니다 우리는거야 + +604 +00:43:27,159 --> 00:43:34,239 + 단지 우리가 두 가지 중 하나를 사용 사용될 때 가서 우리는 극단적 인 다른 시도 할 수 있습니다 + +605 +00:43:34,239 --> 00:43:38,569 + 당신이 시도 할 수 있습니다에 여기 스케일링 우리는 토끼와 음의 확장으로 대신 + +606 +00:43:38,570 --> 00:43:45,530 + 초기화에서 W 매트릭스의 다른 규모 그래서 나는 110001 시도 가정 + +607 +00:43:45,530 --> 00:43:51,099 + 이제 우리는 다른 방법을 오버 슈트 때문에 또 다른 재미 일이 일어 볼 수 있습니다 + +608 +00:43:51,099 --> 00:43:56,260 + 당신이 잘 볼 수있는 감각은 아마 여기에서 결정을보고하는 것이 가장 좋습니다 + +609 +00:43:56,260 --> 00:44:00,250 + 당신은 모든 것이 완전히 중이 10 시간을 포화 볼 수 있습니다 + +610 +00:44:00,250 --> 00:44:05,079 + 모든 부정적인 하나 내가 분포를 의미하는 모든 사람은 정말 모든 것 + +611 +00:44:05,079 --> 00:44:08,389 + 네트워크 카드를 통해 신경 세포의 전체 네트워크를 슈퍼 포화 + +612 +00:44:08,389 --> 00:44:12,509 + 하나 음 (101) 무게가 너무 큰 그들은 것을 계속 추가하기 때문에 + +613 +00:44:12,510 --> 00:44:15,859 + 비선형 성을 겪고 결국이 과정이기 때문에 다른 사람 + +614 +00:44:15,858 --> 00:44:19,949 + 단지 매우 큰 가중치가 큰 그래서 모든 슈퍼 때문에 + +615 +00:44:19,949 --> 00:44:25,669 + 네트워크가 그냥 통해 재료가 흐르는 무엇 때문에 포화 + +616 +00:44:25,670 --> 00:44:28,869 + 끔찍한 그냥 모든 단지에 대한 제로로의 완벽한 재난 권리입니다 + +617 +00:44:28,869 --> 00:44:34,180 + 기하 급수적으로 0 당신은 그래서 당신은 매우 긴 시간과 훈련을 할 수 죽을 + +618 +00:44:34,179 --> 00:44:37,889 + 이 모든 때문에 당신의 손실 그냥 아무것도되는 일이 없을 때 당신은 볼 수 있습니다 + +619 +00:44:37,889 --> 00:44:41,299 + 모든 신경 세포가 포화 아무것도하지 않기 때문에 아무것도 다시 전파되지 않습니다 + +620 +00:44:41,300 --> 00:44:46,490 + 당신이 실제로 예상대로이 초기화가 슈퍼처럼 있도록 업데이트되는 + +621 +00:44:46,489 --> 00:44:50,469 + 까다로운 설정하고 그것이 있어야 특히이 경우 가지 있어야 + +622 +00:44:50,469 --> 00:44:54,629 + 어딘가 10 10 10 K 등 사이 + +623 +00:44:54,630 --> 00:44:58,259 + 그래서 당신은 약간 더 대신 몇 가지 다른 값을 시도의 원칙에 따른 될 수있다 + +624 +00:44:58,259 --> 00:45:03,059 + 2010 년 예를 들어이 있었다 있도록이 작성 몇 가지 서류가 + +625 +00:45:03,059 --> 00:45:07,589 + 우리가 지금의 초기화를 호출하는 것에 대해 제안이 전혀 나가서 + +626 +00:45:07,588 --> 00:45:11,199 + 의 종류가 겪은 그들은의 분산에 대한 표현 보았다 + +627 +00:45:11,199 --> 00:45:15,318 + 당신의 신경 및이를 읽을 수 있습니다 당신은 기본적으로 특정을 제안 할 수있다 + +628 +00:45:15,318 --> 00:45:19,608 + 그래서 난 필요 없어 당신이 당신의 구배를 주문 방법에 대한 초기화 전략 + +629 +00:45:19,608 --> 00:45:24,088 + 나는 그들이 이런 종류의 추천 어떤 다른 하나를 시도 할 필요가 없습니다 2001 시도 + +630 +00:45:24,088 --> 00:45:27,500 + 초기화 우리는 입력의 수의 제곱근으로 나눈 + +631 +00:45:27,500 --> 00:45:28,750 + 하나 하나 신경 + +632 +00:45:28,750 --> 00:45:33,630 + 입력의 많은 당신은 낮은 무게와 끝까지 직관적으로 그 수 + +633 +00:45:33,630 --> 00:45:36,539 + 당신이 더 많은 일을하고 있기 때문에 의미 당신은 당신으로가는 더 많은 물건을 가지고와 + +634 +00:45:36,539 --> 00:45:39,619 + 무게 일부는 그래서 당신은 그들 모두와 경우 상호 작용이 덜합니다 + +635 +00:45:39,619 --> 00:45:43,660 + 더 큰하려는 은신처로 공급되는 단위의 적은 수의 + +636 +00:45:43,659 --> 00:45:46,980 + 무게 다음 거기에 그들 중 몇 그리고 당신은 변화를 원하기 때문에 + +637 +00:45:46,980 --> 00:45:51,019 + 18의 조금 백업 + +638 +00:45:51,018 --> 00:45:54,659 + 여기에 아이디어는 그들은 하나의 신경 세포 더 활성화에서 찾고있다 + +639 +00:45:54,659 --> 00:45:58,118 + 함수는 선형 신경 세포입니다 포함하고있는 경우가 말을하는지 모든입니다 + +640 +00:45:58,119 --> 00:46:02,099 + 당신이 입력으로 데이터를 받고하는 경우 원하는 당신에게이 학습자를 좋아한다 + +641 +00:46:02,099 --> 00:46:06,079 + 다음은이 금액하여 가중치를 초기화해야 하나의 분산을 + +642 +00:46:06,079 --> 00:46:10,670 + 그리고 노트에 난이 파생하는 방법을 정확하게 단지 우리 두 개의 표준되어가는 + +643 +00:46:10,670 --> 00:46:15,650 + 편차 내가 사용할 수 있도록 기본적으로이 합리적인 초기화입니다 + +644 +00:46:15,650 --> 00:46:18,700 + 대신 당신은 볼 수 여기를 사용하는 경우 + +645 +00:46:18,699 --> 00:46:22,399 + 분포 다시보고를 통해보다 합리적인 끝나게 + +646 +00:46:22,400 --> 00:46:25,660 + 이러한 열 에이전트 중 하나에 부정적 일 간의 역사와 당신은 더 많은 것을 얻을 수 + +647 +00:46:25,659 --> 00:46:31,000 + 현명한 여기 수와 실제로의 활성 영역 내에서이 + +648 +00:46:31,000 --> 00:46:33,929 + 모든 청소년은 그래서 당신이 훨씬 더 좋을 것으로 예상 할 수있다 + +649 +00:46:33,929 --> 00:46:38,518 + 초기화 가지 활성 영역에있는 것들 훈련 때문에 + +650 +00:46:38,518 --> 00:46:42,318 + 시작 무에서 시작하는 이유 슈퍼 포화 + +651 +00:46:42,318 --> 00:46:45,179 + 이 단지 아주 좋은, 우리가 아직도 가지고있는 이유 끝나게하지 않습니다 + +652 +00:46:45,179 --> 00:46:48,139 + 이 문서는 계정을 고려하지 않기 때문에 아래로 여기 융​​합이다 + +653 +00:46:48,139 --> 00:46:52,308 + 이 경우 비선형 세입자 등 테니스 비선형 최대 + +654 +00:46:52,309 --> 00:46:57,650 + 전체에 분산의 형성 통계의 종류 같은 그래서 당신 경우 + +655 +00:46:57,650 --> 00:47:02,309 + 그것을 떨어져이 시작하고 최대 여전히이 경우 유통에 일을 + +656 +00:47:02,309 --> 00:47:05,410 + 이 표준 편차가 다운 될 것 같다하지만 경우처럼 극적인 아니다 + +657 +00:47:05,409 --> 00:47:08,179 + 이 안녕 안녕을 설정 단지 시험했다 + +658 +00:47:08,179 --> 00:47:11,299 + 그래서 이것은 합리적인 초기화처럼 거기입니다 + +659 +00:47:11,300 --> 00:47:15,280 + 비교하여 내부 네트워크를 사용하는 단지 2001 그래서 사람들을 설정합니다 + +660 +00:47:15,280 --> 00:47:20,760 + 때로는 같은 관행을 사용하게하지만, 그래서 이것은 10 세의 경우 작동 + +661 +00:47:20,760 --> 00:47:24,349 + 당신이 정류에 넣어하려고하면 합리적인 무언가를 그것은 밝혀 + +662 +00:47:24,349 --> 00:47:30,019 + 선형 단위 네트워크는 그것뿐만 아니라 작동하지 않고 감소 부문이 될 것입니다 + +663 +00:47:30,019 --> 00:47:34,679 + 훨씬 더 빠른 그래서 테헤란에서 집회를보고 첫 번째 레이어는 몇 가지가 있습니다 + +664 +00:47:34,679 --> 00:47:37,769 + 당신이 볼 수있는 분배하고 분배는 더욱 더 얻는다 + +665 +00:47:37,769 --> 00:47:43,130 + 제로 그래서 점점 더 많은 뉴런에서 까다로운이 초기화로 활성화된다 + +666 +00:47:43,130 --> 00:47:48,440 + 그래서 바로 잡기 층 층 그물에 초기화를 사용하는 것은 좋은 일을하지 않습니다 + +667 +00:47:48,440 --> 00:47:52,659 + 그래서 다시 그들이 실제로에 대해 얘기하지 않는이 논문에 대해 생각 + +668 +00:47:52,659 --> 00:47:57,578 + 비선형 및 관련이란의 컴퓨터입니다이 가중 합 + +669 +00:47:57,579 --> 00:48:02,068 + 방법 후 여기에 있지만 그들의 수요에서 뭔가 당신은 그래서 당신이 수행하는 것이 + +670 +00:48:02,068 --> 00:48:05,858 + 당신이 직관적으로 그가 무엇을 0으로 설정하고 분배의 절반을 죽일 + +671 +00:48:05,858 --> 00:48:10,380 + 당신의 최대의 배포하지만, 기본적으로 절반 변형 등이 + +672 +00:48:10,380 --> 00:48:14,849 + 이 사실은 누군가에 작년에 본 논문에서 제안되었다 밝혀 말했다 + +673 +00:48:14,849 --> 00:48:19,000 + 기본적으로 당신은 그가 때문에에 대한 회사 아니에요 2 배 거기에 보면 + +674 +00:48:19,000 --> 00:48:22,809 + 정말 당신은 론의 효과적으로 행복을 모르거나 때마다 변형하지 않는다 + +675 +00:48:22,809 --> 00:48:26,510 + 당신이 입력을 확보하지 못했 있도록 모든 것을 가지고 있기 때문에 당신이 그들을 통해 소요 + +676 +00:48:26,510 --> 00:48:29,960 + 당신의 비선형 당신은 당신이 나는 것 물건을 받고 있지만 당신이 정말로 그렇게하지 + +677 +00:48:29,960 --> 00:48:35,530 + 그래서 당신이 두 가지 변종을 가진 끝 그것도으로 고려하는 것 그리고 + +678 +00:48:35,530 --> 00:48:38,859 + 당신은 당신이 데럴을 위해 특별히 적절한 분포를 얻을 수행 할 때 + +679 +00:48:38,858 --> 00:48:43,719 + 이란 등이 초기화에 사용 된 그물하면 약을 걱정할 필요가 + +680 +00:48:43,719 --> 00:48:48,618 + 여분의 세수 모든 것이 잘 올 것이다 당신은받지 않습니다 + +681 +00:48:48,619 --> 00:48:52,358 + 이 구축 계속 두 가지의 요인과는 나사까지 당신의 활성화를 + +682 +00:48:52,358 --> 00:48:56,769 + 기하 급수적 그래서 기본적으로이 까다로운 까다로운 물건과는 정말 + +683 +00:48:56,769 --> 00:49:01,159 + 예를 들어 자신의 논문에서 연습에 연습 문제는을 가진 비교 + +684 +00:49:01,159 --> 00:49:04,519 + 당신이 너무 요인을 가지고 있지 않으며이 중요한 경우 요인 우리가 정말 깊이가 + +685 +00:49:04,519 --> 00:49:08,500 + 당신이 고려하는 경우이 경우 네트워크는 나는 그들이 수십 플레이어를 가졌다 고 생각 + +686 +00:49:08,500 --> 00:49:12,940 + 당신은 아무것도 그냥하지 않습니다에 당신이 감소를 계산하지 않는 경우가 수렴한다는 사실 + +687 +00:49:12,940 --> 00:49:14,950 + 제로 많이 확인 + +688 +00:49:14,949 --> 00:49:19,469 + 그래서 당신이 정말 필요한 매우 중요한 물건을 조심해야 당신을 통해 그것을 생각하는 + +689 +00:49:19,469 --> 00:49:24,789 + 그것은 잘못 같은 나쁜 일이 너무 구체적으로 어떻게하고 있는지 인플레이션 + +690 +00:49:24,789 --> 00:49:28,108 + 이 레일 장치와 함께 작동 당신이 알고있는 경우 케이스는 정확 + +691 +00:49:28,108 --> 00:49:36,460 + 사용 대답하고 그래서 이것이 오는이 초기화이다 + +692 +00:49:36,460 --> 00:49:40,220 + 부분적으로이 오랫동안 당신의 말은 우리가 내가 생각하는 일부 이유 + +693 +00:49:40,219 --> 00:49:46,088 + 사람들은 완전히 어쩌면이 잘 얻을 수 있었다 얼마나 어려운 감사하지 않았다 + +694 +00:49:46,088 --> 00:49:51,219 + 터키 그래서 난 그냥 적절한 초기화 기본적으로 지적하고 싶은 + +695 +00:49:51,219 --> 00:49:54,419 + 연구의 활성 영역은 당신이 논문은 아직이에 게시되고있다 볼 수 있습니다 + +696 +00:49:54,420 --> 00:49:58,849 + 논문 다수 단지 초기화하는 다른 방법을 마주하여 + +697 +00:49:58,849 --> 00:50:03,019 + 그들은 당신에게를 제공하지 않기 때문에 네트워크는 지난 몇도 흥미 롭다 + +698 +00:50:03,019 --> 00:50:06,659 + 초기화 공식 그들은 이러한 데이터를 초기화 구동 폐기물이 + +699 +00:50:06,659 --> 00:50:10,399 + 네트워크와 지금 당신은 당신의 네트워크에 전달할 데이터의 배치를 취할 + +700 +00:50:10,400 --> 00:50:13,530 + 임의의 네트워크와는 차이를 보면 그 모든 단일 지점에서 + +701 +00:50:13,530 --> 00:50:16,690 + 네트워크 및 직관적으로 당신은 당신의 차이는 0으로 가고 싶지 않아 + +702 +00:50:16,690 --> 00:50:20,200 + 원하지 않는 그들은 모든 약이 같은 일 말하고 싶은 폭발 + +703 +00:50:20,199 --> 00:50:24,328 + 네트워크 전반에 걸쳐 단위주의는 그래서 그들은 항변 스케일이 입력 + +704 +00:50:24,329 --> 00:50:28,349 + 당신이 사방에 활성화에 크게 가질 수 있도록 네트워크에 무게 + +705 +00:50:28,349 --> 00:50:33,568 + 그 순서는 기본적 등 일부 데이터 중심의 기술과 라인이 있습니다 + +706 +00:50:33,568 --> 00:50:39,139 + 제대로 난 난 일부에 갈거야, 그래서 확인을 초기화하는 방법에 대한 작업 + +707 +00:50:39,139 --> 00:50:41,848 + 이러한 많은 문제를 완화하는 기술로 갈하지만 + +708 +00:50:41,849 --> 00:50:55,369 + 지금 나는 몇 가지 질문을 수 그리고 그들은으로 나누어에만있어 + +709 +00:50:55,369 --> 00:50:59,800 + 분산 가능하지만 다시없는거야 전파 당신이 경우 때문에 + +710 +00:50:59,800 --> 00:51:02,710 + 그라데이션 만난 후 당신의 목표는 더 이상 무엇인지 분명하지 않다과 + +711 +00:51:02,710 --> 00:51:06,710 + 그래서 당신은 그라데이션 반드시 못하고있어 그래서 이것은 유일한 문제가 될 수있다 + +712 +00:51:06,710 --> 00:51:11,170 + 난 당신이 그라데이션 I 정상화를 시도 할 경우 무슨 일이 일어날 지 확실하지 않다 + +713 +00:51:11,170 --> 00:51:13,730 + 이 방법은 내가 조금에 제안 할 것 같네요 + +714 +00:51:13,730 --> 00:51:19,960 + 실제로 그 효과에하지만 무엇 깨끗한 방법으로 일을하고있다 + +715 +00:51:19,960 --> 00:51:23,550 + 실제로 그건 실제로 이러한 많은 문제를 해결 뭔가로 이동 + +716 +00:51:23,550 --> 00:51:26,630 + 나의 비전이라고하며 그것은 단지 작년에 제안하고, 그래서 심지어 캔트 + +717 +00:51:26,630 --> 00:51:30,809 + 이 클래스에서이 작년에 덮여 있지만, 지금은 실제로 많은 도움이 있습니다 + +718 +00:51:30,809 --> 00:51:37,119 + 확인하고 기본 개념을 극대화 종이는 대략 기기가 받고 싶은 괜찮습니다 + +719 +00:51:37,119 --> 00:51:42,039 + 네트워크의 모든 단일 부분에서 활성화하고 그래서 그냥 그냥 그냥 할 + +720 +00:51:42,039 --> 00:51:46,369 + 그냥 만들 당신은 당신이 할 수있는 확인주의를 알고있는 무언가를 만들기 때문에 + +721 +00:51:46,369 --> 00:51:50,720 + 단위주의는 완전히 다른 기능이며, 그래서 확인 당신이 할 수 있어요 + +722 +00:51:50,719 --> 00:51:54,980 + 그것을 통해 전파하고 데이터에서 날 다시 복용 무엇 그들이 참조 + +723 +00:51:54,980 --> 00:51:57,480 + 당신은 우리가 만날거야 네트워크를 통해 따기있어 + +724 +00:51:57,480 --> 00:52:00,900 + 네트워크와 최상의으로 이러한 전문화 층을 삽입 + +725 +00:52:00,900 --> 00:52:06,400 + 정규화 층은 귀하의 입력 X를 가지고 그들은 모든 있는지 확인 + +726 +00:52:06,400 --> 00:52:10,420 + 배치에서 하나의 기능 치수는 단위 분출 활성화를 + +727 +00:52:10,420 --> 00:52:15,909 + 그래서 그는 어쩌면이있는 네트워크를 통과 백 예제의 배치를했습니다 + +728 +00:52:15,909 --> 00:52:19,779 + 여기에 좋은 예는 당신의 돈에 더 나은 활성화 너무 많은 일을하다 + +729 +00:52:19,780 --> 00:52:25,530 + 뒤로 어떤 점에있다 D 기능 또는 신경 세포의 불 활성화가 일부 + +730 +00:52:25,530 --> 00:52:28,869 + 부분이는 다시 나중에 입력 + +731 +00:52:28,869 --> 00:52:32,550 + 그래서 이것은 활성화 및 국유화의 주요 주제이다 + +732 +00:52:32,550 --> 00:52:39,390 + 효과적으로 모든 단일 기능을 함께 경험 평균과 분산을 평가 + +733 +00:52:39,389 --> 00:52:44,989 + 그것은 단지 어떤 있도록 전 그냥 확인했다 그것으로 나눈 모든 것을 + +734 +00:52:44,989 --> 00:52:49,088 + 단일 열 여기서 단위는 Univision의입니다 가지고 있으며, 그래서 완벽하게있어 + +735 +00:52:49,088 --> 00:52:54,219 + 미분 기능은 모든 단일 기능 또는 활성화에 그것을 적용 + +736 +00:52:54,219 --> 00:53:02,818 + 독립적으로 배치를 통해 당신은 아주 좋은 것으로 판명 할 수 있도록 + +737 +00:53:02,818 --> 00:53:08,548 + 아이디어는 지금이 팀과 함께 하나의 문제는 그래서 이것은이뿐만 아니라 작동 방법입니다 + +738 +00:53:08,548 --> 00:53:11,670 + 일반적으로 우리는 비선형 다음에 한 한 + +739 +00:53:11,670 --> 00:53:15,900 + 이 파티 네트워크 이제 우리는 이러한 국유화를 삽입 할거야 + +740 +00:53:15,900 --> 00:53:19,670 + 바로 정치 상속인 후 또는 동등 길쌈 층 후에 층 + +741 +00:53:19,670 --> 00:53:24,490 + 상용 네트워크와 잘 CCNA 등 기본적으로 우리가 그들을 시작할 수 있습니다 + +742 +00:53:24,489 --> 00:53:28,159 + 그들은 모든이의 매 단계에서 분출되어 있는지 확인 + +743 +00:53:28,159 --> 00:53:30,190 + 우리가 그냥 그렇게 만들 있기 때문에 네트워크 + +744 +00:53:30,190 --> 00:53:36,500 + 내가이이 함께 생각하는 한 가지 문제는 그것이 불필요한 것 같아 것입니다 + +745 +00:53:36,500 --> 00:53:41,088 + 제약 그래서 당신은 후에 여기 다시 넣을 때 출력은 확실히 것 + +746 +00:53:41,088 --> 00:53:45,389 + 당신이 그들을 정상화 때문에 분출 될 수 있지만 명확하지 않다가 10 H 실제로 + +747 +00:53:45,389 --> 00:53:50,288 + 당신이 10 H의 형태에 대해 생각, 그래서 만약 한 번 단위주의 입력을 후퇴에 + +748 +00:53:50,289 --> 00:53:54,450 + 그들이에 한 번 작동 모든 것을 걸 분명하지 않다 그것에 특정 기술을 가지고 + +749 +00:53:54,449 --> 00:53:59,730 + 출력은 당신이 협상 정확히 것을 확인이 어려운 제약 조건이 + +750 +00:53:59,730 --> 00:54:06,009 + 10 TH 전에 당신은 당신의 10 각을 원하는 경우 선택하는 네트워크를 좋아하기 때문에 + +751 +00:54:06,009 --> 00:54:10,429 + 지금 그것을 더 많거나 적은 포화 확산 및 더 많은 이하로 무엇이 다른 + +752 +00:54:10,429 --> 00:54:14,268 + 그것은이 상단에 작은 패치의 두 번째 부분, 그래서 죽음을 수있을 것입니다 + +753 +00:54:14,268 --> 00:54:19,429 + 아이 티어 행위를 정상화하지 않을하지만 정상화 한 후에는 네트워크를 살 + +754 +00:54:19,429 --> 00:54:25,068 + 감마으로 이동하고 모든 단일 기능에 대한 있어야했다 그래서이 허용하는 + +755 +00:54:25,068 --> 00:54:28,358 + 네트워크가 수행하는 이들은 감마 그래서 우리의 매개 변수이며 수있어 + +756 +00:54:28,358 --> 00:54:33,869 + 우리가 다시거야 매개 변수로 백업하고 그들은 단지 허용 + +757 +00:54:33,869 --> 00:54:38,690 + 일반 ICU 후 배송 네트워크 (22)는이 폭탄이 이동 할 수 있도록 협상 + +758 +00:54:38,690 --> 00:54:44,108 + 규모가 네트워크가 원하는 경우 우리가 초기화 아마도 웹의 110 + +759 +00:54:44,108 --> 00:54:48,250 + 뭐 그런 다음 우리는 네트워크를 조정하도록 선택할 수 수에 의해 + +760 +00:54:48,250 --> 00:54:51,239 + 당신은 우리가 10 H로 공급하면 상상이 수 조정 + +761 +00:54:51,239 --> 00:54:54,719 + 네트워크가 더 이상 그것을 만들기 위해 배경 신호를 선택하거나 + +762 +00:54:54,719 --> 00:54:58,618 + 덜 까다 롭고 또는 한 번 어떤 방식으로 포화하지만 당신은 들어갈 않을거야 + +763 +00:54:58,619 --> 00:55:01,910 + 일이 그냥 완전히 사망하거나 폭발이 문제 + +764 +00:55:01,909 --> 00:55:06,359 + 최적화의 시작 등 상황이 그때 바로 훈련한다 + +765 +00:55:06,360 --> 00:55:10,579 + 전파는 이상이 걸릴 수 있습니다 하나 더 초과 근무로 당신을 찾을 수 없습니다 + +766 +00:55:10,579 --> 00:55:16,170 + 중요한 기능을 사용하면이 무장 괴한을 설정하면 경우에 당신이 그들을 훈련하는 경우 것이 내 + +767 +00:55:16,170 --> 00:55:20,230 + 다시 전파는 말까지의 경험 분산 및 복용하는 것이 일 + +768 +00:55:20,230 --> 00:55:24,829 + 당신이 기본적으로 네트워크가를 취소 할 수있는 능력을 가지고 볼 수있을 때 의미 + +769 +00:55:24,829 --> 00:55:30,519 + 이 부분은 그 부분을 취소 배울 수 국유화 있도록 그래서 왜 돌아 왔어 + +770 +00:55:30,519 --> 00:55:34,059 + 및 실현에 식별 기능의 역할을 할 수 있습니다 또는 일을 배울 수 + +771 +00:55:34,059 --> 00:55:37,599 + 정체성이 없었던 전 그래서 당신이있을 때이 가장 잘 알려진 반면, + +772 +00:55:37,599 --> 00:55:42,460 + 자신의 네트워크에있는 플레이어와 전파를 다시 던졌다 그것을 꺼내 배우거나 + +773 +00:55:42,460 --> 00:55:45,110 + 그것은 유용한 찾으면 그것을 활용 배울 수 + +774 +00:55:45,110 --> 00:55:51,010 + 배경으로 통해 운동이 의지의 종류는 그에게 단지 좋은 점, 그래서 + +775 +00:55:51,010 --> 00:55:58,470 + 이 오른쪽 숫자가 수 있도록이 때문에 기본적으로 몇 가지 속성이있다 + +776 +00:55:58,469 --> 00:56:03,639 + 이들 제 특성은 구배 흐름을 개선한다는 것이다 I은 설명 된 것과 + +777 +00:56:03,639 --> 00:56:09,049 + 네트워크는 고등 교육 비율 때문에 네트워크 배울 수 있습니다 통해 + +778 +00:56:09,050 --> 00:56:13,080 + 빨리이 중요한 하나는 강한 의존성에 대한 소개입니다 감소 + +779 +00:56:13,079 --> 00:56:16,269 + 당신이 당신의 초기화의 다른 선택을 통해 청소로 초기화 + +780 +00:56:16,269 --> 00:56:19,659 + 당신이 당신을 학대하지 않고 그 표시됩니다 확장하는 것은 큰 차이를 볼 수 있습니다 + +781 +00:56:19,659 --> 00:56:23,469 + 최대 당신은 훨씬 더 많은 것들이 훨씬 더 큰 위해 작동 볼 수 있습니다 + +782 +00:56:23,469 --> 00:56:27,539 + 초기 규모의 설정 및 그래서 당신이없는만큼 걱정에 + +783 +00:56:27,539 --> 00:56:34,139 + 정말이 넣어 포인트와 아웃하는 데 도움이 하나 더 미묘한 점은 여기에서 지적 + +784 +00:56:34,139 --> 00:56:39,299 + 그것을 실현에서 돈의 액세스의 종류와 그 필요성에 대한 감소 + +785 +00:56:39,300 --> 00:56:43,900 + 한 방울의 클래스에 나중에 비트에 들어갈되지만 방법은이 역할을 + +786 +00:56:43,900 --> 00:56:51,559 + 당신이 입력 X의 어떤 종류를 가지고 통과 할 때 재미 정규화입니다 + +787 +00:56:51,559 --> 00:56:55,849 + 네트워크 다음 일부 나중에 네트워크에서의 표현은 기본적으로 + +788 +00:56:55,849 --> 00:56:59,858 + 그것뿐만 아니라 기능뿐만 아니라 어떤 다른 예제의 함수의 + +789 +00:56:59,858 --> 00:57:02,049 + 일괄 그렇게되기 위해서는 일 + +790 +00:57:02,050 --> 00:57:05,570 + 무엇 때문에 다른 예는 완전히 그 배치 과정에서 당신과 함께 있습니다 + +791 +00:57:05,570 --> 00:57:09,840 + 독립적으로 의류 패션은 실제로 함께하고 있으므로 그들을 묶어 + +792 +00:57:09,840 --> 00:57:12,880 + 네트워크의 두꺼운 층 같은 말 표현은 실제로 기능입니다 + +793 +00:57:12,880 --> 00:57:16,539 + 무엇이든의 그것은 다시는 샘플링 할 일이하고 관대 한을 무엇을 당신의 + +794 +00:57:16,539 --> 00:57:19,809 + 나중에에 표현 공간에 배치하고이 실제로 좋은가 + +795 +00:57:19,809 --> 00:57:26,139 + 효과를 정례화 등 빈정 생성 않는 사람 당신이이 사실을 + +796 +00:57:26,139 --> 00:57:31,609 + 에 될 일이 것은이 효과를 가지고 있으며, 그래서 난 실제로 것 같다 모르고 + +797 +00:57:31,610 --> 00:57:33,920 + 실제로 그것의 도움 + +798 +00:57:33,920 --> 00:57:38,950 + 확인 시험 나는 방법 기능 나중에 조금 다르게 당신을 열정적 해요 + +799 +00:57:38,949 --> 00:57:42,699 + 이 결정은 함수로 원하는 테스트 시간 불과하므로없는 + +800 +00:57:42,699 --> 00:57:46,500 + 다르게 학사 기능을 사용할 때 시간을 s의 빠른 점 + +801 +00:57:46,500 --> 00:57:52,019 + 특히 당신은 당신이 그렇게 시험에 의해 정규화를 유지하는이 새로운 시그마가 + +802 +00:57:52,019 --> 00:57:55,519 + 난 그냥 당신도 할 수있는 데이터 세트에서보기와 시그마을 기억 해요 + +803 +00:57:55,519 --> 00:57:59,250 + 평균 무엇처럼 계산하고 모든 단일 지점에서 S 상 + +804 +00:57:59,250 --> 00:58:02,309 + 네트워크는 전체 교육 센터를 통해 한 번 그 계산하거나 수 + +805 +00:58:02,309 --> 00:58:05,759 + 그 다음에 훈련하고있는 동안 단지 몇 가지 재미있는 육개월을 실행 한 + +806 +00:58:05,760 --> 00:58:08,800 + 최고의 선수에 있기 때문에 그냥 시간이 있는지 기억 할 수 있도록 당신은하지 않습니다 + +807 +00:58:08,800 --> 00:58:12,460 + 실제로 허리에 걸쳐 경험 평균과 분산을 추정 할 당신 + +808 +00:58:12,460 --> 00:58:17,000 + 그냥 당신이 오지 않을거야 좋은 때문에 그래서 직접들을 사용하려면 + +809 +00:58:17,000 --> 00:58:26,179 + 이 단지 작은 세부 사항이고 그래서 질문은 그래서이 시간에 전달 + +810 +00:58:26,179 --> 00:58:29,049 + 전국 고속도로에 대한 그래서 이것은 좋은 일이 + +811 +00:58:29,050 --> 00:58:35,559 + 그것과 직원 실제로 할당을 사용 + +812 +00:58:35,559 --> 00:58:41,039 + 문제는 둔화 전혀은 그래서를가 않습니다 그래서 감사합니다 + +813 +00:58:41,039 --> 00:58:44,219 + 그것은 불행하게도 내가 정확히 모르는에 대한 런타임 처벌하지만 당신은 지불해야 + +814 +00:58:44,219 --> 00:58:49,088 + 나는 사람이 30 %처럼 말을 듣고 얼마나 비싼 경우에도 그래서 나도 몰라 + +815 +00:58:49,088 --> 00:58:54,318 + 실제로 나는 완전히이 확인되지 않은 있지만 기본적으로 패널티 때문에이있다 + +816 +00:58:54,318 --> 00:58:58,548 + 당신은 당신이 모든 이후로 매우 일반적입니다 일반적으로이 작업을 수행해야 + +817 +00:58:58,548 --> 00:59:02,458 + 나중에 경쟁과 우리는 래리 같은 250 진정 당신이이 모든 것을 가지고 결국있다가 + +818 +00:59:02,458 --> 00:59:16,719 + 질문의 물건을 축적은 우리가 지불하는 가격을 인상 정말 그래 그래서 때를 가정 + +819 +00:59:16,719 --> 00:59:20,249 + 당신은 아마 내가이에 다시 그에게 올 거라고 생각 국가해야 알 수 있습니다 + +820 +00:59:20,248 --> 00:59:24,228 + 몇 슬라이드 네트워크가 건강하지 있음을 감지 할 수있는 방법을 같이 볼 수 있습니다 + +821 +00:59:24,228 --> 00:59:30,318 + 다음 어쩌면 당신은 내가 20이 학습 과정 있도록 다국적 확인을 할 + +822 +00:59:30,318 --> 00:59:36,489 + 분 나는 그래서 우리는 우리가 신뢰 괜찮아요 생각이 700처럼 할 수 있다고 생각 + +823 +00:59:36,489 --> 00:59:41,420 + 우리가 결정했습니다 우리의 데이터는의는의는 이러한 목적을 위해 어떤 결정합시다 + +824 +00:59:41,420 --> 00:59:44,719 + 이 실험은 내가 10 C 작동거야 그리고 나는를 사용하는거야 + +825 +00:59:44,719 --> 00:59:48,688 + 안전 두 층 신경 네트워크는 미묘한 차이를 가지고 있었고, 나는 아이디어를 제공하고 싶습니다 + +826 +00:59:48,688 --> 00:59:51,538 + 이 충격과 같은 방법 등에 대해 같은 때 훈련 신경 네트워크입니다 + +827 +00:59:51,539 --> 00:59:52,699 + 당신은 플레이 어떻게 + +828 +00:59:52,699 --> 00:59:56,849 + 어디 사람이 어떻게 실제로이 무엇 Primaris로 변환 할 + +829 +00:59:56,849 --> 00:59:59,380 + 일을 얻기에 날짜와 재생의 과정에서 같이 일을하는 + +830 +00:59:59,380 --> 01:00:03,019 + 연습하고 그래서 작은 신경 네트워크를 사용하기로 결정 + +831 +01:00:03,018 --> 01:00:08,248 + 내 데이터와 나는 경우에 보일 것이다 사물의 상기 제 1 종류의 전처리 + +832 +01:00:08,248 --> 01:00:11,728 + 내 예측이 그들에게 일을 작업하는 생각 보정되어 있는지 확인하려면 + +833 +01:00:11,728 --> 01:00:16,028 + 내가 갈거야 모두의 첫 번째는 여기에 2 년 신경을 초기화한다 + +834 +01:00:16,028 --> 01:00:19,679 + 네트워크 초기화 무게와 편견 그냥 순진했다 그래서 + +835 +01:00:19,679 --> 01:00:23,969 + 여기에 초기화이 그래서가 감당할 수있는 그냥 아주 작은 네트워크 때문에 + +836 +01:00:23,969 --> 01:00:28,259 + 어쩌면 고갈에서 단지 순진 샘플을 수행 한 다음이 함수는 + +837 +01:00:28,259 --> 01:00:31,329 + 기본적으로 신경 네트워크를 양성하는 것 그리고 난 당신에게 보여주는 아니에요 + +838 +01:00:31,329 --> 01:00:35,949 + 구현 분명하지만 한 것은 당신의 손실을 반환 누락 + +839 +01:00:35,949 --> 01:00:39,170 + 모델 매개 변수에 대한 그래서 먼저 자신의 반환 보험료 + +840 +01:00:39,170 --> 01:00:42,869 + 내가 예를 들어 시도 시간이 나는에 전달되는 정규화를 사용하지 않도록한다 + +841 +01:00:42,869 --> 01:00:45,818 + 종료하고 난 내 손실이 나오는 것을 확인 + +842 +01:00:45,818 --> 01:00:49,358 + 바로 그렇게 행동 나는이 언급 이전 라인 그래서 나는 10 수업을 말하기 + +843 +01:00:49,358 --> 01:00:53,318 + 내가의 손실을 예상하고있어 것을 알 수 있도록 부드러운 분류를 사용하여 지원 n 개의 메신저 + +844 +01:00:53,318 --> 01:00:59,099 + 그 때문에 10 일의 음의 로그가 그 손실에 대한 식이다 + +845 +01:00:59,099 --> 01:01:03,180 + 그리고는 2.3로 밝혀 그래서 나는 이것을두고 나는 그래서 2.3를 많이 얻을 + +846 +01:01:03,179 --> 01:01:05,708 + 기본적으로 신경망은 나에게 확산을주고 있음을 알 수 + +847 +01:01:05,708 --> 01:01:09,728 + 그것은 아무것도 모르기 때문에 클래스를 통해 배포 우리는 너무 봤는데 + +848 +01:01:09,728 --> 01:01:12,778 + 그 밖으로 짜증 나는 확인 수있는 다음 일은 예를 들어 I가 마약이다 + +849 +01:01:12,778 --> 01:01:17,318 + 정규화 물론 내 손실이 올라갈 것으로 예상 지금 때문에 우리 + +850 +01:01:17,318 --> 01:01:20,380 + 목적이 추가 기간이 있고 그래서 그렇게 체크 아웃 + +851 +01:01:20,380 --> 01:01:20,940 + 그 좋네요 + +852 +01:01:20,940 --> 01:01:25,409 + 난 보통은 아주 좋은 상태 검사의 수행하려고 할 것입니다 다른 다음 일 + +853 +01:01:25,409 --> 01:01:28,478 + 당신이 그들의 네트워크에서 작업 할 때 데이터의 작은 조각을하려고한다 + +854 +01:01:28,478 --> 01:01:32,139 + 당신은 당신은 단지 그 작은을 위해 노력하고 그 위에 수 있는지 확인하려고 + +855 +01:01:32,139 --> 01:01:36,608 + 스물 같이합니다 일부는 추천 훈련 예제의 샘플을 말 조각 + +856 +01:01:36,608 --> 01:01:41,858 + (28) 레이블과 난 그냥 그 작은 조각에 훈련 있는지 확인하고 그냥 + +857 +01:01:41,858 --> 01:01:45,179 + 내가 완전히 적합으로 초과 할 수있는 제로 근처에 기본적으로 손실을 얻을 수 있는지 확인 + +858 +01:01:45,179 --> 01:01:48,379 + 내가 이상 캔트이라면 내 생각의 작은 조각을 맞게 때문에 일이 그 + +859 +01:01:48,380 --> 01:01:54,608 + 나는 훈련을 시작하고 난 시작 해요 확실히 그래서 여기에 깨진 + +860 +01:01:54,608 --> 01:01:58,969 + 여기에 일부 매개 변수 임의의 숫자와 나는 전체에 들어갈 않을거야 + +861 +01:01:58,969 --> 01:02:04,150 + 이 세부하지만 기본적으로 내 비용이 제로에 가서 할 수 있는지 확인하고 + +862 +01:02:04,150 --> 01:02:08,519 + 나는 데이터의이 작은 조각에 정확도 100 %를 받고 있어요 그 날을 준다 + +863 +01:02:08,518 --> 01:02:12,659 + 아마 배경은 아마 업데이트가 작동하고있다 신뢰 + +864 +01:02:12,659 --> 01:02:16,798 + 학습 속도는 어떻게 든 합리적으로 설정되고 그래서 작은을 넣을 수 있습니다 + +865 +01:02:16,798 --> 01:02:21,190 + 어쩌면 나는까지 확장에 대해 생각하고이 시점에서 행복하지 않은 데이터 세트 + +866 +01:02:21,190 --> 01:02:28,079 + 뭔가보다 큰 + +867 +01:02:28,079 --> 01:02:33,960 + 그래서 당신은 하나 같이 말처럼 시도 할 수 때때로을 압도 할 수 있어야한다 + +868 +01:02:33,960 --> 01:02:37,409 + 두 개 또는 세 개의 예제 당신이 정말로 아래로 연습 할 수 있습니다 당신은 할 수 있어야 + +869 +01:02:37,409 --> 01:02:40,460 + 더 작은 네트워크를 감당할 그래서 매우 좋은 상태 검사 때문에입니다 + +870 +01:02:40,460 --> 01:02:45,289 + 당신은 작은 네트워크가 여유와 당신이 그것을 도울 수 있는지 확인 할 수 있습니다 + +871 +01:02:45,289 --> 01:02:49,039 + 아마 잘못된 일이 매우 펑키의 구현은 그렇게 잘못 + +872 +01:02:49,039 --> 01:02:52,039 + 당신이 상원을 통과하기 전에 당신의 하루를 확장해서는 안 나는 말했다 + +873 +01:02:52,039 --> 01:03:02,380 + 의 작은 조각을 복용 나는이 접근을 시도 그래서 기본적으로 방법을 확인 + +874 +01:03:02,380 --> 01:03:05,990 + 데이터 이제 우리는 이상 그것을 확장하고 있지만 무기가 좋아하는 올라오고 있어요 + +875 +01:03:05,989 --> 01:03:10,049 + 더 큰 데이터 세트는 I 작동 학습을 검색하기 위해 노력하고있어 당신은에 있습니다 + +876 +01:03:10,050 --> 01:03:13,289 + 정말 그냥 안구 찾아야하는 큰 제공 할 수 있습니다이 권리 플레이 + +877 +01:03:13,289 --> 01:03:17,219 + 규모는 대략 몇 가지 먼저 많은 부정적인 같은 소규모 학습 속도를 시도 + +878 +01:03:17,219 --> 01:03:22,559 + 여섯와 나는 미끼로 손실이 겨우 겨우 그래서이 손실 추락 볼 + +879 +01:03:22,559 --> 01:03:27,509 + 하나의 부정적인 육이 학습 속도는 아마 너무 작은 권리 아무것도 없다이다 + +880 +01:03:27,510 --> 01:03:30,250 + 그들은 손실 때문에 당연히 변경하면 다른 많은 문제가있을 수 있습니다 + +881 +01:03:30,250 --> 01:03:34,409 + 하지만 용처럼 만 이유에서 우리는 너무 작은 정신 검사를 통과하기 때문에 + +882 +01:03:34,409 --> 01:03:38,339 + 나는 이것이 아마 손실이 너무 낮다는 것을 생각하고 그리고 난 당신에 의해 칠 필요 + +883 +01:03:38,340 --> 01:03:43,130 + 이 방법은가는 펑키 무언가의 좋은 예 듣는가 재미있다 + +884 +01:03:43,130 --> 01:03:48,280 + 겨우 내려 갔다 내 손실에 대해 생각하지만 실제로 내 훈련 정확도 + +885 +01:03:48,280 --> 01:03:54,000 + 그게 내가 이길 수 있는지 이해가 않는 방법 기본 10 %에서 20 %까지 촬영 + +886 +01:03:54,000 --> 01:03:58,050 + 손실에 의해 겨우 변경하지만 내 비용 내 정밀도가 너무 좋아 + +887 +01:03:58,050 --> 01:04:08,130 + 물론 훨씬 더 그게 가능의 10 % 이상 + +888 +01:04:08,130 --> 01:04:38,860 + 아직도 + +889 +01:04:38,860 --> 01:04:46,120 + 확인 아마 꽤 정확도를 계산하는 방법에 대해 생각하는 방법이 사용자 지정 + +890 +01:04:46,119 --> 01:05:04,799 + 컴퓨터가 지금 무슨 일이 일어나고 있는지이 점수가 작은 그래서 당신의 훈련입니다 + +891 +01:05:04,800 --> 01:05:08,769 + 대략 확산 여전히 손실을 이동 지금 같은 손실에서 생을 마감하지만, + +892 +01:05:08,769 --> 01:05:12,619 + 당신은 정답이없는 작은 조금 더 아마 그래서 우리는 실제로있어 + +893 +01:05:12,619 --> 01:05:16,210 + 실제로 정확성 D 기술의 맥시 클래스입니다 경쟁 올바른 일을 끝낼 + +894 +01:05:16,210 --> 01:05:19,530 + 이들의 당신은 당신이 실제로 어떤 훈련시에 실행 재미 것들 중 일부입니다 + +895 +01:05:19,530 --> 01:05:24,900 + 물건의 확인을 그래서 지금 내가 시작 표현식에 대해 생각해야합니까 + +896 +01:05:24,900 --> 01:05:27,619 + 시도 매우 낮은 학습 속도 일들이 거의 내가에 갈거야 곧 일어나고 + +897 +01:05:27,619 --> 01:05:30,719 + 내가 무엇을 할 수있을 학습 32,000,000을 시도거야 극단적 인 기타 + +898 +01:05:30,719 --> 01:05:36,199 + 아마도 당신은 몇 가지 이상한 오류가 발생할 수 있으므로 어떤 경우에 발생하는 잘못과 + +899 +01:05:36,199 --> 01:05:40,429 + 일이 낸시 정말 재미있는 물건은 1,000,000 중 하나 너무 좋아 어떻게 얻을 폭발 + +900 +01:05:40,429 --> 01:05:44,639 + 내가 노력 제가 그럼이 시점에서 생각하고 등이 아마 너무 높은 + +901 +01:05:44,639 --> 01:05:48,179 + 거친 지역에있는 좁힐 실제로 나에게 내 비용의 감소를 제공하는 + +902 +01:05:48,179 --> 01:05:51,409 + 그게 내가 몇 가지 여기 나의 이진 검색으로 할 노력하고있어 무엇 스레드 + +903 +01:05:51,409 --> 01:05:54,739 + 포인트 난 당신이 내가 십자가 있어야 할 곳에 대략 알고에 대한 몇 가지 아이디어를 얻을 + +904 +01:05:54,739 --> 01:05:55,929 + 검증 + +905 +01:05:55,929 --> 01:06:00,019 + 이 시점에서 적절한 최적화처럼 내가 약속 최선을 찾을려고 + +906 +01:06:00,019 --> 01:06:04,030 + 내 네트워크 권리를 우리가 찾는 과정에서 이동되는 연습 할 좋아 + +907 +01:06:04,030 --> 01:06:07,820 + 전략은 그래서 처음 난 그냥 우리가 배우고 함께 연주하여 거친 생각을 가지고 + +908 +01:06:07,820 --> 01:06:11,550 + 리처드 난 코스 검색을 수행 한 후있는 것은 더 큰 유사한 속도를 놀라운된다 + +909 +01:06:11,550 --> 01:06:16,180 + 세그먼트 후 내가 어떤 작품을보고 나서이 좁은이 과정을 반복 + +910 +01:06:16,179 --> 01:06:20,500 + 이 지역의 그 일을 잘 확인 여기에 이​​렇게 빨리 그리고 당신의 코드에 대한 + +911 +01:06:20,500 --> 01:06:23,719 + 예를 폭발을 감지하고 초기는 측면에서 좋은 단계처럼 탈옥 + +912 +01:06:23,719 --> 01:06:28,339 + 내가 루프 위치를 가지고 구현 그래서 효과적으로 여기서 뭘하는지 + +913 +01:06:28,340 --> 01:06:31,579 + 나는이 사건을 정규화 말을 배우는 내 총리 샘플 + +914 +01:06:31,579 --> 01:06:36,849 + 속도는 나는 이러한 정확성 그래서 내가 여기에 몇 가지 결과를 얻을 훈련을 샘플링 + +915 +01:06:36,849 --> 01:06:40,179 + 검증 데이터에 이러한 그들을 생산 너무 높은 예비 선거는 + +916 +01:06:40,179 --> 01:06:44,440 + 정확성의 일부는 당신은 그들이 몇 가지를 아주 잘 그래서 50 %였다 40 % 있음을 볼 수있다 + +917 +01:06:44,440 --> 01:06:47,409 + 그들 모두에서 잘 작동하지 않는 것은 그래서 이것은 나에게 어떤 범위에 대한 아이디어를 제공합니다 + +918 +01:06:47,409 --> 01:06:50,659 + 학습의 요금 및 규정은 상대적으로 잘 작동됩니다 + +919 +01:06:50,659 --> 01:06:55,079 + 이 최적화를 수행 할 때 당신은 단지 작은 먼저 밖으로 시작할 수 있습니다 + +920 +01:06:55,079 --> 01:06:58,090 + 당신은 아주 긴 시간 동안 실행하려고 시대의 수는 몇 가지에 대해 실행 + +921 +01:06:58,090 --> 01:07:02,680 + 다른보다 더 일하고 어떤 분은 이미 감각을 얻을 수 있습니다 + +922 +01:07:02,679 --> 01:07:08,259 + 사물과 하나 노트 당신은 정규화 학습을 통해 최적화하고 + +923 +01:07:08,260 --> 01:07:12,320 + 단순히 그냥 유니폼에서 샘플링하고 싶지 않은 공간을 산책하는 것이 가장 좋습니다 평가 + +924 +01:07:12,320 --> 01:07:16,510 + 유통이 학습 요금과 정규화 그들이 행동 때문에 + +925 +01:07:16,510 --> 01:07:20,180 + 곱셈 허리 전파의 역학에 리와는 그래서는 이유 + +926 +01:07:20,179 --> 01:07:25,319 + 당신은 내가 흑인 (326)에서 샘플링하고있어 볼 수 있도록 잠금 공간에서이 작업을 수행 할 수 + +927 +01:07:25,320 --> 01:07:28,350 + 하여 학습 속도와 지수와 나는 10의 힘으로 상승하고있어 + +928 +01:07:28,349 --> 01:07:33,319 + 그것의 전원을 10 놀라운 그래서 당신은 단지에서 샘플링되고 싶지 않아 + +929 +01:07:33,320 --> 01:07:38,610 + 귀하의 샘플의 대부분 때문에 백 같은 균일 한 0012은에 가지 있습니다 + +930 +01:07:38,610 --> 01:07:41,820 + 불량 영역 바로 학습 속도가 곱셈 상호 작용 때문에 + +931 +01:07:41,820 --> 01:07:50,050 + 뭔가 비교적 잘 내가 두 번째 패스를하고있어 어떤 작품을 알고 있어야합니다 + +932 +01:07:50,050 --> 01:07:52,950 + 어디 가지에 갈거야 그리고 난 다시 약간의 이러한 변화 그리고 난 해요 + +933 +01:07:52,949 --> 01:07:58,139 + 그래서 어떤 작품을보고 난 지금이​​ 작업의 일부를 (253)를 얻을 수있는 것을 발견 + +934 +01:07:58,139 --> 01:08:02,460 + 정말 잘 한 일이 가끔이 같은 결과를 얻을 알고 있어야합니다 + +935 +01:08:02,460 --> 01:08:06,920 + (53)는 아주 잘 작동하고 난 난 이것을 볼 경우이 사실은 더 나쁘다 + +936 +01:08:06,920 --> 01:08:11,440 + 나는이 교차 검증을 통해 너무이​​기 때문에 실제로이 시점에서 걱정 + +937 +01:08:11,440 --> 01:08:14,490 + 여기에 나는 이것에 대해 실제로 뭔가 문제가 여기에 결과를 가지고 + +938 +01:08:14,489 --> 01:08:21,880 + 일부 문제에서 힌트 결과 + +939 +01:08:21,880 --> 01:08:31,279 + 문제 + +940 +01:08:31,279 --> 01:08:54,109 + 사실은 꽤 일관성이 너​​무 여기에 일어나는 놀라운 학습 봐 + +941 +01:08:54,109 --> 01:08:58,759 + 93 (94) 사이의 속도는 경향 및 I는 아주 좋은 결과와 끝까지 + +942 +01:08:58,760 --> 01:09:00,690 + 난 무엇을 단지 경계 + +943 +01:09:00,689 --> 01:09:06,960 + 이상 최적화하기 때문에 이것이 거의 13 그것은 거의 0001 인 끝나는이다 + +944 +01:09:06,960 --> 01:09:10,510 + 정말 정말 좋은 결과를 얻는 몇 가지를 통해 찾고 있어요 무엇의 경계 + +945 +01:09:10,510 --> 01:09:14,780 + 의 가장자리에 내가 바라는 건 그게 잘되지 무슨 아마 올해 때문에 + +946 +01:09:14,779 --> 01:09:18,719 + 나는 그것을 정의한 방법은 실제로 최적이며, 그래서 있는지 확인하려면 + +947 +01:09:18,720 --> 01:09:21,560 + 더 나은이있을 수 있기 때문에 나는이 일을 파악하고 난 그냥 내 범위 + +948 +01:09:21,560 --> 01:09:22,520 + 결과 + +949 +01:09:22,520 --> 01:09:26,390 + 내가 부정적인 변경하려면 어쩌면 약간이 길을가는 32 음의 두 개 + +950 +01:09:26,390 --> 01:09:32,570 + 2.5하지만 정규화에 대한 그게 어쩌면 내가 해요 아주 잘 작동 참조 + +951 +01:09:32,569 --> 01:09:38,529 + 약간 더 나은 장소 등의 난을 좋아하는이 한 일에 대해 너무 걱정 + +952 +01:09:38,529 --> 01:09:42,739 + 당신이 나를 샘플 꿀벌 무작위도의 균일 한 경향 볼로 지적 + +953 +01:09:42,739 --> 01:09:46,639 + 이 일이 어떤 샘플링 임의의 정규화 학습 복귀 무엇을 + +954 +01:09:46,640 --> 01:09:49,829 + 사람들은 그래서 정말 그리드 검색이라고 무엇을 함께 할 때로 볼 수 있습니다 + +955 +01:09:49,829 --> 01:09:53,920 + 사람들이에 가고 싶어 무작위로 여기의 차이는 대신 샘플링입니다 + +956 +01:09:53,920 --> 01:09:58,789 + 학습 속도 조절 등 모두 일정량 씩 + +957 +01:09:58,789 --> 01:10:02,519 + 당신은 심지어 학습의 일부 설정을 통해 여기 더블 루프 끝 + +958 +01:10:02,520 --> 01:10:03,740 + 정규화의 설정 + +959 +01:10:03,739 --> 01:10:07,590 + 철저한 되려고 노력이이 실제로 나쁜 생각은 실제로하지 않습니다 + +960 +01:10:07,590 --> 01:10:12,720 + 실제로 언제나 당신을 몇 무작위로 간단하고 직관적으로 잘 작동하지만, + +961 +01:10:12,720 --> 01:10:16,280 + 다음 단계로 가고 싶지 않아 무작위로 샘플링 할 여기에 이​​유가 + +962 +01:10:16,279 --> 01:10:23,319 + 에 대한 그것의 종류 그것에 대해 생각이야 그러나 이것은 내가 샘플링 좋은 검색 방법입니다 + +963 +01:10:23,319 --> 01:10:31,579 + 간격을 설정하고 난 당신이 과세 표준을 쓸어 알고있는 회사와을 가질 수 없습니다 + +964 +01:10:31,579 --> 01:10:35,090 + 난 그냥 무작위로 문제로에서 샘플링 무작위 표본 추출은이다 + +965 +01:10:35,090 --> 01:10:38,930 + 최적화 및 훈련 그들이 작동하는 모든 무엇 자주 발생하는 것으로되어있어 + +966 +01:10:38,930 --> 01:10:41,800 + 그녀는 매개 변수 중 하나가 훨씬 더 중요한 다른 것보다이 될 수있어 + +967 +01:10:41,800 --> 01:10:43,039 + 매개 변수 + +968 +01:10:43,039 --> 01:10:45,989 + 그래서 이것은 중요한 파라미터이다라고 그 성능 + +969 +01:10:45,989 --> 01:10:49,349 + 당신의 손실 함수의 성능은 정말 흰색 차원의 함수가 아니다 + +970 +01:10:49,350 --> 01:10:52,510 + 하지만 정말 당신이 더 나은 결과를 얻을 전시의 함수이다 + +971 +01:10:52,510 --> 01:10:58,699 + 종종 인 X 축을 따라이 다음에 해당되는 경우, 특정 영역 + +972 +01:10:58,699 --> 01:11:02,170 + 경우는이 경우에 당신은 실제로 뭔가를 많이 끝날거야 + +973 +01:11:02,170 --> 01:11:06,300 + 다른 세금과 당신은 당신이했습니다 여기보다 더 좋은 자리와 끝까지 + +974 +01:11:06,300 --> 01:11:09,850 + 정확한 지점에서 샘플링하면에 걸쳐 모든 종류의 정보를 얻기하지 않는 + +975 +01:11:09,850 --> 01:11:14,910 + 그 말이 경우 전 그렇게 때문에 항상 있습니다 이러한 경우 임의 사용 + +976 +01:11:14,909 --> 01:11:24,220 + 실제로 내가 약속 당신에게 벅에 대한 더 많은 강타를 줄 것이다 일반적인 임의 + +977 +01:11:24,220 --> 01:11:28,520 + 아마 학습 속도를 수있는 가장 일반적인 사람과 놀고 싶어 + +978 +01:11:28,520 --> 01:11:32,920 + 업데이트는 어쩌면 우리가 조금이에 갈 거 야에 거 야를 입력합니다 + +979 +01:11:32,920 --> 01:11:36,899 + 정규화와 드롭 아웃 금액은 우리는 그래서이가로 갈거야 + +980 +01:11:36,899 --> 01:11:42,979 + 정말 너무 재미는 방법은 실제로 그래서 그러나 이것은 우리가를 가지고있는 것 같습니다 + +981 +01:11:42,979 --> 01:11:46,679 + 컴퓨터 비전 클러스터의 예를 들어 우리는 그래서 난 그냥 수 많은 기계가 + +982 +01:11:46,680 --> 01:11:49,829 + 많은 기계에서 내 훈련을 배포하고 나는 자신을 작성했습니다 + +983 +01:11:49,829 --> 01:11:53,100 + 예를 들어 의견은 이러한 모든 모든 손실 기능이 어디 얼굴을 설정 + +984 +01:11:53,100 --> 01:11:56,880 + 다른 기계와 컴퓨터와이 여기에 모두 몇 가지 클러스터 + +985 +01:11:56,880 --> 01:12:01,270 + 기본적으로 어떻게 작동하고 무엇을 통해 검색하고 내가 볼 수있는 것은 아니고, 내가 할 수있는 + +986 +01:12:01,270 --> 01:12:04,370 + 내가 확인이 모든 단계에서 작동하지 않는 말을 할 수 있도록 내 노동자들에게 명령을 보낼 + +987 +01:12:04,369 --> 01:12:07,399 + 재 샘플 당신은 전혀 잘하지 않는 이들 중 일부는 아주 잘하고있다 + +988 +01:12:07,399 --> 01:12:10,960 + 나는 정확히 잘 작동하고 무엇을보고 나는 그에게 동적 조정 해요 + +989 +01:12:10,960 --> 01:12:14,020 + 내가 통과해야 할 과정은 실제로 잘 작동하는 물건을 얻기 위해 + +990 +01:12:14,020 --> 01:12:17,490 + 그는 단지 너무 많은 물건을 가지고 있기 때문에 이상 최적화하고 당신은 단지에 여유가 있습니다 + +991 +01:12:17,489 --> 01:12:21,569 + 스프레이 당신이 함께 일해야기도 + +992 +01:12:21,569 --> 01:12:25,759 + 확인 그래서 당신은 최적화 당신은 손실 함수에서 찾고 + +993 +01:12:25,760 --> 01:12:29,289 + 손실 함수는 여러 가지 다른 형태를 취할 수 있으며, 당신은 할 수 있어야합니다 + +994 +01:12:29,289 --> 01:12:34,510 + 당신이보고에서 당신은 꽤 좋은거야있을거야, 그래서 그게 무슨 뜻인지에 읽기 + +995 +01:12:34,510 --> 01:12:38,289 + 그것이 예를 들어이 일을 무슨 재미로 손실 함수 + +996 +01:12:38,289 --> 01:12:42,409 + 아마에 사용 된 이전 강의가 같은 지수 아니라고 지적 내 + +997 +01:12:42,409 --> 01:12:47,359 + 당신이 그래서 우물쭈물하고 조금 보이는 알고에 손실 함수를 내가 원하는 + +998 +01:12:47,359 --> 01:12:50,949 + 어쩌면이되지 않도록으로 학습 속도가 약간 너무 낮은 수 있음을 알려줍니다 + +999 +01:12:50,949 --> 01:12:53,069 + 학습 속도가 그냥 고려할 수 있음을 의미 너무 낮 의미 + +1000 +01:12:53,069 --> 01:12:54,359 + 견딜 수 없는 + +1001 +01:12:54,359 --> 01:12:58,549 + 당신이 고원을 가질 수 있도록 아침 때로는 재미 모든 종류의 것들을 얻을 + +1002 +01:12:58,550 --> 01:13:04,199 + 어떤 시점에서 그 결정 것 인 지금 당신이 그렇게 일반적으로 최적화 실행 + +1003 +01:13:04,198 --> 01:13:15,948 + 의 경우 이러한 종류의 유력한 용의자가 무엇 단지 저를 생각하고 나는 생각 + +1004 +01:13:15,948 --> 01:13:19,388 + 총리가 올바르게 그라디언트를 초기화하고 거의 의심 + +1005 +01:13:19,389 --> 01:13:23,579 + 흐르는하지만 어떤 점에서 그들은까지 추가하고 그냥 몇 가지 연구 훈련을 보았다 + +1006 +01:13:23,579 --> 01:13:27,420 + 사실 많은 재미가 내가 잠시 동안 전체 텀블러를 너무 재미있어 시작이 + +1007 +01:13:27,420 --> 01:13:34,260 + 이 사람들이 이러한 기여를 통해 그들이 갈 수 있도록 전 및 기능을 상실 + +1008 +01:13:34,260 --> 01:13:38,300 + 어떤 좋은 서비스 나는 그렇게 생각하고 훈련 특히 네트워크 전송 + +1009 +01:13:38,300 --> 01:13:43,550 + 우리가이 들어갈거야 것은 이국적인 모양의 모든 종류 I 정확히 아니에요입니다 + +1010 +01:13:43,550 --> 01:13:48,730 + 이 중 하나는 그것이 무슨 의미인지 정말 모르는 어떤 점에서 알 수 + +1011 +01:13:48,729 --> 01:13:52,569 + 잘 + +1012 +01:13:52,569 --> 01:14:04,469 + 그래, 그래서 여기 몇 가지 동시에 훈련하는 작업과 그냥이 + +1013 +01:14:04,469 --> 01:14:08,139 + 그런데 나는이 사실을 훈련한다 무엇을 여기에 무슨 일이 있었 알고 + +1014 +01:14:08,139 --> 01:14:11,170 + 그렇지으로 강화 학습 강화에 에이전트에게 문제를 학습 + +1015 +01:14:11,170 --> 01:14:14,679 + 만약 고정 자산 투자 학습 갖고 있지 않은 고정식 분포를 갖도록 + +1016 +01:14:14,679 --> 01:14:17,800 + 정책 변화와 당신이 끝날 경우 에이전트 환경과 상호 작용 + +1017 +01:14:17,800 --> 01:14:21,199 + 벽에 응시하거나 당신의 공간의 다른 부분을보고 결국 같은 + +1018 +01:14:21,198 --> 01:14:24,629 + 당신은 다른 데이터 분포와 끝까지 그래서 갑자기 난 + +1019 +01:14:24,630 --> 01:14:27,109 + 내가 사용하는 것보다 매우 다른 것을보고는보고있다 및 난 + +1020 +01:14:27,109 --> 01:14:30,098 + 내 에이전트를 훈련 손실은 에이전트가 익숙하기 때문에까지 간다 + +1021 +01:14:30,099 --> 01:14:33,569 + 그 템플릿의 종류 그래서 당신은 재미있는 물건이 일어나고있는 모든 종류가있다 + +1022 +01:14:33,569 --> 01:14:40,578 + 다음이 하나 내가 아무 생각이 무엇을 기본적으로 일어나지 않았을 내 즐겨 찾기 중 하나입니다 + +1023 +01:14:40,578 --> 01:14:45,988 + 여기 손실은 진동하지만 대략 수행하고 그냥 폭발 온다 + +1024 +01:14:45,988 --> 01:14:53,238 + 분명히 뭔가가이 경우 오른쪽 않았고, 또한 여기에 그냥 사람이 있어요 + +1025 +01:14:53,238 --> 01:14:57,789 + 수렴하기로하고 재미의 모든 종류를 얻을 수 있도록 아무 생각이 잘못 없었다 + +1026 +01:14:57,789 --> 01:15:01,368 + 일이 당신의 임무에 재미 플롯으로 끝날 경우에 보내 마십시오 + +1027 +01:15:01,368 --> 01:15:02,948 + 트리오 로스 판초 스하지만, + +1028 +01:15:02,948 --> 01:15:06,219 + 훈련 도중 강력한 + +1029 +01:15:06,219 --> 01:15:09,899 + 만 손실 함수와 보는 다른 것은 보지 않는 당신의 정확성이다 + +1030 +01:15:09,899 --> 01:15:14,929 + 가끔 정확성을보고 선호하므로 특히 예를 들어 정확도 + +1031 +01:15:14,929 --> 01:15:18,248 + 무슨 기능을 통해 정확도는 해석하기 때문에 나는 어떤이 알고 + +1032 +01:15:18,248 --> 01:15:22,519 + 손실 함수는에 대한 분류의 정확도는 절대적인 의미 + +1033 +01:15:22,519 --> 01:15:27,369 + 아마로 해석 등 특히 난에 대한 손실이없는 내 + +1034 +01:15:27,368 --> 01:15:31,589 + 구원 데이터와 나의 훈련 때문에이 경우 예를 들어 나는 그 말을 해요 + +1035 +01:15:31,590 --> 01:15:35,288 + 내 트레이닝 데이터의 정확도가 훨씬 더 검증 정확성을 받고 + +1036 +01:15:35,288 --> 01:15:38,929 + 당신에게 힌트를 줄 수있는이 사람에 따라 이렇게 개선 중지 것을 + +1037 +01:15:38,929 --> 01:15:42,380 + 특히이 경우에 후드에 갈 수있는 큰 격차가 여기에있다 + +1038 +01:15:42,380 --> 01:15:44,440 + 그래서 어쩌면 내가 overfitting 생각 해요 + +1039 +01:15:44,439 --> 01:15:48,069 + 100 % 확신하지만 난 강하게 나는 정기적으로 시도 할 수도 있습니다 과불 수 있습니다 + +1040 +01:15:48,069 --> 01:15:57,038 + 물건도보고 될 수 있습니다 때의 차이를 추적 + +1041 +01:15:57,038 --> 01:16:01,988 + 당신의 매개 변수의 규모와 그 매개 변수로 업데이트의 규모 때문에 + +1042 +01:16:01,988 --> 01:16:06,748 + 당신이 당신의 무게 단위 분출의 순서에 있다고 가정하고 그래서있어 말 + +1043 +01:16:06,748 --> 01:16:10,599 + 다음 직관적으로 당신에 의해 당신의 무게를 증가 업데이트 및 + +1044 +01:16:10,599 --> 01:16:14,349 + 역 전파 당신은보다 훨씬 큰 것으로 해당 업데이트를하지 않으 + +1045 +01:16:14,349 --> 01:16:16,679 + 분명히 가중치 또는 당신은 그들이 작은되고 싶어 + +1046 +01:16:16,679 --> 01:16:20,529 + 당신의 무게의 순서에있을 때 당신의 업데이트는 1987 년 정도가 될 수 있습니다 + +1047 +01:16:20,529 --> 01:16:25,359 + 음 너무 그래서 당신이 증가하는 약이야 업데이트를 보면 하나 + +1048 +01:16:25,359 --> 01:16:29,439 + 당신의 무게에 그냥 예를 들어,이 표준 보는 색상 사각형과 + +1049 +01:16:29,439 --> 01:16:34,129 + 일반적으로 귀하의 매개 변수의 규모와의 좋은 규칙 업데이트에 비해 + +1050 +01:16:34,130 --> 01:16:38,550 + 엄지 손가락이 대략 13 그래서 기본적으로 모든 업데이트 할 수 있어야 당신의 + +1051 +01:16:38,550 --> 01:16:41,360 + 의 하나 하나에 대한 세 번째 유효 숫자와 같은 순서에 수정 + +1052 +01:16:41,359 --> 01:16:44,118 + 매개 변수를 오른쪽 당신은 당신이 매우 작은 결정하지 않는 거대한 업데이트를 제작하지 않는 + +1053 +01:16:44,118 --> 01:16:49,708 + 그래서 업데이트는이 경우 일반적으로 확인 작동 대략 13를보고 한 가지 + +1054 +01:16:49,708 --> 01:16:53,038 + 너무 높은 어쩌면 말 내 학습 등의 방법이 너무 낮게 감소 할 + +1055 +01:16:53,038 --> 01:17:00,069 + 107 아마 내 학습 속도를 증가 할 그래서 요약 오늘날 우리 것 + +1056 +01:17:00,069 --> 01:17:05,308 + 교육 신경 네트워크 청록색 함께 할 수있는 일의 전체 무리 보았다 + +1057 +01:17:05,309 --> 01:17:09,729 + 그들 모두의 팔은 당신이 사용하는 트랙을 의미 잃게 기본적으로 있습니다 + +1058 +01:17:09,729 --> 01:17:11,869 + 초기화 + +1059 +01:17:11,869 --> 01:17:15,750 + 당신은 당신이 작은 네트워크를 생각하는 경우 또는 당신은 어쩌면 그냥 멀리 얻을 수 있습니다 + +1060 +01:17:15,750 --> 01:17:20,399 + 당신의 규모 2001 선택하거나 어쩌면 당신은 그와 조금 놀고 싶어하고있다 + +1061 +01:17:20,399 --> 01:17:26,719 + 여기에 강한 권고 난 그냥 생각하지 사용할 때 당신은 내가 아니에요하고있는 + +1062 +01:17:26,720 --> 01:17:34,110 + 내 결정 프로그램을 샘플링해야하고 많은에게 기부를 할 때 적절하고 + +1063 +01:17:34,109 --> 01:17:39,449 + 그주의해야 할 뭔가 이것은 우리가 아직도 충당하기 위해 무엇을하고 그 + +1064 +01:17:39,449 --> 01:17:44,269 + 이 경우 내가 질문을 할 수 있도록 우리가 두 분 이상을해야합니까 옆에있을 것입니다 + +1065 +01:17:44,270 --> 01:18:01,520 + 어떤 + +1066 +01:18:01,520 --> 01:18:11,120 + 사이의 상관 관계 + +1067 +01:18:11,119 --> 01:18:15,729 + 나는 어떤 분명히 당신이 얻을 필요가 추천 할 수 있다고 생각하지 않습니다 + +1068 +01:18:15,729 --> 01:18:18,769 + 그 검사는 그게 분명 나를 밖으로 점프 거기에 아무것도 생각하지 않습니다 + +1069 +01:18:18,770 --> 01:18:35,210 + 확인 위대한 질문에서 다른 커플 + +1070 +01:18:35,210 --> 01:18:35,949 + 에 대한 질문 + diff --git a/captions/Ko/Lecture6_ko.srt b/captions/Ko/Lecture6_ko.srt new file mode 100644 index 00000000..5153a78f --- /dev/null +++ b/captions/Ko/Lecture6_ko.srt @@ -0,0 +1,3652 @@ +1 +00:00:00,000 --> 00:00:07,009 + 확인 그래서 우리는 다시 신경망을 훈련에 대해 얘기하자 오늘은 이제 첫 무엇과 + +2 +00:00:07,009 --> 00:00:10,449 + 나는 우리가 다이빙을하기 전에 작동 쇼에 오는 당신에게 인터뷰의 비트를 줄 것이다 + +3 +00:00:10,449 --> 00:00:15,489 + 그 소재 단지 일부 관리 것을 먼저 첫 번째 I로 + +4 +00:00:15,490 --> 00:00:18,618 + 기회를하지 않았다 실제로 인터뷰 저스틴 마지막 강의 저스틴입니다하기 + +5 +00:00:18,618 --> 00:00:21,579 + 이 클래스 또한 강사 그는 처음 2 주 동안 실종됐다 + +6 +00:00:21,579 --> 00:00:28,409 + 그들은 그가 어쩌면 매우 지식의 나에게 아무것도에 대해 아무것도 요청할 수 있습니다 + +7 +00:00:28,410 --> 00:00:29,428 + 즉, 삼가의 + +8 +00:00:29,428 --> 00:00:37,960 + 확인하고 72 그렇게 꽤 오랫동안의 알림 내가 시작하는 것이 좋습니다으로 밖으로 + +9 +00:00:37,960 --> 00:00:43,850 + 여기에 구축하고는 기본적으로 다음 주 금요일은 그래서 가능한 한 빨리 그에 시작 할 수있어합니다 + +10 +00:00:43,850 --> 00:00:47,679 + 가능하면 앞으로의 적절한 API와 함께 작동 노하우를 구현 + +11 +00:00:47,679 --> 00:00:50,429 + 뒤로 클래스와 당신은 경쟁의 추상화 사로 잡고 볼 수 있습니다 + +12 +00:00:50,429 --> 00:00:54,820 + 다시 내 세션으로 이동 중퇴하고 실제로 구현합니다 + +13 +00:00:54,820 --> 00:00:57,770 + 상업 네트워크 실제로이이 과제의 말 때문에 + +14 +00:00:57,770 --> 00:01:00,770 + 강한에 오는 방법의 모든 낮은 수준의 세부 사항을 매우 잘 이해하고 + +15 +00:01:00,770 --> 00:01:06,530 + 네트워크 분류 난 그냥 확인 해요 그래서 우리는 단지 신호로이 클래스에 위치 + +16 +00:01:06,530 --> 00:01:10,140 + 다시 우리는 네트워크에서 훈련을한다 밖으로 신경망을 훈련하고 회전하는 + +17 +00:01:10,140 --> 00:01:15,590 + 정말 4 단계 프로세스는 전체 데이터 세트의 이미지와 라벨 우리가 + +18 +00:01:15,590 --> 00:01:18,920 + 우리가 네트워크를 통해 전파 생각했다 데이터 세트에서 작은 백을 샘플링 + +19 +00:01:18,920 --> 00:01:23,060 + 우리는 현재 분류​​하고 얼마나 잘 우리에게 말하고있는 손실에 도착합니다 + +20 +00:01:23,060 --> 00:01:26,390 + 데이터의 파견 그리고 우리는 모두의 기울기를 완료하기 위해 전파 + +21 +00:01:26,390 --> 00:01:29,969 + 무게와이 그라데이션이 우리에게 말하고 우리가 어떻게 매일 대기 확실하지해야 + +22 +00:01:29,969 --> 00:01:33,789 + 네트워크에 우리는 더 나은 다음 번에이 이미지를 분류하고 있도록 + +23 +00:01:33,790 --> 00:01:36,700 + 우리가 실제로 그렇게 할 경우 우리는 그라데이션 우리가 차 업데이트를 사용할 수있다 + +24 +00:01:36,700 --> 00:01:38,930 + 작은 홈 + +25 +00:01:38,930 --> 00:01:42,659 + 지난 시간 우리는 활성화 기능으로 보면서 나는 활성화 피곤 해요 + +26 +00:01:42,659 --> 00:01:45,368 + 기능과 어떤 장점과 이러한 내부자 신경 중 하나를 사용의 단점 + +27 +00:01:45,368 --> 00:01:49,060 + 물었을 때 좋은 질문이 너무 광장에서 들어오는 네트워크 이유도 당신 것 + +28 +00:01:49,060 --> 00:01:53,939 + 정품 인증 기능을 사용하는 이유는 그냥 건너 뛰고 질문을 제기했다하지 + +29 +00:01:53,938 --> 00:01:57,618 + 난 정말 기본적으로 마지막 강의에 매우 능숙하게이 문제를 해결하는 데있어 + +30 +00:01:57,618 --> 00:02:00,790 + 전체 신경망이 끝나는 경우보다 활성화 함수를 사용하지 않는다면 + +31 +00:02:00,790 --> 00:02:05,500 + 당신의 샌드위치 하나 하나가되는 등 용량 단지의 그것과 동일하다 + +32 +00:02:05,500 --> 00:02:10,080 + 그 활성화 기능이 정말 중요하다, 그래서 선형 분류 + +33 +00:02:10,080 --> 00:02:13,880 + 사이에 그들은 그들은 당신에게 당신이 사용할 수있는 모든 방법을 제공 것들입니다 + +34 +00:02:13,879 --> 00:02:17,490 + 실제로 데이터를 넣어 우리는 전처리에 대해 간단히 이야기 + +35 +00:02:17,490 --> 00:02:21,860 + 기술하지만, 아주 간단히 우리는 또한 활성화 기능을 보았고, + +36 +00:02:21,860 --> 00:02:24,830 + 신경망 여기 그래서 문제 전반에 걸쳐 자신의 분포 I + +37 +00:02:24,830 --> 00:02:31,370 + 우리는 이러한 초기 가중치를 선택해야하고 특히 전화가 참조 + +38 +00:02:31,370 --> 00:02:34,930 + 기다리는 사람들을 방법을 큰 규모는 처음에하고 우리는 보았다 + +39 +00:02:34,930 --> 00:02:38,260 + 이 경우 그 그 무게는 신경에​​ 활성화 너무 작은 경우 + +40 +00:02:38,259 --> 00:02:41,909 + 네트워크는 깊은 네트워크가 0으로 이동이 있고 당신이 설정 한 경우 그 기술은 그대로 + +41 +00:02:41,909 --> 00:02:45,129 + 그들 모두보다 높은에 가능성이 대신 폭발하고 그래서 당신은 끝낼 + +42 +00:02:45,129 --> 00:02:48,939 + 다른 네트워크 슈퍼 포화 또는 해당 단지에 대한 모든 네트워크와 끝까지 + +43 +00:02:48,939 --> 00:02:54,189 + 0과 1 그래서 그 규모는 우리가 들여다 설정하는 매우 매우 까다로운 일이다 + +44 +00:02:54,189 --> 00:02:59,579 + 당신이에 사용하는 것은 합리적인 종류를 제공 초기화 + +45 +00:02:59,580 --> 00:03:03,290 + 형성하고는 기본적으로 대략 좋은 활동 활성화 또는를 제공합니다 + +46 +00:03:03,289 --> 00:03:06,459 + 훈련의 시작 부분에서 네트워크를 통해 활성화의 분포 + +47 +00:03:06,459 --> 00:03:10,959 + 그리고, 우리는 많이 경감이 일에 가장 정상화에 들어갔다 + +48 +00:03:10,959 --> 00:03:14,120 + 실제로 제대로 그 기술을 설정하고 세바스찬 이러한 두통의 + +49 +00:03:14,120 --> 00:03:16,689 + 법안이에게 그들이 필요 없어 훨씬 더 강력한 선택한다 + +50 +00:03:16,689 --> 00:03:20,550 + 정확하게 맞 초기 규모를 얻고 우리는 현재의 모든 호출에 갔다 + +51 +00:03:20,550 --> 00:03:23,620 + 우리는 잠시 동안 그것에 대해 이야기하고 우리는 학습에 대해 이야기 + +52 +00:03:23,620 --> 00:03:26,920 + 당신이 실제로 할 방법에 대한 팁과 트릭의 종류를 표시하려고에 의해 처리 + +53 +00:03:26,919 --> 00:03:29,809 + 말했다 당신이 그들을 어떻게 또한 제대로 훈련받을 방법이 신경망 + +54 +00:03:29,810 --> 00:03:34,860 + 위반에 걸쳐 실행 방법 천천히 시간이 지남에 너무 렌더링 일어나 + +55 +00:03:34,860 --> 00:03:37,769 + 우리는 몇 가지로 갈거야 그래서이 시간에 대한 모든 것을 지난 시간에 이야기 + +56 +00:03:37,769 --> 00:03:41,060 + 위 특정 매개 변수에 훈련 신경 네트워크의 나머지 항목 + +57 +00:03:41,060 --> 00:03:44,989 + 계획은 나는 대부분의 부분을 생각하고 우리는 내 난 앙상블 드롭 아웃에 대해 조금 얘기하자 + +58 +00:03:44,989 --> 00:03:49,480 + 나는 그 어떤 행정 일에 난 내 길을 뛰어 등등 전에 있도록 + +59 +00:03:49,479 --> 00:03:53,509 + 잊고 반드시 그렇게 + +60 +00:03:53,509 --> 00:03:58,030 + 차 업데이트 신경망을 훈련에 프로세스가 거기에 있기 때문에 + +61 +00:03:58,030 --> 00:04:01,199 + 이것은 정말 당신이 위반에 대해는 그 모습에 의사입니다 + +62 +00:04:01,199 --> 00:04:04,419 + 법에 심각한 그라데이션 내가 얘기 공연 차 업데이트 + +63 +00:04:04,419 --> 00:04:08,030 + 매개 변수 업데이트는 특히 여기 어디에서이 마지막 줄보고 있었다 + +64 +00:04:08,030 --> 00:04:12,129 + 우리가 만들려고하는보다 복잡한 그 어디 그래서 지금 우리가 무슨 일을하는지 + +65 +00:04:12,129 --> 00:04:17,129 + 학교는 단지 일을 읽고. 우리는 내 컴퓨터 및 우리에 그 휴식을 취할 곳 + +66 +00:04:17,129 --> 00:04:21,639 + 그냥 우리의 주요 요인의 학습 속도에 의해 확장 및 곱셈 우리는 할 수 있습니다 + +67 +00:04:21,639 --> 00:04:23,159 + 훨씬 더 정교한 방법으로 우리 + +68 +00:04:23,160 --> 00:04:27,960 + 해당 날짜 등등에 나는 지난 몇 강의 곳에서 간단히 이미지를 플래시 + +69 +00:04:27,959 --> 00:04:30,759 + 서로 다른 매개 변수를 업데이트 방식을 볼 수 있습니다 얼마나 빨리 그들은 실제로 + +70 +00:04:30,759 --> 00:04:35,129 + 여기에 간단한 손실 함수를 최적화 그래서 특히 것을 볼 수 있습니다 STD + +71 +00:04:35,129 --> 00:04:38,550 + 우리가 여기에 네 번째 줄에 현재 사용하고 그 발 빠르게 그리고 무엇 인 + +72 +00:04:38,550 --> 00:04:41,710 + 그 사실 때문에 그들 모두의 가장 느린 하나입니다 것을 볼 수 있습니다 당신에게 책을 읽어 + +73 +00:04:41,709 --> 00:04:45,139 + 당신은 거의 이제까지 단지 기본 양육권을 사용하지 연습하고 더 나은 방식이다 것을 우리 + +74 +00:04:45,139 --> 00:04:48,979 + 우리가 이제 무엇을 살펴 보자 구조에서 이들에 갈거야 사용할 수 있습니다 + +75 +00:04:48,980 --> 00:04:54,810 + 문제는 너무 느려 약간이 특정을 고려하는 이유 하사관 함께 + +76 +00:04:54,810 --> 00:04:58,589 + 우리는 손실 함수 액면 세트가 여기에 인위적 예 우리 + +77 +00:04:58,589 --> 00:05:02,099 + 손실은 다른 것보다 훨씬 더 높은 긴 한 방향에 반대 + +78 +00:05:02,100 --> 00:05:05,500 + 여기 방향 때문에 기본적으로이 손실 함수는 매우 얕은입니다 + +79 +00:05:05,500 --> 00:05:10,199 + 수평으로하지만, 매우 수직으로 가파른 물론이을 최소화하기 위해 우리가 할 + +80 +00:05:10,199 --> 00:05:13,469 + 우리가 렉스 볼티모어에있어 지금이 가리키는 최소려고 + +81 +00:05:13,470 --> 00:05:19,240 + 우리가 행복 만의 궤도 무엇에 대해 생각 어디 웃는 얼굴 + +82 +00:05:19,240 --> 00:05:22,980 + 이 모두 X 및 Y 방향이다 + +83 +00:05:22,980 --> 00:05:30,650 + 주디 우리가 같은 그 표정이 풍경을 최적화하려고하면 그래서 뭐 + +84 +00:05:30,649 --> 00:05:35,729 + 그것은 수평과 같이 수직으로 난 그렇게 누군가의 엉덩이를 볼 것입니다 무슨 + +85 +00:05:35,730 --> 00:05:43,540 + 당신은 거기 계획하는 이유는 그래서 최대 반송 가서 아래처럼 있어요된다 + +86 +00:05:43,540 --> 00:05:52,030 + 그 이유는 많은 진전를 잘 기본적으로이가되게하지 않습니다 + +87 +00:05:52,029 --> 00:05:56,969 + 우리가 그라데이션 볼 포럼 수평 우리는 복사가 있음을 볼 수 + +88 +00:05:56,970 --> 00:06:00,680 + 이 수평 얕은 기능을하지만 우리가이 있기 때문에 매우 작은 + +89 +00:06:00,680 --> 00:06:03,439 + 큰 평가는 무슨 일이 일어날에 관해서는 매우 가파른 기능이기 때문에 + +90 +00:06:03,439 --> 00:06:06,389 + 당신은 이들 종류의 경우에서 거리를 출시하고이 끝낼 때 + +91 +00:06:06,389 --> 00:06:10,250 + 당신이 수평 방향으로 너무 느린거야 패턴의 종류 만 + +92 +00:06:10,250 --> 00:06:13,300 + 이에 결국 때문에 당신은 너무 빠르고 수직 방향을거야 + +93 +00:06:13,300 --> 00:06:17,918 + 올해 하나 이런 상황 또는 치료의 방법을 우리는 기억으로 모멘텀 그래서 + +94 +00:06:17,918 --> 00:06:22,189 + 기세 업데이트에 대한 업데이트는 다음과 같은 방법으로 우리의 업데이 트를 변경됩니다 + +95 +00:06:22,189 --> 00:06:25,319 + 그래서 지금 우리는 단지 그라데이션을 구현하고 + +96 +00:06:25,319 --> 00:06:28,409 + 그라데이션을 복용하고 우리는에 의해 우리의 현재 위치를 통합하고 + +97 +00:06:28,410 --> 00:06:34,220 + 날짜에 등급 대신 우리는 우리가 계산 된 그라데이션을거야 및 + +98 +00:06:34,220 --> 00:06:36,449 + 대신 직접 위치를 통합 + +99 +00:06:36,449 --> 00:06:40,840 + 우리는 내가 너무 속도로 떠날 수있는이 변수 V를 증가거야 + +100 +00:06:40,839 --> 00:06:44,049 + 우리는 우리가 증가 그래서 약간의 이유를 보게 될 것입니다 + +101 +00:06:44,050 --> 00:06:48,020 + 속도의 변수가 될 대신 대신에 우리는 기본적으로 가입이 구축하고 + +102 +00:06:48,019 --> 00:06:53,278 + 과거에 일부 신빙성을 지수 및 그 위치를 통합하는거야 + +103 +00:06:53,278 --> 00:06:58,610 + 여기에이 새로운는 0과 1 사이의 숫자의 종류로 행복 프라이머 및 음소거입니다 + +104 +00:06:58,610 --> 00:07:03,629 + 그리고 이전 BE되었다 하 등 화면 구배에 첨가 하였다 + +105 +00:07:03,629 --> 00:07:07,180 + 당신은 매우 물리적으로 해석 할 수있는 업데이트 모멘텀에 대한 좋은 데요 + +106 +00:07:07,180 --> 00:07:14,310 + 조건 및 다음과 같은 방법으로 기본적으로 모멘텀 업데이트에 해당하는 사용 + +107 +00:07:14,310 --> 00:07:18,899 + 할인 목록을 해석하는 정말 대담한 구름이 라운드가 허용하는 + +108 +00:07:18,899 --> 00:07:22,459 + 프리이 경우 그래디언트가 숲이라는 입자 + +109 +00:07:22,459 --> 00:07:26,408 + 느낌 그래서이 문서는 대신 그라데이션 약간의 힘을 느끼고있다 + +110 +00:07:26,408 --> 00:07:31,158 + 힘이 상당하므로 직접 위치를 물리학이 힘을 통합 + +111 +00:07:31,158 --> 00:07:36,019 + 이 때문에 가속에 가속이 우리가 경쟁하고있는 것입니다 + +112 +00:07:36,019 --> 00:07:39,938 + 그래서 속도는 여기에 다음 새 배의 가속도에 의해 통합됩니다 + +113 +00:07:39,939 --> 00:07:43,039 + 그 경우에, 마찰의 해석을 가지고 그 때문에 매 + +114 +00:07:43,038 --> 00:07:47,759 + 반복은 약간 둔화이 새로운 시간이 될 직관적 경우 아니었다했다 + +115 +00:07:47,759 --> 00:07:51,550 + 그냥 법 주위에 있었기 때문에 휴식을 오지로 다음 굵게가 않습니다 + +116 +00:07:51,550 --> 00:07:54,509 + 영원히 표면과가에 정착 할 에너지의 손실이 없을 것 + +117 +00:07:54,509 --> 00:07:58,158 + 손실 기능 등 최종 운동량 업데이트는이 중임 + +118 +00:07:58,158 --> 00:08:01,810 + 최적화의 물리적 해석 그러나 우리는 볼이 약 롤링이 + +119 +00:08:01,810 --> 00:08:08,249 + 그리고 시간이 지남에 따라 둔화 것 등이 작동하는 방식은 아주 좋은 무엇이다 + +120 +00:08:08,249 --> 00:08:11,669 + 이 업데이트에 대한 당신은 특히이 속도와를 구축 끝으로 + +121 +00:08:11,668 --> 00:08:14,959 + 얕은 방향을보고 매우 쉽게 당신이 얕은이있는 경우 만 + +122 +00:08:14,959 --> 00:08:18,449 + 일관된 방향은 다음 모멘텀 업데이트는 천천히 속도를 구축 할 것입니다 + +123 +00:08:18,449 --> 00:08:21,360 + 당신이 얕은에서 위로 가속화 결국 방향 벡터 + +124 +00:08:21,360 --> 00:08:24,999 + 방향하지만 무슨 일이 일어날 매우 가파른 방향으로 당신의 시작입니다 + +125 +00:08:24,999 --> 00:08:28,919 + 과정은 일반적으로 약하지만 당신은 항상 다른 사람을 뽑아되고있어 + +126 +00:08:28,918 --> 00:08:32,429 + 중심을 향해 및 감쇠 및 진동의 종류와 방향 + +127 +00:08:32,429 --> 00:08:36,338 + 그래서 그것은 종류의 가파른 방향이 진동을 찍힌 것 중간 및 + +128 +00:08:36,339 --> 00:08:41,139 + 종류의이 과정을 장려하고 일관성이있어 고무적 + +129 +00:08:41,139 --> 00:08:44,889 + 얕은 방향과는 컨버전스의 개선 끝나는 이유입니다 + +130 +00:08:44,889 --> 00:08:49,600 + 대부분의 경우는 그래서 여기 시각화, 예를 들어 우리는 SED 업데이트에서 참조 + +131 +00:08:49,600 --> 00:08:53,459 + 모멘텀 업데이트는 녹색 아니고, 그래서 당신은 녹색 하나를 어떻게 볼 수 있습니다 + +132 +00:08:53,458 --> 00:08:57,008 + 신발을 통해 공격하면이 모든 홍보를 구축하기 때문에 + +133 +00:08:57,009 --> 00:09:00,909 + 최소 오버 슈트하지만 그것은 결국 갤런 변환 끝과 + +134 +00:09:00,909 --> 00:09:04,169 + 물론 그것은 촬영 끝났어하지만이 나온다 일단 당신은 그것의 것을 볼 수있다 + +135 +00:09:04,169 --> 00:09:07,879 + 가 결국 업데이트처럼 그냥 기본보다 훨씬 더 빨리 수렴 + +136 +00:09:07,879 --> 00:09:11,230 + 문을 너무 많이 구축하면 결국 경우보다가 빨리 얻을 것보다 + +137 +00:09:11,230 --> 00:09:17,110 + 당신은 속도가 모멘텀 업데이트는 가고있다있어하지 않았다 + +138 +00:09:17,110 --> 00:09:20,430 + 운동량의 특정 변이 난 그냥 물어보고 싶은게 조금 등장 + +139 +00:09:20,429 --> 00:09:34,289 + 나는 프라이머와 같은 단일있어 언제 모멘텀에 대한 질문은 업데이트 + +140 +00:09:34,289 --> 00:09:40,078 + 보통 때때로 어떤 약 8.5 4.9의 값과 보통 사람들 소요 + +141 +00:09:40,078 --> 00:09:43,219 + 그것은 슈퍼 혜성은 아니지만 사람들이 때때로 리드 (25) 2.99에서 + +142 +00:09:43,220 --> 00:09:54,200 + 천천히 시간이 지남에 있지만, 그것은 단지 하나의 숫자입니다 + +143 +00:09:54,200 --> 00:09:57,180 + 네 그래서 당신은 작은 학습 속도하지만 문제가있는 사람을 방지 할 수 있습니다 + +144 +00:09:57,179 --> 00:10:03,000 + 당신이 있다면 느린 학습 속도는 모든 방향에 전 세계적으로 적용된다 + +145 +00:10:03,000 --> 00:10:06,070 + 그라데이션 등은 당신이에 진전을하지 않는다 기본적 것이다 + +146 +00:10:06,070 --> 00:10:09,390 + 수평 방향으로 오른쪽 당신은 많은 것을 얻을 수 없겠죠하지만 그것은 당신을 데려 갈 것이다 + +147 +00:10:09,389 --> 00:10:12,710 + 영원히 갈 수평으로 몇 가지 작은 학습은 떨어져 무역의이 종류는 말한다 + +148 +00:10:12,710 --> 00:10:25,350 + 자신의 질문에 수정을 설명하는 선택 방법을 초기화하는 방법입니다 + +149 +00:10:25,350 --> 00:10:29,050 + 일반적으로 10을 상실하고 결국 있기 때문에 문제가 너무 많이하지 않습니다 + +150 +00:10:29,049 --> 00:10:32,490 + 처음 몇 단계를 구축하고 당신은 당신이 경우 다음과 같이 끝 + +151 +00:10:32,490 --> 00:10:35,480 + 이 기하 급수적으로의 당신은 기본적으로 그 볼이 재발을 지출 + +152 +00:10:35,480 --> 00:10:39,330 + 이전 인사의 일부를 부패 그래서 당신은 당신이 당신에게 그것을 가지고 한 번 + +153 +00:10:39,330 --> 00:10:46,020 + 모멘텀의 특정 열 때문에 특히 변화라는 것을 가지고있다 + +154 +00:10:46,019 --> 00:10:53,449 + 모멘텀과 그라데이션 하강 여기에 생각에 아저씨는 우리가이입니다 + +155 +00:10:53,450 --> 00:10:57,550 + 보통 운동량 여기 방정식 그것에 대해 생각하는 방법이다 당신의 + +156 +00:10:57,549 --> 00:10:59,789 + 초과 정말 두 부분으로 추천 + +157 +00:10:59,789 --> 00:11:03,279 + 특정 방향으로 약간의 힘을 너무 구축하는 것이의 한 부분이있다 + +158 +00:11:03,279 --> 00:11:06,799 + 즉, 새로운 시대를 그린의 모멘텀 단계이고 그 곳이다 + +159 +00:11:06,799 --> 00:11:09,959 + 모멘텀은 현재를 수행하기 위해 노력하고 두 번째가 + +160 +00:11:09,960 --> 00:11:12,610 + 그라디언트에서 기여 기울기는이 방법으로 당기는 + +161 +00:11:12,610 --> 00:11:17,450 + 손실 함수의 감소와 실제 단계는 벡터 합인 끝낸다 + +162 +00:11:17,450 --> 00:11:21,350 + 그래서 블루만큼 당신이 결국 두 사람은 그냥 녹색 더하기 빨간색의 + +163 +00:11:21,350 --> 00:11:24,840 + 생각하지만 필요한 모멘텀이 실제로 더 나은 작업 끝과 + +164 +00:11:24,840 --> 00:11:29,629 + 다음과 같이 우리는 관계없이 현재의 입력이 무엇의이 시점에서 알 + +165 +00:11:29,629 --> 00:11:33,439 + 우리에게 그래서 우리는 최대 아직 대해 경쟁하지 않은 그러나 우리는 우리가 어떤을 구축 한 것을 알고있다 + +166 +00:11:33,440 --> 00:11:37,240 + 모멘텀과 우리는 우리가 확실히 확인 그래서이 녹색 방향을거야 알고 + +167 +00:11:37,240 --> 00:11:41,220 + 우리는 확실히 여기이 그린 밸리 성분을거야 우리 + +168 +00:11:41,220 --> 00:11:45,310 + 현재의 자리 네 스테 로프 모멘텀을 수행 앞서 대신보고 싶어 + +169 +00:11:45,309 --> 00:11:49,379 + 화살표의 상단이 시점에서이 시점 기울기를 평가하므로 + +170 +00:11:49,379 --> 00:11:53,679 + 당신이와 끝까지 우리가 우리가가는거야 알고 여기에 다음과 같은 차이 + +171 +00:11:53,679 --> 00:11:57,089 + 왜 그냥 같은 것은 그 부분에 도착하기 앞서 살펴 어쨌든이 길을 갈 + +172 +00:11:57,090 --> 00:12:00,420 + 객관적이고 그 시점에서 녹색을 평가하고 그것은 물론 당신이있어하지 않습니다 + +173 +00:12:00,419 --> 00:12:02,309 + 다른에이기 때문에 독서는 다소 차이가있을 것입니다 + +174 +00:12:02,309 --> 00:12:05,669 + 로스 함수의 위치와이 한 단계 앞서 당신에게 약간 더 나은를 제공 + +175 +00:12:05,669 --> 00:12:06,259 + 방향 + +176 +00:12:06,259 --> 00:12:11,109 + 저기 수 있습니다 당신은 당신이 할 수있는 지금 그것을 약간 다른 업데이 트를 얻을 + +177 +00:12:11,109 --> 00:12:14,379 + 이론적으로이 사실에 더 나은 이론 보장을 즐기는 것을 보여 + +178 +00:12:14,379 --> 00:12:18,069 + 수렴 속도뿐만 아니라이 이론뿐만 아니라 실제의 사실과 + +179 +00:12:18,068 --> 00:12:23,068 + 거의 항상 차이가 너무 좋아 그냥 순간보다 더 잘 작동 약 + +180 +00:12:23,068 --> 00:12:28,358 + 다음 해에 그 코드를하지만 여전히 우리의 표기법처럼 같은 작성한된다 + +181 +00:12:28,359 --> 00:12:29,589 + 시간이 + +182 +00:12:29,589 --> 00:12:33,089 + 당신이 현재하고있는 이전의 속도 벡터 및 구배를 돌연변이 + +183 +00:12:33,089 --> 00:12:37,629 + 평가하고 우리는 여기에 업데이트를하고 있으므로 필요한 업데이트 만을 + +184 +00:12:37,629 --> 00:12:41,720 + 차이는이 새로운 더한 새로운 BTW 시간을 뺀 11의 뜻 여기 보류했다 + +185 +00:12:41,720 --> 00:12:44,949 + 우리는이에 약간 다른 위치에서 평가 한 그라데이션을 평가 + +186 +00:12:44,948 --> 00:12:48,278 + 위치를 미리보고 그래서 강한 모멘텀에 정말 그것은 거의 + +187 +00:12:48,278 --> 00:12:51,698 + 항상 지금 약간의 기술은 내가 안되는 여기 거기되는 작품 + +188 +00:12:51,698 --> 00:12:57,068 + 너무 많이 들어갈 것 같네요하지만 사실 그 불편할 약간 있어요 + +189 +00:12:57,068 --> 00:13:00,418 + 일반적으로 우리는 향후에 대해 생각하고 뒤로 우리는 결국 무엇 때문에 통과 + +190 +00:13:00,418 --> 00:13:04,288 + 으로는 최대 프라이 머리 승리 데이터와 그 때의 기울기를 갖지만 + +191 +00:13:04,288 --> 00:13:09,088 + 당신은 떨어져에서 사육 매개 변수 및 그라데이션을 가지고 우리를 원하는 경우는 없습니다 + +192 +00:13:09,089 --> 00:13:12,600 + 다른 점은 그래서 꽤 단지 사이의 간단한 API처럼에 맞지 않는 + +193 +00:13:12,600 --> 00:13:16,019 + 코드를 갖는 그래서 방법이 밝혀 내가 정말하고 싶지 않아 + +194 +00:13:16,019 --> 00:13:19,899 + 아마이에 너무 많은 시간을 소비하지만, 기본적으로 변수를 할 수있는 방법이 + +195 +00:13:19,899 --> 00:13:23,379 + 변압기는 통지 일부 재배치를 수행 살이 찐를 얻을 당신은 얻을 + +196 +00:13:23,379 --> 00:13:26,079 + 더욱 새로 업데이트의처럼 보이는 뭔가 그냥 수 + +197 +00:13:26,078 --> 00:13:29,538 + 당신이 결국 때문에 감동 에드 교환 아만다 마틴에서 스 와이프 + +198 +00:13:29,538 --> 00:13:34,119 + 만 그라디언트 위축을 필요로하고 당신을 무언가를 업데이트하고이 기​​능은 + +199 +00:13:34,119 --> 00:13:35,209 + 정말 앞서 보여요 + +200 +00:13:35,208 --> 00:13:38,159 + 매개 변수의 버전들은 그냥 원시 매개 변수 벡터에 있기 때문에 + +201 +00:13:38,159 --> 00:13:40,608 + 당신이 노트에 갈 수있는 단지 전문적이 체크 아웃하기 + +202 +00:13:40,609 --> 00:13:46,709 + 확인 그래서 여기에 네 스테 로프 가속 독서는 마젠타에 당신은 볼 수 있습니다 + +203 +00:13:46,708 --> 00:13:50,208 + 원래 가게를 통해 여기 모멘텀하지만 많은하지만 가속 때문에 아저씨 + +204 +00:13:50,208 --> 00:13:53,958 + 모멘텀은 당신이 주위에 더 많은 곱슬 있다고 볼 수 있습니다 앞서이 한 단계가 + +205 +00:13:53,958 --> 00:13:57,738 + 신속하고 그 때문에 모든이 작은 기여 약간 더 낫다 + +206 +00:13:57,739 --> 00:14:01,619 + 당신이하려고합니다 어디에서 그라데이션 합산 결국하고 거의 항상합니다 + +207 +00:14:01,619 --> 00:14:08,600 + UD 모멘텀이이었다 최근까지 수 있도록 빠른 그래서 필요의 수렴 + +208 +00:14:08,600 --> 00:14:11,329 + 훈련 상용 네트워크와 많은 사람들의 표준 기본 방법 + +209 +00:14:11,328 --> 00:14:14,658 + 아직이에서 볼 수있는 일반적인 일 업데이트하기 위해 잠시를 사용하여 훈련 + +210 +00:14:14,658 --> 00:14:17,610 + 연습과 필요한 경우 더 나은 + +211 +00:14:17,610 --> 00:14:20,990 + 그래서 잡지는 여기에 일주일을 의미합니다 + +212 +00:14:20,990 --> 00:14:44,350 + 당신이 그것에 대해 생각하는지 질문은 그래서 나는 그것이 약간 잘못된 생각 + +213 +00:14:44,350 --> 00:14:46,990 + 만 일반적으로 생각 신경 네트워크에 대한 옵션을 많이 생각 + +214 +00:14:46,990 --> 00:14:50,350 + 이 미친 계곡과 지역 최소값을 많이 사방 실제로는 아니다 + +215 +00:14:50,350 --> 00:14:53,670 + 그것은 그 보는 올바른 방법은 개념이 할 수있는 올바른 접근이다 + +216 +00:14:53,669 --> 00:14:56,278 + 당신의 마음에 당신은 아주 작은 신경 네트워크와 사람들이 생각하는 데 사용 때 + +217 +00:14:56,278 --> 00:14:59,769 + 지역 최소값 것을 문제 및 최적화 네트워크 그러나 실제로집니다 + +218 +00:14:59,769 --> 00:15:04,269 + 당신이 당신의 모델을 확장으로 최근의 이론적 작업의 많은 아웃 + +219 +00:15:04,269 --> 00:15:10,740 + 이 지역의 최소 갈수록 문제의 사진에 있도록되어 있습니다 + +220 +00:15:10,740 --> 00:15:14,389 + 생각하고있는 것은 지역의 최소값 많이있다하지만 그들은 같은에 대한 모든 것 + +221 +00:15:14,389 --> 00:15:18,958 + 실제로이 때문에 이러한 기능의 신경을보고 더 나은 방법 손실 + +222 +00:15:18,958 --> 00:15:22,078 + 실제로 연습 네트워크와 나는 그릇 등 같은 훨씬 더 찾고 있어요 + +223 +00:15:22,078 --> 00:15:25,599 + 대신 미친 계곡 풍경과 당신은 여전히​​ 당신으로 그것을 표시 할 수 있습니다 + +224 +00:15:25,600 --> 00:15:28,360 + 신경망 최선보다는 최악의 등의 차이 + +225 +00:15:28,360 --> 00:15:29,259 + 지역 최소값 + +226 +00:15:29,259 --> 00:15:32,448 + 실제로 좀 좋아도 일부 연구자와 시간이 지남에 따라 아래로 축소 + +227 +00:15:32,448 --> 00:15:36,120 + 기본적으로이 매우 소규모 네트워크에서 일어나는 나쁜 지역 최소값가 없습니다 + +228 +00:15:36,120 --> 00:15:41,409 + 당신이 다른과 초기화하면 그렇게 연습에서 실제로 당신이 찾는 것은 + +229 +00:15:41,409 --> 00:15:44,610 + 임의의 초기화는 거의 항상 같은처럼 같은 대답을 받고 결국 + +230 +00:15:44,610 --> 00:15:48,009 + 결국 손실은 그래서 당신은 같은 나쁜 지방의 최소값은 없습니다 결국하지 마십시오 + +231 +00:15:48,009 --> 00:15:57,429 + 때로는 특히 당신이 질문을 네트워크 질문을 시작했다 때 + +232 +00:15:57,429 --> 00:16:10,849 + 네 스테 로프 진동 기능을하는 부분으로 + +233 +00:16:10,850 --> 00:16:14,819 + 확인 당신이 여러 슬라이드로 이동하려고했다가에 의해 아마했다 점프 있다고 생각 + +234 +00:16:14,818 --> 00:16:19,849 + 약간의 두 번째 또는 두 가지 방법이 괜찮 날 정말 또 다른 업데이트에 뛰어 보자 + +235 +00:16:19,850 --> 00:16:23,069 + 이 접지라고하고 원래 개발 된 사례에서 볼 것이 일반적 + +236 +00:16:23,068 --> 00:16:25,969 + 다음 볼록 최적화 문학과는 가지에 포팅되었다 + +237 +00:16:25,970 --> 00:16:30,019 + 다른 큰 업데이트로 보이는 있도록 신경망 사람들은 가끔 사용 + +238 +00:16:30,019 --> 00:16:30,560 + 다음 + +239 +00:16:30,559 --> 00:16:35,619 + 우리가 일반적으로 몇 가지 기본적인 확률 그라데이션 하강을 참조로 우리는이 업데이트가 + +240 +00:16:35,620 --> 00:16:37,500 + 여기에 여기에 큰 시간을 학습 + +241 +00:16:37,500 --> 00:16:42,259 + 그라데이션하지만 지금 우리는이 그라데이션 있지만이 추가 변수를 확장하고 + +242 +00:16:42,259 --> 00:16:47,589 + 우리는 있었다이 현금 구축하고 있음을 여기에 메모를 축적 유지하는 것이 + +243 +00:16:47,589 --> 00:16:52,199 + 그라데이션 사각형의 합이 캐시는 양수 만 포함 + +244 +00:16:52,198 --> 00:16:55,599 + 여기 캐시 변수가 같은 크기의 합작 투자 참고하여 + +245 +00:16:55,600 --> 00:17:00,730 + 개인 차원에서 구축 요인 등이 현금과 최대이었다 + +246 +00:17:00,730 --> 00:17:03,839 + 그라디언트 또는 제곱의 합을 추적하는 데 우리는 때때로을에 좋아 + +247 +00:17:03,839 --> 00:17:07,679 + 이들의 두 번째 순간이라는 Oncenter은 잠시 시간을내어 그래서 우리는 계속 + +248 +00:17:07,679 --> 00:17:12,409 + 이 현금을 구축하고 우리가 요소를 분할하는 이유에 의해이 단계 기능입니다 + +249 +00:17:12,409 --> 00:17:21,709 + 그 이유는 그래서 광장 현금의 루트 그래서 무슨 일이 일어나고 끝이 + +250 +00:17:21,709 --> 00:17:26,189 + 사람들은 그것을 푸르르의 푸르르 매개 변수 적응 학습 율법 때문에 호출 + +251 +00:17:26,189 --> 00:17:31,090 + 모든 단일 제품 이제 매개 변수 공간의 모든 단일 차원 + +252 +00:17:31,089 --> 00:17:34,569 + 동적으로 내용에 따라 조정됩니다 같은 학습 속도의 자신의 종류가 + +253 +00:17:34,569 --> 00:17:39,079 + 재료의 종류이 너무 그 규모면에서 볼 수있다 + +254 +00:17:39,079 --> 00:17:42,859 + 우리의 경우 특히이 경우 사인으로 발생하는 해석 + +255 +00:17:42,859 --> 00:17:47,019 + 이 어떤 수평 및 수직 방향으로 발생하지만,이 종류의 작업을 수행 + +256 +00:17:47,019 --> 00:17:51,359 + 역학 + +257 +00:17:51,359 --> 00:18:03,789 + 우리가 수직으로 큰 것을 큰 경사를 가지고 당신은 무엇을 볼 수 있습니다 + +258 +00:18:03,789 --> 00:18:07,259 + 그라데이션은 현금까지 추가되고 우리는 더 크고로 나누어 결국 + +259 +00:18:07,259 --> 00:18:11,359 + 큰 숫자는 너무 너무 수직 단계에서 더 작은 업데이트를 얻을 것이다 + +260 +00:18:11,359 --> 00:18:14,798 + 우리는 매우 깨끗 큰 영역을 많이보고있는 때문에이 학습을 부패한다 + +261 +00:18:14,798 --> 00:18:18,859 + 속도가 수직 방향뿐만에서 더 작은 단계들을 만들 + +262 +00:18:18,859 --> 00:18:22,009 + 우리가 끝낼 수 있도록 수평 방향으로는 매우 얕은 방향의 + +263 +00:18:22,009 --> 00:18:25,750 + 분모 작은 숫자는 당신이 볼 수 있다는 Y에 대한 상대 + +264 +00:18:25,750 --> 00:18:29,058 + 치수는 우리가이 성능 조정이 있도록 빠른 진행을 끝낼거야 + +265 +00:18:29,058 --> 00:18:35,058 + 이 회계의 효과는 기울기와 알라 신의 뜻 방향을 당신에게 + +266 +00:18:35,058 --> 00:18:40,319 + 실제로 수직 대신 바로 그때 훨씬 더 큰 학습을 할 수 있습니다 + +267 +00:18:40,319 --> 00:18:48,048 + 방향 및하지만 그래서는 대학원이없는 한 문제이다는 생각이 무엇인지 + +268 +00:18:48,048 --> 00:18:53,009 + 우리가 원한다면 우리는이 위치를 업데이트하고, 상기 공정 크기로 발생 + +269 +00:18:53,009 --> 00:18:55,900 + 오랜 시간 동안 전체 깊은 신경망에게 지분을 훈련하고 우리는있어 + +270 +00:18:55,900 --> 00:19:01,970 + 그래서 물론 정도에 무슨 일이 일어날 이번 여름에 오랜 시간 훈련 + +271 +00:19:01,970 --> 00:19:05,169 + 현금은 이러한 모든 긍정적 인 번호를 추가 모든 시간을 구축 결국 + +272 +00:19:05,169 --> 00:19:09,100 + 분모에 들어가는 당신은 말 그대로 단지의 경우 20이고 당신은 중지 끝 + +273 +00:19:09,099 --> 00:19:14,579 + 완전히 같은 학습 및 그래서 그래서 아니에요 확인 소득세 문제입니다 + +274 +00:19:14,579 --> 00:19:17,970 + 아마도 우리는 그냥 가지 볼링을 최적의 아래로 붕괴 당신이있어 + +275 +00:19:17,970 --> 00:19:21,919 + 수행하지만 신경 네트워크에서 물건 그건 좀 다음 주위에 왕복 같다 + +276 +00:19:21,919 --> 00:19:24,549 + 그에 따라 그림을 시도하는 것은 그래서이 그것을 생각하고 더 좋은 방법처럼 + +277 +00:19:24,548 --> 00:19:28,329 + 것은 당신의 데이터를 얻을 에너지의 지속적인 종류를 필요로하고 그래서 당신은 싶지 않아 + +278 +00:19:28,329 --> 00:19:33,009 + 이었다 사인에 매우 간단한 변화가 그래서 그냥 중단 붕괴 + +279 +00:19:33,009 --> 00:19:37,829 + 최근 제프 힌튼에 의해 제안 여기 아이디어는 대신 유지하는 것입니다 + +280 +00:19:37,829 --> 00:19:42,289 + 완전히 그냥 제곱의 합과 나는 우리가 있는지 확인 주말을 언급 할 수 있었다 + +281 +00:19:42,289 --> 00:19:46,250 + 새는 카운터 카운터 그래서 대신에 우리는 하이킹이 붕괴 속도와 끝까지 + +282 +00:19:46,250 --> 00:19:52,500 + 주 우리는 0.99 % 사각형과 같은 설정 만 제곱의 합이다 + +283 +00:19:52,500 --> 00:19:57,750 + 천천히 누출하지만 괜찮 것은 그래서 우리는 여전히 좋은 동점을 유지하는 우리 + +284 +00:19:57,750 --> 00:20:01,569 + 가파른 또는 포격 방향으로 스텝 크기를 등화 효과 + +285 +00:20:01,569 --> 00:20:05,869 + 우리는 단지 무기를 판매 완전히 20 업데이트를 변환하지 않을거야 + +286 +00:20:05,869 --> 00:20:10,299 + 19 법안 무기 적절한 방법에 대한 역사적 접촉하는 방식이었다입니다 + +287 +00:20:10,299 --> 00:20:11,430 + 우리에게 소개 + +288 +00:20:11,430 --> 00:20:14,340 + 당신은이 방법을 제안 종이 될 것이라고 생각하지만 사실 그것은이었다 + +289 +00:20:14,339 --> 00:20:18,789 + 슬라이드 저스틴 스콧 사라 클래스 불과 몇 년 전 그래서 저스틴 단지 + +290 +00:20:18,789 --> 00:20:22,240 + 삶의 슬라이드이되어 번쩍이 해적 클래스를 제공 하였다 + +291 +00:20:22,240 --> 00:20:25,630 + 게시되지 않은 그러나 이것은 일반적으로 실제로 잘 작동하고 이렇게하고 있어요 + +292 +00:20:25,630 --> 00:20:29,920 + 기본적으로 우리의 수학 문제는 그래서 나는 그 다음 내가 더 잘 같은 본 구현 + +293 +00:20:29,920 --> 00:20:34,060 + 바로 내 최적화 결과와 나는 그 정말 재미라고 생각하고 + +294 +00:20:34,059 --> 00:20:37,769 + 논문뿐만 아니라 내 논문하지만 많은 사람들 다른 논문에서 사실 마이크에 너무 + +295 +00:20:37,769 --> 00:20:44,559 + 코 세라에서 슬라이드를 인용 한 바로 강의 6 슬라이드 그냥 밀어 + +296 +00:20:44,559 --> 00:20:48,389 + 이후 문제는 다음이 지금 실제로 실제 용지이며 더 많은 결과가있다 + +297 +00:20:48,390 --> 00:20:52,300 + 정확히 그가하고있어 및 등등하지만 잠시 동안이 정말 우스웠다에 + +298 +00:20:52,299 --> 00:20:57,609 + 그래서이까지 내 관점에서 우리는 여기 땅이 파란색과 아라미스입니다 볼 수 있습니다 + +299 +00:20:57,609 --> 00:20:58,579 + 소품이입니다 + +300 +00:20:58,579 --> 00:21:02,490 + 블랙 우리는 둘 다 아래로 여기 아주 빨리 덮여 있음을 알 수 + +301 +00:21:02,490 --> 00:21:07,519 + 보다 약간 빠른 변환이 대학원에서이 특정한 경우에 방법과 + +302 +00:21:07,519 --> 00:21:11,589 + 무기 문제 그러나 그것은 항상 당신이 볼 일반적으로 어떤 경우 뭔가 아니다 + +303 +00:21:11,589 --> 00:21:15,839 + 대학원 너무 일찍 중지하고 그대로 실천하면 펜 Jillette에 훈련 작품 + +304 +00:21:15,839 --> 00:21:21,329 + 비참 말까지 일반적으로 이러한 이러한 방법 및 질문에서 승리 + +305 +00:21:21,329 --> 00:21:24,509 + 우리의 가장 확률값은 진행에 대해 + +306 +00:21:24,509 --> 00:21:55,150 + 이 방법은에 문제가 매우 가파른 길 당신은 아마하지 않으려한다 + +307 +00:21:55,150 --> 00:21:58,800 + 자신 다운 그래서 어쩌면에서 그 방향으로 매우 빠르게 업데이트 할 말 + +308 +00:21:58,799 --> 00:22:02,220 + 당신이 좋아하는 것 특히이 경우 빠른 이동하지만 당신은 가지에 읽고 + +309 +00:22:02,220 --> 00:22:05,019 + 이 특정 예 그것은 일반적으로 이들의 진정한 종류 아니다 + +310 +00:22:05,019 --> 00:22:09,940 + 어떤 네트워크가 좋은 전략의 구성되지 않은 최적화 풍경 적용 + +311 +00:22:09,940 --> 00:22:22,930 + 처음에 이러한 경우에 + +312 +00:22:22,930 --> 00:22:25,730 + 오 그런데 나는 17이 탐사를 통해 건너하지만 너희들은 할 수 + +313 +00:22:25,730 --> 00:22:30,380 + 희망 (127)가 움직이는 0으로 나누기를 방지하기 위해 단지가 있음을 볼 수 + +314 +00:22:30,380 --> 00:22:34,550 + 다시 높은 소유주에 일반적으로 우리는이 1 ~ 5 또는 6 ~ 7 개에 앉아 + +315 +00:22:34,549 --> 00:22:39,139 + 시작하여 현금처럼 뭔가 그래서 다음에 올 수 0 + +316 +00:22:39,140 --> 00:22:46,540 + 당신이 무엇을 얻을 당신의 생활 학습 속도 (22)이 적응 행동하지만 스케일입니다 + +317 +00:22:46,539 --> 00:22:50,420 + 이 증류 그것의 절대 규모가 컨트롤에 아직도의 또는 + +318 +00:22:50,420 --> 00:22:57,370 + 컨트롤은 여전히​​이 이야기는 단지 물건의 종류를 방해 속도를 배우고 + +319 +00:22:57,369 --> 00:23:00,989 + 다른 프라이머 방법에 대해 상대적인 것 같은 더 볼 수 있습니다 + +320 +00:23:00,990 --> 00:23:12,190 + 당신은 단계 동점 골을하지만 절대 글로벌 단계는 최대 아직있다 + +321 +00:23:12,190 --> 00:23:18,710 + 아주 당신이 바로 설명하는 일을 매우 효율적으로부터의 + +322 +00:23:18,710 --> 00:23:23,038 + 이 전 아주 긴 시간에서 재료의 종류를 얻기 위해 끝 때문에 + +323 +00:23:23,038 --> 00:23:27,750 + 정말 시간 t에서의 발현은 지난 몇의 기능 만있어 + +324 +00:23:27,750 --> 00:23:36,480 + 재료는하지만, 지수 함수 적으로 감쇠 가중 합에 우리가 갈거야 + +325 +00:23:36,480 --> 00:23:43,819 + 다행 마지막 업데이트로 이동 + +326 +00:23:43,819 --> 00:24:03,039 + 기하 급수적으로 가중 방식과 유사하고 그래서 당신은이 할 것 + +327 +00:24:03,039 --> 00:24:09,789 + 나는 사람들을 생각하지 않습니다 또는이에 유한 창 정말 당신에게 나중에 할 수 있습니다 시도 + +328 +00:24:09,789 --> 00:24:19,889 + 당신이 10 최적화 네트워크있을 때를 위해 그 X를 볼 것이다 너무 많은 메모리를 필요 + +329 +00:24:19,890 --> 00:24:23,560 + 예되도록 240,000,000 매개 변수의 메모리가 꽤 많이 복용하고 그래서 + +330 +00:24:23,559 --> 00:24:29,659 + 당신은 우리가있어 다음도 좋아 (10) 이전의 불만을 추적하고 싶지 않아 + +331 +00:24:29,660 --> 00:24:37,540 + 거하면 성능이 저하 된 모멘텀을 결합하면 20 있는지에 가서 주셔서 감사합니다 + +332 +00:24:37,539 --> 00:24:45,269 + 질문이 너무 너무 대충 무슨 일이 일어나고 있는지 슬라이드의 아담이을이다 + +333 +00:24:45,269 --> 00:24:49,119 + 마지막 업데이트는 감옥 실제로 최근에 제안되었다 그리고있다 + +334 +00:24:49,119 --> 00:24:52,959 + 당신이 기세를 알 수 있습니다로 모두의 요소는 가지의 트랙을 유지하고있다 + +335 +00:24:52,960 --> 00:24:57,190 + 잘못된 그라디언트를 요약하여 독서의의 첫 번째 순서의 순간 + +336 +00:24:57,190 --> 00:25:02,350 + 이 지수 일부와 손자를 유지하는 두 번째의 트랙을 유지하고 있습니다 + +337 +00:25:02,349 --> 00:25:07,869 + 순간 기울기와 당신이 종료 아담 아담 업데이트 당신이와 끝까지이다 + +338 +00:25:07,869 --> 00:25:13,389 + 기본적으로의 단계와 그것의 같은 종류의 것이 네 같은 종류의 수행 + +339 +00:25:13,390 --> 00:25:16,980 + 조금 그래서 당신처럼 보이는이 일을 끝낼 가장 아마 모멘텀 + +340 +00:25:16,980 --> 00:25:21,650 + 그것은 기본적으로 부패 방법이 속도를 추적 그리고 그건 + +341 +00:25:21,650 --> 00:25:25,420 + 당신의 단계하지만 당신은이 기하 급수적까지 추가하여 아래로 확장 + +342 +00:25:25,420 --> 00:25:29,490 + 새는 당신의 광장 그라디언트의 카운터 등 동일한에서 모두 끝 + +343 +00:25:29,490 --> 00:25:36,009 + 공식과 사람들은 그래서 당신이 모두 힘을 다하고 않는 조합 그게 업데이트 및 + +344 +00:25:36,009 --> 00:25:41,759 + 당신은 또한이 적응 스케일링을하고있는 그래서 여기에있는 군대의 확률값하자 + +345 +00:25:41,759 --> 00:25:44,789 + 이를 비교했을 때 실제로 정말 심지어이 이전 버전을 번쩍해야 + +346 +00:25:44,789 --> 00:25:46,339 + 기본적으로 가장 확률값 + +347 +00:25:46,339 --> 00:25:52,079 + 빨간색은 여기에 우리가 대체 한 것을 제외하고는 동일한 것입니다 단지가 있었다 TX + +348 +00:25:52,079 --> 00:25:56,220 + 이전 단지 그라데이션 현재 지금 우리는이 그라데이션 TX를 교체하고 + +349 +00:25:56,220 --> 00:25:56,630 + 그것으로 + +350 +00:25:56,630 --> 00:26:01,170 + 예를 한 가지 방법에 대한 상상 그래서 만약 RDX이 실행 카운터 인 + +351 +00:26:01,170 --> 00:26:04,090 + 또한 샘플링 많은 배치를 설정하여 불쾌한 kasich입니다 그것을 보면 + +352 +00:26:04,089 --> 00:26:07,359 + 야이 나쁜 패스 난수의 많은 수 그리고 당신은이 모든 잡음을 얻을 수있어 + +353 +00:26:07,359 --> 00:26:10,990 + 그라디언트 그래서 대신에 우리가있어 매번 단계를 어떤 큰 영향을 사용하여 + +354 +00:26:10,990 --> 00:26:14,309 + 실제로 이전 인사의 일부가되었고, 그것을 할 수있는 사용하는 것 + +355 +00:26:14,309 --> 00:26:19,139 + 그것의 그라디언트 방향을 안정시키고 그 기세의 기능입니다 + +356 +00:26:19,140 --> 00:26:23,720 + 여기와 여기에 스케일링이 있는지 확인하는 것입니다 스텝 크기의 운동에 대하여 + +357 +00:26:23,720 --> 00:26:29,940 + 서로 스티븐 L 방향이 감사에 당신은 당신이 것을 싶지 않아 + +358 +00:26:29,940 --> 00:26:31,269 + 하이퍼 매개 변수 + +359 +00:26:31,269 --> 00:26:36,119 + (801)는 일반적으로 보통 9802 포인트 995 가리 + +360 +00:26:36,119 --> 00:26:42,869 + 내 자신의 일에 선두에 걸쳐 높은 프리미엄을의 어딘가에있을 정도로 나는 발견 + +361 +00:26:42,869 --> 00:26:45,719 + 내가 실제로 일반적으로하지 않습니다에 걸쳐이 상대적으로 강력한 설정입니다 + +362 +00:26:45,720 --> 00:26:50,690 + 이러한 난 그냥 보통 스마일을 넣어으로 설정 떠나 결국하지만 당신은 재생할 수 있습니다 + +363 +00:26:50,690 --> 00:27:04,259 + 당신이 추진력을 얻을 수 있습니다 그것의 사람들과 때때로 우리는 보았다 + +364 +00:27:04,259 --> 00:27:08,789 + 그래 당신은 실제로 단지 용지를 읽을 수 않는 것이 레스토랑 작동 더 나은 청소 + +365 +00:27:08,789 --> 00:27:12,849 + 실제로 어제는 종이 아니었다 대해이 229에서 프로젝트 보​​고서이었다 + +366 +00:27:12,849 --> 00:27:17,149 + 나는 그것에 대해 용지가 있는지 모르겠어요하지만 당신이 할 수있는 것을 실제로 사람 + +367 +00:27:17,150 --> 00:27:20,250 + 즉 단순히 여기에 수행되지 않습니다 놀이 + +368 +00:27:20,250 --> 00:27:25,759 + 확인 나는 내가 여기에 아담이 약간 더 복잡하게 할 한 가지 더 + +369 +00:27:25,759 --> 00:27:30,849 + 그것은 불완전 당신이 볼 정도로 나를 그냥 아담의 완전한 몰입에 넣어 보자 + +370 +00:27:30,849 --> 00:27:33,949 + 당신이이 거기에 참조 할 때 혼동 될 수 있습니다 한가지 더있다 + +371 +00:27:33,950 --> 00:27:38,220 + 바이어스 보정이라는 것은 자신의 삽입 및 수정을하는 방식을 경멸하는 + +372 +00:27:38,220 --> 00:27:40,920 + I는 루프의 확대 야하는 이유는 바이어스 보정가에 달려 있다는 + +373 +00:27:40,920 --> 00:27:46,940 + 절대 시간 단계 00 T T 여기에서 사용되며, 그 이유는 이것이 무엇 + +374 +00:27:46,940 --> 00:27:49,730 + 의 작은 점 같은 종류의 일을하고 나는 이것에 대해 혼동하지 않으 + +375 +00:27:49,730 --> 00:27:54,049 + 너무하지만 기본적으로 그 MMV 사실을 보상하기위한 보상있어 + +376 +00:27:54,049 --> 00:27:58,659 + 오니 쉬 (500) 통계는 처음에 잘못 그래서 그가 무엇을하고 있는지입니다 + +377 +00:27:58,660 --> 00:28:01,269 + 정말 메가를 확장에서 + +378 +00:28:01,269 --> 00:28:04,250 + 당신이 편견의 매우 친절와 끝까지하지 않도록 처음 몇 반복 + +379 +00:28:04,250 --> 00:28:07,359 + 제 1 및 제 2 순간의 추정은 그래서 그것에 대해 걱정하지 마십시오 + +380 +00:28:07,359 --> 00:28:11,279 + 너무 많은 이것은 단지이 매우 먼저 귀하의 업데이트를 변화한다 + +381 +00:28:11,279 --> 00:28:15,190 + 항목 등으로의 몇 번 예열되고, 그래서는 적절한에서 이루어집니다 + +382 +00:28:15,190 --> 00:28:18,210 + 통계 메가 측면에서 방법 + +383 +00:28:18,210 --> 00:28:23,380 + 나는 우리가 여러 가지 업데이트에 대한 이야기​​ 그 확인으로 너무 많이 가지 않는다 + +384 +00:28:23,380 --> 00:28:26,710 + 우리는 이러한 모든 업데이트가 여전히이 배우는 좋은 프라이머를 보았다 + +385 +00:28:26,710 --> 00:28:31,279 + 그래서 난 그냥 여전히 필요하지만 것을 간략하게 사실에 대해 얘기하고 싶지 + +386 +00:28:31,279 --> 00:28:34,369 + 학습과 우리 모두를위한 전면 인종 차별주의 학습 속도로 일어나는 보았다 + +387 +00:28:34,369 --> 00:28:37,639 + 이러한 방법과 내가 제기하고자하는 질문을 다음의 어느 하나 + +388 +00:28:37,640 --> 00:28:47,290 + 속도를 학습 사용하는 것이 가장 좋습니다 + +389 +00:28:47,289 --> 00:28:55,509 + 당신이 신경 네트워크를 실행하는 경우 그래서 이것은 레이트 학습에 대한 슬라이드입니다 + +390 +00:28:55,509 --> 00:28:59,819 + 트릭 답을 구분하는 것은 그 중에 무엇을 사용하는 좋은 학습 레이스가 없다는 것입니다 + +391 +00:28:59,819 --> 00:29:04,259 + 이 최적화 때문에 당신은 당신이 먼저 높은 학습 속도를 사용해야한다해야 + +392 +00:29:04,259 --> 00:29:07,869 + 좋은 학습 속도보다 더 빨리 당신이 매우 빠른 진전을 볼 수 있지만, + +393 +00:29:07,869 --> 00:29:10,779 + 어떤 점에서 두 확률 될거야 당신은에 수렴 할 수 없습니다 + +394 +00:29:10,779 --> 00:29:13,829 + 주 내 아주 잘 당신이 시스템에 너무 많은 에너지를 가지고 있기 때문에 + +395 +00:29:13,829 --> 00:29:17,869 + 당신은 당신의 손실 함수의 검은 좋은 부품 등 무엇으로 정착 할 수 없습니다 + +396 +00:29:17,869 --> 00:29:21,399 + 당신은 당신이 속도 배우고 UDK는 다음 종류의이 탈 수 할 + +397 +00:29:21,400 --> 00:29:26,269 + 감소 학습 속도의 드래곤과 그들 모두에 최선을 다할가 많다 + +398 +00:29:26,269 --> 00:29:28,670 + 사람들이 시작하는 다른 방법은 시간이 지남에 따라 요금을 배울 당신은해야 + +399 +00:29:28,670 --> 00:29:32,400 + 또한 같은 종류의 그들의 물건 붕괴의 과제가되었다 + +400 +00:29:32,400 --> 00:29:36,810 + 당신이했습니다에 간단한 하나는 아마도 훈련 데이터의 한 시대는 참조 후 + +401 +00:29:36,809 --> 00:29:41,619 + 파키스탄 새끼가 부패 무슨 말을 한 후 한 번에 너무 매 훈련 샘플을 볼 수 + +402 +00:29:41,619 --> 00:29:45,219 + 내 포인트 9 또는 당신은 또한 사용할 수있는 뭔가에 요금을 학습 + +403 +00:29:45,220 --> 00:29:49,600 + 지수 붕괴하거나 여러 거기 TDK 중 하나 여러가는거야 + +404 +00:29:49,599 --> 00:29:54,379 + 그것은 가능성이 향상 이론적 특성의 일부에 확대하고있어 알고 + +405 +00:29:54,380 --> 00:29:58,260 + 내가 생각하기 때문에 서로 다른 경우에 대한 그들의 불행하게도 많은하지 적용 + +406 +00:29:58,259 --> 00:30:01,150 + 그들은 볼록 최적화 문학에서 대부분이고 우리는 매우 상대하고 + +407 +00:30:01,150 --> 00:30:05,160 + 목표 다르지만 일반적으로 실제로 나는 뭔가에 사용되는 + +408 +00:30:05,160 --> 00:30:12,330 + 질문이었다 + +409 +00:30:12,329 --> 00:30:25,259 + 훈련 동안 이들 사이의 어느 하나의 커밋되지 + +410 +00:30:25,259 --> 00:30:28,470 + 그래, 난 그 모든 표준 생각하지 않습니다 + +411 +00:30:28,470 --> 00:30:32,990 + 흥미로운 점 나는 당신이 그래 사용할 줄 때 확실하지 않다 확실하지 않다 + +412 +00:30:32,990 --> 00:30:37,839 + 그것은 나에게 분명하지 않다 당신이 시도하고 I가 좋아 연습 뭔가를 시도 할 수 있습니다 + +413 +00:30:37,839 --> 00:30:42,079 + 적어도 영향은 바로 지금이다 당신은 거의 항상 내가 발견 지점을 + +414 +00:30:42,079 --> 00:30:46,189 + 일반적으로 좋은 기본값은 지금 모든 것을 위해 시간을 사용하므로 함께 갈 장미 + +415 +00:30:46,190 --> 00:30:49,840 + 아주 잘 우리의 대부분의 문제는 모멘텀보다 더 나은 또는 작동하는 것 같다 + +416 +00:30:49,839 --> 00:30:56,638 + 그들 때문에 우리가 그들에게 전화로 그런 아무것도 그래서 키가 큰 주문 방법이다 + +417 +00:30:56,638 --> 00:31:00,579 + 우리가 평가 한 있도록 만 손실 함수에 그라디언트 정보를 사용하여 + +418 +00:31:00,579 --> 00:31:03,720 + 그라데이션은 우리가 기본적으로 기울기와 모든 단일 방향을 알고 + +419 +00:31:03,720 --> 00:31:05,710 + 즉, 우리가 사용하는 유일한 것이다 + +420 +00:31:05,710 --> 00:31:09,600 + 이 최적화를위한 2 차 방법의 전체 세트입니다하지만 당신은해야 + +421 +00:31:09,599 --> 00:31:13,168 + 내가 너무 많은 세부 사항에 가고 싶지 않는 2 차 반대의 인식 + +422 +00:31:13,169 --> 00:31:17,919 + 그러나 결국 최대 그래서 당신의 손실 함수에 더 큰 근사치를 형성 + +423 +00:31:17,919 --> 00:31:20,820 + 그들 만이 기본적으로 초평면에 근사하지 않는 방법 I 등 + +424 +00:31:20,819 --> 00:31:26,069 + 희망하지만 당신도 토론에 의해 근사 한을 알리는 방법입니다 + +425 +00:31:26,069 --> 00:31:29,710 + 그래서 당신은 그가 또한 독일인 필요한 그라데이션이 필요하지 않습니다 억제 서비스 + +426 +00:31:29,710 --> 00:31:36,808 + 뿐만 아니라 그 계산해야하고 당신에게 내가 말할 것 오늘 밤에 볼지도 모른다 + +427 +00:31:36,808 --> 00:31:38,500 + 229 예 + +428 +00:31:38,500 --> 00:31:44,190 + 뉴턴의 방법은 기본적으로 당신이 그릇을 형성 업데이 트를주고 + +429 +00:31:44,190 --> 00:31:47,259 + 당신의 목적에 같은 패션 근사이 업데이트 사용할 수 있습니다 + +430 +00:31:47,259 --> 00:31:54,259 + 수는 그래서 그 근사 방식의 최소로 직접 이동합니다 + +431 +00:31:54,259 --> 00:31:58,490 + 어떤이 그들을 사용된다 사람을 왜 2 차 방법에 대한 좋은 데요 + +432 +00:31:58,490 --> 00:32:02,099 + 특히 뉴턴 방법은 이것에 대해 좋은 무엇을 여기에 제시 + +433 +00:32:02,099 --> 00:32:05,399 + 컨버전스에 대한 업데이트 + +434 +00:32:05,400 --> 00:32:13,410 + 당신은 학습 속도가 확인이 업데이트의 방법 차 알지 알 수 있습니다 그리고 그건 + +435 +00:32:13,410 --> 00:32:17,220 + 이 손실 기능이 손실 함수에 그라데이션을 보는 경우에 있기 때문에 + +436 +00:32:17,220 --> 00:32:20,480 + 당신은 또한 곡률과 그 장소를 알고 당신은 근사 그렇다면 + +437 +00:32:20,480 --> 00:32:23,920 + 정확히 알고있는이 황소는 어디에 때문에 최소 주문 근사치로 이동합니다 + +438 +00:32:23,920 --> 00:32:26,900 + 그의 최소로 직접 이동할 수 있습니다 당신 학습을위한 필요가 없습니다 + +439 +00:32:26,900 --> 00:32:30,610 + 그게 내가 그 생각 아주 좋은 기능 그래서 그릇에 근접하면 두 가지가 I + +440 +00:32:30,609 --> 00:32:32,969 + 당신은 두 번째 순서를 사용하고 있기 때문에 생각했던 당신은 빠른 수렴을 + +441 +00:32:32,970 --> 00:32:38,839 + 뿐만 아니라 정보가 왜이 단계 업데이트를 사용하도록 종류의 불가능하다 + +442 +00:32:38,839 --> 00:32:47,069 + 과정의 문제에 대한 작품을 모든되는 교육 열정은 백을 말한다 + +443 +00:32:47,069 --> 00:32:48,500 + 만 기본 네트워크 + +444 +00:32:48,500 --> 00:32:52,299 + 백 만 백 만 행렬 그리고 당신은 그것을 변환 할 + +445 +00:32:52,299 --> 00:32:59,259 + 그이 너무 행운 그래서 몇 가지가 발생하지 않을 + +446 +00:32:59,259 --> 00:33:02,480 + 알고리즘과 난 그냥 당신이 당신이 그들을 사용하지 않을 알고 싶습니다 + +447 +00:33:02,480 --> 00:33:05,650 + 기본적으로 뭔가 불리는 곳 DHS하는 아래 클래스 + +448 +00:33:05,650 --> 00:33:08,360 + 수 있습니다 당신은 패션을 변환하지 멀리 얻을 구축 + +449 +00:33:08,359 --> 00:33:11,819 + 모든 순위 연속 업데이트를 통해 헤센의 근사 + +450 +00:33:11,819 --> 00:33:15,000 + 하나는 그것의 종류의 세션을 구축하지만 당신은 여전히​​ 헤 시안을 저장해야 + +451 +00:33:15,000 --> 00:33:18,279 + 대규모 네트워크에 대한 다음 거기에 뭔가 더 좋은 때문에 여전히 메모리에 + +452 +00:33:18,279 --> 00:33:22,710 + 제한 제레미 BFGS의 약자라는 파운드 실제로 가을에 저장되지 않았습니다 + +453 +00:33:22,710 --> 00:33:26,980 + 패션 아니면 근사 회원 그리고 그 사람들이 실제로 사용하는 무엇을 + +454 +00:33:26,980 --> 00:33:33,549 + 때로는 지금 당신은 때때로 최적화 문헌에 언급 참조합니다 LBS + +455 +00:33:33,549 --> 00:33:37,769 + 그것은 우리를 위해 정말 정말 잘 작동 특히 당신은 작은 하나가있는 경우 + +456 +00:33:37,769 --> 00:33:42,450 + 이처럼 상자 같은 결정 기능에는 확률 적 노이즈가 없습니다 + +457 +00:33:42,450 --> 00:33:47,920 + 과 모든 것을 더 도시는 일반적으로 손실을 분쇄 할 수 있습니다 메모리 주소에 맞는 없다 + +458 +00:33:47,920 --> 00:33:53,200 + 기능을 아주 쉽게 그러나 아주 아주 기본적 파운드 GS2을 연장으로 까다로운 + +459 +00:33:53,200 --> 00:33:56,539 + 대규모 데이터 세트 및 이유는이 많은 의사를 서브 샘플링 하였다된다 + +460 +00:33:56,539 --> 00:33:59,730 + 우리는 많은 그래서 WASSUP에 간단한 메모리에 모든 훈련 데이터를 맞지 않을 수 있기 때문에 + +461 +00:33:59,730 --> 00:34:02,930 + 배치는 다음 나는이 많은 경기와에 작품의 위험이있을거야 그 + +462 +00:34:02,930 --> 00:34:06,810 + 근사는 서로 다른 여러 배치를 교환하고 같이있는 잘못에 + +463 +00:34:06,809 --> 00:34:10,449 + 또한 당신이 조심해야 할 능력을 가지고 당신은 확인해야합니다 + +464 +00:34:10,449 --> 00:34:12,539 + 당신이 드롭 아웃을 수정해야 + +465 +00:34:12,539 --> 00:34:17,690 + 당신이 있는지 확인해야하므로 내부적으로 불량배이기는하지만 함수 당신의 + +466 +00:34:17,690 --> 00:34:20,679 + 기능 많은 많은 다른 시간이 모든 근사하고 거짓말을하고있다 + +467 +00:34:20,679 --> 00:34:24,480 + 검색 물건이 매우 무거운 함수의 같은 것을 그래서 당신은 확인해야합니다 + +468 +00:34:24,480 --> 00:34:26,668 + 당신이 사용할 때 사용하지 않거나 출처 확인 + +469 +00:34:26,668 --> 00:34:29,889 + 랜덤 정말 연습 우리에 그래서 기본적으로 그것을 좋아하지 않을 때문에 + +470 +00:34:29,889 --> 00:34:33,779 + 큰 잘 못했습니다 정말 일을하지하지 않는 것 때문에 모든 BHS를 사용하지 않는 + +471 +00:34:33,780 --> 00:34:36,970 + 지금은 다른 방법에 비해 너무 많은 재료가 갖는 기본적 + +472 +00:34:36,969 --> 00:34:41,529 + 일이 당신이 더 나은 것은 바로이 우리의 물건을 잡음을하지만 이상을 수행합니다 + +473 +00:34:41,530 --> 00:34:47,880 + 당신이 할 수있는 경우 그 거래는 좋은 선택으로 사용 요약 그렇게 꺼져과 + +474 +00:34:47,880 --> 00:34:51,570 + 그렇지 않은으로 당신이 아마 하루에 은행 감당할 수있는 여유 + +475 +00:34:51,570 --> 00:34:55,419 + 2009 메모리와 앞으로 매우 큰 소득과 그들에 패스를 얻을 + +476 +00:34:55,418 --> 00:35:00,460 + 메모리 당신은 파운드로 볼 수 있지만에서 사용 관행에 표시되지 않습니다 + +477 +00:35:00,460 --> 00:35:05,220 + 현재 연구 방향 비록 지금 바로 대규모 설정 + +478 +00:35:05,219 --> 00:35:10,009 + 당신이이기 때문에 그래서 다른 개인 업데이트의 제 논의를 결론 + +479 +00:35:10,010 --> 00:35:14,830 + 학습 속도는 우리가 거​​기에이 클래스의 모든 베아트리체 조사하지 않을거야 + +480 +00:35:14,829 --> 00:35:24,739 + 바로 다시 질문 + +481 +00:35:24,739 --> 00:35:34,609 + 당신에 대한 요구하는지 너무 좋은 자동으로 당신이있어 구분 예를 들어 + +482 +00:35:34,610 --> 00:35:38,510 + 그래서 시간이 지남에 속도를 학습 당신은 또한 당신이 있다면 사건을 깰 학습 사용합니다 + +483 +00:35:38,510 --> 00:35:41,930 + 그래서 일반적으로 그랜드 이상을 사용하면 때를 매우 일반적인 영기를 배우는 참조 + +484 +00:35:41,929 --> 00:35:55,379 + 실제로 나는 당신이 대학원 또는 그러나 그것을 사용하는 경우 확실하지 않다 또는 아담 그래 그것은 아닙니다입니다 + +485 +00:35:55,380 --> 00:36:04,900 + 하지 아니 아주 좋은 대답 당신은 그것을 할 확실히 할 수 있지만 어쩌면 항목이 아니라고 + +486 +00:36:04,900 --> 00:36:08,910 + 아담처럼 그냥 방자 안드로이드 때문에에서 학습 (30)를하지 않습니다 + +487 +00:36:08,909 --> 00:36:12,339 + 이 새는 그라데이션입니다하지만 그는 학습 속도가 된 큰 우려했다 + +488 +00:36:12,340 --> 00:36:15,170 + 그것은 인도 자동으로 20을 부패 있기 때문에 아마 이해가되지 않습니다 + +489 +00:36:15,170 --> 00:36:22,710 + 괜찮아 괜찮아 우리는 매우 간단하게 같은 모델 앙상블 I에 갈거야 + +490 +00:36:22,710 --> 00:36:24,829 + 그것은 아주 간단하기 때문에 그것에 대해 얘기 + +491 +00:36:24,829 --> 00:36:28,750 + 당신이 당신의 훈련 데이터에 여러 개의 독립적 인 모델을 훈련하면 밝혀 + +492 +00:36:28,750 --> 00:36:32,949 + 대신 다음 단 하나의 하나의 당신은 당신이했습니다이 시간에 결과를 평균 + +493 +00:36:32,949 --> 00:36:39,929 + 항상 22 % 추가 성능 확인 지금이 정말 이론적하지있어 + +494 +00:36:39,929 --> 00:36:43,289 + 그 결과 같은 종류의하지만 그냥 연습처럼 여기 결과 + +495 +00:36:43,289 --> 00:36:46,570 + 기본적으로이 거의 항상 더 잘 작동 할 좋은 것 같다 + +496 +00:36:46,570 --> 00:36:48,850 + 물론 단점은 모든 다른 독립이 필요하지 않습니다 + +497 +00:36:48,849 --> 00:36:52,259 + 모델과 앞으로해야 할 필요와 그들과 여러분의 뒤로 클래스 + +498 +00:36:52,260 --> 00:36:56,850 + 그 적합하지 그래서 그들 모두를 훈련 아마 당신은 아래로 느려했다 + +499 +00:36:56,849 --> 00:37:00,989 + 당신의 앙상블 모델의 수와 단지 시간 등 몇 가지 팁이있다 + +500 +00:37:00,989 --> 00:37:05,689 + 및 유용한 정보 비트를위한 그래서 하나의 접근 방식을 따기 어떤 종류의에 사용 + +501 +00:37:05,690 --> 00:37:08,619 + 예를 들어 당신이 가진 당신이 당신의 신경망을 훈련으로 모든 서로 다른를 + +502 +00:37:08,619 --> 00:37:11,680 + 체크 포인트는 일반적으로 체크 포인트를 저장 그들에게 하나 하나 하키를 저장하는 + +503 +00:37:11,679 --> 00:37:14,750 + 당신은 당신이 당신의 검증 성능 그래서 한 가지 당신이 무엇인지 알아낼 + +504 +00:37:14,750 --> 00:37:18,119 + 실제로 판명 예를 위해 할 수있는 것은 때로는 같은 얻을 당신입니다 + +505 +00:37:18,119 --> 00:37:23,420 + 당신의 모델에 대한 몇 가지 체크 포인트를 가지고 당신은 그했다 그 + +506 +00:37:23,420 --> 00:37:26,349 + 실제로 때때로에서 사물과 그렇지 그래서 방법을 개선하기 위해 밝혀 + +507 +00:37:26,349 --> 00:37:29,730 + 한 미국 훈련 칠 독립적 인 모델을 훈련해야하지만 당신은 어떤 앙상블 + +508 +00:37:29,730 --> 00:37:34,809 + 그와 관련된 다른 체크 포인트의 트릭있다 + +509 +00:37:34,809 --> 00:37:39,739 + 이것은 우리가 전에 본 적이 당신의 네 단계를 여기에 무슨 일이 일어나고 있는지에 항의 + +510 +00:37:39,739 --> 00:37:44,709 + 나는 실행으로 여기에 예비 선거 X 테스트의 또 다른 세트와이 텍스트를 유지하고있어 + +511 +00:37:44,710 --> 00:37:49,590 + 일부 기하 급수적으로 내 실제 매개 변수 벡터 X를 썩 때 내가 사용하는 + +512 +00:37:49,590 --> 00:37:52,750 + 텍스트 테스트 및 검증이나 테스트 데이터는 거의 항상이 밝혀 + +513 +00:37:52,750 --> 00:37:57,199 + 이 때문에 종류의 같이하고있는 단독 확인 X를 사용하는 것보다 약간 더 나은 수행 + +514 +00:37:57,199 --> 00:38:00,919 + 마지막으로 이전 몇 주 요인 작은 같은 가중 앙상블 그것은 종류의 + +515 +00:38:00,920 --> 00:38:05,309 + 어려운 종류의 한 가지 방법으로 실제로는하지만, 기본적으로 해석하는 + +516 +00:38:05,309 --> 00:38:08,329 + 그것을 나는이 실제로 할 수있는 좋은 일이 이유에 대해 처리 할 수​​있는 하나의 방법을 해석 + +517 +00:38:08,329 --> 00:38:12,900 + 당신의 공 기능을 최적화에 대해 생각하고, 당신은 너무 많은 스테핑있어 + +518 +00:38:12,900 --> 00:38:16,849 + 실제로 모든 단계의 평균을 복용 최소 당신을 얻을 수 주위에 + +519 +00:38:16,849 --> 00:38:20,980 + I가 할 수있는 최소한의 확인에 가까운이 실제로 약간 중요한 이유 + +520 +00:38:20,980 --> 00:38:25,639 + 더 나은 우리가 가고 있기 때문에 내가 가진 작은 앙상블은 내 인생을 논의하기 위해 수 있도록 + +521 +00:38:25,639 --> 00:38:29,759 + 드롭 아웃으로 보면 이것은 당신이 될 것입니다 매우 중요한 기술이다 + +522 +00:38:29,760 --> 00:38:34,590 + 드롭 아웃에 대한 생각은 매우 흥미로운 그래서 등등 구현 및 사용 + +523 +00:38:34,590 --> 00:38:38,620 + 당신의 전체 목적을하고있는 것처럼 당신이 강하와 함께 할 당신입니다 + +524 +00:38:38,619 --> 00:38:45,429 + 신경 네트워크는 무작위 그래서 그냥 통과 공원에서 일부 뉴런 (20)을 설정합니다 + +525 +00:38:45,429 --> 00:38:49,839 + 당신이 당신의 데이터 X의 전진 패스를하고있는 당신이 어떤 작업을 수행하는지 명확히하는 것은 당신의 + +526 +00:38:49,840 --> 00:38:52,670 + 이 기능에 발언권을 계산 + +527 +00:38:52,670 --> 00:38:57,010 + 첫 번째 숨겨진 층 W의 비선형 하나 배 XP SP1 그래서 + +528 +00:38:57,010 --> 00:39:02,830 + 그건 좀 이상이고 다음 여기 이진수의 마스크를 계산합니다 + +529 +00:39:02,829 --> 00:39:05,230 + 여부에 기초하여 0 또는 1 중 + +530 +00:39:05,230 --> 00:39:09,469 + 0과 1 사이 숫자는 우리가 심각한 펌프를 듣고있는 P보다 작은 + +531 +00:39:09,469 --> 00:39:13,469 + 당신이 원하는이 우리는 0과 1의 절반과 절반의 바이너리 마스크입니다 + +532 +00:39:13,469 --> 00:39:17,469 + 다중 정품 인증 적극적으로 우리가 그들의 절반을 포기 숨겨진되는 + +533 +00:39:17,469 --> 00:39:21,349 + 모든 정품 인증 각 하나의 숨겨진 레이어를 계산 한 다음 우리는 두 가지가 드롭 + +534 +00:39:21,349 --> 00:39:25,730 + 무작위로 유닛, 그리고, 우리는 두 번째, 그리고, 우리는 무작위로 그 중 절반을 드롭 할 + +535 +00:39:25,730 --> 00:39:30,699 + 확인 물론 이것은 단지 전방이 후방 패스이어야 합격입니다 + +536 +00:39:30,699 --> 00:39:35,719 + 적절하게이 방울도 다시 전파 할 수 있도록뿐만 아니라 조정 + +537 +00:39:35,719 --> 00:39:39,309 + 그것에서뿐만 아니라, 그래서 그렇게 통해 구현할 때 중퇴 그렇게 기억 + +538 +00:39:39,309 --> 00:39:41,980 + 전진은 드롭을 통과하지만, 역 전파는 경우 후방 패스 + +539 +00:39:41,980 --> 00:39:45,829 + U2에 의해 곱하면 하나는 그래서 당신이 장소에서 기본적으로 생기를 죽인를 구입 + +540 +00:39:45,829 --> 00:39:46,559 + 당신은 떨어 곳 + +541 +00:39:46,559 --> 00:39:52,179 + 나는이 방법을 처음으로 당신이를 보였다 때 확인 그래서 당신은 생각 될 수 있습니다 + +542 +00:39:52,179 --> 00:39:56,799 + 이 전혀 이해가 않습니다이 좋은 생각은 왜에 어떻게 원하는 것이되었다 + +543 +00:39:56,800 --> 00:40:00,390 + 당신의 신경증을 계산하고 (20)이 어떠한 의미를에 다음 그들에게 경향을 설정 + +544 +00:40:00,389 --> 00:40:12,369 + 그래서 나도 몰라 그럼 이제 너희들은 앞서 생각에 과열을 방지 할 수 있도록하자 + +545 +00:40:12,369 --> 00:40:23,880 + 어떤 의미 + +546 +00:40:23,880 --> 00:40:27,170 + 당신은 정말 당신이 그것을 할 말을하는지 있도록 올바른 정보를 얻고 + +547 +00:40:27,170 --> 00:40:31,240 + 난 단지 그때 내 네트워크의 절반을 거​​의 사용하고있는 경우 때문에 overfitting을 방지 + +548 +00:40:31,239 --> 00:40:34,500 + 난 단지 내 네트워크를 한 번의 절반을 사용하고 작은 용량의 같은이 + +549 +00:40:34,500 --> 00:40:37,739 + 하나의 작은 네트워크 I 기본적으로 만 너무 거기에있어 단지처럼 거기 + +550 +00:40:37,739 --> 00:40:40,209 + 이 종류의 그래서 나는 다음 전체 네트워크 거기에 직장에서 무슨 일이 있었는지 수행 할 수 있습니다 + +551 +00:40:40,210 --> 00:40:44,798 + 당신이 대표 할 수 있는지의 관점에서 당신의 분산 제어 등 + +552 +00:40:44,798 --> 00:40:55,619 + 그래 나는 종종 내가하지 않은 다양한 무역에 의해 등의 조건을 충족하고 싶습니다 + +553 +00:40:55,619 --> 00:40:59,480 + 정말 우리는 너무 많이하지 않을거야하지만 힘들다는 더 작은 모델을 + +554 +00:40:59,480 --> 00:41:08,579 + 그 이상하지만 서로 다른 신경 네트워크의 여러 앙상블을 갖는 것은 가고 있었다 + +555 +00:41:08,579 --> 00:41:34,289 + 즉 사용 된 하나의 경우 때문에 조금에 그 시점으로 이동 + +556 +00:41:34,289 --> 00:41:38,119 + 위층 확인 내 다음 인생에서 가리 말씨의 더 나은 방법이 + +557 +00:41:38,119 --> 00:41:43,028 + 의 그 괜찮아 우리가하려고하는 것을 가정하는 특정 예를 살펴 보자 + +558 +00:41:43,028 --> 00:41:47,130 + 신경 네트워크의 고양이 점수를 계산하고 여기에 아이디어는 것입니다 + +559 +00:41:47,130 --> 00:41:51,380 + 이러한 모든 다른 단위를 가지고 강하하고있는 스포츠는 많은 노래 + +560 +00:41:51,380 --> 00:41:54,920 + 방법은 드롭 아웃을보고 있지만 그 중 하나는 당신의 코드 당신의 강요 것입니다 + +561 +00:41:54,920 --> 00:41:59,608 + 어떤 이미지의 표현은 당신이 필요로하기 때문에 중복하고 있었다 + +562 +00:41:59,608 --> 00:42:03,318 + 그 중복 당신은 당신이 절반을받을 제어 할 수있는 방법에 대해이기 때문에 + +563 +00:42:03,318 --> 00:42:06,710 + 네트워크의 내려 그래서 당신은 더 많은 당신의 고양이 점수를 확인해야합니다 + +564 +00:42:06,710 --> 00:42:09,900 + 기능은 제대로 요리 고양이 점수 때문에를 계산하기 위하여려고하는 경우 + +565 +00:42:09,900 --> 00:42:14,000 + 어떤 어떤이 삭제 될 수도 있기 때문에 당신이 그것에 의존 할 수 그들 중 하나 등등 + +566 +00:42:14,000 --> 00:42:17,068 + 이 경우 우리는 여전히 캐츠 킬을 분류 할 수 있도록 그게 보는 하나의 방법입니다 + +567 +00:42:17,068 --> 00:42:22,639 + 우리는 매우 중요 그래서 여부에 대한 액세스 권한이없는 경우에도 적절 + +568 +00:42:22,639 --> 00:42:24,768 + 즉, 드롭 아웃의 하나의 해석이다 + +569 +00:42:24,768 --> 00:42:29,088 + 드롭 아웃의 또 다른 해석은 다음과 같이 근육의 관점에서 언급되어, + +570 +00:42:29,088 --> 00:42:33,358 + 드롭 아웃 효과적으로 모델의 큰 앙상블 훈련으로 바라 보았다 될 수있다 + +571 +00:42:33,358 --> 00:42:36,420 + 기본적으로 서브되는 + +572 +00:42:36,420 --> 00:42:43,099 + 하나의 큰 네트워크는하지만, 그들은 당신이 그렇게 좋은 방식으로 예비 선거를 공유 할 수 없습니다 + +573 +00:42:43,099 --> 00:42:46,650 + 이것을 이해하면 우리는 우리와 우리를 위해 그것을 할 경우 다음 사항을주의해야 + +574 +00:42:46,650 --> 00:42:49,970 + 무작위로 무엇을 생각 뒤로 패스에 비해 단위의 일부를 내려 + +575 +00:42:49,969 --> 00:42:53,669 + 나는 우리가 임의의이 내려 가지고 가정 오른쪽도록 그라데이션 발생 + +576 +00:42:53,670 --> 00:42:57,409 + 후방 패스에서 이러한 단위는 우리가 다시 최대를 통해 전파하고 그 + +577 +00:42:57,409 --> 00:43:01,879 + 했다 특히 만 뉴런의 수 있도록 드롭 아웃에 의해 유도 된 + +578 +00:43:01,880 --> 00:43:05,349 + 전진 패스에 사용 실제로 업데이트 또는 불만이 흐르는이됩니다 + +579 +00:43:05,349 --> 00:43:09,599 + 차단 된 모든 신경 세포가 20 아니 그라디언트 흐름 없기 때문에 그들을 통해 + +580 +00:43:09,599 --> 00:43:13,650 + 그것과 이전 계층의 무게를 너무 업데이트되지 않습니다 + +581 +00:43:13,650 --> 00:43:18,550 + 적극적으로 더 이상 그에 이전 계층으로의 연결을 중퇴했다 + +582 +00:43:18,550 --> 00:43:22,750 + 업데이트 그냥했다 그렇게 정말 무엇을이없는 것처럼 그건되지 않습니다 + +583 +00:43:22,750 --> 00:43:27,230 + 당신의 신경 네트워크의 일부를 샘플링 마스크 하위 오프 삭제하고 만있어 + +584 +00:43:27,230 --> 00:43:30,789 + 교육 당신이 일이 발생할 것이 그 하나의 예에 신경 네트워크 + +585 +00:43:30,789 --> 00:43:44,980 + 시간의 점은 하나의 모델은 하나의 데이터 포인트에 비가 가져옵니다 있도록 + +586 +00:43:44,980 --> 00:43:51,250 + 확인을 나는 것을 반복 시도 할 수 있습니다 + +587 +00:43:51,250 --> 00:44:04,239 + 여기 어딘가에에서 온 너희들이 아닌지를 이해하려면 + +588 +00:44:04,239 --> 00:44:10,789 + 당신이 당신의 자신의 드롭 드롭을 삭제할 때 확인 그래서 내가이의 예 있었으면 좋겠다 + +589 +00:44:10,789 --> 00:44:14,429 + 내가 곱 값에 드롭하면 신경 세포의 오른쪽하지만 최대 09 그 효과를 구입 + +590 +00:44:14,429 --> 00:44:17,918 + 손실의 기능에 영향은 경사 (10)가 있기 때문에 바로 그래서이 없다 + +591 +00:44:17,918 --> 00:44:21,668 + 그는 손실을 계산에 사용하고 그래서 안되었다에 대한 가중치는을받지 않습니다 + +592 +00:44:21,668 --> 00:44:25,679 + 업데이트 우리는 네트워크의 일부를 표본했는데 것처럼 그래서 우리의 단지 열차 + +593 +00:44:25,679 --> 00:44:28,959 + 현재 만에 훈련과 네트워크 나니 하나의 데이터 포인트 + +594 +00:44:28,958 --> 00:44:32,348 + 모든 시간 우리의 가능성 표본이 다른 부분을 위해 그것을 할 당신의 + +595 +00:44:32,349 --> 00:44:35,899 + 신경망하지만 이상한 같은 종류의 그래서 그들은 모두 공유 매개 변수 + +596 +00:44:35,898 --> 00:44:39,778 + 다른 모델의 많은 모든 교육의 앙상블 월요일이 점하지만 그들은 모두 + +597 +00:44:39,778 --> 00:44:48,458 + 공유 매개 변수 즉 이해가되지 않습니다 여기 종류의 약 아이디어 그래서 + +598 +00:44:48,458 --> 00:45:07,108 + 일반적으로 50 %이이 그렇게 동일한 크기를 발생하는 매우 거친 방법 저장 + +599 +00:45:07,108 --> 00:45:09,798 + 세계의 힘은 우리가 실제로 컴퓨터 H 알 + +600 +00:45:09,798 --> 00:45:14,009 + 우리는 우리가했던 것처럼 컴퓨터의 모든 전에를 계산하는 것이 더의 절반 이상 + +601 +00:45:14,009 --> 00:45:17,119 + 값은 20 떨어졌다 얻을 것이다 + +602 +00:45:17,119 --> 00:45:29,250 + 아무것도 그들이 좋은거야 변경되지 않습니다 + +603 +00:45:29,250 --> 00:45:38,349 + 대신 문제에 대한 경쟁 역은 도로에서 경쟁 할 + +604 +00:45:38,349 --> 00:45:42,150 + 당신은 당신이 할 수 있도록 스포츠 업데이트를 수행 할 경우에 삭제되지 않습니다 + +605 +00:45:42,150 --> 00:45:44,950 + 하지만 이론적으로 나는 실제로 우리가 걱정하지 않는 이상한 생각하지 않습니다 + +606 +00:45:44,949 --> 00:46:12,369 + 너무 많이하고 그래서 항상 돼 작업 훈련 그래서 매일 반복 우리를 + +607 +00:46:12,369 --> 00:46:15,469 + 우리가 거​​ 드롭하는지에 대해 우리가 샘플 분 경기 또는 노이즈 패턴을 얻을 + +608 +00:46:15,469 --> 00:46:19,359 + 앞으로 가고, 뒤로 패스와 그라데이션 우리는이 이상을 선회 계속 + +609 +00:46:19,360 --> 00:46:31,360 + 또 다시 그래서 당신의 질문에 어떻게 든 영리 사실 바이너리 마스크처럼 + +610 +00:46:31,360 --> 00:46:35,829 + 최고의 정말 안되지 않은 모델 또는 뭔가를 최적화하는 방법 등 + +611 +00:46:35,829 --> 00:46:44,769 + 이루어집니다 또는 누군가가 내가 그래 내가 갈거야 너무 미안 들여다했다고 생각 + +612 +00:46:44,769 --> 00:46:47,389 + 하나의 슬라이드 다음 슬라이드에 해당 들어가 + +613 +00:46:47,389 --> 00:46:57,618 + 우리는이 시점에서 볼거야 나는 마지막 질문을 할게요 + +614 +00:46:57,619 --> 00:47:04,519 + 질문 하나 다른 레이어에 다른 양에게 드롭을 수행 할 수 있습니다 + +615 +00:47:04,518 --> 00:47:05,459 + 당신을 중지 아무것도 없다 + +616 +00:47:05,460 --> 00:47:09,338 + 그 직관적으로 당신은 당신이 더 필요하면 밖으로 강한 드롭을 적용 할 + +617 +00:47:09,338 --> 00:47:12,690 + 정규화 그렇게 볼 수 Primaris의 엄청난 금액을 갖는 층 거기 + +618 +00:47:12,690 --> 00:47:16,349 + 하나의 예에있어 소득 당신은 거기에 강한 하락에 의해 명중 할 + +619 +00:47:16,349 --> 00:47:20,269 + 반대로 우리가 어떤 네트워크의 초기에 볼 수 있습니다 몇 가지 레이어가있을 수 있습니다 + +620 +00:47:20,268 --> 00:47:24,248 + 코미디 쇼 층은 그가 정말 많은 드롭을 연주하지 않는 매우 작은 + +621 +00:47:24,248 --> 00:47:27,368 + 거기에 조금이가는 컬러 네트워킹은 예를 들어 아주 흔한 일 + +622 +00:47:27,369 --> 00:47:30,740 + 당신은 그 대답은 그래서 낮은 드롭 아웃 시간이 지남에 끝나는로 시작 + +623 +00:47:30,739 --> 00:47:38,848 + 예 내가 두 번째 질문은 당신이 대신 단위 그냥 드롭 아웃 할 수 잊었다 + +624 +00:47:38,849 --> 00:47:41,880 + 당신이 할 수 있고 그 뭔가라고 각각의 가중치는 우리가 원하는 연결 삭제 + +625 +00:47:41,880 --> 00:47:46,349 + 이 클래스에서 너무 많이 들어가 있지만,뿐만 아니라 내가 가진 것을 할 수있는 방법이있다합니다 + +626 +00:47:46,349 --> 00:47:52,829 + 지금은 내가 당신을 우리는이 모든 것을 도입했습니다되어 수행 할 작업을 이상적으로 신뢰하는 시간이야 + +627 +00:47:52,829 --> 00:47:56,940 + 바로 공원으로 노이즈가 통과하고 그래서 당신은 단지 시간과 지금 좋아하면 + +628 +00:47:56,940 --> 00:48:00,349 + 우리는 모든 소음을 통합하고 근사를 좋아 하죠하려는 싶습니다 + +629 +00:48:00,349 --> 00:48:03,318 + 그 뭔가를 것에 당신은 당신이 분류하려면 테스트 이미지를 가지고있는 것처럼 + +630 +00:48:03,318 --> 00:48:06,909 + 당신이 할 수있는 많은 전진은 바이너리 마스크의 다양한 설정으로 전달 + +631 +00:48:06,909 --> 00:48:10,558 + 당신은 단지 서브 네트워크를 사용하고 모든 걸쳐 평균 수 + +632 +00:48:10,559 --> 00:48:14,329 + 그래서 그 아마 배포판은 중대하지만 불행히도 아닌 것 + +633 +00:48:14,329 --> 00:48:17,818 + 매우 효율적 그래서 당신이 실제로이 과정을 근사 할 수 있습니다 밝혀 + +634 +00:48:17,818 --> 00:48:22,338 + 첫 번째 드롭 아웃과 방법을 도입 할 때 어느 정도는 지적 주신 + +635 +00:48:22,338 --> 00:48:26,170 + 당신이 당신의 신경 세포 모두 당신을 활용하고자 직관적으로이 작업을 수행합니다 + +636 +00:48:26,170 --> 00:48:29,509 + 내 무작위 우리가 길을 복사하려고거야 떨어지고 싶지 않아 우리 + +637 +00:48:29,509 --> 00:48:33,548 + 몰라 A의 전진 패스에서 드롭 그래서 온 모든 신경을 남길 수 있습니다 + +638 +00:48:33,548 --> 00:48:39,920 + 테스트 이미지 그러나 우리는 실제로 우리가 이것을 어떻게 우리가 그렇게 할 수 조심해야 + +639 +00:48:39,920 --> 00:48:43,480 + 가난한 우리가 어떤 단위를 드롭하지 않을거야 테스트 이미지를 전달하지만 우리는이 + +640 +00:48:43,480 --> 00:48:48,028 + 얻을 것을 기본적으로 하나의 방법에주의하는 그 무엇 + +641 +00:48:48,028 --> 00:48:54,880 + 문제는 이것이이란과의있어 두 개의 입력이었다고 생각 나는 생각한다 + +642 +00:48:54,880 --> 00:48:59,079 + 이 시간에 존재하는 모든 입력을 그래서 우리는 그렇게 단위를 포기하지 않을 것을 + +643 +00:48:59,079 --> 00:49:02,630 + 이들 두 사람은 가까운 일부 활성화 및 다른 의사가 시간이야 + +644 +00:49:02,630 --> 00:49:06,400 + 컴퓨터는이 비교 아직 어떤 값 세금이 될 수 있습니다 + +645 +00:49:06,400 --> 00:49:12,608 + 훈련 시간 동안 것 뉴런 밖으로 무엇을하지만 X이 값 + +646 +00:49:12,608 --> 00:49:18,440 + 이 때문에 드롭 아웃 마스크 매우 무작위 등 교육 시간에 확인 기대 + +647 +00:49:18,440 --> 00:49:21,170 + 어떤 다른 일어날 수있는 여러 가지 경우가있다 + +648 +00:49:21,170 --> 00:49:27,068 + 이들 경우 다른 규모가 될이하자에 대해 걱정해야 할 것 + +649 +00:49:27,068 --> 00:49:32,259 + 날이 내가이 생각을 의미 정확히 무엇을 보여 + +650 +00:49:32,260 --> 00:49:35,539 + 더 비선형 만 남아있는이란에가보고되지 않았다이야 말할 계산해서 + +651 +00:49:35,539 --> 00:49:39,990 + 스트레스 테스트 중에이 활성화되는 (가) 여기에 10 대기 0 W된다 + +652 +00:49:39,989 --> 00:49:44,848 + 자루 + 한 번 W 이유를 확인 그것이 내가에 테스트를 계산하기 위해 원하고 무엇 때문에 + +653 +00:49:44,849 --> 00:49:48,420 + 내가 조심해야 이유는 교육 시간 예상 출력 중입니다 + +654 +00:49:48,420 --> 00:49:51,528 + 이 특정한 경우에 아주 달라졌을 것의 우리는 네가 + +655 +00:49:51,528 --> 00:49:55,619 + 우리는 그 4에 하나 또는 다른 또는 둘 모두 또는 없음을, 그래서 드롭 수있는 가능성 + +656 +00:49:55,619 --> 00:49:56,720 + 가능성 + +657 +00:49:56,719 --> 00:50:00,750 + 컴퓨터에 다른 계곡은 실제로 당신이 때를 볼 수 있습니다이 수학을 위기했다 + +658 +00:50:00,750 --> 00:50:01,659 + 당신은 그것을 감소 + +659 +00:50:01,659 --> 00:50:07,548 + 당신은 왜 그렇게 훈련에서 기대에 WRX + W 하나 끄기 시간을 절반으로 끝낼 + +660 +00:50:07,548 --> 00:50:15,630 + 시간이 신경 세포의 갱신은 실제로 단지 시간이었고, 그래서 당신은 할 때 + +661 +00:50:15,630 --> 00:50:19,640 + 이것과 이것 저것을 보상하기 위해 당신이 가진 모든 시간을 사용하는 + +662 +00:50:19,639 --> 00:50:22,730 + 우리는 아마와 단위를 삭제 한 사실에서 오는 멀리 일 + +663 +00:50:22,730 --> 00:50:29,219 + 이를 최대 절반 그래서 아마 포인트 인 이유 절반은 그래서이다 + +664 +00:50:29,219 --> 00:50:35,358 + 다섯 올림픽 싱가포르는 우리가 결국 다음 그래서 기본적으로 우리는이 작업을 수행하지 않은 경우 통과 + +665 +00:50:35,358 --> 00:50:39,019 + 우리는 동안 기대 한 것에 비해 충분히 크지 만에 갖는 + +666 +00:50:39,019 --> 00:50:42,960 + 분포가 기본적으로 변경됩니다에서 교육 시간과 당신이있어 + +667 +00:50:42,960 --> 00:50:45,639 + 그들은 이러한보고에 사용하지이기 때문에 휴식 것이 세계의 것 + +668 +00:50:45,639 --> 00:50:49,368 + 큰 고온 열 중성자 그리고 그녀는 그 보상해야하고 그럴 필요 + +669 +00:50:49,369 --> 00:50:53,798 + 그냥 일을 일의 대신 모든 물건을 사용하지 않도록 아래로 뭉개 버려 + +670 +00:50:53,798 --> 00:50:57,480 + 하지만 당신은 복구를 다시 얻기 위해 매일 활성화에 스크래치가 당신의 + +671 +00:50:57,480 --> 00:51:03,099 + 예상 출력 확인이 실제로 어려운 점이지만 내가 한 번 들었다 생각 + +672 +00:51:03,099 --> 00:51:06,559 + 제프 힌튼은 처음에 밖으로 드롭 함께 왔을 때 이야기하는 것이 그 + +673 +00:51:06,559 --> 00:51:10,710 + 어떤하지 않았다 밖으로 우리가 드롭을 시도 그래서 실제로 완벽하게이 부분을 마련하지 않았다 + +674 +00:51:10,710 --> 00:51:16,088 + 일을하고 실제로 이유는 그가이 까다로운 놓쳤다 그가대로 작동하지 않았다 + +675 +00:51:16,088 --> 00:51:19,340 + 실제로 인정 하듯이 지적 그래서 우리는 당신의 활성화를 확장해야 + +676 +00:51:19,340 --> 00:51:24,070 + 아래 때문에이 효과의 시스템 다음 모든 것이 훨씬 더 그렇게 작동 I + +677 +00:51:24,070 --> 00:51:28,500 + 그냥 우리가 기본적으로 이러한 계산처럼이는 모습을 보여 그냥 해요 + +678 +00:51:28,500 --> 00:51:33,449 + 정상적으로 신경망은 그래서 우리는 첫 번째 또는 두 번째하지만 지금은 그냥 시간이 우리가 될 수 있습니다 + +679 +00:51:33,449 --> 00:51:38,869 + 평화의 예를 들어 하하 확률 규모를 삭제하도록 P를 곱해야 + +680 +00:51:38,869 --> 00:51:43,139 + 되도록 활성화 아래로 기대 밖으로 예상하지만 지금은이 + +681 +00:51:43,139 --> 00:51:46,969 + 이 때 실제로 당신의 교육 시간 등의 예상 출력과 동일 + +682 +00:51:46,969 --> 00:51:52,449 + 드롭 아웃에 대한 복구 및 예상 출력이 일치하고이 실제로 작동 + +683 +00:51:52,449 --> 00:52:18,069 + 정말 잘 나는이에서 떨어지는거야, 그래서 기차와 사이에 단지 차이입니다 + +684 +00:52:18,070 --> 00:52:20,780 + 모든 신경 세포를 사용하여 모든 같은 테스트를 불일치가있어 떨어집니다 + +685 +00:52:20,780 --> 00:52:24,580 + 그래서 어느 당신은이 시점에서이를 수정하거나 우리가 부르는 당신은 사용할 수 있습니다 + +686 +00:52:24,579 --> 00:52:29,469 + 내가 조금 당신을 보여주지 벌리 드롭 아웃은 그래서 우리는 비트에 그에게거야 + +687 +00:52:29,469 --> 00:52:34,319 + 드롭 아웃 요약 당신은 아마 해제와 함께 드롭 당신의 단위를 삭제하려면 + +688 +00:52:34,320 --> 00:52:38,210 + 오줌의 확률을 유지하고 그것은 단지 당신이 경우에 그렇게를 확장하는 것을 잊지 + +689 +00:52:38,210 --> 00:52:40,820 + 이 네트워크는 잘 작동 할 것 + +690 +00:52:40,820 --> 00:52:44,190 + 확인도 다시 아니에요 마스크를 전파하는 것을 잊지 마세요 + +691 +00:52:44,190 --> 00:52:49,710 + 할 수있는 방법으로 반전 드롭 아웃을 보여주는 것은이 알아서하는 것입니다 + +692 +00:52:49,710 --> 00:52:53,349 + 기차 및 시험 용액 약간 다른 방식 간의 불일치 + +693 +00:52:53,349 --> 00:52:57,710 + 당신 일이었다 전에, 그래서 특히 우리가 할 거 야하는 것은 우리는 올해를 변경하고 + +694 +00:52:57,710 --> 00:53:01,250 + 우리가하지 않을거야 바이오 매스 컵 냉동 것들은 우리가 할 거 야한다 + +695 +00:53:01,250 --> 00:53:04,980 + 우리가 정품 인증 a를 아래로 확장 할거야, 그래서 교육 시간에 여기에 확장 + +696 +00:53:04,980 --> 00:53:07,960 + 그는 다섯은 우리가있어 소비 때문에 경우 다른 스킬을 시간을 노력 + +697 +00:53:07,960 --> 00:53:12,079 + 비난에게 뜨거운하여 기차 시간을 강화하고 우리가 우리의 코드를 떠날 수있는 시간이야 + +698 +00:53:12,079 --> 00:53:16,029 + 바로 그래서 우리는 기차 시간 활성화의 증폭을하고있는 만진 + +699 +00:53:16,030 --> 00:53:20,880 + 우리는이 행위에 의해 인위적으로 더 큰 모든 것을 만들고있어 후 시간이야 + +700 +00:53:20,880 --> 00:53:24,450 + 우리는이 거 야하지만 지금 우리는 단지 청소를 복구하는거야 + +701 +00:53:24,449 --> 00:53:27,819 + 우리가 지금 스케일링하려고 시간을 수행 한 표현 때문에 당신은 수 있습니다 + +702 +00:53:27,820 --> 00:53:31,010 + 당신은 제대로 기차와 시험 사이의 기대를 보정 할 수 있습니다 + +703 +00:53:31,010 --> 00:53:39,290 + 가장 많이 찾는이의 모든에 년과 오른쪽 그래서 드롭 아웃을 사용하고 작업 + +704 +00:53:39,289 --> 00:53:42,779 + 그래서 정말 감염 실제로 사용하는 것은 다음 몇 줄과 아래로 온다 + +705 +00:53:42,780 --> 00:53:47,300 + 뒤로 패스 조금 변경하지만 네트워크는 거의 항상 함께 잘 작동 + +706 +00:53:47,300 --> 00:54:15,070 + 이 당신은 실제 정확한에 피팅에서 심각 아니라면 그이다 + +707 +00:54:15,070 --> 00:54:17,230 + 내가 여기에 언급 한 이유는 + +708 +00:54:17,230 --> 00:54:22,039 + 근사는 조립 근사하고 이유 중 하나 인 + +709 +00:54:22,039 --> 00:54:25,029 + 실제로 다음 사진에서 일어난 일단 때문에 근사값입니다 + +710 +00:54:25,030 --> 00:54:27,769 + 이러한 예상 출력은 모든 종류의 때문에 비선형의 망쳐된다 + +711 +00:54:27,769 --> 00:54:37,500 + 이러한 질문의 상단에 효과 내가 가서 것을 가리키는 주셔서 감사합니다 + +712 +00:54:37,500 --> 00:54:44,769 + 내가없는 당신은 그들이 드롭 인 (drop-in)과 드롭 아웃 반전 말을하는지 참조 + +713 +00:54:44,769 --> 00:54:49,039 + 동등한 그렇게 여부 때문에 상기의 문제가 아니다 그녀의 일을하고 + +714 +00:54:49,039 --> 00:54:59,309 + 내가 가진 것 구십 어쩌면 당신이 바로 당신이 될 수있는 아마 그것에 대해 생각합니다 + +715 +00:54:59,309 --> 00:55:37,949 + 여기 나는이 모든 단지 기대에 기대에 대한 생각 + +716 +00:55:37,949 --> 00:55:41,349 + 당신은 절반을 삭제하고 있고 그래서 거기에 불구하고도 사용할 올바른 일이 + +717 +00:55:41,349 --> 00:55:44,049 + 실제로 결국 정확히 양에 약간의 임의성이 삭제되고 + +718 +00:55:44,050 --> 00:55:47,370 + 큰 괜찮아 + +719 +00:55:47,369 --> 00:55:51,869 + 이 오, 그래의 그래서이 있었다 밖으로 떨어질 것이다 재미있는 이야기로 당신에게 좋아 + +720 +00:55:51,869 --> 00:55:55,509 + 2012 년 제프 힌튼에 깊은 학습 여름 학교는 처음으로 나 있었다 + +721 +00:55:55,510 --> 00:55:56,590 + 적어도 처음 봤어 + +722 +00:55:56,590 --> 00:56:00,930 + 드롭 아웃을 제시하고 그래서 그는 기본적으로 그냥 괜찮 말하는 것을 당신의 뉴런 (20)에서 + +723 +00:56:00,929 --> 00:56:04,589 + 랜덤 그냥 난 그냥 바쁜 활성화를 해요이 항상 잘 작동 + +724 +00:56:04,590 --> 00:56:07,750 + 더 나은 우리는 내 친구가 앉아으로 그 흥미로운 와우 같은거야 + +725 +00:56:07,750 --> 00:56:10,469 + 내 옆에 그는 단지 바로 그 역이 있음 자신의 노트북을 뽑아 + +726 +00:56:10,469 --> 00:56:13,959 + 대학 기계와 이야기하는 동안과가 바로 그것을 구현 + +727 +00:56:13,960 --> 00:56:17,340 + 시간 제프 힌튼 마무리는 그가 더 나은 결과를 얻고지고 있다고 얘기 + +728 +00:56:17,340 --> 00:56:18,950 + 실제로 미술 기자의 상태 등 + +729 +00:56:18,949 --> 00:56:25,189 + 그는 빠른 작업 한 자신의 데이터에 나는 누군가가 내려면 같이 가서 봤어요 + +730 +00:56:25,190 --> 00:56:30,490 + 일본은 너무 많이 나는 이야기를하려고하면서 추가로 5 %가 바로 다음이었다 + +731 +00:56:30,489 --> 00:56:33,589 + 즉, 무언가가 실제로 매우 몇 번 거기에 정말 재미라고 생각했다 + +732 +00:56:33,590 --> 00:56:36,590 + 이런처럼 그것이 그 중 하나이기 때문에 드롭 아웃은 훌륭한 일이다 + +733 +00:56:36,590 --> 00:56:42,390 + 소수의 투자자는 매우 간단하고 항상 그냥 잘 작동하고있다 + +734 +00:56:42,389 --> 00:56:45,579 + 내가 생각 우리가 주운 팁과 트릭의 이러한 종류의 거의 + +735 +00:56:45,579 --> 00:56:49,659 + 문제는 얼마나 더 많은 간단한 일 드롭 아웃 등을들 수있다 거기입니다 + +736 +00:56:49,659 --> 00:56:50,879 + 당신에게 2 %의 활력을 불어 + +737 +00:56:50,880 --> 00:56:54,140 + 항상 우리는 모른다 + +738 +00:56:54,139 --> 00:57:01,199 + 확인은 그래서 그라데이션 검사로이 시점에서 갈 거라고하지만 난을 생각한다 + +739 +00:57:01,199 --> 00:57:04,588 + 실제로 나는 모든 신경의 피곤 때문에이 작업을 건너 내가거야 결정 + +740 +00:57:04,588 --> 00:57:07,130 + 우리와 같은 네트워크는 모든 훈련의 세부 정보를 많이 얘기했습니다 그 + +741 +00:57:07,130 --> 00:57:10,180 + 작품과 내가 너희들뿐만 아니라 피곤하고 그래서 그라데이션을 건너 뛸 것 같네요 + +742 +00:57:10,179 --> 00:57:13,469 + 그것은 아주 잘 노트 여기에 설명되어 있기 때문에 체크 나는 보시기 바랍니다 + +743 +00:57:13,469 --> 00:57:19,028 + 그것은 까다로운 과정의 종류를 통해 이동하는 시간의 비트에 소요 + +744 +00:57:19,028 --> 00:57:23,190 + 프로세스의 모든 어려움을 감사하고 그래서 그냥 내가 읽어 + +745 +00:57:23,190 --> 00:57:27,250 + 에 더 흥미 수 있도록 내가 주위에 드라이브 수있는 일이 생각하지 않습니다 + +746 +00:57:27,250 --> 00:57:29,469 + 당신은 그래서 난 그냥 그것을 확인하는 것이 좋습니다 것 + +747 +00:57:29,469 --> 00:57:33,118 + 한편 우리는 오른손을 뛰어거야하고 ​​그 작동 올 것와 + +748 +00:57:33,119 --> 00:57:42,358 + 사진을보고 너무 1980년에서이 다섯에서 아일린입니다 같이 + +749 +00:57:42,358 --> 00:57:46,538 + 대략 우리는 어떻게 상용 네트워크 마크의 세부 사항에 갈거야 + +750 +00:57:46,539 --> 00:57:49,609 + 이 클래스에서 우리는 실제로 낮은 수준의 세부 사항을 수행하지 않을거야 + +751 +00:57:49,608 --> 00:57:52,768 + 내가 얼마나이 분야에 대한 수에 대한 당신의 직관을 제공하기 위해 노력하겠습니다 + +752 +00:57:52,768 --> 00:57:56,868 + 어떤 일 전체 상황과 그냥 그에서 오는 일반적으로 작동 그렇다면 + +753 +00:57:56,869 --> 00:57:59,559 + 당신은 당신이 돌아 가야 상업 네트워크의 역사에 대해 이야기하고 싶습니다 + +754 +00:57:59,559 --> 00:58:04,910 + 특히 이렇게 대략 아홉 육십 실험 승인 및 족제비에 + +755 +00:58:04,909 --> 00:58:10,449 + 그들은 차 시각 피질 고양이를 공부하고, 그들은은을 전송했다 + +756 +00:58:10,449 --> 00:58:14,710 + 초기 시각 영역과 고양이와 고양이의 뇌는에 패턴을 찾고 있었다 + +757 +00:58:14,710 --> 00:58:19,500 + 그들은 끝내었고, 화면은 실제로이 언젠가 노벨상을 수상 + +758 +00:58:19,500 --> 00:58:23,449 + 이후이 실험을 위해 우리는이 실험이 보는 무엇을 게재 할로 + +759 +00:58:23,449 --> 00:58:27,518 + 그냥 그렇게처럼 그들은 내가 여기 여든 비디오를 뽑아 그렇게 보면 정말 재미있어 + +760 +00:58:27,518 --> 00:58:32,258 + 여기에 무슨 고양이가 위치에 고정되고, 우리가 기록을하는지 참조 + +761 +00:58:32,259 --> 00:58:35,900 + 뒷면에 처리 영역의 어딘가의 피질에서 + +762 +00:58:35,900 --> 00:58:39,809 + 뇌가 하나가 될 수 있고, 지금 우리가 고양이에 다른 빛의 패턴을 표시하고 있고 + +763 +00:58:39,809 --> 00:58:43,519 + 우리는 이제 살펴 보자 기록과 다른 자극에 대한 신경 세포의 불을 공유하고 + +764 +00:58:43,518 --> 00:58:48,039 + 어떻게 경험과 같이 표시됩니다 + +765 +00:58:48,039 --> 00:59:14,050 + 이리 + +766 +00:59:14,050 --> 00:59:27,410 + 이러한 세포와​​ 같은 실험 그들은 모두 네 모서리를 설정하는 것 + +767 +00:59:27,409 --> 00:59:30,279 + 특정 방향 그리고 그들은 가장자리와 일에 대한 흥분 + +768 +00:59:30,280 --> 00:59:36,360 + 방향과 북쪽 방향은 다음과 같이 그래서 그들을 자극하지 않습니다 + +769 +00:59:36,360 --> 00:59:42,150 + 10 분 비디오와 같은 긴 과정을 통해 우리는이 작업을 수행하지 않을거야 + +770 +00:59:42,150 --> 00:59:45,450 + 오랜 시간 그들은 분방하고 어떻게 시각 피질의 모델로 등장 + +771 +00:59:45,449 --> 00:59:52,349 + 공정 뇌의 정보 및 그래서 그들은 몇 가지 결국 수 있습니다 + +772 +00:59:52,349 --> 00:59:56,059 + 예를 들어 노벨상을 선도 그들은 피질가 있음을 알아 냈다 + +773 +00:59:56,059 --> 00:59:56,759 + 배치 + +774 +00:59:56,760 --> 01:00:02,570 + 국소 적 시각 피질 어떤 것을 의미하는 것은 그녀가 내 프린터라고이다 + +775 +01:00:02,570 --> 01:00:06,920 + 피질에서 기본적으로 근처의 세포 그래서이있다 대뇌 피질의 조직 인근 전개 + +776 +01:00:06,920 --> 01:00:11,389 + 소금 공기 피질은 실제로 그래서 당신의 시야 인근 지역을 처리하는 + +777 +01:00:11,389 --> 01:00:15,049 + 당신은 인근 처리 인식되지 않는 및이를 가지고 어떤 것 + +778 +01:00:15,050 --> 01:00:20,510 + 지역은 당신의 처리에 보존되고 그들도 있다는 것을 알아 냈 + +779 +01:00:20,510 --> 01:00:23,790 + 간단한 세포와​​ 그들이라고 무슨 이러한 역할의 전체 년 + +780 +01:00:23,789 --> 01:00:27,659 + 가장자리의 특정 방향으로 반응하고 이러한 모든 있었다 + +781 +01:00:27,659 --> 01:00:31,809 + 일부 셀 예를 들어 더 복잡한 응답 있도록했다 다른 세포 + +782 +01:00:31,809 --> 01:00:34,949 + 제공하는 특정 방향을 선회하지만 약간 있었다 것 + +783 +01:00:34,949 --> 01:00:38,159 + 그들은 가장자리의 특정 위치에 대한 번역 불변 걱정하지 않도록 + +784 +01:00:38,159 --> 01:00:41,839 + 하지만 그들은 단지 방향에 대해 걱정하고 그래서 그들은 가설 + +785 +01:00:41,840 --> 01:00:44,120 + 시각 피질은 이런 종류의이 실험의 모든 통해 + +786 +01:00:44,119 --> 01:00:48,269 + 당신이 다른 단순한 판매 그들의 독서를 종료 계층 적 조직 + +787 +01:00:48,269 --> 01:00:52,679 + 복잡한 세포 등 이러한 세포라는 세포는 각각의 상단에 내장되어 있습니다 + +788 +01:00:52,679 --> 01:00:56,369 + 특히 다른과 간단한 노래는 수용이 상대적으로 지방이 + +789 +01:00:56,369 --> 01:01:00,019 + 필드 이들은 표현의 더 복잡한 유형을 구축했다 + +790 +01:01:00,019 --> 01:01:04,320 + 그래서이는 연속적인 표현의 층과를 통해 뇌의 + +791 +01:01:04,320 --> 01:01:09,240 + 어떤 사람들은이 재현하려고하는 과정을 많이 경험 + +792 +01:01:09,239 --> 01:01:14,649 + 컴퓨터는 상기 제 때문에 하나의 코드와 시각 피질 모델링하려는 및 + +793 +01:01:14,650 --> 01:01:19,389 + 이 예는 후쿠시마에서 거 드롭이었다 그는 기본적으로 결국 + +794 +01:01:19,389 --> 01:01:20,429 + 설정 + +795 +01:01:20,429 --> 01:01:26,710 + 기본적으로 작은 보면이 지역의 수용 세포 구조 + +796 +01:01:26,710 --> 01:01:31,760 + 충격의 영역은 그 층과이 층 그래서 그는을 강화 + +797 +01:01:31,760 --> 01:01:34,750 + 또한 상기 복잡한 간단한을 해결해 단지에 이러한 간단한 공격했다 + +798 +01:01:34,750 --> 01:01:39,000 + 단순 및 복합도 그때 지금 이라크에 구축의 샌드위치 + +799 +01:01:39,000 --> 01:01:41,849 + 하지만에서 아홉 년대 다시 전파 정말 주위에 아직하지 않습니다 + +800 +01:01:41,849 --> 01:01:45,380 + 그래서 이러한 교육에 대한 내 머리와 자율 학습 절차를 추진 + +801 +01:01:45,380 --> 01:01:49,599 + 클러스터링 방식 등으로 네트워크 그러나 이것은 다시되지 않습니다에 전파 + +802 +01:01:49,599 --> 01:01:54,150 + 시간 그러나 그것은 상단에 건물 연속 층 작은 세포의 이런 생각을했다 + +803 +01:01:54,150 --> 01:02:00,039 + 서로 다음이 실험 또한 그는 가지 위에 구축 + +804 +01:02:00,039 --> 01:02:04,739 + 일을하고 그는 건축 레이아웃을 유지하지만 그가했던 것은 사실이었다 + +805 +01:02:04,739 --> 01:02:09,009 + 연수생은 다시 전파 네트워크 그래서 예를 들어 그는 다른 훈련 + +806 +01:02:09,010 --> 01:02:12,770 + 분류 등 등 네 자리 숫자 또는 문자와 그것의 모든 훈련 + +807 +01:02:12,769 --> 01:02:16,769 + 배경 및 그들은 실제로 읽어 복잡한 시스템이를 사용하여 종료 + +808 +01:02:16,769 --> 01:02:23,469 + 등 등 우편 서비스의 숫자와 같은 레이더를 확인합니다 + +809 +01:02:23,469 --> 01:02:27,239 + 즉 실제로 아홉 구십에 전 꽤 오랜 시간으로 돌아가 그리고 + +810 +01:02:27,239 --> 01:02:33,199 + 사람이 2012 년에 이렇게 다시 그들을 사용하지만 괜찮 아주 작은했고, 누가 + +811 +01:02:33,199 --> 01:02:37,559 + 는이 용지에서했을 정도로 꽤 큰 얻을 시작하는 올 때 + +812 +01:02:37,559 --> 01:02:43,549 + 나는 그들이 그 모든했다으로 탈출 참조 유지하고는 같은 아니에요 + +813 +01:02:43,550 --> 01:02:48,200 + 이 천으로 만 이미지, 그래서 우리의 실험실에서 실제로 제공 데이터 세트 + +814 +01:02:48,199 --> 01:02:51,339 + 클래스는 대략 6 천만 인이 모델을 데이터의 엄청난 금액을 + +815 +01:02:51,340 --> 01:02:56,380 + 알렉스 Kozinski 이들의 이름을 기준으로 알렉스 그물에서 매개 변수 차가운 + +816 +01:02:56,380 --> 01:02:59,260 + 네트워크는이 알렉스 냅이있다 그래서 그들은 이름을 가지고 있는지 가고 있었다 + +817 +01:02:59,260 --> 01:03:05,560 + 이 일이 그렇게처럼 자신의 몇 분에서 구글을 가지고 지역 + +818 +01:03:05,559 --> 01:03:09,630 + 한계는 그리고 우리는 그들에게이 알렉스의 순이었다 그래서 이름을 부여하고 그것은 하나 + +819 +01:03:09,630 --> 01:03:13,090 + 이 실제로 무슨 일이있어 다른 알고리즘을 꽤하여보다 실적 + +820 +01:03:13,090 --> 01:03:17,530 + 역사적으로주의하는 것이 흥미 알렉스 아무것도 2012 년 사이의 차이 + +821 +01:03:17,530 --> 01:03:21,850 + 열 아홉 90 년대 한도는 기본적으로 아주 아주 약간의 차이가있다 + +822 +01:03:21,849 --> 01:03:25,940 + 이러한 두 개의 서로 다른 네트워크에서 볼 때이 사람은 내가 신호를 생각 사​​용 + +823 +01:03:25,940 --> 01:03:31,789 + 10 H의 아마 동전이 하나는 진짜하고 더 깊은이었고, + +824 +01:03:31,789 --> 01:03:33,460 + GPU를 훈련하고 더 많은 데이터를 가지고 있었다 + +825 +01:03:33,460 --> 01:03:38,889 + 그리고 기본적입니다 그것은 단지 그 같은 대략의 차이이다있어 그와 + +826 +01:03:38,889 --> 01:03:41,098 + 그래서 정말 우리가했던 것은 우리는 물론 더 나은 방법을 알아 냈어요된다 + +827 +01:03:41,099 --> 01:03:45,000 + 를 초기화하고 국가 군대와 더 잘 작동과 반군이 많은 작업 + +828 +01:03:45,000 --> 01:03:49,480 + 그것보다 더 있지만, 다른는 모두 데이터와 계산을 살해했다 + +829 +01:03:49,480 --> 01:03:53,740 + 하지만 대부분의 배우가 매우 유사했다 그리고 우리는 몇 가지 더 많은 일을했습니다 + +830 +01:03:53,739 --> 01:03:56,719 + 그들은 큰 필터를 사용하는 예를 들어 같은 트릭은 우리가 많이 사용하는 것을 볼 수 + +831 +01:03:56,719 --> 01:04:01,379 + 작은 필터 우리 또한 지금이 우리가 지금이 선수의 수십입니다 + +832 +01:04:01,380 --> 01:04:05,059 + 백 오십이 나중에 와서는 그래서 우리는 정말 스킬에 꽤 최대입니다 + +833 +01:04:05,059 --> 01:04:08,150 + 어떤면하지만, 그렇지 않으면 당신은 정보를 처리하는 방법의 기본 개념 + +834 +01:04:08,150 --> 01:04:09,789 + 유사하다 + +835 +01:04:09,789 --> 01:04:15,150 + 확인을 그래서 그들은 모든 종류의 작업을 수행 할 수 있도록 기본적으로 모든 곳에서 지금이다 + +836 +01:04:15,150 --> 01:04:19,280 + 물론 분류 것 같은 것들을 그들은 검색에 아주 좋은 경우 그래서 당신 + +837 +01:04:19,280 --> 01:04:24,119 + 그들은 그것과 같은 다른 이미지를 검색 할 수 있습니다 그들에게 이미지를 보여 그들도 할 수있다 + +838 +01:04:24,119 --> 01:04:29,809 + 검출 그래서 여기 저기 개와 말 등등 사람들과 검출 + +839 +01:04:29,809 --> 01:04:33,230 + 이 일부 독일 차, 예를 들어 사용될 수있는 모든 다음이가 + +840 +01:04:33,230 --> 01:04:36,588 + 선 그들도 그렇게 하나 하나의 화소가 몇 가지 실험을 수행 할 수 있습니다 + +841 +01:04:36,588 --> 01:04:41,409 + 예를 들어 표시된 사람이나 분할을 재건 도로 나 나무 나 하늘 + +842 +01:04:41,409 --> 01:04:47,529 + 예를 들어 자동차에서의 사용을 위해 여기에 포함 된 작은 엔비디아 테그 라이야 + +843 +01:04:47,530 --> 01:04:51,480 + 우리가 올 실행할 수 있습니다 GPU는 예를 들어 하나의 이유는 이것이 유용 할 수 있어요 + +844 +01:04:51,480 --> 01:04:55,480 + 당신이 식별 할 수있는 자동차는 모든 당신은 라운딩의 왜곡 된 인식 될 수 있습니다 + +845 +01:04:55,480 --> 01:04:57,219 + 당신의 주위에 물건 + +846 +01:04:57,219 --> 01:05:02,039 + 친구의 당신 일부는 식은 경우 의견은 아마 얼굴을 식별하는 + +847 +01:05:02,039 --> 01:05:04,909 + 페이스 북에 자동으로 내가이 시점에서 추측 것이 거의 확실입니다 + +848 +01:05:04,909 --> 01:05:10,069 + YouTube에서 해당 동영상 분류 YouTube 동영상 안에 무엇을 식별 + +849 +01:05:10,070 --> 01:05:14,900 + 그들은이에 사용하는 매우 성공적 Google의 프로젝트입니다 + +850 +01:05:14,900 --> 01:05:17,900 + 기본적으로 구글이 스트리트 뷰 이미지를 복용에 정말 관심과 + +851 +01:05:17,900 --> 01:05:20,809 + 자동에서 바깥 번호를 읽고 + +852 +01:05:20,809 --> 01:05:25,019 + 확인을 밝혀이 그래서 그들은 인간을 많이했다 완벽한 아스트라한입니다 + +853 +01:05:25,019 --> 01:05:30,289 + 노동의 데이터 팔 엄청난 양에 있고, 다음에 거대한 코멘트를 넣어 + +854 +01:05:30,289 --> 01:05:33,429 + 그것은 인간으로 거의 잘 작동 결국 그는 일이 우리가 거 + +855 +01:05:33,429 --> 01:05:37,710 + 이 물건 정말 정말 잘 추정을 작동 전반에 걸쳐 참조 + +856 +01:05:37,710 --> 01:05:41,730 + 그들은 컴퓨터 게임을 재생할 수 있습니다 포즈 + +857 +01:05:41,730 --> 01:05:46,559 + 그들은 그렇게 암 또는 무언가의 모든 종류를 감지하고 안녕히 가세요 + +858 +01:05:46,559 --> 01:05:53,519 + 이미지는이 내가 생각입니다 거리 표지판을 인식 한자를 읽을 수 있습니다 + +859 +01:05:53,519 --> 01:05:57,690 + 신경 조직의 분할은 또한 이렇게 시각적없는 일을 할 수있다 + +860 +01:05:57,690 --> 01:06:02,510 + 예를 들어, 그들이 사용했던 음성 처리를위한 음성을 인식 할 수있다 + +861 +01:06:02,510 --> 01:06:07,780 + 또한 텍스트 문서에 대해 당신은뿐만 아니라 그들이했습니다 코멘트로 텍스트를 볼 수 있도록 + +862 +01:06:07,780 --> 01:06:11,400 + 이들이 사용되어 한 은하의 종류를 인식하기 위해 사용 된 + +863 +01:06:11,400 --> 01:06:15,570 + 다른 웨일즈 인식하는 최근의 가축 대회에서 이것은이다 + +864 +01:06:15,570 --> 01:06:18,420 + 특히 잘 거기 백마일 같았다 또는 같은 그이다 + +865 +01:06:18,420 --> 01:06:24,409 + 그냥 내 특정 개인 그래서이 자사의 하얀 반점의 패턴을 살 것이다 + +866 +01:06:24,409 --> 01:06:28,179 + 머리는 그 놀라운 그래서 나는이 인식 될 수 있습니다 특정 방법입니다 + +867 +01:06:28,179 --> 01:06:32,618 + 모든 그들이 사용하는 위성 사진에서 작동 꽤 지금이 있기 때문에 + +868 +01:06:32,619 --> 01:06:35,280 + 이 모든 분석 있도록 위성 많은 데이터를 가지고 여러 회사 + +869 +01:06:35,280 --> 01:06:39,530 + 이 경우 큰 의견으로는 도로를 권선 있어요 그러나 당신은 또한 볼 수 있습니다 + +870 +01:06:39,530 --> 01:06:43,850 + 농업 응용 프로그램 또는 그들은 또한 이미지를 할 수있는 사람은 당신에게 수도 캡처 + +871 +01:06:43,849 --> 01:06:48,829 + 내 작품이 포함 된 결과뿐만 아니라 우리가 이미지를 가지고의 일부를 보았다 + +872 +01:06:48,829 --> 01:06:53,369 + 자막 대신 단일 카테고리의 더 문장이 그들은 수도 있습니다 + +873 +01:06:53,369 --> 01:06:56,150 + 다양한 예술적 노력 사용될 + +874 +01:06:56,150 --> 01:06:59,800 + 그래서 이것은 무엇인가라는 깊은 꿈이며 우리는 어떻게 들어갈거야 + +875 +01:06:59,800 --> 01:07:00,350 + 공장 + +876 +01:07:00,349 --> 01:07:04,440 + 실제로 세 번째 과제를 구현하는 것은 어쩌면 당신은 것입니다 확인 할 수있다 + +877 +01:07:04,440 --> 01:07:08,099 + 세 번째 과제를 구현하는 당신은 그것을 이미지를 제공하고 사용 할 수있는 그 + +878 +01:07:08,099 --> 01:07:11,349 + 이 이상한 물건을 할 수 있도록 + +879 +01:07:11,349 --> 01:07:17,380 + 특히 개 환각을 많이하고 우리는 왜 개에 갈거야 + +880 +01:07:17,380 --> 01:07:20,349 + 그것이 사실 여기서 이러한 네트워크하다 화상 순을 수행하는 표시 + +881 +01:07:20,349 --> 01:07:25,579 + 최대 끝으로 훈련을받을 그들은 개 많은 등이 이러한 네트워크를 + +882 +01:07:25,579 --> 01:07:28,259 + 사과 주스와 먹는 개는 그들이 일부를 사용하는의 같은 종류의 + +883 +01:07:28,260 --> 01:07:32,440 + 패턴 및 당신은 다른 이미지를 당신이 그 (것)들을 넣을 수 있어야한다 + +884 +01:07:32,440 --> 01:07:36,710 + 이미지와 루프에서 그들과 우리가 어떻게를 볼 수 있도록 환각 일을 나눠 + +885 +01:07:36,710 --> 01:07:42,769 + 나는 슬라이드를 설명하지 않을거야 비트에서 작동하지만 당신이 할 수 있도록 멋진 모습 + +886 +01:07:42,769 --> 01:07:47,559 + 내가 또한 지적 아마 어딘가에 포함 할 것 상상 + +887 +01:07:47,559 --> 01:07:51,579 + 흥미로운 것은 네트워크 라이벌 표현 불리는이 종이있다 + +888 +01:07:51,579 --> 01:07:55,420 + 개인의 나는 그들이 무슨 짓을했는지 인식의 분기 피질 호출을 생각한다 + +889 +01:07:55,420 --> 01:08:00,250 + 나는 이것이 원숭이 원숭이라고 생각하고 여기 기본적으로 찾고 + +890 +01:08:00,250 --> 01:08:05,280 + 여기 피질에서 ITV에서 기록과 신경이 기록 + +891 +01:08:05,280 --> 01:08:09,030 + 정품 인증 원숭이 이미지를보고 그들은 동일한 이미지를 공급 + +892 +01:08:09,030 --> 01:08:12,660 + 네트워크에서 수행하는 것과 그들이하려는 것은 인기에서입니다 + +893 +01:08:12,659 --> 01:08:16,960 + 상업 네트워크 코드 또는 전용 희소 뉴런 집단으로부터 무도회 + +894 +01:08:16,960 --> 01:08:21,560 + 문맥의 인구는 몇 가지 개념 분류를 수행하기 위해 노력하고 + +895 +01:08:21,560 --> 01:08:25,820 + 무엇 당신이 보는 것은 그 아이디어 피질과 분류의 코팅 + +896 +01:08:25,819 --> 01:08:30,519 + 이미지는 측면에서 2013이 신경 네트워크를 사용하는 것과 거의 같은 좋은 + +897 +01:08:30,520 --> 01:08:35,400 + 정보는 이미지에 대한 걸 당신은 성능이 거의 동일 할 수있다 + +898 +01:08:35,399 --> 01:08:40,279 + 여기에 분류 아마 더 눈에 띄는 결과를 우리가 비교하는 + +899 +01:08:40,279 --> 01:08:43,759 + 일의 경쟁을 통해 이미지를 많이 공급 그들은이있어 + +900 +01:08:43,760 --> 01:08:46,720 + 달 그는 이미지를 많이했다 다음은 이러한 이미지는 방법에 대해 알아 + +901 +01:08:46,720 --> 01:08:48,789 + 뇌 또는 의견 표현 + +902 +01:08:48,789 --> 01:08:53,019 + 그래서이 두 공간은 공간에 배치되는 방식의 이미지 표현이다 + +903 +01:08:53,020 --> 01:08:57,520 + 주석에 의해 당신은 유사성 행렬 및 통계를 비교할 수 있습니다 + +904 +01:08:57,520 --> 01:09:00,450 + 당신이 볼 수 있다는 IT 피질과 코멘트 + +905 +01:09:00,449 --> 01:09:04,099 + 의 매우 매우 유사 표현 기본적 것을 매핑 사이가있다 + +906 +01:09:04,100 --> 01:09:08,440 + 비슷한 물건들이 배치 방법을 계산하고있다처럼 거의 보인다 + +907 +01:09:08,439 --> 01:09:12,399 + 서로 다른 개념과 내용을 시각적으로 공간이 폐쇄 무엇을 훨씬 것은 매우가요 + +908 +01:09:12,399 --> 01:09:16,809 + 당신이 뇌에서에서 볼 무엇 때문에 어떤 사람들은 매우 매우 유사 + +909 +01:09:16,810 --> 01:09:20,780 + 이 회사는 뭔가 뇌를하고 있다는 것을 그냥 몇 가지 증거가 있다고 생각 + +910 +01:09:20,779 --> 01:09:23,769 + 같은 그것은 매우 흥미로운 점에서 다음 남아있는 유일한 문제 때문에 + +911 +01:09:23,770 --> 01:09:24,330 + 케이스 + +912 +01:09:24,329 --> 01:09:27,210 + 이 작품은 + +913 +01:09:27,210 --> 01:09:28,609 + 우리는 다음 수업을 찾을 수 있습니다 + diff --git a/captions/Ko/Lecture8_ko.srt b/captions/Ko/Lecture8_ko.srt new file mode 100644 index 00000000..abdab665 --- /dev/null +++ b/captions/Ko/Lecture8_ko.srt @@ -0,0 +1,3528 @@ +1 +00:00:00,000 --> 00:00:07,519 + 시계의이의 나는 그래서 오늘은 휴식 약간의 강의를 알 수 있도록 시작하자하자 + +2 +00:00:07,519 --> 00:00:11,269 + 오늘 우리는 우리가 얘기 마지막 시간은 약 종류의 우리의 모든 부분을보고있어 + +3 +00:00:11,269 --> 00:00:14,439 + 의견은 우리가 일부를 볼거야 함께 오늘 모든 것을 넣어 + +4 +00:00:14,439 --> 00:00:16,250 + 연락처 응용 프로그램 + +5 +00:00:16,250 --> 00:00:20,550 + 측면은 실제로 이미지 안에 다이빙과 공간 지역화에 대해 이야기하고 + +6 +00:00:20,550 --> 00:00:25,550 + 검출 우리는 우리가 실제로 우리가 나중에 있었다 조금이 강의를 이동했다 + +7 +00:00:25,550 --> 00:00:29,080 + 일정에 우리는 많은 남자들이 프로젝트의이 유형에 관심 보았다 + +8 +00:00:29,079 --> 00:00:31,839 + 무엇 무엇을 누가 종류의 당신의 아이디어를 제공하기 위해 이전으로 이동하고 싶었다 + +9 +00:00:31,839 --> 00:00:38,378 + 가능 그래서 처음 몇 행정 가지 프로젝트 제안했다입니다 + +10 +00:00:38,378 --> 00:00:41,988 + 토요일에 일을 내받은 편지함의 종류 그래서 당신의 대부분을 생각 주말에 폭발 + +11 +00:00:41,988 --> 00:00:45,909 + 제출하지만 당신은 당신이 아마에 가야하지 않은 경우 우리는에 걸 + +12 +00:00:45,909 --> 00:00:49,328 + 그 통해 보는 과정이 반드시 프로젝트 제안서 있는지 확인로 이동합니다 + +13 +00:00:49,329 --> 00:00:52,530 + 한 번에 하나를 합리적으로 인정되지 않습니다 그래서 우리는 희망에 드리도록하겠습니다 + +14 +00:00:52,530 --> 00:01:02,149 + 프로젝트 이번주는 집 또는 두 개의 그래서 사람들이 끝났다 누가 금요일에 의한 사람 + +15 +00:01:02,149 --> 00:01:04,519 + 패치 규범에 붙어 + +16 +00:01:04,519 --> 00:01:09,820 + 우리가 만들고 있어요, 그래서 다음 적은 수의 손을의 좋은 좋은 괜찮 우리는 지난 주에보고 + +17 +00:01:09,819 --> 00:01:13,688 + 진행은 또한 우리가 당신을 요구하고 있음을 알아 두셔야에 실제로 꽤 훈련 + +18 +00:01:13,688 --> 00:01:17,798 + 지금까지이 숙제 C에 큰 대륙 당신이 훈련을 시작하고, 그래서 만약 + +19 +00:01:17,799 --> 00:01:22,570 + 상단 될 수 목요일 밤은 그래서 어쩌면 또한 마지막 부분에 일찍 시작 + +20 +00:01:22,569 --> 00:01:25,618 + 숙제 하나는 만드는 과정에 있었다 잘하면 우리는 다시 사람들을해야합니다 + +21 +00:01:25,618 --> 00:01:30,540 + 이번 주에 당신은 또한 비록 염두에 두어야 할 수있는 숙제 전에 피드백을 얻을 수 있습니다 + +22 +00:01:30,540 --> 00:01:35,450 + 그 일주일에서, 그래서 우리는 실제로 수요일에 다음 주 수업 중간에이 + +23 +00:01:35,450 --> 00:01:41,159 + 수요일 그래서 재미를 많이해야합니다 클래스의 준비 + +24 +00:01:41,159 --> 00:01:46,359 + 확실히 우리가 경쟁에 대해 얘기했다 그래서 마지막 강의는 우리가 할 일 + +25 +00:01:46,358 --> 00:01:50,438 + 조각을 면제 우리는 어떻게 회선을 이해하는 데 시간이 오래 소요 + +26 +00:01:50,438 --> 00:01:53,699 + 운영자는 우리는 종류의 하나에서 기능지도를 변환하는 방법을 작동 + +27 +00:01:53,700 --> 00:01:58,329 + 지도를 통해이 창을 밀어 제품에 이상 실행하여 다른 + +28 +00:01:58,328 --> 00:02:01,809 + 제품을 계산하고 실제로 통해 표현을 변환 + +29 +00:02:01,810 --> 00:02:05,759 + 많은 처리 레이어와 이러한 낮은 기억한다면 당신이 기억하는 경우 + +30 +00:02:05,759 --> 00:02:09,299 + 회선 텐트 층, 상기 가장자리와 색상과 높은 같은 것들 + +31 +00:02:09,299 --> 00:02:14,790 + 레이어는 우리가 어떤 당기는 이야기 더 복잡한 객체 부분을 학습하는 경향이 + +32 +00:02:14,789 --> 00:02:18,509 + 일부 샘플에 사용되는 네트워크 안에 우리의 기능 표현을 소형화 + +33 +00:02:18,509 --> 00:02:24,209 + 그것은 우리가 우리는 또한 특정에 대한 사례 연구를했다 보았다 일반적인 성분이다 + +34 +00:02:24,209 --> 00:02:27,479 + 콘텐츠 아키텍처는 이러한 일이에 푹 얻을하는 경향이 방법을 볼 수 있었다 + +35 +00:02:27,479 --> 00:02:31,568 + 조금 섬유의되는 98 일이 우리가 약을 이야기 그래서 연습 + +36 +00:02:31,568 --> 00:02:35,189 + 우리는 알렉스에 대해 이야기 4 자리 인식하지를 사용 내용 + +37 +00:02:35,189 --> 00:02:38,949 + 종류의 이미지를 승리로 2012 년 큰 깊은 깊은 학습 붐 시작했다 + +38 +00:02:38,949 --> 00:02:45,568 + 우리는 2013 년 하나의 이미지 그물 분류했다 ZF에 대해 이야기하는 것이 오지 + +39 +00:02:45,568 --> 00:02:51,108 + 알렉스 꽤 유사한 지금 우리는 깊이가 종종 더 나은 것을보고 + +40 +00:02:51,109 --> 00:02:55,709 + 분류 우리는 2014 년에 정말 잘했던 구글 매트와 PGG 보았다 + +41 +00:02:55,709 --> 00:03:00,609 + 훨씬 알렉스 나탄 즈보다 더 깊고 더 많이 우리 있었다 대회 + +42 +00:03:00,609 --> 00:03:05,430 + 또한 마이크로 소프트는이 새로운 멋진 미친 것은 그 하나 ResNet라고 보았다 + +43 +00:03:05,430 --> 00:03:10,909 + 그냥 백 오십 곳 아키텍처 등 2015 년 12 월에서 당신의 + +44 +00:03:10,909 --> 00:03:14,579 + 발신자는 지난 몇 년 동안 서로 다른 아키텍처왔다 + +45 +00:03:14,579 --> 00:03:19,109 + 깊은 점점 더 많아지고 있지만 그렇게 분류 그냥 + +46 +00:03:19,109 --> 00:03:23,980 + 지금이 강의에서 우리는 현지화 및 검출에 대해 이야기 할 것중인 + +47 +00:03:23,979 --> 00:03:28,500 + 컴퓨터 비전 및이 아이디어에 또 다른 정말 큰 중요한 문제는 실제로 + +48 +00:03:28,500 --> 00:03:32,699 + 더 나은 기회를하고 깊은 네트워크의 모든 종류의 많은 것을 방문 것이다 + +49 +00:03:32,699 --> 00:03:37,798 + 이러한 새로운 공격뿐만 아니라 그래서 지금까지 수업 시간에 우리가 정말 얘기를했습니다 + +50 +00:03:37,799 --> 00:03:42,639 + 종류의 이미지를 부여 분류에 대해 우리는 분류 할 + +51 +00:03:42,639 --> 00:03:47,049 + 몇 개의 개체 범주는 그것은 그에서 좋은 기본적인 문제가되지 것입니다 + +52 +00:03:47,049 --> 00:03:50,340 + 우리가 그것을 사용하여 한 컴퓨터 비전 의견을 이해하기 위해 사용하고, + +53 +00:03:50,340 --> 00:03:53,800 + 이러한하지만 사람들이오고 있었다 다른 작업을 많이 실제로있다 + +54 +00:03:53,800 --> 00:03:59,350 + 그래서 이들 중 일부는 분류 및 현지화 지금 대신이다 + +55 +00:03:59,349 --> 00:04:03,699 + 우리는 또한 드롭 할 가장자리뿐만 아니라 어떤 종류의 라벨을 분류 + +56 +00:04:03,699 --> 00:04:07,349 + 클래스가 발생하는 경우 이미지 다운 상자가 대답 + +57 +00:04:07,349 --> 00:04:11,549 + 또 다른 문제 사람들은 그래서 여기에 다시 몇 가지 거기의 탐지 작업 + +58 +00:04:11,550 --> 00:04:15,689 + 사진 개체의 범주 수 있지만 실제로의 모든 인스턴스를 찾으려면 + +59 +00:04:15,689 --> 00:04:20,238 + 또 다른 최근의 주변 이미지와 드롭 박스 안에 해당 카테고리 + +60 +00:04:20,238 --> 00:04:24,189 + 이 미친 것은 즉시 호출로 작업하지만 사람들은 비트에서 작동하기 시작했다 + +61 +00:04:24,189 --> 00:04:27,490 + 당신은 당신이 두 가지에 대한 일부 사진 번호가 원하는 다시 분할 + +62 +00:04:27,490 --> 00:04:30,829 + 카테고리는 해당 카테고리의 모든 인스턴스에게 이미지를 찾으려면 + +63 +00:04:30,829 --> 00:04:35,319 + 대신 상자를 사용하는 당신이 실제로 주위에 약간의 윤곽을 그릴 싶어 + +64 +00:04:35,319 --> 00:04:37,279 + 모든 화소를 식별 + +65 +00:04:37,279 --> 00:04:41,549 + 우리가 아니에요, 그래서 미친의 각 인스턴스 인스턴스 세분화 종류에 속하는 + +66 +00:04:41,550 --> 00:04:44,710 + 에 대해 이야기 할 것 오늘은 당신이 그것을 알고 있어야합니다 생각 + +67 +00:04:44,709 --> 00:04:47,959 + 우리는 이러한 현지화 및 탐지 작업 오늘이에 초점을 정말거야 + +68 +00:04:47,959 --> 00:04:52,009 + 이들 사이에 큰 차이가 발견 된 개체의 수이다 + +69 +00:04:52,009 --> 00:04:56,250 + 그래서 및 현지화 하나의 개체 또는 일반 효과 번호 종류가있다 + +70 +00:04:56,250 --> 00:05:00,129 + 검출에 우리는 여러 개체 또는 변수가있을 수 있습니다 반면 객체 + +71 +00:05:00,129 --> 00:05:04,000 + 객체의 수와이 작은 차이처럼 보이지만 그것은을 끌 것 + +72 +00:05:04,000 --> 00:05:05,360 + 실제로 큰을 + +73 +00:05:05,360 --> 00:05:10,480 + 그래서 우리는 첫 번째에 대한 이야기​​거야 아키텍처에 매우 중요 + +74 +00:05:10,480 --> 00:05:15,610 + 간단한의 종류 사촌 분류 및 현지화 그래​​서 그냥 요약하자면 + +75 +00:05:15,610 --> 00:05:16,389 + 내가 그냥 슬픈 + +76 +00:05:16,389 --> 00:05:21,849 + 카테고리 라벨 현지화로 분류 하나의 이미지가 상자에 이미지와 + +77 +00:05:21,850 --> 00:05:26,730 + 분류 현지화는 우리가주는거야 동시에 둘을 의미합니다 + +78 +00:05:26,730 --> 00:05:30,669 + 당신은 사람들이 사용 댄스의 종류의 아이디어는 우리가했습니다 이야기 + +79 +00:05:30,668 --> 00:05:33,849 + 이미지가 분류 도전 이미지에 대해 얘기하지도 + +80 +00:05:33,850 --> 00:05:37,810 + 그래서 여기에 분류 + 현지화 도전을 실행하고있다 + +81 +00:05:37,810 --> 00:05:42,269 + 분류 작업과 유사한 천 클래스와 각있다 + +82 +00:05:42,269 --> 00:05:46,319 + 해당 클래스의 인스턴스를 훈련하는 것은 실제로는 하나의 클래스 여러가 + +83 +00:05:46,319 --> 00:05:51,069 + 이제 이미지 내부 클래스와 테스트 tinier 알고리즘에 대한 경계 상자 + +84 +00:05:51,069 --> 00:05:55,709 + 유기물은 어디 대신 추측 단지 인 클래스 레이블 그것의 우회 + +85 +00:05:55,709 --> 00:05:59,370 + 클래스 레이블 함께 경계 상자와 그것이 바로 당신이 얻을 필요가 얻을 + +86 +00:05:59,370 --> 00:06:03,288 + 클래스 레이블 권리와 경계 상자 권한을 우리는 경계 상자가 있어요 + +87 +00:06:03,288 --> 00:06:06,589 + 바로 당신이 교차라는 어떤 일에 가까이있어 의미 + +88 +00:06:06,589 --> 00:06:11,310 + 그렇게 다시 너무 많은 지금은 걱정하지 않아도 당신은 그것을 얻을 + +89 +00:06:11,310 --> 00:06:15,259 + 이미지 적어도 당신이 당신의 5 가스 중 하나가 올바른지 바로 경우를 얻을 수 있고 + +90 +00:06:15,259 --> 00:06:18,129 + 이 + 분류에 일 주요 데이터 집합 사람들의 종류 + +91 +00:06:18,129 --> 00:06:25,159 + 현지화 하나의 정말 근본적인 패러다임 있도록 때 정말 유용 + +92 +00:06:25,160 --> 00:06:28,700 + 회귀의이 아이디어는 내가 알고하지 않습니다 현지화에 관한 생각 + +93 +00:06:28,699 --> 00:06:31,219 + 당신이 가지처럼 보았다 기계 학습 클래스에 다시 생각 + +94 +00:06:31,220 --> 00:06:36,160 + 분류 및 회귀 나 회귀 또는 애호가 뭔가를 할 수있다 + +95 +00:06:36,160 --> 00:06:39,689 + 우리는 지역화에 대해 이야기 할 때 그것은 정말 우리는 할 수 있습니다 의미있어 + +96 +00:06:39,689 --> 00:06:42,980 + 정말 우리의 이미지가 회귀 문제로이 프레임 + +97 +00:06:42,980 --> 00:06:46,700 + 해당 이미지에 오는 어떤 어떤 처리를 통해 갈 및 / 또는 + +98 +00:06:46,699 --> 00:06:49,990 + 결국 상승을 촉진 실수 번호를 생성하는 것 + +99 +00:06:49,990 --> 00:06:53,829 + 이 상자는 다른 매개 변수화 사람들이 일반적으로는 사용할 수있다 + +100 +00:06:53,829 --> 00:06:57,759 + XY 상부 좌측 코너의 폭 및 높이 좌표 + +101 +00:06:57,759 --> 00:07:01,000 + 상자하지만 당신은뿐만 아니라 다른 변형을 볼 수 있지만 항상 네 개의 숫자에 대한 것 + +102 +00:07:01,000 --> 00:07:04,680 + 상자를 경계하고 다시 일부 지상 진실 경계 상자가있다 + +103 +00:07:04,680 --> 00:07:08,810 + 단지 네 개의 숫자와 지금 우리는 우리가 아마 유클리드와 같은 손실을 계산할 수있다 + +104 +00:07:08,810 --> 00:07:12,699 + 손실 우리가 생산 숫자 사이에 꽤 예쁜 표준 선택 + +105 +00:07:12,699 --> 00:07:16,339 + 우리가했던 것처럼 정확한 숫자는 지금 우리는 단지이 일을 설정할 수 있습니다 우리의 + +106 +00:07:16,339 --> 00:07:20,489 + 우리는 약간의 땅과 데이터의 많은 배치를 샘플링 분류 네트워크 + +107 +00:07:20,490 --> 00:07:24,210 + 진실의 상자 우리는 앞으로 전파 컴퓨터는 우리의 예측 사이에 손실 + +108 +00:07:24,209 --> 00:07:29,359 + 올바른 예측은 다시 전파 너무 네트워크를 업데이트 + +109 +00:07:29,360 --> 00:07:33,250 + 인이 패러다임은 실제로의이 현지화 작업을 만드는 정말 쉽습니다 + +110 +00:07:33,250 --> 00:07:37,269 + 그래서 여기에 구현하기가 꽤 쉬운 방법에 대한 정말 간단한 조리법이다 + +111 +00:07:37,269 --> 00:07:41,289 + 먼저 그냥 다운로드 있도록 분류 + 현지화를 구현할 수 + +112 +00:07:41,290 --> 00:07:44,370 + 당신이 야심 찬 있다면 어떤 기존의 초반 이었죠 모델은 당신 자신을 훈련하는 + +113 +00:07:44,370 --> 00:07:48,139 + 알렉스 기사 BGG 구글과 같은 일이 우리가 이야기 모든 것을 충족 + +114 +00:07:48,139 --> 00:07:53,180 + 마지막 강의는 지금 우리가했다 그 완전히 연결 층을거야 + +115 +00:07:53,180 --> 00:07:57,100 + 거 우리의 수준의 점수를했다 생성하는 순간 옆 사람들을 설정하고 우리는있어 + +116 +00:07:57,100 --> 00:08:00,410 + 야는 어느 시점에 몇 가지 새로운 완전히 연결 층을 부착 + +117 +00:08:00,410 --> 00:08:04,840 + 이 회귀가 가진이 호출됩니다 전화 네트워크 그러나 나는 기본적으로 뜻 + +118 +00:08:04,839 --> 00:08:08,119 + 같은 몇 완전히 연결 층과 같은 것은하고는 좀을두고 있습니다 + +119 +00:08:08,120 --> 00:08:13,889 + 우리가 훈련처럼 실제 값 숫자는 지금 우리가이 일을 훈련 우리 + +120 +00:08:13,889 --> 00:08:17,209 + 분류 네트워크는 유일한 차이는 현재 클래스의 대신 + +121 +00:08:17,209 --> 00:08:18,359 + 전쟁 + +122 +00:08:18,360 --> 00:08:24,550 + 대학원 수업은 우리가이 훈련 매트의 중위 손실과 왕관 보석 상자를 사용하여 + +123 +00:08:24,550 --> 00:08:28,918 + 네트워크 정확히 같은 방법은 지금은 우리가 할 모두 머리를 사용하는 시간이야 + +124 +00:08:28,918 --> 00:08:32,218 + 분류 및 현지화 우리는 이미지가 우리가 변경 한 한 + +125 +00:08:32,219 --> 00:08:36,700 + 분류 우리는 우리가 그것을 통과 비편 재화 머리를 양성하고있다 + +126 +00:08:36,700 --> 00:08:40,620 + 우리는 우리가 상자를 수업 과정을 얻을받을 우리가 정말 좋아 완료되면 그게 다야 + +127 +00:08:40,620 --> 00:08:44,259 + 당신은 그래서 이것은 정말 좋은 간단한 조리법의 종류 할 필요가 너희들 + +128 +00:08:44,259 --> 00:08:50,208 + 그래서 다른 프로젝트에 분류 + 현지화에 사용할 수 + +129 +00:08:50,208 --> 00:08:54,750 + 이 방법의 하나 약간의 세부 사항은 두 가지 방법의 종류가 있음 + +130 +00:08:54,750 --> 00:08:59,990 + 사람들이 당신이 클래스에 얽매이지 regresar을 상상할 수있는이 회귀 작업을하거나 + +131 +00:08:59,990 --> 00:09:04,190 + 클래스 별 당신 regresar 상관없이 클래스 난 없다는 것을 상상할 수 + +132 +00:09:04,190 --> 00:09:07,760 + 이들에 동일한 가중치가 완전히 동일한 접속 구조를 사용하는 것 + +133 +00:09:07,759 --> 00:09:11,600 + 레이어를 출력하여 정렬에있을 것입니다 내 경계 상자를 생산하는 + +134 +00:09:11,600 --> 00:09:15,379 + 그냥 상자가 더 내가이야 클래스를 상관 없습니다 항상 네 개의 숫자 + +135 +00:09:15,379 --> 00:09:19,139 + 당신이 볼 대안은 때때로 우리가 지금이야 클래스 고유의 회귀이다 + +136 +00:09:19,139 --> 00:09:23,389 + 넌 넣어 일종의 당 하나의 경계 상자처럼 번호를 시간을 볼 수있어 + +137 +00:09:23,389 --> 00:09:27,569 + 클래스와 다른 사람들이 때때로이 더 잘 작동하는지 발견하고 + +138 +00:09:27,570 --> 00:09:31,269 + 다른 경우가 있지만, 나는 직관적으로 의미가 가지 의미가 있습니다 그 + +139 +00:09:31,269 --> 00:09:35,470 + 방법은 당신이 조금 될 수있는 고양이 현지화에 대한 생각 뭔가 + +140 +00:09:35,470 --> 00:09:38,129 + 당신이 지역화 방법과 다른 비트는 어쩌면 싶어이 그렇게 훈련 + +141 +00:09:38,129 --> 00:09:42,289 + 그것의 것들하지만 대한 책임은 네트워크의 다른 부분 + +142 +00:09:42,289 --> 00:09:45,569 + 그것은 꽤 쉬운 장소는 당신의 다시 로스 (A)에 제공된 방법을 변경하는 것 + +143 +00:09:45,570 --> 00:09:49,329 + 조금 만 그라운드 진실 클래스를 사용하여 손실을 계산할 + +144 +00:09:49,328 --> 00:09:52,809 + 지상 진실 클래스의 상자하지만 심지어 여전히 기본적으로 같은 생각 + +145 +00:09:52,809 --> 00:09:57,750 + 정확히 회귀가 있었다 부착 위치와 여기에 다른 디자인 선택이다 + +146 +00:09:57,750 --> 00:10:01,360 + 당신은 다른 사람들이 어떻게 볼 경우 다시는 다른 사람들이 너무 중요하지 않습니다 + +147 +00:10:01,360 --> 00:10:05,120 + 그것은 다른 방법으로 몇 가지 일반적인 선택은 바로 후를 연결하는 것입니다 + +148 +00:10:05,120 --> 00:10:09,948 + 당신이 정말로 심각한처럼 마지막 길쌈 공기의 가을은 일종의 의미 + +149 +00:10:09,948 --> 00:10:14,909 + 새로운 완전히 연결 층을 초기화하는 다리와 BG를 통해 같은 것들을 볼 수 있습니다 + +150 +00:10:14,909 --> 00:10:18,909 + 현지화 작업 또 다른 일반적인 선택이 바로 연결하는 것입니다 이런 식으로 당신의 + +151 +00:10:18,909 --> 00:10:22,939 + 침략은에서 마지막 완전히 연결 층 후 실제로 있었다 + +152 +00:10:22,940 --> 00:10:27,310 + 당신이 우리의 CNN에 창고처럼 다른 것을 볼 수 있습니다 작업의 분류 + +153 +00:10:27,309 --> 00:10:31,099 + 이 노동 작업의 종류하지만 둘 중 하나는 잘 작동 + +154 +00:10:31,100 --> 00:10:38,129 + 당신이이 따로 그래서으로 막 어디서나 연결하고 뭔가를 할 수 + +155 +00:10:38,129 --> 00:10:42,029 + 우리는 실제로 하나 이상의 현지화에이 프레임 워크를 일반화 할 수 있습니다 + +156 +00:10:42,029 --> 00:10:46,610 + 이 분류 현지화 작업과 그래서 일반적으로 개체가 우리 + +157 +00:10:46,610 --> 00:10:50,440 + 일종의 우리가 정확히 하나의 객체를 생성하는 신경 이미지를 설정 + +158 +00:10:50,440 --> 00:10:54,620 + 입력 이미지 상자를 경계하지만, 경우에 당신은 미리 알고 있습니다 + +159 +00:10:54,620 --> 00:10:59,279 + 당신은 항상 여기가 개체의 일부 고정 된 수의 지역화하려는 + +160 +00:10:59,279 --> 00:11:03,730 + 지금 당신의 침략은 단지 각 상자를 출력했다 일반화 정말 쉽게 + +161 +00:11:03,730 --> 00:11:07,039 + 당신이 걱정하고 다시 동일한에서 네트워크를 훈련하는 객체 + +162 +00:11:07,039 --> 00:11:12,839 + 방법과 같은 시간이 꽤입니다 실제로 현지화 여러 개체의이 아이디어 + +163 +00:11:12,840 --> 00:11:16,790 + 일반적으로 꽤 강력한 예를 들어 이런 종류의 접근법이있다, 그래서 + +164 +00:11:16,789 --> 00:11:21,559 + 인간의 생각은 우리가 입력 범죄 a를 원하는되도록 추정을 제기 사용 + +165 +00:11:21,559 --> 00:11:25,299 + 근접 사람과 사람의보기를 그 포즈의 의미를 알아 내기 위해 + +166 +00:11:25,299 --> 00:11:29,789 + 사람이 너무 잘 사람들은 일종의 일반적으로 같은 관절의 고정 번호가 자신의 + +167 +00:11:29,789 --> 00:11:34,370 + 가슴과 목 및 팔꿈치와 재료의 종류 그래서 우리는 알고있다 + +168 +00:11:34,370 --> 00:11:39,060 + 우리는을 통해 실행 우리는 모든 관절을 찾아야 그래서 우리는 우리의 이미지를 가져올 수 있음 + +169 +00:11:39,059 --> 00:11:43,829 + 컨볼 루션 네트워크와 우리가 각 관절 위치에 대한 XY 좌표를 퇴보하고 + +170 +00:11:43,830 --> 00:11:47,490 + 그것은 우리에게 실제로 전체 인간의 포즈를 예측할 수 있습니다 우리의 작업을 제공합니다 + +171 +00:11:47,490 --> 00:11:52,409 + 이 논문에서 지역화 프레임 워크의 종류를 사용하여 용지은있다 + +172 +00:11:52,409 --> 00:11:55,819 + 구글 일년이 전에 몇 그 방법 이런 종류의 작업을 수행 것과 + +173 +00:11:55,820 --> 00:11:59,740 + 다른 종과 경적하지만 기본적인 아이디어는 단지 CNN에를 사용하여 회귀했다 + +174 +00:11:59,740 --> 00:12:05,100 + 이러한 공동 세션 그래서 전반적으로이 현지화의 아이디어로 치료 + +175 +00:12:05,100 --> 00:12:09,769 + 개체의 회귀 (46) 수 그래서 몇 가지를 알고 정말 정말 간단합니다 + +176 +00:12:09,769 --> 00:12:12,659 + 프로젝트에 너희들은 당신이 실제로 실행하려면에 대해 생각되었다 + +177 +00:12:12,659 --> 00:12:16,850 + 검출 당신은 이미지의 어떤 부분들을 이해하거나 찾을 원인 + +178 +00:12:16,850 --> 00:12:21,290 + 이미지 내부 및 경우 부분은 그 라인을 따라 프로젝트의 생각 + +179 +00:12:21,289 --> 00:12:25,019 + 난 정말 그 대신이 지역화 프레임 워크에 대해 생각해 보시기 바랍니다 + +180 +00:12:25,019 --> 00:12:27,750 + 당신은 당신이 원하는 알고 개체의 고정 된 수의 사실이 있다면 + +181 +00:12:27,750 --> 00:12:31,929 + 지역화 및 모든 이미지 당신은 현지화 문제로 구도를 시도해야한다 + +182 +00:12:31,929 --> 00:12:38,129 + 즉, 많이 좋아, 그래서 실제로 설치에 쉽게 간단한 아이디어의 경향이 있어요 + +183 +00:12:38,129 --> 00:12:42,019 + 회귀 분석을 통해 현지화 실제로 정말 간단 실제로 I를 작동합니다 + +184 +00:12:42,019 --> 00:12:44,120 + 정말 프로젝트를 위해 그것을 시도하는 것이 좋습니다 것 + +185 +00:12:44,120 --> 00:12:47,330 + 당신이하고 싶어하지만 만약 당신이 조금을 추가 할 필요가 이미지 같은 대회 우승 + +186 +00:12:47,330 --> 00:12:52,330 + 다른 멋진 물건 그래서 사람들이 현지화를 위해 할 또 다른 점은 이것이다 + +187 +00:12:52,330 --> 00:12:56,410 + 슬라이딩 윈도우의 생각은 그래서 우리는 더 자세히하지만 생각이 단계별로합니다 + +188 +00:12:56,409 --> 00:13:00,809 + 당신은 여전히​​ 당신의 분류 현지화 가지고 두 향하고있다 + +189 +00:13:00,809 --> 00:13:04,929 + 네트워크는하지만 실제로 이미지에 있지만, 여러에 있지 번을 실행거야 + +190 +00:13:04,929 --> 00:13:08,269 + 그 다른 통해 집계 된 이미지에 위치하면거야 + +191 +00:13:08,269 --> 00:13:13,100 + 그것은 일종의했다, 그래서 위치는 당신이 실제로 효율적인 방법으로이 작업을 수행 할 수 있습니다 + +192 +00:13:13,100 --> 00:13:17,290 + 보다 구체적으로이 슬라이딩 윈도우 현지화가 어떻게 작동하는지 우리는거야 방법 참조 + +193 +00:13:17,289 --> 00:13:21,980 + 묘기가 실제로 우승자이었다 이상의 때문에 공중파 구조를 보면 + +194 +00:13:21,980 --> 00:13:25,399 + 이미지가 2013 년 국산화에 도전 + +195 +00:13:25,399 --> 00:13:29,730 + 그것은이이 건축가이 설정의이 종류는 기본적으로 우리가보고 무엇처럼 보인다 + +196 +00:13:29,730 --> 00:13:33,839 + 몇 일 전에 우리는 그 다음부터 우리가 가지고에서가 아닌 알렉스가 + +197 +00:13:33,839 --> 00:13:37,820 + 분류 회귀 분류 헤드가 밖으로 돌고 있었다 있어야했다 + +198 +00:13:37,820 --> 00:13:38,740 + 우리 클래스 + +199 +00:13:38,740 --> 00:13:44,450 + 회귀는 알렉스 NAT에있어 때문에이 상자와이 일을 가속화했다 + +200 +00:13:44,450 --> 00:13:51,120 + 구조의 유형은 221 (221)의 입력을 기대하지만 실제로 우리는 실행할 수 있습니다 + +201 +00:13:51,120 --> 00:13:55,679 + 큰 사진이 때로는 도움이 될 수 있습니다에이 그래서 우리는 큰 있다고 가정 + +202 +00:13:55,679 --> 00:14:02,799 + 내가 257에 의해 257를 말할 때 지금 우리가 복용 상상할 수있는 무엇의 큰 이미지를 우리 + +203 +00:14:02,799 --> 00:14:06,659 + 분류 + 현지화 네트워크 그냥 상단 모서리에 실행 + +204 +00:14:06,659 --> 00:14:11,799 + 본 이미지는 그 어떤 여름 우리에게 일부 어떤 수준의 점수를주지 + +205 +00:14:11,799 --> 00:14:15,979 + 잔디 경계 상자 우리는 거 반복있어이 우리 같은 구분을 + +206 +00:14:15,980 --> 00:14:21,820 + + 현지화 네트워크와이 이미지의 이후 네 모서리에서 실행 + +207 +00:14:21,820 --> 00:14:26,230 + 이렇게하면 사람들의 각각에서 잔디 경계 박스 하나에 종료됩니다 + +208 +00:14:26,230 --> 00:14:30,509 + 각 위치에 대한 클래스 분류 점수와 함께 4 개소 + +209 +00:14:30,509 --> 00:14:35,700 + 그러나 우리는 실제로는 일부를 사용 그럼 단지 하나의 경계 상자를 원하는 + +210 +00:14:35,700 --> 00:14:39,770 + 메르세데스 점수에 상자를 경계하고 조금 못생긴 I을두고하는 휴리스틱 + +211 +00:14:39,769 --> 00:14:42,809 + 싶지 않아 그들은 종이지만 생각에이 여기에 세부 사항으로 이동 + +212 +00:14:42,809 --> 00:14:46,699 + 그 대중은 여러 위치에서이 상자를 집계 결합한다 + +213 +00:14:46,700 --> 00:14:50,959 + 그 방송 해에 크레딧의 모델 정렬을 할 수 있습니다 도움이 작동하는 경향이 있습니다 + +214 +00:14:50,958 --> 00:14:55,058 + 정말 잘 그 해 그들에게 도전을 수상하는 평균 + +215 +00:14:55,058 --> 00:14:58,149 + 하지만 실제로 그들은 실제로 많은 개 이상의 위치를​​ 사용 + +216 +00:14:58,149 --> 00:15:08,989 + 오, 나중에는 그들과 함께 완벽하게되어야한다 + +217 +00:15:08,989 --> 00:15:12,939 + 잘 난 그냥있어 당신이 회귀하고있어 한 번, 그래서 실제로 좋은 지적 뜻 + +218 +00:15:12,938 --> 00:15:15,498 + 번호를 예측하는 것은 당신이 복제되지 않을 수 균열 없습니다 + +219 +00:15:15,499 --> 00:15:20,149 + 어디서나 내가 그납니다 알고 있지만 이미지 내에서 할 필요는 없습니다 + +220 +00:15:20,149 --> 00:15:23,698 + 그들은 당신이있을 때 때 특히이 일을하고 좋은 점 + +221 +00:15:23,698 --> 00:15:27,088 + 이 슬라이딩 윈도우 방식으로이 네트워크를 훈련하면 실제로 발송하기 + +222 +00:15:27,089 --> 00:15:30,429 + 약간 선박 선박 지상 진실 상자가 사람들을 위해 프레임을 조정 + +223 +00:15:30,428 --> 00:15:35,999 + 나중에 약하지만 단지 걱정 못생긴 세부 종류의 서로 다른 조각 + +224 +00:15:35,999 --> 00:15:39,428 + 그들은 많은 개 이상의 이미지의 위치를​​ 사용하여 실제로 수행 연습 + +225 +00:15:39,428 --> 00:15:43,629 + 뿐만 아니라 당신이 볼 수있는 여러 규모는 실제로 종이로 파악된다 + +226 +00:15:43,629 --> 00:15:47,129 + 나는 당신이 그들이 가지 평가 모두 다른 위치를 참조 남아있어 + +227 +00:15:47,129 --> 00:15:52,058 + 중간에이 네트워크는 그 출력의 각 상자를 하나씩 진행 참조 + +228 +00:15:52,058 --> 00:15:55,678 + 하단 쉬운에 그 위치는 그 위치 각각에 대해지도를 득점 할 수 + +229 +00:15:55,678 --> 00:16:00,139 + 다음 나는 그들이 매우 시끄러운하지만 좀 자신의 종류를 변환있어 의미 + +230 +00:16:00,139 --> 00:16:03,899 + 일반적으로 곰을 통해 그들은이 멋진 집계 방법을 실행하고 줄 수 있도록 + +231 +00:16:03,899 --> 00:16:07,839 + 곰에 대한 최종 상자를 얻을하고 결정하는 한 쌍의 그들과 동일 + +232 +00:16:07,839 --> 00:16:12,869 + 실제로이와 도전을 원하지만 예상 할 수 하나의 문제는이다 + +233 +00:16:12,869 --> 00:16:15,759 + 실제로 그 하나 하나의 네트워크를 실행하는 데 꽤 비싼 될 수 있습니다 + +234 +00:16:15,759 --> 00:16:20,259 + 작물하지만 우리는 우리가 할 수있는 일에 실제로 더 효율적인있다 + +235 +00:16:20,259 --> 00:16:23,489 + 일반적으로 다음 길쌈 오류을 갖는 이러한 네트워크의 생각 + +236 +00:16:23,489 --> 00:16:26,048 + 완전히 수호신 연결하지만 당신은 그것에 대해 생각할 때 + +237 +00:16:26,048 --> 00:16:31,108 + 완전히 연결 래리 단지 4096 번호를 잘 그냥 요인이다하지만 + +238 +00:16:31,109 --> 00:16:34,679 + 대신 벡터로 생각의 우리는 그것으로 또 다른 생각을 할 수 + +239 +00:16:34,678 --> 00:16:39,269 + 길쌈 기능지도 좀 미친 우리는 단지에 추가 트랜스 + +240 +00:16:39,269 --> 00:16:45,019 + 하나씩 크기 그래서 지금 생각은 우리가 지금 완전히 우리의 차를 처리 할 수​​ 있다는 것입니다 + +241 +00:16:45,019 --> 00:16:49,499 + 연결된 레이어와의 상상 몇 가지있다 길쌈로 변환 + +242 +00:16:49,499 --> 00:16:54,339 + 우리의 완전히 연결된 네트워크는 우리는이 길쌈 기능지도를 가지고 있었고, 우리는 하나가 있었다 + +243 +00:16:54,339 --> 00:16:57,749 + 생산지도 기능을합니다 그 경쟁의 각 요소에서 방법 + +244 +00:16:57,749 --> 00:17:02,048 + 각 우리의 4096 차원 벡터의 요소 그러나 우리 대신에 대해 생각 + +245 +00:17:02,048 --> 00:17:06,288 + 재편 및 정렬 단지 다섯 가지는 중 상당의 벌금 층을 갖는 + +246 +00:17:06,288 --> 00:17:06,970 + 오에 의해 + +247 +00:17:06,970 --> 00:17:10,120 + 솔루션은 조금 이상한하지만 당신이 그것에 대해 생각하면 그것은 이해한다 + +248 +00:17:10,119 --> 00:17:16,318 + 결국하지만 확실히 그래서 우리는 나중에로 변신이 완전히 연결 취할 + +249 +00:17:16,318 --> 00:17:21,899 + 우리가 이전에 완전히 다른 것보다이 최대 5 회선에 의한 오 + +250 +00:17:21,900 --> 00:17:26,409 + 4096 4096이에서가 연결 시장은 실제로 하나씩입니다 + +251 +00:17:26,409 --> 00:17:30,570 + 당신은 당신이 열심히 생각하면 경우에 그 좀 이상한입니다 만 회선 오른쪽 + +252 +00:17:30,569 --> 00:17:35,369 + 종이에 수학을 해결하고 조용한 방을 보내 가서 당신이 그것을 알아낼 그래서 것 + +253 +00:17:35,369 --> 00:17:38,769 + 우리는 기본적으로이 완전히 연결 레이어와 우리의 네트워크의 각을 얻을 수 없습니다 + +254 +00:17:38,769 --> 00:17:43,509 + 길쌈 공기에 지금 지금이 때문에 지금은 정말 멋진 우리의 + +255 +00:17:43,509 --> 00:17:47,589 + 네트워크는 기여와 풀링 및 요소를 완전히 구성되어있다 + +256 +00:17:47,589 --> 00:17:51,819 + 작업은 이제 우리는 실제로 서로 다른 크기의 이미지에 네트워크를 실행할 수 있도록 + +257 +00:17:51,819 --> 00:17:56,889 + 그리고 이런 종류의 매우 저렴한 비용으로 우리에게 동등한를 장착합니다 + +258 +00:17:56,890 --> 00:18:01,840 + 그래서 볼의 종류에 다른 위치에서 독립적으로 작동하지 않을 수 있지만 운영 + +259 +00:18:01,839 --> 00:18:02,609 + 어떻게 작동 + +260 +00:18:02,609 --> 00:18:07,219 + 당신은 당신이 실행 14 템플릿에 의해 당신이 14 이상 작업 할 수있는 교육 시간을 상상 + +261 +00:18:07,220 --> 00:18:11,960 + 여기에는 일부 회선과 우리가 지금이야 완전히 연결 층입니다 + +262 +00:18:11,960 --> 00:18:17,140 + 재 상상으로 길쌈 즈가 말했다, 우리는 다섯 콘 블록이로이 + +263 +00:18:17,140 --> 00:18:22,600 + 우리가 정렬했습니다 그래서 이러한 하나씩 특별히 크기의 요소로 설정됩니다 + +264 +00:18:22,599 --> 00:18:26,449 + 이 같은하지 제거 여기에 깊이 치수를 표시하지만, 이들의 + +265 +00:18:26,450 --> 00:18:30,900 + 하나 4096 권한에 의해 하나 하나가 될하거나 이러한 레이어를 변환하는 것 + +266 +00:18:30,900 --> 00:18:35,259 + 길쌈에 우리가 알고있는 지금 거기에 자신의 회선 우리가 할 수 그 + +267 +00:18:35,259 --> 00:18:39,700 + 실제로 더 큰 크기의 부분에 실행하고 당신은 지금 우리가했습니다 가지고 있음을 알 수 + +268 +00:18:39,700 --> 00:18:43,558 + 몇 가지 추가 픽셀을 추가하고 지금 우리는 실제로이 모든 것을을 실행 + +269 +00:18:43,558 --> 00:18:47,869 + 회선 및 두 개의 별 두 개의 출력을 얻을 수 있지만, 어떻게 여기 정말 멋진 것입니다 + +270 +00:18:47,869 --> 00:18:52,058 + 우리는이 정말 효율적 그래서 지금 우리의 출력을 만들기 위해 계산을 공유 할 수있어 + +271 +00:18:52,058 --> 00:18:56,428 + 큰 4 배하지만 우리는 훨씬 적은보다 4 배의 경우 사촌 컴퓨팅을했습니다 + +272 +00:18:56,429 --> 00:19:00,360 + 여기에 우리가하고있는 계산의 차이에 대해 생각 + +273 +00:19:00,359 --> 00:19:04,449 + 만 여분의 계산은 지금이 노란색 부분에서 일어난 우리는 실제로 매우있어 + +274 +00:19:04,450 --> 00:19:08,610 + 효율적없이 많은 다양한 위치에서 네트워크를 평가 + +275 +00:19:08,609 --> 00:19:11,918 + 이 그들이 평가할 수있어 얼마나 그래서 실제로 많은 계산을 지출 + +276 +00:19:11,919 --> 00:19:15,240 + 당신은 몇 보았다 아주 아주 조밀 한 다중 스케일 방법으로 해당 네트워크 + +277 +00:19:15,240 --> 00:19:19,388 + 확인 전에 밤이에 대한 질문을 감지 + +278 +00:19:19,388 --> 00:19:25,558 + 확인 쓰기는 실제로 우리의 분류 + 현지화 결과를 볼 수 있습니다 + +279 +00:19:25,558 --> 00:19:30,858 + 그래서 2012 년 알렉스 알렉스 Kozinski 잭에서 지난 몇 년 동안 임무 + +280 +00:19:30,858 --> 00:19:36,358 + 힌튼은뿐만 아니라 분류뿐만 아니라 현지화를 원하지만 난 수 없습니다 + +281 +00:19:36,358 --> 00:19:40,978 + 그들은 2013 년을 것을 어떻게했는지 정확히의 게시 된 내용을 찾을 수 + +282 +00:19:40,979 --> 00:19:45,249 + 이상 - 더 - 상단 우리가 실제로 알렉스의 결과 조금에 개선 보았다 + +283 +00:19:45,249 --> 00:19:50,429 + 올해 우리는 VGG에 대해 얘기하고 일종의 정말 깊은 19 야 후 자신의 + +284 +00:19:50,429 --> 00:19:54,009 + 네트워크 그들은 분류에 두번째 장소를 가지고 있지만 난 실제로 1 + +285 +00:19:54,009 --> 00:19:59,139 + 현지화와 BGG 실제로 기본적으로 정확히 같은 전략을 사용하는 + +286 +00:19:59,138 --> 00:20:03,918 + 위업의 죽음을 통해 그들은 단지 더 깊은 네트워크를 사용하고 실제로 BGG 흥미로운 + +287 +00:20:03,919 --> 00:20:08,288 + 사용 된 적은 규모는 적은 장소에서 팻 네트워크를 눈에 띄는 적게 사용 + +288 +00:20:08,288 --> 00:20:12,878 + 기술은 있지만 실제로 그렇게 기본적으로 시대에게 유일한 상당히 감소 + +289 +00:20:12,878 --> 00:20:17,868 + 차이는 피트 이상되는 및 BG는 여기에 BGU 깊은 네트워크 그래서 여기에 + +290 +00:20:17,868 --> 00:20:20,858 + 우리는이 정말 강력한 이미지 기능을 실제로 향상을 볼 수 + +291 +00:20:20,858 --> 00:20:24,098 + 현지화를 변경하기에 충분와 현지화 성능 꽤 + +292 +00:20:24,098 --> 00:20:28,418 + 아키텍처는 전혀 우리는 단지 그녀의 CNN 대해 교환과 결과 a를 개선 + +293 +00:20:28,419 --> 00:20:34,169 + 그 테마가 될 것으로 2015 년 많은 후 올해 마이크로 소프트는 모든 것을 휩쓸 + +294 +00:20:34,169 --> 00:20:39,239 + Microsoft에서 제공하는이 강의는 물론이이 치고 은신처 ResNet에서 + +295 +00:20:39,239 --> 00:20:43,629 + 25 모든 방법에서 여기 현지화 및 음주 적절한 성능을 짓 눌린 + +296 +00:20:43,628 --> 00:20:48,738 + 하지만 구까지 나는이이 조금 의미하고 이것은 정말 이야기입니다 + +297 +00:20:48,739 --> 00:20:52,798 + 깊은 기능을 분리 그래서 예 그들은 깊은 기능을하지만 마이크로 소프트가 않았다 + +298 +00:20:52,798 --> 00:20:56,398 + 실제로 다른 지역화 방법이라고 RPM을 지역 제안이다 + +299 +00:20:56,398 --> 00:21:00,699 + 네트워크 정말 분명하지 않다, 그래서 그것이인지이 어느 부분인지 + +300 +00:21:00,700 --> 00:21:04,929 + 더 나은 현지화 전략 또는 더 나은 기능 여부 그러나 어떤 속도로 그들 + +301 +00:21:04,929 --> 00:21:10,139 + 정말 잘이 꽤 많이 나는 분류에 대해 말하고 싶은 전부 않았다 + +302 +00:21:10,138 --> 00:21:13,848 + 질문이 있다면 현지화는 프로젝트를 위해 그 일을 고려하고 + +303 +00:21:13,848 --> 00:21:19,509 + 이 작업에 대해 우리는 지금 나중에에 이동하기 전에 그것에 대해 이야기해야 + +304 +00:21:19,509 --> 00:21:32,890 + 바로 그럼 내가이 손실을 갖는 것이다 때 특히 손실 성능 + +305 +00:21:32,890 --> 00:21:37,050 + 아웃 라이어 그래서 때때로 사람들은 손실에 L을 사용하지 않는 사실은 정말 나쁜 + +306 +00:21:37,049 --> 00:21:40,609 + 대신 시도하고 아웃 라이어 약간의 도움이 될 수 있습니다 하나의 손실을 판매 할 수 있습니다 + +307 +00:21:40,609 --> 00:21:45,279 + 그는 하나를 거처럼 보이는 곳 사람들은 때로는 부드러운 하나의 손실을 다할 것입니다 + +308 +00:21:45,279 --> 00:21:49,339 + 이야기의 종류하지만 제로 근처는 차있을거야 실제로 스와핑 때문에 + +309 +00:21:49,339 --> 00:21:53,319 + 그 회귀 손실 함수는 때때로뿐만 아니라 경우에 아웃 라이어와 약간의 도움이 될 수 있습니다 + +310 +00:21:53,319 --> 00:21:56,399 + 당신은 때때로 희망 당신은 그냥 생각하지 않는 소음이 약간 있습니다 + +311 +00:21:56,400 --> 00:22:14,380 + 사람들이 모두 할 수 있도록 손가락이 너무 어려운 질문 질문을 생각하지 않는다 교차 + +312 +00:22:14,380 --> 00:22:18,560 + 실제로 나는 실제로 그렇게 피트이야 내가 정확히 기억하지 못하는 기억하지 않는다 + +313 +00:22:18,559 --> 00:22:23,409 + 이는 죽은 감독하지만 BGG 실제로 때문에 전체 네트워크에 배경막 + +314 +00:22:23,410 --> 00:22:27,230 + 그냥 훈련을받은 경우는 실제로 잘 작동이 빠른에있을거야 수 있습니다 + +315 +00:22:27,230 --> 00:22:30,289 + 회귀에했지만, 당신은 당신이 경우 조금 더 나은 결과를 얻을하는 경향이 있습니다 + +316 +00:22:30,289 --> 00:22:34,049 + 다시 홈 네트워크에 드롭 BG는이 실험을했고, 그들은 아마있어 + +317 +00:22:34,049 --> 00:22:37,659 + 상기 하나 또는 두 개의 점을 추가 매입의 모든 일을 통해 낙하하지만, + +318 +00:22:37,660 --> 00:22:41,320 + 더 많은 경쟁과 훈련 시간의 비용 그래서 그래서 나는 것 나는 것 + +319 +00:22:41,319 --> 00:22:44,769 + 그냥 시도 얘기하지 않는 우선으로 다시 삭제하고 있지처럼 말 + +320 +00:22:44,769 --> 00:22:50,440 + 네트워크 + +321 +00:22:50,440 --> 00:22:57,110 + 일반적으로하지 못하기 때문에 당신이 본 동일한 클래스에 대한 테스팅 + +322 +00:22:57,109 --> 00:23:00,839 + 당신이거야 교육 시간은 분명히 다른 인스턴스를 참조하지만 난 당신이있어 의미 + +323 +00:23:00,839 --> 00:23:04,759 + 여전히 우리는 당신을 기대하지 않는 교육 시간에 OC 곰에서 힘든 시간을 곰 + +324 +00:23:04,759 --> 00:23:07,370 + 내가 꽤 어려울 것이다 클래스에서 일반화 + +325 +00:23:07,369 --> 00:23:20,638 + 참 좋은 질문은 네 그래서 때때로 사람들은 함께 훈련을하고 있다고 할 것입니다 + +326 +00:23:20,638 --> 00:23:24,349 + 모두 동시에 또한 때때로 사람들은 별도로 끝날 것 + +327 +00:23:24,349 --> 00:23:27,089 + 네트워크 종류의 책임 침략에 대한 그것의 하나를 + +328 +00:23:27,089 --> 00:23:38,089 + 이들 모두는 질문도 기쁜 일을 분류 만 책임 + +329 +00:23:38,089 --> 00:23:40,558 + 그는 실제로 우리가가가의의에 대해 거 이야기하고있는 다음 일이있어 + +330 +00:23:40,558 --> 00:23:50,740 + 객체 검출의 다른 과제 그래서 + +331 +00:23:50,740 --> 00:23:56,808 + 물론 그래 내 말 잘 그래서 좀 훈련 전략 경우에 따라 달라집니다 + +332 +00:23:56,808 --> 00:23:59,920 + 당신은 또한 종류의 불가지론 클래스의이 아이디어로 돌아갑니다 경우 같은거야 + +333 +00:23:59,920 --> 00:24:03,610 + 그것은 중요하지 않습니다 일류 태평양 회귀 클래스 무관 회귀 + +334 +00:24:03,609 --> 00:24:06,889 + 당신은 당신이 일종의있어 클래스 클래스 특정 내일 상자에 회귀 + +335 +00:24:06,890 --> 00:24:13,950 + 각 클래스에 대해 별도의 침략자를 훈련하는 것은 권리의 객체에 대해 이야기하자 + +336 +00:24:13,950 --> 00:24:19,220 + 객체 검출하므로 검출은 훨씬 더 많은 냉각 애호가이지만 + +337 +00:24:19,220 --> 00:24:22,890 + 해리는 그렇게 생각은 우리가 입력 이미지를 가지고 다시 우리의 일종을 가지고있다 + +338 +00:24:22,890 --> 00:24:26,660 + 클래스는 우리가 입력에 있다는 점에서 그 클래스의 모든 인스턴스를 찾으려면 + +339 +00:24:26,660 --> 00:24:31,670 + 내 말은, 그래서 이미지는 회귀 현지화 이유에 대해 꽤 잘 작동 알고 + +340 +00:24:31,670 --> 00:24:37,470 + 우리는 탐지가 SMS로 표시하기 위해 시도하지 않습니다 우리는이이 개를 + +341 +00:24:37,470 --> 00:24:41,429 + 고양이와 우리가 네 가지를 가지고 우리는 그게 전부가 보이는 것 같습니다 16 번호가 + +342 +00:24:41,429 --> 00:24:46,250 + 수의 회귀 율 이미지 밖으로하지만 우리는 다른 이미지를 보면 다음과 같은 + +343 +00:24:46,250 --> 00:24:50,609 + 이 하나의 단지 그것을 팔 수를 갖도록 나오는 두 가지가 알고있는 그들은 + +344 +00:24:50,609 --> 00:24:54,589 + 고양이의 모두 거기에이 일을보고 우리가 숫자의 무리를 필요로 난 그렇게 + +345 +00:24:54,589 --> 00:24:57,519 + 그것은 검출에게 스트레이트 업 회귀를 치료하는 종류의 하드의 뜻 + +346 +00:24:57,519 --> 00:25:01,450 + 우리가 거​​해야하고, 그래서 우리는 가변 크기의 출력이 문제를 가지고 있기 때문에 + +347 +00:25:01,450 --> 00:25:04,460 + 실제로이 있지만 애호가 뭔가 방법이 나중에에 대해 이야기합니다 + +348 +00:25:04,460 --> 00:25:09,539 + 그런 종류의 어쨌든이 작업을 수행하고 회귀 등으로 처리 않지만 우리는거야 + +349 +00:25:09,539 --> 00:25:12,950 + 그에게 우리는 그 이상으로 얻을 것이다 그러나 일반적으로 당신은이 취급하지 싶어 + +350 +00:25:12,950 --> 00:25:18,360 + 회귀 우리가 정말 쉬운 문제가있어, 그래서 매우 정확한 출력을 가지고 있기 때문에 + +351 +00:25:18,359 --> 00:25:22,779 + 이 문제를 해결하기 정말 쉬운 방법이 검출 생각하지 회귀로하지만, + +352 +00:25:22,779 --> 00:25:25,960 + 기계의 분류 권리로 회귀 및 분류를 학습 + +353 +00:25:25,960 --> 00:25:29,929 + 당신이 망치는 그냥 모든 문제를 바로 먹는 사람들을 사용할 수 있습니다 + +354 +00:25:29,929 --> 00:25:34,250 + 그래서 대신에 우리가 방법을 알고 우리는 회귀 작품 분류를 할 것입니다 + +355 +00:25:34,250 --> 00:25:38,558 + 우리는거야 우리가 그냥 CNN의 권리를 위해 우리가 할거야 이미지 영역을 분류 + +356 +00:25:38,558 --> 00:25:43,349 + 거기 분류의 영상이 입력 많은 지역을과 같은 말 + +357 +00:25:43,349 --> 00:25:46,129 + 확실히에서 추정 된 공격이 지역 없음 + +358 +00:25:46,130 --> 00:25:50,770 + 개 같이 조금 이상 우리가 큰하지만 끝났어 고양이를 발견 것을 알고있다 + +359 +00:25:50,769 --> 00:25:54,460 + 우리가 실제로 단지를 시도 할 수 있도록 그 아무것도 아니다이야 조금 + +360 +00:25:54,460 --> 00:25:58,558 + 전체 무리 다른 이미지 영역은 각각이 뜻에서 분류를 실행 + +361 +00:25:58,558 --> 00:26:02,490 + 기본적으로 우리 가변 크기 출력 문제를 해결 + +362 +00:26:02,490 --> 00:26:11,160 + 그래서 결정 방법의 문제는 방법을 결정할 수 있도록 더 문제가 없습니다 거기에 + +363 +00:26:11,160 --> 00:26:14,558 + 어떤 창 크기 대답하는 것은 우리가 모든 권리 그냥 그대로을 시도하다 + +364 +00:26:14,558 --> 00:26:18,879 + 즉 그가 우리 때문에 오른쪽 실제로 큰 문제입니다입니다 그래서 그들 모두를 시도 + +365 +00:26:18,880 --> 00:26:21,910 + 다른 여러 위치의 서로 다른 크기의 윈도우를 시도해야 + +366 +00:26:21,910 --> 00:26:25,290 + 날이 제대로 테스트 할 비늘이 정말 비싼 잘 될 것입니다 + +367 +00:26:25,289 --> 00:26:39,089 + 당신이이 일을 할 때 우리는 그래도 볼 필요가 장소의 전체 많아요 + +368 +00:26:39,089 --> 00:26:45,058 + 당신이 여분의 두 가지 하나가 말을 별도의 클래스를 추가 할 수 있습니다 추​​가 배경 + +369 +00:26:45,058 --> 00:26:49,569 + 여기에 당신이 할 수있는 또 다른 것은 사실이다하지 않습니다 같은 오 거기에 아무것도 말 + +370 +00:26:49,569 --> 00:26:54,159 + 다중 레이블 분류는 바로 여러 긍정적 인 물건을 넣을 수 없습니다 + +371 +00:26:54,160 --> 00:26:56,950 + 그것은 할 사실은 아주 쉽게 그냥 대신 당신이 가진 부드러운 최대 손실이다 + +372 +00:26:56,950 --> 00:27:01,390 + 독립적 인 로지스틱 회귀 클래스의 독립적 인 회귀 손실 때문에 + +373 +00:27:01,390 --> 00:27:05,100 + 나는 실제로 당신이 네 말을 나는 여러 클래스의 한 지점에서 할 수 있지만 그건 + +374 +00:27:05,099 --> 00:27:10,189 + 다만 손실 함수에 걸어 그래서 그게 바로 이렇게 아주 쉽다이다 + +375 +00:27:10,190 --> 00:27:13,220 + 우리는이 방법에 문제가 보는 것을 실제로 전체가 거기에 있는지처럼 + +376 +00:27:13,220 --> 00:27:17,690 + 다른 위치의 무리 우리는 몇 가지의 솔루션 정렬을 평가해야 + +377 +00:27:17,690 --> 00:27:21,308 + 몇 년 전이었다 당신은 보통 수준의 지방은 정말 빨리 사용 + +378 +00:27:21,308 --> 00:27:26,299 + 분류 그래서 실제로 검출이 정말 모든 문제에 그들 모두를 시도 + +379 +00:27:26,299 --> 00:27:29,119 + 컴퓨터 비전 당신은 아마 조금 더 기록해야하므로 + +380 +00:27:29,119 --> 00:27:34,109 + 관점 그래서이 정말 성공 거기에 대해 2005 년에 시작 + +381 +00:27:34,109 --> 00:27:38,490 + 그것에 접근하지만이 기능을 사용 정말 성공적으로 검출 해요 + +382 +00:27:38,490 --> 00:27:42,039 + 당신이 전화 그래서 만약 표현을 지향 광채의 히스토그램을 호출 + +383 +00:27:42,039 --> 00:27:46,609 + 다시 숙제 1 당신은 실제로 실제로 할 수있는 마지막 부분에이 기능을 사용 + +384 +00:27:46,609 --> 00:27:50,979 + 분류도 그래서 이것은 실제로 가장 큰 특징 일종의였습니다 + +385 +00:27:50,980 --> 00:27:55,670 + 우리는 아이디어는 우리가 선형 그냥 야 할 것입니다 2005 년에 컴퓨터 비전 나리에 있었다 + +386 +00:27:55,670 --> 00:27:59,550 + 이 기능의 상단에 분류하고 그래서 우리의 우리의 분류 될 것 + +387 +00:27:59,549 --> 00:28:03,460 + 선형 분류는이 작품 그래서 만약 것은 우리가 계산이다 빠르고 정말 + +388 +00:28:03,460 --> 00:28:08,250 + 여러 규모에서 전체 이미지에 그라데이션 기능을 지향하고 우리는 이것을 실행 + +389 +00:28:08,250 --> 00:28:12,660 + 모든 위치가 그냥 정말 빨리 바로 수행 할 모든 규모의 선형 분류 + +390 +00:28:12,660 --> 00:28:13,210 + 모든 곳 + +391 +00:28:13,210 --> 00:28:15,329 + 분류 및 과거 평가할 + +392 +00:28:15,329 --> 00:28:21,029 + 이것은 이런 생각을했다 사람 2005 종류에 정말 잘 근무에 일 + +393 +00:28:21,029 --> 00:28:25,029 + 그것 때문에 일종의 앞으로 몇 년 가장의 일에 조금 더 + +394 +00:28:25,029 --> 00:28:29,879 + 중요한 검출 차원의 계획이 일 깊은이라고 패러다임하지만, + +395 +00:28:29,880 --> 00:28:34,470 + 변형 부품 모델은 그래서 싶어 너무 많은 세부 최고에 가지 마세요 + +396 +00:28:34,470 --> 00:28:39,309 + 하지만 기본적인 아이디어는 우리가 아직도이 역사 기념관에 최선을 다하고 있음을 의미 + +397 +00:28:39,309 --> 00:28:42,619 + 그라데이션 기능을하지만 지금은 우리의 모델보다는 선형 인 + +398 +00:28:42,619 --> 00:28:46,659 + 분류 우리는이 선형이를위한 템플릿이 선형 종류를 클릭해야 + +399 +00:28:46,660 --> 00:28:51,370 + 객체 그리고 우리는 또한 정렬 할 수 있습니다 부품에 대한 이러한 템플릿을 + +400 +00:28:51,369 --> 00:28:57,119 + 매우 공간적 위치를 통해 조금 변형하고 일부 약간의 공상을 가지고 + +401 +00:28:57,119 --> 00:29:01,939 + 공상은 이러한 것들을 정말 멋진 내용을 보려면 오전에 대해 말 생각 + +402 +00:29:01,940 --> 00:29:07,190 + 동적 프로그래밍 알고리즘은 실제로 정말 빨리 시험이 일을 평가 + +403 +00:29:07,190 --> 00:29:11,100 + 이에 우리의 엄지 손가락이이 일을 즐길 경우 시간이 실제로 종류의 재미 + +404 +00:29:11,099 --> 00:29:16,119 + 부분은 재미의 종류에 대해 생각하지만 최종 결과는 그것이 훨씬이다라는 것이다 + +405 +00:29:16,119 --> 00:29:19,209 + 에 변형성 조금 있습니다 더 강력한 분류하여 + +406 +00:29:19,210 --> 00:29:23,079 + 모델 당신은 여전히​​ 체중에 대해 정말 빠른 그래서 우리는 여전히 그냥 갈거야 수 있습니다 + +407 +00:29:23,079 --> 00:29:26,490 + 그냥 해 모든 규모를 어디서나 모든 화면 비율을 모든 위치를 평가 + +408 +00:29:26,490 --> 00:29:33,039 + 사방 과거이 실제로 주위 2010 년에 정말로 잘했다 + +409 +00:29:33,039 --> 00:29:37,619 + 그 때문에 한 번에 많은 문제를위한 기술 및 검출 상태의 일종이었다 + +410 +00:29:37,619 --> 00:29:40,509 + 이것은 내가이 너무 많은 시간을 할애하지 않지만 정말 멋진 종이가 있었다 + +411 +00:29:40,509 --> 00:29:45,049 + 이 dpi의 모델은 실제로 단지 특정 유형한다고 주장 작년 + +412 +00:29:45,049 --> 00:29:47,480 + 콘텐츠 권리 등 권리의 + +413 +00:29:47,480 --> 00:29:51,329 + 이 이러한 역사 나는거야 미친 개미는 우리가 볼 수 약간의 가장자리처럼 + +414 +00:29:51,329 --> 00:29:55,539 + 망상과 역사와 있었다 좀 그런 종류의 물건을 풀링 등에 그렇다면 + +415 +00:29:55,539 --> 00:30:00,349 + 당신은 권리에 대해 생각하는 재미의 종류의이 논문을 관심을 확인하고 + +416 +00:30:00,349 --> 00:30:02,250 + 그러나 우리는 정말 작업 할 + +417 +00:30:02,250 --> 00:30:06,259 + 아마 같은 무게없이 빠른없는 분류에이 일을 작동하게 + +418 +00:30:06,259 --> 00:30:11,809 + CNN과 그래서 여기에 이​​번 주이 문제는 여전히 우리가 다른 많은이 하드 맞아 + +419 +00:30:11,809 --> 00:30:14,940 + 우리가 아마 실제로 시도 할 여유가 없을 때 시도하려는 배치 + +420 +00:30:14,940 --> 00:30:19,220 + 모두 있도록 솔루션은 우리가 다른이 그들 모두를 시도하지 않는다는 것입니다 + +421 +00:30:19,220 --> 00:30:23,380 + 것을 우리가보고 싶어하고 우리는 우리의 적용 추측의 종류 + +422 +00:30:23,380 --> 00:30:28,720 + 그 생각 때문에 위치의 그 작은 수에서 분류의 비용 + +423 +00:30:28,720 --> 00:30:35,419 + 우리 지역의 제안 방법에 우리가이 일 때문에 지역 제안이라고 + +424 +00:30:35,419 --> 00:30:39,900 + 그 이미지에 소요 다음 경우 어쩌면 지역의 전체 무리를 출력 + +425 +00:30:39,900 --> 00:30:45,280 + 하나의 방법이 지역에 대해 생각할 수 있도록 가능한 개체는 찾을 수 있습니다 + +426 +00:30:45,279 --> 00:30:48,428 + 제안은 좀 정말 빠른 같아 것입니다 + +427 +00:30:48,429 --> 00:30:53,038 + 클래스 불가지론 개체 검출기는 바로 그들이있어 클래스에 대해 걱정하지 않는다 + +428 +00:30:53,038 --> 00:30:56,038 + 하지 매우 정확하지만 그들은 실행 꽤 빨리있어, 그들이 우리에게 전체를 제공 + +429 +00:30:56,038 --> 00:31:00,769 + 상자의 무리와이 지역의 제안 뒤에 뒤에 일반 직관 + +430 +00:31:00,769 --> 00:31:04,639 + 방법들이 좀 구조 등의 얼룩을 찾는 것을하는 것은 이미지 속도입니다 + +431 +00:31:04,640 --> 00:31:09,740 + 그래서 개체는 일반적으로 내가 그것을 좀 보이는 때 요청할 수 있다면 의미 강아지처럼 + +432 +00:31:09,740 --> 00:31:13,940 + 고양이가 흰색 얼룩 꽃처럼 보이는 흰색 덩어리처럼 나는 일종의 될 ㅋ + +433 +00:31:13,940 --> 00:31:17,929 + 눈과 코는 좀 ㅋ 수 있으므로 사람이 지역의 제안, 메소드이다 + +434 +00:31:17,929 --> 00:31:21,650 + 당신이 볼 무엇 시대의 많은 이들의 많은 주위에 넣어 상자의 종류 + +435 +00:31:21,650 --> 00:31:27,820 + 이미지의 blobby 영역 그래서 아마 가장 유명한 지역 제안 방법 + +436 +00:31:27,819 --> 00:31:31,538 + 선택적 검색이라고 당신은 정말 많은으로 정확히 알 필요가 없습니다 + +437 +00:31:31,538 --> 00:31:36,980 + 이 아이디어에 불과 어떻게 작동하는지 세부 사항은 당신이 당신의 픽셀과 당신부터 시작이다 + +438 +00:31:36,980 --> 00:31:40,919 + 그들이 유사한 색상과 질감이 함께있는 경우 합병 인접 화소의 종류 + +439 +00:31:40,919 --> 00:31:45,770 + 다음 지역과 같은이 연결되어있는 리드 연결이 끊긴 덩어리를 형성 + +440 +00:31:45,769 --> 00:31:50,740 + 당신이 더 크고 더 큰 신체 부위를 얻을 수있는 지역 등 여피족의 얼룩을 병합 + +441 +00:31:50,740 --> 00:31:53,829 + 다음이 다른 스케일의 각각에 대해 실제로 이들 각각을 변환 할 수 있습니다 + +442 +00:31:53,829 --> 00:31:58,710 + 그냥이 작업을 수행하여 그럼 주위에 상자를 그려서 상자에 바비 영역 + +443 +00:31:58,710 --> 00:32:02,548 + 여러 저울을 통해 당신은 일종의 주위에 상자의 전체 무리와 끝까지 + +444 +00:32:02,548 --> 00:32:06,359 + 이미지의 blobby 물건을 많이하고이 계산 합리적으로 신속하고 + +445 +00:32:06,359 --> 00:32:11,500 + 실제로 검색 공간을 선택적으로 확실히 꽤 많이 있지만 아래로 인하 + +446 +00:32:11,500 --> 00:32:14,720 + 마을의 유일한 게임이 바로 가장 유명한 될 수있다되지 않는 것은 왕창있다 + +447 +00:32:14,720 --> 00:32:18,319 + 거기 사람들이 개발의 다른 지역 제안 방법 + +448 +00:32:18,319 --> 00:32:21,509 + 이 논문은 작년에 실제로 정말 멋진 철저한 과학을 행한 + +449 +00:32:21,509 --> 00:32:25,890 + 종류 모든 다른 지역의 제안 방법 및 평가의 준 + +450 +00:32:25,890 --> 00:32:29,950 + 프로와 각각의 단점과 물건의 모든 종류하지만 내 말은 내 + +451 +00:32:29,950 --> 00:32:33,620 + 이 논문에서 테이크 아웃은했다 상자에게 그렇게 하나를 선택해야한다면 것을 사용 + +452 +00:32:33,619 --> 00:32:37,459 + 그것은 정말 빨리가 두 번째의 바닥 세 번째에 실행할 수있어입니다입니다 + +453 +00:32:37,460 --> 00:32:40,950 + 선택적 검색을위한 약 10 초에 비해 이미지 당 + +454 +00:32:40,950 --> 00:32:49,000 + 하지만 더 많은 별이 더 나은 그것이 지금 바로거야 그래서 별을 많이 얻는다 + +455 +00:32:49,000 --> 00:32:51,970 + 우리는이 아이디어 지역의 제안을 가지고 우리는 CNN의 이런 생각을 가지고 + +456 +00:32:51,970 --> 00:32:56,679 + 분류는 그 등이이 그래서 그냥 전부 다 넣어 보자 + +457 +00:32:56,679 --> 00:33:02,830 + 아이디어는 일종의 먼저이 방법으로 2014 년에 정말 좋은 방법으로 함께 넣어 + +458 +00:33:02,829 --> 00:33:08,740 + 그것의 그래서 아이디어라고 RCN IT는 영역 - 기반 방법이다 CNN된다 + +459 +00:33:08,740 --> 00:33:12,179 + 우리가 우리가 입력 이미지가 모든 조각을 본 적이 무엇을 아주 간단 사람입니다 + +460 +00:33:12,179 --> 00:33:17,028 + 선택적 검색에 대한 아마 얻을처럼 우린 영역 제안 방법을 실행하고 + +461 +00:33:17,028 --> 00:33:21,929 + 이천 다른 스케일의 상자와 위치 2000을 의미 여전히 많이 있지만, + +462 +00:33:21,929 --> 00:33:26,380 + 그것은 그 상자의 각각에 대해 이제 이미지의 모든 가능한 상자보다 훨씬 더 적은이다 + +463 +00:33:26,380 --> 00:33:31,510 + 우리는 그 다음 자른 어떤 고정 된 크기로 그 이미지 영역을 워프하고있는거야 + +464 +00:33:31,509 --> 00:33:35,898 + 분류하기 나는 CNN을 통해이 전 실행 한 다음이 CNN은해야 할 것입니다 + +465 +00:33:35,898 --> 00:33:41,199 + 회귀 헤드와 회귀 여기 있었고 분류 사용되었던 + +466 +00:33:41,200 --> 00:33:46,259 + PM은 여기로 생각이이 회귀 할 수있는 일종의 올바른 것으로 그래서 + +467 +00:33:46,259 --> 00:33:50,369 + 있었다 영역의 제안에 대해 조금 떨어져이이 실제로 작동 기록 + +468 +00:33:50,369 --> 00:33:55,219 + 정말 잘 그래 정말 간단있어 정말 멋진하지만, 불행히도 그렇게 + +469 +00:33:55,220 --> 00:33:59,460 + 불행하게도 교육 파이프 라인은 조금 길 있도록 복잡하게된다 + +470 +00:33:59,460 --> 00:34:03,788 + 당신은 기차 훈련 SRC를 종료하고 모델은 많은 같은 많은처럼 알고있다 + +471 +00:34:03,788 --> 00:34:06,970 + 모델이 먼저 잘 작동 인터넷에서 모델을 다운로드하여 시작 + +472 +00:34:06,970 --> 00:34:13,240 + 원래 분류를 위해 그들이 사용하고 그 다음 다음 다음 아니에요 어떻게 + +473 +00:34:13,239 --> 00:34:16,868 + 실제로 미세 조정이이 때문에 검출 율이 모델을 원하는 + +474 +00:34:16,869 --> 00:34:20,780 + 분류 모델은 아마 이미지에 대한 교육을하였습니다 4000 클래스 만 + +475 +00:34:20,780 --> 00:34:24,019 + 하여 검출 데이터 세트는 이미지 클래스의 다른 번호를 갖는다 + +476 +00:34:24,019 --> 00:34:28,398 + 당신은 여전히​​이 실행되도록 여전히이 네트워크를 훈련 여분 조금 다른 + +477 +00:34:28,398 --> 00:34:29,679 + 분류 + +478 +00:34:29,679 --> 00:34:33,429 + 당신은 당신의 클래스를하고 처리하는 말에 몇 새 레이어를 추가해야 + +479 +00:34:33,429 --> 00:34:38,068 + 그렇게 여기에 이​​미지 데이터의 약간 다른 통계 처리에 도움 + +480 +00:34:38,068 --> 00:34:41,579 + 당신은 여전히​​ 분류를하고있어하지만 당신은 보류 이미지에서 실행하지 않는 + +481 +00:34:41,579 --> 00:34:44,869 + 당신은에서 이미지의 긍정적이고 부정적인 지역에 밖으로 실행하고 + +482 +00:34:44,869 --> 00:34:49,950 + 당신의 감지 데이터 세트 바로 처음에 새 레이어로 그래서 당신과 당신 + +483 +00:34:49,949 --> 00:34:53,599 + 하루는 것입니다 다시이 일을 양성 + +484 +00:34:53,599 --> 00:34:57,889 + 다음 우리가 실제로 사고에 할 모든에 대한 그래서 두 개의 책상을 갖추고 있습니다 + +485 +00:34:57,889 --> 00:35:02,230 + 당신이 선택 검색을 실행하는 데이터에 조작 된 해당 이미지를 실행 + +486 +00:35:02,230 --> 00:35:07,079 + 당신이 아래로 여기 CNN에 있었다 그 영역을 추출하고 그 기능을 현금으로 + +487 +00:35:07,079 --> 00:35:12,319 + 책상이 단계에 대한 중요한 뭔가가 큰 하드 드라이브를하는 것입니다에 + +488 +00:35:12,320 --> 00:35:16,289 + 암호는 어쩌면 수천의 몇 수십 주문 너무 큰하지 않기로 결정 + +489 +00:35:16,289 --> 00:35:20,170 + 이미지하지만 이러한 기능을 추출 실제로 그렇게 수백 기가 바이트 소요 + +490 +00:35:20,170 --> 00:35:26,869 + 그건 그리 중대하고 다음 우리는 우리가 RSP의 팔을 훈련 할이 있습니다 + +491 +00:35:26,869 --> 00:35:30,909 + 이러한 사실에 기초하여 상이한 다른 클래스로 분류 할 수 + +492 +00:35:30,909 --> 00:35:35,649 + 기능은 여기에, 그래서 우리는 우리의 무리를 변경할의 무리를 실행하려면 + +493 +00:35:35,650 --> 00:35:40,760 + PM의 서로 다른 바이너리에로 이미지 영역을 분류 할 것인지 여부를 그들이 + +494 +00:35:40,760 --> 00:35:45,220 + 포함하거나 해당 하나의 오브젝트 질문 (A)에 다시가는 것을 포함하지 않는다 + +495 +00:35:45,219 --> 00:35:49,029 + 전 약간은 때때로 당신은 실제로 한 지역이 얼마나 싶어 있습니다 + +496 +00:35:49,030 --> 00:35:53,460 + 여러 긍정적는 같은 이미지를 여러 클래스에 출력 YES로 수 + +497 +00:35:53,460 --> 00:35:56,889 + 그들이 그 그냥 내 훈련 별도의 이진 SVM입니다 않는 영역과 하나의 방법 + +498 +00:35:56,889 --> 00:36:01,579 + 음성 클래스는 바로 그래서 다음이 그들은 단지를 사용하여 오프라인 프로세스의 일종이다 + +499 +00:36:01,579 --> 00:36:08,230 + 이 어쩌면 그들이다 긍정적 인 당신이이 기능을 가지고 있도록 최선 오후 + +500 +00:36:08,230 --> 00:36:11,820 + 백작에 대한 샘플은 그래 그것은 어떤 의미의 권리를하지 않습니다하지만 당신은 얻을 + +501 +00:36:11,820 --> 00:36:14,700 + 생각의 속도 당신은 당신이 다른 이미지를 서로 다른 이미지를 가지고있다 + +502 +00:36:14,699 --> 00:36:18,599 + 영역은 당신이 다음 해당 지역과의 디스크에 저장이 기능을 가지고 있습니다 + +503 +00:36:18,599 --> 00:36:22,029 + 각 클래스에 대한 각각의 양극과 음극 샘플로 분할 + +504 +00:36:22,030 --> 00:36:27,269 + 방금이 이러한 이진 SVM 훈련이 작업을 수행이에게 동일한 작업을 수행 + +505 +00:36:27,269 --> 00:36:33,239 + 개를위한 것이나, 당신은 단지 근처 지금 결정하는 모든 클래스에 대해이 작업을 수행 + +506 +00:36:33,239 --> 00:36:37,029 + 그럼 이렇게 콕스 회귀의이 아이디어가 있다면 다른 정지 권리가있다 + +507 +00:36:37,030 --> 00:36:40,450 + 때로는 지역의 제안은 우리가 실제로 무엇을 원하는 그렇게 완벽하지 + +508 +00:36:40,449 --> 00:36:45,549 + 받는 사람에 교정에 그의 캐스트 기능에서에서 퇴보 할 수있다 + +509 +00:36:45,550 --> 00:36:50,269 + 지역 제안하고 보정 재미 전제 상승의이 종류가 있습니다 + +510 +00:36:50,269 --> 00:36:54,320 + 종이뿐만 종류의에 대한 표현을 국가의 세부 사항을 정상화 + +511 +00:36:54,320 --> 00:36:58,300 + 직관은 어쩌면이이이 지역의 제안에 대해 그것이이다 + +512 +00:36:58,300 --> 00:37:02,030 + 꽤 좋은 우리가 정말 어떤 어떤 수정하지만 어쩌면이 일을 할 필요가 없습니다 + +513 +00:37:02,030 --> 00:37:06,250 + 제안은 너무 멀리 왼쪽에있는 것을 중간에 그것은 침대처럼한다 + +514 +00:37:06,250 --> 00:37:09,510 + 우리는 이것에 회귀 할 오른쪽으로 조금으로 금이 지상의 진실 + +515 +00:37:09,510 --> 00:37:12,530 + 보정 계수 실제로 우리가 조금 이동해야한다는 것을 우리에게 말해 그 + +516 +00:37:12,530 --> 00:37:15,780 + 오른쪽 또는 어쩌면이 사람은 조금 너무 넓 + +517 +00:37:15,780 --> 00:37:19,100 + 우리가 퇴보 할 수 있도록 그들은 고양이 이외의 물건을 너무 많이 잃지 않았다 + +518 +00:37:19,099 --> 00:37:21,880 + 우리에게이 보정 계수 우리는 축소해야 + +519 +00:37:21,880 --> 00:37:26,539 + 지역의 제안은 조금 그래서 다시는 날 그냥 선형을 수행 할 수있다 + +520 +00:37:26,539 --> 00:37:30,340 + 당신은 그러나 당신이 229에서 알 수 회귀 이러한에게 이러한 기능을 가지고 있습니다 + +521 +00:37:30,340 --> 00:37:35,490 + 그냥 내가 우리 전에 그렇게 SAT 선형 회귀를 실행하면 이러한 목표를 가지고 + +522 +00:37:35,489 --> 00:37:39,219 + 우리는 다른에 대해 조금 이야기를 이야기해야 결과를 보면 + +523 +00:37:39,219 --> 00:37:42,769 + 데이터 세트 검출에 사용되는 사람들이 당신이 볼 세 가지의 종류가있다 + +524 +00:37:42,769 --> 00:37:48,489 + 파스칼과 같은 연습 하나는 OC 데이터 세트는 내가 생각하는 매우 중요 + +525 +00:37:48,489 --> 00:37:53,399 + 수천 이전에 있지만, 지금은 조금 작다이 하나의 약 20 + +526 +00:37:53,400 --> 00:37:57,820 + 클래스에 따라서 약 20,000 이미지와는 약 2 개체의 비율이 + +527 +00:37:57,820 --> 00:38:01,550 + 이 상대적으로 작은 틱 데이터 세트가 너무 때문에 당신을 많이 볼 수 있습니다 + +528 +00:38:01,550 --> 00:38:05,860 + 검출 논문이에 그냥 처리하기 쉽게가는 일뿐만 아니라, 거기에 + +529 +00:38:05,860 --> 00:38:09,970 + 문제의 전체 무리로 실행 이미지가 검출 데이터 세트 이미지 + +530 +00:38:09,969 --> 00:38:13,109 + 당신은 아마 우리가 현지화 노력 분류를 보았다 지금까지 본 적이 + +531 +00:38:13,110 --> 00:38:17,820 + 검출 도전하지만 보호 만이 있다는 이미지도있다 + +532 +00:38:17,820 --> 00:38:21,600 + 이백 클래스하지 분류에서 천 그러나 그것은 매우의의 + +533 +00:38:21,599 --> 00:38:25,619 + 당신은 많은 논문이 작업이 표시되지 않도록 큰 거의 50 만 이미지 만 + +534 +00:38:25,619 --> 00:38:29,819 + 이 처리 할 종류의 짜증나지만 이미지 당 약 100 거기에 사촌과 + +535 +00:38:29,820 --> 00:38:32,760 + 다음 주 이상은 최근 코코라는 Microsoft에서 제공하는이 일이있다 + +536 +00:38:32,760 --> 00:38:36,660 + 이는 적은 수의 클래스 이미지를 가지고 있지만 실제로는 더 많은 개체가 + +537 +00:38:36,659 --> 00:38:42,649 + 사람들이 아니에요 일을 같은 비율은 더욱 흥미있는 권리가있다 + +538 +00:38:42,650 --> 00:38:45,300 + 당신이 거기 검출에 대해 얘기하고이이 또한있다 + +539 +00:38:45,300 --> 00:38:49,000 + 재미 평가 메트릭은 우리가 평균 평균 정밀도와 초기 싶어라고 사용하십시오 + +540 +00:38:49,000 --> 00:38:52,000 + 얻을 당신은 정말 알아야 할 사항 등 세부 사항에 너무 많이는 점입니다 + +541 +00:38:52,000 --> 00:38:56,570 + 0 수백 좋은 수백과 사이의 수 + +542 +00:38:56,570 --> 00:38:59,940 + 그리고 그것은 또한 내가 직관의 종류의 IT 당신이하려는 의미 + +543 +00:38:59,940 --> 00:39:04,079 + 오른쪽 당신 싶어 사실이 긍정적 높은 점수를 얻을 당신은 또한에있다가 + +544 +00:39:04,079 --> 00:39:08,230 + 당신의 상자가 일부 내에 할 필요가 생산 몇 가지 한계를 가지고 + +545 +00:39:08,230 --> 00:39:12,090 + 균열 상자의 임계 값을 수행 할 수 있습니다 보통이 그 임계 값이 점 + +546 +00:39:12,090 --> 00:39:15,420 + 노조 만의 교차에 의해 당신은 당신이 약간 다른 문제를 볼 수 있습니다 + +547 +00:39:15,420 --> 00:39:19,740 + 그게 전부 다른 일이 바로 지금 우리가 데이터를 이해하자 + +548 +00:39:19,739 --> 00:39:24,679 + 우리의 CNN을 통해 우리의 높이에 설정하는 것은 바로 그래서이 지난 2에 않았다 + +549 +00:39:24,679 --> 00:39:27,779 + 등의 파스칼 데이비스의 버전 난 당신을 많이 볼 수 있습니다로 작은 말했습니다 + +550 +00:39:27,780 --> 00:39:32,730 + 이에 대한 결과를 다른 버전을 자주 볼 2007, 2010 년을 거기에 + +551 +00:39:32,730 --> 00:39:35,990 + 사람들은 테스트가 공개되어해서 사람들을 사용 그래서 쉽게 + +552 +00:39:35,989 --> 00:39:37,169 + 평가 + +553 +00:39:37,170 --> 00:39:42,380 + 그래,하지만 우리가에서 2011 년부터 본이이 변형 부품 모델 + +554 +00:39:42,380 --> 00:39:48,579 + 커플 슬라이드는 전 약 30 평균 정밀도에있다 스무을 받고있다 + +555 +00:39:48,579 --> 00:39:52,069 + 지역이라는 다른 방법은 국가의 종류이었다 2013에서하자 + +556 +00:39:52,070 --> 00:39:55,280 + 내가 바로 깊은 학습을하기 전에하지만 그것 뿐이다 찾을 수있는 예술은 일종의이다 + +557 +00:39:55,280 --> 00:39:58,130 + 비슷한 맛 당신은 교사의 상단에 동급 플레이어에서 이러한 기능을 가지고 있습니다 + +558 +00:39:58,130 --> 00:40:02,840 + 우리의 CNN 우리가 방금 본이 꽤 간단한 일이 실제로 점프 + +559 +00:40:02,840 --> 00:40:06,789 + 실제로 성능에게 우리가 가진 가장 먼저 바다, 그래서 꽤 많이 향상 + +560 +00:40:06,789 --> 00:40:10,509 + 우리가 사용하는 CNN의이 매우 간단한 프레임 워크를 전환 큰 개선 + +561 +00:40:10,510 --> 00:40:15,160 + 여기에 실제로이 결과는 경계 상자의 억압없이 + +562 +00:40:15,159 --> 00:40:19,029 + 이 포함되어있는 경우에만 실제로 ESPN의 영역에 제안을 사용 + +563 +00:40:19,030 --> 00:40:23,550 + 추가로 결합 제안은 실제로 또 다른 재미 꽤 내기를하는 데 도움이 중지 + +564 +00:40:23,550 --> 00:40:26,820 + 주의해야 할 것은 당신이 우리의 CNN을 가지고 있다면 당신은 모든 것을에게 같은 행동을 할 것입니다 + +565 +00:40:26,820 --> 00:40:31,080 + 사용 예 (16) 대신 알렉스 그물을 제외한 다른 꽤 큰 부스트에서 얻을 + +566 +00:40:31,079 --> 00:40:34,059 + 성능은 그래서 이것은 종류의 우리가 그 바로 전에 보았던 것과 비슷한입니다 + +567 +00:40:34,059 --> 00:40:39,650 + 바로 다른 작업을 많이하는 데 도움이 더 강력한 기능을 경향 사용 + +568 +00:40:39,650 --> 00:40:42,840 + 이것은 우리가 우리가에 큰 개선을 같이했던 한 정말 좋은 권리 + +569 +00:40:42,840 --> 00:40:47,829 + 검출 놀라운 2013에 비해 그러나 우리의 CNN은 완벽하지 않습니다 + +570 +00:40:47,829 --> 00:40:53,150 + 바로 그래서 꽤 우리가 우리 것을보고 그 시험 시간을 오른쪽으로 느린 몇 가지 문제가있다 + +571 +00:40:53,150 --> 00:40:57,110 + 아마 이천 지역을하면의 각 지역에 대한 우리의 CNN을 평가하는 것을 의미한다 + +572 +00:40:57,110 --> 00:41:02,910 + 좀 느린 우리는이 곳이 약간 미묘한 문제가 r에 SVM + +573 +00:41:02,909 --> 00:41:07,009 + 사람들은 일종의 가능성이 가장 오후를 사용하여 오프라인 교육을 받았습니다 회귀 + +574 +00:41:07,010 --> 00:41:10,930 + 그리고 실제로 무게 선형 회귀 우리의 우리의 CNN은 정말하지 않았다의 + +575 +00:41:10,929 --> 00:41:14,960 + 갱신에 응답 할 기회를 갖고 것의 네트워크의 해당 부분 + +576 +00:41:14,960 --> 00:41:19,039 + 그 목적은하고 싶었던 우리는 또한 복잡한 이런 종류의했다 + +577 +00:41:19,039 --> 00:41:24,309 + 혼란의 비트를했다 훈련 파이프 라인은 그래서 년 이러한 문제를 해결하기 위해 + +578 +00:41:24,309 --> 00:41:29,690 + 나중에 우리가 제시 한 빨리 우리의 CNN이 너무 빨리 우리의 CNN이라이 일이 + +579 +00:41:29,690 --> 00:41:34,950 + 꽤 최근 ICC 년 12 월에 바로 할 수 있지만, 아이디어는 정말 간단합니다 + +580 +00:41:34,949 --> 00:41:39,819 + 우리는 CNN을 영역을 추출하고 실행의 순서 만 거 스왑이야 + +581 +00:41:39,820 --> 00:41:43,550 + 이것은 우리가 본 슬라이딩 윈도우 아이디어에 관련의 종류의 종류 + +582 +00:41:43,550 --> 00:41:48,450 + 이상 - 더 - 상단 그래서 여기에 테스트 시간은 우리가 좀 비슷한 모양의 파이프 라인 + +583 +00:41:48,449 --> 00:41:52,299 + 우리가하지거야이 입력 이미지를 우리는 고해상도 입력을거야 + +584 +00:41:52,300 --> 00:41:55,920 + 이미지와 우리의 네트워크의 길쌈 레이어를 통해 그것을 실행하고 + +585 +00:41:55,920 --> 00:42:00,150 + 지금 우린 지금이 고해상도 길쌈 기능지도를 얻고있어 우리의 + +586 +00:42:00,150 --> 00:42:03,940 + 지역의 제안은 직접 그 지역의 특징 거 추출했다 + +587 +00:42:03,940 --> 00:42:07,610 + 투자 수익 (ROI)라는이 일을 사용하여이 길쌈 기능지도에서 제안 + +588 +00:42:07,610 --> 00:42:10,530 + 풀링하고 지역의 + +589 +00:42:10,530 --> 00:42:14,269 + 그 지역에 대한이 조성 기능에 대한 기능은 공급한다 + +590 +00:42:14,269 --> 00:42:17,829 + 우리의 완전히 연결 층으로 다시 것입니다 분류가 가진 + +591 +00:42:17,829 --> 00:42:22,670 + 그래서 이것은 정말 꽤입니다의 냉각되기 전에 우리가 본 것처럼 회귀했다 + +592 +00:42:22,670 --> 00:42:26,930 + 큰 우리가 너무 우리의 CNN 우리의 CNN과 본 많은 문제를 해결 + +593 +00:42:26,929 --> 00:42:31,039 + 우리가이이를 공유함으로써이 문제를 해결 US 시간에 정말 느리다 + +594 +00:42:31,039 --> 00:42:37,289 + 지역의 제안에 걸쳐 길쌈 기능의 계산은 참조하는 우리의 + +595 +00:42:37,289 --> 00:42:40,519 + CNN은 또한 우리가이 메시지를 가지고 교육 시간에 이러한 문제가 + +596 +00:42:40,519 --> 00:42:44,920 + 교육 파이프 라인 우리는 우리가 다른 훈련하고이이 문제를 가지고 있었다 + +597 +00:42:44,920 --> 00:42:48,760 + 별도로 네트워크 용액 부분 우리 그냥 매우 간단 + +598 +00:42:48,760 --> 00:42:50,480 + 알고 한 번에 모두 함께 훈련 + +599 +00:42:50,480 --> 00:42:53,800 + 우리가 실제로 지금 할 수있는이 복잡한 파이프 라인이없는하지 않습니다 + +600 +00:42:53,800 --> 00:42:58,140 + 우리는이는 바로 출력하는 입력에서이 꽤 좋은 기능을 가지고 + +601 +00:42:58,139 --> 00:43:01,299 + 당신은 그래서 빨리 우리의 CNN이 실제로 꽤 많이 해결되어 있는지 볼 수 있습니다 + +602 +00:43:01,300 --> 00:43:06,340 + 우리가 정말 흥미로 우리의 CNN 정렬 보았다 문제 + +603 +00:43:06,340 --> 00:43:10,530 + 빨리 우리의 CNN 기술적 비트는 관심의 길 영역이 문제였다 + +604 +00:43:10,530 --> 00:43:15,519 + 아이디어 있도록 풀링 우리가 아마 고의이 입력 이미지를 가지고있다 + +605 +00:43:15,519 --> 00:43:19,068 + 해상도와 우리는되고있어이 지역의 제안을 + +606 +00:43:19,068 --> 00:43:23,969 + 선택 과목 수술 상자 또는 뭔가처럼 우리는이 지역이을 넣을 수 있습니다 + +607 +00:43:23,969 --> 00:43:27,199 + 우리의 길쌈 및 풀링 층 단지를 통해 높은 해상도 이미지 + +608 +00:43:27,199 --> 00:43:30,880 + 잘 그들이가 여전히이있어 규모 불변의 일종이다 그 때문에 + +609 +00:43:30,880 --> 00:43:34,318 + 다른 입력의 크기 그러나 지금 문제가 완전히 연결이다 + +610 +00:43:34,318 --> 00:43:39,630 + 우리의 전 열차 네트워크의 층이 매우 낮은 입술 죄수를 기대하고있다 + +611 +00:43:39,630 --> 00:43:46,068 + 전체 이미지에서 이러한 기능 반면 기능은 지금 우리가 해결 고해상도있다 + +612 +00:43:46,068 --> 00:43:50,038 + 아주 간단한 방법으로이 문제는 우리가있어이 지역의 제안을 부여 + +613 +00:43:50,039 --> 00:43:53,930 + 야는에 그 댓글 기능의 특수 부품의 종류 전망 + +614 +00:43:53,929 --> 00:43:59,368 + 볼륨 이제 우리는 작은 격자 오른쪽에 그 칸 미래의 부피를 나눌거야 + +615 +00:43:59,369 --> 00:44:04,910 + 이 HIW 계통에 그 일을 분할 하류 층이 기대하고 있고 + +616 +00:44:04,909 --> 00:44:09,798 + 우리는 지금 우리가 우리가 본 적이 그래서 지금 그 그리드의 각 셀 내에서 당겨 맥을 + +617 +00:44:09,798 --> 00:44:14,349 + 이 아주 간단한 전략 우리는이 지역의 제안을 찍은 우리가 공유 한 + +618 +00:44:14,349 --> 00:44:19,430 + 편집 기능이 그에 대한 그 지역의 출력을 자극 추출 + +619 +00:44:19,429 --> 00:44:23,629 + 해당 지역의 제안이 기본적으로 단지의 순서를 교환한다 기록 + +620 +00:44:23,630 --> 00:44:28,108 + 하나의 방법 회선 및 휨 및 자르기 그것에 대해 생각하고 + +621 +00:44:28,108 --> 00:44:31,538 + 이 것은 기본적으로부터 때문에이 꽤 좋은 동작이다 + +622 +00:44:31,539 --> 00:44:35,249 + 다만 최대는 당기 우리는 당겨 최대를 통해 수행 할 수 있습니다 전파 백업하는 방법을 알고 + +623 +00:44:35,248 --> 00:44:38,368 + 다시이이 통해이 거기에 당기는 관심의 영역이 전파 + +624 +00:44:38,369 --> 00:44:42,269 + 잘하고 정말 관절에이 모든 일을 훈련하는 우리를 할 수 있습니다 무엇 + +625 +00:44:42,268 --> 00:44:46,758 + 방법 권한의 일부 결과를 볼 수 있도록 이러한 꽤 꽤 멋진 + +626 +00:44:46,759 --> 00:44:50,858 + 훈련 시간이 너무 놀라운 큰는 CNN입니다 그것은이 복잡한 파이프 라인을 가지고 있었다 + +627 +00:44:50,858 --> 00:44:54,098 + 책상 곳이 모든 물건을 독립적으로 수행하는 모든 물건을 절약 할 수 및 + +628 +00:44:54,099 --> 00:44:57,789 + 심지어 아주 작은 파스칼 데니스에이 훈련 팔십사시간 걸렸다에서 + +629 +00:44:57,789 --> 00:45:05,229 + 우리의 CNN이 LR의 테스트 시간으로 멀리에서 훨씬 더 빨리 당신이 훈련 할 수있다 및 전달 + +630 +00:45:05,228 --> 00:45:09,318 + 다시 우리는 이러한 독립적 인 전진을 실행하는 것은 전달하기 때문에 CNN은 매우 느립니다 + +631 +00:45:09,318 --> 00:45:14,469 + 반면 빠른 우리의 CNN 각 지역의 제안에 대한 CNN의 경우 우리가 할 수있는 + +632 +00:45:14,469 --> 00:45:17,979 + 종류의 다른 지역 제안의 계산을 공유하고이를 얻을 수 + +633 +00:45:17,978 --> 00:45:23,439 + 테스트 거대한 속도까지 난 백이다 마흔여섯 큰 놀랍고 + +634 +00:45:23,440 --> 00:45:26,690 + 성능면 내 말에 사실은 그것은 조금 더 나은 아니다 않는다 + +635 +00:45:26,690 --> 00:45:30,048 + 이 아마에 기인 할 수 있지만 성능에 급격한 차이 + +636 +00:45:30,048 --> 00:45:32,130 + 이 미세 조정 속성이 + +637 +00:45:32,130 --> 00:45:35,140 + 우리의 CNN 과거 실제로 길쌈의 모든 부분이 당신을 찾을 수 있습니다 + +638 +00:45:35,139 --> 00:45:38,969 + 이 ALPA 작업에 도움이 공동으로 네트워크와 당신이 볼 이유는 아마 + +639 +00:45:38,969 --> 00:45:43,230 + 증가의 비트가 바로 여기 그래서 이것은 무엇을 가능하게 할 수 무슨 큰 권리 + +640 +00:45:43,230 --> 00:45:45,730 + 놀라운 우리의 CNN과 외모 빠른 잘못 될 + +641 +00:45:45,730 --> 00:45:51,699 + 큰 문제는 이러한 테스트 나는 속도는 지역의 제안을 포함하지 않는거야이다 + +642 +00:45:51,699 --> 00:45:55,669 + 바로 지금 실제로 병목이라는 정말 좋은 우리의 CNN을 빨리 + +643 +00:45:55,670 --> 00:46:00,750 + 의 속도는 고려 번, 그래서 꽤 멋진 지역의 제안을 계산 + +644 +00:46:00,750 --> 00:46:04,789 + 컴퓨터가 실제로 당신이 많은 것을 볼 수 있습니다 CPU에이 지역의 제안을 계산 + +645 +00:46:04,789 --> 00:46:09,190 + 우리의 속도의 장점은 빠른 바로 25 X 사라 우리 종류의 손실의 + +646 +00:46:09,190 --> 00:46:15,030 + 그 아름다운 백 속도 업도 지금은 나 이초에게이란이 걸리기 때문에 + +647 +00:46:15,030 --> 00:46:18,560 + 실제로 꽤 마젠타 그리고 당신이 정말로 좀 여전히이 실시간으로 사용할 수 없습니다 + +648 +00:46:18,559 --> 00:46:23,750 + 오프라인 처리 것은 바로 그래서 이것의 해결책은 꽤해야한다 + +649 +00:46:23,750 --> 00:46:27,340 + 우리 모두가 이미 컨볼 루션 네트워크를 사용하는 명백한 속도 + +650 +00:46:27,340 --> 00:46:32,620 + 분류를 사용하여 회귀 이유에 대한 이유 제안을 위해 그것을 사용하지 + +651 +00:46:32,619 --> 00:46:39,569 + 즉 그, 그래서 종류의 미친 수 있습니다 작동합니다 작성하는 종이 사람이 원하는이다 + +652 +00:46:39,570 --> 00:46:46,570 + 이름이 예는 빠르게 우리의 CNN을 것 같아요 + +653 +00:46:46,570 --> 00:46:50,789 + 예 그들은 여기 정말 창조적했다 있었다하지만 아이디어는 매우 간단하다 + +654 +00:46:50,789 --> 00:46:55,460 + 바로 여기서 당신은 우리의 입력 이미지를 복용하는 경우 빠른 우리의 CNN에서 일부 어디 + +655 +00:46:55,460 --> 00:46:59,630 + 전체 입력 화상 위에 빅 컨벌루션 피쳐 맵을 계산 + +656 +00:46:59,630 --> 00:47:05,170 + 그래서 그 대신 일부 외부 방법을 사용하는 지역의 제안을 계산하는 그들 + +657 +00:47:05,170 --> 00:47:09,010 + 직접 보이는 지역 제안 네트워크라는이 작은 일을 추가 + +658 +00:47:09,010 --> 00:47:13,060 + 이들의 이러한 외모에 수 그들의 조성 기능을 지속 + +659 +00:47:13,059 --> 00:47:17,599 + 지도 기능을합니다 그 경쟁에서 직접 지역의 제안을 생산하고 + +660 +00:47:17,599 --> 00:47:21,190 + 이 지역의 제안을 일단 당신은 그냥 빨리 우리의 CNN을 같은 일을 + +661 +00:47:21,190 --> 00:47:25,880 + 이 ROI 풀링을 사용하고 난 상류 정지가 빠르거나 CNN과 동일 수 있습니다 + +662 +00:47:25,880 --> 00:47:31,130 + 그래서 여기 정말에 대한 새로운 비트는 그것의이 지역 제안 네트워크는 그것의이다 + +663 +00:47:31,130 --> 00:47:34,180 + 정말 멋진 바로 우리가 모든 일을하고있는 직장에서 하나의 거대한 경쟁 + +664 +00:47:34,179 --> 00:47:40,500 + 옳았다이이 지역 제안 네트워크가 작동하는 방식은 우리 종류의 + +665 +00:47:40,500 --> 00:47:43,880 + 이 매핑 기능을합니다이 경쟁에서 나오는 수의 입력으로받을 + +666 +00:47:43,880 --> 00:47:47,820 + 마지막으로 우리의 길쌈 기능 층과 우리는 당신이 좋아하는 추가 할거야 + +667 +00:47:47,820 --> 00:47:52,570 + 대부분의 물건이있는 것처럼 그에 최근 포스트는 컨볼 루션 네트워크를 오른쪽으로 일했다 + +668 +00:47:52,570 --> 00:47:57,570 + 그래서 실제로 우리가이 있도록 그 권리에 괴물에 의해이 무료 인 오타입니다 + +669 +00:47:57,570 --> 00:48:01,809 + 우리의 길쌈 기능지도를 통해 슬라이딩 윈도우 방식의 종류 만 + +670 +00:48:01,809 --> 00:48:06,820 + 슬라이딩 창 슬라이딩 그래서 우리는 단지 세 가지가 단지 회선 속도입니다 + +671 +00:48:06,820 --> 00:48:10,920 + 셋이 기능지도의 상단에 컨볼 루션 다음으로 우리는 이것을 가지고 + +672 +00:48:10,920 --> 00:48:14,599 + 독특한이 지역 제안 내부 헤드 구조에이 익숙에게 공격 + +673 +00:48:14,599 --> 00:48:19,670 + 우리가 분류하고있는 네트워크 우리는 우리가인지 말하고 싶은 여기 + +674 +00:48:19,670 --> 00:48:25,430 + 여부는 객체이고 또한 이런 종류의에서 회귀하는 회귀 + +675 +00:48:25,429 --> 00:48:29,829 + 실제 pridgen 제안에에 위치가 너무 생각이 그 + +676 +00:48:29,829 --> 00:48:33,909 + 기능지도에 슬라이딩 윈도우 상대의 위치는 종류의 우리에게 알려줍니다 + +677 +00:48:33,909 --> 00:48:38,239 + 우리는 이미지에 다음이 회귀 일종의 출력 곳의 우리에게 + +678 +00:48:38,239 --> 00:48:43,619 + 기능 맵이이 위치의 상단에 수정하지만 실제로 + +679 +00:48:43,619 --> 00:48:46,940 + 그것은 조금 더보다 복잡한 그래서 대신 주소 확인 + +680 +00:48:46,940 --> 00:48:51,110 + 회선이 위치가지도 기능을합니다에서 직접 자신이 갖고 + +681 +00:48:51,110 --> 00:48:55,280 + 당신이 복용 상상할 수있는 서로 다른 앵커 상자의이 개념 + +682 +00:48:55,280 --> 00:48:59,910 + 다른 크기와 모양의 은행 상자 원래대로 붙여 넣기의 종류 + +683 +00:48:59,909 --> 00:49:03,538 + 이 시점에 대응하는 화상의 시점 화상 + +684 +00:49:03,539 --> 00:49:08,020 + 기능지도 오른쪽 다리와 빠른 RCMP는로 이미지에서 앞으로 돌출 된 + +685 +00:49:08,019 --> 00:49:11,519 + 이 기능지도 이제 우리는 우리가 돌​​출하고 반대를하고있는 + +686 +00:49:11,519 --> 00:49:17,288 + 기능지도 다시이 상자에 대한 이미지로 그렇게 한 다음 이들 각각에 대한 + +687 +00:49:17,289 --> 00:49:21,640 + 앵커 상자 그들은을 사용할 수있는 길쌈 앵커 상자의 종류를 사용 + +688 +00:49:21,639 --> 00:49:27,400 + 이 앵커 박스의 각각에 대한 모든 이미지의 위치와 그들이에서 동일합니다 + +689 +00:49:27,400 --> 00:49:32,119 + 이들은 그 앵커 상자 객체에 해당하는지 여부에 대한 점수를 생성 + +690 +00:49:32,119 --> 00:49:36,809 + 그들은 또한 분노가 잘못의 회귀 좌표에 대한 생산 + +691 +00:49:36,809 --> 00:49:41,880 + 우리가 전에 보았던 유사한 방식으로 상자와 지금이 지역의 제안 네트워크 당신에게 + +692 +00:49:41,880 --> 00:49:45,700 + 그냥 높은 수준의 불가지론 객체의 일종의 예측하려고하는 훈련을 할 수 있습니다 + +693 +00:49:45,699 --> 00:49:52,058 + 원래 논문에서 이렇게 빨리 우리의 CNN 검출기 그들은이 일을 훈련하고 + +694 +00:49:52,059 --> 00:49:55,490 + 그들이 제안을 읽는 훈련을 먼저 재미있는 방법의 종류를 잘 작성하지 다음 작업들이 + +695 +00:49:55,489 --> 00:49:59,500 + 다음 우리의 CNN을 통과 기차 그들은 함께하고 마지막에 병합하는 마법을 + +696 +00:49:59,500 --> 00:50:03,530 + 이 이것은 그래서 오늘의 그들은 모든 것을 생성 한 네트워크를 가지고 + +697 +00:50:03,530 --> 00:50:07,880 + 조금 지저분하지만 개별 종이는이 일을 설명하지만, 그 이후 + +698 +00:50:07,880 --> 00:50:10,470 + 그들은 실제로 단지 전체를 변경 일부 미 출판 일을 했어 + +699 +00:50:10,469 --> 00:50:14,909 + 그들은 종류의 이미지를 가지고 하나의 큰 네트워크를 가지고있어 공동으로 일 + +700 +00:50:14,909 --> 00:50:19,679 + 당신이 지역의 제안서 네트워크 내에서 이것을에 가지고 오는을 설치해야 + +701 +00:50:19,679 --> 00:50:23,538 + 분류는 각 지역의 제안이 있거나인지 여부를 분류하는 손실 + +702 +00:50:23,539 --> 00:50:27,670 + 당신은이 지역의 제안 안에 이러한 경계 상자의 회귀가 객체 + +703 +00:50:27,670 --> 00:50:33,500 + 하지 빨리 우리가 할에서 다음 대회 앵커의 상단에 작동 + +704 +00:50:33,500 --> 00:50:37,190 + 우리의 삶 풀링의 끝에서 다음이 빠른 우리의 CNN 여행을하고와 + +705 +00:50:37,190 --> 00:50:41,200 + 네트워크 우리는이 분류는 그 어느 클래스 말을 잃었다가이 + +706 +00:50:41,199 --> 00:50:47,659 + 회귀 때문에이이 지역 제안의 상단에 수정을 해결하기 위해 손실 + +707 +00:50:47,659 --> 00:50:53,170 + 이 큰 것은 그래 네 손실을 단지 하나의 큰 네트워크입니다 + +708 +00:50:53,170 --> 00:51:04,019 + 그래서 제안과 억압 좌표는 세에 의해 생산된다 + +709 +00:51:04,019 --> 00:51:07,588 + 세 가지로 삼 세와 명백한 하나씩 회선 종종 있습니다 + +710 +00:51:07,588 --> 00:51:12,358 + 지도 오른쪽 그래서 아이디어는 우리가 서로 다른 앵커 상자에서 찾고 있다는 점이다 + +711 +00:51:12,358 --> 00:51:16,400 + 다른 위치와 비늘 만의 우리는 실제로 같은보고있는 + +712 +00:51:16,400 --> 00:51:20,139 + 기능 맵의 위치는 그 다른 은행 상자를 분류하지만합니다 + +713 +00:51:20,139 --> 00:51:26,179 + 당신은 다른 앵커에 대해 서로 다른 가중치를 다른 학습 한 I + +714 +00:51:26,179 --> 00:51:29,969 + 아이디어에 의해 세 당신이 원하는, 그래서 그것은 주로 경험적 바로 생각 + +715 +00:51:29,969 --> 00:51:33,429 + 비선형의 약간을 가지고 당신은 단지 종류의 일을 상상할 수 + +716 +00:51:33,429 --> 00:51:38,098 + 직접 나는 그들이 생각하지만, 기능 맵 오프 직접 하나씩 회선 + +717 +00:51:38,099 --> 00:51:40,990 + 신문에서이 문제를 논의하지 않습니다하지만 난에 단지 3 × 3 회 같은데요 + +718 +00:51:40,989 --> 00:51:44,669 + 더 조금 작동하지만 당신은 왜 당신이 왜 같은 정말 깊은 이유가 없습니다 + +719 +00:51:44,670 --> 00:51:47,450 + 적은 수 더있을 수 있다는 것을 더 큰 대령은 그냥 수 + +720 +00:51:47,449 --> 00:51:50,548 + 당신의 종류는 메인의 두 개의 머리를 가진 직장이 작은 경쟁이 + +721 +00:51:50,548 --> 00:51:53,710 + 포인트 및 질문 + +722 +00:51:53,710 --> 00:52:18,380 + 그래 나는 때문에 이해 + +723 +00:52:18,380 --> 00:52:22,140 + 전체 이미지에 대응 + +724 +00:52:22,139 --> 00:52:26,098 + 요점은 우리가 실제로 원하는 전체 이미지를 처리​​하지 않는다는 것입니다 + +725 +00:52:26,099 --> 00:52:29,960 + 에 대한 처리를 할 수있는 이미지의 일부 지역을 선택하지만 우리는 선택해야 + +726 +00:52:29,960 --> 00:52:36,048 + 어떻게 든 그 지역 + +727 +00:52:36,048 --> 00:52:42,188 + 예 즉, 그것은 기본적으로 기본적으로 외부 사용이 생각입니다 + +728 +00:52:42,188 --> 00:52:46,428 + 당신이 그 외부 지역의 제안을 수행 할 때 지역의 제안은 바로 그래서 당신은있어 + +729 +00:52:46,429 --> 00:52:50,929 + 종류의 당신이 회선을하기 전에 먼저 따기하지만 그것은 단지 종류의의 + +730 +00:52:50,929 --> 00:52:54,858 + 난처럼 당신이 한 번에 모든 것을 할 수 있다면 좋은 점은, 그래서 그것은 환상이 가지입니다입니다 + +731 +00:52:54,858 --> 00:52:58,748 + 정말 일반 가공 처리 등이 일반적으로하지만 당신은 할 수있는 + +732 +00:52:58,748 --> 00:53:01,608 + 당신이 좀 기여가 충분 것을 바라고 이미지 + +733 +00:53:01,608 --> 00:53:04,869 + 분류 거 침략 당신이이 정보의 유형 + +734 +00:53:04,869 --> 00:53:07,439 + 이러한 기여는 물론 지역을 분류하기위한 충분한 아마 좋은 + +735 +00:53:07,438 --> 00:53:11,958 + 그래서 그것 때문에 끝에 실제로 계산 절감의 사실이다 + +736 +00:53:11,958 --> 00:53:15,719 + 오늘의 당신은 모두를위한 동일한 길쌈 피터 맵을 사용하게 + +737 +00:53:15,719 --> 00:53:18,938 + 댐의 하류 분류 영역 제안서 + +738 +00:53:18,938 --> 00:53:23,389 + 여기에 속도를 얻을 왜 사실이다 하류 회귀 + +739 +00:53:23,389 --> 00:53:29,788 + 질문 그래, 우리는 우리가 네 손실 훈련이 큰 네트워크를 가지고 지금 우리가 할 수있는 + +740 +00:53:29,789 --> 00:53:31,569 + 모든 물체 감지 정렬 한 번에 + +741 +00:53:31,568 --> 00:53:37,858 + 정말 멋진 우리가 CNN의 다양한의 자유를 비교 결과를 보면, 그래서 + +742 +00:53:37,858 --> 00:53:43,630 + 속도는 우리는 약 50 초 테스트 시간 당 걸렸다 원래 우리의 CNN이 + +743 +00:53:43,630 --> 00:53:47,150 + 영상이이 실행중인 계산되는 영역의 제안을 기대하고있다 + +744 +00:53:47,150 --> 00:53:52,439 + CNN 별도로 꽤 느린 지금 전달 우리의 CNN있어 각 지역의 제안에 대한 우리 + +745 +00:53:52,438 --> 00:53:56,909 + 우리가 이동하면 그것은 일종의 지역 제안 시간으로 병목되었지만 보았다 + +746 +00:53:56,909 --> 00:54:01,768 + 빨리 우리의 CNN을 그 지역의 제안보다 기본적으로 무료오고있다 + +747 +00:54:01,768 --> 00:54:06,139 + 그들은 단지 방법에 있기 때문에 우리는 지역의 제안을 계산은 작은 세 내 + +748 +00:54:06,139 --> 00:54:09,199 + 자유 시간 희석과 몇 하나씩 회선들이있어, 그래서 매우 + +749 +00:54:09,199 --> 00:54:13,229 + 우리의 CNN이의 다섯 번째에서 실행이 빠른 테스트 시간을 보낼 평가 저렴 + +750 +00:54:13,228 --> 00:54:23,849 + 실제로의 두 번째 꽤 높은 해상도 이미지 그래 + +751 +00:54:23,849 --> 00:54:36,739 + 잘 난 당신이하지 기대하는대로 제로 패딩 뒤에 아이디어 중 하나를하지 않은 의미 + +752 +00:54:36,739 --> 00:54:40,699 + 가장자리에서 너무 멀리 떨어져 정보는 그래서 당신이있을 수 있습니다 어쩌면 생각 + +753 +00:54:40,699 --> 00:54:45,299 + 당신이 제로 패딩을하지 않은 경우에 문제가 어쩌면 더 문제지만 + +754 +00:54:45,300 --> 00:54:48,430 + 우리는 일종의 전에 논의하고 제로 것을 추가하는 사실로 의미 + +755 +00:54:48,429 --> 00:54:52,519 + 그것은 어쩌면이 될 수 있도록 패딩은 이러한 기능의 통계에 영향을 미칠 수 + +756 +00:54:52,519 --> 00:54:56,900 + 문제의 비트는하지만 실제로는 잘 작동하는 것 같다 실제로 + +757 +00:54:56,900 --> 00:55:00,099 + 대한 그래 우리가 실패의 경우 어디에 있습니까 곳의 분석이다 그 + +758 +00:55:00,099 --> 00:55:02,949 + 새를 개발할 때 우리는 정말 중요한 과정으로 잘못된 일을받을 수 있나요 + +759 +00:55:02,949 --> 00:55:08,419 + 알고리즘과 나는 당신에게 더 나은 일을 할 수 있습니다 무엇에 대한 통찰력을 제공 할 수 있습니다 + +760 +00:55:08,420 --> 00:55:26,940 + 그래 그래 + +761 +00:55:26,940 --> 00:55:35,858 + 그래서 어쩌면 도움이 될 수 있지만, 그렇게 할 다음에 좀 어려운 사실이다 + +762 +00:55:35,858 --> 00:55:40,108 + 실험 데이터 세트가 다른 맞아 때문에 때 때를 때문에 + +763 +00:55:40,108 --> 00:55:43,789 + 당신은 한 가지입니다하지만 지금 이미지와 같은 분류 데이터 세트의 종류이었다 + +764 +00:55:43,789 --> 00:55:47,259 + 당신이 검출 작업을 할 때이 다른 데이터 세트 그리고 내가 당신을 좋아하지 않은 + +765 +00:55:47,260 --> 00:55:51,000 + 어떤 객체에 기초하여 상기 검출 된 이미지를 분류하려고 상상할 수 + +766 +00:55:51,000 --> 00:55:54,500 + 존재하지만 난 정말하려고하는 정말 좋은 비교를 보지 못했다 + +767 +00:55:54,500 --> 00:56:00,630 + 연구 분명하지만 내 말은 그 해당 프로젝트의 실험 + +768 +00:56:00,630 --> 00:56:18,088 + 그래 그건 아주 좋은 질문이 너무 그럼 당신은 우리의 방법이 문제가 + +769 +00:56:18,088 --> 00:56:22,119 + 하여 투자 수익 (ROI) 풀링 작업 그뿐만 아니라 오른쪽 방식 때문에 풀링 + +770 +00:56:22,119 --> 00:56:25,720 + 6 학년으로 그 일을 분할하고 당신이 한 번 당겨 최대 일 + +771 +00:56:25,719 --> 00:56:29,949 + 회전 실제로 가지 어려움이 정말 멋진 종이 거기에 있어요 + +772 +00:56:29,949 --> 00:56:33,159 + 공간 변압기라는 여름 동안 마지막에 깊은 마음에서 + +773 +00:56:33,159 --> 00:56:39,250 + 실제로 정말 멋진 방법을 소개 네트워크는이 문제를 해결하기 위해 + +774 +00:56:39,250 --> 00:56:42,239 + 아이디어는 대신 ROI 풀링을하는 우리가 직선에 의해 야 할 것이다 + +775 +00:56:42,239 --> 00:56:46,699 + 좀 당신 같은 보간 ​​그래서 한 번 질감과 그래픽을 사용할 수 있습니다 + +776 +00:56:46,699 --> 00:56:50,009 + 실제로 아마이이 미친 할 수있는 것보다 당신은 선형 보간에 의해 수행 + +777 +00:56:50,010 --> 00:56:53,609 + 지역이 너무 좋아 그 확실히 사람들에 대해 생각하고 뭔가하지만, + +778 +00:56:53,608 --> 00:56:56,848 + 그것은 아직 전체 파이프 라인에 통합되지 않은 + +779 +00:56:56,849 --> 00:57:00,338 + 네 + +780 +00:57:00,338 --> 00:57:11,728 + 당신이 우리의 CNN 정권 바로 이런 종류의에서 다시 둔화 될 수 있고, + +781 +00:57:11,728 --> 00:57:12,449 + 저것 봐 + +782 +00:57:12,449 --> 00:57:16,828 + 250 시간은 느리게 당신은 정말 내가 다른 생각을 의미하는 가격을 지불 할 수 + +783 +00:57:16,829 --> 00:57:20,690 + 회전 된 개체와 실제 관심은 우리가 정말 땅을하지 않아도됩니다 + +784 +00:57:20,690 --> 00:57:25,318 + 진실 데이터가이 검출 데이터 세트의이 대부분의 대부분을 그렇게 설정 만 + +785 +00:57:25,318 --> 00:57:29,190 + 지상 진실 정보 우리는 그래서 이러한 액세스 온라인 경계 상자입니다있다 + +786 +00:57:29,190 --> 00:57:33,150 + 그것은 어려운 당신은 실제의 종류의 접지 진실 위치가 없습니다 + +787 +00:57:33,150 --> 00:57:39,219 + 관심 나는 사람들이 정말 끝 그래서이 너무 많이 탐험하지 않은 생각 + +788 +00:57:39,219 --> 00:57:43,009 + 우리의 CNN은 슈퍼 빠른을 가지고 있으며,이 같은 오른쪽에 있었다 과거와 이야기 + +789 +00:57:43,009 --> 00:57:49,798 + 그 좋은 정말 재미있는 내가 아는이 시점에서 지금 실제로 작동 + +790 +00:57:49,798 --> 00:57:52,949 + 당신은 실제로이 때문에 물체 검출에 예술의 상태를 이해할 수 + +791 +00:57:52,949 --> 00:57:55,669 + 이는 짓 눌린 세계 최고의 개체 검출기 중 하나입니다 + +792 +00:57:55,670 --> 00:58:00,479 + 12 월 이미지와 코코아 도전에 도전 이미지에서 모두 + +793 +00:58:00,478 --> 00:58:06,710 + 대부분 같은 다른 것은 그것이 최고의 목적 때문에이 깊은 잔여 네트워크입니다입니다 + +794 +00:58:06,710 --> 00:58:10,548 + 세계에서 지금 백 하나의 층 잔여 네트워크는 빠른 플러스 + +795 +00:58:10,548 --> 00:58:17,298 + 우리의 CNN 플러스 몇 가지 다른 케이크는 바로 여기 그래서 우리는 우리가 이야기에 대해 이야기 + +796 +00:58:17,298 --> 00:58:23,670 + 그들은 항상 여분을 얻을 수있는 우리의 CNN 과거 우리가 작년에 대통령을 보았다 + +797 +00:58:23,670 --> 00:58:26,389 + 대회를 위해 당신은 조금을 얻기 위해 미친 몇 가지를 추가해야 + +798 +00:58:26,389 --> 00:58:30,348 + 개선이 실제로 수행이 상자에 바로 그래서 여기에 성능이 약간 향상 + +799 +00:58:30,349 --> 00:58:33,528 + 바운딩 박스를 정제의 여러 단계 + +800 +00:58:33,528 --> 00:58:38,818 + 당신은 빠른 우리의 CNN 프레임 워크에이 보정에 일을 보았다 + +801 +00:58:38,818 --> 00:58:41,929 + 지역 제안 위에 실제로 네트워크로 그를 피드백 할 수 + +802 +00:58:41,929 --> 00:58:46,298 + 즉,이 상자 정제 그래서 안드레아 다른 생산을 얻을 재 분류 + +803 +00:58:46,298 --> 00:58:50,929 + 그것은 당신에게 후원을 조금주는 단계 그들은뿐만 아니라, 그래서 컨텍스트를 추가 + +804 +00:58:50,929 --> 00:58:55,710 + 당신 제공 그들이 배우 나가 그냥 지역 분류 + +805 +00:58:55,710 --> 00:59:00,309 + 종류의 당신에게보다 더 많은 접촉을 제공하는 전체 이미지에 대한 전체 기능 + +806 +00:59:00,309 --> 00:59:03,999 + 그냥 작은 작물 그물 당신에게 조금 더 아파트 또한 제공 + +807 +00:59:03,998 --> 00:59:08,179 + 우리는 그들이 실제로 실행할 수 있도록 다시 피트 이상에서 본 좀처럼 다중 스케일 테스트를 할 + +808 +00:59:08,179 --> 00:59:10,730 + 다른 크기의 이미지에 대한 것은 시험 시간 + +809 +00:59:10,730 --> 00:59:13,949 + 집계 또는 그 서로 다른 크기와 모든 것들을 넣을 때 + +810 +00:59:13,949 --> 00:59:21,129 + 함께 실제로 SOCO에이 일 하나 때문에 대회를 많이 승리 + +811 +00:59:21,130 --> 00:59:24,960 + 마이크로 소프트 코코 실제로 검출 도전을 실행하고 검출 궁금해 + +812 +00:59:24,960 --> 00:59:29,199 + 우리는 또한 이미지에 급속한 진전을 볼 수 코코아에 도전하는 + +813 +00:59:29,199 --> 00:59:32,909 + 지난 몇 년 동안 감지 도전은 그래서 당신은 2013 년에 볼 수 있습니다 + +814 +00:59:32,909 --> 00:59:38,949 + 우리가 이러한 깊은 학습 탐지 모델을 가지고 첫 번째 시간이었다 종류 + +815 +00:59:38,949 --> 00:59:43,789 + 우리가 현지화 보았다 위업을 통해 그들은 실제로 버전을 제출 자신의 + +816 +00:59:43,789 --> 00:59:47,949 + 로와 로직을 변경의 종류에 의해뿐만 아니라 탐지 작동 시스템 + +817 +00:59:47,949 --> 00:59:51,849 + 이들이 상자를 경계 병합 그들은 꽤 좋은했지만 그들은 실제로 있었다 + +818 +00:59:51,849 --> 00:59:57,319 + 이 다른이 다른 그룹보다 실적 일종의했다 당신의 비전을 호출 + +819 +00:59:57,320 --> 01:00:02,289 + 없는 깊은 학습 방법 2014 우리의 기능을 많이하지만 전혀 사용 + +820 +01:00:02,289 --> 01:00:05,840 + 실제로 실제로 깊은 학습 방법과 구글이 모두 한보고 + +821 +01:00:05,840 --> 01:00:09,740 + 하나 구글 맵 플러스의 상단에 다른 검출 재료를 사용하여 원 + +822 +01:00:09,739 --> 01:00:15,029 + 구글은하지 후 2015 일에 미친 이러한 잔류 네트워크 갔다 플러스 + +823 +01:00:15,030 --> 01:00:19,410 + 내가 통해 특히 해당 작업을 생각하도록 통행 나는 CNN은 모든 것을 분쇄 + +824 +01:00:19,409 --> 01:00:22,409 + 우리가 본 적이 있기 때문에 지난 몇 년은 정말 흥미 진진한 일이있다 + +825 +01:00:22,409 --> 01:00:25,429 + 가장 좋아 검출 지난 몇 년 동안 정말 빠른 진행 + +826 +01:00:25,429 --> 01:00:29,129 + 다른 것들과 내가 생각하는 또 다른 포인트는 만드는 재미의 종류이다이다 + +827 +01:00:29,130 --> 01:00:33,800 + 내가 할 수있는 모든 대회 우승을 위해 실제로는 앙드레 당신을 말했다 알고 + +828 +01:00:33,800 --> 01:00:37,830 + 앙상블과 항상 앙상블과 함께 경기를 승리 그래서 2 %를 얻을 수 있지만, + +829 +01:00:37,829 --> 01:00:42,829 + 실제로 종류의 재미 마이크로 소프트는 또한 최고의 단일 거주 제출 + +830 +01:00:42,829 --> 01:00:47,440 + 이 앙상블이 아니었다 모델링하고 단 하나의 거주자 모델은 실제로 모든 일 + +831 +01:00:47,440 --> 01:00:52,400 + 그래 정말 멋진 사실의 다른 모든 년에서 다른 것들 + +832 +01:00:52,400 --> 01:00:58,130 + 즉,이 재미있는 일의 종류를 잘 그래서 그 밖에 최고의 배우입니다입니다 + +833 +01:00:58,130 --> 01:01:03,240 + 그래서 이것은 정말 그래서 우리는 우리가 현지화의이 아이디어로 이야기 우리입니다 + +834 +01:01:03,239 --> 01:01:08,439 + 회귀 그래서 욜로 당신 만 보면이라는 재미있는 것은 한 번 실제로 시도 + +835 +01:01:08,440 --> 01:01:13,519 + 아이디어 정도로 회귀 문제로 직접 검출 문제를 향하여 + +836 +01:01:13,519 --> 01:01:18,389 + 우리는 실제로 우리의 입력 영상을하려고 우리는로 나누어거야 있음 + +837 +01:01:18,389 --> 01:01:22,190 + 일부 공간 격자 그들은 일곱하여 다음 내에서 칠하는 데 사용 + +838 +01:01:22,190 --> 01:01:26,480 + 공간 격자에 대한 각 요소는 우리가 거​​ 경계 상자의 여섯 번호를 확인하고 + +839 +01:01:26,480 --> 01:01:31,039 + 예측은 내가 그렇게 한 다음 실험의 대부분의 생각에 동일하게 사용 + +840 +01:01:31,039 --> 01:01:36,489 + 각 격자 내에는 네 어쩌면 경계 할 상자를 예측하는거야 + +841 +01:01:36,489 --> 01:01:41,229 + 숫자는 미국 당신이 믿는 정도에 대한 하나의 점수를 보호하려고 + +842 +01:01:41,230 --> 01:01:44,969 + 그 경계 상자는 당신은 또한 각 분류 점수를 보호하는거야 + +843 +01:01:44,969 --> 01:01:49,659 + 그래서 다음에 데이비스 근처 클래스는 정렬이이 검출 문제를 취할 수 있으며, + +844 +01:01:49,659 --> 01:01:53,969 + 그것은 회귀되는 것은 당신의 입력이 출력에 이미지가 어쩌면이입니다 끝 + +845 +01:01:53,969 --> 01:01:59,529 + 오 B에 의해 일곱으로 칠 플러스 단지 회귀 문제를 지금 답변을보고 + +846 +01:01:59,530 --> 01:02:04,820 + 다만 그것을 시도하고 꽤 멋진 그리고 그것은에 새로운 접근 방식의 일종이다 + +847 +01:02:04,820 --> 01:02:07,900 + 우리가 전에 본 적이 이러한 영역 제안 가지 다른 비트 + +848 +01:02:07,900 --> 01:02:12,300 + 물론 종류의이 문제는 바인드 상부있을 것입니다 + +849 +01:02:12,300 --> 01:02:15,930 + 문제가있는 경우 일 수 있습니다 모델 그렇게 할 수 출력 수 + +850 +01:02:15,929 --> 01:02:20,279 + 당신이 테스트하고 데이터는 훈련 데이터에 많은 더 많은 지상 진실 상자가 + +851 +01:02:20,280 --> 01:02:27,180 + 그래서이이 노란색 검출기는 실제로 빠른 실제로 빠르고입니다 정말 + +852 +01:02:27,179 --> 01:02:32,460 + 다음 빠르게 꽤 미친 그러나 불행하게도 그것이 작동하는 경향이 우리의 CNN + +853 +01:02:32,460 --> 01:02:36,769 + 조금 더 그렇게 나쁜 내가 그나마 빠른 노란색이라는 다른 것은 + +854 +01:02:36,769 --> 01:02:39,460 + 약 싶어 이야기하지만, + +855 +01:02:39,460 --> 01:02:45,170 + 중 하나에 전달 마우스 오른쪽 그러나 우리의 숫자로 이들은 평균 AP 번호는 + +856 +01:02:45,170 --> 01:02:49,619 + 파스칼 데이터는 우리가 당신이 볼 수 보았다 설​​정 노란색 실제로 64이 꽤있어 도착 + +857 +01:02:49,619 --> 01:02:53,329 + 좋은이가에 분명히 초당 마흔다섯 프레임에서 실행 + +858 +01:02:53,329 --> 01:02:58,840 + 강력한 GPU는하지만 아직도 그게 놀라운 거의 실시간의의 + +859 +01:02:58,840 --> 01:03:03,960 + 나는 싶어 지금이 다른 버전을 알고에 대해 얘기하지 마세요도했다 + +860 +01:03:03,960 --> 01:03:09,309 + 과거와 목사의 CNN의 당신이 실제로 거의 모든 이길 것을 볼 수 있습니다 + +861 +01:03:09,309 --> 01:03:14,119 + 성능면에서 요하지만 꽤 느린 그래 그 그 사실입니다입니다입니다 + +862 +01:03:14,119 --> 01:03:20,119 + 검출 문제에 깔끔한 트위스트의 종류 실제로 모든 모든 + +863 +01:03:20,119 --> 01:03:22,779 + 다른 검출 메트릭이 모든 다른 검출 모델이 우리 + +864 +01:03:22,780 --> 01:03:26,780 + 오늘 이야기 그들은 모두 꽤 많이는 출시 당신이해야 코드를 가지고 + +865 +01:03:26,780 --> 01:03:30,800 + 프로젝트는 아마 우리의 CNN을 사용하지 않는 어쩌면 너무 느린을 사용하는 것이 좋습니다 + +866 +01:03:30,800 --> 01:03:36,090 + 빠른 꽤 잘 볼하지만 우리의 CNN이 MATLAB 목사를 필요로하는 + +867 +01:03:36,090 --> 01:03:39,720 + 실제로 페르시아는 MATLAB을 필요로하지 않는 목사 우리의 CNN은 파이프 라인이다 + +868 +01:03:39,719 --> 01:03:44,379 + 카페 나는 개인적으로 그것을 사용하지 않은하지만 당신이 시도 할 수 있습니다 뭔가 + +869 +01:03:44,380 --> 01:03:48,070 + 프로젝트에 사용할 나는 그것이 실행 얻을 수 있습니다 얼마나 어려운지 잘 모르겠어요 및 + +870 +01:03:48,070 --> 01:03:52,050 + 노란색 실제로 나는이 프로젝트 때문에 일부 아마 좋은 선택을 생각한다 + +871 +01:03:52,050 --> 01:03:55,810 + 당신이 정말 큰되지 않은 경우 작업을 쉽게 할 수 있도록 빨리 + +872 +01:03:55,809 --> 01:03:59,860 + 실제로 강력한 GPU를하고도 잡았습니다 + +873 +01:03:59,860 --> 01:04:03,480 + 예 그게 내가 이렇게도 예상보다 조금 빨리 물건을 가지고 사실이다 + +874 +01:04:03,480 --> 01:04:10,559 + 검출에이 질문 + +875 +01:04:10,559 --> 01:04:15,880 + 네 + +876 +01:04:15,880 --> 01:04:22,630 + 예 모델의 크기와 같은 모델의 관점에서이 같은 대한 거의이다 + +877 +01:04:22,630 --> 01:04:26,039 + 분류 모델을 사용하면 더 큰 이미지를 실행하는 경우 때 때문에 + +878 +01:04:26,039 --> 01:04:29,109 + 우리의 CNN이 바로 회선을 일으킬 특히 빠른 당신은 정말하지 않습니다 + +879 +01:04:29,108 --> 01:04:32,558 + 층의 전체 영향 아닌 정말 더 이상 더 이상 매개 변수를 소개합니다 + +880 +01:04:32,559 --> 01:04:35,829 + 당신은이 지역의 제안에 대한 몇 가지 추가 매개 변수가 매개 변수 + +881 +01:04:35,829 --> 01:04:38,798 + 그러나 네트워크는 기본적으로 분류와 같은 다수의 원색 + +882 +01:04:38,798 --> 01:04:45,619 + 모델 바로 내가 오늘 일찍 우리가 조금 완료 추측 추측 + From a600a5ccf6c5d09f87e72c9e73595ddc8b42c980 Mon Sep 17 00:00:00 2001 From: JK Im Date: Thu, 7 Apr 2016 12:08:00 -0500 Subject: [PATCH 015/199] Update optimization-1.md Translated the introduction part. --- optimization-1.md | 36 ++++++++++++++++++------------------ 1 file changed, 18 insertions(+), 18 deletions(-) diff --git a/optimization-1.md b/optimization-1.md index 6960c141..b3560dec 100644 --- a/optimization-1.md +++ b/optimization-1.md @@ -5,35 +5,35 @@ permalink: /optimization-1/ Table of Contents: -- [Introduction](#intro) -- [Visualizing the loss function](#vis) -- [Optimization](#optimization) - - [Strategy #1: Random Search](#opt1) - - [Strategy #2: Random Local Search](#opt2) - - [Strategy #3: Following the gradient](#opt3) -- [Computing the gradient](#gradcompute) - - [Numerically with finite differences](#numerical) - - [Analytically with calculus](#analytic) -- [Gradient descent](#gd) -- [Summary](#summary) +- [소개](#intro) +- [손실함수(Loss Function)의 시각화(Visualization)](#vis) +- [최적화(Optimization)](#optimization) + - [전략 #1: 무작위 탐색 (Random Search)](#opt1) + - [전략 #2: 무작위 국소 탐색 (Random Local Search)](#opt2) + - [전략 #3: 그라디언트(gradient) 따라가기](#opt3) +- [그라디언트(Gradient) 계산](#gradcompute) + - [Finite Differences를 이용한 수치적인 방법](#numerical) + - [미분을 이용한 해석적인 방법](#analytic) +- [그라디언트(Gradient) 하강(Descent)](#gd) +- [요약](#summary) -### Introduction +### 소개 -In the previous section we introduced two key components in context of the image classification task: +이전 섹션에서 이미지 분류(image classification)을 할 때에 있어 두 가지의 핵심요쇼를 소개했습니다. -1. A (parameterized) **score function** mapping the raw image pixels to class scores (e.g. a linear function) -2. A **loss function** that measured the quality of a particular set of parameters based on how well the induced scores agreed with the ground truth labels in the training data. We saw that there are many ways and versions of this (e.g. Softmax/SVM). +1. 원 이미지의 픽셀들을 넣으면 분류 스코어(class score)를 계산해주는 모수화된(parameterized) **스코어 함수(score function)** (예를 들어, 선형 함수). +2. 학습(training) 데이타에 어떤 특정 모수(parameter)들을 가지고 스코어 함수(score function)를 적용시켰을 때, 실제 class와 얼마나 잘 일치하는지에 따라 그 특정 모수(parameter)들의 질을 측정하는 **손실 함수(loss function)**. 여러 종류의 손실함수(예를 들어, Softmax/SVM)가 있다. -Concretely, recall that the linear function had the form $ f(x_i, W) = W x_i $ and the SVM we developed was formulated as: +구체적으로 말하자면, 다음과 같은 형식을 가진 선형함수 $f(x_i, W) = W x_i $를 스코어 함수(score function)로 쓸 때, 지난 번에 다룬 바와 같이 SVM은 다음과 같은 수식으로 표현할 수 있다.: $$ L = \frac{1}{N} \sum_i \sum_{j\neq y_i} \left[ \max(0, f(x_i; W)_j - f(x_i; W)_{y_i} + 1) \right] + \alpha R(W) $$ -We saw that a setting of the parameters $W$ that produced predictions for examples $x_i$ consistent with their ground truth labels $y_i$ would also have a very low loss $L$. We are now going to introduce the third and last key component: **optimization**. Optimization is the process of finding the set of parameters $W$ that minimize the loss function. +예시 $x_i$에 대한 예측값이 실제 값(레이블, labels) $y_i$과 같도록 설정된 모수(paramter) $W$는 손실(loss)값 $L$ 또한 매우 낮게 나온다는 것을 알아보았다. 이제 세번째이자 마지막 핵심요소인 **최적화(optimization)**에 대해서 알아보자. 최적화(optimization)는 손실함수(loss function)을 최소화시카는 모수(parameter, $W$)들을 찾는 과정을 뜻한다. -**Foreshadowing:** Once we understand how these three core components interact, we will revisit the first component (the parameterized function mapping) and extend it to functions much more complicated than a linear mapping: First entire Neural Networks, and then Convolutional Neural Networks. The loss functions and the optimization process will remain relatively unchanged. +**예고:** 이 세 가지 핵심요소가 어떻게 상호작용하는지 이해한 후에는, 첫번째 요소(모수화된 함수)로 다시 돌아가서 선형함수보다 더 복잡한 형태로 확장시켜볼 것이다. 처음엔 신경망(Neural Networks), 다음엔 컨볼루션 신경망(Convolutional Neural Networks). 손실함수(loss function)와 최적화(optimization) 과정은 거의 변화가 없을 것이다.. ### Visualizing the loss function From 344b0d84ad69ee6528b5f2f0876f1918cb4f4b5e Mon Sep 17 00:00:00 2001 From: JK Im Date: Thu, 7 Apr 2016 12:10:35 -0500 Subject: [PATCH 016/199] Update optimization-1.md --- optimization-1.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/optimization-1.md b/optimization-1.md index b3560dec..6083854d 100644 --- a/optimization-1.md +++ b/optimization-1.md @@ -31,7 +31,7 @@ $$ L = \frac{1}{N} \sum_i \sum_{j\neq y_i} \left[ \max(0, f(x_i; W)_j - f(x_i; W)_{y_i} + 1) \right] + \alpha R(W) $$ -예시 $x_i$에 대한 예측값이 실제 값(레이블, labels) $y_i$과 같도록 설정된 모수(paramter) $W$는 손실(loss)값 $L$ 또한 매우 낮게 나온다는 것을 알아보았다. 이제 세번째이자 마지막 핵심요소인 **최적화(optimization)**에 대해서 알아보자. 최적화(optimization)는 손실함수(loss function)을 최소화시카는 모수(parameter, $W$)들을 찾는 과정을 뜻한다. +예시 $x_i$에 대한 예측값이 실제 값(레이블, labels) $y_i$과 같도록 설정된 모수(parameter) $W$는 손실(loss)값 $L$ 또한 매우 낮게 나온다는 것을 알아보았다. 이제 세번째이자 마지막 핵심요소인 **최적화(optimization)**에 대해서 알아보자. 최적화(optimization)는 손실함수(loss function)을 최소화시카는 모수(parameter, $W$)들을 찾는 과정을 뜻한다. **예고:** 이 세 가지 핵심요소가 어떻게 상호작용하는지 이해한 후에는, 첫번째 요소(모수화된 함수)로 다시 돌아가서 선형함수보다 더 복잡한 형태로 확장시켜볼 것이다. 처음엔 신경망(Neural Networks), 다음엔 컨볼루션 신경망(Convolutional Neural Networks). 손실함수(loss function)와 최적화(optimization) 과정은 거의 변화가 없을 것이다.. From 7b92d0a781183e5c1ff20eb2adce5f5c519a197f Mon Sep 17 00:00:00 2001 From: myungsub Date: Fri, 8 Apr 2016 08:15:11 +0900 Subject: [PATCH 017/199] fix content headers --- optimization-1.md | 24 +++++++++++++++++------- 1 file changed, 17 insertions(+), 7 deletions(-) diff --git a/optimization-1.md b/optimization-1.md index 6083854d..170b0b11 100644 --- a/optimization-1.md +++ b/optimization-1.md @@ -18,6 +18,7 @@ Table of Contents: - [요약](#summary) + ### 소개 이전 섹션에서 이미지 분류(image classification)을 할 때에 있어 두 가지의 핵심요쇼를 소개했습니다. @@ -36,6 +37,7 @@ $$ **예고:** 이 세 가지 핵심요소가 어떻게 상호작용하는지 이해한 후에는, 첫번째 요소(모수화된 함수)로 다시 돌아가서 선형함수보다 더 복잡한 형태로 확장시켜볼 것이다. 처음엔 신경망(Neural Networks), 다음엔 컨볼루션 신경망(Convolutional Neural Networks). 손실함수(loss function)와 최적화(optimization) 과정은 거의 변화가 없을 것이다.. + ### Visualizing the loss function The loss functions we'll look at in this class are usually defined over very high-dimensional spaces (e.g. in CIFAR-10 a linear classifier weight matrix is of size [10 x 3073] for a total of 30,730 parameters), making them difficult to visualize. However, we can still gain some intuitions about one by slicing through the high-dimensional space along rays (1 dimension), or along planes (2 dimensions). For example, we can generate a random weight matrix $W$ (which corresponds to a single point in the space), then march along a ray and record the loss function value along the way. That is, we can generate a random direction $W_1$ and compute the loss along this direction by evaluating $L(W + a W_1)$ for different values of $a$. This process generates a simple plot with the value of $a$ as the x-axis and the value of the loss function as the y-axis. We can also carry out the same procedure with two dimensions by evaluating the loss $ L(W + a W_1 + b W_2) $ as we vary $a, b$. In a plot, $a, b$ could then correspond to the x-axis and the y-axis, and the value of the loss function can be visualized with a color: @@ -80,11 +82,13 @@ As an aside, you may have guessed from its bowl-shaped appearance that the SVM c *Non-differentiable loss functions*. As a technical note, you can also see that the *kinks* in the loss function (due to the max operation) technically make the loss function non-differentiable because at these kinks the gradient is not defined. However, the [subgradient](http://en.wikipedia.org/wiki/Subderivative) still exists and is commonly used instead. In this class will use the terms *subgradient* and *gradient* interchangeably. + ### Optimization To reiterate, the loss function lets us quantify the quality of any particular set of weights **W**. The goal of optimization is to find **W** that minimizes the loss function. We will now motivate and slowly develop an approach to optimizing the loss function. For those of you coming to this class with previous experience, this section might seem odd since the working example we'll use (the SVM loss) is a convex problem, but keep in mind that our goal is to eventually optimize Neural Networks where we can't easily use any of the tools developed in the Convex Optimization literature. + #### Strategy #1: A first very bad idea solution: Random search Since it is so simple to check how good a given set of parameters **W** is, the first (very bad) idea that may come to mind is to simply try out many different random weights and keep track of what works best. This procedure might look as follows: @@ -135,6 +139,7 @@ With the best **W** this gives an accuracy of about **15.5%**. Given that guessi **Blindfolded hiker analogy.** One analogy that you may find helpful going forward is to think of yourself as hiking on a hilly terrain with a blindfold on, and trying to reach the bottom. In the example of CIFAR-10, the hills are 30,730-dimensional, since the dimensions of **W** are 3073 x 10. At every point on the hill we achieve a particular loss (the height of the terrain). + #### Strategy #2: Random Local Search The first strategy you may think of is to to try to extend one foot in a random direction and then take a step only if it leads downhill. Concretely, we will start out with a random $W$, generate random perturbations $ \delta W $ to it and if the loss at the perturbed $W + \delta W$ is lower, we will perform an update. The code for this procedure is as follows: @@ -155,6 +160,7 @@ for i in xrange(1000): Using the same number of loss function evaluations as before (1000), this approach achieves test set classification accuracy of **21.4%**. This is better, but still wasteful and computationally expensive. + #### Strategy #3: Following the Gradient In the previous section we tried to find a direction in the weight-space that would improve our weight vector (and give us a lower loss). It turns out that there is no need to randomly search for a good direction: we can compute the *best* direction along which we should change our weight vector that is mathematically guaranteed to be the direction of the steepest descend (at least in the limit as the step size goes towards zero). This direction will be related to the **gradient** of the loss function. In our hiking analogy, this approach roughly corresponds to feeling the slope of the hill below our feet and stepping down the direction that feels steepest. @@ -168,22 +174,24 @@ $$ When the functions of interest take a vector of numbers instead of a single number, we call the derivatives **partial derivatives**, and the gradient is simply the vector of partial derivatives in each dimension. + ### Computing the gradient There are two ways to compute the gradient: A slow, approximate but easy way (**numerical gradient**), and a fast, exact but more error-prone way that requires calculus (**analytic gradient**). We will now present both. + #### Computing the gradient numerically with finite differences The formula given above allows us to compute the gradient numerically. Here is a generic function that takes a function `f`, a vector `x` to evaluate the gradient on, and returns the gradient of `f` at `x`: ~~~python def eval_numerical_gradient(f, x): - """ - a naive implementation of numerical gradient of f at x + """ + a naive implementation of numerical gradient of f at x - f should be a function that takes a single argument - x is the point (numpy array) to evaluate the gradient at - """ + """ fx = f(x) # evaluate function value at original point grad = np.zeros(x.shape) @@ -207,7 +215,7 @@ def eval_numerical_gradient(f, x): return grad ~~~ -Following the gradient formula we gave above, the code above iterates over all dimensions one by one, makes a small change `h` along that dimension and calculates the partial derivative of the loss function along that dimension by seeing how much the function changed. The variable `grad` holds the full gradient in the end. +Following the gradient formula we gave above, the code above iterates over all dimensions one by one, makes a small change `h` along that dimension and calculates the partial derivative of the loss function along that dimension by seeing how much the function changed. The variable `grad` holds the full gradient in the end. **Practical considerations**. Note that in the mathematical formulation the gradient is defined in the limit as **h** goes towards zero, but in practice it is often sufficient to use a very small value (such as 1e-5 as seen in the example). Ideally, you want to use the smallest step size that does not lead to numerical issues. Additionally, in practice it often works better to compute the numeric gradient using the **centered difference formula**: $ [f(x+h) - f(x-h)] / 2 h $ . See [wiki](http://en.wikipedia.org/wiki/Numerical_differentiation) for details. @@ -266,6 +274,7 @@ for step_size_log in [-10, -9, -8, -7, -6, -5,-4,-3,-2,-1]: **A problem of efficiency**. You may have noticed that evaluating the numerical gradient has complexity linear in the number of parameters. In our example we had 30730 parameters in total and therefore had to perform 30,731 evaluations of the loss function to evaluate the gradient and to perform only a single parameter update. This problem only gets worse, since modern Neural Networks can easily have tens of millions of parameters. Clearly, this strategy is not scalable and we need something better. + #### Computing the gradient analytically with Calculus The numerical gradient is very simple to compute using the finite difference approximation, but the downside is that it is approximate (since we have to pick a small value of *h*, while the true gradient is defined as the limit as *h* goes to zero), and that it is very computationally expensive to compute. The second way to compute the gradient is analytically using Calculus, which allows us to derive a direct formula for the gradient (no approximations) that is also very fast to compute. However, unlike the numerical gradient it can be more error prone to implement, which is why in practice it is very common to compute the analytic gradient and compare it to the numerical gradient to check the correctnes of your implementation. This is called a **gradient check**. @@ -288,9 +297,10 @@ $$ \nabla_{w_j} L_i = \mathbb{1}(w_j^Tx_i - w_{y_i}^Tx_i + \Delta > 0) x_i $$ -Once you derive the expression for the gradient it is straight-forward to implement the expressions and use them to perform the gradient update. +Once you derive the expression for the gradient it is straight-forward to implement the expressions and use them to perform the gradient update. + ### Gradient Descent Now that we can compute the gradient of the loss function, the procedure of repeatedly evaluating the gradient and then performing a parameter update is called *Gradient Descent*. Its **vanilla** version looks as follows: @@ -321,6 +331,7 @@ The reason this works well is that the examples in the training data are correla The extreme case of this is a setting where the mini-batch contains only a single example. This process is called **Stochastic Gradient Descent (SGD)** (or also sometimes **on-line** gradient descent). This is relatively less common to see because in practice due to vectorized code optimizations it can be computationally much more efficient to evaluate the gradient for 100 examples, than the gradient for one example 100 times. Even though SGD technically refers to using a single example at a time to evaluate the gradient, you will hear people use the term SGD even when referring to mini-batch gradient descent (i.e. mentions of MGD for "Minibatch Gradient Descent", or BGD for "Batch gradient descent" are rare to see), where it is usually assumed that mini-batches are used. The size of the mini-batch is a hyperparameter but it is not very common to cross-validate it. It is usually based on memory constraints (if any), or set to some value, e.g. 32, 64 or 128. We use powers of 2 in practice because many vectorized operation implementations work faster when their inputs are sized in powers of 2. + ### Summary
@@ -335,11 +346,10 @@ In this section, - We developed the intuition of the loss function as a **high-dimensional optimization landscape** in which we are trying to reach the bottom. The working analogy we developed was that of a blindfolded hiker who wishes to reach the bottom. In particular, we saw that the SVM cost function is piece-wise linear and bowl-shaped. - We motivated the idea of optimizing the loss function with -**iterative refinement**, where we start with a random set of weights and refine them step by step until the loss is minimized. +**iterative refinement**, where we start with a random set of weights and refine them step by step until the loss is minimized. - We saw that the **gradient** of a function gives the steepest ascent direction and we discussed a simple but inefficient way of computing it numerically using the finite difference approximation (the finite difference being the value of *h* used in computing the numerical gradient). - We saw that the parameter update requires a tricky setting of the **step size** (or the **learning rate**) that must be set just right: if it is too low the progress is steady but slow. If it is too high the progress can be faster, but more risky. We will explore this tradeoff in much more detail in future sections. - We discussed the tradeoffs between computing the **numerical** and **analytic** gradient. The numerical gradient is simple but it is approximate and expensive to compute. The analytic gradient is exact, fast to compute but more error-prone since it requires the derivation of the gradient with math. Hence, in practice we always use the analytic gradient and then perform a **gradient check**, in which its implementation is compared to the numerical gradient. - We introduced the **Gradient Descent** algorithm which iteratively computes the gradient and performs a parameter update in loop. **Coming up:** The core takeaway from this section is that the ability to compute the gradient of a loss function with respect to its weights (and have some intuitive understanding of it) is the most important skill needed to design, train and understand neural networks. In the next section we will develop proficiency in computing the gradient analytically using the chain rule, otherwise also refered to as **backpropagation**. This will allow us to efficiently optimize relatively arbitrary loss functions that express all kinds of Neural Networks, including Convolutional Neural Networks. - From e7129b53bdd5f0723f6589fd692d351c91406512 Mon Sep 17 00:00:00 2001 From: YB Date: Thu, 7 Apr 2016 23:09:00 -0400 Subject: [PATCH 018/199] Lecture1 - part 1~50 (out of 715) en / ko --- captions/En/Lecture1_en.srt | 182 ++++++++++++++++++------------------ captions/Ko/Lecture1_ko.srt | 118 +++++++++++------------ 2 files changed, 150 insertions(+), 150 deletions(-) diff --git a/captions/En/Lecture1_en.srt b/captions/En/Lecture1_en.srt index 0802c5d6..861e1d32 100644 --- a/captions/En/Lecture1_en.srt +++ b/captions/En/Lecture1_en.srt @@ -1,199 +1,198 @@ -1 +1 00:00:00,000 --> 00:00:03,899 -there's more seats on the side +There's more seats on the side. 2 00:00:03,899 --> 00:00:19,868 -people are walking in late so just to -make sure you're in cs2 31 and deep +people are walking in late. +So, just to make sure you're in cs231n 3 00:00:19,868 --> 00:00:23,969 -learning on your network class for -visual recognition any money in the +Deep Learning Neural network class for +visual recognition. 4 00:00:23,969 --> 00:00:33,549 -wrong class good I so welcome and happy -new year happy first day of winter break +Anybody in the wrong class? OK, good. +Alright. So, welcome and happy new year, happy first day of the winter break. 5 00:00:33,549 --> 00:00:41,069 -so the classiest 231 and this is the -second offering of this class when we +So, this class CS231n. +This is the second offering of this class 6 00:00:41,070 --> 00:00:48,738 -have literally doubled our enrollment -from 480 people last time we offered to +when we have literally doubled our enrollment +from 180 people last time we offered to 7 00:00:48,738 --> 00:00:55,939 -about 350 of you sign up a couple of -words to to to make us all legally +about 350 of you signed up. +Just a couple of words to make us all legally 8 00:00:55,939 --> 00:01:02,570 -covered way our video recording this -class so you know if you're +covered, we are video recording this class. +So, you know if you're 9 00:01:02,570 --> 00:01:10,680 uncomfortable about this for today just -go behind that camera or go to the +go behind that camera or go to the corner that 10 00:01:10,680 --> 00:01:18,280 -cameras litter but we are going to send +camera's not gonna turn, but we are going to send out forms for you to fill out in terms 11 00:01:18,280 --> 00:01:25,228 -of allowing video recording so that's -that's just one bit of housekeeping so +of allowing video recording. +So, that's just one bit of housekeeping. 12 00:01:25,228 --> 00:01:32,200 -alright when they fail him a professor -at the computer science department so +So, alright. My name is Fei-Fei Li, a professor +at the computer science department. 13 00:01:32,200 --> 00:01:37,960 -this class and co-teaching with through +So, this class, I'm co-teaching with two senior graduate students and one of them 14 -00:01:37,959 --> 00:01:45,839 -is areas under a profit greatly we have -what I don't think Andre needs too much +00:01:37,961 --> 00:01:45,839 +is here. He's Andre Karpathy. Andre can you just say hi to everybody? +We have.. I don't think Andre needs too much 15 00:01:45,840 --> 00:01:48,659 -introduction all of you know his work +introduction. A lot of you probably know his work, 16 00:01:48,659 --> 00:01:53,960 -follow his blog his Twitter follower +follow his blog, his Twitter follower. 17 -00:01:53,959 --> 00:02:02,509 -under has way more followers than I do -very popular Justin Johnson is still +00:01:53,961 --> 00:02:02,509 +Andre has way more followers than I do. +So, he's very popular. And also Justin Johnson who is still 18 00:02:02,510 --> 00:02:08,200 traveling internationally but will be -back in a few days so andre land just so +back in a few days. So, Andre and Justin 19 -00:02:08,199 --> 00:02:14,509 -we'll be picking up the bulk of the -lecture teaching today I'll be giving +00:02:08,201 --> 00:02:14,509 +will be picking up the bulk of the +lecture teaching, and today I'll be giving 20 00:02:14,509 --> 00:02:20,039 -her structure but as you probably can -see that I'm expecting the newborn ratio +a first lecture but as you probably can +see that I'm expecting a newborn very soon, 21 00:02:20,039 --> 00:02:28,239 -speaking of weeks so you'll see more of -undrained Justin in lecture time we will +speaking of weeks, so you'll see more of +Andre and Justin in lecture time. We will 22 00:02:28,239 --> 00:02:34,189 -also introduce a whole team of TACE -towards the end of this lecture again +also introduce a whole team of TAs +towards the end of this lecture. 23 00:02:34,189 --> 00:02:38,959 -people who are looking for seats you go -out of that before and come back there's +Again, people who are looking for seats you go +out of that door and come back. There's 24 00:02:38,959 --> 00:02:47,039 -a whole bunch of seats on the side so so -this for this lecture we're going to +a whole bunch of seats on the side. +So, for this lecture, we're going to 25 00:02:47,039 --> 00:02:53,519 -give the introduction of the class what -kind of problems we work on and the +give the introduction of the class, +what kind of problems we work on and the 26 00:02:53,519 --> 00:03:03,530 -tools will be learning so again welcome -to see us 231 and this is a vision in +tools will be learning. So, again, welcome +to CS231n. This is a vision class. 27 00:03:03,530 --> 00:03:09,140 -class it's based on a very specific -modeling architecture called your +It's based on a very specific +modeling architecture called neural network 28 -00:03:09,139 --> 00:03:16,000 -network and the more specifically mostly -a convolution on your network and a lot +00:03:09,141 --> 00:03:16,000 +and the more specifically, mostly +on the Convolutional Neural Network 29 00:03:16,000 --> 00:03:23,799 -of you hear this term maybe through a -popular Press article we we or or or +and a lot of you hear this term, maybe through a +popular press article or 30 00:03:23,799 --> 00:03:34,239 -coverage we've had to call this the deep -learning network growing field of +coverage we tend to call this the deep +learning network. Vision is one of the fastest growing field of 31 00:03:34,239 --> 00:03:40,920 -artificial intelligence in fact the -school has estimated and and we are +artificial intelligence. +In fact, CISCO has estimated it and we are 32 -00:03:40,919 --> 00:03:50,018 -underway for of this by 2016 which we +00:03:40,921 --> 00:03:50,018 +day 4 of this by 2016 which we already have arrived more than 85% of 33 00:03:50,019 --> 00:03:56,230 -the internet cyberspace data is in the -form of pixels +the Internet cyberspace data is in the form of pixels 34 -00:03:56,229 --> 00:04:05,329 -or what they call multimedia so so we -basically have entered an age of vision +00:03:56,231 --> 00:04:05,329 +or what they call Multimedia. +So, we basically have entered at age of vision 35 00:04:05,330 --> 00:04:12,530 -of images and video calls why why is -this so while partly to a large extent +of images and videos. +Why is this so, well partially to a large extent 36 00:04:12,530 --> 00:04:20,858 -is speakers of the explosion of both the +is because of the explosion of both the Internet as a carrier of data as well as 37 00:04:20,858 --> 00:04:25,930 -answers we have more sensors that the -number of people on Thurs these days +sensors. We have more sensors than the +number of people on the Earth these days. 38 00:04:25,930 --> 00:04:32,000 -every one of you is carrying some kind -of smart phones digital cameras and and +Every one of you is carrying some kind +of smart phones, digital cameras and 39 00:04:32,000 --> 00:04:37,879 -and and you know cars around on the -street with cameras so so the soldiers +you know, cars running on the +street with cameras. So, the sensors 40 00:04:37,879 --> 00:04:46,500 have really enabled the explosion of -visual data on the internet but visual +visual data on the Internet but visual 41 00:04:46,500 --> 00:04:55,209 @@ -202,47 +201,46 @@ data to harness so if you have heard my 42 00:04:55,209 --> 00:05:07,810 -previous talks and and some other parks -by the dark matter of the Internet +previous talks and some other talks +by Computer Vision professors, we call this the dark matter of the Internet 43 00:05:07,810 --> 00:05:13,879 -why is this the dark matter just like -the universe is closest to 85% dark +why is this the dark matter? Just like +the universe consist of to 85% dark 44 00:05:13,879 --> 00:05:19,409 -matter dark energy is these matters that -energy that is very hard to observe week +matter. Dark energy is these matters that +energy that is very hard to observe. 45 00:05:19,410 --> 00:05:25,919 -and weekend by mathematical models in -the universe internet these are the +we can infer it by mathematical models in +the universe. On the Internet these are the 46 -00:05:25,918 --> 00:05:30,649 +00:05:25,920 --> 00:05:30,649 matters pixel data the other data that we don't know we have a hard time 47 00:05:30,649 --> 00:05:36,239 -grasping the continent's here's one very -very simple suspects for you to consider +grasping the contents here's one very +very simple aspects for you to consider 48 00:05:36,240 --> 00:05:39,090 so today 49 -00:05:39,089 --> 00:05:49,560 +00:05:39,091 --> 00:05:49,560 YouTube servers every 60 seconds we'll -have more than $150 of videos uploaded +have more than 150 hours of videos uploaded 50 00:05:49,560 --> 00:05:54,089 -onto YouTube servers for every 60 -seconds +onto YouTube servers for every 60 seconds 51 00:05:54,089 --> 00:06:02,739 diff --git a/captions/Ko/Lecture1_ko.srt b/captions/Ko/Lecture1_ko.srt index a129996a..22f0f94a 100644 --- a/captions/Ko/Lecture1_ko.srt +++ b/captions/Ko/Lecture1_ko.srt @@ -1,202 +1,204 @@ 1 00:00:00,000 --> 00:00:03,899 - 측면에 더 많은 좌석이있다 + 늦게 들어오시는 분들은 측면에 더 많은 좌석이 있습니다. 2 00:00:03,899 --> 00:00:19,868 - 사람들은 당신이 CS2 (31)과 깊이에있어 확인하기 위해 늦은 그래서에서 걷고있다 + 이 수업은 CS231n 3 00:00:19,868 --> 00:00:23,969 - 시각적 인식을위한 네트워크 클래스에있는 돈을 학습 + Deep Learning Neural Network Class for Visual Recognition 입니다. 4 00:00:23,969 --> 00:00:33,549 - 겨울 방학의 잘못된 클래스 좋은 내가 너무 환영과 행복 한 새 해 행복 첫날 + 수업 잘못 들어오신 분 있나요? 좋아요. 환영합니다, 행복한 새해, 그리고 행복한 겨울 방학의 첫날 입니다. 5 00:00:33,549 --> 00:00:41,069 - 그래서 이것은이 클래스의 두 번째 제안은 231 classiest하고있을 때 우리 + 이 수업은 CS231n 의 두번째 개강입니다. 6 00:00:41,070 --> 00:00:48,738 - 문자 그대로 4백80명에서 우리가 제공하는 마지막 시간을 우리의 등록을 두 배로 + 우리는 지난 번에 180명의 수강생에서 이번엔 거의 350명으로 7 00:00:48,738 --> 00:00:55,939 - 350 당신의에 우리 모두를 법적으로 확인하는 단어의 몇 가지를 가입 + 말 그대로 수강 인원을 두 배로 늘렸습니다. 8 00:00:55,939 --> 00:01:02,570 - 당신이 있다면 당신이 알 수 있도록 방법이 클래스를 기록 우리의 비디오 덮여 + 알아두셔야 할 것이 우리는 지금 동영상 촬영 중입니다. 9 00:01:02,570 --> 00:01:10,680 - 오늘이 불편 그냥 카메라 뒤에 이동하거나 이동 + 그러니 먄약 오늘 촬영이 불편하시면 카메라 뒤로 이동하거나 10 00:01:10,680 --> 00:01:18,280 - 이 관점에서 작성을 위해 카메라 쓰레기 그러나 우리는 양식을 보내려고하고있다 + 카메라가 비추지 않는 쪽으로 이동하세요. 11 00:01:18,280 --> 00:01:25,228 - 즉, 그래서 비디오 녹화를 허용하는 그 때문에 가사의 한 비트의 + 후에 동영상 촬영에 동의를 구하는 서류양식을 보내드릴겁니다. 12 00:01:25,228 --> 00:01:32,200 - 확실히 그들은 그에게 컴퓨터 과학 학부 교수를 실패 할 때 너무 + 좋아요. 저는 컴퓨터 과학부의 교수이며 이름은 Fei-Fei Lee 입니다. 13 00:01:32,200 --> 00:01:37,960 - 이 클래스와 함께 공동 교육 수석 대학원생을 통해 그 중 하나 + 이 수업동안 제가 두 명의 대학원생과 함께 가르치는데, 14 -00:01:37,959 --> 00:01:45,839 - 이익 아래 영역은 크게 우리는 내가 앙드레 너무 많이 필요하다고 생각하지 않는 것을 가지고있다 +00:01:37,961 --> 00:01:45,839 + 그 중에 한명은 지금 이 자리에 있습니다. 안드레, 모두에게 인사하세요. 15 00:01:45,840 --> 00:01:48,659 - 소개 여러분 모두가 자신의 일을 알고있다 + 안드레에 대해서는 많은 소개가 필요 없을 듯 합니다. 많은 분들이 아마 그를 알고 있을거예요. 16 00:01:48,659 --> 00:01:53,960 - 자신의 블로그를 자신의 트위터 팔로워를 따라 + 그의 블로그나 트위터를 팔로우하면서 말이죠. 17 -00:01:53,959 --> 00:02:02,509 - 방법이 더 추종자가에서 나는 매우 인기있는 저스틴 존슨은 여전히​​보다 +00:01:53,961 --> 00:02:02,509 + 안드레가 저 보다 팔로워수가 훨씬 많아요. 엄청 유명하죠. 18 00:02:02,510 --> 00:02:08,200 - 해외 여행하지만 며칠 그래서 앙드레 땅 그냥 그렇게 다시 할 것이다 + 그리고 다른 한명 저스틴 존슨은 아직 여행중인데 며칠 내로 돌아올 겁니다. 19 -00:02:08,199 --> 00:02:14,509 - 우리는 내가주는거야 강의 교육의 대부분 오늘을 따기됩니다 +00:02:08,201 --> 00:02:14,509 + 안드레와 저스틴이 상당한 양의 강의를 진행하게 됩니다. 20 00:02:14,509 --> 00:02:20,039 - 그녀의 구조는하지만 같은 당신은 아마 나는 신생아 비율을 기대하고있어 것을 알 수 있습니다 + 오늘은 제가 첫 강의를 진행 하겠지만 보시다시피 제가 곧, 몇 주안에, 아기를 낳을 예정입니다. 21 00:02:20,039 --> 00:02:28,239 - 주 말하고 그래서 당신은 우리 것 강의 시간에 비 배수 저스틴의 자세한 내용을 볼 수 있습니다 + 그래서 아마 여러분들은 안드레와 저스틴을 강의 시간에 더 많이 보게 될 것입니다. 22 00:02:28,239 --> 00:02:34,189 - 또한 다시이 강의의 끝으로 TACE의 전체 팀 소개 + 또한 강의를 마치기 전에 전체 조교들을 소개해 드릴 것입니다. 23 00:02:34,189 --> 00:02:38,959 - 좌석을 찾고있는 사람들은 당신이 전에 밖으로 가서 다시 거기 와서 + 아직 좌석을 찾고있는 분은 밖으로 돌아서 들어오시면 24 00:02:38,959 --> 00:02:47,039 - 우리가 가고있는이 강의에 대한 측면에 좌석의 모두 그렇게 때문에이 + 저쪽에 자리가 많이 있습니다. + 이 수업에서 우리는 25 00:02:47,039 --> 00:02:53,519 - 우리가하고 작동 문제가 어떤 종류의 클래스의 도입을 제공 + 이 강좌에 대한 소개, 우리가 풀고있는 문제들과 도구들을 다룰 것입니다. 26 00:02:53,519 --> 00:03:03,530 - 도구는 그래서 다시 배우고 우리에게 (231)를 볼 수 환영이의 비전됩니다 + 다시 한번, Vision 수업 CS231n 강의에 오신 것을 환영 합니다. 27 00:03:03,530 --> 00:03:09,140 - 수업은 여러분라는 매우 구체적인 모델링 아키텍처를 기반으로 + 이 수업은 매우 구체적으로 Neural Network 이라는 모델링 아키텍처를 다루게 되며 28 -00:03:09,139 --> 00:03:16,000 - 네트워크와 네트워크 및 많은에 대한보다 구체적으로는 대부분 컨볼 루션 +00:03:09,141 --> 00:03:16,000 + 더 자세히는 Convolutional Neual Network 을 다룹니다. 29 00:03:16,000 --> 00:03:23,799 - 당신이 인기를 눌러 문서를 통해 어쩌면이 용어를 듣고의 우리 또는 또는 또는 + Deep Learning Network 이라고도 부르는 이 용어를 아마 언론매체 기사에서 접하게 될 텐데. 30 00:03:23,799 --> 00:03:34,239 - 범위는 우리의이 깊은 학습 네트워크 성장 분야를 호출 했어 + Vision은 인공지능 분야에서도 제일 빠르게 성장하고 있는 분야중 하나입니다. 31 00:03:34,239 --> 00:03:40,920 - 사실 인공 지능이 학교는 추정하고 우리는있다 + 실제로, CISCO사의 자료에 의하면, 32 -00:03:40,919 --> 00:03:50,018 - 우리는 이미의 85 % 이상을 도달 한 2016 년이의에 대한 진행 +00:03:40,921 --> 00:03:50,018 + 지금 이 시간, 2016의 4번째 날, + 인터넷 사이버공간 데이터의 85% 이상이 33 00:03:50,019 --> 00:03:56,230 - 인터넷 사이버 공간 데이터는 픽셀의 형태 인 + pixel 형태로 존재합니다. 34 -00:03:56,229 --> 00:04:05,329 - 또는 멀티미디어를 부르는 것을 우리는 기본적으로 비전의 시대를 입력 그래서 있도록 +00:03:56,231 --> 00:04:05,329 + 멀티미디어라고 하죠. 우리는 다시말해 이미지들과 영상 - Vision의 시대에 들어 온 것입니다. 35 00:04:05,330 --> 00:04:12,530 - 이미지와 영상 통화의 이유를 이유입니다 동안 부분적으로 큰 정도 그렇게 + 비젼이 이렇게 큰 부분을 차지하는 이유는 36 00:04:12,530 --> 00:04:20,858 - 데이터 캐리어로서의 인터넷 모두 폭발 스피커 물론이다 + 데이터 캐리어로써의 인터넷과 센서들 덕분입니다. 37 00:04:20,858 --> 00:04:25,930 - 답변 우리는 더 많은 센서가 그 목에 사람들이 일 수 + 우리는 이미 지구상의 인구수보다 많은 센서들을 가지고 있지요. 38 00:04:25,930 --> 00:04:32,000 - 당신의 모든 사람이 스마트 폰의 어떤 종류를 수행하고 디지털 카메라와 + 모든 사람이 스마트 폰, 디지털 카메라를 가지고 있고 39 00:04:32,000 --> 00:04:37,879 - 그리고 당신은 군인 그렇게 있도록 카메라와 거리에 주위에 차를 알고 + 길을 달리는 차들도 카메라가 있죠. 40 00:04:37,879 --> 00:04:46,500 - 정말 인터넷하지만 시각에 영상 데이터의 폭발을 사용하도록 설정 + 정말 센서들은 폭발적인 양의 시각 데이터를 인터넷으로 불러왔죠. 41 00:04:46,500 --> 00:04:55,209 - 데이터 또는 픽셀 데이터는 들어 본 적이 경우 가장 어려운 데이터가 그렇게 활용하는 것도 내 + 하지만 시각 데이터 또는 픽셀 데이터는 또한 가장 다루기 힘든 데이터입니다. 42 00:04:55,209 --> 00:05:07,810 - 인터넷의 암흑 물질에 의해 이전 회담과 다른 공원 + 저와 다른 Computer Vision 교수들은 시각 데이터를 인터넷의 암흑 물질이라고 부릅니다. 43 00:05:07,810 --> 00:05:13,879 - 이유는 우주와 같은 암흑 물질은 85 % 어두운에게 가장 가까운입니다 + 왜 암흑 물질일까요? 이유는 바로 우주의 85%가 암흑 물질로 이루어져 있으며, 44 00:05:13,879 --> 00:05:19,409 - 문제 암흑 에너지는 매우 어렵다 에너지가 주 관찰하는 것이이 문제입니다 + 이 암흑 에너지또한 아주 관측하기 어렵기 때문입니다. 45 00:05:19,410 --> 00:05:25,919 - 우주 인터넷이 수학적 모델로 주말이있는 + 우리는 수학적 모델로써 암흑 에너지를 추론해 볼 수 있죠. 46 -00:05:25,918 --> 00:05:30,649 - 문제는 우리가 힘든 시간을 모르는 데이터를 다른 데이터를 화소를 +00:05:25,920 --> 00:05:30,649 + 인터넷에서는 픽셀 데이터가 바로 우리가 잘 모르는, 우리가 내용을 알아내기 힘든 암흑물질인 것이죠. 47 00:05:30,649 --> 00:05:36,239 - 여기에 대륙있어 파악하는 것은 당신이 고려해야 할 하나의 매우 매우 간단 용의자의 + 여기에 여러분이 고려해야할 아주 간단한 측면이 있습니다. 48 00:05:36,240 --> 00:05:39,090 - 그래서 오늘 + 오늘 49 -00:05:39,089 --> 00:05:49,560 - 유튜브 서버 60 초마다 우리는 업로드 한 동영상의 이상 $ 150해야합니다 +00:05:39,091 --> 00:05:49,560 + 유튜브 서버들은 60 초마다 우리는 150시간 이상의 동영상이 업로드됩니다. 50 00:05:49,560 --> 00:05:54,089 - 60 초마다를위한 YouTube 서버 상에 + 메 60초마다. 51 00:05:54,089 --> 00:06:02,739 From b2f60319e3adc9fd5ee22a929a9e348b09e79d22 Mon Sep 17 00:00:00 2001 From: osx_gnujoow Date: Fri, 8 Apr 2016 14:15:54 +0900 Subject: [PATCH 019/199] aws-tutorial : draft --- aws-tutorial.md | 119 ++++++++++++++++-------------------------------- 1 file changed, 38 insertions(+), 81 deletions(-) diff --git a/aws-tutorial.md b/aws-tutorial.md index e357872b..e2fa191e 100644 --- a/aws-tutorial.md +++ b/aws-tutorial.md @@ -3,95 +3,67 @@ layout: page title: AWS Tutorial permalink: /aws-tutorial/ --- -For GPU instances, we also have an Amazon Machine Image (AMI) that you can use -to launch GPU instances on Amazon EC2. This tutorial goes through how to set up -your own EC2 instance with the provided AMI. **We do not currently -distribute AWS credits to CS231N students but you are welcome to use this -snapshot on your own budget.** - -**TL;DR** for the AWS-savvy: Our image is -`cs231n_caffe_torch7_keras_lasagne_v2`, AMI ID: `ami-125b2c72` in the us-west-1 -region. Use a `g2.2xlarge` instance. Caffe, Torch7, Theano, Keras and Lasagne -are pre-installed. Python bindings of caffe are available. It has CUDA 7.5 and -CuDNN v3. - -First, if you don't have an AWS account already, create one by going to the [AWS -homepage](http://aws.amazon.com/), and clicking on the yellow "Sign In to the -Console" button. It will direct you to a signup page which looks like the -following. + +GPU 인스턴스를 사용할경우, 아마존 EC2에 GPU 인스턴스를 사용할 수 있는 아마존 머신 이미지 (AMI)가 있습니다. 이 튜토리얼은 제공된 AMI를 통해 자신의 EC2 인스턴스를 설정하는 방법에 대해서 설명합니다. **현재 CS231N 학생들에게 AWS크레딧을 제공하지 않습니다. AWS 스냅샷을 사용하기 위해 여러분의 예산을 사용하기 권장합니다.** + +**요약** AWS가 익숙한 분들: 사용할 이미지는 +`cs231n_caffe_torch7_keras_lasagne_v2` 입니다., AMI ID: `ami-125b2c72` region은 US WEST(N. California)입니다. 인스턴스는 `g2.2xlarge`를 사용합니다. 이 이미지에는 Caffe, Torch7, Theano, Keras 그리고 Lasagne가 설치되어 있습니다. 그리고 caffe의 Python binding을 사용할 수 있습니다. 생성한 인스턴스는 CUDA 7.5 와 CuDNN v3를 포함하고 있습니다. + +첫째로, AWS계정이 아직 없다면 [AWS홈페이지](http://aws.amazon.com/)에 접속하여 "가입"이라고 적혀있는 노란색 버튼을 눌러 계정을 생성합니다. 버튼을 누르면 가입페이지가 나오며 아래 그림과 같이 나타납니다.
-Select the "I am a new user" checkbox, click the "Sign in using our secure -server" button, and follow the subsequent pages to provide the required details. -They will ask for a credit card information, and also a phone verification, so -have your phone and credit card ready. +이메일 또는 휴대폰 번호를 입력하고 "새 사용자입니다."를 선택합니다, "보안서버를 사용하여 로그인"을 누르면 세부사항을 입력하는 페이지들이 나오게 됩니다. 이 과정에서 신용카드 정보입력과 핸드폰 인증절차를 진행하게 됩니다. 가입을 위해서 핸드폰과 신용카드를 준비해주세요. -Once you have signed up, go back to the [AWS homepage](http://aws.amazon.com), -click on "Sign In to the Console", and this time sign in using your username and -password. +가입을 완료했다면 [AWS 홈페이지](http://aws.amazon.com)로 돌아가 "콘솔에 로그인" 버튼을 클릭합니다. 그리고 이메일과 비밀번호를 입력해 로그인을 진행합니다.
-Once you have signed in, you will be greeted by a page like this: +로그인을 완료했다면 다음과 같은 페이지가 여러분을 맞아줍니다.
-Make sure that the region information on the top right is set to N. California. -If it is not, change it to N. California by selecting from the dropdown menu -there. +오른쪽 상단의 region이 N. California로 설정되어있는지 확인합니다. 만약 제대로 설정되어 있지 않다면 드롭다운 메뉴에서 N. California로 설정합니다. -(Note that the subsequent steps requires your account to be "Verified" by - Amazon. This may take up to 2 hrs, and you may not be able to launch instances - until your account verification is complete.) +(그 다음으로 진행하기 위해서는 여러분의 계정이 "인증"되어야 합니다. 인증에 소요되는 시간은 약 2시간이며 인증이 완료되기 전까지는 인스턴스를 실행할 수 없을 수도 있습니다.) -Next, click on the EC2 link (first link under the Compute category). You will go -to a dashboard page like this: +다음으로 EC2링크를 클릭합니다. (Compute 카테고리의 첫번째 링크) 그러면 다음과 같은 대시보드 페이지로 이동합니다.
-Click the blue "Launch Instance" button, and you will be redirected to a page -like the following: +"Launch Instace"라고 적혀있는 파란색 버튼을 클릭합니다. 그러면 다음과 같은 페이지로 이동하게 됩니다.
-Click on the "Community AMIs" link on the left sidebar, and search for "cs231n" -in the search box. You should be able to see the AMI -`cs231n_caffe_torch7_keras_lasagne_v2` (AMI ID: `ami-125b2c72`). Select that -AMI, and continue to the next step to choose your instance type. +왼쪽의 사이드바 메뉴에서 "Community AMIs"를 클릭합니다. 그리고 검색창에 "cs231n"를 입력합니다. 검색결과에 `cs231n_caffe_torch7_keras_lasagne_v2`(AMI ID: `ami-125b2c72`)가 나타납니다. 이 AMI를 선택하고 다음 단게에서 인트턴스 타입을 선택합니다.
-Choose the instance type `g2.2xlarge`, and click on "Review and Launch". +인스턴스 타입`g2.2xlarge` 를 선택하고 "Review and Launch"를 클릭합니다.
-In the next page, click on Launch. +다음 화면에서 Launch를 클릭합니다.
-You will be then prompted to create or use an existing key-pair. If you already -use AWS and have a key-pair, you can use that, or alternately you can create a -new one by choosing "Create a new key pair" from the drop-down menu and giving -it some name of your choice. You should then download the key pair, and keep it -somewhere that you won't accidentally delete. Remember that there is **NO WAY** -to get to your instance if you lose your key. +클릭하게 되면 기존에 사용하던 key-pair를 사용할 것인지 새로 key-pair를 만들것인지 묻는 창이 뜨게됩니다. 만약 AWS를 이미 사용하고 있다면 사용하던 key를 사용할 수 있습니다. 혹은 드롭다운 메뉴에서 "Create a new key pair"를 선택하여 새로 key를 생성할 수 있습니다. 그리고 key 를 다운로드해야합니다. 다운로드한 key를 실수로 삭제하지 않도록 각별한 주의를 기울여야합니다. 만약 key를 잃어버릴 경우 인스턴스에 **접속할 수 없습니다.**
@@ -101,70 +73,55 @@ to get to your instance if you lose your key.
-Once you download your key, you should change the permissions of the key to -user-only RW, In Linux/OSX you can do it by: +key 다운로드가 완료되면 key의 권한을 user-only RW로 바꿉니다. Linux/OSX 사용자는 다음 명령어로 권한을 수정할 수 있습니다. ~~~ $ chmod 600 PEM_FILENAME ~~~ -Here `PEM_FILENAME` is the full file name of the .pem file you just downloaded. -After this is done, click on "Launch Instances", and you should see a screen -showing that your instances are launching: +여기서 `PEM_FILENAME`은 방금전에 다운로드한 .pem 파일의 이름입니다. + +권한수정을 마쳤다면 "Launch Instace"를 클릭합니다. 그럼 생성한 인스턴스가 지금 작동중(Your instance are now launching)이라는 메시지가 나타납니다. +
-Click on "View Instances" to see your instance state. It should change to -"Running" and "2/2 status checks passed" as shown below within some time. You -are now ready to ssh into the instance. +"View Instance"를 클릭하여 인스턴스의 상태를 확인합니다. "2/2 status checks passed"상태가 지나면 "Running"으로 상태가 변하게 됩니다. "Running"상태가 되면 ssh를 통해 생성한 인스턴스에 접속 할 수 있습니다.
-First, note down the Public IP of the instance from the instance listing. Then, -do: +먼저, 인스턴스 리스트에서 인스턴스의 Public IP를 기억해 둡니다. 그리고 다음을 진행합니다. ~~~ ssh -i PEM_FILENAME ubuntu@PUBLIC_IP ~~~ -Now you should be logged in to the instance. You can check that Caffe is working -by doing: +이제 인스턴스에 로그인이 됩니다. 다음 명령어를 통해 Caffe가 작동중인지 확인할 수 있습니다. ~~~ $ cd caffe $ ./build/tools/caffe time --gpu 0 --model examples/mnist/lenet.prototxt ~~~ -We have Caffe, Theano, Torch7, Keras and Lasagne pre-installed. Caffe python -bindings are also available by default. We have CUDA 7.5 and CuDNN v3 installed. +생성한 인스턴스에는 Caff3, Theano, Torch7, Keras 그리고 Lasagne이 설치되어 있습니다. 또한 Caffe Python bindings를 기본적으로 사용할 수 있게 설정되어 있습니다. 그리고 인스턴스에는 CUDA 7.5 와 CuDNN v3가 설치되어 있습니다. -If you encounter any error such as +만약 아래와 같은 에러가 발생한다면 ~~~ Check failed: error == cudaSuccess (77 vs. 0) an illegal memory access was encountered ~~~ -you might want to terminate your instance and start over again. I have observed -this rarely, and I am not sure what causes this. - -About how to use these instances: - -- The root directory is only 12GB, and only ~ 3GB of that is free. -- There should be a 60GB `/mnt` directory that you can use to put your data, -model checkpoints, models etc. -- Remember that the `/mnt` directory won't be persistent across -reboots/terminations. -- Stop your instances when are done for the day to avoid incurring charges. GPU -instances are costly. Use your funds wisely. Terminate them when you are sure -you are done with your instance (disk storage also costs something, and can be -significant if you have a large disk footprint). -- Look into creating custom alarms to automatically stop your instances when -they are not doing anything. -- If you need access to a large dataset and don't want to download it every time -you spin up an instance, the best way to go would be to create an AMI for that -and attach that AMI to your machine when configuring your instance (before -launching but after you have selected the AMI). +생성한 인스턴스를 terminate하고 인스턴스 생성부터 다시 시작해야합니다. 오류가 발생하는 정확한 이유는 알 수 없지만 이런현상이 드물게 일어난다고 합니다. + +생성한 인스턴스를 사용하는 방법: + +- root directory는 총 12GB 입니다. 그리고 ~ 3GB 정도의 여유공간이 있습니다. +- model checkpoins, model들을 저장할 수 있는 60GB의 공간이 `/mnt`에 있습니다. +- 인스턴스를 reboot/terminate 하면 `/mnt` 디렉토리의 자료는 소멸됩니다. +- 추가 비용이 발생하지 않도록 작업이 완료되면 인스턴스를 stop해야합니다. GPU 인스턴스는 사용료가 높습니다. 예산을 현명하게 사용하는것을 권장합니다. 여러분의 작업이 완전히 끝났다면 인스턴스를 Terminate합니다. (디스크 공간 또한 과금이 됩니다. 만약 큰 용량의 디스크를 사용한다면 과금이 많이 될 수 있습니다.) +- 'creating custom alarms'에서 인스턴스가 아무 작업을 하지 않을때 인스턴스를 stop하도록 설정할 수 있습니다. +- 만약 인스턴스의 큰 데이터베이스에 접근할 필요가 없거나 데이터베이스를 다운로드 하기위해서 인스턴스 작동을 원하지 않는다면 가장 좋은 방법은 AMI를 생성하고 인스턴스를 설정할 때 당신의 기기에 AMI를 연결하는 것 일것입니다. (이 작업은 AMI를 선택한 후에 인스턴스를 실행(launching) 하기 전에 설정해야합니다.) \ No newline at end of file From 6366bb22737b5e38a6f4d0dd87a4265ee2906877 Mon Sep 17 00:00:00 2001 From: osx_gnujoow Date: Fri, 8 Apr 2016 14:44:28 +0900 Subject: [PATCH 020/199] Image changes in AWS tutorial --- assets/aws-signin.png | Bin 469823 -> 200628 bytes assets/aws-signup.png | Bin 396912 -> 200627 bytes 2 files changed, 0 insertions(+), 0 deletions(-) diff --git a/assets/aws-signin.png b/assets/aws-signin.png index 30413cbf9b96a648be2f288181e40f506b8c637b..023d223d4818448d957d2539a5c54f467e8ad098 100644 GIT binary patch literal 200628 zcmdSBWk6ibvM9>n&fpG%LvVKs?t$R$65QS0-Q7cQm*5VAB)A5L;1=92`<#8xJ$vu( zd;j13>9u;T>RMIRT~%FGJ&~UjrBDzF5FsESP-LXVRUjas10W!vwBTXid)$R2*dZVQ zK2~C4pJc?uK%boK&8=+AARuTHO$-e`$S}~38W|ZHj!x3kAv(FML`1}>81@YGPWBFf z1`UTmc^SI8n|Q#@fdEXO@58h&u z$Hdr0HkE@|hLi!ez((Yv8tDWqjp0YqiL}6tGDQ&f2Fwmf^sV)(EG7ngmw|2PEoI>s zDI7}9!X-ddOH1{!GixbH^^NxafG9!5vH%tf!v!i>VDZNT!3Lwfh7}XNy=%chxvb&dI$t!X z;6b{OjBx>2SU+pnFxvw}jA)E4REy`~j}Znr<~Gnq*F3B&b-$k8-Wmy@pq?mU!+J2^ z-rl-?zrFQD2i`*MoQR-8P`X=f##m#Zy~`obN>$TEQ$e23#NL+4$kg80jLE~+;ayG; z5CR^2@0Ye_E=C{^TN^uPJ`X|i|9XS({rVr>%;cc|dd0EhzR$IR^R?#|@S&SdXo!OY6b%gfBd#>~dX`2Gf? zv!|Vlkq4ulGsS-o@_)w>H*+>|vT|^-vbO{MV_YL+dsi1ha`Jy5`k&u_f2Wy;)&Ij} z=loBy-U(#>#}j5&CKl%Z8T%bn;2*twpR7F0Y&6BKY|ZSP-*E`BvU0Ks{1?D~Jo2^tF^A@Xm{Xn+X0SeN=D1b~WOCAf5{WLVPr zFJ60x4a~&`P{j$pn*tT*M{>Y;K&iT$6L z|1l&H#YHkrsNc82QKg$sCm(-n_rK%yW?-*U9Pp72ol3aW`LgX!^=}2k4-uFTjo(z6 zO#Qw8**%o^EZ{&YT=lPL{{;OFR7VeF{N7Jg8Sn+oP0h(zYQkG8#;P`X|8Kt|wl@xB z<~Eq|%ObzjMT4BPWkk1i1W!w!K<)Ei2!Bd-&T$U0?}BsKyrW5dTkW|{cj=o~~B z5&zxX66qrQWDug@dy|a`ax=#$5!zQL^<6LV~oVmT!+A^WoEC`&^l??FtW|*4o>` zKwJH6W5Byx}rwWCql_tkL3B6VRr*E#F79f@*zVNK3+DInR>I%PV=8(#*ID$n^ z6v4ld{@@k*C?B~nw?pm3bX(=9!Sc%D2$d>1Gt#%K}tza&%N&D;N0FVH2!_S3?jFspt zdgUpX33X0&aM^Y~j%l}a$;7P0rB~>6JgzKg0cV?A`KCI{9;aIVp0AXUoK!LmKb6Wq zukuZC>8-jfHnyD3+#G7P3XVms#H3azbUdoMR_sF4>gjRa7@dsu$?~THOh& zK_7h#GX0WWBc!JHf%j5LBVD0uk#l0MTqBzR+=5G|ob5Q%Tw$g6;YHO6GP8y$`!P#5 zM&4w;g!ymz#Rcso<79e7ujl)XkGVa4Vq1HaZ}nGh`SEnMPUX(2sKA^{jQnH8QP=MB zIDB5Z9bY@B4u1JjskIWBUhcf(KIgo9i-Jy#J$WJZOZn{0H?ZXXLj4Grdr)$9 zjBTe&uJ5u_iXg-6BmW3_{{QhxJX-v{M=@X_iTf=UZp%jJygtr>XvNXvKfujg0bIzpgLWW1`E z{*?>c!VtYa!)p4_+B(-P>uS5JjX~y7ju+?rI~q@2gF$hHM}^X3DtB-LFZC8?KRsz} zXOt5J@4JBb!C%?vbhP+O>{X(tjX2TqsBjeh}y#LqmR+> zl*kN8`0lF^|5Kt_h}OjEE;YrgXHk6dJ6w8`P(*mvpG?_BxwCirT3jjpuLwCTf0(Yd zIUcy4|9ZJFpUgTvxO-adaQGqi*K8C)Us3=@jO+B0Cs&S6D)sFJ$wC1CD!L_;-pr|MLW`YyHhRZkNBF-)aU0>s8s=F~i$yfYrP+|=30 z%AelnUG9!=uP^5a;a2mdGJkD!|6xvT;dl4;#+s~VC5b(E6P3iYgK*aeG8h0R*Cj65 zYZW_c2Wt7U)8p&y-Zt*O3nP71ob2@4+f>Qs9|T(k*6SOmD&;dB-;P$AkGJ;rnzm>D z*2YGOs6mh7C-4XS$@fUzHZuVv>VNtBwg66>B?saATjiSn54yg|rN5`1CViT3jmm?;JP} z@=txlKct6cAlf})TW*rHc%8EuEE12F#v8}hea&XevJ{yZ0vd87((`oE&XyLIb7)M! zXKM=&co$m1LK@%CKK#>={%dm|2@J({l!%_g^7_bOokl7yG~WMR=r7#z$VW%5jz6u%rVRI@qn%WJjRWRe96Ji0g| zCm9B$UEc0wbS3|(3f$p19Tk67Bx^hf68m?AX4Wp(cAH+-vBxjR^>bkfy<{3=V+EKnD~o2{eOu~r$!R=!5E zxB|ma?-n~K{34Dq+nsb@6lxy#YUdqU`v%x1G(4y4XoWTANMI&OObk=9_gwAKpYjJJQOlvche7 zpjzvkxjP+&OLV|ptO%*?GM&pb5gA$ASGN48K@5xz#e1N$?EX0q@Q6@?3 zU(-ECT{aKY!}}n<0}(!BLSsn*cz4k1kP+rE0do&+jK;ta1=XNLfeEzcjO0uMa**=x zdtQw{`4EBFQr=T^5XdaPF^M_CWwd_mt-x=*)K8n93BN^z3-MMGRHKSAjhom_cI9?u z^hNWKe{UF=Ph?ST_`h|z{Qf~UzmU-8&++eGB0n0WuFhpC9Y6HUW+)7Cn%k;=+K+nT zRg>(`cHl#&)jZy(@-sK38wrTZkhL+o>ks^9Q!VOK0(#4O2XpSX>Ge4o>^ennzRKE6 z88ixYp?S5gNVC-S=A|}%dlFj0&K=N7!g4->US=oJh?h6Blw9^zP4G&|)UkvSJzsw` zMW-_*O$tvWZ8zsfpRlVuGd@|&Z)gyffp_HcQfo10OA8#utX4WItXd0PDkgN_s^#fk zWto^eKA~DJFju~^(OqtV08L&uUwE%Zi>WXGGMHprklU057{Umob!3hiV@uJGr?<-!bi!pRVG0oiEY0+7E1ys-tG7zlDP=deYQ(rWKjv~!Ooo&B@iLdCvx-K zK_L7`LLjpcDJ}IMOv>!wmd;UD7reWJLS`T%s1SqFlE*@G(@-4)#t7>QhE+I;Ls~gO z7?k9LK4+}YArI*v+!3&q7X-51_F(e?`D1{}AXP8IsbCw%a3LO`*+%29LGdAZLrO_- zrk{`>{R}Zi97*o;453e@Rp=!7g(cg#i|SbzS`kZ>@_(^i`k%`_-JUx3YzeXA=yy8) zhaA77K`K8d=B|QUZC4jUb}8T57TlL6fKt zLJAU(Jz@;DD1!=Cu^=8|m~2&Srx3!oDJVlYq8c>3gY4s&AP1av8bvdvC4jIJXG7p) z?a35a2_))oPdEvDNvMDBqYRX4KocHDD%$u;R*6-thf)kZIw{UU_O2GOe_byNk$y|3 z)We1w5(GIekt&mKnz`ynyk?Sx z+mr82h08jhtI$3{7d73yX|Av7%;mRMHuovNTsEB+Cz9249dl-)J+1OVi{#inC7^gb z*EA^m@#`wBoQP7@^$nH$kdKuhV)m6+0xN! zv!+lkX0_4uAG#VZtv9&SgeQ@7sLCZ;ijBQs5fUh4e{5(}mLs|PFCPk=$9(~n-b{Z} zt27<`A!ew630nb44$G0ijMEYySPlYW7DZ((4{K(PH<6Fj60?rX3jvN$qz1Vw^L z)2E2~r7oE7_gJCpbyzC-db4Oi2w5tVOzm{m zet_s7J(bNr{mgV@(seb#J`FFE*}CDUbGz}p@tOaa*i&{ZoA}ji<)-|-gUU0LFz8r!dF5Q(_oT}P0(TDuijZom^#>^GiJWL?2bl;FQZ9LeXH;XI>7PdhAWi>CWpd)_P)a> zG0`z*%K#0v&)=6DEs&NssN~XvNH51^X!2-H&1|>BrA0-B6a~rls7mXeMJ=9zENSco zwjRbVLqW6+%oN?@YDTtG6H;z~0JU#>I?#v>QbI%!JR?c6km&x>48u+p z4xyJh=LVYD48y6huIS?pISO;9gtZ+^cEuYWC~~uC!*bHw*crnHQghUQ^Jyb>T-=0F z!Js6`L;W20F-!+jgJ6*I1zJXT<}rR{mo}Mv4bCM4d0SiRGCij5p%V7(@uRcO!IEX! zzk7~f!vdP^D%1geeNd=r8_)aPVV|}ng(w4Ry7sl(k88ePNIle%=}AMN;0uU%KOWJW z9J0S}mY(%vWiI1~%E0VX0%9aSzePPC56k!7tze=QsjsS^`|){+^OMRDP;* zdjj^)ld&H$tWQ^(_gkAz7jsux*PMtnwlIlq+_tOSsmI^vp2^Fd^6jA#jav5 zUuyf8F|!7myl+aikM9v2-s98yN!~a{?|mz#sskbuB;1Hvlq%PWy{~Q5x71WJeTXci zLJ7E%iwmB&VG_UM*@rfowHne)V?XDcB@k&oMD_rJ&}&0#8h;f}<7e+n8(9wmx92{T zX)MzmXpl{d`LnA}RQN~>bd1Z2%MlCduHP-gxiL0ri#A5x^S@dSYHsWfT#aOoQj#L3 zWm+zCIBtt?g~JQe=2U=$mBzuc6Ga*2B)#8w!me55tcYk>aE!!Et^U9PK3MW3_Zst| zZd1kM2!hmpyGbsx>~x1dveYQU?cRK9 zU_J>z&LCi-1-e$`2qWswNs;TGHO6#Th@%*z>it2~;F>=Y1&eu5L6VS;aMzzva=(o142Mpy-7o zhjlU?2t%1^Fhgq7PZWQ5sFR>w9ouMz>RH@$XrE!yggZgcpiHhCdfwxz#)+lc{`1+~ zC9ZZeew?56j187!vID-*zae44M1h?0DLCA9`v7}mX)F_>!8Q@#ph(`m$jW^IVd*g* zgKs~wB+79C-WXX5??Hm&%}3#cY=|GAaOl8dl;Yr#Bqrl9p8*!nu!yZ$SvI)TuqIJo zS{r$6eE6)(ctdT|Ac>Loqi`<0acM{oV`y<7!ygamvRC2qq9F$($`Lemu_lTo-Orvm zz^~L8I(1j%qRl82s#FTHz1Vb;%Z7uRO7mwF&>^z!q+3i$AJ?u4x$QWn1Fpe={+x8d zYB}vEZTf&Ueq^sWII%i9qBVfb#nx@FZ@)Fko}g?|gehf>R*1S?o4;`zn!KSiMqrx~ zMFvo20b`I1ao|IZM6`Ms#5Via|IllUkWIUVuG8EbJ|?5KkJ<+dN*whMEr(TUmWDsbvmSK@Of~` zWC3ic5Jg@yGB8w8`v1;qS=9H=)OZqH)o}E5<^Y$CN@$m*pN>=?qQE9JaXY z^e7|^V4NVe=#0CNx8${dw~0SBtyV1SWP@+A&7eC_KaXs#OzO+8x7>luMS8SFFnd-? zNsazVC7M9}P(Sb8>Eg(C&%WTPOPM8n-H7s3icF`E=mhAP3N>#iJK}HkGSKWm2IO1u z1@WPt=a8Ctgtmr-zWex zSooc6e{0%^r_-8|Zyl$Ai%be$LpYOK2wwDs6KpgrehA&LrM~DY@^6O&Hg@X-X-X4m zXhQ_F^vZT8&nMyWQL49um#p4(h8A%qYxnQE-OmOM{wBj)TU>Vzn8D0=xMypSzdX_b zPy_*#0Z2%W_jYOn@G&ApI3gE+*se<@R043M>tv&h;&zK!mQ6od0pXF2C+|Nx*og7B z)fq7aqfe0`_OyW*q%oLcegK*syLs#bfE?z^E|Ni&x-~51ei>iz?EMDEFIgenA?}!r zskX@H2qnSl3nQzf+n1S+}o#7*#LK}JxY?xAqhm9;oQ@S zShy00SOn7myQntqpdZmezb|)I9z=|@YUNIvq>lNLK22tGw+>j>{98^%RV=`|)y29; zG#GA-|Gd*=tr(U6gavcj={=x$^|b1bW*>hx)8p^IgE*Yqgh!Nh$pjVAekA|rPXQ;o zJJ{@TJDNGfmlxCI{38B&&T(oaEwfd!Q&0Q1W~QyuaV}R~i2FEi{L|U@X-x)JS`N(P z<+o1HYxy5vhUxjXLNIqrk`>u*aP>V3>o?tRO+FCWSIH?)hf<^q-0oSbRgadIE6lY< zI(gfG74<4DoR7*XTfJPToMj-EG5=(xG)|?h^R8-TGI|q#PP?I>4Bf+smUke~eC`S?uGX5$tTGH&n*BGR}HctRY*++=f!F!v20^_n`*mJ zbQimp@R1^~(~&DM`%Z(69IFryU4LHhM77EB7gD+k~8|E2iDH{A$$ORre z@Ifd{2onATbGR9?e<%v<%QS&6Pp~g!nF7~9!(;~TPDuA6p+n;eBf{Ys+c9#o7h|{r ze1Q-b4fvj8)XE}>9!0^Di6$RGmg#`nbP|9c&Mlb{PBsY-!t*cyo^HJj>21H%5qzX~ zX0%en7+NYAWHG;jN^io10#dOWuaQflAf_iC1ox&C*K^dFQ zDVQ8005w50>O;Ut@0k0O}c-WBC&#>5J9uRsDDvZ6*DXdj?^ zj>oboh6l&wA_qMU#Vw%=CveVc92y-h+z?h!mc_#7&P^dfwqcIO>DaVi@>&UFu z@6wLrJEGlDk^C6pKW8)37Kz`g(`wgWy+dcc(qv`k+~MlFTB)-~kajT0oA~u;zVUs& zFPTWDU)kv;d#z>0d{0r1cfydQ45Qs~{5IHub)zZbD53WtJ0GfJ2WO%~eNg{ho_O3w z(bpn#WGba*sv^BU#!RM*N|9}K_dWcxoHxShd71Oo-WjV=sXHGf&DGmA{rA3tWr*EU zGI4j&1mDpoZj+TAJT1}{gMLEpY?y|_sAI4035I$NtfQ|KKPFS+tbIfNq<5?IzkXOD zoCzOPf_WHthiYw!M1yLb--#1PA;1av-D{iG0+`)9EV5(7~K|hnp&AGXjPG18Mh#mJP7pK);9(=L7H^J?#;yvj9Gr;Gx5R9I@kg6Uz8fiWxkJPv}TO8E@QRdIXKnW*op#q=x#+`0`AQ@1*aaXr(k> zxY*5D@I0t-6{`E|=wqged~8Ig`<)D%`3wfW_tDtZZGSavv6j-Swj%y@dmnAbJ_YaT z+=%H-8S~z=n0>NY-TejF+|zH6>}sjpx|f~GbUxfaWed#@dI7eTA)*QbFGS&*W^^wc z0HvAPWa%0zuiMdQ_-CE4bv(lQEWvX@e#}6LaFmi1ERVD3lrq)Abetp z$kdi_1S2V==z)x=U!W&Wjsftw=tJKDL~ucJZ?SVeTABdD&OWZFf7^%%SEw{(;GUe} zoj`yjJ7{Nnlz2Boo`G(yd^36w?Yh7ca+ijB08hh*;R*>zfwbRx*j&l2?X?tRIR}Ovfzn_p^xh~5$uHr%ovHv`c zP7{8jc^oaxo#uXb46g}-+1S3)t5+rDJ>9af%9DdEfso-?@HkzL3~g3WP|;}da)SZW z@&n;$>2bvT<1zT`6RaDfS66}VWgwiNYZCN7z@}t%M%FY^lxEuuyJvHmKeXT;d|trYOczu=$>@UTwPOnm$(zU8P?TwO79( zIO!eK?$%>Q@0I0mFq!n+>rXMPfRuC{iaUkx#CdBiM1)=jq}$Ir(MsDi=ovOB*iMD* z!3gd`AN`O5^LEB|zRQqQCoWHSQ&yhyD-!s>6n4_?ln!xMnqF)nWU`nyAU{bYa{T7h z8dlVypmDb&$Fv`YD|AUXuOXlfWZp;{TaQcGpS(@+H5Vq!e{l$Rv1FC2Ar#gr0VB6x zZBpzVzX>pSqx9g#Krwh5s-@sECoAbcmpApinrJ!CbT{0m5}Ix@>nE3z>N4m=r-Sb; zFt3i9hMX4n-ObL$l+xH=kz>03MTQM#Ki51hF0&n8g?BPi0)+)63lTLIVxD3dUJ0m2h4uw(i;7p$%QM$fE2kx%5E zmfkRd8>OP}POr%Krz$=tWn?c`;O^^+=XJ|CS?&%da@}RQN3DfwpI0X#xsDLHJNeAE zza_i__2iBbYH-QU{495S)XhCw1aga{mqNbq0Wyj%H#tT_UlGc-ws^~073oGT+9%;K zX7}aB-I)#fLW2c|t@8mymu!c>|D zuKJppP{Gt!qJH;3QwL;i3}GW(-YY;OQfm_w(xQ4oq(`0^gp z=>=sGZ1`+v4m!9wUPP)6{pEk|`>el*!9zxq@6(Wynl@&Zu1DRJz`by#Ng(`kS01_` zX*Af}c{xnD7lg#bf+JStzt>Zn(ej7G^KdAA_Yb=wcM0lKvD#06gV);S*L6>lmI=dY z4PC#`wYT1vRNv>bmW@nSSY$jy-G038-IULYiTr1613BTg*^o7=*7iAtTDwV~LDB-Z zvK0{!<=yjz6M8Ld+1{~<&RV|cB2&pz99HE_;3<=qW;S=bCs%MK(RGnU$3+W8Hcdxx z=JTGk7XRL7t-Hy^B+brI)dx#+$=2GyV9H@PQVo$-!Od@b`^4N{`3 z1o^Fp(DsATXjGYV3${8kB`Imbz3B8yc)F?{v(WXLW&y^0V(qiPyw4eta2OKc&Uy@} zeV^;)`}(tSLk|(|lYfPv1${s;O+r&=uD^n}F=0wVr@2e?@d2R-Jm8WCr^)O_LPS7j z!S)+8Wg1+unyzwZO0Z?s~LWe1+NmV+c6UX5OJ8WnOf( zLS~PF+T-Gt>(+IxZOXccjd^~N&b+xrru)TAHa!}9wTim5e3+8Gmn%G-IvKt@Rzju$ zZdT|?u(IZoD~``gtW?6Mf|-$wBc7Lg6>=^m`nBte?=-0}^T(fbG-_1nY78uBcCVK! z5+DJ6(SJi@pon*q;qq+5CWKcGE`w=%%Q}bKLC>+EW0}<`isuB(!qP_H>u0k!=plGc;`oidl5xtZIZrw2aCvw)4=l) zLoCS5z}(rQ?1Rzk>Bwi>;pLfky+L|`^k6imRi{svz<>5w{Ja+_Mb*Q)jYq~a_4_B< zC}C|6nG}(GNqgQ1i8IGpY>3N;BgB>-`n5X;L8ylWO81-{MJwxts-&)4rXZ>WgRCYS z(1Q+CgWas;$?xGd{)kIx=Ug5Ij}u>LSY~MQ*bq6$J7qirw?F$fa|L8I9A~zjBgUm| zO47#LN9?z#@>Nor-R`1ou10N6_24nNE8VO6)f*6gLKXtDYZiz#+q|Z;`CV+WapmuRQXR)tN^ONB zrBlL65V4^SU-!%rxLweh@lc_8xEYk&vmYQyYu&gQOHe+LM$+`Se?&<&8F|yuC)ED1 z&O}0m%SD^4{O0$WPtZFj$3u<~2<3R_3EiUhe4KriC7*ije^6F*2>lO5 z2D>JvHW-+mYF#{ozjXJ(27(+05?K+9OCk>B3BNUnm<=O7k+H9b)9eWN3BNtS+atq1 z*cmB1JFHOpem=?F@wzXB0X~`p7FKCfXqzf_Tj+|62~{so`Vm?rOx#ge1Ofw^O&7ep zF_kiaDu}q^sMQVK-LKOks~j%qC`>KfO@1H1y&KyV|3ia!KSPQQHgH_{>0T_8>xB?M zs>cJ^GhH_;ieKd)hRnxd)eMCZeV6Dc2CXbs6g7x^PJtnR=>d>e{hra|<4`G+jhEqh zI8)kSF`}rV-2h{uv<)k=WlpJe1Q!?P1N&2KvBrbNX8W1NltYG6n?phr3ue<#+KV<| zNrgpoyZ1-11bU4W@=L;K5k0m0&Me~=oe=ofezEtUilvs~Tp>Dnu#r!bZ1s?Pma9y~6di z>Eueu_e{0-97)U2xX*Rk3cDWEH@bIA4K9m#Of{{_(?RiwLv6^X|F0oirU{2 z@&a=zt-8av4gGpTLqExUM7*x|K;7|h)8r!KkS!i&9UHh%ig^ux_UWN7rESpF@4h1D zeTfv7#jocRJ=$MxHTw7uykhmv2Kd|x6zX9jvJdG30fNWfx-AP zuM?LJ-hDu0O0-6^rrGd)E)t-$7*EMVF$3e8i))fPe!niM#FD|Xvk*FuEjJSlH`$$g zSsCjaiHh+!V$^a0*UuOpuFm*v4qdI^36CCH+3($R#6Zosn?l9rvde68e7-?xI#1v)?FvBwY5Va^zT^n05tnuzQ<)V|}6B9yNT z{ZX2NfImY@_q5AAn?t!%mh1!!)F*K5XyUK^%D z;9{Qyq#8EqmIWK&dKQyz6YEjV+@U525XZKytR@_YtNX(j!PmH4i`uX_&c{~{4K?<_ zxXpeH* z(lO7yh)OW4KNYdv80_rbe+NRr+41U*av!abrXVsHDz2i5hmHCT*kR%~io;^}k;5)9 z7S!x*Zc#|++fl#p1B;<{-32A~^Wg6o5&-BNSGX>T$;{C9^JmajXlod7q_DU(;k$E% zdi6h;6PZb#W1mkup?2F|p1@*v2DBr8?Ctjjpm(D<__5#pM@M}!_%Pb4G7%s|-f#x- z1IRc9_hX^J2b4QuxikLaR^;CoJRiKq3@3`|grj|Q_SWIfS$#?7PtB5peUl{KI;BZ+ zc+?dM-3lnJy~im?J3mM~0JIX|G`_RIZo`OfdjnVyCBb{(CzcI#xN#X@zR^RpH>Let zuO^qLM&)3nU*c8~83GnLt^~S2%SJ_CP(uuubypR6^y380@pQA>&yn*g{+!Li@nbV2874zo5qzQbyN7~}O+`H;^(GB; zK!7E$gG7{$_v;GKf@(UZ`st*}6WEE*;{mZ&u_9d_M~t$y17)WS&CAUt1FtxFFVKeW zV80WM=;&p<)yDB<1^&*zSAY{=9)*hL@-;HL}G;iPqKR-wJ?32a%+ig z%&E{*zj0GJOkqE4JI)5986?r|v|0vHfKn6_EAwXSG>v z*Ikl&^=khRoE^U={D=soL*XP0WAGYVfGySn@}KWE3ss&@qMg5aL;QMJct2ftNv!0< z%jNe+t&*lY?}>L(M-yE!>`p?j8pv$-{#5C#!5hTLGU}RKFWj7ki3TD=H_!#~Fp3Tu zT?U3ki7CSN?>1FI(4h}P^5QH}`U^Hij~b+nm3E0W;17$&iqd8T@GV8sNvGFTv)X)% z{lw?s9%Kur7`Qr6EUf7c-J+Pjl}noBohD4Pel6-UWDdT#y&LP=5T+O0B1!4}A}EQ8 z&an$Agy=*TcYQ};s03*zpgCqS7@;!-=O8q*p0_x1mTT%jj0{ON4Tx0NbXi(`3RPPdTB{!V7>pI&` z=FK)f;5|eL?t`B~HHRFGJ`UR=UOWDh>P^*|O&9&$8PD?u&n%MqM9V4Z`3kAk%>0j?>)-;*AhA`01f0?U^F0?(=0py zSO1y9o}i#=7>IXMCItIW5=4-2pi^;V*1R&VmQpF zjxnC&HWwFa5`Kd!XA&J>`s;74u3R+L1$1)I(kY1$#?Sep!C@dLej7&-ZEY#$a@p)NFQhRQDUIr5H^mGn-3m%6us8SzGtiM!*Yc5 zRCrcPXQ=O=bF$2j+yo3BMy60Ta8+mw?BpPZ=Lm5BDj5JNokx8P&;b-B38g};=FoOM zqdRUdP~yPwfD8N@2*%u>tX@H}p96|x+cHYyIFQ8^d>4(u1$ZiVct-!}J|d#~lHrpz zu7NF=UpbY^q*=mVhSb5zU+@#2dIE+w63)h#gnl|$0%1y+FotC@Lu)R)?qXz(2c2Qx z*QJuJ1mI9chqm1)Wkm~oftwY=7HyuZyWc;5;MF$x>NoVQ}oJa6%3mAEcDiI zqcHZA%iIyn%JgGpzH)PQ4RsOoj@ga}79Z{B9QINIC^~2zAz&^h$fE=s`?8QB5Kp_a zai{IJ33#>j3w5R%~Z=qm?!(7yDK2<2ISL6^-l3xh@@nvxgq@yk;#Ysh{^# zR(Dj}uRpedZor5jx_h?Hqtn3J?f1L&bo3rwa?dHqZl?a6^UE-^Axu#}t zl4)UqV5<^OfHPvm08|2d&*WXkw)!(}@sar^RfwETv2u72C@?-&?R2k#Xh$=|4cT_# zb|AhX`p!H;4x*SMFm%Mu!^^Y3SHiOX%tb(Q(d4H3QI3{!#r=Q(3n2f7CoT~NAUe>- zJC>ai?HVIPPZIN<|7cQSyHm%(d#`%^@ip>=Z3=lJ&=3Sk#EUwO@&XBf9)Q&XAd?Hw z3>LB;z%Ei?5Jvrs8~SSG2bIa{9Y!Yeo;$V8!~8YS35Ua|9YWkETk09D4lju+3uj7z zpupplxg^up+zYiuGT4;_fj+t6vUsn^^1ZmySJ-Gjw=18yLtO~z`2p!pe-PNoK`Yq9vmrxi^ev9bFM7|sxY6-u9 zg8;;4We}Hnh8C?$FbX!o^N2~XSyq))XguX6B}fJ$X8{U7Q!!W!^H zoc=YP4}BcEN6cU~I1`{Facm^{6l%5Cg#ep6sx%fX z(A>lD>>G3#j1%nU5U(N8jC51KkuEO5uM2e33nL8XPapZ4LKymnc%{imasG(XX8q{W z?M+tg7silZHcWxw#V)|cevC1xanyXwkPr+jqlztTHk|wj2|}zy{~_{cA&3*<0HsKV zb!7{0XnSu7zcyPwD+L+$bS)`&Rl7!~Rk7`b(w%>^G39s;zCPi0*&?6vFynZtS@>Vh zcLwSa1nPHrjr}>Ct-rpM(t@%3RjSm2np_^L8jeaV{imeA0x|y+tLjixeNu1%UL~jV z3Dc^{{*-F{C$7_~wOi|W+p`pYa7*`sXAJ_VHgXBGnb$SPY&DGS4!pqHYP-gl8JFv8 z$qNHXY072)>;JIz4(@eD|JHA8bH`2^+qRR&wi{=MjjbKqRvTN5?KEj@+vc0!InTY% zeb2f7!klZ(HP-l!@tJm}Su9)UngcEc|6))>1aja)yy2kH>|oRr$4 zu_Aawr~p*G%p_3n!;-53LV#yfA{;oa_-U3CnjXwfb3|`u_ytFhneL~AXy%hK@4$u3 z{($I4h=`#xHw{YYagct{AT)z->m0^za{x+EC^8(`opqD4>|hQPBvoWA>kjBqqKmLUVEvV8aW*OyYrYzU9w@9D?v&t`(rhYD{43GWS9}X}71~Jrk z18RA&v*`EQiM^xE>T>rrWep&qSS=3HQS-G%zn?4x#+-3G$vYU*9MOiA7)+}4u}Y}#V&d$FMGvpsh}nkf27Kb zYzamLRkh*))0osFzqPEx1d9JA;ShHvk-^&xsdJ-SiJ$K^c8X`aXXr(PUEo@JhC4^Y zz)R&a510X$p~2J*b+DnVGS0y4Z}l5MD!c`>j9dDUV0z%*qQoEdf(PHY=nrWZq@agM z^H1QrA@xmU0wSo}ji8rMZX;Q#(N7>qqjq?Vbe}1=5~corgZoW-hUD-@gI%^|=k3!EP1ndNMz%{mvtsf2W+aK=(gVyiq+$pnxp3E%hqUKy1gCA=Qu@$v0)b zpS;XRzgu^{IF@Y>)U9TQ7|BhZoKwu?d)QyNP1M5!mbwmqIPVToHQfnPL97jyJLU94Va|kg6`3j*?ueBlWp`QC$8}#RM=DQga z^>amJL6V0$d;8Q97VgIt=zb-(K3N{Xd!g*n1VMw4LW7MKXCyGaG3=VCuB0JZI909r z%Jk%8TkY7=ykTDPd*GbVLE(Bof=9^EvBxO|q{`;fW0GX$c~$dY7>^HzmQ; z95eI3ZY`6ZXW24z`>uRbd^PncyudF#Tw#dgQ?o{Ew(O+9J zEzzz3gS_}# zBAROO_R~$M+j{u$Jt$O=tg;L-bt5;ll*e0aW?d=d<^WDgdoi+tbD3tIQ?E zpSb4fOFQ3%JDoWz`^BFN_}mUY5PtPFed8ApVJY8IQ?ya|45!TW9>!?pAI01({xEv) zD`j*{m|5+3(yICDfRzx|7DD7f!73nr2d;Rm=OhMq*l{G(c*NW32x+r*yq>Yzm%=e?X$zY6n=`y3^GvsmiuJp>@sK@ zP^&D37KjE-PSvDhLE1<7AW{k`&|H6@tf8~s6BKKq9$64z_?uJqJ-fL$ht6e-Ie1~)CGW> zG$?vwP%zO=sq@xl{lcO&=|HR!1{e885eG8knQq1H0 z&u;erME-Pt%WQtglYVlWhAw&y{g*GRc_Cn#;o^~Nm*CU#qLg93oBYpvnPHqA)EC*7 zBVw)`FW5GG66p6 z^9bNzP2iAd3_C~}!Aw1n1Ec9Zk|VH*A<@1cDS75-@CZP?U=V3qPxmwovR7~}9smF( z7BGCo10FOW{YZU>m20@mK_g z1RSBAjT)YSq=Xl&f0qX@|UmTPU_YD<;gj7VPf2Nbd;VrcXds~&I#}*8S7$e9{oWq7p z;;}%xmK==@GQf%634@vRr*=M~4AQ$xX(LNyFJFizGskvZ&;)#_F)IcPc&P9E`%H9b zPh?*uEQ$na>=s05IwyGMvy%|`nZscI-~mac2`CQxm{tjO4+SwSQtZ(hqu&R@bv#I_ z=P4s_fV5X!9v4HXwl=T|0gfCbdP>a^HG2-9$_F|W=jP01^5`4GmNKt+6VSR-j3yM~ z2H(PkYVd*eza|`|Qm?S0XX{A(s6#Nzf{Z9o9UUs1h)U zEiGkqOJ8eKv&5p?s5GS3u@LN1b;)g2_4tazHk{2nr&gHV+f9AViM{Oh4SDnX!>N|Q zhpHmyV&Kdn-os3_pMrHoIAw6-$%gJ_MBt#e-Y~%Gz+S0^O)#NITs1jG#L`Y^Jfk+lB5>GHQbJMMt z!B(!(XVa4_=4%sq#;HI{*DZ|vI3W=xO5;APkbRUEaVtCg_GT)x)qKIWhK6!P zNYCj%dXa2kHXiJ1d*>hJ#*io#93qFegrj;YrJ6x?HYaC=q8k}m;*Q&~mN;bq8s!sm zGX)EXGi=D5g|9xd_gM#9t^Zf#6@;ofKdOxkYN((7(Xfp5q z!xAmY}?t;FCxSbxP|>%xxg$egGQN;d?*U!C669*=;WyF`QECY zD2g^gUPe1>(Sed5Gurt9GS_LrAb2t*6h*lu_z0XkS=>r6zL=yL(a$v@!Cbg3*0z*Z zW+dBUff)#M*VcmqPLv1;SNpZT!gPujnn!Wse*Fqu(QZoD9b=lJqiFT0NH<8fG1LVv zB$r-C8O9^nYqT^oVz?0n(^2K$B-tsE#&!(3gzHg|zQyK47$#%gqeNo7f(p1)2WUjy+h>TWEOc`d&ut}C!->NJY8D| z$Kq(kE~Vqr0P*UO5}a~J@sYKAgjmPW%Y&!fQf+?7&DM!4tGo zsNjHa_c_PI-X!`;_90;*;g!O?z@Ezv^ngY3+Xq731P72GL!uu@sv?-)5g-SAGo_E0 z9qoYf4R%84J0J1P33MX287LP8$J~UqS3xg4#T}C0>I5|*mJ*f`5QMPmPHq*%ZWHnl z1j^&PT@Q0oV(~oc8$+qu+CEzz&vt~nUd%&ZI38*L-|kz$e-GYzl!8T0xvHqJCX&1m z56#Y_?^(q1k1`jkU)w~cp5&PRC27r{bXXk6p8x4-)@0ULJ~^P=R5`#C;HCD+P1a1) z6I>1ZqER(Z?NJ*0+<8OHiFNA?wUa$#mht1r02ED!{rcD-gG z10m|6V0fLjN7;#`!7oHYWhLseGzm+2!G|8fpT&0~swKz<;>qkK3}H}ynQJS&kqq^d z!DGis&XDj8q&pAnT}$?pCK!7|Jyt@kl>;8Bm9(hhrL(ml>Ougu;TNW$PjLv^epl98G_aJCs~Q83}msux-8(;c)m1d z?{L~8pNj|)fmt)TXHSr|imBwL(597>R#{&IRq2g3DoKun1ZpVhzz%+tGEotS2;+B~ zy&p$y`R#%x2orLX9j#G~i&8&*0$8M5?jVxF`CS2eu42q8h<*=65}j z6zL4OgP#+FeP^LYYY~1d+JFiv7A#BXY2^?5ILHBf+DtO z#+9BTGKC$)^5rOiJz6gQ|G#Z&C8v!v+mGk)-imUT6pWmD`wNVYMqc1AH+zWwU4>x)5{QZ?omBQ@WF zdpt{G0i+3N_XSiZuuN4oyB1M{8FPuud+Fvi+lJElv6mRbwSWlj2D zz>@!>IL!+~z*k}#`QCG8upjCjFVz)gdu(_ufxdNEbS)16wk1qe4R<-u58U+jV_Jqr z=WUF}qDcJWZ*E*{X7U4F&t^nlPywXFguWmw9IlE*PDTTuH5PgptOBX@sh>XUI6aG% z3$gbol487il}}l23jcTtf(-)aKjk@%`(I}lpp)|xanoM&&E~VI_1%@9^oMn~7M1Qj zzySTgXxs>t8+P!s8LTj9l-8}MiOQq%CGOJyAwZA@Ecc#oCBIQm~hvHzy(7=C^o#* zm4Y1#p)&kt7WOQhCo#EyCt`JYPAsC@dTlCq53M7_26f2X$reqv_Aat_=NlY39o!8} zGS*adBk~c65~QPO=T5EqO+k(0;Y1qh#%JC{9=8#f{RX9C7LMHGpN{cb6Y0^7*Nf=m z-g-Eo_e}-OUj4YVDN1@AVi9193>-*1V~M@Wg5fh2lO&&aFqK13;{R4xso$pB?66Mv zHL^_>d{`1k;oU=9#@)63pY^A&p_s9j^j4CxPy%!J+ux9lbn3n6~ zc&VU>!*NzjZd|UUvQb?qheMXy_o)C$c;B4&;mG7lpmzmp2;I$Kj!i;Nh=IgZd;0Gy z)zA#RwjX;)v~y%H_~N@?WmOw_7#=JVM=fz!f%q@CZf`|Ba6;V1sWeRGF~+uCs+N1M zve7LkDFK0@w>A)YaS z%=pt2b2g7YHeJ+tgc#qq#piBT+T|g`YBUu0P2V&E(cFDMNCSz{POKIrJEIuAhicxq z4Byj)#DEqKEh=6b{)SQ+O?nV}pwcRnkRO>qSh$R`-hS0HXtLh_BwPUFoN+ygq|;=X z(&#v6`1A2R;|n%|E~!Kt7Cww>N*C2C>53j5JQx-5)+=OnU>O&%1s{}MIF6qSd!rea zMy@Dc|6{il!^?c4|34jbg1?~*jAj_`PiA#KSF^8Q6*}&HmUtE|L$9W{!;-_jl+P}P zG*D+HzLd`qhEMv@=Yg{B7f>3VF_NuBr@?ac3mii`Ob7%uI&K=?+}DC$t ze))2;x=lZ*zN+z9)lKze0g0Hse1AivNBtCEpZfh1Jg5~|cAeMV2hhXi!ykx1&SFru z>;Hr+<7JbC6GxMTQ?4DmDYCfiQX8fOCK(iLXuhX$+II;XeZ1^pvJvlQ#s13Ic@NlI zX*{k|sB-8zVsr%?%Wm6!j+LQB@1zpe~qBon0wgaB*(d{=*Q;H}2t|++-$5c+q zcAIjc_HAK`%VE|Y3dja;gZzP+N=vGfzcuU(fKrHG_1Z=O^$CKmo%+f_!AlXrB5jWs z{K#sn?QXVSm%)$dJ$M5!?eIZQ1ynI445kpp7?4jh|LDXXneWhhzhcuMk_}^Xjdv$> z#2|AvbEGPNnK5I5pYnkXz12oY$Gd3)jWN!-5r~Z@bDB0CR?^Gbs^krpQ#eN821_HR zg}e@?B>H_J3p^}9#97H`GiPGba25<1>h$Ygx>3}$1}Hf4TsJV(2g|$1RTb{h%H@WM zaEUK+Vd&v7D$>FEN^{Z|)Ne$~INslFzh>{eWtgCifTF#IY9=r4eRqu<$vUQOkeBXcX#OErI%s-REh8$J!E`AL-El7QMU2D=Ga=%={=5@GkYjO5wv0UNkSaoBnQ&-qJ zR!*>38P7>D{(IzZt1MjPq6XMjsnaDUcU4&?KR9O;kH`d|O48ihA&wp^u3H>|?)F#t zwJ~^X`JH?mcRI=I7Uxp`(Q8zo?KB8+FClVVP08=M7}xx6F3A||&uJO9KpJkT-6GI= zN2|*;3ehoX0;KGj4H`Ov~o&eD9c)xPx6!*q!n#7y6J}Rw(@`P@y6WacbPG65Dpo zx3zb`dFXJsND*Jfp&K}dRIJe<0FnA8>p|{Z`TIu}4udw_4$ZOX>~Ae)oknZ))MSl3 zx|*&@UF=iUYPAM~xh}p7WbqHDch5ru|Dc3{=fBNvddvFF8eDeY5@bd-ZT@yjWb_JobHQ%j^Zh*QW+8~Snbj{-H5X@GZHb|S>ojAbfC7KF2I%Cx$ttIg=lEMjh zQy^X|?<4(w^a~4iR;-`EA&oc?8u7v%Nb!P3$qS}R1)lcn^|E7=rR=T%w6%@L=gBf} zy$P8`#&I3oB%9w22olSJ#h*YI-y9IRNKrw06^7jphEs6P}E(Ufhf7ZGX(L!H8@@B z;M(*;A9;6+t$vbmXubDpGoJvmHAw8XFb6RMG>g&Y*{|w)yrvDCl zbWzt-%3vZHuVXW{eqB>&^J(k6I`Uk;+%b}?!NC)Pb741aGF(vbRrz2AqCVdbiUZx1 zr5jq6f5+)DsTn@~W;o#>d~*NAHJynVwI6C*odBPTYvW%CYCB|*^_y@^#&v909m0N0 z5`KUGX=Bc3M8OPd7p3MAwk9L-8O<{q{w=d`Y64)y)lZZ?|*mtUAm612OnJbeBq-BJ^%o#*=S{ zu{>IaFVoT1a|owLq##*s194F6W;})ss@y>Txh!%AoR$ECxcLN1Xbn}W+HbiJ-0qy zErurW>S^?oJlT5_sc?KG0#dnIk8s>0ehgf(7Q>@}S;d}d8LBPTiOcj=oYyc!!KCpWtLwNWpta-I4hr{Nt zy2I0C%wIwr?h^EPH9c;C9SX5iErUU)wopH$hV~K?2@GYODM0P2wvP>ALERZmfJ|E! znnkEbSU;C$8 zZ<0d{f5a}_=RHRW|MAkLj$|v#mkhE((v6U~b~x85h`YOd3IaL8+#A%$zoRaCO?FoB z{Ohe(#=i;Ib`{)ixmkm66Dvu`c9{mc$Z4*pjCc4`JU7(nyG_>=E1N`iz7CM=34eO& zm!jfPyu07Mv`v1;dtIApwE7E{M-|la@}jE1?-@%UjD2V-dr9bVxom*_HUY)v5|;aZ zCbM0ryb1BFM#(Ms0ek$Nn1oy_($Ns<{ zFN}}-91=V5Dxkh)Tfv!%PQ-!h_>?G*{(OT z5kwu+pA8;VU>X|7pTj6SQIRmTL8SoN{c4%eO!Lh23J#kXbtaX}r1!b{j0v^|S@tXj zH|oqjIYw#x$$6|f4nzwv_OvxcImHR+W#X$(-@`7%OTkTuAJJ~HLKlCmy1Tun+790? z9(`dY8z0RI-gTJ~xOl2<#yc%=?H%>kE3v5sxB^MuY$=#8X`Y~*6lvXU+YI**{^;7S zxFBdII~fStE#*0HUcDxlBBou*(U&O9+qU9xW?OHMI7u)9KIrNA!gu;it8HT3y1&}5 z4V5D!Jlr*ddQ+2myG>6GQgr&UrR?wfD-UbSO&LRK>GqvJayq=uHR$tBq*4hy4(3^0z zheToMWR#oz)3D=^r*DsJaFjQxC+(H-3gqMH739q#T;{Ap6My$Ik!!MJ^dC0qENHF@ z<iTM##KF0N+8W(qSG2zR-tAbaVh$- zPy$Zl$D@B|hG8}@SVoE4|q_;txN18W?y_UI*f#Muakr7rg}6^e@_@ z(kEf)6syir(qnq?rsjmrWjM`8P81GzJoO3}1nqFuv#X`hA{YCw@2$9(l{q0~uz&x# z{Y_!seI8(qNY#O(G0tkvk%)e%D}UnAy(8js-dWgg{|&l(%X&P=Q3@)Yn zARaqW$#6g9KUz3Z>-_jPneJ%-lPi#Dxu?>Ksl2v=0R4DbTMFHk?#3;3rrc-6j4Z); zVA;;@y=Z(;Y<1mor|`3>cV={#_58YnRAc%1!01XccZM8eGxuhA-Z*z8^Vv(Fx|T!GB>PdVT^^OqvV z6^D}|-G;YAe&+nS0XyvO2O9$3LSs!g2N6XF!L;Z<=$7q=Ih&*RT7hUTPl0n1Pao~h z_4jhQ3gO{vGMZu8-NRqMLkg0Xr%9w~+E|@&!ls?jtFN*GBI+J@wv3WkQtoZo$&ts6 zE`no{l31;O8K(r;;5Ec}tKqE(TlJc19hi3BZm6E6)f9=E$h0QeD^Op!zNQLcvnleTgX38^pzoXig#Vc}`<8(@cvIH!#xrmT4C)rUK@?q|P_tWyY8T-8rYw*E4jwb(_f7=_c6gJISwL z@&thc(Ywg)#1R*`8UY^27?O;jj}y##=vA$*7#+LI8|}Uwm;jG&|6I1$gr)Hy^#w6o zOc*AKm{ce6tiZWi;Ix87u^34@-YvtJ{$r@v#v2TMu8TK$OiwCJ`;Sank z33r20D7A#4Lq*3ne4ptKe88D61rzO4Tz_Wm{=S}6DJW0-zS}QJI(hC=^^IO7_xq_N z7t2mxs-?`7vsr-`meUi8TB*SJJrg{Bb%K1wvshCKU5(A)8GO>;YG&MA90Xv*12S%b z>8MtAr46(NNmH}sbG=B3PBI1pOT@aq&f~b{_uPd>I(G1Z4PHa}@~3O+^X#g;^ELy@ z)6g;}D;uVjY3)!WS1#P-z_r>cU}rNj(1YpNZf?cUTc^BKvO46~cRQ9hY{Y`zh26tJ z!R9fUp@^S2n}zI_Qye2CavxAC&~kj+G{1#Kh4GtD=4j9(*NEs@>u`TQwqv(!tYNlX zErm;gK64(ZV*r>^b%PX@7_!6~IIBUD#ZeOTpwB2;Q3?Tb+V_E<&)mr)IxUt&n!fG$ zwu@Cmd@Fx+nz%4+W#-MR_2Q{A~pxj;+h z@)bKJ5-Y{}80u7CF6X#Bq?Q<+*Y$@v;;VQ#>g*A)7Q?sW;^lPN+4FKjnlfo1|Nixo zQT1hytZ{>>7{m<=!5bB5*(-B~6B$D$OYzyMCmB zU;Gq^fJ`vj$v`gp$NGE^l`gF3A4zj9*akL5-b_p!>aTMlE|vBS2viVwvjO^eVnnp@ zVaL&@ntfAUoftkh*;abPulqQ>ilM@!wdJE2WJAMS$1$lSxqYtW#qFMmaG3m>VEQ+k z>JI-~nm>0t`}{Cz)V-oM$Ef!mMViPJB#Lg%Q+3`~o(&|F6A#&HFRce~t#vUj`v9Az z#hY7RD4&{3zbeC04^NiQdP`AA3`?HhC7TJ~f>?N zqk`<(&sr#HnvpezP%A7(8(;}bY((h#Jq*@LFt{yC6$)(>)FPcI8j~gEbqq}-e z0qA82ZTMA5>TzYczHPEw*3bLlIyVP|zF;xdqpT_IWL6ThBsMR{il;s>ZEU;O?AsRR z9xNuTubT+S-6os{IQe8T%rI~+*cb$pvGRtd2Isz5X_7`vYcK7 zizz*ZO7E6Cr}~|L)2Yi|JEsa0+}8g9l4S>l9Fdwc=7xsyO6f6;3WGXpP%GXTsGcYI zVG;i>764rk!l;MH9S;!{yfjYm1EzwdUC<{|;Bu%O`39Bn7kA*YoCu`nJfd%u%9u
-We can explain the piecewise-linear structure of the loss function by examing the math. For a single example we have: +부분적으로 선형(piecewise linear)은 손실함수(Loss function)의 구조를 수식을 통해 설명할 수 있다. 예시가 하나인 경우에 다음과 같이 쓸 수 있다. $$ L_i = \sum_{j\neq y_i} \left[ \max(0, w_j^Tx_i - w_{y_i}^Tx_i + 1) \right] From 53bfe490705073de6a3d11731484781f177bbde5 Mon Sep 17 00:00:00 2001 From: Jaemin Cho Date: Sat, 9 Apr 2016 01:38:51 +0900 Subject: [PATCH 027/199] Update Lecture10_ko.srt MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit 2016.04.09. ~09:58 까지 번역 --- captions/Ko/Lecture10_ko.srt | 303 +++++++++++++++++------------------ 1 file changed, 151 insertions(+), 152 deletions(-) diff --git a/captions/Ko/Lecture10_ko.srt b/captions/Ko/Lecture10_ko.srt index 48dad07c..1d9058fa 100644 --- a/captions/Ko/Lecture10_ko.srt +++ b/captions/Ko/Lecture10_ko.srt @@ -1,594 +1,594 @@ -1 +1 00:00:00,000 --> 00:00:04,129 - 우리를 신뢰 + 마이크 테스트 2 00:00:04,129 --> 00:00:12,109 - 확인 그것은 우리가 곧 그래서 오늘 우리가 얘기 할 수 있습니다 시작합니다 확인 좋은 작품 + 오늘은 주제는 Recurrent neural networks 입니다. 3 00:00:12,109 --> 00:00:15,199 - 내가 가장 좋아하는 주제 내 좋아하는 일 중 하나입니다 재발 성 신경 네트워크 + 개인적으로 가장 좋아하는 주제이고 4 00:00:15,199 --> 00:00:18,960 - 모델은 도처에 많이 신경 네트워크에 입력으로 재생 + 또 여러 형태로 사용하고 있는 NN 모델이기도 하죠. 재밌어요. 5 00:00:18,960 --> 00:00:23,009 - 재미 관리 높은 임시 직원의 관점에서 놀 리콜 + 강의 진행에 관해서 언급할 게 있는데, 6 00:00:23,009 --> 00:00:26,089 - 수요일에 여러분의 중간 고사는이 수요일 당신은 정말이야 말할 수 있습니다 + 수요일에 중간 고사가 있어요. 7 00:00:26,089 --> 00:00:32,738 - 너희들은 나에게 매우 흥분 흥분하는 경우 내가 아는 흥분 무엇 묘지는 것 + 다들 중간고사 기대하고 있는거 다 알아요. 사실 별로 기대하는 것 같이 보이지는 않네요. 8 00:00:32,738 --> 00:00:37,979 - 그렇게 그는이 그것 때문에 수요일에 밖으로있을 것이다 것이 수요일 인해 외출 + 수요일에 과제가 나갈 거에요. 9 00:00:37,979 --> 00:00:40,429 - 월요일에 지금부터 주 그러나 나는 우리가 그것을 이동하고 이후가 생각하는 생각 + 제출 기한은 2주 뒤 월요일까지입니다. 10 00:00:40,429 --> 00:00:43,399 - 우리가 계획 수요일은 오늘 발표했다합니다 그러나 우리는거야에 출하 할 수있어 + 그런데 저희가 원래 월요일에 이걸 발표하려 했는데 늦어져서 11 00:00:43,399 --> 00:00:47,129 - 대략 수요일 그래서 우리거야 아마 몇 일에 대한 첫 번째 마감일과 + 아마 제출 기한이 수요일 즈음으로 미뤄질 것 같네요. 12 00:00:47,130 --> 00:00:51,179 - 그런 다음 삼십팔일를 사용하는 경우, 그래서 오해의 그에게 할당 금요일에 기인 + 2번째 과제는 금요일까지고, 3-late day를 사용할 수 있어요. 그런데 너무 일찍 사용하지는 마세요. 13 00:00:51,179 --> 00:00:55,119 - 당신은 일을 당신의 너무 많은 희망을 갖고 오늘을 낳게 될 것입니다 우리의 + 2번째 과제는 금요일까지고, 3-late day를 사용할 수 있어요. 그런데 너무 일찍 사용하지는 마세요. 14 00:00:55,119 --> 00:01:01,089 - 72 또는 여러 사람과 사람 아래로는 대부분의 위대​​한 찾고 좋아 완료 + 몇 명이나 끝냈나요? 72명? 거의 다 끝냈네요, 좋아요. 15 00:01:01,090 --> 00:01:04,549 - 클래스에 너무 현재 잘하는 해변 신경오고에 대해 얘기했다 + 자 우리는 Convolutional Neural Network (CNN)에 대해서 얘기하고 있었죠. 16 00:01:04,549 --> 00:01:07,820 - 네트워크 라스 카사스는 특히 우리는 시각화 이해를 보았다 + 지난 수업에서는 CNN에 대한 시각화와 간단한 이해에 대해서 다루었고, 17 00:01:07,819 --> 00:01:11,618 - 길쌈 신경망 우리는 예쁜 그림의 모두 볼 수 있도록하고 + 이런 그림과 비디오들을 살펴보면서 CNN이 어떻게 작동하는지 살펴보았죠. 18 00:01:11,618 --> 00:01:14,938 - 비디오는 그래서 우리는 많은 재미 그가 달성 정확히 어떤 해석을 시도했다 + 이런 그림과 비디오들을 살펴보면서 CNN이 어떻게 작동하는지 살펴보았죠. 19 00:01:14,938 --> 00:01:17,828 - 모든 네트워크는 그들이 그렇게 작업하고있는 방법을 학습하는 일을하고 있습니다 + 이런 그림과 비디오들을 살펴보면서 CNN이 어떻게 작동하는지 살펴보았죠. 20 00:01:17,828 --> 00:01:24,188 - 그래서 우리는 당신이에서 호출 될 수있는 몇 가지 방법을 통해이 문제를 디버깅 + 그리고 맨 마지막 그림에서 본 것처럼 디버깅도 해 보았고요. 21 00:01:24,188 --> 00:01:27,408 - 구조는 실제로 주말에 나는 다른 시각화에 의해 발견 + 지난 주말에 트위터에서 새로운 시각화 자료를 찾았는데요, 22 00:01:27,409 --> 00:01:32,569 - 내가 트위터에서 이러한 발견 새로운 그들은 정말 근사하고 잘 모르겠어요 + 신기하죠? 23 00:01:32,569 --> 00:01:37,118 - 어떻게 너무 많은 설명이 아니기 때문에 사람들이 만든 방법 + 사실 설명이 없어서 정확히 어떤 방법으로 이걸 만든 건지는 잘 모르겠네요. 24 00:01:37,118 --> 00:01:43,099 - 이 거북 독 거미이고, 다음이 연결 및 어떤 종류처럼 만 보인다 + 그래도 멋있지 않아요? 이건 거북이고, 저건 타란튤라 거미이고, 25 00:01:43,099 --> 00:01:47,468 - 이렇게 개 등 방식 나는 그것이 견과류 같은 생각 + 이건 체인이고, 저건 개들인데, 26 00:01:47,468 --> 00:01:50,509 - 다시 이미지로 최적화 그러나 그들은에서 다른 정례화를 사용하는 +제가 보기에 이건 어떤 최적화 기법을 이미지에 적용한 것 같은데, 27 00:01:50,509 --> 00:01:53,679 - 이 경우 이미지가 나는 그들이이있는 양​​자 필터를 사용하고 생각 + 뭔가 다른 regularization 방법을 적용한 것 같네요 28 00:01:53,679 --> 00:01:57,049 - 멋진 필터의 종류는 해당 이미지에 해당 정규화를 넣어 그래서 만약 내 + 음, 여기에는 bilateral filter (쌍방향 필터) 를 적용한 것 같네요. 29 00:01:57,049 --> 00:01:59,420 - 느낌이 당신이 달성 시각화의 종류가 있다는 것입니다 + 음, 여기에는 bilateral filter (쌍방향 필터) 를 적용한 것 같네요. 30 00:01:59,420 --> 00:02:03,659 - 대신에 꽤 멋진 보이지만 내가가는 정확히 무엇인지 확실하지 않다 그래서 + 그래도 솔직히 정확히 어떤 기법을 적용한 것인지는 잘 모르겠어요. 31 00:02:03,659 --> 00:02:04,549 - 우리는 곧 알게 될 같아요 + 오늘의 주제는 RNN입니다. 32 00:02:04,549 --> 00:02:10,360 - 확인 그래서 오늘 우리는 재발 성 신경 네트워크 무엇의에 대해 얘기 할거야 + 오늘의 주제는 RNN입니다. 33 00:02:10,360 --> 00:02:13,520 - 재발 성 신경 네트워크에 대한 좋은들은 많은 유연성에 제공한다는 것입니다 + RNN의 강점은 네트워크 아키텍쳐를 구성하는 데에 자유도가 놓다는 것입니다. 34 00:02:13,520 --> 00:02:15,870 - 네트워크 아키텍처를 배선하는 방법 + RNN의 강점은 네트워크 아키텍쳐를 구성하는 데에 자유도가 놓다는 것입니다. 35 00:02:15,870 --> 00:02:18,650 - 더 작동하지 않을 때 보통의가 바로 여기 왼쪽에있는 경우를 들어 보자 + 일반적으로 NN을 왼쪽 그림과 같이 구성할 때는 (역자주: Vanilla NN) 36 00:02:18,650 --> 00:02:22,849 - 당신이 빨간색으로 여기에 고정 된 크기의 사진을 제공하는 경우 당신은 그것을 처리 + 여기 빨간색으로 표시된 것처럼 고정된 크기의 input vector를 사용하고, 37 00:02:22,848 --> 00:02:27,639 - 녹색 다음 몇 가지 숨겨진 레이어 내가 산에 더 나은을보고 수정을 생산 + 초록색의 hidden layer들을 통해 작동시키며, 마찬가지로 고정된 크기의 파란색 output vector를 출력합니다. 38 00:02:27,639 --> 00:02:30,738 - 이미지가 제공하는 수정 I 문이며 우리는 고정을 생산하고 + 마찬가지로 고정된 크기의 이미지를 입력으로 받고, 39 00:02:30,739 --> 00:02:34,469 - 가장 가까운 코스가 크기 사진 때 재발 신경망 우리 + 고정된 크기의 이미지를 벡터 형태로 출력합니다. 40 00:02:34,469 --> 00:02:38,239 - 실제로 입출력 또는 둘 모두에서에서 시퀀스 순서를 통해 작동 할 수 있습니다 + RNN에서는 이러한 작업을 계속 반복할 수 있습니다. input, output 모두에서 가능하죠. 41 00:02:38,239 --> 00:02:41,319 - 동시에 영상 자막의 경우 예를 들어, 우리가 일부를 볼 수 있도록 + 오늘 다룰 image captioning(이미지에 상응하는 자막/주석 생성) 을 예로 들면, 42 00:02:41,318 --> 00:02:44,689 - 그것을 오늘 당신은 반복을 통해 다음 고정 된 크기의 이미지를 부여하고 + 고정된 크기의 이미지를 RNN에 입력하게 됩니다. 43 00:02:44,689 --> 00:02:47,829 - 신경망 우리는를 설명하는 단어의 시퀀스를 생성하는 것 + 그리고 그 RNN은 해당 이미지를 설명하는 단어/문장 들을 출력하게 되죠. 44 00:02:47,829 --> 00:02:52,560 - 그래서 그 이미지의 내용에 대한 캡션입니다 문장이 될 것 + 그리고 그 RNN은 해당 이미지를 설명하는 단어/문장 들을 출력하게 되죠. 45 00:02:52,560 --> 00:02:55,969 - 그 예를 들면 로비 감정 분류의 경우, + Sentiment classification(감정 분류)를 예로 들면, 46 00:02:55,969 --> 00:02:59,759 - 단어와 장식 조각의 수를 소모하고, 그들은 클래스에 시도 할 것이다 + (어떤 문장의) 단어들과 그 순서을 입력으로 받아서, 47 00:02:59,759 --> 00:03:03,828 - 드라이버 그 문장의 감정의 경우 긍정적 또는 부정적 + 그 문장의 느낌이 긍정적인지 또는 부정적인지를 출력하게 됩니다. 48 00:03:03,829 --> 00:03:07,590 - 기계 번역 우리는 우리 소요 재발 성 신경 네트워크를 가질 수있다 + 또 다른 예로 machine translation (역자주: 구글 번역과 같은 알고리즘 번역) 에서는, 49 00:03:07,590 --> 00:03:12,069 - 다음 말 영어 단어의 수는 단어의 개수를 생성하라는 + 어떤 영어 문장을 입력으로 받고, 프랑스어로 출력해야 합니다. 50 00:03:12,068 --> 00:03:17,119 - 프랑스어 번역 그래서 우리는이 말 앤드류 재발 성 신경 네트워크를 공급했던 것과 + 그래서 우리는 이 영어 문장을 RNN에 입력하고 (이것을 Sequence to Sequence 라 부름) 51 00:03:17,120 --> 00:03:20,280 - 우리는 설정의 순서 종류의 순서를 호출 등이 작업 여부는 작업 + 그래서 우리는 이 영어 문장을 RNN에 입력하고 (이것을 Sequence to Sequence 라 부름) 52 00:03:20,280 --> 00:03:25,169 - 단지 프랑스어와 영어에 임의의 문장에 대한 번역을 수행 + RNN은 이 영어 문장을 프랑스어 문장으로 번역합니다. 53 00:03:25,169 --> 00:03:28,000 - 당신이 할 수있는 경우, 예를 들어 지난 경우 우리는 비디오 분류를 + 마지막 예 video classification(영상 분류) 에서는, 54 00:03:28,000 --> 00:03:31,699 - 클래스의 일부 번호와 비디오의 매 프레임을 분류하는 상상 + 각 프레임 (순간 캡쳐 화면) 이 어떤 속성을 지니는지, 55 00:03:31,699 --> 00:03:35,429 - 하지만 결정적으로의 전용 함수로 예측 싶지 않아 + 그리고 그 전의 모든 프레임과의 관계는 어떻게 되는지도 고려합니다. 56 00:03:35,430 --> 00:03:38,739 - 현재 시간은 모든 것을 비디오의 현재 프레임 단계하지만 + 그리고 그 전의 모든 프레임과의 관계는 어떻게 되는지도 고려합니다. 57 00:03:38,739 --> 00:03:41,909 - 재발 성 신경 네트워크가 당신을 허용하는 비디오에 전에왔다 + 그러니까 RNN은 각각의 프레임이 어떤 속성을 지니는지 분류하고, 58 00:03:41,909 --> 00:03:44,680 - 건축 와이어 최대 위치를 예측하는 매 시간 단계 + 이전까지의 모든 프레임을 입력으로 받는 함수가 되어, 59 00:03:44,680 --> 00:03:48,760 - 지금까지도 그 지점까지 들어오는 모든 프레임의 함수이다 + 앞으로의 프레임을 예측하는 아키텍쳐를 제공합니다. 60 00:03:48,759 --> 00:03:52,388 - 만약 입력 또는 출력 여전히 재발을 사용할 수있는 서열을 가지고 있지 않다면 + 만약 맨 왼쪽 그림과 같이 입력과 출력의 순서에 관한 정보를 가지고 있지 않아도 RNN을 사용할 수 있습니다. 61 00:03:52,389 --> 00:03:55,250 - 심지어 당신이 처리 할 수​​ 있기 때문에 매우 왼쪽에있는 경우 신경망 당신의 + 만약 맨 왼쪽 그림과 같이 입력과 출력의 순서에 관한 정보를 가지고 있지 않아도 RNN을 사용할 수 있습니다. 62 00:03:55,250 --> 00:04:01,560 - 예를 들어, 입력 또는 출력 순차적으로 내가 가장 좋아하는 예제 중 하나를 수정 + 예를 들어, 제가 좋아하는 딥마인드의 한 논문에서는 63 00:04:01,560 --> 00:04:05,189 - 이 잠시 전에 우리가 노력하고 대한 깊은 광산에서 사람들된다 + 번지로 된 집 주소 이미지를 문자로 변환했습니다. 64 00:04:05,189 --> 00:04:09,750 - 집 번호를 전사 단지에이 큰 이미지 발을 갖는 대신 + 여기서는 단순히 CNN을 사용해서 이미지 자체가 몇 번지를 나타내는지를 분류하지 않고, 65 00:04:09,750 --> 00:04:13,530 - 의견 그들이 와서 집 번호가에 정확히 분류하는 시도 + 여기서는 단순히 CNN을 사용해서 이미지 자체가 몇 번지를 나타내는지를 분류하지 않고, 66 00:04:13,530 --> 00:04:16,649 - 작은 거기에 재발 신경 네트워크 정책과 그 와서 + RNN을 사용해서 작은 CNN이 이미지를 돌아다니면서 읽어들였습니다. 67 00:04:16,649 --> 00:04:19,779 - 특히 때문에 자신의 재발 성 신경 네트워크와 함께 이미지 주위 조향 + RNN을 사용해서 작은 CNN이 이미지를 돌아다니면서 읽어들였습니다. 68 00:04:19,779 --> 00:04:23,969 - 왼쪽에서 오른쪽으로 현재 작업은 기본적으로 집 번호를 판독을 배운 + 이렇게 RNN은 번지 주소 이미지를 왼쪽으로 오른쪽으로 순차적으로 읽는 방법을 학습했습니다. 69 00:04:23,970 --> 00:04:26,870 - 순차적 그래서 우리는 입력으로 사진을 가지고 있지만 우리는 그것을 처리하고 + 이렇게 RNN은 번지 주소 이미지를 왼쪽으로 오른쪽으로 순차적으로 읽는 방법을 학습했습니다. 70 00:04:26,870 --> 00:04:32,019 - 순차적으로 반대로 우리가 이것에 대해 생각할 수도 잘 알려진 사람이다 + 반대로 생각할 수도 있습니다. 이것은 DRAW라는 유명한 논문인데요, 71 00:04:32,019 --> 00:04:35,879 - 이것은 당신이 모델에서 샘플을 여기서 볼 수 있습니다하는지 일반 모델 그리기 + 여기서는 이미지 샘플 하나하나가 무엇인지 개별적으로 판단하지 않고, 72 00:04:35,879 --> 00:04:39,490 - 이들 숫자 샘플과 함께오고 있지만 결정적으로 우리는 단지하지 않은 경우 + 여기서는 이미지 샘플 하나하나가 무엇인지 개별적으로 판단하지 않고, 73 00:04:39,490 --> 00:04:42,860 - 한 번에이 숫자를 예측하지만 우리는 우리의 현재 네트워크와 우리가 + RNN이 여러 이미지를 하나의 큰 캔버스의 형태로 한번에 출력합니다. 74 00:04:42,860 --> 00:04:47,540 - 캔버스로까지 생각하고 커널에가는 시간이 지남에 그린과 + RNN이 여러 이미지를 하나의 큰 캔버스의 형태로 한번에 출력합니다. 75 00:04:47,540 --> 00:04:50,200 - 그래서 당신은 자신에게 실제로 전에 몇 가지 계산을 할 수있는 더 많은 기회를 제공하고 있습니다 + 이 방법은 한 번지수 이미지에 대한 입력 결과를 곧바로 출력하지 않고, 보다 많은 계산을 거친다는 점에서 강력합니다. 76 00:04:50,199 --> 00:04:53,479 - 실제로 당신이 처리 형태의 더 강력한 종류의 것을있어 생산 + 이 방법은 한 번지수 이미지에 대한 입력 결과를 곧바로 출력하지 않고, 보다 많은 계산을 거친다는 점에서 강력합니다. 질문 있나요? 77 00:04:53,480 --> 00:05:14,189 - 데이터는이 지금은 무엇을 의미하는지 정확히의 특성을 통해 질문이었다 + (질문) 그림에서 화살표는 무엇인가요? 78 00:05:14,189 --> 00:05:19,310 - 일이 일이 너무 그래서 그냥 표시 에로스는 기능 의존도를 나타냅니다 + 화살표는 functional dependence를 나타냅니다. 조금 있다가 좀 더 자세하게 살펴 볼 거에요. 79 00:05:19,310 --> 00:05:23,139 - 거친만큼 일하기 전에 우리가가는 것을 정확하게 그처럼 보인다 + 화살표는 functional dependence를 나타냅니다. 조금 있다가 좀 더 자세하게 살펴 볼 거에요. 80 00:05:23,139 --> 00:05:37,168 - 네트워크가 많이 보았다 그래서이 너무 좋아 조금 집 번호를 생성 + (질문) 그림에서 나타나는 숫자들은 무엇인가요? 81 00:05:37,168 --> 00:05:41,219 - 집 숫자와 이러한 그림의 방법으로 와서 그래서 이들에없는 + 이것들은 실제 사진이 아니라 RNN이 학습 후 출력한 결과물입니다. 82 00:05:41,220 --> 00:05:44,830 - 이들의 훈련 일이이의 모델 없음에서 숫자를 만들어 + 이것들은 실제 사진이 아니라 RNN이 학습 후 출력한 결과물입니다. 83 00:05:44,829 --> 00:05:48,219 - 실제로 교육이 만들어집니다 세트 + (질문) 그러니까 실제 사진이 아니라 만들어진 거라는 거죠? 84 00:05:48,220 --> 00:05:51,689 - 그래, 그들은 아주 진짜 보이지만 그들은 실제로 현지에서 만든 것 + 네, 꽤 실제 사진처럼 포이기는 하지만, 이것들은 만들어진 이미지입니다. 85 00:05:51,689 --> 00:05:55,809 - 그래서 재발 성 신경 네트워크 그가 발언이 것은 기본적이며, + RNN은 이런 초록색 박스처럼 생겼습니다. 86 00:05:55,809 --> 00:06:00,979 - 녹색 및 그 상태를 가지며, 그것은 기본적으로 시간을 통해 수신하고 그것을 + RNN은 계속해서 input vector를 입력받습니다. 87 00:06:00,978 --> 00:06:04,859 - 우리가에 입력 벡터에 공급 할 수있는 여배우 그래서 매번 수신 + RNN은 계속해서 input vector를 입력받습니다. 88 00:06:04,860 --> 00:06:08,538 - 무장 한 남자와 내부적으로 어떤 상태를 가지고 있으며, 다음은 그 수정할 수 있습니다 + RNN 내부에는 여러 state가 있는데, 이는 매 시간에 입력받는 input vector의 형태로 나타낼 수 있습니다. 89 00:06:08,538 --> 00:06:12,988 - 그것은 매 시간 단계를 받고 그래서 것의​​ 함수로서 상태 + RNN 내부에는 여러 state가 있는데, 이는 매 시간에 입력받는 input vector의 형태로 나타낼 수 있습니다. 90 00:06:12,988 --> 00:06:17,258 - 그들은 우리가 그 폐기물를 켤 때 물론 모든 무게와 CNN 등 수있어 + RNN에는 또한 weight(가중치)를 설정할 수 있고, 이를 조정함으로써 RNN의 작동을 조절할 수 있습니다. 91 00:06:17,259 --> 00:06:20,829 - 명시된 목표는 수신 한 방법의 측면에서 아놀드 다른 동작 + RNN에는 또한 weight(가중치)를 설정할 수 있고, 이를 조정함으로써 RNN의 작동을 조절할 수 있습니다. 92 00:06:20,829 --> 00:06:25,769 - 나는 보통 우리는 또한 생산에 관심이있을 수 있습니다 면제 및 모든하지만, + 우리는 물론 RNN의 출력 결과물에도 관심을 갖고 있지만, 93 00:06:25,769 --> 00:06:30,429 - 우리가 지금 만 시간의 상단에 이러한 문제를 생성 할 수 있도록 R & S 상태에 따라 + 우리는 물론 RNN의 출력 결과물에도 관심을 갖고 있지만, 94 00:06:30,428 --> 00:06:33,988 - 그래서 당신은 이런 쇼 사진을 볼 수 있지만, 난 그냥 아르 논 것을 알고 싶다 + RNN은 이 중간에 있는, 시간에 따라 이미지를 입력받고 출력하는 단계인 이 초록색 박스라는 것을 알아두셨으면 합니다. 95 00:06:33,988 --> 00:06:36,688 - 정말 블록은 중간에 + RNN은 이 중간에 있는, 시간에 따라 이미지를 입력받고 출력하는 단계인 이 초록색 박스라는 것을 알아두셨으면 합니다. 96 00:06:36,689 --> 00:06:39,489 - 상태로 근무하고 시간이 지남에 사진을받을 수 있으며, 우리는 할 수 있습니다 + RNN은 이 중간에 있는, 시간에 따라 이미지를 입력받고 출력하는 단계인 이 초록색 박스라는 것을 알아두셨으면 합니다. 97 00:06:39,488 --> 00:06:44,838 - 그래서 완전히 일부 응용 프로그램에서 상태의 상단에 몇 가지 예측 방법 + RNN은 이 중간에 있는, 시간에 따라 이미지를 입력받고 출력하는 단계인 이 초록색 박스라는 것을 알아두셨으면 합니다. 98 00:06:44,838 --> 00:06:50,610 - 육군은 지적 이하 상태의 일종을 가지고있는 것처럼이 보일 것이다 + RNN의 각 state는 vector들의 집합으로 나타낼 수 있고, 여기서는 h로 표기하겠습니다. 99 00:06:50,610 --> 00:06:55,399 - 빅터 H하고이 또한 될 수있는 의사의 집합은 두 개있다 + RNN의 각 state는 vector들의 집합으로 나타낼 수 있고, 여기서는 h로 표기하겠습니다. 100 00:06:55,399 --> 00:07:00,939 - 일반 상태와 우리는 이전의 함수로의 기반하지 않은거야 + 각각의 state(h_t) 는 바로 전 단계의 state(h_t-1)과 input vector(x_t)들의 함수로 나타낼 수 있습니다. 101 00:07:00,939 --> 00:07:05,769 - 하나 뺀 상태 관리 시간 IT 및 현재의 입력 벡터 (60)이 + 각각의 state(h_t) 는 바로 전 단계의 state(h_t-1)과 input vector(x_t)들의 함수로 나타낼 수 있습니다. 102 00:07:05,769 --> 00:07:08,338 - 내가 재발 함수를 호출합니다 함수를 통해 수행 할 것입니다 + 여기서의 함수는 Recurrence funtion 이라고 하고 파라미터 W(가중치)를 갖습니다. 103 00:07:08,338 --> 00:07:13,728 - 그 기능은 W 매개 변수가되고 우리는 우리 (W) 그 변경으로 우리는있어 + 우리는 W 값을 변경함에 따라 RNN이 다른 결과를 보이는 걸 확인할 수 있습니다. 104 00:07:13,728 --> 00:07:16,228 - 물론 우리가 원하는 다음 아놀드 다른 행동을보고 가서 + 우리는 W 값을 변경함에 따라 RNN이 다른 결과를 보이는 걸 확인할 수 있습니다. 105 00:07:16,228 --> 00:07:19,338 - 일부 특정 동작은 아르 논은 우리가 그 무게를 훈련 할 겁니다된다 + 따라서 우리는 우리가 원하는 결과를 만들어낼 수 있는 적절한 W를 찾기 위해 training을 거칠 것이죠. 106 00:07:19,338 --> 00:07:23,639 - 씰 지금은 그 노래의 예를 참조에서 나는 같은 점에 유의하고 싶습니다 + 따라서 우리는 우리가 원하는 결과를 만들어낼 수 있는 적절한 W를 찾기 위해 training을 거칠 것이죠. 107 00:07:23,639 --> 00:07:28,209 - 함수는 무게의 고정 기능 w와 함께 매 시간 단계에서 사용된다 + 여기서 기억해야 할 것은 매 단계마다 같은 함수와 같은 W를 사용한다는 것입니다. 108 00:07:28,209 --> 00:07:31,778 - 우리는 매번 물건에 그 하나의 기능을 담당하고 허용 + 여기서 기억해야 할 것은 매 단계마다 같은 함수와 같은 W를 사용한다는 것입니다. 109 00:07:31,778 --> 00:07:35,928 - 우리는 커밋하지 않고 일련의 외부 네트워크를 사용하는 + 그래서 입력이나 출력 vector의 길이를 고려할 필요가 없습니다. 110 00:07:35,928 --> 00:07:38,778 - 시퀀스의 크기는 우리의 모든 동일한 기능을 적용하기 때문에 + 그래서 입력이나 출력 vector의 길이를 고려할 필요가 없습니다. 111 00:07:38,778 --> 00:07:43,528 - 한 시간 간격에 상관없이 시간을 입력 또는 출력 순서가 그렇게되어 + 그래서 입력이나 출력 vector의 길이를 고려할 필요가 없습니다. 112 00:07:43,528 --> 00:07:46,769 - 재발 신경망의 재발 성 신경 네트워크의 특정한 경우 + RNN을 구현하는 가장 간단한 방법은 Vanilla RNN 입니다. 113 00:07:46,769 --> 00:07:50,309 - 당신이 사용할 수있는 간단한 재발이를 설정할 수 있습니다 간단한 방법은 무엇입니까 + RNN을 구현하는 가장 간단한 방법은 Vanilla RNN 입니다. 114 00:07:50,309 --> 00:07:54,569 - 재발 성 신경 네트워크의 상태가이 경우에 약간의 경고로 (42) 기타 + 여기서 RNN을 구성하는 것은 단 하나의 hidden state h 입니다. 115 00:07:54,569 --> 00:08:00,569 - 단지 하나의 상태 시간 후 우리는 기본적으로 알려주는 크로스 공식이 + 여기서 RNN을 구성하는 것은 단 하나의 hidden state h 입니다. 116 00:08:00,569 --> 00:08:04,039 - 당신은 이전 머리의 함수로 숨겨진 상태 나이를 업데이트하는 방법 + 그리고 여기 Recurrence(재귀) 식은 각 hidden state를 시간과 현재 input (x_t)로 어떻게 나타낼 수 있는지 알려줍니다. 117 00:08:04,038 --> 00:08:04,688 - 국가의 + 그리고 여기 Recurrence(재귀) 식은 각 hidden state를 시간과 현재 input (x_t)로 어떻게 나타낼 수 있는지 알려줍니다. 118 00:08:04,689 --> 00:08:08,369 - 현재 입력 엑스타인 특히 우리가있어 간단한 경우와 + 그리고 여기 Recurrence 식은 각 hidden state를 시간과 현재 input (x_t)로 어떻게 나타낼 수 있는지 알려줍니다. 119 00:08:08,369 --> 00:08:10,349 - 이러한 가중치 행렬의 whaaa을해야 할 것 + 가중치 행렬 W_hh와 W_xh에 직전 단계의 hidden state h 와 input vector x가 각각 곱해지고, 120 00:08:10,348 --> 00:08:15,238 - WX 연령 그들은 기본적으로 숨겨진 상태에서 모두 프로젝트거야 + 가중치 행렬 W_hh와 W_xh에 직전 단계의 hidden state h_t-1 와 input vector x가 각각 곱해지고, 121 00:08:15,238 --> 00:08:18,238 - 다음 현재의 입력과 이들 추가하려고하는 이전의 시간 + 이것이 tanh 함수에 의해 새로운 hidden state h_t로 결정되는 방식으로 업데이트 됩니다. 122 00:08:18,238 --> 00:08:21,978 - 그리고, 우리는 모든 연령에서 그들을 뭉개 버려 그것은 우리가 상태를 업데이트하는 방법 + 이것이 tanh 함수에 의해 새로운 hidden state h_t로 결정되는 방식으로 업데이트 됩니다. 123 00:08:21,978 --> 00:08:26,199 - 이 재발 그래서 시간 t는의 함수로 어떻게 전체의 변화를 말하고 그 + 이러한 재귀 식은 h가 시간과 현재 입력에 따라 업데이트되는 함수라는 것을 보여줍니다. 124 00:08:26,199 --> 00:08:29,769 - 역사와 또한이 시간에 현재 입력 한 다음 우리는 할 수 있습니다 +이러한 재귀 식은 h가 시간과 현재 입력에 따라 업데이트되는 함수라는 것을 보여줍니다. 125 00:08:29,769 --> 00:08:34,129 - 예측은 우리는 또 다른를 사용하여 예를 들어 H의 상단에 예측을 기반으로 할 수 있습니다 + h 바로 다음에 결과물이 행렬의 형태로 출력되는 형태가 가장 간단한 형태의 RNN입니다. 126 00:08:34,129 --> 00:08:37,528 - 언덕 국가의 상단에 매트릭스 투영 그래서 이것은 단순한 완료 + h 바로 다음에 결과물이 행렬의 형태로 출력되는 형태가 가장 간단한 형태의 RNN입니다. 127 00:08:37,528 --> 00:08:42,288 - 당신의 인생 작업에 연결할 수있는 경우는, 그래서 그냥 당신의 예를 제공합니다 + 이게 어떻게 작동되는지 간단히 설명드리기 위해 예를 들자면, 128 00:08:42,288 --> 00:08:46,639 - 이 지금 작동하는 방법 난 그냥 섹스의 나이와 왜 추상을 이야기 해요 + 이게 어떻게 작동되는지 간단히 설명드리기 위해 예를 들자면, 129 00:08:46,639 --> 00:08:49,299 - 우리가 실제로 이러한 요인 끝낼 수 배우면에서 형태 + 이런 추상적인 x, h, y 등에 의미를 부여할 수 있습니다. 130 00:08:49,299 --> 00:08:53,059 - 의미와 방법 그래서 하나는 우리는 재발 성 신경을 사용할 수 있습니다 - + 이런 추상적인 x, h, y 등에 의미를 부여할 수 있습니다. + 131 00:08:53,059 --> 00:08:56,149 - 문자 레벨 언어 모델의 경우에서와 같이, 네트워크 및이 중 하나이다 + 예를 들어 이러한 문자 수준 언어 모델에 RNN을 적용하는 것 말이죠. 132 00:08:56,149 --> 00:08:59,899 - 직관적 인과 보는 재미 때문에 우리의 다음 설명의 나의 마음에 드는 방법 + 저는 이 예시를 참 좋아합니다. 직관적이고 재밌거든요. 133 00:08:59,899 --> 00:09:04,698 - 그래서이 경우에 우리는 우리의 춤과를 사용하여 문자 수준의 언어 모델을 + 그래서 RNN 기반 문자 수준 언어 모델에서는, RNN에 문자열의 순서를 주고, 134 00:09:04,698 --> 00:09:07,859 - 우리는에 문자의 시퀀스를 공급하므로 방법이 작동 + 그래서 RNN 기반 문자 수준 언어 모델에서는, RNN에 문자열의 순서를 주고, 135 00:09:07,860 --> 00:09:10,899 - 직장과 매 시간 단계에서 역할을 반복하면 재발을 요청합니다 + 그래서 RNN 기반 문자 수준 언어 모델에서는, RNN에 문자열의 순서를 주고, 136 00:09:10,899 --> 00:09:14,299 - 신경망은 시퀀스의 다음 문자를 예측한다 예측할 + 지금까지의 관찰 결과를 바탕으로 각각의 단계에서 다음에 올 문자는 무엇인지 예측하게 합니다. 137 00:09:14,299 --> 00:09:16,909 - 이 시퀀스에서 다음에 와야 어떻게 생각하는지에 대한 전체 분포 + 지금까지의 관찰 결과를 바탕으로 각각의 단계에서 다음에 올 문자는 무엇인지 예측하게 합니다. 138 00:09:16,909 --> 00:09:21,120 - 즉, 지금까지 나는이 아주 간단한 예에서 우리는이 있다고 가정 있도록 보았다 + 간단한 예를 한번 보죠. 139 00:09:21,120 --> 00:09:25,610 - 트레이닝 시퀀스 안녕하세요 그리고 우리는에있는 문자 어휘가 + 여기서 training 문자열 'hello'를 주면, 140 00:09:25,610 --> 00:09:29,870 - ATL의와 우리가 배울 수있는 재발 성 신경 네트워크를 얻으려고하는거야 + 우리의 현재 어휘 목록에는 'h, e , l, o' 이렇게 4글자가 있겠죠 141 00:09:29,870 --> 00:09:33,289 - 이 훈련 데이터에 시퀀스 내의 다음 문자를 예측하는 방법이 너무 + 그러니까 RNN은 우리의 training 문자열 데이터를 바탕으로 다음에 올 글자가 무엇인지 예측하게 됩니다. 142 00:09:33,289 --> 00:09:37,000 - (A)에서 이러한 문자의 모든 하나 하나에 공급됩니다 설정합니다으로 작동합니다 + 구체적으로, h, e, l, o를 각각 순서대로 하나씩 RNN에 입력해 줍니다. 143 00:09:37,000 --> 00:09:40,509 - 재발 성 신경 네트워크에 시간은 처음 채팅에서 볼 수 있습니다 + 여기서 가로축은 시간입니다. (역자주: 오른쪽으로 갈수록 뒤) 144 00:09:40,509 --> 00:09:47,110 - 단계 및 x 축 듣고 그래서 우리는 H II L & L 및하겠습니다 시간 시간입니다 + h는 첫번째, e는 두번째, 그다음 l, 그다음 l 145 00:09:47,110 --> 00:09:50,629 - 히로미 코팅 문자는 우리가 하나의 뜨거운 표현 여기서 우리가 부르는 사용 + 여기서는 'one-hot' 표기법을 사용하고 있습니다. (역자주: 0과 1로만 나타내는 것) 146 00:09:50,629 --> 00:09:53,889 - 그냥 쓴 그 문자에 대응 주문 있고 어휘 켜져 + 여기서는 'one-hot' 표기법을 사용하고 있습니다. (역자주: 0과 1로만 나타내는 것) 147 00:09:53,889 --> 00:09:58,129 - 지금 우리는 내가 보여 재발 수식 당신이 그것을 착용을 사용하는거야 + 그리고 아까 본 재귀 식을 사용합니다. 148 00:09:58,129 --> 00:10:01,860 - 매 시간 단계는 우리가 80로 시작하고 우리는이 적용된 가정 + 149 00:10:01,860 --> 00:10:04,720 @@ -3856,5 +3856,4 @@ 965 01:09:46,640 --> 01:09:47,569 - 여기에 질문을 드리겠습니다 - + 여기에 질문을 드리겠습니다 From d38cf8396ef326ae36f67631c143509c3dd16ab9 Mon Sep 17 00:00:00 2001 From: YB Date: Fri, 8 Apr 2016 23:53:11 -0400 Subject: [PATCH 028/199] Lecture1 - part 51~60 (out of 715) en / ko --- captions/En/Lecture1_en.srt | 51 ++++++++++++++++++------------------- captions/Ko/Lecture1_ko.srt | 31 +++++++++++----------- 2 files changed, 41 insertions(+), 41 deletions(-) diff --git a/captions/En/Lecture1_en.srt b/captions/En/Lecture1_en.srt index 861e1d32..09ec6da5 100644 --- a/captions/En/Lecture1_en.srt +++ b/captions/En/Lecture1_en.srt @@ -221,8 +221,8 @@ the universe. On the Internet these are the 46 00:05:25,920 --> 00:05:30,649 -matters pixel data the other data that -we don't know we have a hard time +matters pixel data are the data that we don't know. +We have a hard time 47 00:05:30,649 --> 00:05:36,239 @@ -244,57 +244,56 @@ onto YouTube servers for every 60 seconds 51 00:05:54,089 --> 00:06:02,739 -think about the amount of data there is -no way that human eyes can sift through +Think about the amount of data. +There is no way that human eyes can sift through 52 00:06:02,740 --> 00:06:07,829 -this massive amount of data and make it -a lot Asians +this massive amount of data and make it annotations, 53 00:06:07,829 --> 00:06:14,009 -labeling it and and and and described -the contacts soul singer from the +labeling it, and describe the contents. +So, think from the 54 00:06:14,009 --> 00:06:20,980 -perspective of the YouTube team or or -Google company if they want to help us +perspective of the YouTube team or Google company. +If they want to help us 55 00:06:20,980 --> 00:06:25,640 -to search index managed and of course -for their purpose +to search, index, manage, +and of course for their purpose, 56 -00:06:25,639 --> 00:06:31,529 -advertisement or or whatever manipulate -the content of the data were at a loss +00:06:25,641 --> 00:06:31,529 +put advertisement or whatever manipulate +the content of the data, we're at a loss, 57 00:06:31,529 --> 00:06:38,919 -because nobody can take this the only -hope we can do this is true vision +because nobody can hand-annotate this. +The only hope we can do this is through vision 58 00:06:38,920 --> 00:06:44,640 -technology to be able to label the -objects financings vinyl frames +technology. To be able to label the +objects, find the scenes, find the frames, 59 -00:06:44,639 --> 00:06:50,349 -you know lo que where basketball video -were Kobe Bryant's making like that +00:06:44,641 --> 00:06:50,349 +you know, locate where that basketball video +were Kobe Bryant is making like that 60 00:06:50,350 --> 00:06:57,320 -awesome shot and social these are the -problems we are facing today that the +awesome shot. So, these are the +problems that we are facing today that the 61 -00:06:57,319 --> 00:07:02,860 -massive amount of data and the the +00:06:57,321 --> 00:07:02,860 +massive amount of data and the challenges of the dark matter so 62 diff --git a/captions/Ko/Lecture1_ko.srt b/captions/Ko/Lecture1_ko.srt index 22f0f94a..9f11a39e 100644 --- a/captions/Ko/Lecture1_ko.srt +++ b/captions/Ko/Lecture1_ko.srt @@ -56,7 +56,7 @@ 15 00:01:45,840 --> 00:01:48,659 - 안드레에 대해서는 많은 소개가 필요 없을 듯 합니다. 많은 분들이 아마 그를 알고 있을거예요. + 안드레에 대해서는 많은 소개가 필요 없을 듯 합니다. 많은 분들이 아마 그를 알고 있을 겁니다. 16 00:01:48,659 --> 00:01:53,960 @@ -194,7 +194,7 @@ 49 00:05:39,091 --> 00:05:49,560 - 유튜브 서버들은 60 초마다 우리는 150시간 이상의 동영상이 업로드됩니다. + 매 60초마다 유튜브 서버들로 150시간 이상되는 분량의 동영상이 업로드됩니다. 50 00:05:49,560 --> 00:05:54,089 @@ -202,46 +202,47 @@ 51 00:05:54,089 --> 00:06:02,739 - 인간의 눈은 가려 낼 수있는 방법이 없습니다 데이터의 양에 대해 생각 + 데이터의 양에 대해 생각해보면 인간의 눈으로는 이 방대한 데이터를 52 00:06:02,740 --> 00:06:07,829 - 데이터의이 방대한 양과는 많은 아시아 확인 + 가려내고 53 00:06:07,829 --> 00:06:14,009 - 그것과와와 라벨과에서 연락처 영혼 가수 설명 + 분류하여 내용을 묘사할 방법이 없습니다. 54 00:06:14,009 --> 00:06:20,980 - YouTube 팀이나 또는 Google 회사의 관점 그들은 우리를 도와하려면 + YouTube 팀이나 또는 Google 회사의 관점에서 생각해보면 55 00:06:20,980 --> 00:06:25,640 - 자신의 목적을 위해 인덱스를 관리하고 물론 검색하기 + 그들이 이 데이터들을 검색하고 분류하고 또는 그들을 위한 광고를 넣고 56 -00:06:25,639 --> 00:06:31,529 - 광고 나 또는 어떤 조작은 데이터의 내용은 손실이었다 +00:06:25,641 --> 00:06:31,529 + 무엇을 하려고 하던지간에 이건 답이 없어요. 57 00:06:31,529 --> 00:06:38,919 - 아무도 우리가 할 수있는이 유일한 희망을 수 없기 때문이 사실의 비전입니다 + 아무도 직접 손으로 분류를 할 수가 없기 때문이죠. + 우리가 가진 유일한 희망은 Vision 기술입니다. 58 00:06:38,920 --> 00:06:44,640 - 기술은 객체 파이낸싱 비닐 프레임 레이블을 할 수 + 사물이나 풍경들을 알아내고 59 -00:06:44,639 --> 00:06:50,349 - 농구 비디오 그런 코비 브라이언트의 결정이었다 어디 싸다 가야 알고 +00:06:44,641 --> 00:06:50,349 + 코비 브라이언트가 끝내주는 슛을 날리는 농구 비디오를 찾아내는 거죠. 60 00:06:50,350 --> 00:06:57,320 - 멋진 샷과 사회이 우리가 오늘 직면하고있는 문제입니다 + 이런 것들이 지금 우리가 직면하고있는 문제들입니다. 61 -00:06:57,319 --> 00:07:02,860 +00:06:57,321 --> 00:07:02,860 대용량 데이터의 양 문제 때문에 어둠의 도전 62 From d778f095739668d54237659bc4d04e01a3955cd5 Mon Sep 17 00:00:00 2001 From: JK Im Date: Sat, 9 Apr 2016 00:21:00 -0500 Subject: [PATCH 029/199] Update optimization-1.md --- optimization-1.md | 18 +++++++++--------- 1 file changed, 9 insertions(+), 9 deletions(-) diff --git a/optimization-1.md b/optimization-1.md index 1ed36997..c91ecec6 100644 --- a/optimization-1.md +++ b/optimization-1.md @@ -57,7 +57,7 @@ $$ L_i = \sum_{j\neq y_i} \left[ \max(0, w_j^Tx_i - w_{y_i}^Tx_i + 1) \right] $$ -It is clear from the equation that the data loss for each example is a sum of (zero-thresholded due to the $\max(0,-)$ function) linear functions of $W$. Moreover, each row of $W$ (i.e. $w_j$) sometimes has a positive sign in front of it (when it corresponds to a wrong class for an example), and sometimes a negative sign (when it corresponds to the correct class for that example). To make this more explicit, consider a simple dataset that contains three 1-dimensional points and three classes. The full SVM loss (without regularization) becomes: +수식에서 명백히 볼 수 있듯이, 각 예시의 손실(loss)값은 ($\max(0,-)$ 함수로 인해 0에서 막혀있는) $W$의 선형함수들의 합으로 표현된다. $W$의 각 행(즉, $w_j$) 앞에는 때때로 (잘못된 분류일 때, 즉, $j\neq y_i$인 경우) 플러스가 붙고, 때때로 (옳은 분류일 때) 마이너스가 붙는다. 더 명확히 표현하자면, 3개의 1차원 점들과 3개의 클래스가 있다고 해보자. Regularization 없는 총 SVM 손실(loss)은 다음과 같다. $$ \begin{align} @@ -68,28 +68,28 @@ L = & (L_0 + L_1 + L_2)/3 \end{align} $$ -Since these examples are 1-dimensional, the data $x_i$ and weights $w_j$ are numbers. Looking at, for instance, $w_0$, some terms above are linear functions of $w_0$ and each is clamped at zero. We can visualize this as follows: +이 예시들이 1차원이기 때문에, 데이타 $x_i$와 모수(parameter/weight) $w_j$는 숫자(역자 주: 즉, 스칼라. 따라서 위 수식에서 전치행렬을 뜻하는 $T$ 표시는 필요없음)이다. 예를 들어 $w_0$ 를 보면, 몇몇 항들은 $w_0$의 선형함수이고 각각은 0에서 꺾인다. 이를 다음과 같이 시각화할 수 있다.
- 1-dimensional illustration of the data loss. The x-axis is a single weight and the y-axis is the loss. The data loss is a sum of multiple terms, each of which is either independent of a particular weight, or a linear function of it that is thresholded at zero. The full SVM data loss is a 30,730-dimensional version of this shape. + 손실(loss)를 1차원으로 표현한 그림. x축은 모수(parameter/weight) 하나이고, y축은 손실(loss)이다. 손실(loss)는 여러 항들의 합인데, 그 각각은 특정 모수(parameter/weight)값과 무관하거나, 0에 막혀있는 그 모수(parameter/weight)의 선형함수이다. 전체 SVM 손실은 이 모양의 30,730차원 버전이다.
-As an aside, you may have guessed from its bowl-shaped appearance that the SVM cost function is an example of a [convex function](http://en.wikipedia.org/wiki/Convex_function) There is a large amount of literature devoted to efficiently minimizing these types of functions, and you can also take a Stanford class on the topic ( [convex optimization](http://stanford.edu/~boyd/cvxbook/) ). Once we extend our score functions $f$ to Neural Networks our objective functions will become non-convex, and the visualizations above will not feature bowls but complex, bumpy terrains. +옆길로 새면, 아마도 밥공기 모양을 보고 SVM 손실함수(loss function)이 일종의 [볼록함수](http://en.wikipedia.org/wiki/Convex_function)라고 생각했을 것이다. 이런 형태의 함수를 효율적으로 최소화하는 문제에 대한 엄청난 양의 연구 성과들이 있다. 스탠포드 강좌 중에서도 이 주제를 다룬 것도 있다. ( [볼록함수 최적화](http://stanford.edu/~boyd/cvxbook/) ). 이 점수함수(score function) $f$를 신경망(neural networks)로 확장시키면, 목적함수(역자 주: 손실함수(loss function))은 더이상 볼록함수가 아니게 되고, 위와 같은 시각화를 해봐도 밥공기 모양 대신 울퉁불퉁하고 복잡한 모양이 보일 것이다. -*Non-differentiable loss functions*. As a technical note, you can also see that the *kinks* in the loss function (due to the max operation) technically make the loss function non-differentiable because at these kinks the gradient is not defined. However, the [subgradient](http://en.wikipedia.org/wiki/Subderivative) still exists and is commonly used instead. In this class will use the terms *subgradient* and *gradient* interchangeably. +*미분이 불가능한 손실함수(loss functions)*. 기술적인 설명을 덧붙이자면, $\max(0,-)$ 함수 때문에 손실함수(loss functionn)에 *꺾임*이 생기는데, 이 때문에 손실함수(loss functions)는 미분이 불가능해진다. 왜냐하면, 그 꺾이는 부분에서 미분 혹은 그라디언트가 존재하지 않기 때문이다. 하지만, [서브그라디언트(subgradient)](http://en.wikipedia.org/wiki/Subderivative)가 존재하고, 대체로 이를 그라디언트(gradient) 대신 이용한다. 앞으로 이 강의에서는 *그라디언트(gradient)*와 *서브그라디언트(subgradient)*를 구분하지 않고 쓸 것이다. - + -### Optimization +### 최적화 -To reiterate, the loss function lets us quantify the quality of any particular set of weights **W**. The goal of optimization is to find **W** that minimizes the loss function. We will now motivate and slowly develop an approach to optimizing the loss function. For those of you coming to this class with previous experience, this section might seem odd since the working example we'll use (the SVM loss) is a convex problem, but keep in mind that our goal is to eventually optimize Neural Networks where we can't easily use any of the tools developed in the Convex Optimization literature. +정리하면, 손실함수(loss function)는 모수(parameter/weight) **W** 행렬의 질을 측정한다. 최적화의 목적은 이 손실함수(loss function)을 최소화시키는 **W**을 찾아내는 것이다. 다음 단락부터 손실함수(loss function)을 최적화하는 방법에 대해서 찬찬히 살펴볼 것이다. 이전에 경험이 있는 사람들이 보면 이 섹션은 좀 이상하다고 생각할지 모르겠다. 왜냐하면, 여기서 쓰인 예제 (즉, SVM 손실(loss))가 볼록함수이기 때문이다. 하지만, 우리의 궁극적인 목적은 신경망(neural networks)를 최적화시키는 것이고, 여기에는 볼록함수 최적화를 위해 고안된 방법들이 쉽사리 통히지 않는다. -#### Strategy #1: A first very bad idea solution: Random search +#### 전략 #1: 첫번째 매우 나쁜 방법: 무작위 탐색 (Random search) Since it is so simple to check how good a given set of parameters **W** is, the first (very bad) idea that may come to mind is to simply try out many different random weights and keep track of what works best. This procedure might look as follows: From bedbc591c3272c3e45647a7ec3903a6a0f322161 Mon Sep 17 00:00:00 2001 From: JK Im Date: Sat, 9 Apr 2016 00:41:23 -0500 Subject: [PATCH 030/199] Update optimization-1.md --- optimization-1.md | 12 ++++++------ 1 file changed, 6 insertions(+), 6 deletions(-) diff --git a/optimization-1.md b/optimization-1.md index c91ecec6..554df1d9 100644 --- a/optimization-1.md +++ b/optimization-1.md @@ -91,7 +91,7 @@ $$ #### 전략 #1: 첫번째 매우 나쁜 방법: 무작위 탐색 (Random search) -Since it is so simple to check how good a given set of parameters **W** is, the first (very bad) idea that may come to mind is to simply try out many different random weights and keep track of what works best. This procedure might look as follows: +Since it is so simple to check how good a given set of parameters 주어진 모수(parameter/weight) **W**이 얼마나 좋은지를 측정하는 것은 매우 간단하기 때문에, 처음 떠오르는 (매우 나쁜) 생각은, 단순히 무작위로 모수(parameter/weight)을 골라서 넣어보고 넣어 본 값들 중 제일 좋은 값을 기록하는 것이다. 그 과정은 다음과 같다. ~~~python # assume X_train is the data where each column is an example (e.g. 3073 x 50,000) @@ -118,7 +118,7 @@ for num in xrange(1000): # ... (trunctated: continues for 1000 lines) ~~~ -In the code above, we see that we tried out several random weight vectors **W**, and some of them work better than others. We can take the best weights **W** found by this search and try it out on the test set: +위의 코드에서, 여러 개의 무작위 모수(parameter/weight) **W**를 넣어봤고, 그 중 몇몇은 다른 것들보다 좋았다. 그래서 그 중 제일 좋은 모수(parameter/weight) **W**을 테스트 데이터에 넣어보면 된다. ~~~python # Assume X_test is [3073 x 10000], Y_test [10000 x 1] @@ -130,13 +130,13 @@ np.mean(Yte_predict == Yte) # returns 0.1555 ~~~ -With the best **W** this gives an accuracy of about **15.5%**. Given that guessing classes completely at random achieves only 10%, that's not a very bad outcome for a such a brain-dead random search solution! +이 방법으로 얻은 최선의 **W**는 정확도 **15.5%**이다. 완전 무작위 찍기가 단 10%의 정확도를 보이므로, 무식한 방법 치고는 그리 나쁜 것은 아니다. -**Core idea: iterative refinement**. Of course, it turns out that we can do much better. The core idea is that finding the best set of weights **W** is a very difficult or even impossible problem (especially once **W** contains weights for entire complex neural networks), but the problem of refining a specific set of weights **W** to be slightly better is significantly less difficult. In other words, our approach will be to start with a random **W** and then iteratively refine it, making it slightly better each time. +**핵심 아이디어: 반복적 향상**. 물론 이보다 더 좋은 방법들이 있다. 여기서 핵심 아이디어는, 최선의 모수(parameter/weight) **W**을 찾는 것은 매우 어렵거나 때로는 불가능한 문제(특히 복잡한 신경망(neural network) 전체를 구현할 경우)이지만, 어떤 주어진 모수(parameter/weight) **W**을 조금 개선시키는 일은 훨씬 덜 힘들다는 점이다. 다시 말해, 우리의 접근법은 무작위로 뽑은 **W**에서 출발해서 매번 조금씩 개선시키는 것을 반복하는 것이다. -> Our strategy will be to start with random weights and iteratively refine them over time to get lower loss +> 우리의 전략은 무작위로 뽑은 모수(parameter/weight)으로부터 시작해서 반복적으로 조금씩 개선시켜 손실(loss)을 낮추는 것이다. -**Blindfolded hiker analogy.** One analogy that you may find helpful going forward is to think of yourself as hiking on a hilly terrain with a blindfold on, and trying to reach the bottom. In the example of CIFAR-10, the hills are 30,730-dimensional, since the dimensions of **W** are 3073 x 10. At every point on the hill we achieve a particular loss (the height of the terrain). +**눈가리고 등산하는 것에 비유.** 앞으로 도움이 될만한 비유는, 경사진 지형에서 눈가리개를 하고 점점 아래로 내려오는 자기 자신을 생각해보는 것이다. CIFAR-10의 예시에서, 그 언덕들은 (**W**가 3073 x 10 차원이므로) 30,730차원이다. 언덕의 각 지점에는 특정 손실값(loss), 즉, 지형의 고도가 주어진다. From c355d5487a61949d31544f619f222be6552f3c8a Mon Sep 17 00:00:00 2001 From: JK Im Date: Sat, 9 Apr 2016 00:42:52 -0500 Subject: [PATCH 031/199] Update optimization-1.md --- optimization-1.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/optimization-1.md b/optimization-1.md index 554df1d9..8bf56c28 100644 --- a/optimization-1.md +++ b/optimization-1.md @@ -91,7 +91,7 @@ $$ #### 전략 #1: 첫번째 매우 나쁜 방법: 무작위 탐색 (Random search) -Since it is so simple to check how good a given set of parameters 주어진 모수(parameter/weight) **W**이 얼마나 좋은지를 측정하는 것은 매우 간단하기 때문에, 처음 떠오르는 (매우 나쁜) 생각은, 단순히 무작위로 모수(parameter/weight)을 골라서 넣어보고 넣어 본 값들 중 제일 좋은 값을 기록하는 것이다. 그 과정은 다음과 같다. +주어진 모수(parameter/weight) **W**이 얼마나 좋은지를 측정하는 것은 매우 간단하기 때문에, 처음 떠오르는 (매우 나쁜) 생각은, 단순히 무작위로 모수(parameter/weight)을 골라서 넣어보고 넣어 본 값들 중 제일 좋은 값을 기록하는 것이다. 그 과정은 다음과 같다. ~~~python # assume X_train is the data where each column is an example (e.g. 3073 x 50,000) From 33c2b7b61d6b05fb8edc3946bf75074182f3c94a Mon Sep 17 00:00:00 2001 From: JK Im Date: Sat, 9 Apr 2016 00:44:26 -0500 Subject: [PATCH 032/199] Update optimization-1.md --- optimization-1.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/optimization-1.md b/optimization-1.md index 8bf56c28..1cc4013b 100644 --- a/optimization-1.md +++ b/optimization-1.md @@ -136,7 +136,7 @@ np.mean(Yte_predict == Yte) > 우리의 전략은 무작위로 뽑은 모수(parameter/weight)으로부터 시작해서 반복적으로 조금씩 개선시켜 손실(loss)을 낮추는 것이다. -**눈가리고 등산하는 것에 비유.** 앞으로 도움이 될만한 비유는, 경사진 지형에서 눈가리개를 하고 점점 아래로 내려오는 자기 자신을 생각해보는 것이다. CIFAR-10의 예시에서, 그 언덕들은 (**W**가 3073 x 10 차원이므로) 30,730차원이다. 언덕의 각 지점에는 특정 손실값(loss), 즉, 지형의 고도가 주어진다. +**눈가리고 하산하는 것에 비유.** 앞으로 도움이 될만한 비유는, 경사진 지형에서 눈가리개를 하고 점점 아래로 내려오는 자기 자신을 생각해보는 것이다. CIFAR-10의 예시에서, 그 언덕들은 (**W**가 3073 x 10 차원이므로) 30,730차원이다. 언덕의 각 지점에는 특정 손실값(loss), 즉, 지형의 고도가 주어진다. From 8e0b91af25a4ef32414aa2b20119d93d3f9f945f Mon Sep 17 00:00:00 2001 From: Taeksoo Kim Date: Sat, 9 Apr 2016 15:17:24 +0900 Subject: [PATCH 033/199] Update convolutional-networks-korean.md --- convolutional-networks-korean.md | 22 +++++++++++----------- 1 file changed, 11 insertions(+), 11 deletions(-) diff --git a/convolutional-networks-korean.md b/convolutional-networks-korean.md index f7d2bfa1..62ede091 100644 --- a/convolutional-networks-korean.md +++ b/convolutional-networks-korean.md @@ -57,30 +57,30 @@ CNN은 입력이 이미지로 이뤄져 있다는 특징을 살려 좀 더 합 - POOL 레이어는 (가로,세로) 차원에 대해 다운샘플링 (downsampling)을 수행해 [16x16x12]와 같이 줄어든 볼륨을 출력한다. - FC (fully-connected) 레이어는 클래스 점수들을 계산해 [1x1x10]의 크기를 갖는 볼륨을 출력한다. 10개 숫자들은 10개 카테고리에 대한 클래스 점수에 해당한다. 레이어의 이름에서 유추 가능하듯, 이 레이어는 이전 볼륨의 모든 요소와 연결되어 있다. -In this way, ConvNets transform the original image layer by layer from the original pixel values to the final class scores. Note that some layers contain parameters and other don't. In particular, the CONV/FC layers perform transformations that are a function of not only the activations in the input volume, but also of the parameters (the weights and biases of the neurons). On the other hand, the RELU/POOL layers will implement a fixed function. The parameters in the CONV/FC layers will be trained with gradient descent so that the class scores that the ConvNet computes are consistent with the labels in the training set for each image. +이와 같이, CNN은 픽셀 값으로 이뤄진 원본 이미지를 각 레이어를 거치며 클래스 점수로 변환 (transform) 시킨다. 한 가지 기억할 것은, 어떤 레이어는 모수 (parameter)를 갖지만 어떤 레이어는 모수를 갖지 않는다는 것이다. 특히 CONV/FC 레이어들은 단순히 입력 볼륨만이 아니라 가중치(weight)와 바이어스(bias) 또한 포함하는 액티베이션(activation) 함수이다. 반면 RELU/POOL 레이어들은 고정된 함수이다. CONV/FC 레이어의 모수 (parameter)들은 각 이미지에 대한 클래스 점수가 해당 이미지의 레이블과 같아지도록 그라디언트 디센트 (gradient descent)로 학습된다. -In summary: +요약해보면: -- A ConvNet architecture is a list of Layers that transform the image volume into an output volume (e.g. holding the class scores) -- There are a few distinct types of Layers (e.g. CONV/FC/RELU/POOL are by far the most popular) -- Each Layer accepts an input 3D volume and transforms it to an output 3D volume through a differentiable function -- Each Layer may or may not have parameters (e.g. CONV/FC do, RELU/POOL don't) -- Each Layer may or may not have additional hyperparameters (e.g. CONV/FC/POOL do, RELU doesn't) +- CNN 아키텍쳐는 여러 레이어를 통해 입력 이미지 볼륨을 출력 볼륨 ( 클래스 점수 )으로 변환시켜 준다. +- CNN은 몇 가지 종류의 레이어로 구성되어 있다. CONV/FC/RELU/POOL 레이어가 현재 가장 많이 쓰인다. +- 각 레이어는 3차원의 입력 볼륨을 미분 가능한 함수를 통해 3차원 출력 볼륨으로 변환시킨다. +- 모수(parameter)가 있는 레이어도 있고 그렇지 않은 레이어도 있다 (FC/CONV는 모수를 갖고 있고, RELU/POOL 등은 모수가 없음). +- 초모수 (hyperparameter)가 있는 레이어도 있고 그렇지 않은 레이어도 있다 (CONV/FC/POOL 레이어는 초모수를 가지며 RELU는 가지지 않음).
- The activations of an example ConvNet architecture. The initial volume stores the raw image pixels and the last volume stores the class scores. Each volume of activations along the processing path is shown as a column. Since it's difficult to visualize 3D volumes, we lay out each volume's slices in rows. The last layer volume holds the scores for each class, but here we only visualize the sorted top 5 scores, and print the labels of each one. The full web-based demo is shown in the header of our website. The architecture shown here is a tiny VGG Net, which we will discuss later. + CNN 아키텍쳐의 액티베이션 (activation) 예제. 첫 볼륨은 로우 이미지(raw image)를 다루며, 마지막 볼륨은 클래스 점수들을 출력한다. 입/출력 사이의 액티베이션들은 그림의 각 열에 나타나 있다. 3차원 볼륨을 시각적으로 나타내기가 어렵기 때문에 각 행마다 볼륨들의 일부만 나타냈다. 마지막 레이어는 모든 클래스에 대한 점수를 나타내지만 여기에서는 상위 5개 클래스에 대한 점수와 레이블만 표시했다. 전체 웹 데모는 우리의 웹사이트 상단에 있다. 여기에서 사용된 아키텍쳐는 작은 VGG Net이다.
-We now describe the individual layers and the details of their hyperparameters and their connectivities. +이제 각각의 레이어에 대해 초모수(hyperparameter)나 연결성 (connectivity) 등의 세부 사항들을 알아보도록 하자. -#### Convolutional Layer +#### 컨볼루셔널 레이어 (이하 CONV) -The Conv layer is the core building block of a Convolutional Network, and its output volume can be interpreted as holding neurons arranged in a 3D volume. We now discuss the details of the neuron connectivities, their arrangement in space, and their parameter sharing scheme. +CONV 레이어는 CNN을 이루는 핵심 요소이다. CONV 레이어의 출력은 3차원으로 정렬된 뉴런들로 해석될 수 있다. 이제부터는 뉴런들의 연결성 (connectivity), 그들의 공간상의 배치, 그리고 모수 공유(parameter sharing) 에 대해 알아보자. **Overview and Intuition.** The CONV layer's parameters consist of a set of learnable filters. Every filter is small spatially (along width and height), but extends through the full depth of the input volume. During the forward pass, we slide (more precisely, convolve) each filter across the width and height of the input volume, producing a 2-dimensional activation map of that filter. As we slide the filter, across the input, we are computing the dot product between the entries of the filter and the input. Intuitively, the network will learn filters that activate when they see some specific type of feature at some spatial position in the input. Stacking these activation maps for all filters along the depth dimension forms the full output volume. Every entry in the output volume can thus also be interpreted as an output of a neuron that looks at only a small region in the input and shares parameters with neurons in the same activation map (since these numbers all result from applying the same filter). We now dive into the details of this process. From 48a0cd3d128f55dbdeb37dc456c02cee72496cc0 Mon Sep 17 00:00:00 2001 From: MaybeS Date: Sat, 9 Apr 2016 15:37:55 +0900 Subject: [PATCH 034/199] Translate assignment1 to assignment1-kore + trans.md --- assignments2016/assignment1-kor.md | 81 ++++++++++++++++++++++++++++ assignments2016/assignment1-trans.md | 41 ++++++++++++++ 2 files changed, 122 insertions(+) create mode 100644 assignments2016/assignment1-kor.md create mode 100644 assignments2016/assignment1-trans.md diff --git a/assignments2016/assignment1-kor.md b/assignments2016/assignment1-kor.md new file mode 100644 index 00000000..6b6054c7 --- /dev/null +++ b/assignments2016/assignment1-kor.md @@ -0,0 +1,81 @@ +--- +layout: page +mathjax: true +permalink: /assignments2016/assignment1/ +--- +이번 숙제에서 여러분은 간단한 이미지 분류 파이프라인을 k-Nearest neighbor 또는 SVM/Softmax 분류기에 기반하여 넣는 방법을 연습할 수 있습니다. 이번 숙제의 목표는 다음과 같습니다. + +- **이미지 분류 파이프라인**의 기초와 데이터 중심의 접근방식에 대해 이해합니다. +- 학습/확인/테스트의 분할과 **hyperparameter tuning**를 위해 검증 데이터를 사용하는 것에 관해 이해합니다. +- 효율적으로 작성된 **벡터화**된 numpy 코드로 proficiency을 나타나게 합니다. +- k-Nearest Neighbor (**kNN**) 분류기를 수행하고 적용해봅니다. +- Multiclass Support Vector Machine (**SVM**) 분류기를 수행하고 적용해봅니다. +- **Softmax** 분류기를 수행하고 적용해봅니다. +- **Two layer neural network** 분류기를 수행하고 적용해봅니다. +- 위 분류기들의 장단점과 차이에 대해 이해합니다. +- 성능향상을 위해 raw pixels보다 **higher-level representations**을 사용하는 이유에 관하여 이해합니다. (color histograms, Histogram of Gradient (HOG) features) + +## 설치 +여러분은 다음 두가지 방법으로 숙제를 수행할 수 있습니다: Terminal.com을 이용한 가상 환경 또는 로컬 환경. + +### Termianl에서의 가상 환경. +Terminal에는 우리의 수업을 위한 서브도메인이 만들어져 있습니다. [www.stanfordterminalcloud.com](https://www.stanfordterminalcloud.com) 계정을 등록하세요. 이번 숙제에 대한 스냅샷은 [여기](https://www.stanfordterminalcloud.com/snapshot/49f5a1ea15dc424aec19155b3398784d57c55045435315ce4f8b96b62819ef65)에서 찾아볼 수 있습니다. 만약 수업에 등록되었다면, TA(see Piazza for more information)에게 이 수업을 위한 Terminal 예산을 요구할 수 있습니다. 처음 스냅샷을 실행시키면, 수업을 위한 모든 것이 설치되어 있어서 바로 숙제를 수행할 수 있습니다. [여기](/terminal-tutorial)에 Terminal을 위한 간단한 튜토리얼을 작성해 뒀습니다. + +### 로컬 환경 +[여기](http://vision.stanford.edu/teaching/cs231n/winter1516_assignment1.zip)에서 압축파일을 다운받고 다음을 따르세요. + +**[선택 1] Use Anaconda:** +과학, 수학, 공학, 데이터 분석을 위한 다양하고 유명한 패키지들을 담고있는 [Anaconda](https://www.continuum.io/downloads)를 사용하여 설치하는 것이 즐겨 쓰이는 방법입니다. 설치가 다 되면 모든 요구사항을 넘기고 바로 숙제를 수행해도 좋습니다. + +**[선택 2] Manual install, virtual environment:** +만약 Anaconda 대신 좀 더 일반적이고 위험한 방법을 택하고 싶다면 프로젝트를 위한 [virtual environment](http://docs.python-guide.org/en/latest/dev/virtualenvs/)를 만들 수 있습니다. 만약 virtual environment를 사용하지 않는다면 모든 코드가 컴퓨터에 전역적으로 종속되게 설치됩니다. virtual environment의 설정은 아래를 참조하세요. + +~~~bash +cd assignment1 +sudo pip install virtualenv # 아마 먼저 설치되어 있을 겁니다. +virtualenv .env # virtual environment를 만듭니다. +source .env/bin/activate # virtual environment를 활성화 합니다. +pip install -r requirements.txt # dependencies 설치합니다. +# Work on the assignment for a while ... +deactivate # virtual environment를 종료합니다. +~~~ + +**Download data:** +먼저 숙제를 수행하기전에 CIFAR-10 dataset를 다운로드해야 합니다. 아래를 `assignment1` 폴더에서 실행하세요: + +~~~bash +cd cs231n/datasets +./get_datasets.sh +~~~ + +**Start IPython:** +CIFAR-10 data를 받았다면, `assignment1` 폴더의 IPython notebook server를 시작할 수 있습니다. IPython에 친숙하지 않다면 작성해둔 [IPython tutorial](/ipython-tutorial)를 읽어보는 것을 권장합니다. + +**NOTE:** OSX에서 virtual environment를 실행하면, matplotlib 에러가 날 수 있습니다([이 문제에 관한 이슈](http://matplotlib.org/faq/virtualenv_faq.html)). IPython server를 `assignment1`폴더의 `start_ipython_osx.sh`라고 실행하면 이 문제를 피해갈 수 있습니다; 이 스크립트는 virtual environment가 `.env`라고 되어있다고 가정하고 작성되었습니다. + +### 과제 제출: +로컬 환경이나 Terminal에서 숙제를 마쳤다면 `collectSubmission.sh`스크립트를 실행합니다; 이 스크립트는 `assignment1.zip`파일을 만듭니다. 이 파일을 [the coursework](https://coursework.stanford.edu/portal/site/W16-CS-231N-01/)에 업로드하세요. + + +### Q1: k-Nearest Neighbor 분류기 (20 points) + +IPython Notebook **knn.ipynb**이 kNN 분류기를 수행하는 것을 안내합니다. + +### Q2: Support Vector Machine 훈련 (25 points) + +IPython Notebook **svm.ipynb**이 SVM 분류기를 수행하는 것을 안내합니다. + +### Q3: Softmax 분류기 실행하기 (20 points) + +IPython Notebook **softmax.ipynb**이 Softmax 분류기를 수행하는 것을 안내합니다. + +### Q4: Two-Layer Neural Network (25 points) + +IPython Notebook **two_layer_net.ipynb**이 two-layer neural network 분류기를 수행하는 것을 안내합니다. + +### Q5: Higher Level Representations: 이미지 특징 (10 points) + +IPython Notebook **features.ipynb**을 사용하여 higher-level representations이 raw pixel보다 개선이 이루어졌는지 검사합니다. + +### Q6: 추가 과제: 뭔가 더 해보세요! (+10 points) +이번 과제와 관련된 다른 것들을 작성한 코드로 분석하고 연구해보세요. 예를 들어, 질문하고 싶은 흥미로운 질문이 있나요? 통찰력 있는 시각화를 작성할 수 있나요? 아니면 다른 재미있는 살펴볼 거리가 있나요? Or maybe you can experiment with a spin on the loss function 만약 다른 멋있는 것을 시도해본다면 추가로 10 points를 얻을 수 있고 강의에 수행한 결과가 실릴 수 있습니다. \ No newline at end of file diff --git a/assignments2016/assignment1-trans.md b/assignments2016/assignment1-trans.md new file mode 100644 index 00000000..e653ecbf --- /dev/null +++ b/assignments2016/assignment1-trans.md @@ -0,0 +1,41 @@ +# 이 문서는 assignment1을 assignment1-kor로 번역하면서 만들어진 문서입니다. +이 문서에는 번역하는 도중 겪었던 곤란한 부분이나 개선이 필요한 부분에 대해서 열거합니다. + +## 이 단어들을 어떻게 번역하면 좋을까요? +- SVM +- Softmax +- Implement +- Image Classification pipeline + - pipeline + - 이미지 분류 파이프라인 +- data-driven approach + -데이터 중심의 접근방식 +- tuning +- hyperparameter +- validation data +- proficiency +- raw pixels +- color histograms, +- Histogram of Gradient (HOG) features +- matplotlib +- insightful + +## 이 부분은 이렇게 고쳐도 될것 같습니다. +`현재표기 -> 더 나은(이라고 생각되는) 표기` + 또는 +`원문 <- 현재표기` + +- 숙제를 수행하다. -> 숙제를 하다. +- 수업에 등록되었다면. -> ? +- 일반적인(manual) -> ? + +- Higher Level Representations <- 고수준 표현 +- Credit <- 예산 +- install dependencies <- dependencies를 설치합니다. + +## 이 부분은 이렇게 고쳤습니다. +- Setup에서 설명엔 local - virtual 순이지만 소 제목에서는 ###virtual - local 순 이기에 일치시켰습니다. + + +## 다른 문제될 사항 +- Q6의 Or maybe you can experiment with a spin on the loss function 가 번역이 매끄럽게 되지 않아서 자문을 구합니다. From 58734f93d6ed0b6268fbbbe834f3c0cab7d40ccf Mon Sep 17 00:00:00 2001 From: Taeksoo Kim Date: Sat, 9 Apr 2016 15:39:29 +0900 Subject: [PATCH 035/199] Update convolutional-networks-korean.md --- convolutional-networks-korean.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/convolutional-networks-korean.md b/convolutional-networks-korean.md index 62ede091..9cc01fad 100644 --- a/convolutional-networks-korean.md +++ b/convolutional-networks-korean.md @@ -82,7 +82,7 @@ CNN은 입력이 이미지로 이뤄져 있다는 특징을 살려 좀 더 합 CONV 레이어는 CNN을 이루는 핵심 요소이다. CONV 레이어의 출력은 3차원으로 정렬된 뉴런들로 해석될 수 있다. 이제부터는 뉴런들의 연결성 (connectivity), 그들의 공간상의 배치, 그리고 모수 공유(parameter sharing) 에 대해 알아보자. -**Overview and Intuition.** The CONV layer's parameters consist of a set of learnable filters. Every filter is small spatially (along width and height), but extends through the full depth of the input volume. During the forward pass, we slide (more precisely, convolve) each filter across the width and height of the input volume, producing a 2-dimensional activation map of that filter. As we slide the filter, across the input, we are computing the dot product between the entries of the filter and the input. Intuitively, the network will learn filters that activate when they see some specific type of feature at some spatial position in the input. Stacking these activation maps for all filters along the depth dimension forms the full output volume. Every entry in the output volume can thus also be interpreted as an output of a neuron that looks at only a small region in the input and shares parameters with neurons in the same activation map (since these numbers all result from applying the same filter). We now dive into the details of this process. +**개요 및 직관적인 설명.** CONV 레이어의 모수(parameter)들은 일련의 학습가능한 필터들로 이뤄져 있다. 각 필터는 가로/세로 차원으로는 작지만 깊이 (depth) 차원으로는 전체 깊이를 아우른다. 포워드 패스 (forward pass) 때에는 각 필터를 입력 볼륨의 가로/세로 차원으로 슬라이딩 시키며 (정확히는 convolve 시키며) 2차원의 액티베이션 맵 (activation map)을 생성한다. 필터를 입력 위로 슬라이딩 시킬 때, 필터와 입력의 요소들 사이의 내적 연산 (dot product)이 이뤄진다. 직관적으로 설명하면, 이 신경망은 입력의 특정 위치의 특정 패턴에 대해 반응하는 (activate) 필터를 학습한다. 이런 액티베이션 맵 (activation map)을 깊이 (depth) 차원을 따라 쌓은 것이 곧 출력 볼륨이 된다. 그러므로 출력 볼륨의 각 요소들은 입력의 작은 영역만을 취급하고, 같은 액티베이션 맵 내의 뉴런들은 같은 모수들을 공유한다 (같은 필터를 적용한 결과이므로). 이제 이 과정에 대해 좀 더 깊이 파헤쳐보자. **Local Connectivity.** When dealing with high-dimensional inputs such as images, as we saw above it is impractical to connect neurons to all neurons in the previous volume. Instead, we will connect each neuron to only a local region of the input volume. The spatial extent of this connectivity is a hyperparameter called the **receptive field** of the neuron. The extent of the connectivity along the depth axis is always equal to the depth of the input volume. It is important to note this asymmetry in how we treat the spatial dimensions (width and height) and the depth dimension: The connections are local in space (along width and height), but always full along the entire depth of the input volume. From 01a7c76f5141db79d3eeb765c11f9b4540fc3675 Mon Sep 17 00:00:00 2001 From: Taeksoo Kim Date: Sat, 9 Apr 2016 16:10:25 +0900 Subject: [PATCH 036/199] Update convolutional-networks-korean.md --- convolutional-networks-korean.md | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/convolutional-networks-korean.md b/convolutional-networks-korean.md index 9cc01fad..a27376dc 100644 --- a/convolutional-networks-korean.md +++ b/convolutional-networks-korean.md @@ -84,11 +84,11 @@ CONV 레이어는 CNN을 이루는 핵심 요소이다. CONV 레이어의 출력 **개요 및 직관적인 설명.** CONV 레이어의 모수(parameter)들은 일련의 학습가능한 필터들로 이뤄져 있다. 각 필터는 가로/세로 차원으로는 작지만 깊이 (depth) 차원으로는 전체 깊이를 아우른다. 포워드 패스 (forward pass) 때에는 각 필터를 입력 볼륨의 가로/세로 차원으로 슬라이딩 시키며 (정확히는 convolve 시키며) 2차원의 액티베이션 맵 (activation map)을 생성한다. 필터를 입력 위로 슬라이딩 시킬 때, 필터와 입력의 요소들 사이의 내적 연산 (dot product)이 이뤄진다. 직관적으로 설명하면, 이 신경망은 입력의 특정 위치의 특정 패턴에 대해 반응하는 (activate) 필터를 학습한다. 이런 액티베이션 맵 (activation map)을 깊이 (depth) 차원을 따라 쌓은 것이 곧 출력 볼륨이 된다. 그러므로 출력 볼륨의 각 요소들은 입력의 작은 영역만을 취급하고, 같은 액티베이션 맵 내의 뉴런들은 같은 모수들을 공유한다 (같은 필터를 적용한 결과이므로). 이제 이 과정에 대해 좀 더 깊이 파헤쳐보자. -**Local Connectivity.** When dealing with high-dimensional inputs such as images, as we saw above it is impractical to connect neurons to all neurons in the previous volume. Instead, we will connect each neuron to only a local region of the input volume. The spatial extent of this connectivity is a hyperparameter called the **receptive field** of the neuron. The extent of the connectivity along the depth axis is always equal to the depth of the input volume. It is important to note this asymmetry in how we treat the spatial dimensions (width and height) and the depth dimension: The connections are local in space (along width and height), but always full along the entire depth of the input volume. +**로컬 연결성 (Local connectivity).** 이미지와 같은 고차원 입력을 다룰 때에는, 현재 레이어의 한 뉴런을 이전 볼륨의 모든 뉴런들과 연결하는 것이 비 실용적이다. 대신에 우리는 레이어의 각 뉴런을 입력 볼륨의 로컬한 영역(local region)에만 연결할 것이다. 이 영역은 리셉티브 필드 (receptive field)라고 불리는 초모수 (hyperparameter) 이다. 깊이 차원 측면에서는 항상 입력 볼륨의 총 깊이를 다룬다 (가로/세로는 작은 영역을 보지만 깊이는 전체를 본다는 뜻). 공간적 차원 (가로/세로)와 깊이 차원을 다루는 방식이 다르다는 걸 기억하자. -*Example 1*. For example, suppose that the input volume has size [32x32x3], (e.g. an RGB CIFAR-10 image). If the receptive field is of size 5x5, then each neuron in the Conv Layer will have weights to a [5x5x3] region in the input volume, for a total of 5\*5\*3 = 75 weights. Notice that the extent of the connectivity along the depth axis must be 3, since this is the depth of the input volume. +*예제 1*. 예를 들어 입력 볼륨의 크기가 (CIFAR-10의 RGB 이미지와 같이) [32x32x3]이라고 하자. 만약 리셉티브 필드의 크기가 5x5라면, CONV 레이어의 각 뉴런은 입력 볼륨의 [5x5x3] 크기의 영역에 가중치 (weight)를 가하게 된다 (총 5x5x3=75 개 가중치). 입력 볼륨 (RGB 이미지)의 깊이가 3이므로 마지막 숫자가 3이 된다는 것을 기억하자. -*Example 2*. Suppose an input volume had size [16x16x20]. Then using an example receptive field size of 3x3, every neuron in the Conv Layer would now have a total of 3\*3\*20 = 180 connections to the input volume. Notice that, again, the connectivity is local in space (e.g. 3x3), but full along the input depth (20). +*예제 2*. 입력 볼륨의 크기가 [16x16x20]이라고 하자. 3x3 크기의 리셉티브 필드를 사용하면 CONV 레이어의 각 뉴런은 입력 볼륨과 3x3x20=180 개의 연결을 갖게 된다. 이번에도 입력 볼륨의 깊이가 20이므로 마지막 숫자가 20이 된다는 것을 기억하자.
From 54d785403eb32ca65ac52db3c32ed31cd94f183a Mon Sep 17 00:00:00 2001 From: myungsub Date: Sat, 9 Apr 2016 17:32:05 +0900 Subject: [PATCH 037/199] fix inline equations --- optimization-1.md | 16 ++++++++-------- 1 file changed, 8 insertions(+), 8 deletions(-) diff --git a/optimization-1.md b/optimization-1.md index 1cc4013b..ce018a2c 100644 --- a/optimization-1.md +++ b/optimization-1.md @@ -26,13 +26,13 @@ Table of Contents: 1. 원 이미지의 픽셀들을 넣으면 분류 스코어(class score)를 계산해주는 모수화된(parameterized) **스코어 함수(score function)** (예를 들어, 선형 함수). 2. 학습(training) 데이타에 어떤 특정 모수(parameter/weight)들을 가지고 스코어 함수(score function)를 적용시켰을 때, 실제 class와 얼마나 잘 일치하는지에 따라 그 특정 모수(parameter/weight)들의 질을 측정하는 **손실 함수(loss function)**. 여러 종류의 손실함수(예를 들어, Softmax/SVM)가 있다. -구체적으로 말하자면, 다음과 같은 형식을 가진 선형함수 $f(x_i, W) = W x_i $를 스코어 함수(score function)로 쓸 때, 지난 번에 다룬 바와 같이 SVM은 다음과 같은 수식으로 표현할 수 있다.: +구체적으로 말하자면, 다음과 같은 형식을 가진 선형함수 $$ f(x_i, W) = W x_i $$를 스코어 함수(score function)로 쓸 때, 지난 번에 다룬 바와 같이 SVM은 다음과 같은 수식으로 표현할 수 있다.: $$ L = \frac{1}{N} \sum_i \sum_{j\neq y_i} \left[ \max(0, f(x_i; W)_j - f(x_i; W)_{y_i} + 1) \right] + \alpha R(W) $$ -예시 $x_i$에 대한 예측값이 실제 값(레이블, labels) $y_i$과 같도록 설정된 모수(parameter/weight) $W$는 손실(loss)값 $L$ 또한 매우 낮게 나온다는 것을 알아보았다. 이제 세번째이자 마지막 핵심요소인 **최적화(optimization)**에 대해서 알아보자. 최적화(optimization)는 손실함수(loss function)을 최소화시카는 모수(parameter/weight, $W$)들을 찾는 과정을 뜻한다. +예시 $x_i$에 대한 예측값이 실제 값(레이블, labels) $$y_i$$과 같도록 설정된 모수(parameter/weight) $$W$$는 손실(loss)값 $$L$$ 또한 매우 낮게 나온다는 것을 알아보았다. 이제 세번째이자 마지막 핵심요소인 **최적화(optimization)**에 대해서 알아보자. 최적화(optimization)는 손실함수(loss function)을 최소화시카는 모수(parameter/weight, $$W$$)들을 찾는 과정을 뜻한다. **예고:** 이 세 가지 핵심요소가 어떻게 상호작용하는지 이해한 후에는, 첫번째 요소(모수화된 함수)로 다시 돌아가서 선형함수보다 더 복잡한 형태로 확장시켜볼 것이다. 처음엔 신경망(Neural Networks), 다음엔 컨볼루션 신경망(Convolutional Neural Networks). 손실함수(loss function)와 최적화(optimization) 과정은 거의 변화가 없을 것이다.. @@ -40,7 +40,7 @@ $$ ### 손실함수(loss function)의 시각화 -이 강의에서 우리가 다루는 손실함수(loss function)들은 대체로 고차원 공간에서 정의된다. 예를 들어, CIFAR-10의 선형분류기(linear classifier)의 경우 모수(parameter/weight) 행렬은 크기가 [10 x 3073]이고 총 30,730개의 모수(parameter/weight)가 있다. 따라서, 시각화하기가 어려운 면이 있다. 하지만, 고차원 공간을 1차원 직선이나 2차원 평면으로 잘라서 보면 약간의 직관을 얻을 수 있다. 예를 들어, 무작위로 모수(parameter/weight) 행렬 $W$을 하나 뽑는다고 가정해보자. (이는 사실 고차원 공간의 한 점인 셈이다.) 이제 이 점을 직선 하나를 따라 이동시키면서 손실함수(loss function)를 기록해보자. 즉, 무작위로 뽑은 방향 $W_1$을 잡고, 이 방향을 따라 가면서 손실함수(loss function)를 계산하는데, 구체적으로 말하면 $L(W + a W_1)$에 여러 개의 $a$ 값(역자 주: 1차원 스칼라)을 넣어 계산해보는 것이다. 이 과정을 통해 우리는 $a$ 값을 x축, 손실함수(loss function) 값을 y축에 놓고 간단한 그래프를 그릴 수 있다. 또한 이 비슷한 것을 2차원으로도 할 수 있다. 여러 $a, b$값에 따라 $ L(W + a W_1 + b W_2) $을 계산하고(역자 주: $W_2$ 역시 $W_1$과 같은 식으로 뽑은 무작위 방향), $a, b$는 각각 x축과 y축에, 손실함수(loss function) 값 색을 이용해 그리면 된다. +이 강의에서 우리가 다루는 손실함수(loss function)들은 대체로 고차원 공간에서 정의된다. 예를 들어, CIFAR-10의 선형분류기(linear classifier)의 경우 모수(parameter/weight) 행렬은 크기가 [10 x 3073]이고 총 30,730개의 모수(parameter/weight)가 있다. 따라서, 시각화하기가 어려운 면이 있다. 하지만, 고차원 공간을 1차원 직선이나 2차원 평면으로 잘라서 보면 약간의 직관을 얻을 수 있다. 예를 들어, 무작위로 모수(parameter/weight) 행렬 $W$을 하나 뽑는다고 가정해보자. (이는 사실 고차원 공간의 한 점인 셈이다.) 이제 이 점을 직선 하나를 따라 이동시키면서 손실함수(loss function)를 기록해보자. 즉, 무작위로 뽑은 방향 $$W_1$$을 잡고, 이 방향을 따라 가면서 손실함수(loss function)를 계산하는데, 구체적으로 말하면 $$L(W + a W_1)$$에 여러 개의 $$a$$ 값(역자 주: 1차원 스칼라)을 넣어 계산해보는 것이다. 이 과정을 통해 우리는 $$a$$ 값을 x축, 손실함수(loss function) 값을 y축에 놓고 간단한 그래프를 그릴 수 있다. 또한 이 비슷한 것을 2차원으로도 할 수 있다. 여러 $$a, b$$값에 따라 $$ L(W + a W_1 + b W_2) $$을 계산하고(역자 주: $$W_2$$ 역시 $$W_1$$과 같은 식으로 뽑은 무작위 방향), $$a, b$$는 각각 x축과 y축에, 손실함수(loss function) 값 색을 이용해 그리면 된다.
@@ -57,7 +57,7 @@ $$ L_i = \sum_{j\neq y_i} \left[ \max(0, w_j^Tx_i - w_{y_i}^Tx_i + 1) \right] $$ -수식에서 명백히 볼 수 있듯이, 각 예시의 손실(loss)값은 ($\max(0,-)$ 함수로 인해 0에서 막혀있는) $W$의 선형함수들의 합으로 표현된다. $W$의 각 행(즉, $w_j$) 앞에는 때때로 (잘못된 분류일 때, 즉, $j\neq y_i$인 경우) 플러스가 붙고, 때때로 (옳은 분류일 때) 마이너스가 붙는다. 더 명확히 표현하자면, 3개의 1차원 점들과 3개의 클래스가 있다고 해보자. Regularization 없는 총 SVM 손실(loss)은 다음과 같다. +수식에서 명백히 볼 수 있듯이, 각 예시의 손실(loss)값은 ($$\max(0,-)$$ 함수로 인해 0에서 막혀있는) $$W$$의 선형함수들의 합으로 표현된다. $$W$$의 각 행(즉, $$w_j$$) 앞에는 때때로 (잘못된 분류일 때, 즉, $$j\neq y_i$$인 경우) 플러스가 붙고, 때때로 (옳은 분류일 때) 마이너스가 붙는다. 더 명확히 표현하자면, 3개의 1차원 점들과 3개의 클래스가 있다고 해보자. Regularization 없는 총 SVM 손실(loss)은 다음과 같다. $$ \begin{align} @@ -68,7 +68,7 @@ L = & (L_0 + L_1 + L_2)/3 \end{align} $$ -이 예시들이 1차원이기 때문에, 데이타 $x_i$와 모수(parameter/weight) $w_j$는 숫자(역자 주: 즉, 스칼라. 따라서 위 수식에서 전치행렬을 뜻하는 $T$ 표시는 필요없음)이다. 예를 들어 $w_0$ 를 보면, 몇몇 항들은 $w_0$의 선형함수이고 각각은 0에서 꺾인다. 이를 다음과 같이 시각화할 수 있다. +이 예시들이 1차원이기 때문에, 데이타 $$x_i$$와 모수(parameter/weight) $$w_j$$는 숫자(역자 주: 즉, 스칼라. 따라서 위 수식에서 전치행렬을 뜻하는 $$T$$ 표시는 필요없음)이다. 예를 들어 $$w_0$$ 를 보면, 몇몇 항들은 $$w_0$$의 선형함수이고 각각은 0에서 꺾인다. 이를 다음과 같이 시각화할 수 있다.
@@ -77,9 +77,9 @@ $$
-옆길로 새면, 아마도 밥공기 모양을 보고 SVM 손실함수(loss function)이 일종의 [볼록함수](http://en.wikipedia.org/wiki/Convex_function)라고 생각했을 것이다. 이런 형태의 함수를 효율적으로 최소화하는 문제에 대한 엄청난 양의 연구 성과들이 있다. 스탠포드 강좌 중에서도 이 주제를 다룬 것도 있다. ( [볼록함수 최적화](http://stanford.edu/~boyd/cvxbook/) ). 이 점수함수(score function) $f$를 신경망(neural networks)로 확장시키면, 목적함수(역자 주: 손실함수(loss function))은 더이상 볼록함수가 아니게 되고, 위와 같은 시각화를 해봐도 밥공기 모양 대신 울퉁불퉁하고 복잡한 모양이 보일 것이다. +옆길로 새면, 아마도 밥공기 모양을 보고 SVM 손실함수(loss function)이 일종의 [볼록함수](http://en.wikipedia.org/wiki/Convex_function)라고 생각했을 것이다. 이런 형태의 함수를 효율적으로 최소화하는 문제에 대한 엄청난 양의 연구 성과들이 있다. 스탠포드 강좌 중에서도 이 주제를 다룬 것도 있다. ( [볼록함수 최적화](http://stanford.edu/~boyd/cvxbook/) ). 이 점수함수(score function) $$f$$를 신경망(neural networks)로 확장시키면, 목적함수(역자 주: 손실함수(loss function))은 더이상 볼록함수가 아니게 되고, 위와 같은 시각화를 해봐도 밥공기 모양 대신 울퉁불퉁하고 복잡한 모양이 보일 것이다. -*미분이 불가능한 손실함수(loss functions)*. 기술적인 설명을 덧붙이자면, $\max(0,-)$ 함수 때문에 손실함수(loss functionn)에 *꺾임*이 생기는데, 이 때문에 손실함수(loss functions)는 미분이 불가능해진다. 왜냐하면, 그 꺾이는 부분에서 미분 혹은 그라디언트가 존재하지 않기 때문이다. 하지만, [서브그라디언트(subgradient)](http://en.wikipedia.org/wiki/Subderivative)가 존재하고, 대체로 이를 그라디언트(gradient) 대신 이용한다. 앞으로 이 강의에서는 *그라디언트(gradient)*와 *서브그라디언트(subgradient)*를 구분하지 않고 쓸 것이다. +*미분이 불가능한 손실함수(loss functions)*. 기술적인 설명을 덧붙이자면, $$\max(0,-)$$ 함수 때문에 손실함수(loss functionn)에 *꺾임*이 생기는데, 이 때문에 손실함수(loss functions)는 미분이 불가능해진다. 왜냐하면, 그 꺾이는 부분에서 미분 혹은 그라디언트가 존재하지 않기 때문이다. 하지만, [서브그라디언트(subgradient)](http://en.wikipedia.org/wiki/Subderivative)가 존재하고, 대체로 이를 그라디언트(gradient) 대신 이용한다. 앞으로 이 강의에서는 *그라디언트(gradient)*와 *서브그라디언트(subgradient)*를 구분하지 않고 쓸 것이다. @@ -142,7 +142,7 @@ np.mean(Yte_predict == Yte) #### Strategy #2: Random Local Search -The first strategy you may think of is to to try to extend one foot in a random direction and then take a step only if it leads downhill. Concretely, we will start out with a random $W$, generate random perturbations $ \delta W $ to it and if the loss at the perturbed $W + \delta W$ is lower, we will perform an update. The code for this procedure is as follows: +The first strategy you may think of is to to try to extend one foot in a random direction and then take a step only if it leads downhill. Concretely, we will start out with a random $$W$$, generate random perturbations $$ \delta W $$ to it and if the loss at the perturbed $$W + \delta W$$ is lower, we will perform an update. The code for this procedure is as follows: ~~~python W = np.random.randn(10, 3073) * 0.001 # generate random starting W From 11f2f28b7f49a04a9a11e570dc5cd2f16545af26 Mon Sep 17 00:00:00 2001 From: MaybeS Date: Sun, 10 Apr 2016 00:00:28 +0900 Subject: [PATCH 038/199] assignment1 --- assignments2016/assignment1.md | 93 +++++++++++++++------------------- 1 file changed, 42 insertions(+), 51 deletions(-) diff --git a/assignments2016/assignment1.md b/assignments2016/assignment1.md index e99ca6da..85d44cbe 100644 --- a/assignments2016/assignment1.md +++ b/assignments2016/assignment1.md @@ -3,48 +3,45 @@ layout: page mathjax: true permalink: /assignments2016/assignment1/ --- +이번 숙제에서 여러분은 간단한 이미지 분류 파이프라인을 k-Nearest neighbor 또는 SVM/Softmax 분류기에 기반하여 넣는 방법을 연습할 수 있습니다. 이번 숙제의 목표는 다음과 같습니다. -In this assignment you will practice putting together a simple image classification pipeline, based on the k-Nearest Neighbor or the SVM/Softmax classifier. The goals of this assignment are as follows: +- **이미지 분류 파이프라인**의 기초와 데이터 중심의 접근방식에 대해 이해합니다. +- 학습/확인/테스트의 분할과 **hyperparameter tuning**를 위해 검증 데이터를 사용하는 것에 관해 이해합니다. +- 효율적으로 작성된 **벡터화**된 numpy 코드로 proficiency을 나타나게 합니다. +- k-Nearest Neighbor (**kNN**) 분류기를 수행하고 적용해봅니다. +- Multiclass Support Vector Machine (**SVM**) 분류기를 수행하고 적용해봅니다. +- **Softmax** 분류기를 수행하고 적용해봅니다. +- **Two layer neural network** 분류기를 수행하고 적용해봅니다. +- 위 분류기들의 장단점과 차이에 대해 이해합니다. +- 성능향상을 위해 raw pixels보다 **higher-level representations**을 사용하는 이유에 관하여 이해합니다. (색상 히스토그램, 그라데이션의 히스토그램 특징) -- understand the basic **Image Classification pipeline** and the data-driven approach (train/predict stages) -- understand the train/val/test **splits** and the use of validation data for **hyperparameter tuning**. -- develop proficiency in writing efficient **vectorized** code with numpy -- implement and apply a k-Nearest Neighbor (**kNN**) classifier -- implement and apply a Multiclass Support Vector Machine (**SVM**) classifier -- implement and apply a **Softmax** classifier -- implement and apply a **Two layer neural network** classifier -- understand the differences and tradeoffs between these classifiers -- get a basic understanding of performance improvements from using **higher-level representations** than raw pixels (e.g. color histograms, Histogram of Gradient (HOG) features) +## 설치 +여러분은 다음 두가지 방법으로 숙제를 수행할 수 있습니다: Terminal.com을 이용한 가상 환경 또는 로컬 환경. -## Setup -You can work on the assignment in one of two ways: locally on your own machine, or on a virtual machine through Terminal.com. +### Termianl에서의 가상 환경. +Terminal에는 우리의 수업을 위한 서브도메인이 만들어져 있습니다. [www.stanfordterminalcloud.com](https://www.stanfordterminalcloud.com) 계정을 등록하세요. 이번 숙제에 대한 스냅샷은 [여기](https://www.stanfordterminalcloud.com/snapshot/49f5a1ea15dc424aec19155b3398784d57c55045435315ce4f8b96b62819ef65)에서 찾아볼 수 있습니다. 만약 수업에 등록되었다면, TA(see Piazza for more information)에게 이 수업을 위한 Terminal 예산을 요구할 수 있습니다. 처음 스냅샷을 실행시키면, 수업을 위한 모든 것이 설치되어 있어서 바로 숙제를 수행할 수 있습니다. [여기](/terminal-tutorial)에 Terminal을 위한 간단한 튜토리얼을 작성해 뒀습니다. -### Working in the cloud on Terminal +### 로컬 환경 +[여기](http://vision.stanford.edu/teaching/cs231n/winter1516_assignment1.zip)에서 압축파일을 다운받고 다음을 따르세요. -Terminal has created a separate subdomain to serve our class, [www.stanfordterminalcloud.com](https://www.stanfordterminalcloud.com). Register your account there. The Assignment 1 snapshot can then be found [here](https://www.stanfordterminalcloud.com/snapshot/49f5a1ea15dc424aec19155b3398784d57c55045435315ce4f8b96b62819ef65). If you're registered in the class you can contact the TA (see Piazza for more information) to request Terminal credits for use on the assignment. Once you boot up the snapshot everything will be installed for you, and you'll be ready to start on your assignment right away. We've written a small tutorial on Terminal [here](/terminal-tutorial). +**[선택 1] Use Anaconda:** +과학, 수학, 공학, 데이터 분석을 위한 다양하고 유명한 패키지들을 담고있는 [Anaconda](https://www.continuum.io/downloads)를 사용하여 설치하는 것이 즐겨 쓰이는 방법입니다. 설치가 다 되면 모든 요구사항을 넘기고 바로 숙제를 수행해도 좋습니다. -### Working locally -Get the code as a zip file [here](http://vision.stanford.edu/teaching/cs231n/winter1516_assignment1.zip). As for the dependencies: - -**[Option 1] Use Anaconda:** -The preferred approach for installing all the assignment dependencies is to use [Anaconda](https://www.continuum.io/downloads), which is a Python distribution that includes many of the most popular Python packages for science, math, engineering and data analysis. Once you install it you can skip all mentions of requirements and you're ready to go directly to working on the assignment. - -**[Option 2] Manual install, virtual environment:** -If you'd like to (instead of Anaconda) go with a more manual and risky installation route you will likely want to create a [virtual environment](http://docs.python-guide.org/en/latest/dev/virtualenvs/) for the project. If you choose not to use a virtual environment, it is up to you to make sure that all dependencies for the code are installed globally on your machine. To set up a virtual environment, run the following: +**[선택 2] Manual install, virtual environment:** +만약 Anaconda 대신 좀 더 일반적이고 위험한 방법을 택하고 싶다면 프로젝트를 위한 [virtual environment](http://docs.python-guide.org/en/latest/dev/virtualenvs/)를 만들 수 있습니다. 만약 virtual environment를 사용하지 않는다면 모든 코드가 컴퓨터에 전역적으로 종속되게 설치됩니다. virtual environment의 설정은 아래를 참조하세요. ~~~bash cd assignment1 -sudo pip install virtualenv # This may already be installed -virtualenv .env # Create a virtual environment -source .env/bin/activate # Activate the virtual environment -pip install -r requirements.txt # Install dependencies +sudo pip install virtualenv # 아마 먼저 설치되어 있을 겁니다. +virtualenv .env # virtual environment를 만듭니다. +source .env/bin/activate # virtual environment를 활성화 합니다. +pip install -r requirements.txt # dependencies 설치합니다. # Work on the assignment for a while ... -deactivate # Exit the virtual environment +deactivate # virtual environment를 종료합니다. ~~~ **Download data:** -Once you have the starter code, you will need to download the CIFAR-10 dataset. -Run the following from the `assignment1` directory: +먼저 숙제를 수행하기전에 CIFAR-10 dataset를 다운로드해야 합니다. 아래를 `assignment1` 폴더에서 실행하세요: ~~~bash cd cs231n/datasets @@ -52,39 +49,33 @@ cd cs231n/datasets ~~~ **Start IPython:** -After you have the CIFAR-10 data, you should start the IPython notebook server from the -`assignment1` directory. If you are unfamiliar with IPython, you should read our -[IPython tutorial](/ipython-tutorial). +CIFAR-10 data를 받았다면, `assignment1` 폴더의 IPython notebook server를 시작할 수 있습니다. IPython에 친숙하지 않다면 작성해둔 [IPython tutorial](/ipython-tutorial)를 읽어보는 것을 권장합니다. + +**NOTE:** OSX에서 virtual environment를 실행하면, matplotlib 에러가 날 수 있습니다([이 문제에 관한 이슈](http://matplotlib.org/faq/virtualenv_faq.html)). IPython server를 `assignment1`폴더의 `start_ipython_osx.sh`라고 실행하면 이 문제를 피해갈 수 있습니다; 이 스크립트는 virtual environment가 `.env`라고 되어있다고 가정하고 작성되었습니다. -**NOTE:** If you are working in a virtual environment on OSX, you may encounter -errors with matplotlib due to the [issues described here](http://matplotlib.org/faq/virtualenv_faq.html). You can work around this issue by starting the IPython server using the `start_ipython_osx.sh` script from the `assignment1` directory; the script assumes that your virtual environment is named `.env`. +### 과제 제출: +로컬 환경이나 Terminal에서 숙제를 마쳤다면 `collectSubmission.sh`스크립트를 실행합니다; 이 스크립트는 `assignment1.zip`파일을 만듭니다. 이 파일을 [the coursework](https://coursework.stanford.edu/portal/site/W16-CS-231N-01/)에 업로드하세요. -### Submitting your work: -Whether you work on the assignment locally or using Terminal, once you are done -working run the `collectSubmission.sh` script; this will produce a file called -`assignment1.zip`. Upload this file to your dropbox on -[the coursework](https://coursework.stanford.edu/portal/site/W16-CS-231N-01/) -page for the course. -### Q1: k-Nearest Neighbor classifier (20 points) +### Q1: k-Nearest Neighbor 분류기 (20 points) -The IPython Notebook **knn.ipynb** will walk you through implementing the kNN classifier. +IPython Notebook **knn.ipynb**이 kNN 분류기를 수행하는 것을 안내합니다. -### Q2: Training a Support Vector Machine (25 points) +### Q2: Support Vector Machine 훈련 (25 points) -The IPython Notebook **svm.ipynb** will walk you through implementing the SVM classifier. +IPython Notebook **svm.ipynb**이 SVM 분류기를 수행하는 것을 안내합니다. -### Q3: Implement a Softmax classifier (20 points) +### Q3: Softmax 분류기 실행하기 (20 points) -The IPython Notebook **softmax.ipynb** will walk you through implementing the Softmax classifier. +IPython Notebook **softmax.ipynb**이 Softmax 분류기를 수행하는 것을 안내합니다. ### Q4: Two-Layer Neural Network (25 points) -The IPython Notebook **two_layer_net.ipynb** will walk you through the implementation of a two-layer neural network classifier. -### Q5: Higher Level Representations: Image Features (10 points) +IPython Notebook **two_layer_net.ipynb**이 two-layer neural network 분류기를 수행하는 것을 안내합니다. -The IPython Notebook **features.ipynb** will walk you through this exercise, in which you will examine the improvements gained by using higher-level representations as opposed to using raw pixel values. +### Q5: Higher Level Representations: 이미지 특징 (10 points) -### Q6: Cool Bonus: Do something extra! (+10 points) +IPython Notebook **features.ipynb**을 사용하여 higher-level representations이 raw pixel보다 개선이 이루어졌는지 검사합니다. -Implement, investigate or analyze something extra surrounding the topics in this assignment, and using the code you developed. For example, is there some other interesting question we could have asked? Is there any insightful visualization you can plot? Or anything fun to look at? Or maybe you can experiment with a spin on the loss function? If you try out something cool we'll give you up to 10 extra points and may feature your results in the lecture. +### Q6: 추가 과제: 뭔가 더 해보세요! (+10 points) +이번 과제와 관련된 다른 것들을 작성한 코드로 분석하고 연구해보세요. 예를 들어, 질문하고 싶은 흥미로운 질문이 있나요? 통찰력 있는 시각화를 작성할 수 있나요? 아니면 다른 재미있는 살펴볼 거리가 있나요? Or maybe you can experiment with a spin on the loss function 만약 다른 멋있는 것을 시도해본다면 추가로 10 points를 얻을 수 있고 강의에 수행한 결과가 실릴 수 있습니다. \ No newline at end of file From acaf1b9865584268c914ac4aac3c05b5473047aa Mon Sep 17 00:00:00 2001 From: MaybeS Date: Sun, 10 Apr 2016 00:03:16 +0900 Subject: [PATCH 039/199] Update assignment1-trans.md --- assignments2016/assignment1-trans.md | 7 +------ 1 file changed, 1 insertion(+), 6 deletions(-) diff --git a/assignments2016/assignment1-trans.md b/assignments2016/assignment1-trans.md index e653ecbf..8d7db206 100644 --- a/assignments2016/assignment1-trans.md +++ b/assignments2016/assignment1-trans.md @@ -1,8 +1,7 @@ # 이 문서는 assignment1을 assignment1-kor로 번역하면서 만들어진 문서입니다. 이 문서에는 번역하는 도중 겪었던 곤란한 부분이나 개선이 필요한 부분에 대해서 열거합니다. -## 이 단어들을 어떻게 번역하면 좋을까요? -- SVM +## 이 단어들을 어떻게 번역하면 좋을까요? [glossary](http://aikorea.org/cs231n/glossary/)에 추가. - Softmax - Implement - Image Classification pipeline @@ -12,12 +11,9 @@ -데이터 중심의 접근방식 - tuning - hyperparameter -- validation data - proficiency - raw pixels -- color histograms, - Histogram of Gradient (HOG) features -- matplotlib - insightful ## 이 부분은 이렇게 고쳐도 될것 같습니다. @@ -36,6 +32,5 @@ ## 이 부분은 이렇게 고쳤습니다. - Setup에서 설명엔 local - virtual 순이지만 소 제목에서는 ###virtual - local 순 이기에 일치시켰습니다. - ## 다른 문제될 사항 - Q6의 Or maybe you can experiment with a spin on the loss function 가 번역이 매끄럽게 되지 않아서 자문을 구합니다. From 92b2b27baa51b9234f864f3a55c7b33ab1a565f7 Mon Sep 17 00:00:00 2001 From: MaybeS Date: Sun, 10 Apr 2016 00:06:47 +0900 Subject: [PATCH 040/199] Delete trans.md --- assignments2016/assignment1-kor.md | 81 ---------------------------- assignments2016/assignment1-trans.md | 36 ------------- 2 files changed, 117 deletions(-) delete mode 100644 assignments2016/assignment1-kor.md delete mode 100644 assignments2016/assignment1-trans.md diff --git a/assignments2016/assignment1-kor.md b/assignments2016/assignment1-kor.md deleted file mode 100644 index 6b6054c7..00000000 --- a/assignments2016/assignment1-kor.md +++ /dev/null @@ -1,81 +0,0 @@ ---- -layout: page -mathjax: true -permalink: /assignments2016/assignment1/ ---- -이번 숙제에서 여러분은 간단한 이미지 분류 파이프라인을 k-Nearest neighbor 또는 SVM/Softmax 분류기에 기반하여 넣는 방법을 연습할 수 있습니다. 이번 숙제의 목표는 다음과 같습니다. - -- **이미지 분류 파이프라인**의 기초와 데이터 중심의 접근방식에 대해 이해합니다. -- 학습/확인/테스트의 분할과 **hyperparameter tuning**를 위해 검증 데이터를 사용하는 것에 관해 이해합니다. -- 효율적으로 작성된 **벡터화**된 numpy 코드로 proficiency을 나타나게 합니다. -- k-Nearest Neighbor (**kNN**) 분류기를 수행하고 적용해봅니다. -- Multiclass Support Vector Machine (**SVM**) 분류기를 수행하고 적용해봅니다. -- **Softmax** 분류기를 수행하고 적용해봅니다. -- **Two layer neural network** 분류기를 수행하고 적용해봅니다. -- 위 분류기들의 장단점과 차이에 대해 이해합니다. -- 성능향상을 위해 raw pixels보다 **higher-level representations**을 사용하는 이유에 관하여 이해합니다. (color histograms, Histogram of Gradient (HOG) features) - -## 설치 -여러분은 다음 두가지 방법으로 숙제를 수행할 수 있습니다: Terminal.com을 이용한 가상 환경 또는 로컬 환경. - -### Termianl에서의 가상 환경. -Terminal에는 우리의 수업을 위한 서브도메인이 만들어져 있습니다. [www.stanfordterminalcloud.com](https://www.stanfordterminalcloud.com) 계정을 등록하세요. 이번 숙제에 대한 스냅샷은 [여기](https://www.stanfordterminalcloud.com/snapshot/49f5a1ea15dc424aec19155b3398784d57c55045435315ce4f8b96b62819ef65)에서 찾아볼 수 있습니다. 만약 수업에 등록되었다면, TA(see Piazza for more information)에게 이 수업을 위한 Terminal 예산을 요구할 수 있습니다. 처음 스냅샷을 실행시키면, 수업을 위한 모든 것이 설치되어 있어서 바로 숙제를 수행할 수 있습니다. [여기](/terminal-tutorial)에 Terminal을 위한 간단한 튜토리얼을 작성해 뒀습니다. - -### 로컬 환경 -[여기](http://vision.stanford.edu/teaching/cs231n/winter1516_assignment1.zip)에서 압축파일을 다운받고 다음을 따르세요. - -**[선택 1] Use Anaconda:** -과학, 수학, 공학, 데이터 분석을 위한 다양하고 유명한 패키지들을 담고있는 [Anaconda](https://www.continuum.io/downloads)를 사용하여 설치하는 것이 즐겨 쓰이는 방법입니다. 설치가 다 되면 모든 요구사항을 넘기고 바로 숙제를 수행해도 좋습니다. - -**[선택 2] Manual install, virtual environment:** -만약 Anaconda 대신 좀 더 일반적이고 위험한 방법을 택하고 싶다면 프로젝트를 위한 [virtual environment](http://docs.python-guide.org/en/latest/dev/virtualenvs/)를 만들 수 있습니다. 만약 virtual environment를 사용하지 않는다면 모든 코드가 컴퓨터에 전역적으로 종속되게 설치됩니다. virtual environment의 설정은 아래를 참조하세요. - -~~~bash -cd assignment1 -sudo pip install virtualenv # 아마 먼저 설치되어 있을 겁니다. -virtualenv .env # virtual environment를 만듭니다. -source .env/bin/activate # virtual environment를 활성화 합니다. -pip install -r requirements.txt # dependencies 설치합니다. -# Work on the assignment for a while ... -deactivate # virtual environment를 종료합니다. -~~~ - -**Download data:** -먼저 숙제를 수행하기전에 CIFAR-10 dataset를 다운로드해야 합니다. 아래를 `assignment1` 폴더에서 실행하세요: - -~~~bash -cd cs231n/datasets -./get_datasets.sh -~~~ - -**Start IPython:** -CIFAR-10 data를 받았다면, `assignment1` 폴더의 IPython notebook server를 시작할 수 있습니다. IPython에 친숙하지 않다면 작성해둔 [IPython tutorial](/ipython-tutorial)를 읽어보는 것을 권장합니다. - -**NOTE:** OSX에서 virtual environment를 실행하면, matplotlib 에러가 날 수 있습니다([이 문제에 관한 이슈](http://matplotlib.org/faq/virtualenv_faq.html)). IPython server를 `assignment1`폴더의 `start_ipython_osx.sh`라고 실행하면 이 문제를 피해갈 수 있습니다; 이 스크립트는 virtual environment가 `.env`라고 되어있다고 가정하고 작성되었습니다. - -### 과제 제출: -로컬 환경이나 Terminal에서 숙제를 마쳤다면 `collectSubmission.sh`스크립트를 실행합니다; 이 스크립트는 `assignment1.zip`파일을 만듭니다. 이 파일을 [the coursework](https://coursework.stanford.edu/portal/site/W16-CS-231N-01/)에 업로드하세요. - - -### Q1: k-Nearest Neighbor 분류기 (20 points) - -IPython Notebook **knn.ipynb**이 kNN 분류기를 수행하는 것을 안내합니다. - -### Q2: Support Vector Machine 훈련 (25 points) - -IPython Notebook **svm.ipynb**이 SVM 분류기를 수행하는 것을 안내합니다. - -### Q3: Softmax 분류기 실행하기 (20 points) - -IPython Notebook **softmax.ipynb**이 Softmax 분류기를 수행하는 것을 안내합니다. - -### Q4: Two-Layer Neural Network (25 points) - -IPython Notebook **two_layer_net.ipynb**이 two-layer neural network 분류기를 수행하는 것을 안내합니다. - -### Q5: Higher Level Representations: 이미지 특징 (10 points) - -IPython Notebook **features.ipynb**을 사용하여 higher-level representations이 raw pixel보다 개선이 이루어졌는지 검사합니다. - -### Q6: 추가 과제: 뭔가 더 해보세요! (+10 points) -이번 과제와 관련된 다른 것들을 작성한 코드로 분석하고 연구해보세요. 예를 들어, 질문하고 싶은 흥미로운 질문이 있나요? 통찰력 있는 시각화를 작성할 수 있나요? 아니면 다른 재미있는 살펴볼 거리가 있나요? Or maybe you can experiment with a spin on the loss function 만약 다른 멋있는 것을 시도해본다면 추가로 10 points를 얻을 수 있고 강의에 수행한 결과가 실릴 수 있습니다. \ No newline at end of file diff --git a/assignments2016/assignment1-trans.md b/assignments2016/assignment1-trans.md deleted file mode 100644 index 8d7db206..00000000 --- a/assignments2016/assignment1-trans.md +++ /dev/null @@ -1,36 +0,0 @@ -# 이 문서는 assignment1을 assignment1-kor로 번역하면서 만들어진 문서입니다. -이 문서에는 번역하는 도중 겪었던 곤란한 부분이나 개선이 필요한 부분에 대해서 열거합니다. - -## 이 단어들을 어떻게 번역하면 좋을까요? [glossary](http://aikorea.org/cs231n/glossary/)에 추가. -- Softmax -- Implement -- Image Classification pipeline - - pipeline - - 이미지 분류 파이프라인 -- data-driven approach - -데이터 중심의 접근방식 -- tuning -- hyperparameter -- proficiency -- raw pixels -- Histogram of Gradient (HOG) features -- insightful - -## 이 부분은 이렇게 고쳐도 될것 같습니다. -`현재표기 -> 더 나은(이라고 생각되는) 표기` - 또는 -`원문 <- 현재표기` - -- 숙제를 수행하다. -> 숙제를 하다. -- 수업에 등록되었다면. -> ? -- 일반적인(manual) -> ? - -- Higher Level Representations <- 고수준 표현 -- Credit <- 예산 -- install dependencies <- dependencies를 설치합니다. - -## 이 부분은 이렇게 고쳤습니다. -- Setup에서 설명엔 local - virtual 순이지만 소 제목에서는 ###virtual - local 순 이기에 일치시켰습니다. - -## 다른 문제될 사항 -- Q6의 Or maybe you can experiment with a spin on the loss function 가 번역이 매끄럽게 되지 않아서 자문을 구합니다. From 49fc7318c4e66d3f718527243a1d7c2103ffb11a Mon Sep 17 00:00:00 2001 From: MaybeS Date: Sun, 10 Apr 2016 13:36:44 +0900 Subject: [PATCH 041/199] Update assignment1 --- assignments2016/assignment1.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/assignments2016/assignment1.md b/assignments2016/assignment1.md index 85d44cbe..55b911e5 100644 --- a/assignments2016/assignment1.md +++ b/assignments2016/assignment1.md @@ -78,4 +78,4 @@ IPython Notebook **two_layer_net.ipynb**이 two-layer neural network 분류기 IPython Notebook **features.ipynb**을 사용하여 higher-level representations이 raw pixel보다 개선이 이루어졌는지 검사합니다. ### Q6: 추가 과제: 뭔가 더 해보세요! (+10 points) -이번 과제와 관련된 다른 것들을 작성한 코드로 분석하고 연구해보세요. 예를 들어, 질문하고 싶은 흥미로운 질문이 있나요? 통찰력 있는 시각화를 작성할 수 있나요? 아니면 다른 재미있는 살펴볼 거리가 있나요? Or maybe you can experiment with a spin on the loss function 만약 다른 멋있는 것을 시도해본다면 추가로 10 points를 얻을 수 있고 강의에 수행한 결과가 실릴 수 있습니다. \ No newline at end of file +이번 과제와 관련된 다른 것들을 작성한 코드로 분석하고 연구해보세요. 예를 들어, 질문하고 싶은 흥미로운 질문이 있나요? 통찰력 있는 시각화를 작성할 수 있나요? 아니면 다른 재미있는 살펴볼 거리가 있나요? 또는 손실 함수(loss function)을 조금씩 변형해가며 실험해볼 수도 있을 것입니다. 만약 다른 멋있는 것을 시도해본다면 추가로 10 points를 얻을 수 있고 강의에 수행한 결과가 실릴 수 있습니다. \ No newline at end of file From 8a4ebafa5557a46f97c2d4c3714e5a09ba6d83ad Mon Sep 17 00:00:00 2001 From: MaybeS Date: Sun, 10 Apr 2016 13:41:50 +0900 Subject: [PATCH 042/199] Update words --- assignments2016/assignment1.md | 34 +++++++++++++++++----------------- 1 file changed, 17 insertions(+), 17 deletions(-) diff --git a/assignments2016/assignment1.md b/assignments2016/assignment1.md index 55b911e5..8195e6e0 100644 --- a/assignments2016/assignment1.md +++ b/assignments2016/assignment1.md @@ -5,27 +5,27 @@ permalink: /assignments2016/assignment1/ --- 이번 숙제에서 여러분은 간단한 이미지 분류 파이프라인을 k-Nearest neighbor 또는 SVM/Softmax 분류기에 기반하여 넣는 방법을 연습할 수 있습니다. 이번 숙제의 목표는 다음과 같습니다. -- **이미지 분류 파이프라인**의 기초와 데이터 중심의 접근방식에 대해 이해합니다. -- 학습/확인/테스트의 분할과 **hyperparameter tuning**를 위해 검증 데이터를 사용하는 것에 관해 이해합니다. +- **이미지 분류 파이프라인**의 기초와 데이터기반 접근법에 대해 이해합니다. +- 학습/확인/테스트의 분할과 **초모수 튜닝**를 위해 검증 데이터를 사용하는 것에 관해 이해합니다. - 효율적으로 작성된 **벡터화**된 numpy 코드로 proficiency을 나타나게 합니다. -- k-Nearest Neighbor (**kNN**) 분류기를 수행하고 적용해봅니다. -- Multiclass Support Vector Machine (**SVM**) 분류기를 수행하고 적용해봅니다. -- **Softmax** 분류기를 수행하고 적용해봅니다. -- **Two layer neural network** 분류기를 수행하고 적용해봅니다. +- k-Nearest Neighbor (**kNN**) 분류기를 구현하고 적용해봅니다. +- Multiclass Support Vector Machine (**SVM**) 분류기를 구현하고 적용해봅니다. +- **Softmax** 분류기를 구현하고 적용해봅니다. +- **Two layer neural network** 분류기를 구현하고 적용해봅니다. - 위 분류기들의 장단점과 차이에 대해 이해합니다. -- 성능향상을 위해 raw pixels보다 **higher-level representations**을 사용하는 이유에 관하여 이해합니다. (색상 히스토그램, 그라데이션의 히스토그램 특징) +- 성능향상을 위해 단순히 이미지 픽셀(화소)보다 더 고차원의 표현(**higher-level representations**)을 사용하는 이유에 관하여 이해합니다. (색상 히스토그램, 그라데이션의 히스토그램(HOG) 특징) ## 설치 -여러분은 다음 두가지 방법으로 숙제를 수행할 수 있습니다: Terminal.com을 이용한 가상 환경 또는 로컬 환경. +여러분은 다음 두가지 방법으로 숙제를 시작할 수 있습니다: Terminal.com을 이용한 가상 환경 또는 로컬 환경. ### Termianl에서의 가상 환경. -Terminal에는 우리의 수업을 위한 서브도메인이 만들어져 있습니다. [www.stanfordterminalcloud.com](https://www.stanfordterminalcloud.com) 계정을 등록하세요. 이번 숙제에 대한 스냅샷은 [여기](https://www.stanfordterminalcloud.com/snapshot/49f5a1ea15dc424aec19155b3398784d57c55045435315ce4f8b96b62819ef65)에서 찾아볼 수 있습니다. 만약 수업에 등록되었다면, TA(see Piazza for more information)에게 이 수업을 위한 Terminal 예산을 요구할 수 있습니다. 처음 스냅샷을 실행시키면, 수업을 위한 모든 것이 설치되어 있어서 바로 숙제를 수행할 수 있습니다. [여기](/terminal-tutorial)에 Terminal을 위한 간단한 튜토리얼을 작성해 뒀습니다. +Terminal에는 우리의 수업을 위한 서브도메인이 만들어져 있습니다. [www.stanfordterminalcloud.com](https://www.stanfordterminalcloud.com) 계정을 등록하세요. 이번 숙제에 대한 스냅샷은 [여기](https://www.stanfordterminalcloud.com/snapshot/49f5a1ea15dc424aec19155b3398784d57c55045435315ce4f8b96b62819ef65)에서 찾아볼 수 있습니다. 만약 수업에 등록되었다면, TA(see Piazza for more information)에게 이 수업을 위한 Terminal 예산을 요구할 수 있습니다. 처음 스냅샷을 실행시키면, 수업을 위한 모든 것이 설치되어 있어서 바로 숙제를 시작할 수 있습니다. [여기](/terminal-tutorial)에 Terminal을 위한 간단한 튜토리얼을 작성해 뒀습니다. ### 로컬 환경 [여기](http://vision.stanford.edu/teaching/cs231n/winter1516_assignment1.zip)에서 압축파일을 다운받고 다음을 따르세요. **[선택 1] Use Anaconda:** -과학, 수학, 공학, 데이터 분석을 위한 다양하고 유명한 패키지들을 담고있는 [Anaconda](https://www.continuum.io/downloads)를 사용하여 설치하는 것이 즐겨 쓰이는 방법입니다. 설치가 다 되면 모든 요구사항을 넘기고 바로 숙제를 수행해도 좋습니다. +과학, 수학, 공학, 데이터 분석을 위한 다양하고 유명한 패키지들을 담고있는 [Anaconda](https://www.continuum.io/downloads)를 사용하여 설치하는 것이 즐겨 쓰이는 방법입니다. 설치가 다 되면 모든 요구사항을 넘기고 바로 숙제를 시작해도 좋습니다. **[선택 2] Manual install, virtual environment:** 만약 Anaconda 대신 좀 더 일반적이고 위험한 방법을 택하고 싶다면 프로젝트를 위한 [virtual environment](http://docs.python-guide.org/en/latest/dev/virtualenvs/)를 만들 수 있습니다. 만약 virtual environment를 사용하지 않는다면 모든 코드가 컴퓨터에 전역적으로 종속되게 설치됩니다. virtual environment의 설정은 아래를 참조하세요. @@ -41,7 +41,7 @@ deactivate # virtual environment를 종료합니다. ~~~ **Download data:** -먼저 숙제를 수행하기전에 CIFAR-10 dataset를 다운로드해야 합니다. 아래를 `assignment1` 폴더에서 실행하세요: +먼저 숙제를 시작하기전에 CIFAR-10 dataset를 다운로드해야 합니다. 아래를 `assignment1` 폴더에서 실행하세요: ~~~bash cd cs231n/datasets @@ -59,23 +59,23 @@ CIFAR-10 data를 받았다면, `assignment1` 폴더의 IPython notebook server ### Q1: k-Nearest Neighbor 분류기 (20 points) -IPython Notebook **knn.ipynb**이 kNN 분류기를 수행하는 것을 안내합니다. +IPython Notebook **knn.ipynb**이 kNN 분류기를 구현하는 것을 안내합니다. ### Q2: Support Vector Machine 훈련 (25 points) -IPython Notebook **svm.ipynb**이 SVM 분류기를 수행하는 것을 안내합니다. +IPython Notebook **svm.ipynb**이 SVM 분류기를 구현하는 것을 안내합니다. ### Q3: Softmax 분류기 실행하기 (20 points) -IPython Notebook **softmax.ipynb**이 Softmax 분류기를 수행하는 것을 안내합니다. +IPython Notebook **softmax.ipynb**이 Softmax 분류기를 구현하는 것을 안내합니다. ### Q4: Two-Layer Neural Network (25 points) -IPython Notebook **two_layer_net.ipynb**이 two-layer neural network 분류기를 수행하는 것을 안내합니다. +IPython Notebook **two_layer_net.ipynb**이 two-layer neural network 분류기를 구현하는 것을 안내합니다. -### Q5: Higher Level Representations: 이미지 특징 (10 points) +### Q5: 이미지 특징을 고차원으로 표현하기 (10 points) -IPython Notebook **features.ipynb**을 사용하여 higher-level representations이 raw pixel보다 개선이 이루어졌는지 검사합니다. +IPython Notebook **features.ipynb**을 사용하여 단순한 이미지 픽셀(화소)보다 고차원의 표현이 효과적인지 검사합니다. ### Q6: 추가 과제: 뭔가 더 해보세요! (+10 points) 이번 과제와 관련된 다른 것들을 작성한 코드로 분석하고 연구해보세요. 예를 들어, 질문하고 싶은 흥미로운 질문이 있나요? 통찰력 있는 시각화를 작성할 수 있나요? 아니면 다른 재미있는 살펴볼 거리가 있나요? 또는 손실 함수(loss function)을 조금씩 변형해가며 실험해볼 수도 있을 것입니다. 만약 다른 멋있는 것을 시도해본다면 추가로 10 points를 얻을 수 있고 강의에 수행한 결과가 실릴 수 있습니다. \ No newline at end of file From 8dc64a386a09b741da5da6f616dfb2e5baa57512 Mon Sep 17 00:00:00 2001 From: Taeksoo Kim Date: Sun, 10 Apr 2016 22:34:49 +0900 Subject: [PATCH 043/199] Update convolutional-networks-korean.md --- convolutional-networks-korean.md | 10 +++++----- 1 file changed, 5 insertions(+), 5 deletions(-) diff --git a/convolutional-networks-korean.md b/convolutional-networks-korean.md index a27376dc..f60a753e 100644 --- a/convolutional-networks-korean.md +++ b/convolutional-networks-korean.md @@ -94,15 +94,15 @@ CONV 레이어는 CNN을 이루는 핵심 요소이다. CONV 레이어의 출력
- Left: An example input volume in red (e.g. a 32x32x3 CIFAR-10 image), and an example volume of neurons in the first Convolutional layer. Each neuron in the convolutional layer is connected only to a local region in the input volume spatially, but to the full depth (i.e. all color channels). Note, there are multiple neurons (5 in this example) along the depth, all looking at the same region in the input - see discussion of depth columns in text below. Right: The neurons from the Neural Network chapter remain unchanged: They still compute a dot product of their weights with the input followed by a non-linearity, but their connectivity is now restricted to be local spatially. + 좌: 입력 볼륨(붉은색, 32x32x3 크기의 CIFAR-10 이미지)과 첫번째 컨볼루션 레이어 볼륨. 컨볼루션 레이어의 각 뉴런은 입력 볼륨의 일부 영역에만 연결된다 (가로/세로 공간 차원으로는 일부 연결, 깊이(컬러 채널) 차원은 모두 연결). 컨볼루션 레이어의 깊이 차원의 여러 뉴런 (그림에서 5개)들이 모두 입력의 같은 영역을 처리한다는 것을 기억하자 (깊이 차원과 관련해서는 아래에서 더 자세히 알아볼 것임). 우: 입력의 일부 영역에만 연결된다는 점을 제외하고는, 이전 신경망 챕터에서 다뤄지던 뉴런들과 똑같이 내적 연산과 비선형 함수로 이뤄진다.
-**Spatial arrangement**. We have explained the connectivity of each neuron in the Conv Layer to the input volume, but we haven't yet discussed how many neurons there are in the output volume or how they are arranged. Three hyperparameters control the size of the output volume: the **depth, stride** and **zero-padding**. We discuss these next: +**공간적 배치**. 지금까지는 컨볼루션 레이어의 한 뉴런과 입력 볼륨의 연결에 대해 알아보았다. 그러나 아직 출력 볼륨에 얼마나 많은 뉴런들이 있는지, 그리고 그 뉴런들이 어떤식으로 배치되는지는 다루지 않았다. 3개의 hyperparameter들이 출력 볼륨의 크기를 결정하게 된다. 그 3개 요소는 바로 **깊이, stride, 그리고 제로 패딩 (zero-padding)** 이다. 이들에 대해 알아보자: -1. First, the **depth** of the output volume is a hyperparameter that we can pick; It controls the number of neurons in the Conv layer that connect to the same region of the input volume. This is analogous to a regular Neural Network, where we had multiple neurons in a hidden layer all looking at the exact same input. As we will see, all of these neurons will learn to activate for different features in the input. For example, if the first Convolutional Layer takes as input the raw image, then different neurons along the depth dimension may activate in presence of various oriented edged, or blobs of color. We will refer to a set of neurons that are all looking at the same region of the input as a **depth column**. -2. Second, we must specify the **stride** with which we allocate depth columns around the spatial dimensions (width and height). When the stride is 1, then we will allocate a new depth column of neurons to spatial positions only 1 spatial unit apart. This will lead to heavily overlapping receptive fields between the columns, and also to large output volumes. Conversely, if we use higher strides then the receptive fields will overlap less and the resulting output volume will have smaller dimensions spatially. -3. As we will soon see, sometimes it will be convenient to pad the input with zeros spatially on the border of the input volume. The size of this **zero-padding** is a hyperparameter. The nice feature of zero padding is that it will allow us to control the spatial size of the output volumes. In particular, we will sometimes want to exactly preserve the spatial size of the input volume. +1. 먼저, 출력 볼륨의 **깊이** 는 우리가 결정할 수 있는 요소이다. 컨볼루션 레이어의 뉴런들 중 입력 볼륨 내 동일한 영역과 연결된 뉴런의 개수를 의미한다. 마치 일반 신경망에서 히든 레이어 내의 모든 뉴런들이 같은 입력값과 연결된 것과 비슷하다. 앞으로 살펴보겠지만, 이 뉴런들은 입력에 대해 서로 다른 특징 (feature)에 활성화된다 (activate). 예를 들어, 이미지를 입력으로 받는 첫 번째 컨볼루션 레이어의 경우, 깊이 축에 따른 각 뉴런들은 이미지의 서로 다른 엣지, 색깔, 블롭(blob) 등에 활성화된다. 앞으로는 인풋의 서로 같은 영역을 바라보는 뉴런들을 **깊이 컬럼 (depth column)**이라고 부르겠다. +2. 두 번째로 어떤 간격 (가로/세로의 공간적 간격) 으로 깊이 컬럼을 할당할 지를 의미하는 **stride**를 결정해야 한다. 만약 stride가 1이라면, 깊이 컬럼을 1칸마다 할당하게 된다 (한 칸 간격으로 깊이 컬럼 할당). 이럴 경우 각 깊이 컬럼들은 receptive field 상 넓은 영역이 겹치게 되고, 출력 볼륨의 크기도 매우 커지게 된다. 반대로, 큰 stride를 사용한다면 receptive field끼리 좁은 영역만 겹치게 되고 출력 볼륨도 작아지게 된다 (깊이는 작아지지 않고 가로/세로만 작아지게 됨). +3. 조만간 살펴보겠지만, 입력 볼륨의 가장자리를 0으로 패딩하는 것이 좋을 때가 있다. 이 **zero-padding**은 hyperparamter이다. zero-padding을 사용할 때의 장점은, 출력 볼륨의 공간적 크기(가로/세로)를 조절할 수 있다는 것이다. 특히 입력 볼륨의 공간적 크기를 유지하고 싶은 경우 (입력의 가로/세로 = 출력의 가로/세로) 사용하게 된다. We can compute the spatial size of the output volume as a function of the input volume size ($$W$$), the receptive field size of the Conv Layer neurons ($$F$$), the stride with which they are applied ($$S$$), and the amount of zero padding used ($$P$$) on the border. You can convince yourself that the correct formula for calculating how many neurons "fit" is given by $$(W - F + 2P)/S + 1$$. If this number is not an integer, then the strides are set incorrectly and the neurons cannot be tiled so that they "fit" across the input volume neatly, in a symmetric way. An example might help to get intuitions for this formula: From 83db98fee28baf0e3cda4ba57d61aa5b89ce9765 Mon Sep 17 00:00:00 2001 From: myungsub Date: Sun, 10 Apr 2016 22:47:08 +0900 Subject: [PATCH 044/199] update glossary --- assignments2016/assignment1.md | 8 ++++---- glossary.md | 32 ++++++++++++++++---------------- 2 files changed, 20 insertions(+), 20 deletions(-) diff --git a/assignments2016/assignment1.md b/assignments2016/assignment1.md index 8195e6e0..be44ec01 100644 --- a/assignments2016/assignment1.md +++ b/assignments2016/assignment1.md @@ -5,8 +5,8 @@ permalink: /assignments2016/assignment1/ --- 이번 숙제에서 여러분은 간단한 이미지 분류 파이프라인을 k-Nearest neighbor 또는 SVM/Softmax 분류기에 기반하여 넣는 방법을 연습할 수 있습니다. 이번 숙제의 목표는 다음과 같습니다. -- **이미지 분류 파이프라인**의 기초와 데이터기반 접근법에 대해 이해합니다. -- 학습/확인/테스트의 분할과 **초모수 튜닝**를 위해 검증 데이터를 사용하는 것에 관해 이해합니다. +- **이미지 분류 파이프라인**의 기초와 데이터기반 접근법에 대해 이해합니다. +- 학습/확인/테스트의 분할과 **hyperparameter 튜닝**를 위해 검증 데이터를 사용하는 것에 관해 이해합니다. - 효율적으로 작성된 **벡터화**된 numpy 코드로 proficiency을 나타나게 합니다. - k-Nearest Neighbor (**kNN**) 분류기를 구현하고 적용해봅니다. - Multiclass Support Vector Machine (**SVM**) 분류기를 구현하고 적용해봅니다. @@ -59,7 +59,7 @@ CIFAR-10 data를 받았다면, `assignment1` 폴더의 IPython notebook server ### Q1: k-Nearest Neighbor 분류기 (20 points) -IPython Notebook **knn.ipynb**이 kNN 분류기를 구현하는 것을 안내합니다. +IPython Notebook **knn.ipynb**이 kNN 분류기를 구현하는 것을 안내합니다. ### Q2: Support Vector Machine 훈련 (25 points) @@ -78,4 +78,4 @@ IPython Notebook **two_layer_net.ipynb**이 two-layer neural network 분류기 IPython Notebook **features.ipynb**을 사용하여 단순한 이미지 픽셀(화소)보다 고차원의 표현이 효과적인지 검사합니다. ### Q6: 추가 과제: 뭔가 더 해보세요! (+10 points) -이번 과제와 관련된 다른 것들을 작성한 코드로 분석하고 연구해보세요. 예를 들어, 질문하고 싶은 흥미로운 질문이 있나요? 통찰력 있는 시각화를 작성할 수 있나요? 아니면 다른 재미있는 살펴볼 거리가 있나요? 또는 손실 함수(loss function)을 조금씩 변형해가며 실험해볼 수도 있을 것입니다. 만약 다른 멋있는 것을 시도해본다면 추가로 10 points를 얻을 수 있고 강의에 수행한 결과가 실릴 수 있습니다. \ No newline at end of file +이번 과제와 관련된 다른 것들을 작성한 코드로 분석하고 연구해보세요. 예를 들어, 질문하고 싶은 흥미로운 질문이 있나요? 통찰력 있는 시각화를 작성할 수 있나요? 아니면 다른 재미있는 살펴볼 거리가 있나요? 또는 손실 함수(loss function)을 조금씩 변형해가며 실험해볼 수도 있을 것입니다. 만약 다른 멋있는 것을 시도해본다면 추가로 10 points를 얻을 수 있고 강의에 수행한 결과가 실릴 수 있습니다. diff --git a/glossary.md b/glossary.md index 3c82a1c4..068e590d 100644 --- a/glossary.md +++ b/glossary.md @@ -6,8 +6,6 @@ permalink: /glossary/ 영어 --> 한글 번역시 용어의 통일성을 위한 단어장입니다. 새로운 용어에 대한 추가는 GitHub에 이슈를 파서 서로 논의해 보고 정하도록 하면 좋을 것 같습니다. -Markdown 형식의 table이 제대로 렌더링이 안되네요.. 그래서 우선 그냥 html로 표를 그려놓았습니다. 더 깔끔한 방안이 떠오르시는 분들께서는 역시 이슈/PR 부탁드립니다. - @@ -22,8 +20,8 @@ Markdown 형식의 table이 제대로 렌더링이 안되네요.. 그래서 우 - - + + @@ -32,7 +30,7 @@ Markdown 형식의 table이 제대로 렌더링이 안되네요.. 그래서 우 - + @@ -41,13 +39,14 @@ Markdown 형식의 table이 제대로 렌더링이 안되네요.. 그래서 우 - - + + - + + @@ -68,24 +67,25 @@ Markdown 형식의 table이 제대로 렌더링이 안되네요.. 그래서 우 + - + - - + + - + - - - + + + - +
Backpropagation(영어 그대로)
Batch배치
Batch normalization배치 정규화
Bias
Binary이진
Bias(영어 그대로)
Binary(?)
Chain rule연쇄 법칙
Class클래스
Classification분류
Convolution컨볼루션
Convolutional neural network컨볼루션 신경망
Covariance공분산
Cross entropy
Cross entropy(영어 그대로)
Cross validation교차 검증
Depth깊이
Derivative미분값, 도함수
Evaluate평가하다
Feature특징, 표현(?)
Filter필터
Forward propagation
Fully-connected
Forward propagation(영어 그대로)
Fully-connected(영어 그대로)
Gate게이트
Gradient그라디언트
GRU(영어 그대로)
Hyperparameter
Hyperparameter(영어 그대로)
Image이미지
Implement구현하다
Initialization초기화
Iteration반복
Label라벨
Padding패딩
Parameter파라미터
Performance성능
Pixel픽셀, 화소
Pooling풀링
Preprocessing전처리
Receptive Field
Receptive Field(영어 그대로)
Regression회귀
Regularization?
ReLU
Regularization(영어 그대로)
ReLU(영어 그대로)
Representation표현
Recurrent neural network (RNN)회귀신경망(?)
Recurrent neural network (RNN)회귀신경망, RNN
Row vector행 벡터
Score스코어, 점수
Sigmoid
Softmax
Spatial
Sigmoid(영어 그대로)
Softmax(영어 그대로)
Training학습, 트레이닝
Tuning튜닝
Validation검증
Variable변수
Visualization시각화
Weights파라미터 값
Weights파라미터 값, 가중치 (문맥상 사용되는 의미에 따라)
From be3972801f5a2b115ba1919a59c18faab19ef944 Mon Sep 17 00:00:00 2001 From: myungsub Date: Sun, 10 Apr 2016 23:55:50 +0900 Subject: [PATCH 045/199] Add assignment files & assignment3 translation --- assignments2016/assignment1.md | 26 +- assignments2016/assignment1/.gitignore | 3 + assignments2016/assignment1/README.md | 1 + .../assignment1/collectSubmission.sh | 2 + .../assignment1/cs231n/__init__.py | 0 .../cs231n/classifiers/__init__.py | 2 + .../cs231n/classifiers/k_nearest_neighbor.py | 170 ++++ .../cs231n/classifiers/linear_classifier.py | 130 +++ .../cs231n/classifiers/linear_svm.py | 92 ++ .../cs231n/classifiers/neural_net.py | 218 ++++ .../assignment1/cs231n/classifiers/softmax.py | 62 ++ .../assignment1/cs231n/data_utils.py | 158 +++ .../assignment1/cs231n/datasets/.gitignore | 4 + .../cs231n/datasets/get_datasets.sh | 4 + .../assignment1/cs231n/features.py | 148 +++ .../assignment1/cs231n/gradient_check.py | 124 +++ .../assignment1/cs231n/vis_utils.py | 73 ++ assignments2016/assignment1/features.ipynb | 340 +++++++ assignments2016/assignment1/frameworkpython | 13 + assignments2016/assignment1/knn.ipynb | 459 +++++++++ assignments2016/assignment1/requirements.txt | 46 + assignments2016/assignment1/softmax.ipynb | 308 ++++++ .../assignment1/start_ipython_osx.sh | 4 + assignments2016/assignment1/svm.ipynb | 568 +++++++++++ .../assignment1/two_layer_net.ipynb | 456 +++++++++ assignments2016/assignment2/.gitignore | 3 + .../assignment2/BatchNormalization.ipynb | 516 ++++++++++ .../assignment2/ConvolutionalNetworks.ipynb | 869 ++++++++++++++++ assignments2016/assignment2/Dropout.ipynb | 275 +++++ .../assignment2/FullyConnectedNets.ipynb | 941 ++++++++++++++++++ assignments2016/assignment2/README.md | 128 +++ .../assignment2/collectSubmission.sh | 2 + assignments2016/assignment2/cs231n/.gitignore | 3 + .../assignment2/cs231n/__init__.py | 0 .../cs231n/classifiers/__init__.py | 0 .../assignment2/cs231n/classifiers/cnn.py | 105 ++ .../assignment2/cs231n/classifiers/fc_net.py | 250 +++++ .../assignment2/cs231n/data_utils.py | 199 ++++ .../assignment2/cs231n/datasets/.gitignore | 4 + .../cs231n/datasets/get_datasets.sh | 4 + .../assignment2/cs231n/fast_layers.py | 270 +++++ .../assignment2/cs231n/gradient_check.py | 124 +++ assignments2016/assignment2/cs231n/im2col.py | 55 + .../assignment2/cs231n/im2col_cython.pyx | 121 +++ .../assignment2/cs231n/layer_utils.py | 93 ++ assignments2016/assignment2/cs231n/layers.py | 554 +++++++++++ assignments2016/assignment2/cs231n/optim.py | 149 +++ assignments2016/assignment2/cs231n/setup.py | 14 + assignments2016/assignment2/cs231n/solver.py | 266 +++++ .../assignment2/cs231n/vis_utils.py | 73 ++ assignments2016/assignment2/frameworkpython | 13 + assignments2016/assignment2/kitten.jpg | Bin 0 -> 21355 bytes assignments2016/assignment2/puppy.jpg | Bin 0 -> 38392 bytes assignments2016/assignment2/requirements.txt | 46 + .../assignment2/start_ipython_osx.sh | 4 + assignments2016/assignment3.md | 119 +-- assignments2016/assignment3/.gitignore | 3 + .../assignment3/ImageGeneration.ipynb | 511 ++++++++++ .../assignment3/ImageGradients.ipynb | 383 +++++++ .../assignment3/LSTM_Captioning.ipynb | 483 +++++++++ .../assignment3/RNN_Captioning.ipynb | 659 ++++++++++++ .../assignment3/collectSubmission.sh | 2 + assignments2016/assignment3/cs231n/.gitignore | 3 + .../assignment3/cs231n/__init__.py | 0 .../assignment3/cs231n/captioning_solver.py | 233 +++++ .../cs231n/classifiers/__init__.py | 0 .../cs231n/classifiers/pretrained_cnn.py | 252 +++++ .../assignment3/cs231n/classifiers/rnn.py | 204 ++++ .../assignment3/cs231n/coco_utils.py | 84 ++ .../assignment3/cs231n/data_utils.py | 219 ++++ .../cs231n/datasets/get_coco_captioning.sh | 3 + .../cs231n/datasets/get_pretrained_model.sh | 1 + .../cs231n/datasets/get_tiny_imagenet_a.sh | 3 + .../assignment3/cs231n/fast_layers.py | 270 +++++ .../assignment3/cs231n/gradient_check.py | 124 +++ assignments2016/assignment3/cs231n/im2col.py | 55 + .../assignment3/cs231n/im2col_cython.pyx | 121 +++ .../assignment3/cs231n/image_utils.py | 98 ++ .../assignment3/cs231n/layer_utils.py | 141 +++ assignments2016/assignment3/cs231n/layers.py | 302 ++++++ assignments2016/assignment3/cs231n/optim.py | 85 ++ .../assignment3/cs231n/rnn_layers.py | 420 ++++++++ assignments2016/assignment3/cs231n/setup.py | 14 + assignments2016/assignment3/frameworkpython | 13 + assignments2016/assignment3/kitten.jpg | Bin 0 -> 21355 bytes assignments2016/assignment3/requirements.txt | 46 + assignments2016/assignment3/sky.jpg | Bin 0 -> 148465 bytes .../assignment3/start_ipython_osx.sh | 4 + 88 files changed, 13253 insertions(+), 94 deletions(-) create mode 100644 assignments2016/assignment1/.gitignore create mode 100644 assignments2016/assignment1/README.md create mode 100644 assignments2016/assignment1/collectSubmission.sh create mode 100644 assignments2016/assignment1/cs231n/__init__.py create mode 100644 assignments2016/assignment1/cs231n/classifiers/__init__.py create mode 100644 assignments2016/assignment1/cs231n/classifiers/k_nearest_neighbor.py create mode 100644 assignments2016/assignment1/cs231n/classifiers/linear_classifier.py create mode 100644 assignments2016/assignment1/cs231n/classifiers/linear_svm.py create mode 100644 assignments2016/assignment1/cs231n/classifiers/neural_net.py create mode 100644 assignments2016/assignment1/cs231n/classifiers/softmax.py create mode 100644 assignments2016/assignment1/cs231n/data_utils.py create mode 100644 assignments2016/assignment1/cs231n/datasets/.gitignore create mode 100755 assignments2016/assignment1/cs231n/datasets/get_datasets.sh create mode 100644 assignments2016/assignment1/cs231n/features.py create mode 100644 assignments2016/assignment1/cs231n/gradient_check.py create mode 100644 assignments2016/assignment1/cs231n/vis_utils.py create mode 100644 assignments2016/assignment1/features.ipynb create mode 100755 assignments2016/assignment1/frameworkpython create mode 100644 assignments2016/assignment1/knn.ipynb create mode 100644 assignments2016/assignment1/requirements.txt create mode 100644 assignments2016/assignment1/softmax.ipynb create mode 100755 assignments2016/assignment1/start_ipython_osx.sh create mode 100644 assignments2016/assignment1/svm.ipynb create mode 100644 assignments2016/assignment1/two_layer_net.ipynb create mode 100644 assignments2016/assignment2/.gitignore create mode 100644 assignments2016/assignment2/BatchNormalization.ipynb create mode 100644 assignments2016/assignment2/ConvolutionalNetworks.ipynb create mode 100644 assignments2016/assignment2/Dropout.ipynb create mode 100644 assignments2016/assignment2/FullyConnectedNets.ipynb create mode 100644 assignments2016/assignment2/README.md create mode 100755 assignments2016/assignment2/collectSubmission.sh create mode 100644 assignments2016/assignment2/cs231n/.gitignore create mode 100644 assignments2016/assignment2/cs231n/__init__.py create mode 100644 assignments2016/assignment2/cs231n/classifiers/__init__.py create mode 100644 assignments2016/assignment2/cs231n/classifiers/cnn.py create mode 100644 assignments2016/assignment2/cs231n/classifiers/fc_net.py create mode 100644 assignments2016/assignment2/cs231n/data_utils.py create mode 100644 assignments2016/assignment2/cs231n/datasets/.gitignore create mode 100755 assignments2016/assignment2/cs231n/datasets/get_datasets.sh create mode 100644 assignments2016/assignment2/cs231n/fast_layers.py create mode 100644 assignments2016/assignment2/cs231n/gradient_check.py create mode 100644 assignments2016/assignment2/cs231n/im2col.py create mode 100644 assignments2016/assignment2/cs231n/im2col_cython.pyx create mode 100644 assignments2016/assignment2/cs231n/layer_utils.py create mode 100644 assignments2016/assignment2/cs231n/layers.py create mode 100644 assignments2016/assignment2/cs231n/optim.py create mode 100644 assignments2016/assignment2/cs231n/setup.py create mode 100644 assignments2016/assignment2/cs231n/solver.py create mode 100644 assignments2016/assignment2/cs231n/vis_utils.py create mode 100755 assignments2016/assignment2/frameworkpython create mode 100644 assignments2016/assignment2/kitten.jpg create mode 100644 assignments2016/assignment2/puppy.jpg create mode 100644 assignments2016/assignment2/requirements.txt create mode 100755 assignments2016/assignment2/start_ipython_osx.sh create mode 100644 assignments2016/assignment3/.gitignore create mode 100644 assignments2016/assignment3/ImageGeneration.ipynb create mode 100644 assignments2016/assignment3/ImageGradients.ipynb create mode 100644 assignments2016/assignment3/LSTM_Captioning.ipynb create mode 100644 assignments2016/assignment3/RNN_Captioning.ipynb create mode 100755 assignments2016/assignment3/collectSubmission.sh create mode 100644 assignments2016/assignment3/cs231n/.gitignore create mode 100644 assignments2016/assignment3/cs231n/__init__.py create mode 100644 assignments2016/assignment3/cs231n/captioning_solver.py create mode 100644 assignments2016/assignment3/cs231n/classifiers/__init__.py create mode 100644 assignments2016/assignment3/cs231n/classifiers/pretrained_cnn.py create mode 100644 assignments2016/assignment3/cs231n/classifiers/rnn.py create mode 100644 assignments2016/assignment3/cs231n/coco_utils.py create mode 100644 assignments2016/assignment3/cs231n/data_utils.py create mode 100755 assignments2016/assignment3/cs231n/datasets/get_coco_captioning.sh create mode 100755 assignments2016/assignment3/cs231n/datasets/get_pretrained_model.sh create mode 100755 assignments2016/assignment3/cs231n/datasets/get_tiny_imagenet_a.sh create mode 100644 assignments2016/assignment3/cs231n/fast_layers.py create mode 100644 assignments2016/assignment3/cs231n/gradient_check.py create mode 100644 assignments2016/assignment3/cs231n/im2col.py create mode 100644 assignments2016/assignment3/cs231n/im2col_cython.pyx create mode 100644 assignments2016/assignment3/cs231n/image_utils.py create mode 100644 assignments2016/assignment3/cs231n/layer_utils.py create mode 100644 assignments2016/assignment3/cs231n/layers.py create mode 100644 assignments2016/assignment3/cs231n/optim.py create mode 100644 assignments2016/assignment3/cs231n/rnn_layers.py create mode 100644 assignments2016/assignment3/cs231n/setup.py create mode 100755 assignments2016/assignment3/frameworkpython create mode 100644 assignments2016/assignment3/kitten.jpg create mode 100644 assignments2016/assignment3/requirements.txt create mode 100644 assignments2016/assignment3/sky.jpg create mode 100755 assignments2016/assignment3/start_ipython_osx.sh diff --git a/assignments2016/assignment1.md b/assignments2016/assignment1.md index be44ec01..10a22c7d 100644 --- a/assignments2016/assignment1.md +++ b/assignments2016/assignment1.md @@ -25,10 +25,10 @@ Terminal에는 우리의 수업을 위한 서브도메인이 만들어져 있습 [여기](http://vision.stanford.edu/teaching/cs231n/winter1516_assignment1.zip)에서 압축파일을 다운받고 다음을 따르세요. **[선택 1] Use Anaconda:** -과학, 수학, 공학, 데이터 분석을 위한 다양하고 유명한 패키지들을 담고있는 [Anaconda](https://www.continuum.io/downloads)를 사용하여 설치하는 것이 즐겨 쓰이는 방법입니다. 설치가 다 되면 모든 요구사항을 넘기고 바로 숙제를 시작해도 좋습니다. +과학, 수학, 공학, 데이터 분석을 위한 대부분의 주요 패키지들을 담고있는 [Anaconda](https://www.continuum.io/downloads)를 사용하여 설치하는 것이 흔히 사용하는 방법입니다. 설치가 다 되면 모든 요구사항(dependency)을 넘기고 바로 숙제를 시작해도 좋습니다. -**[선택 2] Manual install, virtual environment:** -만약 Anaconda 대신 좀 더 일반적이고 위험한 방법을 택하고 싶다면 프로젝트를 위한 [virtual environment](http://docs.python-guide.org/en/latest/dev/virtualenvs/)를 만들 수 있습니다. 만약 virtual environment를 사용하지 않는다면 모든 코드가 컴퓨터에 전역적으로 종속되게 설치됩니다. virtual environment의 설정은 아래를 참조하세요. +**[선택 2] 수동 설치, virtual environment:** +만약 Anaconda 대신 좀 더 일반적이면서 까다로운 방법을 택하고 싶다면 이번 과제를 위한 [virtual environment](http://docs.python-guide.org/en/latest/dev/virtualenvs/)를 만들 수 있습니다. 만약 virtual environment를 사용하지 않는다면 모든 코드가 컴퓨터에 전역적으로 종속되게 설치됩니다. Virtual environment의 설정은 아래를 참조하세요. ~~~bash cd assignment1 @@ -40,42 +40,42 @@ pip install -r requirements.txt # dependencies 설치합니다. deactivate # virtual environment를 종료합니다. ~~~ -**Download data:** -먼저 숙제를 시작하기전에 CIFAR-10 dataset를 다운로드해야 합니다. 아래를 `assignment1` 폴더에서 실행하세요: +**데이터셋 다운로드:** +먼저 숙제를 시작하기전에 CIFAR-10 dataset를 다운로드해야 합니다. 아래 코드를 `assignment1` 폴더에서 실행하세요: ~~~bash cd cs231n/datasets ./get_datasets.sh ~~~ -**Start IPython:** +**IPython 시작:** CIFAR-10 data를 받았다면, `assignment1` 폴더의 IPython notebook server를 시작할 수 있습니다. IPython에 친숙하지 않다면 작성해둔 [IPython tutorial](/ipython-tutorial)를 읽어보는 것을 권장합니다. -**NOTE:** OSX에서 virtual environment를 실행하면, matplotlib 에러가 날 수 있습니다([이 문제에 관한 이슈](http://matplotlib.org/faq/virtualenv_faq.html)). IPython server를 `assignment1`폴더의 `start_ipython_osx.sh`라고 실행하면 이 문제를 피해갈 수 있습니다; 이 스크립트는 virtual environment가 `.env`라고 되어있다고 가정하고 작성되었습니다. +**NOTE:** OSX에서 virtual environment를 실행하면, matplotlib 에러가 날 수 있습니다([이 문제에 관한 이슈](http://matplotlib.org/faq/virtualenv_faq.html)). IPython 서버를 `assignment1`폴더의 `start_ipython_osx.sh`로 실행하면 이 문제를 피해갈 수 있습니다; 이 스크립트는 virtual environment가 `.env`라고 되어있다고 가정하고 작성되었습니다.로 ### 과제 제출: -로컬 환경이나 Terminal에서 숙제를 마쳤다면 `collectSubmission.sh`스크립트를 실행합니다; 이 스크립트는 `assignment1.zip`파일을 만듭니다. 이 파일을 [the coursework](https://coursework.stanford.edu/portal/site/W16-CS-231N-01/)에 업로드하세요. +로컬 환경이나 Terminal에 상관없이, 이번 숙제를 마쳤다면 `collectSubmission.sh`스크립트를 실행하세요. 이 스크립트는 `assignment1.zip`파일을 만듭니다. 이 파일을 [the coursework](https://coursework.stanford.edu/portal/site/W16-CS-231N-01/)에 업로드하세요. ### Q1: k-Nearest Neighbor 분류기 (20 points) -IPython Notebook **knn.ipynb**이 kNN 분류기를 구현하는 것을 안내합니다. +IPython Notebook **knn.ipynb**이 kNN 분류기를 구현하는 방법을 안내합니다. ### Q2: Support Vector Machine 훈련 (25 points) -IPython Notebook **svm.ipynb**이 SVM 분류기를 구현하는 것을 안내합니다. +IPython Notebook **svm.ipynb**이 SVM 분류기를 구현하는 방법을 안내합니다. ### Q3: Softmax 분류기 실행하기 (20 points) -IPython Notebook **softmax.ipynb**이 Softmax 분류기를 구현하는 것을 안내합니다. +IPython Notebook **softmax.ipynb**이 Softmax 분류기를 구현하는 방법을 안내합니다. ### Q4: Two-Layer Neural Network (25 points) -IPython Notebook **two_layer_net.ipynb**이 two-layer neural network 분류기를 구현하는 것을 안내합니다. +IPython Notebook **two_layer_net.ipynb**이 two-layer neural network 분류기를 구현하는 방법을 안내합니다 ### Q5: 이미지 특징을 고차원으로 표현하기 (10 points) -IPython Notebook **features.ipynb**을 사용하여 단순한 이미지 픽셀(화소)보다 고차원의 표현이 효과적인지 검사합니다. +IPython Notebook **features.ipynb**을 사용하여 단순한 이미지 픽셀(화소)보다 고차원의 표현이 효과적인지 검사해 볼 것입니다. ### Q6: 추가 과제: 뭔가 더 해보세요! (+10 points) 이번 과제와 관련된 다른 것들을 작성한 코드로 분석하고 연구해보세요. 예를 들어, 질문하고 싶은 흥미로운 질문이 있나요? 통찰력 있는 시각화를 작성할 수 있나요? 아니면 다른 재미있는 살펴볼 거리가 있나요? 또는 손실 함수(loss function)을 조금씩 변형해가며 실험해볼 수도 있을 것입니다. 만약 다른 멋있는 것을 시도해본다면 추가로 10 points를 얻을 수 있고 강의에 수행한 결과가 실릴 수 있습니다. diff --git a/assignments2016/assignment1/.gitignore b/assignments2016/assignment1/.gitignore new file mode 100644 index 00000000..b0611d38 --- /dev/null +++ b/assignments2016/assignment1/.gitignore @@ -0,0 +1,3 @@ +*.swp +*.pyc +.env/* diff --git a/assignments2016/assignment1/README.md b/assignments2016/assignment1/README.md new file mode 100644 index 00000000..6aaea415 --- /dev/null +++ b/assignments2016/assignment1/README.md @@ -0,0 +1 @@ +Details about this assignment can be found [on the course webpage](http://cs231n.github.io/), under Assignment #1 of Winter 2016. diff --git a/assignments2016/assignment1/collectSubmission.sh b/assignments2016/assignment1/collectSubmission.sh new file mode 100644 index 00000000..13219057 --- /dev/null +++ b/assignments2016/assignment1/collectSubmission.sh @@ -0,0 +1,2 @@ +rm -f assignment1.zip +zip -r assignment1.zip . -x "*.git*" "*cs231n/datasets*" "*.ipynb_checkpoints*" "*README.md" "*collectSubmission.sh" "*requirements.txt" diff --git a/assignments2016/assignment1/cs231n/__init__.py b/assignments2016/assignment1/cs231n/__init__.py new file mode 100644 index 00000000..e69de29b diff --git a/assignments2016/assignment1/cs231n/classifiers/__init__.py b/assignments2016/assignment1/cs231n/classifiers/__init__.py new file mode 100644 index 00000000..cef2b580 --- /dev/null +++ b/assignments2016/assignment1/cs231n/classifiers/__init__.py @@ -0,0 +1,2 @@ +from cs231n.classifiers.k_nearest_neighbor import * +from cs231n.classifiers.linear_classifier import * diff --git a/assignments2016/assignment1/cs231n/classifiers/k_nearest_neighbor.py b/assignments2016/assignment1/cs231n/classifiers/k_nearest_neighbor.py new file mode 100644 index 00000000..7b592485 --- /dev/null +++ b/assignments2016/assignment1/cs231n/classifiers/k_nearest_neighbor.py @@ -0,0 +1,170 @@ +import numpy as np + +class KNearestNeighbor(object): + """ a kNN classifier with L2 distance """ + + def __init__(self): + pass + + def train(self, X, y): + """ + Train the classifier. For k-nearest neighbors this is just + memorizing the training data. + + Inputs: + - X: A numpy array of shape (num_train, D) containing the training data + consisting of num_train samples each of dimension D. + - y: A numpy array of shape (N,) containing the training labels, where + y[i] is the label for X[i]. + """ + self.X_train = X + self.y_train = y + + def predict(self, X, k=1, num_loops=0): + """ + Predict labels for test data using this classifier. + + Inputs: + - X: A numpy array of shape (num_test, D) containing test data consisting + of num_test samples each of dimension D. + - k: The number of nearest neighbors that vote for the predicted labels. + - num_loops: Determines which implementation to use to compute distances + between training points and testing points. + + Returns: + - y: A numpy array of shape (num_test,) containing predicted labels for the + test data, where y[i] is the predicted label for the test point X[i]. + """ + if num_loops == 0: + dists = self.compute_distances_no_loops(X) + elif num_loops == 1: + dists = self.compute_distances_one_loop(X) + elif num_loops == 2: + dists = self.compute_distances_two_loops(X) + else: + raise ValueError('Invalid value %d for num_loops' % num_loops) + + return self.predict_labels(dists, k=k) + + def compute_distances_two_loops(self, X): + """ + Compute the distance between each test point in X and each training point + in self.X_train using a nested loop over both the training data and the + test data. + + Inputs: + - X: A numpy array of shape (num_test, D) containing test data. + + Returns: + - dists: A numpy array of shape (num_test, num_train) where dists[i, j] + is the Euclidean distance between the ith test point and the jth training + point. + """ + num_test = X.shape[0] + num_train = self.X_train.shape[0] + dists = np.zeros((num_test, num_train)) + for i in xrange(num_test): + for j in xrange(num_train): + ##################################################################### + # TODO: # + # Compute the l2 distance between the ith test point and the jth # + # training point, and store the result in dists[i, j]. You should # + # not use a loop over dimension. # + ##################################################################### + pass + ##################################################################### + # END OF YOUR CODE # + ##################################################################### + return dists + + def compute_distances_one_loop(self, X): + """ + Compute the distance between each test point in X and each training point + in self.X_train using a single loop over the test data. + + Input / Output: Same as compute_distances_two_loops + """ + num_test = X.shape[0] + num_train = self.X_train.shape[0] + dists = np.zeros((num_test, num_train)) + for i in xrange(num_test): + ####################################################################### + # TODO: # + # Compute the l2 distance between the ith test point and all training # + # points, and store the result in dists[i, :]. # + ####################################################################### + pass + ####################################################################### + # END OF YOUR CODE # + ####################################################################### + return dists + + def compute_distances_no_loops(self, X): + """ + Compute the distance between each test point in X and each training point + in self.X_train using no explicit loops. + + Input / Output: Same as compute_distances_two_loops + """ + num_test = X.shape[0] + num_train = self.X_train.shape[0] + dists = np.zeros((num_test, num_train)) + ######################################################################### + # TODO: # + # Compute the l2 distance between all test points and all training # + # points without using any explicit loops, and store the result in # + # dists. # + # # + # You should implement this function using only basic array operations; # + # in particular you should not use functions from scipy. # + # # + # HINT: Try to formulate the l2 distance using matrix multiplication # + # and two broadcast sums. # + ######################################################################### + pass + ######################################################################### + # END OF YOUR CODE # + ######################################################################### + return dists + + def predict_labels(self, dists, k=1): + """ + Given a matrix of distances between test points and training points, + predict a label for each test point. + + Inputs: + - dists: A numpy array of shape (num_test, num_train) where dists[i, j] + gives the distance betwen the ith test point and the jth training point. + + Returns: + - y: A numpy array of shape (num_test,) containing predicted labels for the + test data, where y[i] is the predicted label for the test point X[i]. + """ + num_test = dists.shape[0] + y_pred = np.zeros(num_test) + for i in xrange(num_test): + # A list of length k storing the labels of the k nearest neighbors to + # the ith test point. + closest_y = [] + ######################################################################### + # TODO: # + # Use the distance matrix to find the k nearest neighbors of the ith # + # testing point, and use self.y_train to find the labels of these # + # neighbors. Store these labels in closest_y. # + # Hint: Look up the function numpy.argsort. # + ######################################################################### + pass + ######################################################################### + # TODO: # + # Now that you have found the labels of the k nearest neighbors, you # + # need to find the most common label in the list closest_y of labels. # + # Store this label in y_pred[i]. Break ties by choosing the smaller # + # label. # + ######################################################################### + pass + ######################################################################### + # END OF YOUR CODE # + ######################################################################### + + return y_pred + diff --git a/assignments2016/assignment1/cs231n/classifiers/linear_classifier.py b/assignments2016/assignment1/cs231n/classifiers/linear_classifier.py new file mode 100644 index 00000000..8e820903 --- /dev/null +++ b/assignments2016/assignment1/cs231n/classifiers/linear_classifier.py @@ -0,0 +1,130 @@ +import numpy as np +from cs231n.classifiers.linear_svm import * +from cs231n.classifiers.softmax import * + +class LinearClassifier(object): + + def __init__(self): + self.W = None + + def train(self, X, y, learning_rate=1e-3, reg=1e-5, num_iters=100, + batch_size=200, verbose=False): + """ + Train this linear classifier using stochastic gradient descent. + + Inputs: + - X: A numpy array of shape (N, D) containing training data; there are N + training samples each of dimension D. + - y: A numpy array of shape (N,) containing training labels; y[i] = c + means that X[i] has label 0 <= c < C for C classes. + - learning_rate: (float) learning rate for optimization. + - reg: (float) regularization strength. + - num_iters: (integer) number of steps to take when optimizing + - batch_size: (integer) number of training examples to use at each step. + - verbose: (boolean) If true, print progress during optimization. + + Outputs: + A list containing the value of the loss function at each training iteration. + """ + num_train, dim = X.shape + num_classes = np.max(y) + 1 # assume y takes values 0...K-1 where K is number of classes + if self.W is None: + # lazily initialize W + self.W = 0.001 * np.random.randn(dim, num_classes) + + # Run stochastic gradient descent to optimize W + loss_history = [] + for it in xrange(num_iters): + X_batch = None + y_batch = None + + ######################################################################### + # TODO: # + # Sample batch_size elements from the training data and their # + # corresponding labels to use in this round of gradient descent. # + # Store the data in X_batch and their corresponding labels in # + # y_batch; after sampling X_batch should have shape (dim, batch_size) # + # and y_batch should have shape (batch_size,) # + # # + # Hint: Use np.random.choice to generate indices. Sampling with # + # replacement is faster than sampling without replacement. # + ######################################################################### + pass + ######################################################################### + # END OF YOUR CODE # + ######################################################################### + + # evaluate loss and gradient + loss, grad = self.loss(X_batch, y_batch, reg) + loss_history.append(loss) + + # perform parameter update + ######################################################################### + # TODO: # + # Update the weights using the gradient and the learning rate. # + ######################################################################### + pass + ######################################################################### + # END OF YOUR CODE # + ######################################################################### + + if verbose and it % 100 == 0: + print 'iteration %d / %d: loss %f' % (it, num_iters, loss) + + return loss_history + + def predict(self, X): + """ + Use the trained weights of this linear classifier to predict labels for + data points. + + Inputs: + - X: D x N array of training data. Each column is a D-dimensional point. + + Returns: + - y_pred: Predicted labels for the data in X. y_pred is a 1-dimensional + array of length N, and each element is an integer giving the predicted + class. + """ + y_pred = np.zeros(X.shape[1]) + ########################################################################### + # TODO: # + # Implement this method. Store the predicted labels in y_pred. # + ########################################################################### + pass + ########################################################################### + # END OF YOUR CODE # + ########################################################################### + return y_pred + + def loss(self, X_batch, y_batch, reg): + """ + Compute the loss function and its derivative. + Subclasses will override this. + + Inputs: + - X_batch: A numpy array of shape (N, D) containing a minibatch of N + data points; each point has dimension D. + - y_batch: A numpy array of shape (N,) containing labels for the minibatch. + - reg: (float) regularization strength. + + Returns: A tuple containing: + - loss as a single float + - gradient with respect to self.W; an array of the same shape as W + """ + pass + + +class LinearSVM(LinearClassifier): + """ A subclass that uses the Multiclass SVM loss function """ + + def loss(self, X_batch, y_batch, reg): + return svm_loss_vectorized(self.W, X_batch, y_batch, reg) + + +class Softmax(LinearClassifier): + """ A subclass that uses the Softmax + Cross-entropy loss function """ + + def loss(self, X_batch, y_batch, reg): + return softmax_loss_vectorized(self.W, X_batch, y_batch, reg) + diff --git a/assignments2016/assignment1/cs231n/classifiers/linear_svm.py b/assignments2016/assignment1/cs231n/classifiers/linear_svm.py new file mode 100644 index 00000000..19ab753f --- /dev/null +++ b/assignments2016/assignment1/cs231n/classifiers/linear_svm.py @@ -0,0 +1,92 @@ +import numpy as np +from random import shuffle + +def svm_loss_naive(W, X, y, reg): + """ + Structured SVM loss function, naive implementation (with loops). + + Inputs have dimension D, there are C classes, and we operate on minibatches + of N examples. + + Inputs: + - W: A numpy array of shape (D, C) containing weights. + - X: A numpy array of shape (N, D) containing a minibatch of data. + - y: A numpy array of shape (N,) containing training labels; y[i] = c means + that X[i] has label c, where 0 <= c < C. + - reg: (float) regularization strength + + Returns a tuple of: + - loss as single float + - gradient with respect to weights W; an array of same shape as W + """ + dW = np.zeros(W.shape) # initialize the gradient as zero + + # compute the loss and the gradient + num_classes = W.shape[1] + num_train = X.shape[0] + loss = 0.0 + for i in xrange(num_train): + scores = X[i].dot(W) + correct_class_score = scores[y[i]] + for j in xrange(num_classes): + if j == y[i]: + continue + margin = scores[j] - correct_class_score + 1 # note delta = 1 + if margin > 0: + loss += margin + + # Right now the loss is a sum over all training examples, but we want it + # to be an average instead so we divide by num_train. + loss /= num_train + + # Add regularization to the loss. + loss += 0.5 * reg * np.sum(W * W) + + ############################################################################# + # TODO: # + # Compute the gradient of the loss function and store it dW. # + # Rather that first computing the loss and then computing the derivative, # + # it may be simpler to compute the derivative at the same time that the # + # loss is being computed. As a result you may need to modify some of the # + # code above to compute the gradient. # + ############################################################################# + + + return loss, dW + + +def svm_loss_vectorized(W, X, y, reg): + """ + Structured SVM loss function, vectorized implementation. + + Inputs and outputs are the same as svm_loss_naive. + """ + loss = 0.0 + dW = np.zeros(W.shape) # initialize the gradient as zero + + ############################################################################# + # TODO: # + # Implement a vectorized version of the structured SVM loss, storing the # + # result in loss. # + ############################################################################# + pass + ############################################################################# + # END OF YOUR CODE # + ############################################################################# + + + ############################################################################# + # TODO: # + # Implement a vectorized version of the gradient for the structured SVM # + # loss, storing the result in dW. # + # # + # Hint: Instead of computing the gradient from scratch, it may be easier # + # to reuse some of the intermediate values that you used to compute the # + # loss. # + ############################################################################# + pass + ############################################################################# + # END OF YOUR CODE # + ############################################################################# + + return loss, dW diff --git a/assignments2016/assignment1/cs231n/classifiers/neural_net.py b/assignments2016/assignment1/cs231n/classifiers/neural_net.py new file mode 100644 index 00000000..94bbcd05 --- /dev/null +++ b/assignments2016/assignment1/cs231n/classifiers/neural_net.py @@ -0,0 +1,218 @@ +import numpy as np +import matplotlib.pyplot as plt + + +class TwoLayerNet(object): + """ + A two-layer fully-connected neural network. The net has an input dimension of + N, a hidden layer dimension of H, and performs classification over C classes. + We train the network with a softmax loss function and L2 regularization on the + weight matrices. The network uses a ReLU nonlinearity after the first fully + connected layer. + + In other words, the network has the following architecture: + + input - fully connected layer - ReLU - fully connected layer - softmax + + The outputs of the second fully-connected layer are the scores for each class. + """ + + def __init__(self, input_size, hidden_size, output_size, std=1e-4): + """ + Initialize the model. Weights are initialized to small random values and + biases are initialized to zero. Weights and biases are stored in the + variable self.params, which is a dictionary with the following keys: + + W1: First layer weights; has shape (D, H) + b1: First layer biases; has shape (H,) + W2: Second layer weights; has shape (H, C) + b2: Second layer biases; has shape (C,) + + Inputs: + - input_size: The dimension D of the input data. + - hidden_size: The number of neurons H in the hidden layer. + - output_size: The number of classes C. + """ + self.params = {} + self.params['W1'] = std * np.random.randn(input_size, hidden_size) + self.params['b1'] = np.zeros(hidden_size) + self.params['W2'] = std * np.random.randn(hidden_size, output_size) + self.params['b2'] = np.zeros(output_size) + + def loss(self, X, y=None, reg=0.0): + """ + Compute the loss and gradients for a two layer fully connected neural + network. + + Inputs: + - X: Input data of shape (N, D). Each X[i] is a training sample. + - y: Vector of training labels. y[i] is the label for X[i], and each y[i] is + an integer in the range 0 <= y[i] < C. This parameter is optional; if it + is not passed then we only return scores, and if it is passed then we + instead return the loss and gradients. + - reg: Regularization strength. + + Returns: + If y is None, return a matrix scores of shape (N, C) where scores[i, c] is + the score for class c on input X[i]. + + If y is not None, instead return a tuple of: + - loss: Loss (data loss and regularization loss) for this batch of training + samples. + - grads: Dictionary mapping parameter names to gradients of those parameters + with respect to the loss function; has the same keys as self.params. + """ + # Unpack variables from the params dictionary + W1, b1 = self.params['W1'], self.params['b1'] + W2, b2 = self.params['W2'], self.params['b2'] + N, D = X.shape + + # Compute the forward pass + scores = None + ############################################################################# + # TODO: Perform the forward pass, computing the class scores for the input. # + # Store the result in the scores variable, which should be an array of # + # shape (N, C). # + ############################################################################# + pass + ############################################################################# + # END OF YOUR CODE # + ############################################################################# + + # If the targets are not given then jump out, we're done + if y is None: + return scores + + # Compute the loss + loss = None + ############################################################################# + # TODO: Finish the forward pass, and compute the loss. This should include # + # both the data loss and L2 regularization for W1 and W2. Store the result # + # in the variable loss, which should be a scalar. Use the Softmax # + # classifier loss. So that your results match ours, multiply the # + # regularization loss by 0.5 # + ############################################################################# + pass + ############################################################################# + # END OF YOUR CODE # + ############################################################################# + + # Backward pass: compute gradients + grads = {} + ############################################################################# + # TODO: Compute the backward pass, computing the derivatives of the weights # + # and biases. Store the results in the grads dictionary. For example, # + # grads['W1'] should store the gradient on W1, and be a matrix of same size # + ############################################################################# + pass + ############################################################################# + # END OF YOUR CODE # + ############################################################################# + + return loss, grads + + def train(self, X, y, X_val, y_val, + learning_rate=1e-3, learning_rate_decay=0.95, + reg=1e-5, num_iters=100, + batch_size=200, verbose=False): + """ + Train this neural network using stochastic gradient descent. + + Inputs: + - X: A numpy array of shape (N, D) giving training data. + - y: A numpy array f shape (N,) giving training labels; y[i] = c means that + X[i] has label c, where 0 <= c < C. + - X_val: A numpy array of shape (N_val, D) giving validation data. + - y_val: A numpy array of shape (N_val,) giving validation labels. + - learning_rate: Scalar giving learning rate for optimization. + - learning_rate_decay: Scalar giving factor used to decay the learning rate + after each epoch. + - reg: Scalar giving regularization strength. + - num_iters: Number of steps to take when optimizing. + - batch_size: Number of training examples to use per step. + - verbose: boolean; if true print progress during optimization. + """ + num_train = X.shape[0] + iterations_per_epoch = max(num_train / batch_size, 1) + + # Use SGD to optimize the parameters in self.model + loss_history = [] + train_acc_history = [] + val_acc_history = [] + + for it in xrange(num_iters): + X_batch = None + y_batch = None + + ######################################################################### + # TODO: Create a random minibatch of training data and labels, storing # + # them in X_batch and y_batch respectively. # + ######################################################################### + pass + ######################################################################### + # END OF YOUR CODE # + ######################################################################### + + # Compute loss and gradients using the current minibatch + loss, grads = self.loss(X_batch, y=y_batch, reg=reg) + loss_history.append(loss) + + ######################################################################### + # TODO: Use the gradients in the grads dictionary to update the # + # parameters of the network (stored in the dictionary self.params) # + # using stochastic gradient descent. You'll need to use the gradients # + # stored in the grads dictionary defined above. # + ######################################################################### + pass + ######################################################################### + # END OF YOUR CODE # + ######################################################################### + + if verbose and it % 100 == 0: + print 'iteration %d / %d: loss %f' % (it, num_iters, loss) + + # Every epoch, check train and val accuracy and decay learning rate. + if it % iterations_per_epoch == 0: + # Check accuracy + train_acc = (self.predict(X_batch) == y_batch).mean() + val_acc = (self.predict(X_val) == y_val).mean() + train_acc_history.append(train_acc) + val_acc_history.append(val_acc) + + # Decay learning rate + learning_rate *= learning_rate_decay + + return { + 'loss_history': loss_history, + 'train_acc_history': train_acc_history, + 'val_acc_history': val_acc_history, + } + + def predict(self, X): + """ + Use the trained weights of this two-layer network to predict labels for + data points. For each data point we predict scores for each of the C + classes, and assign each data point to the class with the highest score. + + Inputs: + - X: A numpy array of shape (N, D) giving N D-dimensional data points to + classify. + + Returns: + - y_pred: A numpy array of shape (N,) giving predicted labels for each of + the elements of X. For all i, y_pred[i] = c means that X[i] is predicted + to have class c, where 0 <= c < C. + """ + y_pred = None + + ########################################################################### + # TODO: Implement this function; it should be VERY simple! # + ########################################################################### + pass + ########################################################################### + # END OF YOUR CODE # + ########################################################################### + + return y_pred + + diff --git a/assignments2016/assignment1/cs231n/classifiers/softmax.py b/assignments2016/assignment1/cs231n/classifiers/softmax.py new file mode 100644 index 00000000..edddcfac --- /dev/null +++ b/assignments2016/assignment1/cs231n/classifiers/softmax.py @@ -0,0 +1,62 @@ +import numpy as np +from random import shuffle + +def softmax_loss_naive(W, X, y, reg): + """ + Softmax loss function, naive implementation (with loops) + + Inputs have dimension D, there are C classes, and we operate on minibatches + of N examples. + + Inputs: + - W: A numpy array of shape (D, C) containing weights. + - X: A numpy array of shape (N, D) containing a minibatch of data. + - y: A numpy array of shape (N,) containing training labels; y[i] = c means + that X[i] has label c, where 0 <= c < C. + - reg: (float) regularization strength + + Returns a tuple of: + - loss as single float + - gradient with respect to weights W; an array of same shape as W + """ + # Initialize the loss and gradient to zero. + loss = 0.0 + dW = np.zeros_like(W) + + ############################################################################# + # TODO: Compute the softmax loss and its gradient using explicit loops. # + # Store the loss in loss and the gradient in dW. If you are not careful # + # here, it is easy to run into numeric instability. Don't forget the # + # regularization! # + ############################################################################# + pass + ############################################################################# + # END OF YOUR CODE # + ############################################################################# + + return loss, dW + + +def softmax_loss_vectorized(W, X, y, reg): + """ + Softmax loss function, vectorized version. + + Inputs and outputs are the same as softmax_loss_naive. + """ + # Initialize the loss and gradient to zero. + loss = 0.0 + dW = np.zeros_like(W) + + ############################################################################# + # TODO: Compute the softmax loss and its gradient using no explicit loops. # + # Store the loss in loss and the gradient in dW. If you are not careful # + # here, it is easy to run into numeric instability. Don't forget the # + # regularization! # + ############################################################################# + pass + ############################################################################# + # END OF YOUR CODE # + ############################################################################# + + return loss, dW + diff --git a/assignments2016/assignment1/cs231n/data_utils.py b/assignments2016/assignment1/cs231n/data_utils.py new file mode 100644 index 00000000..9158da4d --- /dev/null +++ b/assignments2016/assignment1/cs231n/data_utils.py @@ -0,0 +1,158 @@ +import cPickle as pickle +import numpy as np +import os +from scipy.misc import imread + +def load_CIFAR_batch(filename): + """ load single batch of cifar """ + with open(filename, 'rb') as f: + datadict = pickle.load(f) + X = datadict['data'] + Y = datadict['labels'] + X = X.reshape(10000, 3, 32, 32).transpose(0,2,3,1).astype("float") + Y = np.array(Y) + return X, Y + +def load_CIFAR10(ROOT): + """ load all of cifar """ + xs = [] + ys = [] + for b in range(1,6): + f = os.path.join(ROOT, 'data_batch_%d' % (b, )) + X, Y = load_CIFAR_batch(f) + xs.append(X) + ys.append(Y) + Xtr = np.concatenate(xs) + Ytr = np.concatenate(ys) + del X, Y + Xte, Yte = load_CIFAR_batch(os.path.join(ROOT, 'test_batch')) + return Xtr, Ytr, Xte, Yte + +def load_tiny_imagenet(path, dtype=np.float32): + """ + Load TinyImageNet. Each of TinyImageNet-100-A, TinyImageNet-100-B, and + TinyImageNet-200 have the same directory structure, so this can be used + to load any of them. + + Inputs: + - path: String giving path to the directory to load. + - dtype: numpy datatype used to load the data. + + Returns: A tuple of + - class_names: A list where class_names[i] is a list of strings giving the + WordNet names for class i in the loaded dataset. + - X_train: (N_tr, 3, 64, 64) array of training images + - y_train: (N_tr,) array of training labels + - X_val: (N_val, 3, 64, 64) array of validation images + - y_val: (N_val,) array of validation labels + - X_test: (N_test, 3, 64, 64) array of testing images. + - y_test: (N_test,) array of test labels; if test labels are not available + (such as in student code) then y_test will be None. + """ + # First load wnids + with open(os.path.join(path, 'wnids.txt'), 'r') as f: + wnids = [x.strip() for x in f] + + # Map wnids to integer labels + wnid_to_label = {wnid: i for i, wnid in enumerate(wnids)} + + # Use words.txt to get names for each class + with open(os.path.join(path, 'words.txt'), 'r') as f: + wnid_to_words = dict(line.split('\t') for line in f) + for wnid, words in wnid_to_words.iteritems(): + wnid_to_words[wnid] = [w.strip() for w in words.split(',')] + class_names = [wnid_to_words[wnid] for wnid in wnids] + + # Next load training data. + X_train = [] + y_train = [] + for i, wnid in enumerate(wnids): + if (i + 1) % 20 == 0: + print 'loading training data for synset %d / %d' % (i + 1, len(wnids)) + # To figure out the filenames we need to open the boxes file + boxes_file = os.path.join(path, 'train', wnid, '%s_boxes.txt' % wnid) + with open(boxes_file, 'r') as f: + filenames = [x.split('\t')[0] for x in f] + num_images = len(filenames) + + X_train_block = np.zeros((num_images, 3, 64, 64), dtype=dtype) + y_train_block = wnid_to_label[wnid] * np.ones(num_images, dtype=np.int64) + for j, img_file in enumerate(filenames): + img_file = os.path.join(path, 'train', wnid, 'images', img_file) + img = imread(img_file) + if img.ndim == 2: + ## grayscale file + img.shape = (64, 64, 1) + X_train_block[j] = img.transpose(2, 0, 1) + X_train.append(X_train_block) + y_train.append(y_train_block) + + # We need to concatenate all training data + X_train = np.concatenate(X_train, axis=0) + y_train = np.concatenate(y_train, axis=0) + + # Next load validation data + with open(os.path.join(path, 'val', 'val_annotations.txt'), 'r') as f: + img_files = [] + val_wnids = [] + for line in f: + img_file, wnid = line.split('\t')[:2] + img_files.append(img_file) + val_wnids.append(wnid) + num_val = len(img_files) + y_val = np.array([wnid_to_label[wnid] for wnid in val_wnids]) + X_val = np.zeros((num_val, 3, 64, 64), dtype=dtype) + for i, img_file in enumerate(img_files): + img_file = os.path.join(path, 'val', 'images', img_file) + img = imread(img_file) + if img.ndim == 2: + img.shape = (64, 64, 1) + X_val[i] = img.transpose(2, 0, 1) + + # Next load test images + # Students won't have test labels, so we need to iterate over files in the + # images directory. + img_files = os.listdir(os.path.join(path, 'test', 'images')) + X_test = np.zeros((len(img_files), 3, 64, 64), dtype=dtype) + for i, img_file in enumerate(img_files): + img_file = os.path.join(path, 'test', 'images', img_file) + img = imread(img_file) + if img.ndim == 2: + img.shape = (64, 64, 1) + X_test[i] = img.transpose(2, 0, 1) + + y_test = None + y_test_file = os.path.join(path, 'test', 'test_annotations.txt') + if os.path.isfile(y_test_file): + with open(y_test_file, 'r') as f: + img_file_to_wnid = {} + for line in f: + line = line.split('\t') + img_file_to_wnid[line[0]] = line[1] + y_test = [wnid_to_label[img_file_to_wnid[img_file]] for img_file in img_files] + y_test = np.array(y_test) + + return class_names, X_train, y_train, X_val, y_val, X_test, y_test + + +def load_models(models_dir): + """ + Load saved models from disk. This will attempt to unpickle all files in a + directory; any files that give errors on unpickling (such as README.txt) will + be skipped. + + Inputs: + - models_dir: String giving the path to a directory containing model files. + Each model file is a pickled dictionary with a 'model' field. + + Returns: + A dictionary mapping model file names to models. + """ + models = {} + for model_file in os.listdir(models_dir): + with open(os.path.join(models_dir, model_file), 'rb') as f: + try: + models[model_file] = pickle.load(f)['model'] + except pickle.UnpicklingError: + continue + return models diff --git a/assignments2016/assignment1/cs231n/datasets/.gitignore b/assignments2016/assignment1/cs231n/datasets/.gitignore new file mode 100644 index 00000000..0232c3ab --- /dev/null +++ b/assignments2016/assignment1/cs231n/datasets/.gitignore @@ -0,0 +1,4 @@ +cifar-10-batches-py/* +tiny-imagenet-100-A* +tiny-imagenet-100-B* +tiny-100-A-pretrained/* diff --git a/assignments2016/assignment1/cs231n/datasets/get_datasets.sh b/assignments2016/assignment1/cs231n/datasets/get_datasets.sh new file mode 100755 index 00000000..0dd93621 --- /dev/null +++ b/assignments2016/assignment1/cs231n/datasets/get_datasets.sh @@ -0,0 +1,4 @@ +# Get CIFAR10 +wget http://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz +tar -xzvf cifar-10-python.tar.gz +rm cifar-10-python.tar.gz diff --git a/assignments2016/assignment1/cs231n/features.py b/assignments2016/assignment1/cs231n/features.py new file mode 100644 index 00000000..fdf40372 --- /dev/null +++ b/assignments2016/assignment1/cs231n/features.py @@ -0,0 +1,148 @@ +import matplotlib +import numpy as np +from scipy.ndimage import uniform_filter + + +def extract_features(imgs, feature_fns, verbose=False): + """ + Given pixel data for images and several feature functions that can operate on + single images, apply all feature functions to all images, concatenating the + feature vectors for each image and storing the features for all images in + a single matrix. + + Inputs: + - imgs: N x H X W X C array of pixel data for N images. + - feature_fns: List of k feature functions. The ith feature function should + take as input an H x W x D array and return a (one-dimensional) array of + length F_i. + - verbose: Boolean; if true, print progress. + + Returns: + An array of shape (N, F_1 + ... + F_k) where each column is the concatenation + of all features for a single image. + """ + num_images = imgs.shape[0] + if num_images == 0: + return np.array([]) + + # Use the first image to determine feature dimensions + feature_dims = [] + first_image_features = [] + for feature_fn in feature_fns: + feats = feature_fn(imgs[0].squeeze()) + assert len(feats.shape) == 1, 'Feature functions must be one-dimensional' + feature_dims.append(feats.size) + first_image_features.append(feats) + + # Now that we know the dimensions of the features, we can allocate a single + # big array to store all features as columns. + total_feature_dim = sum(feature_dims) + imgs_features = np.zeros((num_images, total_feature_dim)) + imgs_features[0] = np.hstack(first_image_features).T + + # Extract features for the rest of the images. + for i in xrange(1, num_images): + idx = 0 + for feature_fn, feature_dim in zip(feature_fns, feature_dims): + next_idx = idx + feature_dim + imgs_features[i, idx:next_idx] = feature_fn(imgs[i].squeeze()) + idx = next_idx + if verbose and i % 1000 == 0: + print 'Done extracting features for %d / %d images' % (i, num_images) + + return imgs_features + + +def rgb2gray(rgb): + """Convert RGB image to grayscale + + Parameters: + rgb : RGB image + + Returns: + gray : grayscale image + + """ + return np.dot(rgb[...,:3], [0.299, 0.587, 0.144]) + + +def hog_feature(im): + """Compute Histogram of Gradient (HOG) feature for an image + + Modified from skimage.feature.hog + http://pydoc.net/Python/scikits-image/0.4.2/skimage.feature.hog + + Reference: + Histograms of Oriented Gradients for Human Detection + Navneet Dalal and Bill Triggs, CVPR 2005 + + Parameters: + im : an input grayscale or rgb image + + Returns: + feat: Histogram of Gradient (HOG) feature + + """ + + # convert rgb to grayscale if needed + if im.ndim == 3: + image = rgb2gray(im) + else: + image = np.at_least_2d(im) + + sx, sy = image.shape # image size + orientations = 9 # number of gradient bins + cx, cy = (8, 8) # pixels per cell + + gx = np.zeros(image.shape) + gy = np.zeros(image.shape) + gx[:, :-1] = np.diff(image, n=1, axis=1) # compute gradient on x-direction + gy[:-1, :] = np.diff(image, n=1, axis=0) # compute gradient on y-direction + grad_mag = np.sqrt(gx ** 2 + gy ** 2) # gradient magnitude + grad_ori = np.arctan2(gy, (gx + 1e-15)) * (180 / np.pi) + 90 # gradient orientation + + n_cellsx = int(np.floor(sx / cx)) # number of cells in x + n_cellsy = int(np.floor(sy / cy)) # number of cells in y + # compute orientations integral images + orientation_histogram = np.zeros((n_cellsx, n_cellsy, orientations)) + for i in range(orientations): + # create new integral image for this orientation + # isolate orientations in this range + temp_ori = np.where(grad_ori < 180 / orientations * (i + 1), + grad_ori, 0) + temp_ori = np.where(grad_ori >= 180 / orientations * i, + temp_ori, 0) + # select magnitudes for those orientations + cond2 = temp_ori > 0 + temp_mag = np.where(cond2, grad_mag, 0) + orientation_histogram[:,:,i] = uniform_filter(temp_mag, size=(cx, cy))[cx/2::cx, cy/2::cy].T + + return orientation_histogram.ravel() + + +def color_histogram_hsv(im, nbin=10, xmin=0, xmax=255, normalized=True): + """ + Compute color histogram for an image using hue. + + Inputs: + - im: H x W x C array of pixel data for an RGB image. + - nbin: Number of histogram bins. (default: 10) + - xmin: Minimum pixel value (default: 0) + - xmax: Maximum pixel value (default: 255) + - normalized: Whether to normalize the histogram (default: True) + + Returns: + 1D vector of length nbin giving the color histogram over the hue of the + input image. + """ + ndim = im.ndim + bins = np.linspace(xmin, xmax, nbin+1) + hsv = matplotlib.colors.rgb_to_hsv(im/xmax) * xmax + imhist, bin_edges = np.histogram(hsv[:,:,0], bins=bins, density=normalized) + imhist = imhist * np.diff(bin_edges) + + # return histogram + return imhist + + +pass diff --git a/assignments2016/assignment1/cs231n/gradient_check.py b/assignments2016/assignment1/cs231n/gradient_check.py new file mode 100644 index 00000000..2d6b1f62 --- /dev/null +++ b/assignments2016/assignment1/cs231n/gradient_check.py @@ -0,0 +1,124 @@ +import numpy as np +from random import randrange + +def eval_numerical_gradient(f, x, verbose=True, h=0.00001): + """ + a naive implementation of numerical gradient of f at x + - f should be a function that takes a single argument + - x is the point (numpy array) to evaluate the gradient at + """ + + fx = f(x) # evaluate function value at original point + grad = np.zeros_like(x) + # iterate over all indexes in x + it = np.nditer(x, flags=['multi_index'], op_flags=['readwrite']) + while not it.finished: + + # evaluate function at x+h + ix = it.multi_index + oldval = x[ix] + x[ix] = oldval + h # increment by h + fxph = f(x) # evalute f(x + h) + x[ix] = oldval - h + fxmh = f(x) # evaluate f(x - h) + x[ix] = oldval # restore + + # compute the partial derivative with centered formula + grad[ix] = (fxph - fxmh) / (2 * h) # the slope + if verbose: + print ix, grad[ix] + it.iternext() # step to next dimension + + return grad + + +def eval_numerical_gradient_array(f, x, df, h=1e-5): + """ + Evaluate a numeric gradient for a function that accepts a numpy + array and returns a numpy array. + """ + grad = np.zeros_like(x) + it = np.nditer(x, flags=['multi_index'], op_flags=['readwrite']) + while not it.finished: + ix = it.multi_index + + oldval = x[ix] + x[ix] = oldval + h + pos = f(x).copy() + x[ix] = oldval - h + neg = f(x).copy() + x[ix] = oldval + + grad[ix] = np.sum((pos - neg) * df) / (2 * h) + it.iternext() + return grad + + +def eval_numerical_gradient_blobs(f, inputs, output, h=1e-5): + """ + Compute numeric gradients for a function that operates on input + and output blobs. + + We assume that f accepts several input blobs as arguments, followed by a blob + into which outputs will be written. For example, f might be called like this: + + f(x, w, out) + + where x and w are input Blobs, and the result of f will be written to out. + + Inputs: + - f: function + - inputs: tuple of input blobs + - output: output blob + - h: step size + """ + numeric_diffs = [] + for input_blob in inputs: + diff = np.zeros_like(input_blob.diffs) + it = np.nditer(input_blob.vals, flags=['multi_index'], + op_flags=['readwrite']) + while not it.finished: + idx = it.multi_index + orig = input_blob.vals[idx] + + input_blob.vals[idx] = orig + h + f(*(inputs + (output,))) + pos = np.copy(output.vals) + input_blob.vals[idx] = orig - h + f(*(inputs + (output,))) + neg = np.copy(output.vals) + input_blob.vals[idx] = orig + + diff[idx] = np.sum((pos - neg) * output.diffs) / (2.0 * h) + + it.iternext() + numeric_diffs.append(diff) + return numeric_diffs + + +def eval_numerical_gradient_net(net, inputs, output, h=1e-5): + return eval_numerical_gradient_blobs(lambda *args: net.forward(), + inputs, output, h=h) + + +def grad_check_sparse(f, x, analytic_grad, num_checks=10, h=1e-5): + """ + sample a few random elements and only return numerical + in this dimensions. + """ + + for i in xrange(num_checks): + ix = tuple([randrange(m) for m in x.shape]) + + oldval = x[ix] + x[ix] = oldval + h # increment by h + fxph = f(x) # evaluate f(x + h) + x[ix] = oldval - h # increment by h + fxmh = f(x) # evaluate f(x - h) + x[ix] = oldval # reset + + grad_numerical = (fxph - fxmh) / (2 * h) + grad_analytic = analytic_grad[ix] + rel_error = abs(grad_numerical - grad_analytic) / (abs(grad_numerical) + abs(grad_analytic)) + print 'numerical: %f analytic: %f, relative error: %e' % (grad_numerical, grad_analytic, rel_error) + diff --git a/assignments2016/assignment1/cs231n/vis_utils.py b/assignments2016/assignment1/cs231n/vis_utils.py new file mode 100644 index 00000000..8d04473f --- /dev/null +++ b/assignments2016/assignment1/cs231n/vis_utils.py @@ -0,0 +1,73 @@ +from math import sqrt, ceil +import numpy as np + +def visualize_grid(Xs, ubound=255.0, padding=1): + """ + Reshape a 4D tensor of image data to a grid for easy visualization. + + Inputs: + - Xs: Data of shape (N, H, W, C) + - ubound: Output grid will have values scaled to the range [0, ubound] + - padding: The number of blank pixels between elements of the grid + """ + (N, H, W, C) = Xs.shape + grid_size = int(ceil(sqrt(N))) + grid_height = H * grid_size + padding * (grid_size - 1) + grid_width = W * grid_size + padding * (grid_size - 1) + grid = np.zeros((grid_height, grid_width, C)) + next_idx = 0 + y0, y1 = 0, H + for y in xrange(grid_size): + x0, x1 = 0, W + for x in xrange(grid_size): + if next_idx < N: + img = Xs[next_idx] + low, high = np.min(img), np.max(img) + grid[y0:y1, x0:x1] = ubound * (img - low) / (high - low) + # grid[y0:y1, x0:x1] = Xs[next_idx] + next_idx += 1 + x0 += W + padding + x1 += W + padding + y0 += H + padding + y1 += H + padding + # grid_max = np.max(grid) + # grid_min = np.min(grid) + # grid = ubound * (grid - grid_min) / (grid_max - grid_min) + return grid + +def vis_grid(Xs): + """ visualize a grid of images """ + (N, H, W, C) = Xs.shape + A = int(ceil(sqrt(N))) + G = np.ones((A*H+A, A*W+A, C), Xs.dtype) + G *= np.min(Xs) + n = 0 + for y in range(A): + for x in range(A): + if n < N: + G[y*H+y:(y+1)*H+y, x*W+x:(x+1)*W+x, :] = Xs[n,:,:,:] + n += 1 + # normalize to [0,1] + maxg = G.max() + ming = G.min() + G = (G - ming)/(maxg-ming) + return G + +def vis_nn(rows): + """ visualize array of arrays of images """ + N = len(rows) + D = len(rows[0]) + H,W,C = rows[0][0].shape + Xs = rows[0][0] + G = np.ones((N*H+N, D*W+D, C), Xs.dtype) + for y in range(N): + for x in range(D): + G[y*H+y:(y+1)*H+y, x*W+x:(x+1)*W+x, :] = rows[y][x] + # normalize to [0,1] + maxg = G.max() + ming = G.min() + G = (G - ming)/(maxg-ming) + return G + + + diff --git a/assignments2016/assignment1/features.ipynb b/assignments2016/assignment1/features.ipynb new file mode 100644 index 00000000..9d22d9f6 --- /dev/null +++ b/assignments2016/assignment1/features.ipynb @@ -0,0 +1,340 @@ +{ + "nbformat_minor": 0, + "nbformat": 4, + "cells": [ + { + "source": [ + "# Image features exercise\n", + "*Complete and hand in this completed worksheet (including its outputs and any supporting code outside of the worksheet) with your assignment submission. For more details see the [assignments page](http://vision.stanford.edu/teaching/cs231n/assignments.html) on the course website.*\n", + "\n", + "We have seen that we can achieve reasonable performance on an image classification task by training a linear classifier on the pixels of the input image. In this exercise we will show that we can improve our classification performance by training linear classifiers not on raw pixels but on features that are computed from the raw pixels.\n", + "\n", + "All of your work for this exercise will be done in this notebook." + ], + "cell_type": "markdown", + "metadata": {} + }, + { + "execution_count": null, + "cell_type": "code", + "source": [ + "import random\n", + "import numpy as np\n", + "from cs231n.data_utils import load_CIFAR10\n", + "import matplotlib.pyplot as plt\n", + "%matplotlib inline\n", + "plt.rcParams['figure.figsize'] = (10.0, 8.0) # set default size of plots\n", + "plt.rcParams['image.interpolation'] = 'nearest'\n", + "plt.rcParams['image.cmap'] = 'gray'\n", + "\n", + "# for auto-reloading extenrnal modules\n", + "# see http://stackoverflow.com/questions/1907993/autoreload-of-modules-in-ipython\n", + "%load_ext autoreload\n", + "%autoreload 2" + ], + "outputs": [], + "metadata": { + "collapsed": false + } + }, + { + "source": [ + "## Load data\n", + "Similar to previous exercises, we will load CIFAR-10 data from disk." + ], + "cell_type": "markdown", + "metadata": {} + }, + { + "execution_count": null, + "cell_type": "code", + "source": [ + "from cs231n.features import color_histogram_hsv, hog_feature\n", + "\n", + "def get_CIFAR10_data(num_training=49000, num_validation=1000, num_test=1000):\n", + " # Load the raw CIFAR-10 data\n", + " cifar10_dir = 'cs231n/datasets/cifar-10-batches-py'\n", + " X_train, y_train, X_test, y_test = load_CIFAR10(cifar10_dir)\n", + " \n", + " # Subsample the data\n", + " mask = range(num_training, num_training + num_validation)\n", + " X_val = X_train[mask]\n", + " y_val = y_train[mask]\n", + " mask = range(num_training)\n", + " X_train = X_train[mask]\n", + " y_train = y_train[mask]\n", + " mask = range(num_test)\n", + " X_test = X_test[mask]\n", + " y_test = y_test[mask]\n", + "\n", + " return X_train, y_train, X_val, y_val, X_test, y_test\n", + "\n", + "X_train, y_train, X_val, y_val, X_test, y_test = get_CIFAR10_data()" + ], + "outputs": [], + "metadata": { + "collapsed": false + } + }, + { + "source": [ + "## Extract Features\n", + "For each image we will compute a Histogram of Oriented\n", + "Gradients (HOG) as well as a color histogram using the hue channel in HSV\n", + "color space. We form our final feature vector for each image by concatenating\n", + "the HOG and color histogram feature vectors.\n", + "\n", + "Roughly speaking, HOG should capture the texture of the image while ignoring\n", + "color information, and the color histogram represents the color of the input\n", + "image while ignoring texture. As a result, we expect that using both together\n", + "ought to work better than using either alone. Verifying this assumption would\n", + "be a good thing to try for the bonus section.\n", + "\n", + "The `hog_feature` and `color_histogram_hsv` functions both operate on a single\n", + "image and return a feature vector for that image. The extract_features\n", + "function takes a set of images and a list of feature functions and evaluates\n", + "each feature function on each image, storing the results in a matrix where\n", + "each column is the concatenation of all feature vectors for a single image." + ], + "cell_type": "markdown", + "metadata": {} + }, + { + "execution_count": null, + "cell_type": "code", + "source": [ + "from cs231n.features import *\n", + "\n", + "num_color_bins = 10 # Number of bins in the color histogram\n", + "feature_fns = [hog_feature, lambda img: color_histogram_hsv(img, nbin=num_color_bins)]\n", + "X_train_feats = extract_features(X_train, feature_fns, verbose=True)\n", + "X_val_feats = extract_features(X_val, feature_fns)\n", + "X_test_feats = extract_features(X_test, feature_fns)\n", + "\n", + "# Preprocessing: Subtract the mean feature\n", + "mean_feat = np.mean(X_train_feats, axis=0, keepdims=True)\n", + "X_train_feats -= mean_feat\n", + "X_val_feats -= mean_feat\n", + "X_test_feats -= mean_feat\n", + "\n", + "# Preprocessing: Divide by standard deviation. This ensures that each feature\n", + "# has roughly the same scale.\n", + "std_feat = np.std(X_train_feats, axis=0, keepdims=True)\n", + "X_train_feats /= std_feat\n", + "X_val_feats /= std_feat\n", + "X_test_feats /= std_feat\n", + "\n", + "# Preprocessing: Add a bias dimension\n", + "X_train_feats = np.hstack([X_train_feats, np.ones((X_train_feats.shape[0], 1))])\n", + "X_val_feats = np.hstack([X_val_feats, np.ones((X_val_feats.shape[0], 1))])\n", + "X_test_feats = np.hstack([X_test_feats, np.ones((X_test_feats.shape[0], 1))])" + ], + "outputs": [], + "metadata": { + "collapsed": false + } + }, + { + "source": [ + "## Train SVM on features\n", + "Using the multiclass SVM code developed earlier in the assignment, train SVMs on top of the features extracted above; this should achieve better results than training SVMs directly on top of raw pixels." + ], + "cell_type": "markdown", + "metadata": {} + }, + { + "execution_count": null, + "cell_type": "code", + "source": [ + "# Use the validation set to tune the learning rate and regularization strength\n", + "\n", + "from cs231n.classifiers.linear_classifier import LinearSVM\n", + "\n", + "learning_rates = [1e-9, 1e-8, 1e-7]\n", + "regularization_strengths = [1e5, 1e6, 1e7]\n", + "\n", + "results = {}\n", + "best_val = -1\n", + "best_svm = None\n", + "\n", + "pass\n", + "################################################################################\n", + "# TODO: #\n", + "# Use the validation set to set the learning rate and regularization strength. #\n", + "# This should be identical to the validation that you did for the SVM; save #\n", + "# the best trained classifer in best_svm. You might also want to play #\n", + "# with different numbers of bins in the color histogram. If you are careful #\n", + "# you should be able to get accuracy of near 0.44 on the validation set. #\n", + "################################################################################\n", + "pass\n", + "################################################################################\n", + "# END OF YOUR CODE #\n", + "################################################################################\n", + "\n", + "# Print out results.\n", + "for lr, reg in sorted(results):\n", + " train_accuracy, val_accuracy = results[(lr, reg)]\n", + " print 'lr %e reg %e train accuracy: %f val accuracy: %f' % (\n", + " lr, reg, train_accuracy, val_accuracy)\n", + " \n", + "print 'best validation accuracy achieved during cross-validation: %f' % best_val" + ], + "outputs": [], + "metadata": { + "collapsed": false + } + }, + { + "execution_count": null, + "cell_type": "code", + "source": [ + "# Evaluate your trained SVM on the test set\n", + "y_test_pred = best_svm.predict(X_test_feats)\n", + "test_accuracy = np.mean(y_test == y_test_pred)\n", + "print test_accuracy" + ], + "outputs": [], + "metadata": { + "collapsed": false + } + }, + { + "execution_count": null, + "cell_type": "code", + "source": [ + "# An important way to gain intuition about how an algorithm works is to\n", + "# visualize the mistakes that it makes. In this visualization, we show examples\n", + "# of images that are misclassified by our current system. The first column\n", + "# shows images that our system labeled as \"plane\" but whose true label is\n", + "# something other than \"plane\".\n", + "\n", + "examples_per_class = 8\n", + "classes = ['plane', 'car', 'bird', 'cat', 'deer', 'dog', 'frog', 'horse', 'ship', 'truck']\n", + "for cls, cls_name in enumerate(classes):\n", + " idxs = np.where((y_test != cls) & (y_test_pred == cls))[0]\n", + " idxs = np.random.choice(idxs, examples_per_class, replace=False)\n", + " for i, idx in enumerate(idxs):\n", + " plt.subplot(examples_per_class, len(classes), i * len(classes) + cls + 1)\n", + " plt.imshow(X_test[idx].astype('uint8'))\n", + " plt.axis('off')\n", + " if i == 0:\n", + " plt.title(cls_name)\n", + "plt.show()" + ], + "outputs": [], + "metadata": { + "collapsed": false + } + }, + { + "source": [ + "### Inline question 1:\n", + "Describe the misclassification results that you see. Do they make sense?" + ], + "cell_type": "markdown", + "metadata": {} + }, + { + "source": [ + "## Neural Network on image features\n", + "Earlier in this assigment we saw that training a two-layer neural network on raw pixels achieved better classification performance than linear classifiers on raw pixels. In this notebook we have seen that linear classifiers on image features outperform linear classifiers on raw pixels. \n", + "\n", + "For completeness, we should also try training a neural network on image features. This approach should outperform all previous approaches: you should easily be able to achieve over 55% classification accuracy on the test set; our best model achieves about 60% classification accuracy." + ], + "cell_type": "markdown", + "metadata": {} + }, + { + "execution_count": null, + "cell_type": "code", + "source": [ + "print X_train_feats.shape" + ], + "outputs": [], + "metadata": { + "collapsed": false + } + }, + { + "execution_count": null, + "cell_type": "code", + "source": [ + "from cs231n.classifiers.neural_net import TwoLayerNet\n", + "\n", + "input_dim = X_train_feats.shape[1]\n", + "hidden_dim = 500\n", + "num_classes = 10\n", + "\n", + "net = TwoLayerNet(input_dim, hidden_dim, num_classes)\n", + "best_net = None\n", + "\n", + "################################################################################\n", + "# TODO: Train a two-layer neural network on image features. You may want to #\n", + "# cross-validate various parameters as in previous sections. Store your best #\n", + "# model in the best_net variable. #\n", + "################################################################################\n", + "pass\n", + "################################################################################\n", + "# END OF YOUR CODE #\n", + "################################################################################" + ], + "outputs": [], + "metadata": { + "collapsed": false + } + }, + { + "execution_count": null, + "cell_type": "code", + "source": [ + "# Run your neural net classifier on the test set. You should be able to\n", + "# get more than 55% accuracy.\n", + "\n", + "test_acc = (net.predict(X_test_feats) == y_test).mean()\n", + "print test_acc" + ], + "outputs": [], + "metadata": { + "collapsed": false + } + }, + { + "source": [ + "# Bonus: Design your own features!\n", + "\n", + "You have seen that simple image features can improve classification performance. So far we have tried HOG and color histograms, but other types of features may be able to achieve even better classification performance.\n", + "\n", + "For bonus points, design and implement a new type of feature and use it for image classification on CIFAR-10. Explain how your feature works and why you expect it to be useful for image classification. Implement it in this notebook, cross-validate any hyperparameters, and compare its performance to the HOG + Color histogram baseline." + ], + "cell_type": "markdown", + "metadata": {} + }, + { + "source": [ + "# Bonus: Do something extra!\n", + "Use the material and code we have presented in this assignment to do something interesting. Was there another question we should have asked? Did any cool ideas pop into your head as you were working on the assignment? This is your chance to show off!" + ], + "cell_type": "markdown", + "metadata": {} + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 2", + "name": "python2", + "language": "python" + }, + "language_info": { + "mimetype": "text/x-python", + "nbconvert_exporter": "python", + "name": "python", + "file_extension": ".py", + "version": "2.7.9", + "pygments_lexer": "ipython2", + "codemirror_mode": { + "version": 2, + "name": "ipython" + } + } + } +} \ No newline at end of file diff --git a/assignments2016/assignment1/frameworkpython b/assignments2016/assignment1/frameworkpython new file mode 100755 index 00000000..a0fa5517 --- /dev/null +++ b/assignments2016/assignment1/frameworkpython @@ -0,0 +1,13 @@ +#!/bin/bash + +# what real Python executable to use +PYVER=2.7 +PATHTOPYTHON=/usr/local/bin/ +PYTHON=${PATHTOPYTHON}python${PYVER} + +# find the root of the virtualenv, it should be the parent of the dir this script is in +ENV=`$PYTHON -c "import os; print os.path.abspath(os.path.join(os.path.dirname(\"$0\"), '..'))"` + +# now run Python with the virtualenv set as Python's HOME +export PYTHONHOME=$ENV +exec $PYTHON "$@" diff --git a/assignments2016/assignment1/knn.ipynb b/assignments2016/assignment1/knn.ipynb new file mode 100644 index 00000000..2c550ba5 --- /dev/null +++ b/assignments2016/assignment1/knn.ipynb @@ -0,0 +1,459 @@ +{ + "nbformat_minor": 0, + "nbformat": 4, + "cells": [ + { + "source": [ + "# k-Nearest Neighbor (kNN) exercise\n", + "\n", + "*Complete and hand in this completed worksheet (including its outputs and any supporting code outside of the worksheet) with your assignment submission. For more details see the [assignments page](http://vision.stanford.edu/teaching/cs231n/assignments.html) on the course website.*\n", + "\n", + "The kNN classifier consists of two stages:\n", + "\n", + "- During training, the classifier takes the training data and simply remembers it\n", + "- During testing, kNN classifies every test image by comparing to all training images and transfering the labels of the k most similar training examples\n", + "- The value of k is cross-validated\n", + "\n", + "In this exercise you will implement these steps and understand the basic Image Classification pipeline, cross-validation, and gain proficiency in writing efficient, vectorized code." + ], + "cell_type": "markdown", + "metadata": {} + }, + { + "execution_count": null, + "cell_type": "code", + "source": [ + "# Run some setup code for this notebook.\n", + "\n", + "import random\n", + "import numpy as np\n", + "from cs231n.data_utils import load_CIFAR10\n", + "import matplotlib.pyplot as plt\n", + "\n", + "# This is a bit of magic to make matplotlib figures appear inline in the notebook\n", + "# rather than in a new window.\n", + "%matplotlib inline\n", + "plt.rcParams['figure.figsize'] = (10.0, 8.0) # set default size of plots\n", + "plt.rcParams['image.interpolation'] = 'nearest'\n", + "plt.rcParams['image.cmap'] = 'gray'\n", + "\n", + "# Some more magic so that the notebook will reload external python modules;\n", + "# see http://stackoverflow.com/questions/1907993/autoreload-of-modules-in-ipython\n", + "%load_ext autoreload\n", + "%autoreload 2" + ], + "outputs": [], + "metadata": { + "collapsed": false + } + }, + { + "execution_count": null, + "cell_type": "code", + "source": [ + "# Load the raw CIFAR-10 data.\n", + "cifar10_dir = 'cs231n/datasets/cifar-10-batches-py'\n", + "X_train, y_train, X_test, y_test = load_CIFAR10(cifar10_dir)\n", + "\n", + "# As a sanity check, we print out the size of the training and test data.\n", + "print 'Training data shape: ', X_train.shape\n", + "print 'Training labels shape: ', y_train.shape\n", + "print 'Test data shape: ', X_test.shape\n", + "print 'Test labels shape: ', y_test.shape" + ], + "outputs": [], + "metadata": { + "collapsed": false + } + }, + { + "execution_count": null, + "cell_type": "code", + "source": [ + "# Visualize some examples from the dataset.\n", + "# We show a few examples of training images from each class.\n", + "classes = ['plane', 'car', 'bird', 'cat', 'deer', 'dog', 'frog', 'horse', 'ship', 'truck']\n", + "num_classes = len(classes)\n", + "samples_per_class = 7\n", + "for y, cls in enumerate(classes):\n", + " idxs = np.flatnonzero(y_train == y)\n", + " idxs = np.random.choice(idxs, samples_per_class, replace=False)\n", + " for i, idx in enumerate(idxs):\n", + " plt_idx = i * num_classes + y + 1\n", + " plt.subplot(samples_per_class, num_classes, plt_idx)\n", + " plt.imshow(X_train[idx].astype('uint8'))\n", + " plt.axis('off')\n", + " if i == 0:\n", + " plt.title(cls)\n", + "plt.show()" + ], + "outputs": [], + "metadata": { + "collapsed": false + } + }, + { + "execution_count": null, + "cell_type": "code", + "source": [ + "# Subsample the data for more efficient code execution in this exercise\n", + "num_training = 5000\n", + "mask = range(num_training)\n", + "X_train = X_train[mask]\n", + "y_train = y_train[mask]\n", + "\n", + "num_test = 500\n", + "mask = range(num_test)\n", + "X_test = X_test[mask]\n", + "y_test = y_test[mask]" + ], + "outputs": [], + "metadata": { + "collapsed": false + } + }, + { + "execution_count": null, + "cell_type": "code", + "source": [ + "# Reshape the image data into rows\n", + "X_train = np.reshape(X_train, (X_train.shape[0], -1))\n", + "X_test = np.reshape(X_test, (X_test.shape[0], -1))\n", + "print X_train.shape, X_test.shape" + ], + "outputs": [], + "metadata": { + "collapsed": false + } + }, + { + "execution_count": null, + "cell_type": "code", + "source": [ + "from cs231n.classifiers import KNearestNeighbor\n", + "\n", + "# Create a kNN classifier instance. \n", + "# Remember that training a kNN classifier is a noop: \n", + "# the Classifier simply remembers the data and does no further processing \n", + "classifier = KNearestNeighbor()\n", + "classifier.train(X_train, y_train)" + ], + "outputs": [], + "metadata": { + "collapsed": false + } + }, + { + "source": [ + "We would now like to classify the test data with the kNN classifier. Recall that we can break down this process into two steps: \n", + "\n", + "1. First we must compute the distances between all test examples and all train examples. \n", + "2. Given these distances, for each test example we find the k nearest examples and have them vote for the label\n", + "\n", + "Lets begin with computing the distance matrix between all training and test examples. For example, if there are **Ntr** training examples and **Nte** test examples, this stage should result in a **Nte x Ntr** matrix where each element (i,j) is the distance between the i-th test and j-th train example.\n", + "\n", + "First, open `cs231n/classifiers/k_nearest_neighbor.py` and implement the function `compute_distances_two_loops` that uses a (very inefficient) double loop over all pairs of (test, train) examples and computes the distance matrix one element at a time." + ], + "cell_type": "markdown", + "metadata": {} + }, + { + "execution_count": null, + "cell_type": "code", + "source": [ + "# Open cs231n/classifiers/k_nearest_neighbor.py and implement\n", + "# compute_distances_two_loops.\n", + "\n", + "# Test your implementation:\n", + "dists = classifier.compute_distances_two_loops(X_test)\n", + "print dists.shape" + ], + "outputs": [], + "metadata": { + "collapsed": false + } + }, + { + "execution_count": null, + "cell_type": "code", + "source": [ + "# We can visualize the distance matrix: each row is a single test example and\n", + "# its distances to training examples\n", + "plt.imshow(dists, interpolation='none')\n", + "plt.show()" + ], + "outputs": [], + "metadata": { + "collapsed": false + } + }, + { + "source": [ + "**Inline Question #1:** Notice the structured patterns in the distance matrix, where some rows or columns are visible brighter. (Note that with the default color scheme black indicates low distances while white indicates high distances.)\n", + "\n", + "- What in the data is the cause behind the distinctly bright rows?\n", + "- What causes the columns?" + ], + "cell_type": "markdown", + "metadata": {} + }, + { + "source": [ + "**Your Answer**: *fill this in.*\n", + "\n" + ], + "cell_type": "markdown", + "metadata": {} + }, + { + "execution_count": null, + "cell_type": "code", + "source": [ + "# Now implement the function predict_labels and run the code below:\n", + "# We use k = 1 (which is Nearest Neighbor).\n", + "y_test_pred = classifier.predict_labels(dists, k=1)\n", + "\n", + "# Compute and print the fraction of correctly predicted examples\n", + "num_correct = np.sum(y_test_pred == y_test)\n", + "accuracy = float(num_correct) / num_test\n", + "print 'Got %d / %d correct => accuracy: %f' % (num_correct, num_test, accuracy)" + ], + "outputs": [], + "metadata": { + "collapsed": false + } + }, + { + "source": [ + "You should expect to see approximately `27%` accuracy. Now lets try out a larger `k`, say `k = 5`:" + ], + "cell_type": "markdown", + "metadata": {} + }, + { + "execution_count": null, + "cell_type": "code", + "source": [ + "y_test_pred = classifier.predict_labels(dists, k=5)\n", + "num_correct = np.sum(y_test_pred == y_test)\n", + "accuracy = float(num_correct) / num_test\n", + "print 'Got %d / %d correct => accuracy: %f' % (num_correct, num_test, accuracy)" + ], + "outputs": [], + "metadata": { + "collapsed": true + } + }, + { + "source": [ + "You should expect to see a slightly better performance than with `k = 1`." + ], + "cell_type": "markdown", + "metadata": {} + }, + { + "execution_count": null, + "cell_type": "code", + "source": [ + "# Now lets speed up distance matrix computation by using partial vectorization\n", + "# with one loop. Implement the function compute_distances_one_loop and run the\n", + "# code below:\n", + "dists_one = classifier.compute_distances_one_loop(X_test)\n", + "\n", + "# To ensure that our vectorized implementation is correct, we make sure that it\n", + "# agrees with the naive implementation. There are many ways to decide whether\n", + "# two matrices are similar; one of the simplest is the Frobenius norm. In case\n", + "# you haven't seen it before, the Frobenius norm of two matrices is the square\n", + "# root of the squared sum of differences of all elements; in other words, reshape\n", + "# the matrices into vectors and compute the Euclidean distance between them.\n", + "difference = np.linalg.norm(dists - dists_one, ord='fro')\n", + "print 'Difference was: %f' % (difference, )\n", + "if difference < 0.001:\n", + " print 'Good! The distance matrices are the same'\n", + "else:\n", + " print 'Uh-oh! The distance matrices are different'" + ], + "outputs": [], + "metadata": { + "collapsed": false + } + }, + { + "execution_count": null, + "cell_type": "code", + "source": [ + "# Now implement the fully vectorized version inside compute_distances_no_loops\n", + "# and run the code\n", + "dists_two = classifier.compute_distances_no_loops(X_test)\n", + "\n", + "# check that the distance matrix agrees with the one we computed before:\n", + "difference = np.linalg.norm(dists - dists_two, ord='fro')\n", + "print 'Difference was: %f' % (difference, )\n", + "if difference < 0.001:\n", + " print 'Good! The distance matrices are the same'\n", + "else:\n", + " print 'Uh-oh! The distance matrices are different'" + ], + "outputs": [], + "metadata": { + "collapsed": false + } + }, + { + "execution_count": null, + "cell_type": "code", + "source": [ + "# Let's compare how fast the implementations are\n", + "def time_function(f, *args):\n", + " \"\"\"\n", + " Call a function f with args and return the time (in seconds) that it took to execute.\n", + " \"\"\"\n", + " import time\n", + " tic = time.time()\n", + " f(*args)\n", + " toc = time.time()\n", + " return toc - tic\n", + "\n", + "two_loop_time = time_function(classifier.compute_distances_two_loops, X_test)\n", + "print 'Two loop version took %f seconds' % two_loop_time\n", + "\n", + "one_loop_time = time_function(classifier.compute_distances_one_loop, X_test)\n", + "print 'One loop version took %f seconds' % one_loop_time\n", + "\n", + "no_loop_time = time_function(classifier.compute_distances_no_loops, X_test)\n", + "print 'No loop version took %f seconds' % no_loop_time\n", + "\n", + "# you should see significantly faster performance with the fully vectorized implementation" + ], + "outputs": [], + "metadata": { + "collapsed": false + } + }, + { + "source": [ + "### Cross-validation\n", + "\n", + "We have implemented the k-Nearest Neighbor classifier but we set the value k = 5 arbitrarily. We will now determine the best value of this hyperparameter with cross-validation." + ], + "cell_type": "markdown", + "metadata": {} + }, + { + "execution_count": null, + "cell_type": "code", + "source": [ + "num_folds = 5\n", + "k_choices = [1, 3, 5, 8, 10, 12, 15, 20, 50, 100]\n", + "\n", + "X_train_folds = []\n", + "y_train_folds = []\n", + "################################################################################\n", + "# TODO: #\n", + "# Split up the training data into folds. After splitting, X_train_folds and #\n", + "# y_train_folds should each be lists of length num_folds, where #\n", + "# y_train_folds[i] is the label vector for the points in X_train_folds[i]. #\n", + "# Hint: Look up the numpy array_split function. #\n", + "################################################################################\n", + "pass\n", + "################################################################################\n", + "# END OF YOUR CODE #\n", + "################################################################################\n", + "\n", + "# A dictionary holding the accuracies for different values of k that we find\n", + "# when running cross-validation. After running cross-validation,\n", + "# k_to_accuracies[k] should be a list of length num_folds giving the different\n", + "# accuracy values that we found when using that value of k.\n", + "k_to_accuracies = {}\n", + "\n", + "\n", + "################################################################################\n", + "# TODO: #\n", + "# Perform k-fold cross validation to find the best value of k. For each #\n", + "# possible value of k, run the k-nearest-neighbor algorithm num_folds times, #\n", + "# where in each case you use all but one of the folds as training data and the #\n", + "# last fold as a validation set. Store the accuracies for all fold and all #\n", + "# values of k in the k_to_accuracies dictionary. #\n", + "################################################################################\n", + "pass\n", + "################################################################################\n", + "# END OF YOUR CODE #\n", + "################################################################################\n", + "\n", + "# Print out the computed accuracies\n", + "for k in sorted(k_to_accuracies):\n", + " for accuracy in k_to_accuracies[k]:\n", + " print 'k = %d, accuracy = %f' % (k, accuracy)" + ], + "outputs": [], + "metadata": { + "collapsed": false + } + }, + { + "execution_count": null, + "cell_type": "code", + "source": [ + "# plot the raw observations\n", + "for k in k_choices:\n", + " accuracies = k_to_accuracies[k]\n", + " plt.scatter([k] * len(accuracies), accuracies)\n", + "\n", + "# plot the trend line with error bars that correspond to standard deviation\n", + "accuracies_mean = np.array([np.mean(v) for k,v in sorted(k_to_accuracies.items())])\n", + "accuracies_std = np.array([np.std(v) for k,v in sorted(k_to_accuracies.items())])\n", + "plt.errorbar(k_choices, accuracies_mean, yerr=accuracies_std)\n", + "plt.title('Cross-validation on k')\n", + "plt.xlabel('k')\n", + "plt.ylabel('Cross-validation accuracy')\n", + "plt.show()" + ], + "outputs": [], + "metadata": { + "collapsed": false + } + }, + { + "execution_count": null, + "cell_type": "code", + "source": [ + "# Based on the cross-validation results above, choose the best value for k, \n", + "# retrain the classifier using all the training data, and test it on the test\n", + "# data. You should be able to get above 28% accuracy on the test data.\n", + "best_k = 1\n", + "\n", + "classifier = KNearestNeighbor()\n", + "classifier.train(X_train, y_train)\n", + "y_test_pred = classifier.predict(X_test, k=best_k)\n", + "\n", + "# Compute and display the accuracy\n", + "num_correct = np.sum(y_test_pred == y_test)\n", + "accuracy = float(num_correct) / num_test\n", + "print 'Got %d / %d correct => accuracy: %f' % (num_correct, num_test, accuracy)" + ], + "outputs": [], + "metadata": { + "collapsed": false + } + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 2", + "name": "python2", + "language": "python" + }, + "language_info": { + "mimetype": "text/x-python", + "nbconvert_exporter": "python", + "name": "python", + "file_extension": ".py", + "version": "2.7.3", + "pygments_lexer": "ipython2", + "codemirror_mode": { + "version": 2, + "name": "ipython" + } + } + } +} \ No newline at end of file diff --git a/assignments2016/assignment1/requirements.txt b/assignments2016/assignment1/requirements.txt new file mode 100644 index 00000000..13111380 --- /dev/null +++ b/assignments2016/assignment1/requirements.txt @@ -0,0 +1,46 @@ +Jinja2==2.8 +MarkupSafe==0.23 +Pillow==3.0.0 +Pygments==2.0.2 +appnope==0.1.0 +backports-abc==0.4 +backports.ssl-match-hostname==3.5.0.1 +certifi==2015.11.20.1 +cycler==0.9.0 +decorator==4.0.6 +functools32==3.2.3-2 +gnureadline==6.3.3 +ipykernel==4.2.2 +ipython==4.0.1 +ipython-genutils==0.1.0 +ipywidgets==4.1.1 +jsonschema==2.5.1 +jupyter==1.0.0 +jupyter-client==4.1.1 +jupyter-console==4.0.3 +jupyter-core==4.0.6 +matplotlib==1.5.0 +mistune==0.7.1 +nbconvert==4.1.0 +nbformat==4.0.1 +notebook==4.0.6 +numpy==1.10.4 +path.py==8.1.2 +pexpect==4.0.1 +pickleshare==0.5 +ptyprocess==0.5 +pyparsing==2.0.7 +python-dateutil==2.4.2 +pytz==2015.7 +pyzmq==15.1.0 +qtconsole==4.1.1 +scipy==0.16.1 +simplegeneric==0.8.1 +singledispatch==3.4.0.3 +six==1.10.0 +terminado==0.5 +tornado==4.3 +traitlets==4.0.0 +wsgiref==0.1.2 +jupyter==1.0.0 +pillow==3.1.0 diff --git a/assignments2016/assignment1/softmax.ipynb b/assignments2016/assignment1/softmax.ipynb new file mode 100644 index 00000000..90914f36 --- /dev/null +++ b/assignments2016/assignment1/softmax.ipynb @@ -0,0 +1,308 @@ +{ + "nbformat_minor": 0, + "nbformat": 4, + "cells": [ + { + "source": [ + "# Softmax exercise\n", + "\n", + "*Complete and hand in this completed worksheet (including its outputs and any supporting code outside of the worksheet) with your assignment submission. For more details see the [assignments page](http://vision.stanford.edu/teaching/cs231n/assignments.html) on the course website.*\n", + "\n", + "This exercise is analogous to the SVM exercise. You will:\n", + "\n", + "- implement a fully-vectorized **loss function** for the Softmax classifier\n", + "- implement the fully-vectorized expression for its **analytic gradient**\n", + "- **check your implementation** with numerical gradient\n", + "- use a validation set to **tune the learning rate and regularization** strength\n", + "- **optimize** the loss function with **SGD**\n", + "- **visualize** the final learned weights\n" + ], + "cell_type": "markdown", + "metadata": {} + }, + { + "execution_count": null, + "cell_type": "code", + "source": [ + "import random\n", + "import numpy as np\n", + "from cs231n.data_utils import load_CIFAR10\n", + "import matplotlib.pyplot as plt\n", + "%matplotlib inline\n", + "plt.rcParams['figure.figsize'] = (10.0, 8.0) # set default size of plots\n", + "plt.rcParams['image.interpolation'] = 'nearest'\n", + "plt.rcParams['image.cmap'] = 'gray'\n", + "\n", + "# for auto-reloading extenrnal modules\n", + "# see http://stackoverflow.com/questions/1907993/autoreload-of-modules-in-ipython\n", + "%load_ext autoreload\n", + "%autoreload 2" + ], + "outputs": [], + "metadata": { + "collapsed": false + } + }, + { + "execution_count": null, + "cell_type": "code", + "source": [ + "def get_CIFAR10_data(num_training=49000, num_validation=1000, num_test=1000, num_dev=500):\n", + " \"\"\"\n", + " Load the CIFAR-10 dataset from disk and perform preprocessing to prepare\n", + " it for the linear classifier. These are the same steps as we used for the\n", + " SVM, but condensed to a single function. \n", + " \"\"\"\n", + " # Load the raw CIFAR-10 data\n", + " cifar10_dir = 'cs231n/datasets/cifar-10-batches-py'\n", + " X_train, y_train, X_test, y_test = load_CIFAR10(cifar10_dir)\n", + " \n", + " # subsample the data\n", + " mask = range(num_training, num_training + num_validation)\n", + " X_val = X_train[mask]\n", + " y_val = y_train[mask]\n", + " mask = range(num_training)\n", + " X_train = X_train[mask]\n", + " y_train = y_train[mask]\n", + " mask = range(num_test)\n", + " X_test = X_test[mask]\n", + " y_test = y_test[mask]\n", + " mask = np.random.choice(num_training, num_dev, replace=False)\n", + " X_dev = X_train[mask]\n", + " y_dev = y_train[mask]\n", + " \n", + " # Preprocessing: reshape the image data into rows\n", + " X_train = np.reshape(X_train, (X_train.shape[0], -1))\n", + " X_val = np.reshape(X_val, (X_val.shape[0], -1))\n", + " X_test = np.reshape(X_test, (X_test.shape[0], -1))\n", + " X_dev = np.reshape(X_dev, (X_dev.shape[0], -1))\n", + " \n", + " # Normalize the data: subtract the mean image\n", + " mean_image = np.mean(X_train, axis = 0)\n", + " X_train -= mean_image\n", + " X_val -= mean_image\n", + " X_test -= mean_image\n", + " X_dev -= mean_image\n", + " \n", + " # add bias dimension and transform into columns\n", + " X_train = np.hstack([X_train, np.ones((X_train.shape[0], 1))])\n", + " X_val = np.hstack([X_val, np.ones((X_val.shape[0], 1))])\n", + " X_test = np.hstack([X_test, np.ones((X_test.shape[0], 1))])\n", + " X_dev = np.hstack([X_dev, np.ones((X_dev.shape[0], 1))])\n", + " \n", + " return X_train, y_train, X_val, y_val, X_test, y_test, X_dev, y_dev\n", + "\n", + "\n", + "# Invoke the above function to get our data.\n", + "X_train, y_train, X_val, y_val, X_test, y_test, X_dev, y_dev = get_CIFAR10_data()\n", + "print 'Train data shape: ', X_train.shape\n", + "print 'Train labels shape: ', y_train.shape\n", + "print 'Validation data shape: ', X_val.shape\n", + "print 'Validation labels shape: ', y_val.shape\n", + "print 'Test data shape: ', X_test.shape\n", + "print 'Test labels shape: ', y_test.shape\n", + "print 'dev data shape: ', X_dev.shape\n", + "print 'dev labels shape: ', y_dev.shape" + ], + "outputs": [], + "metadata": { + "collapsed": false + } + }, + { + "source": [ + "## Softmax Classifier\n", + "\n", + "Your code for this section will all be written inside **cs231n/classifiers/softmax.py**. \n" + ], + "cell_type": "markdown", + "metadata": {} + }, + { + "execution_count": null, + "cell_type": "code", + "source": [ + "# First implement the naive softmax loss function with nested loops.\n", + "# Open the file cs231n/classifiers/softmax.py and implement the\n", + "# softmax_loss_naive function.\n", + "\n", + "from cs231n.classifiers.softmax import softmax_loss_naive\n", + "import time\n", + "\n", + "# Generate a random softmax weight matrix and use it to compute the loss.\n", + "W = np.random.randn(3073, 10) * 0.0001\n", + "loss, grad = softmax_loss_naive(W, X_dev, y_dev, 0.0)\n", + "\n", + "# As a rough sanity check, our loss should be something close to -log(0.1).\n", + "print 'loss: %f' % loss\n", + "print 'sanity check: %f' % (-np.log(0.1))" + ], + "outputs": [], + "metadata": { + "collapsed": false + } + }, + { + "source": [ + "## Inline Question 1:\n", + "Why do we expect our loss to be close to -log(0.1)? Explain briefly.**\n", + "\n", + "**Your answer:** *Fill this in*\n" + ], + "cell_type": "markdown", + "metadata": {} + }, + { + "execution_count": null, + "cell_type": "code", + "source": [ + "# Complete the implementation of softmax_loss_naive and implement a (naive)\n", + "# version of the gradient that uses nested loops.\n", + "loss, grad = softmax_loss_naive(W, X_dev, y_dev, 0.0)\n", + "\n", + "# As we did for the SVM, use numeric gradient checking as a debugging tool.\n", + "# The numeric gradient should be close to the analytic gradient.\n", + "from cs231n.gradient_check import grad_check_sparse\n", + "f = lambda w: softmax_loss_naive(w, X_dev, y_dev, 0.0)[0]\n", + "grad_numerical = grad_check_sparse(f, W, grad, 10)\n", + "\n", + "# similar to SVM case, do another gradient check with regularization\n", + "loss, grad = softmax_loss_naive(W, X_dev, y_dev, 1e2)\n", + "f = lambda w: softmax_loss_naive(w, X_dev, y_dev, 1e2)[0]\n", + "grad_numerical = grad_check_sparse(f, W, grad, 10)" + ], + "outputs": [], + "metadata": { + "collapsed": false + } + }, + { + "execution_count": null, + "cell_type": "code", + "source": [ + "# Now that we have a naive implementation of the softmax loss function and its gradient,\n", + "# implement a vectorized version in softmax_loss_vectorized.\n", + "# The two versions should compute the same results, but the vectorized version should be\n", + "# much faster.\n", + "tic = time.time()\n", + "loss_naive, grad_naive = softmax_loss_naive(W, X_dev, y_dev, 0.00001)\n", + "toc = time.time()\n", + "print 'naive loss: %e computed in %fs' % (loss_naive, toc - tic)\n", + "\n", + "from cs231n.classifiers.softmax import softmax_loss_vectorized\n", + "tic = time.time()\n", + "loss_vectorized, grad_vectorized = softmax_loss_vectorized(W, X_dev, y_dev, 0.00001)\n", + "toc = time.time()\n", + "print 'vectorized loss: %e computed in %fs' % (loss_vectorized, toc - tic)\n", + "\n", + "# As we did for the SVM, we use the Frobenius norm to compare the two versions\n", + "# of the gradient.\n", + "grad_difference = np.linalg.norm(grad_naive - grad_vectorized, ord='fro')\n", + "print 'Loss difference: %f' % np.abs(loss_naive - loss_vectorized)\n", + "print 'Gradient difference: %f' % grad_difference" + ], + "outputs": [], + "metadata": { + "collapsed": false + } + }, + { + "execution_count": null, + "cell_type": "code", + "source": [ + "# Use the validation set to tune hyperparameters (regularization strength and\n", + "# learning rate). You should experiment with different ranges for the learning\n", + "# rates and regularization strengths; if you are careful you should be able to\n", + "# get a classification accuracy of over 0.35 on the validation set.\n", + "from cs231n.classifiers import Softmax\n", + "results = {}\n", + "best_val = -1\n", + "best_softmax = None\n", + "learning_rates = [1e-7, 5e-7]\n", + "regularization_strengths = [5e4, 1e8]\n", + "\n", + "################################################################################\n", + "# TODO: #\n", + "# Use the validation set to set the learning rate and regularization strength. #\n", + "# This should be identical to the validation that you did for the SVM; save #\n", + "# the best trained softmax classifer in best_softmax. #\n", + "################################################################################\n", + "pass\n", + "################################################################################\n", + "# END OF YOUR CODE #\n", + "################################################################################\n", + " \n", + "# Print out results.\n", + "for lr, reg in sorted(results):\n", + " train_accuracy, val_accuracy = results[(lr, reg)]\n", + " print 'lr %e reg %e train accuracy: %f val accuracy: %f' % (\n", + " lr, reg, train_accuracy, val_accuracy)\n", + " \n", + "print 'best validation accuracy achieved during cross-validation: %f' % best_val" + ], + "outputs": [], + "metadata": { + "collapsed": false + } + }, + { + "execution_count": null, + "cell_type": "code", + "source": [ + "# evaluate on test set\n", + "# Evaluate the best softmax on test set\n", + "y_test_pred = best_softmax.predict(X_test)\n", + "test_accuracy = np.mean(y_test == y_test_pred)\n", + "print 'softmax on raw pixels final test set accuracy: %f' % (test_accuracy, )" + ], + "outputs": [], + "metadata": { + "collapsed": false + } + }, + { + "execution_count": null, + "cell_type": "code", + "source": [ + "# Visualize the learned weights for each class\n", + "w = best_softmax.W[:-1,:] # strip out the bias\n", + "w = w.reshape(32, 32, 3, 10)\n", + "\n", + "w_min, w_max = np.min(w), np.max(w)\n", + "\n", + "classes = ['plane', 'car', 'bird', 'cat', 'deer', 'dog', 'frog', 'horse', 'ship', 'truck']\n", + "for i in xrange(10):\n", + " plt.subplot(2, 5, i + 1)\n", + " \n", + " # Rescale the weights to be between 0 and 255\n", + " wimg = 255.0 * (w[:, :, :, i].squeeze() - w_min) / (w_max - w_min)\n", + " plt.imshow(wimg.astype('uint8'))\n", + " plt.axis('off')\n", + " plt.title(classes[i])" + ], + "outputs": [], + "metadata": { + "collapsed": false + } + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 2", + "name": "python2", + "language": "python" + }, + "language_info": { + "mimetype": "text/x-python", + "nbconvert_exporter": "python", + "name": "python", + "file_extension": ".py", + "version": "2.7.9", + "pygments_lexer": "ipython2", + "codemirror_mode": { + "version": 2, + "name": "ipython" + } + } + } +} \ No newline at end of file diff --git a/assignments2016/assignment1/start_ipython_osx.sh b/assignments2016/assignment1/start_ipython_osx.sh new file mode 100755 index 00000000..4815b001 --- /dev/null +++ b/assignments2016/assignment1/start_ipython_osx.sh @@ -0,0 +1,4 @@ +# Assume the virtualenv is called .env + +cp frameworkpython .env/bin +.env/bin/frameworkpython -m IPython notebook diff --git a/assignments2016/assignment1/svm.ipynb b/assignments2016/assignment1/svm.ipynb new file mode 100644 index 00000000..ef6331f7 --- /dev/null +++ b/assignments2016/assignment1/svm.ipynb @@ -0,0 +1,568 @@ +{ + "nbformat_minor": 0, + "nbformat": 4, + "cells": [ + { + "source": [ + "# Multiclass Support Vector Machine exercise\n", + "\n", + "*Complete and hand in this completed worksheet (including its outputs and any supporting code outside of the worksheet) with your assignment submission. For more details see the [assignments page](http://vision.stanford.edu/teaching/cs231n/assignments.html) on the course website.*\n", + "\n", + "In this exercise you will:\n", + " \n", + "- implement a fully-vectorized **loss function** for the SVM\n", + "- implement the fully-vectorized expression for its **analytic gradient**\n", + "- **check your implementation** using numerical gradient\n", + "- use a validation set to **tune the learning rate and regularization** strength\n", + "- **optimize** the loss function with **SGD**\n", + "- **visualize** the final learned weights\n" + ], + "cell_type": "markdown", + "metadata": {} + }, + { + "execution_count": null, + "cell_type": "code", + "source": [ + "# Run some setup code for this notebook.\n", + "\n", + "import random\n", + "import numpy as np\n", + "from cs231n.data_utils import load_CIFAR10\n", + "import matplotlib.pyplot as plt\n", + "\n", + "# This is a bit of magic to make matplotlib figures appear inline in the\n", + "# notebook rather than in a new window.\n", + "%matplotlib inline\n", + "plt.rcParams['figure.figsize'] = (10.0, 8.0) # set default size of plots\n", + "plt.rcParams['image.interpolation'] = 'nearest'\n", + "plt.rcParams['image.cmap'] = 'gray'\n", + "\n", + "# Some more magic so that the notebook will reload external python modules;\n", + "# see http://stackoverflow.com/questions/1907993/autoreload-of-modules-in-ipython\n", + "%load_ext autoreload\n", + "%autoreload 2" + ], + "outputs": [], + "metadata": { + "collapsed": false + } + }, + { + "source": [ + "## CIFAR-10 Data Loading and Preprocessing" + ], + "cell_type": "markdown", + "metadata": {} + }, + { + "execution_count": null, + "cell_type": "code", + "source": [ + "# Load the raw CIFAR-10 data.\n", + "cifar10_dir = 'cs231n/datasets/cifar-10-batches-py'\n", + "X_train, y_train, X_test, y_test = load_CIFAR10(cifar10_dir)\n", + "\n", + "# As a sanity check, we print out the size of the training and test data.\n", + "print 'Training data shape: ', X_train.shape\n", + "print 'Training labels shape: ', y_train.shape\n", + "print 'Test data shape: ', X_test.shape\n", + "print 'Test labels shape: ', y_test.shape" + ], + "outputs": [], + "metadata": { + "collapsed": false + } + }, + { + "execution_count": null, + "cell_type": "code", + "source": [ + "# Visualize some examples from the dataset.\n", + "# We show a few examples of training images from each class.\n", + "classes = ['plane', 'car', 'bird', 'cat', 'deer', 'dog', 'frog', 'horse', 'ship', 'truck']\n", + "num_classes = len(classes)\n", + "samples_per_class = 7\n", + "for y, cls in enumerate(classes):\n", + " idxs = np.flatnonzero(y_train == y)\n", + " idxs = np.random.choice(idxs, samples_per_class, replace=False)\n", + " for i, idx in enumerate(idxs):\n", + " plt_idx = i * num_classes + y + 1\n", + " plt.subplot(samples_per_class, num_classes, plt_idx)\n", + " plt.imshow(X_train[idx].astype('uint8'))\n", + " plt.axis('off')\n", + " if i == 0:\n", + " plt.title(cls)\n", + "plt.show()" + ], + "outputs": [], + "metadata": { + "collapsed": false + } + }, + { + "execution_count": null, + "cell_type": "code", + "source": [ + "# Split the data into train, val, and test sets. In addition we will\n", + "# create a small development set as a subset of the training data;\n", + "# we can use this for development so our code runs faster.\n", + "num_training = 49000\n", + "num_validation = 1000\n", + "num_test = 1000\n", + "num_dev = 500\n", + "\n", + "# Our validation set will be num_validation points from the original\n", + "# training set.\n", + "mask = range(num_training, num_training + num_validation)\n", + "X_val = X_train[mask]\n", + "y_val = y_train[mask]\n", + "\n", + "# Our training set will be the first num_train points from the original\n", + "# training set.\n", + "mask = range(num_training)\n", + "X_train = X_train[mask]\n", + "y_train = y_train[mask]\n", + "\n", + "# We will also make a development set, which is a small subset of\n", + "# the training set.\n", + "mask = np.random.choice(num_training, num_dev, replace=False)\n", + "X_dev = X_train[mask]\n", + "y_dev = y_train[mask]\n", + "\n", + "# We use the first num_test points of the original test set as our\n", + "# test set.\n", + "mask = range(num_test)\n", + "X_test = X_test[mask]\n", + "y_test = y_test[mask]\n", + "\n", + "print 'Train data shape: ', X_train.shape\n", + "print 'Train labels shape: ', y_train.shape\n", + "print 'Validation data shape: ', X_val.shape\n", + "print 'Validation labels shape: ', y_val.shape\n", + "print 'Test data shape: ', X_test.shape\n", + "print 'Test labels shape: ', y_test.shape" + ], + "outputs": [], + "metadata": { + "collapsed": false + } + }, + { + "execution_count": null, + "cell_type": "code", + "source": [ + "# Preprocessing: reshape the image data into rows\n", + "X_train = np.reshape(X_train, (X_train.shape[0], -1))\n", + "X_val = np.reshape(X_val, (X_val.shape[0], -1))\n", + "X_test = np.reshape(X_test, (X_test.shape[0], -1))\n", + "X_dev = np.reshape(X_dev, (X_dev.shape[0], -1))\n", + "\n", + "# As a sanity check, print out the shapes of the data\n", + "print 'Training data shape: ', X_train.shape\n", + "print 'Validation data shape: ', X_val.shape\n", + "print 'Test data shape: ', X_test.shape\n", + "print 'dev data shape: ', X_dev.shape" + ], + "outputs": [], + "metadata": { + "collapsed": false + } + }, + { + "execution_count": null, + "cell_type": "code", + "source": [ + "# Preprocessing: subtract the mean image\n", + "# first: compute the image mean based on the training data\n", + "mean_image = np.mean(X_train, axis=0)\n", + "print mean_image[:10] # print a few of the elements\n", + "plt.figure(figsize=(4,4))\n", + "plt.imshow(mean_image.reshape((32,32,3)).astype('uint8')) # visualize the mean image\n", + "plt.show()" + ], + "outputs": [], + "metadata": { + "collapsed": false + } + }, + { + "execution_count": null, + "cell_type": "code", + "source": [ + "# second: subtract the mean image from train and test data\n", + "X_train -= mean_image\n", + "X_val -= mean_image\n", + "X_test -= mean_image\n", + "X_dev -= mean_image" + ], + "outputs": [], + "metadata": { + "collapsed": false + } + }, + { + "execution_count": null, + "cell_type": "code", + "source": [ + "# third: append the bias dimension of ones (i.e. bias trick) so that our SVM\n", + "# only has to worry about optimizing a single weight matrix W.\n", + "X_train = np.hstack([X_train, np.ones((X_train.shape[0], 1))])\n", + "X_val = np.hstack([X_val, np.ones((X_val.shape[0], 1))])\n", + "X_test = np.hstack([X_test, np.ones((X_test.shape[0], 1))])\n", + "X_dev = np.hstack([X_dev, np.ones((X_dev.shape[0], 1))])\n", + "\n", + "print X_train.shape, X_val.shape, X_test.shape, X_dev.shape" + ], + "outputs": [], + "metadata": { + "collapsed": false + } + }, + { + "source": [ + "## SVM Classifier\n", + "\n", + "Your code for this section will all be written inside **cs231n/classifiers/linear_svm.py**. \n", + "\n", + "As you can see, we have prefilled the function `compute_loss_naive` which uses for loops to evaluate the multiclass SVM loss function. " + ], + "cell_type": "markdown", + "metadata": {} + }, + { + "execution_count": null, + "cell_type": "code", + "source": [ + "# Evaluate the naive implementation of the loss we provided for you:\n", + "from cs231n.classifiers.linear_svm import svm_loss_naive\n", + "import time\n", + "\n", + "# generate a random SVM weight matrix of small numbers\n", + "W = np.random.randn(3073, 10) * 0.0001 \n", + "\n", + "loss, grad = svm_loss_naive(W, X_dev, y_dev, 0.00001)\n", + "print 'loss: %f' % (loss, )" + ], + "outputs": [], + "metadata": { + "collapsed": false + } + }, + { + "source": [ + "The `grad` returned from the function above is right now all zero. Derive and implement the gradient for the SVM cost function and implement it inline inside the function `svm_loss_naive`. You will find it helpful to interleave your new code inside the existing function.\n", + "\n", + "To check that you have correctly implemented the gradient correctly, you can numerically estimate the gradient of the loss function and compare the numeric estimate to the gradient that you computed. We have provided code that does this for you:" + ], + "cell_type": "markdown", + "metadata": {} + }, + { + "execution_count": null, + "cell_type": "code", + "source": [ + "# Once you've implemented the gradient, recompute it with the code below\n", + "# and gradient check it with the function we provided for you\n", + "\n", + "# Compute the loss and its gradient at W.\n", + "loss, grad = svm_loss_naive(W, X_dev, y_dev, 0.0)\n", + "\n", + "# Numerically compute the gradient along several randomly chosen dimensions, and\n", + "# compare them with your analytically computed gradient. The numbers should match\n", + "# almost exactly along all dimensions.\n", + "from cs231n.gradient_check import grad_check_sparse\n", + "f = lambda w: svm_loss_naive(w, X_dev, y_dev, 0.0)[0]\n", + "grad_numerical = grad_check_sparse(f, W, grad)\n", + "\n", + "# do the gradient check once again with regularization turned on\n", + "# you didn't forget the regularization gradient did you?\n", + "loss, grad = svm_loss_naive(W, X_dev, y_dev, 1e2)\n", + "f = lambda w: svm_loss_naive(w, X_dev, y_dev, 1e2)[0]\n", + "grad_numerical = grad_check_sparse(f, W, grad)" + ], + "outputs": [], + "metadata": { + "collapsed": false + } + }, + { + "source": [ + "### Inline Question 1:\n", + "It is possible that once in a while a dimension in the gradcheck will not match exactly. What could such a discrepancy be caused by? Is it a reason for concern? What is a simple example in one dimension where a gradient check could fail? *Hint: the SVM loss function is not strictly speaking differentiable*\n", + "\n", + "**Your Answer:** *fill this in.*" + ], + "cell_type": "markdown", + "metadata": {} + }, + { + "execution_count": null, + "cell_type": "code", + "source": [ + "# Next implement the function svm_loss_vectorized; for now only compute the loss;\n", + "# we will implement the gradient in a moment.\n", + "tic = time.time()\n", + "loss_naive, grad_naive = svm_loss_naive(W, X_dev, y_dev, 0.00001)\n", + "toc = time.time()\n", + "print 'Naive loss: %e computed in %fs' % (loss_naive, toc - tic)\n", + "\n", + "from cs231n.classifiers.linear_svm import svm_loss_vectorized\n", + "tic = time.time()\n", + "loss_vectorized, _ = svm_loss_vectorized(W, X_dev, y_dev, 0.00001)\n", + "toc = time.time()\n", + "print 'Vectorized loss: %e computed in %fs' % (loss_vectorized, toc - tic)\n", + "\n", + "# The losses should match but your vectorized implementation should be much faster.\n", + "print 'difference: %f' % (loss_naive - loss_vectorized)" + ], + "outputs": [], + "metadata": { + "collapsed": false + } + }, + { + "execution_count": null, + "cell_type": "code", + "source": [ + "# Complete the implementation of svm_loss_vectorized, and compute the gradient\n", + "# of the loss function in a vectorized way.\n", + "\n", + "# The naive implementation and the vectorized implementation should match, but\n", + "# the vectorized version should still be much faster.\n", + "tic = time.time()\n", + "_, grad_naive = svm_loss_naive(W, X_dev, y_dev, 0.00001)\n", + "toc = time.time()\n", + "print 'Naive loss and gradient: computed in %fs' % (toc - tic)\n", + "\n", + "tic = time.time()\n", + "_, grad_vectorized = svm_loss_vectorized(W, X_dev, y_dev, 0.00001)\n", + "toc = time.time()\n", + "print 'Vectorized loss and gradient: computed in %fs' % (toc - tic)\n", + "\n", + "# The loss is a single number, so it is easy to compare the values computed\n", + "# by the two implementations. The gradient on the other hand is a matrix, so\n", + "# we use the Frobenius norm to compare them.\n", + "difference = np.linalg.norm(grad_naive - grad_vectorized, ord='fro')\n", + "print 'difference: %f' % difference" + ], + "outputs": [], + "metadata": { + "collapsed": false + } + }, + { + "source": [ + "### Stochastic Gradient Descent\n", + "\n", + "We now have vectorized and efficient expressions for the loss, the gradient and our gradient matches the numerical gradient. We are therefore ready to do SGD to minimize the loss." + ], + "cell_type": "markdown", + "metadata": {} + }, + { + "execution_count": null, + "cell_type": "code", + "source": [ + "# In the file linear_classifier.py, implement SGD in the function\n", + "# LinearClassifier.train() and then run it with the code below.\n", + "from cs231n.classifiers import LinearSVM\n", + "svm = LinearSVM()\n", + "tic = time.time()\n", + "loss_hist = svm.train(X_train, y_train, learning_rate=1e-7, reg=5e4,\n", + " num_iters=1500, verbose=True)\n", + "toc = time.time()\n", + "print 'That took %fs' % (toc - tic)" + ], + "outputs": [], + "metadata": { + "collapsed": false + } + }, + { + "execution_count": null, + "cell_type": "code", + "source": [ + "# A useful debugging strategy is to plot the loss as a function of\n", + "# iteration number:\n", + "plt.plot(loss_hist)\n", + "plt.xlabel('Iteration number')\n", + "plt.ylabel('Loss value')\n", + "plt.show()" + ], + "outputs": [], + "metadata": { + "collapsed": false + } + }, + { + "execution_count": null, + "cell_type": "code", + "source": [ + "# Write the LinearSVM.predict function and evaluate the performance on both the\n", + "# training and validation set\n", + "y_train_pred = svm.predict(X_train)\n", + "print 'training accuracy: %f' % (np.mean(y_train == y_train_pred), )\n", + "y_val_pred = svm.predict(X_val)\n", + "print 'validation accuracy: %f' % (np.mean(y_val == y_val_pred), )" + ], + "outputs": [], + "metadata": { + "collapsed": false + } + }, + { + "execution_count": null, + "cell_type": "code", + "source": [ + "# Use the validation set to tune hyperparameters (regularization strength and\n", + "# learning rate). You should experiment with different ranges for the learning\n", + "# rates and regularization strengths; if you are careful you should be able to\n", + "# get a classification accuracy of about 0.4 on the validation set.\n", + "learning_rates = [1e-7, 5e-5]\n", + "regularization_strengths = [5e4, 1e5]\n", + "\n", + "# results is dictionary mapping tuples of the form\n", + "# (learning_rate, regularization_strength) to tuples of the form\n", + "# (training_accuracy, validation_accuracy). The accuracy is simply the fraction\n", + "# of data points that are correctly classified.\n", + "results = {}\n", + "best_val = -1 # The highest validation accuracy that we have seen so far.\n", + "best_svm = None # The LinearSVM object that achieved the highest validation rate.\n", + "\n", + "################################################################################\n", + "# TODO: #\n", + "# Write code that chooses the best hyperparameters by tuning on the validation #\n", + "# set. For each combination of hyperparameters, train a linear SVM on the #\n", + "# training set, compute its accuracy on the training and validation sets, and #\n", + "# store these numbers in the results dictionary. In addition, store the best #\n", + "# validation accuracy in best_val and the LinearSVM object that achieves this #\n", + "# accuracy in best_svm. #\n", + "# #\n", + "# Hint: You should use a small value for num_iters as you develop your #\n", + "# validation code so that the SVMs don't take much time to train; once you are #\n", + "# confident that your validation code works, you should rerun the validation #\n", + "# code with a larger value for num_iters. #\n", + "################################################################################\n", + "pass\n", + "################################################################################\n", + "# END OF YOUR CODE #\n", + "################################################################################\n", + " \n", + "# Print out results.\n", + "for lr, reg in sorted(results):\n", + " train_accuracy, val_accuracy = results[(lr, reg)]\n", + " print 'lr %e reg %e train accuracy: %f val accuracy: %f' % (\n", + " lr, reg, train_accuracy, val_accuracy)\n", + " \n", + "print 'best validation accuracy achieved during cross-validation: %f' % best_val" + ], + "outputs": [], + "metadata": { + "collapsed": false + } + }, + { + "execution_count": null, + "cell_type": "code", + "source": [ + "# Visualize the cross-validation results\n", + "import math\n", + "x_scatter = [math.log10(x[0]) for x in results]\n", + "y_scatter = [math.log10(x[1]) for x in results]\n", + "\n", + "# plot training accuracy\n", + "marker_size = 100\n", + "colors = [results[x][0] for x in results]\n", + "plt.subplot(2, 1, 1)\n", + "plt.scatter(x_scatter, y_scatter, marker_size, c=colors)\n", + "plt.colorbar()\n", + "plt.xlabel('log learning rate')\n", + "plt.ylabel('log regularization strength')\n", + "plt.title('CIFAR-10 training accuracy')\n", + "\n", + "# plot validation accuracy\n", + "colors = [results[x][1] for x in results] # default size of markers is 20\n", + "plt.subplot(2, 1, 2)\n", + "plt.scatter(x_scatter, y_scatter, marker_size, c=colors)\n", + "plt.colorbar()\n", + "plt.xlabel('log learning rate')\n", + "plt.ylabel('log regularization strength')\n", + "plt.title('CIFAR-10 validation accuracy')\n", + "plt.show()" + ], + "outputs": [], + "metadata": { + "collapsed": false + } + }, + { + "execution_count": null, + "cell_type": "code", + "source": [ + "# Evaluate the best svm on test set\n", + "y_test_pred = best_svm.predict(X_test)\n", + "test_accuracy = np.mean(y_test == y_test_pred)\n", + "print 'linear SVM on raw pixels final test set accuracy: %f' % test_accuracy" + ], + "outputs": [], + "metadata": { + "collapsed": false + } + }, + { + "execution_count": null, + "cell_type": "code", + "source": [ + "# Visualize the learned weights for each class.\n", + "# Depending on your choice of learning rate and regularization strength, these may\n", + "# or may not be nice to look at.\n", + "w = best_svm.W[:-1,:] # strip out the bias\n", + "w = w.reshape(32, 32, 3, 10)\n", + "w_min, w_max = np.min(w), np.max(w)\n", + "classes = ['plane', 'car', 'bird', 'cat', 'deer', 'dog', 'frog', 'horse', 'ship', 'truck']\n", + "for i in xrange(10):\n", + " plt.subplot(2, 5, i + 1)\n", + " \n", + " # Rescale the weights to be between 0 and 255\n", + " wimg = 255.0 * (w[:, :, :, i].squeeze() - w_min) / (w_max - w_min)\n", + " plt.imshow(wimg.astype('uint8'))\n", + " plt.axis('off')\n", + " plt.title(classes[i])" + ], + "outputs": [], + "metadata": { + "collapsed": false + } + }, + { + "source": [ + "### Inline question 2:\n", + "Describe what your visualized SVM weights look like, and offer a brief explanation for why they look they way that they do.\n", + "\n", + "**Your answer:** *fill this in*" + ], + "cell_type": "markdown", + "metadata": {} + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 2", + "name": "python2", + "language": "python" + }, + "language_info": { + "mimetype": "text/x-python", + "nbconvert_exporter": "python", + "name": "python", + "file_extension": ".py", + "version": "2.7.9", + "pygments_lexer": "ipython2", + "codemirror_mode": { + "version": 2, + "name": "ipython" + } + } + } +} \ No newline at end of file diff --git a/assignments2016/assignment1/two_layer_net.ipynb b/assignments2016/assignment1/two_layer_net.ipynb new file mode 100644 index 00000000..917853bf --- /dev/null +++ b/assignments2016/assignment1/two_layer_net.ipynb @@ -0,0 +1,456 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Implementing a Neural Network\n", + "In this exercise we will develop a neural network with fully-connected layers to perform classification, and test it out on the CIFAR-10 dataset." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "# A bit of setup\n", + "\n", + "import numpy as np\n", + "import matplotlib.pyplot as plt\n", + "\n", + "from cs231n.classifiers.neural_net import TwoLayerNet\n", + "\n", + "%matplotlib inline\n", + "plt.rcParams['figure.figsize'] = (10.0, 8.0) # set default size of plots\n", + "plt.rcParams['image.interpolation'] = 'nearest'\n", + "plt.rcParams['image.cmap'] = 'gray'\n", + "\n", + "# for auto-reloading external modules\n", + "# see http://stackoverflow.com/questions/1907993/autoreload-of-modules-in-ipython\n", + "%load_ext autoreload\n", + "%autoreload 2\n", + "\n", + "def rel_error(x, y):\n", + " \"\"\" returns relative error \"\"\"\n", + " return np.max(np.abs(x - y) / (np.maximum(1e-8, np.abs(x) + np.abs(y))))" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "We will use the class `TwoLayerNet` in the file `cs231n/classifiers/neural_net.py` to represent instances of our network. The network parameters are stored in the instance variable `self.params` where keys are string parameter names and values are numpy arrays. Below, we initialize toy data and a toy model that we will use to develop your implementation." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "# Create a small net and some toy data to check your implementations.\n", + "# Note that we set the random seed for repeatable experiments.\n", + "\n", + "input_size = 4\n", + "hidden_size = 10\n", + "num_classes = 3\n", + "num_inputs = 5\n", + "\n", + "def init_toy_model():\n", + " np.random.seed(0)\n", + " return TwoLayerNet(input_size, hidden_size, num_classes, std=1e-1)\n", + "\n", + "def init_toy_data():\n", + " np.random.seed(1)\n", + " X = 10 * np.random.randn(num_inputs, input_size)\n", + " y = np.array([0, 1, 2, 2, 1])\n", + " return X, y\n", + "\n", + "net = init_toy_model()\n", + "X, y = init_toy_data()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Forward pass: compute scores\n", + "Open the file `cs231n/classifiers/neural_net.py` and look at the method `TwoLayerNet.loss`. This function is very similar to the loss functions you have written for the SVM and Softmax exercises: It takes the data and weights and computes the class scores, the loss, and the gradients on the parameters. \n", + "\n", + "Implement the first part of the forward pass which uses the weights and biases to compute the scores for all inputs." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "scores = net.loss(X)\n", + "print 'Your scores:'\n", + "print scores\n", + "print\n", + "print 'correct scores:'\n", + "correct_scores = np.asarray([\n", + " [-0.81233741, -1.27654624, -0.70335995],\n", + " [-0.17129677, -1.18803311, -0.47310444],\n", + " [-0.51590475, -1.01354314, -0.8504215 ],\n", + " [-0.15419291, -0.48629638, -0.52901952],\n", + " [-0.00618733, -0.12435261, -0.15226949]])\n", + "print correct_scores\n", + "print\n", + "\n", + "# The difference should be very small. We get < 1e-7\n", + "print 'Difference between your scores and correct scores:'\n", + "print np.sum(np.abs(scores - correct_scores))" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Forward pass: compute loss\n", + "In the same function, implement the second part that computes the data and regularizaion loss." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "loss, _ = net.loss(X, y, reg=0.1)\n", + "correct_loss = 1.30378789133\n", + "\n", + "# should be very small, we get < 1e-12\n", + "print 'Difference between your loss and correct loss:'\n", + "print np.sum(np.abs(loss - correct_loss))" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Backward pass\n", + "Implement the rest of the function. This will compute the gradient of the loss with respect to the variables `W1`, `b1`, `W2`, and `b2`. Now that you (hopefully!) have a correctly implemented forward pass, you can debug your backward pass using a numeric gradient check:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "from cs231n.gradient_check import eval_numerical_gradient\n", + "\n", + "# Use numeric gradient checking to check your implementation of the backward pass.\n", + "# If your implementation is correct, the difference between the numeric and\n", + "# analytic gradients should be less than 1e-8 for each of W1, W2, b1, and b2.\n", + "\n", + "loss, grads = net.loss(X, y, reg=0.1)\n", + "\n", + "# these should all be less than 1e-8 or so\n", + "for param_name in grads:\n", + " f = lambda W: net.loss(X, y, reg=0.1)[0]\n", + " param_grad_num = eval_numerical_gradient(f, net.params[param_name], verbose=False)\n", + " print '%s max relative error: %e' % (param_name, rel_error(param_grad_num, grads[param_name]))" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Train the network\n", + "To train the network we will use stochastic gradient descent (SGD), similar to the SVM and Softmax classifiers. Look at the function `TwoLayerNet.train` and fill in the missing sections to implement the training procedure. This should be very similar to the training procedure you used for the SVM and Softmax classifiers. You will also have to implement `TwoLayerNet.predict`, as the training process periodically performs prediction to keep track of accuracy over time while the network trains.\n", + "\n", + "Once you have implemented the method, run the code below to train a two-layer network on toy data. You should achieve a training loss less than 0.2." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "net = init_toy_model()\n", + "stats = net.train(X, y, X, y,\n", + " learning_rate=1e-1, reg=1e-5,\n", + " num_iters=100, verbose=False)\n", + "\n", + "print 'Final training loss: ', stats['loss_history'][-1]\n", + "\n", + "# plot the loss history\n", + "plt.plot(stats['loss_history'])\n", + "plt.xlabel('iteration')\n", + "plt.ylabel('training loss')\n", + "plt.title('Training Loss history')\n", + "plt.show()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Load the data\n", + "Now that you have implemented a two-layer network that passes gradient checks and works on toy data, it's time to load up our favorite CIFAR-10 data so we can use it to train a classifier on a real dataset." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "from cs231n.data_utils import load_CIFAR10\n", + "\n", + "def get_CIFAR10_data(num_training=49000, num_validation=1000, num_test=1000):\n", + " \"\"\"\n", + " Load the CIFAR-10 dataset from disk and perform preprocessing to prepare\n", + " it for the two-layer neural net classifier. These are the same steps as\n", + " we used for the SVM, but condensed to a single function. \n", + " \"\"\"\n", + " # Load the raw CIFAR-10 data\n", + " cifar10_dir = 'cs231n/datasets/cifar-10-batches-py'\n", + " X_train, y_train, X_test, y_test = load_CIFAR10(cifar10_dir)\n", + " \n", + " # Subsample the data\n", + " mask = range(num_training, num_training + num_validation)\n", + " X_val = X_train[mask]\n", + " y_val = y_train[mask]\n", + " mask = range(num_training)\n", + " X_train = X_train[mask]\n", + " y_train = y_train[mask]\n", + " mask = range(num_test)\n", + " X_test = X_test[mask]\n", + " y_test = y_test[mask]\n", + "\n", + " # Normalize the data: subtract the mean image\n", + " mean_image = np.mean(X_train, axis=0)\n", + " X_train -= mean_image\n", + " X_val -= mean_image\n", + " X_test -= mean_image\n", + "\n", + " # Reshape data to rows\n", + " X_train = X_train.reshape(num_training, -1)\n", + " X_val = X_val.reshape(num_validation, -1)\n", + " X_test = X_test.reshape(num_test, -1)\n", + "\n", + " return X_train, y_train, X_val, y_val, X_test, y_test\n", + "\n", + "\n", + "# Invoke the above function to get our data.\n", + "X_train, y_train, X_val, y_val, X_test, y_test = get_CIFAR10_data()\n", + "print 'Train data shape: ', X_train.shape\n", + "print 'Train labels shape: ', y_train.shape\n", + "print 'Validation data shape: ', X_val.shape\n", + "print 'Validation labels shape: ', y_val.shape\n", + "print 'Test data shape: ', X_test.shape\n", + "print 'Test labels shape: ', y_test.shape" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Train a network\n", + "To train our network we will use SGD with momentum. In addition, we will adjust the learning rate with an exponential learning rate schedule as optimization proceeds; after each epoch, we will reduce the learning rate by multiplying it by a decay rate." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "input_size = 32 * 32 * 3\n", + "hidden_size = 50\n", + "num_classes = 10\n", + "net = TwoLayerNet(input_size, hidden_size, num_classes)\n", + "\n", + "# Train the network\n", + "stats = net.train(X_train, y_train, X_val, y_val,\n", + " num_iters=1000, batch_size=200,\n", + " learning_rate=1e-4, learning_rate_decay=0.95,\n", + " reg=0.5, verbose=True)\n", + "\n", + "# Predict on the validation set\n", + "val_acc = (net.predict(X_val) == y_val).mean()\n", + "print 'Validation accuracy: ', val_acc\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Debug the training\n", + "With the default parameters we provided above, you should get a validation accuracy of about 0.29 on the validation set. This isn't very good.\n", + "\n", + "One strategy for getting insight into what's wrong is to plot the loss function and the accuracies on the training and validation sets during optimization.\n", + "\n", + "Another strategy is to visualize the weights that were learned in the first layer of the network. In most neural networks trained on visual data, the first layer weights typically show some visible structure when visualized." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "# Plot the loss function and train / validation accuracies\n", + "plt.subplot(2, 1, 1)\n", + "plt.plot(stats['loss_history'])\n", + "plt.title('Loss history')\n", + "plt.xlabel('Iteration')\n", + "plt.ylabel('Loss')\n", + "\n", + "plt.subplot(2, 1, 2)\n", + "plt.plot(stats['train_acc_history'], label='train')\n", + "plt.plot(stats['val_acc_history'], label='val')\n", + "plt.title('Classification accuracy history')\n", + "plt.xlabel('Epoch')\n", + "plt.ylabel('Clasification accuracy')\n", + "plt.show()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "from cs231n.vis_utils import visualize_grid\n", + "\n", + "# Visualize the weights of the network\n", + "\n", + "def show_net_weights(net):\n", + " W1 = net.params['W1']\n", + " W1 = W1.reshape(32, 32, 3, -1).transpose(3, 0, 1, 2)\n", + " plt.imshow(visualize_grid(W1, padding=3).astype('uint8'))\n", + " plt.gca().axis('off')\n", + " plt.show()\n", + "\n", + "show_net_weights(net)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Tune your hyperparameters\n", + "\n", + "**What's wrong?**. Looking at the visualizations above, we see that the loss is decreasing more or less linearly, which seems to suggest that the learning rate may be too low. Moreover, there is no gap between the training and validation accuracy, suggesting that the model we used has low capacity, and that we should increase its size. On the other hand, with a very large model we would expect to see more overfitting, which would manifest itself as a very large gap between the training and validation accuracy.\n", + "\n", + "**Tuning**. Tuning the hyperparameters and developing intuition for how they affect the final performance is a large part of using Neural Networks, so we want you to get a lot of practice. Below, you should experiment with different values of the various hyperparameters, including hidden layer size, learning rate, numer of training epochs, and regularization strength. You might also consider tuning the learning rate decay, but you should be able to get good performance using the default value.\n", + "\n", + "**Approximate results**. You should be aim to achieve a classification accuracy of greater than 48% on the validation set. Our best network gets over 52% on the validation set.\n", + "\n", + "**Experiment**: You goal in this exercise is to get as good of a result on CIFAR-10 as you can, with a fully-connected Neural Network. For every 1% above 52% on the Test set we will award you with one extra bonus point. Feel free implement your own techniques (e.g. PCA to reduce dimensionality, or adding dropout, or adding features to the solver, etc.)." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "best_net = None # store the best model into this \n", + "\n", + "#################################################################################\n", + "# TODO: Tune hyperparameters using the validation set. Store your best trained #\n", + "# model in best_net. #\n", + "# #\n", + "# To help debug your network, it may help to use visualizations similar to the #\n", + "# ones we used above; these visualizations will have significant qualitative #\n", + "# differences from the ones we saw above for the poorly tuned network. #\n", + "# #\n", + "# Tweaking hyperparameters by hand can be fun, but you might find it useful to #\n", + "# write code to sweep through possible combinations of hyperparameters #\n", + "# automatically like we did on the previous exercises. #\n", + "#################################################################################\n", + "pass\n", + "#################################################################################\n", + "# END OF YOUR CODE #\n", + "#################################################################################" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "# visualize the weights of the best network\n", + "show_net_weights(best_net)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Run on the test set\n", + "When you are done experimenting, you should evaluate your final trained network on the test set; you should get above 48%.\n", + "\n", + "**We will give you extra bonus point for every 1% of accuracy above 52%.**" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "test_acc = (best_net.predict(X_test) == y_test).mean()\n", + "print 'Test accuracy: ', test_acc" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 2", + "language": "python", + "name": "python2" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 2 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython2", + "version": "2.7.11" + } + }, + "nbformat": 4, + "nbformat_minor": 0 +} diff --git a/assignments2016/assignment2/.gitignore b/assignments2016/assignment2/.gitignore new file mode 100644 index 00000000..b0611d38 --- /dev/null +++ b/assignments2016/assignment2/.gitignore @@ -0,0 +1,3 @@ +*.swp +*.pyc +.env/* diff --git a/assignments2016/assignment2/BatchNormalization.ipynb b/assignments2016/assignment2/BatchNormalization.ipynb new file mode 100644 index 00000000..c0ca1d51 --- /dev/null +++ b/assignments2016/assignment2/BatchNormalization.ipynb @@ -0,0 +1,516 @@ +{ + "nbformat_minor": 0, + "nbformat": 4, + "cells": [ + { + "source": [ + "# Batch Normalization\n", + "One way to make deep networks easier to train is to use more sophisticated optimization procedures such as SGD+momentum, RMSProp, or Adam. Another strategy is to change the architecture of the network to make it easier to train. One idea along these lines is batch normalization which was recently proposed by [3].\n", + "\n", + "The idea is relatively straightforward. Machine learning methods tend to work better when their input data consists of uncorrelated features with zero mean and unit variance. When training a neural network, we can preprocess the data before feeding it to the network to explicitly decorrelate its features; this will ensure that the first layer of the network sees data that follows a nice distribution. However even if we preprocess the input data, the activations at deeper layers of the network will likely no longer be decorrelated and will no longer have zero mean or unit variance since they are output from earlier layers in the network. Even worse, during the training process the distribution of features at each layer of the network will shift as the weights of each layer are updated.\n", + "\n", + "The authors of [3] hypothesize that the shifting distribution of features inside deep neural networks may make training deep networks more difficult. To overcome this problem, [3] proposes to insert batch normalization layers into the network. At training time, a batch normalization layer uses a minibatch of data to estimate the mean and standard deviation of each feature. These estimated means and standard deviations are then used to center and normalize the features of the minibatch. A running average of these means and standard deviations is kept during training, and at test time these running averages are used to center and normalize features.\n", + "\n", + "It is possible that this normalization strategy could reduce the representational power of the network, since it may sometimes be optimal for certain layers to have features that are not zero-mean or unit variance. To this end, the batch normalization layer includes learnable shift and scale parameters for each feature dimension.\n", + "\n", + "[3] Sergey Ioffe and Christian Szegedy, \"Batch Normalization: Accelerating Deep Network Training by Reducing\n", + "Internal Covariate Shift\", ICML 2015." + ], + "cell_type": "markdown", + "metadata": {} + }, + { + "execution_count": null, + "cell_type": "code", + "source": [ + "# As usual, a bit of setup\n", + "\n", + "import time\n", + "import numpy as np\n", + "import matplotlib.pyplot as plt\n", + "from cs231n.classifiers.fc_net import *\n", + "from cs231n.data_utils import get_CIFAR10_data\n", + "from cs231n.gradient_check import eval_numerical_gradient, eval_numerical_gradient_array\n", + "from cs231n.solver import Solver\n", + "\n", + "%matplotlib inline\n", + "plt.rcParams['figure.figsize'] = (10.0, 8.0) # set default size of plots\n", + "plt.rcParams['image.interpolation'] = 'nearest'\n", + "plt.rcParams['image.cmap'] = 'gray'\n", + "\n", + "# for auto-reloading external modules\n", + "# see http://stackoverflow.com/questions/1907993/autoreload-of-modules-in-ipython\n", + "%load_ext autoreload\n", + "%autoreload 2\n", + "\n", + "def rel_error(x, y):\n", + " \"\"\" returns relative error \"\"\"\n", + " return np.max(np.abs(x - y) / (np.maximum(1e-8, np.abs(x) + np.abs(y))))" + ], + "outputs": [], + "metadata": { + "collapsed": false + } + }, + { + "execution_count": null, + "cell_type": "code", + "source": [ + "# Load the (preprocessed) CIFAR10 data.\n", + "\n", + "data = get_CIFAR10_data()\n", + "for k, v in data.iteritems():\n", + " print '%s: ' % k, v.shape" + ], + "outputs": [], + "metadata": { + "collapsed": false + } + }, + { + "source": [ + "## Batch normalization: Forward\n", + "In the file `cs231n/layers.py`, implement the batch normalization forward pass in the function `batchnorm_forward`. Once you have done so, run the following to test your implementation." + ], + "cell_type": "markdown", + "metadata": {} + }, + { + "execution_count": null, + "cell_type": "code", + "source": [ + "# Check the training-time forward pass by checking means and variances\n", + "# of features both before and after batch normalization\n", + "\n", + "# Simulate the forward pass for a two-layer network\n", + "N, D1, D2, D3 = 200, 50, 60, 3\n", + "X = np.random.randn(N, D1)\n", + "W1 = np.random.randn(D1, D2)\n", + "W2 = np.random.randn(D2, D3)\n", + "a = np.maximum(0, X.dot(W1)).dot(W2)\n", + "\n", + "print 'Before batch normalization:'\n", + "print ' means: ', a.mean(axis=0)\n", + "print ' stds: ', a.std(axis=0)\n", + "\n", + "# Means should be close to zero and stds close to one\n", + "print 'After batch normalization (gamma=1, beta=0)'\n", + "a_norm, _ = batchnorm_forward(a, np.ones(D3), np.zeros(D3), {'mode': 'train'})\n", + "print ' mean: ', a_norm.mean(axis=0)\n", + "print ' std: ', a_norm.std(axis=0)\n", + "\n", + "# Now means should be close to beta and stds close to gamma\n", + "gamma = np.asarray([1.0, 2.0, 3.0])\n", + "beta = np.asarray([11.0, 12.0, 13.0])\n", + "a_norm, _ = batchnorm_forward(a, gamma, beta, {'mode': 'train'})\n", + "print 'After batch normalization (nontrivial gamma, beta)'\n", + "print ' means: ', a_norm.mean(axis=0)\n", + "print ' stds: ', a_norm.std(axis=0)" + ], + "outputs": [], + "metadata": { + "collapsed": false + } + }, + { + "execution_count": null, + "cell_type": "code", + "source": [ + "# Check the test-time forward pass by running the training-time\n", + "# forward pass many times to warm up the running averages, and then\n", + "# checking the means and variances of activations after a test-time\n", + "# forward pass.\n", + "\n", + "N, D1, D2, D3 = 200, 50, 60, 3\n", + "W1 = np.random.randn(D1, D2)\n", + "W2 = np.random.randn(D2, D3)\n", + "\n", + "bn_param = {'mode': 'train'}\n", + "gamma = np.ones(D3)\n", + "beta = np.zeros(D3)\n", + "for t in xrange(50):\n", + " X = np.random.randn(N, D1)\n", + " a = np.maximum(0, X.dot(W1)).dot(W2)\n", + " batchnorm_forward(a, gamma, beta, bn_param)\n", + "bn_param['mode'] = 'test'\n", + "X = np.random.randn(N, D1)\n", + "a = np.maximum(0, X.dot(W1)).dot(W2)\n", + "a_norm, _ = batchnorm_forward(a, gamma, beta, bn_param)\n", + "\n", + "# Means should be close to zero and stds close to one, but will be\n", + "# noisier than training-time forward passes.\n", + "print 'After batch normalization (test-time):'\n", + "print ' means: ', a_norm.mean(axis=0)\n", + "print ' stds: ', a_norm.std(axis=0)" + ], + "outputs": [], + "metadata": { + "collapsed": false + } + }, + { + "source": [ + "## Batch Normalization: backward\n", + "Now implement the backward pass for batch normalization in the function `batchnorm_backward`.\n", + "\n", + "To derive the backward pass you should write out the computation graph for batch normalization and backprop through each of the intermediate nodes. Some intermediates may have multiple outgoing branches; make sure to sum gradients across these branches in the backward pass.\n", + "\n", + "Once you have finished, run the following to numerically check your backward pass." + ], + "cell_type": "markdown", + "metadata": {} + }, + { + "execution_count": null, + "cell_type": "code", + "source": [ + "# Gradient check batchnorm backward pass\n", + "\n", + "N, D = 4, 5\n", + "x = 5 * np.random.randn(N, D) + 12\n", + "gamma = np.random.randn(D)\n", + "beta = np.random.randn(D)\n", + "dout = np.random.randn(N, D)\n", + "\n", + "bn_param = {'mode': 'train'}\n", + "fx = lambda x: batchnorm_forward(x, gamma, beta, bn_param)[0]\n", + "fg = lambda a: batchnorm_forward(x, gamma, beta, bn_param)[0]\n", + "fb = lambda b: batchnorm_forward(x, gamma, beta, bn_param)[0]\n", + "\n", + "dx_num = eval_numerical_gradient_array(fx, x, dout)\n", + "da_num = eval_numerical_gradient_array(fg, gamma, dout)\n", + "db_num = eval_numerical_gradient_array(fb, beta, dout)\n", + "\n", + "_, cache = batchnorm_forward(x, gamma, beta, bn_param)\n", + "dx, dgamma, dbeta = batchnorm_backward(dout, cache)\n", + "print 'dx error: ', rel_error(dx_num, dx)\n", + "print 'dgamma error: ', rel_error(da_num, dgamma)\n", + "print 'dbeta error: ', rel_error(db_num, dbeta)" + ], + "outputs": [], + "metadata": { + "collapsed": false + } + }, + { + "source": [ + "## Batch Normalization: alternative backward\n", + "In class we talked about two different implementations for the sigmoid backward pass. One strategy is to write out a computation graph composed of simple operations and backprop through all intermediate values. Another strategy is to work out the derivatives on paper. For the sigmoid function, it turns out that you can derive a very simple formula for the backward pass by simplifying gradients on paper.\n", + "\n", + "Surprisingly, it turns out that you can also derive a simple expression for the batch normalization backward pass if you work out derivatives on paper and simplify. After doing so, implement the simplified batch normalization backward pass in the function `batchnorm_backward_alt` and compare the two implementations by running the following. Your two implementations should compute nearly identical results, but the alternative implementation should be a bit faster.\n", + "\n", + "NOTE: You can still complete the rest of the assignment if you don't figure this part out, so don't worry too much if you can't get it." + ], + "cell_type": "markdown", + "metadata": {} + }, + { + "execution_count": null, + "cell_type": "code", + "source": [ + "N, D = 100, 500\n", + "x = 5 * np.random.randn(N, D) + 12\n", + "gamma = np.random.randn(D)\n", + "beta = np.random.randn(D)\n", + "dout = np.random.randn(N, D)\n", + "\n", + "bn_param = {'mode': 'train'}\n", + "out, cache = batchnorm_forward(x, gamma, beta, bn_param)\n", + "\n", + "t1 = time.time()\n", + "dx1, dgamma1, dbeta1 = batchnorm_backward(dout, cache)\n", + "t2 = time.time()\n", + "dx2, dgamma2, dbeta2 = batchnorm_backward_alt(dout, cache)\n", + "t3 = time.time()\n", + "\n", + "print 'dx difference: ', rel_error(dx1, dx2)\n", + "print 'dgamma difference: ', rel_error(dgamma1, dgamma2)\n", + "print 'dbeta difference: ', rel_error(dbeta1, dbeta2)\n", + "print 'speedup: %.2fx' % ((t2 - t1) / (t3 - t2))" + ], + "outputs": [], + "metadata": { + "collapsed": false + } + }, + { + "source": [ + "## Fully Connected Nets with Batch Normalization\n", + "Now that you have a working implementation for batch normalization, go back to your `FullyConnectedNet` in the file `cs2312n/classifiers/fc_net.py`. Modify your implementation to add batch normalization.\n", + "\n", + "Concretely, when the flag `use_batchnorm` is `True` in the constructor, you should insert a batch normalization layer before each ReLU nonlinearity. The outputs from the last layer of the network should not be normalized. Once you are done, run the following to gradient-check your implementation.\n", + "\n", + "HINT: You might find it useful to define an additional helper layer similar to those in the file `cs231n/layer_utils.py`. If you decide to do so, do it in the file `cs231n/classifiers/fc_net.py`." + ], + "cell_type": "markdown", + "metadata": {} + }, + { + "execution_count": null, + "cell_type": "code", + "source": [ + "N, D, H1, H2, C = 2, 15, 20, 30, 10\n", + "X = np.random.randn(N, D)\n", + "y = np.random.randint(C, size=(N,))\n", + "\n", + "for reg in [0, 3.14]:\n", + " print 'Running check with reg = ', reg\n", + " model = FullyConnectedNet([H1, H2], input_dim=D, num_classes=C,\n", + " reg=reg, weight_scale=5e-2, dtype=np.float64,\n", + " use_batchnorm=True)\n", + "\n", + " loss, grads = model.loss(X, y)\n", + " print 'Initial loss: ', loss\n", + "\n", + " for name in sorted(grads):\n", + " f = lambda _: model.loss(X, y)[0]\n", + " grad_num = eval_numerical_gradient(f, model.params[name], verbose=False, h=1e-5)\n", + " print '%s relative error: %.2e' % (name, rel_error(grad_num, grads[name]))\n", + " if reg == 0: print" + ], + "outputs": [], + "metadata": { + "collapsed": false + } + }, + { + "source": [ + "# Batchnorm for deep networks\n", + "Run the following to train a six-layer network on a subset of 1000 training examples both with and without batch normalization." + ], + "cell_type": "markdown", + "metadata": {} + }, + { + "execution_count": null, + "cell_type": "code", + "source": [ + "# Try training a very deep net with batchnorm\n", + "hidden_dims = [100, 100, 100, 100, 100]\n", + "\n", + "num_train = 1000\n", + "small_data = {\n", + " 'X_train': data['X_train'][:num_train],\n", + " 'y_train': data['y_train'][:num_train],\n", + " 'X_val': data['X_val'],\n", + " 'y_val': data['y_val'],\n", + "}\n", + "\n", + "weight_scale = 2e-2\n", + "bn_model = FullyConnectedNet(hidden_dims, weight_scale=weight_scale, use_batchnorm=True)\n", + "model = FullyConnectedNet(hidden_dims, weight_scale=weight_scale, use_batchnorm=False)\n", + "\n", + "bn_solver = Solver(bn_model, small_data,\n", + " num_epochs=10, batch_size=50,\n", + " update_rule='adam',\n", + " optim_config={\n", + " 'learning_rate': 1e-3,\n", + " },\n", + " verbose=True, print_every=200)\n", + "bn_solver.train()\n", + "\n", + "solver = Solver(model, small_data,\n", + " num_epochs=10, batch_size=50,\n", + " update_rule='adam',\n", + " optim_config={\n", + " 'learning_rate': 1e-3,\n", + " },\n", + " verbose=True, print_every=200)\n", + "solver.train()" + ], + "outputs": [], + "metadata": { + "collapsed": false + } + }, + { + "source": [ + "Run the following to visualize the results from two networks trained above. You should find that using batch normalization helps the network to converge much faster." + ], + "cell_type": "markdown", + "metadata": {} + }, + { + "execution_count": null, + "cell_type": "code", + "source": [ + "plt.subplot(3, 1, 1)\n", + "plt.title('Training loss')\n", + "plt.xlabel('Iteration')\n", + "\n", + "plt.subplot(3, 1, 2)\n", + "plt.title('Training accuracy')\n", + "plt.xlabel('Epoch')\n", + "\n", + "plt.subplot(3, 1, 3)\n", + "plt.title('Validation accuracy')\n", + "plt.xlabel('Epoch')\n", + "\n", + "plt.subplot(3, 1, 1)\n", + "plt.plot(solver.loss_history, 'o', label='baseline')\n", + "plt.plot(bn_solver.loss_history, 'o', label='batchnorm')\n", + "\n", + "plt.subplot(3, 1, 2)\n", + "plt.plot(solver.train_acc_history, '-o', label='baseline')\n", + "plt.plot(bn_solver.train_acc_history, '-o', label='batchnorm')\n", + "\n", + "plt.subplot(3, 1, 3)\n", + "plt.plot(solver.val_acc_history, '-o', label='baseline')\n", + "plt.plot(bn_solver.val_acc_history, '-o', label='batchnorm')\n", + " \n", + "for i in [1, 2, 3]:\n", + " plt.subplot(3, 1, i)\n", + " plt.legend(loc='upper center', ncol=4)\n", + "plt.gcf().set_size_inches(15, 15)\n", + "plt.show()" + ], + "outputs": [], + "metadata": { + "collapsed": false + } + }, + { + "source": [ + "# Batch normalization and initialization\n", + "We will now run a small experiment to study the interaction of batch normalization and weight initialization.\n", + "\n", + "The first cell will train 8-layer networks both with and without batch normalization using different scales for weight initialization. The second layer will plot training accuracy, validation set accuracy, and training loss as a function of the weight initialization scale." + ], + "cell_type": "markdown", + "metadata": {} + }, + { + "execution_count": null, + "cell_type": "code", + "source": [ + "# Try training a very deep net with batchnorm\n", + "hidden_dims = [50, 50, 50, 50, 50, 50, 50]\n", + "\n", + "num_train = 1000\n", + "small_data = {\n", + " 'X_train': data['X_train'][:num_train],\n", + " 'y_train': data['y_train'][:num_train],\n", + " 'X_val': data['X_val'],\n", + " 'y_val': data['y_val'],\n", + "}\n", + "\n", + "bn_solvers = {}\n", + "solvers = {}\n", + "weight_scales = np.logspace(-4, 0, num=20)\n", + "for i, weight_scale in enumerate(weight_scales):\n", + " print 'Running weight scale %d / %d' % (i + 1, len(weight_scales))\n", + " bn_model = FullyConnectedNet(hidden_dims, weight_scale=weight_scale, use_batchnorm=True)\n", + " model = FullyConnectedNet(hidden_dims, weight_scale=weight_scale, use_batchnorm=False)\n", + "\n", + " bn_solver = Solver(bn_model, small_data,\n", + " num_epochs=10, batch_size=50,\n", + " update_rule='adam',\n", + " optim_config={\n", + " 'learning_rate': 1e-3,\n", + " },\n", + " verbose=False, print_every=200)\n", + " bn_solver.train()\n", + " bn_solvers[weight_scale] = bn_solver\n", + "\n", + " solver = Solver(model, small_data,\n", + " num_epochs=10, batch_size=50,\n", + " update_rule='adam',\n", + " optim_config={\n", + " 'learning_rate': 1e-3,\n", + " },\n", + " verbose=False, print_every=200)\n", + " solver.train()\n", + " solvers[weight_scale] = solver" + ], + "outputs": [], + "metadata": { + "collapsed": false + } + }, + { + "execution_count": null, + "cell_type": "code", + "source": [ + "# Plot results of weight scale experiment\n", + "best_train_accs, bn_best_train_accs = [], []\n", + "best_val_accs, bn_best_val_accs = [], []\n", + "final_train_loss, bn_final_train_loss = [], []\n", + "\n", + "for ws in weight_scales:\n", + " best_train_accs.append(max(solvers[ws].train_acc_history))\n", + " bn_best_train_accs.append(max(bn_solvers[ws].train_acc_history))\n", + " \n", + " best_val_accs.append(max(solvers[ws].val_acc_history))\n", + " bn_best_val_accs.append(max(bn_solvers[ws].val_acc_history))\n", + " \n", + " final_train_loss.append(np.mean(solvers[ws].loss_history[-100:]))\n", + " bn_final_train_loss.append(np.mean(bn_solvers[ws].loss_history[-100:]))\n", + " \n", + "plt.subplot(3, 1, 1)\n", + "plt.title('Best val accuracy vs weight initialization scale')\n", + "plt.xlabel('Weight initialization scale')\n", + "plt.ylabel('Best val accuracy')\n", + "plt.semilogx(weight_scales, best_val_accs, '-o', label='baseline')\n", + "plt.semilogx(weight_scales, bn_best_val_accs, '-o', label='batchnorm')\n", + "plt.legend(ncol=2, loc='lower right')\n", + "\n", + "plt.subplot(3, 1, 2)\n", + "plt.title('Best train accuracy vs weight initialization scale')\n", + "plt.xlabel('Weight initialization scale')\n", + "plt.ylabel('Best training accuracy')\n", + "plt.semilogx(weight_scales, best_train_accs, '-o', label='baseline')\n", + "plt.semilogx(weight_scales, bn_best_train_accs, '-o', label='batchnorm')\n", + "plt.legend()\n", + "\n", + "plt.subplot(3, 1, 3)\n", + "plt.title('Final training loss vs weight initialization scale')\n", + "plt.xlabel('Weight initialization scale')\n", + "plt.ylabel('Final training loss')\n", + "plt.semilogx(weight_scales, final_train_loss, '-o', label='baseline')\n", + "plt.semilogx(weight_scales, bn_final_train_loss, '-o', label='batchnorm')\n", + "plt.legend()\n", + "\n", + "plt.gcf().set_size_inches(10, 15)\n", + "plt.show()" + ], + "outputs": [], + "metadata": { + "collapsed": false + } + }, + { + "source": [ + "# Question:\n", + "Describe the results of this experiment, and try to give a reason why the experiment gave the results that it did." + ], + "cell_type": "markdown", + "metadata": {} + }, + { + "source": [ + "# Answer:\n" + ], + "cell_type": "markdown", + "metadata": {} + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 2", + "name": "python2", + "language": "python" + }, + "language_info": { + "mimetype": "text/x-python", + "nbconvert_exporter": "python", + "name": "python", + "file_extension": ".py", + "version": "2.7.6", + "pygments_lexer": "ipython2", + "codemirror_mode": { + "version": 2, + "name": "ipython" + } + } + } +} \ No newline at end of file diff --git a/assignments2016/assignment2/ConvolutionalNetworks.ipynb b/assignments2016/assignment2/ConvolutionalNetworks.ipynb new file mode 100644 index 00000000..d57dd2ec --- /dev/null +++ b/assignments2016/assignment2/ConvolutionalNetworks.ipynb @@ -0,0 +1,869 @@ +{ + "nbformat_minor": 0, + "nbformat": 4, + "cells": [ + { + "source": [ + "# Convolutional Networks\n", + "So far we have worked with deep fully-connected networks, using them to explore different optimization strategies and network architectures. Fully-connected networks are a good testbed for experimentation because they are very computationally efficient, but in practice all state-of-the-art results use convolutional networks instead.\n", + "\n", + "First you will implement several layer types that are used in convolutional networks. You will then use these layers to train a convolutional network on the CIFAR-10 dataset." + ], + "cell_type": "markdown", + "metadata": {} + }, + { + "execution_count": null, + "cell_type": "code", + "source": [ + "# As usual, a bit of setup\n", + "\n", + "import numpy as np\n", + "import matplotlib.pyplot as plt\n", + "from cs231n.classifiers.cnn import *\n", + "from cs231n.data_utils import get_CIFAR10_data\n", + "from cs231n.gradient_check import eval_numerical_gradient_array, eval_numerical_gradient\n", + "from cs231n.layers import *\n", + "from cs231n.fast_layers import *\n", + "from cs231n.solver import Solver\n", + "\n", + "%matplotlib inline\n", + "plt.rcParams['figure.figsize'] = (10.0, 8.0) # set default size of plots\n", + "plt.rcParams['image.interpolation'] = 'nearest'\n", + "plt.rcParams['image.cmap'] = 'gray'\n", + "\n", + "# for auto-reloading external modules\n", + "# see http://stackoverflow.com/questions/1907993/autoreload-of-modules-in-ipython\n", + "%load_ext autoreload\n", + "%autoreload 2\n", + "\n", + "def rel_error(x, y):\n", + " \"\"\" returns relative error \"\"\"\n", + " return np.max(np.abs(x - y) / (np.maximum(1e-8, np.abs(x) + np.abs(y))))" + ], + "outputs": [], + "metadata": { + "collapsed": false + } + }, + { + "execution_count": null, + "cell_type": "code", + "source": [ + "# Load the (preprocessed) CIFAR10 data.\n", + "\n", + "data = get_CIFAR10_data()\n", + "for k, v in data.iteritems():\n", + " print '%s: ' % k, v.shape" + ], + "outputs": [], + "metadata": { + "collapsed": false + } + }, + { + "source": [ + "# Convolution: Naive forward pass\n", + "The core of a convolutional network is the convolution operation. In the file `cs231n/layers.py`, implement the forward pass for the convolution layer in the function `conv_forward_naive`. \n", + "\n", + "You don't have to worry too much about efficiency at this point; just write the code in whatever way you find most clear.\n", + "\n", + "You can test your implementation by running the following:" + ], + "cell_type": "markdown", + "metadata": {} + }, + { + "execution_count": null, + "cell_type": "code", + "source": [ + "x_shape = (2, 3, 4, 4)\n", + "w_shape = (3, 3, 4, 4)\n", + "x = np.linspace(-0.1, 0.5, num=np.prod(x_shape)).reshape(x_shape)\n", + "w = np.linspace(-0.2, 0.3, num=np.prod(w_shape)).reshape(w_shape)\n", + "b = np.linspace(-0.1, 0.2, num=3)\n", + "\n", + "conv_param = {'stride': 2, 'pad': 1}\n", + "out, _ = conv_forward_naive(x, w, b, conv_param)\n", + "correct_out = np.array([[[[[-0.08759809, -0.10987781],\n", + " [-0.18387192, -0.2109216 ]],\n", + " [[ 0.21027089, 0.21661097],\n", + " [ 0.22847626, 0.23004637]],\n", + " [[ 0.50813986, 0.54309974],\n", + " [ 0.64082444, 0.67101435]]],\n", + " [[[-0.98053589, -1.03143541],\n", + " [-1.19128892, -1.24695841]],\n", + " [[ 0.69108355, 0.66880383],\n", + " [ 0.59480972, 0.56776003]],\n", + " [[ 2.36270298, 2.36904306],\n", + " [ 2.38090835, 2.38247847]]]]])\n", + "\n", + "# Compare your output to ours; difference should be around 1e-8\n", + "print 'Testing conv_forward_naive'\n", + "print 'difference: ', rel_error(out, correct_out)" + ], + "outputs": [], + "metadata": { + "collapsed": false + } + }, + { + "source": [ + "# Aside: Image processing via convolutions\n", + "\n", + "As fun way to both check your implementation and gain a better understanding of the type of operation that convolutional layers can perform, we will set up an input containing two images and manually set up filters that perform common image processing operations (grayscale conversion and edge detection). The convolution forward pass will apply these operations to each of the input images. We can then visualize the results as a sanity check." + ], + "cell_type": "markdown", + "metadata": {} + }, + { + "execution_count": null, + "cell_type": "code", + "source": [ + "from scipy.misc import imread, imresize\n", + "\n", + "kitten, puppy = imread('kitten.jpg'), imread('puppy.jpg')\n", + "# kitten is wide, and puppy is already square\n", + "d = kitten.shape[1] - kitten.shape[0]\n", + "kitten_cropped = kitten[:, d/2:-d/2, :]\n", + "\n", + "img_size = 200 # Make this smaller if it runs too slow\n", + "x = np.zeros((2, 3, img_size, img_size))\n", + "x[0, :, :, :] = imresize(puppy, (img_size, img_size)).transpose((2, 0, 1))\n", + "x[1, :, :, :] = imresize(kitten_cropped, (img_size, img_size)).transpose((2, 0, 1))\n", + "\n", + "# Set up a convolutional weights holding 2 filters, each 3x3\n", + "w = np.zeros((2, 3, 3, 3))\n", + "\n", + "# The first filter converts the image to grayscale.\n", + "# Set up the red, green, and blue channels of the filter.\n", + "w[0, 0, :, :] = [[0, 0, 0], [0, 0.3, 0], [0, 0, 0]]\n", + "w[0, 1, :, :] = [[0, 0, 0], [0, 0.6, 0], [0, 0, 0]]\n", + "w[0, 2, :, :] = [[0, 0, 0], [0, 0.1, 0], [0, 0, 0]]\n", + "\n", + "# Second filter detects horizontal edges in the blue channel.\n", + "w[1, 2, :, :] = [[1, 2, 1], [0, 0, 0], [-1, -2, -1]]\n", + "\n", + "# Vector of biases. We don't need any bias for the grayscale\n", + "# filter, but for the edge detection filter we want to add 128\n", + "# to each output so that nothing is negative.\n", + "b = np.array([0, 128])\n", + "\n", + "# Compute the result of convolving each input in x with each filter in w,\n", + "# offsetting by b, and storing the results in out.\n", + "out, _ = conv_forward_naive(x, w, b, {'stride': 1, 'pad': 1})\n", + "\n", + "def imshow_noax(img, normalize=True):\n", + " \"\"\" Tiny helper to show images as uint8 and remove axis labels \"\"\"\n", + " if normalize:\n", + " img_max, img_min = np.max(img), np.min(img)\n", + " img = 255.0 * (img - img_min) / (img_max - img_min)\n", + " plt.imshow(img.astype('uint8'))\n", + " plt.gca().axis('off')\n", + "\n", + "# Show the original images and the results of the conv operation\n", + "plt.subplot(2, 3, 1)\n", + "imshow_noax(puppy, normalize=False)\n", + "plt.title('Original image')\n", + "plt.subplot(2, 3, 2)\n", + "imshow_noax(out[0, 0])\n", + "plt.title('Grayscale')\n", + "plt.subplot(2, 3, 3)\n", + "imshow_noax(out[0, 1])\n", + "plt.title('Edges')\n", + "plt.subplot(2, 3, 4)\n", + "imshow_noax(kitten_cropped, normalize=False)\n", + "plt.subplot(2, 3, 5)\n", + "imshow_noax(out[1, 0])\n", + "plt.subplot(2, 3, 6)\n", + "imshow_noax(out[1, 1])\n", + "plt.show()" + ], + "outputs": [], + "metadata": { + "collapsed": false + } + }, + { + "source": [ + "# Convolution: Naive backward pass\n", + "Implement the backward pass for the convolution operation in the function `conv_backward_naive` in the file `cs231n/layers.py`. Again, you don't need to worry too much about computational efficiency.\n", + "\n", + "When you are done, run the following to check your backward pass with a numeric gradient check." + ], + "cell_type": "markdown", + "metadata": {} + }, + { + "execution_count": null, + "cell_type": "code", + "source": [ + "x = np.random.randn(4, 3, 5, 5)\n", + "w = np.random.randn(2, 3, 3, 3)\n", + "b = np.random.randn(2,)\n", + "dout = np.random.randn(4, 2, 5, 5)\n", + "conv_param = {'stride': 1, 'pad': 1}\n", + "\n", + "dx_num = eval_numerical_gradient_array(lambda x: conv_forward_naive(x, w, b, conv_param)[0], x, dout)\n", + "dw_num = eval_numerical_gradient_array(lambda w: conv_forward_naive(x, w, b, conv_param)[0], w, dout)\n", + "db_num = eval_numerical_gradient_array(lambda b: conv_forward_naive(x, w, b, conv_param)[0], b, dout)\n", + "\n", + "out, cache = conv_forward_naive(x, w, b, conv_param)\n", + "dx, dw, db = conv_backward_naive(dout, cache)\n", + "\n", + "# Your errors should be around 1e-9'\n", + "print 'Testing conv_backward_naive function'\n", + "print 'dx error: ', rel_error(dx, dx_num)\n", + "print 'dw error: ', rel_error(dw, dw_num)\n", + "print 'db error: ', rel_error(db, db_num)" + ], + "outputs": [], + "metadata": { + "collapsed": false + } + }, + { + "source": [ + "# Max pooling: Naive forward\n", + "Implement the forward pass for the max-pooling operation in the function `max_pool_forward_naive` in the file `cs231n/layers.py`. Again, don't worry too much about computational efficiency.\n", + "\n", + "Check your implementation by running the following:" + ], + "cell_type": "markdown", + "metadata": {} + }, + { + "execution_count": null, + "cell_type": "code", + "source": [ + "x_shape = (2, 3, 4, 4)\n", + "x = np.linspace(-0.3, 0.4, num=np.prod(x_shape)).reshape(x_shape)\n", + "pool_param = {'pool_width': 2, 'pool_height': 2, 'stride': 2}\n", + "\n", + "out, _ = max_pool_forward_naive(x, pool_param)\n", + "\n", + "correct_out = np.array([[[[-0.26315789, -0.24842105],\n", + " [-0.20421053, -0.18947368]],\n", + " [[-0.14526316, -0.13052632],\n", + " [-0.08631579, -0.07157895]],\n", + " [[-0.02736842, -0.01263158],\n", + " [ 0.03157895, 0.04631579]]],\n", + " [[[ 0.09052632, 0.10526316],\n", + " [ 0.14947368, 0.16421053]],\n", + " [[ 0.20842105, 0.22315789],\n", + " [ 0.26736842, 0.28210526]],\n", + " [[ 0.32631579, 0.34105263],\n", + " [ 0.38526316, 0.4 ]]]])\n", + "\n", + "# Compare your output with ours. Difference should be around 1e-8.\n", + "print 'Testing max_pool_forward_naive function:'\n", + "print 'difference: ', rel_error(out, correct_out)" + ], + "outputs": [], + "metadata": { + "collapsed": false + } + }, + { + "source": [ + "# Max pooling: Naive backward\n", + "Implement the backward pass for the max-pooling operation in the function `max_pool_backward_naive` in the file `cs231n/layers.py`. You don't need to worry about computational efficiency.\n", + "\n", + "Check your implementation with numeric gradient checking by running the following:" + ], + "cell_type": "markdown", + "metadata": {} + }, + { + "execution_count": null, + "cell_type": "code", + "source": [ + "x = np.random.randn(3, 2, 8, 8)\n", + "dout = np.random.randn(3, 2, 4, 4)\n", + "pool_param = {'pool_height': 2, 'pool_width': 2, 'stride': 2}\n", + "\n", + "dx_num = eval_numerical_gradient_array(lambda x: max_pool_forward_naive(x, pool_param)[0], x, dout)\n", + "\n", + "out, cache = max_pool_forward_naive(x, pool_param)\n", + "dx = max_pool_backward_naive(dout, cache)\n", + "\n", + "# Your error should be around 1e-12\n", + "print 'Testing max_pool_backward_naive function:'\n", + "print 'dx error: ', rel_error(dx, dx_num)" + ], + "outputs": [], + "metadata": { + "collapsed": false + } + }, + { + "source": [ + "# Fast layers\n", + "Making convolution and pooling layers fast can be challenging. To spare you the pain, we've provided fast implementations of the forward and backward passes for convolution and pooling layers in the file `cs231n/fast_layers.py`.\n", + "\n", + "The fast convolution implementation depends on a Cython extension; to compile it you need to run the following from the `cs231n` directory:\n", + "\n", + "```bash\n", + "python setup.py build_ext --inplace\n", + "```\n", + "\n", + "The API for the fast versions of the convolution and pooling layers is exactly the same as the naive versions that you implemented above: the forward pass receives data, weights, and parameters and produces outputs and a cache object; the backward pass recieves upstream derivatives and the cache object and produces gradients with respect to the data and weights.\n", + "\n", + "**NOTE:** The fast implementation for pooling will only perform optimally if the pooling regions are non-overlapping and tile the input. If these conditions are not met then the fast pooling implementation will not be much faster than the naive implementation.\n", + "\n", + "You can compare the performance of the naive and fast versions of these layers by running the following:" + ], + "cell_type": "markdown", + "metadata": {} + }, + { + "execution_count": null, + "cell_type": "code", + "source": [ + "from cs231n.fast_layers import conv_forward_fast, conv_backward_fast\n", + "from time import time\n", + "\n", + "x = np.random.randn(100, 3, 31, 31)\n", + "w = np.random.randn(25, 3, 3, 3)\n", + "b = np.random.randn(25,)\n", + "dout = np.random.randn(100, 25, 16, 16)\n", + "conv_param = {'stride': 2, 'pad': 1}\n", + "\n", + "t0 = time()\n", + "out_naive, cache_naive = conv_forward_naive(x, w, b, conv_param)\n", + "t1 = time()\n", + "out_fast, cache_fast = conv_forward_fast(x, w, b, conv_param)\n", + "t2 = time()\n", + "\n", + "print 'Testing conv_forward_fast:'\n", + "print 'Naive: %fs' % (t1 - t0)\n", + "print 'Fast: %fs' % (t2 - t1)\n", + "print 'Speedup: %fx' % ((t1 - t0) / (t2 - t1))\n", + "print 'Difference: ', rel_error(out_naive, out_fast)\n", + "\n", + "t0 = time()\n", + "dx_naive, dw_naive, db_naive = conv_backward_naive(dout, cache_naive)\n", + "t1 = time()\n", + "dx_fast, dw_fast, db_fast = conv_backward_fast(dout, cache_fast)\n", + "t2 = time()\n", + "\n", + "print '\\nTesting conv_backward_fast:'\n", + "print 'Naive: %fs' % (t1 - t0)\n", + "print 'Fast: %fs' % (t2 - t1)\n", + "print 'Speedup: %fx' % ((t1 - t0) / (t2 - t1))\n", + "print 'dx difference: ', rel_error(dx_naive, dx_fast)\n", + "print 'dw difference: ', rel_error(dw_naive, dw_fast)\n", + "print 'db difference: ', rel_error(db_naive, db_fast)" + ], + "outputs": [], + "metadata": { + "collapsed": false + } + }, + { + "execution_count": null, + "cell_type": "code", + "source": [ + "from cs231n.fast_layers import max_pool_forward_fast, max_pool_backward_fast\n", + "\n", + "x = np.random.randn(100, 3, 32, 32)\n", + "dout = np.random.randn(100, 3, 16, 16)\n", + "pool_param = {'pool_height': 2, 'pool_width': 2, 'stride': 2}\n", + "\n", + "t0 = time()\n", + "out_naive, cache_naive = max_pool_forward_naive(x, pool_param)\n", + "t1 = time()\n", + "out_fast, cache_fast = max_pool_forward_fast(x, pool_param)\n", + "t2 = time()\n", + "\n", + "print 'Testing pool_forward_fast:'\n", + "print 'Naive: %fs' % (t1 - t0)\n", + "print 'fast: %fs' % (t2 - t1)\n", + "print 'speedup: %fx' % ((t1 - t0) / (t2 - t1))\n", + "print 'difference: ', rel_error(out_naive, out_fast)\n", + "\n", + "t0 = time()\n", + "dx_naive = max_pool_backward_naive(dout, cache_naive)\n", + "t1 = time()\n", + "dx_fast = max_pool_backward_fast(dout, cache_fast)\n", + "t2 = time()\n", + "\n", + "print '\\nTesting pool_backward_fast:'\n", + "print 'Naive: %fs' % (t1 - t0)\n", + "print 'speedup: %fx' % ((t1 - t0) / (t2 - t1))\n", + "print 'dx difference: ', rel_error(dx_naive, dx_fast)" + ], + "outputs": [], + "metadata": { + "collapsed": false + } + }, + { + "source": [ + "# Convolutional \"sandwich\" layers\n", + "Previously we introduced the concept of \"sandwich\" layers that combine multiple operations into commonly used patterns. In the file `cs231n/layer_utils.py` you will find sandwich layers that implement a few commonly used patterns for convolutional networks." + ], + "cell_type": "markdown", + "metadata": {} + }, + { + "execution_count": null, + "cell_type": "code", + "source": [ + "from cs231n.layer_utils import conv_relu_pool_forward, conv_relu_pool_backward\n", + "\n", + "x = np.random.randn(2, 3, 16, 16)\n", + "w = np.random.randn(3, 3, 3, 3)\n", + "b = np.random.randn(3,)\n", + "dout = np.random.randn(2, 3, 8, 8)\n", + "conv_param = {'stride': 1, 'pad': 1}\n", + "pool_param = {'pool_height': 2, 'pool_width': 2, 'stride': 2}\n", + "\n", + "out, cache = conv_relu_pool_forward(x, w, b, conv_param, pool_param)\n", + "dx, dw, db = conv_relu_pool_backward(dout, cache)\n", + "\n", + "dx_num = eval_numerical_gradient_array(lambda x: conv_relu_pool_forward(x, w, b, conv_param, pool_param)[0], x, dout)\n", + "dw_num = eval_numerical_gradient_array(lambda w: conv_relu_pool_forward(x, w, b, conv_param, pool_param)[0], w, dout)\n", + "db_num = eval_numerical_gradient_array(lambda b: conv_relu_pool_forward(x, w, b, conv_param, pool_param)[0], b, dout)\n", + "\n", + "print 'Testing conv_relu_pool'\n", + "print 'dx error: ', rel_error(dx_num, dx)\n", + "print 'dw error: ', rel_error(dw_num, dw)\n", + "print 'db error: ', rel_error(db_num, db)" + ], + "outputs": [], + "metadata": { + "collapsed": false + } + }, + { + "execution_count": null, + "cell_type": "code", + "source": [ + "from cs231n.layer_utils import conv_relu_forward, conv_relu_backward\n", + "\n", + "x = np.random.randn(2, 3, 8, 8)\n", + "w = np.random.randn(3, 3, 3, 3)\n", + "b = np.random.randn(3,)\n", + "dout = np.random.randn(2, 3, 8, 8)\n", + "conv_param = {'stride': 1, 'pad': 1}\n", + "\n", + "out, cache = conv_relu_forward(x, w, b, conv_param)\n", + "dx, dw, db = conv_relu_backward(dout, cache)\n", + "\n", + "dx_num = eval_numerical_gradient_array(lambda x: conv_relu_forward(x, w, b, conv_param)[0], x, dout)\n", + "dw_num = eval_numerical_gradient_array(lambda w: conv_relu_forward(x, w, b, conv_param)[0], w, dout)\n", + "db_num = eval_numerical_gradient_array(lambda b: conv_relu_forward(x, w, b, conv_param)[0], b, dout)\n", + "\n", + "print 'Testing conv_relu:'\n", + "print 'dx error: ', rel_error(dx_num, dx)\n", + "print 'dw error: ', rel_error(dw_num, dw)\n", + "print 'db error: ', rel_error(db_num, db)" + ], + "outputs": [], + "metadata": { + "collapsed": false + } + }, + { + "source": [ + "# Three-layer ConvNet\n", + "Now that you have implemented all the necessary layers, we can put them together into a simple convolutional network.\n", + "\n", + "Open the file `cs231n/cnn.py` and complete the implementation of the `ThreeLayerConvNet` class. Run the following cells to help you debug:" + ], + "cell_type": "markdown", + "metadata": {} + }, + { + "source": [ + "## Sanity check loss\n", + "After you build a new network, one of the first things you should do is sanity check the loss. When we use the softmax loss, we expect the loss for random weights (and no regularization) to be about `log(C)` for `C` classes. When we add regularization this should go up." + ], + "cell_type": "markdown", + "metadata": {} + }, + { + "execution_count": null, + "cell_type": "code", + "source": [ + "model = ThreeLayerConvNet()\n", + "\n", + "N = 50\n", + "X = np.random.randn(N, 3, 32, 32)\n", + "y = np.random.randint(10, size=N)\n", + "\n", + "loss, grads = model.loss(X, y)\n", + "print 'Initial loss (no regularization): ', loss\n", + "\n", + "model.reg = 0.5\n", + "loss, grads = model.loss(X, y)\n", + "print 'Initial loss (with regularization): ', loss" + ], + "outputs": [], + "metadata": { + "collapsed": false + } + }, + { + "source": [ + "## Gradient check\n", + "After the loss looks reasonable, use numeric gradient checking to make sure that your backward pass is correct. When you use numeric gradient checking you should use a small amount of artifical data and a small number of neurons at each layer." + ], + "cell_type": "markdown", + "metadata": {} + }, + { + "execution_count": null, + "cell_type": "code", + "source": [ + "num_inputs = 2\n", + "input_dim = (3, 16, 16)\n", + "reg = 0.0\n", + "num_classes = 10\n", + "X = np.random.randn(num_inputs, *input_dim)\n", + "y = np.random.randint(num_classes, size=num_inputs)\n", + "\n", + "model = ThreeLayerConvNet(num_filters=3, filter_size=3,\n", + " input_dim=input_dim, hidden_dim=7,\n", + " dtype=np.float64)\n", + "loss, grads = model.loss(X, y)\n", + "for param_name in sorted(grads):\n", + " f = lambda _: model.loss(X, y)[0]\n", + " param_grad_num = eval_numerical_gradient(f, model.params[param_name], verbose=False, h=1e-6)\n", + " e = rel_error(param_grad_num, grads[param_name])\n", + " print '%s max relative error: %e' % (param_name, rel_error(param_grad_num, grads[param_name]))" + ], + "outputs": [], + "metadata": { + "collapsed": false + } + }, + { + "source": [ + "## Overfit small data\n", + "A nice trick is to train your model with just a few training samples. You should be able to overfit small datasets, which will result in very high training accuracy and comparatively low validation accuracy." + ], + "cell_type": "markdown", + "metadata": {} + }, + { + "execution_count": null, + "cell_type": "code", + "source": [ + "num_train = 100\n", + "small_data = {\n", + " 'X_train': data['X_train'][:num_train],\n", + " 'y_train': data['y_train'][:num_train],\n", + " 'X_val': data['X_val'],\n", + " 'y_val': data['y_val'],\n", + "}\n", + "\n", + "model = ThreeLayerConvNet(weight_scale=1e-2)\n", + "\n", + "solver = Solver(model, small_data,\n", + " num_epochs=10, batch_size=50,\n", + " update_rule='adam',\n", + " optim_config={\n", + " 'learning_rate': 1e-3,\n", + " },\n", + " verbose=True, print_every=1)\n", + "solver.train()" + ], + "outputs": [], + "metadata": { + "collapsed": false + } + }, + { + "source": [ + "Plotting the loss, training accuracy, and validation accuracy should show clear overfitting:" + ], + "cell_type": "markdown", + "metadata": {} + }, + { + "execution_count": null, + "cell_type": "code", + "source": [ + "plt.subplot(2, 1, 1)\n", + "plt.plot(solver.loss_history, 'o')\n", + "plt.xlabel('iteration')\n", + "plt.ylabel('loss')\n", + "\n", + "plt.subplot(2, 1, 2)\n", + "plt.plot(solver.train_acc_history, '-o')\n", + "plt.plot(solver.val_acc_history, '-o')\n", + "plt.legend(['train', 'val'], loc='upper left')\n", + "plt.xlabel('epoch')\n", + "plt.ylabel('accuracy')\n", + "plt.show()" + ], + "outputs": [], + "metadata": { + "collapsed": false + } + }, + { + "source": [ + "## Train the net\n", + "By training the three-layer convolutional network for one epoch, you should achieve greater than 40% accuracy on the training set:" + ], + "cell_type": "markdown", + "metadata": {} + }, + { + "execution_count": null, + "cell_type": "code", + "source": [ + "model = ThreeLayerConvNet(weight_scale=0.001, hidden_dim=500, reg=0.001)\n", + "\n", + "solver = Solver(model, data,\n", + " num_epochs=1, batch_size=50,\n", + " update_rule='adam',\n", + " optim_config={\n", + " 'learning_rate': 1e-3,\n", + " },\n", + " verbose=True, print_every=20)\n", + "solver.train()" + ], + "outputs": [], + "metadata": { + "scrolled": false, + "collapsed": false + } + }, + { + "source": [ + "## Visualize Filters\n", + "You can visualize the first-layer convolutional filters from the trained network by running the following:" + ], + "cell_type": "markdown", + "metadata": {} + }, + { + "execution_count": null, + "cell_type": "code", + "source": [ + "from cs231n.vis_utils import visualize_grid\n", + "\n", + "grid = visualize_grid(model.params['W1'].transpose(0, 2, 3, 1))\n", + "plt.imshow(grid.astype('uint8'))\n", + "plt.axis('off')\n", + "plt.gcf().set_size_inches(5, 5)\n", + "plt.show()" + ], + "outputs": [], + "metadata": { + "collapsed": false + } + }, + { + "source": [ + "# Spatial Batch Normalization\n", + "We already saw that batch normalization is a very useful technique for training deep fully-connected networks. Batch normalization can also be used for convolutional networks, but we need to tweak it a bit; the modification will be called \"spatial batch normalization.\"\n", + "\n", + "Normally batch-normalization accepts inputs of shape `(N, D)` and produces outputs of shape `(N, D)`, where we normalize across the minibatch dimension `N`. For data coming from convolutional layers, batch normalization needs to accept inputs of shape `(N, C, H, W)` and produce outputs of shape `(N, C, H, W)` where the `N` dimension gives the minibatch size and the `(H, W)` dimensions give the spatial size of the feature map.\n", + "\n", + "If the feature map was produced using convolutions, then we expect the statistics of each feature channel to be relatively consistent both between different imagesand different locations within the same image. Therefore spatial batch normalization computes a mean and variance for each of the `C` feature channels by computing statistics over both the minibatch dimension `N` and the spatial dimensions `H` and `W`." + ], + "cell_type": "markdown", + "metadata": {} + }, + { + "source": [ + "## Spatial batch normalization: forward\n", + "\n", + "In the file `cs231n/layers.py`, implement the forward pass for spatial batch normalization in the function `spatial_batchnorm_forward`. Check your implementation by running the following:" + ], + "cell_type": "markdown", + "metadata": {} + }, + { + "execution_count": null, + "cell_type": "code", + "source": [ + "# Check the training-time forward pass by checking means and variances\n", + "# of features both before and after spatial batch normalization\n", + "\n", + "N, C, H, W = 2, 3, 4, 5\n", + "x = 4 * np.random.randn(N, C, H, W) + 10\n", + "\n", + "print 'Before spatial batch normalization:'\n", + "print ' Shape: ', x.shape\n", + "print ' Means: ', x.mean(axis=(0, 2, 3))\n", + "print ' Stds: ', x.std(axis=(0, 2, 3))\n", + "\n", + "# Means should be close to zero and stds close to one\n", + "gamma, beta = np.ones(C), np.zeros(C)\n", + "bn_param = {'mode': 'train'}\n", + "out, _ = spatial_batchnorm_forward(x, gamma, beta, bn_param)\n", + "print 'After spatial batch normalization:'\n", + "print ' Shape: ', out.shape\n", + "print ' Means: ', out.mean(axis=(0, 2, 3))\n", + "print ' Stds: ', out.std(axis=(0, 2, 3))\n", + "\n", + "# Means should be close to beta and stds close to gamma\n", + "gamma, beta = np.asarray([3, 4, 5]), np.asarray([6, 7, 8])\n", + "out, _ = spatial_batchnorm_forward(x, gamma, beta, bn_param)\n", + "print 'After spatial batch normalization (nontrivial gamma, beta):'\n", + "print ' Shape: ', out.shape\n", + "print ' Means: ', out.mean(axis=(0, 2, 3))\n", + "print ' Stds: ', out.std(axis=(0, 2, 3))" + ], + "outputs": [], + "metadata": { + "collapsed": false + } + }, + { + "execution_count": null, + "cell_type": "code", + "source": [ + "# Check the test-time forward pass by running the training-time\n", + "# forward pass many times to warm up the running averages, and then\n", + "# checking the means and variances of activations after a test-time\n", + "# forward pass.\n", + "\n", + "N, C, H, W = 10, 4, 11, 12\n", + "\n", + "bn_param = {'mode': 'train'}\n", + "gamma = np.ones(C)\n", + "beta = np.zeros(C)\n", + "for t in xrange(50):\n", + " x = 2.3 * np.random.randn(N, C, H, W) + 13\n", + " spatial_batchnorm_forward(x, gamma, beta, bn_param)\n", + "bn_param['mode'] = 'test'\n", + "x = 2.3 * np.random.randn(N, C, H, W) + 13\n", + "a_norm, _ = spatial_batchnorm_forward(x, gamma, beta, bn_param)\n", + "\n", + "# Means should be close to zero and stds close to one, but will be\n", + "# noisier than training-time forward passes.\n", + "print 'After spatial batch normalization (test-time):'\n", + "print ' means: ', a_norm.mean(axis=(0, 2, 3))\n", + "print ' stds: ', a_norm.std(axis=(0, 2, 3))" + ], + "outputs": [], + "metadata": { + "collapsed": false + } + }, + { + "source": [ + "## Spatial batch normalization: backward\n", + "In the file `cs231n/layers.py`, implement the backward pass for spatial batch normalization in the function `spatial_batchnorm_backward`. Run the following to check your implementation using a numeric gradient check:" + ], + "cell_type": "markdown", + "metadata": {} + }, + { + "execution_count": null, + "cell_type": "code", + "source": [ + "N, C, H, W = 2, 3, 4, 5\n", + "x = 5 * np.random.randn(N, C, H, W) + 12\n", + "gamma = np.random.randn(C)\n", + "beta = np.random.randn(C)\n", + "dout = np.random.randn(N, C, H, W)\n", + "\n", + "bn_param = {'mode': 'train'}\n", + "fx = lambda x: spatial_batchnorm_forward(x, gamma, beta, bn_param)[0]\n", + "fg = lambda a: spatial_batchnorm_forward(x, gamma, beta, bn_param)[0]\n", + "fb = lambda b: spatial_batchnorm_forward(x, gamma, beta, bn_param)[0]\n", + "\n", + "dx_num = eval_numerical_gradient_array(fx, x, dout)\n", + "da_num = eval_numerical_gradient_array(fg, gamma, dout)\n", + "db_num = eval_numerical_gradient_array(fb, beta, dout)\n", + "\n", + "_, cache = spatial_batchnorm_forward(x, gamma, beta, bn_param)\n", + "dx, dgamma, dbeta = spatial_batchnorm_backward(dout, cache)\n", + "print 'dx error: ', rel_error(dx_num, dx)\n", + "print 'dgamma error: ', rel_error(da_num, dgamma)\n", + "print 'dbeta error: ', rel_error(db_num, dbeta)" + ], + "outputs": [], + "metadata": { + "collapsed": false + } + }, + { + "source": [ + "# Experiment!\n", + "Experiment and try to get the best performance that you can on CIFAR-10 using a ConvNet. Here are some ideas to get you started:\n", + "\n", + "### Things you should try:\n", + "- Filter size: Above we used 7x7; this makes pretty pictures but smaller filters may be more efficient\n", + "- Number of filters: Above we used 32 filters. Do more or fewer do better?\n", + "- Batch normalization: Try adding spatial batch normalization after convolution layers and vanilla batch normalization aafter affine layers. Do your networks train faster?\n", + "- Network architecture: The network above has two layers of trainable parameters. Can you do better with a deeper network? You can implement alternative architectures in the file `cs231n/classifiers/convnet.py`. Some good architectures to try include:\n", + " - [conv-relu-pool]xN - conv - relu - [affine]xM - [softmax or SVM]\n", + " - [conv-relu-pool]XN - [affine]XM - [softmax or SVM]\n", + " - [conv-relu-conv-relu-pool]xN - [affine]xM - [softmax or SVM]\n", + "\n", + "### Tips for training\n", + "For each network architecture that you try, you should tune the learning rate and regularization strength. When doing this there are a couple important things to keep in mind:\n", + "\n", + "- If the parameters are working well, you should see improvement within a few hundred iterations\n", + "- Remember the course-to-fine approach for hyperparameter tuning: start by testing a large range of hyperparameters for just a few training iterations to find the combinations of parameters that are working at all.\n", + "- Once you have found some sets of parameters that seem to work, search more finely around these parameters. You may need to train for more epochs.\n", + "\n", + "### Going above and beyond\n", + "If you are feeling adventurous there are many other features you can implement to try and improve your performance. You are **not required** to implement any of these; however they would be good things to try for extra credit.\n", + "\n", + "- Alternative update steps: For the assignment we implemented SGD+momentum, RMSprop, and Adam; you could try alternatives like AdaGrad or AdaDelta.\n", + "- Alternative activation functions such as leaky ReLU, parametric ReLU, or MaxOut.\n", + "- Model ensembles\n", + "- Data augmentation\n", + "\n", + "If you do decide to implement something extra, clearly describe it in the \"Extra Credit Description\" cell below.\n", + "\n", + "### What we expect\n", + "At the very least, you should be able to train a ConvNet that gets at least 65% accuracy on the validation set. This is just a lower bound - if you are careful it should be possible to get accuracies much higher than that! Extra credit points will be awarded for particularly high-scoring models or unique approaches.\n", + "\n", + "You should use the space below to experiment and train your network. The final cell in this notebook should contain the training, validation, and test set accuracies for your final trained network. In this notebook you should also write an explanation of what you did, any additional features that you implemented, and any visualizations or graphs that you make in the process of training and evaluating your network.\n", + "\n", + "Have fun and happy training!" + ], + "cell_type": "markdown", + "metadata": {} + }, + { + "execution_count": null, + "cell_type": "code", + "source": [ + "# Train a really good model on CIFAR-10" + ], + "outputs": [], + "metadata": { + "collapsed": true + } + }, + { + "source": [ + "# Extra Credit Description\n", + "If you implement any additional features for extra credit, clearly describe them here with pointers to any code in this or other files if applicable." + ], + "cell_type": "markdown", + "metadata": {} + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 2", + "name": "python2", + "language": "python" + }, + "language_info": { + "mimetype": "text/x-python", + "nbconvert_exporter": "python", + "name": "python", + "file_extension": ".py", + "version": "2.7.6", + "pygments_lexer": "ipython2", + "codemirror_mode": { + "version": 2, + "name": "ipython" + } + } + } +} \ No newline at end of file diff --git a/assignments2016/assignment2/Dropout.ipynb b/assignments2016/assignment2/Dropout.ipynb new file mode 100644 index 00000000..98050908 --- /dev/null +++ b/assignments2016/assignment2/Dropout.ipynb @@ -0,0 +1,275 @@ +{ + "nbformat_minor": 0, + "nbformat": 4, + "cells": [ + { + "source": [ + "# Dropout\n", + "Dropout [1] is a technique for regularizing neural networks by randomly setting some features to zero during the forward pass. In this exercise you will implement a dropout layer and modify your fully-connected network to optionally use dropout.\n", + "\n", + "[1] Geoffrey E. Hinton et al, \"Improving neural networks by preventing co-adaptation of feature detectors\", arXiv 2012" + ], + "cell_type": "markdown", + "metadata": {} + }, + { + "execution_count": null, + "cell_type": "code", + "source": [ + "# As usual, a bit of setup\n", + "\n", + "import time\n", + "import numpy as np\n", + "import matplotlib.pyplot as plt\n", + "from cs231n.classifiers.fc_net import *\n", + "from cs231n.data_utils import get_CIFAR10_data\n", + "from cs231n.gradient_check import eval_numerical_gradient, eval_numerical_gradient_array\n", + "from cs231n.solver import Solver\n", + "\n", + "%matplotlib inline\n", + "plt.rcParams['figure.figsize'] = (10.0, 8.0) # set default size of plots\n", + "plt.rcParams['image.interpolation'] = 'nearest'\n", + "plt.rcParams['image.cmap'] = 'gray'\n", + "\n", + "# for auto-reloading external modules\n", + "# see http://stackoverflow.com/questions/1907993/autoreload-of-modules-in-ipython\n", + "%load_ext autoreload\n", + "%autoreload 2\n", + "\n", + "def rel_error(x, y):\n", + " \"\"\" returns relative error \"\"\"\n", + " return np.max(np.abs(x - y) / (np.maximum(1e-8, np.abs(x) + np.abs(y))))" + ], + "outputs": [], + "metadata": { + "collapsed": false + } + }, + { + "execution_count": null, + "cell_type": "code", + "source": [ + "# Load the (preprocessed) CIFAR10 data.\n", + "\n", + "data = get_CIFAR10_data()\n", + "for k, v in data.iteritems():\n", + " print '%s: ' % k, v.shape" + ], + "outputs": [], + "metadata": { + "collapsed": false + } + }, + { + "source": [ + "# Dropout forward pass\n", + "In the file `cs231n/layers.py`, implement the forward pass for dropout. Since dropout behaves differently during training and testing, make sure to implement the operation for both modes.\n", + "\n", + "Once you have done so, run the cell below to test your implementation." + ], + "cell_type": "markdown", + "metadata": {} + }, + { + "execution_count": null, + "cell_type": "code", + "source": [ + "x = np.random.randn(500, 500) + 10\n", + "\n", + "for p in [0.3, 0.6, 0.75]:\n", + " out, _ = dropout_forward(x, {'mode': 'train', 'p': p})\n", + " out_test, _ = dropout_forward(x, {'mode': 'test', 'p': p})\n", + "\n", + " print 'Running tests with p = ', p\n", + " print 'Mean of input: ', x.mean()\n", + " print 'Mean of train-time output: ', out.mean()\n", + " print 'Mean of test-time output: ', out_test.mean()\n", + " print 'Fraction of train-time output set to zero: ', (out == 0).mean()\n", + " print 'Fraction of test-time output set to zero: ', (out_test == 0).mean()\n", + " print" + ], + "outputs": [], + "metadata": { + "collapsed": false + } + }, + { + "source": [ + "# Dropout backward pass\n", + "In the file `cs231n/layers.py`, implement the backward pass for dropout. After doing so, run the following cell to numerically gradient-check your implementation." + ], + "cell_type": "markdown", + "metadata": {} + }, + { + "execution_count": null, + "cell_type": "code", + "source": [ + "x = np.random.randn(10, 10) + 10\n", + "dout = np.random.randn(*x.shape)\n", + "\n", + "dropout_param = {'mode': 'train', 'p': 0.8, 'seed': 123}\n", + "out, cache = dropout_forward(x, dropout_param)\n", + "dx = dropout_backward(dout, cache)\n", + "dx_num = eval_numerical_gradient_array(lambda xx: dropout_forward(xx, dropout_param)[0], x, dout)\n", + "\n", + "print 'dx relative error: ', rel_error(dx, dx_num)" + ], + "outputs": [], + "metadata": { + "collapsed": false + } + }, + { + "source": [ + "# Fully-connected nets with Dropout\n", + "In the file `cs231n/classifiers/fc_net.py`, modify your implementation to use dropout. Specificially, if the constructor the the net receives a nonzero value for the `dropout` parameter, then the net should add dropout immediately after every ReLU nonlinearity. After doing so, run the following to numerically gradient-check your implementation." + ], + "cell_type": "markdown", + "metadata": {} + }, + { + "execution_count": null, + "cell_type": "code", + "source": [ + "N, D, H1, H2, C = 2, 15, 20, 30, 10\n", + "X = np.random.randn(N, D)\n", + "y = np.random.randint(C, size=(N,))\n", + "\n", + "for dropout in [0, 0.25, 0.5]:\n", + " print 'Running check with dropout = ', dropout\n", + " model = FullyConnectedNet([H1, H2], input_dim=D, num_classes=C,\n", + " weight_scale=5e-2, dtype=np.float64,\n", + " dropout=dropout, seed=123)\n", + "\n", + " loss, grads = model.loss(X, y)\n", + " print 'Initial loss: ', loss\n", + "\n", + " for name in sorted(grads):\n", + " f = lambda _: model.loss(X, y)[0]\n", + " grad_num = eval_numerical_gradient(f, model.params[name], verbose=False, h=1e-5)\n", + " print '%s relative error: %.2e' % (name, rel_error(grad_num, grads[name]))\n", + " print" + ], + "outputs": [], + "metadata": { + "collapsed": false + } + }, + { + "source": [ + "# Regularization experiment\n", + "As an experiment, we will train a pair of two-layer networks on 500 training examples: one will use no dropout, and one will use a dropout probability of 0.75. We will then visualize the training and validation accuracies of the two networks over time." + ], + "cell_type": "markdown", + "metadata": {} + }, + { + "execution_count": null, + "cell_type": "code", + "source": [ + "# Train two identical nets, one with dropout and one without\n", + "\n", + "num_train = 500\n", + "small_data = {\n", + " 'X_train': data['X_train'][:num_train],\n", + " 'y_train': data['y_train'][:num_train],\n", + " 'X_val': data['X_val'],\n", + " 'y_val': data['y_val'],\n", + "}\n", + "\n", + "solvers = {}\n", + "dropout_choices = [0, 0.75]\n", + "for dropout in dropout_choices:\n", + " model = FullyConnectedNet([500], dropout=dropout)\n", + " print dropout\n", + "\n", + " solver = Solver(model, small_data,\n", + " num_epochs=25, batch_size=100,\n", + " update_rule='adam',\n", + " optim_config={\n", + " 'learning_rate': 5e-4,\n", + " },\n", + " verbose=True, print_every=100)\n", + " solver.train()\n", + " solvers[dropout] = solver" + ], + "outputs": [], + "metadata": { + "scrolled": false, + "collapsed": false + } + }, + { + "execution_count": null, + "cell_type": "code", + "source": [ + "# Plot train and validation accuracies of the two models\n", + "\n", + "train_accs = []\n", + "val_accs = []\n", + "for dropout in dropout_choices:\n", + " solver = solvers[dropout]\n", + " train_accs.append(solver.train_acc_history[-1])\n", + " val_accs.append(solver.val_acc_history[-1])\n", + "\n", + "plt.subplot(3, 1, 1)\n", + "for dropout in dropout_choices:\n", + " plt.plot(solvers[dropout].train_acc_history, 'o', label='%.2f dropout' % dropout)\n", + "plt.title('Train accuracy')\n", + "plt.xlabel('Epoch')\n", + "plt.ylabel('Accuracy')\n", + "plt.legend(ncol=2, loc='lower right')\n", + " \n", + "plt.subplot(3, 1, 2)\n", + "for dropout in dropout_choices:\n", + " plt.plot(solvers[dropout].val_acc_history, 'o', label='%.2f dropout' % dropout)\n", + "plt.title('Val accuracy')\n", + "plt.xlabel('Epoch')\n", + "plt.ylabel('Accuracy')\n", + "plt.legend(ncol=2, loc='lower right')\n", + "\n", + "plt.gcf().set_size_inches(15, 15)\n", + "plt.show()" + ], + "outputs": [], + "metadata": { + "collapsed": false + } + }, + { + "source": [ + "# Question\n", + "Explain what you see in this experiment. What does it suggest about dropout?" + ], + "cell_type": "markdown", + "metadata": {} + }, + { + "source": [ + "# Answer\n" + ], + "cell_type": "markdown", + "metadata": {} + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 2", + "name": "python2", + "language": "python" + }, + "language_info": { + "mimetype": "text/x-python", + "nbconvert_exporter": "python", + "name": "python", + "file_extension": ".py", + "version": "2.7.6", + "pygments_lexer": "ipython2", + "codemirror_mode": { + "version": 2, + "name": "ipython" + } + } + } +} \ No newline at end of file diff --git a/assignments2016/assignment2/FullyConnectedNets.ipynb b/assignments2016/assignment2/FullyConnectedNets.ipynb new file mode 100644 index 00000000..bf7cefdd --- /dev/null +++ b/assignments2016/assignment2/FullyConnectedNets.ipynb @@ -0,0 +1,941 @@ +{ + "nbformat_minor": 0, + "nbformat": 4, + "cells": [ + { + "source": [ + "# Fully-Connected Neural Nets\n", + "In the previous homework you implemented a fully-connected two-layer neural network on CIFAR-10. The implementation was simple but not very modular since the loss and gradient were computed in a single monolithic function. This is manageable for a simple two-layer network, but would become impractical as we move to bigger models. Ideally we want to build networks using a more modular design so that we can implement different layer types in isolation and then snap them together into models with different architectures.\n", + "\n", + "In this exercise we will implement fully-connected networks using a more modular approach. For each layer we will implement a `forward` and a `backward` function. The `forward` function will receive inputs, weights, and other parameters and will return both an output and a `cache` object storing data needed for the backward pass, like this:\n", + "\n", + "```python\n", + "def layer_forward(x, w):\n", + " \"\"\" Receive inputs x and weights w \"\"\"\n", + " # Do some computations ...\n", + " z = # ... some intermediate value\n", + " # Do some more computations ...\n", + " out = # the output\n", + " \n", + " cache = (x, w, z, out) # Values we need to compute gradients\n", + " \n", + " return out, cache\n", + "```\n", + "\n", + "The backward pass will receive upstream derivatives and the `cache` object, and will return gradients with respect to the inputs and weights, like this:\n", + "\n", + "```python\n", + "def layer_backward(dout, cache):\n", + " \"\"\"\n", + " Receive derivative of loss with respect to outputs and cache,\n", + " and compute derivative with respect to inputs.\n", + " \"\"\"\n", + " # Unpack cache values\n", + " x, w, z, out = cache\n", + " \n", + " # Use values in cache to compute derivatives\n", + " dx = # Derivative of loss with respect to x\n", + " dw = # Derivative of loss with respect to w\n", + " \n", + " return dx, dw\n", + "```\n", + "\n", + "After implementing a bunch of layers this way, we will be able to easily combine them to build classifiers with different architectures.\n", + "\n", + "In addition to implementing fully-connected networks of arbitrary depth, we will also explore different update rules for optimization, and introduce Dropout as a regularizer and Batch Normalization as a tool to more efficiently optimize deep networks.\n", + " " + ], + "cell_type": "markdown", + "metadata": {} + }, + { + "execution_count": null, + "cell_type": "code", + "source": [ + "# As usual, a bit of setup\n", + "\n", + "import time\n", + "import numpy as np\n", + "import matplotlib.pyplot as plt\n", + "from cs231n.classifiers.fc_net import *\n", + "from cs231n.data_utils import get_CIFAR10_data\n", + "from cs231n.gradient_check import eval_numerical_gradient, eval_numerical_gradient_array\n", + "from cs231n.solver import Solver\n", + "\n", + "%matplotlib inline\n", + "plt.rcParams['figure.figsize'] = (10.0, 8.0) # set default size of plots\n", + "plt.rcParams['image.interpolation'] = 'nearest'\n", + "plt.rcParams['image.cmap'] = 'gray'\n", + "\n", + "# for auto-reloading external modules\n", + "# see http://stackoverflow.com/questions/1907993/autoreload-of-modules-in-ipython\n", + "%load_ext autoreload\n", + "%autoreload 2\n", + "\n", + "def rel_error(x, y):\n", + " \"\"\" returns relative error \"\"\"\n", + " return np.max(np.abs(x - y) / (np.maximum(1e-8, np.abs(x) + np.abs(y))))" + ], + "outputs": [], + "metadata": { + "collapsed": false + } + }, + { + "execution_count": null, + "cell_type": "code", + "source": [ + "# Load the (preprocessed) CIFAR10 data.\n", + "\n", + "data = get_CIFAR10_data()\n", + "for k, v in data.iteritems():\n", + " print '%s: ' % k, v.shape" + ], + "outputs": [], + "metadata": { + "collapsed": false + } + }, + { + "source": [ + "# Affine layer: foward\n", + "Open the file `cs231n/layers.py` and implement the `affine_forward` function.\n", + "\n", + "Once you are done you can test your implementaion by running the following:" + ], + "cell_type": "markdown", + "metadata": {} + }, + { + "execution_count": null, + "cell_type": "code", + "source": [ + "# Test the affine_forward function\n", + "\n", + "num_inputs = 2\n", + "input_shape = (4, 5, 6)\n", + "output_dim = 3\n", + "\n", + "input_size = num_inputs * np.prod(input_shape)\n", + "weight_size = output_dim * np.prod(input_shape)\n", + "\n", + "x = np.linspace(-0.1, 0.5, num=input_size).reshape(num_inputs, *input_shape)\n", + "w = np.linspace(-0.2, 0.3, num=weight_size).reshape(np.prod(input_shape), output_dim)\n", + "b = np.linspace(-0.3, 0.1, num=output_dim)\n", + "\n", + "out, _ = affine_forward(x, w, b)\n", + "correct_out = np.array([[ 1.49834967, 1.70660132, 1.91485297],\n", + " [ 3.25553199, 3.5141327, 3.77273342]])\n", + "\n", + "# Compare your output with ours. The error should be around 1e-9.\n", + "print 'Testing affine_forward function:'\n", + "print 'difference: ', rel_error(out, correct_out)" + ], + "outputs": [], + "metadata": { + "collapsed": false + } + }, + { + "source": [ + "# Affine layer: backward\n", + "Now implement the `affine_backward` function and test your implementation using numeric gradient checking." + ], + "cell_type": "markdown", + "metadata": {} + }, + { + "execution_count": null, + "cell_type": "code", + "source": [ + "# Test the affine_backward function\n", + "\n", + "x = np.random.randn(10, 2, 3)\n", + "w = np.random.randn(6, 5)\n", + "b = np.random.randn(5)\n", + "dout = np.random.randn(10, 5)\n", + "\n", + "dx_num = eval_numerical_gradient_array(lambda x: affine_forward(x, w, b)[0], x, dout)\n", + "dw_num = eval_numerical_gradient_array(lambda w: affine_forward(x, w, b)[0], w, dout)\n", + "db_num = eval_numerical_gradient_array(lambda b: affine_forward(x, w, b)[0], b, dout)\n", + "\n", + "_, cache = affine_forward(x, w, b)\n", + "dx, dw, db = affine_backward(dout, cache)\n", + "\n", + "# The error should be around 1e-10\n", + "print 'Testing affine_backward function:'\n", + "print 'dx error: ', rel_error(dx_num, dx)\n", + "print 'dw error: ', rel_error(dw_num, dw)\n", + "print 'db error: ', rel_error(db_num, db)" + ], + "outputs": [], + "metadata": { + "collapsed": false + } + }, + { + "source": [ + "# ReLU layer: forward\n", + "Implement the forward pass for the ReLU activation function in the `relu_forward` function and test your implementation using the following:" + ], + "cell_type": "markdown", + "metadata": {} + }, + { + "execution_count": null, + "cell_type": "code", + "source": [ + "# Test the relu_forward function\n", + "\n", + "x = np.linspace(-0.5, 0.5, num=12).reshape(3, 4)\n", + "\n", + "out, _ = relu_forward(x)\n", + "correct_out = np.array([[ 0., 0., 0., 0., ],\n", + " [ 0., 0., 0.04545455, 0.13636364,],\n", + " [ 0.22727273, 0.31818182, 0.40909091, 0.5, ]])\n", + "\n", + "# Compare your output with ours. The error should be around 1e-8\n", + "print 'Testing relu_forward function:'\n", + "print 'difference: ', rel_error(out, correct_out)" + ], + "outputs": [], + "metadata": { + "collapsed": false + } + }, + { + "source": [ + "# ReLU layer: backward\n", + "Now implement the backward pass for the ReLU activation function in the `relu_backward` function and test your implementation using numeric gradient checking:" + ], + "cell_type": "markdown", + "metadata": {} + }, + { + "execution_count": null, + "cell_type": "code", + "source": [ + "x = np.random.randn(10, 10)\n", + "dout = np.random.randn(*x.shape)\n", + "\n", + "dx_num = eval_numerical_gradient_array(lambda x: relu_forward(x)[0], x, dout)\n", + "\n", + "_, cache = relu_forward(x)\n", + "dx = relu_backward(dout, cache)\n", + "\n", + "# The error should be around 1e-12\n", + "print 'Testing relu_backward function:'\n", + "print 'dx error: ', rel_error(dx_num, dx)" + ], + "outputs": [], + "metadata": { + "collapsed": false + } + }, + { + "source": [ + "# \"Sandwich\" layers\n", + "There are some common patterns of layers that are frequently used in neural nets. For example, affine layers are frequently followed by a ReLU nonlinearity. To make these common patterns easy, we define several convenience layers in the file `cs231n/layer_utils.py`.\n", + "\n", + "For now take a look at the `affine_relu_forward` and `affine_relu_backward` functions, and run the following to numerically gradient check the backward pass:" + ], + "cell_type": "markdown", + "metadata": {} + }, + { + "execution_count": null, + "cell_type": "code", + "source": [ + "from cs231n.layer_utils import affine_relu_forward, affine_relu_backward\n", + "\n", + "x = np.random.randn(2, 3, 4)\n", + "w = np.random.randn(12, 10)\n", + "b = np.random.randn(10)\n", + "dout = np.random.randn(2, 10)\n", + "\n", + "out, cache = affine_relu_forward(x, w, b)\n", + "dx, dw, db = affine_relu_backward(dout, cache)\n", + "\n", + "dx_num = eval_numerical_gradient_array(lambda x: affine_relu_forward(x, w, b)[0], x, dout)\n", + "dw_num = eval_numerical_gradient_array(lambda w: affine_relu_forward(x, w, b)[0], w, dout)\n", + "db_num = eval_numerical_gradient_array(lambda b: affine_relu_forward(x, w, b)[0], b, dout)\n", + "\n", + "print 'Testing affine_relu_forward:'\n", + "print 'dx error: ', rel_error(dx_num, dx)\n", + "print 'dw error: ', rel_error(dw_num, dw)\n", + "print 'db error: ', rel_error(db_num, db)" + ], + "outputs": [], + "metadata": { + "collapsed": false + } + }, + { + "source": [ + "# Loss layers: Softmax and SVM\n", + "You implemented these loss functions in the last assignment, so we'll give them to you for free here. You should still make sure you understand how they work by looking at the implementations in `cs231n/layers.py`.\n", + "\n", + "You can make sure that the implementations are correct by running the following:" + ], + "cell_type": "markdown", + "metadata": {} + }, + { + "execution_count": null, + "cell_type": "code", + "source": [ + "num_classes, num_inputs = 10, 50\n", + "x = 0.001 * np.random.randn(num_inputs, num_classes)\n", + "y = np.random.randint(num_classes, size=num_inputs)\n", + "\n", + "dx_num = eval_numerical_gradient(lambda x: svm_loss(x, y)[0], x, verbose=False)\n", + "loss, dx = svm_loss(x, y)\n", + "\n", + "# Test svm_loss function. Loss should be around 9 and dx error should be 1e-9\n", + "print 'Testing svm_loss:'\n", + "print 'loss: ', loss\n", + "print 'dx error: ', rel_error(dx_num, dx)\n", + "\n", + "dx_num = eval_numerical_gradient(lambda x: softmax_loss(x, y)[0], x, verbose=False)\n", + "loss, dx = softmax_loss(x, y)\n", + "\n", + "# Test softmax_loss function. Loss should be 2.3 and dx error should be 1e-8\n", + "print '\\nTesting softmax_loss:'\n", + "print 'loss: ', loss\n", + "print 'dx error: ', rel_error(dx_num, dx)" + ], + "outputs": [], + "metadata": { + "collapsed": false + } + }, + { + "source": [ + "# Two-layer network\n", + "In the previous assignment you implemented a two-layer neural network in a single monolithic class. Now that you have implemented modular versions of the necessary layers, you will reimplement the two layer network using these modular implementations.\n", + "\n", + "Open the file `cs231n/classifiers/fc_net.py` and complete the implementation of the `TwoLayerNet` class. This class will serve as a model for the other networks you will implement in this assignment, so read through it to make sure you understand the API. You can run the cell below to test your implementation." + ], + "cell_type": "markdown", + "metadata": {} + }, + { + "execution_count": null, + "cell_type": "code", + "source": [ + "N, D, H, C = 3, 5, 50, 7\n", + "X = np.random.randn(N, D)\n", + "y = np.random.randint(C, size=N)\n", + "\n", + "std = 1e-2\n", + "model = TwoLayerNet(input_dim=D, hidden_dim=H, num_classes=C, weight_scale=std)\n", + "\n", + "print 'Testing initialization ... '\n", + "W1_std = abs(model.params['W1'].std() - std)\n", + "b1 = model.params['b1']\n", + "W2_std = abs(model.params['W2'].std() - std)\n", + "b2 = model.params['b2']\n", + "assert W1_std < std / 10, 'First layer weights do not seem right'\n", + "assert np.all(b1 == 0), 'First layer biases do not seem right'\n", + "assert W2_std < std / 10, 'Second layer weights do not seem right'\n", + "assert np.all(b2 == 0), 'Second layer biases do not seem right'\n", + "\n", + "print 'Testing test-time forward pass ... '\n", + "model.params['W1'] = np.linspace(-0.7, 0.3, num=D*H).reshape(D, H)\n", + "model.params['b1'] = np.linspace(-0.1, 0.9, num=H)\n", + "model.params['W2'] = np.linspace(-0.3, 0.4, num=H*C).reshape(H, C)\n", + "model.params['b2'] = np.linspace(-0.9, 0.1, num=C)\n", + "X = np.linspace(-5.5, 4.5, num=N*D).reshape(D, N).T\n", + "scores = model.loss(X)\n", + "correct_scores = np.asarray(\n", + " [[11.53165108, 12.2917344, 13.05181771, 13.81190102, 14.57198434, 15.33206765, 16.09215096],\n", + " [12.05769098, 12.74614105, 13.43459113, 14.1230412, 14.81149128, 15.49994135, 16.18839143],\n", + " [12.58373087, 13.20054771, 13.81736455, 14.43418138, 15.05099822, 15.66781506, 16.2846319 ]])\n", + "scores_diff = np.abs(scores - correct_scores).sum()\n", + "assert scores_diff < 1e-6, 'Problem with test-time forward pass'\n", + "\n", + "print 'Testing training loss (no regularization)'\n", + "y = np.asarray([0, 5, 1])\n", + "loss, grads = model.loss(X, y)\n", + "correct_loss = 3.4702243556\n", + "assert abs(loss - correct_loss) < 1e-10, 'Problem with training-time loss'\n", + "\n", + "model.reg = 1.0\n", + "loss, grads = model.loss(X, y)\n", + "correct_loss = 26.5948426952\n", + "assert abs(loss - correct_loss) < 1e-10, 'Problem with regularization loss'\n", + "\n", + "for reg in [0.0, 0.7]:\n", + " print 'Running numeric gradient check with reg = ', reg\n", + " model.reg = reg\n", + " loss, grads = model.loss(X, y)\n", + "\n", + " for name in sorted(grads):\n", + " f = lambda _: model.loss(X, y)[0]\n", + " grad_num = eval_numerical_gradient(f, model.params[name], verbose=False)\n", + " print '%s relative error: %.2e' % (name, rel_error(grad_num, grads[name]))" + ], + "outputs": [], + "metadata": { + "collapsed": false + } + }, + { + "source": [ + "# Solver\n", + "In the previous assignment, the logic for training models was coupled to the models themselves. Following a more modular design, for this assignment we have split the logic for training models into a separate class.\n", + "\n", + "Open the file `cs231n/solver.py` and read through it to familiarize yourself with the API. After doing so, use a `Solver` instance to train a `TwoLayerNet` that achieves at least `50%` accuracy on the validation set." + ], + "cell_type": "markdown", + "metadata": {} + }, + { + "execution_count": null, + "cell_type": "code", + "source": [ + "model = TwoLayerNet()\n", + "solver = None\n", + "\n", + "##############################################################################\n", + "# TODO: Use a Solver instance to train a TwoLayerNet that achieves at least #\n", + "# 50% accuracy on the validation set. #\n", + "##############################################################################\n", + "pass\n", + "##############################################################################\n", + "# END OF YOUR CODE #\n", + "##############################################################################" + ], + "outputs": [], + "metadata": { + "collapsed": false + } + }, + { + "execution_count": null, + "cell_type": "code", + "source": [ + "# Run this cell to visualize training loss and train / val accuracy\n", + "\n", + "plt.subplot(2, 1, 1)\n", + "plt.title('Training loss')\n", + "plt.plot(solver.loss_history, 'o')\n", + "plt.xlabel('Iteration')\n", + "\n", + "plt.subplot(2, 1, 2)\n", + "plt.title('Accuracy')\n", + "plt.plot(solver.train_acc_history, '-o', label='train')\n", + "plt.plot(solver.val_acc_history, '-o', label='val')\n", + "plt.plot([0.5] * len(solver.val_acc_history), 'k--')\n", + "plt.xlabel('Epoch')\n", + "plt.legend(loc='lower right')\n", + "plt.gcf().set_size_inches(15, 12)\n", + "plt.show()" + ], + "outputs": [], + "metadata": { + "collapsed": false + } + }, + { + "source": [ + "# Multilayer network\n", + "Next you will implement a fully-connected network with an arbitrary number of hidden layers.\n", + "\n", + "Read through the `FullyConnectedNet` class in the file `cs231n/classifiers/fc_net.py`.\n", + "\n", + "Implement the initialization, the forward pass, and the backward pass. For the moment don't worry about implementing dropout or batch normalization; we will add those features soon." + ], + "cell_type": "markdown", + "metadata": {} + }, + { + "source": [ + "## Initial loss and gradient check" + ], + "cell_type": "markdown", + "metadata": {} + }, + { + "source": [ + "As a sanity check, run the following to check the initial loss and to gradient check the network both with and without regularization. Do the initial losses seem reasonable?\n", + "\n", + "For gradient checking, you should expect to see errors around 1e-6 or less." + ], + "cell_type": "markdown", + "metadata": {} + }, + { + "execution_count": null, + "cell_type": "code", + "source": [ + "N, D, H1, H2, C = 2, 15, 20, 30, 10\n", + "X = np.random.randn(N, D)\n", + "y = np.random.randint(C, size=(N,))\n", + "\n", + "for reg in [0, 3.14]:\n", + " print 'Running check with reg = ', reg\n", + " model = FullyConnectedNet([H1, H2], input_dim=D, num_classes=C,\n", + " reg=reg, weight_scale=5e-2, dtype=np.float64)\n", + "\n", + " loss, grads = model.loss(X, y)\n", + " print 'Initial loss: ', loss\n", + "\n", + " for name in sorted(grads):\n", + " f = lambda _: model.loss(X, y)[0]\n", + " grad_num = eval_numerical_gradient(f, model.params[name], verbose=False, h=1e-5)\n", + " print '%s relative error: %.2e' % (name, rel_error(grad_num, grads[name]))" + ], + "outputs": [], + "metadata": { + "collapsed": false + } + }, + { + "source": [ + "As another sanity check, make sure you can overfit a small dataset of 50 images. First we will try a three-layer network with 100 units in each hidden layer. You will need to tweak the learning rate and initialization scale, but you should be able to overfit and achieve 100% training accuracy within 20 epochs." + ], + "cell_type": "markdown", + "metadata": {} + }, + { + "execution_count": null, + "cell_type": "code", + "source": [ + "# TODO: Use a three-layer Net to overfit 50 training examples.\n", + "\n", + "num_train = 50\n", + "small_data = {\n", + " 'X_train': data['X_train'][:num_train],\n", + " 'y_train': data['y_train'][:num_train],\n", + " 'X_val': data['X_val'],\n", + " 'y_val': data['y_val'],\n", + "}\n", + "\n", + "weight_scale = 1e-2\n", + "learning_rate = 1e-4\n", + "model = FullyConnectedNet([100, 100],\n", + " weight_scale=weight_scale, dtype=np.float64)\n", + "solver = Solver(model, small_data,\n", + " print_every=10, num_epochs=20, batch_size=25,\n", + " update_rule='sgd',\n", + " optim_config={\n", + " 'learning_rate': learning_rate,\n", + " }\n", + " )\n", + "solver.train()\n", + "\n", + "plt.plot(solver.loss_history, 'o')\n", + "plt.title('Training loss history')\n", + "plt.xlabel('Iteration')\n", + "plt.ylabel('Training loss')\n", + "plt.show()" + ], + "outputs": [], + "metadata": { + "scrolled": false, + "collapsed": false + } + }, + { + "source": [ + "Now try to use a five-layer network with 100 units on each layer to overfit 50 training examples. Again you will have to adjust the learning rate and weight initialization, but you should be able to achieve 100% training accuracy within 20 epochs." + ], + "cell_type": "markdown", + "metadata": {} + }, + { + "execution_count": null, + "cell_type": "code", + "source": [ + "# TODO: Use a five-layer Net to overfit 50 training examples.\n", + "\n", + "num_train = 50\n", + "small_data = {\n", + " 'X_train': data['X_train'][:num_train],\n", + " 'y_train': data['y_train'][:num_train],\n", + " 'X_val': data['X_val'],\n", + " 'y_val': data['y_val'],\n", + "}\n", + "\n", + "learning_rate = 1e-3\n", + "weight_scale = 1e-5\n", + "model = FullyConnectedNet([100, 100, 100, 100],\n", + " weight_scale=weight_scale, dtype=np.float64)\n", + "solver = Solver(model, small_data,\n", + " print_every=10, num_epochs=20, batch_size=25,\n", + " update_rule='sgd',\n", + " optim_config={\n", + " 'learning_rate': learning_rate,\n", + " }\n", + " )\n", + "solver.train()\n", + "\n", + "plt.plot(solver.loss_history, 'o')\n", + "plt.title('Training loss history')\n", + "plt.xlabel('Iteration')\n", + "plt.ylabel('Training loss')\n", + "plt.show()" + ], + "outputs": [], + "metadata": { + "collapsed": false + } + }, + { + "source": [ + "# Inline question: \n", + "Did you notice anything about the comparative difficulty of training the three-layer net vs training the five layer net?\n", + "\n", + "# Answer:\n", + "[FILL THIS IN]\n" + ], + "cell_type": "markdown", + "metadata": {} + }, + { + "source": [ + "# Update rules\n", + "So far we have used vanilla stochastic gradient descent (SGD) as our update rule. More sophisticated update rules can make it easier to train deep networks. We will implement a few of the most commonly used update rules and compare them to vanilla SGD." + ], + "cell_type": "markdown", + "metadata": {} + }, + { + "source": [ + "# SGD+Momentum\n", + "Stochastic gradient descent with momentum is a widely used update rule that tends to make deep networks converge faster than vanilla stochstic gradient descent.\n", + "\n", + "Open the file `cs231n/optim.py` and read the documentation at the top of the file to make sure you understand the API. Implement the SGD+momentum update rule in the function `sgd_momentum` and run the following to check your implementation. You should see errors less than 1e-8." + ], + "cell_type": "markdown", + "metadata": {} + }, + { + "execution_count": null, + "cell_type": "code", + "source": [ + "from cs231n.optim import sgd_momentum\n", + "\n", + "N, D = 4, 5\n", + "w = np.linspace(-0.4, 0.6, num=N*D).reshape(N, D)\n", + "dw = np.linspace(-0.6, 0.4, num=N*D).reshape(N, D)\n", + "v = np.linspace(0.6, 0.9, num=N*D).reshape(N, D)\n", + "\n", + "config = {'learning_rate': 1e-3, 'velocity': v}\n", + "next_w, _ = sgd_momentum(w, dw, config=config)\n", + "\n", + "expected_next_w = np.asarray([\n", + " [ 0.1406, 0.20738947, 0.27417895, 0.34096842, 0.40775789],\n", + " [ 0.47454737, 0.54133684, 0.60812632, 0.67491579, 0.74170526],\n", + " [ 0.80849474, 0.87528421, 0.94207368, 1.00886316, 1.07565263],\n", + " [ 1.14244211, 1.20923158, 1.27602105, 1.34281053, 1.4096 ]])\n", + "expected_velocity = np.asarray([\n", + " [ 0.5406, 0.55475789, 0.56891579, 0.58307368, 0.59723158],\n", + " [ 0.61138947, 0.62554737, 0.63970526, 0.65386316, 0.66802105],\n", + " [ 0.68217895, 0.69633684, 0.71049474, 0.72465263, 0.73881053],\n", + " [ 0.75296842, 0.76712632, 0.78128421, 0.79544211, 0.8096 ]])\n", + "\n", + "print 'next_w error: ', rel_error(next_w, expected_next_w)\n", + "print 'velocity error: ', rel_error(expected_velocity, config['velocity'])" + ], + "outputs": [], + "metadata": { + "collapsed": false + } + }, + { + "source": [ + "Once you have done so, run the following to train a six-layer network with both SGD and SGD+momentum. You should see the SGD+momentum update rule converge faster." + ], + "cell_type": "markdown", + "metadata": {} + }, + { + "execution_count": null, + "cell_type": "code", + "source": [ + "num_train = 4000\n", + "small_data = {\n", + " 'X_train': data['X_train'][:num_train],\n", + " 'y_train': data['y_train'][:num_train],\n", + " 'X_val': data['X_val'],\n", + " 'y_val': data['y_val'],\n", + "}\n", + "\n", + "solvers = {}\n", + "\n", + "for update_rule in ['sgd', 'sgd_momentum']:\n", + " print 'running with ', update_rule\n", + " model = FullyConnectedNet([100, 100, 100, 100, 100], weight_scale=5e-2)\n", + "\n", + " solver = Solver(model, small_data,\n", + " num_epochs=5, batch_size=100,\n", + " update_rule=update_rule,\n", + " optim_config={\n", + " 'learning_rate': 1e-2,\n", + " },\n", + " verbose=True)\n", + " solvers[update_rule] = solver\n", + " solver.train()\n", + " print\n", + "\n", + "plt.subplot(3, 1, 1)\n", + "plt.title('Training loss')\n", + "plt.xlabel('Iteration')\n", + "\n", + "plt.subplot(3, 1, 2)\n", + "plt.title('Training accuracy')\n", + "plt.xlabel('Epoch')\n", + "\n", + "plt.subplot(3, 1, 3)\n", + "plt.title('Validation accuracy')\n", + "plt.xlabel('Epoch')\n", + "\n", + "for update_rule, solver in solvers.iteritems():\n", + " plt.subplot(3, 1, 1)\n", + " plt.plot(solver.loss_history, 'o', label=update_rule)\n", + " \n", + " plt.subplot(3, 1, 2)\n", + " plt.plot(solver.train_acc_history, '-o', label=update_rule)\n", + "\n", + " plt.subplot(3, 1, 3)\n", + " plt.plot(solver.val_acc_history, '-o', label=update_rule)\n", + " \n", + "for i in [1, 2, 3]:\n", + " plt.subplot(3, 1, i)\n", + " plt.legend(loc='upper center', ncol=4)\n", + "plt.gcf().set_size_inches(15, 15)\n", + "plt.show()" + ], + "outputs": [], + "metadata": { + "scrolled": false, + "collapsed": false + } + }, + { + "source": [ + "# RMSProp and Adam\n", + "RMSProp [1] and Adam [2] are update rules that set per-parameter learning rates by using a running average of the second moments of gradients.\n", + "\n", + "In the file `cs231n/optim.py`, implement the RMSProp update rule in the `rmsprop` function and implement the Adam update rule in the `adam` function, and check your implementations using the tests below.\n", + "\n", + "[1] Tijmen Tieleman and Geoffrey Hinton. \"Lecture 6.5-rmsprop: Divide the gradient by a running average of its recent magnitude.\" COURSERA: Neural Networks for Machine Learning 4 (2012).\n", + "\n", + "[2] Diederik Kingma and Jimmy Ba, \"Adam: A Method for Stochastic Optimization\", ICLR 2015." + ], + "cell_type": "markdown", + "metadata": {} + }, + { + "execution_count": null, + "cell_type": "code", + "source": [ + "# Test RMSProp implementation; you should see errors less than 1e-7\n", + "from cs231n.optim import rmsprop\n", + "\n", + "N, D = 4, 5\n", + "w = np.linspace(-0.4, 0.6, num=N*D).reshape(N, D)\n", + "dw = np.linspace(-0.6, 0.4, num=N*D).reshape(N, D)\n", + "cache = np.linspace(0.6, 0.9, num=N*D).reshape(N, D)\n", + "\n", + "config = {'learning_rate': 1e-2, 'cache': cache}\n", + "next_w, _ = rmsprop(w, dw, config=config)\n", + "\n", + "expected_next_w = np.asarray([\n", + " [-0.39223849, -0.34037513, -0.28849239, -0.23659121, -0.18467247],\n", + " [-0.132737, -0.08078555, -0.02881884, 0.02316247, 0.07515774],\n", + " [ 0.12716641, 0.17918792, 0.23122175, 0.28326742, 0.33532447],\n", + " [ 0.38739248, 0.43947102, 0.49155973, 0.54365823, 0.59576619]])\n", + "expected_cache = np.asarray([\n", + " [ 0.5976, 0.6126277, 0.6277108, 0.64284931, 0.65804321],\n", + " [ 0.67329252, 0.68859723, 0.70395734, 0.71937285, 0.73484377],\n", + " [ 0.75037008, 0.7659518, 0.78158892, 0.79728144, 0.81302936],\n", + " [ 0.82883269, 0.84469141, 0.86060554, 0.87657507, 0.8926 ]])\n", + "\n", + "print 'next_w error: ', rel_error(expected_next_w, next_w)\n", + "print 'cache error: ', rel_error(expected_cache, config['cache'])" + ], + "outputs": [], + "metadata": { + "collapsed": false + } + }, + { + "execution_count": null, + "cell_type": "code", + "source": [ + "# Test Adam implementation; you should see errors around 1e-7 or less\n", + "from cs231n.optim import adam\n", + "\n", + "N, D = 4, 5\n", + "w = np.linspace(-0.4, 0.6, num=N*D).reshape(N, D)\n", + "dw = np.linspace(-0.6, 0.4, num=N*D).reshape(N, D)\n", + "m = np.linspace(0.6, 0.9, num=N*D).reshape(N, D)\n", + "v = np.linspace(0.7, 0.5, num=N*D).reshape(N, D)\n", + "\n", + "config = {'learning_rate': 1e-2, 'm': m, 'v': v, 't': 5}\n", + "next_w, _ = adam(w, dw, config=config)\n", + "\n", + "expected_next_w = np.asarray([\n", + " [-0.40094747, -0.34836187, -0.29577703, -0.24319299, -0.19060977],\n", + " [-0.1380274, -0.08544591, -0.03286534, 0.01971428, 0.0722929],\n", + " [ 0.1248705, 0.17744702, 0.23002243, 0.28259667, 0.33516969],\n", + " [ 0.38774145, 0.44031188, 0.49288093, 0.54544852, 0.59801459]])\n", + "expected_v = np.asarray([\n", + " [ 0.69966, 0.68908382, 0.67851319, 0.66794809, 0.65738853,],\n", + " [ 0.64683452, 0.63628604, 0.6257431, 0.61520571, 0.60467385,],\n", + " [ 0.59414753, 0.58362676, 0.57311152, 0.56260183, 0.55209767,],\n", + " [ 0.54159906, 0.53110598, 0.52061845, 0.51013645, 0.49966, ]])\n", + "expected_m = np.asarray([\n", + " [ 0.48, 0.49947368, 0.51894737, 0.53842105, 0.55789474],\n", + " [ 0.57736842, 0.59684211, 0.61631579, 0.63578947, 0.65526316],\n", + " [ 0.67473684, 0.69421053, 0.71368421, 0.73315789, 0.75263158],\n", + " [ 0.77210526, 0.79157895, 0.81105263, 0.83052632, 0.85 ]])\n", + "\n", + "print 'next_w error: ', rel_error(expected_next_w, next_w)\n", + "print 'v error: ', rel_error(expected_v, config['v'])\n", + "print 'm error: ', rel_error(expected_m, config['m'])" + ], + "outputs": [], + "metadata": { + "collapsed": false + } + }, + { + "source": [ + "Once you have debugged your RMSProp and Adam implementations, run the following to train a pair of deep networks using these new update rules:" + ], + "cell_type": "markdown", + "metadata": {} + }, + { + "execution_count": null, + "cell_type": "code", + "source": [ + "learning_rates = {'rmsprop': 1e-4, 'adam': 1e-3}\n", + "for update_rule in ['adam', 'rmsprop']:\n", + " print 'running with ', update_rule\n", + " model = FullyConnectedNet([100, 100, 100, 100, 100], weight_scale=5e-2)\n", + "\n", + " solver = Solver(model, small_data,\n", + " num_epochs=5, batch_size=100,\n", + " update_rule=update_rule,\n", + " optim_config={\n", + " 'learning_rate': learning_rates[update_rule]\n", + " },\n", + " verbose=True)\n", + " solvers[update_rule] = solver\n", + " solver.train()\n", + " print\n", + "\n", + "plt.subplot(3, 1, 1)\n", + "plt.title('Training loss')\n", + "plt.xlabel('Iteration')\n", + "\n", + "plt.subplot(3, 1, 2)\n", + "plt.title('Training accuracy')\n", + "plt.xlabel('Epoch')\n", + "\n", + "plt.subplot(3, 1, 3)\n", + "plt.title('Validation accuracy')\n", + "plt.xlabel('Epoch')\n", + "\n", + "for update_rule, solver in solvers.iteritems():\n", + " plt.subplot(3, 1, 1)\n", + " plt.plot(solver.loss_history, 'o', label=update_rule)\n", + " \n", + " plt.subplot(3, 1, 2)\n", + " plt.plot(solver.train_acc_history, '-o', label=update_rule)\n", + "\n", + " plt.subplot(3, 1, 3)\n", + " plt.plot(solver.val_acc_history, '-o', label=update_rule)\n", + " \n", + "for i in [1, 2, 3]:\n", + " plt.subplot(3, 1, i)\n", + " plt.legend(loc='upper center', ncol=4)\n", + "plt.gcf().set_size_inches(15, 15)\n", + "plt.show()" + ], + "outputs": [], + "metadata": { + "collapsed": false + } + }, + { + "source": [ + "# Train a good model!\n", + "Train the best fully-connected model that you can on CIFAR-10, storing your best model in the `best_model` variable. We require you to get at least 50% accuracy on the validation set using a fully-connected net.\n", + "\n", + "If you are careful it should be possible to get accuracies above 55%, but we don't require it for this part and won't assign extra credit for doing so. Later in the assignment we will ask you to train the best convolutional network that you can on CIFAR-10, and we would prefer that you spend your effort working on convolutional nets rather than fully-connected nets.\n", + "\n", + "You might find it useful to complete the `BatchNormalization.ipynb` and `Dropout.ipynb` notebooks before completing this part, since those techniques can help you train powerful models." + ], + "cell_type": "markdown", + "metadata": {} + }, + { + "execution_count": null, + "cell_type": "code", + "source": [ + "best_model = None\n", + "################################################################################\n", + "# TODO: Train the best FullyConnectedNet that you can on CIFAR-10. You might #\n", + "# batch normalization and dropout useful. Store your best model in the #\n", + "# best_model variable. #\n", + "################################################################################\n", + "pass\n", + "################################################################################\n", + "# END OF YOUR CODE #\n", + "################################################################################" + ], + "outputs": [], + "metadata": { + "scrolled": false, + "collapsed": false + } + }, + { + "source": [ + "# Test you model\n", + "Run your best model on the validation and test sets. You should achieve above 50% accuracy on the validation set." + ], + "cell_type": "markdown", + "metadata": {} + }, + { + "execution_count": null, + "cell_type": "code", + "source": [ + "y_test_pred = np.argmax(best_model.loss(X_test), axis=1)\n", + "y_val_pred = np.argmax(best_model.loss(X_val), axis=1)\n", + "print 'Validation set accuracy: ', (y_val_pred == y_val).mean()\n", + "print 'Test set accuracy: ', (y_test_pred == y_test).mean()" + ], + "outputs": [], + "metadata": { + "collapsed": false + } + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 2", + "name": "python2", + "language": "python" + }, + "language_info": { + "mimetype": "text/x-python", + "nbconvert_exporter": "python", + "name": "python", + "file_extension": ".py", + "version": "2.7.6", + "pygments_lexer": "ipython2", + "codemirror_mode": { + "version": 2, + "name": "ipython" + } + } + } +} \ No newline at end of file diff --git a/assignments2016/assignment2/README.md b/assignments2016/assignment2/README.md new file mode 100644 index 00000000..2392c9f2 --- /dev/null +++ b/assignments2016/assignment2/README.md @@ -0,0 +1,128 @@ +In this assignment you will practice writing backpropagation code, and training +Neural Networks and Convolutional Neural Networks. The goals of this assignment +are as follows: + +- understand **Neural Networks** and how they are arranged in layered + architectures +- understand and be able to implement (vectorized) **backpropagation** +- implement various **update rules** used to optimize Neural Networks +- implement **batch normalization** for training deep networks +- implement **dropout** to regularize networks +- effectively **cross-validate** and find the best hyperparameters for Neural + Network architecture +- understand the architecture of **Convolutional Neural Networks** and train + gain experience with training these models on data + +## Setup +You can work on the assignment in one of two ways: locally on your own machine, +or on a virtual machine through Terminal.com. + +### Working in the cloud on Terminal + +Terminal has created a separate subdomain to serve our class, +[www.stanfordterminalcloud.com](https://www.stanfordterminalcloud.com). Register +your account there. The Assignment 2 snapshot can then be found HERE. If you are +registered in the class you can contact the TA (see Piazza for more information) +to request Terminal credits for use on the assignment. Once you boot up the +snapshot everything will be installed for you, and you will be ready to start on +your assignment right away. We have written a small tutorial on Terminal +[here](http://cs231n.github.io/terminal-tutorial/). + +### Working locally +Get the code as a zip file +[here](http://vision.stanford.edu/teaching/cs231n/winter1516_assignment2.zip). +As for the dependencies: + +**[Option 1] Use Anaconda:** +The preferred approach for installing all the assignment dependencies is to use +[Anaconda](https://www.continuum.io/downloads), which is a Python distribution +that includes many of the most popular Python packages for science, math, +engineering and data analysis. Once you install it you can skip all mentions of +requirements and you are ready to go directly to working on the assignment. + +**[Option 2] Manual install, virtual environment:** +If you do not want to use Anaconda and want to go with a more manual and risky +installation route you will likely want to create a +[virtual environment](http://docs.python-guide.org/en/latest/dev/virtualenvs/) +for the project. If you choose not to use a virtual environment, it is up to you +to make sure that all dependencies for the code are installed globally on your +machine. To set up a virtual environment, run the following: + +```bash +cd assignment2 +sudo pip install virtualenv # This may already be installed +virtualenv .env # Create a virtual environment +source .env/bin/activate # Activate the virtual environment +pip install -r requirements.txt # Install dependencies +# Work on the assignment for a while ... +deactivate # Exit the virtual environment +``` + +**Download data:** +Once you have the starter code, you will need to download the CIFAR-10 dataset. +Run the following from the `assignment2` directory: + +```bash +cd cs231n/datasets +./get_datasets.sh +``` + +**Compile the Cython extension:** Convolutional Neural Networks require a very +efficient implementation. We have implemented of the functionality using +[Cython](http://cython.org/); you will need to compile the Cython extension +before you can run the code. From the `cs231n` directory, run the following +command: + +```bash +python setup.py build_ext --inplace +``` + +**Start IPython:** +After you have the CIFAR-10 data, you should start the IPython notebook server +from the `assignment2` directory. If you are unfamiliar with IPython, you should +read our [IPython tutorial](http://cs231n.github.io/ipython-tutorial/). + +**NOTE:** If you are working in a virtual environment on OSX, you may encounter +errors with matplotlib due to the +[issues described here](http://matplotlib.org/faq/virtualenv_faq.html). +You can work around this issue by starting the IPython server using the +`start_ipython_osx.sh` script from the `assignment2` directory; the script +assumes that your virtual environment is named `.env`. + + +### Submitting your work: +Whether you work on the assignment locally or using Terminal, once you are done +working run the `collectSubmission.sh` script; this will produce a file called +`assignment2.zip`. Upload this file to your dropbox on +[the coursework](https://coursework.stanford.edu/portal/site/W15-CS-231N-01/) +page for the course. + + +### Q1: Fully-connected Neural Network (30 points) +The IPython notebook `FullyConnectedNets.ipynb` will introduce you to our +modular layer design, and then use those layers to implement fully-connected +networks of arbitrary depth. To optimize these models you will implement several +popular update rules. + +### Q2: Batch Normalization (30 points) +In the IPython notebook `BatchNormalization.ipynb` you will implement batch +normalization, and use it to train deep fully-connected networks. + +### Q3: Dropout (10 points) +The IPython notebook `Dropout.ipynb` will help you implement Dropout and explore +its effects on model generalization. + +### Q4: ConvNet on CIFAR-10 (30 points) +In the IPython Notebook `ConvolutionalNetworks.ipynb` you will implement several +new layers that are commonly used in convolutional networks. You will train a +(shallow) convolutional network on CIFAR-10, and it will then be up to you to +train the best network that you can. + +### Q5: Do something extra! (up to +10 points) +In the process of training your network, you should feel free to implement +anything that you want to get better performance. You can modify the solver, +implement additional layers, use different types of regularization, use an +ensemble of models, or anything else that comes to mind. If you implement these +or other ideas not covered in the assignment then you will be awarded some bonus +points. + diff --git a/assignments2016/assignment2/collectSubmission.sh b/assignments2016/assignment2/collectSubmission.sh new file mode 100755 index 00000000..f189c6bc --- /dev/null +++ b/assignments2016/assignment2/collectSubmission.sh @@ -0,0 +1,2 @@ +rm -f assignment2.zip +zip -r assignment2.zip . -x "*.git*" "*cs231n/datasets*" "*.ipynb_checkpoints*" "*README.md" "*collectSubmission.sh" "*requirements.txt" ".env/*" "*.pyc" "*cs231n/build/*" diff --git a/assignments2016/assignment2/cs231n/.gitignore b/assignments2016/assignment2/cs231n/.gitignore new file mode 100644 index 00000000..fbb42c24 --- /dev/null +++ b/assignments2016/assignment2/cs231n/.gitignore @@ -0,0 +1,3 @@ +build/* +im2col_cython.c +im2col_cython.so diff --git a/assignments2016/assignment2/cs231n/__init__.py b/assignments2016/assignment2/cs231n/__init__.py new file mode 100644 index 00000000..e69de29b diff --git a/assignments2016/assignment2/cs231n/classifiers/__init__.py b/assignments2016/assignment2/cs231n/classifiers/__init__.py new file mode 100644 index 00000000..e69de29b diff --git a/assignments2016/assignment2/cs231n/classifiers/cnn.py b/assignments2016/assignment2/cs231n/classifiers/cnn.py new file mode 100644 index 00000000..9646c31f --- /dev/null +++ b/assignments2016/assignment2/cs231n/classifiers/cnn.py @@ -0,0 +1,105 @@ +import numpy as np + +from cs231n.layers import * +from cs231n.fast_layers import * +from cs231n.layer_utils import * + + +class ThreeLayerConvNet(object): + """ + A three-layer convolutional network with the following architecture: + + conv - relu - 2x2 max pool - affine - relu - affine - softmax + + The network operates on minibatches of data that have shape (N, C, H, W) + consisting of N images, each with height H and width W and with C input + channels. + """ + + def __init__(self, input_dim=(3, 32, 32), num_filters=32, filter_size=7, + hidden_dim=100, num_classes=10, weight_scale=1e-3, reg=0.0, + dtype=np.float32): + """ + Initialize a new network. + + Inputs: + - input_dim: Tuple (C, H, W) giving size of input data + - num_filters: Number of filters to use in the convolutional layer + - filter_size: Size of filters to use in the convolutional layer + - hidden_dim: Number of units to use in the fully-connected hidden layer + - num_classes: Number of scores to produce from the final affine layer. + - weight_scale: Scalar giving standard deviation for random initialization + of weights. + - reg: Scalar giving L2 regularization strength + - dtype: numpy datatype to use for computation. + """ + self.params = {} + self.reg = reg + self.dtype = dtype + + ############################################################################ + # TODO: Initialize weights and biases for the three-layer convolutional # + # network. Weights should be initialized from a Gaussian with standard # + # deviation equal to weight_scale; biases should be initialized to zero. # + # All weights and biases should be stored in the dictionary self.params. # + # Store weights and biases for the convolutional layer using the keys 'W1' # + # and 'b1'; use keys 'W2' and 'b2' for the weights and biases of the # + # hidden affine layer, and keys 'W3' and 'b3' for the weights and biases # + # of the output affine layer. # + ############################################################################ + pass + ############################################################################ + # END OF YOUR CODE # + ############################################################################ + + for k, v in self.params.iteritems(): + self.params[k] = v.astype(dtype) + + + def loss(self, X, y=None): + """ + Evaluate loss and gradient for the three-layer convolutional network. + + Input / output: Same API as TwoLayerNet in fc_net.py. + """ + W1, b1 = self.params['W1'], self.params['b1'] + W2, b2 = self.params['W2'], self.params['b2'] + W3, b3 = self.params['W3'], self.params['b3'] + + # pass conv_param to the forward pass for the convolutional layer + filter_size = W1.shape[2] + conv_param = {'stride': 1, 'pad': (filter_size - 1) / 2} + + # pass pool_param to the forward pass for the max-pooling layer + pool_param = {'pool_height': 2, 'pool_width': 2, 'stride': 2} + + scores = None + ############################################################################ + # TODO: Implement the forward pass for the three-layer convolutional net, # + # computing the class scores for X and storing them in the scores # + # variable. # + ############################################################################ + pass + ############################################################################ + # END OF YOUR CODE # + ############################################################################ + + if y is None: + return scores + + loss, grads = 0, {} + ############################################################################ + # TODO: Implement the backward pass for the three-layer convolutional net, # + # storing the loss and gradients in the loss and grads variables. Compute # + # data loss using softmax, and make sure that grads[k] holds the gradients # + # for self.params[k]. Don't forget to add L2 regularization! # + ############################################################################ + pass + ############################################################################ + # END OF YOUR CODE # + ############################################################################ + + return loss, grads + + +pass diff --git a/assignments2016/assignment2/cs231n/classifiers/fc_net.py b/assignments2016/assignment2/cs231n/classifiers/fc_net.py new file mode 100644 index 00000000..8f933636 --- /dev/null +++ b/assignments2016/assignment2/cs231n/classifiers/fc_net.py @@ -0,0 +1,250 @@ +import numpy as np + +from cs231n.layers import * +from cs231n.layer_utils import * + + +class TwoLayerNet(object): + """ + A two-layer fully-connected neural network with ReLU nonlinearity and + softmax loss that uses a modular layer design. We assume an input dimension + of D, a hidden dimension of H, and perform classification over C classes. + + The architecure should be affine - relu - affine - softmax. + + Note that this class does not implement gradient descent; instead, it + will interact with a separate Solver object that is responsible for running + optimization. + + The learnable parameters of the model are stored in the dictionary + self.params that maps parameter names to numpy arrays. + """ + + def __init__(self, input_dim=3*32*32, hidden_dim=100, num_classes=10, + weight_scale=1e-3, reg=0.0): + """ + Initialize a new network. + + Inputs: + - input_dim: An integer giving the size of the input + - hidden_dim: An integer giving the size of the hidden layer + - num_classes: An integer giving the number of classes to classify + - dropout: Scalar between 0 and 1 giving dropout strength. + - weight_scale: Scalar giving the standard deviation for random + initialization of the weights. + - reg: Scalar giving L2 regularization strength. + """ + self.params = {} + self.reg = reg + + ############################################################################ + # TODO: Initialize the weights and biases of the two-layer net. Weights # + # should be initialized from a Gaussian with standard deviation equal to # + # weight_scale, and biases should be initialized to zero. All weights and # + # biases should be stored in the dictionary self.params, with first layer # + # weights and biases using the keys 'W1' and 'b1' and second layer weights # + # and biases using the keys 'W2' and 'b2'. # + ############################################################################ + pass + ############################################################################ + # END OF YOUR CODE # + ############################################################################ + + + def loss(self, X, y=None): + """ + Compute loss and gradient for a minibatch of data. + + Inputs: + - X: Array of input data of shape (N, d_1, ..., d_k) + - y: Array of labels, of shape (N,). y[i] gives the label for X[i]. + + Returns: + If y is None, then run a test-time forward pass of the model and return: + - scores: Array of shape (N, C) giving classification scores, where + scores[i, c] is the classification score for X[i] and class c. + + If y is not None, then run a training-time forward and backward pass and + return a tuple of: + - loss: Scalar value giving the loss + - grads: Dictionary with the same keys as self.params, mapping parameter + names to gradients of the loss with respect to those parameters. + """ + scores = None + ############################################################################ + # TODO: Implement the forward pass for the two-layer net, computing the # + # class scores for X and storing them in the scores variable. # + ############################################################################ + pass + ############################################################################ + # END OF YOUR CODE # + ############################################################################ + + # If y is None then we are in test mode so just return scores + if y is None: + return scores + + loss, grads = 0, {} + ############################################################################ + # TODO: Implement the backward pass for the two-layer net. Store the loss # + # in the loss variable and gradients in the grads dictionary. Compute data # + # loss using softmax, and make sure that grads[k] holds the gradients for # + # self.params[k]. Don't forget to add L2 regularization! # + # # + # NOTE: To ensure that your implementation matches ours and you pass the # + # automated tests, make sure that your L2 regularization includes a factor # + # of 0.5 to simplify the expression for the gradient. # + ############################################################################ + pass + ############################################################################ + # END OF YOUR CODE # + ############################################################################ + + return loss, grads + + +class FullyConnectedNet(object): + """ + A fully-connected neural network with an arbitrary number of hidden layers, + ReLU nonlinearities, and a softmax loss function. This will also implement + dropout and batch normalization as options. For a network with L layers, + the architecture will be + + {affine - [batch norm] - relu - [dropout]} x (L - 1) - affine - softmax + + where batch normalization and dropout are optional, and the {...} block is + repeated L - 1 times. + + Similar to the TwoLayerNet above, learnable parameters are stored in the + self.params dictionary and will be learned using the Solver class. + """ + + def __init__(self, hidden_dims, input_dim=3*32*32, num_classes=10, + dropout=0, use_batchnorm=False, reg=0.0, + weight_scale=1e-2, dtype=np.float32, seed=None): + """ + Initialize a new FullyConnectedNet. + + Inputs: + - hidden_dims: A list of integers giving the size of each hidden layer. + - input_dim: An integer giving the size of the input. + - num_classes: An integer giving the number of classes to classify. + - dropout: Scalar between 0 and 1 giving dropout strength. If dropout=0 then + the network should not use dropout at all. + - use_batchnorm: Whether or not the network should use batch normalization. + - reg: Scalar giving L2 regularization strength. + - weight_scale: Scalar giving the standard deviation for random + initialization of the weights. + - dtype: A numpy datatype object; all computations will be performed using + this datatype. float32 is faster but less accurate, so you should use + float64 for numeric gradient checking. + - seed: If not None, then pass this random seed to the dropout layers. This + will make the dropout layers deteriminstic so we can gradient check the + model. + """ + self.use_batchnorm = use_batchnorm + self.use_dropout = dropout > 0 + self.reg = reg + self.num_layers = 1 + len(hidden_dims) + self.dtype = dtype + self.params = {} + + ############################################################################ + # TODO: Initialize the parameters of the network, storing all values in # + # the self.params dictionary. Store weights and biases for the first layer # + # in W1 and b1; for the second layer use W2 and b2, etc. Weights should be # + # initialized from a normal distribution with standard deviation equal to # + # weight_scale and biases should be initialized to zero. # + # # + # When using batch normalization, store scale and shift parameters for the # + # first layer in gamma1 and beta1; for the second layer use gamma2 and # + # beta2, etc. Scale parameters should be initialized to one and shift # + # parameters should be initialized to zero. # + ############################################################################ + pass + ############################################################################ + # END OF YOUR CODE # + ############################################################################ + + # When using dropout we need to pass a dropout_param dictionary to each + # dropout layer so that the layer knows the dropout probability and the mode + # (train / test). You can pass the same dropout_param to each dropout layer. + self.dropout_param = {} + if self.use_dropout: + self.dropout_param = {'mode': 'train', 'p': dropout} + if seed is not None: + self.dropout_param['seed'] = seed + + # With batch normalization we need to keep track of running means and + # variances, so we need to pass a special bn_param object to each batch + # normalization layer. You should pass self.bn_params[0] to the forward pass + # of the first batch normalization layer, self.bn_params[1] to the forward + # pass of the second batch normalization layer, etc. + self.bn_params = [] + if self.use_batchnorm: + self.bn_params = [{'mode': 'train'} for i in xrange(self.num_layers - 1)] + + # Cast all parameters to the correct datatype + for k, v in self.params.iteritems(): + self.params[k] = v.astype(dtype) + + + def loss(self, X, y=None): + """ + Compute loss and gradient for the fully-connected net. + + Input / output: Same as TwoLayerNet above. + """ + X = X.astype(self.dtype) + mode = 'test' if y is None else 'train' + + # Set train/test mode for batchnorm params and dropout param since they + # behave differently during training and testing. + if self.dropout_param is not None: + self.dropout_param['mode'] = mode + if self.use_batchnorm: + for bn_param in self.bn_params: + bn_param[mode] = mode + + scores = None + ############################################################################ + # TODO: Implement the forward pass for the fully-connected net, computing # + # the class scores for X and storing them in the scores variable. # + # # + # When using dropout, you'll need to pass self.dropout_param to each # + # dropout forward pass. # + # # + # When using batch normalization, you'll need to pass self.bn_params[0] to # + # the forward pass for the first batch normalization layer, pass # + # self.bn_params[1] to the forward pass for the second batch normalization # + # layer, etc. # + ############################################################################ + pass + ############################################################################ + # END OF YOUR CODE # + ############################################################################ + + # If test mode return early + if mode == 'test': + return scores + + loss, grads = 0.0, {} + ############################################################################ + # TODO: Implement the backward pass for the fully-connected net. Store the # + # loss in the loss variable and gradients in the grads dictionary. Compute # + # data loss using softmax, and make sure that grads[k] holds the gradients # + # for self.params[k]. Don't forget to add L2 regularization! # + # # + # When using batch normalization, you don't need to regularize the scale # + # and shift parameters. # + # # + # NOTE: To ensure that your implementation matches ours and you pass the # + # automated tests, make sure that your L2 regularization includes a factor # + # of 0.5 to simplify the expression for the gradient. # + ############################################################################ + pass + ############################################################################ + # END OF YOUR CODE # + ############################################################################ + + return loss, grads diff --git a/assignments2016/assignment2/cs231n/data_utils.py b/assignments2016/assignment2/cs231n/data_utils.py new file mode 100644 index 00000000..a4740ea9 --- /dev/null +++ b/assignments2016/assignment2/cs231n/data_utils.py @@ -0,0 +1,199 @@ +import cPickle as pickle +import numpy as np +import os +from scipy.misc import imread + +def load_CIFAR_batch(filename): + """ load single batch of cifar """ + with open(filename, 'rb') as f: + datadict = pickle.load(f) + X = datadict['data'] + Y = datadict['labels'] + X = X.reshape(10000, 3, 32, 32).transpose(0,2,3,1).astype("float") + Y = np.array(Y) + return X, Y + +def load_CIFAR10(ROOT): + """ load all of cifar """ + xs = [] + ys = [] + for b in range(1,6): + f = os.path.join(ROOT, 'data_batch_%d' % (b, )) + X, Y = load_CIFAR_batch(f) + xs.append(X) + ys.append(Y) + Xtr = np.concatenate(xs) + Ytr = np.concatenate(ys) + del X, Y + Xte, Yte = load_CIFAR_batch(os.path.join(ROOT, 'test_batch')) + return Xtr, Ytr, Xte, Yte + + +def get_CIFAR10_data(num_training=49000, num_validation=1000, num_test=1000): + """ + Load the CIFAR-10 dataset from disk and perform preprocessing to prepare + it for classifiers. These are the same steps as we used for the SVM, but + condensed to a single function. + """ + # Load the raw CIFAR-10 data + cifar10_dir = 'cs231n/datasets/cifar-10-batches-py' + X_train, y_train, X_test, y_test = load_CIFAR10(cifar10_dir) + + # Subsample the data + mask = range(num_training, num_training + num_validation) + X_val = X_train[mask] + y_val = y_train[mask] + mask = range(num_training) + X_train = X_train[mask] + y_train = y_train[mask] + mask = range(num_test) + X_test = X_test[mask] + y_test = y_test[mask] + + # Normalize the data: subtract the mean image + mean_image = np.mean(X_train, axis=0) + X_train -= mean_image + X_val -= mean_image + X_test -= mean_image + + # Transpose so that channels come first + X_train = X_train.transpose(0, 3, 1, 2).copy() + X_val = X_val.transpose(0, 3, 1, 2).copy() + X_test = X_test.transpose(0, 3, 1, 2).copy() + + # Package data into a dictionary + return { + 'X_train': X_train, 'y_train': y_train, + 'X_val': X_val, 'y_val': y_val, + 'X_test': X_test, 'y_test': y_test, + } + + +def load_tiny_imagenet(path, dtype=np.float32): + """ + Load TinyImageNet. Each of TinyImageNet-100-A, TinyImageNet-100-B, and + TinyImageNet-200 have the same directory structure, so this can be used + to load any of them. + + Inputs: + - path: String giving path to the directory to load. + - dtype: numpy datatype used to load the data. + + Returns: A tuple of + - class_names: A list where class_names[i] is a list of strings giving the + WordNet names for class i in the loaded dataset. + - X_train: (N_tr, 3, 64, 64) array of training images + - y_train: (N_tr,) array of training labels + - X_val: (N_val, 3, 64, 64) array of validation images + - y_val: (N_val,) array of validation labels + - X_test: (N_test, 3, 64, 64) array of testing images. + - y_test: (N_test,) array of test labels; if test labels are not available + (such as in student code) then y_test will be None. + """ + # First load wnids + with open(os.path.join(path, 'wnids.txt'), 'r') as f: + wnids = [x.strip() for x in f] + + # Map wnids to integer labels + wnid_to_label = {wnid: i for i, wnid in enumerate(wnids)} + + # Use words.txt to get names for each class + with open(os.path.join(path, 'words.txt'), 'r') as f: + wnid_to_words = dict(line.split('\t') for line in f) + for wnid, words in wnid_to_words.iteritems(): + wnid_to_words[wnid] = [w.strip() for w in words.split(',')] + class_names = [wnid_to_words[wnid] for wnid in wnids] + + # Next load training data. + X_train = [] + y_train = [] + for i, wnid in enumerate(wnids): + if (i + 1) % 20 == 0: + print 'loading training data for synset %d / %d' % (i + 1, len(wnids)) + # To figure out the filenames we need to open the boxes file + boxes_file = os.path.join(path, 'train', wnid, '%s_boxes.txt' % wnid) + with open(boxes_file, 'r') as f: + filenames = [x.split('\t')[0] for x in f] + num_images = len(filenames) + + X_train_block = np.zeros((num_images, 3, 64, 64), dtype=dtype) + y_train_block = wnid_to_label[wnid] * np.ones(num_images, dtype=np.int64) + for j, img_file in enumerate(filenames): + img_file = os.path.join(path, 'train', wnid, 'images', img_file) + img = imread(img_file) + if img.ndim == 2: + ## grayscale file + img.shape = (64, 64, 1) + X_train_block[j] = img.transpose(2, 0, 1) + X_train.append(X_train_block) + y_train.append(y_train_block) + + # We need to concatenate all training data + X_train = np.concatenate(X_train, axis=0) + y_train = np.concatenate(y_train, axis=0) + + # Next load validation data + with open(os.path.join(path, 'val', 'val_annotations.txt'), 'r') as f: + img_files = [] + val_wnids = [] + for line in f: + img_file, wnid = line.split('\t')[:2] + img_files.append(img_file) + val_wnids.append(wnid) + num_val = len(img_files) + y_val = np.array([wnid_to_label[wnid] for wnid in val_wnids]) + X_val = np.zeros((num_val, 3, 64, 64), dtype=dtype) + for i, img_file in enumerate(img_files): + img_file = os.path.join(path, 'val', 'images', img_file) + img = imread(img_file) + if img.ndim == 2: + img.shape = (64, 64, 1) + X_val[i] = img.transpose(2, 0, 1) + + # Next load test images + # Students won't have test labels, so we need to iterate over files in the + # images directory. + img_files = os.listdir(os.path.join(path, 'test', 'images')) + X_test = np.zeros((len(img_files), 3, 64, 64), dtype=dtype) + for i, img_file in enumerate(img_files): + img_file = os.path.join(path, 'test', 'images', img_file) + img = imread(img_file) + if img.ndim == 2: + img.shape = (64, 64, 1) + X_test[i] = img.transpose(2, 0, 1) + + y_test = None + y_test_file = os.path.join(path, 'test', 'test_annotations.txt') + if os.path.isfile(y_test_file): + with open(y_test_file, 'r') as f: + img_file_to_wnid = {} + for line in f: + line = line.split('\t') + img_file_to_wnid[line[0]] = line[1] + y_test = [wnid_to_label[img_file_to_wnid[img_file]] for img_file in img_files] + y_test = np.array(y_test) + + return class_names, X_train, y_train, X_val, y_val, X_test, y_test + + +def load_models(models_dir): + """ + Load saved models from disk. This will attempt to unpickle all files in a + directory; any files that give errors on unpickling (such as README.txt) will + be skipped. + + Inputs: + - models_dir: String giving the path to a directory containing model files. + Each model file is a pickled dictionary with a 'model' field. + + Returns: + A dictionary mapping model file names to models. + """ + models = {} + for model_file in os.listdir(models_dir): + with open(os.path.join(models_dir, model_file), 'rb') as f: + try: + models[model_file] = pickle.load(f)['model'] + except pickle.UnpicklingError: + continue + return models diff --git a/assignments2016/assignment2/cs231n/datasets/.gitignore b/assignments2016/assignment2/cs231n/datasets/.gitignore new file mode 100644 index 00000000..0232c3ab --- /dev/null +++ b/assignments2016/assignment2/cs231n/datasets/.gitignore @@ -0,0 +1,4 @@ +cifar-10-batches-py/* +tiny-imagenet-100-A* +tiny-imagenet-100-B* +tiny-100-A-pretrained/* diff --git a/assignments2016/assignment2/cs231n/datasets/get_datasets.sh b/assignments2016/assignment2/cs231n/datasets/get_datasets.sh new file mode 100755 index 00000000..0dd93621 --- /dev/null +++ b/assignments2016/assignment2/cs231n/datasets/get_datasets.sh @@ -0,0 +1,4 @@ +# Get CIFAR10 +wget http://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz +tar -xzvf cifar-10-python.tar.gz +rm cifar-10-python.tar.gz diff --git a/assignments2016/assignment2/cs231n/fast_layers.py b/assignments2016/assignment2/cs231n/fast_layers.py new file mode 100644 index 00000000..2ac8dfb0 --- /dev/null +++ b/assignments2016/assignment2/cs231n/fast_layers.py @@ -0,0 +1,270 @@ +import numpy as np +try: + from cs231n.im2col_cython import col2im_cython, im2col_cython + from cs231n.im2col_cython import col2im_6d_cython +except ImportError: + print 'run the following from the cs231n directory and try again:' + print 'python setup.py build_ext --inplace' + print 'You may also need to restart your iPython kernel' + +from cs231n.im2col import * + + +def conv_forward_im2col(x, w, b, conv_param): + """ + A fast implementation of the forward pass for a convolutional layer + based on im2col and col2im. + """ + N, C, H, W = x.shape + num_filters, _, filter_height, filter_width = w.shape + stride, pad = conv_param['stride'], conv_param['pad'] + + # Check dimensions + assert (W + 2 * pad - filter_width) % stride == 0, 'width does not work' + assert (H + 2 * pad - filter_height) % stride == 0, 'height does not work' + + # Create output + out_height = (H + 2 * pad - filter_height) / stride + 1 + out_width = (W + 2 * pad - filter_width) / stride + 1 + out = np.zeros((N, num_filters, out_height, out_width), dtype=x.dtype) + + # x_cols = im2col_indices(x, w.shape[2], w.shape[3], pad, stride) + x_cols = im2col_cython(x, w.shape[2], w.shape[3], pad, stride) + res = w.reshape((w.shape[0], -1)).dot(x_cols) + b.reshape(-1, 1) + + out = res.reshape(w.shape[0], out.shape[2], out.shape[3], x.shape[0]) + out = out.transpose(3, 0, 1, 2) + + cache = (x, w, b, conv_param, x_cols) + return out, cache + + +def conv_forward_strides(x, w, b, conv_param): + N, C, H, W = x.shape + F, _, HH, WW = w.shape + stride, pad = conv_param['stride'], conv_param['pad'] + + # Check dimensions + assert (W + 2 * pad - WW) % stride == 0, 'width does not work' + assert (H + 2 * pad - HH) % stride == 0, 'height does not work' + + # Pad the input + p = pad + x_padded = np.pad(x, ((0, 0), (0, 0), (p, p), (p, p)), mode='constant') + + # Figure out output dimensions + H += 2 * pad + W += 2 * pad + out_h = (H - HH) / stride + 1 + out_w = (W - WW) / stride + 1 + + # Perform an im2col operation by picking clever strides + shape = (C, HH, WW, N, out_h, out_w) + strides = (H * W, W, 1, C * H * W, stride * W, stride) + strides = x.itemsize * np.array(strides) + x_stride = np.lib.stride_tricks.as_strided(x_padded, + shape=shape, strides=strides) + x_cols = np.ascontiguousarray(x_stride) + x_cols.shape = (C * HH * WW, N * out_h * out_w) + + # Now all our convolutions are a big matrix multiply + res = w.reshape(F, -1).dot(x_cols) + b.reshape(-1, 1) + + # Reshape the output + res.shape = (F, N, out_h, out_w) + out = res.transpose(1, 0, 2, 3) + + # Be nice and return a contiguous array + # The old version of conv_forward_fast doesn't do this, so for a fair + # comparison we won't either + out = np.ascontiguousarray(out) + + cache = (x, w, b, conv_param, x_cols) + return out, cache + + +def conv_backward_strides(dout, cache): + x, w, b, conv_param, x_cols = cache + stride, pad = conv_param['stride'], conv_param['pad'] + + N, C, H, W = x.shape + F, _, HH, WW = w.shape + _, _, out_h, out_w = dout.shape + + db = np.sum(dout, axis=(0, 2, 3)) + + dout_reshaped = dout.transpose(1, 0, 2, 3).reshape(F, -1) + dw = dout_reshaped.dot(x_cols.T).reshape(w.shape) + + dx_cols = w.reshape(F, -1).T.dot(dout_reshaped) + dx_cols.shape = (C, HH, WW, N, out_h, out_w) + dx = col2im_6d_cython(dx_cols, N, C, H, W, HH, WW, pad, stride) + + return dx, dw, db + + +def conv_backward_im2col(dout, cache): + """ + A fast implementation of the backward pass for a convolutional layer + based on im2col and col2im. + """ + x, w, b, conv_param, x_cols = cache + stride, pad = conv_param['stride'], conv_param['pad'] + + db = np.sum(dout, axis=(0, 2, 3)) + + num_filters, _, filter_height, filter_width = w.shape + dout_reshaped = dout.transpose(1, 2, 3, 0).reshape(num_filters, -1) + dw = dout_reshaped.dot(x_cols.T).reshape(w.shape) + + dx_cols = w.reshape(num_filters, -1).T.dot(dout_reshaped) + # dx = col2im_indices(dx_cols, x.shape, filter_height, filter_width, pad, stride) + dx = col2im_cython(dx_cols, x.shape[0], x.shape[1], x.shape[2], x.shape[3], + filter_height, filter_width, pad, stride) + + return dx, dw, db + + +conv_forward_fast = conv_forward_strides +conv_backward_fast = conv_backward_strides + + +def max_pool_forward_fast(x, pool_param): + """ + A fast implementation of the forward pass for a max pooling layer. + + This chooses between the reshape method and the im2col method. If the pooling + regions are square and tile the input image, then we can use the reshape + method which is very fast. Otherwise we fall back on the im2col method, which + is not much faster than the naive method. + """ + N, C, H, W = x.shape + pool_height, pool_width = pool_param['pool_height'], pool_param['pool_width'] + stride = pool_param['stride'] + + same_size = pool_height == pool_width == stride + tiles = H % pool_height == 0 and W % pool_width == 0 + if same_size and tiles: + out, reshape_cache = max_pool_forward_reshape(x, pool_param) + cache = ('reshape', reshape_cache) + else: + out, im2col_cache = max_pool_forward_im2col(x, pool_param) + cache = ('im2col', im2col_cache) + return out, cache + + +def max_pool_backward_fast(dout, cache): + """ + A fast implementation of the backward pass for a max pooling layer. + + This switches between the reshape method an the im2col method depending on + which method was used to generate the cache. + """ + method, real_cache = cache + if method == 'reshape': + return max_pool_backward_reshape(dout, real_cache) + elif method == 'im2col': + return max_pool_backward_im2col(dout, real_cache) + else: + raise ValueError('Unrecognized method "%s"' % method) + + +def max_pool_forward_reshape(x, pool_param): + """ + A fast implementation of the forward pass for the max pooling layer that uses + some clever reshaping. + + This can only be used for square pooling regions that tile the input. + """ + N, C, H, W = x.shape + pool_height, pool_width = pool_param['pool_height'], pool_param['pool_width'] + stride = pool_param['stride'] + assert pool_height == pool_width == stride, 'Invalid pool params' + assert H % pool_height == 0 + assert W % pool_height == 0 + x_reshaped = x.reshape(N, C, H / pool_height, pool_height, + W / pool_width, pool_width) + out = x_reshaped.max(axis=3).max(axis=4) + + cache = (x, x_reshaped, out) + return out, cache + + +def max_pool_backward_reshape(dout, cache): + """ + A fast implementation of the backward pass for the max pooling layer that + uses some clever broadcasting and reshaping. + + This can only be used if the forward pass was computed using + max_pool_forward_reshape. + + NOTE: If there are multiple argmaxes, this method will assign gradient to + ALL argmax elements of the input rather than picking one. In this case the + gradient will actually be incorrect. However this is unlikely to occur in + practice, so it shouldn't matter much. One possible solution is to split the + upstream gradient equally among all argmax elements; this should result in a + valid subgradient. You can make this happen by uncommenting the line below; + however this results in a significant performance penalty (about 40% slower) + and is unlikely to matter in practice so we don't do it. + """ + x, x_reshaped, out = cache + + dx_reshaped = np.zeros_like(x_reshaped) + out_newaxis = out[:, :, :, np.newaxis, :, np.newaxis] + mask = (x_reshaped == out_newaxis) + dout_newaxis = dout[:, :, :, np.newaxis, :, np.newaxis] + dout_broadcast, _ = np.broadcast_arrays(dout_newaxis, dx_reshaped) + dx_reshaped[mask] = dout_broadcast[mask] + dx_reshaped /= np.sum(mask, axis=(3, 5), keepdims=True) + dx = dx_reshaped.reshape(x.shape) + + return dx + + +def max_pool_forward_im2col(x, pool_param): + """ + An implementation of the forward pass for max pooling based on im2col. + + This isn't much faster than the naive version, so it should be avoided if + possible. + """ + N, C, H, W = x.shape + pool_height, pool_width = pool_param['pool_height'], pool_param['pool_width'] + stride = pool_param['stride'] + + assert (H - pool_height) % stride == 0, 'Invalid height' + assert (W - pool_width) % stride == 0, 'Invalid width' + + out_height = (H - pool_height) / stride + 1 + out_width = (W - pool_width) / stride + 1 + + x_split = x.reshape(N * C, 1, H, W) + x_cols = im2col(x_split, pool_height, pool_width, padding=0, stride=stride) + x_cols_argmax = np.argmax(x_cols, axis=0) + x_cols_max = x_cols[x_cols_argmax, np.arange(x_cols.shape[1])] + out = x_cols_max.reshape(out_height, out_width, N, C).transpose(2, 3, 0, 1) + + cache = (x, x_cols, x_cols_argmax, pool_param) + return out, cache + + +def max_pool_backward_im2col(dout, cache): + """ + An implementation of the backward pass for max pooling based on im2col. + + This isn't much faster than the naive version, so it should be avoided if + possible. + """ + x, x_cols, x_cols_argmax, pool_param = cache + N, C, H, W = x.shape + pool_height, pool_width = pool_param['pool_height'], pool_param['pool_width'] + stride = pool_param['stride'] + + dout_reshaped = dout.transpose(2, 3, 0, 1).flatten() + dx_cols = np.zeros_like(x_cols) + dx_cols[x_cols_argmax, np.arange(dx_cols.shape[1])] = dout_reshaped + dx = col2im_indices(dx_cols, (N * C, 1, H, W), pool_height, pool_width, + padding=0, stride=stride) + dx = dx.reshape(x.shape) + + return dx diff --git a/assignments2016/assignment2/cs231n/gradient_check.py b/assignments2016/assignment2/cs231n/gradient_check.py new file mode 100644 index 00000000..2d6b1f62 --- /dev/null +++ b/assignments2016/assignment2/cs231n/gradient_check.py @@ -0,0 +1,124 @@ +import numpy as np +from random import randrange + +def eval_numerical_gradient(f, x, verbose=True, h=0.00001): + """ + a naive implementation of numerical gradient of f at x + - f should be a function that takes a single argument + - x is the point (numpy array) to evaluate the gradient at + """ + + fx = f(x) # evaluate function value at original point + grad = np.zeros_like(x) + # iterate over all indexes in x + it = np.nditer(x, flags=['multi_index'], op_flags=['readwrite']) + while not it.finished: + + # evaluate function at x+h + ix = it.multi_index + oldval = x[ix] + x[ix] = oldval + h # increment by h + fxph = f(x) # evalute f(x + h) + x[ix] = oldval - h + fxmh = f(x) # evaluate f(x - h) + x[ix] = oldval # restore + + # compute the partial derivative with centered formula + grad[ix] = (fxph - fxmh) / (2 * h) # the slope + if verbose: + print ix, grad[ix] + it.iternext() # step to next dimension + + return grad + + +def eval_numerical_gradient_array(f, x, df, h=1e-5): + """ + Evaluate a numeric gradient for a function that accepts a numpy + array and returns a numpy array. + """ + grad = np.zeros_like(x) + it = np.nditer(x, flags=['multi_index'], op_flags=['readwrite']) + while not it.finished: + ix = it.multi_index + + oldval = x[ix] + x[ix] = oldval + h + pos = f(x).copy() + x[ix] = oldval - h + neg = f(x).copy() + x[ix] = oldval + + grad[ix] = np.sum((pos - neg) * df) / (2 * h) + it.iternext() + return grad + + +def eval_numerical_gradient_blobs(f, inputs, output, h=1e-5): + """ + Compute numeric gradients for a function that operates on input + and output blobs. + + We assume that f accepts several input blobs as arguments, followed by a blob + into which outputs will be written. For example, f might be called like this: + + f(x, w, out) + + where x and w are input Blobs, and the result of f will be written to out. + + Inputs: + - f: function + - inputs: tuple of input blobs + - output: output blob + - h: step size + """ + numeric_diffs = [] + for input_blob in inputs: + diff = np.zeros_like(input_blob.diffs) + it = np.nditer(input_blob.vals, flags=['multi_index'], + op_flags=['readwrite']) + while not it.finished: + idx = it.multi_index + orig = input_blob.vals[idx] + + input_blob.vals[idx] = orig + h + f(*(inputs + (output,))) + pos = np.copy(output.vals) + input_blob.vals[idx] = orig - h + f(*(inputs + (output,))) + neg = np.copy(output.vals) + input_blob.vals[idx] = orig + + diff[idx] = np.sum((pos - neg) * output.diffs) / (2.0 * h) + + it.iternext() + numeric_diffs.append(diff) + return numeric_diffs + + +def eval_numerical_gradient_net(net, inputs, output, h=1e-5): + return eval_numerical_gradient_blobs(lambda *args: net.forward(), + inputs, output, h=h) + + +def grad_check_sparse(f, x, analytic_grad, num_checks=10, h=1e-5): + """ + sample a few random elements and only return numerical + in this dimensions. + """ + + for i in xrange(num_checks): + ix = tuple([randrange(m) for m in x.shape]) + + oldval = x[ix] + x[ix] = oldval + h # increment by h + fxph = f(x) # evaluate f(x + h) + x[ix] = oldval - h # increment by h + fxmh = f(x) # evaluate f(x - h) + x[ix] = oldval # reset + + grad_numerical = (fxph - fxmh) / (2 * h) + grad_analytic = analytic_grad[ix] + rel_error = abs(grad_numerical - grad_analytic) / (abs(grad_numerical) + abs(grad_analytic)) + print 'numerical: %f analytic: %f, relative error: %e' % (grad_numerical, grad_analytic, rel_error) + diff --git a/assignments2016/assignment2/cs231n/im2col.py b/assignments2016/assignment2/cs231n/im2col.py new file mode 100644 index 00000000..1942eab6 --- /dev/null +++ b/assignments2016/assignment2/cs231n/im2col.py @@ -0,0 +1,55 @@ +import numpy as np + + +def get_im2col_indices(x_shape, field_height, field_width, padding=1, stride=1): + # First figure out what the size of the output should be + N, C, H, W = x_shape + assert (H + 2 * padding - field_height) % stride == 0 + assert (W + 2 * padding - field_height) % stride == 0 + out_height = (H + 2 * padding - field_height) / stride + 1 + out_width = (W + 2 * padding - field_width) / stride + 1 + + i0 = np.repeat(np.arange(field_height), field_width) + i0 = np.tile(i0, C) + i1 = stride * np.repeat(np.arange(out_height), out_width) + j0 = np.tile(np.arange(field_width), field_height * C) + j1 = stride * np.tile(np.arange(out_width), out_height) + i = i0.reshape(-1, 1) + i1.reshape(1, -1) + j = j0.reshape(-1, 1) + j1.reshape(1, -1) + + k = np.repeat(np.arange(C), field_height * field_width).reshape(-1, 1) + + return (k, i, j) + + +def im2col_indices(x, field_height, field_width, padding=1, stride=1): + """ An implementation of im2col based on some fancy indexing """ + # Zero-pad the input + p = padding + x_padded = np.pad(x, ((0, 0), (0, 0), (p, p), (p, p)), mode='constant') + + k, i, j = get_im2col_indices(x.shape, field_height, field_width, padding, + stride) + + cols = x_padded[:, k, i, j] + C = x.shape[1] + cols = cols.transpose(1, 2, 0).reshape(field_height * field_width * C, -1) + return cols + + +def col2im_indices(cols, x_shape, field_height=3, field_width=3, padding=1, + stride=1): + """ An implementation of col2im based on fancy indexing and np.add.at """ + N, C, H, W = x_shape + H_padded, W_padded = H + 2 * padding, W + 2 * padding + x_padded = np.zeros((N, C, H_padded, W_padded), dtype=cols.dtype) + k, i, j = get_im2col_indices(x_shape, field_height, field_width, padding, + stride) + cols_reshaped = cols.reshape(C * field_height * field_width, -1, N) + cols_reshaped = cols_reshaped.transpose(2, 0, 1) + np.add.at(x_padded, (slice(None), k, i, j), cols_reshaped) + if padding == 0: + return x_padded + return x_padded[:, :, padding:-padding, padding:-padding] + +pass diff --git a/assignments2016/assignment2/cs231n/im2col_cython.pyx b/assignments2016/assignment2/cs231n/im2col_cython.pyx new file mode 100644 index 00000000..d6e33c6f --- /dev/null +++ b/assignments2016/assignment2/cs231n/im2col_cython.pyx @@ -0,0 +1,121 @@ +import numpy as np +cimport numpy as np +cimport cython + +# DTYPE = np.float64 +# ctypedef np.float64_t DTYPE_t + +ctypedef fused DTYPE_t: + np.float32_t + np.float64_t + +def im2col_cython(np.ndarray[DTYPE_t, ndim=4] x, int field_height, + int field_width, int padding, int stride): + cdef int N = x.shape[0] + cdef int C = x.shape[1] + cdef int H = x.shape[2] + cdef int W = x.shape[3] + + cdef int HH = (H + 2 * padding - field_height) / stride + 1 + cdef int WW = (W + 2 * padding - field_width) / stride + 1 + + cdef int p = padding + cdef np.ndarray[DTYPE_t, ndim=4] x_padded = np.pad(x, + ((0, 0), (0, 0), (p, p), (p, p)), mode='constant') + + cdef np.ndarray[DTYPE_t, ndim=2] cols = np.zeros( + (C * field_height * field_width, N * HH * WW), + dtype=x.dtype) + + # Moving the inner loop to a C function with no bounds checking works, but does + # not seem to help performance in any measurable way. + + im2col_cython_inner(cols, x_padded, N, C, H, W, HH, WW, + field_height, field_width, padding, stride) + return cols + + +@cython.boundscheck(False) +cdef int im2col_cython_inner(np.ndarray[DTYPE_t, ndim=2] cols, + np.ndarray[DTYPE_t, ndim=4] x_padded, + int N, int C, int H, int W, int HH, int WW, + int field_height, int field_width, int padding, int stride) except? -1: + cdef int c, ii, jj, row, yy, xx, i, col + + for c in range(C): + for yy in range(HH): + for xx in range(WW): + for ii in range(field_height): + for jj in range(field_width): + row = c * field_width * field_height + ii * field_height + jj + for i in range(N): + col = yy * WW * N + xx * N + i + cols[row, col] = x_padded[i, c, stride * yy + ii, stride * xx + jj] + + + +def col2im_cython(np.ndarray[DTYPE_t, ndim=2] cols, int N, int C, int H, int W, + int field_height, int field_width, int padding, int stride): + cdef np.ndarray x = np.empty((N, C, H, W), dtype=cols.dtype) + cdef int HH = (H + 2 * padding - field_height) / stride + 1 + cdef int WW = (W + 2 * padding - field_width) / stride + 1 + cdef np.ndarray[DTYPE_t, ndim=4] x_padded = np.zeros((N, C, H + 2 * padding, W + 2 * padding), + dtype=cols.dtype) + + # Moving the inner loop to a C-function with no bounds checking improves + # performance quite a bit for col2im. + col2im_cython_inner(cols, x_padded, N, C, H, W, HH, WW, + field_height, field_width, padding, stride) + if padding > 0: + return x_padded[:, :, padding:-padding, padding:-padding] + return x_padded + + +@cython.boundscheck(False) +cdef int col2im_cython_inner(np.ndarray[DTYPE_t, ndim=2] cols, + np.ndarray[DTYPE_t, ndim=4] x_padded, + int N, int C, int H, int W, int HH, int WW, + int field_height, int field_width, int padding, int stride) except? -1: + cdef int c, ii, jj, row, yy, xx, i, col + + for c in range(C): + for ii in range(field_height): + for jj in range(field_width): + row = c * field_width * field_height + ii * field_height + jj + for yy in range(HH): + for xx in range(WW): + for i in range(N): + col = yy * WW * N + xx * N + i + x_padded[i, c, stride * yy + ii, stride * xx + jj] += cols[row, col] + + +@cython.boundscheck(False) +@cython.wraparound(False) +cdef col2im_6d_cython_inner(np.ndarray[DTYPE_t, ndim=6] cols, + np.ndarray[DTYPE_t, ndim=4] x_padded, + int N, int C, int H, int W, int HH, int WW, + int out_h, int out_w, int pad, int stride): + + cdef int c, hh, ww, n, h, w + for n in range(N): + for c in range(C): + for hh in range(HH): + for ww in range(WW): + for h in range(out_h): + for w in range(out_w): + x_padded[n, c, stride * h + hh, stride * w + ww] += cols[c, hh, ww, n, h, w] + + +def col2im_6d_cython(np.ndarray[DTYPE_t, ndim=6] cols, int N, int C, int H, int W, + int HH, int WW, int pad, int stride): + cdef np.ndarray x = np.empty((N, C, H, W), dtype=cols.dtype) + cdef int out_h = (H + 2 * pad - HH) / stride + 1 + cdef int out_w = (W + 2 * pad - WW) / stride + 1 + cdef np.ndarray[DTYPE_t, ndim=4] x_padded = np.zeros((N, C, H + 2 * pad, W + 2 * pad), + dtype=cols.dtype) + + col2im_6d_cython_inner(cols, x_padded, N, C, H, W, HH, WW, out_h, out_w, pad, stride) + + if pad > 0: + return x_padded[:, :, pad:-pad, pad:-pad] + return x_padded diff --git a/assignments2016/assignment2/cs231n/layer_utils.py b/assignments2016/assignment2/cs231n/layer_utils.py new file mode 100644 index 00000000..c4989618 --- /dev/null +++ b/assignments2016/assignment2/cs231n/layer_utils.py @@ -0,0 +1,93 @@ +from cs231n.layers import * +from cs231n.fast_layers import * + + +def affine_relu_forward(x, w, b): + """ + Convenience layer that perorms an affine transform followed by a ReLU + + Inputs: + - x: Input to the affine layer + - w, b: Weights for the affine layer + + Returns a tuple of: + - out: Output from the ReLU + - cache: Object to give to the backward pass + """ + a, fc_cache = affine_forward(x, w, b) + out, relu_cache = relu_forward(a) + cache = (fc_cache, relu_cache) + return out, cache + + +def affine_relu_backward(dout, cache): + """ + Backward pass for the affine-relu convenience layer + """ + fc_cache, relu_cache = cache + da = relu_backward(dout, relu_cache) + dx, dw, db = affine_backward(da, fc_cache) + return dx, dw, db + + +pass + + +def conv_relu_forward(x, w, b, conv_param): + """ + A convenience layer that performs a convolution followed by a ReLU. + + Inputs: + - x: Input to the convolutional layer + - w, b, conv_param: Weights and parameters for the convolutional layer + + Returns a tuple of: + - out: Output from the ReLU + - cache: Object to give to the backward pass + """ + a, conv_cache = conv_forward_fast(x, w, b, conv_param) + out, relu_cache = relu_forward(a) + cache = (conv_cache, relu_cache) + return out, cache + + +def conv_relu_backward(dout, cache): + """ + Backward pass for the conv-relu convenience layer. + """ + conv_cache, relu_cache = cache + da = relu_backward(dout, relu_cache) + dx, dw, db = conv_backward_fast(da, conv_cache) + return dx, dw, db + + +def conv_relu_pool_forward(x, w, b, conv_param, pool_param): + """ + Convenience layer that performs a convolution, a ReLU, and a pool. + + Inputs: + - x: Input to the convolutional layer + - w, b, conv_param: Weights and parameters for the convolutional layer + - pool_param: Parameters for the pooling layer + + Returns a tuple of: + - out: Output from the pooling layer + - cache: Object to give to the backward pass + """ + a, conv_cache = conv_forward_fast(x, w, b, conv_param) + s, relu_cache = relu_forward(a) + out, pool_cache = max_pool_forward_fast(s, pool_param) + cache = (conv_cache, relu_cache, pool_cache) + return out, cache + + +def conv_relu_pool_backward(dout, cache): + """ + Backward pass for the conv-relu-pool convenience layer + """ + conv_cache, relu_cache, pool_cache = cache + ds = max_pool_backward_fast(dout, pool_cache) + da = relu_backward(ds, relu_cache) + dx, dw, db = conv_backward_fast(da, conv_cache) + return dx, dw, db + diff --git a/assignments2016/assignment2/cs231n/layers.py b/assignments2016/assignment2/cs231n/layers.py new file mode 100644 index 00000000..3a716cf2 --- /dev/null +++ b/assignments2016/assignment2/cs231n/layers.py @@ -0,0 +1,554 @@ +import numpy as np + + +def affine_forward(x, w, b): + """ + Computes the forward pass for an affine (fully-connected) layer. + + The input x has shape (N, d_1, ..., d_k) and contains a minibatch of N + examples, where each example x[i] has shape (d_1, ..., d_k). We will + reshape each input into a vector of dimension D = d_1 * ... * d_k, and + then transform it to an output vector of dimension M. + + Inputs: + - x: A numpy array containing input data, of shape (N, d_1, ..., d_k) + - w: A numpy array of weights, of shape (D, M) + - b: A numpy array of biases, of shape (M,) + + Returns a tuple of: + - out: output, of shape (N, M) + - cache: (x, w, b) + """ + out = None + ############################################################################# + # TODO: Implement the affine forward pass. Store the result in out. You # + # will need to reshape the input into rows. # + ############################################################################# + pass + ############################################################################# + # END OF YOUR CODE # + ############################################################################# + cache = (x, w, b) + return out, cache + + +def affine_backward(dout, cache): + """ + Computes the backward pass for an affine layer. + + Inputs: + - dout: Upstream derivative, of shape (N, M) + - cache: Tuple of: + - x: Input data, of shape (N, d_1, ... d_k) + - w: Weights, of shape (D, M) + + Returns a tuple of: + - dx: Gradient with respect to x, of shape (N, d1, ..., d_k) + - dw: Gradient with respect to w, of shape (D, M) + - db: Gradient with respect to b, of shape (M,) + """ + x, w, b = cache + dx, dw, db = None, None, None + ############################################################################# + # TODO: Implement the affine backward pass. # + ############################################################################# + pass + ############################################################################# + # END OF YOUR CODE # + ############################################################################# + return dx, dw, db + + +def relu_forward(x): + """ + Computes the forward pass for a layer of rectified linear units (ReLUs). + + Input: + - x: Inputs, of any shape + + Returns a tuple of: + - out: Output, of the same shape as x + - cache: x + """ + out = None + ############################################################################# + # TODO: Implement the ReLU forward pass. # + ############################################################################# + pass + ############################################################################# + # END OF YOUR CODE # + ############################################################################# + cache = x + return out, cache + + +def relu_backward(dout, cache): + """ + Computes the backward pass for a layer of rectified linear units (ReLUs). + + Input: + - dout: Upstream derivatives, of any shape + - cache: Input x, of same shape as dout + + Returns: + - dx: Gradient with respect to x + """ + dx, x = None, cache + ############################################################################# + # TODO: Implement the ReLU backward pass. # + ############################################################################# + pass + ############################################################################# + # END OF YOUR CODE # + ############################################################################# + return dx + + +def batchnorm_forward(x, gamma, beta, bn_param): + """ + Forward pass for batch normalization. + + During training the sample mean and (uncorrected) sample variance are + computed from minibatch statistics and used to normalize the incoming data. + During training we also keep an exponentially decaying running mean of the mean + and variance of each feature, and these averages are used to normalize data + at test-time. + + At each timestep we update the running averages for mean and variance using + an exponential decay based on the momentum parameter: + + running_mean = momentum * running_mean + (1 - momentum) * sample_mean + running_var = momentum * running_var + (1 - momentum) * sample_var + + Note that the batch normalization paper suggests a different test-time + behavior: they compute sample mean and variance for each feature using a + large number of training images rather than using a running average. For + this implementation we have chosen to use running averages instead since + they do not require an additional estimation step; the torch7 implementation + of batch normalization also uses running averages. + + Input: + - x: Data of shape (N, D) + - gamma: Scale parameter of shape (D,) + - beta: Shift paremeter of shape (D,) + - bn_param: Dictionary with the following keys: + - mode: 'train' or 'test'; required + - eps: Constant for numeric stability + - momentum: Constant for running mean / variance. + - running_mean: Array of shape (D,) giving running mean of features + - running_var Array of shape (D,) giving running variance of features + + Returns a tuple of: + - out: of shape (N, D) + - cache: A tuple of values needed in the backward pass + """ + mode = bn_param['mode'] + eps = bn_param.get('eps', 1e-5) + momentum = bn_param.get('momentum', 0.9) + + N, D = x.shape + running_mean = bn_param.get('running_mean', np.zeros(D, dtype=x.dtype)) + running_var = bn_param.get('running_var', np.zeros(D, dtype=x.dtype)) + + out, cache = None, None + if mode == 'train': + ############################################################################# + # TODO: Implement the training-time forward pass for batch normalization. # + # Use minibatch statistics to compute the mean and variance, use these # + # statistics to normalize the incoming data, and scale and shift the # + # normalized data using gamma and beta. # + # # + # You should store the output in the variable out. Any intermediates that # + # you need for the backward pass should be stored in the cache variable. # + # # + # You should also use your computed sample mean and variance together with # + # the momentum variable to update the running mean and running variance, # + # storing your result in the running_mean and running_var variables. # + ############################################################################# + pass + ############################################################################# + # END OF YOUR CODE # + ############################################################################# + elif mode == 'test': + ############################################################################# + # TODO: Implement the test-time forward pass for batch normalization. Use # + # the running mean and variance to normalize the incoming data, then scale # + # and shift the normalized data using gamma and beta. Store the result in # + # the out variable. # + ############################################################################# + pass + ############################################################################# + # END OF YOUR CODE # + ############################################################################# + else: + raise ValueError('Invalid forward batchnorm mode "%s"' % mode) + + # Store the updated running means back into bn_param + bn_param['running_mean'] = running_mean + bn_param['running_var'] = running_var + + return out, cache + + +def batchnorm_backward(dout, cache): + """ + Backward pass for batch normalization. + + For this implementation, you should write out a computation graph for + batch normalization on paper and propagate gradients backward through + intermediate nodes. + + Inputs: + - dout: Upstream derivatives, of shape (N, D) + - cache: Variable of intermediates from batchnorm_forward. + + Returns a tuple of: + - dx: Gradient with respect to inputs x, of shape (N, D) + - dgamma: Gradient with respect to scale parameter gamma, of shape (D,) + - dbeta: Gradient with respect to shift parameter beta, of shape (D,) + """ + dx, dgamma, dbeta = None, None, None + ############################################################################# + # TODO: Implement the backward pass for batch normalization. Store the # + # results in the dx, dgamma, and dbeta variables. # + ############################################################################# + pass + ############################################################################# + # END OF YOUR CODE # + ############################################################################# + + return dx, dgamma, dbeta + + +def batchnorm_backward_alt(dout, cache): + """ + Alternative backward pass for batch normalization. + + For this implementation you should work out the derivatives for the batch + normalizaton backward pass on paper and simplify as much as possible. You + should be able to derive a simple expression for the backward pass. + + Note: This implementation should expect to receive the same cache variable + as batchnorm_backward, but might not use all of the values in the cache. + + Inputs / outputs: Same as batchnorm_backward + """ + dx, dgamma, dbeta = None, None, None + ############################################################################# + # TODO: Implement the backward pass for batch normalization. Store the # + # results in the dx, dgamma, and dbeta variables. # + # # + # After computing the gradient with respect to the centered inputs, you # + # should be able to compute gradients with respect to the inputs in a # + # single statement; our implementation fits on a single 80-character line. # + ############################################################################# + pass + ############################################################################# + # END OF YOUR CODE # + ############################################################################# + + return dx, dgamma, dbeta + + +def dropout_forward(x, dropout_param): + """ + Performs the forward pass for (inverted) dropout. + + Inputs: + - x: Input data, of any shape + - dropout_param: A dictionary with the following keys: + - p: Dropout parameter. We drop each neuron output with probability p. + - mode: 'test' or 'train'. If the mode is train, then perform dropout; + if the mode is test, then just return the input. + - seed: Seed for the random number generator. Passing seed makes this + function deterministic, which is needed for gradient checking but not in + real networks. + + Outputs: + - out: Array of the same shape as x. + - cache: A tuple (dropout_param, mask). In training mode, mask is the dropout + mask that was used to multiply the input; in test mode, mask is None. + """ + p, mode = dropout_param['p'], dropout_param['mode'] + if 'seed' in dropout_param: + np.random.seed(dropout_param['seed']) + + mask = None + out = None + + if mode == 'train': + ########################################################################### + # TODO: Implement the training phase forward pass for inverted dropout. # + # Store the dropout mask in the mask variable. # + ########################################################################### + pass + ########################################################################### + # END OF YOUR CODE # + ########################################################################### + elif mode == 'test': + ########################################################################### + # TODO: Implement the test phase forward pass for inverted dropout. # + ########################################################################### + pass + ########################################################################### + # END OF YOUR CODE # + ########################################################################### + + cache = (dropout_param, mask) + out = out.astype(x.dtype, copy=False) + + return out, cache + + +def dropout_backward(dout, cache): + """ + Perform the backward pass for (inverted) dropout. + + Inputs: + - dout: Upstream derivatives, of any shape + - cache: (dropout_param, mask) from dropout_forward. + """ + dropout_param, mask = cache + mode = dropout_param['mode'] + + dx = None + if mode == 'train': + ########################################################################### + # TODO: Implement the training phase backward pass for inverted dropout. # + ########################################################################### + pass + ########################################################################### + # END OF YOUR CODE # + ########################################################################### + elif mode == 'test': + dx = dout + return dx + + +def conv_forward_naive(x, w, b, conv_param): + """ + A naive implementation of the forward pass for a convolutional layer. + + The input consists of N data points, each with C channels, height H and width + W. We convolve each input with F different filters, where each filter spans + all C channels and has height HH and width HH. + + Input: + - x: Input data of shape (N, C, H, W) + - w: Filter weights of shape (F, C, HH, WW) + - b: Biases, of shape (F,) + - conv_param: A dictionary with the following keys: + - 'stride': The number of pixels between adjacent receptive fields in the + horizontal and vertical directions. + - 'pad': The number of pixels that will be used to zero-pad the input. + + Returns a tuple of: + - out: Output data, of shape (N, F, H', W') where H' and W' are given by + H' = 1 + (H + 2 * pad - HH) / stride + W' = 1 + (W + 2 * pad - WW) / stride + - cache: (x, w, b, conv_param) + """ + out = None + ############################################################################# + # TODO: Implement the convolutional forward pass. # + # Hint: you can use the function np.pad for padding. # + ############################################################################# + pass + ############################################################################# + # END OF YOUR CODE # + ############################################################################# + cache = (x, w, b, conv_param) + return out, cache + + +def conv_backward_naive(dout, cache): + """ + A naive implementation of the backward pass for a convolutional layer. + + Inputs: + - dout: Upstream derivatives. + - cache: A tuple of (x, w, b, conv_param) as in conv_forward_naive + + Returns a tuple of: + - dx: Gradient with respect to x + - dw: Gradient with respect to w + - db: Gradient with respect to b + """ + dx, dw, db = None, None, None + ############################################################################# + # TODO: Implement the convolutional backward pass. # + ############################################################################# + pass + ############################################################################# + # END OF YOUR CODE # + ############################################################################# + return dx, dw, db + + +def max_pool_forward_naive(x, pool_param): + """ + A naive implementation of the forward pass for a max pooling layer. + + Inputs: + - x: Input data, of shape (N, C, H, W) + - pool_param: dictionary with the following keys: + - 'pool_height': The height of each pooling region + - 'pool_width': The width of each pooling region + - 'stride': The distance between adjacent pooling regions + + Returns a tuple of: + - out: Output data + - cache: (x, pool_param) + """ + out = None + ############################################################################# + # TODO: Implement the max pooling forward pass # + ############################################################################# + pass + ############################################################################# + # END OF YOUR CODE # + ############################################################################# + cache = (x, pool_param) + return out, cache + + +def max_pool_backward_naive(dout, cache): + """ + A naive implementation of the backward pass for a max pooling layer. + + Inputs: + - dout: Upstream derivatives + - cache: A tuple of (x, pool_param) as in the forward pass. + + Returns: + - dx: Gradient with respect to x + """ + dx = None + ############################################################################# + # TODO: Implement the max pooling backward pass # + ############################################################################# + pass + ############################################################################# + # END OF YOUR CODE # + ############################################################################# + return dx + + +def spatial_batchnorm_forward(x, gamma, beta, bn_param): + """ + Computes the forward pass for spatial batch normalization. + + Inputs: + - x: Input data of shape (N, C, H, W) + - gamma: Scale parameter, of shape (C,) + - beta: Shift parameter, of shape (C,) + - bn_param: Dictionary with the following keys: + - mode: 'train' or 'test'; required + - eps: Constant for numeric stability + - momentum: Constant for running mean / variance. momentum=0 means that + old information is discarded completely at every time step, while + momentum=1 means that new information is never incorporated. The + default of momentum=0.9 should work well in most situations. + - running_mean: Array of shape (D,) giving running mean of features + - running_var Array of shape (D,) giving running variance of features + + Returns a tuple of: + - out: Output data, of shape (N, C, H, W) + - cache: Values needed for the backward pass + """ + out, cache = None, None + + ############################################################################# + # TODO: Implement the forward pass for spatial batch normalization. # + # # + # HINT: You can implement spatial batch normalization using the vanilla # + # version of batch normalization defined above. Your implementation should # + # be very short; ours is less than five lines. # + ############################################################################# + pass + ############################################################################# + # END OF YOUR CODE # + ############################################################################# + + return out, cache + + +def spatial_batchnorm_backward(dout, cache): + """ + Computes the backward pass for spatial batch normalization. + + Inputs: + - dout: Upstream derivatives, of shape (N, C, H, W) + - cache: Values from the forward pass + + Returns a tuple of: + - dx: Gradient with respect to inputs, of shape (N, C, H, W) + - dgamma: Gradient with respect to scale parameter, of shape (C,) + - dbeta: Gradient with respect to shift parameter, of shape (C,) + """ + dx, dgamma, dbeta = None, None, None + + ############################################################################# + # TODO: Implement the backward pass for spatial batch normalization. # + # # + # HINT: You can implement spatial batch normalization using the vanilla # + # version of batch normalization defined above. Your implementation should # + # be very short; ours is less than five lines. # + ############################################################################# + pass + ############################################################################# + # END OF YOUR CODE # + ############################################################################# + + return dx, dgamma, dbeta + + +def svm_loss(x, y): + """ + Computes the loss and gradient using for multiclass SVM classification. + + Inputs: + - x: Input data, of shape (N, C) where x[i, j] is the score for the jth class + for the ith input. + - y: Vector of labels, of shape (N,) where y[i] is the label for x[i] and + 0 <= y[i] < C + + Returns a tuple of: + - loss: Scalar giving the loss + - dx: Gradient of the loss with respect to x + """ + N = x.shape[0] + correct_class_scores = x[np.arange(N), y] + margins = np.maximum(0, x - correct_class_scores[:, np.newaxis] + 1.0) + margins[np.arange(N), y] = 0 + loss = np.sum(margins) / N + num_pos = np.sum(margins > 0, axis=1) + dx = np.zeros_like(x) + dx[margins > 0] = 1 + dx[np.arange(N), y] -= num_pos + dx /= N + return loss, dx + + +def softmax_loss(x, y): + """ + Computes the loss and gradient for softmax classification. + + Inputs: + - x: Input data, of shape (N, C) where x[i, j] is the score for the jth class + for the ith input. + - y: Vector of labels, of shape (N,) where y[i] is the label for x[i] and + 0 <= y[i] < C + + Returns a tuple of: + - loss: Scalar giving the loss + - dx: Gradient of the loss with respect to x + """ + probs = np.exp(x - np.max(x, axis=1, keepdims=True)) + probs /= np.sum(probs, axis=1, keepdims=True) + N = x.shape[0] + loss = -np.sum(np.log(probs[np.arange(N), y])) / N + dx = probs.copy() + dx[np.arange(N), y] -= 1 + dx /= N + return loss, dx diff --git a/assignments2016/assignment2/cs231n/optim.py b/assignments2016/assignment2/cs231n/optim.py new file mode 100644 index 00000000..ee84a73b --- /dev/null +++ b/assignments2016/assignment2/cs231n/optim.py @@ -0,0 +1,149 @@ +import numpy as np + +""" +This file implements various first-order update rules that are commonly used for +training neural networks. Each update rule accepts current weights and the +gradient of the loss with respect to those weights and produces the next set of +weights. Each update rule has the same interface: + +def update(w, dw, config=None): + +Inputs: + - w: A numpy array giving the current weights. + - dw: A numpy array of the same shape as w giving the gradient of the + loss with respect to w. + - config: A dictionary containing hyperparameter values such as learning rate, + momentum, etc. If the update rule requires caching values over many + iterations, then config will also hold these cached values. + +Returns: + - next_w: The next point after the update. + - config: The config dictionary to be passed to the next iteration of the + update rule. + +NOTE: For most update rules, the default learning rate will probably not perform +well; however the default values of the other hyperparameters should work well +for a variety of different problems. + +For efficiency, update rules may perform in-place updates, mutating w and +setting next_w equal to w. +""" + + +def sgd(w, dw, config=None): + """ + Performs vanilla stochastic gradient descent. + + config format: + - learning_rate: Scalar learning rate. + """ + if config is None: config = {} + config.setdefault('learning_rate', 1e-2) + + w -= config['learning_rate'] * dw + return w, config + + +def sgd_momentum(w, dw, config=None): + """ + Performs stochastic gradient descent with momentum. + + config format: + - learning_rate: Scalar learning rate. + - momentum: Scalar between 0 and 1 giving the momentum value. + Setting momentum = 0 reduces to sgd. + - velocity: A numpy array of the same shape as w and dw used to store a moving + average of the gradients. + """ + if config is None: config = {} + config.setdefault('learning_rate', 1e-2) + config.setdefault('momentum', 0.9) + v = config.get('velocity', np.zeros_like(w)) + + next_w = None + ############################################################################# + # TODO: Implement the momentum update formula. Store the updated value in # + # the next_w variable. You should also use and update the velocity v. # + ############################################################################# + pass + ############################################################################# + # END OF YOUR CODE # + ############################################################################# + config['velocity'] = v + + return next_w, config + + + +def rmsprop(x, dx, config=None): + """ + Uses the RMSProp update rule, which uses a moving average of squared gradient + values to set adaptive per-parameter learning rates. + + config format: + - learning_rate: Scalar learning rate. + - decay_rate: Scalar between 0 and 1 giving the decay rate for the squared + gradient cache. + - epsilon: Small scalar used for smoothing to avoid dividing by zero. + - cache: Moving average of second moments of gradients. + """ + if config is None: config = {} + config.setdefault('learning_rate', 1e-2) + config.setdefault('decay_rate', 0.99) + config.setdefault('epsilon', 1e-8) + config.setdefault('cache', np.zeros_like(x)) + + next_x = None + ############################################################################# + # TODO: Implement the RMSprop update formula, storing the next value of x # + # in the next_x variable. Don't forget to update cache value stored in # + # config['cache']. # + ############################################################################# + pass + ############################################################################# + # END OF YOUR CODE # + ############################################################################# + + return next_x, config + + +def adam(x, dx, config=None): + """ + Uses the Adam update rule, which incorporates moving averages of both the + gradient and its square and a bias correction term. + + config format: + - learning_rate: Scalar learning rate. + - beta1: Decay rate for moving average of first moment of gradient. + - beta2: Decay rate for moving average of second moment of gradient. + - epsilon: Small scalar used for smoothing to avoid dividing by zero. + - m: Moving average of gradient. + - v: Moving average of squared gradient. + - t: Iteration number. + """ + if config is None: config = {} + config.setdefault('learning_rate', 1e-3) + config.setdefault('beta1', 0.9) + config.setdefault('beta2', 0.999) + config.setdefault('epsilon', 1e-8) + config.setdefault('m', np.zeros_like(x)) + config.setdefault('v', np.zeros_like(x)) + config.setdefault('t', 0) + + next_x = None + ############################################################################# + # TODO: Implement the Adam update formula, storing the next value of x in # + # the next_x variable. Don't forget to update the m, v, and t variables # + # stored in config. # + ############################################################################# + pass + ############################################################################# + # END OF YOUR CODE # + ############################################################################# + + return next_x, config + + + + + diff --git a/assignments2016/assignment2/cs231n/setup.py b/assignments2016/assignment2/cs231n/setup.py new file mode 100644 index 00000000..9a2e6ca0 --- /dev/null +++ b/assignments2016/assignment2/cs231n/setup.py @@ -0,0 +1,14 @@ +from distutils.core import setup +from distutils.extension import Extension +from Cython.Build import cythonize +import numpy + +extensions = [ + Extension('im2col_cython', ['im2col_cython.pyx'], + include_dirs = [numpy.get_include()] + ), +] + +setup( + ext_modules = cythonize(extensions), +) diff --git a/assignments2016/assignment2/cs231n/solver.py b/assignments2016/assignment2/cs231n/solver.py new file mode 100644 index 00000000..02f2726c --- /dev/null +++ b/assignments2016/assignment2/cs231n/solver.py @@ -0,0 +1,266 @@ +import numpy as np + +from cs231n import optim + + +class Solver(object): + """ + A Solver encapsulates all the logic necessary for training classification + models. The Solver performs stochastic gradient descent using different + update rules defined in optim.py. + + The solver accepts both training and validataion data and labels so it can + periodically check classification accuracy on both training and validation + data to watch out for overfitting. + + To train a model, you will first construct a Solver instance, passing the + model, dataset, and various optoins (learning rate, batch size, etc) to the + constructor. You will then call the train() method to run the optimization + procedure and train the model. + + After the train() method returns, model.params will contain the parameters + that performed best on the validation set over the course of training. + In addition, the instance variable solver.loss_history will contain a list + of all losses encountered during training and the instance variables + solver.train_acc_history and solver.val_acc_history will be lists containing + the accuracies of the model on the training and validation set at each epoch. + + Example usage might look something like this: + + data = { + 'X_train': # training data + 'y_train': # training labels + 'X_val': # validation data + 'X_train': # validation labels + } + model = MyAwesomeModel(hidden_size=100, reg=10) + solver = Solver(model, data, + update_rule='sgd', + optim_config={ + 'learning_rate': 1e-3, + }, + lr_decay=0.95, + num_epochs=10, batch_size=100, + print_every=100) + solver.train() + + + A Solver works on a model object that must conform to the following API: + + - model.params must be a dictionary mapping string parameter names to numpy + arrays containing parameter values. + + - model.loss(X, y) must be a function that computes training-time loss and + gradients, and test-time classification scores, with the following inputs + and outputs: + + Inputs: + - X: Array giving a minibatch of input data of shape (N, d_1, ..., d_k) + - y: Array of labels, of shape (N,) giving labels for X where y[i] is the + label for X[i]. + + Returns: + If y is None, run a test-time forward pass and return: + - scores: Array of shape (N, C) giving classification scores for X where + scores[i, c] gives the score of class c for X[i]. + + If y is not None, run a training time forward and backward pass and return + a tuple of: + - loss: Scalar giving the loss + - grads: Dictionary with the same keys as self.params mapping parameter + names to gradients of the loss with respect to those parameters. + """ + + def __init__(self, model, data, **kwargs): + """ + Construct a new Solver instance. + + Required arguments: + - model: A model object conforming to the API described above + - data: A dictionary of training and validation data with the following: + 'X_train': Array of shape (N_train, d_1, ..., d_k) giving training images + 'X_val': Array of shape (N_val, d_1, ..., d_k) giving validation images + 'y_train': Array of shape (N_train,) giving labels for training images + 'y_val': Array of shape (N_val,) giving labels for validation images + + Optional arguments: + - update_rule: A string giving the name of an update rule in optim.py. + Default is 'sgd'. + - optim_config: A dictionary containing hyperparameters that will be + passed to the chosen update rule. Each update rule requires different + hyperparameters (see optim.py) but all update rules require a + 'learning_rate' parameter so that should always be present. + - lr_decay: A scalar for learning rate decay; after each epoch the learning + rate is multiplied by this value. + - batch_size: Size of minibatches used to compute loss and gradient during + training. + - num_epochs: The number of epochs to run for during training. + - print_every: Integer; training losses will be printed every print_every + iterations. + - verbose: Boolean; if set to false then no output will be printed during + training. + """ + self.model = model + self.X_train = data['X_train'] + self.y_train = data['y_train'] + self.X_val = data['X_val'] + self.y_val = data['y_val'] + + # Unpack keyword arguments + self.update_rule = kwargs.pop('update_rule', 'sgd') + self.optim_config = kwargs.pop('optim_config', {}) + self.lr_decay = kwargs.pop('lr_decay', 1.0) + self.batch_size = kwargs.pop('batch_size', 100) + self.num_epochs = kwargs.pop('num_epochs', 10) + + self.print_every = kwargs.pop('print_every', 10) + self.verbose = kwargs.pop('verbose', True) + + # Throw an error if there are extra keyword arguments + if len(kwargs) > 0: + extra = ', '.join('"%s"' % k for k in kwargs.keys()) + raise ValueError('Unrecognized arguments %s' % extra) + + # Make sure the update rule exists, then replace the string + # name with the actual function + if not hasattr(optim, self.update_rule): + raise ValueError('Invalid update_rule "%s"' % self.update_rule) + self.update_rule = getattr(optim, self.update_rule) + + self._reset() + + + def _reset(self): + """ + Set up some book-keeping variables for optimization. Don't call this + manually. + """ + # Set up some variables for book-keeping + self.epoch = 0 + self.best_val_acc = 0 + self.best_params = {} + self.loss_history = [] + self.train_acc_history = [] + self.val_acc_history = [] + + # Make a deep copy of the optim_config for each parameter + self.optim_configs = {} + for p in self.model.params: + d = {k: v for k, v in self.optim_config.iteritems()} + self.optim_configs[p] = d + + + def _step(self): + """ + Make a single gradient update. This is called by train() and should not + be called manually. + """ + # Make a minibatch of training data + num_train = self.X_train.shape[0] + batch_mask = np.random.choice(num_train, self.batch_size) + X_batch = self.X_train[batch_mask] + y_batch = self.y_train[batch_mask] + + # Compute loss and gradient + loss, grads = self.model.loss(X_batch, y_batch) + self.loss_history.append(loss) + + # Perform a parameter update + for p, w in self.model.params.iteritems(): + dw = grads[p] + config = self.optim_configs[p] + next_w, next_config = self.update_rule(w, dw, config) + self.model.params[p] = next_w + self.optim_configs[p] = next_config + + + def check_accuracy(self, X, y, num_samples=None, batch_size=100): + """ + Check accuracy of the model on the provided data. + + Inputs: + - X: Array of data, of shape (N, d_1, ..., d_k) + - y: Array of labels, of shape (N,) + - num_samples: If not None, subsample the data and only test the model + on num_samples datapoints. + - batch_size: Split X and y into batches of this size to avoid using too + much memory. + + Returns: + - acc: Scalar giving the fraction of instances that were correctly + classified by the model. + """ + + # Maybe subsample the data + N = X.shape[0] + if num_samples is not None and N > num_samples: + mask = np.random.choice(N, num_samples) + N = num_samples + X = X[mask] + y = y[mask] + + # Compute predictions in batches + num_batches = N / batch_size + if N % batch_size != 0: + num_batches += 1 + y_pred = [] + for i in xrange(num_batches): + start = i * batch_size + end = (i + 1) * batch_size + scores = self.model.loss(X[start:end]) + y_pred.append(np.argmax(scores, axis=1)) + y_pred = np.hstack(y_pred) + acc = np.mean(y_pred == y) + + return acc + + + def train(self): + """ + Run optimization to train the model. + """ + num_train = self.X_train.shape[0] + iterations_per_epoch = max(num_train / self.batch_size, 1) + num_iterations = self.num_epochs * iterations_per_epoch + + for t in xrange(num_iterations): + self._step() + + # Maybe print training loss + if self.verbose and t % self.print_every == 0: + print '(Iteration %d / %d) loss: %f' % ( + t + 1, num_iterations, self.loss_history[-1]) + + # At the end of every epoch, increment the epoch counter and decay the + # learning rate. + epoch_end = (t + 1) % iterations_per_epoch == 0 + if epoch_end: + self.epoch += 1 + for k in self.optim_configs: + self.optim_configs[k]['learning_rate'] *= self.lr_decay + + # Check train and val accuracy on the first iteration, the last + # iteration, and at the end of each epoch. + first_it = (t == 0) + last_it = (t == num_iterations + 1) + if first_it or last_it or epoch_end: + train_acc = self.check_accuracy(self.X_train, self.y_train, + num_samples=1000) + val_acc = self.check_accuracy(self.X_val, self.y_val) + self.train_acc_history.append(train_acc) + self.val_acc_history.append(val_acc) + + if self.verbose: + print '(Epoch %d / %d) train acc: %f; val_acc: %f' % ( + self.epoch, self.num_epochs, train_acc, val_acc) + + # Keep track of the best model + if val_acc > self.best_val_acc: + self.best_val_acc = val_acc + self.best_params = {} + for k, v in self.model.params.iteritems(): + self.best_params[k] = v.copy() + + # At the end of training swap the best params into the model + self.model.params = self.best_params + diff --git a/assignments2016/assignment2/cs231n/vis_utils.py b/assignments2016/assignment2/cs231n/vis_utils.py new file mode 100644 index 00000000..8d04473f --- /dev/null +++ b/assignments2016/assignment2/cs231n/vis_utils.py @@ -0,0 +1,73 @@ +from math import sqrt, ceil +import numpy as np + +def visualize_grid(Xs, ubound=255.0, padding=1): + """ + Reshape a 4D tensor of image data to a grid for easy visualization. + + Inputs: + - Xs: Data of shape (N, H, W, C) + - ubound: Output grid will have values scaled to the range [0, ubound] + - padding: The number of blank pixels between elements of the grid + """ + (N, H, W, C) = Xs.shape + grid_size = int(ceil(sqrt(N))) + grid_height = H * grid_size + padding * (grid_size - 1) + grid_width = W * grid_size + padding * (grid_size - 1) + grid = np.zeros((grid_height, grid_width, C)) + next_idx = 0 + y0, y1 = 0, H + for y in xrange(grid_size): + x0, x1 = 0, W + for x in xrange(grid_size): + if next_idx < N: + img = Xs[next_idx] + low, high = np.min(img), np.max(img) + grid[y0:y1, x0:x1] = ubound * (img - low) / (high - low) + # grid[y0:y1, x0:x1] = Xs[next_idx] + next_idx += 1 + x0 += W + padding + x1 += W + padding + y0 += H + padding + y1 += H + padding + # grid_max = np.max(grid) + # grid_min = np.min(grid) + # grid = ubound * (grid - grid_min) / (grid_max - grid_min) + return grid + +def vis_grid(Xs): + """ visualize a grid of images """ + (N, H, W, C) = Xs.shape + A = int(ceil(sqrt(N))) + G = np.ones((A*H+A, A*W+A, C), Xs.dtype) + G *= np.min(Xs) + n = 0 + for y in range(A): + for x in range(A): + if n < N: + G[y*H+y:(y+1)*H+y, x*W+x:(x+1)*W+x, :] = Xs[n,:,:,:] + n += 1 + # normalize to [0,1] + maxg = G.max() + ming = G.min() + G = (G - ming)/(maxg-ming) + return G + +def vis_nn(rows): + """ visualize array of arrays of images """ + N = len(rows) + D = len(rows[0]) + H,W,C = rows[0][0].shape + Xs = rows[0][0] + G = np.ones((N*H+N, D*W+D, C), Xs.dtype) + for y in range(N): + for x in range(D): + G[y*H+y:(y+1)*H+y, x*W+x:(x+1)*W+x, :] = rows[y][x] + # normalize to [0,1] + maxg = G.max() + ming = G.min() + G = (G - ming)/(maxg-ming) + return G + + + diff --git a/assignments2016/assignment2/frameworkpython b/assignments2016/assignment2/frameworkpython new file mode 100755 index 00000000..a0fa5517 --- /dev/null +++ b/assignments2016/assignment2/frameworkpython @@ -0,0 +1,13 @@ +#!/bin/bash + +# what real Python executable to use +PYVER=2.7 +PATHTOPYTHON=/usr/local/bin/ +PYTHON=${PATHTOPYTHON}python${PYVER} + +# find the root of the virtualenv, it should be the parent of the dir this script is in +ENV=`$PYTHON -c "import os; print os.path.abspath(os.path.join(os.path.dirname(\"$0\"), '..'))"` + +# now run Python with the virtualenv set as Python's HOME +export PYTHONHOME=$ENV +exec $PYTHON "$@" diff --git a/assignments2016/assignment2/kitten.jpg b/assignments2016/assignment2/kitten.jpg new file mode 100644 index 0000000000000000000000000000000000000000..e421ec1d98edef310d54970be25942ed4622fe60 GIT binary patch literal 21355 zcmafaWl&tv(%|6k?l!mtcXxLN*TLQ0LhxXNLvZ)t?(PKF;2t1og0p$w+pXIDwSA9u z^{MW2>sIw?yB}*Gy8ui@Sp``D6chje_3r_EYy!doFwoG5h=|Dl+Q0t)ga6rr{GXj@ zXvm0Yh=@ptP%yBt@Nn?R@bJi}XsBraZU0uN|LOmG`iJ}%|L@}e9Y6X3Scp)_P)RUQ z)BtEKC>SiLk6{4Wzca$Z{O73u2Vh~~;1K}Oh)_uX$kLbqXebzHSXgKzXgFv%7&rhF zGzF$_~dFQLtUC#Y#c4SNv^$laVO7dTQSmn zM?tMhi@=*n{pJ!jS4#_UrTNi`k-n*pSV%h9&>r2_UWqjg#tm%5Z4m8))l%_^wE5i$ zlzW*b-9Cp>^+wiG$~R&*Ai6X!imW5qV3s_Vm+= zny<`^G7Db?T3D(RYh0YO5GSNkKWI5M5@GG&$I|MTxs4Xe2V|Zvk4o` zY{c$CD*_nPX^X!;06KV9_~K)(RUt;%Iqn&Z%_Xh8lrGmx9TY9fhecT&U%_KhTb-J$ z(qI~~RwjMh`&(bvc<4lXrvOLq2fWV$j?L@gocsi^=B;%a&ywyw%b$Yli|sP#$y%Vf zo>q;n9Z+A-Tb?y<7bb>)nQ`6xR%5+x&n_vk2k-mDj{<+wzBl%9yr;3WQC9|=9LJ*) zIEBu2h_q55H$Upkn~7C+1tN4jBJbQdo#5a<8|mHQ`h5WQW0d?2R@NB)ioW05)7qMa zVz$1A2x+KqXh00+8w}<{0qGA-8;eP8C8dfd=;Jf@gk?gdmAT7#GMwA48?$}vE7uR4 zw8E*(TZ>1lD56V?UPJM|zZnBWKkuon#Czy_Ji{;NP zd}Cvi;hhV%CHnrWy?-A>`Scr(*JxWk#dy(a<|hZ?ZqDU^VfCbaUeFoyw<@TsYQ^mP z(cb$nsLjo|cpVev?M<-wrW*P~0cv{~>NN(p(+mxee!#VT53W^pa8P2cMNo+Q0BG0n zdj4tkYHz3`5|p#7FQnglVCBV4%@M7(nzuy$9ZBK%zH_I{w?*kvEtW2}O0iw@l819$ z7O;(yuUJ|b-K-U8llEYv1aoa(*he|06s)F>E-r@gJ$CHzB{6}yfIo`uL`Vzn z0{~5FM=2PhWaILz=a=-_bX&)VLW=H{5`pO^=ep+V>WyF#l(;8G8Ssi!Ef(WW*PYq8 z=|U0`D>8XtC;t2B4nqeICnFTRxbyUnRaQhhvtSfD5*7s2GN^{K8QfvJlmZmP{x(6; zzJjnfoESROzSDPSZ{E~u1{(tb`n$^oGC~VLU3lhlG&~4m*x08J4ULwh!P9)B7k|ZO z6KqH{iyK({0Bm~xsm*pioz12FI)NXgk0n-%8xr^dus=)CKZ1~&LG>cMYNHG!NX}y` z9t8S7`(;H!R4eSQYN+jrF+&9hg>5FU_{HMF^MiLYNDTg?i^_XY$n^PWSC zT`}KMM`7j_=LZII>UDKRAQyWBTj95dW4rj{ zK10#FOP=hixfbsrw5SLE4sdiv&9Fj*LBa9+%6ZiV^HZ1ga~zJ&tlw!X-tEOfrhv7f zm#*8fD^YF_($40O;12+Dio2R9eCpvIf%%o&uRGD9MLgh1o>#(OgW*EyC>X_arfAXg zgVu;U`E^RNn2mzfttb}dYC^2fglR4H=roIYHK8Gqq?d%0`1ij-kAFLlYRV^y&)L3i zhVR6%%kn+`{s4^4mVj2C-b3KjbDBoj3ZYCrl8`HC>wSnGCSXZ@$Xh&bGkA;RKSeNy z$_+o;w8}axEgwqmDBpqR1>cHP;IfJ}SD#-07PE%60x0vhzXk+q2}GbYq}!e{V&`6g z3*&HGnlz+RwGaF8k#AKg@)f-jR^lTkyH6l~=Yb(4Sg4)5OEZ4`AAtPX2fIA9Y~+*B zS7!K?Lx1Az&k8r9ux{#ke1q|(-MXQ#Oh{wC2tn0?>O%%ye$>(X-j-T$Q&=Dyc2|>^ z5V5>86SwLh9u5N?$7{>>uCnSnN|DoGMqe2@HWwAUWz|fdCl&NnQpKc%6S{1TrX7FJ z#?~3fMMdGucO~!%dZi0e_#y^ky^!`v*!50?^P)zCQ;qK)XtXpPfAsU7w-|D=!I0{{ zWX?A>4LpVh&DiEDITP{wA|*7oq=ia698zw8;D8bkS5rz_4h{gHzNGBkY+ zhAfw*)$@p~7PGZw;KwXW^;kpsQB!Siv6O*#7O?~Ak2_b3AT*`;G`cO(cH0*D&qmi6 zQRj?BzcaYbKq;<6H*a=C{LTRtVx(s8R<0rH_T|`*GpUJKN5an`!SCM^#zYeABLAkX zXvMHMItz}OtzTIG0q!D2BEEueP_7O#@Vfle#@s#tTY&yYNQ+#1A48zfACZA4ur|!{49! zpA*f0>8`A+#87R98`{M&c>^%!U=t^EzJDg9xw;ZN>nFCIrFE_-L5qYEV~Ma3mH1qi zbrlf#Pc|9FXGDsqiO`yUkbp&x-zu7qvjaUx}jk` z>tT#>zs~)Ja7uZJor&4A_}5e=L!X724Lo8bSz`BF0TEjD($chX#M%M?n5eJ^42e~G zy|C=tZS6*7nxSr)5|7i-mdE6>M{IwYm4|JZC%%#}t1@UhAZM22VwW*%knLQ~~r z3S`Z7{i`Aw>kk0_mZHLQx0R>lboTLSSLTCN#&;^6eoYzTjK(ZkwN+`kc;9(g1|5P* zuC$eP0Ks=T2&OyJixZYLM}ftm+8rT? zMMu~_+@V?XRu?Y@pH+AJ;tmUHK;YeO#x(YX6~{REa+G9jjHXKj`2^7tX+g zBUYJBkBVQw2<Y+XIb3ogH z896!4G8|F^fjoZ>F7gj-l!8DlC$t{lM1+dNT@^`?7lO&M17o*)0t_0sM2$JohJAUY z+UGvKGHtil&v3YrV`Y=#ute0+3*L#b{@tiFaoHoB@yh3MHUFZGA?iA9b62|G-c8(S zR9w#I_UNi2T~4wbGXv_%4l|PJar)7G+pFEh)0|M+O%ip2J2FHyqxKZM=t@g!qTxx16_;~8jb1Z$B;|^_znA?E?mE)CsxZU#FO{AYX-tQmAbI|CAd+z?C za{~UFhD28!l*a0>ePVE@uF9Z15lMKlainlD`unulu`T{(Sqdgm<7i1tjJYl-VVO61 zscIMX^A`FCfJ}_SeC^K_9FL)E%bd1Ln6dm;+SSyTE3Tv8`O(!}SYxc0ajWS9I&%-0 za*cT(fOd*BG%B zgS=YL30T6Md)4*P-C9Lr5XS!nY|@a`v{i!AyG+ z+bf^5oKL7jiPb__I$5f14^%Gnd7TLv~g*Ka%oO3y+RrfykYB(yMYztH6mQf(V z5QDE%W1Zlbk(v*c4ufh6U6R|M&m|~}cA=mJ!x9}T*~mwaVNC>2$&He)<-$X1nPFEG zHFOqV^)TqdSF}nA$<0fy+}rnsHNj4{DH2we35{GZ*jdJ56WkwwsPC?W3{>$GjK~1) z8ZDdjtxn@lSMI6ijVUXvb0*ZTFR*?jIy0B!lT!E^;C@rCB(DnM<`OB{R2@kPn>dg+ zscG~L9~jYh@XRtsPd28MJz~oCtHy-wP#A&t?E(Dx;rAjtSp{cHRvLMBbM!h)rij?3 zCrhrWj&mI10>?1@_2r#!!YtBp>zf>5NTp{)l^m&-N-RSipxe6%Rp*Q?Z8J&KI zDf}MDPrQRQlCo5~VxIN!sQ~DjEU@vXY0^TtRuBQG-B-~8i>+CDayV9}#4<+sXqhg; zIz>AJTfWe|0)fTXJE_4s*TNEf!bdHTgy$4Q1Y9m0=eN_O0opY}Urfiz%1e&(PAUeQ zrj;^Kanx^>gr~=QZ)OsLMAltj%C$ z$8<==;Df$W_;27^JkKHih62wd3N)z@8Aslyk@mp&Mct9kq&Wy()Hk=~WVX6Cmr?)Q z=J4F%09K`M!4XG@MI4hx#>G@A2%wom>ro-L8i3v_I&a;#>%Ol|Oy_1L@7P>@bsK%A z_D8cExWtXgT#Y9fbogRfI>cCVMRFC1=k`grF0t{plU0-O>w2u zjAO-yZ(S}*JO1Q|XsuBuQk?eAj;Eh;trc#^(a`$a~h7#d38oyFba#gn#LV^YCa zGcxg$gSTpAw2=dmoP|c=;!}upbA3W&%QvxCjVilG`LV`stmN(N_Hat%AnT1x{*h6p z`w@O`alcx4G>?;LTfXY49R75uI`12j`1rq7eOja$o6uiR2#CZ4>zqrsOJ0vg{t7)W z;axQuYO@Guq=}?so1mS6n^B||{q5dqzcPhtU6A8nz*2rP&~ zwE{&0sB^^tgZ?5lB``cRAXjzfCiKG*0bc>IuMWo_fJNprJvJC_ne3(U>&;;l%CzEf zA9ucf=M&0ZQfG;}RIi7NUe1xf!D(+H7FJ^Kbkrp62i^6_Bu@4zMl=~HDPd=6#7RC2 znt{->a z*Rk-dYLtEkXcEodFXsu>F&S3INKM3DrUe=q1(n}C3QlD_Im*VWsvfBN@Go&!zP5_? z^{DTtP~ZZa8=oUuMLUMI%HiFPSm{v1DJbPrjOr}cQ{fgQoJ6P|a(f1&oD%m;8SWnd z6@*)d&#QEAR07r{6PBIs3r2{j@`ci|B}dr&IO@dfbyUWJ4O~p;PL#1TowHax-S)vH zH#t_Le!a}LpxGVu$o24B;iwt+nR4eA2xmF9ynVTR-#1EWdo=fe>fRL8w6d zH!7!;hg_ThQxcHtO&PajGP4ksCiI~ZXVn=rCLls5f!A!~_0&#`xZ69enVXrbnoUUw zc{Hw!gLn*zMLdoTLOh0JZ>&-Z8%|O@^c%&Zac|gwD*M}E2e-2_kORB9 zGee0^3Et#MB3GZ#c+4HD$**|84LK-V*p(WZs4%UrE|ckjE*8mvJ^lQII^c^7IWqb; z{QhT@sbEC?4QYQ@$=Xv-!UNV0V-+GhJw}0k?|u@_CJwxlSW`KD{^F2x=g~BVF%d#D zX^Fq0b@7z*73rYvjz$fHbG)7q*vZuv%WVDhY|-~fvf(A{PJ?!U-Y(IW;A`~z|1h!?N+ zNt0!Ad0U3S-w{lr55Up#T}_3-!J-IRYNZdhOOP*z`mFi>wG6!qNtybwz)RlKrT=RWNmehNAF@AHF;3e#%;B+HH-fuHU;fAn2(EFT$Vao{ZFAz zeWl9nAw;k~)mxsmTm{q#aj{IQgr|?-@6`TodQ`<`u`Q8;yojeTz81w6vE7kOyo6PN zKMWP)do{7doD3B|xD%M@_|N2violWaKt=vDuP8B?*SBEp#u3F5Ph0a6yHoE!qTwHa z!pGp_XuWE$*Iv1&#Dxs`x4xHv_o&q)+0Y#?Djj+v0BK@`oIpr|40A5cEIP9P=U)w5 zFjqoLQCUSI_?Axe2&f|Hj~{@XWr<1Q3Fh9$h~uxCGmm8QYuyeT;sDffwY|-rPoCEy zBj5A-*4hI7_pzm>+7{$;9ULVnue1!Jwn8!}^X(s-_Q#00)~;{MZQolHdylnvYubh@ z`NafZil-Ljd~i18I}|W0LFzhbq03j?_0A)*%qL6X@w?bB*l zkmh6j2~8NxE--l0-lAh9Xu52&oc4aV>*w6sA-VC5VZ4Hm)q}oO_VEIj&EUQaCyIw_ zJjQz->#{W~oz6{lWnok1DuvT~Vjiv-)Glw1s5PjEd8s>v@>frmlv|wvB;_;#T8YR8%5Qx+rbN%tsGBqlVkG|^<{k6~ z&vLqh)x?j?ku0%zbHU>}{L<9kg-XufoW|^*r(;4 zukM|0=+hpBRI8RME4a8$Jmz-R8yW`C4zS39&JV*)8L+cxiR(G5AeF}T1zto{yl)x& z4F1?Xdjr-flNF@)KE8RfS1=XyD1YqAcurY{$sC8KqhCyJuleQOaP8GQAJ$lG1C-oH zI-9&WAVj{c<`K({G$Z`qV6jDdmVc0gFvnHvhtJ%lr@ykvrIjHqvVMotRftxb?GjVDEF;iLK^rGJE`;{W)GK?^k@j$BiU!ILHg!zA zpH_Axl5tgDxB$tg?n%0OS2-1yA5AcSwL4NM6K{~5t1;nK(_HypqU_6_MBkUjK`Scg zUn;9hPD%7X02{VfWo@yPs=LvXt&P8HvOuEvXgA%v<{aSuZQCKB=1I z?Kz1|lhwC6ki|856J%8SQN5nBNiu;c@G4mH zkTZ@U*i~H(mI5Md{`EclDfSnEIy&3j6p$y}*<#R`uo2!1A~1=OWznSAU|o)wkxe=M z2yEU?g_>-KJSIgd%Qmd}#H;F?_&EpO_u0vo1{Ny1G#TCu9#ZO5LIPD!Kk2q z_>ALPNgz`?G7@VQCA$ICHdEENv|98$ANx(vG4(IP0x@v?0N~|;7ZYTO#W`uSyUA3P zD(_`2+PgDU*=rZ#0%W?hegkf(sZ}#REsj&`1O2`^y*6!bT8KkgtLBL#w0RwsOj(q5 zDaT)b0E#dq#0|%Fp!a!%d8Ltql;Aq2{Qre1PYBn*2edB7@u#f6h^J?3Z zK0l`E+)<7lL{mhUmhNl6xJfxo-%>Xyq#{J;%Q;&heUq|sbw`hAw?yW8%=VN^g={Ly zeyta5KJEYyv+;~LnidXU5-)Fh0jk49hS&<}9%o6GEMZ!Lh}3XY%g)k*Xjs3LrrIEA zHm0k^@WVJ=Z`x4o<#5Hz{OE}>sy8c*jKNLflSHIU2}33&3k7dxzOG8m(YfA6tzr4~Zvw@lX zE1h)EL0RlJoUxTp-c$Hk6>5>mlPNXQmI2X$jOZ0*&%B*$jL%o zxXyEIW{;=GB*X!4Sm#>Uk|^xgiDP25WGUzIzKZ6zIJX{pZ|Z{YExWyj-Fm3xI}0EwLfV=O0}g2b@uSZ41=#mf=|&;Y>A^wmwKRStDq( zieEzI3j0-n)i8me7`8TL7O+eFPcI{O%gkSy{?Zt$sQO&AqlAWVu~hHF|8fF?8Yfy< z;^Vtp;>$?1qp@#^E>eKC(KS#}7X6tPNvRRm09HElPHjAkKQyAr-nAs7M&H(GWrrHY zHJ3|ul!ej=MF*4wmlz&gJC0UUk?ZfATt_(*aVv6u0}v>Y#)lkgim%!5uaWh5mmKfM zdJDzhavb&(QEmMhJxs})*)AqkJt0hTMyaG^WQ>G`u0pUTTF_3p-l5|QnUOWqRc9KX zmc-`4u2Gl$1*r@TQgzkQ=aitb%UrA1)DBG88XDD-aBJIOVZc2@y(Qejf*hXvfsMtJ zDS&{XPdVj7dJSofQjM8OahS1)ob5sn@6e#@DmE3o^oYPaO--jvp~ zeoDS9&%*|~Vo{zTCSP5cIIklNMhk4h$qdTS{x*vqo*yM?%f|@yz)ifKH2vyJC8bZ9 zkFx~gl7~JdmgkcS#wtCg~VZJDGc&mD9`ksoCs&G(1<_E@tma-JXQvmIfiIC&V>`&m&b8{6rZC7{NxWlfz*ovk zVS#pN8Yxdc{Iq6GA>$qGlriaT3ALrxjYEx7$7EQ?E)7AQ!91XRw(jJDG2DN^KEQ_S~ z?e~)G!4Qsv>%#gRFfX`CSc^I@)7ae;459YRCXIyTbaCc$+m&l1yZ4bN=uUcCb_||o z^Jf%Mz@Gboh;l7`M{5~afyB}75w1)}#b_2}zIw6iovIC86V>H7K_;u{vPG~l7R2uG z+FeCeM;I};~0xK?D zDpS3{Im-vYywJ#b!hmExWf4sjhwG-V z?u)5&RsOs?o$rKm{#6j~In5EHs^U*+#2l?)O}>pHD?$E?Y>|Z7k1OdXpKRX1^%qcN zN_9HM0&TRxYSz>Hy%9$;7PY{+7p#iC#7}K$mtqH$4$LpAC>XjZiS@2EP2YY()d*RwCPsk-&kYIh&v}TV0YZJpMuMlbGHdT2<*tciS&$v-=z= z-#@deVnfvW<^Y_+F$SE%5a22pTSKnW-80RSVkn&RjfJdGuYO@O63~9 z&!ai-dtJy@CwaZ5kZFe0!i@ zq`qRk&=H*^cx)R$3hj;@234oobc~zob(yUP98A)foFjs<-2&iylI?YFu|o@J!C4|) z=ecwYc)19+-j>*rC|>@QPn5^5AgNzaRV{E=!D_*sEqf!gnlTccd|!~KXJpZo{wcp@ zFNp4g$TR*-mfT=+ zfk|HQhq?ONx#cz4?~(hJLDpMF>9(&gA{U}1|?dLuM_U!0}`5b5I%wMb2g zdJcw^nniR@m2uhMCPr5flmX$EdJCCnfCUdYy4rQ6-m_Qunm1jrx@eUI>D{t78zbfk zj1OH(n4S#TX*`Y;l}Wn%kF;xE(Ri zU0v09DuZxhwz9!A9e~*^xFOpygtYqjZriL zKlH9@Dug?%EBcr{cJ%+%Q4~yqnZ5)GqzGnB$ajZBC<&QHHmxlD#A2_z&IX&m(g5qa ziaO|F;b14eo21iB77ns;pUkRB2N4U?<*f`VIIkb+m89*5pP_V}Q(66U+}09%Dda8i z^i8efE7)45ny}rClH>d>!6S(!KDc3ij_u&^b5BaHhPbNPp(yB?ASYh4_eakM_+E?dsU{ErC^| zZ_+V`NJQ$&%!@JT$Sn2^37Hegc=xl<^ySczo6&1XqE&h<s1 zTw6AB!`UuPG;gx4EA%_vrfrSY$CZ0+%v0%B;O6GDJPsyaidZ+DL@wa%_uPvM;3rgd z$Ltlkq2vhXM10Z2Eh2svUHsO#g~|F6JD@728v#c_#<`-=oKKOid}%K0e%$m3*d&{X zA_?5>no(zpBihlKHIQcGl_bxRQYK@~PwyD&17Doto91tsVA35?@VT~jJu}4xrkDmu zL|=MjObAULWH>!?XV={_x*T0&z7_WsTG+8WlmMk6yjA6w@Nl_4kx;d_N5)2bjIwT zSL`z^*7USRK5*j52$}o~@s-KFbhm2kN@yFiamUJP>1a{YN#k)FU58P?mLSyYZ)fmr zf3R4bNR(50s)v~QRDOa9cyH6qtEk*nwnVfs3V&*kF<@3qCC|>75(j1wT(mvTv5xql z(5Ube4>&2}!GTctCu<1-eTX}7J759oTTOYyo1`CdZBJ8@L(PM}^7(C|u7SUH3B zd6Lj=AkiY@WIMiU15LtE@mrF8i(6K<~ZvnZITZoNXd zQ*Q^$aZf9RIdbT5C2pCsfq!{3?sDvB(_$HBp)80>vUR7(1O&ah-`a&WF>bv<-J zBDZ&!NA~r^tz(7oX|#+|V~MmYCm8b^PDAC9a2@B5bSF*ez~gy9Be{r#_w&oJJ(%k? zSZRCWPSs0iMX;GPB&?tb|GxS$iYXF~h)UXq){AI5ZWrb=tfcKC;T(ql5#DdkQU=jG%b<f4 zPO&%X+%a%k#qZc&7O4k2U`FheOYx+^yhTR>%_h2~f_@CW5+wr#{l%<9o|9AUfkhat zk4D^t8+Qr&#I$QP2jo;7d@U~g!n(9BgB=L1zuc{wmS8%{7K7gS4?%jo8)MVpQW zPelh=Apt~sUJKZ@AeqO$C0ec8;p30Qp>6@aB{ka?O4tUnt5iE(>;RiXxKBfrEJFfL z{V=^JW&j?KKwTS5qtbbELjwiZH+MlFU?e2vV!DR!yb?s1dY~O!jv9Cl#l|#sR9zj% z&RQdy!s3im%q7b~+E_P97wt%|-b9yXV<0+`sE>5eDR>;ke32SA@>=9RNGnoXkBZB}~z}+0n3rK)k+3!q?fs&bb(0 z&sYo<3G#5X&_8Q9`O2-TVtBenW8)`5efj_DTWZ~Fe=E|;8Rx9>beP2LX@Akw>B zswYtO>oZntYBvu)!chhP>#>DkpjWPP8>9&v0s2hhMzs zp9j1fn5B^&ykEslb;Qu9P~OAPHG1Xo9G#K2~bm#ey>{_YyANH)>%i5 zDcmzo|2}Y6J{VGHugY4UeEpLHt!nQ4EP7c!#t@AqS9(HIHBC9$E8wTD>7&zir4a|4 zptJUyDY6m6R@*nkivYq^SDyi*YN?b*s#YsRSgv6WPF%)Qx)yC>aS@iFqa{uqy$dv> zz-np*5;g2*4}vUX%UKN7V;Mp5mwD1^IZBrOHnp zDi0Fglud+swym8>H#ZWy%LfJfJH`a1=+{OxdJ$;>LNBeRF!bHuvDG$vfrb&=%_pTE zgN2PIJ&U8&BPfFrIa5G4C|AhYZOK2Q4WPMgZ=l1#yYLQ_Jkx8+2tm46;`d>BlRc?L zqEz{$Rz?8QWkNOAgQ`pykv|+&0w5;`Hm2`)i^c2plnyc`9GM7Mq)dBNBW5zV{oPB2 zDeY&>oG{8QG(yNO*iZi@9BjKm3*m{u44)jYJpv!yJ{dWXcgE)O=(&CU+uUUV&Mh$P zps0wxRi3{eV0JscYmH)D8Zw6!j7-JQAQ8>KrvCRGT&_&%iQec$Be$eJ8++AaZs`el zN)~30##d$|;fBgVOstIW&6_jsj~@dWx*(zQrFSRdr;5W7FB_gT@G$k1y9W!sKi&U@ z4sNCh(WiKtr_W>FfuH?3G7mH&aVm|2LGbEl)Nh3V3f5PKU9Fka=0u~Y$~0Xah`%`x zT+Fk3v7bk&JZ0fpF<7&!C6MrsFXX3xUu1}NV8M0yX^WTtGd|NtBt zZszUdhYY|Ib!HNC)um|s#E=2LZ#tfZ@)RPZI3Ax`*?&2QJUwaS8`U~3wYQUC09T}| zk~|Hw1F3puY=s7{GG0JUn0#ad(H2~qf}{w5t}tiB+6%i5>t}bBamx^uW@Cg_(tKk& z9F;ed>rMMlCYFHT&-jaotIr|d{xDDFRGASMy3YG~F$Yy$l{{GznNiTjrcZ4B{w)9V zb_?hSohFiGyRdTv0y2y$AiL`ZMFhUw;GCh2^2^)|mxepePhrBrM7S6Ex++E=F;}A_ zq9a-w?YZP5y{9+0@i) z{UfEb4~@+wZhyha!-%v9knCbgOpSc!#lPjGqaZ{LlV8KxG)Nv_d#S1G&#{5Q8*;O> zOriOkM?O>`-!EL1jWUdK4&%bbA;^&fAL!Hypht33MQ*@gdN6uw!u8|7tS*PChK3r# z3KbOqOn-Vz_5T97rVi6*(wuk+kM1nw1;d_Df0pJyJsP0Yt07#K^#dFOq z#JD3r(4y*YUL2rK4|RYLdm0VOOl3?F%w&IURd#jgE)uyCBSya(Gb%e?)v$AFAO0Dr z8#{LXgnJt@xZZm_NiQJT&x=;7Ei)F78>TeUYBYG(u1L8Oj4!*QgzNfC@;YCm8E_}QE69j#o zKPEMb8)QkiV}YJ}#@GKPDd=^G$Hl}wlX*SZTpZy8&_bS7q@`^CGJpQHsa;6i-ar-H#N;{iB&E6z|>maZ=Q89Yk zwWzGKJhOwMcUj-GtJ0LXnuw`9aI2L&rq;u7l3D87`Ns2*C#K>bVuUm-rap)>TL0&dobE9 zsqgGEGD%{yJ-RlmD8?wMoU+oT;Hs#2mFX@(b?L>!MD=gD)URvJ>Y<=d4It&O!7^e? zzZG;WRb?ORAjt`X`kfO;VFbE1kuO~d#C6P^JPi8=Ru}sUd!o8dv=(dC-mSSdO9W=a zD4?0ub7@Q`9J^1uAMtly(M#tOZ5(?pcylN7zNhX;rP$4{U4gj?=&MePB~0xf$nv+c zust*xhnymPiEE8>s-Be0(aHOyhS%-nIs>$Q_QK?;ahIrE)O2;MpX2wev>m)+Q+ZReb$6!ySpCn5o$5tSQ901L@ zXuZ(o?GqU#9iHD$L~iq=8OaVEi}>N@MI@VW@P;d+6v(&2D*2<-liB&Olp?K}3?}!s zo`|<_Gw_e?^e8#4x4oIukoXsVsD|GUzySpzO_aXtkESDaXdVOKP0I-Ni8|}+Il0~P zZ(!XIGa@ag8dI8M7R~82hC!7lf7)C1x2fDvPic~dt8Qy2tIe$kG?>STdto6ZN6^Ar zy}496%0+f!bexW-0e!?EMw#vf`8$GiS^}b9in+tq@{4L33g*e;EQQUL?$a2;z9*y{ z1dwz?cjOdyxp{^mS!u3}Zmb2#Gt2oOT5rACIdZYm#EW^W)w3vAj8JCTzJF?32M($52bLCVej}+kc zA+F=Y*ht1g*mY4Y-Ut9aC$b_ROQu|T!ck8Jcv)o+xg!q@t?5wC(_9F6jdDvmw?6o<+zBZrAD@7d!~ zgg^yCge4oz(?HY;!O*@Zgb`30m_0RF81rT(q4@Fyg^9DI1Avp4S`~cFFH7n)y;aWI z#r`stQkoh$`MNqOi+JLV*btX& zY(T5g?K^g+w2i&OKhl&R`{^A;oA8=HvmS)ip2mI1 ztHc}eo1~b%e_#lz7B?v{H>^QVh%Rd$!-KS+6n-*?-w|!>0U&>&>Ml3s%EOCFM3~0J zyU0i*KmzMlYe=(pn7=_g;G2v-K;zPhw~FizwNO)|enJ&l9EmV#2ur1Fq_LDk8s-z9 zz}4otTw&((vi69)816aMIt{iVIY6>u{HF(FSnxR@*tyb2UzySLcLWebiz+`6b@V(0C4zI16TiubFn3zWGLstxE`oS7CkZnQpJT>7gPMBe{4ftw!)?h8%c%hofkk&EcKv)- zMNn++k1l6x5Sw7&7^%og-U)Qow~an3TlS};tEBWuP^9B4J2i-vs{s7CPhA!FUr?cy zd|Hg>jH9IO+BI_&QNqU4CO?XXF6=h&LHsqW81ZP`|IZQYD}PytA9`>MrcG?!em}$x zeYW8txqfUgdix_V6?aAfJIiC!5InZ%H2q`_3x0S6FW&yXt2xvEu2P{jVvf8oj5EQE z)%~TcHxopa19gGP;`{h|y%;n&i71(j>V7^Km|u_fUVYi3x9UA)7Vz(XjGukRk@o{3 zcAY1?W5?4*tBMYX$2wwRs7_?GX;72(5OMPyJ8JFs5x2`&(8ekJP#so=#P7dITNFX` zs2D>BYMK=|Fz=pki0UC09?CxG9C3f>UDlR!k(=j{Zg7(JI}_x~)HgnB&?U`~RGne0 zMWIyW0x<{j+kNmRK;#5J^&gViE#emo&NJtkMmN$FlL_?t^GMQ|ZaP~&%E_H&rG{5eIZS0FuG&i!& zP-n?AFS1yaV-1n0=D^LT0!F1IJ6`1V<Za@Z#YlOepPsT1H8&aEs$=_gEzNwn}{IL=-1la_iUvWAHN07%|98}ER2E57Oq z_ZW=jYM{5dWo=9Mzt`dZ;<_Kjo-(Pf3C>Z|s|yS=x)REE%7Y|QDOc&Tzm-@9$Q3}# zVnND~S2zw1MtuVU0}uO@ADum0`+GC({aq8D3(pcrs~Zrk4r7sBlw|C|h~7P5<64|& zW&|;-XOrC{(AmS}60Ni3ZVEJeADF=hKXhekk>MO5NI^^_pU`k)mt*BQzy$8NJ^ln# zn$wMl4j8m}`=FiFk3ddAAdQa4+|h`ulRAdOo`gyPaL99oY<<#lMscTRCv79xtupgx zjS<46cRzIAmtt9c)Tub>465CKyTT{%5LdumKPPvRT+EaDK zrATH<0%+HV@&s)Gc_!lI&8X1IFjZ;c#j4c?NHt`%fRPqv+v^-8RAh2vM641 zDSkQ1Am?hXxaT8x+PV*kS_4t5&I}DSge>2Jz=~QjDQTiPD5&0+9v>i{ob9Ek{kzk> ztJHw`1hi~2Q&qcB(KP@%LnSytB{e{4!rahC4v?BIIsylyBWei@ii?Vaji{K-Nf;E3 zOQX<{wJNPfkkOa7p+P6wslIIHVLZYd`EoKnHal(WpO2`(?~!HmCC(!|XK(lS>^>*8 zb0x<%1-{cB@)p69eYg9tuY8}nW;tV=AV5np_s@(?4+frd)2?|o5gcl`-vAJOKj-0{ zjZ0&WtL{OVn87{qjC+5V+dli?Rf7zbnrvn<3GM#?eW{N1d$+*<0Qqb`K~h-;MKxaA z?NUdw7k3|@{^9=s_5f5rn5|nSmuxd1ff+k~l{vbP&;I~$KcUS;c*W=){YHOX>Uk3D znaboTR>2iG~+pKTYz324 zOxPp;0Bn;Sqj7?VZON$a{26$qcY;P)V0L5~IZ^>x2;T)>JwedrT??rUGg8iONupJf zE%ZAh0gt%jCvLOlMHHoesE|XLAgS&mWsUQWWaj5cAC!NuHQG44p!UEMsCc$bf(|8w`~@ z{uB3xsXu}>l?(2jKf09F?|<6wP4wj@*B1Gep!Rpx6eDoHY*IcJSDQUVS)raHiCQ^%(hiju~C1Nt`j zxQV4%uns$szH?pQ$~?}pARy;I8u)+79QO9|DbCp)&&Ixk{T)1eNmWswbAT&mxyHqx zC!d@pM89LXV!6$3&ryy6%qt>-4n;mK#Z=Y~#qVO0nhIk|$t?*il~H05m8dD2mX=xu zX=pkIxO!;2l8(8WnD1t(M=WgO7Zn%cSE8JFt=OcwVP;T6Rg1MD)< z&`nYJ?r6wvs!$S0W}>5-m?68PqPeLceTsT6Icb&|sOY7rBy&kAWqJlVX>lzY2@O2a zKah`q=gal&T+_~1&|1o6K21dV*8)f1%#e~k%?@!T!QV+LfRZuNT*C&X_pV#l}7sy)b}|hs_nAl z3P!{otB5=e<$2?iO9s*i7#n2&08ah=$@Ed5#4E6Qe%vB}PET|3KM!H(DLUn&_r)jC zF*YOm0WOKJQc&8*1x+@HkfDftOsP1I(p5a@~ja!eA9ml82tuGOLBDi~* zBpK4DP(J6=6;yFQ3MZCzB}f^`&u@?GO2Tb>7`~+S#S!Q{Y?SNUP}#^m_90nF_WxW8X-r9~$tq!F7-dhR8nIVT1ag(Bgo<@FGc5 zsUF1ngAT#HMm`{ZL{@|)69U*wr|$T?j0&+iMhv<8%2eQhPqE+TDZ1u;eq-O);(c55 z{{H|vv^bA~_XUG8m1Y1iJ-UMva+&(GP1I=fG||Yk$gaw(m=qR15oPB*Kw1J)K@y=HIhD)eIGuPODI_6 zJLA=>uX%r(#HtAO0=_GFY0f6vQMm&@?dxAkJOS}ZZX!{fkOy;G@shGT@?kXSpC5y$ z1=d9+EY#O!dhbdF`{jq#hl z^$fnBClvnx#x6V5WSp9E22Ia;Le`OHo-L!HEMuog(yE+`se$UrN{MJydfQzthr-h4 ziHhc$^-{4tbaKfBz0EkoaZ&4Lr#K#y5c zkWWWNH8Mpt84UFVkC976q}0z8^i*@y9VBmj^$O{HZJB ztkm(R#DVgE6OJ+!{H`(%MlqlN0MmCq-*?j{)#2P-Nb=QRCnd(R7e1T!{(sF(cwNfI zxup#`H6m=hU(%44_dAnNtDa~;M*E)B7Zm3uQH<}JrcAOhZRQip>(Ppc$%bS5)Mtou z$Ojs`u76+W_pP`WG$|P;a!CGr(#~nTAB|2iH8|I>XQl>e)EQq4@j=@3k=vW_OUIA3=P_&W)B`IL_nUPpI3t_B9Z@;?F1EIllChq**Nz7Yn-V56UAO842Kx2>aoG-r;|_BiS}2K`Q0UrNNnT^nhX5;*Uz7$ zmy8dEM{TwnpKpJkwRVR9<{;__J;3#?3CC9?XH{eUESO1gK+ioQZnHv9aa0Z(T~ATY zD?Y}YZmudh#cp;r)P0;!(HY{?ZfTY|ccM78$9iUl=X!Q4qaMeSC3X)^Xw=2S?kUk` zrKCMm=tU)^7D@>XD;-zR&{Dk<5TPR!WFn(O1MTU?C#J|+jE+i!p~5$!Ba)kzT2&>W zFnfYpaE5A+!95=p#Q}r#Q_*oxOi|p?QC!eS{{Rie7MB#k70neL#RQJ6H7{(`%J($A zjn;i>YZu7nEONa-MKfP)rko^~G~Xr0cBGl37lmQP43UX$A0#53EByFZKl2BPw*xas zNCc7;o$=^Cf8M(bitwe83LeC7T;JnwhC!nyFiF$^P(Kf-rKec^UO`A(`1>iAQp|9^ zRUZHVBjum(PS5m=c#t-IMmO*N-<4JzTf%wckb(%*Nh9Cq`G0ChkMabLged!~oQ&XL ze+@14){aNWnpjCWgU4A~Qz<9DGw!DwQ-q#5c~lkvmC$$JBW?a-s|yZFnqa$Uwtp}7 zp!u_tCW*BGN-ziHNAvQolo3<%CK1HEKztI6Mu$219{&KL+)_52@XojmpMCbp@9(ko z6$`lGa%pE#oks)@ao=M}aK05Home0q`0xAotGFRQts;GU#lcND-@Y<^eZC&Js6R09 zomi1SJa^B>8;^bJb;7tXmOnNFrr`P!^Zx)(N<#X@QAzjiJ#+bY_*ZdQS7X)LTsH{Y zCx3|l0B=fazJRI0r`!H2q1k;qet+&NJ3Eqh-kFJ*#OpCwD41>YRx$yrDM_QnCqNll zSy@0CSy@;p11l>HWdKS!Pn;gz!32O0rFNfzem99i#>ahya4u?j;b%2G6RrutBm0VS ziyHe`o8;5~066>kKCJNmF{wuMyw^2Xcx%j^SWx|B>+9eJrt9lCBk7J3+|*NzkbBdtx0N2_H#Jm?;PI5-KSrsRE9T!A{(Dtl z!nw*YBv;!}D}&hje`>zDvw(>~&wk%3qj>)Sgid4uJ9j7ZtocRhu%w#khn{CkX$S*s zQl$3GFqBD)4&d7 z*9wndD*J}(YU7Wij~eoUpK9vvA9t1ft5$1qnAE)3{^7URTU1YrT+_!8xTv2JxZa2& zEs)K5CiH|6(vCN!?V_fW!!A+e!Y$IAAvByYQE|asH=_}T62(eUQoKr0YdqS+-4(Km z!avjd*SMgOL0*Act!yeZoD{9p(n7WrXkbDbT^@xMy3~<@$t^CE&1fkA3sdTuY zBdv+3(nF-Vpl)T}wI+m|*DSIj(B0CD#0@s3VWs3=E5!J&Dk03g(#p9`cFWf}wf-#-`KhKR!IZMRI?qV;naj_oV>W%R+eDu-b)T9Y8Nb?TI z=SwHp>-KbY?+}LLDnQ5{zuv0;WO!0wSY=zt#Cm73BiG^EHLnGowVlu1KBwVP{%UNb zncwez%qqp`;%@@^hm_#y>rYXTMlrXj_V|jc_>I(R3Q3QHdmoR(J8z1T!*Jat892e) zC*0L}&5h(Kxfw& zuF@5g(T?J9PJMrB9Gb?g6=})lYbz@#11l>lC<7}iD<}g6Wo2amWo2b$0A*mTte^{L z!Cq!DNC(EdN0s>IxWUh${?+sQkels?%vWZ3Zr@}+RjXUr7Dt@UtYtAl@JB_*9U}!59a7lRNB`OMhG+)> literal 0 HcmV?d00001 diff --git a/assignments2016/assignment2/puppy.jpg b/assignments2016/assignment2/puppy.jpg new file mode 100644 index 0000000000000000000000000000000000000000..3cc1234743a65998dffc7c7d799abcd57e938d08 GIT binary patch literal 38392 zcmeFYcT`l*_9xn8B%`3@oO6~OrO7!XBG9BJH@OLtB-k`Lw`6E?&Il+9UTA<006)TU}MMtFwqpIGWrq=24JIa80Z(8xp{x?0PH{M-{=5% z=rkfUDUUt@bl=by{6Ey+$KMG2jlkas{Efih2>gw}|CI>fBxC*MK~MkyWZB8s|DtPB zQ~si{9*Y0~sj2^_|LslS!~Y=mw>R;BrQZB4^xp{ljlkas{Efih2>c5H5g}nwX(175 zAz>zAF==5TX)#g2|0?azS_UD7v#+n0w4k7ekASVcr=5cU$kSag(AG;(SU^Y+Agd7Q zWeak3@MW@dZ~}YCaUOTQ<75Kc%W;~C>j>$1sW>=;)q}hpjDmEHK|!t{DSJ)@d139iDy*R&}t*AY}u&})?zpcHfFu%R1sQqJcVM!53VLO(8vbP8Qi@TSf zx7%O5?LmSLZVv7a9=<+kG=v3N1piU#|5H5C5dDkk{}O(*jOaYlD&7vZz79{&r~IF0 zlXQ|54_@wd-%V{v!nbqs)J6*Z;7Y7#~ z4-cP!oDl6}5Kxkm5s_0-(oj=TQaz$&;AEzyXQzKe#UjAM&c)5k$4kQ`C@RP!!pXzS z^QRCDJbZkDdju4OgcLk5`CnAli2IN0c5Bt|IuJph{==RUKr67B;7TRavB zg-Aq75k9MO<9kZOsS`F)JMYMQ1P>okQPZ$1%)mXwlKd7=tbQ`gWmGBzRn+i=($0H| z;31pX7xvRXqWvY=|DIrx|Bod5mtg;vYY9Mvg@JB7EOLMX;I!z0Q;#}}t);7L??yg< zX!W>>+Ym^c)$M!bEG=qwyKNUK=RSdSr5VnIDN4vpzK=UL>!&lMeZh0rL(Pm`{Cw-V@`V)x8He|Zi zSh?|`u5`h8xZK39;kCt~*4C+B9!^YnsE(lIZ@@k(k_CqwzLDt5OSZmSzpwLt!sr?` zD4Gv`O^TGam%$hP7ACDsJAYY8Ns1Jh*;5mFi1a5;{0+!6&Tf#?+qL8?r!=(Zlr`K@ zs?47jl1(#IWtJObOQBzZnNPyZ%yD&(n=Rbx?APi8^7Zwkf{T0w|V zG5Aaq_Hb^2Wj*WT%4z|gmXau#sY(}57hS%KT>tk>Ok)GuF8&*yUmEkQ<_zbMQFIL_xevVwyO#?vT#i+j2KhtOXJHju1^ z>g}pqp=+9&h|0z{|5Ig+%RBJGMslKagO#IRWntPkFhpGmX+u>V0rX{c8#c3Bb`G~F)++RtyAQL1KCghFM!Aw?C0@ zNlx6vPdTxrsHK)R0@+4H$S0zAP3O~5Zg?)Jvsj2ywECWoPBc8tL(RoA<d@cEd z9QS#n`54)5D_0a^>I}$5Mj%&BNWG@Y9lO*b3zLPtn2m-v<8Y=}IudxM?CC8FHRf#_ zY@0^JAZ)8lvU4?Y=@K)DGOux4W$RtQv2qnxHGH-j;E}T0Hd4(MS1&cqq=U6tgBls? zcnmzGI?E>V`58Vv7g<2H$;dVr)Ebp<;tZwpEg31ljPTA$xW`hMx`J2~GIm^HK;3m5 z*z|N(3^d}w)?D?2jb4mcyvgDj6zO%bkB+DncAa3!NVi zWOi|7xS(_an+}*vzT`|S%hs#qHXQ)=!@xyE{#8e)>wDuaxMLiX6Q%DQ(h5j_#Q%E7 z@*6PaYw(jVZ1~-V;LUCPZ-DBQaT_;-52IxRj&c+tPpvg@T15%wRYErtUPirmdIzow;RtFl?Z`E2^m4-B@&(v&ngl4Ybd=9i>E}o0yu4W^+X(PL0 zJ`IgIS0Qu=S0&3@TyPAz$sv{bs1TYZ-gGOGJZgp%so?TqPzzl|q`Ip^;Z@ai-aS~FoE%d__;2s;1jUCN|F1&4vX%rmDhC%v z%S)rvLv>Yv&>eRVspboAQcHmL`JRT>4}CQ^ePMJ5n9f&arZehDVc=qg`E#*^vz9Ql z92n}q6uT@F+9_4g1XGnmQs|tJ%+!a5Hjb|o!7KQ2l-MvCf|ieI22C^Cs`pr{>KSBA z)8o`D7(^2@s_dLPUyN0^KbimW8<5rU8*o9u^L%m2|Lf9m4=U^ihw`i853(eYCzTQ? zSoTh2apDsvZ~pF?$eAmcm!%wnz8^CSH~$grSxXOse)vR3fLMkhy`$21(#nyGB_D@s zxbjLpfSD;x3vP6Eu5OyuMEGhci_R$WRI43k-ET$>(>A&~RaX_Xi`;Um%D^n0ojS+d5#%Fsdt zhynbQSwcoup9Gw{kuU9-)kr^3BDRtWff=RD!>mVO8{Z3~fIb5@qop8-6C7bCY=Hu7NM=rMhz`bH+)F^iQjVTj_TQDb6@8~*{p*5v`KW18!>iIkXUzL8*#nxYkEvxRu~ngEl)Ff6HL+-LA+K~uD;EDV zhig4?EI~?eYOvG6nQ)0~!R006uLxC{tB~J-Wcl{rfamCrfbIpATISMr!b?JqHTn)m)Xpob1!H@w!OvD(~CdEOp!5DMOrWj87sag=gnF{&- zqTH}@D{68YQ(E{NcfZa0Bt$jbmI1Y5G$9uLU>W(4nwnz8BsxCmA-$?q%M4q-D$&d8 zfB_k1>KG9!FRviuBtk+zuY0?ci%UN=H4cGhyGBElE`s>OBT)=eTXwao_m8~pD}GaA zT2JJmkL8TAE~yNp8*@t{*mSv|R93(kVX4O1O7t;YkEw(1%;aJy0_^~XKGv!73~zu> zJw0hPbd}$iI_R&o|rC?qTZzZ_Wd2{nkl=f04VX1FI1rBedi zrLcj`bWM5i=f>!&A@!XtCHf*si)vPRaRz?_NN^YvsHr!y(25xg5suKrV6!obblTG$ z#v)W}R)W-IZ#Q0)$vip?=TR9cBj3|nEbM$JtjYajBh6C|9L&^_o_=eW&_U6YY~?5i zE?+7g1mhzQVj4Vj> zoEleRIko+oC^*BLLp#np3U`<=1*gUY#;3ml_1S-H3U3?1l)75}<{aE;cx&3d)*?B3 zBSLq8ivmy(jUG>_3hlh0NaO%J)BU6mc{*Z5?)2wcFc|||9J+v#Rxa#Bt_Ei5BqUS0 zfJ#X3FBAW{-qj8h><2S26;)w(R)&Z53wsZA*Qq9CO@C#5d7bISmXu_9c=3*uze(V7 zT$;|nJ6%^x>%>(i0|Gz0NAFrc)6GrXL7UF-qYk#TO0Kw^6x5n1?OWL;*2JS!u13Yk zwP$KN$}UA9RbH}XSIQrO0H;pA_sCu{w##m7tduk`I@%O_9W$f5(J=5zcz8?}SW<|z zvhmT%hiGm-%#tI}L;nh+(6^%R0nFcv+XpAshFo274`!eG(l8js;>Pv8 zbSD`_%g*&S!_@W3>til)!Pg@@xk<@#dp_UfRq{|FQj~)`=Hc=kb)?oOui|}bh=x!E zS8|0(mi6Mr>g7Bnt|lx}2Img5QA*4>R5mvz?nO|hJKi;M&wG|@+il^Y5G#`Q^%?tG zHjo!XWVEAJyc&T9Cti|^Wc6dydb((T2Dz%YgVy+^laXm2VCPaS{|&fZ9!yw~cz|QY zs8JOVW0#x=MynS^4VIUWMwDh?3Ed=?dYO&!OGCTxCr^PvhRMRy1rac9h`KX80^xz9 z1^1)K#tf&1>2!*-Qx*5dS%8s&m2n$s&PdzjgIy#okCh|bz#bb1Pu%j!)~VU>fLRJu zj4k|1A)K&X{a?zHw$3$}RvR||>CXsMyg(`~=Uh$>Xx1#r7?#+?%03kpYtW|;KTxtzVR7OYjJ;+Gc~23 znFnczvGvZJmR)OW)f3<>7E4`m2Az@02^L9gt+^p7%gg>DCaUj9-nEFE8OylwasC+kx!epZ_@70 zn4m@d@mzNU!sfP70*VF%h;tf@?3?WP>|jb0O>vMvDLV1l(-ApGyeh?%ko(5zb7L%J zKJ~(X3f8B-EaX|=qv?F5|PMgSG3ejUp zL-1h{qn$GW$x<;XGX+{ZbuVhLWDG%wZo3VTN3 z8{;iw5fUAIQqT|L-vFjOr(bVDzX6k6CF=^ek$wd?*Gaztip8~Vc=-XPU%_3SB?00^ z8AAbkY91|Px6`r3eI38JcGAQ=3bnmTat%GGwx)7@3qesL7a&c0$t~_1Sdv%?UWuWN zBsK^xveUwM(aGs$=V#;6Q=6Mbf++y4n|Lzy3^Nl;{C+AhIo{zAPmR1P2f!9k*M@)F ziq&e|+W9z4q^8I{Vd+ABvaV0)HD^g0OD_$gJ8j;7GV zOWp=w4ENcZy#U4$SBLrJxh8`ig-oons*%XCl}*e=MgJJ>*4^5rfGSs3#AwvM4jyp} zikHhaakbJSj`j^|#h-1EmKv$o4Rn!PZWwxJ`2zP~7ps*J(ju)AiLp*&p8Kj1k!WL1 z0V83m{%M?Uh>cacYTCMlM@1-UH zto#LWqA7g&-MQ>L#fx%`Hd#jmbl-3;>WNXV05O?1084hpac|6+=!145(Dk0~r#ay_ z66Ea++~0n@$PJU8fNiC<(;>RU8|z@+J3_h80Jig|@{zes5BGJ1QiBoL4!Qt^1m_u1 zQ@Y_qIbx&Ec5J6wi-n-c1>Pvbn7P=;6?75+PHxjdi$7?+zJ4~aH0Uw$% z*tOdLA1l&DLw)Jw$<3~4YogD1g3{v`#P zDzu4`rSHpAG+=-vx~V9=(jjyv8I&OwqJYlWkOW-%NG@t!eK4)&H(Z9WrMMs5*oC#6c{qi3k228RVDa{siaD-ECCNhUBj)LSYt=hore zRNPlUbuP-&=iS+Q$;3Ds@@6Hl*I1$uuZ*kJ4)1l9Fm=Xxhy#!T5V! z)h8SSSoK{V1#$YiNw$?B78T_PXXP^0SLfIdyO6?h;g2SIvQ8M4A-zG#3IyON)ikd> z!um~J=4al8kL!qy``zKmdef!w8qe9-FR&IZ29(+1iFCXl)O#MZLvvT`=ES2bpJBK58z2%smhYM|H#ek= zESgm8-^K$C=yhlRpbMzkxA+Z!!Y}vC&T}JHOp{YnMG?*H32zMwEQ1v$<^i+9LdH_x zq$qhciJoRPthl^xVADQqu+N<_HR{5v*9;$}BKsUyu_0qmaJ!4Jn&(X&PFYA_yj2lr z6UEi3MR|dnE11N>pt^w&)EYD`3FsEFb$dXvd3XvXs4_^}Zp^d9&4X8}DbN&^jiy-iaA48!-G7d0au zy?0Bwpjsc}B%>gY8=5wl5Jif16KQ&s;bpy+&!<^RHh8_YOV?vI3^T{LVfiZ@qtunc>!;^6 zoFs!M1wiB1eBE{p;wD)i-R$QsMvZFqJI)qKrE1c1EvjKD+la|L>cn-RYhLn^ES%;d zG(hDM{xr}lly^d~Fc`duzeRIk_@;9}axvL|02e``DiBz_6~B}`rE{#9-BWR>?o5Hx z*TXIt1eTqqS^^#cDyn-m?C~5`)Jx{!9B+Sw9W$BuD5C$#ES{cfWAN z!u)B3vN4R<)5T15#L&#8!P131)V z{OxR5ln*+5I9J@xC!EO^u#2>QRph|_q48#3n{~&Q1J=ZHxGr*E`hl@i+DDwH0YBzy z;<1SMn|}kk%Nj3v?H+OOF#QI&A{{KmTEvQH{2kp}QmCsuUb04PrE~5h-&#O}7gDH(}%6_lw z(?wCvfTf6s#b;vEfX?9P>^P@xT zr=oV_C(Wp`u~4T4LCO;?re)X;0m;}3^3XV-<1TtueZpW0YK*Afxfjk*`Hy`geHF)(^ zR2w!ppXbsX>1fai2nc+Nbah~w3tk46?$klI=sRy+YCk5FHsI}9oi$F;W|X%iY{+8b z?^%O4RvDFtMUf`qNfKMetOTKddH0HgB0LyOV zf)mi=%Yw2xl306;pJ5g!_JPhSf>q`8xnX9<_bmreF z9MftkdWz|z$3%d=`??iKStBDQ(84e5+&#;QyAZYCfI<@b-)*^GCU#vf1BYWPQ*7(7p zG0makAhfJzjMW`ZmtQ{Jq-Ugj;nU%kY0Qu_=DZ}8Y|q=?DM^{<3a0t;Szzt5)}5|N zXr)xJaN|uS$6_=vRj}{=@Hb4XrxI5VD9jG5cFV|7vFOJX6Ftvxw)i+op}NMIz%3neGJPE*hwFn zD%nAn`T@eXO5!c|g-K)Rg@mmbio$4httFT)Y^24X&UdR^H9TnwxnBPbNPw8NzZ15< z?p#l@JaYmDypdCtM1rcD=5U%S40TDFt|%9s5eu2(*k`dxMfFTE2hAQvC;;^M*?bIk z^BN0L^2px2!mp}qAE2s?1_RPTEK8IL6w!5Fmr2!czfJ-LT*w!9^0EgOn<)-q+d_ux zGS{GN@>WsXvXw`?ZK9#N7Ed}~+9rD83a_P*MMb;!6_xiOqU(C3UcoMEl{F(v@AfP1I7WPAxpwZgnw8CeDGs?)DexzzN%E%@*DUt8D+p*85a&Nw^t{ zrC+?54SY5D=AF0KNd_(FS7w}*#DE+Y@pHc)K%iCusIIZKuh)2xX zPLxNMxYZB>!r)2cEB-;(zGw~og67PjMrq9&ip|d1mI8ToTazOXWuWen98XKy@UFMQ z7h}Qt?v4?M)rEOIxho<4wr#CDQ%nMBJm4KqTR}+Nucs3F`51L!1}IbBbwv(|L)c_9 zg8Bo6()>IU`s0;EUAr8?J${0>d?yV{^H++8vQ4lriv;bmX6@$KovqGRM~1?%^WOl< z9}cww3>zp1>5Kueu5PIy>ipE$m_0Bzno1HWOT+yvmpL(LYAvaj!bgZ~h8Vp(GWDq+ z2o8p5Ld80E3Ph>;<`J4ux@m*)UPW&Fa((c5LnE~%nC{dNJ<*d|F!xvzhP~hxR+GWV zzM`b9Q`1Bs`d}_4So1G%h`@g=&R;f0_Xm|Q(v|V@^v^nJkngj-GS;!vxkS|4i`cpD zHYOxJ7ex;a6NxUM`r(O;!sLUWyV%I-SYDBmC7YH}}C=zEYZQ6G9KR@E#XY zpH@i##DFl3rHP0rzeS37;i8DdSpM4)*Gw`m&BozQ2ovjcn}F|Z3x#3Yf8Bj#_Q&)lJTm)c<8E@Y_(_6h)Rmwk0w2Xi4sDl=sz^VqtlE4yjo4TIw@IiK{}iv~r)q8UGpUq0(da?QH|U=A7#+5Q&c__y09jbao^r9yWAC=C z%8`n=4^Y*t!m|vGt6|I@wf9;wy47+yFR{)t@mN=APV4cOwd5|qL>$VrJJ)fWC-vna z><)4XFSWRshkr&s>nE+O7ig6n3Z1adP-8Eev*fVD)bHNJ&Mg>kkvGOtsQL}CETO3h zDZ*x`kTJYY)zmC;$Ae$iJZRLhYGvPYDUBVMBR0vjaI(1%G&xeU4Ds4LF6=#0)tnS& z^$`BbLub{_Elv+>?6;N4bj-M;{`Se8HEKd`$TCuW^sPGJ1!n5Tp?e{JYZIO%wL(@< zg-T($9HEi*@>8|U7g3dRye!}Q4TfmlA-VTUz40&acwe3>#IM_&y$|gsA+crJ-2Wvt z0C~i~9=OVGfgM*WDRc7Z^!r6!hk>^gyUCM_*Pp3&w0XUtE2Oxu8DVFX@yUP*bi79M zjIjPb&#*RUP~+(RAfMsy;xs=pgI}3Sc7i57=)cVo$eF)w3HB?G*QaZ?XAqsCoZ5TK z3fuLtvcE8d*5_`_9mlf6xW`E(8Cty=A3R_Bgv?i5_0QPdGd^+WzjrhCT|WPjLHE;? zh|RN#7t`bGr!9W<%qJ85aUS&U^;FdiBx*)`))}_G0?9a3vR={{R(`BpU82RW;jO^> z{InPiwc((s&nNYgr+OcB@GQF1^xNCiojml?C1P%17hY9B3j1B`jjR#3RCR_vBw6mp zckYVjv2NjYBcydr%{!iO%21pj^66U6m(U7i`2JdH0;@%Xw?eIRsb(P=o@SkHKtKi3 zEW^SXf=H$1U{=AQC-%O1TXw2!HDB~2W~8juF(F1xvYv1LOZHO7x)43@J(Y|_TIG<@ z-3L`J#6E$tog|{8J8fcGBcL3EH=(W*^@2LnpHG;i)CHfTTEi`=L3wQ+@2S+4Ar z^Xpknv88CXsL6U0Kc72@-p;Y@10l9T_(S=7(DP$!iLXA-EIS6nKG|f zK{~rkpaAlm_|uK9YjWGz~F0 zsQim!g>I&@EaKk2&XCwQO@PXt##(Mr@H8kwH3l6*q%NGsXvyA5-=#1D9IJy>*_jAb zrV~-INC0~5>77d8HAAnt$NgDz=lVfkaPY5*SG{%Y$z;;wV)(hn0JN9_qNUEQI26Q( zmBt=U+3XmhFMM%q;YHqse7^QU_Nr_(WnTG_LuVcC3f7ovnT%e=5Wdt$$BM30QkSI@ zWSu8*pP(*=_)EW(J#l*v*HqeI{)P$xpor^4T`e1IS5z2Ew<{MNWL>3(+!Uyx(> z@zXc9FGzqgcCAtAuiX$ZVap|OqG)qZr1@?3&OB(pyIa|}K1)0vOCxh~ z{x+C*$+Y;>@JZAQ-CzQ{XO*GLLV7Oa{i!3lU-0&|*fvOgCP`0gK{8}te1O5l-vCxY z1jBFMwKuHt_ruOOnVcd_>Ms>|R>FS+L?K-Yqus~-kG?;0+jFZKj&EUjVHb=Vtrtt> zSIOk0-=y~sBMXED=VWJ$qpU9ItEE5jNRf0hC5G}mO_26WiRy>-pN#P&?Kme=Ckr`7 z1>6Pboz`m#w-~DBJU*65b$LAPWEV}FRdp$@xsbge5^0_k6Fh}VhT=MiZx4dk(}~*p&TN0^69(xy*i?5{Og=0Ahmp6 z^;%2zMtj=|fB+BA6Yq52>enkIvYU{hw_}I^Q*irf(~ArI*$n0HoDsjoM+{4H>jHqe zHk}3~bTQk%6#cK{cvkGW*rJXHMUB_})IU@@u~ysV&uEAo7y|+XfINbp`YzkCv$Y$~ zx9w1~>gP|XsN&-}uVp@mKr`1*H zC#OoF!w=58ly>|UqB+zbu{y{gpJjvm6msg7^E3_A*uEGvFS_4?klO1mWc|m?L``01 z@Y8%rD%RXBp&ECP`P56dPR?dK+Lz!O`M+hLqX>l)uSXJuEcJ*V3=MRPwrBb z5NGH9v8uMQdojL##--C!xfOnqp8T)5YLtWm$Vu`t}0}0<%Tw=D? zB1su`;|B3JIcuRiX_e5lSez_5`{a=RpARO3&3wvt0yDBxWl|wGW|`^fOHL&MX0(|O z3Fe!Ua+P)+Ja3=ghv=DusNJ~b3+<}MqjfJf+TxuD7UQ9GSrVUMP}V4Giw!yHoiz<} z-0!$O9;@#H(r)HFx=PhT-43c$KnObHb~>lgkc5rwwzaN^SlpHnUVl%1I8H-f(U|is za58`JbTksW58DV4t+ zR$x@mB_A>6qTcwcD*4DfqR{Xok03YyN0J~Yb+J8ggTcOi>EjZIE=sEAoU37BG9mMyyQyOnM<+!t~ADL=Dp4dYi_$M`BWZ@0X1gEACWw!>2`5W5K>R zME1PRAq%^NV;b%y%9uO+c_)$+b6rK_EuCYJb@o9iQ{8gJ$jWFtI2YS16Rwt|)V04kPy9k626zA)K^${zeSIO`3G z<+t*vudo)frQEJa2#cZ9xFJ#XH5mQ%{Kj6Se^pC`iug{K7i-|5@~_GCOx}i)BWtxL zrCA6tvUn@L@Td@pw?EHWI78Vl2oj9&WI!t1(4f}lWI_sLHD{I(pvFK z8A>MBPdJHqweEY5S@;bf(0Vv{nR#-GyY{W&%yz4WOo1m~*#R*vyfKFgj-7ISZstAK zXs1F*15?y027-f|?~le8u*b>0i7c&G86FSZZ{?FMJGxNLa~`m8nNUCJK_TG4FS!#u z;nerG@nc}vh!7Y4I}q& z(I%ejv%gxlCh0AqI8~Lw&Kq7*gR{lgDWh;n)4i`@wfD=jbH4IWmGW$WX7VpUHPixqOH31)icg1sPuYK_FcwDt9QzSVN*)2otj?kfgPZH!& z0dFaDHc7|cbAf$npS`6b_YSU=jknA1ntz%!mlS%b1)T`S{UU_TZgo8MPlJqLYKLid za&l$dvsmaq*(sjcOBp6nq?<1bl+jN#`6>EE1&(!Wt@DNZa5GZNv`siejwt6>#=LXD zJBL|p9)j@Yj3&}A*6a==G9@#m*CilS`fosK_p_mW{WgP9g%_86qC*gOdQ`|G`mj8S zO`>bL>e@pesvb8-D@8)OfI1)3c4k`Ey=O+Y_EOZi&8?Ey$!&O4HQi$zoh6`g0W-I9 zSX7TS3M9PJl;2s!S?7|osbjGqs-fnhePh`8h=A)*^&tj#S>7}v2%>KHV1Ut*TXYP= zYmC!{YwRImtc$L&D!UBMgv}3nbWpbE-=P4s!Myx0gW03W*ry0BGFp1wJXZq}dkx)q zb8QfDpTpd=wrv4r51*yPmH<>k1k8zcx^IoAzZ-nD1wUKOOkyL$sNl#h!Gy|ZfPtiP zyb?dd@{nj`9Pj)n#<9S}>ErF{J@Vb+w){>T z29y&WbEljfYp$5F`Urx0^iw~Otj9f_g70I<>w_dE>;1K&+urfvKJ*>ho9pJ?h<@|* z*+2lmc1INVyHstO`e3;a$Mg?i)Q%*jKmU`_MLevAI^deWB=+v|bUUBD$#OElHyBCK1~>L^dNyQ&a}p>@HcUlZ%S7%(s1N37N=o z^{`cv0G?bJY%|9cOAo+%D~*DsV!Tu=Z)6qkwx3Eb@d?TeI!Pm z;j#Thw0rjGgM1*fjaD$Qv#;dJJGuEhj-iy{g9i~t@3DX2DA$C(b!)s$I)wFB9uLywyza%i7x8;v|=jAugnoS?v=5Azy&Cl`uG=U1=!+Nz5o^vQK1-(K4F z%ORG{M$USQnUYn?RfZ8JA6^!oI)ARCAfCkCTUQB-tu_#CzsWl?Qt^=>BriO^sY%+V z%p?7kP)$1vBNdO2j)DrJ4i8&q69d`vjRime0UC~pRpy)1Z$3$u7qt8@rg?W_%sF6+ zFC5ixTR7+08^Q=S#G)vK*bq>)L`C^2*W>{`HIqK0Zck}SR=;y8NZORJ{MgdgmEMkH zZxdE)+|<*as9s0$=_SD2gScSZ8Ym!<8H-f5WO!d_CEX{UMn~x>EV@E7>nQjT661Di zm6tFTcqq;b4&bR9#0J5>@gJGmi>%uuN)@MdOX)b)@)m{kOu=WNU-+r9imF|%H|e?j zX(2G$Mh&ZTh5Jv%9_8$r+iz`(ul?xKdth;0ISZ}BNRLsl{8mm9_!UZ(eTI4fDe)8A zNvj1PSbX%hI@Lfp@6BaVptsKo_^9hpr1!)VzC!4rUd|BfHQD+q_)$~5R}u^65MHiW z)fSu9VHk;?aQs2z^RG2k(RI_Xrq+Ugy0~4Hyh_aZA=0SE57T56e#Pt`=LV`+IFz^M zGQL$g`*OX37xIS`o%>aSl0H&nA>J_a4XQCJY2)i2B;o|7rM>Qwt58z1pe#%K7?(R| zKaIt74XvI(!d&%nCG%vwj1d zW{Q)R{mfeC`DZMc7qB=-^@?g=YpRcN%*cG~*>eDZF&u30D(R~y6b0wxp92s>KMb^s(U=x=zD^sEXPij_OG?i zlOzns^J=9O6Xp9wzEMg~=6C+q3Ic?o=sO{cyw*uXgTzd}NohK&q=CJ7B1XUac=LyT8rRYDz-Q zo2QC)@GlT_B=mt{2hSGQ{xkVzsn2DhxIOqrbk?&miO`3i^V7 zI}sEZ>?7fnQxeeHMNDH`IFN!TOzqZ9PH;6#@{JLPX__R2Zdb{-@d zJv=s>72r(mQ*UUVI@lcg1hdv|*A?*|Y7g4D)bfSx9(?`fcabY(+pRXio&;3TP71o< z-4_q0W`MqHKB!H!HsVs=I>QZqFbEl!ldQATqS{)Sdts&L7Y#I)$q}8x*y5HmKU;lO z4@~b_w%S6>U5Sd-lc+j)ER7xH_@ubgHmn>3$pi%q*~lP1&0ci5t7x?4t1GXpyi0rV zkcp?7t|>`ZkMpuNcwfUleT28=!-`vk6=vyX=JIDcudukpjsa7Yhrw$3JI1eVajcQd z-;wM2dYfs##Fr4;tu0bka{HH(KNDyM1OyUVM_dEh$Oq{=(^xB*!eT8=P8vHS-VVug zz6wxpro$T^Cfw_l8o8(zr1QQuU|u?t_-^vEL{9jE;tSv>|7@SOMfWX_+t1sOPeT;| z@*gQ0l5QJDmvrz8)}{7aS$Doe)z#T9&1a+PN_h7^1HR*72dyW* zL_MCI;4`!c7w)0gkmX*Uk>6y@&~hB8y=_l0$QQTbS|@{aUBB?-uJX*mj8%9urtWcn zc|4&+^r^Pqr^1lhyDzsVpqctx8N!w#iV>g;^G zw5voVhFI@%@QCE%YU3T23)ZwtiO5zM`wFHz6GRRt30+HextJ=_LzN32sGAEdyN6NSAetRxU$2EEeK^co!4W42!EvXO7 zf)p&vuZkHxKi#{m;kf98tTbaEjU+2EU9!~l44;v#izZo)C2oR}e+;#1iA{y(bw%=2 zHaRBo{)o5=?#i^#kNI^bG_eIKC=~xdX!g{b*_H`ze~2xTdkyd`^6o`}gYQ1v3`uKdxEkJs37rYJVp##PS6D z#3eb+$Ei-5rrA|s*BQM}yS+QUHASnRJOZgc^t0P6OS9`WOOMR+Y+?o*Xs`GV6tA%U zIUXYUaVONSr$8ws1_kmhSNz4TGo_V|B?O58l%i-9JT-EOJ+>@8xeYv?PP35+-;pg&$ zkD0+ei;bc}GS38-ntO7cNHZIugGXbDh+vlHK;+vSDhxxOHn+BQC zdnTKxS!-HB2)DtC>HW%22kul)4YJ&1n{cVnYan*r9Cr82D?)anawt9jcCFIawRB=J3ol|8ae`C+VDSKk0Ij_(p0jj&{yTU%A>3N15= z>iofY7o>&|*&G*c=Ds$*n+(hPGUiFOGS-wcwRC%ru4*WuSc6O^zt1Hln))b=#Kkzz z_oqo)OPGMz_?`pkOjvRqBhx1_Pfbx8_1QTe2u3=Xvkv1|rz}^>$w%Rpi@mDojgiSi zF6w1@U|>I3K}RS>^H=r6XR%XW=4c+hZ>46DPVE|Ud#vJu4(yDLTak9})~)HWvU%}7 z*3UO*NrobwJkz1ip@Z@pDU%o@9;-5lp0LVeN7y~liSem=zM-m0x$_awBtsNzAcLK^ z0n?0`Ee*Rnw$cP0x1|4h!du4x)Oqe*nXg`3Dx!Ly)=dPM%?kMmQ{P{W`(BgPrK-aI z3AioM0e~L|pw0i;e=F;)gW3w)b?+dBmLkPkoFYYn6nEF)?(QT=flwqAFO~+ELU7mM z6mM~NFIGIo+M+Gip6tE9_nbZR{g-62CNpc+{oMC+J=gUM^pd*zHp}fpz$uSeh_MV5 zrxV(`DVv0|OaQ343u9`H_q%C)?qUt&6{%p@%SpA^$TG8>;^$ZM#F7UV$cKlOPHp1* zJTL8xrUlt*AxtY@5clsB$muovALgTM+QM3*J{v}XjqUlziSeNS0N7MThw)dXxu0if z#H^egj!F|`xkKK%jZh~-aRn_hlCLut-;Gl2QBFzx#i8WQr+dRQ0qwj=R*gLy;ff!O zsbe+bQ6BIU&~OiN^~+*niHM0DF4H;IwQwVOwcyi4*c5))?0)y9fKtmw#Xhf9Wv6FT zQ-FWPXu{Lqk&`{K|K;D1#q#pCAr{eRk)B6uj_4ocEB^qEK|u+N3S*{Gaq*^&vD#Oj z0T(am{sB@atuP(`K%-vo8f2GF<^58&FW3h!qze-?Z1$Ky3VGNuK#P?sEDn4;G7wzj`W@w{bx$j_E}~xLpM3nZGdp^~E}r;FJ;(QY z3%=;)cm8f195ui#m~@}`?sqLv_`+i^NG|#D$Tu9aQQFX^@8xxj<-tT;daIw!Y zLG62tx9b zd+5VrKdYh7i~bUCN3s~wit*jjKg zwA5$uedVZN5n^FpHOK?nR)zHqf3r?yQ)@|8flE*upm>0jB&*&nG?KBS0Gq{K6HQA| zv4Ov)Fm2?xTkPM>OK(SQ*3?i2Ue#zHe2yWamAg67ys2_!y45EijF2^ki`Z`j>hETKbWB!d6tzksHw~;2-=_n-(Z*5Bc0@Ka&HQi0lqw97r9fFH;rXDiu6GczoDKdwp? zpUg$NL4TC{t9HorC>lIzwBE#3ZQJE?L2+z(YAdnR^u|*qAJz&$kWY6@pJ<)DMY>nw zUe9jv>TqnR-t1H!|Mn2_=TPCFKQy~04=%Qb@0d8pf6TWAPIZ!e-MMJAn8pv$QM78- zi+5fXqgpiH&qUZRE0^ma_;AYc=h{CzwM*^tOnR(#$b~uP5T_S8KbMfS$XF*1xLwvM z)uzK?>L(6+2DQXKqLO9ykwmsUZwN@?G!v<_EvBwnPru^fAz1 zr1FI4$a|?hh<5ZjqlgRdaD#9FM4e@vFdu3*Rm98hoVp&KClJy_Fx}Rv>z#xLU1nY6 zyn5m&sp|+cB%PRKTQo|(tEV2GzjsT}@0>YIh+PUF_|T!kAxXRyns07aN;;f$@Hj>s zhx?m)iKjm0d6De|u?7CV!H=K^NKVIxu0jX>A3B?Fvd^$fr_+g-h2@p(PYU^e8|LRc z=s?2D+d56Scvq4xS^DAy7K%31k>&Qs+==r;%*x0pCQj+`6ew(CX==hpOwM)6l)!U9g>&!{c;aDr>QM2 ze&kk51{xepYavqcP;}JeJdYXsaD2&#rih@8Ve@HAssQzf5qyq3o|pE%LOmfu<#vM3 zQog)Z$jP``590VaZTYr}$eUB`HJbxLWXm%KwaeSx9AEU8;Pf@76(u}D3f*+U)JP85 zw~17G4?5*7E=MoG!sxFjR?&m2mUQX#qvA(8o+Z1b5`I}lnOqcEsdjH=k4QhC-JbUR z1B3&B&1yvH)xv#&di9bl2GOgaqt^n(4MmaSk+cNniK8kBsqdVx1ivYiyFxKt64 z@KExp#Y^WW&xFVH8B4nC2;)%uuJ385_tc5w9R}Yq7sVw=EzNEA`E&A*ed@mbFkKkl zk>I+~vwgW$5K}A8eWkS`)1$Bvsxre*U-8n*P*TJd5uu20<-ut+vhP_Ol7R z(dc3k{+yu5h~VnUE^*i~UlTMf34Pu>d{H|#)3G|zl{8}1KKw57{#bm$4GL0LbM;?+ zd+1ij&SXv6K%K}f5TbHraO-`Ex(fvwlD&7`myf}zKm8NR9U|P*%QDiSP5gC7J}0$l+ z^G-{xlPiUc0Dm+?OMGCXFS{(Kl&IAg8^}^w>=aFBR+EKbvJ|m~%YN$J-{Eho)$jHF zt9F}S7h)vg7P`bYOvX5FBv~Xc?-u6zL<48f`9xi6rT+nj+Q~abf=GMkDIo(--5R(< zIj6Lg8f?#IE-SI<`K{mER`q9q_MpD#iL8~U-*@Jr*mSLI4K4KcW+vl`Er2Z}7T7VB zvID$!!cl9$r}?|>f)%>rcvn2VxM0W!&3d6#`cs~6xzu55!9!>slSk9DeKsSZmk?N# z^c0aZ$Dqv#mYm@)(Km<}h2yacrL`4zGZ9TPl^a0S`=x`n!iGNn*V6vqX`gq`Fkx+L zOdGR#@?S1!vvH{DNuW~f?9`;bxh2*96y&&%D(81i+$`#VUCOy@-I>1sn`-g<0ccRo z8u)(XkJzMsfTFf3KpXhDR`BU_X$hlD$F(;HB2T2b!Wlj|l3}NE92@<3R^bx_HFmO{ z%oC%oT5dd8Pq@sUN$Ku-56y8j)-LH&;m>(C;BahCD*6gW=mn`o0oS;FH`x%B( z{To@A=bv-5aMr;^`c6|J?abrs z$YsVslftRf(ujflfbiwD`@6+g!M%-!gvmQX&ZBn&Z*PNw7N9@V`)>Z|it=0gg)cMw z{Rb#z{Y|W5dTOvN4Y;g2WpU?X0Jy*KJG{Il`Gs;|jRQA6Nv>s2oM57~;O|c{l0##f z+`{FkFyTq`Yss{Pmr8cS;?idi7-4L;pqfx`M}Gh)-ZO0s_yxg7M;kH*hbOZ9mgm~e7iE<}qpmPb|AzOrROBscm1KttWVAnguVwd4 z=*z9gIdsGFGVdiqC60yGIMJ)7y<7LaY?h;|zux`9+gAZxDH6vQyO{ynwnlBn=14>j`PTEK4Z~ zi%UL$tCF9}^hEq|T0&(>vvyDB5-z<}VTx`31+%BB>c&swPt`P5Qz-})z?g_31t2Z0 znS5GZ4)(K`@dgjG^ZcKI^>y#-|24ik@J#MDpu3Ms7wgm2hS^Ka3ofjnW#tUUPz6uk z@8}4rSH-a0+Q%(Qzfn&+?U*xgkYgEaM!nN`pM%JweVxjNJ7Wfb$;;l3X;rOlJCz$# zYy5dF?gN^ERWC?W&@Xq|T&P*!4uK<*yx9VFg^t4-Egr8s`pKjLPhyGNuPV^FBfVlc zGsXt#tvt-gsD@lme^37URT9XT3U!AD0vk5@wtSSKn@Cogicbn3wGc9nNH!y<#=*4~ zY0&(ZKgZKRS-)WMF2KZgWC+qs8s8rQ!C--ME-Z#rw8%N^v0n~DT8WZyg-TrJs#W;` zxO=xdC^_mWvxZc|gI8p$6{`ve!{oBZBgZTLLUAqHYFoaQWC|9fza$$@&A2`p55E84 z#T2)j4=MLEGHplKGqeca<#UGjJ;C^L49L`S)(wc?F%wa^I>_(pV}mjj@bN*vBRURy`Ps}```|Hp74*}RjC zv`iZ z_2A{%7seTHBeqWEsI0M?S0Q#s5*P#Whht|GEdsVv^hT9UlQQXq@yu&(Rn72?PU3#? z3Iv;-0fgQ)A`#DJO(v;LHfBMuj}&PNZnt#(w8igBL}|9lo$S+r1+jJ5@-sckV3 zQ*He5n|luXU?>^2v=x)I@+y!O2x5;-Nk9q zp$`|X&rsonJ4Dk)SwxJp1KwU$8xrX8$@GlL4{w`zxPPz>&sn-Xym6;I9SU0fz$1mv z8QJe#eKc&uoc=kxTXlR&064gm!t2|vw=8K`zni4&n*noaeRqip@Xf8#p6P9Snb1Pk z&(hF4CeqORM+P@fRqxq54z5${7%=6!M&^zo&VmU~ktsdnkilA)2c1$rH))vRZOGg2 zvpUb@-jk7-pPfBXfaOUsy()JQcI*Z?X+T$&;&Uz>s9)aFq_~ng;BiM@RoQu_YXG?!B^bX6FTBZi08tl=T@td}9&L=U`!>EEWpmg+zYZMl9FGd;nck3vIq< zRF~BED@LA(Dn|zZB77 zWN=PSEKG6~_sjomlmBOy!a!X9XO_ae_)cLgu>L(^I8UEW52^k`Z17rY(IOU$n*L}S zW{I=ZRH>z9_EEbyvO`cRW10?LC;SDU3%=|sFK$gF02^9J>O zOat<3N);Hu(2_JpkyI5>vtZ@}j*rvErxe3=E>iJ60qWjJVGlKH4L5Bp5pE$iQOVQ3R&b{z4K=g&N`sl)h#a^DODX@R7KC-e`3SG5t_W|=@0jMillt1{WP^%lRq);U|KeQ zkWm%$V4kQDTKbZe4t}hga4p_l(kwAeh})s&ps(@LGIGUt_Y9Tl6xYMj0=0hXBaFd> z1sbKWja&^gcVeWw=v-4`{y`%N@Wd_M)9W;Hfyqo=8FTj~2vdmQFx9EWsJ&Ipc(2%K z;Y~W4m4seB z0Y23)uoOv?IA(lozWaRj?tSLOq${+)HCJv^2k;tLn@)K2w2FTH-G=JS5^Gqm*YyPp zcXA2|<+5oPkN>l*dT9E~x5DV#&1#_#=I-z`!9h-3H0s@7*X_rZGH4OzybezbWS|MI z3r?v4a{t4-6(Qw{K`9o`{j|3mbJdd^>`5|3U2R{h2QJtr@`{lPLr6ouir;urG)Q5V zj6?Gh`xWNh7Hy+3FQ97%GamSymwZpSTbf$HM8)uEOvk)Bg{Ia@(1pk_4ikZB(Akz6 z%Z`^0D3;P&O=)TPKr)mxE05t(5-}ZWXbAzjD+OSfP&o?cg^)gtx*TWO+`Jaw3z@W> zIcEe?4llKvZcg{V=n5Y(F0#=7*lr_a&z~-0WdHy6-S%1JlCE4-nR7^2ckReumX=%> zWkyX4#UMEa<$f5owHo8@G?m3UT%ydvS>WS&BGYVL?;Q2N!!peR(^jHnIS-QJ`%BgK zFNTGMKvkkLT*IHr%iz}9s?1@X;!jge#{0CPi71uhr#_5VRl4L(3R8jhfa_9Tk`1vz z^EW_3@l8c-OqTW&IUlOyJ z^}g~A)c8GD>njy4{fZM_BPjE>ZrRGt?q-E_+21e#Y|BKR6@vd5*qr=P?5_IQOkAy7 zi%i2Ib)os!e&&6_?RBl^Pi=d3=G3;ULH}iuv-kFvSf|SuHMEg*2{UWME#4RWQoHL* zmk_Zg@m|3X8H=ev)pv`2QE6N7W6$Qr^LOY0?`wRN<7D|5ivV+OdBL#o_r69q1Gr(- zubff09SZ|KS|^9+I`-=%OWiP9=E!+PP?z-Wu060c4M5#sWhNaXLt#}JQ7*J`K- zM}nFO89=VzpK|ues#@NG?mVLMYCVbv6kJ0h+ka@X)#PH=@?={4vRcDkduXK?JUX(q zgi5dA%Nmd)Bj(4%jkLM1x#sm$1Xcq5wHt|h(8neYO|)UafD_2!d)mtj0i$9JR>1f@ zWd506o9AOn%nF;%?DT<7Tl6)pAi{Qm%slpu1~1=n8;QQ8an4ux8FqyBga^Bn;R~tS zeVeP-3ox2;W)`(inU6=p9rdCSXP45D6Th#B$o%ZjGCvE;%)@-*Ba zKWa-Psah_40#mJYL*Ctaq`qb3y88#nYtii-7-Fa?2g_`HQE&V@oJ*5()b$VWN%|;v zoI8bkOw1h{x>}#zuoeUFJTY%bQ>{<}v0gOqN-bv|fqic`7B5mio0;A@hynfrOiH+# z^_n|b$IipK%eki@`*mwo(lr%KTAvd$dYtixNZ ztthFd0+v9o4k^sh5V3&)!QisAIdYIc8C-Di6@as(U{>xiW&*k?*l>ZS|DBrhAF#XC4+Dw`wFux>zcF!Rcjlf*bsbj8=%fFBmIH;Q^ z5Ud6E8#Ky!M$Gk*R7)(^kZl1mf9_?R^Gpmwk#%e*$#YH6Co#XdDJgYWi|oK}y+zji z;=t(R;J_*sMZ$TJkK$rgYDA6iZL3{^NE{4;3vclZbTVE^(S#T1`~+yp7;~lG6sZAT z!u0u7#MaK?HMRi@jh~=BLcQD-A@*7>tU}|Vd|DO!siTGVw4eQ71{q@SCCA`G4-Hr= zb6mu0=rlB2vEbzC%Ipo0*=YUojNyGVynVwk3UB>kuYoIB?S<5RE~r|+=u5ocEbvMR zLN8g~wA}zWpG}g45%|sG=OfAFH97-mqOM-D%F&7d=5%pdeatuFt7SUAR8r^%%tpXm zl*Av}r(j4jVr;eXzR+L&tFsuEn@n)`dBq=hPgOVqy zcz1=~L#E!w#VsGlSjJRF?k0Oswrg(s&MD?}Mh`Q0(4ltH1pIc?>nDGTY>3_T*2Cs3 z(C#zQLs!?AFEO^Nn|vsX-p^B%|(O>#Apy zZR;moP~t>v`2x<2uk;9t)-|em#LSxlL6DbOmwQjTt5NsZQk|YK1KO6{myZ;o^R(%g z^56NKG;<~C5u~25Gi8+i*d+cApzS@s+`3@mLGban*_F^4q?`9mw;z4Q8jf^VfycOA z;C8q$MSUYkj!sm8$ZdVdVwA8vAk9O+TN-TrS`d%#@hWs0yR&G@<<%JCbyLa5@F{sm zMoxMnc_`MmCcwuL`Y#f*cY>(-4Eun-gSZ{;SqjaN3u@06EhOu_idGfB+(i3Hz@hQP zD>Gj5Dumh*eqFcb(qFB+3P$b^?SZ>>!Y*JF+;i`F`l>gDbruEjzxLy#6J^p>Kocfp z3LpA%>dJ`Bw|yu)RHny5-~R)|{4q3O5oE3Esn_Ly*fvwko^Lj@u}t9K&Q4NH!zv{LvrzmTW?L!Ob!PyB)u}2wtC-&EAK(s$WVY+ThQ8yV@)xo`g1Yv zqreShU~nWpGGw-u?x5I&?GxIzQiAcjH7akkN6a{-N}aP(2z;qz>xs%C5V1nt!nr-Rux&tR@nSm*cH9T7WyrzO4?!)W5whsF*qRR{?f{ ztI+%@$R6WWWQSQ_J{l0p5-Pcy^DE%RxD+eP!lXZJC1TPw|7r87TQUFZUHg9fEJ+kY zpnT~MFxQM&4LgkErz*^zoc{Z%I+#QU#{IH_4T#0MFnnxOrnNY2Z5-E483=Ki+DdIU z3W+yiVCtyus6yV1ee~&+KJo))5OtdpG*^@tR$)!s4O+B%=c#ZGu9VesCpmf%nYig8 zQq7%kh>47|L#LXhrBF1U0Cj3$f{|j1&G(rq>jJl7VN`=50kHc$+g%3Q;v2dm|?e&ExD@^y?{bi5Sh=&hbDcIf-3$M&fm@j8P*z+a=qd#N+T zU&=+Yr_pw0V;Bll>wTlcy^Z3!-xQ=|DUqwH#K|F+9uE4^^N&9xNeMZ-l&7;*Rmy4%k^B2T(S6-Dwvr%`2yyn-UEQb9 zS6%M%2Jr(7pBGC+y53Mt;rM!qGo+0&+dYjaHSyQHqU@enRBaM?y~QO&MxsfWfOV?r zD5Fjnawb8dx9(hf^enj~E+Pki1f$Vu<$1KPAqqFUE|Aez5{F~Itj!6}u`OR?MQ^O> zcx#G`5_DW4_UbC6y$ys$KM739Cx!(lUe+8xsFLgtZEN`8dFTJCLBFE(v&cpSmYgIz zTGogs_ZnJ@a)k=&S*}+rc8Hq2sP*@)seB_hUM(Q%Bl!@ltQS-?p^#K^>wC{NSRIQ_ zPsA>QW>?&P2wBVa$&mV8)u3*4-9Qioqp2?f-K6rCfALn7Lj(d2((GraIW2L}2!lX} zXsXJcumt|()tXp5acf|UeFsN(T^9AAgY{bG%CaHPh)vp=nM95SwA~FR0A3lZA&;MO z{RK0CB9Trvw}y1i-4ub7SY`9^>@dd_aLGM}uMT(*=9 zBy0}YH=DN*Hv3ak9up&x)W@YRDN6aTGXLL&ofyG*PVj%Hd364OZ~wMx-tV#;{ReP4 z3da7_ZjhryNYBpFTyZ`b6GoY@=a)1JG==A`U?I!7+qVdI2DDyPE-Mi4k?|iCu9C2o z+M3Q2IMfO6Xcz~P&f@735*WGDRhIPKLiK^E_d-HOqdRSA+z?@zfpo#nUsZjw?lkq) zL;Qc>Wc32hYHa&rT*o_O$j}*Gv^{W6i=$^Pzpt#n-AiVBWRS&UNN!S? zK8Cv}2Xp*n3o{hU$lRFfe>x~R-aE=7eHa7ru_zUD43pbiB8wW=?!#AllSbkLF~U|p zKEGlwRb;NvBu#iq+7$L&}3kYIzoWfq-cph|EZA9956ew)_^>&)bE;3k<7& zf|W{^k6!NQsotr$8AMdY@r=%y%%9=e?HzJ5WrO9BQHbFY=x!M0d-^6eYXmD1a7*&$ zVzsH3M|(YLm;N|lp_6F<{dwwDZrA<>%|){>k64sVkmOLpdlQw3I^|++mF&IpQLzX==Unz#E+_&_u1#<+r^xvg9Nl=VY0QyQ| z<=0{*do%bA??^CWobi7hZ2q?xr}s-11AF@N|Dm5OFa9KCzT>#)P*kR_lc&SdP;tEn z$3KQ@vF_yPP;7GYf#P}x<#aEMS?yn-Og7psteu%^%S&T4W9!7|kl^=M!nOjHQxLX~ z-doOH8zhPr03%B>&bw$qxP3$rROft3#^H}~#!Pl%e7I-2Bb$oOp)TjsS^3K1wEXms z6i6Do*5y)_u^Mr)>PG?6M_q7yF`>t`$IUt9dYFGc5wB!WkX-c3K?4X(=;N@H$NkE> zMj|uTgC7~DfGxKiP~1-GI+hs(oyE&8oUQ$dQ(7Z z#rI~C**|8vlK6;cB;hJPXlkY)EX~x|QDH=~Yj}x6Y92mR>wU^v+@x^GAzKL{nT*`D zotgPQx%##}FosZt(|rc`BSh}5#fHHFy@X-|pbKhVX=JU>8iGx`{2I5X$Hd?Ks&E4S z%Hsc)2z$R*Fy%EYb|nW|+j*4{Kox;N>BIN& zTaioESq$$0L6;?QqxtkvHciGXneF6}q_}A5%0rS>)Y_X(0AE!Fp*qziy_^0N)44#- zyP}r&WXxI#t)INAw!F=+&m0J$S}h9oVWBZBSJe$gMQCxnjucRz36Di?fzkJ6AVJOp z1i^}h@HQz-H>g1EOqesUNVc126IQ1PaG0fsQPMRPB`{KI6|xRLx8+B`6EvdbKvh5e z)cR6fFDMeT^HJLSCng<>#Y(mhLCcI)T7+{cc!Cxk&7>eIG(-Btx|DB+TjfAAIkAyn=O2==!09!xU#|LaZR{{a8foVH~#d~0MV={#lzg|5OicT$DF zUy7*)&E4(7ode6A!?CW*$iqiV*_V(i5Vn&K5uHqaQ*sA|gJ%3>0hJC-BYVzW=B+u2 z_9=RCAO*`a{nBcG%7TnLJTJ~(k`R6a8o%|~iPHVY(^B)M&!zc*$5lVJoC|$1zC&Oq zxld=h-Y{EcBj27;dh6&Y6$Lt$qlY4gSD0D29q_JSXgM`t!~p)zPSmzS=1MjrJD{S8 zV?3FDw5O-)C(bSbW(OB--32e_@C@dtw=6KS!v{RS5`O+zoL5a6IIrB9!omMOL{7?E zttgdi05L^S3H`ZR+bK|??=w8#!YUUv<(;p+uxEAg{6xaohl;SA9i_}>x+`>;`J(T# zA}5>;+Czz{hParIDO#Ap6RUh|NOjt?uLvJ~IQfdK)+#DQ?+D`cR4t*e7_s&G*qGV! zIgCz8Kwl;|$`FTKEx|iX7TCB)Elv-lpbEtLA^*T>zQsaDi5GxZiXa7eLTBdrJNSZ+ zCIUEoI9Vl=5xiG|T&%dq{Aoj-EDNyIr$ZCSUjDG;uk~_V9yZgZr+;gCAF_^gq zcY`dQ`ikO4X)s=;kIG4XDYyIc%FIv@K}|A8EW#@QMfGejqN}ddT*2vsQ?d_<)Il?9 zMhvU1!Xr)Ma%*t}b#wPE-dCAwG{ZiObJ{#}=nJgKDBu?~SsQFY_0`5L6bRO#Rd-&4(pe{o?bkzG13kZPS%2X$VE;+mVAB)~bH`BL}X= zRT_IVV0S5NzlbhR%apLdg>GKt=KW$)t(*n>+FbaPH(71hl@V2WIrEtbR~i%43Wk6b z4E@xB(Z@~QQ-0XIh^Mukttl<&#fx=Jwm5~@mC%V^BoS86b6xAL#|N4^7#A(oLh>Fa zr{nxUAY+^lC0yu_dM?W7rCvD=n&Jj-S%88x+QKzBO2IkHoLsoA($}bI5Bp zthAA5>k4m>!d)56=D@%%{e)5qyqHr7*sb7`M#QK&_pLypn?Gn=bUWWib zwj7o>;dz7Mb3(PEM8+PU_Ev|w0DjUxaIRHYB831^c7rzWsfGapNT(w1*%P~wL-S{= zQM%f6I!RR_B0Z(#$w z0$h;X0GPgW?x4_umaC<8iGDkF)koGV@oI9%T7IJiySBQgq-UnruJ^Ax?YKx=uTUU| z+>V__`>gqDI^w=^n#e~-E0yszu2&+v?_cwR2P*U_2#pOzC;E#SQC9M-BTt_dwIFI0 zW0%FH{K~2X>5zdd{PXUzCbL_m8nbxRP5cTEK+7W@es@reQ5S&NLSalkma$=cDKzqKGbD?AsO3) zd16z6fp(NRSEoi)dwW;Hko~ye50gsY$#@n;fV<3=soy2$ZHVuQfu>2Zg6s3IZUA_> z5qtXYCd0sIyt9`jJvi)hVQPpVq%o&}g)z1h z6c))Qxm&7f@Y{iU3#!Y%Cd0qlR6krXX(s=P&Jg&87cvmYb!wQDO{NRL`OMkKX2fb; zjN<7T=x5xh7O7&8a;P{oOsZv75|MUO|GMxlg_uZR!zbbKsc~OqEOPf6Qgbo?!AoN!Q zYQVz%ZYv)ZKYoHmH;z};Q1pw+RXDC%tkJW;HYK!us z35bD@7dioZ4y$!aaWwJO8N%M3qz*4 zGF^#9v_Uxi;Ci_fF}z}L(RGtx$vwSf)Q3JD*B#aHXQn7(xvg>5BX*jM!#-;@mpB9 zWB&=opw-N$%?*f!oTf@8lch%N1+!?YuyD%#9x*ZasC2*I2T*Az_tDm&qRJK2#Htt| zQ&j$Z?Xt*?X@Pd)FVo2x-%AOccR~Da#A@lUe7o&VQzZ8n{4M!cxKW- zBRC?QjpOFKZfh(rwQRJ*si-~*+(K8@nky3w<1=^6cZluP4B#Lo=NvCf4-MsV<6rSG z9S>p?CHUPhtT{|5>`pE5P_m}{+119js6ewBQZSm z|2_5x`2$l`p#)M3SI4UAibw%TdE1iV?Jf@a-!3FNbP|z^ka={&w51h7IWxJ%@ zqbE@p7b*lD8$Tn_T#mkSX>3-uLM6kILiSt44@av*o&J=A983aoIP{0-F?@e)R|UqN zR3cWaD%z^PxFttp_00*dJsQE0DQli3ujZ|t^XEfv<`rWk?@5a+t9sR46XRXdFk)~e zKF5tIvBZL%sLYNS#dh>K|KVzmwgZ-mpmOhbDqKfH0e%jJ-_^8UB9SSud1{4cBoJ<{ zq{r6@u*E>HupFrIMmTGJU*&{9zq=PFQ~G*7ccQhHTy=w&j(Ki4PwHj?^zHzF;HU!8 zaseFE<4)Cp+X}ZA25Ys?8is}G`nh#9b(6AzuP1%JoxqX8(>TX^3NaHr`Q66$mOrE= z4eHr|wE-n-cY{}xW(l%Mj!?KQ;>UEi^K6Y-Iru`MuW}^Jx>g!^o%4kyQP0Sk@i1RK zBV1ZTZs%F>#hd`Q3%H(CZ@2Q&?Kv_L zo>IOQutZ1GxX0cJL|{45I&Edc2muFbYT-gc8xGI(61}u6AzRYsopRE~cO#($2|}a4 zD`z@H=p12-D-%8<+Y!J4w@AGx+3zak^nLA{D1!E3c496vg%t$wB`VvYSkp(8k z3%yA%)k`c5+>6y~STSk)1Mfc-!DP?iXTjQwr=A<2Et>CTi#1R~^Yt7{w+}U-{&Bfe8)>tR*T%>`K+XOA(FO(kBP;Qsh=lmoiOp5$^W6@kITm8`u?Q|Ow3Kr{eoi3fAcGU3eGhz zCa(Jwk=S`V7x(Ax3yMO|YhTWjPv++_h#mirtT{)mrD~tAuJy0IkKTJBE#6jo_+vV| z%9Q)cTzSPX94nWU{Ae&sk}P-mA(8_{ra;TOMOMfs(?V_uz#k#mRi;3i(2z9ed9~d` zK(%9oeHX__j-nB#SeKDro25F|suz+^dj4Kf7=SW_ED6yr$GVnQDS_6XFZI;>s9D#s znlG=aev>~+#FDt1RW?`ep{NB1Q4gUb8R3H97fHj z%r}HE!s&lsRkUqKG{e}V9lBaZ+mL@1iV@<>;Xw698qb*fi-*wLg>TJ~v;2Q%m06!F z(AJQ&-_AYFXi3CqV{bpre_#I(NA9A_ literal 0 HcmV?d00001 diff --git a/assignments2016/assignment2/requirements.txt b/assignments2016/assignment2/requirements.txt new file mode 100644 index 00000000..3e6c302d --- /dev/null +++ b/assignments2016/assignment2/requirements.txt @@ -0,0 +1,46 @@ +Cython==0.23.4 +Jinja2==2.8 +MarkupSafe==0.23 +Pillow==3.0.0 +Pygments==2.0.2 +appnope==0.1.0 +argparse==1.2.1 +backports-abc==0.4 +backports.ssl-match-hostname==3.5.0.1 +certifi==2015.11.20.1 +cycler==0.9.0 +decorator==4.0.6 +functools32==3.2.3-2 +gnureadline==6.3.3 +ipykernel==4.2.2 +ipython==4.0.1 +ipython-genutils==0.1.0 +ipywidgets==4.1.1 +jsonschema==2.5.1 +jupyter==1.0.0 +jupyter-client==4.1.1 +jupyter-console==4.0.3 +jupyter-core==4.0.6 +matplotlib==1.5.0 +mistune==0.7.1 +nbconvert==4.1.0 +nbformat==4.0.1 +notebook==4.0.6 +numpy==1.10.4 +path.py==8.1.2 +pexpect==4.0.1 +pickleshare==0.5 +ptyprocess==0.5 +pyparsing==2.0.7 +python-dateutil==2.4.2 +pytz==2015.7 +pyzmq==15.1.0 +qtconsole==4.1.1 +scipy==0.16.1 +simplegeneric==0.8.1 +singledispatch==3.4.0.3 +six==1.10.0 +terminado==0.5 +tornado==4.3 +traitlets==4.0.0 +wsgiref==0.1.2 diff --git a/assignments2016/assignment2/start_ipython_osx.sh b/assignments2016/assignment2/start_ipython_osx.sh new file mode 100755 index 00000000..4815b001 --- /dev/null +++ b/assignments2016/assignment2/start_ipython_osx.sh @@ -0,0 +1,4 @@ +# Assume the virtualenv is called .env + +cp frameworkpython .env/bin +.env/bin/frameworkpython -m IPython notebook diff --git a/assignments2016/assignment3.md b/assignments2016/assignment3.md index b35b748c..2ce5ddd7 100644 --- a/assignments2016/assignment3.md +++ b/assignments2016/assignment3.md @@ -4,51 +4,34 @@ mathjax: true permalink: assignments2016/assignment3/ --- -In this assignment you will implement recurrent networks, and apply them to image captioning on Microsoft COCO. We will also introduce the TinyImageNet dataset, and use a pretrained model on this dataset to explore different applications of image gradients. +이번 과제에서는 회귀신경망(Recurrent Neural Network, RNN)을 구현하고, Microsoft COCO 데이터셋의 이미지 캡셔닝(captionint) 문제에 적용해볼 것입니다. 또한, TinyImageNet 데이터셋을 소개하고, 이 데이터셋에 대해 미리 학습된 모델을 사용하여 이미지 그라디언트에 대한 다양한 어플리케이션에 대해 알아볼 것입니다. -The goals of this assignment are as follows: +이번 과제의 목표는 다음과 같습니다. -- Understand the architecture of *recurrent neural networks (RNNs)* and how they operate on sequences by sharing weights over time -- Understand the difference between vanilla RNNs and Long-Short Term Memory (LSTM) RNNs -- Understand how to sample from an RNN at test-time -- Understand how to combine convolutional neural nets and recurrent nets to implement an image captioning system -- Understand how a trained convolutional network can be used to compute gradients with respect to the input image -- Implement and different applications of image gradients, including saliency maps, fooling images, class visualizations, feature inversion, and DeepDream. +- *회귀신경망(Recurrent Neural Network, RNN)* 구조에 대해 이해하고 시간축 상에서 파라미터 값을 공유하면서 어떻게 시퀀스 데이터에 대해 동작하는지 이해하기 +- 기본 RNN 구조와 Long-Short Term Memory (LSTM) RNN 구조의 차이점 이해하기 +- 테스트 시 RNN에서 어떻게 샘플을 뽑는지 이해하기 +- 이미지 캡셔닝 시스템을 구현하기 위해 컨볼루션 신경망(CNN)과 회귀신경망(RNN)을 결합하는 방법 이해하기 +- 학습된 CNN이 입력 이미지에 대한 그라디언트를 계산할 때 어떻게 활용되는지 이해하기 +- 이미지 그라디언트의 여러 가지 응용법들 구현하기 (saliency 맵, 모델 속이기, 클래스 시각화, 특징 추출의 역과정, DeepDream 등 포함) -## Setup -You can work on the assignment in one of two ways: locally on your own machine, -or on a virtual machine through Terminal.com. +## 설치 +다음 두가지 방법으로 숙제를 시작할 수 있습니다: Terminal.com을 이용한 가상 환경 또는 로컬 환경. -### Working in the cloud on Terminal +### Termianl에서의 가상 환경. +Terminal에는 우리의 수업을 위한 서브도메인이 만들어져 있습니다. [www.stanfordterminalcloud.com](https://www.stanfordterminalcloud.com) 계정을 등록하세요. 이번 숙제에 대한 스냅샷은 [여기](https://www.stanfordterminalcloud.com/snapshot/49f5a1ea15dc424aec19155b3398784d57c55045435315ce4f8b96b62819ef65)에서 찾아볼 수 있습니다. 만약 수업에 등록되었다면, TA(see Piazza for more information)에게 이 수업을 위한 Terminal 예산을 요구할 수 있습니다. 처음 스냅샷을 실행시키면, 수업을 위한 모든 것이 설치되어 있어서 바로 숙제를 시작할 수 있습니다. [여기](/terminal-tutorial)에 Terminal을 위한 간단한 튜토리얼을 작성해 뒀습니다. -Terminal has created a separate subdomain to serve our class, -[www.stanfordterminalcloud.com](https://www.stanfordterminalcloud.com). Register -your account there. The Assignment 3 snapshot can then be found [HERE](https://www.stanfordterminalcloud.com/snapshot/29054ca27bc2e8bda888709ba3d9dd07a172cbbf0824152aac49b14a018ffbe5). -If you are registered in the class you can contact the TA (see Piazza for more -information) to request Terminal credits for use on the assignment. Once you -boot up the snapshot everything will be installed for you, and you will be ready to start on your assignment right away. We have written a small tutorial on Terminal [here](/terminal-tutorial). - -### Working locally -Get the code as a zip file -[here](http://cs231n.stanford.edu/winter1516_assignment3.zip). -As for the dependencies: +### 로컬 환경 +[여기](http://cs231n.stanford.edu/winter1516_assignment3.zip)에서 압축파일을 다운받으세요. +Dependency 관련: **[Option 1] Use Anaconda:** -The preferred approach for installing all the assignment dependencies is to use -[Anaconda](https://www.continuum.io/downloads), which is a Python distribution -that includes many of the most popular Python packages for science, math, -engineering and data analysis. Once you install it you can skip all mentions of -requirements and you are ready to go directly to working on the assignment. - -**[Option 2] Manual install, virtual environment:** -If you do not want to use Anaconda and want to go with a more manual and risky -installation route you will likely want to create a -[virtual environment](http://docs.python-guide.org/en/latest/dev/virtualenvs/) -for the project. If you choose not to use a virtual environment, it is up to you -to make sure that all dependencies for the code are installed globally on your -machine. To set up a virtual environment, run the following: +과학, 수학, 공학, 데이터 분석을 위한 대부분의 주요 패키지들을 담고있는 [Anaconda](https://www.continuum.io/downloads)를 사용하여 설치하는 것이 흔히 사용하는 방법입니다. 설치가 다 되면 모든 요구사항(dependency)을 넘기고 바로 숙제를 시작해도 좋습니다. -~~~bash +**[Option 2] 수동 설치, virtual environment:** +만약 Anaconda 대신 좀 더 일반적이면서 까다로운 방법을 택하고 싶다면 이번 과제를 위한 [virtual environment](http://docs.python-guide.org/en/latest/dev/virtualenvs/)를 만들 수 있습니다. 만약 virtual environment를 사용하지 않는다면 모든 코드가 컴퓨터에 전역적으로 종속되게 설치됩니다. Virtual environment의 설정은 아래를 참조하세요. + +~~~bash드 cd assignment3 sudo pip install virtualenv # This may already be installed virtualenv .env # Create a virtual environment @@ -58,8 +41,8 @@ pip install -r requirements.txt # Install dependencies deactivate # Exit the virtual environment ~~~ -**Download data:** -Once you have the starter code, you will need to download the processed MS-COCO dataset, the TinyImageNet dataset, and the pretrained TinyImageNet model. Run the following from the `assignment3` directory: +**데이터셋 다운로드:** +시작 코드를 받은 후, 전처리 과정이 수행된 MS-COCO 데이터셋, TinyImageNet 데이터셋, 미리 학습된 TinyImageNet 모델을 다운받아야 합니다. `assignment3` 디렉토리에서 다음 명령어를 입력하세요. ~~~bash cd cs231n/datasets @@ -68,59 +51,33 @@ cd cs231n/datasets ./get_pretrained_model.sh ~~~ -**Compile the Cython extension:** Convolutional Neural Networks require a very -efficient implementation. We have implemented of the functionality using -[Cython](http://cython.org/); you will need to compile the Cython extension -before you can run the code. From the `cs231n` directory, run the following -command: +**Compile the Cython extension:** 컨볼루션 신경망은 매우 효율적인 구현이 필요합니다. [Cython](http://cython.org/)을 사용하여 필요한 기능들을 구현해 두어서, 코드를 돌리기 전에 Cython extension을 컴파일해 주어야 합니다. `cs231n` 디렉토리에서 다음 명령어를 입력하세요. ~~~bash python setup.py build_ext --inplace ~~~ -**Start IPython:** -After you have the data, you should start the IPython notebook server -from the `assignment3` directory. If you are unfamiliar with IPython, you should -read our [IPython tutorial](/ipython-tutorial). - -**NOTE:** If you are working in a virtual environment on OSX, you may encounter -errors with matplotlib due to the -[issues described here](http://matplotlib.org/faq/virtualenv_faq.html). -You can work around this issue by starting the IPython server using the -`start_ipython_osx.sh` script from the `assignment3` directory; the script -assumes that your virtual environment is named `.env`. +**IPython 시작:** +데이터를 모두 다운받은 뒤, `assignment3`에서 IPython notebook 서버를 시작해야 합니다. IPython에 익숙하지 않다면 [IPython tutorial](/ipython-tutorial)을 먼저 읽어보는 것을 권장합니다. +**NOTE:** OSX에서 virtual environment를 실행하면, matplotlib 에러가 날 수 있습니다([이 문제에 관한 이슈](http://matplotlib.org/faq/virtualenv_faq.html)). IPython 서버를 `assignment3`폴더의 `start_ipython_osx.sh`로 실행하면 이 문제를 피해갈 수 있습니다; 이 스크립트는 virtual environment가 `.env`라고 되어있다고 가정하고 작성되었습니다. -### Submitting your work: -Whether you work on the assignment locally or using Terminal, once you are done -working run the `collectSubmission.sh` script; this will produce a file called -`assignment3.zip`. Upload this file under the Assignments tab on -[the coursework](https://coursework.stanford.edu/portal/site/W15-CS-231N-01/) -page for the course. +### 과제 제출: +로컬 환경이나 Terminal에서 숙제를 마쳤다면 `collectSubmission.sh`스크립트를 실행하세요. 이 스크립트는 `assignment3.zip`파일을 만듭니다. 이 파일을 [the coursework](https://coursework.stanford.edu/portal/site/W15-CS-231N-01/) 페이지의 Assignments 탭 아래에 업로드하세요. -### Q1: Image Captioning with Vanilla RNNs (40 points) -The IPython notebook `RNN_Captioning.ipynb` will walk you through the -implementation of an image captioning system on MS-COCO using vanilla recurrent -networks. -### Q2: Image Captioning with LSTMs (35 points) -The IPython notebook `LSTM_Captioning.ipynb` will walk you through the -implementation of Long-Short Term Memory (LSTM) RNNs, and apply them to image -captioning on MS-COCO. +### Q1: 기본 RNN 구조로 이미지 캡셔닝 구현 (40 points) +IPython notebook `RNN_Captioning.ipynb`에서 기본 RNN 구조를 사용하여 MS COCO 데이터셋에서 이미지 캡셔닝 시스템을 구현하는 방법을 설명합니다. -### Q3: Image Gradients: Saliency maps and Fooling Images (10 points) -The IPython notebook `ImageGradients.ipynb` will introduce the TinyImageNet -dataset. You will use a pretrained model on this dataset to compute gradients -with respect to the image, and use them to produce saliency maps and fooling -images. +### Q2: LSTM 구조로 이미지 캡셔닝 구현 (35 points) +IPython notebook `LSTM_Captioning.ipynb`에서 Long-Short Term Memory (LSTM) RNN 구조의 구현에 대해 설명하고, 이를 MS COCO 데이터셋의 이미지 캡셔닝 문제에 적용해 봅니다. -### Q4: Image Generation: Classes, Inversion, DeepDream (15 points) -In the IPython notebook `ImageGeneration.ipynb` you will use the pretrained -TinyImageNet model to generate images. In particular you will generate -class visualizations and implement feature inversion and DeepDream. +### Q3: 이미지 그라디언트: Saliency 맵과 Fooling Images (10 points) +IPython notebook `ImageGradients.ipynb`에서 TinyImageNet 데이터셋을 소개합니다. 이 데이터셋에 대해 미리 학습된 모델(pretrained model)을 활용하여 이미지에 대한 그라디언트를 계산하고, 이를 사용해서 saliency 맵과 fooling image들을 생성하는 법에 대해 설명합니다. -### Q5: Do something extra! (up to +10 points) -Given the components of the assignment, try to do something cool. Maybe there is -some way to generate images that we did not implement in the assignment? +### Q4: 이미지 생성: 클래스, 역 과정(Inversion), DeepDream (15 points) +IPython notebook `ImageGeneration.ipynb`에서는 미리 학습된 TinyImageNet 모델을 활용하여 이미지를 생성해볼 것입니다. 특히, 클래스들을 시각화 해보고 특징(feature) 추출의 역과정과 DeepDream을 구현할 것입니다. +### Q5: 추가 과제: 뭔가 더 해보세요! (+10 points) +이번 과제에서 제공된 것들을 활용해서 무언가 멋있는 것들을 시도해볼 수 있을 것입니다. 과제에서 구현하지 않은 다른 방식으로 이미지들을 생성하는 방법이 있을 수도 있어요! diff --git a/assignments2016/assignment3/.gitignore b/assignments2016/assignment3/.gitignore new file mode 100644 index 00000000..b0611d38 --- /dev/null +++ b/assignments2016/assignment3/.gitignore @@ -0,0 +1,3 @@ +*.swp +*.pyc +.env/* diff --git a/assignments2016/assignment3/ImageGeneration.ipynb b/assignments2016/assignment3/ImageGeneration.ipynb new file mode 100644 index 00000000..24747ae5 --- /dev/null +++ b/assignments2016/assignment3/ImageGeneration.ipynb @@ -0,0 +1,511 @@ +{ + "nbformat_minor": 0, + "nbformat": 4, + "cells": [ + { + "source": [ + "# Image Generation\n", + "In this notebook we will continue our exploration of image gradients using the deep model that was pretrained on TinyImageNet. We will explore various ways of using these image gradients to generate images. We will implement class visualizations, feature inversion, and DeepDream." + ], + "cell_type": "markdown", + "metadata": {} + }, + { + "execution_count": null, + "cell_type": "code", + "source": [ + "# As usual, a bit of setup\n", + "\n", + "import time, os, json\n", + "import numpy as np\n", + "from scipy.misc import imread, imresize\n", + "import matplotlib.pyplot as plt\n", + "\n", + "from cs231n.classifiers.pretrained_cnn import PretrainedCNN\n", + "from cs231n.data_utils import load_tiny_imagenet\n", + "from cs231n.image_utils import blur_image, deprocess_image, preprocess_image\n", + "\n", + "%matplotlib inline\n", + "plt.rcParams['figure.figsize'] = (10.0, 8.0) # set default size of plots\n", + "plt.rcParams['image.interpolation'] = 'nearest'\n", + "plt.rcParams['image.cmap'] = 'gray'\n", + "\n", + "# for auto-reloading external modules\n", + "# see http://stackoverflow.com/questions/1907993/autoreload-of-modules-in-ipython\n", + "%load_ext autoreload\n", + "%autoreload 2" + ], + "outputs": [], + "metadata": { + "collapsed": false + } + }, + { + "source": [ + "# TinyImageNet and pretrained model\n", + "As in the previous notebook, load the TinyImageNet dataset and the pretrained model." + ], + "cell_type": "markdown", + "metadata": {} + }, + { + "execution_count": null, + "cell_type": "code", + "source": [ + "data = load_tiny_imagenet('cs231n/datasets/tiny-imagenet-100-A', subtract_mean=True)\n", + "model = PretrainedCNN(h5_file='cs231n/datasets/pretrained_model.h5')" + ], + "outputs": [], + "metadata": { + "collapsed": false + } + }, + { + "source": [ + " # Class visualization\n", + "By starting with a random noise image and performing gradient ascent on a target class, we can generate an image that the network will recognize as the target class. This idea was first presented in [1]; [2] extended this idea by suggesting several regularization techniques that can improve the quality of the generated image.\n", + "\n", + "Concretely, let $I$ be an image and let $y$ be a target class. Let $s_y(I)$ be the score that a convolutional network assigns to the image $I$ for class $y$; note that these are raw unnormalized scores, not class probabilities. We wish to generate an image $I^*$ that achieves a high score for the class $y$ by solving the problem\n", + "\n", + "$$\n", + "I^* = \\arg\\max_I s_y(I) + R(I)\n", + "$$\n", + "\n", + "where $R$ is a (possibly implicit) regularizer. We can solve this optimization problem using gradient descent, computing gradients with respect to the generated image. We will use (explicit) L2 regularization of the form\n", + "\n", + "$$\n", + "R(I) + \\lambda \\|I\\|_2^2\n", + "$$\n", + "\n", + "and implicit regularization as suggested by [2] by peridically blurring the generated image. We can solve this problem using gradient ascent on the generated image.\n", + "\n", + "In the cell below, complete the implementation of the `create_class_visualization` function.\n", + "\n", + "[1] Karen Simonyan, Andrea Vedaldi, and Andrew Zisserman. \"Deep Inside Convolutional Networks: Visualising\n", + "Image Classification Models and Saliency Maps\", ICLR Workshop 2014.\n", + "\n", + "[2] Yosinski et al, \"Understanding Neural Networks Through Deep Visualization\", ICML 2015 Deep Learning Workshop" + ], + "cell_type": "markdown", + "metadata": {} + }, + { + "execution_count": null, + "cell_type": "code", + "source": [ + "def create_class_visualization(target_y, model, **kwargs):\n", + " \"\"\"\n", + " Perform optimization over the image to generate class visualizations.\n", + " \n", + " Inputs:\n", + " - target_y: Integer in the range [0, 100) giving the target class\n", + " - model: A PretrainedCNN that will be used for generation\n", + " \n", + " Keyword arguments:\n", + " - learning_rate: Floating point number giving the learning rate\n", + " - blur_every: An integer; how often to blur the image as a regularizer\n", + " - l2_reg: Floating point number giving L2 regularization strength on the image;\n", + " this is lambda in the equation above.\n", + " - max_jitter: How much random jitter to add to the image as regularization\n", + " - num_iterations: How many iterations to run for\n", + " - show_every: How often to show the image\n", + " \"\"\"\n", + " \n", + " learning_rate = kwargs.pop('learning_rate', 10000)\n", + " blur_every = kwargs.pop('blur_every', 1)\n", + " l2_reg = kwargs.pop('l2_reg', 1e-6)\n", + " max_jitter = kwargs.pop('max_jitter', 4)\n", + " num_iterations = kwargs.pop('num_iterations', 100)\n", + " show_every = kwargs.pop('show_every', 25)\n", + " \n", + " X = np.random.randn(1, 3, 64, 64)\n", + " for t in xrange(num_iterations):\n", + " # As a regularizer, add random jitter to the image\n", + " ox, oy = np.random.randint(-max_jitter, max_jitter+1, 2)\n", + " X = np.roll(np.roll(X, ox, -1), oy, -2)\n", + "\n", + " dX = None\n", + " ############################################################################\n", + " # TODO: Compute the image gradient dX of the image with respect to the #\n", + " # target_y class score. This should be similar to the fooling images. Also #\n", + " # add L2 regularization to dX and update the image X using the image #\n", + " # gradient and the learning rate. #\n", + " ############################################################################\n", + " pass\n", + " ############################################################################\n", + " # END OF YOUR CODE #\n", + " ############################################################################\n", + " \n", + " # Undo the jitter\n", + " X = np.roll(np.roll(X, -ox, -1), -oy, -2)\n", + " \n", + " # As a regularizer, clip the image\n", + " X = np.clip(X, -data['mean_image'], 255.0 - data['mean_image'])\n", + " \n", + " # As a regularizer, periodically blur the image\n", + " if t % blur_every == 0:\n", + " X = blur_image(X)\n", + " \n", + " # Periodically show the image\n", + " if t % show_every == 0:\n", + " plt.imshow(deprocess_image(X, data['mean_image']))\n", + " plt.gcf().set_size_inches(3, 3)\n", + " plt.axis('off')\n", + " plt.show()\n", + " return X" + ], + "outputs": [], + "metadata": { + "collapsed": true + } + }, + { + "source": [ + "You can use the code above to generate some cool images! An example is shown below. Try to generate a cool-looking image. If you want you can try to implement the other regularization schemes from Yosinski et al, but it isn't required." + ], + "cell_type": "markdown", + "metadata": {} + }, + { + "execution_count": null, + "cell_type": "code", + "source": [ + "target_y = 43 # Tarantula\n", + "print data['class_names'][target_y]\n", + "X = create_class_visualization(target_y, model, show_every=25)" + ], + "outputs": [], + "metadata": { + "collapsed": false + } + }, + { + "source": [ + "# Feature Inversion\n", + "In an attempt to understand the types of features that convolutional networks learn to recognize, a recent paper [1] attempts to reconstruct an image from its feature representation. We can easily implement this idea using image gradients from the pretrained network.\n", + "\n", + "Concretely, given a image $I$, let $\\phi_\\ell(I)$ be the activations at layer $\\ell$ of the convolutional network $\\phi$. We wish to find an image $I^*$ with a similar feature representation as $I$ at layer $\\ell$ of the network $\\phi$ by solving the optimization problem\n", + "\n", + "$$\n", + "I^* = \\arg\\min_{I'} \\|\\phi_\\ell(I) - \\phi_\\ell(I')\\|_2^2 + R(I')\n", + "$$\n", + "\n", + "where $\\|\\cdot\\|_2^2$ is the squared Euclidean norm. As above, $R$ is a (possibly implicit) regularizer. We can solve this optimization problem using gradient descent, computing gradients with respect to the generated image. We will use (explicit) L2 regularization of the form\n", + "\n", + "$$\n", + "R(I') + \\lambda \\|I'\\|_2^2\n", + "$$\n", + "\n", + "together with implicit regularization by periodically blurring the image, as recommended by [2].\n", + "\n", + "Implement this method in the function below.\n", + "\n", + "[1] Aravindh Mahendran, Andrea Vedaldi, \"Understanding Deep Image Representations by Inverting them\", CVPR 2015\n", + "\n", + "[2] Yosinski et al, \"Understanding Neural Networks Through Deep Visualization\", ICML 2015 Deep Learning Workshop" + ], + "cell_type": "markdown", + "metadata": {} + }, + { + "execution_count": null, + "cell_type": "code", + "source": [ + "def invert_features(target_feats, layer, model, **kwargs):\n", + " \"\"\"\n", + " Perform feature inversion in the style of Mahendran and Vedaldi 2015, using\n", + " L2 regularization and periodic blurring.\n", + " \n", + " Inputs:\n", + " - target_feats: Image features of the target image, of shape (1, C, H, W);\n", + " we will try to generate an image that matches these features\n", + " - layer: The index of the layer from which the features were extracted\n", + " - model: A PretrainedCNN that was used to extract features\n", + " \n", + " Keyword arguments:\n", + " - learning_rate: The learning rate to use for gradient descent\n", + " - num_iterations: The number of iterations to use for gradient descent\n", + " - l2_reg: The strength of L2 regularization to use; this is lambda in the\n", + " equation above.\n", + " - blur_every: How often to blur the image as implicit regularization; set\n", + " to 0 to disable blurring.\n", + " - show_every: How often to show the generated image; set to 0 to disable\n", + " showing intermediate reuslts.\n", + " \n", + " Returns:\n", + " - X: Generated image of shape (1, 3, 64, 64) that matches the target features.\n", + " \"\"\"\n", + " learning_rate = kwargs.pop('learning_rate', 10000)\n", + " num_iterations = kwargs.pop('num_iterations', 500)\n", + " l2_reg = kwargs.pop('l2_reg', 1e-7)\n", + " blur_every = kwargs.pop('blur_every', 1)\n", + " show_every = kwargs.pop('show_every', 50)\n", + " \n", + " X = np.random.randn(1, 3, 64, 64)\n", + " for t in xrange(num_iterations):\n", + " ############################################################################\n", + " # TODO: Compute the image gradient dX of the reconstruction loss with #\n", + " # respect to the image. You should include L2 regularization penalizing #\n", + " # large pixel values in the generated image using the l2_reg parameter; #\n", + " # then update the generated image using the learning_rate from above. #\n", + " ############################################################################\n", + " pass\n", + " ############################################################################\n", + " # END OF YOUR CODE #\n", + " ############################################################################\n", + " \n", + " # As a regularizer, clip the image\n", + " X = np.clip(X, -data['mean_image'], 255.0 - data['mean_image'])\n", + " \n", + " # As a regularizer, periodically blur the image\n", + " if (blur_every > 0) and t % blur_every == 0:\n", + " X = blur_image(X)\n", + "\n", + " if (show_every > 0) and (t % show_every == 0 or t + 1 == num_iterations):\n", + " plt.imshow(deprocess_image(X, data['mean_image']))\n", + " plt.gcf().set_size_inches(3, 3)\n", + " plt.axis('off')\n", + " plt.title('t = %d' % t)\n", + " plt.show()" + ], + "outputs": [], + "metadata": { + "collapsed": false + } + }, + { + "source": [ + "### Shallow feature reconstruction\n", + "After implementing the feature inversion above, run the following cell to try and reconstruct features from the fourth convolutional layer of the pretrained model. You should be able to reconstruct the features using the provided optimization parameters." + ], + "cell_type": "markdown", + "metadata": {} + }, + { + "execution_count": null, + "cell_type": "code", + "source": [ + "filename = 'kitten.jpg'\n", + "layer = 3 # layers start from 0 so these are features after 4 convolutions\n", + "img = imresize(imread(filename), (64, 64))\n", + "\n", + "plt.imshow(img)\n", + "plt.gcf().set_size_inches(3, 3)\n", + "plt.title('Original image')\n", + "plt.axis('off')\n", + "plt.show()\n", + "\n", + "# Preprocess the image before passing it to the network:\n", + "# subtract the mean, add a dimension, etc\n", + "img_pre = preprocess_image(img, data['mean_image'])\n", + "\n", + "# Extract features from the image\n", + "feats, _ = model.forward(img_pre, end=layer)\n", + "\n", + "# Invert the features\n", + "kwargs = {\n", + " 'num_iterations': 400,\n", + " 'learning_rate': 5000,\n", + " 'l2_reg': 1e-8,\n", + " 'show_every': 100,\n", + " 'blur_every': 10,\n", + "}\n", + "X = invert_features(feats, layer, model, **kwargs)" + ], + "outputs": [], + "metadata": { + "scrolled": false, + "collapsed": false + } + }, + { + "source": [ + "### Deep feature reconstruction\n", + "Reconstructing images using features from deeper layers of the network tends to give interesting results. In the cell below, try to reconstruct the best image you can by inverting the features after 7 layers of convolutions. You will need to play with the hyperparameters to try and get a good result.\n", + "\n", + "HINT: If you read the paper by Mahendran and Vedaldi, you'll see that reconstructions from deep features tend not to look much like the original image, so you shouldn't expect the results to look like the reconstruction above. You should be able to get an image that shows some discernable structure within 1000 iterations." + ], + "cell_type": "markdown", + "metadata": {} + }, + { + "execution_count": null, + "cell_type": "code", + "source": [ + "filename = 'kitten.jpg'\n", + "layer = 6 # layers start from 0 so these are features after 7 convolutions\n", + "img = imresize(imread(filename), (64, 64))\n", + "\n", + "plt.imshow(img)\n", + "plt.gcf().set_size_inches(3, 3)\n", + "plt.title('Original image')\n", + "plt.axis('off')\n", + "plt.show()\n", + "\n", + "# Preprocess the image before passing it to the network:\n", + "# subtract the mean, add a dimension, etc\n", + "img_pre = preprocess_image(img, data['mean_image'])\n", + "\n", + "# Extract features from the image\n", + "feats, _ = model.forward(img_pre, end=layer)\n", + "\n", + "# Invert the features\n", + "# You will need to play with these parameters.\n", + "kwargs = {\n", + " 'num_iterations': 1000,\n", + " 'learning_rate': 0,\n", + " 'l2_reg': 0,\n", + " 'show_every': 100,\n", + " 'blur_every': 0,\n", + "}\n", + "X = invert_features(feats, layer, model, **kwargs)" + ], + "outputs": [], + "metadata": { + "collapsed": false + } + }, + { + "source": [ + "# DeepDream\n", + "In the summer of 2015, Google released a [blog post](http://googleresearch.blogspot.com/2015/06/inceptionism-going-deeper-into-neural.html) describing a new method of generating images from neural networks, and they later [released code](https://github.com/google/deepdream) to generate these images.\n", + "\n", + "The idea is very simple. We pick some layer from the network, pass the starting image through the network to extract features at the chosen layer, set the gradient at that layer equal to the activations themselves, and then backpropagate to the image. This has the effect of modifying the image to amplify the activations at the chosen layer of the network.\n", + "\n", + "For DeepDream we usually extract features from one of the convolutional layers, allowing us to generate images of any resolution.\n", + "\n", + "We can implement this idea using our pretrained network. The results probably won't look as good as Google's since their network is much bigger, but we should still be able to generate some interesting images." + ], + "cell_type": "markdown", + "metadata": {} + }, + { + "execution_count": null, + "cell_type": "code", + "source": [ + "def deepdream(X, layer, model, **kwargs):\n", + " \"\"\"\n", + " Generate a DeepDream image.\n", + " \n", + " Inputs:\n", + " - X: Starting image, of shape (1, 3, H, W)\n", + " - layer: Index of layer at which to dream\n", + " - model: A PretrainedCNN object\n", + " \n", + " Keyword arguments:\n", + " - learning_rate: How much to update the image at each iteration\n", + " - max_jitter: Maximum number of pixels for jitter regularization\n", + " - num_iterations: How many iterations to run for\n", + " - show_every: How often to show the generated image\n", + " \"\"\"\n", + " \n", + " X = X.copy()\n", + " \n", + " learning_rate = kwargs.pop('learning_rate', 5.0)\n", + " max_jitter = kwargs.pop('max_jitter', 16)\n", + " num_iterations = kwargs.pop('num_iterations', 100)\n", + " show_every = kwargs.pop('show_every', 25)\n", + " \n", + " for t in xrange(num_iterations):\n", + " # As a regularizer, add random jitter to the image\n", + " ox, oy = np.random.randint(-max_jitter, max_jitter+1, 2)\n", + " X = np.roll(np.roll(X, ox, -1), oy, -2)\n", + "\n", + " dX = None\n", + " ############################################################################\n", + " # TODO: Compute the image gradient dX using the DeepDream method. You'll #\n", + " # need to use the forward and backward methods of the model object to #\n", + " # extract activations and set gradients for the chosen layer. After #\n", + " # computing the image gradient dX, you should use the learning rate to #\n", + " # update the image X. #\n", + " ############################################################################\n", + " pass\n", + " ############################################################################\n", + " # END OF YOUR CODE #\n", + " ############################################################################\n", + " \n", + " # Undo the jitter\n", + " X = np.roll(np.roll(X, -ox, -1), -oy, -2)\n", + " \n", + " # As a regularizer, clip the image\n", + " mean_pixel = data['mean_image'].mean(axis=(1, 2), keepdims=True)\n", + " X = np.clip(X, -mean_pixel, 255.0 - mean_pixel)\n", + " \n", + " # Periodically show the image\n", + " if t == 0 or (t + 1) % show_every == 0:\n", + " img = deprocess_image(X, data['mean_image'], mean='pixel')\n", + " plt.imshow(img)\n", + " plt.title('t = %d' % (t + 1))\n", + " plt.gcf().set_size_inches(8, 8)\n", + " plt.axis('off')\n", + " plt.show()\n", + " return X" + ], + "outputs": [], + "metadata": { + "collapsed": false + } + }, + { + "source": [ + "# Generate some images!\n", + "Try and generate a cool-looking DeepDeam image using the pretrained network. You can try using different layers, or starting from different images. You can reduce the image size if it runs too slowly on your machine, or increase the image size if you are feeling ambitious." + ], + "cell_type": "markdown", + "metadata": {} + }, + { + "execution_count": null, + "cell_type": "code", + "source": [ + "def read_image(filename, max_size):\n", + " \"\"\"\n", + " Read an image from disk and resize it so its larger side is max_size\n", + " \"\"\"\n", + " img = imread(filename)\n", + " H, W, _ = img.shape\n", + " if H >= W:\n", + " img = imresize(img, (max_size, int(W * float(max_size) / H)))\n", + " elif H < W:\n", + " img = imresize(img, (int(H * float(max_size) / W), max_size))\n", + " return img\n", + "\n", + "filename = 'kitten.jpg'\n", + "max_size = 256\n", + "img = read_image(filename, max_size)\n", + "plt.imshow(img)\n", + "plt.axis('off')\n", + "\n", + "# Preprocess the image by converting to float, transposing,\n", + "# and performing mean subtraction.\n", + "img_pre = preprocess_image(img, data['mean_image'], mean='pixel')\n", + "\n", + "out = deepdream(img_pre, 7, model, learning_rate=2000)" + ], + "outputs": [], + "metadata": { + "scrolled": false, + "collapsed": false + } + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 2", + "name": "python2", + "language": "python" + }, + "language_info": { + "mimetype": "text/x-python", + "nbconvert_exporter": "python", + "name": "python", + "file_extension": ".py", + "version": "2.7.6", + "pygments_lexer": "ipython2", + "codemirror_mode": { + "version": 2, + "name": "ipython" + } + } + } +} \ No newline at end of file diff --git a/assignments2016/assignment3/ImageGradients.ipynb b/assignments2016/assignment3/ImageGradients.ipynb new file mode 100644 index 00000000..669cef26 --- /dev/null +++ b/assignments2016/assignment3/ImageGradients.ipynb @@ -0,0 +1,383 @@ +{ + "nbformat_minor": 0, + "nbformat": 4, + "cells": [ + { + "source": [ + "# Image Gradients\n", + "In this notebook we'll introduce the TinyImageNet dataset and a deep CNN that has been pretrained on this dataset. You will use this pretrained model to compute gradients with respect to images, and use these image gradients to produce class saliency maps and fooling images." + ], + "cell_type": "markdown", + "metadata": {} + }, + { + "execution_count": null, + "cell_type": "code", + "source": [ + "# As usual, a bit of setup\n", + "\n", + "import time, os, json\n", + "import numpy as np\n", + "import skimage.io\n", + "import matplotlib.pyplot as plt\n", + "\n", + "from cs231n.classifiers.pretrained_cnn import PretrainedCNN\n", + "from cs231n.data_utils import load_tiny_imagenet\n", + "from cs231n.image_utils import blur_image, deprocess_image\n", + "\n", + "%matplotlib inline\n", + "plt.rcParams['figure.figsize'] = (10.0, 8.0) # set default size of plots\n", + "plt.rcParams['image.interpolation'] = 'nearest'\n", + "plt.rcParams['image.cmap'] = 'gray'\n", + "\n", + "# for auto-reloading external modules\n", + "# see http://stackoverflow.com/questions/1907993/autoreload-of-modules-in-ipython\n", + "%load_ext autoreload\n", + "%autoreload 2" + ], + "outputs": [], + "metadata": { + "collapsed": false + } + }, + { + "source": [ + "# Introducing TinyImageNet\n", + "\n", + "The TinyImageNet dataset is a subset of the ILSVRC-2012 classification dataset. It consists of 200 object classes, and for each object class it provides 500 training images, 50 validation images, and 50 test images. All images have been downsampled to 64x64 pixels. We have provided the labels for all training and validation images, but have withheld the labels for the test images.\n", + "\n", + "We have further split the full TinyImageNet dataset into two equal pieces, each with 100 object classes. We refer to these datasets as TinyImageNet-100-A and TinyImageNet-100-B; for this exercise you will work with TinyImageNet-100-A.\n", + "\n", + "To download the data, go into the `cs231n/datasets` directory and run the script `get_tiny_imagenet_a.sh`. Then run the following code to load the TinyImageNet-100-A dataset into memory.\n", + "\n", + "NOTE: The full TinyImageNet-100-A dataset will take up about 250MB of disk space, and loading the full TinyImageNet-100-A dataset into memory will use about 2.8GB of memory." + ], + "cell_type": "markdown", + "metadata": {} + }, + { + "execution_count": null, + "cell_type": "code", + "source": [ + "data = load_tiny_imagenet('cs231n/datasets/tiny-imagenet-100-A', subtract_mean=True)" + ], + "outputs": [], + "metadata": { + "collapsed": false + } + }, + { + "source": [ + "# TinyImageNet-100-A classes\n", + "Since ImageNet is based on the WordNet ontology, each class in ImageNet (and TinyImageNet) actually has several different names. For example \"pop bottle\" and \"soda bottle\" are both valid names for the same class. Run the following to see a list of all classes in TinyImageNet-100-A:" + ], + "cell_type": "markdown", + "metadata": {} + }, + { + "execution_count": null, + "cell_type": "code", + "source": [ + "for i, names in enumerate(data['class_names']):\n", + " print i, ' '.join('\"%s\"' % name for name in names)" + ], + "outputs": [], + "metadata": { + "scrolled": false, + "collapsed": false + } + }, + { + "source": [ + "# Visualize Examples\n", + "Run the following to visualize some example images from random classses in TinyImageNet-100-A. It selects classes and images randomly, so you can run it several times to see different images." + ], + "cell_type": "markdown", + "metadata": {} + }, + { + "execution_count": null, + "cell_type": "code", + "source": [ + "# Visualize some examples of the training data\n", + "classes_to_show = 7\n", + "examples_per_class = 5\n", + "\n", + "class_idxs = np.random.choice(len(data['class_names']), size=classes_to_show, replace=False)\n", + "for i, class_idx in enumerate(class_idxs):\n", + " train_idxs, = np.nonzero(data['y_train'] == class_idx)\n", + " train_idxs = np.random.choice(train_idxs, size=examples_per_class, replace=False)\n", + " for j, train_idx in enumerate(train_idxs):\n", + " img = deprocess_image(data['X_train'][train_idx], data['mean_image'])\n", + " plt.subplot(examples_per_class, classes_to_show, 1 + i + classes_to_show * j)\n", + " if j == 0:\n", + " plt.title(data['class_names'][class_idx][0])\n", + " plt.imshow(img)\n", + " plt.gca().axis('off')\n", + "\n", + "plt.show()" + ], + "outputs": [], + "metadata": { + "collapsed": false + } + }, + { + "source": [ + "# Pretrained model\n", + "We have trained a deep CNN for you on the TinyImageNet-100-A dataset that we will use for image visualization. The model has 9 convolutional layers (with spatial batch normalization) and 1 fully-connected hidden layer (with batch normalization).\n", + "\n", + "To get the model, run the script `get_pretrained_model.sh` from the `cs231n/datasets` directory. After doing so, run the following to load the model from disk." + ], + "cell_type": "markdown", + "metadata": {} + }, + { + "execution_count": null, + "cell_type": "code", + "source": [ + "model = PretrainedCNN(h5_file='cs231n/datasets/pretrained_model.h5')" + ], + "outputs": [], + "metadata": { + "collapsed": false + } + }, + { + "source": [ + "## Pretrained model performance\n", + "Run the following to test the performance of the pretrained model on some random training and validation set images. You should see training accuracy around 90% and validation accuracy around 60%; this indicates a bit of overfitting, but it should work for our visualization experiments." + ], + "cell_type": "markdown", + "metadata": {} + }, + { + "execution_count": null, + "cell_type": "code", + "source": [ + "batch_size = 100\n", + "\n", + "# Test the model on training data\n", + "mask = np.random.randint(data['X_train'].shape[0], size=batch_size)\n", + "X, y = data['X_train'][mask], data['y_train'][mask]\n", + "y_pred = model.loss(X).argmax(axis=1)\n", + "print 'Training accuracy: ', (y_pred == y).mean()\n", + "\n", + "# Test the model on validation data\n", + "mask = np.random.randint(data['X_val'].shape[0], size=batch_size)\n", + "X, y = data['X_val'][mask], data['y_val'][mask]\n", + "y_pred = model.loss(X).argmax(axis=1)\n", + "print 'Validation accuracy: ', (y_pred == y).mean()" + ], + "outputs": [], + "metadata": { + "collapsed": false + } + }, + { + "source": [ + "# Saliency Maps\n", + "Using this pretrained model, we will compute class saliency maps as described in Section 3.1 of [1].\n", + "\n", + "As mentioned in Section 2 of the paper, you should compute the gradient of the image with respect to the unnormalized class score, not with respect to the normalized class probability.\n", + "\n", + "You will need to use the `forward` and `backward` methods of the `PretrainedCNN` class to compute gradients with respect to the image. Open the file `cs231n/classifiers/pretrained_cnn.py` and read the documentation for these methods to make sure you know how they work. For example usage, you can see the `loss` method. Make sure to run the model in `test` mode when computing saliency maps.\n", + "\n", + "[1] Karen Simonyan, Andrea Vedaldi, and Andrew Zisserman. \"Deep Inside Convolutional Networks: Visualising\n", + "Image Classification Models and Saliency Maps\", ICLR Workshop 2014." + ], + "cell_type": "markdown", + "metadata": {} + }, + { + "execution_count": null, + "cell_type": "code", + "source": [ + "def compute_saliency_maps(X, y, model):\n", + " \"\"\"\n", + " Compute a class saliency map using the model for images X and labels y.\n", + " \n", + " Input:\n", + " - X: Input images, of shape (N, 3, H, W)\n", + " - y: Labels for X, of shape (N,)\n", + " - model: A PretrainedCNN that will be used to compute the saliency map.\n", + " \n", + " Returns:\n", + " - saliency: An array of shape (N, H, W) giving the saliency maps for the input\n", + " images.\n", + " \"\"\"\n", + " saliency = None\n", + " ##############################################################################\n", + " # TODO: Implement this function. You should use the forward and backward #\n", + " # methods of the PretrainedCNN class, and compute gradients with respect to #\n", + " # the unnormalized class score of the ground-truth classes in y. #\n", + " ##############################################################################\n", + " pass\n", + " ##############################################################################\n", + " # END OF YOUR CODE #\n", + " ##############################################################################\n", + " return saliency" + ], + "outputs": [], + "metadata": { + "collapsed": true + } + }, + { + "source": [ + "Once you have completed the implementation in the cell above, run the following to visualize some class saliency maps on the validation set of TinyImageNet-100-A." + ], + "cell_type": "markdown", + "metadata": {} + }, + { + "execution_count": null, + "cell_type": "code", + "source": [ + "def show_saliency_maps(mask):\n", + " mask = np.asarray(mask)\n", + " X = data['X_val'][mask]\n", + " y = data['y_val'][mask]\n", + "\n", + " saliency = compute_saliency_maps(X, y, model)\n", + "\n", + " for i in xrange(mask.size):\n", + " plt.subplot(2, mask.size, i + 1)\n", + " plt.imshow(deprocess_image(X[i], data['mean_image']))\n", + " plt.axis('off')\n", + " plt.title(data['class_names'][y[i]][0])\n", + " plt.subplot(2, mask.size, mask.size + i + 1)\n", + " plt.title(mask[i])\n", + " plt.imshow(saliency[i])\n", + " plt.axis('off')\n", + " plt.gcf().set_size_inches(10, 4)\n", + " plt.show()\n", + "\n", + "# Show some random images\n", + "mask = np.random.randint(data['X_val'].shape[0], size=5)\n", + "show_saliency_maps(mask)\n", + " \n", + "# These are some cherry-picked images that should give good results\n", + "show_saliency_maps([128, 3225, 2417, 1640, 4619])" + ], + "outputs": [], + "metadata": { + "collapsed": false + } + }, + { + "source": [ + "# Fooling Images\n", + "We can also use image gradients to generate \"fooling images\" as discussed in [2]. Given an image and a target class, we can perform gradient ascent over the image to maximize the target class, stopping when the network classifies the image as the target class. Implement the following function to generate fooling images.\n", + "\n", + "[2] Szegedy et al, \"Intriguing properties of neural networks\", ICLR 2014" + ], + "cell_type": "markdown", + "metadata": {} + }, + { + "execution_count": null, + "cell_type": "code", + "source": [ + "def make_fooling_image(X, target_y, model):\n", + " \"\"\"\n", + " Generate a fooling image that is close to X, but that the model classifies\n", + " as target_y.\n", + " \n", + " Inputs:\n", + " - X: Input image, of shape (1, 3, 64, 64)\n", + " - target_y: An integer in the range [0, 100)\n", + " - model: A PretrainedCNN\n", + " \n", + " Returns:\n", + " - X_fooling: An image that is close to X, but that is classifed as target_y\n", + " by the model.\n", + " \"\"\"\n", + " X_fooling = X.copy()\n", + " ##############################################################################\n", + " # TODO: Generate a fooling image X_fooling that the model will classify as #\n", + " # the class target_y. Use gradient ascent on the target class score, using #\n", + " # the model.forward method to compute scores and the model.backward method #\n", + " # to compute image gradients. #\n", + " # #\n", + " # HINT: For most examples, you should be able to generate a fooling image #\n", + " # in fewer than 100 iterations of gradient ascent. #\n", + " ##############################################################################\n", + " pass\n", + " ##############################################################################\n", + " # END OF YOUR CODE #\n", + " ##############################################################################\n", + " return X_fooling" + ], + "outputs": [], + "metadata": { + "collapsed": true + } + }, + { + "source": [ + "Run the following to choose a random validation set image that is correctly classified by the network, and then make a fooling image." + ], + "cell_type": "markdown", + "metadata": {} + }, + { + "execution_count": null, + "cell_type": "code", + "source": [ + "# Find a correctly classified validation image\n", + "while True:\n", + " i = np.random.randint(data['X_val'].shape[0])\n", + " X = data['X_val'][i:i+1]\n", + " y = data['y_val'][i:i+1]\n", + " y_pred = model.loss(X)[0].argmax()\n", + " if y_pred == y: break\n", + "\n", + "target_y = 67\n", + "X_fooling = make_fooling_image(X, target_y, model)\n", + "\n", + "# Make sure that X_fooling is classified as y_target\n", + "scores = model.loss(X_fooling)\n", + "assert scores[0].argmax() == target_y, 'The network is not fooled!'\n", + "\n", + "# Show original image, fooling image, and difference\n", + "plt.subplot(1, 3, 1)\n", + "plt.imshow(deprocess_image(X, data['mean_image']))\n", + "plt.axis('off')\n", + "plt.title(data['class_names'][y][0])\n", + "plt.subplot(1, 3, 2)\n", + "plt.imshow(deprocess_image(X_fooling, data['mean_image'], renorm=True))\n", + "plt.title(data['class_names'][target_y][0])\n", + "plt.axis('off')\n", + "plt.subplot(1, 3, 3)\n", + "plt.title('Difference')\n", + "plt.imshow(deprocess_image(X - X_fooling, data['mean_image']))\n", + "plt.axis('off')\n", + "plt.show()" + ], + "outputs": [], + "metadata": { + "collapsed": false + } + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 2", + "name": "python2", + "language": "python" + }, + "language_info": { + "mimetype": "text/x-python", + "nbconvert_exporter": "python", + "name": "python", + "file_extension": ".py", + "version": "2.7.6", + "pygments_lexer": "ipython2", + "codemirror_mode": { + "version": 2, + "name": "ipython" + } + } + } +} \ No newline at end of file diff --git a/assignments2016/assignment3/LSTM_Captioning.ipynb b/assignments2016/assignment3/LSTM_Captioning.ipynb new file mode 100644 index 00000000..74c0bf44 --- /dev/null +++ b/assignments2016/assignment3/LSTM_Captioning.ipynb @@ -0,0 +1,483 @@ +{ + "nbformat_minor": 0, + "nbformat": 4, + "cells": [ + { + "source": [ + "# Image Captioning with LSTMs\n", + "In the previous exercise you implemented a vanilla RNN and applied it to image captioning. In this notebook you will implement the LSTM update rule and use it for image captioning." + ], + "cell_type": "markdown", + "metadata": {} + }, + { + "execution_count": null, + "cell_type": "code", + "source": [ + "# As usual, a bit of setup\n", + "\n", + "import time, os, json\n", + "import numpy as np\n", + "import matplotlib.pyplot as plt\n", + "\n", + "from cs231n.gradient_check import eval_numerical_gradient, eval_numerical_gradient_array\n", + "from cs231n.rnn_layers import *\n", + "from cs231n.captioning_solver import CaptioningSolver\n", + "from cs231n.classifiers.rnn import CaptioningRNN\n", + "from cs231n.coco_utils import load_coco_data, sample_coco_minibatch, decode_captions\n", + "from cs231n.image_utils import image_from_url\n", + "\n", + "%matplotlib inline\n", + "plt.rcParams['figure.figsize'] = (10.0, 8.0) # set default size of plots\n", + "plt.rcParams['image.interpolation'] = 'nearest'\n", + "plt.rcParams['image.cmap'] = 'gray'\n", + "\n", + "# for auto-reloading external modules\n", + "# see http://stackoverflow.com/questions/1907993/autoreload-of-modules-in-ipython\n", + "%load_ext autoreload\n", + "%autoreload 2\n", + "\n", + "def rel_error(x, y):\n", + " \"\"\" returns relative error \"\"\"\n", + " return np.max(np.abs(x - y) / (np.maximum(1e-8, np.abs(x) + np.abs(y))))" + ], + "outputs": [], + "metadata": { + "collapsed": false + } + }, + { + "source": [ + "# Load MS-COCO data\n", + "As in the previous notebook, we will use the Microsoft COCO dataset for captioning." + ], + "cell_type": "markdown", + "metadata": {} + }, + { + "execution_count": null, + "cell_type": "code", + "source": [ + "# Load COCO data from disk; this returns a dictionary\n", + "# We'll work with dimensionality-reduced features for this notebook, but feel\n", + "# free to experiment with the original features by changing the flag below.\n", + "data = load_coco_data(pca_features=True)\n", + "\n", + "# Print out all the keys and values from the data dictionary\n", + "for k, v in data.iteritems():\n", + " if type(v) == np.ndarray:\n", + " print k, type(v), v.shape, v.dtype\n", + " else:\n", + " print k, type(v), len(v)" + ], + "outputs": [], + "metadata": { + "collapsed": false + } + }, + { + "source": [ + "# LSTM\n", + "If you read recent papers, you'll see that many people use a variant on the vanialla RNN called Long-Short Term Memory (LSTM) RNNs. Vanilla RNNs can be tough to train on long sequences due to vanishing and exploding gradiants caused by repeated matrix multiplication. LSTMs solve this problem by replacing the simple update rule of the vanilla RNN with a gating mechanism as follows.\n", + "\n", + "Similar to the vanilla RNN, at each timestep we receive an input $x_t\\in\\mathbb{R}^D$ and the previous hidden state $h_{t-1}\\in\\mathbb{R}^H$; the LSTM also maintains an $H$-dimensional *cell state*, so we also receive the previous cell state $c_{t-1}\\in\\mathbb{R}^H$. The learnable parameters of the LSTM are an *input-to-hidden* matrix $W_x\\in\\mathbb{R}^{4H\\times D}$, a *hidden-to-hidden* matrix $W_h\\in\\mathbb{R}^{4H\\times H}$ and a *bias vector* $b\\in\\mathbb{R}^{4H}$.\n", + "\n", + "At each timestep we first compute an *activation vector* $a\\in\\mathbb{R}^{4H}$ as $a=W_xx_t + W_hh_{t-1}+b$. We then divide this into four vectors $a_i,a_f,a_o,a_g\\in\\mathbb{R}^H$ where $a_i$ consists of the first $H$ elements of $a$, $a_f$ is the next $H$ elements of $a$, etc. We then compute the *input gate* $g\\in\\mathbb{R}^H$, *forget gate* $f\\in\\mathbb{R}^H$, *output gate* $o\\in\\mathbb{R}^H$ and *block input* $g\\in\\mathbb{R}^H$ as\n", + "\n", + "$$\n", + "\\begin{align*}\n", + "i = \\sigma(a_i) \\hspace{2pc}\n", + "f = \\sigma(a_f) \\hspace{2pc}\n", + "o = \\sigma(a_o) \\hspace{2pc}\n", + "g = \\tanh(a_g)\n", + "\\end{align*}\n", + "$$\n", + "\n", + "where $\\sigma$ is the sigmoid function and $\\tanh$ is the hyperbolic tangent, both applied elementwise.\n", + "\n", + "Finally we compute the next cell state $c_t$ and next hidden state $h_t$ as\n", + "\n", + "$$\n", + "c_{t} = f\\odot c_{t-1} + i\\odot g \\hspace{4pc}\n", + "h_t = o\\odot\\tanh(c_t)\n", + "$$\n", + "\n", + "where $\\odot$ is the elementwise product of vectors.\n", + "\n", + "In the rest of the notebook we will implement the LSTM update rule and apply it to the image captioning task." + ], + "cell_type": "markdown", + "metadata": {} + }, + { + "source": [ + "# LSTM: step forward\n", + "Implement the forward pass for a single timestep of an LSTM in the `lstm_step_forward` function in the file `cs231n/rnn_layers.py`. This should be similar to the `rnn_step_forward` function that you implemented above, but using the LSTM update rule instead.\n", + "\n", + "Once you are done, run the following to perform a simple test of your implementation. You should see errors around `1e-8` or less." + ], + "cell_type": "markdown", + "metadata": {} + }, + { + "execution_count": null, + "cell_type": "code", + "source": [ + "N, D, H = 3, 4, 5\n", + "x = np.linspace(-0.4, 1.2, num=N*D).reshape(N, D)\n", + "prev_h = np.linspace(-0.3, 0.7, num=N*H).reshape(N, H)\n", + "prev_c = np.linspace(-0.4, 0.9, num=N*H).reshape(N, H)\n", + "Wx = np.linspace(-2.1, 1.3, num=4*D*H).reshape(D, 4 * H)\n", + "Wh = np.linspace(-0.7, 2.2, num=4*H*H).reshape(H, 4 * H)\n", + "b = np.linspace(0.3, 0.7, num=4*H)\n", + "\n", + "next_h, next_c, cache = lstm_step_forward(x, prev_h, prev_c, Wx, Wh, b)\n", + "\n", + "expected_next_h = np.asarray([\n", + " [ 0.24635157, 0.28610883, 0.32240467, 0.35525807, 0.38474904],\n", + " [ 0.49223563, 0.55611431, 0.61507696, 0.66844003, 0.7159181 ],\n", + " [ 0.56735664, 0.66310127, 0.74419266, 0.80889665, 0.858299 ]])\n", + "expected_next_c = np.asarray([\n", + " [ 0.32986176, 0.39145139, 0.451556, 0.51014116, 0.56717407],\n", + " [ 0.66382255, 0.76674007, 0.87195994, 0.97902709, 1.08751345],\n", + " [ 0.74192008, 0.90592151, 1.07717006, 1.25120233, 1.42395676]])\n", + "\n", + "print 'next_h error: ', rel_error(expected_next_h, next_h)\n", + "print 'next_c error: ', rel_error(expected_next_c, next_c)" + ], + "outputs": [], + "metadata": { + "collapsed": false + } + }, + { + "source": [ + "#LSTM: step backward\n", + "Implement the backward pass for a single LSTM timestep in the function `lstm_step_backward` in the file `cs231n/rnn_layers.py`. Once you are done, run the following to perform numeric gradient checking on your implementation. You should see errors around `1e-8` or less." + ], + "cell_type": "markdown", + "metadata": {} + }, + { + "execution_count": null, + "cell_type": "code", + "source": [ + "N, D, H = 4, 5, 6\n", + "x = np.random.randn(N, D)\n", + "prev_h = np.random.randn(N, H)\n", + "prev_c = np.random.randn(N, H)\n", + "Wx = np.random.randn(D, 4 * H)\n", + "Wh = np.random.randn(H, 4 * H)\n", + "b = np.random.randn(4 * H)\n", + "\n", + "next_h, next_c, cache = lstm_step_forward(x, prev_h, prev_c, Wx, Wh, b)\n", + "\n", + "dnext_h = np.random.randn(*next_h.shape)\n", + "dnext_c = np.random.randn(*next_c.shape)\n", + "\n", + "fx_h = lambda x: lstm_step_forward(x, prev_h, prev_c, Wx, Wh, b)[0]\n", + "fh_h = lambda h: lstm_step_forward(x, prev_h, prev_c, Wx, Wh, b)[0]\n", + "fc_h = lambda c: lstm_step_forward(x, prev_h, prev_c, Wx, Wh, b)[0]\n", + "fWx_h = lambda Wx: lstm_step_forward(x, prev_h, prev_c, Wx, Wh, b)[0]\n", + "fWh_h = lambda Wh: lstm_step_forward(x, prev_h, prev_c, Wx, Wh, b)[0]\n", + "fb_h = lambda b: lstm_step_forward(x, prev_h, prev_c, Wx, Wh, b)[0]\n", + "\n", + "fx_c = lambda x: lstm_step_forward(x, prev_h, prev_c, Wx, Wh, b)[1]\n", + "fh_c = lambda h: lstm_step_forward(x, prev_h, prev_c, Wx, Wh, b)[1]\n", + "fc_c = lambda c: lstm_step_forward(x, prev_h, prev_c, Wx, Wh, b)[1]\n", + "fWx_c = lambda Wx: lstm_step_forward(x, prev_h, prev_c, Wx, Wh, b)[1]\n", + "fWh_c = lambda Wh: lstm_step_forward(x, prev_h, prev_c, Wx, Wh, b)[1]\n", + "fb_c = lambda b: lstm_step_forward(x, prev_h, prev_c, Wx, Wh, b)[1]\n", + "\n", + "num_grad = eval_numerical_gradient_array\n", + "\n", + "dx_num = num_grad(fx_h, x, dnext_h) + num_grad(fx_c, x, dnext_c)\n", + "dh_num = num_grad(fh_h, prev_h, dnext_h) + num_grad(fh_c, prev_h, dnext_c)\n", + "dc_num = num_grad(fc_h, prev_c, dnext_h) + num_grad(fc_c, prev_c, dnext_c)\n", + "dWx_num = num_grad(fWx_h, Wx, dnext_h) + num_grad(fWx_c, Wx, dnext_c)\n", + "dWh_num = num_grad(fWh_h, Wh, dnext_h) + num_grad(fWh_c, Wh, dnext_c)\n", + "db_num = num_grad(fb_h, b, dnext_h) + num_grad(fb_c, b, dnext_c)\n", + "\n", + "dx, dh, dc, dWx, dWh, db = lstm_step_backward(dnext_h, dnext_c, cache)\n", + "\n", + "print 'dx error: ', rel_error(dx_num, dx)\n", + "print 'dh error: ', rel_error(dh_num, dh)\n", + "print 'dc error: ', rel_error(dc_num, dc)\n", + "print 'dWx error: ', rel_error(dWx_num, dWx)\n", + "print 'dWh error: ', rel_error(dWh_num, dWh)\n", + "print 'db error: ', rel_error(db_num, db)" + ], + "outputs": [], + "metadata": { + "collapsed": false + } + }, + { + "source": [ + "# LSTM: forward\n", + "In the function `lstm_forward` in the file `cs231n/rnn_layers.py`, implement the `lstm_forward` function to run an LSTM forward on an entire timeseries of data.\n", + "\n", + "When you are done run the following to check your implementation. You should see an error around `1e-7`." + ], + "cell_type": "markdown", + "metadata": {} + }, + { + "execution_count": null, + "cell_type": "code", + "source": [ + "N, D, H, T = 2, 5, 4, 3\n", + "x = np.linspace(-0.4, 0.6, num=N*T*D).reshape(N, T, D)\n", + "h0 = np.linspace(-0.4, 0.8, num=N*H).reshape(N, H)\n", + "Wx = np.linspace(-0.2, 0.9, num=4*D*H).reshape(D, 4 * H)\n", + "Wh = np.linspace(-0.3, 0.6, num=4*H*H).reshape(H, 4 * H)\n", + "b = np.linspace(0.2, 0.7, num=4*H)\n", + "\n", + "h, cache = lstm_forward(x, h0, Wx, Wh, b)\n", + "\n", + "expected_h = np.asarray([\n", + " [[ 0.01764008, 0.01823233, 0.01882671, 0.0194232 ],\n", + " [ 0.11287491, 0.12146228, 0.13018446, 0.13902939],\n", + " [ 0.31358768, 0.33338627, 0.35304453, 0.37250975]],\n", + " [[ 0.45767879, 0.4761092, 0.4936887, 0.51041945],\n", + " [ 0.6704845, 0.69350089, 0.71486014, 0.7346449 ],\n", + " [ 0.81733511, 0.83677871, 0.85403753, 0.86935314]]])\n", + "\n", + "print 'h error: ', rel_error(expected_h, h)" + ], + "outputs": [], + "metadata": { + "collapsed": false + } + }, + { + "source": [ + "# LSTM: backward\n", + "Implement the backward pass for an LSTM over an entire timeseries of data in the function `lstm_backward` in the file `cs231n/rnn_layers.py`. When you are done run the following to perform numeric gradient checking on your implementation. You should see errors around `1e-8` or less." + ], + "cell_type": "markdown", + "metadata": {} + }, + { + "execution_count": null, + "cell_type": "code", + "source": [ + "from cs231n.rnn_layers import lstm_forward, lstm_backward\n", + "\n", + "N, D, T, H = 2, 3, 10, 6\n", + "\n", + "x = np.random.randn(N, T, D)\n", + "h0 = np.random.randn(N, H)\n", + "Wx = np.random.randn(D, 4 * H)\n", + "Wh = np.random.randn(H, 4 * H)\n", + "b = np.random.randn(4 * H)\n", + "\n", + "out, cache = lstm_forward(x, h0, Wx, Wh, b)\n", + "\n", + "dout = np.random.randn(*out.shape)\n", + "\n", + "dx, dh0, dWx, dWh, db = lstm_backward(dout, cache)\n", + "\n", + "fx = lambda x: lstm_forward(x, h0, Wx, Wh, b)[0]\n", + "fh0 = lambda h0: lstm_forward(x, h0, Wx, Wh, b)[0]\n", + "fWx = lambda Wx: lstm_forward(x, h0, Wx, Wh, b)[0]\n", + "fWh = lambda Wh: lstm_forward(x, h0, Wx, Wh, b)[0]\n", + "fb = lambda b: lstm_forward(x, h0, Wx, Wh, b)[0]\n", + "\n", + "dx_num = eval_numerical_gradient_array(fx, x, dout)\n", + "dh0_num = eval_numerical_gradient_array(fh0, h0, dout)\n", + "dWx_num = eval_numerical_gradient_array(fWx, Wx, dout)\n", + "dWh_num = eval_numerical_gradient_array(fWh, Wh, dout)\n", + "db_num = eval_numerical_gradient_array(fb, b, dout)\n", + "\n", + "print 'dx error: ', rel_error(dx_num, dx)\n", + "print 'dh0 error: ', rel_error(dx_num, dx)\n", + "print 'dWx error: ', rel_error(dx_num, dx)\n", + "print 'dWh error: ', rel_error(dx_num, dx)\n", + "print 'db error: ', rel_error(dx_num, dx)" + ], + "outputs": [], + "metadata": { + "collapsed": false + } + }, + { + "source": [ + "#LSTM captioning model\n", + "Now that you have implemented an LSTM, update the implementation of the `loss` method of the `CaptioningRNN` class in the file `cs231n/classifiers/rnn.py` to handle the case where `self.cell_type` is `lstm`. This should require adding less than 10 lines of code.\n", + "\n", + "Once you have done so, run the following to check your implementation. You should see a difference of less than `1e-10`." + ], + "cell_type": "markdown", + "metadata": {} + }, + { + "execution_count": null, + "cell_type": "code", + "source": [ + "N, D, W, H = 10, 20, 30, 40\n", + "word_to_idx = {'': 0, 'cat': 2, 'dog': 3}\n", + "V = len(word_to_idx)\n", + "T = 13\n", + "\n", + "model = CaptioningRNN(word_to_idx,\n", + " input_dim=D,\n", + " wordvec_dim=W,\n", + " hidden_dim=H,\n", + " cell_type='lstm',\n", + " dtype=np.float64)\n", + "\n", + "# Set all model parameters to fixed values\n", + "for k, v in model.params.iteritems():\n", + " model.params[k] = np.linspace(-1.4, 1.3, num=v.size).reshape(*v.shape)\n", + "\n", + "features = np.linspace(-0.5, 1.7, num=N*D).reshape(N, D)\n", + "captions = (np.arange(N * T) % V).reshape(N, T)\n", + "\n", + "loss, grads = model.loss(features, captions)\n", + "expected_loss = 9.82445935443\n", + "\n", + "print 'loss: ', loss\n", + "print 'expected loss: ', expected_loss\n", + "print 'difference: ', abs(loss - expected_loss)" + ], + "outputs": [], + "metadata": { + "collapsed": false + } + }, + { + "source": [ + "# Overfit LSTM captioning model\n", + "Run the following to overfit an LSTM captioning model on the same small dataset as we used for the RNN above." + ], + "cell_type": "markdown", + "metadata": {} + }, + { + "execution_count": null, + "cell_type": "code", + "source": [ + "small_data = load_coco_data(max_train=50)\n", + "\n", + "small_lstm_model = CaptioningRNN(\n", + " cell_type='lstm',\n", + " word_to_idx=data['word_to_idx'],\n", + " input_dim=data['train_features'].shape[1],\n", + " hidden_dim=512,\n", + " wordvec_dim=256,\n", + " dtype=np.float32,\n", + " )\n", + "\n", + "small_lstm_solver = CaptioningSolver(small_lstm_model, small_data,\n", + " update_rule='adam',\n", + " num_epochs=50,\n", + " batch_size=25,\n", + " optim_config={\n", + " 'learning_rate': 5e-3,\n", + " },\n", + " lr_decay=0.995,\n", + " verbose=True, print_every=10,\n", + " )\n", + "\n", + "small_lstm_solver.train()\n", + "\n", + "# Plot the training losses\n", + "plt.plot(small_lstm_solver.loss_history)\n", + "plt.xlabel('Iteration')\n", + "plt.ylabel('Loss')\n", + "plt.title('Training loss history')\n", + "plt.show()" + ], + "outputs": [], + "metadata": { + "collapsed": false + } + }, + { + "source": [ + "# LSTM test-time sampling\n", + "Modify the `sample` method of the `CaptioningRNN` class to handle the case where `self.cell_type` is `lstm`. This should take fewer than 10 lines of code.\n", + "\n", + "When you are done run the following to sample from your overfit LSTM model on some training and validation set samples." + ], + "cell_type": "markdown", + "metadata": {} + }, + { + "execution_count": null, + "cell_type": "code", + "source": [ + "for split in ['train', 'val']:\n", + " minibatch = sample_coco_minibatch(small_data, split=split, batch_size=2)\n", + " gt_captions, features, urls = minibatch\n", + " gt_captions = decode_captions(gt_captions, data['idx_to_word'])\n", + "\n", + " sample_captions = small_lstm_model.sample(features)\n", + " sample_captions = decode_captions(sample_captions, data['idx_to_word'])\n", + "\n", + " for gt_caption, sample_caption, url in zip(gt_captions, sample_captions, urls):\n", + " plt.imshow(image_from_url(url))\n", + " plt.title('%s\\n%s\\nGT:%s' % (split, sample_caption, gt_caption))\n", + " plt.axis('off')\n", + " plt.show()" + ], + "outputs": [], + "metadata": { + "collapsed": false + } + }, + { + "source": [ + "# Train a good captioning model!\n", + "Using the pieces you have implemented in this and the previous notebook, try to train a captioning model that gives decent qualitative results (better than the random garbage you saw with the overfit models) when sampling on the validation set. You can subsample the training set if you want; we just want to see samples on the validatation set that are better than random.\n", + "\n", + "Don't spend too much time on this part; we don't have any explicit accuracy thresholds you need to meet." + ], + "cell_type": "markdown", + "metadata": {} + }, + { + "execution_count": null, + "cell_type": "code", + "source": [ + "pass\n" + ], + "outputs": [], + "metadata": { + "collapsed": false + } + }, + { + "execution_count": null, + "cell_type": "code", + "source": [ + "pass\n" + ], + "outputs": [], + "metadata": { + "scrolled": false, + "collapsed": false + } + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 2", + "name": "python2", + "language": "python" + }, + "language_info": { + "mimetype": "text/x-python", + "nbconvert_exporter": "python", + "name": "python", + "file_extension": ".py", + "version": "2.7.6", + "pygments_lexer": "ipython2", + "codemirror_mode": { + "version": 2, + "name": "ipython" + } + } + } +} \ No newline at end of file diff --git a/assignments2016/assignment3/RNN_Captioning.ipynb b/assignments2016/assignment3/RNN_Captioning.ipynb new file mode 100644 index 00000000..61bc2a46 --- /dev/null +++ b/assignments2016/assignment3/RNN_Captioning.ipynb @@ -0,0 +1,659 @@ +{ + "nbformat_minor": 0, + "nbformat": 4, + "cells": [ + { + "source": [ + "# Image Captioning with RNNs\n", + "In this exercise you will implement a vanilla recurrent neural networks and use them it to train a model that can generate novel captions for images." + ], + "cell_type": "markdown", + "metadata": {} + }, + { + "execution_count": null, + "cell_type": "code", + "source": [ + "# As usual, a bit of setup\n", + "\n", + "import time, os, json\n", + "import numpy as np\n", + "import matplotlib.pyplot as plt\n", + "\n", + "from cs231n.gradient_check import eval_numerical_gradient, eval_numerical_gradient_array\n", + "from cs231n.rnn_layers import *\n", + "from cs231n.captioning_solver import CaptioningSolver\n", + "from cs231n.classifiers.rnn import CaptioningRNN\n", + "from cs231n.coco_utils import load_coco_data, sample_coco_minibatch, decode_captions\n", + "from cs231n.image_utils import image_from_url\n", + "\n", + "%matplotlib inline\n", + "plt.rcParams['figure.figsize'] = (10.0, 8.0) # set default size of plots\n", + "plt.rcParams['image.interpolation'] = 'nearest'\n", + "plt.rcParams['image.cmap'] = 'gray'\n", + "\n", + "# for auto-reloading external modules\n", + "# see http://stackoverflow.com/questions/1907993/autoreload-of-modules-in-ipython\n", + "%load_ext autoreload\n", + "%autoreload 2\n", + "\n", + "def rel_error(x, y):\n", + " \"\"\" returns relative error \"\"\"\n", + " return np.max(np.abs(x - y) / (np.maximum(1e-8, np.abs(x) + np.abs(y))))" + ], + "outputs": [], + "metadata": { + "collapsed": false + } + }, + { + "source": [ + "# Microsoft COCO\n", + "For this exercise we will use the 2014 release of the [Microsoft COCO dataset](http://mscoco.org/) which has become the standard testbed for image captioning. The dataset consists of 80,000 training images and 40,000 validation images, each annotated with 5 captions written by workers on Amazon Mechanical Turk.\n", + "\n", + "To download the data, change to the `cs231n/datasets` directory and run the script `get_coco_captioning.sh`.\n", + "\n", + "We have preprocessed the data and extracted features for you already. For all images we have extracted features from the fc7 layer of the VGG-16 network pretrained on ImageNet; these features are stored in the files `train2014_vgg16_fc7.h5` and `val2014_vgg16_fc7.h5` respectively. To cut down on processing time and memory requirements, we have reduced the dimensionality of the features from 4096 to 512; these features can be found in the files `train2014_vgg16_fc7_pca.h5` and `val2014_vgg16_fc7_pca.h5`.\n", + "\n", + "The raw images take up a lot of space (nearly 20GB) so we have not included them in the download. However all images are taken from Flickr, and URLs of the training and validation images are stored in the files `train2014_urls.txt` and `val2014_urls.txt` respectively. This allows you to download images on the fly for visualization. Since images are downloaded on-the-fly, **you must be connected to the internet to view images**.\n", + "\n", + "Dealing with strings is inefficient, so we will work with an encoded version of the captions. Each word is assigned an integer ID, allowing us to represent a caption by a sequence of integers. The mapping between integer IDs and words is in the file `coco2014_vocab.json`, and you can use the function `decode_captions` from the file `cs231n/coco_utils.py` to convert numpy arrays of integer IDs back into strings.\n", + "\n", + "There are a couple special tokens that we add to the vocabulary. We prepend a special `` token and append an `` token to the beginning and end of each caption respectively. Rare words are replaced with a special `` token (for \"unknown\"). In addition, since we want to train with minibatches containing captions of different lengths, we pad short captions with a special `` token after the `` token and don't compute loss or gradient for `` tokens. Since they are a bit of a pain, we have taken care of all implementation details around special tokens for you.\n", + "\n", + "You can load all of the MS-COCO data (captions, features, URLs, and vocabulary) using the `load_coco_data` function from the file `cs231n/coco_utils.py`. Run the following cell to do so:" + ], + "cell_type": "markdown", + "metadata": {} + }, + { + "execution_count": null, + "cell_type": "code", + "source": [ + "# Load COCO data from disk; this returns a dictionary\n", + "# We'll work with dimensionality-reduced features for this notebook, but feel\n", + "# free to experiment with the original features by changing the flag below.\n", + "data = load_coco_data(pca_features=True)\n", + "\n", + "# Print out all the keys and values from the data dictionary\n", + "for k, v in data.iteritems():\n", + " if type(v) == np.ndarray:\n", + " print k, type(v), v.shape, v.dtype\n", + " else:\n", + " print k, type(v), len(v)" + ], + "outputs": [], + "metadata": { + "collapsed": false + } + }, + { + "source": [ + "## Look at the data\n", + "It is always a good idea to look at examples from the dataset before working with it.\n", + "\n", + "You can use the `sample_coco_minibatch` function from the file `cs231n/coco_utils.py` to sample minibatches of data from the data structure returned from `load_coco_data`. Run the following to sample a small minibatch of training data and show the images and their captions. Running it multiple times and looking at the results helps you to get a sense of the dataset.\n", + "\n", + "Note that we decode the captions using the `decode_captions` function and that we download the images on-the-fly using their Flickr URL, so **you must be connected to the internet to viw images**." + ], + "cell_type": "markdown", + "metadata": {} + }, + { + "execution_count": null, + "cell_type": "code", + "source": [ + "# Sample a minibatch and show the images and captions\n", + "batch_size = 3\n", + "\n", + "captions, features, urls = sample_coco_minibatch(data, batch_size=batch_size)\n", + "for i, (caption, url) in enumerate(zip(captions, urls)):\n", + " plt.imshow(image_from_url(url))\n", + " plt.axis('off')\n", + " caption_str = decode_captions(caption, data['idx_to_word'])\n", + " plt.title(caption_str)\n", + " plt.show()" + ], + "outputs": [], + "metadata": { + "collapsed": false + } + }, + { + "source": [ + "# Recurrent Neural Networks\n", + "As discussed in lecture, we will use recurrent neural network (RNN) language models for image captioning. The file `cs231n/rnn_layers.py` contains implementations of different layer types that are needed for recurrent neural networks, and the file `cs231n/classifiers/rnn.py` uses these layers to implement an image captioning model.\n", + "\n", + "We will first implement different types of RNN layers in `cs231n/rnn_layers.py`." + ], + "cell_type": "markdown", + "metadata": {} + }, + { + "source": [ + "# Vanilla RNN: step forward\n", + "Open the file `cs231n/rnn_layers.py`. This file implements the forward and backward passes for different types of layers that are commonly used in recurrent neural networks.\n", + "\n", + "First implement the function `rnn_step_forward` which implements the forward pass for a single timestep of a vanilla recurrent neural network. After doing so run the following to check your implementation." + ], + "cell_type": "markdown", + "metadata": {} + }, + { + "execution_count": null, + "cell_type": "code", + "source": [ + "N, D, H = 3, 10, 4\n", + "\n", + "x = np.linspace(-0.4, 0.7, num=N*D).reshape(N, D)\n", + "prev_h = np.linspace(-0.2, 0.5, num=N*H).reshape(N, H)\n", + "Wx = np.linspace(-0.1, 0.9, num=D*H).reshape(D, H)\n", + "Wh = np.linspace(-0.3, 0.7, num=H*H).reshape(H, H)\n", + "b = np.linspace(-0.2, 0.4, num=H)\n", + "\n", + "next_h, _ = rnn_step_forward(x, prev_h, Wx, Wh, b)\n", + "expected_next_h = np.asarray([\n", + " [-0.58172089, -0.50182032, -0.41232771, -0.31410098],\n", + " [ 0.66854692, 0.79562378, 0.87755553, 0.92795967],\n", + " [ 0.97934501, 0.99144213, 0.99646691, 0.99854353]])\n", + "\n", + "print 'next_h error: ', rel_error(expected_next_h, next_h)" + ], + "outputs": [], + "metadata": { + "collapsed": false + } + }, + { + "source": [ + "# Vanilla RNN: step backward\n", + "In the file `cs231n/rnn_layers.py` implement the `rnn_step_backward` function. After doing so run the following to numerically gradient check your implementation. You should see errors less than `1e-8`." + ], + "cell_type": "markdown", + "metadata": {} + }, + { + "execution_count": null, + "cell_type": "code", + "source": [ + "from cs231n.rnn_layers import rnn_step_forward, rnn_step_backward\n", + "\n", + "N, D, H = 4, 5, 6\n", + "x = np.random.randn(N, D)\n", + "h = np.random.randn(N, H)\n", + "Wx = np.random.randn(D, H)\n", + "Wh = np.random.randn(H, H)\n", + "b = np.random.randn(H)\n", + "\n", + "out, cache = rnn_step_forward(x, h, Wx, Wh, b)\n", + "\n", + "dnext_h = np.random.randn(*out.shape)\n", + "\n", + "fx = lambda x: rnn_step_forward(x, h, Wx, Wh, b)[0]\n", + "fh = lambda prev_h: rnn_step_forward(x, h, Wx, Wh, b)[0]\n", + "fWx = lambda Wx: rnn_step_forward(x, h, Wx, Wh, b)[0]\n", + "fWh = lambda Wh: rnn_step_forward(x, h, Wx, Wh, b)[0]\n", + "fb = lambda b: rnn_step_forward(x, h, Wx, Wh, b)[0]\n", + "\n", + "dx_num = eval_numerical_gradient_array(fx, x, dnext_h)\n", + "dprev_h_num = eval_numerical_gradient_array(fh, h, dnext_h)\n", + "dWx_num = eval_numerical_gradient_array(fWx, Wx, dnext_h)\n", + "dWh_num = eval_numerical_gradient_array(fWh, Wh, dnext_h)\n", + "db_num = eval_numerical_gradient_array(fb, b, dnext_h)\n", + "\n", + "dx, dprev_h, dWx, dWh, db = rnn_step_backward(dnext_h, cache)\n", + "\n", + "print 'dx error: ', rel_error(dx_num, dx)\n", + "print 'dprev_h error: ', rel_error(dprev_h_num, dprev_h)\n", + "print 'dWx error: ', rel_error(dWx_num, dWx)\n", + "print 'dWh error: ', rel_error(dWh_num, dWh)\n", + "print 'db error: ', rel_error(db_num, db)" + ], + "outputs": [], + "metadata": { + "collapsed": false + } + }, + { + "source": [ + "# Vanilla RNN: forward\n", + "Now that you have implemented the forward and backward passes for a single timestep of a vanilla RNN, you will combine these pieces to implement a RNN that process an entire sequence of data.\n", + "\n", + "In the file `cs231n/rnn_layers.py`, implement the function `rnn_forward`. This should be implemented using the `rnn_step_forward` function that you defined above. After doing so run the following to check your implementation. You should see errors less than `1e-7`." + ], + "cell_type": "markdown", + "metadata": {} + }, + { + "execution_count": null, + "cell_type": "code", + "source": [ + "N, T, D, H = 2, 3, 4, 5\n", + "\n", + "x = np.linspace(-0.1, 0.3, num=N*T*D).reshape(N, T, D)\n", + "h0 = np.linspace(-0.3, 0.1, num=N*H).reshape(N, H)\n", + "Wx = np.linspace(-0.2, 0.4, num=D*H).reshape(D, H)\n", + "Wh = np.linspace(-0.4, 0.1, num=H*H).reshape(H, H)\n", + "b = np.linspace(-0.7, 0.1, num=H)\n", + "\n", + "h, _ = rnn_forward(x, h0, Wx, Wh, b)\n", + "expected_h = np.asarray([\n", + " [\n", + " [-0.42070749, -0.27279261, -0.11074945, 0.05740409, 0.22236251],\n", + " [-0.39525808, -0.22554661, -0.0409454, 0.14649412, 0.32397316],\n", + " [-0.42305111, -0.24223728, -0.04287027, 0.15997045, 0.35014525],\n", + " ],\n", + " [\n", + " [-0.55857474, -0.39065825, -0.19198182, 0.02378408, 0.23735671],\n", + " [-0.27150199, -0.07088804, 0.13562939, 0.33099728, 0.50158768],\n", + " [-0.51014825, -0.30524429, -0.06755202, 0.17806392, 0.40333043]]])\n", + "print 'h error: ', rel_error(expected_h, h)" + ], + "outputs": [], + "metadata": { + "collapsed": false + } + }, + { + "source": [ + "# Vanilla RNN: backward\n", + "In the file `cs231n/rnn_layers.py`, implement the backward pass for a vanilla RNN in the function `rnn_backward`. This should run back-propagation over the entire sequence, calling into the `rnn_step_backward` function that you defined above." + ], + "cell_type": "markdown", + "metadata": {} + }, + { + "execution_count": null, + "cell_type": "code", + "source": [ + "N, D, T, H = 2, 3, 10, 5\n", + "\n", + "x = np.random.randn(N, T, D)\n", + "h0 = np.random.randn(N, H)\n", + "Wx = np.random.randn(D, H)\n", + "Wh = np.random.randn(H, H)\n", + "b = np.random.randn(H)\n", + "\n", + "out, cache = rnn_forward(x, h0, Wx, Wh, b)\n", + "\n", + "dout = np.random.randn(*out.shape)\n", + "\n", + "dx, dh0, dWx, dWh, db = rnn_backward(dout, cache)\n", + "\n", + "fx = lambda x: rnn_forward(x, h0, Wx, Wh, b)[0]\n", + "fh0 = lambda h0: rnn_forward(x, h0, Wx, Wh, b)[0]\n", + "fWx = lambda Wx: rnn_forward(x, h0, Wx, Wh, b)[0]\n", + "fWh = lambda Wh: rnn_forward(x, h0, Wx, Wh, b)[0]\n", + "fb = lambda b: rnn_forward(x, h0, Wx, Wh, b)[0]\n", + "\n", + "dx_num = eval_numerical_gradient_array(fx, x, dout)\n", + "dh0_num = eval_numerical_gradient_array(fh0, h0, dout)\n", + "dWx_num = eval_numerical_gradient_array(fWx, Wx, dout)\n", + "dWh_num = eval_numerical_gradient_array(fWh, Wh, dout)\n", + "db_num = eval_numerical_gradient_array(fb, b, dout)\n", + "\n", + "print 'dx error: ', rel_error(dx_num, dx)\n", + "print 'dh0 error: ', rel_error(dh0_num, dh0)\n", + "print 'dWx error: ', rel_error(dWx_num, dWx)\n", + "print 'dWh error: ', rel_error(dWh_num, dWh)\n", + "print 'db error: ', rel_error(db_num, db)" + ], + "outputs": [], + "metadata": { + "collapsed": false + } + }, + { + "source": [ + "# Word embedding: forward\n", + "In deep learning systems, we commonly represent words using vectors. Each word of the vocabulary will be associated with a vector, and these vectors will be learned jointly with the rest of the system.\n", + "\n", + "In the file `cs231n/rnn_layers.py`, implement the function `word_embedding_forward` to convert words (represented by integers) into vectors. Run the following to check your implementation. You should see error around `1e-8`." + ], + "cell_type": "markdown", + "metadata": {} + }, + { + "execution_count": null, + "cell_type": "code", + "source": [ + "N, T, V, D = 2, 4, 5, 3\n", + "\n", + "x = np.asarray([[0, 3, 1, 2], [2, 1, 0, 3]])\n", + "W = np.linspace(0, 1, num=V*D).reshape(V, D)\n", + "\n", + "out, _ = word_embedding_forward(x, W)\n", + "expected_out = np.asarray([\n", + " [[ 0., 0.07142857, 0.14285714],\n", + " [ 0.64285714, 0.71428571, 0.78571429],\n", + " [ 0.21428571, 0.28571429, 0.35714286],\n", + " [ 0.42857143, 0.5, 0.57142857]],\n", + " [[ 0.42857143, 0.5, 0.57142857],\n", + " [ 0.21428571, 0.28571429, 0.35714286],\n", + " [ 0., 0.07142857, 0.14285714],\n", + " [ 0.64285714, 0.71428571, 0.78571429]]])\n", + "\n", + "print 'out error: ', rel_error(expected_out, out)" + ], + "outputs": [], + "metadata": { + "collapsed": false + } + }, + { + "source": [ + "# Word embedding: backward\n", + "Implement the backward pass for the word embedding function in the function `word_embedding_backward`. After doing so run the following to numerically gradient check your implementation. You should see errors less than `1e-11`." + ], + "cell_type": "markdown", + "metadata": {} + }, + { + "execution_count": null, + "cell_type": "code", + "source": [ + "N, T, V, D = 50, 3, 5, 6\n", + "\n", + "x = np.random.randint(V, size=(N, T))\n", + "W = np.random.randn(V, D)\n", + "\n", + "out, cache = word_embedding_forward(x, W)\n", + "dout = np.random.randn(*out.shape)\n", + "dW = word_embedding_backward(dout, cache)\n", + "\n", + "f = lambda W: word_embedding_forward(x, W)[0]\n", + "dW_num = eval_numerical_gradient_array(f, W, dout)\n", + "\n", + "print 'dW error: ', rel_error(dW, dW_num)" + ], + "outputs": [], + "metadata": { + "collapsed": false + } + }, + { + "source": [ + "# Temporal Affine layer\n", + "At every timestep we use an affine function to transform the RNN hidden vector at that timestep into scores for each word in the vocabulary. Because this is very similar to the affine layer that you implemented in assignment 2, we have provided this function for you in the `temporal_affine_forward` and `temporal_affine_backward` functions in the file `cs231n/rnn_layers.py`. Run the following to perform numeric gradient checking on the implementation." + ], + "cell_type": "markdown", + "metadata": {} + }, + { + "execution_count": null, + "cell_type": "code", + "source": [ + "# Gradient check for temporal affine layer\n", + "N, T, D, M = 2, 3, 4, 5\n", + "\n", + "x = np.random.randn(N, T, D)\n", + "w = np.random.randn(D, M)\n", + "b = np.random.randn(M)\n", + "\n", + "out, cache = temporal_affine_forward(x, w, b)\n", + "\n", + "dout = np.random.randn(*out.shape)\n", + "\n", + "fx = lambda x: temporal_affine_forward(x, w, b)[0]\n", + "fw = lambda w: temporal_affine_forward(x, w, b)[0]\n", + "fb = lambda b: temporal_affine_forward(x, w, b)[0]\n", + "\n", + "dx_num = eval_numerical_gradient_array(fx, x, dout)\n", + "dw_num = eval_numerical_gradient_array(fw, w, dout)\n", + "db_num = eval_numerical_gradient_array(fb, b, dout)\n", + "\n", + "dx, dw, db = temporal_affine_backward(dout, cache)\n", + "\n", + "print 'dx error: ', rel_error(dx_num, dx)\n", + "print 'dw error: ', rel_error(dw_num, dw)\n", + "print 'db error: ', rel_error(db_num, db)" + ], + "outputs": [], + "metadata": { + "collapsed": false + } + }, + { + "source": [ + "# Temporal Softmax loss\n", + "In an RNN language model, at every timestep we produce a score for each word in the vocabulary. We know the ground-truth word at each timestep, so we use a softmax loss function to compute loss and gradient at each timestep. We sum the losses over time and average them over the minibatch.\n", + "\n", + "However there is one wrinke: since we operate over minibatches and different captions may have different lengths, we append `` tokens to the end of each caption so they all have the same length. We don't want these `` tokens to count toward the loss or gradient, so in addition to scores and ground-truth labels our loss function also accepts a `mask` array that tells it which elements of the scores count towards the loss.\n", + "\n", + "Since this is very similar to the softmax loss function you implemented in assignment 1, we have implemented this loss function for you; look at the `temporal_softmax_loss` function in the file `cs231n/rnn_layers.py`.\n", + "\n", + "Run the following cell to sanity check the loss and perform numeric gradient checking on the function." + ], + "cell_type": "markdown", + "metadata": {} + }, + { + "execution_count": null, + "cell_type": "code", + "source": [ + "# Sanity check for temporal softmax loss\n", + "from cs231n.rnn_layers import temporal_softmax_loss\n", + "\n", + "N, T, V = 100, 1, 10\n", + "\n", + "def check_loss(N, T, V, p):\n", + " x = 0.001 * np.random.randn(N, T, V)\n", + " y = np.random.randint(V, size=(N, T))\n", + " mask = np.random.rand(N, T) <= p\n", + " print temporal_softmax_loss(x, y, mask)[0]\n", + " \n", + "check_loss(100, 1, 10, 1.0) # Should be about 2.3\n", + "check_loss(100, 10, 10, 1.0) # Should be about 23\n", + "check_loss(5000, 10, 10, 0.1) # Should be about 2.3\n", + "\n", + "# Gradient check for temporal softmax loss\n", + "N, T, V = 7, 8, 9\n", + "\n", + "x = np.random.randn(N, T, V)\n", + "y = np.random.randint(V, size=(N, T))\n", + "mask = (np.random.rand(N, T) > 0.5)\n", + "\n", + "loss, dx = temporal_softmax_loss(x, y, mask, verbose=False)\n", + "\n", + "dx_num = eval_numerical_gradient(lambda x: temporal_softmax_loss(x, y, mask)[0], x, verbose=False)\n", + "\n", + "print 'dx error: ', rel_error(dx, dx_num)" + ], + "outputs": [], + "metadata": { + "collapsed": false + } + }, + { + "source": [ + "# RNN for image captioning\n", + "Now that you have implemented the necessary layers, you can combine them to build an image captioning model. Open the file `cs231n/classifiers/rnn.py` and look at the `CaptioningRNN` class.\n", + "\n", + "Implement the forward and backward pass of the model in the `loss` function. For now you only need to implement the case where `cell_type='rnn'` for vanialla RNNs; you will implement the LSTM case later. After doing so, run the following to check your forward pass using a small test case; you should see error less than `1e-10`." + ], + "cell_type": "markdown", + "metadata": {} + }, + { + "execution_count": null, + "cell_type": "code", + "source": [ + "N, D, W, H = 10, 20, 30, 40\n", + "word_to_idx = {'': 0, 'cat': 2, 'dog': 3}\n", + "V = len(word_to_idx)\n", + "T = 13\n", + "\n", + "model = CaptioningRNN(word_to_idx,\n", + " input_dim=D,\n", + " wordvec_dim=W,\n", + " hidden_dim=H,\n", + " cell_type='rnn',\n", + " dtype=np.float64)\n", + "\n", + "# Set all model parameters to fixed values\n", + "for k, v in model.params.iteritems():\n", + " model.params[k] = np.linspace(-1.4, 1.3, num=v.size).reshape(*v.shape)\n", + "\n", + "features = np.linspace(-1.5, 0.3, num=(N * D)).reshape(N, D)\n", + "captions = (np.arange(N * T) % V).reshape(N, T)\n", + "\n", + "loss, grads = model.loss(features, captions)\n", + "expected_loss = 9.83235591003\n", + "\n", + "print 'loss: ', loss\n", + "print 'expected loss: ', expected_loss\n", + "print 'difference: ', abs(loss - expected_loss)" + ], + "outputs": [], + "metadata": { + "scrolled": false, + "collapsed": false + } + }, + { + "source": [ + "Run the following cell to perform numeric gradient checking on the `CaptioningRNN` class; you should errors around `1e-7` or less." + ], + "cell_type": "markdown", + "metadata": {} + }, + { + "execution_count": null, + "cell_type": "code", + "source": [ + "batch_size = 2\n", + "timesteps = 3\n", + "input_dim = 4\n", + "wordvec_dim = 5\n", + "hidden_dim = 6\n", + "word_to_idx = {'': 0, 'cat': 2, 'dog': 3}\n", + "vocab_size = len(word_to_idx)\n", + "\n", + "captions = np.random.randint(vocab_size, size=(batch_size, timesteps))\n", + "features = np.random.randn(batch_size, input_dim)\n", + "\n", + "model = CaptioningRNN(word_to_idx,\n", + " input_dim=input_dim,\n", + " wordvec_dim=wordvec_dim,\n", + " hidden_dim=hidden_dim,\n", + " cell_type='rnn',\n", + " dtype=np.float64,\n", + " )\n", + "\n", + "loss, grads = model.loss(features, captions)\n", + "\n", + "for param_name in sorted(grads):\n", + " f = lambda _: model.loss(features, captions)[0]\n", + " param_grad_num = eval_numerical_gradient(f, model.params[param_name], verbose=False, h=1e-6)\n", + " e = rel_error(param_grad_num, grads[param_name])\n", + " print '%s relative error: %e' % (param_name, e)" + ], + "outputs": [], + "metadata": { + "collapsed": false + } + }, + { + "source": [ + "# Overfit small data\n", + "Similar to the `Solver` class that we used to train image classification models on the previous assignment, on this assignment we use a `CaptioningSolver` class to train image captioning models. Open the file `cs231n/captioning_solver.py` and read through the `CaptioningSolver` class; it should look very familiar.\n", + "\n", + "Once you have familiarized yourself with the API, run the following to make sure your model overfit a small sample of 100 training examples. You should see losses around 1." + ], + "cell_type": "markdown", + "metadata": {} + }, + { + "execution_count": null, + "cell_type": "code", + "source": [ + "small_data = load_coco_data(max_train=50)\n", + "\n", + "small_rnn_model = CaptioningRNN(\n", + " cell_type='rnn',\n", + " word_to_idx=data['word_to_idx'],\n", + " input_dim=data['train_features'].shape[1],\n", + " hidden_dim=512,\n", + " wordvec_dim=256,\n", + " )\n", + "\n", + "small_rnn_solver = CaptioningSolver(small_rnn_model, small_data,\n", + " update_rule='adam',\n", + " num_epochs=50,\n", + " batch_size=25,\n", + " optim_config={\n", + " 'learning_rate': 5e-3,\n", + " },\n", + " lr_decay=0.95,\n", + " verbose=True, print_every=10,\n", + " )\n", + "\n", + "small_rnn_solver.train()\n", + "\n", + "# Plot the training losses\n", + "plt.plot(small_rnn_solver.loss_history)\n", + "plt.xlabel('Iteration')\n", + "plt.ylabel('Loss')\n", + "plt.title('Training loss history')\n", + "plt.show()" + ], + "outputs": [], + "metadata": { + "collapsed": false + } + }, + { + "source": [ + "# Test-time sampling\n", + "Unlike classification models, image captioning models behave very differently at training time and at test time. At training time, we have access to the ground-truth caption so we feed ground-truth words as input to the RNN at each timestep. At test time, we sample from the distribution over the vocabulary at each timestep, and feed the sample as input to the RNN at the next timestep.\n", + "\n", + "In the file `cs231n/classifiers/rnn.py`, implement the `sample` method for test-time sampling. After doing so, run the following to sample from your overfit model on both training and validation data. The samples on training data should be very good; the samples on validation data probably won't make sense." + ], + "cell_type": "markdown", + "metadata": {} + }, + { + "execution_count": null, + "cell_type": "code", + "source": [ + "for split in ['train', 'val']:\n", + " minibatch = sample_coco_minibatch(small_data, split=split, batch_size=2)\n", + " gt_captions, features, urls = minibatch\n", + " gt_captions = decode_captions(gt_captions, data['idx_to_word'])\n", + "\n", + " sample_captions = small_rnn_model.sample(features)\n", + " sample_captions = decode_captions(sample_captions, data['idx_to_word'])\n", + "\n", + " for gt_caption, sample_caption, url in zip(gt_captions, sample_captions, urls):\n", + " plt.imshow(image_from_url(url))\n", + " plt.title('%s\\n%s\\nGT:%s' % (split, sample_caption, gt_caption))\n", + " plt.axis('off')\n", + " plt.show()" + ], + "outputs": [], + "metadata": { + "scrolled": false, + "collapsed": false + } + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 2", + "name": "python2", + "language": "python" + }, + "language_info": { + "mimetype": "text/x-python", + "nbconvert_exporter": "python", + "name": "python", + "file_extension": ".py", + "version": "2.7.6", + "pygments_lexer": "ipython2", + "codemirror_mode": { + "version": 2, + "name": "ipython" + } + } + } +} \ No newline at end of file diff --git a/assignments2016/assignment3/collectSubmission.sh b/assignments2016/assignment3/collectSubmission.sh new file mode 100755 index 00000000..4e6b0b4c --- /dev/null +++ b/assignments2016/assignment3/collectSubmission.sh @@ -0,0 +1,2 @@ +rm -f assignment3.zip +zip -r assignment3.zip . -x "*.git" "*cs231n/datasets*" "*.ipynb_checkpoints*" "*README.md" "*collectSubmission.sh" "*requirements.txt" ".env/*" "*.pyc" "*cs231n/build/*" diff --git a/assignments2016/assignment3/cs231n/.gitignore b/assignments2016/assignment3/cs231n/.gitignore new file mode 100644 index 00000000..fbb42c24 --- /dev/null +++ b/assignments2016/assignment3/cs231n/.gitignore @@ -0,0 +1,3 @@ +build/* +im2col_cython.c +im2col_cython.so diff --git a/assignments2016/assignment3/cs231n/__init__.py b/assignments2016/assignment3/cs231n/__init__.py new file mode 100644 index 00000000..e69de29b diff --git a/assignments2016/assignment3/cs231n/captioning_solver.py b/assignments2016/assignment3/cs231n/captioning_solver.py new file mode 100644 index 00000000..6e3dddb1 --- /dev/null +++ b/assignments2016/assignment3/cs231n/captioning_solver.py @@ -0,0 +1,233 @@ +import numpy as np + +from cs231n import optim +from cs231n.coco_utils import sample_coco_minibatch + + +class CaptioningSolver(object): + """ + A CaptioningSolver encapsulates all the logic necessary for training + image captioning models. The CaptioningSolver performs stochastic gradient + descent using different update rules defined in optim.py. + + The solver accepts both training and validataion data and labels so it can + periodically check classification accuracy on both training and validation + data to watch out for overfitting. + + To train a model, you will first construct a CaptioningSolver instance, + passing the model, dataset, and various options (learning rate, batch size, + etc) to the constructor. You will then call the train() method to run the + optimization procedure and train the model. + + After the train() method returns, model.params will contain the parameters + that performed best on the validation set over the course of training. + In addition, the instance variable solver.loss_history will contain a list + of all losses encountered during training and the instance variables + solver.train_acc_history and solver.val_acc_history will be lists containing + the accuracies of the model on the training and validation set at each epoch. + + Example usage might look something like this: + + data = load_coco_data() + model = MyAwesomeModel(hidden_dim=100) + solver = CaptioningSolver(model, data, + update_rule='sgd', + optim_config={ + 'learning_rate': 1e-3, + }, + lr_decay=0.95, + num_epochs=10, batch_size=100, + print_every=100) + solver.train() + + + A CaptioningSolver works on a model object that must conform to the following + API: + + - model.params must be a dictionary mapping string parameter names to numpy + arrays containing parameter values. + + - model.loss(features, captions) must be a function that computes + training-time loss and gradients, with the following inputs and outputs: + + Inputs: + - features: Array giving a minibatch of features for images, of shape (N, D + - captions: Array of captions for those images, of shape (N, T) where + each element is in the range (0, V]. + + Returns: + - loss: Scalar giving the loss + - grads: Dictionary with the same keys as self.params mapping parameter + names to gradients of the loss with respect to those parameters. + """ + + def __init__(self, model, data, **kwargs): + """ + Construct a new CaptioningSolver instance. + + Required arguments: + - model: A model object conforming to the API described above + - data: A dictionary of training and validation data from load_coco_data + + Optional arguments: + - update_rule: A string giving the name of an update rule in optim.py. + Default is 'sgd'. + - optim_config: A dictionary containing hyperparameters that will be + passed to the chosen update rule. Each update rule requires different + hyperparameters (see optim.py) but all update rules require a + 'learning_rate' parameter so that should always be present. + - lr_decay: A scalar for learning rate decay; after each epoch the learning + rate is multiplied by this value. + - batch_size: Size of minibatches used to compute loss and gradient during + training. + - num_epochs: The number of epochs to run for during training. + - print_every: Integer; training losses will be printed every print_every + iterations. + - verbose: Boolean; if set to false then no output will be printed during + training. + """ + self.model = model + self.data = data + + # Unpack keyword arguments + self.update_rule = kwargs.pop('update_rule', 'sgd') + self.optim_config = kwargs.pop('optim_config', {}) + self.lr_decay = kwargs.pop('lr_decay', 1.0) + self.batch_size = kwargs.pop('batch_size', 100) + self.num_epochs = kwargs.pop('num_epochs', 10) + + self.print_every = kwargs.pop('print_every', 10) + self.verbose = kwargs.pop('verbose', True) + + # Throw an error if there are extra keyword arguments + if len(kwargs) > 0: + extra = ', '.join('"%s"' % k for k in kwargs.keys()) + raise ValueError('Unrecognized arguments %s' % extra) + + # Make sure the update rule exists, then replace the string + # name with the actual function + if not hasattr(optim, self.update_rule): + raise ValueError('Invalid update_rule "%s"' % self.update_rule) + self.update_rule = getattr(optim, self.update_rule) + + self._reset() + + + def _reset(self): + """ + Set up some book-keeping variables for optimization. Don't call this + manually. + """ + # Set up some variables for book-keeping + self.epoch = 0 + self.best_val_acc = 0 + self.best_params = {} + self.loss_history = [] + self.train_acc_history = [] + self.val_acc_history = [] + + # Make a deep copy of the optim_config for each parameter + self.optim_configs = {} + for p in self.model.params: + d = {k: v for k, v in self.optim_config.iteritems()} + self.optim_configs[p] = d + + + def _step(self): + """ + Make a single gradient update. This is called by train() and should not + be called manually. + """ + # Make a minibatch of training data + minibatch = sample_coco_minibatch(self.data, + batch_size=self.batch_size, + split='train') + captions, features, urls = minibatch + + # Compute loss and gradient + loss, grads = self.model.loss(features, captions) + self.loss_history.append(loss) + + # Perform a parameter update + for p, w in self.model.params.iteritems(): + dw = grads[p] + config = self.optim_configs[p] + next_w, next_config = self.update_rule(w, dw, config) + self.model.params[p] = next_w + self.optim_configs[p] = next_config + + + # TODO: This does nothing right now; maybe implement BLEU? + def check_accuracy(self, X, y, num_samples=None, batch_size=100): + """ + Check accuracy of the model on the provided data. + + Inputs: + - X: Array of data, of shape (N, d_1, ..., d_k) + - y: Array of labels, of shape (N,) + - num_samples: If not None, subsample the data and only test the model + on num_samples datapoints. + - batch_size: Split X and y into batches of this size to avoid using too + much memory. + + Returns: + - acc: Scalar giving the fraction of instances that were correctly + classified by the model. + """ + return 0.0 + + # Maybe subsample the data + N = X.shape[0] + if num_samples is not None and N > num_samples: + mask = np.random.choice(N, num_samples) + N = num_samples + X = X[mask] + y = y[mask] + + # Compute predictions in batches + num_batches = N / batch_size + if N % batch_size != 0: + num_batches += 1 + y_pred = [] + for i in xrange(num_batches): + start = i * batch_size + end = (i + 1) * batch_size + scores = self.model.loss(X[start:end]) + y_pred.append(np.argmax(scores, axis=1)) + y_pred = np.hstack(y_pred) + acc = np.mean(y_pred == y) + + return acc + + + def train(self): + """ + Run optimization to train the model. + """ + num_train = self.data['train_captions'].shape[0] + iterations_per_epoch = max(num_train / self.batch_size, 1) + num_iterations = self.num_epochs * iterations_per_epoch + + for t in xrange(num_iterations): + self._step() + + # Maybe print training loss + if self.verbose and t % self.print_every == 0: + print '(Iteration %d / %d) loss: %f' % ( + t + 1, num_iterations, self.loss_history[-1]) + + # At the end of every epoch, increment the epoch counter and decay the + # learning rate. + epoch_end = (t + 1) % iterations_per_epoch == 0 + if epoch_end: + self.epoch += 1 + for k in self.optim_configs: + self.optim_configs[k]['learning_rate'] *= self.lr_decay + + # Check train and val accuracy on the first iteration, the last + # iteration, and at the end of each epoch. + # TODO: Implement some logic to check Bleu on validation set periodically + + # At the end of training swap the best params into the model + # self.model.params = self.best_params + diff --git a/assignments2016/assignment3/cs231n/classifiers/__init__.py b/assignments2016/assignment3/cs231n/classifiers/__init__.py new file mode 100644 index 00000000..e69de29b diff --git a/assignments2016/assignment3/cs231n/classifiers/pretrained_cnn.py b/assignments2016/assignment3/cs231n/classifiers/pretrained_cnn.py new file mode 100644 index 00000000..a12dce98 --- /dev/null +++ b/assignments2016/assignment3/cs231n/classifiers/pretrained_cnn.py @@ -0,0 +1,252 @@ +import numpy as np +import h5py + +from cs231n.layers import * +from cs231n.fast_layers import * +from cs231n.layer_utils import * + + +class PretrainedCNN(object): + def __init__(self, dtype=np.float32, num_classes=100, input_size=64, h5_file=None): + self.dtype = dtype + self.conv_params = [] + self.input_size = input_size + self.num_classes = num_classes + + # TODO: In the future it would be nice if the architecture could be loaded from + # the HDF5 file rather than being hardcoded. For now this will have to do. + self.conv_params.append({'stride': 2, 'pad': 2}) + self.conv_params.append({'stride': 1, 'pad': 1}) + self.conv_params.append({'stride': 2, 'pad': 1}) + self.conv_params.append({'stride': 1, 'pad': 1}) + self.conv_params.append({'stride': 2, 'pad': 1}) + self.conv_params.append({'stride': 1, 'pad': 1}) + self.conv_params.append({'stride': 2, 'pad': 1}) + self.conv_params.append({'stride': 1, 'pad': 1}) + self.conv_params.append({'stride': 2, 'pad': 1}) + + self.filter_sizes = [5, 3, 3, 3, 3, 3, 3, 3, 3] + self.num_filters = [64, 64, 128, 128, 256, 256, 512, 512, 1024] + hidden_dim = 512 + + self.bn_params = [] + + cur_size = input_size + prev_dim = 3 + self.params = {} + for i, (f, next_dim) in enumerate(zip(self.filter_sizes, self.num_filters)): + fan_in = f * f * prev_dim + self.params['W%d' % (i + 1)] = np.sqrt(2.0 / fan_in) * np.random.randn(next_dim, prev_dim, f, f) + self.params['b%d' % (i + 1)] = np.zeros(next_dim) + self.params['gamma%d' % (i + 1)] = np.ones(next_dim) + self.params['beta%d' % (i + 1)] = np.zeros(next_dim) + self.bn_params.append({'mode': 'train'}) + prev_dim = next_dim + if self.conv_params[i]['stride'] == 2: cur_size /= 2 + + # Add a fully-connected layers + fan_in = cur_size * cur_size * self.num_filters[-1] + self.params['W%d' % (i + 2)] = np.sqrt(2.0 / fan_in) * np.random.randn(fan_in, hidden_dim) + self.params['b%d' % (i + 2)] = np.zeros(hidden_dim) + self.params['gamma%d' % (i + 2)] = np.ones(hidden_dim) + self.params['beta%d' % (i + 2)] = np.zeros(hidden_dim) + self.bn_params.append({'mode': 'train'}) + self.params['W%d' % (i + 3)] = np.sqrt(2.0 / hidden_dim) * np.random.randn(hidden_dim, num_classes) + self.params['b%d' % (i + 3)] = np.zeros(num_classes) + + for k, v in self.params.iteritems(): + self.params[k] = v.astype(dtype) + + if h5_file is not None: + self.load_weights(h5_file) + + + def load_weights(self, h5_file, verbose=False): + """ + Load pretrained weights from an HDF5 file. + + Inputs: + - h5_file: Path to the HDF5 file where pretrained weights are stored. + - verbose: Whether to print debugging info + """ + + # Before loading weights we need to make a dummy forward pass to initialize + # the running averages in the bn_pararams + x = np.random.randn(1, 3, self.input_size, self.input_size) + y = np.random.randint(self.num_classes, size=1) + loss, grads = self.loss(x, y) + + with h5py.File(h5_file, 'r') as f: + for k, v in f.iteritems(): + v = np.asarray(v) + if k in self.params: + if verbose: print k, v.shape, self.params[k].shape + if v.shape == self.params[k].shape: + self.params[k] = v.copy() + elif v.T.shape == self.params[k].shape: + self.params[k] = v.T.copy() + else: + raise ValueError('shapes for %s do not match' % k) + if k.startswith('running_mean'): + i = int(k[12:]) - 1 + assert self.bn_params[i]['running_mean'].shape == v.shape + self.bn_params[i]['running_mean'] = v.copy() + if verbose: print k, v.shape + if k.startswith('running_var'): + i = int(k[11:]) - 1 + assert v.shape == self.bn_params[i]['running_var'].shape + self.bn_params[i]['running_var'] = v.copy() + if verbose: print k, v.shape + + for k, v in self.params.iteritems(): + self.params[k] = v.astype(self.dtype) + + + def forward(self, X, start=None, end=None, mode='test'): + """ + Run part of the model forward, starting and ending at an arbitrary layer, + in either training mode or testing mode. + + You can pass arbitrary input to the starting layer, and you will receive + output from the ending layer and a cache object that can be used to run + the model backward over the same set of layers. + + For the purposes of this function, a "layer" is one of the following blocks: + + [conv - spatial batchnorm - relu] (There are 9 of these) + [affine - batchnorm - relu] (There is one of these) + [affine] (There is one of these) + + Inputs: + - X: The input to the starting layer. If start=0, then this should be an + array of shape (N, C, 64, 64). + - start: The index of the layer to start from. start=0 starts from the first + convolutional layer. Default is 0. + - end: The index of the layer to end at. start=11 ends at the last + fully-connected layer, returning class scores. Default is 11. + - mode: The mode to use, either 'test' or 'train'. We need this because + batch normalization behaves differently at training time and test time. + + Returns: + - out: Output from the end layer. + - cache: A cache object that can be passed to the backward method to run the + network backward over the same range of layers. + """ + X = X.astype(self.dtype) + if start is None: start = 0 + if end is None: end = len(self.conv_params) + 1 + layer_caches = [] + + prev_a = X + for i in xrange(start, end + 1): + i1 = i + 1 + if 0 <= i < len(self.conv_params): + # This is a conv layer + w, b = self.params['W%d' % i1], self.params['b%d' % i1] + gamma, beta = self.params['gamma%d' % i1], self.params['beta%d' % i1] + conv_param = self.conv_params[i] + bn_param = self.bn_params[i] + bn_param['mode'] = mode + + next_a, cache = conv_bn_relu_forward(prev_a, w, b, gamma, beta, conv_param, bn_param) + elif i == len(self.conv_params): + # This is the fully-connected hidden layer + w, b = self.params['W%d' % i1], self.params['b%d' % i1] + gamma, beta = self.params['gamma%d' % i1], self.params['beta%d' % i1] + bn_param = self.bn_params[i] + bn_param['mode'] = mode + next_a, cache = affine_bn_relu_forward(prev_a, w, b, gamma, beta, bn_param) + elif i == len(self.conv_params) + 1: + # This is the last fully-connected layer that produces scores + w, b = self.params['W%d' % i1], self.params['b%d' % i1] + next_a, cache = affine_forward(prev_a, w, b) + else: + raise ValueError('Invalid layer index %d' % i) + + layer_caches.append(cache) + prev_a = next_a + + out = prev_a + cache = (start, end, layer_caches) + return out, cache + + + def backward(self, dout, cache): + """ + Run the model backward over a sequence of layers that were previously run + forward using the self.forward method. + + Inputs: + - dout: Gradient with respect to the ending layer; this should have the same + shape as the out variable returned from the corresponding call to forward. + - cache: A cache object returned from self.forward. + + Returns: + - dX: Gradient with respect to the start layer. This will have the same + shape as the input X passed to self.forward. + - grads: Gradient of all parameters in the layers. For example if you run + forward through two convolutional layers, then on the corresponding call + to backward grads will contain the gradients with respect to the weights, + biases, and spatial batchnorm parameters of those two convolutional + layers. The grads dictionary will therefore contain a subset of the keys + of self.params, and grads[k] and self.params[k] will have the same shape. + """ + start, end, layer_caches = cache + dnext_a = dout + grads = {} + for i in reversed(range(start, end + 1)): + i1 = i + 1 + if i == len(self.conv_params) + 1: + # This is the last fully-connected layer + dprev_a, dw, db = affine_backward(dnext_a, layer_caches.pop()) + grads['W%d' % i1] = dw + grads['b%d' % i1] = db + elif i == len(self.conv_params): + # This is the fully-connected hidden layer + temp = affine_bn_relu_backward(dnext_a, layer_caches.pop()) + dprev_a, dw, db, dgamma, dbeta = temp + grads['W%d' % i1] = dw + grads['b%d' % i1] = db + grads['gamma%d' % i1] = dgamma + grads['beta%d' % i1] = dbeta + elif 0 <= i < len(self.conv_params): + # This is a conv layer + temp = conv_bn_relu_backward(dnext_a, layer_caches.pop()) + dprev_a, dw, db, dgamma, dbeta = temp + grads['W%d' % i1] = dw + grads['b%d' % i1] = db + grads['gamma%d' % i1] = dgamma + grads['beta%d' % i1] = dbeta + else: + raise ValueError('Invalid layer index %d' % i) + dnext_a = dprev_a + + dX = dnext_a + return dX, grads + + + def loss(self, X, y=None): + """ + Classification loss used to train the network. + + Inputs: + - X: Array of data, of shape (N, 3, 64, 64) + - y: Array of labels, of shape (N,) + + If y is None, then run a test-time forward pass and return: + - scores: Array of shape (N, 100) giving class scores. + + If y is not None, then run a training-time forward and backward pass and + return a tuple of: + - loss: Scalar giving loss + - grads: Dictionary of gradients, with the same keys as self.params. + """ + # Note that we implement this by just caling self.forward and self.backward + mode = 'test' if y is None else 'train' + scores, cache = self.forward(X, mode=mode) + if mode == 'test': + return scores + loss, dscores = softmax_loss(scores, y) + dX, grads = self.backward(dscores, cache) + return loss, grads + diff --git a/assignments2016/assignment3/cs231n/classifiers/rnn.py b/assignments2016/assignment3/cs231n/classifiers/rnn.py new file mode 100644 index 00000000..d43bba80 --- /dev/null +++ b/assignments2016/assignment3/cs231n/classifiers/rnn.py @@ -0,0 +1,204 @@ +import numpy as np + +from cs231n.layers import * +from cs231n.rnn_layers import * + + +class CaptioningRNN(object): + """ + A CaptioningRNN produces captions from image features using a recurrent + neural network. + + The RNN receives input vectors of size D, has a vocab size of V, works on + sequences of length T, has an RNN hidden dimension of H, uses word vectors + of dimension W, and operates on minibatches of size N. + + Note that we don't use any regularization for the CaptioningRNN. + """ + + def __init__(self, word_to_idx, input_dim=512, wordvec_dim=128, + hidden_dim=128, cell_type='rnn', dtype=np.float32): + """ + Construct a new CaptioningRNN instance. + + Inputs: + - word_to_idx: A dictionary giving the vocabulary. It contains V entries, + and maps each string to a unique integer in the range [0, V). + - input_dim: Dimension D of input image feature vectors. + - wordvec_dim: Dimension W of word vectors. + - hidden_dim: Dimension H for the hidden state of the RNN. + - cell_type: What type of RNN to use; either 'rnn' or 'lstm'. + - dtype: numpy datatype to use; use float32 for training and float64 for + numeric gradient checking. + """ + if cell_type not in {'rnn', 'lstm'}: + raise ValueError('Invalid cell_type "%s"' % cell_type) + + self.cell_type = cell_type + self.dtype = dtype + self.word_to_idx = word_to_idx + self.idx_to_word = {i: w for w, i in word_to_idx.iteritems()} + self.params = {} + + vocab_size = len(word_to_idx) + + self._null = word_to_idx[''] + self._start = word_to_idx.get('', None) + self._end = word_to_idx.get('', None) + + # Initialize word vectors + self.params['W_embed'] = np.random.randn(vocab_size, wordvec_dim) + self.params['W_embed'] /= 100 + + # Initialize CNN -> hidden state projection parameters + self.params['W_proj'] = np.random.randn(input_dim, hidden_dim) + self.params['W_proj'] /= np.sqrt(input_dim) + self.params['b_proj'] = np.zeros(hidden_dim) + + # Initialize parameters for the RNN + dim_mul = {'lstm': 4, 'rnn': 1}[cell_type] + self.params['Wx'] = np.random.randn(wordvec_dim, dim_mul * hidden_dim) + self.params['Wx'] /= np.sqrt(wordvec_dim) + self.params['Wh'] = np.random.randn(hidden_dim, dim_mul * hidden_dim) + self.params['Wh'] /= np.sqrt(hidden_dim) + self.params['b'] = np.zeros(dim_mul * hidden_dim) + + # Initialize output to vocab weights + self.params['W_vocab'] = np.random.randn(hidden_dim, vocab_size) + self.params['W_vocab'] /= np.sqrt(hidden_dim) + self.params['b_vocab'] = np.zeros(vocab_size) + + # Cast parameters to correct dtype + for k, v in self.params.iteritems(): + self.params[k] = v.astype(self.dtype) + + + def loss(self, features, captions): + """ + Compute training-time loss for the RNN. We input image features and + ground-truth captions for those images, and use an RNN (or LSTM) to compute + loss and gradients on all parameters. + + Inputs: + - features: Input image features, of shape (N, D) + - captions: Ground-truth captions; an integer array of shape (N, T) where + each element is in the range 0 <= y[i, t] < V + + Returns a tuple of: + - loss: Scalar loss + - grads: Dictionary of gradients parallel to self.params + """ + # Cut captions into two pieces: captions_in has everything but the last word + # and will be input to the RNN; captions_out has everything but the first + # word and this is what we will expect the RNN to generate. These are offset + # by one relative to each other because the RNN should produce word (t+1) + # after receiving word t. The first element of captions_in will be the START + # token, and the first element of captions_out will be the first word. + captions_in = captions[:, :-1] + captions_out = captions[:, 1:] + + # You'll need this + mask = (captions_out != self._null) + + # Weight and bias for the affine transform from image features to initial + # hidden state + W_proj, b_proj = self.params['W_proj'], self.params['b_proj'] + + # Word embedding matrix + W_embed = self.params['W_embed'] + + # Input-to-hidden, hidden-to-hidden, and biases for the RNN + Wx, Wh, b = self.params['Wx'], self.params['Wh'], self.params['b'] + + # Weight and bias for the hidden-to-vocab transformation. + W_vocab, b_vocab = self.params['W_vocab'], self.params['b_vocab'] + + loss, grads = 0.0, {} + ############################################################################ + # TODO: Implement the forward and backward passes for the CaptioningRNN. # + # In the forward pass you will need to do the following: # + # (1) Use an affine transformation to compute the initial hidden state # + # from the image features. This should produce an array of shape (N, H)# + # (2) Use a word embedding layer to transform the words in captions_in # + # from indices to vectors, giving an array of shape (N, T, W). # + # (3) Use either a vanilla RNN or LSTM (depending on self.cell_type) to # + # process the sequence of input word vectors and produce hidden state # + # vectors for all timesteps, producing an array of shape (N, T, H). # + # (4) Use a (temporal) affine transformation to compute scores over the # + # vocabulary at every timestep using the hidden states, giving an # + # array of shape (N, T, V). # + # (5) Use (temporal) softmax to compute loss using captions_out, ignoring # + # the points where the output word is using the mask above. # + # # + # In the backward pass you will need to compute the gradient of the loss # + # with respect to all model parameters. Use the loss and grads variables # + # defined above to store loss and gradients; grads[k] should give the # + # gradients for self.params[k]. # + ############################################################################ + pass + ############################################################################ + # END OF YOUR CODE # + ############################################################################ + + return loss, grads + + + def sample(self, features, max_length=30): + """ + Run a test-time forward pass for the model, sampling captions for input + feature vectors. + + At each timestep, we embed the current word, pass it and the previous hidden + state to the RNN to get the next hidden state, use the hidden state to get + scores for all vocab words, and choose the word with the highest score as + the next word. The initial hidden state is computed by applying an affine + transform to the input image features, and the initial word is the + token. + + For LSTMs you will also have to keep track of the cell state; in that case + the initial cell state should be zero. + + Inputs: + - features: Array of input image features of shape (N, D). + - max_length: Maximum length T of generated captions. + + Returns: + - captions: Array of shape (N, max_length) giving sampled captions, + where each element is an integer in the range [0, V). The first element + of captions should be the first sampled word, not the token. + """ + N = features.shape[0] + captions = self._null * np.ones((N, max_length), dtype=np.int32) + + # Unpack parameters + W_proj, b_proj = self.params['W_proj'], self.params['b_proj'] + W_embed = self.params['W_embed'] + Wx, Wh, b = self.params['Wx'], self.params['Wh'], self.params['b'] + W_vocab, b_vocab = self.params['W_vocab'], self.params['b_vocab'] + + ########################################################################### + # TODO: Implement test-time sampling for the model. You will need to # + # initialize the hidden state of the RNN by applying the learned affine # + # transform to the input image features. The first word that you feed to # + # the RNN should be the token; its value is stored in the # + # variable self._start. At each timestep you will need to do to: # + # (1) Embed the previous word using the learned word embeddings # + # (2) Make an RNN step using the previous hidden state and the embedded # + # current word to get the next hidden state. # + # (3) Apply the learned affine transformation to the next hidden state to # + # get scores for all words in the vocabulary # + # (4) Select the word with the highest score as the next word, writing it # + # to the appropriate slot in the captions variable # + # # + # For simplicity, you do not need to stop generating after an token # + # is sampled, but you can if you want to. # + # # + # HINT: You will not be able to use the rnn_forward or lstm_forward # + # functions; you'll need to call rnn_step_forward or lstm_step_forward in # + # a loop. # + ########################################################################### + pass + ############################################################################ + # END OF YOUR CODE # + ############################################################################ + return captions diff --git a/assignments2016/assignment3/cs231n/coco_utils.py b/assignments2016/assignment3/cs231n/coco_utils.py new file mode 100644 index 00000000..bc5f5793 --- /dev/null +++ b/assignments2016/assignment3/cs231n/coco_utils.py @@ -0,0 +1,84 @@ +import os, json +import numpy as np +import h5py + + +def load_coco_data(base_dir='cs231n/datasets/coco_captioning', + max_train=None, + pca_features=True): + data = {} + caption_file = os.path.join(base_dir, 'coco2014_captions.h5') + with h5py.File(caption_file, 'r') as f: + for k, v in f.iteritems(): + data[k] = np.asarray(v) + + if pca_features: + train_feat_file = os.path.join(base_dir, 'train2014_vgg16_fc7_pca.h5') + else: + train_feat_file = os.path.join(base_dir, 'train2014_vgg16_fc7.h5') + with h5py.File(train_feat_file, 'r') as f: + data['train_features'] = np.asarray(f['features']) + + if pca_features: + val_feat_file = os.path.join(base_dir, 'val2014_vgg16_fc7_pca.h5') + else: + val_feat_file = os.path.join(base_dir, 'val2014_vgg16_fc7.h5') + with h5py.File(val_feat_file, 'r') as f: + data['val_features'] = np.asarray(f['features']) + + dict_file = os.path.join(base_dir, 'coco2014_vocab.json') + with open(dict_file, 'r') as f: + dict_data = json.load(f) + for k, v in dict_data.iteritems(): + data[k] = v + + train_url_file = os.path.join(base_dir, 'train2014_urls.txt') + with open(train_url_file, 'r') as f: + train_urls = np.asarray([line.strip() for line in f]) + data['train_urls'] = train_urls + + val_url_file = os.path.join(base_dir, 'val2014_urls.txt') + with open(val_url_file, 'r') as f: + val_urls = np.asarray([line.strip() for line in f]) + data['val_urls'] = val_urls + + # Maybe subsample the training data + if max_train is not None: + num_train = data['train_captions'].shape[0] + mask = np.random.randint(num_train, size=max_train) + data['train_captions'] = data['train_captions'][mask] + data['train_image_idxs'] = data['train_image_idxs'][mask] + + return data + + +def decode_captions(captions, idx_to_word): + singleton = False + if captions.ndim == 1: + singleton = True + captions = captions[None] + decoded = [] + N, T = captions.shape + for i in xrange(N): + words = [] + for t in xrange(T): + word = idx_to_word[captions[i, t]] + if word != '': + words.append(word) + if word == '': + break + decoded.append(' '.join(words)) + if singleton: + decoded = decoded[0] + return decoded + + +def sample_coco_minibatch(data, batch_size=100, split='train'): + split_size = data['%s_captions' % split].shape[0] + mask = np.random.choice(split_size, batch_size) + captions = data['%s_captions' % split][mask] + image_idxs = data['%s_image_idxs' % split][mask] + image_features = data['%s_features' % split][image_idxs] + urls = data['%s_urls' % split][image_idxs] + return captions, image_features, urls + diff --git a/assignments2016/assignment3/cs231n/data_utils.py b/assignments2016/assignment3/cs231n/data_utils.py new file mode 100644 index 00000000..0fca6f59 --- /dev/null +++ b/assignments2016/assignment3/cs231n/data_utils.py @@ -0,0 +1,219 @@ +import cPickle as pickle +import numpy as np +import os +from scipy.misc import imread + +def load_CIFAR_batch(filename): + """ load single batch of cifar """ + with open(filename, 'rb') as f: + datadict = pickle.load(f) + X = datadict['data'] + Y = datadict['labels'] + X = X.reshape(10000, 3, 32, 32).transpose(0,2,3,1).astype("float") + Y = np.array(Y) + return X, Y + +def load_CIFAR10(ROOT): + """ load all of cifar """ + xs = [] + ys = [] + for b in range(1,6): + f = os.path.join(ROOT, 'data_batch_%d' % (b, )) + X, Y = load_CIFAR_batch(f) + xs.append(X) + ys.append(Y) + Xtr = np.concatenate(xs) + Ytr = np.concatenate(ys) + del X, Y + Xte, Yte = load_CIFAR_batch(os.path.join(ROOT, 'test_batch')) + return Xtr, Ytr, Xte, Yte + + +def get_CIFAR10_data(num_training=49000, num_validation=1000, num_test=1000, + subtract_mean=True): + """ + Load the CIFAR-10 dataset from disk and perform preprocessing to prepare + it for classifiers. These are the same steps as we used for the SVM, but + condensed to a single function. + """ + # Load the raw CIFAR-10 data + cifar10_dir = 'cs231n/datasets/cifar-10-batches-py' + X_train, y_train, X_test, y_test = load_CIFAR10(cifar10_dir) + + # Subsample the data + mask = range(num_training, num_training + num_validation) + X_val = X_train[mask] + y_val = y_train[mask] + mask = range(num_training) + X_train = X_train[mask] + y_train = y_train[mask] + mask = range(num_test) + X_test = X_test[mask] + y_test = y_test[mask] + + # Normalize the data: subtract the mean image + if subtract_mean: + mean_image = np.mean(X_train, axis=0) + X_train -= mean_image + X_val -= mean_image + X_test -= mean_image + + # Transpose so that channels come first + X_train = X_train.transpose(0, 3, 1, 2).copy() + X_val = X_val.transpose(0, 3, 1, 2).copy() + X_test = X_test.transpose(0, 3, 1, 2).copy() + + # Package data into a dictionary + return { + 'X_train': X_train, 'y_train': y_train, + 'X_val': X_val, 'y_val': y_val, + 'X_test': X_test, 'y_test': y_test, + } + + +def load_tiny_imagenet(path, dtype=np.float32, subtract_mean=True): + """ + Load TinyImageNet. Each of TinyImageNet-100-A, TinyImageNet-100-B, and + TinyImageNet-200 have the same directory structure, so this can be used + to load any of them. + + Inputs: + - path: String giving path to the directory to load. + - dtype: numpy datatype used to load the data. + - subtract_mean: Whether to subtract the mean training image. + + Returns: A dictionary with the following entries: + - class_names: A list where class_names[i] is a list of strings giving the + WordNet names for class i in the loaded dataset. + - X_train: (N_tr, 3, 64, 64) array of training images + - y_train: (N_tr,) array of training labels + - X_val: (N_val, 3, 64, 64) array of validation images + - y_val: (N_val,) array of validation labels + - X_test: (N_test, 3, 64, 64) array of testing images. + - y_test: (N_test,) array of test labels; if test labels are not available + (such as in student code) then y_test will be None. + - mean_image: (3, 64, 64) array giving mean training image + """ + # First load wnids + with open(os.path.join(path, 'wnids.txt'), 'r') as f: + wnids = [x.strip() for x in f] + + # Map wnids to integer labels + wnid_to_label = {wnid: i for i, wnid in enumerate(wnids)} + + # Use words.txt to get names for each class + with open(os.path.join(path, 'words.txt'), 'r') as f: + wnid_to_words = dict(line.split('\t') for line in f) + for wnid, words in wnid_to_words.iteritems(): + wnid_to_words[wnid] = [w.strip() for w in words.split(',')] + class_names = [wnid_to_words[wnid] for wnid in wnids] + + # Next load training data. + X_train = [] + y_train = [] + for i, wnid in enumerate(wnids): + if (i + 1) % 20 == 0: + print 'loading training data for synset %d / %d' % (i + 1, len(wnids)) + # To figure out the filenames we need to open the boxes file + boxes_file = os.path.join(path, 'train', wnid, '%s_boxes.txt' % wnid) + with open(boxes_file, 'r') as f: + filenames = [x.split('\t')[0] for x in f] + num_images = len(filenames) + + X_train_block = np.zeros((num_images, 3, 64, 64), dtype=dtype) + y_train_block = wnid_to_label[wnid] * np.ones(num_images, dtype=np.int64) + for j, img_file in enumerate(filenames): + img_file = os.path.join(path, 'train', wnid, 'images', img_file) + img = imread(img_file) + if img.ndim == 2: + ## grayscale file + img.shape = (64, 64, 1) + X_train_block[j] = img.transpose(2, 0, 1) + X_train.append(X_train_block) + y_train.append(y_train_block) + + # We need to concatenate all training data + X_train = np.concatenate(X_train, axis=0) + y_train = np.concatenate(y_train, axis=0) + + # Next load validation data + with open(os.path.join(path, 'val', 'val_annotations.txt'), 'r') as f: + img_files = [] + val_wnids = [] + for line in f: + img_file, wnid = line.split('\t')[:2] + img_files.append(img_file) + val_wnids.append(wnid) + num_val = len(img_files) + y_val = np.array([wnid_to_label[wnid] for wnid in val_wnids]) + X_val = np.zeros((num_val, 3, 64, 64), dtype=dtype) + for i, img_file in enumerate(img_files): + img_file = os.path.join(path, 'val', 'images', img_file) + img = imread(img_file) + if img.ndim == 2: + img.shape = (64, 64, 1) + X_val[i] = img.transpose(2, 0, 1) + + # Next load test images + # Students won't have test labels, so we need to iterate over files in the + # images directory. + img_files = os.listdir(os.path.join(path, 'test', 'images')) + X_test = np.zeros((len(img_files), 3, 64, 64), dtype=dtype) + for i, img_file in enumerate(img_files): + img_file = os.path.join(path, 'test', 'images', img_file) + img = imread(img_file) + if img.ndim == 2: + img.shape = (64, 64, 1) + X_test[i] = img.transpose(2, 0, 1) + + y_test = None + y_test_file = os.path.join(path, 'test', 'test_annotations.txt') + if os.path.isfile(y_test_file): + with open(y_test_file, 'r') as f: + img_file_to_wnid = {} + for line in f: + line = line.split('\t') + img_file_to_wnid[line[0]] = line[1] + y_test = [wnid_to_label[img_file_to_wnid[img_file]] for img_file in img_files] + y_test = np.array(y_test) + + mean_image = X_train.mean(axis=0) + if subtract_mean: + X_train -= mean_image[None] + X_val -= mean_image[None] + X_test -= mean_image[None] + + return { + 'class_names': class_names, + 'X_train': X_train, + 'y_train': y_train, + 'X_val': X_val, + 'y_val': y_val, + 'X_test': X_test, + 'y_test': y_test, + 'class_names': class_names, + 'mean_image': mean_image, + } + + +def load_models(models_dir): + """ + Load saved models from disk. This will attempt to unpickle all files in a + directory; any files that give errors on unpickling (such as README.txt) will + be skipped. + + Inputs: + - models_dir: String giving the path to a directory containing model files. + Each model file is a pickled dictionary with a 'model' field. + + Returns: + A dictionary mapping model file names to models. + """ + models = {} + for model_file in os.listdir(models_dir): + with open(os.path.join(models_dir, model_file), 'rb') as f: + try: + models[model_file] = pickle.load(f)['model'] + except pickle.UnpicklingError: + continue + return models diff --git a/assignments2016/assignment3/cs231n/datasets/get_coco_captioning.sh b/assignments2016/assignment3/cs231n/datasets/get_coco_captioning.sh new file mode 100755 index 00000000..683e34e4 --- /dev/null +++ b/assignments2016/assignment3/cs231n/datasets/get_coco_captioning.sh @@ -0,0 +1,3 @@ +wget "http://cs231n.stanford.edu/coco_captioning.zip" +unzip coco_captioning.zip +rm coco_captioning.zip diff --git a/assignments2016/assignment3/cs231n/datasets/get_pretrained_model.sh b/assignments2016/assignment3/cs231n/datasets/get_pretrained_model.sh new file mode 100755 index 00000000..d4a6ceb2 --- /dev/null +++ b/assignments2016/assignment3/cs231n/datasets/get_pretrained_model.sh @@ -0,0 +1 @@ +wget http://cs231n.stanford.edu/pretrained_model.h5 diff --git a/assignments2016/assignment3/cs231n/datasets/get_tiny_imagenet_a.sh b/assignments2016/assignment3/cs231n/datasets/get_tiny_imagenet_a.sh new file mode 100755 index 00000000..6d975605 --- /dev/null +++ b/assignments2016/assignment3/cs231n/datasets/get_tiny_imagenet_a.sh @@ -0,0 +1,3 @@ +wget http://cs231n.stanford.edu/tiny-imagenet-100-A.zip +unzip tiny-imagenet-100-A.zip +rm tiny-imagenet-100-A.zip diff --git a/assignments2016/assignment3/cs231n/fast_layers.py b/assignments2016/assignment3/cs231n/fast_layers.py new file mode 100644 index 00000000..ea0ce0bc --- /dev/null +++ b/assignments2016/assignment3/cs231n/fast_layers.py @@ -0,0 +1,270 @@ +import numpy as np +try: + from cs231n.im2col_cython import col2im_cython, im2col_cython + from cs231n.im2col_cython import col2im_6d_cython +except ImportError: + print 'run the following from the cs231n directory and try again:' + print 'python setup.py build_ext --inplace' + print 'You may also need to restart your iPython kernel' + +from cs231n.im2col import * + + +def conv_forward_im2col(x, w, b, conv_param): + """ + A fast implementation of the forward pass for a convolutional layer + based on im2col and col2im. + """ + N, C, H, W = x.shape + num_filters, _, filter_height, filter_width = w.shape + stride, pad = conv_param['stride'], conv_param['pad'] + + # Check dimensions + assert (W + 2 * pad - filter_width) % stride == 0, 'width does not work' + assert (H + 2 * pad - filter_height) % stride == 0, 'height does not work' + + # Create output + out_height = (H + 2 * pad - filter_height) / stride + 1 + out_width = (W + 2 * pad - filter_width) / stride + 1 + out = np.zeros((N, num_filters, out_height, out_width), dtype=x.dtype) + + # x_cols = im2col_indices(x, w.shape[2], w.shape[3], pad, stride) + x_cols = im2col_cython(x, w.shape[2], w.shape[3], pad, stride) + res = w.reshape((w.shape[0], -1)).dot(x_cols) + b.reshape(-1, 1) + + out = res.reshape(w.shape[0], out.shape[2], out.shape[3], x.shape[0]) + out = out.transpose(3, 0, 1, 2) + + cache = (x, w, b, conv_param, x_cols) + return out, cache + + +def conv_forward_strides(x, w, b, conv_param): + N, C, H, W = x.shape + F, _, HH, WW = w.shape + stride, pad = conv_param['stride'], conv_param['pad'] + + # Check dimensions + #assert (W + 2 * pad - WW) % stride == 0, 'width does not work' + #assert (H + 2 * pad - HH) % stride == 0, 'height does not work' + + # Pad the input + p = pad + x_padded = np.pad(x, ((0, 0), (0, 0), (p, p), (p, p)), mode='constant') + + # Figure out output dimensions + H += 2 * pad + W += 2 * pad + out_h = (H - HH) / stride + 1 + out_w = (W - WW) / stride + 1 + + # Perform an im2col operation by picking clever strides + shape = (C, HH, WW, N, out_h, out_w) + strides = (H * W, W, 1, C * H * W, stride * W, stride) + strides = x.itemsize * np.array(strides) + x_stride = np.lib.stride_tricks.as_strided(x_padded, + shape=shape, strides=strides) + x_cols = np.ascontiguousarray(x_stride) + x_cols.shape = (C * HH * WW, N * out_h * out_w) + + # Now all our convolutions are a big matrix multiply + res = w.reshape(F, -1).dot(x_cols) + b.reshape(-1, 1) + + # Reshape the output + res.shape = (F, N, out_h, out_w) + out = res.transpose(1, 0, 2, 3) + + # Be nice and return a contiguous array + # The old version of conv_forward_fast doesn't do this, so for a fair + # comparison we won't either + out = np.ascontiguousarray(out) + + cache = (x, w, b, conv_param, x_cols) + return out, cache + + +def conv_backward_strides(dout, cache): + x, w, b, conv_param, x_cols = cache + stride, pad = conv_param['stride'], conv_param['pad'] + + N, C, H, W = x.shape + F, _, HH, WW = w.shape + _, _, out_h, out_w = dout.shape + + db = np.sum(dout, axis=(0, 2, 3)) + + dout_reshaped = dout.transpose(1, 0, 2, 3).reshape(F, -1) + dw = dout_reshaped.dot(x_cols.T).reshape(w.shape) + + dx_cols = w.reshape(F, -1).T.dot(dout_reshaped) + dx_cols.shape = (C, HH, WW, N, out_h, out_w) + dx = col2im_6d_cython(dx_cols, N, C, H, W, HH, WW, pad, stride) + + return dx, dw, db + + +def conv_backward_im2col(dout, cache): + """ + A fast implementation of the backward pass for a convolutional layer + based on im2col and col2im. + """ + x, w, b, conv_param, x_cols = cache + stride, pad = conv_param['stride'], conv_param['pad'] + + db = np.sum(dout, axis=(0, 2, 3)) + + num_filters, _, filter_height, filter_width = w.shape + dout_reshaped = dout.transpose(1, 2, 3, 0).reshape(num_filters, -1) + dw = dout_reshaped.dot(x_cols.T).reshape(w.shape) + + dx_cols = w.reshape(num_filters, -1).T.dot(dout_reshaped) + # dx = col2im_indices(dx_cols, x.shape, filter_height, filter_width, pad, stride) + dx = col2im_cython(dx_cols, x.shape[0], x.shape[1], x.shape[2], x.shape[3], + filter_height, filter_width, pad, stride) + + return dx, dw, db + + +conv_forward_fast = conv_forward_strides +conv_backward_fast = conv_backward_strides + + +def max_pool_forward_fast(x, pool_param): + """ + A fast implementation of the forward pass for a max pooling layer. + + This chooses between the reshape method and the im2col method. If the pooling + regions are square and tile the input image, then we can use the reshape + method which is very fast. Otherwise we fall back on the im2col method, which + is not much faster than the naive method. + """ + N, C, H, W = x.shape + pool_height, pool_width = pool_param['pool_height'], pool_param['pool_width'] + stride = pool_param['stride'] + + same_size = pool_height == pool_width == stride + tiles = H % pool_height == 0 and W % pool_width == 0 + if same_size and tiles: + out, reshape_cache = max_pool_forward_reshape(x, pool_param) + cache = ('reshape', reshape_cache) + else: + out, im2col_cache = max_pool_forward_im2col(x, pool_param) + cache = ('im2col', im2col_cache) + return out, cache + + +def max_pool_backward_fast(dout, cache): + """ + A fast implementation of the backward pass for a max pooling layer. + + This switches between the reshape method an the im2col method depending on + which method was used to generate the cache. + """ + method, real_cache = cache + if method == 'reshape': + return max_pool_backward_reshape(dout, real_cache) + elif method == 'im2col': + return max_pool_backward_im2col(dout, real_cache) + else: + raise ValueError('Unrecognized method "%s"' % method) + + +def max_pool_forward_reshape(x, pool_param): + """ + A fast implementation of the forward pass for the max pooling layer that uses + some clever reshaping. + + This can only be used for square pooling regions that tile the input. + """ + N, C, H, W = x.shape + pool_height, pool_width = pool_param['pool_height'], pool_param['pool_width'] + stride = pool_param['stride'] + assert pool_height == pool_width == stride, 'Invalid pool params' + assert H % pool_height == 0 + assert W % pool_height == 0 + x_reshaped = x.reshape(N, C, H / pool_height, pool_height, + W / pool_width, pool_width) + out = x_reshaped.max(axis=3).max(axis=4) + + cache = (x, x_reshaped, out) + return out, cache + + +def max_pool_backward_reshape(dout, cache): + """ + A fast implementation of the backward pass for the max pooling layer that + uses some clever broadcasting and reshaping. + + This can only be used if the forward pass was computed using + max_pool_forward_reshape. + + NOTE: If there are multiple argmaxes, this method will assign gradient to + ALL argmax elements of the input rather than picking one. In this case the + gradient will actually be incorrect. However this is unlikely to occur in + practice, so it shouldn't matter much. One possible solution is to split the + upstream gradient equally among all argmax elements; this should result in a + valid subgradient. You can make this happen by uncommenting the line below; + however this results in a significant performance penalty (about 40% slower) + and is unlikely to matter in practice so we don't do it. + """ + x, x_reshaped, out = cache + + dx_reshaped = np.zeros_like(x_reshaped) + out_newaxis = out[:, :, :, np.newaxis, :, np.newaxis] + mask = (x_reshaped == out_newaxis) + dout_newaxis = dout[:, :, :, np.newaxis, :, np.newaxis] + dout_broadcast, _ = np.broadcast_arrays(dout_newaxis, dx_reshaped) + dx_reshaped[mask] = dout_broadcast[mask] + dx_reshaped /= np.sum(mask, axis=(3, 5), keepdims=True) + dx = dx_reshaped.reshape(x.shape) + + return dx + + +def max_pool_forward_im2col(x, pool_param): + """ + An implementation of the forward pass for max pooling based on im2col. + + This isn't much faster than the naive version, so it should be avoided if + possible. + """ + N, C, H, W = x.shape + pool_height, pool_width = pool_param['pool_height'], pool_param['pool_width'] + stride = pool_param['stride'] + + assert (H - pool_height) % stride == 0, 'Invalid height' + assert (W - pool_width) % stride == 0, 'Invalid width' + + out_height = (H - pool_height) / stride + 1 + out_width = (W - pool_width) / stride + 1 + + x_split = x.reshape(N * C, 1, H, W) + x_cols = im2col(x_split, pool_height, pool_width, padding=0, stride=stride) + x_cols_argmax = np.argmax(x_cols, axis=0) + x_cols_max = x_cols[x_cols_argmax, np.arange(x_cols.shape[1])] + out = x_cols_max.reshape(out_height, out_width, N, C).transpose(2, 3, 0, 1) + + cache = (x, x_cols, x_cols_argmax, pool_param) + return out, cache + + +def max_pool_backward_im2col(dout, cache): + """ + An implementation of the backward pass for max pooling based on im2col. + + This isn't much faster than the naive version, so it should be avoided if + possible. + """ + x, x_cols, x_cols_argmax, pool_param = cache + N, C, H, W = x.shape + pool_height, pool_width = pool_param['pool_height'], pool_param['pool_width'] + stride = pool_param['stride'] + + dout_reshaped = dout.transpose(2, 3, 0, 1).flatten() + dx_cols = np.zeros_like(x_cols) + dx_cols[x_cols_argmax, np.arange(dx_cols.shape[1])] = dout_reshaped + dx = col2im_indices(dx_cols, (N * C, 1, H, W), pool_height, pool_width, + padding=0, stride=stride) + dx = dx.reshape(x.shape) + + return dx diff --git a/assignments2016/assignment3/cs231n/gradient_check.py b/assignments2016/assignment3/cs231n/gradient_check.py new file mode 100644 index 00000000..2d6b1f62 --- /dev/null +++ b/assignments2016/assignment3/cs231n/gradient_check.py @@ -0,0 +1,124 @@ +import numpy as np +from random import randrange + +def eval_numerical_gradient(f, x, verbose=True, h=0.00001): + """ + a naive implementation of numerical gradient of f at x + - f should be a function that takes a single argument + - x is the point (numpy array) to evaluate the gradient at + """ + + fx = f(x) # evaluate function value at original point + grad = np.zeros_like(x) + # iterate over all indexes in x + it = np.nditer(x, flags=['multi_index'], op_flags=['readwrite']) + while not it.finished: + + # evaluate function at x+h + ix = it.multi_index + oldval = x[ix] + x[ix] = oldval + h # increment by h + fxph = f(x) # evalute f(x + h) + x[ix] = oldval - h + fxmh = f(x) # evaluate f(x - h) + x[ix] = oldval # restore + + # compute the partial derivative with centered formula + grad[ix] = (fxph - fxmh) / (2 * h) # the slope + if verbose: + print ix, grad[ix] + it.iternext() # step to next dimension + + return grad + + +def eval_numerical_gradient_array(f, x, df, h=1e-5): + """ + Evaluate a numeric gradient for a function that accepts a numpy + array and returns a numpy array. + """ + grad = np.zeros_like(x) + it = np.nditer(x, flags=['multi_index'], op_flags=['readwrite']) + while not it.finished: + ix = it.multi_index + + oldval = x[ix] + x[ix] = oldval + h + pos = f(x).copy() + x[ix] = oldval - h + neg = f(x).copy() + x[ix] = oldval + + grad[ix] = np.sum((pos - neg) * df) / (2 * h) + it.iternext() + return grad + + +def eval_numerical_gradient_blobs(f, inputs, output, h=1e-5): + """ + Compute numeric gradients for a function that operates on input + and output blobs. + + We assume that f accepts several input blobs as arguments, followed by a blob + into which outputs will be written. For example, f might be called like this: + + f(x, w, out) + + where x and w are input Blobs, and the result of f will be written to out. + + Inputs: + - f: function + - inputs: tuple of input blobs + - output: output blob + - h: step size + """ + numeric_diffs = [] + for input_blob in inputs: + diff = np.zeros_like(input_blob.diffs) + it = np.nditer(input_blob.vals, flags=['multi_index'], + op_flags=['readwrite']) + while not it.finished: + idx = it.multi_index + orig = input_blob.vals[idx] + + input_blob.vals[idx] = orig + h + f(*(inputs + (output,))) + pos = np.copy(output.vals) + input_blob.vals[idx] = orig - h + f(*(inputs + (output,))) + neg = np.copy(output.vals) + input_blob.vals[idx] = orig + + diff[idx] = np.sum((pos - neg) * output.diffs) / (2.0 * h) + + it.iternext() + numeric_diffs.append(diff) + return numeric_diffs + + +def eval_numerical_gradient_net(net, inputs, output, h=1e-5): + return eval_numerical_gradient_blobs(lambda *args: net.forward(), + inputs, output, h=h) + + +def grad_check_sparse(f, x, analytic_grad, num_checks=10, h=1e-5): + """ + sample a few random elements and only return numerical + in this dimensions. + """ + + for i in xrange(num_checks): + ix = tuple([randrange(m) for m in x.shape]) + + oldval = x[ix] + x[ix] = oldval + h # increment by h + fxph = f(x) # evaluate f(x + h) + x[ix] = oldval - h # increment by h + fxmh = f(x) # evaluate f(x - h) + x[ix] = oldval # reset + + grad_numerical = (fxph - fxmh) / (2 * h) + grad_analytic = analytic_grad[ix] + rel_error = abs(grad_numerical - grad_analytic) / (abs(grad_numerical) + abs(grad_analytic)) + print 'numerical: %f analytic: %f, relative error: %e' % (grad_numerical, grad_analytic, rel_error) + diff --git a/assignments2016/assignment3/cs231n/im2col.py b/assignments2016/assignment3/cs231n/im2col.py new file mode 100644 index 00000000..1942eab6 --- /dev/null +++ b/assignments2016/assignment3/cs231n/im2col.py @@ -0,0 +1,55 @@ +import numpy as np + + +def get_im2col_indices(x_shape, field_height, field_width, padding=1, stride=1): + # First figure out what the size of the output should be + N, C, H, W = x_shape + assert (H + 2 * padding - field_height) % stride == 0 + assert (W + 2 * padding - field_height) % stride == 0 + out_height = (H + 2 * padding - field_height) / stride + 1 + out_width = (W + 2 * padding - field_width) / stride + 1 + + i0 = np.repeat(np.arange(field_height), field_width) + i0 = np.tile(i0, C) + i1 = stride * np.repeat(np.arange(out_height), out_width) + j0 = np.tile(np.arange(field_width), field_height * C) + j1 = stride * np.tile(np.arange(out_width), out_height) + i = i0.reshape(-1, 1) + i1.reshape(1, -1) + j = j0.reshape(-1, 1) + j1.reshape(1, -1) + + k = np.repeat(np.arange(C), field_height * field_width).reshape(-1, 1) + + return (k, i, j) + + +def im2col_indices(x, field_height, field_width, padding=1, stride=1): + """ An implementation of im2col based on some fancy indexing """ + # Zero-pad the input + p = padding + x_padded = np.pad(x, ((0, 0), (0, 0), (p, p), (p, p)), mode='constant') + + k, i, j = get_im2col_indices(x.shape, field_height, field_width, padding, + stride) + + cols = x_padded[:, k, i, j] + C = x.shape[1] + cols = cols.transpose(1, 2, 0).reshape(field_height * field_width * C, -1) + return cols + + +def col2im_indices(cols, x_shape, field_height=3, field_width=3, padding=1, + stride=1): + """ An implementation of col2im based on fancy indexing and np.add.at """ + N, C, H, W = x_shape + H_padded, W_padded = H + 2 * padding, W + 2 * padding + x_padded = np.zeros((N, C, H_padded, W_padded), dtype=cols.dtype) + k, i, j = get_im2col_indices(x_shape, field_height, field_width, padding, + stride) + cols_reshaped = cols.reshape(C * field_height * field_width, -1, N) + cols_reshaped = cols_reshaped.transpose(2, 0, 1) + np.add.at(x_padded, (slice(None), k, i, j), cols_reshaped) + if padding == 0: + return x_padded + return x_padded[:, :, padding:-padding, padding:-padding] + +pass diff --git a/assignments2016/assignment3/cs231n/im2col_cython.pyx b/assignments2016/assignment3/cs231n/im2col_cython.pyx new file mode 100644 index 00000000..d6e33c6f --- /dev/null +++ b/assignments2016/assignment3/cs231n/im2col_cython.pyx @@ -0,0 +1,121 @@ +import numpy as np +cimport numpy as np +cimport cython + +# DTYPE = np.float64 +# ctypedef np.float64_t DTYPE_t + +ctypedef fused DTYPE_t: + np.float32_t + np.float64_t + +def im2col_cython(np.ndarray[DTYPE_t, ndim=4] x, int field_height, + int field_width, int padding, int stride): + cdef int N = x.shape[0] + cdef int C = x.shape[1] + cdef int H = x.shape[2] + cdef int W = x.shape[3] + + cdef int HH = (H + 2 * padding - field_height) / stride + 1 + cdef int WW = (W + 2 * padding - field_width) / stride + 1 + + cdef int p = padding + cdef np.ndarray[DTYPE_t, ndim=4] x_padded = np.pad(x, + ((0, 0), (0, 0), (p, p), (p, p)), mode='constant') + + cdef np.ndarray[DTYPE_t, ndim=2] cols = np.zeros( + (C * field_height * field_width, N * HH * WW), + dtype=x.dtype) + + # Moving the inner loop to a C function with no bounds checking works, but does + # not seem to help performance in any measurable way. + + im2col_cython_inner(cols, x_padded, N, C, H, W, HH, WW, + field_height, field_width, padding, stride) + return cols + + +@cython.boundscheck(False) +cdef int im2col_cython_inner(np.ndarray[DTYPE_t, ndim=2] cols, + np.ndarray[DTYPE_t, ndim=4] x_padded, + int N, int C, int H, int W, int HH, int WW, + int field_height, int field_width, int padding, int stride) except? -1: + cdef int c, ii, jj, row, yy, xx, i, col + + for c in range(C): + for yy in range(HH): + for xx in range(WW): + for ii in range(field_height): + for jj in range(field_width): + row = c * field_width * field_height + ii * field_height + jj + for i in range(N): + col = yy * WW * N + xx * N + i + cols[row, col] = x_padded[i, c, stride * yy + ii, stride * xx + jj] + + + +def col2im_cython(np.ndarray[DTYPE_t, ndim=2] cols, int N, int C, int H, int W, + int field_height, int field_width, int padding, int stride): + cdef np.ndarray x = np.empty((N, C, H, W), dtype=cols.dtype) + cdef int HH = (H + 2 * padding - field_height) / stride + 1 + cdef int WW = (W + 2 * padding - field_width) / stride + 1 + cdef np.ndarray[DTYPE_t, ndim=4] x_padded = np.zeros((N, C, H + 2 * padding, W + 2 * padding), + dtype=cols.dtype) + + # Moving the inner loop to a C-function with no bounds checking improves + # performance quite a bit for col2im. + col2im_cython_inner(cols, x_padded, N, C, H, W, HH, WW, + field_height, field_width, padding, stride) + if padding > 0: + return x_padded[:, :, padding:-padding, padding:-padding] + return x_padded + + +@cython.boundscheck(False) +cdef int col2im_cython_inner(np.ndarray[DTYPE_t, ndim=2] cols, + np.ndarray[DTYPE_t, ndim=4] x_padded, + int N, int C, int H, int W, int HH, int WW, + int field_height, int field_width, int padding, int stride) except? -1: + cdef int c, ii, jj, row, yy, xx, i, col + + for c in range(C): + for ii in range(field_height): + for jj in range(field_width): + row = c * field_width * field_height + ii * field_height + jj + for yy in range(HH): + for xx in range(WW): + for i in range(N): + col = yy * WW * N + xx * N + i + x_padded[i, c, stride * yy + ii, stride * xx + jj] += cols[row, col] + + +@cython.boundscheck(False) +@cython.wraparound(False) +cdef col2im_6d_cython_inner(np.ndarray[DTYPE_t, ndim=6] cols, + np.ndarray[DTYPE_t, ndim=4] x_padded, + int N, int C, int H, int W, int HH, int WW, + int out_h, int out_w, int pad, int stride): + + cdef int c, hh, ww, n, h, w + for n in range(N): + for c in range(C): + for hh in range(HH): + for ww in range(WW): + for h in range(out_h): + for w in range(out_w): + x_padded[n, c, stride * h + hh, stride * w + ww] += cols[c, hh, ww, n, h, w] + + +def col2im_6d_cython(np.ndarray[DTYPE_t, ndim=6] cols, int N, int C, int H, int W, + int HH, int WW, int pad, int stride): + cdef np.ndarray x = np.empty((N, C, H, W), dtype=cols.dtype) + cdef int out_h = (H + 2 * pad - HH) / stride + 1 + cdef int out_w = (W + 2 * pad - WW) / stride + 1 + cdef np.ndarray[DTYPE_t, ndim=4] x_padded = np.zeros((N, C, H + 2 * pad, W + 2 * pad), + dtype=cols.dtype) + + col2im_6d_cython_inner(cols, x_padded, N, C, H, W, HH, WW, out_h, out_w, pad, stride) + + if pad > 0: + return x_padded[:, :, pad:-pad, pad:-pad] + return x_padded diff --git a/assignments2016/assignment3/cs231n/image_utils.py b/assignments2016/assignment3/cs231n/image_utils.py new file mode 100644 index 00000000..300ffb66 --- /dev/null +++ b/assignments2016/assignment3/cs231n/image_utils.py @@ -0,0 +1,98 @@ +import urllib2, os, tempfile + +import numpy as np +from scipy.misc import imread + +from cs231n.fast_layers import conv_forward_fast + + +""" +Utility functions used for viewing and processing images. +""" + + +def blur_image(X): + """ + A very gentle image blurring operation, to be used as a regularizer for image + generation. + + Inputs: + - X: Image data of shape (N, 3, H, W) + + Returns: + - X_blur: Blurred version of X, of shape (N, 3, H, W) + """ + w_blur = np.zeros((3, 3, 3, 3)) + b_blur = np.zeros(3) + blur_param = {'stride': 1, 'pad': 1} + for i in xrange(3): + w_blur[i, i] = np.asarray([[1, 2, 1], [2, 188, 2], [1, 2, 1]], dtype=np.float32) + w_blur /= 200.0 + return conv_forward_fast(X, w_blur, b_blur, blur_param)[0] + + +def preprocess_image(img, mean_img, mean='image'): + """ + Convert to float, transepose, and subtract mean pixel + + Input: + - img: (H, W, 3) + + Returns: + - (1, 3, H, 3) + """ + if mean == 'image': + mean = mean_img + elif mean == 'pixel': + mean = mean_img.mean(axis=(1, 2), keepdims=True) + elif mean == 'none': + mean = 0 + else: + raise ValueError('mean must be image or pixel or none') + return img.astype(np.float32).transpose(2, 0, 1)[None] - mean + + +def deprocess_image(img, mean_img, mean='image', renorm=False): + """ + Add mean pixel, transpose, and convert to uint8 + + Input: + - (1, 3, H, W) or (3, H, W) + + Returns: + - (H, W, 3) + """ + if mean == 'image': + mean = mean_img + elif mean == 'pixel': + mean = mean_img.mean(axis=(1, 2), keepdims=True) + elif mean == 'none': + mean = 0 + else: + raise ValueError('mean must be image or pixel or none') + if img.ndim == 3: + img = img[None] + img = (img + mean)[0].transpose(1, 2, 0) + if renorm: + low, high = img.min(), img.max() + img = 255.0 * (img - low) / (high - low) + return img.astype(np.uint8) + + +def image_from_url(url): + """ + Read an image from a URL. Returns a numpy array with the pixel data. + We write the image to a temporary file then read it back. Kinda gross. + """ + try: + f = urllib2.urlopen(url) + _, fname = tempfile.mkstemp() + with open(fname, 'wb') as ff: + ff.write(f.read()) + img = imread(fname) + os.remove(fname) + return img + except urllib2.URLError as e: + print 'URL Error: ', e.reason, url + except urllib2.HTTPError as e: + print 'HTTP Error: ', e.code, url diff --git a/assignments2016/assignment3/cs231n/layer_utils.py b/assignments2016/assignment3/cs231n/layer_utils.py new file mode 100644 index 00000000..0a04d333 --- /dev/null +++ b/assignments2016/assignment3/cs231n/layer_utils.py @@ -0,0 +1,141 @@ +from cs231n.layers import * +from cs231n.fast_layers import * + + +def affine_relu_forward(x, w, b): + """ + Convenience layer that perorms an affine transform followed by a ReLU + + Inputs: + - x: Input to the affine layer + - w, b: Weights for the affine layer + + Returns a tuple of: + - out: Output from the ReLU + - cache: Object to give to the backward pass + """ + a, fc_cache = affine_forward(x, w, b) + out, relu_cache = relu_forward(a) + cache = (fc_cache, relu_cache) + return out, cache + + +def affine_relu_backward(dout, cache): + """ + Backward pass for the affine-relu convenience layer + """ + fc_cache, relu_cache = cache + da = relu_backward(dout, relu_cache) + dx, dw, db = affine_backward(da, fc_cache) + return dx, dw, db + + +def affine_bn_relu_forward(x, w, b, gamma, beta, bn_param): + """ + Convenience layer that performs an affine transform, batch normalization, + and ReLU. + + Inputs: + - x: Array of shape (N, D1); input to the affine layer + - w, b: Arrays of shape (D2, D2) and (D2,) giving the weight and bias for + the affine transform. + - gamma, beta: Arrays of shape (D2,) and (D2,) giving scale and shift + parameters for batch normalization. + - bn_param: Dictionary of parameters for batch normalization. + + Returns: + - out: Output from ReLU, of shape (N, D2) + - cache: Object to give to the backward pass. + """ + a, fc_cache = affine_forward(x, w, b) + a_bn, bn_cache = batchnorm_forward(a, gamma, beta, bn_param) + out, relu_cache = relu_forward(a_bn) + cache = (fc_cache, bn_cache, relu_cache) + return out, cache + + +def affine_bn_relu_backward(dout, cache): + """ + Backward pass for the affine-batchnorm-relu convenience layer. + """ + fc_cache, bn_cache, relu_cache = cache + da_bn = relu_backward(dout, relu_cache) + da, dgamma, dbeta = batchnorm_backward(da_bn, bn_cache) + dx, dw, db = affine_backward(da, fc_cache) + return dx, dw, db, dgamma, dbeta + + +def conv_relu_forward(x, w, b, conv_param): + """ + A convenience layer that performs a convolution followed by a ReLU. + + Inputs: + - x: Input to the convolutional layer + - w, b, conv_param: Weights and parameters for the convolutional layer + + Returns a tuple of: + - out: Output from the ReLU + - cache: Object to give to the backward pass + """ + a, conv_cache = conv_forward_fast(x, w, b, conv_param) + out, relu_cache = relu_forward(a) + cache = (conv_cache, relu_cache) + return out, cache + + +def conv_relu_backward(dout, cache): + """ + Backward pass for the conv-relu convenience layer. + """ + conv_cache, relu_cache = cache + da = relu_backward(dout, relu_cache) + dx, dw, db = conv_backward_fast(da, conv_cache) + return dx, dw, db + + +def conv_bn_relu_forward(x, w, b, gamma, beta, conv_param, bn_param): + a, conv_cache = conv_forward_fast(x, w, b, conv_param) + an, bn_cache = spatial_batchnorm_forward(a, gamma, beta, bn_param) + out, relu_cache = relu_forward(an) + cache = (conv_cache, bn_cache, relu_cache) + return out, cache + + +def conv_bn_relu_backward(dout, cache): + conv_cache, bn_cache, relu_cache = cache + dan = relu_backward(dout, relu_cache) + da, dgamma, dbeta = spatial_batchnorm_backward(dan, bn_cache) + dx, dw, db = conv_backward_fast(da, conv_cache) + return dx, dw, db, dgamma, dbeta + + +def conv_relu_pool_forward(x, w, b, conv_param, pool_param): + """ + Convenience layer that performs a convolution, a ReLU, and a pool. + + Inputs: + - x: Input to the convolutional layer + - w, b, conv_param: Weights and parameters for the convolutional layer + - pool_param: Parameters for the pooling layer + + Returns a tuple of: + - out: Output from the pooling layer + - cache: Object to give to the backward pass + """ + a, conv_cache = conv_forward_fast(x, w, b, conv_param) + s, relu_cache = relu_forward(a) + out, pool_cache = max_pool_forward_fast(s, pool_param) + cache = (conv_cache, relu_cache, pool_cache) + return out, cache + + +def conv_relu_pool_backward(dout, cache): + """ + Backward pass for the conv-relu-pool convenience layer + """ + conv_cache, relu_cache, pool_cache = cache + ds = max_pool_backward_fast(dout, pool_cache) + da = relu_backward(ds, relu_cache) + dx, dw, db = conv_backward_fast(da, conv_cache) + return dx, dw, db + diff --git a/assignments2016/assignment3/cs231n/layers.py b/assignments2016/assignment3/cs231n/layers.py new file mode 100644 index 00000000..9fc9b80f --- /dev/null +++ b/assignments2016/assignment3/cs231n/layers.py @@ -0,0 +1,302 @@ +import numpy as np + + +def affine_forward(x, w, b): + """ + Computes the forward pass for an affine (fully-connected) layer. + + The input x has shape (N, d_1, ..., d_k) where x[i] is the ith input. + We multiply this against a weight matrix of shape (D, M) where + D = \prod_i d_i + + Inputs: + x - Input data, of shape (N, d_1, ..., d_k) + w - Weights, of shape (D, M) + b - Biases, of shape (M,) + + Returns a tuple of: + - out: output, of shape (N, M) + - cache: (x, w, b) + """ + out = x.reshape(x.shape[0], -1).dot(w) + b + cache = (x, w, b) + return out, cache + + +def affine_backward(dout, cache): + """ + Computes the backward pass for an affine layer. + + Inputs: + - dout: Upstream derivative, of shape (N, M) + - cache: Tuple of: + - x: Input data, of shape (N, d_1, ... d_k) + - w: Weights, of shape (D, M) + + Returns a tuple of: + - dx: Gradient with respect to x, of shape (N, d1, ..., d_k) + - dw: Gradient with respect to w, of shape (D, M) + - db: Gradient with respect to b, of shape (M,) + """ + x, w, b = cache + dx = dout.dot(w.T).reshape(x.shape) + dw = x.reshape(x.shape[0], -1).T.dot(dout) + db = np.sum(dout, axis=0) + return dx, dw, db + + +def relu_forward(x): + """ + Computes the forward pass for a layer of rectified linear units (ReLUs). + + Input: + - x: Inputs, of any shape + + Returns a tuple of: + - out: Output, of the same shape as x + - cache: x + """ + out = np.maximum(0, x) + cache = x + return out, cache + + +def relu_backward(dout, cache): + """ + Computes the backward pass for a layer of rectified linear units (ReLUs). + + Input: + - dout: Upstream derivatives, of any shape + - cache: Input x, of same shape as dout + + Returns: + - dx: Gradient with respect to x + """ + x = cache + dx = np.where(x > 0, dout, 0) + return dx + + +def batchnorm_forward(x, gamma, beta, bn_param): + """ + Forward pass for batch normalization. + + During training the sample mean and (uncorrected) sample variance are + computed from minibatch statistics and used to normalize the incoming data. + During training we also keep an exponentially decaying running mean of the mean + and variance of each feature, and these averages are used to normalize data + at test-time. + + At each timestep we update the running averages for mean and variance using + an exponential decay based on the momentum parameter: + + running_mean = momentum * running_mean + (1 - momentum) * sample_mean + running_var = momentum * running_var + (1 - momentum) * sample_var + + Note that the batch normalization paper suggests a different test-time + behavior: they compute sample mean and variance for each feature using a + large number of training images rather than using a running average. For + this implementation we have chosen to use running averages instead since + they do not require an additional estimation step; the torch7 implementation + of batch normalization also uses running averages. + + Input: + - x: Data of shape (N, D) + - gamma: Scale parameter of shape (D,) + - beta: Shift paremeter of shape (D,) + - bn_param: Dictionary with the following keys: + - mode: 'train' or 'test'; required + - eps: Constant for numeric stability + - momentum: Constant for running mean / variance. + - running_mean: Array of shape (D,) giving running mean of features + - running_var Array of shape (D,) giving running variance of features + + Returns a tuple of: + - out: of shape (N, D) + - cache: A tuple of values needed in the backward pass + """ + mode = bn_param['mode'] + eps = bn_param.get('eps', 1e-5) + momentum = bn_param.get('momentum', 0.9) + + N, D = x.shape + running_mean = bn_param.get('running_mean', np.zeros(D, dtype=x.dtype)) + running_var = bn_param.get('running_var', np.zeros(D, dtype=x.dtype)) + + out, cache = None, None + if mode == 'train': + # Compute output + mu = x.mean(axis=0) + xc = x - mu + var = np.mean(xc ** 2, axis=0) + std = np.sqrt(var + eps) + xn = xc / std + out = gamma * xn + beta + + cache = (mode, x, gamma, xc, std, xn, out) + + # Update running average of mean + running_mean *= momentum + running_mean += (1 - momentum) * mu + + # Update running average of variance + running_var *= momentum + running_var += (1 - momentum) * var + elif mode == 'test': + # Using running mean and variance to normalize + std = np.sqrt(running_var + eps) + xn = (x - running_mean) / std + out = gamma * xn + beta + cache = (mode, x, xn, gamma, beta, std) + else: + raise ValueError('Invalid forward batchnorm mode "%s"' % mode) + + # Store the updated running means back into bn_param + bn_param['running_mean'] = running_mean + bn_param['running_var'] = running_var + + return out, cache + + +def batchnorm_backward(dout, cache): + """ + Backward pass for batch normalization. + + For this implementation, you should write out a computation graph for + batch normalization on paper and propagate gradients backward through + intermediate nodes. + + Inputs: + - dout: Upstream derivatives, of shape (N, D) + - cache: Variable of intermediates from batchnorm_forward. + + Returns a tuple of: + - dx: Gradient with respect to inputs x, of shape (N, D) + - dgamma: Gradient with respect to scale parameter gamma, of shape (D,) + - dbeta: Gradient with respect to shift parameter beta, of shape (D,) + """ + mode = cache[0] + if mode == 'train': + mode, x, gamma, xc, std, xn, out = cache + + N = x.shape[0] + dbeta = dout.sum(axis=0) + dgamma = np.sum(xn * dout, axis=0) + dxn = gamma * dout + dxc = dxn / std + dstd = -np.sum((dxn * xc) / (std * std), axis=0) + dvar = 0.5 * dstd / std + dxc += (2.0 / N) * xc * dvar + dmu = np.sum(dxc, axis=0) + dx = dxc - dmu / N + elif mode == 'test': + mode, x, xn, gamma, beta, std = cache + dbeta = dout.sum(axis=0) + dgamma = np.sum(xn * dout, axis=0) + dxn = gamma * dout + dx = dxn / std + else: + raise ValueError(mode) + + return dx, dgamma, dbeta + + +def spatial_batchnorm_forward(x, gamma, beta, bn_param): + """ + Computes the forward pass for spatial batch normalization. + + Inputs: + - x: Input data of shape (N, C, H, W) + - gamma: Scale parameter, of shape (C,) + - beta: Shift parameter, of shape (C,) + - bn_param: Dictionary with the following keys: + - mode: 'train' or 'test'; required + - eps: Constant for numeric stability + - momentum: Constant for running mean / variance. momentum=0 means that + old information is discarded completely at every time step, while + momentum=1 means that new information is never incorporated. The + default of momentum=0.9 should work well in most situations. + - running_mean: Array of shape (D,) giving running mean of features + - running_var Array of shape (D,) giving running variance of features + + Returns a tuple of: + - out: Output data, of shape (N, C, H, W) + - cache: Values needed for the backward pass + """ + N, C, H, W = x.shape + x_flat = x.transpose(0, 2, 3, 1).reshape(-1, C) + out_flat, cache = batchnorm_forward(x_flat, gamma, beta, bn_param) + out = out_flat.reshape(N, H, W, C).transpose(0, 3, 1, 2) + return out, cache + + +def spatial_batchnorm_backward(dout, cache): + """ + Computes the backward pass for spatial batch normalization. + + Inputs: + - dout: Upstream derivatives, of shape (N, C, H, W) + - cache: Values from the forward pass + + Returns a tuple of: + - dx: Gradient with respect to inputs, of shape (N, C, H, W) + - dgamma: Gradient with respect to scale parameter, of shape (C,) + - dbeta: Gradient with respect to shift parameter, of shape (C,) + """ + N, C, H, W = dout.shape + dout_flat = dout.transpose(0, 2, 3, 1).reshape(-1, C) + dx_flat, dgamma, dbeta = batchnorm_backward(dout_flat, cache) + dx = dx_flat.reshape(N, H, W, C).transpose(0, 3, 1, 2) + return dx, dgamma, dbeta + + +def svm_loss(x, y): + """ + Computes the loss and gradient using for multiclass SVM classification. + + Inputs: + - x: Input data, of shape (N, C) where x[i, j] is the score for the jth class + for the ith input. + - y: Vector of labels, of shape (N,) where y[i] is the label for x[i] and + 0 <= y[i] < C + + Returns a tuple of: + - loss: Scalar giving the loss + - dx: Gradient of the loss with respect to x + """ + N = x.shape[0] + correct_class_scores = x[np.arange(N), y] + margins = np.maximum(0, x - correct_class_scores[:, np.newaxis] + 1.0) + margins[np.arange(N), y] = 0 + loss = np.sum(margins) / N + num_pos = np.sum(margins > 0, axis=1) + dx = np.zeros_like(x) + dx[margins > 0] = 1 + dx[np.arange(N), y] -= num_pos + dx /= N + return loss, dx + + +def softmax_loss(x, y): + """ + Computes the loss and gradient for softmax classification. + + Inputs: + - x: Input data, of shape (N, C) where x[i, j] is the score for the jth class + for the ith input. + - y: Vector of labels, of shape (N,) where y[i] is the label for x[i] and + 0 <= y[i] < C + + Returns a tuple of: + - loss: Scalar giving the loss + - dx: Gradient of the loss with respect to x + """ + probs = np.exp(x - np.max(x, axis=1, keepdims=True)) + probs /= np.sum(probs, axis=1, keepdims=True) + N = x.shape[0] + loss = -np.sum(np.log(probs[np.arange(N), y])) / N + dx = probs.copy() + dx[np.arange(N), y] -= 1 + dx /= N + return loss, dx + diff --git a/assignments2016/assignment3/cs231n/optim.py b/assignments2016/assignment3/cs231n/optim.py new file mode 100644 index 00000000..210e716a --- /dev/null +++ b/assignments2016/assignment3/cs231n/optim.py @@ -0,0 +1,85 @@ +import numpy as np + +""" +This file implements various first-order update rules that are commonly used for +training neural networks. Each update rule accepts current weights and the +gradient of the loss with respect to those weights and produces the next set of +weights. Each update rule has the same interface: + +def update(w, dw, config=None): + +Inputs: + - w: A numpy array giving the current weights. + - dw: A numpy array of the same shape as w giving the gradient of the + loss with respect to w. + - config: A dictionary containing hyperparameter values such as learning rate, + momentum, etc. If the update rule requires caching values over many + iterations, then config will also hold these cached values. + +Returns: + - next_w: The next point after the update. + - config: The config dictionary to be passed to the next iteration of the + update rule. + +NOTE: For most update rules, the default learning rate will probably not perform +well; however the default values of the other hyperparameters should work well +for a variety of different problems. + +For efficiency, update rules may perform in-place updates, mutating w and +setting next_w equal to w. +""" + + +def sgd(w, dw, config=None): + """ + Performs vanilla stochastic gradient descent. + + config format: + - learning_rate: Scalar learning rate. + """ + if config is None: config = {} + config.setdefault('learning_rate', 1e-2) + + w -= config['learning_rate'] * dw + return w, config + + +def adam(x, dx, config=None): + """ + Uses the Adam update rule, which incorporates moving averages of both the + gradient and its square and a bias correction term. + + config format: + - learning_rate: Scalar learning rate. + - beta1: Decay rate for moving average of first moment of gradient. + - beta2: Decay rate for moving average of second moment of gradient. + - epsilon: Small scalar used for smoothing to avoid dividing by zero. + - m: Moving average of gradient. + - v: Moving average of squared gradient. + - t: Iteration number. + """ + if config is None: config = {} + config.setdefault('learning_rate', 1e-3) + config.setdefault('beta1', 0.9) + config.setdefault('beta2', 0.999) + config.setdefault('epsilon', 1e-8) + config.setdefault('m', np.zeros_like(x)) + config.setdefault('v', np.zeros_like(x)) + config.setdefault('t', 0) + + next_x = None + beta1, beta2, eps = config['beta1'], config['beta2'], config['epsilon'] + t, m, v = config['t'], config['m'], config['v'] + m = beta1 * m + (1 - beta1) * dx + v = beta2 * v + (1 - beta2) * (dx * dx) + t += 1 + alpha = config['learning_rate'] * np.sqrt(1 - beta2 ** t) / (1 - beta1 ** t) + x -= alpha * (m / (np.sqrt(v) + eps)) + config['t'] = t + config['m'] = m + config['v'] = v + next_x = x + + return next_x, config + + diff --git a/assignments2016/assignment3/cs231n/rnn_layers.py b/assignments2016/assignment3/cs231n/rnn_layers.py new file mode 100644 index 00000000..d2ce0fe0 --- /dev/null +++ b/assignments2016/assignment3/cs231n/rnn_layers.py @@ -0,0 +1,420 @@ +import numpy as np + + +""" +This file defines layer types that are commonly used for recurrent neural +networks. +""" + + +def rnn_step_forward(x, prev_h, Wx, Wh, b): + """ + Run the forward pass for a single timestep of a vanilla RNN that uses a tanh + activation function. + + The input data has dimension D, the hidden state has dimension H, and we use + a minibatch size of N. + + Inputs: + - x: Input data for this timestep, of shape (N, D). + - prev_h: Hidden state from previous timestep, of shape (N, H) + - Wx: Weight matrix for input-to-hidden connections, of shape (D, H) + - Wh: Weight matrix for hidden-to-hidden connections, of shape (H, H) + - b: Biases of shape (H,) + + Returns a tuple of: + - next_h: Next hidden state, of shape (N, H) + - cache: Tuple of values needed for the backward pass. + """ + next_h, cache = None, None + ############################################################################## + # TODO: Implement a single forward step for the vanilla RNN. Store the next # + # hidden state and any values you need for the backward pass in the next_h # + # and cache variables respectively. # + ############################################################################## + pass + ############################################################################## + # END OF YOUR CODE # + ############################################################################## + return next_h, cache + + +def rnn_step_backward(dnext_h, cache): + """ + Backward pass for a single timestep of a vanilla RNN. + + Inputs: + - dnext_h: Gradient of loss with respect to next hidden state + - cache: Cache object from the forward pass + + Returns a tuple of: + - dx: Gradients of input data, of shape (N, D) + - dprev_h: Gradients of previous hidden state, of shape (N, H) + - dWx: Gradients of input-to-hidden weights, of shape (N, H) + - dWh: Gradients of hidden-to-hidden weights, of shape (H, H) + - db: Gradients of bias vector, of shape (H,) + """ + dx, dprev_h, dWx, dWh, db = None, None, None, None, None + ############################################################################## + # TODO: Implement the backward pass for a single step of a vanilla RNN. # + # # + # HINT: For the tanh function, you can compute the local derivative in terms # + # of the output value from tanh. # + ############################################################################## + pass + ############################################################################## + # END OF YOUR CODE # + ############################################################################## + return dx, dprev_h, dWx, dWh, db + + +def rnn_forward(x, h0, Wx, Wh, b): + """ + Run a vanilla RNN forward on an entire sequence of data. We assume an input + sequence composed of T vectors, each of dimension D. The RNN uses a hidden + size of H, and we work over a minibatch containing N sequences. After running + the RNN forward, we return the hidden states for all timesteps. + + Inputs: + - x: Input data for the entire timeseries, of shape (N, T, D). + - h0: Initial hidden state, of shape (N, H) + - Wx: Weight matrix for input-to-hidden connections, of shape (D, H) + - Wh: Weight matrix for hidden-to-hidden connections, of shape (H, H) + - b: Biases of shape (H,) + + Returns a tuple of: + - h: Hidden states for the entire timeseries, of shape (N, T, H). + - cache: Values needed in the backward pass + """ + h, cache = None, None + ############################################################################## + # TODO: Implement forward pass for a vanilla RNN running on a sequence of # + # input data. You should use the rnn_step_forward function that you defined # + # above. # + ############################################################################## + pass + ############################################################################## + # END OF YOUR CODE # + ############################################################################## + return h, cache + + +def rnn_backward(dh, cache): + """ + Compute the backward pass for a vanilla RNN over an entire sequence of data. + + Inputs: + - dh: Upstream gradients of all hidden states, of shape (N, T, H) + + Returns a tuple of: + - dx: Gradient of inputs, of shape (N, T, D) + - dh0: Gradient of initial hidden state, of shape (N, H) + - dWx: Gradient of input-to-hidden weights, of shape (D, H) + - dWh: Gradient of hidden-to-hidden weights, of shape (H, H) + - db: Gradient of biases, of shape (H,) + """ + dx, dh0, dWx, dWh, db = None, None, None, None, None + ############################################################################## + # TODO: Implement the backward pass for a vanilla RNN running an entire # + # sequence of data. You should use the rnn_step_backward function that you # + # defined above. # + ############################################################################## + pass + ############################################################################## + # END OF YOUR CODE # + ############################################################################## + return dx, dh0, dWx, dWh, db + + +def word_embedding_forward(x, W): + """ + Forward pass for word embeddings. We operate on minibatches of size N where + each sequence has length T. We assume a vocabulary of V words, assigning each + to a vector of dimension D. + + Inputs: + - x: Integer array of shape (N, T) giving indices of words. Each element idx + of x muxt be in the range 0 <= idx < V. + - W: Weight matrix of shape (V, D) giving word vectors for all words. + + Returns a tuple of: + - out: Array of shape (N, T, D) giving word vectors for all input words. + - cache: Values needed for the backward pass + """ + out, cache = None, None + ############################################################################## + # TODO: Implement the forward pass for word embeddings. # + # # + # HINT: This should be very simple. # + ############################################################################## + pass + ############################################################################## + # END OF YOUR CODE # + ############################################################################## + return out, cache + + +def word_embedding_backward(dout, cache): + """ + Backward pass for word embeddings. We cannot back-propagate into the words + since they are integers, so we only return gradient for the word embedding + matrix. + + HINT: Look up the function np.add.at + + Inputs: + - dout: Upstream gradients of shape (N, T, D) + - cache: Values from the forward pass + + Returns: + - dW: Gradient of word embedding matrix, of shape (V, D). + """ + dW = None + ############################################################################## + # TODO: Implement the backward pass for word embeddings. # + # # + # HINT: Look up the function np.add.at # + ############################################################################## + pass + ############################################################################## + # END OF YOUR CODE # + ############################################################################## + return dW + + +def sigmoid(x): + """ + A numerically stable version of the logistic sigmoid function. + """ + pos_mask = (x >= 0) + neg_mask = (x < 0) + z = np.zeros_like(x) + z[pos_mask] = np.exp(-x[pos_mask]) + z[neg_mask] = np.exp(x[neg_mask]) + top = np.ones_like(x) + top[neg_mask] = z[neg_mask] + return top / (1 + z) + + +def lstm_step_forward(x, prev_h, prev_c, Wx, Wh, b): + """ + Forward pass for a single timestep of an LSTM. + + The input data has dimension D, the hidden state has dimension H, and we use + a minibatch size of N. + + Inputs: + - x: Input data, of shape (N, D) + - prev_h: Previous hidden state, of shape (N, H) + - prev_c: previous cell state, of shape (N, H) + - Wx: Input-to-hidden weights, of shape (D, 4H) + - Wh: Hidden-to-hidden weights, of shape (H, 4H) + - b: Biases, of shape (4H,) + + Returns a tuple of: + - next_h: Next hidden state, of shape (N, H) + - next_c: Next cell state, of shape (N, H) + - cache: Tuple of values needed for backward pass. + """ + next_h, next_c, cache = None, None, None + ############################################################################# + # TODO: Implement the forward pass for a single timestep of an LSTM. # + # You may want to use the numerically stable sigmoid implementation above. # + ############################################################################# + pass + ############################################################################## + # END OF YOUR CODE # + ############################################################################## + + return next_h, next_c, cache + + +def lstm_step_backward(dnext_h, dnext_c, cache): + """ + Backward pass for a single timestep of an LSTM. + + Inputs: + - dnext_h: Gradients of next hidden state, of shape (N, H) + - dnext_c: Gradients of next cell state, of shape (N, H) + - cache: Values from the forward pass + + Returns a tuple of: + - dx: Gradient of input data, of shape (N, D) + - dprev_h: Gradient of previous hidden state, of shape (N, H) + - dprev_c: Gradient of previous cell state, of shape (N, H) + - dWx: Gradient of input-to-hidden weights, of shape (D, 4H) + - dWh: Gradient of hidden-to-hidden weights, of shape (H, 4H) + - db: Gradient of biases, of shape (4H,) + """ + dx, dh, dc, dWx, dWh, db = None, None, None, None, None, None + ############################################################################# + # TODO: Implement the backward pass for a single timestep of an LSTM. # + # # + # HINT: For sigmoid and tanh you can compute local derivatives in terms of # + # the output value from the nonlinearity. # + ############################################################################# + pass + ############################################################################## + # END OF YOUR CODE # + ############################################################################## + + return dx, dprev_h, dprev_c, dWx, dWh, db + + +def lstm_forward(x, h0, Wx, Wh, b): + """ + Forward pass for an LSTM over an entire sequence of data. We assume an input + sequence composed of T vectors, each of dimension D. The LSTM uses a hidden + size of H, and we work over a minibatch containing N sequences. After running + the LSTM forward, we return the hidden states for all timesteps. + + Note that the initial cell state is passed as input, but the initial cell + state is set to zero. Also note that the cell state is not returned; it is + an internal variable to the LSTM and is not accessed from outside. + + Inputs: + - x: Input data of shape (N, T, D) + - h0: Initial hidden state of shape (N, H) + - Wx: Weights for input-to-hidden connections, of shape (D, 4H) + - Wh: Weights for hidden-to-hidden connections, of shape (H, 4H) + - b: Biases of shape (4H,) + + Returns a tuple of: + - h: Hidden states for all timesteps of all sequences, of shape (N, T, H) + - cache: Values needed for the backward pass. + """ + h, cache = None, None + ############################################################################# + # TODO: Implement the forward pass for an LSTM over an entire timeseries. # + # You should use the lstm_step_forward function that you just defined. # + ############################################################################# + pass + ############################################################################## + # END OF YOUR CODE # + ############################################################################## + + return h, cache + + +def lstm_backward(dh, cache): + """ + Backward pass for an LSTM over an entire sequence of data.] + + Inputs: + - dh: Upstream gradients of hidden states, of shape (N, T, H) + - cache: Values from the forward pass + + Returns a tuple of: + - dx: Gradient of input data of shape (N, T, D) + - dh0: Gradient of initial hidden state of shape (N, H) + - dWx: Gradient of input-to-hidden weight matrix of shape (D, 4H) + - dWh: Gradient of hidden-to-hidden weight matrix of shape (H, 4H) + - db: Gradient of biases, of shape (4H,) + """ + dx, dh0, dWx, dWh, db = None, None, None, None, None + ############################################################################# + # TODO: Implement the backward pass for an LSTM over an entire timeseries. # + # You should use the lstm_step_backward function that you just defined. # + ############################################################################# + pass + ############################################################################## + # END OF YOUR CODE # + ############################################################################## + + return dx, dh0, dWx, dWh, db + + +def temporal_affine_forward(x, w, b): + """ + Forward pass for a temporal affine layer. The input is a set of D-dimensional + vectors arranged into a minibatch of N timeseries, each of length T. We use + an affine function to transform each of those vectors into a new vector of + dimension M. + + Inputs: + - x: Input data of shape (N, T, D) + - w: Weights of shape (D, M) + - b: Biases of shape (M,) + + Returns a tuple of: + - out: Output data of shape (N, T, M) + - cache: Values needed for the backward pass + """ + N, T, D = x.shape + M = b.shape[0] + out = x.reshape(N * T, D).dot(w).reshape(N, T, M) + b + cache = x, w, b, out + return out, cache + + +def temporal_affine_backward(dout, cache): + """ + Backward pass for temporal affine layer. + + Input: + - dout: Upstream gradients of shape (N, T, M) + - cache: Values from forward pass + + Returns a tuple of: + - dx: Gradient of input, of shape (N, T, D) + - dw: Gradient of weights, of shape (D, M) + - db: Gradient of biases, of shape (M,) + """ + x, w, b, out = cache + N, T, D = x.shape + M = b.shape[0] + + dx = dout.reshape(N * T, M).dot(w.T).reshape(N, T, D) + dw = dout.reshape(N * T, M).T.dot(x.reshape(N * T, D)).T + db = dout.sum(axis=(0, 1)) + + return dx, dw, db + + +def temporal_softmax_loss(x, y, mask, verbose=False): + """ + A temporal version of softmax loss for use in RNNs. We assume that we are + making predictions over a vocabulary of size V for each timestep of a + timeseries of length T, over a minibatch of size N. The input x gives scores + for all vocabulary elements at all timesteps, and y gives the indices of the + ground-truth element at each timestep. We use a cross-entropy loss at each + timestep, summing the loss over all timesteps and averaging across the + minibatch. + + As an additional complication, we may want to ignore the model output at some + timesteps, since sequences of different length may have been combined into a + minibatch and padded with NULL tokens. The optional mask argument tells us + which elements should contribute to the loss. + + Inputs: + - x: Input scores, of shape (N, T, V) + - y: Ground-truth indices, of shape (N, T) where each element is in the range + 0 <= y[i, t] < V + - mask: Boolean array of shape (N, T) where mask[i, t] tells whether or not + the scores at x[i, t] should contribute to the loss. + + Returns a tuple of: + - loss: Scalar giving loss + - dx: Gradient of loss with respect to scores x. + """ + + N, T, V = x.shape + + x_flat = x.reshape(N * T, V) + y_flat = y.reshape(N * T) + mask_flat = mask.reshape(N * T) + + probs = np.exp(x_flat - np.max(x_flat, axis=1, keepdims=True)) + probs /= np.sum(probs, axis=1, keepdims=True) + loss = -np.sum(mask_flat * np.log(probs[np.arange(N * T), y_flat])) / N + dx_flat = probs.copy() + dx_flat[np.arange(N * T), y_flat] -= 1 + dx_flat /= N + dx_flat *= mask_flat[:, None] + + if verbose: print 'dx_flat: ', dx_flat.shape + + dx = dx_flat.reshape(N, T, V) + + return loss, dx + diff --git a/assignments2016/assignment3/cs231n/setup.py b/assignments2016/assignment3/cs231n/setup.py new file mode 100644 index 00000000..9a2e6ca0 --- /dev/null +++ b/assignments2016/assignment3/cs231n/setup.py @@ -0,0 +1,14 @@ +from distutils.core import setup +from distutils.extension import Extension +from Cython.Build import cythonize +import numpy + +extensions = [ + Extension('im2col_cython', ['im2col_cython.pyx'], + include_dirs = [numpy.get_include()] + ), +] + +setup( + ext_modules = cythonize(extensions), +) diff --git a/assignments2016/assignment3/frameworkpython b/assignments2016/assignment3/frameworkpython new file mode 100755 index 00000000..a0fa5517 --- /dev/null +++ b/assignments2016/assignment3/frameworkpython @@ -0,0 +1,13 @@ +#!/bin/bash + +# what real Python executable to use +PYVER=2.7 +PATHTOPYTHON=/usr/local/bin/ +PYTHON=${PATHTOPYTHON}python${PYVER} + +# find the root of the virtualenv, it should be the parent of the dir this script is in +ENV=`$PYTHON -c "import os; print os.path.abspath(os.path.join(os.path.dirname(\"$0\"), '..'))"` + +# now run Python with the virtualenv set as Python's HOME +export PYTHONHOME=$ENV +exec $PYTHON "$@" diff --git a/assignments2016/assignment3/kitten.jpg b/assignments2016/assignment3/kitten.jpg new file mode 100644 index 0000000000000000000000000000000000000000..e421ec1d98edef310d54970be25942ed4622fe60 GIT binary patch literal 21355 zcmafaWl&tv(%|6k?l!mtcXxLN*TLQ0LhxXNLvZ)t?(PKF;2t1og0p$w+pXIDwSA9u z^{MW2>sIw?yB}*Gy8ui@Sp``D6chje_3r_EYy!doFwoG5h=|Dl+Q0t)ga6rr{GXj@ zXvm0Yh=@ptP%yBt@Nn?R@bJi}XsBraZU0uN|LOmG`iJ}%|L@}e9Y6X3Scp)_P)RUQ z)BtEKC>SiLk6{4Wzca$Z{O73u2Vh~~;1K}Oh)_uX$kLbqXebzHSXgKzXgFv%7&rhF zGzF$_~dFQLtUC#Y#c4SNv^$laVO7dTQSmn zM?tMhi@=*n{pJ!jS4#_UrTNi`k-n*pSV%h9&>r2_UWqjg#tm%5Z4m8))l%_^wE5i$ zlzW*b-9Cp>^+wiG$~R&*Ai6X!imW5qV3s_Vm+= zny<`^G7Db?T3D(RYh0YO5GSNkKWI5M5@GG&$I|MTxs4Xe2V|Zvk4o` zY{c$CD*_nPX^X!;06KV9_~K)(RUt;%Iqn&Z%_Xh8lrGmx9TY9fhecT&U%_KhTb-J$ z(qI~~RwjMh`&(bvc<4lXrvOLq2fWV$j?L@gocsi^=B;%a&ywyw%b$Yli|sP#$y%Vf zo>q;n9Z+A-Tb?y<7bb>)nQ`6xR%5+x&n_vk2k-mDj{<+wzBl%9yr;3WQC9|=9LJ*) zIEBu2h_q55H$Upkn~7C+1tN4jBJbQdo#5a<8|mHQ`h5WQW0d?2R@NB)ioW05)7qMa zVz$1A2x+KqXh00+8w}<{0qGA-8;eP8C8dfd=;Jf@gk?gdmAT7#GMwA48?$}vE7uR4 zw8E*(TZ>1lD56V?UPJM|zZnBWKkuon#Czy_Ji{;NP zd}Cvi;hhV%CHnrWy?-A>`Scr(*JxWk#dy(a<|hZ?ZqDU^VfCbaUeFoyw<@TsYQ^mP z(cb$nsLjo|cpVev?M<-wrW*P~0cv{~>NN(p(+mxee!#VT53W^pa8P2cMNo+Q0BG0n zdj4tkYHz3`5|p#7FQnglVCBV4%@M7(nzuy$9ZBK%zH_I{w?*kvEtW2}O0iw@l819$ z7O;(yuUJ|b-K-U8llEYv1aoa(*he|06s)F>E-r@gJ$CHzB{6}yfIo`uL`Vzn z0{~5FM=2PhWaILz=a=-_bX&)VLW=H{5`pO^=ep+V>WyF#l(;8G8Ssi!Ef(WW*PYq8 z=|U0`D>8XtC;t2B4nqeICnFTRxbyUnRaQhhvtSfD5*7s2GN^{K8QfvJlmZmP{x(6; zzJjnfoESROzSDPSZ{E~u1{(tb`n$^oGC~VLU3lhlG&~4m*x08J4ULwh!P9)B7k|ZO z6KqH{iyK({0Bm~xsm*pioz12FI)NXgk0n-%8xr^dus=)CKZ1~&LG>cMYNHG!NX}y` z9t8S7`(;H!R4eSQYN+jrF+&9hg>5FU_{HMF^MiLYNDTg?i^_XY$n^PWSC zT`}KMM`7j_=LZII>UDKRAQyWBTj95dW4rj{ zK10#FOP=hixfbsrw5SLE4sdiv&9Fj*LBa9+%6ZiV^HZ1ga~zJ&tlw!X-tEOfrhv7f zm#*8fD^YF_($40O;12+Dio2R9eCpvIf%%o&uRGD9MLgh1o>#(OgW*EyC>X_arfAXg zgVu;U`E^RNn2mzfttb}dYC^2fglR4H=roIYHK8Gqq?d%0`1ij-kAFLlYRV^y&)L3i zhVR6%%kn+`{s4^4mVj2C-b3KjbDBoj3ZYCrl8`HC>wSnGCSXZ@$Xh&bGkA;RKSeNy z$_+o;w8}axEgwqmDBpqR1>cHP;IfJ}SD#-07PE%60x0vhzXk+q2}GbYq}!e{V&`6g z3*&HGnlz+RwGaF8k#AKg@)f-jR^lTkyH6l~=Yb(4Sg4)5OEZ4`AAtPX2fIA9Y~+*B zS7!K?Lx1Az&k8r9ux{#ke1q|(-MXQ#Oh{wC2tn0?>O%%ye$>(X-j-T$Q&=Dyc2|>^ z5V5>86SwLh9u5N?$7{>>uCnSnN|DoGMqe2@HWwAUWz|fdCl&NnQpKc%6S{1TrX7FJ z#?~3fMMdGucO~!%dZi0e_#y^ky^!`v*!50?^P)zCQ;qK)XtXpPfAsU7w-|D=!I0{{ zWX?A>4LpVh&DiEDITP{wA|*7oq=ia698zw8;D8bkS5rz_4h{gHzNGBkY+ zhAfw*)$@p~7PGZw;KwXW^;kpsQB!Siv6O*#7O?~Ak2_b3AT*`;G`cO(cH0*D&qmi6 zQRj?BzcaYbKq;<6H*a=C{LTRtVx(s8R<0rH_T|`*GpUJKN5an`!SCM^#zYeABLAkX zXvMHMItz}OtzTIG0q!D2BEEueP_7O#@Vfle#@s#tTY&yYNQ+#1A48zfACZA4ur|!{49! zpA*f0>8`A+#87R98`{M&c>^%!U=t^EzJDg9xw;ZN>nFCIrFE_-L5qYEV~Ma3mH1qi zbrlf#Pc|9FXGDsqiO`yUkbp&x-zu7qvjaUx}jk` z>tT#>zs~)Ja7uZJor&4A_}5e=L!X724Lo8bSz`BF0TEjD($chX#M%M?n5eJ^42e~G zy|C=tZS6*7nxSr)5|7i-mdE6>M{IwYm4|JZC%%#}t1@UhAZM22VwW*%knLQ~~r z3S`Z7{i`Aw>kk0_mZHLQx0R>lboTLSSLTCN#&;^6eoYzTjK(ZkwN+`kc;9(g1|5P* zuC$eP0Ks=T2&OyJixZYLM}ftm+8rT? zMMu~_+@V?XRu?Y@pH+AJ;tmUHK;YeO#x(YX6~{REa+G9jjHXKj`2^7tX+g zBUYJBkBVQw2<Y+XIb3ogH z896!4G8|F^fjoZ>F7gj-l!8DlC$t{lM1+dNT@^`?7lO&M17o*)0t_0sM2$JohJAUY z+UGvKGHtil&v3YrV`Y=#ute0+3*L#b{@tiFaoHoB@yh3MHUFZGA?iA9b62|G-c8(S zR9w#I_UNi2T~4wbGXv_%4l|PJar)7G+pFEh)0|M+O%ip2J2FHyqxKZM=t@g!qTxx16_;~8jb1Z$B;|^_znA?E?mE)CsxZU#FO{AYX-tQmAbI|CAd+z?C za{~UFhD28!l*a0>ePVE@uF9Z15lMKlainlD`unulu`T{(Sqdgm<7i1tjJYl-VVO61 zscIMX^A`FCfJ}_SeC^K_9FL)E%bd1Ln6dm;+SSyTE3Tv8`O(!}SYxc0ajWS9I&%-0 za*cT(fOd*BG%B zgS=YL30T6Md)4*P-C9Lr5XS!nY|@a`v{i!AyG+ z+bf^5oKL7jiPb__I$5f14^%Gnd7TLv~g*Ka%oO3y+RrfykYB(yMYztH6mQf(V z5QDE%W1Zlbk(v*c4ufh6U6R|M&m|~}cA=mJ!x9}T*~mwaVNC>2$&He)<-$X1nPFEG zHFOqV^)TqdSF}nA$<0fy+}rnsHNj4{DH2we35{GZ*jdJ56WkwwsPC?W3{>$GjK~1) z8ZDdjtxn@lSMI6ijVUXvb0*ZTFR*?jIy0B!lT!E^;C@rCB(DnM<`OB{R2@kPn>dg+ zscG~L9~jYh@XRtsPd28MJz~oCtHy-wP#A&t?E(Dx;rAjtSp{cHRvLMBbM!h)rij?3 zCrhrWj&mI10>?1@_2r#!!YtBp>zf>5NTp{)l^m&-N-RSipxe6%Rp*Q?Z8J&KI zDf}MDPrQRQlCo5~VxIN!sQ~DjEU@vXY0^TtRuBQG-B-~8i>+CDayV9}#4<+sXqhg; zIz>AJTfWe|0)fTXJE_4s*TNEf!bdHTgy$4Q1Y9m0=eN_O0opY}Urfiz%1e&(PAUeQ zrj;^Kanx^>gr~=QZ)OsLMAltj%C$ z$8<==;Df$W_;27^JkKHih62wd3N)z@8Aslyk@mp&Mct9kq&Wy()Hk=~WVX6Cmr?)Q z=J4F%09K`M!4XG@MI4hx#>G@A2%wom>ro-L8i3v_I&a;#>%Ol|Oy_1L@7P>@bsK%A z_D8cExWtXgT#Y9fbogRfI>cCVMRFC1=k`grF0t{plU0-O>w2u zjAO-yZ(S}*JO1Q|XsuBuQk?eAj;Eh;trc#^(a`$a~h7#d38oyFba#gn#LV^YCa zGcxg$gSTpAw2=dmoP|c=;!}upbA3W&%QvxCjVilG`LV`stmN(N_Hat%AnT1x{*h6p z`w@O`alcx4G>?;LTfXY49R75uI`12j`1rq7eOja$o6uiR2#CZ4>zqrsOJ0vg{t7)W z;axQuYO@Guq=}?so1mS6n^B||{q5dqzcPhtU6A8nz*2rP&~ zwE{&0sB^^tgZ?5lB``cRAXjzfCiKG*0bc>IuMWo_fJNprJvJC_ne3(U>&;;l%CzEf zA9ucf=M&0ZQfG;}RIi7NUe1xf!D(+H7FJ^Kbkrp62i^6_Bu@4zMl=~HDPd=6#7RC2 znt{->a z*Rk-dYLtEkXcEodFXsu>F&S3INKM3DrUe=q1(n}C3QlD_Im*VWsvfBN@Go&!zP5_? z^{DTtP~ZZa8=oUuMLUMI%HiFPSm{v1DJbPrjOr}cQ{fgQoJ6P|a(f1&oD%m;8SWnd z6@*)d&#QEAR07r{6PBIs3r2{j@`ci|B}dr&IO@dfbyUWJ4O~p;PL#1TowHax-S)vH zH#t_Le!a}LpxGVu$o24B;iwt+nR4eA2xmF9ynVTR-#1EWdo=fe>fRL8w6d zH!7!;hg_ThQxcHtO&PajGP4ksCiI~ZXVn=rCLls5f!A!~_0&#`xZ69enVXrbnoUUw zc{Hw!gLn*zMLdoTLOh0JZ>&-Z8%|O@^c%&Zac|gwD*M}E2e-2_kORB9 zGee0^3Et#MB3GZ#c+4HD$**|84LK-V*p(WZs4%UrE|ckjE*8mvJ^lQII^c^7IWqb; z{QhT@sbEC?4QYQ@$=Xv-!UNV0V-+GhJw}0k?|u@_CJwxlSW`KD{^F2x=g~BVF%d#D zX^Fq0b@7z*73rYvjz$fHbG)7q*vZuv%WVDhY|-~fvf(A{PJ?!U-Y(IW;A`~z|1h!?N+ zNt0!Ad0U3S-w{lr55Up#T}_3-!J-IRYNZdhOOP*z`mFi>wG6!qNtybwz)RlKrT=RWNmehNAF@AHF;3e#%;B+HH-fuHU;fAn2(EFT$Vao{ZFAz zeWl9nAw;k~)mxsmTm{q#aj{IQgr|?-@6`TodQ`<`u`Q8;yojeTz81w6vE7kOyo6PN zKMWP)do{7doD3B|xD%M@_|N2violWaKt=vDuP8B?*SBEp#u3F5Ph0a6yHoE!qTwHa z!pGp_XuWE$*Iv1&#Dxs`x4xHv_o&q)+0Y#?Djj+v0BK@`oIpr|40A5cEIP9P=U)w5 zFjqoLQCUSI_?Axe2&f|Hj~{@XWr<1Q3Fh9$h~uxCGmm8QYuyeT;sDffwY|-rPoCEy zBj5A-*4hI7_pzm>+7{$;9ULVnue1!Jwn8!}^X(s-_Q#00)~;{MZQolHdylnvYubh@ z`NafZil-Ljd~i18I}|W0LFzhbq03j?_0A)*%qL6X@w?bB*l zkmh6j2~8NxE--l0-lAh9Xu52&oc4aV>*w6sA-VC5VZ4Hm)q}oO_VEIj&EUQaCyIw_ zJjQz->#{W~oz6{lWnok1DuvT~Vjiv-)Glw1s5PjEd8s>v@>frmlv|wvB;_;#T8YR8%5Qx+rbN%tsGBqlVkG|^<{k6~ z&vLqh)x?j?ku0%zbHU>}{L<9kg-XufoW|^*r(;4 zukM|0=+hpBRI8RME4a8$Jmz-R8yW`C4zS39&JV*)8L+cxiR(G5AeF}T1zto{yl)x& z4F1?Xdjr-flNF@)KE8RfS1=XyD1YqAcurY{$sC8KqhCyJuleQOaP8GQAJ$lG1C-oH zI-9&WAVj{c<`K({G$Z`qV6jDdmVc0gFvnHvhtJ%lr@ykvrIjHqvVMotRftxb?GjVDEF;iLK^rGJE`;{W)GK?^k@j$BiU!ILHg!zA zpH_Axl5tgDxB$tg?n%0OS2-1yA5AcSwL4NM6K{~5t1;nK(_HypqU_6_MBkUjK`Scg zUn;9hPD%7X02{VfWo@yPs=LvXt&P8HvOuEvXgA%v<{aSuZQCKB=1I z?Kz1|lhwC6ki|856J%8SQN5nBNiu;c@G4mH zkTZ@U*i~H(mI5Md{`EclDfSnEIy&3j6p$y}*<#R`uo2!1A~1=OWznSAU|o)wkxe=M z2yEU?g_>-KJSIgd%Qmd}#H;F?_&EpO_u0vo1{Ny1G#TCu9#ZO5LIPD!Kk2q z_>ALPNgz`?G7@VQCA$ICHdEENv|98$ANx(vG4(IP0x@v?0N~|;7ZYTO#W`uSyUA3P zD(_`2+PgDU*=rZ#0%W?hegkf(sZ}#REsj&`1O2`^y*6!bT8KkgtLBL#w0RwsOj(q5 zDaT)b0E#dq#0|%Fp!a!%d8Ltql;Aq2{Qre1PYBn*2edB7@u#f6h^J?3Z zK0l`E+)<7lL{mhUmhNl6xJfxo-%>Xyq#{J;%Q;&heUq|sbw`hAw?yW8%=VN^g={Ly zeyta5KJEYyv+;~LnidXU5-)Fh0jk49hS&<}9%o6GEMZ!Lh}3XY%g)k*Xjs3LrrIEA zHm0k^@WVJ=Z`x4o<#5Hz{OE}>sy8c*jKNLflSHIU2}33&3k7dxzOG8m(YfA6tzr4~Zvw@lX zE1h)EL0RlJoUxTp-c$Hk6>5>mlPNXQmI2X$jOZ0*&%B*$jL%o zxXyEIW{;=GB*X!4Sm#>Uk|^xgiDP25WGUzIzKZ6zIJX{pZ|Z{YExWyj-Fm3xI}0EwLfV=O0}g2b@uSZ41=#mf=|&;Y>A^wmwKRStDq( zieEzI3j0-n)i8me7`8TL7O+eFPcI{O%gkSy{?Zt$sQO&AqlAWVu~hHF|8fF?8Yfy< z;^Vtp;>$?1qp@#^E>eKC(KS#}7X6tPNvRRm09HElPHjAkKQyAr-nAs7M&H(GWrrHY zHJ3|ul!ej=MF*4wmlz&gJC0UUk?ZfATt_(*aVv6u0}v>Y#)lkgim%!5uaWh5mmKfM zdJDzhavb&(QEmMhJxs})*)AqkJt0hTMyaG^WQ>G`u0pUTTF_3p-l5|QnUOWqRc9KX zmc-`4u2Gl$1*r@TQgzkQ=aitb%UrA1)DBG88XDD-aBJIOVZc2@y(Qejf*hXvfsMtJ zDS&{XPdVj7dJSofQjM8OahS1)ob5sn@6e#@DmE3o^oYPaO--jvp~ zeoDS9&%*|~Vo{zTCSP5cIIklNMhk4h$qdTS{x*vqo*yM?%f|@yz)ifKH2vyJC8bZ9 zkFx~gl7~JdmgkcS#wtCg~VZJDGc&mD9`ksoCs&G(1<_E@tma-JXQvmIfiIC&V>`&m&b8{6rZC7{NxWlfz*ovk zVS#pN8Yxdc{Iq6GA>$qGlriaT3ALrxjYEx7$7EQ?E)7AQ!91XRw(jJDG2DN^KEQ_S~ z?e~)G!4Qsv>%#gRFfX`CSc^I@)7ae;459YRCXIyTbaCc$+m&l1yZ4bN=uUcCb_||o z^Jf%Mz@Gboh;l7`M{5~afyB}75w1)}#b_2}zIw6iovIC86V>H7K_;u{vPG~l7R2uG z+FeCeM;I};~0xK?D zDpS3{Im-vYywJ#b!hmExWf4sjhwG-V z?u)5&RsOs?o$rKm{#6j~In5EHs^U*+#2l?)O}>pHD?$E?Y>|Z7k1OdXpKRX1^%qcN zN_9HM0&TRxYSz>Hy%9$;7PY{+7p#iC#7}K$mtqH$4$LpAC>XjZiS@2EP2YY()d*RwCPsk-&kYIh&v}TV0YZJpMuMlbGHdT2<*tciS&$v-=z= z-#@deVnfvW<^Y_+F$SE%5a22pTSKnW-80RSVkn&RjfJdGuYO@O63~9 z&!ai-dtJy@CwaZ5kZFe0!i@ zq`qRk&=H*^cx)R$3hj;@234oobc~zob(yUP98A)foFjs<-2&iylI?YFu|o@J!C4|) z=ecwYc)19+-j>*rC|>@QPn5^5AgNzaRV{E=!D_*sEqf!gnlTccd|!~KXJpZo{wcp@ zFNp4g$TR*-mfT=+ zfk|HQhq?ONx#cz4?~(hJLDpMF>9(&gA{U}1|?dLuM_U!0}`5b5I%wMb2g zdJcw^nniR@m2uhMCPr5flmX$EdJCCnfCUdYy4rQ6-m_Qunm1jrx@eUI>D{t78zbfk zj1OH(n4S#TX*`Y;l}Wn%kF;xE(Ri zU0v09DuZxhwz9!A9e~*^xFOpygtYqjZriL zKlH9@Dug?%EBcr{cJ%+%Q4~yqnZ5)GqzGnB$ajZBC<&QHHmxlD#A2_z&IX&m(g5qa ziaO|F;b14eo21iB77ns;pUkRB2N4U?<*f`VIIkb+m89*5pP_V}Q(66U+}09%Dda8i z^i8efE7)45ny}rClH>d>!6S(!KDc3ij_u&^b5BaHhPbNPp(yB?ASYh4_eakM_+E?dsU{ErC^| zZ_+V`NJQ$&%!@JT$Sn2^37Hegc=xl<^ySczo6&1XqE&h<s1 zTw6AB!`UuPG;gx4EA%_vrfrSY$CZ0+%v0%B;O6GDJPsyaidZ+DL@wa%_uPvM;3rgd z$Ltlkq2vhXM10Z2Eh2svUHsO#g~|F6JD@728v#c_#<`-=oKKOid}%K0e%$m3*d&{X zA_?5>no(zpBihlKHIQcGl_bxRQYK@~PwyD&17Doto91tsVA35?@VT~jJu}4xrkDmu zL|=MjObAULWH>!?XV={_x*T0&z7_WsTG+8WlmMk6yjA6w@Nl_4kx;d_N5)2bjIwT zSL`z^*7USRK5*j52$}o~@s-KFbhm2kN@yFiamUJP>1a{YN#k)FU58P?mLSyYZ)fmr zf3R4bNR(50s)v~QRDOa9cyH6qtEk*nwnVfs3V&*kF<@3qCC|>75(j1wT(mvTv5xql z(5Ube4>&2}!GTctCu<1-eTX}7J759oTTOYyo1`CdZBJ8@L(PM}^7(C|u7SUH3B zd6Lj=AkiY@WIMiU15LtE@mrF8i(6K<~ZvnZITZoNXd zQ*Q^$aZf9RIdbT5C2pCsfq!{3?sDvB(_$HBp)80>vUR7(1O&ah-`a&WF>bv<-J zBDZ&!NA~r^tz(7oX|#+|V~MmYCm8b^PDAC9a2@B5bSF*ez~gy9Be{r#_w&oJJ(%k? zSZRCWPSs0iMX;GPB&?tb|GxS$iYXF~h)UXq){AI5ZWrb=tfcKC;T(ql5#DdkQU=jG%b<f4 zPO&%X+%a%k#qZc&7O4k2U`FheOYx+^yhTR>%_h2~f_@CW5+wr#{l%<9o|9AUfkhat zk4D^t8+Qr&#I$QP2jo;7d@U~g!n(9BgB=L1zuc{wmS8%{7K7gS4?%jo8)MVpQW zPelh=Apt~sUJKZ@AeqO$C0ec8;p30Qp>6@aB{ka?O4tUnt5iE(>;RiXxKBfrEJFfL z{V=^JW&j?KKwTS5qtbbELjwiZH+MlFU?e2vV!DR!yb?s1dY~O!jv9Cl#l|#sR9zj% z&RQdy!s3im%q7b~+E_P97wt%|-b9yXV<0+`sE>5eDR>;ke32SA@>=9RNGnoXkBZB}~z}+0n3rK)k+3!q?fs&bb(0 z&sYo<3G#5X&_8Q9`O2-TVtBenW8)`5efj_DTWZ~Fe=E|;8Rx9>beP2LX@Akw>B zswYtO>oZntYBvu)!chhP>#>DkpjWPP8>9&v0s2hhMzs zp9j1fn5B^&ykEslb;Qu9P~OAPHG1Xo9G#K2~bm#ey>{_YyANH)>%i5 zDcmzo|2}Y6J{VGHugY4UeEpLHt!nQ4EP7c!#t@AqS9(HIHBC9$E8wTD>7&zir4a|4 zptJUyDY6m6R@*nkivYq^SDyi*YN?b*s#YsRSgv6WPF%)Qx)yC>aS@iFqa{uqy$dv> zz-np*5;g2*4}vUX%UKN7V;Mp5mwD1^IZBrOHnp zDi0Fglud+swym8>H#ZWy%LfJfJH`a1=+{OxdJ$;>LNBeRF!bHuvDG$vfrb&=%_pTE zgN2PIJ&U8&BPfFrIa5G4C|AhYZOK2Q4WPMgZ=l1#yYLQ_Jkx8+2tm46;`d>BlRc?L zqEz{$Rz?8QWkNOAgQ`pykv|+&0w5;`Hm2`)i^c2plnyc`9GM7Mq)dBNBW5zV{oPB2 zDeY&>oG{8QG(yNO*iZi@9BjKm3*m{u44)jYJpv!yJ{dWXcgE)O=(&CU+uUUV&Mh$P zps0wxRi3{eV0JscYmH)D8Zw6!j7-JQAQ8>KrvCRGT&_&%iQec$Be$eJ8++AaZs`el zN)~30##d$|;fBgVOstIW&6_jsj~@dWx*(zQrFSRdr;5W7FB_gT@G$k1y9W!sKi&U@ z4sNCh(WiKtr_W>FfuH?3G7mH&aVm|2LGbEl)Nh3V3f5PKU9Fka=0u~Y$~0Xah`%`x zT+Fk3v7bk&JZ0fpF<7&!C6MrsFXX3xUu1}NV8M0yX^WTtGd|NtBt zZszUdhYY|Ib!HNC)um|s#E=2LZ#tfZ@)RPZI3Ax`*?&2QJUwaS8`U~3wYQUC09T}| zk~|Hw1F3puY=s7{GG0JUn0#ad(H2~qf}{w5t}tiB+6%i5>t}bBamx^uW@Cg_(tKk& z9F;ed>rMMlCYFHT&-jaotIr|d{xDDFRGASMy3YG~F$Yy$l{{GznNiTjrcZ4B{w)9V zb_?hSohFiGyRdTv0y2y$AiL`ZMFhUw;GCh2^2^)|mxepePhrBrM7S6Ex++E=F;}A_ zq9a-w?YZP5y{9+0@i) z{UfEb4~@+wZhyha!-%v9knCbgOpSc!#lPjGqaZ{LlV8KxG)Nv_d#S1G&#{5Q8*;O> zOriOkM?O>`-!EL1jWUdK4&%bbA;^&fAL!Hypht33MQ*@gdN6uw!u8|7tS*PChK3r# z3KbOqOn-Vz_5T97rVi6*(wuk+kM1nw1;d_Df0pJyJsP0Yt07#K^#dFOq z#JD3r(4y*YUL2rK4|RYLdm0VOOl3?F%w&IURd#jgE)uyCBSya(Gb%e?)v$AFAO0Dr z8#{LXgnJt@xZZm_NiQJT&x=;7Ei)F78>TeUYBYG(u1L8Oj4!*QgzNfC@;YCm8E_}QE69j#o zKPEMb8)QkiV}YJ}#@GKPDd=^G$Hl}wlX*SZTpZy8&_bS7q@`^CGJpQHsa;6i-ar-H#N;{iB&E6z|>maZ=Q89Yk zwWzGKJhOwMcUj-GtJ0LXnuw`9aI2L&rq;u7l3D87`Ns2*C#K>bVuUm-rap)>TL0&dobE9 zsqgGEGD%{yJ-RlmD8?wMoU+oT;Hs#2mFX@(b?L>!MD=gD)URvJ>Y<=d4It&O!7^e? zzZG;WRb?ORAjt`X`kfO;VFbE1kuO~d#C6P^JPi8=Ru}sUd!o8dv=(dC-mSSdO9W=a zD4?0ub7@Q`9J^1uAMtly(M#tOZ5(?pcylN7zNhX;rP$4{U4gj?=&MePB~0xf$nv+c zust*xhnymPiEE8>s-Be0(aHOyhS%-nIs>$Q_QK?;ahIrE)O2;MpX2wev>m)+Q+ZReb$6!ySpCn5o$5tSQ901L@ zXuZ(o?GqU#9iHD$L~iq=8OaVEi}>N@MI@VW@P;d+6v(&2D*2<-liB&Olp?K}3?}!s zo`|<_Gw_e?^e8#4x4oIukoXsVsD|GUzySpzO_aXtkESDaXdVOKP0I-Ni8|}+Il0~P zZ(!XIGa@ag8dI8M7R~82hC!7lf7)C1x2fDvPic~dt8Qy2tIe$kG?>STdto6ZN6^Ar zy}496%0+f!bexW-0e!?EMw#vf`8$GiS^}b9in+tq@{4L33g*e;EQQUL?$a2;z9*y{ z1dwz?cjOdyxp{^mS!u3}Zmb2#Gt2oOT5rACIdZYm#EW^W)w3vAj8JCTzJF?32M($52bLCVej}+kc zA+F=Y*ht1g*mY4Y-Ut9aC$b_ROQu|T!ck8Jcv)o+xg!q@t?5wC(_9F6jdDvmw?6o<+zBZrAD@7d!~ zgg^yCge4oz(?HY;!O*@Zgb`30m_0RF81rT(q4@Fyg^9DI1Avp4S`~cFFH7n)y;aWI z#r`stQkoh$`MNqOi+JLV*btX& zY(T5g?K^g+w2i&OKhl&R`{^A;oA8=HvmS)ip2mI1 ztHc}eo1~b%e_#lz7B?v{H>^QVh%Rd$!-KS+6n-*?-w|!>0U&>&>Ml3s%EOCFM3~0J zyU0i*KmzMlYe=(pn7=_g;G2v-K;zPhw~FizwNO)|enJ&l9EmV#2ur1Fq_LDk8s-z9 zz}4otTw&((vi69)816aMIt{iVIY6>u{HF(FSnxR@*tyb2UzySLcLWebiz+`6b@V(0C4zI16TiubFn3zWGLstxE`oS7CkZnQpJT>7gPMBe{4ftw!)?h8%c%hofkk&EcKv)- zMNn++k1l6x5Sw7&7^%og-U)Qow~an3TlS};tEBWuP^9B4J2i-vs{s7CPhA!FUr?cy zd|Hg>jH9IO+BI_&QNqU4CO?XXF6=h&LHsqW81ZP`|IZQYD}PytA9`>MrcG?!em}$x zeYW8txqfUgdix_V6?aAfJIiC!5InZ%H2q`_3x0S6FW&yXt2xvEu2P{jVvf8oj5EQE z)%~TcHxopa19gGP;`{h|y%;n&i71(j>V7^Km|u_fUVYi3x9UA)7Vz(XjGukRk@o{3 zcAY1?W5?4*tBMYX$2wwRs7_?GX;72(5OMPyJ8JFs5x2`&(8ekJP#so=#P7dITNFX` zs2D>BYMK=|Fz=pki0UC09?CxG9C3f>UDlR!k(=j{Zg7(JI}_x~)HgnB&?U`~RGne0 zMWIyW0x<{j+kNmRK;#5J^&gViE#emo&NJtkMmN$FlL_?t^GMQ|ZaP~&%E_H&rG{5eIZS0FuG&i!& zP-n?AFS1yaV-1n0=D^LT0!F1IJ6`1V<Za@Z#YlOepPsT1H8&aEs$=_gEzNwn}{IL=-1la_iUvWAHN07%|98}ER2E57Oq z_ZW=jYM{5dWo=9Mzt`dZ;<_Kjo-(Pf3C>Z|s|yS=x)REE%7Y|QDOc&Tzm-@9$Q3}# zVnND~S2zw1MtuVU0}uO@ADum0`+GC({aq8D3(pcrs~Zrk4r7sBlw|C|h~7P5<64|& zW&|;-XOrC{(AmS}60Ni3ZVEJeADF=hKXhekk>MO5NI^^_pU`k)mt*BQzy$8NJ^ln# zn$wMl4j8m}`=FiFk3ddAAdQa4+|h`ulRAdOo`gyPaL99oY<<#lMscTRCv79xtupgx zjS<46cRzIAmtt9c)Tub>465CKyTT{%5LdumKPPvRT+EaDK zrATH<0%+HV@&s)Gc_!lI&8X1IFjZ;c#j4c?NHt`%fRPqv+v^-8RAh2vM641 zDSkQ1Am?hXxaT8x+PV*kS_4t5&I}DSge>2Jz=~QjDQTiPD5&0+9v>i{ob9Ek{kzk> ztJHw`1hi~2Q&qcB(KP@%LnSytB{e{4!rahC4v?BIIsylyBWei@ii?Vaji{K-Nf;E3 zOQX<{wJNPfkkOa7p+P6wslIIHVLZYd`EoKnHal(WpO2`(?~!HmCC(!|XK(lS>^>*8 zb0x<%1-{cB@)p69eYg9tuY8}nW;tV=AV5np_s@(?4+frd)2?|o5gcl`-vAJOKj-0{ zjZ0&WtL{OVn87{qjC+5V+dli?Rf7zbnrvn<3GM#?eW{N1d$+*<0Qqb`K~h-;MKxaA z?NUdw7k3|@{^9=s_5f5rn5|nSmuxd1ff+k~l{vbP&;I~$KcUS;c*W=){YHOX>Uk3D znaboTR>2iG~+pKTYz324 zOxPp;0Bn;Sqj7?VZON$a{26$qcY;P)V0L5~IZ^>x2;T)>JwedrT??rUGg8iONupJf zE%ZAh0gt%jCvLOlMHHoesE|XLAgS&mWsUQWWaj5cAC!NuHQG44p!UEMsCc$bf(|8w`~@ z{uB3xsXu}>l?(2jKf09F?|<6wP4wj@*B1Gep!Rpx6eDoHY*IcJSDQUVS)raHiCQ^%(hiju~C1Nt`j zxQV4%uns$szH?pQ$~?}pARy;I8u)+79QO9|DbCp)&&Ixk{T)1eNmWswbAT&mxyHqx zC!d@pM89LXV!6$3&ryy6%qt>-4n;mK#Z=Y~#qVO0nhIk|$t?*il~H05m8dD2mX=xu zX=pkIxO!;2l8(8WnD1t(M=WgO7Zn%cSE8JFt=OcwVP;T6Rg1MD)< z&`nYJ?r6wvs!$S0W}>5-m?68PqPeLceTsT6Icb&|sOY7rBy&kAWqJlVX>lzY2@O2a zKah`q=gal&T+_~1&|1o6K21dV*8)f1%#e~k%?@!T!QV+LfRZuNT*C&X_pV#l}7sy)b}|hs_nAl z3P!{otB5=e<$2?iO9s*i7#n2&08ah=$@Ed5#4E6Qe%vB}PET|3KM!H(DLUn&_r)jC zF*YOm0WOKJQc&8*1x+@HkfDftOsP1I(p5a@~ja!eA9ml82tuGOLBDi~* zBpK4DP(J6=6;yFQ3MZCzB}f^`&u@?GO2Tb>7`~+S#S!Q{Y?SNUP}#^m_90nF_WxW8X-r9~$tq!F7-dhR8nIVT1ag(Bgo<@FGc5 zsUF1ngAT#HMm`{ZL{@|)69U*wr|$T?j0&+iMhv<8%2eQhPqE+TDZ1u;eq-O);(c55 z{{H|vv^bA~_XUG8m1Y1iJ-UMva+&(GP1I=fG||Yk$gaw(m=qR15oPB*Kw1J)K@y=HIhD)eIGuPODI_6 zJLA=>uX%r(#HtAO0=_GFY0f6vQMm&@?dxAkJOS}ZZX!{fkOy;G@shGT@?kXSpC5y$ z1=d9+EY#O!dhbdF`{jq#hl z^$fnBClvnx#x6V5WSp9E22Ia;Le`OHo-L!HEMuog(yE+`se$UrN{MJydfQzthr-h4 ziHhc$^-{4tbaKfBz0EkoaZ&4Lr#K#y5c zkWWWNH8Mpt84UFVkC976q}0z8^i*@y9VBmj^$O{HZJB ztkm(R#DVgE6OJ+!{H`(%MlqlN0MmCq-*?j{)#2P-Nb=QRCnd(R7e1T!{(sF(cwNfI zxup#`H6m=hU(%44_dAnNtDa~;M*E)B7Zm3uQH<}JrcAOhZRQip>(Ppc$%bS5)Mtou z$Ojs`u76+W_pP`WG$|P;a!CGr(#~nTAB|2iH8|I>XQl>e)EQq4@j=@3k=vW_OUIA3=P_&W)B`IL_nUPpI3t_B9Z@;?F1EIllChq**Nz7Yn-V56UAO842Kx2>aoG-r;|_BiS}2K`Q0UrNNnT^nhX5;*Uz7$ zmy8dEM{TwnpKpJkwRVR9<{;__J;3#?3CC9?XH{eUESO1gK+ioQZnHv9aa0Z(T~ATY zD?Y}YZmudh#cp;r)P0;!(HY{?ZfTY|ccM78$9iUl=X!Q4qaMeSC3X)^Xw=2S?kUk` zrKCMm=tU)^7D@>XD;-zR&{Dk<5TPR!WFn(O1MTU?C#J|+jE+i!p~5$!Ba)kzT2&>W zFnfYpaE5A+!95=p#Q}r#Q_*oxOi|p?QC!eS{{Rie7MB#k70neL#RQJ6H7{(`%J($A zjn;i>YZu7nEONa-MKfP)rko^~G~Xr0cBGl37lmQP43UX$A0#53EByFZKl2BPw*xas zNCc7;o$=^Cf8M(bitwe83LeC7T;JnwhC!nyFiF$^P(Kf-rKec^UO`A(`1>iAQp|9^ zRUZHVBjum(PS5m=c#t-IMmO*N-<4JzTf%wckb(%*Nh9Cq`G0ChkMabLged!~oQ&XL ze+@14){aNWnpjCWgU4A~Qz<9DGw!DwQ-q#5c~lkvmC$$JBW?a-s|yZFnqa$Uwtp}7 zp!u_tCW*BGN-ziHNAvQolo3<%CK1HEKztI6Mu$219{&KL+)_52@XojmpMCbp@9(ko z6$`lGa%pE#oks)@ao=M}aK05Home0q`0xAotGFRQts;GU#lcND-@Y<^eZC&Js6R09 zomi1SJa^B>8;^bJb;7tXmOnNFrr`P!^Zx)(N<#X@QAzjiJ#+bY_*ZdQS7X)LTsH{Y zCx3|l0B=fazJRI0r`!H2q1k;qet+&NJ3Eqh-kFJ*#OpCwD41>YRx$yrDM_QnCqNll zSy@0CSy@;p11l>HWdKS!Pn;gz!32O0rFNfzem99i#>ahya4u?j;b%2G6RrutBm0VS ziyHe`o8;5~066>kKCJNmF{wuMyw^2Xcx%j^SWx|B>+9eJrt9lCBk7J3+|*NzkbBdtx0N2_H#Jm?;PI5-KSrsRE9T!A{(Dtl z!nw*YBv;!}D}&hje`>zDvw(>~&wk%3qj>)Sgid4uJ9j7ZtocRhu%w#khn{CkX$S*s zQl$3GFqBD)4&d7 z*9wndD*J}(YU7Wij~eoUpK9vvA9t1ft5$1qnAE)3{^7URTU1YrT+_!8xTv2JxZa2& zEs)K5CiH|6(vCN!?V_fW!!A+e!Y$IAAvByYQE|asH=_}T62(eUQoKr0YdqS+-4(Km z!avjd*SMgOL0*Act!yeZoD{9p(n7WrXkbDbT^@xMy3~<@$t^CE&1fkA3sdTuY zBdv+3(nF-Vpl)T}wI+m|*DSIj(B0CD#0@s3VWs3=E5!J&Dk03g(#p9`cFWf}wf-#-`KhKR!IZMRI?qV;naj_oV>W%R+eDu-b)T9Y8Nb?TI z=SwHp>-KbY?+}LLDnQ5{zuv0;WO!0wSY=zt#Cm73BiG^EHLnGowVlu1KBwVP{%UNb zncwez%qqp`;%@@^hm_#y>rYXTMlrXj_V|jc_>I(R3Q3QHdmoR(J8z1T!*Jat892e) zC*0L}&5h(Kxfw& zuF@5g(T?J9PJMrB9Gb?g6=})lYbz@#11l>lC<7}iD<}g6Wo2amWo2b$0A*mTte^{L z!Cq!DNC(EdN0s>IxWUh${?+sQkels?%vWZ3Zr@}+RjXUr7Dt@UtYtAl@JB_*9U}!59a7lRNB`OMhG+)> literal 0 HcmV?d00001 diff --git a/assignments2016/assignment3/requirements.txt b/assignments2016/assignment3/requirements.txt new file mode 100644 index 00000000..3e6c302d --- /dev/null +++ b/assignments2016/assignment3/requirements.txt @@ -0,0 +1,46 @@ +Cython==0.23.4 +Jinja2==2.8 +MarkupSafe==0.23 +Pillow==3.0.0 +Pygments==2.0.2 +appnope==0.1.0 +argparse==1.2.1 +backports-abc==0.4 +backports.ssl-match-hostname==3.5.0.1 +certifi==2015.11.20.1 +cycler==0.9.0 +decorator==4.0.6 +functools32==3.2.3-2 +gnureadline==6.3.3 +ipykernel==4.2.2 +ipython==4.0.1 +ipython-genutils==0.1.0 +ipywidgets==4.1.1 +jsonschema==2.5.1 +jupyter==1.0.0 +jupyter-client==4.1.1 +jupyter-console==4.0.3 +jupyter-core==4.0.6 +matplotlib==1.5.0 +mistune==0.7.1 +nbconvert==4.1.0 +nbformat==4.0.1 +notebook==4.0.6 +numpy==1.10.4 +path.py==8.1.2 +pexpect==4.0.1 +pickleshare==0.5 +ptyprocess==0.5 +pyparsing==2.0.7 +python-dateutil==2.4.2 +pytz==2015.7 +pyzmq==15.1.0 +qtconsole==4.1.1 +scipy==0.16.1 +simplegeneric==0.8.1 +singledispatch==3.4.0.3 +six==1.10.0 +terminado==0.5 +tornado==4.3 +traitlets==4.0.0 +wsgiref==0.1.2 diff --git a/assignments2016/assignment3/sky.jpg b/assignments2016/assignment3/sky.jpg new file mode 100644 index 0000000000000000000000000000000000000000..81fe60ab50cfd282a852df6e72f6a503111b521c GIT binary patch literal 148465 zcmeFY^;cX$(?2-41PJc#PH>0d!QCymy99TK!6kTb9bh212bbUu$q<~u-C=p2-S64o z_I>|=t$X`+pWD^ddRnSJ{kHP93BXj4m6rv;zyJU+?+@VZAB>TVkDWCDATQ4hKn4H+ zumDaNSOC&{i2mK#MZ^Dx4Z!~H{zGyuE;f!<|KS28BLU#w|9s-&`V^Il1VDagQ~EE@ z|Nr?vE%3GuhycLB!lI(0BLBzz&-7m&R8$mXWHdBXWHcmX6l54Ucz8qvL{vmXRCIK- z|Gwz&^d3S-`(M#p9{>v(o(9Pj4u%>4iv9Y>xXqH3EqKO#+phEr(`@i+18!#iNLL#Wp;Hi)Sl9(az zpSHhd!uI{fpaP7cP!0D1<`5`zkcI)o7CV*^$$)!8Ju*rd3~osPW`n?BDDLQyjKz#l zLXC&kPei==;b2Yd(|wZe*`nH|E%&>%)PJ+YzxXT!FL^p4UytaHC|04LI&=- zy&>0pkB5ez^tJPwg+V|kt}i<$&3Nq<<)$>K=Uz1}drtcLtqvAugc7#KeKwAmH%0WSclkx2fD^>K6 zF{qTXGN`OuwYDYfvy07u7887)i>#9m9%df2(I{2eGJc>`NTH+7P_PW<@Q~$+et@xH zRDhQ!sGf!83s*vHEw|N(MJJ}Kww0t)!-@Ig9YLu<8i2-$ERQS+%a8qK&@;Mkjx213 zJ4Q+=6M3plHqb+(7Hy&u=xBwyi0V%2QdxQYctkkX1vcFN_u2#c&o||MzXPWF(0U(Uy#i#FS6n&W~QHrYhYV^aj z2>XMdZ2kv#LpE%+Shg>zl2i)x%EbwLV`f_ZF&TNo)V>Rp=~3x@=~2nK#2%D2YD}0~ z#GfTJWc(sBh~TpAFeS7myLjv;r`nmAaX2}!vV_W;52ybS51GWIkS$Ca9_H{`n?t#I z!~P~%KI5_5TXBD1tmDkHSXi&&PQEYB9T zZ9@K(BtX~KBVAKmS^)%)K-4sDuKVev<%kGOq~8D%ee5EFml)~@BO#`#yt5>@klx^S z?$5)yQ1l0cZcK@ryA@xl!_>j5^Zi~feszp9+j6Av1p*3*eVAueh{WG9xPcuW-Qc>N z?^Ldk`tS2ftY8%XFf3 zVe+>sm7?`IfbWWSTXA$`Lbj{@B13^9r!Io_g&XxTjJO zjij)(8;e-U){4sO^hGSWdkn=DY1^NsF*Bd;k-VDP5Kbbzm5lC=Jy$e98acqV8ub~0 zqP1m|qo}PwuFZCEql3fr&Q@lVrvmx$2kH+zYcnJ9rb8A9^kp{jvwQMA* zESMtN8QHoJLIig-O3t4f*QWlc%Da_J+v_$}%paW$cDhtG1E*9XZ^h@O5)Z`*tyKe;JtVic z8Mc0HL`Pn0OykViI|xv{0h(P#|IyTqo7@Lm=iSH_x6p0d_JX;+%UgR;KsVBE%`V&p zUCM^!M9@}}NEWs)2S@gbNTPxMi60Ln++6Po{Speb^Wr#Lr1#|v%l@A9yiA3S7q499 zlBpXRz@Kep&W7CXnxArY6_%LM1u$ZGZ z)a?LCL@s2zmG^F3fuSbt$|Q!N`Sg*-(tB75piczm{EiMgexuGazDNW|l}3!_OUVQp zFCyl#i)Oa;wATB~MFB&4n3*Trhmf(Y#F42O+Q%BE1qZ~In1j(LDQE2altZW|fP(A@ zh@cH4f%(iI5sbXWL`g)2K{W##l1&3E%~_nUWA%qP3wgA|vM8OUR3l$fF*~~z4=6q4 za)<0Tu> zJf%NIjW6b671W#7mA{5=y61C1FUJps%Cu2WioQziiQ!Lrd0AYnWd$+{tsL`@s4a#W z)C!Bb)YVmIaQmn5WqWpbn?vIx9;z|uk^f$MU9|#V6^l4#^XBnoM`lP*QTfnkX=##z zE}$P{tozbi0*3~j%ENpeQJ(H5)w90!*-1Bhf3USJT3+#$gM1e>H8V{W2`BPD#;p#P zViqg*oUuzh153p^(AoO;Gbam^(aloVoWWPOX6|}n4^mqG{XAY`9GtaNhyv|BqaS( z9{CH{PpT>pZ!m&6N22AHG=Skf5W<#Bh1JXJegnXoR;p0>#FClJa7uAnTE^*+u}3Al z-S(^Fer4yQqQ$j-bmW#!Ekrm;eJuK87c#_%9E{{z9@k)aGV0?BiFOu7qRa0<{g9lE zXDRQtyCWpE-E=pq`G|2W51UH#_+jKgn8_QCo@X0M5TQp1lOm5m2PQ^7{`TY9UH)B7a?a}*2zaCWubEvyTf({+1dV5Fi_>Db~|)#V@xFP z=1+mU9enO;!j#6D_BGq=BF#Ua?L-UXLpG5fvN&y4Zn{JY@J4D$xb0-mkFBeON4#q2 zG`c$or7EP{YcfCe%xge$Cs= z%BSnyzI%4d6C2So>3_7rFv1cWZ6Qt(FVgGbp#L*cc#uAC z24hYAMsiI2p^>8FH5k&Qj2M@BWAgBCv}v0VkrIdOQv{5N*yoMV%GJawazQ+mOHf$r-{=F+3e#= zGfnd;^^%Ktpgn^MqoF2gLoJ2o-H5T?G47iWzzAWkGmf?tLmkK3lgr0kzNN|z%5Do% z0y?!Y%oWBGdfTuVdm33POGFu3DaN#g6!`CBwji!DYyEm|v1apFEB^=!1a97qaIBN^+Ne zc(P^xv|6&3tuVHjeNjfy36{#e53Bp_5v)W;xHM%aB**DB7}d0Ti8sW)E2^-Me3KPE zoo1Z1z45nUS6u{pBk{aGM#67|F?tE6D+{f} zL9bL_Ux?T_KHsY%-9*3aC#PHb{Cc#k6BB}u8W}ebcKM76&zNfI%%%7>b}0*TexDH) zfiY)-ReJeEHtE@iy`Un+!B|6B0+RC`Nv&(m)WEzlf1TVwL2oK;^xXT?2co%(vy{JO z4mR+`^D~(GcCTe?xVKl2zRCTY`OC)gX^m;j#u>N)6E_ZD>#yRHipih8PGzA-MRtsA z=BU5&fAukK!G}f^$8689lWX}!{xiIL!YtcxStI*F&c##Grs+_LVe(SEEbAg62-}y% z>i;6|b^@nGmOSQ*HS~>@(fjEQKsjY4Csj?8`Ps6r{7VJW3oMSjC@SX_*$)dU`Sh8B zF*CLSb>olf^y#aDKBS5bFNig+%;@f{FMgo6g;56*>buD4JtAVFJ2Ej-0SQqK)V1g#js zsvD)0#+26K=3b&G5Q|Ix5Kbgew1+uo?M~6zUfNm`$7) z!5w0-qitPi8<5B^nlPCBc`StuOVABBZT(W|knRj}EvJ#|k_b`4&LLALrWm-va0|G9 z37yV-ZqSBEr`$ik0n|d(?#^7(1>OL3b95CJj4+fNpZd?!)uP7JTBb;3t(vvI1*0w|9-hBou&&ak%tv84 zj+RMyjzA^+?e2s+Cy;BK+r#}8MHBT@y^|#MYUGYXB4{>~+WyC-NfLlX_9YdKl39f| zy(*GN2!PHxkMJoLCVF%%DKun@P@g^of7qR>UMmq8kbq*vnT@u?pa#g8L{WwN@v?U52`AY0o+WTs^M4 zkSp_jX`{?v4}P38O+vUh@5qCy-n;!Ib&lJ;_{MN2C0-am9 zyFfo6?TENzc$x#8E2S^b-@3DZ^v6b6q_;JdrS$grAs;-LGQ96NTP36A)DwjP^7uDy z61A>j_s`kG0F>9^x&_?oA{u{Uk-XvW96scp4?>?Q`B^-Fmo|QIIG5?(?MDXr)s!Jn z+SYo@!u$phGXM1kSp4aytc)OE(S+BM{A_rxv@+}P1}L5;>w0;F@pn&KxyW0aJ-$dv zD_lHocs3>qC!QYe@O+ud(#VRp$Xax`|FpDszP9>1wNr@1UNtH3+^(vwf~CU#r1mmE z`L~$h4Yg}vV-d?i;#Q^s|4-mY;YYlxwOY`nXO5OG4U?i4a2s!(I%@#z!g0+y@;Ukq zQ09Nlqs{8U)=%W&FjmGsWubj56qe>M%BcW&gd{8+^ zI6sESi*r^dqu-jf^nlTOkjo;&oQP^r-~gTqZ5Wm)F`B;55`M*k=L=rt#>F<+jEiXw zol5OEjWRKsFiHuZ1RgPtA5nW!Iqrg+eJ0B6PZRlR0eVfT<)6M;YzrMrUV2p%I@Ec_ zls_8GjZ_{)qs5L)c+9Hq@{9dw_XVqqMt%dVghxoun^YdV=R(Ked23-ZWSX`qe z{fF;e=1f@HELth{HL}{f<1*A}CpvEc&pnfUGL$)-LGGU+otQht8iA0w-OH=Gr5lNcIW5RKV zgY6B#AY8X$W|6%B&6t~pi(7DRnz`XTzPjv1#HqYvUI0xtdkrj%R~g104(b2;V_V?m zF2N+)^C~6zbTkTk_;QmAKmP+8-M|TqdX28qsABA( zm=f~c8cVeR`KAVW)H$9b-6t%A<&CLtFtgRMAf%P}R7axR1yJ8fn3OE@PG{a({+?3Mi39ZVs8is`HyR%T86bwG~1mCD?S3Cp56y@Sh z>26yha2i4UOk!#DeXRL8I)zK}HDt$5-c< xzNhZim@=k}6bS2;YHDmDf|M#GJex znzlPBVyi?*&nJ0tvp%Z)+6gdDmPh)B5@)19LO{i^`gFBHM@`!v8#9a&dM`U3K~}yaX@Q^? z$PA-PV9LEz^m0Q5md}IizqO_kyu*vpSp9KF516cXM%Foi#gQg%=@iN7u)eysmU ztxF-n9MsLlfh)%Ulc(^uFK&$0?j#|xsFm7pmcYf*gw%61T`B?T{NSrTj?3B_3I6(SUBnb5V*f70iVfo}klOEjx%de45(N7(YHH^65k@)>82 zIz_Bjq#rut>J=~U4Ck3P4q(*su}%WK9XuF???GssyR-_=NLIwaokRz5uz9_+sAwa8 z$LkxwL(BTWN{xPShC+w8JN_q576lbrxv*oy3KNiBg6c0!_;G9>eyk43| z-kDZA>vba_Ng&6sWzoA4F1i%zZNycK+=_{!jf1s-hWHNrBqF**0iks#F;2Z-j|F@I z&p-Xz8p@i|IvO2IOy-dceeFwnM)!CX4n^<9H5pZZ2{90_iJQPvQQ!yq{Lc!ucO4uu z;@43wCQP_;@fbu#a44-@A0unX2Y{BDf{bBy2_c*o>j4}{o9KpRN4Pply54{`Sf zf27c%7O51OxZ&Cpb`S~G@=aS9=#VPtYI*n9>L<+h+Z!Yb zrPo!D+j)%OdKx5VmbHAEm9cXcX%Q~onnFUdRxIl`+i8|lc4_Ugej(!&J=e=8Wk(Rn zVF{JZ*G4#SC(bO*YqPl*a?)S}wdW0W$44|QX&&`{mta^;f1x}+v%d>QZ5Z7!zznxgzlC6@nnp0cy@n(EvX0EBzsEyPb z?0tR?aiMmJa@e>2tGq#?vl`*9RYfy%;5v=tAya&xcLOv|ol=P#-gZi4B0E^lkhJHA z$Bq9~Y9WK;6BdiB&TI~|g8f~en^&y;tHH*Uf;t}@NpLRgW*SWeoKn zqft54kk2VVLd{5AqEP_*)dvs-1d0`{+SUH!Io>&TbZ|^4cfu)&42&a~N6dGMYt^Oa zF-|Gxk9YkvgzET{MkRd$+#&ebyw<0vL06QI<{V3g#=x0q^}~K()B6i*&9&Zn!r0beuF1(>j}dNXxv&<$w(NpK+bVi3g_RV ztobDIBId61l9c+|Qz;K0NU#Z#?t;fl^X!by;zLE_se1bo-{G~%3tFdw3wn!2Pej4+ z>|pE@g$0ygbqbC6y3zFA$n^J%gRcCAoRNls94+Yfy#ZDx>n-nJfcupd=@J?#jqj(& zwu+{Fhk+<```ymgEMS~A=sq8%q(B-Q79{DCETthqbso$ zSDk|QEs+Sdo;EN5YYU%ZB7!(E( zJrWsZw-S!n#%ng|8T;OFIy+q?DxTU4*JCMaW{!CZakkF><|&vwE`0s{YXiE$-ST{vuFtK2sHS;u1#V5KAw1qhdWqUXF~cq}0zuTL~K=7Kk*+mMmln z-x9>41jD}oN)k7xXyfF8LROH*1HCO7Hpu5~wn&a5CQXJksTY_yGMT6!A0_q)sprYb+Z^lPKO5-xVf;dZ@z z0rzt0Y91zAacugN)P<%$xIQNoc73Gq`5;O`;h$md9Fr%jPM4g@(dRvCSA5fKB<2z| zr2T@U9!@W5aIhBXHBUCd?({SDhH+hE8iVwSEGY78A-RcMKL3#ZI{a>pka*a^mhj0S zn}hDas$*j2fa)7SiLFI2KUgx8f|OjtUbDp?Yj^>}Me9C$`$3RR$}nwn)j_+}LF^5n zNzylMh#}zZOSCUsa(n!4!20+2jtlr-8L%ARp#oggn$o{1ncLpvsi0-D%;dr;WOz98W8wQOYV~Cw%3CF zxb+>OmAb4RqA2JMHCWhTZzNR z>2&pcjEORZ5>d1zQJZKY`aXbj`g7e|Y3sehBL7TtmOc|T0&iM`Q>4)Rq3s(}5jnn3 zWT*%GF4l2|SnB2LBkG15wfI}9`Z8WC*W(%2e}_;E>hgpMTsX*AxHxX`8&mx+NDZmA zfI>mI;MR{tk(vteX@U3Gh(oP;+sjEMJWQ2!9OTyNOOw6F8Ve+e)a{;v)oT%H>umcB znl&3X=N`ON|Ni`I`Rz7;Jg&PrJAXse(pINR-%Y0fudLUCJh35*wyP_SkGRKP#3pIS znb`zsy1y5h$?Zmu!>$MSJh^hEinYA6&ax%bPEqXA`fueD)9QlZfO9|!X$DMta_jgZtTF(eC z0M!aAE=0XHz3Wuctgsyqcsju2{mBSNTQj9Dww!{g= zw#-$o)Z*D0wNO*FsK1eMn!mW34NnrDb4XPOy&qI#_e`;jeB7HlMaA{nMdG&=4ts(E zF(!rm1+MJ$CslQ}ntDA*)N=J5f%6bia^0+r@#V7ufAUGn+RHLgVLZE9k?p$4-5cUV zzB!-3$H?Gnr*P~ts6(*SYy{)gezM;;Wt&+0Wi5xoqIjJX=cYZ$UEoww?f$MYStPof zP4Gf4PL2PZdwF>fW#3H?;cbe;ONl;Fpinmv*keg%|p;cj;?9@rQ z-D2nZ!zV`kGPQQ;@0PSx-`aWTv(t3yUV0^eGx-J6xm_z{{-!wZZUhaLb16%y?Y!># zubrfV@aKxt8+y%ak9Kl)Z<&;dD_18NjNzd$TcfAR(2YfHqLY+i8u|Uk-Lp%1Vl&P7 z1D)u>%=xTa;~I~?4~nXr-HI}4kfR+l`r5yoR>LZaO8GpZ4_Klzp-Q!&RW&Yt7#_asv!R;*wfpee z?>VfLeE4WniA}3cvJ*+|ExqD%c!TQQ)vr!oDC5KoBau+Nd}{CzR1h+?{Jq}7w9M3G zn+fqSsN`3cg)(YZ=p`S@V&{PjT{wlOrqSnfx0k3IT#oGh{aZvAZ)0)VgscLQf%7=5 zqr1BY)K52o!f@CoqZ@MPPrFAm3`%$y30Nv&U-WV9!d9ugN!^|kO4`s~9%r8iae}gGPE1ZAD z7*ke!H2E7*Kq0Q4hT{~=e2^5)iY>7Z!3;o%_cH{}npLwv&Xqq{irC>uBC;E>u_bF& zoK@jg`DLXtZ-}69@?=?wMKoNI$$jT62NFp4{<1NQ{#ebn-p>`O1s(`0iOcy_dv# zpZQOmwP1SWH@&YZ4VsX+aRe0wVc^&QdF~iP8xPiP9X! z$fi?tMB;7e8|#jmM*0XEd%+mc= zu{B8VfxTZj$;8Jhs@jTxI=*W}etPuG>YmbRb-~wG>*v{^$0O&c;0NColB|q0c3#en zIjq^-f@?>Gw3XjoRXyBbhkbEybQL`WYA|1<@}*bf=vQ}9NzKW~!Ph?~VlFcy8K={v z1T-=qKKY7}SvjdYUqvS(%J|=PKIJ*xHnFxmP=hLlj6IUT9eq!Ne;quhrj@#!+PcpA zBWpI=n&*2rcb#6@k{Dj6qt884+JJhsy|)yo-ZR@gCp6@K>+XLZur(LN*I!!4_{vJ? z_S(OHYm&d1?h#w9Ru9BpH_qwEpPQP_6vnOmQ`w8}^KDDBLfeS}+EHqAlv8R&O^cvs^2{c!swBx*cx;Vu8>}%tC~e#4ag?RIbXOLms7?W}E->)qJy z)iZf;)lu#Pc@szGvvQaFV*&3)b755Jv-rcyv>XDz+{e+SSp$Q;qad-@;gf<|aNdOo zGy)Po_L7)5{*MO?UQa_cCgsdtSv9rFmx&}jbt6XiEN$xj7(9ERLtuA);HE0)ER<4a zrEGa%`!iMW$#=8c0spTGqBn>{Pc@V|5w@juWG1x-g>|f>MLD6e4!(DX?fo3Z^{Iq- zTYk)dnAyqcTIg}10iLZmEA}`GQ^W25oj+#B}4Sc8$dMCM(w`E z-6*Iwvy5s@u4-X%i5$aJvwLh(?}%&o9KL06w7?yA_sFJT28s-mbNfJ z51r;vb;XB_0lzE7fCbovel9I6<3xDUF>lZ7%u{r#t=<6n^RJJv}c4!u{k$6Zp2+FPSW^ zyN0C5Y2-?owrdEw^cj6vhL12p7IrumV>`_XBj&>25~C7WM7(jbii_iP@xMP|OHPQy zi+|FYhooT0cIl?pdX7BfQ(T}C7Hl?&jow!2om|M@z zjk*mR6S)VFD#(-gyf%Fr`Q>_N(gv1~5kdo4M{(VrcZF&lb5O(=p>#H` z$>V{FuO`5vL!uqN9_X{zL<2pqVeTzTekG!_CJ$Lo*RO?v{oAeoh;>?CEz)y%)EkBP z0u}Dof~8#y)mH@dNGIrjzt#v;-}ZFcxHmz)2_7{9&L_tX^1e+yV|?FH_V$>@ku}CE zxF6nW{Lz(7%d~w;FmZ`Es_;{@M$S3tDPuvuN+9H<(IA7F1ejr=IsvJo2w8~AW$2|I z5|1t&8snhqf8?}N--X1^**;>pP1n8nZW0G$2yFiS*?RtT*Ehm$ip4h){|0DBZAs0# z)h);=#IaIkiQ{(B+;baNI@-*V6RlaNy}iEXGUVfBq@h;7Z4Zhr!bl{|v7kF!roCna zHQh3IxDdS#BwRe>Z%winOnmbsAZIN-Y&|fH4ECXI^`$=ed0at^P~&Ihu9y_`@NtuR ze2Z}cPNlNv%T>imyv#a{HsXcV`BS(Cxn$hk6K=#CK(5MeIuxzq#*jc;1NEpvj9U46 zgz-$q^F^_R?_oqzn1mzf(%Eock=`T2mWE)bN#Re>pf=R(Yhu|gALC-%G#qU%UwW!m z2#71+_vLl)eWpY8&t;HR)|V_ZwXM-`oT-s_S}1Z&jQRO>R!18)k~K{sIo}8`C`B}7 z`*2bt`LqF5{TW=s~0v?=>W9|9!oCBk14l!cthIa%$Kc^nwO`?RieB z9*Dw{o3MJoNfUQ{-*)9u6vh0n;JyD30yq~Y%G{mB$mnY_@oAO(#=cpdgJp#QlW!)y zjRWSeAGv?s(i^~fe)UH9MWu54< zR~zv`q|oOErg7>?J>69dXT8?1Hzhof!iI#KUy_&GcQg8C#tpfkc|#|AbiJAmn!Sq7 zBY4%iQ-1kA+m1VV2S=XWzO1)whR<@Z>*pd%`Rl6Z@`k#p!{l5J97!(HtT#<-OXWFw zu)MCm1%($BksLUIT@z~eIr{Ioe=!^Wx!J`#hNj;9{e|hS_LNY?E!_4~bY;__0e5`9 zKjC9m*{UKw;X2D7=#_up6{c9PYdcvSI}D{};}==y(Xk7=9&TKH7;C0)Sx-5W@RT;I=E=i+Tp7rKB2A+tW{G;pJtDYu&Ap({CZmb|9`Xx{1ox;SQ4pI5eQ)wbE zntQ4~%f1kNlYnh>gwmQwnj~brNSHr!kgIS!;Ln&(}W z)eFEFpV!^<EPSuX`237H7@|5h&GQga%F*)U|-2laRJ8F1h=b~G?PC*AmV~KBc zFJ)bKVREF_*~3L2%F|p~mNAk05A_EBBc$gDn-g+v-SkET&N94tyCh$FjSbdh8JS>GbMx;GovFYBArt96GgF6g^U) z+)cmO($+Mex>(?7*iG|e{`7rX63&PHcQUih>iIW-YZhXMortw(X?yLazhLJeOy&g4 zoB;o}jkPISx0}jjdu`G*a-6#KPU2Una0qr$Jtm^ z&aRmlk{egcRPh5Jh|MEAtp$xKU%hpMY{aY7HP%x^{qOz__}ETd&v+PO7eZpK z30x5Hu=Pe=BzFgyo5!yzakpm@zkt6rlxae^P}9ho&nRu6v=Qrw>L-y4C>-mp6|TGU ziGJH~K7m+u&)iU3lW@cx;C3Q>s+Px`Y8k}G^RH-4zWqij@`$^y3+e4Q9~XYWGoa(Q(YJ61M~;x( zG30NU_i9oZm0fc?->~Mr0Sp}vJ+^=j`5s~rI5C5o9V?Z?N6hD8doZCQi}lfh39|hk z5hCaP0|vt<;z*pS=M*FAT*Y?h5*B$fPC@eTqaTqK$cFQf_NBg%zC=X`*PBA_h&hji zlPqMF^{%diLkpKhWD_7reW^T;9{Y=F%T@>q0scAcSB}0GE@Oqr;$Tr`)2fKA+{Vpk zYTLP~jrfLnz5{tXyn#X^Vy)fLU+xYi)tstL*k!Q93lz%`4|`BAS@)MBMX-X@KWnb? zOre|>UsOL97ZI$Wbw!8KY-L}I1hOQee|S*H8$c$vTYM%g6L?mYTCrNA+OMOH+RFP3 zRA?dj2DPaPT7M*FK9^3ja=Ex{nglUBxxE4f`de;SzDd!!m#~wM>ts3`#d}`cOhbCP z{^qj=VuY>PZ9~p&M1Wk|$dP4-Sax3fqYG~U1a~6X(O^n=4d@o@*jell5xM5y?? z!WE-AEp|rg2oB+5CDR*CI)Vp`>;Bu%xy*5b74Cl0Jj4?+_vbE-ShGcmf5VN${e$*g zT%_WH6tZ|*{dcON&MZ}8L#p4sWC#_zbFjiFt5l~O7IsEY$^Ef2gxJJ+B)1b`G*nv) z%@)La5Z1H^RAdwk7=Rh6uSn-!TY9zK#kKPzoHV@UxQBNu2nhEi!v0%yXVl-cq*Kk^ zX-hjE9+Q73!M3dtdpaCyEqz-LUKS8#_@_8JWFhVn^iber;WQ7PJ#01*@Q}+)d;=so z*pvxn+wE?)OQ{d93JNQ0_s}da2)|6zZDq;THu?&I6bF)^1w;lz*E%e`e8;3*UJUlZFs3STl+ zNp>6fwJloi9k*^sR^@`8NJj=uF;#80f}$0E?auxdfiif2pkicA>F;Y^1rl(b4q6HRo410fxeq#%8*(eEY+TPxZX4F0Cm#Leu1O=EF~NEIC4Sie$?) z?2mPX*Uesu!Vf^^vD0&&ng%D&7#d#)A~32t^XV1s!=p>;I#?1oZz$HS?Bj00y*Dq~ z`T0IhjI6EHLH}sgM`YyRsu}-QKL}lQpht7%ve8V(&+T`qAA*y*m;QmJ5i~xiD`mpq zNQ{5v)b!hhzvF7^&=7Fvr{lUZZ!gYAjm`mw6gGz;U`zj?ea^&$H8_b$eLB?>7!R@_ zMK-GvW8i%QcouRA9<3aB2hx9FNOON|QE@+U)_((ZD2r_F*Lx8ZM6LWY;LJ;|hhA{m z1V$GGw5M(u?SA9#`Y3|2BDXri)LCjDyfT*c^F^v2*Tc(xQ>z){AXqO`K}hk%!CAOm zvHV#P_rABPf5V>0g_h?`?+vgGjeE*jh2k}*|8*B_9b{oayIMKoARH4dDC+n_^adEY z<-HUouRTj&&f3^+9(@PF(Wu`?%7}05pEX~vL+AxAc&;7NSPg#IPa{v)REpK+G(&Ai zK)|)JLrMPC0YN|3jSZWRf^2qvdx9r3zw%X#9zV)&uJ&tl(=mGkyEQc?*69P7T_zq_ zrZ$12REZg=>)yYKo4@h?Tw8bB@(AKODceIyXTvS{{RX%)2nZA{G5A}Y>VGasQSh6E zO{$_sup(YgGP|Q0%>z=W5-+!Gg8y(*xbg2FwS!_s2C!ku&Rm&zfQ@^hYAL;~B5^&p z#hY!&h9&1LJmubHi)nPa~IyF+(tw+r?K?eyEtNXFxB`v;+_TH2t>eVf4& zlcQL=6ILR;n)=E)s?nx0Xf?NqO)y7{j_}t@gXUDH{qU8U=KCa+t1BP0@9k>;QU;H_ zOk*GnqkGOlj5!Y24t2*PZA+PFSAD-b+9y5oUdE$nJE?t>B!jW5WOx&{-80xMNcI`8 z)`%g|Pdv3ZJIge3;Yregtvko;H;#29EdDhUX+BpNLLDQ)MOpsFPLEUr*gavI%hoL! zHO@^VRYwqb)0VV1z`j7BFNK);0{{A{pWMJ30DnjhJCk@f&+3k0$D+B^?&5NO^;$a$ z1fM$QOdJ;Q5v}EZ{pClRUf}Crn~DHIuMG01i?OX~oW#{5S$ktLggrcUQFyV4mEsF?>EEZ)9rV@*Gp!itW**{;E=$MB1^y~hxNH} zes9EwgJD9H;Zr20s^S*wmRz(}uN};4jPr6xR*;^T@vFLI4YT*0z$!-1L-*5?2u%?{ zE4BTj#%(VL5Ta_6W2-kD+eeZz*c@AAar})}tiMe`wA#9kMCN=uU{e&JD>eA{qj-18 zrt-!1`jcpa%68M}c#fao5mfy6R^FZOmAuud&PH5Td4lz%hqyk0ESpR=F+gE;PEL6> zP%?ZMYOCI^>^|lfm6~%U(#pGYO+(^GJ6E@x0i3#1?)1?YjA!Gdu^+DEOiT5A6sB=# z^4bV%CMDF(`oCB@>#(N!|Nr~Z+d=_FI#s%rPPvuN$>>%Fj2bYyxfMZai4CbC449+4 zVIna`jfPPJMmM9=AK&Zx{eS*D=Une|p0CIA@$$E$6Fqf{^-pKZf*tE}n=Tc**GeGK zeiun3Wwp5Dk_C095*JH9b@}}|Y()$d5V?Cl%QCFHG$5MT8oDyMCkx?ZT8{MrBd48< z&Lv+I7Wgj<`|1M{Mz=#*$7xBKdpi&XJ%K(FOuPl=vw(Y7`m>%Dw%YYji3%t#5-Pm6 z>TR3za?Ui%{&J|pWsj8L$Wj*1@HP7UhcpK*s&Au{iS5?h=2}bbIjM*MD05zTqkE)9dNQ``>Cfcsy+Z z@{j%qxR*V5;P<7RWVC+!ua?rk4o#qg-TE;_GmgBd7ce!k?*d#@a=d5xJ%)BedS9f~ z^+A&H-BFEu-lSSDqs0-Ia++VR8&7pxT7rJST7P3E{)0#61Ky~`(ZvS`(mEyw)HLW9! zwyo*7ioCE|G})gkUqVuaPUie+ZgZ+rFCoGLk2W3j)ZSpr_li%} z&tx|Um3iiVf}G<{^Z0DKy4$n)E4wo19{0XvSBjn!8D1>!b$P)??yG;)_tt2!E)9Ho zj8kOhQ!SE7&!#9{5?nYUj_z`5AdT^%ru+`>-3m~3UVti=>#fN<>mh<+^-e{&)W&E& zyc=rhVUn%|=#8SYpu<-TZ}_Tl_ut1-yR919cQ2>7v}L)2BX9oF@pKhP7Qf4gYr#Xw z<*k9VPqN{f%s$&8^Pgc&Yvzd}oL#l_6l3{)ffa4$)`NV5yOUnO!jUgw8Xw?OIl7|v zjGcR0yFxl-zd{Ymn9peSfysev?=7CmK+hki5~BqEd;PM+@yP(w$tA)HyX3tL1VNk+Bb_G+c>p z17F1|Z>H(hDq1~S{Y?TE0iMG8W#QabkWTV-FYQiP6})&;55~q_GC=QX(Aq*eeh!>& z;hU%g!EU;d-Uwn^aeWdF)PbG!j zUUwK6M}0lpx+;9~CDb*G^IkZOZR;i%CfAxO@wt&3;Upqt+S|IEDNBDb(3iY3UZst-52~3-Iy^0L7L% z_-?AQh(QuDYI4p3JX&I)7B8jPt!s=ICWU0y1jgm5c+*yQ5bwPo^KN!;{3E5yDa}C8 zd)DSVP^!Z`!G2Xan?C^EU8CYc3nB}AJE+x(-=BK^p&)k7zIl_y^cli3B+$5LZy)f4 zK=cWNTE7oApN>idR%-5U?w0cve1jleE>BY9QuWxc9`bLTE}pD;4vks*Uu&5#H*O+0 zO#>XW8yn>w;~B~A)2W45ebFZkx@oW-y-XW6RRTU)JP~E0k4-Z-uM%4} zjWp8hjN?q~KaATjuY(pk%gPca4%v+P=dC_xNadqHBs$A`zhZ;VLKQSD*Tx!J#8J@C zHCjYlS?#pKuJ6TZy`pm(_7-m8cJX4Wg=!(k0H&nizqdkyU?6>JS<25tzUE&l5&-WS z%a+uYLJ5xxZ^t+OLjbLoTLD@3h-*sy%7u?O+OlUvsUBGT-v2JqrO_OjdH(qO+;i?@ z&4)gV+7DqFT;bdmTNpk2bf2cQ>diB6((fe6>m?b2Q?<9TNsPCm!1OUL%EbZ$2oMs? z7SAFP3Vj0;{?wMcNB1gHFuzOjXrX321sk^8Ku#20RL01udKL&JT#U|9xuzzgUt=(N z;;{#rPc%JUQ{j#4k*VW`lY-xYSebX{;=Gr-n7N!aPo0Be7L#gcmabq6-1@Qe=&1-_ z(Jvc++mZEJPkW`Kp~B#I6Oy!uX9u#B$|ha?=P(gbGSQcaQsyv0XI1*xcBF#Yb@E2W z>)MbY?{t~9e+eVYu^?ZcUS^FyvMgJp7#GHQTay;3g+gxsHhHyURkX~j{o6arwbA)xhb=^0^v;M^Y zSZbjc;qF(UBLz=L9|;nT_(qC)GvAqvPbjrW~Yv=v@-OYF1*)7Ik%0%+5SC!B0aOJdVbejJV ztcyEgX&V3ImZ$>f(-}ruV)c0{yz#%&M0 z7wNLKA?A6Wm~RAFFobf)_wThWr<$)u0`O6U1P;y&9=&@56>P-81<5}5Ru30vr6;Ry z?pq(`_7qw5U8)AUw^p5IxcWCCsvr=33EiMGtsfyPDLN5LKcNoJcRmf;Q+)QW1&B=~ zO@$~Ax6&heiOS(pitO=B@`|_PZp%jlUQAgFy|C^+2u4Ju{o2ryS0uz(uZqUBp^eAI zgUGw#()tatp0(52WXE4!x~=`XN=5^OdHZ11{X!Iz{X*W^@vHSsE#X#FdRPy0Vd~#o zmj1=Za?lT}0Bub`qV&_iY^yU#MqMljNhYn6m-GZL$D42!Fr5pOtHR3md@Dw*g z$aXWB?aCqtY0c?K)O1Owg$Tzp>TAHblv>Qv~|>yMss=lf;9^e z2kapxOb(#;E#LvnLe#afb9RkjRi9#Vz$E&k4<<%_?__3qVk9~B zV&gm#6wmLqJ2)@0xxY@_A6@60Lkk6gr3*X3Ae~p5L$dbWNUr)sZMJqGXF6EF=TV*l z{E3GtWv|_9-?uQo)cTfkb#Zksp;YsA zpZim!7)hCuFG=mku!?gROyK3-eOfuy3$2H%sXu2vGj2gd@Ew=^K5t9v+Nk1;-szrm zI_JAzDFkME!O!KPs(hKp`Jyv8G<&9ex`_{&csn)~(^hO!#b2#9iBfoc{c|)Vj^-8S zdAHry=tDM@!W>XpuggQ}-CA z)PWabEby*zQK2mo(|m)$*-zh8$kuDH8_c!^O~U3ti4Q&WKBsUe01tHtJgw+$(3UCk zpPoa$rKXz`>vf=X3}|7K!i=(QxvOF3@CpYH8paDG|6@ZmoFCcDFTJ()+k4)!c#MZXcn@r+4FhV=87Q zkEjN5#KOK8nYr4;tf`wj%%mkDX?*+80sw`HrJ<%`@a5KOYAer05PfXFR#fc7YFro{vm)JnE@Y-eH7n~e0`?3Q(GP^$$Ru`7h5Ot=cM zIYPpX#WZUc%vmoxwJl5k-a3g5?u*_3TBJu4zyX^ZmOy*4S?AVCHh`u}KX^voQ?MRuh{2q)S&9Vxz>3-TvxbV1%>ze<6T=(GC z=ba$d(jtj*s>&82WaelqyL))(6-9!Bm(IZzu2@ zeDq5$Ud$tLfN1P8h=LR@WZ55oPABcIc`}^SrXZ*$Wm;I1^h5^Y6FS@<+6@o;T)Jmw za`X4dGdRu!dcQPHmWb5k1s9ra?uUk6z>BOnd$!#W{#lfY+Wc-Fy?5E5ef<>Y11y#Z zo!&UeYJ8VN-qP|kT5VHN_hTIEaL;9yqSCjEEf=S$++cQqV%>o81)Eg1>xiCwL4D99 zcNC{-e@j*fOW8Y^)yFPNYw^!JF+!_?e!bnx*IZ#+`?x(>pzhERe|_G-FrxPC4R=t# z0q?Er=Pq`8qIAq6k1|$VI{kRFtk{pO+p=1A1pecZlBzibOa*4)Cw@tMN z`&sF>05x}NUoGzU#HUH}5xT9{_N|_sbhM2YV&y%y+j-xXv^yL${JDwf$7R_jlT(ww z3YO?2)%1g<976MTNn{J;TamCyX|c)i^&hxQsvPBzhH$&`{>|yTyL<2RAF~a?I#!?s z5L3%-jDc@gwSBFkK2BZ!_?*TmJt2a$J~FtH!m>6eRyxqNtJ@R8)b29Q&Uz?s0nzSc zwaY1T!eOIQxE7BM7t>b6(Ai?5dl^BVYVZ>9*O3Nb)|_}RIj{m=9gq~?chM_VhtG*K zS@SlTXwZb5y%J%!knAERKMbWS)ZTi_Pt#*6*aHR9f!gtH2$E({B zJYPRVEX>GYyW<5tmzk~Jt`${Q=T<(lv7M@f9I~&p<=3ZtUqi|UQZMePzTl@b7r(wK zYkpkSPsGJ0@)bl4ukjkjblgE7{S$7VZo&=o2CL-=+3|C=$!$$lx42L%5hfiuShYlC zgLn{sZzT^n^;K!1@yRslglbQj77o)j4t9g|SkGjak^8FVf1bra7tgYV9XXyz3{SgZ z2pdXNmrgPZ(sjYvve^qgREt5Up@t5H>Yy;d?;A3~0DhW#F8l@^GJt%K0e%jWI#~?3 zE(*Z%VOcGFFPWDn)D_AU1ZB+xwT`so84~Yq8D=adb<=r=Q2-Pt5;yajz%X9B9)kOj z*XDO8E0Ut(9-mR`^6^X);Lk9T9Rv;4*_r#cTauZblX%TStQekZ^Vtp;`Fjg76A-6W zEvv4^t$I`Ntr%nA)%|#U(kQ+o5}&uVQp9iuNIp+cyDG!IGx7^jFGwq+bVV4hv#GX_ zot~Kz8Z>2Eso!L<_RC5PjLc`r890@sDqh$vXBs|pT<_X33`}in{Bl#0rm>_}s~@Xa zAciDx*0jK6Z>aW$nKeH>h;%G;H+j{IRM&kjWH82_fh77PiR_v8(51h4Y{y22qp|eO z4UAK9s|l_9fZ#{EaCpmJ+Md3AO z$Fh7bSH;T&d?P}dQ%Jyo>POZ_r~dvVIlN1vY>x9q>$HTtt?f=Pbu zCMuWC_kETTLmzD1EX=OzV-p&zhbIX}G$}M0wGhGBRL;rM(<^qRZp1*er>Vtq-RG7g zw+8DQ>WgODOw+x|=3Ult)8gv#WmXZ3TfBOraqrRY#6^&bloud)H=JhW$kRGR*ZnJm zwE@&O{gu`Z<(4?UMsL<-que~g5y83=fm&dZ-&5!-C9Jord*!}uA>M|7OtjrP%vCC= zPGv=-++UIB7*2j46fUd(T-F!1a)Fu!$i&&$Kx$IsL;odOHTdiuO)eg};ACrh>sW;l za3KM_w1~0GtcxZO@yORqIgd_{_JasWUu#5N?Q3je zT$o?x{&`ZPKYj_{6w=Rm*8!N3o3cTvRUrR(qV;ECNPLED@@(Yh4sEH8XP`0Dmb!6g zS=i5NLjG$Y#^nc5S|Mb~aS}5rAEY}>$6evp&FV1-)gGNf?ai44gNbX8y=8?UNdZ`b zn@~@t@4WFDBunS88EVRBRPnpv`QlKxk*+n7sRzHYx+zI zB>fL*SR6S_rm&5Dm-HJKqShSqaUi($DthB%{U@&4`G+E3cDuDy(?#(Olzf5d624?{ zm@t*%8e(8t%!+a|(;r$#$Pn#IfF2Pu80a0MD>J>|@e-Kxhd%ZuzLY~It7Rel zQz2W5m6mJcG6TLPtq=0Gx3fjm^=kekZ&CT|ZeBIwlKEo7K{P)9Sgw)cgyaj4Ghm*K0$%hX*gZ zXFC+B!rghT+(C6Hn$`qIxJ*D6^b_9JITa&R>tYGPnqseEHj6%6GC`T&e06XrU%w=EB28?H z)_ZH{Y#Ow~ulgZ}Ilm`qvaL|Ex({bGd-Z*o2PW}PVoh3$*uy|qmHDe&z!UFv$v(fO z>=*h~;zt@DQvAYAX@1C?juZGohUGN5IVaL_bo)h0Q*RF;-hoXj{ zi^JPSL(Nu>{UfiI-Fkh$w}teg<{zc6zufA2O<>))P0*6otQ0uO`k&g@UJD82`{)u# zAO3)2GHwP;9BdT=ymV22>jb?)@sU(L!7*Ybzj2*nDb>0@-my3ns>RYkI)0;ek=H}X z$=v8$*CwrP{H02^%qI!oheKZ4RK3|z2uyMA41^2JK=C$9MFQCX+<5MQlSYxy?sxR7 zsr&bi!9iR(pNG?ch8JI%R{$R&&s3@Dm6`m0&q-fM?n0qIt#KO~pMCD7bdzPuUfM+$ z?L00_T#jPq0{n-34u0MIa-Te{1enPwkSkP*sk->zCpzo@u5dRY1I zM+&u}>PeUS?$>xo%hKVSH`+l8m_&9S9cP|;&Nn*5{T#+s5%r>iiPYWmUYe31H!xpwtWJ4n+R`Eevshi*Fn%r z*2>N{!eFJBAvEytQa1R!6`@MWV3UAqtM(vNa8t!;2Rw%0VL<=#4Wj}YMNEH2sat_dnEZgwACD3(m6Ek6_C_766Q za>s*u;PkVI)TzR_IDo-`{}L|NtEnptIM%sj?*52Jam@8VdvyFMlBa7_ltZ}uGPc`g zCPUa+YBJsn?v;#X4-Pq!%b~N7MG44h1;#qmEdd}dv~^t^(%~t{fcR_e|Ba4Xo3K9j zC;qp(HPad$xf#Ck(bCF&!_jF1YO08 znJ{)qZC<5Mg&2vPD37O@C~6>77i-Wh`@KA`tGeI80((iGh@ zcQiEl*cGV>a=hWxkKl-l@5`!ub-6Wy;O_MNoC*PcPN|i;!~s+Yi%dR2b?|voo5aQw z|5LwQ#!h{H-5Ss&P63=v>$%+|mnAIDyf~K26IWurt`OQ1 zf_#ds6uFVKsp!46d+={S$ft=qZ`Q*N<9+n9;g(w5n~=F=@Z>T-nUYT!vfKR^bimPF?01wP-RAzU|*s8&rF_pe4zS zT46^D=X0GnQm>28+p3>7;pCfwr)Z62*J`qEi*@AM= zq3Qi{iE{<=%51};g%IfsH?{Wu{a!vM7=E<-6$tNr7}~&zaz+=-8S5`M9BAyu`Zw8~ zAG!Mbt>Ztj2_(i}bIr+SI-vF+xgh9G*uP3zEb`1rW2s-}L{x&)rF^HdI{@CCO`hi=GD{WJ0+r53G6;SX@?=Rme?!5g=8^vR5+W( z%+ec%(~4e)lNF4Yx)G$gm~_-ey3$od)J>)jtq1BA=EW&7EoW1?OyXm%JdSa*W}Es+ zgdYQx8slP1afQ;V^~SH9#uhEA8aEt<9EZ>^VbHLYJGS@0<0{8VbX&ugjsJ5oqvdqY zxLvJ=dX?9quT>(>bDUOZ*gfj-fXW6K)w3Znx2pMT7Ah;Ob`pjjxm)${&_jv-?P{s7 z45hM8fFi>pXgD6$b`TXvv;0k7KCY-ay+**rrD}p@$#*cu@Y}f6_#|!D<{%=+IuFB#i{Y421kw$ znX`l^Hr{IApf4(?E#HKEbnr|>)ZhUyS&>B@Opm;SZ{_=yw~q+2zrYPS*8%lI)iXB^ zxuq9-brpoX1!Zf#iTRh-(?T_sZH_wlLZBh_?(nc5a6@q2@(B$V-s(Y)LaZtyMvLT|Ark^q4Sq z?pf(f*Es*VM1|OV|3QAo=hx)(TLj9$%+P$%M?~U=jp3VYPv)o)O8O|Lvlnq}7tDr2 z{$f_5>Rx71)xuf1oKB$F%q4g1bCLW}p(|9nm@g@0nCco?G@P;gR#i%exM?*>|?a1&h z!c_gOIYdE(4a1ua_x^heHxVsRYj5QQ)ecpVmW9|7*{!cv-GjKsFT>t|*-A_-NQ19% zb>%o(SY~UEaP4bBp*PlQC9{e|w!TJeS%P|+oB;0gc$)~wAEO*3_HUPkC!6{xBi4O> zw0wl6{gSDHyOwX-+hG*7dh!fecP9^ys6P;xnFa-i)Y5d@OsEOjTAEGY`9r8{w+(C( znZ2z8jdPW^8<$~t0}A%~+!Cig5>@A}T`9Zd3iGx#iq{Z$zv7J{iA1opNGh~4-A$z3 zZ|!5Qo;du6Kb>PiLX%RS2OgUuWq@>Qv_Icu6nfV$0^pMTl{ULJ{tGdFH=A3#O7JHD*p;&`ZbFDU5039}tmKyw$}t-kg4!JQgG<`I{^O5UIhbI9QUVF zJho5JA)o<#ky)vfRcjVG7PeZV!G1C$`RWkfDg*7@+J!IP^dpw?jK_8BH_#PZ7C zu|BQNo7`{p&f+ZX$dOGyREMjD>F7n=H#>{8?vEi9Z_)0-w< z4K{4XonwhVH{bbeoHr0Y36HV1tFf=3Qj0R~ORvCEUTK_#W@+x+Lfz38x98S?8&3Zr z_N6a6`3j6Rh-?`)2?_ili|^XdJc=nuO!h;+vFxW+Wa@d>5UdmQjr%+Tc=WwzRY9+Y zKHm3GzoDoiTka+~Gu}vd@X>bl3RT-j!kwrx_-9SH(a_J;n@dBYb`Acjfv*Bqn$tjk z?i8|uzgVWhxD3!7WubT75hJ3OOt4gxo06BtxwsF(7q+Rg*Br2Up5~D{SwnFh_SWN_ z!CaiCwjc%*+;ik~aB}rOJ>&u9uxoT_KcpP8BbRImK;3>Y!j*AflOWGG)1JM1#@DcS zAR%H~{@&x9)Bn(6#MTS{UF=(r(j+j=rAlh@)*p*FpfY{qBn>t=`mrJD z8#9bf!~V!myaE^jMghfO~8x49KPJpbuqG}`H!8#8i0{ZcbWQ#x*9EzoGEvQ|4d?I?$d z1(XpkTS+od|6*+_nZ4Ft=1OEe=bUbRG32z+lB^E5G1PZQ09>ftw2_^?2{8n%;rOS{ zWS90jRz@Sq<;-Um&Vs=Qa;!EETAFK-w@kZx$h;pvvY{28+hM$dQ+*P4gq7g|EfJ;cnr z#Hsrl$Bp~D#*e1;^G`|^Uu<`$4-misBAqTx@YswvZT1#Q8VAc1y|4O>T?BmUBrXtc zZuUyR8DDhQ+20|CJ=Q)0oNqQ)P=UW^q>Vz%sf(*G-b46mGwwhLGs|EF9d;3UuokX2 z_vok{svUl%=g=}jym?RF0|)S~cFc0Dg&r1fX0WvI$$ZEf&acTX=JNVFV%B1TFg#$p z-?!?F_sV@8v$leWd+0nVWU+~lJUf|e`}LSb;(RDoD$e;lIW$1X=^SY})h@e{6r#CZ z#?gA$!%(K5K%ClSve@2maH$U>WoJ($#sKu&pw3a91rx1U-igjBxz*ueZ1TU6!1#8@qa}&rB+nF`MvISARYf1 zcppkJ!$dm3CRw1ln*=Q%LLW)T@?l|#{U=>AiObg=sm}z^fydI(Lm8Jm=je0 z{*j98fIs{&(lyyFVqawL@=NZmia7G~RkFrE--1mSI&e+GjxA=(rOV9gs52g;I^}@{ zY0PclN2jz2I&ti$+u3P3W~Gl(H;jwiMHDDu`jmk$lKV=j^5Kvlhhk97jU$vm@0I#dCqmFKKJHKw%R->LLwR%`~yLAS#P3ve!pHjYZ??8ECA+DjQ{tuD;Ad%?Oh)@luqpV82d}S5&R7 zM_0HGtg9BnicbmR9)&Dc&T-@25vZ^5xRN7pmwjK^3yKrbbMMH--x_n zvRpJ-_R6>(-Y^@YW99XJQuFchwLt40&$&Iq#ek$ydou5xP46<~glg^Ew=%~>T7(^8 z5;reh1zf-|vRM70_XOstYVARL!+?wQ%qh5GKb9P?42j!_l=EAcpVNNw2xs zAe&Yf^<&CJIoW!JpW#SR6P@}4dG7k78UC}Sn2O0@{=Exyn#TfWbm8W@ae|rIks^M6 zO}&lPTjq_*smjjjX7coNb1!YyA*!hS-W92F`70^83l_U2$l%Cgg9t1B#Y$daFWn*7 zdJ;ud#&XU|ca>|l+zT4sS*As-GTCSuWV2-q+>Nv%n1P1cZm2NP#}#l)crV+rz0e-< z{*VpraD<%aS7{FM6?~LCFhl#{PK#%d4V9Mhul;LF|BLG-?5=4J4hGL6apDxH)2sdBBd7Eyo zz0U_Ya7&y^kgQU8MZS#;yNXMu7 ztgeI-)qJd-eb)iC=+~EmE!yjY`%9Tkihu(;q}eFYzsIpC0F~d)_TKr>7F zq(%8C(|odqd0m~dI5XrJk$1q5$#6uGc7ux=2y85OLawq-VXOv8i6wa^dy=V{S&IEj z9@+Ot(94H446t6WoSZ`c3b@X`d`0_9t+)`t`@gr0lZnW%@Gng**c5)hZT&aqA+3#k@Kv~1|KhS=Hwf=Q?iif65_`-plffb{0^5k%@cEf0) zCRg(0>fyJrIxY;u&~Y*HbyslHCUA%!oniCC(1Z3IFvCz@GL~f(roY# zjJ5gYEQ`XEp(2EJ0j$KD_)#+@5$IcB)Is%mq514seCL;)elnS-nf_2fKFix0D_NlZ zXmlb!*voFGe}A2#MkzSjE$AG+>T|+Bsx2D!+$P^#q9m{O@{=(tcMBLtYz6h4dB0F$ z8T8qy*qzhq*&8sONDZKh=+dt=1Usz_z-JUXwGN1_9dhx0OGf4HJeNjhaU&8%mUj5o zT3rtj!<-a=3(DSS>tS9o#5!I}4zzp7Yi`&=b=<3^9nIZXt_i+#qn}PL01QGYlCesm zKP^TBRxak(Hw;YICXdfugU@JIKFjs9E#?-Ps6_W;7(RZ^VJa=IycWkr9{jZ*8CM(Bq7dBG6H6|Cv+zs5F5<{AE`39_3 zD9Q)ZZdYx?uigGfe8+86uC2GJ*px;tLilr?t$TH6>RU!R_f89pw_TK38)wn2dh6)I(`Bjd2!h$~PttDEYJ!aAQ(^~6-N=yS?iWW4c^4n z?ZzXmQp-#+$2}$0%yhR#dd1CbS~?=GHBvlO_hSn!302p-y?d`KYj4xPM{f3xiiJI! zkuZ~|#sZGa4^$NLe@5ta%BJRyYsDHBtjE;d?xz&JyD6`y{!gVWTOxMritQ!Z@H;vP z09pc+aQbzZSzp|vJe<)=Jui5g8}1<1oj5u)X7XNeV{)}(RsS@qfGmmcbom20)abnY z;w#-)+1oD8?T2|5pHs5DDwJM&Q{gp=Kq|HP`bl}=Z*R9E2)JPRvega0z>tmUv@eT` zzVEcO2K)?B^G>_5hvJy+0}*hJrCFD@cRX1C6iivup7=V6tdcvRyeLzfkB!8}IUhi( z$rUGx6V^8;t9519X2qh?(`kj;W5`w@hDHm|vaDL*nQCb6w10*1vH<)-Z1pr1Pb6{Y z{30E}p4rqJ77Uw2yM3~1K_ITcdYqT3jQWjw#HEYz`-}l&C9e_5M}2PD;@?ik z_*LshqQ>@-dW6;B?85~4q7wi5g(L!z?~L`hVAQ^o%n(JYrT%HU%`TVaINN+0b)&0b zN@cD!cr8GXZkJh7Uv%NCU$kQE_R ztAC_=oJ1U#xB1EK^3-1pfbk6Myi!h%+o$_z7I#(sq-i(dDlZKr?Ku7HlfbU%DP8+F zOXgPhv+ytaYpRCzE#oM`|HheO7`E3u8YeCT?$VFTyW75Q2934)$k%8o$e#9|JV<ZddLkz0-7oqkf&0K4r^O zl*!{B-4;$O8Y=}(#W7)%lB{oNLX?~Z^@|~dCb#y765%=jkTk`9p#Y!DXUztb# z+bAAmg~(#DH??E^yr0?XoWQTFKF{}RFS1H(z5;_i&JOVk{|m_PbiJO@gJm&Zxy8%> zdh^WuLXXHV_63qHZZ|@Ri4hczVhy<~iLJb&7}IoDxyRP)w&atuCs8j%*SM}Ou5Wz< zK4WGVqo4^JdEt(Ex=39Tjbqa}zq`n}cmC+lzr7M}9&S!!&t&tiw0%uv&XZ-PhM+FE zL($*qm2o!$+`6xyFGG6Fn@0=n!)k4&Pd1?0RL+>S7P1_QD}@(fsZbt#trP4(*l8Dl zs@v=><0BO7CJ!m8n*QM5+)r`DJX!sWPn=lGrrlqTe{fI*!rb?jEgc}~*LOWzwB3}; z_zaLym)%$Dij&xHcg?&;?sqt2+qou^O=WW;&eTTmY+HxAqKg0;{I~eJPD((r-hIu2 zQJaW;BK1#TLngDk_nAnR*K|YC?=lskfrJkuKBl7yciaDyEN&FWi`Sg?whO6x_y=QH zFGBR1bWaS%o}ChVSfGQnAsMh$kK@sv)RBTf3MEXN7iYo8($SvS7y&icpUIH z(*FC|VdX^)z?FqrFqP^2 zZ2HwUC;Sve;^*&`ncbdX8!#qW0G?ldRDKq0XxIVhYt zHorBWdd1DdY(mj zG-o`cV{rw1eG}8zZlqsNi?5h39fnY3(^ZZIUF{?6=t{3GbcMD&8Y8POE&W~9{mK?) zJc$Xky86M3a5Ne+5H@Ke9otso2)1>b(Co}Nz8 zJ=(54@1wWv?CL)cJ|ldo>9O#RNscqY7iz218M=lr8y2WWWAP~I|5KUJSR z!_+Mrz-$vK z8=v+#5F@D{iKwkkv#0BRzA>mp8}a6HyJ~XBKWLu+&${(zzCEA57EJ(m4*q+Qdis)) zk@<4@@T0QfE55TNwHV8PtC|mmwUh|3Q;8Tu#pXCIVNXKZ_X0B$c0x*wg~<5w#D_w2 zuxGwNOUNTpeNA6+k&z)SgEqGTQ$Edt^Jl>`qQD66elYZlgGrMs(hpq|fxFV;I97U# zQ(xVPQzB>&-&wbg6Tt3*rB&PCu!GjHiwi-hjDKlJgs1K=Vul z@(emi7sT4jFahx(RwUlsJ<(9bhng*^0{m90aPPEIr+@^(h_7d_myH^{v@2R<9XpSI zZNP+9@0yQVF9!)?s9E8^He|(|a?f}`#t-3Gz?T+E!;)u~W?D7a zhyxYiDEHb;nkl{8jc*y50vs*^3zYQ^jY^6@F@j-}c764cyU~ZjWe^r+;Azv$7UrE+Eppldw0RlDWv~`VUvjg0$7XaCMcG8pM+kc~CGN?u*4N`ZX?7lv z^lpbTn%N+wl`dg?V=+|xNLkHmLktAgJ zIs(g*4rKxuNAnW~2Xo$k<+#4jaYeDHD(yG&pUIgRNe=+hD0A?H$*K07!ygB2_8tHB zI}Raa>$7z63%fh`@tB_8u--d2|+Qi zzKdupnvuN*UR|TL!8kev(#^#@9iZ@oX@V_tj9_dn;eqT%Dw~O2*^>_al0(^oY`yp> zAs0WvT6s9hrA;YzEbc^vX^{bcAf7uBR&FDMjY#JY64t6P&pSSP2|Ry_6H5K9Y*?&D z{5bPZDy!Xy_2Vu6=bZ1AQi+4_^CaJKAs;7QmY1xFEmj;{eCW~Tv0+&&%{>cRqmu(n zr1?rUh5BKS1%pfwLC2ig|B=jv2$SH!O$kblHzD2Cy6lz{w0@2B-xF6HdW{p*>Dk~C z{Yv+iDgH?*er1!R$*$a;V5S~*mqo%d?gq8TH$T&2ece|9R##GHcsYgKz5S^5aVAs8 zUP3PTP6<(jX&%}LZ-_0<;=TN|x0Sqa0HBuLt|NMN?*@B%YD&i^1fQ{^uw8l8{sa^k z?uZghiKGcEM)+T8jY0C;P>JYi%`o0NMpa)A3-H?ltJ=68O-*RmDudpBsVP@Uu|y-2 zHb;s%uH&$udE=4Ws!P7qi6!-;;UlMJcH~}H%{96XqdaiNXahl@usvz(<25q$E}aq| z*UdjWF-WqM)GGL*5#fPVKTad%Yig;vPp8%KRH)ow6SO}A>uM{t_?XXFJ8%BJ!4~&gNwQGtN!|;ZoGOzExoW zO#&sJI_PP;$*2Q>G}<)kJZ=MPa`EvA8e=KOVyy)*-H&fVzoyMv%A66$Nwn2vlm}GcjI+=Enq_vk_A+GeMcoke?=4iF z$XDbHf#=u^Z`E@mC*YDeS+%y}@8Y1A_*1pirtR)(JJkJaX}60@-9u6^_#W(L1_tZq z=J&Jbrf-zpR@T;b&$&HZY2Ny4q?cF`rdN#c8|aFvE{4$#K%Ly-lP?18?B+}^g085J zGhd#ASsBDBPPEM+pkrFktu))bfa{#Si!W8U#2zfG{zllmn#Zgtjha2G-Dwr%L*{1+ z!>Yf|b<#4Oey}4E$Gg0{(6gbB&XrLpC_f9JM_+!(ZKmn+XC;9754}72GDeSJya`o8 zVe$E~AjrTw6SLG=HmXFk{D!{8VX4>#U++h{rg+xvC3~B-Nun>t;9X?F@PO}{H99n( z-I5B-uQHb`v%5FfPZo4Osn9$NeUv^|42RAfK*ID>ss^{P4C2W`Df()y)r&;xvMsv} z=5YgJ!cD664uKLPB}RCrZGk!xc+6~c)`X7eYS%{1BmKQ(3aN!X<%QWJdtK+7ziZoe z^32lDB@Op3%yfb=DAyj*B35LCo}MPh;-L zm2S|)0wko*X?);J@qrTWKbJ4#0wLZiW+F!D?3cys``Bk_jUN4fl}3oqUp%>14sA-r zRXdT?Ag)d-QRR6z`#)JuE-7iB8_#@B+C62NAUmcBb5|=M6Gx$kThw1<5-%SS9OBy8 zw6eO;IuxNp>DJoI6>2$X+bZm#xNY5%^VI*N=sf({-rqLfUw0{O)u??=?cLfVIMt#k zsZna5TCw*E!YM_qq*hY9Mnr64#cZh=#3uHREq27{ljmtX<2Vz{jO%;KLHP^joFyM@u?-P!czR_$T|JL5W^ zx#PD5>~9QnzP-%W)=Dhl$Nyz`c>MeTqW0n0G_}NC3%9xNiCI-wh-FwA0{$@H(F@J_ zHWPZa_vq^#iI7fzrM%w`)>T?9r;>-3bvf35zqHJPN4uX|U?>NQjdj=DG|O<698`_B z>cerMvWbs>UD14ZVycQ!(D#BAJ(yyPUWldwSF=uAZo8Ap4nggBrb4HMMF0$Kp_4MuR1nXSJ^RlaU}FQ zwop7I3$a@<%Qn~RdAI{@XBx|Go$*@tY4k@dJ0bUbCY!2~+C$hBh}IX1e>-Axr(dl# zt7HWSeG7U03N7jxtb$9l>QZ^QcbJWXv=X%n2#qCv=Px|s>Rb_gnsnfU+XxC@$B8x8 zRy|kYkZ?fQ@nZ=)pd6612@rbxtKm;~sx5As=HB@cB_xL#+`pzxd79Kgj7XmN>y!|n zEU^OdmtKfY20;JZFze5GhY0u1?yo+8CSs9h8UQogo{Q0VJ{>mO<S2vdWrdgG;?%Qao#f!mV@c?bnEvSyr zmRI-o>^xbKV!=%fJG_Y}on9sASTd>ad`-#kN?>I7_nqdy4A3YoHNdIs3h^RB`L-`g z#bErvAZ3|}!Jxx#G19sZr3^msiIaN#A!MLu4rIS9yzBC+bv%+Fot-S&ZBxQ^NrNw- zZ6=LhTNG$rtZFxc5LyJqRT+$M*br*-LR42wL(-L*;qcATaC4t90xm>Ej%8dW zI`q$t=|^Io1TL1|o^!*Elgf60oBzMg+)uPL|J*QzC_M8d7aJrEzKnRZ%&ILFlH^LX zEy4i;TL!#Be8)#5v%zAVv$b~*PjP1_lI2@eaG>rgw6HKmE^G%*R?xX}tH;dn4|TM# z>bdlyN}YAu5h_JQdkZM?Uml*;7y6pKYrWAdlxS`9Z|&;e&sm1~N>>xkCFR~H-H-YZ z%;hr+kH2p5)?7Np`5T?_)qD%Rqt}Ux+Md^u8p#aHk-UoZe!cPvx9P0?zkCjOSD_yVtZl6b>?`LRZaJ<-RBAR zGY2`j2IZZ7o!X^aZ`=DuUu!P3{TTn3vxYY4@6aa`F-&H`;!!CiOX&Azm7MwC*kKEL zpH`K7|)^O=nrG$GWO{b9BuQ*N=Wj82Dg%Cv`i@nam0x3LVSch}CnenV)U=o)!X z_%3qc1D4Z#7c&%vf{(?o3U%{C<1+K2Cq;KOWL4%^%bv9klnKM)T()BcRpHX9^#J4VKpTWpe zr8#}b6@_(dC)q-t^F`lpbRIO{3qbFo*u9{^)(N>V|Y*7l|RFf5LHlZlS+MXhmMLTjsGW z=elaH(OB{-H6fVFpQa~NmdSA}imL97anoez7QK%CVL4{`!Y4pBP9kyUA>13sMgm!M zpu~Ak5Al5UTb8vyu-B1Agx?%PDW)Wf9k65^enEY=7`uooOYdjS(c_hzNI`4ejXPQ@+D76 zd`>w&?PvPBhkb{~20jaNa3&MWZy;jjL|8!0!rRAHmayrHy4lu`P%SubKSKN`*Zzup zA0DJbz~fQ>9W9;Lv1`ymgMN<_j`*6{3asI!sK1gq&ZxPuqAJ>Z^w?hWuWZFbpP($h z1gNS(G?!+c}=Fle4?$&LgWYIA0DyO87Uq zrN<*t)abz_{q@DE07?@#8si|Hv1l)v9j9g0A2Tf(#0(h>A60V${qMCpLhm-Dr=n~w zFDt8Yh9W&z@la+}hx6~#-g@LFC*obj&Pp~=)=%S6zBn`?E8$9HDK)hojeh-BQA~VT zJev1tL~$k3s(bXWM?)&oayvq6uFTMo5%p1YsiHPIjuHjq_ zfWqoTO+jJ|NI(5F6`Q6dxdNK)xUU%>ycQfQbg<~D*mfD(?2Wz$W8wR`D`P>jHWZ&; zr;bdksD?GcrkKoJ)>E)?K1nrPw?5lJuOx4$63dlaN>yh`T za+p|aEvZy`I4-DqQaT9O$>zRIiyvAkKju(jJ=8?Oo`VpAh9XXgP^Jz4u!kSKi6`I<|*l_5e&g!Xe*nJDKSI%$+c0oTqX$xFd zKz$S+00NZeSUzZRDCDmmzf^wgLSawwC**&v@LPnlS%H;B-jC;tCF_WMCWbxTXb9y! z+R*GyZcJ1#`JTZhqsrbb5K%XJe2=p+|1(){zkMXwcd$G3`mc_a4UGWaEddSLVnPKn zL@6`L*_xhg3{KhzW1zyx+yQzplTblaZ$2(_HG6RFe#6GWfO`jBO*IDlb&p)&-^O6( zWsx09WYu^e8}t#`%VoFn4(U11tTAdIb?_65?v+MJmf9q(WnVUB2~FqAnkP&HV>p90 zkA4lzE-YJxC^5OqOqg{#c<}lAej8`rY}M4w*M-11vZ%5(hHUuJ#o;iW`JI^vz120>n z44^z;4q#HrQ79R4x`K<;W9&|S228VBOh=^B=)?$nDzB?#kdn0abE-Oga9`5%G{pQN@9fj%+Ay%HMivM)5>JGCIV4eHS4-Cu%& z&%#m3Ij^3H`ghk(=EQMQ((CILOjm}M1T6#2J>r&5q@6;vQo{ve$ZDpw?*=AaDzJEJ zQFAidmj|#)lqe!wb^9|HH)kfFz80wW1b7s#D7Ac@>i`+V#3hLugjtY8X7zWvP}X4v z+r?G0dXt|kj_qs10m&B4-#W-LmZp!)Y**bg80soNe+djetIXT$)Bj=pV&VZxL!^8S ztkQa=8N^oxGib}XDc|oLAyCQtIoYttJaS^87jeX`zxLe1gK9PXfRBUyC5+3Lx(4|R zytfbxiA!wwyXemio3;T;v~Kj{9SqGU1KeF1tzBW{EPQH&P>&jNDzqFcpJ09%xYf-3 zYu;JE`_NmDd1p)S+o9>=rv+uI&+2F#aT_$RBf&rPj+m(D@!!bz9hr5t#O6 zG2xS(rq-tecARvhE1BW^EXKcB@|$7efnFU zHG7fj4#xeI??z#nnnKM_0hhgKPIAsUq3rci;BF?st542N%*dGeg}1MYc0Y`dGidCyCC;MD)sFA=o~8LEe^Z_iVA0n7p&>7S9=D1j zbUC<%>G$~EGcPT)xKa^Rqwu-Qvu@)at^&;L?poWSmvP&0ZLWDB8)+ZA;XlEubz<>I zWn@n(-qpX30*g{Q0PO7PUS~&fN&ERTzDNj?0$g`ndBk<2%c!U9g;P&$2Bq9Y-T;4- z43*z3jh|DIR&DebsWNf!`rf`JOfLSCfN=q1`H|MP->Sx}Lg%k^&6~0s4#lWabT;(S z3fn;aFsA>6txUG^N{_hQ@|xY7lBt3=RH9Z6C5UC~f!#9K^L_Bqwl@!h&-X0b&`m_L zlrz^36VLBtx6gO)XwhgmyD@50p26j1xEi}n_`zk0_eEItBmYYXUR|rJ!@;y?MOUwN3^4bckprU#l z{5JD!$f+W8B3>+&Z+m-xgV9vvpGDWlDZQL(G>*K#hdz4RZRKjUg-)dc8`5uY451_k zgfCd>vAp71OL=2y$ZPA?fBxJ^O-galTH^mU8)qd^g)Z!c#NY3)r7iV*u3y#VM<@#p z7@c3|9LJA2%@NC$7>oOPp7iPNWO;O+2mF<#@;_dUEkD^evNC32lWH~PkTWLj1-08a z?F-4x=re~@q4UJsT7Hz_0U5Kl$B;SjDi_+Uj6U_FBUlsZUbBJOOKV^T*z3 z!(N=};97=uaQXK|+0C03d_nLG*Z!uAfZ^56IaM(V8|@NxuzQd43j$J_$I z0MdSB^L4bUpH2vSm(#J$@(RK1A71O!j6g#5RqL1YilTGAsT7X*2k(zGqH=6p#hzwp zB-F+(BG4(-#ltS_@6ex+a`6Jhi>y!>>DmiJVOOQO6hFuY7B8GDL8LVGO&9SJay=v% zkY1!jo*be8eQB+Cn@NQU7H5Y@Uk);>0a6&Hdx(kD1Q}me#ZAy1R0zlDpzL~F@-Td} z)OC;|i{mbSAtT5KR~uaIJ`v0xkkRGt5GM%D&V-#bBsFbJ7oA3-pXGNc~azamtYGh}%&~R5*v2}r<2}os!ybFUs9wxlj(^YPb{$$XpGS`?ucaN`| z4=m-0=?~koBl8akhKQ-sdmU>cr!N)raVk-}*)ZoFx3wL!D7Lz{6Y4Rjw8C#4C^CP@lXhiOS z@w~^Q!JZ68F}qVTP9R&^+O*omU&($Z5)Q)!b`Fym73Q>Ft1=E1WZDNxo{S;?Y5$)a z_fkxzV&m(NMEl^zGvG85l-v5}NY%iJMwuyNs;lH%ppB5>Y>J3z7t}vlqHX6>^Ez6`-YMj-qZA6N%ZR)bY#{!Lc zIYSxqeHeb^uN?ir44E}hHXj?DECb`d>rFqOZl!URxf1ce55L?;S|Jb8Tf3R#0vLS= zplK14DCYU=^-LkkF@0H5$h&j>k{Fy`a_ChxO{3HBhkDDeD%QhnN7O%;&(^OsU#oLY z;{;IT#`}av6;^_i80bs5P!sha&mZx%C%x@el&yrVne`WF>IMkDP)xC zz;4tk>nN=IApERz2{10G7T(`Qq=4!it?+|iMmJ#3$dV#PIHgYsJa8X78HZPj_B|sO z6u46-=Ys;MMrPxSIc1!Y5Oc9v09(8XN9tgYKI-)lRlKVG&)z3IA-J%AWYs+BZF;d4 zCCv|smC-Ixd1P@S8(YSTb&b)SzSAAOs>+ShFT4u!m(_5J)%ltNQK}kasiW?V&M9?o z8rek#zby##H+cXh(vZM+M>b&j)Mo-kX`VIRjAr?dunt?pyOpQM+-=w|NC_m9o__N!2M+*eBpxhnhK} znR**PS5c%%4tuJhvIfBgq;&{jbY)QkZ_;RZs7BZ)NTs6a8kj}V=_fUGkuJzZ1n;o6 z*bAW{(a{(iX_eU*j84yF9ltiKsws?#hg7#X-!!P&|B1%=7|egvQ1R5Ft%T5AO6|`y zm4cJrCz|bL2ZTJanaFeEdT=b5mY2F4>mm{p=VQ3xs+I(5^0svm#ygfZmq<$(hV{7Y-lh71Fuq`=%u*EqZcIGM1%P$e^$L<+!&3>K zvdMUct^`}eOolX9X;mBA1vtJ8XLxRy>T6`=@doz}_mKkMFAD*@oQ(=XMFLpQ%m(!b zuG5;Fo=hgsp&A!K3(iT4qesu{8Wy}yjt!|b${7c5yJpOU^#^ShZu=Tf=i&EkOna~K)(938wO&<#ve zt;JS{pC%i$c5!Enf0rc+4R5g-x20ezO^j>-azj+0bB9MX(2DrMsVuCxZ-&FuHuj)AAWW6%Jzqey#Y)l_i(fMR9?5-QWbN|_LGB z5ZqGzA|oBas5Yj3+i~(a<02TG-xaihp?gI|*!|^w0I1F=iT;gs(yga(>H++r!F6|Z z+&fKx+}1n$z6((n?)&olvz;KVQ3pJ2iGt%$h}Tu3_9LYoe>RzLQuE1QlPX;L zrrSNpG)snq+8S>U3XjJ$UvcZtSDdOSz<*wQypgO?d2+r+|A7DZ6aEPC5=r+vqmg@~ z)!}}LPTdRPQpXDlo$xCH8w29U(T=QC^UT!L#jTa2lUJZy za$65zWEG#^Nm9QU5gYu60+j~UA19f#2fk@t%v>;>{YKoxN}9Y-0UZlLASXPe)1QJd zQwRJWHjJvcdvy;I{>tETWDRFyezR4r(6%F}>0w#FD%%p_n0hn)rUY)G^_KkEF@sV2 zu4AB%_f>6va=TDFHv#*@X(Tf#)*~7%86+KB%{DEX-y7iI5&iH4;_)U-oR)hxdKqH6 zHfa%}BQv45(DGN{gw(63u}Jt1Uo1b%okaWXBa?SfTSERaw29h?D< zODzFIub_49U!G^?jk>vm{j%m2GT@YL9hhxDE{3dlu%?44j;cJ0?l=<%8h}57&&F2W zDp({PjXS1@fzZ^H&A-t)5&@Gj`5xu64J}x|mu1EFzB7sWE)MZXrcuS`3*iC^0HQqp z<~V!Btx%2KmZYR^cMU5P>2ktK`JJ0Go1r=P+lhHkGI()pe8)Rw1@G^}Y{a$<9qABH zely@XEzUfFe?f+ClGf9#bukC_tx-`e|7N|b$48t94Q{vBEx_qF8e`?n^z;TeRT_20 zqhDF=cXq(uc(FmO8SC(MLG+O*(Xxipy-$fI6p)7OT7e4U-}#GDZy8Qkfaq={7G*3f z;sBr1bJ%`QYF2^hap{5Vy^I3PWClASW(R=ss5{V#`kfCAO}M%5Bm#<1EaU33=WO*s z*A3JC6oK^FMYQcOOHCO-dkykaCw3XDGc@T+re#ulwraQgWv}#PuXlUbi1#WEmbXD! zStY{}kk?!uxp9K3*im`Z)q_n|Eo3$k%@N}dE@$3Z&yv;HC7%RrgYBI5ZyXNYD-X9@AC2_k9WsX30 z&8JQ%Es&B*WbkH}IW;l3slt~xc^#Pl-3(AbT?{8B7I@@w@ELKhl>+t(^E8I4v zpSm5QvSaLCM;iR4p{!}k_nFsD4Kb8H=h5~`X2Xf$vJFMdwW=Ywml+hS$4_or2|oK_ z&vCc#HNUhvgK(r-KRivkOddLNSN~OeI<><#$$ZE@=F#CWBRXj!@aUGx_luoZMyA!d zc>JD@Mmg6EvAomy1Gm9=8JAskqC+0>C-mu}j^D$jztK}tDs=-Azj9GHU&6SUx+`2+ zFEkQLusR2k%rGdpx1AOS#Z?nGY2=NGKmX2st(Xz)rAVn^X4uiB6?Jc2_v%@85)>f3 z)*)Mouy&kvRQc}%W6f<(xXO&Nqz#M8aDNI zpE%5W{oE)qcRx?0Qr-Lr!Dm$koJvwoEgaglc4O!jilV|Mnz39=FQ=}*K|juaVc`yY zMwm5^gK_r76Y8(dSdf^RM~Hl;9{OM@6!FgqolrRkeb8QpWxbpZyuW~)6-hhR%LB;V z%g|r46?pArGu`_o|6N1v^Aj2{&GL-=(g~xG?TkR|md{~XMxa!=E2#lO<4cW7C>Hzb zo3-TOfoc=LPl6bJcqSRXyBWM_g%V}7ufcjw+v2-8{L>rNU64sp{G?UldAX-38CBRe0z@}R(8seA%@ z7z*dV84#hp1bE8|5(1%SQRJg5`tEm)1dgK*!=+)vOef`2Q>(QZvWVABEk8cZO{2Z> zo}IFBJ;}qJGdk=b9}S=lr`l+T0RC@{U{1GlO=*bqYhxm9$TLKGj%TgY_tEH(tQX-W zFEh9EtS-5pl-Sa$-*f65hglm@Tlu&h6ZISh$Uj*!vvaB)kFFAn z|7)CL^T;*!%kW4H@v!S1mfm|it~tdPp%GdLk3ToBHXRGJ3*>ww@Z5>1@PkKmSja1v z@&}-YhZ`D2AUO3~x^&>0PX6kDz&v~MRQ1QrrQWA?Mo1mhlQqXbHzG9Vd-v-AaXBaF zQa**gMcx{a$g1Hf1DjE4!@^3k-;S4kUvt*u?j@eTHx+nub3urT)!m{^`Q?_2EvFeL z7dJQnn^_^lQ}szcmW^HklHjJ&?KV77rtHgpX@ua(T5Dk|bz)`qzf@rNulW4XQ7q0A z6~U@asMg48k{FeAGFb-Fj;7_?ZDZ68{`U&xUE8v#hh^UwO|naU@yZx-BiWLwB1LHm zhPl=hMd2PV3)f`qykh&ZAhpMz&K643w!+Zm@~_wF5Vca($SU1Z6cQ%_>&#^FrBA~EwI3^g@wWUCTr(@1veJm|YX zY&vd=qOhBNE;WOKs8LCYoi$wOEz|LofR0~i#WJsD+A>@$v5^hlC=#8N;6@D#eW9eK zigSxygmw<-#fE$>-Gv)?cwoJd7?$3~>a|piic4TxjSebE(}Oj*aZ`C+$N?_r&}GYLrR^+2Xv%HU$5Fp; z-*Moexm79E*|_7Ri)e}Z=t&6P@=3p{V_!Va5E3b{7nCkbGHmP{+^>4L-Kyguejn}O zwX{&d;i*W=mbiEx(U8jdodw*r(dQFZwfSNsUtFtn#T$+$ZRyIFeZlY*Dtz()vC1q% zt?;?VS#E<+;+xH1S5FebHj9&*9qtYXNe=X4Y}57}9!cZ7)ZSAT%yRp<{mHsd!)(G+ zr#n6+yHzl~?ce8=(7?xYYmiva#&!tgSnj<}!6zZvQ~$;fjulq`abTJ<#I3Yj!J^XB znOdT#QJxyi)E$#qdG4?+@x`pM!Kt-?@o8dJfUtqZq(!hDbL9eUv{eDxfF0Uf>NeAT zn=mx9C7v|sy&xXU%3SMt!EBr+L|W?@>v-Neu$*qamg&)XHt6jk5r+rn_9dbkR(?#k6XcADV++d%`dLkOvq-RqeOTpH zblViTN-LacHL_6gYqRNN`D@ir{?_J<2v74P8rsV5qoiKud#2OZTUbWTG5gy_u}|X2ANlgv}*Q7RR|nnBkDhZFE#s+e0b% zkcfulpBwhqWuJ~#MzNdLbeky?94yg@L$8puF~>bJwUF8b1XfG;8dOkGp>VxCkck2* zS2cBugBnb?9)IzqaJe0P`0cQ|!h(NBjNbR@BojEiom8o-W&YvwOrS8B!s2M{suSRN z7U<-;2tJ}6v!08Qlb6XH8c6kWph#=ClfjwfOqvWt;-yEO{v=G`cCtL`4*#@* zbu?dN$a!|OTM^kf$WwmcCe;*Q{_@gCQCUN%Tf{X^Mhg-5b^?he*Y~Ct13M?T>`X0c zakk#=Fqr+>i=|FNo4A?a%vQOQ=N06Bxy-$8$Y9jj5~S^#hksk!#dUYAV7b=|pDqOyccB%OTZjTSVG@|xrmfJ>LnhgOs)Bcmhc zq|0;!A_2rf{+=rCllpV}OTjqDh_-~fP6^{t|Kf==dn>?SnQeMh2?VR%`sCb3zwT|- z`sW6wf?gY>@^<0vQ6pSF6aqPTjt1ePU;k(uUAtQsid2f#M6;T%_1#l7Te$Ih&*=j) zMc=o9?NG|mo~wzu2+w)(;A8a~JewS0&&!#&Hu;%}dIKssmUi2)O#th$cw}|)Ge!R*0vd<#&)VMpSO|EEnc{>PQUbek*kw4#o8u0H<161W@1X+R7RZ@ z5WxgpJub_*{x9pUCE0F>zH$W2q{uhr#IVvM{n^wig%*sb|8H>6-1X4SDU1&Hvd>Ai zDy0BUC=lO@|G2T=lQ>;lN6a50zbvR51h0?`+sJMOl<$zOtwBWOcn!F=GXjrb5Sm#@ z!xnv%ug)PS)Z&j7pJLLwNfp`(5G6KgzE)a)MX8nlQVY4f9ahu465Kv3I{oF;&o&OJ zs@bWZ?1>$Qq@R`uNl+mqz&qwnwpC*01#S6oy<_N?qKyrscS*=X6CROQ*?}2Qc(*8m z*%#%KpV?X+1`K9Pm#K)pt)1cBVR0tAslWPdEfJ~q-o!rE!$)KOBQc4_%%!_IS}%%C zK0~o+vTmWu{r{2lfsC7@p%ECiL-#ay7Q3pnG&QsX6Z?*ikwLIbuKG)(I`!b-S-0Pn zC856!Z9Rrl!8$#-@`Sie)>Ko+f$x53Usv!>3oBYLoN7RoGN*8|EXE#wngvp?R(fk@ z-1H#-Lq|RKXtsR>x}LxBh4aPp`r#4kVt&8Wc0(Tt+_Eg_ci69UxwFKhaJ@)DZZC)oT(7;;r#g*k5gf*EmmuEQiMHLSahP?D8WZOAZ_r1VcO)^aCb~H z9H*V(iYzv7jVh~EKNWFdq9$2Zwxmz#(+M5chAZp#A7JkeWcLf}E&6G`IbBC1IrH&3 zOMK7v+|B-J&#RZ+afU?x-E9ePw)1<@i+;maXvy~YvL8&e4bU93HsoUb`~e_#{UFKU z_3EfnTSh82BY-Q7^4&-V?75u46p)+~5i^2>qy-#TzfV<8*c`?EVzQKSySg|__7l!$ zMiYK-Jzlq&oef^~GwW=7nt`oUI;iog*^6H|RJG|W`*12WtG}wL7N-l&Fu`O`%Wcly zW5e2;-LGX8RVmcvbqej>{4k4M>dkR^Eq|nF9&91Xsykfc@1S5i8r`;k*Q(5~KCQOM z70WSqakiiP@vJgC4R84Ye65YK_73g?7S5BubtU)CHUiEq!%H zTQ1h)QnYM_CZ#cH9Nz9$sihC|;$X6C3qJ{(S7%Fw(B-?fYYRrw9RdE__}{?p>@&j) zw&FujXv`*%cMDOY>D&~QKK8G@oJM5~9jC$y@7B+uvM*SVu|c$dzab`+29Y;F`>+Kw zJB}>4mD}A3!;b3LSS2?KIkGxPJIkc4eZ)n`-7+5WCo zDPpZ$mQN~NkbL`OX}6`_q_*4+(S(hi_d0nMZY1n+{Jzs9Jvlb=B;R%U7FUXlr|cGD7NYp3r>oR2?=G*RF4UT@2pcZcEzH z|G*6lQwtp9bo7q{|9oGpk;DP*R4?i`71pQ{7!qAjm!+Zbw#fo)wz~Fvp9(8Td122xAnDHySFypbY6}z&Vx#73 z^Y`B7p2SPmv7@ZRs{SUF{*0%HX1V;Kgic-#rf`TFMeH3jp2&&L*#Va+rP57~rg^Wd z+|EtN_$W_|POkA4o&9j{fFHZ$ws$m?UvT*@6A`z0^*yM(Y8kw;{;#p2x1I6<33*>@XznneOFyszF(pOB75KRHwFjU=!T?y8Ec>`1+cWbl{p zpP61TxmZC=xFyZ}W~oiNvuGdsu!xT(e8)OY#We>_?{-?b$GPUDel82G*oM8y=2Ecr zK(qc;H2JV48&zn{7fpf76bru}95@0TxJW{0gE*lHC1}k@t_R0}%ox#%Z%Z`H?+p<*jpjh<(#LBvirmTL=!!*1)Dp4?BQQ>oPY zsZp*V0$xliLLxeq_=;5Yp8 z@2R7ixPW#RV+^ELt~c9DIkfF(`}1!rY9=b{yM68%_`B&fDg?VEkS3n4pIv{gg$S&R zA+7@!(?*<)OK9zLa`{sW>l(bw%N?4po{?HOxN?wnr$G0xU)EODE~wQLVK%wwq*-F< zUqcq7?Q{MYBn}@GkHV)3m{bn1olG40qczsomRLS{EtD|2d! z@n_t*+sMMVs;rk{?RuoJ1?S>uUIUHoiM&S%u0;Gw)tGqM2j8apwb37mW;Q8+6w%{e zEzgx-Kb#$tfe|z44OP0n3~dosJbe!2I%;HSnb`Vr6dT_PWYO4O@ivA&RUbu)qE0rP ztwNq|O>aU~DA^Gng(LAvI$^p~iZ#t*o`(w7z0_~D)$NZA<^U5jvWn+w{RKzuc%A$X zKChPj{K^;%>>a)xH_3mbM=YiIH%mx(c2#%$#xmj+8G2e_mlTj=5_^Qu>D6 z|LaO6-yzZVY>MgcLy2caZ#^Cpw`14@IYFERq5%}a5S(1w?I<; zsvG&j7f^gNiWDPZ(6I58o0=(nUC7(?FSr|>LCvk)vW)iHOw@;7tTr7QE+x%47L|x- za44S>d*Jg6Vs34pmNe)4w^})CfFZn;11jeruPj}I`PXnO6F3e|Ft^_ef=7~y!-0T$ z-ORGAd2-;O>>8!I6Bvu)J_&imBTYBdg)2`!7f_prRuJ-Qr{q%zUQ^I_~ zi;_?q5Z+h*iZ(p!uhQa?l?l*@Z0T)sQ{jAQE^#k(z0_GC-t*{^BaL$l+U)OZ%+|E% zXjEyS_tEr)?|2*MQPNV$Cdx9!l{}=#O~I|v1DTFtno$GZMU#V zH>itYQ*Y< zwdE>AS8^ETuV2@;BokI1_$bUk^nj{O{KyiS1Wd0e-HKz#WJE<|teZ?DF#aZZshBy1{SLKKDpVZR-VZQYG#}nDoyaM z9ztAR>ks0d0f^SrW~QZ2*}DgK5!9;|dbREG$C`Qnqm3X{ZV7(y0uQGzIC2gw2W4M% zX4b7+ty{I4bY~1ySCwLGt(@Lmh1~Y9fzkV>1!eL)Y^h(H8bTGL<-2LQsN5}0A(1XZ zQ{hb8iIeXfjn{b9N=A@IIa2MVMc|=iuITRX$#LIF%yy@*Fu^H!?Ynw$-GBg#FEnn+ zl1YJ|N^P~BtmzTuMTn#Bkp`f+d%5iq`h9bY2iHdDQd8vNf`@L0e{LiNH{H(1Mml*( za9ygt4lr(pMf%9kWqB3*drfCUGa&ssIZ-b%6t{!FF1gol|Au|2hedrkKY&%6~_%1@wa z9~HRyY`j<|LT}pKBRs_kP>QfDk<0G4$^56n-?|!=Y5Upmk_rFWIdV{T)n1t7!U92v zGi+q6g4&~Yg3bWF$5;I$PK=w?S(fY)`}?-W_@%%B4~))GmD%u$Tcv4DqYIq;+?3Bu z`hCLg6T&4E)9CTb+O|lu#U}6-w0hzIV>6NSA^B0G*Np5(n~6jsJnUe#ZjdEs%U9O= z;G%2h`U<)?scX=smp%jhNZ_?qO_Ozd6vr)ih_n80!gwaD97g(L^6W}4Y2mHb8iGGb zo|tVgJPNxR*Z5M{u{r62qpLpG^*lSa98hi8RAX=?`DJrwB5+jm%lzf@Rce#*+E{ql zJ+kWgarfex>E=tFaJ=uoTV|4p^R1>m*NWzv88c2g5nA@aQ=Q|XEWr)Ur%D%Pw{jM8 z_3FM)hK)@qxHx@pb}I&@j#n6 zA!CgXgkC^?Yh}x*rEzu)Th}hwq&`$LEIz5x_1Rqh>`Wj>rtn2hE<*b1|WbkhQmF6h)x860b=RK`SSlT4^y%gW%4zj_93 zw^$CNdDn%MA<0*DHL1f$CoJ?bKIp_V?c0f@v64s@C*Q!eH>j(1{mnkBGJC(bSbK~Q zM%yVx+*`)d&0h#;0L1D8`~L2!Kg>8ivq^SI`e$T}WeATQM;MHF{KQJ|_nU3QEN<6I z*U8Tec$KpXiv`QC%(fvhs#w#Bi7`15qn4|8ftH}0wvev^zK3RA;tsWz>St`3M`I-{ zDQ2~(&+^6mef2L$i|?`Prj6!#HTg4lPFbpVTdtoZ7dyi~%e-xNwX-yI+l#(mD5O=O zZeTB5!v966z@$uz^8xwg&LFgZqN>5FP&de9frwUSn~(R7a|l5sBkR6jbDTN&rTS=3 z=MD}JGpFBlI(;rSW=dglc@bCh!fEW6yWH%k$3>bU+!|ro<1Du@G|+`>6rG<9%$6@@ zba`+2mh+w0MXIwEi}530I{ozs5-#1Ho8X$&gXHKDV?X$Khdx6=s|CeKw7lt`pG=WXfC?{z;mw!T=v z{~H#Rfuej+W-D+#pSRW2RY9gBZG5yzA23@TkYD|zd;ljLcp@Y8j8Q^AmmM>him8>+ z7<{{Cgij^+ZG35g>-YtPx0tUVbyS(H1>5-!7spC^`d9p~S<1D< z_>}=d&B=`668d@N->)g&@6dN>%snk|z8iL;!*6H|w)e*ji_8)bTz1d_9uZ@vjX}R> zzQul`O$;nf>mpWn7pNw_qGKwW_CAfisM>9nShoAA_)jUXQXUAaqHIf?#gwyW5jIHF zZx`xzOTVp;XW1v+NvMb;1-memWjpumYl8mVSOb5wc;F_J49hp<@E13%$fip?{ym{U zzZB?|a@K8l&k*?M1|pMk_2SnE8gSqiaCVVZmyo9s0w+qhT`ObI>iw(7!9rTLzSphM zn)U^}69LDRtm_04KsGG5)^HJtdq_n-O_-+0awr>RshoAGr>tkiH3uM2!p>K3BKCRKkP2wwKqpM=23Jn#v5j=bE)F$H=RKt~=V?fIOG>S=a zE8U4YX3R0=((`s>$0xkW9b)>$1Vu0316IYwC%0dJuACRdPdZbJJKqiJ&PNyREM1FM$J?E1 z!b3DUC*<7&xEZJo@{$XoAlf1moN`41a}&V&+vqTUBCj!(?;SSac}jkc-a_y0KZi59 znw})Md~D>)rMHEr(#}r)F{dF2DXZExN^>hm?+yEx{$WOnQCfq;Ls zvils5GT01ucv~u)t^TcH<{sh|%(XMr=|OH7MY-CwwS_fomWSzQMzlP5>vR>;x+pgX zW7+RF_$QU-61H+zt*3ou)wzp-Gh%>|_O)$~KCH+Y#a-El%2j%8nE>r~jbEp-9l|Q* z)S7q;dNb4=lX+k3rWdC3p$#j2g^BL>E41v>fk1*!pOiDaE0z>4=S!u;P@bLCOu&6` z(Tn-X3a_;M4U%KV!p`sdFJh9+rcp5-X9`+#|N8kQsrE^>~O!!A2hxQ`j!v!V+8Y&s2BXpf7MS8IQ zE%o8&U1K~PHE#~ZDnDJ?WrpIO9GvWTawQ=3p)ATUZaI8X?bC^FlHtyI)3P=BWYmKu z)XnXZ12SF`%@nb1={_wP->k()4Yn_baa{hj-mFg@T;A)}-RGI<22%I~`8|P@#vE@m z(R$#!F7IrUHnqS}%SZOTfq=f#kCi$V`RfBk`WCHGG7oI;Y@27huJ&XQYD}FO`3rof zmDjm0LcijwvqNHpH5U%OfLBggo<^lRdorb-Atucn1J^$=o%^_jo3*8U1H?S3z0F?P z>EoAC(#`2Wu{`qpF;0Fu(I~j+NX_{7}O(bH_xDZ2u|P4edK)6-V4!f!P!=3iJP|W{P8>R{T&~(T2oD^R~B@n zD<{t5r^y1H(80_26srO9G-j|&NcJdve9Ot}N^c#}UKQ^dpJ4rlaH7%XoGiI`nIBOW z<0hfH1_m=0o3BLQye|27;&MoOsqWI#50wwscT0PjY+-Gy?&Q`1FboM|7IDfD_;R{z zTGd9Xpfz0}2}lS%P%9tw&MH`~Tg>M2pE=Z&i2UntZEACW>(7mnuIG)u3Z_=_Ei+EX z!cxiGuLQdX2g5!NPnbkMc!Q#KN7+&pZ=Vx|VB&TEPIn?E$xFwvx3L6zMN`~FM1&8_ z_Kg1@MQ7pHWc#;apGPr3QR(JUx|MDeWJt>>*$9czqdR}1fV40eGGIuI(MXP-NOz3x z7%;jO}ilJ%TeR|5I=N6DW@n*so=SWw%(? zYA|9#*M-N>?sqS4CF`zLSZf=Po%3_SA{ycSM|+O$Vm~o+$>0?G3Z23b2|78F^?FZ! z@6Xq?0B;ZDlm>`{In_x0Ta_L1)A#u$t5Y?;Znh+9u$+yo$Sk2keKpG}zBs#(2QxK} zX`FH}>b_8yBUek=ws^AKBCo}duoDT*-fn=WXBExv^v8d^-bm;{eQiU z7=gaQy_1u2w7VNwDv&4O*R|S)HvOe09mgUXsX05wP5HmBV+^#uIc`mV1bEd1Ik#qV zMecKR8P|`q_;~g@PZb9uI~e9%`{PsIloUcz4W~0_#rYgR3Ih(Mh6n0v39K2S{Rmyi zk*}J*<%j97dPAEhiut3dSe1mA=-f($XOoF!pzSQ8M2B*U#+TT47T%<{gBhXOE<=Im z)hO;)#3?+g0hno!&f*>PFro22iCrVTrc~+_Xrd9SBFa3FBo-O`vO$2D&T_3sAcDW< z?leOP&Fy}T=;am z+=J_l|9m+x&?c>NRBYHt{915^1a_L`H$caCfF`*m0wq8y3tilh@%VXCgH?*td*OMW zy5peE5iSZCUZiX7v0FYhEcg` zg8kv7xOcN>yDyKT{amByRSX6-E=Nsz^w+G17CnDqtF)CuJ;`h8M;ND*mn&m`o|!hj ztQkm91bi)6DhHwQjUf-+C;GRewcX}v_^UvrI)I=$U-78eoHPv&{;>!dxMQ0~v*B*$ z^H(Wl(c=O{;y=25b*{MKat`}E&q_6JmQa^Jx!PA-CSDa?A(@}6BejWuo1Bwo`r3i# z(scF)uVF+d^8ql!xi;8ok+3oT|JQTNxk(?Y%EXG)yrCtk!m+%I8(&OGDN)q1Q#~qg zUZT-{11@BOon6``7zMg^CxJ-)1e_%rf7gdElWjsCcTvuhJPRr zkK?EuTYpz@6Y3G*&gpt3)*zWf`)v63zgIx(pVxP&lsSi$={GIVRP*GPLD&$Ie)LEa zB2YQclu_@WYPk^l;X%`PmVJq_0qJJE$FPeAf)21JLkLL^xq19q+lZNU;pPA|_i zTOnjWDmsmGm{b_Vy&Y?kHjzKpL|(``IX-DBV|nFYZyAit=QA4F265LU+}5h)Irxqw z>(6>_%<}3e>TdaBDV&73VC_SZy$Z1=>{=&M>aPJ2j~j)or6u?#de$@Zq>R*m%!F6A zj;OQjw~KQnoZNdC4?IFF27T(Q!1i_N)=iFcAJezD>b8lD5iOb3()n-_HRX}3+_RlN zOYT|EgM>58U&FmpPP8Ykqf=aPfd{g$>p9XntvT9B>Qc^$?O@D`Ub$XV=2=$&+XCl= zbjH*03*)QUsVnU$8DKFVt9PjqwVCU7pM2dFlA zN&p;*+wW1dZAmN4J2-Vp?BOkcV9K>aWCD3nU8nwBNlhCdjVvGyi`Y|8bv6Qm$_Q-hERi}(T`%Wg(27HFTq<@&h-rvwVBM zF3CTXrVPQu&}9;p!jof7>T0&rsX7ZX2OGitS5fEYLn5IC?piXhZ~4zHHJ&VcPaQ#> z-neToAFl^Kt|_N`c@D+kb4v-mdb2E)B{Gl)RWM;%3Ht-h7*~u{sq~(ALsuK{P@QWG zzbBLd*0oLBUqv9~wG!RO9~qwg9?n)*7(wo!VprwON|VxaT*f?l{%|h%fiqZ{5KQF z^%lRII-2xj1wtcOKI#QB@QLh~TAH;J-v7On|D177x!Se3_b+3Ayq5DruSh7tD?-cr zgh}_8d3(pQ+3n8)NuI?|sZ;6MQzQAUY`|l>x6n8xYi+N8ax^>{BZjLVziYh9_u%1` zs^b6DPu}$=WIvG!K;_lrmNoabK<&~xX~(@zfgJ+nuyfxAJlW65Y<$@Jd-hf z&!KHSB>1^pDJ1WE(xnG2Kc$gjAw7Zoi={?|)inJmPd7VXlI-L3wLJNz#c^ua)?CmC zt3q5UTaJgRXUfTI(=8%CSfpjXV`4_W=&SCC??4iBj0FB z@=niu3{_T?G0H2_xb0O*!+dv=bh3+I*}5^;8l^w ze_q!*3^uWbq#wQnsyoNssF)tUIb!lPXYsIYuQ6VpF8#iCcm`2ZL_n0j@$*S$nP!lr zIKczVGuIZ15za*${X1c~(Js?Z9IU%r#|r!qH%~5RBr8q?+H~9I73x^^yexUoA&yM4 z;F#;+)6;tV597~GgaTU$%N2q&18AXdGy=>w+9dMEW3kK#Fr$eIofq;OqSjPWYle`g8vg=#{s#=c0nb zMKVO&&u5;v?zAv--stu~`#kgsrhckWc<@toutKl2Q=)lG_;S*q4pt{W^TZUDL53fwnFpQF`|BRS?Eb=Yh`j5JD<9Ej#ZVqv=8PVvRGC9XL88 zivPXS7J|IWr0U&iwVuzI&ua6ZUd2VZgg)-@ziG+uAUCIPTg1NF%0J(+Io51NPsGa} zb^V#VXI24QyJR{;i1`V97&*Y0gW;shk4)G}71K(3;PE>4jnK{+0JT;;geIst(bT0K zjRfhWs@6M*1XISD$+IY+PT7X6Ly)B@?JF6+b_SeDJ2RPP(T0xS!H$e<>uUjNMud?@ z9^J+7&!cS&P$?H|85%!t>9|z?Gw9yi2x}H-Z*NBukn1qN-{K|HKFPsXcv=&3qT#c* z0o1=sIj5~sK3N)RY)lm!J%U59R&m}5uYU2F&T`hONfeus?&52L4SGPf7E$|nFO`!t z&4UeWEA&`sxc1|`OGBK$PtHsmN6Jc&Os38vfF{ajbV)42?Oxykm+_;Jhpr^XzwU?v zvr7}b6>6YmF`MTmeqL(%SBrg0S^H*A3Rtjo1bsl9@%y*gu*&qI;0eonw8uLkIax#_ zeq(>P?&?^i)I*^$d3)>m>oDJu(JjD}!E~%0HXG%D?hs~)c|~fHoW7_l4?j^;<0e|= zM9+MsGE>k|12$Qdi$Yz2!CK@gp2!U&dDmL^L5&{lfPnUeO*~4ZJPX@Op6_+|%DINN z_R((cbtsoOjichSIhpNBUGEw}6Ur7CtW_K^A7>^i168O!yJpl>Q}!{#6jWBapliFt zJbq0)0XsD(O>H2KT;oQB0QzN73+y(eYQBR%OInniaYppkB=>3!??mqJ8i4>&`;Bo} zPX$CJl2S9Fxo7iDy>b#>X0Ydjs{gDf|D4H2KVj1Zy=B~x2cLhOJo&|rI=->OpXV3y zd{#kQ_H^C#{2Z<~;uG)BZ@UW)S%>bbok&w#}@G6%ckNlnSl60jqWmjKMUM2`S_bcE+iMY zxFyRRa5yXKjPiEJE|nF_6HnM&SP3hgnOL3sRj@`>%QmeTN%Z;s)7#0iovdP!{t)-F!S8&OXhcj z4*v=8^pmJXkplBaJ-p_E8(U5_Bd|gmeo$r2dmqX9t>8rSYWlMEXsv+`FRR|Zg@+L6 z7b`>RA;TUTV`^1yqBU=xrkzSwQ#49wfLXkP@B|bU^0;K&+Ei;zvr$kVeRJ2X=^KZI zSIjIe=f_O#gBUoxGzmD8#mIGx$;dK1DAI#EUvX)^T6nM1axXC2nc-v*EqHKy; zasoLp$gHIqazMWLSZ$tav$Dc7ei(=!dkl6Ti`rmvD_@Cipgt8G#UoMmFki?Q>AvZ5 zb}b6oG~x5mAlJ4t{+W?B(hPqI=tY!0wVfNj{6{Cjq7V&&zo-2V$hZHKgN0n=vH$Nd zl=*xFccS}GvF$kq11_Lg#rR%p4mkf!qDaD>snwh@&MRlR0)1~X@>mwvPhe*~8@?-o zY$_U`NXOUjYnlzLH}YO|Eh3}fYjz&Zyw_c1LgYY$S$fty67r^39Cpy`*<2JkH4ecK zmqiI*H#1$bynhpwBJEbX%+cRLrkSK~!6-OG=5T9B_r5k;rmoIVVn~m4ll9nowSomC z<04Z+E|Tl3A57c+BsC(wPX|8jJdOr4%&kHpoEVXjrl287H-(29S8@Tgy%^t$(Cy?; zh7*VMc>jpP!}esPD8{K!`k>b(5ves2pGIgv#v=xO@%~+1;g$@({}mrzA6HFv zk9(J=#jbh|cSPs~%J1M}Q!ixfw{Xe7s_Gk|=<(EWyCJT)yURnTLp%*#JKiA=Tcu)a z#`b#>)zn!IIw$a-94#n%+MaCiP*}W)<5Qwns+e)iqWAMVIGe+t%=^$PefYC60=HwX zN;!`vp|^1j{8NIT1mfN*G+-)#v8SuQPjbDrj?K(+Q5e*HjOg~Pnw(3eQA9S9(VS$^ z(yr->LTEeJaB(OzsXaajdQz&SG4FF0Ci3G~sytJGL@!tj3AA&58ADGr62p}%iQw^N zvxk03=B6YQ&9E>Z54#HGhE}|kx;66o&?ByC{Wzj!sdh_ed_qaner9ZOPu^koheq`s z=hePEE7Rs?#6Bjk=P}#F4iCCEp8&dg&lfXC0T_8iW3`Ypo-C4Gg?Uzb1qNci1 zV~S5RknWL4G+NUHsLPz}Le=R{ibSUR6OXswA072Ix?-ON*yJ|s<_79xNpnws4_SkmF_RwNzjw|!(kDXbNm6PO ziOkl2w|DMDGo~LG1b@mmwii0{aTzvxzNcM1q$CfVrj>T-orIbES-zy)c5kIQyBX3E zu%k%LVdUBUH%-V#{&~nl0UPs|0mO$~k7*FB&gQRZ?smiO-Ai`h*%)&@1cfw(q2%c~ z3IrKzCjrq(;?>&@HMNQb@V&3OSUlP~IPNNyqik(I)=PTEcVrC34hfivaYC5uE)I8K zSF3IY!&X0hP34-}*xOt~SZiF$tk2IUA5jEXTyS2P&=9JMCU5&|qtDN}c_Wr?%v&FJt1Cs#aBv)_a&QRwOKUaLcjE`n05xr$2RMiN== z(u`aV!B%xMU^dy(O9x4EwS{H!AP<($*9vwl~>Zu028>!gPH$hZ)>XPwra z7E=X|?=FrWDY9o%7jfw?rZLkKw_4aIU_m}=ltA8c& z5y#b&N^5HOBvL0A!tcI}Ro+bwI9Sp*(2%U2NYw|`n5K9&8sd1=RMdtz275Tu7Vi=M zq{Je(z0V}X65VxwP_^l%vH@=3>1457Tkd^|1zV9_^PdlHzawG(ItOk7H`#4 zaA99!IcRg^A&FjlvFAjFkK#zHpFINzHD$Cd2d=tOa7AsX?ne#E6n2>e`R%F8w zCfj49lZ99*ygqK9s_Pyo;H6n6D}~`QSR~uLHZ`MWtDdf40sYlp{?odL{idFp%oCBm z1r1{(2SpaST*v08d;K)UHbE>P*4uO9vsiqy>8l9w91@n;Q*G3hB6%aY^>lJN?ZOeug)Q6QpC z%=BQWrgtg2Av~})yCRf*cX%?)6Yy$|LAbdp@~%aowA4n|BZe_Aa-U z#?X{7FYfrH8W9SgDE$ns)vC}V0Sa0A423f44}^_I(mAuyHJCYVc+NzE+&Qw1am$*} z;JmO2oo7RVm*+?nsGFvq!c9kc$=U3N{#>Rbn}eQ4`Lg5X3k~oFh9ngD2Y0R~bchwh zjTaGG!kvory8gM@iH!7dt8Fl_&Is=I7J!~kZLPj_t!y3K8eOzXPPeH6a0(vX?9VkT zw+btTQEX~R2Z>?66U(V*6>Sp^27orhrf#)7=^ytD>sn&{uLw4MAd{PWFopdls^^H6;pM=jkWm_e~ zwbBEceT0V#i86Tz!j{%d{;!Ba@2Z!%L3MT>&4JZOX-2M{QKF`hiU8YnSY?vAZRmx7 z%e^eQ1;8lxAUu3fliWlm#%U)Ma>UU zv$YUPt8kt4kj^;N(D4+{bAUQw#uuR7vCh_NY9z*DT+7=*&lkX5=reJ3CSKy|VPiJz z%HYS1;-S5w>$8~Z zApJ+YlL;dAv(!o2=l2cMAw^mP9-MJ);`S*Vx}EYG&x;A#jQF91oPiiup(w)GytP$a zjdm2_;^r}U+B6P?)4fMLu=@aCt2bC&yh~TIAlRsWv4lfWlW~|&aK((n=yeU1nl zT7o-y%7T29qoJ5aT3IXt-e@Dhw_KKo@vePQT=qJCO>Qe~NqlSLzyR1b z*Q3@guA+~fLL0ioktau2N@~H?b;^CLwnRm`2kQc_<|eqIfxLef|9N|H4^Q@90uKc| z#4^Wns(_X&=}lU|Of;L$+Inawz{~WZB^gyu5jj!jh7eW%JdH`n`&Q36?q9sWqZ{W& zqfk>gE8wLUd+S)SyTw07a%?5k_VvmwQ&SrofpGhoNtPTn&S{U@&A2-&NByf)j&x=lBwPjKtA8@(wy^!9 zt907&s;%}spW3M2L=j~NM0^rN1ms>!48%!}O@+t1sX?K!OkV%_!(hRzoA%L?(n5g6 z{Py9<&_;~qTwcmCO!nCCik1&9EPgA#1X8M#nXi&r9Etv}PXovY zod*x(l)$G)-p3nWxlR^_)WQ!IwfixwLm@(9ouM0AyzME|n&@YQcl8?3)e6y&h?Gi{ z?2}L1;tg`{qT5DbX9owb2~XJL_>t*&K2|`2+v*}o)F@Hphd7pgWYl%wXixiVVt>m& z?M08trgf&QLeT;OF4lJ+6==9jlx`L?-iPH?kBe0-a_q94l%`@3PE4vE)w6bsn1u9+Y+W_uo8c zLssz~9EGS=AI^-dmyDeTtWIV>^ym085uyki6g{{+l5SR8d4$20dvc?0)w>R*Jf`8g zKh{RZ7oBd!J&g?){3OxWSV8mh2??=D4g@$dznY38I>N`S(cPN{mh0g%ty+y z>?#J4r_yVS1ZMobV_1d4=LsppR@;ZN-BZv1s~ZRRf*F$<95TOmy7xM@PW0x#x=a;1 zDKnAbN0b?dil*cj;O41MJ_gU3Y)J+jClq3IP6pgNW!Nlqjr`U0Q0DlRzHlF(HYtt~ zLH^ZvrcN!3Zz8#^!d$98rzS7Q_?CUwZZb$yk9UR4{l zy!Mq-;3Y-AF%Ou#rmU8Bz@x+`IEHQtHfPYtVphcQwTHVU0`K)>RldI=`MA^}9c>Y#8_J{BnBBO}YIh)8V zP3w@bX_t}ichoBn?q*e)FoO>Vr024DlMINarqZsci{q60J||5XiMPdOXIQ++yGjVt zhxm_>WE?^lVV!Sfv%XWMKDCch!wtiq-+s}wMc&&?(r|JO4S6#lV*W|@f5@IuCSBbA z8Tp;_2VKL~0FtlI6r7VSb-Zzwfio=ew7TuNpjNxd-Cnjad(2pF81$ghn}$@Dr9*c~LK8 zK9>Ve(o*BIOSo%vdz`(Y>#7#SR%<(1G6tVj45m2QJ!BW-*l0TCY-#Z0`yM+z*%p{~ zb_U(J*TULbP6J;*j0TIk0)*^6w+xHjmz`KhxL5OLPLjl~Uvt~pI%tWYbM@L4M28L< z9nu|cH5@}_tX9;^d2|;SOttsQTilP3B+xvTGiinLAC{4{)dN~rxDj=6B z*{IhhDOBp&^%+@u`cZt>#jPU#L6k__%$HM8OlJ=rVJ!F+8Dkk&0S96dmK}0e=u{Zo zc8CAJ`ZOjJkN1&(jD1Nb((g2Mv>zGIB-~6>TEhzsSGP>9Z8Qw{wgAE7=Kcc8`F@{$ zjT3RSYGUbR<{n}avVoAm6;ND)vK)dzle;I)ECaPR#0?4Rx3l|_S68j8lvGk(vpBpJ zqt3D?xwG!5-gDfgEKBs6wf5hf3MQv_b#}H#1T$1$*(#VQ;o}TL!wl(;i7$%P5xK1K zYYmi>k1pKL_N^tG+f2K$Jt4Lk;)-e}Oq*v{pU7)!XA`D=*VcDl2>&^h=<>2!avB`A zIIWD3O|^F-|G1Hv77i*3!U>TW*B@8k0d(_RCN60OjN7Iw^mWUg>whRoB^&N|hv5KZ znqJ-XjTOr~d&KBT5@W{Bn{RfgzG(F6S9wIJqmrcWZDmw~C3IVX$I`Y!j^`V{C!5MF z7t|_R#O?S31ZN!y6+$2pTlX`GAQ5(tn!qN{LG9b*e#Ops#MuYUg)W5x)AKs({%nCc zge1a8d(7ZDOGd0?0^|0;b3NU(Mt4-6tr?KiRazf$_)&MoKOQfSRc@vze~8P%Exx4J zkHRp6UsYLL$CjaFF$`Qd&ZMq8WhJnn$KyO&lTdmgU-dnKzB1mVPKJ;C79a6nQ{cqv zPm}3g|fb*1Nabi4>m^__#KvrfXxze5dvH$FL46k$AL*ea$WK=xnT4`xc>kB|I8byV~8% zfa)9T#uPbT6DMRD-&g-HZty0m)I|)VEEI{>e2l1;vn}C=E~f51mU!&WYf#b(?0X>~ zt}n@JM%BA{yaf5Pac6wYn3B_d+9)QKUJ0AcRkU3{)cN;L2zPU)S<4sq@n>k~$qSMT z*HTo6ofK;14Sd5tL8yN=JvKptB;1$^PsOYar27b}Onf|vrh2~`9hEA9E@Ant9^3>1ljly)fXB_nI;_Q%SAx_ z$2VdkJgWL0{^?1)Lb#`jeE;P{e#w1@M;g&ZGMBX>0JC=UV9K&)aciThzL1N|XDPcs z>b7k)b1n|>QVI9F&LNtl8H+gnn^TXD#7DCDHLG!2HHs|o#gml)JOK&i|WOu>OQ{nE1a}_bDpla;)S^XHpLEJ*nuZ}fLb3nr+LOsso9Z*pZ_&-p+o?7C+$o;t3voP;I-`n zmxHvmjhELOYNir&aMpAgEw=}dK_Oa(YQk0s)0q6n)$Z-vAo%%Ge4s{M^uakWkfR7q zCSI|Xz#}ld*_KnP6n!DjgAB}=>};5`slYs@ZFz+{+VV0DRY^F9edzaDa)=>^ihHd0 zZ>9A)>bC_y21mE32JWh_9rDUv42co4$=@HK6wTD)H*`r ziE-v?R4!F7DN+0flGfF;m*oyD zXo@kasT#U`(agl;^Fe)oZEO+c$>Pz|uV27F$?mctmcW71jh5P0m>nESOmmx6)da6! z?2-g&`g=TVTpoCj6DOD3rLmIQuDQm-4{bq@&q$tJ>Z+9dundoi%V&c~DbBphbh=CB zq!p=bRCj-MBmRamMyJ^+@~(j1$n_&@k8pu5YCbNBoUsfuv0-S)f9iW#E#X_2l}WSc zyZ^(Cc@D9mrp#N49|->Pm!osFbG7uE11;C*0&>8ZGx^x~Xg`4}nC%Axb&j~=id?Cfe5{@%&A6fpIi!n(Rl$ZSL&fw72qI#tBjxE(`Y zPxY+)p_&;sQ|^oiR^U!4z?Q%Ed0>7;#C9@Gry6!F>=>b~7R&kzB*VFxT=pW2?;RX$N16n0<$rqc zHRxU;_UAz0?~XU>mv^&*{v3J?u0TF~c4G{`WBO>jA|5Yhd1oTji7q*7n(9{a~se-&Fa0YjM1o~ z)T*D=nE&d~x3kYo7FwV_FN!^}g-L6W+G5*S>+bq{iCH??3Jm)#+l$H&NO@b)$YvL- z&x!*e;j!BW!Fa?$gmU3Fzediq>=vDxWH6*G03shvEOgbsCtkf?f&W;j`KNNwU#yI; z!vqD^_nWQ>(_qR&^{>tESpWk9^4Xqe{c{t$<1ptvmFw-Zs7OZ6lyCH~$oxiYj7e%O zS6Ot{S{4PE8vJ{Qg2<~098w;&g(9d*l2)wErG#0O?(J?wxy(dp(+YhjG3yeuT-8%w z!q$+vbDB@P?Oo`tqgQiOZR?%idOE{>iJXOoM6q&5DFXS_Jg4QBcSTAoi*0VmasQ_G zY=%cjP<+52T>GrQf*>E4^HJrjSL~*l`h~mTlp$=jyDHEnHv?KH&UbkepJ|xO&`~Ps zut`@B2dxJPf3{oxW5U(KPJXs4KV2P z#183j59AGiv^JoQ2)(9PNzR+$dgaIOi^uu&8eh~mimg`86*!Kc363;Obr$L)L%z~V zXaxeANZbGP6DJxG5QM?;D?R<4Tr)VnpfJ!-^?R?w@xNuB^&oX03z&_^g`hyJCwGb| zs=qyX!Q))aHbp2=E2$au2tQbtGtskS-;1j^U-|xmVbWka6Ev@+7}2rjs=^KI`_tK_ z<(%d`7@*>?e70Jlyppc#8Pask+f!GZaYIUk$W0G%!sEdo&~Q5gm}+B>%k}MI_B}N6 z_KI~oH^MBjcRQ8HWl>LS^vz*hB$^r?2Qu~pWR_l@$e4`JRXDjfEOrCC>}fVLGbCtM zF7xyoi<29ZS3MIH{g)^jbs_ERdhZDyBV_*KA4k2}DW0gv`!k-D`pzy;%wC$W zVKYFiUMg-RIZ4RyXD$C^(tg;e`gh|<5sQ#4&vy{v=qDSC|1v^rui!koZrM-j5R2K2 zoBQ!L-YcgBh133r#zR!2BQWA)7xv*sz~SFJL0V^8aGLg&uXzKr6g#Qb`U_gx!dXx4 z36@%a4APpIKbiJ@M((-w(MvoAH++c=x|hi$)^f5r6;{(33R6t6!khaigj08u5R>;i zvkaSVLql6LsR4hAPPFs`hR01jYH-Wgf1728>Y;OX6@r{KPwcb^Cd&#=oFo7ZqIrl_&&=0cNquXOJaY6e{`C1t7)xoHqryx~lDD2*%F`P)I&ZHLxIk$K|QO{dQ_4L~XA&EW~Ne?~d1AKM~ z$DZwq*i>PURk|=r)FPQ{3 z#J@UvTxOhw|I>tsnsEAGCQd*DA2rsux#iGq!+OoeQ>0PYA>0u z?mv7x+$BXKotUvC3AyB#=l*2O$n_vVwM8;%b5jqZRppoEq@Z|uFJ=@4@7=hSZv){Rm^A~5ztmE_gFHo$zdi!|bYg<7M6(E`S< z=e&biwT^()G@~uv5mAffjE(kecYTi&4G8(pjJPEiNJ%RfT^`hbbt6*IlE(TzizzQS z(8zyxsI6krI%$%aCUSj12-riv{q(z3+Rbn9qGvuvOlz$tO5ER2h|M)fDLWhTvrpmG zncP(%Z%eo={0Cfm&qw~0(6(9hx=dWKE5f6J1!WPofqdFF%{G)b%Z)J)zihBl-@&}H z;A%}3iA4@k*W#L}{;ahx{6bPYg>rbbd}e1!-)OyMbk#XGZX;0;iOmxfaGUW;>u{vq zAs~xCs*lmZli<(R3{*e&ti%Tk6kH3R?O%Gz|Gl%4bza}&mwh{huPdM1IiFsoCMj0A zX4Jw=InN;EzZP8{fDet^Mp0Q*2W2K|Qb%r85u|XNj-O%}HP9&GzGh6WqGl1>Gl`Uh zD6bxguC^p~MedA5a;nDH%VZ-Sk|oG?BZQCiYB#}|cwU3O%QwJm9rwfV`1(_Srl&8k zibQr>+r=SnD1ElG{^?B17#GbP&^}xFm<_fd(V$Cb*z&O#@klV*mi#g=Ncm#rab-k-3cjojv`}%uY zNLMyO8K@I>3L(&z5V!e+(97#Zacd1}30=f=x;B${StUjk&35}HG&MHG(af&UJyvCPnc`bv=;mqZU_~AGsBFtg;lml(j!rdgq;OUULo>G8m@o4G^`L8Y+ zPWL(?Fawg(6^=eV`#tsd4t5pq(MPQ_GRXoj%$O*A%X1m0@s!oAR?5MIKdPn+uJqk; zGhP50)R@WdSeQ}vL{4V5&U-~2J+Zu-3u$ibB@PHtR-pcgUJ!^qjFV{_vLzk+C#q6PpVLTdp|M*1GYsN+3!MnmP zH9@aa$N7hL5vPMjy#&L*0N4@%3k-Y|WD>zfAS|BT<# zuPi5!gt=*3Dxv|28=B1VBZ;i4yD-}>eTR?;fziBQGp@x16Oc2UDJV=sq*kaj-5oph zFS%??%wq73FaONA6C96`+$Sw)$Z)Hw88e>+V065G`I#dVHYQ@0p?YiQf@V(`G*49l ze(+tc-Y#A*i&x~eQN3$4*$~5?k1BAwN6JQmJ{2FI5p=#9L|zJ%bWWj4{ond!qCwR`)W+H!q8KLNQRT?I1J(f?YQuM<8_vnw^7oY5vW3g{j6a{H1|K4#!{f^A%GJm7R z*Y#tXRmpG1_<(q%5U##qBpOj-Y{V8d0xM&46u*S zVG=U~0MEWCl9-Fxx&LMV$Hh)+IegCR%g%ZK{*nI@Lj^_jj-agf>l>=kf;a2(q<-Y; zC!&L1{aN@f{Dpj#?3W?nZi)L?uST5yNI3vq9rGQ+3=SX0h4U`^%5Xzzrfc#XO=Ve8 z$vBhcigt$v7pbApr3FlRtu&j9J$i&-RacrRB-g*}&e4MHZM*t$Ixe-YR=A}ujQx@t zLG+eKOe>QTLI<@1PNXe^t*OH~(%ixtI1%=-b#(Oos`hlVr6)#I%``sTZ`q}4xU9I0 zVn*hgu(Ng>DAZ%;9FX(@r(Uu<$K8$KzQ^qXCLb@<-2gz0;iV$kopjNmh|Z5e3Q ze~lf8@8c-T+R+dLo z({%QFUzi;Ytf^`19+Rjjtg1xGmy7Y}r!$cb2XZ!5?rNM|Z%NDm$jrbRb8`O{l&9X7 zzHUjTk>KvtHU>R?&2`r?KeuYh6Jvb!mD;9n<{$IDNvw)hN@sT~P!~yRfx4R!7p)xt z?U{6ri;Cc6r!9y;F+AEgNHpT7ij8lo|MI>Sy>RGvzKJ0!foV(V@Pt6?my|f~;EE{s zprc`5diGvX*KO+{v7%?WNKZbC@y_f1%YY~a+@THWxW;= z8=1XtTW0u(jI$}ZFJc?Z8t89X+*LKUO2M+8}oR1 z)q5`670lH@EtZ3u2JBtk5Ac{@&b*8B?3Mxfg=VduvGYdYaw$`AU5a^zWmhez;$F>! zr^5MW0ma_&>lNr~HKM`3$2or3`}EZ+yF@(-pjmbulSH(mGj8yIfmRz9C}>Jh^=ALy zjtJZySi8|i)+_eT4VdGKm-78FWFhA@`3-v_aPvq8p=TacAkyIcdn|T{Cr{!F` zxmxFNr-%|Y{FzXto?~o`#s-tSBUmR0lw9-$b2bRW{!2s8-O#&(R`zAzed+HT+M0bh zIp)ZG-K8}!%Qs@$HmjatCQO<-zYcsLlp$Xwt`;|}uAlq^(WUZmXeI8s35SDxCar;#)|o0&pJW8=x)jE#iY8J0Lg13dz=hb?8X8S%Dizxi07$Cy5zROB0#*; zgM;`U$KU-8`R18c;cJ$yxO_40>YE&tMpK{nG%n~O|asw(34Mq=+0is|>DT8G~0 zKXVg3?Lm63ToLtE2nd$!6VyQwi<%6Zm@(1qF$9eu~hVO~1 zw~3JoGHWenG6=fFE)Km>v;VmRgdutfp*D+8NwaMM8Nox@!Z*=g*U>v7G9=Sr6M)Wx((4Udk+Wz3@WZeNjlKJwKNG8AtS)+Ev7u5QOybE1$HepGxd_U9st z4%NdVl==rLB6w(8f6#>=pCq0<^wA9IvLx*Qt>G1?yx01lGVJJTorp_5eR696sFQQ%$+USEP=f6h?+`f#?t z<|T`Bd+7)GXnGK%!fwQ?^Zbw1KSu{8`N!=AS_`2L$qgkBymd6c3>AVFmM?@m8pd%6 zA!^Lk0}WzZb1iEEh>qT|1yP$xGh=G|9MdR1UAv*71B1Ta6|JIwe=aad7Kt@yl9go& za#s+Bx3)~$TwTD=rp&~fqAYz#&QS{vxb{4&q(yEKbZsC+V-f^|P{d`iM(yr`7evNk z9X=A0?Sy=INS+kPmSe@#m}=Z8nw)a#2ybI%^VDg19nqL+3kyJ)^K>9XKFU=!((G%ccd#}qp|Sw%`nYgZBzm8&9;rmgERjg970o7m&jhQ(xgRP$%5bGirL9EpUSMk*w+6Tpq7KA+Pa48QS-b9 ze^$JZyaw^-R}e-S_W;6HEKTZWo6$;b+q%(&y%cTi?oi>F#kDy0gweNaskatpq78+; z@9~MXqpKdgk5U&xv>F-S?0AJf-@K}=Q+y!C8){b=RVb|X7zbjLkpG{d+~udxXSsSl z5`XWEMA7^?RQ&hw-!;79M=v4;I9vvbp$s{YUzvQvBi!@AYGzn_|LU%h)wibqG^So7 z&aUF5FK)h09lssEdXq=_?*SQaT%`R!eEv7jbh28kaW=!8YRiFVi*9Zv$<4?Av~ge| zx3t%6r2MKnw4Ymyy3+8!Pi@_NF&tKGhIhQx-N@cksG@$FP}}q~ls3kuH%F<>awdo5 zQ4Ojc2EQ9N;Z}*2enJKLmpQxiS78VI=LPL@g=5B*>lh5UR;pCrtQ;XPBGZ)}4%LrD zADREXBe5=!I3Pao->59usRk zm0Oh`!&RsL6c7SOo;*agv}5sJf6ltm1#`k$gW=N}QkCFRW^x6v|DN-72Z^g-mAN&2 zi$WLs5~&mZ2P7GQ;?`wPIY9}6H&+=u-oil|mE%&cE(asU;|F7dOqJo%g8szYeg(x- zFcWdDBDrLOa`+(8F`O+Zw#yAh@?xF*A4O*!*JS(lVG9vdP*VC)K%}LcJ%WI=bZmgq z>FAChDj+QkHff2Gjv5<1q&6Bx!{{2_UC+Ds|9$R#7WZ}C*Lj}b<4FB#SKye=K~(Y3 z97^!b7O;`;XAT6gH(z76s)DqTtuSt17gaVv2_ehxM&iD6`iosVC1$vUX9l&PO6A^l zFtfA;$=4BI53Wk&%OH~HI|o{4nzy!%j|9^MtqN&Pux<}{xa95#p_dziDgL{&V_Vgc z@1|1k);FDyFufE5faZ!;Q#aAJK)F?L*Hpq|sgCq5k=WE#}07Jcvbv?@%ZI+UQ zLm5Kk$m(_0i)DKn>$;M!MRBBtvu6ACcC*E|g8!ZIjXN%qwDTA=MoP{U-Fu$a%xV)Q&=Vqi~xCOxa!hSdm@8f>9;}PW0{6AKRJ=M0kbZA z*40){m45RGZtMEF+Dp*;iFh#~li!XrA)1DzA#zLy2`fVxEV-T%f?t~RjpIR|GjAlj zH$d_W-PryVn40!eUDnn{%;2tVTkY>>WsM>W29OZQcz{-a3C*g^!S2(3g=!ZzN1uEJ zH4=R-I=uvZF22JPci~RqI3V{4G}N7C`9cUZ%VfkSf68i-L7}IBvOc>oYfULl)=$%> zDcFsbxH%R-PR~}gp^R{#VRd6b$l33h$HMlX=&4FQ7%H{vL@=Y>%-wLF%w^$bGeL`& zG$SRM`Lcel&Y|S-pdgjNiTXqKgJqIFCU07h>$KC&5v8Vg#BrS*!3XXp*WOipo(n-!DLT;$-A)9#-M%n6oCk!|3dg zROJ!3aK)ABa{KGytdt#I=vJs2ls)j~wTw%5EYhQMkVH=oXKJK6o{1-_iem6-e0`CB zdPi#iDfuwG6J5d0C=kQiNT&Cc)*+vAt`{KgIQj~!Zis%B%txS!@NcI;QZbrT(nSneis=02bHp)%eouII*=8hf35qHkgr=K~gMQ$CLJ z(h1~@eqjHhRCKWSLwbg4@K5~>rjbu_i9yT~OF7y-$<>FrZz(c&(+Q~1B7 zG~6YkXulOx$4K`Xn%{JYPA&`{utqNbjjfh(WaIUU?nk5HW-j7I3CyVhz{DD+B{ox>*K=1+Y%bA9_ZPc1O< zu|ZO3kHnFoF;t}FRMFnhK&9K!P5d;omy5&9T4vf9nIjyiti1@JHtKbk5dOXbH9iN` zB^$tn*@07RY>4UkLp$EOW5+72S{BG~|Bt53)yQ29_v3{Eiq2*tA~N zno{lCvbZ{xL}b6Mic9?>D3V|iKxqVAtTo^bD~~mfP5RS92n+e)29|Q+4{eIOj6P{K zGMMkorj8-K=eFc$@WV6U{$#GnzX4#_OW^D-VY&BA!u5o1Vw0zbQ(Y4Yts4OW1@Hq5 zJ7MHzEZh>+B$~+NBE0rj2>LO-!ZU*NQ(W8Or4KJorvwKRnhTq1aUxZ{6(dM`qYa^r zeKAk<^%g`kGTZpDt#|Ne*zpnYPq$*`)1GOEhY*~2KO2BgUJ(oGxB0OmxU&X_aa7zd z8acwXC)7)LEP-fR1u08`Hj1kH-7X)drO&IBeLDK;>f;PHHH$6kR)t3M7tR5L zqvL@5T;~vH*+JOKVBLiKCy=vDumXj%0%R+`k^O5&UTU^$wdb@%li;6yEtHMZ*uiPs z*5k1#hEouuu2PW2J3Ei1*EuTKMmcSse-k5!iGm4$CFyhSMNRp`gx#J)K6y`DkSW&f zyje%O2M*3J;}Q&m#G{*IdlqhgR4FZ%zqm{z@p}DjXK44MifVss5A5=mNvCX#vDu}L z1OXP(NlJypDEWQeyfIMkbV}Z^3%lAAlpx!IF**Q?RFg67?lD0-?Gy48sP#q&2R2c{ z+kHs9R4;ZmAUi>z^vzLG!c=F^Jkm14Ii-2wgcADP%t=U1t{BKGffm8Dw6DCrwAXL9 zU=1u_6`b>D5C6RHZJV*2dI=M_eGDw|T#WAtF zxqNw%B>+OgQt3D?rf8{s!N$`T!t>7aXBDQ%(6EUwn24t6^@JzH(X0F7&Q?yhjANd+ zN~jT1_vZ7FruEXLsHbSu&~ybnoJBlboBTez`(3dzBtDmGO8%BM7AP+t_Pz;k0Xsp) zeO65iJkoSFy@I}3-?P5pL-LS&+YHG2J$#rAFFX&9dHY*@B)STHm& zb>#|HkdSVM&WvNl`d`S}SK_)^$v&KhMczk;P(g{hy^a)4G;vB$&Cu zerDdjb1pyI#-JYD*Baq@()`hX<8gMsMs2@D)TJoO&ETQyrX5=T+32%+vHaR`Nv_(u z-a+oOw+OwGP61cjr6y)xNnZx@I7^XQ{m!+bULHoUSDZj~APgLh6QCO?Uc-+%orUs( z$G!v#eIs?oGZwcp`*{ePK~za9{(9|)MtpJ@`ILCjeP-K_Az~-uZ}hIRZSd0(8t;9I z;(Jqy%R3mMiBYdy$73Xf=7_WhY_7El#ap{2Aym2+$e`tKXI1!Etu{A&hEpjV4=5@r zaVgvD?PhJ-;5xuvN(fk2rTFwv>}1C143yTqn&9%BhFj`B&b3CJ${=%^nmmJAhB7@I zhrKK*Smy=jKM_q{EN9#K>pmE1FkP-kPJGLORUdzmw0w_Fr48wM2Axs6_%$ytP25-G zuDWoDu(~9U%yP!|9-DvMBhmi@p?0z zQybyARc0h{v`spBL;|Wh9!u|oou}2*XbG6jL^#@b1_>Il`bnkG>x*7lY@cw4+?G*8 zSJh3H2JCSl0@K-E@r3N0WOrLtV}8l1_N!fZv8QN}FCPkGrWKRq>g1vmck_+>(Fi#0ko)ZvBrc9EQ+ zS75eQ34Aa**T*mOB86{;r{j1TPwdZQMs<`D=MQDm+p!A%;hJgQ_63tilTgS%WzUMc zD@f3uJ}w2>jH6dYREq!I`OQ^(84_VhDKy%2*D{6gAES>S$D_UM;c|0!F||{x#J&0? zisax$X;;*%O~lzNpC6-Lr5}tcSApznRTeW(y(A4)OW*XjkQK{AT2nXrgU&8LLHRLD z9yYPU=cnwMm;1bWLAM15C2>qB)gMDq1gJmEsld8tfiE!O$qbN8Fk?9VR+EnJ*R#=J zZI3I`lm4cw19!fixK(PSZ8EZLzEG*A0{PG+!$)Kqk8Tv%a*$UDNNzJ-{u`8$@LIgw z(m!MD_%p9@KTCJ*-(3Db9=*pr5yUdy@i1f12S{@~>1g<7klJnHwc%ikE7OFy9uIVT z+&tkCdvPDiG|=bU)(4p>l*S%^_=EZ9CCxvaLADxi(A^7vQ{N%GDh#&HDLGT`npauY zbc(YI9$F?z*}K`^gPl4Vj%d<}>*kiG=!-5)B~pRDWt?TQ2&=i&YoM|L)-tb6Cb$WF z?}S%S8m0UVrVCG4Hp1%)zr$6!C+JC@r6#7p?+Oc%Cg9K5M&jQ9<4FUHs1(W9&CIE@ z3bj17HNEdelS12c6QBLeLwm~_0jQ{A26jRc?8nPB_lpBY;!l0BhHi~t6E9?f@4(n zl@7XHaQ)~nUVO%XoDs)@Z)xut@EbPHL~FN{{IOQ!0>xdC(@)BZpEJI}A6-+f^Y!2( z!>svg#3sig=;$^c|2$tOT12H$qMY_+KMe7M49>1n`&g+_!M4YnsX%s7MXk$k=J84$ z&S{3F(Dkl6_jD$4`byjQuamO#AQB_bb0%D{6 zHN)ae_N`jV006=ly=q0vLxJk#=~&xC7(xS1*80PKoKTRiGY?*YzgrbdEbqv)PCKjz zjO72Sh?w|XWFU96b%F|A>`e!c|<-o&nAInW(ChW$%?Gbo$OiMO3GK`A5;G{Jl z0$D}7ug6zJmMK7hpIyC8WLX^s$+@ccy?t(6xZK#mP8Nj;6>5$CAPhHIEg2HnI=s~^j5j7pjcOJNAON)sQ>aqtdfn zt+%sXUvQMI&{W4p4^3X(w2@D7dL)H6?%C;7+@d~z^&JQk^ZM@&HXY~AojV{bk8EYV zUS8N-e07`VAaIediV&|XnA0>5BFM|_U%jw;HY?qPIKq|mtq}4{Q;ae`YL>U5qvS7H z$Fk01go7o^#@rl?C5_Sd9+W8A}tsVx_*nACTgVf=z@4tA7rRnm~Y9 z61D3TA8Hp!DLm>_H8x$GAa#16-G&nr2qq@6UZM`Z z&d?h*02ynftRNfbD@W6-;V;f*AG67Dv-++*EX$$MTaL2{4;*%RVL>S+Jz#u(HnHu6 zXwDMk;~WvsBBbk*UhMK>AIp&xR}n0Sd5@l_u_KO5(?RT1I(@ypz0t^#lGxHp z2WQz33d}GVUb@Xl-=dtf;^CkilRWjeec@MU81a15X4C568gPszHS9Scle5|SR76$N zmB%aHu2Ba|$H9#U43LG;sc==OXprA^;yx?90n2bhEzmMFV?Yq}R`|m-x;(YK_IFt* z!PC#b`zZ+XX?D?PaO8nU{h`8({22XN=F9ifT5=(p2{*oc1l?%!CV9oeV59BEhlhn$ z?v7UCM!QEROKLKtwc_=OMl{lSDJoGoYWI_I;Ye$~U8&8e%k(A?!lLv2W-<`y;?S`z zuxT&MCe7+rBXx9{MV89LDwZRzy_lyN1U_{;vWMR^dMs4Nks^#4F!r@nUORs)3{E}` z|Dse{!AfUo;y%;xCEN9%bkD*0@F)}I1F6~;Legxa5ZawI?$gH~qT_iF&O zd7AozbBBb_7e~#U)LKNXx?_S!K(0g?LC!9xrW0}3UH6M)1+WeY3cZvJvYQB|X*my6 z6lf|B_H6OuITBHj1ch>uK<>#mWU`rZ?6n5B9DP<1%ZZpG>OZ7=@8oyE@Bvravx^sN z8zO^GmRdGFjY@g`CK3;M5CsGB%^G8lfG|8kqh@%PaE_1@qs^&V|YyHd_qFcZA<2t>rx2NO&w(w!@uX~rG@Me-hJSs_Z7)f%|8zFN`2U0QF3xhS*OG4C zu&qsAV)(bx8Gq<=`oO`^twTGVD<%tV8%h}>C;L)V#9l|O>K050?ot{s@2R1#Dwmbw ziJ)xGVT65+J8r(XNar^&U0&1XKl$GZUWfOzr-k~645n+~jA{!tUHUwMh&z*r zJhhQkY4C}5nF%XN($wFoeYI*%VhLQ)arYJ)8(s6i&Q2>PhWU$p*>vk2EddvQzVuuG zCUU7w=AHUV>?P0iL($WY1M!#1V#_8>-OoF>OKgvwEG+zHNQw+|5wK5f&J5iAq_(A{ ze(Z_m0O?zY|B7yIbi}0FnIHQJa`Ac@Wp=AkteuA8gqZmpnsKMza$dccxbb%+J8RLR z+4K8+c)=9JCnO8A!f*sWc0^kH$xqYvrKY?_`8Z;jBDixYg>zO2x z_sQn77$yX0%yo~PP(kLy+4$0S)x`yW)13(c&B_H3sxO4F{-O{*@hd@{Ss`w9X{Ltfh0?Wkd4OKmk-~FQloW6?`1=hWggFUI0uvQtPHL9mdARpJy|n} zumphr$)I{&mMi)0Blj}{wt)AN)6-ZaG}$00uKQ`wgbs z@BH>V%ok6kVFO6yEe8a*n!D=0*ioVZ-C>7&?4L@m4wCT!C{#50*PZ^m^HQ<(+P6WP zIo&4c6GDN$xiYZjh8a{BQ_&tRu#j`{0kDk3f@>r=6d&4dKi@KMX=`+dUGwy_dg=+% znRL)s6R&Jhz6FyF0c`yT%L7EX!ME6#dobxQpq;j6o<7)CFpcGBNVG=>ywU$XlD@a6aB ztVc6&wQ&2K$$ZU^+GIKx4pqHcV|qdNN0LFInGfUebSoOijvL!m8boXL&A8uNL1)+W zkU_1~R zR#ZC3RP!YT_bgU|4dY!b{v{bPcD;nk8P^T6s}}8Uj2(Is#^#%O$8geRDZqK&QMn zoQm6`denI(VTsJn>dq(zG&%{mXl@abp#nG|q*?`0 zhZT20^*)ciM3~BmrgmDtNF(bsP^{^88GSQes?WuYA^|d`na3#SM2F&Io!26o20|lj zz`7-vWv0-#Z%;@*dU`E~NTQk3gI+&62686p#`iju8fgnH&q-UC9xMiAzL>3$D1t#< zc|Bj5s^iis=^6E?lBEz7`PGi|63CCWHhqG5H8;2?PEI&$JFP+p5P zn&w%Y-@6&iXoD)N(nB`oI=}TG>7e{Av?T@tUCYJkwE;xQW!_fXF*CekUy#piN=QRJ8o5@ZAf4@*+ zH9}erDRSF2a6Z@#>piCg>~dVQ^A`d;#)^!D{BNH07{0x9+qj`No?Q~lJE+Cjk5Q(+2%p zs3x)k)X#G<=@l9;mz_)Jio4?>xKlk?ZF!%T7ZIv=YeN%2JX`wgjedu>n4&W@D&C4+A zfSZ4d-EUE1$*=ylL*>1WqEcg9ZcLJ*Lij@-_(?lM%s{*$AMa?w1myRw|%xBp(|=Jqi|ZC ztslV?StphuFk_ie*C~PQcEvEp*`N9A`drbUxIM3Tt^ZXMS?S+A^556rRKK*z=W;Am^h*utwy5jIdFc5IcR<^R}qlB zV&E>pf|%BNE)d8QD%o;rF{y(+U(z!D>e$y$HWtjGWaLT`M`rr4fkn;aD49#4j1S)b zUM`i08;!dhU4rO3OfX5$TJ99#q~#DM+5-nOi#|86HABlbMqbxBVFp~g;kh5jxD$gZ z5GNDF*h{Z9hP_bohrzDL?klH$CV_d;5`yNUW^3;0`*>mvpmS+_-0h(2CHgmD{OXgN-PTtYPtXr|77o+KVZ37;AXEH^=u$dFU zg_V*e&l_(uhZMu2iRcpmjmAf{0SEoK==DvObc3?goT`YYP5sOgP`O)eB3+ze#y{na zE8&ClzkN1;zJVq^z2u5roX_z{)&01DXLY)H*!(p$-4}J<`;#LzKZERz=$X zcL!2Guu#xHnyE9PGkSzevu{@QVi5B;J(T5VQh*#wdn(5wOPPdK2#;0^`U9LOnrmO? z7A>x3NQ#R@=3s89cy$_N7n<{bpZjShFa#%HmUYwxp^8N>3+BYn7JNp*uqaktDKXzb z>CGRX)z!jTEQJ{jgbhY>gkNq;=+m*8xshUGF5Rzo0`-z-pXrs|*4t5RGWB$4Xjfl< zWR7YxEagdQRLrUG-E28PNT#>xLgXIHCHygQ@$z5E473-;G5IEDC<>lCtg8QUs2@}< z63#tgsM^1gjb7zJUw!fCzyh8GHze;@oDATa4=@_tw{}xY*NJfpJFH7~Kh*v)oQDjq zYrAovsWuc?ox2?;NyZ>sO6dAfehB(eu5PWTjNDYPftZWl-_aeQxD%nIH^-~)+)w4J znafRFQL3_CJ1wNZL)`ne?BG<^js2_X@#Mtlwy{5#%3+hVj9pqc3yy>2EU0Vi**L#^ zQ7bi|%j7Q+)~yZx+dC>`jh6ZCt^VJg^`Q|TYRct?bvN@Z-b_jJ`3g^SW=LXKm^D1+ z^s7v!cc8-KZK%^sd5b|*)woyu&Eub>nlP5i%sPZkQ#8LM6SC&jFza2P*Q^~7bT&!! zu`m#hHmzzD>rNH>GqMY@&?8$J0Betn^ws-?Hl#Nh_Fa^}L`9^wrhWm>-H#YNww|Ba zo2aUJ3AU$AWj1J;vU%;(za>`W+v*7pA7gcTaa5zpzgD-r-kT!f7Kcj+ZGHq?SENYl z#^rx8Z?g2XwwwpoQZM8h=XZ&umHjCh z^Qx<*XD6|n@rMpzQuM5AR*z-dO8m{Jt;|!+drlP9V@(;M?@>Ik(?Kc1Pzh6YkUt5W z4@YSwAhg1`gmUe7=b;~#-F702wMm@cq+KP)uxMPG{vSOL#{u6)Cc{?IKPEOB|1R4R z1Z0;-%Z*c-{|yTZ4u205e#DnVrsemt(qutB;9V=R<=t|!1VHHnuFvSVvQ6ms*b}<< z^OpDAjwRRapQuz$QE^M<>3o)XzT&8rUJ1F|^7a})Sn`>t43>sB!7fv7d57t2ujOT! zLaM+q;{HTjQuYS?UJZW)_idBAQOt@W*LwZ!Ykxb40Sfog6`s&KT5sFPuL3uEAF)ZA zl82~Z$BvlQgTp<%R^M~4C<^Opd(~bPLbG>fxrKxfw>5%oYmVXyLyGb8+eW|$Y z@e^;pnPHrlxj4b%(QFByE?S%E6g?$8G-5pG@GftoY2Q(Qh!7-j)|yg{NLOQV$=ho? z4VFDT+~_+VjPy=L8*>=kUktg=YD_HLU|<2`y7m=~!H-TciW0#0^ffziSFwz}`@Vpe z^n3))a9ahQgaKJ|YN-yNTn{2)e%BkuLRV@eVc_;cx}Z;Z-vhUG-qQKqg>(K+XvN)&O0+9Ovr8oM0B~q1Lqd=L0w+)RI#uyvh?A#N0VKbWYi*0 z)bHctO#d+T%JW0tBdshh>l-LqnT|}|M^J-1DCR!UI*`veQTa6{oiJa5)Z;R&0F4Wl z$F0(@8koM_wDI9a(|i673Sdvucp_%a%7VGXaZ+ECi?(D<|DySC z)+ap{Z%5ezqeX}$y!ZBdg`}WLo&mk-CmH7#D$EUuqk~P2XAmP6?uPZz$g=uCXrTwk zsXqx0HU=|Lw?^h{%WgQIi|wM~uKaHYd`bVu9Gz9NGJ4iGMUd7@==yVwQGDtv0L8uo zfvFnJd0`)~VdtE5Cr@1SaU4DI6=iBihvR6Y=+o`VVX?(|9$sA^%E0__=h4Ddj{^K}R%pfUELgZ`yuI`cP)EeCTC-Lcq}{ckZYQAr$>zF59SCh#YuFT%*7H-{!mHrc<&c77QncfpX4dtYz4K33J@@81r( zj=^Ud-vo$0*7y~Y!~Wo`axwl!RN1h1&kh({IGHQVIMnRsbX%W{Yw_nnRcqcr{dRuV z{Pb{}a#5`zgbb7p#HdR|e4|Xw2j-C>_%F5Cd9}l=K;?P`rdZid; z;^LoaaH9+0;hCC^SNmOJimD&kFrgaMfeLsYD`3vv#j!p^sHEEUxh01-dh%y+-7L%A({gR-J2m*!S4|IqGc4L^2j42^J4{Cpp2P5H~^o4Zj+~Dq)ci zSOKq_p-U1R6B=6Y7rUU*c0KWpE+?Cs;8O73iOedX@P*-S)!#$}L$&e1mey!hixA7H zB=I1{CL(jyb148aZpq=9c=5D(0DL}!9{6#czJ46E7gsQ6KswoU3^tr2Ff9Aq9{91! zDpG%bro?=w8rJzC%^_rJUHoLb{6aRKV@VuFmo6Ve>X)%<^K0U=0gkUbP-9^|>r9u$noxo3F!C7ml5yB+J!cY}y69t>GW#ie1eKzL zNzt_GN2gAAP!Rj~Rh;YUM@)hwoCKm*GsUoJ$hd@lAeL{N59+*pk_0K>#U*~$uk9gK zK5Is%mnK$X2GE+vYcy_76kKW^;A9BWtz{yqLVUmbxG{`DD>EG^vH0bG!>;B>vLOt3 zVra*uPl~Z@MG#B3ia>c#h%P|LtCn41J6TeOxNccnrDqv>-8WIZD3tk>Uk*`ykU)Hp z;Sg9J?PSz$F;J!jPr(o+%;%+pztx>*zikVWcdH-dnpZ7>Bh?HuJtvgeECdi2nQJ8n z8sL~U>xuJ>kQRere>>pYoURxJMP}?|cbbRy$l%J7YbcKkpor#LLYG6{+j3G~ka|zw zT(&X91z%B8i82~Jl0Dl|Z>pF-7Xoh0TIB+a;vFn?Rn1GWIGI0A&R{`-Fja4fi0+*P z?)2$AEa4o3_4pQB158iqxK`U=q3ttfbW7^CY^P?eXG0_P_6FzEx=aen(b)D?wzut7~TS zncdaXK)9C6h|&A?Wz!DkCQ?dx>f0VW;KBeQaoVbwD}em*>-sTynGb)P>fN$l9PCad zs4YxHmb8a^O8r<__W0YZu_t}g507Rl^kz7l47opwlIo3`A^!iJJ46rmno7H|)QCG# zSuMse9j_TSH_qp>N!=t_sM)B{X|HF`oljOacK7Q64{&KkU{lW@sz50=QqWq`^{bKJ zKaF(@b>K`=W6Oq&R$XqJQT%?R8TNepoqpl`#7faotyGyPQg8NE^Xl-+;!{2y%c~d> zouBhDe9(?Z#_rF*x`quaGPq?=Wg!?4&})Q%$+d^iv}-oX_Qp(~^J;U#Lq=k4P4m&pSty zmrnxQlON91GP_O|@|=Kc%iy@BRM(Q%+5gKAJc9lG2Eqi{Y?GvjxLl=jmX)Snb&Dh( z$9#=5)_;v??aE?}+KD<|2{diMqsO(_R18lm!)x={wuW5x_J;J7 zsrshe{K=VwoW=u^`9J<95Q6s=3>N0Yy*Oy6FqWrp2eWK|k~nf1eTiEpRKHZUy6=)= z$!~0d@6=)U)U;#Ivo_pOS8G$G!qrmnWv^$=px;(8TrfKVTyJ zxDJxilzpM=Q#I8t)QL9Ans~qzI{SHT(4%up21yxss&~Q<8?e3HNZw6Y^5aIydT${; zM7yFYLe59%HE5b3sfWx$A;E*W2ymgDw>=yUg=l*&UMIIjUHOK+ldm08W3fm(7!jHE z79_>BGe)=3DqH+?2~KOWm>S8XTrTf%W@po>=NsdF1QdL4Ub(L~i6C3m)-U+F{e^?( zJ6G$Cu4%ZiL4VgxcD97dk3UH>f2F%3%S3^a1YW~J2g`i@0G0>#JUR+q0v>XRT>|A*?q1Lo33eT)?KxllczqTR*wngS1{i zlGe^H7{J7=jy-W1-fWtqQyQ4gfyv?Qt4yD>UU${?(wjJ;9Y^|AziD$^$%9n+d_dfWr_9aWiftoji&DHkd ztxq+xSWv%1Vc9FGS3k*k@030t$ff=Hw^^sY#Ll!WLg@tcIecp4P--Xl?}~SC zd+Eq-5*ep?(&uNx>94z@S(P7iRzE@Y<21nM>R)sb~~8wqkQT$JE}r7#eIsj&x!t zFZu6IN3D0N`e%Qq+9lBNDp%!URh;~hV$^c*m^55Un?o*Vu}%uX9$X{ucS2zAFcV4*vhhr9 z>Ya;jtw$T*3^a!h(2_R;8g0v>gE|g*W;0SR%IgNGw9sh3&(5mY*!}M;+VVb@wqdlv zEDIgUo@yTZ9PJk!wV^_ms<-t8izKQB1e+ZyRxoI?4idP)?pf^DGGxozoQVl2z9e}Y z6PiSQ0EbU&d2O zeVDPsoRtPOZ}Zj9TTzHCV~^t*f_gcJ4r=#3M{LwPUiaIeP$LK1)V1j_aXm2V@BE<{ zmSx@jIwwL?E%yLp{v~biGJe4Q@`ZWC;oIZ+gpor~e@94OJ*k zKhrL5C&wUII)|2b1(blZ#CT4Dzd@QR1A3QyCG~Ve!O@7PFEtb7d-NBya=Q)(FUQZD zK400DKHeERW*Lm&jA|E7Yqc~#XPGDIx~jX1@NrdjFhLji1nKZHmWOZCIbdcmZvaWV zcflF(9_i)|E?ydntQZiLMpORMiki;>gUvT6;*XSoz)eIVZDqGK9_3fC>f6g_Fmlk& zBnX_-e1xw_w=W%@qToT&>|zBaxJe~;$@)B1!MfF;-9U@olsR|IHc>*50W@xf=M45L znu-MZEgWjDV?z8>G8Cvrz7nZkD3!lvT2nq&P4lU?Yb_y7sHm9d3F~b0@UE4S0Vou! z%f5}*!N)S*4(7IeCXHJ4lVxEFBVk4ko<#if5y^EOv&{xEDbwHcg3Wm+SVDPl%uaKH z|Hf2fQK1)%8JMs*JxEPr$wPHQedGsfW|I2G#H?;xg;t;q={+l6KIN7(ktVF#d@J_OxzsQFupajrOPqU>d$7!8=Gim-+j&#)=RW0bp=PuS*SqUBwdpdt%nL%b=B=WZyW{q!+`H9`0?KT1bz~F0s5Xz} zp6Sx-{LE`o=>mR& zy9LI+9c@I^Kgb(e8iP8xp*5<}PBb{ZZf8G{{R1)Mn1TVFH47{2>f|GFi4Q`+RZ{0r zYdc=tOR;Cd@hg4EV+n_@@_gdu?w5v_G4%7J8N^SB!_&dZ^z^Hn@Vw-FCAh0lgc?xs zYi4y-|Fh1M)uCSnPhq(UwpJ^9RpbMMRPQ~YR_}%~j-8cO(rZS{jvf0`8K9op8Wo(9 zuyFKK^`_Gmy^pU2~q?vh7xr*F!^ew+4!pYm`H|cGo z%Xcb`wK9%Pj(uMB%cKM|UGA*V)5&n1KF?zQnf@3XHw6j|QLaVIB*>DO$~&h5$_Ci3 z9uxnECb4NzEZcZZb5eG?H>K`2ud%*E%Q-zNm~PjH?sQu}^4Ot|KOUHTxTO5_!(i(B z2c>L$%=tX28Kf!d6TVps?Q~`h(1bA=eNZ zf<+RReM-P;G+uG>1K1YfZDueT z0QIA)Pj3Ll_+mNhvTfA1#ycYZf6F=VR-eJ~l5`sh4fp1h;(pktuCiIR$;x_a*91wz z+!ob^San&>%`|y&rOSP%JM7tbZ~}GC1!-szc|h~K{DgzDcL7r`-EXr*eMV93rDT0zPcEbkH|yjF8`puhjaT zFGwZo#+r4KS$ps$21=JaSmSd!Gi*T;DfDbqvfiizgq3kSj1RQMZRK$nkvumJ#!E!;sludWGosWyKC0x@*sKutTp(yM0HYrOrd{d1x ze;~!zFuza%E$Mv&c;F7DNZl*4;g!4;W;UjGEcl3lX*WRad~Ti_&EDhvg#>x^F^kO4 z*H~udxU)S8>PGzf*^r&A(ftY4R=L~o9D>(@6= z)iB}`47aT5veq;$U3tq@+H>^lD2+`0-d*^|_|bd%cdP$qOK6Ux;u4vhrz`J{Z=Bp* zwKh6=dW;+OrPFh+` zj)d=f8w5!Zq`RQb(5~)-J;#H16BYM%5Qnm4K{K`Ggg4pY+@#B?eD;xRVVy$Lpc<8> zLk^6Q{9pY40pdU%zpsX~!{6lZ{u0#I+5XQo@$fWs%|*ecx|-yNAjLim`itSei##eg z-;>*s(_QLntX>`;L+4+Iel2_tN%()_HMd@e&FA>MU7tmVKQ%^Y#XkkSxjpfn&llCH zN;7!fZ-jk8b$uDEJZ`u0Jd@);f>c|Scs>UI02enIRE{5&hs;hFPok#3*>E7%bobDj z+LMlQ_C>0d{8-;7t$Weta7|M=O7^ZiO}~e)Tv}YRE8AAzT=gjb0FN#a&P`iVR77qr z(djnY--m3X)6H~-3}RHSEIdWX>9ZxOSYC*LZ+>)ykN$s ztuI!+_A)hVx8mowb04dFH2YT)dex^Vvu|T#OwwvRZQm>}hw7{+g`Nj-rb;a?p_Y6V z7mgRZGc2Pu{&v& z@$))uPvUIlJ+g6H)#}4vf57m?*`8@C(aY*Oj9}kBW^A?8L#gTNP4D#BuWRt@c){($ zG2^y)?s86%Vx{wIj~<*+XGpcn>Sa{7@YKglZPeSexM}dh_rP?bv`l~Ob$Wf5iy7d) zzJ{c2{67Bxg96)B#y@jk{MkMj#~dDf)UQXS8aA!zpkra2Q~wKvA1V6jPIYZ zgVxRT!~X!KN;{>WBaYSdnD4e;`yFlGrvCta{{RLR{*|2;{ZxF*Mp(^5%9Lt)#J& znYzDt!%XkPDRz%PEqO;C=wpi9y=_0h*LIDz_roB9Yw~)R$s7wz-l2cCH}f$yD0yy4 zZ**O%sJe9`M^6;YZnSEv#hxUTrI~HRD}j?77i_L@ zX5y&QGTzSSG@O$1TVw8J-B!Ue)^QItzip0w=Ic4-&&=VB8+$rr=$_*9CUt77XYCQy ztL+g&hI*Zmg(I3xk=<{(6lN7j~~O7sn3sZ*bfL% zhC;lRe!+~q_>Vf@Y!W+vuP_~}Uy*5*R5N)aenhdQ$gwL8xr5f(JXuEqr&*K6(e6c3 zINW@Ys=FRlvyW_zPxcqTPsl1DknFWixq617&w-GjI+yZi#W5)}Zms5c4QE;Ba~;(! ztCz#>OrWUQNZPA)lDe(bpssx1?nm?f=6oxnxmkW!zR@4~#R*j5)PI@Ten-Ss(W0-h zz?1&LzGWI;P`o`?>H7MBF*hBEEyG|>M5FsWK)3A!%R zYXmZ^xo%W-$i8pBzjD+(iOP$Lw&TiPo@Gi?4dvs59c)+Zk<{EeGTjh(yx)BP07QyQ zxh+uQ%kEol$Dr}R&vzcsp;OGBBfBk5pz)GrMOb)M;lXv~`;Esz(kEX1`}8K8+o=q_ z;Cj=3hI9#!IH%|ID^5Q0!*8hA_w*@#i$p2$0WxVWmUc94>~Oc@ zOqwn>Qu$v*S+KTA@57hao^)}MDnvP7MOMn%rDc|5N=5s+SzDCnBUq(Gw!v5=r7SeR zqB#r2_bRN-u$xDH-ouS0B+T(U^lDtK3133`rELsFzoQl?7MM4lofy)fUX zM25VH$LtS>$T-FB_V?r#7pYfF8vFyr%dgO8#o*=rjH!)#7bdncr=G&VydObi0`@bI z*c6|>p!kBF4eznkLhZMp-(Se-i@=g((pwc%`jULcL-!m}T@xv6GmlII8G3vB2N6|l ztGI*`0x`3b~4Y^tI*4m)f~DE)!ox1h<}Ba}$hJaa{fGC3ey zbWLb)lnM?cZ34TdHrQ?O_wQh&fWaakM6C@af@>-*waYDR0*Q9PVW*ALMpKn4E-PoL z{28Mrxo!154uz%{WO={By26WF<2+8<_gTzVA0CIy{DmhY=zcp&Ix;o;C}%wP=dsOG zT(>CvPpm%^-=;C2+|~Gbc(C6dUvs02ge{m#N>ij?v9rQ&)H=GBcHp@kUA~&UIDI=D zrmtm(whUnZrhe8VpxSlDcmy) zX1r2GZ}G|LQo&i=b|r;z*eWc)OcO?%D?)$CW)<*F8Y(BGgTWmw%Tl0b(N-NP2JUc2 zDBz)Yb%Ii{lHT5}&Ti`Oj{E(IZ*Mc$$gf=JjD48Seg}1%Dc(zFOzUHbn|pJ|vC^$G zo<>Y);#`PwvdJpQ&aGR1AFtT-5y>qUs?BgJSCHQ`wg6Vy>))98ki(ArMs+z4Dq;Mv z6HJNBf>QGH!7R$R9g9d$p`~90dB`;B@@sPKDk!H&nq?VVFyTRTE~;czVkIkM=@n&Z z=u1un{y)Gnkee2X?c`bUc^Rn*ReqesMRI13jR7cqJMKYFsV~pvu$lr&beUqE4x|dN z0&kjU*`i3W(ka|BPnuM(Eb|0`)d3gW(4zkUvnG`-%^P)$1gQ|2VqV!!wG{Z4n;@xA z-a{-c?00V5qkCvLWoeZ0sy_3J9!!E03+5jVeTf~!N)^$1w2G5O z3W^hfR;*1q#HC&WB+jaeb&-0k3bkNR6+A>!WYVS&fCPvzLr}EQUbF{eE6QY)YBDLH zVo;5nTa!ke;m=a4T(_u5&W_UL>9pIUjkKB3Sk%}}4$iLD$g7&W#cOxbBC4o zDQX`FwT}YvC&z+kryTm(!u~nJX_hJ&*oWd<9O3pl`_F`GZMOX~R;i%N4K()Sw*tP@ z-PsQtsyyPZk^Zm6!y6tW@sGppPFUdH-&5;JXlo5)mlMwZD)=Hfa$Bv=?~n3$%FjRL zd}h)+0I$iljE{{*?yf=rG{zN(i>YY9FBIa38wbp`WvRC z+0N5lhpWvz;KQ#|3)Ik+vwQID$`>@cn?Wv&w!((C);A>o0K4%&$%$T`1GJA0%-`yb zS&uh#b=rQdB+^@9^F1Ho-A(MC@44o-O(eKy2-~y2Ykd6~=4QsEk4FsJJw~Rs3cSx# zozbq7$BCP=uLZ%cBg~ur+t?bFV3@48!W&&!)MJ}lQ38c&Kit9zFC z{Z6b9wPzM-rKD1ZqVn^cuYP}1S47rYeYkMjCJjq(_v4kHw=+{@1 zK6@_fClxAtOK~?X)fmFm&AMV z=lDFVnBRldhDAqrS4@jrzW)H(%>0z3{P>Yn_?JBR7}vVy9(tVcgsitHv!%Vg#Btd3 z?P}wiVj#slk_ub?#U-}jkD-_ zD;;!s)1CHaUsb}#Bc7dOXHu!1{{XWD^l_~mc^j~()ZDxC^Pc8xmKyzuBUib*SF<$Q zU{$Ar40)C+PnQxI6ikrG_WS;%l-?eA5hN_J)stnpUOumK2VXw+1wFC#$kBl*BJ8&- zmD3(iBB%E&W53uaHou-e`wF}v*j3`?zfOg0d6QFq@Jk9^%39GcVcOcK*X{k5uC&FC zw)6s(GQ+5`gkdJ`%@ltOUzgwLXjxgVdwG>z7rCiwwh3tjQu17enH`@#p81j7Y3n)L z$hAu6AA%dJyzt<2ChXr$6>4dVihh5<)Vg*KHw8&k$*Q;6I`AFMu0M~-k#Psy)`DvZ z=lALKD{26eI+N zkHeU_b{hs>!uoPDy@ z#rAxPj-qA?&0DUDaEALlZSTmt;@gieGL6Ld{fidWRt=q*M2dAjz5f7m8kI6xxR;St zzT=HBc10#PgCqQ*)h1b*VuB7eFhstBoV$|N#}ai=cP?EBqWry3^qLAtvR@tqw)F=d z@C%pI>?#x5Ch-LVIKVc=_>0_jPvY_dRXGvUhoH__CaHz$&~lLRU>Y{iM7?%3vcJpE zpfQh^To!L(jhDYN3KagtuVsD`+f&aXX$v{kvR`=a`m##P;H zYtu|!OaaAitQDo45!g~ibyHM zbuTQ7RcISs5ylt_Q{n58-hDcQFkhj5g%|1cABd#3a5B4gw151026isV>~NBR?Bkqc#u zNba+p+9H2MOBnQAI0{KoX;IQ?*9l7*I^X*jvd-8zT{6N=tR~LVo|`3yj$A$Ld6|+s zaLJk1noo*b=RHoF;cYj<%X)T|o#n*4M%xWw;>nid#O;JB+d8XQ=d0V$rFW)9?Sn%} zfuUZ-iDXs#bORfhvQd9JWf+C8JP+Guv?KR>~b%DH>MwIde4VQS)!F zBOcrCYErr2VAjoHuFWfH6S}1}P&ZV_)lgS;Htut`w@)wa)B7*qT^LsvX3$zOoQhX> z7|&A*wxj4;)$Vb>Bb=+KZ`B@zA^34ld`nl_-q`!TW(~%s^T+a4)HMvhi@r`U#717vE8_KK<^{1zc{ANfAMSu?78*qQ_- zWT&%9+$WEF15!(Gf5}!8C52(tGOo#zEKy#6Pq5TxiGRsUKVvueSAnutyQz+3RdpTu z6s~St=dj|1dZNp@T}svI=Osx=lTmE78Y2CXsIGhVB)6nVPTwrQZ)p3L>Ev?cm!qXc6y-w!+a=7NqV8AmWA@{dJ9Jwv_|o7rtY^xz=kM-E z8tw8LkCF4ERyPBv`jnwpxL@QGLbL7lF)5YY!;FuSKCu(zimDGJE?~!YK{BmE-B}fD z3hYHXjHy+YDQyEPJ+U*$lKO7Z0`0YjU9ER(XC4H9EBUA{TtIdD$c5fz;$bwme<6UZ%A=MZI`E9 zj#bJ(`gsyV(~-GF%G+ z9HK_>g6W1KfmZ6uGEFqG`gtBt@g<{-q_LEnOY!?2!6O?~&=lLx2h(EVE;6$mkmjiQ zn_AIRlYSg|FVw@i_+2;6Wv&a}52CeC_}M+UbH16DY;TP{ogz z?D29!)iaKEJ3X_vLzBCv{7;hivFcZp5>0SxV}RDJ6K{T}b~)_f%L?e_?%)l*Kj^gS zW%(8;*=K?1-i{hz!spSWO3zi~%zKSKX}8E2Gba@zMJhWBZGRjeGRydZ^*HNmym@pQ z=Ij3erOT0wu&#~%=SGILtna_r&Yr2$V;a57Y=yiXkof5;IOUUO-7QBCJ7@I$%@}Hl z?DOZt^*HHO-FXtLm$xCy04$!)))%->jyiK?XlmQ=$&9Z50M{ArLhia8>lV%a#YUv= zW8%s;bo%>cWZ>mrFQ1XqX*!M;KHu2#V*damgwtm2RpvnMTzNj_+bu^p%E=b5RW#b; z&1B%c(beL`VW~cE8}NUjUsSzsIOFDFRMfZImlCJ_rXJ?&4(!F{ZBe}&rqWD`?$LEK zC918kK^;TX)9Yk8+9#bb@Z%OhD6(41>!3a_?1m6|jr^0?!-QiR}Z z6E|pRg(c7cm=v7>;p#e&k{D!(2O0GNKq$wDT!R4ogQZix;7R>OP9z1Ep;vN7pA=X5 zf{n`|K-DzKqczPF!h2=#_r##y<(OUUgwaUQqSU;wO-k2uKGvq|j{S~?h0Ir0bIbg_ z8l`=OPf1wrK~qcM_6}N--*=L`y)pj)Z)TElS|*i>=JNjlKxH+<3(SJGPGHryZy4fH zE{vU0ha7S#+i97{Ma0^6t*)=^MgAL0gIk+MO(Q1~rPhMd%E6OX4}9`ESy69%@+e*D zE$`EbaG$X_reiQ-i6s`F40d?_e^ExB4vRAA7a&}O*D>MpU6|B`i*LWxkS~o zRY_S2tOJl95#0X(vDFiHf=DR573DDeK-27-OQ79n)Jc)}M^X2{Oln1hiS6vXTwcXU zNWW=whM6-|T5HJndN*$^ihtnAayvR#adF1|yvjAH@j2s`OH9$@@{U%u1`bWDiYTl8 zBFVIJD%mb2PYT<(qD^y01w55ri6x{<824E#r;eYfop5n%N?sW@eAkdJ13~1I{De>N zVyEw+YTm94_g~-eB|jlJjXy?>%}kNes#2BDy@#hwY`pbW`hRAvI>6PD9EZltIZ_&2jYmxs>d;Fj>Q+OA zbXJrEGW&Gm7lhFg{{RHyB@se7<{e601nZ(W`yrGC7?MeSM)B|!vXv$Sh)M&tFA*;2 zxS?cT@z5;$_kbKx+7`+=%DnQ2fqh2~Qr9jfO(_7^6*9E;ZOLCIR z&gZAv=x~e{+!j4da{Pr28{)&t&Trw17aiq{ZRUDCoeFgl*(Q<0R9J5t`{mTr(WfSw z;CWfDTe>|QFM_H%i`z)$C5^fXS#P)+RleD8b)1sdy1~J1X&qnZ(Wz%VdKRl+d-O4_ z?+Y}{*WSi8^{e;!HH%N89rh=AhmXHtzIIHh&o8FMnl}cH26wYY6)-(TltXVE z@hsKCSTSg#50cjKh`eIaTcYoagZyBgmD}_(D_!tI z(z^~8hsw5fcGr0$t9ifjW@*TXT-3gMOJy>F3Chx>K2y{ z96wB~fl0T(cP2OJFGTd>CtK6FFu3f6QDu2>7xXRo4lm1Y0M2D! zpiW@I-;qi176zy9JPQ$q2V$}rKp8T^Ku(aa*cInTpC5ey@fIr6C{@pp(VVnfrBBFT zDRe|uwL6w6z#3yrCt0Lr-QPAzsm++~MaCDgxp*zYsUxk7E~R==F44CtjN74{hgAZV zxYVro?e`(mD`uXX70lYBCFSSX#kW}Dv2|2@@$Vn9Fp{%G%?Vb(anuZ%RtM(`j*H}K zO1TWAy-J@nily)H3Lb{#?6dP3BoR`LUc5g-Xwj@`TU;1a=3lp}6bD6bbX=!$ zVNtAHrCiO9&B`}~s_@2Di6c;&3pS&2PYD>*s}||pj>63pf>laikP!%0E1w zHAaBev0J05wwB1`H0p;*ZE>*KUMSK`e3c_`OsbQ}R7xbljE5~T1U3f|U<8mqE&p+f~kW2@1%H;wso43m;tG{mC&D&G7f4-3Z>aoUcW z;mMw`t^9FM4Lo4M)oifwjj_3gKanKqn`FYPNb{vz`yqD*zNMvOhh7NHtwH*pnv_}Q zG|?1mdE~K9`y9$mjV5T|gr!E@J|dFaB+~eUDW>?G>IxAu`eJUfwb^<#q~(SEE@Zcw zW8=@vhSo`Q^Dk4apCTpxdP3BhmkVik!O&^Ceq*-p$d=BMdq1ylBVN@g@Z@2aF0^Tm zKE{TZ;f@Y-S+Q$|=;-ftg}}koH4ZrX7-Dj}X7l4(Id>h)wzYTIsZy6dyaKULEADzy z+BlQhw5pUY-X8np)~95i`SV2!+jrCIVHU4u(%GAJO=CPeJYjhm^QK+8v}mg7rgf*I znX(s-xfvnn0orAdMwsj_e{q~ZK*UmGNqPxPKuIYI6Xov&a+bpJd;Q9ylO{pBBru<# zN`1&F$hm7zBBusHLbeSHdc4f!+AT5d*9PAXLET?2MSI9^v-B&a&6(0}*`<$daPz-2 zQlzoJ-?@=D{AXZgboU&hrO^?Wvc#_2Zy$cf{*5iS=*~`^L!r|z(!%CMAP(3 z-{@Gcl0Ti));H|U(<#GBeEI`hNX9ub&@bO%HRLRcR7ELcxg@F)WVLUyY`(A~v<%)B+(49lprzz7NNwBD$T-Cf z5VA;J2P8c>64OMq>6uwbdNw83NiR|1QDe{tTi+u#+ck`6IY@!eA=Rn3NFZy!#b&IyBuAW#>*RaW{1qr@NndUlRELvk+Ej7+16U=*j+Q!GnbKY9+4c4Qiq^VHOzBh6%P3hL zj9V?&qv~6Qt~g+OnJ$0Pmn$~mNX1ZX{)N@2;5TyxoNmacspn#wei0PBfmT0H*vz`B z_!`dPG+^G#53z7p*)yusCGb9DQkDp7)XoaftAk||Cg|uWSK9=4OD0Tj2dR0VE~G4b zDNeOTdh`;1fGSTPx4vQ2k||c1x=*7f(4}#bw5CR?teP}wBUebxCd}&A+2S>&l3!n; zyE-VZULN_9r+pOSM_!)BE#$H2Y1X_^k6wp4R_(X^ea&%2zRa>U;DygE@g!BhwogKi zSHFLtH7&dSO_aPDaq?WFS73^it7G>qM(o9+(+?xUfe+11^CxvN&^XYeezsuxVsRh+!r&fkK8YY;!0(JhRkeNnH&m^=BTB0vy=+yQ7Nh2aE zYXjAvyqL1+vgRc|`xNCa9zb~CEO`>3DM=S+Sg$#vrE7=u_px2p8iNNX;J@Nbu&Ws| zx*mzA|zIQk7b5>^cw88s=|A-7 zTe$?T$)?8NlR?%2)$=LBy;!y8_j{TP% zUR;Xxsw78JsamYJ;x!TSqDb8LCQ~WGKT-Z~kyYqd;QOvs2pq<82PeTFa`&**$N9cQ zJ}@OR8wJrMrbOxyxq&$iXhBM#4nRsoa*$A%ZlM&a11S;CWe94O2qed-MJbE~$B0kJ6z*vdob385n83Z`~R>D3DZUX5byD&{RJg}PM*Ky1RPENU6Fs}>_vmcY8I zlR~+j^9NM`4MlQE3xg&Vixlbw0jnB>(2<=^p=y;t=nXqs*fnTh7!+z3sZ+01*fezbcgpGfY+}-?ZfU%^5j5zyZQ46sF847@+cWh_P33kf*xqhw zR+5s*=Os?hK}(cQHm!#H@a+Af3(W~W-dxSw*sQ>9wOl5@^8JG1r;pQbv3cIZSf!3~9qQWaw4Vqe| zv|8D4-`}B1$}0t`RrbLuw&9Ob6sZMyEL+QA?i=Fr`U6Inv`rd2Zeq=wW;JPbSzXE% z?KjE2c}X>cDXcxth%*^Q8MUpyvIdO{m$^o3?fRVw*^42N%>=~DBe6&6-D|mdDeVZ zG2LuM^Swcn{<1{S>c=Knt5|`{)Ncp?qe7I4AwNRZrF6aEsTp<`i|7pkYr{G#JN5W1 zPWc?nD`2@B&9_Dm8M>OIO0`tMqe~l^@R}`ByWesgYe;?#8npIY`xI-^e`Z;g=S*Yt zDPG*~$A@xas2-70qt4rt+SO~JpH|^-%kDfbQs?zL9CSxLnwM6YS^bK&=%h(P^(H_tCuoXG zAi0Mu`{l%Tw&V6Jb$tM&K6rK+4LK4o{fW!@e$eV*C-2{J{AGxdOc%>xc))MpQBWpH zDGe%%p%;uLKYsW7il}R`)P3{K2qTa1l+GC?rC z``i!EUn~(!nZz7pxf_U1MDYZ}guOy=@<2rl+7CJ!8OV0|0^$spMm-U+z@#j1LwfauPe8>!f>RcKUJ#*5mvO zCydvh;Abv)0x}7i=`z1q5$nU~7JckC2KgY?`vvSN1LFRNQ4)w!@+1mFq&q{ng_lFP z_apEGhxtPjpk*FF;o)pcDFNUQ=qVr+!%wivm%f5R*MV%BGA&c@`;}=E(=vNBS}%OE z39#c=FC4H;>8_@Ic4)1X>Cfm(8hL$yR5IE%Z0%COn_dSMY?983Nj7a*HI4>VYO5GU zXkIfH8%DRqo$j4{mj3{QZrfkU=Qg^Bp{-8!VKaj!9qk>E&7*TrxoPdVQM<}rd6+kP zu0yl9630js?nd26XZR@GQCo8@+0^3r`}fEw+qa#Gx-F!|XC`e?p93b140R^0cj@^u zaplpZO-qY{TWjRMqm`+A^T%=FS?#$$SF)B$XKM?R8C^X4^f77bSX;c7DC%kB^f7Rz zXyIRxyqjwDqkng?mHz-`w-TjYOVrt6XwBqI z3i^lGf;xwNir1AE`z`$pxK>#@|JD zcNcxgf6Oj0B2%+TSLhmfQ2zkrfhzekr*p%<*l+VOo;fW>w?y;yXt<#1 ztQoMcz$%x0#LzZhoo{^zmDks)l{@ei(;pEjoh1tXSOV@#sgY}T1>Md>p|bf|B}?tf z)vpZen)o4&%Qa9&9L*o{OKRb5*9JT*XNjW(5{(HWX;$IwniS>FrsXPB5o>P$0D~z} zX|m*M;~mIrTy^W}LsCULwZn%uY|A5|r-9{{YJJUy(RE8BB zMu}1tnsfKl&}wXgE#QylhZZd5K3WGnFEmMk>?^N#N0Q@8{S+@*3}x72;?bARNMQ@a36>7{p&s8unj+?OfW0i6}K>8nbv zPEv)*v1@!Gy_)$JNFj0@Y7db#N+48+$!Umj2t@$Q2{73mMr6Q+*8-*DTzM5M*26Qe zE_jIn!hgtmYLwj5Z6fZ_V!Jrj(6fc0mnF7W$^wFNw(8-&<$5eUK1sjrK3lWl0JCW z^wzTZ@xjO3Pkslkd~KmU@thnsoABmZ@kc4}Tn~5S=1w+vd9c1zaV`%#^+)>l$?yA8ZL6ZnRTZ5ZdJIxZ;8D|!(y$)haI>bDoXZ@T6APW zisRRTO6Ib{!KHBw8C4(ZbVF>$ZiwQ&qB&7{>>Q*6X8SUTx?24v-rCO6NPAi#CQZ$L1((D3LnID9fuuy(!@%)?>H{IB_Z=~&ypFqqktTQnCFmgO2{ELCkUjyDB*Aaq156k1 zaU?sTDGm%rz}f{&8L;D@L7YQNxHdG0RRpR5IE>LHY_$ur8n-zHAmh|kLvpdC3}0bM zj{W@#VnM=W^R3LK;tfcb8y&FYSA2`+kAK~e;1mi$F}tbfv32Z(gVBqGU7NsQ&+pFzA}^Z_qHxpdqV zDLx~wF?1eC@*D^77AaB_l?_B1a%P1ZuN##-1&VOnp_HiBnW1f0>4r4xZNE3Sk)Ihp zq*bnY8F02=Bu`%9dUi7E)Xxt=RICq(-9w3-os5D}0nBu};I9T$6xMrS91PI%kMtc3 zUJ#`&n++KSC?LNKN==Icp`yo6sI1eZJfaGuOB;+X0m^w@AMi_W&_xuaF*M`r+^-^q zE{Pbl)rl@H$G)Uwqt+)aOy$u)&Q4{-% zAeO}RHzqV#@638}8=?LHn4>Y}5k@|O(OX0KR#rmr0q-Nlb0wi)k^DYSsO^eR`q&=` zldz%X9v~&aHiYSq`c)yM*KtLPtudT3N=QkvNU~VZnG|3eK;nj|k3&r)g? z44N+9jhYs#)O(uPOUSJ&ra@NV!kMdd8>!qM83Ue6cC^hNiQ5G?=xAHYPpLhO~Z# zOJLTxeMqBvz@2_Q$Z69xRb2_QMupKUxG7R6fs~O+x-Bb_zGre7qP0(N`6Wiii>^aA z-wcL^e9I1)SAeh}oAfPHpgjs@h%y`ST19tD9LR28y^GS1zN9jyzhIAI<9Z=UXmAWE=r%*bzw&G_}y5I!veM5`#Csm@?h>{mRhHYE1D*erPAvZ<( z*F6fAD)IgLl&f5F$O_~1DJn!c(;l>a@E~&@Rek&7SE*Ca>R!UrEymz^B9!*#1n!p% z=k50wA4IlIQe+jbar^;0kA9#|YuYtR%-oB%_h^Q_&hOvQte`5ky%=#xmCex|#eb<< zHT%Dy*AU^qzkNo}zkE%XMh&|z+0{1f3>!%A&+v3M^4s6=bL2WPqWo6@{AFcwG_F1S z`WSIG(Ir=sokUMNzTcsrb9cFMUI}{a88XEQ6`=FjmMFGhmh8PoT=G~|zaM{m4A?eG zrr$l+GdB}^E~@A?a#P=_ABnF20D)3sOttfgB=+U{lLaBo51qg{)fgd)dQZTJ=4Qf@u8D}E#0Hzj8vS@ZxuTY+HC|pd= z!iu%bEkd_W2GzyR&ymhu$krleCZimj55(5d@;m#!XD#s^pNc1c$DK&w&l^XZR@55K zIB(3g@cym7?}1ZY(PtQV)-nG8(r@qB=QQj2W%yHg=#Qd;jMF?WZZbXJO}^PbBK>RR zv7OI6_-|BgZ=M-7@2~R}hVq+#vFBvNQlrsX4&nSQPU0EDOyp~P2~LEQmiz3nvft`= zAMwtgIYGCT96Vz@PKx{CzZVspbvRyGVwW4y)bK`Ym2C6vV?v{KMpS$&E=o~rXBLxO zzeB*jA^!l1=LI((_Kb7(=j44mqWn7JR&rZI3kmfpUg$87&dy(pEoHthmt&(B%Xprj zXEtw$va4#3jA~mo$LLK$!M+}+qI^~GzG_XZE#5n>M;CFX&JCmK@#k3;pE;8ggQjOm zVc@YwdvhYM2DtGiq_erRGqXa3l0Z?Wy!&+_Qxs-m(Pn<5;oodCo48z#t5Sk%Jn`yt z)#$#5UKd9`O_{De#_*RVK&<2KotI}Y9v^(cln2NUpODgH1Hvy45Yt0cXm%b3Q2`;_ z9e$u80+7rv00V7_%dnvRIso5sq{Tu^lC4ihMN=NXuYHH{0WYBFErO5@l4sw(AT%%^ z08Ep@1fmuplNm%_5x#&GWB5NH``|AJ_9-AjW1&e5keom)O#&w$F$`>BI1ZvTkbw&y zZ){LrVtfWbfGI>tdtZ<|eMfT-1rthL68e!SiA&H@h>|I~=z3A?6sZA9e%{2?=AY*x zktl(u?F}X~V4Y6>PCmuKRob2r5icU^X->(A6cj!%2!upiK-{Vu}sK9sdACHVN!+ zK?zH3zM{uaWW75s96%GG5ONtrcON!3&?Guz#Az}}F*2W^-9Ir`kKBrK23~n4m@%VJ z2_evv1c?Q|xRVt5d-lOlU0G-w3JcID{0yQzQ6m~(QScI8{RZ)S$dF#aXn8yi3y2^R zc^X*@!`Nnrw^dHeFKErtb z071F-8^RPX)IWqI{+$L;9X_Bh0eHV}P*i}QP1u%2MIyrV_qa_E!a@APDVCnJOXC_7 zB5Z#UDA<#}i+?-l45i$Ln0`J)gJiUy)ta_XhfZc%Tn!o(4CHGK8#WhS7Wy4c3abd= zDAwQ0?sOK{4V+mZo3pbN(a_Vm=vt*wnO2Or>vH08w5;ok@?M?MX4;q2sgXL|KSIqL zzF%IZO;<%lt1MjFbusR&I&w9f%GK5iO`$bsGW7SHjaEAw&U+cx7JS*#rr&2XRAAOm za-CUY*NMAY*?N9VeL~Nn*BL1z2~uT8aq3m8e~udpt(?B&)3VstMqRSS3#_gEvFhCZ zcp1}eWQH&f*s6(TSd!71vB{-vPZ%GHWTuS=4EYT!Exk!8%icYrMSb_d zhWs0%7+bZ(kyQqz&PZ1z_B5m}2waQOy#iJJJCtQ{(}`8x@!Y;U z&|i}E)#oHsbu;Kzm2Cch15vD6j)-#hM|~`FDZ=M|B)W}k6W-|$+;?1^ObY|8Zm8tc@`jDq?2$oObV zWy`zYsY%>=dX@8@ISMqNEEJMjGnCsW{{SGGo!5VVf*N+EiZ+1gwS5OlU&roVr7k%F zg|FbcsqM(-lcsGH(U)4&euDo16S1hN6I7y-;rIM;A2wGO%&T3C@u)`WV;*9uFQF~i z(IToUwa6Vpy?c&=<7}qs7d~MNm+C_r-ASmW)Fvcn|ZwolM`JJnnc7gKX7xm#Gx43O_pqN|!ko5m}*9?2b1K zr0a=R?q3}cSZHqFJ&T%uFn;RwQhi zqbtfoy$Xn`Os_cwQIX7aF2HOE$SDDVh;I=bm}~@uGC3izNYKd)KrygM;TgmhfDZ{I zA_S1}7|Ri5ic%^-u}+~jGVCf6s-+K_O9=prQkcADcTjX861t&A=&dFLV*=c+Me|Bu zFez0UKwNdf`NdUGQIG+8H3M=f!ep>vU|NMz+$$OJWYCy815TB4)jN|WHHm4~5F3;( z3)C)7ewA|V3V~KTIy-x_Uef8D<-J2@&Zz-r^$Mi&u4YrL9PUKW(PE{+PQ6ekJe2B! zK*+AD888!1@%lJKRy&U-z^lyjYswt(3E(fvk{zn+Jc{zSP$uwB7(PKFHULyM}emET-iT?l= zDJ;2V=Vy%IwWjGwwET}k(PorVIWlK!rgmfa>;C|$4Wj7ZDk>|I!G5hhnJN=J`8NAI zjx${U09fU0aa!7xz9rLM zcF&GE+dNT;OG1>RDKz|D{{W-bby?G?{{ZRnt_|^CzD6R};;K4wj65wYG@8$swy*L}AKFU>*PNX{X{SD=8nNsT8>~unDXvR=7K3~vgTAjP-R;d@7m=VG zK}+AggOoPrM!^VKrZFI*{{T^@KZtpH5H_|)&~FfJf{|7O zNEq+=Bxq@UMSg~c@eGh{fWShiD32(MKcEhkjN}eH$H38CMo7^^wf%__NOk}@-kcvW#Mq!{N2#|6-MvT8vq=BJY8NPySO^Pl3{==X^ z}&O@sk(I!ac zA3nkS&!{O0YhDQG_=c<;#j8+zhbq#Gt{|E!bXu~ z*3`Hgb~!UGYG;z!8y2-vIp=h18eN)^tFT`%rT$ibI}z5s^f?p`jz#))UU`>k&(y(% z;h|2pML3#9N%UQJ_$t-Z9(s}7zvRuGY;*|YvAG&jxhqQl0P0qi!Jl4O-#mHb)kE2n zof>tnRzC%5W^8X?p)D(~Uwq9tQZsV0?OSwG`B={n*p_#NXk4cE98(8YRj+J43sU=~ zkYfxQSrh`;igf#a#rkmji1mxA5R+vTwgYd+*2G2>9GGvrLGzaqsc zbLLueq<`i&lvtHo>gO<|RKwF<9;ZTR zE3U?J=(s(aF{Sk8OrnUXNfwnn@HrXQx6saH_R#nEDOF68jc*`(8KX^Pj;qH{)O&iP zO(QQRiJA3o`xWX{EM8OMXU&?en9A8V%_!EM`S`%VUZ+Kc1`m}QhxfCB{D@rdvwRX2FWiZxn9d|w*ZxLgc{?Y z!DbUB$G|D<{VJ5mX2)|Ql+VhK)DSf!MK(O-MK&)W1epYpEQqAW46juIf@Xm51<@>m zL~Rnu0MK?K5%KRNiX*6pKt`Yvq)P~d;6BDd^1+k~UIID5RwAUJH-xb1f%3oz;uIKB zh+Y7(u~U&J%p<@uVRpclH471$N-VJk0wrWM2^~Tjl>`KCph}@$YnLrVB5*G3dBNtJO?C7X#EH*qb>VE9ulXbhRr zDqPVvu4$xYSEyPgneAXYbKE+QAV!(0W^CTlF8=^yuhZ^$93_P&c3MQXL^DKn6f5Mh zf`ydJ*)u4L;-1Hi{BKT5cw8SGabH8(<;GTHmNt)`^-69MZT|r0 z8^6yVR&yocZ9L_5*vPfE@&5qnXFMONwV-%q6&@*dG%jrvc1w4`88^mH*|SZfkkvGp zw9`u|Y>ppL_;(m@hWX!Rjn7oArZ*=i9r!i*Id6C4-{hlC;iANE_ABdT^R=*R zMZLLk{tW8K(svmnsYfQ!b=ePn8LbhtUzHNAN@j{llUmsqgkBNwIz^%Ya0S4Tg2TYU zl%G6oM&KOK@d}_Kenb4fQb$RF7B0}M@ErXR5SBt|`Sj1fu+9ZECj0{?1_GGK1F^6R zTr-ka54b8K8e1A)kUT-p-{km(#R-(2*Fk+*rdI&gYj$%h=)UXeOS>$v?$BX z0|%j|{+{**Z=hKG3&J}DGhnG;k-&Tlz}X+zNKzg!4z|gmunU)%Qi#BTCr-w|GV&yu z7!JH4@-GiRw^0cu1p?$R7ht#y*bA+oTleh@hfrRkXgU?xGGB27;xvN8qB%rNAv7`c zM&s$~JN%(ZiNpnAR7=Qi9MG{2?kbe1x6h@z``m7Hb3W`i3 zSjdOo!P1w!64rt}nilVla!Bg%T)UN3Qq=|5(C4Fd)gxBQC5rR@$Srr{$cEyg<@y-b z%@Z_TeU|(4_kuZ3JpSc4%#t(WN<+I;i}8*N6tA#ly8eL5sJQzyOx2z8e!U76ZZ|2) z>Hfo}H5z2LTz4M%l`74Dd|6tx%U(ribjwlnXUo|wdyH@fjdViRNh#AbYKgt*wz5Yj zhQ7aoXiW`J&6n9MZ+7YhOWZEzb!?)wOU^|oSe&TT{{XrqdatLnbIPD!<$3)|Skfjl zX`KWwa#)4##sv$MO6ug7jHM%4-uKB2@40^FU0T2^k#R_T=+RY*cMBlGxD%=0CB=CY zmFWq`-bm)i>P#5=m&D0BEXubIUZZ%()lYaURH{!K8*H&Tq8U#eN>aGT>_5#lQD~l4 zTKVI~sY~XCPGf6C*?Cp>kfU{ipUeVR2%aVKrDfXF^(@n?i1KX9RleoAZ*E(X){P>_ zYsr;rHeQrR@GiwBk(=3<8gH;tritsax`D5IG2>tV0HjqZbo+-JOp?Y0s#H*%$H5gN zVdzatCBK)R8@%xy+7ZgpuM~v>-6{Rb)aZ3CQl@1hqCHQ0lq<&^KE|~)Q@DXKWzABw z6`Ob_okG1Cm%d+<7ff;5(OFxRD_Wb8rIbTb-2#^EO<9K+kknp=jazd%>L)~9nDh^f zX;dM|uwN4{UWqA4mK8lXl{4x)tHfaT!^H^>JhndC617!M^f15|GIPU@+uHz(d3|M;1wuH{f9+WQb!TMKEXs@PR5}pwva6E4o5dOBf5nQ$s}$ zlox=rI1TcICRCINB@aUM=#a{Qz^sup7`{;IAOn#}6L>`FfQoZKGSsO7@<0HLiJ~ad zC57r-?qDcsd0&)igsvtOuz*!wgjfxR2z@1bjjY6woN@4>1Dw4vnn;DYAVL)or zxhJP~Wx_u! zZvy>me^Mw?-EuC)Mf*A?jV(*V$A`YwsFNc1f};Wd(Y-tU8l_Bqv%ZWn%8U2|F1iC!?N7Z8_`L(l6ZoxDCA(Pkg- zC7w**y|qn6Ew&u8KG@*|%*$;dOESjh7sa0k#-v|`{{XUFI_z~4yzs9J{(nOy-Twf# zeLT@IlszZ(27;N`ab@x$K}ft5EF_Gi;S@)gAg3bBX(0QAav?w|W(ZfNj5 zLd!CJUk;zy_1_CSlxd$O^6>DHkMk2)!drdHH7Vnve(K4m-y*)!8cJXMzH4*UjFNr! zdGbw}a;w0x3WN=IBWCScouffMrCW;K1q+>sW{FE>a=2@I3|h9QP*e15?s4?zC_08D&}K*&@zp-cN7G7~~z&{d6&;NU34UVeh;NKug< zIF8rcc}L%&NC1i-0+8wZ*cjJz45ROulMX<}NLJ`{m^dH8DY3FOVRR0B#p3iFPC!kG z#rcgF=DwNw4xcd+SAq#G!^gaZQFH1?f&TzW=zj<~5(z|XKY%fuL?MNVv6>wrvV`;@ z;xw4q848E^8bSdum{69x8c0o!(8~iIizBD}4Hi7$AnZww<42? zhzi7}JVwY(?c7WdH5b03llQRk{0X4H187vDEDYGuVM9YWjNlNI`Fmk+zB@s9c z7EvR~0)uf2^(M-)Cl28zGml_#&z~XoE!gc0<&L75()@@ngG~uk0gLb|MQnLYB_94{ zZhu~*yuSYcL6Q9*A>;TV&H{}2?Y4hTf^jQBN<&4B7qI?5qUbDH+4lR5fK)?^z<&o} z$${euF(ubwB-dt_TR=(i@*9v+rNj)BWcJBawj+?DhmA35b}>Re@?5uv@$bliEOI5I zQzmlRWw_g}q|vAA>Q~Q1D8A+`8sDU8(1zc$38Yeub<{eSojRG*v8fi6zp{C6hGYK# zl4#rAgF}lz)z5Gtzg!a5snmuu642}>hkk`io3E)>H!Qvl>{p`Y%(7ODy|P$Az8LS{ z`9+9qHt0lOFMh1hF>b?E8veso-}F~d8gCx?3ayfvYIUE!L{Y3lrJs8R z72G(a7G0-U8r4el6(h~1XH^+s#ZiltOO4Zto1U8}ka4 zcaq1?VJv+Tq|LZ$B(u2K=iywt3eDquktifwGI%9&WH(d`bm;k_Wo}AzRX&bxcHU#T zeUYJiW@c1|IX|h?ztS<1;zt#qCC7;zYVit0E(WN{@1lBjNnu=qH2r=1;uK}E4w$H# za{G|@vDhikZXSq^f7%3rvPC}Q_&ff^SudwAuzoGymm$}oUG`an`t}B;UHFm8ufD+~ zO>8W65;RVzjZWOO#(s*)jOGiFGH2w+>H#?n<`0Y(!ir2k5gTJ`nRPH~Qze0x;tNH&(js)rOiYRhKL}BQEFwD*S<@X*F2;#8C=t*URRKCigQ7<1 z1?)VPBLR3zN{|i(YV-sRjgnUqHXsI3AH)tpkw^k;WX3|tT>yCvR3UN}Mg_RmC4|Z< zM2eumyCxJSSB-DbkyHRL(7D04X{hiyc+a_V+I_|=I!cwuRZ_LJT$gEEWq`(8gs9-X zOH5_zdMjuw40=*!$l4~iStp*IhF-R!xN&2#HD-Mq6rTfn)s}Yq+2ws@-O)l zLc(&7J^b_i9$#Ba%jTr`cW2Pw8+2?- z`0^}~6p%h4N=EcWjK;wti6sghL^b$FPT9N?{{V-EJz4m8F{#fy$hJ~6 z-7F-VGw&_Y1zU~ymFH2vLR1=kjm!ziieFHi2dKE4tq}Snz%uj-D0umZfIME%IVABf zKr_(=h0w@pA=~Zr9x;6fbTTnC8i*fYyCrrrkdg`cdV-hJu+uGqqB|WvBVa*eW7DUY z*yuRy1PVfn>N7$C2@3+ipT6KvOc}A0B#QU>10H!{qVLF)_uNwGI#}W_AFpy>;9ZQD zF`E$4iCqkAi4Txa!=TmxA0GYk2}_qEdK=jrnDqYuQJNBb21O(+g8iZ7f=k~aewAcw z21}4!O0+J(Q1jS-J_F~7r27&y3WymG!ZgrX3>1cO?mhu=7{*|w{Sth><1cJh6e9k{o3G z!kQ51P3#0hV_=#Lh4JQ1&hm`$&{zg&((yqpoG7M><0H7)nYyIw1=vQ_=CuC3~Gb z>8VF6yYeJj9#<2a?rXDN7IBp=cJ7KZ`<-Rpgx5cQMiu^^*5>e zRlO3a^a%q^5=Maxdo0thu@vM-jBF`Ljx-MFII;XhczwZ~irW_|&}y-A-ezZy7~T8w zEW)o3ew;-p>qV#Jxp8nnt^k__vJ#{_C56fLkB|16!4a>;X8ZsqFZwaef+hT=EPjN zQ>9-9g)4;7%{JD~jUtk#f#(pU$X49PUPmnnk;(al;en*fNg{L|1O-X3V*UHTa(~e; zT@;ZFs_^%~3BLaTz*S^boQ)E+Wi{j{G8w*Ne)@uus#HfO&`pX`Lr$-kp$gn8`x2Lm z6R5coLa_sfzC*zj(xWvdthE}2+j^MCk)89fOG393Hj~+;T^nEIlPj|u=JyNd;OB7J zx}UR3qRE|gZv6@9-K=gZbrZ@QZ3F)RU6}O~K(=UHP{%T?X&ov^#!8(c9E~cT_9UBK z{{T`NkjQaXW-poTFzH8!QZ-NO+@1`lUJ0)ZQjfp5J`R6intorOeuGMjGph12kzxot zO+EYQUyh!IDJ7#t7STb(o{b&_)nK@;9mpzG#@rWDIVQAg$JzJekUk-#GiT+;=_2t9 zB0@y$V`L~Huvnsjsu~_N2_2@yu|(Jl6GmNH1zM3dYf>mzq#+bq8Ni*>9a9?-0z>2U zC;33SL+`PT1WgtUX@?qOV<&?oL@|jhQ5!1BFiRMKkt9$(fJk=$;Syy`R0TQFA)NzB z(i&6&Y({FJNeV`UZOLJ33WY@|QYTc)O4NUlOLhpR>5;kx?~+E~6+lz@KuVEu(HrCy z7eb>tbckmIPNf38=mpB)La9+qk~0`tdYwYu37M43RO$eY%Ug^cK81Ru((Z7vtZ8XFhlQ?KWm2Kp>=xDu z7AHv<+84wsX>a#8GGNhGbGoIg{{UJIq~8qgc9p3VE7QVdCMK-rRV+T_k*VvBCVszqqq;f! zJ$wDJxx8THtr)qa=QHelf2m_RcedOi*q_W|~pD zF6&zNg(SC3pGWww5ndg#@!hkTxVN05l214?sM$`Ml%JZ<1pfeY(T*s!R(aoSle5t2 zd`B2b&R9>L+&Y@NPyC}!r2G-abB=qn=MRW}C7$a!%in)|#qj2_;c=F6zH`1&zx)K_ zW^3ovxSvvLRi!-TZH_AkjeKi;;gq7WCC`)hjexQ^~JuDP^UF$#L73?7OOJ(wlqjosp3?eUHV# z^`qpo=H|wffe!#P&_^>Fu_?Zzp<`i^v@_2zU!LM?^Z@|+3R3!l*ztSd2`qGeK~w|D z5xEjde*OFMBN_)l&yey%s1eO|09s=)yk7bQq9%v0B)tt83qGHbygB#u3nCaYkPVLF zG?>IM2lOm*AHp+^L@}iSydii-VzD+ihhc)829+-$TkHYwi810KjRM%05jG|vF`Poj zuV54E9y$rFg^@MyBcwWBj9wC~bPGaABr{>y?t#F(Bc?t9koF6G#HHLJctxf=XpIE; zdW_iK5tX4R`U&u2@Qox4LQ)4I_;wkg4O#)O&VGP65+pQ`>{sCc;xJ&h$J+sRH;3O& z<1e65V10y8GLOB2a1vumUil5|fLizOVg7?A5HX-)!fa&1h?*F&?GBuV$kX?Mq%4kvy$r|5r&H_~0U%<;TN}h7 zIF6XbR4CwY766+QVoOA}8=gp{f)rN~^f9CFpsEYy{tX&hJLAirVp~gWxU14LBNzO3 zE!*i6pZ;S0pSj!UzXPYdy#4NX`c97suh`(r^AtKdvHVvq`aGJ$;3}_nn)BSb{{Z&> ze@`RUbqm zFQyNphr9S^r|M%>vGBg9f2-ke;&FHPJMYx%$2mzH`7x;IJvVMY*|lkMxvwN*QMyIC zHwcNLMbb7YZS^ct)iJS%t_2IMd6`Nnn<<-@cN`3R*OVH);EMI@d&y9Z5ZRkMRs8<| zMJm>Y)hcb0RI^6PbY|j?dXwLgWzSJGB8nb9g7x|Ql$nJkXVan&4sOE#V{mFf2f2T4}U zr*UF+EVtQ`$3{dZG?(1I^v;}FTCJ-=$yqHw1P(;8nN@Qv(wS_8_D-L+MOs9#kfiz< zc{N^{N-UAo_wA20k9&G?BC7qbXx6RSo{}VmclaI5iW23%OA(^~0Fqr;px)R|8pPSKOmvu13oeGtx=4kooNKRiZOKiBe4G*=rk6`pt4f~~o< z?S8!o6^?Vo7N_io3Nvl#;IqV)>3IAF3t)0y29+~C(Q1|b0 zq~LT8(Rr0CQS@Vs!br}iL}gFNx#8?hM^9tsM-||pt#i|`&V%Rm_xv3NzM9ED!s?0p zJQ{GLpOvu`sk2(GGMuU9{z{C9YuQrakV4^>=WHiw*wd0omYo&9$Yijq^%Skcr$+>) zy$v?B6umr(wI*AXOzVURSW7{|ui$#uC06h?MU}SZiZ!z?UP6t)yhJdjABWG>)v6b0 zdB@-G3xIK7ParGuCdEjKQ5VV@b$$E%iB|oJCevfO9)xuwOB;-W>Jx9swDdG!Scd`! z0?ZKa(AQx2fX_a{t=JdKmr;$9c*CbX%DyjN;X-0@1v31rE~Of^_Zn~B2;{`0ML7o{ zM;wSfN%6AaK8)_f*ieQzcI*R}^i58S69u4EK0b?|=LPF(JR2U~)5v+p9R2&|X4>eb z8dhARQF(am?QL{ z+6-?HBB}u3Pf3;IB3Q9EdKxWwiK()+IhG+IbT$-`4N4`tNV!JOJu~DCRAPUMCu(lb zPkjg{ZXp>zAdXZGCrk-a0hE47aG+d`xh!ZC$O>QMhNzcgY)czrib$fUhfbfFeOm+a zNnlru3s8Y2DkSu5#dZW#=@Z7xzi(t}+i2~GqXAhHQ?fdR@be{lAvTGtIgjvwm6v0v zT2Epcxjd}^dgkOC6XeL>_l?_IWi_yfL;lG4T?bOP&$aQ^V+R$~4Ze)C_KV2MXkhnj8crA=?y0 z&;ZQE*DEg$Iq)`QWM9zB>@)|NYWd8f_e+75rs>miPyYa%o%PkKa(PIzXJHh)_sQ8D zyj2{J6K>g+D{dXOWEY(a#K)6H&IzrVS2{Ey4U%mxhl+p5 zI@FxJJ1Fx#*H(v$8BZBJKmJ9u@bAs#ZOrfZlDYo?qt5wpsOaRi3o1^Zi)`c6%hBu= zO}HdyUd8Rlo_PL+J6&zYO>j>eA4d#rukz2Ia*sWphh*I>^V=Cbzy3zLa;Yu$#(hoO zuZx;kZ}d2=a{mCwmTU3xIM0-kA0sSf8i|1!@%u0_-}EGL_Q0Awx<`}H5<_58V71(# zSfvrt;x~+E0r&wqg9xECEQM$VFMAC*jWObohgL)jU%4dF8)YqtjF6+{k=qnjMMEQ# zA05M9*bNO4A=(8k$3V$|P=;bl=z@$szWweo$3TLnJ6~a!KBDY&H;s(fzkGsBX)*o^ zkfedggpioQOlbuuj*tp~j^6#DNrfnT_ps4o8eRd6hm(2!1jkLGv=ldse?X*$Z}bNu zpxF2YA@Dzh03Qf~av4+>z-Wztn*+nYpXxEt$$w!-p5wTKBre2+(|%n6$e+FNG-b#0 zy8D!n3+9P2MS#V{4%kvlxDG?5{RE;t6WJC%;`w2;3t$ur7&L)kFc^{z%ih4_=rJ@h z_uO|5dyFSkIIvv|&{13lL1dnTR44TG9k=KxER2v>vHp+Z$kHBB*cmY)x9CYPYz7iT zKPE8_r~D5JA77vnMVfEOhon2Q#igd}^*T*13&pc|Q%{B$9I~aQB-Z|g$xi_%dK6V`o~6f;1g_E|QON~Kd+I2~7MXMWmDlV^8q1lS z-%W^TAmE=yZ0V`p$836!{zeXZ*5eG&ooueeN)%|FqNbwq^0F|ThY&w@@@Q$acz%CFUQ|k2J-wnwcJd^QJv)t3=LNkpIw`0;aNMJD0{u(EImt6cqc(HbHk|(eN4ZupXF*qqTD^ui zFI;{On9sqCqvXt?Q#tHeNUE(Ge92WO+^I!fpomgyxeTe6Rg>6h)AJ_BUdiIyZ(cn=s2WSsVs2%l&(|G+zDz|9aDJ6Q$?Z~Rc-Izm~khs94~(Ug%wo3_9QBi zA)QQqw8>J4M8>s}8=yMHJ;jDVTRQTfT=?;NkD7 z1GGg(RLF2r?BgyeV}U)qcHz0bccgEU1!J+h$eN=WW7&^hmIw`-i%xqO*D!0bP^Lta zOG8Sh)9ynw4EloEx>_)7QI8SRUA^%hHNl)$FUZ8z1f%E#W9|X&+>=#TFe{ecK_YBE*n1TIl=oOl2Ts#C}$Om=suXks1pC1c?V00uBgP zHctY?BtuFh@qiA?6sBtF^{~u(a4k}+f<$#Os+uCYu<>v+Z0?QQE5(jP53`xAX3-DG zI@*@LE=60NCjziJ_|PYXsijW9(6iF=6b3a5BGa%st6czltskP;Rx_7{n1A)BLrY36 z($WRcQOm6}v0{dr`W;<8DRV8;(_%VoeHgl4i`3{X?NSaDuo^UsNo;2kya9-=C{_Zd zCrmL16e1@?O5JoN5V#@PM&`vcChto+Hh^(5mNBVO$JNt&6mBgJ-nFu3Hlb!mDl1In zYTK2kMA^Bef{jXJw)!QQS2RqUh#eK(J|-Q^3R3V%mr)g_U8@qMRk^AWUQCWmmSj>z z0|ADZ5|HqjM3X`#`&wpvCTZF+*(|wE+a*i;Y#I@lLOb@FelCWPdm>#NO=`Qe@yDqZ zoyHzYTK)84)xJk1SK?f4?H=AdE^9}RlQdxVdym6^8O9Ddet(nHG@p&>wX~CaXUO=~ z>#}?txcYGY4UG%pI+WiQ)8n|Zf8+Tp@?Jmug*s^aLepGQ;PPlvte-Q${tEb^_;}&V z>Uy4mb)E8fnK1BldpPnbt4CTENT}4&mJqx-xH^2K38gvC*q%m{6GA!_D(7EeNQ#`pWu1Z#ZUEV z^xv=0^v{cKr1O`Iao_8M&VDuc!T1zX+c@&~liv8U$*MdLlNeKM^W9eGe|(Hu(qz|F z^4sppjUQ8v6t;Zao%AV43W3!`bAMA*h(O2$ktAX?po$nR1Cj7AsK6BBPh*f&gQQ}P5(9;4u7fY6d8YtRWuET%R_@edK=fMBLDO)uCO z^Y_e252i-s9z`Kwtq|>sENLPAynY4{)`o0SVJ5qSGzcA`{0tue$pnTaB*sW$MhcaIg6t>-FFz(eM^XC?820;(0Hykbm@;Bv*nR~c zu*v@bV=0M?v4W%%`i&rCE}o#3h0sxe$U=xvU4`KT9lw!Hf(!I4Bn+e<;2A&zbPM$g zC~ueTj;VyGXkHi8DKG`VKoWre04yEB%liO{90vJfQVC?6V^o4f>aNVIDv#8)Z9%?0 z2y5A*ku8-B%YW?Kgp$~yr$tB1dNRv->T}C%=#Go-w-O4q*Is0IQ$}p(_Y_r%Ou1I% z%d2wBQ~G_`3}c+5NUAe#-ikFUHUcjH0Eh4R0~d_}jy%DXmodgq$xbmd zqhp?5QeXO{%@*U-f_3rqc>EL1_!kt5G<4OCk#dC`6|L*-ztcDhwcZRJvvo^!O;+O@ z5lX2>cNuverCrM44o#Lb?{{Xkx%l$HvNGZ(}F0Pj%t?lQ(f0HJy#!ce_B`P){dyxKV ziyEy3{$uV}4SSBor2hapCza~rLrl2Jiz==jF-HXL4Jm%+?tg9iq)!T&?pLtvr>lsT-a^<`~yO!%!0kgyqWnl zeQc7nkwUESNeY}Zw*tNFzMpc^-3m_1*A)y)=;XUHt<+AKDrY-U50A+jHXI9T7Gp(L%m!=o~^*T2Z&xV-fv`D3E1&~(^UCDZ5xDF-WW=$hF?PC=^w z08ypT9}k&Y@=qY8S@~yiomK>x@dkmm^fHi9LWL}s?ey+o+Fg>PCIt&}c*g|rr`Q?N zF}pfxiZ3IpaxMYF#D43Kig8lJ;{|=@jbH zHg0K|)oB{0Nu5%W6rx?JM2XWZOs6;iqFj|SB>|EJIGPrORH+&?X%wr-7p>=m6XHv$ zcv_0xn^Id8Ye1-Jnay%kk1&>vIk)0ZKKD7@SK)a>haB+ZliVrTXM!}B!!Tkw1j8e@~pP?3{S5`T3Xa?Byl?v2Cr-N%)K5l^Lupvz|w3@CJiTaZ4|Jw6MO% zOulMySuy2YLD=!Lo`<(^VwoLlgGT)&|ftNzrw)%4> zR;j+rgXYVY*3A>U6u0U&z|Pccu(ml%%O}T{eNRBEhPiBWy3HuXJZ}B-Dl*vB${+y2&_{&Pfv*GZj+iRbh?))F~Dv{!R>v7jd zAC5do*&_kV-!thV?C@%~Mh8&R8YYBY4CDuhIG|btkb%ev2A6T05XMyagOX1O2?78d zKvE@a9|=eR@QKKhQUD&o5y^LAz*74R+lvIFz z#7-OrgzOIplm~r*LK_(|-?S1!aRCC5&4k!3g;_wRHbjST7?NRRS@jo(-?j^42V>{Z zYMA6WA_^Tbq`rg6Y&>@h@-TcpVEC9)V>BP4Yyo4qQe#3MLlfUXwdyiqv@$?a{Uc~5 z7JP=CEO!B0BgokL58)j^27#xy++@eV3K+EjRfs86z{&Q{U{G*C&{%O^`gso=H4iWTVRp`7B=HEKw$S0BNwU1-j0WhvHU zLVM?#D~itl0740^_WPUb(Sz)g6M6I`l|1t*x$a3C#I@+X5;{-zA$HtZWmXEB+lZ8! z?5;LYg+0imjY!bXVyvn1G~`2xOC!ax>r>nqve}h1!g3ukvE`4rsK8gCCQ~fiys{`- zGjC4PpY&rnByAMKx47FF$)jZq@+Y^_xEJm1d@GjcpDFi8Y*S#+jyV?T<@pq<+zOtk zk%LXEM!$@`^2FNm>{pheyBRk+&f`#QZfP5ps~Oi9xaE{()YNiE{#l~M#Ldk!MpvKD z0y$MP)pCEEv7=N?lGM9Mr&CefuXR_6o~6GNNJh-u=-PGL=g^+s8ROTfjYhVAg4}NW zB60A6NZGo*OBmu+jn;VXVpqP)4wb-`E-0rXqIC%sOc}QIv-IV$B}(R8Mv3ZNZcAxg z$E$IV;3(a9EymF9`!?>G&uWwUbu{IDw-YMzrhD}i6$(zwieo~K{O`XXG>b+V?MphSu^V1zd|V8KWq6C z#K!Uy%DXGdsP13L@quYkH)M%(VNPR@C=lz&Eick1QIaOb7#C?$!}97Ds3U9+J%CXr zu7IaZkJ7I6$th9o#FxZJN_G-ve{1?xTaO0ViWZUeu1fNA}`f~C!>sn)hxjM#G-G3sWt(pWvW_#UIEjpk!lqa_wV&G zZJiOuyb2a5u^L$BRjXW*Gs%^*+DM*F^eRc`u`;c*+^DN=2@Snzk<@-V|84~R-<WbA&jtz5d>sS1_LCD4kDz<2y8-wE>Wse!7OAJ zfY`UHZ|T=cJq|*ZQ$m&7CGVk~tqaVTK5@v|-?L`c?Wkb zR%zsNI&uvi5q4t75QJESAqW6M5P$?B2mnGGBY+4(5CDWB01QVoNWn3d#*DkFMjh<~ zrnshKUC}UIl-bT#re@sJH*Tnzb>eG8*@s%(qgveDsRE^HLrg|2=@8C=r&8R8RLih4 zrAO!w$u#3ckL49)vb8z|s&Fk^+oH^^hhw6e22^rgQ~+rVO%PMc>w@)#?)dZ~sU?Dx zYvH3+vm36B`nn`#yWi6!`evQr;lQt9s2i5jPXcy|NgMh~@nf;jQ;C(M(MfD-Ipc}W z&4XhkSDn)8i6L1M1ehbJNs=3im`#Hi5u<}X`tup+fvKqHa*aghq`8)migh)dd3Q23 zx%igZ!#Ukz>TzvJN#Xvy4NV@LnoZ|r!0TA)aygz(Pl=(R{6GGi_4bZa{zcGonr{hi zyLZXyHupG++ulzw{{WS>mle@qAl#;i&ls$3&o{Nx)N=Bid1D)#ul$7^q~0*2k4`3? z9Zk44+Q$>mwf%R*v6j-x#&^bF9;egc#@y92=QGYK6mu1+V}8G4r@kes_4DLGIs3`< zaI?XY7&Gz@cg^_?;q~(@$Q<_@#wjsGssu+N@EaL^I{W&G=r%r4n~3wF$QoN zb~ZB`0p;^R`V6kNCr{`pOP5Y#_<+eYL+lV=&?n{njL|>^4(ImlDAe);BcwVVk;$;2J_o?ttp`DS_sC~4oll_2nb7PR0{H~c z%%n{SN%#8IOXnjER4NKsDq)h%9=F5CjN%bh7^dLwdp({!FS=(W7Yo zpRWTMyElsRZ^BqcW#mJvQ#=4n;#xL)L@>3J<-?dWi(E-(cZIhOC23scV}GflwoNXF zRZ6!kRUx-9TFPB^HHxj1CQ`4SJqqfp-7`jXdX(o!9=tzeXuIsnypZH4EywvlN+yWT z!}0scBx)7BD0aIqs5FfWO6R!WII5;Ho&`jv$?9%J^1_8)qQ_!?2DTkK8SVKrZ1n#C z)TwJvfwf?>z{YZD)+q-@(;fR77PT?&fX+DcAg5E8W0vgM8uYa;+3wi$x!cghsM%8P z`K}o!>rLp>n@WrK*Z46i>XLhkcP;VBTC6Re_vqXYy{Q| z(VLSbr*v3XCVVHkElG3}^iq?1EyCb0^1rO{Goe^$&5MPDk!q~}0J0?g@e-?%;{nwP zllU?1aU-1k!js?&^&-n_&<Y*g7VIwN?Sx`|1$Fl#auS1$ zgPrIf21BPrjukxqqkPj!@MY&=*N?w^%8n1YB&)paTeOaS{mXBN`v%o6dC6X4t4PAF(l>#&?bopsC@RUKL-=r{KA4OQuYCvr8s zmRE6j8V;m>xhcv`m+P!>A#b7-chI^ZwzqG&c~~{SzfkDX?Sg+EI{WM?H)CI=3!M3W zg(|asi`Ugm*|*%&9*LWxdR_2YjL}OQ5-MY9FuN>HqV4%B)U3Mwi0jO!Qal1NvAJua zY1iEny16rC-@n{7SiU}t8>08jC34%5Mukk|mDc_}{zWM4)k{P*>+<4Ms=f6kxkY1G zq_$$fbRQvnvczg&iltBJ((K8_75EmZ*5&9{n(RCfRf;QiSex!mMmWiy( zNm-C8uYgatueLg)@1Uz&9D!9y?dkL2O-h+PIGMRG0#IV50&I2s1>h@AvolMPs-#VY zzRMm%b|Qsgol(^SmF$`{{rmk$s&MznhId3Ruc>J^4oL%55J?844MucTXu<*pmUG%w|ss6r&<)a{`O>IIYifr#WX$mz7rX=P;c+dIr}&?)QfaCrCpnminZ9QE1s zewC`n99IJ$;=hX&Ru|dsd7hE+B(Roz@HPIjo5N4c=KlbV{I;|ZRpYh9-v5l8M z7w~pg9}%sx*VOUX2+MeE*4g})bd5Tu7yCByYb-JT7{ffeXSP`CZdu)9-sP&-D;!fe zJsyOju;kv{jk{`zJ@bdu;BiTv%2sGxT*WMHzYj8{yxMH7duQT%OKzU{lvVeIuRLTD zv@6-8d1K>m#DwW#bir=gJ@<{{ZQ?p3esvdh**`8Gl{W z{7<4=VHoaf`##GKI@$6$CwgRVBy$c>eULFb&S!o>PtZy4>F?|wJb`FLHb&M5LCFHk zv5gWulA)#nb^}Cy0!g+5LQE_id-uR>2Plj&5#&xo$Z31`*m@{r#Kebe1=nDE`;Ni| z$AAvTlk7B5LpB0Hu`%!_!hm2~BqN#xF`C$30HXebHUI$44wsOCo1wfT$6Y6iOIo8(`#UEEFQD^EB>-HHS?0!h(z5Duw1TsNn28jG#V~?MB8Ngx} zgb6OpG{i~}3IZK|qT}c=3DkVS@e9Hy5e*KP*p4iGgW(!fAe$6qDNB$jMKLj)MdCsQ zXw8enI(|cBNepRTiDM)Rs6~QFKdBY5qTM>&3$&~7_ted|nRaituk2+g(X=*cQ?IcU z>y;;WvP25E4yJq`X4j(adVTgIqYe8Is^b3uF)!dr3rM+EnNl)Iq(xHwmFnDNwT@Lp z$;7c<8YIWs>;To;yFtzy{! z0N^F>^&iKKZ++Q&-VR)7fw5zc7EwWE83~>P(v6(D`e!c+@Y^hdzc) zw7VASUcLlLF{u_ST<43@S9#uMR)F>3Ju^J0wO4gcP z{{W(u1=O02W{hQJqfT`&G^A6N&Pb(P#YWBgEmoZN{(YO-i5C%%@P%qb<9f zvWrI1Wy)*09o15mOPAOhlWdfqlDVrRl)=YlJZQgshw!!eG&8`bB^eS;+7Xbe?vksMSAqJ{O(eIiIJU#`H{+pzki{A7P!w)l`I=_CcTxT3=zt) zX;b~m_33{f*uSKZR*#>*LMI`rOpnBBku8T(x4ipAQhxpa03)U*yuO(J#+QR7cd%2* zEG6^;5hYLtQrAGLe(+CD8R|42zF*lbHfGC;zQqfFex-;sfJ{iwv^eXcRLQB(?m~9CtJ8zWhHUiz-z>D3H%Raucci_ab#oQe--<4B`Z& zCdQNvE7_c$flIi~VNHV)(58jTq(tIq&Vg1`eu1D0){(bwVwy9kQ#Um6)eeKDWJg*?{R&G2S2oL` z1EZ0-mlK@TrgfT(Jq}}4x8vk(7DLhJx|bGZ)}!QVYjrpBGUcDI1=#g5^m$Wu%1(cR zh3!6hG{44v`V{NaS={c&C1)l?O{D7{6~!WG{6?iNsKL9n%%$zM&jdE>X->rji? z&-O`Z?{tk$D=T!H=yw_xuDB(W%b)rz@#oyo_&-~dsJQrfvx<0mCehc6Ei`*gwbuuY ziS;s z0QpOP+rQlKEn(Sz zhH~2hc({Wl{d6b@@H!3q=m`52k?|LVH2(mfAxVnLWNsqJQ()(~0PaAvIG?_tG&32X zqYk_PkLngp(1qX}cEe>6;D15blt3sJhM4SZ0vO65#RuURh(vU4< z_>7RJ*h-i%Lsb0m8ON~L^%UYXg?2VN>@qs^Bq6|!NID(m7mL!Mp{-QxnWWT^j z8nGL~JaL?I8L(jb1tFO75`joeAA5|DP9TF7uyFkcjZjhwG56{wH1rFI{t?<4FhOHU zY!sIV^%)^QEQyUEa0_p#B%|$q0a5qw>QdN-kjtNNoIyYcYz)J37s~hhi$Z$=M}RU7 zzc5-F3h*op6r8g=S>{reOv+>b`4NhM^&wb(t7 zt=AF{@tWQ6C#O#j1eGZ_`wr+7X&LUnr~22GB=WOQirAmGWvpj1>12&tojj~#(Uhg6_@8f~e`+(9He<#ms#;5Yj*7!Z zOOGQ-(_D;YnfNu1D_c2QT3oTY9cxCs&Z)Sv7+9z3N80?2<5i=Ugd6v`618Vyw58M4 z7NMfY7~$N1a$5zBvx50J>iq+&1JR+NM;t#uYiMht$uap$(AEhaVJN<9K4cYQ`W>#N zqx42iO(iZTRC2}a!QysvwBX4pQ{Ru+*r}y-`;gMK@bTzr%~1{wFlj)!9zaPixjiUn zV;59twb@;w)fz?GT5cBnN$JqRn~|b}?4sz08cByzrV8`BCv1Ir4cnqRkwi}6{pk0FILPlBj77fq1`yYIdn_#FmT~LM&udlkPldK83%WhK)@6_a#WG zR91uP9Ax+s!fPavr?|T%AYljV_9l$593ob_skw?Y*JPD$3UjHQfvsDKSAnb|LzS{h zj0p1+?gVAX*Jf;WA>_ev85CxPkuecU9{!$XAZYy{sS*j(EO>`>&&pgV{Y5)7-7+Tv z^-4mgK+LqtYLIG?Wz#HDrnnc!RwX56N_0z5q+g|D*pvJ+3oBBIZjuF1C5%X8BusdX z;SMNB3ly1l-myxQgN;BLyF#I5OvksXUyX4wm!VX#PN|a9_GcO>|si=!VCgLbX-UPD51cjd~;sl^Wth9Ybe= z8@q=1kVXq5NG#mb>g4Ec?=W(f_bC?t0M8p%T4!NG`z6IAlhbbwf){Abq0w_Ac*@S@ z5l-lrFF{4eREgCnO6(O}DTtop=edh@Q!Kq&W_@E+Ln=xlP##0mWibNi6>D3Y6lr98 zi!v;#F{wNn7WX^>m9Q#-b|av)x&r#^p_^?MU4rUB?5SRPMjoX>VVxz-HG`Yfr^OmX zmU))Fr?)Jw1%L4pT{b$q`bg*RDXbohC2Eg4QKOo@yo_G|0NIUwt0%yp;z|pTMssy5 zrQ~|Jai*;D@?>ezYU5Fl4aT?D)Oi?J*QsY1Gk?#^TO@jM!=r*(sOfcoh?~bE#+C64 zg3muw0(TJS#DZFkUfBEp00sR~sz}LXY1!0V{AAu!;QoevJ$q-IHPfMqaZ@RK$L>Q; z*O^hn6p4AI*|l3(&ked89}DYhGQ;0Jw@l&ZQ=u&@nI@WhtQPR5iH>HSp6jSdPAY5v z081ZZt@vl-8K|`P#>=tf9}E0Y@RQxWo=)>eyvH67clbOk_-p*L-@}&w0Mt%P{zlbX zl;gw2#lFq(^E{JRso~k~-xrUe=wI?r7bNhxbI<5`UZ+<(=YB}_eilFQc)v3x$=O1x zB9jiJxDm;o+*3S{&@<>MN7~>M2f%6|S{OM$>=~dC&96T|M{)-RP@_WqfxuFZ1vw2c z2gB+o{0|uo;TJ(*C;}9a!wVUqf@}u>1}QP5J_i8k76?pr5sti(=ik0#v3Nz2M4DoY z;saqZ4F`~M{f_0h_!+Pc^b7vE{l;AC@IRfGWu%Qh43TxXMEN34vR=^Jk znh6ZXAZGv@BtrsX^P#cifS>3a8PaSDx&@(-QjR(M{sD0d5(GlXr636njEyEdQ1}p$ z9z`Y|L1P^ckX!`fIsjsRUwo0Tpng9fZ~?H?e1VwD)95%*!SIhK<+%PZB*0(6HYvzx zp|Avw_al-Cq0%0W5Foh@;2{gKurv4lk~)Y$xAw-Iys|GB5TSCh=Kx$n zNCd>RSCOS2yP+~AT>So<1usE)ve+w4g>ef)?#I2sj4UWLoSX*4gH(5~m` z8mzyhLx|HY)4E-&;as+pahdB7hzaAM);eCbhgSS#LA&$K+ME=7wvS$Gxk*y%kP zQnq~=Zl03+oj#wWrN@bStXDo zWh`kBLrunU=vAv{j^?E5lx`5B+j|>|qb95|{0~l+dyiU0yJ{^b!6x{1HnSximMtKc zp>{P&FsJkn<`}eWi!O?LqDfnO_xhFTSH}`dE3D_CFGJ|8=n&2Zh0Mw~5*pZw)$!Bz zBZR!krvCsyBW!8l&8pcOWBlJBYf)kU0M@=`YuGM`{{YBo{{SV0diNl9Y@U_3@FtLK z1zD2Ox#h6+Em;e)&u>9!9ry{}ERlS{RIz*))SA~$mq9)#DX^&A@HC;SrRM(tf~{RQ z>Q|nQ-!fklMA|2-si)K!UW%5G(4DbA@QKdX$}LO)@nf zK%)*>F3_QsjAI43X^)8xRoQAfD9T$A(3U+-3%e{C&e@`3?FPez(G|sEN|^-NX;UN` zcLlmvMK^8~XyC0eS-Edl)X}$&#vCi*br1f>Z^6$vw)AZoQrXZ?QcGf$eOWGeIosRD zXEfgy*hdPh6U@$n?)`b0msMApkzTV~en=;J{{W2t00&fYQG?56y0`wF$ZBc{;~Y7e z^=oIM^CJ1rx2DHW3OVIEG;3AQBG#W%ftMP^+bU#9QrP3Sv#Fxi3e{g8Bs5d%1x)!I zi5*w1kWKL^OutfzPNe*fguNNDs8W5s3p0Zujm|rH4D~PU8f5f5=2WXO=XzDCbib0} z3nos>5G?-yk?#gmtVpW#DZPtm(Z_p`-EacXo0|38^jKBbQxG)9^yn& zn2Mr#5+)0XJ|Z`O;tr7hG2F!Jjyf~h`WI=aWY(b1Ewv?5nz{~>`eS3It7NcwQ+qu+(dIWAb_)LhPk~gMYtC9Zzm{_ikvrp8`dRno-(!VUx)n8FK zjY#w=I*6)X995>U%}RQnN#b84ZQTb8lL=lELhxw`vtp&mTBTyA`C!Vh(j=C31qqes zPzcrlHfU8Y#KyLnv1mg@qG(!#?7J7JK(wfluyl-w+kq-Z(xAE!*R?2<66x%6`j<1T z)M?=yee}0z(i!Uyl_n9#_bp5-a+?Eac^N zI;(;wT9!Dxt%|p|3z6<)#^UJlr!^#DU(jZL=G8}@1$ve~N2du<=E^PfWVy0hQ+|k& zr7SX97Q5qQxv$*Wv|(E&w4~o6Xj;A3^lV&QMk*<7g~hHt$Y*?-;=bARFmi0$iuPmG zr5L_BeDSv|?pCRCEkdF*)W3su9T~Ui=R%X1MQ(ZZl~l2k^?TLS<;jUj zomlfYtrz|-*M4Mio#V^Q;cDUf7qy$e6Y1Q{`I~Uq?Zb}&~=c)ZAx5@KqBV>SH-*R&1>F9d2KHR!?@gkS@IQAGsMGGW-q zsUR5gLnw(Me*uV6h*|6dKosa`U=$VwlC(4q1%doR$G&18V`OHB$ju0w8^k<8Q5sPj z1+g>>_tZ5XY-IrQB>W7R*&oC=giREfGLSAlWU2Py_W>_3L$W!bqv$l2!cX)WD`2(6 zQv8oJ3qXfLz}feV4vJV6jV>ZVV#mMO6vAuY?k|2p(Mx@T*xv9vK*fjf zYL<{V;2gc+e;4*g7uTq*lf-O_A*bIXISsL)ydyaYjo<-05<*e)^W-yPXebjCjs|fQ zW8I0MC{26!fxJGwf|>=f5wKblBu~g04#!K#z@!@Y<_iEBjxvbp1t4m~50BJX4Iw=b zl>;_L*m4>x-{5E)1tBI9h>!|Dp!h~elfYWGvTRM#va5Om-o{TDSvq|SMaLe2X_o0@ zIN`})+ND`6EFN}Dt?3}pn7MntXNS;+@$XI@Y9v}S76Gn1R1 z4yyja@hC?@P4Btgj!7eyQL_dU_ps96snh=e_`-hw07FvS;b>!gGM;#?nXGbpM+nvh z3wP0;iBm+5I{QQIx0z~|~)G~V9wX$*xiv?_WL+*Hf*D!RBgTOqnMrC5gI z;d+!Maz`rJE9Z@Lr))JF_;uU$As*Kj67XMt5w~`Q#OAFR__g?S$)zV;jhe zlDRBXxl5ky{k)8aHQ($$o2`Jwvh^5|Z1On0C$hv9s^cD`@VzBTk+d z4rEEtqMIVPthyj7Q(Q%P1)Z1Q-vB5r`M<$6qVIG>*J5gQFo3y{1Szvm0=gurKOiGt z$&pTv@;~_5;1o$_HyPXCB5#zkMM91|{{T`*&jhvTuGYyROMQtWOFfZK=abU7A*Mb^ z)1_#zMN;J=ytN!I2*|&_>H3ivTiK(MOGS&zwUV>1o@P~Ru&BNVSJy@kR%-mU)Ct=W zQK%EG@0bOmC2mKLC6Bpt<`k|bx{WQ7`UOekjyEYXoyA@_DY}m)>lb=_733;#=g2sc z!F$20WQRhTp-y-yQ)B$oDJwODNYbZr9IjF`lo>SWJasbSi?&#m*;Z9E^&zPo4Jw_A zONgAUg;|hQntd?8+}l0D%v|FUt+mUcX>zv85jW0LZL*S|p=S z$eI;MVZdg+;~v>9650-}UXL%o1VI^NpVNTII6zB5vDEft6&WQ;pUDL36p=Q%7ldyg zZ-0TJI~{C__9nrUg@Ak`NfO_oBr^#D+wN$rOm6_N!e~MobbVl@g4UN`5=99}mYrN4 zGgh6R&N)-Fysz;xj8abw!v6sBiK%5hx$`g`Z1p2@^f_bk;*P2OnLduTkNYQvpI^|NzZNou8~lHi;XX|J4^H@!51$-A=-FD7 zS>aUc&ZWH>!e9R6V&g7Nq}xv(zhkb@HOYLac(SFAe0@)o&GGo%A5RR*PL8Khu8Q{2 zOr`!$i56`tVB_OP@@SLrBTb*U~OrxB+iHid{vDCt;5;V;ai!`QOr%=y{Zh{6f zvtp#sb)hzpkTHEh|7PG+5@Qe=xv+ovKss{0W}OCqgMsZyZzs${Pzlqc+#RWLHS zA)LsV=*inOA#`7(s2CBfOa4eu?v92!kvug+mW}vPyDNnUGpqjqw(eZEzf9w))#_+k z+~7?jxgAui65Asp*154{$!OHqw1Kf}OwP2VYg(AHlQk&Bl!lavi9%zf43JD>n2dc^ z8#92otNz$$b@4UBgj+m@>sM{Owa?o&!v`ZAxfDA(SlY%(t+*aT{{SahD7S^W{SMDZ z*I$bn+Z=g$d7Ssgbbe=xJ96Kd>EMS6rg@(xN^t)GCy7+N#(VNB)U4lM!K0{X&JI~y zazwUWAB{Q++#)vb|!OA6?D-4z-e z#p)bToUgI&YvhScDKWeV(RD=IAH+tZ{e+4;9E|A9)!)iN4=qy#`}Nii#bH590C| z02x4Lf?$9qGl>2nNezGn`aGf!pjq`81NcV8?R))$z}Wi?V|=hfkKAU7LfAjB7|&0r zOn8ht0il+`=jrzcKKlT`n+Gqh!K6Jwg=mZ4-vW{ep}ZXUhKR6oDGuZBBVa}M*urRD z5ny5(V_-ue)dHq0h*E>V>5>QlW*tG4!304|-r{l#17HI~XZsdJkj^3%A&`Pm0|g8y zjN~kL8y~4O4OlWlL5spaz@a4jdazpc9CRX*wno^KWQtr!{6pBNDMSwr0;et_m+Twp zI~hNGfp`fIA$bcLlna*Hen_lnw(xWmDB&4Sd~}Ur8z@6aFI!Q;deXNdE0>I;laqR8 z{1Pipwi-8Vp`3R)n$@@Z9X^Rwcvg6kku?)cbl+x`3T<=PX~Uk(hOKw>A$P|eHqMsx zNT;4jmF}aB{+KS)OOHOpRqCcvd(f9fHFt-<^fgUKr|41C^PH{*bZ)tDIyfiUlapRe z_+E0)AbiYfQ!YloaBxUm@3QpQK%(9-`0*;!)O|@^>hmGRhmyr5^aA?9HN)+Jqo^bW zZBINuQc`wYlxWSkq2b}M*4Mpx2Xi+603{0dN+&|LVA-ou_a!^2WpycAUgw4Zyty9= zdJ9YnyA*3rzWo86+vY=3dFXXU6kz4(lji+KjeHechVho-LXkA;b_3@S&Oj=KsA$c| zu;M&c#eWujgHj0~3Z+lv9wb-}l<+#Pfo+>wRm({m>hmWhO0GHN(Y3g5iN}*3?CXrE z)x53Ht@B^E@8~5g{Rt&V$hOF7Lw)xx(p(oVg?csj9#3 zRIl96saJBN5jIcrdzLBB)XbD~kJOqK3*u#J%NJ7N=4REZUGe0hbxVc^6ugbt6w~uB zY>M`hUBQx#A1DWPT)U zqLCcNsZ@_7?uRn7pj=&<$&tFx(0`gcq*$kDQ)zYyH_=t@M>y`wj+O>gm!Ms2MNbfM zs7q4a5Xwa)p2o`*y%8)atn&0KQ_qop+J=8~DQIhD`)h2MsnJ;6t4e$eE~nGE#_Z}_ zE8STqz{^`5sT` zS|g!F4C#_nfg+9!?5-hGcwYYiv89hJaw4f;>PDtbSnt9lp5eh`1%DdozT#C!R!=G> zs*wcX6Q7@;DQ?D)H(;UKc>R*iNt>iNNylS1UHL-vNCxe zRWv?AngyOm^!0ksQ5i8Pg_39iP%T)k_uxlPm+D$}Pv5jmq^g;$eV(hJYSE22yyGhg z9WAWf*NcZQL(25J55k`LUt?mQ#!`w=O(nqaGhlLWiS$^rul}Rao$HP_$|kqhQr+-4 zWv-^U?gg%*uqUIhXOl^n7TH|*OENUk*A~3jICQLRUu`$z=BaB zBG~leXj+tM#Jg(u0h`uVc_puGKApp|3cd2l9zcQ=3tR`MB0HuG1C7|K*2zWGz|pzr zE;cmDWDM>w4CYd;Frc!X{P>u+EY_}!i=`ENZekj z#jdFF{{W1B4Dn0-SC>Zm9+$(CgsMlK^C2d!9&K&$D^Ck zTwY&&z5Puik;7C(v8in0ujpbab6kI;9YcKb?sQs~ohK)ZzthkiH{p5DX0g!fjzpES zmQZd~jxy$^n#M@p6`JO;Yi^4jWybgka@r0Mqd;^a7?aKoJNESPY zlNhUGxPamPlSFw$Nre%7u$3=B6gt0jy8oPM8@ z^aXju7Ej>=z+JB-(HO`%AwU2X5ku}10gOxZQbaBU?m#p&P!fnO1r$3#N+akLn1%)= zpkgHv@#*jN8AuCy21BXv9if~?W`PutSqH=-l>*{CkV-(dH)H%jBslC$ECF%!@)#)s z(12_q+8wd6@81y(fY)>|?gL<;v0xb?!{!PS z^T6-k0Hnrf9EOmepxDUU8bKj|K;T#t9zezX8vq2@-Vi_}3JovVX2FO_0{f1nGhtRI z{TxgT3PPfgN_}|s8H)q3q=NPf{lAfvNjx4vDE@-%LpX~giNt@>(*r%1pqdRs-|!1S zFjFOjdT}FAG;C<@g!Qp1ZjGHXic595CBhj?aa^8z+dJ1pH15=1%hpveM@pA{{q_d6J+ZNq-i;}EExNgTGAL($#!ao>AKa-T_$?4T>xaYTWl{!MWs{Q-PlU8ZT z4LgUIhDACVMh%oXUDVJ?( zKYWU`Df_{rLEy(D@NFA3C`$qO%F3shHESHoJ35wwuTZV?SPkTC8d6Bj%K9azQ+|6F zsMa&-(f7o^LZ5Oa?GKT!Yb+%CinVgD%~(Zw-ELtOMl~3SGp%qGqh1_@g{^ht|L|qo$ zVhfsuT9^`=`ehj$9zP?4uT7GwCl?oeh z49{$FiL^`4C$M!OE4lBV@ng+F@p$a@-QBR`FxFYU9v##$Ie^A-l` zh)DXrr9UtfGJ5yMzwvCLj|_0%XSPl={{XF&g1!5=B+c`mI+1EM=Ih8 z)I;D(xZ(PgwGV|V!DYqmzrBstXgygj4Q-j$vR#e3I)qZ+c8+(#*O=~Ic#2s06&##S zhQGxx!qie;hAeK|eM+srLe-5Nu_#6ea>(Rfp?58g`{7y{<&%n~Yt3=sdN06wRI`_y zWb?jLN-d+*__1<(I=d@pw#JK31?baTex+;aVR;`dDOvP2#T4B_HzOkJT}kh;B^_2W z;L%@Gt0iP!%@t6dm3$eotK61Vb|-;c&HGJn99{;xzEXRAjw=4&^EKVLobm`%g)!tcV6^0gruOf|Vgtt}9n~TCI zR!>r)n|O%pUKMKHWYI*;vbD=Ht|fx)-O5?q=B_nWo;mV5vEqH4bIOu5C0JyN@T{LP zI`8yZMwA*cTSh4o}d-mG&PV2dj?? zlB9W(&Bg59tzUegmCG{Wd7m$S_y((w>hNhCpQem2{rhE_H~hL7bC=%$eEbdJOg`qj zXJc=ud&#qBtmB$I@BYh&=W=-8wamf%Q$_M?qt$eei(YY;gV6M?S1CTn&e~?FhL2D9 zb6d(==VOQDd}SS@(fI!WC!U$>);H#aH4}-k)VAVXT8ZR1o~I+PmEnb|TxTY8*EwX! zywo0E$q>*N&--R-iHKo2vS66#8mA=$5o=oh9E9~QzQcsJ%9E6{bY`&P!KT{sh z;O(};`V0QnQfqwvhgUn*7;^D&$1~BnV~R(g#xrfBk<>mENi5^&inP8GZG4e2$|Y$mGY0_0L@FJ57niqIM<< z4nJ^G>#`;y>{4F&1L5jR82-C?4m8I^QI8;@3=&2ov<>(vss%)6fSLh*pRgcoc(?Z* z$ICs%5mbGN9as{i0{8EMHhuK<0BOF23zw@Bruz#$y#mjmJr2DM;Ukg_uqsFqEs)3< zhT#4AkAp<~*YM0II<7 z`~Coz*pg5MNA?6)K#rm)zdx`|gIW+0hu`c*kZt*jQ$r@m*vsGSJjTvO4UW*sh2R!; zB)u+y@PQ%U9>Q`2$^MB2pMLfOB4{eYz;XpLObpRz8yAP(25||Ynj>ry(4xITk_`_S z^$;Q_A=4ehe#@XZ^y*R^md9vj#Nsaig<@;pzre}A>`@QEFdakR>}qNBbLq!ojh((; z=S!v7<36W8NpVK_<7|t;hk_dQNwpepcf-q{?o*{fz9&3s?A_YY9Fl(bG;Q&>RqugU z{BT{u@+rrcLQiyAon)UZ`{F;%$)8Ux`R-YZ+)fk;vbEI>`*q1qHJaI)l`5})$x82j zpTC)LegxI@Q+FeL;i^(ZZjm`7%WD&(_)t-9lcXG8!j)|XTtJZwA+soobCv} z$R)=?pqQ%IsoMtWlR`v_23)_rI! zTF}i(a|O?0tt|aQmfXzvm)P|md=S*B%Oq2&>7TeX*{X>Rmo29G{{RMzXYU4mI3!$! zL}MlBxR!I0gzB}#vGaB+$~h8V%^0Orj#Kx6rE1^l`Vzv?rPQj5=#9{Hp4`3thLk<^ zC8t9Hcu^fa6S~<9DE1(&3}nBa1XXSF3GBTkqam^;-@XR5TQ$=qb#J@mjnvVf8!N_# zU&%E)V&w`&x>ndSb7NTOuWxPrlF^`OP@{r&&UZ6)60B(X5xQ9|)0Rh!ccBHCc)>V^>=pp{j51 z1b!i^NQcH+L|qe9p2P}6xo7K;eql(4+a&qCzZ^#KK15J|sPLa3dxcPJdW?y&@(V>! zxo=~?eSZG{sL@^V>G=v&{QzPKD+#hW1mrCJMv0~ZLtrKhQ}?l7(XT?!5wI8uFk)5; z3y-J2d>*_jW)VJS9s$5935|iJS7?EP;K~p91CaMSp) z_+{bQ%Z`5sGyec#E_?c%*uDa9Xsq8o-6N>hek5{ye&%JD#50VtoSO47r&l4_RmYL) zED=kigYTP>>;C`>YmRGmxwmtuCH^ldW%BAZdpn!0 zaV~iwwb$1#*x)X2-&eoU5ZdYwB}d{aG5-LUv(zK;a!&!=d{v$g(<@lOj&i+Z znrMqs)juXI?{nL^>tez9c{y`CI%{1+EVo|sx;aa0#Futih30=C@NA`@TQ?(IuTp&3 zN;hVHp{Qd!{{RA5_BmzIHmj?wG;)%wzIf!4=DuIi~3Ekf&ZHToij4b9*~QPc3chJlyF@$v9I%IHU4Vf$m2DeoVSd; z9X;&~WThPkJMOdk9r)!Hws1{3Xyz!~HNOYH;MCLWIW^+pv8`WAf?Jt2t4EgGoBf>p z9O^DE&i?>TuN1#R^&MLee?x_})OK6{0IJ6RT)zgSwMhQ}uK777i)VatQPwRmNUCjQrts?oR$@Lqz!99*MTTM5X{21fkp=aVvNvE9p zE@vC@9>0ovyf?_|!-vtun=i9~_~PTmC)wd6hp5eMYf*BMlOyZCJRBmEK2tRLNr4bC zzt@T-jDbgL?7$c+PB3#2#9U%M;h6TuZOWr`61igiiFKCOx6PE$tQqX{PAM`I5P@D+xJFzdp z=7`5)a=}O#>(}=i(0GQCFGECacOAfy#D=cJPTdmrCV*i{{(!ju07VnW$kIeokt~;> z3}Vn2?uNkL5Rw^aW`XgCGz6kFPxK52;TgnqCN#l_%?Bm#%w_^WcmYuh!ahK2@D{=F zgoZ^SjNlEuvn!b3m{J&k&fal zkAj&00HBIUe-G+3ki38%_zWCa(qj@yAW+Q)!As0SuR=hk6wvTPT@UaU$6(9yK1D?h zD39VD$Yg+lk)>lpB|295IvqZWMpJZ-jB&k>9W^iZFWcAbR62@+s7rF&v+hCaufi_adiox^38!>f4`kIMPcWjz>~R=G%5%`OC<9 zjbkMxA?I+KLfWJCA*n|JYQ)O5 zAP32P8KG|;yi0X$wjVkKo6(0UarD5-p{G+z%k%=Zraa10M7Xm$HAx*orw>1&UQg;$ zNs)fb4tW$Q@d}K({RRF{YElxiw9QB?6Y1 zyis#cj|H8&nl7Kecr&dbg3`GAa5I{>qN8NX)Ct@W&aOyc_a!3izix|gcr)$l7G24l zkdZ&k55i|kEdM7CC`Np>Q1zk*4}pFYGB znY>=<89D}Z##F)igTi<;c4gS=rZ|eNU*L+T5$T{&@=*}}CPe_bxy03@TBpx#}m3y zD{2>tF>UlNIUHT147j5>{{W-oulWZ=jT+K|_4ml-aCM*ZMc0nvq`I~BTYkvo=M(Qf zEStv`{T~ZBr6vkPN<*d#k=5kM>^zZ3k)}m9G>C6wxeW;qfLj48Ci%R+p4P;as}$l% zP9(ItG*8)ncYwn6qN`xKr14B@UQOyZ&G6(sT~lr*1a4oK+wwooG5Sq@=4#Q@H#Ds$ zCi2_IW0f^UGj8kWhhv@?S1V^k z^IFk*wyxXV>@`h6@@6Eb*CWQU)%2r@Ol_NEC&VHvR5wk)5;@x`($!p zRf(c#jgL>rGGNJ6Sf_K-22Ki>Q(Qs^lEW@ z%-k%}GNyEPR|#`F9S`DF9-NOarE`vXF56$>oVhSr*^lHXv)?R!GFT$K_@`#uJetE@ z-ELgewN}?Y=bHW+PovTQ0GQgdx?d1Q^W2($6eN!|v%CtfsQI#=#!Eymlje@2SNNgw z-#v~#SF&=?4}Ac*)bZbAI;NxdXKpMiJtLbYYuVFV-?z`xk#ZV`ho0=?C8}Zl`4uOv zW=ns5`y!qTWWSN4y?W0){)VlFyPR%$h3=!~O=qa#CYe9+^iug$dM2-^XDcfQo_tp} z(tQWG*ILWs%9XBsac%PV%&d4^8fH1ODWizFi_dZld6#uYv+r`5?#H9WmtZl8d?a&4 zCP8klJ}6@La1Iqm=hnt{R1V})2P?s9fKw`Wc-3lrvPIwp<}Txzv>T* z$R-OC73@s`yXGDSXsaJ!rVD;SwlD<^fM1~P0O0xv5dDYogD=Ot1dz%=$$=yrh@5@w z2t;l?0%Q0^;S&zPNES$!>I|dmI|>3UISa%6jS(@gKxAYWm*1CRZ`7xrXqS{BpQga3tciGAHXtUNNfl6Y0+xlSlrOJz2U=trc?V0jQ1g{*|cXywe4NI z^5|5J)(Gp>&csnzA~Dv-EtAgemaQroO%v0e2Pi7h3R*O7T50ksRHN^?8lRygs_^3b z@%s{$8eWK0!BW_I%Z}s`W!0ItvfO93dUP5WJqRP*mMwWN$vja>mteEG46P;RTDNF< zU70Jr6qQ7;CIVVLY6wa34OpY_mg5yk2ztf2(o2aeREK2Dv z91cgxGp3VTFmH6-xNcxnqrcSZZ*LYKW3nL^HQ?DeEk71^PiXDhnG^vS1 zNqgKJdqBa+G=aca_QCmJg|S9IK$Ha-jUcdTb~7Y|h~gdvUIUY|Xl_YLiC&CN8%_pu zbdBN!X=9E+s9dLP**x^zZfD6OMzy46)0RtFdu{bC(y?9BCyj&iA<0-I%%=`v!r;Ai zraiwxOKmM7a*vX!jY_)m4i$3C6)f_ z>Wi-9nGCK9e>_;&e_BLxC8=DA-tX^n8nPlx(In9>OtkBf$&sL29~$Ugh2j)ygC-V1 zld=3k)0eMeEQ)W)E5Po%4y9tW*ztUNfjJdIySpNkxQwO!5@0V0S4MjyaAP?Jse$14 zu+UxhD4{kl0E8y|F*uCJ@cUq)(APt#eF0D%#2*mJq!VD|XCE-UAPuL`q=gm?qsNvx z{=!IfsBwPK@){5i12js|p86SzU5EX~Z+;tOswKZj7P1n2xpfvq*(B^{hfEZLz?pCS zgCRB>q%@)ZZaYEPGV=NYI*7$2M$;Aq(3%|&i1|<>sR&Z!1-(=W#%R#TTC}osGUp^NwIMA z3IgRjj^dqP(5(g>q)Un~kTRGpibxnpq(|W+SQ#*z&%ekt;pio?StM#>fd2sId8gEh zOon2isaHgmD&&H5^J6&?x~RS)yAuLcG8c#+7$hqcSWZA>Y>vH+svE>Qsw9DODGfGQ zpxZ1Wr7DG}$jPGDvI$6N)*5I8zQt0YO+JE^Ld`n(DndS?MnV@)Mx8s)sdApCVpFP5 zPxUf!Ow~=B_V=p;)Vj%;P8bSR%$yOoHpwoh=x6-@0MvT*iV~`87}=YdYVNMjk(#Hz za!n~@3be3ll^IG~30$&ucG4?&x^Nx9I~OJ$JGPp9eE1XHYAvUXB>5SWn3Zw3rb?Tl zoqhc{737Om5nsS`Eg3X6e@{?DU`LqA7uZSpj>MznHPKxJ&H*7jc>m%eXQ_Li`2C-Un;fuvr)ECfK?nQyD(wk$}o0*MRH^EO{_g zet{^7M5H#C{=k%efcSk$S^>))$6z%z#B2xi86FAP4<ikXgM*U0EY1bf?YO7aP*mi*N z2V>Brfs!8~Fj^bG1akTh6g;3L9yAZVf}}f+;*uIbJF%F}3S5Vh0uv*b@#*{b1)${) z0-6DhfFm9uCMoU+O%bU700NM5i5Ws#0wc~tv=mS}?0!Md=nPlBK}cw^1A807hDdaU zBn38*DU9M9kP-v~U|{zPNPX(TURBmKvp* zn&LPw?-xG72OhYe=RdCNcb&>HSQ@*2XqjeCe0R#WM1m`K>GS#PDPuCvW~;rmhMnNo zsTVyslLa@eSadF=Kl0zA(k!| zD(OZ0%-;el5#fmDIu!%vxY1pE)T%51v&j%gtA$J8a?!bSU?EV9JSjK*Z{Otji-Nv0 zZZ?g(LuSkKcVOhF7i)fARC~rHzB9F@fBBXzx?8BU;@>idAN_4lKM8r0zg`ntuXMYP z{JyIjMjvyte$`sCrP}t2Z>H(CbCdE}6lMBU60zQM2i-7s>iJpJzAT{&C;8``SeOHZ zeT`ywB&7BEVqRSFz|C$t?Lop)falzl$Jskh9r4M_>hw78xJI`w`#2 zt>1pkMZ7WztSsxid2A}VUj5(9FXg_mzPqr*iE6`w3yz~i)puj!$3A*XOg3V|?^He; z_ORNka|;fQHXK%Oa);!t3C3fFxF7wIb!ey>BHX7K(`&R~n}v~a)oo@M!GBtW3(Ek+ z`$dL*rCooNZqU~obv9%bF7{N^Yb@T2TvF4b=4^5&4qa;6^grZpv|7YPBDa?05-S9Q z?Q|~-NMr*Z3NgjJZ9?NKR-{s}=$Fwd)Pz;gA~P|mHynE5T@(Er3z>kUg|YMl#m{=Q z>RMqC_yC~czO+)q)%n7X-{n*h%Bc+dZ0E=WTm}<)WLMq3bRBROy|KT%4QcF%L(KZ` zCIL4N((nlsO8!h{plzl8r8~z=Hw1G&$bTGP6e&U)~$;O1eFG&jCjl27vlT`E7*!_~$;k_wSIizb+?DLiX0Kc0)=`FuW{N?SL{-osR znh>;p&DvFN(<~Q=(uWC8J^~rR0*BGfvlb-$11rd2d(HZF2&s6+~%t~WLal+ ztZq!r{@7ODf**>VxuWD~(r`ZORyDIT$liZIpf}U1*ZS7AwqI`ai(Ye>+`{>XCzmtM zHrt&-yH8c}L5$`u-ZWGg+Ht~gInDFx_ifMUeEnjUXf0mlA=; zVp9N`?9S>yi3d8 za-M;@O9j3T7~ZlL2M3R1r7#w0H0F^16w8nxn5KP2S_m9|7ahy(Kuw0iHyZjhP44w@<9FmrQ@a6s&m2Fkp93;EviNBa7qee2F~cb0id?%l>=nK5=>vY z9t?)GFCmYOO%Nzj^dE+lPT}=U$w~ykHQ-65rHgm1mM#THrJ^~7Mc-4i z{2}YJuDqgyRVLvw<&E75U^IvoAw&=GPZW^rfzZ?((JBm|l^jB_XgR7AB6uW-qC`1N z0dx46zmu!q*7nGgjO3g2kBK~_BnG|M{@a8=ShrU@l>1Ev4>_LCs18@tRa2#_-h7eO3j zsSr=wJLYpZOv54#EOY&iD9n4fx0o&F-dpGFdsi0iqLyTS(5B5RuT~e_{VZiOuxwTt zl{(RE4xU%aImW{;Yda=IxD2X1hL26W(bR~=iVjOsGQ-3CMbk^>*`&pvBI9aB!yra1 zPaiY5Aaia#jm?H?Di|>2fSnwSm*zUV{&1~QzHsDYkkZ``={INWsy)8XpV?oq<2yR<=lVeZ{jzu1wB+;GDQ~R)1DN@_ zZwPsgJ$XhBiDI)w#prk3Iy{3nHBUWmwrTw&vxA>}+{U*ZGL~x zDOfPDv@4P<+K_MiiK}Rv4SLfxklGm3teR=er~Lc^#k+5kAZaykFBU!W?I~XE?xrp_xAE(BF_}`F-|Z8w$Vr8P zlx>=Xpv7_VK!O{2wSekWFpC{DkDThrE*YH;i;|jBdo39P=Xl(XX!Jj|JEz2j6q^oq)KK|RS&G&ir|gWawL z$BtX_+|BaGt6Zzk5e{3fqR@XS0kCciHmRh5Av) zZkS2%h);O9g@)PAb`J;83hI!r_3pRfJ_Vk@&*I|X{-KqCMK+Y$yuQYV$8@GGEYpz) z%Q%p$lq4UjPJ}E6Yf1`tLvC*E-%upt2RYrhI=OBC5-chiV&_-f9abDz!|{5tP5 zK7z55MKsbWKFnS!{R&ovqCQ#S8M5gtB%dOFgju$vUP)HVtQ!0 zF2>uG^4@LJz>1S^_=d}Q#pQO5r*3DRu2e`v@n>np&|7Z)#tC?3RwhmKO+Oup@+k&m zJ+kjbSD2a3bxeF|z+;NzQujSm8mrxK3Y-^Err2su0yndQzMJ}PPzPA%t(l6(zv z-AQvE(VtDB&S)vN$;R+Uj0C^DK|}9n`|u6!rI%V>DFHd&hRwk5%f)Z7Y~Fs&Zw4^2 zb1eyhG{!Gc&mA?*^rolj{{fyvA<_CR7~R`~?!%%rQf$H1zX(P$SaL66MGrbOi;A@k zJe{_Ya%l@vjG65oayiSd90Lk07N$Y1y+EM~=OjB>LW-f~eLn#M(V~Wc5`It!KNa@J zm(8yj>PYoA%g*Bri~yQh#Bi?c-@MP#Hi~Mz1=FCZ1NS;ulBGMx1?Q2%#_7Fj;i4J^ z<(>TCpnqQT{8%xg{8G$S+2Sp?IzZo*ZvY-BD@e)rn&S4^Vh3U_x**4*RJ4nbQh_dB zA}L3~O~WQlW;K@)F)WR9NMWeJ{3-_R_$=x(AN&Q4<}|CcO*>vtnd`6_%VnLBb%=== zt9$t$fMDca=ve$Sf{+6+Hkv3VOXf*ABygMot?AQ z?vm*Ir2T8E!$J3^cI5|c5id{Y3*X9Z*Efj@<_q4+^w6gVAkJX(Zx)#$xYnCv-4w5&W7Z+cGWYrQPf!qlTBP+Oy;RzmZd}|{C&10 zh*6B^=^IJ8J+*b%mz?_6GG#j1>S;z=i6H9v9?(#T+sBp|!NZ6?b{fNnzM~0v6cYM) z)69jQQo$c+Pme#HnP!(mbP_~8c$!Xm&(%x3o5jjS8DF3-&Wp~BjDGg$XQ~M$nRq_=lwL}i@LG#`;Td|p)E3OY&ntTTKbQ}e*o5otmph^_ffOxY1&QsWRBp^ zDem!ylXl;c3xU()$rru^L^Wt#|J-?+~HX+Ru zeCn-gmTnrFq#12bod>4{y+)i0(~4m#_}JLEe0MsMzgKMOETo!%u@xU|GfXe&Z@o`1 zmeqy_eG@dNxp7}RA=0louz9#7o53F4 z^7kx(js)}qVo4014c_K2Y|hCJz_x08$A`d&SMEw~lVpE0m92X= zrZRT=2Jw|zLrP+XN9BX#a=v!}F~@_{mhOfoUTSHbZ}zRrzj)TH{xI55e?4QmyY|F+ zQbNd}P?AzvTXU4At#$a;Ip@@t4*gI1XYTcQIY+qX`WYHtvt4i)`!*Tf*1>WG zI@#LJ^X5w4>s;quE!Gk%8zLh0dM>y9%%F+%=~sR^?81j(ISIp^=gabzk$Sg={3hM9 zgY54V#43(FMR=tIUC%G`l_-9SiV}t0)hLZ7ic6sPo_BIdNLZd+Zm~2m`jKsWCp~*4%lFH#-!$q5G_9k3Np_K8u`a?@5*rUa`9J@|SmbS$XKA>Yw>@KlgbS_y^rjtcor3FznH9 zU@D+|`8hdw6%-|qx0oivrZem3~Gw z7uQU3l)=SbL>?$wf_E0f%q$jIRA%&Iblq(DBYC6 z(_pEsV_yr}H$}Z_*GE*z8se|OJX_x!5GuQfFV#Pq2UKr>A11H>6@vUU>EwodVgHZO zrsR>5UOu2cADJHCi~t~;OF5K27FZHV&`4+WFvjHY12+^Uh}=SU+FHUCmcvG2LuuEJ zkj^3#rVrf9B^&v;b>Lag-kVuPtMSfbxNo2hAC1P|7?(kpW_A{65V|Fg92)&b-u562>yh% zrc0v?-KFArD8+7yV)l8}mcn?BQtdbgeb+W((r6C3uc2Kcd60VOkJn+6XqI{bH2vStE zd#=EswZ*&T4w=r|C3ew?A3C3&b0T#N=sX~tRJX5Om=))kiePy%bAQh*{$7f$U)C_} z(EnmF^ie-WG`LLrU!8aDm4C{IX1R^e+eO^jPvIF$6fT=uG&->G`+i`ZznMPvZCvT$ zwx&m$;tkvJP1_t_`(jk|I|H>EWZYcq2pPChVw}(;lZ%F;suKU~^eAB)l6r^QY>Q7M z?EJ-$MGCv))M7APvz8nOX#q9n$VR|R1xKS*o07x78qcY?s1ZKCCqbywuI)xa@JE zK%M$`g%h*guN68K=s4_K^PVUX@}94?z9IdQMS5P2+h01e8?{Qlus(y^C^9Zw*uN?U z$?B&lIvUE?rI#J1(ckAb1kx|>FL7O12RY1ASM;@Dh#d51(Td|aoSMX81aFeOCt|!G z2nj}zXMU&Kq(YPd3eMx(Ud0y{KMA&XL?$qcBY{J^M-C_H;TKY+1t*YOe8+H6^Xr`b>oZ`K z?3?P-D0_$59buATXOkYj2df75=>tLkQcx z{_t1<;5Q_lX#u>2_U~SKj6$^t)UGS$y&^H&OoC&@VZW4g@>jT3srACiyZpbzWvSlS zVFjctWFOm6P~M${kK`trBK$K{mZ#BN=^4#SgHY|DG~O5x*^wA6&RG##wdv#DaRQeX z^J)~1fEcf8yjgF(?L+Fr#o``BQJATQOY8P}`p>6#_k_k16c1G!l}4*Rf0B$tm5T3i z-&Dr5lq}PqpDjCkMk|aR2C0_`u z?~$_zv%K&$i?{e~&sDtl8LdP?BeQ7GLd}p;`qw*Kaqa#8+HJ)q6(HDE&W^ij4{lEW z$&bn{oZLCmi!NF^AU|&Dx^q_~ZoWj??P1wpP4|zM9~)V3yAzhLvf?Wi_QL;!9EZ6y z?6a(Wm)j4DT_2IxMrV?iRyX}8E|Q-4`At|H)wy*kJ{3;sIU72Wt5kAU!}D3CO}O?~ zj%LRDNxs5Drs@@*f%RU>JfSvi$Bvdn$4VIuR|=egTz8x9lPSAERw8vfOZ+zPF;|5>uSpJc!N#6hn^@#aiy zdb8ne&>b#w^#aRJSMCt-Cj?D}azf70RY%*Iqa}T^U&ZHFPDrV9oeF&un|vEK-g(@H zl^d3EKjg!w#3Rzxw5OZ@9=`Q#n~E{WJEV-=R!^==Q7!7D+PxVZbTT~EUZfT zMMUprxg|BPsX2-Pbv_o?&pGWg!?U=s_1CIzS}hIrilk|Ump;sxTm8nj#YR}i3*_P( zg9ydPW&u*%L=V}Y5H)rx(@Vs;+C$TnM{chWzKl||Qjm|d@S|TPL-Q{S-DD*SZg&0r zSx~#C+ML2YjvQh`o^d@)_@-aIUV(vzLgWqp%FES63dVpg11e?!f0n=L`7vC+G)1n1 zhyIMXQizhpV>0(pZ@jFUR+2b$yZ7;#Y4yS^4W|g~^%aG>EELt zUyxcUoJ9xGgnO_QNb`k@1Z?;n+Yp7Ff>{=ff6Fr0DKkKQ5#V1Y=SIqm z&MxgP>a^HYl>M5%#2b}L@5pD1hUJ;~Px9?J<6kCNO%z_=0-~nbA_rpbmcf+UBL9DG zMB`vNz$pY?C>LuQgfT^G`)k|)h57@BXd_rqq|S@4+;<3*#k1RnQRgp-84H@k%^zzawvbFirS7!VZl0+dr}Y=g3*~Kfo;m z`AYrr{{Z}dUT)nJS^nVWV&S)ZV7xFmTS8CzjYOP8UzT?)yvyy#-I29Sg9$`>cmE)ZJ(NllvvQVK#CsBs=9}CXKtTpuAV)LJ z41a&yv|E1wP zftwgE)ud#$!|RkG>WaEA{Dz~In6wSeJdg#Z^{ePGd_t9BA@Wv?aq@U|c)r2|R1-=d zpL&(Dl>R&j82&l5dIN>AR!!m|B@SbH+{D&W3vl!(@1&&yC1*TRfz*TDQE#?X zIOoT3O*)qZa9?ExS66lRlwR#-YKi>9ycP?pqmf4pG*jEqvS9#1z;y(7W3wS57j$XTYtDqcmUSl$tt&`G5@6@By&!3DLqR$V+td?1OnAWZ z6^Kh)kXL9(_`9dxl?hq*pHK?t%vjPg;Lo=uy`+xO$^q_tishX^g}3tbEKU@3?XHah z>@0*@udUo4sv6B{gn^vzWLBexYhp-3`b(7j10hsx`<@{NpC7eVLm16%!;6Q|gZl+J z&qcEui?2LOezdFm>s#VrdKDz#6Tu7ZEDM=OU{}i#7xfSFrCg@A)TSnO+O;Nf@@Ddt zLZ-sPCElalfEFssN;Rtzr*>ADFFFU!jh|OOP#k|ud+Rl!F7j0F#{Ez8*(HzL-wYaa zqJPICj3e3v*G!6*mIwYR^b_|en*KN$^D8%Mvf$#~m;LEKP9Iug_=VOA?zPij+;NC` zB~#WeL*3>dDhpXVThg-{Fyte5I_~S|k!g>r}}R zI9I>guciNPzpfC1nai7th_O}vNqSvu_m@67OJ%qigu;w(2{fvE4cSqBCKA# zl4%jCE;2CmOrJJ#rP5OTqAv%1`UP%GT2sASh9aq zlK7GNppX@!={S1@8v zd2$Pwk#1+ZIw@IN(MaowKg5OR_uq7GHpLrk@hROmx}gMVGwNQk<55k+h{v)Qz;%L6BP$ z8jX2V9j%tn5YRvYG<#e&DHW6v^|Wq*GZ1Q%3jGqH?Jh+T-%8@DPYBX+3@^a=xbRi% z3b}!>IgeQOK-F9UM^pd9422RAY(WmuOM>JiG%%*N57FSNXwtWeur^u=BRUB!s2zU^ zC|5}UX81rHUvMruiZWheFQ=U;X=yuaKH?<#?#S zwE!d3HF%~clw(@z%h6cSD%0JjE?M~E`=P#&JHo879}+p6aun_RuyKj^lSUyVWda(rocyLbVfq@KS` z$$+FS;JKzN=w3V;)Gbh;NcTmRo;bvE&+(5^ojKLR;bEKrPpOp~zrrx+kbi@ZQkRuS z8uK^?5gHurrd|dg0=#W#zOaxjG_Y72)o5i`&75!yb6(*F@$~EDs)}b6BkeTjtsYZ@ zhf+IbtZ3Opj3`}2m;UbdD&qpY+9(|GU1j}>V&=ShXEuINNx3Om9(xuusLLH-qvZbx)eR~z$GZYi z2gyfuW7mU-;ru8qRv=48cfc5K;l-z?{P-C4(S$;=@=Y5M84PJL?4#XMRn6x!w+&5% zh%}-v3(n0+kmAxAt^9>auRx4{l#KHcWh@DJi84)5J*~=GR4T|iGuz+aEluOJafSiM z)P_h&himMS&2rM~tka2wYD>J0tfUNXs~~!K z<6LZ`?Sv&$NS}WTNv%^+u!>3DI2;U(RLvtqrIui#pO@Jxud|t+!6`&$71NF{QdK;4 zV~-d8x_c6-<(j|FQO_s*L6HAZY<1ECplUev<$PYvF)3e|F>F^f&^@@LIXJZX3sT3$ zeMbeZMc;+&iJdl;!7s`4HaNY&QkyJz@(5ou2$wS@ntooN4v}AP&~{)}jLYV9Cm(|8 zUtNq5B7VvEMS9>;S&!pMrMx9B!&{v3_lD9!^ zdf?OgWs$}b?b?xbf)v%tM`TZ)cJLi#jtMTvt1LcXfnZtT8hB3r*JC~zg8LKl#jnb5 zVi0`Cn_Ggse>~JU6pH_GWRvHMtc!f>qbO4=_j%eQ#Qyn-bHaNdk;KUF&4GTRwrU~& z809*&T3qVWj}Si@Gkz~CqIP=vX7pd}#}}avGVe0CLe!TkUzo_L$ZRgf867NAtyE4M z{|>QnDAOCzly~k~DVJI+(A`+3ZEEeMoGHDPo3fhmdfvwGqS@?0x$wi!%f-LC*hMwh z|8h8vn535Hi;u0Z0}{KQk;VzJJ;YQgV76-NrwTM44Cu(5mnDp^T!z0`5`Kpc_kxAh zK>K4`oB@In-*==i<7`d~qtFx{+PEVAii^_M?4@aO zXjP2yr{9j@&95e0PwSLDQSccb|{%y}tSDELIe9N76B z1ZD6IGkA>FUAt>e4UlJ@a1_m)dvHX0m{;+9T}gqUe;}wP(9~Dfr}C|j#?TESlt;LU zLvz~$hq_W189saqdrUmB6xaf_%~-y?`!T1v$& z+*Ctu zYc{aE<0oWZ#P=@8yWsU{1@3#ejVP>$Bp<)|*(YOuQ@U522ZUu845WD!HkdR$70?YX zN}U0Pi!-4@CWYN8O~2W3d3JFSom+A^TJCGs`K~#KnZ9Zg6iuV7p@L>PjepJC7EgbH zPt)ZJhuCMYk+!pK2FAaaOkuf@rRc7`E5iMJI9MG!KCcGhci+mD3D?i|!*WRas)wLb z`;c`SG!xaN2cwkyrG-VwR#OJ-I91#=gq=p3mqiZQgml2CxhZGuwwSZ0#Lf~8rJ zEWZ1q6=%d+FZiqGq1z82XdSpeY}$&{`4)Eso%lX-M$Y0%=waGTPwMG;ebddLUD`ds z?vSG!$2-Po)z7}4?eMIxp?Js{?++)UgouT`7Ae&eQju)I9D{QbWKqtTEj;HY?v=wudw~RB%`U@&tJ{2d z;~X93CU4{YpxxvH8#ZcFH7{mOl=B8*M=1x2Zwe6P^Or+6QD zJ$8F}FCi&?yy>4CQkj=KD*t&L_Ur1|_Ws$Q1^h4L$A7-4VqMd?eC&edmwS%y|9$NA zQ>OmYj&Wn|6Yrq%vd^oZJjYz4R+r*D`(I7nPU^;w@3wt)T}fJN?YjYpt zCyVA+qJ|b4Olu#dD34wrV_dRCgs9y4z`t`_WeyF};n ztGnXIyq^K-Z||yXl7j3r4b5_bh*{w>!dyc9D>zb&lny(Ksm6=#o)i-$(kQ5Cf$}mg z4l;ccu$qR}D1VuFGk}^*`PbI|mWSc9GRe za-`){XGF?*-0RBi4sy)y)q~#X1vV+@*T`{Ee8T*{2&RHV5017;aWZ5PjGOI>Bq>8wx~ z6BH190YP>Q7r>AxpB)1xZGUT~Uy|m%twc=Wbqmr5^8oJ)yqQj}O(qNl1(9oE#lQ5h zI#o77%_TtQaRJW6ABEPFjOIOpTNmhJLphrD6xS+S!k7crUJl=I{V?4zW>8K0iW(t> zcf?v<$}xCg{_i@~-v1b~xmHW3F5{5P6hcSq`5qo|C;tafl;tP`ec`q6>SQpAaan)~ zIz^Rd1dFiqD*Ep0O6v1TV)#-T{Ioh87%`}i`Dcsd|4DKXDG%ff9a9wvu?PxFS$O3d zp+3l%>S2BcT~0;d@I(PlP(XI+0ugwe zA9V(`A(1`1q)#IMUfnqwK+B|1y<+rvJFQd~H&tLF<4OU_cT);Gh~lZRnxoWRQWwYB zmY&td7qm;_RGHy|N6_RV%f-o<(ghj*juT<=GWat=?t-@5g`TJcv&=nM>}YPhBTJ^kwxoHe^VTlv16Qa4&}Q zAI=2mUO2ZB>^o#}f5Uz!p!j5KlvI7nPI$YmTbXur@vh21*7H2XnMB#Le-DzJ-G2=Y zD7}k)5N~U?{iJ6|Ch7UTc&(hD{o5wp0Beo)nvvqq(f>Z|#8N7Fc9H@9yd7LLouOUf6mZN`>fY#DBjZ|Oc`Ela%7 zj<l+QCQ%40(VRa;LbU7JVo zbq}b-=MFf`nFew3{S)%+-pUN{fV>}uo+4YD5z|Q3eV!!r>#Svwn1cbMdrvOSKluZqUH~vQ;TQHe8FT#oeIfGF zQq)^b^U6(JfQR!f?nuM)o@Y@T(Bj~{r)1u|BE|Hi=64M7c3uUhOuh)O=(Jbo$Q0aGpS&~{@l z1FbpyDb))hq%8tsq=HIM#};)%#DyxHtDL5#i=$$q!BgN~ z&O|Fy)bL%60b6(;THU0Zg1v$h1{`agOPjzC9fZsU8mY9<3~B6*OtN(j(Cf-*1!dcM}2-&eydnZt4^gJ;g-|P zaU3xYf{aT7W(QyOb1msPR!-3%h85|X;2JujIk6n4n%qS2bw)P|I}5VbI14o~_#(pE znc)L<8Ot$KxP`QBa;@a93Jsfz-cs*u9F3ZOM0poQvFt@Yfg=1<%5*~u6;2=$n@c4s zLZziN0%E;>^I@DBX+-1d={KZ3sI~%!Cw8SY+vP7)p8M-iF!Mu_ak2e_=}gT|o8Q11 zDeOGyfT~m|4<4fSNe0 zVfj>Cr1zgDd{c3CrF?1$c(-d$z#;fDRG~@xi44}!2};1^idpo~DRu&kR(E82Hf|#m zStu!(O8J0(tbS_cg7Ys!>xH1s5~V*y{3{Rh*L$r+Q-cYz3|H2i`5+JL$EM3&P`jv>$L+}gR{uaEELgwOW zbgFi45pVoK7Z;@$wI#rNv{aZ)?d4&X2cMXojk4|=6HxpDjtH#NJ**k;@ zk_mgqQ!NOJos~Mz*C*r2IMjat<``Iu_8E2qgUe+Kr!SErZ{SJ{e=3EMC-t~~9W7Ts-erC+gzu+N>C#Dx++K;{iN zUD=kw1^M@6YVzXXc!I66rrnAmKt9f%`0BCc$K)4oF-jX_kldP2SUAQ-&uE?AXqM>-r z3Td|>IwaouF(q&AL~|}h-g|^lOcp#aP)do{5ldHw^M)5H+-R)8$-Gj__uM(i*P7kX zCU1$XNbc_=^o`SfN?#GeLj@Fkpu+OgVcj2ju?JAD@-^Lad5V(v0y_?J$oN9*XKA?N zIg~i|eWb8yJg-f-Gv~rRax^C6lceHg6ZZMc%xz$u$Y=BgPMQ%)^eu?D+2j%kmLIg# z==4vQ%~LrruXAhEyB4#YjpPh+^P9m^!Vs*8SfJq~gnFI=R!xJG^!;B;DZKx^3CntI z5P%ky?A0wN5U-DLY3IX&Wt-h4@#lG?`GR;!hVnR*3~~d4EoE6I>bVrjmG6m5le^A2 zDd*n+6D5D_$k@yXok?MOFzQDcIK2=DYCym1L9z{pxdbq@;8M(oW$L3qEK9bPH;jlw gYzFp-;^0~>6hNzXZjQihnM68cfp<^c_P@{n13TzorvLx| literal 0 HcmV?d00001 diff --git a/assignments2016/assignment3/start_ipython_osx.sh b/assignments2016/assignment3/start_ipython_osx.sh new file mode 100755 index 00000000..4815b001 --- /dev/null +++ b/assignments2016/assignment3/start_ipython_osx.sh @@ -0,0 +1,4 @@ +# Assume the virtualenv is called .env + +cp frameworkpython .env/bin +.env/bin/frameworkpython -m IPython notebook From 1c013e2ecaa4f5f0448eca886761f55e0c063f90 Mon Sep 17 00:00:00 2001 From: YB Date: Sun, 10 Apr 2016 22:26:35 -0400 Subject: [PATCH 046/199] Lecture1 - part 61~100 (out of 715) en / ko --- captions/En/Lecture1_en.srt | 137 ++++++++++++++++++------------------ captions/Ko/Lecture1_ko.srt | 90 ++++++++++++----------- 2 files changed, 117 insertions(+), 110 deletions(-) diff --git a/captions/En/Lecture1_en.srt b/captions/En/Lecture1_en.srt index 09ec6da5..f645769c 100644 --- a/captions/En/Lecture1_en.srt +++ b/captions/En/Lecture1_en.srt @@ -294,112 +294,112 @@ problems that we are facing today that the 61 00:06:57,321 --> 00:07:02,860 massive amount of data and the -challenges of the dark matter so +challenges of the dark matter. 62 00:07:02,860 --> 00:07:07,379 -comfortable vision as a field that +So, computer vision is a field that touches upon many other fields of 63 00:07:07,379 --> 00:07:12,740 -studies so I am sure that even City -heater sitting here +studies. So, I am sure that even sitting here, 64 00:07:12,740 --> 00:07:18,050 -mania vehicle from computer size but -many of you come from biology psychology +many of you come from computer science, but +many of you come from biology, psychology, 65 00:07:18,050 --> 00:07:24,389 are specializing in natural language -processing or graphics for robotics or +processing or graphics or robotics or 66 00:07:24,389 --> 00:07:30,680 -or you know medical imaging and so on so -I love field computer vision is really a +or you know medical imaging and so on. +So, as a field, computer vision is really a 67 00:07:30,680 --> 00:07:37,329 -truly interdisciplinary field what the -problems we work on the models we use +truly interdisciplinary field. +What the problems we work on, the models we use 68 00:07:37,329 --> 00:07:43,849 -such as engineering physics biology -psychology compare size of mathematics +touches an engineering, physics, biology, +psychology computer science and mathematics. 69 00:07:43,850 --> 00:07:51,030 -so just a little bit of a more personal -touch I am the director of the division +So just a little bit of a more personal touch, +I am the director of the computer vision lab 70 -00:07:51,029 --> 00:07:58,589 -lab stuff in our lab will I work with -graduate students and postdocs and even +00:07:51,031 --> 00:07:58,589 +at the Stanford. In our lab, +I work with graduate students and post-docs and even 71 00:07:58,589 --> 00:08:04,669 -under ladders students on the number of +under-graduate students on the number of topics and most dear to our own research 72 00:08:04,670 --> 00:08:10,540 -who some of them you know the great just -come from my lab +who some of them, you know, +Andre, Justin come from my lab. 73 00:08:10,540 --> 00:08:17,780 -number of two years come from my lab we -work on machine learning which is part +A number of TAs come from my lab. +we work on machine learning which is part 74 -00:08:17,779 --> 00:08:26,109 -of a superset of deep learning we work a -lot of science and neuroscience as well +00:08:17,781 --> 00:08:26,109 +of a superset of deep learning. +We work a lot on cognitive science and neuroscience as well 75 00:08:26,110 --> 00:08:31,270 -as the intersection between an LPN -speech so that's that's the kind of +as the intersection between an NLP and +speech. so that's that's the kind of 76 00:08:31,269 --> 00:08:40,399 -landscape of computer vision research -that my lab works so also to put things +landscape of computer vision research that my lab works. +So, also to put things 77 00:08:40,399 --> 00:08:45,600 -in a little more perspective what other -computer vision classes now we offer +in a little more perspective, what other +computer vision classes that we offer 78 00:08:45,600 --> 00:08:51,050 -here at stuff or through the computer -science department are clearly you're in +here at Stanford through the computer science department. +Clearly, you're in 79 00:08:51,049 --> 00:08:59,629 -this class es 21 so you some of you who +this class CS231n. +So, some of you who have never taken computer vision 80 00:08:59,629 --> 00:09:06,220 -probably heard of commuters and for the -first time probably should have already +probably heard of computer vision for the first time. +probably should have already 81 00:09:06,220 --> 00:09:14,730 -done vs 131 that's a cool class of -previous quarter we offer and then and +done CS131. That's an intro class of +previous quarter we offered. 82 00:09:14,730 --> 00:09:19,779 -then next quarter which normally is -offer this quarter but this year as a +and then next quarter which normally is +offered this quarter but this year as a 83 00:09:19,779 --> 00:09:25,069 @@ -408,18 +408,19 @@ graduate-level computer vision class 84 00:09:25,070 --> 00:09:31,840 -'cause cs2 30180 offered by professors -so he'll suffer si who works in robotic +called CS231a offered by professor +Silvio Savarese who works in robotic 85 00:09:31,840 --> 00:09:47,230 -3d vision and a lot of you ask the -question that this classiest 231 versus +3D vision and a lot of you ask the +question that do these replace each other? +CS231n versus 86 00:09:47,230 --> 00:09:56,639 -the S two thirty 18 and the other is -know if you're interested in a broader +CS231a and the answer is no. +if you're interested in a broader 87 00:09:56,639 --> 00:10:03,220 @@ -428,53 +429,53 @@ vision as well as some of the 88 00:10:03,220 --> 00:10:11,009 -fundamental fundamental topics that come -that relay 223 division robotic vision +fundamental topics that comes +that related to 3D vision, robotic vision 89 00:10:11,009 --> 00:10:17,269 and visual recognition you should -consider taking in 23188 that is the +consider taking 231a. That is the 90 00:10:17,269 --> 00:10:26,039 -more general class 231 end which will go +more general class. 231n which will go into starting today more deeply focuses 91 00:10:26,039 --> 00:10:33,329 -on a specific Ando of both problem and -model model is your network and the +on a specific ando of both problem and +model. Model is neural network and the 92 00:10:33,330 --> 00:10:38,580 -undertow is visual recognition mostly +ando is visual recognition mostly, but of course they have a little bit of 93 -00:10:38,580 --> 00:10:47,990 -overlap but that's the major difference -next next quarter we also have possibly +00:10:38, 580 --> 00:10:47,990 +overlap but that's the major difference. +Next quarter, we also have possibly 94 00:10:47,990 --> 00:10:55,590 -a couple of a couple of advanced seminar +a couple of advanced seminar level class but that's still in the 95 00:10:55,590 --> 00:11:01,649 -formations so you just have to check the -syllabus so that's the kind of curcumin +formations so you just have to check the syllabus. +So, that's the kind of computer 96 00:11:01,649 --> 00:11:11,409 -division curricula we offer this year at -Stanford in question so far yes +vision curriculum we offer this year at +Stanford. Any question so far? Yes 97 00:11:11,409 --> 00:11:20,879 -131 is not a strict requirement for this -class but you also see that if you've +131 is not a strict requirement for this class, +but you should see that if you've 98 00:11:20,879 --> 00:11:25,570 @@ -483,16 +484,16 @@ first time I suggest you find a way to 99 00:11:25,570 --> 00:11:33,830 -catch up because this class has shrooms -a basic level of understanding of of +catch up because this class assumes +a basic level of understanding of 100 00:11:33,830 --> 00:11:42,560 -computer vision you can browse the notes +computer vision. You can browse the notes and so on 101 -00:11:42,559 --> 00:11:49,619 +00:11:42,561 --> 00:11:49,619 today is that I will give a very brief broad stroke history of computer vision diff --git a/captions/Ko/Lecture1_ko.srt b/captions/Ko/Lecture1_ko.srt index 9f11a39e..e1a71f38 100644 --- a/captions/Ko/Lecture1_ko.srt +++ b/captions/Ko/Lecture1_ko.srt @@ -243,163 +243,169 @@ 61 00:06:57,321 --> 00:07:02,860 - 대용량 데이터의 양 문제 때문에 어둠의 도전 + 엄청난 양의 데이터, 즉 인터넷의 암흑 물질에 대한 도전이죠. 62 00:07:02,860 --> 00:07:07,379 - 많은 다른 분야에 닿을 필드로 편안한 비전 + 컴퓨터 비젼분야는 다른 많은 분야와 맞닿아 있습니다. 63 00:07:07,379 --> 00:07:12,740 - 연구 그래서 난시 히터는 여기에서 확인 앉아 오전 + 마치 여기 앉아계신 여러분들 중에서도 64 00:07:12,740 --> 00:07:18,050 - 컴퓨터의 크기와 매니아 차량은하지만 많은 생물학 심리학에서 온 + 어떤 분들은 컴퓨터 과학에서, 어떤 분들은 생물학, 심리학, 65 00:07:18,050 --> 00:07:24,389 - 로봇을위한 자연 언어 처리 또는 그래픽을 전문으로 또는 + 자연어 처리, 그래픽스, 로보틱스, 66 00:07:24,389 --> 00:07:30,680 - 또는 당신은 의료 영상을 알고 그래서 내가 사랑하는 있도록 필드 컴퓨터 비전 정말입니다 + 또는 의료 영상등 여러분야에서 온것 처럼요. + 컴퓨터 비전은 실제로 67 00:07:30,680 --> 00:07:37,329 - 문제는 우리가 우리가 사용하는 모델 일을 무엇을 진정으로 학제 분야 + 여러 학문이 관련된 분야있니다. + 우리가 풀고있는 문제들, 사용하는 모델들은 68 00:07:37,329 --> 00:07:43,849 - 엔지니어링 물리학 생물학 심리학으로 수학의 크기를 비교 + 물리학, 생물학, 심리학, 컴퓨터 과학과 수학까지 관련되어 있죠. 69 00:07:43,850 --> 00:07:51,030 - 그래서 좀 더 개인적인 접촉의 조금 나는 부문의 감독이다 + 좀 더 개인적인 이야기를 하자면, 저는 스탠포드에서 컴퓨터 비젼 연구실을 맡고 있는데요. 70 -00:07:51,029 --> 00:07:58,589 - 심지어 대학원생과 박사후 연구원과 함께 작동합니다 우리의 실험실에서 실험실 물건 +00:07:51,031 --> 00:07:58,589 + 대학원생, 박사후 연구원과 71 00:07:58,589 --> 00:08:04,669 - 사다리 아래 항목의 수에 학생들과 우리 자신의 연구에 가장 사랑 + 학부생들까지 함께 우리의 연구를 위해 아주 많은 주제를 다룹니다. 72 00:08:04,670 --> 00:08:10,540 - 그들 중 일부는 당신이 좋은 알고있는 내 실험실에서 온 + 그들 중 일부는 음.. 안드레와 저스틴도 우리 연구실에 있고요, 73 00:08:10,540 --> 00:08:17,780 - 일부인 2 년 수는 우리가 기계 학습 작업을 내 실험실에서 온 + 많은 조교들도 우리 연구실 출신입니다. + 우리는 머신러닝, 74 -00:08:17,779 --> 00:08:26,109 - 깊은 학습의 상위 집합의 우리뿐만 아니라 과학과 신경 과학의 많은 작업 +00:08:17,781 --> 00:08:26,109 + 즉 딥러닝을 포함하는 큰 분야를 연구하며, + 자연어 처리와 연설의 교차점으로서 75 00:08:26,110 --> 00:08:31,270 - 즉 그의 있도록 LPN 연설 사이의 교차점으로의 종류 + 인지과학과 신경과학에 대해서도 많은 연구를 합니다. 76 00:08:31,269 --> 00:08:40,399 - 내 연구실 물건을 넣어 그래서 또한 작동 컴퓨터 비전 연구의 풍경 + 제 연구실에 대한 간략한 소개였습니다. 77 00:08:40,399 --> 00:08:45,600 - 우리가 제공하는 지금 무엇을 다른 컴퓨터 비전 클래스 조금 더 관점에서 + 자, 그럼 조금 다른 관점에서 생각해볼 수 있도록 어떠한 Vision 수업들이 78 00:08:45,600 --> 00:08:51,050 - 분명히 여기에 물건이나 컴퓨터 과학 부서를 통해 당신은에있어 + 스탠포드 컴퓨터 과학부에서 열리는지 알아보도록 하죠. 79 00:08:51,049 --> 00:08:59,629 - 이 클래스 에스 (21) 그래서 당신에게 당신의 찍은 적이없는 컴퓨터 비전 + 여러분들은 지금 CS231n 수업을 듣고 계시고요. + 이 중에서 컴퓨터 비전수업을 한번도 들어 본 적이 없고 80 00:08:59,629 --> 00:09:06,220 - 아마 통근 들어 처음으로 아마 이미 가지고 있어야 + 컴퓨터 비전이란 말을 처음 듣는 분들이 있다면 81 00:09:06,220 --> 00:09:14,730 - 전 분기의 멋진 클래스는 우리가 제공하는 것 (131) 다음과 대 다 + 전 분기에 열었던 CS131 수업을 들으셨어야해요. 82 00:09:14,730 --> 00:09:19,779 - 정상적으로되는 다음 분기는로 이번 분기하지만 올해를 제공하다 + 그리고 원래는 이번 분기에서 올해만 다음 분기로 미뤄진 83 00:09:19,779 --> 00:09:25,069 - 작은 중요한 대학원 수준 컴퓨터 비젼 클래스가 시프트 + 매우 중요한 대학원 수준의 컴퓨터 비젼 클래스 84 00:09:25,070 --> 00:09:31,840 - 그는시 로봇 작동 누가 고통을 것입니다, 그래서 교수가 제공하는 '원인 CS2 30180 + CS231a 를 로보틱스 3D 비젼의 Silvio Savarese 교수가 가르칩니다. 85 00:09:31,840 --> 00:09:47,230 - 차원의 비전과 당신의 많은 질문을하는이 classiest 231 대 + 많은 분들이 CS231n 과 CS231a 수업이 서로 같은지 물어보시는데 86 00:09:47,230 --> 00:09:56,639 - 의 S 두 삼십 (18) 및 넓은에 관심이 있다면, 다른 하나는 알고있다 + 같지 않습니다. 만약 좀 더 넓은 범위의 87 00:09:56,639 --> 00:10:03,220 - 도구 및 컴퓨터 비전 주제의 범위뿐만 아니라 일부 + 컴퓨터 비젼분야의 주제와 도구사용 및 88 00:10:03,220 --> 00:10:11,009 - 그 릴레이 (223) 부문 로봇 비전을 올 기본적인 기본 주제 + 3D 비젼, 로보틱스 비젼과 시각인지에 관한 기본적인 주제들에 관심이 있다면 89 00:10:11,009 --> 00:10:17,269 - 시각 인식 당신은을입니다 23,188에 복용을 고려해야합니다 + 더 포괄적인 231a 수업 수강을 고려해 보세요. 90 00:10:17,269 --> 00:10:26,039 - 더 깊이 초점을 맞추고 오늘부터로 갈 것보다 일반적인 클래스 (231) 끝 + 오늘부터 시작하는 231n은 좀 더 세부적인 91 00:10:26,039 --> 00:10:33,329 - 모두 문제 및 모델 모델의 특정 안도에 네트워크와는 + 문제와 모델을 다룹니다. 대부분 신경망 모델을 이용한 92 00:10:33,330 --> 00:10:38,580 - 물러 시각적 인식 대부분이지만, 물론 그들은 조금이 + 시각적 인식이죠. 당연히 두 수업에서 93 00:10:38,580 --> 00:10:47,990 - 중복하지만 그 다음 분기 옆에 큰 차이의 우리 또한 가능성이 있습니다 + 중복되는 부분도 있겠죠. + 그리고 아마 다음 분기에 94 00:10:47,990 --> 00:10:55,590 - 고급 세미나 레벨 클래스의 몇 몇 있지만은 아직이다 + 심화된 수준의 세미나 수업이 몇몇 열릴 것 같아요. 하지만 95 00:10:55,590 --> 00:11:01,649 - 즉 커큐민의 종류 그래서 형성은 그냥 강의 계획서를 확인 할 수 있도록 + 아직 미정이니 후에 강의목록을 확인하셔야 할겁니다. 96 00:11:01,649 --> 00:11:11,409 - 분할 교육 과정은 우리가 질문에 스탠포드에서 올해 제공 지금까지 네 + 스탠포드 컴퓨터 비전 수업들을 대략적으로 소개했습니다. 질문있나요? 네. 97 00:11:11,409 --> 00:11:20,879 - (131)는이 클래스에 대한 엄격한 요구 사항이 아닙니다하지만 당신은 볼 당신은했습니다 경우 + 131이 이 수업의 선수과목은 아닙니다만, 98 00:11:20,879 --> 00:11:25,570 - 처음으로 컴퓨터 비전 들어 본 적이 난 당신이 방식을 찾을 제안 + 컴퓨터 비전 수업을 처음 듣는다면 99 00:11:25,570 --> 00:11:33,830 - 이 클래스의 이해의 기본 수준을 shrooms 때문에 잡기 + 이 수업은 컴퓨터 비전에대한 기본적인 이해를 요구하기 때문에 100 00:11:33,830 --> 00:11:42,560 - 컴퓨터 비전은 그렇게에서 메모를 검색 할 수 있습니다 + 강의노트 등을 통한 예습을 하길 권해드립니다. 101 00:11:42,559 --> 00:11:49,619 From feda64c6e986a4fe69e92043a945321c78402ba3 Mon Sep 17 00:00:00 2001 From: Taeksoo Kim Date: Tue, 12 Apr 2016 19:43:23 +0900 Subject: [PATCH 047/199] Update convolutional-networks-korean.md --- convolutional-networks-korean.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/convolutional-networks-korean.md b/convolutional-networks-korean.md index f60a753e..40c793c1 100644 --- a/convolutional-networks-korean.md +++ b/convolutional-networks-korean.md @@ -19,9 +19,9 @@ Table of Contents: - [Computational Considerations](#comp) - [Additional References](#add) -## 컨볼루셔널 신경망 (CNN/ConvNets) +## 컨볼루션 신경망 (CNN/ConvNets) -컨볼루셔널 신경망 (Convolutional Neural Network, 이하 CNN)은 앞 장에서 다룬 일반 신경망과 매우 유사하다. CNN은 학습 가능한 가중치 (weight)와 바이어스(bias)로 구성되어 있다. 각 뉴런은 입력을 받아 내적 연산( dot product )을 한 뒤 선택에 따라 비선형 (non-linear) 연산을 한다. 전체 네트워크는 일반 신경망과 마찬가지로 미분 가능한 하나의 스코어 함수 (score function)을 갖게 된다 (맨 앞쪽에서 로우 이미지 (raw image)를 읽고 맨 뒤쪽에서 각 클래스에 대한 점수를 구하게 됨). 또한 CNN은 마지막 레이어에 (SVM/Softmax와 같은) 손실 함수 (loss function)을 가지며, 우리가 일반 신경망을 학습시킬 때 사용하던 각종 기법들을 동일하게 적용할 수 있다. +컨볼루션 신경망 (Convolutional Neural Network, 이하 CNN)은 앞 장에서 다룬 일반 신경망과 매우 유사하다. CNN은 학습 가능한 가중치 (weight)와 바이어스(bias)로 구성되어 있다. 각 뉴런은 입력을 받아 내적 연산( dot product )을 한 뒤 선택에 따라 비선형 (non-linear) 연산을 한다. 전체 네트워크는 일반 신경망과 마찬가지로 미분 가능한 하나의 스코어 함수 (score function)을 갖게 된다 (맨 앞쪽에서 로우 이미지 (raw image)를 읽고 맨 뒤쪽에서 각 클래스에 대한 점수를 구하게 됨). 또한 CNN은 마지막 레이어에 (SVM/Softmax와 같은) 손실 함수 (loss function)을 가지며, 우리가 일반 신경망을 학습시킬 때 사용하던 각종 기법들을 동일하게 적용할 수 있다. CNN과 일반 신경망의 차이점은 무엇일까? CNN 아키텍쳐는 입력 데이터가 이미지라는 가정 덕분에 이미지 데이터가 갖는 특성들을 인코딩 할 수 있다. 이러한 아키텍쳐는 포워드 함수 (forward function)을 더욱 효과적으로 구현할 수 있고 네트워크를 학습시키는데 필요한 모수 (parameter)의 수를 크게 줄일 수 있게 해준다. From 6df773215ed9098b428bccdda38d74d5f84635ef Mon Sep 17 00:00:00 2001 From: Taeksoo Kim Date: Tue, 12 Apr 2016 20:32:28 +0900 Subject: [PATCH 048/199] Update convolutional-networks-korean.md --- convolutional-networks-korean.md | 12 ++++++------ 1 file changed, 6 insertions(+), 6 deletions(-) diff --git a/convolutional-networks-korean.md b/convolutional-networks-korean.md index 40c793c1..206360be 100644 --- a/convolutional-networks-korean.md +++ b/convolutional-networks-korean.md @@ -104,21 +104,21 @@ CONV 레이어는 CNN을 이루는 핵심 요소이다. CONV 레이어의 출력 2. 두 번째로 어떤 간격 (가로/세로의 공간적 간격) 으로 깊이 컬럼을 할당할 지를 의미하는 **stride**를 결정해야 한다. 만약 stride가 1이라면, 깊이 컬럼을 1칸마다 할당하게 된다 (한 칸 간격으로 깊이 컬럼 할당). 이럴 경우 각 깊이 컬럼들은 receptive field 상 넓은 영역이 겹치게 되고, 출력 볼륨의 크기도 매우 커지게 된다. 반대로, 큰 stride를 사용한다면 receptive field끼리 좁은 영역만 겹치게 되고 출력 볼륨도 작아지게 된다 (깊이는 작아지지 않고 가로/세로만 작아지게 됨). 3. 조만간 살펴보겠지만, 입력 볼륨의 가장자리를 0으로 패딩하는 것이 좋을 때가 있다. 이 **zero-padding**은 hyperparamter이다. zero-padding을 사용할 때의 장점은, 출력 볼륨의 공간적 크기(가로/세로)를 조절할 수 있다는 것이다. 특히 입력 볼륨의 공간적 크기를 유지하고 싶은 경우 (입력의 가로/세로 = 출력의 가로/세로) 사용하게 된다. -We can compute the spatial size of the output volume as a function of the input volume size ($$W$$), the receptive field size of the Conv Layer neurons ($$F$$), the stride with which they are applied ($$S$$), and the amount of zero padding used ($$P$$) on the border. You can convince yourself that the correct formula for calculating how many neurons "fit" is given by $$(W - F + 2P)/S + 1$$. If this number is not an integer, then the strides are set incorrectly and the neurons cannot be tiled so that they "fit" across the input volume neatly, in a symmetric way. An example might help to get intuitions for this formula: +출력 볼륨의 공간적 크기 (가로/세로)는 입력 볼륨 크기 ($$W$$), CONV 레이어의 리셉티브 필드 크기($$F$$)와 stride ($$S$$), 그리고 제로 패딩 (zero-padding) 사이즈 ($$P$$) 의 함수로 계산할 수 있다. $$(W - F + 2P)/S + 1$$. I을 통해 알맞은 크기를 계산하면 된다. 만약 이 값이 정수가 아니라면 stride가 잘못 정해진 것이다. 이 경우 뉴런들이 대칭을 이루며 깔끔하게 배치되는 것이 불가능하다. 다음 예제를 보면 이 수식을 좀 더 직관적으로 이해할 수 있을 것이다:
- Illustration of spatial arrangement. In this example there is only one spatial dimension (x-axis), one neuron with a receptive field size of F = 3, the input size is W = 5, and there is zero padding of P = 1. Left: The neuron strided across the input in stride of S = 1, giving output of size (5 - 3 + 2)/1+1 = 5. Right: The neuron uses stride of S = 2, giving output of size (5 - 3 + 2)/2+1 = 3. Notice that stride S = 3 could not be used since it wouldn't fit neatly across the volume. In terms of the equation, this can be determined since (5 - 3 + 2) = 4 is not divisible by 3. -
The neuron weights are in this example [1,0,-1] (shown on very right), and its bias is zero. These weights are shared across all yellow neurons (see parameter sharing below). + 공간적 배치에 관한 그림. 이 예제에서는 가로/세로 공간적 차원 중 하나만 고려한다 (x축). 리셉티브 필드 F=3, 입력 사이즈 W=5, 제로 패딩 P=1. : 뉴런들이 stride S=1을 갖고 배치된 경우, 출력 사이즈는 (5-3+2)/1 +1 = 5이다. : stride S=2인 경우 (5-3+2)/2 + 1 = 3의 출력 사이즈를 가진다. Stride S=3은 사용할 수 없다. (5-3+2) = 4가 3으로 나눠지지 않기 때문에 출력 볼륨의 뉴런들이 깔끔히 배치되지 않는다. + 이 예에서 뉴런들의 가중치는 [1,0,-1] (가장 오른쪽) 이며 bias는 0이다. 이 가중치는 노란 뉴런들 모두에게 공유된다 (아래에서 parameter sharing에 대해 살펴보라).
-*Use of zero-padding*. In the example above on left, note that the input dimension was 5 and the output dimension was equal: also 5. This worked out so because our receptive fields were 3 and we used zero padding of 1. If there was no zero-padding used, then the output volume would have had spatial dimension of only 3, because that it is how many neurons would have "fit" across the original input. In general, setting zero padding to be $$P = (F - 1)/2$$ when the stride is $$S = 1$$ ensures that the input volume and output volume will have the same size spatially. It is very common to use zero-padding in this way and we will discuss the full reasons when we talk more about ConvNet architectures. +*제로 패딩 사용*. 위 예제의 왼쪽 그림에서, 입력과 출력의 차원이 모두 5라는 것을 기억하자. 리셉티브 필드가 3이고 제로 패딩이 1이기 때문에 이런 결과가 나오는 것이다. 만약 제로 패딩이 사용되지 않았다면 출력 볼륨의 크기는 3이 될 것이다. 일반적으로, 제로 패딩을 $$P = (F - 1)/2$$ , stride $$S = 1$$로 세팅하면 입/출력의 크기가 같아지게 된다. 이런 방식으로 사용하는 것이 일반적이며, 앞으로 컨볼루션 신경망에 대해 다루면서 그 이유에 대해 더 알아볼 것이다. -*Constraints on strides*. Note that the spatial arrangement hyperparameters have mutual constraints. For example, when the input has size $$W = 10$$, no zero-padding is used $$P = 0$$, and the filter size is $$F = 3$$, then it would be impossible to use stride $$S = 2$$, since $$(W - F + 2P)/S + 1 = (10 - 3 + 0) / 2 + 1 = 4.5$$, i.e. not an integer, indicating that the neurons don't "fit" neatly and symmetrically across the input. Therefore, this setting of the hyperparameters is considered to be invalid, and a ConvNet library would likely throw an exception. As we will see in the ConvNet architectures section, sizing the ConvNets appropriately so that all the dimensions "work out" can be a real headache, which the use of zero-padding and some design guidelines will significantly alleviate. +*Stride에 대한 constraints*. 공간적 배치와 관련된 hyperparameter들은 상호 constraint들이 존재한다는 것을 기억하자. 예를 들어, 입력 사이즈 $$W=10$$이고 제로 패딩이 사용되지 않았고 $$P=0$$, 필터 사이즈가 $$F=3$$이라면, stride $$S=2$$를 사용하는 것이 불가능하다. $$(W - F + 2P)/S + 1 = (10 - 3 + 0) / 2 + 1 = 4.5$$이 정수가 아니기 때문이다. 그러므로 hyperparameter를 이런 식으로 설정하면 컨볼루션 신경망 관련 라이브러리들은 exception을 낸다. 컨볼루션 신경망의 구조 관련 섹션에서 확인하겠지만, 전체 신경망이 잘 돌아가도록 이런 숫자들을 설정하는 과정은 매우 골치 아프다. 제로 패딩이나 다른 신경망 디자인 비법들을 사용하면 훨씬 수월하게 진행할 수 있다. -*Real-world example*. The [Krizhevsky et al.](http://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks) architecture that won the ImageNet challenge in 2012 accepted images of size [227x227x3]. On the first Convolutional Layer, it used neurons with receptive field size $$F = 11$$, stride $$S = 4$$ and no zero padding $$P = 0$$. Since (227 - 11)/4 + 1 = 55, and since the Conv layer had a depth of $$K = 96$$, the Conv layer output volume had size [55x55x96]. Each of the 55\*55\*96 neurons in this volume was connected to a region of size [11x11x3] in the input volume. Moreover, all 96 neurons in each depth column are connected to the same [11x11x3] region of the input, but of course with different weights. +*실제 예제*. 이미지넷 대회에서 우승한 [Krizhevsky et al.](http://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks) 의 모델의 경우 [227x227x3] 크기의 이미지를 입력으로 받는다. 첫 번째 컨볼루션 레이어에서는 리셉티브 필드 $$F=11$$, stride $$S=4$$를 사용했고 제로 패딩은 사용하지 않았다 $$P=0$$. (227 - 11)/4 +1=55 이고 컨볼루션 레이어의 깊이는 $$K=96$$이므로 이 컨볼루션 레이어의 크기는 [11x11x3]이 된다. 각각의 55*55*96개 뉴런들은 입력 볼륨의 [11x11x3]개 뉴런들과 연결되어 있다. 그리고 각 깊이의 모든 96개 뉴런들은 입력 볼륨의 같은 [11x11x3] 영역에 서로 다른 가중치를 가지고 연결된다. **Parameter Sharing.** Parameter sharing scheme is used in Convolutional Layers to control the number of parameters. Using the real-world example above, we see that there are 55\*55\*96 = 290,400 neurons in the first Conv Layer, and each has 11\*11\*3 = 363 weights and 1 bias. Together, this adds up to 290400 * 364 = 105,705,600 parameters on the first layer of the ConvNet alone. Clearly, this number is very high. From c76f8a01cd0885fd3dbd8b14a65ba6295290e46a Mon Sep 17 00:00:00 2001 From: Taeksoo Kim Date: Tue, 12 Apr 2016 20:42:27 +0900 Subject: [PATCH 049/199] Update convolutional-networks-korean.md --- convolutional-networks-korean.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/convolutional-networks-korean.md b/convolutional-networks-korean.md index 206360be..8835d22c 100644 --- a/convolutional-networks-korean.md +++ b/convolutional-networks-korean.md @@ -120,7 +120,7 @@ CONV 레이어는 CNN을 이루는 핵심 요소이다. CONV 레이어의 출력 *실제 예제*. 이미지넷 대회에서 우승한 [Krizhevsky et al.](http://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks) 의 모델의 경우 [227x227x3] 크기의 이미지를 입력으로 받는다. 첫 번째 컨볼루션 레이어에서는 리셉티브 필드 $$F=11$$, stride $$S=4$$를 사용했고 제로 패딩은 사용하지 않았다 $$P=0$$. (227 - 11)/4 +1=55 이고 컨볼루션 레이어의 깊이는 $$K=96$$이므로 이 컨볼루션 레이어의 크기는 [11x11x3]이 된다. 각각의 55*55*96개 뉴런들은 입력 볼륨의 [11x11x3]개 뉴런들과 연결되어 있다. 그리고 각 깊이의 모든 96개 뉴런들은 입력 볼륨의 같은 [11x11x3] 영역에 서로 다른 가중치를 가지고 연결된다. -**Parameter Sharing.** Parameter sharing scheme is used in Convolutional Layers to control the number of parameters. Using the real-world example above, we see that there are 55\*55\*96 = 290,400 neurons in the first Conv Layer, and each has 11\*11\*3 = 363 weights and 1 bias. Together, this adds up to 290400 * 364 = 105,705,600 parameters on the first layer of the ConvNet alone. Clearly, this number is very high. +**파라미터 공유**. 파라미터 공유 기법은 컨볼루션 레이어의 파라미터 개수를 조절하기 위해 사용된다. 위의 실제 예제에서 보았듯, 첫 번째 컨볼루션 레이어에는 55\*55\*96 = 290,400 개의 뉴런이 있고 각각의 뉴런은 11\*11\*3 = 363개의 가중치와 1개의 바이어스를 가진다. 첫 번째 컨볼루션 레이어만 따져도 총 파라미터 개수는 290400*364=105,705,600개가 된다. 분명히 이 숫자는 너무 크다. It turns out that we can dramatically reduce the number of parameters by making one reasonable assumption: That if one patch feature is useful to compute at some spatial position (x,y), then it should also be useful to compute at a different position (x2,y2). In other words, denoting a single 2-dimensional slice of depth as a **depth slice** (e.g. a volume of size [55x55x96] has 96 depth slices, each of size [55x55]), we are going to constrain the neurons in each depth slice to use the same weights and bias. With this parameter sharing scheme, the first Conv Layer in our example would now have only 96 unique set of weights (one for each depth slice), for a total of 96\*11\*11\*3 = 34,848 unique weights, or 34,944 parameters (+96 biases). Alternatively, all 55*55 neurons in each depth slice will now be using the same parameters. In practice during backpropagation, every neuron in the volume will compute the gradient for its weights, but these gradients will be added up across each depth slice and only update a single set of weights per slice. From c597c553d3df31c0d3287fce850654361168624f Mon Sep 17 00:00:00 2001 From: Taeksoo Kim Date: Tue, 12 Apr 2016 21:19:16 +0900 Subject: [PATCH 050/199] Update convolutional-networks-korean.md --- convolutional-networks-korean.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/convolutional-networks-korean.md b/convolutional-networks-korean.md index 8835d22c..9d615e80 100644 --- a/convolutional-networks-korean.md +++ b/convolutional-networks-korean.md @@ -122,7 +122,7 @@ CONV 레이어는 CNN을 이루는 핵심 요소이다. CONV 레이어의 출력 **파라미터 공유**. 파라미터 공유 기법은 컨볼루션 레이어의 파라미터 개수를 조절하기 위해 사용된다. 위의 실제 예제에서 보았듯, 첫 번째 컨볼루션 레이어에는 55\*55\*96 = 290,400 개의 뉴런이 있고 각각의 뉴런은 11\*11\*3 = 363개의 가중치와 1개의 바이어스를 가진다. 첫 번째 컨볼루션 레이어만 따져도 총 파라미터 개수는 290400*364=105,705,600개가 된다. 분명히 이 숫자는 너무 크다. -It turns out that we can dramatically reduce the number of parameters by making one reasonable assumption: That if one patch feature is useful to compute at some spatial position (x,y), then it should also be useful to compute at a different position (x2,y2). In other words, denoting a single 2-dimensional slice of depth as a **depth slice** (e.g. a volume of size [55x55x96] has 96 depth slices, each of size [55x55]), we are going to constrain the neurons in each depth slice to use the same weights and bias. With this parameter sharing scheme, the first Conv Layer in our example would now have only 96 unique set of weights (one for each depth slice), for a total of 96\*11\*11\*3 = 34,848 unique weights, or 34,944 parameters (+96 biases). Alternatively, all 55*55 neurons in each depth slice will now be using the same parameters. In practice during backpropagation, every neuron in the volume will compute the gradient for its weights, but these gradients will be added up across each depth slice and only update a single set of weights per slice. +사실 적절한 가정을 통해 파라미터 개수를 크게 줄이는 것이 가능하다: (x,y)에서 어떤 patch feature가 유용하게 사용되었다면, 이 feature는 다른 위치 (x2,y2)에서도 유용하게 사용될 수 있다. 3차원 볼륨의 한 슬라이스 (가로/세로 차원, 깊이 차원을 자름) 를 **depth slice**라고 하자 ([55x55x96] 사이즈의 볼륨은 각각 [55x55]의 크기를 가진 96개의 depth slice임). 앞으로는 각 depth slice 내의 뉴런들이 같은 가중치와 바이어스를 가지도록 제한할 것이다. 이런 파라미터 공유 기법을 사용하면, 예제의 첫 번째 컨볼루션 레이어는 (depth slice 당) 96개의 고유한 가중치를 가져서 총 96\*11\*11\*3 = 34,848개의 고유한 가중치, 또는 바이어스를 합쳐서 34,944개의 파라미터를 갖게 된다. 또는 각 depth slice에 존재하는 55*55개의 뉴런들은 모두 같은 파라미터를 사용하게 된다. 실제로는 backpropagation 과정에서 각 depth slice 내의 모든 뉴런들이 가중치에 대한 gradient를 계산하겠지만, 가중치 업데이트 할 때에는 이 gradient들을 합해 사용한다. Notice that if all neurons in a single depth slice are using the same weight vector, then the forward pass of the CONV layer can in each depth slice be computed as a **convolution** of the neuron's weights with the input volume (Hence the name: Convolutional Layer). Therefore, it is common to refer to the sets of weights as a **filter** (or a **kernel**), which is convolved with the input. The result of this convolution is an *activation map* (e.g. of size [55x55]), and the set of activation maps for each different filter are stacked together along the depth dimension to produce the output volume (e.g. [55x55x96]). From 0746608a6e2b26c88b7b5ccaf32d0d4facc71c54 Mon Sep 17 00:00:00 2001 From: Taeksoo Kim Date: Tue, 12 Apr 2016 21:20:33 +0900 Subject: [PATCH 051/199] Update convolutional-networks-korean.md --- convolutional-networks-korean.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/convolutional-networks-korean.md b/convolutional-networks-korean.md index 9d615e80..6089aa65 100644 --- a/convolutional-networks-korean.md +++ b/convolutional-networks-korean.md @@ -118,7 +118,7 @@ CONV 레이어는 CNN을 이루는 핵심 요소이다. CONV 레이어의 출력 *Stride에 대한 constraints*. 공간적 배치와 관련된 hyperparameter들은 상호 constraint들이 존재한다는 것을 기억하자. 예를 들어, 입력 사이즈 $$W=10$$이고 제로 패딩이 사용되지 않았고 $$P=0$$, 필터 사이즈가 $$F=3$$이라면, stride $$S=2$$를 사용하는 것이 불가능하다. $$(W - F + 2P)/S + 1 = (10 - 3 + 0) / 2 + 1 = 4.5$$이 정수가 아니기 때문이다. 그러므로 hyperparameter를 이런 식으로 설정하면 컨볼루션 신경망 관련 라이브러리들은 exception을 낸다. 컨볼루션 신경망의 구조 관련 섹션에서 확인하겠지만, 전체 신경망이 잘 돌아가도록 이런 숫자들을 설정하는 과정은 매우 골치 아프다. 제로 패딩이나 다른 신경망 디자인 비법들을 사용하면 훨씬 수월하게 진행할 수 있다. -*실제 예제*. 이미지넷 대회에서 우승한 [Krizhevsky et al.](http://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks) 의 모델의 경우 [227x227x3] 크기의 이미지를 입력으로 받는다. 첫 번째 컨볼루션 레이어에서는 리셉티브 필드 $$F=11$$, stride $$S=4$$를 사용했고 제로 패딩은 사용하지 않았다 $$P=0$$. (227 - 11)/4 +1=55 이고 컨볼루션 레이어의 깊이는 $$K=96$$이므로 이 컨볼루션 레이어의 크기는 [11x11x3]이 된다. 각각의 55*55*96개 뉴런들은 입력 볼륨의 [11x11x3]개 뉴런들과 연결되어 있다. 그리고 각 깊이의 모든 96개 뉴런들은 입력 볼륨의 같은 [11x11x3] 영역에 서로 다른 가중치를 가지고 연결된다. +*실제 예제*. 이미지넷 대회에서 우승한 [Krizhevsky et al.](http://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks) 의 모델의 경우 [227x227x3] 크기의 이미지를 입력으로 받는다. 첫 번째 컨볼루션 레이어에서는 리셉티브 필드 $$F=11$$, stride $$S=4$$를 사용했고 제로 패딩은 사용하지 않았다 $$P=0$$. (227 - 11)/4 +1=55 이고 컨볼루션 레이어의 깊이는 $$K=96$$이므로 이 컨볼루션 레이어의 크기는 [11x11x3]이 된다. 각각의 55\*55\*96개 뉴런들은 입력 볼륨의 [11x11x3]개 뉴런들과 연결되어 있다. 그리고 각 깊이의 모든 96개 뉴런들은 입력 볼륨의 같은 [11x11x3] 영역에 서로 다른 가중치를 가지고 연결된다. **파라미터 공유**. 파라미터 공유 기법은 컨볼루션 레이어의 파라미터 개수를 조절하기 위해 사용된다. 위의 실제 예제에서 보았듯, 첫 번째 컨볼루션 레이어에는 55\*55\*96 = 290,400 개의 뉴런이 있고 각각의 뉴런은 11\*11\*3 = 363개의 가중치와 1개의 바이어스를 가진다. 첫 번째 컨볼루션 레이어만 따져도 총 파라미터 개수는 290400*364=105,705,600개가 된다. 분명히 이 숫자는 너무 크다. From 1bc952f0d3cec61b7e72d2d2da02db3edc263f7e Mon Sep 17 00:00:00 2001 From: Taeksoo Kim Date: Tue, 12 Apr 2016 21:21:58 +0900 Subject: [PATCH 052/199] Update convolutional-networks-korean.md --- convolutional-networks-korean.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/convolutional-networks-korean.md b/convolutional-networks-korean.md index 6089aa65..e849c032 100644 --- a/convolutional-networks-korean.md +++ b/convolutional-networks-korean.md @@ -122,7 +122,7 @@ CONV 레이어는 CNN을 이루는 핵심 요소이다. CONV 레이어의 출력 **파라미터 공유**. 파라미터 공유 기법은 컨볼루션 레이어의 파라미터 개수를 조절하기 위해 사용된다. 위의 실제 예제에서 보았듯, 첫 번째 컨볼루션 레이어에는 55\*55\*96 = 290,400 개의 뉴런이 있고 각각의 뉴런은 11\*11\*3 = 363개의 가중치와 1개의 바이어스를 가진다. 첫 번째 컨볼루션 레이어만 따져도 총 파라미터 개수는 290400*364=105,705,600개가 된다. 분명히 이 숫자는 너무 크다. -사실 적절한 가정을 통해 파라미터 개수를 크게 줄이는 것이 가능하다: (x,y)에서 어떤 patch feature가 유용하게 사용되었다면, 이 feature는 다른 위치 (x2,y2)에서도 유용하게 사용될 수 있다. 3차원 볼륨의 한 슬라이스 (가로/세로 차원, 깊이 차원을 자름) 를 **depth slice**라고 하자 ([55x55x96] 사이즈의 볼륨은 각각 [55x55]의 크기를 가진 96개의 depth slice임). 앞으로는 각 depth slice 내의 뉴런들이 같은 가중치와 바이어스를 가지도록 제한할 것이다. 이런 파라미터 공유 기법을 사용하면, 예제의 첫 번째 컨볼루션 레이어는 (depth slice 당) 96개의 고유한 가중치를 가져서 총 96\*11\*11\*3 = 34,848개의 고유한 가중치, 또는 바이어스를 합쳐서 34,944개의 파라미터를 갖게 된다. 또는 각 depth slice에 존재하는 55*55개의 뉴런들은 모두 같은 파라미터를 사용하게 된다. 실제로는 backpropagation 과정에서 각 depth slice 내의 모든 뉴런들이 가중치에 대한 gradient를 계산하겠지만, 가중치 업데이트 할 때에는 이 gradient들을 합해 사용한다. +사실 적절한 가정을 통해 파라미터 개수를 크게 줄이는 것이 가능하다: (x,y)에서 어떤 patch feature가 유용하게 사용되었다면, 이 feature는 다른 위치 (x2,y2)에서도 유용하게 사용될 수 있다. 3차원 볼륨의 한 슬라이스 (깊이 차원으로 자른 2차원 슬라이스) 를 **depth slice**라고 하자 ([55x55x96] 사이즈의 볼륨은 각각 [55x55]의 크기를 가진 96개의 depth slice임). 앞으로는 각 depth slice 내의 뉴런들이 같은 가중치와 바이어스를 가지도록 제한할 것이다. 이런 파라미터 공유 기법을 사용하면, 예제의 첫 번째 컨볼루션 레이어는 (depth slice 당) 96개의 고유한 가중치를 가져서 총 96\*11\*11\*3 = 34,848개의 고유한 가중치, 또는 바이어스를 합쳐서 34,944개의 파라미터를 갖게 된다. 또는 각 depth slice에 존재하는 55*55개의 뉴런들은 모두 같은 파라미터를 사용하게 된다. 실제로는 backpropagation 과정에서 각 depth slice 내의 모든 뉴런들이 가중치에 대한 gradient를 계산하겠지만, 가중치 업데이트 할 때에는 이 gradient들을 합해 사용한다. Notice that if all neurons in a single depth slice are using the same weight vector, then the forward pass of the CONV layer can in each depth slice be computed as a **convolution** of the neuron's weights with the input volume (Hence the name: Convolutional Layer). Therefore, it is common to refer to the sets of weights as a **filter** (or a **kernel**), which is convolved with the input. The result of this convolution is an *activation map* (e.g. of size [55x55]), and the set of activation maps for each different filter are stacked together along the depth dimension to produce the output volume (e.g. [55x55x96]). From 05b206fd300088b8eee28ac3c07b515ddfad6e7c Mon Sep 17 00:00:00 2001 From: Taeksoo Kim Date: Tue, 12 Apr 2016 21:25:33 +0900 Subject: [PATCH 053/199] Update convolutional-networks-korean.md --- convolutional-networks-korean.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/convolutional-networks-korean.md b/convolutional-networks-korean.md index e849c032..7e6a7f99 100644 --- a/convolutional-networks-korean.md +++ b/convolutional-networks-korean.md @@ -118,7 +118,7 @@ CONV 레이어는 CNN을 이루는 핵심 요소이다. CONV 레이어의 출력 *Stride에 대한 constraints*. 공간적 배치와 관련된 hyperparameter들은 상호 constraint들이 존재한다는 것을 기억하자. 예를 들어, 입력 사이즈 $$W=10$$이고 제로 패딩이 사용되지 않았고 $$P=0$$, 필터 사이즈가 $$F=3$$이라면, stride $$S=2$$를 사용하는 것이 불가능하다. $$(W - F + 2P)/S + 1 = (10 - 3 + 0) / 2 + 1 = 4.5$$이 정수가 아니기 때문이다. 그러므로 hyperparameter를 이런 식으로 설정하면 컨볼루션 신경망 관련 라이브러리들은 exception을 낸다. 컨볼루션 신경망의 구조 관련 섹션에서 확인하겠지만, 전체 신경망이 잘 돌아가도록 이런 숫자들을 설정하는 과정은 매우 골치 아프다. 제로 패딩이나 다른 신경망 디자인 비법들을 사용하면 훨씬 수월하게 진행할 수 있다. -*실제 예제*. 이미지넷 대회에서 우승한 [Krizhevsky et al.](http://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks) 의 모델의 경우 [227x227x3] 크기의 이미지를 입력으로 받는다. 첫 번째 컨볼루션 레이어에서는 리셉티브 필드 $$F=11$$, stride $$S=4$$를 사용했고 제로 패딩은 사용하지 않았다 $$P=0$$. (227 - 11)/4 +1=55 이고 컨볼루션 레이어의 깊이는 $$K=96$$이므로 이 컨볼루션 레이어의 크기는 [11x11x3]이 된다. 각각의 55\*55\*96개 뉴런들은 입력 볼륨의 [11x11x3]개 뉴런들과 연결되어 있다. 그리고 각 깊이의 모든 96개 뉴런들은 입력 볼륨의 같은 [11x11x3] 영역에 서로 다른 가중치를 가지고 연결된다. +*실제 예제*. 이미지넷 대회에서 우승한 [Krizhevsky et al.](http://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks) 의 모델의 경우 [227x227x3] 크기의 이미지를 입력으로 받는다. 첫 번째 컨볼루션 레이어에서는 리셉티브 필드 $$F=11$$, stride $$S=4$$를 사용했고 제로 패딩은 사용하지 않았다 $$P=0$$. (227 - 11)/4 +1=55 이고 컨볼루션 레이어의 깊이는 $$K=96$$이므로 이 컨볼루션 레이어의 크기는 [55x55x96]이 된다. 각각의 55\*55\*96개 뉴런들은 입력 볼륨의 [11x11x3]개 뉴런들과 연결되어 있다. 그리고 각 깊이의 모든 96개 뉴런들은 입력 볼륨의 같은 [11x11x3] 영역에 서로 다른 가중치를 가지고 연결된다. **파라미터 공유**. 파라미터 공유 기법은 컨볼루션 레이어의 파라미터 개수를 조절하기 위해 사용된다. 위의 실제 예제에서 보았듯, 첫 번째 컨볼루션 레이어에는 55\*55\*96 = 290,400 개의 뉴런이 있고 각각의 뉴런은 11\*11\*3 = 363개의 가중치와 1개의 바이어스를 가진다. 첫 번째 컨볼루션 레이어만 따져도 총 파라미터 개수는 290400*364=105,705,600개가 된다. 분명히 이 숫자는 너무 크다. From 00780f81a41226bd7485d010c50da3a14f04ca82 Mon Sep 17 00:00:00 2001 From: JK Im Date: Tue, 12 Apr 2016 17:38:10 -0500 Subject: [PATCH 054/199] Update optimization-1.md --- optimization-1.md | 26 +++++++++++++------------- 1 file changed, 13 insertions(+), 13 deletions(-) diff --git a/optimization-1.md b/optimization-1.md index ce018a2c..b9a35b64 100644 --- a/optimization-1.md +++ b/optimization-1.md @@ -23,10 +23,10 @@ Table of Contents: 이전 섹션에서 이미지 분류(image classification)을 할 때에 있어 두 가지의 핵심요쇼를 소개했습니다. -1. 원 이미지의 픽셀들을 넣으면 분류 스코어(class score)를 계산해주는 모수화된(parameterized) **스코어 함수(score function)** (예를 들어, 선형 함수). -2. 학습(training) 데이타에 어떤 특정 모수(parameter/weight)들을 가지고 스코어 함수(score function)를 적용시켰을 때, 실제 class와 얼마나 잘 일치하는지에 따라 그 특정 모수(parameter/weight)들의 질을 측정하는 **손실 함수(loss function)**. 여러 종류의 손실함수(예를 들어, Softmax/SVM)가 있다. +1. 원 이미지의 픽셀들을 넣으면 분류 스코어(class score)를 계산해주는 모수화된(parameterized) **스코어함수(score function)** (예를 들어, 선형함수). +2. 학습(training) 데이타에 어떤 특정 모수(parameter/weight)들을 가지고 스코어함수(score function)를 적용시켰을 때, 실제 class와 얼마나 잘 일치하는지에 따라 그 특정 모수(parameter/weight)들의 질을 측정하는 **손실함수(loss function)**. 여러 종류의 손실함수(예를 들어, Softmax/SVM)가 있다. -구체적으로 말하자면, 다음과 같은 형식을 가진 선형함수 $$ f(x_i, W) = W x_i $$를 스코어 함수(score function)로 쓸 때, 지난 번에 다룬 바와 같이 SVM은 다음과 같은 수식으로 표현할 수 있다.: +구체적으로 말하자면, 다음과 같은 형식을 가진 선형함수 $$ f(x_i, W) = W x_i $$를 스코어함수(score function)로 쓸 때, 앞에서 다룬 바와 같이 SVM은 다음과 같은 수식으로 표현할 수 있다.: $$ L = \frac{1}{N} \sum_i \sum_{j\neq y_i} \left[ \max(0, f(x_i; W)_j - f(x_i; W)_{y_i} + 1) \right] + \alpha R(W) @@ -77,7 +77,7 @@ $$
-옆길로 새면, 아마도 밥공기 모양을 보고 SVM 손실함수(loss function)이 일종의 [볼록함수](http://en.wikipedia.org/wiki/Convex_function)라고 생각했을 것이다. 이런 형태의 함수를 효율적으로 최소화하는 문제에 대한 엄청난 양의 연구 성과들이 있다. 스탠포드 강좌 중에서도 이 주제를 다룬 것도 있다. ( [볼록함수 최적화](http://stanford.edu/~boyd/cvxbook/) ). 이 점수함수(score function) $$f$$를 신경망(neural networks)로 확장시키면, 목적함수(역자 주: 손실함수(loss function))은 더이상 볼록함수가 아니게 되고, 위와 같은 시각화를 해봐도 밥공기 모양 대신 울퉁불퉁하고 복잡한 모양이 보일 것이다. +옆길로 새면, 아마도 밥공기 모양을 보고 SVM 손실함수(loss function)이 일종의 [볼록함수](http://en.wikipedia.org/wiki/Convex_function)라고 생각했을 것이다. 이런 형태의 함수를 효율적으로 최소화하는 문제에 대한 엄청난 양의 연구 성과들이 있다. 스탠포드 강좌 중에서도 이 주제를 다룬 것도 있다. ( [볼록함수 최적화](http://stanford.edu/~boyd/cvxbook/) ). 이 스코어함수(score function) $$f$$를 신경망(neural networks)로 확장시키면, 목적함수(역자 주: 손실함수(loss function))은 더이상 볼록함수가 아니게 되고, 위와 같은 시각화를 해봐도 밥공기 모양 대신 울퉁불퉁하고 복잡한 모양이 보일 것이다. *미분이 불가능한 손실함수(loss functions)*. 기술적인 설명을 덧붙이자면, $$\max(0,-)$$ 함수 때문에 손실함수(loss functionn)에 *꺾임*이 생기는데, 이 때문에 손실함수(loss functions)는 미분이 불가능해진다. 왜냐하면, 그 꺾이는 부분에서 미분 혹은 그라디언트가 존재하지 않기 때문이다. 하지만, [서브그라디언트(subgradient)](http://en.wikipedia.org/wiki/Subderivative)가 존재하고, 대체로 이를 그라디언트(gradient) 대신 이용한다. 앞으로 이 강의에서는 *그라디언트(gradient)*와 *서브그라디언트(subgradient)*를 구분하지 않고 쓸 것이다. @@ -136,13 +136,13 @@ np.mean(Yte_predict == Yte) > 우리의 전략은 무작위로 뽑은 모수(parameter/weight)으로부터 시작해서 반복적으로 조금씩 개선시켜 손실(loss)을 낮추는 것이다. -**눈가리고 하산하는 것에 비유.** 앞으로 도움이 될만한 비유는, 경사진 지형에서 눈가리개를 하고 점점 아래로 내려오는 자기 자신을 생각해보는 것이다. CIFAR-10의 예시에서, 그 언덕들은 (**W**가 3073 x 10 차원이므로) 30,730차원이다. 언덕의 각 지점에는 특정 손실값(loss), 즉, 지형의 고도가 주어진다. +**눈 가리고 하산하는 것에 비유.** 앞으로 도움이 될만한 비유는, 경사진 지형에서 눈가리개를 하고 점점 아래로 내려오는 자기 자신을 생각해보는 것이다. CIFAR-10의 예시에서, 그 언덕들은 (**W**가 3073 x 10 차원이므로) 30,730차원이다. 언덕의 각 지점에서의 고도가 손실함수(loss function)의 손실값(loss)의 역할을 한다. -#### Strategy #2: Random Local Search +#### 전략 #2: 무작위 국소 탐색 (Random Local Search) -The first strategy you may think of is to to try to extend one foot in a random direction and then take a step only if it leads downhill. Concretely, we will start out with a random $$W$$, generate random perturbations $$ \delta W $$ to it and if the loss at the perturbed $$W + \delta W$$ is lower, we will perform an update. The code for this procedure is as follows: +처음 떠오르는 전략은, 시작점에서 무작위로 방향을 정해서 발을 살짝 뻗어서 더듬어보고 그게 내리막길길을 때만 한발짝 내딛는 것이다. 구체적으로 말하면, 임의의 $$W$$에서 시작하고, 또다른 임의의 방향 $$ \delta W $$으로 살짝 움직여본다. 만약에 움직여간 자리($$W + \delta W$$)에서의 손실잢(loss)가 더 낮으면, 거기로 움직이고 다시 탐색을 시작한다. 이 과정을 코드로 짜면 다음과 같다. ~~~python W = np.random.randn(10, 3073) * 0.001 # generate random starting W @@ -157,25 +157,25 @@ for i in xrange(1000): print 'iter %d loss is %f' % (i, bestloss) ~~~ -Using the same number of loss function evaluations as before (1000), this approach achieves test set classification accuracy of **21.4%**. This is better, but still wasteful and computationally expensive. +이전과 같은 횟수(즉, 1000번)만큼 손실함수(loss function)을 계산하고도, 이 방법을 테스트 데이터에 적용해보니, 분류정확도가 **21.4%**로 나왔다. 발전하긴 했지만, 여전히 좀 비효울적인 것 같다. -#### Strategy #3: Following the Gradient +#### 전략 #3: 그라디언트(gradient) 따라가기 -In the previous section we tried to find a direction in the weight-space that would improve our weight vector (and give us a lower loss). It turns out that there is no need to randomly search for a good direction: we can compute the *best* direction along which we should change our weight vector that is mathematically guaranteed to be the direction of the steepest descend (at least in the limit as the step size goes towards zero). This direction will be related to the **gradient** of the loss function. In our hiking analogy, this approach roughly corresponds to feeling the slope of the hill below our feet and stepping down the direction that feels steepest. +이전 섹션에서, 모수(parameter/weight) 공간에서 모수(parameter/weight) 벡터를 향상시키는 (즉, 손실값을 더 낮추는) 뱡향을 찾는 시도를 해봤다. 그런데 사실 좋은 방향을 찾기 위해 방향을 무작위로 탐색할 필요가 없다고 한다. (적어도 반지름이 0으로 수렴하는 아주 좁은 근방에서는) 가장 가파르게 감소한다고 수학적으로 검증된 *최선의* 방향을 구할 수 있고, 이 방향을 따라 모수(parameter/weight) 벡터를 움직이면 된다는 것이다. 이 방향이 손실함수(loss function)의 **그라디언트(gradient)**와 관계있다. 눈 가리고 하산하는 것에 비유할 때, 발 밑 지형을 잘 더듬어보고 가장 가파르다는 느낌을 주는 방향으로 내려가는 것에 비견할 수 있다. -In one-dimensional functions, the slope is the instantaneous rate of change of the function at any point you might be interested in. The gradient is a generalization of slope for functions that don't take a single number but a vector of numbers. Additionally, the gradient is just a vector of slopes (more commonly referred to as **derivatives**) for each dimension in the input space. The mathematical expression for the derivative of a 1-D function with respect its input is: +1차원 함수의 경우, 어떤 점에서 움직일 때 기울기는 함수값의 순간 증가율을 나타낸다. 그라디언트(gradient)는 이 기울기란 것을, 변수가 하나가 아니라 여러 개인 경우로 일반화시킨 것이다. 덧붙여 설명하면, 그라디언트(gradient)는 입력데이터공간(역자 주: x들의 공간)의 각 차원에 해당하는 기울기(**미분**이라고 더 많이 불린다)들의 백터이다. 1차원 함수의 미분을 수식으로 쓰면 다음과 같다. $$ \frac{df(x)}{dx} = \lim_{h\ \to 0} \frac{f(x + h) - f(x)}{h} $$ -When the functions of interest take a vector of numbers instead of a single number, we call the derivatives **partial derivatives**, and the gradient is simply the vector of partial derivatives in each dimension. +함수가 숫자 하나가 아닌 벡터를 입력으로 받는 경우 (역자 주: x가 벡터인 경우), 우리는 미분을 **편미분**이라고 부른고, 그라디언트(gradient)는 단순히 각 차원으로의 편미분들을 모아놓은 벡터이다. -### Computing the gradient +### 그라디언트(gradient) 계산 There are two ways to compute the gradient: A slow, approximate but easy way (**numerical gradient**), and a fast, exact but more error-prone way that requires calculus (**analytic gradient**). We will now present both. From 37466659ee2f0d47f86e7cdcf7ea82e55b0c63a6 Mon Sep 17 00:00:00 2001 From: Taeksoo Kim Date: Wed, 13 Apr 2016 12:09:23 +0900 Subject: [PATCH 055/199] Update convolutional-networks-korean.md --- convolutional-networks-korean.md | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/convolutional-networks-korean.md b/convolutional-networks-korean.md index 7e6a7f99..7c072a05 100644 --- a/convolutional-networks-korean.md +++ b/convolutional-networks-korean.md @@ -124,16 +124,16 @@ CONV 레이어는 CNN을 이루는 핵심 요소이다. CONV 레이어의 출력 사실 적절한 가정을 통해 파라미터 개수를 크게 줄이는 것이 가능하다: (x,y)에서 어떤 patch feature가 유용하게 사용되었다면, 이 feature는 다른 위치 (x2,y2)에서도 유용하게 사용될 수 있다. 3차원 볼륨의 한 슬라이스 (깊이 차원으로 자른 2차원 슬라이스) 를 **depth slice**라고 하자 ([55x55x96] 사이즈의 볼륨은 각각 [55x55]의 크기를 가진 96개의 depth slice임). 앞으로는 각 depth slice 내의 뉴런들이 같은 가중치와 바이어스를 가지도록 제한할 것이다. 이런 파라미터 공유 기법을 사용하면, 예제의 첫 번째 컨볼루션 레이어는 (depth slice 당) 96개의 고유한 가중치를 가져서 총 96\*11\*11\*3 = 34,848개의 고유한 가중치, 또는 바이어스를 합쳐서 34,944개의 파라미터를 갖게 된다. 또는 각 depth slice에 존재하는 55*55개의 뉴런들은 모두 같은 파라미터를 사용하게 된다. 실제로는 backpropagation 과정에서 각 depth slice 내의 모든 뉴런들이 가중치에 대한 gradient를 계산하겠지만, 가중치 업데이트 할 때에는 이 gradient들을 합해 사용한다. -Notice that if all neurons in a single depth slice are using the same weight vector, then the forward pass of the CONV layer can in each depth slice be computed as a **convolution** of the neuron's weights with the input volume (Hence the name: Convolutional Layer). Therefore, it is common to refer to the sets of weights as a **filter** (or a **kernel**), which is convolved with the input. The result of this convolution is an *activation map* (e.g. of size [55x55]), and the set of activation maps for each different filter are stacked together along the depth dimension to produce the output volume (e.g. [55x55x96]). +한 depth slice내의 모든 뉴런들이 같은 가중치 벡터를 갖기 때문에 컨볼루션 레이어의 forward pass는 입력 볼륨과 가중치 간의 **컨볼루션**으로 계산될 수 있다 (컨볼루션 레이어라는 이름이 붙은 이유). 그러므로 컨볼루션 레이어의 가중치는 **필터(filter)** 또는 **커널(kernel)**이라고 부른다. 컨볼루션의 결과물은 **액티베이션 맵(activation map, [55x55] 사이즈)** 이 되며 각 깊이에 해당하는 필터의 액티베이션 맵들을 쌓으면 최종 출력 볼륨 ([55x55x96] 사이즈) 가 된다.
- Example filters learned by Krizhevsky et al. Each of the 96 filters shown here is of size [11x11x3], and each one is shared by the 55*55 neurons in one depth slice. Notice that the parameter sharing assumption is relatively reasonable: If detecting a horizontal edge is important at some location in the image, it should intuitively be useful at some other location as well due to the translationally-invariant structure of images. There is therefore no need to relearn to detect a horizontal edge at every one of the 55*55 distinct locations in the Conv layer output volume. + Krizhevsky et al. 에서 학습된 필터의 예. 96개의 필터 각각은 [11x11x3] 사이즈이며, 하나의 depth slice 내 55*55개 뉴런들이 이 필터들을 공유한다. 만약 이미지의 특정 위치에서 가로 엣지 (edge)를 검출하는 것이 중요했다면, 이미지의 다른 위치에서도 같은 특성이 중요할 수 있다 (이미지의 translationally-invariant한 특성 때문). 그러므로 55*55개 뉴런 각각에 대해 가로 엣지 검출 필터를 재학습 할 필요가 없다.
-Note that sometimes the parameter sharing assumption may not make sense. This is especially the case when the input images to a ConvNet have some specific centered structure, where we should expect, for example, that completely different features should be learned on one side of the image than another. One practical example is when the input are faces that have been centered in the image. You might expect that different eye-specific or hair-specific features could (and should) be learned in different spatial locations. In that case it is common to relax the parameter sharing scheme, and instead simply call the layer a **Locally-Connected Layer**. +가끔은 파라미터 sharing에 대한 가정이 부적절할 수도 있다. 특히 입력 이미지가 중심을 기준으로 찍힌 경우 (예를 들면 이미지 중앙에 얼굴이 있는 이미지), 이미지의 각 영역에 대해 완전히 다른 feature들이 학습되어야 할 수 있다. 눈과 관련된 feature나 머리카락과 관련된 feature 등은 서로 다른 영역에서 학습될 것이다. 이런 경우에는 파라미터 sharing 기법을 접어두고 대신 **Locally-Connected Layer**라는 레이어를 사용하는 것이 좋다. **Numpy examples.** To make the discussion above more concrete, lets express the same ideas but in code and with a specific example. Suppose that the input volume is a numpy array `X`. Then: From 38a4925270ad55589b15a9a1ad2b40f63c590040 Mon Sep 17 00:00:00 2001 From: Taeksoo Kim Date: Wed, 13 Apr 2016 15:23:16 +0900 Subject: [PATCH 056/199] Update convolutional-networks-korean.md --- convolutional-networks-korean.md | 39 ++++++++++++++++---------------- 1 file changed, 20 insertions(+), 19 deletions(-) diff --git a/convolutional-networks-korean.md b/convolutional-networks-korean.md index 7c072a05..71dd02a4 100644 --- a/convolutional-networks-korean.md +++ b/convolutional-networks-korean.md @@ -135,19 +135,20 @@ CONV 레이어는 CNN을 이루는 핵심 요소이다. CONV 레이어의 출력 가끔은 파라미터 sharing에 대한 가정이 부적절할 수도 있다. 특히 입력 이미지가 중심을 기준으로 찍힌 경우 (예를 들면 이미지 중앙에 얼굴이 있는 이미지), 이미지의 각 영역에 대해 완전히 다른 feature들이 학습되어야 할 수 있다. 눈과 관련된 feature나 머리카락과 관련된 feature 등은 서로 다른 영역에서 학습될 것이다. 이런 경우에는 파라미터 sharing 기법을 접어두고 대신 **Locally-Connected Layer**라는 레이어를 사용하는 것이 좋다. -**Numpy examples.** To make the discussion above more concrete, lets express the same ideas but in code and with a specific example. Suppose that the input volume is a numpy array `X`. Then: - +**Numpy 예제.** 위에서 다룬 것들을 더 확실히 알아보기 위해 코드를 작성해보자. 입력 볼륨을 numpy 배열 `X`라고 하면: - A *depth column* at position `(x,y)` would be the activations `X[x,y,:]`. +- `(x,y)`위치에서의 *depth column*은 액티베이션 `X[x,y,:]`이 된다. - A *depth slice*, or equivalently an *activation map* at depth `d` would be the activations `X[:,:,d]`. - -*Conv Layer Example*. Suppose that the input volume `X` has shape `X.shape: (11,11,4)`. Suppose further that we use no zero padding ($$P = 0$$), that the filter size is $$F = 5$$, and that the stride is $$S = 2$$. The output volume would therefore have spatial size (11-5)/2+1 = 4, giving a volume with width and height of 4. The activation map in the output volume (call it `V`), would then look as follows (only some of the elements are computed in this example): +- depth `d`에서의 *depth slice*, 또는 *액티베이션 맵 (activation map)*은 `X[:,:,d]`가 된다. +- +*컨볼루션 레이어 예제*. 입력 볼륨 `X`의 모양이 `X.shape: (11,11,4)`이고 제로 패딩은 사용하지 않으며($$P = 0$$) 필터 크기는 $$F = 5$$, stride $$S = 2$$라고 하자. 출력 볼륨의 spatial 크기 (가로/세로)는 (11-5)/2 + 1 = 4가 된다. 출력 볼륨의 액티베이션 맵 (`V`라고 하자) 는 아래와 같은 것이다 (아래에는 일부 요소만 나타냄). - `V[0,0,0] = np.sum(X[:5,:5,:] * W0) + b0` - `V[1,0,0] = np.sum(X[2:7,:5,:] * W0) + b0` - `V[2,0,0] = np.sum(X[4:9,:5,:] * W0) + b0` - `V[3,0,0] = np.sum(X[6:11,:5,:] * W0) + b0` -Remember that in numpy, the operation `*` above denotes elementwise multiplication between the arrays. Notice also that the weight vector `W0` is the weight vector of that neuron and `b0` is the bias. Here, `W0` is assumed to be of shape `W0.shape: (5,5,4)`, since the filter size is 5 and the depth of the input volume is 4. Notice that at each point, we are computing the dot product as seen before in ordinary neural networks. Also, we see that we are using the same weight and bias (due to parameter sharing), and where the dimensions along the width are increasing in steps of 2 (i.e. the stride). To construct a second activation map in the output volume, we would have: +Numpy에서 `*`연산은 두 배열 간의 elementwise 곱셈이라는 것을 기억하자. 또한 `W0`는 가중치 벡터이고 `b0`은 바이어스라는 것도 기억하자. 여기에서 `W0`의 모양은 `W0.shape: (5,5,4)`라고 가정하자 (필터 사이즈는 5, depth는 4). 각 위치에서 일반 신경망에서와 같이 내적 연산을 수행하게 된다. 또한 파라미터 sharing 기법으로 같은 가중치, 바이어스가 사용되고 가로 차원에 대해 2 (stride)칸씩 옮겨가며 연산이 이뤄진다는 것을 볼 수 있다. 출력 볼륨의 두 번째 액티베이션 맵을 구성하는 방법은: - `V[0,0,1] = np.sum(X[:5,:5,:] * W1) + b1` - `V[1,0,1] = np.sum(X[2:7,:5,:] * W1) + b1` @@ -156,26 +157,26 @@ Remember that in numpy, the operation `*` above denotes elementwise multiplicati - `V[0,1,1] = np.sum(X[:5,2:7,:] * W1) + b1` (example of going along y) - `V[2,3,1] = np.sum(X[4:9,6:11,:] * W1) + b1` (or along both) -where we see that we are indexing into the second depth dimension in `V` (at index 1) because we are computing the second activation map, and that a different set of parameters (`W1`) is now used. In the example above, we are for brevity leaving out some of the other operatations the Conv Layer would perform to fill the other parts of the output array `V`. Additionally, recall that these activation maps are often followed elementwise through an activation function such as ReLU, but this is not shown here. +위 예제는 `V`의 두 번째 depth 차원 (인덱스 1)을 인덱싱하고 있다. 두 번째 액티베이션 맵을 계산하므로, 여기에서 사용된 가중치는 이전 예제와 달리 `W1`이다. 보통 액티베이션 맵이 구해진 뒤 ReLU와 같은 elementwise 연산이 가해지는 경우가 많은데, 위 예제에서는 다루지 않았다. -**Summary**. To summarize, the Conv Layer: +**요약**. To summarize, the Conv Layer: -- Accepts a volume of size $$W_1 \times H_1 \times D_1$$ -- Requires four hyperparameters: - - Number of filters $$K$$, - - their spatial extent $$F$$, - - the stride $$S$$, - - the amount of zero padding $$P$$. -- Produces a volume of size $$W_2 \times H_2 \times D_2$$ where: +- $$W_1 \times H_1 \times D_1$$ 크기의 볼륨을 입력받는다. +- 4개의 hyperparameter가 필요하다: + - 필터 개수 $$K$$, + - 필터의 가로/세로 Spatial 크기 $$F$$, + - Stride $$S$$, + - 제로 패딩 $$P$$. +- $$W_2 \times H_2 \times D_2$$ 크기의 출력 볼륨을 생성한다: - $$W_2 = (W_1 - F + 2P)/S + 1$$ - - $$H_2 = (H_1 - F + 2P)/S + 1$$ (i.e. width and height are computed equally by symmetry) + - $$H_2 = (H_1 - F + 2P)/S + 1$$ (i.e. 가로/세로는 같은 방식으로 계산됨) - $$D_2 = K$$ -- With parameter sharing, it introduces $$F \cdot F \cdot D_1$$ weights per filter, for a total of $$(F \cdot F \cdot D_1) \cdot K$$ weights and $$K$$ biases. -- In the output volume, the $$d$$-th depth slice (of size $$W_2 \times H_2$$) is the result of performing a valid convolution of the $$d$$-th filter over the input volume with a stride of $$S$$, and then offset by $$d$$-th bias. +- 파라미터 sharing로 인해 필터 당 $$F \cdot F \cdot D_1$$개의 가중치를 가져서 총 $$(F \cdot F \cdot D_1) \cdot K$$개의 가중치와 $$K$$개의 바이어스를 갖게 된다. +- 출력 볼륨에서 $$d$$번째 depth slice ($$W_2 \times H_2$$ 크기)는 입력 볼륨에 $$d$$번째 필터를 stride $$S$$만큼 옮겨가며 컨볼루션 한 뒤 $$d$$번째 바이어스를 더한 결과이다. -A common setting of the hyperparameters is $$F = 3, S = 1, P = 1$$. However, there are common conventions and rules of thumb that motivate these hyperparameters. See the [ConvNet architectures](#architectures) section below. +흔한 Hyperparameter기본 세팅은 $$F = 3, S = 1, P = 1$$이다. 뒤에서 다룰 [ConvNet architectures](#architectures)에서 hyperparameter 세팅과 관련된 법칙이나 방식 등을 확인할 수 있다. -**Convolution Demo**. Below is a running demo of a CONV layer. Since 3D volumes are hard to visualize, all the volumes (the input volume (in blue), the weight volumes (in red), the output volume (in green)) are visualized with each depth slice stacked in rows. The input volume is of size $$W_1 = 5, H_1 = 5, D_1 = 3$$, and the CONV layer parameters are $$K = 2, F = 3, S = 2, P = 1$$. That is, we have two filters of size $$3 \times 3$$, and they are applied with a stride of 2. Therefore, the output volume size has spatial size (5 - 3 + 2)/2 + 1 = 3. Moreover, notice that a padding of $$P = 1$$ is applied to the input volume, making the outer border of the input volume zero. The visualization below iterates over the output activations (green), and shows that each element is computed by elementwise multiplying the highlighted input (blue) with the filter (red), summing it up, and then offsetting the result by the bias. +** 컨볼루션 데모**. 아래는 컨볼루션 레이어 데모이다. 3차원 볼륨은 시각화하기 힘드므로 각 행마다 depth slice를 하나씩 배치했다. 각 볼륨은 입력 볼륨(파란색), 가중치 볼륨(빨간색), 출력 볼륨(녹색)으로 이뤄진다. 입력 볼륨의 크기는 $$W_1 = 5, H_1 = 5, D_1 = 3$$이고 컨볼루션 레이어의 파라미터들은 $$K = 2, F = 3, S = 2, P = 1$$이다. 즉, 2개의 $$3 \times 3$$크기의 필터가 각각 stride 2마다 적용된다. 그러므로 출력 볼륨의 spatial 크기 (가로/세로)는 (5 - 3 + 2)/2 + 1 = 3이다. 제로 패딩 $$P = 1$$ 이 적용되어 입력 볼륨의 가장자리가 모두 0으로 되어있다는 것을 확인할 수 있다. 아래의 영상에서 하이라이트 표시된 입력(파란색)과 필터(빨간색)이 elementwise로 곱해진 뒤 하나로 더해지고 bias가 더해지는걸 볼 수 있다.
From a56506d31138cbe096159fde19bf43132d235aa0 Mon Sep 17 00:00:00 2001 From: Taeksoo Kim Date: Wed, 13 Apr 2016 15:24:29 +0900 Subject: [PATCH 057/199] Update convolutional-networks-korean.md --- convolutional-networks-korean.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/convolutional-networks-korean.md b/convolutional-networks-korean.md index 71dd02a4..4db77c35 100644 --- a/convolutional-networks-korean.md +++ b/convolutional-networks-korean.md @@ -140,7 +140,7 @@ CONV 레이어는 CNN을 이루는 핵심 요소이다. CONV 레이어의 출력 - `(x,y)`위치에서의 *depth column*은 액티베이션 `X[x,y,:]`이 된다. - A *depth slice*, or equivalently an *activation map* at depth `d` would be the activations `X[:,:,d]`. - depth `d`에서의 *depth slice*, 또는 *액티베이션 맵 (activation map)*은 `X[:,:,d]`가 된다. -- + *컨볼루션 레이어 예제*. 입력 볼륨 `X`의 모양이 `X.shape: (11,11,4)`이고 제로 패딩은 사용하지 않으며($$P = 0$$) 필터 크기는 $$F = 5$$, stride $$S = 2$$라고 하자. 출력 볼륨의 spatial 크기 (가로/세로)는 (11-5)/2 + 1 = 4가 된다. 출력 볼륨의 액티베이션 맵 (`V`라고 하자) 는 아래와 같은 것이다 (아래에는 일부 요소만 나타냄). - `V[0,0,0] = np.sum(X[:5,:5,:] * W0) + b0` From 6756e24d5a2b924cd10d5bf0a297519a265fc8ba Mon Sep 17 00:00:00 2001 From: Taeksoo Kim Date: Wed, 13 Apr 2016 15:24:58 +0900 Subject: [PATCH 058/199] Update convolutional-networks-korean.md --- convolutional-networks-korean.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/convolutional-networks-korean.md b/convolutional-networks-korean.md index 4db77c35..3c81b675 100644 --- a/convolutional-networks-korean.md +++ b/convolutional-networks-korean.md @@ -176,7 +176,7 @@ Numpy에서 `*`연산은 두 배열 간의 elementwise 곱셈이라는 것을 흔한 Hyperparameter기본 세팅은 $$F = 3, S = 1, P = 1$$이다. 뒤에서 다룰 [ConvNet architectures](#architectures)에서 hyperparameter 세팅과 관련된 법칙이나 방식 등을 확인할 수 있다. -** 컨볼루션 데모**. 아래는 컨볼루션 레이어 데모이다. 3차원 볼륨은 시각화하기 힘드므로 각 행마다 depth slice를 하나씩 배치했다. 각 볼륨은 입력 볼륨(파란색), 가중치 볼륨(빨간색), 출력 볼륨(녹색)으로 이뤄진다. 입력 볼륨의 크기는 $$W_1 = 5, H_1 = 5, D_1 = 3$$이고 컨볼루션 레이어의 파라미터들은 $$K = 2, F = 3, S = 2, P = 1$$이다. 즉, 2개의 $$3 \times 3$$크기의 필터가 각각 stride 2마다 적용된다. 그러므로 출력 볼륨의 spatial 크기 (가로/세로)는 (5 - 3 + 2)/2 + 1 = 3이다. 제로 패딩 $$P = 1$$ 이 적용되어 입력 볼륨의 가장자리가 모두 0으로 되어있다는 것을 확인할 수 있다. 아래의 영상에서 하이라이트 표시된 입력(파란색)과 필터(빨간색)이 elementwise로 곱해진 뒤 하나로 더해지고 bias가 더해지는걸 볼 수 있다. +**컨볼루션 데모**. 아래는 컨볼루션 레이어 데모이다. 3차원 볼륨은 시각화하기 힘드므로 각 행마다 depth slice를 하나씩 배치했다. 각 볼륨은 입력 볼륨(파란색), 가중치 볼륨(빨간색), 출력 볼륨(녹색)으로 이뤄진다. 입력 볼륨의 크기는 $$W_1 = 5, H_1 = 5, D_1 = 3$$이고 컨볼루션 레이어의 파라미터들은 $$K = 2, F = 3, S = 2, P = 1$$이다. 즉, 2개의 $$3 \times 3$$크기의 필터가 각각 stride 2마다 적용된다. 그러므로 출력 볼륨의 spatial 크기 (가로/세로)는 (5 - 3 + 2)/2 + 1 = 3이다. 제로 패딩 $$P = 1$$ 이 적용되어 입력 볼륨의 가장자리가 모두 0으로 되어있다는 것을 확인할 수 있다. 아래의 영상에서 하이라이트 표시된 입력(파란색)과 필터(빨간색)이 elementwise로 곱해진 뒤 하나로 더해지고 bias가 더해지는걸 볼 수 있다.
From 1af8f62872cde83aeb67c734d160c4a78ffa5439 Mon Sep 17 00:00:00 2001 From: Taeksoo Kim Date: Fri, 15 Apr 2016 20:12:48 +0900 Subject: [PATCH 059/199] Update convolutional-networks-korean.md --- convolutional-networks-korean.md | 1 + 1 file changed, 1 insertion(+) diff --git a/convolutional-networks-korean.md b/convolutional-networks-korean.md index 3c81b675..5ebec9c3 100644 --- a/convolutional-networks-korean.md +++ b/convolutional-networks-korean.md @@ -183,6 +183,7 @@ Numpy에서 `*`연산은 두 배열 간의 elementwise 곱셈이라는 것을
+**매트릭스 곱으로 구현**. 컨볼루션 연산은 필터와 이미지의 로컬한 영역간의 내적 연산을 한 것과 같다. 컨볼루션 레이어의 일반적인 구현 패턴은 이 점을 이용해 컨볼루션 레이어의 forward pass를 다음과 같이 하나의 큰 매트릭스 곱으로 **Implementation as Matrix Multiplication**. Note that the convolution operation essentially performs dot products between the filters and local regions of the input. A common implementation pattern of the CONV layer is to take advantage of this fact and formulate the forward pass of a convolutional layer as one big matrix multiply as follows: 1. The local regions in the input image are stretched out into columns in an operation commonly called **im2col**. For example, if the input is [227x227x3] and it is to be convolved with 11x11x3 filters at stride 4, then we would take [11x11x3] blocks of pixels in the input and stretch each block into a column vector of size 11\*11\*3 = 363. Iterating this process in the input at stride of 4 gives (227-11)/4+1 = 55 locations along both width and height, leading to an output matrix `X_col` of *im2col* of size [363 x 3025], where every column is a stretched out receptive field and there are 55*55 = 3025 of them in total. Note that since the receptive fields overlap, every number in the input volume may be duplicated in multiple distinct columns. From 130e292cf5aab268460156eaef8354d10dab94ce Mon Sep 17 00:00:00 2001 From: Seo Jonghan Date: Fri, 15 Apr 2016 23:54:26 +0900 Subject: [PATCH 060/199] Normalization Draft --- neural-networks-2.kr.md | 306 ++++++++++++++++++++++++++++++++++++++++ 1 file changed, 306 insertions(+) create mode 100644 neural-networks-2.kr.md diff --git a/neural-networks-2.kr.md b/neural-networks-2.kr.md new file mode 100644 index 00000000..e16d988f --- /dev/null +++ b/neural-networks-2.kr.md @@ -0,0 +1,306 @@ +--- +layout: page +permalink: /neural-networks-2-kr/ +--- + +목차: + +- [데이터 및 모델 준비](#intro) + - [데이터 전처리(Data Preprocessing)](#datapre) + - [가중치 초기화(Weight Initialization)](#init) + - [배치 정규화(Batch Normalization)](#batchnorm) + - [Regularization](#reg) (L2/L1/Maxnorm/Dropout) +- [손실 함수(Loss functions)](#losses) +- [요약 (Summary)](#summary) + + + + +## 데이터 및 모델 준비 + +앞 장에서 내적(dot product) 및 비선형성(non-linearity)을 연산을 순차적으로 수행하는 뉴런(Neuron) 모델과 이러한 뉴런들의 다층구조(layers)로 구성된 신경망(Neural Networks)에 대해서 소개하였다. 신경망(Neural Networks) 모델은 선형변환(linear mapping) 결과를 비선형성 변환에 적용하는 과정이 연속적으로 발생하게 되고 따라서 선형분류(Linear Classification) 부분에서 소개한 선형변환(linear mapping)을 확장한 새로운 형태의 **score function** 정의를 필요로 한다. 이번 장에서는 데이터 전처리(data preprocessing), 파라미터 초기화(weight initialization), 손실 함수(loss function)을 소개한다. + + + +### 데이터 전처리(Data Preprocessing) + +데이터 행렬 `X`에 대해서 일반적으로 아래의 3가지 전처리 방법을 사용한다. (여기서 데이터 `X`는 `D` 차원의 데이터 벡터 `N`개로 이루어진 `[N X D]` 행렬로 가정한다) + +**평균 차감(Mean Subtraction)** +가장 흔하게 사용되는 데이터 전처리 기법이다. 데이터의 모든 *피쳐(feature)*에 각각에 대해서 평균값 만큼 차감하는 방법으로 기하학 관점에서 보자면 데이터 군집을 모든 차원에 대해서 원점으로 이동시키는 것으로 해석할 수 있다. numpy에서는 다음과 같이 구현 가능하다: `X -= np.mean(X, axis = 0)`. 특히 이미지 처리에 있어서 계산의 간결성을 위해서 모든 픽셀에서 동일한 값을 차감하는 방식으로 구현한다.(예를들어 numpy에서 `X -= np.mean(X)`) + +**정규화(Normalization)** +정규화(Normalization)는 각 차원의 데이터가 동일한 범위내의 값을 갖도록 하는 전처리 기법을 의미한다. 일반적으로 다음의 2가지 중 하나를 선택하여 구현한다. (1) 각 데이터값을 평균 만큼 차감 하고 표준편차 값으로 나눈다: (`X /= np.std(X, axis = 0)`), 이때 각 차원에 대해서 개별적으로 연산을 수행한다. (2) 또 다른 기법은 각 차원에서 최소/최대 값이 각각 -1/1의 값을 갖도록 정규화 하는 것이다. 하지만 이 기법은 스케일(scale)(혹은 단위(units))이 다른 features가 (거의) 동일한 비중으로 학습 결과에 영향을 줄 것이라는 가정하에 사용하는 것이 일반적이다. +이미지 처리에서는 각 픽셀 값이 이미 동일한 스케일(0~255)을 갖고 있는 경우가 대부분 이기 때문에 정규화 전처리 기법을 반드시 사용해야 하는 것은 아니다. + +
+ +
Common data preprocessing pipeline. Left: Original toy, 2-dimensional input data. Middle: The data is zero-centered by subtracting the mean in each dimension. The data cloud is now centered around the origin. Right: Each dimension is additionally scaled by its standard deviation. The red lines indicate the extent of the data - they are of unequal length in the middle, but of equal length on the right.
+
+ +**PCA와 Whitening** +먼저 평균차감(Mean Subtraction) 기법을 이용하여 데이터를 정규화 시킨다. 그리고 데이터 간의 상관관계를 나타내는 공분산(Covariance)을 계산한다: + +~~~python +# Assume input data matrix X of size [N x D] +X -= np.mean(X, axis = 0) # zero-center the data (important) +cov = np.dot(X.T, X) / X.shape[0] # get the data covariance matrix +~~~ + +공분산(Convairance) 행렬에서 (i, j) 값은 `X` 행렬에서 i번째, j번째 데이터 간의 **상관정도(covariance)**를 나타내는 값이라고 해석할 수 있다. 특히, 공분산(Covariance) 행렬에서 대각선 상(the diagonal)의 값들은 `X` 행렬의 각 데이터(주, row 벡터)의 분산(variance)값과 같다. 또한 공분산(Covariance) 행렬은 simmetric, [positive semi-definite](http://en.wikipedia.org/wiki/Positive-definite_matrix#Negative-definite.2C_semidefinite_and_indefinite_matrices) 성질을 갖는다. 공분산(Covariance) 행렬의 SVD factorication은 다음과 같이 구할 수 있는데, + +~~~python +U,S,V = np.linalg.svd(cov) +~~~ + +여기서 `U` 행렬의 컬럼(column) 벡터는 아이겐벡터(eigenvector), S는 특이값(singular value)의 1차원 배열이다 (공분산(Covariance)은 symmetric, positive semi-definite의 성질이 있으므로 S 벡터의 각 성분은 아이겐밸류(engienvalue) 제곱의 값을 갖는다) 데이터 `X`를 고유기저(eigenbasis)에 사상시킴으로써 데이터 간의 상관관계를 없앨 수 있다: + +~~~python +Xrot = np.dot(X, U) # decorrelate the data +~~~ + +`U` 행렬의 컬럼 벡터는 norm 값은 1이고 서로 직교하는 정규직교(orthonormal)의 성질을 갖고 있기때문에, 기저벡터(basis vector)가 됨을 알 수 있다. 따라서 고유기저(eigenbasis)로 사상(projection)하는 것은 아이겐벡터(eigenvector)를 새로운 축으로하여 `X` 데이터를 회전하는 것으로 해석할 수 있다. (위의 python 코드에서) `Xrot` 행렬의 공분산(Covariance)을 구하면 대각행렬(diagonal matrix)인 것을 알 수 있디. `np.linalg.svd`의 이점 중 하나는 `U` 행렬의 컬럼 벡터는 각 벡터에 상응하는 아이겐밸류(eigenvalue)의 내림차순으로 정렬된 다는 것이다. 따라서 처음 몇 개의 벡터만 사용하여 데이터 차원을 축소하는데 사용할 수 있다.(and discarding the dimensions along which the data has no variance) 이러한 기법을 [Principal Component Analysis (PCA)](http://en.wikipedia.org/wiki/Principal_component_analysis) 차원 축소 기법이라 부르기도 한다. + +~~~python +Xrot_reduced = np.dot(X, U[:,:100]) # Xrot_reduced becomes [N x 100] +~~~ + +위의 연산을 통하여, [N x D] 크기의 `X` 데이터를 [N x 100] 크기의 데이터로 압축 할 수 있는데 데이터의 variance가 가능한 큰 값을 갖도록 하는 100개의 차원이 선택된다. PCA-축소 기법으로 전처리 된 데이터를 선형 분류기 혹은 신경망에 학습시킴으로써 좋은 성능을 기대할 수 있을 뿐만 아니라 트레이닝 시간과 사용 메모리 용량에서도 이득을 볼 수 있다. + +마지막으로 살펴볼 기법은 **화이트닝(whitening)**으로 이는 기저벡터(eigenbasis) 데이터를 아이겐밸류(eigenvalue) 값으로 나누어 정규화는 기법이다. 화이트닝 변환의 기하학적 해석은 만약 입력 데이터가 multivariable gaussian 분포를라면 화이트닝된 데이터는 평균은 0이고 공분산(covariance)는 단위행렬을 갖는 정규분포를 갖게된다. 와이트닝은 다음과 같이 구할 수 있다: + +~~~python +# whiten the data: +# divide by the eigenvalues (which are square roots of the singular values) +Xwhite = Xrot / np.sqrt(S + 1e-5) +~~~ + +*주의: 노이즈 과장(Exaggeratin noice)* 위의 식에서 분모가 0이 되는 것을 방지하기 위해서 1e-5(또는 임의의 작은 상수도 무방)를 더한 것에 주목하자. 화이트닝 기법의 단점 중 하나는 모든 차원의 데이터를 동일하게 늘리게 되는데 특히 분산값이 매우 작아 노이즈로 해석할 수 있는 차원의 데이터까지 포함되어 데이터 내의 노이즈 과장되는 효과가 나타난다는 것이다. 이런 경우 보통 (1e-5와 같은 작은 수가 아닌) 큰 수를 분모에 더하는 방식으로 스무딩(smoothing) 효과를 추가하여 이러한 노이즈 과장 현상을 완화 할 수 있다. + +
+ +
PCA / Whitening. Left: Original toy, 2-dimensional input data. Middle: After performing PCA. The data is centered at zero and then rotated into the eigenbasis of the data covariance matrix. This decorrelates the data (the covariance matrix becomes diagonal). Right: Each dimension is additionally scaled by the eigenvalues, transforming the data covariance matrix into the identity matrix. Geometrically, this corresponds to stretching and squeezing the data into an isotropic gaussian blob.
+
+ +CIFAR-10 이미지에 위에서 소개된 변환들을 적용하여 각 변환 효과를 시각화 할 수 있다. CIFAR-10 학습 데이터는 50,000 x 3072 크기이며 각 이미지 데이터는 3072 차원을 갖는 row 벡터로 표현되어 있다. [3072 x 3072] 크기를 갖는 공분산(covariance) 행렬을 구하고 SVD 분해 (연산 시간이 비교적 오래걸린다)를 한다. 연산을 통하여 구해진 eigenvector는 어떤 특성을 보이는가? 다음의 이미지를 통하여 그 결과를 확인해 볼 수 있다: + +
+ +
Left:An example set of 49 images. 2nd from Left: The top 144 out of 3072 eigenvectors. The top eigenvectors account for most of the variance in the data, and we can see that they correspond to lower frequencies in the images. 2nd from Right: The 49 images reduced with PCA, using the 144 eigenvectors shown here. That is, instead of expressing every image as a 3072-dimensional vector where each element is the brightness of a particular pixel at some location and channel, every image above is only represented with a 144-dimensional vector, where each element measures how much of each eigenvector adds up to make up the image. In order to visualize what image information has been retained in the 144 numbers, we must rotate back into the "pixel" basis of 3072 numbers. Since U is a rotation, this can be achieved by multiplying by U.transpose()[:144,:], and then visualizing the resulting 3072 numbers as the image. You can see that the images are slightly more blurry, reflecting the fact that the top eigenvectors capture lower frequencies. However, most of the information is still preserved. Right: Visualization of the "white" representation, where the variance along every one of the 144 dimensions is squashed to equal length. Here, the whitened 144 numbers are rotated back to image pixel basis by multiplying by U.transpose()[:144,:]. The lower frequencies (which accounted for most variance) are now negligible, while the higher frequencies (which account for relatively little variance originally) become exaggerated.
+
+ +**실전 응용** 모든 변환 기법을 소개하기 위해 PCA/화이트닝(Whitening)도 함께 살펴보았지만 콘볼루션 신경망(Convolutional Networks)에서는 이 변환을 사용하는 경우는 거의 없다. 하지만 (평균차감(Mean Subtraction) 기법을 통하여) zero-centered 데이터로 변환하거나 각 픽셀 값을 정규화 하는 기법은 일반적으로 흔하게 쓰는 전처리 기법 중에 하나이다. + +**흔히 하는 실수**. 전처리 기법을 적용함에 있어서 명심해야 하는 중요한 사항은 전처리를 위한 여러 통계치들은 학습 데이터만 대상으로 추출하고 검증, 테스트 데이터에 적용해야 한다. 예를들어 평균차감(mena subtraction) 기법을 적용 할 때 흔히 하는 실수 중에 하나는 전체 데이터를 대상으로 평균차감 처리를 하고 이 데이터를 학습, 검증, 테스트 데이터로 나누어 사용하는 것이다. 올바른 방법은 학습, 검증, 테스트를 위한 데이터를 먼저 나눈 후에 학습 데이터를 대상으로 평균값을 구한 후에 평균차감 전처리를 모든 데이터군(학습, 검증, 테스트)에 적용하는 것이다. + + + +### 가중치 초기화 + +우리는 지금까지 신경망(Neural Network) 구조 및 데이터 전처리 기법에 대해 알아 보았다. 실제 데이터를 신경망 내에서 학습 시키기 전에 해야하는 작업이 있는데 바로 파라미터(paramters) 초기화 이다. + +**실수: 0으로 초기화하기**. 실은 우리가 하지 말아야 하는 방식을 먼저 적용해보자. 학습된 신경망에서 가중치들이 최종적으로 어떤 값으로 수렴해야 하는지 알 수 없지만 데이터 정규화 기법을 적절하게 적용하여 가중치의 절반은 양수 값 나머지 절반은 음수 값을 갖는다는 가정을 할 수 있을 것이다. 더나아가 모든 가중치를 0으로 초기화 함으로써 최상의 학습 결과를 얻을 것이라는 아이디어 또한 합리적인 추론으로 보일 수 있다. 하지만 이러한 방법은 명백히 잘못된 방법이라는 것이 밝혀졌다. 왜냐하면 가중치가 0으로 초기화된 신경망 내의 뉴런들은 모두 동일한 연산 결과를 낼 것이고 따라서 backpropagaton 과정에서 동일한 그라디언트(gradient) 값을 얻게 될 것이고 결과적으로 모든 파라미터(paramter)는 동일한 값으로 업데이트 될 것이기 때문이다. 다시말해, 모든 가중치 값이 동일한 값으로 초기화 된다면 뉴런들의 비대칭성(asymmetry)를 야기할 요소가 사라지게 된다. + +**0에 가까운 작은 난수**. 위에서 언급한 이야기를 종합하자면, 가중치 값은 가능한 0에 가까운 값이어야 또한 모든 동일하게 0이되어서는 안된다는 것이다. 소위 *symmetry breaking*을 사용하는데 이는 0에 가까운 (하지만 0이 아닌) 값으로 가중치를 초기화시키는 방법이다. 즉, 모든 가중치들을 난수를 이용하여 고유한 값으로 초기화 함으로써 각 파라미터 값이 서로 다른 값으로 업데이트 되고 결과적으로 전체 신경망 내에서 서로 다른 특성을 보이는 다양한 부분으로 분화될 수 있다. 가중치 배열은 다음과 같이 구현할 수 있는데 `W = 0.01* np.random.randn(D,H)` 여기서 `randn`은 평균 0, 표준편자 1인 정규 분포로 부터 얻는 값이다. 앞의 공식에 의한면, 모든 가중치 벡터는 다차원 정규 분포로 부터 추출된 벡터로 초기화 되기 때문에 공간 상에서 각 벡터들은 (특정한 패턴 혹은 방향성 없이) 무작위의 방향성을 갖게 된다. 정규 분포가 아닌 균일 분포(uniform distribution)로 부터 추출된 값으로 가중치를 초기화 해도 무방하지만 이 방법은 학습된 최종 성능에 미치는 영향은 미미한 것으로 알려져 있다. + +*주의*: 가중치를 0에 가까운 작은 값으로 초기화 하는 것은 항상 좋은 성능을 답보하는 것은 아니다. 예를들어 아주 작은 값으로 구성된 가중치 값으로 된 신경망의 경우 backpropagation 연상 과정에서 그라디언트(gradient) 또한 작은 값을 갖게 된다(그라디언트(gradient)는 가중치 값에 례하기 때문). 이는 네트워크의 역방향으로 흐르며 전달되는 "그라디언트 시그널(gradient signal)"을 감소시키게 되고 이는 신경망 학습에 있어서 중요한 문제를 야기하게 된다. + +**분산 보정, 1/sqrt(n)**. 위에서 제안한 방법의 문제점 중 하나는 랜덤값으로 초기화된 뉴런으로 학습되어 나온 결과의 분포가 입력 데이터 수에 비례하여 커지는 분산을 갖는다는 것이다. 가중치 벡터를 *팬인(fan-in)*(입력 데이터 수)의 제곱근 값으로 나누는 연산을 통하여 뉴런 출력의 분산이 1로 정규화 할 수 있다. 권장되는 휴리스틱(heuristic) 기법은 뉴런의 가중치 벡터를 다음과 같이 초기화 하는 것이다. `w = np.random.randn(n) / sqrt(n)` (n: 입력 수). 이 방법은 근사적으로 동일한 출력 분포를 갖게 할 뿐만 아니라 신경망의 수렴률 또한 향상시키는 것으로 알려져 있다. + +이는 다음의 유도 과정을 통해서 확인할 수 있다.: 가중치 값을 나타내는 $$ w $$와 입력 데이터를 나타내는 $$ x $$의 내적 연산 $$ s = \sum_i^n w_i x_i $$가 있다고 하자. 이는 비선형 연산 이전 단계에 일어나는 뉴런 연산이 되고 $$ s $$의 연산은 다음과 같이 구할 수 있다. + +$$ +\begin{align} +\text{Var}(s) &= \text{Var}(\sum_i^n w_ix_i) \\\\ +&= \sum_i^n \text{Var}(w_ix_i) \\\\ +&= \sum_i^n [E(w_i)]^2\text{Var}(x_i) + E[(x_i)]^2\text{Var}(w_i) + \text{Var}(x_i)\text{Var}(w_i) \\\\ +&= \sum_i^n \text{Var}(x_i)\text{Var}(w_i) \\\\ +&= \left( n \text{Var}(w) \right) \text{Var}(x) +\end{align} +$$ + +처음 2단계는 [분산의 성질](http://en.wikipedia.org/wiki/Variance)을 이용하여 전개하였다. 가중치와 입력 데이터 모두 평균이 0이라고 가정하고 있기때문에 $$ E[x_i] = E[w_i] = 0 $$이 되고 따라서 3번째 단계에서 4번째 단계로 전개가 가능하다. 하지만 평균이 0이라고 가정하는 것은 일반적으로 모든 상황에서 가정할 수 있는 것은 아니라는 것을 명심해야 한다. 일례로 ReLU 유닛은 0보다 큰 평균값을 갖는다. 마지막 단계는 $$ w_i, x_i $$ 모두 동일한 확률 분포(identically distribution)를 갖는다고 가정하여 전개할 수 있다. +위의 유도 과정을 통하여 $$ s $$가 입력 데이터 $$ x $$와 동일한 분산을 갖기 위해서는 초기화 단계에서 모든 가중치 벡터 $$ w $$의 분산이 $$ 1/n $$로 만들어야 한다는 것을 알 수 있다. 또한 확률 변수 $$ X $$, 스칼라(scalar) 값 $$ a $$에 대해서 $$ \text{Var}(aX) = a^2\text{Var}(X) $$이 성립하므로 분산이 $$ 1/n $$이 되기 위해서는 표준정규분포에서 값을 뽑아서 $$ a = \sqrt{1/n} $$ 곱해주어야 한다는 것을 알 수 있다. `w = np.random.randn(n) / sqrt(n)`로 가중치를 초기화하면 된다. + +이와 유사한 내용의 연구를 [Understanding the difficulty of training deep feedforward neural networks](http://jmlr.org/proceedings/papers/v9/glorot10a/glorot10a.pdf) by Glorot et al. 논문에서 확인할 수 있다. 논문의 저자는 $$ \text{Var}(w) = 2/(n_{in} + n _{out}) $$ ($$ n_{in}, n_{out} $$ 각각 이전 레이어, 다음 레이어의 입력 유닛수)로 초기화 할 것을 권고하며 끝맺고 있다. **This is motivated by based on a compromise and an equivalent analysis of the backpropagated gradients.** 동일한 주제에 대한 더 최근의 연구는 [Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification](http://arxiv-web3.library.cornell.edu/abs/1502.01852) by He et al.에서 확인 할 수 있는데, 특히 ReLU 뉴런에 대한 초기화 방법에 대해서 다루고 있다. 이 논문에서는 신경망에서 뉴런의 분산은 $$ 2.0/n $$가 되야 한다고 결론내리고 있다. 즉, `w = np.random.randn(n) * sqrt(2.0/n)`을 이용하여 가중치를 초기화 하는 것을 의미하며 이는 특히 ReLU 뉴런이 사용되는 신경망에서 최근에 권장되고 있는 방식이다. + +**희소 초기화(Sparse initialization)**. 보정되지 않은 분산을 위한 또 다른 방법은 모든 가중치 행렬을 0으로 초기화 하는 것이다. 이때 대칭성을 깨기 위해서 모든 뉴런을 고정된 숫자의 아래 단계 뉴런들과 무작위로 연결한다.**(with weights sampled from a small gaussian as above)** 연결하는 뉴런의 수는 대략 10개 정도이다. + +** bias 초기화 **. 가중치에 랜덤한 값을 설정하므로써 대칭성 문제는 해결되기 때문에 주로bias는 0으로 초기화한다. ReLU 연산의 비선형성에 의해서 몇몇 경우에는 0.01과 같은 작은 상수값을 사용하기도 하는데 이는 ReLU 연산이 초기부터 fire되고 따라서 그라디언트(gradient) 값이 유미의한 값을 갖고 신경망을 통해서 전달되는 것을 보장할 수 있기 때문이다. 하지만 상수값을 사용하는 방식이 성능 향상을 언제나 보장하는 것인가에 대해서는 이견이 존재한다(실제 몇몇 사례에서 더 나쁜 결과가 볼 수 있다). 따라서 bias 값은 0으로 초기화 하는 것이 더 일반적이라 할 수 있다. + +** 실전응용 **, ReLU 유닛을 사용하고 `w = np.random.randn(n) * sqrt(2.0/n)` 초기화하는 것이 요즘의 추세이다[He et al.](http://arxiv-web3.library.cornell.edu/abs/1502.01852). + + + +**Batch Normalization**. A recently developed technique by Ioffe and Szegedy called [Batch Normalization](http://arxiv.org/abs/1502.03167) alleviates a lot of headaches with properly initializing neural networks by explicitly forcing the activations throughout a network to take on a unit gaussian distribution at the beginning of the training. The core observation is that this is possible because normalization is a simple differentiable operation. In the implementation, applying this technique usually amounts to insert the BatchNorm layer immediately after fully connected layers (or convolutional layers, as we'll soon see), and before non-linearities. We do not expand on this technique here because it is well described in the linked paper, but note that it has become a very common practice to use Batch Normalization in neural networks. In practice networks that use Batch Normalization are significantly more robust to bad initialization. Additionally, batch normalization can be interpreted as doing preprocessing at every layer of the network, but integrated into the network itself in a differentiably manner. Neat! + + + +### Regularization + +There are several ways of controlling the capacity of Neural Networks to prevent overfitting: + +**L2 regularization** is perhaps the most common form of regularization. It can be implemented by penalizing the squared magnitude of all parameters directly in the objective. That is, for every weight $w$ in the network, we add the term $\frac{1}{2} \lambda w^2$ to the objective, where $\lambda$ is the regularization strength. It is common to see the factor of $\frac{1}{2}$ in front because then the gradient of this term with respect to the parameter $w$ is simply $\lambda w$ instead of $2 \lambda w$. The L2 regularization has the intuitive interpretation of heavily penalizing peaky weight vectors and preferring diffuse weight vectors. As we discussed in the Linear Classification section, due to multiplicative interactions between weights and inputs this has the appealing property of encouraging the network to use all of its inputs a little rather that some of its inputs a lot. Lastly, notice that during gradient descent parameter update, using the L2 regularization ultimately means that every weight is decayed linearly: `W += -lambda * W` towards zero. + +**L1 regularization** is another relatively common form of regularization, where for each weight $w$ we add the term $\lambda \mid w \mid$ to the objective. It is possible to combine the L1 regularization with the L2 regularization: $\lambda_1 \mid w \mid + \lambda_2 w^2$ (this is called [Elastic net regularization](http://web.stanford.edu/~hastie/Papers/B67.2%20%282005%29%20301-320%20Zou%20&%20Hastie.pdf)). The L1 regularization has the intriguing property that it leads the weight vectors to become sparse during optimization (i.e. very close to exactly zero). In other words, neurons with L1 regularization end up using only a sparse subset of their most important inputs and become nearly invariant to the "noisy" inputs. In comparison, final weight vectors from L2 regularization are usually diffuse, small numbers. In practice, if you are not concerned with explicit feature selection, L2 regularization can be expected to give superior performance over L1. + +**Max norm constraints**. Another form of regularization is to enforce an absolute upper bound on the magnitude of the weight vector for every neuron and use projected gradient descent to enforce the constraint. In practice, this corresponds to performing the parameter update as normal, and then enforcing the constraint by clamping the weight vector $\vec{w}$ of every neuron to satisfy $\Vert \vec{w} \Vert_2 < c$. Typical values of $c$ are on orders of 3 or 4. Some people report improvements when using this form of regularization. One of its appealing properties is that network cannot "explode" even when the learning rates are set too high because the updates are always bounded. + +**Dropout** is an extremely effective, simple and recently introduced regularization technique by Srivastava et al. in [Dropout: A Simple Way to Prevent Neural Networks from Overfitting](http://www.cs.toronto.edu/~rsalakhu/papers/srivastava14a.pdf) (pdf) that complements the other methods (L1, L2, maxnorm). While training, dropout is implemented by only keeping a neuron active with some probability $p$ (a hyperparameter), or setting it to zero otherwise. + +
+ +
Figure taken from the Dropout paper that illustrates the idea. During training, Dropout can be interpreted as sampling a Neural Network within the full Neural Network, and only updating the parameters of the sampled network based on the input data. (However, the exponential number of possible sampled networks are not independent because they share the parameters.) During testing there is no dropout applied, with the interpretation of evaluating an averaged prediction across the exponentially-sized ensemble of all sub-networks (more about ensembles in the next section).
+
+ +Vanilla dropout in an example 3-layer Neural Network would be implemented as follows: + +~~~python +""" Vanilla Dropout: Not recommended implementation (see notes below) """ + +p = 0.5 # probability of keeping a unit active. higher = less dropout + +def train_step(X): + """ X contains the data """ + + # forward pass for example 3-layer neural network + H1 = np.maximum(0, np.dot(W1, X) + b1) + U1 = np.random.rand(*H1.shape) < p # first dropout mask + H1 *= U1 # drop! + H2 = np.maximum(0, np.dot(W2, H1) + b2) + U2 = np.random.rand(*H2.shape) < p # second dropout mask + H2 *= U2 # drop! + out = np.dot(W3, H2) + b3 + + # backward pass: compute gradients... (not shown) + # perform parameter update... (not shown) + +def predict(X): + # ensembled forward pass + H1 = np.maximum(0, np.dot(W1, X) + b1) * p # NOTE: scale the activations + H2 = np.maximum(0, np.dot(W2, H1) + b2) * p # NOTE: scale the activations + out = np.dot(W3, H2) + b3 +~~~ + +In the code above, inside the `train_step` function we have performed dropout twice: on the first hidden layer and on the second hidden layer. It is also possible to perform dropout right on the input layer, in which case we would also create a binary mask for the input `X`. The backward pass remains unchanged, but of course has to take into account the generated masks `U1,U2`. + +Crucially, note that in the `predict` function we are not dropping anymore, but we are performing a scaling of both hidden layer outputs by $p$. This is important because at test time all neurons see all their inputs, so we want the outputs of neurons at test time to be identical to their expected outputs at training time. For example, in case of $p = 0.5$, the neurons must halve their outputs at test time to have the same output as they had during training time (in expectation). To see this, consider an output of a neuron $x$ (before dropout). With dropout, the expected output from this neuron will become $px + (1-p)0$, because the neuron's output will be set to zero with probability $1-p$. At test time, when we keep the neuron always active, we must adjust $x \rightarrow px$ to keep the same expected output. It can also be shown that performing this attenuation at test time can be related to the process of iterating over all the possible binary masks (and therefore all the exponentially many sub-networks) and computing their ensemble prediction. + +The undesirable property of the scheme presented above is that we must scale the activations by $p$ at test time. Since test-time performance is so critical, it is always preferable to use **inverted dropout**, which performs the scaling at train time, leaving the forward pass at test time untouched. Additionally, this has the appealing property that the prediction code can remain untouched when you decide to tweak where you apply dropout, or if at all. Inverted dropout looks as follows: + +~~~python +""" +Inverted Dropout: Recommended implementation example. +We drop and scale at train time and don't do anything at test time. +""" + +p = 0.5 # probability of keeping a unit active. higher = less dropout + +def train_step(X): + # forward pass for example 3-layer neural network + H1 = np.maximum(0, np.dot(W1, X) + b1) + U1 = (np.random.rand(*H1.shape) < p) / p # first dropout mask. Notice /p! + H1 *= U1 # drop! + H2 = np.maximum(0, np.dot(W2, H1) + b2) + U2 = (np.random.rand(*H2.shape) < p) / p # second dropout mask. Notice /p! + H2 *= U2 # drop! + out = np.dot(W3, H2) + b3 + + # backward pass: compute gradients... (not shown) + # perform parameter update... (not shown) + +def predict(X): + # ensembled forward pass + H1 = np.maximum(0, np.dot(W1, X) + b1) # no scaling necessary + H2 = np.maximum(0, np.dot(W2, H1) + b2) + out = np.dot(W3, H2) + b3 +~~~ + +There has a been a large amount of research after the first introduction of dropout that tries to understand the source of its power in practice, and its relation to the other regularization techniques. Recommended further reading for an interested reader includes: + +- [Dropout paper](http://www.cs.toronto.edu/~rsalakhu/papers/srivastava14a.pdf) by Srivastava et al. 2014. +- [Dropout Training as Adaptive Regularization](http://papers.nips.cc/paper/4882-dropout-training-as-adaptive-regularization.pdf): "we show that the dropout regularizer is first-order equivalent to an L2 regularizer applied after scaling the features by an estimate of the inverse diagonal Fisher information matrix". + +**Theme of noise in forward pass**. Dropout falls into a more general category of methods that introduce stochastic behavior in the forward pass of the network. During testing, the noise is marginalized over *analytically* (as is the case with dropout when multiplying by $p$), or *numerically* (e.g. via sampling, by performing several forward passes with different random decisions and then averaging over them). An example of other research in this direction includes [DropConnect](http://cs.nyu.edu/~wanli/dropc/), where a random set of weights is instead set to zero during forward pass. As foreshadowing, Convolutional Neural Networks also take advantage of this theme with methods such as stochastic pooling, fractional pooling, and data augmentation. We will go into details of these methods later. + +**Bias regularization**. As we already mentioned in the Linear Classification section, it is not common to regularize the bias parameters because they do not interact with the data through multiplicative interactions, and therefore do not have the interpretation of controlling the influence of a data dimension on the final objective. However, in practical applications (and with proper data preprocessing) regularizing the bias rarely leads to significantly worse performance. This is likely because there are very few bias terms compared to all the weights, so the classifier can "afford to" use the biases if it needs them to obtain a better data loss. + +**Per-layer regularization**. It is not very common to regularize different layers to different amounts (except perhaps the output layer). Relatively few results regarding this idea have been published in the literature. + +**In practice**: It is most common to use a single, global L2 regularization strength that is cross-validated. It is also common to combine this with dropout applied after all layers. The value of $p = 0.5$ is a reasonable default, but this can be tuned on validation data. + + + +### Loss functions + +We have discussed the regularization loss part of the objective, which can be seen as penalizing some measure of complexity of the model. The second part of an objective is the *data loss*, which in a supervised learning problem measures the compatibility between a prediction (e.g. the class scores in classification) and the ground truth label. The data loss takes the form of an average over the data losses for every individual example. That is, $L = \frac{1}{N} \sum_i L_i$ where $N$ is the number of training data. Lets abbreviate $f = f(x_i; W)$ to be the activations of the output layer in a Neural Network. There are several types of problems you might want to solve in practice: + +**Classification** is the case that we have so far discussed at length. Here, we assume a dataset of examples and a single correct label (out of a fixed set) for each example. One of two most commonly seen cost functions in this setting are the SVM (e.g. the Weston Watkins formulation): + +$$ +L_i = \sum_{j\neq y_i} \max(0, f_j - f_{y_i} + 1) +$$ + +As we briefly alluded to, some people report better performance with the squared hinge loss (i.e. instead using $\max(0, f_j - f_{y_i} + 1)^2$). The second common choice is the Softmax classifier that uses the cross-entropy loss: + +$$ +L_i = -\log\left(\frac{e^{f_{y_i}}}{ \sum_j e^{f_j} }\right) +$$ + +**Problem: Large number of classes**. When the set of labels is very large (e.g. words in English dictionary, or ImageNet which contains 22,000 categories), it may be helpful to use *Hierarchical Softmax* (see one explanation [here](http://arxiv.org/pdf/1310.4546.pdf) (pdf)). The hierarchical softmax decomposes labels into a tree. Each label is then represented as a path along the tree, and a Softmax classifier is trained at every node of the tree to disambiguate between the left and right branch. The structure of the tree strongly impacts the performance and is generally problem-dependent. + +**Attribute classification**. Both losses above assume that there is a single correct answer $y_i$. But what if $y_i$ is a binary vector where every example may or may not have a certain attribute, and where the attributes are not exclusive? For example, images on Instagram can be thought of as labeled with a certain subset of hashtags from a large set of all hashtags, and an image may contain multiple. A sensible approach in this case is to build a binary classifier for every single attribute independently. For example, a binary classifier for each category independently would take the form: + +$$ +L_i = \sum_j \max(0, 1 - y_{ij} f_j) +$$ + +where the sum is over all categories $j$, and $y_{ij}$ is either +1 or -1 depending on whether the i-th example is labeled with the j-th attribute, and the score vector $f_j$ will be positive when the class is predicted to be present and negative otherwise. Notice that loss is accumulated if a positive example has score less than +1, or when a negative example has score greater than -1. + +An alternative to this loss would be to train a logistic regression classifier for every attribute independently. A binary logistic regression classifier has only two classes (0,1), and calculates the probability of class 1 as: + +$$ +P(y = 1 \mid x; w, b) = \frac{1}{1 + e^{-(w^Tx +b)}} = \sigma (w^Tx + b) +$$ + +Since the probabilities of class 1 and 0 sum to one, the probability for class 0 is $P(y = 0 \mid x; w, b) = 1 - P(y = 1 \mid x; w,b)$. Hence, an example is classified as a positive example (y = 1) if $\sigma (w^Tx + b) > 0.5$, or equivalently if the score $w^Tx +b > 0$. The loss function then maximizes the log likelihood of this probability. You can convince yourself that this simplifies to: + +$$ +L_i = \sum_j y_{ij} \log(\sigma(f_j)) + (1 - y_{ij}) \log(1 - \sigma(f_j)) +$$ + +where the labels $y_{ij}$ are assumed to be either 1 (positive) or 0 (negative), and $\sigma(\cdot)$ is the sigmoid function. The expression above can look scary but the gradient on $f$ is in fact extremely simple and intuitive: $\partial{L_i} / \partial{f_j} = y_{ij} - \sigma(f_j)$ (as you can double check yourself by taking the derivatives). + +**Regression** is the task of predicting real-valued quantities, such as the price of houses or the length of something in an image. For this task, it is common to compute the loss between the predicted quantity and the true answer and then measure the L2 squared norm, or L1 norm of the difference. The L2 norm squared would compute the loss for a single example of the form: + +$$ +L_i = \Vert f - y_i \Vert_2^2 +$$ + +The reason the L2 norm is squared in the objective is that the gradient becomes much simpler, without changing the optimal parameters since squaring is a monotonic operation. The L1 norm would be formulated by summing the absolute value along each dimension: + +$$ +L_i = \Vert f - y_i \Vert_1 = \sum_j \mid f_j - (y_i)_j \mid +$$ + +where the sum $\sum_j$ is a sum over all dimensions of the desired prediction, if there is more than one quantity being predicted. Looking at only the j-th dimension of the i-th example and denoting the difference between the true and the predicted value by $\delta_{ij}$, the gradient for this dimension (i.e. $\partial{L_i} / \partial{f_j}$) is easily derived to be either $\delta_{ij}$ with the L2 norm, or $sign(\delta_{ij})$. That is, the gradient on the score will either be directly proportional to the difference in the error, or it will be fixed and only inherit the sign of the difference. + +*Word of caution*: It is important to note that the L2 loss is much harder to optimize than a more stable loss such as Softmax. Intuitively, it requires a very fragile and specific property from the network to output exactly one correct value for each input (and its augmentations). Notice that this is not the case with Softmax, where the precise value of each score is less important: It only matters that their magnitudes are appropriate. Additionally, the L2 loss is less robust because outliers can introduce huge gradients. When faced with a regression problem, first consider if it is absolutely inadequate to quantize the output into bins. For example, if you are predicting star rating for a product, it might work much better to use 5 independent classifiers for ratings of 1-5 stars instead of a regression loss. Classification has the additional benefit that it can give you a distribution over the regression outputs, not just a single output with no indication of its confidence. If you're certain that classification is not appropriate, use the L2 but be careful: For example, the L2 is more fragile and applying dropout in the network (especially in the layer right before the L2 loss) is not a great idea. + +> When faced with a regression task, first consider if it is absolutely necessary. Instead, have a strong preference to discretizing your outputs to bins and perform classification over them whenever possible. + +**Structured prediction**. The structured loss refers to a case where the labels can be arbitrary structures such as graphs, trees, or other complex objects. Usually it is also assumed that the space of structures is very large and not easily enumerable. The basic idea behind the structured SVM loss is to demand a margin between the correct structure $y_i$ and the highest-scoring incorrect structure. It is not common to solve this problem as a simple unconstrained optimization problem with gradient descent. Instead, special solvers are usually devised so that the specific simplifying assumptions of the structure space can be taken advantage of. We mention the problem briefly but consider the specifics to be outside of the scope of the class. + + + +## Summary + +In summary: + +- The recommended preprocessing is to center the data to have mean of zero, and normalize its scale to [-1, 1] along each feature +- Initialize the weights by drawing them from a gaussian distribution with standard deviation of $\sqrt{2/n}$, where $n$ is the number of inputs to the neuron. E.g. in numpy: `w = np.random.randn(n) * sqrt(2.0/n)`. +- Use L2 regularization and dropout (the inverted version) +- Use batch normalization +- We discussed different tasks you might want to perform in practice, and the most common loss functions for each task + +We've now preprocessed the data and set up and initialized the model. In the next section we will look at the learning process and its dynamics. From 2693f6477e25bbfab5b15a367909761c261113cd Mon Sep 17 00:00:00 2001 From: OkminLee Date: Sat, 16 Apr 2016 10:20:47 +0900 Subject: [PATCH 061/199] Update classification.md --- classification.md | 31 ++++++++++++++++--------------- 1 file changed, 16 insertions(+), 15 deletions(-) diff --git a/classification.md b/classification.md index d929a9e3..2cf97c87 100644 --- a/classification.md +++ b/classification.md @@ -4,7 +4,7 @@ mathjax: true permalink: /classification/ --- -본 강의노트는 컴퓨터비전 외의 분야를 공부하던 사람들에게 영상 분류 (Image Classification) 문제와, 데이터 기반 방법론(data-driven approach)을 소개하고자 함이다. 목차는 다음과 같다. +본 강의노트는 컴퓨터비전 외의 분야를 공부하던 사람들에게 Image Classification(이미지 분류) 문제와, data-driven approach(데이터 기반 방법론)을 소개한다. 목차는 다음과 같다. - [영상 분류, 데이터 기반 방법론, 파이프라인](#intro) - [Nearest Neighbor 분류기](#nn) @@ -16,36 +16,37 @@ permalink: /classification/ - [읽을 자료](#reading) -## 영상 분류 -**동기**. 이 섹션에서는 영상 분류 문제에 대해 다룰 것이다. 영상 분류 문제란, 입력 이미지를 미리 정해진 카테고리 중 하나로 분류하는 문제로, 문제 정의는 매우 간단하지만 다양한 활용 가능성이 있는 컴퓨터 비전 분야의 핵심적인 문제 중의 하나이다. 강의의 나중 파트에서도 살펴보겠지만, 영상 분류와 멀어보이는 다른 컴퓨터 비전 분야의 여러 문제들 (물체 검출, 영상 분할 등)이 영상 분류 문제를 푸는 것으로 인해 해결될 수 있다. +## Image Classification(이미지 분류) -**예시**. 예를 들어, 아래 이미지 -For example, in the image below an image classification model takes a single image and assigns probabilities to 4 labels, *{cat, dog, hat, mug}*. As shown in the image, keep in mind that to a computer an image is represented as one large 3-dimensional array of numbers. In this example, the cat image is 248 pixels wide, 400 pixels tall, and has three color channels Red,Green,Blue (or RGB for short). Therefore, the image consists of 248 x 400 x 3 numbers, or a total of 297,600 numbers. Each number is an integer that ranges from 0 (black) to 255 (white). Our task is to turn this quarter of a million numbers into a single label, such as *"cat"*. +**동기**. 이 섹션에서는 이미지 분류 문제에 대해 다룰 것이다. 이미지 분류 문제란, 입력 이미지를 미리 정해진 카테고리 중 하나인 라벨로 분류하는 문제다. 문제 정의는 매우 간단하지만 다양한 활용 가능성이 있는 컴퓨터 비전 분야의 핵심적인 문제 중의 하나이다. 강의의 나중 파트에서도 살펴보겠지만, 이미지 분류와 멀어보이는 다른 컴퓨터 비전 분야의 여러 문제들 (물체 검출, 영상 분할 등)이 영상 분류 문제를 푸는 것으로 인해 해결될 수 있다. + +**예시**. 예를 들어, 아래 그림의 이미지 분류 모델은 하나의 이미지와 4개의 분류가능한 라벨 *{cat, dog, hat, mug}* 이 있다. 그림에서 보다시피, 컴퓨터에서 이미지는 3차원 배열로 표현된다. 이 예시에서 고양이 이미지는 가로 248픽셀(모니터의 화면을 구성하는 최소 단위, 역자 주), 세로 400픽셀로 구성되어 있고 3개의 색상 채널이 있는데 각각 Red, Green, Blue(RGB)로 불린다. 따라서 이 이미지는 248 x 400 x 3개(총 297,500개)의 픽셀로 구성되어 있다. 각 픽셀의 값은 0~255 범위의 정수값이다. 이미지 분류 문제는 이 수많은 값들을 *"cat"* 이라는 하나의 라벨로 변경하는 것이다.
The task in Image Classification is to predict a single label (or a distribution over labels as shown here to indicate our confidence) for a given image. Images are 3-dimensional arrays of integers from 0 to 255, of size Width x Height x 3. The 3 represents the three color channels Red, Green, Blue.
-**Challenges**. Since this task of recognizing a visual concept (e.g. cat) is relatively trivial for a human to perform, it is worth considering the challenges involved from the perspective of a Computer Vision algorithm. As we present (an inexhaustive) list of challenges below, keep in mind the raw representation of images as a 3-D array of brightness values: +**문제**. 이미지를 분류하는 일(예를들어 *"cat"*)이 사람에게는 대수롭지 않겠지만, 컴퓨터 비전의 관점에서 생각해보면 해결해야 하는 문제들이 있다. 아래에 서술된 해결해야 하는 문제들처럼, 이미지는 3차원 배열의 값으로 나타내는 것을 염두해두어야 한다. -- **Viewpoint variation**. A single instance of an object can be oriented in many ways with respect to the camera. -- **Scale variation**. Visual classes often exhibit variation in their size (size in the real world, not only in terms of their extent in the image). -- **Deformation**. Many objects of interest are not rigid bodies and can be deformed in extreme ways. -- **Occlusion**. The objects of interest can be occluded. Sometimes only a small portion of an object (as little as few pixels) could be visible. -- **Illumination conditions**. The effects of illumination are drastic on the pixel level. -- **Background clutter**. The objects of interest may *blend* into their environment, making them hard to identify. -- **Intra-class variation**. The classes of interest can often be relatively broad, such as *chair*. There are many different types of these objects, each with their own appearance. +- **Viewpoint variation(시점 변화)**. 객체의 단일 인스턴스는 카메라에 의해 시점이 달라질 수 있다. +- **Scale variation(크기 변화)**. 비주얼 클래스는 대부분 그것들의 크기의 변화를 나타낸다(이미지의 크기뿐만 아니라 실제 세계에서의 크기까지 포함함). +- **Deformation(변형)**. 많은 객체들은 고정된 형태가 없고, 극단적인 형태로 변형될 수 있다. +- **Occlusion(폐색)**. 객체들은 전체가 보이지 않을 수 있다. 때로는 물체의 매우 적은 부분(매우 적은 픽셀)이 보인다. +- **Illumination conditions(조명 상태)**. 조명의 영향으로 픽셀 값이 변형된다. +- **Background clutter(배경 분규)**. 객체가 주변 환경에 섞여(*blend*) 알아보기 힘들게 된다. +- **Intra-class variation(내부클래스의 다양성)**. 분류해야할 클래스는 범위가 큰 것들이 많다. 예를 들어 *의자* 의 경우, 매우 다양한 형태의 객체가 있다. -A good image classification model must be invariant to the cross product of all these variations, while simultaneously retaining sensitivity to the inter-class variations. +좋은 이미지 분류기는 각 클래스간의 감도를 유지하면서 동시에 이런 다양한 문제들에 대해 변함 없이 분류할 수 있는 성능을 유지해야 한다.
-**Data-driven approach**. How might we go about writing an algorithm that can classify images into distinct categories? Unlike writing an algorithm for, for example, sorting a list of numbers, it is not obvious how one might write an algorithm for identifying cats in images. Therefore, instead of trying to specify what every one of the categories of interest look like directly in code, the approach that we will take is not unlike one you would take with a child: we're going to provide the computer with many examples of each class and then develop learning algorithms that look at these examples and learn about the visual appearance of each class. This approach is referred to as a *data-driven approach*, since it relies on first accumulating a *training dataset* of labeled images. Here is an example of what such a dataset might look like: +**Data-driven approach(데이터 기반 방법론)**. 어떻게 하면 이미지를 각각의 카테고리로 분류하는 알고리즘을 작성할 수 있을까? 숫자를 정렬하는 알고리즘 작성과는 달리 고양이를 분별하는 알고리즘을 작성하는 것은 어렵다. +How might we go about writing an algorithm that can classify images into distinct categories? Unlike writing an algorithm for, for example, sorting a list of numbers, it is not obvious how one might write an algorithm for identifying cats in images. Therefore, instead of trying to specify what every one of the categories of interest look like directly in code, the approach that we will take is not unlike one you would take with a child: we're going to provide the computer with many examples of each class and then develop learning algorithms that look at these examples and learn about the visual appearance of each class. This approach is referred to as a *data-driven approach*, since it relies on first accumulating a *training dataset* of labeled images. Here is an example of what such a dataset might look like:
From 1fdd9da854b4b44eaa0056b496b901d1cd52b2c5 Mon Sep 17 00:00:00 2001 From: Taeksoo Kim Date: Sat, 16 Apr 2016 13:52:54 +0900 Subject: [PATCH 062/199] Update convolutional-networks-korean.md --- convolutional-networks-korean.md | 47 ++++++++++++++++---------------- 1 file changed, 24 insertions(+), 23 deletions(-) diff --git a/convolutional-networks-korean.md b/convolutional-networks-korean.md index 5ebec9c3..dbb95fa5 100644 --- a/convolutional-networks-korean.md +++ b/convolutional-networks-korean.md @@ -183,51 +183,52 @@ Numpy에서 `*`연산은 두 배열 간의 elementwise 곱셈이라는 것을
-**매트릭스 곱으로 구현**. 컨볼루션 연산은 필터와 이미지의 로컬한 영역간의 내적 연산을 한 것과 같다. 컨볼루션 레이어의 일반적인 구현 패턴은 이 점을 이용해 컨볼루션 레이어의 forward pass를 다음과 같이 하나의 큰 매트릭스 곱으로 -**Implementation as Matrix Multiplication**. Note that the convolution operation essentially performs dot products between the filters and local regions of the input. A common implementation pattern of the CONV layer is to take advantage of this fact and formulate the forward pass of a convolutional layer as one big matrix multiply as follows: +**매트릭스 곱으로 구현**. 컨볼루션 연산은 필터와 이미지의 로컬한 영역간의 내적 연산을 한 것과 같다. 컨볼루션 레이어의 일반적인 구현 패턴은 이 점을 이용해 컨볼루션 레이어의 forward pass를 다음과 같이 하나의 큰 매트릭스 곱으로 계산된다: -1. The local regions in the input image are stretched out into columns in an operation commonly called **im2col**. For example, if the input is [227x227x3] and it is to be convolved with 11x11x3 filters at stride 4, then we would take [11x11x3] blocks of pixels in the input and stretch each block into a column vector of size 11\*11\*3 = 363. Iterating this process in the input at stride of 4 gives (227-11)/4+1 = 55 locations along both width and height, leading to an output matrix `X_col` of *im2col* of size [363 x 3025], where every column is a stretched out receptive field and there are 55*55 = 3025 of them in total. Note that since the receptive fields overlap, every number in the input volume may be duplicated in multiple distinct columns. -2. The weights of the CONV layer are similarly stretched out into rows. For example, if there are 96 filters of size [11x11x3] this would give a matrix `W_row` of size [96 x 363]. -3. The result of a convolution is now equivalent to performing one large matrix multiply `np.dot(W_row, X_col)`, which evaluates the dot product between every filter and every receptive field location. In our example, the output of this operation would be [96 x 3025], giving the output of the dot product of each filter at each location. -4. The result must finally be reshaped back to its proper output dimension [55x55x96]. +1. 이미지의 각 로컬 영역을 열 벡터로 stretch 한다 (이런 연산을 보통 **im2col** 이라고 부름). 예를 들어, 만약 [227x227x3] 사이즈의 입력이 11x11x3 사이즈와 strie 4의 필터와 컨볼루션 한다면, 이미지에서 [11x11x3] 크기의 픽셀 블록을 가져와 11\*11\*3=363 크기의 열 벡터로 바꾸게 된다. 이 과정을 stride 4마다 하므로 가로, 세로에 대해 각각 (227-11)/4+1=55, 총 55\*55=3025 개 영역에 대해 반복하게 되고, 출력물인 `X_col`은 [363x3025]의 사이즈를 갖게 된다. 각각의 열 벡터는 리셉티브 필드를 1차원으로 stretch 한 것이고, 이 리셉티브 필드는 주위 리셉티브 필드들과 겹치므로 입력 볼륨의 여러 값들이 여러 출력 열벡터에 중복되어 나타날 수 있다. +2. 컨볼루션 레이어의 가중치는 비슷한 방식으로 행 벡터 형태로 stretch된다. 예를 들어 [11x11x3]사이즈의 총 96개 필터가 있다면, [96x363] 사이즈의 W_row가 만들어진다. +3. 이제 컨볼루션 연산은 하나의 큰 매트릭스 연산 `np.dot(W_row, X_col)`를 계산하는 것과 같다. 이 연산은 모든 필터와 모든 리셉티브 필터 영역들 사이의 내적 연산을 하는 것과 같다. 우리의 예에서는 각 영역에 대한 각각의 필터를 각각의 영역에 적용한 [96x3025] 사이즈의 출력물이 얻어진다. +4. 결과물은 [55x55x96] 차원으로 reshape 한다. -This approach has the downside that it can use a lot of memory, since some values in the input volume are replicated multiple times in `X_col`. However, the benefit is that there are many very efficient implementations of Matrix Multiplication that we can take advantage of (for example, in the commonly used [BLAS](http://www.netlib.org/blas/) API). Morever, the same *im2col* idea can be reused to perform the pooling operation, which we discuss next. +이 방식은 입력 볼륨의 여러 값들이 `X_col`에 여러 번 복사되기 때문에 메모리가 많이 사용된다는 단점이 있다. 그러나 매트릭스 연산과 관련된 많은 효율적 구현방식들을 사용할 수 있다는 장점도 있다 ([BLAS](http://www.netlib.org/blas/) API 가 하나의 예임). 뿐만 아니라 같은 *im2col* 아이디어는 풀링 연산에서 재활용 할 수도 있다 (뒤에서 다루게 된다). -**Backpropagation.** The backward pass for a convolution operation (for both the data and the weights) is also a convolution (but with spatially-flipped filters). This is easy to derive in the 1-dimensional case with a toy example (not expanded on for now). +**Backpropagation.** 컨볼루션 연산의 backward pass 역시 컨볼루션 연산이다 (가로/세로가 뒤집어진 필터를 사용한다는 차이점이 있음). 간단한 1차원 예제를 가지고 쉽게 확인해볼 수 있다. -#### Pooling Layer +#### 풀링 레이어 (Pooling Layer) -It is common to periodically insert a Pooling layer in-between successive Conv layers in a ConvNet architecture. Its function is to progressively reduce the spatial size of the representation to reduce the amount of parameters and computation in the network, and hence to also control overfitting. The Pooling Layer operates independently on every depth slice of the input and resizes it spatially, using the MAX operation. The most common form is a pooling layer with filters of size 2x2 applied with a stride of 2 downsamples every depth slice in the input by 2 along both width and height, discarding 75% of the activations. Every MAX operation would in this case be taking a max over 4 numbers (little 2x2 region in some depth slice). The depth dimension remains unchanged. More generally, the pooling layer: +CNN 구조 내에 컨볼루션 레이어들 중간중간에 주기적으로 풀링 레이어를 넣는 것이 일반적이다. 풀링 레이어가 하는 일은 네트워크의 파라미터의 개수나 연산량을 줄이기 위해 representation의 spatial한 사이즈를 줄이는 것이다. 이는 오버피팅을 조절하는 효과도 가지고 있다. 풀링 레이어는 MAX 연산을 각 depth slice에 대해 독립적으로 적용하여 spatial한 크기를 줄인다. 사이즈 2x2와 stride 2가 가장 많이 사용되는 풀링 레이어이다. 각 depth slice를 가로/세로축을 따라 1/2로 downsampling해 75%의 액티베이션은 버리게 된다. 이 경우 MAX 연산은 4개 숫자 중 최대값을 선택하게 된다 (같은 depth slice 내의 2x2 영역). Depth 차원은 변하지 않는다. 풀링 레이어의 특징들은 일반적으로 아래와 같다: -- Accepts a volume of size $$W_1 \times H_1 \times D_1$$ -- Requires three hyperparameters: - - their spatial extent $$F$$, - - the stride $$S$$, -- Produces a volume of size $$W_2 \times H_2 \times D_2$$ where: +- $$W_1 \times H_1 \times D_1$$ 사이즈의 입력을 받는다 +- 3가지 hyperparameter를 필요로 한다. + - Spatial extent $$F$$ + - Stride $$S$$ +- $$W_2 \times H_2 \times D_2$$ 사이즈의 볼륨을 만든다 - $$W_2 = (W_1 - F)/S + 1$$ - $$H_2 = (H_1 - F)/S + 1$$ - $$D_2 = D_1$$ -- Introduces zero parameters since it computes a fixed function of the input -- Note that it is not common to use zero-padding for Pooling layers +- 입력에 대해 항상 같은 연산을 하므로 파라미터는 따로 존재하지 않는다 +- 풀링 레이어에는 보통 제로 패딩을 하지 않는다 -It is worth noting that there are only two commonly seen variations of the max pooling layer found in practice: A pooling layer with $$F = 3, S = 2$$ (also called overlapping pooling), and more commonly $$F = 2, S = 2$$. Pooling sizes with larger receptive fields are too destructive. +일반적으로 실전에서는 두 종류의 max 풀링 레이어만 널리 쓰인다. 하나는 overlapping 풀링이라고도 불리는 $$F = 3, S = 2$$ 이고 하나는 더 자주 쓰이는 $$F = 2, S = 2$$ 이다. 큰 리셉티브 필드에 대해서 풀링을 하면 보통 너무 많은 정보를 버리게 된다. -**General pooling**. In addition to max pooling, the pooling units can also perform other functions, such as *average pooling* or even *L2-norm pooling*. Average pooling was often used historically but has recently fallen out of favor compared to the max pooling operation, which has been shown to work better in practice. +**일반적인 풀링**. Max 풀링 뿐 아니라 *average 풀링*, *L2-norm 풀링* 등 다른 연산으로 풀링할 수도 있다. Average 풀링은 과거에 많이 쓰였으나 최근에는 Max 풀링이 더 좋은 성능을 보이며 점차 쓰이지 않고 있다.
- Pooling layer downsamples the volume spatially, independently in each depth slice of the input volume. Left: In this example, the input volume of size [224x224x64] is pooled with filter size 2, stride 2 into output volume of size [112x112x64]. Notice that the volume depth is preserved. Right: The most common downsampling operation is max, giving rise to max pooling, here shown with a stride of 2. That is, each max is taken over 4 numbers (little 2x2 square). + 풀링 레이어는 입력 볼륨의 각 depth slice를 spatial하게 downsampling한다. 좌: 이 예제에서는 입력 볼륨이 [224x224x64]이며 필터 크기 2, stride 2로 풀링해 [112x112x64] 크기의 출력 볼륨을 만든다. 볼륨의 depth는 그대로 유지된다는 것을 기억하자. Right: 가장 널리 쓰이는 max 풀링. 2x2의 4개 숫자에 대해 max를 취하게된다.
-**Backpropagation**. Recall from the backpropagation chapter that the backward pass for a max(x, y) operation has a simple interpretation as only routing the gradient to the input that had the highest value in the forward pass. Hence, during the forward pass of a pooling layer it is common to keep track of the index of the max activation (sometimes also called *the switches*) so that gradient routing is efficient during backpropagation. +**Backpropagation**. Backpropagation 챕터에서 max(x,y)의 backward pass는 그냥 forward pass에서 가장 큰 값을 가졌던 입력의 gradient를 보내는 것과 같다고 배운 것을 기억하자. 그러므로 forward pass 과정에서 보통 max 액티베이션의 위치를 저장해두었다가 backpropagation 때 사용한다. -**Recent developments**. +**최근의 발전된 내용들**. -- [Fractional Max-Pooling](http://arxiv.org/abs/1412.6071) suggests a method for performing the pooling operation with filters smaller than 2x2. This is done by randomly generating pooling regions with a combination of 1x1, 1x2, 2x1 or 2x2 filters to tile the input activation map. The grids are generated randomly on each forward pass, and at test time the predictions can be averaged across several grids. +- [Fractional Max-Pooling](http://arxiv.org/abs/1412.6071) 2x2보다 더 작은 필터들로 풀링하는 방식. 1x1, 1x2, 2x1, 2x2 크기의 +- +- suggests a method for performing the pooling operation with filters smaller than 2x2. This is done by randomly generating pooling regions with a combination of 1x1, 1x2, 2x1 or 2x2 filters to tile the input activation map. The grids are generated randomly on each forward pass, and at test time the predictions can be averaged across several grids. - [Striving for Simplicity: The All Convolutional Net](http://arxiv.org/abs/1412.6806) proposes to discard the pooling layer in favor of architecture that only consists of repeated CONV layers. To reduce the size of the representation they suggest using larger stride in CONV layer once in a while. Due to the aggressive reduction in the size of the representation (which is helpful only for smaller datasets to control overfitting), the trend in the literature is towards discarding the pooling layer in modern ConvNets. From cc76fdaa0de5d494797d7f35893ca4ca475f14f5 Mon Sep 17 00:00:00 2001 From: Taeksoo Kim Date: Sat, 16 Apr 2016 14:27:22 +0900 Subject: [PATCH 063/199] Update convolutional-networks-korean.md --- convolutional-networks-korean.md | 16 +++++++--------- 1 file changed, 7 insertions(+), 9 deletions(-) diff --git a/convolutional-networks-korean.md b/convolutional-networks-korean.md index dbb95fa5..ef181614 100644 --- a/convolutional-networks-korean.md +++ b/convolutional-networks-korean.md @@ -226,24 +226,22 @@ CNN 구조 내에 컨볼루션 레이어들 중간중간에 주기적으로 풀 **최근의 발전된 내용들**. -- [Fractional Max-Pooling](http://arxiv.org/abs/1412.6071) 2x2보다 더 작은 필터들로 풀링하는 방식. 1x1, 1x2, 2x1, 2x2 크기의 -- -- suggests a method for performing the pooling operation with filters smaller than 2x2. This is done by randomly generating pooling regions with a combination of 1x1, 1x2, 2x1 or 2x2 filters to tile the input activation map. The grids are generated randomly on each forward pass, and at test time the predictions can be averaged across several grids. -- [Striving for Simplicity: The All Convolutional Net](http://arxiv.org/abs/1412.6806) proposes to discard the pooling layer in favor of architecture that only consists of repeated CONV layers. To reduce the size of the representation they suggest using larger stride in CONV layer once in a while. +- [Fractional Max-Pooling](http://arxiv.org/abs/1412.6071) 2x2보다 더 작은 필터들로 풀링하는 방식. 1x1, 1x2, 2x1, 2x2 크기의 필터들을 임의로 조합해 풀링한다. 매 forward pass마다 grid들이 랜덤하게 생성되고, 테스트 때에는 여러 grid들의 예측 점수들의 평균치를 사용하게 된다. +- [Striving for Simplicity: The All Convolutional Net](http://arxiv.org/abs/1412.6806) 라는 논문은 컨볼루션 레이어만 반복하며 풀링 레이어를 사용하지 않는 방식을 제안한다. Representation의 크기를 줄이기 위해 가끔씩 큰 stride를 가진 컨볼루션 레이어를 사용한다. -Due to the aggressive reduction in the size of the representation (which is helpful only for smaller datasets to control overfitting), the trend in the literature is towards discarding the pooling layer in modern ConvNets. +풀링 레이어가 보통 representation의 크기를 심하게 줄이기 때문에 (이런 효과는 작은 데이터셋에서만 오버피팅 방지 효과 등으로 인해 도움이 됨), 최근 추세는 점점 풀링 레이어를 사용하지 않는 쪽으로 발전하고 있다. -#### Normalization Layer +#### Normalization 레이어 -Many types of normalization layers have been proposed for use in ConvNet architectures, sometimes with the intentions of implementing inhibition schemes observed in the biological brain. However, these layers have recently fallen out of favor because in practice their contribution has been shown to be minimal, if any. For various types of normalizations, see the discussion in Alex Krizhevsky's [cuda-convnet library API](http://code.google.com/p/cuda-convnet/wiki/LayerParams#Local_response_normalization_layer_(same_map)). +실제 두뇌의 억제 메커니즘 구현 등을 위해 많은 종류의 normalization 레이어들이 제안되었다. 그러나 이런 레이어들이 실제로 주는 효과가 별로 없다는 것이 알려지면서 최근에는 거의 사용되지 않고 있다. Normalization에 대해 알고 싶다면 Alex Krizhevsky의 글을 읽어보기 바란다 [cuda-convnet library API](http://code.google.com/p/cuda-convnet/wiki/LayerParams#Local_response_normalization_layer_(same_map)). -#### Fully-connected layer +#### Fully-connected 레이어 Neurons in a fully connected layer have full connections to all activations in the previous layer, as seen in regular Neural Networks. Their activations can hence be computed with a matrix multiplication followed by a bias offset. See the *Neural Network* section of the notes for more information. - +<어/a> #### Converting FC layers to CONV layers It is worth noting that the only difference between FC and CONV layers is that the neurons in the CONV layer are connected only to a local region in the input, and that many of the neurons in a CONV volume share parameters. However, the neurons in both layers still compute dot products, so their functional form is identical. Therefore, it turns out that it's possible to convert between FC and CONV layers: From 8e5503ee2dd440efee6a67c6dbf960ed9ebac73b Mon Sep 17 00:00:00 2001 From: ygchoistat Date: Sat, 16 Apr 2016 19:54:04 +0900 Subject: [PATCH 064/199] Update neural-networks-3.md --- neural-networks-3.md | 48 +++++++++++++++++++++++--------------------- 1 file changed, 25 insertions(+), 23 deletions(-) diff --git a/neural-networks-3.md b/neural-networks-3.md index 9ad74d6f..7c4bb524 100644 --- a/neural-networks-3.md +++ b/neural-networks-3.md @@ -5,47 +5,49 @@ permalink: /neural-networks-3/ Table of Contents: -- [Gradient checks](#gradcheck) +- [그라디언트 점검 (Gradient checks)](#gradcheck) - [Sanity checks](#sanitycheck) -- [Babysitting the learning process](#baby) - - [Loss function](#loss) - - [Train/val accuracy](#accuracy) - - [Weights:Updates ratio](#ratio) - - [Activation/Gradient distributions per layer](#distr) - - [Visualization](#vis) -- [Parameter updates](#update) - - [First-order (SGD), momentum, Nesterov momentum](#sgd) - - [Annealing the learning rate](#anneal) - - [Second-order methods](#second) - - [Per-parameter adaptive learning rates (Adagrad, RMSProp)](#ada) -- [Hyperparameter Optimization](#hyper) -- [Evaluation](#eval) - - [Model Ensembles](#ensemble) -- [Summary](#summary) -- [Additional References](#add) +- [학습 과정 돌보기 (Babysitting the learning process)](#baby) + - [손실 함수 (Loss function)](#loss) + - [훈련/검증 성능 (Train/val accuracy)](#accuracy) + - [웨이트의 현재값과 변화량의 비율 (Weights:Updates ratio)](#ratio) + - [레이어별 활성값 및 그라디언트값의 분포 (Activation/Gradient distributions per layer)](#distr) + - [시각화 (Visualization)](#vis) +- [파라미터 업데이트 (Parameter updates)](#update) + - [일차 근사 방법 (SGD) (First-order (SGD)), 모멘텀 (momentum), Nesterov 모멘텀 (Nesterov momentum)](#sgd) + - [학습 속도를 담금질하기 (Annealing the learning rate)](#anneal) + - [이차 근사 방법 (Second-order methods)](#second) + - [파라미터별로 학습 속도를 데이터가 판단하게 하기 (Adagrad, RMSProp) )Per-parameter adaptive learning rates (Adagrad, RMSProp))](#ada) +- [초-파라미터 최적화 (Hyperparameter Optimization)](#hyper) +- [평가 (Evaluation)](#eval) + - [모형 앙상블 (Model Ensembles)](#ensemble) +- [요약](#summary) +- [추가적인 참고 문헌](#add) ## Learning -In the previous sections we've discussed the static parts of a Neural Networks: how we can set up the network connectivity, the data, and the loss function. This section is devoted to the dynamics, or in other words, the process of learning the parameters and finding good hyperparameters. +이전 섹션들에서는 레이어를 몇 층 쌓고 레이어별로 몇 개의 유닛을 준비할지(newwork connectivity), 데이터를 어떻게 준비하고 어떤 손실 함수(loss function)를 선택할지 논하였다. 말하자면 이전 섹션들은 주로 뉴럴 네트워크(Neural Network)의 정적인 부분인데, 본 섹션에서는 동적인 부분들을 소개한다. 파라미터(parameter)를 학습하고 좋은 초-파라미터(hyperparamter)를 찾는 과정 등을 다룰 예정이다. -### Gradient Checks +### 그라디언트 체크 (Gradient Checks) -In theory, performing a gradient check is as simple as comparing the analytic gradient to the numerical gradient. In practice, the process is much more involved and error prone. Here are some tips, tricks, and issues to watch out for: +이론적인 그라디언트 체크라 하면, 수치적으로 계산한(numerical) 그라디언트와 수식으로 계산한(analytic) 그라디언트를 비교하는 정도라 매우 간단하다고 생각할 수도 있겠다. 그렇지만 이 작업을 직접 실현해 보면 훨씬 복잡하고 뜬금없이 오차가 발생하기도 쉽다는 것을 깨달을 것이다. 이제 팁, 트릭, 조심할 이슈들 몇 개를 소개하고자 한다. -**Use the centered formula**. The formula you may have seen for the finite difference approximation when evaluating the numerical gradient looks as follows: + +**같은 근사라 하여도 이론적으로 더 정확도가 높은 공식이 있다 (Use the centered formula)**. 그라디언트($\frac{df(x)}{dx}$)를 수치적으로 근사한다 하면 보통 다음 유한 차분 근사(finite difference approximation)를 떠올릴 것이다: $$ \frac{df(x)}{dx} = \frac{f(x + h) - f(x)}{h} \hspace{0.1in} \text{(bad, do not use)} $$ -where $h$ is a very small number, in practice approximately 1e-5 or so. In practice, it turns out that it is much better to use the *centered* difference formula of the form: +여기서 $h$는 아주 작은 수이고 보통 1e-5 정도의 수를 사용한다. 위 식보다는 아래의 *중심화된(centered)* 차분 공식이 경험적으로는 훨씬 낫다: $$ \frac{df(x)}{dx} = \frac{f(x + h) - f(x - h)}{2h} \hspace{0.1in} \text{(use instead)} $$ -This requires you to evaluate the loss function twice to check every single dimension of the gradient (so it is about 2 times as expensive), but the gradient approximation turns out to be much more precise. To see this, you can use Taylor expansion of $f(x+h)$ and $f(x-h)$ and verify that the first formula has an error on order of $O(h)$, while the second formula only has error terms on order of $O(h^2)$ (i.e. it is a second order approximation). +물론 이 공식은 $f(x+h)$ 말고도 $f(x-h)$도 계산하여야 하므로 최초 식보다 계산량이 두 배 많지만 훨씬 정확한 근사를 제공한다. $f(x+h)$ 및 $f(x-h)$의 ($x$ 근방에서의) 테일러 전개를 고려하면 이유를 금방 알 수 있다. 첫 식은 +To see this, you can use Taylor expansion of $f(x+h)$ and $f(x-h)$ and verify that the first formula has an error on order of $O(h)$, while the second formula only has error terms on order of $O(h^2)$ (i.e. it is a second order approximation). **Use relative error for the comparison**. What are the details of comparing the numerical gradient $f'_n$ and analytic gradient $f'_a$? That is, how do we know if the two are not compatible? You might be temped to keep track of the difference $\mid f'_a - f'_n \mid $ or its square and define the gradient check as failed if that difference is above a threshold. However, this is problematic. For example, consider the case where their difference is 1e-4. This seems like a very appropriate difference if the two gradients are about 1.0, so we'd consider the two gradients to match. But if the gradients were both on order of 1e-5 or lower, then we'd consider 1e-4 to be a huge difference and likely a failure. Hence, it is always more appropriate to consider the *relative error*: From c534736ab194f57f4621be1fd0d8cd5b9aa1c7d9 Mon Sep 17 00:00:00 2001 From: ygchoistat Date: Sat, 16 Apr 2016 20:59:58 +0900 Subject: [PATCH 065/199] Update neural-networks-3.md MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit PR 연습 겸 업데이트 해봅니다. 총 400라인 정도의 문서인데 오늘 50라인 정도 마쳤습니다. 틈틈이 또 업데이트 하겠습니다. --- neural-networks-3.md | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/neural-networks-3.md b/neural-networks-3.md index 7c4bb524..819b511e 100644 --- a/neural-networks-3.md +++ b/neural-networks-3.md @@ -46,10 +46,10 @@ $$ \frac{df(x)}{dx} = \frac{f(x + h) - f(x - h)}{2h} \hspace{0.1in} \text{(use instead)} $$ -물론 이 공식은 $f(x+h)$ 말고도 $f(x-h)$도 계산하여야 하므로 최초 식보다 계산량이 두 배 많지만 훨씬 정확한 근사를 제공한다. $f(x+h)$ 및 $f(x-h)$의 ($x$ 근방에서의) 테일러 전개를 고려하면 이유를 금방 알 수 있다. 첫 식은 -To see this, you can use Taylor expansion of $f(x+h)$ and $f(x-h)$ and verify that the first formula has an error on order of $O(h)$, while the second formula only has error terms on order of $O(h^2)$ (i.e. it is a second order approximation). +물론 이 공식은 $f(x+h)$ 말고도 $f(x-h)$도 계산하여야 하므로 최초 식보다 계산량이 두 배 많지만 훨씬 정확한 근사를 제공한다. $f(x+h)$ 및 $f(x-h)$의 ($x$ 근방에서의) 테일러 전개를 고려하면 이유를 금방 알 수 있다. 첫 식은 $O(h)$의 오차가 있는 데 반해 -- 역자 주 : $f(x + h) = f(x) + hf'(x) + O(h)$로부터 $f'(x) - \frac{(f(x+h)-f(x)}{h} = O(h)$ -- 두번째 식은 오차가 $O(h^2)$이다 (즉, 이차 근사이다). -**Use relative error for the comparison**. What are the details of comparing the numerical gradient $f'_n$ and analytic gradient $f'_a$? That is, how do we know if the two are not compatible? You might be temped to keep track of the difference $\mid f'_a - f'_n \mid $ or its square and define the gradient check as failed if that difference is above a threshold. However, this is problematic. For example, consider the case where their difference is 1e-4. This seems like a very appropriate difference if the two gradients are about 1.0, so we'd consider the two gradients to match. But if the gradients were both on order of 1e-5 or lower, then we'd consider 1e-4 to be a huge difference and likely a failure. Hence, it is always more appropriate to consider the *relative error*: + +**상대 오차를 사용하라 (Use relative error for the comparison)**. 그라디언트의 (수식으로 계산한, analytic) 참값 $f'_a$와 수치적(numerical) 근사값 $f'_n$을 비교하려면 어떤 디테일을 점검하여야 할까? 이 둘이 비슷하지 않음(not compatible)을 어떻게 알아낼 수 있을까? 가장 쉽게는 둘의 절대 오차 $\mid f'_a - f'_n \mid $ 혹은 그 제곱을 쭉 추적하여 이 값(들)이 언젠가 어느 한계점(threshold)를 넘으면 그라디언트 오류라 할 수도 있겠다. 그렇지만 절대 오차에는 문제가 있는 것이, 가령 절대 오차가 1e-4라 가정하여 보자. 만약 $f'_a$와 $f'_n$ 모두 1.0 언저리라면 1e-4의 오차 정도는 매우 훌륭한 근사이고 $f'_a \approx f'_n$이라 할 수 있다. 그런데 만약 두 그라디언트가 1e-5거나 더 작은 값이라면? 그렇다면 1e-4는 매우 큰 차이가 되고 근사가 실패했다고 보는 게 맞다. 따라서 절대 오차와 두 그라디언트 값의 비율을 고려하는 *상대 오차*가 더 적절하다. 언제나!: $$ \frac{\mid f'_a - f'_n \mid}{\max(\mid f'_a \mid, \mid f'_n \mid)} From b74b34d34ae6c27e50a64c918ca69505e0b63c2a Mon Sep 17 00:00:00 2001 From: MaybeS Date: Mon, 18 Apr 2016 17:47:25 +0900 Subject: [PATCH 066/199] assignment#1/features.ipynb --- .../features-Copy1-checkpoint.ipynb | 338 ++++++++ .../features-checkpoint.ipynb | 338 ++++++++ .../.ipynb_checkpoints/knn-checkpoint.ipynb | 459 +++++++++++ .../assignment1/features-Copy1.ipynb | 338 ++++++++ assignments2016/assignment1/features.ipynb | 555 +++++++------ assignments2016/assignment1/features_trans.md | 9 + assignments2016/assignment1/knn.ipynb | 744 +++++++++--------- 7 files changed, 2127 insertions(+), 654 deletions(-) create mode 100644 assignments2016/assignment1/.ipynb_checkpoints/features-Copy1-checkpoint.ipynb create mode 100644 assignments2016/assignment1/.ipynb_checkpoints/features-checkpoint.ipynb create mode 100644 assignments2016/assignment1/.ipynb_checkpoints/knn-checkpoint.ipynb create mode 100644 assignments2016/assignment1/features-Copy1.ipynb create mode 100644 assignments2016/assignment1/features_trans.md diff --git a/assignments2016/assignment1/.ipynb_checkpoints/features-Copy1-checkpoint.ipynb b/assignments2016/assignment1/.ipynb_checkpoints/features-Copy1-checkpoint.ipynb new file mode 100644 index 00000000..99f95458 --- /dev/null +++ b/assignments2016/assignment1/.ipynb_checkpoints/features-Copy1-checkpoint.ipynb @@ -0,0 +1,338 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# 이미지 특징 연습\n", + "*이 워크시트를 완성하고 제출하세요. (출력물과 워크시트에 포함되지 않은 코드들을 포함해서) 더 자세한 정보는 코스 웹사이트인 [숙제 페이지](http://vision.stanford.edu/teaching/cs231n/assignments.html)에서 볼 수 있습니다.*\n", + "\n", + "우리는 입력된 이미지의 픽셀에 선형 분류기를 학습시켜 이미지 분류 작업에 적절한 성능을 얻을 수 있음을 알고있습니다.\n", + "이번 연습에서 우리는 단순 픽셀을 계산하기 위해 단순 픽셀(화소)이 아닌 특징을 통해 선형 분류기를 훈련시켜 우리의 분류 성능을 향상시킬 수 있음을 보일 것입니다.\n", + "\n", + "이번 연습을 위한 모든 해야할 작업들은 이 notebook에서 수행됩니다." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "import random\n", + "import numpy as np\n", + "from cs231n.data_utils import load_CIFAR10\n", + "import matplotlib.pyplot as plt\n", + "%matplotlib inline\n", + "plt.rcParams['figure.figsize'] = (10.0, 8.0) # 기본 그래프 크기 설정\n", + "plt.rcParams['image.interpolation'] = 'nearest'\n", + "plt.rcParams['image.cmap'] = 'gray'\n", + "\n", + "# auto-reloading을 위한 외부 모듈\n", + "# http://stackoverflow.com/questions/1907993/autoreload-of-modules-in-ipython를 보세요.\n", + "%load_ext autoreload\n", + "%autoreload 2" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 데이터 불러오기\n", + "이전 연습에서 처럼, 우리는 CIFAR-10 데이터를 불러올 것입니다." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "from cs231n.features import color_histogram_hsv, hog_feature\n", + "\n", + "def get_CIFAR10_data(num_training=49000, num_validation=1000, num_test=1000):\n", + " # CIFAR-10 데이터를 불러옵니다.\n", + " cifar10_dir = 'cs231n/datasets/cifar-10-batches-py'\n", + " X_train, y_train, X_test, y_test = load_CIFAR10(cifar10_dir)\n", + " \n", + " # 데이터 표본\n", + " mask = range(num_training, num_training + num_validation)\n", + " X_val = X_train[mask]\n", + " y_val = y_train[mask]\n", + " mask = range(num_training)\n", + " X_train = X_train[mask]\n", + " y_train = y_train[mask]\n", + " mask = range(num_test)\n", + " X_test = X_test[mask]\n", + " y_test = y_test[mask]\n", + "\n", + " return X_train, y_train, X_val, y_val, X_test, y_test\n", + "\n", + "X_train, y_train, X_val, y_val, X_test, y_test = get_CIFAR10_data()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# 특징 추출하기\n", + "우리는 각 이미지 마다 그라데이션의 히스토그램(HOG)를 HSV색 공간에서의 색상 채널을 사용한 색상 히스토그램만큼 잘 계산할 것입니다. 우리는 우리의 마지막 특징 벡터를 각 이미지마다 HOG와 색상 히스토그램 특징 벡터를 이용하여 형성합니다.\n", + "\n", + "Roughly speaking, HOG should capture the texture of the image while ignoring\n", + "color information, and the color histogram represents the color of the input\n", + "image while ignoring texture. As a result, we expect that using both together\n", + "ought to work better than using either alone. Verifying this assumption would\n", + "be a good thing to try for the bonus section.\n", + "\n", + "The `hog_feature` and `color_histogram_hsv` functions both operate on a single\n", + "image and return a feature vector for that image. The extract_features\n", + "function takes a set of images and a list of feature functions and evaluates\n", + "each feature function on each image, storing the results in a matrix where\n", + "each column is the concatenation of all feature vectors for a single image." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "from cs231n.features import *\n", + "\n", + "num_color_bins = 10 # Number of bins in the color histogram\n", + "feature_fns = [hog_feature, lambda img: color_histogram_hsv(img, nbin=num_color_bins)]\n", + "X_train_feats = extract_features(X_train, feature_fns, verbose=True)\n", + "X_val_feats = extract_features(X_val, feature_fns)\n", + "X_test_feats = extract_features(X_test, feature_fns)\n", + "\n", + "# Preprocessing: Subtract the mean feature\n", + "mean_feat = np.mean(X_train_feats, axis=0, keepdims=True)\n", + "X_train_feats -= mean_feat\n", + "X_val_feats -= mean_feat\n", + "X_test_feats -= mean_feat\n", + "\n", + "# Preprocessing: Divide by standard deviation. This ensures that each feature\n", + "# has roughly the same scale.\n", + "std_feat = np.std(X_train_feats, axis=0, keepdims=True)\n", + "X_train_feats /= std_feat\n", + "X_val_feats /= std_feat\n", + "X_test_feats /= std_feat\n", + "\n", + "# Preprocessing: Add a bias dimension\n", + "X_train_feats = np.hstack([X_train_feats, np.ones((X_train_feats.shape[0], 1))])\n", + "X_val_feats = np.hstack([X_val_feats, np.ones((X_val_feats.shape[0], 1))])\n", + "X_test_feats = np.hstack([X_test_feats, np.ones((X_test_feats.shape[0], 1))])" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Train SVM on features\n", + "Using the multiclass SVM code developed earlier in the assignment, train SVMs on top of the features extracted above; this should achieve better results than training SVMs directly on top of raw pixels." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "# Use the validation set to tune the learning rate and regularization strength\n", + "\n", + "from cs231n.classifiers.linear_classifier import LinearSVM\n", + "\n", + "learning_rates = [1e-9, 1e-8, 1e-7]\n", + "regularization_strengths = [1e5, 1e6, 1e7]\n", + "\n", + "results = {}\n", + "best_val = -1\n", + "best_svm = None\n", + "\n", + "pass\n", + "################################################################################\n", + "# TODO: #\n", + "# Use the validation set to set the learning rate and regularization strength. #\n", + "# This should be identical to the validation that you did for the SVM; save #\n", + "# the best trained classifer in best_svm. You might also want to play #\n", + "# with different numbers of bins in the color histogram. If you are careful #\n", + "# you should be able to get accuracy of near 0.44 on the validation set. #\n", + "################################################################################\n", + "pass\n", + "################################################################################\n", + "# END OF YOUR CODE #\n", + "################################################################################\n", + "\n", + "# Print out results.\n", + "for lr, reg in sorted(results):\n", + " train_accuracy, val_accuracy = results[(lr, reg)]\n", + " print 'lr %e reg %e train accuracy: %f val accuracy: %f' % (\n", + " lr, reg, train_accuracy, val_accuracy)\n", + " \n", + "print 'best validation accuracy achieved during cross-validation: %f' % best_val" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "# Evaluate your trained SVM on the test set\n", + "y_test_pred = best_svm.predict(X_test_feats)\n", + "test_accuracy = np.mean(y_test == y_test_pred)\n", + "print test_accuracy" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "# An important way to gain intuition about how an algorithm works is to\n", + "# visualize the mistakes that it makes. In this visualization, we show examples\n", + "# of images that are misclassified by our current system. The first column\n", + "# shows images that our system labeled as \"plane\" but whose true label is\n", + "# something other than \"plane\".\n", + "\n", + "examples_per_class = 8\n", + "classes = ['plane', 'car', 'bird', 'cat', 'deer', 'dog', 'frog', 'horse', 'ship', 'truck']\n", + "for cls, cls_name in enumerate(classes):\n", + " idxs = np.where((y_test != cls) & (y_test_pred == cls))[0]\n", + " idxs = np.random.choice(idxs, examples_per_class, replace=False)\n", + " for i, idx in enumerate(idxs):\n", + " plt.subplot(examples_per_class, len(classes), i * len(classes) + cls + 1)\n", + " plt.imshow(X_test[idx].astype('uint8'))\n", + " plt.axis('off')\n", + " if i == 0:\n", + " plt.title(cls_name)\n", + "plt.show()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Inline question 1:\n", + "Describe the misclassification results that you see. Do they make sense?" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Neural Network on image features\n", + "Earlier in this assigment we saw that training a two-layer neural network on raw pixels achieved better classification performance than linear classifiers on raw pixels. In this notebook we have seen that linear classifiers on image features outperform linear classifiers on raw pixels. \n", + "\n", + "For completeness, we should also try training a neural network on image features. This approach should outperform all previous approaches: you should easily be able to achieve over 55% classification accuracy on the test set; our best model achieves about 60% classification accuracy." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "print X_train_feats.shape" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "from cs231n.classifiers.neural_net import TwoLayerNet\n", + "\n", + "input_dim = X_train_feats.shape[1]\n", + "hidden_dim = 500\n", + "num_classes = 10\n", + "\n", + "net = TwoLayerNet(input_dim, hidden_dim, num_classes)\n", + "best_net = None\n", + "\n", + "################################################################################\n", + "# TODO: Train a two-layer neural network on image features. You may want to #\n", + "# cross-validate various parameters as in previous sections. Store your best #\n", + "# model in the best_net variable. #\n", + "################################################################################\n", + "pass\n", + "################################################################################\n", + "# END OF YOUR CODE #\n", + "################################################################################" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "# Run your neural net classifier on the test set. You should be able to\n", + "# get more than 55% accuracy.\n", + "\n", + "test_acc = (net.predict(X_test_feats) == y_test).mean()\n", + "print test_acc" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Bonus: Design your own features!\n", + "\n", + "You have seen that simple image features can improve classification performance. So far we have tried HOG and color histograms, but other types of features may be able to achieve even better classification performance.\n", + "\n", + "For bonus points, design and implement a new type of feature and use it for image classification on CIFAR-10. Explain how your feature works and why you expect it to be useful for image classification. Implement it in this notebook, cross-validate any hyperparameters, and compare its performance to the HOG + Color histogram baseline." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Bonus: Do something extra!\n", + "Use the material and code we have presented in this assignment to do something interesting. Was there another question we should have asked? Did any cool ideas pop into your head as you were working on the assignment? This is your chance to show off!" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.5.1" + } + }, + "nbformat": 4, + "nbformat_minor": 0 +} diff --git a/assignments2016/assignment1/.ipynb_checkpoints/features-checkpoint.ipynb b/assignments2016/assignment1/.ipynb_checkpoints/features-checkpoint.ipynb new file mode 100644 index 00000000..99f95458 --- /dev/null +++ b/assignments2016/assignment1/.ipynb_checkpoints/features-checkpoint.ipynb @@ -0,0 +1,338 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# 이미지 특징 연습\n", + "*이 워크시트를 완성하고 제출하세요. (출력물과 워크시트에 포함되지 않은 코드들을 포함해서) 더 자세한 정보는 코스 웹사이트인 [숙제 페이지](http://vision.stanford.edu/teaching/cs231n/assignments.html)에서 볼 수 있습니다.*\n", + "\n", + "우리는 입력된 이미지의 픽셀에 선형 분류기를 학습시켜 이미지 분류 작업에 적절한 성능을 얻을 수 있음을 알고있습니다.\n", + "이번 연습에서 우리는 단순 픽셀을 계산하기 위해 단순 픽셀(화소)이 아닌 특징을 통해 선형 분류기를 훈련시켜 우리의 분류 성능을 향상시킬 수 있음을 보일 것입니다.\n", + "\n", + "이번 연습을 위한 모든 해야할 작업들은 이 notebook에서 수행됩니다." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "import random\n", + "import numpy as np\n", + "from cs231n.data_utils import load_CIFAR10\n", + "import matplotlib.pyplot as plt\n", + "%matplotlib inline\n", + "plt.rcParams['figure.figsize'] = (10.0, 8.0) # 기본 그래프 크기 설정\n", + "plt.rcParams['image.interpolation'] = 'nearest'\n", + "plt.rcParams['image.cmap'] = 'gray'\n", + "\n", + "# auto-reloading을 위한 외부 모듈\n", + "# http://stackoverflow.com/questions/1907993/autoreload-of-modules-in-ipython를 보세요.\n", + "%load_ext autoreload\n", + "%autoreload 2" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 데이터 불러오기\n", + "이전 연습에서 처럼, 우리는 CIFAR-10 데이터를 불러올 것입니다." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "from cs231n.features import color_histogram_hsv, hog_feature\n", + "\n", + "def get_CIFAR10_data(num_training=49000, num_validation=1000, num_test=1000):\n", + " # CIFAR-10 데이터를 불러옵니다.\n", + " cifar10_dir = 'cs231n/datasets/cifar-10-batches-py'\n", + " X_train, y_train, X_test, y_test = load_CIFAR10(cifar10_dir)\n", + " \n", + " # 데이터 표본\n", + " mask = range(num_training, num_training + num_validation)\n", + " X_val = X_train[mask]\n", + " y_val = y_train[mask]\n", + " mask = range(num_training)\n", + " X_train = X_train[mask]\n", + " y_train = y_train[mask]\n", + " mask = range(num_test)\n", + " X_test = X_test[mask]\n", + " y_test = y_test[mask]\n", + "\n", + " return X_train, y_train, X_val, y_val, X_test, y_test\n", + "\n", + "X_train, y_train, X_val, y_val, X_test, y_test = get_CIFAR10_data()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# 특징 추출하기\n", + "우리는 각 이미지 마다 그라데이션의 히스토그램(HOG)를 HSV색 공간에서의 색상 채널을 사용한 색상 히스토그램만큼 잘 계산할 것입니다. 우리는 우리의 마지막 특징 벡터를 각 이미지마다 HOG와 색상 히스토그램 특징 벡터를 이용하여 형성합니다.\n", + "\n", + "Roughly speaking, HOG should capture the texture of the image while ignoring\n", + "color information, and the color histogram represents the color of the input\n", + "image while ignoring texture. As a result, we expect that using both together\n", + "ought to work better than using either alone. Verifying this assumption would\n", + "be a good thing to try for the bonus section.\n", + "\n", + "The `hog_feature` and `color_histogram_hsv` functions both operate on a single\n", + "image and return a feature vector for that image. The extract_features\n", + "function takes a set of images and a list of feature functions and evaluates\n", + "each feature function on each image, storing the results in a matrix where\n", + "each column is the concatenation of all feature vectors for a single image." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "from cs231n.features import *\n", + "\n", + "num_color_bins = 10 # Number of bins in the color histogram\n", + "feature_fns = [hog_feature, lambda img: color_histogram_hsv(img, nbin=num_color_bins)]\n", + "X_train_feats = extract_features(X_train, feature_fns, verbose=True)\n", + "X_val_feats = extract_features(X_val, feature_fns)\n", + "X_test_feats = extract_features(X_test, feature_fns)\n", + "\n", + "# Preprocessing: Subtract the mean feature\n", + "mean_feat = np.mean(X_train_feats, axis=0, keepdims=True)\n", + "X_train_feats -= mean_feat\n", + "X_val_feats -= mean_feat\n", + "X_test_feats -= mean_feat\n", + "\n", + "# Preprocessing: Divide by standard deviation. This ensures that each feature\n", + "# has roughly the same scale.\n", + "std_feat = np.std(X_train_feats, axis=0, keepdims=True)\n", + "X_train_feats /= std_feat\n", + "X_val_feats /= std_feat\n", + "X_test_feats /= std_feat\n", + "\n", + "# Preprocessing: Add a bias dimension\n", + "X_train_feats = np.hstack([X_train_feats, np.ones((X_train_feats.shape[0], 1))])\n", + "X_val_feats = np.hstack([X_val_feats, np.ones((X_val_feats.shape[0], 1))])\n", + "X_test_feats = np.hstack([X_test_feats, np.ones((X_test_feats.shape[0], 1))])" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Train SVM on features\n", + "Using the multiclass SVM code developed earlier in the assignment, train SVMs on top of the features extracted above; this should achieve better results than training SVMs directly on top of raw pixels." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "# Use the validation set to tune the learning rate and regularization strength\n", + "\n", + "from cs231n.classifiers.linear_classifier import LinearSVM\n", + "\n", + "learning_rates = [1e-9, 1e-8, 1e-7]\n", + "regularization_strengths = [1e5, 1e6, 1e7]\n", + "\n", + "results = {}\n", + "best_val = -1\n", + "best_svm = None\n", + "\n", + "pass\n", + "################################################################################\n", + "# TODO: #\n", + "# Use the validation set to set the learning rate and regularization strength. #\n", + "# This should be identical to the validation that you did for the SVM; save #\n", + "# the best trained classifer in best_svm. You might also want to play #\n", + "# with different numbers of bins in the color histogram. If you are careful #\n", + "# you should be able to get accuracy of near 0.44 on the validation set. #\n", + "################################################################################\n", + "pass\n", + "################################################################################\n", + "# END OF YOUR CODE #\n", + "################################################################################\n", + "\n", + "# Print out results.\n", + "for lr, reg in sorted(results):\n", + " train_accuracy, val_accuracy = results[(lr, reg)]\n", + " print 'lr %e reg %e train accuracy: %f val accuracy: %f' % (\n", + " lr, reg, train_accuracy, val_accuracy)\n", + " \n", + "print 'best validation accuracy achieved during cross-validation: %f' % best_val" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "# Evaluate your trained SVM on the test set\n", + "y_test_pred = best_svm.predict(X_test_feats)\n", + "test_accuracy = np.mean(y_test == y_test_pred)\n", + "print test_accuracy" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "# An important way to gain intuition about how an algorithm works is to\n", + "# visualize the mistakes that it makes. In this visualization, we show examples\n", + "# of images that are misclassified by our current system. The first column\n", + "# shows images that our system labeled as \"plane\" but whose true label is\n", + "# something other than \"plane\".\n", + "\n", + "examples_per_class = 8\n", + "classes = ['plane', 'car', 'bird', 'cat', 'deer', 'dog', 'frog', 'horse', 'ship', 'truck']\n", + "for cls, cls_name in enumerate(classes):\n", + " idxs = np.where((y_test != cls) & (y_test_pred == cls))[0]\n", + " idxs = np.random.choice(idxs, examples_per_class, replace=False)\n", + " for i, idx in enumerate(idxs):\n", + " plt.subplot(examples_per_class, len(classes), i * len(classes) + cls + 1)\n", + " plt.imshow(X_test[idx].astype('uint8'))\n", + " plt.axis('off')\n", + " if i == 0:\n", + " plt.title(cls_name)\n", + "plt.show()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Inline question 1:\n", + "Describe the misclassification results that you see. Do they make sense?" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Neural Network on image features\n", + "Earlier in this assigment we saw that training a two-layer neural network on raw pixels achieved better classification performance than linear classifiers on raw pixels. In this notebook we have seen that linear classifiers on image features outperform linear classifiers on raw pixels. \n", + "\n", + "For completeness, we should also try training a neural network on image features. This approach should outperform all previous approaches: you should easily be able to achieve over 55% classification accuracy on the test set; our best model achieves about 60% classification accuracy." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "print X_train_feats.shape" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "from cs231n.classifiers.neural_net import TwoLayerNet\n", + "\n", + "input_dim = X_train_feats.shape[1]\n", + "hidden_dim = 500\n", + "num_classes = 10\n", + "\n", + "net = TwoLayerNet(input_dim, hidden_dim, num_classes)\n", + "best_net = None\n", + "\n", + "################################################################################\n", + "# TODO: Train a two-layer neural network on image features. You may want to #\n", + "# cross-validate various parameters as in previous sections. Store your best #\n", + "# model in the best_net variable. #\n", + "################################################################################\n", + "pass\n", + "################################################################################\n", + "# END OF YOUR CODE #\n", + "################################################################################" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "# Run your neural net classifier on the test set. You should be able to\n", + "# get more than 55% accuracy.\n", + "\n", + "test_acc = (net.predict(X_test_feats) == y_test).mean()\n", + "print test_acc" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Bonus: Design your own features!\n", + "\n", + "You have seen that simple image features can improve classification performance. So far we have tried HOG and color histograms, but other types of features may be able to achieve even better classification performance.\n", + "\n", + "For bonus points, design and implement a new type of feature and use it for image classification on CIFAR-10. Explain how your feature works and why you expect it to be useful for image classification. Implement it in this notebook, cross-validate any hyperparameters, and compare its performance to the HOG + Color histogram baseline." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Bonus: Do something extra!\n", + "Use the material and code we have presented in this assignment to do something interesting. Was there another question we should have asked? Did any cool ideas pop into your head as you were working on the assignment? This is your chance to show off!" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.5.1" + } + }, + "nbformat": 4, + "nbformat_minor": 0 +} diff --git a/assignments2016/assignment1/.ipynb_checkpoints/knn-checkpoint.ipynb b/assignments2016/assignment1/.ipynb_checkpoints/knn-checkpoint.ipynb new file mode 100644 index 00000000..7ed1b7b4 --- /dev/null +++ b/assignments2016/assignment1/.ipynb_checkpoints/knn-checkpoint.ipynb @@ -0,0 +1,459 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# k-Nearest Neighbor (kNN) exercise\n", + "\n", + "*Complete and hand in this completed worksheet (including its outputs and any supporting code outside of the worksheet) with your assignment submission. For more details see the [assignments page](http://vision.stanford.edu/teaching/cs231n/assignments.html) on the course website.*\n", + "\n", + "The kNN classifier consists of two stages:\n", + "\n", + "- During training, the classifier takes the training data and simply remembers it\n", + "- During testing, kNN classifies every test image by comparing to all training images and transfering the labels of the k most similar training examples\n", + "- The value of k is cross-validated\n", + "\n", + "In this exercise you will implement these steps and understand the basic Image Classification pipeline, cross-validation, and gain proficiency in writing efficient, vectorized code." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "# Run some setup code for this notebook.\n", + "\n", + "import random\n", + "import numpy as np\n", + "from cs231n.data_utils import load_CIFAR10\n", + "import matplotlib.pyplot as plt\n", + "\n", + "# This is a bit of magic to make matplotlib figures appear inline in the notebook\n", + "# rather than in a new window.\n", + "%matplotlib inline\n", + "plt.rcParams['figure.figsize'] = (10.0, 8.0) # set default size of plots\n", + "plt.rcParams['image.interpolation'] = 'nearest'\n", + "plt.rcParams['image.cmap'] = 'gray'\n", + "\n", + "# Some more magic so that the notebook will reload external python modules;\n", + "# see http://stackoverflow.com/questions/1907993/autoreload-of-modules-in-ipython\n", + "%load_ext autoreload\n", + "%autoreload 2" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "# Load the raw CIFAR-10 data.\n", + "cifar10_dir = 'cs231n/datasets/cifar-10-batches-py'\n", + "X_train, y_train, X_test, y_test = load_CIFAR10(cifar10_dir)\n", + "\n", + "# As a sanity check, we print out the size of the training and test data.\n", + "print 'Training data shape: ', X_train.shape\n", + "print 'Training labels shape: ', y_train.shape\n", + "print 'Test data shape: ', X_test.shape\n", + "print 'Test labels shape: ', y_test.shape" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "# Visualize some examples from the dataset.\n", + "# We show a few examples of training images from each class.\n", + "classes = ['plane', 'car', 'bird', 'cat', 'deer', 'dog', 'frog', 'horse', 'ship', 'truck']\n", + "num_classes = len(classes)\n", + "samples_per_class = 7\n", + "for y, cls in enumerate(classes):\n", + " idxs = np.flatnonzero(y_train == y)\n", + " idxs = np.random.choice(idxs, samples_per_class, replace=False)\n", + " for i, idx in enumerate(idxs):\n", + " plt_idx = i * num_classes + y + 1\n", + " plt.subplot(samples_per_class, num_classes, plt_idx)\n", + " plt.imshow(X_train[idx].astype('uint8'))\n", + " plt.axis('off')\n", + " if i == 0:\n", + " plt.title(cls)\n", + "plt.show()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "# Subsample the data for more efficient code execution in this exercise\n", + "num_training = 5000\n", + "mask = range(num_training)\n", + "X_train = X_train[mask]\n", + "y_train = y_train[mask]\n", + "\n", + "num_test = 500\n", + "mask = range(num_test)\n", + "X_test = X_test[mask]\n", + "y_test = y_test[mask]" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "# Reshape the image data into rows\n", + "X_train = np.reshape(X_train, (X_train.shape[0], -1))\n", + "X_test = np.reshape(X_test, (X_test.shape[0], -1))\n", + "print X_train.shape, X_test.shape" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "from cs231n.classifiers import KNearestNeighbor\n", + "\n", + "# Create a kNN classifier instance. \n", + "# Remember that training a kNN classifier is a noop: \n", + "# the Classifier simply remembers the data and does no further processing \n", + "classifier = KNearestNeighbor()\n", + "classifier.train(X_train, y_train)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "We would now like to classify the test data with the kNN classifier. Recall that we can break down this process into two steps: \n", + "\n", + "1. First we must compute the distances between all test examples and all train examples. \n", + "2. Given these distances, for each test example we find the k nearest examples and have them vote for the label\n", + "\n", + "Lets begin with computing the distance matrix between all training and test examples. For example, if there are **Ntr** training examples and **Nte** test examples, this stage should result in a **Nte x Ntr** matrix where each element (i,j) is the distance between the i-th test and j-th train example.\n", + "\n", + "First, open `cs231n/classifiers/k_nearest_neighbor.py` and implement the function `compute_distances_two_loops` that uses a (very inefficient) double loop over all pairs of (test, train) examples and computes the distance matrix one element at a time." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "# Open cs231n/classifiers/k_nearest_neighbor.py and implement\n", + "# compute_distances_two_loops.\n", + "\n", + "# Test your implementation:\n", + "dists = classifier.compute_distances_two_loops(X_test)\n", + "print dists.shape" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "# We can visualize the distance matrix: each row is a single test example and\n", + "# its distances to training examples\n", + "plt.imshow(dists, interpolation='none')\n", + "plt.show()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "**Inline Question #1:** Notice the structured patterns in the distance matrix, where some rows or columns are visible brighter. (Note that with the default color scheme black indicates low distances while white indicates high distances.)\n", + "\n", + "- What in the data is the cause behind the distinctly bright rows?\n", + "- What causes the columns?" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "**Your Answer**: *fill this in.*\n", + "\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "# Now implement the function predict_labels and run the code below:\n", + "# We use k = 1 (which is Nearest Neighbor).\n", + "y_test_pred = classifier.predict_labels(dists, k=1)\n", + "\n", + "# Compute and print the fraction of correctly predicted examples\n", + "num_correct = np.sum(y_test_pred == y_test)\n", + "accuracy = float(num_correct) / num_test\n", + "print 'Got %d / %d correct => accuracy: %f' % (num_correct, num_test, accuracy)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "You should expect to see approximately `27%` accuracy. Now lets try out a larger `k`, say `k = 5`:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": true + }, + "outputs": [], + "source": [ + "y_test_pred = classifier.predict_labels(dists, k=5)\n", + "num_correct = np.sum(y_test_pred == y_test)\n", + "accuracy = float(num_correct) / num_test\n", + "print 'Got %d / %d correct => accuracy: %f' % (num_correct, num_test, accuracy)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "You should expect to see a slightly better performance than with `k = 1`." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "# Now lets speed up distance matrix computation by using partial vectorization\n", + "# with one loop. Implement the function compute_distances_one_loop and run the\n", + "# code below:\n", + "dists_one = classifier.compute_distances_one_loop(X_test)\n", + "\n", + "# To ensure that our vectorized implementation is correct, we make sure that it\n", + "# agrees with the naive implementation. There are many ways to decide whether\n", + "# two matrices are similar; one of the simplest is the Frobenius norm. In case\n", + "# you haven't seen it before, the Frobenius norm of two matrices is the square\n", + "# root of the squared sum of differences of all elements; in other words, reshape\n", + "# the matrices into vectors and compute the Euclidean distance between them.\n", + "difference = np.linalg.norm(dists - dists_one, ord='fro')\n", + "print 'Difference was: %f' % (difference, )\n", + "if difference < 0.001:\n", + " print 'Good! The distance matrices are the same'\n", + "else:\n", + " print 'Uh-oh! The distance matrices are different'" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "# Now implement the fully vectorized version inside compute_distances_no_loops\n", + "# and run the code\n", + "dists_two = classifier.compute_distances_no_loops(X_test)\n", + "\n", + "# check that the distance matrix agrees with the one we computed before:\n", + "difference = np.linalg.norm(dists - dists_two, ord='fro')\n", + "print 'Difference was: %f' % (difference, )\n", + "if difference < 0.001:\n", + " print 'Good! The distance matrices are the same'\n", + "else:\n", + " print 'Uh-oh! The distance matrices are different'" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "# Let's compare how fast the implementations are\n", + "def time_function(f, *args):\n", + " \"\"\"\n", + " Call a function f with args and return the time (in seconds) that it took to execute.\n", + " \"\"\"\n", + " import time\n", + " tic = time.time()\n", + " f(*args)\n", + " toc = time.time()\n", + " return toc - tic\n", + "\n", + "two_loop_time = time_function(classifier.compute_distances_two_loops, X_test)\n", + "print 'Two loop version took %f seconds' % two_loop_time\n", + "\n", + "one_loop_time = time_function(classifier.compute_distances_one_loop, X_test)\n", + "print 'One loop version took %f seconds' % one_loop_time\n", + "\n", + "no_loop_time = time_function(classifier.compute_distances_no_loops, X_test)\n", + "print 'No loop version took %f seconds' % no_loop_time\n", + "\n", + "# you should see significantly faster performance with the fully vectorized implementation" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Cross-validation\n", + "\n", + "We have implemented the k-Nearest Neighbor classifier but we set the value k = 5 arbitrarily. We will now determine the best value of this hyperparameter with cross-validation." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "num_folds = 5\n", + "k_choices = [1, 3, 5, 8, 10, 12, 15, 20, 50, 100]\n", + "\n", + "X_train_folds = []\n", + "y_train_folds = []\n", + "################################################################################\n", + "# TODO: #\n", + "# Split up the training data into folds. After splitting, X_train_folds and #\n", + "# y_train_folds should each be lists of length num_folds, where #\n", + "# y_train_folds[i] is the label vector for the points in X_train_folds[i]. #\n", + "# Hint: Look up the numpy array_split function. #\n", + "################################################################################\n", + "pass\n", + "################################################################################\n", + "# END OF YOUR CODE #\n", + "################################################################################\n", + "\n", + "# A dictionary holding the accuracies for different values of k that we find\n", + "# when running cross-validation. After running cross-validation,\n", + "# k_to_accuracies[k] should be a list of length num_folds giving the different\n", + "# accuracy values that we found when using that value of k.\n", + "k_to_accuracies = {}\n", + "\n", + "\n", + "################################################################################\n", + "# TODO: #\n", + "# Perform k-fold cross validation to find the best value of k. For each #\n", + "# possible value of k, run the k-nearest-neighbor algorithm num_folds times, #\n", + "# where in each case you use all but one of the folds as training data and the #\n", + "# last fold as a validation set. Store the accuracies for all fold and all #\n", + "# values of k in the k_to_accuracies dictionary. #\n", + "################################################################################\n", + "pass\n", + "################################################################################\n", + "# END OF YOUR CODE #\n", + "################################################################################\n", + "\n", + "# Print out the computed accuracies\n", + "for k in sorted(k_to_accuracies):\n", + " for accuracy in k_to_accuracies[k]:\n", + " print 'k = %d, accuracy = %f' % (k, accuracy)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "# plot the raw observations\n", + "for k in k_choices:\n", + " accuracies = k_to_accuracies[k]\n", + " plt.scatter([k] * len(accuracies), accuracies)\n", + "\n", + "# plot the trend line with error bars that correspond to standard deviation\n", + "accuracies_mean = np.array([np.mean(v) for k,v in sorted(k_to_accuracies.items())])\n", + "accuracies_std = np.array([np.std(v) for k,v in sorted(k_to_accuracies.items())])\n", + "plt.errorbar(k_choices, accuracies_mean, yerr=accuracies_std)\n", + "plt.title('Cross-validation on k')\n", + "plt.xlabel('k')\n", + "plt.ylabel('Cross-validation accuracy')\n", + "plt.show()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "# Based on the cross-validation results above, choose the best value for k, \n", + "# retrain the classifier using all the training data, and test it on the test\n", + "# data. You should be able to get above 28% accuracy on the test data.\n", + "best_k = 1\n", + "\n", + "classifier = KNearestNeighbor()\n", + "classifier.train(X_train, y_train)\n", + "y_test_pred = classifier.predict(X_test, k=best_k)\n", + "\n", + "# Compute and display the accuracy\n", + "num_correct = np.sum(y_test_pred == y_test)\n", + "accuracy = float(num_correct) / num_test\n", + "print 'Got %d / %d correct => accuracy: %f' % (num_correct, num_test, accuracy)" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.5.1" + } + }, + "nbformat": 4, + "nbformat_minor": 0 +} diff --git a/assignments2016/assignment1/features-Copy1.ipynb b/assignments2016/assignment1/features-Copy1.ipynb new file mode 100644 index 00000000..99f95458 --- /dev/null +++ b/assignments2016/assignment1/features-Copy1.ipynb @@ -0,0 +1,338 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# 이미지 특징 연습\n", + "*이 워크시트를 완성하고 제출하세요. (출력물과 워크시트에 포함되지 않은 코드들을 포함해서) 더 자세한 정보는 코스 웹사이트인 [숙제 페이지](http://vision.stanford.edu/teaching/cs231n/assignments.html)에서 볼 수 있습니다.*\n", + "\n", + "우리는 입력된 이미지의 픽셀에 선형 분류기를 학습시켜 이미지 분류 작업에 적절한 성능을 얻을 수 있음을 알고있습니다.\n", + "이번 연습에서 우리는 단순 픽셀을 계산하기 위해 단순 픽셀(화소)이 아닌 특징을 통해 선형 분류기를 훈련시켜 우리의 분류 성능을 향상시킬 수 있음을 보일 것입니다.\n", + "\n", + "이번 연습을 위한 모든 해야할 작업들은 이 notebook에서 수행됩니다." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "import random\n", + "import numpy as np\n", + "from cs231n.data_utils import load_CIFAR10\n", + "import matplotlib.pyplot as plt\n", + "%matplotlib inline\n", + "plt.rcParams['figure.figsize'] = (10.0, 8.0) # 기본 그래프 크기 설정\n", + "plt.rcParams['image.interpolation'] = 'nearest'\n", + "plt.rcParams['image.cmap'] = 'gray'\n", + "\n", + "# auto-reloading을 위한 외부 모듈\n", + "# http://stackoverflow.com/questions/1907993/autoreload-of-modules-in-ipython를 보세요.\n", + "%load_ext autoreload\n", + "%autoreload 2" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 데이터 불러오기\n", + "이전 연습에서 처럼, 우리는 CIFAR-10 데이터를 불러올 것입니다." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "from cs231n.features import color_histogram_hsv, hog_feature\n", + "\n", + "def get_CIFAR10_data(num_training=49000, num_validation=1000, num_test=1000):\n", + " # CIFAR-10 데이터를 불러옵니다.\n", + " cifar10_dir = 'cs231n/datasets/cifar-10-batches-py'\n", + " X_train, y_train, X_test, y_test = load_CIFAR10(cifar10_dir)\n", + " \n", + " # 데이터 표본\n", + " mask = range(num_training, num_training + num_validation)\n", + " X_val = X_train[mask]\n", + " y_val = y_train[mask]\n", + " mask = range(num_training)\n", + " X_train = X_train[mask]\n", + " y_train = y_train[mask]\n", + " mask = range(num_test)\n", + " X_test = X_test[mask]\n", + " y_test = y_test[mask]\n", + "\n", + " return X_train, y_train, X_val, y_val, X_test, y_test\n", + "\n", + "X_train, y_train, X_val, y_val, X_test, y_test = get_CIFAR10_data()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# 특징 추출하기\n", + "우리는 각 이미지 마다 그라데이션의 히스토그램(HOG)를 HSV색 공간에서의 색상 채널을 사용한 색상 히스토그램만큼 잘 계산할 것입니다. 우리는 우리의 마지막 특징 벡터를 각 이미지마다 HOG와 색상 히스토그램 특징 벡터를 이용하여 형성합니다.\n", + "\n", + "Roughly speaking, HOG should capture the texture of the image while ignoring\n", + "color information, and the color histogram represents the color of the input\n", + "image while ignoring texture. As a result, we expect that using both together\n", + "ought to work better than using either alone. Verifying this assumption would\n", + "be a good thing to try for the bonus section.\n", + "\n", + "The `hog_feature` and `color_histogram_hsv` functions both operate on a single\n", + "image and return a feature vector for that image. The extract_features\n", + "function takes a set of images and a list of feature functions and evaluates\n", + "each feature function on each image, storing the results in a matrix where\n", + "each column is the concatenation of all feature vectors for a single image." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "from cs231n.features import *\n", + "\n", + "num_color_bins = 10 # Number of bins in the color histogram\n", + "feature_fns = [hog_feature, lambda img: color_histogram_hsv(img, nbin=num_color_bins)]\n", + "X_train_feats = extract_features(X_train, feature_fns, verbose=True)\n", + "X_val_feats = extract_features(X_val, feature_fns)\n", + "X_test_feats = extract_features(X_test, feature_fns)\n", + "\n", + "# Preprocessing: Subtract the mean feature\n", + "mean_feat = np.mean(X_train_feats, axis=0, keepdims=True)\n", + "X_train_feats -= mean_feat\n", + "X_val_feats -= mean_feat\n", + "X_test_feats -= mean_feat\n", + "\n", + "# Preprocessing: Divide by standard deviation. This ensures that each feature\n", + "# has roughly the same scale.\n", + "std_feat = np.std(X_train_feats, axis=0, keepdims=True)\n", + "X_train_feats /= std_feat\n", + "X_val_feats /= std_feat\n", + "X_test_feats /= std_feat\n", + "\n", + "# Preprocessing: Add a bias dimension\n", + "X_train_feats = np.hstack([X_train_feats, np.ones((X_train_feats.shape[0], 1))])\n", + "X_val_feats = np.hstack([X_val_feats, np.ones((X_val_feats.shape[0], 1))])\n", + "X_test_feats = np.hstack([X_test_feats, np.ones((X_test_feats.shape[0], 1))])" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Train SVM on features\n", + "Using the multiclass SVM code developed earlier in the assignment, train SVMs on top of the features extracted above; this should achieve better results than training SVMs directly on top of raw pixels." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "# Use the validation set to tune the learning rate and regularization strength\n", + "\n", + "from cs231n.classifiers.linear_classifier import LinearSVM\n", + "\n", + "learning_rates = [1e-9, 1e-8, 1e-7]\n", + "regularization_strengths = [1e5, 1e6, 1e7]\n", + "\n", + "results = {}\n", + "best_val = -1\n", + "best_svm = None\n", + "\n", + "pass\n", + "################################################################################\n", + "# TODO: #\n", + "# Use the validation set to set the learning rate and regularization strength. #\n", + "# This should be identical to the validation that you did for the SVM; save #\n", + "# the best trained classifer in best_svm. You might also want to play #\n", + "# with different numbers of bins in the color histogram. If you are careful #\n", + "# you should be able to get accuracy of near 0.44 on the validation set. #\n", + "################################################################################\n", + "pass\n", + "################################################################################\n", + "# END OF YOUR CODE #\n", + "################################################################################\n", + "\n", + "# Print out results.\n", + "for lr, reg in sorted(results):\n", + " train_accuracy, val_accuracy = results[(lr, reg)]\n", + " print 'lr %e reg %e train accuracy: %f val accuracy: %f' % (\n", + " lr, reg, train_accuracy, val_accuracy)\n", + " \n", + "print 'best validation accuracy achieved during cross-validation: %f' % best_val" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "# Evaluate your trained SVM on the test set\n", + "y_test_pred = best_svm.predict(X_test_feats)\n", + "test_accuracy = np.mean(y_test == y_test_pred)\n", + "print test_accuracy" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "# An important way to gain intuition about how an algorithm works is to\n", + "# visualize the mistakes that it makes. In this visualization, we show examples\n", + "# of images that are misclassified by our current system. The first column\n", + "# shows images that our system labeled as \"plane\" but whose true label is\n", + "# something other than \"plane\".\n", + "\n", + "examples_per_class = 8\n", + "classes = ['plane', 'car', 'bird', 'cat', 'deer', 'dog', 'frog', 'horse', 'ship', 'truck']\n", + "for cls, cls_name in enumerate(classes):\n", + " idxs = np.where((y_test != cls) & (y_test_pred == cls))[0]\n", + " idxs = np.random.choice(idxs, examples_per_class, replace=False)\n", + " for i, idx in enumerate(idxs):\n", + " plt.subplot(examples_per_class, len(classes), i * len(classes) + cls + 1)\n", + " plt.imshow(X_test[idx].astype('uint8'))\n", + " plt.axis('off')\n", + " if i == 0:\n", + " plt.title(cls_name)\n", + "plt.show()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Inline question 1:\n", + "Describe the misclassification results that you see. Do they make sense?" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Neural Network on image features\n", + "Earlier in this assigment we saw that training a two-layer neural network on raw pixels achieved better classification performance than linear classifiers on raw pixels. In this notebook we have seen that linear classifiers on image features outperform linear classifiers on raw pixels. \n", + "\n", + "For completeness, we should also try training a neural network on image features. This approach should outperform all previous approaches: you should easily be able to achieve over 55% classification accuracy on the test set; our best model achieves about 60% classification accuracy." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "print X_train_feats.shape" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "from cs231n.classifiers.neural_net import TwoLayerNet\n", + "\n", + "input_dim = X_train_feats.shape[1]\n", + "hidden_dim = 500\n", + "num_classes = 10\n", + "\n", + "net = TwoLayerNet(input_dim, hidden_dim, num_classes)\n", + "best_net = None\n", + "\n", + "################################################################################\n", + "# TODO: Train a two-layer neural network on image features. You may want to #\n", + "# cross-validate various parameters as in previous sections. Store your best #\n", + "# model in the best_net variable. #\n", + "################################################################################\n", + "pass\n", + "################################################################################\n", + "# END OF YOUR CODE #\n", + "################################################################################" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "# Run your neural net classifier on the test set. You should be able to\n", + "# get more than 55% accuracy.\n", + "\n", + "test_acc = (net.predict(X_test_feats) == y_test).mean()\n", + "print test_acc" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Bonus: Design your own features!\n", + "\n", + "You have seen that simple image features can improve classification performance. So far we have tried HOG and color histograms, but other types of features may be able to achieve even better classification performance.\n", + "\n", + "For bonus points, design and implement a new type of feature and use it for image classification on CIFAR-10. Explain how your feature works and why you expect it to be useful for image classification. Implement it in this notebook, cross-validate any hyperparameters, and compare its performance to the HOG + Color histogram baseline." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Bonus: Do something extra!\n", + "Use the material and code we have presented in this assignment to do something interesting. Was there another question we should have asked? Did any cool ideas pop into your head as you were working on the assignment? This is your chance to show off!" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.5.1" + } + }, + "nbformat": 4, + "nbformat_minor": 0 +} diff --git a/assignments2016/assignment1/features.ipynb b/assignments2016/assignment1/features.ipynb index 9d22d9f6..af49194a 100644 --- a/assignments2016/assignment1/features.ipynb +++ b/assignments2016/assignment1/features.ipynb @@ -1,340 +1,331 @@ { - "nbformat_minor": 0, - "nbformat": 4, "cells": [ { + "cell_type": "markdown", + "metadata": {}, "source": [ - "# Image features exercise\n", - "*Complete and hand in this completed worksheet (including its outputs and any supporting code outside of the worksheet) with your assignment submission. For more details see the [assignments page](http://vision.stanford.edu/teaching/cs231n/assignments.html) on the course website.*\n", - "\n", - "We have seen that we can achieve reasonable performance on an image classification task by training a linear classifier on the pixels of the input image. In this exercise we will show that we can improve our classification performance by training linear classifiers not on raw pixels but on features that are computed from the raw pixels.\n", - "\n", - "All of your work for this exercise will be done in this notebook." - ], - "cell_type": "markdown", - "metadata": {} - }, + "# 이미지 특징 연습\n", + "*이 워크시트를 완성하고 제출하세요. (출력물과 워크시트에 포함되지 않은 코드들을 포함해서) 더 자세한 정보는 코스 웹사이트인 [숙제 페이지](http://vision.stanford.edu/teaching/cs231n/assignments.html)에서 볼 수 있습니다.*\n", + "\n", + "우리는 입력된 이미지의 픽셀에 선형 분류기를 학습시켜 이미지 분류 작업에 적절한 성능을 얻을 수 있음을 알고있습니다.\n", + "이번 연습에서 우리는 단순 픽셀을 계산하기 위해 단순 픽셀(화소)이 아닌 특징을 통해 선형 분류기를 훈련시켜 우리의 분류 성능을 향상시킬 수 있음을 보일 것입니다.\n", + "\n", + "이번 연습을 위한 모든 해야할 작업들은 이 notebook에서 수행됩니다." + ] + }, { - "execution_count": null, - "cell_type": "code", - "source": [ - "import random\n", - "import numpy as np\n", - "from cs231n.data_utils import load_CIFAR10\n", - "import matplotlib.pyplot as plt\n", - "%matplotlib inline\n", - "plt.rcParams['figure.figsize'] = (10.0, 8.0) # set default size of plots\n", - "plt.rcParams['image.interpolation'] = 'nearest'\n", - "plt.rcParams['image.cmap'] = 'gray'\n", - "\n", - "# for auto-reloading extenrnal modules\n", - "# see http://stackoverflow.com/questions/1907993/autoreload-of-modules-in-ipython\n", - "%load_ext autoreload\n", - "%autoreload 2" - ], - "outputs": [], + "cell_type": "code", + "execution_count": null, "metadata": { "collapsed": false - } - }, - { + }, + "outputs": [], "source": [ - "## Load data\n", - "Similar to previous exercises, we will load CIFAR-10 data from disk." - ], - "cell_type": "markdown", - "metadata": {} - }, + "import random\n", + "import numpy as np\n", + "from cs231n.data_utils import load_CIFAR10\n", + "import matplotlib.pyplot as plt\n", + "%matplotlib inline\n", + "plt.rcParams['figure.figsize'] = (10.0, 8.0) # 기본 그래프 크기 설정\n", + "plt.rcParams['image.interpolation'] = 'nearest'\n", + "plt.rcParams['image.cmap'] = 'gray'\n", + "\n", + "# auto-reloading을 위한 외부 모듈\n", + "# http://stackoverflow.com/questions/1907993/autoreload-of-modules-in-ipython를 보세요.\n", + "%load_ext autoreload\n", + "%autoreload 2" + ] + }, { - "execution_count": null, - "cell_type": "code", + "cell_type": "markdown", + "metadata": {}, "source": [ - "from cs231n.features import color_histogram_hsv, hog_feature\n", - "\n", - "def get_CIFAR10_data(num_training=49000, num_validation=1000, num_test=1000):\n", - " # Load the raw CIFAR-10 data\n", - " cifar10_dir = 'cs231n/datasets/cifar-10-batches-py'\n", - " X_train, y_train, X_test, y_test = load_CIFAR10(cifar10_dir)\n", - " \n", - " # Subsample the data\n", - " mask = range(num_training, num_training + num_validation)\n", - " X_val = X_train[mask]\n", - " y_val = y_train[mask]\n", - " mask = range(num_training)\n", - " X_train = X_train[mask]\n", - " y_train = y_train[mask]\n", - " mask = range(num_test)\n", - " X_test = X_test[mask]\n", - " y_test = y_test[mask]\n", - "\n", - " return X_train, y_train, X_val, y_val, X_test, y_test\n", - "\n", - "X_train, y_train, X_val, y_val, X_test, y_test = get_CIFAR10_data()" - ], - "outputs": [], + "## 데이터 불러오기\n", + "이전 연습에서 처럼, 우리는 CIFAR-10 데이터를 불러올 것입니다." + ] + }, + { + "cell_type": "code", + "execution_count": null, "metadata": { "collapsed": false - } - }, - { + }, + "outputs": [], "source": [ - "## Extract Features\n", - "For each image we will compute a Histogram of Oriented\n", - "Gradients (HOG) as well as a color histogram using the hue channel in HSV\n", - "color space. We form our final feature vector for each image by concatenating\n", - "the HOG and color histogram feature vectors.\n", - "\n", - "Roughly speaking, HOG should capture the texture of the image while ignoring\n", - "color information, and the color histogram represents the color of the input\n", - "image while ignoring texture. As a result, we expect that using both together\n", - "ought to work better than using either alone. Verifying this assumption would\n", - "be a good thing to try for the bonus section.\n", - "\n", - "The `hog_feature` and `color_histogram_hsv` functions both operate on a single\n", - "image and return a feature vector for that image. The extract_features\n", - "function takes a set of images and a list of feature functions and evaluates\n", - "each feature function on each image, storing the results in a matrix where\n", - "each column is the concatenation of all feature vectors for a single image." - ], - "cell_type": "markdown", - "metadata": {} - }, + "from cs231n.features import color_histogram_hsv, hog_feature\n", + "\n", + "def get_CIFAR10_data(num_training=49000, num_validation=1000, num_test=1000):\n", + " # CIFAR-10 데이터를 불러옵니다.\n", + " cifar10_dir = 'cs231n/datasets/cifar-10-batches-py'\n", + " X_train, y_train, X_test, y_test = load_CIFAR10(cifar10_dir)\n", + " \n", + " # 데이터 표본\n", + " mask = range(num_training, num_training + num_validation)\n", + " X_val = X_train[mask]\n", + " y_val = y_train[mask]\n", + " mask = range(num_training)\n", + " X_train = X_train[mask]\n", + " y_train = y_train[mask]\n", + " mask = range(num_test)\n", + " X_test = X_test[mask]\n", + " y_test = y_test[mask]\n", + "\n", + " return X_train, y_train, X_val, y_val, X_test, y_test\n", + "\n", + "X_train, y_train, X_val, y_val, X_test, y_test = get_CIFAR10_data()" + ] + }, { - "execution_count": null, - "cell_type": "code", + "cell_type": "markdown", + "metadata": {}, "source": [ - "from cs231n.features import *\n", - "\n", - "num_color_bins = 10 # Number of bins in the color histogram\n", - "feature_fns = [hog_feature, lambda img: color_histogram_hsv(img, nbin=num_color_bins)]\n", - "X_train_feats = extract_features(X_train, feature_fns, verbose=True)\n", - "X_val_feats = extract_features(X_val, feature_fns)\n", - "X_test_feats = extract_features(X_test, feature_fns)\n", - "\n", - "# Preprocessing: Subtract the mean feature\n", - "mean_feat = np.mean(X_train_feats, axis=0, keepdims=True)\n", - "X_train_feats -= mean_feat\n", - "X_val_feats -= mean_feat\n", - "X_test_feats -= mean_feat\n", - "\n", - "# Preprocessing: Divide by standard deviation. This ensures that each feature\n", - "# has roughly the same scale.\n", - "std_feat = np.std(X_train_feats, axis=0, keepdims=True)\n", - "X_train_feats /= std_feat\n", - "X_val_feats /= std_feat\n", - "X_test_feats /= std_feat\n", - "\n", - "# Preprocessing: Add a bias dimension\n", - "X_train_feats = np.hstack([X_train_feats, np.ones((X_train_feats.shape[0], 1))])\n", - "X_val_feats = np.hstack([X_val_feats, np.ones((X_val_feats.shape[0], 1))])\n", - "X_test_feats = np.hstack([X_test_feats, np.ones((X_test_feats.shape[0], 1))])" - ], - "outputs": [], + "# 특징 추출하기\n", + "우리는 각 이미지 마다 그라데이션의 히스토그램(HOG)를 HSV색 공간에서의 색상 채널을 사용한 색상 히스토그램만큼 잘 계산할 것입니다. 우리는 우리의 마지막 특징 벡터를 각 이미지마다 HOG와 색상 히스토그램 특징 벡터를 이용하여 형성합니다.\n", + "\n", + "강조하면, HOG 색상 정보를 무시하면서 이미지의 질감을 포착하고 색상 히스토그램은 질감을 무시하면서 입력된 이미지의 색상 나타낼 수 있습니다. 결과적으로, 우리는 두 가지를 동시에 사용하므로 한가지만 사용하는 것보다 더 효과적으로 작동할 것을 기대합니다. 이 가정을 증명하는 것은 보너스 단계에서 수행할만한 좋은 과제가 될 수 있습니다.\n", + "\n", + "`hog_feature` 과 `color_histogram_hsv` 함수는 둘 다 하나의 이미지에서 그 이미지의 특징벡터를 반환하는 작업을 수행합니다. extract_features 함수는 이미지 집합과 특징 함수들의 목록을 가지고 각 이미지에 각각의 특징 함수를 평가하고 결과를 각 열이 하나의 이미지에 대한 모든 특징 벡터의 연결인 배열에 저장합니다." + ] + }, + { + "cell_type": "code", + "execution_count": null, "metadata": { "collapsed": false - } - }, + }, + "outputs": [], + "source": [ + "from cs231n.features import *\n", + "\n", + "num_color_bins = 10 # Number of bins in the color histogram\n", + "feature_fns = [hog_feature, lambda img: color_histogram_hsv(img, nbin=num_color_bins)]\n", + "X_train_feats = extract_features(X_train, feature_fns, verbose=True)\n", + "X_val_feats = extract_features(X_val, feature_fns)\n", + "X_test_feats = extract_features(X_test, feature_fns)\n", + "\n", + "# 전처리: 평균 특징 빼기\n", + "mean_feat = np.mean(X_train_feats, axis=0, keepdims=True)\n", + "X_train_feats -= mean_feat\n", + "X_val_feats -= mean_feat\n", + "X_test_feats -= mean_feat\n", + "\n", + "# 전처리: 표준편차로 분리하기. 이것은 각 특징이 거의 같은 규모임을 보장합니다.\n", + "std_feat = np.std(X_train_feats, axis=0, keepdims=True)\n", + "X_train_feats /= std_feat\n", + "X_val_feats /= std_feat\n", + "X_test_feats /= std_feat\n", + "\n", + "# 전처리: bias 차원 추가\n", + "X_train_feats = np.hstack([X_train_feats, np.ones((X_train_feats.shape[0], 1))])\n", + "X_val_feats = np.hstack([X_val_feats, np.ones((X_val_feats.shape[0], 1))])\n", + "X_test_feats = np.hstack([X_test_feats, np.ones((X_test_feats.shape[0], 1))])" + ] + }, { + "cell_type": "markdown", + "metadata": {}, "source": [ - "## Train SVM on features\n", - "Using the multiclass SVM code developed earlier in the assignment, train SVMs on top of the features extracted above; this should achieve better results than training SVMs directly on top of raw pixels." - ], - "cell_type": "markdown", - "metadata": {} - }, + "## SVM을 특징에 대해서 훈련\n", + "이번 과제에서 작성한 멀티클래스 SVM 코드를 사용하여 위에서 추출된 특징을 이용해 SVM을 훈련합니다.\n", + "이 방법은 SVM을 단순픽셀을 이용하여 훈련시키는 것보다 더 좋은 결과를 얻을 수 있습니다." + ] + }, { - "execution_count": null, - "cell_type": "code", + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], "source": [ - "# Use the validation set to tune the learning rate and regularization strength\n", - "\n", - "from cs231n.classifiers.linear_classifier import LinearSVM\n", - "\n", - "learning_rates = [1e-9, 1e-8, 1e-7]\n", - "regularization_strengths = [1e5, 1e6, 1e7]\n", - "\n", - "results = {}\n", - "best_val = -1\n", - "best_svm = None\n", - "\n", - "pass\n", - "################################################################################\n", - "# TODO: #\n", - "# Use the validation set to set the learning rate and regularization strength. #\n", - "# This should be identical to the validation that you did for the SVM; save #\n", - "# the best trained classifer in best_svm. You might also want to play #\n", - "# with different numbers of bins in the color histogram. If you are careful #\n", - "# you should be able to get accuracy of near 0.44 on the validation set. #\n", - "################################################################################\n", - "pass\n", - "################################################################################\n", - "# END OF YOUR CODE #\n", - "################################################################################\n", - "\n", - "# Print out results.\n", - "for lr, reg in sorted(results):\n", - " train_accuracy, val_accuracy = results[(lr, reg)]\n", - " print 'lr %e reg %e train accuracy: %f val accuracy: %f' % (\n", - " lr, reg, train_accuracy, val_accuracy)\n", - " \n", + "# Validation을 사용하여 학습 속도와 정규화 강도를 조정합니다.\n", + "\n", + "from cs231n.classifiers.linear_classifier import LinearSVM\n", + "\n", + "learning_rates = [1e-9, 1e-8, 1e-7]\n", + "regularization_strengths = [1e5, 1e6, 1e7]\n", + "\n", + "results = {}\n", + "best_val = -1\n", + "best_svm = None\n", + "\n", + "pass\n", + "######################################################################################\n", + "# TODO: #\n", + "# Validation을 사용하여 학습 속도와 정규화 강도를 조정합니다. #\n", + "# 이것은 SVM에서 했던 검증과 동일해야 합니다. #\n", + "# 가장 잘 훈련된 분류기를 best_svm에 저장하세요. #\n", + "# 아마 다른 개수의 색상 히스토그램안의 bin을 사용하여 해보고 싶을 수 있습니다. #\n", + "# 아마 다른 개수의 색상 히스토그램안의 bin을 사용하여 해보고 싶을 수 있습니다. #\n", + "# 만약 신중하다면, validation 세트에서 0.44에 근접한 정확도를 얻을 수 있을것 입니다. #\n", + "######################################################################################\n", + "\n", + "pass\n", + "######################################################################################\n", + "# 코드의 끝 #\n", + "######################################################################################\n", + "\n", + "# 결과를 출력합니다.\n", + "for lr, reg in sorted(results):\n", + " train_accuracy, val_accuracy = results[(lr, reg)]\n", + " print 'lr %e reg %e train accuracy: %f val accuracy: %f' % (\n", + " lr, reg, train_accuracy, val_accuracy)\n", + " \n", "print 'best validation accuracy achieved during cross-validation: %f' % best_val" - ], - "outputs": [], + ] + }, + { + "cell_type": "code", + "execution_count": null, "metadata": { "collapsed": false - } - }, - { - "execution_count": null, - "cell_type": "code", + }, + "outputs": [], "source": [ - "# Evaluate your trained SVM on the test set\n", - "y_test_pred = best_svm.predict(X_test_feats)\n", - "test_accuracy = np.mean(y_test == y_test_pred)\n", + "# 테스트 세트로 당신이 훈련시킨 SVM을 평가합니다.\n", + "y_test_pred = best_svm.predict(X_test_feats)\n", + "test_accuracy = np.mean(y_test == y_test_pred)\n", "print test_accuracy" - ], - "outputs": [], + ] + }, + { + "cell_type": "code", + "execution_count": null, "metadata": { "collapsed": false - } - }, - { - "execution_count": null, - "cell_type": "code", + }, + "outputs": [], "source": [ - "# An important way to gain intuition about how an algorithm works is to\n", - "# visualize the mistakes that it makes. In this visualization, we show examples\n", - "# of images that are misclassified by our current system. The first column\n", - "# shows images that our system labeled as \"plane\" but whose true label is\n", - "# something other than \"plane\".\n", - "\n", - "examples_per_class = 8\n", - "classes = ['plane', 'car', 'bird', 'cat', 'deer', 'dog', 'frog', 'horse', 'ship', 'truck']\n", - "for cls, cls_name in enumerate(classes):\n", - " idxs = np.where((y_test != cls) & (y_test_pred == cls))[0]\n", - " idxs = np.random.choice(idxs, examples_per_class, replace=False)\n", - " for i, idx in enumerate(idxs):\n", - " plt.subplot(examples_per_class, len(classes), i * len(classes) + cls + 1)\n", - " plt.imshow(X_test[idx].astype('uint8'))\n", - " plt.axis('off')\n", - " if i == 0:\n", - " plt.title(cls_name)\n", + "# 알고리즘이 어떻게 작동하는지에 대한 직관을 얻기 위해 중요한 것은\n", + "# 알고리즘이 만드는 실수를 시각화 하는것 입니다.\n", + "# 이 시각화에서, 우리는 현재 시스템에서 잘못 분류된 이미지의 예제들을 보여줍니다.\n", + "# 첫 번째 열은 실제 \"plane\"은 아니지만 시스템이 \"plane\"으로 분류된 이미지를 보여줍니다.\n", + "\n", + "examples_per_class = 8\n", + "classes = ['plane', 'car', 'bird', 'cat', 'deer', 'dog', 'frog', 'horse', 'ship', 'truck']\n", + "for cls, cls_name in enumerate(classes):\n", + " idxs = np.where((y_test != cls) & (y_test_pred == cls))[0]\n", + " idxs = np.random.choice(idxs, examples_per_class, replace=False)\n", + " for i, idx in enumerate(idxs):\n", + " plt.subplot(examples_per_class, len(classes), i * len(classes) + cls + 1)\n", + " plt.imshow(X_test[idx].astype('uint8'))\n", + " plt.axis('off')\n", + " if i == 0:\n", + " plt.title(cls_name)\n", "plt.show()" - ], - "outputs": [], - "metadata": { - "collapsed": false - } - }, + ] + }, { + "cell_type": "markdown", + "metadata": {}, "source": [ - "### Inline question 1:\n", - "Describe the misclassification results that you see. Do they make sense?" - ], - "cell_type": "markdown", - "metadata": {} - }, + "### Inline question 1:\n", + " 잘못 분류된 결과에 대해 설명해보세요. 의미를 알 수 있나요?" + ] + }, { + "cell_type": "markdown", + "metadata": {}, "source": [ - "## Neural Network on image features\n", - "Earlier in this assigment we saw that training a two-layer neural network on raw pixels achieved better classification performance than linear classifiers on raw pixels. In this notebook we have seen that linear classifiers on image features outperform linear classifiers on raw pixels. \n", - "\n", + "## Neural Network on image features\n", + "Earlier in this assigment we saw that training a two-layer neural network on raw pixels achieved better classification performance than linear classifiers on raw pixels. In this notebook we have seen that linear classifiers on image features outperform linear classifiers on raw pixels. \n", + "\n", "For completeness, we should also try training a neural network on image features. This approach should outperform all previous approaches: you should easily be able to achieve over 55% classification accuracy on the test set; our best model achieves about 60% classification accuracy." - ], - "cell_type": "markdown", - "metadata": {} - }, + ] + }, { - "execution_count": null, - "cell_type": "code", + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], "source": [ "print X_train_feats.shape" - ], - "outputs": [], + ] + }, + { + "cell_type": "code", + "execution_count": null, "metadata": { "collapsed": false - } - }, - { - "execution_count": null, - "cell_type": "code", + }, + "outputs": [], "source": [ - "from cs231n.classifiers.neural_net import TwoLayerNet\n", - "\n", - "input_dim = X_train_feats.shape[1]\n", - "hidden_dim = 500\n", - "num_classes = 10\n", - "\n", - "net = TwoLayerNet(input_dim, hidden_dim, num_classes)\n", - "best_net = None\n", - "\n", - "################################################################################\n", - "# TODO: Train a two-layer neural network on image features. You may want to #\n", - "# cross-validate various parameters as in previous sections. Store your best #\n", - "# model in the best_net variable. #\n", - "################################################################################\n", - "pass\n", - "################################################################################\n", - "# END OF YOUR CODE #\n", + "from cs231n.classifiers.neural_net import TwoLayerNet\n", + "\n", + "input_dim = X_train_feats.shape[1]\n", + "hidden_dim = 500\n", + "num_classes = 10\n", + "\n", + "net = TwoLayerNet(input_dim, hidden_dim, num_classes)\n", + "best_net = None\n", + "\n", + "################################################################################\n", + "# TODO: Train a two-layer neural network on image features. You may want to #\n", + "# cross-validate various parameters as in previous sections. Store your best #\n", + "# model in the best_net variable. #\n", + "################################################################################\n", + "pass\n", + "################################################################################\n", + "# END OF YOUR CODE #\n", "################################################################################" - ], - "outputs": [], + ] + }, + { + "cell_type": "code", + "execution_count": null, "metadata": { "collapsed": false - } - }, - { - "execution_count": null, - "cell_type": "code", + }, + "outputs": [], "source": [ - "# Run your neural net classifier on the test set. You should be able to\n", - "# get more than 55% accuracy.\n", - "\n", - "test_acc = (net.predict(X_test_feats) == y_test).mean()\n", + "# Run your neural net classifier on the test set. You should be able to\n", + "# get more than 55% accuracy.\n", + "\n", + "test_acc = (net.predict(X_test_feats) == y_test).mean()\n", "print test_acc" - ], - "outputs": [], - "metadata": { - "collapsed": false - } - }, + ] + }, { + "cell_type": "markdown", + "metadata": {}, "source": [ - "# Bonus: Design your own features!\n", - "\n", - "You have seen that simple image features can improve classification performance. So far we have tried HOG and color histograms, but other types of features may be able to achieve even better classification performance.\n", - "\n", + "# Bonus: Design your own features!\n", + "\n", + "You have seen that simple image features can improve classification performance. So far we have tried HOG and color histograms, but other types of features may be able to achieve even better classification performance.\n", + "\n", "For bonus points, design and implement a new type of feature and use it for image classification on CIFAR-10. Explain how your feature works and why you expect it to be useful for image classification. Implement it in this notebook, cross-validate any hyperparameters, and compare its performance to the HOG + Color histogram baseline." - ], - "cell_type": "markdown", - "metadata": {} - }, + ] + }, { + "cell_type": "markdown", + "metadata": {}, "source": [ - "# Bonus: Do something extra!\n", + "# Bonus: Do something extra!\n", "Use the material and code we have presented in this assignment to do something interesting. Was there another question we should have asked? Did any cool ideas pop into your head as you were working on the assignment? This is your chance to show off!" - ], - "cell_type": "markdown", - "metadata": {} + ] } - ], + ], "metadata": { "kernelspec": { - "display_name": "Python 2", - "name": "python2", - "language": "python" - }, + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, "language_info": { - "mimetype": "text/x-python", - "nbconvert_exporter": "python", - "name": "python", - "file_extension": ".py", - "version": "2.7.9", - "pygments_lexer": "ipython2", "codemirror_mode": { - "version": 2, - "name": "ipython" - } + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.5.1" } - } -} \ No newline at end of file + }, + "nbformat": 4, + "nbformat_minor": 0 +} diff --git a/assignments2016/assignment1/features_trans.md b/assignments2016/assignment1/features_trans.md new file mode 100644 index 00000000..cca77aeb --- /dev/null +++ b/assignments2016/assignment1/features_trans.md @@ -0,0 +1,9 @@ +# Words +Image features: 이미지 특징 +Plots: 그래프 +deviation: +completeness: +material: + +You might also want to play with different numbers of bins in the color histogram. +아마 다른 개수의 색상 히스토그램안의 bin을 사용하여 해보고 싶을 수 있습니다. diff --git a/assignments2016/assignment1/knn.ipynb b/assignments2016/assignment1/knn.ipynb index 2c550ba5..7ed1b7b4 100644 --- a/assignments2016/assignment1/knn.ipynb +++ b/assignments2016/assignment1/knn.ipynb @@ -1,459 +1,459 @@ { - "nbformat_minor": 0, - "nbformat": 4, "cells": [ { + "cell_type": "markdown", + "metadata": {}, "source": [ - "# k-Nearest Neighbor (kNN) exercise\n", - "\n", - "*Complete and hand in this completed worksheet (including its outputs and any supporting code outside of the worksheet) with your assignment submission. For more details see the [assignments page](http://vision.stanford.edu/teaching/cs231n/assignments.html) on the course website.*\n", - "\n", - "The kNN classifier consists of two stages:\n", - "\n", - "- During training, the classifier takes the training data and simply remembers it\n", - "- During testing, kNN classifies every test image by comparing to all training images and transfering the labels of the k most similar training examples\n", - "- The value of k is cross-validated\n", - "\n", + "# k-Nearest Neighbor (kNN) exercise\n", + "\n", + "*Complete and hand in this completed worksheet (including its outputs and any supporting code outside of the worksheet) with your assignment submission. For more details see the [assignments page](http://vision.stanford.edu/teaching/cs231n/assignments.html) on the course website.*\n", + "\n", + "The kNN classifier consists of two stages:\n", + "\n", + "- During training, the classifier takes the training data and simply remembers it\n", + "- During testing, kNN classifies every test image by comparing to all training images and transfering the labels of the k most similar training examples\n", + "- The value of k is cross-validated\n", + "\n", "In this exercise you will implement these steps and understand the basic Image Classification pipeline, cross-validation, and gain proficiency in writing efficient, vectorized code." - ], - "cell_type": "markdown", - "metadata": {} - }, + ] + }, { - "execution_count": null, - "cell_type": "code", + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], "source": [ - "# Run some setup code for this notebook.\n", - "\n", - "import random\n", - "import numpy as np\n", - "from cs231n.data_utils import load_CIFAR10\n", - "import matplotlib.pyplot as plt\n", - "\n", - "# This is a bit of magic to make matplotlib figures appear inline in the notebook\n", - "# rather than in a new window.\n", - "%matplotlib inline\n", - "plt.rcParams['figure.figsize'] = (10.0, 8.0) # set default size of plots\n", - "plt.rcParams['image.interpolation'] = 'nearest'\n", - "plt.rcParams['image.cmap'] = 'gray'\n", - "\n", - "# Some more magic so that the notebook will reload external python modules;\n", - "# see http://stackoverflow.com/questions/1907993/autoreload-of-modules-in-ipython\n", - "%load_ext autoreload\n", + "# Run some setup code for this notebook.\n", + "\n", + "import random\n", + "import numpy as np\n", + "from cs231n.data_utils import load_CIFAR10\n", + "import matplotlib.pyplot as plt\n", + "\n", + "# This is a bit of magic to make matplotlib figures appear inline in the notebook\n", + "# rather than in a new window.\n", + "%matplotlib inline\n", + "plt.rcParams['figure.figsize'] = (10.0, 8.0) # set default size of plots\n", + "plt.rcParams['image.interpolation'] = 'nearest'\n", + "plt.rcParams['image.cmap'] = 'gray'\n", + "\n", + "# Some more magic so that the notebook will reload external python modules;\n", + "# see http://stackoverflow.com/questions/1907993/autoreload-of-modules-in-ipython\n", + "%load_ext autoreload\n", "%autoreload 2" - ], - "outputs": [], + ] + }, + { + "cell_type": "code", + "execution_count": null, "metadata": { "collapsed": false - } - }, - { - "execution_count": null, - "cell_type": "code", + }, + "outputs": [], "source": [ - "# Load the raw CIFAR-10 data.\n", - "cifar10_dir = 'cs231n/datasets/cifar-10-batches-py'\n", - "X_train, y_train, X_test, y_test = load_CIFAR10(cifar10_dir)\n", - "\n", - "# As a sanity check, we print out the size of the training and test data.\n", - "print 'Training data shape: ', X_train.shape\n", - "print 'Training labels shape: ', y_train.shape\n", - "print 'Test data shape: ', X_test.shape\n", + "# Load the raw CIFAR-10 data.\n", + "cifar10_dir = 'cs231n/datasets/cifar-10-batches-py'\n", + "X_train, y_train, X_test, y_test = load_CIFAR10(cifar10_dir)\n", + "\n", + "# As a sanity check, we print out the size of the training and test data.\n", + "print 'Training data shape: ', X_train.shape\n", + "print 'Training labels shape: ', y_train.shape\n", + "print 'Test data shape: ', X_test.shape\n", "print 'Test labels shape: ', y_test.shape" - ], - "outputs": [], + ] + }, + { + "cell_type": "code", + "execution_count": null, "metadata": { "collapsed": false - } - }, - { - "execution_count": null, - "cell_type": "code", + }, + "outputs": [], "source": [ - "# Visualize some examples from the dataset.\n", - "# We show a few examples of training images from each class.\n", - "classes = ['plane', 'car', 'bird', 'cat', 'deer', 'dog', 'frog', 'horse', 'ship', 'truck']\n", - "num_classes = len(classes)\n", - "samples_per_class = 7\n", - "for y, cls in enumerate(classes):\n", - " idxs = np.flatnonzero(y_train == y)\n", - " idxs = np.random.choice(idxs, samples_per_class, replace=False)\n", - " for i, idx in enumerate(idxs):\n", - " plt_idx = i * num_classes + y + 1\n", - " plt.subplot(samples_per_class, num_classes, plt_idx)\n", - " plt.imshow(X_train[idx].astype('uint8'))\n", - " plt.axis('off')\n", - " if i == 0:\n", - " plt.title(cls)\n", + "# Visualize some examples from the dataset.\n", + "# We show a few examples of training images from each class.\n", + "classes = ['plane', 'car', 'bird', 'cat', 'deer', 'dog', 'frog', 'horse', 'ship', 'truck']\n", + "num_classes = len(classes)\n", + "samples_per_class = 7\n", + "for y, cls in enumerate(classes):\n", + " idxs = np.flatnonzero(y_train == y)\n", + " idxs = np.random.choice(idxs, samples_per_class, replace=False)\n", + " for i, idx in enumerate(idxs):\n", + " plt_idx = i * num_classes + y + 1\n", + " plt.subplot(samples_per_class, num_classes, plt_idx)\n", + " plt.imshow(X_train[idx].astype('uint8'))\n", + " plt.axis('off')\n", + " if i == 0:\n", + " plt.title(cls)\n", "plt.show()" - ], - "outputs": [], + ] + }, + { + "cell_type": "code", + "execution_count": null, "metadata": { "collapsed": false - } - }, - { - "execution_count": null, - "cell_type": "code", + }, + "outputs": [], "source": [ - "# Subsample the data for more efficient code execution in this exercise\n", - "num_training = 5000\n", - "mask = range(num_training)\n", - "X_train = X_train[mask]\n", - "y_train = y_train[mask]\n", - "\n", - "num_test = 500\n", - "mask = range(num_test)\n", - "X_test = X_test[mask]\n", + "# Subsample the data for more efficient code execution in this exercise\n", + "num_training = 5000\n", + "mask = range(num_training)\n", + "X_train = X_train[mask]\n", + "y_train = y_train[mask]\n", + "\n", + "num_test = 500\n", + "mask = range(num_test)\n", + "X_test = X_test[mask]\n", "y_test = y_test[mask]" - ], - "outputs": [], + ] + }, + { + "cell_type": "code", + "execution_count": null, "metadata": { "collapsed": false - } - }, - { - "execution_count": null, - "cell_type": "code", + }, + "outputs": [], "source": [ - "# Reshape the image data into rows\n", - "X_train = np.reshape(X_train, (X_train.shape[0], -1))\n", - "X_test = np.reshape(X_test, (X_test.shape[0], -1))\n", + "# Reshape the image data into rows\n", + "X_train = np.reshape(X_train, (X_train.shape[0], -1))\n", + "X_test = np.reshape(X_test, (X_test.shape[0], -1))\n", "print X_train.shape, X_test.shape" - ], - "outputs": [], + ] + }, + { + "cell_type": "code", + "execution_count": null, "metadata": { "collapsed": false - } - }, - { - "execution_count": null, - "cell_type": "code", + }, + "outputs": [], "source": [ - "from cs231n.classifiers import KNearestNeighbor\n", - "\n", - "# Create a kNN classifier instance. \n", - "# Remember that training a kNN classifier is a noop: \n", - "# the Classifier simply remembers the data and does no further processing \n", - "classifier = KNearestNeighbor()\n", + "from cs231n.classifiers import KNearestNeighbor\n", + "\n", + "# Create a kNN classifier instance. \n", + "# Remember that training a kNN classifier is a noop: \n", + "# the Classifier simply remembers the data and does no further processing \n", + "classifier = KNearestNeighbor()\n", "classifier.train(X_train, y_train)" - ], - "outputs": [], - "metadata": { - "collapsed": false - } - }, + ] + }, { + "cell_type": "markdown", + "metadata": {}, "source": [ - "We would now like to classify the test data with the kNN classifier. Recall that we can break down this process into two steps: \n", - "\n", - "1. First we must compute the distances between all test examples and all train examples. \n", - "2. Given these distances, for each test example we find the k nearest examples and have them vote for the label\n", - "\n", - "Lets begin with computing the distance matrix between all training and test examples. For example, if there are **Ntr** training examples and **Nte** test examples, this stage should result in a **Nte x Ntr** matrix where each element (i,j) is the distance between the i-th test and j-th train example.\n", - "\n", + "We would now like to classify the test data with the kNN classifier. Recall that we can break down this process into two steps: \n", + "\n", + "1. First we must compute the distances between all test examples and all train examples. \n", + "2. Given these distances, for each test example we find the k nearest examples and have them vote for the label\n", + "\n", + "Lets begin with computing the distance matrix between all training and test examples. For example, if there are **Ntr** training examples and **Nte** test examples, this stage should result in a **Nte x Ntr** matrix where each element (i,j) is the distance between the i-th test and j-th train example.\n", + "\n", "First, open `cs231n/classifiers/k_nearest_neighbor.py` and implement the function `compute_distances_two_loops` that uses a (very inefficient) double loop over all pairs of (test, train) examples and computes the distance matrix one element at a time." - ], - "cell_type": "markdown", - "metadata": {} - }, + ] + }, { - "execution_count": null, - "cell_type": "code", + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], "source": [ - "# Open cs231n/classifiers/k_nearest_neighbor.py and implement\n", - "# compute_distances_two_loops.\n", - "\n", - "# Test your implementation:\n", - "dists = classifier.compute_distances_two_loops(X_test)\n", + "# Open cs231n/classifiers/k_nearest_neighbor.py and implement\n", + "# compute_distances_two_loops.\n", + "\n", + "# Test your implementation:\n", + "dists = classifier.compute_distances_two_loops(X_test)\n", "print dists.shape" - ], - "outputs": [], + ] + }, + { + "cell_type": "code", + "execution_count": null, "metadata": { "collapsed": false - } - }, - { - "execution_count": null, - "cell_type": "code", + }, + "outputs": [], "source": [ - "# We can visualize the distance matrix: each row is a single test example and\n", - "# its distances to training examples\n", - "plt.imshow(dists, interpolation='none')\n", + "# We can visualize the distance matrix: each row is a single test example and\n", + "# its distances to training examples\n", + "plt.imshow(dists, interpolation='none')\n", "plt.show()" - ], - "outputs": [], - "metadata": { - "collapsed": false - } - }, + ] + }, { + "cell_type": "markdown", + "metadata": {}, "source": [ - "**Inline Question #1:** Notice the structured patterns in the distance matrix, where some rows or columns are visible brighter. (Note that with the default color scheme black indicates low distances while white indicates high distances.)\n", - "\n", - "- What in the data is the cause behind the distinctly bright rows?\n", + "**Inline Question #1:** Notice the structured patterns in the distance matrix, where some rows or columns are visible brighter. (Note that with the default color scheme black indicates low distances while white indicates high distances.)\n", + "\n", + "- What in the data is the cause behind the distinctly bright rows?\n", "- What causes the columns?" - ], - "cell_type": "markdown", - "metadata": {} - }, + ] + }, { + "cell_type": "markdown", + "metadata": {}, "source": [ - "**Your Answer**: *fill this in.*\n", + "**Your Answer**: *fill this in.*\n", "\n" - ], - "cell_type": "markdown", - "metadata": {} - }, + ] + }, { - "execution_count": null, - "cell_type": "code", - "source": [ - "# Now implement the function predict_labels and run the code below:\n", - "# We use k = 1 (which is Nearest Neighbor).\n", - "y_test_pred = classifier.predict_labels(dists, k=1)\n", - "\n", - "# Compute and print the fraction of correctly predicted examples\n", - "num_correct = np.sum(y_test_pred == y_test)\n", - "accuracy = float(num_correct) / num_test\n", - "print 'Got %d / %d correct => accuracy: %f' % (num_correct, num_test, accuracy)" - ], - "outputs": [], + "cell_type": "code", + "execution_count": null, "metadata": { "collapsed": false - } - }, + }, + "outputs": [], + "source": [ + "# Now implement the function predict_labels and run the code below:\n", + "# We use k = 1 (which is Nearest Neighbor).\n", + "y_test_pred = classifier.predict_labels(dists, k=1)\n", + "\n", + "# Compute and print the fraction of correctly predicted examples\n", + "num_correct = np.sum(y_test_pred == y_test)\n", + "accuracy = float(num_correct) / num_test\n", + "print 'Got %d / %d correct => accuracy: %f' % (num_correct, num_test, accuracy)" + ] + }, { + "cell_type": "markdown", + "metadata": {}, "source": [ "You should expect to see approximately `27%` accuracy. Now lets try out a larger `k`, say `k = 5`:" - ], - "cell_type": "markdown", - "metadata": {} - }, + ] + }, { - "execution_count": null, - "cell_type": "code", - "source": [ - "y_test_pred = classifier.predict_labels(dists, k=5)\n", - "num_correct = np.sum(y_test_pred == y_test)\n", - "accuracy = float(num_correct) / num_test\n", - "print 'Got %d / %d correct => accuracy: %f' % (num_correct, num_test, accuracy)" - ], - "outputs": [], + "cell_type": "code", + "execution_count": null, "metadata": { "collapsed": true - } - }, + }, + "outputs": [], + "source": [ + "y_test_pred = classifier.predict_labels(dists, k=5)\n", + "num_correct = np.sum(y_test_pred == y_test)\n", + "accuracy = float(num_correct) / num_test\n", + "print 'Got %d / %d correct => accuracy: %f' % (num_correct, num_test, accuracy)" + ] + }, { + "cell_type": "markdown", + "metadata": {}, "source": [ "You should expect to see a slightly better performance than with `k = 1`." - ], - "cell_type": "markdown", - "metadata": {} - }, + ] + }, { - "execution_count": null, - "cell_type": "code", + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], "source": [ - "# Now lets speed up distance matrix computation by using partial vectorization\n", - "# with one loop. Implement the function compute_distances_one_loop and run the\n", - "# code below:\n", - "dists_one = classifier.compute_distances_one_loop(X_test)\n", - "\n", - "# To ensure that our vectorized implementation is correct, we make sure that it\n", - "# agrees with the naive implementation. There are many ways to decide whether\n", - "# two matrices are similar; one of the simplest is the Frobenius norm. In case\n", - "# you haven't seen it before, the Frobenius norm of two matrices is the square\n", - "# root of the squared sum of differences of all elements; in other words, reshape\n", - "# the matrices into vectors and compute the Euclidean distance between them.\n", - "difference = np.linalg.norm(dists - dists_one, ord='fro')\n", - "print 'Difference was: %f' % (difference, )\n", - "if difference < 0.001:\n", - " print 'Good! The distance matrices are the same'\n", - "else:\n", + "# Now lets speed up distance matrix computation by using partial vectorization\n", + "# with one loop. Implement the function compute_distances_one_loop and run the\n", + "# code below:\n", + "dists_one = classifier.compute_distances_one_loop(X_test)\n", + "\n", + "# To ensure that our vectorized implementation is correct, we make sure that it\n", + "# agrees with the naive implementation. There are many ways to decide whether\n", + "# two matrices are similar; one of the simplest is the Frobenius norm. In case\n", + "# you haven't seen it before, the Frobenius norm of two matrices is the square\n", + "# root of the squared sum of differences of all elements; in other words, reshape\n", + "# the matrices into vectors and compute the Euclidean distance between them.\n", + "difference = np.linalg.norm(dists - dists_one, ord='fro')\n", + "print 'Difference was: %f' % (difference, )\n", + "if difference < 0.001:\n", + " print 'Good! The distance matrices are the same'\n", + "else:\n", " print 'Uh-oh! The distance matrices are different'" - ], - "outputs": [], + ] + }, + { + "cell_type": "code", + "execution_count": null, "metadata": { "collapsed": false - } - }, - { - "execution_count": null, - "cell_type": "code", + }, + "outputs": [], "source": [ - "# Now implement the fully vectorized version inside compute_distances_no_loops\n", - "# and run the code\n", - "dists_two = classifier.compute_distances_no_loops(X_test)\n", - "\n", - "# check that the distance matrix agrees with the one we computed before:\n", - "difference = np.linalg.norm(dists - dists_two, ord='fro')\n", - "print 'Difference was: %f' % (difference, )\n", - "if difference < 0.001:\n", - " print 'Good! The distance matrices are the same'\n", - "else:\n", + "# Now implement the fully vectorized version inside compute_distances_no_loops\n", + "# and run the code\n", + "dists_two = classifier.compute_distances_no_loops(X_test)\n", + "\n", + "# check that the distance matrix agrees with the one we computed before:\n", + "difference = np.linalg.norm(dists - dists_two, ord='fro')\n", + "print 'Difference was: %f' % (difference, )\n", + "if difference < 0.001:\n", + " print 'Good! The distance matrices are the same'\n", + "else:\n", " print 'Uh-oh! The distance matrices are different'" - ], - "outputs": [], + ] + }, + { + "cell_type": "code", + "execution_count": null, "metadata": { "collapsed": false - } - }, - { - "execution_count": null, - "cell_type": "code", + }, + "outputs": [], "source": [ - "# Let's compare how fast the implementations are\n", - "def time_function(f, *args):\n", - " \"\"\"\n", - " Call a function f with args and return the time (in seconds) that it took to execute.\n", - " \"\"\"\n", - " import time\n", - " tic = time.time()\n", - " f(*args)\n", - " toc = time.time()\n", - " return toc - tic\n", - "\n", - "two_loop_time = time_function(classifier.compute_distances_two_loops, X_test)\n", - "print 'Two loop version took %f seconds' % two_loop_time\n", - "\n", - "one_loop_time = time_function(classifier.compute_distances_one_loop, X_test)\n", - "print 'One loop version took %f seconds' % one_loop_time\n", - "\n", - "no_loop_time = time_function(classifier.compute_distances_no_loops, X_test)\n", - "print 'No loop version took %f seconds' % no_loop_time\n", - "\n", + "# Let's compare how fast the implementations are\n", + "def time_function(f, *args):\n", + " \"\"\"\n", + " Call a function f with args and return the time (in seconds) that it took to execute.\n", + " \"\"\"\n", + " import time\n", + " tic = time.time()\n", + " f(*args)\n", + " toc = time.time()\n", + " return toc - tic\n", + "\n", + "two_loop_time = time_function(classifier.compute_distances_two_loops, X_test)\n", + "print 'Two loop version took %f seconds' % two_loop_time\n", + "\n", + "one_loop_time = time_function(classifier.compute_distances_one_loop, X_test)\n", + "print 'One loop version took %f seconds' % one_loop_time\n", + "\n", + "no_loop_time = time_function(classifier.compute_distances_no_loops, X_test)\n", + "print 'No loop version took %f seconds' % no_loop_time\n", + "\n", "# you should see significantly faster performance with the fully vectorized implementation" - ], - "outputs": [], - "metadata": { - "collapsed": false - } - }, + ] + }, { + "cell_type": "markdown", + "metadata": {}, "source": [ - "### Cross-validation\n", - "\n", + "### Cross-validation\n", + "\n", "We have implemented the k-Nearest Neighbor classifier but we set the value k = 5 arbitrarily. We will now determine the best value of this hyperparameter with cross-validation." - ], - "cell_type": "markdown", - "metadata": {} - }, + ] + }, { - "execution_count": null, - "cell_type": "code", + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], "source": [ - "num_folds = 5\n", - "k_choices = [1, 3, 5, 8, 10, 12, 15, 20, 50, 100]\n", - "\n", - "X_train_folds = []\n", - "y_train_folds = []\n", - "################################################################################\n", - "# TODO: #\n", - "# Split up the training data into folds. After splitting, X_train_folds and #\n", - "# y_train_folds should each be lists of length num_folds, where #\n", - "# y_train_folds[i] is the label vector for the points in X_train_folds[i]. #\n", - "# Hint: Look up the numpy array_split function. #\n", - "################################################################################\n", - "pass\n", - "################################################################################\n", - "# END OF YOUR CODE #\n", - "################################################################################\n", - "\n", - "# A dictionary holding the accuracies for different values of k that we find\n", - "# when running cross-validation. After running cross-validation,\n", - "# k_to_accuracies[k] should be a list of length num_folds giving the different\n", - "# accuracy values that we found when using that value of k.\n", - "k_to_accuracies = {}\n", - "\n", - "\n", - "################################################################################\n", - "# TODO: #\n", - "# Perform k-fold cross validation to find the best value of k. For each #\n", - "# possible value of k, run the k-nearest-neighbor algorithm num_folds times, #\n", - "# where in each case you use all but one of the folds as training data and the #\n", - "# last fold as a validation set. Store the accuracies for all fold and all #\n", - "# values of k in the k_to_accuracies dictionary. #\n", - "################################################################################\n", - "pass\n", - "################################################################################\n", - "# END OF YOUR CODE #\n", - "################################################################################\n", - "\n", - "# Print out the computed accuracies\n", - "for k in sorted(k_to_accuracies):\n", - " for accuracy in k_to_accuracies[k]:\n", + "num_folds = 5\n", + "k_choices = [1, 3, 5, 8, 10, 12, 15, 20, 50, 100]\n", + "\n", + "X_train_folds = []\n", + "y_train_folds = []\n", + "################################################################################\n", + "# TODO: #\n", + "# Split up the training data into folds. After splitting, X_train_folds and #\n", + "# y_train_folds should each be lists of length num_folds, where #\n", + "# y_train_folds[i] is the label vector for the points in X_train_folds[i]. #\n", + "# Hint: Look up the numpy array_split function. #\n", + "################################################################################\n", + "pass\n", + "################################################################################\n", + "# END OF YOUR CODE #\n", + "################################################################################\n", + "\n", + "# A dictionary holding the accuracies for different values of k that we find\n", + "# when running cross-validation. After running cross-validation,\n", + "# k_to_accuracies[k] should be a list of length num_folds giving the different\n", + "# accuracy values that we found when using that value of k.\n", + "k_to_accuracies = {}\n", + "\n", + "\n", + "################################################################################\n", + "# TODO: #\n", + "# Perform k-fold cross validation to find the best value of k. For each #\n", + "# possible value of k, run the k-nearest-neighbor algorithm num_folds times, #\n", + "# where in each case you use all but one of the folds as training data and the #\n", + "# last fold as a validation set. Store the accuracies for all fold and all #\n", + "# values of k in the k_to_accuracies dictionary. #\n", + "################################################################################\n", + "pass\n", + "################################################################################\n", + "# END OF YOUR CODE #\n", + "################################################################################\n", + "\n", + "# Print out the computed accuracies\n", + "for k in sorted(k_to_accuracies):\n", + " for accuracy in k_to_accuracies[k]:\n", " print 'k = %d, accuracy = %f' % (k, accuracy)" - ], - "outputs": [], + ] + }, + { + "cell_type": "code", + "execution_count": null, "metadata": { "collapsed": false - } - }, - { - "execution_count": null, - "cell_type": "code", + }, + "outputs": [], "source": [ - "# plot the raw observations\n", - "for k in k_choices:\n", - " accuracies = k_to_accuracies[k]\n", - " plt.scatter([k] * len(accuracies), accuracies)\n", - "\n", - "# plot the trend line with error bars that correspond to standard deviation\n", - "accuracies_mean = np.array([np.mean(v) for k,v in sorted(k_to_accuracies.items())])\n", - "accuracies_std = np.array([np.std(v) for k,v in sorted(k_to_accuracies.items())])\n", - "plt.errorbar(k_choices, accuracies_mean, yerr=accuracies_std)\n", - "plt.title('Cross-validation on k')\n", - "plt.xlabel('k')\n", - "plt.ylabel('Cross-validation accuracy')\n", + "# plot the raw observations\n", + "for k in k_choices:\n", + " accuracies = k_to_accuracies[k]\n", + " plt.scatter([k] * len(accuracies), accuracies)\n", + "\n", + "# plot the trend line with error bars that correspond to standard deviation\n", + "accuracies_mean = np.array([np.mean(v) for k,v in sorted(k_to_accuracies.items())])\n", + "accuracies_std = np.array([np.std(v) for k,v in sorted(k_to_accuracies.items())])\n", + "plt.errorbar(k_choices, accuracies_mean, yerr=accuracies_std)\n", + "plt.title('Cross-validation on k')\n", + "plt.xlabel('k')\n", + "plt.ylabel('Cross-validation accuracy')\n", "plt.show()" - ], - "outputs": [], + ] + }, + { + "cell_type": "code", + "execution_count": null, "metadata": { "collapsed": false - } - }, - { - "execution_count": null, - "cell_type": "code", + }, + "outputs": [], "source": [ - "# Based on the cross-validation results above, choose the best value for k, \n", - "# retrain the classifier using all the training data, and test it on the test\n", - "# data. You should be able to get above 28% accuracy on the test data.\n", - "best_k = 1\n", - "\n", - "classifier = KNearestNeighbor()\n", - "classifier.train(X_train, y_train)\n", - "y_test_pred = classifier.predict(X_test, k=best_k)\n", - "\n", - "# Compute and display the accuracy\n", - "num_correct = np.sum(y_test_pred == y_test)\n", - "accuracy = float(num_correct) / num_test\n", + "# Based on the cross-validation results above, choose the best value for k, \n", + "# retrain the classifier using all the training data, and test it on the test\n", + "# data. You should be able to get above 28% accuracy on the test data.\n", + "best_k = 1\n", + "\n", + "classifier = KNearestNeighbor()\n", + "classifier.train(X_train, y_train)\n", + "y_test_pred = classifier.predict(X_test, k=best_k)\n", + "\n", + "# Compute and display the accuracy\n", + "num_correct = np.sum(y_test_pred == y_test)\n", + "accuracy = float(num_correct) / num_test\n", "print 'Got %d / %d correct => accuracy: %f' % (num_correct, num_test, accuracy)" - ], - "outputs": [], - "metadata": { - "collapsed": false - } + ] } - ], + ], "metadata": { "kernelspec": { - "display_name": "Python 2", - "name": "python2", - "language": "python" - }, + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, "language_info": { - "mimetype": "text/x-python", - "nbconvert_exporter": "python", - "name": "python", - "file_extension": ".py", - "version": "2.7.3", - "pygments_lexer": "ipython2", "codemirror_mode": { - "version": 2, - "name": "ipython" - } + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.5.1" } - } -} \ No newline at end of file + }, + "nbformat": 4, + "nbformat_minor": 0 +} From 67df575246ae1bb2608385589a8fab76975bab63 Mon Sep 17 00:00:00 2001 From: MaybeS Date: Mon, 18 Apr 2016 17:48:30 +0900 Subject: [PATCH 067/199] remove etc --- .../assignment1/features-Copy1.ipynb | 338 ------------------ assignments2016/assignment1/features_trans.md | 9 - 2 files changed, 347 deletions(-) delete mode 100644 assignments2016/assignment1/features-Copy1.ipynb delete mode 100644 assignments2016/assignment1/features_trans.md diff --git a/assignments2016/assignment1/features-Copy1.ipynb b/assignments2016/assignment1/features-Copy1.ipynb deleted file mode 100644 index 99f95458..00000000 --- a/assignments2016/assignment1/features-Copy1.ipynb +++ /dev/null @@ -1,338 +0,0 @@ -{ - "cells": [ - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# 이미지 특징 연습\n", - "*이 워크시트를 완성하고 제출하세요. (출력물과 워크시트에 포함되지 않은 코드들을 포함해서) 더 자세한 정보는 코스 웹사이트인 [숙제 페이지](http://vision.stanford.edu/teaching/cs231n/assignments.html)에서 볼 수 있습니다.*\n", - "\n", - "우리는 입력된 이미지의 픽셀에 선형 분류기를 학습시켜 이미지 분류 작업에 적절한 성능을 얻을 수 있음을 알고있습니다.\n", - "이번 연습에서 우리는 단순 픽셀을 계산하기 위해 단순 픽셀(화소)이 아닌 특징을 통해 선형 분류기를 훈련시켜 우리의 분류 성능을 향상시킬 수 있음을 보일 것입니다.\n", - "\n", - "이번 연습을 위한 모든 해야할 작업들은 이 notebook에서 수행됩니다." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "collapsed": false - }, - "outputs": [], - "source": [ - "import random\n", - "import numpy as np\n", - "from cs231n.data_utils import load_CIFAR10\n", - "import matplotlib.pyplot as plt\n", - "%matplotlib inline\n", - "plt.rcParams['figure.figsize'] = (10.0, 8.0) # 기본 그래프 크기 설정\n", - "plt.rcParams['image.interpolation'] = 'nearest'\n", - "plt.rcParams['image.cmap'] = 'gray'\n", - "\n", - "# auto-reloading을 위한 외부 모듈\n", - "# http://stackoverflow.com/questions/1907993/autoreload-of-modules-in-ipython를 보세요.\n", - "%load_ext autoreload\n", - "%autoreload 2" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## 데이터 불러오기\n", - "이전 연습에서 처럼, 우리는 CIFAR-10 데이터를 불러올 것입니다." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "collapsed": false - }, - "outputs": [], - "source": [ - "from cs231n.features import color_histogram_hsv, hog_feature\n", - "\n", - "def get_CIFAR10_data(num_training=49000, num_validation=1000, num_test=1000):\n", - " # CIFAR-10 데이터를 불러옵니다.\n", - " cifar10_dir = 'cs231n/datasets/cifar-10-batches-py'\n", - " X_train, y_train, X_test, y_test = load_CIFAR10(cifar10_dir)\n", - " \n", - " # 데이터 표본\n", - " mask = range(num_training, num_training + num_validation)\n", - " X_val = X_train[mask]\n", - " y_val = y_train[mask]\n", - " mask = range(num_training)\n", - " X_train = X_train[mask]\n", - " y_train = y_train[mask]\n", - " mask = range(num_test)\n", - " X_test = X_test[mask]\n", - " y_test = y_test[mask]\n", - "\n", - " return X_train, y_train, X_val, y_val, X_test, y_test\n", - "\n", - "X_train, y_train, X_val, y_val, X_test, y_test = get_CIFAR10_data()" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# 특징 추출하기\n", - "우리는 각 이미지 마다 그라데이션의 히스토그램(HOG)를 HSV색 공간에서의 색상 채널을 사용한 색상 히스토그램만큼 잘 계산할 것입니다. 우리는 우리의 마지막 특징 벡터를 각 이미지마다 HOG와 색상 히스토그램 특징 벡터를 이용하여 형성합니다.\n", - "\n", - "Roughly speaking, HOG should capture the texture of the image while ignoring\n", - "color information, and the color histogram represents the color of the input\n", - "image while ignoring texture. As a result, we expect that using both together\n", - "ought to work better than using either alone. Verifying this assumption would\n", - "be a good thing to try for the bonus section.\n", - "\n", - "The `hog_feature` and `color_histogram_hsv` functions both operate on a single\n", - "image and return a feature vector for that image. The extract_features\n", - "function takes a set of images and a list of feature functions and evaluates\n", - "each feature function on each image, storing the results in a matrix where\n", - "each column is the concatenation of all feature vectors for a single image." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "collapsed": false - }, - "outputs": [], - "source": [ - "from cs231n.features import *\n", - "\n", - "num_color_bins = 10 # Number of bins in the color histogram\n", - "feature_fns = [hog_feature, lambda img: color_histogram_hsv(img, nbin=num_color_bins)]\n", - "X_train_feats = extract_features(X_train, feature_fns, verbose=True)\n", - "X_val_feats = extract_features(X_val, feature_fns)\n", - "X_test_feats = extract_features(X_test, feature_fns)\n", - "\n", - "# Preprocessing: Subtract the mean feature\n", - "mean_feat = np.mean(X_train_feats, axis=0, keepdims=True)\n", - "X_train_feats -= mean_feat\n", - "X_val_feats -= mean_feat\n", - "X_test_feats -= mean_feat\n", - "\n", - "# Preprocessing: Divide by standard deviation. This ensures that each feature\n", - "# has roughly the same scale.\n", - "std_feat = np.std(X_train_feats, axis=0, keepdims=True)\n", - "X_train_feats /= std_feat\n", - "X_val_feats /= std_feat\n", - "X_test_feats /= std_feat\n", - "\n", - "# Preprocessing: Add a bias dimension\n", - "X_train_feats = np.hstack([X_train_feats, np.ones((X_train_feats.shape[0], 1))])\n", - "X_val_feats = np.hstack([X_val_feats, np.ones((X_val_feats.shape[0], 1))])\n", - "X_test_feats = np.hstack([X_test_feats, np.ones((X_test_feats.shape[0], 1))])" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Train SVM on features\n", - "Using the multiclass SVM code developed earlier in the assignment, train SVMs on top of the features extracted above; this should achieve better results than training SVMs directly on top of raw pixels." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "collapsed": false - }, - "outputs": [], - "source": [ - "# Use the validation set to tune the learning rate and regularization strength\n", - "\n", - "from cs231n.classifiers.linear_classifier import LinearSVM\n", - "\n", - "learning_rates = [1e-9, 1e-8, 1e-7]\n", - "regularization_strengths = [1e5, 1e6, 1e7]\n", - "\n", - "results = {}\n", - "best_val = -1\n", - "best_svm = None\n", - "\n", - "pass\n", - "################################################################################\n", - "# TODO: #\n", - "# Use the validation set to set the learning rate and regularization strength. #\n", - "# This should be identical to the validation that you did for the SVM; save #\n", - "# the best trained classifer in best_svm. You might also want to play #\n", - "# with different numbers of bins in the color histogram. If you are careful #\n", - "# you should be able to get accuracy of near 0.44 on the validation set. #\n", - "################################################################################\n", - "pass\n", - "################################################################################\n", - "# END OF YOUR CODE #\n", - "################################################################################\n", - "\n", - "# Print out results.\n", - "for lr, reg in sorted(results):\n", - " train_accuracy, val_accuracy = results[(lr, reg)]\n", - " print 'lr %e reg %e train accuracy: %f val accuracy: %f' % (\n", - " lr, reg, train_accuracy, val_accuracy)\n", - " \n", - "print 'best validation accuracy achieved during cross-validation: %f' % best_val" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "collapsed": false - }, - "outputs": [], - "source": [ - "# Evaluate your trained SVM on the test set\n", - "y_test_pred = best_svm.predict(X_test_feats)\n", - "test_accuracy = np.mean(y_test == y_test_pred)\n", - "print test_accuracy" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "collapsed": false - }, - "outputs": [], - "source": [ - "# An important way to gain intuition about how an algorithm works is to\n", - "# visualize the mistakes that it makes. In this visualization, we show examples\n", - "# of images that are misclassified by our current system. The first column\n", - "# shows images that our system labeled as \"plane\" but whose true label is\n", - "# something other than \"plane\".\n", - "\n", - "examples_per_class = 8\n", - "classes = ['plane', 'car', 'bird', 'cat', 'deer', 'dog', 'frog', 'horse', 'ship', 'truck']\n", - "for cls, cls_name in enumerate(classes):\n", - " idxs = np.where((y_test != cls) & (y_test_pred == cls))[0]\n", - " idxs = np.random.choice(idxs, examples_per_class, replace=False)\n", - " for i, idx in enumerate(idxs):\n", - " plt.subplot(examples_per_class, len(classes), i * len(classes) + cls + 1)\n", - " plt.imshow(X_test[idx].astype('uint8'))\n", - " plt.axis('off')\n", - " if i == 0:\n", - " plt.title(cls_name)\n", - "plt.show()" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### Inline question 1:\n", - "Describe the misclassification results that you see. Do they make sense?" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Neural Network on image features\n", - "Earlier in this assigment we saw that training a two-layer neural network on raw pixels achieved better classification performance than linear classifiers on raw pixels. In this notebook we have seen that linear classifiers on image features outperform linear classifiers on raw pixels. \n", - "\n", - "For completeness, we should also try training a neural network on image features. This approach should outperform all previous approaches: you should easily be able to achieve over 55% classification accuracy on the test set; our best model achieves about 60% classification accuracy." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "collapsed": false - }, - "outputs": [], - "source": [ - "print X_train_feats.shape" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "collapsed": false - }, - "outputs": [], - "source": [ - "from cs231n.classifiers.neural_net import TwoLayerNet\n", - "\n", - "input_dim = X_train_feats.shape[1]\n", - "hidden_dim = 500\n", - "num_classes = 10\n", - "\n", - "net = TwoLayerNet(input_dim, hidden_dim, num_classes)\n", - "best_net = None\n", - "\n", - "################################################################################\n", - "# TODO: Train a two-layer neural network on image features. You may want to #\n", - "# cross-validate various parameters as in previous sections. Store your best #\n", - "# model in the best_net variable. #\n", - "################################################################################\n", - "pass\n", - "################################################################################\n", - "# END OF YOUR CODE #\n", - "################################################################################" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "collapsed": false - }, - "outputs": [], - "source": [ - "# Run your neural net classifier on the test set. You should be able to\n", - "# get more than 55% accuracy.\n", - "\n", - "test_acc = (net.predict(X_test_feats) == y_test).mean()\n", - "print test_acc" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# Bonus: Design your own features!\n", - "\n", - "You have seen that simple image features can improve classification performance. So far we have tried HOG and color histograms, but other types of features may be able to achieve even better classification performance.\n", - "\n", - "For bonus points, design and implement a new type of feature and use it for image classification on CIFAR-10. Explain how your feature works and why you expect it to be useful for image classification. Implement it in this notebook, cross-validate any hyperparameters, and compare its performance to the HOG + Color histogram baseline." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# Bonus: Do something extra!\n", - "Use the material and code we have presented in this assignment to do something interesting. Was there another question we should have asked? Did any cool ideas pop into your head as you were working on the assignment? This is your chance to show off!" - ] - } - ], - "metadata": { - "kernelspec": { - "display_name": "Python 3", - "language": "python", - "name": "python3" - }, - "language_info": { - "codemirror_mode": { - "name": "ipython", - "version": 3 - }, - "file_extension": ".py", - "mimetype": "text/x-python", - "name": "python", - "nbconvert_exporter": "python", - "pygments_lexer": "ipython3", - "version": "3.5.1" - } - }, - "nbformat": 4, - "nbformat_minor": 0 -} diff --git a/assignments2016/assignment1/features_trans.md b/assignments2016/assignment1/features_trans.md deleted file mode 100644 index cca77aeb..00000000 --- a/assignments2016/assignment1/features_trans.md +++ /dev/null @@ -1,9 +0,0 @@ -# Words -Image features: 이미지 특징 -Plots: 그래프 -deviation: -completeness: -material: - -You might also want to play with different numbers of bins in the color histogram. -아마 다른 개수의 색상 히스토그램안의 bin을 사용하여 해보고 싶을 수 있습니다. From a8cbdae418da2b6dce74422f7bb63754b91225ee Mon Sep 17 00:00:00 2001 From: MaybeS Date: Mon, 18 Apr 2016 22:47:03 +0900 Subject: [PATCH 068/199] Update assignment1/features.ipynb --- .../features-Copy1-checkpoint.ipynb | 338 ------------- .../features-checkpoint.ipynb | 338 ------------- .../.ipynb_checkpoints/knn-checkpoint.ipynb | 459 ------------------ assignments2016/assignment1/features.ipynb | 29 +- 4 files changed, 15 insertions(+), 1149 deletions(-) delete mode 100644 assignments2016/assignment1/.ipynb_checkpoints/features-Copy1-checkpoint.ipynb delete mode 100644 assignments2016/assignment1/.ipynb_checkpoints/features-checkpoint.ipynb delete mode 100644 assignments2016/assignment1/.ipynb_checkpoints/knn-checkpoint.ipynb diff --git a/assignments2016/assignment1/.ipynb_checkpoints/features-Copy1-checkpoint.ipynb b/assignments2016/assignment1/.ipynb_checkpoints/features-Copy1-checkpoint.ipynb deleted file mode 100644 index 99f95458..00000000 --- a/assignments2016/assignment1/.ipynb_checkpoints/features-Copy1-checkpoint.ipynb +++ /dev/null @@ -1,338 +0,0 @@ -{ - "cells": [ - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# 이미지 특징 연습\n", - "*이 워크시트를 완성하고 제출하세요. (출력물과 워크시트에 포함되지 않은 코드들을 포함해서) 더 자세한 정보는 코스 웹사이트인 [숙제 페이지](http://vision.stanford.edu/teaching/cs231n/assignments.html)에서 볼 수 있습니다.*\n", - "\n", - "우리는 입력된 이미지의 픽셀에 선형 분류기를 학습시켜 이미지 분류 작업에 적절한 성능을 얻을 수 있음을 알고있습니다.\n", - "이번 연습에서 우리는 단순 픽셀을 계산하기 위해 단순 픽셀(화소)이 아닌 특징을 통해 선형 분류기를 훈련시켜 우리의 분류 성능을 향상시킬 수 있음을 보일 것입니다.\n", - "\n", - "이번 연습을 위한 모든 해야할 작업들은 이 notebook에서 수행됩니다." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "collapsed": false - }, - "outputs": [], - "source": [ - "import random\n", - "import numpy as np\n", - "from cs231n.data_utils import load_CIFAR10\n", - "import matplotlib.pyplot as plt\n", - "%matplotlib inline\n", - "plt.rcParams['figure.figsize'] = (10.0, 8.0) # 기본 그래프 크기 설정\n", - "plt.rcParams['image.interpolation'] = 'nearest'\n", - "plt.rcParams['image.cmap'] = 'gray'\n", - "\n", - "# auto-reloading을 위한 외부 모듈\n", - "# http://stackoverflow.com/questions/1907993/autoreload-of-modules-in-ipython를 보세요.\n", - "%load_ext autoreload\n", - "%autoreload 2" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## 데이터 불러오기\n", - "이전 연습에서 처럼, 우리는 CIFAR-10 데이터를 불러올 것입니다." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "collapsed": false - }, - "outputs": [], - "source": [ - "from cs231n.features import color_histogram_hsv, hog_feature\n", - "\n", - "def get_CIFAR10_data(num_training=49000, num_validation=1000, num_test=1000):\n", - " # CIFAR-10 데이터를 불러옵니다.\n", - " cifar10_dir = 'cs231n/datasets/cifar-10-batches-py'\n", - " X_train, y_train, X_test, y_test = load_CIFAR10(cifar10_dir)\n", - " \n", - " # 데이터 표본\n", - " mask = range(num_training, num_training + num_validation)\n", - " X_val = X_train[mask]\n", - " y_val = y_train[mask]\n", - " mask = range(num_training)\n", - " X_train = X_train[mask]\n", - " y_train = y_train[mask]\n", - " mask = range(num_test)\n", - " X_test = X_test[mask]\n", - " y_test = y_test[mask]\n", - "\n", - " return X_train, y_train, X_val, y_val, X_test, y_test\n", - "\n", - "X_train, y_train, X_val, y_val, X_test, y_test = get_CIFAR10_data()" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# 특징 추출하기\n", - "우리는 각 이미지 마다 그라데이션의 히스토그램(HOG)를 HSV색 공간에서의 색상 채널을 사용한 색상 히스토그램만큼 잘 계산할 것입니다. 우리는 우리의 마지막 특징 벡터를 각 이미지마다 HOG와 색상 히스토그램 특징 벡터를 이용하여 형성합니다.\n", - "\n", - "Roughly speaking, HOG should capture the texture of the image while ignoring\n", - "color information, and the color histogram represents the color of the input\n", - "image while ignoring texture. As a result, we expect that using both together\n", - "ought to work better than using either alone. Verifying this assumption would\n", - "be a good thing to try for the bonus section.\n", - "\n", - "The `hog_feature` and `color_histogram_hsv` functions both operate on a single\n", - "image and return a feature vector for that image. The extract_features\n", - "function takes a set of images and a list of feature functions and evaluates\n", - "each feature function on each image, storing the results in a matrix where\n", - "each column is the concatenation of all feature vectors for a single image." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "collapsed": false - }, - "outputs": [], - "source": [ - "from cs231n.features import *\n", - "\n", - "num_color_bins = 10 # Number of bins in the color histogram\n", - "feature_fns = [hog_feature, lambda img: color_histogram_hsv(img, nbin=num_color_bins)]\n", - "X_train_feats = extract_features(X_train, feature_fns, verbose=True)\n", - "X_val_feats = extract_features(X_val, feature_fns)\n", - "X_test_feats = extract_features(X_test, feature_fns)\n", - "\n", - "# Preprocessing: Subtract the mean feature\n", - "mean_feat = np.mean(X_train_feats, axis=0, keepdims=True)\n", - "X_train_feats -= mean_feat\n", - "X_val_feats -= mean_feat\n", - "X_test_feats -= mean_feat\n", - "\n", - "# Preprocessing: Divide by standard deviation. This ensures that each feature\n", - "# has roughly the same scale.\n", - "std_feat = np.std(X_train_feats, axis=0, keepdims=True)\n", - "X_train_feats /= std_feat\n", - "X_val_feats /= std_feat\n", - "X_test_feats /= std_feat\n", - "\n", - "# Preprocessing: Add a bias dimension\n", - "X_train_feats = np.hstack([X_train_feats, np.ones((X_train_feats.shape[0], 1))])\n", - "X_val_feats = np.hstack([X_val_feats, np.ones((X_val_feats.shape[0], 1))])\n", - "X_test_feats = np.hstack([X_test_feats, np.ones((X_test_feats.shape[0], 1))])" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Train SVM on features\n", - "Using the multiclass SVM code developed earlier in the assignment, train SVMs on top of the features extracted above; this should achieve better results than training SVMs directly on top of raw pixels." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "collapsed": false - }, - "outputs": [], - "source": [ - "# Use the validation set to tune the learning rate and regularization strength\n", - "\n", - "from cs231n.classifiers.linear_classifier import LinearSVM\n", - "\n", - "learning_rates = [1e-9, 1e-8, 1e-7]\n", - "regularization_strengths = [1e5, 1e6, 1e7]\n", - "\n", - "results = {}\n", - "best_val = -1\n", - "best_svm = None\n", - "\n", - "pass\n", - "################################################################################\n", - "# TODO: #\n", - "# Use the validation set to set the learning rate and regularization strength. #\n", - "# This should be identical to the validation that you did for the SVM; save #\n", - "# the best trained classifer in best_svm. You might also want to play #\n", - "# with different numbers of bins in the color histogram. If you are careful #\n", - "# you should be able to get accuracy of near 0.44 on the validation set. #\n", - "################################################################################\n", - "pass\n", - "################################################################################\n", - "# END OF YOUR CODE #\n", - "################################################################################\n", - "\n", - "# Print out results.\n", - "for lr, reg in sorted(results):\n", - " train_accuracy, val_accuracy = results[(lr, reg)]\n", - " print 'lr %e reg %e train accuracy: %f val accuracy: %f' % (\n", - " lr, reg, train_accuracy, val_accuracy)\n", - " \n", - "print 'best validation accuracy achieved during cross-validation: %f' % best_val" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "collapsed": false - }, - "outputs": [], - "source": [ - "# Evaluate your trained SVM on the test set\n", - "y_test_pred = best_svm.predict(X_test_feats)\n", - "test_accuracy = np.mean(y_test == y_test_pred)\n", - "print test_accuracy" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "collapsed": false - }, - "outputs": [], - "source": [ - "# An important way to gain intuition about how an algorithm works is to\n", - "# visualize the mistakes that it makes. In this visualization, we show examples\n", - "# of images that are misclassified by our current system. The first column\n", - "# shows images that our system labeled as \"plane\" but whose true label is\n", - "# something other than \"plane\".\n", - "\n", - "examples_per_class = 8\n", - "classes = ['plane', 'car', 'bird', 'cat', 'deer', 'dog', 'frog', 'horse', 'ship', 'truck']\n", - "for cls, cls_name in enumerate(classes):\n", - " idxs = np.where((y_test != cls) & (y_test_pred == cls))[0]\n", - " idxs = np.random.choice(idxs, examples_per_class, replace=False)\n", - " for i, idx in enumerate(idxs):\n", - " plt.subplot(examples_per_class, len(classes), i * len(classes) + cls + 1)\n", - " plt.imshow(X_test[idx].astype('uint8'))\n", - " plt.axis('off')\n", - " if i == 0:\n", - " plt.title(cls_name)\n", - "plt.show()" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### Inline question 1:\n", - "Describe the misclassification results that you see. Do they make sense?" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Neural Network on image features\n", - "Earlier in this assigment we saw that training a two-layer neural network on raw pixels achieved better classification performance than linear classifiers on raw pixels. In this notebook we have seen that linear classifiers on image features outperform linear classifiers on raw pixels. \n", - "\n", - "For completeness, we should also try training a neural network on image features. This approach should outperform all previous approaches: you should easily be able to achieve over 55% classification accuracy on the test set; our best model achieves about 60% classification accuracy." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "collapsed": false - }, - "outputs": [], - "source": [ - "print X_train_feats.shape" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "collapsed": false - }, - "outputs": [], - "source": [ - "from cs231n.classifiers.neural_net import TwoLayerNet\n", - "\n", - "input_dim = X_train_feats.shape[1]\n", - "hidden_dim = 500\n", - "num_classes = 10\n", - "\n", - "net = TwoLayerNet(input_dim, hidden_dim, num_classes)\n", - "best_net = None\n", - "\n", - "################################################################################\n", - "# TODO: Train a two-layer neural network on image features. You may want to #\n", - "# cross-validate various parameters as in previous sections. Store your best #\n", - "# model in the best_net variable. #\n", - "################################################################################\n", - "pass\n", - "################################################################################\n", - "# END OF YOUR CODE #\n", - "################################################################################" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "collapsed": false - }, - "outputs": [], - "source": [ - "# Run your neural net classifier on the test set. You should be able to\n", - "# get more than 55% accuracy.\n", - "\n", - "test_acc = (net.predict(X_test_feats) == y_test).mean()\n", - "print test_acc" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# Bonus: Design your own features!\n", - "\n", - "You have seen that simple image features can improve classification performance. So far we have tried HOG and color histograms, but other types of features may be able to achieve even better classification performance.\n", - "\n", - "For bonus points, design and implement a new type of feature and use it for image classification on CIFAR-10. Explain how your feature works and why you expect it to be useful for image classification. Implement it in this notebook, cross-validate any hyperparameters, and compare its performance to the HOG + Color histogram baseline." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# Bonus: Do something extra!\n", - "Use the material and code we have presented in this assignment to do something interesting. Was there another question we should have asked? Did any cool ideas pop into your head as you were working on the assignment? This is your chance to show off!" - ] - } - ], - "metadata": { - "kernelspec": { - "display_name": "Python 3", - "language": "python", - "name": "python3" - }, - "language_info": { - "codemirror_mode": { - "name": "ipython", - "version": 3 - }, - "file_extension": ".py", - "mimetype": "text/x-python", - "name": "python", - "nbconvert_exporter": "python", - "pygments_lexer": "ipython3", - "version": "3.5.1" - } - }, - "nbformat": 4, - "nbformat_minor": 0 -} diff --git a/assignments2016/assignment1/.ipynb_checkpoints/features-checkpoint.ipynb b/assignments2016/assignment1/.ipynb_checkpoints/features-checkpoint.ipynb deleted file mode 100644 index 99f95458..00000000 --- a/assignments2016/assignment1/.ipynb_checkpoints/features-checkpoint.ipynb +++ /dev/null @@ -1,338 +0,0 @@ -{ - "cells": [ - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# 이미지 특징 연습\n", - "*이 워크시트를 완성하고 제출하세요. (출력물과 워크시트에 포함되지 않은 코드들을 포함해서) 더 자세한 정보는 코스 웹사이트인 [숙제 페이지](http://vision.stanford.edu/teaching/cs231n/assignments.html)에서 볼 수 있습니다.*\n", - "\n", - "우리는 입력된 이미지의 픽셀에 선형 분류기를 학습시켜 이미지 분류 작업에 적절한 성능을 얻을 수 있음을 알고있습니다.\n", - "이번 연습에서 우리는 단순 픽셀을 계산하기 위해 단순 픽셀(화소)이 아닌 특징을 통해 선형 분류기를 훈련시켜 우리의 분류 성능을 향상시킬 수 있음을 보일 것입니다.\n", - "\n", - "이번 연습을 위한 모든 해야할 작업들은 이 notebook에서 수행됩니다." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "collapsed": false - }, - "outputs": [], - "source": [ - "import random\n", - "import numpy as np\n", - "from cs231n.data_utils import load_CIFAR10\n", - "import matplotlib.pyplot as plt\n", - "%matplotlib inline\n", - "plt.rcParams['figure.figsize'] = (10.0, 8.0) # 기본 그래프 크기 설정\n", - "plt.rcParams['image.interpolation'] = 'nearest'\n", - "plt.rcParams['image.cmap'] = 'gray'\n", - "\n", - "# auto-reloading을 위한 외부 모듈\n", - "# http://stackoverflow.com/questions/1907993/autoreload-of-modules-in-ipython를 보세요.\n", - "%load_ext autoreload\n", - "%autoreload 2" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## 데이터 불러오기\n", - "이전 연습에서 처럼, 우리는 CIFAR-10 데이터를 불러올 것입니다." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "collapsed": false - }, - "outputs": [], - "source": [ - "from cs231n.features import color_histogram_hsv, hog_feature\n", - "\n", - "def get_CIFAR10_data(num_training=49000, num_validation=1000, num_test=1000):\n", - " # CIFAR-10 데이터를 불러옵니다.\n", - " cifar10_dir = 'cs231n/datasets/cifar-10-batches-py'\n", - " X_train, y_train, X_test, y_test = load_CIFAR10(cifar10_dir)\n", - " \n", - " # 데이터 표본\n", - " mask = range(num_training, num_training + num_validation)\n", - " X_val = X_train[mask]\n", - " y_val = y_train[mask]\n", - " mask = range(num_training)\n", - " X_train = X_train[mask]\n", - " y_train = y_train[mask]\n", - " mask = range(num_test)\n", - " X_test = X_test[mask]\n", - " y_test = y_test[mask]\n", - "\n", - " return X_train, y_train, X_val, y_val, X_test, y_test\n", - "\n", - "X_train, y_train, X_val, y_val, X_test, y_test = get_CIFAR10_data()" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# 특징 추출하기\n", - "우리는 각 이미지 마다 그라데이션의 히스토그램(HOG)를 HSV색 공간에서의 색상 채널을 사용한 색상 히스토그램만큼 잘 계산할 것입니다. 우리는 우리의 마지막 특징 벡터를 각 이미지마다 HOG와 색상 히스토그램 특징 벡터를 이용하여 형성합니다.\n", - "\n", - "Roughly speaking, HOG should capture the texture of the image while ignoring\n", - "color information, and the color histogram represents the color of the input\n", - "image while ignoring texture. As a result, we expect that using both together\n", - "ought to work better than using either alone. Verifying this assumption would\n", - "be a good thing to try for the bonus section.\n", - "\n", - "The `hog_feature` and `color_histogram_hsv` functions both operate on a single\n", - "image and return a feature vector for that image. The extract_features\n", - "function takes a set of images and a list of feature functions and evaluates\n", - "each feature function on each image, storing the results in a matrix where\n", - "each column is the concatenation of all feature vectors for a single image." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "collapsed": false - }, - "outputs": [], - "source": [ - "from cs231n.features import *\n", - "\n", - "num_color_bins = 10 # Number of bins in the color histogram\n", - "feature_fns = [hog_feature, lambda img: color_histogram_hsv(img, nbin=num_color_bins)]\n", - "X_train_feats = extract_features(X_train, feature_fns, verbose=True)\n", - "X_val_feats = extract_features(X_val, feature_fns)\n", - "X_test_feats = extract_features(X_test, feature_fns)\n", - "\n", - "# Preprocessing: Subtract the mean feature\n", - "mean_feat = np.mean(X_train_feats, axis=0, keepdims=True)\n", - "X_train_feats -= mean_feat\n", - "X_val_feats -= mean_feat\n", - "X_test_feats -= mean_feat\n", - "\n", - "# Preprocessing: Divide by standard deviation. This ensures that each feature\n", - "# has roughly the same scale.\n", - "std_feat = np.std(X_train_feats, axis=0, keepdims=True)\n", - "X_train_feats /= std_feat\n", - "X_val_feats /= std_feat\n", - "X_test_feats /= std_feat\n", - "\n", - "# Preprocessing: Add a bias dimension\n", - "X_train_feats = np.hstack([X_train_feats, np.ones((X_train_feats.shape[0], 1))])\n", - "X_val_feats = np.hstack([X_val_feats, np.ones((X_val_feats.shape[0], 1))])\n", - "X_test_feats = np.hstack([X_test_feats, np.ones((X_test_feats.shape[0], 1))])" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Train SVM on features\n", - "Using the multiclass SVM code developed earlier in the assignment, train SVMs on top of the features extracted above; this should achieve better results than training SVMs directly on top of raw pixels." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "collapsed": false - }, - "outputs": [], - "source": [ - "# Use the validation set to tune the learning rate and regularization strength\n", - "\n", - "from cs231n.classifiers.linear_classifier import LinearSVM\n", - "\n", - "learning_rates = [1e-9, 1e-8, 1e-7]\n", - "regularization_strengths = [1e5, 1e6, 1e7]\n", - "\n", - "results = {}\n", - "best_val = -1\n", - "best_svm = None\n", - "\n", - "pass\n", - "################################################################################\n", - "# TODO: #\n", - "# Use the validation set to set the learning rate and regularization strength. #\n", - "# This should be identical to the validation that you did for the SVM; save #\n", - "# the best trained classifer in best_svm. You might also want to play #\n", - "# with different numbers of bins in the color histogram. If you are careful #\n", - "# you should be able to get accuracy of near 0.44 on the validation set. #\n", - "################################################################################\n", - "pass\n", - "################################################################################\n", - "# END OF YOUR CODE #\n", - "################################################################################\n", - "\n", - "# Print out results.\n", - "for lr, reg in sorted(results):\n", - " train_accuracy, val_accuracy = results[(lr, reg)]\n", - " print 'lr %e reg %e train accuracy: %f val accuracy: %f' % (\n", - " lr, reg, train_accuracy, val_accuracy)\n", - " \n", - "print 'best validation accuracy achieved during cross-validation: %f' % best_val" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "collapsed": false - }, - "outputs": [], - "source": [ - "# Evaluate your trained SVM on the test set\n", - "y_test_pred = best_svm.predict(X_test_feats)\n", - "test_accuracy = np.mean(y_test == y_test_pred)\n", - "print test_accuracy" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "collapsed": false - }, - "outputs": [], - "source": [ - "# An important way to gain intuition about how an algorithm works is to\n", - "# visualize the mistakes that it makes. In this visualization, we show examples\n", - "# of images that are misclassified by our current system. The first column\n", - "# shows images that our system labeled as \"plane\" but whose true label is\n", - "# something other than \"plane\".\n", - "\n", - "examples_per_class = 8\n", - "classes = ['plane', 'car', 'bird', 'cat', 'deer', 'dog', 'frog', 'horse', 'ship', 'truck']\n", - "for cls, cls_name in enumerate(classes):\n", - " idxs = np.where((y_test != cls) & (y_test_pred == cls))[0]\n", - " idxs = np.random.choice(idxs, examples_per_class, replace=False)\n", - " for i, idx in enumerate(idxs):\n", - " plt.subplot(examples_per_class, len(classes), i * len(classes) + cls + 1)\n", - " plt.imshow(X_test[idx].astype('uint8'))\n", - " plt.axis('off')\n", - " if i == 0:\n", - " plt.title(cls_name)\n", - "plt.show()" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### Inline question 1:\n", - "Describe the misclassification results that you see. Do they make sense?" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Neural Network on image features\n", - "Earlier in this assigment we saw that training a two-layer neural network on raw pixels achieved better classification performance than linear classifiers on raw pixels. In this notebook we have seen that linear classifiers on image features outperform linear classifiers on raw pixels. \n", - "\n", - "For completeness, we should also try training a neural network on image features. This approach should outperform all previous approaches: you should easily be able to achieve over 55% classification accuracy on the test set; our best model achieves about 60% classification accuracy." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "collapsed": false - }, - "outputs": [], - "source": [ - "print X_train_feats.shape" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "collapsed": false - }, - "outputs": [], - "source": [ - "from cs231n.classifiers.neural_net import TwoLayerNet\n", - "\n", - "input_dim = X_train_feats.shape[1]\n", - "hidden_dim = 500\n", - "num_classes = 10\n", - "\n", - "net = TwoLayerNet(input_dim, hidden_dim, num_classes)\n", - "best_net = None\n", - "\n", - "################################################################################\n", - "# TODO: Train a two-layer neural network on image features. You may want to #\n", - "# cross-validate various parameters as in previous sections. Store your best #\n", - "# model in the best_net variable. #\n", - "################################################################################\n", - "pass\n", - "################################################################################\n", - "# END OF YOUR CODE #\n", - "################################################################################" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "collapsed": false - }, - "outputs": [], - "source": [ - "# Run your neural net classifier on the test set. You should be able to\n", - "# get more than 55% accuracy.\n", - "\n", - "test_acc = (net.predict(X_test_feats) == y_test).mean()\n", - "print test_acc" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# Bonus: Design your own features!\n", - "\n", - "You have seen that simple image features can improve classification performance. So far we have tried HOG and color histograms, but other types of features may be able to achieve even better classification performance.\n", - "\n", - "For bonus points, design and implement a new type of feature and use it for image classification on CIFAR-10. Explain how your feature works and why you expect it to be useful for image classification. Implement it in this notebook, cross-validate any hyperparameters, and compare its performance to the HOG + Color histogram baseline." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# Bonus: Do something extra!\n", - "Use the material and code we have presented in this assignment to do something interesting. Was there another question we should have asked? Did any cool ideas pop into your head as you were working on the assignment? This is your chance to show off!" - ] - } - ], - "metadata": { - "kernelspec": { - "display_name": "Python 3", - "language": "python", - "name": "python3" - }, - "language_info": { - "codemirror_mode": { - "name": "ipython", - "version": 3 - }, - "file_extension": ".py", - "mimetype": "text/x-python", - "name": "python", - "nbconvert_exporter": "python", - "pygments_lexer": "ipython3", - "version": "3.5.1" - } - }, - "nbformat": 4, - "nbformat_minor": 0 -} diff --git a/assignments2016/assignment1/.ipynb_checkpoints/knn-checkpoint.ipynb b/assignments2016/assignment1/.ipynb_checkpoints/knn-checkpoint.ipynb deleted file mode 100644 index 7ed1b7b4..00000000 --- a/assignments2016/assignment1/.ipynb_checkpoints/knn-checkpoint.ipynb +++ /dev/null @@ -1,459 +0,0 @@ -{ - "cells": [ - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# k-Nearest Neighbor (kNN) exercise\n", - "\n", - "*Complete and hand in this completed worksheet (including its outputs and any supporting code outside of the worksheet) with your assignment submission. For more details see the [assignments page](http://vision.stanford.edu/teaching/cs231n/assignments.html) on the course website.*\n", - "\n", - "The kNN classifier consists of two stages:\n", - "\n", - "- During training, the classifier takes the training data and simply remembers it\n", - "- During testing, kNN classifies every test image by comparing to all training images and transfering the labels of the k most similar training examples\n", - "- The value of k is cross-validated\n", - "\n", - "In this exercise you will implement these steps and understand the basic Image Classification pipeline, cross-validation, and gain proficiency in writing efficient, vectorized code." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "collapsed": false - }, - "outputs": [], - "source": [ - "# Run some setup code for this notebook.\n", - "\n", - "import random\n", - "import numpy as np\n", - "from cs231n.data_utils import load_CIFAR10\n", - "import matplotlib.pyplot as plt\n", - "\n", - "# This is a bit of magic to make matplotlib figures appear inline in the notebook\n", - "# rather than in a new window.\n", - "%matplotlib inline\n", - "plt.rcParams['figure.figsize'] = (10.0, 8.0) # set default size of plots\n", - "plt.rcParams['image.interpolation'] = 'nearest'\n", - "plt.rcParams['image.cmap'] = 'gray'\n", - "\n", - "# Some more magic so that the notebook will reload external python modules;\n", - "# see http://stackoverflow.com/questions/1907993/autoreload-of-modules-in-ipython\n", - "%load_ext autoreload\n", - "%autoreload 2" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "collapsed": false - }, - "outputs": [], - "source": [ - "# Load the raw CIFAR-10 data.\n", - "cifar10_dir = 'cs231n/datasets/cifar-10-batches-py'\n", - "X_train, y_train, X_test, y_test = load_CIFAR10(cifar10_dir)\n", - "\n", - "# As a sanity check, we print out the size of the training and test data.\n", - "print 'Training data shape: ', X_train.shape\n", - "print 'Training labels shape: ', y_train.shape\n", - "print 'Test data shape: ', X_test.shape\n", - "print 'Test labels shape: ', y_test.shape" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "collapsed": false - }, - "outputs": [], - "source": [ - "# Visualize some examples from the dataset.\n", - "# We show a few examples of training images from each class.\n", - "classes = ['plane', 'car', 'bird', 'cat', 'deer', 'dog', 'frog', 'horse', 'ship', 'truck']\n", - "num_classes = len(classes)\n", - "samples_per_class = 7\n", - "for y, cls in enumerate(classes):\n", - " idxs = np.flatnonzero(y_train == y)\n", - " idxs = np.random.choice(idxs, samples_per_class, replace=False)\n", - " for i, idx in enumerate(idxs):\n", - " plt_idx = i * num_classes + y + 1\n", - " plt.subplot(samples_per_class, num_classes, plt_idx)\n", - " plt.imshow(X_train[idx].astype('uint8'))\n", - " plt.axis('off')\n", - " if i == 0:\n", - " plt.title(cls)\n", - "plt.show()" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "collapsed": false - }, - "outputs": [], - "source": [ - "# Subsample the data for more efficient code execution in this exercise\n", - "num_training = 5000\n", - "mask = range(num_training)\n", - "X_train = X_train[mask]\n", - "y_train = y_train[mask]\n", - "\n", - "num_test = 500\n", - "mask = range(num_test)\n", - "X_test = X_test[mask]\n", - "y_test = y_test[mask]" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "collapsed": false - }, - "outputs": [], - "source": [ - "# Reshape the image data into rows\n", - "X_train = np.reshape(X_train, (X_train.shape[0], -1))\n", - "X_test = np.reshape(X_test, (X_test.shape[0], -1))\n", - "print X_train.shape, X_test.shape" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "collapsed": false - }, - "outputs": [], - "source": [ - "from cs231n.classifiers import KNearestNeighbor\n", - "\n", - "# Create a kNN classifier instance. \n", - "# Remember that training a kNN classifier is a noop: \n", - "# the Classifier simply remembers the data and does no further processing \n", - "classifier = KNearestNeighbor()\n", - "classifier.train(X_train, y_train)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "We would now like to classify the test data with the kNN classifier. Recall that we can break down this process into two steps: \n", - "\n", - "1. First we must compute the distances between all test examples and all train examples. \n", - "2. Given these distances, for each test example we find the k nearest examples and have them vote for the label\n", - "\n", - "Lets begin with computing the distance matrix between all training and test examples. For example, if there are **Ntr** training examples and **Nte** test examples, this stage should result in a **Nte x Ntr** matrix where each element (i,j) is the distance between the i-th test and j-th train example.\n", - "\n", - "First, open `cs231n/classifiers/k_nearest_neighbor.py` and implement the function `compute_distances_two_loops` that uses a (very inefficient) double loop over all pairs of (test, train) examples and computes the distance matrix one element at a time." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "collapsed": false - }, - "outputs": [], - "source": [ - "# Open cs231n/classifiers/k_nearest_neighbor.py and implement\n", - "# compute_distances_two_loops.\n", - "\n", - "# Test your implementation:\n", - "dists = classifier.compute_distances_two_loops(X_test)\n", - "print dists.shape" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "collapsed": false - }, - "outputs": [], - "source": [ - "# We can visualize the distance matrix: each row is a single test example and\n", - "# its distances to training examples\n", - "plt.imshow(dists, interpolation='none')\n", - "plt.show()" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "**Inline Question #1:** Notice the structured patterns in the distance matrix, where some rows or columns are visible brighter. (Note that with the default color scheme black indicates low distances while white indicates high distances.)\n", - "\n", - "- What in the data is the cause behind the distinctly bright rows?\n", - "- What causes the columns?" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "**Your Answer**: *fill this in.*\n", - "\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "collapsed": false - }, - "outputs": [], - "source": [ - "# Now implement the function predict_labels and run the code below:\n", - "# We use k = 1 (which is Nearest Neighbor).\n", - "y_test_pred = classifier.predict_labels(dists, k=1)\n", - "\n", - "# Compute and print the fraction of correctly predicted examples\n", - "num_correct = np.sum(y_test_pred == y_test)\n", - "accuracy = float(num_correct) / num_test\n", - "print 'Got %d / %d correct => accuracy: %f' % (num_correct, num_test, accuracy)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "You should expect to see approximately `27%` accuracy. Now lets try out a larger `k`, say `k = 5`:" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "collapsed": true - }, - "outputs": [], - "source": [ - "y_test_pred = classifier.predict_labels(dists, k=5)\n", - "num_correct = np.sum(y_test_pred == y_test)\n", - "accuracy = float(num_correct) / num_test\n", - "print 'Got %d / %d correct => accuracy: %f' % (num_correct, num_test, accuracy)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "You should expect to see a slightly better performance than with `k = 1`." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "collapsed": false - }, - "outputs": [], - "source": [ - "# Now lets speed up distance matrix computation by using partial vectorization\n", - "# with one loop. Implement the function compute_distances_one_loop and run the\n", - "# code below:\n", - "dists_one = classifier.compute_distances_one_loop(X_test)\n", - "\n", - "# To ensure that our vectorized implementation is correct, we make sure that it\n", - "# agrees with the naive implementation. There are many ways to decide whether\n", - "# two matrices are similar; one of the simplest is the Frobenius norm. In case\n", - "# you haven't seen it before, the Frobenius norm of two matrices is the square\n", - "# root of the squared sum of differences of all elements; in other words, reshape\n", - "# the matrices into vectors and compute the Euclidean distance between them.\n", - "difference = np.linalg.norm(dists - dists_one, ord='fro')\n", - "print 'Difference was: %f' % (difference, )\n", - "if difference < 0.001:\n", - " print 'Good! The distance matrices are the same'\n", - "else:\n", - " print 'Uh-oh! The distance matrices are different'" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "collapsed": false - }, - "outputs": [], - "source": [ - "# Now implement the fully vectorized version inside compute_distances_no_loops\n", - "# and run the code\n", - "dists_two = classifier.compute_distances_no_loops(X_test)\n", - "\n", - "# check that the distance matrix agrees with the one we computed before:\n", - "difference = np.linalg.norm(dists - dists_two, ord='fro')\n", - "print 'Difference was: %f' % (difference, )\n", - "if difference < 0.001:\n", - " print 'Good! The distance matrices are the same'\n", - "else:\n", - " print 'Uh-oh! The distance matrices are different'" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "collapsed": false - }, - "outputs": [], - "source": [ - "# Let's compare how fast the implementations are\n", - "def time_function(f, *args):\n", - " \"\"\"\n", - " Call a function f with args and return the time (in seconds) that it took to execute.\n", - " \"\"\"\n", - " import time\n", - " tic = time.time()\n", - " f(*args)\n", - " toc = time.time()\n", - " return toc - tic\n", - "\n", - "two_loop_time = time_function(classifier.compute_distances_two_loops, X_test)\n", - "print 'Two loop version took %f seconds' % two_loop_time\n", - "\n", - "one_loop_time = time_function(classifier.compute_distances_one_loop, X_test)\n", - "print 'One loop version took %f seconds' % one_loop_time\n", - "\n", - "no_loop_time = time_function(classifier.compute_distances_no_loops, X_test)\n", - "print 'No loop version took %f seconds' % no_loop_time\n", - "\n", - "# you should see significantly faster performance with the fully vectorized implementation" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### Cross-validation\n", - "\n", - "We have implemented the k-Nearest Neighbor classifier but we set the value k = 5 arbitrarily. We will now determine the best value of this hyperparameter with cross-validation." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "collapsed": false - }, - "outputs": [], - "source": [ - "num_folds = 5\n", - "k_choices = [1, 3, 5, 8, 10, 12, 15, 20, 50, 100]\n", - "\n", - "X_train_folds = []\n", - "y_train_folds = []\n", - "################################################################################\n", - "# TODO: #\n", - "# Split up the training data into folds. After splitting, X_train_folds and #\n", - "# y_train_folds should each be lists of length num_folds, where #\n", - "# y_train_folds[i] is the label vector for the points in X_train_folds[i]. #\n", - "# Hint: Look up the numpy array_split function. #\n", - "################################################################################\n", - "pass\n", - "################################################################################\n", - "# END OF YOUR CODE #\n", - "################################################################################\n", - "\n", - "# A dictionary holding the accuracies for different values of k that we find\n", - "# when running cross-validation. After running cross-validation,\n", - "# k_to_accuracies[k] should be a list of length num_folds giving the different\n", - "# accuracy values that we found when using that value of k.\n", - "k_to_accuracies = {}\n", - "\n", - "\n", - "################################################################################\n", - "# TODO: #\n", - "# Perform k-fold cross validation to find the best value of k. For each #\n", - "# possible value of k, run the k-nearest-neighbor algorithm num_folds times, #\n", - "# where in each case you use all but one of the folds as training data and the #\n", - "# last fold as a validation set. Store the accuracies for all fold and all #\n", - "# values of k in the k_to_accuracies dictionary. #\n", - "################################################################################\n", - "pass\n", - "################################################################################\n", - "# END OF YOUR CODE #\n", - "################################################################################\n", - "\n", - "# Print out the computed accuracies\n", - "for k in sorted(k_to_accuracies):\n", - " for accuracy in k_to_accuracies[k]:\n", - " print 'k = %d, accuracy = %f' % (k, accuracy)" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "collapsed": false - }, - "outputs": [], - "source": [ - "# plot the raw observations\n", - "for k in k_choices:\n", - " accuracies = k_to_accuracies[k]\n", - " plt.scatter([k] * len(accuracies), accuracies)\n", - "\n", - "# plot the trend line with error bars that correspond to standard deviation\n", - "accuracies_mean = np.array([np.mean(v) for k,v in sorted(k_to_accuracies.items())])\n", - "accuracies_std = np.array([np.std(v) for k,v in sorted(k_to_accuracies.items())])\n", - "plt.errorbar(k_choices, accuracies_mean, yerr=accuracies_std)\n", - "plt.title('Cross-validation on k')\n", - "plt.xlabel('k')\n", - "plt.ylabel('Cross-validation accuracy')\n", - "plt.show()" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "collapsed": false - }, - "outputs": [], - "source": [ - "# Based on the cross-validation results above, choose the best value for k, \n", - "# retrain the classifier using all the training data, and test it on the test\n", - "# data. You should be able to get above 28% accuracy on the test data.\n", - "best_k = 1\n", - "\n", - "classifier = KNearestNeighbor()\n", - "classifier.train(X_train, y_train)\n", - "y_test_pred = classifier.predict(X_test, k=best_k)\n", - "\n", - "# Compute and display the accuracy\n", - "num_correct = np.sum(y_test_pred == y_test)\n", - "accuracy = float(num_correct) / num_test\n", - "print 'Got %d / %d correct => accuracy: %f' % (num_correct, num_test, accuracy)" - ] - } - ], - "metadata": { - "kernelspec": { - "display_name": "Python 3", - "language": "python", - "name": "python3" - }, - "language_info": { - "codemirror_mode": { - "name": "ipython", - "version": 3 - }, - "file_extension": ".py", - "mimetype": "text/x-python", - "name": "python", - "nbconvert_exporter": "python", - "pygments_lexer": "ipython3", - "version": "3.5.1" - } - }, - "nbformat": 4, - "nbformat_minor": 0 -} diff --git a/assignments2016/assignment1/features.ipynb b/assignments2016/assignment1/features.ipynb index af49194a..ff361f1a 100644 --- a/assignments2016/assignment1/features.ipynb +++ b/assignments2016/assignment1/features.ipynb @@ -227,10 +227,10 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "## Neural Network on image features\n", - "Earlier in this assigment we saw that training a two-layer neural network on raw pixels achieved better classification performance than linear classifiers on raw pixels. In this notebook we have seen that linear classifiers on image features outperform linear classifiers on raw pixels. \n", + "## 이미지 특징의 신경망\n", + "이번 과제에서 우리는 단순 픽셀의 2-계층 신경망을 학습시키면 선형 분류기보다 성능이 더 향상됨을 배웠습니다. 이 notebook에서 우리는 이미지 특징의 선형 분류기가 단순 픽셀의 선형 분류기보다 뛰어나다는 것을 알 수 있었습니다.\n", "\n", - "For completeness, we should also try training a neural network on image features. This approach should outperform all previous approaches: you should easily be able to achieve over 55% classification accuracy on the test set; our best model achieves about 60% classification accuracy." + "완성도를 위해, 우리는 이미지 특징의 신경망 또한 학습시켜보아야 합니다. 이 접근법은 이전의 모든 방법보다 더 뛰어날 것입니다: 테스트 세트에 대해 55%이상의 분류 정확도를 쉽게 달성할 수 있어야합니다; 우리의 최고의 모델은 60%의 분류 정확도를 달성했습니다." ] }, { @@ -262,13 +262,14 @@ "best_net = None\n", "\n", "################################################################################\n", - "# TODO: Train a two-layer neural network on image features. You may want to #\n", - "# cross-validate various parameters as in previous sections. Store your best #\n", - "# model in the best_net variable. #\n", + "# TODO: #\n", + "# 이미지 특징으로 2-계층 신경망 학습시키기. #\n", + "# 이전 섹션처럼 다양한 변수들을 교차검증하기. #\n", + "# 최고의 모델을 best_net 변수에 저장하기. #\n", "################################################################################\n", "pass\n", "################################################################################\n", - "# END OF YOUR CODE #\n", + "# 코드의 끝 #\n", "################################################################################" ] }, @@ -280,8 +281,8 @@ }, "outputs": [], "source": [ - "# Run your neural net classifier on the test set. You should be able to\n", - "# get more than 55% accuracy.\n", + "# 당신의 신경망 분류기를 테스트 세트로 실행시켜 보세요.\n", + "# 55% 이상의 정확도를 얻을 수 있어야 합니다.\n", "\n", "test_acc = (net.predict(X_test_feats) == y_test).mean()\n", "print test_acc" @@ -291,19 +292,19 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "# Bonus: Design your own features!\n", + "# 보너스: 당신만의 특징을 디자인해보세요!\n", "\n", - "You have seen that simple image features can improve classification performance. So far we have tried HOG and color histograms, but other types of features may be able to achieve even better classification performance.\n", + "간단한 이미지 특징이 분류기의 성능을 향상시킬 수 있음을 배웠습니다. 지금까지 우리는 HOG와 색상 히스토그램을 통해 시도해봤지만 다른 종류의 특징들은 분류 성능을 더 향상시킬 수 있습니다.\n", "\n", - "For bonus points, design and implement a new type of feature and use it for image classification on CIFAR-10. Explain how your feature works and why you expect it to be useful for image classification. Implement it in this notebook, cross-validate any hyperparameters, and compare its performance to the HOG + Color histogram baseline." + "보너스 포인트를 위해, 새로운 종류의 특징을 디자인하고 적용하고 CIFAR-10의 이미지 분류에 사용해 보세요. 당신의 특징이 어떻게 작동하고 왜 그러한 특징이 이미지 분류에 효과적으로 작동할 것이라 생각했는데 설명해보세요. 이 notebook에서 적용해보고, 임의의 hyperparameters로 교차 검증 하고 HOG + 색상 히스토그램 기준과 성능을 비교해보세요." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "# Bonus: Do something extra!\n", - "Use the material and code we have presented in this assignment to do something interesting. Was there another question we should have asked? Did any cool ideas pop into your head as you were working on the assignment? This is your chance to show off!" + "# 보너스: 뭔가 더 해보세요!\n", + "이번 과제에서 제공된 자료와 코드를 사용하여 흥미로운 도전을 해보세요. 과제를 하면서 다른 의문점이 생겼나요? 과제를 하면서 머리에서 참신한 생각이 떠올랐나요? 당신을 보여줄 수 있는 기회입니다!" ] } ], From e3c0a19fc475a3352e7fa7132c9aefad3b59ef2b Mon Sep 17 00:00:00 2001 From: Seo Jonghan Date: Tue, 19 Apr 2016 01:59:38 +0900 Subject: [PATCH 069/199] =?UTF-8?q?Dropout=EA=B9=8C=EC=A7=80=20=EC=99=84?= =?UTF-8?q?=EB=A3=8C?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit --- neural-networks-2.kr.md | 18 ++++++++++-------- 1 file changed, 10 insertions(+), 8 deletions(-) diff --git a/neural-networks-2.kr.md b/neural-networks-2.kr.md index e16d988f..d85f7793 100644 --- a/neural-networks-2.kr.md +++ b/neural-networks-2.kr.md @@ -126,27 +126,29 @@ $$ **희소 초기화(Sparse initialization)**. 보정되지 않은 분산을 위한 또 다른 방법은 모든 가중치 행렬을 0으로 초기화 하는 것이다. 이때 대칭성을 깨기 위해서 모든 뉴런을 고정된 숫자의 아래 단계 뉴런들과 무작위로 연결한다.**(with weights sampled from a small gaussian as above)** 연결하는 뉴런의 수는 대략 10개 정도이다. -** bias 초기화 **. 가중치에 랜덤한 값을 설정하므로써 대칭성 문제는 해결되기 때문에 주로bias는 0으로 초기화한다. ReLU 연산의 비선형성에 의해서 몇몇 경우에는 0.01과 같은 작은 상수값을 사용하기도 하는데 이는 ReLU 연산이 초기부터 fire되고 따라서 그라디언트(gradient) 값이 유미의한 값을 갖고 신경망을 통해서 전달되는 것을 보장할 수 있기 때문이다. 하지만 상수값을 사용하는 방식이 성능 향상을 언제나 보장하는 것인가에 대해서는 이견이 존재한다(실제 몇몇 사례에서 더 나쁜 결과가 볼 수 있다). 따라서 bias 값은 0으로 초기화 하는 것이 더 일반적이라 할 수 있다. +**bias 초기화**. 가중치에 랜덤한 값을 설정하므로써 대칭성 문제는 해결되기 때문에 주로bias는 0으로 초기화한다. ReLU 연산의 비선형성에 의해서 몇몇 경우에는 0.01과 같은 작은 상수값을 사용하기도 하는데 이는 ReLU 연산이 초기부터 fire되고 따라서 그라디언트(gradient) 값이 유미의한 값을 갖고 신경망을 통해서 전달되는 것을 보장할 수 있기 때문이다. 하지만 상수값을 사용하는 방식이 성능 향상을 언제나 보장하는 것인가에 대해서는 이견이 존재한다(실제 몇몇 사례에서 더 나쁜 결과가 볼 수 있다). 따라서 bias 값은 0으로 초기화 하는 것이 더 일반적이라 할 수 있다. -** 실전응용 **, ReLU 유닛을 사용하고 `w = np.random.randn(n) * sqrt(2.0/n)` 초기화하는 것이 요즘의 추세이다[He et al.](http://arxiv-web3.library.cornell.edu/abs/1502.01852). +**실전응용**, ReLU 유닛을 사용하고 `w = np.random.randn(n) * sqrt(2.0/n)` 초기화하는 것이 요즘의 추세이다 [He et al.](http://arxiv-web3.library.cornell.edu/abs/1502.01852). -**Batch Normalization**. A recently developed technique by Ioffe and Szegedy called [Batch Normalization](http://arxiv.org/abs/1502.03167) alleviates a lot of headaches with properly initializing neural networks by explicitly forcing the activations throughout a network to take on a unit gaussian distribution at the beginning of the training. The core observation is that this is possible because normalization is a simple differentiable operation. In the implementation, applying this technique usually amounts to insert the BatchNorm layer immediately after fully connected layers (or convolutional layers, as we'll soon see), and before non-linearities. We do not expand on this technique here because it is well described in the linked paper, but note that it has become a very common practice to use Batch Normalization in neural networks. In practice networks that use Batch Normalization are significantly more robust to bad initialization. Additionally, batch normalization can be interpreted as doing preprocessing at every layer of the network, but integrated into the network itself in a differentiably manner. Neat! +**배치 정규화(Batch Normalization)** 최근 Ioffe and Szegedy에 의해서 제안된 [배치 정규화(Batch Normalization)](http://arxiv.org/abs/1502.03167) 기법은 신경망 학습단계에서 activation 값이 표준정규분포를 갖도록 강제하는 기법으로 신경망을 적절한 값으로 초기화하여 그동안 많은 연구자들을 괴롭혀왔던 초기화 문제의 상당부분을 해소해 주었다. 여기서 사용한 정규화 기법이 단순 미분 가능한 연산이었기에 적용 가능하다. 실제 구현에서는 배치 정규화 레이어를 fully-connected 레이어 (혹은 곧 설명하게될 컨볼루션 레이어) 다음, 비선형 연산 이전에 위치 시키는 방식으로 이 기법을 신경망에 적용할 수 있다. 앞에서 링크된 논문에서 배치 정규화(Batch Normalization) 기법에 대해서 자세하게 설명하고 있기 때문에 여기에서는 관련 기법을 자세하게 다루지는 않겠지만, 이 기법은 이미 신경망 학습에서 일반적으로 사용되는 기법중 하나라는 것을 밝혀두는 바이다. 실제 적용 사례를 보면 배치 정규화(Batch Normalization)을 사용하여 학습한 신경망은 특히 나쁜 초기화의 영향에 강하다는 것이 밝혀졌다. 배치 정규화(Batch Normalization)는 신경망 내의 모든 레이어에서 전처리 과정을 수행하는 것이지만, 미분 가능하다는 성질에 의해서 신경망 내의 학습 단계로 통합되었다고 볼 수 있다. ### Regularization -There are several ways of controlling the capacity of Neural Networks to prevent overfitting: +이번 파트에서는 신경망 학습에서 overfitting을 막을 수 있는 몇가지 방법을 소개하고자 한다. -**L2 regularization** is perhaps the most common form of regularization. It can be implemented by penalizing the squared magnitude of all parameters directly in the objective. That is, for every weight $w$ in the network, we add the term $\frac{1}{2} \lambda w^2$ to the objective, where $\lambda$ is the regularization strength. It is common to see the factor of $\frac{1}{2}$ in front because then the gradient of this term with respect to the parameter $w$ is simply $\lambda w$ instead of $2 \lambda w$. The L2 regularization has the intuitive interpretation of heavily penalizing peaky weight vectors and preferring diffuse weight vectors. As we discussed in the Linear Classification section, due to multiplicative interactions between weights and inputs this has the appealing property of encouraging the network to use all of its inputs a little rather that some of its inputs a lot. Lastly, notice that during gradient descent parameter update, using the L2 regularization ultimately means that every weight is decayed linearly: `W += -lambda * W` towards zero. +**L2 regularization**은 가장 일반적으로 사용되는 regularization 기법이다. 모든 파라미터 제곱 만큼의 크기를 목적 함수에 제약을 거는 방식으로 구현된다. 다시말해, 가중치 벡터 $$w$$가 있을때, 목적 함수에 $$\frac{1}{2} \lambda w^2$$를 더한다 (여가서 $$lambda$$는 regulrization의 강도를 의미). $$\frac{1}{2}$$ 부분이 항상 존재하는데 이는 앞서 본 regularization 값을 $$w$$로 미분했을 때 $$2 \lambda w$$가 아닌 $$ \lambda w$$의 값을 갖도록 하기 위함이다. L2 reguralization은 큰 값이 많이 존재하는 가중치에 제약을 주고, 가중치 값을 가능한 널리 퍼지도록 하는 효과를 주는 것으로 볼 수 있다. 선형 분류(Linear Classification) 장에서도 이야기 했던 가중치와 입력 데이터가 곱해지는 연산이므로 특정 몇개의 입력 데이터에 강하게 적용되기 보다는 모든 입력데이터에 약하게 적용되도록 하는 것이 일반적이다. gradient descent 업데이트 과정에서 L2 regularization을 적용하는 것은 모든 가중치 값이 선형적으로 감소하게 된다: `W += -lambda * W`이 0으로 감소하게 된다. -**L1 regularization** is another relatively common form of regularization, where for each weight $w$ we add the term $\lambda \mid w \mid$ to the objective. It is possible to combine the L1 regularization with the L2 regularization: $\lambda_1 \mid w \mid + \lambda_2 w^2$ (this is called [Elastic net regularization](http://web.stanford.edu/~hastie/Papers/B67.2%20%282005%29%20301-320%20Zou%20&%20Hastie.pdf)). The L1 regularization has the intriguing property that it leads the weight vectors to become sparse during optimization (i.e. very close to exactly zero). In other words, neurons with L1 regularization end up using only a sparse subset of their most important inputs and become nearly invariant to the "noisy" inputs. In comparison, final weight vectors from L2 regularization are usually diffuse, small numbers. In practice, if you are not concerned with explicit feature selection, L2 regularization can be expected to give superior performance over L1. -**Max norm constraints**. Another form of regularization is to enforce an absolute upper bound on the magnitude of the weight vector for every neuron and use projected gradient descent to enforce the constraint. In practice, this corresponds to performing the parameter update as normal, and then enforcing the constraint by clamping the weight vector $\vec{w}$ of every neuron to satisfy $\Vert \vec{w} \Vert_2 < c$. Typical values of $c$ are on orders of 3 or 4. Some people report improvements when using this form of regularization. One of its appealing properties is that network cannot "explode" even when the learning rates are set too high because the updates are always bounded. +**L1 regularization** 또한 상대적으로 많이 사용되는 regularization 기법으로 가중치 벡터$$w$$가 있을때, 목적 함수에 $$\lamda \mid w \mid$$를 더한다. 다음과 같이 L1 regularization과 L2 regularization을 동시에 사용할 수도 있다: $$\lambad_1 \mid w \mid + \lamda_2 w^2$$([Elastic net regularization](http://web.stanford.edu/~hastie/Papers/B67.2%20%282005%29%20301-320%20Zou%20&%20Hastie.pdf)라고도 불린다). L1 regularization은 최적화 과정 동안 가중치 벡터들을 sparse하게(거의 0에 가깝게) 만드는 흥미로운 특성이 있다. 다시 말해, L1 regularization이 적용된 뉴런들은 결국 입력 데이터의 sparse한 부분만을 사용하고, "noisy" 입력 데이터에 거의 영향을 받지 않는다. 이에 반해, L2 regularization을 적용하면 최종 가중치 벡터들은 작은 값들이 퍼져있는 형태로 나타나게 된다. 실제 신경망 학습에 적용할 때, 만약 특정한 feature selection 후 학습하는 것이 아니라면 많은 경우에 L2 regularization을 사용하면 훨씬 좋은 성능을 기대할 수 있다. + +**Max norm constrains**. regularizatio 기법 중 하나로 가중치 벡터의 길이가 미리 정해 놓은 상한 값을 넘지 못하도록 제한하면서 gradient descent 연산도 제한 된 조건 안에서만 계산하도록 하는 projected gradient descent를 사용한다. 신경망 학습에 실제 적용하는 방법은, 먼저 일반적인 방법으로 파라미터를 업데이트 하고, 모든 뉴런의 가중치 벡터 $$\vec{w}$$이 대해서 $$\Vert2 \vec{w} \Vert2 < c$$를 만족하도록 제한을 가한다. 일반적으로 c값은 3 혹은 4로 설정한다. 이 regularization 기법을 적용한 몇몇 연구를 통하여 성능 향상이 있음이 알려졌다. 이 기법의 흥미로운 사실 중 하나는 학습률(learning rate)을 큰 값을로 설정하고 학습 시키더라도 신경망이 "explode"하지 않는 다는 것인데 이는 업데이트 될 때마다 제한된 범위 내의 값을 갖기 때문이다. + +**Dropout** [Dropout: A Simple Way to Prevent Neural Networks from Overfitting](http://www.cs.toronto.edu/~rsalakhu/papers/srivastava14a.pdf)에서 Srivastava et al.의해 최근 제안된 기법으로 간단하지만 아주 효과적인 regularization 방법으로 위에서 소개한 다른 regularization 기법들과 (L1, L2, maxnorm) 상호 보완적인 방법으로 알려져 있다. 각 뉴런들을 $$p$$의 확률로 활성화 시켜 학습에 적용 하는 방식으로 구현할 수 있다. -**Dropout** is an extremely effective, simple and recently introduced regularization technique by Srivastava et al. in [Dropout: A Simple Way to Prevent Neural Networks from Overfitting](http://www.cs.toronto.edu/~rsalakhu/papers/srivastava14a.pdf) (pdf) that complements the other methods (L1, L2, maxnorm). While training, dropout is implemented by only keeping a neuron active with some probability $p$ (a hyperparameter), or setting it to zero otherwise.
From 23bf2019b1b3ec2665a873ee97bdab68324224ad Mon Sep 17 00:00:00 2001 From: YB Date: Mon, 18 Apr 2016 22:59:48 -0400 Subject: [PATCH 070/199] Lecture1 - part 101~120 (out of 715) en / ko --- captions/En/Lecture1_en.srt | 49 +++++++++++++++++++------------------ captions/Ko/Lecture1_ko.srt | 44 ++++++++++++++++----------------- 2 files changed, 47 insertions(+), 46 deletions(-) diff --git a/captions/En/Lecture1_en.srt b/captions/En/Lecture1_en.srt index f645769c..733f41b4 100644 --- a/captions/En/Lecture1_en.srt +++ b/captions/En/Lecture1_en.srt @@ -489,22 +489,22 @@ a basic level of understanding of 100 00:11:33,830 --> 00:11:42,560 -computer vision. You can browse the notes -and so on +computer vision. + You can browse the notes and so on. 101 00:11:42,561 --> 00:11:49,619 -today is that I will give a very brief +Okay, so the rest of today is that I will give a very brief broad stroke history of computer vision 102 00:11:49,620 --> 00:11:55,519 -and then we'll talk about 231 and a +and then we'll talk about 231n little bit in terms of the organization 103 00:11:55,519 --> 00:12:01,409 -of the class they really care about in +of the class. I actually really care about sharing with you this brief history of computer 104 @@ -515,37 +515,38 @@ here primarily because of your interest 105 00:12:07,480 --> 00:12:11,990 in this really interesting tool called -deeply and this is the purpose of this +deep learning and this is the purpose of this 106 00:12:11,990 --> 00:12:16,370 -class will offer you an in-depth look +class. +We're offering you an in-depth look and then 107 00:12:16,370 --> 00:12:22,470 and just journey through the of the what -this deeply model is but without +this deep learning model is but without 108 00:12:22,470 --> 00:12:28,050 -understanding the problem domain without -thinking deeply about what this problem +understanding the problem domain, without +thinking deeply about what this problem is, 109 -00:12:28,049 --> 00:12:37,849 -is it's very hard for you to to go out +00:12:28,051 --> 00:12:37,849 +it's very hard for you to to go out to be an inventor of the next model that 110 00:12:37,850 --> 00:12:43,320 -really solve the big problems vision or +really solves the big problem in vision or to be you know developing developing 111 00:12:43,320 --> 00:12:52,379 -making impactful work in solving a heart -problem and also in general problem +making impactful work in solving a hard +problem. and also in general problem 112 00:12:52,379 --> 00:12:58,860 @@ -554,17 +555,17 @@ themselves are never never fully 113 00:12:58,860 --> 00:13:00,129 -decoupled +decoupled. 114 00:13:00,129 --> 00:13:05,360 -inform each other and you'll see through +They inform each other and you'll see through the history of deep learning a little 115 00:13:05,360 --> 00:13:13,000 bit that the coalition on your network -architecture come from the meat to solve +architecture come from the need to solve 116 00:13:13,000 --> 00:13:15,289 @@ -572,7 +573,7 @@ a vision problem 117 00:13:15,289 --> 00:13:23,449 -vision problem helps the the planning +vision problem helps the the deep learning algorithm to evolve and I'm back and 118 @@ -582,17 +583,17 @@ know I want you to finish this course I 119 00:13:29,350 --> 00:13:34,300 -feel proud that you're still enough -vision and of deep learning so you you +feel proud that you're student of +computer vision and of deep learning so you you 120 -00:13:34,299 --> 00:13:39,528 -have this bullshit all set and the +00:13:34,301 --> 00:13:39,528 +have this both tool-set and the in-depth understanding of how to use the 121 00:13:39,528 --> 00:13:46,750 -tools to to to to tackle important +tool-set to to to to tackle important problems so it's a brief history but 122 diff --git a/captions/Ko/Lecture1_ko.srt b/captions/Ko/Lecture1_ko.srt index e1a71f38..dd608bdd 100644 --- a/captions/Ko/Lecture1_ko.srt +++ b/captions/Ko/Lecture1_ko.srt @@ -409,83 +409,83 @@ 101 00:11:42,559 --> 00:11:49,619 - 오늘은 내가 컴퓨터 비전의 아주 짧은 폭 넓은 행정 역사를 줄 것입니다 + 좋아요. 오늘은 컴퓨터 비전의 역사를 간단히 다루고 102 00:11:49,620 --> 00:11:55,519 - 그리고, 우리는 조직의 관점에서 (231)과 조금 얘기하자 + 231n 수업이 어떻게 구성되어 있는지 이야기 해볼 것입니다. 103 00:11:55,519 --> 00:12:01,409 - 클래스의 그들은 정말 당신에 대해 컴퓨터의이 짧은 역사를 관심 + 저는 사실 이 컴퓨터 비전의 역사를 다루는 것을 중요하게 생각하는데 104 00:12:01,409 --> 00:12:07,480 - 비전 당신 때문에 당신의 관심이 주로 여기에있을 수 있습니다 알고 있기 때문에 + 여러분 대무분은 이 딥러닝이라는 매우 흥미있는 도구에 관심이 있어 이 자리에 있을텐데요, 105 00:12:07,480 --> 00:12:11,990 - 이 정말 흥미 도구 깊이 호출이이 목적이며 + 그것이 이 수업의 목적이기도 하고요. 106 00:12:11,990 --> 00:12:16,370 - 클래스는 당신에게 깊이있는 모양을 제공 할 것이다 + 이 수업은 딥러닝 모델이 무엇인지 107 00:12:16,370 --> 00:12:22,470 - 과의를 통해 바로 여행이 깊이 모델은하지 않고 있지만, 무엇 + 심도있게 다룰 것입니다. 108 00:12:22,470 --> 00:12:28,050 - 어떤이 문제에 대해 깊이 생각하지 않고 문제 도메인을 이해 + 하지만 문제가 속해있는 영역에 대한 이해, 문제에 대한 깊은 고찰없이는 109 -00:12:28,049 --> 00:12:37,849 - 당신이 다음 모델의 발명가로 외출하는 것이 매우 어렵다입니다 +00:12:28,051 --> 00:12:37,849 + 비젼분야의 새로운 문제를 해결하는 110 00:12:37,850 --> 00:12:43,320 - 정말 큰 문제의 비전을 해결하거나 당신이 알고있는 개발 개발을 할 + 새로운 모델을 만들거나 111 00:12:43,320 --> 00:12:52,379 - 일반적인 문제도 심장 문제 해결에 영향력있는 작품을 만들고 + 어려운 문제들을 푸는 일에 주요한 일을 하기 매우 어려울 것입니다. 112 00:12:52,379 --> 00:12:58,860 - 도메인 및 모델 모델링 도구 자체는 결코 완전히 결코 + 또한 일반적인 문제 영역과 모델링 도구 자체는 113 00:12:58,860 --> 00:13:00,129 - 분리 + 결코 완전히 서로 분리될 수 없어요. 114 00:13:00,129 --> 00:13:05,360 - 상호 통보하고 깊은 학습 조금에게의 역사를 통해 볼 수 있습니다 + 그들은 서로에게 정보를 제공합니다. 딥 러닝의 역사를 통해 알게 되겠지만 115 00:13:05,360 --> 00:13:13,000 - 고기에서 온 네트워크 아키텍처의 연합이 해결하는 것을 조금 + 네트워크 아키텍처의 연합은 비젼 문제들을 해결하고자 하는 116 00:13:13,000 --> 00:13:15,289 - 시력 문제 + 필요에 의해 생겨납니다. 117 00:13:15,289 --> 00:13:23,449 - 비전 문제는 발전 할 계획 알고리즘을하는 데 도움이 저 돌아 왔어요 및 + 그리고 다시 비전 문제는 딥 러닝 알고리즘을 발전시키는데 도움을 주게되죠. 118 00:13:23,450 --> 00:13:29,350 - 앞으로 그래서 당신은 당신이이 과정 I을 완료 할 알고에 정말 중요하다 + 그래서 매우 중요합니다. 여러분들이 이 수업을 마치고 119 00:13:29,350 --> 00:13:34,300 - 당신 때문에 깊은 학습 여전히 충분한 비전을 걸 자랑스럽게 느낄 당신 + 컴퓨터 비젼과 딥러닝 수업의 학생임을 자랑스러워 하길 바래요. 120 -00:13:34,299 --> 00:13:39,528 - 이 헛소리 모든 설정 및 사용 방법의 심도있는 이해를 +00:13:34,301 --> 00:13:39,528 + 여러분들은 문제해결을 위한 도구들과 그 도구들을 사용하는 방법에 대한 깊은 이해를 갖게 될 거에요. 121 00:13:39,528 --> 00:13:46,750 From 2b95feaf3c1a766770e6e349b8dc7cb02bb80ae1 Mon Sep 17 00:00:00 2001 From: Young-Geun Date: Tue, 19 Apr 2016 17:51:10 +0900 Subject: [PATCH 071/199] =?UTF-8?q?github=20desktop=20=EC=97=B0=EC=8A=B5?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit --- neural-networks-3-kr.md | 386 ++++++++++++++++++++++++++++++++++++++++ 1 file changed, 386 insertions(+) create mode 100644 neural-networks-3-kr.md diff --git a/neural-networks-3-kr.md b/neural-networks-3-kr.md new file mode 100644 index 00000000..9324a357 --- /dev/null +++ b/neural-networks-3-kr.md @@ -0,0 +1,386 @@ +--- +layout: page +permalink: /neural-networks-3/ +--- + +Table of Contents: + +- [그라디언트 점검 (Gradient checks)](#gradcheck) +- [Sanity checks](#sanitycheck) +- [학습 과정 돌보기 (Babysitting the learning process)](#baby) + - [손실 함수 (Loss function)](#loss) + - [훈련/검증 성능 (Train/val accuracy)](#accuracy) + - [웨이트의 현재값과 변화량의 비율 (Weights:Updates ratio)](#ratio) + - [레이어별 활성값 및 그라디언트값의 분포 (Activation/Gradient distributions per layer)](#distr) + - [시각화 (Visualization)](#vis) +- [파라미터 업데이트 (Parameter updates)](#update) + - [일차 근사 방법 (SGD) (First-order (SGD)), 모멘텀 (momentum), Nesterov 모멘텀 (Nesterov momentum)](#sgd) + - [학습 속도를 담금질하기 (Annealing the learning rate)](#anneal) + - [이차 근사 방법 (Second-order methods)](#second) + - [파라미터별로 학습 속도를 데이터가 판단하게 하기 (Adagrad, RMSProp) )Per-parameter adaptive learning rates (Adagrad, RMSProp))](#ada) +- [초-파라미터 최적화 (Hyperparameter Optimization)](#hyper) +- [평가 (Evaluation)](#eval) + - [모형 앙상블 (Model Ensembles)](#ensemble) +- [요약](#summary) +- [추가적인 참고 문헌](#add) + +## Learning + +이전 섹션들에서는 레이어를 몇 층 쌓고 레이어별로 몇 개의 유닛을 준비할지(newwork connectivity), 데이터를 어떻게 준비하고 어떤 손실 함수(loss function)를 선택할지 논하였다. 말하자면 이전 섹션들은 주로 뉴럴 네트워크(Neural Network)의 정적인 부분인데, 본 섹션에서는 동적인 부분들을 소개한다. 파라미터(parameter)를 학습하고 좋은 초-파라미터(hyperparamter)를 찾는 과정 등을 다룰 예정이다. + + +### 그라디언트 체크 (Gradient Checks) + +이론적인 그라디언트 체크라 하면, 수치적으로 계산한(numerical) 그라디언트와 수식으로 계산한(analytic) 그라디언트를 비교하는 정도라 매우 간단하다고 생각할 수도 있겠다. 그렇지만 이 작업을 직접 실현해 보면 훨씬 복잡하고 뜬금없이 오차가 발생하기도 쉽다는 것을 깨달을 것이다. 이제 팁, 트릭, 조심할 이슈들 몇 개를 소개하고자 한다. + + +**같은 근사라 하여도 이론적으로 더 정확도가 높은 근사 공식이 있다 (Use the centered formula)**. 그라디언트($\frac{df(x)}{dx}$)를 수치적으로 근사한다 하면 보통 다음 유한 차분 근사(finite difference approximation)를 떠올릴 것이다: + +$$ +\frac{df(x)}{dx} = \frac{f(x + h) - f(x)}{h} \hspace{0.1in} \text{(bad, do not use)} +$$ + +여기서 $h$는 아주 작은 수이고 보통 1e-5 정도의 수를 사용한다. 위 식보다는 아래의 *중심화된(centered)* 차분 공식이 경험적으로는 훨씬 낫다: + +$$ +\frac{df(x)}{dx} = \frac{f(x + h) - f(x - h)}{2h} \hspace{0.1in} \text{(use instead)} +$$ + +물론 이 공식은 $f(x+h)$ 말고도 $f(x-h)$도 계산하여야 하므로 최초 식보다 계산량이 두 배 많지만 훨씬 정확한 근사를 제공한다. $f(x+h)$ 및 $f(x-h)$의 ($x$ 근방에서의) 테일러 전개를 고려하면 이유를 금방 알 수 있다. 첫 식은 $O(h)$의 오차가 있는 데 반해 두번째 식은 오차가 $O(h^2)$이다 (즉, 이차 근사이다). -- 역자 주 : (1) 테일러 전개에서 $f(x + h) = f(x) + hf'(x) + O(h)$로부터 $f'(x) - \frac{(f(x+h)-f(x)}{h} = O(h)$. (2) $h$가 보통 벡터이므로 $O(h)$보다는 $O(\|h\|)$가 더 정확한 표현이나 편의상 $\|\cdot\|$을 생략한 듯 보입니다. + + +**상대 오차를 사용하라 (Use relative error for the comparison)**. 그라디언트의 (수식으로 계산한, analytic) 참값 $f'_a$와 수치적(numerical) 근사값 $f'_n$을 비교하려면 어떤 디테일을 점검하여야 할까? 이 둘이 비슷하지 않음(not compatible)을 어떻게 알아낼 수 있을까? 가장 쉽게는 둘의 절대 오차 $\mid f'_a - f'_n \mid $ 혹은 그 제곱을 쭉 추적하여 이 값(들)이 언젠가 어느 한계점(threshold)를 넘으면 그라디언트 오류라 할 수도 있겠다. 그렇지만 절대 오차에는 문제가 있는 것이, 가령 절대 오차가 1e-4라 가정하여 보자. 만약 $f'_a$와 $f'_n$ 모두 1.0 언저리라면 1e-4의 오차 정도는 매우 훌륭한 근사이고 $f'_a \approx f'_n$이라 할 수 있다. 그런데 만약 두 그라디언트가 1e-5거나 더 작은 값이라면? 그렇다면 1e-4는 매우 큰 차이가 되고 근사가 실패했다고 보아야 한다. 따라서 절대 오차와 두 그라디언트 값의 비율을 고려하는 *상대 오차*가 더 적절하다. 언제나!: + + +$$ +\frac{\mid f'_a - f'_n \mid}{\max(\mid f'_a \mid, \mid f'_n \mid)} +$$ + +보통의 상대 오차 공식은 분모에 $f'_a$ 혹은 $f'_n$ 둘 중 하나만 있지만, 나는 둘의 최대값을 분모로 선호하는 편이다. 그래야 공식에 대칭성이 생기고 둘 중 하나가 exactly 0이 되어 분모가 0이 되는 사태를 방지할 수 있다 (ReLU를 사용하면 자주 일어나는 문제이다). $f'_a$와 $f'_n$가 모두 exact 0이 된다면? 이 때는 상대 오차를 점검할 필요 없이 그라디언트 체크를 통과하여야 한다. 당신의 코드가 이 상황을 감안하여 조직된 코드인지 점검하여 보라. + +실제 상황에서의 유용한 가이드: + +- (상대 오차) > 1e-2 면 그라디언트 계산이 아마 잘못되었을 수도 있다. +- 1e-2 > (상대 오차) > 1e-4 면 불편함을 느끼기 바란다. +- 1e-4 > (상대 오차) 는, 꺾임이 있는 목적함수 (objectives with kinks)에서는 괜찮다. 그렇지만 tanh 혹은 softmax를 쓰는 목적함수처럼 꺾임이 없다면 1e-4는 너무 크다. +- 1e-7 혹은 그보다 작은 상대 오차라면, 행복함을 만끽하라. + +하나 더 유념해야 할 것은, 망의 레이어 개수가 많아지면(deeper network) 상대 오차가 커진다. 이를테면 레이어(layer) 10개짜리 망(network)에서 인풋 데이터의 그라디언트를 체크한다면, 에러가 층을 올라가며 축적되므로 1e-2 정도의 상대 오차는 괜찮을 수도 있다. 거꾸로 말하자면, 미분가능한 함수 하나만 갖고 노는데 1e-2의 상대 오차가 발생한다면 이것은 부정확한 그라디언트일 가능성이 매우 높다. + + +**double precision형 변수를 사용하라 (Use double precision)**. A common pitfall is using single precision floating point to compute gradient check. It is often that case that you might get high relative errors (as high as 1e-2) even with a correct gradient implementation. In my experience I've sometimes seen my relative errors plummet from 1e-2 to 1e-8 by switching to double precision. + +**Stick around active range of floating point**. It's a good idea to read through ["What Every Computer Scientist Should Know About Floating-Point Arithmetic"](http://docs.oracle.com/cd/E19957-01/806-3568/ncg_goldberg.html), as it may demystify your errors and enable you to write more careful code. For example, in neural nets it can be common to normalize the loss function over the batch. However, if your gradients per datapoint are very small, then *additionally* dividing them by the number of data points is starting to give very small numbers, which in turn will lead to more numerical issues. This is why I like to always print the raw numerical/analytic gradient, and make sure that the numbers you are comparing are not extremely small (e.g. roughly 1e-10 and smaller in absolute value is worrying). If they are you may want to temporarily scale your loss function up by a constant to bring them to a "nicer" range where floats are more dense - ideally on the order of 1.0, where your float exponent is 0. + +**Kinks in the objective**. One source of inaccuracy to be aware of during gradient checking is the problem of *kinks*. Kinks refer to non-differentiable parts of an objective function, introduced by functions such as ReLU ($max(0,x)$), or the SVM loss, Maxout neurons, etc. Consider gradient checking the ReLU function at $x = -1e6$. Since $x < 0$, the analytic gradient at this point is exactly zero. However, the numerical gradient would suddenly compute a non-zero gradient because $f(x+h)$ might cross over the kink (e.g. if $h > 1e-6$) and introduce a non-zero contribution. You might think that this is a pathological case, but in fact this case can be very common. For example, an SVM for CIFAR-10 contains up to 450,000 $max(0,x)$ terms because there are 50,000 examples and each example yields 9 terms to the objective. Moreover, a Neural Network with an SVM classifier will contain many more kinks due to ReLUs. + +Note that it is possible to know if a kink was crossed in the evaluation of the loss. This can be done by keeping track of the identities of all "winners" in a function of form $max(x,y)$; That is, was x or y higher during the forward pass. If the identity of at least one winner changes when evaluating $f(x+h)$ and then $f(x-h)$, then a kink was crossed and the numerical gradient will not be exact. + +**Use only few datapoints**. One fix to the above problem of kinks is to use fewer datapoints, since loss functions that contain kinks (e.g. due to use of ReLUs or margin losses etc.) will have fewer kinks with fewer datapoints, so it is less likely for you to cross one when you perform the finite different approximation. Moreover, if your gradcheck for only ~2 or 3 datapoints then you would almost certainly gradcheck for an entire batch. Using very few datapoints also makes your gradient check faster and more efficient. + +**Be careful with the step size h**. It is not necessarily the case that smaller is better, because when $h$ is much smaller, you may start running into numerical precision problems. Sometimes when the gradient doesn't check, it is possible that you change $h$ to be 1e-4 or 1e-6 and suddenly the gradient will be correct. This [wikipedia article](http://en.wikipedia.org/wiki/Numerical_differentiation) contains a chart that plots the value of **h** on the x-axis and the numerical gradient error on the y-axis. + +**Gradcheck during a "characteristic" mode of operation**. It is important to realize that a gradient check is performed at a particular (and usually random), single point in the space of parameters. Even if the gradient check succeeds at that point, it is not immediately certain that the gradient is correctly implemented globally. Additionally, a random initialization might not be the most "characteristic" point in the space of parameters and may in fact introduce pathological situations where the gradient seems to be correctly implemented but isn't. For instance, an SVM with very small weight initialization will assign almost exactly zero scores to all datapoints and the gradients will exhibit a particular pattern across all datapoints. An incorrect implementation of the gradient could still produce this pattern and not generalize to a more characteristic mode of operation where some scores are larger than others. Therefore, to be safe it is best to use a short **burn-in** time during which the network is allowed to learn and perform the gradient check after the loss starts to go down. The danger of performing it at the first iteration is that this could introduce pathological edge cases and mask an incorrect implementation of the gradient. + +**Don't let the regularization overwhelm the data**. It is often the case that a loss function is a sum of the data loss and the regularization loss (e.g. L2 penalty on weights). One danger to be aware of is that the regularization loss may overwhelm the data loss, in which case the gradients will be primarily coming from the regularization term (which usually has a much simpler gradient expression). This can mask an incorrect implementation of the data loss gradient. Therefore, it is recommended to turn off regularization and check the data loss alone first, and then the regularization term second and independently. One way to perform the latter is to hack the code to remove the data loss contribution. Another way is to increase the regularization strength so as to ensure that its effect is non-negligible in the gradient check, and that an incorrect implementation would be spotted. + +**Remember to turn off dropout/augmentations**. When performing gradient check, remember to turn off any non-deterministic effects in the network, such as dropout, random data augmentations, etc. Otherwise these can clearly introduce huge errors when estimating the numerical gradient. The downside of turning off these effects is that you wouldn't be gradient checking them (e.g. it might be that dropout isn't backpropagated correctly). Therefore, a better solution might be to force a particular random seed before evaluating both $f(x+h)$ and $f(x-h)$, and when evaluating the analytic gradient. + +**Check only few dimensions**. In practice the gradients can have sizes of million parameters. In these cases it is only practical to check some of the dimensions of the gradient and assume that the others are correct. **Be careful**: One issue to be careful with is to make sure to gradient check a few dimensions for every separate parameter. In some applications, people combine the parameters into a single large parameter vector for convenience. In these cases, for example, the biases could only take up a tiny number of parameters from the whole vector, so it is important to not sample at random but to take this into account and check that all parameters receive the correct gradients. + + +### Before learning: sanity checks Tips/Tricks + +Here are a few sanity checks you might consider running before you plunge into expensive optimization: + +- **Look for correct loss at chance performance.** Make sure you're getting the loss you expect when you initialize with small parameters. It's best to first check the data loss alone (so set regularization strength to zero). For example, for CIFAR-10 with a Softmax classifier we would expect the initial loss to be 2.302, because we expect a diffuse probability of 0.1 for each class (since there are 10 classes), and Softmax loss is the negative log probability of the correct class so: -ln(0.1) = 2.302. For The Weston Watkins SVM, we expect all desired margins to be violated (since all scores are approximately zero), and hence expect a loss of 9 (since margin is 1 for each wrong class). If you're not seeing these losses there might be issue with initialization. +- As a second sanity check, increasing the regularization strength should increase the loss +- **Overfit a tiny subset of data**. Lastly and most importantly, before training on the full dataset try to train on a tiny portion (e.g. 20 examples) of your data and make sure you can achieve zero cost. For this experiment it's also best to set regularization to zero, otherwise this can prevent you from getting zero cost. Unless you pass this sanity check with a small dataset it is not worth proceeding to the full dataset. Note that it may happen that you can overfit very small dataset but still have an incorrect implementation. For instance, if your datapoints' features are random due to some bug, then it will be possible to overfit your small training set but you will never notice any generalization when you fold it your full dataset. + + +### Babysitting the learning process + +There are multiple useful quantities you should monitor during training of a neural network. These plots are the window into the training process and should be utilized to get intuitions about different hyperparameter settings and how they should be changed for more efficient learning. + +The x-axis of the plots below are always in units of epochs, which measure how many times every example has been seen during training in expectation (e.g. one epoch means that every example has been seen once). It is preferable to track epochs rather than iterations since the number of iterations depends on the arbitrary setting of batch size. + + +#### Loss function + +The first quantity that is useful to track during training is the loss, as it is evaluated on the individual batches during the forward pass. Below is a cartoon diagram showing the loss over time, and especially what the shape might tell you about the learning rate: + +
+ + +
+ Left: A cartoon depicting the effects of different learning rates. With low learning rates the improvements will be linear. With high learning rates they will start to look more exponential. Higher learning rates will decay the loss faster, but they get stuck at worse values of loss (green line). This is because there is too much "energy" in the optimization and the parameters are bouncing around chaotically, unable to settle in a nice spot in the optimization landscape. Right: An example of a typical loss function over time, while training a small network on CIFAR-10 dataset. This loss function looks reasonable (it might indicate a slightly too small learning rate based on its speed of decay, but it's hard to say), and also indicates that the batch size might be a little too low (since the cost is a little too noisy). +
+
+ +The amount of "wiggle" in the loss is related to the batch size. When the batch size is 1, the wiggle will be relatively high. When the batch size is the full dataset, the wiggle will be minimal because every gradient update should be improving the loss function monotonically (unless the learning rate is set too high). + +Some people prefer to plot their loss functions in the log domain. Since learning progress generally takes an exponential form shape, the plot appears more as a slightly more interpretable straight line, rather than a hockey stick. Additionally, if multiple cross-validated models are plotted on the same loss graph, the differences between them become more apparent. + +Sometimes loss functions can look funny [lossfunctions.tumblr.com](http://lossfunctions.tumblr.com/). + + +#### Train/Val accuracy + +The second important quantity to track while training a classifier is the validation/training accuracy. This plot can give you valuable insights into the amount of overfitting in your model: + +
+ +
+ The gap between the training and validation accuracy indicates the amount of overfitting. Two possible cases are shown in the diagram on the left. The blue validation error curve shows very small validation accuracy compared to the training accuracy, indicating strong overfitting (note, it's possible for the validation accuracy to even start to go down after some point). When you see this in practice you probably want to increase regularization (stronger L2 weight penalty, more dropout, etc.) or collect more data. The other possible case is when the validation accuracy tracks the training accuracy fairly well. This case indicates that your model capacity is not high enough: make the model larger by increasing the number of parameters. +
+
+
+ + +#### Ratio of weights:updates + +The last quantity you might want to track is the ratio of the update magnitudes to to the value magnitudes. Note: *updates*, not the raw gradients (e.g. in vanilla sgd this would be the gradient multiplied by the learning rate). You might want to evaluate and track this ratio for every set of parameters independently. A rough heuristic is that this ratio should be somewhere around 1e-3. If it is lower than this then the learning rate might be too low. If it is higher then the learning rate is likely too high. Here is a specific example: + +~~~python +# assume parameter vector W and its gradient vector dW +param_scale = np.linalg.norm(W.ravel()) +update = -learning_rate*dW # simple SGD update +update_scale = np.linalg.norm(update.ravel()) +W += update # the actual update +print update_scale / param_scale # want ~1e-3 +~~~ + +Instead of tracking the min or the max, some people prefer to compute and track the norm of the gradients and their updates instead. These metrics are usually correlated and often give approximately the same results. + + +#### Activation / Gradient distributions per layer + +An incorrect initialization can slow down or even completely stall the learning process. Luckily, this issue can be diagnosed relatively easily. One way to do so is to plot activation/gradient histograms for all layers of the network. Intuitively, it is not a good sign to see any strange distributions - e.g. with tanh neurons we would like to see a distribution of neuron activations between the full range of [-1,1], instead of seeing all neurons outputting zero, or all neurons being completely saturated at either -1 or 1. + + + +#### First-layer Visualizations + +Lastly, when one is working with image pixels it can be helpful and satisfying to plot the first-layer features visually: + +
+ + +
+ Examples of visualized weights for the first layer of a neural network. Left: Noisy features indicate could be a symptom: Unconverged network, improperly set learning rate, very low weight regularization penalty. Right: Nice, smooth, clean and diverse features are a good indication that the training is proceeding well. +
+
+ + +### Parameter updates + +Once the analytic gradient is computed with backpropagation, the gradients are used to perform a parameter update. There are several approaches for performing the update, which we discuss next. + +We note that optimization for deep networks is currently a very active area of research. In this section we highlight some established and common techniques you may see in practice, briefly describe their intuition, but leave a detailed analysis outside of the scope of the class. We provide some further pointers for an interested reader. + + +#### SGD and bells and whistles + +**Vanilla update**. The simplest form of update is to change the parameters along the negative gradient direction (since the gradient indicates the direction of increase, but we usually wish to minimize a loss function). Assuming a vector of parameters `x` and the gradient `dx`, the simplest update has the form: + +~~~python +# Vanilla update +x += - learning_rate * dx +~~~ + +where `learning_rate` is a hyperparameter - a fixed constant. When evaluated on the full dataset, and when the learning rate is low enough, this is guaranteed to make non-negative progress on the loss function. + +**Momentum update** is another approach that almost always enjoys better converge rates on deep networks. This update can be motivated from a physical perspective of the optimization problem. In particular, the loss can be interpreted as a the height of a hilly terrain (and therefore also to the potential energy since $U = mgh$ and therefore $ U \propto h $ ). Initializing the parameters with random numbers is equivalent to setting a particle with zero initial velocity at some location. The optimization process can then be seen as equivalent to the process of simulating the parameter vector (i.e. a particle) as rolling on the landscape. + +Since the force on the particle is related to the gradient of potential energy (i.e. $F = - \nabla U $ ), the **force** felt by the particle is precisely the (negative) **gradient** of the loss function. Moreover, $F = ma $ so the (negative) gradient is in this view proportional to the acceleration of the particle. Note that this is different from the SGD update shown above, where the gradient directly integrates the position. Instead, the physics view suggests an update in which the gradient only directly influences the velocity, which in turn has an effect on the position: + +~~~python +# Momentum update +v = mu * v - learning_rate * dx # integrate velocity +x += v # integrate position +~~~ + +Here we see an introduction of a `v` variable that is initialized at zero, and an additional hyperparameter (`mu`). As an unfortunate misnomer, this variable is in optimization referred to as *momentum* (its typical value is about 0.9), but its physical meaning is more consistent with the coefficient of friction. Effectively, this variable damps the velocity and reduces the kinetic energy of the system, or otherwise the particle would never come to a stop at the bottom of a hill. When cross-validated, this parameter is usually set to values such as [0.5, 0.9, 0.95, 0.99]. Similar to annealing schedules for learning rates (discussed later, below), optimization can sometimes benefit a little from momentum schedules, where the momentum is increased in later stages of learning. A typical setting is to start with momentum of about 0.5 and anneal it to 0.99 or so over multiple epochs. + +> With Momentum update, the parameter vector will build up velocity in any direction that has consistent gradient. + +**Nesterov Momentum** is a slightly different version of the momentum update has recently been gaining popularity. It enjoys stronger theoretical converge guarantees for convex functions and in practice it also consistenly works slightly better than standard momentum. + +The core idea behind Nesterov momentum is that when the current parameter vector is at some position `x`, then looking at the momentum update above, we know that the momentum term alone (i.e. ignoring the second term with the gradient) is about to nudge the parameter vector by `mu * v`. Therefore, if we are about to compute the gradient, we can treat the future approximate position `x + mu * v` as a "lookahead" - this is a point in the vicinity of where we are soon going to end up. Hence, it makes sense to compute the gradient at `x + mu * v` instead of at the "old/stale" position `x`. + +
+ +
+ Nesterov momentum. Instead of evaluating gradient at the current position (red circle), we know that our momentum is about to carry us to the tip of the green arrow. With Nesterov momentum we therefore instead evaluate the gradient at this "looked-ahead" position. +
+
+ +That is, in a slightly awkward notation, we would like to do the following: + +~~~python +x_ahead = x + mu * v +# evaluate dx_ahead (the gradient at x_ahead instead of at x) +v = mu * v - learning_rate * dx_ahead +x += v +~~~ + +However, in practice people prefer to express the update to look as similar to vanilla SGD or to the previous momentum update as possible. This is possible to achieve by manipulating the update above with a variable transform `x_ahead = x + mu * v`, and then expressing the update in terms of `x_ahead` instead of `x`. That is, the parameter vector we are actually storing is always the ahead version. The equations in terms of `x_ahead` (but renaming it back to `x`) then become: + +~~~python +v_prev = v # back this up +v = mu * v - learning_rate * dx # velocity update stays the same +x += -mu * v_prev + (1 + mu) * v # position update changes form +~~~ + +We recommend this further reading to understand the source of these equations and the mathematical formulation of Nesterov's Accelerated Momentum (NAG): + +- [Advances in optimizing Recurrent Networks](http://arxiv.org/pdf/1212.0901v2.pdf) by Yoshua Bengio, Section 3.5. +- [Ilya Sutskever's thesis](http://www.cs.utoronto.ca/~ilya/pubs/ilya_sutskever_phd_thesis.pdf) (pdf) contains a longer exposition of the topic in section 7.2 + + + +#### Annealing the learning rate + +In training deep networks, it is usually helpful to anneal the learning rate over time. Good intuition to have in mind is that with a high learning rate, the system contains too much kinetic energy and the parameter vector bounces around chaotically, unable to settle down into deeper, but narrower parts of the loss function. Knowing when to decay the learning rate can be tricky: Decay it slowly and you'll be wasting computation bouncing around chaotically with little improvement for a long time. But decay it too aggressively and the system will cool too quickly, unable to reach the best position it can. There are three common types of implementing the learning rate decay: + +- **Step decay**: Reduce the learning rate by some factor every few epochs. Typical values might be reducing the learning rate by a half every 5 epochs, or by 0.1 every 20 epochs. These numbers depend heavily on the type of problem and the model. One heuristic you may see in practice is to watch the validation error while training with a fixed learning rate, and reduce the learning rate by a constant (e.g. 0.5) whenever the validation error stops improving. +- **Exponential decay.** has the mathematical form $\alpha = \alpha_0 e^{-k t}$, where $\alpha_0, k$ are hyperparameters and $t$ is the iteration number (but you can also use units of epochs). +- **1/t decay** has the mathematical form $\alpha = \alpha_0 / (1 + k t )$ where $a_0, k$ are hyperparameters and $t$ is the iteration number. + +In practice, we find that the step decay dropout is slightly preferable because the hyperparameters it involves (the fraction of decay and the step timings in units of epochs) are more interpretable than the hyperparameter $k$. Lastly, if you can afford the computational budget, err on the side of slower decay and train for a longer time. + + +#### Second order methods + +A second, popular group of methods for optimization in context of deep learning is based on [Newton's method](http://en.wikipedia.org/wiki/Newton%27s_method_in_optimization), which iterates the following update: + +$$ +x \leftarrow x - [H f(x)]^{-1} \nabla f(x) +$$ + +Here, $H f(x)$ is the [Hessian matrix](http://en.wikipedia.org/wiki/Hessian_matrix), which is a square matrix of second-order partial derivatives of the function. The term $\nabla f(x)$ is the gradient vector, as seen in Gradient Descent. Intuitively, the Hessian describes the local curvature of the loss function, which allows us to perform a more efficient update. In particular, multiplying by the inverse Hessian leads the optimization to take more aggressive steps in directions of shallow curvature and shorter steps in directions of steep curvature. Note, crucially, the absence of any learning rate hyperparameters in the update formula, which the proponents of these methods cite this as a large advantage over first-order methods. + +However, the update above is impractical for most deep learning applications because computing (and inverting) the Hessian in its explicit form is a very costly process in both space and time. For instance, a Neural Network with one million parameters would have a Hessian matrix of size [1,000,000 x 1,000,000], occupying approximately 3725 gigabytes of RAM. Hence, a large variety of *quasi-Newton* methods have been developed that seek to approximate the inverse Hessian. Among these, the most popular is [L-BFGS](http://en.wikipedia.org/wiki/Limited-memory_BFGS), which uses the information in the gradients over time to form the approximation implicitly (i.e. the full matrix is never computed). + +However, even after we eliminate the memory concerns, a large downside of a naive application of L-BFGS is that it must be computed over the entire training set, which could contain millions of examples. Unlike mini-batch SGD, getting L-BFGS to work on mini-batches is more tricky and an active area of research. + +**In practice**, it is currently not common to see L-BFGS or similar second-order methods applied to large-scale Deep Learning and Convolutional Neural Networks. Instead, SGD variants based on (Nesterov's) momentum are more standard because they are simpler and scale more easily. + +Additional references: + +- [Large Scale Distributed Deep Networks](http://research.google.com/archive/large_deep_networks_nips2012.html) is a paper from the Google Brain team, comparing L-BFGS and SGD variants in large-scale distributed optimization. +- [SFO](http://arxiv.org/abs/1311.2115) algorithm strives to combine the advantages of SGD with advantages of L-BFGS. + + +#### Per-parameter adaptive learning rate methods + +All previous approaches we've discussed so far manipulated the learning rate globally and equally for all parameters. Tuning the learning rates is an expensive process, so much work has gone into devising methods that can adaptively tune the learning rates, and even do so per parameter. Many of these methods may still require other hyperparameter settings, but the argument is that they are well-behaved for a broader range of hyperparameter values than the raw learning rate. In this section we highlight some common adaptive methods you may encounter in practice: + +**Adagrad** is an adaptive learning rate method originally proposed by [Duchi et al.](http://jmlr.org/papers/v12/duchi11a.html). + +~~~python +# Assume the gradient dx and parameter vector x +cache += dx**2 +x += - learning_rate * dx / (np.sqrt(cache) + eps) +~~~ + +Notice that the variable `cache` has size equal to the size of the gradient, and keeps track of per-parameter sum of squared gradients. This is then used to normalize the parameter update step, element-wise. Notice that the weights that receive high gradients will have their effective learning rate reduced, while weights that receive small or infrequent updates will have their effective learning rate increased. Amusingly, the square root operation turns out to be very important and without it the algorithm performs much worse. The smoothing term `eps` (usually set somewhere in range from 1e-4 to 1e-8) avoids division by zero. A downside of Adagrad is that in case of Deep Learning, the monotonic learning rate usually proves too aggressive and stops learning too early. + +**RMSprop.** RMSprop is a very effective, but currently unpublished adaptive learning rate method. Amusingly, everyone who uses this method in their work currently cites [slide 29 of Lecture 6](http://www.cs.toronto.edu/~tijmen/csc321/slides/lecture_slides_lec6.pdf) of Geoff Hinton's Coursera class. The RMSProp update adjusts the Adagrad method in a very simple way in an attempt to reduce its aggressive, monotonically decreasing learning rate. In particular, it uses a moving average of squared gradients instead, giving: + +~~~python +cache = decay_rate * cache + (1 - decay_rate) * dx**2 +x += - learning_rate * dx / (np.sqrt(cache) + eps) +~~~ + +Here, `decay_rate` is a hyperparameter and typical values are [0.9, 0.99, 0.999]. Notice that the `x+=` update is identical to Adagrad, but the `cache` variable is a "leaky". Hence, RMSProp still modulates the learning rate of each weight based on the magnitudes of its gradients, which has a beneficial equalizing effect, but unlike Adagrad the updates do not get monotonically smaller. + +**Adam.** [Adam](http://arxiv.org/abs/1412.6980) is a recently proposed update that looks a bit like RMSProp with momentum. The (simplified) update looks as follows: + +~~~python +m = beta1*m + (1-beta1)*dx +v = beta2*v + (1-beta2)*(dx**2) +x += - learning_rate * m / (np.sqrt(v) + eps) +~~~ + +Notice that the update looks exactly as RMSProp update, except the "smooth" version of the gradient `m` is used instead of the raw (and perhaps noisy) gradient vector `dx`. Recommended values in the paper are `eps = 1e-8`, `beta1 = 0.9`, `beta2 = 0.999`. In practice Adam is currently recommended as the default algorithm to use, and often works slightly better than RMSProp. However, it is often also worth trying SGD+Nesterov Momentum as an alternative. The full Adam update also includes a *bias correction* mechanism, which compensates for the fact that in the first few time steps the vectors `m,v` are both initialized and therefore biased at zero, before they fully "warm up". We refer the reader to the paper for the details, or the course slides where this is expanded on. + +Additional References: + +- [Unit Tests for Stochastic Optimization](http://arxiv.org/abs/1312.6055) proposes a series of tests as a standardized benchmark for stochastic optimization. + +
+ + +
+ Animations that may help your intuitions about the learning process dynamics. Left: Contours of a loss surface and time evolution of different optimization algorithms. Notice the "overshooting" behavior of momentum-based methods, which make the optimization look like a ball rolling down the hill. Right: A visualization of a saddle point in the optimization landscape, where the curvature along different dimension has different signs (one dimension curves up and another down). Notice that SGD has a very hard time breaking symmetry and gets stuck on the top. Conversely, algorithms such as RMSprop will see very low gradients in the saddle direction. Due to the denominator term in the RMSprop update, this will increase the effective learning rate along this direction, helping RMSProp proceed. Images credit: Alec Radford. +
+
+ + +### Hyperparameter optimization + +As we've seen, training Neural Networks can involve many hyperparameter settings. The most common hyperparameters in context of Neural Networks include: + +- the initial learning rate +- learning rate decay schedule (such as the decay constant) +- regularization strength (L2 penalty, dropout strength) + +But as saw, there are many more relatively less sensitive hyperparameters, for example in per-parameter adaptive learning methods, the setting of momentum and its schedule, etc. In this section we describe some additional tips and tricks for performing the hyperparameter search: + +**Implementation**. Larger Neural Networks typically require a long time to train, so performing hyperparameter search can take many days/weeks. It is important to keep this in mind since it influences the design of your code base. One particular design is to have a **worker** that continuously samples random hyperparameters and performs the optimization. During the training, the worker will keep track of the validation performance after every epoch, and writes a model checkpoint (together with miscellaneous training statistics such as the loss over time) to a file, preferably on a shared file system. It is useful to include the validation performance directly in the filename, so that it is simple to inspect and sort the progress. Then there is a second program which we will call a **master**, which launches or kills workers across a computing cluster, and may additionally inspect the checkpoints written by workers and plot their training statistics, etc. + +**Prefer one validation fold to cross-validation**. In most cases a single validation set of respectable size substantially simplifies the code base, without the need for cross-validation with multiple folds. You'll hear people say they "cross-validated" a parameter, but many times it is assumed that they still only used a single validation set. + +**Hyperparameter ranges**. Search for hyperparameters on log scale. For example, a typical sampling of the learning rate would look as follows: `learning_rate = 10 ** uniform(-6, 1)`. That is, we are generating a random random with a uniform distribution, but then raising it to the power of 10. The same strategy should be used for the regularization strength. Intuitively, this is because learning rate and regularization strength have multiplicative effects on the training dynamics. For example, a fixed change of adding 0.01 to a learning rate has huge effects on the dynamics if the learning rate is 0.001, but nearly no effect if the learning rate when it is 10. This is because the learning rate multiplies the computed gradient in the update. Therefore, it is much more natural to consider a range of learning rate multiplied or divided by some value, than a range of learning rate added or subtracted to by some value. Some parameters (e.g. dropout) are instead usually searched in the original scale (e.g. `dropout = uniform(0,1)`). + +**Prefer random search to grid search**. As argued by Bergstra and Bengio in [Random Search for Hyper-Parameter Optimization](http://www.jmlr.org/papers/volume13/bergstra12a/bergstra12a.pdf), "randomly chosen trials are more efficient for hyper-parameter optimization than trials on a grid". As it turns out, this is also usually easier to implement. + +
+ +
+ Core illustration from Random Search for Hyper-Parameter Optimization by Bergstra and Bengio. It is very often the case that some of the hyperparameters matter much more than others (e.g. top hyperparam vs. left one in this figure). Performing random search rather than grid search allows you to much more precisely discover good values for the important ones. +
+
+ +**Careful with best values on border**. Sometimes it can happen that you're searching for a hyperparameter (e.g. learning rate) in a bad range. For example, suppose we use `learning_rate = 10 ** uniform(-6, 1)`. Once we receive the results, it is important to double check that the final learning rate is not at the edge of this interval, or otherwise you may be missing more optimal hyperparameter setting beyond the interval. + +**Stage your search from coarse to fine**. In practice, it can be helpful to first search in coarse ranges (e.g. 10 ** [-6, 1]), and then depending on where the best results are turning up, narrow the range. Also, it can be helpful to perform the initial coarse search while only training for 1 epoch or even less, because many hyperparameter settings can lead the model to not learn at all, or immediately explode with infinite cost. The second stage could then perform a narrower search with 5 epochs, and the last stage could perform a detailed search in the final range for many more epochs (for example). + +**Bayesian Hyperparameter Optimization** is a whole area of research devoted to coming up with algorithms that try to more efficiently navigate the space of hyperparameters. The core idea is to appropriately balance the exploration - exploitation trade-off when querying the performance at different hyperparameters. Multiple libraries have been developed based on these models as well, among some of the better known ones are [Spearmint](https://github.com/JasperSnoek/spearmint), [SMAC](http://www.cs.ubc.ca/labs/beta/Projects/SMAC/), and [Hyperopt](http://jaberg.github.io/hyperopt/). However, in practical settings with ConvNets it is still relatively difficult to beat random search in a carefully-chosen intervals. See some additional from-the-trenches discussion [here](http://nlpers.blogspot.com/2014/10/hyperparameter-search-bayesian.html). + + +## Evaluation + + +### Model Ensembles + +In practice, one reliable approach to improving the performance of Neural Networks by a few percent is to train multiple independent models, and at test time average their predictions. As the number of models in the ensemble increases, the performance typically monotonically improves (though with diminishing returns). Moreover, the improvements are more dramatic with higher model variety in the ensemble. There are a few approaches to forming an ensemble: + +- **Same model, different initializations**. Use cross-validation to determine the best hyperparameters, then train multiple models with the best set of hyperparameters but with different random initialization. The danger with this approach is that the variety is only due to initialization. +- **Top models discovered during cross-validation**. Use cross-validation to determine the best hyperparameters, then pick the top few (e.g. 10) models to form the ensemble. This improves the variety of the ensemble but has the danger of including suboptimal models. In practice, this can be easier to perform since it doesn't require additional retraining of models after cross-validation +- **Different checkpoints of a single model**. If training is very expensive, some people have had limited success in taking different checkpoints of a single network over time (for example after every epoch) and using those to form an ensemble. Clearly, this suffers from some lack of variety, but can still work reasonably well in practice. The advantage of this approach is that is very cheap. +- **Running average of parameters during training**. Related to the last point, a cheap way of almost always getting an extra percent or two of performance is to maintain a second copy of the network's weights in memory that maintains an exponentially decaying sum of previous weights during training. This way you're averaging the state of the network over last several iterations. You will find that this "smoothed" version of the weights over last few steps almost always achieves better validation error. The rough intuition to have in mind is that the objective is bowl-shaped and your network is jumping around the mode, so the average has a higher chance of being somewhere nearer the mode. + +One disadvantage of model ensembles is that they take longer to evaluate on test example. An interested reader may find the recent work from Geoff Hinton on ["Dark Knowledge"](https://www.youtube.com/watch?v=EK61htlw8hY) inspiring, where the idea is to "distill" a good ensemble back to a single model by incorporating the ensemble log likelihoods into a modified objective. + + +## Summary + +To train a Neural Network: + +- Gradient check your implementation with a small batch of data and be aware of the pitfalls. +- As a sanity check, make sure your initial loss is reasonable, and that you can achieve 100% training accuracy on a very small portion of the data +- During training, monitor the loss, the training/validation accuracy, and if you're feeling fancier, the magnitude of updates in relation to parameter values (it should be ~1e-3), and when dealing with ConvNets, the first-layer weights. +- The two recommended updates to use are either SGD+Nesterov Momentum or Adam. +- Decay your learning rate over the period of the training. For example, halve the learning rate after a fixed number of epochs, or whenever the validation accuracy tops off. +- Search for good hyperparameters with random search (not grid search). Stage your search from coarse (wide hyperparameter ranges, training only for 1-5 epochs), to fine (narrower rangers, training for many more epochs) +- Form model ensembles for extra performance + + +## Additional References + +- [SGD](http://research.microsoft.com/pubs/192769/tricks-2012.pdf) tips and tricks from Leon Bottou +- [Efficient BackProp](http://yann.lecun.com/exdb/publis/pdf/lecun-98b.pdf) (pdf) from Yann LeCun +- [Practical Recommendations for Gradient-Based Training of Deep +Architectures](http://arxiv.org/pdf/1206.5533v2.pdf) from Yoshua Bengio \ No newline at end of file From 4cc88f6a8c3b69d7672509c65ff567b222cee7f5 Mon Sep 17 00:00:00 2001 From: MaybeS Date: Tue, 19 Apr 2016 20:24:43 +0900 Subject: [PATCH 072/199] Update .gitignore (.ipynb_checkpoints) Update Translate --- .gitignore | 1 + assignments2016/assignment1/features.ipynb | 4 ++-- 2 files changed, 3 insertions(+), 2 deletions(-) diff --git a/.gitignore b/.gitignore index 4e4b44ae..c3f21a98 100644 --- a/.gitignore +++ b/.gitignore @@ -1,3 +1,4 @@ _site .DS_Store *.swp +.ipynb_checkpoints diff --git a/assignments2016/assignment1/features.ipynb b/assignments2016/assignment1/features.ipynb index ff361f1a..7e0177e0 100644 --- a/assignments2016/assignment1/features.ipynb +++ b/assignments2016/assignment1/features.ipynb @@ -219,7 +219,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "### Inline question 1:\n", + "### 연습문제 1:\n", " 잘못 분류된 결과에 대해 설명해보세요. 의미를 알 수 있나요?" ] }, @@ -228,7 +228,7 @@ "metadata": {}, "source": [ "## 이미지 특징의 신경망\n", - "이번 과제에서 우리는 단순 픽셀의 2-계층 신경망을 학습시키면 선형 분류기보다 성능이 더 향상됨을 배웠습니다. 이 notebook에서 우리는 이미지 특징의 선형 분류기가 단순 픽셀의 선형 분류기보다 뛰어나다는 것을 알 수 있었습니다.\n", + "이번 과제에서 우리는 단순 픽셀의 2-계층 신경망을 학습시키면 선형 분류기보다 성능이 더 향상됨을 배웠습니다. 이번 notebook에서 우리는 선형 분류기를 이미지 픽셀에 바로 적용하는 것보다 이미지에서 추출한 특징(feature)에 적용하는 것이 더 좋은 성능을 얻는 것을 알 수 있었습니다.\n", "\n", "완성도를 위해, 우리는 이미지 특징의 신경망 또한 학습시켜보아야 합니다. 이 접근법은 이전의 모든 방법보다 더 뛰어날 것입니다: 테스트 세트에 대해 55%이상의 분류 정확도를 쉽게 달성할 수 있어야합니다; 우리의 최고의 모델은 60%의 분류 정확도를 달성했습니다." ] From 8c6f155431854e7a83fd243aed3e519171d04b5b Mon Sep 17 00:00:00 2001 From: JK Im Date: Tue, 19 Apr 2016 17:41:42 -0500 Subject: [PATCH 073/199] Update optimization-1.md --- optimization-1.md | 20 ++++++++++---------- 1 file changed, 10 insertions(+), 10 deletions(-) diff --git a/optimization-1.md b/optimization-1.md index b9a35b64..2b528e6e 100644 --- a/optimization-1.md +++ b/optimization-1.md @@ -177,13 +177,13 @@ $$ ### 그라디언트(gradient) 계산 -There are two ways to compute the gradient: A slow, approximate but easy way (**numerical gradient**), and a fast, exact but more error-prone way that requires calculus (**analytic gradient**). We will now present both. +그라디언트(gradient) 계산법은 크게 2가지가 있다: 느리고 근사값이지만 쉬운 방법 (**수치 그라디언트**), and 빠르고 정확하지만 미분이 필요하고 실수하기 쉬운 방법 (**해석적 그라디언트**). 여기서 둘 다 다룰 것이다. -#### Computing the gradient numerically with finite differences +#### 유한한 차이(Finite Difference)를 이용하여 수치적으로 그라디언트(gradient) 계산하기 -The formula given above allows us to compute the gradient numerically. Here is a generic function that takes a function `f`, a vector `x` to evaluate the gradient on, and returns the gradient of `f` at `x`: +위에 주어진 수식을 이용하여 그라디언트(gradient)를 수치적으로 계산할 수 있다. 여기 임의의 함수 `f`, 이 함수에 입력값으로 넣을 벡터 `x` 가 주어졌을 때, `x`에서 `f`의 그라디언트(gradient)를 계산해주는 범용 함수가 있다: ~~~python def eval_numerical_gradient(f, x): @@ -215,11 +215,11 @@ def eval_numerical_gradient(f, x): return grad ~~~ -Following the gradient formula we gave above, the code above iterates over all dimensions one by one, makes a small change `h` along that dimension and calculates the partial derivative of the loss function along that dimension by seeing how much the function changed. The variable `grad` holds the full gradient in the end. +이 코드는, 위에 주어진 그라디언트(gradient) 식을 이용해서 모든 차원을 하나씩 돌아가면서 그 방향으로 작은 변화 `h`를 줬을 때, 손실함수(loss function)의 값이 얼마나 변하는지를 구해서, 그 방향의 편미분 값을 계산한다. 변수 `grad`에 전체 그라디언트(gradient) 값이 최종적으로 저장된다. -**Practical considerations**. Note that in the mathematical formulation the gradient is defined in the limit as **h** goes towards zero, but in practice it is often sufficient to use a very small value (such as 1e-5 as seen in the example). Ideally, you want to use the smallest step size that does not lead to numerical issues. Additionally, in practice it often works better to compute the numeric gradient using the **centered difference formula**: $ [f(x+h) - f(x-h)] / 2 h $ . See [wiki](http://en.wikipedia.org/wiki/Numerical_differentiation) for details. +**실제 고려할 사항**. **h**가 0으로 수렴할 때의 극한값이 그라디언트(gradient)의 수학적으로 정의인데, (이 예제에서 나온 것처럼 1e-5 같이) 작은 값이면 충분하다. 이상적으로, 수치적인 문제를 일으키지 않는 수준에서 가장 작은 값을 쓰면 된다. 덧붙여서, 실제 활용할 때, x를 **양 방향으로 변화를 주어서 구한 수식**이 더 좋은 경우가 많다: $ [f(x+h) - f(x-h)] / 2 h $ . 다음 [위키](http://en.wikipedia.org/wiki/Numerical_differentiation)를 보면 자세한 것을 알 수 있다. -We can use the function given above to compute the gradient at any point and for any function. Lets compute the gradient for the CIFAR-10 loss function at some random point in the weight space: +위에서 계산한 함수를 이용하면, 아무 함수의 아무 값에서나 그라디언트(gradient)를 계산할 수 있다. 무작위로 뽑은 모수(parameter/weight)값에서 CIFAR-10의 손실함수(loss function)의 그라디언트를 구해본다.: ~~~python @@ -232,7 +232,7 @@ W = np.random.rand(10, 3073) * 0.001 # random weight vector df = eval_numerical_gradient(CIFAR10_loss_fun, W) # get the gradient ~~~ -The gradient tells us the slope of the loss function along every dimension, which we can use to make an update: +그라디언트(gradient)는 각 차원에서 CIFAR-10의 손실함수(loss function)의 기울기를 알려주는데, 그걸 이용해서 모수(parameter/weight)를 업데이트한다. ~~~python loss_original = CIFAR10_loss_fun(W) # the original loss @@ -259,9 +259,9 @@ for step_size_log in [-10, -9, -8, -7, -6, -5,-4,-3,-2,-1]: # for step size 1.000000e-01 new loss: 25392.214036 ~~~ -**Update in negative gradient direction**. In the code above, notice that to compute `W_new` we are making an update in the negative direction of the gradient `df` since we wish our loss function to decrease, not increase. +**Update in negative gradient direction**. 위 코드에서, 새로운 모수 `W_new`로 업데이트할 때, 그라디언트(gradient) `df`의 반대방향으로 움직인 것을 주목하자. 왜냐하면 우리가 원하는 것은 손실함수(loss function)의 증가가 아니라 감소하는 것이기 때문이다. -**Effect of step size**. The gradient tells us the direction in which the function has the steepest rate of increase, but it does not tell us how far along this direction we should step. As we will see later in the course, choosing the step size (also called the *learning rate*) will become one of the most important (and most headache-inducing) hyperparameter settings in training a neural network. In our blindfolded hill-descent analogy, we feel the hill below our feet sloping in some direction, but the step length we should take is uncertain. If we shuffle our feet carefully we can expect to make consistent but very small progress (this corresponds to having a small step size). Conversely, we can choose to make a large, confident step in an attempt to descend faster, but this may not pay off. As you can see in the code example above, at some point taking a bigger step gives a higher loss as we "overstep". +**작은 변화값 `h` (step)의 크기가 미치는 영향**. 그라디언트(gradient)에서 알 수 있는 것은 함수값이 가장 빠르게 증가하는 방향이고, 그 방향으로 대체 얼만큼을 가야하는지는 알려주지 않는다. 강의 뒤에서 다루게 되겠지만, 얼만큼 가야하는지(*학습 속도*라고도 함), 즉, `h`값의 크기는 신경망(neural network)를 학습시킬 때 있어 가장 중요한 (그래서 결정하기 까다로운) 하이퍼파라미터(hyperparameter)가 될 것이다. 눈 가리고 하산하는 비유에서, 우리는 우리 발 밑으로 어느 방향이 가장 가파른지 느끼지만, 얼마나 발을 뻗어야할 지는 불확실하다. 발을 살살 휘져으면, 꾸준하지만 매우 조금씩밖에 못 내려갈 것이다. (이는 아주 작은 `h`값에 비견된다.) 반대로, 욕심껏 빨리 내려가려고 크고 과감하게 발을 내딛을 수도 있는데, 항상 뜻대로 되지는 않을지 모른다. 위의 제시된 코드에서와 같이, 어느 수준 이상의 큰 `h`값은 오히려 손실값을 증가시킨다.
@@ -271,7 +271,7 @@ for step_size_log in [-10, -9, -8, -7, -6, -5,-4,-3,-2,-1]:
-**A problem of efficiency**. You may have noticed that evaluating the numerical gradient has complexity linear in the number of parameters. In our example we had 30730 parameters in total and therefore had to perform 30,731 evaluations of the loss function to evaluate the gradient and to perform only a single parameter update. This problem only gets worse, since modern Neural Networks can easily have tens of millions of parameters. Clearly, this strategy is not scalable and we need something better. +**효율성의 문제**. 알다시피, 그라디언트(gradient)를 수치적으로 계산하는 데 드는 비용은 모수(parameter/weight)의 수에 따라 선형적으로 늘어난다. 위 예시에서, 총 30,730의 모수(parameter/weight)가 있으므로 30,731번 손실함수값을 계산해서 그라디언트(gradient)를 구해 봐야 딱 한 번 업데이트할 수 있다. 요즘 쓰이는 신경망(neural networks)들은 수천만개의 모수(parameter/weight)도 우스운데, 그런 경우 이 문제는 매우 심각해진다. 당연하게도, 이 전략은 별로고, 더 좋은게 있다. From b552f2e0ac611e2d6fee76ac85604a3953d25434 Mon Sep 17 00:00:00 2001 From: YB Date: Sun, 24 Apr 2016 21:19:27 -0400 Subject: [PATCH 074/199] Lecture1 - part 121~140 (out of 715) en / ko --- captions/En/Lecture1_en.srt | 68 ++++++++++++++++++------------------- captions/Ko/Lecture1_ko.srt | 50 +++++++++++++++------------ 2 files changed, 62 insertions(+), 56 deletions(-) diff --git a/captions/En/Lecture1_en.srt b/captions/En/Lecture1_en.srt index 733f41b4..b66f4f95 100644 --- a/captions/En/Lecture1_en.srt +++ b/captions/En/Lecture1_en.srt @@ -598,18 +598,18 @@ problems so it's a brief history but 122 00:13:46,750 --> 00:13:54,149 -does it mean so short history so we're -gonna go all the way back to 200 540 +doesn't mean it's a short history so we're +gonna go all the way back to 200 sorry, 540 123 00:13:54,149 --> 00:14:00,110 million years ago so why why did I -picked this you know on the other scale +picked this you know on the scale 124 00:14:00,110 --> 00:14:09,240 -of Earth history this is a fairly -specific range of years while so I don't +of Earth history this is a very +specific range of years. Well, so I don't 125 00:14:09,240 --> 00:14:14,049 @@ -618,76 +618,76 @@ is a very very curious period of the 126 00:14:14,049 --> 00:14:23,539 -Earth's history biologists call this the -big bag of evolution before 503 +Earth's history. Biologists call this the +big bag of evolution. Before 503, 4 127 00:14:23,539 --> 00:14:27,679 -for 540 million years ago +540 million years ago, 128 00:14:27,679 --> 00:14:37,989 -a very peaceful of its pretty big pot of -water so we have very simple organisms +The Earth was a very peaceful pot of water. +It's pretty big pot of water. So, we have very simple organisms. 129 00:14:37,990 --> 00:14:46,049 -these are like animals that just floats -in the water and the way the eastern and +These are like animals that just floats +in the water and the way they eat and hang out 130 00:14:46,049 --> 00:14:53,838 -now a daily basis is you know the flow -to move some kind of food comes by near +on a daily basis is you know they just float +and some kind of food comes by near 131 00:14:53,839 --> 00:15:01,160 -their house or whatever they just open +their mouth or whatever, they just open their mouths grabbed it and we don't 132 -00:15:01,159 --> 00:15:09,969 -have too many different types of animals -but really strange happened around 540 +00:15:01,160 --> 00:15:09,969 +have too many different types of animals, +but something really strange happened around 540 133 00:15:09,970 --> 00:15:18,430 -million solely from the fossils we study -there's a huge explosive of species +million suddenly from the fossils we study +there's a huge explosive of species. 134 -00:15:18,429 --> 00:15:27,729 -biologist car speciation like suddenly -for some reason something that animal +00:15:18,430 --> 00:15:27,729 +Biologists call speciation. It's like suddenly, +for some reason, something hit the Earth that animal 135 00:15:27,730 --> 00:15:35,230 -start to diversify and they got a -complex the start 2022 you start a +start to diversify and they got really +complex the start to have 136 00:15:35,230 --> 00:15:41,039 -predators and praising and they have all -kind of tools to to to survive what was +predators and preys and they have all +kind of tools to survive. What was 137 00:15:41,039 --> 00:15:46,698 -the triggering force of those was a huge -question because people received no did +the triggering force of this was a huge +question, because people was saying 138 00:15:46,698 --> 00:15:53,269 -you know another SAT whatever meteoroid -earth or or you know the environment +you know another said whatever meteoroid hit +the Earth or or you know the environment 139 00:15:53,269 --> 00:16:00,198 -change they talk about one of the most -convincing theory is by this guy call +change? It turned out one of the most +convincing theory is by this guy called 140 00:16:00,198 --> 00:16:03,159 -Andrew Parker +Andrew Parker. He is a 141 00:16:03,159 --> 00:16:09,490 diff --git a/captions/Ko/Lecture1_ko.srt b/captions/Ko/Lecture1_ko.srt index dd608bdd..ed23c4c0 100644 --- a/captions/Ko/Lecture1_ko.srt +++ b/captions/Ko/Lecture1_ko.srt @@ -489,83 +489,89 @@ 121 00:13:39,528 --> 00:13:46,750 - 도구는 간단한 역사 그래서 중요한 문제를 해결하기에에에 있지만, + 비전의 역사는 간략하지만, 122 00:13:46,750 --> 00:13:54,149 - 우리가 거​​ (200) (540)에 다시 모든 길을 갈 것 때문에 그렇게 짧은 역사를 의미 하는가 + 짧은 역사는 아닙니다. 이야기는 5억 4천만년 전으로 거슬러 올라갑니다. 123 00:13:54,149 --> 00:14:00,110 - 왜 그랬는지이 고른 이유 만 년 전 그래서 당신은 다른 규모에 알고 + 왜 5억 4천만년 전 일까요? 124 00:14:00,110 --> 00:14:09,240 - 그래서하지 않는 동안 지구 역사의이 년의 상당히 구체적인 범위는 + 제가 매우 구체적인 시간대를 이야기했는데요. 125 00:14:09,240 --> 00:14:14,049 - 이 들었을 알고 있지만 이것은 매우 호기심 기간 + 여러분이 이미 알고 있는지는 모르지만 + 이시간대는 지구의 역사 중에서도 매우 흥미로운 시간대 입니다. 126 00:14:14,049 --> 00:14:23,539 - 지구 역사의 생물 학자들은이 503 전에 진화의 큰 가방 전화 + 생물 학자들은 이 시기를 진화의 큰 가방이라고 부릅니다. + 매우 간단한 생명체들이 있었죠. 127 00:14:23,539 --> 00:14:27,679 - 5억4천만년 전을위한 + 5억4천만년 전 그 이전의 지구는 128 00:14:27,679 --> 00:14:37,989 - 물은 꽤 큰 냄비의 아주 평화로운 그래서 우리는 아주 간단한 생물이 + 아주 평화로운 물이 담긴 큰 냄비였어요. + 매우 간단한 유기체들이 있었는데, 129 00:14:37,990 --> 00:14:46,049 - 이건 그냥 물에 떠 동물과 방법 동부와 같다 + 이들은 마치 물에 떠 다니는 동물들처럼 + 일상적인 먹고 살아가는 방법은 130 00:14:46,049 --> 00:14:53,838 - 지금은 매일 매일 당신이 음식 근처에 의해 제공의 어떤 종류를 이동하는 흐름을 알고있다 + 단지 둥둥 떠다니면서 먹을 것이 주변에 다가오면 131 00:14:53,839 --> 00:15:01,160 - 그들은 단지 자신의 입을 열어 무엇이든 자신의 집 또는 그것을 잡고 우리는하지 않습니다 + 그들은 입에 넣는 거죠. 132 -00:15:01,159 --> 00:15:09,969 - 동물의 너무 많은 다른 유형을 가지고 있지만 정말 이상한 약 540 일 +00:15:01,160 --> 00:15:09,969 + 이 당시에는 그렇게 많은 종류에 동물들이 있지 않았어요. + 그런데 아주 이상한 일이 일어났어요. 133 00:15:09,970 --> 00:15:18,430 - 전적으로 우리가 공부 화석에서 백만 종의 거대한 폭발있다 + 5억 4천만년 전 이후의 화석에서는 엄청난 양의 종들이 발견된 것이죠. 134 -00:15:18,429 --> 00:15:27,729 - 어떤 이유로 어떤 동물을위한 갑자기 같은 생물학 차 분화 +00:15:18,430 --> 00:15:27,729 + 생물학자들은 이 것을 종의 분화라고 하죠. + 아주 갑자기 무슨 이유인지 지구에서 135 00:15:27,730 --> 00:15:35,230 - 다양 화하기 시작하고 그들이있어 복잡한 시작 2022 당신이 시작할 + 생물들이 다양화하기 시작하고 매우 복잡한 136 00:15:35,230 --> 00:15:41,039 - 포식자와 찬양 그리고 그들은 무엇 살아남을 수있는 도구의 모든 종류가 + 포식자와 먹이감의 관계 그리고 살아남기 위한 수단들을 가지기 시작했어요. 137 00:15:41,039 --> 00:15:46,698 - 사람들이 받았기 때문에 사람들의 트리거 힘이 큰 문제였다 더 않았다 + 그럼 과연 무엇이 이 변화들의 시발점이었는가 하는 질문이 남아있었죠. 138 00:15:46,698 --> 00:15:53,269 - 당신은 어떤 유성 지구 또는 다른 SAT 알고하거나 환경을 알고 + 사람들은 유성이 지구 떨어졌다던가 환경이 바뀌었다던가하는 주장들을 했어요. 139 00:15:53,269 --> 00:16:00,198 - 그들이 가장 설득력있는 이론의 약 얘기를 변경하면이 사람의 전화입니다 + 그중에서도 제일 설득력있는 이론은 Andrew Parker의 이론입니다. 140 00:16:00,198 --> 00:16:03,159 - 앤드류 파커 + 그는 141 00:16:03,159 --> 00:16:09,490 From c3b99adc7762bbde6e6be5d1daa2f91a918212b2 Mon Sep 17 00:00:00 2001 From: osx_gnujoow Date: Wed, 27 Apr 2016 17:39:28 +0900 Subject: [PATCH 075/199] =?UTF-8?q?=EC=B4=88=EC=95=88=EC=9E=91=EC=84=B1=20?= =?UTF-8?q?=EC=99=84=EB=A3=8C?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit --- terminal-tutorial.md | 16 ++++++++-------- 1 file changed, 8 insertions(+), 8 deletions(-) diff --git a/terminal-tutorial.md b/terminal-tutorial.md index 5c7fa6ef..4d996329 100644 --- a/terminal-tutorial.md +++ b/terminal-tutorial.md @@ -3,36 +3,36 @@ layout: page title: Terminal.com Tutorial permalink: /terminal-tutorial/ --- -For the assignments, we offer an option to use [Terminal](https://www.stanfordterminalcloud.com) for developing and testing your implementations. Notice that we're not using the main Terminal.com site but a subdomain which has been assigned specifically for this class. [Terminal](https://www.stanfordterminalcloud.com) is an online computing platform that allows us to access pre-configured command line environments. Note that, it's not required to use [Terminal](https://www.stanfordterminalcloud.com) for your assignments; however, it might make life easier with all the required dependencies and development toolkits configured for you. +과제를 진행하기 위해서, [Terminal](https://www.stanfordterminalcloud.com)을 사용하는 옵션을 제공합니다. Terminal에서 여러분들의 결과물을 개발하고 테스트 할 수 있습니다. 한가지 유의해야할 것은 Terminal.com의 메인페이지를 사용하지 않고 cs231n 수업을 위해 특별히 할당된 서브도메인에 등록된 사이트를 사용합니다. [Terminal](https://www.stanfordterminalcloud.com)은 미리 설정된 커맨드 라인 환경(command line environment)에 접근할수 있는 온라인 컴퓨팅 플랫폼입니다. 여러분들의 과제를 진행하기 위해서 반드시 [Terminal](https://www.stanfordterminalcloud.com) 을 사용할 필요는 없습니다. 그러나 개발을 위한 필요사항들과 개발도구들이 미리 설정되어 있기 때문에 수고를 덜 수 있습니다. -This tutorial lists the necessary steps of working on the assignments using Terminal. First of all, [sign up your own account](https://www.stanfordterminalcloud.com/signup). Log in [Terminal](https://www.stanfordterminalcloud.com) with the account that you have just created. +이 튜토리얼은 Terminal을 사용하여 과제를 진행하기 위한 필수적인 과정들을 설명합니다. 가장 먼저, [여러분의 계정을 만듭니다.](https://www.stanfordterminalcloud.com/signup). 방금전에 만든 계정으로 [Terminal](https://www.stanfordterminalcloud.com)에 로그인 합니다. -For each assignment, we will provide you a link to a shared terminal snapshot. These snapshots are pre-configured command line environments with the starter code, where you can write your implementations and execute the code. +각각의 과제마다 Terminal 스냅샷 링크를 제공합니다. 이 스냅샷들은 여러분들의 결과물을 작성하고 실행할 시작코드와 미리 설정된 커맨드 라인 환경이 포함되어 있습니다. -Here's an example of what a snapshot page looked like for an assignment in 2015: +여기 2015년 과제처럼 보이는 스냅샷을 통해 예를 들어보겠습니다.
-Yours will look similar. Click the "Start" button on the lower right corner. This will clone the shared snapshot to your own account. Now you should be able to find the terminal under the [My Terminals](https://www.stanfordterminalcloud.com/terminals) tab. +여러분의 스냅샷도 이와 비슷할 것입니다. 오른쪽 아래의 "Start" 버튼을 클릭합니다. 그럼 여러분의 계정에 공유된 스냅샷이 복사됩니다. 이제 [My Terminals](https://www.stanfordterminalcloud.com/terminals) 탭에서 복사된 터미널을 찾을 수 있습니다.
-Yours will look similar. You are all set! To work on the assignments, click the link to your terminal (shown in the red box in the above image). This link will open up the user interface layer over an AWS machine. It will look something similar to this: +여러분의 화면도 이와 비슷할 것입니다. 이제 과제를 진행하기 위한 준비가 되었습니다! 링크를 클릭하여 terminal을 열어봅시다. (위 이미지의 빨간색 상자) 이 링크는 AWS 머신상의 유저인터페이스를 계층을 엽니다. 다음과 비슷한 화면이 나타납니다.
-We have set up the Jupyter Notebook and other dependencies in the terminal. Launch a new console window with the small + sign (if you don't already have one), navigate around and look for the assignment folder and code. Launch a Jupyer notebook and work on the assignment. If your're a student enrolled in the class you will submit your assignment through Coursework: +terminal에 Jupyter Notebook과 다른 필요요소들이 설치되어 있습니다. 조그마한 + 버튼을 눌러 콘솔을 실행합니다.(콘솔이 없을 경우), 그리고 과제폴더와 코드를 찾습니다. 그리고 Jupyper Notebook을 실행하고 과제를 진행합니다. 만약 당신이 cs231n에 등록한 학생이면 코스워크를 통해 과제를 제출해야합니다.
-For more information about [Terminal](https://www.stanfordterminalcloud.com), check out the [FAQ](https://www.stanfordterminalcloud.com/faq) page. +[Terminal](https://www.stanfordterminalcloud.com)에 대한 더 많은 정보를 원하시면 [FAQ](https://www.stanfordterminalcloud.com/faq)페이지를 방문해주세요 **Important Note:** the usage of Terminal is charged on an hourly rate based on the instance type. A medium type instance costs $0.124 per hour. If you are enrolled in the class email Serena Yeung (syyeung@cs.stanford.edu) to request Terminal credits. We will send you $3 the first time around, and you can request more funds on a rolling basis when you run out. Please be responsible with the funds we allocate you. \ No newline at end of file From 466d2ffbc7fa892bfdc54107f1f4a5a190e26deb Mon Sep 17 00:00:00 2001 From: Donghun Lee Date: Thu, 28 Apr 2016 05:46:00 +0900 Subject: [PATCH 076/199] Translate understanding-cnn.md file test (#33) Test commit for translating understanding-cnn.md file --- understanding-cnn.md | 2 ++ 1 file changed, 2 insertions(+) diff --git a/understanding-cnn.md b/understanding-cnn.md index a8e3fa23..c931cff5 100644 --- a/understanding-cnn.md +++ b/understanding-cnn.md @@ -8,8 +8,10 @@ permalink: /understanding-cnn/ (this page is currently in draft form) ## Visualizing what ConvNets learn +## ConvNets이 무엇을 학습하는지의 시각화 Several approaches for understanding and visualizing Convolutional Networks have been developed in the literature, partly as a response the common criticism that the learned features in a Neural Network are not interpretable. In this section we briefly survey some of these approaches and related work. +학계에서 콘볼루션 네트워크 (Convolution Network)을 이해하고 시각화 하려고 하는 여러가지 시도들이 있었는데, 이는 신경망 네트워크 (Neural Network)를 통해 학습된 특징 (feature) 가 설명력이 없다는 일반적인 비판에 대해 답변을 하기 위함이다. 이번 섹션에는 이와 관련된 접근법과 관련 연구들을 간단하게 알아보고자 한다. ### Visualizing the activations and first-layer weights From 4dd08b194d8eea24fa6672ae3bb7480ad5121211 Mon Sep 17 00:00:00 2001 From: osx_gnujoow Date: Thu, 28 Apr 2016 09:56:07 +0900 Subject: [PATCH 077/199] Terminal.md :: Importance Note --- terminal-tutorial.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/terminal-tutorial.md b/terminal-tutorial.md index 4d996329..1fd80657 100644 --- a/terminal-tutorial.md +++ b/terminal-tutorial.md @@ -35,4 +35,4 @@ terminal에 Jupyter Notebook과 다른 필요요소들이 설치되어 있습니 [Terminal](https://www.stanfordterminalcloud.com)에 대한 더 많은 정보를 원하시면 [FAQ](https://www.stanfordterminalcloud.com/faq)페이지를 방문해주세요 -**Important Note:** the usage of Terminal is charged on an hourly rate based on the instance type. A medium type instance costs $0.124 per hour. If you are enrolled in the class email Serena Yeung (syyeung@cs.stanford.edu) to request Terminal credits. We will send you $3 the first time around, and you can request more funds on a rolling basis when you run out. Please be responsible with the funds we allocate you. \ No newline at end of file +**중요 공지사항:** 터미널 사용시 사용하는 인스턴스 타입에 따라 시간당 사용요금이 부과됩니다. 미디엄 타입의 인스턴스 요금은 시간당 $0.124 입니다. \ No newline at end of file From a712881fc092f8e147d3b9aeb3c499fb05a808da Mon Sep 17 00:00:00 2001 From: osx_gnujoow Date: Thu, 28 Apr 2016 10:16:41 +0900 Subject: [PATCH 078/199] no message --- terminal-tutorial.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/terminal-tutorial.md b/terminal-tutorial.md index 1fd80657..4c01c5b8 100644 --- a/terminal-tutorial.md +++ b/terminal-tutorial.md @@ -35,4 +35,4 @@ terminal에 Jupyter Notebook과 다른 필요요소들이 설치되어 있습니 [Terminal](https://www.stanfordterminalcloud.com)에 대한 더 많은 정보를 원하시면 [FAQ](https://www.stanfordterminalcloud.com/faq)페이지를 방문해주세요 -**중요 공지사항:** 터미널 사용시 사용하는 인스턴스 타입에 따라 시간당 사용요금이 부과됩니다. 미디엄 타입의 인스턴스 요금은 시간당 $0.124 입니다. \ No newline at end of file +**중요** 터미널 사용시 사용하는 인스턴스 타입에 따라 시간당 사용요금이 부과됩니다. 미디엄 타입의 인스턴스 요금은 시간당 $0.124 입니다. \ No newline at end of file From ea796126efc59ef39a8a221c98bda31e50ba1114 Mon Sep 17 00:00:00 2001 From: osx_gnujoow Date: Thu, 28 Apr 2016 12:35:40 +0900 Subject: [PATCH 079/199] python-tutorial.md : draft --- ipython-tutorial.md | 39 +++++++++++---------------------------- 1 file changed, 11 insertions(+), 28 deletions(-) diff --git a/ipython-tutorial.md b/ipython-tutorial.md index 75692508..c48204ee 100644 --- a/ipython-tutorial.md +++ b/ipython-tutorial.md @@ -3,72 +3,55 @@ layout: page title: IPython Tutorial permalink: /ipython-tutorial/ --- +cs231s 수업에서는 프로그래밍 과제 진행을 위해 [IPython notebooks](http://ipython.org/)을 사용합니다. IPython notebook을 사용하면 여러분의 브라우저에서 Python코드를 작성하고 실행할 수 있습니다. Python notebook를 사용하면 여러조각의 코드를 아주 쉽게 수정하고 실행할 수 있습니다. 이런 장점 때문에 IPython notebook은 계산과학분야에서 널리 사용되고 있습니다. -In this class, we will use [IPython notebooks](http://ipython.org/) for the -programming assignments. An IPython notebook lets you write and execute Python -code in your web browser. IPython notebooks make it very easy to tinker with -code and execute it in bits and pieces; for this reason IPython notebooks are -widely used in scientific computing. - -Installing and running IPython is easy. From the command line, the following -will install IPython: +IPython의 설치와 실행은 간단합니다. command line에서 다음 명령어를 입력하여 IPython을 설치합니다. ~~~ pip install "ipython[notebook]" ~~~ -Once you have IPython installed, start it with this command: +IPython의 설치가 완료되면 다음 명령어를 통해 IPython을 실행합니다. ~~~ ipython notebook ~~~ -Once IPython is running, point your web browser at http://localhost:8888 to -start using IPython notebooks. If everything worked correctly, you should -see a screen like this, showing all available IPython notebooks in the current -directory: +IPython이 실행되면, IPyhton을 사용하기 위해 웹 브라우저를 실행하여 http://localhost:8888 에 접속합니다. 모든것이 잘 작동한다면 웹 브라우저에는 아래와 같은 화면이 나타납니다. 화면에는 현재 폴더에 사용가능한 Python notebook들이 나타납니다.
-If you click through to a notebook file, you will see a screen like this: +notebook 파일을 클릭하면 다음과 같은 화면이 나타납니다.
-An IPython notebook is made up of a number of **cells**. Each cell can contain -Python code. You can execute a cell by clicking on it and pressing `Shift-Enter`. -When you do so, the code in the cell will run, and the output of the cell -will be displayed beneath the cell. For example, after running the first cell -the notebook looks like this: +IPython notebook은 여러개의 **cell**들로 이루어져있습니다. 각각의 cell들은 Python코드를 포함하고 있습니다. `Shift-Enter`를 누르거나 셀을 클릭하여 셀을 실행할 수 있습니다. 셀의 코드를 실행하면 셀의 코드의 실행결과는 셀의 바로 아래에 나타납니다. 예를 들어 첫번째 cell의 코드를 실행하면 아래와 같은 화면이 나타납니다.
-Global variables are shared between cells. Executing the second cell thus gives -the following result: +전역변수들은 다른 셀들에게도 공유됩니다. 두번째 셀을 실행하면 다음과 같은 결과가 나옵니다.
-By convention, IPython notebooks are expected to be run from top to bottom. -Failing to execute some cells or executing cells out of order can result in -errors: +일반적으로, IPython notebook의 코드를 실행할 때 맨위에서 맨 아래 순서로 실행합니다. +몇몇 셀을 실행하는데 실패하거나 셀들을 순서대로 실행하지 않으면 오류가 발생할 수 있습니다.
-After you have modified an IPython notebook for one of the assignments by -modifying or executing some of its cells, remember to **save your changes!** +과제를 진행하면서 notebook의 cell을 수정하거나 실행하여 IPython notebook이 변경되었다면 **저장하는것을 잊지마세요.**
-This has only been a brief introduction to IPython notebooks, but it should -be enough to get you up and running on the assignments for this course. +지금 까지 IPyhton의 사용법에 대해서 알아보았습니다. 간략한 내용이지만 위 내용들을 잘 숙지하면 숙제를 진행하는데 어려움이 없을 겁니다. \ No newline at end of file From 79f7d28d5e1645c132b0d25b66dd8e28d1131e4f Mon Sep 17 00:00:00 2001 From: osx_gnujoow Date: Thu, 28 Apr 2016 12:36:34 +0900 Subject: [PATCH 080/199] no message --- ipython-tutorial.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/ipython-tutorial.md b/ipython-tutorial.md index c48204ee..c1964bd2 100644 --- a/ipython-tutorial.md +++ b/ipython-tutorial.md @@ -54,4 +54,4 @@ IPython notebook은 여러개의 **cell**들로 이루어져있습니다. 각각
-지금 까지 IPyhton의 사용법에 대해서 알아보았습니다. 간략한 내용이지만 위 내용들을 잘 숙지하면 숙제를 진행하는데 어려움이 없을 겁니다. \ No newline at end of file +지금 까지 IPyhton의 사용법에 대해서 알아보았습니다. 간략한 내용이지만 위 내용들을 잘 숙지하면 무리없이과제를 진행할 수 있습니다. \ No newline at end of file From 0efdbd9df5bf712cf1700b55dc96c615798b3758 Mon Sep 17 00:00:00 2001 From: "KIM, WOOJUNG" Date: Thu, 28 Apr 2016 12:43:49 +0900 Subject: [PATCH 081/199] Update ipython-tutorial.md --- ipython-tutorial.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/ipython-tutorial.md b/ipython-tutorial.md index c1964bd2..f448b95b 100644 --- a/ipython-tutorial.md +++ b/ipython-tutorial.md @@ -48,10 +48,10 @@ IPython notebook은 여러개의 **cell**들로 이루어져있습니다. 각각
-과제를 진행하면서 notebook의 cell을 수정하거나 실행하여 IPython notebook이 변경되었다면 **저장하는것을 잊지마세요.** +과제를 진행하면서 notebook의 cell을 수정하거나 실행하여 IPython notebook이 변경되었다면 **저장하는 것을 잊지마세요.**
-지금 까지 IPyhton의 사용법에 대해서 알아보았습니다. 간략한 내용이지만 위 내용들을 잘 숙지하면 무리없이과제를 진행할 수 있습니다. \ No newline at end of file +지금 까지 IPyhton의 사용법에 대해서 알아보았습니다. 간략한 내용이지만 위 내용들을 잘 숙지하면 무리없이과제를 진행할 수 있습니다. From b81a1bab0864f2ef60e93a469c3626926e8f6174 Mon Sep 17 00:00:00 2001 From: JK Im Date: Fri, 29 Apr 2016 16:12:40 -0500 Subject: [PATCH 082/199] Update optimization-1.md --- optimization-1.md | 53 +++++++++++++++++++++++------------------------ 1 file changed, 26 insertions(+), 27 deletions(-) diff --git a/optimization-1.md b/optimization-1.md index 2b528e6e..ce91cf44 100644 --- a/optimization-1.md +++ b/optimization-1.md @@ -261,12 +261,12 @@ for step_size_log in [-10, -9, -8, -7, -6, -5,-4,-3,-2,-1]: **Update in negative gradient direction**. 위 코드에서, 새로운 모수 `W_new`로 업데이트할 때, 그라디언트(gradient) `df`의 반대방향으로 움직인 것을 주목하자. 왜냐하면 우리가 원하는 것은 손실함수(loss function)의 증가가 아니라 감소하는 것이기 때문이다. -**작은 변화값 `h` (step)의 크기가 미치는 영향**. 그라디언트(gradient)에서 알 수 있는 것은 함수값이 가장 빠르게 증가하는 방향이고, 그 방향으로 대체 얼만큼을 가야하는지는 알려주지 않는다. 강의 뒤에서 다루게 되겠지만, 얼만큼 가야하는지(*학습 속도*라고도 함), 즉, `h`값의 크기는 신경망(neural network)를 학습시킬 때 있어 가장 중요한 (그래서 결정하기 까다로운) 하이퍼파라미터(hyperparameter)가 될 것이다. 눈 가리고 하산하는 비유에서, 우리는 우리 발 밑으로 어느 방향이 가장 가파른지 느끼지만, 얼마나 발을 뻗어야할 지는 불확실하다. 발을 살살 휘져으면, 꾸준하지만 매우 조금씩밖에 못 내려갈 것이다. (이는 아주 작은 `h`값에 비견된다.) 반대로, 욕심껏 빨리 내려가려고 크고 과감하게 발을 내딛을 수도 있는데, 항상 뜻대로 되지는 않을지 모른다. 위의 제시된 코드에서와 같이, 어느 수준 이상의 큰 `h`값은 오히려 손실값을 증가시킨다. +**스텝 크기가 미치는 영향**. 그라디언트(gradient)에서 알 수 있는 것은 함수값이 가장 빠르게 증가하는 방향이고, 그 방향으로 대체 얼만큼을 가야하는지는 알려주지 않는다. 강의 뒤에서 다루게 되겠지만, 얼만큼 가야하는지를 의미하는 스텝 크기(혹은 *학습 속도*라고도 함)는 신경망(neural network)를 학습시킬 때 있어 가장 중요한 (그래서 결정하기 까다로운) 하이퍼파라미터(hyperparameter)가 될 것이다. 눈 가리고 하산하는 비유에서, 우리는 우리 발 밑으로 어느 방향이 가장 가파른지 느끼지만, 얼마나 발을 뻗어야할 지는 불확실하다. 발을 살살 휘져으면, 꾸준하지만 매우 조금씩밖에 못 내려갈 것이다. (이는 아주 작은 스텝 크기에 비견된다.) 반대로, 욕심껏 빨리 내려가려고 크고 과감하게 발을 내딛을 수도 있는데, 항상 뜻대로 되지는 않을지 모른다. 위의 제시된 코드에서와 같이, 어느 수준 이상의 큰 스켑 크기는 오히려 손실값을 증가시킨다.
- Visualizing the effect of step size. We start at some particular spot W and evaluate the gradient (or rather its negative - the white arrow) which tells us the direction of the steepest decrease in the loss function. Small steps are likely to lead to consistent but slow progress. Large steps can lead to better progress but are more risky. Note that eventually, for a large step size we will overshoot and make the loss worse. The step size (or as we will later call it - the learning rate) will become one of the most important hyperparameters that we will have to carefully tune. + 작은 변화값(step)이 주는 영향을 시각적으로 보여주는 그림. 특정 지검 W에서 시작해서 그라디언트(혹은 거기에 -1을 곱한 값)를 계산한다. 이 그라디언트에 -1을 곱한 방향, 즉 하얀 화살표 방향이 손실함수(loss function)이 가장 빠르게 감소하는 방향이다. 그 방향으로 조금 가는 것은 일관되지만 느리게 최적화를 진행시킨다. 반면에, 그 방향으로 너무 많이 가면, 더 많이 감소시키지만 위험성도 크다. 스텝 크기가 점점 커지면, 결국에는 최소값을 지나쳐서 손실값이 더 커지는 지점까지 가게될 것이다. 스텝 크기(나중에 학습속도라고 부를 것임) 가장 중요한 하이퍼파라미터(hyperparameter)이라서 매우 조심스럽게 결정해야 할 것이다.
@@ -275,50 +275,50 @@ for step_size_log in [-10, -9, -8, -7, -6, -5,-4,-3,-2,-1]: -#### Computing the gradient analytically with Calculus +#### 미적분을 이용하여 해석적으로 그라디언트(gradient)를 계산하기 -The numerical gradient is very simple to compute using the finite difference approximation, but the downside is that it is approximate (since we have to pick a small value of *h*, while the true gradient is defined as the limit as *h* goes to zero), and that it is very computationally expensive to compute. The second way to compute the gradient is analytically using Calculus, which allows us to derive a direct formula for the gradient (no approximations) that is also very fast to compute. However, unlike the numerical gradient it can be more error prone to implement, which is why in practice it is very common to compute the analytic gradient and compare it to the numerical gradient to check the correctnes of your implementation. This is called a **gradient check**. +수치적으로 계산하는 그라디언트(gradient)는 유한차이(finite difference)를 이용해서 매우 단순하지만, 단점은 근사값이라는 점과 (그라디언트의 진짜 정의는 "h"가 0으로 수렴할 때의 극한값인데, 여기서는 그냥 작은 "h"값을 쓰기 때문에), 계산이 비효율적이라는 것이다. 두번째 방법은 미적분을 이용해서 해석적으로 그라디언트(gradient)를 구하는 것인데, 이는 (근사치가 아닌) 정확한 수식을 이용하기 때문에 계산하기 매우 빠르다. 하지만, 수치적으로 구한 그라디언트(gradient)와는 다르게, 구현하는데 실수하기 쉽다. 그래서, 실제 응용할 때 해석적으로 구한 다음에 수치적으로 구한 것과 비교해보고, 틀린 경우 고치는 게 흔한 일이다. 이 과정을 **그라디언트체크(gradient check)**라고 한다.. -Lets use the example of the SVM loss function for a single datapoint: +SVM 손실함수(loss function)의 예를 들어서 설명해보자. $$ L_i = \sum_{j\neq y_i} \left[ \max(0, w_j^Tx_i - w_{y_i}^Tx_i + \Delta) \right] $$ -We can differentiate the function with respect to the weights. For example, taking the gradient with respect to $w_{y_i}$ we obtain: +모수(parameter/weight)로 이 함수를 미분할 수 있다. 예를 들어, $w_{y_i}$로 미분하면 이렇게 된다: $$ \nabla_{w_{y_i}} L_i = - \left( \sum_{j\neq y_i} \mathbb{1}(w_j^Tx_i - w_{y_i}^Tx_i + \Delta > 0) \right) x_i $$ -where $\mathbb{1}$ is the indicator function that is one if the condition inside is true or zero otherwise. While the expression may look scary when it is written out, when you're implementing this in code you'd simply count the number of classes that didn't meet the desired margin (and hence contributed to the loss function) and then the data vector $x_i$ scaled by this number is the gradient. Notice that this is the gradient only with respect to the row of $W$ that corresponds to the correct class. For the other rows where $j \neq y_i $ the gradient is: +여기서 $\mathbb{1}$은 정의함수라고 하는데, 쉽게 말해 괄호 안의 조건이 충족되면 1, 아니면 0인 값을 갖는다. 이렇게 써놓으면 무시무시해보이지만, 실제로 코딩으로 구현할 때는 원하는 차이(마진, margin)을 못 만족시키는, 따라서 손실함수(loss function)의 증가에 일조하는 클래스의 개수를 세고, 이 숫자를 데이터벡터 $x_i$에 곱하면 이게 바로 그라디언트(gradient)이다. 단, 이는 참인 클래스에 해당하는 $W$의 행으로 미분했을 때의 그라디언트이다. $j \neq y_i $인 다른 행에 대한 그라디언트(gradient)는 다음과 같다. $$ \nabla_{w_j} L_i = \mathbb{1}(w_j^Tx_i - w_{y_i}^Tx_i + \Delta > 0) x_i $$ -Once you derive the expression for the gradient it is straight-forward to implement the expressions and use them to perform the gradient update. +일단 그라디언트(gradient) 수식을 구하고 그라디언트(gradient)를 업데이트시키는 것은 간단하다. -### Gradient Descent +### 그라디언트 하강 (gradient descent) -Now that we can compute the gradient of the loss function, the procedure of repeatedly evaluating the gradient and then performing a parameter update is called *Gradient Descent*. Its **vanilla** version looks as follows: +이제 손실함수(loss function)의 그라디언트(gradient)를 계산할 줄 알게 됐는데, 그라디언트(gradient)를 계속해서 계산하고 모수(weight/parameter)를 Now that we can compute the gradient of the loss function, the procedure of repeatedly evaluating the gradient and then performing a parameter update is called *Gradient Descent*. Its **vanilla** version looks as follows: ~~~python -# Vanilla Gradient Descent +# 단순한 경사하강(gradient descent) while True: weights_grad = evaluate_gradient(loss_fun, data, weights) weights += - step_size * weights_grad # perform parameter update ~~~ -This simple loop is at the core of all Neural Network libraries. There are other ways of performing the optimization (e.g. LBFGS), but Gradient Descent is currently by far the most common and established way of optimizing Neural Network loss functions. Throughout the class we will put some bells and whistles on the details of this loop (e.g. the exact details of the update equation), but the core idea of following the gradient until we're happy with the results will remain the same. +이 단순한 루프는 모든 신경망(neural network)의 중심에 있는 것이다. 다른 방법으로 (예컨데. LBFGS) 최적화를 할 수 있는 방법이 있긴 하지만, 현재로는 그라디언트 하강 (gradient descent)이 신경망(neural network)의 손실함수(loss function)을 최적화하는 것으로는 가장 많이 쓰인다. 이 강의에서, 이 루프에 이것저것 세세하게 덧붙이기(예를 들어, 업데이트 수식이 정확히 어떻게 되는지 등)는 할 것이다. 하지만, 결과에 만족할 때까지 그라디언트(gradient)를 따라서 움직인다는 기본적인 개념은 안 바뀔 것이다. -**Mini-batch gradient descent.** In large-scale applications (such as the ILSVRC challenge), the training data can have on order of millions of examples. Hence, it seems wasteful to compute the full loss function over the entire training set in order to perform only a single parameter update. A very common approach to addressing this challenge is to compute the gradient over **batches** of the training data. For example, in current state of the art ConvNets, a typical batch contains 256 examples from the entire training set of 1.2 million. This batch is then used to perform a parameter update: +**Mini-batch gradient descent (MGD).** (ILSVRC challenge처럼) 대규모의 응용사례에서, 학습데이터(training data)가 수백만개 주어질 수 있다. 따라서, 모수를 한 번 업데이트하려고 학습데이터(training data) 전체를 계산에 사용하는 것은 낭비가 될 수 있다. 이를 극복하기 위해서 흔하게 쓰이는 방법으로는, 학습데이터(training data)의 **소집합(batches)**만 이용해서 그라디언트(gradient)를 구하는 것이다. 예를 들어 ConvNets을 쓸 때, 한 번에 120만개 중에 256개만을 이용해서 그라디언트(gradient)를 구하고 모수(parameter/weight) 업데이트를 한다. 다음 코드를 보자. ~~~python -# Vanilla Minibatch Gradient Descent +# 단순한 소집합(minibatch) 그라디언트(gradient) 업데이트 while True: data_batch = sample_training_data(data, 256) # sample 256 examples @@ -326,30 +326,29 @@ while True: weights += - step_size * weights_grad # perform parameter update ~~~ -The reason this works well is that the examples in the training data are correlated. To see this, consider the extreme case where all 1.2 million images in ILSVRC are in fact made up of exact duplicates of only 1000 unique images (one for each class, or in other words 1200 identical copies of each image). Then it is clear that the gradients we would compute for all 1200 identical copies would all be the same, and when we average the data loss over all 1.2 million images we would get the exact same loss as if we only evaluated on a small subset of 1000. In practice of course, the dataset would not contain duplicate images, the gradient from a mini-batch is a good approximation of the gradient of the full objective. Therefore, much faster convergence can be achieved in practice by evaluating the mini-batch gradients to perform more frequent parameter updates. +이 방법이 먹히는 이유는 학습데이터들의 예시들이 서로 상관관계가 있기 때문이다. 이것에 대해 알아보기위해, ILSVRC의 120만개 이미지들이 사실은 1천개의 서로 다른 이미지들의 복사본이라는 극단적인 경우를 생각해보자. (즉, 한 클래스 당 하나이고, 이 하나가 1천2백번 복사된 것) 그러면 명백한 것은, 이 1천2백개의 이미지에서의 그라디언트(gradient)값 (역자 주: 이 1천2백개에 해당하는 $i$에 대한 $L_i$값)은 다 똑같다는 점이다. 그렇다면 이 1천2백개씩 똑같은 값들 120만개를 평균내서 손실값(loss)를 구하는 것이나, 서로 다른 1천개의 이미지당 하나씩 1000개의 값을 평균내서 손실값(loss)을 구하는 것이나 똑같다. 실제로는 당연히 중복된 데이터를 주지는 않겠지만, 소집합에서 계산하는 그라디언트(gradient)가 모든 데이터를 써서 구하는 것의 근사값으로 괜찮게 쓰일 수 있을 것이다. 따라서, 소집합에서 그라디언트(gradient)를 구해서 더 자주자주 모수(parameter/weight)을 업데이트하면 실제로 더 빠른 수렴하게 된다. -The extreme case of this is a setting where the mini-batch contains only a single example. This process is called **Stochastic Gradient Descent (SGD)** (or also sometimes **on-line** gradient descent). This is relatively less common to see because in practice due to vectorized code optimizations it can be computationally much more efficient to evaluate the gradient for 100 examples, than the gradient for one example 100 times. Even though SGD technically refers to using a single example at a time to evaluate the gradient, you will hear people use the term SGD even when referring to mini-batch gradient descent (i.e. mentions of MGD for "Minibatch Gradient Descent", or BGD for "Batch gradient descent" are rare to see), where it is usually assumed that mini-batches are used. The size of the mini-batch is a hyperparameter but it is not very common to cross-validate it. It is usually based on memory constraints (if any), or set to some value, e.g. 32, 64 or 128. We use powers of 2 in practice because many vectorized operation implementations work faster when their inputs are sized in powers of 2. +이 방법의 극단적인 형태는 소집합이 데이터 달랑 한개로 이루어졌을 때이다. 이는 **확률그라디언트하강(Stochastic Gradient Descent (SGD))** (혹은 **온라인** 그라디언트 하강)이라고 불린다. 이건 상대적으로 덜 보편적인데, 그 이유는 우리가 프로그램을 짤 때 계산을 벡터/행렬로 만들어서 하기 때문에, 한 예제에서 100번 계산하는 것보다 100개의 예제에서 1번 계산하는게 더 빠르기 때문이다. SGD가 엄밀한 의미에서는 예제 하나짜리 소집합에서 그라디언트(gradient)를 계산하는 것이지만, 많은 사람들이 그냥 MGD를 의미하면서 SGD라고 부르기도 한다. 혹은 드물게나마 집합 그라디언트 하강 (Batch gradient descent, BGD)이라고도 부른다. 소집합(minibatch)의 크기도 하이퍼파라미터(hyperparameter)이지만, 이것을 교차검증하는 일은 흔치 않다. 이건 대체로 컴퓨터 메모리 크기의 한계에 따라 결정되거나, 몇몇 특정값 (예를 들어, 32, 64 or 128 같은 것)을 이용한다. 2의 제곱수를 이용하는 이유는 많은 벡터 계산이 2의 제곱수가 입력될 때 더 빠르기 때문이다. -### Summary +### 요약
- Summary of the information flow. The dataset of pairs of (x,y) is given and fixed. The weights start out as random numbers and can change. During the forward pass the score function computes class scores, stored in vector f. The loss function contains two components: The data loss computes the compatibility between the scores f and the labels y. The regularization loss is only a function of the weights. During Gradient Descent, we compute the gradient on the weights (and optionally on data if we wish) and use them to perform a parameter update during Gradient Descent. + 정보 흐름 요약. (x,y)라는 고정된 데이터 쌍이 주어져 있다. 처음에는 무작위로 뽑은 모수(parameter/weight)값으로 시작해서 바꿔나간다. 왼쪽에서 오른쪽으로 가면서, 점수함수(score function)가 각 클래스의 점수를 계산하고 그 값이 f 벡터에 저장된다. 손실함수(loss function)는 두 부분으로 나뉘어 있다. 첫째, 데이터 손실(data loss)은 모수(parameter/weight)만으로 계산하는 함수이다. 그라디언트 하강(Gradient Descent) 과정에서, 모수(parameter/weight)로 미분한 (혹은 원한다면 데이터 값으로 추가로 미분한... ??? 역자 주: 이 괄호안의 내용은 무슨 소린지 모르겠음.) 그라디언트(gradient)를 계산하고, 이것을 이용해서 모수(parameter/weight)값을 업데이트한다.
-In this section, +이 섹션에서 다음을 다루었다. -- We developed the intuition of the loss function as a **high-dimensional optimization landscape** in which we are trying to reach the bottom. The working analogy we developed was that of a blindfolded hiker who wishes to reach the bottom. In particular, we saw that the SVM cost function is piece-wise linear and bowl-shaped. -- We motivated the idea of optimizing the loss function with -**iterative refinement**, where we start with a random set of weights and refine them step by step until the loss is minimized. -- We saw that the **gradient** of a function gives the steepest ascent direction and we discussed a simple but inefficient way of computing it numerically using the finite difference approximation (the finite difference being the value of *h* used in computing the numerical gradient). -- We saw that the parameter update requires a tricky setting of the **step size** (or the **learning rate**) that must be set just right: if it is too low the progress is steady but slow. If it is too high the progress can be faster, but more risky. We will explore this tradeoff in much more detail in future sections. -- We discussed the tradeoffs between computing the **numerical** and **analytic** gradient. The numerical gradient is simple but it is approximate and expensive to compute. The analytic gradient is exact, fast to compute but more error-prone since it requires the derivation of the gradient with math. Hence, in practice we always use the analytic gradient and then perform a **gradient check**, in which its implementation is compared to the numerical gradient. -- We introduced the **Gradient Descent** algorithm which iteratively computes the gradient and performs a parameter update in loop. +- 손실함수(loss function)가 **고차원의 울퉁불퉁한 지형**이고, 이 지형에서 아래쪽으로 내려가는 것으로 직관적인 설명을 발전시켰다. 이에 대한 비유는 눈가린 등산객이 하산하는 것이었다. 특히, SVM의 손실함수(loss function)가 부분적으로 선형(linear)인 밥공기 모양이라는 것을 확인했따. +- 손실함수(loss function)을 최적화시킨다는 개념을, 아무 데서나 시작해서 더 나아지는 쪽으로 한걸음 한걸음 나은 쪽으로 가서 최적화시킨다는 **반복적으로 개선**의 측면으로 운을 띄워봤고 +- 함수의 **그라디언트(gradient)**는 그 함수값이 감소하는 가장 빠른 방향이라는 점을 알아봤고, 이것을 유한차이(finite difference, 즉 미분할 때 *h*의 값이 유한하다는 의미)를 이용하여 단순무식하게 수치적으로 어림잡아 계산하는 방법도 알아보았다. +- 모수(parameter/weight)를 업데이트할 때, 한 번에 얼마나 움직여야하는지(혹은 **학습속도**)를 딱 맞게 설정하는 것이 까다로운 문제라는 것도 알아보았다. 이 값이 너무 낮으면 너무 느려지고, 너무 높으면 빨라지지만 위험한 점이 있다. 이 장단점에 대해 다음 섹션에서 자세하게 알아볼 것이다. +- 그라디언트(gradient)를 계산할 때 **수치적**인 방법과 **해석적**인 방법의 장단점을 알아보았다. 수치적인 그라디언트(gradient)는 단순하지만, 근사값이고 비효율적이다. 해석적인 그라디언트(gradient)는 정확하고 빠르지만 손으로 계산해야 되서 실수를 할 수 있다. 따라서 실제 응용에서는 해석적인 그라디언트(gradient)을 쓰고, **그라디언트 체크(gradient check)**라는 수치적인 그라디언트(gradient)와 비교하는 과정을 거친다. +- 반복적으로 뺑뺑이 돌려서 그라디언트(gradient)를 계산하고 모수(parameter/weight)를 업데이트하는 **그라디언트 하강 (Gradient Descent)** 알고리즘을 소개했다. -**Coming up:** The core takeaway from this section is that the ability to compute the gradient of a loss function with respect to its weights (and have some intuitive understanding of it) is the most important skill needed to design, train and understand neural networks. In the next section we will develop proficiency in computing the gradient analytically using the chain rule, otherwise also refered to as **backpropagation**. This will allow us to efficiently optimize relatively arbitrary loss functions that express all kinds of Neural Networks, including Convolutional Neural Networks. +**예고:** 이 섹션에서 핵심은, 손실함수(loss function)를 모수(parameter/weight)로 미분하여 그라디언트(gradient)를 계산하는 법(과 그에 대한 직관적인 이해)가 신경망(neural network)를 디자인하고 학습시키고 이해하는데 있어 가장 중요한 기술이라는 점이다. 다음 섹션에서는, 그라디언(gradient)를 해석적으로 구할 때 연쇄법칙을 이용한, **backpropagation**이라고도 불리는 효율적인 방법에 대해 알아보겠다. 이 방법을 쓰면 컨볼루션 신경망 (Convolutional Neural Networks)을 포함한 모든 종류의 신경망(Neural Networks)에서 쓰이는 상대적으로 일반적인 손실함수(loss function)를 효율적으로 최적화시킬 수 있다. From f74046015f37cd4b32b4a80ae8ce0343e2161023 Mon Sep 17 00:00:00 2001 From: JK Im Date: Fri, 29 Apr 2016 16:15:34 -0500 Subject: [PATCH 083/199] Update optimization-1.md MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit optimization-1.md 끝까지 번역해봤습니다. 지난 번 PR한거(아직 승인 안 하신 듯)에서 큰 오역(졸았나...)을 발견해서 그것도 수정했습니다. --- optimization-1.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/optimization-1.md b/optimization-1.md index ce91cf44..709e3293 100644 --- a/optimization-1.md +++ b/optimization-1.md @@ -348,7 +348,7 @@ while True: - 손실함수(loss function)을 최적화시킨다는 개념을, 아무 데서나 시작해서 더 나아지는 쪽으로 한걸음 한걸음 나은 쪽으로 가서 최적화시킨다는 **반복적으로 개선**의 측면으로 운을 띄워봤고 - 함수의 **그라디언트(gradient)**는 그 함수값이 감소하는 가장 빠른 방향이라는 점을 알아봤고, 이것을 유한차이(finite difference, 즉 미분할 때 *h*의 값이 유한하다는 의미)를 이용하여 단순무식하게 수치적으로 어림잡아 계산하는 방법도 알아보았다. - 모수(parameter/weight)를 업데이트할 때, 한 번에 얼마나 움직여야하는지(혹은 **학습속도**)를 딱 맞게 설정하는 것이 까다로운 문제라는 것도 알아보았다. 이 값이 너무 낮으면 너무 느려지고, 너무 높으면 빨라지지만 위험한 점이 있다. 이 장단점에 대해 다음 섹션에서 자세하게 알아볼 것이다. -- 그라디언트(gradient)를 계산할 때 **수치적**인 방법과 **해석적**인 방법의 장단점을 알아보았다. 수치적인 그라디언트(gradient)는 단순하지만, 근사값이고 비효율적이다. 해석적인 그라디언트(gradient)는 정확하고 빠르지만 손으로 계산해야 되서 실수를 할 수 있다. 따라서 실제 응용에서는 해석적인 그라디언트(gradient)을 쓰고, **그라디언트 체크(gradient check)**라는 수치적인 그라디언트(gradient)와 비교하는 과정을 거친다. +- 그라디언트(gradient)를 계산할 때 **수치적**인 방법과 **해석적**인 방법의 장단점을 알아보았다. 수치적인 그라디언트(gradient)는 단순하지만, 근사값이고 비효율적이다. 해석적인 그라디언트(gradient)는 정확하고 빠르지만 손으로 계산해야 되서 실수를 할 수 있다. 따라서 실제 응용에서는 해석적인 그라디언트(gradient)을 쓰고, **그라디언트 체크(gradient check)**라는 수치적인 그라디언트(gradient)와 비교/검증하는 과정을 거친다. - 반복적으로 뺑뺑이 돌려서 그라디언트(gradient)를 계산하고 모수(parameter/weight)를 업데이트하는 **그라디언트 하강 (Gradient Descent)** 알고리즘을 소개했다. **예고:** 이 섹션에서 핵심은, 손실함수(loss function)를 모수(parameter/weight)로 미분하여 그라디언트(gradient)를 계산하는 법(과 그에 대한 직관적인 이해)가 신경망(neural network)를 디자인하고 학습시키고 이해하는데 있어 가장 중요한 기술이라는 점이다. 다음 섹션에서는, 그라디언(gradient)를 해석적으로 구할 때 연쇄법칙을 이용한, **backpropagation**이라고도 불리는 효율적인 방법에 대해 알아보겠다. 이 방법을 쓰면 컨볼루션 신경망 (Convolutional Neural Networks)을 포함한 모든 종류의 신경망(Neural Networks)에서 쓰이는 상대적으로 일반적인 손실함수(loss function)를 효율적으로 최적화시킬 수 있다. From 0e527e43bea5a92e66817047aa66aeb2f2c014c3 Mon Sep 17 00:00:00 2001 From: JK Im Date: Fri, 29 Apr 2016 16:35:51 -0500 Subject: [PATCH 084/199] Update optimization-1.md MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Glossary에서 Batch가 배치로 되어 있는 것을 뒤늦게 발견해서 업데이트했습니다. --- optimization-1.md | 10 +++++----- 1 file changed, 5 insertions(+), 5 deletions(-) diff --git a/optimization-1.md b/optimization-1.md index 709e3293..5b158932 100644 --- a/optimization-1.md +++ b/optimization-1.md @@ -314,11 +314,11 @@ while True: ~~~ 이 단순한 루프는 모든 신경망(neural network)의 중심에 있는 것이다. 다른 방법으로 (예컨데. LBFGS) 최적화를 할 수 있는 방법이 있긴 하지만, 현재로는 그라디언트 하강 (gradient descent)이 신경망(neural network)의 손실함수(loss function)을 최적화하는 것으로는 가장 많이 쓰인다. 이 강의에서, 이 루프에 이것저것 세세하게 덧붙이기(예를 들어, 업데이트 수식이 정확히 어떻게 되는지 등)는 할 것이다. 하지만, 결과에 만족할 때까지 그라디언트(gradient)를 따라서 움직인다는 기본적인 개념은 안 바뀔 것이다. - -**Mini-batch gradient descent (MGD).** (ILSVRC challenge처럼) 대규모의 응용사례에서, 학습데이터(training data)가 수백만개 주어질 수 있다. 따라서, 모수를 한 번 업데이트하려고 학습데이터(training data) 전체를 계산에 사용하는 것은 낭비가 될 수 있다. 이를 극복하기 위해서 흔하게 쓰이는 방법으로는, 학습데이터(training data)의 **소집합(batches)**만 이용해서 그라디언트(gradient)를 구하는 것이다. 예를 들어 ConvNets을 쓸 때, 한 번에 120만개 중에 256개만을 이용해서 그라디언트(gradient)를 구하고 모수(parameter/weight) 업데이트를 한다. 다음 코드를 보자. +bat +**미니배치 그라디언트 하강 (Mini-batch gradient descent (MGD)).** (ILSVRC challenge처럼) 대규모의 응용사례에서, 학습데이터(training data)가 수백만개 주어질 수 있다. 따라서, 모수를 한 번 업데이트하려고 학습데이터(training data) 전체를 계산에 사용하는 것은 낭비가 될 수 있다. 이를 극복하기 위해서 흔하게 쓰이는 방법으로는, 학습데이터(training data)의 **배치(batches)**만 이용해서 그라디언트(gradient)를 구하는 것이다. 예를 들어 ConvNets을 쓸 때, 한 번에 120만개 중에 256개짜리 배치만을 이용해서 그라디언트(gradient)를 구하고 모수(parameter/weight) 업데이트를 한다. 다음 코드를 보자. ~~~python -# 단순한 소집합(minibatch) 그라디언트(gradient) 업데이트 +# 단순한 미니배치 (minibatch) 그라디언트(gradient) 업데이트 while True: data_batch = sample_training_data(data, 256) # sample 256 examples @@ -326,9 +326,9 @@ while True: weights += - step_size * weights_grad # perform parameter update ~~~ -이 방법이 먹히는 이유는 학습데이터들의 예시들이 서로 상관관계가 있기 때문이다. 이것에 대해 알아보기위해, ILSVRC의 120만개 이미지들이 사실은 1천개의 서로 다른 이미지들의 복사본이라는 극단적인 경우를 생각해보자. (즉, 한 클래스 당 하나이고, 이 하나가 1천2백번 복사된 것) 그러면 명백한 것은, 이 1천2백개의 이미지에서의 그라디언트(gradient)값 (역자 주: 이 1천2백개에 해당하는 $i$에 대한 $L_i$값)은 다 똑같다는 점이다. 그렇다면 이 1천2백개씩 똑같은 값들 120만개를 평균내서 손실값(loss)를 구하는 것이나, 서로 다른 1천개의 이미지당 하나씩 1000개의 값을 평균내서 손실값(loss)을 구하는 것이나 똑같다. 실제로는 당연히 중복된 데이터를 주지는 않겠지만, 소집합에서 계산하는 그라디언트(gradient)가 모든 데이터를 써서 구하는 것의 근사값으로 괜찮게 쓰일 수 있을 것이다. 따라서, 소집합에서 그라디언트(gradient)를 구해서 더 자주자주 모수(parameter/weight)을 업데이트하면 실제로 더 빠른 수렴하게 된다. +이 방법이 먹히는 이유는 학습데이터들의 예시들이 서로 상관관계가 있기 때문이다. 이것에 대해 알아보기위해, ILSVRC의 120만개 이미지들이 사실은 1천개의 서로 다른 이미지들의 복사본이라는 극단적인 경우를 생각해보자. (즉, 한 클래스 당 하나이고, 이 하나가 1천2백번 복사된 것) 그러면 명백한 것은, 이 1천2백개의 이미지에서의 그라디언트(gradient)값 (역자 주: 이 1천2백개에 해당하는 $i$에 대한 $L_i$값)은 다 똑같다는 점이다. 그렇다면 이 1천2백개씩 똑같은 값들 120만개를 평균내서 손실값(loss)를 구하는 것이나, 서로 다른 1천개의 이미지당 하나씩 1000개의 값을 평균내서 손실값(loss)을 구하는 것이나 똑같다. 실제로는 당연히 중복된 데이터를 주지는 않겠지만, 미니배치(mini-batch)에서만 계산하는 그라디언트(gradient)는 모든 데이터를 써서 구하는 것의 근사값으로 괜찮게 쓰일 수 있을 것이다. 따라서, 미니배치(mini-batch)에서 그라디언트(gradient)를 구해서 더 자주자주 모수(parameter/weight)을 업데이트하면 실제로 더 빠른 수렴하게 된다. -이 방법의 극단적인 형태는 소집합이 데이터 달랑 한개로 이루어졌을 때이다. 이는 **확률그라디언트하강(Stochastic Gradient Descent (SGD))** (혹은 **온라인** 그라디언트 하강)이라고 불린다. 이건 상대적으로 덜 보편적인데, 그 이유는 우리가 프로그램을 짤 때 계산을 벡터/행렬로 만들어서 하기 때문에, 한 예제에서 100번 계산하는 것보다 100개의 예제에서 1번 계산하는게 더 빠르기 때문이다. SGD가 엄밀한 의미에서는 예제 하나짜리 소집합에서 그라디언트(gradient)를 계산하는 것이지만, 많은 사람들이 그냥 MGD를 의미하면서 SGD라고 부르기도 한다. 혹은 드물게나마 집합 그라디언트 하강 (Batch gradient descent, BGD)이라고도 부른다. 소집합(minibatch)의 크기도 하이퍼파라미터(hyperparameter)이지만, 이것을 교차검증하는 일은 흔치 않다. 이건 대체로 컴퓨터 메모리 크기의 한계에 따라 결정되거나, 몇몇 특정값 (예를 들어, 32, 64 or 128 같은 것)을 이용한다. 2의 제곱수를 이용하는 이유는 많은 벡터 계산이 2의 제곱수가 입력될 때 더 빠르기 때문이다. +이 방법의 극단적인 형태는 미니배치(mini-batch)가 데이터 달랑 한개로 이루어졌을 때이다. 이는 **확률그라디언트하강(Stochastic Gradient Descent (SGD))** (혹은 **온라인** 그라디언트 하강)이라고 불린다. 이건 상대적으로 덜 보편적인데, 그 이유는 우리가 프로그램을 짤 때 계산을 벡터/행렬로 만들어서 하기 때문에, 한 예제에서 100번 계산하는 것보다 100개의 예제에서 1번 계산하는게 더 빠르기 때문이다. SGD가 엄밀한 의미에서는 예제 하나짜리 미니배치(mini-batch)에서 그라디언트(gradient)를 계산하는 것이지만, 많은 사람들이 그냥 MGD를 의미하면서 SGD라고 부르기도 한다. 혹은 드물게나마 배치 그라디언트 하강 (Batch gradient descent, BGD)이라고도 부른다. 미니배치(mini-batch)의 크기도 하이퍼파라미터(hyperparameter)이지만, 이것을 교차검증하는 일은 흔치 않다. 이건 대체로 컴퓨터 메모리 크기의 한계에 따라 결정되거나, 몇몇 특정값 (예를 들어, 32, 64 or 128 같은 것)을 이용한다. 2의 제곱수를 이용하는 이유는 많은 벡터 계산이 2의 제곱수가 입력될 때 더 빠르기 때문이다. From fadb593cba48fcd1928c8f8ab593c05d4e313ab4 Mon Sep 17 00:00:00 2001 From: MaybeS Date: Sat, 30 Apr 2016 21:46:16 +0900 Subject: [PATCH 085/199] Update assignment1/knn.ipynb --- assignments2016/assignment1/knn.ipynb | 163 +++++++++++++------------- 1 file changed, 81 insertions(+), 82 deletions(-) diff --git a/assignments2016/assignment1/knn.ipynb b/assignments2016/assignment1/knn.ipynb index 7ed1b7b4..eb90188b 100644 --- a/assignments2016/assignment1/knn.ipynb +++ b/assignments2016/assignment1/knn.ipynb @@ -4,17 +4,18 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "# k-Nearest Neighbor (kNN) exercise\n", + "# K-Nearest Neighbor (kNN) 연습\n", "\n", - "*Complete and hand in this completed worksheet (including its outputs and any supporting code outside of the worksheet) with your assignment submission. For more details see the [assignments page](http://vision.stanford.edu/teaching/cs231n/assignments.html) on the course website.*\n", + "*이 워크시트를 완성하고 제출하세요. (출력물과 워크시트에 포함되지 않은 코드들을 포함해서) 더 자세한 정보는 코스 웹사이트인 [숙제 페이지](http://vision.stanford.edu/teaching/cs231n/assignments.html)에서 볼 수 있습니다.*\n", "\n", - "The kNN classifier consists of two stages:\n", + "kNN 분류기는 다음 두 단계로 구성됩니다.\n", "\n", - "- During training, the classifier takes the training data and simply remembers it\n", - "- During testing, kNN classifies every test image by comparing to all training images and transfering the labels of the k most similar training examples\n", - "- The value of k is cross-validated\n", + "- 학습중에, 분류기는 데이터를 학습하고 그것을 기억합니다.\n", + "- 테스트중에, KNN은 모든 이미지를 훈련된 이미지와 k 번째 레이블을 전송하는 가장 유사한 훈련 예와 비교합니다.\n", + "- k의 값은 교차 검증되었습니다.\n", "\n", - "In this exercise you will implement these steps and understand the basic Image Classification pipeline, cross-validation, and gain proficiency in writing efficient, vectorized code." + "이번 연습에서 우리는 이러한 단계들을 수행하고 \n", + "간단한 이미지 분류기 pipeline, 교차검증을 이해하고, 효율적인 벡터화된 코드를 작성하는 방법을 알아봅니다." ] }, { @@ -25,22 +26,21 @@ }, "outputs": [], "source": [ - "# Run some setup code for this notebook.\n", + "# 이 notebook을 위해 몇가지 설치 코드를 실행하세요.\n", "\n", "import random\n", "import numpy as np\n", "from cs231n.data_utils import load_CIFAR10\n", "import matplotlib.pyplot as plt\n", "\n", - "# This is a bit of magic to make matplotlib figures appear inline in the notebook\n", - "# rather than in a new window.\n", + "# matplotlib figure들을 새 창에서 뛰우지 않고 이 notebook에서 하기 위한 약간의 마법입니다.\n", "%matplotlib inline\n", "plt.rcParams['figure.figsize'] = (10.0, 8.0) # set default size of plots\n", "plt.rcParams['image.interpolation'] = 'nearest'\n", "plt.rcParams['image.cmap'] = 'gray'\n", "\n", - "# Some more magic so that the notebook will reload external python modules;\n", - "# see http://stackoverflow.com/questions/1907993/autoreload-of-modules-in-ipython\n", + "# 이 notebook이 외부 파이썬 모듈을 재호출하기위한 코드입니다.\n", + "# 다음 링크를 확인해 보세요. http://stackoverflow.com/questions/1907993/autoreload-of-modules-in-ipython\n", "%load_ext autoreload\n", "%autoreload 2" ] @@ -53,11 +53,11 @@ }, "outputs": [], "source": [ - "# Load the raw CIFAR-10 data.\n", + "# CIFAR-10 데이터를 불러옵니다.\n", "cifar10_dir = 'cs231n/datasets/cifar-10-batches-py'\n", "X_train, y_train, X_test, y_test = load_CIFAR10(cifar10_dir)\n", "\n", - "# As a sanity check, we print out the size of the training and test data.\n", + "# sanity 체크로서 학습 데이터와 테스트 데이터의 크기를 출력합니다.\n", "print 'Training data shape: ', X_train.shape\n", "print 'Training labels shape: ', y_train.shape\n", "print 'Test data shape: ', X_test.shape\n", @@ -72,8 +72,8 @@ }, "outputs": [], "source": [ - "# Visualize some examples from the dataset.\n", - "# We show a few examples of training images from each class.\n", + "# 데이터셋에서 몇 가지 예제를 시각화 합니다.\n", + "# 각 class마다 약간의 학습 이미지를 보여줍니다.\n", "classes = ['plane', 'car', 'bird', 'cat', 'deer', 'dog', 'frog', 'horse', 'ship', 'truck']\n", "num_classes = len(classes)\n", "samples_per_class = 7\n", @@ -98,7 +98,7 @@ }, "outputs": [], "source": [ - "# Subsample the data for more efficient code execution in this exercise\n", + "# 이 연습에 더 효율적인 코드 실행을 위한 데이터를 표본\n", "num_training = 5000\n", "mask = range(num_training)\n", "X_train = X_train[mask]\n", @@ -118,7 +118,7 @@ }, "outputs": [], "source": [ - "# Reshape the image data into rows\n", + "# 이미지 데이터를 행으로 변형시킵니다.\n", "X_train = np.reshape(X_train, (X_train.shape[0], -1))\n", "X_test = np.reshape(X_test, (X_test.shape[0], -1))\n", "print X_train.shape, X_test.shape" @@ -134,9 +134,9 @@ "source": [ "from cs231n.classifiers import KNearestNeighbor\n", "\n", - "# Create a kNN classifier instance. \n", - "# Remember that training a kNN classifier is a noop: \n", - "# the Classifier simply remembers the data and does no further processing \n", + "# kNN 분류기를 생성합니다.\n", + "# kNN분류기를 학습시킬때 분류기는 단순히 데이터를 기억하고\n", + "# 더 이상의 처리를 하지 않는다는것을 기억하세요.\n", "classifier = KNearestNeighbor()\n", "classifier.train(X_train, y_train)" ] @@ -145,14 +145,18 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "We would now like to classify the test data with the kNN classifier. Recall that we can break down this process into two steps: \n", + "이제 테스트데이터를 kNN 분류기로 분류해볼껍니다.\n", + "이 과정을 두 단계로 분류할 수 있습니다:\n", "\n", - "1. First we must compute the distances between all test examples and all train examples. \n", - "2. Given these distances, for each test example we find the k nearest examples and have them vote for the label\n", + "1. 먼저 모든 테스트 예제와 모든 훈련 예제 사이의 거리를 계산해야 합니다.\n", + "2. Given these distances, for each test example \n", + "we find the k nearest examples and have them vote \n", + "for the label\n", "\n", - "Lets begin with computing the distance matrix between all training and test examples. For example, if there are **Ntr** training examples and **Nte** test examples, this stage should result in a **Nte x Ntr** matrix where each element (i,j) is the distance between the i-th test and j-th train example.\n", + "모든 테스트 예제와 학습 예제 사이의 거리 행렬을 계산하는 것 부터 시작해 봅시다. **Ntr** 학습 예제와 **Nte** 테스트 예제가 있을 때, 각 (i, j) 요소가 i번째 테스트와 j번째 훈련 예제의 거리를 나타내는 **Nte x Ntr** 행렬을 결과로 얻을 수 있습니다.\n", "\n", - "First, open `cs231n/classifiers/k_nearest_neighbor.py` and implement the function `compute_distances_two_loops` that uses a (very inefficient) double loop over all pairs of (test, train) examples and computes the distance matrix one element at a time." + "\n", + "먼저 `cs231n/classifiers/k_nearest_neighbor.py`를 열고 각 (테스트, 학습) 예제를 계산하는데 (매우 비효율적인) 이중 반복문을 사용한 `compute_distances_two_loops`를 구현해 보세요." ] }, { @@ -163,10 +167,10 @@ }, "outputs": [], "source": [ - "# Open cs231n/classifiers/k_nearest_neighbor.py and implement\n", - "# compute_distances_two_loops.\n", + "# cs231n/classifiers/k_nearest_neighbor.py를 열고\n", + "# compute_distances_two_loops를 구현해 보세요.\n", "\n", - "# Test your implementation:\n", + "# 구현을 테스트해보세요.\n", "dists = classifier.compute_distances_two_loops(X_test)\n", "print dists.shape" ] @@ -179,8 +183,7 @@ }, "outputs": [], "source": [ - "# We can visualize the distance matrix: each row is a single test example and\n", - "# its distances to training examples\n", + "# 거리 행렬을 시각화 할 수 있습니다: 각 행은 하나의 시험 예제와 훈련 예제의 거리\n", "plt.imshow(dists, interpolation='none')\n", "plt.show()" ] @@ -189,17 +192,17 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "**Inline Question #1:** Notice the structured patterns in the distance matrix, where some rows or columns are visible brighter. (Note that with the default color scheme black indicates low distances while white indicates high distances.)\n", + "**연습문제 #1** 일부 행, 열이 더 밝게 가시화 된 거리 행렬의 구조화된 패턴에 주목하세요. (기본 색상에서 검은 색은 낮은 간격을 나타내는 반면, 흰색은 높은 간격을 나타내는 것에 주목하세요.)\n", "\n", - "- What in the data is the cause behind the distinctly bright rows?\n", - "- What causes the columns?" + "- 뚜렷하게 밝은 행의 데이터가 그렇게 표시된 원인은 무엇일까요?\n", + "- 열은 어떤 원인 때문에 저럴까요?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "**Your Answer**: *fill this in.*\n", + "**당신의 답**: *여기에 쓰세요*\n", "\n" ] }, @@ -211,11 +214,11 @@ }, "outputs": [], "source": [ - "# Now implement the function predict_labels and run the code below:\n", - "# We use k = 1 (which is Nearest Neighbor).\n", + "# 이제 predict_labels를 구현해보고 아래의 코드를 실행해 보세요.\n", + "# k = 1 을 사용합니다.(가장 가까운 이웃으로)\n", "y_test_pred = classifier.predict_labels(dists, k=1)\n", "\n", - "# Compute and print the fraction of correctly predicted examples\n", + "# 예측 예제의 정확도를 계산하고 출력하세요.\n", "num_correct = np.sum(y_test_pred == y_test)\n", "accuracy = float(num_correct) / num_test\n", "print 'Got %d / %d correct => accuracy: %f' % (num_correct, num_test, accuracy)" @@ -225,7 +228,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "You should expect to see approximately `27%` accuracy. Now lets try out a larger `k`, say `k = 5`:" + "우리는 대략 `27%`정도의 정확도를 예상합니다. 이제 `k = 5`같은 좀더 큰 `k`에 대해서도 실행해 보세요." ] }, { @@ -246,7 +249,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "You should expect to see a slightly better performance than with `k = 1`." + "`k = 1`보다 약간 더 좋은 성능을 기대할 수 있습니다." ] }, { @@ -257,17 +260,16 @@ }, "outputs": [], "source": [ - "# Now lets speed up distance matrix computation by using partial vectorization\n", - "# with one loop. Implement the function compute_distances_one_loop and run the\n", - "# code below:\n", + "# 이제 부분 벡터화와 단일 반복문을 사용하여 거리 행렬의 계산 속도를 높일 수 있습니다.\n", + "# compute_distance_one_loop를 구현해보고 아래의 코드를 실행해 보세요.\n", "dists_one = classifier.compute_distances_one_loop(X_test)\n", "\n", - "# To ensure that our vectorized implementation is correct, we make sure that it\n", - "# agrees with the naive implementation. There are many ways to decide whether\n", - "# two matrices are similar; one of the simplest is the Frobenius norm. In case\n", - "# you haven't seen it before, the Frobenius norm of two matrices is the square\n", - "# root of the squared sum of differences of all elements; in other words, reshape\n", - "# the matrices into vectors and compute the Euclidean distance between them.\n", + "# 우리의 벡터화 구현이 맞다는것을 보장하기 위해, \n", + "# 우리는 navie한 구현을 확인해야 합니다.\n", + "# 두 행렬의 유사 여부를 결정하는 방법은 여러가지가 있습니다.\n", + "# 단순한 방법은 Frobenius norm입니다.\n", + "# 이 Frobenius norm의 두 행렬은 모든 원소의 차이의 제곱합의 제곱근 입니다.\n", + "# 다른 말로 하면, 행렬을 벡터로 변형하고 유클리드 거리를 계산합니다.\n", "difference = np.linalg.norm(dists - dists_one, ord='fro')\n", "print 'Difference was: %f' % (difference, )\n", "if difference < 0.001:\n", @@ -284,11 +286,10 @@ }, "outputs": [], "source": [ - "# Now implement the fully vectorized version inside compute_distances_no_loops\n", - "# and run the code\n", + "# 이제 compute_distances_no_loops 안의 완전히 벡터화된 버전을 구현하고 실행합니다.\n", "dists_two = classifier.compute_distances_no_loops(X_test)\n", "\n", - "# check that the distance matrix agrees with the one we computed before:\n", + "# 거리 행렬이 우리가 전에 계산한 것과 일치하는지 확인합니다.\n", "difference = np.linalg.norm(dists - dists_two, ord='fro')\n", "print 'Difference was: %f' % (difference, )\n", "if difference < 0.001:\n", @@ -305,7 +306,7 @@ }, "outputs": [], "source": [ - "# Let's compare how fast the implementations are\n", + "# 구현한 것들이 얼마나 빠른지 비교합시다.\n", "def time_function(f, *args):\n", " \"\"\"\n", " Call a function f with args and return the time (in seconds) that it took to execute.\n", @@ -325,16 +326,16 @@ "no_loop_time = time_function(classifier.compute_distances_no_loops, X_test)\n", "print 'No loop version took %f seconds' % no_loop_time\n", "\n", - "# you should see significantly faster performance with the fully vectorized implementation" + "# 완전 벡터화 구현이 훨씬 더 빠른 성능을 낸다는것을 볼 수 있습니다." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "### Cross-validation\n", + "### 교차검증\n", "\n", - "We have implemented the k-Nearest Neighbor classifier but we set the value k = 5 arbitrarily. We will now determine the best value of this hyperparameter with cross-validation." + "우리는 k-Nearest Neighbor 분류기를 구현했지만 임의로 k = 5라는 값을 정했습니다. 이제 hyperparameter의 교차검증으로 최선의 값을 결정할 것입니다." ] }, { @@ -350,39 +351,37 @@ "\n", "X_train_folds = []\n", "y_train_folds = []\n", - "################################################################################\n", - "# TODO: #\n", - "# Split up the training data into folds. After splitting, X_train_folds and #\n", - "# y_train_folds should each be lists of length num_folds, where #\n", - "# y_train_folds[i] is the label vector for the points in X_train_folds[i]. #\n", - "# Hint: Look up the numpy array_split function. #\n", - "################################################################################\n", + "####################################################################################\n", + "# TODO: #\n", + "# 폴더에 훈련 데이터를 분류합니다. #\n", + "# 분류 후에, X_train_folds 와 y_train_folds는 y_train_folds[i]가 #\n", + "# X_train_folds[i]의 점에 대한 레이블 벡터인 num_folds의 길이의 목록이어야 합니다. #\n", + "# 힌트: numpy의 array_split 함수를 살펴보세요. #\n", + "####################################################################################\n", "pass\n", "################################################################################\n", - "# END OF YOUR CODE #\n", + "# 코드의 끝 #\n", "################################################################################\n", "\n", - "# A dictionary holding the accuracies for different values of k that we find\n", - "# when running cross-validation. After running cross-validation,\n", + "# 사전은 서로 다른 교차 검증을 실행할 때 찾은 k의 값에 대한 정확도를 가지고 있습니다.\n", "# k_to_accuracies[k] should be a list of length num_folds giving the different\n", "# accuracy values that we found when using that value of k.\n", "k_to_accuracies = {}\n", "\n", "\n", - "################################################################################\n", - "# TODO: #\n", - "# Perform k-fold cross validation to find the best value of k. For each #\n", - "# possible value of k, run the k-nearest-neighbor algorithm num_folds times, #\n", - "# where in each case you use all but one of the folds as training data and the #\n", - "# last fold as a validation set. Store the accuracies for all fold and all #\n", - "# values of k in the k_to_accuracies dictionary. #\n", - "################################################################################\n", + "####################################################################################\n", + "# TODO: #\n", + "# 최고의 k 값을 찾기 위해 k-fold 교차 검증을 수행합니다. #\n", + "# 가능한 각 k에 대해서, k-nearest-neighbor 알고리즘을 numpy의um_folds회 실행합니다.#\n", + "# 각각의 경우에 모두 사용하되 그 중 하나는 훈련 데이터로, #\n", + "# 마지막 하나는 검증 데이터로 사용합니다. #\n", + "####################################################################################\n", "pass\n", "################################################################################\n", - "# END OF YOUR CODE #\n", + "# 코드의 끝 #\n", "################################################################################\n", "\n", - "# Print out the computed accuracies\n", + "# 계산된 정확도를 출력합니다.\n", "for k in sorted(k_to_accuracies):\n", " for accuracy in k_to_accuracies[k]:\n", " print 'k = %d, accuracy = %f' % (k, accuracy)" @@ -396,12 +395,12 @@ }, "outputs": [], "source": [ - "# plot the raw observations\n", + "# 원시 관측 플롯\n", "for k in k_choices:\n", " accuracies = k_to_accuracies[k]\n", " plt.scatter([k] * len(accuracies), accuracies)\n", "\n", - "# plot the trend line with error bars that correspond to standard deviation\n", + "# 표준편차에 해당하는 오차 막대와 추세선을 그립니다.\n", "accuracies_mean = np.array([np.mean(v) for k,v in sorted(k_to_accuracies.items())])\n", "accuracies_std = np.array([np.std(v) for k,v in sorted(k_to_accuracies.items())])\n", "plt.errorbar(k_choices, accuracies_mean, yerr=accuracies_std)\n", @@ -419,16 +418,16 @@ }, "outputs": [], "source": [ - "# Based on the cross-validation results above, choose the best value for k, \n", - "# retrain the classifier using all the training data, and test it on the test\n", - "# data. You should be able to get above 28% accuracy on the test data.\n", + "# 위의 교차검증 결과에 기반해서 최적의 k를 선택하고 모든 학습 데이터를 \n", + "# 이용하여 분류기를 재학습 시키고 테스트 데이터를 이용해 테스트 해봅니다.\n", + "# 테스트데이터에 대해서 28%이상의 정확도를 얻을 수 있어야 합니다.\n", "best_k = 1\n", "\n", "classifier = KNearestNeighbor()\n", "classifier.train(X_train, y_train)\n", "y_test_pred = classifier.predict(X_test, k=best_k)\n", "\n", - "# Compute and display the accuracy\n", + "# 정확도를 계산하고 출력합니다.\n", "num_correct = np.sum(y_test_pred == y_test)\n", "accuracy = float(num_correct) / num_test\n", "print 'Got %d / %d correct => accuracy: %f' % (num_correct, num_test, accuracy)" From 31f71cd7ac30ef8c07668617035df216a0ef71ba Mon Sep 17 00:00:00 2001 From: YB Date: Sun, 1 May 2016 21:52:59 -0400 Subject: [PATCH 086/199] Lecture1 - part 141~160 (out of 715) en / ko --- captions/En/Lecture1_en.srt | 58 ++++++++++++++++++------------------- captions/Ko/Lecture1_ko.srt | 46 +++++++++++++++-------------- 2 files changed, 54 insertions(+), 50 deletions(-) diff --git a/captions/En/Lecture1_en.srt b/captions/En/Lecture1_en.srt index b66f4f95..59667319 100644 --- a/captions/En/Lecture1_en.srt +++ b/captions/En/Lecture1_en.srt @@ -691,18 +691,18 @@ Andrew Parker. He is a 141 00:16:03,159 --> 00:16:09,490 -largest in Australia from Australia he -he studied a lot of fun +modern geologist in Australia from Australia. +He he studied a lot of 142 00:16:09,490 --> 00:16:19,278 fossils and he's theory is that it was -the onset of the ice so one of the first +the onset of the ice. So, one of the first 143 00:16:19,278 --> 00:16:25,688 -trial bites developed and I really -really simple I it's almost like a +trilobite developed an eye, a really +really simple eye. It's almost like a 144 00:16:25,688 --> 00:16:30,779 @@ -712,47 +712,47 @@ and make some projections in 145 00:16:30,779 --> 00:16:34,750 register some information from the -environment +environment. 146 00:16:34,750 --> 00:16:41,080 -suddenly life is no longer so medal -because once you have that I the first +Suddenly ,life is no longer so medal +because once you have that eye, the first 147 00:16:41,080 --> 00:16:44,889 thing you can do is you could go patch -for you actually know where food it's +food. You actually know where food is. 148 00:16:44,889 --> 00:16:51,809 -not just like blind them floating in the -water and was you can go cat food +Not just like blind them floating in the +water and once you can go catch food. 149 00:16:51,809 --> 00:16:57,399 -guess what the food best better develop +Guess what? The food had better developed eyes and to run away from you otherwise 150 00:16:57,399 --> 00:17:02,590 -they'll be gone you know your your so +They'll be gone. You know your your so the first of all who had had eyes were 151 00:17:02,590 --> 00:17:11,380 -like in a limited both Google and so -just like has the best time you think +like in unlimited buffet like working in Google and so +just like it has the best time eating 152 00:17:11,380 --> 00:17:18,170 -everything they can but because those -are all set of lies what we what the +everything they can. But because of this +onset of ice, what we what the 153 00:17:18,170 --> 00:17:28,400 -college's realize is the biological arms -race begin every single animal needs to +realized is the biological arms +race begin. Every single animal needs to 154 00:17:28,400 --> 00:17:34,170 @@ -761,33 +761,33 @@ survive or to you know you you you 155 00:17:34,170 --> 00:17:40,190 -suddenly have praised and predators and -all this and the speciation so that's +suddenly have preys and predators and +all this and the speciation began. so that's 156 00:17:40,190 --> 00:17:47,870 -one vision begun 540 million years and -not only religion begun visual was one +when vision begun 540 million years and +not only vision begun. vision was one 157 -00:17:47,869 --> 00:17:53,189 +00:17:47,870 --> 00:17:53,189 of the major driving force of the -speciation or that the big fan of +speciation or that the big bang of 158 00:17:53,190 --> 00:17:58,980 -evolution or I so so we're not gonna -fall evolution for with too much detail +evolution. Alright, so so we're not gonna +fall evolution for with too much detail. 159 00:17:58,980 --> 00:18:08,710 -another big important work that the +Another big important work that the engineering of vision happened around 160 00:18:08,710 --> 00:18:19,220 the Renaissance and of course it's -attributed to this amazing guy so before +attributed to this amazing guy Leonardo Da Vinci. so before 161 00:18:19,220 --> 00:18:23,740 diff --git a/captions/Ko/Lecture1_ko.srt b/captions/Ko/Lecture1_ko.srt index ed23c4c0..8f2832a4 100644 --- a/captions/Ko/Lecture1_ko.srt +++ b/captions/Ko/Lecture1_ko.srt @@ -575,83 +575,87 @@ 141 00:16:03,159 --> 00:16:09,490 - 호주에서 가장 큰 호주에서 그는 그가 재미를 많이 공부 + 호주의 현대 지질학자로서 아주 많은 화석을 연구했죠. 142 00:16:09,490 --> 00:16:19,278 - 화석 그 이론은 제 그렇게 한 얼음의 개시 인 것을 + 빙하기의 도래가 그의 이론이었죠. 143 00:16:19,278 --> 00:16:25,688 - 시험 물린 개발하고 정말 정말 간단 나는 그것은처럼 거의이다 + 최초로 삼엽충이 "눈"을 갖게되었어요. 매우 간단한 눈인데 144 00:16:25,688 --> 00:16:30,779 - 단지 빛을 포착하고 몇 가지 예측을 핀홀 카메라 + 핀홀 카메라처럼 단지 빛을 포착하고 투영해서 145 00:16:30,779 --> 00:16:34,750 - 환경에서 일부 정보를 등록 + 주변환경의 정보를 받아들이죠. 146 00:16:34,750 --> 00:16:41,080 - 당신은 일단 때문에 갑자기 인생은 더 이상 이렇게 메달 없다 내가 처음 그 + 이제부터는 삶이 달라집니다. 147 00:16:41,080 --> 00:16:44,889 - 음식이 어디 당신이 실제로 알고 당신이 할 수있는 것은 당신이 패치를 갈 수있다 + 제일 먼저 먹이를 찾아갑니다, 먹이가 어디있는지 보이거든요. 148 00:16:44,889 --> 00:16:51,809 - 뿐만 아니라 물에 떠있는 그들을 눈 멀게 좋아하고 당신이 고양이 먹이를 갈 수 있었다 + 더이상 장님처럼 둥둥 떠 다니기만 하지 않아도 되죠. + 자 이제 당신이 먹이들을 찾아갈 수 있게 되었어요. 149 00:16:51,809 --> 00:16:57,399 - 음식이 가장 좋은 눈을 개발하고, 그렇지 않으면 멀리에서 실행하는 것 같아요 + 이제 먹이들도 당신에게 도망가기 위해 눈이 필요해졌죠. 150 00:16:57,399 --> 00:17:02,590 - 그들은 당신이 알고 사라질 것이다 당신의 눈을 가졌다 모든 사람의 첫 번째이었다 있도록 + 그렇지 않으면.. 먹이들은 인생 끝난거죠. + 처음으로 눈을 가진 녀석은 그야말로 151 00:17:02,590 --> 00:17:11,380 - 그냥 좋아 제한된 구글 모두에서 그렇게 같은 것은 당신이 생각하는 가장 좋은 시간을 가지고 + 구글에서 제공하는 무제한 뷔페 식당에 앉아 끝내주는 시간을 보내는거죠. 152 00:17:11,380 --> 00:17:18,170 - 모든 것이 그들이 할 수 있지만, 그 거짓말 우리 것의 모든 설정을하기 때문에 + 찾을 수 있는 모든 먹이를 먹고다닙니다. + 우리는 이 빙하기가 도래하면서 153 00:17:18,170 --> 00:17:28,400 - 대학의 실현은 생물학적 군비 경쟁은 모든 단일 동물에 필요 시작이다 + 생물학적으로 군비 경쟁이 시작된 것을 알 수 있죠. 모든 동물들은 154 00:17:28,400 --> 00:17:34,170 - 생존을 위해 일을 개발하기 위해 배울 필요가 또는 당신은 당신에게 당신이 알고에 + 생존을 위해 무엇인가를 개발하는 법을 배워야 했어요. 155 00:17:34,170 --> 00:17:40,190 - 즉, 그래서 갑자기 육식 동물이 모든과 분화와 칭찬했다 + 갑자기 천적과 먹이의 관계, 그리고 종의 분화가 시작됬어요. 156 00:17:40,190 --> 00:17:47,870 - 하나의 비전 540,000,000년 시작 및뿐만 아니라 종교는 시각적 시작 하나 + 5억 4천만년 전, 그때가 바로 비전이 시작된 시기입니다. + 그 뿐아니라 비전은 157 -00:17:47,869 --> 00:17:53,189 - 종 분화의 또는 큰 팬의 주요 원동력의 +00:17:47,870 --> 00:17:53,189 + 종의 분화와 진화를 불러온 주요 원동력중의 하나 입니다. 158 00:17:53,190 --> 00:17:58,980 - 진화 또는 정말 그래서 우리는 너무 많은 세부 사항에 대한 않을거야 가을의 진화있어 + 좋아요. 우리는 진화에 대해 자세히 다루진 않을거예요. 159 00:17:58,980 --> 00:18:08,710 - 비전 엔지니어링 주위에 일어난 또 다른 큰 중요한 일 + 비전 엔지니어링에 대한 또 다른 중요한 일이 160 00:18:08,710 --> 00:18:19,220 - 르네상스와는 물론 그 전에 너무 놀라운 사람에 의한 것 + 르네상스시기에 일어났어요. 바로 레오나르도 다 빈치에의한 것이었죠. 161 00:18:19,220 --> 00:18:23,740 From ac95a49f22f5caf42d23fbb58fb98d23d239c7f8 Mon Sep 17 00:00:00 2001 From: OkminLee Date: Tue, 3 May 2016 20:56:52 +0900 Subject: [PATCH 087/199] Update classification.md --- classification.md | 30 +++++++++++++++++------------- 1 file changed, 17 insertions(+), 13 deletions(-) diff --git a/classification.md b/classification.md index 2cf97c87..0d1d9b71 100644 --- a/classification.md +++ b/classification.md @@ -46,39 +46,43 @@ permalink: /classification/
**Data-driven approach(데이터 기반 방법론)**. 어떻게 하면 이미지를 각각의 카테고리로 분류하는 알고리즘을 작성할 수 있을까? 숫자를 정렬하는 알고리즘 작성과는 달리 고양이를 분별하는 알고리즘을 작성하는 것은 어렵다. -How might we go about writing an algorithm that can classify images into distinct categories? Unlike writing an algorithm for, for example, sorting a list of numbers, it is not obvious how one might write an algorithm for identifying cats in images. Therefore, instead of trying to specify what every one of the categories of interest look like directly in code, the approach that we will take is not unlike one you would take with a child: we're going to provide the computer with many examples of each class and then develop learning algorithms that look at these examples and learn about the visual appearance of each class. This approach is referred to as a *data-driven approach*, since it relies on first accumulating a *training dataset* of labeled images. Here is an example of what such a dataset might look like: + +그러므로, 코드를 통해 직접적으로 모든 것을 카테고리로 분류하기 보다는 좀 더 쉬운 방법을 사용할 것이다. 먼저 컴퓨터에게 각 클래스에 대해 많은 예제를 주고 나서 이 예제들을 보고 시각적으로 학습할 수 있는 학습 알고리즘을 개발한다. + 이런 방법을 *data-driven approach(데이터 기반 아법론)* 이라고 한다. 이 방법은 라벨화가 된 이미지들 *training dataset(트레이닝 데이터 셋)* 이 처음 학습을 위해 필요하다. 아래 그림은 이런 데이터셋의 예이다:
-
An example training set for four visual categories. In practice we may have thousands of categories and hundreds of thousands of images for each category.
+
4개의 카테고리에 대한 트레이닝 셋에 대한 예. 학습과정에서 천여개의 카테고리와 각 카테고리당 수십만개의 이미지가 있을 수 있다.
-**The image classification pipeline**. We've seen that the task in Image Classification is to take an array of pixels that represents a single image and assign a label to it. Our complete pipeline can be formalized as follows: +**The image classification pipeline(이미지 분류 파이프라인)**. 이제까지 이미지 분류는 픽셀값을 같고 있는 배열은 하나의 이미지로 표현하고 라벨을 할당하는 것이다라는 것을 살펴봤다. 우리의 완전한 파이프라인은 아래와 같이 공식화할 수 있다: -- **Input:** Our input consists of a set of *N* images, each labeled with one of *K* different classes. We refer to this data as the *training set*. -- **Learning:** Our task is to use the training set to learn what every one of the classes looks like. We refer to this step as *training a classifier*, or *learning a model*. -- **Evaluation:** In the end, we evaluate the quality of the classifier by asking it to predict labels for a new set of images that it has never seen before. We will then compare the true labels of these images to the ones predicted by the classifier. Intuitively, we're hoping that a lot of the predictions match up with the true answers (which we call the *ground truth*). +- **Input(입력):** 입력은 *N* 개의 이미지로 구성되어 있고, *K* 개의 별개의 클래스로 라벨화 되어 있다. 이 데이터를 *training set* 으로 사용한다. +- **Learning(학습):** 학습에서 할 일은 트레이닝 셋을 이용해 각각의 클래스를 학습하는 것이다. 이 과정을 *training a classifier* 혹은 *learning a model* 이란 용어를 사용해 표현할 수 있다. +- **Evaluation(평가):** 마지막으로 새로운 이미지에 대해 어떤 라벨값으로 분류되는지 예측해봄으로써 분류기의 성능을 평가한다. 새로운 이미지의 라벨값과 분류기를 통해 예측된 라벨값을 비교할 수 있다. 직감적으로, 많은 예상치들이 실제 답과 일치하기를 기대한다. 이 것을 *ground truth(실측 자료)* 라고 한다. -### Nearest Neighbor Classifier -As our first approach, we will develop what we call a **Nearest Neighbor Classifier**. This classifier has nothing to do with Convolutional Neural Networks and it is very rarely used in practice, but it will allow us to get an idea about the basic approach to an image classification problem. -**Example image classification dataset: CIFAR-10.** One popular toy image classification dataset is the CIFAR-10 dataset. This dataset consists of 60,000 tiny images that are 32 pixels high and wide. Each image is labeled with one of 10 classes (for example *"airplane, automobile, bird, etc"*). These 60,000 images are partitioned into a training set of 50,000 images and a test set of 10,000 images. In the image below you can see 10 random example images from each one of the 10 classes: +## Nearest Neighbor Classifier(최근접 이웃 분류기) + +첫번째 방법으로써, **Nearest Neighbor Classifier** 라 불리는 분류기를 개발할 것이다. 이 분류기는 컨볼루션 신경망 방법이 사용되지 않고 연습과정애서 매우 드물게 사용된다. 하지만 이 분류기는 이미지 분류 문제에 대한 기본적인 접근방법을 알 수 있다. + +**이미지 분류 데이터셋의 예: CIFAR-10.** 하나의 유명한 이미지 분류 데이터셋은 CIFAR-10 dataset 이다. 이 데이터셋은 60,000개의 작은 이미지로 구성되어 있고, 각 이미지는 32x32픽셀 크기있다. 각 이미지는 10개의 클래스중 하나로 라벨화되어 있다(예를 들어, *"airplane, automobile, bird, etc"*). 이 60,000개의 이미지 중에 50,000개는 트레이싱 셋, 10,000개는 트레이닝 셋으로 분류된다. 아래의 그림에서 각 10개의 클래스에 대해 임의로 선정한 10개의 이미지들의 예를 볼 수 있다:
-
Left: Example images from the CIFAR-10 dataset. Right: first column shows a few test images and next to each we show the top 10 nearest neighbors in the training set according to pixel-wise difference.
+
좌: CIFAR-10 dataset 의 각 클래스 예. 우: 첫번째 열은 테스트 셋이고 나머지 열은 이 테스트셋에 대해서 트레이닝 셋에 있는 이미지 중 픽셀값 차에 따른 상위 10개의 최근접 이웃 이미지이다.
-Suppose now that we are given the CIFAR-10 training set of 50,000 images (5,000 images for every one of the labels), and we wish to label the remaining 10,000. The nearest neighbor classifier will take a test image, compare it to every single one of the training images, and predict the label of the closest training image. In the image above and on the right you can see an example result of such a procedure for 10 example test images. Notice that in only about 3 out of 10 examples an image of the same class is retrieved, while in the other 7 examples this is not the case. For example, in the 8th row the nearest training image to the horse head is a red car, presumably due to the strong black background. As a result, this image of a horse would in this case be mislabeled as a car. +50,000개의 CIFAR-10 트레이닝 셋(하나의 라벨 당 5,000개의 이미지)이 주어진 상태에서 나머지 10,000개의 이미지에 대해 라벨화 하는 것을 가정해보자. 최근접 이웃 분류기는 테스트 이미지를 취해 모든 트레이닝 이미지와 비교를 하고 라벨 값을 예상할 것이다. 상단 이미지의 우측과 같이 10개의 테스트 이미지에 대한 결과를 확인할 수 있다. 10개의 이미지 중 3개만이 같은 클래스로 검색된 반면에, 7개의 이미지는 같은 클래스로 분류되지 않았다. 예를 들어, 8번째 행의 말 학습 이미지에 대한 첫번째 최근접 이웃 이미지는 붉은색의 차이다. 짐작컨데 이 경우는 검은색 배경의 영향이 큰 듯 하다. 결과적으로, 이 말 이미지는 차로 잘 못 분류될 것이다. -You may have noticed that we left unspecified the details of exactly how we compare two images, which in this case are just two blocks of 32 x 32 x 3. One of the simplest possibilities is to compare the images pixel by pixel and add up all the differences. In other words, given two images and representing them as vectors $$ I_1, I_2 $$ , a reasonable choice for comparing them might be the **L1 distance**: +두개의 이미지를 비교하는 정확한 방법을 아직 명시하지 않았는데, 이 경우에는 32 x 32 x 3 크기의 두 블록이다. 가장 간단한 방법 중 하나는 이미지를 각각의 픽셀값으로 비교하고, 그 차이를 더해 모두 더하는 것이다. 다시 말해서 두 개의 이미지가 주어지고 그 것들을 $$ I_1, I_2 $$ 벡터로 나타냈을 때, 벡터 간의 **L1 distance** 를 계산하는 것이 적절한 방법이다: $$ d_1 (I_1, I_2) = \sum_{p} \left| I^p_1 - I^p_2 \right| $$ -Where the sum is taken over all pixels. Here is the procedure visualized: +결과는 모든 픽셀값 차이의 합이다. 아래에 시각적인 절차가 있다:
From cf1dad06f239e403e8a7729ccd8b8d0c044aeabd Mon Sep 17 00:00:00 2001 From: OkminLee Date: Tue, 3 May 2016 21:01:49 +0900 Subject: [PATCH 088/199] Update classification.md --- classification.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/classification.md b/classification.md index 0d1d9b71..f61e7f0c 100644 --- a/classification.md +++ b/classification.md @@ -41,7 +41,7 @@ permalink: /classification/ 좋은 이미지 분류기는 각 클래스간의 감도를 유지하면서 동시에 이런 다양한 문제들에 대해 변함 없이 분류할 수 있는 성능을 유지해야 한다.
- +
From 20efb02e3549217448025927305d31bf49ad64f6 Mon Sep 17 00:00:00 2001 From: OkminLee Date: Tue, 3 May 2016 21:06:39 +0900 Subject: [PATCH 089/199] Update classification.md --- classification.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/classification.md b/classification.md index f61e7f0c..b9ce3e29 100644 --- a/classification.md +++ b/classification.md @@ -24,7 +24,7 @@ permalink: /classification/ **예시**. 예를 들어, 아래 그림의 이미지 분류 모델은 하나의 이미지와 4개의 분류가능한 라벨 *{cat, dog, hat, mug}* 이 있다. 그림에서 보다시피, 컴퓨터에서 이미지는 3차원 배열로 표현된다. 이 예시에서 고양이 이미지는 가로 248픽셀(모니터의 화면을 구성하는 최소 단위, 역자 주), 세로 400픽셀로 구성되어 있고 3개의 색상 채널이 있는데 각각 Red, Green, Blue(RGB)로 불린다. 따라서 이 이미지는 248 x 400 x 3개(총 297,500개)의 픽셀로 구성되어 있다. 각 픽셀의 값은 0~255 범위의 정수값이다. 이미지 분류 문제는 이 수많은 값들을 *"cat"* 이라는 하나의 라벨로 변경하는 것이다.
- +
The task in Image Classification is to predict a single label (or a distribution over labels as shown here to indicate our confidence) for a given image. Images are 3-dimensional arrays of integers from 0 to 255, of size Width x Height x 3. The 3 represents the three color channels Red, Green, Blue.
From b7fb0b73b1b5397049df7c3461805a0f3c519946 Mon Sep 17 00:00:00 2001 From: Taeksoo Kim Date: Tue, 3 May 2016 23:54:26 +0900 Subject: [PATCH 090/199] Update convolutional-networks-korean.md --- convolutional-networks-korean.md | 25 ++++++++++++++----------- 1 file changed, 14 insertions(+), 11 deletions(-) diff --git a/convolutional-networks-korean.md b/convolutional-networks-korean.md index ef181614..4b218c7e 100644 --- a/convolutional-networks-korean.md +++ b/convolutional-networks-korean.md @@ -239,23 +239,26 @@ CNN 구조 내에 컨볼루션 레이어들 중간중간에 주기적으로 풀 #### Fully-connected 레이어 -Neurons in a fully connected layer have full connections to all activations in the previous layer, as seen in regular Neural Networks. Their activations can hence be computed with a matrix multiplication followed by a bias offset. See the *Neural Network* section of the notes for more information. +Fully connected 레이어 내의 뉴런들은 일반 신경망 챕터에서 보았듯이이전 레이어의 모든 액티베이션들과 연결되어 있다. 그러므로 Fully connected레이어의 액티베이션은 매트릭스 곱을 한 뒤 바이어스를 더해 구할 수 있다. 더 많은 정보를 위해 강의 노트의 "신경망" 섹션을 보기 바란다. -<어/a> -#### Converting FC layers to CONV layers + +#### FC 레이어를 CONV 레이어로 변환하기 -It is worth noting that the only difference between FC and CONV layers is that the neurons in the CONV layer are connected only to a local region in the input, and that many of the neurons in a CONV volume share parameters. However, the neurons in both layers still compute dot products, so their functional form is identical. Therefore, it turns out that it's possible to convert between FC and CONV layers: +FC 레이어와 CONV 레이어의 차이점은, CONV 레이어는 입력의 일부 영역에만 연결되어 있고, CONV 볼륨의 많은 뉴런들이 파라미터를 공유한다는 것 뿐이라는 것을 알아 둘 필요가 있다. 두 레이어 모두 내적 연산을 수행하므로 실제 함수 형태는 동일하다. 그러므로 FC 레이어를 CONV 레이어로 변환하는 것이 가능하다: -- For any CONV layer there is an FC layer that implements the same forward function. The weight matrix would be a large matrix that is mostly zero except for at certian blocks (due to local connectivity) where the weights in many of the blocks are equal (due to parameter sharing). -- Conversely, any FC layer can be converted to a CONV layer. For example, an FC layer with $$K = 4096$$ that is looking at some input volume of size $$7 \times 7 \times 512$$ can be equivalently expressed as a CONV layer with $$F = 7, P = 0, S = 1, K = 4096$$. In other words, we are setting the filter size to be exactly the size of the input volume, and hence the output will simply be $$1 \times 1 \times 4096$$ since only a single depth column "fits" across the input volume, giving identical result as the initial FC layer. +- 모든 CONV 레이어는 동일한 forward 함수를 수행하는 FC 레이어 짝이 있다. 이 경우의 가중치 매트릭스는 몇몇 블록을 제외하고 모두 0으로 이뤄지며 (local connectivity: 입력의 일부 영역에만 연결된 특성), 이 블록들 중 여러개는 같은 값을 지니게 된다 (파라미터 공유). -**FC->CONV conversion**. Of these two conversions, the ability to convert an FC layer to a CONV layer is particularly useful in practice. Consider a ConvNet architecture that takes a 224x224x3 image, and then uses a series of CONV layers and POOL layers to reduce the image to an activations volume of size 7x7x512 (in an *AlexNet* architecture that we'll see later, this is done by use of 5 pooling layers that downsample the input spatially by a factor of two each time, making the final spatial size 224/2/2/2/2/2 = 7). From there, an AlexNet uses two FC layers of size 4096 and finally the last FC layers with 1000 neurons that compute the class scores. We can convert each of these three FC layers to CONV layers as described above: +- 반대로, 모든 FC 레이어는 CONV 레이어로 변환될 수 있다. 예를 들어, $$7 \times 7 \times 512$$ 크기의 입력을 받고 $$K= 4906$$ 인 FC 레이어는 $$F = 7, P = 0, S = 1, K = 4096$$인 CONV 레이어로 표현 가능하다. 바꿔 말하면, 필터의 크기를 입력 볼륨의 크기와 동일하게 만들고 $$1 \times 1 \times 4906$$ 크기의 아웃풋을 출력할 수 있다. 각 depth에 대해 하나의 값만 구해지므로 (필터의 가로/세로가 입력 볼륨의 가로/세로와 같으므로) FC 레이어와 같은 결과를 얻게 된다. -- Replace the first FC layer that looks at [7x7x512] volume with a CONV layer that uses filter size $$F = 7$$, giving output volume [1x1x4096]. -- Replace the second FC layer with a CONV layer that uses filter size $$F = 1$$, giving output volume [1x1x4096] -- Replace the last FC layer similarly, with $$F=1$$, giving final output [1x1x1000] +**FC->CONV 변환**. 이 두 변환 중, FC 레이어를 CONV 레이어로의 변환은 매우 실전에서 매우 유용하다. 224x224x3의 이미지를 입력으로 받고 일련의 CONV레이어와 POOL 레이어를 이용해 7x7x512의 액티베이션을 만드는 컨볼루션넷 아키텍쳐를 생각해 보자 (뒤에서 살펴 볼 *AlexNet* 아키텍쳐에서는 입력의 spatial(가로/세로) 크기를 반으로 줄이는 풀링 레이어 5개를 사용해 7x7x512의 액티베이션을 만든다. 224/2/2/2/2/2 = 7이기 때문이다). AlexNet은 여기에 4096의 크기를 갖는 FC 레이어 2개와 클래스 스코어를 계산하는 1000개 뉴런으로 이뤄진 마지막 FC 레이어를 사용한다. 이 마지막 3개의 FC 레이어를 CONV 레이어로 변환하는 방법을 아래에서 배우게 된다: -Each of these conversions could in practice involve manipulating (e.g. reshaping) the weight matrix $$W$$ in each FC layer into CONV layer filters. It turns out that this conversion allows us to "slide" the original ConvNet very efficiently across many spatial positions in a larger image, in a single forward pass. +- [7x7x512]의 입력 볼륨을 받는 첫 번째 FC 레이어를 $$F = 7$$의 필터 크기를 갖는 CONV 레이어로 바꾼다. 이 때 출력 볼륨의 크기는 [1x1x4096] 이 된다. +- 두 번째 FC 레이어를 $$F = 1$$ 필터 사이즈의 CONV 레이어로 바꾼다. 이 때 출력 볼륨의 크기는 [1x1s4096]이 된다. +- 같은 방식으로 마지막 FC 레이어를 $$F = 1$$의 CONV 레이어를 바꾼다. 출력 볼륨의 크기는 [1x1x1000]이 된다. + +각각의 변환은 일반적으로 FC 레이어의 가중치 $$W$$를 CONV 레이어의 필터로 변환하는 과정을 수반한다. 이런 변환을 하고 나면, 큰 이미지 (가로/세로가 224보다 큰 이미지)를 단 한번의 forward pass만으로 마치 이미지를 "슬라이딩"하면서 여러 영역을 읽은 것과 같은 효과를 준다. + +예를 들어,224x224 크기의 이미지를 입력으로 받으면 [7x7x512]의 볼륨을 출력하는 이 아키텍쳐에, ( 224/7 = 32배 줄어듦 ) 된 아키텍쳐에 384x384 크기의 이미지를 넣으면 [12x12x512] 크기의 볼륨을 출력하게 된다 (384/32 = 12 이므로). 이후 3개 CONV 레이어 For example, if 224x224 image gives a volume of size [7x7x512] - i.e. a reduction by 32, then forwarding an image of size 384x384 through the converted architecture would give the equivalent volume in size [12x12x512], since 384/32 = 12. Following through with the next 3 CONV layers that we just converted from FC layers would now give the final volume of size [6x6x1000], since (12 - 7)/1 + 1 = 6. Note that instead of a single vector of class scores of size [1x1x1000], we're now getting and entire 6x6 array of class scores across the 384x384 image. From 52eca781506bb0b99cd5f6702aa50373c5021ef3 Mon Sep 17 00:00:00 2001 From: Jaemin Cho Date: Wed, 4 May 2016 00:36:15 +0900 Subject: [PATCH 091/199] =?UTF-8?q?Lecture=2010=20=EB=B2=88=EC=97=AD=20?= =?UTF-8?q?=EC=A4=91=EA=B0=84=20=EC=97=85=EB=8D=B0=EC=9D=B4=ED=8A=B8?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit ~33:33 까지 번역했습니다. 한동안 집중을 못했었는데 최대한 빨리 끝내도록 할게요! --- captions/Ko/Lecture10_ko.srt | 666 +++++++++++++++++------------------ 1 file changed, 333 insertions(+), 333 deletions(-) diff --git a/captions/Ko/Lecture10_ko.srt b/captions/Ko/Lecture10_ko.srt index 1d9058fa..fdf66e82 100644 --- a/captions/Ko/Lecture10_ko.srt +++ b/captions/Ko/Lecture10_ko.srt @@ -588,1307 +588,1307 @@ 148 00:09:58,129 --> 00:10:01,860 - + 처음에 h에는 0만 들어가 있습니다. 149 00:10:01,860 --> 00:10:04,720 - 사용하여 숨겨진 상태 유권자 매 시간 단계를 계산 요청 + 그래서 매 시간 단계마다 이 재귀 식을 이용해서 hidden state 벡터를 계산합니다. 150 00:10:04,720 --> 00:10:08,790 - 이 수정 재발의 공식은 그래서 우리는 상태에서 단지 3 %가 여기에 가정 + hidden state에 3개의 (안들림) 가 있습니다. 151 00:10:08,789 --> 00:10:11,099 - 우리는 세 가지 차원 표현으로 끝낼거야 그 + 각 시점에서 이전까지 입력받은 모든 문자들을 요약해서 표현합니다. 152 00:10:11,100 --> 00:10:13,040 - 기본적으로 어느 시점에서 + 각 시점에서 이전까지 입력받은 모든 문자들을 요약해서 표현합니다. 153 00:10:13,039 --> 00:10:15,759 - 온까지 모든 문자 요약 + 각 시점에서 이전까지 입력받은 모든 문자들을 요약해서 표현합니다. 154 00:10:15,759 --> 00:10:20,159 - 그래서 우리는 이것이 필요 적용 할 필요가 매 시간 스텝 지금 + 이런 방법으로 매 시간 단계마다 바로 다음 순서 에 올 문자를 예측할 것입니다. 155 00:10:20,159 --> 00:10:23,139 - 우리는 매번 다음해야 어떤 단계 것으로 예측거야 + 이런 방법으로 매 시간 단계마다 바로 다음 순서에 올 문자를 예측할 것입니다. + 156 00:10:23,139 --> 00:10:27,569 - 우리는이 네 개의 문자를 가지고 있기 때문에 그래서 예를 들어 시퀀스에서 문자 + 우리는 이 4 개의 문자(역자주: h, e, l, o)를 가지고 있고, 매 시간 단계마다 이 4개의 문자 중 어떤 문자가 오는지 예측할 것입니다. 157 00:10:27,570 --> 00:10:32,100 - 이것은 우리에 대한 그래서 매 시간에 전화 번호를 보호하기 위해거야 + 우리는 이 4 개의 문자(역자주: h, e, l, o)를 가지고 있고, 매 시간 단계마다 이 4개의 문자 중 어떤 문자가 오는지 예측할 것입니다. 158 00:10:32,100 --> 00:10:37,139 - 우리는 편지 H와 RNN과에서 말한 아주 처음에 예 + 제일 처음에는 H를 입력할 것입니다. 159 00:10:37,139 --> 00:10:40,799 - 무게 컴퓨터의 현재 설정이 그가의 정규화 된 잠금 문제입니다 + RNN은 현재의 weight를 바탕으로 다음에 어떤 문자가 올 지 예측합니다. 160 00:10:40,799 --> 00:10:42,959 - 여기 옆에 와서해야한다고 생각하는 것에 대해 + RNN은 현재의 weight를 바탕으로 다음에 어떤 문자가 올 지 예측합니다. 161 00:10:42,960 --> 00:10:47,950 - H 가능성이 아니라 2.2으로 먹고 다음 일을 올 110 가능성이 있으므로 일 + 현재 normalized 되지 않은 수치로는, (역자주: 맨 위 왼쪽 사각형 안의 숫자) h는 1.0, e는 2.2, 162 00:10:47,950 --> 00:10:52,640 - 하지 않는 한 로트의 측면에서 가능성이 지금 가능성 및 OS 4.1 세 부정적 + l은 -3.0 , o는 4.1라는 숫자의 정도로 나타날 것입니다. 163 00:10:52,639 --> 00:10:56,409 - 물론 확률은 우리는이 훈련 순서로 우리가 알고있는 것을 알고 우리가 + 물론 우리는 이 training sequence에서 h 다음에 e가 온다는 것을 알고 있습니다. 164 00:10:56,409 --> 00:11:00,669 - 녹색으로 표시되어이 2.2가 정확한 사실 때문에 각을 따라야한다 + 그러니까 여기 초록색으로 적혀 있는 e의 2.2라는 숫자가 정답이 되는 것이죠. 165 00:11:00,669 --> 00:11:04,559 - 이 경우에 답하고 그래서 우리는 그 높은 것으로 원하는 우리는이 모든 작업을 수행합니다 + 그래서 이 숫자는 커야 하고, 다른 숫자들은 작아져야 합니다. 166 00:11:04,559 --> 00:11:07,799 - 다른 숫자는 우리가 기본적으로 가지고 매 시간에 낮아야합니다 + 이처럼 매 시간 단계마다 우리는 다음에 올 타겟 문자를 갖고 있습니다. 167 00:11:07,799 --> 00:11:12,209 - 다음 문자가 순서대로 제공해야하는지에 대한 목표 그래서 우리는 단지 원하는 + 타겟에 해당하는 숫자는 커야 하고, 나머지 숫자는 작아야 합니다. 168 00:11:12,210 --> 00:11:15,470 - 모든 숫자는 높은하고 다른 모든 숫자는 낮은 것으로 그래서 그의의 + 타겟에 해당하는 숫자는 커야 하고, 나머지 숫자는 작아야 합니다. 169 00:11:15,470 --> 00:11:19,950 - 상기 녹색 신호 손실 함수에 포함하고 있다는 점에서 포함 과정 + 그래서 이러한 정보는 loss function(손실 함수)의 gradient signal에 포함됩니다. 170 00:11:19,950 --> 00:11:23,220 - 다시 이러한 연결 그렇게 생각하는 또 다른 방법을 통해 전파됩니다 + 그리고 그러한 loss 들은 이 연결들은 통해 back-propagation 됩니다. 171 00:11:23,220 --> 00:11:26,600 - 그것은 매 시간 단계는 우리가 기본적으로 소프트 맥스 분류를 가지고있다에 대해 + 매 시간 단계에 softmax classifier을 갖고 있다고 합시다. 172 00:11:26,600 --> 00:11:31,300 - 그래서이 모든 일 다음 문자를 통해 소프트 맥스 분류 및 + 그래서 매 시간 단계마다 softmax classifier가 다음에 어떤 문자가 와야 할 지를 예측하고, 173 00:11:31,299 --> 00:11:34,269 - 모든 단일 지점에서 우리는 다음 문자가 있어야한다 그래서 우리가 알고 + 그리고 모든 loss들은 맨 위(역자주: output layer)부터 거꾸로 그래프를 내려오면서 계산되어서 174 00:11:34,269 --> 00:11:37,879 - 모든 손실은 상단에서 아래로 둔화 얻을 그들은 모두를 통과한다 + 그리고 모든 loss들은 맨 위(역자주: output layer)부터 거꾸로 그래프를 내려오면서 계산되어서 175 00:11:37,879 --> 00:11:41,179 - 모든 화살을 거꾸로이 그래프에서 기울기를 얻기 위하여려고 된 모든 + 그리고 모든 loss들은 맨 위(역자주: output layer)부터 거꾸로 그래프를 내려오면서 계산되어서 176 00:11:41,179 --> 00:11:44,479 - 체중 행렬 그리고, 우리는 매트릭스를 이동하는 방법을 알게되도록 + weight 행렬에 gradient를 주어 적절한 값으로 변화시켜 RNN이 문자를 보다 정확하게 예측하게 합니다. 177 00:11:44,480 --> 00:11:50,039 - 정확한 문제는 우리가 그 무게를 형성 할 것 아르 논에서 나오는된다 + weight 행렬에 gradient를 주어 적절한 값으로 교정시켜 RNN이 문자를 보다 정확하게 예측하게 합니다. 178 00:11:50,039 --> 00:11:53,599 - 그 올바른 행동 때문에 군대는 먹이 올바른 동작을 + 그러니까 여러분이 RNN에 문자를 입력하면 RNN은 보다 정확한 행동(역자주: 여기서는 문자 예측)을 하는 것이죠. 179 00:11:53,600 --> 00:11:57,750 - 당신과 같은 문자는 우리에 대한 다른 질문이 뒤집 수있는 방법을 상상할 수있다 + 이제 어떻게 데이터를 학습시키는지에 대해 상상이 좀 갈 거에요. 180 00:11:57,750 --> 00:12:02,879 - 그림 + 여기 그림에 대해 질문이 있나요? 181 00:12:02,879 --> 00:12:08,750 - 그래 나는 장면을 되풀이을 언급 한 바와 같이 그렇게 필사적으로 거짓말을 주셔서 감사합니다 + (질문): W_xh와 W_hy는 항상 일정한 값을 가지나요? 182 00:12:08,750 --> 00:12:13,320 - 항상 동일한 기능 그래서 우리는 하나의 WX 환자마다 단계 우리가 + (답변): W(weight) 들은 매 recurrence 단계 마다 항상 일정한 값을 가집니다. 183 00:12:13,320 --> 00:12:17,010 - 같은 whah의 모든 시간 단계에서 하나의 WHYY마다 단계에 적용했습니다 + (답변): W(weight) 들은 매 recurrence 단계 마다 항상 일정한 값을 가집니다. 184 00:12:17,009 --> 00:12:23,830 - 여기에서 우리는 WX를 사용했습니다 AWH 이유도 및 다시 awhh 네 번 + 여기서 우리는 W_xh, W_hh, W_yh를 각각 4번씩 사용했습니다. 185 00:12:23,830 --> 00:12:27,720 - 우리 모두를해야하기 때문에 전파 우리는 YouTube에 계정을 통해 얻을 때 + 여러분이 backpropagation을 할 때, 동일한 weight 행렬에 이러한 gradient 들을 계속 더한다는 것을 명심해야 합니다. 186 00:12:27,720 --> 00:12:30,750 - 동일한 가중치 행렬을 더하여 이들 구배는 사용 되었기 때문에 + 여러분이 backpropagation을 할 때, 동일한 weight 행렬에 이러한 gradient 들을 계속 더한다는 것을 명심해야 합니다. 187 00:12:30,750 --> 00:12:35,879 - 여러 시간 단계와에 이것은 당신이 가변적으로 알고 처리 할 수​​있게 해준다 것입니다 + 그리고 이것은 우리가 길이가 다양한 입력값들을 사용할 수 있게 해 줍니다. 188 00:12:35,879 --> 00:12:38,960 - 크기 입력 우리는 같은 일을 그렇게하지 ​​않는 일을하는지마다 때문에 + 그리고 이것은 우리가 길이가 다양한 입력값들을 사용할 수 있게 해 줍니다. 189 00:12:38,960 --> 00:12:48,540 - 사물의 절대 금액의 기능과 공통 무엇인지 질문 + 그러니까 정해진 길이의 입력값들을 사용하지 않아도 된다는 것이죠. 190 00:12:48,539 --> 00:12:52,579 - 제 80 초기화하는 일이 내가 생각하는 미국 상원 (20)이 아주 아주 + (질문): 처음 h_0를 어떻게 초기화하나요? 191 00:12:52,580 --> 00:13:00,650 - 처음에 공통되지만 순서에 따라 데이터를 수신처 않는다 - + (답변): 0으로 놓는 것이 가장 일반적입니다. + 192 00:13:00,649 --> 00:13:01,289 - 그 문제 + (질문): 입력값의 순서는 영향을 미치나요? 193 00:13:01,289 --> 00:13:11,299 - 예 있기 때문에 그래서 당신은 만약 그렇다면 다른 순서로 이러한 문자를 요구하고 - + (질문): 입력값의 순서는 영향을 미치나요? 194 00:13:11,299 --> 00:13:14,359 - 이이 경우에는이 경우의 순서 긴 시퀀스 않았는지 + (답변): 여기서는 중요하지 않습니다. hidden state는 지금까지 들어온 모든 값을 반영하거든요. 195 00:13:14,360 --> 00:13:17,870 - 당신에 대해 생각하면 항상 시간에하기 때문에 하나 하나 점을 중요하지 않습니다 + (답변): 여기서는 중요하지 않습니다. hidden state는 지금까지 들어온 모든 값을 반영하거든요. 196 00:13:17,870 --> 00:13:21,299 - 그것은 기능적 같이 이것은 팩터의 함수로서 시간이 단계에서의 + (답변): 여기서는 중요하지 않습니다. hidden state는 지금까지 들어온 모든 값을 반영하거든요. 197 00:13:21,299 --> 00:13:26,859 - 그것은 바로 그래서 장애는 그냥 오랫동안 중요한 전에 온 모든 + (답변): 여기서는 중요하지 않습니다. hidden state는 지금까지 들어온 모든 값을 반영하거든요. 198 00:13:26,860 --> 00:13:31,590 - 당신이 그것을 읽고있는 것처럼 우리는 몇 가지 구체적인 통해 체를 통해 갈거야 + 보다 구체적인 예들로 확실히 설명드리겠습니다. 199 00:13:31,590 --> 00:13:36,149 - 나는 이러한 점들을 명확히 생각 예는 특정보고하기 + 문자 단위의 언어 모델 코드는 매우 간단합니다. 200 00:13:36,149 --> 00:13:38,980 - 당신은 그것의 언어 모델의 특성을 시도 할 경우 실제로 예 + 여러분들이 나중에 찾아볼 수 있게 GitHub에 올려 놓았어요. 201 00:13:38,980 --> 00:13:43,350 - 이 곳에 아주 짧은 그래서 난 그냥 당신이 좋은 가정을 찾을 수를 썼다 + 이것은 NumPy 기반의 100줄 길이의 문자 단위 RNN 코드입니다. 202 00:13:43,350 --> 00:13:47,220 - 정확도 수준 NumPy와 백에 줄 응용 프로그램입니다 그리고 당신은 갈 수 있습니다 + 이것은 NumPy 기반의 100줄 길이의 문자 단위 RNN 코드입니다. 203 00:13:47,220 --> 00:13:49,840 - 당신이를 통해 실제 활성 단계 그래서 당신은 구체적으로 볼 수 있습니다 + 실제로 RNN이 어떻게 학습하는지를 알기 위해서 이 코드를 단계별로 살펴볼게요. 204 00:13:49,840 --> 00:13:53,220 - 우리는이 재발 신경 네트워크에 미치는 영향을 훈련 할 수 그래서 내가 갈거야 방법 + 실제로 RNN이 어떻게 학습하는지를 알기 위해서 이 코드를 단계별로 살펴볼게요. 205 00:13:53,220 --> 00:13:58,250 - 이 때문에 우리는 처음에 모든 블록을 통해 갈거야 단계별로 + 코드를 블록들로 나누어 하나하나 살펴보겠습니다. 206 00:13:58,250 --> 00:14:02,389 - 당신은 여기에만 의존은 우리가 일부 텍스트 데이터를로드하는 것을 볼 수 있습니다로 + 처음에는 보다시피 NumPy만 사용합니다. 207 00:14:02,389 --> 00:14:05,569 - 여기에 우리의 입력의 큰 순서 그냥 큰 모음입니다 + 여기에 우리가 입력받을 것은 문자들의 대용량 순서 .txt 데이터입니다. 208 00:14:05,570 --> 00:14:10,090 - 이 경우 문자 파일을 TXT하고 우리 모두가 얻을 텍스트 입력 + 여기에 우리가 입력받을 것은 문자들의 대용량 순서 .txt 데이터입니다. 209 00:14:10,090 --> 00:14:14,810 - 해당 파일의 문자 우리는 해당 파일의 모든 고유 문자를 찾을 수 + 이 파일의 모든 문자를 읽어들이고, mapping dictionary를 생성합니다. 210 00:14:14,809 --> 00:14:18,179 - 계절의 특성에 매핑이 매핑 사전을 만들 + mapping dictionary는 문자에 index를 대응시키고, 또 반대로 index에 문자를 대응시킵니다. 211 00:14:18,179 --> 00:14:23,120 - 인덱스에서 두 문자 우리는 기본적으로 우리의 문자가 너무 빵을 보이는 주문 + 그러니까 문자를 순서대로 배열하는 것입니다. 212 00:14:23,120 --> 00:14:27,350 - 파일로의 전체 무리와 데이터의 전체 무리 우리 백 한 + 여기 보면 아주 긴 문자열이 들어 있는 큰 데이터를 읽어들이네요. 213 00:14:27,350 --> 00:14:30,860 - 문자 나처럼 뭔가 그래서 우리는 순서에 그들을 주문 + 우리는 이 데이터를 배열해서 각 문자에 index를 지정할 것입니다. 214 00:14:30,860 --> 00:14:36,300 - 여기에 모든 문자 남자에 연관 지수는 우리는 라이센스를 감소거야 + 그리고 여기에 보다시피 initialization(초깃값 설정)을 하게 됩니다. 215 00:14:36,299 --> 00:14:39,899 - 당신이 재발 성 신경로 볼 수 있습니다로 첫 번째 하이퍼 기본 크기 숨겨져 있습니다 + hidden size(hidden state의 크기)는 hyperparameter(바뀌지 않는 값) 입니다. 여기서는 100으로 설정했습니다. 216 00:14:39,899 --> 00:14:43,100 - 그렇게하면 네트워크 우리는 학습율이 여기 백으로 사용하지 않을 + hidden size(hidden state의 크기)는 hyperparameter(바뀌지 않는 값) 입니다. 여기서는 100으로 설정했습니다. 217 00:14:43,100 --> 00:14:46,720 - 스물 다섯이이 매개 변수가 최대 여기 시퀀스 길이입니다 + 여기 있는 건 learning rate 이고요. 218 00:14:46,720 --> 00:14:51,019 - 우리의 입력 데이터가 길 경우 당신은 문제가 무엇인지 알게 될 것입니다 수 있습니다 - + 25가 지정되어 있는 seq_length는 여러분이 RNN을 공부하다 보면 나오는 parameter 입니다. + 219 00:14:51,019 --> 00:14:53,899 - 시간의 수백만 같은 너무 큰 말은 UPS는 당신이 넣을 수있는 방법은 없습니다 + 많은 경우 우리의 입력 데이터는 너무 커서 RNN에 한꺼번에 넣을 수가 없습니다. 220 00:14:53,899 --> 00:14:56,870 - 다란과의 모든 위에 우리가 물건을 모두 유지할 필요가 있기 때문에 + 이것은 우리가 backpropagation을 하는 동안 메모리에 데이터를 저장해 두어야 하는데 여기에 한계가 있기 때문이죠 221 00:14:56,870 --> 00:15:00,070 - 당신이 실제로 전파를 다시 할 수 있도록 메모리 우리는 할 수 없습니다 + 이것은 우리가 backpropagation을 하는 동안 메모리에 데이터를 저장해 두어야 하는데 여기에 한계가 있기 때문이죠 222 00:15:00,070 --> 00:15:03,540 - 우리가 갈거야 그것 모두를 통해 그것의 모든 메모리와 다시 문질러 두 남자를 유지 + 이것은 우리가 backpropagation을 하는 동안 메모리에 데이터를 저장해 두어야 하는데 여기에 한계가 있기 때문이죠 223 00:15:03,539 --> 00:15:07,139 - 이 경우 우리의 입력 데이터를 통해 덩어리로 우리는 25의 덩어리에서 통해거야 + 그래서 우리는 입력 데이터를 몇 개의 데이터로 쪼개고, 여기서는 길이가 25인 데이터들로 쪼갰습니다. 224 00:15:07,139 --> 00:15:09,230 - 당신은 약간의 시간을 볼 수 있도록 + 그래서 우리는 입력 데이터를 몇 개의 데이터로 쪼개고, 여기서는 길이가 25인 데이터들로 쪼갰습니다. 225 00:15:09,230 --> 00:15:14,769 - 우리는이 전체 데이터 집합을 가지고 있지만 25 문자의 덩어리로 갈 것 + 그러니까 한 번에 처리할 문자의 개수가 25개인 것입니다. 226 00:15:14,769 --> 00:15:19,509 - 시간과 우리가 백업거야 때마다 시간에 25 자 통과 + 다시 설명하면, 한 번에 backpropagation 하는 문자의 개수가 25인 것이고, 227 00:15:19,509 --> 00:15:22,149 - 우리는 우리가 가지고 있기 때문에 이상에 대한 전파을 다시 할 여유가 없기 때문에 + 한 번에 모든 데이터를 기억해서 backpropagation 할 수 없기 때문에, 하나의 크기가 25개인 덩어리 데이터들로 나누어서 처리합니다. 228 00:15:22,149 --> 00:15:26,899 - 모든 물건을 기억하고 우리는 여기에 25 그리고 우리 덩어리거야 + 한 번에 모든 데이터를 기억해서 backpropagation 할 수 없기 때문에, 하나의 크기가 25개인 덩어리 데이터들로 나누어서 처리합니다. 229 00:15:26,899 --> 00:15:30,789 - 모두 여기에 내가 무작위로 분석하고있어 이러한 W 행렬과 일부 상자 WX 그래서이 + 여기 보이는 행렬들은 random 함수를 이용해서 초기값이 무작위적으로 입력됩니다. 230 00:15:30,789 --> 00:15:34,709 - HHH와 HY 그 우리의 매개 변수의 모든 과대 광고의 모든 우리는 거 야된다 + Wxh, Whh, Wxy은 모두 우리가 backpropagation을 통해 학습시킬 대상들입니다. 231 00:15:34,710 --> 00:15:36,790 - backrub를 양성하는 + Wxh, Whh, Wxy은 모두 우리가 backpropagation을 통해 학습시킬 대상들입니다. 232 00:15:36,789 --> 00:15:40,699 - 나는 여기에 손실 함수를 통해 건너 갈거야 내가 바닥에 도착하는거야 + loss function은 넘어가고 맨 밑 부분을 살펴보겠습니다. 233 00:15:40,700 --> 00:15:44,020 - 여기에 스크립트의 우리는 메인 루프를 가지고 있고이 중 일부를 통해 갈거야 + 이 부분은 Main loop입니다. 이 중에서 몇 부분을 한번 살펴보죠. 234 00:15:44,019 --> 00:15:48,399 - 여기에 다양한 몇 가지 초기화가 그래서 20 지금 보일 수 있습니다 + 이 부분에서 어떤 변수들에 0을 대입하는 초기화가 진행됩니다. 235 00:15:48,399 --> 00:15:50,829 - 다음 처음에 우리는 영원히 찾고 + 그리고 계속해서 loop을 돌리게 되죠. 236 00:15:50,830 --> 00:15:54,960 - 우리가 여기서하고있는 것은 그래서 여기에 데이터의 배치가 어디 실제로 샘플링 + 우리가 지금 보고 있는 것은 전체 데이터의 한 batch 입니다. 237 00:15:54,960 --> 00:15:58,970 - 그 목록에, 그래서이 데이터 세트에서 25 문자의 배치를 취할 + 전체 데이터 세트에서 크기 25의 문자 batch를 가지를 list input으로 넣어줍니다. 238 00:15:58,970 --> 00:16:03,019 - 입력하고 목록 및 둔다는 기본적으로 단지가 25 정수는 대응 + 그리고 그 list input은 각 문자에 대응되는 25개의 숫자를 갖고 있습니다. 239 00:16:03,019 --> 00:16:06,919 - 당신이 볼로 문자 대상은 모두 같은 문자하지만 오프셋 + 타겟들은 여기 index에 1을 더한 값이 되는데요, 240 00:16:06,919 --> 00:16:09,909 - 하나 그 때문에 우리는 모든을 예측하려는 인덱스는 + 이것은 타겟들이 현재 순서가 아니라 바로 다음 순서에 나올 문자들이기 때문에 그렇습니다. 241 00:16:09,909 --> 00:16:15,269 - 한 시간에 물건을 너무 중요한 목표는 25 자에 불과 목록입니다 + 그러니까 list input에는 25개의 문자에 대응되는 25개의 숫자가 있고, 타겟 문자는 그 숫자들에서 1을 더한 index에 대응되는 문자들입니다. 242 00:16:15,269 --> 00:16:20,689 - 즉 우리가 기본적으로 샘플링 무엇 때문에 대상은 같은 미래에 의해 오프셋 (offset) + 그러니까 list input에는 25개의 문자에 대응되는 25개의 숫자가 있고, 타겟 문자는 그 숫자들에서 1을 더한 index에 대응되는 문자들입니다. 243 00:16:20,690 --> 00:16:26,480 - 여기에 데이터에서 우리이는 그래서 몇 가지 예제 코드는 시간의 모든 단일 지점 + 이것은 sampling 코드입니다. 244 00:16:26,480 --> 00:16:30,659 - 이번 주 훈련 물론 내가하려고하는 것은 그것이 무엇의 일부 샘플을 생성하는 + 매 시간 단계에서 RNN을 학습시키면서, 현재 RNN이 어떻게 사고하고 있는지에 알아보기 위한 sample을 출력합니다. 245 00:16:30,659 --> 00:16:35,370 - 현재 감사 문자는 이러한 순서는 다음과 같이 보일 것입니다 실제로 무엇을 + 매 시간 단계에서 RNN을 학습시키면서, 현재 RNN이 어떻게 사고하고 있는지에 알아보기 위한 sample을 출력합니다. 246 00:16:35,370 --> 00:16:40,320 - 우리는 문자 낮은 수준의 예술가와 테스트 시간을 사용하는 방법은 우리가 걸이다 + 우리가 문자 단위의 RNN을 사용할 때에는 247 00:16:40,320 --> 00:16:43,570 - 다음 몇 가지 문자와 함께이 항상 아니라는 것을 보게 될 것은 우리에게주는 + RNN이 매 시간 단계마다 바로 다음에 올 문자들의 순서를 출력합니다. 248 00:16:43,570 --> 00:16:46,379 - 당신이 샘플링을 상상할 수 있도록 시퀀스에서 다음 문자의 분포 + 그러니까 sampling 후 그것을 다시 입력값으로 주고, 다음 sample을 또다시 입력값으로 주는 방식으로 모든 sample을 입력한 다음, 249 00:16:46,379 --> 00:16:49,259 - 그것에서 다음 다음 문자의 위업은에서 샘플을 받고 + 그러니까 sampling 후 그것을 다시 입력값으로 주고, 다음 sample을 또다시 입력값으로 주는 방식으로 모든 sample을 입력한 다음, 250 00:16:49,259 --> 00:16:52,769 - 분포와에 모든 샘플을 공급 유지에 그 일을 계속 + 그러니까 sampling 후 그것을 다시 입력값으로 주고, 다음 sample을 또다시 입력값으로 주는 방식으로 모든 sample을 입력한 다음, 251 00:16:52,769 --> 00:16:56,549 - 철, 당신은이 코드가 무엇을 할 것입니다 그 임의의 텍스트 데이터를 생성 할 수 있습니다 + RNN에게 추상적인 문자열을 만들라고 지시할 수 있게 됩니다. 252 00:16:56,549 --> 00:17:00,549 - 그리고 우리가 여기에 약간의 그 거 야 그래서 샘플 기능을 발생 + 이게 이 코드의 기능이고, 이것은 조금 있다 살펴볼 sample function을 사용합니다. 253 00:17:00,549 --> 00:17:04,250 - I는 손실 함수는 입력 대상을받는 손실 함수를 호출있어 + 여기서는 loss function을 불러옵니다. 254 00:17:04,250 --> 00:17:09,160 - 그리고이 H 준비 H 압력이 자신의 상태 벡터에 대한 짧은 또한 수신 + loss function은 입력값, 타겟 문자, hprev 을 입력받습니다. 255 00:17:09,160 --> 00:17:13,900 - 이전 트렁크에서 우리는 (25)의 일괄거야 그리고 우리는 유지됩니다 + hprev는 h from previous chunk 을 뜻합니다. 256 00:17:13,900 --> 00:17:18,179 - 당신의 25 편지의 끝 부분에있는 최신 사진이 무엇인지를 추적하는 우리 + 우리가 크기가 25인 batch들을 사용하는데, 257 00:17:18,179 --> 00:17:22,400 - 우리가 다음에 다시 만날 때 우리는 그의 초기 시간으로 그에서 볼 수 있습니다 + hidden state에서는 바로 전 batch의 마지막 문자가 무엇인지에 대한 정보가 필요하고, 이 마지막 문자를 다음 batch의 첫 h 에 입력하게 됩니다. 258 00:17:22,400 --> 00:17:26,140 - 우리가 숨겨진 상태가 제대로 기본적으로있는 것을 확인하고, 그래서 시간 + 그러니까 h가 batch 에서 그 다음 batch 로 제대로 넘어가기 위해서 h prev을 사용하는 것입니다. 259 00:17:26,140 --> 00:17:30,700 - 그 통해 일괄 배치에서 전파 그러나 우리는 다시 전파하는 + 그리고 그 h prev는 backpropagation 할 때만 사용됩니다. 260 00:17:30,700 --> 00:17:35,558 - 그 25 시간 단계 그래서 우리는 손실 및 그라디언트의 기능에 적합하고 + 그 h prev을 loss fuction에 입력하면, loss. gradient, weight 행렬, 그리고 bias를 출력합니다. 261 00:17:35,558 --> 00:17:39,319 - 모든 무게 행렬과 모든 상자와 당신은 손실을 인쇄하고 + 그 h prev을 loss fuction에 입력하면, loss. gradient, weight 행렬, 그리고 bias를 출력합니다. 262 00:17:39,319 --> 00:17:44,149 - 그리고 여기에 우리가 여기에 우리가 우리에게 나이 인사를 듣는다 프라이머 업데이트 그리고 + 여기에서 loss를 print 하고, 여기에선 parameter들을 loss function이 하라는 대로 업데이트합니다. 263 00:17:44,150 --> 00:17:47,429 - 실제로 당신이 대학원에 업데이트로 인식해야 업데이트를 수행 + 실제로 업데이트가 되는 것은 여기 adagrad update 라고 적혀 있는 부분이네요. 264 00:17:47,429 --> 00:17:53,100 - 그래서 나는이 현금으로 모든 생각하는 모든 현금이 + 여기 gradient 계산을 위한 변수들을 제곱한 값들을 계속 더해 줍니다. 265 00:17:53,099 --> 00:17:56,819 - 그라데이션에 대한 변수는 내가 축적 한 다음를 수행하고있어 어느 제곱 + 그리고 이것들로 adagrad를 업데이트 하죠. 266 00:17:56,819 --> 00:18:00,639 - 독재 날짜 누군가가 손실 함수로 이동하고 어떻게 그처럼 보인다 + 이제 loss funcion을 살펴보겠습니다. 267 00:18:00,640 --> 00:18:05,790 - 이제 손실 함수는 정말 앞으로 구성 코드 블록이며, + 이 블록이 loss fuction이고, foward와 backward 방법들로 이루어져 있습니다. 268 00:18:05,789 --> 00:18:08,990 - 우리가 앞으로 패스 다음의 뒷면을 비교하는, 그래서 뒤로 방법 + 처음에는 forward pass, 나중에는 초록색으로 적혀 있는 backward pass를 수행합니다. 269 00:18:08,990 --> 00:18:13,130 - 녹색 그래서이 두 단계를 통해 갈거야 통과하면 앞으로해야 당신에게 전달 + 처음에는 forward pass, 나중에는 초록색으로 적혀 있는 backward pass를 수행합니다. 270 00:18:13,130 --> 00:18:18,919 - 기본적으로 우리는 우리가 기다리고있어 그 적자 대상이 25를받을 얻을 인식 + forward pass에서는 input을 target을 향하게 만듭니다. 271 00:18:18,919 --> 00:18:23,360 - 인덱스 우리는 25 일에서 그들을 통해 거래하지 않는 우리는이 텍스트를 만들 + 여기서 25개의 index를 받지만, 반복문을 25번 실행하는 것이 아니라, 272 00:18:23,359 --> 00:18:27,500 - 그럼 그냥 제로이며, 입력 벡터 우리는 그래서 하나의 뜨거운 인코딩을 설정 + 여기 있는 성분이 모두 0인 input vector에 one-hot 인코딩을 하게 됩니다. 273 00:18:27,500 --> 00:18:32,169 - 어떤 인덱스 및 자극 우리는 하나 우리가에 공급하고 그것을 설정 + 그러니까 input에 대응되는 bit를 1로 지정하는 것이죠. 274 00:18:32,169 --> 00:18:34,110 - 그 하나의 뜨거운 인코딩 문자 + one hot encoding을 이용해서 input을 주고, 275 00:18:34,109 --> 00:18:39,229 - 여기에이 식 HSI T 그래서를 사용하여 재발 수식을 계산에 + 밑에 있는 recurrence 공식을 이용해서 계산합니다. 276 00:18:39,230 --> 00:18:42,210 - 자신의 연령이 모든 것을 다하고 하나 하나 추적하기 + hs[t]는 매 시간 단계의 모든 값들을 기록합니다. 277 00:18:42,210 --> 00:18:46,910 - 시간 물건 그래서 우리는 상태 벡터와를 사용하여 출력을 계산 + recurrence 공식과 이 두 줄의 코드를 통해 hidden state vector과 output vector 을 계산합니다. 278 00:18:46,910 --> 00:18:50,779 - 재발 수식이 두 줄 다음 저기 난을 계산 해요 + 여기서는 softmax function(역자주: cross entropy loss)을 이용해서 normalization을 구현합니다. 279 00:18:50,779 --> 00:18:54,440 - 그래서 용의자 그래서이 정상화 작동하는지 우리는 확률을 얻을 경우 + softmax function에서의 loss는 정답(역자주: 타겟 문자)이 나올 확률의 log를 취하고 거기에 -1을 곱한 값입니다. 280 00:18:54,440 --> 00:18:58,190 - 그 그냥 그래서 당신의 손실은 정답의 부정적인 잠금 확률 + softmax function에서의 loss는 정답(역자주: 타겟 문자)이 나올 확률의 log를 취하고 거기에 -1을 곱한 값입니다. 281 00:18:58,190 --> 00:19:02,779 - 부드러움의 분류는 그 목적 그래서 거기 잃고 우리는 거 야 + 지금까지 forward pass 를 살펴보았고, 이제 그래프를 통해 backpropagation을 살펴보겠습니다. 282 00:19:02,779 --> 00:19:06,899 - 우리가 뒤로 이동 뒤로 패스 그래서 다시 그래프를 통해 전파 + backward pass에서는, 25번째 문자에서 첫번째 문자까지 거슬러 올라갑니다. 283 00:19:06,900 --> 00:19:08,530 - (25)로부터의 순서를 통해 + backward pass에서는, 25번째 문자에서 첫번째 문자까지 거슬러 올라갑니다. 284 00:19:08,529 --> 00:19:12,899 - 당신은 내가 인식합니다 다시 하나 어쩌면 모든 방법은 얼마나 많은 세부 사항을 모르는 I + backward pass에서는, 25번째 문자에서 첫번째 문자까지 거슬러 올라갑니다. 285 00:19:12,900 --> 00:19:16,509 - 여기에 가고 싶어하지만 당신은 소프트 맥스를 통해 전파를 다시 인식합니다 + 여기서는 softmax, activation 등을 통한 backpropagation이 수행됩니다. 286 00:19:16,509 --> 00:19:19,089 - 내가 통해 전파하고 있지 않다 활성화 기능을 통해 전파 + 그리고 모든 gradient와 parameter들을 더해주죠. 287 00:19:19,089 --> 00:19:23,379 - 그것의 모든 난 그냥 모든 인사 및 모든 총리을 추가 해요 + 한 가지 짚고 넘어갈 점은, Whh를 비롯한 행렬에서의 gradient 계산에서 '+='을 사용하고 있다는 것입니다. 288 00:19:23,380 --> 00:19:27,210 - 특히 여기에서주의해야 할 한 가지는 이러한 재료와 무게를 만드는 것입니다 + 한 가지 짚고 넘어갈 점은, Whh를 비롯한 행렬에서의 gradient 계산에서 '+='을 사용하고 있다는 것입니다. 289 00:19:27,210 --> 00:19:31,210 - 내가 플러스를 사용하고 woahh 같은 행렬에 해당 그것은 매 시간 스텝 때문에 + 한 가지 짚고 넘어갈 점은, Whh를 비롯한 행렬에서의 gradient 계산에서 '+='을 사용하고 있다는 것입니다. 290 00:19:31,210 --> 00:19:34,590 - 이 무게의 모든 그라데이션을 받고 행렬 우리는 축적해야 + 우리는 매 시간 단계마다 weight 행렬들이 gradient를 받고, 이 값들을 모두 더해 주어야 하기 때문에, 이 행렬을 계속 쓰게 됩니다. 291 00:19:34,589 --> 00:19:37,449 - 우리는 이러한 모든 체중 행렬을 계속 사용하기 때문에 모든 체중 행렬에 적합 + 우리는 매 시간 단계마다 weight 행렬들이 gradient를 받고, 이 값들을 모두 더해 주어야 하기 때문에, 이 행렬을 계속 쓰게 됩니다. 292 00:19:37,450 --> 00:19:43,980 - 시간이 지남에 그들로 때마다 단계에서 동일한 그래서 우리 그냥 배경에서와 + 그리고 계속해서 backpropagation을 하게 되죠. 293 00:19:43,980 --> 00:19:48,130 - 그것은 우리에게 생기를 제공하고 우리는에서 그 손실 기능을 사용할 수 있습니다 + 여기에서 나온 gradient는 loss function에 사용되고, 결국 parameter를 업데이트하게 됩니다. 294 00:19:48,130 --> 00:19:52,580 - 기본 및 여기에 우리는 마침내 그래서 여기에 샘플링 기능은 어디 한 우리 + 마지막으로 sampling function입니다. 295 00:19:52,579 --> 00:19:55,960 - 실제로 그 내용에 기초하여 새로운 텍스트 데이터를 생성하는 아티스트 가려고 + 여기서 RNN을 지금까지 학습한 training 데이터를 바탕으로 실제로 새로운 문자열 데이터를 출력하게 됩니다. 296 00:19:55,960 --> 00:19:59,058 - 캐릭터와 방법의 통계에 변호사를 보았고를 기반으로하고있다 + 여기서 RNN을 지금까지 학습한 training 데이터를 바탕으로 실제로 새로운 문자열 데이터를 출력하게 됩니다. 297 00:19:59,058 --> 00:20:02,048 - 우리는 약간의 비와 함께 초기화 그래서 그들은 훈련 데이터에서 서로를 따라 + 여기서 문자열을 초기화해주었고, 298 00:20:02,048 --> 00:20:06,759 - 문자, 그리고, 우리는 우리가 피곤 때까지 가서 우리가 재발을 계산 + 피곤해질 때까지 (역자주: 미리 설정한 recurrence가 끝날 때까지) 다음 작업들을 반복합니다. 299 00:20:06,759 --> 00:20:09,289 - 식 문제로부터 배포 샘플이 + recurrence 공식 실행, 각 문자에 대한 확률분포 계산, 샘플링, one-hot 인코딩, 그리고 그 결과물을 다음 시간 단계로 재입력 300 00:20:09,289 --> 00:20:10,450 - 분포 + recurrence 공식 실행, 각 문자에 대한 확률분포 계산, 샘플링, one-hot 인코딩, 그리고 그 결과물을 다음 시간 단계로 재입력 301 00:20:10,450 --> 00:20:15,640 - 핫 케이트 (11) 핫 표현으로 인코딩 한 후 우리는을 받​​았는데 + recurrence 공식 실행, 각 문자에 대한 확률분포 계산, 샘플링, one-hot 인코딩, 그리고 그 결과물을 다음 시간 단계로 재입력 + 302 00:20:15,640 --> 00:20:22,460 - 우리가 실제로 200 텍스트를 얻을 때까지 우리가이 일을 계속 그래서 다음에 시간이 너무 거기 어떤 + 이 작업들을 충분히 많은 문자열을 출력할 때까지 계속 수행합니다. 303 00:20:22,460 --> 00:20:27,190 - 그냥이 작동하는 방법의 거친 레이아웃 등을 통해 질문 + (질문: 안들림 => 답변) 우리는 매 batch 마다 25개의 softmax classifier를 갖고 있습니다. 304 00:20:27,190 --> 00:21:04,680 - 다시 $ (25) 남부 최대의 모든 배치에서 분류 우리 같은에서 사람들의 모든 + (답변) 그 classifier 들은 한번에 backpropagation을 진행하고, 반대방향으로 모든 결과물들을 더해주죠. 305 00:21:04,680 --> 00:21:14,910 - 시간과 모든 우리가 사용하는 왜 거꾸로 그건가는 연결에 추가 + 그게 우리가 이걸 쓰는 이유죠. 다음 질문? 306 00:21:14,910 --> 00:21:19,259 - 여기 정규화 당신은 내가 그것을 생략 추측 아마하지 않는 것을 확인할 수 있습니다 + (질문) 여기서 regularization을 쓰나요? 307 00:21:19,259 --> 00:21:23,720 - 때때로 나는 정규화를 시도 여기에 있지만 일반적으로 내가 생각할 수있는 난 몰라 + (답변) 여기서는 빠져 있습니다. 일반적으로 RNN에서는 다른 알고리즘만큼 regularization이 흔하게 적용되지는 않습니다. 308 00:21:23,720 --> 00:21:27,269 - 때로는 그것을 외부로 반복 너트를 사용하는 것이 일반적이다 생각 + (답변) 여기서는 빠져 있습니다. 일반적으로 RNN에서는 다른 알고리즘만큼 regularization이 흔하게 적용되지는 않습니다. 309 00:21:27,269 --> 00:21:38,379 - 최악의 결과처럼 내게 준 것은 그래서 가끔 그것을 싸움 발기인의 그것의 종류를 건너 + (답변) 가끔 아주 좋지 않은 결과를 낳기도 해서, 저는 그냥 사용하지 않을 때도 있습니다. 일종의 hyperparameter이죠. 다음 질문? (질문 안들림) 310 00:21:38,380 --> 00:21:48,260 - 그래 그건 그래 그건 우리가 바로 여기에 25 샷의 순서 그래서 바로 + (답변) 여기서의 문자들은 아주 기초적인 수준입니다. 그래서 실제로 이런 문자가 존재하는지 별로 신경쓰지는 않아요. 311 00:21:48,259 --> 00:21:51,839 - 매우 낮은 캐릭터 레벨에 대한 수준과 우리가 실제로 단어에 대해 걱정하지 않는다 우리는하지 않습니다 + (답변) 여기서의 문자들은 아주 기초적인 수준입니다. 그래서 실제로 이런 문자가 존재하는지 별로 신경쓰지는 않아요. 312 00:21:51,839 --> 00:21:56,289 - 그 단어는 사실은 그렇지 않습니다에서처럼 문자 인덱스가 너무 arnelle를 그리워 존재 알고 + 문자들의 index와 그것들의 순서 정도만을 고려할 뿐이죠. 313 00:21:56,289 --> 00:21:58,569 - 같은 언어 또는 아무것도 그냥 그렇게 문자에 대해 아는 + 다음 질문? 314 00:21:58,569 --> 00:22:08,009 - 시리즈와 시퀀스 부록에 그 우리가 사용하는 조각을 모델링하고 무엇 + (질문) space 대신 일정한 segment size(25)를 이용하는 이유가 있나요? 315 00:22:08,009 --> 00:22:13,460 - 대신 그 같은 문자로 공간 또는 뭔가를 사용할 수 있습니다 + (질문) space 대신 일정한 segment size(25)를 이용하는 이유가 있나요? 316 00:22:13,460 --> 00:22:18,630 - 25 일정 배치는 그가 아마 할 수 생각하지만 다음 종류의 단지, 당신은 + (답변) 크기가 25인 batch 말고 space로 구분하는 것 역시 가능할 것 같습니다. 하지만 거기에는 언어에 대한특별한 가정이 필요해서 권장되지 않아요. 317 00:22:18,630 --> 00:22:22,530 - 당신이 그렇게 할 이유 언어에 대한 가정이 곧 볼 수 있도록합니다 + 자세한 이유는 좀 있다가 살펴보도록 하겠습니다. 318 00:22:22,529 --> 00:22:25,359 - 이에 아무것도 연결할 수 있습니다 그리고 우리는 우리가 많이 가질 수 있음을 볼 수 있기 때문에 + 이 코드에는 어떤 문자열도 입력할 수 있어요. 이걸 갖고 여러 가지를 해 볼게요. 319 00:22:25,359 --> 00:22:31,539 - 그 확인과 재미 이제 우리는 우리가 텍스트의 전체 무리를 우리가하지 걸릴 수 있습니다 할 수있는 + 여기 우리가 출처를 모르는 어떤 문자열이 있습니다. 320 00:22:31,539 --> 00:22:34,889 - 이 문자의 순서 어디에서 왔는지 신경 그리고 우리는 아르 논에 공급 + 그리고 이 문자열을 RNN에 학습시키고, RNN이 문자열을 만들어내게 할 거에요. 321 00:22:34,890 --> 00:22:40,670 - 우리는 철을 훈련 할 수 있으며 같은 텍스트를 작성하고 그래서 예를 들어, 당신은 할 수 있습니다 + 예를 들어, 셰익스피어의 모든 작품을 입력할 수 있습니다. 322 00:22:40,670 --> 00:22:44,789 - 당신이 그것을 모두 잡을 수 있습니다 윌리엄 셰익스피어의 작품을 모두 가지고 그냥 거대한입니다 + 크기가 좀 크긴 하지만, 이건 단지 문자열일 뿐이에요. 323 00:22:44,789 --> 00:22:48,289 - 문자의 순서 당신은 재발 성 신경 네트워크에 넣고 시도 + 크기가 좀 크긴 하지만, 이건 단지 문자열일 뿐이에요. 324 00:22:48,289 --> 00:22:51,909 - 윌리엄 셰익스피어의 지지자에 대한 시퀀스에서 다음 문자를 예측하고 + RNN 셰익스피어의 작품을 학습시키고, 셰익스피어의 시에서의 다음 문자를 예측하게끔 할 수 있습니다. 325 00:22:51,910 --> 00:22:54,650 - 그래서 당신은 처음에 재발 신경망을 물론 그 작업을 수행 할 때 + 처음에는 학습이 되어 있지 않기 때문에, 결과물들은 매우 무작위적인 문자열입니다. 326 00:22:54,650 --> 00:22:59,030 - 그래서 그냥 바로 종료 그래서에서 왜곡을 생산 무작위 임의의 매개 변수가 + 처음에는 학습이 되어 있지 않기 때문에, 결과물들은 매우 무작위적인 문자열입니다. 327 00:22:59,029 --> 00:23:03,200 - 그냥 임의의 문자를이다 그러나 당신이 훈련 할 때 다음 아르 논가에 시작됩니다 + 하지만 학습을 통해 RNN은 이 문자열 안에는 단어들이 있고, 단어들 사이에 space가 있고, 따옴표의 사용법을 이해하기 되죠. 328 00:23:03,200 --> 00:23:06,930 - 그 확인을 이해 거기의 말을 시작 공백이 같은 일이 실제로있다 + 하지만 학습을 통해 RNN은 이 문자열 안에는 단어들이 있고, 단어들 사이에 space가 있고, 따옴표의 사용법을 이해하기 되죠. 329 00:23:06,930 --> 00:23:11,490 - 따옴표와 함께 실험은 그것은 기본적으로 매우 짧은 일부 내용 + 하지만 학습을 통해 RNN은 이 문자열 안에는 단어들이 있고, 단어들 사이에 space가 있고, 쌍따옴표(")의 사용법을 이해하기 되죠. 330 00:23:11,490 --> 00:23:16,420 - 당신이 더 많은 질병 훈련으로 여기거나 등등 다음과 같은 단어 + 그리고 'here', 'on', 'and so on' 과 같은 기본적인 표현들을 알게 됩니다. 331 00:23:16,420 --> 00:23:18,820 - 되고 점점 더 세련되고 재발 신경 네트워크를 배운다 + 그리고 RNN을 계속 학습시킬수록, 이러한 표현들이 점점 정제되는 것을 확인할 수 있습니다. 332 00:23:18,819 --> 00:23:22,609 - 당신이 견적을 열 때 나중에 닫거나해야하는 그 문장 + 예를 들어 "를 한번 사용하면 "를 한번 더 사용해서 인용구를 닫아 주는 것들을 익히는 거죠. 333 00:23:22,609 --> 00:23:26,379 - 점과 한잔과 함께 그냥 바위에서 통계적으로 모든 물건을 배운다 + 또 문장이 마침표로 끝나는 것 역시 따로 가르치지 않고도 패턴만으로 익히게 됩니다. 334 00:23:26,380 --> 00:23:29,630 - 실제로 코치 아무것도 머리를하지 않고 당신이 할 수있는 말의 패턴 + 또 문장이 마침표로 끝나는 것 역시 따로 가르치지 않고도 통계적 패턴만으로 익히게 됩니다. 335 00:23:29,630 --> 00:23:30,580 - 샘플 전체 + 그리고 마침내 '셰익스피어 문학' 자체를 생성할 수 있게 되죠. 336 00:23:30,579 --> 00:23:34,349 - 캐릭터 레벨이 기반으로 셰익스피어 그래서 그냥에 대한 아이디어를 줄 것 + 여기 RNN이 만들어낸 작품을 읽어볼게요. 337 00:23:34,349 --> 00:23:38,740 - 물건의 종류 나는 그가 접근 될 것이다 생각을 많이하고 갱을 제공 + (읽는 중) "Alas, I think he shall come approached and the day..." 338 00:23:38,740 --> 00:23:42,900 - 존재에 달성 될 것이다 변형됩니다 공급하지 않고 자신 만 체인 결코 + (읽는 중) "Alas, I think he shall come approached and the day..." 339 00:23:42,900 --> 00:23:45,460 - 내가 잠을 안 그의 죽음의 주제 + (읽는 중) "Alas, I think he shall come approached and the day..." 340 00:23:45,460 --> 00:23:56,909 - 즉, 당신이있어 이와 관련하여 네트워크의 나갈 것입니다 물건의 종류 + (질문) 하지만 이것들은 25개가 넘는 문자로 이루어진 문장은 기억할 수가 없기 때문에 제대로 생성할 수 없죠? 341 00:23:56,909 --> 00:24:02,679 - 내가 좋아 그래서 비트에 다시 연락하고 싶은 아주 미묘한 점을 의미 + (답변) 네 맞습니다. 그거 사실 되게 알아차리기 힘든 부분이라 제가 나중에 말하려고 했었어요. 342 00:24:02,679 --> 00:24:05,980 - 우리는 셰익스피어에서이 작업을 실행할 수 있지만 우리는 그래서 기본적으로 아무것도 태양을 실행할 수 있습니다 + 우리는 셰익스피어 작품이 아니라 다른 것들에도 이것을 활용할 수 있습니다. 343 00:24:05,980 --> 00:24:08,960 - 우리는 내가 대략 년 전 등처럼 생각 저스틴과 함께이 함께 연주 + 이것들은 제가 Justin과 작년에 만들어본 것들입니다. 344 00:24:08,960 --> 00:24:12,990 - 저스틴 턱 그는 대수 기하학에서이 책을 발견하고이 단지입니다 + Justin은 한 대수기하학 책의 LaTeX 소스를 RNN에 학습시켰습니다. 345 00:24:12,990 --> 00:24:18,069 - 대형 라텍스 소스 파일 우리는이 형상에 대해 그 라텍스 소스 파일을 가져다 + Justin은 한 대수기하학 책의 LaTeX 소스를 RNN에 학습시켰습니다. 346 00:24:18,069 --> 00:24:23,398 - 예술을 재정 작가는 기본적으로 수학을 그렇게 생성을 배울 수 + 그리고 RNN은 수학책을 집필했죠. 347 00:24:23,398 --> 00:24:27,199 - 이 아침에 제출 된 샘플 그냥 다음 늦은 체크 아웃 뱉어이며 우리 + 물론 RNN은 LaTeX 형식으로 결과물을 출력하지 않아서 저희가 약간 손봐주긴 했지만, 348 00:24:27,200 --> 00:24:30,009 - 파일럿 물론 바로 작동하지 않습니다으로 우리는 조정 그것은 작은 비트를 가지고 있습니다 + 물론 RNN은 LaTeX 형식으로 결과물을 출력하지 않아서 저희가 약간 손봐주긴 했지만, 349 00:24:30,009 --> 00:24:33,890 - 하지만 기본적으로 아르 논은 우리가 당신을 만든 실수의 일부를 불통 후 + 어쨌든 한두 번 손보고 나니 보시는 바와 같이 수학책이 되었어요. 350 00:24:33,890 --> 00:24:37,200 - 컴파일 할 수 있습니다 당신은 당신이 그것을 볼로 수학을 생성 얻을 수 있습니다 + 어쨌든 한두 번 손보고 나니 보시는 바와 같이 수학책이 되었어요. 351 00:24:37,200 --> 00:24:42,460 - 그것은 기본적으로는 그녀의 바보 같은 작은 사각형을두고 이러한 모든 증거를 생성 + 살펴보면, RNN은 proof(정리)를 쓰는 방법을 배웠네요. 수학적 정리의 끝에는 저렇게 사각형을 쓰죠. 352 00:24:42,460 --> 00:24:47,090 - 군대의 끝은 그렇게에 우리를 보자 생성 + lemma(소정리)를 비롯한 다른 것들도 만들어 냈고요. 353 00:24:47,089 --> 00:24:52,428 - 때때로 우리는 성공의 다양한 양으로 다이어그램을 만들 예정 + 그림을 그리는 방법도 배웠네요. 354 00:24:52,429 --> 00:24:56,720 - 그리고 이것에 대해 최선 나의 마음에 드는 부분은 상단에 여기 증거가 남아 있다는 것입니다 + 제가 가장 좋아하는 부분은 여기 왼쪽 상단에 있는 "Proof. Omitted" 부분입니다. 355 00:24:56,720 --> 00:24:59,650 - 방출된다 + RNN도 귀찮았나 봐요 (웃음) 356 00:24:59,650 --> 00:25:05,780 - Sarno는 게으른하지만 그렇지 않으면이 물건은 확실히 구별 I입니다 + RNN도 귀찮았나 봐요 (웃음) 357 00:25:05,779 --> 00:25:12,480 - 실제 형상에서에서 말을 그래서 X의 X 10 방식을하자 확인 나는 확실하지 않다 + 전반적으로 보면 RNN은 꽤 대수기하학책 같이 보이는 걸 만들어 냈어요. 358 00:25:12,480 --> 00:25:16,160 - 그 부분에 대한하지만 그렇지 않으면이의 게슈탈트 매우 좋아 보인다 + 뭐 세부적인 부분은 제가 대수기하를 잘 몰라서 말하기 그렇지만, 전반적으로 괜찮아요. 359 00:25:16,160 --> 00:25:19,529 - 그것이 내가 가장 어려운 임의의 일을 찾기 위해 노력 임의의 물건이 I + 저는 이어서 문자 단위 RNN으로 표현할 수 있는 가장 어렵고 추상적인 것들이 무엇이 있을까 생각했고, 360 00:25:19,529 --> 00:25:22,769 - 캐릭터 레벨을 던질 수 있었다 나는 소스 코드를 실제로 결정 + 소스 코드에 생각이 미쳤습니다. 361 00:25:22,769 --> 00:25:27,879 - 매우 어려운 그래서 C 코드와 같은 단지 이전 인 리눅스 소스의 모든했다 + 그래서 리누스 토발즈의 GitHub에 들어가 리눅스의 모든 C 코드를 가져왔습니다. 362 00:25:27,880 --> 00:25:30,850 - 당신은 그것을 복사 할 수 있습니다 당신은 내가 몇 백 메가 바이트 생각으로 끝낼 단지 + 이 C 코드는 자그마치 700MB나 됩니다. 363 00:25:30,849 --> 00:25:35,079 - 코드와 헤더 파일을 참조하고 단지 아르 논에 던져 그리고, 그것은 할 수 + 이 코드를 RNN에게 학습시켰고, RNN은 코드를 생성해 냈습니다. 364 00:25:35,079 --> 00:25:39,849 - 아르 논 당신의 코드 등이 생성 된 코드를 생성하는 법을 배워야 + 이게 바로 RNN이 생성해낸 코드입니다. 365 00:25:39,849 --> 00:25:42,949 - 그것을 볼 수 있습니다 기본적으로는 입력에 대해 알고 함수 선언을 만듭니다 + 살펴보면 함수를 생성했고, 변수를 지정하고, 문법적 오류가 거의 없습니다. 366 00:25:42,950 --> 00:25:47,460 - 구문 그것은 일종의 변수에 대해 알고 거의 실수를하는 방법을 + 변수를 어떻게 사용하는지도 아는 것 같고, 367 00:25:47,460 --> 00:25:53,230 - 그들이 때때로 그것을 코딩 할 계획 사용은 자신의 가짜 코멘트를 작성 + indentation (들여쓰기)도 적절히 했고, 주석도 달았습니다. 368 00:25:53,230 --> 00:25:58,089 - 구문은 브라켓을 열고 닫습니다하지 않을 것을 발견하는 것은 매우 드문 일이다 + 괄호를 열고 닫지 않는 등의 실수를 찾아보기가 매우 힘들었습니다. 369 00:25:58,089 --> 00:26:01,808 - dornin 그래서 몇 가지를 배우고하는 등이에 실제로 상대적으로 쉽다 + 이런 것들은 RNN이 배우기 가장 쉬운 것들 중 하나거든요. 370 00:26:01,808 --> 00:26:04,058 - 실제로 만드는 실수는 그 예를 들어 그 + RNN의 실수들 중에는 쓰이지 않을 변수를 선언하거나, 선언하지도 않은 변수를 불러오기를 시도는 것들이 있었습니다. 371 00:26:04,058 --> 00:26:07,240 - 그것을 사용하여 결코 끝나지 않아 몇 가지 변수를 선언하거나 동일한 변수를 할 + RNN의 실수들 중에는 쓰이지 않을 변수를 선언하거나, 선언하지도 않은 변수를 불러오기를 시도는 것들이 있었습니다. 372 00:26:07,240 --> 00:26:09,929 - 이 선언되지 않습니다 그래서 이러한 높은 수준의 물건 중 일부는 아직 행방 불명된다 + 그러니까 아직 매우 높은 단계의 코딩 수준에는 도달하지 못한 거죠. 373 00:26:09,929 --> 00:26:12,509 - 하지만 그렇지 않으면 잘 할 수있는 + 하지만 그런 것들을 제외하고 보면 꽤 코딩을 잘 했습니다. 374 00:26:12,509 --> 00:26:17,460 - 그것은 또한 더 적대적는 지프 새로운 GOP에게 문자로 허가 된 문자를 암송하지 + 새로운 GPU 라이센스에 관한 주석을 다는 방법도 배웠네요. 375 00:26:17,460 --> 00:26:22,009 - 그는 데이터에서 배운하고는 GPL 라이센스 후이 알고있다 + 새로운 GPU 라이센스에 관한 주석을 다는 방법도 배웠네요. 376 00:26:22,009 --> 00:26:25,779 - 일부는이 파일의 일부 매크로를 포함하고는, 그래서 다음 몇 가지 코드가있다 + GPL 라이센스 다음에는 #include, 매크로 코드 등이 오는 것도 배웠고요. 377 00:26:25,779 --> 00:26:33,879 - 기본적으로 그냥 쇼에 교대로 매우 작은 것이 무엇인지를 배웠다 + (질문) 이건 (아까 보여준) min char-rnn 으로 만들어낸 건가요? 378 00:26:33,880 --> 00:26:37,169 - 그냥 장난감 일이 일어나고 다음 문자 거기하고 있는지를 보여 + (답변) min char-rnn은 그냥 작동 원리를 알려주기 위해 만들어낸 장난감 같은 거고, 379 00:26:37,169 --> 00:26:41,230 - 그냥 충전 된 구현 및 토치의 많은 종류는이다 + (답변) 실제로는 min char-rnn의 확장판인 torch 기반 char-rnn을 으로 구현했고, GPU를 이용해서 처리했습니다. 380 00:26:41,230 --> 00:26:45,009 - 과 및 실행과 GPU를 확장 그래서 당신은 자신을 재생할 수 등 + (답변) 실제로는 min char-rnn의 확장판인 torch 기반 char-rnn을 으로 구현했고, GPU를 이용해서 처리했습니다. 381 00:26:45,009 --> 00:26:49,269 - 이 특히 그것이 세 계층 앨리스의 다음 후자에 의해이 가고 있었다 + 이 부분은 수업 마지막 부분에 다룰 것인데, 3-layer LSTM 이라는 것입니다. 382 00:26:49,269 --> 00:26:52,289 - 팀 그리고 우리는 그게 전화의 더 복잡한 종류의 의미를 볼 수 있습니다 + 이건 RNN의 복잡한 버전이라고 생각하면 됩니다. 383 00:26:52,289 --> 00:26:58,839 - 난 그냥이 어떻게 작동하는지에 대한 아이디어를 제공 네트워크는 그래​​서 종이가 있음을 우리 + 좀 더 이해가 쉽도록 예를 들어 볼게요. 384 00:26:58,839 --> 00:27:02,089 - 많은 연주 그러나 이것은 단지 작년 우리는 기본적으로 노력하고 + 이건 작년에 저희가 이런 것들을 가지고 만들어본 것들입니다. 385 00:27:02,089 --> 00:27:08,949 - 우리는 신경 과학자있어 척 그리고 우리는 몇 가지 테스트 텍스트에 미용실을 던졌다 + 저희는 문자 단위 RNN에 신경과학적으로 접근을 해 보았습니다. 386 00:27:08,950 --> 00:27:13,110 - 그래서 아덴의 코드 스 니펫에서이 텍스트를 읽고 우리가보고있는 + hidden state 내부 특정 cell의 excitement(흥분) 여부에 따라 색을 칠해 봤습니다. 387 00:27:13,109 --> 00:27:17,119 - 특정 셀의 여부에 기초하여 상기 텍스트 착색 당해 그의 상태 + hidden state 내부 특정 cell의 excitement(흥분) 여부에 따라 색을 칠해 봤습니다. 388 00:27:17,119 --> 00:27:18,699 - 하지 그 흥분 판매 여부 + hidden state 내부 특정 cell의 excitement(흥분) 여부에 따라 색을 칠해 봤습니다. 389 00:27:18,700 --> 00:27:23,470 - 확인 그래서 당신은 국가의 많은 볼 수 있습니다 + 보시다시피, hidden state의 뉴런들의 상태를 해석하는 일이 쉽지가 않습니다. 390 00:27:23,470 --> 00:27:27,110 - 뉴런은 이상한 방법으로 아무것도의 종류에 화재의 종류에 해석되지 않습니다 + 보시다시피, hidden state의 뉴런들의 상태를 해석하는 일이 쉽지가 않습니다. 391 00:27:27,109 --> 00:27:29,829 - 그들이해야하기 때문에 그들 중 일부는 매우 낮은 수준의 문자를해야 + 왜냐하면 어떤 뉴런들은 매우 낮은 단계에서의 작업을 맡거든요. 392 00:27:29,829 --> 00:27:33,859 - 그녀는하자 모두 같은 나이와 물건 후에 오는가 얼마나 자주 같은 수준의 물건 + 예를 들면, 'h 다음에 e가 얼마나 자주 오는가' 가 있네요. 393 00:27:33,859 --> 00:27:37,928 - 우리는 빠른처럼 자신을 찾을 예를 들면 있도록 세포는 아주 해석입니다 + 하지만 어떤 cell 들은 해석하기가 꽤 용이했습니다. 394 00:27:37,929 --> 00:27:41,830 - 검출 있도록이 셀은 그냥 인용 한 때 온 후는 유지 + 여기 보시는 것은 인용구 검출 cell 입니다. 395 00:27:41,829 --> 00:27:46,460 - 인용 옷장까지에 등이 매우 안정적이 추적을 유지하고 + 이 cell은 처음 따옴표가 나오면 켜지고, 따옴표가 다시 나타나면 꺼집니다. 396 00:27:46,460 --> 00:27:50,610 - 그냥 역 전파에서이 크기의 섬을 나오는 그 + 이건 그냥 backpropagation의 결과로 나온 것입니다. 397 00:27:50,609 --> 00:27:54,329 - 캐릭터 레벨 통계 물론 내외 다르며이다 + RNN은 문자열의 길이가 따옴표들의 사이에 있을때와 따옴표 바깥에 있을 때에 다르다는 것을 파악했습니다. 398 00:27:54,329 --> 00:27:57,639 - 유용한 기능은 학습하고 그래서 그것의 머리 상태의 일부를 바칩니다 + 그래서 hidden state의 특정 부분들을 현재 문자들이 인용구 안에 있는지 파악하게 했습니다. 399 00:27:57,640 --> 00:28:00,650 - 당신이 따옴표 안에있어 여부를 추적하고이로 돌아갑니다 + 그래서 hidden state의 특정 부분들을 현재 문자들이 인용구 안에 있는지 파악하게 했습니다. 400 00:28:00,650 --> 00:28:05,159 - 나는이 RNN가 I에 훈련 것을 여기에서 지적하고 싶은 질문 + 이것이 아까 (질문했던 사람)의 질문에 답을 해줄 것 같은데요, 401 00:28:05,159 --> 00:28:06,500 - 시퀀스 길이를 생각한다 + 이 RNN의 seq_length는 100 이었습니다.(역자주: batch 크기가 100) 402 00:28:06,500 --> 00:28:10,269 - 백하지만 당신은이 인용문의 길이가 실제로보다 훨씬 더 측정 할 경우 + 하지만 실제로 이 인용구들의 크기를 재어 보면 100보다 훨씬 길다는 것을 알 수 있습니다. 403 00:28:10,269 --> 00:28:16,220 - 내가 생각 백 (250)처럼 우리는 다시 최대 전파에 그래서 우리는 일 + 제가 보기에 대략 250정도 인 것 같네요. 404 00:28:16,220 --> 00:28:20,190 - 백은 그래서는 셀 수 실제로 로렌 같은 유일한 장소 + 그러니까 우리는 한 번에 크기가 100인 backpropagation만을 진행했고, RNN에게는 그때만이 유일한 학습 기회입니다. 405 00:28:20,190 --> 00:28:23,460 - 자체는 더 이상이 부록을 발견 할 수 없습니다 때문에 - + 그러니까 문자열 크기가 100이 넘어가면 그 앞뒤의 dependencies(종속성, 관계) 에 대해서는 직접적으로 학습하지를 않습니다. 406 00:28:23,460 --> 00:28:27,809 - 그러나보다 기본적으로 내가이이 훈련을 수 있다는 것을 보여 것 같아요 + 그러니까 문자열 크기가 100이 넘어가면 그 앞뒤의 dependencies(종속성, 관계) 에 대해서는 직접적으로 학습하지를 않습니다. 407 00:28:27,809 --> 00:28:31,159 - 캐릭터 레벨 검출 백보다 작은 시퀀스에 유용한로 판매 + 하지만 이 결과는 실제 문자열의 길이보다 작은 크기의 batch 들로 학습한다고 해도, batch 크기보다 긴 문자열에 대해서도 잘 작동할 수 있다는 것을 보여주네요. 408 00:28:31,160 --> 00:28:36,580 - 다음은이 때문에이 셀 수 있도록 긴 시퀀스에 제대로 일반화 + 하지만 이 결과는 실제 문자열의 길이보다 작은 크기의 batch 들로 학습한다고 해도, batch 크기보다 긴 문자열에 대해서도 잘 작동할 수 있다는 것을 보여주네요. 409 00:28:36,579 --> 00:28:39,859 - 그것은 단지에도 교육을받은 경우 더 이상 백 단계를 작동하는 것 같다 + 그러니까 batch 크기는 100이었지만, 410 00:28:39,859 --> 00:28:44,759 - 그것보다 수백이의 종속성을 발견 할 만 할 수 있다면 + 크기가 수백이 넘는 문자열의 dependecies 도 잘 잡아낸 것이죠. 411 00:28:44,759 --> 00:28:48,890 - 이 여기에 다른 데이터 세트 내가 레오 톨스토이의 전쟁과 평화는이에 생각이다 + 이것은 톨스토이의 <전쟁과 평화> 데이터 입니다. 412 00:28:48,890 --> 00:28:52,460 - 이 데이터 세트는 대략 80 매에서 새 줄 문자있다 + 이 데이터 세트는 대략 80문자마다 한 번 줄이 바뀝니다. 413 00:28:52,460 --> 00:28:57,819 - 80 자 문자 대략 새로운 라인이있다 그리고 거기에있다 + 이 데이터 세트는 대략 80문자마다 한 번 줄이 바뀝니다. 414 00:28:57,819 --> 00:29:02,470 - 그 다음 하나 같은에서 시작은 우리가 찾을 수 있도록 라인 링크 추적 + 그리고 우리는 줄 길이 tracking cell을 찾아냈습니다. 415 00:29:02,470 --> 00:29:06,539 - 천천히 시간이 지남에 따라 구분하고이 같은 세포가 있음을 상상 + 이 cell은 줄이 처음 시작하면 1로 시작해서, 문자열이 진행될수록 천천히 그 값이 감소합니다. 416 00:29:06,539 --> 00:29:09,019 - 당신이 말 때문에에 캐릭터를 좋아하는 예측 실제로 매우 유용 + RNN은 현재 자신이 어느 시간 단계에 있는지 알아야 하기 때문에 이 기능은 매우 유용합니다. 417 00:29:09,019 --> 00:29:13,059 - 이 때 새로운 라인을 알 수 있도록이 애니의 티 타임 단계를 계산하는 + RNN은 현재 자신이 어느 시간 단계에 있는지 알아야 하기 때문에 이 기능은 매우 유용합니다. 418 00:29:13,059 --> 00:29:15,149 - 문자는 다음에 올 가능성이 높습니다 + 이를 통해서 언제 줄을 바꾸어야 하는지 알 수 있기 때문이죠. 419 00:29:15,150 --> 00:29:19,280 - 확인 그래서 추적처럼 거기에 우리가 실제로 단지 응답 세포를 발견 알려 + 이것 말고도 if 문을 감지하는 cell도 찾아냈고, 420 00:29:19,279 --> 00:29:23,970 - 갑자기 문을 우리는 응답자가 시세 및 문자열을 인용 세포 발견 + 인용구과 주석을 감지하는 cell 도 찾아냈고, 421 00:29:23,970 --> 00:29:28,710 - 우리는 내가 깊은 당신이 네 스틴 표현 등을 더 흥분 세포 발견 + 상대적으로 deep한 코드를 감지하는 cell 도 찾아냈습니다. 422 00:29:28,710 --> 00:29:33,150 - 실제로이 내부에서 찾을 수 있습니다 흥미로운 세포의 모든 종류는 아니다 + 다른 역할을 수행하는 cell 들도 찾을 수 있을 것이고, 중요한 것은 이것들이 전부 backpropagation 에서 나왔다는 겁니다. 423 00:29:33,150 --> 00:29:36,710 - 완전하게 다시 전파에서 나와서 그래서 아주 마법의 + 되게 마법같은 일이죠. 424 00:29:36,710 --> 00:29:42,130 - 나는 생각하지만, + (질문) 어떻게 cell 하나하나가 흥분했는지 알 수 있었죠? 425 00:29:42,130 --> 00:29:49,110 - 당신은 그냥 통과거야 그래서 내가 생각하는이 앨리스 팀은 약 2,100 세포있어 + (답변) 이 LSTM 에서는 대략 2100개의 cell 들이 있었습니다. 저는 그냥 하나하나 다 살펴봤어요. 426 00:29:49,109 --> 00:29:54,589 - 그들과 그들 중 일부는 다음과 같이하지만 그 중 약 5 %를 말할 것입니다 + (답변) 대부분은 규칙을 찾기가 어려웠지만, 약 5%에 해당하는 cell들에 대해서 살펴본 것들과 같은 규칙을 찾을 수 있었습니다. 427 00:29:54,589 --> 00:30:00,429 - 그냥 수동으로 통과, 그래서 당신은 뭔가 흥미로운 것을 발견 + (질문) 그러니까 어떤 cell들은 켜고, 어떤 cell들은 끄는 방식으로 찾은 건가요? 428 00:30:00,430 --> 00:30:05,310 - 미안 우리가 완전히 전체를 실행하는 온전한에 있지만 우리는있어 + (답변) 오 제가 질문을 잘못 이해했었네요. 저희는 RNN 전체를 실행시켰고, 특정 hidden state의 흥분 상태를 관찰했습니다. 429 00:30:05,309 --> 00:30:09,679 - US 하나의 셀의 소성시 하나의 숨겨진 상태 화재를보고 + (답변) 오 제가 질문을 잘못 이해했었네요. 저희는 RNN 전체를 실행시켰고, 특정 hidden state의 흥분 상태를 관찰했습니다. 430 00:30:09,680 --> 00:30:14,470 - 다란 그래서 일반적으로하지만 우리는 하나에서 기록의 단지 친절 실행 + (답변) 그러니까 그냥 실행은 그대로 하되, 특정 hidden state의 상태를 기록하고 살펴본 것입니다. 431 00:30:14,470 --> 00:30:20,900 - 전지 등이 단지 전체가 판매하는 의미가 숨겨진 상태 + 이해가 되셨나요? 432 00:30:20,900 --> 00:30:23,940 - 숨겨진 상태의 그 상승 한 부분 사이에 기본적으로 많은있다 + 그러니까 저는 여기서 hidden state 단 한 부분만을 여기 슬라이드에 나타냈습니다. 433 00:30:23,940 --> 00:30:27,740 - 다른 하나는 여전히 다른 방법에 관련된 세포를 잤지 숨겨진 그들이있어 + 물론 hidden state 에는 이 부분 말고도 다른 일들을 하는 cell들이 많이 있죠. 434 00:30:27,740 --> 00:30:30,349 - 모든 다른 시간에 믿고 그들은 모두 다른 일을하고있는 + 이것들은 모두 동시에, 다른 기능을 수행합니다. 435 00:30:30,349 --> 00:30:41,899 - 아르 논 숨겨진 상태 내부 + (질문) 여기서의 hidden state의 layer은 1개인가요? 436 00:30:41,900 --> 00:30:50,150 - 하지만 당신은 하나의 층으로 비슷한 결과를 얻을 수 있습니다 + (답변) Multi-layer RNN을 말씀하시는 건가요? 그것에 대해서는 좀 있다가 설명드리겠습니다. 여기서는 Multi-layer을 썼지만, Single-layer을 썼어도 결과는 비슷했을 거에요. 437 00:30:50,150 --> 00:31:00,490 - 이 세포는 (110) 각각에 부정적 일 사이에 항상 있었고,이은이다 + (질문: 안들림) (답변): 이 hidden state 들은 -1 ~ 1의 값을 가집니다. tanh 함수의 결과물이거든요. 438 00:31:00,490 --> 00:31:04,120 - 우리가 아직 덮여 있지만, 살사의 소성 사이하지 않은 분석 팀 + (답변) 이건 우리가 아직 다루지 않은 LSTM에 대한 것들입니다. 한 cell에 배정된 값은 -1~1 이라는 것 정도만 알아두세요. 439 00:31:04,119 --> 00:31:11,869 - 하나 하나는 그래서 꽤 있습니다 정도로되어 우리에게이 사진의 규모이다 + (답변) 이건 우리가 아직 다루지 않은 LSTM에 대한 것들입니다. 한 cell에 배정된 값은 -1~1 이라는 것 정도만 알아두세요. 440 00:31:11,869 --> 00:31:15,609 - 시원하고 트렌디 한 시퀀스 모델 시간이 지남에 실제로 수에 대한 대략 + RNN은 매우 잘 작동하고, 이러한 시퀀스 모델을 잘 학습할 수 있습니다. 441 00:31:15,609 --> 00:31:19,039 - 일년 전 여러 사람이 실제로 사용할 수 있음을 깨닫게했다 + 대략 1년 전에 어떤 사람들이 이걸 컴퓨터 비전-image aptioning 분야에 적용해 보았습니다. 442 00:31:19,039 --> 00:31:22,039 - 수행 할 수있는 컴퓨터 비전의 맥락에서 같은 매우 깔끔한 응용 프로그램 + 대략 1년 전에 어떤 사람들이 이걸 컴퓨터 비전-image captioning 분야에 적용해 보았습니다. 443 00:31:22,039 --> 00:31:25,210 - 하나의 복용이 상황에서 캡처 이미지는 우리가하고 싶은 상상 + 여기서는 어떤 하나의 사진을 가지고 단어의 배열을 생성해 보았는데요, 444 00:31:25,210 --> 00:31:27,840 - 보증의 순서로 설명하고이 있습니다 수녀은 아주 좋다 + RNN은 여기서 매우 잘 작동했습니다. 445 00:31:27,839 --> 00:31:32,490 - 이 특정 모델 그들 있도록 시퀀스는 시간이 지남에 따라 개발 방법을 이해 + RNN은 여기서 매우 잘 작동했습니다. 446 00:31:32,490 --> 00:31:36,240 - 이 실제로 대략에서 일을 설명하려고 전년 될 일이 내 + 여기 한 부분을 보시면, 447 00:31:36,240 --> 00:31:43,039 - 나는 내 용지에서 사진이있는 종이 그래서 나는 그 정도 사용하려고 해요 우리 + 사실 이건 제 논문이기 때문에 저 사진들은 제가 마음대로 쓸 수 있죠. 448 00:31:43,039 --> 00:31:46,629 - 다음 네트워크에서 수행하고 수수료 및 누락을 먹이 + CNN에 이미지를 입력했는데요, 449 00:31:46,630 --> 00:31:48,990 - 당신이 휴대 전화 모델은 실제로 단지 두 개의 모듈로 구성된 것을 확인할 수 있습니다 + 잘 살펴보시면 사실 이것은 CNN과 RNN의 두 부분으로 구성되어 있다는 것을 발견할 수 있습니다. 450 00:31:48,990 --> 00:31:51,750 - 이미지와 자신의 처리를하고있다 코멘트가있다 + 잘 살펴보시면 사실 이것은 CNN과 RNN의 두 부분으로 구성되어 있다는 것을 발견할 수 있습니다. 451 00:31:51,750 --> 00:31:55,460 - 그래서 같은 모델링 시퀀스와 아주 좋은 것입니다 매우 될 것입니다 현재 부채 + CNN은 이미지 처리를, RNN은 단어들의 순서 결정을 맡았습니다. 452 00:31:55,460 --> 00:31:58,470 - 이 어디 있는지 과정의 처음부터 나의 비유를 기억한다면 + 제가 강의 처음에 했던 레고 블록 비유를 기억한다면, 453 00:31:58,470 --> 00:32:01,039 - 좀 레고 블록과 재생처럼 우리는 그 두 개의 모듈을거야 + CNN과 RNN을 그림에 보이는 화살표와 같이 연결시킨 것을 이해할 수 잇을 것입니다. 454 00:32:01,039 --> 00:32:04,509 - 사이 그래서 무엇을의 화살표에 해당하는 함께 스틱 + CNN과 RNN을 그림에 보이는 화살표와 같이 연결시킨 것을 이해할 수 잇을 것입니다. 455 00:32:04,509 --> 00:32:07,829 - 우리가 효과적으로 여기서하고있는 것은 어디 조절이 RNN 생식 모델 + 저희가 여기서 잘한 점은 여기서 RNN 단어 생성 모델의 입력값을 적절히 조절했다는 것입니다. 456 00:32:07,829 --> 00:32:11,349 - 아니면 그냥 무작위로 그 샘플 텍스트를 이야기하지만 우리는 에어컨 아니에요 그 + 그러니까 아무 텍스트나 RNN에 입력한 것이 아니라, 457 00:32:11,349 --> 00:32:14,939 - 네트워크 해변 와서 상단으로 프로세스를 생성하고 난 당신을 정확하게 보여주지 + CNN의 결과물을 RNN의 입력값으로 받아온 것이죠. 458 00:32:14,940 --> 00:32:21,220 - 그 모습 어떻게 그래서 앞으로가 통과 무엇을 보여 드리겠습니다 가정 + 좀 더 자세히 설명드리겠습니다. forward pass 부분부터요. 459 00:32:21,220 --> 00:32:24,110 - 자신의 우리가 테스트 이미지를 가지고 우리가 설명하려는 생각된다 + 여기 test image가 있습니다. 460 00:32:24,109 --> 00:32:27,679 - 프로세스 방법 이렇게 단어의 시퀀스가​​ 모델 화상을 US + 우리는 이 이미지에서 단어들의 시퀀스를 만들어보고 싶어요. 461 00:32:27,680 --> 00:32:31,240 - 어떤 플러그인이에서 왼쪽 작업을 수행하는 것을 가지고 정책 + 그래서 다음과 같이 이미지를 먼저 처리했습니다. 462 00:32:31,240 --> 00:32:35,250 - 우리가에 만화의 모두 통과 수영장, 그래서 경우는 VG 정가입니다 + 먼저 이미지를 CNN에 입력했습니다. 여기서 쓰인 CNN은 VGG net 이었습니다. 463 00:32:35,250 --> 00:32:37,349 - 우리는 단부에 도달 할 때까지 + 그리고 여기 conv들과 maxpool 들을 통과시켰죠. 464 00:32:37,349 --> 00:32:40,149 - 일반적으로 마지막에 우리는 당신을주고있다이 자동 분류가 + 일반적으로 마지막에는 softmax classifier가 위치합니다. 465 00:32:40,150 --> 00:32:44,440 - 이익 분배를 통해 우리가있어이 경우 이미지 1000 카테고리 말 + softmax는 확률분포를 출력하죠. 예를 들어 1000개의 카테고리가 있다면 각 카테고리에 대한 확률분포를요. 466 00:32:44,440 --> 00:32:47,420 - 실제로 분류 없애가는 대신에 우리가 갈거야 + 근데 여기서 우리는 softmax를 사용하지 않았습니다. 467 00:32:47,420 --> 00:32:50,750 - 재발에 연합 부재의 상단에있는 표현을 리디렉션 + 대신 이 끝부분을 RNN의 시작 부분과 연결시켰죠. 468 00:32:50,750 --> 00:32:54,880 - 신경 네트워크는 그래​​서 우리는 특정 함께 아르 논의 생성에 시작 + RNN 입력에 처음에는 특별한 벡터들을 사용했습니다. 469 00:32:54,880 --> 00:33:00,410 - 자극이 내가 생각 그럼에도 불구하고 그래서 예술 벡터 (300), 정서적, + RNN 에 입력되는 벡터들의 차원은 300이었고요, 470 00:33:00,410 --> 00:33:02,700 - 이것은 우리가 항상 플러그 특별한 삼백 감정적 인 승리입니다 + RNN의 첫 iteration에는 무조건 이 벡터를 사용했습니다. 471 00:33:02,700 --> 00:33:05,750 - 첫 번째 반복이 나에게 이야기에이 시퀀스의 시작입니다 + 그럼으로써 RNN이 이것이 시퀀스의 시작임을 파악할 수 있게 했습니다. 472 00:33:05,750 --> 00:33:09,039 - 그리고, 우리는 당신을 나타낸 재발 수식을 수행 할거야 + 그리고 아까 살펴본 recurrence 공식 (Vanilla NN)을 사용했습니다. 473 00:33:09,039 --> 00:33:13,769 - 재발 성 신경 네트워크의 전에 일반적으로 우리는이 재발 계산 + 그리고 아까 살펴본 recurrence 공식 (Vanilla NN)을 사용했습니다. 474 00:33:13,769 --> 00:33:18,779 @@ -1896,15 +1896,15 @@ 475 00:33:18,779 --> 00:33:23,500 - 또한 현재에뿐만 아니라 재발 성 신경 네트워크로 조절합니다 + 아까는 (Wxh*x + Whh*h)과 0으로 초기화되는 h_0을 사용했다면, 476 00:33:23,500 --> 00:33:28,089 - 우리가 20을 좋아합니다 상태에서 입력 전류는 그래서 그 용어는 멀리 간다 + 아까는 (Wxh*x + Whh*h)과 0으로 초기화되는 h_0을 사용했다면, 477 00:33:28,089 --> 00:33:33,649 - 하지만 우리는 처음에 단지 사랑의 시간이 될 추가하여 컨디셔닝 처음으로 + 이번에는 v를 추가해서 (Wxh*x + Whh*h + Wih*v) 를 사용했습니다. 478 00:33:33,650 --> 00:33:38,040 From a4baaabc028432ea9871fc8e790dd543a7f27f63 Mon Sep 17 00:00:00 2001 From: JK Im Date: Tue, 3 May 2016 10:39:19 -0500 Subject: [PATCH 092/199] Update optimization-1.md MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit 모수=>파라미터 점수함수=>스코어함수 (오류 하나) 파이썬 코드 코멘트 번역 안 했던거 추가 번역. --- optimization-1.md | 114 +++++++++++++++++++++++----------------------- 1 file changed, 57 insertions(+), 57 deletions(-) diff --git a/optimization-1.md b/optimization-1.md index 5b158932..e24a0def 100644 --- a/optimization-1.md +++ b/optimization-1.md @@ -3,7 +3,7 @@ layout: page permalink: /optimization-1/ --- -Table of Contents: +목자: - [소개](#intro) - [손실함수(Loss Function)의 시각화(Visualization)](#vis) @@ -23,8 +23,8 @@ Table of Contents: 이전 섹션에서 이미지 분류(image classification)을 할 때에 있어 두 가지의 핵심요쇼를 소개했습니다. -1. 원 이미지의 픽셀들을 넣으면 분류 스코어(class score)를 계산해주는 모수화된(parameterized) **스코어함수(score function)** (예를 들어, 선형함수). -2. 학습(training) 데이타에 어떤 특정 모수(parameter/weight)들을 가지고 스코어함수(score function)를 적용시켰을 때, 실제 class와 얼마나 잘 일치하는지에 따라 그 특정 모수(parameter/weight)들의 질을 측정하는 **손실함수(loss function)**. 여러 종류의 손실함수(예를 들어, Softmax/SVM)가 있다. +1. 원 이미지의 픽셀들을 넣으면 분류 스코어(class score)를 계산해주는 파라미터화된(parameterized) **스코어함수(score function)** (예를 들어, 선형함수). +2. 학습(training) 데이타에 어떤 특정 파라미터(parameter/weight)들을 가지고 스코어함수(score function)를 적용시켰을 때, 실제 class와 얼마나 잘 일치하는지에 따라 그 특정 파라미터(parameter/weight)들의 질을 측정하는 **손실함수(loss function)**. 여러 종류의 손실함수(예를 들어, Softmax/SVM)가 있다. 구체적으로 말하자면, 다음과 같은 형식을 가진 선형함수 $$ f(x_i, W) = W x_i $$를 스코어함수(score function)로 쓸 때, 앞에서 다룬 바와 같이 SVM은 다음과 같은 수식으로 표현할 수 있다.: @@ -32,15 +32,15 @@ $$ L = \frac{1}{N} \sum_i \sum_{j\neq y_i} \left[ \max(0, f(x_i; W)_j - f(x_i; W)_{y_i} + 1) \right] + \alpha R(W) $$ -예시 $x_i$에 대한 예측값이 실제 값(레이블, labels) $$y_i$$과 같도록 설정된 모수(parameter/weight) $$W$$는 손실(loss)값 $$L$$ 또한 매우 낮게 나온다는 것을 알아보았다. 이제 세번째이자 마지막 핵심요소인 **최적화(optimization)**에 대해서 알아보자. 최적화(optimization)는 손실함수(loss function)을 최소화시카는 모수(parameter/weight, $$W$$)들을 찾는 과정을 뜻한다. +예시 $x_i$에 대한 예측값이 실제 값(레이블, labels) $$y_i$$과 같도록 설정된 파라미터(parameter/weight) $$W$$는 손실(loss)값 $$L$$ 또한 매우 낮게 나온다는 것을 알아보았다. 이제 세번째이자 마지막 핵심요소인 **최적화(optimization)**에 대해서 알아보자. 최적화(optimization)는 손실함수(loss function)을 최소화시카는 파라미터(parameter/weight, $$W$$)들을 찾는 과정을 뜻한다. -**예고:** 이 세 가지 핵심요소가 어떻게 상호작용하는지 이해한 후에는, 첫번째 요소(모수화된 함수)로 다시 돌아가서 선형함수보다 더 복잡한 형태로 확장시켜볼 것이다. 처음엔 신경망(Neural Networks), 다음엔 컨볼루션 신경망(Convolutional Neural Networks). 손실함수(loss function)와 최적화(optimization) 과정은 거의 변화가 없을 것이다.. +**예고:** 이 세 가지 핵심요소가 어떻게 상호작용하는지 이해한 후에는, 첫번째 요소(파라미터화된 함수)로 다시 돌아가서 선형함수보다 더 복잡한 형태로 확장시켜볼 것이다. 처음엔 신경망(Neural Networks), 다음엔 컨볼루션 신경망(Convolutional Neural Networks). 손실함수(loss function)와 최적화(optimization) 과정은 거의 변화가 없을 것이다.. ### 손실함수(loss function)의 시각화 -이 강의에서 우리가 다루는 손실함수(loss function)들은 대체로 고차원 공간에서 정의된다. 예를 들어, CIFAR-10의 선형분류기(linear classifier)의 경우 모수(parameter/weight) 행렬은 크기가 [10 x 3073]이고 총 30,730개의 모수(parameter/weight)가 있다. 따라서, 시각화하기가 어려운 면이 있다. 하지만, 고차원 공간을 1차원 직선이나 2차원 평면으로 잘라서 보면 약간의 직관을 얻을 수 있다. 예를 들어, 무작위로 모수(parameter/weight) 행렬 $W$을 하나 뽑는다고 가정해보자. (이는 사실 고차원 공간의 한 점인 셈이다.) 이제 이 점을 직선 하나를 따라 이동시키면서 손실함수(loss function)를 기록해보자. 즉, 무작위로 뽑은 방향 $$W_1$$을 잡고, 이 방향을 따라 가면서 손실함수(loss function)를 계산하는데, 구체적으로 말하면 $$L(W + a W_1)$$에 여러 개의 $$a$$ 값(역자 주: 1차원 스칼라)을 넣어 계산해보는 것이다. 이 과정을 통해 우리는 $$a$$ 값을 x축, 손실함수(loss function) 값을 y축에 놓고 간단한 그래프를 그릴 수 있다. 또한 이 비슷한 것을 2차원으로도 할 수 있다. 여러 $$a, b$$값에 따라 $$ L(W + a W_1 + b W_2) $$을 계산하고(역자 주: $$W_2$$ 역시 $$W_1$$과 같은 식으로 뽑은 무작위 방향), $$a, b$$는 각각 x축과 y축에, 손실함수(loss function) 값 색을 이용해 그리면 된다. +이 강의에서 우리가 다루는 손실함수(loss function)들은 대체로 고차원 공간에서 정의된다. 예를 들어, CIFAR-10의 선형분류기(linear classifier)의 경우 파라미터(parameter/weight) 행렬은 크기가 [10 x 3073]이고 총 30,730개의 파라미터(parameter/weight)가 있다. 따라서, 시각화하기가 어려운 면이 있다. 하지만, 고차원 공간을 1차원 직선이나 2차원 평면으로 잘라서 보면 약간의 직관을 얻을 수 있다. 예를 들어, 무작위로 파라미터(parameter/weight) 행렬 $W$을 하나 뽑는다고 가정해보자. (이는 사실 고차원 공간의 한 점인 셈이다.) 이제 이 점을 직선 하나를 따라 이동시키면서 손실함수(loss function)를 기록해보자. 즉, 무작위로 뽑은 방향 $$W_1$$을 잡고, 이 방향을 따라 가면서 손실함수(loss function)를 계산하는데, 구체적으로 말하면 $$L(W + a W_1)$$에 여러 개의 $$a$$ 값(역자 주: 1차원 스칼라)을 넣어 계산해보는 것이다. 이 과정을 통해 우리는 $$a$$ 값을 x축, 손실함수(loss function) 값을 y축에 놓고 간단한 그래프를 그릴 수 있다. 또한 이 비슷한 것을 2차원으로도 할 수 있다. 여러 $$a, b$$값에 따라 $$ L(W + a W_1 + b W_2) $$을 계산하고(역자 주: $$W_2$$ 역시 $$W_1$$과 같은 식으로 뽑은 무작위 방향), $$a, b$$는 각각 x축과 y축에, 손실함수(loss function) 값 색을 이용해 그리면 된다.
@@ -68,12 +68,12 @@ L = & (L_0 + L_1 + L_2)/3 \end{align} $$ -이 예시들이 1차원이기 때문에, 데이타 $$x_i$$와 모수(parameter/weight) $$w_j$$는 숫자(역자 주: 즉, 스칼라. 따라서 위 수식에서 전치행렬을 뜻하는 $$T$$ 표시는 필요없음)이다. 예를 들어 $$w_0$$ 를 보면, 몇몇 항들은 $$w_0$$의 선형함수이고 각각은 0에서 꺾인다. 이를 다음과 같이 시각화할 수 있다. +이 예시들이 1차원이기 때문에, 데이타 $$x_i$$와 파라미터(parameter/weight) $$w_j$$는 숫자(역자 주: 즉, 스칼라. 따라서 위 수식에서 전치행렬을 뜻하는 $$T$$ 표시는 필요없음)이다. 예를 들어 $$w_0$$ 를 보면, 몇몇 항들은 $$w_0$$의 선형함수이고 각각은 0에서 꺾인다. 이를 다음과 같이 시각화할 수 있다.
- 손실(loss)를 1차원으로 표현한 그림. x축은 모수(parameter/weight) 하나이고, y축은 손실(loss)이다. 손실(loss)는 여러 항들의 합인데, 그 각각은 특정 모수(parameter/weight)값과 무관하거나, 0에 막혀있는 그 모수(parameter/weight)의 선형함수이다. 전체 SVM 손실은 이 모양의 30,730차원 버전이다. + 손실(loss)를 1차원으로 표현한 그림. x축은 파라미터(parameter/weight) 하나이고, y축은 손실(loss)이다. 손실(loss)는 여러 항들의 합인데, 그 각각은 특정 파라미터(parameter/weight)값과 무관하거나, 0에 막혀있는 그 파라미터(parameter/weight)의 선형함수이다. 전체 SVM 손실은 이 모양의 30,730차원 버전이다.
@@ -85,18 +85,18 @@ $$ ### 최적화 -정리하면, 손실함수(loss function)는 모수(parameter/weight) **W** 행렬의 질을 측정한다. 최적화의 목적은 이 손실함수(loss function)을 최소화시키는 **W**을 찾아내는 것이다. 다음 단락부터 손실함수(loss function)을 최적화하는 방법에 대해서 찬찬히 살펴볼 것이다. 이전에 경험이 있는 사람들이 보면 이 섹션은 좀 이상하다고 생각할지 모르겠다. 왜냐하면, 여기서 쓰인 예제 (즉, SVM 손실(loss))가 볼록함수이기 때문이다. 하지만, 우리의 궁극적인 목적은 신경망(neural networks)를 최적화시키는 것이고, 여기에는 볼록함수 최적화를 위해 고안된 방법들이 쉽사리 통히지 않는다. +정리하면, 손실함수(loss function)는 파라미터(parameter/weight) **W** 행렬의 질을 측정한다. 최적화의 목적은 이 손실함수(loss function)을 최소화시키는 **W**을 찾아내는 것이다. 다음 단락부터 손실함수(loss function)을 최적화하는 방법에 대해서 찬찬히 살펴볼 것이다. 이전에 경험이 있는 사람들이 보면 이 섹션은 좀 이상하다고 생각할지 모르겠다. 왜냐하면, 여기서 쓰인 예 (즉, SVM 손실(loss))가 볼록함수이기 때문이다. 하지만, 우리의 궁극적인 목적은 신경망(neural networks)를 최적화시키는 것이고, 여기에는 볼록함수 최적화를 위해 고안된 방법들이 쉽사리 통히지 않는다. #### 전략 #1: 첫번째 매우 나쁜 방법: 무작위 탐색 (Random search) -주어진 모수(parameter/weight) **W**이 얼마나 좋은지를 측정하는 것은 매우 간단하기 때문에, 처음 떠오르는 (매우 나쁜) 생각은, 단순히 무작위로 모수(parameter/weight)을 골라서 넣어보고 넣어 본 값들 중 제일 좋은 값을 기록하는 것이다. 그 과정은 다음과 같다. +주어진 파라미터(parameter/weight) **W**이 얼마나 좋은지를 측정하는 것은 매우 간단하기 때문에, 처음 떠오르는 (매우 나쁜) 생각은, 단순히 무작위로 파라미터(parameter/weight)을 골라서 넣어보고 넣어 본 값들 중 제일 좋은 값을 기록하는 것이다. 그 과정은 다음과 같다. ~~~python -# assume X_train is the data where each column is an example (e.g. 3073 x 50,000) -# assume Y_train are the labels (e.g. 1D array of 50,000) -# assume the function L evaluates the loss function +# X_train의 각 열(column)이 예제 하나에 해당하는 행렬이라고 생각하자. (예를 들어, 3073 x 50,000짜리) +# Y_train 은 레이블값이 저장된 어레이(array)이라고 하자. (즉, 길이 50,000짜리 1차원 어레이) +# 그리고 함수 L이 손실함수라고 하자. bestloss = float("inf") # Python assigns the highest possible float value for num in xrange(1000): @@ -118,23 +118,23 @@ for num in xrange(1000): # ... (trunctated: continues for 1000 lines) ~~~ -위의 코드에서, 여러 개의 무작위 모수(parameter/weight) **W**를 넣어봤고, 그 중 몇몇은 다른 것들보다 좋았다. 그래서 그 중 제일 좋은 모수(parameter/weight) **W**을 테스트 데이터에 넣어보면 된다. +위의 코드에서, 여러 개의 무작위 파라미터(parameter/weight) **W**를 넣어봤고, 그 중 몇몇은 다른 것들보다 좋았다. 그래서 그 중 제일 좋은 파라미터(parameter/weight) **W**을 테스트 데이터에 넣어보면 된다. ~~~python -# Assume X_test is [3073 x 10000], Y_test [10000 x 1] -scores = Wbest.dot(Xte_cols) # 10 x 10000, the class scores for all test examples -# find the index with max score in each column (the predicted class) +# X_test은 크기가 [3073 x 10000]인 행렬, Y_test는 크기가 [10000 x 1]인 어레이라고 하자. +scores = Wbest.dot(Xte_cols) # 모든 테스트데이터 예제(1만개)에 대한 각 클라스(10개)별 점수를 모아놓은 크기 10 x 10000짜리인 행렬 +# 각 열(column)에서 가장 높은 점수에 해당하는 클래스를 찾자. (즉, 예측 클래스) Yte_predict = np.argmax(scores, axis = 0) -# and calculate accuracy (fraction of predictions that are correct) +# 그리고 정확도를 계산하자. (예측 성공률) np.mean(Yte_predict == Yte) -# returns 0.1555 +# 정확도 값이 0.1555라고 한다. ~~~ 이 방법으로 얻은 최선의 **W**는 정확도 **15.5%**이다. 완전 무작위 찍기가 단 10%의 정확도를 보이므로, 무식한 방법 치고는 그리 나쁜 것은 아니다. -**핵심 아이디어: 반복적 향상**. 물론 이보다 더 좋은 방법들이 있다. 여기서 핵심 아이디어는, 최선의 모수(parameter/weight) **W**을 찾는 것은 매우 어렵거나 때로는 불가능한 문제(특히 복잡한 신경망(neural network) 전체를 구현할 경우)이지만, 어떤 주어진 모수(parameter/weight) **W**을 조금 개선시키는 일은 훨씬 덜 힘들다는 점이다. 다시 말해, 우리의 접근법은 무작위로 뽑은 **W**에서 출발해서 매번 조금씩 개선시키는 것을 반복하는 것이다. +**핵심 아이디어: 반복적 향상**. 물론 이보다 더 좋은 방법들이 있다. 여기서 핵심 아이디어는, 최선의 파라미터(parameter/weight) **W**을 찾는 것은 매우 어렵거나 때로는 불가능한 문제(특히 복잡한 신경망(neural network) 전체를 구현할 경우)이지만, 어떤 주어진 파라미터(parameter/weight) **W**을 조금 개선시키는 일은 훨씬 덜 힘들다는 점이다. 다시 말해, 우리의 접근법은 무작위로 뽑은 **W**에서 출발해서 매번 조금씩 개선시키는 것을 반복하는 것이다. -> 우리의 전략은 무작위로 뽑은 모수(parameter/weight)으로부터 시작해서 반복적으로 조금씩 개선시켜 손실(loss)을 낮추는 것이다. +> 우리의 전략은 무작위로 뽑은 파라미터(parameter/weight)으로부터 시작해서 반복적으로 조금씩 개선시켜 손실(loss)을 낮추는 것이다. **눈 가리고 하산하는 것에 비유.** 앞으로 도움이 될만한 비유는, 경사진 지형에서 눈가리개를 하고 점점 아래로 내려오는 자기 자신을 생각해보는 것이다. CIFAR-10의 예시에서, 그 언덕들은 (**W**가 3073 x 10 차원이므로) 30,730차원이다. 언덕의 각 지점에서의 고도가 손실함수(loss function)의 손실값(loss)의 역할을 한다. @@ -145,7 +145,7 @@ np.mean(Yte_predict == Yte) 처음 떠오르는 전략은, 시작점에서 무작위로 방향을 정해서 발을 살짝 뻗어서 더듬어보고 그게 내리막길길을 때만 한발짝 내딛는 것이다. 구체적으로 말하면, 임의의 $$W$$에서 시작하고, 또다른 임의의 방향 $$ \delta W $$으로 살짝 움직여본다. 만약에 움직여간 자리($$W + \delta W$$)에서의 손실잢(loss)가 더 낮으면, 거기로 움직이고 다시 탐색을 시작한다. 이 과정을 코드로 짜면 다음과 같다. ~~~python -W = np.random.randn(10, 3073) * 0.001 # generate random starting W +W = np.random.randn(10, 3073) * 0.001 # 임의의 시작 파라미터를 랜덤하게 고른다. bestloss = float("inf") for i in xrange(1000): step_size = 0.0001 @@ -163,7 +163,7 @@ for i in xrange(1000): #### 전략 #3: 그라디언트(gradient) 따라가기 -이전 섹션에서, 모수(parameter/weight) 공간에서 모수(parameter/weight) 벡터를 향상시키는 (즉, 손실값을 더 낮추는) 뱡향을 찾는 시도를 해봤다. 그런데 사실 좋은 방향을 찾기 위해 방향을 무작위로 탐색할 필요가 없다고 한다. (적어도 반지름이 0으로 수렴하는 아주 좁은 근방에서는) 가장 가파르게 감소한다고 수학적으로 검증된 *최선의* 방향을 구할 수 있고, 이 방향을 따라 모수(parameter/weight) 벡터를 움직이면 된다는 것이다. 이 방향이 손실함수(loss function)의 **그라디언트(gradient)**와 관계있다. 눈 가리고 하산하는 것에 비유할 때, 발 밑 지형을 잘 더듬어보고 가장 가파르다는 느낌을 주는 방향으로 내려가는 것에 비견할 수 있다. +이전 섹션에서, 파라미터(parameter/weight) 공간에서 파라미터(parameter/weight) 벡터를 향상시키는 (즉, 손실값을 더 낮추는) 뱡향을 찾는 시도를 해봤다. 그런데 사실 좋은 방향을 찾기 위해 방향을 무작위로 탐색할 필요가 없다고 한다. (적어도 반지름이 0으로 수렴하는 아주 좁은 근방에서는) 가장 가파르게 감소한다고 수학적으로 검증된 *최선의* 방향을 구할 수 있고, 이 방향을 따라 파라미터(parameter/weight) 벡터를 움직이면 된다는 것이다. 이 방향이 손실함수(loss function)의 **그라디언트(gradient)**와 관계있다. 눈 가리고 하산하는 것에 비유할 때, 발 밑 지형을 잘 더듬어보고 가장 가파르다는 느낌을 주는 방향으로 내려가는 것에 비견할 수 있다. 1차원 함수의 경우, 어떤 점에서 움직일 때 기울기는 함수값의 순간 증가율을 나타낸다. 그라디언트(gradient)는 이 기울기란 것을, 변수가 하나가 아니라 여러 개인 경우로 일반화시킨 것이다. 덧붙여 설명하면, 그라디언트(gradient)는 입력데이터공간(역자 주: x들의 공간)의 각 차원에 해당하는 기울기(**미분**이라고 더 많이 불린다)들의 백터이다. 1차원 함수의 미분을 수식으로 쓰면 다음과 같다. @@ -188,60 +188,60 @@ $$ ~~~python def eval_numerical_gradient(f, x): """ - a naive implementation of numerical gradient of f at x - - f should be a function that takes a single argument - - x is the point (numpy array) to evaluate the gradient at +함수 f의 x에서의 그라디언트를 매우 단순하게 구현하기. +- f 는 입력값 1개를 받는 함수여야한다. + - x는 numpy 어레이(array)로서그라디언트를 계산할 지점 (역자 주: 그라디언트는 당연하게도 어디서 계산하느냐에 따라 달라지므로, 함수 f 뿐 아니라 x도 정해줘야함). """ - fx = f(x) # evaluate function value at original point + fx = f(x) # 원래 지점 x에서 함수값 구하기. grad = np.zeros(x.shape) h = 0.00001 - # iterate over all indexes in x + # x의 모든 인덱스를 다 돌면서 계산하기. it = np.nditer(x, flags=['multi_index'], op_flags=['readwrite']) while not it.finished: - # evaluate function at x+h + # 함수 값을 x+h에서 계산하기. ix = it.multi_index old_value = x[ix] - x[ix] = old_value + h # increment by h + x[ix] = old_value + h # 변화랑h fxh = f(x) # evalute f(x + h) - x[ix] = old_value # restore to previous value (very important!) + x[ix] = old_value # 이전 값을 다시 가져온다. (매우 중요!) - # compute the partial derivative - grad[ix] = (fxh - fx) / h # the slope - it.iternext() # step to next dimension + # 편미분 계산 + grad[ix] = (fxh - fx) / h # 기울기 + it.iternext() # 다음 단계로 가서 반복. return grad ~~~ 이 코드는, 위에 주어진 그라디언트(gradient) 식을 이용해서 모든 차원을 하나씩 돌아가면서 그 방향으로 작은 변화 `h`를 줬을 때, 손실함수(loss function)의 값이 얼마나 변하는지를 구해서, 그 방향의 편미분 값을 계산한다. 변수 `grad`에 전체 그라디언트(gradient) 값이 최종적으로 저장된다. -**실제 고려할 사항**. **h**가 0으로 수렴할 때의 극한값이 그라디언트(gradient)의 수학적으로 정의인데, (이 예제에서 나온 것처럼 1e-5 같이) 작은 값이면 충분하다. 이상적으로, 수치적인 문제를 일으키지 않는 수준에서 가장 작은 값을 쓰면 된다. 덧붙여서, 실제 활용할 때, x를 **양 방향으로 변화를 주어서 구한 수식**이 더 좋은 경우가 많다: $ [f(x+h) - f(x-h)] / 2 h $ . 다음 [위키](http://en.wikipedia.org/wiki/Numerical_differentiation)를 보면 자세한 것을 알 수 있다. +**실제 고려할 사항**. **h**가 0으로 수렴할 때의 극한값이 그라디언트(gradient)의 수학적으로 정의인데, (이 예시에서 나온 것처럼 1e-5 같이) 작은 값이면 충분하다. 이상적으로, 수치적인 문제를 일으키지 않는 수준에서 가장 작은 값을 쓰면 된다. 덧붙여서, 실제 활용할 때, x를 **양 방향으로 변화를 주어서 구한 수식**이 더 좋은 경우가 많다: $ [f(x+h) - f(x-h)] / 2 h $ . 다음 [위키](http://en.wikipedia.org/wiki/Numerical_differentiation)를 보면 자세한 것을 알 수 있다. -위에서 계산한 함수를 이용하면, 아무 함수의 아무 값에서나 그라디언트(gradient)를 계산할 수 있다. 무작위로 뽑은 모수(parameter/weight)값에서 CIFAR-10의 손실함수(loss function)의 그라디언트를 구해본다.: +위에서 계산한 함수를 이용하면, 아무 함수의 아무 값에서나 그라디언트(gradient)를 계산할 수 있다. 무작위로 뽑은 파라미터(parameter/weight)값에서 CIFAR-10의 손실함수(loss function)의 그라디언트를 구해본다.: ~~~python -# to use the generic code above we want a function that takes a single argument -# (the weights in our case) so we close over X_train and Y_train +# 위의 범용코드를 쓰려면 함수가 입력값 하나(이 경우 파라미터)를 받아야함. + # 따라서X_train와 Y_train은 입력값으로 안 치고 W 하나만 입력값으로 받도록 함수 다시 정의. def CIFAR10_loss_fun(W): return L(X_train, Y_train, W) -W = np.random.rand(10, 3073) * 0.001 # random weight vector -df = eval_numerical_gradient(CIFAR10_loss_fun, W) # get the gradient +W = np.random.rand(10, 3073) * 0.001 # 랜덤 파라미터 벡터. +df = eval_numerical_gradient(CIFAR10_loss_fun, W) # 그라디언트를 구했다. ~~~ -그라디언트(gradient)는 각 차원에서 CIFAR-10의 손실함수(loss function)의 기울기를 알려주는데, 그걸 이용해서 모수(parameter/weight)를 업데이트한다. +그라디언트(gradient)는 각 차원에서 CIFAR-10의 손실함수(loss function)의 기울기를 알려주는데, 그걸 이용해서 파라미터(parameter/weight)를 업데이트한다. ~~~python -loss_original = CIFAR10_loss_fun(W) # the original loss +loss_original = CIFAR10_loss_fun(W) # 기존 손실값 print 'original loss: %f' % (loss_original, ) -# lets see the effect of multiple step sizes +# 스텝크기가 주는 영향에 대해 알아보자. for step_size_log in [-10, -9, -8, -7, -6, -5,-4,-3,-2,-1]: step_size = 10 ** step_size_log - W_new = W - step_size * df # new position in the weight space + W_new = W - step_size * df # 파라미터(parameter/weight) 공간 상의 새 파라미터 값 loss_new = CIFAR10_loss_fun(W_new) print 'for step size %f new loss: %f' % (step_size, loss_new) @@ -259,7 +259,7 @@ for step_size_log in [-10, -9, -8, -7, -6, -5,-4,-3,-2,-1]: # for step size 1.000000e-01 new loss: 25392.214036 ~~~ -**Update in negative gradient direction**. 위 코드에서, 새로운 모수 `W_new`로 업데이트할 때, 그라디언트(gradient) `df`의 반대방향으로 움직인 것을 주목하자. 왜냐하면 우리가 원하는 것은 손실함수(loss function)의 증가가 아니라 감소하는 것이기 때문이다. +**Update in negative gradient direction**. 위 코드에서, 새로운 파라미터 `W_new`로 업데이트할 때, 그라디언트(gradient) `df`의 반대방향으로 움직인 것을 주목하자. 왜냐하면 우리가 원하는 것은 손실함수(loss function)의 증가가 아니라 감소하는 것이기 때문이다. **스텝 크기가 미치는 영향**. 그라디언트(gradient)에서 알 수 있는 것은 함수값이 가장 빠르게 증가하는 방향이고, 그 방향으로 대체 얼만큼을 가야하는지는 알려주지 않는다. 강의 뒤에서 다루게 되겠지만, 얼만큼 가야하는지를 의미하는 스텝 크기(혹은 *학습 속도*라고도 함)는 신경망(neural network)를 학습시킬 때 있어 가장 중요한 (그래서 결정하기 까다로운) 하이퍼파라미터(hyperparameter)가 될 것이다. 눈 가리고 하산하는 비유에서, 우리는 우리 발 밑으로 어느 방향이 가장 가파른지 느끼지만, 얼마나 발을 뻗어야할 지는 불확실하다. 발을 살살 휘져으면, 꾸준하지만 매우 조금씩밖에 못 내려갈 것이다. (이는 아주 작은 스텝 크기에 비견된다.) 반대로, 욕심껏 빨리 내려가려고 크고 과감하게 발을 내딛을 수도 있는데, 항상 뜻대로 되지는 않을지 모른다. 위의 제시된 코드에서와 같이, 어느 수준 이상의 큰 스켑 크기는 오히려 손실값을 증가시킨다. @@ -271,7 +271,7 @@ for step_size_log in [-10, -9, -8, -7, -6, -5,-4,-3,-2,-1]:
-**효율성의 문제**. 알다시피, 그라디언트(gradient)를 수치적으로 계산하는 데 드는 비용은 모수(parameter/weight)의 수에 따라 선형적으로 늘어난다. 위 예시에서, 총 30,730의 모수(parameter/weight)가 있으므로 30,731번 손실함수값을 계산해서 그라디언트(gradient)를 구해 봐야 딱 한 번 업데이트할 수 있다. 요즘 쓰이는 신경망(neural networks)들은 수천만개의 모수(parameter/weight)도 우스운데, 그런 경우 이 문제는 매우 심각해진다. 당연하게도, 이 전략은 별로고, 더 좋은게 있다. +**효율성의 문제**. 알다시피, 그라디언트(gradient)를 수치적으로 계산하는 데 드는 비용은 파라미터(parameter/weight)의 수에 따라 선형적으로 늘어난다. 위 예시에서, 총 30,730의 파라미터(parameter/weight)가 있으므로 30,731번 손실함수값을 계산해서 그라디언트(gradient)를 구해 봐야 딱 한 번 업데이트할 수 있다. 요즘 쓰이는 신경망(neural networks)들은 수천만개의 파라미터(parameter/weight)도 우스운데, 그런 경우 이 문제는 매우 심각해진다. 당연하게도, 이 전략은 별로고, 더 좋은게 있다. @@ -285,7 +285,7 @@ $$ L_i = \sum_{j\neq y_i} \left[ \max(0, w_j^Tx_i - w_{y_i}^Tx_i + \Delta) \right] $$ -모수(parameter/weight)로 이 함수를 미분할 수 있다. 예를 들어, $w_{y_i}$로 미분하면 이렇게 된다: +파라미터(parameter/weight)로 이 함수를 미분할 수 있다. 예를 들어, $w_{y_i}$로 미분하면 이렇게 된다: $$ \nabla_{w_{y_i}} L_i = - \left( \sum_{j\neq y_i} \mathbb{1}(w_j^Tx_i - w_{y_i}^Tx_i + \Delta > 0) \right) x_i @@ -303,30 +303,30 @@ $$ ### 그라디언트 하강 (gradient descent) -이제 손실함수(loss function)의 그라디언트(gradient)를 계산할 줄 알게 됐는데, 그라디언트(gradient)를 계속해서 계산하고 모수(weight/parameter)를 Now that we can compute the gradient of the loss function, the procedure of repeatedly evaluating the gradient and then performing a parameter update is called *Gradient Descent*. Its **vanilla** version looks as follows: +이제 손실함수(loss function)의 그라디언트(gradient)를 계산할 줄 알게 됐는데, 그라디언트(gradient)를 계속해서 계산하고 파라미터(weight/parameter)를 Now that we can compute the gradient of the loss function, the procedure of repeatedly evaluating the gradient and then performing a parameter update is called *Gradient Descent*. Its **vanilla** version looks as follows: ~~~python # 단순한 경사하강(gradient descent) while True: weights_grad = evaluate_gradient(loss_fun, data, weights) - weights += - step_size * weights_grad # perform parameter update + weights += - step_size * weights_grad # 파라미터 업데이트(parameter update) ~~~ 이 단순한 루프는 모든 신경망(neural network)의 중심에 있는 것이다. 다른 방법으로 (예컨데. LBFGS) 최적화를 할 수 있는 방법이 있긴 하지만, 현재로는 그라디언트 하강 (gradient descent)이 신경망(neural network)의 손실함수(loss function)을 최적화하는 것으로는 가장 많이 쓰인다. 이 강의에서, 이 루프에 이것저것 세세하게 덧붙이기(예를 들어, 업데이트 수식이 정확히 어떻게 되는지 등)는 할 것이다. 하지만, 결과에 만족할 때까지 그라디언트(gradient)를 따라서 움직인다는 기본적인 개념은 안 바뀔 것이다. bat -**미니배치 그라디언트 하강 (Mini-batch gradient descent (MGD)).** (ILSVRC challenge처럼) 대규모의 응용사례에서, 학습데이터(training data)가 수백만개 주어질 수 있다. 따라서, 모수를 한 번 업데이트하려고 학습데이터(training data) 전체를 계산에 사용하는 것은 낭비가 될 수 있다. 이를 극복하기 위해서 흔하게 쓰이는 방법으로는, 학습데이터(training data)의 **배치(batches)**만 이용해서 그라디언트(gradient)를 구하는 것이다. 예를 들어 ConvNets을 쓸 때, 한 번에 120만개 중에 256개짜리 배치만을 이용해서 그라디언트(gradient)를 구하고 모수(parameter/weight) 업데이트를 한다. 다음 코드를 보자. +**미니배치 그라디언트 하강 (Mini-batch gradient descent (MGD)).** (ILSVRC challenge처럼) 대규모의 응용사례에서, 학습데이터(training data)가 수백만개 주어질 수 있다. 따라서, 파라미터를 한 번 업데이트하려고 학습데이터(training data) 전체를 계산에 사용하는 것은 낭비가 될 수 있다. 이를 극복하기 위해서 흔하게 쓰이는 방법으로는, 학습데이터(training data)의 **배치(batches)**만 이용해서 그라디언트(gradient)를 구하는 것이다. 예를 들어 ConvNets을 쓸 때, 한 번에 120만개 중에 256개짜리 배치만을 이용해서 그라디언트(gradient)를 구하고 파라미터(parameter/weight) 업데이트를 한다. 다음 코드를 보자. ~~~python # 단순한 미니배치 (minibatch) 그라디언트(gradient) 업데이트 while True: - data_batch = sample_training_data(data, 256) # sample 256 examples + data_batch = sample_training_data(data, 256) # 예제 256개짜리 미니배치(mini-batch) weights_grad = evaluate_gradient(loss_fun, data_batch, weights) - weights += - step_size * weights_grad # perform parameter update + weights += - step_size * weights_grad # 파라미터 업데이트(parameter update) ~~~ -이 방법이 먹히는 이유는 학습데이터들의 예시들이 서로 상관관계가 있기 때문이다. 이것에 대해 알아보기위해, ILSVRC의 120만개 이미지들이 사실은 1천개의 서로 다른 이미지들의 복사본이라는 극단적인 경우를 생각해보자. (즉, 한 클래스 당 하나이고, 이 하나가 1천2백번 복사된 것) 그러면 명백한 것은, 이 1천2백개의 이미지에서의 그라디언트(gradient)값 (역자 주: 이 1천2백개에 해당하는 $i$에 대한 $L_i$값)은 다 똑같다는 점이다. 그렇다면 이 1천2백개씩 똑같은 값들 120만개를 평균내서 손실값(loss)를 구하는 것이나, 서로 다른 1천개의 이미지당 하나씩 1000개의 값을 평균내서 손실값(loss)을 구하는 것이나 똑같다. 실제로는 당연히 중복된 데이터를 주지는 않겠지만, 미니배치(mini-batch)에서만 계산하는 그라디언트(gradient)는 모든 데이터를 써서 구하는 것의 근사값으로 괜찮게 쓰일 수 있을 것이다. 따라서, 미니배치(mini-batch)에서 그라디언트(gradient)를 구해서 더 자주자주 모수(parameter/weight)을 업데이트하면 실제로 더 빠른 수렴하게 된다. +이 방법이 먹히는 이유는 학습데이터들의 예시들이 서로 상관관계가 있기 때문이다. 이것에 대해 알아보기위해, ILSVRC의 120만개 이미지들이 사실은 1천개의 서로 다른 이미지들의 복사본이라는 극단적인 경우를 생각해보자. (즉, 한 클래스 당 하나이고, 이 하나가 1천2백번 복사된 것) 그러면 명백한 것은, 이 1천2백개의 이미지에서의 그라디언트(gradient)값 (역자 주: 이 1천2백개에 해당하는 $i$에 대한 $L_i$값)은 다 똑같다는 점이다. 그렇다면 이 1천2백개씩 똑같은 값들 120만개를 평균내서 손실값(loss)를 구하는 것이나, 서로 다른 1천개의 이미지당 하나씩 1000개의 값을 평균내서 손실값(loss)을 구하는 것이나 똑같다. 실제로는 당연히 중복된 데이터를 주지는 않겠지만, 미니배치(mini-batch)에서만 계산하는 그라디언트(gradient)는 모든 데이터를 써서 구하는 것의 근사값으로 괜찮게 쓰일 수 있을 것이다. 따라서, 미니배치(mini-batch)에서 그라디언트(gradient)를 구해서 더 자주자주 파라미터(parameter/weight)을 업데이트하면 실제로 더 빠른 수렴하게 된다. 이 방법의 극단적인 형태는 미니배치(mini-batch)가 데이터 달랑 한개로 이루어졌을 때이다. 이는 **확률그라디언트하강(Stochastic Gradient Descent (SGD))** (혹은 **온라인** 그라디언트 하강)이라고 불린다. 이건 상대적으로 덜 보편적인데, 그 이유는 우리가 프로그램을 짤 때 계산을 벡터/행렬로 만들어서 하기 때문에, 한 예제에서 100번 계산하는 것보다 100개의 예제에서 1번 계산하는게 더 빠르기 때문이다. SGD가 엄밀한 의미에서는 예제 하나짜리 미니배치(mini-batch)에서 그라디언트(gradient)를 계산하는 것이지만, 많은 사람들이 그냥 MGD를 의미하면서 SGD라고 부르기도 한다. 혹은 드물게나마 배치 그라디언트 하강 (Batch gradient descent, BGD)이라고도 부른다. 미니배치(mini-batch)의 크기도 하이퍼파라미터(hyperparameter)이지만, 이것을 교차검증하는 일은 흔치 않다. 이건 대체로 컴퓨터 메모리 크기의 한계에 따라 결정되거나, 몇몇 특정값 (예를 들어, 32, 64 or 128 같은 것)을 이용한다. 2의 제곱수를 이용하는 이유는 많은 벡터 계산이 2의 제곱수가 입력될 때 더 빠르기 때문이다. @@ -337,7 +337,7 @@ while True:
- 정보 흐름 요약. (x,y)라는 고정된 데이터 쌍이 주어져 있다. 처음에는 무작위로 뽑은 모수(parameter/weight)값으로 시작해서 바꿔나간다. 왼쪽에서 오른쪽으로 가면서, 점수함수(score function)가 각 클래스의 점수를 계산하고 그 값이 f 벡터에 저장된다. 손실함수(loss function)는 두 부분으로 나뉘어 있다. 첫째, 데이터 손실(data loss)은 모수(parameter/weight)만으로 계산하는 함수이다. 그라디언트 하강(Gradient Descent) 과정에서, 모수(parameter/weight)로 미분한 (혹은 원한다면 데이터 값으로 추가로 미분한... ??? 역자 주: 이 괄호안의 내용은 무슨 소린지 모르겠음.) 그라디언트(gradient)를 계산하고, 이것을 이용해서 모수(parameter/weight)값을 업데이트한다. + 정보 흐름 요약. (x,y)라는 고정된 데이터 쌍이 주어져 있다. 처음에는 무작위로 뽑은 파라미터(parameter/weight)값으로 시작해서 바꿔나간다. 왼쪽에서 오른쪽으로 가면서, 스코어함수(score function)가 각 클래스의 점수를 계산하고 그 값이 f 벡터에 저장된다. 손실함수(loss function)는 두 부분으로 나뉘어 있다. 첫째, 데이터 손실(data loss)은 파라미터(parameter/weight)만으로 계산하는 함수이다. 그라디언트 하강(Gradient Descent) 과정에서, 파라미터(parameter/weight)로 미분한 (혹은 원한다면 데이터 값으로 추가로 미분한... ??? 역자 주: 이 괄호안의 내용은 무슨 소린지 모르겠음.) 그라디언트(gradient)를 계산하고, 이것을 이용해서 파라미터(parameter/weight)값을 업데이트한다.
@@ -347,8 +347,8 @@ while True: - 손실함수(loss function)가 **고차원의 울퉁불퉁한 지형**이고, 이 지형에서 아래쪽으로 내려가는 것으로 직관적인 설명을 발전시켰다. 이에 대한 비유는 눈가린 등산객이 하산하는 것이었다. 특히, SVM의 손실함수(loss function)가 부분적으로 선형(linear)인 밥공기 모양이라는 것을 확인했따. - 손실함수(loss function)을 최적화시킨다는 개념을, 아무 데서나 시작해서 더 나아지는 쪽으로 한걸음 한걸음 나은 쪽으로 가서 최적화시킨다는 **반복적으로 개선**의 측면으로 운을 띄워봤고 - 함수의 **그라디언트(gradient)**는 그 함수값이 감소하는 가장 빠른 방향이라는 점을 알아봤고, 이것을 유한차이(finite difference, 즉 미분할 때 *h*의 값이 유한하다는 의미)를 이용하여 단순무식하게 수치적으로 어림잡아 계산하는 방법도 알아보았다. -- 모수(parameter/weight)를 업데이트할 때, 한 번에 얼마나 움직여야하는지(혹은 **학습속도**)를 딱 맞게 설정하는 것이 까다로운 문제라는 것도 알아보았다. 이 값이 너무 낮으면 너무 느려지고, 너무 높으면 빨라지지만 위험한 점이 있다. 이 장단점에 대해 다음 섹션에서 자세하게 알아볼 것이다. +- 파라미터(parameter/weight)를 업데이트할 때, 한 번에 얼마나 움직여야하는지(혹은 **학습속도**)를 딱 맞게 설정하는 것이 까다로운 문제라는 것도 알아보았다. 이 값이 너무 낮으면 너무 느려지고, 너무 높으면 빨라지지만 위험한 점이 있다. 이 장단점에 대해 다음 섹션에서 자세하게 알아볼 것이다. - 그라디언트(gradient)를 계산할 때 **수치적**인 방법과 **해석적**인 방법의 장단점을 알아보았다. 수치적인 그라디언트(gradient)는 단순하지만, 근사값이고 비효율적이다. 해석적인 그라디언트(gradient)는 정확하고 빠르지만 손으로 계산해야 되서 실수를 할 수 있다. 따라서 실제 응용에서는 해석적인 그라디언트(gradient)을 쓰고, **그라디언트 체크(gradient check)**라는 수치적인 그라디언트(gradient)와 비교/검증하는 과정을 거친다. -- 반복적으로 뺑뺑이 돌려서 그라디언트(gradient)를 계산하고 모수(parameter/weight)를 업데이트하는 **그라디언트 하강 (Gradient Descent)** 알고리즘을 소개했다. +- 반복적으로 뺑뺑이 돌려서 그라디언트(gradient)를 계산하고 파라미터(parameter/weight)를 업데이트하는 **그라디언트 하강 (Gradient Descent)** 알고리즘을 소개했다. -**예고:** 이 섹션에서 핵심은, 손실함수(loss function)를 모수(parameter/weight)로 미분하여 그라디언트(gradient)를 계산하는 법(과 그에 대한 직관적인 이해)가 신경망(neural network)를 디자인하고 학습시키고 이해하는데 있어 가장 중요한 기술이라는 점이다. 다음 섹션에서는, 그라디언(gradient)를 해석적으로 구할 때 연쇄법칙을 이용한, **backpropagation**이라고도 불리는 효율적인 방법에 대해 알아보겠다. 이 방법을 쓰면 컨볼루션 신경망 (Convolutional Neural Networks)을 포함한 모든 종류의 신경망(Neural Networks)에서 쓰이는 상대적으로 일반적인 손실함수(loss function)를 효율적으로 최적화시킬 수 있다. +**예고:** 이 섹션에서 핵심은, 손실함수(loss function)를 파라미터(parameter/weight)로 미분하여 그라디언트(gradient)를 계산하는 법(과 그에 대한 직관적인 이해)가 신경망(neural network)를 디자인하고 학습시키고 이해하는데 있어 가장 중요한 기술이라는 점이다. 다음 섹션에서는, 그라디언(gradient)를 해석적으로 구할 때 연쇄법칙을 이용한, **backpropagation**이라고도 불리는 효율적인 방법에 대해 알아보겠다. 이 방법을 쓰면 컨볼루션 신경망 (Convolutional Neural Networks)을 포함한 모든 종류의 신경망(Neural Networks)에서 쓰이는 상대적으로 일반적인 손실함수(loss function)를 효율적으로 최적화시킬 수 있다. From 28081bb5abf29e899e067d9f6f510fa913020bc9 Mon Sep 17 00:00:00 2001 From: JK Im Date: Tue, 3 May 2016 10:40:18 -0500 Subject: [PATCH 093/199] Update optimization-1.md --- optimization-1.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/optimization-1.md b/optimization-1.md index e24a0def..e8ab89c4 100644 --- a/optimization-1.md +++ b/optimization-1.md @@ -349,6 +349,6 @@ while True: - 함수의 **그라디언트(gradient)**는 그 함수값이 감소하는 가장 빠른 방향이라는 점을 알아봤고, 이것을 유한차이(finite difference, 즉 미분할 때 *h*의 값이 유한하다는 의미)를 이용하여 단순무식하게 수치적으로 어림잡아 계산하는 방법도 알아보았다. - 파라미터(parameter/weight)를 업데이트할 때, 한 번에 얼마나 움직여야하는지(혹은 **학습속도**)를 딱 맞게 설정하는 것이 까다로운 문제라는 것도 알아보았다. 이 값이 너무 낮으면 너무 느려지고, 너무 높으면 빨라지지만 위험한 점이 있다. 이 장단점에 대해 다음 섹션에서 자세하게 알아볼 것이다. - 그라디언트(gradient)를 계산할 때 **수치적**인 방법과 **해석적**인 방법의 장단점을 알아보았다. 수치적인 그라디언트(gradient)는 단순하지만, 근사값이고 비효율적이다. 해석적인 그라디언트(gradient)는 정확하고 빠르지만 손으로 계산해야 되서 실수를 할 수 있다. 따라서 실제 응용에서는 해석적인 그라디언트(gradient)을 쓰고, **그라디언트 체크(gradient check)**라는 수치적인 그라디언트(gradient)와 비교/검증하는 과정을 거친다. -- 반복적으로 뺑뺑이 돌려서 그라디언트(gradient)를 계산하고 파라미터(parameter/weight)를 업데이트하는 **그라디언트 하강 (Gradient Descent)** 알고리즘을 소개했다. +- 반복적으로 루프(loop)를 돌려서 그라디언트(gradient)를 계산하고 파라미터(parameter/weight)를 업데이트하는 **그라디언트 하강 (Gradient Descent)** 알고리즘을 소개했다. **예고:** 이 섹션에서 핵심은, 손실함수(loss function)를 파라미터(parameter/weight)로 미분하여 그라디언트(gradient)를 계산하는 법(과 그에 대한 직관적인 이해)가 신경망(neural network)를 디자인하고 학습시키고 이해하는데 있어 가장 중요한 기술이라는 점이다. 다음 섹션에서는, 그라디언(gradient)를 해석적으로 구할 때 연쇄법칙을 이용한, **backpropagation**이라고도 불리는 효율적인 방법에 대해 알아보겠다. 이 방법을 쓰면 컨볼루션 신경망 (Convolutional Neural Networks)을 포함한 모든 종류의 신경망(Neural Networks)에서 쓰이는 상대적으로 일반적인 손실함수(loss function)를 효율적으로 최적화시킬 수 있다. From c5da2de82c75303245d3aa6226734d6b8c3e1574 Mon Sep 17 00:00:00 2001 From: JK Im Date: Tue, 3 May 2016 10:45:32 -0500 Subject: [PATCH 094/199] Update optimization-1.md --- optimization-1.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/optimization-1.md b/optimization-1.md index e8ab89c4..f6b94015 100644 --- a/optimization-1.md +++ b/optimization-1.md @@ -337,7 +337,7 @@ while True:
- 정보 흐름 요약. (x,y)라는 고정된 데이터 쌍이 주어져 있다. 처음에는 무작위로 뽑은 파라미터(parameter/weight)값으로 시작해서 바꿔나간다. 왼쪽에서 오른쪽으로 가면서, 스코어함수(score function)가 각 클래스의 점수를 계산하고 그 값이 f 벡터에 저장된다. 손실함수(loss function)는 두 부분으로 나뉘어 있다. 첫째, 데이터 손실(data loss)은 파라미터(parameter/weight)만으로 계산하는 함수이다. 그라디언트 하강(Gradient Descent) 과정에서, 파라미터(parameter/weight)로 미분한 (혹은 원한다면 데이터 값으로 추가로 미분한... ??? 역자 주: 이 괄호안의 내용은 무슨 소린지 모르겠음.) 그라디언트(gradient)를 계산하고, 이것을 이용해서 파라미터(parameter/weight)값을 업데이트한다. + 정보 흐름 요약. (x,y)라는 고정된 데이터 쌍이 주어져 있다. 처음에는 무작위로 뽑은 파라미터(parameter/weight)값으로 시작해서 바꿔나간다. 왼쪽에서 오른쪽으로 가면서, 스코어함수(score function)가 각 클래스의 점수를 계산하고 그 값이 f 벡터에 저장된다. 손실함수(loss function)는 두 부분으로 나뉘어 있다. 첫째, 데이터 손실(data loss)은 파라미터(parameter/weight)만으로 계산하는 함수이다. 그라디언트 하강(Gradient Descent) 과정에서, 파라미터(parameter/weight)로 미분한 (혹은 원한다면 데이터 값으로 추가로 미분한. 역자 주: 필요에 따라 데이터 값으로도 미분하는 경우가 있다고 함. 문맥상 몰라도 되는 듯.) 그라디언트(gradient)를 계산하고, 이것을 이용해서 파라미터(parameter/weight)값을 업데이트한다.
From de537e69ba0fb2a935bb5cdd00945cb6cd4a0f2c Mon Sep 17 00:00:00 2001 From: OkminLee Date: Wed, 4 May 2016 19:35:14 +0900 Subject: [PATCH 095/199] Update classification.md --- classification.md | 37 +++++++++++++++++++------------------ 1 file changed, 19 insertions(+), 18 deletions(-) diff --git a/classification.md b/classification.md index b9ce3e29..71bcf54b 100644 --- a/classification.md +++ b/classification.md @@ -6,7 +6,7 @@ permalink: /classification/ 본 강의노트는 컴퓨터비전 외의 분야를 공부하던 사람들에게 Image Classification(이미지 분류) 문제와, data-driven approach(데이터 기반 방법론)을 소개한다. 목차는 다음과 같다. -- [영상 분류, 데이터 기반 방법론, 파이프라인](#intro) +- [Image Classification(이미지 분류), data-driven approach(데이터 기반 방법론), pipeline(파이프라인)](#intro) - [Nearest Neighbor 분류기](#nn) - [k-Nearest Neighbor 알고리즘](#knn) - [Validation sets, Cross-validation, hyperparameter 튜닝](#val) @@ -24,8 +24,8 @@ permalink: /classification/ **예시**. 예를 들어, 아래 그림의 이미지 분류 모델은 하나의 이미지와 4개의 분류가능한 라벨 *{cat, dog, hat, mug}* 이 있다. 그림에서 보다시피, 컴퓨터에서 이미지는 3차원 배열로 표현된다. 이 예시에서 고양이 이미지는 가로 248픽셀(모니터의 화면을 구성하는 최소 단위, 역자 주), 세로 400픽셀로 구성되어 있고 3개의 색상 채널이 있는데 각각 Red, Green, Blue(RGB)로 불린다. 따라서 이 이미지는 248 x 400 x 3개(총 297,500개)의 픽셀로 구성되어 있다. 각 픽셀의 값은 0~255 범위의 정수값이다. 이미지 분류 문제는 이 수많은 값들을 *"cat"* 이라는 하나의 라벨로 변경하는 것이다.
- -
The task in Image Classification is to predict a single label (or a distribution over labels as shown here to indicate our confidence) for a given image. Images are 3-dimensional arrays of integers from 0 to 255, of size Width x Height x 3. The 3 represents the three color channels Red, Green, Blue.
+ +
이미지 분류는 이미지가 주어졌을 때 그에 대한 라벨(각 라벨에 대한 신뢰도를 표시하는 분류)을 예측하는 일이다. 이미지는 0~255 정수 범위의 값을 가지는 Width(너비) x Height(높이) x 3의 크기의 3차원 배열이다. 3은 Red, Green, Blue로 구성된 3개의 채널을 의미한다.
**문제**. 이미지를 분류하는 일(예를들어 *"cat"*)이 사람에게는 대수롭지 않겠지만, 컴퓨터 비전의 관점에서 생각해보면 해결해야 하는 문제들이 있다. 아래에 서술된 해결해야 하는 문제들처럼, 이미지는 3차원 배열의 값으로 나타내는 것을 염두해두어야 한다. @@ -41,14 +41,14 @@ permalink: /classification/ 좋은 이미지 분류기는 각 클래스간의 감도를 유지하면서 동시에 이런 다양한 문제들에 대해 변함 없이 분류할 수 있는 성능을 유지해야 한다.
- +
**Data-driven approach(데이터 기반 방법론)**. 어떻게 하면 이미지를 각각의 카테고리로 분류하는 알고리즘을 작성할 수 있을까? 숫자를 정렬하는 알고리즘 작성과는 달리 고양이를 분별하는 알고리즘을 작성하는 것은 어렵다. -그러므로, 코드를 통해 직접적으로 모든 것을 카테고리로 분류하기 보다는 좀 더 쉬운 방법을 사용할 것이다. 먼저 컴퓨터에게 각 클래스에 대해 많은 예제를 주고 나서 이 예제들을 보고 시각적으로 학습할 수 있는 학습 알고리즘을 개발한다. - 이런 방법을 *data-driven approach(데이터 기반 아법론)* 이라고 한다. 이 방법은 라벨화가 된 이미지들 *training dataset(트레이닝 데이터 셋)* 이 처음 학습을 위해 필요하다. 아래 그림은 이런 데이터셋의 예이다: + 그러므로, 코드를 통해 직접적으로 모든 것을 카테고리로 분류하기 보다는 좀 더 쉬운 방법을 사용할 것이다. 먼저 컴퓨터에게 각 클래스에 대해 많은 예제를 주고 나서 이 예제들을 보고 시각적으로 학습할 수 있는 학습 알고리즘을 개발한다. + 이런 방법을 *data-driven approach(데이터 기반 아법론)* 이라고 한다. 이 방법은 라벨화가 된 이미지들 *training dataset(트레이닝 데이터 셋)* 이 처음 학습을 위해 필요하다. 아래 그림은 이런 데이터셋의 예이다.
@@ -76,7 +76,7 @@ permalink: /classification/ 50,000개의 CIFAR-10 트레이닝 셋(하나의 라벨 당 5,000개의 이미지)이 주어진 상태에서 나머지 10,000개의 이미지에 대해 라벨화 하는 것을 가정해보자. 최근접 이웃 분류기는 테스트 이미지를 취해 모든 트레이닝 이미지와 비교를 하고 라벨 값을 예상할 것이다. 상단 이미지의 우측과 같이 10개의 테스트 이미지에 대한 결과를 확인할 수 있다. 10개의 이미지 중 3개만이 같은 클래스로 검색된 반면에, 7개의 이미지는 같은 클래스로 분류되지 않았다. 예를 들어, 8번째 행의 말 학습 이미지에 대한 첫번째 최근접 이웃 이미지는 붉은색의 차이다. 짐작컨데 이 경우는 검은색 배경의 영향이 큰 듯 하다. 결과적으로, 이 말 이미지는 차로 잘 못 분류될 것이다. -두개의 이미지를 비교하는 정확한 방법을 아직 명시하지 않았는데, 이 경우에는 32 x 32 x 3 크기의 두 블록이다. 가장 간단한 방법 중 하나는 이미지를 각각의 픽셀값으로 비교하고, 그 차이를 더해 모두 더하는 것이다. 다시 말해서 두 개의 이미지가 주어지고 그 것들을 $$ I_1, I_2 $$ 벡터로 나타냈을 때, 벡터 간의 **L1 distance** 를 계산하는 것이 적절한 방법이다: +두개의 이미지를 비교하는 정확한 방법을 아직 명시하지 않았는데, 이 경우에는 32 x 32 x 3 크기의 두 블록이다. 가장 간단한 방법 중 하나는 이미지를 각각의 픽셀값으로 비교하고, 그 차이를 더해 모두 더하는 것이다. 다시 말해서 두 개의 이미지가 주어지고 그 것들을 $$ I_1, I_2 $$ 벡터로 나타냈을 때, 벡터 간의 **L1 distance(L1 거리)** 를 계산하는 것이 적절한 방법이다: $$ d_1 (I_1, I_2) = \sum_{p} \left| I^p_1 - I^p_2 \right| @@ -89,16 +89,16 @@ $$
An example of using pixel-wise differences to compare two images with L1 distance (for one color channel in this example). Two images are subtracted elementwise and then all differences are added up to a single number. If two images are identical the result will be zero. But if the images are very different the result will be large.
-Let's also look at how we might implement the classifier in code. First, let's load the CIFAR-10 data into memory as 4 arrays: the training data/labels and the test data/labels. In the code below, `Xtr` (of size 50,000 x 32 x 32 x 3) holds all the images in the training set, and a corresponding 1-dimensional array `Ytr` (of length 50,000) holds the training labels (from 0 to 9): +분류기를 코드상에서 어떻게 구현하는 과정을 살펴보자. 첫번째로 CIFAR-10 데이터를 4개의 배열을 통해 메모리로 불러온다. 각각은 트레이닝 데이터와 라벨, 테스트 데이터와 라벨이다. 아래 코드에 `Xtr`(크기 50,000 x 32 x 32 x 3)은 트레이닝 셋의 모든 이미지를 저장하고 1차원 배열인 `Ytr`(길이 50,000)은 트레이닝 데이터의 라벨을 저장한다. ~~~python -Xtr, Ytr, Xte, Yte = load_CIFAR10('data/cifar10/') # a magic function we provide -# flatten out all images to be one-dimensional -Xtr_rows = Xtr.reshape(Xtr.shape[0], 32 * 32 * 3) # Xtr_rows becomes 50000 x 3072 -Xte_rows = Xte.reshape(Xte.shape[0], 32 * 32 * 3) # Xte_rows becomes 10000 x 3072 +Xtr, Ytr, Xte, Yte = load_CIFAR10('data/cifar10/') # 제공되는 함수 +# 모든 이미지가 1차원 배열로 저장된다. +Xtr_rows = Xtr.reshape(Xtr.shape[0], 32 * 32 * 3) # Xtr_rows는 50000 x 3072 크기의 배열. +Xte_rows = Xte.reshape(Xte.shape[0], 32 * 32 * 3) # Xte_rows는 10000 x 3072 크기의 배열. ~~~ -Now that we have all images stretched out as rows, here is how we could train and evaluate a classifier: +이제 모든 이미지를 배열의 각 행들로 얻었다. 아래에는 분류기를 어떻게 학습시키고 평가하는지에 대한 코드이다: ~~~python nn = NearestNeighbor() # create a Nearest Neighbor classifier class @@ -109,7 +109,7 @@ Yte_predict = nn.predict(Xte_rows) # predict labels on the test images print 'accuracy: %f' % ( np.mean(Yte_predict == Yte) ) ~~~ -Notice that as an evaluation criterion, it is common to use the **accuracy**, which measures the fraction of predictions that were correct. Notice that all classifiers we will build satisfy this one common API: they have a `train(X,y)` function that takes the data and the labels to learn from. Internally, the class should build some kind of model of the labels and how they can be predicted from the data. And then there is a `predict(X)` function, which takes new data and predicts the labels. Of course, we've left out the meat of things - the actual classifier itself. Here is an implementation of a simple Nearest Neighbor classifier with the L1 distance that satisfies this template: +일반적으로 평가 기준으로서 **accuracy(정확도)** 를 사용한다. 정확도는 예측값이 얼마나 일치한지 비율을 측정한다. 앞으로 만들어 볼 모든 분류기는 공통적인 API를 갖는다: 그것들은 데이터(X)와 데이터가 실제로 속하는 라벨(y)을 입력으로 받는 `train(X,y)` 형태의 함수가 있다. 내부적으로는 클래스는 특정한 종류의 라벨에 대한 모델과 그 값들이 데이터로부터 어떻게 예측될 수 있는지 만들어야 한다. 그 이후에 새로운 데이터로 부터 라벨을 예측하는 `predict(X)` 형태의 함수가 있다. 물론, 아직은 실제로 분류기가 작동하는 부분은 빠져있다. 이제 L1 거리를 이용한 간단한 최근접 이웃 분류기에 대한 구현방법을 소개한다: ~~~python import numpy as np @@ -141,10 +141,10 @@ class NearestNeighbor(object): return Ypred ~~~ -If you ran this code, you would see that this classifier only achieves **38.6%** on CIFAR-10. That's more impressive than guessing at random (which would give 10% accuracy since there are 10 classes), but nowhere near human performance (which is [estimated at about 94%](http://karpathy.github.io/2011/04/27/manually-classifying-cifar10/)) or near state-of-the-art Convolutional Neural Networks that achieve about 95%, matching human accuracy (see the [leaderboard](http://www.kaggle.com/c/cifar-10/leaderboard) of a recent Kaggle competition on CIFAR-10). +이 코드를 실행해보면 이 분류기는 CIFAR-10에 대해 정확도가 **38.6%** 밖에 되지 않다는 것을 확인할 수 있다. 임의로 답을 결정하는 것(10개의 클래스가 있을 때 10%의 정확도)보다는 낫지만, 사람의 반응([약 94%](http://karpathy.github.io/2011/04/27/manually-classifying-cifar10/))이나 최신 컨볼루션 신경망의 성능(약 95%)에는 훨씬 미치지 못한다(최근 Kaggle 대회 [순위표](http://www.kaggle.com/c/cifar-10/leaderboard) 참고). -**The choice of distance.** -There are many other ways of computing distances between vectors. Another common choice could be to instead use the **L2 distance**, which has the geometric interpretation of computing the euclidean distance between two vectors. The distance takes the form: +**거리 선택** +벡터간의 거리를 계산하는 방법은 많다. 다른 일반적인 선택으로는 두 벡터간의 유클리디안 거리를 계산하는 기하학적인 방법인 **L2 distance(L2 거리)** 의 사용을 고려해볼 수 있다. 이 거리는 아래의 식으로 얻는다: $$ d_2 (I_1, I_2) = \sqrt{\sum_{p} \left( I^p_1 - I^p_2 \right)^2} @@ -161,7 +161,8 @@ Note that I included the `np.sqrt` call above, but in a practical nearest neighb **L1 vs. L2.** It is interesting to consider differences between the two metrics. In particular, the L2 distance is much more unforgiving than the L1 distance when it comes to differences between two vectors. That is, the L2 distance prefers many medium disagreements to one big one. L1 and L2 distances (or equivalently the L1/L2 norms of the differences between a pair of images) are the most commonly used special cases of a [p-norm](http://planetmath.org/vectorpnorm). -### k - Nearest Neighbor Classifier + +## k - Nearest Neighbor Classifier You may have noticed that it is strange to only use the label of the nearest image when we wish to make a prediction. Indeed, it is almost always the case that one can do better by using what's called a **k-Nearest Neighbor Classifier**. The idea is very simple: instead of finding the single closest image in the training set, we will find the top **k** closest images, and have them vote on the label of the test image. In particular, when *k = 1*, we recover the Nearest Neighbor classifier. Intuitively, higher values of **k** have a smoothing effect that makes the classifier more resistant to outliers: From e64afda81c4df903350de73942857dd4e1941cb0 Mon Sep 17 00:00:00 2001 From: MaybeS Date: Sun, 8 May 2016 16:59:22 +0900 Subject: [PATCH 096/199] Update cross-validate --- assignments2016/assignment1/knn.ipynb | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/assignments2016/assignment1/knn.ipynb b/assignments2016/assignment1/knn.ipynb index eb90188b..5d35ff86 100644 --- a/assignments2016/assignment1/knn.ipynb +++ b/assignments2016/assignment1/knn.ipynb @@ -366,13 +366,15 @@ "# 사전은 서로 다른 교차 검증을 실행할 때 찾은 k의 값에 대한 정확도를 가지고 있습니다.\n", "# k_to_accuracies[k] should be a list of length num_folds giving the different\n", "# accuracy values that we found when using that value of k.\n", + "# k_to_accuracies[k]는 'num_folds' 길이의 리스트로 \n", + "# 각기 다른 k 값을 사용할 때의 정확도를 담고있습니다.\n", "k_to_accuracies = {}\n", "\n", "\n", "####################################################################################\n", "# TODO: #\n", "# 최고의 k 값을 찾기 위해 k-fold 교차 검증을 수행합니다. #\n", - "# 가능한 각 k에 대해서, k-nearest-neighbor 알고리즘을 numpy의um_folds회 실행합니다.#\n", + "# 가능한 각 k에 대해서, k-nearest-neighbor 알고리즘을 numpy의num_folds회 실행합니다.#\n", "# 각각의 경우에 모두 사용하되 그 중 하나는 훈련 데이터로, #\n", "# 마지막 하나는 검증 데이터로 사용합니다. #\n", "####################################################################################\n", From 59685da95f1cad29b0cf946963de28b44253174d Mon Sep 17 00:00:00 2001 From: MaybeS Date: Sun, 8 May 2016 17:00:47 +0900 Subject: [PATCH 097/199] Remove Original lines --- assignments2016/assignment1/knn.ipynb | 2 -- 1 file changed, 2 deletions(-) diff --git a/assignments2016/assignment1/knn.ipynb b/assignments2016/assignment1/knn.ipynb index 5d35ff86..00b0ec4b 100644 --- a/assignments2016/assignment1/knn.ipynb +++ b/assignments2016/assignment1/knn.ipynb @@ -364,8 +364,6 @@ "################################################################################\n", "\n", "# 사전은 서로 다른 교차 검증을 실행할 때 찾은 k의 값에 대한 정확도를 가지고 있습니다.\n", - "# k_to_accuracies[k] should be a list of length num_folds giving the different\n", - "# accuracy values that we found when using that value of k.\n", "# k_to_accuracies[k]는 'num_folds' 길이의 리스트로 \n", "# 각기 다른 k 값을 사용할 때의 정확도를 담고있습니다.\n", "k_to_accuracies = {}\n", From 8d8613ee2149a76b939b8188580325fa8b1ae2f9 Mon Sep 17 00:00:00 2001 From: myungsub Date: Mon, 9 May 2016 12:04:36 +0900 Subject: [PATCH 098/199] translate main page --- index.html | 40 ++++++++++++++++++++-------------------- 1 file changed, 20 insertions(+), 20 deletions(-) diff --git a/index.html b/index.html index 7c935e25..9c57fb79 100644 --- a/index.html +++ b/index.html @@ -16,25 +16,25 @@ Glossary
-
Winter 2016 Assignments
+
Winter 2016 과제
@@ -60,7 +60,7 @@
--> -
Module 0: Preparation
+
Module 0: 준비
@@ -91,10 +91,10 @@
- 영상 분류: 데이터 기반 방법론, k-Nearest Neighbor, train/val/test 구분 + 이미지 분류: 데이터 기반 방법론, k-Nearest Neighbor, train/val/test 구분
- L1/L2 거리, hyperparameter 탐색, cross-validation + L1/L2 거리, hyperparameter 탐색, 교차검증(cross-validation)
@@ -103,7 +103,7 @@ 선형 분류: Support Vector Machine, Softmax
- parameteric approach, bias trick, hinge loss, cross-entropy loss, L2 regularization, web demo + parameteric 접근법, bias 트릭, hinge loss, cross-entropy loss, L2 regularization, 웹 데모
@@ -112,7 +112,7 @@ 최적화: Stochastic Gradient Descent
- optimization landscapes, local search, learning rate, analytic/numerical gradient + 최적화 공간, 국소 탐색(local search), learning rate, analytic/numerical 그라디언트
@@ -121,16 +121,16 @@ Backpropagation, Intuition
- 연쇄 법칙 (chain rule) 해석, real-valued circuits, patterns in gradient flow + 연쇄 법칙 (chain rule) 해석, real-valued circuits, 그라디언트 흐름의 패턴
- 신경망 파트 1: Setting up the Architecture + 신경망 파트 1: 네트워크 구조 정하기
- 생물학적 뉴런 모델, activation functions, 신경망 구조, representational power + 생물학적 뉴런 모델, 활성 함수(activation functions), 신경망 구조, 표현력(representational power)
@@ -139,7 +139,7 @@ 신경망 파트 2: 데이터 준비 및 Loss
- 전처리, weight 초기값 설정, batch normalization, regularization (L2/dropout), loss 함수 + 전처리, weight 초기값 설정, 배치 정규화(batch normalization), regularization (L2/dropout), 손실함수
@@ -148,7 +148,7 @@ 신경망 파트 3: 학습 및 평가
- gradient checks, sanity checks, babysitting the learning process, momentum (+nesterov), second-order methods, Adagrad/RMSprop, hyperparameter optimization, model ensembles + 그라디언트 체크, 버그 점검, 학습 과정 모니터링, momentum (+nesterov), 2차(2nd-order) 방법, Adagrad/RMSprop, hyperparameter 최적화, 모델 ensemble
@@ -167,19 +167,19 @@
- Convolutional Neural Networks: Architectures, Convolution / Pooling Layers + 컨볼루션 신경망: 구조, Convolution / Pooling 레이어
- layers, spatial arrangement, layer patterns, layer sizing patterns, AlexNet/ZFNet/VGGNet case studies, computational considerations +들 레이어(층), 공간적 배치, 레이어 패턴, 레이어 사이즈, AlexNet/ZFNet/VGGNet 사례 분석, 계산량에 관한 고려 사항들
- Understanding and Visualizing Convolutional Neural Networks + 컨볼루션 신경망 분석 및 시각화
- tSNE embeddings, deconvnets, data gradients, fooling ConvNets, human comparisons + tSNE embeddings, deconvnets, 데이터에 대한 그라디언트, ConvNet 속이기, 사람과의 비교
From fca15f217d0aacc1633a83577c9e6a0d2442f1a7 Mon Sep 17 00:00:00 2001 From: YB Date: Sun, 8 May 2016 23:31:40 -0400 Subject: [PATCH 099/199] Lecture1 - part 161~170 (out of 715) en / ko --- captions/En/Lecture1_en.srt | 20 ++++++++++---------- captions/Ko/Lecture1_ko.srt | 21 +++++++++++---------- 2 files changed, 21 insertions(+), 20 deletions(-) diff --git a/captions/En/Lecture1_en.srt b/captions/En/Lecture1_en.srt index 59667319..3d0bb97d 100644 --- a/captions/En/Lecture1_en.srt +++ b/captions/En/Lecture1_en.srt @@ -791,18 +791,18 @@ attributed to this amazing guy Leonardo Da Vinci. so before 161 00:18:19,220 --> 00:18:23,740 -other songs you know throughout human +Renaissance you know throughout human civilization from Asia to Europe to 162 00:18:23,740 --> 00:18:30,400 -India to Arabic world we have seen +India to Arabic world, we have seen models of cameras so Aristotle has 163 00:18:30,400 --> 00:18:36,360 -proposed the camera through the Leafs -Chinese philosopher moses have proposed +proposed the camera through the Leaves. +Chinese philosopher Mozi have proposed 164 00:18:36,359 --> 00:18:40,939 @@ -812,7 +812,7 @@ but 165 00:18:40,940 --> 00:18:47,750 if you look at the first documentation -really modern looking camera it's called +of really modern looking camera it's called 166 00:18:47,750 --> 00:18:49,180 @@ -820,22 +820,22 @@ camera obscura 167 00:18:49,180 --> 00:18:56,610 -and that is documented by Leonardo da da -Vinci I'm not gonna get into the details +and that is documented by Leonardo da +Vinci. I'm not gonna get into the details 168 00:18:56,609 --> 00:19:07,240 but this is you know you get the idea -that there is some kind of a whole to +that there is some kind of lens or at least a whole to 169 00:19:07,240 --> 00:19:12,240 -capture light reflected from the real +capture lights reflected from the real world and then there is some kind of 170 00:19:12,240 --> 00:19:20,319 -protection to capture the information of +projection to capture the information of the of the of the real-world image so 171 diff --git a/captions/Ko/Lecture1_ko.srt b/captions/Ko/Lecture1_ko.srt index 8f2832a4..3cff0561 100644 --- a/captions/Ko/Lecture1_ko.srt +++ b/captions/Ko/Lecture1_ko.srt @@ -659,43 +659,44 @@ 161 00:18:19,220 --> 00:18:23,740 - 다른 노래는 유럽에 아시아에서 인류 문명에 걸쳐 알고 + 르네상스 이전에도 인간 문명화의 과정에서 아시아, 유럽, 162 00:18:23,740 --> 00:18:30,400 - 아리스토텔레스는이 때문에 인도는 아랍어 세계에 우리는 카메라의 모델을 보았다 + 인도에서 아랍세계에 걸쳐 카메라의 모델들이 있어왔어요. + 아리스토텔레스는 163 00:18:30,400 --> 00:18:36,360 - 철학자 모세가 제안 잎을 중국어 통해 카메라를 제안 + 나뭇잎 잎사귀들을 통해서 보는 카메라를 제안했구요. 중국의 철학자 Mozi는 164 00:18:36,359 --> 00:18:40,939 - 전체와 함께 상자를 통해 카메라 만 + 구멍이 있는 상자를 통해 바라보는 카메라를 제안했어요. 165 00:18:40,940 --> 00:18:47,750 - 첫 번째 문서 정말 현대 찾고 카메라를 보면 그것이라고 + 하지만 처음으로 현대의 카메라와 닮은 카메라의 기록을 보면 166 00:18:47,750 --> 00:18:49,180 - 카메라 옵스큐라 + Camera Obscura 라는 카메라가 있어요. 167 00:18:49,180 --> 00:18:56,610 - 그리고 그 레오나르도 다빈치에 의해 설명되어 나는 세부 사항에 들어갈 거 아니에요 + 레오나르도 다빈치에 의해 기록되었는데 자세히 다루진 않겠어요. 168 00:18:56,609 --> 00:19:07,240 - 그러나 이것은 당신이 전체의 몇 가지 종류가 존재한다는 생각을 알고있다 + 그러나 이것은 카메라에 현실세계에서 반사된 빛을 포착하는 169 00:19:07,240 --> 00:19:12,240 - 캡처 빛이 현실 세계에서 반사 된 후 어떤 종류의가 + 렌즈 혹은 적어도 구멍이 있다는 것을 보여줍니다. 또한, 170 00:19:12,240 --> 00:19:20,319 - 보호되도록 실제 이미지의의 정보를 캡처 + 현실세계의 상에서 정보를 얻어 투영하는 과정이 이루어짐을 알 수 있습니다. 171 00:19:20,319 --> 00:19:27,779 From fd70b794ffe399cd1b153b2a9fd32c0a4a31bf81 Mon Sep 17 00:00:00 2001 From: myungsub Date: Mon, 9 May 2016 13:52:06 +0900 Subject: [PATCH 100/199] Add Acknowledgement --- acknowledgement.md | 9 +++++++++ assignments2016/assignment1.md | 7 ++++++- assignments2016/assignment3.md | 7 ++++++- aws-tutorial.md | 7 ++++++- classification.md | 5 +++++ convolutional-networks-korean.md | 27 ++++++++++++++++----------- index.html | 8 ++++++-- ipython-tutorial.md | 5 +++++ neural-networks-2.kr.md | 21 +++++++++++++-------- optimization-1.md | 7 ++++++- terminal-tutorial.md | 7 ++++++- 11 files changed, 84 insertions(+), 26 deletions(-) create mode 100644 acknowledgement.md diff --git a/acknowledgement.md b/acknowledgement.md new file mode 100644 index 00000000..56053111 --- /dev/null +++ b/acknowledgement.md @@ -0,0 +1,9 @@ +--- +layout: page +mathjax: true +permalink: /acknowledgement/ +--- + +*(프로젝트 완료 시까지 임시 파일입니다)* + +다들 바쁘신 와중에 틈틈이 시간내어 번역 프로젝트에 참여해 주신 myungsub, sandrokim, ygchoi, alexseong, ckyun777, dolai, donghun, gnujoow, j-min, jaywhang, jazzsaxmafia, jihoonl, jslee, junghojin, juyong, kjw0612, maybe, okmin, rollis0825, salopge, sanghun, sora, stats2ml, sungjunhong 님께 이 자리를 빌려 감사 말씀을 드립니다. diff --git a/assignments2016/assignment1.md b/assignments2016/assignment1.md index 10a22c7d..fa7a4502 100644 --- a/assignments2016/assignment1.md +++ b/assignments2016/assignment1.md @@ -18,7 +18,7 @@ permalink: /assignments2016/assignment1/ ## 설치 여러분은 다음 두가지 방법으로 숙제를 시작할 수 있습니다: Terminal.com을 이용한 가상 환경 또는 로컬 환경. -### Termianl에서의 가상 환경. +### Terminal에서의 가상 환경. Terminal에는 우리의 수업을 위한 서브도메인이 만들어져 있습니다. [www.stanfordterminalcloud.com](https://www.stanfordterminalcloud.com) 계정을 등록하세요. 이번 숙제에 대한 스냅샷은 [여기](https://www.stanfordterminalcloud.com/snapshot/49f5a1ea15dc424aec19155b3398784d57c55045435315ce4f8b96b62819ef65)에서 찾아볼 수 있습니다. 만약 수업에 등록되었다면, TA(see Piazza for more information)에게 이 수업을 위한 Terminal 예산을 요구할 수 있습니다. 처음 스냅샷을 실행시키면, 수업을 위한 모든 것이 설치되어 있어서 바로 숙제를 시작할 수 있습니다. [여기](/terminal-tutorial)에 Terminal을 위한 간단한 튜토리얼을 작성해 뒀습니다. ### 로컬 환경 @@ -79,3 +79,8 @@ IPython Notebook **features.ipynb**을 사용하여 단순한 이미지 픽셀( ### Q6: 추가 과제: 뭔가 더 해보세요! (+10 points) 이번 과제와 관련된 다른 것들을 작성한 코드로 분석하고 연구해보세요. 예를 들어, 질문하고 싶은 흥미로운 질문이 있나요? 통찰력 있는 시각화를 작성할 수 있나요? 아니면 다른 재미있는 살펴볼 거리가 있나요? 또는 손실 함수(loss function)을 조금씩 변형해가며 실험해볼 수도 있을 것입니다. 만약 다른 멋있는 것을 시도해본다면 추가로 10 points를 얻을 수 있고 강의에 수행한 결과가 실릴 수 있습니다. + +--- +

+번역: 배지운 (MaybeS) +

diff --git a/assignments2016/assignment3.md b/assignments2016/assignment3.md index 2ce5ddd7..452f4767 100644 --- a/assignments2016/assignment3.md +++ b/assignments2016/assignment3.md @@ -18,7 +18,7 @@ permalink: assignments2016/assignment3/ ## 설치 다음 두가지 방법으로 숙제를 시작할 수 있습니다: Terminal.com을 이용한 가상 환경 또는 로컬 환경. -### Termianl에서의 가상 환경. +### Terminal에서의 가상 환경. Terminal에는 우리의 수업을 위한 서브도메인이 만들어져 있습니다. [www.stanfordterminalcloud.com](https://www.stanfordterminalcloud.com) 계정을 등록하세요. 이번 숙제에 대한 스냅샷은 [여기](https://www.stanfordterminalcloud.com/snapshot/49f5a1ea15dc424aec19155b3398784d57c55045435315ce4f8b96b62819ef65)에서 찾아볼 수 있습니다. 만약 수업에 등록되었다면, TA(see Piazza for more information)에게 이 수업을 위한 Terminal 예산을 요구할 수 있습니다. 처음 스냅샷을 실행시키면, 수업을 위한 모든 것이 설치되어 있어서 바로 숙제를 시작할 수 있습니다. [여기](/terminal-tutorial)에 Terminal을 위한 간단한 튜토리얼을 작성해 뒀습니다. ### 로컬 환경 @@ -81,3 +81,8 @@ IPython notebook `ImageGeneration.ipynb`에서는 미리 학습된 TinyImageNet ### Q5: 추가 과제: 뭔가 더 해보세요! (+10 points) 이번 과제에서 제공된 것들을 활용해서 무언가 멋있는 것들을 시도해볼 수 있을 것입니다. 과제에서 구현하지 않은 다른 방식으로 이미지들을 생성하는 방법이 있을 수도 있어요! + +--- +

+번역: 최명섭 (myungsub) +

diff --git a/aws-tutorial.md b/aws-tutorial.md index e2fa191e..ea753c8b 100644 --- a/aws-tutorial.md +++ b/aws-tutorial.md @@ -124,4 +124,9 @@ Check failed: error == cudaSuccess (77 vs. 0) an illegal memory access was enc - 인스턴스를 reboot/terminate 하면 `/mnt` 디렉토리의 자료는 소멸됩니다. - 추가 비용이 발생하지 않도록 작업이 완료되면 인스턴스를 stop해야합니다. GPU 인스턴스는 사용료가 높습니다. 예산을 현명하게 사용하는것을 권장합니다. 여러분의 작업이 완전히 끝났다면 인스턴스를 Terminate합니다. (디스크 공간 또한 과금이 됩니다. 만약 큰 용량의 디스크를 사용한다면 과금이 많이 될 수 있습니다.) - 'creating custom alarms'에서 인스턴스가 아무 작업을 하지 않을때 인스턴스를 stop하도록 설정할 수 있습니다. -- 만약 인스턴스의 큰 데이터베이스에 접근할 필요가 없거나 데이터베이스를 다운로드 하기위해서 인스턴스 작동을 원하지 않는다면 가장 좋은 방법은 AMI를 생성하고 인스턴스를 설정할 때 당신의 기기에 AMI를 연결하는 것 일것입니다. (이 작업은 AMI를 선택한 후에 인스턴스를 실행(launching) 하기 전에 설정해야합니다.) \ No newline at end of file +- 만약 인스턴스의 큰 데이터베이스에 접근할 필요가 없거나 데이터베이스를 다운로드 하기위해서 인스턴스 작동을 원하지 않는다면 가장 좋은 방법은 AMI를 생성하고 인스턴스를 설정할 때 당신의 기기에 AMI를 연결하는 것 일것입니다. (이 작업은 AMI를 선택한 후에 인스턴스를 실행(launching) 하기 전에 설정해야합니다.) + +--- +

+번역: 김우정 (gnujoow) +

diff --git a/classification.md b/classification.md index 71bcf54b..927b8dd8 100644 --- a/classification.md +++ b/classification.md @@ -289,3 +289,8 @@ Here are some (optional) links you may find interesting for further reading: - [A Few Useful Things to Know about Machine Learning](http://homes.cs.washington.edu/~pedrod/papers/cacm12.pdf), where especially section 6 is related but the whole paper is a warmly recommended reading. - [Recognizing and Learning Object Categories](http://people.csail.mit.edu/torralba/shortCourseRLOC/index.html), a short course of object categorization at ICCV 2005. + +--- +

+번역: 이옥민 (OkminLee) +

diff --git a/convolutional-networks-korean.md b/convolutional-networks-korean.md index 4b218c7e..9e1ce5e4 100644 --- a/convolutional-networks-korean.md +++ b/convolutional-networks-korean.md @@ -1,6 +1,6 @@ --- layout: page -permalink: /convolutional-networks/ +permalink: /convolutional-networks-kr/ --- Table of Contents: @@ -31,7 +31,7 @@ CNN과 일반 신경망의 차이점은 무엇일까? CNN 아키텍쳐는 입력 앞 장에서 보았듯이 신경망은 입력받은 벡터를 일련의 히든 레이어 (hidden layer) 를 통해 변형 (transform) 시킨다. 각 히든 레이어는 뉴런들로 이뤄져 있으며, 각 뉴런은 앞쪽 레이어 (previous layer)의 모든 뉴런과 연결되어 있다 (fully connected). 같은 레이어 내에 있는 뉴런들 끼리는 연결이 존재하지 않고 서로 독립적이다. 마지막 Fully-connected 레이어는 출력 레이어라고 불리며, 분류 문제에서 클래스 점수 (class score)를 나타낸다. -일반 신경망은 이미지를 다루기에 적절하지 않다. CIFAR-10 데이터의 경우 각 이미지가 32x32x3 (가로,세로 32, 3개 컬러 채널)로 이뤄져 있어서 첫 번째 히든 레이어 내의 하나의 뉴런의 경우 32x32x3=3072개의 가중치가 필요하지만, 더 큰 이미지를 사용할 경우에는 같은 구조를 이용하는 것이 불가능하다. 예를 들어 200x200x3의 크기를 가진 이미지는 같은 뉴런에 대해 200x200x3=120,000개의 가중치를 필요로 하기 때문이다. 더욱이, 이런 뉴런이 레이어 내에 여러개 존재하므로 모수의 개수가 크게 증가하게 된다. 이와 같이 Fully-connectivity는 심한 낭비이며 많은 수의 모수는 곧 오버피팅(overfitting)으로 귀결된다. +일반 신경망은 이미지를 다루기에 적절하지 않다. CIFAR-10 데이터의 경우 각 이미지가 32x32x3 (가로,세로 32, 3개 컬러 채널)로 이뤄져 있어서 첫 번째 히든 레이어 내의 하나의 뉴런의 경우 32x32x3=3072개의 가중치가 필요하지만, 더 큰 이미지를 사용할 경우에는 같은 구조를 이용하는 것이 불가능하다. 예를 들어 200x200x3의 크기를 가진 이미지는 같은 뉴런에 대해 200x200x3=120,000개의 가중치를 필요로 하기 때문이다. 더욱이, 이런 뉴런이 레이어 내에 여러개 존재하므로 모수의 개수가 크게 증가하게 된다. 이와 같이 Fully-connectivity는 심한 낭비이며 많은 수의 모수는 곧 오버피팅(overfitting)으로 귀결된다. CNN은 입력이 이미지로 이뤄져 있다는 특징을 살려 좀 더 합리적인 방향으로 아키텍쳐를 구성할 수 있다. 특히 일반 신경망과 달리, CNN의 레이어들은 가로,세로,깊이의 3개 차원을 갖게 된다 ( 여기에서 말하는 깊이란 전체 신경망의 깊이가 아니라 액티베이션 볼륨 ( activation volume ) 에서의 3번 째 차원을 이야기 함 ). 예를 들어 CIFAR-10 이미지는 32x32x3 (가로,세로,깊이) 의 차원을 갖는 입력 액티베이션 볼륨 (activation volume)이라고 볼 수 있다. 조만간 보겠지만, 하나의 레이어에 위치한 뉴런들은 일반 신경망과는 달리 앞 레이어의 전체 뉴런이 아닌 일부에만 연결이 되어 있다. CNN 아키텍쳐는 전체 이미지를 클래스 점수들로 이뤄진 하나의 벡터로 만들어주기 때문에 마지막 출력 레이어는 1x1x10(10은 CIFAR-10 데이터의 클래스 개수)의 차원을 가지게 된다. 이에 대한 그럼은 아래와 같다: @@ -70,7 +70,7 @@ CNN은 입력이 이미지로 이뤄져 있다는 특징을 살려 좀 더 합
- CNN 아키텍쳐의 액티베이션 (activation) 예제. 첫 볼륨은 로우 이미지(raw image)를 다루며, 마지막 볼륨은 클래스 점수들을 출력한다. 입/출력 사이의 액티베이션들은 그림의 각 열에 나타나 있다. 3차원 볼륨을 시각적으로 나타내기가 어렵기 때문에 각 행마다 볼륨들의 일부만 나타냈다. 마지막 레이어는 모든 클래스에 대한 점수를 나타내지만 여기에서는 상위 5개 클래스에 대한 점수와 레이블만 표시했다. 전체 웹 데모는 우리의 웹사이트 상단에 있다. 여기에서 사용된 아키텍쳐는 작은 VGG Net이다. + CNN 아키텍쳐의 액티베이션 (activation) 예제. 첫 볼륨은 로우 이미지(raw image)를 다루며, 마지막 볼륨은 클래스 점수들을 출력한다. 입/출력 사이의 액티베이션들은 그림의 각 열에 나타나 있다. 3차원 볼륨을 시각적으로 나타내기가 어렵기 때문에 각 행마다 볼륨들의 일부만 나타냈다. 마지막 레이어는 모든 클래스에 대한 점수를 나타내지만 여기에서는 상위 5개 클래스에 대한 점수와 레이블만 표시했다. 전체 웹 데모는 우리의 웹사이트 상단에 있다. 여기에서 사용된 아키텍쳐는 작은 VGG Net이다.
@@ -84,9 +84,9 @@ CONV 레이어는 CNN을 이루는 핵심 요소이다. CONV 레이어의 출력 **개요 및 직관적인 설명.** CONV 레이어의 모수(parameter)들은 일련의 학습가능한 필터들로 이뤄져 있다. 각 필터는 가로/세로 차원으로는 작지만 깊이 (depth) 차원으로는 전체 깊이를 아우른다. 포워드 패스 (forward pass) 때에는 각 필터를 입력 볼륨의 가로/세로 차원으로 슬라이딩 시키며 (정확히는 convolve 시키며) 2차원의 액티베이션 맵 (activation map)을 생성한다. 필터를 입력 위로 슬라이딩 시킬 때, 필터와 입력의 요소들 사이의 내적 연산 (dot product)이 이뤄진다. 직관적으로 설명하면, 이 신경망은 입력의 특정 위치의 특정 패턴에 대해 반응하는 (activate) 필터를 학습한다. 이런 액티베이션 맵 (activation map)을 깊이 (depth) 차원을 따라 쌓은 것이 곧 출력 볼륨이 된다. 그러므로 출력 볼륨의 각 요소들은 입력의 작은 영역만을 취급하고, 같은 액티베이션 맵 내의 뉴런들은 같은 모수들을 공유한다 (같은 필터를 적용한 결과이므로). 이제 이 과정에 대해 좀 더 깊이 파헤쳐보자. -**로컬 연결성 (Local connectivity).** 이미지와 같은 고차원 입력을 다룰 때에는, 현재 레이어의 한 뉴런을 이전 볼륨의 모든 뉴런들과 연결하는 것이 비 실용적이다. 대신에 우리는 레이어의 각 뉴런을 입력 볼륨의 로컬한 영역(local region)에만 연결할 것이다. 이 영역은 리셉티브 필드 (receptive field)라고 불리는 초모수 (hyperparameter) 이다. 깊이 차원 측면에서는 항상 입력 볼륨의 총 깊이를 다룬다 (가로/세로는 작은 영역을 보지만 깊이는 전체를 본다는 뜻). 공간적 차원 (가로/세로)와 깊이 차원을 다루는 방식이 다르다는 걸 기억하자. +**로컬 연결성 (Local connectivity).** 이미지와 같은 고차원 입력을 다룰 때에는, 현재 레이어의 한 뉴런을 이전 볼륨의 모든 뉴런들과 연결하는 것이 비 실용적이다. 대신에 우리는 레이어의 각 뉴런을 입력 볼륨의 로컬한 영역(local region)에만 연결할 것이다. 이 영역은 리셉티브 필드 (receptive field)라고 불리는 초모수 (hyperparameter) 이다. 깊이 차원 측면에서는 항상 입력 볼륨의 총 깊이를 다룬다 (가로/세로는 작은 영역을 보지만 깊이는 전체를 본다는 뜻). 공간적 차원 (가로/세로)와 깊이 차원을 다루는 방식이 다르다는 걸 기억하자. -*예제 1*. 예를 들어 입력 볼륨의 크기가 (CIFAR-10의 RGB 이미지와 같이) [32x32x3]이라고 하자. 만약 리셉티브 필드의 크기가 5x5라면, CONV 레이어의 각 뉴런은 입력 볼륨의 [5x5x3] 크기의 영역에 가중치 (weight)를 가하게 된다 (총 5x5x3=75 개 가중치). 입력 볼륨 (RGB 이미지)의 깊이가 3이므로 마지막 숫자가 3이 된다는 것을 기억하자. +*예제 1*. 예를 들어 입력 볼륨의 크기가 (CIFAR-10의 RGB 이미지와 같이) [32x32x3]이라고 하자. 만약 리셉티브 필드의 크기가 5x5라면, CONV 레이어의 각 뉴런은 입력 볼륨의 [5x5x3] 크기의 영역에 가중치 (weight)를 가하게 된다 (총 5x5x3=75 개 가중치). 입력 볼륨 (RGB 이미지)의 깊이가 3이므로 마지막 숫자가 3이 된다는 것을 기억하자. *예제 2*. 입력 볼륨의 크기가 [16x16x20]이라고 하자. 3x3 크기의 리셉티브 필드를 사용하면 CONV 레이어의 각 뉴런은 입력 볼륨과 3x3x20=180 개의 연결을 갖게 된다. 이번에도 입력 볼륨의 깊이가 20이므로 마지막 숫자가 20이 된다는 것을 기억하자. @@ -109,7 +109,7 @@ CONV 레이어는 CNN을 이루는 핵심 요소이다. CONV 레이어의 출력
- 공간적 배치에 관한 그림. 이 예제에서는 가로/세로 공간적 차원 중 하나만 고려한다 (x축). 리셉티브 필드 F=3, 입력 사이즈 W=5, 제로 패딩 P=1. : 뉴런들이 stride S=1을 갖고 배치된 경우, 출력 사이즈는 (5-3+2)/1 +1 = 5이다. : stride S=2인 경우 (5-3+2)/2 + 1 = 3의 출력 사이즈를 가진다. Stride S=3은 사용할 수 없다. (5-3+2) = 4가 3으로 나눠지지 않기 때문에 출력 볼륨의 뉴런들이 깔끔히 배치되지 않는다. + 공간적 배치에 관한 그림. 이 예제에서는 가로/세로 공간적 차원 중 하나만 고려한다 (x축). 리셉티브 필드 F=3, 입력 사이즈 W=5, 제로 패딩 P=1. : 뉴런들이 stride S=1을 갖고 배치된 경우, 출력 사이즈는 (5-3+2)/1 +1 = 5이다. : stride S=2인 경우 (5-3+2)/2 + 1 = 3의 출력 사이즈를 가진다. Stride S=3은 사용할 수 없다. (5-3+2) = 4가 3으로 나눠지지 않기 때문에 출력 볼륨의 뉴런들이 깔끔히 배치되지 않는다. 이 예에서 뉴런들의 가중치는 [1,0,-1] (가장 오른쪽) 이며 bias는 0이다. 이 가중치는 노란 뉴런들 모두에게 공유된다 (아래에서 parameter sharing에 대해 살펴보라).
@@ -122,7 +122,7 @@ CONV 레이어는 CNN을 이루는 핵심 요소이다. CONV 레이어의 출력 **파라미터 공유**. 파라미터 공유 기법은 컨볼루션 레이어의 파라미터 개수를 조절하기 위해 사용된다. 위의 실제 예제에서 보았듯, 첫 번째 컨볼루션 레이어에는 55\*55\*96 = 290,400 개의 뉴런이 있고 각각의 뉴런은 11\*11\*3 = 363개의 가중치와 1개의 바이어스를 가진다. 첫 번째 컨볼루션 레이어만 따져도 총 파라미터 개수는 290400*364=105,705,600개가 된다. 분명히 이 숫자는 너무 크다. -사실 적절한 가정을 통해 파라미터 개수를 크게 줄이는 것이 가능하다: (x,y)에서 어떤 patch feature가 유용하게 사용되었다면, 이 feature는 다른 위치 (x2,y2)에서도 유용하게 사용될 수 있다. 3차원 볼륨의 한 슬라이스 (깊이 차원으로 자른 2차원 슬라이스) 를 **depth slice**라고 하자 ([55x55x96] 사이즈의 볼륨은 각각 [55x55]의 크기를 가진 96개의 depth slice임). 앞으로는 각 depth slice 내의 뉴런들이 같은 가중치와 바이어스를 가지도록 제한할 것이다. 이런 파라미터 공유 기법을 사용하면, 예제의 첫 번째 컨볼루션 레이어는 (depth slice 당) 96개의 고유한 가중치를 가져서 총 96\*11\*11\*3 = 34,848개의 고유한 가중치, 또는 바이어스를 합쳐서 34,944개의 파라미터를 갖게 된다. 또는 각 depth slice에 존재하는 55*55개의 뉴런들은 모두 같은 파라미터를 사용하게 된다. 실제로는 backpropagation 과정에서 각 depth slice 내의 모든 뉴런들이 가중치에 대한 gradient를 계산하겠지만, 가중치 업데이트 할 때에는 이 gradient들을 합해 사용한다. +사실 적절한 가정을 통해 파라미터 개수를 크게 줄이는 것이 가능하다: (x,y)에서 어떤 patch feature가 유용하게 사용되었다면, 이 feature는 다른 위치 (x2,y2)에서도 유용하게 사용될 수 있다. 3차원 볼륨의 한 슬라이스 (깊이 차원으로 자른 2차원 슬라이스) 를 **depth slice**라고 하자 ([55x55x96] 사이즈의 볼륨은 각각 [55x55]의 크기를 가진 96개의 depth slice임). 앞으로는 각 depth slice 내의 뉴런들이 같은 가중치와 바이어스를 가지도록 제한할 것이다. 이런 파라미터 공유 기법을 사용하면, 예제의 첫 번째 컨볼루션 레이어는 (depth slice 당) 96개의 고유한 가중치를 가져서 총 96\*11\*11\*3 = 34,848개의 고유한 가중치, 또는 바이어스를 합쳐서 34,944개의 파라미터를 갖게 된다. 또는 각 depth slice에 존재하는 55*55개의 뉴런들은 모두 같은 파라미터를 사용하게 된다. 실제로는 backpropagation 과정에서 각 depth slice 내의 모든 뉴런들이 가중치에 대한 gradient를 계산하겠지만, 가중치 업데이트 할 때에는 이 gradient들을 합해 사용한다. 한 depth slice내의 모든 뉴런들이 같은 가중치 벡터를 갖기 때문에 컨볼루션 레이어의 forward pass는 입력 볼륨과 가중치 간의 **컨볼루션**으로 계산될 수 있다 (컨볼루션 레이어라는 이름이 붙은 이유). 그러므로 컨볼루션 레이어의 가중치는 **필터(filter)** 또는 **커널(kernel)**이라고 부른다. 컨볼루션의 결과물은 **액티베이션 맵(activation map, [55x55] 사이즈)** 이 되며 각 깊이에 해당하는 필터의 액티베이션 맵들을 쌓으면 최종 출력 볼륨 ([55x55x96] 사이즈) 가 된다. @@ -135,7 +135,7 @@ CONV 레이어는 CNN을 이루는 핵심 요소이다. CONV 레이어의 출력 가끔은 파라미터 sharing에 대한 가정이 부적절할 수도 있다. 특히 입력 이미지가 중심을 기준으로 찍힌 경우 (예를 들면 이미지 중앙에 얼굴이 있는 이미지), 이미지의 각 영역에 대해 완전히 다른 feature들이 학습되어야 할 수 있다. 눈과 관련된 feature나 머리카락과 관련된 feature 등은 서로 다른 영역에서 학습될 것이다. 이런 경우에는 파라미터 sharing 기법을 접어두고 대신 **Locally-Connected Layer**라는 레이어를 사용하는 것이 좋다. -**Numpy 예제.** 위에서 다룬 것들을 더 확실히 알아보기 위해 코드를 작성해보자. 입력 볼륨을 numpy 배열 `X`라고 하면: +**Numpy 예제.** 위에서 다룬 것들을 더 확실히 알아보기 위해 코드를 작성해보자. 입력 볼륨을 numpy 배열 `X`라고 하면: - A *depth column* at position `(x,y)` would be the activations `X[x,y,:]`. - `(x,y)`위치에서의 *depth column*은 액티베이션 `X[x,y,:]`이 된다. - A *depth slice*, or equivalently an *activation map* at depth `d` would be the activations `X[:,:,d]`. @@ -176,7 +176,7 @@ Numpy에서 `*`연산은 두 배열 간의 elementwise 곱셈이라는 것을 흔한 Hyperparameter기본 세팅은 $$F = 3, S = 1, P = 1$$이다. 뒤에서 다룰 [ConvNet architectures](#architectures)에서 hyperparameter 세팅과 관련된 법칙이나 방식 등을 확인할 수 있다. -**컨볼루션 데모**. 아래는 컨볼루션 레이어 데모이다. 3차원 볼륨은 시각화하기 힘드므로 각 행마다 depth slice를 하나씩 배치했다. 각 볼륨은 입력 볼륨(파란색), 가중치 볼륨(빨간색), 출력 볼륨(녹색)으로 이뤄진다. 입력 볼륨의 크기는 $$W_1 = 5, H_1 = 5, D_1 = 3$$이고 컨볼루션 레이어의 파라미터들은 $$K = 2, F = 3, S = 2, P = 1$$이다. 즉, 2개의 $$3 \times 3$$크기의 필터가 각각 stride 2마다 적용된다. 그러므로 출력 볼륨의 spatial 크기 (가로/세로)는 (5 - 3 + 2)/2 + 1 = 3이다. 제로 패딩 $$P = 1$$ 이 적용되어 입력 볼륨의 가장자리가 모두 0으로 되어있다는 것을 확인할 수 있다. 아래의 영상에서 하이라이트 표시된 입력(파란색)과 필터(빨간색)이 elementwise로 곱해진 뒤 하나로 더해지고 bias가 더해지는걸 볼 수 있다. +**컨볼루션 데모**. 아래는 컨볼루션 레이어 데모이다. 3차원 볼륨은 시각화하기 힘드므로 각 행마다 depth slice를 하나씩 배치했다. 각 볼륨은 입력 볼륨(파란색), 가중치 볼륨(빨간색), 출력 볼륨(녹색)으로 이뤄진다. 입력 볼륨의 크기는 $$W_1 = 5, H_1 = 5, D_1 = 3$$이고 컨볼루션 레이어의 파라미터들은 $$K = 2, F = 3, S = 2, P = 1$$이다. 즉, 2개의 $$3 \times 3$$크기의 필터가 각각 stride 2마다 적용된다. 그러므로 출력 볼륨의 spatial 크기 (가로/세로)는 (5 - 3 + 2)/2 + 1 = 3이다. 제로 패딩 $$P = 1$$ 이 적용되어 입력 볼륨의 가장자리가 모두 0으로 되어있다는 것을 확인할 수 있다. 아래의 영상에서 하이라이트 표시된 입력(파란색)과 필터(빨간색)이 elementwise로 곱해진 뒤 하나로 더해지고 bias가 더해지는걸 볼 수 있다.
@@ -218,7 +218,7 @@ CNN 구조 내에 컨볼루션 레이어들 중간중간에 주기적으로 풀
- 풀링 레이어는 입력 볼륨의 각 depth slice를 spatial하게 downsampling한다. 좌: 이 예제에서는 입력 볼륨이 [224x224x64]이며 필터 크기 2, stride 2로 풀링해 [112x112x64] 크기의 출력 볼륨을 만든다. 볼륨의 depth는 그대로 유지된다는 것을 기억하자. Right: 가장 널리 쓰이는 max 풀링. 2x2의 4개 숫자에 대해 max를 취하게된다. + 풀링 레이어는 입력 볼륨의 각 depth slice를 spatial하게 downsampling한다. 좌: 이 예제에서는 입력 볼륨이 [224x224x64]이며 필터 크기 2, stride 2로 풀링해 [112x112x64] 크기의 출력 볼륨을 만든다. 볼륨의 depth는 그대로 유지된다는 것을 기억하자. Right: 가장 널리 쓰이는 max 풀링. 2x2의 4개 숫자에 대해 max를 취하게된다.
@@ -226,7 +226,7 @@ CNN 구조 내에 컨볼루션 레이어들 중간중간에 주기적으로 풀 **최근의 발전된 내용들**. -- [Fractional Max-Pooling](http://arxiv.org/abs/1412.6071) 2x2보다 더 작은 필터들로 풀링하는 방식. 1x1, 1x2, 2x1, 2x2 크기의 필터들을 임의로 조합해 풀링한다. 매 forward pass마다 grid들이 랜덤하게 생성되고, 테스트 때에는 여러 grid들의 예측 점수들의 평균치를 사용하게 된다. +- [Fractional Max-Pooling](http://arxiv.org/abs/1412.6071) 2x2보다 더 작은 필터들로 풀링하는 방식. 1x1, 1x2, 2x1, 2x2 크기의 필터들을 임의로 조합해 풀링한다. 매 forward pass마다 grid들이 랜덤하게 생성되고, 테스트 때에는 여러 grid들의 예측 점수들의 평균치를 사용하게 된다. - [Striving for Simplicity: The All Convolutional Net](http://arxiv.org/abs/1412.6806) 라는 논문은 컨볼루션 레이어만 반복하며 풀링 레이어를 사용하지 않는 방식을 제안한다. Representation의 크기를 줄이기 위해 가끔씩 큰 stride를 가진 컨볼루션 레이어를 사용한다. 풀링 레이어가 보통 representation의 크기를 심하게 줄이기 때문에 (이런 효과는 작은 데이터셋에서만 오버피팅 방지 효과 등으로 인해 도움이 됨), 최근 추세는 점점 풀링 레이어를 사용하지 않는 쪽으로 발전하고 있다. @@ -383,3 +383,8 @@ Additional resources related to implementation: - [Caffe](http://caffe.berkeleyvision.org/), one of the most popular ConvNet libraries. - [Example Torch 7 ConvNet](https://github.com/nagadomi/kaggle-cifar10-torch7) that achieves 7% error on CIFAR-10 with a single model - [Ben Graham's Sparse ConvNet](https://www.kaggle.com/c/cifar-10/forums/t/10493/train-you-very-own-deep-convolutional-network/56310) package, which Ben Graham used to great success to achieve less than 4% error on CIFAR-10. + +--- +

+번역: 김택수 (jazzsaxmafia) +

diff --git a/index.html b/index.html index 9c57fb79..b659eedf 100644 --- a/index.html +++ b/index.html @@ -135,7 +135,7 @@
- + 신경망 파트 2: 데이터 준비 및 Loss
@@ -166,7 +166,7 @@
Module 2: Convolutional Neural Networks
diff --git a/ipython-tutorial.md b/ipython-tutorial.md index f448b95b..33e1a7da 100644 --- a/ipython-tutorial.md +++ b/ipython-tutorial.md @@ -55,3 +55,8 @@ IPython notebook은 여러개의 **cell**들로 이루어져있습니다. 각각
지금 까지 IPyhton의 사용법에 대해서 알아보았습니다. 간략한 내용이지만 위 내용들을 잘 숙지하면 무리없이과제를 진행할 수 있습니다. + +--- +

+번역: 김우정 (gnujoow) +

diff --git a/neural-networks-2.kr.md b/neural-networks-2.kr.md index d85f7793..f2e7ebe5 100644 --- a/neural-networks-2.kr.md +++ b/neural-networks-2.kr.md @@ -164,7 +164,7 @@ p = 0.5 # probability of keeping a unit active. higher = less dropout def train_step(X): """ X contains the data """ - + # forward pass for example 3-layer neural network H1 = np.maximum(0, np.dot(W1, X) + b1) U1 = np.random.rand(*H1.shape) < p # first dropout mask @@ -173,10 +173,10 @@ def train_step(X): U2 = np.random.rand(*H2.shape) < p # second dropout mask H2 *= U2 # drop! out = np.dot(W3, H2) + b3 - + # backward pass: compute gradients... (not shown) # perform parameter update... (not shown) - + def predict(X): # ensembled forward pass H1 = np.maximum(0, np.dot(W1, X) + b1) * p # NOTE: scale the activations @@ -184,14 +184,14 @@ def predict(X): out = np.dot(W3, H2) + b3 ~~~ -In the code above, inside the `train_step` function we have performed dropout twice: on the first hidden layer and on the second hidden layer. It is also possible to perform dropout right on the input layer, in which case we would also create a binary mask for the input `X`. The backward pass remains unchanged, but of course has to take into account the generated masks `U1,U2`. +In the code above, inside the `train_step` function we have performed dropout twice: on the first hidden layer and on the second hidden layer. It is also possible to perform dropout right on the input layer, in which case we would also create a binary mask for the input `X`. The backward pass remains unchanged, but of course has to take into account the generated masks `U1,U2`. Crucially, note that in the `predict` function we are not dropping anymore, but we are performing a scaling of both hidden layer outputs by $p$. This is important because at test time all neurons see all their inputs, so we want the outputs of neurons at test time to be identical to their expected outputs at training time. For example, in case of $p = 0.5$, the neurons must halve their outputs at test time to have the same output as they had during training time (in expectation). To see this, consider an output of a neuron $x$ (before dropout). With dropout, the expected output from this neuron will become $px + (1-p)0$, because the neuron's output will be set to zero with probability $1-p$. At test time, when we keep the neuron always active, we must adjust $x \rightarrow px$ to keep the same expected output. It can also be shown that performing this attenuation at test time can be related to the process of iterating over all the possible binary masks (and therefore all the exponentially many sub-networks) and computing their ensemble prediction. The undesirable property of the scheme presented above is that we must scale the activations by $p$ at test time. Since test-time performance is so critical, it is always preferable to use **inverted dropout**, which performs the scaling at train time, leaving the forward pass at test time untouched. Additionally, this has the appealing property that the prediction code can remain untouched when you decide to tweak where you apply dropout, or if at all. Inverted dropout looks as follows: ~~~python -""" +""" Inverted Dropout: Recommended implementation example. We drop and scale at train time and don't do anything at test time. """ @@ -207,10 +207,10 @@ def train_step(X): U2 = (np.random.rand(*H2.shape) < p) / p # second dropout mask. Notice /p! H2 *= U2 # drop! out = np.dot(W3, H2) + b3 - + # backward pass: compute gradients... (not shown) # perform parameter update... (not shown) - + def predict(X): # ensembled forward pass H1 = np.maximum(0, np.dot(W1, X) + b1) # no scaling necessary @@ -257,7 +257,7 @@ $$ L_i = \sum_j \max(0, 1 - y_{ij} f_j) $$ -where the sum is over all categories $j$, and $y_{ij}$ is either +1 or -1 depending on whether the i-th example is labeled with the j-th attribute, and the score vector $f_j$ will be positive when the class is predicted to be present and negative otherwise. Notice that loss is accumulated if a positive example has score less than +1, or when a negative example has score greater than -1. +where the sum is over all categories $j$, and $y_{ij}$ is either +1 or -1 depending on whether the i-th example is labeled with the j-th attribute, and the score vector $f_j$ will be positive when the class is predicted to be present and negative otherwise. Notice that loss is accumulated if a positive example has score less than +1, or when a negative example has score greater than -1. An alternative to this loss would be to train a logistic regression classifier for every attribute independently. A binary logistic regression classifier has only two classes (0,1), and calculates the probability of class 1 as: @@ -306,3 +306,8 @@ In summary: - We discussed different tasks you might want to perform in practice, and the most common loss functions for each task We've now preprocessed the data and set up and initialized the model. In the next section we will look at the learning process and its dynamics. + +--- +

+번역: 서종한 (salopge) +

diff --git a/optimization-1.md b/optimization-1.md index f6b94015..1348f40e 100644 --- a/optimization-1.md +++ b/optimization-1.md @@ -323,7 +323,7 @@ bat while True: data_batch = sample_training_data(data, 256) # 예제 256개짜리 미니배치(mini-batch) weights_grad = evaluate_gradient(loss_fun, data_batch, weights) - weights += - step_size * weights_grad # 파라미터 업데이트(parameter update) + weights += - step_size * weights_grad # 파라미터 업데이트(parameter update) ~~~ 이 방법이 먹히는 이유는 학습데이터들의 예시들이 서로 상관관계가 있기 때문이다. 이것에 대해 알아보기위해, ILSVRC의 120만개 이미지들이 사실은 1천개의 서로 다른 이미지들의 복사본이라는 극단적인 경우를 생각해보자. (즉, 한 클래스 당 하나이고, 이 하나가 1천2백번 복사된 것) 그러면 명백한 것은, 이 1천2백개의 이미지에서의 그라디언트(gradient)값 (역자 주: 이 1천2백개에 해당하는 $i$에 대한 $L_i$값)은 다 똑같다는 점이다. 그렇다면 이 1천2백개씩 똑같은 값들 120만개를 평균내서 손실값(loss)를 구하는 것이나, 서로 다른 1천개의 이미지당 하나씩 1000개의 값을 평균내서 손실값(loss)을 구하는 것이나 똑같다. 실제로는 당연히 중복된 데이터를 주지는 않겠지만, 미니배치(mini-batch)에서만 계산하는 그라디언트(gradient)는 모든 데이터를 써서 구하는 것의 근사값으로 괜찮게 쓰일 수 있을 것이다. 따라서, 미니배치(mini-batch)에서 그라디언트(gradient)를 구해서 더 자주자주 파라미터(parameter/weight)을 업데이트하면 실제로 더 빠른 수렴하게 된다. @@ -352,3 +352,8 @@ while True: - 반복적으로 루프(loop)를 돌려서 그라디언트(gradient)를 계산하고 파라미터(parameter/weight)를 업데이트하는 **그라디언트 하강 (Gradient Descent)** 알고리즘을 소개했다. **예고:** 이 섹션에서 핵심은, 손실함수(loss function)를 파라미터(parameter/weight)로 미분하여 그라디언트(gradient)를 계산하는 법(과 그에 대한 직관적인 이해)가 신경망(neural network)를 디자인하고 학습시키고 이해하는데 있어 가장 중요한 기술이라는 점이다. 다음 섹션에서는, 그라디언(gradient)를 해석적으로 구할 때 연쇄법칙을 이용한, **backpropagation**이라고도 불리는 효율적인 방법에 대해 알아보겠다. 이 방법을 쓰면 컨볼루션 신경망 (Convolutional Neural Networks)을 포함한 모든 종류의 신경망(Neural Networks)에서 쓰이는 상대적으로 일반적인 손실함수(loss function)를 효율적으로 최적화시킬 수 있다. + +--- +

+번역: 임준구 (stats2ml) +

diff --git a/terminal-tutorial.md b/terminal-tutorial.md index 4c01c5b8..7f8261b4 100644 --- a/terminal-tutorial.md +++ b/terminal-tutorial.md @@ -35,4 +35,9 @@ terminal에 Jupyter Notebook과 다른 필요요소들이 설치되어 있습니 [Terminal](https://www.stanfordterminalcloud.com)에 대한 더 많은 정보를 원하시면 [FAQ](https://www.stanfordterminalcloud.com/faq)페이지를 방문해주세요 -**중요** 터미널 사용시 사용하는 인스턴스 타입에 따라 시간당 사용요금이 부과됩니다. 미디엄 타입의 인스턴스 요금은 시간당 $0.124 입니다. \ No newline at end of file +**중요** 터미널 사용시 사용하는 인스턴스 타입에 따라 시간당 사용요금이 부과됩니다. 미디엄 타입의 인스턴스 요금은 시간당 $0.124 입니다. + +--- +

+번역: 김우정 (gnujoow) +

From 92a2bb04ae6c41dd55a516fdce5e7d4af6de26a3 Mon Sep 17 00:00:00 2001 From: myungsub Date: Mon, 9 May 2016 14:09:48 +0900 Subject: [PATCH 101/199] Add video info & acknowledgement --- index.html | 4 ++++ video-lectures.md | 10 ++++++++++ 2 files changed, 14 insertions(+) create mode 100644 video-lectures.md diff --git a/index.html b/index.html index b659eedf..73335ec6 100644 --- a/index.html +++ b/index.html @@ -16,6 +16,10 @@ Glossary
+ +
Winter 2016 과제
diff --git a/video-lectures.md b/video-lectures.md new file mode 100644 index 00000000..3a7ae924 --- /dev/null +++ b/video-lectures.md @@ -0,0 +1,10 @@ +--- +layout: page +title: Video Lectures +permalink: /video-lectures/ +--- + +동영상 강의는 원래 강사인 Andrej Karpathy가 직접 유튜브에 올렸었지만, 몇 가지 문제로 인해 현재는 내려간 상태입니다. +그러나 [[이곳](https://archive.org/details/cs231n-CNNs)]에서 웹으로 강의를 듣거나 [토렌트 링크](https://archive.org/download/cs231n-CNNs/cs231n-CNNs_archive.torrent)를 통해 받을 수 있고, [새로 유튜브에 재생목록을 만들어주신 분](https://www.youtube.com/playlist?list=PLLvH2FwAQhnpj1WEB-jHmPuUeQ8mX-XXG)도 있습니다. 자막 파일은 [[여기](https://github.com/aikorea/cs231n/tree/master/captions)]에서 다운받으실 수 있습니다. (아직 진행 중이라 미완입니다.) + +유튜브에서 자동생성되어 매우 안 좋은 상태였던 영어 자막을 수정하고 한글로 번역까지 해 주는 작업은 **김영범 (rollis0825), 황재하 (jaywhang), 이지훈 (jihoonl), 김석우 (sandrokim), 이준수 (jslee), 조재민 (j-min)** 님께서 수고해 주시고 계십니다! From 7aabbf9ab8665272e9949ba61209a5f2385267fe Mon Sep 17 00:00:00 2001 From: Taeksoo Kim Date: Mon, 9 May 2016 14:23:45 +0900 Subject: [PATCH 102/199] Update convolutional-networks.md --- convolutional-networks.md | 217 ++++++++++++++++++++------------------ 1 file changed, 113 insertions(+), 104 deletions(-) diff --git a/convolutional-networks.md b/convolutional-networks.md index d08fbb36..9e1ce5e4 100644 --- a/convolutional-networks.md +++ b/convolutional-networks.md @@ -1,6 +1,6 @@ --- layout: page -permalink: /convolutional-networks/ +permalink: /convolutional-networks-kr/ --- Table of Contents: @@ -19,135 +19,136 @@ Table of Contents: - [Computational Considerations](#comp) - [Additional References](#add) -## Convolutional Neural Networks (CNNs / ConvNets) +## 컨볼루션 신경망 (CNN/ConvNets) -Convolutional Neural Networks are very similar to ordinary Neural Networks from the previous chapter: They are made up of neurons that have learnable weights and biases. Each neuron receives some inputs, performs a dot product and optionally follows it with a non-linearity. The whole network still express a single differentiable score function: From the raw image pixels on one end to class scores at the other. And they still have a loss function (e.g. SVM/Softmax) on the last (fully-connected) layer and all the tips/tricks we developed for learning regular Neural Networks still apply. +컨볼루션 신경망 (Convolutional Neural Network, 이하 CNN)은 앞 장에서 다룬 일반 신경망과 매우 유사하다. CNN은 학습 가능한 가중치 (weight)와 바이어스(bias)로 구성되어 있다. 각 뉴런은 입력을 받아 내적 연산( dot product )을 한 뒤 선택에 따라 비선형 (non-linear) 연산을 한다. 전체 네트워크는 일반 신경망과 마찬가지로 미분 가능한 하나의 스코어 함수 (score function)을 갖게 된다 (맨 앞쪽에서 로우 이미지 (raw image)를 읽고 맨 뒤쪽에서 각 클래스에 대한 점수를 구하게 됨). 또한 CNN은 마지막 레이어에 (SVM/Softmax와 같은) 손실 함수 (loss function)을 가지며, 우리가 일반 신경망을 학습시킬 때 사용하던 각종 기법들을 동일하게 적용할 수 있다. -So what does change? ConvNet architectures make the explicit assumption that the inputs are images, which allows us to encode certain properties into the architecture. These then make the forward function more efficient to implement and vastly reduces the amount of parameters in the network. +CNN과 일반 신경망의 차이점은 무엇일까? CNN 아키텍쳐는 입력 데이터가 이미지라는 가정 덕분에 이미지 데이터가 갖는 특성들을 인코딩 할 수 있다. 이러한 아키텍쳐는 포워드 함수 (forward function)을 더욱 효과적으로 구현할 수 있고 네트워크를 학습시키는데 필요한 모수 (parameter)의 수를 크게 줄일 수 있게 해준다. -### Architecture Overview +### 아키텍쳐 개요 -*Recall: Regular Neural Nets.* As we saw in the previous chapter, Neural Networks receive an input (a single vector), and transform it through a series of *hidden layers*. Each hidden layer is made up of a set of neurons, where each neuron is fully connected to all neurons in the previous layer, and where neurons in a single layer function completely independently and do not share any connections. The last fully-connected layer is called the "output layer" and in classification settings it represents the class scores. +앞 장에서 보았듯이 신경망은 입력받은 벡터를 일련의 히든 레이어 (hidden layer) 를 통해 변형 (transform) 시킨다. 각 히든 레이어는 뉴런들로 이뤄져 있으며, 각 뉴런은 앞쪽 레이어 (previous layer)의 모든 뉴런과 연결되어 있다 (fully connected). 같은 레이어 내에 있는 뉴런들 끼리는 연결이 존재하지 않고 서로 독립적이다. 마지막 Fully-connected 레이어는 출력 레이어라고 불리며, 분류 문제에서 클래스 점수 (class score)를 나타낸다. -*Regular Neural Nets don't scale well to full images*. In CIFAR-10, images are only of size 32x32x3 (32 wide, 32 high, 3 color channels), so a single fully-connected neuron in a first hidden layer of a regular Neural Network would have 32\*32\*3 = 3072 weights. This amount still seems manageable, but clearly this fully-connected structure does not scale to larger images. For example, an image of more respectible size, e.g. 200x200x3, would lead to neurons that have 200\*200\*3 = 120,000 weights. Moreover, we would almost certainly want to have several such neurons, so the parameters would add up quickly! Clearly, this full connectivity is wasteful and the huge number of parameters would quickly lead to overfitting. +일반 신경망은 이미지를 다루기에 적절하지 않다. CIFAR-10 데이터의 경우 각 이미지가 32x32x3 (가로,세로 32, 3개 컬러 채널)로 이뤄져 있어서 첫 번째 히든 레이어 내의 하나의 뉴런의 경우 32x32x3=3072개의 가중치가 필요하지만, 더 큰 이미지를 사용할 경우에는 같은 구조를 이용하는 것이 불가능하다. 예를 들어 200x200x3의 크기를 가진 이미지는 같은 뉴런에 대해 200x200x3=120,000개의 가중치를 필요로 하기 때문이다. 더욱이, 이런 뉴런이 레이어 내에 여러개 존재하므로 모수의 개수가 크게 증가하게 된다. 이와 같이 Fully-connectivity는 심한 낭비이며 많은 수의 모수는 곧 오버피팅(overfitting)으로 귀결된다. -*3D volumes of neurons*. Convolutional Neural Networks take advantage of the fact that the input consists of images and they constrain the architecture in a more sensible way. In particular, unlike a regular Neural Network, the layers of a ConvNet have neurons arranged in 3 dimensions: **width, height, depth**. (Note that the word *depth* here refers to the third dimension of an activation volume, not to the depth of a full Neural Network, which can refer to the total number of layers in a network.) For example, the input images in CIFAR-10 are an input volume of activations, and the volume has dimensions 32x32x3 (width, height, depth respectively). As we will soon see, the neurons in a layer will only be connected to a small region of the layer before it, instead of all of the neurons in a fully-connected manner. Moreover, the final output layer would for CIFAR-10 have dimensions 1x1x10, because by the end of the ConvNet architecture we will reduce the full image into a single vector of class scores, arranged along the depth dimension. Here is a visualization: +CNN은 입력이 이미지로 이뤄져 있다는 특징을 살려 좀 더 합리적인 방향으로 아키텍쳐를 구성할 수 있다. 특히 일반 신경망과 달리, CNN의 레이어들은 가로,세로,깊이의 3개 차원을 갖게 된다 ( 여기에서 말하는 깊이란 전체 신경망의 깊이가 아니라 액티베이션 볼륨 ( activation volume ) 에서의 3번 째 차원을 이야기 함 ). 예를 들어 CIFAR-10 이미지는 32x32x3 (가로,세로,깊이) 의 차원을 갖는 입력 액티베이션 볼륨 (activation volume)이라고 볼 수 있다. 조만간 보겠지만, 하나의 레이어에 위치한 뉴런들은 일반 신경망과는 달리 앞 레이어의 전체 뉴런이 아닌 일부에만 연결이 되어 있다. CNN 아키텍쳐는 전체 이미지를 클래스 점수들로 이뤄진 하나의 벡터로 만들어주기 때문에 마지막 출력 레이어는 1x1x10(10은 CIFAR-10 데이터의 클래스 개수)의 차원을 가지게 된다. 이에 대한 그럼은 아래와 같다:
-
Left: A regular 3-layer Neural Network. Right: A ConvNet arranges its neurons in three dimensions (width, height, depth), as visualized in one of the layers. Every layer of a ConvNet transforms the 3D input volume to a 3D output volume of neuron activations. In this example, the red input layer holds the image, so its width and height would be the dimensions of the image, and the depth would be 3 (Red, Green, Blue channels).
+
좌: 일반 3-레이어 신경망. 우: 그림과 같이 CNN은 뉴런들을 3차원으로 배치한다. CNN의 모든 레이어는 3차원 입력 볼륨을 3차원 출력 볼륨으로 변환 (transform) 시킨다. 이 예제에서 붉은 색으로 나타난 입력 레이어는 이미지를 입력으로 받으므로, 이 레이어의 가로/세로/채널은 각각 이미지의 가로/세로/3(Red,Green,Blue) 이다.
-> A ConvNet is made up of Layers. Every Layer has a simple API: It transforms an input 3D volume to an output 3D volume with some differentiable function that may or may not have parameters. +> CNN은 여러 레이어로 이루어져 있다. 각각의 레이어는 3차원의 볼륨을 입력으로 받고 미분 가능한 함수를 거쳐 3차원의 볼륨을 출력하는 간단한 기능을 한다. -### Layers used to build ConvNets +### CNN을 이루는 레이어들 -As we described above, every layer of a ConvNet transforms one volume of activations to another through a differentiable function. We use three main types of layers to build ConvNet architectures: **Convolutional Layer**, **Pooling Layer**, and **Fully-Connected Layer** (exactly as seen in regular Neural Networks). We will stack these layers to form a full ConvNet **architecture**. +위에서 다룬 것과 같이, CNN의 각 레이어는 미분 가능한 변환 함수를 통해 하나의 액티베이션 볼륨을 또다른 액티베이션 볼륨으로 변환 (transform) 시킨다. CNN 아키텍쳐에서는 크게 컨볼루셔널 레이어, 풀링 레이어, Fully-connected 레이어라는 3개 종류의 레이어가 사용된다. 전체 CNN 아키텍쳐는 이 3 종류의 레이어들을 쌓아 만들어진다. -*Example Architecture: Overview*. We will go into more details below, but a simple ConvNet for CIFAR-10 classification could have the architecture [INPUT - CONV - RELU - POOL - FC]. In more detail: +*예제: 아래에서 더 자세하게 배우겠지만, CIFAR-10 데이터를 다루기 위한 간단한 CNN은 [INPUT-CONV-RELU-POOL-FC]로 구축할 수 있다. -- INPUT [32x32x3] will hold the raw pixel values of the image, in this case an image of width 32, height 32, and with three color channels R,G,B. -- CONV layer will compute the output of neurons that are connected to local regions in the input, each computing a dot product between their weights and the region they are connected to in the input volume. This may result in volume such as [32x32x12]. -- RELU layer will apply an elementwise activation function, such as the $$max(0,x)$$ thresholding at zero. This leaves the size of the volume unchanged ([32x32x12]). -- POOL layer will perform a downsampling operation along the spatial dimensions (width, height), resulting in volume such as [16x16x12]. -- FC (i.e. fully-connected) layer will compute the class scores, resulting in volume of size [1x1x10], where each of the 10 numbers correspond to a class score, such as among the 10 categories of CIFAR-10. As with ordinary Neural Networks and as the name implies, each neuron in this layer will be connected to all the numbers in the previous volume. +- INPUT 입력 이미지가 가로32, 세로32, 그리고 RGB 채널을 가지는 경우 입력의 크기는 [32x32x3]. +- CONV 레이어는 입력 이미지의 일부 영역과 연결되어 있으며, 이 연결된 영역과 자신의 가중치의 내적 연산 (dot product) 을 계산하게 된다. 결과 볼륨은 [32x32x12]와 같은 크기를 갖게 된다. +- RELU 레이어는 max(0,x)와 같이 각 요소에 적용되는 액티베이션 함수 (activation function)이다. 이 레이어는 볼륨의 크기를 변화시키지 않는다 ([32x32x12]) +- POOL 레이어는 (가로,세로) 차원에 대해 다운샘플링 (downsampling)을 수행해 [16x16x12]와 같이 줄어든 볼륨을 출력한다. +- FC (fully-connected) 레이어는 클래스 점수들을 계산해 [1x1x10]의 크기를 갖는 볼륨을 출력한다. 10개 숫자들은 10개 카테고리에 대한 클래스 점수에 해당한다. 레이어의 이름에서 유추 가능하듯, 이 레이어는 이전 볼륨의 모든 요소와 연결되어 있다. -In this way, ConvNets transform the original image layer by layer from the original pixel values to the final class scores. Note that some layers contain parameters and other don't. In particular, the CONV/FC layers perform transformations that are a function of not only the activations in the input volume, but also of the parameters (the weights and biases of the neurons). On the other hand, the RELU/POOL layers will implement a fixed function. The parameters in the CONV/FC layers will be trained with gradient descent so that the class scores that the ConvNet computes are consistent with the labels in the training set for each image. +이와 같이, CNN은 픽셀 값으로 이뤄진 원본 이미지를 각 레이어를 거치며 클래스 점수로 변환 (transform) 시킨다. 한 가지 기억할 것은, 어떤 레이어는 모수 (parameter)를 갖지만 어떤 레이어는 모수를 갖지 않는다는 것이다. 특히 CONV/FC 레이어들은 단순히 입력 볼륨만이 아니라 가중치(weight)와 바이어스(bias) 또한 포함하는 액티베이션(activation) 함수이다. 반면 RELU/POOL 레이어들은 고정된 함수이다. CONV/FC 레이어의 모수 (parameter)들은 각 이미지에 대한 클래스 점수가 해당 이미지의 레이블과 같아지도록 그라디언트 디센트 (gradient descent)로 학습된다. -In summary: +요약해보면: -- A ConvNet architecture is a list of Layers that transform the image volume into an output volume (e.g. holding the class scores) -- There are a few distinct types of Layers (e.g. CONV/FC/RELU/POOL are by far the most popular) -- Each Layer accepts an input 3D volume and transforms it to an output 3D volume through a differentiable function -- Each Layer may or may not have parameters (e.g. CONV/FC do, RELU/POOL don't) -- Each Layer may or may not have additional hyperparameters (e.g. CONV/FC/POOL do, RELU doesn't) +- CNN 아키텍쳐는 여러 레이어를 통해 입력 이미지 볼륨을 출력 볼륨 ( 클래스 점수 )으로 변환시켜 준다. +- CNN은 몇 가지 종류의 레이어로 구성되어 있다. CONV/FC/RELU/POOL 레이어가 현재 가장 많이 쓰인다. +- 각 레이어는 3차원의 입력 볼륨을 미분 가능한 함수를 통해 3차원 출력 볼륨으로 변환시킨다. +- 모수(parameter)가 있는 레이어도 있고 그렇지 않은 레이어도 있다 (FC/CONV는 모수를 갖고 있고, RELU/POOL 등은 모수가 없음). +- 초모수 (hyperparameter)가 있는 레이어도 있고 그렇지 않은 레이어도 있다 (CONV/FC/POOL 레이어는 초모수를 가지며 RELU는 가지지 않음).
- The activations of an example ConvNet architecture. The initial volume stores the raw image pixels and the last volume stores the class scores. Each volume of activations along the processing path is shown as a column. Since it's difficult to visualize 3D volumes, we lay out each volume's slices in rows. The last layer volume holds the scores for each class, but here we only visualize the sorted top 5 scores, and print the labels of each one. The full web-based demo is shown in the header of our website. The architecture shown here is a tiny VGG Net, which we will discuss later. + CNN 아키텍쳐의 액티베이션 (activation) 예제. 첫 볼륨은 로우 이미지(raw image)를 다루며, 마지막 볼륨은 클래스 점수들을 출력한다. 입/출력 사이의 액티베이션들은 그림의 각 열에 나타나 있다. 3차원 볼륨을 시각적으로 나타내기가 어렵기 때문에 각 행마다 볼륨들의 일부만 나타냈다. 마지막 레이어는 모든 클래스에 대한 점수를 나타내지만 여기에서는 상위 5개 클래스에 대한 점수와 레이블만 표시했다. 전체 웹 데모는 우리의 웹사이트 상단에 있다. 여기에서 사용된 아키텍쳐는 작은 VGG Net이다.
-We now describe the individual layers and the details of their hyperparameters and their connectivities. +이제 각각의 레이어에 대해 초모수(hyperparameter)나 연결성 (connectivity) 등의 세부 사항들을 알아보도록 하자. -#### Convolutional Layer +#### 컨볼루셔널 레이어 (이하 CONV) -The Conv layer is the core building block of a Convolutional Network, and its output volume can be interpreted as holding neurons arranged in a 3D volume. We now discuss the details of the neuron connectivities, their arrangement in space, and their parameter sharing scheme. +CONV 레이어는 CNN을 이루는 핵심 요소이다. CONV 레이어의 출력은 3차원으로 정렬된 뉴런들로 해석될 수 있다. 이제부터는 뉴런들의 연결성 (connectivity), 그들의 공간상의 배치, 그리고 모수 공유(parameter sharing) 에 대해 알아보자. -**Overview and Intuition.** The CONV layer's parameters consist of a set of learnable filters. Every filter is small spatially (along width and height), but extends through the full depth of the input volume. During the forward pass, we slide (more precisely, convolve) each filter across the width and height of the input volume, producing a 2-dimensional activation map of that filter. As we slide the filter, across the input, we are computing the dot product between the entries of the filter and the input. Intuitively, the network will learn filters that activate when they see some specific type of feature at some spatial position in the input. Stacking these activation maps for all filters along the depth dimension forms the full output volume. Every entry in the output volume can thus also be interpreted as an output of a neuron that looks at only a small region in the input and shares parameters with neurons in the same activation map (since these numbers all result from applying the same filter). We now dive into the details of this process. +**개요 및 직관적인 설명.** CONV 레이어의 모수(parameter)들은 일련의 학습가능한 필터들로 이뤄져 있다. 각 필터는 가로/세로 차원으로는 작지만 깊이 (depth) 차원으로는 전체 깊이를 아우른다. 포워드 패스 (forward pass) 때에는 각 필터를 입력 볼륨의 가로/세로 차원으로 슬라이딩 시키며 (정확히는 convolve 시키며) 2차원의 액티베이션 맵 (activation map)을 생성한다. 필터를 입력 위로 슬라이딩 시킬 때, 필터와 입력의 요소들 사이의 내적 연산 (dot product)이 이뤄진다. 직관적으로 설명하면, 이 신경망은 입력의 특정 위치의 특정 패턴에 대해 반응하는 (activate) 필터를 학습한다. 이런 액티베이션 맵 (activation map)을 깊이 (depth) 차원을 따라 쌓은 것이 곧 출력 볼륨이 된다. 그러므로 출력 볼륨의 각 요소들은 입력의 작은 영역만을 취급하고, 같은 액티베이션 맵 내의 뉴런들은 같은 모수들을 공유한다 (같은 필터를 적용한 결과이므로). 이제 이 과정에 대해 좀 더 깊이 파헤쳐보자. -**Local Connectivity.** When dealing with high-dimensional inputs such as images, as we saw above it is impractical to connect neurons to all neurons in the previous volume. Instead, we will connect each neuron to only a local region of the input volume. The spatial extent of this connectivity is a hyperparameter called the **receptive field** of the neuron. The extent of the connectivity along the depth axis is always equal to the depth of the input volume. It is important to note this asymmetry in how we treat the spatial dimensions (width and height) and the depth dimension: The connections are local in space (along width and height), but always full along the entire depth of the input volume. +**로컬 연결성 (Local connectivity).** 이미지와 같은 고차원 입력을 다룰 때에는, 현재 레이어의 한 뉴런을 이전 볼륨의 모든 뉴런들과 연결하는 것이 비 실용적이다. 대신에 우리는 레이어의 각 뉴런을 입력 볼륨의 로컬한 영역(local region)에만 연결할 것이다. 이 영역은 리셉티브 필드 (receptive field)라고 불리는 초모수 (hyperparameter) 이다. 깊이 차원 측면에서는 항상 입력 볼륨의 총 깊이를 다룬다 (가로/세로는 작은 영역을 보지만 깊이는 전체를 본다는 뜻). 공간적 차원 (가로/세로)와 깊이 차원을 다루는 방식이 다르다는 걸 기억하자. -*Example 1*. For example, suppose that the input volume has size [32x32x3], (e.g. an RGB CIFAR-10 image). If the receptive field is of size 5x5, then each neuron in the Conv Layer will have weights to a [5x5x3] region in the input volume, for a total of 5\*5\*3 = 75 weights. Notice that the extent of the connectivity along the depth axis must be 3, since this is the depth of the input volume. +*예제 1*. 예를 들어 입력 볼륨의 크기가 (CIFAR-10의 RGB 이미지와 같이) [32x32x3]이라고 하자. 만약 리셉티브 필드의 크기가 5x5라면, CONV 레이어의 각 뉴런은 입력 볼륨의 [5x5x3] 크기의 영역에 가중치 (weight)를 가하게 된다 (총 5x5x3=75 개 가중치). 입력 볼륨 (RGB 이미지)의 깊이가 3이므로 마지막 숫자가 3이 된다는 것을 기억하자. -*Example 2*. Suppose an input volume had size [16x16x20]. Then using an example receptive field size of 3x3, every neuron in the Conv Layer would now have a total of 3\*3\*20 = 180 connections to the input volume. Notice that, again, the connectivity is local in space (e.g. 3x3), but full along the input depth (20). +*예제 2*. 입력 볼륨의 크기가 [16x16x20]이라고 하자. 3x3 크기의 리셉티브 필드를 사용하면 CONV 레이어의 각 뉴런은 입력 볼륨과 3x3x20=180 개의 연결을 갖게 된다. 이번에도 입력 볼륨의 깊이가 20이므로 마지막 숫자가 20이 된다는 것을 기억하자.
- Left: An example input volume in red (e.g. a 32x32x3 CIFAR-10 image), and an example volume of neurons in the first Convolutional layer. Each neuron in the convolutional layer is connected only to a local region in the input volume spatially, but to the full depth (i.e. all color channels). Note, there are multiple neurons (5 in this example) along the depth, all looking at the same region in the input - see discussion of depth columns in text below. Right: The neurons from the Neural Network chapter remain unchanged: They still compute a dot product of their weights with the input followed by a non-linearity, but their connectivity is now restricted to be local spatially. + 좌: 입력 볼륨(붉은색, 32x32x3 크기의 CIFAR-10 이미지)과 첫번째 컨볼루션 레이어 볼륨. 컨볼루션 레이어의 각 뉴런은 입력 볼륨의 일부 영역에만 연결된다 (가로/세로 공간 차원으로는 일부 연결, 깊이(컬러 채널) 차원은 모두 연결). 컨볼루션 레이어의 깊이 차원의 여러 뉴런 (그림에서 5개)들이 모두 입력의 같은 영역을 처리한다는 것을 기억하자 (깊이 차원과 관련해서는 아래에서 더 자세히 알아볼 것임). 우: 입력의 일부 영역에만 연결된다는 점을 제외하고는, 이전 신경망 챕터에서 다뤄지던 뉴런들과 똑같이 내적 연산과 비선형 함수로 이뤄진다.
-**Spatial arrangement**. We have explained the connectivity of each neuron in the Conv Layer to the input volume, but we haven't yet discussed how many neurons there are in the output volume or how they are arranged. Three hyperparameters control the size of the output volume: the **depth, stride** and **zero-padding**. We discuss these next: +**공간적 배치**. 지금까지는 컨볼루션 레이어의 한 뉴런과 입력 볼륨의 연결에 대해 알아보았다. 그러나 아직 출력 볼륨에 얼마나 많은 뉴런들이 있는지, 그리고 그 뉴런들이 어떤식으로 배치되는지는 다루지 않았다. 3개의 hyperparameter들이 출력 볼륨의 크기를 결정하게 된다. 그 3개 요소는 바로 **깊이, stride, 그리고 제로 패딩 (zero-padding)** 이다. 이들에 대해 알아보자: -1. First, the **depth** of the output volume is a hyperparameter that we can pick; It controls the number of neurons in the Conv layer that connect to the same region of the input volume. This is analogous to a regular Neural Network, where we had multiple neurons in a hidden layer all looking at the exact same input. As we will see, all of these neurons will learn to activate for different features in the input. For example, if the first Convolutional Layer takes as input the raw image, then different neurons along the depth dimension may activate in presence of various oriented edged, or blobs of color. We will refer to a set of neurons that are all looking at the same region of the input as a **depth column**. -2. Second, we must specify the **stride** with which we allocate depth columns around the spatial dimensions (width and height). When the stride is 1, then we will allocate a new depth column of neurons to spatial positions only 1 spatial unit apart. This will lead to heavily overlapping receptive fields between the columns, and also to large output volumes. Conversely, if we use higher strides then the receptive fields will overlap less and the resulting output volume will have smaller dimensions spatially. -3. As we will soon see, sometimes it will be convenient to pad the input with zeros spatially on the border of the input volume. The size of this **zero-padding** is a hyperparameter. The nice feature of zero padding is that it will allow us to control the spatial size of the output volumes. In particular, we will sometimes want to exactly preserve the spatial size of the input volume. +1. 먼저, 출력 볼륨의 **깊이** 는 우리가 결정할 수 있는 요소이다. 컨볼루션 레이어의 뉴런들 중 입력 볼륨 내 동일한 영역과 연결된 뉴런의 개수를 의미한다. 마치 일반 신경망에서 히든 레이어 내의 모든 뉴런들이 같은 입력값과 연결된 것과 비슷하다. 앞으로 살펴보겠지만, 이 뉴런들은 입력에 대해 서로 다른 특징 (feature)에 활성화된다 (activate). 예를 들어, 이미지를 입력으로 받는 첫 번째 컨볼루션 레이어의 경우, 깊이 축에 따른 각 뉴런들은 이미지의 서로 다른 엣지, 색깔, 블롭(blob) 등에 활성화된다. 앞으로는 인풋의 서로 같은 영역을 바라보는 뉴런들을 **깊이 컬럼 (depth column)**이라고 부르겠다. +2. 두 번째로 어떤 간격 (가로/세로의 공간적 간격) 으로 깊이 컬럼을 할당할 지를 의미하는 **stride**를 결정해야 한다. 만약 stride가 1이라면, 깊이 컬럼을 1칸마다 할당하게 된다 (한 칸 간격으로 깊이 컬럼 할당). 이럴 경우 각 깊이 컬럼들은 receptive field 상 넓은 영역이 겹치게 되고, 출력 볼륨의 크기도 매우 커지게 된다. 반대로, 큰 stride를 사용한다면 receptive field끼리 좁은 영역만 겹치게 되고 출력 볼륨도 작아지게 된다 (깊이는 작아지지 않고 가로/세로만 작아지게 됨). +3. 조만간 살펴보겠지만, 입력 볼륨의 가장자리를 0으로 패딩하는 것이 좋을 때가 있다. 이 **zero-padding**은 hyperparamter이다. zero-padding을 사용할 때의 장점은, 출력 볼륨의 공간적 크기(가로/세로)를 조절할 수 있다는 것이다. 특히 입력 볼륨의 공간적 크기를 유지하고 싶은 경우 (입력의 가로/세로 = 출력의 가로/세로) 사용하게 된다. -We can compute the spatial size of the output volume as a function of the input volume size ($$W$$), the receptive field size of the Conv Layer neurons ($$F$$), the stride with which they are applied ($$S$$), and the amount of zero padding used ($$P$$) on the border. You can convince yourself that the correct formula for calculating how many neurons "fit" is given by $$(W - F + 2P)/S + 1$$. If this number is not an integer, then the strides are set incorrectly and the neurons cannot be tiled so that they "fit" across the input volume neatly, in a symmetric way. An example might help to get intuitions for this formula: +출력 볼륨의 공간적 크기 (가로/세로)는 입력 볼륨 크기 ($$W$$), CONV 레이어의 리셉티브 필드 크기($$F$$)와 stride ($$S$$), 그리고 제로 패딩 (zero-padding) 사이즈 ($$P$$) 의 함수로 계산할 수 있다. $$(W - F + 2P)/S + 1$$. I을 통해 알맞은 크기를 계산하면 된다. 만약 이 값이 정수가 아니라면 stride가 잘못 정해진 것이다. 이 경우 뉴런들이 대칭을 이루며 깔끔하게 배치되는 것이 불가능하다. 다음 예제를 보면 이 수식을 좀 더 직관적으로 이해할 수 있을 것이다:
- Illustration of spatial arrangement. In this example there is only one spatial dimension (x-axis), one neuron with a receptive field size of F = 3, the input size is W = 5, and there is zero padding of P = 1. Left: The neuron strided across the input in stride of S = 1, giving output of size (5 - 3 + 2)/1+1 = 5. Right: The neuron uses stride of S = 2, giving output of size (5 - 3 + 2)/2+1 = 3. Notice that stride S = 3 could not be used since it wouldn't fit neatly across the volume. In terms of the equation, this can be determined since (5 - 3 + 2) = 4 is not divisible by 3. -
The neuron weights are in this example [1,0,-1] (shown on very right), and its bias is zero. These weights are shared across all yellow neurons (see parameter sharing below). + 공간적 배치에 관한 그림. 이 예제에서는 가로/세로 공간적 차원 중 하나만 고려한다 (x축). 리셉티브 필드 F=3, 입력 사이즈 W=5, 제로 패딩 P=1. : 뉴런들이 stride S=1을 갖고 배치된 경우, 출력 사이즈는 (5-3+2)/1 +1 = 5이다. : stride S=2인 경우 (5-3+2)/2 + 1 = 3의 출력 사이즈를 가진다. Stride S=3은 사용할 수 없다. (5-3+2) = 4가 3으로 나눠지지 않기 때문에 출력 볼륨의 뉴런들이 깔끔히 배치되지 않는다. + 이 예에서 뉴런들의 가중치는 [1,0,-1] (가장 오른쪽) 이며 bias는 0이다. 이 가중치는 노란 뉴런들 모두에게 공유된다 (아래에서 parameter sharing에 대해 살펴보라).
-*Use of zero-padding*. In the example above on left, note that the input dimension was 5 and the output dimension was equal: also 5. This worked out so because our receptive fields were 3 and we used zero padding of 1. If there was no zero-padding used, then the output volume would have had spatial dimension of only 3, because that it is how many neurons would have "fit" across the original input. In general, setting zero padding to be $$P = (F - 1)/2$$ when the stride is $$S = 1$$ ensures that the input volume and output volume will have the same size spatially. It is very common to use zero-padding in this way and we will discuss the full reasons when we talk more about ConvNet architectures. +*제로 패딩 사용*. 위 예제의 왼쪽 그림에서, 입력과 출력의 차원이 모두 5라는 것을 기억하자. 리셉티브 필드가 3이고 제로 패딩이 1이기 때문에 이런 결과가 나오는 것이다. 만약 제로 패딩이 사용되지 않았다면 출력 볼륨의 크기는 3이 될 것이다. 일반적으로, 제로 패딩을 $$P = (F - 1)/2$$ , stride $$S = 1$$로 세팅하면 입/출력의 크기가 같아지게 된다. 이런 방식으로 사용하는 것이 일반적이며, 앞으로 컨볼루션 신경망에 대해 다루면서 그 이유에 대해 더 알아볼 것이다. -*Constraints on strides*. Note that the spatial arrangement hyperparameters have mutual constraints. For example, when the input has size $$W = 10$$, no zero-padding is used $$P = 0$$, and the filter size is $$F = 3$$, then it would be impossible to use stride $$S = 2$$, since $$(W - F + 2P)/S + 1 = (10 - 3 + 0) / 2 + 1 = 4.5$$, i.e. not an integer, indicating that the neurons don't "fit" neatly and symmetrically across the input. Therefore, this setting of the hyperparameters is considered to be invalid, and a ConvNet library would likely throw an exception. As we will see in the ConvNet architectures section, sizing the ConvNets appropriately so that all the dimensions "work out" can be a real headache, which the use of zero-padding and some design guidelines will significantly alleviate. +*Stride에 대한 constraints*. 공간적 배치와 관련된 hyperparameter들은 상호 constraint들이 존재한다는 것을 기억하자. 예를 들어, 입력 사이즈 $$W=10$$이고 제로 패딩이 사용되지 않았고 $$P=0$$, 필터 사이즈가 $$F=3$$이라면, stride $$S=2$$를 사용하는 것이 불가능하다. $$(W - F + 2P)/S + 1 = (10 - 3 + 0) / 2 + 1 = 4.5$$이 정수가 아니기 때문이다. 그러므로 hyperparameter를 이런 식으로 설정하면 컨볼루션 신경망 관련 라이브러리들은 exception을 낸다. 컨볼루션 신경망의 구조 관련 섹션에서 확인하겠지만, 전체 신경망이 잘 돌아가도록 이런 숫자들을 설정하는 과정은 매우 골치 아프다. 제로 패딩이나 다른 신경망 디자인 비법들을 사용하면 훨씬 수월하게 진행할 수 있다. -*Real-world example*. The [Krizhevsky et al.](http://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks) architecture that won the ImageNet challenge in 2012 accepted images of size [227x227x3]. On the first Convolutional Layer, it used neurons with receptive field size $$F = 11$$, stride $$S = 4$$ and no zero padding $$P = 0$$. Since (227 - 11)/4 + 1 = 55, and since the Conv layer had a depth of $$K = 96$$, the Conv layer output volume had size [55x55x96]. Each of the 55\*55\*96 neurons in this volume was connected to a region of size [11x11x3] in the input volume. Moreover, all 96 neurons in each depth column are connected to the same [11x11x3] region of the input, but of course with different weights. +*실제 예제*. 이미지넷 대회에서 우승한 [Krizhevsky et al.](http://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks) 의 모델의 경우 [227x227x3] 크기의 이미지를 입력으로 받는다. 첫 번째 컨볼루션 레이어에서는 리셉티브 필드 $$F=11$$, stride $$S=4$$를 사용했고 제로 패딩은 사용하지 않았다 $$P=0$$. (227 - 11)/4 +1=55 이고 컨볼루션 레이어의 깊이는 $$K=96$$이므로 이 컨볼루션 레이어의 크기는 [55x55x96]이 된다. 각각의 55\*55\*96개 뉴런들은 입력 볼륨의 [11x11x3]개 뉴런들과 연결되어 있다. 그리고 각 깊이의 모든 96개 뉴런들은 입력 볼륨의 같은 [11x11x3] 영역에 서로 다른 가중치를 가지고 연결된다. -**Parameter Sharing.** Parameter sharing scheme is used in Convolutional Layers to control the number of parameters. Using the real-world example above, we see that there are 55\*55\*96 = 290,400 neurons in the first Conv Layer, and each has 11\*11\*3 = 363 weights and 1 bias. Together, this adds up to 290400 * 364 = 105,705,600 parameters on the first layer of the ConvNet alone. Clearly, this number is very high. +**파라미터 공유**. 파라미터 공유 기법은 컨볼루션 레이어의 파라미터 개수를 조절하기 위해 사용된다. 위의 실제 예제에서 보았듯, 첫 번째 컨볼루션 레이어에는 55\*55\*96 = 290,400 개의 뉴런이 있고 각각의 뉴런은 11\*11\*3 = 363개의 가중치와 1개의 바이어스를 가진다. 첫 번째 컨볼루션 레이어만 따져도 총 파라미터 개수는 290400*364=105,705,600개가 된다. 분명히 이 숫자는 너무 크다. -It turns out that we can dramatically reduce the number of parameters by making one reasonable assumption: That if one patch feature is useful to compute at some spatial position (x,y), then it should also be useful to compute at a different position (x2,y2). In other words, denoting a single 2-dimensional slice of depth as a **depth slice** (e.g. a volume of size [55x55x96] has 96 depth slices, each of size [55x55]), we are going to constrain the neurons in each depth slice to use the same weights and bias. With this parameter sharing scheme, the first Conv Layer in our example would now have only 96 unique set of weights (one for each depth slice), for a total of 96\*11\*11\*3 = 34,848 unique weights, or 34,944 parameters (+96 biases). Alternatively, all 55*55 neurons in each depth slice will now be using the same parameters. In practice during backpropagation, every neuron in the volume will compute the gradient for its weights, but these gradients will be added up across each depth slice and only update a single set of weights per slice. +사실 적절한 가정을 통해 파라미터 개수를 크게 줄이는 것이 가능하다: (x,y)에서 어떤 patch feature가 유용하게 사용되었다면, 이 feature는 다른 위치 (x2,y2)에서도 유용하게 사용될 수 있다. 3차원 볼륨의 한 슬라이스 (깊이 차원으로 자른 2차원 슬라이스) 를 **depth slice**라고 하자 ([55x55x96] 사이즈의 볼륨은 각각 [55x55]의 크기를 가진 96개의 depth slice임). 앞으로는 각 depth slice 내의 뉴런들이 같은 가중치와 바이어스를 가지도록 제한할 것이다. 이런 파라미터 공유 기법을 사용하면, 예제의 첫 번째 컨볼루션 레이어는 (depth slice 당) 96개의 고유한 가중치를 가져서 총 96\*11\*11\*3 = 34,848개의 고유한 가중치, 또는 바이어스를 합쳐서 34,944개의 파라미터를 갖게 된다. 또는 각 depth slice에 존재하는 55*55개의 뉴런들은 모두 같은 파라미터를 사용하게 된다. 실제로는 backpropagation 과정에서 각 depth slice 내의 모든 뉴런들이 가중치에 대한 gradient를 계산하겠지만, 가중치 업데이트 할 때에는 이 gradient들을 합해 사용한다. -Notice that if all neurons in a single depth slice are using the same weight vector, then the forward pass of the CONV layer can in each depth slice be computed as a **convolution** of the neuron's weights with the input volume (Hence the name: Convolutional Layer). Therefore, it is common to refer to the sets of weights as a **filter** (or a **kernel**), which is convolved with the input. The result of this convolution is an *activation map* (e.g. of size [55x55]), and the set of activation maps for each different filter are stacked together along the depth dimension to produce the output volume (e.g. [55x55x96]). +한 depth slice내의 모든 뉴런들이 같은 가중치 벡터를 갖기 때문에 컨볼루션 레이어의 forward pass는 입력 볼륨과 가중치 간의 **컨볼루션**으로 계산될 수 있다 (컨볼루션 레이어라는 이름이 붙은 이유). 그러므로 컨볼루션 레이어의 가중치는 **필터(filter)** 또는 **커널(kernel)**이라고 부른다. 컨볼루션의 결과물은 **액티베이션 맵(activation map, [55x55] 사이즈)** 이 되며 각 깊이에 해당하는 필터의 액티베이션 맵들을 쌓으면 최종 출력 볼륨 ([55x55x96] 사이즈) 가 된다.
- Example filters learned by Krizhevsky et al. Each of the 96 filters shown here is of size [11x11x3], and each one is shared by the 55*55 neurons in one depth slice. Notice that the parameter sharing assumption is relatively reasonable: If detecting a horizontal edge is important at some location in the image, it should intuitively be useful at some other location as well due to the translationally-invariant structure of images. There is therefore no need to relearn to detect a horizontal edge at every one of the 55*55 distinct locations in the Conv layer output volume. + Krizhevsky et al. 에서 학습된 필터의 예. 96개의 필터 각각은 [11x11x3] 사이즈이며, 하나의 depth slice 내 55*55개 뉴런들이 이 필터들을 공유한다. 만약 이미지의 특정 위치에서 가로 엣지 (edge)를 검출하는 것이 중요했다면, 이미지의 다른 위치에서도 같은 특성이 중요할 수 있다 (이미지의 translationally-invariant한 특성 때문). 그러므로 55*55개 뉴런 각각에 대해 가로 엣지 검출 필터를 재학습 할 필요가 없다.
-Note that sometimes the parameter sharing assumption may not make sense. This is especially the case when the input images to a ConvNet have some specific centered structure, where we should expect, for example, that completely different features should be learned on one side of the image than another. One practical example is when the input are faces that have been centered in the image. You might expect that different eye-specific or hair-specific features could (and should) be learned in different spatial locations. In that case it is common to relax the parameter sharing scheme, and instead simply call the layer a **Locally-Connected Layer**. - -**Numpy examples.** To make the discussion above more concrete, lets express the same ideas but in code and with a specific example. Suppose that the input volume is a numpy array `X`. Then: +가끔은 파라미터 sharing에 대한 가정이 부적절할 수도 있다. 특히 입력 이미지가 중심을 기준으로 찍힌 경우 (예를 들면 이미지 중앙에 얼굴이 있는 이미지), 이미지의 각 영역에 대해 완전히 다른 feature들이 학습되어야 할 수 있다. 눈과 관련된 feature나 머리카락과 관련된 feature 등은 서로 다른 영역에서 학습될 것이다. 이런 경우에는 파라미터 sharing 기법을 접어두고 대신 **Locally-Connected Layer**라는 레이어를 사용하는 것이 좋다. +**Numpy 예제.** 위에서 다룬 것들을 더 확실히 알아보기 위해 코드를 작성해보자. 입력 볼륨을 numpy 배열 `X`라고 하면: - A *depth column* at position `(x,y)` would be the activations `X[x,y,:]`. +- `(x,y)`위치에서의 *depth column*은 액티베이션 `X[x,y,:]`이 된다. - A *depth slice*, or equivalently an *activation map* at depth `d` would be the activations `X[:,:,d]`. +- depth `d`에서의 *depth slice*, 또는 *액티베이션 맵 (activation map)*은 `X[:,:,d]`가 된다. -*Conv Layer Example*. Suppose that the input volume `X` has shape `X.shape: (11,11,4)`. Suppose further that we use no zero padding ($$P = 0$$), that the filter size is $$F = 5$$, and that the stride is $$S = 2$$. The output volume would therefore have spatial size (11-5)/2+1 = 4, giving a volume with width and height of 4. The activation map in the output volume (call it `V`), would then look as follows (only some of the elements are computed in this example): +*컨볼루션 레이어 예제*. 입력 볼륨 `X`의 모양이 `X.shape: (11,11,4)`이고 제로 패딩은 사용하지 않으며($$P = 0$$) 필터 크기는 $$F = 5$$, stride $$S = 2$$라고 하자. 출력 볼륨의 spatial 크기 (가로/세로)는 (11-5)/2 + 1 = 4가 된다. 출력 볼륨의 액티베이션 맵 (`V`라고 하자) 는 아래와 같은 것이다 (아래에는 일부 요소만 나타냄). - `V[0,0,0] = np.sum(X[:5,:5,:] * W0) + b0` - `V[1,0,0] = np.sum(X[2:7,:5,:] * W0) + b0` - `V[2,0,0] = np.sum(X[4:9,:5,:] * W0) + b0` - `V[3,0,0] = np.sum(X[6:11,:5,:] * W0) + b0` -Remember that in numpy, the operation `*` above denotes elementwise multiplication between the arrays. Notice also that the weight vector `W0` is the weight vector of that neuron and `b0` is the bias. Here, `W0` is assumed to be of shape `W0.shape: (5,5,4)`, since the filter size is 5 and the depth of the input volume is 4. Notice that at each point, we are computing the dot product as seen before in ordinary neural networks. Also, we see that we are using the same weight and bias (due to parameter sharing), and where the dimensions along the width are increasing in steps of 2 (i.e. the stride). To construct a second activation map in the output volume, we would have: +Numpy에서 `*`연산은 두 배열 간의 elementwise 곱셈이라는 것을 기억하자. 또한 `W0`는 가중치 벡터이고 `b0`은 바이어스라는 것도 기억하자. 여기에서 `W0`의 모양은 `W0.shape: (5,5,4)`라고 가정하자 (필터 사이즈는 5, depth는 4). 각 위치에서 일반 신경망에서와 같이 내적 연산을 수행하게 된다. 또한 파라미터 sharing 기법으로 같은 가중치, 바이어스가 사용되고 가로 차원에 대해 2 (stride)칸씩 옮겨가며 연산이 이뤄진다는 것을 볼 수 있다. 출력 볼륨의 두 번째 액티베이션 맵을 구성하는 방법은: - `V[0,0,1] = np.sum(X[:5,:5,:] * W1) + b1` - `V[1,0,1] = np.sum(X[2:7,:5,:] * W1) + b1` @@ -156,105 +157,108 @@ Remember that in numpy, the operation `*` above denotes elementwise multiplicati - `V[0,1,1] = np.sum(X[:5,2:7,:] * W1) + b1` (example of going along y) - `V[2,3,1] = np.sum(X[4:9,6:11,:] * W1) + b1` (or along both) -where we see that we are indexing into the second depth dimension in `V` (at index 1) because we are computing the second activation map, and that a different set of parameters (`W1`) is now used. In the example above, we are for brevity leaving out some of the other operatations the Conv Layer would perform to fill the other parts of the output array `V`. Additionally, recall that these activation maps are often followed elementwise through an activation function such as ReLU, but this is not shown here. +위 예제는 `V`의 두 번째 depth 차원 (인덱스 1)을 인덱싱하고 있다. 두 번째 액티베이션 맵을 계산하므로, 여기에서 사용된 가중치는 이전 예제와 달리 `W1`이다. 보통 액티베이션 맵이 구해진 뒤 ReLU와 같은 elementwise 연산이 가해지는 경우가 많은데, 위 예제에서는 다루지 않았다. -**Summary**. To summarize, the Conv Layer: +**요약**. To summarize, the Conv Layer: -- Accepts a volume of size $$W_1 \times H_1 \times D_1$$ -- Requires four hyperparameters: - - Number of filters $$K$$, - - their spatial extent $$F$$, - - the stride $$S$$, - - the amount of zero padding $$P$$. -- Produces a volume of size $$W_2 \times H_2 \times D_2$$ where: +- $$W_1 \times H_1 \times D_1$$ 크기의 볼륨을 입력받는다. +- 4개의 hyperparameter가 필요하다: + - 필터 개수 $$K$$, + - 필터의 가로/세로 Spatial 크기 $$F$$, + - Stride $$S$$, + - 제로 패딩 $$P$$. +- $$W_2 \times H_2 \times D_2$$ 크기의 출력 볼륨을 생성한다: - $$W_2 = (W_1 - F + 2P)/S + 1$$ - - $$H_2 = (H_1 - F + 2P)/S + 1$$ (i.e. width and height are computed equally by symmetry) + - $$H_2 = (H_1 - F + 2P)/S + 1$$ (i.e. 가로/세로는 같은 방식으로 계산됨) - $$D_2 = K$$ -- With parameter sharing, it introduces $$F \cdot F \cdot D_1$$ weights per filter, for a total of $$(F \cdot F \cdot D_1) \cdot K$$ weights and $$K$$ biases. -- In the output volume, the $$d$$-th depth slice (of size $$W_2 \times H_2$$) is the result of performing a valid convolution of the $$d$$-th filter over the input volume with a stride of $$S$$, and then offset by $$d$$-th bias. +- 파라미터 sharing로 인해 필터 당 $$F \cdot F \cdot D_1$$개의 가중치를 가져서 총 $$(F \cdot F \cdot D_1) \cdot K$$개의 가중치와 $$K$$개의 바이어스를 갖게 된다. +- 출력 볼륨에서 $$d$$번째 depth slice ($$W_2 \times H_2$$ 크기)는 입력 볼륨에 $$d$$번째 필터를 stride $$S$$만큼 옮겨가며 컨볼루션 한 뒤 $$d$$번째 바이어스를 더한 결과이다. -A common setting of the hyperparameters is $$F = 3, S = 1, P = 1$$. However, there are common conventions and rules of thumb that motivate these hyperparameters. See the [ConvNet architectures](#architectures) section below. +흔한 Hyperparameter기본 세팅은 $$F = 3, S = 1, P = 1$$이다. 뒤에서 다룰 [ConvNet architectures](#architectures)에서 hyperparameter 세팅과 관련된 법칙이나 방식 등을 확인할 수 있다. -**Convolution Demo**. Below is a running demo of a CONV layer. Since 3D volumes are hard to visualize, all the volumes (the input volume (in blue), the weight volumes (in red), the output volume (in green)) are visualized with each depth slice stacked in rows. The input volume is of size $$W_1 = 5, H_1 = 5, D_1 = 3$$, and the CONV layer parameters are $$K = 2, F = 3, S = 2, P = 1$$. That is, we have two filters of size $$3 \times 3$$, and they are applied with a stride of 2. Therefore, the output volume size has spatial size (5 - 3 + 2)/2 + 1 = 3. Moreover, notice that a padding of $$P = 1$$ is applied to the input volume, making the outer border of the input volume zero. The visualization below iterates over the output activations (green), and shows that each element is computed by elementwise multiplying the highlighted input (blue) with the filter (red), summing it up, and then offsetting the result by the bias. +**컨볼루션 데모**. 아래는 컨볼루션 레이어 데모이다. 3차원 볼륨은 시각화하기 힘드므로 각 행마다 depth slice를 하나씩 배치했다. 각 볼륨은 입력 볼륨(파란색), 가중치 볼륨(빨간색), 출력 볼륨(녹색)으로 이뤄진다. 입력 볼륨의 크기는 $$W_1 = 5, H_1 = 5, D_1 = 3$$이고 컨볼루션 레이어의 파라미터들은 $$K = 2, F = 3, S = 2, P = 1$$이다. 즉, 2개의 $$3 \times 3$$크기의 필터가 각각 stride 2마다 적용된다. 그러므로 출력 볼륨의 spatial 크기 (가로/세로)는 (5 - 3 + 2)/2 + 1 = 3이다. 제로 패딩 $$P = 1$$ 이 적용되어 입력 볼륨의 가장자리가 모두 0으로 되어있다는 것을 확인할 수 있다. 아래의 영상에서 하이라이트 표시된 입력(파란색)과 필터(빨간색)이 elementwise로 곱해진 뒤 하나로 더해지고 bias가 더해지는걸 볼 수 있다.
-**Implementation as Matrix Multiplication**. Note that the convolution operation essentially performs dot products between the filters and local regions of the input. A common implementation pattern of the CONV layer is to take advantage of this fact and formulate the forward pass of a convolutional layer as one big matrix multiply as follows: +**매트릭스 곱으로 구현**. 컨볼루션 연산은 필터와 이미지의 로컬한 영역간의 내적 연산을 한 것과 같다. 컨볼루션 레이어의 일반적인 구현 패턴은 이 점을 이용해 컨볼루션 레이어의 forward pass를 다음과 같이 하나의 큰 매트릭스 곱으로 계산된다: -1. The local regions in the input image are stretched out into columns in an operation commonly called **im2col**. For example, if the input is [227x227x3] and it is to be convolved with 11x11x3 filters at stride 4, then we would take [11x11x3] blocks of pixels in the input and stretch each block into a column vector of size 11\*11\*3 = 363. Iterating this process in the input at stride of 4 gives (227-11)/4+1 = 55 locations along both width and height, leading to an output matrix `X_col` of *im2col* of size [363 x 3025], where every column is a stretched out receptive field and there are 55*55 = 3025 of them in total. Note that since the receptive fields overlap, every number in the input volume may be duplicated in multiple distinct columns. -2. The weights of the CONV layer are similarly stretched out into rows. For example, if there are 96 filters of size [11x11x3] this would give a matrix `W_row` of size [96 x 363]. -3. The result of a convolution is now equivalent to performing one large matrix multiply `np.dot(W_row, X_col)`, which evaluates the dot product between every filter and every receptive field location. In our example, the output of this operation would be [96 x 3025], giving the output of the dot product of each filter at each location. -4. The result must finally be reshaped back to its proper output dimension [55x55x96]. +1. 이미지의 각 로컬 영역을 열 벡터로 stretch 한다 (이런 연산을 보통 **im2col** 이라고 부름). 예를 들어, 만약 [227x227x3] 사이즈의 입력이 11x11x3 사이즈와 strie 4의 필터와 컨볼루션 한다면, 이미지에서 [11x11x3] 크기의 픽셀 블록을 가져와 11\*11\*3=363 크기의 열 벡터로 바꾸게 된다. 이 과정을 stride 4마다 하므로 가로, 세로에 대해 각각 (227-11)/4+1=55, 총 55\*55=3025 개 영역에 대해 반복하게 되고, 출력물인 `X_col`은 [363x3025]의 사이즈를 갖게 된다. 각각의 열 벡터는 리셉티브 필드를 1차원으로 stretch 한 것이고, 이 리셉티브 필드는 주위 리셉티브 필드들과 겹치므로 입력 볼륨의 여러 값들이 여러 출력 열벡터에 중복되어 나타날 수 있다. +2. 컨볼루션 레이어의 가중치는 비슷한 방식으로 행 벡터 형태로 stretch된다. 예를 들어 [11x11x3]사이즈의 총 96개 필터가 있다면, [96x363] 사이즈의 W_row가 만들어진다. +3. 이제 컨볼루션 연산은 하나의 큰 매트릭스 연산 `np.dot(W_row, X_col)`를 계산하는 것과 같다. 이 연산은 모든 필터와 모든 리셉티브 필터 영역들 사이의 내적 연산을 하는 것과 같다. 우리의 예에서는 각 영역에 대한 각각의 필터를 각각의 영역에 적용한 [96x3025] 사이즈의 출력물이 얻어진다. +4. 결과물은 [55x55x96] 차원으로 reshape 한다. -This approach has the downside that it can use a lot of memory, since some values in the input volume are replicated multiple times in `X_col`. However, the benefit is that there are many very efficient implementations of Matrix Multiplication that we can take advantage of (for example, in the commonly used [BLAS](http://www.netlib.org/blas/) API). Morever, the same *im2col* idea can be reused to perform the pooling operation, which we discuss next. +이 방식은 입력 볼륨의 여러 값들이 `X_col`에 여러 번 복사되기 때문에 메모리가 많이 사용된다는 단점이 있다. 그러나 매트릭스 연산과 관련된 많은 효율적 구현방식들을 사용할 수 있다는 장점도 있다 ([BLAS](http://www.netlib.org/blas/) API 가 하나의 예임). 뿐만 아니라 같은 *im2col* 아이디어는 풀링 연산에서 재활용 할 수도 있다 (뒤에서 다루게 된다). -**Backpropagation.** The backward pass for a convolution operation (for both the data and the weights) is also a convolution (but with spatially-flipped filters). This is easy to derive in the 1-dimensional case with a toy example (not expanded on for now). +**Backpropagation.** 컨볼루션 연산의 backward pass 역시 컨볼루션 연산이다 (가로/세로가 뒤집어진 필터를 사용한다는 차이점이 있음). 간단한 1차원 예제를 가지고 쉽게 확인해볼 수 있다. -#### Pooling Layer +#### 풀링 레이어 (Pooling Layer) -It is common to periodically insert a Pooling layer in-between successive Conv layers in a ConvNet architecture. Its function is to progressively reduce the spatial size of the representation to reduce the amount of parameters and computation in the network, and hence to also control overfitting. The Pooling Layer operates independently on every depth slice of the input and resizes it spatially, using the MAX operation. The most common form is a pooling layer with filters of size 2x2 applied with a stride of 2 downsamples every depth slice in the input by 2 along both width and height, discarding 75% of the activations. Every MAX operation would in this case be taking a max over 4 numbers (little 2x2 region in some depth slice). The depth dimension remains unchanged. More generally, the pooling layer: +CNN 구조 내에 컨볼루션 레이어들 중간중간에 주기적으로 풀링 레이어를 넣는 것이 일반적이다. 풀링 레이어가 하는 일은 네트워크의 파라미터의 개수나 연산량을 줄이기 위해 representation의 spatial한 사이즈를 줄이는 것이다. 이는 오버피팅을 조절하는 효과도 가지고 있다. 풀링 레이어는 MAX 연산을 각 depth slice에 대해 독립적으로 적용하여 spatial한 크기를 줄인다. 사이즈 2x2와 stride 2가 가장 많이 사용되는 풀링 레이어이다. 각 depth slice를 가로/세로축을 따라 1/2로 downsampling해 75%의 액티베이션은 버리게 된다. 이 경우 MAX 연산은 4개 숫자 중 최대값을 선택하게 된다 (같은 depth slice 내의 2x2 영역). Depth 차원은 변하지 않는다. 풀링 레이어의 특징들은 일반적으로 아래와 같다: -- Accepts a volume of size $$W_1 \times H_1 \times D_1$$ -- Requires three hyperparameters: - - their spatial extent $$F$$, - - the stride $$S$$, -- Produces a volume of size $$W_2 \times H_2 \times D_2$$ where: +- $$W_1 \times H_1 \times D_1$$ 사이즈의 입력을 받는다 +- 3가지 hyperparameter를 필요로 한다. + - Spatial extent $$F$$ + - Stride $$S$$ +- $$W_2 \times H_2 \times D_2$$ 사이즈의 볼륨을 만든다 - $$W_2 = (W_1 - F)/S + 1$$ - $$H_2 = (H_1 - F)/S + 1$$ - $$D_2 = D_1$$ -- Introduces zero parameters since it computes a fixed function of the input -- Note that it is not common to use zero-padding for Pooling layers +- 입력에 대해 항상 같은 연산을 하므로 파라미터는 따로 존재하지 않는다 +- 풀링 레이어에는 보통 제로 패딩을 하지 않는다 -It is worth noting that there are only two commonly seen variations of the max pooling layer found in practice: A pooling layer with $$F = 3, S = 2$$ (also called overlapping pooling), and more commonly $$F = 2, S = 2$$. Pooling sizes with larger receptive fields are too destructive. +일반적으로 실전에서는 두 종류의 max 풀링 레이어만 널리 쓰인다. 하나는 overlapping 풀링이라고도 불리는 $$F = 3, S = 2$$ 이고 하나는 더 자주 쓰이는 $$F = 2, S = 2$$ 이다. 큰 리셉티브 필드에 대해서 풀링을 하면 보통 너무 많은 정보를 버리게 된다. -**General pooling**. In addition to max pooling, the pooling units can also perform other functions, such as *average pooling* or even *L2-norm pooling*. Average pooling was often used historically but has recently fallen out of favor compared to the max pooling operation, which has been shown to work better in practice. +**일반적인 풀링**. Max 풀링 뿐 아니라 *average 풀링*, *L2-norm 풀링* 등 다른 연산으로 풀링할 수도 있다. Average 풀링은 과거에 많이 쓰였으나 최근에는 Max 풀링이 더 좋은 성능을 보이며 점차 쓰이지 않고 있다.
- Pooling layer downsamples the volume spatially, independently in each depth slice of the input volume. Left: In this example, the input volume of size [224x224x64] is pooled with filter size 2, stride 2 into output volume of size [112x112x64]. Notice that the volume depth is preserved. Right: The most common downsampling operation is max, giving rise to max pooling, here shown with a stride of 2. That is, each max is taken over 4 numbers (little 2x2 square). + 풀링 레이어는 입력 볼륨의 각 depth slice를 spatial하게 downsampling한다. 좌: 이 예제에서는 입력 볼륨이 [224x224x64]이며 필터 크기 2, stride 2로 풀링해 [112x112x64] 크기의 출력 볼륨을 만든다. 볼륨의 depth는 그대로 유지된다는 것을 기억하자. Right: 가장 널리 쓰이는 max 풀링. 2x2의 4개 숫자에 대해 max를 취하게된다.
-**Backpropagation**. Recall from the backpropagation chapter that the backward pass for a max(x, y) operation has a simple interpretation as only routing the gradient to the input that had the highest value in the forward pass. Hence, during the forward pass of a pooling layer it is common to keep track of the index of the max activation (sometimes also called *the switches*) so that gradient routing is efficient during backpropagation. +**Backpropagation**. Backpropagation 챕터에서 max(x,y)의 backward pass는 그냥 forward pass에서 가장 큰 값을 가졌던 입력의 gradient를 보내는 것과 같다고 배운 것을 기억하자. 그러므로 forward pass 과정에서 보통 max 액티베이션의 위치를 저장해두었다가 backpropagation 때 사용한다. -**Recent developments**. +**최근의 발전된 내용들**. -- [Fractional Max-Pooling](http://arxiv.org/abs/1412.6071) suggests a method for performing the pooling operation with filters smaller than 2x2. This is done by randomly generating pooling regions with a combination of 1x1, 1x2, 2x1 or 2x2 filters to tile the input activation map. The grids are generated randomly on each forward pass, and at test time the predictions can be averaged across several grids. -- [Striving for Simplicity: The All Convolutional Net](http://arxiv.org/abs/1412.6806) proposes to discard the pooling layer in favor of architecture that only consists of repeated CONV layers. To reduce the size of the representation they suggest using larger stride in CONV layer once in a while. +- [Fractional Max-Pooling](http://arxiv.org/abs/1412.6071) 2x2보다 더 작은 필터들로 풀링하는 방식. 1x1, 1x2, 2x1, 2x2 크기의 필터들을 임의로 조합해 풀링한다. 매 forward pass마다 grid들이 랜덤하게 생성되고, 테스트 때에는 여러 grid들의 예측 점수들의 평균치를 사용하게 된다. +- [Striving for Simplicity: The All Convolutional Net](http://arxiv.org/abs/1412.6806) 라는 논문은 컨볼루션 레이어만 반복하며 풀링 레이어를 사용하지 않는 방식을 제안한다. Representation의 크기를 줄이기 위해 가끔씩 큰 stride를 가진 컨볼루션 레이어를 사용한다. -Due to the aggressive reduction in the size of the representation (which is helpful only for smaller datasets to control overfitting), the trend in the literature is towards discarding the pooling layer in modern ConvNets. +풀링 레이어가 보통 representation의 크기를 심하게 줄이기 때문에 (이런 효과는 작은 데이터셋에서만 오버피팅 방지 효과 등으로 인해 도움이 됨), 최근 추세는 점점 풀링 레이어를 사용하지 않는 쪽으로 발전하고 있다. -#### Normalization Layer +#### Normalization 레이어 -Many types of normalization layers have been proposed for use in ConvNet architectures, sometimes with the intentions of implementing inhibition schemes observed in the biological brain. However, these layers have recently fallen out of favor because in practice their contribution has been shown to be minimal, if any. For various types of normalizations, see the discussion in Alex Krizhevsky's [cuda-convnet library API](http://code.google.com/p/cuda-convnet/wiki/LayerParams#Local_response_normalization_layer_(same_map)). +실제 두뇌의 억제 메커니즘 구현 등을 위해 많은 종류의 normalization 레이어들이 제안되었다. 그러나 이런 레이어들이 실제로 주는 효과가 별로 없다는 것이 알려지면서 최근에는 거의 사용되지 않고 있다. Normalization에 대해 알고 싶다면 Alex Krizhevsky의 글을 읽어보기 바란다 [cuda-convnet library API](http://code.google.com/p/cuda-convnet/wiki/LayerParams#Local_response_normalization_layer_(same_map)). -#### Fully-connected layer +#### Fully-connected 레이어 -Neurons in a fully connected layer have full connections to all activations in the previous layer, as seen in regular Neural Networks. Their activations can hence be computed with a matrix multiplication followed by a bias offset. See the *Neural Network* section of the notes for more information. +Fully connected 레이어 내의 뉴런들은 일반 신경망 챕터에서 보았듯이이전 레이어의 모든 액티베이션들과 연결되어 있다. 그러므로 Fully connected레이어의 액티베이션은 매트릭스 곱을 한 뒤 바이어스를 더해 구할 수 있다. 더 많은 정보를 위해 강의 노트의 "신경망" 섹션을 보기 바란다. -#### Converting FC layers to CONV layers +#### FC 레이어를 CONV 레이어로 변환하기 + +FC 레이어와 CONV 레이어의 차이점은, CONV 레이어는 입력의 일부 영역에만 연결되어 있고, CONV 볼륨의 많은 뉴런들이 파라미터를 공유한다는 것 뿐이라는 것을 알아 둘 필요가 있다. 두 레이어 모두 내적 연산을 수행하므로 실제 함수 형태는 동일하다. 그러므로 FC 레이어를 CONV 레이어로 변환하는 것이 가능하다: + +- 모든 CONV 레이어는 동일한 forward 함수를 수행하는 FC 레이어 짝이 있다. 이 경우의 가중치 매트릭스는 몇몇 블록을 제외하고 모두 0으로 이뤄지며 (local connectivity: 입력의 일부 영역에만 연결된 특성), 이 블록들 중 여러개는 같은 값을 지니게 된다 (파라미터 공유). -It is worth noting that the only difference between FC and CONV layers is that the neurons in the CONV layer are connected only to a local region in the input, and that many of the neurons in a CONV volume share parameters. However, the neurons in both layers still compute dot products, so their functional form is identical. Therefore, it turns out that it's possible to convert between FC and CONV layers: +- 반대로, 모든 FC 레이어는 CONV 레이어로 변환될 수 있다. 예를 들어, $$7 \times 7 \times 512$$ 크기의 입력을 받고 $$K= 4906$$ 인 FC 레이어는 $$F = 7, P = 0, S = 1, K = 4096$$인 CONV 레이어로 표현 가능하다. 바꿔 말하면, 필터의 크기를 입력 볼륨의 크기와 동일하게 만들고 $$1 \times 1 \times 4906$$ 크기의 아웃풋을 출력할 수 있다. 각 depth에 대해 하나의 값만 구해지므로 (필터의 가로/세로가 입력 볼륨의 가로/세로와 같으므로) FC 레이어와 같은 결과를 얻게 된다. -- For any CONV layer there is an FC layer that implements the same forward function. The weight matrix would be a large matrix that is mostly zero except for at certian blocks (due to local connectivity) where the weights in many of the blocks are equal (due to parameter sharing). -- Conversely, any FC layer can be converted to a CONV layer. For example, an FC layer with $$K = 4096$$ that is looking at some input volume of size $$7 \times 7 \times 512$$ can be equivalently expressed as a CONV layer with $$F = 7, P = 0, S = 1, K = 4096$$. In other words, we are setting the filter size to be exactly the size of the input volume, and hence the output will simply be $$1 \times 1 \times 4096$$ since only a single depth column "fits" across the input volume, giving identical result as the initial FC layer. +**FC->CONV 변환**. 이 두 변환 중, FC 레이어를 CONV 레이어로의 변환은 매우 실전에서 매우 유용하다. 224x224x3의 이미지를 입력으로 받고 일련의 CONV레이어와 POOL 레이어를 이용해 7x7x512의 액티베이션을 만드는 컨볼루션넷 아키텍쳐를 생각해 보자 (뒤에서 살펴 볼 *AlexNet* 아키텍쳐에서는 입력의 spatial(가로/세로) 크기를 반으로 줄이는 풀링 레이어 5개를 사용해 7x7x512의 액티베이션을 만든다. 224/2/2/2/2/2 = 7이기 때문이다). AlexNet은 여기에 4096의 크기를 갖는 FC 레이어 2개와 클래스 스코어를 계산하는 1000개 뉴런으로 이뤄진 마지막 FC 레이어를 사용한다. 이 마지막 3개의 FC 레이어를 CONV 레이어로 변환하는 방법을 아래에서 배우게 된다: -**FC->CONV conversion**. Of these two conversions, the ability to convert an FC layer to a CONV layer is particularly useful in practice. Consider a ConvNet architecture that takes a 224x224x3 image, and then uses a series of CONV layers and POOL layers to reduce the image to an activations volume of size 7x7x512 (in an *AlexNet* architecture that we'll see later, this is done by use of 5 pooling layers that downsample the input spatially by a factor of two each time, making the final spatial size 224/2/2/2/2/2 = 7). From there, an AlexNet uses two FC layers of size 4096 and finally the last FC layers with 1000 neurons that compute the class scores. We can convert each of these three FC layers to CONV layers as described above: +- [7x7x512]의 입력 볼륨을 받는 첫 번째 FC 레이어를 $$F = 7$$의 필터 크기를 갖는 CONV 레이어로 바꾼다. 이 때 출력 볼륨의 크기는 [1x1x4096] 이 된다. +- 두 번째 FC 레이어를 $$F = 1$$ 필터 사이즈의 CONV 레이어로 바꾼다. 이 때 출력 볼륨의 크기는 [1x1s4096]이 된다. +- 같은 방식으로 마지막 FC 레이어를 $$F = 1$$의 CONV 레이어를 바꾼다. 출력 볼륨의 크기는 [1x1x1000]이 된다. -- Replace the first FC layer that looks at [7x7x512] volume with a CONV layer that uses filter size $$F = 7$$, giving output volume [1x1x4096]. -- Replace the second FC layer with a CONV layer that uses filter size $$F = 1$$, giving output volume [1x1x4096] -- Replace the last FC layer similarly, with $$F=1$$, giving final output [1x1x1000] +각각의 변환은 일반적으로 FC 레이어의 가중치 $$W$$를 CONV 레이어의 필터로 변환하는 과정을 수반한다. 이런 변환을 하고 나면, 큰 이미지 (가로/세로가 224보다 큰 이미지)를 단 한번의 forward pass만으로 마치 이미지를 "슬라이딩"하면서 여러 영역을 읽은 것과 같은 효과를 준다. -Each of these conversions could in practice involve manipulating (e.g. reshaping) the weight matrix $$W$$ in each FC layer into CONV layer filters. It turns out that this conversion allows us to "slide" the original ConvNet very efficiently across many spatial positions in a larger image, in a single forward pass. +예를 들어,224x224 크기의 이미지를 입력으로 받으면 [7x7x512]의 볼륨을 출력하는 이 아키텍쳐에, ( 224/7 = 32배 줄어듦 ) 된 아키텍쳐에 384x384 크기의 이미지를 넣으면 [12x12x512] 크기의 볼륨을 출력하게 된다 (384/32 = 12 이므로). 이후 3개 CONV 레이어 For example, if 224x224 image gives a volume of size [7x7x512] - i.e. a reduction by 32, then forwarding an image of size 384x384 through the converted architecture would give the equivalent volume in size [12x12x512], since 384/32 = 12. Following through with the next 3 CONV layers that we just converted from FC layers would now give the final volume of size [6x6x1000], since (12 - 7)/1 + 1 = 6. Note that instead of a single vector of class scores of size [1x1x1000], we're now getting and entire 6x6 array of class scores across the 384x384 image. @@ -379,3 +383,8 @@ Additional resources related to implementation: - [Caffe](http://caffe.berkeleyvision.org/), one of the most popular ConvNet libraries. - [Example Torch 7 ConvNet](https://github.com/nagadomi/kaggle-cifar10-torch7) that achieves 7% error on CIFAR-10 with a single model - [Ben Graham's Sparse ConvNet](https://www.kaggle.com/c/cifar-10/forums/t/10493/train-you-very-own-deep-convolutional-network/56310) package, which Ben Graham used to great success to achieve less than 4% error on CIFAR-10. + +--- +

+번역: 김택수 (jazzsaxmafia) +

From 114c00700e71a43f407d852e3a1423a6df49ea1c Mon Sep 17 00:00:00 2001 From: "KIM, WOOJUNG" Date: Mon, 9 May 2016 15:58:01 +0900 Subject: [PATCH 103/199] =?UTF-8?q?ipython-tutorial=EC=9D=98=20=EB=9D=84?= =?UTF-8?q?=EC=96=B4=EC=93=B0=EA=B8=B0=20=EC=98=A4=EB=A5=98=20=EC=88=98?= =?UTF-8?q?=EC=A0=95?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit --- ipython-tutorial.md | 16 ++++++++-------- 1 file changed, 8 insertions(+), 8 deletions(-) diff --git a/ipython-tutorial.md b/ipython-tutorial.md index 33e1a7da..b674e5ab 100644 --- a/ipython-tutorial.md +++ b/ipython-tutorial.md @@ -3,7 +3,7 @@ layout: page title: IPython Tutorial permalink: /ipython-tutorial/ --- -cs231s 수업에서는 프로그래밍 과제 진행을 위해 [IPython notebooks](http://ipython.org/)을 사용합니다. IPython notebook을 사용하면 여러분의 브라우저에서 Python코드를 작성하고 실행할 수 있습니다. Python notebook를 사용하면 여러조각의 코드를 아주 쉽게 수정하고 실행할 수 있습니다. 이런 장점 때문에 IPython notebook은 계산과학분야에서 널리 사용되고 있습니다. +cs231s 수업에서는 프로그래밍 과제 진행을 위해 [IPython notebooks](http://ipython.org/)을 사용합니다. IPython notebook을 사용하면 여러분의 브라우저에서 Python 코드를 작성하고 실행할 수 있습니다. Python notebook를 사용하면 여러 조각의 코드를 아주 쉽게 수정하고 실행할 수 있습니다. 이런 장점 때문에 IPython notebook은 계산과학분야에서 널리 사용되고 있습니다. IPython의 설치와 실행은 간단합니다. command line에서 다음 명령어를 입력하여 IPython을 설치합니다. @@ -17,7 +17,7 @@ IPython의 설치가 완료되면 다음 명령어를 통해 IPython을 실행 ipython notebook ~~~ -IPython이 실행되면, IPyhton을 사용하기 위해 웹 브라우저를 실행하여 http://localhost:8888 에 접속합니다. 모든것이 잘 작동한다면 웹 브라우저에는 아래와 같은 화면이 나타납니다. 화면에는 현재 폴더에 사용가능한 Python notebook들이 나타납니다. +IPython이 실행되면, IPyhton을 사용하기 위해 웹 브라우저를 실행하여 http://localhost:8888 에 접속합니다. 모든 것이 잘 작동한다면 웹 브라우저에는 아래와 같은 화면이 나타납니다. 화면에는 현재 폴더에 사용가능한 Python notebook들이 나타납니다.
@@ -29,32 +29,32 @@ notebook 파일을 클릭하면 다음과 같은 화면이 나타납니다.
-IPython notebook은 여러개의 **cell**들로 이루어져있습니다. 각각의 cell들은 Python코드를 포함하고 있습니다. `Shift-Enter`를 누르거나 셀을 클릭하여 셀을 실행할 수 있습니다. 셀의 코드를 실행하면 셀의 코드의 실행결과는 셀의 바로 아래에 나타납니다. 예를 들어 첫번째 cell의 코드를 실행하면 아래와 같은 화면이 나타납니다. +IPython notebook은 여러 개의 **cell**들로 이루어져 있습니다. 각각의 cell들은 Python 코드를 포함하고 있습니다. `Shift-Enter`를 누르거나 셀을 클릭하여 셀을 실행할 수 있습니다. 셀의 코드를 실행하면 셀의 코드의 실행결과는 셀의 바로 아래에 나타납니다. 예를 들어 첫 번째 cell의 코드를 실행하면 아래와 같은 화면이 나타납니다.
-전역변수들은 다른 셀들에게도 공유됩니다. 두번째 셀을 실행하면 다음과 같은 결과가 나옵니다. +전역변수들은 다른 셀들에도 공유됩니다. 두 번째 셀을 실행하면 다음과 같은 결과가 나옵니다.
-일반적으로, IPython notebook의 코드를 실행할 때 맨위에서 맨 아래 순서로 실행합니다. -몇몇 셀을 실행하는데 실패하거나 셀들을 순서대로 실행하지 않으면 오류가 발생할 수 있습니다. +일반적으로, IPython notebook의 코드를 실행할 때 맨 위에서 맨 아래 순서로 실행합니다. +몇몇 셀을 실행하는 데 실패하거나 셀들을 순서대로 실행하지 않으면 오류가 발생할 수 있습니다.
-과제를 진행하면서 notebook의 cell을 수정하거나 실행하여 IPython notebook이 변경되었다면 **저장하는 것을 잊지마세요.** +과제를 진행하면서 notebook의 cell을 수정하거나 실행하여 IPython notebook이 변경되었다면 **저장하는 것을 잊지 마세요.**
-지금 까지 IPyhton의 사용법에 대해서 알아보았습니다. 간략한 내용이지만 위 내용들을 잘 숙지하면 무리없이과제를 진행할 수 있습니다. +지금까지 IPyhton의 사용법에 대해서 알아보았습니다. 간략한 내용이지만 위 내용을 잘 숙지하면 무리 없이 과제를 진행할 수 있습니다. ---

From 553adb1d2676171947045aaee0f285f814025714 Mon Sep 17 00:00:00 2001 From: "KIM, WOOJUNG" Date: Mon, 9 May 2016 16:09:04 +0900 Subject: [PATCH 104/199] =?UTF-8?q?terminal-tutorial.md=20=EB=9D=84?= =?UTF-8?q?=EC=96=B4=EC=93=B0=EA=B8=B0=20=EC=98=A4=EB=A5=98=20=EC=A0=95?= =?UTF-8?q?=EC=A0=95?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit --- terminal-tutorial.md | 20 ++++++++++---------- 1 file changed, 10 insertions(+), 10 deletions(-) diff --git a/terminal-tutorial.md b/terminal-tutorial.md index 7f8261b4..c1c8e29c 100644 --- a/terminal-tutorial.md +++ b/terminal-tutorial.md @@ -3,39 +3,39 @@ layout: page title: Terminal.com Tutorial permalink: /terminal-tutorial/ --- -과제를 진행하기 위해서, [Terminal](https://www.stanfordterminalcloud.com)을 사용하는 옵션을 제공합니다. Terminal에서 여러분들의 결과물을 개발하고 테스트 할 수 있습니다. 한가지 유의해야할 것은 Terminal.com의 메인페이지를 사용하지 않고 cs231n 수업을 위해 특별히 할당된 서브도메인에 등록된 사이트를 사용합니다. [Terminal](https://www.stanfordterminalcloud.com)은 미리 설정된 커맨드 라인 환경(command line environment)에 접근할수 있는 온라인 컴퓨팅 플랫폼입니다. 여러분들의 과제를 진행하기 위해서 반드시 [Terminal](https://www.stanfordterminalcloud.com) 을 사용할 필요는 없습니다. 그러나 개발을 위한 필요사항들과 개발도구들이 미리 설정되어 있기 때문에 수고를 덜 수 있습니다. +과제를 진행하기 위해서, [Terminal](https://www.stanfordterminalcloud.com)을 사용하는 옵션을 제공합니다. Terminal에서 여러분의 결과물을 개발하고 테스트할 수 있습니다. 한가지 유의해야 할 것은 Terminal.com의 메인페이지를 사용하지 않고 cs231n 수업을 위해 특별히 할당된 서브도메인에 등록된 사이트를 사용합니다. [Terminal](https://www.stanfordterminalcloud.com)은 미리 설정된 커맨드 라인 환경(command line environment)에 접근할 수 있는 온라인 컴퓨팅 플랫폼입니다. 여러분이 과제를 진행하기 위해서 반드시 [Terminal](https://www.stanfordterminalcloud.com) 을 사용할 필요는 없습니다. 그러나 개발을 위한 필요사항들과 개발도구들이 미리 설정되어 있기 때문에 수고를 덜 수 있습니다. -이 튜토리얼은 Terminal을 사용하여 과제를 진행하기 위한 필수적인 과정들을 설명합니다. 가장 먼저, [여러분의 계정을 만듭니다.](https://www.stanfordterminalcloud.com/signup). 방금전에 만든 계정으로 [Terminal](https://www.stanfordterminalcloud.com)에 로그인 합니다. +이 튜토리얼은 Terminal을 사용하여 과제를 진행하기 위한 필수적인 과정들을 설명합니다. 가장 먼저, [여러분의 계정을 만듭니다.](https://www.stanfordterminalcloud.com/signup). 바로 전에 만든 계정으로 [Terminal](https://www.stanfordterminalcloud.com)에 로그인합니다. -각각의 과제마다 Terminal 스냅샷 링크를 제공합니다. 이 스냅샷들은 여러분들의 결과물을 작성하고 실행할 시작코드와 미리 설정된 커맨드 라인 환경이 포함되어 있습니다. +각각의 과제마다 Terminal 스냅 샷 링크를 제공합니다. 이 스냅 샷들은 여러분의 결과물을 작성하고 실행할 시작 코드와 미리 설정된 command line 환경이 포함되어 있습니다. -여기 2015년 과제처럼 보이는 스냅샷을 통해 예를 들어보겠습니다. +여기 2015년 과제처럼 보이는 스냅 샷을 통해 예를 들어보겠습니다.

-여러분의 스냅샷도 이와 비슷할 것입니다. 오른쪽 아래의 "Start" 버튼을 클릭합니다. 그럼 여러분의 계정에 공유된 스냅샷이 복사됩니다. 이제 [My Terminals](https://www.stanfordterminalcloud.com/terminals) 탭에서 복사된 터미널을 찾을 수 있습니다. +여러분의 스냅 샷도 이와 비슷할 것입니다. 오른쪽 아래의 "Start" 버튼을 클릭합니다. 그럼 여러분의 계정에 공유된 스냅 샷이 복사됩니다. 이제 [My Terminals](https://www.stanfordterminalcloud.com/terminals) 탭에서 복사된 터미널을 찾을 수 있습니다.
-여러분의 화면도 이와 비슷할 것입니다. 이제 과제를 진행하기 위한 준비가 되었습니다! 링크를 클릭하여 terminal을 열어봅시다. (위 이미지의 빨간색 상자) 이 링크는 AWS 머신상의 유저인터페이스를 계층을 엽니다. 다음과 비슷한 화면이 나타납니다. +여러분의 화면도 이와 비슷할 것입니다. 이제 과제를 진행하기 위한 준비가 되었습니다! 링크를 클릭하여 terminal을 열어봅시다. (위 이미지의 빨간색 상자) 이 링크는 AWS 머신상의 사용자인터페이스를 계층을 엽니다. 다음과 비슷한 화면이 나타납니다.
- -terminal에 Jupyter Notebook과 다른 필요요소들이 설치되어 있습니다. 조그마한 + 버튼을 눌러 콘솔을 실행합니다.(콘솔이 없을 경우), 그리고 과제폴더와 코드를 찾습니다. 그리고 Jupyper Notebook을 실행하고 과제를 진행합니다. 만약 당신이 cs231n에 등록한 학생이면 코스워크를 통해 과제를 제출해야합니다. +ㅇㅇ +terminal에 Jupyter Notebook과 다른 필요요소들이 설치되어 있습니다. 조그마한 + 버튼을 눌러 콘솔을 실행합니다. (콘솔이 없으면 ), 그리고 과제 폴더와 코드를 찾습니다. 그리고 Jupyper Notebook을 실행하고 과제를 진행합니다. 만약 당신이 cs231n에 등록한 학생이면 코스워크를 통해 과제를 제출해야 합니다.
-[Terminal](https://www.stanfordterminalcloud.com)에 대한 더 많은 정보를 원하시면 [FAQ](https://www.stanfordterminalcloud.com/faq)페이지를 방문해주세요 +[Terminal](https://www.stanfordterminalcloud.com)에 대한 더 많은 정보를 원하시면 [FAQ](https://www.stanfordterminalcloud.com/faq) 페이지를 방문해주세요 -**중요** 터미널 사용시 사용하는 인스턴스 타입에 따라 시간당 사용요금이 부과됩니다. 미디엄 타입의 인스턴스 요금은 시간당 $0.124 입니다. +**중요** 터미널 사용 시 사용하는 인스턴스 타입에 따라 시간당 사용요금이 부과됩니다. 미디엄 타입의 인스턴스 요금은 시간당 $0.124입니다. ---

From f8778253a67375a563f94302c51ab5c3b9d7af13 Mon Sep 17 00:00:00 2001 From: "KIM, WOOJUNG" Date: Mon, 9 May 2016 16:19:05 +0900 Subject: [PATCH 105/199] Update terminal-tutorial.md --- terminal-tutorial.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/terminal-tutorial.md b/terminal-tutorial.md index c1c8e29c..3da07bc3 100644 --- a/terminal-tutorial.md +++ b/terminal-tutorial.md @@ -26,7 +26,7 @@ permalink: /terminal-tutorial/

-ㅇㅇ + terminal에 Jupyter Notebook과 다른 필요요소들이 설치되어 있습니다. 조그마한 + 버튼을 눌러 콘솔을 실행합니다. (콘솔이 없으면 ), 그리고 과제 폴더와 코드를 찾습니다. 그리고 Jupyper Notebook을 실행하고 과제를 진행합니다. 만약 당신이 cs231n에 등록한 학생이면 코스워크를 통해 과제를 제출해야 합니다.
From 46d63c4b4115616c84df12375aad5582dac2777f Mon Sep 17 00:00:00 2001 From: ygchoistat Date: Mon, 9 May 2016 21:38:20 +0900 Subject: [PATCH 106/199] Add files via upload MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit 생각보다 작업이 더뎌서 죄송합니다. 전체 7000단어 중 1200단어 가량 (1/6) 마쳤습니다. --- neural-networks-3.md | 38 +++++++++++++++++++++++--------------- 1 file changed, 23 insertions(+), 15 deletions(-) diff --git a/neural-networks-3.md b/neural-networks-3.md index 819b511e..4b579e98 100644 --- a/neural-networks-3.md +++ b/neural-networks-3.md @@ -34,7 +34,7 @@ Table of Contents: 이론적인 그라디언트 체크라 하면, 수치적으로 계산한(numerical) 그라디언트와 수식으로 계산한(analytic) 그라디언트를 비교하는 정도라 매우 간단하다고 생각할 수도 있겠다. 그렇지만 이 작업을 직접 실현해 보면 훨씬 복잡하고 뜬금없이 오차가 발생하기도 쉽다는 것을 깨달을 것이다. 이제 팁, 트릭, 조심할 이슈들 몇 개를 소개하고자 한다. -**같은 근사라 하여도 이론적으로 더 정확도가 높은 공식이 있다 (Use the centered formula)**. 그라디언트($\frac{df(x)}{dx}$)를 수치적으로 근사한다 하면 보통 다음 유한 차분 근사(finite difference approximation)를 떠올릴 것이다: +**같은 근사라 하여도 이론적으로 더 정확도가 높은 근사 공식이 있다 (Use the centered formula)**. 그라디언트($\frac{df(x)}{dx}$)를 수치적으로 근사한다 하면 보통 다음 유한 차분 근사(finite difference approximation)를 떠올릴 것이다: $$ \frac{df(x)}{dx} = \frac{f(x + h) - f(x)}{h} \hspace{0.1in} \text{(bad, do not use)} @@ -46,33 +46,41 @@ $$ \frac{df(x)}{dx} = \frac{f(x + h) - f(x - h)}{2h} \hspace{0.1in} \text{(use instead)} $$ -물론 이 공식은 $f(x+h)$ 말고도 $f(x-h)$도 계산하여야 하므로 최초 식보다 계산량이 두 배 많지만 훨씬 정확한 근사를 제공한다. $f(x+h)$ 및 $f(x-h)$의 ($x$ 근방에서의) 테일러 전개를 고려하면 이유를 금방 알 수 있다. 첫 식은 $O(h)$의 오차가 있는 데 반해 -- 역자 주 : $f(x + h) = f(x) + hf'(x) + O(h)$로부터 $f'(x) - \frac{(f(x+h)-f(x)}{h} = O(h)$ -- 두번째 식은 오차가 $O(h^2)$이다 (즉, 이차 근사이다). +물론 이 공식은 $f(x+h)$ 말고도 $f(x-h)$도 계산하여야 하므로 최초 식보다 계산량이 두 배 많지만 훨씬 정확한 근사를 제공한다. $f(x+h)$ 및 $f(x-h)$의 ($x$ 근방에서의) 테일러 전개를 고려하면 이유를 금방 알 수 있다. 첫 식은 $O(h)$의 오차가 있는 데 반해 두번째 식은 오차가 $O(h^2)$이다 (즉, 이차 근사이다). -- 역자 주 : (1) 테일러 전개에서 $f(x + h) = f(x) + hf'(x) + O(h)$로부터 $f'(x) - \frac{(f(x+h)-f(x)}{h} = O(h)$. (2) $h$가 보통 벡터이므로 $O(h)$보다는 $O(\|h\|)$가 더 정확한 표현이나 편의상 $\|\cdot\|$을 생략한 듯 보입니다. -**상대 오차를 사용하라 (Use relative error for the comparison)**. 그라디언트의 (수식으로 계산한, analytic) 참값 $f'_a$와 수치적(numerical) 근사값 $f'_n$을 비교하려면 어떤 디테일을 점검하여야 할까? 이 둘이 비슷하지 않음(not compatible)을 어떻게 알아낼 수 있을까? 가장 쉽게는 둘의 절대 오차 $\mid f'_a - f'_n \mid $ 혹은 그 제곱을 쭉 추적하여 이 값(들)이 언젠가 어느 한계점(threshold)를 넘으면 그라디언트 오류라 할 수도 있겠다. 그렇지만 절대 오차에는 문제가 있는 것이, 가령 절대 오차가 1e-4라 가정하여 보자. 만약 $f'_a$와 $f'_n$ 모두 1.0 언저리라면 1e-4의 오차 정도는 매우 훌륭한 근사이고 $f'_a \approx f'_n$이라 할 수 있다. 그런데 만약 두 그라디언트가 1e-5거나 더 작은 값이라면? 그렇다면 1e-4는 매우 큰 차이가 되고 근사가 실패했다고 보는 게 맞다. 따라서 절대 오차와 두 그라디언트 값의 비율을 고려하는 *상대 오차*가 더 적절하다. 언제나!: +**상대 오차를 사용하라 (Use relative error for the comparison)**. 그라디언트의 (수식으로 계산한, analytic) 참값 $f'_a$와 수치적(numerical) 근사값 $f'_n$을 비교하려면 어떤 디테일을 점검하여야 할까? 이 둘이 비슷하지 않음(not compatible)을 어떻게 알아낼 수 있을까? 가장 쉽게는 둘의 절대 오차 $\mid f'_a - f'_n \mid $ 혹은 그 제곱을 쭉 추적하여 이 값(들)이 언젠가 어느 한계점(threshold)를 넘으면 그라디언트 오류라 할 수도 있겠다. 그렇지만 절대 오차에는 문제가 있는 것이, 가령 절대 오차가 1e-4라 가정하여 보자. 만약 $f'_a$와 $f'_n$ 모두 1.0 언저리라면 1e-4의 오차 정도는 매우 훌륭한 근사이고 $f'_a \approx f'_n$이라 할 수 있다. 그런데 만약 두 그라디언트가 1e-5거나 더 작은 값이라면? 그렇다면 1e-4는 매우 큰 차이가 되고 근사가 실패했다고 보아야 한다. 따라서 절대 오차와 두 그라디언트 값의 비율을 고려하는 *상대 오차*가 더 적절하다. 언제나!: + $$ \frac{\mid f'_a - f'_n \mid}{\max(\mid f'_a \mid, \mid f'_n \mid)} $$ -which considers their ratio of the differences to the ratio of the absolute values of both gradients. Notice that normally the relative error formula only includes one of the two terms (either one), but I prefer to max (or add) both to make it symmetric and to prevent dividing by zero in the case where one of the two is zero (which can often happen, especially with ReLUs). However, one must explicitly keep track of the case where both are zero and pass the gradient check in that edge case. In practice: +보통의 상대 오차 공식은 분모에 $f'_a$ 혹은 $f'_n$ 둘 중 하나만 있지만, 나는 둘의 최대값을 분모로 선호하는 편이다. 그래야 공식에 대칭성이 생기고 둘 중 하나가 exactly 0이 되어 분모가 0이 되는 사태를 방지할 수 있다 (ReLU를 사용하면 자주 일어나는 문제이다). $f'_a$와 $f'_n$가 모두 exact 0이 된다면? 이 때는 상대 오차를 점검할 필요 없이 그라디언트 체크를 통과하여야 한다. 당신의 코드가 이 상황을 감안하여 조직된 코드인지 점검하여 보라. + +실제 상황에서의 유용한 가이드: + +- (상대 오차) > 1e-2 면 그라디언트 계산이 아마 잘못되었을 수도 있다. +- 1e-2 > (상대 오차) > 1e-4 면 불편함을 느끼기 바란다. +- 1e-4 > (상대 오차) 는, 꺾임이 있는 목적함수 (objectives with kinks)에서는 괜찮다. 그렇지만 tanh 혹은 softmax를 쓰는 목적함수처럼 꺾임이 없다면 1e-4는 너무 크다. +- 1e-7 혹은 그보다 작은 상대 오차라면, 행복을 느껴야 한다. + +하나 더 유념해야 할 것은, 망의 레이어 개수가 많아지면(deeper network) 상대 오차가 커진다. 이를테면 레이어(layer) 10개짜리 망(network)에서 인풋 데이터의 그라디언트를 체크한다면, 에러가 층을 올라가며 축적되므로 1e-2 정도의 상대 오차는 괜찮을 수도 있다. 거꾸로 말하자면, 미분가능한 함수 하나만 갖고 노는데 1e-2의 상대 오차가 발생한다면 이것은 부정확한 그라디언트일 가능성이 매우 높다. + + +**이중정확성 변수를 사용하라 (Use double precision)**. 흔히들 실수하는 것이, 그라디언트 체크를 계산하는 데 단일정확성 부동소숫점(single precision floating point) 변수를 사용하는 경우가 있다. 단일정확성 변수를 쓰면 그라디언트 계산이 맞다 하더라도 상대 오차가 (1e-2 정도로) 커지는 경우가 종종 있다. 내 경험상으로는 이중정확성 변수를 쓰면 상대 오차가 1e-2에서 1e-8까지 개선되는 경우도 봤다. + -- relative error > 1e-2 usually means the gradient is probably wrong -- 1e-2 > relative error > 1e-4 should make you feel uncomfortable -- 1e-4 > relative error is usually okay for objectives with kinks. But if there are no kinks (e.g. use of tanh nonlinearities and softmax), then 1e-4 is too high. -- 1e-7 and less you should be happy. +**부동소숫점 연산이 활성화되는 범위에서 계산하라 (Stick around active range of floating point)**. 당신 좀더 세심한 코드를 작성하고 실수를 줄이려면 ["모든 컴퓨터 사이언티스트들이 부동소숫점 연산에 대해 알아야 하는 것들(What Every Computer Scientist Should Know About Floating-Point Arithmetic)"](http://docs.oracle.com/cd/E19957-01/806-3568/ncg_goldberg.html) 를 읽는 게 좋다. 예를 들어, 신경망에서는 손실함수(loss function)를 배치별로(over batch)로 normalize하는 것이 보통이다 (역자 주 : 그라디언트 합을 배치 사이즈로 나누는 장면을 지칭하는 듯). 그렇지만 한 자료당(per datapoint) 그라디언트가 매우 작다면, 거기에 또 데이터 갯수를 *부가적으로* 나눌 경우 매우 작은 수가 되고 더욱더 많은 수치적인 문제가 생길 수 있다. 그래서 필자는 $f'_a$ 혹은 $f'_n$의 계산값을 계속 찍어보고 두 값이 너무 작지 않은가 확인하는 편이다. (대충 1e-10 혹은 그보다 작은 크기의 값이면 걱정하여라) 만약 두 값이 너무 작다면, 적당히 상수를 곱하여 부동소숫점 표현이 조금 더 "괜찮도록" (부동소숫점 표현에서 지수 부분이 0이 되도록) 만들 수도 있다. -Also keep in mind that the deeper the network, the higher the relative errors will be. So if you are gradient checking the input data for a 10-layer network, a relative error of 1e-2 might be okay because the errors build up on the way. Conversely, an error of 1e-2 for a single differentiable function likely indicates incorrect gradient. -**Use double precision**. A common pitfall is using single precision floating point to compute gradient check. It is often that case that you might get high relative errors (as high as 1e-2) even with a correct gradient implementation. In my experience I've sometimes seen my relative errors plummet from 1e-2 to 1e-8 by switching to double precision. +**목적함수에서의 꺾인 점 (Kinks in the objective)**. *꺾인 점(kink)*들에서 부정확한 계산이 발생할 수 있는데 이를 그라디언트 체크 과정에서도 염두에 두고 있어야 한다. 꺾인 점(kink)은 목적함수의 미분 불가능한 부분을 지칭하는 용어이다. ReLU 함수 ($max(0,x)$), 서포트 벡터 머신(SVM) 목적함수나 맥스아웃 뉴런(maxout neuron) 등을 사용하면 발생할 수 있다. 꺾인 점이 야기시킬 수 있는 문제는 대략 이렇다. ReLU 함수의 그라디언트를 $x = -1e6$에서 체크한다고 생각하여 보자. $x < 0$이므로 $f'_a$는 정확히 $0$이다. 그렇지만, 수치적으로 계산된 그라디언트는 $f(x+h)$가 꺾인 점을 넘을 수도 있으므로 (이를테면 $h > 1e-6$인 경우) 갑자기 $0$이 아닌 값을 내놓게 될 수도 있다. 이런 병적인(?) 경우까지 신경써야 하냐고 물을 수도 있겠는데, 사실 매우 흔하다. 예를 들어 CIFAR-10를 위해 서포트 벡터 머신(SVM)을 쓴다고 하면, 데이터가 50,000개이고(50,000 examples) 한 데이터당 $max(0,x)$ 항이 9개씩 있으니 결국 45만개의 ReLU항과 맞닥뜨리게 된다. 게다가 서포트 벡터 머신 분류기(SVM classifier)와 신경망(neural network)을 붙이면 ReLU들 때문에 꺾인 점이 더 늘어날 수도 있다. -**Stick around active range of floating point**. It's a good idea to read through ["What Every Computer Scientist Should Know About Floating-Point Arithmetic"](http://docs.oracle.com/cd/E19957-01/806-3568/ncg_goldberg.html), as it may demystify your errors and enable you to write more careful code. For example, in neural nets it can be common to normalize the loss function over the batch. However, if your gradients per datapoint are very small, then *additionally* dividing them by the number of data points is starting to give very small numbers, which in turn will lead to more numerical issues. This is why I like to always print the raw numerical/analytic gradient, and make sure that the numbers you are comparing are not extremely small (e.g. roughly 1e-10 and smaller in absolute value is worrying). If they are you may want to temporarily scale your loss function up by a constant to bring them to a "nicer" range where floats are more dense - ideally on the order of 1.0, where your float exponent is 0. +다행히도, 손실함수를 계산할 때 꺾인 점을 넘어서 계산했는지 (a kink was crossed) 여부를 알 수 있다. $max(x,y)$ 꼴 함수에서 $x$, $y$ 중 누가 "이겼는지"를 계속 기록해둔다고 생각해 보자. $f(x+h)$와 $f(x-h)$를 계산할 때 적어도 하나의 "승자"가 바뀐다면, 꺾인 점을 넘는 현상이 발생한 것이고 그렇다면 수치적인 그라디언트가 정확한 값이 아닐 수도 있다. -**Kinks in the objective**. One source of inaccuracy to be aware of during gradient checking is the problem of *kinks*. Kinks refer to non-differentiable parts of an objective function, introduced by functions such as ReLU ($max(0,x)$), or the SVM loss, Maxout neurons, etc. Consider gradient checking the ReLU function at $x = -1e6$. Since $x < 0$, the analytic gradient at this point is exactly zero. However, the numerical gradient would suddenly compute a non-zero gradient because $f(x+h)$ might cross over the kink (e.g. if $h > 1e-6$) and introduce a non-zero contribution. You might think that this is a pathological case, but in fact this case can be very common. For example, an SVM for CIFAR-10 contains up to 450,000 $max(0,x)$ terms because there are 50,000 examples and each example yields 9 terms to the objective. Moreover, a Neural Network with an SVM classifier will contain many more kinks due to ReLUs. +**적은 수의 데이터만 써라 (Use only few datapoints)** 꺾인 점과 관련된 하나의 해결책은 더 적은 데이터를 쓰는 것이다. 손실함수가 꺾인 점을 포함하고 있으면 (ReLU나 margin loss등을 썼을 경우처럼) 데이터가 적을수록 더 적은 꺾인 점을 포함할 것이고, 따라서 유한 차분 근사(finite different approximation) 과정에서 꺾인 점을 가로지르는 경우가 더 적을 것이다. 게다가, ~2 혹은 3개의 데이터에 대해서만 그라디언트 체크를 수행하는 게 거의 배치(batch) 전부에 대해 그라디언트 체크하는 게 될 테니 훨씬 빠르고 효율적이다. (역자 주 : 그렇지만 배치 사이즈가 작아지면 다른 쪽에서 문제가 생길 수도 있을 것 같은데..) -Note that it is possible to know if a kink was crossed in the evaluation of the loss. This can be done by keeping track of the identities of all "winners" in a function of form $max(x,y)$; That is, was x or y higher during the forward pass. If the identity of at least one winner changes when evaluating $f(x+h)$ and then $f(x-h)$, then a kink was crossed and the numerical gradient will not be exact. -**Use only few datapoints**. One fix to the above problem of kinks is to use fewer datapoints, since loss functions that contain kinks (e.g. due to use of ReLUs or margin losses etc.) will have fewer kinks with fewer datapoints, so it is less likely for you to cross one when you perform the finite different approximation. Moreover, if your gradcheck for only ~2 or 3 datapoints then you would almost certainly gradcheck for an entire batch. Using very few datapoints also makes your gradient check faster and more efficient. **Be careful with the step size h**. It is not necessarily the case that smaller is better, because when $h$ is much smaller, you may start running into numerical precision problems. Sometimes when the gradient doesn't check, it is possible that you change $h$ to be 1e-4 or 1e-6 and suddenly the gradient will be correct. This [wikipedia article](http://en.wikipedia.org/wiki/Numerical_differentiation) contains a chart that plots the value of **h** on the x-axis and the numerical gradient error on the y-axis. @@ -379,4 +387,4 @@ To train a Neural Network: - [SGD](http://research.microsoft.com/pubs/192769/tricks-2012.pdf) tips and tricks from Leon Bottou - [Efficient BackProp](http://yann.lecun.com/exdb/publis/pdf/lecun-98b.pdf) (pdf) from Yann LeCun - [Practical Recommendations for Gradient-Based Training of Deep -Architectures](http://arxiv.org/pdf/1206.5533v2.pdf) from Yoshua Bengio +Architectures](http://arxiv.org/pdf/1206.5533v2.pdf) from Yoshua Bengio \ No newline at end of file From 2456bf1434f46e4a9720a0aec9c22091de4ba576 Mon Sep 17 00:00:00 2001 From: ygchoistat Date: Mon, 9 May 2016 21:38:57 +0900 Subject: [PATCH 107/199] Delete neural-networks-3-kr.md --- neural-networks-3-kr.md | 386 ---------------------------------------- 1 file changed, 386 deletions(-) delete mode 100644 neural-networks-3-kr.md diff --git a/neural-networks-3-kr.md b/neural-networks-3-kr.md deleted file mode 100644 index 9324a357..00000000 --- a/neural-networks-3-kr.md +++ /dev/null @@ -1,386 +0,0 @@ ---- -layout: page -permalink: /neural-networks-3/ ---- - -Table of Contents: - -- [그라디언트 점검 (Gradient checks)](#gradcheck) -- [Sanity checks](#sanitycheck) -- [학습 과정 돌보기 (Babysitting the learning process)](#baby) - - [손실 함수 (Loss function)](#loss) - - [훈련/검증 성능 (Train/val accuracy)](#accuracy) - - [웨이트의 현재값과 변화량의 비율 (Weights:Updates ratio)](#ratio) - - [레이어별 활성값 및 그라디언트값의 분포 (Activation/Gradient distributions per layer)](#distr) - - [시각화 (Visualization)](#vis) -- [파라미터 업데이트 (Parameter updates)](#update) - - [일차 근사 방법 (SGD) (First-order (SGD)), 모멘텀 (momentum), Nesterov 모멘텀 (Nesterov momentum)](#sgd) - - [학습 속도를 담금질하기 (Annealing the learning rate)](#anneal) - - [이차 근사 방법 (Second-order methods)](#second) - - [파라미터별로 학습 속도를 데이터가 판단하게 하기 (Adagrad, RMSProp) )Per-parameter adaptive learning rates (Adagrad, RMSProp))](#ada) -- [초-파라미터 최적화 (Hyperparameter Optimization)](#hyper) -- [평가 (Evaluation)](#eval) - - [모형 앙상블 (Model Ensembles)](#ensemble) -- [요약](#summary) -- [추가적인 참고 문헌](#add) - -## Learning - -이전 섹션들에서는 레이어를 몇 층 쌓고 레이어별로 몇 개의 유닛을 준비할지(newwork connectivity), 데이터를 어떻게 준비하고 어떤 손실 함수(loss function)를 선택할지 논하였다. 말하자면 이전 섹션들은 주로 뉴럴 네트워크(Neural Network)의 정적인 부분인데, 본 섹션에서는 동적인 부분들을 소개한다. 파라미터(parameter)를 학습하고 좋은 초-파라미터(hyperparamter)를 찾는 과정 등을 다룰 예정이다. - - -### 그라디언트 체크 (Gradient Checks) - -이론적인 그라디언트 체크라 하면, 수치적으로 계산한(numerical) 그라디언트와 수식으로 계산한(analytic) 그라디언트를 비교하는 정도라 매우 간단하다고 생각할 수도 있겠다. 그렇지만 이 작업을 직접 실현해 보면 훨씬 복잡하고 뜬금없이 오차가 발생하기도 쉽다는 것을 깨달을 것이다. 이제 팁, 트릭, 조심할 이슈들 몇 개를 소개하고자 한다. - - -**같은 근사라 하여도 이론적으로 더 정확도가 높은 근사 공식이 있다 (Use the centered formula)**. 그라디언트($\frac{df(x)}{dx}$)를 수치적으로 근사한다 하면 보통 다음 유한 차분 근사(finite difference approximation)를 떠올릴 것이다: - -$$ -\frac{df(x)}{dx} = \frac{f(x + h) - f(x)}{h} \hspace{0.1in} \text{(bad, do not use)} -$$ - -여기서 $h$는 아주 작은 수이고 보통 1e-5 정도의 수를 사용한다. 위 식보다는 아래의 *중심화된(centered)* 차분 공식이 경험적으로는 훨씬 낫다: - -$$ -\frac{df(x)}{dx} = \frac{f(x + h) - f(x - h)}{2h} \hspace{0.1in} \text{(use instead)} -$$ - -물론 이 공식은 $f(x+h)$ 말고도 $f(x-h)$도 계산하여야 하므로 최초 식보다 계산량이 두 배 많지만 훨씬 정확한 근사를 제공한다. $f(x+h)$ 및 $f(x-h)$의 ($x$ 근방에서의) 테일러 전개를 고려하면 이유를 금방 알 수 있다. 첫 식은 $O(h)$의 오차가 있는 데 반해 두번째 식은 오차가 $O(h^2)$이다 (즉, 이차 근사이다). -- 역자 주 : (1) 테일러 전개에서 $f(x + h) = f(x) + hf'(x) + O(h)$로부터 $f'(x) - \frac{(f(x+h)-f(x)}{h} = O(h)$. (2) $h$가 보통 벡터이므로 $O(h)$보다는 $O(\|h\|)$가 더 정확한 표현이나 편의상 $\|\cdot\|$을 생략한 듯 보입니다. - - -**상대 오차를 사용하라 (Use relative error for the comparison)**. 그라디언트의 (수식으로 계산한, analytic) 참값 $f'_a$와 수치적(numerical) 근사값 $f'_n$을 비교하려면 어떤 디테일을 점검하여야 할까? 이 둘이 비슷하지 않음(not compatible)을 어떻게 알아낼 수 있을까? 가장 쉽게는 둘의 절대 오차 $\mid f'_a - f'_n \mid $ 혹은 그 제곱을 쭉 추적하여 이 값(들)이 언젠가 어느 한계점(threshold)를 넘으면 그라디언트 오류라 할 수도 있겠다. 그렇지만 절대 오차에는 문제가 있는 것이, 가령 절대 오차가 1e-4라 가정하여 보자. 만약 $f'_a$와 $f'_n$ 모두 1.0 언저리라면 1e-4의 오차 정도는 매우 훌륭한 근사이고 $f'_a \approx f'_n$이라 할 수 있다. 그런데 만약 두 그라디언트가 1e-5거나 더 작은 값이라면? 그렇다면 1e-4는 매우 큰 차이가 되고 근사가 실패했다고 보아야 한다. 따라서 절대 오차와 두 그라디언트 값의 비율을 고려하는 *상대 오차*가 더 적절하다. 언제나!: - - -$$ -\frac{\mid f'_a - f'_n \mid}{\max(\mid f'_a \mid, \mid f'_n \mid)} -$$ - -보통의 상대 오차 공식은 분모에 $f'_a$ 혹은 $f'_n$ 둘 중 하나만 있지만, 나는 둘의 최대값을 분모로 선호하는 편이다. 그래야 공식에 대칭성이 생기고 둘 중 하나가 exactly 0이 되어 분모가 0이 되는 사태를 방지할 수 있다 (ReLU를 사용하면 자주 일어나는 문제이다). $f'_a$와 $f'_n$가 모두 exact 0이 된다면? 이 때는 상대 오차를 점검할 필요 없이 그라디언트 체크를 통과하여야 한다. 당신의 코드가 이 상황을 감안하여 조직된 코드인지 점검하여 보라. - -실제 상황에서의 유용한 가이드: - -- (상대 오차) > 1e-2 면 그라디언트 계산이 아마 잘못되었을 수도 있다. -- 1e-2 > (상대 오차) > 1e-4 면 불편함을 느끼기 바란다. -- 1e-4 > (상대 오차) 는, 꺾임이 있는 목적함수 (objectives with kinks)에서는 괜찮다. 그렇지만 tanh 혹은 softmax를 쓰는 목적함수처럼 꺾임이 없다면 1e-4는 너무 크다. -- 1e-7 혹은 그보다 작은 상대 오차라면, 행복함을 만끽하라. - -하나 더 유념해야 할 것은, 망의 레이어 개수가 많아지면(deeper network) 상대 오차가 커진다. 이를테면 레이어(layer) 10개짜리 망(network)에서 인풋 데이터의 그라디언트를 체크한다면, 에러가 층을 올라가며 축적되므로 1e-2 정도의 상대 오차는 괜찮을 수도 있다. 거꾸로 말하자면, 미분가능한 함수 하나만 갖고 노는데 1e-2의 상대 오차가 발생한다면 이것은 부정확한 그라디언트일 가능성이 매우 높다. - - -**double precision형 변수를 사용하라 (Use double precision)**. A common pitfall is using single precision floating point to compute gradient check. It is often that case that you might get high relative errors (as high as 1e-2) even with a correct gradient implementation. In my experience I've sometimes seen my relative errors plummet from 1e-2 to 1e-8 by switching to double precision. - -**Stick around active range of floating point**. It's a good idea to read through ["What Every Computer Scientist Should Know About Floating-Point Arithmetic"](http://docs.oracle.com/cd/E19957-01/806-3568/ncg_goldberg.html), as it may demystify your errors and enable you to write more careful code. For example, in neural nets it can be common to normalize the loss function over the batch. However, if your gradients per datapoint are very small, then *additionally* dividing them by the number of data points is starting to give very small numbers, which in turn will lead to more numerical issues. This is why I like to always print the raw numerical/analytic gradient, and make sure that the numbers you are comparing are not extremely small (e.g. roughly 1e-10 and smaller in absolute value is worrying). If they are you may want to temporarily scale your loss function up by a constant to bring them to a "nicer" range where floats are more dense - ideally on the order of 1.0, where your float exponent is 0. - -**Kinks in the objective**. One source of inaccuracy to be aware of during gradient checking is the problem of *kinks*. Kinks refer to non-differentiable parts of an objective function, introduced by functions such as ReLU ($max(0,x)$), or the SVM loss, Maxout neurons, etc. Consider gradient checking the ReLU function at $x = -1e6$. Since $x < 0$, the analytic gradient at this point is exactly zero. However, the numerical gradient would suddenly compute a non-zero gradient because $f(x+h)$ might cross over the kink (e.g. if $h > 1e-6$) and introduce a non-zero contribution. You might think that this is a pathological case, but in fact this case can be very common. For example, an SVM for CIFAR-10 contains up to 450,000 $max(0,x)$ terms because there are 50,000 examples and each example yields 9 terms to the objective. Moreover, a Neural Network with an SVM classifier will contain many more kinks due to ReLUs. - -Note that it is possible to know if a kink was crossed in the evaluation of the loss. This can be done by keeping track of the identities of all "winners" in a function of form $max(x,y)$; That is, was x or y higher during the forward pass. If the identity of at least one winner changes when evaluating $f(x+h)$ and then $f(x-h)$, then a kink was crossed and the numerical gradient will not be exact. - -**Use only few datapoints**. One fix to the above problem of kinks is to use fewer datapoints, since loss functions that contain kinks (e.g. due to use of ReLUs or margin losses etc.) will have fewer kinks with fewer datapoints, so it is less likely for you to cross one when you perform the finite different approximation. Moreover, if your gradcheck for only ~2 or 3 datapoints then you would almost certainly gradcheck for an entire batch. Using very few datapoints also makes your gradient check faster and more efficient. - -**Be careful with the step size h**. It is not necessarily the case that smaller is better, because when $h$ is much smaller, you may start running into numerical precision problems. Sometimes when the gradient doesn't check, it is possible that you change $h$ to be 1e-4 or 1e-6 and suddenly the gradient will be correct. This [wikipedia article](http://en.wikipedia.org/wiki/Numerical_differentiation) contains a chart that plots the value of **h** on the x-axis and the numerical gradient error on the y-axis. - -**Gradcheck during a "characteristic" mode of operation**. It is important to realize that a gradient check is performed at a particular (and usually random), single point in the space of parameters. Even if the gradient check succeeds at that point, it is not immediately certain that the gradient is correctly implemented globally. Additionally, a random initialization might not be the most "characteristic" point in the space of parameters and may in fact introduce pathological situations where the gradient seems to be correctly implemented but isn't. For instance, an SVM with very small weight initialization will assign almost exactly zero scores to all datapoints and the gradients will exhibit a particular pattern across all datapoints. An incorrect implementation of the gradient could still produce this pattern and not generalize to a more characteristic mode of operation where some scores are larger than others. Therefore, to be safe it is best to use a short **burn-in** time during which the network is allowed to learn and perform the gradient check after the loss starts to go down. The danger of performing it at the first iteration is that this could introduce pathological edge cases and mask an incorrect implementation of the gradient. - -**Don't let the regularization overwhelm the data**. It is often the case that a loss function is a sum of the data loss and the regularization loss (e.g. L2 penalty on weights). One danger to be aware of is that the regularization loss may overwhelm the data loss, in which case the gradients will be primarily coming from the regularization term (which usually has a much simpler gradient expression). This can mask an incorrect implementation of the data loss gradient. Therefore, it is recommended to turn off regularization and check the data loss alone first, and then the regularization term second and independently. One way to perform the latter is to hack the code to remove the data loss contribution. Another way is to increase the regularization strength so as to ensure that its effect is non-negligible in the gradient check, and that an incorrect implementation would be spotted. - -**Remember to turn off dropout/augmentations**. When performing gradient check, remember to turn off any non-deterministic effects in the network, such as dropout, random data augmentations, etc. Otherwise these can clearly introduce huge errors when estimating the numerical gradient. The downside of turning off these effects is that you wouldn't be gradient checking them (e.g. it might be that dropout isn't backpropagated correctly). Therefore, a better solution might be to force a particular random seed before evaluating both $f(x+h)$ and $f(x-h)$, and when evaluating the analytic gradient. - -**Check only few dimensions**. In practice the gradients can have sizes of million parameters. In these cases it is only practical to check some of the dimensions of the gradient and assume that the others are correct. **Be careful**: One issue to be careful with is to make sure to gradient check a few dimensions for every separate parameter. In some applications, people combine the parameters into a single large parameter vector for convenience. In these cases, for example, the biases could only take up a tiny number of parameters from the whole vector, so it is important to not sample at random but to take this into account and check that all parameters receive the correct gradients. - - -### Before learning: sanity checks Tips/Tricks - -Here are a few sanity checks you might consider running before you plunge into expensive optimization: - -- **Look for correct loss at chance performance.** Make sure you're getting the loss you expect when you initialize with small parameters. It's best to first check the data loss alone (so set regularization strength to zero). For example, for CIFAR-10 with a Softmax classifier we would expect the initial loss to be 2.302, because we expect a diffuse probability of 0.1 for each class (since there are 10 classes), and Softmax loss is the negative log probability of the correct class so: -ln(0.1) = 2.302. For The Weston Watkins SVM, we expect all desired margins to be violated (since all scores are approximately zero), and hence expect a loss of 9 (since margin is 1 for each wrong class). If you're not seeing these losses there might be issue with initialization. -- As a second sanity check, increasing the regularization strength should increase the loss -- **Overfit a tiny subset of data**. Lastly and most importantly, before training on the full dataset try to train on a tiny portion (e.g. 20 examples) of your data and make sure you can achieve zero cost. For this experiment it's also best to set regularization to zero, otherwise this can prevent you from getting zero cost. Unless you pass this sanity check with a small dataset it is not worth proceeding to the full dataset. Note that it may happen that you can overfit very small dataset but still have an incorrect implementation. For instance, if your datapoints' features are random due to some bug, then it will be possible to overfit your small training set but you will never notice any generalization when you fold it your full dataset. - - -### Babysitting the learning process - -There are multiple useful quantities you should monitor during training of a neural network. These plots are the window into the training process and should be utilized to get intuitions about different hyperparameter settings and how they should be changed for more efficient learning. - -The x-axis of the plots below are always in units of epochs, which measure how many times every example has been seen during training in expectation (e.g. one epoch means that every example has been seen once). It is preferable to track epochs rather than iterations since the number of iterations depends on the arbitrary setting of batch size. - - -#### Loss function - -The first quantity that is useful to track during training is the loss, as it is evaluated on the individual batches during the forward pass. Below is a cartoon diagram showing the loss over time, and especially what the shape might tell you about the learning rate: - -
- - -
- Left: A cartoon depicting the effects of different learning rates. With low learning rates the improvements will be linear. With high learning rates they will start to look more exponential. Higher learning rates will decay the loss faster, but they get stuck at worse values of loss (green line). This is because there is too much "energy" in the optimization and the parameters are bouncing around chaotically, unable to settle in a nice spot in the optimization landscape. Right: An example of a typical loss function over time, while training a small network on CIFAR-10 dataset. This loss function looks reasonable (it might indicate a slightly too small learning rate based on its speed of decay, but it's hard to say), and also indicates that the batch size might be a little too low (since the cost is a little too noisy). -
-
- -The amount of "wiggle" in the loss is related to the batch size. When the batch size is 1, the wiggle will be relatively high. When the batch size is the full dataset, the wiggle will be minimal because every gradient update should be improving the loss function monotonically (unless the learning rate is set too high). - -Some people prefer to plot their loss functions in the log domain. Since learning progress generally takes an exponential form shape, the plot appears more as a slightly more interpretable straight line, rather than a hockey stick. Additionally, if multiple cross-validated models are plotted on the same loss graph, the differences between them become more apparent. - -Sometimes loss functions can look funny [lossfunctions.tumblr.com](http://lossfunctions.tumblr.com/). - - -#### Train/Val accuracy - -The second important quantity to track while training a classifier is the validation/training accuracy. This plot can give you valuable insights into the amount of overfitting in your model: - -
- -
- The gap between the training and validation accuracy indicates the amount of overfitting. Two possible cases are shown in the diagram on the left. The blue validation error curve shows very small validation accuracy compared to the training accuracy, indicating strong overfitting (note, it's possible for the validation accuracy to even start to go down after some point). When you see this in practice you probably want to increase regularization (stronger L2 weight penalty, more dropout, etc.) or collect more data. The other possible case is when the validation accuracy tracks the training accuracy fairly well. This case indicates that your model capacity is not high enough: make the model larger by increasing the number of parameters. -
-
-
- - -#### Ratio of weights:updates - -The last quantity you might want to track is the ratio of the update magnitudes to to the value magnitudes. Note: *updates*, not the raw gradients (e.g. in vanilla sgd this would be the gradient multiplied by the learning rate). You might want to evaluate and track this ratio for every set of parameters independently. A rough heuristic is that this ratio should be somewhere around 1e-3. If it is lower than this then the learning rate might be too low. If it is higher then the learning rate is likely too high. Here is a specific example: - -~~~python -# assume parameter vector W and its gradient vector dW -param_scale = np.linalg.norm(W.ravel()) -update = -learning_rate*dW # simple SGD update -update_scale = np.linalg.norm(update.ravel()) -W += update # the actual update -print update_scale / param_scale # want ~1e-3 -~~~ - -Instead of tracking the min or the max, some people prefer to compute and track the norm of the gradients and their updates instead. These metrics are usually correlated and often give approximately the same results. - - -#### Activation / Gradient distributions per layer - -An incorrect initialization can slow down or even completely stall the learning process. Luckily, this issue can be diagnosed relatively easily. One way to do so is to plot activation/gradient histograms for all layers of the network. Intuitively, it is not a good sign to see any strange distributions - e.g. with tanh neurons we would like to see a distribution of neuron activations between the full range of [-1,1], instead of seeing all neurons outputting zero, or all neurons being completely saturated at either -1 or 1. - - - -#### First-layer Visualizations - -Lastly, when one is working with image pixels it can be helpful and satisfying to plot the first-layer features visually: - -
- - -
- Examples of visualized weights for the first layer of a neural network. Left: Noisy features indicate could be a symptom: Unconverged network, improperly set learning rate, very low weight regularization penalty. Right: Nice, smooth, clean and diverse features are a good indication that the training is proceeding well. -
-
- - -### Parameter updates - -Once the analytic gradient is computed with backpropagation, the gradients are used to perform a parameter update. There are several approaches for performing the update, which we discuss next. - -We note that optimization for deep networks is currently a very active area of research. In this section we highlight some established and common techniques you may see in practice, briefly describe their intuition, but leave a detailed analysis outside of the scope of the class. We provide some further pointers for an interested reader. - - -#### SGD and bells and whistles - -**Vanilla update**. The simplest form of update is to change the parameters along the negative gradient direction (since the gradient indicates the direction of increase, but we usually wish to minimize a loss function). Assuming a vector of parameters `x` and the gradient `dx`, the simplest update has the form: - -~~~python -# Vanilla update -x += - learning_rate * dx -~~~ - -where `learning_rate` is a hyperparameter - a fixed constant. When evaluated on the full dataset, and when the learning rate is low enough, this is guaranteed to make non-negative progress on the loss function. - -**Momentum update** is another approach that almost always enjoys better converge rates on deep networks. This update can be motivated from a physical perspective of the optimization problem. In particular, the loss can be interpreted as a the height of a hilly terrain (and therefore also to the potential energy since $U = mgh$ and therefore $ U \propto h $ ). Initializing the parameters with random numbers is equivalent to setting a particle with zero initial velocity at some location. The optimization process can then be seen as equivalent to the process of simulating the parameter vector (i.e. a particle) as rolling on the landscape. - -Since the force on the particle is related to the gradient of potential energy (i.e. $F = - \nabla U $ ), the **force** felt by the particle is precisely the (negative) **gradient** of the loss function. Moreover, $F = ma $ so the (negative) gradient is in this view proportional to the acceleration of the particle. Note that this is different from the SGD update shown above, where the gradient directly integrates the position. Instead, the physics view suggests an update in which the gradient only directly influences the velocity, which in turn has an effect on the position: - -~~~python -# Momentum update -v = mu * v - learning_rate * dx # integrate velocity -x += v # integrate position -~~~ - -Here we see an introduction of a `v` variable that is initialized at zero, and an additional hyperparameter (`mu`). As an unfortunate misnomer, this variable is in optimization referred to as *momentum* (its typical value is about 0.9), but its physical meaning is more consistent with the coefficient of friction. Effectively, this variable damps the velocity and reduces the kinetic energy of the system, or otherwise the particle would never come to a stop at the bottom of a hill. When cross-validated, this parameter is usually set to values such as [0.5, 0.9, 0.95, 0.99]. Similar to annealing schedules for learning rates (discussed later, below), optimization can sometimes benefit a little from momentum schedules, where the momentum is increased in later stages of learning. A typical setting is to start with momentum of about 0.5 and anneal it to 0.99 or so over multiple epochs. - -> With Momentum update, the parameter vector will build up velocity in any direction that has consistent gradient. - -**Nesterov Momentum** is a slightly different version of the momentum update has recently been gaining popularity. It enjoys stronger theoretical converge guarantees for convex functions and in practice it also consistenly works slightly better than standard momentum. - -The core idea behind Nesterov momentum is that when the current parameter vector is at some position `x`, then looking at the momentum update above, we know that the momentum term alone (i.e. ignoring the second term with the gradient) is about to nudge the parameter vector by `mu * v`. Therefore, if we are about to compute the gradient, we can treat the future approximate position `x + mu * v` as a "lookahead" - this is a point in the vicinity of where we are soon going to end up. Hence, it makes sense to compute the gradient at `x + mu * v` instead of at the "old/stale" position `x`. - -
- -
- Nesterov momentum. Instead of evaluating gradient at the current position (red circle), we know that our momentum is about to carry us to the tip of the green arrow. With Nesterov momentum we therefore instead evaluate the gradient at this "looked-ahead" position. -
-
- -That is, in a slightly awkward notation, we would like to do the following: - -~~~python -x_ahead = x + mu * v -# evaluate dx_ahead (the gradient at x_ahead instead of at x) -v = mu * v - learning_rate * dx_ahead -x += v -~~~ - -However, in practice people prefer to express the update to look as similar to vanilla SGD or to the previous momentum update as possible. This is possible to achieve by manipulating the update above with a variable transform `x_ahead = x + mu * v`, and then expressing the update in terms of `x_ahead` instead of `x`. That is, the parameter vector we are actually storing is always the ahead version. The equations in terms of `x_ahead` (but renaming it back to `x`) then become: - -~~~python -v_prev = v # back this up -v = mu * v - learning_rate * dx # velocity update stays the same -x += -mu * v_prev + (1 + mu) * v # position update changes form -~~~ - -We recommend this further reading to understand the source of these equations and the mathematical formulation of Nesterov's Accelerated Momentum (NAG): - -- [Advances in optimizing Recurrent Networks](http://arxiv.org/pdf/1212.0901v2.pdf) by Yoshua Bengio, Section 3.5. -- [Ilya Sutskever's thesis](http://www.cs.utoronto.ca/~ilya/pubs/ilya_sutskever_phd_thesis.pdf) (pdf) contains a longer exposition of the topic in section 7.2 - - - -#### Annealing the learning rate - -In training deep networks, it is usually helpful to anneal the learning rate over time. Good intuition to have in mind is that with a high learning rate, the system contains too much kinetic energy and the parameter vector bounces around chaotically, unable to settle down into deeper, but narrower parts of the loss function. Knowing when to decay the learning rate can be tricky: Decay it slowly and you'll be wasting computation bouncing around chaotically with little improvement for a long time. But decay it too aggressively and the system will cool too quickly, unable to reach the best position it can. There are three common types of implementing the learning rate decay: - -- **Step decay**: Reduce the learning rate by some factor every few epochs. Typical values might be reducing the learning rate by a half every 5 epochs, or by 0.1 every 20 epochs. These numbers depend heavily on the type of problem and the model. One heuristic you may see in practice is to watch the validation error while training with a fixed learning rate, and reduce the learning rate by a constant (e.g. 0.5) whenever the validation error stops improving. -- **Exponential decay.** has the mathematical form $\alpha = \alpha_0 e^{-k t}$, where $\alpha_0, k$ are hyperparameters and $t$ is the iteration number (but you can also use units of epochs). -- **1/t decay** has the mathematical form $\alpha = \alpha_0 / (1 + k t )$ where $a_0, k$ are hyperparameters and $t$ is the iteration number. - -In practice, we find that the step decay dropout is slightly preferable because the hyperparameters it involves (the fraction of decay and the step timings in units of epochs) are more interpretable than the hyperparameter $k$. Lastly, if you can afford the computational budget, err on the side of slower decay and train for a longer time. - - -#### Second order methods - -A second, popular group of methods for optimization in context of deep learning is based on [Newton's method](http://en.wikipedia.org/wiki/Newton%27s_method_in_optimization), which iterates the following update: - -$$ -x \leftarrow x - [H f(x)]^{-1} \nabla f(x) -$$ - -Here, $H f(x)$ is the [Hessian matrix](http://en.wikipedia.org/wiki/Hessian_matrix), which is a square matrix of second-order partial derivatives of the function. The term $\nabla f(x)$ is the gradient vector, as seen in Gradient Descent. Intuitively, the Hessian describes the local curvature of the loss function, which allows us to perform a more efficient update. In particular, multiplying by the inverse Hessian leads the optimization to take more aggressive steps in directions of shallow curvature and shorter steps in directions of steep curvature. Note, crucially, the absence of any learning rate hyperparameters in the update formula, which the proponents of these methods cite this as a large advantage over first-order methods. - -However, the update above is impractical for most deep learning applications because computing (and inverting) the Hessian in its explicit form is a very costly process in both space and time. For instance, a Neural Network with one million parameters would have a Hessian matrix of size [1,000,000 x 1,000,000], occupying approximately 3725 gigabytes of RAM. Hence, a large variety of *quasi-Newton* methods have been developed that seek to approximate the inverse Hessian. Among these, the most popular is [L-BFGS](http://en.wikipedia.org/wiki/Limited-memory_BFGS), which uses the information in the gradients over time to form the approximation implicitly (i.e. the full matrix is never computed). - -However, even after we eliminate the memory concerns, a large downside of a naive application of L-BFGS is that it must be computed over the entire training set, which could contain millions of examples. Unlike mini-batch SGD, getting L-BFGS to work on mini-batches is more tricky and an active area of research. - -**In practice**, it is currently not common to see L-BFGS or similar second-order methods applied to large-scale Deep Learning and Convolutional Neural Networks. Instead, SGD variants based on (Nesterov's) momentum are more standard because they are simpler and scale more easily. - -Additional references: - -- [Large Scale Distributed Deep Networks](http://research.google.com/archive/large_deep_networks_nips2012.html) is a paper from the Google Brain team, comparing L-BFGS and SGD variants in large-scale distributed optimization. -- [SFO](http://arxiv.org/abs/1311.2115) algorithm strives to combine the advantages of SGD with advantages of L-BFGS. - - -#### Per-parameter adaptive learning rate methods - -All previous approaches we've discussed so far manipulated the learning rate globally and equally for all parameters. Tuning the learning rates is an expensive process, so much work has gone into devising methods that can adaptively tune the learning rates, and even do so per parameter. Many of these methods may still require other hyperparameter settings, but the argument is that they are well-behaved for a broader range of hyperparameter values than the raw learning rate. In this section we highlight some common adaptive methods you may encounter in practice: - -**Adagrad** is an adaptive learning rate method originally proposed by [Duchi et al.](http://jmlr.org/papers/v12/duchi11a.html). - -~~~python -# Assume the gradient dx and parameter vector x -cache += dx**2 -x += - learning_rate * dx / (np.sqrt(cache) + eps) -~~~ - -Notice that the variable `cache` has size equal to the size of the gradient, and keeps track of per-parameter sum of squared gradients. This is then used to normalize the parameter update step, element-wise. Notice that the weights that receive high gradients will have their effective learning rate reduced, while weights that receive small or infrequent updates will have their effective learning rate increased. Amusingly, the square root operation turns out to be very important and without it the algorithm performs much worse. The smoothing term `eps` (usually set somewhere in range from 1e-4 to 1e-8) avoids division by zero. A downside of Adagrad is that in case of Deep Learning, the monotonic learning rate usually proves too aggressive and stops learning too early. - -**RMSprop.** RMSprop is a very effective, but currently unpublished adaptive learning rate method. Amusingly, everyone who uses this method in their work currently cites [slide 29 of Lecture 6](http://www.cs.toronto.edu/~tijmen/csc321/slides/lecture_slides_lec6.pdf) of Geoff Hinton's Coursera class. The RMSProp update adjusts the Adagrad method in a very simple way in an attempt to reduce its aggressive, monotonically decreasing learning rate. In particular, it uses a moving average of squared gradients instead, giving: - -~~~python -cache = decay_rate * cache + (1 - decay_rate) * dx**2 -x += - learning_rate * dx / (np.sqrt(cache) + eps) -~~~ - -Here, `decay_rate` is a hyperparameter and typical values are [0.9, 0.99, 0.999]. Notice that the `x+=` update is identical to Adagrad, but the `cache` variable is a "leaky". Hence, RMSProp still modulates the learning rate of each weight based on the magnitudes of its gradients, which has a beneficial equalizing effect, but unlike Adagrad the updates do not get monotonically smaller. - -**Adam.** [Adam](http://arxiv.org/abs/1412.6980) is a recently proposed update that looks a bit like RMSProp with momentum. The (simplified) update looks as follows: - -~~~python -m = beta1*m + (1-beta1)*dx -v = beta2*v + (1-beta2)*(dx**2) -x += - learning_rate * m / (np.sqrt(v) + eps) -~~~ - -Notice that the update looks exactly as RMSProp update, except the "smooth" version of the gradient `m` is used instead of the raw (and perhaps noisy) gradient vector `dx`. Recommended values in the paper are `eps = 1e-8`, `beta1 = 0.9`, `beta2 = 0.999`. In practice Adam is currently recommended as the default algorithm to use, and often works slightly better than RMSProp. However, it is often also worth trying SGD+Nesterov Momentum as an alternative. The full Adam update also includes a *bias correction* mechanism, which compensates for the fact that in the first few time steps the vectors `m,v` are both initialized and therefore biased at zero, before they fully "warm up". We refer the reader to the paper for the details, or the course slides where this is expanded on. - -Additional References: - -- [Unit Tests for Stochastic Optimization](http://arxiv.org/abs/1312.6055) proposes a series of tests as a standardized benchmark for stochastic optimization. - -
- - -
- Animations that may help your intuitions about the learning process dynamics. Left: Contours of a loss surface and time evolution of different optimization algorithms. Notice the "overshooting" behavior of momentum-based methods, which make the optimization look like a ball rolling down the hill. Right: A visualization of a saddle point in the optimization landscape, where the curvature along different dimension has different signs (one dimension curves up and another down). Notice that SGD has a very hard time breaking symmetry and gets stuck on the top. Conversely, algorithms such as RMSprop will see very low gradients in the saddle direction. Due to the denominator term in the RMSprop update, this will increase the effective learning rate along this direction, helping RMSProp proceed. Images credit: Alec Radford. -
-
- - -### Hyperparameter optimization - -As we've seen, training Neural Networks can involve many hyperparameter settings. The most common hyperparameters in context of Neural Networks include: - -- the initial learning rate -- learning rate decay schedule (such as the decay constant) -- regularization strength (L2 penalty, dropout strength) - -But as saw, there are many more relatively less sensitive hyperparameters, for example in per-parameter adaptive learning methods, the setting of momentum and its schedule, etc. In this section we describe some additional tips and tricks for performing the hyperparameter search: - -**Implementation**. Larger Neural Networks typically require a long time to train, so performing hyperparameter search can take many days/weeks. It is important to keep this in mind since it influences the design of your code base. One particular design is to have a **worker** that continuously samples random hyperparameters and performs the optimization. During the training, the worker will keep track of the validation performance after every epoch, and writes a model checkpoint (together with miscellaneous training statistics such as the loss over time) to a file, preferably on a shared file system. It is useful to include the validation performance directly in the filename, so that it is simple to inspect and sort the progress. Then there is a second program which we will call a **master**, which launches or kills workers across a computing cluster, and may additionally inspect the checkpoints written by workers and plot their training statistics, etc. - -**Prefer one validation fold to cross-validation**. In most cases a single validation set of respectable size substantially simplifies the code base, without the need for cross-validation with multiple folds. You'll hear people say they "cross-validated" a parameter, but many times it is assumed that they still only used a single validation set. - -**Hyperparameter ranges**. Search for hyperparameters on log scale. For example, a typical sampling of the learning rate would look as follows: `learning_rate = 10 ** uniform(-6, 1)`. That is, we are generating a random random with a uniform distribution, but then raising it to the power of 10. The same strategy should be used for the regularization strength. Intuitively, this is because learning rate and regularization strength have multiplicative effects on the training dynamics. For example, a fixed change of adding 0.01 to a learning rate has huge effects on the dynamics if the learning rate is 0.001, but nearly no effect if the learning rate when it is 10. This is because the learning rate multiplies the computed gradient in the update. Therefore, it is much more natural to consider a range of learning rate multiplied or divided by some value, than a range of learning rate added or subtracted to by some value. Some parameters (e.g. dropout) are instead usually searched in the original scale (e.g. `dropout = uniform(0,1)`). - -**Prefer random search to grid search**. As argued by Bergstra and Bengio in [Random Search for Hyper-Parameter Optimization](http://www.jmlr.org/papers/volume13/bergstra12a/bergstra12a.pdf), "randomly chosen trials are more efficient for hyper-parameter optimization than trials on a grid". As it turns out, this is also usually easier to implement. - -
- -
- Core illustration from Random Search for Hyper-Parameter Optimization by Bergstra and Bengio. It is very often the case that some of the hyperparameters matter much more than others (e.g. top hyperparam vs. left one in this figure). Performing random search rather than grid search allows you to much more precisely discover good values for the important ones. -
-
- -**Careful with best values on border**. Sometimes it can happen that you're searching for a hyperparameter (e.g. learning rate) in a bad range. For example, suppose we use `learning_rate = 10 ** uniform(-6, 1)`. Once we receive the results, it is important to double check that the final learning rate is not at the edge of this interval, or otherwise you may be missing more optimal hyperparameter setting beyond the interval. - -**Stage your search from coarse to fine**. In practice, it can be helpful to first search in coarse ranges (e.g. 10 ** [-6, 1]), and then depending on where the best results are turning up, narrow the range. Also, it can be helpful to perform the initial coarse search while only training for 1 epoch or even less, because many hyperparameter settings can lead the model to not learn at all, or immediately explode with infinite cost. The second stage could then perform a narrower search with 5 epochs, and the last stage could perform a detailed search in the final range for many more epochs (for example). - -**Bayesian Hyperparameter Optimization** is a whole area of research devoted to coming up with algorithms that try to more efficiently navigate the space of hyperparameters. The core idea is to appropriately balance the exploration - exploitation trade-off when querying the performance at different hyperparameters. Multiple libraries have been developed based on these models as well, among some of the better known ones are [Spearmint](https://github.com/JasperSnoek/spearmint), [SMAC](http://www.cs.ubc.ca/labs/beta/Projects/SMAC/), and [Hyperopt](http://jaberg.github.io/hyperopt/). However, in practical settings with ConvNets it is still relatively difficult to beat random search in a carefully-chosen intervals. See some additional from-the-trenches discussion [here](http://nlpers.blogspot.com/2014/10/hyperparameter-search-bayesian.html). - - -## Evaluation - - -### Model Ensembles - -In practice, one reliable approach to improving the performance of Neural Networks by a few percent is to train multiple independent models, and at test time average their predictions. As the number of models in the ensemble increases, the performance typically monotonically improves (though with diminishing returns). Moreover, the improvements are more dramatic with higher model variety in the ensemble. There are a few approaches to forming an ensemble: - -- **Same model, different initializations**. Use cross-validation to determine the best hyperparameters, then train multiple models with the best set of hyperparameters but with different random initialization. The danger with this approach is that the variety is only due to initialization. -- **Top models discovered during cross-validation**. Use cross-validation to determine the best hyperparameters, then pick the top few (e.g. 10) models to form the ensemble. This improves the variety of the ensemble but has the danger of including suboptimal models. In practice, this can be easier to perform since it doesn't require additional retraining of models after cross-validation -- **Different checkpoints of a single model**. If training is very expensive, some people have had limited success in taking different checkpoints of a single network over time (for example after every epoch) and using those to form an ensemble. Clearly, this suffers from some lack of variety, but can still work reasonably well in practice. The advantage of this approach is that is very cheap. -- **Running average of parameters during training**. Related to the last point, a cheap way of almost always getting an extra percent or two of performance is to maintain a second copy of the network's weights in memory that maintains an exponentially decaying sum of previous weights during training. This way you're averaging the state of the network over last several iterations. You will find that this "smoothed" version of the weights over last few steps almost always achieves better validation error. The rough intuition to have in mind is that the objective is bowl-shaped and your network is jumping around the mode, so the average has a higher chance of being somewhere nearer the mode. - -One disadvantage of model ensembles is that they take longer to evaluate on test example. An interested reader may find the recent work from Geoff Hinton on ["Dark Knowledge"](https://www.youtube.com/watch?v=EK61htlw8hY) inspiring, where the idea is to "distill" a good ensemble back to a single model by incorporating the ensemble log likelihoods into a modified objective. - - -## Summary - -To train a Neural Network: - -- Gradient check your implementation with a small batch of data and be aware of the pitfalls. -- As a sanity check, make sure your initial loss is reasonable, and that you can achieve 100% training accuracy on a very small portion of the data -- During training, monitor the loss, the training/validation accuracy, and if you're feeling fancier, the magnitude of updates in relation to parameter values (it should be ~1e-3), and when dealing with ConvNets, the first-layer weights. -- The two recommended updates to use are either SGD+Nesterov Momentum or Adam. -- Decay your learning rate over the period of the training. For example, halve the learning rate after a fixed number of epochs, or whenever the validation accuracy tops off. -- Search for good hyperparameters with random search (not grid search). Stage your search from coarse (wide hyperparameter ranges, training only for 1-5 epochs), to fine (narrower rangers, training for many more epochs) -- Form model ensembles for extra performance - - -## Additional References - -- [SGD](http://research.microsoft.com/pubs/192769/tricks-2012.pdf) tips and tricks from Leon Bottou -- [Efficient BackProp](http://yann.lecun.com/exdb/publis/pdf/lecun-98b.pdf) (pdf) from Yann LeCun -- [Practical Recommendations for Gradient-Based Training of Deep -Architectures](http://arxiv.org/pdf/1206.5533v2.pdf) from Yoshua Bengio \ No newline at end of file From 8734d3cc89f7d59e5084de877fdf5ae3a1b0258c Mon Sep 17 00:00:00 2001 From: Myungsub Choi Date: Tue, 10 May 2016 02:48:07 +0900 Subject: [PATCH 108/199] Update optimization-1.md --- optimization-1.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/optimization-1.md b/optimization-1.md index 1348f40e..fafaffeb 100644 --- a/optimization-1.md +++ b/optimization-1.md @@ -355,5 +355,5 @@ while True: ---

-번역: 임준구 (stats2ml) +번역: stats2ml

From dc142dc437866163459dfb5197133b5c10b975ea Mon Sep 17 00:00:00 2001 From: JK Im Date: Mon, 9 May 2016 13:35:12 -0500 Subject: [PATCH 109/199] Update index.html MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit optimization-1 부분 소제목/키워드 업데이트. --- index.html | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/index.html b/index.html index 73335ec6..2142555b 100644 --- a/index.html +++ b/index.html @@ -113,10 +113,10 @@
- 최적화: Stochastic Gradient Descent + 최적화: 확률 그라디언트 하강(Stochastic Gradient Descent)
- 최적화 공간, 국소 탐색(local search), learning rate, analytic/numerical 그라디언트 + '지형'으로서의 최적화 목적 함수 (optimization landscapes), 국소 탐색(local search), 학습 속도(learning rate), 해석적(analytic)/수치적(numerical) 그라디언트
From c6971c3376158de0b552e76359b10b3734bd47c8 Mon Sep 17 00:00:00 2001 From: Myungsub Choi Date: Wed, 11 May 2016 10:37:54 +0900 Subject: [PATCH 110/199] Add acknowledgement --- neural-networks-3.md | 7 ++++++- 1 file changed, 6 insertions(+), 1 deletion(-) diff --git a/neural-networks-3.md b/neural-networks-3.md index 4b579e98..0d69991a 100644 --- a/neural-networks-3.md +++ b/neural-networks-3.md @@ -387,4 +387,9 @@ To train a Neural Network: - [SGD](http://research.microsoft.com/pubs/192769/tricks-2012.pdf) tips and tricks from Leon Bottou - [Efficient BackProp](http://yann.lecun.com/exdb/publis/pdf/lecun-98b.pdf) (pdf) from Yann LeCun - [Practical Recommendations for Gradient-Based Training of Deep -Architectures](http://arxiv.org/pdf/1206.5533v2.pdf) from Yoshua Bengio \ No newline at end of file +Architectures](http://arxiv.org/pdf/1206.5533v2.pdf) from Yoshua Bengio + +--- +

+번역: 최영근 ygchoistat +

From 0222d65bd4efcad9b4c85a93e42d284c1b088cac Mon Sep 17 00:00:00 2001 From: Myungsub Choi Date: Wed, 11 May 2016 13:17:11 +0900 Subject: [PATCH 111/199] fix typo --- index.html | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/index.html b/index.html index 2142555b..17b8b232 100644 --- a/index.html +++ b/index.html @@ -171,10 +171,10 @@
- 컨볼루션 신경망: 구조, Convolution / Pooling 레이어 + 컨볼루션 신경망: 구조, Convolution / Pooling 레이어들
-들 레이어(층), 공간적 배치, 레이어 패턴, 레이어 사이즈, AlexNet/ZFNet/VGGNet 사례 분석, 계산량에 관한 고려 사항들 + 레이어(층), 공간적 배치, 레이어 패턴, 레이어 사이즈, AlexNet/ZFNet/VGGNet 사례 분석, 계산량에 관한 고려 사항들
From cf3e6bd95faa0c843ed7dbe3d208d839cbea5b9b Mon Sep 17 00:00:00 2001 From: Seo Jonghan Date: Wed, 11 May 2016 22:33:35 +0900 Subject: [PATCH 112/199] Vanila Dropout --- neural-networks-2.kr.md | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/neural-networks-2.kr.md b/neural-networks-2.kr.md index f2e7ebe5..f4e21a98 100644 --- a/neural-networks-2.kr.md +++ b/neural-networks-2.kr.md @@ -155,7 +155,7 @@ $$
Figure taken from the Dropout paper that illustrates the idea. During training, Dropout can be interpreted as sampling a Neural Network within the full Neural Network, and only updating the parameters of the sampled network based on the input data. (However, the exponential number of possible sampled networks are not independent because they share the parameters.) During testing there is no dropout applied, with the interpretation of evaluating an averaged prediction across the exponentially-sized ensemble of all sub-networks (more about ensembles in the next section).
-Vanilla dropout in an example 3-layer Neural Network would be implemented as follows: +3-레이어 신경망 회로에 적용된 Vanilla dropout 예제를 아래 구현하였다. ~~~python """ Vanilla Dropout: Not recommended implementation (see notes below) """ @@ -184,11 +184,11 @@ def predict(X): out = np.dot(W3, H2) + b3 ~~~ -In the code above, inside the `train_step` function we have performed dropout twice: on the first hidden layer and on the second hidden layer. It is also possible to perform dropout right on the input layer, in which case we would also create a binary mask for the input `X`. The backward pass remains unchanged, but of course has to take into account the generated masks `U1,U2`. +train_step 함수를 보면 첫번째 히든 레이어와 두번째 히든레이어 총 2 부분에서 dropout이 적용된 것을 볼 수 있다. 물론 입력 데이터 `X`를 위한 p=0.5 마스크를 만들어 입력 단에도 dropout을 적용할 수 있다. 역전파(backward pass) 과정에서는 forward에서 사용된 `U1, U2`를 사용하여 수행한다. -Crucially, note that in the `predict` function we are not dropping anymore, but we are performing a scaling of both hidden layer outputs by $p$. This is important because at test time all neurons see all their inputs, so we want the outputs of neurons at test time to be identical to their expected outputs at training time. For example, in case of $p = 0.5$, the neurons must halve their outputs at test time to have the same output as they had during training time (in expectation). To see this, consider an output of a neuron $x$ (before dropout). With dropout, the expected output from this neuron will become $px + (1-p)0$, because the neuron's output will be set to zero with probability $1-p$. At test time, when we keep the neuron always active, we must adjust $x \rightarrow px$ to keep the same expected output. It can also be shown that performing this attenuation at test time can be related to the process of iterating over all the possible binary masks (and therefore all the exponentially many sub-networks) and computing their ensemble prediction. +`predict` 함수을 보면 dropout을 적용하지 않았지만 히든 레이어 출력 데이터에 $$p$$ 만큼 스케일링 한 것을 주목할 필요가 있다. 테스트 과정에서 모든 뉴런은 모든 입력 데이터를 받기 때문에 학습 과정에서 얻을 수 있는 출력값과 동일한 조건으로 맞추어 보정해야한다. dropout 확률 $$p = 0.5$$ 인 경우를 가정해 보자. 테스트 과정 동안 뉴런의 출력 값은 모두 1/2만큼 줄어들어야 하는데 이는 학습 과정 동안 뉴런 출력 데이터의 기대값과 동일하게 맞추기 위함이다. 뉴런 $$x$$가 있을때 dropout 적용하지 않은 출력 데이터가 있다고 가정하자. dropout을 적용하면 이 뉴런에서의 기대값은 $$px + (1-p)0$$가 되는데 이는 $$1-p$$의 확률로 뉴런의 출력 데이터 값이 0이 되기 때문이다. 테스트 과정에서는 모든 뉴런을 사용하기 때문에 동일한 기대값을 갖기 위해서는 $$x \rightarrow px$$로 보정해 주어야 한다. 또 다른 관점에서 보면 $$p$$만큼 값을 줄이는 과정은 모든 가능한 dropout 마스크를 적용한 후 그 결과를 이용하여 ensemble prediction을 수행하는 것으로 해석 할 수 있다. -The undesirable property of the scheme presented above is that we must scale the activations by $p$ at test time. Since test-time performance is so critical, it is always preferable to use **inverted dropout**, which performs the scaling at train time, leaving the forward pass at test time untouched. Additionally, this has the appealing property that the prediction code can remain untouched when you decide to tweak where you apply dropout, or if at all. Inverted dropout looks as follows: +위에서 소개한 방법은 테스트 과정에서 뉴런 출력에 $$p$$를 곱하는 연산이 수행해야 하는데 이는 원하지 않는 방식인 경우가 많다. 테스트 과정에서의 성능은 매우 중요한 이슈이기 때문에 많은 경우에 **inverted droptou** 방식이 더 선호된다. 이는 스케일링 연산을 학습 과정에서 적용하고 테스트 과정에서는 추가적인 스케일링 연산없이 바로 사용하는 방식이다. 이 기법의 또 다른 장점은 만약 dropout을 수정하기로 했을때 prediction 코드에는 여전히 변화가 없다는 것이다. Inverted dropout은 다음과 같이 구현할 수 있다. ~~~python """ From 431ba46ea6b7d30f938b976138eceeed5bc0b895 Mon Sep 17 00:00:00 2001 From: kirumang Date: Wed, 11 May 2016 22:44:53 +0900 Subject: [PATCH 113/199] first_commit_during_translate --- optimization-2.md | 49 +++++++++++++++++++++++++---------------------- 1 file changed, 26 insertions(+), 23 deletions(-) diff --git a/optimization-2.md b/optimization-2.md index 4201517f..4dd25070 100644 --- a/optimization-2.md +++ b/optimization-2.md @@ -5,64 +5,67 @@ permalink: /optimization-2/ Table of Contents: -- [Introduction](#intro) -- [Simple expressions, interpreting the gradient](#grad) -- [Compound expressions, chain rule, backpropagation](#backprop) -- [Intuitive understanding of backpropagation](#intuitive) -- [Modularity: Sigmoid example](#sigmoid) -- [Backprop in practice: Staged computation](#staged) -- [Patterns in backward flow](#patters) -- [Gradients for vectorized operations](#mat) -- [Summary](#summary) +- [소개(Introduction)](#intro) +- [그라디언트(Gradient)에 대한 간단한 표현과 이해](#grad) +- [복합적인 표현(Compound Expression), 체인룰(chain rule), Backpropagation)](#backprop) +- [역전파(Backpropation)에 대한 직관적인 이해](#intuitive) +- [모듈성 : 시그모이(Sigmoid)드 예제](#sigmoid) +- [역전파(Backprop) 실제: 단계별 계산](#staged) +- [역박향 흐름의 패턴](#patters) +- [벡터 기반의 그라디언트(Gradient) 계산)](#mat) +- [요약](#summary) + ### Introduction -**Motivation**. In this section we will develop expertise with an intuitive understanding of **backpropagation**, which is a way of computing gradients of expressions through recursive application of **chain rule**. Understanding of this process and its subtleties is critical for you to understand, and effectively develop, design and debug Neural Networks. +**Motivation**. 이번 섹션에서 우리는 **역전파(Backpropagation)**에 대한 직관적인 이해를 바탕으로 전문지식을 더 키우고자 한다. Backpropagation은 Network 전체에 대해 반복적인 **체인룰(Chain rule)**을 적용하여 그라디언트(Gradient)를 계산하는 방법 중 하나이다. Backpropagation 과정과 세부 요소들에 대한 이해는 여러분에게 있어서 Neural Networks를 효과적으로 개발하고, 디자인하고 디버그하는데 중요하다고 볼 수 있다. + +**Problem statement**. 이번 섹션에서 공부할 핵심 문제는 다음과 같다 : 주어진 함 $$f(x)$$ 가 있고, $$x$$ 는 입력 값으로 이루어진 벡터이고, 주어진 입력 $$x$$에 대해서 함수 $$f$$의 그라디언트를 계산하고자 한다. (i.e. $$\nabla f(x)$$ ). -**Problem statement**. The core problem studied in this section is as follows: We are given some function $f(x)$ where $x$ is a vector of inputs and we are interested in computing the gradient of $f$ at $x$ (i.e. $\nabla f(x)$ ). +**Motivation**. 우리가 이 문제에 관심을 기울이는 이유에 대해 Neural Network관점에서 좀더 구체적으로 살펴 보자. $$f$$는 Loss 함수 ( $$L$$ ) 에 해당하고 입력 값 $$x$$ 는 학습 데이터(Training data)와 Neural Network의 Weight라고 볼 수 있다. 예를 들면, Loss는 SVM Loss 함수가 될 수 있고, 입력 값은 학습 데이터 $$(x_i,y_i), i=1 \ldots N$$ 와 Weight, Bias $$W,b$$ 으로 볼 수 있다. 여기서 학습데이터는 미리 주어져서 고정 되어있는 값으로 볼 수 있고 (보통의 기계 학습에서 그러하듯..), Weight는 Neural Network의 학습을 위해 실제로 컨트롤 하는 값이다. 따라서 입력 값 $$x_i$$ 에 대한 그라디언트 계산이 쉬울지라도, 실제로는 파라미터(Parameter, Neural Network의 Weight) 값에 대한 Gradient를 일반적으로 계산하고, Gradient값을 활용하여 Parameter를 업데이트 할 수 있다. 하지만, Neural Network이 어떻게 작동하는지 해석하고, 시각화 하는 부분에서 입력 값 $x_i$에 대한 Gradient도 유용하게 활용 될 수 있는데, 이 부분은 본 강의의 뒷부분에 다룰 예정이다. -**Motivation**. Recall that the primary reason we are interested in this problem is that in the specific case of Neural Networks, $f$ will correspond to the loss function ( $L$ ) and the inputs $x$ will consist of the training data and the neural network weights. For example, the loss could be the SVM loss function and the inputs are both the training data $(x_i,y_i), i=1 \ldots N$ and the weights and biases $W,b$. Note that (as is usually the case in Machine Learning) we think of the training data as given and fixed, and of the weights as variables we have control over. Hence, even though we can easily use backpropagation to compute the gradient on the input examples $x_i$, in practice we usually only compute the gradient for the parameters (e.g. $W,b$) so that we can use it to perform a parameter update. However, as we will see later in the class the gradient on $x_i$ can still be useful sometimes, for example for purposes of visualization and interpreting what the Neural Network might be doing. -If you are coming to this class and you're comfortable with deriving gradients with chain rule, we would still like to encourage you to at least skim this section, since it presents a rarely developed view of backpropagation as backward flow in real-valued circuits and any insights you'll gain may help you throughout the class. +여러분이 이미 Chain Rule을 통해 Gradient를 도출하는데 익숙하더라도 이 섹션을 간략히 훑어보기를 권장한다. 왜냐하면 이 섹션에서는 다른데서는 보기 힘든 Backpropagation에 대한 실제 숫자를 활용한 역방향 흐름(Backward Flow)에 대해 설명을 할 것이고, 이를 통해 여러분이 얻게 될 통찰력은 이번 강의 전체에 있어 도움이 될 것이라 생각하기 때문이다. -### Simple expressions and interpretation of the gradient -Lets start simple so that we can develop the notation and conventions for more complex expressions. Consider a simple multiplication function of two numbers $f(x,y) = x y$. It is a matter of simple calculus to derive the partial derivative for either input: +### 그라디언트(Gradient)에 대한 간단한 표현과 이해 + +복잡한 모델에 대한 수식등을 만들기에 앞서 간단하게 시작을 해보자. x와 y 두 숫자의 곱을 계산하는 간단한 함수 f를 정의하자. $$f(x,y) = x y$$. 각각의 입력 변수에 대한 편미분은 간단한 수학으로 아래와 같이 구해 진다. : $$ f(x,y) = x y \hspace{0.5in} \rightarrow \hspace{0.5in} \frac{\partial f}{\partial x} = y \hspace{0.5in} \frac{\partial f}{\partial y} = x $$ -**Interpretation**. Keep in mind what the derivatives tell you: They indicate the rate of change of a function with respect to that variable surrounding an infinitesimally small region near a particular point: +**Interpretation**. 미분이 여러분에게 시사하는 바를 명심하자 : 미분은 입력 변수 부근의 아주 작은(0에 매우 가까운) 변화에 대한 해당 함수 값의 변화량이다. : $$ \frac{df(x)}{dx} = \lim_{h\ \to 0} \frac{f(x + h) - f(x)}{h} $$ -A technical note is that the division sign on the left-hand sign is, unlike the division sign on the right-hand sign, not a division. Instead, this notation indicates that the operator $ \frac{d}{dx} $ is being applied to the function $f$, and returns a different function (the derivative). A nice way to think about the expression above is that when $h$ is very small, then the function is well-approximated by a straight line, and the derivative is its slope. In other words, the derivative on each variable tells you the sensitivity of the whole expression on its value. For example, if $x = 4, y = -3$ then $f(x,y) = -12$ and the derivative on $x$ $\frac{\partial f}{\partial x} = -3$. This tells us that if we were to increase the value of this variable by a tiny amount, the effect on the whole expression would be to decrease it (due to the negative sign), and by three times that amount. This can be seen by rearranging the above equation ( $ f(x + h) = f(x) + h \frac{df(x)}{dx} $ ). Analogously, since $\frac{\partial f}{\partial y} = 4$, we expect that increasing the value of $y$ by some very small amount $h$ would also increase the output of the function (due to the positive sign), and by $4h$. +위에 수식을 기술적인 관점에서 보면, 왼쪽에 있는 분수 기호(가로바)는 오른쪽 분수 기호와 달리 나누기를 뜻하지는 않는다. 대신 연산자 $$ \frac{d}{dx} $$ 가 함수 $$f$$에 적용 되어 미분 된 함수를 의미 하는 것이다. 위의 수식을 이해하는 가장 좋은 방법은 $$h$$가 매우 작으면 함수 $$f$$는 직선으로 근사(Approximated) 될 수 있고, 미분 값은 그 직선의 기울기를 뜻한다. 다시말해, 만약 $$x = 4, y = -3$$ 이면 $$f(x,y) = -12$$ 가 되고, $$x$$에 대한 편미분 값은 $$x$$ $$\frac{\partial f}{\partial x} = -3$$ 으로 얻어진다. 이말은 즉슨, 우리가 x를 아주 조금 증가 시키면 전체 함수 값은 3배로 작아진다는 의미이다. (미분 값이 음수이므로). 이 것은 위의 수식을 재구성하면 이와 같이 간단히 보여 줄 수 있다 ( $$ f(x + h) = f(x) + h \frac{df(x)}{dx} $$ ).비슷하게, $$\frac{\partial f}{\partial y} = 4$$, 이므로, $$y$$ 값을 아주 작은 $$h$$ 만큼 증가 시킨다면 $$4h$$ 만큼 전체 함수 값은 증가하게 될 것이다. (이번에는 미분 값이 양수) -> The derivative on each variable tells you the sensitivity of the whole expression on its value. +> 미분은 각 변수가 해당 값에서 전체 함수(Expression)의 결과 값에 영향을 미치는 민감도와 같은 개념이다. -As mentioned, the gradient $\nabla f$ is the vector of partial derivatives, so we have that $\nabla f = [\frac{\partial f}{\partial x}, \frac{\partial f}{\partial y}] = [y, x]$. Even though the gradient is technically a vector, we will often use terms such as *"the gradient on x"* instead of the technically correct phrase *"the partial derivative on x"* for simplicity. +앞서 말했듯이, 그라디언트 $$\nabla f$$는 편미분 값들의 벡터이다. 따라서 수식으로 표현하면 다음과 같다: $$\nabla f = [\frac{\partial f}{\partial x}, \frac{\partial f}{\partial y}] = [y, x]$$, 그라디언트가 기술적으로 벡터일지라도 심플한 표현을 위해 *"X에 대한 편미분"* 이라는 정확한 표현 대신 *"X에 대한 그라디언트"* 와 같은 표현을 종종 쓰게 될 예정이다. -We can also derive the derivatives for the addition operation: +다음과 같은 수식에 대해서도 미분값(그라디언트)을 한번 구해보자: $$ f(x,y) = x + y \hspace{0.5in} \rightarrow \hspace{0.5in} \frac{\partial f}{\partial x} = 1 \hspace{0.5in} \frac{\partial f}{\partial y} = 1 $$ -that is, the derivative on both $x,y$ is one regardless of what the values of $x,y$ are. This makes sense, since increasing either $x,y$ would increase the output of $f$, and the rate of that increase would be independent of what the actual values of $x,y$ are (unlike the case of multiplication above). The last function we'll use quite a bit in the class is the *max* operation: +위의 수식에서 볼 수 있듯이, $$x,y$$에 대한 미분은 $$x,y$$ 값에 관계 없이 1이다. 당연히, $$x,y$$ 값이 증가하면 $$f$$가 증가하기 때문이다. 그리고 $$f$$ 값의 증가율 또한 $$x,y$$ 값에 관계 없이 일정하다 (앞서 살펴본 곱셈의 경우와 다름). 마지막으로 살펴볼 함수는 우리가 수업에서 자주 다루는 *Max* 함수 이다 : $$ f(x,y) = \max(x, y) \hspace{0.5in} \rightarrow \hspace{0.5in} \frac{\partial f}{\partial x} = \mathbb{1}(x >= y) \hspace{0.5in} \frac{\partial f}{\partial y} = \mathbb{1}(y >= x) $$ -That is, the (sub)gradient is 1 on the input that was larger and 0 on the other input. Intuitively, if the inputs are $x = 4,y = 2$, then the max is 4, and the function is not sensitive to the setting of $y$. That is, if we were to increase it by a tiny amount $h$, the function would keep outputting 4, and therefore the gradient is zero: there is no effect. Of course, if we were to change $y$ by a large amount (e.g. larger than 2), then the value of $f$ would change, but the derivatives tell us nothing about the effect of such large changes on the inputs of a function; They are only informative for tiny, infinitesimally small changes on the inputs, as indicated by the $\lim_{h \rightarrow 0}$ in its definition. - +입력 값이 더 큰 값에 대한 (서브)그라디언트는 1이고, 다른 입력 값의 그라디언트는 0이 된다. 직관적으로 보면, $$x = 4,y = 2$$ 인 경우 max 값은 4 이고, 이 함수는 현재의 $$y$$ 값에 영향을 받지 않는다. 바꾸어말하면, $$y$$값을 아주 작은 값인 $$h$$ 만큼 증가시키더라도 이 함수의 결과 값은 4로 유지된다. 따라서 그라디언트는 0이다 (y값의 영향이 없다). 물론 $$y$$값을 매우 크게 증가 시킨다면 (예를 들면 2이상) 함수 $$f$$ 값은 바뀌겠지만, 미분은 이런 큰 변화 값과는 관련이 없다. 미분이라는 것이 본래 그 정의에도 있듯($$\lim_{h \rightarrow 0}$$) 아주 작은 입력 값 변화에 대해서 의미를 갖는 값이기 때문이다. + ### Compound expressions with chain rule Lets now start to consider more complicated expressions that involve multiple composed functions, such as $f(x,y,z) = (x + y) z$. This expression is still simple enough to differentiate directly, but we'll take a particular approach to it that will be helpful with understanding the intuition behind backpropagation. In particular, note that this expression can be broken down into two expressions: $q = x + y$ and $f = q z$. Moreover, we know how to compute the derivatives of both expressions separately, as seen in the previous section. $f$ is just multiplication of $q$ and $z$, so $\frac{\partial f}{\partial q} = z, \frac{\partial f}{\partial z} = q$, and $q$ is addition of $x$ and $y$ so $ \frac{\partial q}{\partial x} = 1, \frac{\partial q}{\partial y} = 1 $. However, we don't necessarily care about the gradient on the intermediate value $q$ - the value of $\frac{\partial f}{\partial q}$ is not useful. Instead, we are ultimately interested in the gradient of $f$ with respect to its inputs $x,y,z$. The **chain rule** tells us that the correct way to "chain" these gradient expressions together is through multiplication. For example, $\frac{\partial f}{\partial x} = \frac{\partial f}{\partial q} \frac{\partial q}{\partial x} $. In practice this is simply a multiplication of the two numbers that hold the two gradients. Lets see this with an example: From 8182992268b1fa1c86d97f31d8f584e1c0e0fd64 Mon Sep 17 00:00:00 2001 From: Seo Jonghan Date: Wed, 11 May 2016 23:02:20 +0900 Subject: [PATCH 114/199] fix typo --- neural-networks-2.kr.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/neural-networks-2.kr.md b/neural-networks-2.kr.md index f4e21a98..bce547f3 100644 --- a/neural-networks-2.kr.md +++ b/neural-networks-2.kr.md @@ -143,7 +143,7 @@ $$ **L2 regularization**은 가장 일반적으로 사용되는 regularization 기법이다. 모든 파라미터 제곱 만큼의 크기를 목적 함수에 제약을 거는 방식으로 구현된다. 다시말해, 가중치 벡터 $$w$$가 있을때, 목적 함수에 $$\frac{1}{2} \lambda w^2$$를 더한다 (여가서 $$lambda$$는 regulrization의 강도를 의미). $$\frac{1}{2}$$ 부분이 항상 존재하는데 이는 앞서 본 regularization 값을 $$w$$로 미분했을 때 $$2 \lambda w$$가 아닌 $$ \lambda w$$의 값을 갖도록 하기 위함이다. L2 reguralization은 큰 값이 많이 존재하는 가중치에 제약을 주고, 가중치 값을 가능한 널리 퍼지도록 하는 효과를 주는 것으로 볼 수 있다. 선형 분류(Linear Classification) 장에서도 이야기 했던 가중치와 입력 데이터가 곱해지는 연산이므로 특정 몇개의 입력 데이터에 강하게 적용되기 보다는 모든 입력데이터에 약하게 적용되도록 하는 것이 일반적이다. gradient descent 업데이트 과정에서 L2 regularization을 적용하는 것은 모든 가중치 값이 선형적으로 감소하게 된다: `W += -lambda * W`이 0으로 감소하게 된다. -**L1 regularization** 또한 상대적으로 많이 사용되는 regularization 기법으로 가중치 벡터$$w$$가 있을때, 목적 함수에 $$\lamda \mid w \mid$$를 더한다. 다음과 같이 L1 regularization과 L2 regularization을 동시에 사용할 수도 있다: $$\lambad_1 \mid w \mid + \lamda_2 w^2$$([Elastic net regularization](http://web.stanford.edu/~hastie/Papers/B67.2%20%282005%29%20301-320%20Zou%20&%20Hastie.pdf)라고도 불린다). L1 regularization은 최적화 과정 동안 가중치 벡터들을 sparse하게(거의 0에 가깝게) 만드는 흥미로운 특성이 있다. 다시 말해, L1 regularization이 적용된 뉴런들은 결국 입력 데이터의 sparse한 부분만을 사용하고, "noisy" 입력 데이터에 거의 영향을 받지 않는다. 이에 반해, L2 regularization을 적용하면 최종 가중치 벡터들은 작은 값들이 퍼져있는 형태로 나타나게 된다. 실제 신경망 학습에 적용할 때, 만약 특정한 feature selection 후 학습하는 것이 아니라면 많은 경우에 L2 regularization을 사용하면 훨씬 좋은 성능을 기대할 수 있다. +**L1 regularization** 또한 상대적으로 많이 사용되는 regularization 기법으로 가중치 벡터$$w$$가 있을때, 목적 함수에 $$\lambda \mid w \mid$$를 더한다. 다음과 같이 L1 regularization과 L2 regularization을 동시에 사용할 수도 있다: $$\lambda_1 \mid w \mid + \lambda_2 w^2$$([Elastic net regularization](http://web.stanford.edu/~hastie/Papers/B67.2%20%282005%29%20301-320%20Zou%20&%20Hastie.pdf)라고도 불린다). L1 regularization은 최적화 과정 동안 가중치 벡터들을 sparse하게(거의 0에 가깝게) 만드는 흥미로운 특성이 있다. 다시 말해, L1 regularization이 적용된 뉴런들은 결국 입력 데이터의 sparse한 부분만을 사용하고, "noisy" 입력 데이터에 거의 영향을 받지 않는다. 이에 반해, L2 regularization을 적용하면 최종 가중치 벡터들은 작은 값들이 퍼져있는 형태로 나타나게 된다. 실제 신경망 학습에 적용할 때, 만약 특정한 feature selection 후 학습하는 것이 아니라면 많은 경우에 L2 regularization을 사용하면 훨씬 좋은 성능을 기대할 수 있다. **Max norm constrains**. regularizatio 기법 중 하나로 가중치 벡터의 길이가 미리 정해 놓은 상한 값을 넘지 못하도록 제한하면서 gradient descent 연산도 제한 된 조건 안에서만 계산하도록 하는 projected gradient descent를 사용한다. 신경망 학습에 실제 적용하는 방법은, 먼저 일반적인 방법으로 파라미터를 업데이트 하고, 모든 뉴런의 가중치 벡터 $$\vec{w}$$이 대해서 $$\Vert2 \vec{w} \Vert2 < c$$를 만족하도록 제한을 가한다. 일반적으로 c값은 3 혹은 4로 설정한다. 이 regularization 기법을 적용한 몇몇 연구를 통하여 성능 향상이 있음이 알려졌다. 이 기법의 흥미로운 사실 중 하나는 학습률(learning rate)을 큰 값을로 설정하고 학습 시키더라도 신경망이 "explode"하지 않는 다는 것인데 이는 업데이트 될 때마다 제한된 범위 내의 값을 갖기 때문이다. From 7854f151fc263c2ed9e1938cf0addc4ae5f446ee Mon Sep 17 00:00:00 2001 From: Seo Jonghan Date: Thu, 12 May 2016 00:42:37 +0900 Subject: [PATCH 115/199] =?UTF-8?q?=EC=A0=95=EA=B7=9C=ED=99=94=EA=B9=8C?= =?UTF-8?q?=EC=A7=80=20=EC=99=84=EB=A3=8C?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit --- neural-networks-2.kr.md | 12 ++++++------ 1 file changed, 6 insertions(+), 6 deletions(-) diff --git a/neural-networks-2.kr.md b/neural-networks-2.kr.md index bce547f3..55302f8e 100644 --- a/neural-networks-2.kr.md +++ b/neural-networks-2.kr.md @@ -218,18 +218,18 @@ def predict(X): out = np.dot(W3, H2) + b3 ~~~ -There has a been a large amount of research after the first introduction of dropout that tries to understand the source of its power in practice, and its relation to the other regularization techniques. Recommended further reading for an interested reader includes: +dropout이 처음 소개된 이후로 실제 적용 사례에서 나타난 성능 향상의 근본 원인과 기존의 다른 regularization 기법과의 관계등에 대한 수많은 연구가 진행되었다. 관련하여 다음의 자료들을 읽어보는 것인 도움이 될 것이라 생각된다: -- [Dropout paper](http://www.cs.toronto.edu/~rsalakhu/papers/srivastava14a.pdf) by Srivastava et al. 2014. +- [Dropout 논문](http://www.cs.toronto.edu/~rsalakhu/papers/srivastava14a.pdf) by Srivastava et al. 2014. - [Dropout Training as Adaptive Regularization](http://papers.nips.cc/paper/4882-dropout-training-as-adaptive-regularization.pdf): "we show that the dropout regularizer is first-order equivalent to an L2 regularizer applied after scaling the features by an estimate of the inverse diagonal Fisher information matrix". -**Theme of noise in forward pass**. Dropout falls into a more general category of methods that introduce stochastic behavior in the forward pass of the network. During testing, the noise is marginalized over *analytically* (as is the case with dropout when multiplying by $p$), or *numerically* (e.g. via sampling, by performing several forward passes with different random decisions and then averaging over them). An example of other research in this direction includes [DropConnect](http://cs.nyu.edu/~wanli/dropc/), where a random set of weights is instead set to zero during forward pass. As foreshadowing, Convolutional Neural Networks also take advantage of this theme with methods such as stochastic pooling, fractional pooling, and data augmentation. We will go into details of these methods later. +**forward pass에서 노이즈 관련하여** 넓은 의미에서 보자면 dropout은 신경망의 forward pass에서 stochastic(확률적) 접근을 도입하는 것으로 볼 수 있다. testing 과정에서 노이즈 감소하게 되는데 이는 *분석적 해석*은 `확률 $p$ 만큼 곱해진 결과`라고 볼 수 있고, *수치적 해석*은 `랜덤하게 선택된 forward pass를 여러차례 수행한 결과의 평균`이라고 볼 수 있다. 동일한 관점에서의 연구들 중 하나인 [DropConnect](http://cs.nyu.edu/~wanli/dropc/)를 보면 forward pass 동안 가중치 값을 0으로 설정하는 것으로 볼 수 있다. Convolutional 신경망에서 dropout과 함께 stochastic(확률적) 풀링(pooling), 부분 풀링, 데이터 augmentation 등의 기법을 같이 사용하여 추가적인 성능 향상을 기대할 수 있다. 이에 대해서는 뒤에서 더 자세히 살펴 볼 것이다. -**Bias regularization**. As we already mentioned in the Linear Classification section, it is not common to regularize the bias parameters because they do not interact with the data through multiplicative interactions, and therefore do not have the interpretation of controlling the influence of a data dimension on the final objective. However, in practical applications (and with proper data preprocessing) regularizing the bias rarely leads to significantly worse performance. This is likely because there are very few bias terms compared to all the weights, so the classifier can "afford to" use the biases if it needs them to obtain a better data loss. +**Bias resularization**. Linear Classification 파트에서 설명했듯이, bias 텀은 regularization을 적용하지 않는 것이 일반적인데, 이는 학습된 가중치와 곱셈 연산을 하지 않기 때문에 목적 함수에서 데이터 dimension을 결정하는 요소로 작용하지 않는다. 그러나 실제 적용 사례들을 보면 bias 텀에 regularization을 적용하였을 때 심각한 성능 저하가 나타나는 경우는 극히 드문 것으로 알려져 있다. 이는 모든 가중치 텀의 갯수와 비교했을 때 bais 텀의 갯수는 무시할 만한 수준이어서 so the classifier can "afford to" use the biases if it needs them to obtain a better data loss. -**Per-layer regularization**. It is not very common to regularize different layers to different amounts (except perhaps the output layer). Relatively few results regarding this idea have been published in the literature. +**레이어별 정규화**. 마지막 출력 레이어를 제외하고 레이어를 각각 따로 정규화 하는 것은 일반적인 방법이 아니다. 레이어 별 정규화를 적용한 논문수도 상대적으로 매우 적은 편이다. -**In practice**: It is most common to use a single, global L2 regularization strength that is cross-validated. It is also common to combine this with dropout applied after all layers. The value of $p = 0.5$ is a reasonable default, but this can be tuned on validation data. +**실전 응용**: 하나의 공통된 L2 정규화를 사용하는 것이 일반적이다. 또한 모든 레이어 이후에 dropout을 적용하는 것 또한 일반적으로 많이 사용된다. dropout rate로 $p = 0.5*이 주로 사용되지만 validation 과정에서 값을 조정하기도 한다. From f1bf45b16a05d35ae484a60a7245e8f20e4ebd99 Mon Sep 17 00:00:00 2001 From: jung_hojin Date: Thu, 12 May 2016 03:40:18 +0900 Subject: [PATCH 116/199] Fix autogenerated English captions for lecture 4 lines 001 ~ 200 Fixes autogenerated captions from lines 1 to 200. Some inaudible lines are marked with '[??]' for now. --- captions/En/Lecture4_en.srt | 2200 +++++++++++++++++------------------ 1 file changed, 1099 insertions(+), 1101 deletions(-) diff --git a/captions/En/Lecture4_en.srt b/captions/En/Lecture4_en.srt index 46ec2630..b2eb9a16 100644 --- a/captions/En/Lecture4_en.srt +++ b/captions/En/Lecture4_en.srt @@ -1,66 +1,66 @@ 1 00:00:02,740 --> 00:00:07,000 -ok so let me dive into some -administrator +Okay, so let me dive into some +administrative 2 00:00:07,000 --> 00:00:14,669 -went first so I can recall that -assignment one is due next Wednesday +points first. So recall that +assignment 1 is due next Wednesday. 3 00:00:14,669 --> 00:00:19,050 -yeah but hundred and fifty hours left -and I use ours because there's a more +You have about 150 hours left, +and I use hours because there's a more 4 00:00:19,050 --> 00:00:23,320 -common sense of doom and remember that -third of those hours he'll be +imminent sense of doom and remember that +a third of those hours you'll be 5 00:00:23,320 --> 00:00:29,278 -unconscious so you don't have that much -time it's really running out and you +unconscious, so you don't have that much +time. It's really running out. And you 6 00:00:29,278 --> 00:00:31,768 -know you might think that you have a -late day Sun so on but these images get +know you might think that you have +late days and so on but these assignments just get 7 00:00:31,768 --> 00:00:38,640 -harder over time so you want to see -those and so so start now likely so +harder over time so you want to save +those and so on, so start now. Let's see. So 8 00:00:38,640 --> 00:00:43,109 -there's no office hours or anything like -that on Monday I'll hold make up office +there's no office hours or anything like +that on Monday. I'll hold make up office 9 00:00:43,109 --> 00:00:45,839 -hours on Wednesday because I want you +hours on Wednesday because I want you guys to be able to talk to me about the 10 00:00:45,840 --> 00:00:49,260 -special projects and so on so I'll be +especially projects and so on, so I'll be moving my office hours from Monday to 11 -00:00:49,259 --> 00:00:52,820 -wednesday usually I had my office starts -at 6 p.m. instead I'll have them at 5 +00:00:49,260 --> 00:00:52,820 +Wednesday. Usually I had my office hours +at 6PM. Instead I'll have them at 5PM 12 00:00:52,820 --> 00:00:59,909 -p.m. and usually think gates 260 but now -be engaged to 39-1 them both and yeah +and usually it's in Gates 260 but now +it'll be in Gates 259, so minus 1 on both and yeah 13 00:00:59,909 --> 00:01:03,429 -and also to note when you're going to be +and also to note, when you're going to be studying for midterm that's coming up in 14 @@ -69,172 +69,172 @@ a few weeks 15 00:01:04,170 --> 00:01:07,109 -make sure you go through the lecture +make sure you go through the lecture notes as well which are really part of 16 00:01:07,109 --> 00:01:09,819 -this class and a kind of pick and choose +this class and I kind of pick and choose some of the things that I think are most 17 00:01:09,819 --> 00:01:13,579 -valuable to present a lecture but -there's quite a bit of a more material +valuable to present in a lecture but +there's quite a bit of, you know, more material 18 00:01:13,579 --> 00:01:16,548 -to beware of that might pop up in the -mid term even though I'm comin some of +to beware of that might pop up in the +midterm, even though I'm covering some of 19 00:01:16,549 --> 00:01:19,610 -the most important stuff usually no -larger than URI through those lecture +the most important stuff usually in the lecture, +so do read through those lecture -20 +20 00:01:19,609 --> 00:01:25,618 -notes their complimentary to the actress -and so the material for the material be +notes, they're complimentary to the lectures. +And so the material for the midterm will be 21 00:01:25,618 --> 00:01:32,269 -drawn from both the lectures and its ok -so having said all that we're going to +drawn from both the lectures and the notes. Okay. +So having said all that, we're going to 22 00:01:32,269 --> 00:01:36,769 -dive into the material so where we are -right now just as a reminder we have +dive into the material. So where we are +right now, just as a reminder, we have 23 00:01:36,769 --> 00:01:39,989 -this core function we looked at several -loss functions such as the SP loss +the score function, we looked at several +loss functions such as the SVM loss 24 00:01:39,989 --> 00:01:44,359 -function last time and we look at the -full lost that you achieve for any +function last time, and we look at the +full loss that you achieve for any 25 00:01:44,359 --> 00:01:49,379 -particular set of weights on over your -training data and this loss made up of +particular set of weights on, over your +training data, and this loss is made up of 26 00:01:49,379 --> 00:01:53,509 -two components there's a data loss and -loss right and really what we want to do +two components, there's a data loss and +a regularization loss, right. And really what we want to do 27 00:01:53,509 --> 00:01:57,200 -is we want to do right now the gradient -expression of the loss of respect to the +is we want to derive out the gradient +expression of the loss function with respect to the 28 00:01:57,200 --> 00:02:01,118 -weights and we want to do this so that +weights and we want to do this so that we can actually perform the optimization 29 00:02:01,118 --> 00:02:07,069 -process optimization process we're doing -in dissent where we iterate in a leading +process. And in the optimization process we're doing +gradient descent, where we iterate evaluating 30 00:02:07,069 --> 00:02:11,030 -the gradient on your weights during a -primary update and just repeating this +the gradient on your weights doing a +parameter update and just repeating this 31 00:02:11,030 --> 00:02:14,259 -over and over again so that were +over and over again, so that we're converging to 32 00:02:14,259 --> 00:02:17,929 -the low points on that loss function and -when we arrived at a loss that's +the low points of that loss function and +when we arrive at a low loss, that's 33 00:02:17,930 --> 00:02:20,799 -equivalent to making good predictions -over our training data in terms of this +equivalent to making good predictions +over our training data in terms of the 34 00:02:20,799 --> 00:02:25,030 -course that come out now we also saw -that are too kind of waste evaluate the +scores that come out. Now we also saw +that are two kinds of ways to evaluate the 35 00:02:25,030 --> 00:02:29,019 -gradient there's an American gradient +gradient. There's a numerical gradient and this is very easy to write but it's 36 00:02:29,019 --> 00:02:32,840 -extremely slow to evaluate and there's -an elegy gradient which is which you +extremely slow to evaluate, and there's +analytic gradient, which is, which you 37 00:02:32,840 --> 00:02:36,658 -obtained by using calculus and will be +obtain by using calculus and we'll be going into that in this lecture quite a 38 00:02:36,658 --> 00:02:41,318 -bit more and so it's fast exact which is -great but it's not you can get it wrong +bit more and so it's fast, exact, which is +great, but it's not, you can get it wrong 39 00:02:41,318 --> 00:02:45,969 -sometimes and so we always the following -week already in check where we write all +sometimes, and so we always what we call +gradient check, where we write all 40 00:02:45,969 --> 00:02:48,639 -the expressions to complete the analytic -gradients and then we double check its +the expressions to compute the analytic +gradients, and then we double check its 41 00:02:48,639 --> 00:02:51,828 -correctness with numerical gradient and +correctness with numerical gradient and so I'm not sure if you're going to see 42 00:02:51,829 --> 00:02:59,250 -that you're going to see that definitely -assignments ok so now you might be +that, you're going to see that definitely the +assignments. Okay, so, now you might be 43 00:02:59,250 --> 00:03:04,378 -tempted to when you see the setup we -just want to drive the gradient of the +tempted to, when you see this setup, we +just want to derive the gradient of the 44 00:03:04,378 --> 00:03:08,459 -loss function back to the weights you -might be tempted to just you know right +loss function with respect to the weights. You +might be tempted to just, you know, right 45 00:03:08,459 --> 00:03:11,709 -out the full loss and just start to take -the gradients as you seen your calculus +out the full loss and just start to take +the gradients as you see your calculus 46 00:03:11,709 --> 00:03:16,120 -class but I'd like to make is that you +class, but the point I'd like to make is that you should think much more of this in terms 47 00:03:16,120 --> 00:03:22,480 -of computational grass instead of just -taking thinking of one giant expression +of computational graphs, instead of just +taking, thinking of one giant expression 48 00:03:22,479 --> 00:03:25,369 -that you're going to drive content with +that you're going to derive with pen and paper the expression for the 49 @@ -243,634 +243,632 @@ gradient and the reason for that 50 00:03:27,549 --> 00:03:31,689 -so here we are thinking about these -values flow flowing through a +so here we are thinking about these +values flow, flowing through a 51 00:03:31,689 --> 00:03:35,509 -competition around these operations -along circles and they transferred to +computational graph where you have these operations +along circles and they're 52 00:03:35,509 --> 00:03:38,979 -basically a function pieces that +basically little function pieces that transform your inputs all the way to the 53 00:03:38,979 --> 00:03:43,018 -loss function at the end so we start off +loss function at the end, so we start off with our data and our parameters as 54 00:03:43,019 --> 00:03:46,079 -inputs they feed through this -competition graph which is just all +inputs. They feed through this +computational graph, which is just all 55 00:03:46,079 --> 00:03:49,790 -these series of functions along the way -and at the end we get a single number +these series of functions along the way, +and at the end, we get a single number 56 00:03:49,789 --> 00:03:53,590 -which is the loss and the reason that -I'd like to think about it this way is +which is the loss. And the reason that +I'd like you to think about it this way is 57 00:03:53,590 --> 00:03:57,069 -that these expressions right now look +that, these expressions right now look very small and you might be able to 58 00:03:57,068 --> 00:04:00,339 -derive these grievance but these -expressions are in competition grass are +derive these gradients, but these +expressions are, in computational graphs, are 59 00:04:00,340 --> 00:04:04,250 -about to get very big and so for example +about to get very big and, so for example, convolutional neural networks will have 60 00:04:04,250 --> 00:04:08,829 -hundreds maybe are dozens of operations +hundreds maybe or dozens of operations, so we'll have all these images 61 00:04:08,829 --> 00:04:12,939 -flow-through like big computational +flowing through like pretty big computational graph to get our loss and so it becomes 62 00:04:12,939 --> 00:04:16,858 -impractical to just write out these -expressions and commercial networks are +impractical to just write out these +expressions, and convolutional networks are 63 00:04:16,858 --> 00:04:19,370 -not even the worst of it once you -actually start to for example do +not even the worst of it. Once you +actually start to, for example, do 64 00:04:19,370 --> 00:04:23,509 -something called an alternate sheen -which is a paper from the mind where +something called a Neural Turing Machine, +which is a paper from DeepMind, where 65 00:04:23,509 --> 00:04:26,329 -this is basically differentiable Turing +this is basically differentiable Turing machine 66 00:04:26,329 --> 00:04:30,128 -so the whole thing is differentiable the +so the whole thing is differentiable, the whole procedure that the computer is 67 00:04:30,129 --> 00:04:33,590 -performing on the tape is made smooth +performing on a tape is made smooth and is differentiable computer basically 68 00:04:33,589 --> 00:04:39,519 -and the competition graphic this is huge -and not only is this this is not hit +and the computational graph of this is huge, +and not only is this, this is not it 69 00:04:39,519 --> 00:04:42,478 -because what you end up doing and we're -going to recurrent neural networks and a +because what you end up doing and we're +going to recurrent neural networks in a 70 00:04:42,478 --> 00:04:45,848 -bit but what you end up doing is you end -up controlling this graph so think about +bit, but what you end up doing is you end +up [??] this graph, so think about 71 00:04:45,848 --> 00:04:51,658 -this graph copied many hundreds of time +this graph copied many hundreds of time steps and so you end up with this giant 72 00:04:51,658 --> 00:04:56,379 -monster of hundreds of thousands of +monster of hundreds of thousands of nodes and little computational units and 73 00:04:56,379 --> 00:04:59,819 -so it's impossible to write out you know -here's the loss for the neural Turing +so it's impossible to write out, you know, +here's the loss for the Neural Turing 74 00:04:59,819 --> 00:05:03,650 -machine it's just impossible it would -take like billions of pages and so we +Machine. It's just impossible, it would +take like billions of pages, and so we 75 00:05:03,649 --> 00:05:07,068 -have to think about this more in terms -of the structure so little functions +have to think about this more in terms +of data structures of little functions 76 00:05:07,069 --> 00:05:11,710 -transforming intermediate variables to -just lost at the very end so we're going +transforming intermediate variables to +guess the loss at the very end. Okay. So we're going 77 00:05:11,709 --> 00:05:14,318 -to be looking specifically at +to be looking specifically at competition graphs and how we can derive 78 00:05:14,319 --> 00:05:20,560 -the gradient on the inputs with respect -to the loss function at the very end so +the gradient on the inputs with respect +to the loss function at the very end. Okay. 79 00:05:20,560 --> 00:05:25,569 -what start-up simple and concrete a very -small competition graph we have three +So let's start off simple and concrete. So let's consider a very +small computational graph we have three 80 00:05:25,569 --> 00:05:29,778 -scalars as an input to this graph XY and -Z and they take on these specific about +scalars as inputs to this graph, x, y and +z, and they take on these specific values 81 00:05:29,778 --> 00:05:35,069 -these in this example of negative 25 of -94 and we have this very small graphic +in this example of -2, 5 and +-4, and we have this very small graph 82 00:05:35,069 --> 00:05:38,669 -or circuit you'll hear me refer to these -interchangeably hi there is a graph for +or circuit, you'll hear me refer to these +interchangeably either as a graph or 83 00:05:38,668 --> 00:05:43,038 -a circuit so we have this graph that at -the end gives us this out the negative +a circuit, so we have this graph that at +the end gives us this output 84 00:05:43,038 --> 00:05:47,288 -12 ok so here's what I've done is up -over deep refilled look will call the +-12. Okay. So here what I've done is I've +already pre-filled what we'll call the 85 00:05:47,288 --> 00:05:51,120 -forward pass of this graph where I set -the input and then I compute the outfits +forward pass of this graph, where I set +the inputs and then I compute the outfits 86 00:05:51,120 --> 00:05:56,288 -and I would like to do as we'd like to -drive the gradients of the expression on +And now what we'd like to do is, we'd like to +derive the gradients of the expression on 87 00:05:56,288 --> 00:06:01,250 -the inputs and what we'll do that is -introduced this intermediate variable +the inputs, and, so what we'll do now, is, +I'll introduced this intermediate variable 88 00:06:01,250 --> 00:06:07,050 -cue the plus gate so there's a plus gate -and times gate as I refer to them and +q after the plus gate, so there's a plus gate +and times gate, as I'll refer to them, and 89 00:06:07,050 --> 00:06:10,800 -thus must get this computing this outfit -cue and Sookie was this intermediate as +this plus gate is computing this output +q, and so q is this intermediate as 90 00:06:10,800 --> 00:06:14,788 -a result of X plus Y and then f is a -multiplication of qnz what I've written +a result of x plus y, and then f is a +multiplication of q and z. And what I've written 91 00:06:14,788 --> 00:06:19,360 -out here is what we want is the -gradients the derivative stiff idea if I +out here is, basically, what we want is the +gradients, the derivatives, df/dx, df/dy, df/dz 92 00:06:19,360 --> 00:06:25,598 -do I get my desired and I've written out -the intermediate please log gradients +And I've written out +the intermediate, these little gradients 93 00:06:25,598 --> 00:06:30,120 -for every one of these two expressions -separately so now we've performed for +for every one of these two expressions +separately, so now we've performed forward 94 00:06:30,120 --> 00:06:33,490 -class going from left to right and what -will do now is will derive the backward +pass going from left to right, and what +we'll do now is we'll derive the backward 95 00:06:33,490 --> 00:06:35,699 -pass will go from the back +pass, we'll go from the back 96 00:06:35,699 --> 00:06:39,300 -to the front competing gradients of all +to the front, computing gradients of all the intermediates in our circuit until 97 00:06:39,300 --> 00:06:43,509 -the very end we're going to build up to -it the gradients on the inputs and so we +at the very end, we're going to build up to [??] +it the gradients on the inputs, and so we 98 00:06:43,509 --> 00:06:47,680 -start off at the very right and as a +start off at the very right and, as a base case sort of this recursive 99 00:06:47,680 --> 00:06:52,670 -procedure we're considering the gradient -of the respective so this is just the +procedure, we're considering the gradient +of f with respective to f, so this is just the 100 00:06:52,670 --> 00:06:56,020 -identity function so what is the -derivative of it +identity function, so what is the +derivative of it, 101 00:06:56,019 --> 00:07:06,240 -identity mapping the idea it's one right -so in the identity has a gradient of one +identity mapping? What is the gradient of df by df? It's one, right? +So the identity has a gradient of one. 102 00:07:06,240 --> 00:07:10,329 -so that's our base case we start off -with the one and now we're going to go +So that's our base case. We start off +with a one, and now we're going to go 103 00:07:10,329 --> 00:07:18,519 -backwards through this graph so we want -to gradient of with respect is that so +backwards through this graph. So, we want +the gradient of f with respect to z. 104 00:07:18,519 --> 00:07:27,089 -what is that in this competition graph -ok it's so we have not written out right +So what is that in this computational graph? (x+1) +Okay, it's q, so we have that written out right 105 00:07:27,089 --> 00:07:32,879 -here and what is key in this particular -example it's three right so the gradient +here and what is q in this particular +example? It's 3, right? So the gradient 106 00:07:32,879 --> 00:07:36,279 -on that according to this will become -just 3 I'm going to be right ingredients +on z, according to this will, become +just 3. So I'm going to be writing the ingredients 107 00:07:36,279 --> 00:07:42,309 -under the lines in red and the values -are in green about the lines of the +under the lines in red and the values +are in green above the lines. So with the 108 00:07:42,310 --> 00:07:48,420 -gradient on the in the front is one and -not the gradient onset is 33 as telling +gradient on the, in the front is 1, and +now the gradient on z is 3, and what red 3 is telling 109 00:07:48,420 --> 00:07:52,009 -you really intuitively keep in mind the -interpretation of a gradient is what +you really intuitively, keep in mind the +interpretation of a gradient, is what 110 00:07:52,009 --> 00:07:58,459 -that's saying is that the influence of -dead on the final value is positive and +that's saying is that the influence of +z on the final value is positive and 111 00:07:58,459 --> 00:08:02,859 -with sort of course of three so if I -increments Z by a small amount eight +with, sort of a force of 3. So if I +increment z by a small amount h 112 00:08:02,860 --> 00:08:07,759 -then the output of the circuit will -react by increasing because it's a +then the output of the circuit will +react by increasing, because it's a 113 00:08:07,759 --> 00:08:13,009 -positive three will increase by three so +positive 3, will increase by 3h, so small change will result in a positive 114 00:08:13,009 --> 00:08:21,560 -change in the ultimate now the gradient -upon cue in this case will be so deified +change in the output. Now the gradient +on q in this case will be, so df/dq 115 00:08:21,560 --> 00:08:30,860 -IQ is that what is that before so we get -a gradient of negative for on that part +is z. What is z? -4. Okay? So we get +a gradient of -4 on that path 116 00:08:30,860 --> 00:08:34,599 -of the circuit and with that saying is -that if he were to increase the output +of the circuit, and what that's saying is +that if q were to increase, then the output 117 00:08:34,599 --> 00:08:39,740 -of the circuit will decrease ok by if -you increase by H be up to the circuit +of the circuit will decrease, okay, by, if +you increase by h, the output of the circuit 118 00:08:39,740 --> 00:08:44,789 -will decrease by four age that's the -slope is negative for ok now we're going +will decrease by 4h. That's the +slope, is -4. Okay, now we're going 119 00:08:44,789 --> 00:08:48,480 -to continue this process through this +to continue this recursive process through this plus gate and this is where things get 120 00:08:48,480 --> 00:08:49,039 -slightly +slightly interesting 121 00:08:49,039 --> 00:08:54,328 -I suppose so we'd like to compute the -agreement on on why with respect to Y +I suppose. So we'd like to compute the +gradient on f on y with respect to y 122 00:08:54,328 --> 00:09:10,208 -and so the gradient on why would this in -this particular graph will become +and so the gradient on y with res, in +this particular graph, will become 123 00:09:10,208 --> 00:09:23,979 -glass at it either way I'd like to think -about this is by applying trainable ok +Let's just guess and then we'll see how this gets derived properly. So I hear some murmurs of the right answer. It will be -4. +So let's see how, so there are many ways to derive it at this point, because the expression is very small and you can kind of, glance at it, but the way I'd like to think about it is by applying chain rule, okay. 124 00:09:23,980 --> 00:09:27,709 -so the chain rule says that if you would -like to direct the gradient of everyone +So the chain rule says that if you would +like to derive the gradient of f on y 125 00:09:27,708 --> 00:09:33,208 -why then it's equal to the FBI dq times -the cube ideal i right and so we +then it's equal to df/dq times +dq/dy, right? And so we've 126 00:09:33,208 --> 00:09:36,438 -computed both of those expressions in -particular IQ might be why we know is +computed both of those expressions, in +particular dq/dy, we know, is 127 00:09:36,438 --> 00:09:42,519 -negative or so that's the effect of the -influence of coupon is DFID Q which is +-4, so that's the effect of the +influence of q on f, is df/dq, which is 128 00:09:42,519 --> 00:09:46,619 -negative for and now we know the local -would like to know the local influence +-4, and now we know the local, +we'd like to know the local influence 129 00:09:46,619 --> 00:09:52,449 -of why on cuba and that local influence -of light on Q is one that's the locals +of y on q, and that local influence +of y on q is 1, because that's the local 130 00:09:52,448 --> 00:09:58,969 -refer to as the local derivative of Y -for the prostate and so the general +as I'll refer to as the local derivative of y +for the plus gate, and so the chain rule 131 00:09:58,970 --> 00:10:02,019 -tells us that the correct thing to do to -change these two gradients the local +tells us that the correct thing to do to +chain these two gradients, the local 132 00:10:02,019 --> 00:10:06,139 -gradient awful why don't you and the -kind of global gradient of Q on the +gradient of y on q, and the, +kind of global gradient of q on the 133 00:10:06,139 --> 00:10:10,948 -update of the circuit is to multiply -them so we'll get made it four times and +output of the circuit, is to multiply +them. So we'll get -4 times 1 134 00:10:10,948 --> 00:10:14,588 -so this kind of the the crux of her back -propagation works is this is a very +And so, this is kind of the, the crux of how back +propagation works. This is a very 135 00:10:14,589 --> 00:10:18,209 -important to understand here that we had -at least two pieces that we keep +important to understand here that, we have +these two pieces that we keep 136 00:10:18,208 --> 00:10:24,289 -multiplying through when we performed as -general we have computed X plus Y and +multiplying through when we perform the chain rule. +We have q computed x + y, and 137 00:10:24,289 --> 00:10:29,379 -the derivative X&Y with respect to that -single expression is one and one so keep +the derivative x and y, with respect to that +single expression is one and one. So keep 138 00:10:29,379 --> 00:10:32,749 -in mind interpretation of the gradient -that's saying is that X&Y have a +in mind the interpretation of the gradient. +What that's saying is that x and y have a 139 00:10:32,749 --> 00:10:38,509 -positive influence on cue with a slope -of 10 increasing X by H +positive influence on q, with a slope +of 1. So increasing x by h 140 00:10:38,509 --> 00:10:44,548 -will increase cue by H and eventually -like as we'd like to influence of light +will increase q by h, and what we'd eventually +like is, we'd like the influence of y 141 00:10:44,548 --> 00:10:49,980 -on the final out but the circuit and so -the way this end up working is you take +on the final output of the circuit, And so +the way this end up working is, you take 142 00:10:49,980 --> 00:10:53,480 -the influence of why are you and we know -the influence of Q on the final loss +the influence of y on q, and we know +the influence of q on the final loss 143 00:10:53,480 --> 00:10:57,058 -which is what we are recursively -computing here through this graph and +which is what we are recursively +computing here through this graph, and 144 00:10:57,058 --> 00:11:00,350 -the correct thing to do is to multiply -them so we end up with a nickname for 10 +the correct thing to do is to multiply +them, so we end up with -4 times 1 145 00:11:00,350 --> 00:11:05,189 -to 15 negative for and so the way this -works out is basically what this is +gets you -4. And so the way this +works out is, basically what this is 146 00:11:05,188 --> 00:11:08,649 -saying is that the influence of why on -the final output circuit is negative or +saying is that the influence of y on +the final output of the circuit is -4 147 00:11:08,649 --> 00:11:14,649 -so increasing why should decrease the -album circuit by negative four times the +so increasing y should decrease the +output of the circuit by -4 times the 148 00:11:14,649 --> 00:11:18,230 -law change that you've made and the way -that end up working out is why has a +little change that you've made. And the way +that end up working out is y has a 149 00:11:18,230 --> 00:11:21,810 -positive influence in Cuse increasing -why slightly increase askew +positive influence in q, so increasing +y, slightly increase q 150 00:11:21,809 --> 00:11:27,959 -which likely decreases in the circuit so -chain rule is kind of giving us this +which slightly decreases the output of the circuit, okay? +So chain rule is kind of giving us this 151 00:11:27,960 --> 00:11:29,120 -correspondence +correspondence. Go ahead. 152 00:11:29,120 --> 00:11:45,259 -we're going to get into this you'll see -many many many associations of this and +Yeap, thank you. we're going to get into this. You'll see many, basically this entire class is about this, so you'll see many many instantiations of this and 153 00:11:45,259 --> 00:11:48,889 -all drill this into you by the end of -class and you understand it you will not +I'll drill this into you by the end of this +class and you'll understand it. You will not 154 00:11:48,889 --> 00:11:51,870 -have any symbolic expressions anywhere -once we complete this letter actually +have any symbolic expressions anywhere +once we compute this, once we're actually 155 00:11:51,870 --> 00:11:54,639 -implementing this and you'll see -implementations of it later in this in +implementing this and you'll see +implementations of it later in this. 156 00:11:54,639 --> 00:11:57,009 -this it will always be just factors in -numbers +It will always be just be vectors and numbers. 157 00:11:57,009 --> 00:12:02,230 -robert is numbers ok and looking at X we -have a very smart that happen thing that +Raw vectors, numbers. Okay, and looking at x, we +have a very smiliar thing that happens. 158 00:12:02,230 --> 00:12:05,889 -happens we wonder if IDX that's our -final objective but we have to combine +We want df/dx. That's our +final objective, but, and we have to combine it. 159 00:12:05,889 --> 00:12:09,799 -it we know what the exes what is access -and listen to you and ask you same place +We know what the x is, what is x's influence on q +and what is q's influence 160 00:12:09,799 --> 00:12:13,979 -on the end of the circuit and so that -ends up being the chain grow so take a +on the end of the circuit, and so that +ends up being the chain rule, so you take 161 00:12:13,980 --> 00:12:19,240 -negative four times want to give you one -ok so the way this works to generalize a +-4 times 1 and gives you -4, okay? +So the way this works, to generalize a 162 00:12:19,240 --> 00:12:23,289 -bit from this example and way to think -about it is as follows you are a gate +bit from this example and the way to think +about it is as follows. You are a gate 163 00:12:23,289 --> 00:12:28,429 -embedded in a circuit and this is a very +embedded in a circuit and this is a very large computational graph or circuit and 164 00:12:28,429 --> 00:12:32,250 -you receive some templates some -particular numbers X&Y come in and +you receive some inputs, some +particular numbers x and y come in 165 00:12:32,250 --> 00:12:39,059 -perform some operation on them and -compute some good set Z and now this +and you perform some operation f on them and +compute some output z. And now this 166 00:12:39,059 --> 00:12:43,019 -magazine goes into competition grass and +value of z goes into computational graph and something happens to it but you're just 167 00:12:43,019 --> 00:12:46,169 -too great hanging out in a circuit and -you're not sure what happens but by the +a gate hanging out in a circuit and +you're not sure what happens, but by the 168 00:12:46,169 --> 00:12:50,939 -end of the circuit the loss computed and +end of the circuit the loss computed, okay? And that's the forward pass and then we're 169 00:12:50,940 --> 00:12:56,250 -proceeding recursively in the reverse -order backwards but before that actually +proceeding recursively in the reverse +order backwards, but before that actually 170 00:12:56,250 --> 00:13:01,120 -get to that part right away when I get -X&Y the thing I'd like to point out that +Before I get to that part, right away when I get +x and y, the thing I'd like to point out that 171 00:13:01,120 --> 00:13:05,279 -during the forward pass if you're this -gate and you get to your values X&Y you +during the forward pass, if you're this +gate and you get to your values x and y 172 00:13:05,279 --> 00:13:08,500 -computer output said and there's another -thing you can computer right away and +you compute your output z, and there's another +thing you can compute right away and 173 00:13:08,500 --> 00:13:10,230 -that is the local gradients +that is the local gradients on x and y. 174 00:13:10,230 --> 00:13:14,789 -X&Y so I can compute those right away +So I can compute those right away because I'm just a gate and I know what 175 00:13:14,789 --> 00:13:18,009 -I'm performing like say additional -application so I know the influence that +I'm performing, like say additional +multiplication, so I know the influence that 176 00:13:18,009 --> 00:13:24,259 -X&Y have won my out the body so I can -compute those guys right away but then +x and y have on my output value, so I can +compute those guys right away, okay? But then 177 00:13:24,259 --> 00:13:25,389 @@ -878,8 +876,8 @@ what happens 178 00:13:25,389 --> 00:13:29,769 -near the end so the lawsuits computed -another going backwards eventually learn +near the end so the loss gets computed +and now we're going backwards, I'll eventually learn 179 00:13:29,769 --> 00:13:32,499 @@ -887,136 +885,136 @@ about what is my influence on 180 00:13:32,499 --> 00:13:37,839 -the final output of the circuit the loss -to learn what is DL by these in their +the final output of the circuit, the loss. +So I'll learn what is dL/dz in there. 181 00:13:37,839 --> 00:13:41,419 -ingredient will flow into me and what I -have to do is I have to change that +The ingredient will flow into me and what I +have to do is I have to chain that 182 00:13:41,418 --> 00:13:45,278 -gradient through this recursive case so -I have to make sure to change the +gradient through this recursive case, so +I have to make sure to chain the 183 00:13:45,278 --> 00:13:48,778 -gradient through my operation performed +gradient through my operation that I performed and it turns out that the correct thing 184 00:13:48,778 --> 00:13:52,068 -to do here buy tramadol really what it's -saying is the correct thing to do is to +to do here by chain rule, really what it's +saying, is that the correct thing to do is to 185 00:13:52,068 --> 00:13:56,068 -multiply your local gradient without +multiply your local gradient with that gradient and that actually gives you the 186 00:13:56,068 --> 00:13:57,838 -DL IDX +dL/dx that gives you the 187 00:13:57,839 --> 00:14:02,739 -employees off X on the final output of -the circuit so really chain rule is just +influence of x on the final output of +the circuit. So really, chain rule is just 188 00:14:02,739 --> 00:14:08,229 -this added multiplication where we take -what are called Global gradient of this +this added multiplication. where we take our, +what I'll call, global gradient of this 189 00:14:08,229 --> 00:14:12,669 -gate on the outfit and we've changed -through the local gradient in the same +gate on the output, and we chain it +through the local gradient, and the same 190 00:14:12,668 --> 00:14:18,509 -thing goes for a while so it's just a -multiplication of that guy the gradient +thing goes for y. So it's just a +multiplication of that guy, that gradient 191 00:14:18,509 --> 00:14:22,889 -by your local gradient if you're a gate -and then remember that these X's and Y's +by your local gradient if you're a gate. +And then remember that these x's and y's 192 00:14:22,889 --> 00:14:27,229 -there are coming from different states -right so you end up with the cursing +they are coming from different gates, right? +So you end up with recursing 193 00:14:27,229 --> 00:14:31,899 -this process through the entire cup -additional Turkish and so these gates +this process through the entire computational +circuit, and so these gates 194 00:14:31,899 --> 00:14:36,808 -just basically communicate to each other -the influence on the final loss so they +just basically communicate to each other +the influence on the final loss, so they 195 00:14:36,808 --> 00:14:39,688 -tell each other ok if this is a positive +tell each other, okay if this is a positive gradient that means you're positively 196 00:14:39,688 --> 00:14:43,198 -influencing the loss of its negative -gradient negative influence negatively +influencing the loss, if it's a negative +gradient you're negatively 197 00:14:43,198 --> 00:14:46,788 -influencing loss and he just gets almost -applied through the circuit by these +influencing the loss, and these just get all +multiplied through the circuit by these 198 00:14:46,788 --> 00:14:51,019 -local gradients and you end up with and -this process is called back propagation +local gradients and you end up with, and +this process is called back propagation. 199 00:14:51,019 --> 00:14:54,489 -it's a way of computing through a +It's a way of computing through a recursive application of chain rule 200 00:14:54,489 --> 00:14:58,399 -through competition grab the influence +through computational graph, the influence of every single intermediate value in 201 00:14:58,399 --> 00:15:02,158 -that graph on the final loss function +that graph on the final loss function and so will see many examples of this 202 00:15:02,158 --> 00:15:06,918 -truck is like her I'll go into a +truck is like her I'll go into a specific example there is a slightly 203 00:15:06,918 --> 00:15:11,298 -larger and we'll work through it in +larger and we'll work through it in detail but i dont their own questions at 204 00:15:11,298 --> 00:15:20,389 -this point that I would like to ask +this point that I would like to ask ahead I'm going to come back to that you 205 00:15:20,389 --> 00:15:25,538 -add the gradients the grading the +add the gradients the grading the cognitive Adam so if Z is being employed 206 00:15:25,538 --> 00:15:29,928 -in multiple places in the circus the +in multiple places in the circus the back roads closed will add that will 207 @@ -1025,72 +1023,72 @@ come back to that point 208 00:15:31,539 --> 00:16:03,139 -like we're going to get the all of those +like we're going to get the all of those issues and we're gonna see ya you're 209 00:16:03,139 --> 00:16:05,769 -gonna get what we call banishing +gonna get what we call banishing gradient problems and so on 210 00:16:05,769 --> 00:16:10,669 -we'll see let's go through another +we'll see let's go through another example to make this more concrete so 211 00:16:10,669 --> 00:16:14,318 -here we have another circuit it happens +here we have another circuit it happens to be computing a little two-dimensional 212 00:16:14,318 --> 00:16:18,179 -in Iran but for now don't worry about +in Iran but for now don't worry about that interpretation just think of this 213 00:16:18,179 --> 00:16:22,849 -as that's an expression so one over +as that's an expression so one over one-plus key to the whatever number of 214 00:16:22,850 --> 00:16:29,000 -inputs here is by Andrew function and we +inputs here is by Andrew function and we have a single output over there and I 215 00:16:29,000 --> 00:16:32,490 -translated that mathematical expression +translated that mathematical expression into this competition in draft form so 216 00:16:32,490 --> 00:16:35,769 -we have to recursively from inside out +we have to recursively from inside out compete with expression so a person do 217 00:16:35,769 --> 00:16:42,129 -all the little W times access and then +all the little W times access and then we add them all up and then we take a 218 00:16:42,129 --> 00:16:46,129 -negative of it and then we exponentially +negative of it and then we exponentially that and they had one and then we 219 00:16:46,129 --> 00:16:49,769 -finally divide and we get the result of +finally divide and we get the result of the expression and so we're going to do 220 00:16:49,769 --> 00:16:52,409 -now is we're going to back propagate +now is we're going to back propagate through this expression we're going to 221 00:16:52,409 --> 00:16:56,500 -compute what the influence of every +compute what the influence of every single input value is on the output of 222 @@ -1099,57 +1097,57 @@ this expression that is degrading here 223 00:17:07,230 --> 00:17:22,039 -so for now the US is just a binary plus +so for now the US is just a binary plus its entirety + gate and we have a plus 224 00:17:22,039 --> 00:17:26,519 -one gate I'm making up these gates on +one gate I'm making up these gates on the spot and we'll see that what is a 225 00:17:26,519 --> 00:17:31,519 -gate or is not a gate is kind of up to +gate or is not a gate is kind of up to you come back to this point of it so for 226 00:17:31,519 --> 00:17:35,639 -now I just like we have several more +now I just like we have several more gates that we're using throughout and so 227 00:17:35,640 --> 00:17:38,650 -I just like to write out as we go +I just like to write out as we go through this example several of these 228 00:17:38,650 --> 00:17:42,720 -derivatives exponentiation and we know +derivatives exponentiation and we know for every little local gate what these 229 00:17:42,720 --> 00:17:49,048 -local gradients are right so we can do +local gradients are right so we can do that using calculus so the extra tax and 230 00:17:49,048 --> 00:17:52,900 -so on so these are all the operations +so on so these are all the operations and also addition and multiplication 231 00:17:52,900 --> 00:17:56,040 -which I'm assuming that you have +which I'm assuming that you have memorized in terms of what the great 232 00:17:56,039 --> 00:17:58,970 -things look like they're going to start +things look like they're going to start off at the end of the circuit and I've 233 00:17:58,970 --> 00:18:03,450 -already filled in a one point zero zero +already filled in a one point zero zero in the back because that's how we always 234 @@ -1158,157 +1156,157 @@ start this recursion 235 00:18:04,859 --> 00:18:10,519 -1110 right but since that's the gradient +1110 right but since that's the gradient on the identity function now we're going 236 00:18:10,519 --> 00:18:17,849 -to back propagate through this one over +to back propagate through this one over x operation ok so the relative of one of 237 00:18:17,849 --> 00:18:22,048 -wrecks the local gradient is a negative +wrecks the local gradient is a negative one over x squared so that none of Rex 238 00:18:22,048 --> 00:18:27,119 -gate during the forward pass received +gate during the forward pass received input 1.37 and right away that one of 239 00:18:27,119 --> 00:18:30,759 -her ex Kate could have computed what the +her ex Kate could have computed what the local gradients the local variant was 240 00:18:30,759 --> 00:18:35,048 -negative one over x squared and ordering +negative one over x squared and ordering back propagation and has to buy tramadol 241 00:18:35,048 --> 00:18:40,750 -multiply that local gradient by the +multiply that local gradient by the gradient of it on the final of the 242 00:18:40,750 --> 00:18:44,789 -circuit which is easy because it happens +circuit which is easy because it happens to be so what ends up being the 243 00:18:44,789 --> 00:18:51,349 -expression for the back propagated +expression for the back propagated reading here from one of my ex Kate 244 00:18:51,349 --> 00:18:59,829 -but she always has two pieces local +but she always has two pieces local gradient times the gradient from or from 245 00:18:59,829 --> 00:19:18,069 -which is the gradient DFID X so that +which is the gradient DFID X so that that is the local gradient 246 00:19:18,069 --> 00:19:23,480 -giving one over 3.7 squared and then +giving one over 3.7 squared and then multiplied by one point zero which is 247 00:19:23,480 --> 00:19:27,940 -degrading from which is really just one +degrading from which is really just one because we just started and so applying 248 00:19:27,940 --> 00:19:34,850 -general right away here and the other is +general right away here and the other is negative 01534 that's the gradient on 249 00:19:34,849 --> 00:19:38,798 -that piece of the wire where this valley +that piece of the wire where this valley was blowing ok so it has a negative 250 00:19:38,798 --> 00:19:43,889 -effect on the outfit you might expect +effect on the outfit you might expect that right because if you were to 251 00:19:43,890 --> 00:19:47,850 -increase this value and then it goes +increase this value and then it goes through a gate of one over x then 252 00:19:47,849 --> 00:19:50,939 -increased amount of Rex get smaller so +increased amount of Rex get smaller so that's why you're seeing negative 253 00:19:50,940 --> 00:19:55,620 -gradient rate we're going to continue +gradient rate we're going to continue back propagation here in the next gate 254 00:19:55,619 --> 00:20:01,048 -in the circuit it's adding a constant of +in the circuit it's adding a constant of one so the local gradient if you look at 255 00:20:01,048 --> 00:20:06,960 -adding a constant to a value the +adding a constant to a value the gradient off on exit is just one right 256 00:20:06,960 --> 00:20:13,169 -to talk to us and so the change gradient +to talk to us and so the change gradient here that we continue along the wire 257 00:20:13,169 --> 00:20:22,940 -will be your local gradient which has +will be your local gradient which has one time the gradient from above the 258 00:20:22,940 --> 00:20:28,590 -gate which it has just learned is +gate which it has just learned is negative Jul 23 2013 continues along the 259 00:20:28,589 --> 00:20:34,709 -way are unchanged and intuitively that +way are unchanged and intuitively that makes sense right because this is value 260 00:20:34,710 --> 00:20:38,319 -floats and it has some influence on the +floats and it has some influence on the final circuit and if you're if you're 261 00:20:38,319 --> 00:20:42,798 -adding one then its influence its rate +adding one then its influence its rate of change of slope toward the final 262 00:20:42,798 --> 00:20:46,970 -value doesn't change if you increase +value doesn't change if you increase this by some amount the effect at the 263 00:20:46,970 --> 00:20:51,548 -end will be the same because the rate of +end will be the same because the rate of change doesn't change through the +1 264 00:20:51,548 --> 00:20:56,859 -gays just a constant officer continued +gays just a constant officer continued innovation here so the gradient of the 265 00:20:56,859 --> 00:21:01,599 -axe the axe so you can come back +axe the axe so you can come back propagation we're going to perform 266 @@ -1317,82 +1315,82 @@ gates input of negative one 267 00:21:05,000 --> 00:21:08,329 -it right away could have completed its +it right away could have completed its local gradient and now it knows that the 268 00:21:08,329 --> 00:21:12,259 -gradient from above is negative point by +gradient from above is negative point by three so the continued backpropagation 269 00:21:12,259 --> 00:21:20,000 -here in applying chain rule would +here in applying chain rule would received the rhetorical questions I'm 270 00:21:20,000 --> 00:21:25,119 -not sure but but basically each of the +not sure but but basically each of the negative one which is the ex the ex 271 00:21:25,119 --> 00:21:30,569 -input to this expert eight times the +input to this expert eight times the chain rule right to the point by three 272 00:21:30,569 --> 00:21:35,269 -so we keep multiplying their own so what +so we keep multiplying their own so what is the effect on me and what I have an 273 00:21:35,269 --> 00:21:39,069 -effect on the final end of the circuit +effect on the final end of the circuit those are being always multiplied so we 274 00:21:39,069 --> 00:21:46,859 -get negative 22 at this point so now we +get negative 22 at this point so now we have a time to negative one gate so what 275 00:21:46,859 --> 00:21:50,279 -ends up happening what happens to the +ends up happening what happens to the gradient when you do it turns me on an 276 00:21:50,279 --> 00:21:57,139 -accomplished on da lips around right +accomplished on da lips around right because we have basically constant input 277 00:21:57,140 --> 00:22:02,038 -which happened to be a constant of +which happened to be a constant of negative one so negative one time one 278 00:22:02,038 --> 00:22:05,548 -time they dont give us negative one in +time they dont give us negative one in the forward pass and so now we have to 279 00:22:05,548 --> 00:22:09,569 -multiply by a that's the local gradient +multiply by a that's the local gradient times the greeting from Bob which is 280 00:22:09,569 --> 00:22:14,879 -fine too so we end up with just positive +fine too so we end up with just positive so now continue back propagation 281 00:22:14,880 --> 00:22:21,110 -propagating + and this plus operation +propagating + and this plus operation has multiple input here the green in the 282 00:22:21,109 --> 00:22:25,599 -local gradient for the bus gate as one +local gradient for the bus gate as one and 10 what ends up happening to the 283 @@ -1401,292 +1399,292 @@ brilliance flow along the upper buyers 284 00:22:42,359 --> 00:22:48,089 -surplus paid has a local gradient on all +surplus paid has a local gradient on all of its always will be just one because 285 00:22:48,089 --> 00:22:53,769 -if you just have a functioning you know +if you just have a functioning you know experts why then for that function the 286 00:22:53,769 --> 00:22:58,109 -gradient on either X or Y is just one +gradient on either X or Y is just one and so what you end up getting is just 287 00:22:58,109 --> 00:23:03,619 -one time spent two and so in fact for a +one time spent two and so in fact for a plus gate always see see the same fact 288 00:23:03,619 --> 00:23:07,469 -where the local gradient all of its +where the local gradient all of its inputs is one and so whatever grading it 289 00:23:07,470 --> 00:23:11,289 -gets from above it just always +gets from above it just always distributes gradient equally to all of 290 00:23:11,289 --> 00:23:14,339 -its inputs because in the chain rule +its inputs because in the chain rule don't have multiplied and multiplied by 291 00:23:14,339 --> 00:23:18,129 -10 something remains unchanged surplus +10 something remains unchanged surplus get this kind of like ingredient 292 00:23:18,130 --> 00:23:22,170 -distributor whereas something flows in +distributor whereas something flows in from the top it all just spread out all 293 00:23:22,170 --> 00:23:26,560 -the great teams equally to all of its +the great teams equally to all of its children and so we've already received 294 00:23:26,559 --> 00:23:32,139 -one of the inputs gradient point to hear +one of the inputs gradient point to hear on the very final output of the circuit 295 00:23:32,140 --> 00:23:35,970 -and so this employees has been completed +and so this employees has been completed through a series of applications of 296 00:23:35,970 --> 00:23:42,450 -trainer along the way there was another +trainer along the way there was another plus get that skipped over and so this 297 00:23:42,450 --> 00:23:47,090 -point you kind of this tribute to both +point you kind of this tribute to both 20.2 equally so we've already done a 298 00:23:47,089 --> 00:23:51,750 -blockade and there's a multiply get +blockade and there's a multiply get there and so now we're going to back 299 00:23:51,750 --> 00:23:55,940 -propagate through that multiply +propagate through that multiply operation and so the local grade so the 300 00:23:55,940 --> 00:24:06,450 -so what will be the gradient for w 00 +so what will be the gradient for w 00 will be degrading 40 basically 301 00:24:06,450 --> 00:24:19,059 -2000 you will be going in W one will be +2000 you will be going in W one will be W 0:30 will be negative one times when 302 00:24:19,059 --> 00:24:24,389 -too good and the gradient on x zero will +too good and the gradient on x zero will be there is a bug bite away in the slide 303 00:24:24,390 --> 00:24:27,840 -that I just noticed like few minutes +that I just noticed like few minutes before I actually create the class also 304 00:24:27,839 --> 00:24:34,289 -increase starting to class so you see . +increase starting to class so you see . 39 there it should be point for its 305 00:24:34,289 --> 00:24:37,480 -because of a bug in evangelization +because of a bug in evangelization because I'm truncating a to the small 306 00:24:37,480 --> 00:24:41,190 -digits but basically that should be +digits but basically that should be pointed or because the way you get that 307 00:24:41,190 --> 00:24:45,400 -is two times pointed to get the point +is two times pointed to get the point for just like I've written out there so 308 00:24:45,400 --> 00:24:50,980 -that's what the opportunity there okay +that's what the opportunity there okay so that we've been propagated the 309 00:24:50,980 --> 00:24:55,190 -circuit here and we get through this +circuit here and we get through this expression and so you might imagine in 310 00:24:55,190 --> 00:24:59,289 -there are actual downstream applications +there are actual downstream applications will have data and all the parameters as 311 00:24:59,289 --> 00:25:03,450 -inputs loss functions at the top at the +inputs loss functions at the top at the end it will be forward pass to evaluate 312 00:25:03,450 --> 00:25:06,440 -the loss function and then we'll back +the loss function and then we'll back propagate through every piece of 313 00:25:06,440 --> 00:25:10,450 -competition we've done along the way and +competition we've done along the way and Welbeck propagate through every gate to 314 00:25:10,450 --> 00:25:14,150 -get our imports and back up again just +get our imports and back up again just means supply chain rule many many times 315 00:25:14,150 --> 00:25:21,720 -and we'll see how that is implemented in +and we'll see how that is implemented in but the question i guess im going to 316 00:25:21,720 --> 00:25:31,769 -skip that because it's the same I'm +skip that because it's the same I'm going to skip the other questions 317 00:25:31,769 --> 00:25:45,869 -so the cost of forward and backward +so the cost of forward and backward propagation is roughly almost always end 318 00:25:45,869 --> 00:25:49,500 -up being basically equal when you look +up being basically equal when you look at timings usually the backup a slightly 319 00:25:49,500 --> 00:25:58,710 -slower idea so let's see one thing I +slower idea so let's see one thing I want to point out before in one is that 320 00:25:58,710 --> 00:26:02,350 -the setting of these gates like these +the setting of these gates like these gates are arbitrary so what can I could 321 00:26:02,349 --> 00:26:06,509 -have known for example is some of you +have known for example is some of you may know this I can collapse these gates 322 00:26:06,509 --> 00:26:10,549 -into one gate if I wanted to for example +into one gate if I wanted to for example in something called the sigmoid function 323 00:26:10,549 --> 00:26:14,069 -which has that particular form a single +which has that particular form a single facts which the sigmoid function 324 00:26:14,069 --> 00:26:19,460 -computes won over one plus or minus tax +computes won over one plus or minus tax and so I could have rewritten that 325 00:26:19,460 --> 00:26:22,650 -expression and i cant collapsed all of +expression and i cant collapsed all of those gates that made up the sigmoid 326 00:26:22,650 --> 00:26:27,769 -gate into a single gate and so there's a +gate into a single gate and so there's a sigmoid get here and I could have done 327 00:26:27,769 --> 00:26:32,440 -that in a single go sort of and when I +that in a single go sort of and when I would have had to do if I wanted to have 328 00:26:32,440 --> 00:26:37,980 -that gate as I need to compute an +that gate as I need to compute an expression for how this so what is the 329 00:26:37,980 --> 00:26:41,670 -local gradient for the sigmoid get +local gradient for the sigmoid get basically so what is the gradient of the 330 00:26:41,670 --> 00:26:44,470 -small gate on its input and I had to go +small gate on its input and I had to go through some math which I'm not going to 331 00:26:44,470 --> 00:26:46,980 -go into detail but you end up with that +go into detail but you end up with that expression over there 332 00:26:46,980 --> 00:26:51,750 -it ends up being 1-6 next time segment +it ends up being 1-6 next time segment of access to local gradient and that 333 00:26:51,750 --> 00:26:55,450 -allows me to put this piece into a +allows me to put this piece into a competition graph because once I know 334 00:26:55,450 --> 00:26:58,819 -how to compute the local gradient +how to compute the local gradient everything else is defined just through 335 00:26:58,819 --> 00:27:02,389 -chain rule and multiply everything +chain rule and multiply everything together so we can back propagate 336 00:27:02,390 --> 00:27:06,720 -through the sigmoid get down and the way +through the sigmoid get down and the way that would look like is input to the 337 00:27:06,720 --> 00:27:11,750 -gate was one point zero that's what flu +gate was one point zero that's what flu went into the gate and punk 73 went out 338 00:27:11,750 --> 00:27:18,759 -so . 7360 facts okay and we want to +so . 7360 facts okay and we want to local gradient which is as we've seen 339 00:27:18,759 --> 00:27:26,450 -from the math on their backs so you get +from the math on their backs so you get access point cemetery multiplying 1-23 340 00:27:26,450 --> 00:27:31,170 -that's the local gradient and then times +that's the local gradient and then times will work we happened to be at the end 341 00:27:31,170 --> 00:27:36,330 -of the circuit so times 10 even writing +of the circuit so times 10 even writing so we end up with 12 and of course we 342 @@ -1695,22 +1693,22 @@ get the same answer 343 00:27:37,650 --> 00:27:42,220 -point to as we received before 12 +point to as we received before 12 because calculus works but basically we 344 00:27:42,220 --> 00:27:44,480 -could have broken up this expression +could have broken up this expression down and 345 00:27:44,480 --> 00:27:47,450 -one piece at a time or we could just +one piece at a time or we could just have a single signaled gate and it's 346 00:27:47,450 --> 00:27:51,569 -kind of up to us and what level up here +kind of up to us and what level up here are key to break these expressions and 347 @@ -1719,267 +1717,267 @@ so you'd like to 348 00:27:52,339 --> 00:27:55,829 -intuitively clustered these expressions +intuitively clustered these expressions into single gates if it's very efficient 349 00:27:55,829 --> 00:28:06,819 -or easy to direct the local radiance +or easy to direct the local radiance because then they become your pieces so 350 00:28:06,819 --> 00:28:10,529 -the question is do libraries typically +the question is do libraries typically do that I do they worry about you know 351 00:28:10,529 --> 00:28:14,058 -what's what's easy to convince the +what's what's easy to convince the computer and the answer is yes I would 352 00:28:14,058 --> 00:28:17,480 -say so so he noted that there are some +say so so he noted that there are some piece of operation you'd like to do over 353 00:28:17,480 --> 00:28:20,798 -and over again and it has a very simple +and over again and it has a very simple local gradient that's something very 354 00:28:20,798 --> 00:28:24,900 -appealing to actually create a single +appealing to actually create a single unit of and we'll see some of those 355 00:28:24,900 --> 00:28:30,230 -examples actually but I think I'd like +examples actually but I think I'd like to also point out that once you the 356 00:28:30,230 --> 00:28:32,490 -reason I like to think about these +reason I like to think about these compositional grass is it really hope 357 00:28:32,490 --> 00:28:36,289 -your intuition to think about how greedy +your intuition to think about how greedy and slow in a neural network it's not 358 00:28:36,289 --> 00:28:39,369 -just you don't want this to be a black +just you don't want this to be a black box do you want to understand 359 00:28:39,369 --> 00:28:43,959 -intuitively how this happens and you +intuitively how this happens and you start to develop after a while of 360 00:28:43,960 --> 00:28:47,850 -looking at additional graphs intuitions +looking at additional graphs intuitions about how these graybeards flow and this 361 00:28:47,849 --> 00:28:52,029 -might help you debug some issues like +might help you debug some issues like say will go to banish ingredient problem 362 00:28:52,029 --> 00:28:55,950 -it's much easier to understand exactly +it's much easier to understand exactly what's going wrong in your optimization 363 00:28:55,950 --> 00:28:59,250 -if you understand how greedy and slow +if you understand how greedy and slow and networks will help you debug these 364 00:28:59,250 --> 00:29:02,740 -networks much more efficiently and so +networks much more efficiently and so some information for example we already 365 00:29:02,740 --> 00:29:07,609 -saw the eighth at Gate it has a little +saw the eighth at Gate it has a little reading the one to all of its inputs so 366 00:29:07,609 --> 00:29:11,279 -it's just a greeting distributor that's +it's just a greeting distributor that's like a nice way to think about it 367 00:29:11,279 --> 00:29:14,548 -whenever you have a plus operation +whenever you have a plus operation anywhere in your score function or your 368 00:29:14,548 --> 00:29:18,740 -comment or anywhere else it's +comment or anywhere else it's distributed ratings the max kate is 369 00:29:18,740 --> 00:29:23,009 -instead a great writer and way this +instead a great writer and way this works is if you look at the expression 370 00:29:23,009 --> 00:29:30,970 -like we have great these markers don't +like we have great these markers don't work so if you have a very simple binary 371 00:29:30,970 --> 00:29:38,410 -expression of Maxim XY so this is a gate +expression of Maxim XY so this is a gate then the gradient on x online if you 372 00:29:38,410 --> 00:29:42,570 -think about it the green on the larger +think about it the green on the larger one of your inputs which is larger the 373 00:29:42,569 --> 00:29:46,389 -gradient on that guy is one and all this +gradient on that guy is one and all this and the smaller one is a greeting of 374 00:29:46,390 --> 00:29:50,630 -zero and intuitively that because if one +zero and intuitively that because if one of these was smaller than what it has no 375 00:29:50,630 --> 00:29:53,220 -effect on the out but because the other +effect on the out but because the other guy's larger and that's what ends up 376 00:29:53,220 --> 00:29:57,009 -getting through the gate so you end up +getting through the gate so you end up with a gradient of one on the 377 00:29:57,009 --> 00:30:03,140 -larger one of the inputs and so that's +larger one of the inputs and so that's why max cady as a gradient writer if I'm 378 00:30:03,140 --> 00:30:06,420 -actually and I have received several +actually and I have received several inputs one of them was the largest of 379 00:30:06,420 --> 00:30:09,550 -all of them and that's the value that I +all of them and that's the value that I propagated through the circuit and 380 00:30:09,549 --> 00:30:12,909 -application time I'm just going to +application time I'm just going to receive my gradient from above and I'm 381 00:30:12,910 --> 00:30:16,590 -going to write it to whoever was my +going to write it to whoever was my largest impact it's a gradient writer 382 00:30:16,589 --> 00:30:22,569 -and multiply gate is a gradient switcher +and multiply gate is a gradient switcher actually don't think that's a very good 383 00:30:22,569 --> 00:30:26,960 -way to look at it but I'm referring to +way to look at it but I'm referring to the fact that it's not actually 384 00:30:26,960 --> 00:30:39,150 -nevermind about that part so the +nevermind about that part so the question is what happens if the two 385 00:30:39,150 --> 00:30:53,470 -inputs are equal when you go through max +inputs are equal when you go through max Kade what happens I don't think it's 386 00:30:53,470 --> 00:30:57,559 -correct to distributed to all of them I +correct to distributed to all of them I think you have to you have to pick one 387 00:30:57,559 --> 00:31:07,990 -that basically never happens in actual +that basically never happens in actual practice so max gradient here actually 388 00:31:07,990 --> 00:31:13,019 -have an example is that here was larger +have an example is that here was larger than W so only is it has an influence on 389 00:31:13,019 --> 00:31:16,839 -the output of this max Kade right so +the output of this max Kade right so when two flows into the max gate and 390 00:31:16,839 --> 00:31:20,879 -gets read it and W gets a zero gradient +gets read it and W gets a zero gradient because its effect on the circuit is 391 00:31:20,880 --> 00:31:25,360 -nothing there is zero because when you +nothing there is zero because when you change it doesn't matter when you change 392 00:31:25,359 --> 00:31:29,689 -it because that is not a larger bally +it because that is not a larger bally going through the competition grounds I 393 00:31:29,690 --> 00:31:33,100 -have another note that is related to +have another note that is related to back propagation which we already 394 00:31:33,099 --> 00:31:36,490 -addressed through question I just want +addressed through question I just want to briefly point out with it terribly 395 00:31:36,490 --> 00:31:40,440 -bad luck and figure that if you have +bad luck and figure that if you have these circuits and sometimes you have a 396 00:31:40,440 --> 00:31:43,330 -value that branches out into a circuit +value that branches out into a circuit and is used in multiple parts of the 397 00:31:43,329 --> 00:31:47,179 -circuit the correct thing to do by +circuit the correct thing to do by multivariate chain rule is to actually 398 00:31:47,180 --> 00:31:55,110 -add up the contributions at the +add up the contributions at the operation so gradients add a background 399 00:31:55,109 --> 00:32:00,009 -in backwards through the circuit if they +in backwards through the circuit if they ever flow in in these backward flow 400 00:32:00,009 --> 00:32:04,879 -right we're going to go into +right we're going to go into implementation very simple just a couple 401 @@ -1988,657 +1986,657 @@ of questions 402 00:32:05,700 --> 00:32:11,620 -thank you for the question the question +thank you for the question the question is is there ever like a loop in these 403 00:32:11,619 --> 00:32:15,839 -graphs that will never be looks so there +graphs that will never be looks so there are never any loops you might think that 404 00:32:15,839 --> 00:32:18,589 -if you use a recurrent neural network +if you use a recurrent neural network that there are loops in there but 405 00:32:18,589 --> 00:32:21,658 -there's actually no because what we'll +there's actually no because what we'll do is we'll take a recurrent neural 406 00:32:21,659 --> 00:32:26,230 -network and will unfold it through time +network and will unfold it through time steps and this will all become there 407 00:32:26,230 --> 00:32:31,259 -will never be a loop in the photograph +will never be a loop in the photograph copy pasted that small piece or time 408 00:32:31,259 --> 00:32:39,538 -you'll see that more when we actually +you'll see that more when we actually get into it but he's always looked so 409 00:32:39,538 --> 00:32:42,220 -let's look at the implementation of this +let's look at the implementation of this is actually implemented in practice and 410 00:32:42,220 --> 00:32:46,860 -I think will help make this more +I think will help make this more concrete as well so we always have these 411 00:32:46,859 --> 00:32:52,038 -graphs graphs these are the best way to +graphs graphs these are the best way to think about structuring neural networks 412 00:32:52,038 --> 00:32:56,929 -and so what we end up with is all these +and so what we end up with is all these gates there were going to seem a bit but 413 00:32:56,929 --> 00:33:00,059 -on top of the gates there something that +on top of the gates there something that needs to maintain connectivity structure 414 00:33:00,058 --> 00:33:03,490 -of the same paragraph what gates are +of the same paragraph what gates are connected to each other and so usually 415 00:33:03,490 --> 00:33:09,710 -that's handled by a graph or object +that's handled by a graph or object usually in that the net object has needs 416 00:33:09,710 --> 00:33:13,679 -two main pieces which was the forward +two main pieces which was the forward and backward peace and this is just you 417 00:33:13,679 --> 00:33:19,929 -two coats run but basically roughly the +two coats run but basically roughly the idea is that in the forward pass 418 00:33:19,929 --> 00:33:23,759 -trading overall the gates in the circuit +trading overall the gates in the circuit that and they're sorted in topological 419 00:33:23,759 --> 00:33:27,980 -order what that means is that all the +order what that means is that all the inputs must come to every note before 420 00:33:27,980 --> 00:33:32,099 -the opportunity consumed just ordered +the opportunity consumed just ordered from left to right and we're just 421 00:33:32,099 --> 00:33:35,969 -boarding will call ya forward on every +boarding will call ya forward on every single gate along the way so we iterate 422 00:33:35,970 --> 00:33:39,600 -over that graph and just go forward to +over that graph and just go forward to every single piece and this object will 423 00:33:39,599 --> 00:33:43,189 -just make sure that happens in the +just make sure that happens in the proper connectivity pattern and backward 424 00:33:43,190 --> 00:33:46,620 -pass we're going in the exact reverse +pass we're going in the exact reverse order and we're calling backward on 425 00:33:46,619 --> 00:33:49,709 -every single gate and these gates will +every single gate and these gates will end up communicating gradients to each 426 00:33:49,710 --> 00:33:53,429 -other and the old get changeup and +other and the old get changeup and computing the analytic gradients it back 427 00:33:53,429 --> 00:33:57,860 -so really an object is a very thin +so really an object is a very thin wrapper around all these gates or as we 428 00:33:57,859 --> 00:34:01,879 -will see their cold layers layers or +will see their cold layers layers or gates I'm going to use interchangeably 429 00:34:01,880 --> 00:34:05,700 -and they're just very thin wrapper +and they're just very thin wrapper surround connectivity structure of these 430 00:34:05,700 --> 00:34:09,369 -gates and calling a forward and backward +gates and calling a forward and backward function on them and then let's look at 431 00:34:09,369 --> 00:34:12,950 -a specific example of one of the gates +a specific example of one of the gates and how this might be implemented and 432 00:34:12,949 --> 00:34:16,759 -this is not just a year ago this is +this is not just a year ago this is actually more like correct 433 00:34:16,760 --> 00:34:18,730 -implementation something like this might +implementation something like this might run 434 00:34:18,730 --> 00:34:23,769 -at the end so let us enter and multiply +at the end so let us enter and multiply gate and how it could be implemented and 435 00:34:23,769 --> 00:34:27,690 -multiply gate in this case is just a +multiply gate in this case is just a binary multiplies receives two inputs 436 00:34:27,690 --> 00:34:33,780 -X&Y it computes their multiplication +X&Y it computes their multiplication that his ex times why and returns and 437 00:34:33,780 --> 00:34:38,950 -all these games must be satisfied the +all these games must be satisfied the API of a forward and backward cool how 438 00:34:38,949 --> 00:34:42,529 -do you behave in a forward pass and how +do you behave in a forward pass and how they behave in a backward pass and 439 00:34:42,530 --> 00:34:46,019 -repass just computer whatever in a +repass just computer whatever in a backward pass we eventually end up 440 00:34:46,019 --> 00:34:52,639 -learning about what is our gradient on +learning about what is our gradient on the final loss to the old ideas as what 441 00:34:52,639 --> 00:34:55,628 -we learn that's represented in this +we learn that's represented in this variable these head and right now 442 00:34:55,628 --> 00:35:00,639 -everything is scalars so X Y is that our +everything is scalars so X Y is that our numbers here he said is also a number 443 00:35:00,639 --> 00:35:07,799 -telling the employers and what this gate +telling the employers and what this gate is charged in this backward pass is 444 00:35:07,800 --> 00:35:11,550 -performing the little piece of general +performing the little piece of general so what we have to compute is how do you 445 00:35:11,550 --> 00:35:16,550 -change this gradient these into your +change this gradient these into your inputs X&Y compute the ex NDY and we 446 00:35:16,550 --> 00:35:19,820 -turned us into backward pass and then +turned us into backward pass and then the competition on draft will make sure 447 00:35:19,820 --> 00:35:23,720 -that these get routed properly to all +that these get routed properly to all the other bags and if there are any 448 00:35:23,719 --> 00:35:27,919 -badges that add up the competition grab +badges that add up the competition grab my dad might add all the ingredients 449 00:35:27,920 --> 00:35:35,650 -together ok so how would we implement +together ok so how would we implement the DAX and devices for example what is 450 00:35:35,650 --> 00:35:42,300 -the X in this case it would be equal to +the X in this case it would be equal to the implementation 451 00:35:42,300 --> 00:35:49,460 -why times easy break and a white and +why times easy break and a white and easy additional point to make here by 452 00:35:49,460 --> 00:35:53,659 -the way that I added some lies in the +the way that I added some lies in the past we have to remember these values of 453 00:35:53,659 --> 00:35:57,509 -X&Y because we end up using them in a +X&Y because we end up using them in a backward pass from assigning them to a 454 00:35:57,510 --> 00:36:01,000 -sell stop because I need to remember +sell stop because I need to remember what X Y are because I need access to 455 00:36:01,000 --> 00:36:04,949 -them in my back yard pass in general and +them in my back yard pass in general and back-propagation when we build these 456 00:36:04,949 --> 00:36:09,359 -when you actually the forward pass every +when you actually the forward pass every single gate must remember the impetus in 457 00:36:09,360 --> 00:36:13,430 -any kind of intermediate calculations +any kind of intermediate calculations performed that it needs to do that needs 458 00:36:13,429 --> 00:36:17,069 -access to a backward pass so basically +access to a backward pass so basically we end up running these networks at 459 00:36:17,070 --> 00:36:20,050 -runtime just always keep in mind that as +runtime just always keep in mind that as you're doing this forward pass a huge 460 00:36:20,050 --> 00:36:22,890 -amount of stuff gets cashed in your +amount of stuff gets cashed in your memory and that all has to stick around 461 00:36:22,889 --> 00:36:25,909 -because during the propagation and I +because during the propagation and I need access to some of those variables 462 00:36:25,909 --> 00:36:30,779 -and so your memory and the ballooning up +and so your memory and the ballooning up during a forward pass backward pass it 463 00:36:30,780 --> 00:36:33,690 -gets all consumed and we need all those +gets all consumed and we need all those intermediaries to actually compete the 464 00:36:33,690 --> 00:36:45,289 -proper backward class so that you can +proper backward class so that you can get rid of many of these things and you 465 00:36:45,289 --> 00:36:49,710 -don't have to compete in going to cash +don't have to compete in going to cash them so you can save on memory for sure 466 00:36:49,710 --> 00:36:54,110 -but I don't think most implementations +but I don't think most implementations actually worried about that I don't 467 00:36:54,110 --> 00:36:57,280 -think there's a lot of logic that deals +think there's a lot of logic that deals with that usually end up remembering it 468 00:36:57,280 --> 00:37:09,370 -anyway I yes I think if you're in an +anyway I yes I think if you're in an embedded device for example and you were 469 00:37:09,369 --> 00:37:11,949 -eerily by the American strains this is +eerily by the American strains this is something that you might take advantage 470 00:37:11,949 --> 00:37:15,539 -of it we know that a neural network only +of it we know that a neural network only has to run and test time then you might 471 00:37:15,539 --> 00:37:18,750 -want to make sure going to the code to +want to make sure going to the code to make sure nothing gets cashed in case 472 00:37:18,750 --> 00:37:33,130 -you wanna do a backward pass questions +you wanna do a backward pass questions yes we remember the local gradients in 473 00:37:33,130 --> 00:37:39,750 -the forward pass then we don't have to +the forward pass then we don't have to remember the other intermediates I think 474 00:37:39,750 --> 00:37:45,269 -that might only be the case in such in +that might only be the case in such in some simple expressions like this 1 I'm 475 00:37:45,269 --> 00:37:49,170 -not actually sure that's true in general +not actually sure that's true in general but I mean you're in charge of remember 476 00:37:49,170 --> 00:37:54,950 -whatever you need to perform the +whatever you need to perform the backward pass gate by game basis you 477 00:37:54,949 --> 00:37:58,509 -don't know if you can remember whatever +don't know if you can remember whatever you feel like it has a footprint on 478 00:37:58,510 --> 00:38:04,420 -someone and you can be clever with that +someone and you can be clever with that guy's example of what it looks like in 479 00:38:04,420 --> 00:38:08,250 -practice we're going to look at specific +practice we're going to look at specific examples and torture tortures a deep 480 00:38:08,250 --> 00:38:11,480 -learning framework which we might be +learning framework which we might be going to a bit near the end of the class 481 00:38:11,480 --> 00:38:16,750 -that some of you might end up using for +that some of you might end up using for your projects going to the github repo 482 00:38:16,750 --> 00:38:20,320 -for porridge and you look at the +for porridge and you look at the musically it's just a giant collection 483 00:38:20,320 --> 00:38:24,580 -of these later objects and these are the +of these later objects and these are the gates gates the same thing so there's 484 00:38:24,579 --> 00:38:27,429 -all these layers that's really what a +all these layers that's really what a deep learning framework is this just a 485 00:38:27,429 --> 00:38:31,559 -whole bunch of layers and a very thin +whole bunch of layers and a very thin competition graph thing that keeps track 486 00:38:31,559 --> 00:38:36,420 -of all the connectivity and so really +of all the connectivity and so really the image to have in mind at all these 487 00:38:36,420 --> 00:38:42,639 -things are your leg blocks and then +things are your leg blocks and then we're building up these graphs out of 488 00:38:42,639 --> 00:38:44,829 -your league in blocks out of the layers +your league in blocks out of the layers you're putting them together in various 489 00:38:44,829 --> 00:38:47,549 -ways depending on what you want to +ways depending on what you want to achieve and the end up building all 490 00:38:47,550 --> 00:38:51,519 -kinds of stuff so that's how you work +kinds of stuff so that's how you work with their own networks so every library 491 00:38:51,519 --> 00:38:54,809 -just a whole set of layers that you +just a whole set of layers that you might want to compute and every layer is 492 00:38:54,809 --> 00:38:58,840 -implementing a smoky function peace and +implementing a smoky function peace and that function keys knows how to move 493 00:38:58,840 --> 00:39:02,670 -forward and knows how to do a backward +forward and knows how to do a backward so just above a specific example let's 494 00:39:02,670 --> 00:39:10,150 -look at the mall constant layer and +look at the mall constant layer and torch the mall constant layer or chrome 495 00:39:10,150 --> 00:39:16,039 -just a scaling by scalar so it takes +just a scaling by scalar so it takes some tenser X so this is not a scalar 496 00:39:16,039 --> 00:39:19,300 -but it's actually like an array of +but it's actually like an array of numbers basically because when we 497 00:39:19,300 --> 00:39:22,410 -actually work with these we do a lot of +actually work with these we do a lot of extras operation so we receive a tensor 498 00:39:22,409 --> 00:39:28,289 -which is really just and dimensional +which is really just and dimensional array and was killed by constant and you 499 00:39:28,289 --> 00:39:31,980 -can see that this actually just a sporty +can see that this actually just a sporty lines there some initialization stuff 500 00:39:31,980 --> 00:39:35,940 -this is lula by the way if this is +this is lula by the way if this is looking some foreign to you but there's 501 00:39:35,940 --> 00:39:40,510 -initialisation where you actually +initialisation where you actually passing that a that you want to use as 502 00:39:40,510 --> 00:39:44,630 -you are scaling and then during the +you are scaling and then during the forward pass which they call update out 503 00:39:44,630 --> 00:39:49,170 -but in a forward pass all they do is +but in a forward pass all they do is they just multiply X and returned it and 504 00:39:49,170 --> 00:39:53,760 -into backward pass which they call +into backward pass which they call update grad input there's any statement 505 00:39:53,760 --> 00:39:56,510 -here but really when you look at these +here but really when you look at these three live their most important you can 506 00:39:56,510 --> 00:39:59,690 -see that all is doing its copying into a +see that all is doing its copying into a variable grad 507 00:39:59,690 --> 00:40:03,539 -would need to compute that's your grade +would need to compute that's your grade in that you're passing up the great 508 00:40:03,539 --> 00:40:08,309 -impetus you're copping out but ran up to +impetus you're copping out but ran up to this your your gradient on final loss 509 00:40:08,309 --> 00:40:11,989 -you're copping that over into grad input +you're copping that over into grad input and you're multiplying by the by the 510 00:40:11,989 --> 00:40:15,629 -scalar which is what you should be doing +scalar which is what you should be doing because you are your local ratings just 511 00:40:15,630 --> 00:40:19,980 -a and C you take the out but you have to +a and C you take the out but you have to take the gradient from above and just 512 00:40:19,980 --> 00:40:23,150 -killed by AP which is what these three +killed by AP which is what these three lines are doing and that's your grad 513 00:40:23,150 --> 00:40:27,849 -important that's what you return so +important that's what you return so that's one of the hundreds of layers 514 00:40:27,849 --> 00:40:32,110 -that are and torture you can also look +that are and torture you can also look at examples in cafe get there is also a 515 00:40:32,110 --> 00:40:36,140 -deep learning framework specifically for +deep learning framework specifically for images might be working with again if 516 00:40:36,139 --> 00:40:39,690 -you go into the layers director just see +you go into the layers director just see all these layers all of them implement 517 00:40:39,690 --> 00:40:43,490 -the forward backward API so just to give +the forward backward API so just to give you an example there's a single layer 518 00:40:43,489 --> 00:40:51,269 -layer takes a blob so comfy likes to +layer takes a blob so comfy likes to call these tensors blogs so it takes a 519 00:40:51,269 --> 00:40:54,219 -blob is just an international array of +blob is just an international array of numbers and it passes 520 00:40:54,219 --> 00:40:57,949 -element wise to a single function and so +element wise to a single function and so its computing in a forward pass a 521 00:40:57,949 --> 00:41:04,379 -sigmoid which you can see their use my +sigmoid which you can see their use my printer so they're calling it a lot of 522 00:41:04,380 --> 00:41:07,840 -this stuff is just boilerplate getting +this stuff is just boilerplate getting pointers to all the data and then we 523 00:41:07,840 --> 00:41:11,730 -have a bottom blob and we're calling a +have a bottom blob and we're calling a sigmoid function on the bottom and 524 00:41:11,730 --> 00:41:14,829 -that's just a sigmoid function right +that's just a sigmoid function right there that's why we compute in a 525 00:41:14,829 --> 00:41:18,719 -backward pass some boilerplate stuff but +backward pass some boilerplate stuff but really what's important is we need to 526 00:41:18,719 --> 00:41:23,369 -compute the gradient times the chain +compute the gradient times the chain rule here so that's what you see in this 527 00:41:23,369 --> 00:41:26,150 -line that's where the magic happens when +line that's where the magic happens when we take the 528 00:41:26,150 --> 00:41:32,048 -so they call the greetings dips and you +so they call the greetings dips and you compute the bottom diff is the top if 529 00:41:32,048 --> 00:41:36,869 -times this piece which is really the +times this piece which is really the that's the local gradient so this is 530 00:41:36,869 --> 00:41:41,960 -chain rule happening right here through +chain rule happening right here through that multiplication so and so every 531 00:41:41,960 --> 00:41:45,179 -single layer just a forward backward API +single layer just a forward backward API and then you have a competition growth 532 00:41:45,179 --> 00:41:52,288 -on top or another object that troubled +on top or another object that troubled connectivity and questions about some of 533 @@ -2647,177 +2645,177 @@ these implementations and so on 534 00:42:00,849 --> 00:42:15,559 -because when you want to do right away +because when you want to do right away to a backward and I have a gradient and 535 00:42:15,559 --> 00:42:19,369 -I can do an update right up my alley +I can do an update right up my alley gradient and I change my way it's a tiny 536 00:42:19,369 --> 00:42:24,960 -bit and the direction the negative +bit and the direction the negative direction of your writing so overcome 537 00:42:24,960 --> 00:42:28,858 -the loss backward computer gradient and +the loss backward computer gradient and then the update uses the gradient to 538 00:42:28,858 --> 00:42:33,278 -increment you are a bit so that's what +increment you are a bit so that's what keeps happening Lupin III neural network 539 00:42:33,278 --> 00:42:36,318 -that's all that's happening forward +that's all that's happening forward backward update forward backward state 540 00:42:36,318 --> 00:42:51,808 -will see that you're asking about the +will see that you're asking about the for loop therefore Lapeer I do notice ok 541 00:42:51,809 --> 00:42:57,160 -yeah they have a for loop yes you'd like +yeah they have a for loop yes you'd like us to be better eyes and that actually 542 00:42:57,159 --> 00:43:03,679 -sure this is C++ so I think they just go +sure this is C++ so I think they just go for it 543 00:43:03,679 --> 00:43:10,899 -yeah so this is a CPU implementation by +yeah so this is a CPU implementation by the way I should mention that this is a 544 00:43:10,900 --> 00:43:14,599 -CPU implementation of a similar there's +CPU implementation of a similar there's a second file that implement the 545 00:43:14,599 --> 00:43:19,420 -simulator on GPU and that's correct code +simulator on GPU and that's correct code and so that's a separate file its 546 00:43:19,420 --> 00:43:21,980 -would-be sigmoid out see you or +would-be sigmoid out see you or something like that I'm not showing you 547 00:43:21,980 --> 00:43:30,349 -that the russians ok great so I like to +that the russians ok great so I like to make is will be of course working with 548 00:43:30,349 --> 00:43:33,519 -better so these things flowing along our +better so these things flowing along our grass are not just killers they're going 549 00:43:33,519 --> 00:43:38,449 -to be entire back to us and so nothing +to be entire back to us and so nothing changes the only thing that is different 550 00:43:38,449 --> 00:43:43,529 -now since these are vectors XY and Z are +now since these are vectors XY and Z are vectors is that these local gradient 551 00:43:43,530 --> 00:43:47,530 -which before used to be just a scalar +which before used to be just a scalar now there in general for general 552 00:43:47,530 --> 00:43:51,290 -expressions their full Jacobian matrices +expressions their full Jacobian matrices and so it could be a major exodus 553 00:43:51,289 --> 00:43:54,670 -two-dimensional matrix and basically +two-dimensional matrix and basically tells me what is the influence of every 554 00:43:54,670 --> 00:43:58,010 -single element in X on every single +single element in X on every single element of 555 00:43:58,010 --> 00:44:01,880 -and that's what you can be a major +and that's what you can be a major source and the gradient the same 556 00:44:01,880 --> 00:44:08,960 -expression as before but now they hear +expression as before but now they hear the IDX is a vector and DL Moody said is 557 00:44:08,960 --> 00:44:16,079 -designed as an actor and designed by Dax +designed as an actor and designed by Dax is an entire Jacobian matrix end up with 558 00:44:16,079 --> 00:44:32,130 -an entire matrix-vector multiply to +an entire matrix-vector multiply to actually change the gradient know so 559 00:44:32,130 --> 00:44:36,380 -I'll come back to this point in a bit +I'll come back to this point in a bit you never actually end up forming the 560 00:44:36,380 --> 00:44:40,119 -Jacobian you'll never actually do this +Jacobian you'll never actually do this matrix multiply most of the time this is 561 00:44:40,119 --> 00:44:43,730 -just a general way of looking at you +just a general way of looking at you know arbitrary function and I need to 562 00:44:43,730 --> 00:44:46,260 -keep track of this and I think that +keep track of this and I think that these two are actually out of order 563 00:44:46,260 --> 00:44:49,569 -because he said by the exit the Jacobian +because he said by the exit the Jacobian which should be on the left side so 564 00:44:49,568 --> 00:44:53,159 -that's that's a mistaken slide because +that's that's a mistaken slide because it should be a major factor multiplied 565 00:44:53,159 --> 00:44:57,618 -so I'll show you why you don't actually +so I'll show you why you don't actually need to perform those Jacobins so let's 566 00:44:57,619 --> 00:45:02,119 -work with a specific example that is +work with a specific example that is relatively common in the works 567 00:45:02,119 --> 00:45:06,869 -suppose we have this nonlinearity max 50 +suppose we have this nonlinearity max 50 index so really what this is operation 568 00:45:06,869 --> 00:45:11,068 -is doing its receiving a vector sale +is doing its receiving a vector sale 4096 numbers which is a typical thing 569 @@ -2830,117 +2828,117 @@ you might want to do 571 00:45:14,630 --> 00:45:19,630 -and your computing an element wise +and your computing an element wise threshold 0 so anything that is lower 572 00:45:19,630 --> 00:45:24,680 -than 0 gets clamped 20 and that's your +than 0 gets clamped 20 and that's your function that your computing and sew up 573 00:45:24,679 --> 00:45:28,588 -the victories on the same dimension to +the victories on the same dimension to the question here I'd like to ask is 574 00:45:28,588 --> 00:45:40,268 -what is the size of the Jacobian matrix +what is the size of the Jacobian matrix for this layer 4096 4096 in principle 575 00:45:40,268 --> 00:45:45,018 -every single number in here could have +every single number in here could have influenced every single number in there 576 00:45:45,018 --> 00:45:49,459 -but that's not the case necessarily +but that's not the case necessarily right to the second question is so this 577 00:45:49,460 --> 00:45:52,949 -is a huge measure sixteen million +is a huge measure sixteen million numbers but why would you never formed 578 00:45:52,949 --> 00:46:02,719 -what does actually look like always be +what does actually look like always be matrix because every one of these 4096 579 00:46:02,719 --> 00:46:09,949 -could have influenced every it is so the +could have influenced every it is so the communists still a giant 4085 4086 580 00:46:09,949 --> 00:46:14,558 -matrix but has special structure right +matrix but has special structure right and what is that special structure but 581 00:46:14,559 --> 00:46:27,420 -so is a huge tits 4095 4096 matrix but +so is a huge tits 4095 4096 matrix but there's only elements on the diagonal 582 00:46:27,420 --> 00:46:33,700 -because this is an element was operation +because this is an element was operation and moreover they're not just once but 583 00:46:33,699 --> 00:46:38,129 -whichever element was less than zero it +whichever element was less than zero it was clamped 20 so some of these ones 584 00:46:38,130 --> 00:46:42,798 -actually are zeros in whichever elements +actually are zeros in whichever elements had a lower than zero value during the 585 00:46:42,798 --> 00:46:47,429 -forward pass and so the Jacobian would +forward pass and so the Jacobian would just be almost no identity matrix but 586 00:46:47,429 --> 00:46:52,250 -some of them are actually Sarah so you +some of them are actually Sarah so you never actually would want to form the 587 00:46:52,250 --> 00:46:55,429 -full Jacobean because that's silly and +full Jacobean because that's silly and so you never actually want to carry out 588 00:46:55,429 --> 00:47:00,808 -this operation as a matrix-vector +this operation as a matrix-vector multiply because their special structure 589 00:47:00,809 --> 00:47:04,150 -that we want to take advantage of and so +that we want to take advantage of and so in particular the gradient the backward 590 00:47:04,150 --> 00:47:09,269 -pass for this operation is very very +pass for this operation is very very easy because you just want to look at 591 00:47:09,268 --> 00:47:14,159 -all the dimensions where your input was +all the dimensions where your input was less than zero and you want to kill the 592 00:47:14,159 --> 00:47:17,210 -gradient and those mentioned you want to +gradient and those mentioned you want to set the gradient 20 in those dimensions 593 00:47:17,210 --> 00:47:21,650 -so you take the grid out but here and +so you take the grid out but here and whichever numbers were less than zero 594 @@ -2949,187 +2947,187 @@ just set them 200 and then you can ask 595 00:47:25,909 --> 00:47:52,230 -so very simple operations in the in the +so very simple operations in the in the end in terms of 596 00:47:52,230 --> 00:47:55,940 -if you want to you can do that but +if you want to you can do that but that's internal to you and said the gate 597 00:47:55,940 --> 00:47:59,670 -and you can use that to do backdrop but +and you can use that to do backdrop but what's going back to other dates they 598 00:47:59,670 --> 00:48:17,380 -only care about the gradient vector so +only care about the gradient vector so we'll never actually run into that case 599 00:48:17,380 --> 00:48:20,430 -because we almost always have a single +because we almost always have a single out but skill and rallied in the end 600 00:48:20,429 --> 00:48:24,129 -because we're interested in Los +because we're interested in Los functions so we just have a single 601 00:48:24,130 --> 00:48:27,318 -number at the end that were interested +number at the end that were interested in trading for prospective if we had 602 00:48:27,318 --> 00:48:30,949 -multiple outputs then we have to keep +multiple outputs then we have to keep track of all of those as well 603 00:48:30,949 --> 00:48:35,769 -imperil when we do the backpropagation +imperil when we do the backpropagation but we just have to get a rally loss 604 00:48:35,769 --> 00:48:45,880 -function so as not to worry about that +function so as not to worry about that so I want to also make the point that 605 00:48:45,880 --> 00:48:51,230 -actually four thousand crazy usually we +actually four thousand crazy usually we use many batches so say many batch of a 606 00:48:51,230 --> 00:48:54,929 -hundred elements going through the same +hundred elements going through the same time and then you end up with a hundred 607 00:48:54,929 --> 00:48:59,038 -4096 emotional factors that are all +4096 emotional factors that are all coming in peril but all the examples 608 00:48:59,039 --> 00:49:02,539 -enemy better processed independently of +enemy better processed independently of each other in peril and so that you 609 00:49:02,539 --> 00:49:08,869 -could really end up being four hundred +could really end up being four hundred million so huge so you never formally is 610 00:49:08,869 --> 00:49:14,160 -basically and you takes to take care to +basically and you takes to take care to actually take advantage of the sparsity 611 00:49:14,159 --> 00:49:17,538 -structure in the Jacobian and you hand +structure in the Jacobian and you hand code operations you don't actually right 612 00:49:17,539 --> 00:49:25,819 -before the generalized general inside +before the generalized general inside any gate implementation ok so I'd like 613 00:49:25,818 --> 00:49:30,788 -to point out that your assignment he'll +to point out that your assignment he'll be writing as Max and so on and I just 614 00:49:30,789 --> 00:49:33,680 -wanted to give you a hint on the design +wanted to give you a hint on the design of how you actually should approach this 615 00:49:33,679 --> 00:49:39,769 -problem what you should do is just think +problem what you should do is just think about it as a back propagation even if 616 00:49:39,769 --> 00:49:44,108 -you're doing this for classification +you're doing this for classification optimization so roughly or structure 617 00:49:44,108 --> 00:49:50,048 -should look something like this where +should look something like this where against major computation and units that 618 00:49:50,048 --> 00:49:53,960 -you know the local gradient off and then +you know the local gradient off and then do backdrop when you actually these 619 00:49:53,960 --> 00:49:57,679 -gradients in your assignment so in the +gradients in your assignment so in the top your code will look something like 620 00:49:57,679 --> 00:49:59,679 -this where we don't have any graph +this where we don't have any graph structure because you're doing 621 00:49:59,679 --> 00:50:04,038 -everything in line so no crazy I just +everything in line so no crazy I just running like that that you have to do 622 00:50:04,039 --> 00:50:07,200 -you will do that in a second assignment +you will do that in a second assignment you'll actually come up with a graphic 623 00:50:07,199 --> 00:50:10,509 -object you implement your layers but my +object you implement your layers but my first assignment you're just doing it in 624 00:50:10,510 --> 00:50:15,579 -line just straight up an awesome and so +line just straight up an awesome and so complete your scores based on wnx 625 00:50:15,579 --> 00:50:21,798 -compute these margins which are Maxim 0 +compute these margins which are Maxim 0 and the score differences compute the 626 00:50:21,798 --> 00:50:26,239 -loss and then do backdrop and in +loss and then do backdrop and in particular I would really advise you to 627 00:50:26,239 --> 00:50:30,949 -have this intermediate course let you +have this intermediate course let you create a matrix and then compute the 628 00:50:30,949 --> 00:50:34,769 -gradient on scores before you can view +gradient on scores before you can view the gradient on your weights and so 629 00:50:34,769 --> 00:50:40,179 -chain chain rule here like you might be +chain chain rule here like you might be tempted to try to just arrived W the 630 00:50:40,179 --> 00:50:43,798 -gradient on W equals and then implement +gradient on W equals and then implement that and that's an unhealthy way of 631 00:50:43,798 --> 00:50:47,349 -approaching problem so state your +approaching problem so state your competition and do backdrop through this 632 @@ -3138,72 +3136,72 @@ course and they will help you out so 633 00:50:55,800 --> 00:51:01,570 -so far are hopelessly large so we end up +so far are hopelessly large so we end up in this competition structures and these 634 00:51:01,570 --> 00:51:05,470 -intermediate nodes forward backward API +intermediate nodes forward backward API for both the notes and also for the 635 00:51:05,469 --> 00:51:08,869 -graph structure and infrastructure is +graph structure and infrastructure is usually a very thin wrapper on all these 636 00:51:08,869 --> 00:51:12,059 -layers and it can handle the +layers and it can handle the communication between him and his 637 00:51:12,059 --> 00:51:16,380 -communication is always along like +communication is always along like doctors being passed around in practice 638 00:51:16,380 --> 00:51:19,289 -when we write these implementations what +when we write these implementations what we're passing around our DS and 639 00:51:19,289 --> 00:51:23,079 -dimensional sensors really what that +dimensional sensors really what that means is just an end dimensional array 640 00:51:23,079 --> 00:51:28,059 -array those are what goes between the +array those are what goes between the gates and then internally every single 641 00:51:28,059 --> 00:51:33,529 -gate knows what to do in the forward and +gate knows what to do in the forward and backward pass ok so at this point I'm 642 00:51:33,530 --> 00:51:37,690 -going to end with that propagation and +going to end with that propagation and I'm going to go into neural networks so 643 00:51:37,690 --> 00:51:49,860 -any questions before we move on from +any questions before we move on from background 644 00:51:49,860 --> 00:52:03,130 -operation challenging assignment almost +operation challenging assignment almost is how do you make sure that you do all 645 00:52:03,130 --> 00:52:06,750 -the sufficiently nicely with operations +the sufficiently nicely with operations in numpy so that's going to be something 646 00:52:06,750 --> 00:52:18,030 -that brings our stuff that you guys are +that brings our stuff that you guys are going to be like and what you want them 647 @@ -3212,112 +3210,112 @@ to be I don't think he'd want to do that 648 00:52:24,489 --> 00:52:30,739 -yeah I'm not sure maybe that works but +yeah I'm not sure maybe that works but it's up to you to design this and to 649 00:52:30,739 --> 00:52:38,609 -back up through it so that's that's what +back up through it so that's that's what we're going to go to neural networks is 650 00:52:38,610 --> 00:52:44,010 -exactly what they look like you'll be +exactly what they look like you'll be involving me and this is what happens 651 00:52:44,010 --> 00:52:46,770 -when you search on Google Images +when you search on Google Images networks this is I think the first 652 00:52:46,769 --> 00:52:51,590 -result of something like that so let's +result of something like that so let's look at the networks and before we dive 653 00:52:51,590 --> 00:52:55,100 -into neural networks actually I'd like +into neural networks actually I'd like to do it first without all the brain 654 00:52:55,099 --> 00:52:58,329 -stuff so forget that their neural forget +stuff so forget that their neural forget that they have any relation whatsoever 655 00:52:58,329 --> 00:53:03,170 -to brain they don't forget if you +to brain they don't forget if you thought that they did but they do let's 656 00:53:03,170 --> 00:53:07,309 -just look at school functions well +just look at school functions well before we thought that equals WX is what 657 00:53:07,309 --> 00:53:11,079 -we've been working with so far but now +we've been working with so far but now as I said we're going to start to make 658 00:53:11,079 --> 00:53:14,590 -that F more complex and so if you want +that F more complex and so if you want to use a neural network then you're 659 00:53:14,590 --> 00:53:20,309 -going to change that equation to this so +going to change that equation to this so this is a two-layer neural network and 660 00:53:20,309 --> 00:53:24,820 -that's what it looks like and it's just +that's what it looks like and it's just a more complex mathematical expression X 661 00:53:24,820 --> 00:53:30,230 -and so what's happening here as you +and so what's happening here as you receive your input X and you make 662 00:53:30,230 --> 00:53:32,369 -multiplied by matrix just like we did +multiplied by matrix just like we did before 663 00:53:32,369 --> 00:53:36,619 -now what's coming next what comes next +now what's coming next what comes next is a nonlinearity or activation function 664 00:53:36,619 --> 00:53:39,710 -I'm going to go into several choices +I'm going to go into several choices that you might make for these in this 665 00:53:39,710 --> 00:53:43,800 -case I'm using the threshold 0 as an +case I'm using the threshold 0 as an activation function so basically we're 666 00:53:43,800 --> 00:53:47,780 -doing matrix multiply we threshold +doing matrix multiply we threshold everything they get 20 and then we do 667 00:53:47,780 --> 00:53:52,240 -one more major supply and that gives us +one more major supply and that gives us are scarce and so if I was to drop this 668 00:53:52,239 --> 00:53:58,169 -say in case of C for 10 with three South +say in case of C for 10 with three South 3072 numbers going in the pixel values 669 00:53:58,170 --> 00:54:02,110 -and before we just went one single major +and before we just went one single major metabolite discourse we went right away 670 @@ -3326,87 +3324,87 @@ metabolite discourse we went right away 671 00:54:02,469 --> 00:54:05,899 -numbers but now we get to go through +numbers but now we get to go through this intermediate representation 672 00:54:05,900 --> 00:54:13,019 -pendants hidden state will call them +pendants hidden state will call them hidden layers so each of hundred-numbers 673 00:54:13,019 --> 00:54:16,849 -or whatever you want your size of the +or whatever you want your size of the network to be so this is a high pressure 674 00:54:16,849 --> 00:54:21,109 -that's a a hundred and we go through +that's a a hundred and we go through this intermediate representation so make 675 00:54:21,108 --> 00:54:24,319 -sure to multiply gives us +sure to multiply gives us hundred-numbers threshold at zero and 676 00:54:24,320 --> 00:54:28,559 -then one will make sure that this course +then one will make sure that this course and since we have more numbers we have 677 00:54:28,559 --> 00:54:33,820 -more wiggle to do more interesting +more wiggle to do more interesting things so I'm or one particular example 678 00:54:33,820 --> 00:54:36,330 -of something interesting you might want +of something interesting you might want to do what you might think that in the 679 00:54:36,329 --> 00:54:40,210 -latter could do is going back to the +latter could do is going back to the example of interpreting linear 680 00:54:40,210 --> 00:54:45,690 -classifiers on C part 10 and we saw the +classifiers on C part 10 and we saw the car class has this red car that tries to 681 00:54:45,690 --> 00:54:51,280 -merge all the modes of different car +merge all the modes of different car space in different directions and so in 682 00:54:51,280 --> 00:54:57,980 -this case one single layer one single +this case one single layer one single leader crossfire had to go across all 683 00:54:57,980 --> 00:55:02,250 -those modes and we couldn't deal with +those modes and we couldn't deal with for example of different colors that 684 00:55:02,250 --> 00:55:05,190 -wasn't very natural to do but now we +wasn't very natural to do but now we have hundred-numbers in this 685 00:55:05,190 --> 00:55:08,289 -intermediate and so you might imagine +intermediate and so you might imagine for example that one of those numbers 686 00:55:08,289 --> 00:55:11,539 -could be just picking up on the red +could be just picking up on the red carpet leasing forward is just gotta 687 00:55:11,539 --> 00:55:14,750 -find is there a wrecked car facing +find is there a wrecked car facing forward another one could be red car 688 @@ -3415,92 +3413,92 @@ facing slightly to the left 689 00:55:16,280 --> 00:55:20,650 -let carvey seems like the right and +let carvey seems like the right and those elements of age would only become 690 00:55:20,650 --> 00:55:24,358 -positive if they find that thing in the +positive if they find that thing in the image 691 00:55:24,358 --> 00:55:28,029 -otherwise they stay at zero and so +otherwise they stay at zero and so another age might look for green cards 692 00:55:28,030 --> 00:55:31,180 -or yellow cards or whatever else in +or yellow cards or whatever else in different orientations so now we can 693 00:55:31,179 --> 00:55:35,669 -have a template for all these different +have a template for all these different modes and so these neurons turn on or 694 00:55:35,670 --> 00:55:41,869 -off if they find the thing they're +off if they find the thing they're looking for some specific type and then 695 00:55:41,869 --> 00:55:46,660 -this W two major scan some across all +this W two major scan some across all those little card templates and I we 696 00:55:46,659 --> 00:55:50,719 -have like say twenty card templates of +have like say twenty card templates of what you look like and now to complete 697 00:55:50,719 --> 00:55:54,149 -the scoring classifier there's an +the scoring classifier there's an additional measures so we have a choice 698 00:55:54,150 --> 00:55:58,700 -of a weighted sum over them and so if +of a weighted sum over them and so if anyone of them turned on then through my 699 00:55:58,699 --> 00:56:02,269 -way it's somewhat positive weights +way it's somewhat positive weights presumably I would be adding up and 700 00:56:02,269 --> 00:56:07,358 -getting a higher score and so now I can +getting a higher score and so now I can have this multimodal our classifier 701 00:56:07,358 --> 00:56:13,098 -through this additional hidden layer +through this additional hidden layer between there and wavy reason for why 702 00:56:13,099 --> 00:56:14,720 -these would do something more +these would do something more interesting 703 00:56:14,719 --> 00:56:49,509 -was a question for extra points in the +was a question for extra points in the assignment and do something fun or extra 704 00:56:49,510 --> 00:56:53,220 -and so you get the carpet whatever you +and so you get the carpet whatever you think is interesting experiment and will 705 00:56:53,219 --> 00:56:56,699 -give you some bonus points that's good +give you some bonus points that's good candidate for for something you might 706 00:56:56,699 --> 00:56:59,659 -want to investigate whether that works +want to investigate whether that works or not 707 @@ -3509,12 +3507,12 @@ questions 708 00:57:08,329 --> 00:57:34,989 -allocated over the different modes of +allocated over the different modes of the dataset and I don't have a good 709 00:57:34,989 --> 00:57:37,969 -answer for that this since we're going +answer for that this since we're going to train this fully with 710 @@ -3523,322 +3521,322 @@ back-propagation 711 00:57:39,500 --> 00:57:42,690 -I think it's like a naive to think that +I think it's like a naive to think that there will be exact template for sale 712 00:57:42,690 --> 00:57:46,539 -let carvey seeing red carpet is left you +let carvey seeing red carpet is left you probably want to find that you'll find 713 00:57:46,539 --> 00:57:50,690 -these kind of like mixes and weird +these kind of like mixes and weird things intermediates and so on 714 00:57:50,690 --> 00:57:55,630 -coming animal optimally find a way to +coming animal optimally find a way to truncate your data with its boundaries 715 00:57:55,630 --> 00:57:59,809 -and kuwait's relegated just adjust the +and kuwait's relegated just adjust the company could come alright so it's 716 00:57:59,809 --> 00:58:10,579 -really hard to say well become tangled +really hard to say well become tangled up I think that's right so that's the 717 00:58:10,579 --> 00:58:14,579 -size of hidden layer and a high +size of hidden layer and a high primarily get to choose that so I chose 718 00:58:14,579 --> 00:58:18,719 -hundred usually that's going to be +hundred usually that's going to be usually you'll see that we're going to 719 00:58:18,719 --> 00:58:22,739 -this a lot but usually you want them to +this a lot but usually you want them to be as big as possible as its your 720 00:58:22,739 --> 00:58:30,659 -computer and so on so more is better I'm +computer and so on so more is better I'm going to that 721 00:58:30,659 --> 00:58:38,639 -asking do we always take max 10 nature +asking do we always take max 10 nature and we don't get this like five slides 722 00:58:38,639 --> 00:58:44,359 -away somewhere to go into neural +away somewhere to go into neural networks I guess maybe I should just go 723 00:58:44,360 --> 00:58:48,390 -ahead and take questions near the end if +ahead and take questions near the end if you wanted this to be a three-layer 724 00:58:48,389 --> 00:58:50,940 -neural network by the way there's a very +neural network by the way there's a very simple way in which we just extend 725 00:58:50,940 --> 00:58:53,710 -that's right so we just keep continuing +that's right so we just keep continuing the same pattern we have all these 726 00:58:53,710 --> 00:58:57,159 -intermediate hidden nodes and then we +intermediate hidden nodes and then we can keep making our network deeper and 727 00:58:57,159 --> 00:58:59,750 -deeper and you can compute more +deeper and you can compute more interesting functions because you're 728 00:58:59,750 --> 00:59:03,369 -giving yourself more time to compute +giving yourself more time to compute something interesting and henry VIII way 729 00:59:03,369 --> 00:59:09,559 -up one other slide I want to flash is +up one other slide I want to flash is that training a two-layer neural network 730 00:59:09,559 --> 00:59:12,690 -I mean it's actually quite simple when +I mean it's actually quite simple when it comes down to it so this is like 731 00:59:12,690 --> 00:59:17,349 -borrowed from Blockbuster and basically +borrowed from Blockbuster and basically the price is roughly eleven lines of 732 00:59:17,349 --> 00:59:21,980 -Python to implement a two layer neural +Python to implement a two layer neural network during binary classification on 733 00:59:21,980 --> 00:59:27,570 -what is this two-dimensional better to +what is this two-dimensional better to have a two dimensional data matrix X you 734 00:59:27,570 --> 00:59:32,580 -have thirty three dimensional and you +have thirty three dimensional and you have a binary labels for why and then 735 00:59:32,579 --> 00:59:36,579 -sin 0 sin 1 are your weight matrices +sin 0 sin 1 are your weight matrices wait one way to end so I think they're 736 00:59:36,579 --> 00:59:41,150 -called central synapse but mature and +called central synapse but mature and then this is the opposition group here 737 00:59:41,150 --> 00:59:46,269 -and what you what you're seeing here I +and what you what you're seeing here I should use my point for more than just 738 00:59:46,269 --> 00:59:50,139 -being here as we're completing the first +being here as we're completing the first layer activations but and this is using 739 00:59:50,139 --> 00:59:54,069 -a signal nonlinearity not a max of 0 +a signal nonlinearity not a max of 0 necks and we're going to a bit of what 740 00:59:54,070 --> 00:59:58,650 -these nonlinearities might be more than +these nonlinearities might be more than one form is reviewing the first layer 741 00:59:58,650 --> 01:00:03,059 -and the second layer and then its +and the second layer and then its computing here right away the backward 742 01:00:03,059 --> 01:00:08,130 -pass so this adult adult as the gradient +pass so this adult adult as the gradient gel to the gradient ml 1 and the 743 01:00:08,130 --> 01:00:13,390 -gradient and this is a major update here +gradient and this is a major update here so right away he's doing an update at 744 01:00:13,389 --> 01:00:17,150 -the same time as during the final piece +the same time as during the final piece of backdrop here where he formulated the 745 01:00:17,150 --> 01:00:22,519 -gradient on the W and right away he said +gradient on the W and right away he said adding 22 gradient here and some really 746 01:00:22,519 --> 01:00:24,630 -eleven lines supplies to train the +eleven lines supplies to train the neural network 747 01:00:24,630 --> 01:00:29,710 -classification the reason that this loss +classification the reason that this loss may look slightly different from what 748 01:00:29,710 --> 01:00:33,500 -you've seen right now is that this is a +you've seen right now is that this is a logistic regression loss so you saw a 749 01:00:33,500 --> 01:00:37,159 -generalization of it which is a nice +generalization of it which is a nice classifier into multiple dimensions but 750 01:00:37,159 --> 01:00:40,149 -this is basically a logistic loss being +this is basically a logistic loss being updated here and you can go through this 751 01:00:40,150 --> 01:00:43,500 -in more detail by yourself but the +in more detail by yourself but the logistic regression lost look slightly 752 01:00:43,500 --> 01:00:50,539 -different and that's being that's inside +different and that's being that's inside there but otherwise yes this is not too 753 01:00:50,539 --> 01:00:55,320 -crazy of a competition and very few +crazy of a competition and very few lines of code suffice actually train 754 01:00:55,320 --> 01:00:58,900 -these networks everything else is plus +these networks everything else is plus how do you make an official and how do 755 01:00:58,900 --> 01:01:03,019 -you there's a cross-validation pipeline +you there's a cross-validation pipeline that you need to have it all this stuff 756 01:01:03,019 --> 01:01:07,050 -that goes on top to actually give these +that goes on top to actually give these large code bases but the kernel of it is 757 01:01:07,050 --> 01:01:11,019 -quite simple we compute these layers +quite simple we compute these layers forward pass backward pass through an 758 01:01:11,019 --> 01:01:18,840 -update when it rains but the rain is +update when it rains but the rain is creating your personal initial random 759 01:01:18,840 --> 01:01:24,170 -weights so you need to start somewhere +weights so you need to start somewhere so you generate a random W 760 01:01:24,170 --> 01:01:29,150 -now I want to mention that you'll also +now I want to mention that you'll also be training a two-layer neural network 761 01:01:29,150 --> 01:01:32,070 -in this class so you'll be doing +in this class so you'll be doing something very similar to this but 762 01:01:32,070 --> 01:01:34,950 -you're not using logistic regression and +you're not using logistic regression and you might have different activation 763 01:01:34,949 --> 01:01:39,149 -functions but again just my advice to +functions but again just my advice to you when you implement this is staged 764 01:01:39,150 --> 01:01:42,789 -your computation into these intermediate +your computation into these intermediate results and then do proper 765 01:01:42,789 --> 01:01:46,909 -backpropagation into every intermediate +backpropagation into every intermediate result so you might have you compute 766 01:01:46,909 --> 01:01:54,460 -your computer you receive these weight +your computer you receive these weight matrices and also the biases I don't 767 01:01:54,460 --> 01:01:59,940 -believe you have biases p.m. in your +believe you have biases p.m. in your slot max but here you'll have biases so 768 01:01:59,940 --> 01:02:03,269 -take your weight matrices in the biases +take your weight matrices in the biases computer person later computers course 769 01:02:03,269 --> 01:02:08,429 -complete your loss and then do backward +complete your loss and then do backward pass so backdrop in this course then 770 01:02:08,429 --> 01:02:13,739 -backdrop into the weights at the second +backdrop into the weights at the second layer and backdrop into this h1 doctor 771 01:02:13,739 --> 01:02:18,849 -and then through eight-run backdrop into +and then through eight-run backdrop into the first weight matrices and spices do 772 01:02:18,849 --> 01:02:22,929 -proper backpropagation here otherwise if +proper backpropagation here otherwise if you tried and right away just say what 773 01:02:22,929 --> 01:02:26,739 -is DWI on what is going on W one if you +is DWI on what is going on W one if you just try to make it a single expression 774 01:02:26,739 --> 01:02:31,099 -for it will be way too large and +for it will be way too large and headaches so do it through a series of 775 @@ -3851,147 +3849,147 @@ that's just a hint 777 01:02:36,119 --> 01:02:39,940 -ok now I'd like to say that was the +ok now I'd like to say that was the presentation of neural networks without 778 01:02:39,940 --> 01:02:43,940 -all the bring stuff and it looks fairly +all the bring stuff and it looks fairly simple so now we're going to make it 779 01:02:43,940 --> 01:02:47,740 -slightly more insane by folding in all +slightly more insane by folding in all kinds of like motivations mostly 780 01:02:47,739 --> 01:02:51,219 -historical about like how this came +historical about like how this came about that it's related to bring it all 781 01:02:51,219 --> 01:02:54,939 -and so we have neural networks and we +and so we have neural networks and we have neurons inside these neural 782 01:02:54,940 --> 01:02:59,440 -networks so this is what I look like +networks so this is what I look like just what happens when you search on 783 01:02:59,440 --> 01:03:03,800 -image search Iran so there you go now +image search Iran so there you go now your actual biological neurons don't 784 01:03:03,800 --> 01:03:09,030 -look like this are currently more like +look like this are currently more like that and so on 785 01:03:09,030 --> 01:03:11,880 -just very briefly just to give you an +just very briefly just to give you an idea about where this is all coming from 786 01:03:11,880 --> 01:03:17,220 -you have a cell body or so much like to +you have a cell body or so much like to call it and it's got all these dendrites 787 01:03:17,219 --> 01:03:21,049 -that are connected to other neurons +that are connected to other neurons there's a cluster of other neurons and 788 01:03:21,050 --> 01:03:25,450 -somebody's over here and then drives are +somebody's over here and then drives are really these appendages that listen to 789 01:03:25,449 --> 01:03:30,869 -them so this is your inputs to in Iran +them so this is your inputs to in Iran and then it's got a single axon that 790 01:03:30,869 --> 01:03:35,839 -comes out of a neuron that carries the +comes out of a neuron that carries the output of the competition at this number 791 01:03:35,840 --> 01:03:40,579 -forms so usually usually have this +forms so usually usually have this neuron receives inputs if many of them 792 01:03:40,579 --> 01:03:46,179 -online then this sell your own can +online then this sell your own can choose to spike it sends an activation 793 01:03:46,179 --> 01:03:50,199 -potential down the axon and then this +potential down the axon and then this actually like that were just out to 794 01:03:50,199 --> 01:03:54,659 -connect to dendrites other neurons that +connect to dendrites other neurons that are downstream so there are other 795 01:03:54,659 --> 01:03:57,639 -neurons here and their dendrites +neurons here and their dendrites connected to the axons of these guys 796 01:03:57,639 --> 01:04:02,299 -basically just neurons connected through +basically just neurons connected through these synapses between and we had these 797 01:04:02,300 --> 01:04:05,840 -dendrites that Rd in particular on and +dendrites that Rd in particular on and this action on that actually carries the 798 01:04:05,840 --> 01:04:10,410 -output on their own and so basically you +output on their own and so basically you can come up with a very crude model of a 799 01:04:10,409 --> 01:04:16,769 -neuron and it will look something like +neuron and it will look something like this we have so this is the cell body 800 01:04:16,769 --> 01:04:20,909 -here on their own and just imagine an +here on their own and just imagine an axon coming from a different neuron 801 01:04:20,909 --> 01:04:24,730 -someone at work and this neuron is +someone at work and this neuron is connected to that Iran through this 802 01:04:24,730 --> 01:04:29,840 -synapse and every one of these synapses +synapse and every one of these synapses has a weight associated with it 803 01:04:29,840 --> 01:04:35,350 -of how much this neuron likes that +of how much this neuron likes that neuron basically and so actually carries 804 01:04:35,349 --> 01:04:39,769 -this X it interacts in the synapse and +this X it interacts in the synapse and they multiply and discrete model so you 805 01:04:39,769 --> 01:04:44,989 -get W 00 flooding flowing to the summer +get W 00 flooding flowing to the summer and then that happens for many Iraqis 806 @@ -4000,182 +3998,182 @@ who have lots of 807 01:04:45,849 --> 01:04:51,500 -and puts up w times explosion and the +and puts up w times explosion and the cell body here it's just some offset by 808 01:04:51,500 --> 01:04:56,940 -bias and then if an activation function +bias and then if an activation function is met here so it passes through an 809 01:04:56,940 --> 01:05:02,800 -activation function to actually complete +activation function to actually complete the outfit of the sax on now in 810 01:05:02,800 --> 01:05:06,570 -biological models historically people +biological models historically people like to use the sigmoid nonlinearity to 811 01:05:06,570 --> 01:05:11,730 -actually the reason for that is because +actually the reason for that is because you get a number between 0 and one and 812 01:05:11,730 --> 01:05:15,420 -you can interpret that as the rate at +you can interpret that as the rate at which this neuron inspiring for that 813 01:05:15,420 --> 01:05:19,809 -particular input so it's a rate between +particular input so it's a rate between zero and one that's going through the 814 01:05:19,809 --> 01:05:23,889 -activation function so if this neuron is +activation function so if this neuron is seen something that likes in the neurons 815 01:05:23,889 --> 01:05:27,900 -that connected to it it will start to +that connected to it it will start to spike a lot and the rate is described by 816 01:05:27,900 --> 01:05:33,139 -F off the impact oK so that's the crude +F off the impact oK so that's the crude model of neuron if I wanted to implement 817 01:05:33,139 --> 01:05:38,819 -it would look something like this so and +it would look something like this so and neuron function forward pass and receive 818 01:05:38,820 --> 01:05:44,500 -some inputs this is a vector and reform +some inputs this is a vector and reform of the cell body so just a lawyer some 819 01:05:44,500 --> 01:05:49,980 -and we put the firing rate as a sigmoid +and we put the firing rate as a sigmoid off the Somali some and return to firing 820 01:05:49,980 --> 01:05:53,579 -rate and then this can plug into +rate and then this can plug into different neurons right so you can 821 01:05:53,579 --> 01:05:56,710 -imagine you can actually see that this +imagine you can actually see that this looks very similar to a linear 822 01:05:56,710 --> 01:06:02,750 -classifier radar for MIMO lehrer some +classifier radar for MIMO lehrer some here and we're passing through 823 01:06:02,750 --> 01:06:07,050 -nonlinearity so every single neuron in +nonlinearity so every single neuron in this model is really like a small your 824 01:06:07,050 --> 01:06:11,530 -classifier but these authors plug into +classifier but these authors plug into each other and they can work together to 825 01:06:11,530 --> 01:06:16,650 -do interesting things now 10 to make +do interesting things now 10 to make about neurons that they're very they're 826 01:06:16,650 --> 01:06:21,300 -not like biological neurons biological +not like biological neurons biological neurons are super complex so if you go 827 01:06:21,300 --> 01:06:24,670 -around then you start saying that neural +around then you start saying that neural networks work like brain people are 828 01:06:24,670 --> 01:06:28,849 -starting to round people started firing +starting to round people started firing at you and that's because there are 829 01:06:28,849 --> 01:06:33,650 -complex dynamical systems there are many +complex dynamical systems there are many different types of neurons they function 830 01:06:33,650 --> 01:06:38,550 -differently these dendrites there they +differently these dendrites there they can perform lots of interesting 831 01:06:38,550 --> 01:06:42,140 -computation a good review article is in +computation a good review article is in direct competition which I really 832 01:06:42,139 --> 01:06:46,069 -enjoyed these synapses are complex +enjoyed these synapses are complex dynamical systems they're not just a 833 01:06:46,070 --> 01:06:49,720 -single weight and we're not really sure +single weight and we're not really sure of the brain uses rate code to 834 01:06:49,719 --> 01:06:54,689 -communicate so very crude mathematical +communicate so very crude mathematical model and don't put his analogy too much 835 01:06:54,690 --> 01:06:57,960 -but it's good for a kind of like media +but it's good for a kind of like media articles 836 01:06:57,960 --> 01:07:01,990 -so I suppose that's why this keeps +so I suppose that's why this keeps coming up again and again as we 837 01:07:01,989 --> 01:07:04,989 -explained that this works like a brain +explained that this works like a brain but I'm not going to go too deep into 838 01:07:04,989 --> 01:07:09,829 -this to go back to a question that was +this to go back to a question that was asked for there's an entire set of 839 01:07:09,829 --> 01:07:17,559 -nonlinearities that we can choose from +nonlinearities that we can choose from so historically signal has been used 840 01:07:17,559 --> 01:07:20,210 -quite a bit and we're going to go into +quite a bit and we're going to go into much more detail over what these 841 01:07:20,210 --> 01:07:23,690 -nonlinearities are what are their trades +nonlinearities are what are their trades tradeoffs and why you might want to use 842 01:07:23,690 --> 01:07:27,838 -one or the other but for now just like a +one or the other but for now just like a flash to mention that there are many to 843 @@ -4184,12 +4182,12 @@ choose from 844 01:07:28,579 --> 01:07:33,940 -historically people use to 10 H as of +historically people use to 10 H as of 2012 really became quite popular 845 01:07:33,940 --> 01:07:38,429 -it makes your networks quite a bit +it makes your networks quite a bit faster so right now if you want a 846 @@ -4198,107 +4196,107 @@ default choice for nonlinearity 847 01:07:40,429 --> 01:07:45,679 -relew that's the current default +relew that's the current default recommendation and then there's a few 848 01:07:45,679 --> 01:07:51,489 -activation functions here and so are +activation functions here and so are proposed a few years ago I max out is 849 01:07:51,489 --> 01:07:54,989 -interesting and very recently you lou +interesting and very recently you lou and so you can come up with different 850 01:07:54,989 --> 01:07:58,319 -activation functions and you can +activation functions and you can describe I these might work better or 851 01:07:58,320 --> 01:08:01,789 -not and so this is an active area of +not and so this is an active area of research is trying to go up by the 852 01:08:01,789 --> 01:08:05,949 -activation functions that perform there +activation functions that perform there had better properties in one way or 853 01:08:05,949 --> 01:08:10,909 -another we're going to go into this much +another we're going to go into this much more details as soon in class but for 854 01:08:10,909 --> 01:08:15,980 -now we have these morons we have a +now we have these morons we have a choice of activation function and then 855 01:08:15,980 --> 01:08:19,259 -we runs these neurons into neural +we runs these neurons into neural networks right so we just connect them 856 01:08:19,259 --> 01:08:23,140 -together so they can talk to each other +together so they can talk to each other and so here is an example of a what to 857 01:08:23,140 --> 01:08:27,170 -learn or relearn rowlett when you want +learn or relearn rowlett when you want to count the number of layers and their 858 01:08:27,170 --> 01:08:30,829 -neural net you count the number of +neural net you count the number of players that happened waits to hear the 859 01:08:30,829 --> 01:08:35,449 -input layer does not count as a later +input layer does not count as a later cuz there's no reason Iran's largest 860 01:08:35,449 --> 01:08:39,729 -single values they don't actually do any +single values they don't actually do any computation so we have two players here 861 01:08:39,729 --> 01:08:45,068 -that that have weights to learn it and +that that have weights to learn it and we call these layers fully connected 862 01:08:45,069 --> 01:08:50,870 -layers and so that I shown you that a +layers and so that I shown you that a single neuron computer this little 863 01:08:50,869 --> 01:08:54,750 -weight at some and ambassador +weight at some and ambassador nonlinearity in a neural network the 864 01:08:54,750 --> 01:08:58,829 -reason we arrange these into layers is +reason we arrange these into layers is because Iranian them into layers allows 865 01:08:58,829 --> 01:09:01,759 -us to the competition much more +us to the competition much more efficiently so instead of having an 866 01:09:01,759 --> 01:09:04,460 -amorphous blob of neurons and every one +amorphous blob of neurons and every one of them has to be computed independently 867 01:09:04,460 --> 01:09:08,699 -having them in layers allows us to use +having them in layers allows us to use vectorized operations and so we can 868 @@ -4307,102 +4305,102 @@ compute an entire set of 869 01:09:10,140 --> 01:09:14,410 -neurons in a single hidden layer as just +neurons in a single hidden layer as just a single times amateurs multiply and 870 01:09:14,409 --> 01:09:17,619 -that's why we arrange them in these +that's why we arrange them in these layers where Iran since I deliver and 871 01:09:17,619 --> 01:09:21,119 -evaluate it completely in peril and they +evaluate it completely in peril and they all say the same thing but it's a 872 01:09:21,119 --> 01:09:25,519 -computational trick to arrange them in +computational trick to arrange them in leaders this is a three-layer neural net 873 01:09:25,520 --> 01:09:30,500 -and this is how you would compute it +and this is how you would compute it just a bunch of major multiplies 874 01:09:30,500 --> 01:09:35,550 -followed by another activation followed +followed by another activation followed by activation function as well now I'd 875 01:09:35,550 --> 01:09:40,520 -like to show you a demo of how these +like to show you a demo of how these neural networks work so this is just 876 01:09:40,520 --> 01:09:44,770 -grabbed a model shoot you in a bit but +grabbed a model shoot you in a bit but basically this is an example of a 877 01:09:44,770 --> 01:09:50,080 -two-layer neural network classifying AP +two-layer neural network classifying AP doing a binary classification task two 878 01:09:50,079 --> 01:09:54,119 -closest red and green and so if these +closest red and green and so if these points in two dimensions and I'm drawing 879 01:09:54,119 --> 01:09:58,109 -the decision boundaries by the neural +the decision boundaries by the neural network and see what you can see is when 880 01:09:58,109 --> 01:10:01,969 -I train a neural network on this data +I train a neural network on this data the more hidden neurons I have in my 881 01:10:01,970 --> 01:10:05,770 -head in later the more wiggle your +head in later the more wiggle your electric cars right the more can compute 882 01:10:05,770 --> 01:10:12,290 -crazy functions and just show you also a +crazy functions and just show you also a regularization strength so this is the 883 01:10:12,289 --> 01:10:17,069 -regularization of how much you penalize +regularization of how much you penalize large W you can see that when you insist 884 01:10:17,069 --> 01:10:22,340 -that your WR very small you end up with +that your WR very small you end up with a very smooth functions so they don't 885 01:10:22,340 --> 01:10:27,050 -have as much variance so these neural +have as much variance so these neural networks there's not as much wriggle 886 01:10:27,050 --> 01:10:31,090 -that they can give you and then you +that they can give you and then you decrease the regularization these know 887 01:10:31,090 --> 01:10:34,090 -that we can do more and more complex +that we can do more and more complex tasks so they can kind of get in and get 888 01:10:34,090 --> 01:10:38,710 -these laws squeezed out points to cover +these laws squeezed out points to cover them in the training data so let me show 889 @@ -4415,7 +4413,7 @@ during training 891 01:10:47,079 --> 01:10:53,010 -so there's some stuff to explain here +so there's some stuff to explain here let me first actually you can play with 892 @@ -4424,122 +4422,122 @@ this because it's all in javascript 893 01:10:56,060 --> 01:11:04,060 -alright so we're doing here as we have +alright so we're doing here as we have six neurons and this is a binary 894 01:11:04,060 --> 01:11:09,000 -classification there said with circle +classification there said with circle data and so we have a little cluster of 895 01:11:09,000 --> 01:11:13,520 -green dot separated by red dots and work +green dot separated by red dots and work training a neural network to classify 896 01:11:13,520 --> 01:11:18,080 -this dataset so if I restart the neural +this dataset so if I restart the neural network it's just started off with the 897 01:11:18,079 --> 01:11:20,949 -random W and that it converges the +random W and that it converges the decision boundary to actually classified 898 01:11:20,949 --> 01:11:26,289 -the data showing on the right which is +the data showing on the right which is the cool part is one interpretation of 899 01:11:26,289 --> 01:11:29,529 -the neural network here is what I'm +the neural network here is what I'm taking that's great to hear and I'm 900 01:11:29,529 --> 01:11:33,909 -showing how this space gets worked by +showing how this space gets worked by the neural network so you can interpret 901 01:11:33,909 --> 01:11:37,619 -what the neural network is doing is it's +what the neural network is doing is it's using its hidden layer to transport your 902 01:11:37,619 --> 01:11:41,159 -input data in such a way that the second +input data in such a way that the second hidden layer can come in with a linear 903 01:11:41,159 --> 01:11:47,059 -classifier and classify your data so +classifier and classify your data so here you see that the neural network 904 01:11:47,060 --> 01:11:51,920 -arranges your space it works it such +arranges your space it works it such that the second layer which is really a 905 01:11:51,920 --> 01:11:56,779 -linear classifier on top of the first +linear classifier on top of the first layer is can put a plane through it okay 906 01:11:56,779 --> 01:11:59,939 -so it's working the space so that you +so it's working the space so that you can put the plane through it and 907 01:11:59,939 --> 01:12:06,259 -separate out the points so let's look at +separate out the points so let's look at this again so you can really see what 908 01:12:06,260 --> 01:12:10,940 -happens gets worked for that you can +happens gets worked for that you can leave early classify the data this is 909 01:12:10,939 --> 01:12:13,569 -something that people sometimes also +something that people sometimes also referred to as current trek it's 910 01:12:13,569 --> 01:12:19,149 -changing your data representation to a +changing your data representation to a space where two linearly separable ok 911 01:12:19,149 --> 01:12:23,079 -now here's a question if we'd like to +now here's a question if we'd like to separate the right now we have six 912 01:12:23,079 --> 01:12:27,809 -neurons here and the intermediate layer +neurons here and the intermediate layer and it allows us to separate out these 913 01:12:27,810 --> 01:12:33,580 -things so you can see actually those six +things so you can see actually those six neurons roughly you can see these lines 914 01:12:33,579 --> 01:12:36,869 -here like they're kind of like these +here like they're kind of like these functions of one of these neurons so 915 01:12:36,869 --> 01:12:40,349 -here's a question for you what is the +here's a question for you what is the minimum number of neurons for which this 916 01:12:40,350 --> 01:12:45,570 -dataset is separable with a neural +dataset is separable with a neural network like if I want to know that work 917 @@ -4548,57 +4546,57 @@ to correctly classify this as a minimum 918 01:12:51,890 --> 01:13:15,270 -so into it with the way this work is 34 +so into it with the way this work is 34 so what happens with or is there is one 919 01:13:15,270 --> 01:13:18,910 -around here that went from this way to +around here that went from this way to that way this way to that way this way 920 01:13:18,909 --> 01:13:22,689 -to that way there's more neurons that +to that way there's more neurons that are cutting up this plane and then 921 01:13:22,689 --> 01:13:27,039 -there's an additional layer that's a +there's an additional layer that's a weighted sum so in fact the lowest 922 01:13:27,039 --> 01:13:34,739 -number here what would be three which +number here what would be three which would work so with three neurons ok so 923 01:13:34,739 --> 01:13:39,189 -one plane second plane airplane so three +one plane second plane airplane so three linear functions within the linearity 924 01:13:39,189 --> 01:13:45,649 -and then you can basically with three +and then you can basically with three lines you can carve out the space so 925 01:13:45,649 --> 01:13:52,429 -that the second layer can just combined +that the second layer can just combined them when their numbers are 102 926 01:13:52,430 --> 01:13:57,850 -certainly donate to this will break +certainly donate to this will break because two lines are not enough I 927 01:13:57,850 --> 01:14:03,900 -suppose this work something very good +suppose this work something very good here so with to basically it will find 928 01:14:03,899 --> 01:14:07,239 -the optimum way of just using these two +the optimum way of just using these two lines they're kind of creating this 929 @@ -4607,12 +4605,12 @@ tunnel and that the best you can do 930 01:14:14,600 --> 01:14:31,300 -I think if I was using rather I think +I think if I was using rather I think there would be much surrealism and I 931 01:14:31,300 --> 01:14:50,460 -think you'd see sharp boundaries yeah +think you'd see sharp boundaries yeah you can do for now let's do it because 932 @@ -4621,82 +4619,82 @@ some of these parts 933 01:14:52,130 --> 01:14:58,119 -there's more than one of those revenues +there's more than one of those revenues are active and so you end up with there 934 01:14:58,119 --> 01:15:02,359 -are really three lines I think like 123 +are really three lines I think like 123 but then in some of the corners to revel 935 01:15:02,359 --> 01:15:05,689 -in your eyes are active and so these +in your eyes are active and so these weights will have its kind of funky you 936 01:15:05,689 --> 01:15:12,649 -have to think about it but ok so let's +have to think about it but ok so let's look at say twenty here so change to 20 937 01:15:12,649 --> 01:15:16,670 -so we have lots of space there and let's +so we have lots of space there and let's look at different assets like a spiral 938 01:15:16,670 --> 01:15:22,390 -you can see how this thing just as I'm +you can see how this thing just as I'm doing this update will just go in there 939 01:15:22,390 --> 01:15:32,800 -and figure that out very simple data +and figure that out very simple data that is not my own circle and then ran 940 01:15:32,800 --> 01:15:39,880 -him down so you could kind of goes in +him down so you could kind of goes in there and it's like covers up the green 941 01:15:39,880 --> 01:15:48,039 -lawns and the red ones and yeah and with +lawns and the red ones and yeah and with fewer say like I'm going to break this 942 01:15:48,039 --> 01:15:54,890 -now I'm not going to go with five yes +now I'm not going to go with five yes this will start working worse and worse 943 01:15:54,890 --> 01:15:58,770 -because you don't have enough capacity +because you don't have enough capacity to separate out this data so you can 944 01:15:58,770 --> 01:16:05,270 -play with this in your free time and so +play with this in your free time and so as a summary 945 01:16:05,270 --> 01:16:10,690 -we arrange these neurons and neural +we arrange these neurons and neural networks into political heirs 946 01:16:10,689 --> 01:16:14,579 -look at that crop and how this gets +look at that crop and how this gets changing competition graphs and they're 947 01:16:14,579 --> 01:16:19,149 -not really neural and as you'll see soon +not really neural and as you'll see soon the bigger the better and we'll go into 948 01:16:19,149 --> 01:16:28,210 -that a lot I want to take questions +that a lot I want to take questions before I am just sorry questions we have 949 @@ -4709,57 +4707,57 @@ yes thank you 951 01:16:36,899 --> 01:16:41,119 -so is it always better to have more +so is it always better to have more neurons and neural network the answer to 952 01:16:41,119 --> 01:16:48,809 -that is yes more is always better it's +that is yes more is always better it's usually competition constraint so more 953 01:16:48,810 --> 01:16:52,510 -always work better but then you have to +always work better but then you have to be careful to regularize it properly so 954 01:16:52,510 --> 01:16:55,810 -the correct way to constrain you're not +the correct way to constrain you're not worked over put your data is not by 955 01:16:55,810 --> 01:16:58,940 -making the network smaller the correct +making the network smaller the correct way to do it is to increase the 956 01:16:58,939 --> 01:17:03,079 -regularization so you always want to use +regularization so you always want to use as larger network as you want but then 957 01:17:03,079 --> 01:17:06,269 -you have to make sure to properly +you have to make sure to properly regulate rise it but most of the time 958 01:17:06,270 --> 01:17:09,920 -because competition reasons why I don't +because competition reasons why I don't have time to wait forever to train our 959 01:17:09,920 --> 01:17:19,980 -networks use smaller ones for practical +networks use smaller ones for practical reasons question arises equally 960 01:17:19,979 --> 01:17:25,509 -usually you do as a simplification you +usually you do as a simplification you yeah most of the often when you see 961 01:17:25,510 --> 01:17:28,030 -networks trained in practice they will +networks trained in practice they will be regularized the same way throughout 962 @@ -4768,72 +4766,72 @@ but you don't have to necessarily 963 01:17:33,810 --> 01:17:40,500 -is anybody using secondary option in +is anybody using secondary option in optimizing networks there is value 964 01:17:40,500 --> 01:17:44,859 -sometimes when your data sets are small +sometimes when your data sets are small you can use things like lbs which I 965 01:17:44,859 --> 01:17:47,729 -don't go into too much and it's the +don't go into too much and it's the second order method but usually the data 966 01:17:47,729 --> 01:17:50,500 -sets are really large and that's when +sets are really large and that's when I'll get you it doesn't work very well 967 01:17:50,500 --> 01:17:57,039 -so you when you millions of the up with +so you when you millions of the up with you can't do lbs for ya and LBJ is not 968 01:17:57,039 --> 01:18:01,970 -very good with many batch you always +very good with many batch you always have to fall back by default 969 01:18:01,970 --> 01:18:16,650 -like how do you allocate not a good +like how do you allocate not a good answer for that unfortunately so you 970 01:18:16,649 --> 01:18:20,899 -want a depth is good but maybe after +want a depth is good but maybe after like ten layers may be a simple data 971 01:18:20,899 --> 01:18:25,219 -said it's not really adding too much in +said it's not really adding too much in one minute so I can still take some 972 01:18:25,220 --> 01:18:35,990 -questions you have a question for the +questions you have a question for the tradeoff between where do I allocate my 973 01:18:35,989 --> 01:18:40,019 -capacity to I want us to be deeper or do +capacity to I want us to be deeper or do I want it to be wider not a very good 974 01:18:40,020 --> 01:18:47,860 -answer to that yes usually especially +answer to that yes usually especially with images we find that more layers are 975 01:18:47,859 --> 01:18:51,199 -critical but sometimes when you have +critical but sometimes when you have simple tastes like to do you are some 976 01:18:51,199 --> 01:18:55,359 -other things like depth is not as +other things like depth is not as critical and so it's kind of slightly 977 @@ -4842,32 +4840,32 @@ data dependent 978 01:19:01,670 --> 01:19:10,050 -different for different layers that +different for different layers that health usually it's not done usually 979 01:19:10,050 --> 01:19:15,960 -just gonna pick one and go with it +just gonna pick one and go with it that's for example will also see the 980 01:19:15,960 --> 01:19:19,279 -most of them are changes with others and +most of them are changes with others and so you just use that throughout and 981 01:19:19,279 --> 01:19:22,389 -there's no real benefit to to switch +there's no real benefit to to switch them around people don't play with that 982 01:19:22,390 --> 01:19:26,660 -too much on principle you there's +too much on principle you there's nothing preventing you are so it is 420 983 01:19:26,659 --> 01:19:29,789 -so we're going to end here but we'll see +so we're going to end here but we'll see a lot more neural networks so a lot of 984 From 63c1c48639e676495e1bb71624d0bc143ff8a524 Mon Sep 17 00:00:00 2001 From: JK Im Date: Wed, 11 May 2016 16:11:03 -0500 Subject: [PATCH 117/199] Update neural-networks-1.md --- neural-networks-1.md | 22 +++++++++++----------- 1 file changed, 11 insertions(+), 11 deletions(-) diff --git a/neural-networks-1.md b/neural-networks-1.md index cb10cc86..cfda6639 100644 --- a/neural-networks-1.md +++ b/neural-networks-1.md @@ -3,13 +3,13 @@ layout: page permalink: /neural-networks-1/ --- -Table of Contents: +목차: -- [Quick intro without brain analogies](#quick) -- [Modeling one neuron](#intro) - - [Biological motivation and connections](#bio) - - [Single neuron as a linear classifier](#classifier) - - [Commonly used activation functions](#actfun) +- [간단한 소개: 뇌에 비유하지 않고](#quick) +- [뉴런 하나 모델링하기](#intro) + - [생물학적 동기와 연결](#bio) + - [선형분류기로서의 뉴런 1개](#classifier) + - [흔하게 사용되는 활성함수](#actfun) - [Neural Network architectures](#nn) - [Layer-wise organization](#layers) - [Example feed-forward computation](#feedforward) @@ -20,16 +20,16 @@ Table of Contents: -## Quick intro +## 간단한 소개 -It is possible to introduce neural networks without appealing to brain analogies. In the section on linear classification we computed scores for different visual categories given the image using the formula $ s = W x $, where $W$ was a matrix and $x$ was an input column vector containing all pixel data of the image. In the case of CIFAR-10, $x$ is a [3072x1] column vector, and $W$ is a [10x3072] matrix, so that the output scores is a vector of 10 class scores. +뇌에 비유하지 않고도 신경망(neural networks)를 소개할 수 있다. 이 선형분류에 관한 섹션에서, $W$가 행렬이고 $x$가 입력 열벡터(column vector)로서 이미지의 모든 픽셀 정보값을 가질 때, $ s = W x $ 형태의 공식을 이용하여 주어진 이미지를 가지고 각 카테고리에 해당하는 스코어를 계산했었다. CIFAR-10의 경우, $x$는 크기가 [3072x1]인 열벡터이고, $W$는 크기가 [10x3072]인 행렬이었다. 따라서, 출력 스코어는 크기가 [10x1]인 벡터가 된다. (역자 주: 숫자 1개가 클래스 1개랑 관련있음.) -An example neural network would instead compute $ s = W_2 \max(0, W_1 x) $. Here, $W_1$ could be, for example, a [100x3072] matrix transforming the image into a 100-dimensional intermediate vector. The function $max(0,-) $ is a non-linearity that is applied elementwise. There are several choices we could make for the non-linearity (which we'll study below), but this one is a common choice and simply thresholds all activations that are below zero to zero. Finally, the matrix $W_2$ would then be of size [10x100], so that we again get 10 numbers out that we interpret as the class scores. Notice that the non-linearity is critical computationally - if we left it out, the two matrices could be collapsed to a single matrix, and therefore the predicted class scores would again be a linear function of the input. The non-linearity is where we get the *wiggle*. The parameters $W_2, W_1$ are learned with stochastic gradient descent, and their gradients are derived with chain rule (and computed with backpropagation). +신경망(neural network)는 그 대신, 예컨대 이런 류의 것을 계산한다: $ s = W_2 \max(0, W_1 x) $. 여기서 $W_1$는, 역시 예를 들자면, 크기가 [100x3072]인 행렬로서 이미지를 100차원짜리 중간단계 벡터로 전환하는 것일 수도 있겠다. $max(0,-) $ 함수는 비선형함수로서 $W_1 x $의 각 원소에 적용된다. (밑에서 다루겠지만), 이러한 비선형성을 구현하기 위한 방법은 여러 개 있지만, 이 함수는 흔히 쓰이는 것이고 단순히 모든 0 이하값을 0으로 막아버린다. 끝으로, 행렬 $W_2$은 크기 [10x100]짜리 행렬일 수도 있겠다. 그래서 결국에는 클래스 스코어(class score)로 쓰일 숫자 10개를 내놓게 된다. 비선형성이 계산에 있어 결정적이라는 점을 주목하자. 만약에 비선형성이 없다면, 이 행렬들은 서로 곱해져서 결국에는 하나의 행렬이 되고, 예측 스코어(score)도 역시나 입력값의 선형 함수(linear function)이 되고 만다. 이 비선형성에서 우리는 *wiggle*을 찾는다. 파라미터 $W_2, W_1$는 확률그라디언트로 학습시키고, 그 그라디언트들은 연쇄법칙(과 backpropagation)으로 계산하여 구한다. -A three-layer neural network could analogously look like $ s = W_3 \max(0, W_2 \max(0, W_1 x)) $, where all of $W_3, W_2, W_1$ are parameters to be learned. The sizes of the intermediate hidden vectors are hyperparameters of the network and we'll see how we can set them later. Lets now look into how we can interpret these computations from the neuron/network perspective. +3단계 신경망(neural network)는 $ s = W_3 \max(0, W_2 \max(0, W_1 x)) $랑 비슷하다. 이 때, $W_3, W_2, W_1$들은 모두 파라미터(parameter)들이고 추후에 학습시킨다. 중간 단계 벡터의 크기들은 하이퍼파라미터(hyperparameter)로서 나중에 어떻게 정하는지 알아보겠다. 이제 뉴런(neuron) 혹은 네트워크의 입장에서 이 계산들을 어떻게 해석해야하는지 알아보자. -## Modeling one neuron +## 뉴런 하나 모델링하기 The area of Neural Networks has originally been primarily inspired by the goal of modeling biological neural systems, but has since diverged and become a matter of engineering and achieving good results in Machine Learning tasks. Nonetheless, we begin our discussion with a very brief and high-level description of the biological system that a large portion of this area has been inspired by. From bb773757f91ecda921b50c21f3158b04f9fb8464 Mon Sep 17 00:00:00 2001 From: Myungsub Choi Date: Thu, 12 May 2016 14:34:17 +0900 Subject: [PATCH 118/199] Update glossary.md --- glossary.md | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/glossary.md b/glossary.md index 068e590d..873f0861 100644 --- a/glossary.md +++ b/glossary.md @@ -21,7 +21,6 @@ permalink: /glossary/ Batch배치 Batch normalization배치 정규화 Bias(영어 그대로) - Binary(?) Chain rule연쇄 법칙 Class클래스 Classification분류 @@ -37,7 +36,7 @@ permalink: /glossary/ Dropout(영어 그대로) Error에러, 오차 Evaluate평가하다 - Feature특징, 표현(?) + Feature특징, 표현, 피쳐 Filter필터 Forward propagation(영어 그대로) Fully-connected(영어 그대로) From 5df724b6e4cb5fade5baa2eba14de48665ec5fcb Mon Sep 17 00:00:00 2001 From: Taeksoo Kim Date: Thu, 12 May 2016 14:55:47 +0900 Subject: [PATCH 119/199] Update convolutional-networks.md --- convolutional-networks.md | 29 ++++++++++++++--------------- 1 file changed, 14 insertions(+), 15 deletions(-) diff --git a/convolutional-networks.md b/convolutional-networks.md index 9e1ce5e4..cfdec179 100644 --- a/convolutional-networks.md +++ b/convolutional-networks.md @@ -258,37 +258,36 @@ FC 레이어와 CONV 레이어의 차이점은, CONV 레이어는 입력의 일 각각의 변환은 일반적으로 FC 레이어의 가중치 $$W$$를 CONV 레이어의 필터로 변환하는 과정을 수반한다. 이런 변환을 하고 나면, 큰 이미지 (가로/세로가 224보다 큰 이미지)를 단 한번의 forward pass만으로 마치 이미지를 "슬라이딩"하면서 여러 영역을 읽은 것과 같은 효과를 준다. -예를 들어,224x224 크기의 이미지를 입력으로 받으면 [7x7x512]의 볼륨을 출력하는 이 아키텍쳐에, ( 224/7 = 32배 줄어듦 ) 된 아키텍쳐에 384x384 크기의 이미지를 넣으면 [12x12x512] 크기의 볼륨을 출력하게 된다 (384/32 = 12 이므로). 이후 3개 CONV 레이어 +예를 들어,224x224 크기의 이미지를 입력으로 받으면 [7x7x512]의 볼륨을 출력하는 이 아키텍쳐에, ( 224/7 = 32배 줄어듦 ) 된 아키텍쳐에 384x384 크기의 이미지를 넣으면 [12x12x512] 크기의 볼륨을 출력하게 된다 (384/32 = 12 이므로). 이후 FC에서 CONV로 변환한 3개의 CONV 레이어를 거치면 [6x6x1000] 크기의 최종 볼륨을 얻게 된다 ( (12 - 7)/1 +1 =6 이므로). [1x1x1000]크기를 지닌 하나의 클래스 점수 벡터 대신 384x384 이미지로부터 6x6개의 클래스 점수 배열을 구했다는 것이 중요하다. For example, if 224x224 image gives a volume of size [7x7x512] - i.e. a reduction by 32, then forwarding an image of size 384x384 through the converted architecture would give the equivalent volume in size [12x12x512], since 384/32 = 12. Following through with the next 3 CONV layers that we just converted from FC layers would now give the final volume of size [6x6x1000], since (12 - 7)/1 + 1 = 6. Note that instead of a single vector of class scores of size [1x1x1000], we're now getting and entire 6x6 array of class scores across the 384x384 image. -> Evaluating the original ConvNet (with FC layers) independently across 224x224 crops of the 384x384 image in strides of 32 pixels gives an identical result to forwarding the converted ConvNet one time. +> 위의 내용은 384x384 크기의 이미지를 32의 stride 간격으로 224x224 크기로 잘라 각각을 원본 ConvNet (뒷쪽 3개 레이어가 FC인)에 적용한 것과 같은 결과를 보여준다. -Naturally, forwarding the converted ConvNet a single time is much more efficient than iterating the original ConvNet over all those 36 locations, since the 36 evaluations share computation. This trick is often used in practice to get better performance, where for example, it is common to resize an image to make it bigger, use a converted ConvNet to evaluate the class scores at many spatial positions and then average the class scores. +당연히 (CONV레이어만으로) 변환된 ConvNet을 이용해 한 번에 이미지를 처리하는 것이 원본 ConvNet으로 36개 위치에 대해 반복적으로 처리하는 것 보다 훨씬 효율적이다. 36번의 처리 과정에서 같은 계산이 중복되기 때문이다. 이런 기법은 실전에서 성능 향상을 위해 종종 사용된다. 예를 들어 이미지를 크게 리사이즈 한 뒤 변환된 ConvNet을 이용해 여러 위치에 대한 클래스 점수를 구한 다음 그 점수들의 평균을 취하는 기법 등이 있다. -Lastly, what if we wanted to efficiently apply the original ConvNet over the image but at a stride smaller than 32 pixels? We could achieve this with multiple forward passes. For example, note that if we wanted to use a stride of 16 pixels we could do so by combining the volumes received by forwarding the converted ConvNet twice: First over the original image and second over the image but with the image shifted spatially by 16 pixels along both width and height. +마지막으로 32 픽셀보다 적은 stride 간격으로 ConvNet을 적용하고 싶다면 어떡해야 할까? 포워드 패스 (forward pass)를 여러 번 적용하면 가능하다. 예를 들어 16의 stride 간격으로 처리를 하고 싶다면 변환된 ConvNet에 이미지를 2번 적용한 뒤 합치는 방식을 사용하면 된다: 먼저 원본 이미지를 처리한 뒤 원본 이미지를 가로/세로 16 픽셀만큼 쉬프트 시킨 뒤 한번 더 처리하면 된다. -- An IPython Notebook on [Net Surgery](https://github.com/BVLC/caffe/blob/master/examples/net_surgery.ipynb) shows how to perform the conversion in practice, in code (using Caffe) +- Caffe를 이용해 ConvNet 변환을 수행하는 실제 IPython Notebook 예제 [Net Surgery](https://github.com/BVLC/caffe/blob/master/examples/net_surgery.ipynb) -### ConvNet Architectures +### ConvNet 구조 -We have seen that Convolutional Networks are commonly made up of only three layer types: CONV, POOL (we assume Max pool unless stated otherwise) and FC (short for fully-connected). We will also explicitly write the RELU activation function as a layer, which applies elementwise non-linearity. In this section we discuss how these are commonly stacked together to form entire ConvNets. +위에서 컨볼루셔널 신경망은 일반적으로 CONV, POOL (별다른 언급이 없다면 Max Pool이라고 가정), FC 레이어로 이뤄져 있다는 것을 배웠다. 각 원소에 비선형 특징을 가해주는 RELU 액티베이션 함수도 명시적으로 레이어로 취급하겠다. 이 섹션에서는 어떤 방식으로 이 레이어들이 쌓아져 전체 ConvNet이 이뤄지는지 알아보겠다. -#### Layer Patterns -The most common form of a ConvNet architecture stacks a few CONV-RELU layers, follows them with POOL layers, and repeats this pattern until the image has been merged spatially to a small size. At some point, it is common to transition to fully-connected layers. The last fully-connected layer holds the output, such as the class scores. In other words, the most common ConvNet architecture follows the pattern: - +#### 레이어 패턴 +가장 흔한 ConvNet 구조는 몇 개의 CONV-RELU 레이어를 쌓은 뒤 POOL 레이어를 추가한 형태가 여러 번 반복되며 이미지 볼륨의 spatial (가로/세로) 크기를 줄이는 것이다. 이런 방식으로 적절히 쌓은 뒤 FC 레이어들을 쌓아준다. 마지막 FC 레이어는 클래스 점수와 같은 출력을 만들어낸다. 다시 말해서, 일반적인 ConvNet 구조는 다음 패턴을 따른다: `INPUT -> [[CONV -> RELU]*N -> POOL?]*M -> [FC -> RELU]*K -> FC` -where the `*` indicates repetition, and the `POOL?` indicates an optional pooling layer. Moreover, `N >= 0` (and usually `N <= 3`), `M >= 0`, `K >= 0` (and usually `K < 3`). For example, here are some common ConvNet architectures you may see that follow this pattern: +`*`는 반복을 의미하며 `POOL?` 은 선택적으로 POOL 레이어를 사용한다는 의미이다. 또한 `N >= 0` (보통 `N <= 3`), `M >= 0`, `K >= 0` (보통 `K < 3`)이다. 예를 들어, 보통의 ConvNet 구조에서 아래와 같은 패턴들을 흔히 발견할 수 있다: -- `INPUT -> FC`, implements a linear classifier. Here `N = M = K = 0`. +- `INPUT -> FC`, 선형 분류기이다. 이 때 `N = M = K = 0`. - `INPUT -> CONV -> RELU -> FC` -- `INPUT -> [CONV -> RELU -> POOL]*2 -> FC -> RELU -> FC`. Here we see that there is a single CONV layer between every POOL layer. -- `INPUT -> [CONV -> RELU -> CONV -> RELU -> POOL]*3 -> [FC -> RELU]*2 -> FC` Here we see two CONV layers stacked before every POOL layer. This is generally a good idea for larger and deeper networks, because multiple stacked CONV layers can develop more complex features of the input volume before the destructive pooling operation. +- `INPUT -> [CONV -> RELU -> POOL]*2 -> FC -> RELU -> FC`. 이 경우는 POOL 레이어 하나 당 하나의 CONV 레이어가 존재한다. +- `INPUT -> [CONV -> RELU -> CONV -> RELU -> POOL]*3 -> [FC -> RELU]*2 -> FC` 이 경우는 각각의 POOL 레이어를 거치기 전에 여러 개의 CONV 레이어를 거치게 된다. 크고 깊은 신경망에서는 이런 구조가 적합하다. 여러 층으로 쌓인 CONV 레이어는 pooling 연산으로 인해 많은 정보가 파괴되기 전에 복잡한 feature들을 추출할 수 있게 해주기 때문이다. -*Prefer a stack of small filter CONV to one large receptive field CONV layer*. Suppose that you stack three 3x3 CONV layers on top of each other (with non-linearities in between, of course). In this arrangement, each neuron on the first CONV layer has a 3x3 view of the input volume. A neuron on the second CONV layer has a 3x3 view of the first CONV layer, and hence by extension a 5x5 view of the input volume. Similarly, a neuron on the third CONV layer has a 3x3 view of the 2nd CONV layer, and hence a 7x7 view of the input volume. Suppose that instead of these three layers of 3x3 CONV, we only wanted to use a single CONV layer with 7x7 receptive fields. These neurons would have a receptive field size of the input volume that is identical in spatial extent (7x7), but with several disadvantages. First, the neurons would be computing a linear function over the input, while the three stacks of CONV layers contain non-linearities that make their features more expressive. Second, if we suppose that all the volumes have $$C$$ channels, then it can be seen that the single 7x7 CONV layer would contain $$C \times (7 \times 7 \times C) = 49 C^2$$ parameters, while the three 3x3 CONV layers would only contain $$3 \times (C \times (3 \times 3 \times C)) = 27 C^2$$ parameters. Intuitively, stacking CONV layers with tiny filters as opposed to having one CONV layer with big filters allows us to express more powerful features of the input, and with fewer parameters. As a practical disadvantage, we might need more memory to hold all the intermediate CONV layer results if we plan to do backpropagation. +*큰 리셉티브 필드를 가지는 CONV 레이어 하나 대신 여러개의 작은 필터를 가진 CONV 레이어를 쌓는 것이 좋다*. 3x3 크기의 CONV 레이어 3개를 쌓는다고 생각해보자 (물론 각 레이어 사이에는 비선형 함수를 넣어준다). 이 경우 첫 번째 CONV 레이어의 각 뉴런은 입력 볼륨의 3x3 영역을 보게 된다. 두 번째 CONV 레이어의 각 뉴런은 첫 번째 CONV 레이어의 3x3 영역을 보게 되어 결론적으로 입력 볼륨의 5x5 영역을 보게 되는 효과가 있다. 비슷하게, 세 번째 CONV 레이어의 각 뉴런은 두 번째 CONV 레이어의 3x3 영역을 보게 되어 입력 볼륨의 7x7 영역을 보는 것과 같아진다. 이런 방식으로 3개의 3x3 CONV 레이어를 사용하는 대신 7x7의 리셉티브 필드를 가지는 CONV 레이어 하나를 사용한다고 생각해 보자. 이 경우에도 각 뉴런은 입력 볼륨의 7x7 영역을 리셉티브 필드로 갖게 되지만 몇 가지 단점이 존재한다. 먼저, CONV 레이어 3개를 쌓은 경우에는 중간 중간 비선형 함수의 영향으로 표현력 높은 feature를 만드는 반면, 하나의 (7x7) CONV 레이어만 갖는 경우 각 뉴런은 입력에 대해 선형 함수를 적용하게 된다. 두 번째로, 모든 볼륨이 $$C$$ 개의 채널(또는 깊이)을 갖는다고 가정한다면, 7x7 CONV 레이어의 경우 $$C \times (7 \times 7 \times C)=49 C^2$$개의 파라미터를 갖게 된다. 반면 3개의 3x3 CONV 레이어의 경우는 $$3 \times (C \times (3 \times 3 \times)) = 27 C^2$$개의 파라미터만 갖게 된다. 직관적으로, 하나의 큰 필터를 갖는 CONV 레이어보다, 작은 필터를 갖는 여러 개의 CONV 레이어를 쌓는 것이 더 적은 파라미터만 사용하면서도 입력으로부터 더 좋은 feature를 추출하게 해준다. 단점이 있다면, backpropagation을 할 때 CONV 레이어의 중간 결과들을 저장하기 위해 더 많은 메모리 공간을 잡고 있어야 한다는 것이다. #### Layer Sizing Patterns From e94b1c3f60798f07873bb49203d239e15c0596bf Mon Sep 17 00:00:00 2001 From: Taeksoo Kim Date: Thu, 12 May 2016 15:18:52 +0900 Subject: [PATCH 120/199] Update convolutional-networks.md --- convolutional-networks.md | 12 ++++++------ 1 file changed, 6 insertions(+), 6 deletions(-) diff --git a/convolutional-networks.md b/convolutional-networks.md index cfdec179..6f371841 100644 --- a/convolutional-networks.md +++ b/convolutional-networks.md @@ -290,17 +290,17 @@ For example, if 224x224 image gives a volume of size [7x7x512] - i.e. a reductio *큰 리셉티브 필드를 가지는 CONV 레이어 하나 대신 여러개의 작은 필터를 가진 CONV 레이어를 쌓는 것이 좋다*. 3x3 크기의 CONV 레이어 3개를 쌓는다고 생각해보자 (물론 각 레이어 사이에는 비선형 함수를 넣어준다). 이 경우 첫 번째 CONV 레이어의 각 뉴런은 입력 볼륨의 3x3 영역을 보게 된다. 두 번째 CONV 레이어의 각 뉴런은 첫 번째 CONV 레이어의 3x3 영역을 보게 되어 결론적으로 입력 볼륨의 5x5 영역을 보게 되는 효과가 있다. 비슷하게, 세 번째 CONV 레이어의 각 뉴런은 두 번째 CONV 레이어의 3x3 영역을 보게 되어 입력 볼륨의 7x7 영역을 보는 것과 같아진다. 이런 방식으로 3개의 3x3 CONV 레이어를 사용하는 대신 7x7의 리셉티브 필드를 가지는 CONV 레이어 하나를 사용한다고 생각해 보자. 이 경우에도 각 뉴런은 입력 볼륨의 7x7 영역을 리셉티브 필드로 갖게 되지만 몇 가지 단점이 존재한다. 먼저, CONV 레이어 3개를 쌓은 경우에는 중간 중간 비선형 함수의 영향으로 표현력 높은 feature를 만드는 반면, 하나의 (7x7) CONV 레이어만 갖는 경우 각 뉴런은 입력에 대해 선형 함수를 적용하게 된다. 두 번째로, 모든 볼륨이 $$C$$ 개의 채널(또는 깊이)을 갖는다고 가정한다면, 7x7 CONV 레이어의 경우 $$C \times (7 \times 7 \times C)=49 C^2$$개의 파라미터를 갖게 된다. 반면 3개의 3x3 CONV 레이어의 경우는 $$3 \times (C \times (3 \times 3 \times)) = 27 C^2$$개의 파라미터만 갖게 된다. 직관적으로, 하나의 큰 필터를 갖는 CONV 레이어보다, 작은 필터를 갖는 여러 개의 CONV 레이어를 쌓는 것이 더 적은 파라미터만 사용하면서도 입력으로부터 더 좋은 feature를 추출하게 해준다. 단점이 있다면, backpropagation을 할 때 CONV 레이어의 중간 결과들을 저장하기 위해 더 많은 메모리 공간을 잡고 있어야 한다는 것이다. -#### Layer Sizing Patterns +#### 레이어 크기 결정 패턴 -Until now we've omitted mentions of common hyperparameters used in each of the layers in a ConvNet. We will first state the common rules of thumb for sizing the architectures and then follow the rules with a discussion of the notation: +지금까지는 ConvNet의 각 레이어에서 흔히 쓰이는 하이퍼파라미터에 대한 언급을 하지 않았다. 여기에서는 처음으로 ConvNet 구조의 크기를 결정하는 법칙 (수학적으로 증명된 법칙은 아니고 실험적으로 좋은 법칙)들을 살펴보고, 그 뒤에 각종 표기법에 대해 알아보겠다: -The **input layer** (that contains the image) should be divisible by 2 many times. Common numbers include 32 (e.g. CIFAR-10), 64, 96 (e.g. STL-10), or 224 (e.g. common ImageNet ConvNets), 384, and 512. +**입력 레이어** (이미지 포함)는 여러번 2로 나눌 수 있어야 한다. 흔히 사용되는 숫자들은 32 (CIFAR-10 데이터), 64, 96 (STL-10), 224 (많이 쓰이는 ImageNet ConvNet), 384, 512 등이 있다. -The **conv layers** should be using small filters (e.g. 3x3 or at most 5x5), using a stride of $$S = 1$$, and crucially, padding the input volume with zeros in such way that the conv layer does not alter the spatial dimensions of the input. That is, when $$F = 3$$, then using $$P = 1$$ will retain the original size of the input. When $$F = 5$$, $$P = 2$$. For a general $$F$$, it can be seen that $$P = (F - 1) / 2$$ preserves the input size. If you must use bigger filter sizes (such as 7x7 or so), it is only common to see this on the very first conv layer that is looking at the input image. +**CONV 레이어**는 (3x3 또는 최대 5x5의)작은 필터들과 $$S = 1$$의 stride를 사용하며, 결정적으로 입력과 출력의 spatial 크기 (가로/세로)가 달라지지 않도록 입력 볼륨에 제로 패딩을 해 줘야 한다. 즉, $$F = 3$$이라면, $$P = 1$$로 제로 패딩을 해 주면 입력의 spatial 사이즈를 그대로 유지하게 된다. 만약 $$F = 5$$라면 $$P = 2$$를 사용하게 된다. 일반적으로 $$F$$에 대해서 $$P = (F - 1)/2$$를 사용하면 입력의 크기가 그대로 유지된다. 만약 7x7과 같이 큰 필터를 사용하는 경우에는 보통 이미지와 바로 연결된 첫 번째 CONV 레이어에만 사용한다. -The **pool layers** are in charge of downsampling the spatial dimensions of the input. The most common setting is to use max-pooling with 2x2 receptive fields (i.e. $$F = 2$$), and with a stride of 2 (i.e. $$S = 2$$). Note that this discards exactly 75% of the activations in an input volume (due to downsampling by 2 in both width and height). Another sligthly less common setting is to use 3x3 receptive fields with a stride of 2, but this makes. It is very uncommon to see receptive field sizes for max pooling that are larger than 3 because the pooling is then too lossy and agressive. This usually leads to worse performance. +**POOL 레이어**는 spatial 차원에 대한 다운샘플링을 위해 사용된다. 가장 일반적인 세팅은 2x2의 리셉티브 필드($$F = 2$$)를 가진 max 풀링이다. 이 경우 입력의 75%의 액티베이션 값이 버려진다는 것을 기억하자 (가로/세로에 대해 각각 절반으로 다운샘플링 하므로). 또 다른 약간 덜 사용되는 세팅은 3x3 리셉티브 필드에 stride를 2로 놓는 것이다. Max 풀링에 3보다 큰 리셉티브 필드를 가지는 경우는 너무 많은 정보를 버리게 되므로 거의 사용되지 않는다. 많은 정보 손실은 곧 성능 하락으로 이어진다. -*Reducing sizing headaches.* The scheme presented above is pleasing because all the CONV layers preserve the spatial size of their input, while the POOL layers alone are in charge of down-sampling the volumes spatially. In an alternative scheme where we use strides greater than 1 or don't zero-pad the input in CONV layers, we would have to very carefully keep track of the input volumes throughout the CNN architecture and make sure that all strides and filters "work out", and that the ConvNet architecture is nicely and symmetrically wired. +*크기 축소와 관련된 고민들.* 위에서 다룬 전략은 꽤 좋지만 모든 CONV 레이어는 입력의 spatial 크기를 그대로 유지시키고, POOL 레이어만 spatial 차원의 다운샘플링을 책임지게 된다. 또다른 대안은 CONV 레이어에서 1보다 큰 stride를 사용하거나 제로 패딩 주지 않는 것이다. 이 경우에는 전체 ConvNet이 잘 동작하도록 각 레이어의 입력 볼륨들을 잘 살펴봐야 한다. *Why use stride of 1 in CONV?* Smaller strides work better in practice. Additionally, as already mentioned stride 1 allows us to leave all spatial down-sampling to the POOL layers, with the CONV layers only transforming the input volume depth-wise. From 64254e98e5f6200967659195261d2ad67487690e Mon Sep 17 00:00:00 2001 From: Taeksoo Kim Date: Thu, 12 May 2016 15:34:14 +0900 Subject: [PATCH 121/199] Update convolutional-networks.md --- convolutional-networks.md | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/convolutional-networks.md b/convolutional-networks.md index 6f371841..a4f3a314 100644 --- a/convolutional-networks.md +++ b/convolutional-networks.md @@ -302,11 +302,11 @@ For example, if 224x224 image gives a volume of size [7x7x512] - i.e. a reductio *크기 축소와 관련된 고민들.* 위에서 다룬 전략은 꽤 좋지만 모든 CONV 레이어는 입력의 spatial 크기를 그대로 유지시키고, POOL 레이어만 spatial 차원의 다운샘플링을 책임지게 된다. 또다른 대안은 CONV 레이어에서 1보다 큰 stride를 사용하거나 제로 패딩 주지 않는 것이다. 이 경우에는 전체 ConvNet이 잘 동작하도록 각 레이어의 입력 볼륨들을 잘 살펴봐야 한다. -*Why use stride of 1 in CONV?* Smaller strides work better in practice. Additionally, as already mentioned stride 1 allows us to leave all spatial down-sampling to the POOL layers, with the CONV layers only transforming the input volume depth-wise. +*왜 CONV 레이어에 stride 1을 사용할까?* 보통 작은 stride가 더 잘 동작한다. 뿐만 아니라, 위에서 언급한 것과 같이 stirde를 1로 놓으면 모든 spatial 다운샘플링을 POOL 레이어에 맡기게 되고 CONV 레이어는 입력 볼륨의 깊이만 변화시키게 된다. -*Why use padding?* In addition to the aforementioned benefit of keeping the spatial sizes constant after CONV, doing this actually improves performance. If the CONV layers were to not zero-pad the inputs and only perform valid convolutions, then the size of the volumes would reduce by a small amount after each CONV, and the information at the borders would be "washed away" too quickly. +*왜 (제로)패딩을 사용할까?* 앞에서 본 것과 같이 CONV 레이어를 통과하면서 spatial 크기를 그대로 유지하게 해준다는 점 외에도, 패딩을 쓰면 성능도 향상된다. 만약 제로 패딩을 하지 않고 valid convolution (패딩을 하지 않은 convolution)을 한다면 볼륨의 크기는 CONV 레이어를 거칠 때마다 줄어들게 되고, 가장자리의 정보들이 빠르게 사라진다. -*Compromising based on memory constraints.* In some cases (especially early in the ConvNet architectures), the amount of memory can build up very quickly with the rules of thumb presented above. For example, filtering a 224x224x3 image with three 3x3 CONV layers with 64 filters each and padding 1 would create three activation volumes of size [224x224x64]. This amounts to a total of about 10 million activations, or 72MB of memory (per image, for both activations and gradients). Since GPUs are often bottlenecked by memory, it may be necessary to compromise. In practice, people prefer to make the compromise at only the first CONV layer of the network. For example, one compromise might be to use a first CONV layer with filter sizes of 7x7 and stride of 2 (as seen in a ZF net). As another example, an AlexNet uses filer sizes of 11x11 and stride of 4. +*메모리 제한에 따른 타협.* 어떤 경우에는 (특히 예전에 나온 ConvNet 구조에서), 위에서 다룬 기법들을 사용할 경우 메모리 사용량이 매우 빠른 속도로 늘게 된다. 예를 들어 224x224x3의 이미지를 64개의 필터와 stride 1을 사용하는 3x3 CONV 레이어 3개로 필터링하면 [224x224x64]의 크기를 가진 액티베이션 볼륨을 총 3개 만들게 된다. 이 숫자는 거의 1,000만 개의 액티베이션 값이고, (이미지 1장 당)72MB 정도의 메모리를 사용하게 된다 (액티베이션과 그라디언트 각각에). GPU를 사용하면 보통 메모리에서 병목 현상이 생기므로, 이 부분에서는 어느 정도 현실과 타협을 할 필요가 있다. 실전에서는 보통 첫 번째 CONV 레이어에서 타협점을 찾는다. 예를 들면 첫 번째 CONV 레이어에서 7x7 필터와 stride 2 (ZF net)을 사용하는 케이스가 있다. AlexNet의 경우 11x11 필터와 stride 4를 사용한다. #### Case studies From 21ca96394e70bd3c9f1d3cd18884beb8734b8045 Mon Sep 17 00:00:00 2001 From: Taeksoo Kim Date: Thu, 12 May 2016 16:10:52 +0900 Subject: [PATCH 122/199] Update convolutional-networks.md --- convolutional-networks.md | 12 ++++++------ 1 file changed, 6 insertions(+), 6 deletions(-) diff --git a/convolutional-networks.md b/convolutional-networks.md index a4f3a314..9b4663dd 100644 --- a/convolutional-networks.md +++ b/convolutional-networks.md @@ -309,14 +309,14 @@ For example, if 224x224 image gives a volume of size [7x7x512] - i.e. a reductio *메모리 제한에 따른 타협.* 어떤 경우에는 (특히 예전에 나온 ConvNet 구조에서), 위에서 다룬 기법들을 사용할 경우 메모리 사용량이 매우 빠른 속도로 늘게 된다. 예를 들어 224x224x3의 이미지를 64개의 필터와 stride 1을 사용하는 3x3 CONV 레이어 3개로 필터링하면 [224x224x64]의 크기를 가진 액티베이션 볼륨을 총 3개 만들게 된다. 이 숫자는 거의 1,000만 개의 액티베이션 값이고, (이미지 1장 당)72MB 정도의 메모리를 사용하게 된다 (액티베이션과 그라디언트 각각에). GPU를 사용하면 보통 메모리에서 병목 현상이 생기므로, 이 부분에서는 어느 정도 현실과 타협을 할 필요가 있다. 실전에서는 보통 첫 번째 CONV 레이어에서 타협점을 찾는다. 예를 들면 첫 번째 CONV 레이어에서 7x7 필터와 stride 2 (ZF net)을 사용하는 케이스가 있다. AlexNet의 경우 11x11 필터와 stride 4를 사용한다. -#### Case studies +#### 케이스 스터디 -There are several architectures in the field of Convolutional Networks that have a name. The most common are: +필드에서 사용되는 몇몇 ConvNet들은 별명을 갖고 있다. 그 중 가장 많이 쓰이는 구조들은: -- **LeNet**. The first successful applications of Convolutional Networks were developed by Yann LeCun in 1990's. Of these, the best known is the [LeNet](http://yann.lecun.com/exdb/publis/pdf/lecun-98.pdf) architecture that was used to read zip codes, digits, etc. -- **AlexNet**. The first work that popularized Convolutional Networks in Computer Vision was the [AlexNet](http://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks), developed by Alex Krizhevsky, Ilya Sutskever and Geoff Hinton. The AlexNet was submitted to the [ImageNet ILSVRC challenge](http://www.image-net.org/challenges/LSVRC/2014/) in 2012 and significantly outperformed the second runner-up (top 5 error of 16% compared to runner-up with 26% error). The Network had a similar architecture basic as LeNet, but was deeper, bigger, and featured Convolutional Layers stacked on top of each other (previously it was common to only have a single CONV layer immediately followed by a POOL layer). -- **ZF Net**. The ILSVRC 2013 winner was a Convolutional Network from Matthew Zeiler and Rob Fergus. It became known as the [ZFNet](http://arxiv.org/abs/1311.2901) (short for Zeiler & Fergus Net). It was an improvement on AlexNet by tweaking the architecture hyperparameters, in particular by expanding the size of the middle convolutional layers. -- **GoogLeNet**. The ILSVRC 2014 winner was a Convolutional Network from [Szegedy et al.](http://arxiv.org/abs/1409.4842) from Google. Its main contribution was the development of an *Inception Module* that dramatically reduced the number of parameters in the network (4M, compared to AlexNet with 60M). Additionally, this paper uses Average Pooling instead of Fully Connected layers at the top of the ConvNet, eliminating a large amount of parameters that do not seem to matter much. +- **LeNet**. 최초의 성공적인 ConvNet 애플리케이션들은 1990년대에 Yann LeCun이 만들었다. 그 중에서도 zip 코드나 숫자를 읽는 [LeNet](http://yann.lecun.com/exdb/publis/pdf/lecun-98.pdf) 아키텍쳐가 가장 유명하다. +- **AlexNet**. 컴퓨터 비전 분야에서 ConvNet을 유명하게 만든 것은 Alex Krizhevsky, Ilya Sutskever, Geoff Hinton이 만든 [AlexNet](http://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks)이다. AlexNet은 [ImageNet ILSVRC challenge](http://www.image-net.org/challenges/LSVRC/2014/) 2012에 출전해 2등을 큰 차이로 제치고 1등을 했다 (top 5 에러율 16%, 2등은 26%). 아키텍쳐는 LeNet과 기본적으로 유사지만, 더 깊고 크다. 또한 (과거에는 하나의 CONV 레이어 이후에 바로 POOL 레이어를 쌓은 것과 달리) 여러 개의 CONV 레이어들이 쌓여 있다. +- **ZF Net**. ILSVRC 2013년의 승자는 Matthew Zeiler와 Rob Fergus가 만들었다. 저자들의 이름을 따 [ZFNet](http://arxiv.org/abs/1311.2901)이라고 불린다. AlexNet에서 중간 CONV 레이어 크기를 조정하는 등 하이퍼파라미터들을 수정해 만들었다. +- **GoogLeNet**. ILSVRC 2014의 승자는 [Szegedy et al.](http://arxiv.org/abs/1409.4842) 이 구글에서 만들었다. 이 모델의 가장 큰 기여는 파라미터의 개수를 엄청나게 줄여주는 Inception module을 제안한 것이다 (4M, AlexNet의 경우 60M). 뿐만 아니라, ConvNet 마지막에 FC 레이어 대신 Average 풀링을 사용해 별로 중요하지 않아 보이는 파라미터들을 많이 줄이게 된다. - **VGGNet**. The runner-up in ILSVRC 2014 was the network from Karen Simonyan and Andrew Zisserman that became known as the [VGGNet](http://www.robots.ox.ac.uk/~vgg/research/very_deep/). Its main contribution was in showing that the depth of the network is a critical component for good performance. Their final best network contains 16 CONV/FC layers and, appealingly, features an extremely homogeneous architecture that only performs 3x3 convolutions and 2x2 pooling from the beginning to the end. It was later found that despite its slightly weaker classification performance, the VGG ConvNet features outperform those of GoogLeNet in multiple transfer learning tasks. Hence, the VGG network is currently the most preferred choice in the community when extracting CNN features from images. In particular, their [pretrained model](http://www.robots.ox.ac.uk/~vgg/research/very_deep/) is available for plug and play use in Caffe. A downside of the VGGNet is that it is more expensive to evaluate and uses a lot more memory and parameters (140M). - **ResNet**. [Residual Network](http://arxiv.org/abs/1512.03385) developed by Kaiming He et al. was the winner of ILSVRC 2015. It features an interesting architecture with special *skip connections* and features heavy use of batch normalization. The architecture is also missing fully connected layers at the end of the network. The reader is also referred to Kaiming's presentation ([video](https://www.youtube.com/watch?v=1PGLj-uKT1w), [slides](http://research.microsoft.com/en-us/um/people/kahe/ilsvrc15/ilsvrc2015_deep_residual_learning_kaiminghe.pdf)), and some [recent experiments](https://github.com/gcr/torch-residual-networks) that reproduce these networks in Torch. From 18a859103154d2d675db0553d90d7b3e52a2c441 Mon Sep 17 00:00:00 2001 From: Taeksoo Kim Date: Thu, 12 May 2016 18:14:12 +0900 Subject: [PATCH 123/199] Update convolutional-networks.md --- convolutional-networks.md | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/convolutional-networks.md b/convolutional-networks.md index 9b4663dd..51726024 100644 --- a/convolutional-networks.md +++ b/convolutional-networks.md @@ -317,9 +317,9 @@ For example, if 224x224 image gives a volume of size [7x7x512] - i.e. a reductio - **AlexNet**. 컴퓨터 비전 분야에서 ConvNet을 유명하게 만든 것은 Alex Krizhevsky, Ilya Sutskever, Geoff Hinton이 만든 [AlexNet](http://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks)이다. AlexNet은 [ImageNet ILSVRC challenge](http://www.image-net.org/challenges/LSVRC/2014/) 2012에 출전해 2등을 큰 차이로 제치고 1등을 했다 (top 5 에러율 16%, 2등은 26%). 아키텍쳐는 LeNet과 기본적으로 유사지만, 더 깊고 크다. 또한 (과거에는 하나의 CONV 레이어 이후에 바로 POOL 레이어를 쌓은 것과 달리) 여러 개의 CONV 레이어들이 쌓여 있다. - **ZF Net**. ILSVRC 2013년의 승자는 Matthew Zeiler와 Rob Fergus가 만들었다. 저자들의 이름을 따 [ZFNet](http://arxiv.org/abs/1311.2901)이라고 불린다. AlexNet에서 중간 CONV 레이어 크기를 조정하는 등 하이퍼파라미터들을 수정해 만들었다. - **GoogLeNet**. ILSVRC 2014의 승자는 [Szegedy et al.](http://arxiv.org/abs/1409.4842) 이 구글에서 만들었다. 이 모델의 가장 큰 기여는 파라미터의 개수를 엄청나게 줄여주는 Inception module을 제안한 것이다 (4M, AlexNet의 경우 60M). 뿐만 아니라, ConvNet 마지막에 FC 레이어 대신 Average 풀링을 사용해 별로 중요하지 않아 보이는 파라미터들을 많이 줄이게 된다. -- **VGGNet**. The runner-up in ILSVRC 2014 was the network from Karen Simonyan and Andrew Zisserman that became known as the [VGGNet](http://www.robots.ox.ac.uk/~vgg/research/very_deep/). Its main contribution was in showing that the depth of the network is a critical component for good performance. Their final best network contains 16 CONV/FC layers and, appealingly, features an extremely homogeneous architecture that only performs 3x3 convolutions and 2x2 pooling from the beginning to the end. It was later found that despite its slightly weaker classification performance, the VGG ConvNet features outperform those of GoogLeNet in multiple transfer learning tasks. Hence, the VGG network is currently the most preferred choice in the community when extracting CNN features from images. In particular, their [pretrained model](http://www.robots.ox.ac.uk/~vgg/research/very_deep/) is available for plug and play use in Caffe. A downside of the VGGNet is that it is more expensive to evaluate and uses a lot more memory and parameters (140M). -- **ResNet**. [Residual Network](http://arxiv.org/abs/1512.03385) developed by Kaiming He et al. was the winner of ILSVRC 2015. It features an interesting architecture with special *skip connections* and features heavy use of batch normalization. The architecture is also missing fully connected layers at the end of the network. The reader is also referred to Kaiming's presentation ([video](https://www.youtube.com/watch?v=1PGLj-uKT1w), [slides](http://research.microsoft.com/en-us/um/people/kahe/ilsvrc15/ilsvrc2015_deep_residual_learning_kaiminghe.pdf)), and some [recent experiments](https://github.com/gcr/torch-residual-networks) that reproduce these networks in Torch. - +- **VGGNet**. ILSVRC 2014에서 2등을 한 네트워크는 Karen Simonyan과 Andrew Zisserman이 만든 [VGGNet](http://www.robots.ox.ac.uk/~vgg/research/very_deep/)이라고 불리우는 모델이다. 이 모델의 가장 큰 기여는 네트워크의 깊이가 좋은 성능에 있어 매우 중요한 요소라는 것을 보여준 것이다. 이들이 제안한 여러 개 모델 중 가장 좋은 것은 16개의 CONV/FC 레이어로 이뤄지며, 모든 컨볼루션은 3x3, 모든 풀링은 2x2만으로 이뤄져 있다. 비록 GoogLeNet보다 이미지 분류 성능은 약간 낮지만, 여러 Transfer Learning 과제에서 더 좋은 성능을 보인다는 것이 나중에 밝혀졌다. 그래서 VGGNet은 최근에 이미지 feature 추출을 위해 가장 많이 사용되고 있다. Caffe를 사용하면 Pretrained model을 받아 바로 사용하는 것도 가능하다. VGGNet의 단점은, 매우 많은 메모리를 사용하며 (140M) 많은 연산을 한다는 것이다. + - **ResNet**. Kaiming He et al.이 만든 [Residual Network](http://arxiv.org/abs/1512.03385)가 ILSVRC 2015에서 우승을 차지했다. Skip connection이라는 특이한 구조를 사용하며 batch normalizatoin을 많이 사용했다는 특징이 있다. 이 아키텍쳐는 또한 마지막 부분에서 FC 레이어를 사용하지 않는다. Kaiming의 발표자료 ([video](https://www.youtube.com/watch?v=1PGLj-uKT1w), [slides](http://research.microsoft.com/en-us/um/people/kahe/ilsvrc15/ilsvrc2015_deep_residual_learning_kaiminghe.pdf))나 Torch로 구현된 [최근 실험들](https://github.com/gcr/torch-residual-networks) 들도 확인할 수 있다. + **VGGNet in detail**. Lets break down the [VGGNet](http://www.robots.ox.ac.uk/~vgg/research/very_deep/) in more detail. The whole VGGNet is composed of CONV layers that perform 3x3 convolutions with stride 1 and pad 1, and of POOL layers that perform 2x2 max pooling with stride 2 (and no padding). We can write out the size of the representation at each step of the processing and keep track of both the representation size and the total number of weights: From 70b372d922ac2b74d010a1fefa81982f5f4637b9 Mon Sep 17 00:00:00 2001 From: Taeksoo Kim Date: Thu, 12 May 2016 20:33:01 +0900 Subject: [PATCH 124/199] Update convolutional-networks.md --- convolutional-networks.md | 10 +++++----- 1 file changed, 5 insertions(+), 5 deletions(-) diff --git a/convolutional-networks.md b/convolutional-networks.md index 51726024..3acc346a 100644 --- a/convolutional-networks.md +++ b/convolutional-networks.md @@ -320,8 +320,8 @@ For example, if 224x224 image gives a volume of size [7x7x512] - i.e. a reductio - **VGGNet**. ILSVRC 2014에서 2등을 한 네트워크는 Karen Simonyan과 Andrew Zisserman이 만든 [VGGNet](http://www.robots.ox.ac.uk/~vgg/research/very_deep/)이라고 불리우는 모델이다. 이 모델의 가장 큰 기여는 네트워크의 깊이가 좋은 성능에 있어 매우 중요한 요소라는 것을 보여준 것이다. 이들이 제안한 여러 개 모델 중 가장 좋은 것은 16개의 CONV/FC 레이어로 이뤄지며, 모든 컨볼루션은 3x3, 모든 풀링은 2x2만으로 이뤄져 있다. 비록 GoogLeNet보다 이미지 분류 성능은 약간 낮지만, 여러 Transfer Learning 과제에서 더 좋은 성능을 보인다는 것이 나중에 밝혀졌다. 그래서 VGGNet은 최근에 이미지 feature 추출을 위해 가장 많이 사용되고 있다. Caffe를 사용하면 Pretrained model을 받아 바로 사용하는 것도 가능하다. VGGNet의 단점은, 매우 많은 메모리를 사용하며 (140M) 많은 연산을 한다는 것이다. - **ResNet**. Kaiming He et al.이 만든 [Residual Network](http://arxiv.org/abs/1512.03385)가 ILSVRC 2015에서 우승을 차지했다. Skip connection이라는 특이한 구조를 사용하며 batch normalizatoin을 많이 사용했다는 특징이 있다. 이 아키텍쳐는 또한 마지막 부분에서 FC 레이어를 사용하지 않는다. Kaiming의 발표자료 ([video](https://www.youtube.com/watch?v=1PGLj-uKT1w), [slides](http://research.microsoft.com/en-us/um/people/kahe/ilsvrc15/ilsvrc2015_deep_residual_learning_kaiminghe.pdf))나 Torch로 구현된 [최근 실험들](https://github.com/gcr/torch-residual-networks) 들도 확인할 수 있다. -**VGGNet in detail**. -Lets break down the [VGGNet](http://www.robots.ox.ac.uk/~vgg/research/very_deep/) in more detail. The whole VGGNet is composed of CONV layers that perform 3x3 convolutions with stride 1 and pad 1, and of POOL layers that perform 2x2 max pooling with stride 2 (and no padding). We can write out the size of the representation at each step of the processing and keep track of both the representation size and the total number of weights: +**VGGNet의 세부 사항들**. +[VGGNet](http://www.robots.ox.ac.uk/~vgg/research/very_deep/)에 대해 좀 더 자세히 파헤쳐 보자. 전체 VGGNet은 필터 크기 3x3, stride 1, 제로패딩 1로 이뤄진 CONV 레이어들과 2x2 필터 크기 (패딩은 없음)의 POOL 레이어들로 구성된다. 아래에서 각 단계의 처리 과정을 살펴보고, 각 단계의 결과 크기와 가중치 개수를 알아본다. ~~~ INPUT: [224x224x3] memory: 224*224*3=150K weights: 0 @@ -351,13 +351,13 @@ TOTAL memory: 24M * 4 bytes ~= 93MB / image (only forward! ~*2 for bwd) TOTAL params: 138M parameters ~~~ -As is common with Convolutional Networks, notice that most of the memory is used in the early CONV layers, and that most of the parameters are in the last FC layers. In this particular case, the first FC layer contains 100M weights, out of a total of 140M. - +ConvNet에서 자주 볼 수 있는 특징으로써, 대부분의 메모리가 앞쪽에서 소비된다는 점과, 마지막 FC 레이어들이 가장 많은 파라미터들을 갖고 있다는 점을 기억하자. 이 예제에서는, 첫 번째 FC 레이어가 총 140M개 중 100M개의 가중치를 갖는다. -#### Computational Considerations +#### 계산 관련 고려사항들 Computational Considerations +ConvNet을 만들 때 일어나는 가장 큰 병목 현상은 메모리 병목이다. 최신 GPU들은 3/4/6GB의 메모리를 내장하고 있다. 가장 좋은 GPU들의 경우 12GB를 갖고 있다. 메모리와 관련해 주의깊게 살펴 볼 것은 크게 3가지이다. The largest bottleneck to be aware of when constructing ConvNet architectures is the memory bottleneck. Many modern GPUs have a limit of 3/4/6GB memory, with the best GPUs having about 12GB of memory. There are three major sources of memory to keep track of: - From the intermediate volume sizes: These are the raw number of **activations** at every layer of the ConvNet, and also their gradients (of equal size). Usually, most of the activations are on the earlier layers of a ConvNet (i.e. first Conv Layers). These are kept around because they are needed for backpropagation, but a clever implementation that runs a ConvNet only at test time could in principle reduce this by a huge amount, by only storing the current activations at any layer and discarding the previous activations on layers below. From 311c69fdfd6f92d8aec8203d6e7af2691bae4415 Mon Sep 17 00:00:00 2001 From: Taeksoo Kim Date: Thu, 12 May 2016 21:16:47 +0900 Subject: [PATCH 125/199] Update convolutional-networks.md --- convolutional-networks.md | 109 +++++++++++++++++++------------------- 1 file changed, 55 insertions(+), 54 deletions(-) diff --git a/convolutional-networks.md b/convolutional-networks.md index 3acc346a..443d378e 100644 --- a/convolutional-networks.md +++ b/convolutional-networks.md @@ -5,25 +5,25 @@ permalink: /convolutional-networks-kr/ Table of Contents: -- [Architecture Overview](#overview) -- [ConvNet Layers](#layers) - - [Convolutional Layer](#conv) - - [Pooling Layer](#pool) - - [Normalization Layer](#norm) - - [Fully-Connected Layer](#fc) - - [Converting Fully-Connected Layers to Convolutional Layers](#convert) -- [ConvNet Architectures](#architectures) - - [Layer Patterns](#layerpat) - - [Layer Sizing Patterns](#layersizepat) - - [Case Studies](#case) (LeNet / AlexNet / ZFNet / GoogLeNet / VGGNet) - - [Computational Considerations](#comp) -- [Additional References](#add) - -## 컨볼루션 신경망 (CNN/ConvNets) - -컨볼루션 신경망 (Convolutional Neural Network, 이하 CNN)은 앞 장에서 다룬 일반 신경망과 매우 유사하다. CNN은 학습 가능한 가중치 (weight)와 바이어스(bias)로 구성되어 있다. 각 뉴런은 입력을 받아 내적 연산( dot product )을 한 뒤 선택에 따라 비선형 (non-linear) 연산을 한다. 전체 네트워크는 일반 신경망과 마찬가지로 미분 가능한 하나의 스코어 함수 (score function)을 갖게 된다 (맨 앞쪽에서 로우 이미지 (raw image)를 읽고 맨 뒤쪽에서 각 클래스에 대한 점수를 구하게 됨). 또한 CNN은 마지막 레이어에 (SVM/Softmax와 같은) 손실 함수 (loss function)을 가지며, 우리가 일반 신경망을 학습시킬 때 사용하던 각종 기법들을 동일하게 적용할 수 있다. - -CNN과 일반 신경망의 차이점은 무엇일까? CNN 아키텍쳐는 입력 데이터가 이미지라는 가정 덕분에 이미지 데이터가 갖는 특성들을 인코딩 할 수 있다. 이러한 아키텍쳐는 포워드 함수 (forward function)을 더욱 효과적으로 구현할 수 있고 네트워크를 학습시키는데 필요한 모수 (parameter)의 수를 크게 줄일 수 있게 해준다. +- [아키텍쳐 개요](#overview) +- [ConvNet을 이루는 레이어들](#layers) + - [컨볼루셔널 레이어](#conv) + - [풀링 레이어](#pool) + - [Normalization 레이어](#norm) + - [Fully-Connected 레이어](#fc) + - [FC 레이어를 CONV 레이어로 변환하기](#convert) +- [ConvNet 구조](#architectures) + - [레이어 패턴](#layerpat) + - [레이어 크기 결정 패턴](#layersizepat) + - [케이스 스터디](#case) (LeNet / AlexNet / ZFNet / GoogLeNet / VGGNet) + - [계산 관련 고려사항들](#comp) +- [추가 레퍼런스](#add) + +## 컨볼루션 신경망 (ConvNet) + +컨볼루션 신경망 (Convolutional Neural Network, 이하 ConvNet)은 앞 장에서 다룬 일반 신경망과 매우 유사하다. ConvNet은 학습 가능한 가중치 (weight)와 바이어스(bias)로 구성되어 있다. 각 뉴런은 입력을 받아 내적 연산( dot product )을 한 뒤 선택에 따라 비선형 (non-linear) 연산을 한다. 전체 네트워크는 일반 신경망과 마찬가지로 미분 가능한 하나의 스코어 함수 (score function)을 갖게 된다 (맨 앞쪽에서 로우 이미지 (raw image)를 읽고 맨 뒤쪽에서 각 클래스에 대한 점수를 구하게 됨). 또한 ConvNet은 마지막 레이어에 (SVM/Softmax와 같은) 손실 함수 (loss function)을 가지며, 우리가 일반 신경망을 학습시킬 때 사용하던 각종 기법들을 동일하게 적용할 수 있다. + +ConvNet과 일반 신경망의 차이점은 무엇일까? ConvNet 아키텍쳐는 입력 데이터가 이미지라는 가정 덕분에 이미지 데이터가 갖는 특성들을 인코딩 할 수 있다. 이러한 아키텍쳐는 포워드 함수 (forward function)을 더욱 효과적으로 구현할 수 있고 네트워크를 학습시키는데 필요한 모수 (parameter)의 수를 크게 줄일 수 있게 해준다. @@ -33,23 +33,23 @@ CNN과 일반 신경망의 차이점은 무엇일까? CNN 아키텍쳐는 입력 일반 신경망은 이미지를 다루기에 적절하지 않다. CIFAR-10 데이터의 경우 각 이미지가 32x32x3 (가로,세로 32, 3개 컬러 채널)로 이뤄져 있어서 첫 번째 히든 레이어 내의 하나의 뉴런의 경우 32x32x3=3072개의 가중치가 필요하지만, 더 큰 이미지를 사용할 경우에는 같은 구조를 이용하는 것이 불가능하다. 예를 들어 200x200x3의 크기를 가진 이미지는 같은 뉴런에 대해 200x200x3=120,000개의 가중치를 필요로 하기 때문이다. 더욱이, 이런 뉴런이 레이어 내에 여러개 존재하므로 모수의 개수가 크게 증가하게 된다. 이와 같이 Fully-connectivity는 심한 낭비이며 많은 수의 모수는 곧 오버피팅(overfitting)으로 귀결된다. -CNN은 입력이 이미지로 이뤄져 있다는 특징을 살려 좀 더 합리적인 방향으로 아키텍쳐를 구성할 수 있다. 특히 일반 신경망과 달리, CNN의 레이어들은 가로,세로,깊이의 3개 차원을 갖게 된다 ( 여기에서 말하는 깊이란 전체 신경망의 깊이가 아니라 액티베이션 볼륨 ( activation volume ) 에서의 3번 째 차원을 이야기 함 ). 예를 들어 CIFAR-10 이미지는 32x32x3 (가로,세로,깊이) 의 차원을 갖는 입력 액티베이션 볼륨 (activation volume)이라고 볼 수 있다. 조만간 보겠지만, 하나의 레이어에 위치한 뉴런들은 일반 신경망과는 달리 앞 레이어의 전체 뉴런이 아닌 일부에만 연결이 되어 있다. CNN 아키텍쳐는 전체 이미지를 클래스 점수들로 이뤄진 하나의 벡터로 만들어주기 때문에 마지막 출력 레이어는 1x1x10(10은 CIFAR-10 데이터의 클래스 개수)의 차원을 가지게 된다. 이에 대한 그럼은 아래와 같다: +ConvNet은 입력이 이미지로 이뤄져 있다는 특징을 살려 좀 더 합리적인 방향으로 아키텍쳐를 구성할 수 있다. 특히 일반 신경망과 달리, ConvNet의 레이어들은 가로,세로,깊이의 3개 차원을 갖게 된다 ( 여기에서 말하는 깊이란 전체 신경망의 깊이가 아니라 액티베이션 볼륨 ( activation volume ) 에서의 3번 째 차원을 이야기 함 ). 예를 들어 CIFAR-10 이미지는 32x32x3 (가로,세로,깊이) 의 차원을 갖는 입력 액티베이션 볼륨 (activation volume)이라고 볼 수 있다. 조만간 보겠지만, 하나의 레이어에 위치한 뉴런들은 일반 신경망과는 달리 앞 레이어의 전체 뉴런이 아닌 일부에만 연결이 되어 있다. ConvNet 아키텍쳐는 전체 이미지를 클래스 점수들로 이뤄진 하나의 벡터로 만들어주기 때문에 마지막 출력 레이어는 1x1x10(10은 CIFAR-10 데이터의 클래스 개수)의 차원을 가지게 된다. 이에 대한 그럼은 아래와 같다:
- -
좌: 일반 3-레이어 신경망. 우: 그림과 같이 CNN은 뉴런들을 3차원으로 배치한다. CNN의 모든 레이어는 3차원 입력 볼륨을 3차원 출력 볼륨으로 변환 (transform) 시킨다. 이 예제에서 붉은 색으로 나타난 입력 레이어는 이미지를 입력으로 받으므로, 이 레이어의 가로/세로/채널은 각각 이미지의 가로/세로/3(Red,Green,Blue) 이다.
+ +
좌: 일반 3-레이어 신경망. 우: 그림과 같이 ConvNet은 뉴런들을 3차원으로 배치한다. ConvNet의 모든 레이어는 3차원 입력 볼륨을 3차원 출력 볼륨으로 변환 (transform) 시킨다. 이 예제에서 붉은 색으로 나타난 입력 레이어는 이미지를 입력으로 받으므로, 이 레이어의 가로/세로/채널은 각각 이미지의 가로/세로/3(Red,Green,Blue) 이다.
-> CNN은 여러 레이어로 이루어져 있다. 각각의 레이어는 3차원의 볼륨을 입력으로 받고 미분 가능한 함수를 거쳐 3차원의 볼륨을 출력하는 간단한 기능을 한다. +> ConvNet은 여러 레이어로 이루어져 있다. 각각의 레이어는 3차원의 볼륨을 입력으로 받고 미분 가능한 함수를 거쳐 3차원의 볼륨을 출력하는 간단한 기능을 한다. -### CNN을 이루는 레이어들 +### ConvNet을 이루는 레이어들 -위에서 다룬 것과 같이, CNN의 각 레이어는 미분 가능한 변환 함수를 통해 하나의 액티베이션 볼륨을 또다른 액티베이션 볼륨으로 변환 (transform) 시킨다. CNN 아키텍쳐에서는 크게 컨볼루셔널 레이어, 풀링 레이어, Fully-connected 레이어라는 3개 종류의 레이어가 사용된다. 전체 CNN 아키텍쳐는 이 3 종류의 레이어들을 쌓아 만들어진다. +위에서 다룬 것과 같이, ConvNet의 각 레이어는 미분 가능한 변환 함수를 통해 하나의 액티베이션 볼륨을 또다른 액티베이션 볼륨으로 변환 (transform) 시킨다. ConvNet 아키텍쳐에서는 크게 컨볼루셔널 레이어, 풀링 레이어, Fully-connected 레이어라는 3개 종류의 레이어가 사용된다. 전체 ConvNet 아키텍쳐는 이 3 종류의 레이어들을 쌓아 만들어진다. -*예제: 아래에서 더 자세하게 배우겠지만, CIFAR-10 데이터를 다루기 위한 간단한 CNN은 [INPUT-CONV-RELU-POOL-FC]로 구축할 수 있다. +*예제: 아래에서 더 자세하게 배우겠지만, CIFAR-10 데이터를 다루기 위한 간단한 ConvNet은 [INPUT-CONV-RELU-POOL-FC]로 구축할 수 있다. - INPUT 입력 이미지가 가로32, 세로32, 그리고 RGB 채널을 가지는 경우 입력의 크기는 [32x32x3]. - CONV 레이어는 입력 이미지의 일부 영역과 연결되어 있으며, 이 연결된 영역과 자신의 가중치의 내적 연산 (dot product) 을 계산하게 된다. 결과 볼륨은 [32x32x12]와 같은 크기를 갖게 된다. @@ -57,20 +57,20 @@ CNN은 입력이 이미지로 이뤄져 있다는 특징을 살려 좀 더 합 - POOL 레이어는 (가로,세로) 차원에 대해 다운샘플링 (downsampling)을 수행해 [16x16x12]와 같이 줄어든 볼륨을 출력한다. - FC (fully-connected) 레이어는 클래스 점수들을 계산해 [1x1x10]의 크기를 갖는 볼륨을 출력한다. 10개 숫자들은 10개 카테고리에 대한 클래스 점수에 해당한다. 레이어의 이름에서 유추 가능하듯, 이 레이어는 이전 볼륨의 모든 요소와 연결되어 있다. -이와 같이, CNN은 픽셀 값으로 이뤄진 원본 이미지를 각 레이어를 거치며 클래스 점수로 변환 (transform) 시킨다. 한 가지 기억할 것은, 어떤 레이어는 모수 (parameter)를 갖지만 어떤 레이어는 모수를 갖지 않는다는 것이다. 특히 CONV/FC 레이어들은 단순히 입력 볼륨만이 아니라 가중치(weight)와 바이어스(bias) 또한 포함하는 액티베이션(activation) 함수이다. 반면 RELU/POOL 레이어들은 고정된 함수이다. CONV/FC 레이어의 모수 (parameter)들은 각 이미지에 대한 클래스 점수가 해당 이미지의 레이블과 같아지도록 그라디언트 디센트 (gradient descent)로 학습된다. +이와 같이, ConvNet은 픽셀 값으로 이뤄진 원본 이미지를 각 레이어를 거치며 클래스 점수로 변환 (transform) 시킨다. 한 가지 기억할 것은, 어떤 레이어는 모수 (parameter)를 갖지만 어떤 레이어는 모수를 갖지 않는다는 것이다. 특히 CONV/FC 레이어들은 단순히 입력 볼륨만이 아니라 가중치(weight)와 바이어스(bias) 또한 포함하는 액티베이션(activation) 함수이다. 반면 RELU/POOL 레이어들은 고정된 함수이다. CONV/FC 레이어의 모수 (parameter)들은 각 이미지에 대한 클래스 점수가 해당 이미지의 레이블과 같아지도록 그라디언트 디센트 (gradient descent)로 학습된다. 요약해보면: -- CNN 아키텍쳐는 여러 레이어를 통해 입력 이미지 볼륨을 출력 볼륨 ( 클래스 점수 )으로 변환시켜 준다. -- CNN은 몇 가지 종류의 레이어로 구성되어 있다. CONV/FC/RELU/POOL 레이어가 현재 가장 많이 쓰인다. +- ConvNet 아키텍쳐는 여러 레이어를 통해 입력 이미지 볼륨을 출력 볼륨 ( 클래스 점수 )으로 변환시켜 준다. +- ConvNet은 몇 가지 종류의 레이어로 구성되어 있다. CONV/FC/RELU/POOL 레이어가 현재 가장 많이 쓰인다. - 각 레이어는 3차원의 입력 볼륨을 미분 가능한 함수를 통해 3차원 출력 볼륨으로 변환시킨다. - 모수(parameter)가 있는 레이어도 있고 그렇지 않은 레이어도 있다 (FC/CONV는 모수를 갖고 있고, RELU/POOL 등은 모수가 없음). - 초모수 (hyperparameter)가 있는 레이어도 있고 그렇지 않은 레이어도 있다 (CONV/FC/POOL 레이어는 초모수를 가지며 RELU는 가지지 않음).
- +
- CNN 아키텍쳐의 액티베이션 (activation) 예제. 첫 볼륨은 로우 이미지(raw image)를 다루며, 마지막 볼륨은 클래스 점수들을 출력한다. 입/출력 사이의 액티베이션들은 그림의 각 열에 나타나 있다. 3차원 볼륨을 시각적으로 나타내기가 어렵기 때문에 각 행마다 볼륨들의 일부만 나타냈다. 마지막 레이어는 모든 클래스에 대한 점수를 나타내지만 여기에서는 상위 5개 클래스에 대한 점수와 레이블만 표시했다. 전체 웹 데모는 우리의 웹사이트 상단에 있다. 여기에서 사용된 아키텍쳐는 작은 VGG Net이다. + ConvNet 아키텍쳐의 액티베이션 (activation) 예제. 첫 볼륨은 로우 이미지(raw image)를 다루며, 마지막 볼륨은 클래스 점수들을 출력한다. 입/출력 사이의 액티베이션들은 그림의 각 열에 나타나 있다. 3차원 볼륨을 시각적으로 나타내기가 어렵기 때문에 각 행마다 볼륨들의 일부만 나타냈다. 마지막 레이어는 모든 클래스에 대한 점수를 나타내지만 여기에서는 상위 5개 클래스에 대한 점수와 레이블만 표시했다. 전체 웹 데모는 우리의 웹사이트 상단에 있다. 여기에서 사용된 아키텍쳐는 작은 VGG Net이다.
@@ -80,7 +80,7 @@ CNN은 입력이 이미지로 이뤄져 있다는 특징을 살려 좀 더 합 #### 컨볼루셔널 레이어 (이하 CONV) -CONV 레이어는 CNN을 이루는 핵심 요소이다. CONV 레이어의 출력은 3차원으로 정렬된 뉴런들로 해석될 수 있다. 이제부터는 뉴런들의 연결성 (connectivity), 그들의 공간상의 배치, 그리고 모수 공유(parameter sharing) 에 대해 알아보자. +CONV 레이어는 ConvNet을 이루는 핵심 요소이다. CONV 레이어의 출력은 3차원으로 정렬된 뉴런들로 해석될 수 있다. 이제부터는 뉴런들의 연결성 (connectivity), 그들의 공간상의 배치, 그리고 모수 공유(parameter sharing) 에 대해 알아보자. **개요 및 직관적인 설명.** CONV 레이어의 모수(parameter)들은 일련의 학습가능한 필터들로 이뤄져 있다. 각 필터는 가로/세로 차원으로는 작지만 깊이 (depth) 차원으로는 전체 깊이를 아우른다. 포워드 패스 (forward pass) 때에는 각 필터를 입력 볼륨의 가로/세로 차원으로 슬라이딩 시키며 (정확히는 convolve 시키며) 2차원의 액티베이션 맵 (activation map)을 생성한다. 필터를 입력 위로 슬라이딩 시킬 때, 필터와 입력의 요소들 사이의 내적 연산 (dot product)이 이뤄진다. 직관적으로 설명하면, 이 신경망은 입력의 특정 위치의 특정 패턴에 대해 반응하는 (activate) 필터를 학습한다. 이런 액티베이션 맵 (activation map)을 깊이 (depth) 차원을 따라 쌓은 것이 곧 출력 볼륨이 된다. 그러므로 출력 볼륨의 각 요소들은 입력의 작은 영역만을 취급하고, 같은 액티베이션 맵 내의 뉴런들은 같은 모수들을 공유한다 (같은 필터를 적용한 결과이므로). 이제 이 과정에 대해 좀 더 깊이 파헤쳐보자. @@ -91,7 +91,7 @@ CONV 레이어는 CNN을 이루는 핵심 요소이다. CONV 레이어의 출력 *예제 2*. 입력 볼륨의 크기가 [16x16x20]이라고 하자. 3x3 크기의 리셉티브 필드를 사용하면 CONV 레이어의 각 뉴런은 입력 볼륨과 3x3x20=180 개의 연결을 갖게 된다. 이번에도 입력 볼륨의 깊이가 20이므로 마지막 숫자가 20이 된다는 것을 기억하자.
- +
좌: 입력 볼륨(붉은색, 32x32x3 크기의 CIFAR-10 이미지)과 첫번째 컨볼루션 레이어 볼륨. 컨볼루션 레이어의 각 뉴런은 입력 볼륨의 일부 영역에만 연결된다 (가로/세로 공간 차원으로는 일부 연결, 깊이(컬러 채널) 차원은 모두 연결). 컨볼루션 레이어의 깊이 차원의 여러 뉴런 (그림에서 5개)들이 모두 입력의 같은 영역을 처리한다는 것을 기억하자 (깊이 차원과 관련해서는 아래에서 더 자세히 알아볼 것임). 우: 입력의 일부 영역에만 연결된다는 점을 제외하고는, 이전 신경망 챕터에서 다뤄지던 뉴런들과 똑같이 내적 연산과 비선형 함수로 이뤄진다. @@ -107,7 +107,7 @@ CONV 레이어는 CNN을 이루는 핵심 요소이다. CONV 레이어의 출력 출력 볼륨의 공간적 크기 (가로/세로)는 입력 볼륨 크기 ($$W$$), CONV 레이어의 리셉티브 필드 크기($$F$$)와 stride ($$S$$), 그리고 제로 패딩 (zero-padding) 사이즈 ($$P$$) 의 함수로 계산할 수 있다. $$(W - F + 2P)/S + 1$$. I을 통해 알맞은 크기를 계산하면 된다. 만약 이 값이 정수가 아니라면 stride가 잘못 정해진 것이다. 이 경우 뉴런들이 대칭을 이루며 깔끔하게 배치되는 것이 불가능하다. 다음 예제를 보면 이 수식을 좀 더 직관적으로 이해할 수 있을 것이다:
- +
공간적 배치에 관한 그림. 이 예제에서는 가로/세로 공간적 차원 중 하나만 고려한다 (x축). 리셉티브 필드 F=3, 입력 사이즈 W=5, 제로 패딩 P=1. : 뉴런들이 stride S=1을 갖고 배치된 경우, 출력 사이즈는 (5-3+2)/1 +1 = 5이다. : stride S=2인 경우 (5-3+2)/2 + 1 = 3의 출력 사이즈를 가진다. Stride S=3은 사용할 수 없다. (5-3+2) = 4가 3으로 나눠지지 않기 때문에 출력 볼륨의 뉴런들이 깔끔히 배치되지 않는다. 이 예에서 뉴런들의 가중치는 [1,0,-1] (가장 오른쪽) 이며 bias는 0이다. 이 가중치는 노란 뉴런들 모두에게 공유된다 (아래에서 parameter sharing에 대해 살펴보라). @@ -127,7 +127,7 @@ CONV 레이어는 CNN을 이루는 핵심 요소이다. CONV 레이어의 출력 한 depth slice내의 모든 뉴런들이 같은 가중치 벡터를 갖기 때문에 컨볼루션 레이어의 forward pass는 입력 볼륨과 가중치 간의 **컨볼루션**으로 계산될 수 있다 (컨볼루션 레이어라는 이름이 붙은 이유). 그러므로 컨볼루션 레이어의 가중치는 **필터(filter)** 또는 **커널(kernel)**이라고 부른다. 컨볼루션의 결과물은 **액티베이션 맵(activation map, [55x55] 사이즈)** 이 되며 각 깊이에 해당하는 필터의 액티베이션 맵들을 쌓으면 최종 출력 볼륨 ([55x55x96] 사이즈) 가 된다.
- +
Krizhevsky et al. 에서 학습된 필터의 예. 96개의 필터 각각은 [11x11x3] 사이즈이며, 하나의 depth slice 내 55*55개 뉴런들이 이 필터들을 공유한다. 만약 이미지의 특정 위치에서 가로 엣지 (edge)를 검출하는 것이 중요했다면, 이미지의 다른 위치에서도 같은 특성이 중요할 수 있다 (이미지의 translationally-invariant한 특성 때문). 그러므로 55*55개 뉴런 각각에 대해 가로 엣지 검출 필터를 재학습 할 필요가 없다.
@@ -197,7 +197,7 @@ Numpy에서 `*`연산은 두 배열 간의 elementwise 곱셈이라는 것을 #### 풀링 레이어 (Pooling Layer) -CNN 구조 내에 컨볼루션 레이어들 중간중간에 주기적으로 풀링 레이어를 넣는 것이 일반적이다. 풀링 레이어가 하는 일은 네트워크의 파라미터의 개수나 연산량을 줄이기 위해 representation의 spatial한 사이즈를 줄이는 것이다. 이는 오버피팅을 조절하는 효과도 가지고 있다. 풀링 레이어는 MAX 연산을 각 depth slice에 대해 독립적으로 적용하여 spatial한 크기를 줄인다. 사이즈 2x2와 stride 2가 가장 많이 사용되는 풀링 레이어이다. 각 depth slice를 가로/세로축을 따라 1/2로 downsampling해 75%의 액티베이션은 버리게 된다. 이 경우 MAX 연산은 4개 숫자 중 최대값을 선택하게 된다 (같은 depth slice 내의 2x2 영역). Depth 차원은 변하지 않는다. 풀링 레이어의 특징들은 일반적으로 아래와 같다: +ConvNet 구조 내에 컨볼루션 레이어들 중간중간에 주기적으로 풀링 레이어를 넣는 것이 일반적이다. 풀링 레이어가 하는 일은 네트워크의 파라미터의 개수나 연산량을 줄이기 위해 representation의 spatial한 사이즈를 줄이는 것이다. 이는 오버피팅을 조절하는 효과도 가지고 있다. 풀링 레이어는 MAX 연산을 각 depth slice에 대해 독립적으로 적용하여 spatial한 크기를 줄인다. 사이즈 2x2와 stride 2가 가장 많이 사용되는 풀링 레이어이다. 각 depth slice를 가로/세로축을 따라 1/2로 downsampling해 75%의 액티베이션은 버리게 된다. 이 경우 MAX 연산은 4개 숫자 중 최대값을 선택하게 된다 (같은 depth slice 내의 2x2 영역). Depth 차원은 변하지 않는다. 풀링 레이어의 특징들은 일반적으로 아래와 같다: - $$W_1 \times H_1 \times D_1$$ 사이즈의 입력을 받는다 - 3가지 hyperparameter를 필요로 한다. @@ -215,8 +215,8 @@ CNN 구조 내에 컨볼루션 레이어들 중간중간에 주기적으로 풀 **일반적인 풀링**. Max 풀링 뿐 아니라 *average 풀링*, *L2-norm 풀링* 등 다른 연산으로 풀링할 수도 있다. Average 풀링은 과거에 많이 쓰였으나 최근에는 Max 풀링이 더 좋은 성능을 보이며 점차 쓰이지 않고 있다.
- - + +
풀링 레이어는 입력 볼륨의 각 depth slice를 spatial하게 downsampling한다. 좌: 이 예제에서는 입력 볼륨이 [224x224x64]이며 필터 크기 2, stride 2로 풀링해 [112x112x64] 크기의 출력 볼륨을 만든다. 볼륨의 depth는 그대로 유지된다는 것을 기억하자. Right: 가장 널리 쓰이는 max 풀링. 2x2의 4개 숫자에 대해 max를 취하게된다.
@@ -355,35 +355,36 @@ ConvNet에서 자주 볼 수 있는 특징으로써, 대부분의 메모리가 -#### 계산 관련 고려사항들 Computational Considerations +#### 계산 관련 고려사항들 ConvNet을 만들 때 일어나는 가장 큰 병목 현상은 메모리 병목이다. 최신 GPU들은 3/4/6GB의 메모리를 내장하고 있다. 가장 좋은 GPU들의 경우 12GB를 갖고 있다. 메모리와 관련해 주의깊게 살펴 볼 것은 크게 3가지이다. -The largest bottleneck to be aware of when constructing ConvNet architectures is the memory bottleneck. Many modern GPUs have a limit of 3/4/6GB memory, with the best GPUs having about 12GB of memory. There are three major sources of memory to keep track of: -- From the intermediate volume sizes: These are the raw number of **activations** at every layer of the ConvNet, and also their gradients (of equal size). Usually, most of the activations are on the earlier layers of a ConvNet (i.e. first Conv Layers). These are kept around because they are needed for backpropagation, but a clever implementation that runs a ConvNet only at test time could in principle reduce this by a huge amount, by only storing the current activations at any layer and discarding the previous activations on layers below. -- From the parameter sizes: These are the numbers that hold the network **parameters**, their gradients during backpropagation, and commonly also a step cache if the optimization is using momentum, Adagrad, or RMSProp. Therefore, the memory to store the parameter vector alone must usually be multiplied by a factor of at least 3 or so. -- Every ConvNet implementation has to maintain **miscellaneous** memory, such as the image data batches, perhaps their augmented versions, etc. +- 중간 단계의 볼륨 크기: 매 레이어에서 발생하는 액티베이션들과 그에 상응하는 그라디언트 (액티베이션과 같은 크기)의 개수이다. 보통 대부분의 액티베이션들은 ConvNet의 앞쪽 레이어들에서 발생된다 (예: 첫 번째 CONV 레이어). 이 값들은 backpropagation에 필요하기 때문에 계속 메모리에 두고 있어야 한다. 학습이 아닌 테스트에만 ConvNet을 사용할 때는 현재 처리 중인 레이어의 액티베이션 값을 제외한 앞쪽 액티베이션들은 버리는 방식으로 구현할 수 있다. +- 파라미터 크기: 신경망이 갖고 있는 파라미터의 개수이며, backpropagation을 위한 각 파라미터의 그라디언트, 그리고 최적화에 momentum, Adagrad, RMSProp 등을 사용한다면 이와 관련된 파라미터들도 캐싱해 놓아야 한다. 그러므로 파라미터 저장 공간은 기본적으로 (파라미터 개수의)3배 정도 더 필요하다. +- 모든 ConvNet 구현체는 이미지 데이터 배치 등을 위한 기타 용도의 메모리를 유지해야 한다. -Once you have a rough estimate of the total number of values (for activations, gradients, and misc), the number should be converted to size in GB. Take the number of values, multiply by 4 to get the raw number of bytes (since every floating point is 4 bytes, or maybe by 8 for double precision), and then divide by 1024 multiple times to get the amount of memory in KB, MB, and finally GB. If your network doesn't fit, a common heuristic to "make it fit" is to decrease the batch size, since most of the memory is usually consumed by the activations. +일단 액티베이션, 그라디언트, 기타용도에 필요한 값들의 개수를 예상했다면, GB 스케일로 바꿔야 한다. 예측한 개수에 4를 곱해 바이트 수를 구하고 (floating point가 4바이트, double precision의 경우 8바이트 이므로), 1024로 여러 번 나눠 KB, MB, GB로 바꾼다. 만약 신경망의 크기가 너무 크다면, 배치 크기를 줄이는 등의 휴리스틱을 이용해 (대부분의 메모리가 액티베이션에 사용되므로) 가용 메모리에 맞게 만들어야 한다. -### Visualizing and Understanding Convolutional Networks +### ConvNet의 시각화 및 이해 -In the [next section](../understanding-cnn/) of these notes we look at visualizing and understanding Convolutional Neural Networks. +[다음 섹션](../understanding-ConvNet/)에서는 ConvNet을 시각화하고, ConvNet이 어떤 정보들을 인코딩 하는지 알아본다. -### Additional Resources +### 추가 레퍼런스 -Additional resources related to implementation: +구현과 관련된 리소스들: -- [DeepLearning.net tutorial](http://deeplearning.net/tutorial/lenet.html) walks through an implementation of a ConvNet in Theano -- [cuda-convnet2](https://code.google.com/p/cuda-convnet2/) by Alex Krizhevsky is a ConvNet implementation that supports multiple GPUs -- [ConvNetJS CIFAR-10 demo](http://cs.stanford.edu/people/karpathy/convnetjs/demo/cifar10.html) allows you to play with ConvNet architectures and see the results and computations in real time, in the browser. -- [Caffe](http://caffe.berkeleyvision.org/), one of the most popular ConvNet libraries. -- [Example Torch 7 ConvNet](https://github.com/nagadomi/kaggle-cifar10-torch7) that achieves 7% error on CIFAR-10 with a single model -- [Ben Graham's Sparse ConvNet](https://www.kaggle.com/c/cifar-10/forums/t/10493/train-you-very-own-deep-convolutional-network/56310) package, which Ben Graham used to great success to achieve less than 4% error on CIFAR-10. +- [DeepLearning.net tutorial](http://deeplearning.net/tutorial/lenet.html) Theano로 ConvNet을 구현하는 과정을 보여줌 +- [cuda-convnet2](https://code.google.com/p/cuda-convnet2/) Alex Krizhevsky가 여러 GPU를 사용해 ConvNet을 구현하는 방법을 알려주는 자료 +- [ConvNetJS CIFAR-10 demo](http://cs.stanford.edu/people/karpathy/convnetjs/demo/cifar10.html) 브라우저에서 ConvNet의 구조를 바꿔보고 결과를 실시간으로 볼 수 있는 자료 +- [Caffe](http://caffe.berkeleyvision.org/), 가장 널리 쓰이는 ConvNet 라이브러리 중 하나 +- [Example Torch 7 ConvNet](https://github.com/nagadomi/kaggle-cifar10-torch7) 하나의 모델로 CIFAR-10 데이터에 대해 7% 에러율을 기록한 코드 +- [Ben Graham's Sparse ConvNet](https://www.kaggle.com/c/cifar-10/forums/t/10493/train-you-very-own-deep-convolutional-network/56310) CIFAR-10에서 4% 이하의 에러율을 보인 패키지 ---

번역: 김택수 (jazzsaxmafia)

+ + From 76c92227620c41ff68ce5e24d99b7eba4fc10dc7 Mon Sep 17 00:00:00 2001 From: Taeksoo Kim Date: Thu, 12 May 2016 21:17:48 +0900 Subject: [PATCH 126/199] Update convolutional-networks.md --- convolutional-networks.md | 2 -- 1 file changed, 2 deletions(-) diff --git a/convolutional-networks.md b/convolutional-networks.md index 443d378e..1360868d 100644 --- a/convolutional-networks.md +++ b/convolutional-networks.md @@ -136,9 +136,7 @@ CONV 레이어는 ConvNet을 이루는 핵심 요소이다. CONV 레이어의 가끔은 파라미터 sharing에 대한 가정이 부적절할 수도 있다. 특히 입력 이미지가 중심을 기준으로 찍힌 경우 (예를 들면 이미지 중앙에 얼굴이 있는 이미지), 이미지의 각 영역에 대해 완전히 다른 feature들이 학습되어야 할 수 있다. 눈과 관련된 feature나 머리카락과 관련된 feature 등은 서로 다른 영역에서 학습될 것이다. 이런 경우에는 파라미터 sharing 기법을 접어두고 대신 **Locally-Connected Layer**라는 레이어를 사용하는 것이 좋다. **Numpy 예제.** 위에서 다룬 것들을 더 확실히 알아보기 위해 코드를 작성해보자. 입력 볼륨을 numpy 배열 `X`라고 하면: -- A *depth column* at position `(x,y)` would be the activations `X[x,y,:]`. - `(x,y)`위치에서의 *depth column*은 액티베이션 `X[x,y,:]`이 된다. -- A *depth slice*, or equivalently an *activation map* at depth `d` would be the activations `X[:,:,d]`. - depth `d`에서의 *depth slice*, 또는 *액티베이션 맵 (activation map)*은 `X[:,:,d]`가 된다. *컨볼루션 레이어 예제*. 입력 볼륨 `X`의 모양이 `X.shape: (11,11,4)`이고 제로 패딩은 사용하지 않으며($$P = 0$$) 필터 크기는 $$F = 5$$, stride $$S = 2$$라고 하자. 출력 볼륨의 spatial 크기 (가로/세로)는 (11-5)/2 + 1 = 4가 된다. 출력 볼륨의 액티베이션 맵 (`V`라고 하자) 는 아래와 같은 것이다 (아래에는 일부 요소만 나타냄). From 66368549c1d626952645353c5730368d0f68a560 Mon Sep 17 00:00:00 2001 From: Taeksoo Kim Date: Thu, 12 May 2016 21:18:30 +0900 Subject: [PATCH 127/199] Update convolutional-networks.md --- convolutional-networks.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/convolutional-networks.md b/convolutional-networks.md index 1360868d..1ee2bb64 100644 --- a/convolutional-networks.md +++ b/convolutional-networks.md @@ -172,7 +172,7 @@ Numpy에서 `*`연산은 두 배열 간의 elementwise 곱셈이라는 것을 - 파라미터 sharing로 인해 필터 당 $$F \cdot F \cdot D_1$$개의 가중치를 가져서 총 $$(F \cdot F \cdot D_1) \cdot K$$개의 가중치와 $$K$$개의 바이어스를 갖게 된다. - 출력 볼륨에서 $$d$$번째 depth slice ($$W_2 \times H_2$$ 크기)는 입력 볼륨에 $$d$$번째 필터를 stride $$S$$만큼 옮겨가며 컨볼루션 한 뒤 $$d$$번째 바이어스를 더한 결과이다. -흔한 Hyperparameter기본 세팅은 $$F = 3, S = 1, P = 1$$이다. 뒤에서 다룰 [ConvNet architectures](#architectures)에서 hyperparameter 세팅과 관련된 법칙이나 방식 등을 확인할 수 있다. +흔한 Hyperparameter기본 세팅은 $$F = 3, S = 1, P = 1$$이다. 뒤에서 다룰 [ConvNet 구조](#architectures)에서 hyperparameter 세팅과 관련된 법칙이나 방식 등을 확인할 수 있다. **컨볼루션 데모**. 아래는 컨볼루션 레이어 데모이다. 3차원 볼륨은 시각화하기 힘드므로 각 행마다 depth slice를 하나씩 배치했다. 각 볼륨은 입력 볼륨(파란색), 가중치 볼륨(빨간색), 출력 볼륨(녹색)으로 이뤄진다. 입력 볼륨의 크기는 $$W_1 = 5, H_1 = 5, D_1 = 3$$이고 컨볼루션 레이어의 파라미터들은 $$K = 2, F = 3, S = 2, P = 1$$이다. 즉, 2개의 $$3 \times 3$$크기의 필터가 각각 stride 2마다 적용된다. 그러므로 출력 볼륨의 spatial 크기 (가로/세로)는 (5 - 3 + 2)/2 + 1 = 3이다. 제로 패딩 $$P = 1$$ 이 적용되어 입력 볼륨의 가장자리가 모두 0으로 되어있다는 것을 확인할 수 있다. 아래의 영상에서 하이라이트 표시된 입력(파란색)과 필터(빨간색)이 elementwise로 곱해진 뒤 하나로 더해지고 bias가 더해지는걸 볼 수 있다. From e4ad9b572f860d1e7fb71cf5e8ca939552cb1986 Mon Sep 17 00:00:00 2001 From: jung_hojin Date: Fri, 13 May 2016 03:35:05 +0900 Subject: [PATCH 128/199] Fix autogenerated English captions for lecture 4 lines 201 ~ 400 Fixes autogenerated captions from lines 201 to 400. Some inaudible lines are marked with '[??]' for now. --- captions/En/Lecture4_en.srt | 667 ++++++++++++++++++------------------ 1 file changed, 331 insertions(+), 336 deletions(-) diff --git a/captions/En/Lecture4_en.srt b/captions/En/Lecture4_en.srt index b2eb9a16..69db59ba 100644 --- a/captions/En/Lecture4_en.srt +++ b/captions/En/Lecture4_en.srt @@ -989,102 +989,102 @@ of every single intermediate value in 201 00:14:58,399 --> 00:15:02,158 -that graph on the final loss function -and so will see many examples of this +that graph on the final loss function. +So we'll see many examples of this 202 00:15:02,158 --> 00:15:06,918 -truck is like her I'll go into a -specific example there is a slightly +throughout the lecture. I'll go into a +specific example that is a slightly 203 00:15:06,918 --> 00:15:11,298 larger and we'll work through it in -detail but i dont their own questions at +detail. But I don't know if there are any questions at 204 00:15:11,298 --> 00:15:20,389 -this point that I would like to ask -ahead I'm going to come back to that you +this point that anyone would like to ask. +Go ahead. If z is used by multiple nodes, I'm going to come back to that. 205 00:15:20,389 --> 00:15:25,538 -add the gradients the grading the -cognitive Adam so if Z is being employed +You add the gradients. The gradient, the +correct thing to do is you add them. So if z is being influenced 206 00:15:25,538 --> 00:15:29,928 -in multiple places in the circus the -back roads closed will add that will +in multiple places in the circuit, the +backward flows will add. I will 207 00:15:29,928 --> 00:15:31,539 -come back to that point +come back to that point. 208 00:15:31,539 --> 00:16:03,139 -like we're going to get the all of those -issues and we're gonna see ya you're +So I think, I would've repeated your question, but you're jumping ahead like 100 slides. So we're going to get the all of those +issues and we're going to see, you're 209 00:16:03,139 --> 00:16:05,769 -gonna get what we call banishing -gradient problems and so on +going to get what we call vanishing +gradient problems and so on. 210 00:16:05,769 --> 00:16:10,669 -we'll see let's go through another -example to make this more concrete so +We'll see. Okay, let's go through another +example to make this more concrete. 211 00:16:10,669 --> 00:16:14,318 -here we have another circuit it happens +So here we have another circuit. It happens to be computing a little two-dimensional 212 00:16:14,318 --> 00:16:18,179 -in Iran but for now don't worry about -that interpretation just think of this +sigmoid neuron, but for now don't worry about +that interpretation. Just think of this 213 00:16:18,179 --> 00:16:22,849 -as that's an expression so one over -one-plus key to the whatever number of +as, that's an expression so one-over- +one-plus-e-to-the-whatever, so the number of 214 00:16:22,850 --> 00:16:29,000 -inputs here is by Andrew function and we -have a single output over there and I +inputs here is five, and we're computing that function and we +have a single output over there, okay? 215 00:16:29,000 --> 00:16:32,490 -translated that mathematical expression -into this competition in draft form so +And I translated that mathematical expression +into this computational graph form, so 216 00:16:32,490 --> 00:16:35,769 we have to recursively from inside out -compete with expression so a person do +compute this expression so we first do 217 00:16:35,769 --> 00:16:42,129 -all the little W times access and then +all the little w times x's, and then we add them all up and then we take a 218 00:16:42,129 --> 00:16:46,129 -negative of it and then we exponentially -that and they had one and then we +negative of it and then we exponentiate +that and then we add one and then we 219 00:16:46,129 --> 00:16:49,769 finally divide and we get the result of -the expression and so we're going to do +the expression. And so we're going to do 220 00:16:49,769 --> 00:16:52,409 -now is we're going to back propagate -through this expression we're going to +now is we're going to backpropagate +through this expression. We're going to 221 00:16:52,409 --> 00:16:56,500 @@ -1093,27 +1093,27 @@ single input value is on the output of 222 00:16:56,500 --> 00:17:07,230 -this expression that is degrading here +this expression, what is the gradient here. 223 00:17:07,230 --> 00:17:22,039 -so for now the US is just a binary plus -its entirety + gate and we have a plus +So for now, you're concerned about the interpretation of plus may be in these circles. For now, let's just assume that this plus is a binary '+', +It's a binary '+' gate, and we have there 224 00:17:22,039 --> 00:17:26,519 -one gate I'm making up these gates on -the spot and we'll see that what is a +plus one gate. I'm making up these gates on +the spot, and we'll see that what is a 225 00:17:26,519 --> 00:17:31,519 gate or is not a gate is kind of up to -you come back to this point of it so for +you. I'll come back to this point in a bit. 226 00:17:31,519 --> 00:17:35,639 -now I just like we have several more -gates that we're using throughout and so +So for now, I just like, we have several more +gates that we're using throughout, and so 227 00:17:35,640 --> 00:17:38,650 @@ -1122,530 +1122,525 @@ through this example several of these 228 00:17:38,650 --> 00:17:42,720 -derivatives exponentiation and we know +derivatives. So we have exponentiation and we know for every little local gate what these 229 00:17:42,720 --> 00:17:49,048 -local gradients are right so we can do -that using calculus so the extra tax and +local gradients are, right? So we can derive +that using calculus. So e^x derivative is e^x and 230 00:17:49,048 --> 00:17:52,900 -so on so these are all the operations +so on. So these are all the operations and also addition and multiplication 231 00:17:52,900 --> 00:17:56,040 which I'm assuming that you have -memorized in terms of what the great +memorized in terms of what the gradients 232 00:17:56,039 --> 00:17:58,970 -things look like they're going to start +look like. So we're going to start off at the end of the circuit and I've 233 00:17:58,970 --> 00:18:03,450 -already filled in a one point zero zero +already filled in a 1.00 in the back because that's how we always 234 00:18:03,450 --> 00:18:04,860 -start this recursion +start this recursion with a 1.0 235 00:18:04,859 --> 00:18:10,519 -1110 right but since that's the gradient -on the identity function now we're going +right since that's the gradient +on the identity function. Now we're going 236 00:18:10,519 --> 00:18:17,849 -to back propagate through this one over -x operation ok so the relative of one of +to backpropagate through this 1/x +operation, okay? So the derivative of 1/x 237 00:18:17,849 --> 00:18:22,048 -wrecks the local gradient is a negative -one over x squared so that none of Rex +the local gradient is -1/(x^2), +so that 1/x gate 238 00:18:22,048 --> 00:18:27,119 -gate during the forward pass received -input 1.37 and right away that one of +during the forward pass received +input 1.37 and right away that 1/x gate 239 00:18:27,119 --> 00:18:30,759 -her ex Kate could have computed what the -local gradients the local variant was +could have computed what the +local gradient was. The local gradient was 240 00:18:30,759 --> 00:18:35,048 -negative one over x squared and ordering -back propagation and has to buy tramadol +-1/(x^2) and now during backpropagation, +it has to, by chain rule, 241 00:18:35,048 --> 00:18:40,750 multiply that local gradient by the -gradient of it on the final of the +gradient of it on the final output of the circuit 242 00:18:40,750 --> 00:18:44,789 -circuit which is easy because it happens -to be so what ends up being the +which is easy because it happens +to be at the end. So what ends up being the 243 00:18:44,789 --> 00:18:51,349 -expression for the back propagated -reading here from one of my ex Kate +expression for the backpropagated +gradient here, from 1/x gate 244 00:18:51,349 --> 00:18:59,829 -but she always has two pieces local -gradient times the gradient from or from +The chain rule always has two pieces: local +gradient times the gradient from the top or from above. 245 00:18:59,829 --> 00:19:18,069 -which is the gradient DFID X so that -that is the local gradient +So we get -1/x^2, which is the gradient df/dx. +So that is the local gradient. 246 00:19:18,069 --> 00:19:23,480 -giving one over 3.7 squared and then -multiplied by one point zero which is +-1/3.7^2 and then multiplied by 1.0 which is 247 00:19:23,480 --> 00:19:27,940 -degrading from which is really just one -because we just started and so applying +the gradient from above, which is really just 1 +because we've just started, and I'm applying 248 00:19:27,940 --> 00:19:34,850 -general right away here and the other is -negative 01534 that's the gradient on +chain rule right away here and the output is +-0.53. So that's the gradient on 249 00:19:34,849 --> 00:19:38,798 -that piece of the wire where this valley -was blowing ok so it has a negative +that piece of the wire, where this valley +was flowing, okay. So it has a negative 250 00:19:38,798 --> 00:19:43,889 -effect on the outfit you might expect -that right because if you were to +effect on the output. And you might expect +that right, because if you were to 251 00:19:43,890 --> 00:19:47,850 increase this value and then it goes -through a gate of one over x then +through a gate of 1/x, then if you 252 00:19:47,849 --> 00:19:50,939 -increased amount of Rex get smaller so -that's why you're seeing negative +increase this, 1/x get smaller, so +that's why you're seeing a negative 253 00:19:50,940 --> 00:19:55,620 -gradient rate we're going to continue -back propagation here in the next gate +gradient, right. So we're going to continue +backpropagation here. The next gate 254 00:19:55,619 --> 00:20:01,048 -in the circuit it's adding a constant of -one so the local gradient if you look at +in the circuit, it's adding a constant of +1, so the local gradient, if you look at 255 00:20:01,048 --> 00:20:06,960 -adding a constant to a value the -gradient off on exit is just one right +adding a constant to a value, the +gradient of, on x is just 1, right? 256 00:20:06,960 --> 00:20:13,169 -to talk to us and so the change gradient +From basic calculus. And so the chained gradient here that we continue along the wire 257 00:20:13,169 --> 00:20:22,940 -will be your local gradient which has -one time the gradient from above the +will be... We have a local gradient, which is +1 times the gradient from above the 258 00:20:22,940 --> 00:20:28,590 -gate which it has just learned is -negative Jul 23 2013 continues along the +gate, which it has just learned is +-0.53, okay? So -0.53 continues along the 259 00:20:28,589 --> 00:20:34,709 -way are unchanged and intuitively that -makes sense right because this is value +wire unchanged. And intuitively that +makes sense right, because this value 260 00:20:34,710 --> 00:20:38,319 floats and it has some influence on the -final circuit and if you're if you're +final circuit and now, if you're 261 00:20:38,319 --> 00:20:42,798 -adding one then its influence its rate -of change of slope toward the final +adding 1, then its influence, its rate +of change, its slope towards the final 262 00:20:42,798 --> 00:20:46,970 -value doesn't change if you increase -this by some amount the effect at the +value doesn't change. If you increase +this by some amount, the effect at the 263 00:20:46,970 --> 00:20:51,548 -end will be the same because the rate of -change doesn't change through the +1 +end will be the same, because the rate of +change doesn't change through the +1 gate. 264 00:20:51,548 --> 00:20:56,859 -gays just a constant officer continued -innovation here so the gradient of the +It's just a constant offset. Okay, we continued +derivation here. So the gradient of e^x is 265 00:20:56,859 --> 00:21:01,599 -axe the axe so you can come back -propagation we're going to perform +e^x, so to continue backpropagation we're going to perform, 266 00:21:01,599 --> 00:21:05,000 -gates input of negative one +so this gate saw input of negative one. 267 00:21:05,000 --> 00:21:08,329 -it right away could have completed its -local gradient and now it knows that the +It right away could have computed its +local gradient, and now it knows that the 268 00:21:08,329 --> 00:21:12,259 -gradient from above is negative point by -three so the continued backpropagation +gradient from above is -0.53. +So to continue backpropagation 269 00:21:12,259 --> 00:21:20,000 -here in applying chain rule would -received the rhetorical questions I'm +here and apply chain rule, we would +receive [STUDENT ANSWER] Okay, so these are most of the rhetorical questions so I'm [??] 270 00:21:20,000 --> 00:21:25,119 -not sure but but basically each of the -negative one which is the ex the ex +not sure, but yeah, basically e^(-1) +which is the e^x, 271 00:21:25,119 --> 00:21:30,569 -input to this expert eight times the -chain rule right to the point by three +the x input to this exp gate times the +chain rule, right, so the gradient from above is -0.53 272 00:21:30,569 --> 00:21:35,269 -so we keep multiplying their own so what -is the effect on me and what I have an +so we keep multiplying that on. So what +is the effect on me and what do I have an 273 00:21:35,269 --> 00:21:39,069 -effect on the final end of the circuit -those are being always multiplied so we +effect on the final end of the circuit. +those are being always multiplied. So we 274 00:21:39,069 --> 00:21:46,859 -get negative 22 at this point so now we -have a time to negative one gate so what +get -0.2 at this point. So now we +have a *(-1) gate. So what 275 00:21:46,859 --> 00:21:50,279 -ends up happening what happens to the -gradient when you do it turns me on an +ends up happening, what happens to the +gradient when you do a times -1 in the 276 00:21:50,279 --> 00:21:57,139 -accomplished on da lips around right -because we have basically constant input +computatonal graph? It flips around, right? +Because we have basically, a constant multiply of input 277 00:21:57,140 --> 00:22:02,038 which happened to be a constant of -negative one so negative one time one +-1, so 1 * -1 278 00:22:02,038 --> 00:22:05,548 -time they dont give us negative one in -the forward pass and so now we have to +gave us -1 in the forward pass, and so now we have to 279 00:22:05,548 --> 00:22:09,569 -multiply by a that's the local gradient -times the greeting from Bob which is +multiply by a, that's the local gradient, +times the gradient from above which is -0.2 280 00:22:09,569 --> 00:22:14,879 -fine too so we end up with just positive -so now continue back propagation +so we end up with just +0.2 now. +So now we're continuing backpropagation 281 00:22:14,880 --> 00:22:21,110 -propagating + and this plus operation -has multiple input here the green in the +We're backpropagating '+' and this '+' operation +has multiple input here, the gradient, 282 00:22:21,109 --> 00:22:25,599 -local gradient for the bus gate as one -and 10 what ends up happening to the +the local gradient for the '+' gate is 1 +and 1, so what ends up happening to, 283 00:22:25,599 --> 00:22:42,359 -brilliance flow along the upper buyers +what gradients flow along the output wires? 284 00:22:42,359 --> 00:22:48,089 -surplus paid has a local gradient on all -of its always will be just one because +So the plus gate has a local gradient on all +of its inputs always will be just one, right, because 285 00:22:48,089 --> 00:22:53,769 -if you just have a functioning you know -experts why then for that function the +if you just have a function, you know, +x+y, then for that function 286 00:22:53,769 --> 00:22:58,109 -gradient on either X or Y is just one +the gradient on either x or y is just one and so what you end up getting is just 287 00:22:58,109 --> 00:23:03,619 -one time spent two and so in fact for a -plus gate always see see the same fact +1 * 0.2. And so, in fact for a +plus gate, always you see the same fact 288 00:23:03,619 --> 00:23:07,469 -where the local gradient all of its -inputs is one and so whatever grading it +where the local gradient of all of its +inputs is 1, and so whatever gradient it 289 00:23:07,470 --> 00:23:11,289 -gets from above it just always +gets from above, it just always distributes gradient equally to all of 290 00:23:11,289 --> 00:23:14,339 -its inputs because in the chain rule -don't have multiplied and multiplied by +its inputs, because in the chain rule, +they'll get multiplied and when you multiply by 1 291 00:23:14,339 --> 00:23:18,129 -10 something remains unchanged surplus -get this kind of like ingredient +something remains unchanged. So a plus [??] +gate, it's kind of like a gradient 292 00:23:18,130 --> 00:23:22,170 -distributor whereas something flows in -from the top it all just spread out all +distributor, where if something flows in +from the top, it will just spread out all 293 00:23:22,170 --> 00:23:26,560 -the great teams equally to all of its -children and so we've already received +the gradients equally to all of its +children. And so we've already received 294 00:23:26,559 --> 00:23:32,139 -one of the inputs gradient point to hear +one of the inputs is gradient 0.2 here on the very final output of the circuit 295 00:23:32,140 --> 00:23:35,970 -and so this employees has been completed +and so this influence has been computed through a series of applications of 296 00:23:35,970 --> 00:23:42,450 -trainer along the way there was another -plus get that skipped over and so this +chain rule along the way. There was another +'+' gate that I skipped over, and so this 297 00:23:42,450 --> 00:23:47,090 -point you kind of this tribute to both -20.2 equally so we've already done a +0.2 kind of distributes to both +0.2. 0.2 equally so we've already done a 298 00:23:47,089 --> 00:23:51,750 -blockade and there's a multiply get -there and so now we're going to back +'+' gate, and there's a '*' gate there, +and so now we're going to backpropagate 299 00:23:51,750 --> 00:23:55,940 -propagate through that multiply -operation and so the local grade so the +through that multiply operation 300 00:23:55,940 --> 00:24:06,450 -so what will be the gradient for w 00 -will be degrading 40 basically +so what will be the gradient for w0 and x0? +What will be the gradient for w0 specfically? 301 00:24:06,450 --> 00:24:19,059 -2000 you will be going in W one will be -W 0:30 will be negative one times when +[STUDENT ANSWER] Someone say 0? 0 will be wrong. It will be, so the gradient w1 will be, w0 sorry, will be 302 00:24:19,059 --> 00:24:24,389 -too good and the gradient on x zero will -be there is a bug bite away in the slide +-1 * 0.2. Good. And the gradient on x0 will +be, there is a bug, by the way, in the slide 303 00:24:24,390 --> 00:24:27,840 that I just noticed like few minutes -before I actually create the class also +before I actually created the class. 304 00:24:27,839 --> 00:24:34,289 -increase starting to class so you see . -39 there it should be point for its +Created the, started the class. So you see +0.39 there it should be 0.4. It's 305 00:24:34,289 --> 00:24:37,480 -because of a bug in evangelization -because I'm truncating a to the small +because of a bug in the visualization +because I'm truncating at 2-decimal 306 00:24:37,480 --> 00:24:41,190 -digits but basically that should be -pointed or because the way you get that +digits, but anyways, basically that should be +0.4 because the way you get that 307 00:24:41,190 --> 00:24:45,400 -is two times pointed to get the point -for just like I've written out there so +is 2 * 0.2 gives you 0.4 +just like I've written out over there. 308 00:24:45,400 --> 00:24:50,980 -that's what the opportunity there okay -so that we've been propagated the +So that's what the output should be there. +Okay, so that what we've backpropagated this 309 00:24:50,980 --> 00:24:55,190 -circuit here and we get through this +circuit here and we've backpropagated through this expression and so you might imagine in 310 00:24:55,190 --> 00:24:59,289 -there are actual downstream applications -will have data and all the parameters as +our actual downstream applications, +we'll have data and all the parameters as inputs 311 00:24:59,289 --> 00:25:03,450 -inputs loss functions at the top at the -end it will be forward pass to evaluate +the loss function is at the top at the +end, so we'll do forward pass to evaluate 312 00:25:03,450 --> 00:25:06,440 -the loss function and then we'll back -propagate through every piece of +the loss function and then we'll backpropagate +through every piece of 313 00:25:06,440 --> 00:25:10,450 -competition we've done along the way and -Welbeck propagate through every gate to +computation we've done along the way, and +we'll backpropagate through every gate to 314 00:25:10,450 --> 00:25:14,150 -get our imports and back up again just -means supply chain rule many many times +get our inputs, and backpropagate just +means apply chain rule many many times 315 00:25:14,150 --> 00:25:21,720 -and we'll see how that is implemented in -but the question i guess im going to +and we'll see how that is implemented in a bit. +Sorry, did you have a question? [STUDENT QUESTION] 316 00:25:21,720 --> 00:25:31,769 -skip that because it's the same I'm -going to skip the other questions +Oh yes, so I'm going to skip that because it's the same. +So I'm going to skip the other '*' gate. Any other questions at this point? [STUDENT QUESTION] 317 00:25:31,769 --> 00:25:45,869 -so the cost of forward and backward -propagation is roughly almost always end +That's right. so the costs of forward and backward +propagation are roughly equal. Well, it should be, it almost always ends 318 00:25:45,869 --> 00:25:49,500 up being basically equal when you look -at timings usually the backup a slightly +at timings, usually the backward pass is slightly 319 00:25:49,500 --> 00:25:58,710 -slower idea so let's see one thing I -want to point out before in one is that +slower, but yeah. Okay, so let's see, one thing I +wanted to point out, before we move on, is that 320 00:25:58,710 --> 00:26:02,350 -the setting of these gates like these -gates are arbitrary so what can I could +the setting of these gates, like these +gates are arbitrary, so one thing I could 321 00:26:02,349 --> 00:26:06,509 -have known for example is some of you -may know this I can collapse these gates +have done, for example, is, some of you +may know this, I can collapse these gates 322 00:26:06,509 --> 00:26:10,549 -into one gate if I wanted to for example -in something called the sigmoid function +into one gate if I wanted to, for example. +There is something called the sigmoid function 323 00:26:10,549 --> 00:26:14,069 -which has that particular form a single -facts which the sigmoid function +which has that particular form, so a sigma of x +which is the sigmoid function 324 00:26:14,069 --> 00:26:19,460 -computes won over one plus or minus tax +computes 1/(1+e^(-x)) and so I could have rewritten that 325 00:26:19,460 --> 00:26:22,650 -expression and i cant collapsed all of +expression and I could have collapsed all of those gates that made up the sigmoid 326 00:26:22,650 --> 00:26:27,769 -gate into a single gate and so there's a -sigmoid get here and I could have done +gate into a single sigmoid gate. And so there's a +sigmoid gate here, and I could have done 327 00:26:27,769 --> 00:26:32,440 -that in a single go sort of and when I -would have had to do if I wanted to have +that in a single go, sort of, and what I +would have had to do, if I wanted to have 328 00:26:32,440 --> 00:26:37,980 -that gate as I need to compute an -expression for how this so what is the +that gate, is I need to compute an +expression for how this, so what is the 329 00:26:37,980 --> 00:26:41,670 -local gradient for the sigmoid get -basically so what is the gradient of the +local gradient for the sigmoid gate +basically? So what is the gradient on the 330 00:26:41,670 --> 00:26:44,470 -small gate on its input and I had to go +sigmoid gate on its input and I have to go through some math which I'm not going to 331 00:26:44,470 --> 00:26:46,980 go into detail but you end up with that -expression over there +expression over there. 332 00:26:46,980 --> 00:26:51,750 -it ends up being 1-6 next time segment -of access to local gradient and that +It ends up being (1-sigmoid(x)) * sigmoid(x). +That's the local gradient and that 333 00:26:51,750 --> 00:26:55,450 -allows me to put this piece into a -competition graph because once I know +allows me to now, put this piece into a +computational graph, because once I know 334 00:26:55,450 --> 00:26:58,819 @@ -1655,37 +1650,37 @@ everything else is defined just through 335 00:26:58,819 --> 00:27:02,389 chain rule and multiply everything -together so we can back propagate +together. So we can backpropagate 336 00:27:02,390 --> 00:27:06,720 -through the sigmoid get down and the way -that would look like is input to the +through this sigmoid gate now, and the way +that would look like is, the input to the 337 00:27:06,720 --> 00:27:11,750 -gate was one point zero that's what flu -went into the gate and punk 73 went out +sigmoid gate was 1.0, that's what +went into the sigmoid gate, and 0.73 went out. 338 00:27:11,750 --> 00:27:18,759 -so . 7360 facts okay and we want to -local gradient which is as we've seen +So 0.73 is sigma of x, okay? And we want the +local gradient which is, as we've seen 339 00:27:18,759 --> 00:27:26,450 -from the math on their backs so you get -access point cemetery multiplying 1-23 +from the math that I performed there (1 - sigma(x)) * sigma(x) +so you get, sigma(x) is 0.73, multiplying (1 - 0.73) 340 00:27:26,450 --> 00:27:31,170 -that's the local gradient and then times -will work we happened to be at the end +that's the local gradient and then times, +we happened to be at the end 341 00:27:31,170 --> 00:27:36,330 -of the circuit so times 10 even writing -so we end up with 12 and of course we +of the circuit, so times 1.0, which I'm not even writing. +So we end up with 0.2. And of course we 342 00:27:36,329 --> 00:27:37,649 @@ -1693,8 +1688,8 @@ get the same answer 343 00:27:37,650 --> 00:27:42,220 -point to as we received before 12 -because calculus works but basically we +0.2, as we received before, 0.2, +because calculus works, but basically we 344 00:27:42,220 --> 00:27:44,480 @@ -1703,87 +1698,87 @@ down and 345 00:27:44,480 --> 00:27:47,450 -one piece at a time or we could just -have a single signaled gate and it's +did one piece at a time or we could just +have a single sigmoid gate and that's 346 00:27:47,450 --> 00:27:51,569 -kind of up to us and what level up here -are key to break these expressions and +kind of up to us at what level of hierarchy +do we break these expressions 347 00:27:51,569 --> 00:27:52,339 -so you'd like to +and so you'd like to 348 00:27:52,339 --> 00:27:55,829 -intuitively clustered these expressions +intuitively, cluster these expressions into single gates if it's very efficient 349 00:27:55,829 --> 00:28:06,819 -or easy to direct the local radiance -because then they become your pieces so +or easy to derive the local gradients +because then those become your pieces. [STUDENT QUESTION] 350 00:28:06,819 --> 00:28:10,529 -the question is do libraries typically -do that I do they worry about you know +Yes. So the question is, do libraries typically +do that? Do they worry about, you know 351 00:28:10,529 --> 00:28:14,058 -what's what's easy to convince the -computer and the answer is yes I would +what's easy to or convenient to +compute and the answer is yeah, I would say so, 352 00:28:14,058 --> 00:28:17,480 -say so so he noted that there are some +So if you noted that there are some piece of operation you'd like to do over 353 00:28:17,480 --> 00:28:20,798 -and over again and it has a very simple -local gradient that's something very +and over again, and it has a very simple +local gradient, then that's something very 354 00:28:20,798 --> 00:28:24,900 appealing to actually create a single -unit of and we'll see some of those +unit out of, and we'll see some of those 355 00:28:24,900 --> 00:28:30,230 -examples actually but I think I'd like -to also point out that once you the +examples actually int a bit I think. Okay, I'd like +to also point out that once you, 356 00:28:30,230 --> 00:28:32,490 -reason I like to think about these -compositional grass is it really hope +the reason I like to think about these +computational graphs, is it really helps 357 00:28:32,490 --> 00:28:36,289 -your intuition to think about how greedy -and slow in a neural network it's not +your intuition to think about how gradients +flow in a neural network. It's not just, 358 00:28:36,289 --> 00:28:39,369 -just you don't want this to be a black -box do you want to understand +you don't want this to be a black +box to you, you want to understand 359 00:28:39,369 --> 00:28:43,959 -intuitively how this happens and you +intuitively how this happens, and you start to develop after a while of 360 00:28:43,960 --> 00:28:47,850 -looking at additional graphs intuitions -about how these graybeards flow and this +looking at computational graphs intuitions +about how these gradients flow, and this 361 00:28:47,849 --> 00:28:52,029 -might help you debug some issues like -say will go to banish ingredient problem +by the way, helps you debug some issues like, +say, we'll go to vanishing gradient problem 362 00:28:52,029 --> 00:28:55,950 @@ -1792,22 +1787,22 @@ what's going wrong in your optimization 363 00:28:55,950 --> 00:28:59,250 -if you understand how greedy and slow -and networks will help you debug these +if you understand how gradients flow +in networks. It will help you debug these 364 00:28:59,250 --> 00:29:02,740 -networks much more efficiently and so -some information for example we already +networks much more efficiently. And so +some intuitions for example, we already 365 00:29:02,740 --> 00:29:07,609 -saw the eighth at Gate it has a little -reading the one to all of its inputs so +saw the add gate. It has a local +gradient of one to all of its inputs, so 366 00:29:07,609 --> 00:29:11,279 -it's just a greeting distributor that's +it's just a gradient distributor. That's like a nice way to think about it 367 @@ -1817,78 +1812,78 @@ anywhere in your score function or your 368 00:29:14,548 --> 00:29:18,740 -comment or anywhere else it's -distributed ratings the max kate is +ConvNet or anywhere else. It just +distributes gradients equally. The max gate is 369 00:29:18,740 --> 00:29:23,009 -instead a great writer and way this -works is if you look at the expression +instead, a gradient router, and the way this +works is, if you look at the expression 370 00:29:23,009 --> 00:29:30,970 -like we have great these markers don't -work so if you have a very simple binary +like, we have. Great, these markers don't +work. So if you have a very simple binary 371 00:29:30,970 --> 00:29:38,410 -expression of Maxim XY so this is a gate -then the gradient on x online if you +expression of max(x, y), so this is a gate. +Then, the gradient on x and y, if you 372 00:29:38,410 --> 00:29:42,570 -think about it the green on the larger -one of your inputs which is larger the +think about it, the gradient on the larger +one of your inputs, whichever one was larger 373 00:29:42,569 --> 00:29:46,389 -gradient on that guy is one and all this -and the smaller one is a greeting of +the gradient on that guy is one and all this, +and the smaller one has a gradient of 0. 374 00:29:46,390 --> 00:29:50,630 -zero and intuitively that because if one -of these was smaller than what it has no +And intuitively, that's because if one +of these was smaller, then wiggling it has no 375 00:29:50,630 --> 00:29:53,220 -effect on the out but because the other -guy's larger and that's what ends up +effect on the output because the other +guy is larger and that's what ends up 376 00:29:53,220 --> 00:29:57,009 -getting through the gate so you end up -with a gradient of one on the +propagating through the gate. So you end up +with a gradient of 1 on the 377 00:29:57,009 --> 00:30:03,140 -larger one of the inputs and so that's -why max cady as a gradient writer if I'm +larger one of the inputs, and so that's +why max gate is a gradient router. If I'm 378 00:30:03,140 --> 00:30:06,420 -actually and I have received several -inputs one of them was the largest of +a max gate and I have received several +inputs, one of them was the largest of 379 00:30:06,420 --> 00:30:09,550 all of them and that's the value that I -propagated through the circuit and +propagated through the circuit. 380 00:30:09,549 --> 00:30:12,909 -application time I'm just going to +At backpropagation time, I'm just going to receive my gradient from above and I'm 381 00:30:12,910 --> 00:30:16,590 -going to write it to whoever was my -largest impact it's a gradient writer +going to route it to whoever was my +largest input. So it's a gradient router, 382 00:30:16,589 --> 00:30:22,569 -and multiply gate is a gradient switcher -actually don't think that's a very good +and the multiply gate is a gradient switcher. +Actually I don't think that's a very good 383 00:30:22,569 --> 00:30:26,960 @@ -1897,62 +1892,62 @@ the fact that it's not actually 384 00:30:26,960 --> 00:30:39,150 -nevermind about that part so the +nevermind about that part. Go ahead. [STUDENT QUESTION] So your question is what happens if the two 385 00:30:39,150 --> 00:30:53,470 inputs are equal when you go through max -Kade what happens I don't think it's +gate. Yeah, what happens? [STUDENT ANSWER] I don't think it's 386 00:30:53,470 --> 00:30:57,559 -correct to distributed to all of them I -think you have to you have to pick one +correct to distributed to all of them. I +think you'd have to pick one. 387 00:30:57,559 --> 00:31:07,990 -that basically never happens in actual -practice so max gradient here actually +But that basically never happens in actual +practice. Okay, so max gradient here, I actually 388 00:31:07,990 --> 00:31:13,019 -have an example is that here was larger -than W so only is it has an influence on +have an example. So z, here, was larger +than w, so only z has an influence on 389 00:31:13,019 --> 00:31:16,839 -the output of this max Kade right so -when two flows into the max gate and +the output of this max gate, right? So +when 2 flows into the max gate 390 00:31:16,839 --> 00:31:20,879 -gets read it and W gets a zero gradient +it gets routed to z, and w gets a 0 gradient because its effect on the circuit is 391 00:31:20,880 --> 00:31:25,360 -nothing there is zero because when you -change it doesn't matter when you change +nothing. There is 0, because when you +change it, it doesn't matter when you change 392 00:31:25,359 --> 00:31:29,689 -it because that is not a larger bally -going through the competition grounds I +it, because z is the larger value +going through the computational graph. 393 00:31:29,690 --> 00:31:33,100 -have another note that is related to -back propagation which we already +I have another note that is related to +backpropagation which we already 394 00:31:33,099 --> 00:31:36,490 -addressed through question I just want -to briefly point out with it terribly +addressed through a question. I just wanted +to briefly point out with a terribly 395 00:31:36,490 --> 00:31:40,440 -bad luck and figure that if you have +bad looking figure that if you have these circuits and sometimes you have a 396 @@ -1962,23 +1957,23 @@ and is used in multiple parts of the 397 00:31:43,329 --> 00:31:47,179 -circuit the correct thing to do by -multivariate chain rule is to actually +circuit, the correct thing to do by +multivariate chain rule, is to actually 398 00:31:47,180 --> 00:31:55,110 add up the contributions at the -operation so gradients add a background +operation. So gradients add when they backpropagate 399 00:31:55,109 --> 00:32:00,009 -in backwards through the circuit if they -ever flow in in these backward flow +backwards through the circuit. If they +ever flow, they add up in these backward flow 400 00:32:00,009 --> 00:32:04,879 -right we're going to go into -implementation very simple just a couple +All right. We're going to go into +implementation very soon. I'll just some more 401 00:32:04,880 --> 00:32:05,700 From 990d5cbc79b808e7ecc7dc65ed9aac9da3bbbd85 Mon Sep 17 00:00:00 2001 From: Myungsub Choi Date: Fri, 13 May 2016 10:29:07 +0900 Subject: [PATCH 129/199] initial progress bar --- convolutional-networks.md | 18 ++++++++---------- css/main.css | 14 +++++++++++++- index.html | 19 +++++++++++++++++-- 3 files changed, 38 insertions(+), 13 deletions(-) diff --git a/convolutional-networks.md b/convolutional-networks.md index 1ee2bb64..bbcc8705 100644 --- a/convolutional-networks.md +++ b/convolutional-networks.md @@ -1,6 +1,6 @@ --- layout: page -permalink: /convolutional-networks-kr/ +permalink: /convolutional-networks/ --- Table of Contents: @@ -285,7 +285,7 @@ For example, if 224x224 image gives a volume of size [7x7x512] - i.e. a reductio - `INPUT -> [CONV -> RELU -> POOL]*2 -> FC -> RELU -> FC`. 이 경우는 POOL 레이어 하나 당 하나의 CONV 레이어가 존재한다. - `INPUT -> [CONV -> RELU -> CONV -> RELU -> POOL]*3 -> [FC -> RELU]*2 -> FC` 이 경우는 각각의 POOL 레이어를 거치기 전에 여러 개의 CONV 레이어를 거치게 된다. 크고 깊은 신경망에서는 이런 구조가 적합하다. 여러 층으로 쌓인 CONV 레이어는 pooling 연산으로 인해 많은 정보가 파괴되기 전에 복잡한 feature들을 추출할 수 있게 해주기 때문이다. -*큰 리셉티브 필드를 가지는 CONV 레이어 하나 대신 여러개의 작은 필터를 가진 CONV 레이어를 쌓는 것이 좋다*. 3x3 크기의 CONV 레이어 3개를 쌓는다고 생각해보자 (물론 각 레이어 사이에는 비선형 함수를 넣어준다). 이 경우 첫 번째 CONV 레이어의 각 뉴런은 입력 볼륨의 3x3 영역을 보게 된다. 두 번째 CONV 레이어의 각 뉴런은 첫 번째 CONV 레이어의 3x3 영역을 보게 되어 결론적으로 입력 볼륨의 5x5 영역을 보게 되는 효과가 있다. 비슷하게, 세 번째 CONV 레이어의 각 뉴런은 두 번째 CONV 레이어의 3x3 영역을 보게 되어 입력 볼륨의 7x7 영역을 보는 것과 같아진다. 이런 방식으로 3개의 3x3 CONV 레이어를 사용하는 대신 7x7의 리셉티브 필드를 가지는 CONV 레이어 하나를 사용한다고 생각해 보자. 이 경우에도 각 뉴런은 입력 볼륨의 7x7 영역을 리셉티브 필드로 갖게 되지만 몇 가지 단점이 존재한다. 먼저, CONV 레이어 3개를 쌓은 경우에는 중간 중간 비선형 함수의 영향으로 표현력 높은 feature를 만드는 반면, 하나의 (7x7) CONV 레이어만 갖는 경우 각 뉴런은 입력에 대해 선형 함수를 적용하게 된다. 두 번째로, 모든 볼륨이 $$C$$ 개의 채널(또는 깊이)을 갖는다고 가정한다면, 7x7 CONV 레이어의 경우 $$C \times (7 \times 7 \times C)=49 C^2$$개의 파라미터를 갖게 된다. 반면 3개의 3x3 CONV 레이어의 경우는 $$3 \times (C \times (3 \times 3 \times)) = 27 C^2$$개의 파라미터만 갖게 된다. 직관적으로, 하나의 큰 필터를 갖는 CONV 레이어보다, 작은 필터를 갖는 여러 개의 CONV 레이어를 쌓는 것이 더 적은 파라미터만 사용하면서도 입력으로부터 더 좋은 feature를 추출하게 해준다. 단점이 있다면, backpropagation을 할 때 CONV 레이어의 중간 결과들을 저장하기 위해 더 많은 메모리 공간을 잡고 있어야 한다는 것이다. +*큰 리셉티브 필드를 가지는 CONV 레이어 하나 대신 여러개의 작은 필터를 가진 CONV 레이어를 쌓는 것이 좋다*. 3x3 크기의 CONV 레이어 3개를 쌓는다고 생각해보자 (물론 각 레이어 사이에는 비선형 함수를 넣어준다). 이 경우 첫 번째 CONV 레이어의 각 뉴런은 입력 볼륨의 3x3 영역을 보게 된다. 두 번째 CONV 레이어의 각 뉴런은 첫 번째 CONV 레이어의 3x3 영역을 보게 되어 결론적으로 입력 볼륨의 5x5 영역을 보게 되는 효과가 있다. 비슷하게, 세 번째 CONV 레이어의 각 뉴런은 두 번째 CONV 레이어의 3x3 영역을 보게 되어 입력 볼륨의 7x7 영역을 보는 것과 같아진다. 이런 방식으로 3개의 3x3 CONV 레이어를 사용하는 대신 7x7의 리셉티브 필드를 가지는 CONV 레이어 하나를 사용한다고 생각해 보자. 이 경우에도 각 뉴런은 입력 볼륨의 7x7 영역을 리셉티브 필드로 갖게 되지만 몇 가지 단점이 존재한다. 먼저, CONV 레이어 3개를 쌓은 경우에는 중간 중간 비선형 함수의 영향으로 표현력 높은 feature를 만드는 반면, 하나의 (7x7) CONV 레이어만 갖는 경우 각 뉴런은 입력에 대해 선형 함수를 적용하게 된다. 두 번째로, 모든 볼륨이 $$C$$ 개의 채널(또는 깊이)을 갖는다고 가정한다면, 7x7 CONV 레이어의 경우 $$C \times (7 \times 7 \times C)=49 C^2$$개의 파라미터를 갖게 된다. 반면 3개의 3x3 CONV 레이어의 경우는 $$3 \times (C \times (3 \times 3 \times)) = 27 C^2$$개의 파라미터만 갖게 된다. 직관적으로, 하나의 큰 필터를 갖는 CONV 레이어보다, 작은 필터를 갖는 여러 개의 CONV 레이어를 쌓는 것이 더 적은 파라미터만 사용하면서도 입력으로부터 더 좋은 feature를 추출하게 해준다. 단점이 있다면, backpropagation을 할 때 CONV 레이어의 중간 결과들을 저장하기 위해 더 많은 메모리 공간을 잡고 있어야 한다는 것이다. #### 레이어 크기 결정 패턴 @@ -313,11 +313,11 @@ For example, if 224x224 image gives a volume of size [7x7x512] - i.e. a reductio - **LeNet**. 최초의 성공적인 ConvNet 애플리케이션들은 1990년대에 Yann LeCun이 만들었다. 그 중에서도 zip 코드나 숫자를 읽는 [LeNet](http://yann.lecun.com/exdb/publis/pdf/lecun-98.pdf) 아키텍쳐가 가장 유명하다. - **AlexNet**. 컴퓨터 비전 분야에서 ConvNet을 유명하게 만든 것은 Alex Krizhevsky, Ilya Sutskever, Geoff Hinton이 만든 [AlexNet](http://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks)이다. AlexNet은 [ImageNet ILSVRC challenge](http://www.image-net.org/challenges/LSVRC/2014/) 2012에 출전해 2등을 큰 차이로 제치고 1등을 했다 (top 5 에러율 16%, 2등은 26%). 아키텍쳐는 LeNet과 기본적으로 유사지만, 더 깊고 크다. 또한 (과거에는 하나의 CONV 레이어 이후에 바로 POOL 레이어를 쌓은 것과 달리) 여러 개의 CONV 레이어들이 쌓여 있다. -- **ZF Net**. ILSVRC 2013년의 승자는 Matthew Zeiler와 Rob Fergus가 만들었다. 저자들의 이름을 따 [ZFNet](http://arxiv.org/abs/1311.2901)이라고 불린다. AlexNet에서 중간 CONV 레이어 크기를 조정하는 등 하이퍼파라미터들을 수정해 만들었다. +- **ZF Net**. ILSVRC 2013년의 승자는 Matthew Zeiler와 Rob Fergus가 만들었다. 저자들의 이름을 따 [ZFNet](http://arxiv.org/abs/1311.2901)이라고 불린다. AlexNet에서 중간 CONV 레이어 크기를 조정하는 등 하이퍼파라미터들을 수정해 만들었다. - **GoogLeNet**. ILSVRC 2014의 승자는 [Szegedy et al.](http://arxiv.org/abs/1409.4842) 이 구글에서 만들었다. 이 모델의 가장 큰 기여는 파라미터의 개수를 엄청나게 줄여주는 Inception module을 제안한 것이다 (4M, AlexNet의 경우 60M). 뿐만 아니라, ConvNet 마지막에 FC 레이어 대신 Average 풀링을 사용해 별로 중요하지 않아 보이는 파라미터들을 많이 줄이게 된다. - **VGGNet**. ILSVRC 2014에서 2등을 한 네트워크는 Karen Simonyan과 Andrew Zisserman이 만든 [VGGNet](http://www.robots.ox.ac.uk/~vgg/research/very_deep/)이라고 불리우는 모델이다. 이 모델의 가장 큰 기여는 네트워크의 깊이가 좋은 성능에 있어 매우 중요한 요소라는 것을 보여준 것이다. 이들이 제안한 여러 개 모델 중 가장 좋은 것은 16개의 CONV/FC 레이어로 이뤄지며, 모든 컨볼루션은 3x3, 모든 풀링은 2x2만으로 이뤄져 있다. 비록 GoogLeNet보다 이미지 분류 성능은 약간 낮지만, 여러 Transfer Learning 과제에서 더 좋은 성능을 보인다는 것이 나중에 밝혀졌다. 그래서 VGGNet은 최근에 이미지 feature 추출을 위해 가장 많이 사용되고 있다. Caffe를 사용하면 Pretrained model을 받아 바로 사용하는 것도 가능하다. VGGNet의 단점은, 매우 많은 메모리를 사용하며 (140M) 많은 연산을 한다는 것이다. - **ResNet**. Kaiming He et al.이 만든 [Residual Network](http://arxiv.org/abs/1512.03385)가 ILSVRC 2015에서 우승을 차지했다. Skip connection이라는 특이한 구조를 사용하며 batch normalizatoin을 많이 사용했다는 특징이 있다. 이 아키텍쳐는 또한 마지막 부분에서 FC 레이어를 사용하지 않는다. Kaiming의 발표자료 ([video](https://www.youtube.com/watch?v=1PGLj-uKT1w), [slides](http://research.microsoft.com/en-us/um/people/kahe/ilsvrc15/ilsvrc2015_deep_residual_learning_kaiminghe.pdf))나 Torch로 구현된 [최근 실험들](https://github.com/gcr/torch-residual-networks) 들도 확인할 수 있다. - + **VGGNet의 세부 사항들**. [VGGNet](http://www.robots.ox.ac.uk/~vgg/research/very_deep/)에 대해 좀 더 자세히 파헤쳐 보자. 전체 VGGNet은 필터 크기 3x3, stride 1, 제로패딩 1로 이뤄진 CONV 레이어들과 2x2 필터 크기 (패딩은 없음)의 POOL 레이어들로 구성된다. 아래에서 각 단계의 처리 과정을 살펴보고, 각 단계의 결과 크기와 가중치 개수를 알아본다. @@ -353,15 +353,15 @@ ConvNet에서 자주 볼 수 있는 특징으로써, 대부분의 메모리가 -#### 계산 관련 고려사항들 +#### 계산 관련 고려사항들 -ConvNet을 만들 때 일어나는 가장 큰 병목 현상은 메모리 병목이다. 최신 GPU들은 3/4/6GB의 메모리를 내장하고 있다. 가장 좋은 GPU들의 경우 12GB를 갖고 있다. 메모리와 관련해 주의깊게 살펴 볼 것은 크게 3가지이다. +ConvNet을 만들 때 일어나는 가장 큰 병목 현상은 메모리 병목이다. 최신 GPU들은 3/4/6GB의 메모리를 내장하고 있다. 가장 좋은 GPU들의 경우 12GB를 갖고 있다. 메모리와 관련해 주의깊게 살펴 볼 것은 크게 3가지이다. - 중간 단계의 볼륨 크기: 매 레이어에서 발생하는 액티베이션들과 그에 상응하는 그라디언트 (액티베이션과 같은 크기)의 개수이다. 보통 대부분의 액티베이션들은 ConvNet의 앞쪽 레이어들에서 발생된다 (예: 첫 번째 CONV 레이어). 이 값들은 backpropagation에 필요하기 때문에 계속 메모리에 두고 있어야 한다. 학습이 아닌 테스트에만 ConvNet을 사용할 때는 현재 처리 중인 레이어의 액티베이션 값을 제외한 앞쪽 액티베이션들은 버리는 방식으로 구현할 수 있다. - 파라미터 크기: 신경망이 갖고 있는 파라미터의 개수이며, backpropagation을 위한 각 파라미터의 그라디언트, 그리고 최적화에 momentum, Adagrad, RMSProp 등을 사용한다면 이와 관련된 파라미터들도 캐싱해 놓아야 한다. 그러므로 파라미터 저장 공간은 기본적으로 (파라미터 개수의)3배 정도 더 필요하다. -- 모든 ConvNet 구현체는 이미지 데이터 배치 등을 위한 기타 용도의 메모리를 유지해야 한다. +- 모든 ConvNet 구현체는 이미지 데이터 배치 등을 위한 기타 용도의 메모리를 유지해야 한다. -일단 액티베이션, 그라디언트, 기타용도에 필요한 값들의 개수를 예상했다면, GB 스케일로 바꿔야 한다. 예측한 개수에 4를 곱해 바이트 수를 구하고 (floating point가 4바이트, double precision의 경우 8바이트 이므로), 1024로 여러 번 나눠 KB, MB, GB로 바꾼다. 만약 신경망의 크기가 너무 크다면, 배치 크기를 줄이는 등의 휴리스틱을 이용해 (대부분의 메모리가 액티베이션에 사용되므로) 가용 메모리에 맞게 만들어야 한다. +일단 액티베이션, 그라디언트, 기타용도에 필요한 값들의 개수를 예상했다면, GB 스케일로 바꿔야 한다. 예측한 개수에 4를 곱해 바이트 수를 구하고 (floating point가 4바이트, double precision의 경우 8바이트 이므로), 1024로 여러 번 나눠 KB, MB, GB로 바꾼다. 만약 신경망의 크기가 너무 크다면, 배치 크기를 줄이는 등의 휴리스틱을 이용해 (대부분의 메모리가 액티베이션에 사용되므로) 가용 메모리에 맞게 만들어야 한다. ### ConvNet의 시각화 및 이해 @@ -384,5 +384,3 @@ ConvNet을 만들 때 일어나는 가장 큰 병목 현상은 메모리 병목

번역: 김택수 (jazzsaxmafia)

- - diff --git a/css/main.css b/css/main.css index b495ff28..df573de0 100644 --- a/css/main.css +++ b/css/main.css @@ -57,7 +57,7 @@ a:visited { color: #205caa; } } .materials-item a{ color: #333; - display: block; + display: inline; padding: 3px; } .materials-item { @@ -125,6 +125,18 @@ a:visited { color: #205caa; } color: #009; } +/* Custom CSS rules for progress bar */ +.progress { + position: relative; +} +.progress span { + font-family: "Arial"; + position: absolute; + text-align:center; + top: 0%; + font-size: small; +} + /* Custom CSS rules for content */ .embedded-video { diff --git a/index.html b/index.html index 17b8b232..1fbccf7d 100644 --- a/index.html +++ b/index.html @@ -26,6 +26,9 @@ Assignment #1: 이미지 분류, kNN, SVM, Softmax, 뉴럴 네트워크 + + +
@@ -33,6 +36,9 @@ Assignment #2: Fully-Connected 네트워크, 배치 정규화(Batch Normalization), Dropout, 컨볼루션 신경망 + + +
@@ -40,6 +46,9 @@ Assignment #3: 회귀신경망(Recurrent Neural Networks), 이미지 캡셔닝(Captioning), 이미지 그라디언트, DeepDream + + +
@@ -113,7 +128,7 @@
- 최적화: 확률 그라디언트 하강(Stochastic Gradient Descent) + 최적화: 확률 그라디언트 하강(Stochastic Gradient Descent)
'지형'으로서의 최적화 목적 함수 (optimization landscapes), 국소 탐색(local search), 학습 속도(learning rate), 해석적(analytic)/수치적(numerical) 그라디언트 @@ -170,7 +185,7 @@
Module 2: Convolutional Neural Networks
- + 컨볼루션 신경망: 구조, Convolution / Pooling 레이어들
From 7802a7d80fc5b3ff2c21418b092b7057662780e1 Mon Sep 17 00:00:00 2001 From: Myungsub Choi Date: Fri, 13 May 2016 11:05:20 +0900 Subject: [PATCH 130/199] add progress bars --- css/main.css | 1 + index.html | 46 +++++++++++++++++++++++++++++++++++++++++----- 2 files changed, 42 insertions(+), 5 deletions(-) diff --git a/css/main.css b/css/main.css index df573de0..d3e55bb8 100644 --- a/css/main.css +++ b/css/main.css @@ -128,6 +128,7 @@ a:visited { color: #205caa; } /* Custom CSS rules for progress bar */ .progress { position: relative; + font-size: 16px; } .progress span { font-family: "Arial"; diff --git a/index.html b/index.html index 1fbccf7d..c7f9553e 100644 --- a/index.html +++ b/index.html @@ -79,15 +79,17 @@ Python / Numpy Tutorial + + +
IPython Notebook Tutorial - - - Complete! + + Complete!
@@ -95,14 +97,18 @@ Terminal.com Tutorial - Complete! + + Complete! +
AWS Tutorial - Complete! + + Complete! +
@@ -112,6 +118,9 @@ 이미지 분류: 데이터 기반 방법론, k-Nearest Neighbor, train/val/test 구분 + + +
L1/L2 거리, hyperparameter 탐색, 교차검증(cross-validation)
@@ -121,6 +130,9 @@ 선형 분류: Support Vector Machine, Softmax + + +
parameteric 접근법, bias 트릭, hinge loss, cross-entropy loss, L2 regularization, 웹 데모
@@ -130,6 +142,9 @@ 최적화: 확률 그라디언트 하강(Stochastic Gradient Descent) + + Complete! +
'지형'으로서의 최적화 목적 함수 (optimization landscapes), 국소 탐색(local search), 학습 속도(learning rate), 해석적(analytic)/수치적(numerical) 그라디언트
@@ -139,6 +154,9 @@ Backpropagation, Intuition + + +
연쇄 법칙 (chain rule) 해석, real-valued circuits, 그라디언트 흐름의 패턴
@@ -148,6 +166,9 @@ 신경망 파트 1: 네트워크 구조 정하기 + + +
생물학적 뉴런 모델, 활성 함수(activation functions), 신경망 구조, 표현력(representational power)
@@ -157,6 +178,9 @@ 신경망 파트 2: 데이터 준비 및 Loss + + +
전처리, weight 초기값 설정, 배치 정규화(batch normalization), regularization (L2/dropout), 손실함수
@@ -166,6 +190,9 @@ 신경망 파트 3: 학습 및 평가 + + +
그라디언트 체크, 버그 점검, 학습 과정 모니터링, momentum (+nesterov), 2차(2nd-order) 방법, Adagrad/RMSprop, hyperparameter 최적화, 모델 ensemble
@@ -188,6 +215,9 @@ 컨볼루션 신경망: 구조, Convolution / Pooling 레이어들 + + Complete! +
레이어(층), 공간적 배치, 레이어 패턴, 레이어 사이즈, AlexNet/ZFNet/VGGNet 사례 분석, 계산량에 관한 고려 사항들
@@ -197,6 +227,9 @@ 컨볼루션 신경망 분석 및 시각화 + + +
tSNE embeddings, deconvnets, 데이터에 대한 그라디언트, ConvNet 속이기, 사람과의 비교
@@ -206,6 +239,9 @@ Transfer Learning and Fine-tuning Convolutional Neural Networks + + +
From 97226e2d8d9c1809372038149a750f633f31f46d Mon Sep 17 00:00:00 2001 From: MaybeS Date: Fri, 13 May 2016 16:35:26 +0900 Subject: [PATCH 131/199] Update assignment1/softmax.ipynb --- assignments2016/assignment1/softmax.ipynb | 525 +++++++++++----------- 1 file changed, 260 insertions(+), 265 deletions(-) diff --git a/assignments2016/assignment1/softmax.ipynb b/assignments2016/assignment1/softmax.ipynb index 90914f36..78e68e00 100644 --- a/assignments2016/assignment1/softmax.ipynb +++ b/assignments2016/assignment1/softmax.ipynb @@ -1,308 +1,303 @@ { - "nbformat_minor": 0, - "nbformat": 4, "cells": [ { + "cell_type": "markdown", + "metadata": {}, "source": [ - "# Softmax exercise\n", - "\n", - "*Complete and hand in this completed worksheet (including its outputs and any supporting code outside of the worksheet) with your assignment submission. For more details see the [assignments page](http://vision.stanford.edu/teaching/cs231n/assignments.html) on the course website.*\n", - "\n", - "This exercise is analogous to the SVM exercise. You will:\n", - "\n", - "- implement a fully-vectorized **loss function** for the Softmax classifier\n", - "- implement the fully-vectorized expression for its **analytic gradient**\n", - "- **check your implementation** with numerical gradient\n", - "- use a validation set to **tune the learning rate and regularization** strength\n", - "- **optimize** the loss function with **SGD**\n", - "- **visualize** the final learned weights\n" - ], - "cell_type": "markdown", - "metadata": {} - }, + "# Softmax 연습\n", + "\n", + "*이 워크시트를 완성하고 제출하세요. (출력물과 워크시트에 포함되지 않은 코드들을 포함해서) 더 자세한 정보는 코스 웹사이트인 [숙제 페이지](http://vision.stanford.edu/teaching/cs231n/assignments.html)에서 볼 수 있습니다.*\n", + "\n", + "이번 연습은 SVM과 유사합니다. 아래와 같은 것들을 하게됩니다.\n", + "\n", + "- Softmax 분류기를 위한 완전히 벡터화된 **손실 함수**를 구현합니다.\n", + "- **분석 요소**를 위한 완전히 벡터화된 표현식을 구현합니다.\n", + "- 구현한것을 수치 요소로 체크합니다.\n", + "- 검증 셋을 이용해 **학습율과 정규화 강도를 튜닝**합니다.\n", + "- **SGD**를 사용해 손실 함수를 **최적화**합니다.\n", + "- 최종 학습 가중치를 **시각화**합니다." + ] + }, { - "execution_count": null, - "cell_type": "code", + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], "source": [ - "import random\n", - "import numpy as np\n", - "from cs231n.data_utils import load_CIFAR10\n", - "import matplotlib.pyplot as plt\n", - "%matplotlib inline\n", - "plt.rcParams['figure.figsize'] = (10.0, 8.0) # set default size of plots\n", - "plt.rcParams['image.interpolation'] = 'nearest'\n", - "plt.rcParams['image.cmap'] = 'gray'\n", - "\n", - "# for auto-reloading extenrnal modules\n", - "# see http://stackoverflow.com/questions/1907993/autoreload-of-modules-in-ipython\n", - "%load_ext autoreload\n", + "import random\n", + "import numpy as np\n", + "from cs231n.data_utils import load_CIFAR10\n", + "import matplotlib.pyplot as plt\n", + "%matplotlib inline\n", + "plt.rcParams['figure.figsize'] = (10.0, 8.0) # set default size of plots\n", + "plt.rcParams['image.interpolation'] = 'nearest'\n", + "plt.rcParams['image.cmap'] = 'gray'\n", + "\n", + "# 외부 모듈의 auto-reloading을 위해 아래 링크를 확인하세요.\n", + "# http://stackoverflow.com/questions/1907993/autoreload-of-modules-in-ipython\n", + "%load_ext autoreload\n", "%autoreload 2" - ], - "outputs": [], + ] + }, + { + "cell_type": "code", + "execution_count": null, "metadata": { "collapsed": false - } - }, - { - "execution_count": null, - "cell_type": "code", + }, + "outputs": [], "source": [ - "def get_CIFAR10_data(num_training=49000, num_validation=1000, num_test=1000, num_dev=500):\n", - " \"\"\"\n", - " Load the CIFAR-10 dataset from disk and perform preprocessing to prepare\n", - " it for the linear classifier. These are the same steps as we used for the\n", - " SVM, but condensed to a single function. \n", - " \"\"\"\n", - " # Load the raw CIFAR-10 data\n", - " cifar10_dir = 'cs231n/datasets/cifar-10-batches-py'\n", - " X_train, y_train, X_test, y_test = load_CIFAR10(cifar10_dir)\n", - " \n", - " # subsample the data\n", - " mask = range(num_training, num_training + num_validation)\n", - " X_val = X_train[mask]\n", - " y_val = y_train[mask]\n", - " mask = range(num_training)\n", - " X_train = X_train[mask]\n", - " y_train = y_train[mask]\n", - " mask = range(num_test)\n", - " X_test = X_test[mask]\n", - " y_test = y_test[mask]\n", - " mask = np.random.choice(num_training, num_dev, replace=False)\n", - " X_dev = X_train[mask]\n", - " y_dev = y_train[mask]\n", - " \n", - " # Preprocessing: reshape the image data into rows\n", - " X_train = np.reshape(X_train, (X_train.shape[0], -1))\n", - " X_val = np.reshape(X_val, (X_val.shape[0], -1))\n", - " X_test = np.reshape(X_test, (X_test.shape[0], -1))\n", - " X_dev = np.reshape(X_dev, (X_dev.shape[0], -1))\n", - " \n", - " # Normalize the data: subtract the mean image\n", - " mean_image = np.mean(X_train, axis = 0)\n", - " X_train -= mean_image\n", - " X_val -= mean_image\n", - " X_test -= mean_image\n", - " X_dev -= mean_image\n", - " \n", - " # add bias dimension and transform into columns\n", - " X_train = np.hstack([X_train, np.ones((X_train.shape[0], 1))])\n", - " X_val = np.hstack([X_val, np.ones((X_val.shape[0], 1))])\n", - " X_test = np.hstack([X_test, np.ones((X_test.shape[0], 1))])\n", - " X_dev = np.hstack([X_dev, np.ones((X_dev.shape[0], 1))])\n", - " \n", - " return X_train, y_train, X_val, y_val, X_test, y_test, X_dev, y_dev\n", - "\n", - "\n", - "# Invoke the above function to get our data.\n", - "X_train, y_train, X_val, y_val, X_test, y_test, X_dev, y_dev = get_CIFAR10_data()\n", - "print 'Train data shape: ', X_train.shape\n", - "print 'Train labels shape: ', y_train.shape\n", - "print 'Validation data shape: ', X_val.shape\n", - "print 'Validation labels shape: ', y_val.shape\n", - "print 'Test data shape: ', X_test.shape\n", - "print 'Test labels shape: ', y_test.shape\n", - "print 'dev data shape: ', X_dev.shape\n", + "def get_CIFAR10_data(num_training=49000, num_validation=1000, num_test=1000, num_dev=500):\n", + " \"\"\"\n", + " CIFAR-10 데이터 셋을 불러온 후 미리 준비된 선형 분류기에 전처리를 수행합니다.\n", + " 이 과정은 SVM에서 사용했던 방법과 같지만 하나의 함수로 압축되어 있습니다.\n", + " \"\"\"\n", + " # 원시 CIFAR-10 데이터를 불러옵니다.\n", + " cifar10_dir = 'cs231n/datasets/cifar-10-batches-py'\n", + " X_train, y_train, X_test, y_test = load_CIFAR10(cifar10_dir)\n", + " \n", + " # 데이터에서 표본을 얻습니다.\n", + " mask = range(num_training, num_training + num_validation)\n", + " X_val = X_train[mask]\n", + " y_val = y_train[mask]\n", + " mask = range(num_training)\n", + " X_train = X_train[mask]\n", + " y_train = y_train[mask]\n", + " mask = range(num_test)\n", + " X_test = X_test[mask]\n", + " y_test = y_test[mask]\n", + " mask = np.random.choice(num_training, num_dev, replace=False)\n", + " X_dev = X_train[mask]\n", + " y_dev = y_train[mask]\n", + " \n", + " # 전처리: 이미지 데이터를 행으로 변형합니다.\n", + " X_train = np.reshape(X_train, (X_train.shape[0], -1))\n", + " X_val = np.reshape(X_val, (X_val.shape[0], -1))\n", + " X_test = np.reshape(X_test, (X_test.shape[0], -1))\n", + " X_dev = np.reshape(X_dev, (X_dev.shape[0], -1))\n", + " \n", + " # 데이터 정규화: 평균 이미지 빼기\n", + " mean_image = np.mean(X_train, axis = 0)\n", + " X_train -= mean_image\n", + " X_val -= mean_image\n", + " X_test -= mean_image\n", + " X_dev -= mean_image\n", + " \n", + " # 기저 차원을 더하고 열로 변형시킵니다.\n", + " X_train = np.hstack([X_train, np.ones((X_train.shape[0], 1))])\n", + " X_val = np.hstack([X_val, np.ones((X_val.shape[0], 1))])\n", + " X_test = np.hstack([X_test, np.ones((X_test.shape[0], 1))])\n", + " X_dev = np.hstack([X_dev, np.ones((X_dev.shape[0], 1))])\n", + " \n", + " return X_train, y_train, X_val, y_val, X_test, y_test, X_dev, y_dev\n", + "\n", + "\n", + "# 위 함수를 우리 데이터로 실행해봅니다.\n", + "X_train, y_train, X_val, y_val, X_test, y_test, X_dev, y_dev = get_CIFAR10_data()\n", + "print 'Train data shape: ', X_train.shape\n", + "print 'Train labels shape: ', y_train.shape\n", + "print 'Validation data shape: ', X_val.shape\n", + "print 'Validation labels shape: ', y_val.shape\n", + "print 'Test data shape: ', X_test.shape\n", + "print 'Test labels shape: ', y_test.shape\n", + "print 'dev data shape: ', X_dev.shape\n", "print 'dev labels shape: ', y_dev.shape" - ], - "outputs": [], - "metadata": { - "collapsed": false - } - }, + ] + }, { + "cell_type": "markdown", + "metadata": {}, "source": [ - "## Softmax Classifier\n", - "\n", - "Your code for this section will all be written inside **cs231n/classifiers/softmax.py**. \n" - ], - "cell_type": "markdown", - "metadata": {} - }, + "## Softmax 분류기\n", + "\n", + "**cs231n/classifiers/softmax.py**에 이번 섹션에 필요한 코드가 적혀있습니다.\n" + ] + }, { - "execution_count": null, - "cell_type": "code", - "source": [ - "# First implement the naive softmax loss function with nested loops.\n", - "# Open the file cs231n/classifiers/softmax.py and implement the\n", - "# softmax_loss_naive function.\n", - "\n", - "from cs231n.classifiers.softmax import softmax_loss_naive\n", - "import time\n", - "\n", - "# Generate a random softmax weight matrix and use it to compute the loss.\n", - "W = np.random.randn(3073, 10) * 0.0001\n", - "loss, grad = softmax_loss_naive(W, X_dev, y_dev, 0.0)\n", - "\n", - "# As a rough sanity check, our loss should be something close to -log(0.1).\n", - "print 'loss: %f' % loss\n", - "print 'sanity check: %f' % (-np.log(0.1))" - ], - "outputs": [], + "cell_type": "code", + "execution_count": null, "metadata": { "collapsed": false - } - }, + }, + "outputs": [], + "source": [ + "# 먼저 softmax 손실 함수를 구현하세요.\n", + "# cs231n/calssifiers/softmax.py 를 열고 softmax_loss_naive 함수를 구현하세요.\n", + "\n", + "from cs231n.classifiers.softmax import softmax_loss_naive\n", + "import time\n", + "\n", + "# 랜덤 softmax 가중치 배열을 만들고 손실을 계산하는데 사용합니다.\n", + "W = np.random.randn(3073, 10) * 0.0001\n", + "loss, grad = softmax_loss_naive(W, X_dev, y_dev, 0.0)\n", + "\n", + "# As a rough sanity check, our loss should be something close to -log(0.1).\n", + "print 'loss: %f' % loss\n", + "print 'sanity check: %f' % (-np.log(0.1))" + ] + }, { + "cell_type": "markdown", + "metadata": {}, "source": [ - "## Inline Question 1:\n", - "Why do we expect our loss to be close to -log(0.1)? Explain briefly.**\n", - "\n", - "**Your answer:** *Fill this in*\n" - ], - "cell_type": "markdown", - "metadata": {} - }, + "## 연습문제 1:\n", + "왜 손실이 -log(0.1)로 근사되는지 이유를 간단히 서술하세요.\n", + "\n", + "**당신의 답:** *여기에 쓰세요*" + ] + }, { - "execution_count": null, - "cell_type": "code", + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], "source": [ - "# Complete the implementation of softmax_loss_naive and implement a (naive)\n", - "# version of the gradient that uses nested loops.\n", - "loss, grad = softmax_loss_naive(W, X_dev, y_dev, 0.0)\n", - "\n", - "# As we did for the SVM, use numeric gradient checking as a debugging tool.\n", - "# The numeric gradient should be close to the analytic gradient.\n", - "from cs231n.gradient_check import grad_check_sparse\n", - "f = lambda w: softmax_loss_naive(w, X_dev, y_dev, 0.0)[0]\n", - "grad_numerical = grad_check_sparse(f, W, grad, 10)\n", - "\n", - "# similar to SVM case, do another gradient check with regularization\n", - "loss, grad = softmax_loss_naive(W, X_dev, y_dev, 1e2)\n", - "f = lambda w: softmax_loss_naive(w, X_dev, y_dev, 1e2)[0]\n", + "# softmax_loss_naived의 구현을 완성하고 중첩 루프를 이용한 버전을 구현해 보세요.\n", + "loss, grad = softmax_loss_naive(W, X_dev, y_dev, 0.0)\n", + "\n", + "# SVM에서 했던 것 처럼, 수치 요소를 디버깅 툴처럼 체크해보세요.\n", + "# The numeric gradient should be close to the analytic gradient.\n", + "from cs231n.gradient_check import grad_check_sparse\n", + "f = lambda w: softmax_loss_naive(w, X_dev, y_dev, 0.0)[0]\n", + "grad_numerical = grad_check_sparse(f, W, grad, 10)\n", + "\n", + "# SVM에서처럼, 정규화를 이용해 다른 요소를 체크해보세요.\n", + "loss, grad = softmax_loss_naive(W, X_dev, y_dev, 1e2)\n", + "f = lambda w: softmax_loss_naive(w, X_dev, y_dev, 1e2)[0]\n", "grad_numerical = grad_check_sparse(f, W, grad, 10)" - ], - "outputs": [], + ] + }, + { + "cell_type": "code", + "execution_count": null, "metadata": { "collapsed": false - } - }, - { - "execution_count": null, - "cell_type": "code", + }, + "outputs": [], "source": [ - "# Now that we have a naive implementation of the softmax loss function and its gradient,\n", - "# implement a vectorized version in softmax_loss_vectorized.\n", - "# The two versions should compute the same results, but the vectorized version should be\n", - "# much faster.\n", - "tic = time.time()\n", - "loss_naive, grad_naive = softmax_loss_naive(W, X_dev, y_dev, 0.00001)\n", - "toc = time.time()\n", - "print 'naive loss: %e computed in %fs' % (loss_naive, toc - tic)\n", - "\n", - "from cs231n.classifiers.softmax import softmax_loss_vectorized\n", - "tic = time.time()\n", - "loss_vectorized, grad_vectorized = softmax_loss_vectorized(W, X_dev, y_dev, 0.00001)\n", - "toc = time.time()\n", - "print 'vectorized loss: %e computed in %fs' % (loss_vectorized, toc - tic)\n", - "\n", - "# As we did for the SVM, we use the Frobenius norm to compare the two versions\n", - "# of the gradient.\n", - "grad_difference = np.linalg.norm(grad_naive - grad_vectorized, ord='fro')\n", - "print 'Loss difference: %f' % np.abs(loss_naive - loss_vectorized)\n", + "# 이제 간단하게 구현된 softmax 손실함수와 요소와 soft_max_loss_vectorized에 구현된 벡터화된 버전이 있습니다.\n", + "# 이 두가지 버전은 같은 결과를 낼 것이지만 벡터화된 버전이 좀 더 빠를것 입니다.\n", + "tic = time.time()\n", + "loss_naive, grad_naive = softmax_loss_naive(W, X_dev, y_dev, 0.00001)\n", + "toc = time.time()\n", + "print 'naive loss: %e computed in %fs' % (loss_naive, toc - tic)\n", + "\n", + "from cs231n.classifiers.softmax import softmax_loss_vectorized\n", + "tic = time.time()\n", + "loss_vectorized, grad_vectorized = softmax_loss_vectorized(W, X_dev, y_dev, 0.00001)\n", + "toc = time.time()\n", + "print 'vectorized loss: %e computed in %fs' % (loss_vectorized, toc - tic)\n", + "\n", + "# As we did for the SVM, we use the Frobenius norm to compare the two versions\n", + "# of the gradient.\n", + "grad_difference = np.linalg.norm(grad_naive - grad_vectorized, ord='fro')\n", + "print 'Loss difference: %f' % np.abs(loss_naive - loss_vectorized)\n", "print 'Gradient difference: %f' % grad_difference" - ], - "outputs": [], + ] + }, + { + "cell_type": "code", + "execution_count": null, "metadata": { "collapsed": false - } - }, - { - "execution_count": null, - "cell_type": "code", + }, + "outputs": [], "source": [ - "# Use the validation set to tune hyperparameters (regularization strength and\n", - "# learning rate). You should experiment with different ranges for the learning\n", - "# rates and regularization strengths; if you are careful you should be able to\n", - "# get a classification accuracy of over 0.35 on the validation set.\n", - "from cs231n.classifiers import Softmax\n", - "results = {}\n", - "best_val = -1\n", - "best_softmax = None\n", - "learning_rates = [1e-7, 5e-7]\n", - "regularization_strengths = [5e4, 1e8]\n", - "\n", - "################################################################################\n", - "# TODO: #\n", - "# Use the validation set to set the learning rate and regularization strength. #\n", - "# This should be identical to the validation that you did for the SVM; save #\n", - "# the best trained softmax classifer in best_softmax. #\n", - "################################################################################\n", - "pass\n", - "################################################################################\n", - "# END OF YOUR CODE #\n", - "################################################################################\n", - " \n", - "# Print out results.\n", - "for lr, reg in sorted(results):\n", - " train_accuracy, val_accuracy = results[(lr, reg)]\n", - " print 'lr %e reg %e train accuracy: %f val accuracy: %f' % (\n", - " lr, reg, train_accuracy, val_accuracy)\n", - " \n", + "# Use the validation set to tune hyperparameters (regularization strength and\n", + "# learning rate). You should experiment with different ranges for the learning\n", + "# rates and regularization strengths; if you are careful you should be able to\n", + "# get a classification accuracy of over 0.35 on the validation set.\n", + "from cs231n.classifiers import Softmax\n", + "results = {}\n", + "best_val = -1\n", + "best_softmax = None\n", + "learning_rates = [1e-7, 5e-7]\n", + "regularization_strengths = [5e4, 1e8]\n", + "\n", + "################################################################################\n", + "# TODO: #\n", + "# Use the validation set to set the learning rate and regularization strength. #\n", + "# This should be identical to the validation that you did for the SVM; save #\n", + "# the best trained softmax classifer in best_softmax. #\n", + "################################################################################\n", + "pass\n", + "################################################################################\n", + "# END OF YOUR CODE #\n", + "################################################################################\n", + " \n", + "# Print out results.\n", + "for lr, reg in sorted(results):\n", + " train_accuracy, val_accuracy = results[(lr, reg)]\n", + " print 'lr %e reg %e train accuracy: %f val accuracy: %f' % (\n", + " lr, reg, train_accuracy, val_accuracy)\n", + " \n", "print 'best validation accuracy achieved during cross-validation: %f' % best_val" - ], - "outputs": [], + ] + }, + { + "cell_type": "code", + "execution_count": null, "metadata": { "collapsed": false - } - }, - { - "execution_count": null, - "cell_type": "code", + }, + "outputs": [], "source": [ - "# evaluate on test set\n", - "# Evaluate the best softmax on test set\n", - "y_test_pred = best_softmax.predict(X_test)\n", - "test_accuracy = np.mean(y_test == y_test_pred)\n", + "# evaluate on test set\n", + "# Evaluate the best softmax on test set\n", + "y_test_pred = best_softmax.predict(X_test)\n", + "test_accuracy = np.mean(y_test == y_test_pred)\n", "print 'softmax on raw pixels final test set accuracy: %f' % (test_accuracy, )" - ], - "outputs": [], + ] + }, + { + "cell_type": "code", + "execution_count": null, "metadata": { "collapsed": false - } - }, - { - "execution_count": null, - "cell_type": "code", + }, + "outputs": [], "source": [ - "# Visualize the learned weights for each class\n", - "w = best_softmax.W[:-1,:] # strip out the bias\n", - "w = w.reshape(32, 32, 3, 10)\n", - "\n", - "w_min, w_max = np.min(w), np.max(w)\n", - "\n", - "classes = ['plane', 'car', 'bird', 'cat', 'deer', 'dog', 'frog', 'horse', 'ship', 'truck']\n", - "for i in xrange(10):\n", - " plt.subplot(2, 5, i + 1)\n", - " \n", - " # Rescale the weights to be between 0 and 255\n", - " wimg = 255.0 * (w[:, :, :, i].squeeze() - w_min) / (w_max - w_min)\n", - " plt.imshow(wimg.astype('uint8'))\n", - " plt.axis('off')\n", + "# Visualize the learned weights for each class\n", + "w = best_softmax.W[:-1,:] # strip out the bias\n", + "w = w.reshape(32, 32, 3, 10)\n", + "\n", + "w_min, w_max = np.min(w), np.max(w)\n", + "\n", + "classes = ['plane', 'car', 'bird', 'cat', 'deer', 'dog', 'frog', 'horse', 'ship', 'truck']\n", + "for i in xrange(10):\n", + " plt.subplot(2, 5, i + 1)\n", + " \n", + " # Rescale the weights to be between 0 and 255\n", + " wimg = 255.0 * (w[:, :, :, i].squeeze() - w_min) / (w_max - w_min)\n", + " plt.imshow(wimg.astype('uint8'))\n", + " plt.axis('off')\n", " plt.title(classes[i])" - ], - "outputs": [], - "metadata": { - "collapsed": false - } + ] } - ], + ], "metadata": { "kernelspec": { - "display_name": "Python 2", - "name": "python2", - "language": "python" - }, + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, "language_info": { - "mimetype": "text/x-python", - "nbconvert_exporter": "python", - "name": "python", - "file_extension": ".py", - "version": "2.7.9", - "pygments_lexer": "ipython2", "codemirror_mode": { - "version": 2, - "name": "ipython" - } + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.5.1" } - } -} \ No newline at end of file + }, + "nbformat": 4, + "nbformat_minor": 0 +} From a7eb2ef66f250eae9fffc503beb5ca35a85053db Mon Sep 17 00:00:00 2001 From: Taeksoo Kim Date: Fri, 13 May 2016 16:35:57 +0900 Subject: [PATCH 132/199] =?UTF-8?q?=EC=98=81=EC=96=B4=20=EC=9B=90=EB=AC=B8?= =?UTF-8?q?=EC=9D=84=20=EC=95=88=20=EC=A7=80=EC=9A=B4=20=EB=AC=B8=EB=8B=A8?= =?UTF-8?q?=EC=9D=B4=20=EC=9E=88=EC=96=B4=EC=84=9C=20=EC=A7=80=EC=9B=80?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit --- convolutional-networks.md | 2 -- 1 file changed, 2 deletions(-) diff --git a/convolutional-networks.md b/convolutional-networks.md index bbcc8705..4f924f03 100644 --- a/convolutional-networks.md +++ b/convolutional-networks.md @@ -258,8 +258,6 @@ FC 레이어와 CONV 레이어의 차이점은, CONV 레이어는 입력의 일 예를 들어,224x224 크기의 이미지를 입력으로 받으면 [7x7x512]의 볼륨을 출력하는 이 아키텍쳐에, ( 224/7 = 32배 줄어듦 ) 된 아키텍쳐에 384x384 크기의 이미지를 넣으면 [12x12x512] 크기의 볼륨을 출력하게 된다 (384/32 = 12 이므로). 이후 FC에서 CONV로 변환한 3개의 CONV 레이어를 거치면 [6x6x1000] 크기의 최종 볼륨을 얻게 된다 ( (12 - 7)/1 +1 =6 이므로). [1x1x1000]크기를 지닌 하나의 클래스 점수 벡터 대신 384x384 이미지로부터 6x6개의 클래스 점수 배열을 구했다는 것이 중요하다. -For example, if 224x224 image gives a volume of size [7x7x512] - i.e. a reduction by 32, then forwarding an image of size 384x384 through the converted architecture would give the equivalent volume in size [12x12x512], since 384/32 = 12. Following through with the next 3 CONV layers that we just converted from FC layers would now give the final volume of size [6x6x1000], since (12 - 7)/1 + 1 = 6. Note that instead of a single vector of class scores of size [1x1x1000], we're now getting and entire 6x6 array of class scores across the 384x384 image. - > 위의 내용은 384x384 크기의 이미지를 32의 stride 간격으로 224x224 크기로 잘라 각각을 원본 ConvNet (뒷쪽 3개 레이어가 FC인)에 적용한 것과 같은 결과를 보여준다. 당연히 (CONV레이어만으로) 변환된 ConvNet을 이용해 한 번에 이미지를 처리하는 것이 원본 ConvNet으로 36개 위치에 대해 반복적으로 처리하는 것 보다 훨씬 효율적이다. 36번의 처리 과정에서 같은 계산이 중복되기 때문이다. 이런 기법은 실전에서 성능 향상을 위해 종종 사용된다. 예를 들어 이미지를 크게 리사이즈 한 뒤 변환된 ConvNet을 이용해 여러 위치에 대한 클래스 점수를 구한 다음 그 점수들의 평균을 취하는 기법 등이 있다. From 921d58d116a97a9c749b208f07f06a3c119fa700 Mon Sep 17 00:00:00 2001 From: MaybeS Date: Fri, 13 May 2016 16:42:04 +0900 Subject: [PATCH 133/199] Update newline --- assignments2016/assignment1/softmax.ipynb | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/assignments2016/assignment1/softmax.ipynb b/assignments2016/assignment1/softmax.ipynb index 78e68e00..94fd47cf 100644 --- a/assignments2016/assignment1/softmax.ipynb +++ b/assignments2016/assignment1/softmax.ipynb @@ -123,7 +123,7 @@ }, "outputs": [], "source": [ - "# 먼저 softmax 손실 함수를 구현하세요.\n", + "# 먼저 중첩 루프를 사용해 softmax 손실 함수를 구현하세요.\n", "# cs231n/calssifiers/softmax.py 를 열고 softmax_loss_naive 함수를 구현하세요.\n", "\n", "from cs231n.classifiers.softmax import softmax_loss_naive\n", From 2c23d2be850a1005c661a84e22b7475e3056d7ea Mon Sep 17 00:00:00 2001 From: jung_hojin Date: Sun, 15 May 2016 02:36:54 +0900 Subject: [PATCH 134/199] Fill up inaudible words --- captions/En/Lecture4_en.srt | 10 +++++----- 1 file changed, 5 insertions(+), 5 deletions(-) diff --git a/captions/En/Lecture4_en.srt b/captions/En/Lecture4_en.srt index 69db59ba..5d650195 100644 --- a/captions/En/Lecture4_en.srt +++ b/captions/En/Lecture4_en.srt @@ -344,7 +344,7 @@ going to recurrent neural networks in a 70 00:04:42,478 --> 00:04:45,848 bit, but what you end up doing is you end -up [??] this graph, so think about +up unrolling this graph, so think about 71 00:04:45,848 --> 00:04:51,658 @@ -477,8 +477,8 @@ the intermediates in our circuit until 97 00:06:39,300 --> 00:06:43,509 -at the very end, we're going to build up to [??] -it the gradients on the inputs, and so we +at the very end, we're going to build up to +get the gradients on the inputs, and so we 98 00:06:43,509 --> 00:06:47,680 @@ -1324,7 +1324,7 @@ So to continue backpropagation 269 00:21:12,259 --> 00:21:20,000 here and apply chain rule, we would -receive [STUDENT ANSWER] Okay, so these are most of the rhetorical questions so I'm [??] +receive [STUDENT ANSWER] Okay, so these are most of the rhetorical questions so I'm 270 00:21:20,000 --> 00:21:25,119 @@ -1431,7 +1431,7 @@ they'll get multiplied and when you multiply by 1 291 00:23:14,339 --> 00:23:18,129 -something remains unchanged. So a plus [??] +something remains unchanged. So a plus gate, it's kind of like a gradient 292 From ca0c9de2795c7aba6ceb6ac2cf5253d228150eba Mon Sep 17 00:00:00 2001 From: YB Date: Sun, 15 May 2016 22:36:51 -0400 Subject: [PATCH 135/199] Lecture1 - part 171~190 (out of 715) en / ko --- captions/En/Lecture1_en.srt | 64 ++++++++++++++++++------------------- captions/Ko/Lecture1_ko.srt | 41 ++++++++++++------------ 2 files changed, 53 insertions(+), 52 deletions(-) diff --git a/captions/En/Lecture1_en.srt b/captions/En/Lecture1_en.srt index 3d0bb97d..68d1b453 100644 --- a/captions/En/Lecture1_en.srt +++ b/captions/En/Lecture1_en.srt @@ -840,38 +840,38 @@ the of the of the real-world image so 171 00:19:20,319 --> 00:19:27,779 -that's the beginning of the modern you -know engineering +That's the beginning of the modern, you +know, engineering of vision. 172 00:19:27,779 --> 00:19:36,170 -vision it started with one team to copy -the world and wanted to make a copy of +It's started with wanting to copy +the world and wanting to make a copy of 173 00:19:36,170 --> 00:19:42,350 -the visual world it hasn't gone anywhere -close to wanting to engineer the +the visual world. +It hasn't got anywhere close to wanting to engineer the 174 00:19:42,349 --> 00:19:46,879 -understanding of the visual world right -now we're just talking about duplicating +understanding of the visual world. +Right now, we're just talking about duplicating 175 00:19:46,880 --> 00:19:53,760 -the visual world so that's one important +the visual world. so that's one important work to remember and of course after 176 00:19:53,759 --> 00:20:01,299 -camera obscura although we we we start -to see a whole series of successful in +camera Obscura that we we we start +to see a whole series of successful, you 177 00:20:01,299 --> 00:20:07,539 -all some film gets developed you know -like kodak was one of the first +know, some film gets developed, you know +like Kodak was one of the first 178 00:20:07,539 --> 00:20:12,329 @@ -880,23 +880,23 @@ and then we start to have camcorders and 179 00:20:12,329 --> 00:20:21,889 -and and all this very important +and and all this. Another very important important piece of work that I want you 180 00:20:21,890 --> 00:20:28,050 to be aware of as vision student is -absolutely nothing engineering work but +actually not a engineering work but 181 00:20:28,049 --> 00:20:32,710 -do you think science piece of science +science piece of science work that's starting to ask the question 182 00:20:32,710 --> 00:20:38,130 -is how does Visual work in our -biological bring you know we +is how does Vision work in our +biological brain? you know we 183 00:20:38,130 --> 00:20:45,760 @@ -905,38 +905,38 @@ years of evolution to get to really 184 00:20:45,759 --> 00:20:54,579 -fantastic visual system in humans but -what did evolution do during this time +fantastic visual system in mammals and humans but +what did evolution do during this time? 185 00:20:54,579 --> 00:21:01,759 -what kind of architecture did develop -from that simple trilobite to today +what kind of architecture did it develop +from that simple trilobite eye to today 186 00:21:01,759 --> 00:21:07,950 -yours and mine while very important -piece of work happened at Harvard lied +yours and mine? Well, very important +piece of work happened at Harvard like 187 00:21:07,950 --> 00:21:12,690 -to that time too young to very young -ambitious pulls the occupant in the +two at that time two young two very young +ambitious post-doc Hubel and Wiesel. 188 00:21:12,690 --> 00:21:21,500 -vehicle what they did is that they used +What they did is that they used awake but anaesthetized cats and then 189 00:21:21,500 --> 00:21:28,529 -there was not technology to build this -little needle electrode to push the +there was enough technology to build this +little needle called electrode to push the 190 00:21:28,529 --> 00:21:35,129 -electrons through to the the the the -skull is open until the bringing of the +electrode through into the the the the +skull is open into the brain of the 191 00:21:35,130 --> 00:21:42,180 @@ -961,7 +961,7 @@ of course but earliest stage for visual 195 00:22:02,369 --> 00:22:07,299 processing then there is tons and tons -of new orleans working on vision then we +of neurons working on vision then we 196 00:22:07,299 --> 00:22:12,419 diff --git a/captions/Ko/Lecture1_ko.srt b/captions/Ko/Lecture1_ko.srt index 3cff0561..7790c724 100644 --- a/captions/Ko/Lecture1_ko.srt +++ b/captions/Ko/Lecture1_ko.srt @@ -700,83 +700,84 @@ 171 00:19:20,319 --> 00:19:27,779 - 즉, 당신이 알고있는 현대 엔지니어링의 시작이다 + 이것이 현대 Vision 공학의 시작이라고 할 수 있습니다. 172 00:19:27,779 --> 00:19:36,170 - 비전이 세상을 복사 한 팀과 시작의 복사본을 만들고 싶었다 + 비전은 세상을 복사하고 싶어서, 시각적인 세상을 복사하고 싶어서 시작되었죠. 173 00:19:36,170 --> 00:19:42,350 - 시각 세계는은을 설계하고자 어디서나 가까운 사라되지 않았습니다 + 이 시기까지는 시각적인 세상을 공학적으로 이해할 단계는 아니며 174 00:19:42,349 --> 00:19:46,879 - 시각 세계의 이해는 지금 우리는 단지 복제에 대해 얘기하고 + 단지 세상을 복제하고 있죠. 175 00:19:46,880 --> 00:19:53,760 - 그래서 시각적 세계는 하나의 중요한 기억해야 할 일과 물론 이후의 + 하지만 여전히 기억해야할 중요한 업적이지요. 176 00:19:53,759 --> 00:20:01,299 - 우리는 우리가 성공의 전체 시리즈를 참조하기 시작 우리하지만 카메라 옵스큐라 + 물론 Obscura 카메라 이후에 많은 발전이 있어 왔어요. 177 00:20:01,299 --> 00:20:07,539 - 모두가 어떤 영화는 코닥처럼 알고 개발되는 것은 최초의 일이었다 + 여러분도 알다시피 필름이 개발되고 Kodak에서 178 00:20:07,539 --> 00:20:12,329 - 회사는 상업 카메라를 개발하고 우리는 캠코더를 가지고 시작 + 상업 카메라를 개발하고 우리는 캠코더들까지 가지게 되었죠. 179 00:20:12,329 --> 00:20:21,889 - 과 및 모든 작업이 매우 중요 중요한 부분 당신을 원하는 + Vision 을 공부하는 학생으로서 180 00:20:21,890 --> 00:20:28,050 - 비전 학생이 절대적으로 아무것도 엔지니어링 작품으로 알고 있어야하지만, + 여러분들이 알고 있어야할 또 다들 중요한 점은 사실 공학적인 요소가 아닌 181 00:20:28,049 --> 00:20:32,710 - 당신은 질문을 시작하고 과학 연구의 과학 조각을 생각 + 과학적인 요소로써 이런 질문을 던집니다. 182 00:20:32,710 --> 00:20:38,130 - 우리의 생물학적 비주얼 작업 당신이 우리를 알고 가져 않는 방법입니다 + Vision 은 우리의 생물학적 뇌안에서 어떻게 작동할까요? 183 00:20:38,130 --> 00:20:45,760 - 우리는 지금 정말에 도착하는 진화의 5억4천만년했다 것을 알고있다 + 우리는 이제 5억 4천만년에 걸친 진화를 통해서 184 00:20:45,759 --> 00:20:54,579 - 환상적인 비주얼 인간의 시스템 만 한 일이 진화는이 기간 동안 수행 + 포유류와 인간이 가진 끝내주는 시각 시스탬이 생겼다는 것을 배웠어요. + 하지만 이 기간동안의 진화과정에서 무슨 일이 일어난 걸까요? 185 00:20:54,579 --> 00:21:01,759 - 오늘 간단 삼엽충에서 건축의 어떤 종류를 개발 않았다 + 삼엽충의 간단한 눈에서 오늘날의 여러분과 저의 눈까지 어떠한 구조를 발달시켜온 걸까요? 186 00:21:01,759 --> 00:21:07,950 - 작업의 매우 중요한 부분 하버드에서 일어난 당신과 나의 동안 거짓말 + 매우 중요한 연구가 하버드에서 187 00:21:07,950 --> 00:21:12,690 - 그 시간에 야심 찬 아주 젊은 너무 젊은는에서 탑승자를 가져옵니다 + 박사후 과정에 있었던 당시 매우 젊고 열정있는 Hubel과 Wiesel 두 사람에 의해 이루어졌어요. 188 00:21:12,690 --> 00:21:21,500 - 그들이 무슨 짓을했는지 차량은 다음 깨어 있지만 마취 고양이를 사용한다는 것입니다 + 그들은 고양이를 깨어있는 상태로 마취를 시키고 189 00:21:21,500 --> 00:21:28,529 - 를 밀어이 작은 바늘 전극을 구축하는 기술은 없었다 + Electrode라는 직은 바늘을 190 00:21:28,529 --> 00:21:35,129 - 상기하여 두개골을 통해 전자가의 지참까지 열려 + 두개골이 열린 상태의 고양이의 뇌로 집어 넣었습니다. 191 00:21:35,130 --> 00:21:42,180 From a26e4f4bed3b24da7865f6828d3e15ecbde4941a Mon Sep 17 00:00:00 2001 From: jung_hojin Date: Tue, 17 May 2016 01:47:45 +0900 Subject: [PATCH 136/199] Fix autogenerated English Captions from lecture 4 lines 401 ~ 600 Fixes autogenerated captions from 401 to 600. Some inaudible lines are marked with '[??]' for now. --- captions/En/Lecture4_en.srt | 693 ++++++++++++++++++------------------ 1 file changed, 346 insertions(+), 347 deletions(-) diff --git a/captions/En/Lecture4_en.srt b/captions/En/Lecture4_en.srt index 5d650195..64a91c2a 100644 --- a/captions/En/Lecture4_en.srt +++ b/captions/En/Lecture4_en.srt @@ -1973,932 +1973,931 @@ ever flow, they add up in these backward flow 400 00:32:00,009 --> 00:32:04,879 All right. We're going to go into -implementation very soon. I'll just some more +implementation very soon. I'll just take some more 401 00:32:04,880 --> 00:32:05,700 -of questions +questions. 402 00:32:05,700 --> 00:32:11,620 -thank you for the question the question -is is there ever like a loop in these +Thank you for the question. The question +is, is there ever, like a loop in these 403 00:32:11,619 --> 00:32:15,839 -graphs that will never be looks so there -are never any loops you might think that +graphs. There will never be loops, so there +are never any loops. You might think that 404 00:32:15,839 --> 00:32:18,589 -if you use a recurrent neural network -that there are loops in there but +if you use a recurrent neural network, +that there are loops in there 405 00:32:18,589 --> 00:32:21,658 -there's actually no because what we'll +but there are actually no loops because what we'll do is we'll take a recurrent neural 406 00:32:21,659 --> 00:32:26,230 -network and will unfold it through time -steps and this will all become there +network and we will unfold it through time +steps and this will all become, there 407 00:32:26,230 --> 00:32:31,259 -will never be a loop in the photograph -copy pasted that small piece or time +will never be a loop in the unfolded graph +where we've copied pasted that small recurrent net piece over time. 408 00:32:31,259 --> 00:32:39,538 -you'll see that more when we actually -get into it but he's always looked so +You'll see that more when we actually +get into it but these are always DAGs, there are no loops. Okay, awesome. 409 00:32:39,538 --> 00:32:42,220 -let's look at the implementation of this +So let's look at the implementation of how this is actually implemented in practice and 410 00:32:42,220 --> 00:32:46,860 -I think will help make this more -concrete as well so we always have these +I think it will help make this more +concrete as well. So we always have these 411 00:32:46,859 --> 00:32:52,038 -graphs graphs these are the best way to -think about structuring neural networks +graphs, computational graphs. These are the best way to +think about structuring neural networks. 412 00:32:52,038 --> 00:32:56,929 -and so what we end up with is all these -gates there were going to seem a bit but +And so what we end up with is, all these +gates that we're going to see a bit, but 413 00:32:56,929 --> 00:33:00,059 -on top of the gates there something that +on top of the gates, there something's that needs to maintain connectivity structure 414 00:33:00,058 --> 00:33:03,490 -of the same paragraph what gates are -connected to each other and so usually +of this entire graph, what gates are +connected to each other. And so usually 415 00:33:03,490 --> 00:33:09,710 -that's handled by a graph or object -usually in that the net object has needs +that's handled by a graph or a net object, +usually a net, and the net object has these 416 00:33:09,710 --> 00:33:13,679 -two main pieces which was the forward -and backward peace and this is just you +two main pieces, which is the forward +and the backward piece. And this is just pseudo 417 00:33:13,679 --> 00:33:19,929 -two coats run but basically roughly the +code, so this won't run, but basically, roughly the idea is that in the forward pass 418 00:33:19,929 --> 00:33:23,759 -trading overall the gates in the circuit -that and they're sorted in topological +we're iterating over all the gates in the circuit +that, and they're sorted in topological [??] 419 00:33:23,759 --> 00:33:27,980 -order what that means is that all the -inputs must come to every note before +order. What that means is that all the +inputs must come to every node before 420 00:33:27,980 --> 00:33:32,099 -the opportunity consumed just ordered +the output can be consumed. So these are just ordered from left to right and we're just 421 00:33:32,099 --> 00:33:35,969 -boarding will call ya forward on every +forwarding, we're calling a forward on every single gate along the way so we iterate 422 00:33:35,970 --> 00:33:39,600 -over that graph and just go forward to -every single piece and this object will +over that graph and we just go forward in +every single piece and this net object will 423 00:33:39,599 --> 00:33:43,189 just make sure that happens in the -proper connectivity pattern and backward +proper connectivity pattern. In backward 424 00:33:43,190 --> 00:33:46,620 -pass we're going in the exact reverse +pass, we're going in the exact reversed order and we're calling backward on 425 00:33:46,619 --> 00:33:49,709 every single gate and these gates will -end up communicating gradients to each +end up communicating gradients through each 426 00:33:49,710 --> 00:33:53,429 -other and the old get changeup and -computing the analytic gradients it back +other and they all get chained up and +computing the analytic gradient at the back. 427 00:33:53,429 --> 00:33:57,860 -so really an object is a very thin -wrapper around all these gates or as we +So really a net object is a very thin +wrapper around all these gates, or as we 428 00:33:57,859 --> 00:34:01,879 -will see their cold layers layers or -gates I'm going to use interchangeably +will see they're called layers, layers or +gates. I'm going to use interchangeably 429 00:34:01,880 --> 00:34:05,700 -and they're just very thin wrapper -surround connectivity structure of these +and they're just very thin wrappers +around connectivity structure of these 430 00:34:05,700 --> 00:34:09,369 gates and calling a forward and backward -function on them and then let's look at +function on them. And then let's look at 431 00:34:09,369 --> 00:34:12,950 a specific example of one of the gates -and how this might be implemented and +and how this might be implemented. 432 00:34:12,949 --> 00:34:16,759 -this is not just a year ago this is +And this is not just a pseudo code. This is actually more like correct 433 00:34:16,760 --> 00:34:18,730 -implementation something like this might +implementation something like this might [??] run 434 00:34:18,730 --> 00:34:23,769 -at the end so let us enter and multiply -gate and how it could be implemented and +at the end. So let's consider a multiply +gate and how it could be implemented. 435 00:34:23,769 --> 00:34:27,690 -multiply gate in this case is just a -binary multiplies receives two inputs +A multiply gate, in this case, is just a +binary multiply who receives two inputs [??] 436 00:34:27,690 --> 00:34:33,780 -X&Y it computes their multiplication -that his ex times why and returns and +x and y. It computes their multiplication, +z = x * y and it returns z. 437 00:34:33,780 --> 00:34:38,950 -all these games must be satisfied the -API of a forward and backward cool how +And all these gates must basically satisfied this +API of a forward call and a backward call. How 438 00:34:38,949 --> 00:34:42,529 -do you behave in a forward pass and how -they behave in a backward pass and +do you behave in a forward pass, and how +do you behave in a backward pass. And 439 00:34:42,530 --> 00:34:46,019 -repass just computer whatever in a -backward pass we eventually end up +in a forward pass, we just compute whatever. In a +backward pass, we eventually end up 440 00:34:46,019 --> 00:34:52,639 learning about what is our gradient on -the final loss to the old ideas as what +the final loss. So dL/dz is what 441 00:34:52,639 --> 00:34:55,628 -we learn that's represented in this -variable these head and right now +we learn. That's represented in this +variable dz, and right now 442 00:34:55,628 --> 00:35:00,639 -everything is scalars so X Y is that our -numbers here he said is also a number +everything here is scalars, so x, y, z are +numbers here. dz is also a number 443 00:35:00,639 --> 00:35:07,799 -telling the employers and what this gate -is charged in this backward pass is +telling the influence on the end of the circuit. And what this gate +is in charge of in this backward pass is 444 00:35:07,800 --> 00:35:11,550 -performing the little piece of general -so what we have to compute is how do you +performing the little piece of chain rule. +So what we have to compute is how do you 445 00:35:11,550 --> 00:35:16,550 -change this gradient these into your -inputs X&Y compute the ex NDY and we +chain this gradient dz into your +inputs x and y. In other words, we have to compute dx and dy and we have to 446 00:35:16,550 --> 00:35:19,820 -turned us into backward pass and then -the competition on draft will make sure +returned those in the backward pass, And then +the computational graph will make sure 447 00:35:19,820 --> 00:35:23,720 that these get routed properly to all -the other bags and if there are any +the other gates. And if there are any 448 00:35:23,719 --> 00:35:27,919 -badges that add up the competition grab -my dad might add all the ingredients +edges that add up, the computational graph +might add all those ingredients together. 449 00:35:27,920 --> 00:35:35,650 -together ok so how would we implement -the DAX and devices for example what is +Okay, so how would we implement +the dx and dy? So for example, what is 450 00:35:35,650 --> 00:35:42,300 -the X in this case it would be equal to -the implementation +dx in this case? What would it be equal to, +the implementation? 451 00:35:42,300 --> 00:35:49,460 -why times easy break and a white and -easy additional point to make here by +y * dz. Great. And, so y * dz. Additional point to make here by the way, 452 00:35:49,460 --> 00:35:53,659 -the way that I added some lies in the -past we have to remember these values of +note that I've added some lines in the forward +pass. We have to remember these values of 453 00:35:53,659 --> 00:35:57,509 -X&Y because we end up using them in a -backward pass from assigning them to a +x and y, because we end up using them in a +backward pass, so I'm assigning them to a 454 00:35:57,510 --> 00:36:01,000 -sell stop because I need to remember -what X Y are because I need access to +'self.' because I need to remember +what x y are because I need access to 455 00:36:01,000 --> 00:36:04,949 -them in my back yard pass in general and -back-propagation when we build these +them in my backward pass. In general, in +backpropagation, when we build these, 456 00:36:04,949 --> 00:36:09,359 -when you actually the forward pass every -single gate must remember the impetus in +when you actually do forward pass, every +single gate must remember the inputs in 457 00:36:09,360 --> 00:36:13,430 -any kind of intermediate calculations -performed that it needs to do that needs +any kind of intermediate calculations that it has +performed that it needs to do, that needs 458 00:36:13,429 --> 00:36:17,069 -access to a backward pass so basically -we end up running these networks at +access to in the backward pass. So basically +when we end up running these networks at 459 00:36:17,070 --> 00:36:20,050 -runtime just always keep in mind that as -you're doing this forward pass a huge +runtime, just always keep in mind that as +you're doing this forward pass, a huge 460 00:36:20,050 --> 00:36:22,890 -amount of stuff gets cashed in your -memory and that all has to stick around +amount of stuff gets cached in your +memory, and that all has to stick around 461 00:36:22,889 --> 00:36:25,909 -because during the propagation and I -need access to some of those variables +because during backpropagation, you might +need access to some of those variables. 462 00:36:25,909 --> 00:36:30,779 -and so your memory and the ballooning up -during a forward pass backward pass it +And so, your memory ends up ballooning up +during the forward pass, and then in backward pass, 463 00:36:30,780 --> 00:36:33,690 -gets all consumed and we need all those -intermediaries to actually compete the +it gets all consumed and we need all those +intermediates to actually compute the 464 00:36:33,690 --> 00:36:45,289 -proper backward class so that you can +proper backward pass. So that's... [STUDENT QUESTION] Yes, so if you don't, if you know you don't want to do backward pass, then you can get rid of many of these things and you 465 00:36:45,289 --> 00:36:49,710 -don't have to compete in going to cash -them so you can save on memory for sure +don't have to compute, you don't need to cache +them. So you can save memory for sure. 466 00:36:49,710 --> 00:36:54,110 -but I don't think most implementations -actually worried about that I don't +But I don't think most implementations +actually worriy about that. I don't 467 00:36:54,110 --> 00:36:57,280 think there's a lot of logic that deals -with that usually end up remembering it +with that. Usually we end up remembering it 468 00:36:57,280 --> 00:37:09,370 -anyway I yes I think if you're in an -embedded device for example and you were +anyway. I, yeah. [STUDENT QUESTION] I see. Yes, so I think if you're in the [??] +embedded device for example, and you worry 469 00:37:09,369 --> 00:37:11,949 -eerily by the American strains this is +really about your memory constraints, this is something that you might take advantage 470 00:37:11,949 --> 00:37:15,539 -of it we know that a neural network only -has to run and test time then you might +of. If you know that a neural network only +has to run in test time, then you might 471 00:37:15,539 --> 00:37:18,750 -want to make sure going to the code to -make sure nothing gets cashed in case +want to make sure to go into the code to +make sure nothing gets cached in case 472 00:37:18,750 --> 00:37:33,130 -you wanna do a backward pass questions -yes we remember the local gradients in +you want to do a backward pass. Questions. +Yes. [STUDENT QUESTION] You're saying if we remember the local gradients in 473 00:37:33,130 --> 00:37:39,750 -the forward pass then we don't have to -remember the other intermediates I think +the forward pass, then we don't have to +remember the other intermediates? I think 474 00:37:39,750 --> 00:37:45,269 -that might only be the case in such in -some simple expressions like this 1 I'm +that might only be the case in +some simple expressions like this one. I'm 475 00:37:45,269 --> 00:37:49,170 -not actually sure that's true in general -but I mean you're in charge of remember +not actually sure if that's true in general. +But I mean, you're in charge of, remember 476 00:37:49,170 --> 00:37:54,950 -whatever you need to perform the -backward pass gate by game basis you +whatever you need to, perform the +backward pass, and on a gate-by-gate basis. You 477 00:37:54,949 --> 00:37:58,509 -don't know if you can remember whatever -you feel like it has a footprint on +don't necce, you can remember whatever [memory footprint??] +you feel like. It has lower footprint and so on. 478 00:37:58,510 --> 00:38:04,420 -someone and you can be clever with that -guy's example of what it looks like in +You can be clever with that. Okay, so just to give +you guy's example of what this looks like in 479 00:38:04,420 --> 00:38:08,250 -practice we're going to look at specific -examples and torture tortures a deep +practice, we're going to look at specific +examples, say, in Torch. Torch is a deep 480 00:38:08,250 --> 00:38:11,480 -learning framework which we might be -going to a bit near the end of the class +learning framework, which we might +go into a bit near the end of the class. 481 00:38:11,480 --> 00:38:16,750 -that some of you might end up using for -your projects going to the github repo +Some of you might end up using for +your projects. If you go into the Github repo 482 00:38:16,750 --> 00:38:20,320 -for porridge and you look at the -musically it's just a giant collection +for Torch, and you'll look at, +basically, it's just a giant collection 483 00:38:20,320 --> 00:38:24,580 -of these later objects and these are the -gates gates the same thing so there's +of these layer objects and these are the +gates. Layers, gates, the same thing. So there's 484 00:38:24,579 --> 00:38:27,429 -all these layers that's really what a -deep learning framework is this just a +all these layers. That's really what a +deep learning framework is. It's just a 485 00:38:27,429 --> 00:38:31,559 whole bunch of layers and a very thin -competition graph thing that keeps track +computational graph thing that keeps track 486 00:38:31,559 --> 00:38:36,420 -of all the connectivity and so really -the image to have in mind at all these +of all the layer connectivity. And so really, +the image to have in mind is all these 487 00:38:36,420 --> 00:38:42,639 -things are your leg blocks and then -we're building up these graphs out of +things are your Lego blocks, and then +we're building up these computational graphs out of 488 00:38:42,639 --> 00:38:44,829 -your league in blocks out of the layers -you're putting them together in various +your Lego blocks, out of the layers. +You're putting them together in various 489 00:38:44,829 --> 00:38:47,549 ways depending on what you want to -achieve and the end up building all +achieve, so you end building all 490 00:38:47,550 --> 00:38:51,519 -kinds of stuff so that's how you work -with their own networks so every library +kinds of stuff. So that's how you work +with neural networks. So every library is 491 00:38:51,519 --> 00:38:54,809 just a whole set of layers that you -might want to compute and every layer is +might want to compute, and every layer is 492 00:38:54,809 --> 00:38:58,840 -implementing a smoky function peace and -that function keys knows how to move +just implementing a small function piece, and +that function piece knows how to do a 493 00:38:58,840 --> 00:39:02,670 -forward and knows how to do a backward -so just above a specific example let's +forward and it knows how to do a backward. +So just [?? For the?] specific example, let's 494 00:39:02,670 --> 00:39:10,150 -look at the mall constant layer and -torch the mall constant layer or chrome +look at the MulConstant layer in +Torch. The MulConstant layer performs 495 00:39:10,150 --> 00:39:16,039 -just a scaling by scalar so it takes -some tenser X so this is not a scalar +just a scaling by a scalar. So it takes +some tensor X. So this is not a scalar 496 00:39:16,039 --> 00:39:19,300 but it's actually like an array of -numbers basically because when we +numbers basically, because when we 497 00:39:19,300 --> 00:39:22,410 -actually work with these we do a lot of -extras operation so we receive a tensor +actually work with these, we do a lot of +[??] operation so we receive a tensor 498 00:39:22,409 --> 00:39:28,289 -which is really just and dimensional -array and was killed by constant and you +which is really just a n-dimensional +array, and we scale it by a constant. And you 499 00:39:28,289 --> 00:39:31,980 -can see that this actually just a sporty -lines there some initialization stuff +can see that this layer actually just has 40 +lines. There's some initialization stuff. 500 00:39:31,980 --> 00:39:35,940 -this is lula by the way if this is -looking some foreign to you but there's +This is Lua, by the way. If this is +looking some foreign to you, but there's 501 00:39:35,940 --> 00:39:40,510 -initialisation where you actually -passing that a that you want to use as +initialization, where you actually +pass in that a that you want to use as 502 00:39:40,510 --> 00:39:44,630 -you are scaling and then during the -forward pass which they call update out +your scaling, and then during the +forward pass which they call updateOutput 503 00:39:44,630 --> 00:39:49,170 -but in a forward pass all they do is -they just multiply X and returned it and +in a forward pass all they do is +they just multiply aX and return it. And 504 00:39:49,170 --> 00:39:53,760 -into backward pass which they call -update grad input there's any statement +in the backward pass which they call +updateGradInput, there's an if statement 505 00:39:53,760 --> 00:39:56,510 here but really when you look at these -three live their most important you can +three lines, they're most important. You can 506 00:39:56,510 --> 00:39:59,690 -see that all is doing its copying into a -variable grad +see that all it's doing is it's copying into a +variable gradInput 507 00:39:59,690 --> 00:40:03,539 -would need to compute that's your grade -in that you're passing up the great +which you need to compute. That's your gradient [??] +that you're passing up. The gradInput is, 508 00:40:03,539 --> 00:40:08,309 -impetus you're copping out but ran up to -this your your gradient on final loss +you're copying gradOutput. gradOutput is +your gradient on final loss. 509 00:40:08,309 --> 00:40:11,989 -you're copping that over into grad input -and you're multiplying by the by the +You're copying that over into gradInput +and you're multiplying by the scalar, 510 00:40:11,989 --> 00:40:15,629 -scalar which is what you should be doing -because you are your local ratings just +which is what you should be doing +because your local gradient is just a 511 00:40:15,630 --> 00:40:19,980 -a and C you take the out but you have to -take the gradient from above and just +and so you take the output you have, you +take the gradient from above and you just 512 00:40:19,980 --> 00:40:23,150 -killed by AP which is what these three -lines are doing and that's your grad +scale it by a, which is what these three +lines are doing. And that's your gradInput 513 00:40:23,150 --> 00:40:27,849 -important that's what you return so +and that's what you return. So that's one of the hundreds of layers 514 00:40:27,849 --> 00:40:32,110 -that are and torture you can also look -at examples in cafe get there is also a +that are in Torch. We can also look +at examples in Caffe. Caffe is also a 515 00:40:32,110 --> 00:40:36,140 deep learning framework specifically for -images might be working with again if +images that you might be working with. Again, if 516 00:40:36,139 --> 00:40:39,690 -you go into the layers director just see -all these layers all of them implement +you go into the layers directory in GitHub, you just see +all these layers. All of them implement 517 00:40:39,690 --> 00:40:43,490 -the forward backward API so just to give -you an example there's a single layer +the forward backward API. So just to give +you an example, there's a sigmoid layer in Caffe. 518 00:40:43,489 --> 00:40:51,269 -layer takes a blob so comfy likes to -call these tensors blogs so it takes a +So sigmoid layer takes a blob. So Caffe likes to +call these tensors blobs. So it takes a 519 00:40:51,269 --> 00:40:54,219 -blob is just an international array of -numbers and it passes +blob. It's just an n-dimensional array of +numbers, and it passes it 520 00:40:54,219 --> 00:40:57,949 -element wise to a single function and so -its computing in a forward pass a +elementwise through a sigmoid function. And so +it's computing in a forward pass a 521 00:40:57,949 --> 00:41:04,379 -sigmoid which you can see their use my -printer so they're calling it a lot of +sigmoid, which you can see there. Let me use my +pointer. Okay, so there, its calling, so a lot of 522 00:41:04,380 --> 00:41:07,840 -this stuff is just boilerplate getting -pointers to all the data and then we +this stuff is just boilerplate, getting +pointers to all the data, and then we 523 00:41:07,840 --> 00:41:11,730 -have a bottom blob and we're calling a +have a bottom blob, and we're calling a sigmoid function on the bottom and 524 00:41:11,730 --> 00:41:14,829 that's just a sigmoid function right -there that's why we compute in a +there. So that's what we compute. And in a 525 00:41:14,829 --> 00:41:18,719 -backward pass some boilerplate stuff but +backward pass, some boilerplate stuff, but really what's important is we need to 526 00:41:18,719 --> 00:41:23,369 compute the gradient times the chain -rule here so that's what you see in this +rule here, so that's what you see in this 527 00:41:23,369 --> 00:41:26,150 -line that's where the magic happens when -we take the +line. That's where the magic happens where +we take the diff, 528 00:41:26,150 --> 00:41:32,048 -so they call the greetings dips and you -compute the bottom diff is the top if +so they call the gradients diffs. And you +compute the bottom diff is the top diff 529 00:41:32,048 --> 00:41:36,869 -times this piece which is really the -that's the local gradient so this is +times this piece which is really the, +that's the local gradient, so this is 530 00:41:36,869 --> 00:41:41,960 chain rule happening right here through -that multiplication so and so every +that multiplication. So, and that's it. So every 531 00:41:41,960 --> 00:41:45,179 single layer just a forward backward API -and then you have a competition growth +and then you have a computational graph 532 00:41:45,179 --> 00:41:52,288 -on top or another object that troubled -connectivity and questions about some of +on top or a net object that keeps track of all the +connectivity. Any questions about some of 533 00:41:52,289 --> 00:42:00,849 -these implementations and so on +these implementations and so on? 534 00:42:00,849 --> 00:42:15,559 -because when you want to do right away -to a backward and I have a gradient and +Yes, thank you. [STUDENT QUESTION] So the question is, do we have to go through forward and backward for every update. The answer is yes, because when you want to do update, you need the gradient, and so you need to do forward on your sample minibatch. You do a forward. Right away you do a backward. +And now you have your analytic gradient. 535 00:42:15,559 --> 00:42:19,369 -I can do an update right up my alley -gradient and I change my way it's a tiny +And now I can do an update, where I take my analytic +gradient and I change my weights a tiny 536 00:42:19,369 --> 00:42:24,960 -bit and the direction the negative -direction of your writing so overcome +bit in the direction, the negative +direction of your gradient. So forward computes 537 00:42:24,960 --> 00:42:28,858 -the loss backward computer gradient and +the loss, backward computes your gradient, and then the update uses the gradient to 538 00:42:28,858 --> 00:42:33,278 -increment you are a bit so that's what -keeps happening Lupin III neural network +increment your weights a bit. So that's what +keeps happening in the loop. When you train a neural network 539 00:42:33,278 --> 00:42:36,318 -that's all that's happening forward -backward update forward backward state +that's all that's happening. Forward, +backward, update. Forward, backward, update. 540 00:42:36,318 --> 00:42:51,808 -will see that you're asking about the -for loop therefore Lapeer I do notice ok +We'll see that in a bit. Go ahead. [STUDENT QUESTION] You're asking about a +for loop. Oh, is there a for loop here? I didn't even notice. Okay. 541 00:42:51,809 --> 00:42:57,160 -yeah they have a for loop yes you'd like -us to be better eyes and that actually +Yeah, they have a for loop. Yes, so you'd like +this to be vectorized and that actually... 542 00:42:57,159 --> 00:43:03,679 -sure this is C++ so I think they just go -for it +Because this is C++, so I think they just do it. +Go for it. 543 00:43:03,679 --> 00:43:10,899 -yeah so this is a CPU implementation by -the way I should mention that this is a +Yeah, so this is a CPU implementation by +the way. I should mention that this is a 544 00:43:10,900 --> 00:43:14,599 -CPU implementation of a similar there's -a second file that implement the +CPU implementation of a sigmoid layer. +There's a second file that implements the 545 00:43:14,599 --> 00:43:19,420 -simulator on GPU and that's correct code -and so that's a separate file its +sigmoid layer on GPU and that's CUDA code. +And so that's a separate file. It 546 00:43:19,420 --> 00:43:21,980 -would-be sigmoid out see you or -something like that I'm not showing you +would be sigmoid.cu or +something like that. I'm not showing you that. 547 00:43:21,980 --> 00:43:30,349 -that the russians ok great so I like to -make is will be of course working with +Any questions? Okay, great. So one point I'd like to +make is, we'll be of course working with 548 00:43:30,349 --> 00:43:33,519 -better so these things flowing along our -grass are not just killers they're going +vectors, so these things flowing along our +graphs are not just scalars. They're going 549 00:43:33,519 --> 00:43:38,449 -to be entire back to us and so nothing -changes the only thing that is different +to be entire vectors. And so nothing +changes. The only thing that is different 550 00:43:38,449 --> 00:43:43,529 -now since these are vectors XY and Z are -vectors is that these local gradient +now since these are vectors, x, y, and z are +vectors, is that this local gradient 551 00:43:43,530 --> 00:43:47,530 -which before used to be just a scalar -now there in general for general +which before used to be just a scalar, +now they're in general, for general 552 00:43:47,530 --> 00:43:51,290 -expressions their full Jacobian matrices -and so it could be a major exodus +expressions, they're full Jacobian matrices. +And so Jacobian matrix is this 553 00:43:51,289 --> 00:43:54,670 two-dimensional matrix and basically -tells me what is the influence of every +tells you what is the influence of every 554 00:43:54,670 --> 00:43:58,010 -single element in X on every single -element of +single element in x on every single +element of z, 555 00:43:58,010 --> 00:44:01,880 -and that's what you can be a major -source and the gradient the same +and that's what Jacobian matrix +stores, and the gradient is the same 556 00:44:01,880 --> 00:44:08,960 -expression as before but now they hear -the IDX is a vector and DL Moody said is +expression as before, but now, say here, +dz/dx is a vector and dL/dz is... sorry. 557 00:44:08,960 --> 00:44:16,079 -designed as an actor and designed by Dax -is an entire Jacobian matrix end up with +dL/dz is a vector and dz/dx +is an entire Jacobian matrix, so you end up with 558 00:44:16,079 --> 00:44:32,130 an entire matrix-vector multiply to -actually change the gradient know so +actually chain the gradient backwards. [STUDENT QUESTION] 559 00:44:32,130 --> 00:44:36,380 -I'll come back to this point in a bit -you never actually end up forming the +No. So I'll come back to this point in a bit. +You never actually end up forming the full 560 00:44:36,380 --> 00:44:40,119 -Jacobian you'll never actually do this -matrix multiply most of the time this is +Jacobian. You'll never actually do this +matrix multiply most of the time. This is 561 00:44:40,119 --> 00:44:43,730 -just a general way of looking at you -know arbitrary function and I need to +just a general way of looking at, you +know, arbitrary function, and I need to 562 00:44:43,730 --> 00:44:46,260 -keep track of this and I think that +keep track of this. And I think that these two are actually out of order 563 00:44:46,260 --> 00:44:49,569 -because he said by the exit the Jacobian -which should be on the left side so +because dz/dx is the Jacobian +which should be on the left side, so 564 00:44:49,568 --> 00:44:53,159 -that's that's a mistaken slide because -it should be a major factor multiplied +I think that's a mistaken slide because +this should be a matrix-vector multiply. 565 00:44:53,159 --> 00:44:57,618 -so I'll show you why you don't actually -need to perform those Jacobins so let's +So I'll show you why you don't actually +need to ever perform those Jacobians. So let's 566 00:44:57,619 --> 00:45:02,119 work with a specific example that is -relatively common in the works +relatively common in neural networks. 567 00:45:02,119 --> 00:45:06,869 -suppose we have this nonlinearity max 50 -index so really what this is operation +Suppose we have this nonlinearity max(0, x) +So really what this operation 568 00:45:06,869 --> 00:45:11,068 -is doing its receiving a vector sale -4096 numbers which is a typical thing +is doing is it's receiving a vector, say +4096 numbers, which is a typical thing 569 00:45:11,068 --> 00:45:12,308 -you might want to do +you might want to do. 570 00:45:12,309 --> 00:45:14,630 -4096 numbers real value +4096 numbers, real value, come in 571 00:45:14,630 --> 00:45:19,630 -and your computing an element wise -threshold 0 so anything that is lower +and your computing an element-wise +thresholding at 0, so anything that is lower 572 00:45:19,630 --> 00:45:24,680 -than 0 gets clamped 20 and that's your -function that your computing and sew up +than 0 gets clamped to 0, and that's your +function that your computing. And so output 573 00:45:24,679 --> 00:45:28,588 -the victories on the same dimension to +vector is of the same dimension. So the question here I'd like to ask is 574 00:45:28,588 --> 00:45:40,268 what is the size of the Jacobian matrix -for this layer 4096 4096 in principle +for this layer? 4096 by 4096. In principle, 575 00:45:40,268 --> 00:45:45,018 every single number in here could have -influenced every single number in there +influenced every single number in there. 576 00:45:45,018 --> 00:45:49,459 -but that's not the case necessarily -right to the second question is so this +But that's not the case necessarily, +right? So the second question is, so this 577 00:45:49,460 --> 00:45:52,949 -is a huge measure sixteen million -numbers but why would you never formed +is a huge matrix, 16 million +numbers, but why would you never form it? [??] 578 00:45:52,949 --> 00:46:02,719 -what does actually look like always be -matrix because every one of these 4096 +What does the Jacobian actually look like? [STUDENT QUESTION] No, Jacobian will always be +a matrix, because every one of these 4096 579 00:46:02,719 --> 00:46:09,949 -could have influenced every it is so the -communists still a giant 4085 4086 +could have influenced every... It is, so the +Jacobian is still a giant 4096 by 4096 580 00:46:09,949 --> 00:46:14,558 -matrix but has special structure right -and what is that special structure but +matrix, but has special structure, right? +And what is that special structure? [STUDENT ANSWER] 581 00:46:14,559 --> 00:46:27,420 -so is a huge tits 4095 4096 matrix but -there's only elements on the diagonal +Yeah, so this Jacobian is huge. So it's 4096 by 4096 matrix, but +there are only elements on the diagonal 582 00:46:27,420 --> 00:46:33,700 -because this is an element was operation -and moreover they're not just once but +because this is an element-wise operation, +and moreover, they're not just 1's, but 583 00:46:33,699 --> 00:46:38,129 -whichever element was less than zero it -was clamped 20 so some of these ones +for whichever element that was less than 0, +it was clamped to 0, so some of these 1's 584 00:46:38,130 --> 00:46:42,798 -actually are zeros in whichever elements -had a lower than zero value during the +actually are zeros, in whichever elements +had a lower-than-zero value during the 585 00:46:42,798 --> 00:46:47,429 -forward pass and so the Jacobian would -just be almost no identity matrix but +forward pass. And so the Jacobian would +just be almost an identity matrix but 586 00:46:47,429 --> 00:46:52,250 -some of them are actually Sarah so you +some of them are actually zero. So you never actually would want to form the 587 @@ -2909,12 +2908,12 @@ so you never actually want to carry out 588 00:46:55,429 --> 00:47:00,808 this operation as a matrix-vector -multiply because their special structure +multiply, because (of) their special structure [??] 589 00:47:00,809 --> 00:47:04,150 -that we want to take advantage of and so -in particular the gradient the backward +that we want to take advantage of. And so +in particular, the gradient, the backward 590 00:47:04,150 --> 00:47:09,269 @@ -2928,47 +2927,47 @@ less than zero and you want to kill the 592 00:47:14,159 --> 00:47:17,210 -gradient and those mentioned you want to -set the gradient 20 in those dimensions +gradient and those dimensions. You want to +set the gradient to 0 in those dimensions. 593 00:47:17,210 --> 00:47:21,650 -so you take the grid out but here and -whichever numbers were less than zero +So you take the grid output here, and [??] +whichever numbers were less than zero, 594 00:47:21,650 --> 00:47:25,910 -just set them 200 and then you can ask +just set them to 0. Set those gradients to 0 and then you continue backward pass. 595 00:47:25,909 --> 00:47:52,230 -so very simple operations in the in the -end in terms of +So very simple operations in the +end in terms of efficiency. [STUDENT QUESTION] That's right. So the question is, the commication between the gates is always just vectors. That's right. So this Jacobian 596 00:47:52,230 --> 00:47:55,940 -if you want to you can do that but -that's internal to you and said the gate +if you wanted to, you can form that but +that's internal to you inside the gate. 597 00:47:55,940 --> 00:47:59,670 -and you can use that to do backdrop but -what's going back to other dates they +And you can use that to do backprop, but +what's going back to other gates, they 598 00:47:59,670 --> 00:48:17,380 -only care about the gradient vector so +only care about the gradient vector. [STUDENT QUESTION] Yes, so the question is, unless you end up having multiple outputs, because then for each output, we have to do this, so yeah. So we'll never actually run into that case 599 00:48:17,380 --> 00:48:20,430 because we almost always have a single -out but skill and rallied in the end +output, scalar value at the end 600 00:48:20,429 --> 00:48:24,129 -because we're interested in Los -functions so we just have a single +because we're interested in loss +functions. So we just have a single 601 00:48:24,130 --> 00:48:27,318 From 95db9da7b5dc5822c66b9cb66f78db63dcdc0bc3 Mon Sep 17 00:00:00 2001 From: jung_hojin Date: Thu, 19 May 2016 03:20:08 +0900 Subject: [PATCH 137/199] Fix autogenerated English captions from lecture 4 lines 601 ~ 800 Fixes autogenerated captions from 601 to 800. Some inaudible lines are marked with '[??]' for now. --- captions/En/Lecture4_en.srt | 1294 +++++++++++++++++------------------ 1 file changed, 647 insertions(+), 647 deletions(-) diff --git a/captions/En/Lecture4_en.srt b/captions/En/Lecture4_en.srt index 64a91c2a..9f2e678c 100644 --- a/captions/En/Lecture4_en.srt +++ b/captions/En/Lecture4_en.srt @@ -2971,98 +2971,98 @@ functions. So we just have a single 601 00:48:24,130 --> 00:48:27,318 -number at the end that were interested -in trading for prospective if we had +number at the end that we're interested +in computing gradients respective. If we had [??] 602 00:48:27,318 --> 00:48:30,949 -multiple outputs then we have to keep +multiple outputs, then we have to keep track of all of those as well 603 00:48:30,949 --> 00:48:35,769 -imperil when we do the backpropagation -but we just have to get a rally loss +in parallel when we do the backpropagation. +But we just have scalar value loss 604 00:48:35,769 --> 00:48:45,880 -function so as not to worry about that -so I want to also make the point that +function so we don't have to worry about that. Okay, makes sense? +So I want to also make the point that 605 00:48:45,880 --> 00:48:51,230 -actually four thousand crazy usually we -use many batches so say many batch of a +actually 4096 dimensions is not even crazy. Usually we +use minibatches, so say, minibatch of a 606 00:48:51,230 --> 00:48:54,929 -hundred elements going through the same -time and then you end up with a hundred +100 elements going through at the same +time, and then you end up with 100 607 00:48:54,929 --> 00:48:59,038 -4096 emotional factors that are all -coming in peril but all the examples +4096-dimensional vectors that are all +coming in parallel, but all the examples 608 00:48:59,039 --> 00:49:02,539 -enemy better processed independently of -each other in peril and so that you +in the minibatch are processed independently of +each other in parallel, and so this Jacobian matrix 609 00:49:02,539 --> 00:49:08,869 -could really end up being four hundred -million so huge so you never formally is +really ends up being 400 million, 400,000 by 400,000. +So huge so you never form these, 610 00:49:08,869 --> 00:49:14,160 -basically and you takes to take care to +basically. And you take, you take care to actually take advantage of the sparsity 611 00:49:14,159 --> 00:49:17,538 structure in the Jacobian and you hand -code operations you don't actually right +code operations, so you don't actually write 612 00:49:17,539 --> 00:49:25,819 -before the generalized general inside -any gate implementation ok so I'd like +fully generalized chain rule inside +any gate implementation. Okay cool. So I'd like 613 00:49:25,818 --> 00:49:30,788 -to point out that your assignment he'll -be writing as Max and so on and I just +to point out that in your assignment, you'll +be writing SVMs and Softmax and so on, and I just kind of 614 00:49:30,789 --> 00:49:33,680 -wanted to give you a hint on the design +would like to give you a hint on the design [??] of how you actually should approach this 615 00:49:33,679 --> 00:49:39,769 -problem what you should do is just think -about it as a back propagation even if +problem. What you should do is just think +about it as a back propagation, even if 616 00:49:39,769 --> 00:49:44,108 -you're doing this for classification -optimization so roughly or structure +you're doing this for linear classification +optimization. So roughly, your structure 617 00:49:44,108 --> 00:49:50,048 -should look something like this where -against major computation and units that +should look something like this where... +again, stage your computation in units that 618 00:49:50,048 --> 00:49:53,960 -you know the local gradient off and then -do backdrop when you actually these +you know the local gradient of and then +do backprop when you actually evaluate these 619 00:49:53,960 --> 00:49:57,679 -gradients in your assignment so in the -top your code will look something like +gradients in your assignment. So in the +top, your code will look something like 620 00:49:57,679 --> 00:49:59,679 @@ -3071,642 +3071,642 @@ structure because you're doing 621 00:49:59,679 --> 00:50:04,038 -everything in line so no crazy I just -running like that that you have to do +everything inline. So no crazy edges +or anything like that that you have to do. 622 00:50:04,039 --> 00:50:07,200 -you will do that in a second assignment -you'll actually come up with a graphic +You will do that in the second assignment. +You'll actually come up with a graph 623 00:50:07,199 --> 00:50:10,509 -object you implement your layers but my -first assignment you're just doing it in +object and you'll implement your layers. But in +the first assignment, you're just doing it inline 624 00:50:10,510 --> 00:50:15,579 -line just straight up an awesome and so -complete your scores based on wnx +just straight up vanilla setup. And so +compute your scores based on W and X. 625 00:50:15,579 --> 00:50:21,798 -compute these margins which are Maxim 0 -and the score differences compute the +Compute these margins which are max of 0 +and the score differences, compute the 626 00:50:21,798 --> 00:50:26,239 -loss and then do backdrop and in -particular I would really advise you to +loss, and then do backprop. And in +particular, I would really advise you to 627 00:50:26,239 --> 00:50:30,949 -have this intermediate course let you -create a matrix and then compute the +have this intermediate scores that you +create. It's a matrix. And then compute the 628 00:50:30,949 --> 00:50:34,769 -gradient on scores before you can view -the gradient on your weights and so +gradient on scores before you compute +the gradient on your weights. And so 629 00:50:34,769 --> 00:50:40,179 -chain chain rule here like you might be -tempted to try to just arrived W the +chain, use chain rule here. Otherwise, you might be +tempted to try to just derive W, the 630 00:50:40,179 --> 00:50:43,798 -gradient on W equals and then implement +gradient on W equals, and then implement [??] that and that's an unhealthy way of 631 00:50:43,798 --> 00:50:47,349 -approaching problem so state your -competition and do backdrop through this +approaching the problem. So stage your +computation and do backdrop through this 632 00:50:47,349 --> 00:50:55,800 -course and they will help you out so +scores and that will help you out. Okay. cool. 633 00:50:55,800 --> 00:51:01,570 -so far are hopelessly large so we end up -in this competition structures and these +So, let's see. Summary so far. Neural networks are hopelessly large, so we end up +in this computational structures and these 634 00:51:01,570 --> 00:51:05,470 -intermediate nodes forward backward API -for both the notes and also for the +intermediate nodes, forward backward API +for both the nodes and also for the 635 00:51:05,469 --> 00:51:08,869 -graph structure and infrastructure is -usually a very thin wrapper on all these +graph structure. And the graph structure is +usually a very thin wrapper around all these 636 00:51:08,869 --> 00:51:12,059 -layers and it can handle the -communication between him and his +layers and it handles all the +communication between them. And this 637 00:51:12,059 --> 00:51:16,380 communication is always along like -doctors being passed around in practice +vectors being passed around. In practice, 638 00:51:16,380 --> 00:51:19,289 -when we write these implementations what -we're passing around our DS and +when we write these implementations, what +we're passing around are these 639 00:51:19,289 --> 00:51:23,079 -dimensional sensors really what that -means is just an end dimensional array +n-dimensional tensors. Really what that +means is just an n-dimensional array. 640 00:51:23,079 --> 00:51:28,059 -array those are what goes between the -gates and then internally every single +So like an numpy array. Those are what goes between the +gates, and then internally, every single 641 00:51:28,059 --> 00:51:33,529 gate knows what to do in the forward and -backward pass ok so at this point I'm +the backward pass. Okay, so at this point, I'm 642 00:51:33,530 --> 00:51:37,690 -going to end with that propagation and -I'm going to go into neural networks so +going to end with backpropagation and +I'm going to go into neural networks. So 643 00:51:37,690 --> 00:51:49,860 any questions before we move on from -background +backprop? Go ahead. [STUDENT QUESTION] 644 00:51:49,860 --> 00:52:03,130 -operation challenging assignment almost -is how do you make sure that you do all +The summation is Li = blah? Yes, there is a sum there. So you want that to be vectorized operation that you... Yeah so basically, the challenge in your assignment almost +is, how do you make sure that you do all 645 00:52:03,130 --> 00:52:06,750 -the sufficiently nicely with operations -in numpy so that's going to be something +this efficiently nicely with matrix vector operations +in numpy, so that's going to be some of the 646 00:52:06,750 --> 00:52:18,030 -that brings our stuff that you guys are -going to be like and what you want them +brain teaser stuff that you guys are +going to have to do. [STUDENT QUESTION] Yes, so it's up to you what you want your gates to be like, and what you want them 647 00:52:18,030 --> 00:52:24,490 -to be I don't think he'd want to do that +to be. [STUDENT QUESTION] Yeah, I don't think you'd want to do that. 648 00:52:24,489 --> 00:52:30,739 -yeah I'm not sure maybe that works but -it's up to you to design this and to +Yeah, I'm not sure. Maybe that works. I don't know. +But it's up to you to design this and to 649 00:52:30,739 --> 00:52:38,609 -back up through it so that's that's what -we're going to go to neural networks is +backprop through. Yeah, so that's fun. Okay. +So we're going to go to neural networks. This is 650 00:52:38,610 --> 00:52:44,010 -exactly what they look like you'll be -involving me and this is what happens +exactly what they look like. So you'll be +implementing me, and this is just what happens 651 00:52:44,010 --> 00:52:46,770 -when you search on Google Images -networks this is I think the first +when you search on Google Images for +neural networks. This is I think the first 652 00:52:46,769 --> 00:52:51,590 -result of something like that so let's -look at the networks and before we dive +result or something like that. So let's +look at neural networks. And before we dive 653 00:52:51,590 --> 00:52:55,100 -into neural networks actually I'd like +into neural networks actually, I'd like to do it first without all the brain 654 00:52:55,099 --> 00:52:58,329 -stuff so forget that their neural forget +stuff. So forget that they're neural. Forget that they have any relation whatsoever 655 00:52:58,329 --> 00:53:03,170 -to brain they don't forget if you -thought that they did but they do let's +to a brain. They don't, but forget if you +thought that they did, that they do. Let's 656 00:53:03,170 --> 00:53:07,309 -just look at school functions well -before we thought that equals WX is what +just look at score functions. Well +before, we saw that f=Wx is what 657 00:53:07,309 --> 00:53:11,079 -we've been working with so far but now -as I said we're going to start to make +we've been working with so far. But now +as I said, we're going to start to make 658 00:53:11,079 --> 00:53:14,590 -that F more complex and so if you want +that f more complex. And so if you wanted to use a neural network then you're 659 00:53:14,590 --> 00:53:20,309 -going to change that equation to this so -this is a two-layer neural network and +going to change that equation to this. So +this is a two-layer neural network, and 660 00:53:20,309 --> 00:53:24,820 -that's what it looks like and it's just -a more complex mathematical expression X +that's what it looks like, and it's just +a more complex mathematical expression of x. 661 00:53:24,820 --> 00:53:30,230 -and so what's happening here as you -receive your input X and you make +And so what's happening here is, you +receive your input x, and you 662 00:53:30,230 --> 00:53:32,369 -multiplied by matrix just like we did -before +multiply it by a matrix, just like we did +before. 663 00:53:32,369 --> 00:53:36,619 -now what's coming next what comes next -is a nonlinearity or activation function +Now, what's coming next, what comes next +is a nonlinearity or activation function, 664 00:53:36,619 --> 00:53:39,710 -I'm going to go into several choices -that you might make for these in this +and we're going to go into several choices +that you might make for these. In this 665 00:53:39,710 --> 00:53:43,800 -case I'm using the threshold 0 as an -activation function so basically we're +case, I'm using the thresholding at 0 as an +activation function. So basically, we're 666 00:53:43,800 --> 00:53:47,780 -doing matrix multiply we threshold -everything they get 20 and then we do +doing matrix multiply, we threshold +everything negative to 0, and then we do 667 00:53:47,780 --> 00:53:52,240 -one more major supply and that gives us -are scarce and so if I was to drop this +one more matrix multiply, and that gives us +our scores. And so if I was to draw this, 668 00:53:52,239 --> 00:53:58,169 -say in case of C for 10 with three South -3072 numbers going in the pixel values +say in case of CIFAR-10, with 3072 numbers +going in, those are the pixel values, 669 00:53:58,170 --> 00:54:02,110 -and before we just went one single major -metabolite discourse we went right away +and before, we just went one single matrix +multiply to scores. We went right away 670 00:54:02,110 --> 00:54:02,470 -22 +to 10 671 00:54:02,469 --> 00:54:05,899 -numbers but now we get to go through +numbers. But now, we get to go through this intermediate representation 672 00:54:05,900 --> 00:54:13,019 -pendants hidden state will call them -hidden layers so each of hundred-numbers +of hidden state. We'll call them +hidden layers. So hidden vector h of hundred numbers, say 673 00:54:13,019 --> 00:54:16,849 -or whatever you want your size of the -network to be so this is a high pressure +or whatever you want your size of the neural +network to be. So this is a hyperparameter. 674 00:54:16,849 --> 00:54:21,109 -that's a a hundred and we go through -this intermediate representation so make +that's, say, a hundred, and we go through +this intermediate representation. So matrix 675 00:54:21,108 --> 00:54:24,319 -sure to multiply gives us -hundred-numbers threshold at zero and +multiply gives us +hundred numbers, threshold at zero, and 676 00:54:24,320 --> 00:54:28,559 -then one will make sure that this course -and since we have more numbers we have +then one more matrix multiply to get the scores. +And since we have more numbers, we have 677 00:54:28,559 --> 00:54:33,820 more wiggle to do more interesting -things so I'm or one particular example +things. So a more, one particular example 678 00:54:33,820 --> 00:54:36,330 of something interesting you might want -to do what you might think that in the +to, you might think that a neural network 679 00:54:36,329 --> 00:54:40,210 -latter could do is going back to the +could do, is going back to this example of interpreting linear 680 00:54:40,210 --> 00:54:45,690 -classifiers on C part 10 and we saw the +classifiers on CIFAR-10, and we saw that the car class has this red car that tries to 681 00:54:45,690 --> 00:54:51,280 -merge all the modes of different car -space in different directions and so in +merge all the modes of different cars +facing in different directions. And so in 682 00:54:51,280 --> 00:54:57,980 -this case one single layer one single -leader crossfire had to go across all +this case, one single layer, one single +linear classifier had to go across all 683 00:54:57,980 --> 00:55:02,250 -those modes and we couldn't deal with -for example of different colors that +those modes, and we couldn't deal with +for example, cars of different colors. That 684 00:55:02,250 --> 00:55:05,190 -wasn't very natural to do but now we -have hundred-numbers in this +wasn't very natural to do. But now we +have hundred numbers in this 685 00:55:05,190 --> 00:55:08,289 -intermediate and so you might imagine -for example that one of those numbers +intermediate, and so you might imagine +for example, that one of those numbers 686 00:55:08,289 --> 00:55:11,539 -could be just picking up on the red -carpet leasing forward is just gotta +could be just picking up on red +car facing forward. It's just classifying, 687 00:55:11,539 --> 00:55:14,750 -find is there a wrecked car facing -forward another one could be red car +is there a red car facing +forward. Another one could be red car 688 00:55:14,750 --> 00:55:16,280 -facing slightly to the left +facing slightly to the left, 689 00:55:16,280 --> 00:55:20,650 -let carvey seems like the right and -those elements of age would only become +left car facing slightly to the right, and +those elements of h would only become 690 00:55:20,650 --> 00:55:24,358 positive if they find that thing in the -image +image, 691 00:55:24,358 --> 00:55:28,029 -otherwise they stay at zero and so -another age might look for green cards +otherwise, they stay at zero. And so +another h might look for green cars 692 00:55:28,030 --> 00:55:31,180 -or yellow cards or whatever else in -different orientations so now we can +or yellow cars or whatever else in +different orientations. So now we can 693 00:55:31,179 --> 00:55:35,669 have a template for all these different -modes and so these neurons turn on or +modes. And so these neurons turn on or 694 00:55:35,670 --> 00:55:41,869 off if they find the thing they're -looking for some specific type and then +looking for. Car of some specific type, and then 695 00:55:41,869 --> 00:55:46,660 -this W two major scan some across all -those little card templates and I we +this W2 matrix scan sum across all +those little car templates. So now we 696 00:55:46,659 --> 00:55:50,719 have like say twenty card templates of -what you look like and now to complete +what cars could look like, and now, to compute 697 00:55:50,719 --> 00:55:54,149 -the scoring classifier there's an -additional measures so we have a choice +the score of car classifier, there's an +additional matrix multiply, so we have a choice 698 00:55:54,150 --> 00:55:58,700 -of a weighted sum over them and so if -anyone of them turned on then through my +of doing a weighted sum over them. And so if +anyone of them turn on, then through my 699 00:55:58,699 --> 00:56:02,269 -way it's somewhat positive weights -presumably I would be adding up and +weighted sum, with positive weights +presumably, I would be adding up and 700 00:56:02,269 --> 00:56:07,358 -getting a higher score and so now I can -have this multimodal our classifier +getting a higher score. And so now I can +have this multimodal car classifier 701 00:56:07,358 --> 00:56:13,098 through this additional hidden layer -between there and wavy reason for why +between there. So that's handwavy reason for why 702 00:56:13,099 --> 00:56:14,720 these would do something more -interesting +interesting. 703 00:56:14,719 --> 00:56:49,509 -was a question for extra points in the -assignment and do something fun or extra +Was there a question? [STUDENT QUESTION] So the question is, if h had less than 10 units, would it be inferior to a linear classifier? I think that's... that's acutally not obvious to me. It's an interesting question. I think... you could make that work. I think you could make it work. Yeah, I think that would actually work. Someone should try that for extra points in the +assignment. So you'll have a section on the assignment do something fun or extra 704 00:56:49,510 --> 00:56:53,220 -and so you get the carpet whatever you +and so you get to come up with whatever you think is interesting experiment and will 705 00:56:53,219 --> 00:56:56,699 -give you some bonus points that's good -candidate for for something you might +give you some bonus points. So that's good +candidate for something you might 706 00:56:56,699 --> 00:56:59,659 -want to investigate whether that works -or not +want to investigate, whether that works +or not. 707 00:56:59,659 --> 00:57:08,329 -questions +Any other questions? Go ahead. [STUDENT QUESTION] 708 00:57:08,329 --> 00:57:34,989 -allocated over the different modes of +Sorry, I don't think I understood the question. [STUDENT QUESTION] I see. So you're really asking about the layout of the h vector and how it gets allocated over the different modes of the dataset and I don't have a good 709 00:57:34,989 --> 00:57:37,969 -answer for that this since we're going +answer for that. Since we're going to train this fully with 710 00:57:37,969 --> 00:57:39,500 -back-propagation +backpropagation, 711 00:57:39,500 --> 00:57:42,690 -I think it's like a naive to think that -there will be exact template for sale +I think it's like naive to think that +there will be exact template for, say a 712 00:57:42,690 --> 00:57:46,539 -let carvey seeing red carpet is left you -probably want to find that you'll find +left car facing, red car facing left. You +probably want to find that. You'll find 713 00:57:46,539 --> 00:57:50,690 -these kind of like mixes and weird -things intermediates and so on +these kind of like mixes, and weird +things, intermediates, and so on. 714 00:57:50,690 --> 00:57:55,630 -coming animal optimally find a way to -truncate your data with its boundaries +So this neural network will come in and it will optimally find a way to +truncate your data with its linear boundaries 715 00:57:55,630 --> 00:57:59,809 -and kuwait's relegated just adjust the -company could come alright so it's +and these weights will all get adjusted +just to make it come out right. So it's 716 00:57:59,809 --> 00:58:10,579 -really hard to say well become tangled -up I think that's right so that's the +really hard to say. It will all become tangled +up I think. Go ahead. That's right. So that's the 717 00:58:10,579 --> 00:58:14,579 -size of hidden layer and a high -primarily get to choose that so I chose +size of a hidden layer, and it's a hyperparameter. +We get to choose that. So I chose 718 00:58:14,579 --> 00:58:18,719 -hundred usually that's going to be -usually you'll see that we're going to +hundred. Usually that's going to be, +usually, you'll see that with neural networks. We'll go into 719 00:58:18,719 --> 00:58:22,739 -this a lot but usually you want them to -be as big as possible as its your +this a lot, but usually you want them to +be as big as possible, as it fits in your 720 00:58:22,739 --> 00:58:30,659 -computer and so on so more is better I'm -going to that +computer and so on, so more is better. So we'll go +into that. Go ahead. [STUDENT QUESTION] 721 00:58:30,659 --> 00:58:38,639 -asking do we always take max 10 nature -and we don't get this like five slides +So you're asking, do we always take max of 0 and h, +and we don't, and I'll get, it's like five slides 722 00:58:38,639 --> 00:58:44,359 -away somewhere to go into neural -networks I guess maybe I should just go +away. So I'm going to go into neural +networks. I guess maybe I should preemtively just go 723 00:58:44,360 --> 00:58:48,390 -ahead and take questions near the end if +ahead and take questions near the end. If you wanted this to be a three-layer 724 00:58:48,389 --> 00:58:50,940 -neural network by the way there's a very +neural network by the way, there's a very simple way in which we just extend 725 00:58:50,940 --> 00:58:53,710 -that's right so we just keep continuing -the same pattern we have all these +this, right? So we just keep continuing +the same pattern where we have all these 726 00:58:53,710 --> 00:58:57,159 -intermediate hidden nodes and then we +intermediate hidden nodes, and then we can keep making our network deeper and 727 00:58:57,159 --> 00:58:59,750 -deeper and you can compute more +deeper, and you can compute more interesting functions because you're 728 00:58:59,750 --> 00:59:03,369 giving yourself more time to compute -something interesting and henry VIII way +something interesting in a handwavy way. 729 00:59:03,369 --> 00:59:09,559 -up one other slide I want to flash is -that training a two-layer neural network +Now, one other slide I wanted to flash is +that, training a two-layer neural network, 730 00:59:09,559 --> 00:59:12,690 -I mean it's actually quite simple when -it comes down to it so this is like +I mean, it's actually quite simple when +it comes down to it. So this is a slide 731 00:59:12,690 --> 00:59:17,349 -borrowed from Blockbuster and basically -the price is roughly eleven lines of +borrowed from a blog post I found, and basically +it suffices roughly eleven lines of 732 00:59:17,349 --> 00:59:21,980 Python to implement a two layer neural -network during binary classification on +network, doing binary classification on 733 00:59:21,980 --> 00:59:27,570 -what is this two-dimensional better to -have a two dimensional data matrix X you +what is this, two dimensional data. So you +have a two dimensional data matrix X. You 734 00:59:27,570 --> 00:59:32,580 -have thirty three dimensional and you -have a binary labels for why and then +have, sorry it's three dimensional. And you +have binary labels for y, and then 735 00:59:32,579 --> 00:59:36,579 -sin 0 sin 1 are your weight matrices -wait one way to end so I think they're +syn0 syn1 are your weight matrices +weight1 weight2. And so I think they're 736 00:59:36,579 --> 00:59:41,150 -called central synapse but mature and -then this is the opposition group here +called syn for synapse but I'm not sure. And +then this is the optimization loop here 737 00:59:41,150 --> 00:59:46,269 -and what you what you're seeing here I -should use my point for more than just +and what you're seeing here, I +should use my pointer for more, what you're 738 00:59:46,269 --> 00:59:50,139 -being here as we're completing the first -layer activations but and this is using +seeing here is we're computing the first +layer activations, but this is using 739 00:59:50,139 --> 00:59:54,069 -a signal nonlinearity not a max of 0 -necks and we're going to a bit of what +a sigmoid nonlinearity not a max of 0 and X. +And we'll go into a bit of what [??] 740 00:59:54,070 --> 00:59:58,650 -these nonlinearities might be more than -one form is reviewing the first layer +these nonlinearities might be. So sigmoid is +one form. It's computing the first layer, 741 00:59:58,650 --> 01:00:03,059 -and the second layer and then its +and then it's computing second layer, and then its computing here right away the backward 742 01:00:03,059 --> 01:00:08,130 -pass so this adult adult as the gradient -gel to the gradient ml 1 and the +pass. So this is the l2_delta. It's the gradient on +l2, the gradient on l1, and the 743 01:00:08,130 --> 01:00:13,390 -gradient and this is a major update here -so right away he's doing an update at +gradient, and this is an update here. +So right away he's doing an update at 744 01:00:13,389 --> 01:00:17,150 the same time as during the final piece -of backdrop here where he formulated the +of backprop here where he formulating the 745 01:00:17,150 --> 01:00:22,519 -gradient on the W and right away he said -adding 22 gradient here and some really +gradient on the W, and right away he's +adding to gradient here. And so really 746 01:00:22,519 --> 01:00:24,630 -eleven lines supplies to train the -neural network +eleven lines suffice to train a +neural network to do binary 747 01:00:24,630 --> 01:00:29,710 -classification the reason that this loss -may look slightly different from what +classification. The reason that this loss +might look slightly different from what 748 01:00:29,710 --> 01:00:33,500 -you've seen right now is that this is a -logistic regression loss so you saw a +you've seen right now, is that this is a +logistic regression loss. So you saw a 749 01:00:33,500 --> 01:00:37,159 -generalization of it which is a nice -classifier into multiple dimensions but +generalization of it which is softmax +classifier into multiple dimensions. But 750 01:00:37,159 --> 01:00:40,149 @@ -3715,57 +3715,57 @@ updated here and you can go through this 751 01:00:40,150 --> 01:00:43,500 -in more detail by yourself but the -logistic regression lost look slightly +in more detail by yourself. But the +logistic regression loss look slightly 752 01:00:43,500 --> 01:00:50,539 -different and that's being that's inside -there but otherwise yes this is not too +different and that's being, that's inside +there. But otherwise, yes, so this is not too 753 01:00:50,539 --> 01:00:55,320 -crazy of a competition and very few -lines of code suffice actually train +crazy of a computation, and very few +lines of code suffice to actually train 754 01:00:55,320 --> 01:00:58,900 -these networks everything else is plus -how do you make an official and how do +these networks. Everything else is fluff. +How do you make it efficient, how do 755 01:00:58,900 --> 01:01:03,019 -you there's a cross-validation pipeline -that you need to have it all this stuff +you... there's a cross-validation pipeline +that you need to have and all this stuff 756 01:01:03,019 --> 01:01:07,050 that goes on top to actually give these -large code bases but the kernel of it is +large code bases, but the kernel of it is 757 01:01:07,050 --> 01:01:11,019 -quite simple we compute these layers -forward pass backward pass through an +quite simple. We compute these layers, do +forward pass, we do backward pass, we do an 758 01:01:11,019 --> 01:01:18,840 -update when it rains but the rain is -creating your personal initial random +update, we keep iterating this over and over again. Go ahead. [STUDENT QUESTION] +The random function is creating your first initial random 759 01:01:18,840 --> 01:01:24,170 -weights so you need to start somewhere -so you generate a random W +weights, so you need to start somewhere +so you generate a random W. 760 01:01:24,170 --> 01:01:29,150 -now I want to mention that you'll also +Okay. Now I wanted to mention that you'll also be training a two-layer neural network 761 01:01:29,150 --> 01:01:32,070 -in this class so you'll be doing +in this class, so you'll be doing something very similar to this but 762 @@ -3775,365 +3775,365 @@ you might have different activation 763 01:01:34,949 --> 01:01:39,149 -functions but again just my advice to -you when you implement this is staged +functions. But again, just my advice to +you when you implement this is, stage 764 01:01:39,150 --> 01:01:42,789 your computation into these intermediate -results and then do proper +results, and then do proper 765 01:01:42,789 --> 01:01:46,909 backpropagation into every intermediate -result so you might have you compute +result. So you might have, you compute 766 01:01:46,909 --> 01:01:54,460 -your computer you receive these weight -matrices and also the biases I don't +your... Let's see. You compute, you receive these weight +matrices and also the biases. I don't 767 01:01:54,460 --> 01:01:59,940 -believe you have biases p.m. in your -slot max but here you'll have biases so +believe you have biases actually in your SVM and in your +softmax, but here you'll have biases. So 768 01:01:59,940 --> 01:02:03,269 -take your weight matrices in the biases -computer person later computers course +take your weight matrices in the biases, +compute the first hidden layer, compute your scores, 769 01:02:03,269 --> 01:02:08,429 -complete your loss and then do backward -pass so backdrop in this course then +compute your loss, and then do backward +pass. So bacprop in to scores, then 770 01:02:08,429 --> 01:02:13,739 -backdrop into the weights at the second -layer and backdrop into this h1 doctor +backprop into the weights at the second +layer, and backprop into this h1 vector, 771 01:02:13,739 --> 01:02:18,849 -and then through eight-run backdrop into -the first weight matrices and spices do +and then through h1, backprop into +the first weight matrices and the first biases. Okay, so do 772 01:02:18,849 --> 01:02:22,929 -proper backpropagation here otherwise if -you tried and right away just say what +proper backpropagation here. Otherwise, if +you try to right away, just say, what 773 01:02:22,929 --> 01:02:26,739 -is DWI on what is going on W one if you +is dW1, what is the gradient on W1. If you just try to make it a single expression 774 01:02:26,739 --> 01:02:31,099 -for it will be way too large and -headaches so do it through a series of +for it, it will be way too large and you'll have +headaches. So do it through a series of 775 01:02:31,099 --> 01:02:32,619 -steps and back-propagation +steps and back-propagation. 776 01:02:32,619 --> 01:02:36,119 -that's just a hint +That's just a hint. 777 01:02:36,119 --> 01:02:39,940 -ok now I'd like to say that was the +Okay. So now I'd like to, so that was the presentation of neural networks without 778 01:02:39,940 --> 01:02:43,940 -all the bring stuff and it looks fairly -simple so now we're going to make it +all the brain stuff and it looks fairly +simple. So now we're going to make it 779 01:02:43,940 --> 01:02:47,740 slightly more insane by folding in all -kinds of like motivations mostly +kinds of like motivations, mostly 780 01:02:47,739 --> 01:02:51,219 -historical about like how this came -about that it's related to bring it all +historical about like how this came +about that it's related to brain at all. 781 01:02:51,219 --> 01:02:54,939 -and so we have neural networks and we +And so, we have neural networks and we have neurons inside these neural 782 01:02:54,940 --> 01:02:59,440 -networks so this is what I look like -just what happens when you search on +networks. So this is what neurons look like. +This is just what happens when you search on 783 01:02:59,440 --> 01:03:03,800 -image search Iran so there you go now +image search 'neurons'. so there you go. Now your actual biological neurons don't 784 01:03:03,800 --> 01:03:09,030 -look like this are currently more like -that and so on +look like this. Fortunately, they look more like +that. And so a neuron, 785 01:03:09,030 --> 01:03:11,880 -just very briefly just to give you an +just very briefly, just to give you an idea about where this is all coming from 786 01:03:11,880 --> 01:03:17,220 -you have a cell body or so much like to +you have the cell body or a Soma as pople like to [??] call it and it's got all these dendrites 787 01:03:17,219 --> 01:03:21,049 -that are connected to other neurons +that are connected to other neurons. So there's a cluster of other neurons and 788 01:03:21,050 --> 01:03:25,450 -somebody's over here and then drives are -really these appendages that listen to +cell bodies over here. And dendrites are +really, these appendages that listen to 789 01:03:25,449 --> 01:03:30,869 -them so this is your inputs to in Iran +them. So this is your inputs to a neuron, and then it's got a single axon that 790 01:03:30,869 --> 01:03:35,839 -comes out of a neuron that carries the -output of the competition at this number +comes out of the neuron that carries the +output of the computation that this neurons performs. 791 01:03:35,840 --> 01:03:40,579 -forms so usually usually have this -neuron receives inputs if many of them +So usunally, usually you have this +neuron, receives inputs. If many of them 792 01:03:40,579 --> 01:03:46,179 -online then this sell your own can -choose to spike it sends an activation +align, then this cell, this neuron can +choose to spike. It says an activation 793 01:03:46,179 --> 01:03:50,199 potential down the axon and then this -actually like that were just out to +actually like diverges out to 794 01:03:50,199 --> 01:03:54,659 connect to dendrites other neurons that -are downstream so there are other +are downstream. So there are other 795 01:03:54,659 --> 01:03:57,639 neurons here and their dendrites -connected to the axons of these guys +connected to the axons of these guys. 796 01:03:57,639 --> 01:04:02,299 -basically just neurons connected through -these synapses between and we had these +So basically, just neurons connected through +these synapses in between and we had these 797 01:04:02,300 --> 01:04:05,840 -dendrites that Rd in particular on and -this action on that actually carries the +dendrites that are the input to a neuron and +this axon that actually carries the 798 01:04:05,840 --> 01:04:10,410 -output on their own and so basically you +output of a neuron. And so basically, you can come up with a very crude model of a 799 01:04:10,409 --> 01:04:16,769 -neuron and it will look something like -this we have so this is the cell body +neuron, and it will look something like +this. We have an axon, so this is the cell body 800 01:04:16,769 --> 01:04:20,909 -here on their own and just imagine an -axon coming from a different neuron +here of a neuron. And just imagine an +axon coming from a different neuron, 801 01:04:20,909 --> 01:04:24,730 -someone at work and this neuron is -connected to that Iran through this +somewhere in the network, and this neuron is +connected to that neuron through this 802 01:04:24,730 --> 01:04:29,840 -synapse and every one of these synapses +synapse. And every one of these synapses has a weight associated with it 803 01:04:29,840 --> 01:04:35,350 of how much this neuron likes that -neuron basically and so actually carries +neuron basically. And so axon carries 804 01:04:35,349 --> 01:04:39,769 -this X it interacts in the synapse and -they multiply and discrete model so you +this x. It interacts in the synapse and +they multiply in this crude model. So you 805 01:04:39,769 --> 01:04:44,989 -get W 00 flooding flowing to the summer -and then that happens for many Iraqis +get w0x0 flowing to the soma. +And then that happens for many neurons 806 01:04:44,989 --> 01:04:45,849 -who have lots of +so you have lots of 807 01:04:45,849 --> 01:04:51,500 -and puts up w times explosion and the -cell body here it's just some offset by +inputs of w times x flowing in. And the +cell body here, it just performs a sum, offset by 808 01:04:51,500 --> 01:04:56,940 -bias and then if an activation function -is met here so it passes through an +a bias, and then if an activation function +is met here, so it passes through an 809 01:04:56,940 --> 01:05:02,800 -activation function to actually complete -the outfit of the sax on now in +activation function to actually compute +the output of this axon. Now in 810 01:05:02,800 --> 01:05:06,570 -biological models historically people -like to use the sigmoid nonlinearity to +biological models, historically people +liked to use the sigmoid nonlinearity to 811 01:05:06,570 --> 01:05:11,730 -actually the reason for that is because -you get a number between 0 and one and +actually use for the activation function. The reason for that is because +you get a number between 0 and 1, and 812 01:05:11,730 --> 01:05:15,420 you can interpret that as the rate at -which this neuron inspiring for that +which this neuron is firing for that 813 01:05:15,420 --> 01:05:19,809 -particular input so it's a rate between -zero and one that's going through the +particular input. So it's a rate between +0 and 1 that's going through the 814 01:05:19,809 --> 01:05:23,889 -activation function so if this neuron is -seen something that likes in the neurons +activation function. So if this neuron is +seen something it likes, in the neurons 815 01:05:23,889 --> 01:05:27,900 -that connected to it it will start to -spike a lot and the rate is described by +that connected to it, it will start to +spike a lot, and the rate is described by 816 01:05:27,900 --> 01:05:33,139 -F off the impact oK so that's the crude -model of neuron if I wanted to implement +f of the input. Okay, so that's the crude +model of the neuron. If I wanted to implement it 817 01:05:33,139 --> 01:05:38,819 -it would look something like this so and -neuron function forward pass and receive +it would look something like this. So a +neuron_tick function forward pass, it receives 818 01:05:38,820 --> 01:05:44,500 -some inputs this is a vector and reform -of the cell body so just a lawyer some +some inputs. This is a vector and we form +a sum at the cell body, so just a linear sum. 819 01:05:44,500 --> 01:05:49,980 -and we put the firing rate as a sigmoid -off the Somali some and return to firing +And we put, we compute the firing rate as a sigmoid +of the cell body sum and return the firing 820 01:05:49,980 --> 01:05:53,579 -rate and then this can plug into -different neurons right so you can +rate. And then this can plug into +different neurons, right? So you can 821 01:05:53,579 --> 01:05:56,710 -imagine you can actually see that this +imagine, you can actually see that this looks very similar to a linear 822 01:05:56,710 --> 01:06:02,750 -classifier radar for MIMO lehrer some -here and we're passing through +classifier, right? We're forming a linear sum +here, a weighted sum, and we're passing that through 823 01:06:02,750 --> 01:06:07,050 -nonlinearity so every single neuron in -this model is really like a small your +nonlinearity. So every single neuron in +this model is really like a small linear 824 01:06:07,050 --> 01:06:11,530 -classifier but these authors plug into -each other and they can work together to +classifier, but these linear classifiers plug into +each other, and they can work together to 825 01:06:11,530 --> 01:06:16,650 -do interesting things now 10 to make -about neurons that they're very they're +do interesting things. Now one note to make +about neurons is that they're very, they're 826 01:06:16,650 --> 01:06:21,300 -not like biological neurons biological -neurons are super complex so if you go +not like biological neurons. Biological +neurons are super complex, so if you go 827 01:06:21,300 --> 01:06:24,670 -around then you start saying that neural -networks work like brain people are +around and you start saying that neural +networks work like brain, people are 828 01:06:24,670 --> 01:06:28,849 -starting to round people started firing -at you and that's because there are +starting to frown. People will start to frown +at you and that's because neurons are 829 01:06:28,849 --> 01:06:33,650 -complex dynamical systems there are many -different types of neurons they function +complex, dynamical systems. There are many +different types of neurons. They function 830 01:06:33,650 --> 01:06:38,550 -differently these dendrites there they +differently. These dendrites, they can perform lots of interesting 831 01:06:38,550 --> 01:06:42,140 -computation a good review article is in -direct competition which I really +computation. A good review article is +Dendritic Computation, which I really 832 01:06:42,139 --> 01:06:46,069 -enjoyed these synapses are complex -dynamical systems they're not just a +enjoyed. These synapses are complex +dynamical systems. They're not just a 833 01:06:46,070 --> 01:06:49,720 -single weight and we're not really sure -of the brain uses rate code to +single weight. And we're not really sure +if the brain uses rate code to 834 01:06:49,719 --> 01:06:54,689 -communicate so very crude mathematical -model and don't put his analogy too much +communicate, so very crude mathematical +model and don't put his analogy too much. 835 01:06:54,690 --> 01:06:57,960 -but it's good for a kind of like media -articles +But it's good for, kind of like, media +articles, 836 01:06:57,960 --> 01:07:01,990 @@ -4142,47 +4142,47 @@ coming up again and again as we 837 01:07:01,989 --> 01:07:04,989 -explained that this works like a brain -but I'm not going to go too deep into +explained that this works like your brain. +But I'm not going to go too deep into 838 01:07:04,989 --> 01:07:09,829 -this to go back to a question that was -asked for there's an entire set of +this. To go back to a question that was +asked before, there's an entire set of 839 01:07:09,829 --> 01:07:17,559 -nonlinearities that we can choose from -so historically signal has been used +nonlinearities that we can choose from. +So historically, sigmoid has been used 840 01:07:17,559 --> 01:07:20,210 -quite a bit and we're going to go into +quite a bit, and we're going to go into much more detail over what these 841 01:07:20,210 --> 01:07:23,690 -nonlinearities are what are their trades -tradeoffs and why you might want to use +nonlinearities are, what are their +tradeoffs, and why you might want to use 842 01:07:23,690 --> 01:07:27,838 -one or the other but for now just like a -flash to mention that there are many to +one or the other, but for now, I'd just like to +flash to mention that there are many things to [??] 843 01:07:27,838 --> 01:07:28,579 -choose from +choose from. 844 01:07:28,579 --> 01:07:33,940 -historically people use to 10 H as of -2012 really became quite popular +Historically people use to signmoid and tanh. As of +2012, ReLU became quite popular. 845 01:07:33,940 --> 01:07:38,429 -it makes your networks quite a bit -faster so right now if you want a +It makes your networks converge quite a bit +faster, so right now, if you wanted a 846 01:07:38,429 --> 01:07:40,429 @@ -4190,108 +4190,108 @@ default choice for nonlinearity 847 01:07:40,429 --> 01:07:45,679 -relew that's the current default -recommendation and then there's a few +use ReLU. That's the current default +recommendation. And then there's a few, kind of a hipster 848 01:07:45,679 --> 01:07:51,489 -activation functions here and so are -proposed a few years ago I max out is +activation functions here. And so Leaky ReLUs were +proposed a few years ago. Maxout is 849 01:07:51,489 --> 01:07:54,989 -interesting and very recently you lou -and so you can come up with different +interesting. And very recently ELU. +And so you can come up with different 850 01:07:54,989 --> 01:07:58,319 activation functions and you can -describe I these might work better or +describe why these might work better or 851 01:07:58,320 --> 01:08:01,789 -not and so this is an active area of -research is trying to go up by the +not. And so this is an active area of +research. It's trying to come up with these 852 01:08:01,789 --> 01:08:05,949 -activation functions that perform there -had better properties in one way or +activation functions that perform, that +have better properties in one way or 853 01:08:05,949 --> 01:08:10,909 -another we're going to go into this much -more details as soon in class but for +another. So we're going to go into this with much +more detail soon in the class. But for 854 01:08:10,909 --> 01:08:15,980 -now we have these morons we have a -choice of activation function and then +now, we have these neurons, we have a +choice of activation function, and then 855 01:08:15,980 --> 01:08:19,259 -we runs these neurons into neural -networks right so we just connect them +we arrange these neurons into neural +networks, right? So we just connect them 856 01:08:19,259 --> 01:08:23,140 -together so they can talk to each other -and so here is an example of a what to +together so they can talk to each other. +And so here is an example of a 857 01:08:23,140 --> 01:08:27,170 -learn or relearn rowlett when you want -to count the number of layers and their +2-layer neural net or 3-layer neural net. When you want +to count the number of layers and the 858 01:08:27,170 --> 01:08:30,829 -neural net you count the number of -players that happened waits to hear the +neural net, you count the number of +layers that have weights. So here, the 859 01:08:30,829 --> 01:08:35,449 -input layer does not count as a later -cuz there's no reason Iran's largest +input layer does not count as a layer, +because there's no... These neurons are just 860 01:08:35,449 --> 01:08:39,729 -single values they don't actually do any -computation so we have two players here +single values. They don't actually do any +computation. So we have two layers here 861 01:08:39,729 --> 01:08:45,068 -that that have weights to learn it and +that have weights. So it's a 2-layer net. And we call these layers fully connected 862 01:08:45,069 --> 01:08:50,870 -layers and so that I shown you that a -single neuron computer this little +layers, and so, remember that I shown you that a +single neuron computes this little 863 01:08:50,869 --> 01:08:54,750 -weight at some and ambassador -nonlinearity in a neural network the +weighted sum, and then passed that through +nonlinearity. In a neural network, the 864 01:08:54,750 --> 01:08:58,829 reason we arrange these into layers is -because Iranian them into layers allows +because arranging them into layers allows 865 01:08:58,829 --> 01:09:01,759 -us to the competition much more -efficiently so instead of having an +us to perform the computation much more +efficiently. So instead of having an 866 01:09:01,759 --> 01:09:04,460 amorphous blob of neurons and every one -of them has to be computed independently +of them has to be computed independently, 867 01:09:04,460 --> 01:09:08,699 having them in layers allows us to use -vectorized operations and so we can +vectorized operations. And so we can 868 01:09:08,699 --> 01:09:10,139 @@ -4300,102 +4300,102 @@ compute an entire set of 869 01:09:10,140 --> 01:09:14,410 neurons in a single hidden layer as just -a single times amateurs multiply and +at a single times a matrix multiply. And 870 01:09:14,409 --> 01:09:17,619 that's why we arrange them in these -layers where Iran since I deliver and +layers, where neurons inside a layer can be 871 01:09:17,619 --> 01:09:21,119 -evaluate it completely in peril and they -all say the same thing but it's a +evaluated completely in parallel, and they +all see the same input. So it's a 872 01:09:21,119 --> 01:09:25,519 computational trick to arrange them in -leaders this is a three-layer neural net +layers. So this is a 3-layer neural net 873 01:09:25,520 --> 01:09:30,500 -and this is how you would compute it -just a bunch of major multiplies +and this is how you would compute it. +Just a bunch of matrix multiplies 874 01:09:30,500 --> 01:09:35,550 -followed by another activation followed -by activation function as well now I'd +followed by activation function. +So now I'd 875 01:09:35,550 --> 01:09:40,520 like to show you a demo of how these -neural networks work so this is just +neural networks work. So this is JavaScript demo 876 01:09:40,520 --> 01:09:44,770 -grabbed a model shoot you in a bit but -basically this is an example of a +that I'll show you in a bit. But +basically, this is an example of a 877 01:09:44,770 --> 01:09:50,080 -two-layer neural network classifying AP -doing a binary classification task two +two-layer neural network classifying a, +doing a binary classification task. So we have two 878 01:09:50,079 --> 01:09:54,119 -closest red and green and so if these -points in two dimensions and I'm drawing +classes, red and green. And so we have these +points in two dimensions, and I'm drawing 879 01:09:54,119 --> 01:09:58,109 the decision boundaries by the neural -network and see what you can see is when +network. And so what you can see is, when 880 01:09:58,109 --> 01:10:01,969 -I train a neural network on this data +I train a neural network on this data, the more hidden neurons I have in my 881 01:10:01,970 --> 01:10:05,770 -head in later the more wiggle your -electric cars right the more can compute +hidden layer, the more wiggle your +neural network has, right? The more it can compute 882 01:10:05,770 --> 01:10:12,290 -crazy functions and just show you also a -regularization strength so this is the +crazy functions. And just to show you effect also of +regularization strength. So this is the 883 01:10:12,289 --> 01:10:17,069 regularization of how much you penalize -large W you can see that when you insist +large W. So you can see that when you insist 884 01:10:17,069 --> 01:10:22,340 -that your WR very small you end up with -a very smooth functions so they don't +that your Ws are very small, you end up with +a very smooth functions, so they don't 885 01:10:22,340 --> 01:10:27,050 -have as much variance so these neural -networks there's not as much wriggle +have as much variance. So these neural +networks, there's not as much wiggle 886 01:10:27,050 --> 01:10:31,090 -that they can give you and then you -decrease the regularization these know +that they can give you, and then as you +decrease the regularization, these neural 887 01:10:31,090 --> 01:10:34,090 -that we can do more and more complex -tasks so they can kind of get in and get +networks can do more and more complex +tasks, so they can kind of get in and get 888 01:10:34,090 --> 01:10:38,710 -these laws squeezed out points to cover -them in the training data so let me show +these little squeezed out points to cover +them in a training data. So let me show 889 01:10:38,710 --> 01:10:41,489 @@ -4403,61 +4403,61 @@ you what this looks like 890 01:10:41,489 --> 01:10:47,079 -during training +during training. Okay. 891 01:10:47,079 --> 01:10:53,010 -so there's some stuff to explain here -let me first actually you can play with +So there're some stuff to explain here. +Let me first actually... So you can play with 892 01:10:53,010 --> 01:10:56,060 -this because it's all in javascript +this because it's all in JavaScript. 893 01:10:56,060 --> 01:11:04,060 -alright so we're doing here as we have -six neurons and this is a binary +Okay. All right. So what we're doing here is we have +six neurons, and this is a binary 894 01:11:04,060 --> 01:11:09,000 -classification there said with circle -data and so we have a little cluster of +classification dataset with circle +data. And so we have a little cluster of 895 01:11:09,000 --> 01:11:13,520 -green dot separated by red dots and work +green dots separated by red dots. And we're training a neural network to classify 896 01:11:13,520 --> 01:11:18,080 -this dataset so if I restart the neural -network it's just started off with the +this dataset. So if I restart the neural +network, it's just, starts off with a 897 01:11:18,079 --> 01:11:20,949 -random W and that it converges the -decision boundary to actually classified +random W, and then it converges the +decision boundary to actually classify 898 01:11:20,949 --> 01:11:26,289 -the data showing on the right which is -the cool part is one interpretation of +the data. What I'm showing on the right, which is +the cool part, this visualization, is one interpretation of 899 01:11:26,289 --> 01:11:29,529 -the neural network here is what I'm -taking that's great to hear and I'm +the neural network here, is what I'm +taking this grid here and I'm 900 01:11:29,529 --> 01:11:33,909 -showing how this space gets worked by -the neural network so you can interpret +showing how this space gets warped by +the neural network. So you can interpret 901 01:11:33,909 --> 01:11:37,619 what the neural network is doing is it's -using its hidden layer to transport your +using its hidden layer to transform your 902 01:11:37,619 --> 01:11:41,159 @@ -4466,403 +4466,403 @@ hidden layer can come in with a linear 903 01:11:41,159 --> 01:11:47,059 -classifier and classify your data so -here you see that the neural network +classifier and classify your data. So +here, you see that the neural network 904 01:11:47,060 --> 01:11:51,920 -arranges your space it works it such -that the second layer which is really a +arranges your space. It warps it such +that the second layer, which is really a 905 01:11:51,920 --> 01:11:56,779 linear classifier on top of the first -layer is can put a plane through it okay +layer, can put a plane through it, okay? 906 01:11:56,779 --> 01:11:59,939 -so it's working the space so that you -can put the plane through it and +So it's warping the space so that you +can put a plane through it and 907 01:11:59,939 --> 01:12:06,259 -separate out the points so let's look at -this again so you can really see what +separate out the points. So let's look at +this again. So you can roughly see what 908 01:12:06,260 --> 01:12:10,940 -happens gets worked for that you can -leave early classify the data this is +how this gets warped so that you can +linearly classify the data. This is 909 01:12:10,939 --> 01:12:13,569 something that people sometimes also -referred to as current trek it's +referred to as kernel trick. It's 910 01:12:13,569 --> 01:12:19,149 changing your data representation to a -space where two linearly separable ok +space where it's linearly separable. Okay. 911 01:12:19,149 --> 01:12:23,079 -now here's a question if we'd like to -separate the right now we have six +Now, here's a question. If we'd like to +separate, so right now we have six 912 01:12:23,079 --> 01:12:27,809 -neurons here and the intermediate layer +neurons here in the intermediate layer, and it allows us to separate out these 913 01:12:27,810 --> 01:12:33,580 -things so you can see actually those six -neurons roughly you can see these lines +data points. So you can see actually those six +neurons roughly. You can see these lines 914 01:12:33,579 --> 01:12:36,869 -here like they're kind of like these -functions of one of these neurons so +here, like they're kind of like these +functions of one of these neurons. So 915 01:12:36,869 --> 01:12:40,349 -here's a question for you what is the +here's a question for you, What is the minimum number of neurons for which this 916 01:12:40,350 --> 01:12:45,570 dataset is separable with a neural -network like if I want to know that work +network? If I want the neural network 917 01:12:45,569 --> 01:12:51,889 -to correctly classify this as a minimum +to correctly classify this, how many neurons do I need in the hidden layer as a minimum? [STUDENT ANSWER] 918 01:12:51,890 --> 01:13:15,270 -so into it with the way this work is 34 -so what happens with or is there is one +4? I heard some 3s, some 4s. Binary search. So intuitively, the way this would work is, let's see 4. +So what happens with 4 is, there is one 919 01:13:15,270 --> 01:13:18,910 -around here that went from this way to -that way this way to that way this way +neuron here that went from this way to +that way, this way to that way, this way 920 01:13:18,909 --> 01:13:22,689 -to that way there's more neurons that -are cutting up this plane and then +to that way. There's four neurons that +are cutting up this plane. And then 921 01:13:22,689 --> 01:13:27,039 -there's an additional layer that's a -weighted sum so in fact the lowest +there's an additional layer that's doing a +weighted sum. So in fact, the lowest 922 01:13:27,039 --> 01:13:34,739 -number here what would be three which -would work so with three neurons ok so +number here would be three, which +would work. So with three neurons... So 923 01:13:34,739 --> 01:13:39,189 -one plane second plane airplane so three -linear functions within the linearity +one plane, second plane, third plane. So three +linear functions with a nonlinearity, 924 01:13:39,189 --> 01:13:45,649 and then you can basically with three -lines you can carve out the space so +lines, you can carve out the space so 925 01:13:45,649 --> 01:13:52,429 -that the second layer can just combined -them when their numbers are 102 +that the second layer can just combine +them when their numbers are 1 and not 0. [STUDENT QUESTION] 926 01:13:52,430 --> 01:13:57,850 -certainly donate to this will break -because two lines are not enough I +At two? Certainly. So at two, this will break +because two lines are not enough. I 927 01:13:57,850 --> 01:14:03,900 -suppose this work something very good -here so with to basically it will find +suppose this works something [??] Not going to look very good +here. So with two, basically it will find 928 01:14:03,899 --> 01:14:07,239 the optimum way of just using these two -lines they're kind of creating this +lines. They're kind of creating this 929 01:14:07,239 --> 01:14:14,599 -tunnel and that the best you can do +tunnel and that's the best you can do. Okay? [STUDENT QUESTION] 930 01:14:14,600 --> 01:14:31,300 -I think if I was using rather I think -there would be much surrealism and I +The curve, I think... Which nonlinearity am I using? tanh? Yeah, I'm not sure exactly how that works out. If I was using ReLU, I think it would be much, so ReLU is the... Let me change to ReLU, and I 931 01:14:31,300 --> 01:14:50,460 -think you'd see sharp boundaries yeah -you can do for now let's do it because +think you'd see sharp boundaries. Yeah. +Yes, this is three. You can do four. So let's do... [STUDENT QUESTION] Yeah, that's because 932 01:14:50,460 --> 01:14:52,130 -some of these parts +because some of these parts 933 01:14:52,130 --> 01:14:58,119 -there's more than one of those revenues -are active and so you end up with there +there's more than one of those ReLUs +are active, and so you end up with... there 934 01:14:58,119 --> 01:15:02,359 -are really three lines I think like 123 -but then in some of the corners to revel +are really three lines. I think like one, two, three, +but then in some of the corners two ReLU 935 01:15:02,359 --> 01:15:05,689 -in your eyes are active and so these -weights will have its kind of funky you +neurons are active and so these +weights will add up. It's kind of funky. You 936 01:15:05,689 --> 01:15:12,649 -have to think about it but ok so let's -look at say twenty here so change to 20 +have to think about a bit. But okay. So let's +look at, say, twenty here. So I changed to twenty 937 01:15:12,649 --> 01:15:16,670 -so we have lots of space there and let's -look at different assets like a spiral +so we have lots of space there, and let's +look at different datasets like say spiral. 938 01:15:16,670 --> 01:15:22,390 -you can see how this thing just as I'm -doing this update will just go in there +So you can see how this thing just, as I'm +doing this update, it will just go in there 939 01:15:22,390 --> 01:15:32,800 -and figure that out very simple data -that is not my own circle and then ran +and figure that out. Very simple data +that is not. Spiral. Circle, and then random 940 01:15:32,800 --> 01:15:39,880 -him down so you could kind of goes in -there and it's like covers up the green +so random data, so you could, kind of goes in [??] +there, like covers up the green 941 01:15:39,880 --> 01:15:48,039 -lawns and the red ones and yeah and with -fewer say like I'm going to break this +ones and the red ones. And yeah. And with +fewer, say like five... I'm going to break this 942 01:15:48,039 --> 01:15:54,890 -now I'm not going to go with five yes -this will start working worse and worse +now. I'm not going to... Okay. So with five... Yes. + So this will start working worse and worse 943 01:15:54,890 --> 01:15:58,770 because you don't have enough capacity -to separate out this data so you can +to separate out this data. So you can 944 01:15:58,770 --> 01:16:05,270 -play with this in your free time and so -as a summary +play with this in your free time. Okay. And so +as a summary, 945 01:16:05,270 --> 01:16:10,690 -we arrange these neurons and neural -networks into political heirs +we arrange these neurons in neural +networks into fully connected layers. 946 01:16:10,689 --> 01:16:14,579 -look at that crop and how this gets -changing competition graphs and they're +We've looked at backprop and how this gets +chained in computational graphs. And they're 947 01:16:14,579 --> 01:16:19,149 -not really neural and as you'll see soon -the bigger the better and we'll go into +not really neural. And as you'll see soon, +the bigger the better, and we'll go into 948 01:16:19,149 --> 01:16:28,210 -that a lot I want to take questions -before I am just sorry questions we have +that a lot. I want to take questions +before I end. Just sorry. Were there any questions? Go ahead. We have 949 01:16:28,210 --> 01:16:29,359 -two more minutes +two more minutes. Sorry. 950 01:16:29,359 --> 01:16:36,899 -yes thank you +Yes, thank you. 951 01:16:36,899 --> 01:16:41,119 -so is it always better to have more -neurons and neural network the answer to +So is it always better to have more +neurons in your neural network? The answer to 952 01:16:41,119 --> 01:16:48,809 -that is yes more is always better it's -usually competition constraint so more +that is yes. More is always better. It's +usually computational constraint, so more will 953 01:16:48,810 --> 01:16:52,510 -always work better but then you have to -be careful to regularize it properly so +always work better, but then you have to +be careful to regularize it properly. So 954 01:16:52,510 --> 01:16:55,810 -the correct way to constrain you're not -worked over put your data is not by +the correct way to constrain your neural +network to not overfit your data is not by 955 01:16:55,810 --> 01:16:58,940 -making the network smaller the correct +making the network smaller. The correct way to do it is to increase the 956 01:16:58,939 --> 01:17:03,079 -regularization so you always want to use -as larger network as you want but then +regularization. So you always want to use +as large a network as you want, but then 957 01:17:03,079 --> 01:17:06,269 you have to make sure to properly -regulate rise it but most of the time +regularize it. But most of the time 958 01:17:06,270 --> 01:17:09,920 -because competition reasons why I don't -have time to wait forever to train our +because of computational reasons, you have finite +amount of time, you don't want to wait forever to train your 959 01:17:09,920 --> 01:17:19,980 -networks use smaller ones for practical -reasons question arises equally +networks. You'll use smaller ones for practical +reasons. Question? [STUDENT QUESTION] Do you regularize each layer equally. 960 01:17:19,979 --> 01:17:25,509 -usually you do as a simplification you -yeah most of the often when you see +Usually you do as a simplification. +Yeah. Most of the, often when you see 961 01:17:25,510 --> 01:17:28,030 -networks trained in practice they will -be regularized the same way throughout +networks get trained in practice, they will +be regularized the same way throughout. 962 01:17:28,029 --> 01:17:33,809 -but you don't have to necessarily +But you don't have to necessarily. Go ahead. 963 01:17:33,810 --> 01:17:40,500 -is anybody using secondary option in -optimizing networks there is value +Is there any value to use in second derivatives using hashing in [??] +optimizing neural networks? There is value 964 01:17:40,500 --> 01:17:44,859 -sometimes when your data sets are small -you can use things like lbs which I +sometimes when your data sets are small. +You can use things like L-BFGS which I 965 01:17:44,859 --> 01:17:47,729 -don't go into too much and it's the -second order method but usually the data +didn't go into too much, and that's a +second order method, but usually the datasets 966 01:17:47,729 --> 01:17:50,500 -sets are really large and that's when -I'll get you it doesn't work very well +are really large and that's when +L-BFGS doesn't work very well. 967 01:17:50,500 --> 01:17:57,039 -so you when you millions of the up with -you can't do lbs for ya and LBJ is not +So when you millions of data points, +you can't do L-BFGS for various reasons. Yeah. And L-BFGS is not 968 01:17:57,039 --> 01:18:01,970 -very good with many batch you always -have to fall back by default +very good with minibatch. You always +have to do full batch by default. Question. 969 01:18:01,970 --> 01:18:16,650 -like how do you allocate not a good -answer for that unfortunately so you +[STUDENT QUESTION] So what is the tradeoff between depth and size roughly, like how do you allocate? Not a good +answer for that unfortunately. So you 970 01:18:16,649 --> 01:18:20,899 -want a depth is good but maybe after -like ten layers may be a simple data +want, depth is good, but maybe after +like ten layers maybe, if you have simple dataet 971 01:18:20,899 --> 01:18:25,219 -said it's not really adding too much in -one minute so I can still take some +it's not really adding too much. We have +one more minute so I can still take some 972 01:18:25,220 --> 01:18:35,990 -questions you have a question for the +questions. You had a question. [??] [STUDENT QUESTION] Yeah, so the tradeoff between where do I allocate my 973 01:18:35,989 --> 01:18:40,019 -capacity to I want us to be deeper or do -I want it to be wider not a very good +capacity, do I want us to be deeper or do +I want it to be wider, not a very good 974 01:18:40,020 --> 01:18:47,860 -answer to that yes usually especially -with images we find that more layers are +answer to that. [STUDENT QUESTION] Yes, usually, especially +with images, we find that more layers are 975 01:18:47,859 --> 01:18:51,199 -critical but sometimes when you have -simple tastes like to do you are some +critical. But sometimes when you have +simple datasets like 2D or some 976 01:18:51,199 --> 01:18:55,359 other things like depth is not as -critical and so it's kind of slightly +critical, and so it's kind of slightly 977 01:18:55,359 --> 01:19:01,670 -data dependent +data dependent. We had a question over there. [STUDENT QUESTION] 978 01:19:01,670 --> 01:19:10,050 -different for different layers that -health usually it's not done usually +Different activation functions for different layers, does that +help? Usually it's not done. Usually we 979 01:19:10,050 --> 01:19:15,960 -just gonna pick one and go with it -that's for example will also see the +just gonna pick one and go with it. So say, for ConvNets +for example, we'll see that 980 01:19:15,960 --> 01:19:19,279 -most of them are changes with others and +most of them are [??] with ReLUs. And so you just use that throughout and 981 01:19:19,279 --> 01:19:22,389 -there's no real benefit to to switch -them around people don't play with that +there's no real benefit to switch +them around. People don't play with that 982 01:19:22,390 --> 01:19:26,660 -too much on principle you there's -nothing preventing you are so it is 420 +too much, but in principle, there's +nothing preventing you. So it is 4:20, 983 01:19:26,659 --> 01:19:29,789 -so we're going to end here but we'll see -a lot more neural networks so a lot of +so we're going to end here, but we'll see +a lot more neural networks, so a lot of 984 01:19:29,789 --> 01:19:31,238 -these questions will go through them +these questions we'll go through them. + From 48c256d9fa52cd60cbf9b08746efc2510283d217 Mon Sep 17 00:00:00 2001 From: Jihoon Lee Date: Fri, 20 May 2016 22:04:46 +0200 Subject: [PATCH 138/199] update old works --- captions/En/Lecture3_en.srt | 80 +++++++++++++------------- captions/Ko/Lecture3_ko.srt | 112 +++++++++++++++++++----------------- 2 files changed, 97 insertions(+), 95 deletions(-) diff --git a/captions/En/Lecture3_en.srt b/captions/En/Lecture3_en.srt index e708148a..64d625d0 100644 --- a/captions/En/Lecture3_en.srt +++ b/captions/En/Lecture3_en.srt @@ -10,7 +10,7 @@ administrative things first 3 00:00:09,429 --> 00:00:12,859 -just as a reminder of the person Simon +just as a reminder of the first assignment is due on next Wednesday so you have 4 @@ -31,7 +31,7 @@ of course he also have some late two 7 00:00:25,920 --> 00:00:29,960 days that you can use and allocate among -your silence as you see fit +your assignment as you see fit 8 00:00:29,960 --> 00:00:35,149 @@ -51,11 +51,11 @@ we're talking about the fact that this 11 00:00:42,950 --> 00:00:45,780 is actually a very difficult problem -right so he just consider the cross +right so you just consider the cross 12 00:00:45,780 --> 00:00:50,829 -product called the possible variations +product of all the possible variations that we have to be robust to when we 13 @@ -65,7 +65,7 @@ as cat just seems like such an 14 00:00:54,198 --> 00:00:58,049 -intractable and possible problem and not +intractable and impossible problem and not only do we know how to solve these 15 @@ -81,11 +81,11 @@ at human accuracy or even slightly 17 00:01:05,859 --> 00:01:11,829 surpassing it and some of those classes -and it's also runs nearly in real kind +and it's also runs nearly in real-time 18 00:01:11,829 --> 00:01:16,539 -of your phone and so basically and all +on your phone and so basically and all of this also happened in the last three 19 @@ -100,12 +100,12 @@ exciting oK so that's the problem of 21 00:01:23,609 --> 00:01:27,140 -classification of a commission we talked -specifically about the data German +classification of image recognition we talked +specifically about the data-driven 22 00:01:27,140 --> 00:01:30,450 -approaching the fact that we can't just +approach the fact that we can't just explicitly hardcode these classifiers so 23 @@ -120,17 +120,17 @@ having the validation splits where we 25 00:01:37,188 --> 00:01:41,408 -just had our hyper parameters and a test +just test out our hyper parameters and a test that that you don't touch too much we 26 00:01:41,409 --> 00:01:44,810 look specifically at the example of the -nearest neighbor classifier and someone +nearest neighbor classifier and some more 27 00:01:44,810 --> 00:01:48,618 -and the canyons neighbor classifier and +and K nearest neighbor classifiers and I talked about the secret India said 28 @@ -145,22 +145,21 @@ paratroop approach which is really that 30 00:01:58,438 --> 00:02:03,639 -we're writing a function from image -directly to the tennis courts that have +we're writing a function F from image +directly to the raw 10 scores if you have 10 classes 31 00:02:03,640 --> 00:02:07,618 -10 closest and the spermatic formerly -seem to be a long year for us so we just +And this parameteric form seem to be linear first. 32 00:02:07,618 --> 00:02:11,520 -have equal WX and we talked about the -interpretations of this linear +So we just have F=Wx. and we talked about the +interpretations of this linear classifer 33 00:02:11,520 --> 00:02:12,850 -classifier the fact that you can +the fact that you can 34 00:02:12,849 --> 00:02:16,039 @@ -170,16 +169,15 @@ that you can interpret it as these 35 00:02:16,039 --> 00:02:18,449 images being in the very -high-dimensional space and arlen your +high-dimensional space and our linear classifer 36 00:02:18,449 --> 00:02:23,560 -class partner kind of going in and -coloring this space my class course so +kind of going in and coloring this space by class course 37 00:02:23,560 --> 00:02:28,740 -to speak and so by the end of the class +so to speak. and so by the end of the class we got to this picture where we suppose 38 @@ -194,32 +192,32 @@ classes say 10 classes and support n and 40 00:02:36,530 --> 00:02:40,740 -basically this function assigning scores +basically this function f() is assigning scores for every single one of these images 41 00:02:40,740 --> 00:02:44,510 -with some particular setting off weights -which have chosen randomly here we got +with some particular setting of weights +which have chosen randomly here 42 00:02:44,509 --> 00:02:47,939 -some scores out and so some of these -results are good and some of them are +we get some scores out and so some of these +results are good and some of them are bad 43 00:02:47,939 --> 00:02:51,419 -bad so if you inspect this course for -example in the first image you can see +so if you inspect this scores, for example, +in the first image you can see that 44 00:02:51,419 --> 00:02:55,509 -that the correct class or just cat got a +the correct class which is cat got a score of 2.9 and that's kind of in the 45 00:02:55,509 --> 00:03:00,060 -middle so some some classes he received +middle so some some classes here received a higher score which is not very good 46 @@ -230,31 +228,31 @@ which is good for that particular image 47 00:03:03,289 --> 00:03:09,019 the car was very well classified because -the car was much higher than all of the +class score of car was much higher than all of the other ones 48 00:03:09,020 --> 00:03:12,980 -other ones and the Frog was enough -durable classified all right so we have +And the Frog was not very well classified at all. +right? 49 00:03:12,979 --> 00:03:18,199 -this notion that four different weights +So we had this notion that for different weights these different weights work better or 50 00:03:18,199 --> 00:03:21,389 worse on different images and of course -we're trying to find a way it's that +we're trying to find weights that 51 00:03:21,389 --> 00:03:26,209 -give us course that are consistent with -all the ground truth labels labels and +give us scores that are consistent with +all the ground truth labels. All the labels and data. 52 00:03:26,210 --> 00:03:30,490 -the data and so what we're going to do +And so what we're going to do now is so far with only I believe what I 53 diff --git a/captions/Ko/Lecture3_ko.srt b/captions/Ko/Lecture3_ko.srt index 4ad4589f..b9855c1c 100644 --- a/captions/Ko/Lecture3_ko.srt +++ b/captions/Ko/Lecture3_ko.srt @@ -1,206 +1,210 @@ 1 00:00:00,000 --> 00:00:05,400 - 그래서 우리는 손실 함수에 재료의 일부에 오늘 도착하기 전에 +오늘 수업 내용인 Loss function Optimiyation을 시작하기에 앞서서 2 00:00:05,400 --> 00:00:09,429 - 최적화 내가 먼저 일부 관리 일을 통해 가고 싶어 +몇가지 공지할 사항들이 있습니다. 3 00:00:09,429 --> 00:00:12,859 - 당신은 그래서 그냥 사람의 신호로 시몬은 다음 주 수요일에 기인한다 +첫번째 숙제 기한이 다음주 수요일까지입니다. 4 00:00:12,859 --> 00:00:18,100 +약 9일정도 남아있구요 다음주 월요일은 휴일이기떄문에 약 구일 왼쪽 단지 경고로 월요일이 것 때문에 휴일입니다 5 00:00:18,100 --> 00:00:23,050 - 근무 시간에 더 클래스는, 그래서 확인하기 위해 따라 시간을 계획하지 +수업과 오피스 아워가 없습니다. 이에 맞춰서 계획해서 6 00:00:23,050 --> 00:00:25,920 - 당신은 그가 또한 일부 늦게이이 과정의 시간에 과제를 완료 할 수 있습니다 +숙제를 제 시간안에 끝내길 바라구요. 7 00:00:25,920 --> 00:00:29,960 - 당신이 사용하고 맞는 볼로 침묵 사이에서 할당 할 수있는 일 +Late day 룰을 숙제기한들에 맞춰 잘 사용하길 바랍니다. 8 00:00:29,960 --> 00:00:35,149 - 확인을 재료로 그래서 다이빙 먼저 나는 우리가 어디 당신을 생각 나게하고 싶습니다 +이제 수업을 시작합시다. 첫번째로 어디까지 진행했는지 보면.. 9 00:00:35,149 --> 00:00:39,100 - 현재 마지막으로 우리는이 문제를 시각적 인식으로 바라 보았다 +저번 시간에 이 시각 인식(Visual Recognition) 문제 중 10 00:00:39,100 --> 00:00:42,950 - 특히 이미지 분류에서 우리는이 사실에 대해 얘기 +이미지 분류법(Image classificatino)을 보았고, 이 문제가 실제로는 11 00:00:42,950 --> 00:00:45,780 - 그는 단지 십자가를 고려 바로 때문에 매우 어려운 문제가 실​​제로 +매우 어려운 문제였습니다. 모든 변화의 외적(cross product) 계산이 12 -00:00:45,780 --> 00:00:50,829 - 제품은 우리가 때를에 강력한해야 가능한 변화라고 +00:00:45,780 --> 00:00:54,198 +고양이와 같은 카테고리를 Robust하게 분류하는데 필요했던 것을 고려한다면 13 -00:00:50,829 --> 00:00:54,198 - 고양이 그냥 그런 것 같아 같은 이러한 범주 중 하나를 인식 - -14 00:00:54,198 --> 00:00:58,049 - 다루기 힘든 가능한 문제가 아니라 단지 우리가 이것들을 해결하는 방법을 알고 +풀기에 매우 어려운 문제처럼 보였지만, 이제는 15 00:00:58,049 --> 00:01:02,108 - 문제는 지금 그러나 우리는 종류의 수천을 위해이 문제를 해결할 수 있으며, +수천개의 카테고리르 위한 문제도 풀 수 있고 16 00:01:02,109 --> 00:01:05,859 - 당해 방법의 상태는 거의 인간의 정밀도 혹은 약간 작동 +최신의 기법들은 거의 사람의 정확도와 비슷하거나 17 00:01:05,859 --> 00:01:11,829 - 그것은 그 클래스의 일부를 돌파하고 또한 리얼 종류 거의 실행있어 +심지어 더 좋은 경우도 있습니다. 그리고 거의 실시간(nearly in time)으로 18 00:01:11,829 --> 00:01:16,539 - 휴대 전화 등 기본적으로이 모든의 또한 마지막 세에서 일어난 +전화기 수준의 기기에서 동작합니다. 이 모든 일들은 지난 3년간이루어졌고 19 00:01:16,540 --> 00:01:19,790 - 이 모든에 클래스의 말에 세 또한 수 있습니다 전문가 +이 코스 후에는 학생들은 모두 이 기술에 대한 전문가가 될 것입니다. 20 00:01:19,790 --> 00:01:23,609 - 기술 그 문제 그래서 정말 시원하고 확인을 흥분 때문에 +정말 멋지고 기대되는 일입니다. OK. 21 00:01:23,609 --> 00:01:27,140 - 위원회의 분류 우리는 데이터 독일어에 대해 구체적으로 이야기 +이것은 이미지 인식 분류문제입니다. 우리는 데이터 기반 접근법(Data-driven approach)에 22 00:01:27,140 --> 00:01:30,450 - 우리가 명시 적으로 이러한 분류를 하드 수 없다는 사실에 접근 +대해서 이야기했습니다. 이 분류기(Classifier)는 명시적으로 Hard-code할 수 없기떄문에 23 00:01:30,450 --> 00:01:34,100 - 우리는 실제로 다나에서 그들을 훈련을해야하고 그래서 우리의 생각 보았다 +데이터를 이용해서 분류기(Classifier)를 학습시켜야합니다. 그래서 다른 트레이닝 데이터와 24 00:01:34,099 --> 00:01:37,188 - 다른 갖는 유효성을 갖는 훈련 데이터는 어디 분할 우리 +Hyperparameter를 테스트 할수 있는 검증 테이터를 갖는 방법들, 그리고 25 00:01:37,188 --> 00:01:41,408 - 우리의 하이퍼 매개 변수와 너무 많이 만지지 않도록하는 테스트를했다 우리 +많이 건드릴 일이없는 테스트 셋에 대해서 보았습니다. 26 00:01:41,409 --> 00:01:44,810 - 가장 가까운 이웃 분류 누군가의 예에서 구체적으로 보면 +구체적으로 Nearest Neighbor Classifier와 27 00:01:44,810 --> 00:01:48,618 - 그리고 협곡 이웃 분류와 나는 비밀 인도에 대해 이야기했다 +몇몇의 K-NN Classifer의 예를 보았습니다. 28 00:01:48,618 --> 00:01:52,938 - 이는 우리의 도요타는 우리가 내가 소개 한 후이 수업 시간에 재생했다 +그리고 수업시간에 이야기했던 두개의 데이터 셋에 대해서 이야기했습니다. 29 00:01:52,938 --> 00:01:58,438 - 정말로, 즉 I가 공수 방식 불리는이 방법의 아이디어 +이후에 Paratroop(???) 접근이라고 붙인 접근법의 아이디어를 이야기했습니다. 30 00:01:58,438 --> 00:02:03,639 - 우리는 바로이 테니스 코트에 이미지에서 함수를 작성하는 +단순히 원래의 이미지로부터 Score를 그대로 가져오는 f()를 만듭니다. +10개의 클래스가 있으면 10개의 스코어를 가져옵니다. 31 00:02:03,640 --> 00:02:07,618 - (10)에 가장 가까운과 정액은 이전에 우리 단지를 우리에게 긴 해가 될 것 같다 +그래서 이 Parametric form은 우선 Linear한것처럼 보이게 됩니다. 32 00:02:07,618 --> 00:02:11,520 - 동일 WX 가지고 우리는이 선형의 해석에 대해 이야기 +그래서 F=Wx를 갖게 됩니다. 그리고 이 Linear Classifer에 대해서 +분석을 이전에 이야기했었는데. 33 00:02:11,520 --> 00:02:12,850 - 당신이 할 수있는 그 사실을 분류 +실제로 Linear Classifer를 Matching Template으로 분석해도 되고 34 00:02:12,849 --> 00:02:16,039 - 일치하는 템플릿으로 해석하거나 이들로 해석 할 수 +아니면 상위 차원 공간에 있는 이미지들을 상상하고 35 00:02:16,039 --> 00:02:18,449 - 및 매우 높은 차원 공간에있는 이미지를 사용자의 알렌 +Linear Classifer가 이 공간 안에 들어가서 36 00:02:18,449 --> 00:02:23,560 - 클래스 파트너 종류의가는 그래서이 공간 내 수업 과정을 착색 +Class score에 맞춰서 색칠한다고 생각해도 됩니다. 37 00:02:23,560 --> 00:02:28,740 - 클래스의 말에 그렇게 말을하고 우리는 우리가 가정이 사진에 도착 +음.. 그래서 이전 수업 마지막에 이 사진들까지 왔습니다. 38 00:02:28,740 --> 00:02:32,240 - 우리는 여기에서 훈련 예 훈련 데이터 세트들을 단지 세 개의 이미지가 +Training data set에서 이 세장의 사진을 위와 같은 열과 함께 39 00:02:32,240 --> 00:02:36,530 - 우리가 가지고있는 열을 따라 몇 가지 클래스 10 클래스와 지원 n은 말과 +10개의 Class를 가지고 있다고 봅시다. 40 00:02:36,530 --> 00:02:40,740 - 기본적으로이 기능은 이러한 이미지의 모든 하나 하나에 대해 점수를 할당 +기본적으로이 이 함수 f()는 모든 한장한장의 이미지에 Score를 주게됩니다. 41 00:02:40,740 --> 00:02:44,510 - 여기에 무작위로 선택한 일부 특정 설정 해제 무게를 우리는 가지고 +무작위로 선정된 몇개의 Weight 세팅들과 함께요. 42 00:02:44,509 --> 00:02:47,939 - 아웃 등 일부 점수 결과 중 일부는 좋은 그들 중 일부는 +그러면 몇개의 좋고 나쁜 score들을 얻게됩니다. 43 00:02:47,939 --> 00:02:51,419 - 첫 번째 이미지 예를 들어,이 과정을 검사하면 당신이 볼 수 있도록 나쁜 +첫 번째 이미지를 예를들면 44 00:02:51,419 --> 00:02:55,509 - 올바른 클래스 또는 그냥 고양이는 2.9의 점수를 얻었고, 그것은에 가지 있다고 +올바른 Class인 고양이 Class는 애매한 2.9점을 받았고 45 00:02:55,509 --> 00:03:00,060 - 중간 그래서 일부 일부 클래스는 그는 매우 좋지 않다 높은 점수를받은 +몇몇 Class들이 고양이 Class보다 더 높은 점수를 받았습니다. +(높으면 원래 안되는거죠?) 46 00:03:00,060 --> 00:03:03,289 - 일부 클래스는 특정 이미지에 좋은 훨씬 낮은 점수를받은 +그리고 몇몇 Class들은 고양이에 비해 많이 낮은 점수를 받았습니다. +(이건 특정 이미지들에게 좋은 징후입니다.) 47 00:03:03,289 --> 00:03:09,019 - 차가 모두보다 높은 때문에 자동차 잘 분급 +두번쨰 사진인 자동차는 아주 잘 분류되었습니다. +다른 이미지들에 비해서 자동차 점수가 아주 높죠? 48 00:03:09,020 --> 00:03:12,980 - 다른 사람과 개구리는 내구성이 충분히 잘 그래서 우리는 모두 분류했다 +세번째인 사진인 개구리는 분류에 실패했습니다. 그렇죠? 49 00:03:12,979 --> 00:03:18,199 - 이 개념 네 가지 무게 서로 다른 가중치가 작동하는지 더 나은 또는 +이처럼 다른 Weight들은 여러 이미지들에게 +좋게 적용될수도 있고 나쁘게 적용될수 있다는걸 알았습니다. 50 00:03:18,199 --> 00:03:21,389 - 다른 이미지에 물론 더 우리는이 것을의 방법을 찾기 위해 노력하고 +그리고 알다시피 우리가 찾고자하는 것은 +모든 Ground Truth Label들과 일치하는 점수를 주는 51 00:03:21,389 --> 00:03:26,209 - 모든 지상 진실은 라벨 레이블과 함께 우리에게 일치 과정을 제공 +모든 라벨과 데이터들을 잘 분류할수 있는 Weight입니다. 52 00:03:26,210 --> 00:03:30,490 From 06d34e4e04be22e5ebc122559495cde854928d8e Mon Sep 17 00:00:00 2001 From: Jihoon Lee Date: Fri, 20 May 2016 22:15:38 +0200 Subject: [PATCH 139/199] apply comments from old PR --- captions/En/Lecture3_en.srt | 8 ++++---- captions/Ko/Lecture3_ko.srt | 6 +++--- 2 files changed, 7 insertions(+), 7 deletions(-) diff --git a/captions/En/Lecture3_en.srt b/captions/En/Lecture3_en.srt index 64d625d0..f93fe7e0 100644 --- a/captions/En/Lecture3_en.srt +++ b/captions/En/Lecture3_en.srt @@ -131,7 +131,7 @@ nearest neighbor classifier and some more 27 00:01:44,810 --> 00:01:48,618 and K nearest neighbor classifiers and -I talked about the secret India said +I talked about the CIFAR-10 dataset 28 00:01:48,618 --> 00:01:52,938 @@ -141,7 +141,7 @@ with during this class then I introduced 29 00:01:52,938 --> 00:01:58,438 the idea of this approach that I termed -paratroop approach which is really that +parametric approach which is really that 30 00:01:58,438 --> 00:02:03,639 @@ -187,8 +187,8 @@ dataset them just three images here 39 00:02:32,240 --> 00:02:36,530 -along the columns and we have some -classes say 10 classes and support n and +along the columns and we have some classes say +10 classes in CIFAR-10 40 00:02:36,530 --> 00:02:40,740 diff --git a/captions/Ko/Lecture3_ko.srt b/captions/Ko/Lecture3_ko.srt index b9855c1c..0ab59137 100644 --- a/captions/Ko/Lecture3_ko.srt +++ b/captions/Ko/Lecture3_ko.srt @@ -105,11 +105,11 @@ Hyperparameter를 테스트 할수 있는 검증 테이터를 갖는 방법들, 28 00:01:48,618 --> 00:01:52,938 -그리고 수업시간에 이야기했던 두개의 데이터 셋에 대해서 이야기했습니다. +그리고 수업시간에 이야기했던 CIFAR-10 데이터 셋에 대해서 이야기했습니다. 29 00:01:52,938 --> 00:01:58,438 -이후에 Paratroop(???) 접근이라고 붙인 접근법의 아이디어를 이야기했습니다. +이후에 Parametric Approach이라고 붙인 접근법의 아이디어를 이야기했습니다. 30 00:01:58,438 --> 00:02:03,639 @@ -151,7 +151,7 @@ Training data set에서 이 세장의 사진을 위와 같은 열과 함께 39 00:02:32,240 --> 00:02:36,530 -10개의 Class를 가지고 있다고 봅시다. +CIFAR-10내의 10개의 Class를 가지고 있다고 봅시다. 40 00:02:36,530 --> 00:02:40,740 From 0ff6bb432e426bcbdd31732864a679ed1e470a9c Mon Sep 17 00:00:00 2001 From: jung_hojin Date: Mon, 23 May 2016 01:58:50 +0900 Subject: [PATCH 140/199] Fill up empty slots --- captions/En/Lecture4_en.srt | 118 ++++++++++++++++++------------------ 1 file changed, 58 insertions(+), 60 deletions(-) diff --git a/captions/En/Lecture4_en.srt b/captions/En/Lecture4_en.srt index 9f2e678c..b3889d59 100644 --- a/captions/En/Lecture4_en.srt +++ b/captions/En/Lecture4_en.srt @@ -1324,7 +1324,7 @@ So to continue backpropagation 269 00:21:12,259 --> 00:21:20,000 here and apply chain rule, we would -receive [STUDENT ANSWER] Okay, so these are most of the rhetorical questions so I'm +receive... (Student is asking question) Okay, so these are most of the rhetorical questions so I'm 270 00:21:20,000 --> 00:21:25,119 @@ -1476,11 +1476,11 @@ through that multiply operation 300 00:23:55,940 --> 00:24:06,450 so what will be the gradient for w0 and x0? -What will be the gradient for w0 specfically? +What will be the gradient for w0 specifically? 301 00:24:06,450 --> 00:24:19,059 -[STUDENT ANSWER] Someone say 0? 0 will be wrong. It will be, so the gradient w1 will be, w0 sorry, will be +Someone say 0? 0 will be wrong. It will be, so the gradient w1 will be, w0 sorry, will be 302 00:24:19,059 --> 00:24:24,389 @@ -1550,12 +1550,12 @@ means apply chain rule many many times 315 00:25:14,150 --> 00:25:21,720 and we'll see how that is implemented in a bit. -Sorry, did you have a question? [STUDENT QUESTION] +Sorry, did you have a question? (Student is asking question) 316 00:25:21,720 --> 00:25:31,769 Oh yes, so I'm going to skip that because it's the same. -So I'm going to skip the other '*' gate. Any other questions at this point? [STUDENT QUESTION] +So I'm going to skip the other '*' gate. Any other questions at this point? (Student is asking question) 317 00:25:31,769 --> 00:25:45,869 @@ -1718,7 +1718,7 @@ into single gates if it's very efficient 349 00:27:55,829 --> 00:28:06,819 or easy to derive the local gradients -because then those become your pieces. [STUDENT QUESTION] +because then those become your pieces. (Student is asking question) 350 00:28:06,819 --> 00:28:10,529 @@ -1892,13 +1892,13 @@ the fact that it's not actually 384 00:30:26,960 --> 00:30:39,150 -nevermind about that part. Go ahead. [STUDENT QUESTION] So your +nevermind about that part. Go ahead. (Student is asking question) So your question is what happens if the two 385 00:30:39,150 --> 00:30:53,470 inputs are equal when you go through max -gate. Yeah, what happens? [STUDENT ANSWER] I don't think it's +gate. Yeah, what happens? (Student is answering) I don't think it's 386 00:30:53,470 --> 00:30:57,559 @@ -2062,7 +2062,7 @@ idea is that in the forward pass 418 00:33:19,929 --> 00:33:23,759 we're iterating over all the gates in the circuit -that, and they're sorted in topological [??] +that, and they're sorted in topological 419 00:33:23,759 --> 00:33:27,980 @@ -2136,7 +2136,7 @@ actually more like correct 433 00:34:16,760 --> 00:34:18,730 -implementation something like this might [??] +implementation. Something like this might run 434 @@ -2147,7 +2147,7 @@ gate and how it could be implemented. 435 00:34:23,769 --> 00:34:27,690 A multiply gate, in this case, is just a -binary multiply who receives two inputs [??] +binary multiply, so it receives two inputs 436 00:34:27,690 --> 00:34:33,780 @@ -2290,7 +2290,7 @@ intermediates to actually compute the 464 00:36:33,690 --> 00:36:45,289 -proper backward pass. So that's... [STUDENT QUESTION] Yes, so if you don't, if you know you don't want to do backward pass, then you can +proper backward pass. So that's... (Student is asking question) Yes, so if you don't, if you know you don't want to do backward pass, then you can get rid of many of these things and you 465 @@ -2310,7 +2310,7 @@ with that. Usually we end up remembering it 468 00:36:57,280 --> 00:37:09,370 -anyway. I, yeah. [STUDENT QUESTION] I see. Yes, so I think if you're in the [??] +anyway. (Student is asking question) I see. Yes, so I think if you're in the embedded device for example, and you worry 469 @@ -2331,7 +2331,7 @@ make sure nothing gets cached in case 472 00:37:18,750 --> 00:37:33,130 you want to do a backward pass. Questions. -Yes. [STUDENT QUESTION] You're saying if we remember the local gradients in +Yes. (Student is asking question) You're saying if we remember the local gradients in 473 00:37:33,130 --> 00:37:39,750 @@ -2351,11 +2351,11 @@ But I mean, you're in charge of, remember 476 00:37:49,170 --> 00:37:54,950 whatever you need to, perform the -backward pass, and on a gate-by-gate basis. You +backward pass, and on a gate-by-gate basis. 477 00:37:54,949 --> 00:37:58,509 -don't necce, you can remember whatever [memory footprint??] +You can remember whatever you feel like. It has lower footprint and so on. 478 @@ -2436,7 +2436,7 @@ that function piece knows how to do a 493 00:38:58,840 --> 00:39:02,670 forward and it knows how to do a backward. -So just [?? For the?] specific example, let's +So just to view the specific example, let's 494 00:39:02,670 --> 00:39:10,150 @@ -2456,7 +2456,7 @@ numbers basically, because when we 497 00:39:19,300 --> 00:39:22,410 actually work with these, we do a lot of -[??] operation so we receive a tensor +vectorized operation so we receive a tensor 498 00:39:22,409 --> 00:39:28,289 @@ -2505,7 +2505,7 @@ variable gradInput 507 00:39:59,690 --> 00:40:03,539 -which you need to compute. That's your gradient [??] +which it needs to compute. That's your gradient that you're passing up. The gradInput is, 508 @@ -2639,7 +2639,7 @@ these implementations and so on? 534 00:42:00,849 --> 00:42:15,559 -Yes, thank you. [STUDENT QUESTION] So the question is, do we have to go through forward and backward for every update. The answer is yes, because when you want to do update, you need the gradient, and so you need to do forward on your sample minibatch. You do a forward. Right away you do a backward. +Yes, thank you. (Student is asking question) So the question is, do we have to go through forward and backward for every update. The answer is yes, because when you want to do update, you need the gradient, and so you need to do forward on your sample minibatch. You do a forward. Right away you do a backward. And now you have your analytic gradient. 535 @@ -2669,7 +2669,7 @@ backward, update. Forward, backward, update. 540 00:42:36,318 --> 00:42:51,808 -We'll see that in a bit. Go ahead. [STUDENT QUESTION] You're asking about a +We'll see that in a bit. Go ahead. (Student is asking question) You're asking about a for loop. Oh, is there a for loop here? I didn't even notice. Okay. 541 @@ -2760,7 +2760,7 @@ is an entire Jacobian matrix, so you end up with 558 00:44:16,079 --> 00:44:32,130 an entire matrix-vector multiply to -actually chain the gradient backwards. [STUDENT QUESTION] +actually chain the gradient backwards. (Student is asking question) 559 00:44:32,130 --> 00:44:36,380 @@ -2853,11 +2853,11 @@ right? So the second question is, so this 577 00:45:49,460 --> 00:45:52,949 is a huge matrix, 16 million -numbers, but why would you never form it? [??] +numbers, but why would you never form it? 578 00:45:52,949 --> 00:46:02,719 -What does the Jacobian actually look like? [STUDENT QUESTION] No, Jacobian will always be +What does the Jacobian actually look like? (Student is asking question) No, Jacobian will always be a matrix, because every one of these 4096 579 @@ -2868,7 +2868,7 @@ Jacobian is still a giant 4096 by 4096 580 00:46:09,949 --> 00:46:14,558 matrix, but has special structure, right? -And what is that special structure? [STUDENT ANSWER] +And what is that special structure? (Student is answering) 581 00:46:14,559 --> 00:46:27,420 @@ -2907,8 +2907,8 @@ so you never actually want to carry out 588 00:46:55,429 --> 00:47:00,808 -this operation as a matrix-vector -multiply, because (of) their special structure [??] +this operation as a matrix vector +multiply, because of their special structure 589 00:47:00,809 --> 00:47:04,150 @@ -2932,7 +2932,7 @@ set the gradient to 0 in those dimensions. 593 00:47:17,210 --> 00:47:21,650 -So you take the grid output here, and [??] +So you take the grad output here, and whichever numbers were less than zero, 594 @@ -2942,7 +2942,7 @@ just set them to 0. Set those gradients to 0 and then you continue backward pass 595 00:47:25,909 --> 00:47:52,230 So very simple operations in the -end in terms of efficiency. [STUDENT QUESTION] That's right. So the question is, the commication between the gates is always just vectors. That's right. So this Jacobian +end in terms of efficiency. (Student is asking question) That's right. So the question is, the commication between the gates is always just vectors. That's right. So this Jacobian 596 00:47:52,230 --> 00:47:55,940 @@ -2956,7 +2956,7 @@ what's going back to other gates, they 598 00:47:59,670 --> 00:48:17,380 -only care about the gradient vector. [STUDENT QUESTION] Yes, so the question is, unless you end up having multiple outputs, because then for each output, we have to do this, so yeah. So +only care about the gradient vector. (Student is asking question) Yes, so the question is, unless you end up having multiple outputs, because then for each output, we have to do this, so yeah. So we'll never actually run into that case 599 @@ -3036,7 +3036,7 @@ be writing SVMs and Softmax and so on, and I just kind of 614 00:49:30,789 --> 00:49:33,680 -would like to give you a hint on the design [??] +would like to give you a hint on the design of how you actually should approach this 615 @@ -3116,7 +3116,7 @@ tempted to try to just derive W, the 630 00:50:40,179 --> 00:50:43,798 -gradient on W equals, and then implement [??] +gradient on W equals, and then implement that and that's an unhealthy way of 631 @@ -3181,7 +3181,7 @@ I'm going to go into neural networks. So 643 00:51:37,690 --> 00:51:49,860 any questions before we move on from -backprop? Go ahead. [STUDENT QUESTION] +backprop? Go ahead. (Student is asking a question) 644 00:51:49,860 --> 00:52:03,130 @@ -3196,11 +3196,11 @@ in numpy, so that's going to be some of the 646 00:52:06,750 --> 00:52:18,030 brain teaser stuff that you guys are -going to have to do. [STUDENT QUESTION] Yes, so it's up to you what you want your gates to be like, and what you want them +going to have to do. (Student is asking a question) Yes, so it's up to you what you want your gates to be like, and what you want them 647 00:52:18,030 --> 00:52:24,490 -to be. [STUDENT QUESTION] Yeah, I don't think you'd want to do that. +to be. (Student is asking a question) Yeah, I don't think you'd want to do that. 648 00:52:24,489 --> 00:52:30,739 @@ -3477,7 +3477,7 @@ interesting. 703 00:56:14,719 --> 00:56:49,509 -Was there a question? [STUDENT QUESTION] So the question is, if h had less than 10 units, would it be inferior to a linear classifier? I think that's... that's acutally not obvious to me. It's an interesting question. I think... you could make that work. I think you could make it work. Yeah, I think that would actually work. Someone should try that for extra points in the +Was there a question? (Student is asking a question) So the question is, if h had less than 10 units, would it be inferior to a linear classifier? I think that's... that's acutally not obvious to me. It's an interesting question. I think... you could make that work. I think you could make it work. Yeah, I think that would actually work. Someone should try that for extra points in the assignment. So you'll have a section on the assignment do something fun or extra 704 @@ -3497,11 +3497,11 @@ or not. 707 00:56:59,659 --> 00:57:08,329 -Any other questions? Go ahead. [STUDENT QUESTION] +Any other questions? Go ahead. (Student is asking a question) 708 00:57:08,329 --> 00:57:34,989 -Sorry, I don't think I understood the question. [STUDENT QUESTION] I see. So you're really asking about the layout of the h vector and how it gets allocated over the different modes of +Sorry, I don't think I understood the question. (Student is asking a question) I see. So you're really asking about the layout of the h vector and how it gets allocated over the different modes of the dataset and I don't have a good 709 @@ -3561,7 +3561,7 @@ be as big as possible, as it fits in your 720 00:58:22,739 --> 00:58:30,659 computer and so on, so more is better. So we'll go -into that. Go ahead. [STUDENT QUESTION] +into that. Go ahead. (Student is asking a question) 721 00:58:30,659 --> 00:58:38,639 @@ -3656,7 +3656,7 @@ layer activations, but this is using 739 00:59:50,139 --> 00:59:54,069 a sigmoid nonlinearity not a max of 0 and X. -And we'll go into a bit of what [??] +And we'll go into a bit of what 740 00:59:54,070 --> 00:59:58,650 @@ -3750,7 +3750,7 @@ forward pass, we do backward pass, we do an 758 01:01:11,019 --> 01:01:18,840 -update, we keep iterating this over and over again. Go ahead. [STUDENT QUESTION] +update, we keep iterating this over and over again. Go ahead. (Student is asking a question) The random function is creating your first initial random 759 @@ -3888,8 +3888,8 @@ idea about where this is all coming from 786 01:03:11,880 --> 01:03:17,220 -you have the cell body or a Soma as pople like to [??] -call it and it's got all these dendrites +you have the cell body or a Soma as people like to +call it, and it's got all these dendrites 787 01:03:17,219 --> 01:03:21,049 @@ -4536,7 +4536,7 @@ network? If I want the neural network 917 01:12:45,569 --> 01:12:51,889 -to correctly classify this, how many neurons do I need in the hidden layer as a minimum? [STUDENT ANSWER] +to correctly classify this, how many neurons do I need in the hidden layer as a minimum? (Student is answering) 918 01:12:51,890 --> 01:13:15,270 @@ -4576,7 +4576,7 @@ lines, you can carve out the space so 925 01:13:45,649 --> 01:13:52,429 that the second layer can just combine -them when their numbers are 1 and not 0. [STUDENT QUESTION] +them when their numbers are 1 and not 0. (Student is asking question) 926 01:13:52,430 --> 01:13:57,850 @@ -4585,7 +4585,7 @@ because two lines are not enough. I 927 01:13:57,850 --> 01:14:03,900 -suppose this works something [??] Not going to look very good +suppose this works... Not going to look very good here. So with two, basically it will find 928 @@ -4595,7 +4595,7 @@ lines. They're kind of creating this 929 01:14:07,239 --> 01:14:14,599 -tunnel and that's the best you can do. Okay? [STUDENT QUESTION] +tunnel and that's the best you can do. Okay? (Student is asking question) 930 01:14:14,600 --> 01:14:31,300 @@ -4604,7 +4604,7 @@ The curve, I think... Which nonlinearity am I using? tanh? Yeah, I'm not sure ex 931 01:14:31,300 --> 01:14:50,460 think you'd see sharp boundaries. Yeah. -Yes, this is three. You can do four. So let's do... [STUDENT QUESTION] Yeah, that's because +Yes, this is three. You can do four. So let's do... (Student is asking question) Yeah, that's because 932 01:14:50,460 --> 01:14:52,130 @@ -4643,11 +4643,11 @@ doing this update, it will just go in there 939 01:15:22,390 --> 01:15:32,800 and figure that out. Very simple data -that is not. Spiral. Circle, and then random +that is not. Spiral. Circle, and then random... 940 01:15:32,800 --> 01:15:39,880 -so random data, so you could, kind of goes in [??] +so random data, and so you could, kind of goes in there, like covers up the green 941 @@ -4741,7 +4741,7 @@ amount of time, you don't want to wait forever to train your 959 01:17:09,920 --> 01:17:19,980 networks. You'll use smaller ones for practical -reasons. Question? [STUDENT QUESTION] Do you regularize each layer equally. +reasons. Question? (Student is asking question) Do you regularize each layer equally. 960 01:17:19,979 --> 01:17:25,509 @@ -4755,11 +4755,11 @@ be regularized the same way throughout. 962 01:17:28,029 --> 01:17:33,809 -But you don't have to necessarily. Go ahead. +But you don't have to necessarily. Go ahead. (Student is asking question) 963 01:17:33,810 --> 01:17:40,500 -Is there any value to use in second derivatives using hashing in [??] +Is there any value to using second derivatives using hashing in optimizing neural networks? There is value 964 @@ -4789,7 +4789,7 @@ have to do full batch by default. Question. 969 01:18:01,970 --> 01:18:16,650 -[STUDENT QUESTION] So what is the tradeoff between depth and size roughly, like how do you allocate? Not a good +(Student is asking question) So what is the tradeoff between depth and size roughly, like how do you allocate? Not a good answer for that unfortunately. So you 970 @@ -4804,7 +4804,7 @@ one more minute so I can still take some 972 01:18:25,220 --> 01:18:35,990 -questions. You had a question. [??] [STUDENT QUESTION] Yeah, so the +questions. You had a question for a while. (Student is asking question) Yeah, so the tradeoff between where do I allocate my 973 @@ -4814,7 +4814,7 @@ I want it to be wider, not a very good 974 01:18:40,020 --> 01:18:47,860 -answer to that. [STUDENT QUESTION] Yes, usually, especially +answer to that. (Student is asking question) Yes, usually, especially with images, we find that more layers are 975 @@ -4829,7 +4829,7 @@ critical, and so it's kind of slightly 977 01:18:55,359 --> 01:19:01,670 -data dependent. We had a question over there. [STUDENT QUESTION] +data dependent. We had a question over there. (Student is asking question) 978 01:19:01,670 --> 01:19:10,050 @@ -4863,6 +4863,4 @@ a lot more neural networks, so a lot of 984 01:19:29,789 --> 01:19:31,238 -these questions we'll go through them. - - +these questions we'll go through them. \ No newline at end of file From ce0328cd9c9cfec7c1e488d494523f7e5b540b91 Mon Sep 17 00:00:00 2001 From: myungsub Date: Tue, 24 May 2016 14:31:50 +0900 Subject: [PATCH 141/199] review classification.md --- classification.md | 62 +++++++++++++++++++++++------------------------ 1 file changed, 31 insertions(+), 31 deletions(-) diff --git a/classification.md b/classification.md index 927b8dd8..d1ecbf19 100644 --- a/classification.md +++ b/classification.md @@ -19,9 +19,9 @@ permalink: /classification/ ## Image Classification(이미지 분류) -**동기**. 이 섹션에서는 이미지 분류 문제에 대해 다룰 것이다. 이미지 분류 문제란, 입력 이미지를 미리 정해진 카테고리 중 하나인 라벨로 분류하는 문제다. 문제 정의는 매우 간단하지만 다양한 활용 가능성이 있는 컴퓨터 비전 분야의 핵심적인 문제 중의 하나이다. 강의의 나중 파트에서도 살펴보겠지만, 이미지 분류와 멀어보이는 다른 컴퓨터 비전 분야의 여러 문제들 (물체 검출, 영상 분할 등)이 영상 분류 문제를 푸는 것으로 인해 해결될 수 있다. +**동기**. 이 섹션에서는 이미지 분류 문제에 대해 다룰 것이다. 이미지 분류 문제란, 입력 이미지를 미리 정해진 카테고리 중 하나인 라벨로 분류하는 문제다. 문제 정의는 매우 간단하지만 다양한 활용 가능성이 있는 컴퓨터 비전 분야의 핵심적인 문제 중의 하나이다. 강의의 나중 파트에서도 살펴보겠지만, 이미지 분류와 멀어보이는 다른 컴퓨터 비전 분야의 여러 문제들 (물체 검출, 영상 분할 등)이 이미지 분류 문제를 푸는 것으로 인해 해결될 수 있다. -**예시**. 예를 들어, 아래 그림의 이미지 분류 모델은 하나의 이미지와 4개의 분류가능한 라벨 *{cat, dog, hat, mug}* 이 있다. 그림에서 보다시피, 컴퓨터에서 이미지는 3차원 배열로 표현된다. 이 예시에서 고양이 이미지는 가로 248픽셀(모니터의 화면을 구성하는 최소 단위, 역자 주), 세로 400픽셀로 구성되어 있고 3개의 색상 채널이 있는데 각각 Red, Green, Blue(RGB)로 불린다. 따라서 이 이미지는 248 x 400 x 3개(총 297,500개)의 픽셀로 구성되어 있다. 각 픽셀의 값은 0~255 범위의 정수값이다. 이미지 분류 문제는 이 수많은 값들을 *"cat"* 이라는 하나의 라벨로 변경하는 것이다. +**예시**. 예를 들어, 아래 그림의 이미지 분류 모델은 하나의 이미지와 4개의 분류가능한 라벨 *{cat, dog, hat, mug}* 이 있다. 그림에서 보다시피, 컴퓨터에서 이미지는 3차원 배열로 표현된다. 이 예시에서 고양이 이미지는 가로 248픽셀(모니터의 화면을 구성하는 최소 단위, 역자 주), 세로 400픽셀로 구성되어 있고 Red, Green, Blue(RGB) 3개의 색상 채널이 있다. 따라서 이 이미지는 248 x 400 x 3개(총 297,500개)의 픽셀로 구성되어 있다. 각 픽셀의 값은 0~255 범위의 정수값이다. 이미지 분류 문제는 이 수많은 값들을 *"cat"* 이라는 하나의 라벨로 변경하는 것이다.
@@ -33,7 +33,7 @@ permalink: /classification/ - **Viewpoint variation(시점 변화)**. 객체의 단일 인스턴스는 카메라에 의해 시점이 달라질 수 있다. - **Scale variation(크기 변화)**. 비주얼 클래스는 대부분 그것들의 크기의 변화를 나타낸다(이미지의 크기뿐만 아니라 실제 세계에서의 크기까지 포함함). - **Deformation(변형)**. 많은 객체들은 고정된 형태가 없고, 극단적인 형태로 변형될 수 있다. -- **Occlusion(폐색)**. 객체들은 전체가 보이지 않을 수 있다. 때로는 물체의 매우 적은 부분(매우 적은 픽셀)이 보인다. +- **Occlusion(폐색)**. 객체들은 전체가 보이지 않을 수 있다. 때로는 물체의 매우 적은 부분(매우 적은 픽셀)만이 보인다. - **Illumination conditions(조명 상태)**. 조명의 영향으로 픽셀 값이 변형된다. - **Background clutter(배경 분규)**. 객체가 주변 환경에 섞여(*blend*) 알아보기 힘들게 된다. - **Intra-class variation(내부클래스의 다양성)**. 분류해야할 클래스는 범위가 큰 것들이 많다. 예를 들어 *의자* 의 경우, 매우 다양한 형태의 객체가 있다. @@ -48,48 +48,48 @@ permalink: /classification/ **Data-driven approach(데이터 기반 방법론)**. 어떻게 하면 이미지를 각각의 카테고리로 분류하는 알고리즘을 작성할 수 있을까? 숫자를 정렬하는 알고리즘 작성과는 달리 고양이를 분별하는 알고리즘을 작성하는 것은 어렵다. 그러므로, 코드를 통해 직접적으로 모든 것을 카테고리로 분류하기 보다는 좀 더 쉬운 방법을 사용할 것이다. 먼저 컴퓨터에게 각 클래스에 대해 많은 예제를 주고 나서 이 예제들을 보고 시각적으로 학습할 수 있는 학습 알고리즘을 개발한다. - 이런 방법을 *data-driven approach(데이터 기반 아법론)* 이라고 한다. 이 방법은 라벨화가 된 이미지들 *training dataset(트레이닝 데이터 셋)* 이 처음 학습을 위해 필요하다. 아래 그림은 이런 데이터셋의 예이다. + 이런 방법을 *data-driven approach(데이터 기반 방법론)* 이라고 한다. 이 방법은 라벨화가 된 이미지들 *training dataset(학습 데이터셋)* 이 처음 학습을 위해 필요하다. 아래 그림은 이런 데이터셋의 예이다.
-
4개의 카테고리에 대한 트레이닝 셋에 대한 예. 학습과정에서 천여개의 카테고리와 각 카테고리당 수십만개의 이미지가 있을 수 있다.
+
4개의 카테고리에 대한 학습 데이터셋에 대한 예. 학습과정에서 천여 개의 카테고리와 각 카테고리당 수십만 개의 이미지가 있을 수 있다.
-**The image classification pipeline(이미지 분류 파이프라인)**. 이제까지 이미지 분류는 픽셀값을 같고 있는 배열은 하나의 이미지로 표현하고 라벨을 할당하는 것이다라는 것을 살펴봤다. 우리의 완전한 파이프라인은 아래와 같이 공식화할 수 있다: +**The image classification pipeline(이미지 분류 파이프라인)**. 이미지 분류 문제란, 이미지를 픽셀들의 배열로 표현하고 각 이미지에 라벨을 하나씩 할당하는 문제라는 것을 이제까지 살펴보았다. 완전한 파이프라인은 아래와 같이 공식화할 수 있다: - **Input(입력):** 입력은 *N* 개의 이미지로 구성되어 있고, *K* 개의 별개의 클래스로 라벨화 되어 있다. 이 데이터를 *training set* 으로 사용한다. - **Learning(학습):** 학습에서 할 일은 트레이닝 셋을 이용해 각각의 클래스를 학습하는 것이다. 이 과정을 *training a classifier* 혹은 *learning a model* 이란 용어를 사용해 표현할 수 있다. -- **Evaluation(평가):** 마지막으로 새로운 이미지에 대해 어떤 라벨값으로 분류되는지 예측해봄으로써 분류기의 성능을 평가한다. 새로운 이미지의 라벨값과 분류기를 통해 예측된 라벨값을 비교할 수 있다. 직감적으로, 많은 예상치들이 실제 답과 일치하기를 기대한다. 이 것을 *ground truth(실측 자료)* 라고 한다. +- **Evaluation(평가):** 마지막으로 새로운 이미지에 대해 어떤 라벨로 분류되는지 예측해봄으로써 분류기의 성능을 평가한다. 새로운 이미지의 라벨과 분류기를 통해 예측된 라벨을 비교할 것이다. 직관적으로, 많은 예상치들이 실제 답과 일치하기를 기대하는 것이고, 이 것을 우리는 *ground truth(실측 자료)* 라고 한다. ## Nearest Neighbor Classifier(최근접 이웃 분류기) -첫번째 방법으로써, **Nearest Neighbor Classifier** 라 불리는 분류기를 개발할 것이다. 이 분류기는 컨볼루션 신경망 방법이 사용되지 않고 연습과정애서 매우 드물게 사용된다. 하지만 이 분류기는 이미지 분류 문제에 대한 기본적인 접근방법을 알 수 있다. +첫번째 방법으로써, **Nearest Neighbor Classifier** 라 불리는 분류기를 개발할 것이다. 이 분류기는 컨볼루션 신경망 방법과는 아무 상관이 없고 실제 문제를 풀 때 자주 사용되지는 않지만, 이미지 분류 문제에 대한 기본적인 접근 방법을 알 수 있도록 한다. -**이미지 분류 데이터셋의 예: CIFAR-10.** 하나의 유명한 이미지 분류 데이터셋은 CIFAR-10 dataset 이다. 이 데이터셋은 60,000개의 작은 이미지로 구성되어 있고, 각 이미지는 32x32픽셀 크기있다. 각 이미지는 10개의 클래스중 하나로 라벨화되어 있다(예를 들어, *"airplane, automobile, bird, etc"*). 이 60,000개의 이미지 중에 50,000개는 트레이싱 셋, 10,000개는 트레이닝 셋으로 분류된다. 아래의 그림에서 각 10개의 클래스에 대해 임의로 선정한 10개의 이미지들의 예를 볼 수 있다: +**이미지 분류 데이터셋의 예: CIFAR-10.** 간단하면서 유명한 이미지 분류 데이터셋 중의 하나는 CIFAR-10 dataset 이다. 이 데이터셋은 60,000개의 작은 이미지로 구성되어 있고, 각 이미지는 32x32 픽셀 크기이다. 각 이미지는 10개의 클래스중 하나로 라벨링되어 있다(Ex. *"airplane, automobile, bird, etc"*). 이 60,000개의 이미지 중에 50,000개는 학습 데이터셋 (트레이닝 셋), 10,000개는 테스트 (데이터)셋으로 분류된다. 아래의 그림에서 각 10개의 클래스에 대해 임의로 선정한 10개의 이미지들의 예를 볼 수 있다:
-
좌: CIFAR-10 dataset 의 각 클래스 예. 우: 첫번째 열은 테스트 셋이고 나머지 열은 이 테스트셋에 대해서 트레이닝 셋에 있는 이미지 중 픽셀값 차에 따른 상위 10개의 최근접 이웃 이미지이다.
+
좌: CIFAR-10 dataset 의 각 클래스 예. 우: 첫번째 열은 테스트 셋이고 나머지 열은 이 테스트 셋에 대해서 트레이닝 셋에 있는 이미지 중 픽셀값 차에 따른 상위 10개의 최근접 이웃 이미지이다.
-50,000개의 CIFAR-10 트레이닝 셋(하나의 라벨 당 5,000개의 이미지)이 주어진 상태에서 나머지 10,000개의 이미지에 대해 라벨화 하는 것을 가정해보자. 최근접 이웃 분류기는 테스트 이미지를 취해 모든 트레이닝 이미지와 비교를 하고 라벨 값을 예상할 것이다. 상단 이미지의 우측과 같이 10개의 테스트 이미지에 대한 결과를 확인할 수 있다. 10개의 이미지 중 3개만이 같은 클래스로 검색된 반면에, 7개의 이미지는 같은 클래스로 분류되지 않았다. 예를 들어, 8번째 행의 말 학습 이미지에 대한 첫번째 최근접 이웃 이미지는 붉은색의 차이다. 짐작컨데 이 경우는 검은색 배경의 영향이 큰 듯 하다. 결과적으로, 이 말 이미지는 차로 잘 못 분류될 것이다. +50,000개의 CIFAR-10 트레이닝 셋(하나의 라벨 당 5,000개의 이미지)이 주어진 상태에서 나머지 10,000개의 이미지에 대해 라벨화 하는 것을 가정해보자. 최근접 이웃 분류기는 테스트 이미지를 취해 모든 학습 이미지와 비교를 하고 라벨 값을 예상할 것이다. 상단 이미지의 우측과 같이 10개의 테스트 이미지에 대한 결과를 확인해보면, 10개의 이미지 중 3개만이 같은 클래스로 검색된 반면, 7개의 이미지는 같은 클래스로 분류되지 않았다. 예를 들어, 8번째 행의 말 학습 이미지에 대한 첫번째 최근접 이웃 이미지는 붉은색의 차이다. 짐작컨데 이 경우는 검은색 배경의 영향이 큰 듯 하다. 결과적으로, 이 말 이미지는 차로 잘못 분류될 것이다. -두개의 이미지를 비교하는 정확한 방법을 아직 명시하지 않았는데, 이 경우에는 32 x 32 x 3 크기의 두 블록이다. 가장 간단한 방법 중 하나는 이미지를 각각의 픽셀값으로 비교하고, 그 차이를 더해 모두 더하는 것이다. 다시 말해서 두 개의 이미지가 주어지고 그 것들을 $$ I_1, I_2 $$ 벡터로 나타냈을 때, 벡터 간의 **L1 distance(L1 거리)** 를 계산하는 것이 적절한 방법이다: +두개의 이미지(이 경우에는 32 x 32 x 3 크기의 두 블록)를 비교하는 정확한 방법을 아직 명시하지 않았다는 점을 눈치챘을 것이다. 가장 간단한 방법 중 하나는 이미지를 각각의 픽셀값으로 비교하고, 그 차이를 모두 더하는 것이다. 다시 말해서 두 개의 이미지가 주어지고 그 것들을 $$ I_1, I_2 $$ 벡터로 나타냈을 때, 벡터 간의 **L1 distance(L1 거리)** 를 계산하는 것이 한 가지 방법이다: $$ d_1 (I_1, I_2) = \sum_{p} \left| I^p_1 - I^p_2 \right| $$ -결과는 모든 픽셀값 차이의 합이다. 아래에 시각적인 절차가 있다: +결과는 모든 픽셀값 차이의 합이다. 아래에 그 과정을 시각화 하였다:
-
An example of using pixel-wise differences to compare two images with L1 distance (for one color channel in this example). Two images are subtracted elementwise and then all differences are added up to a single number. If two images are identical the result will be zero. But if the images are very different the result will be large.
+
두 개의 이미지를 (각각의 색 채널마다의) L1 거리를 이용해서 비교할 때, 각 픽셀마다의 차이를 사용하는 예시. 두 이미지 벡터(행렬)의 각 성분마다 차를 계산하고, 그 차를 전부 더해서 하나의 숫자를 얻는다. 두 이미지가 똑같을 경우에는 결과가 0일 것이고, 두 이미지가 매우 다르다면 결과값이 클 것이다.
-분류기를 코드상에서 어떻게 구현하는 과정을 살펴보자. 첫번째로 CIFAR-10 데이터를 4개의 배열을 통해 메모리로 불러온다. 각각은 트레이닝 데이터와 라벨, 테스트 데이터와 라벨이다. 아래 코드에 `Xtr`(크기 50,000 x 32 x 32 x 3)은 트레이닝 셋의 모든 이미지를 저장하고 1차원 배열인 `Ytr`(길이 50,000)은 트레이닝 데이터의 라벨을 저장한다. +다음으로, 분류기를 실제로 코드 상에서 어떻게 구현하는지 살펴보자. 첫 번째로 CIFAR-10 데이터를 메모리로 불러와 4개의 배열에 저장한다. 각각은 학습(트레이닝) 데이터와 그 라벨, 테스트 데이터와 그 라벨이다. 아래 코드에 `Xtr`(크기 50,000 x 32 x 32 x 3)은 트레이닝 셋의 모든 이미지를 저장하고 1차원 배열인 `Ytr`(길이 50,000)은 트레이닝 데이터의 라벨(0부터 9까지)을 저장한다. ~~~python Xtr, Ytr, Xte, Yte = load_CIFAR10('data/cifar10/') # 제공되는 함수 @@ -101,15 +101,15 @@ Xte_rows = Xte.reshape(Xte.shape[0], 32 * 32 * 3) # Xte_rows는 10000 x 3072 크 이제 모든 이미지를 배열의 각 행들로 얻었다. 아래에는 분류기를 어떻게 학습시키고 평가하는지에 대한 코드이다: ~~~python -nn = NearestNeighbor() # create a Nearest Neighbor classifier class -nn.train(Xtr_rows, Ytr) # train the classifier on the training images and labels -Yte_predict = nn.predict(Xte_rows) # predict labels on the test images -# and now print the classification accuracy, which is the average number -# of examples that are correctly predicted (i.e. label matches) +nn = NearestNeighbor() # Nearest Neighbor 분류기 클래스 생성 +nn.train(Xtr_rows, Ytr) # 학습 이미지/라벨을 활용하여 분류기 학습 +Yte_predict = nn.predict(Xte_rows) # 테스트 이미지들에 대해 라벨 예측 +# 그리고 분류 성능을 프린트한다 +# 정확도는 이미지가 올바르게 예측된 비율로 계산된다 (라벨이 같을 비율) print 'accuracy: %f' % ( np.mean(Yte_predict == Yte) ) ~~~ -일반적으로 평가 기준으로서 **accuracy(정확도)** 를 사용한다. 정확도는 예측값이 얼마나 일치한지 비율을 측정한다. 앞으로 만들어 볼 모든 분류기는 공통적인 API를 갖는다: 그것들은 데이터(X)와 데이터가 실제로 속하는 라벨(y)을 입력으로 받는 `train(X,y)` 형태의 함수가 있다. 내부적으로는 클래스는 특정한 종류의 라벨에 대한 모델과 그 값들이 데이터로부터 어떻게 예측될 수 있는지 만들어야 한다. 그 이후에 새로운 데이터로 부터 라벨을 예측하는 `predict(X)` 형태의 함수가 있다. 물론, 아직은 실제로 분류기가 작동하는 부분은 빠져있다. 이제 L1 거리를 이용한 간단한 최근접 이웃 분류기에 대한 구현방법을 소개한다: +일반적으로 평가 기준으로서 **accuracy(정확도)** 를 사용한다. 정확도는 예측값이 실제와 얼마나 일치하는지 그 비율을 측정한다. 앞으로 만들어볼 모든 분류기는 공통적인 API를 갖게 될 것이다: 데이터(X)와 데이터가 실제로 속하는 라벨(y)을 입력으로 받는 `train(X,y)` 형태의 함수가 있다는 점이다. 내부적으로, 이 함수는 라벨들을 활용하여 어떤 모델을 만들어야 하고, 그 값들이 데이터로부터 어떻게 예측될 수 있는지를 알아야 한다. 그 이후에는 새로운 데이터로 부터 라벨을 예측하는 `predict(X)` 형태의 함수가 있다. 물론, 아직까지는 실제 분류기 자체가 빠져있다. 다음은 앞의 형식을 만족하는 L1 거리를 이용한 간단한 최근접 이웃 분류기의 구현이다: ~~~python import numpy as np @@ -120,31 +120,31 @@ class NearestNeighbor(object): def train(self, X, y): """ X is N x D where each row is an example. Y is 1-dimension of size N """ - # the nearest neighbor classifier simply remembers all the training data + # nearest neighbor 분류기는 단순히 모든 학습 데이터를 기억해둔다. self.Xtr = X self.ytr = y def predict(self, X): """ X is N x D where each row is an example we wish to predict label for """ num_test = X.shape[0] - # lets make sure that the output type matches the input type + # 출력 type과 입력 type이 갖게 되도록 확인해준다. Ypred = np.zeros(num_test, dtype = self.ytr.dtype) # loop over all test rows for i in xrange(num_test): - # find the nearest training image to the i'th test image - # using the L1 distance (sum of absolute value differences) + # i번째 테스트 이미지와 가장 가까운 학습 이미지를 + # L1 거리(절대값 차의 총합)를 이용하여 찾는다. distances = np.sum(np.abs(self.Xtr - X[i,:]), axis = 1) - min_index = np.argmin(distances) # get the index with smallest distance - Ypred[i] = self.ytr[min_index] # predict the label of the nearest example + min_index = np.argmin(distances) # 가장 작은 distance를 갖는 인덱스를 찾는다. + Ypred[i] = self.ytr[min_index] # 가장 가까운 이웃의 라벨로 예측 return Ypred ~~~ -이 코드를 실행해보면 이 분류기는 CIFAR-10에 대해 정확도가 **38.6%** 밖에 되지 않다는 것을 확인할 수 있다. 임의로 답을 결정하는 것(10개의 클래스가 있을 때 10%의 정확도)보다는 낫지만, 사람의 반응([약 94%](http://karpathy.github.io/2011/04/27/manually-classifying-cifar10/))이나 최신 컨볼루션 신경망의 성능(약 95%)에는 훨씬 미치지 못한다(최근 Kaggle 대회 [순위표](http://www.kaggle.com/c/cifar-10/leaderboard) 참고). +이 코드를 실행해보면 이 분류기는 CIFAR-10에 대해 정확도가 **38.6%** 밖에 되지 않는다는 것을 확인할 수 있다. 임의로 답을 결정하는 것(10개의 클래스가 있으므로 10%의 정확도)보다는 낫지만, 사람의 정확도([약 94%](http://karpathy.github.io/2011/04/27/manually-classifying-cifar10/))나 최신 컨볼루션 신경망의 성능(약 95%)에는 훨씬 미치지 못한다(최근 Kaggle 대회 [순위표](http://www.kaggle.com/c/cifar-10/leaderboard) 참고). -**거리 선택** -벡터간의 거리를 계산하는 방법은 많다. 다른 일반적인 선택으로는 두 벡터간의 유클리디안 거리를 계산하는 기하학적인 방법인 **L2 distance(L2 거리)** 의 사용을 고려해볼 수 있다. 이 거리는 아래의 식으로 얻는다: +**거리(distance) 선택** +벡터간의 거리를 계산하는 방법은 L1 거리 외에도 매우 많다. 또 다른 일반적인 선택으로, 기하학적으로 두 벡터간의 유클리디안 거리를 계산하는 것으로 해석할 수 있는 **L2 distance(L2 거리)** 의 사용을 고려해볼 수 있다. 이 거리의 계산 방식은 다음과 같다: $$ d_2 (I_1, I_2) = \sqrt{\sum_{p} \left( I^p_1 - I^p_2 \right)^2} @@ -179,7 +179,7 @@ In practice, you will almost always want to use k-Nearest Neighbor. But what val The k-nearest neighbor classifier requires a setting for *k*. But what number works best? Additionally, we saw that there are many different distance functions we could have used: L1 norm, L2 norm, there are many other choices we didn't even consider (e.g. dot products). These choices are called **hyperparameters** and they come up very often in the design of many Machine Learning algorithms that learn from data. It's often not obvious what values/settings one should choose. You might be tempted to suggest that we should try out many different values and see what works best. That is a fine idea and that's indeed what we will do, but this must be done very carefully. In particular, **we cannot use the test set for the purpose of tweaking hyperparameters**. Whenever you're designing Machine Learning algorithms, you should think of the test set as a very precious resource that should ideally never be touched until one time at the very end. Otherwise, the very real danger is that you may tune your hyperparameters to work well on the test set, but if you were to deploy your model you could see a significantly reduced performance. In practice, we would say that you **overfit** to the test set. Another way of looking at it is that if you tune your hyperparameters on the test set, you are effectively using the test set as the training set, and therefore the performance you achieve on it will be too optimistic with respect to what you might actually observe when you deploy your model. But if you only use the test set once at end, it remains a good proxy for measuring the **generalization** of your classifier (we will see much more discussion surrounding generalization later in the class). - +당신은 > Evaluate on the test set only a single time, at the very end. Luckily, there is a correct way of tuning the hyperparameters and it does not touch the test set at all. The idea is to split our training set in two: a slightly smaller training set, and what we call a **validation set**. Using CIFAR-10 as an example, we could for example use 49,000 of the training images for training, and leave 1,000 aside for validation. This validation set is essentially used as a fake test set to tune the hyper-parameters. From 1fded635c548781ff5ff7ad8b4ba986a147d3a64 Mon Sep 17 00:00:00 2001 From: myungsub Date: Tue, 24 May 2016 17:35:20 +0900 Subject: [PATCH 142/199] finish till validation --- classification.md | 58 ++++++++++++++++++++++++++--------------------- index.html | 2 +- 2 files changed, 33 insertions(+), 27 deletions(-) diff --git a/classification.md b/classification.md index d1ecbf19..9d902182 100644 --- a/classification.md +++ b/classification.md @@ -150,71 +150,72 @@ $$ d_2 (I_1, I_2) = \sqrt{\sum_{p} \left( I^p_1 - I^p_2 \right)^2} $$ -In other words we would be computing the pixelwise difference as before, but this time we square all of them, add them up and finally take the square root. In numpy, using the code from above we would need to only replace a single line of code. The line that computes the distances: +즉, 이전처럼 각 픽셀간의 차를 구하지만 각각에 제곱을 취하고, 전부 더한 다음에 최종적으로 제곱근을 취한다. NumPy를 사용한다면 위 코드를 사용하여 거리를 계산하는 아래의 코드 부분 딱 한 줄만 바꾸면 된다. ~~~python distances = np.sqrt(np.sum(np.square(self.Xtr - X[i,:]), axis = 1)) ~~~ -Note that I included the `np.sqrt` call above, but in a practical nearest neighbor application we could leave out the square root operation because square root is a *monotonic function*. That is, it scales the absolute sizes of the distances but it preserves the ordering, so the nearest neighbors with or without it are identical. If you ran the Nearest Neighbor classifier on CIFAR-10 with this distance, you would obtain **35.4%** accuracy (slightly lower than our L1 distance result). +위 코드에서는 `np.sqrt` 함수를 호출하는 것을 그대로 남겨두었지만, 제곱근 함수는 단조 함수이기 때문에 실제 nearest neighbor 응용에서 제곱근은 빼도 결과에 상관이 없다. 즉, 계산되는 거리들의 크기에는 차이가 생기겠지만 그 순서는 동일하기 때문에, 제곱근 함수를 포함할 때와 포함하지 않을 때의 nearest neighbor(최근접 이웃)는 동일하다. 이 거리 함수를 사용하여 Nearest Neighbor 분류기를 CIFAR-10 데이터셋에 돌린다면, **35.4%** 정확도를 얻을 수 있다 (L1 거리를 사용한 결과보다 조금 낮아졌다). -**L1 vs. L2.** It is interesting to consider differences between the two metrics. In particular, the L2 distance is much more unforgiving than the L1 distance when it comes to differences between two vectors. That is, the L2 distance prefers many medium disagreements to one big one. L1 and L2 distances (or equivalently the L1/L2 norms of the differences between a pair of images) are the most commonly used special cases of a [p-norm](http://planetmath.org/vectorpnorm). +**L1 vs. L2.** 두 거리 함수의 특징을 비교하는 것은 매우 흥미로운 주제이다. 일반적으로, L2 거리는 L1 거리에 비해 두 벡터간의 차가 커지는 것에 대해 훨씬 더 크게 반응한다. 즉, L2 거리는 하나의 큰 차이가 있는 것보다 여러 개의 적당한 차이가 생기는 것을 선호한다. L1/L2 거리(또는 두 이미지의 차에 대한 L1/L2 norm)는 일반적인 [p-norm](http://planetmath.org/vectorpnorm)의 형태 중 가장 많이 사용되는 두 가지이다. -## k - Nearest Neighbor Classifier +## k - Nearest Neighbor (kNN) 분류기 -You may have noticed that it is strange to only use the label of the nearest image when we wish to make a prediction. Indeed, it is almost always the case that one can do better by using what's called a **k-Nearest Neighbor Classifier**. The idea is very simple: instead of finding the single closest image in the training set, we will find the top **k** closest images, and have them vote on the label of the test image. In particular, when *k = 1*, we recover the Nearest Neighbor classifier. Intuitively, higher values of **k** have a smoothing effect that makes the classifier more resistant to outliers: +여태까지 예측을 할 때 가장 가까운 이미지의 라벨만을 사용하는 것을 이상하다고 생각할 수도 있을 것이다. 확실히, **k-Nearest Neighbor Classifier (kNN 분류기)** 라는 것을 사용한다면 거의 무조건 더 분류를 잘 할 수 있다. 아이디어는 매우 간단하다: 학습 데이터셋에서 가장 가까운 하나의 이미지만을 찾는 것이 아니라, 가장 가까운 **k** 개의 이미지를 찾아서 테스트 이미지의 라벨에 대해 투표하도록 하는 것이다. 여기서 *k = 1* 인 경우, 원래의 Nearest Neighbor 분류기가 된다. 직관적으로 **k** 값이 커질수록 분류기는 이상점(outlier)에 더 강인하고, 분류 경계가 부드러워지는 효과가 있다.
-
An example of the difference between Nearest Neighbor and a 5-Nearest Neighbor classifier, using 2-dimensional points and 3 classes (red, blue, green). The colored regions show the decision boundaries induced by the classifier with an L2 distance. The white regions show points that are ambiguously classified (i.e. class votes are tied for at least two classes). Notice that in the case of a NN classifier, outlier datapoints (e.g. green point in the middle of a cloud of blue points) create small islands of likely incorrect predictions, while the 5-NN classifier smooths over these irregularities, likely leading to better generalization on the test data (not shown). Also note that the gray regions in the 5-NN image are caused by ties in the votes among the nearest neighbors (e.g. 2 neighbors are red, next two neighbors are blue, last neighbor is green).
+
Nearest Neighbor 분류기와 5-Nearest Neighbor 분류기의 차이 예시. 2차원 점과 3개의 클래스(라벨: red, blue, green)를 사용하였다. 색칠된 부분들은 L2 거리를 사용한 분류기를 통해 정해진 결정 경계(decision boundaries)이다. 흰색 부분들은 애매하게 분류(투표를 가장 많이 받은 라벨이 여러 개 있는 경우)된 점들을 나타낸다. NN 분류기의 경우 이상점들(e.g. 수많은 파란 점들 가운데에 있는 하나의 초록색 점)이 실제 결과와 맞지 않을 가능성이 큰 섬들을 형성하지만, 5-NN 분류기는 이런 조그만한 섬들이 생기지 않도록 부드럽게 이어주는 것을 확인하자. 이런 특성이 실제 테스트 데이터(그림에는 없음)에 적용할 때는 더 나은 일반화(generalization) 성능을 보인다. 또한, 5-NN 분류기 결과에서 회색 부분들은 nearest neighbors 간의 투표에서 동점이 발생한 경우(e.g. 2개의 이웃이 red, 다음 2개가 blue, 마지막 이웃이 green)인 것을 확인하자.
-In practice, you will almost always want to use k-Nearest Neighbor. But what value of *k* should you use? We turn to this problem next. +실제 문제에 적용할 경우, 대부분은 NN 분류기보다는 k-Nearest Neighbor (kNN) 분류기를 사용하고 싶을 것이다. 그러나 어떤 *k* 값을 골라야 할까? 이 문제에 대해 지금부터 다룰 것이다. -### Validation sets for Hyperparameter tuning -The k-nearest neighbor classifier requires a setting for *k*. But what number works best? Additionally, we saw that there are many different distance functions we could have used: L1 norm, L2 norm, there are many other choices we didn't even consider (e.g. dot products). These choices are called **hyperparameters** and they come up very often in the design of many Machine Learning algorithms that learn from data. It's often not obvious what values/settings one should choose. +### Hyperparameter 튜닝을 위한 검증 셋 (Validation set) -You might be tempted to suggest that we should try out many different values and see what works best. That is a fine idea and that's indeed what we will do, but this must be done very carefully. In particular, **we cannot use the test set for the purpose of tweaking hyperparameters**. Whenever you're designing Machine Learning algorithms, you should think of the test set as a very precious resource that should ideally never be touched until one time at the very end. Otherwise, the very real danger is that you may tune your hyperparameters to work well on the test set, but if you were to deploy your model you could see a significantly reduced performance. In practice, we would say that you **overfit** to the test set. Another way of looking at it is that if you tune your hyperparameters on the test set, you are effectively using the test set as the training set, and therefore the performance you achieve on it will be too optimistic with respect to what you might actually observe when you deploy your model. But if you only use the test set once at end, it remains a good proxy for measuring the **generalization** of your classifier (we will see much more discussion surrounding generalization later in the class). -당신은 -> Evaluate on the test set only a single time, at the very end. +k-nearest neighbor 분류기는 *k* 를 정해줘야 한다. 그런데 어떤 값이 가장 좋을까? 또한, 앞서 우리는 여러 가지 거리 함수(L1 norm, L2 norm, 여기서 고려하지 않은 다른 종류들 - e.g.내적 - 도 매우 많다)에 대해서도 살펴보았다. 이러한 선택들을 **hyperparameters** 라 부르고, 데이터로부터 학습하는 많은 기계학습(머신러닝) 알고리즘 디자인에 등장한다. 그런데 어떤 값/세팅을 골라야 하는지에 대해서 확신이 있는 경우는 거의 없다. -Luckily, there is a correct way of tuning the hyperparameters and it does not touch the test set at all. The idea is to split our training set in two: a slightly smaller training set, and what we call a **validation set**. Using CIFAR-10 as an example, we could for example use 49,000 of the training images for training, and leave 1,000 aside for validation. This validation set is essentially used as a fake test set to tune the hyper-parameters. +여러 가지 다른 값들을 시도해보고, 어떤 것이 가장 좋은 성능을 보이는지 확인해보는 방법을 생각할 수 있다. 아래에서 우리도 실제로 이렇게 할 것이지만, 이 과정은 매우 조심스럽게 수행되어야 한다. 특히, **hyperparameter 값을 조정하기 위해 테스트 셋을 사용하면 절대 안 된다**. 우리가 머신러닝 알고리즘을 디자인할 때, 테스트 셋은 매우 귀한 리소스이고, 이론적으로는 실제로 알고리즘을 평가할 때인 맨 마지막 단 한 번을 제외하고는 절대 쳐다봐서는 안 된다. 그렇게 하지 않는다면 위험한 점은, 우리 모델의 hyperparameter 들이 테스트 셋에서는 잘 동작하도록 튜닝이 되어 있지만, 실전에서 모델을 사용(deploy)할 때 상당히 성능이 낮아지는 것을 확인할 수 있을 것이다. 머신러닝에서는 이것을 테스트 셋에 **overfit** 되었다고 말한다. 이를 다른 관점으로 바라본다면, 우리가 테스트 셋을 사용하여 hyperparameter 들을 튜닝했다는 것은 곧 우리가 테스트 셋을 마치 학습 데이터셋(트레이닝 셋)처럼 사용한 것이고, 우리 모델의 테스트 셋에서의 성능은 실제로 다른 데이터에 적용할 때에 비해 너무 낙관적이게 되어버린다. 그러나 테스트 셋을 맨 마지막에 딱 한 번만 사용한다면, 그 때는 우리가 학습한 분류기의 **일반화(generalization)** 된 성능을 잘 평가할 수 있는 척도로 활용될 것이다. (이 수업의 나중 부분에서도 일반화에 관련된 주제를 다룰 것이다.) -Here is what this might look like in the case of CIFAR-10: +> 테스트 셋에 성능을 평가하는 것은 맨 마지막에 단 한 번만 하라. + +다행히도, hyperparameter 들을 튜닝하는 올바른 방법이 존재하고, 이 방법은 테스트 셋을 전혀 건드리지 않는다. 아이디어는, 우리가 갖고 있는 트레이닝 셋을 두 개로 쪼개는 것이다: 이른바 **검증 셋(validation set)** 으로 불리는, 약간 적은 수의 트레이닝 셋과 나머지로 나눈다. CIFAR-10 데이터셋을 예로 들면, 학습 이미지들 중에 49,000 장을 트레이닝 셋으로 삼고, 나머지 1,000 개를 검증(validation) 용으로 남겨놓는 것이다. 이 검증 셋은 hyperparameter 들을 튜닝할 때, 가짜 테스트 셋으로 활용된다. (역자 주: 즉, 실전 테스트인 수능을 준비하기 위한 모의고사라고 생각하면 된다.) + +CIFAR-10의 경우, 이런 식으로 나타낼 수 있을 것이다: ~~~python -# assume we have Xtr_rows, Ytr, Xte_rows, Yte as before -# recall Xtr_rows is 50,000 x 3072 matrix -Xval_rows = Xtr_rows[:1000, :] # take first 1000 for validation +# Xtr_rows, Ytr, Xte_rows, Yte 는 이전과 동일하게 갖고 있다고 가정하자. +# Xtr_rows 는 50,000 x 3072 행렬이었다. +Xval_rows = Xtr_rows[:1000, :] # 앞의 1000 개를 검증용으로 선택한다. Yval = Ytr[:1000] -Xtr_rows = Xtr_rows[1000:, :] # keep last 49,000 for train +Xtr_rows = Xtr_rows[1000:, :] # 뒤쪽의 49,000 개를 학습용으로 선택한다. Ytr = Ytr[1000:] -# find hyperparameters that work best on the validation set +# 검증 셋에서 가장 잘 동작하는 hyperparameter 들을 찾는다. validation_accuracies = [] for k in [1, 3, 5, 10, 20, 50, 100]: - # use a particular value of k and evaluation on validation data + # 특정 k 값을 정해서 검증 데이터에 대해 평가할 때 사용한다. nn = NearestNeighbor() nn.train(Xtr_rows, Ytr) - # here we assume a modified NearestNeighbor class that can take a k as input + # 여기서는 k를 input으로 받을 수 있도록 변형된 NearestNeighbor 클래스가 있다고 가정하자. Yval_predict = nn.predict(Xval_rows, k = k) acc = np.mean(Yval_predict == Yval) print 'accuracy: %f' % (acc,) - # keep track of what works on the validation set + # 검증 셋에 대한 정확도를 저장해 놓는다. validation_accuracies.append((k, acc)) ~~~ -By the end of this procedure, we could plot a graph that shows which values of *k* work best. We would then stick with this value and evaluate once on the actual test set. +이 과정이 끝나면, 어떤 *k* 값이 가장 잘 동작하는지를 그래프로 그려볼 수 있다. 그 뒤, 가장 잘 동작하는 k 값으로 정하고, 실제 테스트 셋에 대해 한 번 평가를 하면 된다. -> Split your training set into training set and a validation set. Use validation set to tune all hyperparameters. At the end run a single time on the test set and report performance. +> 학습 데이터셋을 트레이닝 셋과 검증 셋으로 나누고, 검증 셋을 활용하여 모든 hyperparameter 들을 튜닝하라. 마지막으로 테스트 셋에 대해서는 딱 한 번 돌려보고, 성능을 리포트한다. -**Cross-validation**. +**Cross-validation (교차 검증)**. In cases where the size of your training data (and therefore also the validation data) might be small, people sometimes use a more sophisticated technique for hyperparameter tuning called **cross-validation**. Working with our previous example, the idea is that instead of arbitrarily picking the first 1000 datapoints to be the validation set and rest training set, you can get a better and less noisy estimate of how well a certain value of *k* works by iterating over different validation sets and averaging the performance across these. For example, in 5-fold cross-validation, we would split the training data into 5 equal folds, use 4 of them for training, and 1 for validation. We would then iterate over which fold is the validation fold, evaluate the performance, and finally average the performance across the different folds.
@@ -232,6 +233,7 @@ In cases where the size of your training data (and therefore also the validation
+ **Pros and Cons of Nearest Neighbor classifier.** It is worth considering some advantages and drawbacks of the Nearest Neighbor classifier. Clearly, one advantage is that it is very simple to implement and understand. Additionally, the classifier takes no time to train, since all that is required is to store and possibly index the training data. However, we pay that computational cost at test time, since classifying a test example requires a comparison to every single training example. This is backwards, since in practice we often care about the test time efficiency much more than the efficiency at training time. In fact, the deep neural networks we will develop later in this class shift this tradeoff to the other extreme: They are very expensive to train, but once the training is finished it is very cheap to classify a new test example. This mode of operation is much more desirable in practice. @@ -255,6 +257,7 @@ Here is one more visualization to convince you that using pixel differences to c In particular, note that images that are nearby each other are much more a function of the general color distribution of the images, or the type of background rather than their semantic identity. For example, a dog can be seen very near a frog since both happen to be on white background. Ideally we would like images of all of the 10 classes to form their own clusters, so that images of the same class are nearby to each other regardless of irrelevant characteristics and variations (such as the background). However, to get this property we will have to go beyond raw pixels. + ### Summary In summary: @@ -270,6 +273,7 @@ In summary: In next lectures we will embark on addressing these challenges and eventually arrive at solutions that give 90% accuracies, allow us to completely discard the training set once learning is complete, and they will allow us to evaluate a test image in less than a millisecond. + ### Summary: Applying kNN in practice If you wish to apply kNN in practice (hopefully not on images, or perhaps as only a baseline) proceed as follows: @@ -282,6 +286,7 @@ If you wish to apply kNN in practice (hopefully not on images, or perhaps as onl 6. Take note of the hyperparameters that gave the best results. There is a question of whether you should use the full training set with the best hyperparameters, since the optimal hyperparameters might change if you were to fold the validation data into your training set (since the size of the data would be larger). In practice it is cleaner to not use the validation data in the final classifier and consider it to be *burned* on estimating the hyperparameters. Evaluate the best model on the test set. Report the test set accuracy and declare the result to be the performance of the kNN classifier on your data. + #### Further Reading Here are some (optional) links you may find interesting for further reading: @@ -292,5 +297,6 @@ Here are some (optional) links you may find interesting for further reading: ---

-번역: 이옥민 (OkminLee) +번역: 이옥민 (OkminLee), + 최명섭(myungsub)

diff --git a/index.html b/index.html index c7f9553e..87724b9f 100644 --- a/index.html +++ b/index.html @@ -119,7 +119,7 @@ 이미지 분류: 데이터 기반 방법론, k-Nearest Neighbor, train/val/test 구분 - +
L1/L2 거리, hyperparameter 탐색, 교차검증(cross-validation) From 6e99188f1580ade500a6edb4f68befa2f4b4c743 Mon Sep 17 00:00:00 2001 From: myungsub Date: Wed, 25 May 2016 18:38:24 +0900 Subject: [PATCH 143/199] Image Classification Finished --- classification.md | 68 +++++++++++++++++++++++------------------------ index.html | 2 +- 2 files changed, 34 insertions(+), 36 deletions(-) diff --git a/classification.md b/classification.md index 9d902182..8a405a47 100644 --- a/classification.md +++ b/classification.md @@ -216,84 +216,82 @@ for k in [1, 3, 5, 10, 20, 50, 100]: > 학습 데이터셋을 트레이닝 셋과 검증 셋으로 나누고, 검증 셋을 활용하여 모든 hyperparameter 들을 튜닝하라. 마지막으로 테스트 셋에 대해서는 딱 한 번 돌려보고, 성능을 리포트한다. **Cross-validation (교차 검증)**. -In cases where the size of your training data (and therefore also the validation data) might be small, people sometimes use a more sophisticated technique for hyperparameter tuning called **cross-validation**. Working with our previous example, the idea is that instead of arbitrarily picking the first 1000 datapoints to be the validation set and rest training set, you can get a better and less noisy estimate of how well a certain value of *k* works by iterating over different validation sets and averaging the performance across these. For example, in 5-fold cross-validation, we would split the training data into 5 equal folds, use 4 of them for training, and 1 for validation. We would then iterate over which fold is the validation fold, evaluate the performance, and finally average the performance across the different folds. +학습 데이터셋의 크기가 작을 경우(검증 셋의 크기도 작을 것이다), 조금 더 정교한 방식으로 **교차 검증(cross-validation)** 이라는 hyperparameter 튜닝 방법을 사용한다. 앞의 예시에서처럼 첫 1000 개의 데이터를 검증 셋으로 사용하고 나머지를 학습(training) 셋으로 사용하는 대신, 어떤 *k* 값이 더 좋은지를 여러 가지 검증 셋에 대해 시험해보고 평균 성능을 확인해본다면 보다 잡음이 덜 섞이고 나은 예측을 할 수 있을 것이다. 예를 들어, 5-fold 교차 검증에서는 학습 데이터를 5개의 동일한 크기의 그룹(fold)으로 쪼갠 뒤, 4개를 학습용으로, 1개를 검증용으로 사용한다. 그 다음에는 어떤 그룹을 검증 셋으로 사용할지에 따라 iteration(반복)을 돌고, 성능을 평가하고, 각 그룹에 대해 평가한 성능을 평균낸다.
-
Example of a 5-fold cross-validation run for the parameter k. For each value of k we train on 4 folds and evaluate on the 5th. Hence, for each k we receive 5 accuracies on the validation fold (accuracy is the y-axis, each result is a point). The trend line is drawn through the average of the results for each k and the error bars indicate the standard deviation. Note that in this particular case, the cross-validation suggests that a value of about k = 7 works best on this particular dataset (corresponding to the peak in the plot). If we used more than 5 folds, we might expect to see a smoother (i.e. less noisy) curve.
+
파라미터 k 에 대한 5-fold 교차 검증 예시. 각 k 값마다 4개의 그룹에 대해 학습을 하고 다섯 번째 그룹을 사용하여 성능을 평가한다. 따라서, 각 k 마다 검증 셋으로 활용한 그룹들에서 5 개의 정확도가 나온다. (y축이 정확도를 나타내고, 각 결과는 점으로 표시하였다.) 그래프에서 선은 각 k 에서의 결과의 평균으로 그려져 있고, 에러 바는 표준 편차를 나타낸다. 이 경우, 이 데이터셋에 대해서는 k = 7 로 놓는 것이 가장 좋을 것(그래프에서 가장 높은 부분)이라고 교차 검증 결과가 말해준다. 만약 5개보다 더 많은 그룹 수를 사용했다면, 지금보다는 더 부드러운 곡선 형태(즉, 잡음이 덜 섞여있음)의 그래프를 볼 수 있을 것이다.
-**In practice**. In practice, people prefer to avoid cross-validation in favor of having a single validation split, since cross-validation can be computationally expensive. The splits people tend to use is between 50%-90% of the training data for training and rest for validation. However, this depends on multiple factors: For example if the number of hyperparameters is large you may prefer to use bigger validation splits. If the number of examples in the validation set is small (perhaps only a few hundred or so), it is safer to use cross-validation. Typical number of folds you can see in practice would be 3-fold, 5-fold or 10-fold cross-validation. +**실제 활용**. 교차 검증은 계산량이 매우 많아지기 때문에, 실제로 사람들은 교차 검증보다 하나의 검증 셋을 정해놓는 것을 선호한다. 보통은 학습 데이터의 50% ~ 90% 정도를 학습 용으로 쓰고 나머지를 검증 데이터로 활용하는데, 검증 데이터셋의 크기는 여러 가지 변수들에 의해 영향을 받는다. 예를 들어, hyperparameter 개수가 매우 많다면, 검증 데이터셋의 크기를 늘리는게 좋을 것이다. 검증 셋에 있는 데이터의 개수가 매우 적다면 (수백 개 정도), 교차 검증 방법을 사용하는 것이 더 안전하다. 보통은 3-fold, 5-fold, 10-fold 교차 검증을 주로 많이 사용한다.
-
Common data splits. A training and test set is given. The training set is split into folds (for example 5 folds here). The folds 1-4 become the training set. One fold (e.g. fold 5 here in yellow) is denoted as the Validation fold and is used to tune the hyperparameters. Cross-validation goes a step further iterates over the choice of which fold is the validation fold, separately from 1-5. This would be referred to as 5-fold cross-validation. In the very end once the model is trained and all the best hyperparameters were determined, the model is evaluated a single time on the test data (red).
+
데이터를 그룹으로 나누는 일반적인 방법. 학습 데이터셋과 테스트 셋은 주어져 있다. 학습 셋은 이 예시의 경우, 5개의 그룹으로 나누어져 있다. 이 중 1-4 그룹이 학습 셋이 되고, 나머지 하나(노란색 그룹 5)가 검증 셋으로 사용할 그룹으로 hyperparameter 들을 튜닝하는데 사용된다. 교차 검증 방법은 여기서 한 단계 더 나아가서 어떤 그룹을 검증 셋으로 사용할지를 1-5까지 바꿔가며 전부 반복하고, 이를 5-fold 교차 검증이라 부른다. 모델의 학습이 끝나고 가장 좋은 hyperparameter 들이 정해진 이후에는, 마지막으로 모델을 테스트 데이터(빨간색)에 대해 딱 한 번 시험해보고 성능을 평가한다.
-**Pros and Cons of Nearest Neighbor classifier.** +**Nearest Neighbor 분류기의 장단점.** -It is worth considering some advantages and drawbacks of the Nearest Neighbor classifier. Clearly, one advantage is that it is very simple to implement and understand. Additionally, the classifier takes no time to train, since all that is required is to store and possibly index the training data. However, we pay that computational cost at test time, since classifying a test example requires a comparison to every single training example. This is backwards, since in practice we often care about the test time efficiency much more than the efficiency at training time. In fact, the deep neural networks we will develop later in this class shift this tradeoff to the other extreme: They are very expensive to train, but once the training is finished it is very cheap to classify a new test example. This mode of operation is much more desirable in practice. +Nearest Neighbor 분류기의 장점과 단점이 무엇인지 분석해보자. 당연히, 한 가지 장점은 방법을 이해하고 구현하는 것이 매우 쉽다는 점이다. 또한, 분류기를 학습할 때 단순히 학습 데이터셋을 저장하고 기억만 해놓으면 되기 때문에 학습 시간이 전혀 소요되지 않는다. 그러나, 학습 시의 계산량이 없는 것은 테스트할 때 모든 학습 데이터 예시들과 비교를 해야되기 때문에 계산량이 매우 많아지는 것으로 보상된다. 이것은 거꾸로인게, 보통 우리는 테스트할 때 얼마나 효율적인지에 관심이 많이 있고, 학습에 소요되는 시간이 얼마인지는 크게 중요하게 생각하지 않기 때문이다. 사실, 이 수업에서 나중에 다룰 (깊은) 뉴럴 네트워크, 또는 신경망 구조는 이 교환(tradeoff)을 반대 극단으로 이끈다. 뉴럴 네트워크는 학습할 때 매우 많은 계산량을 필요로 하지만, 학습이 끝나면 새로운 테스트 샘플을 분류하는데 매우 적은 계산만으로도 수행할 수 있다. 실제 환경에서는 이러한 형태가 더 바람직하다. -As an aside, the computational complexity of the Nearest Neighbor classifier is an active area of research, and several **Approximate Nearest Neighbor** (ANN) algorithms and libraries exist that can accelerate the nearest neighbor lookup in a dataset (e.g. [FLANN](http://www.cs.ubc.ca/research/flann/)). These algorithms allow one to trade off the correctness of the nearest neighbor retrieval with its space/time complexity during retrieval, and usually rely on a pre-processing/indexing stage that involves building a kdtree, or running the k-means algorithm. +딴 얘기지만, Nearest Neighbor 분류기의 계산량(computational complexity) 문제는 매우 활발한 연구 주제이고, 많은 **Approximate Nearest Neighbor** (ANN, 근사 최근접 이웃) 알고리즘 및 라이브러리들이 있어서 데이터셋 내에서 nearest neighbor를 찾는 것을 가속화해준다 (e.g. [FLANN](http://www.cs.ubc.ca/research/flann/)). 이 알고리즘들은 nearest neighbor를 찾는 것의 정확도를 조금 희생하여 공간(메모리)/시간(계산량) 복잡도를 크게 낮추도록 하고, 보통 kdtree나 k-means 알고리즘 등과 같은 전처리 기법에 의존하는 경우가 많다. -The Nearest Neighbor Classifier may sometimes be a good choice in some settings (especially if the data is low-dimensional), but it is rarely appropriate for use in practical image classification settings. One problem is that images are high-dimensional objects (i.e. they often contain many pixels), and distances over high-dimensional spaces can be very counter-intuitive. The image below illustrates the point that the pixel-based L2 similarities we developed above are very different from perceptual similarities: +Nearest Neighbor 분류기가 좋은 경우도 있지만 (특히 데이터의 차원이 낮을 때), 실제 이미지 분류 문제 세팅에서는 대부분 효과적이지 않다. 한 가지 문제는, 이미지가 매우 고차원 물체라는 것이고 (수많은 픽셀들로 이루어져 있다), 고차원 공간에서의 '거리'는 매우 직관적이지 않는 경우가 많다. 아래 그림을 보면, 사람이 보기에 비슷한 이미지로 느끼는 것과 위에서 살펴본 픽셀 값들의 L2 거리를 기준으로 비슷한 것은 매우 다르다는 것을 알 수 있다.
-
Pixel-based distances on high-dimensional data (and images especially) can be very unintuitive. An original image (left) and three other images next to it that are all equally far away from it based on L2 pixel distance. Clearly, the pixel-wise distance does not correspond at all to perceptual or semantic similarity.
+
고차원 데이터(이미지)에서의 픽셀값 기준 거리는 매우 비직관적인 경우가 많다. 원본 이미지(왼쪽)와 그 옆의 세 이미지는 픽셀값의 L2 거리를 기준으로 모두 같은 거리만큼 떨어져 있다. 이로 보아 픽셀값을 기준으로 한 거리는 인지적, 의미적으로 거의 연관이 없다고 생각할 수 있다.
-Here is one more visualization to convince you that using pixel differences to compare images is inadequate. We can use a visualization technique called t-SNE to take the CIFAR-10 images and embed them in two dimensions so that their (local) pairwise distances are best preserved. In this visualization, images that are shown nearby are considered to be very near according to the L2 pixelwise distance we developed above: +아래는 픽셀값의 차이만으로는 불충분하다는 점을 다시 한 번 보여주기 위한 시각화이다. 여기서는 t-SNE 라는 시각화 기법을 사용하여 CIFAR-10 이미지들을 서로간의 거리가 잘 보존되도록 2차원으로 투사시킨 것이다. 이 시각화에서, 가까이 있는 이미지들은 픽셀간의 L2 거리가 매우 가까울 것이라고 생각하면 된다.
-
CIFAR-10 images embedded in two dimensions with t-SNE. Images that are nearby on this image are considered to be close based on the L2 pixel distance. Notice the strong effect of background rather than semantic class differences. Click here for a bigger version of this visualization.
+
t-SNE로 2차원으로 투사시킨 CIFAR-10 이미지들. 여기서 서로 가까이 있는 이미지들은 픽셀간의 L2 거리가 가까울 것이라고 생각하면 된다. 실제 클래스의 의미적인 차이보다 배경이 끼치는 영향이 얼마나 큰지 확인할 수 있다. 시각화의 큰 버전은 여기 에서 확인할 수 있다.
-In particular, note that images that are nearby each other are much more a function of the general color distribution of the images, or the type of background rather than their semantic identity. For example, a dog can be seen very near a frog since both happen to be on white background. Ideally we would like images of all of the 10 classes to form their own clusters, so that images of the same class are nearby to each other regardless of irrelevant characteristics and variations (such as the background). However, to get this property we will have to go beyond raw pixels. +여기서 특히, 서로 가까운 이미지들은 보편적인 색의 분포나 배경의 종류에 영향을 많이 받고 각자의 실제 의미가 담긴 클래스에는 큰 영향을 받지 않는 것을 확인할 수 있다. 예를 들어, 강아지와 개구리가 똑같이 흰 배경에 있어서 (실제 클래스가 다름에도 불구하고) 매우 가까이 위치한 것을 볼 수 있다. 이상적으로는 같은 클래스의 이미지들이 여러 변칙적인 성질과 변화(또는 배경)에 상관없이 가까이 있어서 10개의 클래스들이 각각 군집을 이뤄서 뭉쳐있었으면 좋겠지만, 이러한 성질을 위해서는 단순 픽셀값 이상의 것이 필요하다. -### Summary +### 요약 -In summary: +- 여기서는 **이미지 분류(Image Classification)** 문제에 대해 살펴보았다. 각 이미지별로 한 개의 카테고리로 라벨링 되어있는 이미지들이 주어지고, 새로운 테스트 이미지들이 들어왔을 때 이 카테고리들 중 하나로 분류하도록 하고 예측값들의 정확도를 측정하였다. +- 간단한 **Nearest Neighbor 분류기** 를 소개하였다. 이 분류기와 관련하여 여러 가지 hyperparameter (k의 값이라든지, 데이터를 비교할 때 사용하는 거리의 종류라든지) 들이 존재하는 것을 보았고, 어떤 것을 선택할지 확실한 답은 없다는 것을 보았다. +- 이 hyperparameter 들을 올바르게 정하는 방법은 학습 데이터셋을 두 개로 (학습 셋과 **검증 셋(validation set)** 으로 불리는 가까 테스트 셋) 나누는 것임을 배웠다. 검증 셋에서 여러 가지 hyperparameter 값들을 시험해 보았고, 가장 좋은 성능을 얻는 값을 찾을 수 있었다. +- 학습 데이터가 적은 경우, 어떤 hyperparameter를 선택해야 하는지에 대해 보다 안정적인 방식인 **교차 검증(cross-validation)** 이라는 방법을 알게 되었다. +- 가장 좋은 hyperparameter 값들을 찾은 뒤, 그것으로 값을 고정하고 실제 테스트 셋에 대해 마지막에 단 한 번 **평가** 를 한다. +- Nearest Neighbor 분류기는 CIFAR-10 데이터셋에서 약 40% 정도의 정확도를 보이는 것을 확인하였다. 이 방법은 구현이 매우 간단하지만, 학습 데이터셋 전체를 메모리에 저장해야 하고, 새로운 테스트 이미지를 분류하고 평가할 때 계산량이 매우 많다. +- 마지막으로, 단순히 픽셀 값들의 L1이나 L2 거리는 이미지의 클래스보다 배경이나 이미지의 전체적인 색깔 분포 등에 더 큰 영향을 받기 때문에 이미지 분류 문제에 있어서 충분하지 못하다는 점을 보았다. -- We introduced the problem of **Image Classification**, in which we are given a set of images that are all labeled with a single category. We are then asked to predict these categories for a novel set of test images and measure the accuracy of the predictions. -- We introduced a simple classifier called the **Nearest Neighbor classifier**. We saw that there are multiple hyper-parameters (such as value of k, or the type of distance used to compare examples) that are associated with this classifier and that there was no obvious way of choosing them. -- We saw that the correct way to set these hyperparameters is to split your training data into two: a training set and a fake test set, which we call **validation set**. We try different hyperparameter values and keep the values that lead to the best performance on the validation set. -- If the lack of training data is a concern, we discussed a procedure called **cross-validation**, which can help reduce noise in estimating which hyperparameters work best. -- Once the best hyperparameters are found, we fix them and perform a single **evaluation** on the actual test set. -- We saw that Nearest Neighbor can get us about 40% accuracy on CIFAR-10. It is simple to implement but requires us to store the entire training set and it is expensive to evaluate on a test image. -- Finally, we saw that the use of L1 or L2 distances on raw pixel values is not adequate since the distances correlate more strongly with backgrounds and color distributions of images than with their semantic content. - -In next lectures we will embark on addressing these challenges and eventually arrive at solutions that give 90% accuracies, allow us to completely discard the training set once learning is complete, and they will allow us to evaluate a test image in less than a millisecond. +다음 강의에서는 여기서의 문제들을 해결하기 위한 방법들에 대해 살펴보고, 최종적으로 90% 정도의 성능을 갖고, 학습이 완료된 이후에는 학습 데이터셋을 전부 없애버려도 상관없으며, 테스트 이미지를 1/1000 초 단위로 빠르게 분류하고 평가할 수 있도록 해주는 모델을 살펴볼 것이다. -### Summary: Applying kNN in practice +### 요약2: kNN을 실제로 적용하기 -If you wish to apply kNN in practice (hopefully not on images, or perhaps as only a baseline) proceed as follows: +실제 응용에서 kNN을 사용하고 싶다면 (이미지에는 적용하지 않는 것을 추천하지만, 베이스라인으로 시도해볼 수는 있을 것이다), 다음 과정을 따르면 된다: -1. Preprocess your data: Normalize the features in your data (e.g. one pixel in images) to have zero mean and unit variance. We will cover this in more detail in later sections, and chose not to cover data normalization in this section because pixels in images are usually homogeneous and do not exhibit widely different distributions, alleviating the need for data normalization. -2. If your data is very high-dimensional, consider using a dimensionality reduction technique such as PCA ([wiki ref](http://en.wikipedia.org/wiki/Principal_component_analysis), [CS229ref](http://cs229.stanford.edu/notes/cs229-notes10.pdf), [blog ref](http://www.bigdataexaminer.com/understanding-dimensionality-reduction-principal-component-analysis-and-singular-value-decomposition/)) or even [Random Projections](http://scikit-learn.org/stable/modules/random_projection.html). -3. Split your training data randomly into train/val splits. As a rule of thumb, between 70-90% of your data usually goes to the train split. This setting depends on how many hyperparameters you have and how much of an influence you expect them to have. If there are many hyperparameters to estimate, you should err on the side of having larger validation set to estimate them effectively. If you are concerned about the size of your validation data, it is best to split the training data into folds and perform cross-validation. If you can afford the computational budget it is always safer to go with cross-validation (the more folds the better, but more expensive). -4. Train and evaluate the kNN classifier on the validation data (for all folds, if doing cross-validation) for many choices of **k** (e.g. the more the better) and across different distance types (L1 and L2 are good candidates) -5. If your kNN classifier is running too long, consider using an Approximate Nearest Neighbor library (e.g. [FLANN](http://www.cs.ubc.ca/research/flann/)) to accelerate the retrieval (at cost of some accuracy). -6. Take note of the hyperparameters that gave the best results. There is a question of whether you should use the full training set with the best hyperparameters, since the optimal hyperparameters might change if you were to fold the validation data into your training set (since the size of the data would be larger). In practice it is cleaner to not use the validation data in the final classifier and consider it to be *burned* on estimating the hyperparameters. Evaluate the best model on the test set. Report the test set accuracy and declare the result to be the performance of the kNN classifier on your data. +1. 데이터 전처리 과정을 수행하라: 데이터의 각 특징(feature)들을 평균이 0, 표준편차가 1이 되도록 정규화하라. 정규화 관련된 내용은 강의의 나중 부분에서도 다루겠지만, 여기서 따로 다루지 않았던 이유는 이미지의 픽셀들은 보통 균등한 분포를 갖기 때문에 데이터 정규화가 크게 필요하지 않기 때문이다. +2. 사용할 데이터가 매우 고차원 데이터라면, PCA ([wiki ref](http://en.wikipedia.org/wiki/Principal_component_analysis), [CS229ref](http://cs229.stanford.edu/notes/cs229-notes10.pdf), [blog ref](http://www.bigdataexaminer.com/understanding-dimensionality-reduction-principal-component-analysis-and-singular-value-decomposition/))나 아예 [Random Projection](http://scikit-learn.org/stable/modules/random_projection.html)과 같은 차원 축소 기법들을 적용하는 것을 고려해 보자. +3. 학습 데이터를 랜덤으로 학습/검증 셋(train/val split)으로 나누어라. 일반적으로, 70~90% 정도의 데이터를 학습용으로 사용한다. 이 세팅은 튜닝할 hyperparameter 들이 얼마나 많이 있는지에 따라, 각각이 얼만큼의 영향을 끼칠지에 따라 달라진다. 정해야 할 hyperparameter의 개수가 많다면, 그것들을 효과적으로 정하기 위해 충분히 큰 검증 셋을 사용해야 한다. 검증 셋의 크기가 적당한지에 대해서 의문이 있다면, 학습 데이터를 그룹으로 나누어서 교차 검증을 하는 방법이 제일 좋다. 계산할 시간만 충분하다면, 교차 검증을 하는 것이 항상 더 안전하다 (그룹이 많을수록 더 좋지만, 그만큼 계산량도 늘어난다). +4. 여러 가지 **k** 값에 대해 (많이 해볼수록 좋다), 다른 종류의 거리 함수에 대해 (L1과 L2 거리를 주로 사용한다) kNN 분류기를 학습하고, 검증 셋으로 (또는 교차 검증을 사용한다면 모든 그룹에 대해) 평가해 보자. +5. 현재의 kNN 분류기가 너무 느리다면, 이를 가속하기 위해 Approximate Nearest Neighbor 라이브러리 (e.g. [FLANN](http://www.cs.ubc.ca/research/flann/))를 사용하는 것을 고려해보라. (성능은 조금 떨어질 것이다) +6. 가장 좋은 결과를 주는 hyperparameter 들을 기록해두라. 가장 좋은 hyperparameter 세팅으로 다시 전체 학습 데이터셋을 학습해야 하는지에 대한 점은 아직 확실하지 않다. 학습 셋에서 쪼갠 검증 셋을 다시 합친다면 최적의 hyperparameter 세팅이 바뀔 수도 있기 때문이다 (학습에 사용한 데이터셋의 크기가 커지기 때문에). 실제로는 최종 분류기에서 검증 셋은 사용하지 않는 편이 더 깔끔하고, 검증에 사용한 데이터들은 hyperparameter 들을 고르는데 사용되어 *날라가버렸다* 고 생각해도 된다. 그 뒤, 최종 모델을 테스트 셋에 대해 성능을 평가해 보고, 그 테스트 셋에 대한 정확도를 현재 데이터로 학습한 kNN 분류기의 성능으로 발표하라. -#### Further Reading +#### 추가 읽기 자료 -Here are some (optional) links you may find interesting for further reading: +관심있을 법한 추가적인 읽기 자료 몇 가지를 선정해 두었다 (optional): -- [A Few Useful Things to Know about Machine Learning](http://homes.cs.washington.edu/~pedrod/papers/cacm12.pdf), where especially section 6 is related but the whole paper is a warmly recommended reading. +- [A Few Useful Things to Know about Machine Learning](http://homes.cs.washington.edu/~pedrod/papers/cacm12.pdf) 에서 section 6이 가장 연관이 있지만, 전체적인 내용을 다 읽는 것도 추천한다. -- [Recognizing and Learning Object Categories](http://people.csail.mit.edu/torralba/shortCourseRLOC/index.html), a short course of object categorization at ICCV 2005. +- [Recognizing and Learning Object Categories](http://people.csail.mit.edu/torralba/shortCourseRLOC/index.html). 물체 분류에 관한 ICCV 2005 (컴퓨터비전 분야에서 유명한) 학회의 short course. ---

diff --git a/index.html b/index.html index 87724b9f..1072b9ab 100644 --- a/index.html +++ b/index.html @@ -119,7 +119,7 @@ 이미지 분류: 데이터 기반 방법론, k-Nearest Neighbor, train/val/test 구분 - + Complete!

L1/L2 거리, hyperparameter 탐색, 교차검증(cross-validation) From e2f6b5b1d891727cf5d9c6aea3c93e4664a75c70 Mon Sep 17 00:00:00 2001 From: Sanghun Kang Date: Wed, 25 May 2016 21:28:40 +0900 Subject: [PATCH 144/199] =?UTF-8?q?=EC=98=A4=ED=83=80=20=EC=88=98=EC=A0=95?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit 오타 수정 --- ipython-tutorial.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/ipython-tutorial.md b/ipython-tutorial.md index b674e5ab..07d6716e 100644 --- a/ipython-tutorial.md +++ b/ipython-tutorial.md @@ -3,7 +3,7 @@ layout: page title: IPython Tutorial permalink: /ipython-tutorial/ --- -cs231s 수업에서는 프로그래밍 과제 진행을 위해 [IPython notebooks](http://ipython.org/)을 사용합니다. IPython notebook을 사용하면 여러분의 브라우저에서 Python 코드를 작성하고 실행할 수 있습니다. Python notebook를 사용하면 여러 조각의 코드를 아주 쉽게 수정하고 실행할 수 있습니다. 이런 장점 때문에 IPython notebook은 계산과학분야에서 널리 사용되고 있습니다. +cs231n 수업에서는 프로그래밍 과제 진행을 위해 [IPython notebooks](http://ipython.org/)을 사용합니다. IPython notebook을 사용하면 여러분의 브라우저에서 Python 코드를 작성하고 실행할 수 있습니다. Python notebook를 사용하면 여러 조각의 코드를 아주 쉽게 수정하고 실행할 수 있습니다. 이런 장점 때문에 IPython notebook은 계산과학분야에서 널리 사용되고 있습니다. IPython의 설치와 실행은 간단합니다. command line에서 다음 명령어를 입력하여 IPython을 설치합니다. From 2c30a55b496da338da1d1aec4bc5528419fc58d9 Mon Sep 17 00:00:00 2001 From: Sanghun Kang Date: Wed, 25 May 2016 23:12:14 +0900 Subject: [PATCH 145/199] translated till table of contents --- python-numpy-tutorial.md | 69 +++++++++++++++++++--------------------- 1 file changed, 32 insertions(+), 37 deletions(-) diff --git a/python-numpy-tutorial.md b/python-numpy-tutorial.md index 5191adab..257e6d50 100644 --- a/python-numpy-tutorial.md +++ b/python-numpy-tutorial.md @@ -21,56 +21,51 @@ Python: Numpy --> -This tutorial was contributed by [Justin Johnson](http://cs.stanford.edu/people/jcjohns/). +이 튜토리얼은 [Justin Johnson](http://cs.stanford.edu/people/jcjohns/)에 의해 작성되었습니다. -We will use the Python programming language for all assignments in this course. -Python is a great general-purpose programming language on its own, but with the -help of a few popular libraries (numpy, scipy, matplotlib) it becomes a powerful -environment for scientific computing. +cs231n 수업의 모든 과제에서는 프로그래밍 언어로 파이썬을 사용할 것입니다. +파이썬은 그 자체만으로도 훌륭한 범용 프로그래밍 언어이지만, 몇몇 라이브러리(numpy, scipy, matplotlib)의 도움으로 +계산과학 분야에서 강력한 개발 환경을 갖추게 됩니다. -We expect that many of you will have some experience with Python and numpy; -for the rest of you, this section will serve as a quick crash course both on -the Python programming language and on the use of Python for scientific -computing. +많은 분들이 파이썬과 numpy를 경험 해보셨을거라고 생각합니다. 경험 하지 못했을지라도 이 문서를 통해 +'프로그래밍 언어로서의 파이썬'과 '파이썬을 계산과학에 활용하는법'을 빠르게 훑을 수 있습니다. -Some of you may have previous knowledge in Matlab, in which case we also recommend the [numpy for Matlab users](http://wiki.scipy.org/NumPy_for_Matlab_Users) page. +만약 Matlab을 사용해보셨다면, [Matlab사용자를 위한 numpy](http://wiki.scipy.org/NumPy_for_Matlab_Users) 페이지를 추천해 드립니다. -You can also find an [IPython notebook version of this tutorial here](https://github.com/kuleshov/cs228-material/blob/master/tutorials/python/cs228-python-tutorial.ipynb) created by [Volodymyr Kuleshov](http://web.stanford.edu/~kuleshov/) and [Isaac Caswell](https://symsys.stanford.edu/viewing/symsysaffiliate/21335) for [CS 228](http://cs.stanford.edu/~ermon/cs228/index.html). +또한 [CS 228](http://cs.stanford.edu/~ermon/cs228/index.html)수업을 위해 [Volodymyr Kuleshov](http://web.stanford.edu/~kuleshov/) 와 [Isaac Caswell](https://symsys.stanford.edu/viewing/symsysaffiliate/21335)가 만든 [이 튜토리얼의 IPython notebook 버전](https://github.com/kuleshov/cs228-material/blob/master/tutorials/python/cs228-python-tutorial.ipynb)도 참조 할 수 있습니다. -Table of contents: +목차: -- [Python](#python) - - [Basic data types](#python-basic) - - [Containers](#python-containers) - - [Lists](#python-lists) - - [Dictionaries](#python-dicts) - - [Sets](#python-sets) - - [Tuples](#python-tuples) - - [Functions](#python-functions) - - [Classes](#python-classes) +- [파이썬](#python) + - [기본 자료형](#python-basic) + - [컨테이너](#python-containers) + - [리스트](#python-lists) + - [딕셔너리](#python-dicts) + - [집합](#python-sets) + - [튜플](#python-tuples) + - [함수](#python-functions) + - [클래스](#python-classes) - [Numpy](#numpy) - - [Arrays](#numpy-arrays) - - [Array indexing](#numpy-array-indexing) - - [Datatypes](#numpy-datatypes) - - [Array math](#numpy-math) - - [Broadcasting](#numpy-broadcasting) + - [배열](#numpy-arrays) + - [배열 색인](#numpy-array-indexing) + - [데이터타입](#numpy-datatypes) + - [배열 연산](#numpy-math) + - [브로드캐스팅](#numpy-broadcasting) - [SciPy](#scipy) - - [Image operations](#scipy-image) - - [MATLAB files](#scipy-matlab) - - [Distance between points](#scipy-dist) + - [이미지 작업](#scipy-image) + - [MATLAB 파일](#scipy-matlab) + - [두 점 사이의 거리](#scipy-dist) - [Matplotlib](#matplotlib) - [Plotting](#matplotlib-plotting) - [Subplots](#matplotlib-subplots) - - [Images](#matplotlib-images) + - [이미지](#matplotlib-images) -## Python +## 파이썬 -Python is a high-level, dynamically typed multiparadigm programming language. -Python code is often said to be almost like pseudocode, since it allows you -to express very powerful ideas in very few lines of code while being very -readable. As an example, here is an implementation of the classic quicksort -algorithm in Python: +파이썬은 고차원이고, 다중패러다임을 지원하는 동적 프로그래밍 언어이다. +짧지만 가독성 높은 코드 몇 줄로 수준 높은 아이디어들을 표현할수있기에 파이썬 코드는 거의 수도코드처럼 보인다고도 한다. +아래는 quicksort알고리즘의 파이썬 구현 예시이다: ~~~python def quicksort(arr): @@ -96,7 +91,7 @@ You can check your Python version at the command line by running `python --version`. -### Basic data types +### 기본 자료형 Like most languages, Python has a number of basic types including integers, floats, booleans, and strings. These data types behave in ways that are From 8a81171086bee6def6ba797f4f2cc0ace3aed470 Mon Sep 17 00:00:00 2001 From: Sanghun Kang Date: Sat, 28 May 2016 15:32:11 +0900 Subject: [PATCH 146/199] translated till string part --- python-numpy-tutorial.md | 259 +++++++++++++++++++-------------------- 1 file changed, 128 insertions(+), 131 deletions(-) diff --git a/python-numpy-tutorial.md b/python-numpy-tutorial.md index 257e6d50..11df0cf9 100644 --- a/python-numpy-tutorial.md +++ b/python-numpy-tutorial.md @@ -78,89 +78,86 @@ def quicksort(arr): return quicksort(left) + middle + quicksort(right) print quicksort([3,6,8,10,1,2,1]) -# Prints "[1, 1, 2, 3, 6, 8, 10]" +# 출력 "[1, 1, 2, 3, 6, 8, 10]" ~~~ -### Python versions -There are currently two different supported versions of Python, 2.7 and 3.4. -Somewhat confusingly, Python 3.0 introduced many backwards-incompatible changes -to the language, so code written for 2.7 may not work under 3.4 and vice versa. -For this class all code will use Python 2.7. +### 파이썬 버전 +현재 파이썬에는 두가지 버전이 있습니다. 파이썬 2.7 그리고 파이썬 3.4입니다. +혼란스럽게도, 파이썬3은 기존 파이썬2와 호환되지 않게 변경된 부분이 있습니다. +그러므로 파이썬 2.7로 쓰여진 코드는 3.4환경에서 동작하지 않고 그 반대도 마찬가지입니다. +이 수업에선 파이썬 2.7을 사용합니다. -You can check your Python version at the command line by running +커맨드라인에 아래의 명령어를 입력해서 현재 설치된 파이썬 버전을 확인 할 수 있습니다. `python --version`. ### 기본 자료형 -Like most languages, Python has a number of basic types including integers, -floats, booleans, and strings. These data types behave in ways that are -familiar from other programming languages. +다른 프로그래밍 언어들처럼, 파이썬에는 정수, 실수, 불린, 문자열같은 기본 자료형이 있습니다. +파이썬 기본 자료형 역시 다른 프로그래밍 언어와 유사합니다. -**Numbers:** Integers and floats work as you would expect from other languages: +**숫자:** 다른 언어와 마찬가지로 파이썬의 정수형(Integers)과 실수형(floats) 데이터 타입 역시 동일한 역할을 합니다 : ~~~python x = 3 -print type(x) # Prints "" -print x # Prints "3" -print x + 1 # Addition; prints "4" -print x - 1 # Subtraction; prints "2" -print x * 2 # Multiplication; prints "6" -print x ** 2 # Exponentiation; prints "9" +print type(x) # 출력 "" +print x # 출력 "3" +print x + 1 # 덧셈; 출력 "4" +print x - 1 # 뺄셈; 출력 "2" +print x * 2 # 곱셈; 출력 "6" +print x ** 2 # 제곱; 출력 "9" x += 1 -print x # Prints "4" +print x # 출력 "4" x *= 2 -print x # Prints "8" +print x # 출력 "8" y = 2.5 -print type(y) # Prints "" -print y, y + 1, y * 2, y ** 2 # Prints "2.5 3.5 5.0 6.25" +print type(y) # 출력 "" +print y, y + 1, y * 2, y ** 2 # 출력 "2.5 3.5 5.0 6.25" ~~~ -Note that unlike many languages, Python does not have unary increment (`x++`) -or decrement (`x--`) operators. +다른 언어들과는 달리, 파이썬에는 증감 단항연상자(`x++`, `x--`)가 없습니다. -Python also has built-in types for long integers and complex numbers; -you can find all of the details -[in the documentation](https://docs.python.org/2/library/stdtypes.html#numeric-types-int-float-long-complex). +파이썬 역시 long 정수형과 복소수 데이터 타입이 구현되어 있습니다. +자세한 사항은 [문서](https://docs.python.org/2/library/stdtypes.html#numeric-types-int-float-long-complex)에서 찾아볼 수 있습니다. -**Booleans:** Python implements all of the usual operators for Boolean logic, -but uses English words rather than symbols (`&&`, `||`, etc.): +**불린(Booleans):** 파이썬에는 논리 자료형의 모든 연산자들이 구현되어 있습니다. +그렇지만 기호(`&&`, `||`, 등.) 대신 영어 단어로 구현되어 있습니다 : ~~~python t = True f = False -print type(t) # Prints "" -print t and f # Logical AND; prints "False" -print t or f # Logical OR; prints "True" -print not t # Logical NOT; prints "False" -print t != f # Logical XOR; prints "True" +print type(t) # 출력 "" +print t and f # 논리 AND; 출력 "False" +print t or f # 논리 OR; 출력 "True" +print not t # 논리 NOT; 출력 "False" +print t != f # 논리 XOR; 출력 "True" ~~~ -**Strings:** Python has great support for strings: +**문자열:** 파이썬은 문자열과 연관된 다양한 기능을 지원합니다: ~~~python -hello = 'hello' # String literals can use single quotes -world = "world" # or double quotes; it does not matter. -print hello # Prints "hello" -print len(hello) # String length; prints "5" -hw = hello + ' ' + world # String concatenation -print hw # prints "hello world" -hw12 = '%s %s %d' % (hello, world, 12) # sprintf style string formatting -print hw12 # prints "hello world 12" +hello = 'hello' # String 문자열을 표현할땐 따옴표나 +world = "world" # 쌍따옴표가 사용됩니다; 어떤걸 써도 상관없습니다. +print hello # 출력 "hello" +print len(hello) # 문자열 길이; 출력 "5" +hw = hello + ' ' + world # 문자열 연결 +print hw # 출력 "hello world" +hw12 = '%s %s %d' % (hello, world, 12) # sprintf 방식의 문자열 서식 지정 +print hw12 # 출력 "hello world 12" ~~~ String objects have a bunch of useful methods; for example: ~~~python s = "hello" -print s.capitalize() # Capitalize a string; prints "Hello" -print s.upper() # Convert a string to uppercase; prints "HELLO" -print s.rjust(7) # Right-justify a string, padding with spaces; prints " hello" -print s.center(7) # Center a string, padding with spaces; prints " hello " +print s.capitalize() # Capitalize a string; 출력 "Hello" +print s.upper() # Convert a string to uppercase; 출력 "HELLO" +print s.rjust(7) # Right-justify a string, padding with spaces; 출력 " hello" +print s.center(7) # Center a string, padding with spaces; 출력 " hello " print s.replace('l', '(ell)') # Replace all instances of one substring with another; - # prints "he(ell)(ell)o" -print ' world '.strip() # Strip leading and trailing whitespace; prints "world" + # 출력 "he(ell)(ell)o" +print ' world '.strip() # Strip leading and trailing whitespace; 출력 "world" ~~~ -You can find a list of all string methods [in the documentation](https://docs.python.org/2/library/stdtypes.html#string-methods). +모든 문자열 메소드는 [문서](https://docs.python.org/2/library/stdtypes.html#string-methods)에서 찾아볼 수 있습니다. ### Containers @@ -173,14 +170,14 @@ and can contain elements of different types: ~~~python xs = [3, 1, 2] # Create a list -print xs, xs[2] # Prints "[3, 1, 2] 2" -print xs[-1] # Negative indices count from the end of the list; prints "2" +print xs, xs[2] # 출력 "[3, 1, 2] 2" +print xs[-1] # Negative indices count from the end of the list; 출력 "2" xs[2] = 'foo' # Lists can contain elements of different types -print xs # Prints "[3, 1, 'foo']" +print xs # 출력 "[3, 1, 'foo']" xs.append('bar') # Add a new element to the end of the list -print xs # Prints "[3, 1, 'foo', 'bar']" +print xs # 출력 "[3, 1, 'foo', 'bar']" x = xs.pop() # Remove and return the last element of the list -print x, xs # Prints "bar [3, 1, 'foo']" +print x, xs # 출력 "bar [3, 1, 'foo']" ~~~ As usual, you can find all the gory details about lists [in the documentation](https://docs.python.org/2/tutorial/datastructures.html#more-on-lists). @@ -191,14 +188,14 @@ concise syntax to access sublists; this is known as *slicing*: ~~~python nums = range(5) # range is a built-in function that creates a list of integers -print nums # Prints "[0, 1, 2, 3, 4]" -print nums[2:4] # Get a slice from index 2 to 4 (exclusive); prints "[2, 3]" -print nums[2:] # Get a slice from index 2 to the end; prints "[2, 3, 4]" -print nums[:2] # Get a slice from the start to index 2 (exclusive); prints "[0, 1]" -print nums[:] # Get a slice of the whole list; prints ["0, 1, 2, 3, 4]" -print nums[:-1] # Slice indices can be negative; prints ["0, 1, 2, 3]" +print nums # 출력 "[0, 1, 2, 3, 4]" +print nums[2:4] # Get a slice from index 2 to 4 (exclusive); 출력 "[2, 3]" +print nums[2:] # Get a slice from index 2 to the end; 출력 "[2, 3, 4]" +print nums[:2] # Get a slice from the start to index 2 (exclusive); 출력 "[0, 1]" +print nums[:] # Get a slice of the whole list; 출력 ["0, 1, 2, 3, 4]" +print nums[:-1] # Slice indices can be negative; 출력 ["0, 1, 2, 3]" nums[2:4] = [8, 9] # Assign a new sublist to a slice -print nums # Prints "[0, 1, 8, 9, 4]" +print nums # 출력 "[0, 1, 8, 9, 4]" ~~~ We will see slicing again in the context of numpy arrays. @@ -208,7 +205,7 @@ We will see slicing again in the context of numpy arrays. animals = ['cat', 'dog', 'monkey'] for animal in animals: print animal -# Prints "cat", "dog", "monkey", each on its own line. +# 출력 "cat", "dog", "monkey", each on its own line. ~~~ If you want access to the index of each element within the body of a loop, @@ -218,7 +215,7 @@ use the built-in `enumerate` function: animals = ['cat', 'dog', 'monkey'] for idx, animal in enumerate(animals): print '#%d: %s' % (idx + 1, animal) -# Prints "#1: cat", "#2: dog", "#3: monkey", each on its own line +# 출력 "#1: cat", "#2: dog", "#3: monkey", each on its own line ~~~ **List comprehensions:** @@ -230,7 +227,7 @@ nums = [0, 1, 2, 3, 4] squares = [] for x in nums: squares.append(x ** 2) -print squares # Prints [0, 1, 4, 9, 16] +print squares # 출력 [0, 1, 4, 9, 16] ~~~ You can make this code simpler using a **list comprehension**: @@ -238,7 +235,7 @@ You can make this code simpler using a **list comprehension**: ~~~python nums = [0, 1, 2, 3, 4] squares = [x ** 2 for x in nums] -print squares # Prints [0, 1, 4, 9, 16] +print squares # 출력 [0, 1, 4, 9, 16] ~~~ List comprehensions can also contain conditions: @@ -246,7 +243,7 @@ List comprehensions can also contain conditions: ~~~python nums = [0, 1, 2, 3, 4] even_squares = [x ** 2 for x in nums if x % 2 == 0] -print even_squares # Prints "[0, 4, 16]" +print even_squares # 출력 "[0, 4, 16]" ~~~ @@ -256,15 +253,15 @@ an object in Javascript. You can use it like this: ~~~python d = {'cat': 'cute', 'dog': 'furry'} # Create a new dictionary with some data -print d['cat'] # Get an entry from a dictionary; prints "cute" -print 'cat' in d # Check if a dictionary has a given key; prints "True" +print d['cat'] # Get an entry from a dictionary; 출력 "cute" +print 'cat' in d # Check if a dictionary has a given key; 출력 "True" d['fish'] = 'wet' # Set an entry in a dictionary -print d['fish'] # Prints "wet" +print d['fish'] # 출력 "wet" # print d['monkey'] # KeyError: 'monkey' not a key of d -print d.get('monkey', 'N/A') # Get an element with a default; prints "N/A" -print d.get('fish', 'N/A') # Get an element with a default; prints "wet" +print d.get('monkey', 'N/A') # Get an element with a default; 출력 "N/A" +print d.get('fish', 'N/A') # Get an element with a default; 출력 "wet" del d['fish'] # Remove an element from a dictionary -print d.get('fish', 'N/A') # "fish" is no longer a key; prints "N/A" +print d.get('fish', 'N/A') # "fish" is no longer a key; 출력 "N/A" ~~~ You can find all you need to know about dictionaries [in the documentation](https://docs.python.org/2/library/stdtypes.html#dict). @@ -276,7 +273,7 @@ d = {'person': 2, 'cat': 4, 'spider': 8} for animal in d: legs = d[animal] print 'A %s has %d legs' % (animal, legs) -# Prints "A person has 2 legs", "A spider has 8 legs", "A cat has 4 legs" +# 출력 "A person has 2 legs", "A spider has 8 legs", "A cat has 4 legs" ~~~ If you want access to keys and their corresponding values, use the `iteritems` method: @@ -285,7 +282,7 @@ If you want access to keys and their corresponding values, use the `iteritems` m d = {'person': 2, 'cat': 4, 'spider': 8} for animal, legs in d.iteritems(): print 'A %s has %d legs' % (animal, legs) -# Prints "A person has 2 legs", "A spider has 8 legs", "A cat has 4 legs" +# 출력 "A person has 2 legs", "A spider has 8 legs", "A cat has 4 legs" ~~~ **Dictionary comprehensions:** @@ -295,7 +292,7 @@ dictionaries. For example: ~~~python nums = [0, 1, 2, 3, 4] even_num_to_square = {x: x ** 2 for x in nums if x % 2 == 0} -print even_num_to_square # Prints "{0: 0, 2: 4, 4: 16}" +print even_num_to_square # 출력 "{0: 0, 2: 4, 4: 16}" ~~~ @@ -305,15 +302,15 @@ the following: ~~~python animals = {'cat', 'dog'} -print 'cat' in animals # Check if an element is in a set; prints "True" -print 'fish' in animals # prints "False" +print 'cat' in animals # Check if an element is in a set; 출력 "True" +print 'fish' in animals # 출력 "False" animals.add('fish') # Add an element to a set -print 'fish' in animals # Prints "True" -print len(animals) # Number of elements in a set; prints "3" +print 'fish' in animals # 출력 "True" +print len(animals) # Number of elements in a set; 출력 "3" animals.add('cat') # Adding an element that is already in the set does nothing -print len(animals) # Prints "3" +print len(animals) # 출력 "3" animals.remove('cat') # Remove an element from a set -print len(animals) # Prints "2" +print len(animals) # 출력 "2" ~~~ As usual, everything you want to know about sets can be found @@ -329,7 +326,7 @@ in which you visit the elements of the set: animals = {'cat', 'dog', 'fish'} for idx, animal in enumerate(animals): print '#%d: %s' % (idx + 1, animal) -# Prints "#1: fish", "#2: dog", "#3: cat" +# 출력 "#1: fish", "#2: dog", "#3: cat" ~~~ **Set comprehensions:** @@ -338,7 +335,7 @@ Like lists and dictionaries, we can easily construct sets using set comprehensio ~~~python from math import sqrt nums = {int(sqrt(x)) for x in range(30)} -print nums # Prints "set([0, 1, 2, 3, 4, 5])" +print nums # 출력 "set([0, 1, 2, 3, 4, 5])" ~~~ @@ -351,9 +348,9 @@ Here is a trivial example: ~~~python d = {(x, x + 1): x for x in range(10)} # Create a dictionary with tuple keys t = (5, 6) # Create a tuple -print type(t) # Prints "" -print d[t] # Prints "5" -print d[(1, 2)] # Prints "1" +print type(t) # 출력 "" +print d[t] # 출력 "5" +print d[(1, 2)] # 출력 "1" ~~~ [The documentation](https://docs.python.org/2/tutorial/datastructures.html#tuples-and-sequences) has more information about tuples. @@ -372,7 +369,7 @@ def sign(x): for x in [-1, 0, 1]: print sign(x) -# Prints "negative", "zero", "positive" +# 출력 "negative", "zero", "positive" ~~~ We will often define functions to take optional keyword arguments, like this: @@ -384,8 +381,8 @@ def hello(name, loud=False): else: print 'Hello, %s' % name -hello('Bob') # Prints "Hello, Bob" -hello('Fred', loud=True) # Prints "HELLO, FRED!" +hello('Bob') # 출력 "Hello, Bob" +hello('Fred', loud=True) # 출력 "HELLO, FRED!" ~~~ There is a lot more information about Python functions [in the documentation](https://docs.python.org/2/tutorial/controlflow.html#defining-functions). @@ -410,8 +407,8 @@ class Greeter(object): print 'Hello, %s' % self.name g = Greeter('Fred') # Construct an instance of the Greeter class -g.greet() # Call an instance method; prints "Hello, Fred" -g.greet(loud=True) # Call an instance method; prints "HELLO, FRED!" +g.greet() # Call an instance method; 출력 "Hello, Fred" +g.greet(loud=True) # Call an instance method; 출력 "HELLO, FRED!" ~~~ You can read a lot more about Python classes [in the documentation](https://docs.python.org/2/tutorial/classes.html). @@ -437,15 +434,15 @@ and access elements using square brackets: import numpy as np a = np.array([1, 2, 3]) # Create a rank 1 array -print type(a) # Prints "" -print a.shape # Prints "(3,)" -print a[0], a[1], a[2] # Prints "1 2 3" +print type(a) # 출력 "" +print a.shape # 출력 "(3,)" +print a[0], a[1], a[2] # 출력 "1 2 3" a[0] = 5 # Change an element of the array -print a # Prints "[5, 2, 3]" +print a # 출력 "[5, 2, 3]" b = np.array([[1,2,3],[4,5,6]]) # Create a rank 2 array -print b.shape # Prints "(2, 3)" -print b[0, 0], b[0, 1], b[1, 0] # Prints "1 2 4" +print b.shape # 출력 "(2, 3)" +print b[0, 0], b[0, 1], b[1, 0] # 출력 "1 2 4" ~~~ Numpy also provides many functions to create arrays: @@ -454,18 +451,18 @@ Numpy also provides many functions to create arrays: import numpy as np a = np.zeros((2,2)) # Create an array of all zeros -print a # Prints "[[ 0. 0.] +print a # 출력 "[[ 0. 0.] # [ 0. 0.]]" b = np.ones((1,2)) # Create an array of all ones -print b # Prints "[[ 1. 1.]]" +print b # 출력 "[[ 1. 1.]]" c = np.full((2,2), 7) # Create a constant array -print c # Prints "[[ 7. 7.] +print c # 출력 "[[ 7. 7.] # [ 7. 7.]]" d = np.eye(2) # Create a 2x2 identity matrix -print d # Prints "[[ 1. 0.] +print d # 출력 "[[ 1. 0.] # [ 0. 1.]]" e = np.random.random((2,2)) # Create an array filled with random values @@ -501,9 +498,9 @@ b = a[:2, 1:3] # A slice of an array is a view into the same data, so modifying it # will modify the original array. -print a[0, 1] # Prints "2" +print a[0, 1] # 출력 "2" b[0, 0] = 77 # b[0, 0] is the same piece of data as a[0, 1] -print a[0, 1] # Prints "77" +print a[0, 1] # 출력 "77" ~~~ You can also mix integer indexing with slice indexing. @@ -526,14 +523,14 @@ a = np.array([[1,2,3,4], [5,6,7,8], [9,10,11,12]]) # original array: row_r1 = a[1, :] # Rank 1 view of the second row of a row_r2 = a[1:2, :] # Rank 2 view of the second row of a -print row_r1, row_r1.shape # Prints "[5 6 7 8] (4,)" -print row_r2, row_r2.shape # Prints "[[5 6 7 8]] (1, 4)" +print row_r1, row_r1.shape # 출력 "[5 6 7 8] (4,)" +print row_r2, row_r2.shape # 출력 "[[5 6 7 8]] (1, 4)" # We can make the same distinction when accessing columns of an array: col_r1 = a[:, 1] col_r2 = a[:, 1:2] -print col_r1, col_r1.shape # Prints "[ 2 6 10] (3,)" -print col_r2, col_r2.shape # Prints "[[ 2] +print col_r1, col_r1.shape # 출력 "[ 2 6 10] (3,)" +print col_r2, col_r2.shape # 출력 "[[ 2] # [ 6] # [10]] (3, 1)" ~~~ @@ -551,17 +548,17 @@ a = np.array([[1,2], [3, 4], [5, 6]]) # An example of integer array indexing. # The returned array will have shape (3,) and -print a[[0, 1, 2], [0, 1, 0]] # Prints "[1 4 5]" +print a[[0, 1, 2], [0, 1, 0]] # 출력 "[1 4 5]" # The above example of integer array indexing is equivalent to this: -print np.array([a[0, 0], a[1, 1], a[2, 0]]) # Prints "[1 4 5]" +print np.array([a[0, 0], a[1, 1], a[2, 0]]) # 출력 "[1 4 5]" # When using integer array indexing, you can reuse the same # element from the source array: -print a[[0, 0], [1, 1]] # Prints "[2 2]" +print a[[0, 0], [1, 1]] # 출력 "[2 2]" # Equivalent to the previous integer array indexing example -print np.array([a[0, 1], a[0, 1]]) # Prints "[2 2]" +print np.array([a[0, 1], a[0, 1]]) # 출력 "[2 2]" ~~~ One useful trick with integer array indexing is selecting or mutating one @@ -573,7 +570,7 @@ import numpy as np # Create a new array from which we will select elements a = np.array([[1,2,3], [4,5,6], [7,8,9], [10, 11, 12]]) -print a # prints "array([[ 1, 2, 3], +print a # 출력 "array([[ 1, 2, 3], # [ 4, 5, 6], # [ 7, 8, 9], # [10, 11, 12]])" @@ -582,12 +579,12 @@ print a # prints "array([[ 1, 2, 3], b = np.array([0, 2, 0, 1]) # Select one element from each row of a using the indices in b -print a[np.arange(4), b] # Prints "[ 1 6 7 11]" +print a[np.arange(4), b] # 출력 "[ 1 6 7 11]" # Mutate one element from each row of a using the indices in b a[np.arange(4), b] += 10 -print a # prints "array([[11, 2, 3], +print a # 출력 "array([[11, 2, 3], # [ 4, 5, 16], # [17, 8, 9], # [10, 21, 12]]) @@ -608,17 +605,17 @@ bool_idx = (a > 2) # Find the elements of a that are bigger than 2; # shape as a, where each slot of bool_idx tells # whether that element of a is > 2. -print bool_idx # Prints "[[False False] +print bool_idx # 출력 "[[False False] # [ True True] # [ True True]]" # We use boolean array indexing to construct a rank 1 array # consisting of the elements of a corresponding to the True values # of bool_idx -print a[bool_idx] # Prints "[3 4 5 6]" +print a[bool_idx] # 출력 "[3 4 5 6]" # We can do all of the above in a single concise statement: -print a[a > 2] # Prints "[3 4 5 6]" +print a[a > 2] # 출력 "[3 4 5 6]" ~~~ For brevity we have left out a lot of details about numpy array indexing; @@ -637,13 +634,13 @@ Here is an example: import numpy as np x = np.array([1, 2]) # Let numpy choose the datatype -print x.dtype # Prints "int64" +print x.dtype # 출력 "int64" x = np.array([1.0, 2.0]) # Let numpy choose the datatype -print x.dtype # Prints "float64" +print x.dtype # 출력 "float64" x = np.array([1, 2], dtype=np.int64) # Force a particular datatype -print x.dtype # Prints "int64" +print x.dtype # 출력 "int64" ~~~ You can read all about numpy datatypes [in the documentation](http://docs.scipy.org/doc/numpy/reference/arrays.dtypes.html). @@ -727,9 +724,9 @@ import numpy as np x = np.array([[1,2],[3,4]]) -print np.sum(x) # Compute sum of all elements; prints "10" -print np.sum(x, axis=0) # Compute sum of each column; prints "[4 6]" -print np.sum(x, axis=1) # Compute sum of each row; prints "[3 7]" +print np.sum(x) # Compute sum of all elements; 출력 "10" +print np.sum(x, axis=0) # Compute sum of each column; 출력 "[4 6]" +print np.sum(x, axis=1) # Compute sum of each row; 출력 "[3 7]" ~~~ You can find the full list of mathematical functions provided by numpy [in the documentation](http://docs.scipy.org/doc/numpy/reference/routines.math.html). @@ -743,15 +740,15 @@ simply use the `T` attribute of an array object: import numpy as np x = np.array([[1,2], [3,4]]) -print x # Prints "[[1 2] +print x # 출력 "[[1 2] # [3 4]]" -print x.T # Prints "[[1 3] +print x.T # 출력 "[[1 3] # [2 4]]" # Note that taking the transpose of a rank 1 array does nothing: v = np.array([1,2,3]) -print v # Prints "[1 2 3]" -print v.T # Prints "[1 2 3]" +print v # 출력 "[1 2 3]" +print v.T # 출력 "[1 2 3]" ~~~ Numpy provides many more functions for manipulating arrays; you can see the full list [in the documentation](http://docs.scipy.org/doc/numpy/reference/routines.array-manipulation.html). @@ -802,12 +799,12 @@ import numpy as np x = np.array([[1,2,3], [4,5,6], [7,8,9], [10, 11, 12]]) v = np.array([1, 0, 1]) vv = np.tile(v, (4, 1)) # Stack 4 copies of v on top of each other -print vv # Prints "[[1 0 1] +print vv # 출력 "[[1 0 1] # [1 0 1] # [1 0 1] # [1 0 1]]" y = x + vv # Add x and vv elementwise -print y # Prints "[[ 2 2 4 +print y # 출력 "[[ 2 2 4 # [ 5 5 7] # [ 8 8 10] # [11 11 13]]" @@ -824,7 +821,7 @@ import numpy as np x = np.array([[1,2,3], [4,5,6], [7,8,9], [10, 11, 12]]) v = np.array([1, 0, 1]) y = x + v # Add v to each row of x using broadcasting -print y # Prints "[[ 2 2 4] +print y # 출력 "[[ 2 2 4] # [ 5 5 7] # [ 8 8 10] # [11 11 13]]" @@ -935,7 +932,7 @@ from scipy.misc import imread, imsave, imresize # Read an JPEG image into a numpy array img = imread('assets/cat.jpg') -print img.dtype, img.shape # Prints "uint8 (400, 248, 3)" +print img.dtype, img.shape # 출력 "uint8 (400, 248, 3)" # We can tint the image by scaling each of the color channels # by a different scalar constant. The image has shape (400, 248, 3); From 72f5b6b772a08d44c88ac095699b6b42c161287e Mon Sep 17 00:00:00 2001 From: Sanghun Kang Date: Sat, 28 May 2016 16:43:41 +0900 Subject: [PATCH 147/199] String part Done --- python-numpy-tutorial.md | 14 +++++++------- 1 file changed, 7 insertions(+), 7 deletions(-) diff --git a/python-numpy-tutorial.md b/python-numpy-tutorial.md index 11df0cf9..18381fd5 100644 --- a/python-numpy-tutorial.md +++ b/python-numpy-tutorial.md @@ -145,17 +145,17 @@ hw12 = '%s %s %d' % (hello, world, 12) # sprintf 방식의 문자열 서식 지 print hw12 # 출력 "hello world 12" ~~~ -String objects have a bunch of useful methods; for example: +문자열 객체에는 유용한 메소드들이 많습니다; 예를 들어: ~~~python s = "hello" -print s.capitalize() # Capitalize a string; 출력 "Hello" -print s.upper() # Convert a string to uppercase; 출력 "HELLO" -print s.rjust(7) # Right-justify a string, padding with spaces; 출력 " hello" -print s.center(7) # Center a string, padding with spaces; 출력 " hello " -print s.replace('l', '(ell)') # Replace all instances of one substring with another; +print s.capitalize() # 문자열을 대문자로 시작하게함; 출력 "Hello" +print s.upper() # 모든 문자를 대문자로 바꿈; 출력 "HELLO" +print s.rjust(7) # 문자열 오른쪽 정렬, 빈공간은 여백으로 채움; 출력 " hello" +print s.center(7) # 문자열 가운데 정렬, 빈공간은 여백으로 채움; 출력 " hello " +print s.replace('l', '(ell)') # 첫번째 인자로 온 문자열을 두번째 인자 문자열로 바꿈; # 출력 "he(ell)(ell)o" -print ' world '.strip() # Strip leading and trailing whitespace; 출력 "world" +print ' world '.strip() # 문자열 앞뒤 공백 제거; 출력 "world" ~~~ 모든 문자열 메소드는 [문서](https://docs.python.org/2/library/stdtypes.html#string-methods)에서 찾아볼 수 있습니다. From 70c84602c6266cf0d7d280d17c17d0ec98385fb3 Mon Sep 17 00:00:00 2001 From: Sanghun Kang Date: Sat, 28 May 2016 17:00:08 +0900 Subject: [PATCH 148/199] list done --- python-numpy-tutorial.md | 22 +++++++++++----------- 1 file changed, 11 insertions(+), 11 deletions(-) diff --git a/python-numpy-tutorial.md b/python-numpy-tutorial.md index 18381fd5..bbe2027b 100644 --- a/python-numpy-tutorial.md +++ b/python-numpy-tutorial.md @@ -160,27 +160,27 @@ print ' world '.strip() # 문자열 앞뒤 공백 제거; 출력 "world" 모든 문자열 메소드는 [문서](https://docs.python.org/2/library/stdtypes.html#string-methods)에서 찾아볼 수 있습니다. -### Containers +### 컨테이너 Python includes several built-in container types: lists, dictionaries, sets, and tuples. +파이썬은 다음과 같은 컨테이너 타입이 구현되어 있습니다: 리스트, 딕셔너리, 집합, 튜플 -#### Lists -A list is the Python equivalent of an array, but is resizeable -and can contain elements of different types: +#### 리스트 +리스트는 파이썬에서 배열같은 존재입니다. 그렇지만 배열과 달리 크기 변경이 가능하고 +서로 다른 자료형일지라도 하나의 리스트에 저장 될 수 있습니다: ~~~python -xs = [3, 1, 2] # Create a list +xs = [3, 1, 2] # 리스트 생성 print xs, xs[2] # 출력 "[3, 1, 2] 2" -print xs[-1] # Negative indices count from the end of the list; 출력 "2" -xs[2] = 'foo' # Lists can contain elements of different types +print xs[-1] # 인덱스가 음수일 경우 리스트의 끝에서부터 세어진다; 출력 "2" +xs[2] = 'foo' # 리스트는 자료형이 다른 요소들을 저장 할 수 있다 print xs # 출력 "[3, 1, 'foo']" -xs.append('bar') # Add a new element to the end of the list +xs.append('bar') # 리스트의 끝에 새 요소 추가 print xs # 출력 "[3, 1, 'foo', 'bar']" -x = xs.pop() # Remove and return the last element of the list +x = xs.pop() # 리스트의 마지막 요소 삭제하고 반환 print x, xs # 출력 "bar [3, 1, 'foo']" ~~~ -As usual, you can find all the gory details about lists -[in the documentation](https://docs.python.org/2/tutorial/datastructures.html#more-on-lists). +마찬가지로, 리스트에 대해 자세하 사항은 [문서](https://docs.python.org/2/tutorial/datastructures.html#more-on-lists)에서 찾아볼 수 있습니다. **Slicing:** In addition to accessing list elements one at a time, Python provides From 00709c45c37e0d506e027abd749d98df0fd741a6 Mon Sep 17 00:00:00 2001 From: Sanghun Kang Date: Sat, 28 May 2016 17:27:17 +0900 Subject: [PATCH 149/199] slicing done --- python-numpy-tutorial.md | 28 ++++++++++++++-------------- 1 file changed, 14 insertions(+), 14 deletions(-) diff --git a/python-numpy-tutorial.md b/python-numpy-tutorial.md index bbe2027b..f412da3b 100644 --- a/python-numpy-tutorial.md +++ b/python-numpy-tutorial.md @@ -61,8 +61,7 @@ cs231n 수업의 모든 과제에서는 프로그래밍 언어로 파이썬을 - [이미지](#matplotlib-images) -## 파이썬 - +## Python 파이썬은 고차원이고, 다중패러다임을 지원하는 동적 프로그래밍 언어이다. 짧지만 가독성 높은 코드 몇 줄로 수준 높은 아이디어들을 표현할수있기에 파이썬 코드는 거의 수도코드처럼 보인다고도 한다. 아래는 quicksort알고리즘의 파이썬 구현 예시이다: @@ -161,11 +160,12 @@ print ' world '.strip() # 문자열 앞뒤 공백 제거; 출력 "world" ### 컨테이너 -Python includes several built-in container types: lists, dictionaries, sets, and tuples. + 파이썬은 다음과 같은 컨테이너 타입이 구현되어 있습니다: 리스트, 딕셔너리, 집합, 튜플 #### 리스트 + 리스트는 파이썬에서 배열같은 존재입니다. 그렇지만 배열과 달리 크기 변경이 가능하고 서로 다른 자료형일지라도 하나의 리스트에 저장 될 수 있습니다: @@ -182,22 +182,22 @@ print x, xs # 출력 "bar [3, 1, 'foo']" ~~~ 마찬가지로, 리스트에 대해 자세하 사항은 [문서](https://docs.python.org/2/tutorial/datastructures.html#more-on-lists)에서 찾아볼 수 있습니다. -**Slicing:** -In addition to accessing list elements one at a time, Python provides -concise syntax to access sublists; this is known as *slicing*: +**슬라이싱:** +리스트의 요소로 한번에 접근하는것 이외에도, 파이썬은 리스트의 일부분에만 접근하는 간결한 문법을 제공한다; +이를 *슬라이싱*이라고 한다: ~~~python -nums = range(5) # range is a built-in function that creates a list of integers +nums = range(5) # range는 파이썬에 구현되어 있는 함수이며 정수들로 구성된 리스트를 만든다 print nums # 출력 "[0, 1, 2, 3, 4]" -print nums[2:4] # Get a slice from index 2 to 4 (exclusive); 출력 "[2, 3]" -print nums[2:] # Get a slice from index 2 to the end; 출력 "[2, 3, 4]" -print nums[:2] # Get a slice from the start to index 2 (exclusive); 출력 "[0, 1]" -print nums[:] # Get a slice of the whole list; 출력 ["0, 1, 2, 3, 4]" -print nums[:-1] # Slice indices can be negative; 출력 ["0, 1, 2, 3]" -nums[2:4] = [8, 9] # Assign a new sublist to a slice +print nums[2:4] # 인덱스 2에서 4(제외)까지 슬라이싱; 출력 "[2, 3]" +print nums[2:] # 인덱스 2에서 끝까지 슬라이싱; 출력 "[2, 3, 4]" +print nums[:2] # 처음부터 인덱스 2(제외)까지 슬라이싱; 출력 "[0, 1]" +print nums[:] # 전체 리스트 슬라이싱; 출력 ["0, 1, 2, 3, 4]" +print nums[:-1] # 슬라이싱 인덱스는 음수도 가능; 출력 ["0, 1, 2, 3]" +nums[2:4] = [8, 9] # 슬라이스된 리스트에 새로운 리스트 할당 print nums # 출력 "[0, 1, 8, 9, 4]" ~~~ -We will see slicing again in the context of numpy arrays. +numpy 배열 부분에서 다시 슬라이싱을 보게될것입니다. **Loops:** You can loop over the elements of a list like this: From 4ace2f93dbd500b4de1fbbe9d08486de77602260 Mon Sep 17 00:00:00 2001 From: Sanghun Kang Date: Sat, 28 May 2016 18:55:48 +0900 Subject: [PATCH 150/199] dictionary done --- python-numpy-tutorial.md | 75 ++++++++++++++++++++-------------------- 1 file changed, 37 insertions(+), 38 deletions(-) diff --git a/python-numpy-tutorial.md b/python-numpy-tutorial.md index f412da3b..48fc06e3 100644 --- a/python-numpy-tutorial.md +++ b/python-numpy-tutorial.md @@ -62,9 +62,9 @@ cs231n 수업의 모든 과제에서는 프로그래밍 언어로 파이썬을 ## Python -파이썬은 고차원이고, 다중패러다임을 지원하는 동적 프로그래밍 언어이다. -짧지만 가독성 높은 코드 몇 줄로 수준 높은 아이디어들을 표현할수있기에 파이썬 코드는 거의 수도코드처럼 보인다고도 한다. -아래는 quicksort알고리즘의 파이썬 구현 예시이다: +파이썬은 고차원이고, 다중패러다임을 지원하는 동적 프로그래밍 언어입니다. +짧지만 가독성 높은 코드 몇 줄로 수준 높은 아이디어들을 표현할수있기에 파이썬 코드는 거의 수도코드처럼 보인다고도 합니다. +아래는 quicksort알고리즘의 파이썬 구현 예시입니다: ~~~python def quicksort(arr): @@ -172,22 +172,22 @@ print ' world '.strip() # 문자열 앞뒤 공백 제거; 출력 "world" ~~~python xs = [3, 1, 2] # 리스트 생성 print xs, xs[2] # 출력 "[3, 1, 2] 2" -print xs[-1] # 인덱스가 음수일 경우 리스트의 끝에서부터 세어진다; 출력 "2" -xs[2] = 'foo' # 리스트는 자료형이 다른 요소들을 저장 할 수 있다 +print xs[-1] # 인덱스가 음수일 경우 리스트의 끝에서부터 세어짐; 출력 "2" +xs[2] = 'foo' # 리스트는 자료형이 다른 요소들을 저장 할 수 있습니다 print xs # 출력 "[3, 1, 'foo']" xs.append('bar') # 리스트의 끝에 새 요소 추가 print xs # 출력 "[3, 1, 'foo', 'bar']" x = xs.pop() # 리스트의 마지막 요소 삭제하고 반환 print x, xs # 출력 "bar [3, 1, 'foo']" ~~~ -마찬가지로, 리스트에 대해 자세하 사항은 [문서](https://docs.python.org/2/tutorial/datastructures.html#more-on-lists)에서 찾아볼 수 있습니다. +마찬가지로, 리스트에 대해 자세한 사항은 [문서](https://docs.python.org/2/tutorial/datastructures.html#more-on-lists)에서 찾아볼 수 있습니다. **슬라이싱:** -리스트의 요소로 한번에 접근하는것 이외에도, 파이썬은 리스트의 일부분에만 접근하는 간결한 문법을 제공한다; -이를 *슬라이싱*이라고 한다: +리스트의 요소로 한번에 접근하는것 이외에도, 파이썬은 리스트의 일부분에만 접근하는 간결한 문법을 제공합니다; +이를 *슬라이싱*이라고 합니다: ~~~python -nums = range(5) # range는 파이썬에 구현되어 있는 함수이며 정수들로 구성된 리스트를 만든다 +nums = range(5) # range는 파이썬에 구현되어 있는 함수이며 정수들로 구성된 리스트를 만듭니다 print nums # 출력 "[0, 1, 2, 3, 4]" print nums[2:4] # 인덱스 2에서 4(제외)까지 슬라이싱; 출력 "[2, 3]" print nums[2:] # 인덱스 2에서 끝까지 슬라이싱; 출력 "[2, 3, 4]" @@ -199,28 +199,28 @@ print nums # 출력 "[0, 1, 8, 9, 4]" ~~~ numpy 배열 부분에서 다시 슬라이싱을 보게될것입니다. -**Loops:** You can loop over the elements of a list like this: +**반복문:** 아래와 같이 리스트의 요소들을 반복해서 조회할 수 있습니다: ~~~python animals = ['cat', 'dog', 'monkey'] for animal in animals: print animal -# 출력 "cat", "dog", "monkey", each on its own line. +# 출력 "cat", "dog", "monkey", 한 줄에 하나씩 출력. ~~~ -If you want access to the index of each element within the body of a loop, -use the built-in `enumerate` function: +만약 반복문 내에서 리스트 각 요소의 인덱스에 접근하고 싶다면, 'enumerate' 함수를 사용하세요: ~~~python animals = ['cat', 'dog', 'monkey'] for idx, animal in enumerate(animals): print '#%d: %s' % (idx + 1, animal) -# 출력 "#1: cat", "#2: dog", "#3: monkey", each on its own line +# 출력 "#1: cat", "#2: dog", "#3: monkey", 한 줄에 하나씩 출력. ~~~ -**List comprehensions:** -When programming, frequently we want to transform one type of data into another. -As a simple example, consider the following code that computes square numbers: +**리스트 comprehensions:** +프로그래밍을 하다보면, 자료형을 변환해야 하는 경우가 자주 있습니다. +간단한 예를 들자면, 숫자의 제곱을 계산하는 다음의 코드를 보세요: + ~~~python nums = [0, 1, 2, 3, 4] @@ -230,7 +230,7 @@ for x in nums: print squares # 출력 [0, 1, 4, 9, 16] ~~~ -You can make this code simpler using a **list comprehension**: +**리스트 comprehension**을 이용해 이 코드를 더 간단하게 만들 수 있습니다: ~~~python nums = [0, 1, 2, 3, 4] @@ -238,7 +238,7 @@ squares = [x ** 2 for x in nums] print squares # 출력 [0, 1, 4, 9, 16] ~~~ -List comprehensions can also contain conditions: +리스트 comprehensions에 조건을 추가 할 수도 있습니다: ~~~python nums = [0, 1, 2, 3, 4] @@ -247,33 +247,32 @@ print even_squares # 출력 "[0, 4, 16]" ~~~ -#### Dictionaries -A dictionary stores (key, value) pairs, similar to a `Map` in Java or -an object in Javascript. You can use it like this: +#### 딕셔너리 +자바의 '맵', 자바스크립트의 '오브젝트'와 유사하게, 파이썬의 '딕셔너리'는 (열쇠, 값) 쌍을 저장합니다. +아래와 같은 방식으로 딕셔너리를 사용할 수 있습니다: ~~~python -d = {'cat': 'cute', 'dog': 'furry'} # Create a new dictionary with some data -print d['cat'] # Get an entry from a dictionary; 출력 "cute" -print 'cat' in d # Check if a dictionary has a given key; 출력 "True" -d['fish'] = 'wet' # Set an entry in a dictionary +d = {'cat': 'cute', 'dog': 'furry'} # 새로운 딕셔너리를 만듭니다 +print d['cat'] # 딕셔너리의 값을 받음; 출력 "cute" +print 'cat' in d # 딕셔너리가 주어진 열쇠를 가지고 있는지 확인; 출력 "True" +d['fish'] = 'wet' # 딕셔너리의 값을 지정 print d['fish'] # 출력 "wet" # print d['monkey'] # KeyError: 'monkey' not a key of d -print d.get('monkey', 'N/A') # Get an element with a default; 출력 "N/A" -print d.get('fish', 'N/A') # Get an element with a default; 출력 "wet" -del d['fish'] # Remove an element from a dictionary -print d.get('fish', 'N/A') # "fish" is no longer a key; 출력 "N/A" +print d.get('monkey', 'N/A') # 딕셔너리의 값을 받음. 존재하지 않는 다면 'N/A'; 출력 "N/A" +print d.get('fish', 'N/A') # 딕셔너리의 값을 받음. 존재하지 않는 다면 'N/A'; 출력 "wet" +del d['fish'] # 딕셔너리에 저장된 요소 삭제 +print d.get('fish', 'N/A') # "fish"는 더이상 열쇠가 아님; 출력 "N/A" ~~~ -You can find all you need to know about dictionaries -[in the documentation](https://docs.python.org/2/library/stdtypes.html#dict). +딕셔너리에 관해 필요한 모든것은 [문서](https://docs.python.org/2/library/stdtypes.html#dict)에서 찾아볼 수 있습니다. -**Loops:** It is easy to iterate over the keys in a dictionary: +**반복문:** 딕셔너리의 열쇠는 쉽게 반복될 수 있습니다: ~~~python d = {'person': 2, 'cat': 4, 'spider': 8} for animal in d: legs = d[animal] print 'A %s has %d legs' % (animal, legs) -# 출력 "A person has 2 legs", "A spider has 8 legs", "A cat has 4 legs" +# 출력 "A person has 2 legs", "A spider has 8 legs", "A cat has 4 legs", 한 줄에 하나씩 출력. ~~~ If you want access to keys and their corresponding values, use the `iteritems` method: @@ -282,12 +281,12 @@ If you want access to keys and their corresponding values, use the `iteritems` m d = {'person': 2, 'cat': 4, 'spider': 8} for animal, legs in d.iteritems(): print 'A %s has %d legs' % (animal, legs) -# 출력 "A person has 2 legs", "A spider has 8 legs", "A cat has 4 legs" +# 출력 "A person has 2 legs", "A spider has 8 legs", "A cat has 4 legs", 한 줄에 하나씩 출력. ~~~ -**Dictionary comprehensions:** -These are similar to list comprehensions, but allow you to easily construct -dictionaries. For example: +**딕셔너리 comprehensions:** +리스트 comprehensions과 유사한 딕셔너리 comprehensions을 통해 손쉽게 딕셔너리를 만들수 있습니다. +예시: ~~~python nums = [0, 1, 2, 3, 4] From 3e2f53febb492b27bf5db9250863d6abb489a680 Mon Sep 17 00:00:00 2001 From: Sanghun Kang Date: Sat, 28 May 2016 19:44:44 +0900 Subject: [PATCH 151/199] function done --- python-numpy-tutorial.md | 70 ++++++++++++++++++++-------------------- 1 file changed, 35 insertions(+), 35 deletions(-) diff --git a/python-numpy-tutorial.md b/python-numpy-tutorial.md index 48fc06e3..5b7e3769 100644 --- a/python-numpy-tutorial.md +++ b/python-numpy-tutorial.md @@ -61,6 +61,7 @@ cs231n 수업의 모든 과제에서는 프로그래밍 언어로 파이썬을 - [이미지](#matplotlib-images) + ## Python 파이썬은 고차원이고, 다중패러다임을 지원하는 동적 프로그래밍 언어입니다. 짧지만 가독성 높은 코드 몇 줄로 수준 높은 아이디어들을 표현할수있기에 파이썬 코드는 거의 수도코드처럼 보인다고도 합니다. @@ -90,6 +91,7 @@ print quicksort([3,6,8,10,1,2,1]) `python --version`. + ### 기본 자료형 다른 프로그래밍 언어들처럼, 파이썬에는 정수, 실수, 불린, 문자열같은 기본 자료형이 있습니다. @@ -159,13 +161,13 @@ print ' world '.strip() # 문자열 앞뒤 공백 제거; 출력 "world" 모든 문자열 메소드는 [문서](https://docs.python.org/2/library/stdtypes.html#string-methods)에서 찾아볼 수 있습니다. -### 컨테이너 +### 컨테이너 파이썬은 다음과 같은 컨테이너 타입이 구현되어 있습니다: 리스트, 딕셔너리, 집합, 튜플 -#### 리스트 +#### 리스트 리스트는 파이썬에서 배열같은 존재입니다. 그렇지만 배열과 달리 크기 변경이 가능하고 서로 다른 자료형일지라도 하나의 리스트에 저장 될 수 있습니다: @@ -247,6 +249,7 @@ print even_squares # 출력 "[0, 4, 16]" ~~~ + #### 딕셔너리 자바의 '맵', 자바스크립트의 '오브젝트'와 유사하게, 파이썬의 '딕셔너리'는 (열쇠, 값) 쌍을 저장합니다. 아래와 같은 방식으로 딕셔너리를 사용할 수 있습니다: @@ -263,7 +266,7 @@ print d.get('fish', 'N/A') # 딕셔너리의 값을 받음. 존재하지 않 del d['fish'] # 딕셔너리에 저장된 요소 삭제 print d.get('fish', 'N/A') # "fish"는 더이상 열쇠가 아님; 출력 "N/A" ~~~ -딕셔너리에 관해 필요한 모든것은 [문서](https://docs.python.org/2/library/stdtypes.html#dict)에서 찾아볼 수 있습니다. +딕셔너리에 관해 더 알고싶다면 [문서](https://docs.python.org/2/library/stdtypes.html#dict)를 참조하세요. **반복문:** 딕셔너리의 열쇠는 쉽게 반복될 수 있습니다: @@ -275,7 +278,7 @@ for animal in d: # 출력 "A person has 2 legs", "A spider has 8 legs", "A cat has 4 legs", 한 줄에 하나씩 출력. ~~~ -If you want access to keys and their corresponding values, use the `iteritems` method: +만약 열쇠와, 그에 상응하는 값에 접근하고 싶다면, 'iteritems' 메소드를 사용하세요: ~~~python d = {'person': 2, 'cat': 4, 'spider': 8} @@ -295,41 +298,38 @@ print even_num_to_square # 출력 "{0: 0, 2: 4, 4: 16}" ~~~ -#### Sets -A set is an unordered collection of distinct elements. As a simple example, consider -the following: + +#### 집합 +집합은 순서 구분이 없고 서로 다른 요소간의 모임입니다. 예시: ~~~python animals = {'cat', 'dog'} -print 'cat' in animals # Check if an element is in a set; 출력 "True" +print 'cat' in animals # 요소가 집합에 포함되어 있는지 확인; 출력 "True" print 'fish' in animals # 출력 "False" -animals.add('fish') # Add an element to a set +animals.add('fish') # 요소를 집합에 추가 print 'fish' in animals # 출력 "True" -print len(animals) # Number of elements in a set; 출력 "3" -animals.add('cat') # Adding an element that is already in the set does nothing +print len(animals) # 집합에 포함된 요소의 수; 출력 "3" +animals.add('cat') # 이미 포함되어있는 요소를 추가할 경우 아무 변화 없음 print len(animals) # 출력 "3" animals.remove('cat') # Remove an element from a set print len(animals) # 출력 "2" ~~~ -As usual, everything you want to know about sets can be found -[in the documentation](https://docs.python.org/2/library/sets.html#set-objects). - +마찬가지로, 집합에 관해 더 알고싶다면 [문서](https://docs.python.org/2/library/sets.html#set-objects)를 참조하세요. -**Loops:** -Iterating over a set has the same syntax as iterating over a list; -however since sets are unordered, you cannot make assumptions about the order -in which you visit the elements of the set: +**반복문:** +집합을 반복하는 구문은 리스트 반복 구문과 동일합니다; +그러나 집합은 순서가 없어서, 어떤 순서로 반복될지 추측할순 없습니다: ~~~python animals = {'cat', 'dog', 'fish'} for idx, animal in enumerate(animals): print '#%d: %s' % (idx + 1, animal) -# 출력 "#1: fish", "#2: dog", "#3: cat" +# 출력 "#1: fish", "#2: dog", "#3: cat", 한 줄에 하나씩 출력. ~~~ -**Set comprehensions:** -Like lists and dictionaries, we can easily construct sets using set comprehensions: +**집합 comprehensions:** +리스트, 딕셔너리와 마찬가지로 집합 comprehensions을 통해 손쉽게 집합을 만들수 있습니다. ~~~python from math import sqrt @@ -338,24 +338,25 @@ print nums # 출력 "set([0, 1, 2, 3, 4, 5])" ~~~ -#### Tuples -A tuple is an (immutable) ordered list of values. -A tuple is in many ways similar to a list; one of the most important differences is that -tuples can be used as keys in dictionaries and as elements of sets, while lists cannot. -Here is a trivial example: + +#### 튜플 +튜플은 요소들 간 순서가 있으며 값이 변하지 않는 리스트입니다. +튜플은 많은 면에서 리스트와 유사합니다; 가장 중요한 차이점은 튜플은 '딕셔너리의 열쇠'와 '집합의 요소'가 될 수 있지만 리스트는 불가능하다는 점입니다. +여기 간단한 예시가 있습니다: ~~~python -d = {(x, x + 1): x for x in range(10)} # Create a dictionary with tuple keys -t = (5, 6) # Create a tuple +d = {(x, x + 1): x for x in range(10)} # 튜플을 열쇠로 하는 딕셔너리 생성 +t = (5, 6) # 튜플 생성 print type(t) # 출력 "" print d[t] # 출력 "5" print d[(1, 2)] # 출력 "1" ~~~ -[The documentation](https://docs.python.org/2/tutorial/datastructures.html#tuples-and-sequences) has more information about tuples. +[문서](https://docs.python.org/2/tutorial/datastructures.html#tuples-and-sequences)에 튜플에 관한 더 많은 정보가 있습니다. -### Functions -Python functions are defined using the `def` keyword. For example: + +### 함수 +파이썬 함수는 'def' 키워드를 통해 정의됩니다. 예시: ~~~python def sign(x): @@ -368,10 +369,10 @@ def sign(x): for x in [-1, 0, 1]: print sign(x) -# 출력 "negative", "zero", "positive" +# 출력 "negative", "zero", "positive", 한 줄에 하나씩 출력. ~~~ -We will often define functions to take optional keyword arguments, like this: +가끔은 아래처럼 선택적으로 인자를 받는 함수를 정의할 때도 있습니다: ~~~python def hello(name, loud=False): @@ -383,8 +384,7 @@ def hello(name, loud=False): hello('Bob') # 출력 "Hello, Bob" hello('Fred', loud=True) # 출력 "HELLO, FRED!" ~~~ -There is a lot more information about Python functions -[in the documentation](https://docs.python.org/2/tutorial/controlflow.html#defining-functions). +파이썬 함수에 관해 더 많은 정보는 [문서](https://docs.python.org/2/tutorial/controlflow.html#defining-functions)를 참조하세요. ### Classes From ebb9c8cead0ec6f4ff2d15f95ce409e9ef5a003f Mon Sep 17 00:00:00 2001 From: Sanghun Kang Date: Sat, 28 May 2016 20:29:41 +0900 Subject: [PATCH 152/199] class done --- python-numpy-tutorial.md | 19 +++++++++---------- 1 file changed, 9 insertions(+), 10 deletions(-) diff --git a/python-numpy-tutorial.md b/python-numpy-tutorial.md index 5b7e3769..e8cfa51d 100644 --- a/python-numpy-tutorial.md +++ b/python-numpy-tutorial.md @@ -384,33 +384,32 @@ def hello(name, loud=False): hello('Bob') # 출력 "Hello, Bob" hello('Fred', loud=True) # 출력 "HELLO, FRED!" ~~~ -파이썬 함수에 관해 더 많은 정보는 [문서](https://docs.python.org/2/tutorial/controlflow.html#defining-functions)를 참조하세요. +파이썬 함수에 관한 더 많은 정보는 [문서](https://docs.python.org/2/tutorial/controlflow.html#defining-functions)를 참조하세요. ### Classes -The syntax for defining classes in Python is straightforward: +파이썬에서 클래스를 정의하는 구문은 복잡하지 않습니다: ~~~python class Greeter(object): - # Constructor + # 생성자 def __init__(self, name): - self.name = name # Create an instance variable + self.name = name # 인스턴스 변수 선언 - # Instance method + # 인스턴스 메소드 def greet(self, loud=False): if loud: print 'HELLO, %s!' % self.name.upper() else: print 'Hello, %s' % self.name -g = Greeter('Fred') # Construct an instance of the Greeter class -g.greet() # Call an instance method; 출력 "Hello, Fred" -g.greet(loud=True) # Call an instance method; 출력 "HELLO, FRED!" +g = Greeter('Fred') # Greeter 클래스의 인스턴스 생성 +g.greet() # 인스턴스 메소드 호출; 출력 "Hello, Fred" +g.greet(loud=True) # 인스턴스 메소드 호출; 출력 "HELLO, FRED!" ~~~ -You can read a lot more about Python classes -[in the documentation](https://docs.python.org/2/tutorial/classes.html). +파이썬 클래스에 관한 더 많은 정보는 [문서](https://docs.python.org/2/tutorial/classes.html)를 참조하세요. ## Numpy From 5ea392423e6a71ed71cfd68ca97e9691fabe74cb Mon Sep 17 00:00:00 2001 From: YB Date: Sat, 28 May 2016 11:38:56 -0400 Subject: [PATCH 153/199] Lecture1 - part 191~200 (out of 715) en / ko --- captions/En/Lecture1_en.srt | 42 ++++++++++++++++++------------------- captions/Ko/Lecture1_ko.srt | 27 ++++++++++++------------ 2 files changed, 35 insertions(+), 34 deletions(-) diff --git a/captions/En/Lecture1_en.srt b/captions/En/Lecture1_en.srt index 68d1b453..b9e8cdf3 100644 --- a/captions/En/Lecture1_en.srt +++ b/captions/En/Lecture1_en.srt @@ -940,57 +940,57 @@ skull is open into the brain of the 191 00:21:35,130 --> 00:21:42,180 -cut into an area what we already know -come primary visual cortex primary +cat into an area what we already know +called primary visual cortex. 192 00:21:42,180 --> 00:21:49,490 -visual cortex area do a lot of things -for for visual processing but before +Primary visual cortex is the area that +nuerons do a lot of things for for visual processing 193 00:21:49,490 --> 00:21:54,779 -visa we don't really know what primary -visual cortex is to be winter snow is +but before Hubel and Wiesel, +we don't really know what primary visual cortex is doing. 194 00:21:54,779 --> 00:22:02,369 -one of the earliest stage of the UI is -of course but earliest stage for visual +We just know it's one of the earliest stage on the eyes, +of course, but earliest stage for visual processing. 195 00:22:02,369 --> 00:22:07,299 -processing then there is tons and tons -of neurons working on vision then we +And then there is tons and tons +of neurons working on vision. 196 00:22:07,299 --> 00:22:12,419 -really alter our to know what this is +And we really ought to know what this is because that's the beginning of vision 197 00:22:12,420 --> 00:22:20,300 -visual process in the bring so they they -put this electrode into the primary +visual process in the brain. +So they they put this electrode into the primary visual cortex 198 -00:22:20,299 --> 00:22:25,930 -visual cortex and an interesting this is -another interesting fact I don't drop my +00:22:20,300 --> 00:22:25,930 +and an interestingly, +this is another interesting fact. 199 00:22:25,930 --> 00:22:34,880 -stuff for sure you probably visual -cortex the first they come from being +I will drop off my stuff. I will show you. +Primary visual cortex, the first stage, or second depending on where they come from. 200 00:22:34,880 --> 00:22:40,910 -very very rough prosecutor first aid of -your cortical visual processing stage is +I'm being very very rough. +The First stage of your cortical visual processing stage is 201 00:22:40,910 --> 00:22:47,180 -in the back of your bring not near your +in the back of your brain not near your I know it's very interesting because 202 diff --git a/captions/Ko/Lecture1_ko.srt b/captions/Ko/Lecture1_ko.srt index 7790c724..9ba66761 100644 --- a/captions/Ko/Lecture1_ko.srt +++ b/captions/Ko/Lecture1_ko.srt @@ -773,51 +773,52 @@ 189 00:21:21,500 --> 00:21:28,529 - Electrode라는 직은 바늘을 + Electrode라는 작은 바늘을 190 00:21:28,529 --> 00:21:35,129 - 두개골이 열린 상태의 고양이의 뇌로 집어 넣었습니다. + 두개골이 열린 상태의 고양이의 뇌로. 191 00:21:35,130 --> 00:21:42,180 - 우리는 이미 일차 시각 피질의 차를 올 알고있는 지역으로 절단 + 일차 시각 피질이라고 알려진 부위에 넣습니다. 192 00:21:42,180 --> 00:21:49,490 - 시각 피질 영역은 영상 처리하지만, 전에 일을 많이 할 + 일차 시각 피질 영역에서는 많은 뉴런들이 시각 정보를 처리합니다. 193 00:21:49,490 --> 00:21:54,779 - 우리가 정말 모르는 비자는 시각 주요 내용 피질 겨울 눈이 될 것입니다 + 하지만 Hubel과 Wiesel 이전엔 일차 시각 피질이 정확이 무슨 일을 하는지 몰랐죠. 194 00:21:54,779 --> 00:22:02,369 - 사용자 인터페이스의 초기 단계 중 하나는 물론이지만 시각에 대한 초기 단계 + 단지 시각처리과정의 초기 단계라는 것과 195 00:22:02,369 --> 00:22:07,299 - 처리는 다음의 비전에 우리 작업 톤 뉴 올리언스의 톤이있다 + 엄청난 양의 뉴런이 있다는 것만 알고 있었어요. 196 00:22:07,299 --> 00:22:12,419 - 그 비전의 시작이기 때문에 정말이 무엇인지 알고 우리를 변경 + 이 일차 시각 피질은 뇌에서 시각 처리의 시작지점이기 때문에 꼭 알아야만 합니다. 197 00:22:12,420 --> 00:22:20,300 - (가) 그렇게 가져 시각 과정은 그들이 차에이 전극을 넣어 + 그렇게 그들은 일차 시각 피질에 전극을 넣습니다. 198 -00:22:20,299 --> 00:22:25,930 - 시각 피질이 흥미로운 내가 떨어 뜨리지 않는 또 다른 흥미로운 사실​​ 내 +00:22:20,300 --> 00:22:25,930 + 여기에서 또 다른 흥미로운 사실​​이 있습니다. 199 00:22:25,930 --> 00:22:34,880 - (가) 첫째가되는 것을 온 있는지 아마 시각 피질에 대한 물건 + 이것 좀 내려놓고 설명할게요. 일차 시각 피질, + 시작이 어디냐에 따라 첫번째 혹은 두번째의 시각 처리 과정이 이루어 지는 곳이죠. 200 00:22:34,880 --> 00:22:40,910 - 당신의 대뇌 피질의 시각 처리 단계의 아주 아주 거친 검사 응급 처치입니다 + 간략히 말해 이 시각 피질들의 첫번째 시각 처리 과정은 201 00:22:40,910 --> 00:22:47,180 From 3a0fc0c45bd8cc4fb735ebaaa3b81c9d4aaff47a Mon Sep 17 00:00:00 2001 From: Sanghun Kang Date: Sun, 29 May 2016 23:56:00 +0900 Subject: [PATCH 154/199] translated till integer array indexing --- python-numpy-tutorial.md | 143 +++++++++++++++++++-------------------- 1 file changed, 69 insertions(+), 74 deletions(-) diff --git a/python-numpy-tutorial.md b/python-numpy-tutorial.md index e8cfa51d..8369b580 100644 --- a/python-numpy-tutorial.md +++ b/python-numpy-tutorial.md @@ -47,7 +47,7 @@ cs231n 수업의 모든 과제에서는 프로그래밍 언어로 파이썬을 - [클래스](#python-classes) - [Numpy](#numpy) - [배열](#numpy-arrays) - - [배열 색인](#numpy-array-indexing) + - [배열 인덱싱](#numpy-array-indexing) - [데이터타입](#numpy-datatypes) - [배열 연산](#numpy-math) - [브로드캐스팅](#numpy-broadcasting) @@ -414,70 +414,66 @@ g.greet(loud=True) # 인스턴스 메소드 호출; 출력 "HELLO, FRED!" ## Numpy -[Numpy](http://www.numpy.org/) is the core library for scientific computing in Python. -It provides a high-performance multidimensional array object, and tools for working with these -arrays. If you are already familiar with MATLAB, you might find -[this tutorial useful](http://wiki.scipy.org/NumPy_for_Matlab_Users) to get started with Numpy. +[Numpy](http://www.numpy.org/)는 파이썬이 계산과학분야에 이용될때 핵심 역할을 하는 라이브러리입니다. +Numpy는 고성능의 다차원 배열 객체와 이를 다룰 도구를 제공합니다. 만약 MATLAB에 익숙한 분이라면 넘파이 학습을 시작하는데 있어 +[이 튜토리얼](http://wiki.scipy.org/NumPy_for_Matlab_Users)이 유용할 것입니다. -### Arrays -A numpy array is a grid of values, all of the same type, and is indexed by a tuple of -nonnegative integers. The number of dimensions is the *rank* of the array; the *shape* -of an array is a tuple of integers giving the size of the array along each dimension. -We can initialize numpy arrays from nested Python lists, -and access elements using square brackets: +### 배열 +Numpy 배열은 동일한 자료형을 가지는 값들이 격자판 형태로 있는 것입니다. 각각의 값들은 튜플(이때 튜플은 양의 정수만을 요소값으로 갖습니다.) 형태로 색인됩니다. +*rank*는 배열이 몇차원인지를 의미합니다; *shape*는 는 각 차원의 크기를 알려주는 정수들이 모인 튜플입니다. + +파이썬의 리스트를 중첩해 Numpy 배열을 초기화 할 수 있고, 대괄호를 통해 각 요소에 접근할 수 있습니다: ~~~python import numpy as np -a = np.array([1, 2, 3]) # Create a rank 1 array +a = np.array([1, 2, 3]) # rank가 1인 배열 생성 print type(a) # 출력 "" print a.shape # 출력 "(3,)" print a[0], a[1], a[2] # 출력 "1 2 3" -a[0] = 5 # Change an element of the array +a[0] = 5 # 요소를 변경 print a # 출력 "[5, 2, 3]" -b = np.array([[1,2,3],[4,5,6]]) # Create a rank 2 array +b = np.array([[1,2,3],[4,5,6]]) # rank가 2인 배열 생성 print b.shape # 출력 "(2, 3)" print b[0, 0], b[0, 1], b[1, 0] # 출력 "1 2 4" ~~~ -Numpy also provides many functions to create arrays: +리스트의 중첩이 아니더라도 Numpy는 배열을 만들기 위한 다양한 함수를 제공합니다. ~~~python import numpy as np -a = np.zeros((2,2)) # Create an array of all zeros +a = np.zeros((2,2)) # 모든 값이 0인 배열 생성 print a # 출력 "[[ 0. 0.] - # [ 0. 0.]]" + # [ 0. 0.]]" -b = np.ones((1,2)) # Create an array of all ones +b = np.ones((1,2)) # 모든 값이 1인 배열 생성 print b # 출력 "[[ 1. 1.]]" -c = np.full((2,2), 7) # Create a constant array +c = np.full((2,2), 7) # 모든 값이 특정 상수인 배열 생성 print c # 출력 "[[ 7. 7.] - # [ 7. 7.]]" + # [ 7. 7.]]" -d = np.eye(2) # Create a 2x2 identity matrix +d = np.eye(2) # 2x2 단위 행렬 생성 print d # 출력 "[[ 1. 0.] - # [ 0. 1.]]" + # [ 0. 1.]]" -e = np.random.random((2,2)) # Create an array filled with random values -print e # Might print "[[ 0.91940167 0.08143941] +e = np.random.random((2,2)) # 임의의 값으로 채워진 배열 생성 +print e # 임의의 값 출력 "[[ 0.91940167 0.08143941] # [ 0.68744134 0.87236687]]" ~~~ -You can read about other methods of array creation -[in the documentation](http://docs.scipy.org/doc/numpy/user/basics.creation.html#arrays-creation). +배열 생성에 관한 다른 방법들은 [문서](http://docs.scipy.org/doc/numpy/user/basics.creation.html#arrays-creation)를 참조하세요. -### Array indexing -Numpy offers several ways to index into arrays. -**Slicing:** -Similar to Python lists, numpy arrays can be sliced. -Since arrays may be multidimensional, you must specify a slice for each dimension -of the array: +### 배열 인덱싱 +Numpy는 배열을 인덱싱하는 몇가지 방법을 제공합니다. + +**슬라이싱:** +파이썬 리스트와 유사하게, Numpy 배열도 슬라이싱이 가능합니다. Numpy 배열은 다차원인 경우가 많기에, 각 차원별로 어떻게 슬라이스할건지 명확히 해야합니다: ~~~python import numpy as np @@ -488,104 +484,103 @@ import numpy as np # [ 9 10 11 12]] a = np.array([[1,2,3,4], [5,6,7,8], [9,10,11,12]]) -# Use slicing to pull out the subarray consisting of the first 2 rows -# and columns 1 and 2; b is the following array of shape (2, 2): +# 슬라이싱을 이용하여 첫 두행과 1열,2열로 이루어진 부분배열을 만들어 봅시다; +# b는 shape가 (2,2)인 배열이 됩니다: # [[2 3] # [6 7]] b = a[:2, 1:3] -# A slice of an array is a view into the same data, so modifying it -# will modify the original array. +# 슬라이싱된 배열은 원본 배열과 같은 데이터를 참조합니다, 즉 슬라이싱된 배열을 수정하면 +# 원본 배열 역시 수정됩니다. print a[0, 1] # 출력 "2" -b[0, 0] = 77 # b[0, 0] is the same piece of data as a[0, 1] +b[0, 0] = 77 # b[0, 0]은 a[0, 1]과 같은 데이터입니다 print a[0, 1] # 출력 "77" ~~~ -You can also mix integer indexing with slice indexing. -However, doing so will yield an array of lower rank than the original array. -Note that this is quite different from the way that MATLAB handles array -slicing: +정수를 이용한 인덱싱과 슬라이싱을 혼합하여 사용할 수 있습니다. +하지만 이렇게 할 경우, 기존의 배열보다 낮은 rank의 배열이 얻어집니다. +이는 MATLAB이 배열을 다루는 방식과 차이가 있습니다. + +슬라이싱: ~~~python import numpy as np -# Create the following rank 2 array with shape (3, 4) +# 아래와 같은 요소를 가지는 rank가 2이고 shape가 (3, 4)인 배열 생성 # [[ 1 2 3 4] # [ 5 6 7 8] # [ 9 10 11 12]] a = np.array([[1,2,3,4], [5,6,7,8], [9,10,11,12]]) -# Two ways of accessing the data in the middle row of the array. -# Mixing integer indexing with slices yields an array of lower rank, -# while using only slices yields an array of the same rank as the -# original array: -row_r1 = a[1, :] # Rank 1 view of the second row of a -row_r2 = a[1:2, :] # Rank 2 view of the second row of a +# 배열의 중간 행에 접근하는 두가지 방법이 있습니다. +# 정수 인덱싱과 슬라이싱을 혼합해서 사용하면 낮은 rank의 배열이 생성되지만, +# 슬라이싱만 사용하면 원본 배열과 동일한 rank의 배열이 생성됩니다. +row_r1 = a[1, :] # 배열a의 두번째 행을 rank가 1인 배열로 +row_r2 = a[1:2, :] # 배열a의 두번째 행을 rank가 2인 배열로 print row_r1, row_r1.shape # 출력 "[5 6 7 8] (4,)" print row_r2, row_r2.shape # 출력 "[[5 6 7 8]] (1, 4)" -# We can make the same distinction when accessing columns of an array: +# 행이 아닌 열의 경우에도 마찬가지입니다: col_r1 = a[:, 1] col_r2 = a[:, 1:2] print col_r1, col_r1.shape # 출력 "[ 2 6 10] (3,)" print col_r2, col_r2.shape # 출력 "[[ 2] - # [ 6] - # [10]] (3, 1)" + # [ 6] + # [10]] (3, 1)" ~~~ -**Integer array indexing:** -When you index into numpy arrays using slicing, the resulting array view -will always be a subarray of the original array. In contrast, integer array -indexing allows you to construct arbitrary arrays using the data from another -array. Here is an example: +**정수 배열 인덱싱:** +Numpy 배열을 슬라이싱하면, 결과로 얻어지는 배열은 언제나 원본 배열의 부분 배열입니다. +그러나 정수 배열 인덱싱을 한다면, 원본과 다른 배열을 만들수 있습니다. +여기에 예시가 있습니다: ~~~python import numpy as np a = np.array([[1,2], [3, 4], [5, 6]]) -# An example of integer array indexing. -# The returned array will have shape (3,) and +# 정수 배열 인덱싱의 예. +# 반환되는 배열의 shape는 (3,) print a[[0, 1, 2], [0, 1, 0]] # 출력 "[1 4 5]" -# The above example of integer array indexing is equivalent to this: +# 위에서 본 정수 배열 인덱싱 예제는 다음과 동일합니다: print np.array([a[0, 0], a[1, 1], a[2, 0]]) # 출력 "[1 4 5]" -# When using integer array indexing, you can reuse the same -# element from the source array: +# 정수 배열 인덱싱을 사용할 때, +# 원본 배열의 같은 요소를 재사용 할 수 있습니다: print a[[0, 0], [1, 1]] # 출력 "[2 2]" -# Equivalent to the previous integer array indexing example +# 위 예제는 다음과 동일합니다 print np.array([a[0, 1], a[0, 1]]) # 출력 "[2 2]" ~~~ -One useful trick with integer array indexing is selecting or mutating one -element from each row of a matrix: +정수 배열 인덱싱을 유용하게 사용하는 방법 중 하나는 행렬의 각 행에서 하나의 요소를 선택하거나 바꾸는 것입니다: ~~~python import numpy as np -# Create a new array from which we will select elements +# 요소를 선택할 새로운 배열 생성 a = np.array([[1,2,3], [4,5,6], [7,8,9], [10, 11, 12]]) print a # 출력 "array([[ 1, 2, 3], - # [ 4, 5, 6], - # [ 7, 8, 9], - # [10, 11, 12]])" + # [ 4, 5, 6], + # [ 7, 8, 9], + # [10, 11, 12]])" -# Create an array of indices +# 인덱스를 저장할 배열 생성 b = np.array([0, 2, 0, 1]) -# Select one element from each row of a using the indices in b + +# b에 저장된 인덱스를 이용해 각 행에서 하나의 요소를 선택합니다 print a[np.arange(4), b] # 출력 "[ 1 6 7 11]" -# Mutate one element from each row of a using the indices in b +# b에 저장된 인덱스를 이용해 각 행에서 하나의 요소를 변경합니다 a[np.arange(4), b] += 10 print a # 출력 "array([[11, 2, 3], - # [ 4, 5, 16], - # [17, 8, 9], - # [10, 21, 12]]) + # [ 4, 5, 16], + # [17, 8, 9], + # [10, 21, 12]]) ~~~ **Boolean array indexing:** From c78193487b82d925958e56395ece1ef6e0d0cf44 Mon Sep 17 00:00:00 2001 From: YB Date: Sun, 29 May 2016 11:55:37 -0400 Subject: [PATCH 155/199] Lecture1 - part 201~210 (out of 715) en / ko --- captions/En/Lecture1_en.srt | 40 ++++++++++++++++++------------------- captions/Ko/Lecture1_ko.srt | 23 +++++++++++---------- 2 files changed, 31 insertions(+), 32 deletions(-) diff --git a/captions/En/Lecture1_en.srt b/captions/En/Lecture1_en.srt index b9e8cdf3..b7b0e3fb 100644 --- a/captions/En/Lecture1_en.srt +++ b/captions/En/Lecture1_en.srt @@ -990,57 +990,55 @@ The First stage of your cortical visual processing stage is 201 00:22:40,910 --> 00:22:47,180 -in the back of your brain not near your -I know it's very interesting because +in the back of your brain not near your eye. +Okay? It's very interesting because 202 00:22:47,180 --> 00:22:51,788 -your own factory in cortical processing -is right +your olfactory cortical processing is right behind your nose. 203 00:22:51,788 --> 00:22:58,519 -behind her nose your auditory is right -behind every year but your primary +Your auditory is right behind your ear. + 204 00:22:58,519 --> 00:23:05,798 -visual cortex is the furthest from your -eye and another very interesting that in +but your primary visual cortex is the furthest from your eye +and another very interesting fact. 205 00:23:05,798 --> 00:23:11,099 -fact not only the primary there's a huge -area working on vision almost 50% of +In fact, not only the primary, +there's a huge area working on vision. 206 00:23:11,099 --> 00:23:17,888 -your brain is a love division this in -this the hardest and most important +Almost 50% of your brain is involved in vision. +Vision is the hardest and most important 207 00:23:17,888 --> 00:23:22,608 -sensory perceptual cognitive system in -the break and I'm not saying anything +sensory perceptual cognitive system in the brain. +I'm not saying anything 208 00:23:22,608 --> 00:23:29,839 -else does is not useful clearly but it -take nature of this long to develop this +else does is useful clearly, but it +take nature this long to develop this sensory system 209 00:23:29,839 --> 00:23:37,579 -this sensory system and it takes the -troop this much realist a space to be +and it takes the intro this much real estate space to be 210 00:23:37,579 --> 00:23:43,148 -used for the system why because it's so -important and it's so damn hard that's +used for the system. Why? +because it's so important and it's so damn hard. 211 00:23:43,148 --> 00:23:50,959 -why we need to get back to human reason +That's why we need to get back to human reason they were really ambitious they wanna 212 diff --git a/captions/Ko/Lecture1_ko.srt b/captions/Ko/Lecture1_ko.srt index 9ba66761..5d4d6543 100644 --- a/captions/Ko/Lecture1_ko.srt +++ b/captions/Ko/Lecture1_ko.srt @@ -777,7 +777,7 @@ 190 00:21:28,529 --> 00:21:35,129 - 두개골이 열린 상태의 고양이의 뇌로. + 두개골이 열린 상태의 고양이의 뇌로, 191 00:21:35,130 --> 00:21:42,180 @@ -822,43 +822,44 @@ 201 00:22:40,910 --> 00:22:47,180 - 당신이 근처에없는 가져 뒷면에 나는 그것이 있기 때문에 매우 흥미로운 알고 + 눈 근처가 아닌 뇌 뒷편에서 이루어 집니다. 매우 흥미로운 점은 202 00:22:47,180 --> 00:22:51,788 - 대뇌 피질의 처리에 자신의 공장이 맞다 + 후각 대뇌 피질은 코 바로 뒤에 있어요. 203 00:22:51,788 --> 00:22:58,519 - 그녀의 코 뒤에 청각 바로 매년 그러나 주 뒤에 + 청각 피질은 귀 바로 뒤에 있지요. 204 00:22:58,519 --> 00:23:05,798 - 시각 피질은 눈에서 가장 먼과에서 그 또 다른 매우 흥미로운 일이다 + 그런데 시각 피질은 눈에서 가장 먼 곳에서 이루어지죠. 205 00:23:05,798 --> 00:23:11,099 - 사실뿐만 아니라 기본 비전 거의 50 %의 작업 거대한 지역에있다 + 사실, 일차 시각 피질뿐만 아니라 많은 다른 부분들이 시각처리에 관여합니다. 206 00:23:11,099 --> 00:23:17,888 - 당신의 두뇌는이이 어려운 가장 중요한 사랑의 부서입니다 + 거의 50%의 뇌가 Vision과 관련되어있어요. + Vision은 뇌에서 가장 어렵고 중요한 감각 지각체계입니다. 207 00:23:17,888 --> 00:23:22,608 - 감각 지각인지 휴식 시간에 시스템 난 아무것도 말하고 있지 않다 + 제가 다른 체계들이 중요하지 않다는 것은 아니지만, 208 00:23:22,608 --> 00:23:29,839 - 다른 않습니다 분명히 도움이되지 않습니다하지만이 개발이 긴의 특성을 + 자연이 이 감각 체계를 발달 시키는데에 오랜 시간이 걸렸고, 209 00:23:29,839 --> 00:23:37,579 - 이 감각 시스템은 그것을 할 병력에게 공간이 많은 현실 소요 + 이렇게 큰 공간을 차지하고 있어요. 210 00:23:37,579 --> 00:23:43,148 - 너무 중요하기 때문에 왜 시스템에 사용 그렇게 빌어 먹을 하드 그건입니다 + 왜그럴까요? 너무 중요하고 엄청나게 어렵기 때문입니다. 211 00:23:43,148 --> 00:23:50,959 From 0dd2e55ebbaf9e7edc1514aa2bad64333a729963 Mon Sep 17 00:00:00 2001 From: YB Date: Sun, 29 May 2016 11:58:50 -0400 Subject: [PATCH 156/199] Lecture1 - En_208 fix --- captions/En/Lecture1_en.srt | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/captions/En/Lecture1_en.srt b/captions/En/Lecture1_en.srt index b7b0e3fb..e86327c3 100644 --- a/captions/En/Lecture1_en.srt +++ b/captions/En/Lecture1_en.srt @@ -1024,7 +1024,7 @@ I'm not saying anything 208 00:23:22,608 --> 00:23:29,839 -else does is useful clearly, but it +else isn't useful clearly, but it take nature this long to develop this sensory system 209 From 4fa9bc7c83eed558f4ffa9d5f8f77c785cd47420 Mon Sep 17 00:00:00 2001 From: myungsub Date: Mon, 30 May 2016 12:46:05 +0900 Subject: [PATCH 157/199] fix typo & assignment #2 --- assignments2016/assignment1.md | 2 +- assignments2016/assignment2.md | 151 ++++++++++----------------------- assignments2016/assignment3.md | 2 +- 3 files changed, 49 insertions(+), 106 deletions(-) diff --git a/assignments2016/assignment1.md b/assignments2016/assignment1.md index fa7a4502..32e11f88 100644 --- a/assignments2016/assignment1.md +++ b/assignments2016/assignment1.md @@ -51,7 +51,7 @@ cd cs231n/datasets **IPython 시작:** CIFAR-10 data를 받았다면, `assignment1` 폴더의 IPython notebook server를 시작할 수 있습니다. IPython에 친숙하지 않다면 작성해둔 [IPython tutorial](/ipython-tutorial)를 읽어보는 것을 권장합니다. -**NOTE:** OSX에서 virtual environment를 실행하면, matplotlib 에러가 날 수 있습니다([이 문제에 관한 이슈](http://matplotlib.org/faq/virtualenv_faq.html)). IPython 서버를 `assignment1`폴더의 `start_ipython_osx.sh`로 실행하면 이 문제를 피해갈 수 있습니다; 이 스크립트는 virtual environment가 `.env`라고 되어있다고 가정하고 작성되었습니다.로 +**NOTE:** OSX에서 virtual environment를 실행하면, matplotlib 에러가 날 수 있습니다([이 문제에 관한 이슈](http://matplotlib.org/faq/virtualenv_faq.html)). IPython 서버를 `assignment1`폴더의 `start_ipython_osx.sh`로 실행하면 이 문제를 피해갈 수 있습니다; 이 스크립트는 virtual environment가 `.env`라고 되어있다고 가정하고 작성되었습니다. ### 과제 제출: 로컬 환경이나 Terminal에 상관없이, 이번 숙제를 마쳤다면 `collectSubmission.sh`스크립트를 실행하세요. 이 스크립트는 `assignment1.zip`파일을 만듭니다. 이 파일을 [the coursework](https://coursework.stanford.edu/portal/site/W16-CS-231N-01/)에 업로드하세요. diff --git a/assignments2016/assignment2.md b/assignments2016/assignment2.md index e5b36cef..77349fad 100644 --- a/assignments2016/assignment2.md +++ b/assignments2016/assignment2.md @@ -4,131 +4,74 @@ mathjax: true permalink: assignments2016/assignment2/ --- -In this assignment you will practice writing backpropagation code, and training -Neural Networks and Convolutional Neural Networks. The goals of this assignment -are as follows: - -- understand **Neural Networks** and how they are arranged in layered - architectures -- understand and be able to implement (vectorized) **backpropagation** -- implement various **update rules** used to optimize Neural Networks -- implement **batch normalization** for training deep networks -- implement **dropout** to regularize networks -- effectively **cross-validate** and find the best hyperparameters for Neural - Network architecture -- understand the architecture of **Convolutional Neural Networks** and train - gain experience with training these models on data - -## Setup -You can work on the assignment in one of two ways: locally on your own machine, -or on a virtual machine through Terminal.com. - -### Working in the cloud on Terminal - -Terminal has created a separate subdomain to serve our class, -[www.stanfordterminalcloud.com](https://www.stanfordterminalcloud.com). Register -your account there. The Assignment 2 snapshot can then be found [HERE](https://www.stanfordterminalcloud.com/snapshot/6c95ca2c9866a962964ede3ea5813d4c2410ba48d92cf8d11a93fbb13e08b76a). If you are -registered in the class you can contact the TA (see Piazza for more information) -to request Terminal credits for use on the assignment. Once you boot up the -snapshot everything will be installed for you, and you will be ready to start on -your assignment right away. We have written a small tutorial on Terminal -[here](/terminal-tutorial). - -### Working locally -Get the code as a zip file -[here](http://vision.stanford.edu/teaching/cs231n/winter1516_assignment2.zip). -As for the dependencies: - -**[Option 1] Use Anaconda:** -The preferred approach for installing all the assignment dependencies is to use -[Anaconda](https://www.continuum.io/downloads), which is a Python distribution -that includes many of the most popular Python packages for science, math, -engineering and data analysis. Once you install it you can skip all mentions of -requirements and you are ready to go directly to working on the assignment. - -**[Option 2] Manual install, virtual environment:** -If you do not want to use Anaconda and want to go with a more manual and risky -installation route you will likely want to create a -[virtual environment](http://docs.python-guide.org/en/latest/dev/virtualenvs/) -for the project. If you choose not to use a virtual environment, it is up to you -to make sure that all dependencies for the code are installed globally on your -machine. To set up a virtual environment, run the following: +이번 숙제에서 여러분은 backpropagation 코드를 작성하는 법을 연습하고, 기본 형태의 뉴럴 네트워크(신경망)와 컨볼루션 신경망을 학습해볼 것입니다. 이번 숙제의 목표는 다음과 같습니다. + +- **뉴럴 네트워크(신경망)** 에 대해 이해하고 레이어가 있는 구조가 어떻게 배치되어 있는지 이해하기 +- **backpropagation** 에 대해 이해하고 (벡터화된) 코드로 구현하기 +- 뉴럴 네트워크를 학습시키는데 필요한 여러 가지 **업데이트 규칙** 구현하기 +- 딥 뉴럴 네트워크를 학습하는데 필요한 **batch normalization** 구현하기 +- 네트워크를 regularization 할 때 필요한 **dropout** 구현하기 +- 효과적인 **교차 검증(cross validation)** 을 통해 뉴럴 네트워크 구조에서 사용되는 여러 가지 hyperparameter 들의 최적값 찾기 +- **컨볼루션 신경망** 구조에 대해 이해하고 이 모델들을 실제 데이터에 학습해보는 것을 경험하기 + +## 설치 +여러분은 다음 두가지 방법으로 숙제를 시작할 수 있습니다: Terminal.com을 이용한 가상 환경 또는 로컬 환경. + +### Terminal에서의 가상 환경. +Terminal에는 우리의 수업을 위한 서브도메인이 만들어져 있습니다. [www.stanfordterminalcloud.com](https://www.stanfordterminalcloud.com) 계정을 등록하세요. 이번 숙제에 대한 스냅샷은 [여기](https://www.stanfordterminalcloud.com/snapshot/6c95ca2c9866a962964ede3ea5813d4c2410ba48d92cf8d11a93fbb13e08b76a)에서 찾아볼 수 있습니다. 만약 수업에 등록되었다면, TA(see Piazza for more information)에게 이 수업을 위한 Terminal 예산을 요구할 수 있습니다. 처음 스냅샷을 실행시키면, 수업을 위한 모든 것이 설치되어 있어서 바로 숙제를 시작할 수 있습니다. [여기](/terminal-tutorial)에 Terminal을 위한 간단한 튜토리얼을 작성해 뒀습니다. + +### 로컬 환경 +[여기](http://vision.stanford.edu/teaching/cs231n/winter1516_assignment2.zip)에서 압축파일을 다운받고 다음을 따르세요. + +**[선택 1] Use Anaconda:** +과학, 수학, 공학, 데이터 분석을 위한 대부분의 주요 패키지들을 담고있는 [Anaconda](https://www.continuum.io/downloads)를 사용하여 설치하는 것이 흔히 사용하는 방법입니다. 설치가 다 되면 모든 요구사항(dependency)을 넘기고 바로 숙제를 시작해도 좋습니다. + +**[선택 2] 수동 설치, virtual environment:** +만약 Anaconda 대신 좀 더 일반적이면서 까다로운 방법을 택하고 싶다면 이번 과제를 위한 [virtual environment](http://docs.python-guide.org/en/latest/dev/virtualenvs/)를 만들 수 있습니다. 만약 virtual environment를 사용하지 않는다면 모든 코드가 컴퓨터에 전역적으로 종속되게 설치됩니다. Virtual environment의 설정은 아래를 참조하세요. ~~~bash -cd assignment2 -sudo pip install virtualenv # This may already be installed -virtualenv .env # Create a virtual environment -source .env/bin/activate # Activate the virtual environment -pip install -r requirements.txt # Install dependencies +cd assignment1 +sudo pip install virtualenv # 아마 먼저 설치되어 있을 겁니다. +virtualenv .env # virtual environment를 만듭니다. +source .env/bin/activate # virtual environment를 활성화 합니다. +pip install -r requirements.txt # dependencies 설치합니다. # Work on the assignment for a while ... -deactivate # Exit the virtual environment +deactivate # virtual environment를 종료합니다. ~~~ -**Download data:** -Once you have the starter code, you will need to download the CIFAR-10 dataset. -Run the following from the `assignment2` directory: +**데이터셋 다운로드:** +먼저 숙제를 시작하기전에 CIFAR-10 dataset를 다운로드해야 합니다. 아래 코드를 `assignment2` 폴더에서 실행하세요: ~~~bash cd cs231n/datasets ./get_datasets.sh ~~~ -**Compile the Cython extension:** Convolutional Neural Networks require a very -efficient implementation. We have implemented of the functionality using -[Cython](http://cython.org/); you will need to compile the Cython extension -before you can run the code. From the `cs231n` directory, run the following -command: +**Cython extension 컴파일하기:** 컨볼루션 신경망은 매우 효율적인 구현을 필요로 합니다. 이 숙제를 위해서 [Cython](http://cython.org/)을 활용하여 여러 기능들을 구현해 놓았는데, 이를 위해 코드를 돌리기 전에 Cython extension을 컴파일 해야 합니다. `cs231n` 디렉토리에서 아래 명령어를 실행하세요: ~~~bash python setup.py build_ext --inplace ~~~ -**Start IPython:** -After you have the CIFAR-10 data, you should start the IPython notebook server -from the `assignment2` directory. If you are unfamiliar with IPython, you should -read our [IPython tutorial](/ipython-tutorial). - -**NOTE:** If you are working in a virtual environment on OSX, you may encounter -errors with matplotlib due to the -[issues described here](http://matplotlib.org/faq/virtualenv_faq.html). -You can work around this issue by starting the IPython server using the -`start_ipython_osx.sh` script from the `assignment2` directory; the script -assumes that your virtual environment is named `.env`. +**IPython 시작:** +CIFAR-10 data를 받았다면, `assignment1` 폴더의 IPython notebook server를 시작할 수 있습니다. IPython에 친숙하지 않다면 작성해둔 [IPython tutorial](/ipython-tutorial)를 읽어보는 것을 권장합니다. +**NOTE:** OSX에서 virtual environment를 실행하면, matplotlib 에러가 날 수 있습니다([이 문제에 관한 이슈](http://matplotlib.org/faq/virtualenv_faq.html)). IPython 서버를 `assignment2`폴더의 `start_ipython_osx.sh`로 실행하면 이 문제를 피해갈 수 있습니다; 이 스크립트는 virtual environment가 `.env`라고 되어있다고 가정하고 작성되었습니다. -### Submitting your work: -Whether you work on the assignment locally or using Terminal, once you are done -working run the `collectSubmission.sh` script; this will produce a file called -`assignment2.zip`. Upload this file under the Assignments tab on -[the coursework](https://coursework.stanford.edu/portal/site/W15-CS-231N-01/) -page for the course. +### 과제 제출: +로컬 환경이나 Terminal에 상관없이, 이번 숙제를 마쳤다면 `collectSubmission.sh`스크립트를 실행하세요. 이 스크립트는 `assignment2.zip`파일을 만듭니다. 이 파일을 [the coursework](https://coursework.stanford.edu/portal/site/W16-CS-231N-01/)에 업로드하세요. - -### Q1: Fully-connected Neural Network (30 points) -The IPython notebook `FullyConnectedNets.ipynb` will introduce you to our -modular layer design, and then use those layers to implement fully-connected -networks of arbitrary depth. To optimize these models you will implement several -popular update rules. +### Q1: Fully-connected 뉴럴 네트워크 (30 points) +`FullyConnectedNets.ipynb` IPython notebook 파일에서 모듈화된 레이어 디자인을 소개하고, 이 레이어들을 이용해서 임의의 깊이를 갖는 fully-connected 네트워크를 구현할 것입니다. 이 모델들을 최적화하기 위해서 자주 사용되는 여러 가지 업데이트 규칙들을 구현해야 할 것입니다. ### Q2: Batch Normalization (30 points) -In the IPython notebook `BatchNormalization.ipynb` you will implement batch -normalization, and use it to train deep fully-connected networks. +`BatchNormalization.ipynb` IPython notebook 파일에서는 batch normalization 을 구현하고, 이를 사용하여 깊은(deep) fully-connected 네트워크를 학습할 것입니다. ### Q3: Dropout (10 points) -The IPython notebook `Dropout.ipynb` will help you implement Dropout and explore -its effects on model generalization. - -### Q4: ConvNet on CIFAR-10 (30 points) -In the IPython Notebook `ConvolutionalNetworks.ipynb` you will implement several -new layers that are commonly used in convolutional networks. You will train a -(shallow) convolutional network on CIFAR-10, and it will then be up to you to -train the best network that you can. - -### Q5: Do something extra! (up to +10 points) -In the process of training your network, you should feel free to implement -anything that you want to get better performance. You can modify the solver, -implement additional layers, use different types of regularization, use an -ensemble of models, or anything else that comes to mind. If you implement these -or other ideas not covered in the assignment then you will be awarded some bonus -points. +`Dropout.ipynb` IPython notebook 파일에서는 Dropout을 구현하고, 이것이 모델의 일반화 성능에 어떤 영향을 미치는지 살펴볼 것입니다. + +### Q4: CIFAR-10 에서의 컨볼루션 신경망 (30 points) +`ConvolutionalNetworks.ipynb` IPython notebook 파일에서는 컨볼루션 신경망에서 흔히 사용되는 여러 새로운 레이어들을 구현할 것입니다. 먼저 CIFAR-10 데이터셋에 대해 (얕은, 깊지않은, 작은 규모의) 컨볼루션 신경망을 학습하고, 이후에는 가능한 한 최선의 노력을 다해서 최고의 성능을 뽑아내보길 바랍니다. +### Q5: 추가 과제: 뭔가 더 해보세요! (up to +10 points) +네트워크를 학습하는 과정 속에서, 더 좋은 성능을 위해 필요한 것이 있다면 얼마든지 추가적으로 구현하기 바랍니다. 최적화 기법(solver)을 바꿔도 좋고, 추가적인 레이어를 구현하거나, 다른 종류의 regularization 을 사용하고나, 모델 ensemble 등 생각나는 모든 것을 시도해 보세요. 이번 숙제에서 다루지 않은 새로운 아이디어를 구현한다면 추가 점수를 받을 수 있을 것입니다. diff --git a/assignments2016/assignment3.md b/assignments2016/assignment3.md index 452f4767..230ad3b0 100644 --- a/assignments2016/assignment3.md +++ b/assignments2016/assignment3.md @@ -51,7 +51,7 @@ cd cs231n/datasets ./get_pretrained_model.sh ~~~ -**Compile the Cython extension:** 컨볼루션 신경망은 매우 효율적인 구현이 필요합니다. [Cython](http://cython.org/)을 사용하여 필요한 기능들을 구현해 두어서, 코드를 돌리기 전에 Cython extension을 컴파일해 주어야 합니다. `cs231n` 디렉토리에서 다음 명령어를 입력하세요. +**Cython extension 컴파일하기:** 컨볼루션 신경망은 매우 효율적인 구현을 필요로 합니다. 이 숙제를 위해서 [Cython](http://cython.org/)을 활용하여 여러 기능들을 구현해 놓았는데, 이를 위해 코드를 돌리기 전에 Cython extension을 컴파일해 주어야 합크니다. `cs231n` 디렉토리에서 아래 명령어를 실행하세요: ~~~bash python setup.py build_ext --inplace From 75287eb00c7a3e1393d9bfaf4bbf4534e6e45ec3 Mon Sep 17 00:00:00 2001 From: Sanghun Kang Date: Tue, 31 May 2016 00:53:03 +0900 Subject: [PATCH 158/199] translating numpy broadcasting --- python-numpy-tutorial.md | 129 ++++++++++++++++----------------------- 1 file changed, 54 insertions(+), 75 deletions(-) diff --git a/python-numpy-tutorial.md b/python-numpy-tutorial.md index 8369b580..ac658a88 100644 --- a/python-numpy-tutorial.md +++ b/python-numpy-tutorial.md @@ -583,65 +583,62 @@ print a # 출력 "array([[11, 2, 3], # [10, 21, 12]]) ~~~ -**Boolean array indexing:** -Boolean array indexing lets you pick out arbitrary elements of an array. -Frequently this type of indexing is used to select the elements of an array -that satisfy some condition. Here is an example: +**불린 배열 인덱싱:** +불린 배열 인덱싱을 통해 배열속 요소를 취사 선택할 수 있습니다. +불린 배열 인덱싱은 특정 조건을 만족시키는 요소만 선택하고자 할 때 자주 사용됩니다. +다음은 그 예시입니다: ~~~python import numpy as np a = np.array([[1,2], [3, 4], [5, 6]]) -bool_idx = (a > 2) # Find the elements of a that are bigger than 2; - # this returns a numpy array of Booleans of the same - # shape as a, where each slot of bool_idx tells - # whether that element of a is > 2. +bool_idx = (a > 2) # 2보다 큰 a의 요소를 찾습니다; + # 이 코드는 a와 shape가 같고 불린 자료형을 요소로 하는 numpy 배열을 반환합니다, + # bool_idx의 각 요소는 동일한 위치에 있는 a의 + # 요소가 2보다 큰지를 말해줍니다. print bool_idx # 출력 "[[False False] - # [ True True] - # [ True True]]" + # [ True True] + # [ True True]]" -# We use boolean array indexing to construct a rank 1 array -# consisting of the elements of a corresponding to the True values -# of bool_idx +# 불린 배열 인덱싱을 통해 bool_idx에서 +# 참 값을 가지는 요소로 구성되는 +# rank 1인 배열을 구성할 수 있습니다. print a[bool_idx] # 출력 "[3 4 5 6]" -# We can do all of the above in a single concise statement: +# 위에서 한 모든것을 한 문장으로 할 수 있습니다: print a[a > 2] # 출력 "[3 4 5 6]" ~~~ -For brevity we have left out a lot of details about numpy array indexing; -if you want to know more you should -[read the documentation](http://docs.scipy.org/doc/numpy/reference/arrays.indexing.html). +튜토리얼을 간결히 하고자 numpy 배열 인덱싱에 관한 많은 내용을 생략했습니다. +조금 더 알고싶다면 [문서](http://docs.scipy.org/doc/numpy/reference/arrays.indexing.html)를 참조하세요. -### Datatypes -Every numpy array is a grid of elements of the same type. -Numpy provides a large set of numeric datatypes that you can use to construct arrays. -Numpy tries to guess a datatype when you create an array, but functions that construct -arrays usually also include an optional argument to explicitly specify the datatype. -Here is an example: + +### 자료형 +Numpy 배열은 동일한 자료형을 가지는 값들이 격자판 형태로 있는 것입니다. +Numpy에선 배열을 구성하는데 사용할 수 있는 다양한 숫자 자료형을 제공합니다. +Numpy는 배열이 생성될 때 자료형을 스스로 추측합니다, 그러나 배열을 생성할 때 명시적으로 특정 자료형을 지정할수도 있습니다. 예시: ~~~python import numpy as np -x = np.array([1, 2]) # Let numpy choose the datatype +x = np.array([1, 2]) # Numpy가 자료형을 추측해서 선택 print x.dtype # 출력 "int64" -x = np.array([1.0, 2.0]) # Let numpy choose the datatype +x = np.array([1.0, 2.0]) # Numpy가 자료형을 추측해서 선택 print x.dtype # 출력 "float64" -x = np.array([1, 2], dtype=np.int64) # Force a particular datatype +x = np.array([1, 2], dtype=np.int64) # 특정 자료형을 명시적으로 지정 print x.dtype # 출력 "int64" ~~~ -You can read all about numpy datatypes -[in the documentation](http://docs.scipy.org/doc/numpy/reference/arrays.dtypes.html). +Numpy 자료형에 관한 자세한 사항은 [문서](http://docs.scipy.org/doc/numpy/reference/arrays.dtypes.html)를 참조하세요. -### Array math -Basic mathematical functions operate elementwise on arrays, and are available -both as operator overloads and as functions in the numpy module: + +### 배열 연산 +기본적인 수학함수는 배열의 각 요소별로 동작하며 연산자를 통해 동작하거나 numpy 함수모듈을 통해 동작합니다: ~~~python import numpy as np @@ -649,41 +646,37 @@ import numpy as np x = np.array([[1,2],[3,4]], dtype=np.float64) y = np.array([[5,6],[7,8]], dtype=np.float64) -# Elementwise sum; both produce the array +# 요소별 합; 둘 다 다음의 배열을 만듭니다 # [[ 6.0 8.0] # [10.0 12.0]] print x + y print np.add(x, y) -# Elementwise difference; both produce the array +# 요소별 차; 둘 다 다음의 배열을 만듭니다 # [[-4.0 -4.0] # [-4.0 -4.0]] print x - y print np.subtract(x, y) -# Elementwise product; both produce the array +# 요소별 곱; 둘 다 다음의 배열을 만듭니다 # [[ 5.0 12.0] # [21.0 32.0]] print x * y print np.multiply(x, y) -# Elementwise division; both produce the array +# 요소별 나눗셈; 둘 다 다음의 배열을 만듭니다 # [[ 0.2 0.33333333] # [ 0.42857143 0.5 ]] print x / y print np.divide(x, y) -# Elementwise square root; produces the array +# 요소별 제곱근; 다음의 배열을 만듭니다 # [[ 1. 1.41421356] # [ 1.73205081 2. ]] print np.sqrt(x) ~~~ -Note that unlike MATLAB, `*` is elementwise multiplication, not matrix -multiplication. We instead use the `dot` function to compute inner -products of vectors, to multiply a vector by a matrix, and to -multiply matrices. `dot` is available both as a function in the numpy -module and as an instance method of array objects: +MATLAB과 달리, '*'은 행렬곱이 아니라 요소별 곱입니다. Numpy에선 벡터의 내적, 벡터와 행렬의 곱, 행렬곱을 위해 '*'대신 'dot'함수를 사용합니다. 'dot'은 Numpy 모듈 함수로서도 배열 객체의 인스턴스 메소드로서도 이용 가능한 합수입니다: ~~~python import numpy as np @@ -694,40 +687,36 @@ y = np.array([[5,6],[7,8]]) v = np.array([9,10]) w = np.array([11, 12]) -# Inner product of vectors; both produce 219 +# 벡터의 내적; 둘 다 결과는 219 print v.dot(w) print np.dot(v, w) -# Matrix / vector product; both produce the rank 1 array [29 67] +# 행렬과 벡터의 곱; 둘 다 결과는 rank 1 인 배열 [29 67] print x.dot(v) print np.dot(x, v) -# Matrix / matrix product; both produce the rank 2 array +# 행렬곱; 둘 다 결과는 rank 2인 배열 # [[19 22] # [43 50]] print x.dot(y) print np.dot(x, y) ~~~ -Numpy provides many useful functions for performing computations on -arrays; one of the most useful is `sum`: +Numpy는 배열 연산에 유용하게 쓰이는 많은 함수를 제공합니다. 가장 유용한건 'sum'입니다: ~~~python import numpy as np x = np.array([[1,2],[3,4]]) -print np.sum(x) # Compute sum of all elements; 출력 "10" -print np.sum(x, axis=0) # Compute sum of each column; 출력 "[4 6]" -print np.sum(x, axis=1) # Compute sum of each row; 출력 "[3 7]" +print np.sum(x) # 모든 요소를 합한 값을 연산; 출력 "10" +print np.sum(x, axis=0) # 각 열에 대한 합을 연산; 출력 "[4 6]" +print np.sum(x, axis=1) # 각 행에 대한 합을 연산; 출력 "[3 7]" ~~~ -You can find the full list of mathematical functions provided by numpy -[in the documentation](http://docs.scipy.org/doc/numpy/reference/routines.math.html). +Numpy가 제공하는 모든 수학함수들의 목록은 [문서](http://docs.scipy.org/doc/numpy/reference/routines.math.html)를 참조하세요. -Apart from computing mathematical functions using arrays, we frequently -need to reshape or otherwise manipulate data in arrays. The simplest example -of this type of operation is transposing a matrix; to transpose a matrix, -simply use the `T` attribute of an array object: +배열연산을 하지 않더라도, 종종 배열의 모양을 바꾸거나 데이터를 처리해야할 때가 있습니다. +가장 간단한 예는 행렬의 주대각선을 기준으로 대칭되는 요소끼리 뒤바꾸는 것입니다; 이를 전치라고 하며 행렬을 전치하기 위해선, 간단하게 배열 객체의 'T' 속성을 사용하면 됩니다: ~~~python import numpy as np @@ -738,39 +727,33 @@ print x # 출력 "[[1 2] print x.T # 출력 "[[1 3] # [2 4]]" -# Note that taking the transpose of a rank 1 array does nothing: +# rank 1인 배열을 전치할경우 아무일도 일어나지 않습니다: v = np.array([1,2,3]) print v # 출력 "[1 2 3]" print v.T # 출력 "[1 2 3]" ~~~ -Numpy provides many more functions for manipulating arrays; you can see the full list -[in the documentation](http://docs.scipy.org/doc/numpy/reference/routines.array-manipulation.html). +Numpy는 배열을 다루는 다양한 함수들을 제공합니다; 이러한 함수의 전체 목록은 [문서](http://docs.scipy.org/doc/numpy/reference/routines.array-manipulation.html)를 참조하세요. -### Broadcasting -Broadcasting is a powerful mechanism that allows numpy to work with arrays of different -shapes when performing arithmetic operations. Frequently we have a smaller array and a -larger array, and we want to use the smaller array multiple times to perform some operation -on the larger array. -For example, suppose that we want to add a constant vector to each -row of a matrix. We could do it like this: +### 브로드캐스팅 +브로트캐스팅은 Numpy에서 shape가 다른 배열간에도 산술 연산이 가능하게 하는 메커니즘입니다. 종종 작은 배열과 큰 배열이 있을 때, 큰 배열을 대상으로 작은 배열을 여러번 연산하고자 할 때가 있습니다. 예를 들어, 행렬의 각 행에 상수 벡터를 더하는걸 생각해보세요. 이는 다음과 같은 방식으로 처리될 수 있습니다: ~~~python import numpy as np -# We will add the vector v to each row of the matrix x, -# storing the result in the matrix y +# 행렬 x의 각 행에 벡터 v를 더한 뒤, +# 그 결과를 행렬 y에 저장하고자 합니다 x = np.array([[1,2,3], [4,5,6], [7,8,9], [10, 11, 12]]) v = np.array([1, 0, 1]) -y = np.empty_like(x) # Create an empty matrix with the same shape as x +y = np.empty_like(x) # x와 동일한 shape를 가지며 비어있는 행렬 생성 -# Add the vector v to each row of the matrix x with an explicit loop +# 명시적 반복문을 통해 행렬 x의 각 행에 벡터 v를 더하는 방법 for i in range(4): y[i, :] = x[i, :] + v -# Now y is the following +# 이제 y는 다음과 같습니다 # [[ 2 2 4] # [ 5 5 7] # [ 8 8 10] @@ -778,11 +761,7 @@ for i in range(4): print y ~~~ -This works; however when the matrix `x` is very large, computing an explicit loop -in Python could be slow. Note that adding the vector `v` to each row of the matrix -`x` is equivalent to forming a matrix `vv` by stacking multiple copies of `v` vertically, -then performing elementwise summation of `x` and `vv`. We could implement this -approach like this: +위의 방식대로 하면 됩니다; 그러나 'x'가 매우 큰 행렬이라면, 파이썬의 명시적 반복문을 통해 연산을 수행했을때 느릴 수 있습니다. 벡터 'v'를 행렬 'x'의 각 행에 더하는것은 'v'를 여러개 복사해 수직으로 쌓은 행렬 'vv'를 만들고 이 'vv'를 'x'에 더하는것과 동일합니다. 이 과정을 아래의 코드로 구현할 수 있습니다: ~~~python import numpy as np From 65d707ba0a6686ba41b9f473a55ca6b27dbcf612 Mon Sep 17 00:00:00 2001 From: Sanghun Kang Date: Tue, 31 May 2016 15:35:46 +0900 Subject: [PATCH 159/199] translating numpy broadcasting --- python-numpy-tutorial.md | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/python-numpy-tutorial.md b/python-numpy-tutorial.md index ac658a88..ea98c3b4 100644 --- a/python-numpy-tutorial.md +++ b/python-numpy-tutorial.md @@ -387,7 +387,8 @@ hello('Fred', loud=True) # 출력 "HELLO, FRED!" 파이썬 함수에 관한 더 많은 정보는 [문서](https://docs.python.org/2/tutorial/controlflow.html#defining-functions)를 참조하세요. -### Classes + +### 클래스 파이썬에서 클래스를 정의하는 구문은 복잡하지 않습니다: @@ -412,6 +413,7 @@ g.greet(loud=True) # 인스턴스 메소드 호출; 출력 "HELLO, FRED!" 파이썬 클래스에 관한 더 많은 정보는 [문서](https://docs.python.org/2/tutorial/classes.html)를 참조하세요. + ## Numpy [Numpy](http://www.numpy.org/)는 파이썬이 계산과학분야에 이용될때 핵심 역할을 하는 라이브러리입니다. From 0b123b4c522821e34952a11db84fdc4ee66445ed Mon Sep 17 00:00:00 2001 From: myungsub Date: Wed, 1 Jun 2016 10:42:39 +0900 Subject: [PATCH 160/199] start linear-classify --- index.html | 8 +++---- linear-classify.md | 52 ++++++++++++++++++++++++++++------------------ 2 files changed, 36 insertions(+), 24 deletions(-) diff --git a/index.html b/index.html index 1072b9ab..713b0009 100644 --- a/index.html +++ b/index.html @@ -27,7 +27,7 @@ Assignment #1: 이미지 분류, kNN, SVM, Softmax, 뉴럴 네트워크 - +
@@ -37,7 +37,7 @@ 컨볼루션 신경망 - +
@@ -80,7 +80,7 @@ Python / Numpy Tutorial - +
@@ -131,7 +131,7 @@ 선형 분류: Support Vector Machine, Softmax - +
parameteric 접근법, bias 트릭, hinge loss, cross-entropy loss, L2 regularization, 웹 데모 diff --git a/linear-classify.md b/linear-classify.md index 03f76942..916e2abf 100644 --- a/linear-classify.md +++ b/linear-classify.md @@ -5,28 +5,30 @@ permalink: /linear-classify/ Table of Contents: -- [Intro to Linear classification](#intro) -- [Linear score function](#score) -- [Interpreting a linear classifier](#interpret) -- [Loss function](#loss) +- [선형 분류 소개](#intro) +- [선형 스코어 함수](#score) +- [선형 분류기 분석하기](#interpret) +- [손실함수(Loss function)](#loss) - [Multiclass SVM](#svm) - - [Softmax classifier](#softmax) + - [Softmax 분류기](#softmax) - [SVM vs Softmax](#svmvssoftmax) -- [Interactive Web Demo of Linear Classification](#webdemo) -- [Summary](#summary) +- [선형 분류 웹 데모](#webdemo) +- [요약](#summary) -## Linear Classification -In the last section we introduced the problem of Image Classification, which is the task of assigning a single label to an image from a fixed set of categories. Morever, we described the k-Nearest Neighbor (kNN) classifier which labels images by comparing them to (annotated) images from the training set. As we saw, kNN has a number of disadvantages: +## 선형 분류 (Linear Classification) -- The classifier must *remember* all of the training data and store it for future comparisons with the test data. This is space inefficient because datasets may easily be gigabytes in size. -- Classifying a test image is expensive since it requires a comparison to all training images. +지난 섹션에서는 특정 카테고리에서 하나의 라벨을 이미지에 붙이는 문제인 이미지 분류에 대해 소개하였다. 또한, 학습 데이터셋에 있는 (라벨링된) 이미지들 중 가까이 있는 것들의 라벨을 활용하는 k-Nearest Neighbor (kNN) 분류기에 대해 설명하였다. 앞서 살펴보았듯이, kNN은 몇 가지 단점이 있다: -**Overview**. We are now going to develop a more powerful approach to image classification that we will eventually naturally extend to entire Neural Networks and Convolutional Neural Networks. The approach will have two major components: a **score function** that maps the raw data to class scores, and a **loss function** that quantifies the agreement between the predicted scores and the ground truth labels. We will then cast this as an optimization problem in which we will minimize the loss function with respect to the parameters of the score function. +- 이 분류기는 모든 학습 데이터를 *기억* 해야 하고, 나중에 테스트 데이터와 비교하기 위해 저장해 두어야 한다. 이것은 메모리 공간 관점에서 매우 비효율적이고, 일반적인 데이터셋들은 용량이 기가바이트 단위를 쉽게 넘기는 것이 많기 때문에 문제가 된다. +- 테스트 이미지를 분류할 때 모든 학습 이미지와 다 비교를 해야 하기 때문에 매우 계산량/시간이 많이 소요된다. + +**Overview**. 이번 노트에서는 이미지 분류를 위한 보다 강력한 방법들을 발전시켜나갈 것이고, 이는 나중에 뉴럴 네트워크와 컨볼루션 신경망으로 확장될 것이다. 이 방법들은 두 가지 중요한 요소가 있다: 데이터를 클래스 스코어로 매핑시키는 **스코어 함수**, 그리고 예측한 스코어와 실제(ground truth) 라벨과의 차이를 정량화해주는 **손실 함수** 가 그 두 가지이다. 우리는 이를 최적화 문제로 바꾸어서 스코어 함수의 파라미터들에 대한 손실 함수를 최소화할 것이다. -### Parameterized mapping from images to label scores + +### 이미지에서 라벨 스코어로의 파라미터화된 매핑(mapping) The first component of this approach is to define the score function that maps the pixel values of an image to confidence scores for each class. We will develop the approach with a concrete example. As before, let's assume a training dataset of images $ x_i \in R^D $, each associated with a label $ y_i $. Here $ i = 1 \dots N $ and $ y_i \in \{ 1 \dots K \} $. That is, we have **N** examples (each with a dimensionality **D**) and **K** distinct categories. For example, in CIFAR-10 we have a training set of **N** = 50,000 images, each with **D** = 32 x 32 x 3 = 3072 pixels, and **K** = 10, since there are 10 distinct classes (dog, cat, car, etc). We will now define the score function $f: R^D \mapsto R^K$ that maps the raw image pixels to class scores. @@ -48,7 +50,8 @@ There are a few things to note: > Foreshadowing: Convolutional Neural Networks will map image pixels to scores exactly as shown above, but the mapping ( f ) will be more complex and will contain more parameters. -### Interpreting a linear classifier + +### 선형 분류기 분석하기 Notice that a linear classifier computes the score of a class as a weighted sum of all of its pixel values across all 3 of its color channels. Depending on precisely what values we set for these weights, the function has the capacity to like or dislike (depending on the sign of each weight) certain colors at certain positions in the image. For instance, you can imagine that the "ship" class might be more likely if there is a lot of blue on the sides of an image (which could likely correspond to water). You might expect that the "ship" classifier would then have a lot of positive weights across its blue channel weights (presence of blue increases score of ship), and negative weights in the red/green channels (presence of red/green descreases the score of ship). @@ -106,13 +109,16 @@ With our CIFAR-10 example, $x_i$ is now [3073 x 1] instead of [3072 x 1] - (with **Image data preprocessing.** As a quick note, in the examples above we used the raw pixel values (which range from [0...255]). In Machine Learning, it is a very common practice to always perform normalization of your input features (in the case of images, every pixel is thought of as a feature). In particular, it is important to **center your data** by subtracting the mean from every feature. In the case of images, this corresponds to computing a *mean image* across the training images and subtracting it from every image to get images where the pixels range from approximately [-127 ... 127]. Further common preprocessing is to scale each input feature so that its values range from [-1, 1]. Of these, zero mean centering is arguably more important but we will have to wait for its justification until we understand the dynamics of gradient descent. -### Loss function + +### 손실함수(Loss function) + In the previous section we defined a function from the pixel values to class scores, which was parameterized by a set of weights $W$. Moreover, we saw that we don't have control over the data $ (x_i,y_i) $ (it is fixed and given), but we do have control over these weights and we want to set them so that the predicted class scores are consistent with the ground truth labels in the training data. For example, going back to the example image of a cat and its scores for the classes "cat", "dog" and "ship", we saw that the particular set of weights in that example was not very good at all: We fed in the pixels that depict a cat but the cat score came out very low (-96.8) compared to the other classes (dog score 437.9 and ship score 61.95). We are going to measure our unhappiness with outcomes such as this one with a **loss function** (or sometimes also referred to as the **cost function** or the **objective**). Intuitively, the loss will be high if we're doing a poor job of classifying the training data, and it will be low if we're doing well. -#### Multiclass Support Vector Machine loss + +#### Multiclass Support Vector Machine 손실함수수 There are several ways to define the details of the loss function. As a first example we will first develop a commonly used loss called the **Multiclass Support Vector Machine** (SVM) loss. The SVM loss is set up so that the SVM "wants" the correct class for each image to a have a score higher than the incorrect classes by some fixed margin $\Delta$. Notice that it's sometimes helpful to anthropomorphise the loss functions as we did above: The SVM "wants" a certain outcome in the sense that the outcome would yield a lower loss (which is good). @@ -152,6 +158,7 @@ A last piece of terminology we'll mention before we finish with this section is + **Regularization**. There is one bug with the loss function we presented above. Suppose that we have a dataset and a set of parameters **W** that correctly classify every example (i.e. all scores are so that all the margins are met, and $L_i = 0$ for all i). The issue is that this set of **W** is not necessarily unique: there might be many similar **W** that correctly classify the examples. One easy way to see this is that if some parameters **W** correctly classify all examples (so loss is zero for each example), then any multiple of these parameters $ \lambda W $ where $ \lambda > 1 $ will also give zero loss because this transformation uniformly stretches all score magnitudes and hence also their absolute differences. For example, if the difference in scores between a correct class and a nearest incorrect class was 15, then multiplying all elements of **W** by 2 would make the new difference 30. In other words, we wish to encode some preference for a certain set of weights **W** over others to remove this ambiguity. We can do so by extending the loss function with a **regularization penalty** $R(W)$. The most common regularization penalty is the **L2** norm that discourages large weights through an elementwise quadratic penalty over all parameters: @@ -252,7 +259,8 @@ where $C$ is a hyperparameter, and $y_i \in \\{ -1,1 \\} $. You can convince you **Aside: Other Multiclass SVM formulations.** It is worth noting that the Multiclass SVM presented in this section is one of few ways of formulating the SVM over multiple classes. Another commonly used form is the *One-Vs-All* (OVA) SVM which trains an independent binary SVM for each class vs. all other classes. Related, but less common to see in practice is also the *All-vs-All* (AVA) strategy. Our formulation follows the [Weston and Watkins 1999 (pdf)](https://www.elen.ucl.ac.be/Proceedings/esann/esannpdf/es1999-461.pdf) version, which is a more powerful version than OVA (in the sense that you can construct multiclass datasets where this version can achieve zero data loss, but OVA cannot. See details in the paper if interested). The last formulation you may see is a *Structured SVM*, which maximizes the margin between the score of the correct class and the score of the highest-scoring incorrect runner-up class. Understanding the differences between these formulations is outside of the scope of the class. The version presented in these notes is a safe bet to use in practice, but the arguably simplest OVA strategy is likely to work just as well (as also argued by Rikin et al. 2004 in [In Defense of One-Vs-All Classification (pdf)](http://www.jmlr.org/papers/volume5/rifkin04a/rifkin04a.pdf)). -### Softmax classifier + +### Softmax 분류기 It turns out that the SVM is one of two commonly seen classifiers. The other popular choice is the **Softmax classifier**, which has a different loss function. If you've heard of the binary Logistic Regression classifier before, the Softmax classifier is its generalization to multiple classes. Unlike the SVM which treats the outputs $f(x_i,W)$ as (uncalibrated and possibly difficult to interpret) scores for each class, the Softmax classifier gives a slightly more intuitive output (normalized class probabilities) and also has a probabilistic interpretation that we will describe shortly. In the Softmax classifier, the function mapping $f(x_i; W) = W x_i$ stays unchanged, but we now interpret these scores as the unnormalized log probabilities for each class and replace the *hinge loss* with a **cross-entropy loss** that has the form: @@ -301,6 +309,7 @@ p = np.exp(f) / np.sum(np.exp(f)) # safe to do, gives the correct answer **Possibly confusing naming conventions**. To be precise, the *SVM classifier* uses the *hinge loss*, or also sometimes called the *max-margin loss*. The *Softmax classifier* uses the *cross-entropy loss*. The Softmax classifier gets its name from the *softmax function*, which is used to squash the raw class scores into normalized positive values that sum to one, so that the cross-entropy loss can be applied. In particular, note that technically it doesn't make sense to talk about the "softmax loss", since softmax is just the squashing function, but it is a relatively commonly used shorthand. + ### SVM vs. Softmax A picture might help clarify the distinction between the Softmax and SVM classifiers: @@ -327,7 +336,8 @@ where the probabilites are now more diffuse. Moreover, in the limit where the we **In practice, SVM and Softmax are usually comparable.** The performance difference between the SVM and Softmax are usually very small, and different people will have different opinions on which classifier works better. Compared to the Softmax classifier, the SVM is a more *local* objective, which could be thought of either as a bug or a feature. Consider an example that achieves the scores [10, -2, 3] and where the first class is correct. An SVM (e.g. with desired margin of $\Delta = 1$) will see that the correct class already has a score higher than the margin compared to the other classes and it will compute loss of zero. The SVM does not care about the details of the individual scores: if they were instead [10, -100, -100] or [10, 9, 9] the SVM would be indifferent since the margin of 1 is satisfied and hence the loss is zero. However, these scenarios are not equivalent to a Softmax classifier, which would accumulate a much higher loss for the scores [10, 9, 9] than for [10, -100, -100]. In other words, the Softmax classifier is never fully happy with the scores it produces: the correct class could always have a higher probability and the incorrect classes always a lower probability and the loss would always get better. However, the SVM is happy once the margins are satisfied and it does not micromanage the exact scores beyond this constraint. This can intuitively be thought of as a feature: For example, a car classifier which is likely spending most of its "effort" on the difficult problem of separating cars from trucks should not be influenced by the frog examples, which it already assigns very low scores to, and which likely cluster around a completely different side of the data cloud. -### Interactive web demo + +### 선형 분류 웹 데모
@@ -339,7 +349,8 @@ where the probabilites are now more diffuse. Moreover, in the limit where the we -### Summary + +### 요약 In summary, @@ -351,7 +362,8 @@ In summary, We now saw one way to take a dataset of images and map each one to class scores based on a set of parameters, and we saw two examples of loss functions that we can use to measure the quality of the predictions. But how do we efficiently determine the parameters that give the best (lowest) loss? This process is *optimization*, and it is the topic of the next section. -### Further Reading + +### 추가 읽기 자료료 These readings are optional and contain pointers of interest. From 8de67e3b995a02b29811f871928314aa82e914db Mon Sep 17 00:00:00 2001 From: jung_hojin Date: Thu, 2 Jun 2016 02:51:10 +0900 Subject: [PATCH 161/199] Sync captions with sound --- captions/En/Lecture4_en.srt | 1743 ++++++++++++++++++++++------------- 1 file changed, 1123 insertions(+), 620 deletions(-) diff --git a/captions/En/Lecture4_en.srt b/captions/En/Lecture4_en.srt index b3889d59..16daaa80 100644 --- a/captions/En/Lecture4_en.srt +++ b/captions/En/Lecture4_en.srt @@ -4,12 +4,12 @@ Okay, so let me dive into some administrative 2 -00:00:07,000 --> 00:00:14,669 -points first. So recall that +00:00:09,900 --> 00:00:14,900 +points first. So again, recall that assignment 1 is due next Wednesday. 3 -00:00:14,669 --> 00:00:19,050 +00:00:14,900 --> 00:00:19,050 You have about 150 hours left, and I use hours because there's a more @@ -84,7 +84,7 @@ there's quite a bit of, you know, more material 18 00:01:13,579 --> 00:01:16,548 -to beware of that might pop up in the +to be aware of that might pop up in the midterm, even though I'm covering some of 19 @@ -93,8 +93,8 @@ the most important stuff usually in the lecture, so do read through those lecture 20 -00:01:19,609 --> 00:01:25,618 -notes, they're complimentary to the lectures. +00:01:19,610 --> 00:01:25,618 +notes. They're complimentary to the lectures. And so the material for the midterm will be 21 @@ -124,7 +124,7 @@ training data, and this loss is made up of 26 00:01:49,379 --> 00:01:53,509 -two components, there's a data loss and +two components. There's a data loss and a regularization loss, right. And really what we want to do 27 @@ -144,7 +144,7 @@ gradient descent, where we iterate evaluating 30 00:02:07,069 --> 00:02:11,030 -the gradient on your weights doing a +the gradient on your weights, doing a parameter update and just repeating this 31 @@ -233,7 +233,7 @@ of computational graphs, instead of just taking, thinking of one giant expression 48 -00:03:22,479 --> 00:03:25,369 +00:03:22,480 --> 00:03:25,369 that you're going to derive with pen and paper the expression for the @@ -248,8 +248,8 @@ values flow, flowing through a 51 00:03:31,689 --> 00:03:35,509 -computational graph where you have these operations -along circles and they're +computational graph where you have these +operations along circles and they're 52 00:03:35,509 --> 00:03:38,979 @@ -272,7 +272,7 @@ these series of functions along the way, and at the end, we get a single number 56 -00:03:49,789 --> 00:03:53,590 +00:03:49,790 --> 00:03:53,590 which is the loss. And the reason that I'd like you to think about it this way is @@ -282,7 +282,7 @@ that, these expressions right now look very small and you might be able to 58 -00:03:57,068 --> 00:04:00,339 +00:03:57,070 --> 00:04:00,339 derive these gradients, but these expressions are, in computational graphs, are @@ -318,8 +318,8 @@ which is a paper from DeepMind, where 65 00:04:23,509 --> 00:04:26,329 -this is basically differentiable Turing -machine +this is basically differentiable +Turing machine 66 00:04:26,329 --> 00:04:30,128 @@ -332,7 +332,7 @@ performing on a tape is made smooth and is differentiable computer basically 68 -00:04:33,589 --> 00:04:39,519 +00:04:33,591 --> 00:04:39,519 and the computational graph of this is huge, and not only is this, this is not it @@ -367,7 +367,7 @@ Machine. It's just impossible, it would take like billions of pages, and so we 75 -00:05:03,649 --> 00:05:07,068 +00:05:03,651 --> 00:05:07,068 have to think about this more in terms of data structures of little functions @@ -377,9 +377,9 @@ transforming intermediate variables to guess the loss at the very end. Okay. So we're going 77 -00:05:11,709 --> 00:05:14,318 +00:05:11,711 --> 00:05:14,318 to be looking specifically at -competition graphs and how we can derive +computational graphs and how we can derive 78 00:05:14,319 --> 00:05:20,560 @@ -388,12 +388,12 @@ to the loss function at the very end. Okay. 79 00:05:20,560 --> 00:05:25,569 -So let's start off simple and concrete. So let's consider a very -small computational graph we have three +So let's start off simple and concrete. So let's +consider a very small computational graph where 80 00:05:25,569 --> 00:05:29,778 -scalars as inputs to this graph, x, y and +we have scalars as inputs to this graph, x, y and z, and they take on these specific values 81 @@ -407,19 +407,19 @@ or circuit, you'll hear me refer to these interchangeably either as a graph or 83 -00:05:38,668 --> 00:05:43,038 +00:05:38,670 --> 00:05:43,038 a circuit, so we have this graph that at -the end gives us this output +the end gives us this output -12. 84 00:05:43,038 --> 00:05:47,288 --12. Okay. So here what I've done is I've +Okay. So here what I've done is I've already pre-filled what we'll call the 85 00:05:47,288 --> 00:05:51,120 forward pass of this graph, where I set -the inputs and then I compute the outfits +the inputs and then I compute the outputs 86 00:05:51,120 --> 00:05:56,288 @@ -449,11 +449,11 @@ multiplication of q and z. And what I've written 91 00:06:14,788 --> 00:06:19,360 out here is, basically, what we want is the -gradients, the derivatives, df/dx, df/dy, df/dz +gradients, the derivatives, df/dx, df/dy, 92 00:06:19,360 --> 00:06:25,598 -And I've written out +df/dz. And I've written out the intermediate, these little gradients 93 @@ -496,8 +496,12 @@ identity function, so what is the derivative of it, 101 -00:06:56,019 --> 00:07:06,240 -identity mapping? What is the gradient of df by df? It's one, right? +00:06:56,021 --> 00:06:57,240 +identity mapping? + +101 +00:06:59,000 --> 00:07:06,240 +What is the gradient of df by df? It's one, right? So the identity has a gradient of one. 102 @@ -511,8 +515,11 @@ backwards through this graph. So, we want the gradient of f with respect to z. 104 -00:07:18,519 --> 00:07:27,089 -So what is that in this computational graph? (x+1) +00:07:18,519 --> 00:07:21,089 +So what is that in this computational graph? + +104 +00:07:24,019 --> 00:07:27,089 Okay, it's q, so we have that written out right 105 @@ -523,7 +530,7 @@ example? It's 3, right? So the gradient 106 00:07:32,879 --> 00:07:36,279 on z, according to this will, become -just 3. So I'm going to be writing the ingredients +just 3. So I'm going to be writing the gradients 107 00:07:36,279 --> 00:07:42,309 @@ -561,14 +568,14 @@ positive 3, will increase by 3h, so small change will result in a positive 114 -00:08:13,009 --> 00:08:21,560 -change in the output. Now the gradient -on q in this case will be, so df/dq +00:08:13,010 --> 00:08:18,560 +change in the output. Now the +gradient on q in this case will be 115 -00:08:21,560 --> 00:08:30,860 -is z. What is z? -4. Okay? So we get -a gradient of -4 on that path +00:08:21,009 --> 00:08:30,860 +So df/dq is z. What is z? -4. Okay? +So we get a gradient of -4 on that path 116 00:08:30,860 --> 00:08:34,599 @@ -600,14 +607,32 @@ I suppose. So we'd like to compute the gradient on f on y with respect to y 122 -00:08:54,328 --> 00:09:10,208 -and so the gradient on y with res, in -this particular graph, will become +00:08:54,328 --> 00:09:00,208 +and so the gradient on y in +this particular graph will become + +123 +00:09:03,909 --> 00:09:07,179 +Let's just guess and then we'll +see how this gets derived properly. + +123 +00:09:12,209 --> 00:09:15,208 +So I hear some murmurs of the right answer. +It will be -4. So let's see how. 123 -00:09:10,208 --> 00:09:23,979 -Let's just guess and then we'll see how this gets derived properly. So I hear some murmurs of the right answer. It will be -4. -So let's see how, so there are many ways to derive it at this point, because the expression is very small and you can kind of, glance at it, but the way I'd like to think about it is by applying chain rule, okay. +00:09:15,209 --> 00:09:17,800 +So there are many ways to derive it at this point + +123 +00:09:17,801 --> 00:09:21,000 +because the expression is very small and you can +kind of, glance at it, but the way I'd like to + +123 +00:09:21,001 --> 00:09:23,979 +think about this is by applying chain rule, okay. 124 00:09:23,980 --> 00:09:27,709 @@ -615,7 +640,7 @@ So the chain rule says that if you would like to derive the gradient of f on y 125 -00:09:27,708 --> 00:09:33,208 +00:09:27,710 --> 00:09:33,208 then it's equal to df/dq times dq/dy, right? And so we've @@ -640,7 +665,7 @@ of y on q, and that local influence of y on q is 1, because that's the local 130 -00:09:52,448 --> 00:09:58,969 +00:09:52,450 --> 00:09:58,969 as I'll refer to as the local derivative of y for the plus gate, and so the chain rule @@ -661,8 +686,8 @@ them. So we'll get -4 times 1 134 00:10:10,948 --> 00:10:14,588 -And so, this is kind of the, the crux of how back -propagation works. This is a very +And so, this is kind of the, the crux of how +backpropagation works. This is a very 135 00:10:14,589 --> 00:10:18,209 @@ -670,14 +695,14 @@ important to understand here that, we have these two pieces that we keep 136 -00:10:18,208 --> 00:10:24,289 +00:10:18,210 --> 00:10:24,289 multiplying through when we perform the chain rule. We have q computed x + y, and 137 00:10:24,289 --> 00:10:29,379 the derivative x and y, with respect to that -single expression is one and one. So keep +single expression is 1 and 1. So keep 138 00:10:29,379 --> 00:10:32,749 @@ -715,12 +740,12 @@ the correct thing to do is to multiply them, so we end up with -4 times 1 145 -00:11:00,350 --> 00:11:05,189 +00:11:00,351 --> 00:11:05,189 gets you -4. And so the way this works out is, basically what this is 146 -00:11:05,188 --> 00:11:08,649 +00:11:05,190 --> 00:11:08,649 saying is that the influence of y on the final output of the circuit is -4 @@ -736,21 +761,31 @@ that end up working out is y has a 149 00:11:18,230 --> 00:11:21,810 -positive influence in q, so increasing -y, slightly increase q +positive influence on q, so increasing +y, slightly increases q 150 -00:11:21,809 --> 00:11:27,959 -which slightly decreases the output of the circuit, okay? -So chain rule is kind of giving us this +00:11:21,810 --> 00:11:27,959 +which slightly decreases the output of the circuit, +okay? So chain rule is kind of giving us this 151 -00:11:27,960 --> 00:11:29,120 +00:11:27,960 --> 00:11:29,320 correspondence. Go ahead. 152 -00:11:29,120 --> 00:11:45,259 -Yeap, thank you. we're going to get into this. You'll see many, basically this entire class is about this, so you'll see many many instantiations of this and +00:11:29,320 --> 00:11:38,360 +(Student is asking question) + +152 +00:11:38,360 --> 00:11:42,559 +Yeap, thank you. So we're going to get into this. +You'll see many, basically this entire class + +152 +00:11:42,559 --> 00:11:45,259 +is about this, so you'll see many +many instantiations of this and 153 00:11:45,259 --> 00:11:48,889 @@ -774,16 +809,16 @@ It will always be just be vectors and numbers. 157 00:11:57,009 --> 00:12:02,230 Raw vectors, numbers. Okay, and looking at x, we -have a very smiliar thing that happens. +have a very similar thing that happens. 158 00:12:02,230 --> 00:12:05,889 -We want df/dx. That's our -final objective, but, and we have to combine it. +We want df/dx. That's our final objective, + but, and we have to combine it. 159 00:12:05,889 --> 00:12:09,799 -We know what the x is, what is x's influence on q +We know what the x's, what is x's influence on q and what is q's influence 160 @@ -828,17 +863,17 @@ you're not sure what happens, but by the 168 00:12:46,169 --> 00:12:50,939 -end of the circuit the loss computed, okay? And +end of the circuit the loss gets computed, okay? And that's the forward pass and then we're 169 00:12:50,940 --> 00:12:56,250 proceeding recursively in the reverse -order backwards, but before that actually +order backwards, but before that actually, 170 00:12:56,250 --> 00:13:01,120 -Before I get to that part, right away when I get +before I get to that part, right away when I get x and y, the thing I'd like to point out that 171 @@ -862,7 +897,7 @@ because I'm just a gate and I know what 175 00:13:14,789 --> 00:13:18,009 -I'm performing, like say additional +I'm performing, like say addition or multiplication, so I know the influence that 176 @@ -876,7 +911,7 @@ what happens 178 00:13:25,389 --> 00:13:29,769 -near the end so the loss gets computed +near the end, so the loss gets computed and now we're going backwards, I'll eventually learn 179 @@ -890,11 +925,11 @@ So I'll learn what is dL/dz in there. 181 00:13:37,839 --> 00:13:41,419 -The ingredient will flow into me and what I +The gradient will flow into me and what I have to do is I have to chain that 182 -00:13:41,418 --> 00:13:45,278 +00:13:41,419 --> 00:13:45,278 gradient through this recursive case, so I have to make sure to chain the @@ -933,7 +968,7 @@ gate on the output, and we chain it through the local gradient, and the same 190 -00:14:12,668 --> 00:14:18,509 +00:14:12,669 --> 00:14:18,509 thing goes for y. So it's just a multiplication of that guy, that gradient @@ -975,7 +1010,7 @@ multiplied through the circuit by these 198 00:14:46,788 --> 00:14:51,019 local gradients and you end up with, and -this process is called back propagation. +this process is called backpropagation. 199 00:14:51,019 --> 00:14:54,489 @@ -999,31 +1034,48 @@ specific example that is a slightly 203 00:15:06,918 --> 00:15:11,298 -larger and we'll work through it in -detail. But I don't know if there are any questions at +larger and we'll work through it in detail. +But I don't know if there are any questions at 204 -00:15:11,298 --> 00:15:20,389 +00:15:11,298 --> 00:15:13,000 this point that anyone would like to ask. -Go ahead. If z is used by multiple nodes, I'm going to come back to that. +Go ahead. + +204 +00:15:13,001 --> 00:15:16,000 +What happens if z is used by two other nodes? + +204 +00:15:16,001 --> 00:15:19,000 +If z is used by multiple nodes, I'm going to come back to that. 205 -00:15:20,389 --> 00:15:25,538 +00:15:19,000 --> 00:15:23,537 You add the gradients. The gradient, the -correct thing to do is you add them. So if z is being influenced +correct thing to do is you add them. 206 -00:15:25,538 --> 00:15:29,928 -in multiple places in the circuit, the -backward flows will add. I will +00:15:23,538 --> 00:15:29,928 +So if z is being influenced in multiple places +in the circuit, the backward flows will add. 207 00:15:29,928 --> 00:15:31,539 -come back to that point. +I will come back to that point. Go ahead. + +208 +00:15:31,539 --> 00:15:53,038 +(Student is asking question) + +208 +00:15:53,039 --> 00:15:59,139 +Yeap. So I think, I would've repeated your question, +but you're jumping ahead like 100 slides. 208 -00:15:31,539 --> 00:16:03,139 -So I think, I would've repeated your question, but you're jumping ahead like 100 slides. So we're going to get the all of those +00:15:59,539 --> 00:16:03,139 +So we're going to get the all of those issues and we're going to see, you're 209 @@ -1053,12 +1105,12 @@ one-plus-e-to-the-whatever, so the number of 214 00:16:22,850 --> 00:16:29,000 -inputs here is five, and we're computing that function and we -have a single output over there, okay? +inputs here is five, and we're computing that function +and we have a single output over there, okay? 215 00:16:29,000 --> 00:16:32,490 -And I translated that mathematical expression +And I've translated that mathematical expression into this computational graph form, so 216 @@ -1079,7 +1131,7 @@ that and then we add one and then we 219 00:16:46,129 --> 00:16:49,769 finally divide and we get the result of -the expression. And so we're going to do +the expression. And so what we're going to do 220 00:16:49,769 --> 00:16:52,409 @@ -1092,13 +1144,23 @@ compute what the influence of every single input value is on the output of 222 -00:16:56,500 --> 00:17:07,230 +00:16:56,500 --> 00:16:59,230 this expression, what is the gradient here. +Yeap, go ahead. + +222 +00:16:59,231 --> 00:17:010,229 +(Student is asking question) + +223 +00:17:10,230 --> 00:17:15,229 +So for now, so you're concerned about the +interpretation of plus may be in these circles. 223 -00:17:07,230 --> 00:17:22,039 -So for now, you're concerned about the interpretation of plus may be in these circles. For now, let's just assume that this plus is a binary '+', -It's a binary '+' gate, and we have there +00:17:15,230 --> 00:17:22,039 +For now, let's just assume that this plus is a binary +plus. It's a binary plus gate, and we have there 224 00:17:22,039 --> 00:17:26,519 @@ -1106,24 +1168,24 @@ plus one gate. I'm making up these gates on the spot, and we'll see that what is a 225 -00:17:26,519 --> 00:17:31,519 +00:17:26,519 --> 00:17:31,000 gate or is not a gate is kind of up to you. I'll come back to this point in a bit. 226 -00:17:31,519 --> 00:17:35,639 +00:17:31,001 --> 00:17:35,639 So for now, I just like, we have several more gates that we're using throughout, and so 227 00:17:35,640 --> 00:17:38,650 -I just like to write out as we go +I'd just like to write out as we go through this example several of these 228 00:17:38,650 --> 00:17:42,720 -derivatives. So we have exponentiation and we know -for every little local gate what these +derivatives. So we have exponentiation and +we know for every little local gate what these 229 00:17:42,720 --> 00:17:49,048 @@ -1141,7 +1203,7 @@ which I'm assuming that you have memorized in terms of what the gradients 232 -00:17:56,039 --> 00:17:58,970 +00:17:56,040 --> 00:17:58,970 look like. So we're going to start off at the end of the circuit and I've @@ -1151,12 +1213,12 @@ already filled in a 1.00 in the back because that's how we always 234 -00:18:03,450 --> 00:18:04,860 +00:18:03,450 --> 00:18:04,890 start this recursion with a 1.0 235 -00:18:04,859 --> 00:18:10,519 -right since that's the gradient +00:18:04,891 --> 00:18:10,519 +right, since that's the gradient on the identity function. Now we're going 236 @@ -1195,17 +1257,29 @@ which is easy because it happens to be at the end. So what ends up being the 243 -00:18:44,789 --> 00:18:51,349 +00:18:44,789 --> 00:18:49,349 expression for the backpropagated -gradient here, from 1/x gate +gradient here, from the 1/x gate? 244 -00:18:51,349 --> 00:18:59,829 +00:18:54,049 --> 00:19:59,048 The chain rule always has two pieces: local -gradient times the gradient from the top or from above. +gradient times the gradient from the top + +244 +00:18:59,049 --> 00:19:01,300 +or from above. + +245 +00:19:04,301 --> 00:19:08,069 +(Student is answering) + +245 +00:19:08,301 --> 00:19:12,500 +Um, yeah. Okay. Yeah, so that's correct. 245 -00:18:59,829 --> 00:19:18,069 +00:19:12,501 --> 00:19:18,069 So we get -1/x^2, which is the gradient df/dx. So that is the local gradient. @@ -1224,8 +1298,8 @@ chain rule right away here and the output is -0.53. So that's the gradient on 249 -00:19:34,849 --> 00:19:38,798 -that piece of the wire, where this valley +00:19:34,850 --> 00:19:38,798 +that piece of the wire, where this value was flowing, okay. So it has a negative 250 @@ -1239,7 +1313,7 @@ increase this value and then it goes through a gate of 1/x, then if you 252 -00:19:47,849 --> 00:19:50,939 +00:19:47,851 --> 00:19:50,939 increase this, 1/x get smaller, so that's why you're seeing a negative @@ -1249,32 +1323,40 @@ gradient, right. So we're going to continue backpropagation here. The next gate 254 -00:19:55,619 --> 00:20:01,048 -in the circuit, it's adding a constant of -1, so the local gradient, if you look at +00:19:55,621 --> 00:19:58,400 +in the circuit, it's adding a constant of 1, 255 +00:19:58,400 --> 00:20:01,048 +so the local gradient, if you look at + +256 00:20:01,048 --> 00:20:06,960 adding a constant to a value, the -gradient of, on x is just 1, right? +gradient on x is just 1, right, 256 -00:20:06,960 --> 00:20:13,169 -From basic calculus. And so the chained gradient -here that we continue along the wire +00:20:06,961 --> 00:20:13,169 +from basic calculus. And so the chained +gradient here that we continue along the wire + +257 +00:20:13,169 --> 00:20:27,868 +will be... +(Student is answering) 257 -00:20:13,169 --> 00:20:22,940 -will be... We have a local gradient, which is -1 times the gradient from above the +00:20:17,869 --> 00:20:22,940 +We have a local gradient, which is +1, times the gradient from above the 258 00:20:22,940 --> 00:20:28,590 -gate, which it has just learned is --0.53, okay? So -0.53 continues along the +gate, which it has just learned is -0.53, okay? +So -0.53 continues along the 259 -00:20:28,589 --> 00:20:34,709 +00:20:28,590 --> 00:20:34,709 wire unchanged. And intuitively that makes sense right, because this value @@ -1299,17 +1381,18 @@ end will be the same, because the rate of change doesn't change through the +1 gate. 264 -00:20:51,548 --> 00:20:56,859 -It's just a constant offset. Okay, we continued +00:20:51,548 --> 00:20:57,859 +It's just a constant offset, okay? We continue derivation here. So the gradient of e^x is 265 -00:20:56,859 --> 00:21:01,599 -e^x, so to continue backpropagation we're going to perform, +00:20:57,859 --> 00:21:01,599 +e^x, so to continue backpropagation +we're going to perform, 266 00:21:01,599 --> 00:21:05,000 -so this gate saw input of negative one. +so this gate saw input of -1. 267 00:21:05,000 --> 00:21:08,329 @@ -1322,19 +1405,27 @@ gradient from above is -0.53. So to continue backpropagation 269 -00:21:12,259 --> 00:21:20,000 -here and apply chain rule, we would -receive... (Student is asking question) Okay, so these are most of the rhetorical questions so I'm +00:21:12,259 --> 00:21:15,000 +here and apply chain rule, we would receive... + +269 +00:21:15,000 --> 00:21:17,400 +(Student is answering) + +269 +00:21:17,400 --> 00:21:20,000 +Okay, so these are most of +the rhetorical questions so I'm 270 00:21:20,000 --> 00:21:25,119 -not sure, but yeah, basically e^(-1) -which is the e^x, +not sure, but yeah, basically +e^(-1) which is the e^x, 271 00:21:25,119 --> 00:21:30,569 -the x input to this exp gate times the -chain rule, right, so the gradient from above is -0.53 +the x input to this exp gate times the chain rule, +right, so the gradient from above is -0.53 272 00:21:30,569 --> 00:21:35,269 @@ -1343,7 +1434,7 @@ is the effect on me and what do I have an 273 00:21:35,269 --> 00:21:39,069 -effect on the final end of the circuit. +effect on the final end of the circuit, those are being always multiplied. So we 274 @@ -1357,9 +1448,13 @@ ends up happening, what happens to the gradient when you do a times -1 in the 276 -00:21:50,279 --> 00:21:57,139 -computatonal graph? It flips around, right? -Because we have basically, a constant multiply of input +00:21:50,279 --> 00:21:53,139 +computational graph? + +276 +00:21:53,139 --> 00:21:57,139 +It flips around, right? Because we have +basically, a constant multiply of input 277 00:21:57,140 --> 00:22:02,038 @@ -1368,7 +1463,8 @@ which happened to be a constant of 278 00:22:02,038 --> 00:22:05,548 -gave us -1 in the forward pass, and so now we have to +gave us -1 in the forward pass, +and so now we have to 279 00:22:05,548 --> 00:22:09,569 @@ -1383,15 +1479,15 @@ So now we're continuing backpropagation 281 00:22:14,880 --> 00:22:21,110 We're backpropagating '+' and this '+' operation -has multiple input here, the gradient, +has multiple inputs here, the gradient, 282 -00:22:21,109 --> 00:22:25,599 -the local gradient for the '+' gate is 1 +00:22:21,110 --> 00:22:25,599 +the local gradient for the plus gate is 1 and 1, so what ends up happening to, 283 -00:22:25,599 --> 00:22:42,359 +00:22:25,599 --> 00:22:27,359 what gradients flow along the output wires? 284 @@ -1431,7 +1527,7 @@ they'll get multiplied and when you multiply by 1 291 00:23:14,339 --> 00:23:18,129 -something remains unchanged. So a plus +something remains unchanged. So a '+' gate, it's kind of like a gradient 292 @@ -1445,7 +1541,7 @@ the gradients equally to all of its children. And so we've already received 294 -00:23:26,559 --> 00:23:32,139 +00:23:26,560 --> 00:23:32,139 one of the inputs is gradient 0.2 here on the very final output of the circuit @@ -1457,33 +1553,39 @@ through a series of applications of 296 00:23:35,970 --> 00:23:42,450 chain rule along the way. There was another -'+' gate that I skipped over, and so this +plus gate that I've skipped over, and so this 297 00:23:42,450 --> 00:23:47,090 0.2 kind of distributes to both -0.2. 0.2 equally so we've already done a +0.2 0.2 equally so we've already done a 298 -00:23:47,089 --> 00:23:51,750 -'+' gate, and there's a '*' gate there, +00:23:47,090 --> 00:23:51,750 +plus gate, and there's a multiply gate there, and so now we're going to backpropagate 299 00:23:51,750 --> 00:23:55,940 -through that multiply operation +through that multiply operation. +And so the local grad, so the, + +300 +00:23:55,940 --> 00:24:02,450 +so what will be the gradients for w0 and x0? +What will be the gradient for w0, specifically? 300 -00:23:55,940 --> 00:24:06,450 -so what will be the gradient for w0 and x0? -What will be the gradient for w0 specifically? +00:24:02,450 --> 00:24:06,450 +(Student is answering) 301 -00:24:06,450 --> 00:24:19,059 -Someone say 0? 0 will be wrong. It will be, so the gradient w1 will be, w0 sorry, will be +00:24:06,450 --> 00:24:17,059 +Did someone say 0? 0 will be wrong. It will be, +so the gradient w1 will be, w0 sorry, will be 302 -00:24:19,059 --> 00:24:24,389 +00:24:17,059 --> 00:24:24,389 -1 * 0.2. Good. And the gradient on x0 will be, there is a bug, by the way, in the slide @@ -1493,7 +1595,7 @@ that I just noticed like few minutes before I actually created the class. 304 -00:24:27,839 --> 00:24:34,289 +00:24:27,840 --> 00:24:34,289 Created the, started the class. So you see 0.39 there it should be 0.4. It's @@ -1515,12 +1617,12 @@ just like I've written out over there. 308 00:24:45,400 --> 00:24:50,980 So that's what the output should be there. -Okay, so that what we've backpropagated this +Okay, so we've backpropagated this 309 00:24:50,980 --> 00:24:55,190 -circuit here and we've backpropagated through this -expression and so you might imagine in +circuit here and we've backpropagated through +this expression and so you might imagine in 310 00:24:55,190 --> 00:24:59,289 @@ -1548,19 +1650,40 @@ get our inputs, and backpropagate just means apply chain rule many many times 315 -00:25:14,150 --> 00:25:21,720 +00:25:14,150 --> 00:25:18,220 and we'll see how that is implemented in a bit. -Sorry, did you have a question? (Student is asking question) +Sorry, did you have a question? 316 -00:25:21,720 --> 00:25:31,769 -Oh yes, so I'm going to skip that because it's the same. -So I'm going to skip the other '*' gate. Any other questions at this point? (Student is asking question) +00:25:18,220 --> 00:25:20,520 +(Student is asking question) + +316 +00:25:20,521 --> 00:25:23,021 +Oh yes, so I'm going to skip +that because it's the same. + +316 +00:25:23,021 --> 00:25:27,821 +So I'm going to skip the other times gate. +Any other questions at this point? + +316 +00:25:27,821 --> 00:25:32,969 +(Student is asking question) + +317 +00:25:32,969 --> 00:25:37,200 +That's right. so the costs of forward and +backward propagation are roughly equal. 317 -00:25:31,769 --> 00:25:45,869 -That's right. so the costs of forward and backward -propagation are roughly equal. Well, it should be, it almost always ends +00:25:37,200 --> 00:25:44,100 +(Student is asking question) + +317 +00:25:44,100 --> 00:25:45,869 +Well, it should be, it almost always ends 318 00:25:45,869 --> 00:25:49,500 @@ -1568,8 +1691,12 @@ up being basically equal when you look at timings, usually the backward pass is slightly 319 -00:25:49,500 --> 00:25:58,710 -slower, but yeah. Okay, so let's see, one thing I +00:25:49,500 --> 00:25:52,000 +slower, but yeah. + +319 +00:25:55,000 --> 00:25:58,710 +Okay, so let's see, one thing I wanted to point out, before we move on, is that 320 @@ -1578,13 +1705,13 @@ the setting of these gates, like these gates are arbitrary, so one thing I could 321 -00:26:02,349 --> 00:26:06,509 +00:26:02,350 --> 00:26:06,509 have done, for example, is, some of you may know this, I can collapse these gates 322 00:26:06,509 --> 00:26:10,549 -into one gate if I wanted to, for example. +into one gate if I wanted to. For example, There is something called the sigmoid function 323 @@ -1664,26 +1791,34 @@ went into the sigmoid gate, and 0.73 went out. 338 00:27:11,750 --> 00:27:18,759 -So 0.73 is sigma of x, okay? And we want the -local gradient which is, as we've seen +So 0.73 is sigma of x, okay? And now we want +the local gradient which is, as we've seen 339 -00:27:18,759 --> 00:27:26,450 -from the math that I performed there (1 - sigma(x)) * sigma(x) +00:27:18,759 --> 00:27:22,559 +from the math that I performed there +(1 - sigma(x)) * sigma(x) + +339 +00:27:22,559 --> 00:27:26,450 so you get, sigma(x) is 0.73, multiplying (1 - 0.73) 340 00:27:26,450 --> 00:27:31,170 that's the local gradient and then times, -we happened to be at the end +we happen to be at the end + +341 +00:27:31,170 --> 00:27:34,170 +of the circuit, so times 1.0, +which I'm not even writing. 341 -00:27:31,170 --> 00:27:36,330 -of the circuit, so times 1.0, which I'm not even writing. +00:27:34,170 --> 00:27:36,330 So we end up with 0.2. And of course we 342 -00:27:36,329 --> 00:27:37,649 +00:27:36,330 --> 00:27:37,649 get the same answer 343 @@ -1693,8 +1828,8 @@ because calculus works, but basically we 344 00:27:42,220 --> 00:27:44,480 -could have broken up this expression -down and +could have broken up +this expression down and 345 00:27:44,480 --> 00:27:47,450 @@ -1716,12 +1851,16 @@ intuitively, cluster these expressions into single gates if it's very efficient 349 -00:27:55,829 --> 00:28:06,819 +00:27:55,829 --> 00:28:59,800 or easy to derive the local gradients -because then those become your pieces. (Student is asking question) +because then those become your pieces. + +349 +00:28:00,000 --> 00:28:05,819 +(Student is asking question) 350 -00:28:06,819 --> 00:28:10,529 +00:28:05,819 --> 00:28:10,529 Yes. So the question is, do libraries typically do that? Do they worry about, you know @@ -1732,7 +1871,7 @@ compute and the answer is yeah, I would say so, 352 00:28:14,058 --> 00:28:17,480 -So if you noted that there are some +So if you noticed that there are some piece of operation you'd like to do over 353 @@ -1747,8 +1886,8 @@ unit out of, and we'll see some of those 355 00:28:24,900 --> 00:28:30,230 -examples actually int a bit I think. Okay, I'd like -to also point out that once you, +examples actually int a bit I think. +Okay, I'd like to also point out that once you, 356 00:28:30,230 --> 00:28:32,490 @@ -1776,7 +1915,7 @@ looking at computational graphs intuitions about how these gradients flow, and this 361 -00:28:47,849 --> 00:28:52,029 +00:28:47,850 --> 00:28:52,029 by the way, helps you debug some issues like, say, we'll go to vanishing gradient problem @@ -1798,7 +1937,7 @@ some intuitions for example, we already 365 00:29:02,740 --> 00:29:07,609 saw the add gate. It has a local -gradient of one to all of its inputs, so +gradient of 1 to all of its inputs, so 366 00:29:07,609 --> 00:29:11,279 @@ -1836,7 +1975,7 @@ think about it, the gradient on the larger one of your inputs, whichever one was larger 373 -00:29:42,569 --> 00:29:46,389 +00:29:42,570 --> 00:29:46,389 the gradient on that guy is one and all this, and the smaller one has a gradient of 0. @@ -1852,8 +1991,8 @@ guy is larger and that's what ends up 376 00:29:53,220 --> 00:29:57,009 -propagating through the gate. So you end up -with a gradient of 1 on the +propagating through the gate. +So you end up with a gradient of 1 on the 377 00:29:57,009 --> 00:30:03,140 @@ -1871,34 +2010,61 @@ all of them and that's the value that I propagated through the circuit. 380 -00:30:09,549 --> 00:30:12,909 +00:30:09,550 --> 00:30:12,909 At backpropagation time, I'm just going to receive my gradient from above and I'm 381 00:30:12,910 --> 00:30:16,590 going to route it to whoever was my -largest input. So it's a gradient router, +largest input. So it's a gradient router. 382 -00:30:16,589 --> 00:30:22,569 -and the multiply gate is a gradient switcher. +00:30:17,000 --> 00:30:22,569 +And the multiply gate is a gradient switcher. Actually I don't think that's a very good 383 00:30:22,569 --> 00:30:26,960 -way to look at it but I'm referring to +way to look at it, but I'm referring to the fact that it's not actually 384 -00:30:26,960 --> 00:30:39,150 -nevermind about that part. Go ahead. (Student is asking question) So your -question is what happens if the two +00:30:26,960 --> 00:30:28,150 +nevermind about that part. + +384 +00:30:29,560 --> 00:30:30,860 +Go ahead. + +384 +00:30:30,860 --> 00:30:36,650 +(Student is asking question) + +384 +00:30:36,650 --> 00:30:39,150 +So your question is what happens if the two + +385 +00:30:39,150 --> 00:30:41,470 +inputs are equal when +you go through max gate. + +385 +00:30:44,150 --> 00:30:46,150 +Yeah, what happens? + +385 +00:30:46,150 --> 00:30:48,470 +(Student is answering) + +385 +00:30:48,470 --> 00:30:50,000 +Yeah, you pick one. Yeah. 385 -00:30:39,150 --> 00:30:53,470 -inputs are equal when you go through max -gate. Yeah, what happens? (Student is answering) I don't think it's +00:30:52,300 --> 00:30:53,470 +Yeah, I don't think it's 386 00:30:53,470 --> 00:30:57,559 @@ -1906,9 +2072,13 @@ correct to distributed to all of them. I think you'd have to pick one. 387 -00:30:57,559 --> 00:31:07,990 -But that basically never happens in actual -practice. Okay, so max gradient here, I actually +00:30:58,259 --> 00:31:01,990 +But that basically never +happens in actual practice. + +387 +00:31:05,559 --> 00:31:07,990 +Okay, so max gradient here, I actually 388 00:31:07,990 --> 00:31:13,019 @@ -1917,8 +2087,8 @@ than w, so only z has an influence on 389 00:31:13,019 --> 00:31:16,839 -the output of this max gate, right? So -when 2 flows into the max gate +the output of this max gate, right? +So when 2 flows into the max gate 390 00:31:16,839 --> 00:31:20,879 @@ -1931,7 +2101,7 @@ nothing. There is 0, because when you change it, it doesn't matter when you change 392 -00:31:25,359 --> 00:31:29,689 +00:31:25,360 --> 00:31:29,689 it, because z is the larger value going through the computational graph. @@ -1941,7 +2111,7 @@ I have another note that is related to backpropagation which we already 394 -00:31:33,099 --> 00:31:36,490 +00:31:33,100 --> 00:31:36,490 addressed through a question. I just wanted to briefly point out with a terribly @@ -1956,36 +2126,43 @@ value that branches out into a circuit and is used in multiple parts of the 397 -00:31:43,329 --> 00:31:47,179 +00:31:43,330 --> 00:31:47,179 circuit, the correct thing to do by multivariate chain rule, is to actually 398 -00:31:47,180 --> 00:31:55,110 -add up the contributions at the -operation. So gradients add when they backpropagate +00:31:47,180 --> 00:31:51,110 +add up the contributions at the operation. + +398 +00:31:51,110 --> 00:31:55,110 +So gradients add when they backpropagate 399 -00:31:55,109 --> 00:32:00,009 +00:31:55,110 --> 00:32:00,009 backwards through the circuit. If they ever flow, they add up in these backward flow 400 00:32:00,009 --> 00:32:04,879 All right. We're going to go into -implementation very soon. I'll just take some more +implementation very soon. I'll just take some 401 00:32:04,880 --> 00:32:05,700 -questions. +more questions. 402 -00:32:05,700 --> 00:32:11,620 +00:32:05,700 --> 00:32:08,820 +(Student is asking question) + +402 +00:32:08,820 --> 00:32:11,620 Thank you for the question. The question is, is there ever, like a loop in these 403 -00:32:11,619 --> 00:32:15,839 +00:32:11,620 --> 00:32:15,839 graphs. There will never be loops, so there are never any loops. You might think that @@ -1996,8 +2173,8 @@ that there are loops in there 405 00:32:18,589 --> 00:32:21,658 -but there are actually no loops because what we'll -do is we'll take a recurrent neural +but there are actually no loops because what +we'll do is we'll take a recurrent neural 406 00:32:21,659 --> 00:32:26,230 @@ -2005,14 +2182,26 @@ network and we will unfold it through time steps and this will all become, there 407 -00:32:26,230 --> 00:32:31,259 -will never be a loop in the unfolded graph -where we've copied pasted that small recurrent net piece over time. +00:32:26,230 --> 00:32:30,530 +will never be a loop in the unfolded graph where +we've copied pasted that small recurrent net piece + +407 +00:32:30,530 --> 00:32:31,259 +over time. 408 -00:32:31,259 --> 00:32:39,538 +00:32:31,259 --> 00:32:35,059 You'll see that more when we actually -get into it but these are always DAGs, there are no loops. Okay, awesome. +get into it but these are always DAGs + +408 +00:32:35,059 --> 00:32:36,338 +There are no loops. + +408 +00:32:38,059 --> 00:32:39,538 +Okay, awesome. 409 00:32:39,538 --> 00:32:42,220 @@ -2020,13 +2209,17 @@ So let's look at the implementation of how this is actually implemented in practice and 410 -00:32:42,220 --> 00:32:46,860 +00:32:42,220 --> 00:32:46,990 I think it will help make this more concrete as well. So we always have these 411 -00:32:46,859 --> 00:32:52,038 -graphs, computational graphs. These are the best way to +00:32:46,990 --> 00:32:48,938 +graphs, computational graphs. + +411 +00:32:48,938 --> 00:32:52,038 +These are the best way to think about structuring neural networks. 412 @@ -2040,7 +2233,7 @@ on top of the gates, there something's that needs to maintain connectivity structure 414 -00:33:00,058 --> 00:33:03,490 +00:33:00,059 --> 00:33:03,490 of this entire graph, what gates are connected to each other. And so usually @@ -2056,8 +2249,8 @@ and the backward piece. And this is just pseudo 417 00:33:13,679 --> 00:33:19,929 -code, so this won't run, but basically, roughly the -idea is that in the forward pass +code, so this won't run, but basically, +roughly the idea is that in the forward pass 418 00:33:19,929 --> 00:33:23,759 @@ -2071,8 +2264,8 @@ inputs must come to every node before 420 00:33:27,980 --> 00:33:32,099 -the output can be consumed. So these are just ordered -from left to right and we're just +the output can be consumed. So these are just +ordered from left to right and we're just 421 00:33:32,099 --> 00:33:35,969 @@ -2085,7 +2278,7 @@ over that graph and we just go forward in every single piece and this net object will 423 -00:33:39,599 --> 00:33:43,189 +00:33:39,600 --> 00:33:43,189 just make sure that happens in the proper connectivity pattern. In backward @@ -2095,7 +2288,7 @@ pass, we're going in the exact reversed order and we're calling backward on 425 -00:33:46,619 --> 00:33:49,709 +00:33:46,620 --> 00:33:49,709 every single gate and these gates will end up communicating gradients through each @@ -2110,9 +2303,9 @@ So really a net object is a very thin wrapper around all these gates, or as we 428 -00:33:57,859 --> 00:34:01,879 +00:33:57,860 --> 00:34:01,879 will see they're called layers, layers or -gates. I'm going to use interchangeably +gates. I'm going to use those interchangeably 429 00:34:01,880 --> 00:34:05,700 @@ -2130,14 +2323,13 @@ a specific example of one of the gates and how this might be implemented. 432 -00:34:12,949 --> 00:34:16,759 -And this is not just a pseudo code. This is -actually more like correct +00:34:12,950 --> 00:34:16,759 +And this is not just a pseudo code. +This is actually more like correct 433 00:34:16,760 --> 00:34:18,730 -implementation. Something like this might -run +implementation. Something like this might run 434 00:34:18,730 --> 00:34:23,769 @@ -2160,14 +2352,14 @@ And all these gates must basically satisfied this API of a forward call and a backward call. How 438 -00:34:38,949 --> 00:34:42,529 +00:34:38,950 --> 00:34:42,529 do you behave in a forward pass, and how do you behave in a backward pass. And 439 00:34:42,530 --> 00:34:46,019 -in a forward pass, we just compute whatever. In a -backward pass, we eventually end up +in a forward pass, we just compute whatever. +In a backward pass, we eventually end up 440 00:34:46,019 --> 00:34:52,639 @@ -2185,9 +2377,13 @@ everything here is scalars, so x, y, z are numbers here. dz is also a number 443 -00:35:00,639 --> 00:35:07,799 -telling the influence on the end of the circuit. And what this gate -is in charge of in this backward pass is +00:35:00,639 --> 00:35:03,639 +telling the influence on the end of the circuit. + +443 +00:35:03,639 --> 00:35:07,799 +And what this gate is in charge +of in this backward pass is 444 00:35:07,800 --> 00:35:11,550 @@ -2195,9 +2391,13 @@ performing the little piece of chain rule. So what we have to compute is how do you 445 -00:35:11,550 --> 00:35:16,550 -chain this gradient dz into your -inputs x and y. In other words, we have to compute dx and dy and we have to +00:35:11,550 --> 00:35:14,550 +chain this gradient dz into your inputs x and y. + +445 +00:35:14,550 --> 00:35:16,550 +In other words, we have to compute +dx and dy and we have to 446 00:35:16,550 --> 00:35:19,820 @@ -2210,23 +2410,24 @@ that these get routed properly to all the other gates. And if there are any 448 -00:35:23,719 --> 00:35:27,919 +00:35:23,720 --> 00:35:28,820 edges that add up, the computational graph -might add all those ingredients together. +might add all those gradients together. 449 -00:35:27,920 --> 00:35:35,650 +00:35:30,220 --> 00:35:35,650 Okay, so how would we implement the dx and dy? So for example, what is 450 -00:35:35,650 --> 00:35:42,300 +00:35:35,650 --> 00:35:40,300 dx in this case? What would it be equal to, the implementation? 451 -00:35:42,300 --> 00:35:49,460 -y * dz. Great. And, so y * dz. Additional point to make here by the way, +00:35:43,300 --> 00:35:49,460 +y * dz. Great. And, so y * dz. +Additional point to make here by the way, 452 00:35:49,460 --> 00:35:53,659 @@ -2235,7 +2436,7 @@ pass. We have to remember these values of 453 00:35:53,659 --> 00:35:57,509 -x and y, because we end up using them in a +x and y, because we end up using them in the backward pass, so I'm assigning them to a 454 @@ -2259,7 +2460,7 @@ any kind of intermediate calculations that it has performed that it needs to do, that needs 458 -00:36:13,429 --> 00:36:17,069 +00:36:13,430 --> 00:36:17,069 access to in the backward pass. So basically when we end up running these networks at @@ -2274,7 +2475,7 @@ amount of stuff gets cached in your memory, and that all has to stick around 461 -00:36:22,889 --> 00:36:25,909 +00:36:22,890 --> 00:36:25,909 because during backpropagation, you might need access to some of those variables. @@ -2289,9 +2490,22 @@ it gets all consumed and we need all those intermediates to actually compute the 464 -00:36:33,690 --> 00:36:45,289 -proper backward pass. So that's... (Student is asking question) Yes, so if you don't, if you know you don't want to do backward pass, then you can -get rid of many of these things and you +00:36:33,690 --> 00:36:36,000 +proper backward pass. So that's... + +464 +00:36:36,000 --> 00:36:41,089 +(Student is asking question) + +464 +00:36:41,089 --> 00:36:43,189 +Yes, so if you don't, if you know you +don't want to do backward pass, + +464 +00:36:43,189 --> 00:36:45,289 +then you can get rid of +many of these things and you 465 00:36:45,289 --> 00:36:49,710 @@ -2304,13 +2518,17 @@ But I don't think most implementations actually worriy about that. I don't 467 -00:36:54,110 --> 00:36:57,280 -think there's a lot of logic that deals -with that. Usually we end up remembering it +00:36:54,110 --> 00:36:58,280 +think there's a lot of logic that deals with that. +Usually we end up remembering it anyway. + +468 +00:37:00,280 --> 00:37:05,870 +(Student is asking question) 468 -00:36:57,280 --> 00:37:09,370 -anyway. (Student is asking question) I see. Yes, so I think if you're in the +00:37:05,870 --> 00:37:09,369 +I see. Yes, so I think if you're in the embedded device for example, and you worry 469 @@ -2329,18 +2547,26 @@ want to make sure to go into the code to make sure nothing gets cached in case 472 -00:37:18,750 --> 00:37:33,130 -you want to do a backward pass. Questions. -Yes. (Student is asking question) You're saying if we remember the local gradients in +00:37:18,750 --> 00:37:22,030 +you want to do a backward pass. +Questions. Yes. + +472 +00:37:22,030 --> 00:37:30,990 +(Student is asking question) + +472 +00:37:30,990 --> 00:37:33,130 +You're saying if we remember the local gradients in 473 -00:37:33,130 --> 00:37:39,750 +00:37:33,130 --> 00:37:39,250 the forward pass, then we don't have to -remember the other intermediates? I think +remember the other intermediates? 474 -00:37:39,750 --> 00:37:45,269 -that might only be the case in +00:37:39,250 --> 00:37:45,269 +I think that might only be the case in some simple expressions like this one. I'm 475 @@ -2354,7 +2580,7 @@ whatever you need to, perform the backward pass, and on a gate-by-gate basis. 477 -00:37:54,949 --> 00:37:58,509 +00:37:54,950 --> 00:37:58,509 You can remember whatever you feel like. It has lower footprint and so on. @@ -2389,7 +2615,7 @@ of these layer objects and these are the gates. Layers, gates, the same thing. So there's 484 -00:38:24,579 --> 00:38:27,429 +00:38:24,580 --> 00:38:27,429 all these layers. That's really what a deep learning framework is. It's just a @@ -2405,8 +2631,8 @@ the image to have in mind is all these 487 00:38:36,420 --> 00:38:42,639 -things are your Lego blocks, and then -we're building up these computational graphs out of +things are your Lego blocks, and then we're +building up these computational graphs out of 488 00:38:42,639 --> 00:38:44,829 @@ -2459,7 +2685,7 @@ actually work with these, we do a lot of vectorized operation so we receive a tensor 498 -00:39:22,409 --> 00:39:28,289 +00:39:22,410 --> 00:39:28,289 which is really just a n-dimensional array, and we scale it by a constant. And you @@ -2470,7 +2696,7 @@ lines. There's some initialization stuff. 500 00:39:31,980 --> 00:39:35,940 -This is Lua, by the way. If this is +This is Lua, by the way, if this is looking some foreign to you, but there's 501 @@ -2549,9 +2775,9 @@ deep learning framework specifically for images that you might be working with. Again, if 516 -00:40:36,139 --> 00:40:39,690 -you go into the layers directory in GitHub, you just see -all these layers. All of them implement +00:40:36,140 --> 00:40:39,690 +you go into the layers directory in GitHub, +you just see all these layers. All of them implement 517 00:40:39,690 --> 00:40:43,490 @@ -2559,7 +2785,7 @@ the forward backward API. So just to give you an example, there's a sigmoid layer in Caffe. 518 -00:40:43,489 --> 00:40:51,269 +00:40:43,490 --> 00:40:51,269 So sigmoid layer takes a blob. So Caffe likes to call these tensors blobs. So it takes a @@ -2590,8 +2816,8 @@ sigmoid function on the bottom and 524 00:41:11,730 --> 00:41:14,829 -that's just a sigmoid function right -there. So that's what we compute. And in a +that's just a sigmoid function right there. +So that's what we compute. And in the 525 00:41:14,829 --> 00:41:18,719 @@ -2605,8 +2831,8 @@ rule here, so that's what you see in this 527 00:41:23,369 --> 00:41:26,150 -line. That's where the magic happens where -we take the diff, +line. That's where the magic happens +where we take the diff, 528 00:41:26,150 --> 00:41:32,048 @@ -2635,55 +2861,86 @@ connectivity. Any questions about some of 533 00:41:52,289 --> 00:42:00,849 -these implementations and so on? +these implementations and so on? Go ahead. + +533 +00:41:54,000 --> 00:42:00,849 +(Student is asking question) + +534 +00:42:00,849 --> 00:42:04,759 +Yes, thank you. So the question is, do we have to +go through forward and backward for every update. + +534 +00:42:04,759 --> 00:42:09,259 +The answer is yes, because when you +want to do update, you need the gradient, + +534 +00:42:09,259 --> 00:42:11,849 +and so you need to do forward +on your sample minibatch. 534 -00:42:00,849 --> 00:42:15,559 -Yes, thank you. (Student is asking question) So the question is, do we have to go through forward and backward for every update. The answer is yes, because when you want to do update, you need the gradient, and so you need to do forward on your sample minibatch. You do a forward. Right away you do a backward. +00:42:11,849 --> 00:42:15,559 +You do a forward. Right away you do a backward. And now you have your analytic gradient. 535 00:42:15,559 --> 00:42:19,369 -And now I can do an update, where I take my analytic -gradient and I change my weights a tiny +And now I can do an update, where I take my +analytic gradient and I change my weights a tiny 536 00:42:19,369 --> 00:42:24,960 -bit in the direction, the negative -direction of your gradient. So forward computes +bit in the direction, the negative direction +of your gradient. So forward computes 537 00:42:24,960 --> 00:42:28,858 -the loss, backward computes your gradient, and -then the update uses the gradient to +the loss, backward computes your gradient, +and then the update uses the gradient to 538 -00:42:28,858 --> 00:42:33,278 -increment your weights a bit. So that's what -keeps happening in the loop. When you train a neural network +00:42:28,858 --> 00:42:33,000 +increment your weights a bit. So that's what keeps +happening in the loop. When you train a neural 539 -00:42:33,278 --> 00:42:36,318 -that's all that's happening. Forward, +00:42:33,000 --> 00:42:36,318 +network that's all that's happening. Forward, backward, update. Forward, backward, update. 540 -00:42:36,318 --> 00:42:51,808 -We'll see that in a bit. Go ahead. (Student is asking question) You're asking about a -for loop. Oh, is there a for loop here? I didn't even notice. Okay. +00:42:36,318 --> 00:42:38,808 +We'll see that in a bit. Go ahead. + +540 +00:42:38,808 --> 00:42:43,808 +(Student is asking question) + +540 +00:42:44,808 --> 00:42:47,008 +You're asking about a for-loop. + +540 +00:42:49,208 --> 00:42:51,808 +Oh, is there a for-loop here? +I didn't even notice. Okay. 541 00:42:51,809 --> 00:42:57,160 -Yeah, they have a for loop. Yes, so you'd like +Yeah, they have a for-loop. Yes, so you'd like this to be vectorized and that actually... 542 -00:42:57,159 --> 00:43:03,679 +00:42:57,160 --> 00:43:03,679 Because this is C++, so I think they just do it. Go for it. 543 -00:43:03,679 --> 00:43:10,899 +00:43:06,679 --> 00:43:10,899 Yeah, so this is a CPU implementation by the way. I should mention that this is a @@ -2698,12 +2955,12 @@ sigmoid layer on GPU and that's CUDA code. And so that's a separate file. It 546 -00:43:19,420 --> 00:43:21,980 +00:43:19,420 --> 00:43:22,280 would be sigmoid.cu or something like that. I'm not showing you that. 547 -00:43:21,980 --> 00:43:30,349 +00:43:23,580 --> 00:43:30,349 Any questions? Okay, great. So one point I'd like to make is, we'll be of course working with @@ -2733,7 +2990,7 @@ expressions, they're full Jacobian matrices. And so Jacobian matrix is this 553 -00:43:51,289 --> 00:43:54,670 +00:43:51,290 --> 00:43:54,670 two-dimensional matrix and basically tells you what is the influence of every @@ -2748,22 +3005,26 @@ and that's what Jacobian matrix stores, and the gradient is the same 556 -00:44:01,880 --> 00:44:08,960 +00:44:01,880 --> 00:44:10,960 expression as before, but now, say here, dz/dx is a vector and dL/dz is... sorry. 557 -00:44:08,960 --> 00:44:16,079 -dL/dz is a vector and dz/dx -is an entire Jacobian matrix, so you end up with +00:44:11,560 --> 00:44:16,079 +dL/dz is a vector and dz/dx is an +entire Jacobian matrix, so you end up with 558 -00:44:16,079 --> 00:44:32,130 +00:44:16,079 --> 00:44:20,130 an entire matrix-vector multiply to -actually chain the gradient backwards. (Student is asking question) +actually chain the gradient backwards. + +558 +00:44:20,130 --> 00:44:29,130 +(Student is asking question) 559 -00:44:32,130 --> 00:44:36,380 +00:44:31,530 --> 00:44:36,380 No. So I'll come back to this point in a bit. You never actually end up forming the full @@ -2788,12 +3049,12 @@ because dz/dx is the Jacobian which should be on the left side, so 564 -00:44:49,568 --> 00:44:53,159 +00:44:49,569 --> 00:44:53,859 I think that's a mistaken slide because this should be a matrix-vector multiply. 565 -00:44:53,159 --> 00:44:57,618 +00:44:53,859 --> 00:44:57,618 So I'll show you why you don't actually need to ever perform those Jacobians. So let's @@ -2822,23 +3083,27 @@ you might want to do. 571 00:45:14,630 --> 00:45:19,630 -and your computing an element-wise +and you're computing an element-wise thresholding at 0, so anything that is lower 572 00:45:19,630 --> 00:45:24,680 than 0 gets clamped to 0, and that's your -function that your computing. And so output +function that you're computing. And so output 573 -00:45:24,679 --> 00:45:28,588 +00:45:24,680 --> 00:45:28,588 vector is of the same dimension. So the question here I'd like to ask is 574 -00:45:28,588 --> 00:45:40,268 -what is the size of the Jacobian matrix -for this layer? 4096 by 4096. In principle, +00:45:28,588 --> 00:45:32,068 +what is the size of the +Jacobian matrix for this layer? + +574 +00:45:37,588 --> 00:45:40,268 +4096 by 4096. In principle, 575 00:45:40,268 --> 00:45:45,018 @@ -2847,18 +3112,26 @@ influenced every single number in there. 576 00:45:45,018 --> 00:45:49,459 -But that's not the case necessarily, -right? So the second question is, so this +But that's not the case necessarily, right? +So the second question is, so this 577 00:45:49,460 --> 00:45:52,949 -is a huge matrix, 16 million -numbers, but why would you never form it? +is a huge matrix, 16 million numbers, +but why would you never form it? + +578 +00:45:52,949 --> 00:45:54,719 +What does the Jacobian actually look like? + +578 +00:45:54,719 --> 00:45:59,019 +(Student is asking question) 578 -00:45:52,949 --> 00:46:02,719 -What does the Jacobian actually look like? (Student is asking question) No, Jacobian will always be -a matrix, because every one of these 4096 +00:45:59,019 --> 00:46:02,719 +No, Jacobian will always be a matrix +because every one of these 4096 579 00:46:02,719 --> 00:46:09,949 @@ -2868,11 +3141,19 @@ Jacobian is still a giant 4096 by 4096 580 00:46:09,949 --> 00:46:14,558 matrix, but has special structure, right? -And what is that special structure? (Student is answering) +And what is that special structure? + +580 +00:46:14,558 --> 00:46:17,558 +(Student is answering) + +581 +00:46:17,559 --> 00:46:20,420 +Yeah, so this Jacobian is huge. 581 -00:46:14,559 --> 00:46:27,420 -Yeah, so this Jacobian is huge. So it's 4096 by 4096 matrix, but +00:46:21,259 --> 00:46:27,420 +So it's 4096 by 4096 matrix, but there are only elements on the diagonal 582 @@ -2881,7 +3162,7 @@ because this is an element-wise operation, and moreover, they're not just 1's, but 583 -00:46:33,699 --> 00:46:38,129 +00:46:33,700 --> 00:46:38,129 for whichever element that was less than 0, it was clamped to 0, so some of these 1's @@ -2921,13 +3202,13 @@ pass for this operation is very very easy because you just want to look at 591 -00:47:09,268 --> 00:47:14,159 +00:47:09,269 --> 00:47:14,159 all the dimensions where your input was less than zero and you want to kill the 592 00:47:14,159 --> 00:47:17,210 -gradient and those dimensions. You want to +gradient in those dimensions. You want to set the gradient to 0 in those dimensions. 593 @@ -2937,17 +3218,35 @@ whichever numbers were less than zero, 594 00:47:21,650 --> 00:47:25,910 -just set them to 0. Set those gradients to 0 and then you continue backward pass. +just set them to 0. Set those gradients to 0 +and then you continue backward pass. 595 -00:47:25,909 --> 00:47:52,230 +00:47:26,209 --> 00:47:30,209 So very simple operations in the -end in terms of efficiency. (Student is asking question) That's right. So the question is, the commication between the gates is always just vectors. That's right. So this Jacobian +end in terms of efficiency. + +595 +00:47:30,209 --> 00:47:36,809 +(Student is asking question) + +595 +00:47:36,809 --> 00:47:37,300 +That's right. + +595 +00:47:37,300 --> 00:47:45,930 +(Student is asking question) + +595 +00:47:45,930 --> 00:47:51,830 +So the question is, the commication between the +gates is always just vectors. That's right. 596 -00:47:52,230 --> 00:47:55,940 -if you wanted to, you can form that but -that's internal to you inside the gate. +00:47:51,830 --> 00:47:55,940 +So this Jacobian, if you wanted to, you can form +that but that's internal to you inside the gate. 597 00:47:55,940 --> 00:47:59,670 @@ -2955,9 +3254,26 @@ And you can use that to do backprop, but what's going back to other gates, they 598 -00:47:59,670 --> 00:48:17,380 -only care about the gradient vector. (Student is asking question) Yes, so the question is, unless you end up having multiple outputs, because then for each output, we have to do this, so yeah. So -we'll never actually run into that case +00:47:59,670 --> 00:48:02,870 +only care about the gradient vector. + +598 +00:48:02,870 --> 00:48:09,070 +(Student is asking question) + +598 +00:48:09,070 --> 00:48:12,070 +Yes, so the question is, unless +you end up having multiple outputs, + +598 +00:48:12,070 --> 00:48:15,070 +because then for each output, +we have to do this, so yeah. + +598 +00:48:15,070 --> 00:48:17,380 +So we'll never actually run into that case 599 00:48:17,380 --> 00:48:20,430 @@ -2965,14 +3281,14 @@ because we almost always have a single output, scalar value at the end 600 -00:48:20,429 --> 00:48:24,129 +00:48:20,430 --> 00:48:24,129 because we're interested in loss functions. So we just have a single 601 00:48:24,130 --> 00:48:27,318 number at the end that we're interested -in computing gradients respective. If we had [??] +in computing gradients with respect to. If we had 602 00:48:27,318 --> 00:48:30,949 @@ -2985,14 +3301,18 @@ in parallel when we do the backpropagation. But we just have scalar value loss 604 -00:48:35,769 --> 00:48:45,880 -function so we don't have to worry about that. Okay, makes sense? -So I want to also make the point that +00:48:35,769 --> 00:48:38,580 +function so we don't have to worry about that. + +604 +00:48:40,269 --> 00:48:46,080 +Okay, makes sense? So I want +to also make the point that actually 605 -00:48:45,880 --> 00:48:51,230 -actually 4096 dimensions is not even crazy. Usually we -use minibatches, so say, minibatch of a +00:48:46,080 --> 00:48:51,230 +4096 dimensions is not even crazy. Usually +we use minibatches, so say, minibatch of a 606 00:48:51,230 --> 00:48:54,929 @@ -3020,7 +3340,7 @@ basically. And you take, you take care to actually take advantage of the sparsity 611 -00:49:14,159 --> 00:49:17,538 +00:49:14,160 --> 00:49:17,538 structure in the Jacobian and you hand code operations, so you don't actually write @@ -3030,19 +3350,19 @@ fully generalized chain rule inside any gate implementation. Okay cool. So I'd like 613 -00:49:25,818 --> 00:49:30,788 -to point out that in your assignment, you'll -be writing SVMs and Softmax and so on, and I just kind of +00:49:25,819 --> 00:49:30,788 +to point out that in your assignment, you'll be +writing SVMs and Softmax and so on, and I just kind 614 00:49:30,789 --> 00:49:33,680 -would like to give you a hint on the design +of would like to give you a hint on the design of how you actually should approach this 615 -00:49:33,679 --> 00:49:39,769 +00:49:33,680 --> 00:49:39,769 problem. What you should do is just think -about it as a back propagation, even if +about it as a backpropagation, even if 616 00:49:39,769 --> 00:49:44,108 @@ -3080,7 +3400,7 @@ You will do that in the second assignment. You'll actually come up with a graph 623 -00:50:07,199 --> 00:50:10,509 +00:50:07,200 --> 00:50:10,509 object and you'll implement your layers. But in the first assignment, you're just doing it inline @@ -3111,8 +3431,8 @@ the gradient on your weights. And so 629 00:50:34,769 --> 00:50:40,179 -chain, use chain rule here. Otherwise, you might be -tempted to try to just derive W, the +chain, use chain rule here. Otherwise, you might +be tempted to try to just derive W, the 630 00:50:40,179 --> 00:50:43,798 @@ -3122,16 +3442,25 @@ that and that's an unhealthy way of 631 00:50:43,798 --> 00:50:47,349 approaching the problem. So stage your -computation and do backdrop through this +computation and do backprop through this + +632 +00:50:47,349 --> 00:50:49,900 +scores and that will help you out. 632 -00:50:47,349 --> 00:50:55,800 -scores and that will help you out. Okay. cool. +00:50:51,500 --> 00:50:52,800 +Okay. cool. 633 -00:50:55,800 --> 00:51:01,570 -So, let's see. Summary so far. Neural networks are hopelessly large, so we end up -in this computational structures and these +00:50:54,300 --> 00:50:59,570 +So, let's see. Summary so far. +Neural networks are hopelessly large, + +633 +00:50:59,570 --> 00:51:01,570 +so we end up in this +computational structures and these 634 00:51:01,570 --> 00:51:05,470 @@ -3139,7 +3468,7 @@ intermediate nodes, forward backward API for both the nodes and also for the 635 -00:51:05,469 --> 00:51:08,869 +00:51:05,470 --> 00:51:08,869 graph structure. And the graph structure is usually a very thin wrapper around all these @@ -3165,8 +3494,8 @@ means is just an n-dimensional array. 640 00:51:23,079 --> 00:51:28,059 -So like an numpy array. Those are what goes between the -gates, and then internally, every single +So like an numpy array. Those are what goes +between the gates, and then internally, every single 641 00:51:28,059 --> 00:51:33,529 @@ -3179,14 +3508,28 @@ going to end with backpropagation and I'm going to go into neural networks. So 643 -00:51:37,690 --> 00:51:49,860 +00:51:37,690 --> 00:51:40,390 any questions before we move on from -backprop? Go ahead. (Student is asking a question) +backprop? Go ahead. + +643 +00:51:40,390 --> 00:51:51,860 +(Student is asking a question) + +644 +00:51:51,860 --> 00:51:55,530 +The summation inside Li = blah? +Yes, there is a sum there. + +644 +00:51:55,530 --> 00:52:00,130 +So you want that to be vectorized operation that +you... Yeah so basically, the challenge in your 644 -00:51:49,860 --> 00:52:03,130 -The summation is Li = blah? Yes, there is a sum there. So you want that to be vectorized operation that you... Yeah so basically, the challenge in your assignment almost -is, how do you make sure that you do all +00:52:00,130 --> 00:52:03,130 +assignment almost is, +how do you make sure that you do all 645 00:52:03,130 --> 00:52:06,750 @@ -3194,16 +3537,29 @@ this efficiently nicely with matrix vector operations in numpy, so that's going to be some of the 646 -00:52:06,750 --> 00:52:18,030 +00:52:06,750 --> 00:52:09,750 brain teaser stuff that you guys are -going to have to do. (Student is asking a question) Yes, so it's up to you what you want your gates to be like, and what you want them +going to have to do. + +646 +00:52:09,750 --> 00:52:14,250 +(Student is asking a question) + +646 +00:52:14,250 --> 00:52:20,030 +Yes, so it's up to you what you want your gates +to be like, and what you want them to be. 647 -00:52:18,030 --> 00:52:24,490 -to be. (Student is asking a question) Yeah, I don't think you'd want to do that. +00:52:20,030 --> 00:52:22,490 +(Student is asking a question) + +647 +00:52:22,490 --> 00:52:24,490 +Yeah, I don't think you'd want to do that. 648 -00:52:24,489 --> 00:52:30,739 +00:52:25,490 --> 00:52:30,739 Yeah, I'm not sure. Maybe that works. I don't know. But it's up to you to design this and to @@ -3215,7 +3571,7 @@ So we're going to go to neural networks. This is 650 00:52:38,610 --> 00:52:44,010 exactly what they look like. So you'll be -implementing me, and this is just what happens +implementing these, and this is just what happens 651 00:52:44,010 --> 00:52:46,770 @@ -3223,7 +3579,7 @@ when you search on Google Images for neural networks. This is I think the first 652 -00:52:46,769 --> 00:52:51,590 +00:52:46,770 --> 00:52:51,590 result or something like that. So let's look at neural networks. And before we dive @@ -3233,7 +3589,7 @@ into neural networks actually, I'd like to do it first without all the brain 654 -00:52:55,099 --> 00:52:58,329 +00:52:55,100 --> 00:52:58,329 stuff. So forget that they're neural. Forget that they have any relation whatsoever @@ -3274,8 +3630,7 @@ receive your input x, and you 662 00:53:30,230 --> 00:53:32,369 -multiply it by a matrix, just like we did -before. +multiply it by a matrix, just like we did before. 663 00:53:32,369 --> 00:53:36,619 @@ -3303,7 +3658,7 @@ one more matrix multiply, and that gives us our scores. And so if I was to draw this, 668 -00:53:52,239 --> 00:53:58,169 +00:53:52,240 --> 00:53:58,169 say in case of CIFAR-10, with 3072 numbers going in, those are the pixel values, @@ -3312,24 +3667,20 @@ going in, those are the pixel values, and before, we just went one single matrix multiply to scores. We went right away -670 -00:54:02,110 --> 00:54:02,470 -to 10 - 671 -00:54:02,469 --> 00:54:05,899 -numbers. But now, we get to go through -this intermediate representation +00:54:02,110 --> 00:54:05,899 +to 10 numbers. But now, we get to go +through this intermediate representation 672 00:54:05,900 --> 00:54:13,019 -of hidden state. We'll call them -hidden layers. So hidden vector h of hundred numbers, say +of hidden state. We'll call them hidden layers. +So hidden vector h of hundred numbers, say 673 00:54:13,019 --> 00:54:16,849 or whatever you want your size of the neural -network to be. So this is a hyperparameter. +network to be. So this is a hyperparameter, 674 00:54:16,849 --> 00:54:21,109 @@ -3337,9 +3688,9 @@ that's, say, a hundred, and we go through this intermediate representation. So matrix 675 -00:54:21,108 --> 00:54:24,319 -multiply gives us -hundred numbers, threshold at zero, and +00:54:21,109 --> 00:54:24,319 +multiply gives us hundred +numbers, threshold at zero, and 676 00:54:24,320 --> 00:54:28,559 @@ -3357,7 +3708,7 @@ of something interesting you might want to, you might think that a neural network 679 -00:54:36,329 --> 00:54:40,210 +00:54:36,330 --> 00:54:40,210 could do, is going back to this example of interpreting linear @@ -3407,13 +3758,12 @@ facing slightly to the left, 689 00:55:16,280 --> 00:55:20,650 -left car facing slightly to the right, and +red car facing slightly to the right, and those elements of h would only become 690 00:55:20,650 --> 00:55:24,358 -positive if they find that thing in the -image, +positive if they find that thing in the image, 691 00:55:24,358 --> 00:55:28,029 @@ -3426,22 +3776,22 @@ or yellow cars or whatever else in different orientations. So now we can 693 -00:55:31,179 --> 00:55:35,669 +00:55:31,180 --> 00:55:35,669 have a template for all these different modes. And so these neurons turn on or 694 00:55:35,670 --> 00:55:41,869 -off if they find the thing they're -looking for. Car of some specific type, and then +off if they find the thing they're looking +for. Car of some specific type, and then 695 00:55:41,869 --> 00:55:46,660 -this W2 matrix scan sum across all +this W2 matrix can sum across all those little car templates. So now we 696 -00:55:46,659 --> 00:55:50,719 +00:55:46,660 --> 00:55:50,719 have like say twenty card templates of what cars could look like, and now, to compute @@ -3456,7 +3806,7 @@ of doing a weighted sum over them. And so if anyone of them turn on, then through my 699 -00:55:58,699 --> 00:56:02,269 +00:55:58,700 --> 00:56:02,269 weighted sum, with positive weights presumably, I would be adding up and @@ -3467,51 +3817,93 @@ have this multimodal car classifier 701 00:56:07,358 --> 00:56:13,098 -through this additional hidden layer -between there. So that's handwavy reason for why +through this additional hidden layer in between +there. So that's a handwavy reason for why 702 00:56:13,099 --> 00:56:14,720 -these would do something more -interesting. +these would do something more interesting. 703 -00:56:14,719 --> 00:56:49,509 -Was there a question? (Student is asking a question) So the question is, if h had less than 10 units, would it be inferior to a linear classifier? I think that's... that's acutally not obvious to me. It's an interesting question. I think... you could make that work. I think you could make it work. Yeah, I think that would actually work. Someone should try that for extra points in the -assignment. So you'll have a section on the assignment do something fun or extra +00:56:15,520 --> 00:56:16,509 +Was there a question? Yeah. + +703 +00:56:16,509 --> 00:56:26,350 +(Student is asking a question) + +703 +00:56:26,350 --> 00:56:32,509 +So the question is, if h had less than 10 units, would +it be inferior to a linear classifier? I think that's... + +703 +00:56:33,200 --> 00:56:39,509 +that's actually not obvious to me. It's an interesting +question. I think... you could make that work. + +703 +00:56:39,509 --> 00:56:40,509 +I think you could make it work. + +703 +00:56:43,509 --> 00:56:47,509 +Yeah, I think that would actually work. Someone +should try that for extra points in the assignment. + +703 +00:56:47,509 --> 00:56:49,509 +So you'll have a section on the +assignment do something fun or extra 704 00:56:49,510 --> 00:56:53,220 and so you get to come up with whatever you -think is interesting experiment and will +think is interesting experiment and we'll 705 -00:56:53,219 --> 00:56:56,699 +00:56:53,220 --> 00:56:56,699 give you some bonus points. So that's good candidate for something you might 706 00:56:56,699 --> 00:56:59,659 -want to investigate, whether that works -or not. +want to investigate, whether that works or not. 707 -00:56:59,659 --> 00:57:08,329 -Any other questions? Go ahead. (Student is asking a question) +00:56:59,659 --> 00:57:00,929 +Any other questions? Go ahead. + +707 +00:57:01,329 --> 00:57:11,329 +(Student is asking a question) + +708 +00:57:11,329 --> 00:57:13,589 +Sorry, I don't think I understood the question. 708 -00:57:08,329 --> 00:57:34,989 -Sorry, I don't think I understood the question. (Student is asking a question) I see. So you're really asking about the layout of the h vector and how it gets allocated over the different modes of -the dataset and I don't have a good +00:57:13,589 --> 00:57:26,989 +(Student is asking question) + +708 +00:57:26,989 --> 00:57:28,000 +I see. + +708 +00:57:28,900--> 00:57:32,389 +So you're really asking about the layout of the +h vector and how it gets allocated over the + +708 +00:57:32,389--> 00:57:34,989 +the different modes of the dataset +and I don't have a good 709 -00:57:34,989 --> 00:57:37,969 +00:57:34,989 --> 00:57:39,500 answer for that. Since we're going -to train this fully with - -710 -00:57:37,969 --> 00:57:39,500 -backpropagation, +to train this fully with backpropagation, 711 00:57:39,500 --> 00:57:42,690 @@ -3529,9 +3921,13 @@ these kind of like mixes, and weird things, intermediates, and so on. 714 -00:57:50,690 --> 00:57:55,630 -So this neural network will come in and it will optimally find a way to -truncate your data with its linear boundaries +00:57:50,690 --> 00:57:54,390 +So this neural network will come in and it will +optimally find a way to truncate your data + +714 +00:57:54,390 --> 00:57:55,630 +with its linear boundaries 715 00:57:55,630 --> 00:57:59,809 @@ -3539,9 +3935,17 @@ and these weights will all get adjusted just to make it come out right. So it's 716 -00:57:59,809 --> 00:58:10,579 -really hard to say. It will all become tangled -up I think. Go ahead. That's right. So that's the +00:57:59,809 --> 00:58:03,809 +really hard to say. It will all become +tangled up I think. Go ahead. + +716 +00:58:03,809 --> 00:58:09,500 +(Student is asking question) + +716 +00:58:09,500 --> 00:58:10,579 +That's right. So that's the 717 00:58:10,579 --> 00:58:14,579 @@ -3550,8 +3954,8 @@ We get to choose that. So I chose 718 00:58:14,579 --> 00:58:18,719 -hundred. Usually that's going to be, -usually, you'll see that with neural networks. We'll go into +hundred. Usually that's going to be, usually, +you'll see that with neural networks. We'll go into 719 00:58:18,719 --> 00:58:22,739 @@ -3559,27 +3963,31 @@ this a lot, but usually you want them to be as big as possible, as it fits in your 720 -00:58:22,739 --> 00:58:30,659 -computer and so on, so more is better. So we'll go -into that. Go ahead. (Student is asking a question) +00:58:22,739 --> 00:58:27,659 +computer and so on, so more is better. +We'll go into that. Go ahead. + +720 +00:58:27,659 --> 00:58:33,659 +(Student is asking question) 721 -00:58:30,659 --> 00:58:38,639 +00:58:33,659 --> 00:58:38,639 So you're asking, do we always take max of 0 and h, and we don't, and I'll get, it's like five slides 722 00:58:38,639 --> 00:58:44,359 -away. So I'm going to go into neural -networks. I guess maybe I should preemtively just go +away. So I'm going to go into neural networks. +I guess maybe I should preemtively just go 723 00:58:44,360 --> 00:58:48,390 -ahead and take questions near the end. If -you wanted this to be a three-layer +ahead and take questions near the end. +If you wanted this to be a three-layer 724 -00:58:48,389 --> 00:58:50,940 +00:58:48,390 --> 00:58:50,940 neural network by the way, there's a very simple way in which we just extend @@ -3615,8 +4023,8 @@ it comes down to it. So this is a slide 731 00:59:12,690 --> 00:59:17,349 -borrowed from a blog post I found, and basically -it suffices roughly eleven lines of +borrowed from a blog post I found, and +basically it suffices roughly eleven lines of 732 00:59:17,349 --> 00:59:21,980 @@ -3634,7 +4042,7 @@ have, sorry it's three dimensional. And you have binary labels for y, and then 735 -00:59:32,579 --> 00:59:36,579 +00:59:32,580 --> 00:59:36,579 syn0 syn1 are your weight matrices weight1 weight2. And so I think they're @@ -3645,8 +4053,8 @@ then this is the optimization loop here 737 00:59:41,150 --> 00:59:46,269 -and what you're seeing here, I -should use my pointer for more, what you're +and what you're seeing here, I should +use my pointer for more, what you're 738 00:59:46,269 --> 00:59:50,139 @@ -3665,13 +4073,13 @@ one form. It's computing the first layer, 741 00:59:58,650 --> 01:00:03,059 -and then it's computing second layer, and then its -computing here right away the backward +and then it's computing second layer, and then +it's computing here right away the backward 742 01:00:03,059 --> 01:00:08,130 -pass. So this is the l2_delta. It's the gradient on -l2, the gradient on l1, and the +pass. So this is the l2_delta. It's the gradient +on l2, the gradient on l1, and the 743 01:00:08,130 --> 01:00:13,390 @@ -3679,7 +4087,7 @@ gradient, and this is an update here. So right away he's doing an update at 744 -01:00:13,389 --> 01:00:17,150 +01:00:13,390 --> 01:00:17,150 the same time as during the final piece of backprop here where he formulating the @@ -3705,7 +4113,7 @@ logistic regression loss. So you saw a 749 01:00:33,500 --> 01:00:37,159 -generalization of it which is softmax +generalization of it which is a softmax classifier into multiple dimensions. But 750 @@ -3749,9 +4157,18 @@ quite simple. We compute these layers, do forward pass, we do backward pass, we do an 758 -01:01:11,019 --> 01:01:18,840 -update, we keep iterating this over and over again. Go ahead. (Student is asking a question) -The random function is creating your first initial random +01:01:11,019 --> 01:01:14,540 +update, we keep iterating this over and over again. +Go ahead. + +758 +01:01:14,540 --> 01:01:16,240 +(Student is asking a question) + +758 +01:01:16,240 --> 01:01:18,840 +The random function is creating +your first initial random 759 01:01:18,840 --> 01:01:24,170 @@ -3774,7 +4191,7 @@ you're not using logistic regression and you might have different activation 763 -01:01:34,949 --> 01:01:39,149 +01:01:34,950 --> 01:01:39,149 functions. But again, just my advice to you when you implement this is, stage @@ -3790,13 +4207,13 @@ result. So you might have, you compute 766 01:01:46,909 --> 01:01:54,460 -your... Let's see. You compute, you receive these weight -matrices and also the biases. I don't +your... Let's see. You compute, you receive these +weight matrices and also the biases. I don't 767 01:01:54,460 --> 01:01:59,940 -believe you have biases actually in your SVM and in your -softmax, but here you'll have biases. So +believe you have biases actually in your SVM and in +your softmax, but here you'll have biases. So 768 01:01:59,940 --> 01:02:03,269 @@ -3806,7 +4223,7 @@ compute the first hidden layer, compute your scores, 769 01:02:03,269 --> 01:02:08,429 compute your loss, and then do backward -pass. So bacprop in to scores, then +pass. So backprop into scores, then 770 01:02:08,429 --> 01:02:13,739 @@ -3815,8 +4232,8 @@ layer, and backprop into this h1 vector, 771 01:02:13,739 --> 01:02:18,849 -and then through h1, backprop into -the first weight matrices and the first biases. Okay, so do +and then through h1, backprop into the first +weight matrices and the first biases. Okay, so do 772 01:02:18,849 --> 01:02:22,929 @@ -3857,7 +4274,7 @@ slightly more insane by folding in all kinds of like motivations, mostly 780 -01:02:47,739 --> 01:02:51,219 +01:02:47,740 --> 01:02:51,219 historical about like how this came about that it's related to brain at all. @@ -3873,13 +4290,13 @@ This is just what happens when you search on 783 01:02:59,440 --> 01:03:03,800 -image search 'neurons'. so there you go. Now +image search 'neurons', so there you go. Now your actual biological neurons don't 784 01:03:03,800 --> 01:03:09,030 -look like this. Fortunately, they look more like -that. And so a neuron, +look like this. Unfortunately, they look +more like that. And so a neuron, 785 01:03:09,030 --> 01:03:11,880 @@ -3888,11 +4305,11 @@ idea about where this is all coming from 786 01:03:11,880 --> 01:03:17,220 -you have the cell body or a Soma as people like to -call it, and it's got all these dendrites +you have the cell body or a soma as people like +to call it, and it's got all these dendrites 787 -01:03:17,219 --> 01:03:21,049 +01:03:17,220 --> 01:03:21,049 that are connected to other neurons. So there's a cluster of other neurons and @@ -3902,7 +4319,7 @@ cell bodies over here. And dendrites are really, these appendages that listen to 789 -01:03:25,449 --> 01:03:30,869 +01:03:25,450 --> 01:03:30,869 them. So this is your inputs to a neuron, and then it's got a single axon that @@ -3913,7 +4330,7 @@ output of the computation that this neurons performs. 791 01:03:35,840 --> 01:03:40,579 -So usunally, usually you have this +So usually, usually you have this neuron, receives inputs. If many of them 792 @@ -3928,13 +4345,13 @@ actually like diverges out to 794 01:03:50,199 --> 01:03:54,659 -connect to dendrites other neurons that +connect to dendrites of other neurons that are downstream. So there are other 795 01:03:54,659 --> 01:03:57,639 neurons here and their dendrites -connected to the axons of these guys. +connect to the axons of these guys. 796 01:03:57,639 --> 01:04:02,299 @@ -3952,9 +4369,9 @@ output of a neuron. And so basically, you can come up with a very crude model of a 799 -01:04:10,409 --> 01:04:16,769 -neuron, and it will look something like -this. We have an axon, so this is the cell body +01:04:10,410 --> 01:04:16,769 +neuron, and it will look something like this. +We have an axon, so this is the cell body 800 01:04:16,769 --> 01:04:20,909 @@ -3977,7 +4394,7 @@ of how much this neuron likes that neuron basically. And so axon carries 804 -01:04:35,349 --> 01:04:39,769 +01:04:35,350 --> 01:04:39,769 this x. It interacts in the synapse and they multiply in this crude model. So you @@ -4011,8 +4428,12 @@ biological models, historically people liked to use the sigmoid nonlinearity to 811 -01:05:06,570 --> 01:05:11,730 -actually use for the activation function. The reason for that is because +01:05:06,570 --> 01:05:09,430 +actually use for the activation function. +The reason for that is because + +811 +01:05:09,430 --> 01:05:11,730 you get a number between 0 and 1, and 812 @@ -4027,7 +4448,7 @@ particular input. So it's a rate between 814 01:05:19,809 --> 01:05:23,889 -activation function. So if this neuron is +activation function. So if this neuron has seen something it likes, in the neurons 815 @@ -4067,8 +4488,8 @@ looks very similar to a linear 822 01:05:56,710 --> 01:06:02,750 -classifier, right? We're forming a linear sum -here, a weighted sum, and we're passing that through +classifier, right? We're forming a linear sum here, +a weighted sum, and we're passing that through 823 01:06:02,750 --> 01:06:07,050 @@ -4116,7 +4537,7 @@ computation. A good review article is Dendritic Computation, which I really 832 -01:06:42,139 --> 01:06:46,069 +01:06:42,140 --> 01:06:46,069 enjoyed. These synapses are complex dynamical systems. They're not just a @@ -4126,14 +4547,13 @@ single weight. And we're not really sure if the brain uses rate code to 834 -01:06:49,719 --> 01:06:54,689 +01:06:49,720 --> 01:06:54,689 communicate, so very crude mathematical -model and don't put his analogy too much. +model and don't push his analogy too much. 835 01:06:54,690 --> 01:06:57,960 -But it's good for, kind of like, media -articles, +But it's good for, kind of like, media articles, 836 01:06:57,960 --> 01:07:01,990 @@ -4141,7 +4561,7 @@ so I suppose that's why this keeps coming up again and again as we 837 -01:07:01,989 --> 01:07:04,989 +01:07:01,990 --> 01:07:04,989 explained that this works like your brain. But I'm not going to go too deep into @@ -4151,8 +4571,11 @@ this. To go back to a question that was asked before, there's an entire set of 839 -01:07:09,829 --> 01:07:17,559 +01:07:09,829 --> 01:07:11,859 nonlinearities that we can choose from. + +839 +01:07:14,559 --> 01:07:17,559 So historically, sigmoid has been used 840 @@ -4168,7 +4591,7 @@ tradeoffs, and why you might want to use 842 01:07:23,690 --> 01:07:27,838 one or the other, but for now, I'd just like to -flash to mention that there are many things to [??] +flash them and mention that there are many things to 843 01:07:27,838 --> 01:07:28,579 @@ -4176,8 +4599,8 @@ choose from. 844 01:07:28,579 --> 01:07:33,940 -Historically people use to signmoid and tanh. As of -2012, ReLU became quite popular. +Historically people use to signmoid and tanh. +As of 2012, ReLU became quite popular. 845 01:07:33,940 --> 01:07:38,429 @@ -4185,18 +4608,18 @@ It makes your networks converge quite a bit faster, so right now, if you wanted a 846 -01:07:38,429 --> 01:07:40,429 -default choice for nonlinearity +01:07:38,429 --> 01:07:41,429 +default choice for nonlinearity, use ReLU. 847 -01:07:40,429 --> 01:07:45,679 -use ReLU. That's the current default -recommendation. And then there's a few, kind of a hipster +01:07:41,429 --> 01:07:45,679 +That's the current default recommendation. +And then there's a few, kind of a hipster 848 01:07:45,679 --> 01:07:51,489 -activation functions here. And so Leaky ReLUs were -proposed a few years ago. Maxout is +activation functions here. And so Leaky ReLUs +were proposed a few years ago. Maxout is 849 01:07:51,489 --> 01:07:54,989 @@ -4240,8 +4663,8 @@ And so here is an example of a 857 01:08:23,140 --> 01:08:27,170 -2-layer neural net or 3-layer neural net. When you want -to count the number of layers and the +2-layer neural net or 3-layer neural net. When +you want to count the number of layers and the 858 01:08:27,170 --> 01:08:30,829 @@ -4269,7 +4692,7 @@ layers, and so, remember that I shown you that a single neuron computes this little 863 -01:08:50,869 --> 01:08:54,750 +01:08:50,870 --> 01:08:54,750 weighted sum, and then passed that through nonlinearity. In a neural network, the @@ -4303,7 +4726,7 @@ neurons in a single hidden layer as just at a single times a matrix multiply. And 870 -01:09:14,409 --> 01:09:17,619 +01:09:14,410 --> 01:09:17,619 that's why we arrange them in these layers, where neurons inside a layer can be @@ -4343,7 +4766,7 @@ two-layer neural network classifying a, doing a binary classification task. So we have two 878 -01:09:50,079 --> 01:09:54,119 +01:09:50,080 --> 01:09:54,119 classes, red and green. And so we have these points in two dimensions, and I'm drawing @@ -4359,18 +4782,18 @@ the more hidden neurons I have in my 881 01:10:01,970 --> 01:10:05,770 -hidden layer, the more wiggle your -neural network has, right? The more it can compute +hidden layer, the more wiggle your neural +network has, right? The more it can compute 882 01:10:05,770 --> 01:10:12,290 -crazy functions. And just to show you effect also of -regularization strength. So this is the +crazy functions. And just to show you effect +also of regularization strength. So this is the 883 -01:10:12,289 --> 01:10:17,069 +01:10:12,290 --> 01:10:17,069 regularization of how much you penalize -large W. So you can see that when you insist +large Ws. So you can see that when you insist 884 01:10:17,069 --> 01:10:22,340 @@ -4435,14 +4858,14 @@ this dataset. So if I restart the neural network, it's just, starts off with a 897 -01:11:18,079 --> 01:11:20,949 +01:11:18,080 --> 01:11:20,949 random W, and then it converges the decision boundary to actually classify 898 01:11:20,949 --> 01:11:26,289 -the data. What I'm showing on the right, which is -the cool part, this visualization, is one interpretation of +the data. What I'm showing on the right, which is the +cool part, this visualization, is one interpretation of 899 01:11:26,289 --> 01:11:29,529 @@ -4495,7 +4918,7 @@ how this gets warped so that you can linearly classify the data. This is 909 -01:12:10,939 --> 01:12:13,569 +01:12:10,940 --> 01:12:13,569 something that people sometimes also referred to as kernel trick. It's @@ -4520,7 +4943,7 @@ data points. So you can see actually those six neurons roughly. You can see these lines 914 -01:12:33,579 --> 01:12:36,869 +01:12:33,580 --> 01:12:36,869 here, like they're kind of like these functions of one of these neurons. So @@ -4535,13 +4958,23 @@ dataset is separable with a neural network? If I want the neural network 917 -01:12:45,569 --> 01:12:51,889 -to correctly classify this, how many neurons do I need in the hidden layer as a minimum? (Student is answering) +01:12:45,570 --> 01:12:49,089 +to correctly classify this, how many neurons do +I need in the hidden layer as a minimum? 918 -01:12:51,890 --> 01:13:15,270 -4? I heard some 3s, some 4s. Binary search. So intuitively, the way this would work is, let's see 4. -So what happens with 4 is, there is one +01:12:57,890 --> 01:13:04,270 +Four? I heard some threes, some fours. +Binary search. + +918 +01:13:04,270 --> 01:13:08,870 +So intuitively, the way this +would work is, let's see four. + +918 +01:13:12,270 --> 01:13:15,270 +So what happens with four is, there is one 919 01:13:15,270 --> 01:13:18,910 @@ -4549,7 +4982,7 @@ neuron here that went from this way to that way, this way to that way, this way 920 -01:13:18,909 --> 01:13:22,689 +01:13:18,910 --> 01:13:22,689 to that way. There's four neurons that are cutting up this plane. And then @@ -4565,8 +4998,8 @@ would work. So with three neurons... So 923 01:13:34,739 --> 01:13:39,189 -one plane, second plane, third plane. So three -linear functions with a nonlinearity, +one plane, second plane, third plane. So +three linear functions with a nonlinearity, 924 01:13:39,189 --> 01:13:45,649 @@ -4574,9 +5007,13 @@ and then you can basically with three lines, you can carve out the space so 925 -01:13:45,649 --> 01:13:52,429 +01:13:45,649 --> 01:13:50,329 that the second layer can just combine -them when their numbers are 1 and not 0. (Student is asking question) +them when their numbers are 1 and not 0. + +925 +01:13:50,329 --> 01:13:52,429 +(Student is asking question) 926 01:13:52,430 --> 01:13:57,850 @@ -4585,40 +5022,58 @@ because two lines are not enough. I 927 01:13:57,850 --> 01:14:03,900 -suppose this works... Not going to look very good -here. So with two, basically it will find +suppose this works... Not going to look very +good here. So with two, basically it will find 928 -01:14:03,899 --> 01:14:07,239 -the optimum way of just using these two +01:14:03,900 --> 01:14:07,239 +the optimal way of just using these two lines. They're kind of creating this 929 -01:14:07,239 --> 01:14:14,599 -tunnel and that's the best you can do. Okay? (Student is asking question) +01:14:07,239 --> 01:14:11,239 +tunnel and that's the best you can do. Okay? + +929 +01:14:11,239 --> 01:14:14,599 +(Student is asking question) 930 -01:14:14,600 --> 01:14:31,300 -The curve, I think... Which nonlinearity am I using? tanh? Yeah, I'm not sure exactly how that works out. If I was using ReLU, I think it would be much, so ReLU is the... Let me change to ReLU, and I +01:14:18,600 --> 01:14:25,400 +The curve, I think... Which nonlinearity am I using? +tanh? Yeah, I'm not sure exactly how that works out. + +930 +01:14:25,400 --> 01:14:31,300 +If I was using ReLU, I think it would be much, +so ReLU is the... Let me change to ReLU, and I 931 -01:14:31,300 --> 01:14:50,460 +01:14:31,300 --> 01:14:41,460 think you'd see sharp boundaries. Yeah. -Yes, this is three. You can do four. So let's do... (Student is asking question) Yeah, that's because +Yes, this is three. You can do four. So let's do... + +931 +01:14:41,460 --> 01:14:47,460 +(Student is asking question) + +931 +01:14:47,460 --> 01:14:50,460 +Yeah, that's because, it's 932 01:14:50,460 --> 01:14:52,130 -because some of these parts +because in some of these parts 933 -01:14:52,130 --> 01:14:58,119 +01:14:52,130 --> 01:14:57,819 there's more than one of those ReLUs -are active, and so you end up with... there +are active, and so you end up with... 934 -01:14:58,119 --> 01:15:02,359 -are really three lines. I think like one, two, three, -but then in some of the corners two ReLU +01:14:57,819 --> 01:15:02,359 +There are really three lines. I think like one, two, +three, but then in some of the corners two ReLU 935 01:15:02,359 --> 01:15:05,689 @@ -4642,13 +5097,13 @@ doing this update, it will just go in there 939 01:15:22,390 --> 01:15:32,800 -and figure that out. Very simple data -that is not. Spiral. Circle, and then random... +and figure that out. Very simple dataset +is not... Spiral. Circle, and then random... 940 -01:15:32,800 --> 01:15:39,880 -so random data, and so you could, kind of goes in -there, like covers up the green +01:15:33,200 --> 01:15:39,880 +so random data, and so you could, kind +of goes in there, like covers up the green 941 01:15:39,880 --> 01:15:48,039 @@ -4667,8 +5122,8 @@ to separate out this data. So you can 944 01:15:58,770 --> 01:16:05,270 -play with this in your free time. Okay. And so -as a summary, +play with this in your free time. +Okay. And so as a summary, 945 01:16:05,270 --> 01:16:10,690 @@ -4676,32 +5131,40 @@ we arrange these neurons in neural networks into fully connected layers. 946 -01:16:10,689 --> 01:16:14,579 +01:16:10,690 --> 01:16:14,579 We've looked at backprop and how this gets chained in computational graphs. And they're 947 01:16:14,579 --> 01:16:19,149 -not really neural. And as you'll see soon, +not really neural. And as we'll see soon, the bigger the better, and we'll go into 948 -01:16:19,149 --> 01:16:28,210 -that a lot. I want to take questions -before I end. Just sorry. Were there any questions? Go ahead. We have +01:16:19,149 --> 01:16:23,510 +that a lot. I want to take questions before I end. +Just sorry. Were there any questions? Go ahead. + +948 +01:16:23,510 --> 01:16:27,710 +(Student is asking question) 949 -01:16:28,210 --> 01:16:29,359 -two more minutes. Sorry. +01:16:27,710 --> 01:16:29,359 +We have two more minutes. Sorry. + +948 +01:16:29,359 --> 01:16:35,710 +(Student is asking question) 950 -01:16:29,359 --> 01:16:36,899 +01:16:35,710 --> 01:16:36,899 Yes, thank you. 951 01:16:36,899 --> 01:16:41,119 -So is it always better to have more -neurons in your neural network? The answer to +So is it always better to have more neurons +in your neural network? The answer to 952 01:16:41,119 --> 01:16:48,809 @@ -4720,11 +5183,11 @@ network to not overfit your data is not by 955 01:16:55,810 --> 01:16:58,940 -making the network smaller. The correct -way to do it is to increase the +making the network smaller. +The correct way to do it is to increase the 956 -01:16:58,939 --> 01:17:03,079 +01:16:58,940 --> 01:17:03,079 regularization. So you always want to use as large a network as you want, but then @@ -4734,18 +5197,26 @@ you have to make sure to properly regularize it. But most of the time 958 -01:17:06,270 --> 01:17:09,920 +01:17:06,270 --> 01:17:09,320 because of computational reasons, you have finite -amount of time, you don't want to wait forever to train your +amount of time, you don't want to wait forever to + +959 +01:17:09,320 --> 01:17:14,980 +train your networks. You'll use smaller +ones for practical reasons. Question? 959 -01:17:09,920 --> 01:17:19,980 -networks. You'll use smaller ones for practical -reasons. Question? (Student is asking question) Do you regularize each layer equally. +01:17:14,980 --> 01:17:17,780 +(Student is asking question) + +959 +01:17:17,780 --> 01:17:19,980 +Do you regularize each layer equally. 960 -01:17:19,979 --> 01:17:25,509 -Usually you do as a simplification. +01:17:19,980 --> 01:17:25,509 +Usually you do, as a simplification. Yeah. Most of the, often when you see 961 @@ -4754,13 +5225,17 @@ networks get trained in practice, they will be regularized the same way throughout. 962 -01:17:28,029 --> 01:17:33,809 -But you don't have to necessarily. Go ahead. (Student is asking question) +01:17:28,030 --> 01:17:31,030 +But you don't have to necessarily. Go ahead. + +962 +01:17:31,030 --> 01:17:35,710 +(Student is asking question) 963 -01:17:33,810 --> 01:17:40,500 -Is there any value to using second derivatives using hashing in -optimizing neural networks? There is value +01:17:35,710 --> 01:17:40,500 +Is there any value to using second derivatives using +hashing in optimizing neural networks? There is value 964 01:17:40,500 --> 01:17:44,859 @@ -4779,22 +5254,30 @@ L-BFGS doesn't work very well. 967 01:17:50,500 --> 01:17:57,039 -So when you millions of data points, -you can't do L-BFGS for various reasons. Yeah. And L-BFGS is not +So when you millions of data points, you can't do +L-BFGS for various reasons. Yeah. And L-BFGS is 968 01:17:57,039 --> 01:18:01,970 -very good with minibatch. You always +not very good with minibatch. You always have to do full batch by default. Question. 969 -01:18:01,970 --> 01:18:16,650 -(Student is asking question) So what is the tradeoff between depth and size roughly, like how do you allocate? Not a good -answer for that unfortunately. So you +01:18:01,970 --> 01:18:09,950 +(Student is asking question) + +969 +01:18:09,950 --> 01:18:13,650 +So what is the tradeoff between depth and +size roughly, like how do you allocate? + +969 +01:18:13,650 --> 01:18:16,450 +Not a good answer for that unfortunately. 970 -01:18:16,649 --> 01:18:20,899 -want, depth is good, but maybe after +01:18:16,450 --> 01:18:20,899 +So you want, depth is good, but maybe after like ten layers maybe, if you have simple dataet 971 @@ -4803,22 +5286,38 @@ it's not really adding too much. We have one more minute so I can still take some 972 -01:18:25,220 --> 01:18:35,990 -questions. You had a question for a while. (Student is asking question) Yeah, so the -tradeoff between where do I allocate my +01:18:25,220 --> 01:18:26,620 +questions. You had a question for a while. + +972 +01:18:26,620 --> 01:18:31,520 +(Student is asking question) + +972 +01:18:31,520 --> 01:18:35,990 +Yeah, so the tradeoff between +where do I allocate my 973 -01:18:35,989 --> 01:18:40,019 +01:18:35,990 --> 01:18:40,019 capacity, do I want us to be deeper or do I want it to be wider, not a very good 974 -01:18:40,020 --> 01:18:47,860 -answer to that. (Student is asking question) Yes, usually, especially -with images, we find that more layers are +01:18:40,020 --> 01:18:41,860 +answer to that. + +974 +01:18:41,860 --> 01:18:44,560 +(Student is asking question) + +974 +01:18:44,560 --> 01:18:47,860 +Yes, usually, especially with +images, we find that more layers are 975 -01:18:47,859 --> 01:18:51,199 +01:18:47,860 --> 01:18:51,199 critical. But sometimes when you have simple datasets like 2D or some @@ -4828,23 +5327,27 @@ other things like depth is not as critical, and so it's kind of slightly 977 -01:18:55,359 --> 01:19:01,670 -data dependent. We had a question over there. (Student is asking question) +01:18:55,359 --> 01:19:59,670 +data dependent. We had a question over there. + +977 +01:18:59,670 --> 01:19:05,670 +(Student is asking question) 978 -01:19:01,670 --> 01:19:10,050 -Different activation functions for different layers, does that -help? Usually it's not done. Usually we +01:19:05,670 --> 01:19:10,050 +Different activation functions for different layers, +does that help? Usually it's not done. Usually we 979 01:19:10,050 --> 01:19:15,960 -just gonna pick one and go with it. So say, for ConvNets -for example, we'll see that +just kind of pick one and go with it. +So say, for ConvNets for example, we'll see that 980 01:19:15,960 --> 01:19:19,279 -most of them are [??] with ReLUs. And -so you just use that throughout and +most of them are trained just with ReLUs. +And so you just use that throughout and 981 01:19:19,279 --> 01:19:22,389 @@ -4857,10 +5360,10 @@ too much, but in principle, there's nothing preventing you. So it is 4:20, 983 -01:19:26,659 --> 01:19:29,789 +01:19:26,660 --> 01:19:29,789 so we're going to end here, but we'll see -a lot more neural networks, so a lot of +lots of more neural networks, so a lot of 984 -01:19:29,789 --> 01:19:31,238 -these questions we'll go through them. \ No newline at end of file +01:19:29,789 --> 01:19:31,738 +these questions, we'll go through them. \ No newline at end of file From 735e1edc47dfb15615fd12138f6d5abaf1babd7d Mon Sep 17 00:00:00 2001 From: jung_hojin Date: Thu, 2 Jun 2016 22:06:16 +0900 Subject: [PATCH 162/199] Fix line numbers --- captions/En/Lecture4_en.srt | 2012 +++++++++++++++++------------------ 1 file changed, 1006 insertions(+), 1006 deletions(-) diff --git a/captions/En/Lecture4_en.srt b/captions/En/Lecture4_en.srt index 16daaa80..d85c7302 100644 --- a/captions/En/Lecture4_en.srt +++ b/captions/En/Lecture4_en.srt @@ -92,7 +92,7 @@ midterm, even though I'm covering some of the most important stuff usually in the lecture, so do read through those lecture -20 +20 00:01:19,610 --> 00:01:25,618 notes. They're complimentary to the lectures. And so the material for the midterm will be @@ -398,8 +398,8 @@ z, and they take on these specific values 81 00:05:29,778 --> 00:05:35,069 -in this example of -2, 5 and --4, and we have this very small graph +in this example of -2, 5 and -4, +and we have this very small graph 82 00:05:35,069 --> 00:05:38,669 @@ -499,4871 +499,4871 @@ derivative of it, 00:06:56,021 --> 00:06:57,240 identity mapping? -101 +102 00:06:59,000 --> 00:07:06,240 What is the gradient of df by df? It's one, right? So the identity has a gradient of one. -102 +103 00:07:06,240 --> 00:07:10,329 So that's our base case. We start off with a one, and now we're going to go -103 +104 00:07:10,329 --> 00:07:18,519 backwards through this graph. So, we want the gradient of f with respect to z. -104 +105 00:07:18,519 --> 00:07:21,089 So what is that in this computational graph? -104 +106 00:07:24,019 --> 00:07:27,089 Okay, it's q, so we have that written out right -105 +107 00:07:27,089 --> 00:07:32,879 here and what is q in this particular example? It's 3, right? So the gradient -106 +108 00:07:32,879 --> 00:07:36,279 on z, according to this will, become just 3. So I'm going to be writing the gradients -107 +109 00:07:36,279 --> 00:07:42,309 under the lines in red and the values are in green above the lines. So with the -108 +110 00:07:42,310 --> 00:07:48,420 gradient on the, in the front is 1, and now the gradient on z is 3, and what red 3 is telling -109 +111 00:07:48,420 --> 00:07:52,009 you really intuitively, keep in mind the interpretation of a gradient, is what -110 +112 00:07:52,009 --> 00:07:58,459 that's saying is that the influence of z on the final value is positive and -111 +113 00:07:58,459 --> 00:08:02,859 with, sort of a force of 3. So if I increment z by a small amount h -112 +114 00:08:02,860 --> 00:08:07,759 then the output of the circuit will react by increasing, because it's a -113 +115 00:08:07,759 --> 00:08:13,009 positive 3, will increase by 3h, so small change will result in a positive -114 +116 00:08:13,010 --> 00:08:18,560 change in the output. Now the gradient on q in this case will be -115 +117 00:08:21,009 --> 00:08:30,860 So df/dq is z. What is z? -4. Okay? So we get a gradient of -4 on that path -116 +118 00:08:30,860 --> 00:08:34,599 of the circuit, and what that's saying is that if q were to increase, then the output -117 +119 00:08:34,599 --> 00:08:39,740 of the circuit will decrease, okay, by, if you increase by h, the output of the circuit -118 +120 00:08:39,740 --> 00:08:44,789 will decrease by 4h. That's the slope, is -4. Okay, now we're going -119 +121 00:08:44,789 --> 00:08:48,480 to continue this recursive process through this plus gate and this is where things get -120 +122 00:08:48,480 --> 00:08:49,039 slightly interesting -121 +123 00:08:49,039 --> 00:08:54,328 I suppose. So we'd like to compute the gradient on f on y with respect to y -122 +124 00:08:54,328 --> 00:09:00,208 and so the gradient on y in this particular graph will become -123 +125 00:09:03,909 --> 00:09:07,179 Let's just guess and then we'll see how this gets derived properly. -123 +126 00:09:12,209 --> 00:09:15,208 So I hear some murmurs of the right answer. It will be -4. So let's see how. -123 +127 00:09:15,209 --> 00:09:17,800 So there are many ways to derive it at this point -123 +128 00:09:17,801 --> 00:09:21,000 because the expression is very small and you can kind of, glance at it, but the way I'd like to -123 +129 00:09:21,001 --> 00:09:23,979 think about this is by applying chain rule, okay. -124 +130 00:09:23,980 --> 00:09:27,709 So the chain rule says that if you would like to derive the gradient of f on y -125 +131 00:09:27,710 --> 00:09:33,208 then it's equal to df/dq times dq/dy, right? And so we've -126 +132 00:09:33,208 --> 00:09:36,438 computed both of those expressions, in particular dq/dy, we know, is -127 +133 00:09:36,438 --> 00:09:42,519 -4, so that's the effect of the influence of q on f, is df/dq, which is -128 +134 00:09:42,519 --> 00:09:46,619 -4, and now we know the local, we'd like to know the local influence -129 +135 00:09:46,619 --> 00:09:52,449 of y on q, and that local influence of y on q is 1, because that's the local -130 +136 00:09:52,450 --> 00:09:58,969 as I'll refer to as the local derivative of y for the plus gate, and so the chain rule -131 +137 00:09:58,970 --> 00:10:02,019 tells us that the correct thing to do to chain these two gradients, the local -132 +138 00:10:02,019 --> 00:10:06,139 gradient of y on q, and the, kind of global gradient of q on the -133 +139 00:10:06,139 --> 00:10:10,948 output of the circuit, is to multiply them. So we'll get -4 times 1 -134 +140 00:10:10,948 --> 00:10:14,588 And so, this is kind of the, the crux of how backpropagation works. This is a very -135 +141 00:10:14,589 --> 00:10:18,209 important to understand here that, we have these two pieces that we keep -136 +142 00:10:18,210 --> 00:10:24,289 multiplying through when we perform the chain rule. We have q computed x + y, and -137 +143 00:10:24,289 --> 00:10:29,379 the derivative x and y, with respect to that single expression is 1 and 1. So keep -138 +144 00:10:29,379 --> 00:10:32,749 in mind the interpretation of the gradient. What that's saying is that x and y have a -139 +145 00:10:32,749 --> 00:10:38,509 positive influence on q, with a slope of 1. So increasing x by h -140 +146 00:10:38,509 --> 00:10:44,548 will increase q by h, and what we'd eventually like is, we'd like the influence of y -141 +147 00:10:44,548 --> 00:10:49,980 on the final output of the circuit, And so the way this end up working is, you take -142 +148 00:10:49,980 --> 00:10:53,480 the influence of y on q, and we know the influence of q on the final loss -143 +149 00:10:53,480 --> 00:10:57,058 which is what we are recursively computing here through this graph, and -144 +150 00:10:57,058 --> 00:11:00,350 the correct thing to do is to multiply them, so we end up with -4 times 1 -145 +151 00:11:00,351 --> 00:11:05,189 gets you -4. And so the way this works out is, basically what this is -146 +152 00:11:05,190 --> 00:11:08,649 saying is that the influence of y on the final output of the circuit is -4 -147 +153 00:11:08,649 --> 00:11:14,649 so increasing y should decrease the output of the circuit by -4 times the -148 +154 00:11:14,649 --> 00:11:18,230 little change that you've made. And the way that end up working out is y has a -149 +155 00:11:18,230 --> 00:11:21,810 positive influence on q, so increasing y, slightly increases q -150 +156 00:11:21,810 --> 00:11:27,959 which slightly decreases the output of the circuit, okay? So chain rule is kind of giving us this -151 +157 00:11:27,960 --> 00:11:29,320 correspondence. Go ahead. -152 +158 00:11:29,320 --> 00:11:38,360 (Student is asking question) -152 +159 00:11:38,360 --> 00:11:42,559 Yeap, thank you. So we're going to get into this. You'll see many, basically this entire class -152 +160 00:11:42,559 --> 00:11:45,259 is about this, so you'll see many many instantiations of this and -153 +161 00:11:45,259 --> 00:11:48,889 I'll drill this into you by the end of this class and you'll understand it. You will not -154 +162 00:11:48,889 --> 00:11:51,870 have any symbolic expressions anywhere once we compute this, once we're actually -155 +163 00:11:51,870 --> 00:11:54,639 implementing this and you'll see implementations of it later in this. -156 +164 00:11:54,639 --> 00:11:57,009 It will always be just be vectors and numbers. -157 +165 00:11:57,009 --> 00:12:02,230 Raw vectors, numbers. Okay, and looking at x, we have a very similar thing that happens. -158 +166 00:12:02,230 --> 00:12:05,889 We want df/dx. That's our final objective, but, and we have to combine it. -159 +167 00:12:05,889 --> 00:12:09,799 We know what the x's, what is x's influence on q and what is q's influence -160 +168 00:12:09,799 --> 00:12:13,979 on the end of the circuit, and so that ends up being the chain rule, so you take -161 +169 00:12:13,980 --> 00:12:19,240 -4 times 1 and gives you -4, okay? So the way this works, to generalize a -162 +170 00:12:19,240 --> 00:12:23,289 bit from this example and the way to think about it is as follows. You are a gate -163 +171 00:12:23,289 --> 00:12:28,429 embedded in a circuit and this is a very large computational graph or circuit and -164 +172 00:12:28,429 --> 00:12:32,250 you receive some inputs, some particular numbers x and y come in -165 +173 00:12:32,250 --> 00:12:39,059 and you perform some operation f on them and compute some output z. And now this -166 +174 00:12:39,059 --> 00:12:43,019 value of z goes into computational graph and something happens to it but you're just -167 +175 00:12:43,019 --> 00:12:46,169 a gate hanging out in a circuit and you're not sure what happens, but by the -168 +176 00:12:46,169 --> 00:12:50,939 end of the circuit the loss gets computed, okay? And that's the forward pass and then we're -169 +177 00:12:50,940 --> 00:12:56,250 proceeding recursively in the reverse order backwards, but before that actually, -170 +178 00:12:56,250 --> 00:13:01,120 before I get to that part, right away when I get x and y, the thing I'd like to point out that -171 +179 00:13:01,120 --> 00:13:05,279 during the forward pass, if you're this gate and you get to your values x and y -172 +180 00:13:05,279 --> 00:13:08,500 you compute your output z, and there's another thing you can compute right away and -173 +181 00:13:08,500 --> 00:13:10,230 that is the local gradients on x and y. -174 +182 00:13:10,230 --> 00:13:14,789 So I can compute those right away because I'm just a gate and I know what -175 +183 00:13:14,789 --> 00:13:18,009 I'm performing, like say addition or multiplication, so I know the influence that -176 +184 00:13:18,009 --> 00:13:24,259 x and y have on my output value, so I can compute those guys right away, okay? But then -177 +185 00:13:24,259 --> 00:13:25,389 what happens -178 +186 00:13:25,389 --> 00:13:29,769 near the end, so the loss gets computed and now we're going backwards, I'll eventually learn -179 +187 00:13:29,769 --> 00:13:32,499 about what is my influence on -180 +188 00:13:32,499 --> 00:13:37,839 the final output of the circuit, the loss. So I'll learn what is dL/dz in there. -181 +189 00:13:37,839 --> 00:13:41,419 The gradient will flow into me and what I have to do is I have to chain that -182 +190 00:13:41,419 --> 00:13:45,278 gradient through this recursive case, so I have to make sure to chain the -183 +191 00:13:45,278 --> 00:13:48,778 gradient through my operation that I performed and it turns out that the correct thing -184 +192 00:13:48,778 --> 00:13:52,068 to do here by chain rule, really what it's saying, is that the correct thing to do is to -185 +193 00:13:52,068 --> 00:13:56,068 multiply your local gradient with that gradient and that actually gives you the -186 +194 00:13:56,068 --> 00:13:57,838 dL/dx that gives you the -187 +195 00:13:57,839 --> 00:14:02,739 influence of x on the final output of the circuit. So really, chain rule is just -188 +196 00:14:02,739 --> 00:14:08,229 this added multiplication. where we take our, what I'll call, global gradient of this -189 +197 00:14:08,229 --> 00:14:12,669 gate on the output, and we chain it through the local gradient, and the same -190 +198 00:14:12,669 --> 00:14:18,509 thing goes for y. So it's just a multiplication of that guy, that gradient -191 +199 00:14:18,509 --> 00:14:22,889 by your local gradient if you're a gate. And then remember that these x's and y's -192 +200 00:14:22,889 --> 00:14:27,229 they are coming from different gates, right? So you end up with recursing -193 +201 00:14:27,229 --> 00:14:31,899 this process through the entire computational circuit, and so these gates -194 +202 00:14:31,899 --> 00:14:36,808 just basically communicate to each other the influence on the final loss, so they -195 +203 00:14:36,808 --> 00:14:39,688 tell each other, okay if this is a positive gradient that means you're positively -196 +204 00:14:39,688 --> 00:14:43,198 influencing the loss, if it's a negative gradient you're negatively -197 +205 00:14:43,198 --> 00:14:46,788 influencing the loss, and these just get all multiplied through the circuit by these -198 +206 00:14:46,788 --> 00:14:51,019 local gradients and you end up with, and this process is called backpropagation. -199 +207 00:14:51,019 --> 00:14:54,489 It's a way of computing through a recursive application of chain rule -200 +208 00:14:54,489 --> 00:14:58,399 through computational graph, the influence of every single intermediate value in -201 +209 00:14:58,399 --> 00:15:02,158 that graph on the final loss function. So we'll see many examples of this -202 +210 00:15:02,158 --> 00:15:06,918 throughout the lecture. I'll go into a specific example that is a slightly -203 +211 00:15:06,918 --> 00:15:11,298 larger and we'll work through it in detail. But I don't know if there are any questions at -204 +212 00:15:11,298 --> 00:15:13,000 this point that anyone would like to ask. Go ahead. -204 +213 00:15:13,001 --> 00:15:16,000 What happens if z is used by two other nodes? -204 +214 00:15:16,001 --> 00:15:19,000 If z is used by multiple nodes, I'm going to come back to that. -205 +215 00:15:19,000 --> 00:15:23,537 You add the gradients. The gradient, the correct thing to do is you add them. -206 +216 00:15:23,538 --> 00:15:29,928 So if z is being influenced in multiple places in the circuit, the backward flows will add. -207 +217 00:15:29,928 --> 00:15:31,539 I will come back to that point. Go ahead. -208 +218 00:15:31,539 --> 00:15:53,038 (Student is asking question) -208 +219 00:15:53,039 --> 00:15:59,139 Yeap. So I think, I would've repeated your question, but you're jumping ahead like 100 slides. -208 +220 00:15:59,539 --> 00:16:03,139 So we're going to get the all of those issues and we're going to see, you're -209 +221 00:16:03,139 --> 00:16:05,769 going to get what we call vanishing gradient problems and so on. -210 +222 00:16:05,769 --> 00:16:10,669 We'll see. Okay, let's go through another example to make this more concrete. -211 +223 00:16:10,669 --> 00:16:14,318 So here we have another circuit. It happens to be computing a little two-dimensional -212 +224 00:16:14,318 --> 00:16:18,179 sigmoid neuron, but for now don't worry about that interpretation. Just think of this -213 +225 00:16:18,179 --> 00:16:22,849 as, that's an expression so one-over- one-plus-e-to-the-whatever, so the number of -214 +226 00:16:22,850 --> 00:16:29,000 inputs here is five, and we're computing that function and we have a single output over there, okay? -215 +227 00:16:29,000 --> 00:16:32,490 And I've translated that mathematical expression into this computational graph form, so -216 +228 00:16:32,490 --> 00:16:35,769 we have to recursively from inside out compute this expression so we first do -217 +229 00:16:35,769 --> 00:16:42,129 all the little w times x's, and then we add them all up and then we take a -218 +230 00:16:42,129 --> 00:16:46,129 negative of it and then we exponentiate that and then we add one and then we -219 +231 00:16:46,129 --> 00:16:49,769 finally divide and we get the result of the expression. And so what we're going to do -220 +232 00:16:49,769 --> 00:16:52,409 now is we're going to backpropagate through this expression. We're going to -221 +233 00:16:52,409 --> 00:16:56,500 compute what the influence of every single input value is on the output of -222 +234 00:16:56,500 --> 00:16:59,230 this expression, what is the gradient here. Yeap, go ahead. -222 +235 00:16:59,231 --> 00:17:010,229 (Student is asking question) -223 +236 00:17:10,230 --> 00:17:15,229 So for now, so you're concerned about the interpretation of plus may be in these circles. -223 +237 00:17:15,230 --> 00:17:22,039 For now, let's just assume that this plus is a binary plus. It's a binary plus gate, and we have there -224 +238 00:17:22,039 --> 00:17:26,519 plus one gate. I'm making up these gates on the spot, and we'll see that what is a -225 +239 00:17:26,519 --> 00:17:31,000 gate or is not a gate is kind of up to you. I'll come back to this point in a bit. -226 +240 00:17:31,001 --> 00:17:35,639 So for now, I just like, we have several more gates that we're using throughout, and so -227 +241 00:17:35,640 --> 00:17:38,650 I'd just like to write out as we go through this example several of these -228 +242 00:17:38,650 --> 00:17:42,720 derivatives. So we have exponentiation and we know for every little local gate what these -229 +243 00:17:42,720 --> 00:17:49,048 local gradients are, right? So we can derive that using calculus. So e^x derivative is e^x and -230 +244 00:17:49,048 --> 00:17:52,900 so on. So these are all the operations and also addition and multiplication -231 +245 00:17:52,900 --> 00:17:56,040 which I'm assuming that you have memorized in terms of what the gradients -232 +246 00:17:56,040 --> 00:17:58,970 look like. So we're going to start off at the end of the circuit and I've -233 +247 00:17:58,970 --> 00:18:03,450 already filled in a 1.00 in the back because that's how we always -234 +248 00:18:03,450 --> 00:18:04,890 start this recursion with a 1.0 -235 +249 00:18:04,891 --> 00:18:10,519 right, since that's the gradient on the identity function. Now we're going -236 +250 00:18:10,519 --> 00:18:17,849 to backpropagate through this 1/x operation, okay? So the derivative of 1/x -237 +251 00:18:17,849 --> 00:18:22,048 the local gradient is -1/(x^2), so that 1/x gate -238 +252 00:18:22,048 --> 00:18:27,119 during the forward pass received input 1.37 and right away that 1/x gate -239 +253 00:18:27,119 --> 00:18:30,759 could have computed what the local gradient was. The local gradient was -240 +254 00:18:30,759 --> 00:18:35,048 -1/(x^2) and now during backpropagation, it has to, by chain rule, -241 +255 00:18:35,048 --> 00:18:40,750 multiply that local gradient by the gradient of it on the final output of the circuit -242 +256 00:18:40,750 --> 00:18:44,789 which is easy because it happens to be at the end. So what ends up being the -243 +257 00:18:44,789 --> 00:18:49,349 expression for the backpropagated gradient here, from the 1/x gate? -244 +258 00:18:54,049 --> 00:19:59,048 The chain rule always has two pieces: local gradient times the gradient from the top -244 +259 00:18:59,049 --> 00:19:01,300 or from above. -245 +260 00:19:04,301 --> 00:19:08,069 (Student is answering) -245 +261 00:19:08,301 --> 00:19:12,500 Um, yeah. Okay. Yeah, so that's correct. -245 +262 00:19:12,501 --> 00:19:18,069 So we get -1/x^2, which is the gradient df/dx. So that is the local gradient. -246 +263 00:19:18,069 --> 00:19:23,480 -1/3.7^2 and then multiplied by 1.0 which is -247 +264 00:19:23,480 --> 00:19:27,940 the gradient from above, which is really just 1 because we've just started, and I'm applying -248 +265 00:19:27,940 --> 00:19:34,850 chain rule right away here and the output is -0.53. So that's the gradient on -249 +266 00:19:34,850 --> 00:19:38,798 that piece of the wire, where this value was flowing, okay. So it has a negative -250 +267 00:19:38,798 --> 00:19:43,889 effect on the output. And you might expect that right, because if you were to -251 +268 00:19:43,890 --> 00:19:47,850 increase this value and then it goes through a gate of 1/x, then if you -252 +269 00:19:47,851 --> 00:19:50,939 increase this, 1/x get smaller, so that's why you're seeing a negative -253 +270 00:19:50,940 --> 00:19:55,620 gradient, right. So we're going to continue backpropagation here. The next gate -254 +271 00:19:55,621 --> 00:19:58,400 in the circuit, it's adding a constant of 1, -255 +272 00:19:58,400 --> 00:20:01,048 so the local gradient, if you look at -256 +273 00:20:01,048 --> 00:20:06,960 adding a constant to a value, the gradient on x is just 1, right, -256 +274 00:20:06,961 --> 00:20:13,169 from basic calculus. And so the chained gradient here that we continue along the wire -257 +275 00:20:13,169 --> 00:20:27,868 will be... (Student is answering) -257 +276 00:20:17,869 --> 00:20:22,940 We have a local gradient, which is 1, times the gradient from above the -258 +277 00:20:22,940 --> 00:20:28,590 gate, which it has just learned is -0.53, okay? So -0.53 continues along the -259 +278 00:20:28,590 --> 00:20:34,709 wire unchanged. And intuitively that makes sense right, because this value -260 +279 00:20:34,710 --> 00:20:38,319 floats and it has some influence on the final circuit and now, if you're -261 +280 00:20:38,319 --> 00:20:42,798 adding 1, then its influence, its rate of change, its slope towards the final -262 +281 00:20:42,798 --> 00:20:46,970 value doesn't change. If you increase this by some amount, the effect at the -263 +282 00:20:46,970 --> 00:20:51,548 end will be the same, because the rate of change doesn't change through the +1 gate. -264 +283 00:20:51,548 --> 00:20:57,859 It's just a constant offset, okay? We continue derivation here. So the gradient of e^x is -265 +284 00:20:57,859 --> 00:21:01,599 e^x, so to continue backpropagation we're going to perform, -266 +285 00:21:01,599 --> 00:21:05,000 so this gate saw input of -1. -267 +286 00:21:05,000 --> 00:21:08,329 It right away could have computed its local gradient, and now it knows that the -268 +287 00:21:08,329 --> 00:21:12,259 gradient from above is -0.53. So to continue backpropagation -269 +288 00:21:12,259 --> 00:21:15,000 here and apply chain rule, we would receive... -269 +289 00:21:15,000 --> 00:21:17,400 (Student is answering) -269 +290 00:21:17,400 --> 00:21:20,000 Okay, so these are most of the rhetorical questions so I'm -270 +291 00:21:20,000 --> 00:21:25,119 not sure, but yeah, basically e^(-1) which is the e^x, -271 +292 00:21:25,119 --> 00:21:30,569 the x input to this exp gate times the chain rule, right, so the gradient from above is -0.53 -272 +293 00:21:30,569 --> 00:21:35,269 so we keep multiplying that on. So what is the effect on me and what do I have an -273 +294 00:21:35,269 --> 00:21:39,069 effect on the final end of the circuit, those are being always multiplied. So we -274 +295 00:21:39,069 --> 00:21:46,859 get -0.2 at this point. So now we have a *(-1) gate. So what -275 +296 00:21:46,859 --> 00:21:50,279 ends up happening, what happens to the gradient when you do a times -1 in the -276 +297 00:21:50,279 --> 00:21:53,139 computational graph? -276 +298 00:21:53,139 --> 00:21:57,139 It flips around, right? Because we have basically, a constant multiply of input -277 +299 00:21:57,140 --> 00:22:02,038 which happened to be a constant of -1, so 1 * -1 -278 +300 00:22:02,038 --> 00:22:05,548 gave us -1 in the forward pass, and so now we have to -279 +301 00:22:05,548 --> 00:22:09,569 multiply by a, that's the local gradient, times the gradient from above which is -0.2 -280 +302 00:22:09,569 --> 00:22:14,879 so we end up with just +0.2 now. So now we're continuing backpropagation -281 +303 00:22:14,880 --> 00:22:21,110 We're backpropagating '+' and this '+' operation has multiple inputs here, the gradient, -282 +304 00:22:21,110 --> 00:22:25,599 the local gradient for the plus gate is 1 and 1, so what ends up happening to, -283 +305 00:22:25,599 --> 00:22:27,359 what gradients flow along the output wires? -284 +306 00:22:42,359 --> 00:22:48,089 So the plus gate has a local gradient on all of its inputs always will be just one, right, because -285 +307 00:22:48,089 --> 00:22:53,769 if you just have a function, you know, x+y, then for that function -286 +308 00:22:53,769 --> 00:22:58,109 the gradient on either x or y is just one and so what you end up getting is just -287 +309 00:22:58,109 --> 00:23:03,619 1 * 0.2. And so, in fact for a plus gate, always you see the same fact -288 +310 00:23:03,619 --> 00:23:07,469 where the local gradient of all of its inputs is 1, and so whatever gradient it -289 +311 00:23:07,470 --> 00:23:11,289 gets from above, it just always distributes gradient equally to all of -290 +312 00:23:11,289 --> 00:23:14,339 its inputs, because in the chain rule, they'll get multiplied and when you multiply by 1 -291 +313 00:23:14,339 --> 00:23:18,129 something remains unchanged. So a '+' gate, it's kind of like a gradient -292 +314 00:23:18,130 --> 00:23:22,170 distributor, where if something flows in from the top, it will just spread out all -293 +315 00:23:22,170 --> 00:23:26,560 the gradients equally to all of its children. And so we've already received -294 +316 00:23:26,560 --> 00:23:32,139 one of the inputs is gradient 0.2 here on the very final output of the circuit -295 +317 00:23:32,140 --> 00:23:35,970 and so this influence has been computed through a series of applications of -296 +318 00:23:35,970 --> 00:23:42,450 chain rule along the way. There was another plus gate that I've skipped over, and so this -297 +319 00:23:42,450 --> 00:23:47,090 0.2 kind of distributes to both 0.2 0.2 equally so we've already done a -298 +320 00:23:47,090 --> 00:23:51,750 plus gate, and there's a multiply gate there, and so now we're going to backpropagate -299 +321 00:23:51,750 --> 00:23:55,940 through that multiply operation. And so the local grad, so the, -300 +322 00:23:55,940 --> 00:24:02,450 so what will be the gradients for w0 and x0? What will be the gradient for w0, specifically? -300 +323 00:24:02,450 --> 00:24:06,450 (Student is answering) -301 +324 00:24:06,450 --> 00:24:17,059 Did someone say 0? 0 will be wrong. It will be, so the gradient w1 will be, w0 sorry, will be -302 +325 00:24:17,059 --> 00:24:24,389 -1 * 0.2. Good. And the gradient on x0 will be, there is a bug, by the way, in the slide -303 +326 00:24:24,390 --> 00:24:27,840 that I just noticed like few minutes before I actually created the class. -304 +327 00:24:27,840 --> 00:24:34,289 Created the, started the class. So you see 0.39 there it should be 0.4. It's -305 +328 00:24:34,289 --> 00:24:37,480 because of a bug in the visualization because I'm truncating at 2-decimal -306 +329 00:24:37,480 --> 00:24:41,190 digits, but anyways, basically that should be 0.4 because the way you get that -307 +330 00:24:41,190 --> 00:24:45,400 is 2 * 0.2 gives you 0.4 just like I've written out over there. -308 +331 00:24:45,400 --> 00:24:50,980 So that's what the output should be there. Okay, so we've backpropagated this -309 +332 00:24:50,980 --> 00:24:55,190 circuit here and we've backpropagated through this expression and so you might imagine in -310 +333 00:24:55,190 --> 00:24:59,289 our actual downstream applications, we'll have data and all the parameters as inputs -311 +334 00:24:59,289 --> 00:25:03,450 the loss function is at the top at the end, so we'll do forward pass to evaluate -312 +335 00:25:03,450 --> 00:25:06,440 the loss function and then we'll backpropagate through every piece of -313 +336 00:25:06,440 --> 00:25:10,450 computation we've done along the way, and we'll backpropagate through every gate to -314 +337 00:25:10,450 --> 00:25:14,150 get our inputs, and backpropagate just means apply chain rule many many times -315 +338 00:25:14,150 --> 00:25:18,220 and we'll see how that is implemented in a bit. Sorry, did you have a question? -316 +339 00:25:18,220 --> 00:25:20,520 (Student is asking question) -316 +340 00:25:20,521 --> 00:25:23,021 Oh yes, so I'm going to skip that because it's the same. -316 +341 00:25:23,021 --> 00:25:27,821 So I'm going to skip the other times gate. Any other questions at this point? -316 +342 00:25:27,821 --> 00:25:32,969 (Student is asking question) -317 +343 00:25:32,969 --> 00:25:37,200 That's right. so the costs of forward and backward propagation are roughly equal. -317 +344 00:25:37,200 --> 00:25:44,100 (Student is asking question) -317 +345 00:25:44,100 --> 00:25:45,869 Well, it should be, it almost always ends -318 +346 00:25:45,869 --> 00:25:49,500 up being basically equal when you look at timings, usually the backward pass is slightly -319 +347 00:25:49,500 --> 00:25:52,000 slower, but yeah. -319 +348 00:25:55,000 --> 00:25:58,710 Okay, so let's see, one thing I wanted to point out, before we move on, is that -320 +349 00:25:58,710 --> 00:26:02,350 the setting of these gates, like these gates are arbitrary, so one thing I could -321 +350 00:26:02,350 --> 00:26:06,509 have done, for example, is, some of you may know this, I can collapse these gates -322 +351 00:26:06,509 --> 00:26:10,549 into one gate if I wanted to. For example, There is something called the sigmoid function -323 +352 00:26:10,549 --> 00:26:14,069 which has that particular form, so a sigma of x which is the sigmoid function -324 +353 00:26:14,069 --> 00:26:19,460 computes 1/(1+e^(-x)) and so I could have rewritten that -325 +354 00:26:19,460 --> 00:26:22,650 expression and I could have collapsed all of those gates that made up the sigmoid -326 +355 00:26:22,650 --> 00:26:27,769 gate into a single sigmoid gate. And so there's a sigmoid gate here, and I could have done -327 +356 00:26:27,769 --> 00:26:32,440 that in a single go, sort of, and what I would have had to do, if I wanted to have -328 +357 00:26:32,440 --> 00:26:37,980 that gate, is I need to compute an expression for how this, so what is the -329 +358 00:26:37,980 --> 00:26:41,670 local gradient for the sigmoid gate basically? So what is the gradient on the -330 +359 00:26:41,670 --> 00:26:44,470 sigmoid gate on its input and I have to go through some math which I'm not going to -331 +360 00:26:44,470 --> 00:26:46,980 go into detail but you end up with that expression over there. -332 +361 00:26:46,980 --> 00:26:51,750 It ends up being (1-sigmoid(x)) * sigmoid(x). That's the local gradient and that -333 +362 00:26:51,750 --> 00:26:55,450 allows me to now, put this piece into a computational graph, because once I know -334 +363 00:26:55,450 --> 00:26:58,819 how to compute the local gradient everything else is defined just through -335 +364 00:26:58,819 --> 00:27:02,389 chain rule and multiply everything together. So we can backpropagate -336 +365 00:27:02,390 --> 00:27:06,720 through this sigmoid gate now, and the way that would look like is, the input to the -337 +366 00:27:06,720 --> 00:27:11,750 sigmoid gate was 1.0, that's what went into the sigmoid gate, and 0.73 went out. -338 +367 00:27:11,750 --> 00:27:18,759 So 0.73 is sigma of x, okay? And now we want the local gradient which is, as we've seen -339 +368 00:27:18,759 --> 00:27:22,559 from the math that I performed there (1 - sigma(x)) * sigma(x) -339 +369 00:27:22,559 --> 00:27:26,450 so you get, sigma(x) is 0.73, multiplying (1 - 0.73) -340 +370 00:27:26,450 --> 00:27:31,170 that's the local gradient and then times, we happen to be at the end -341 +371 00:27:31,170 --> 00:27:34,170 of the circuit, so times 1.0, which I'm not even writing. -341 +372 00:27:34,170 --> 00:27:36,330 So we end up with 0.2. And of course we -342 +373 00:27:36,330 --> 00:27:37,649 get the same answer -343 +374 00:27:37,650 --> 00:27:42,220 0.2, as we received before, 0.2, because calculus works, but basically we -344 +375 00:27:42,220 --> 00:27:44,480 could have broken up this expression down and -345 +376 00:27:44,480 --> 00:27:47,450 did one piece at a time or we could just have a single sigmoid gate and that's -346 +377 00:27:47,450 --> 00:27:51,569 kind of up to us at what level of hierarchy do we break these expressions -347 +378 00:27:51,569 --> 00:27:52,339 and so you'd like to -348 +379 00:27:52,339 --> 00:27:55,829 intuitively, cluster these expressions into single gates if it's very efficient -349 +380 00:27:55,829 --> 00:28:59,800 or easy to derive the local gradients because then those become your pieces. -349 +381 00:28:00,000 --> 00:28:05,819 (Student is asking question) -350 +382 00:28:05,819 --> 00:28:10,529 Yes. So the question is, do libraries typically do that? Do they worry about, you know -351 +383 00:28:10,529 --> 00:28:14,058 what's easy to or convenient to compute and the answer is yeah, I would say so, -352 +384 00:28:14,058 --> 00:28:17,480 So if you noticed that there are some piece of operation you'd like to do over -353 +385 00:28:17,480 --> 00:28:20,798 and over again, and it has a very simple local gradient, then that's something very -354 +386 00:28:20,798 --> 00:28:24,900 appealing to actually create a single unit out of, and we'll see some of those -355 +387 00:28:24,900 --> 00:28:30,230 examples actually int a bit I think. Okay, I'd like to also point out that once you, -356 +388 00:28:30,230 --> 00:28:32,490 the reason I like to think about these computational graphs, is it really helps -357 +389 00:28:32,490 --> 00:28:36,289 your intuition to think about how gradients flow in a neural network. It's not just, -358 +390 00:28:36,289 --> 00:28:39,369 you don't want this to be a black box to you, you want to understand -359 +391 00:28:39,369 --> 00:28:43,959 intuitively how this happens, and you start to develop after a while of -360 +392 00:28:43,960 --> 00:28:47,850 looking at computational graphs intuitions about how these gradients flow, and this -361 +393 00:28:47,850 --> 00:28:52,029 by the way, helps you debug some issues like, say, we'll go to vanishing gradient problem -362 +394 00:28:52,029 --> 00:28:55,950 it's much easier to understand exactly what's going wrong in your optimization -363 +395 00:28:55,950 --> 00:28:59,250 if you understand how gradients flow in networks. It will help you debug these -364 +396 00:28:59,250 --> 00:29:02,740 networks much more efficiently. And so some intuitions for example, we already -365 +397 00:29:02,740 --> 00:29:07,609 saw the add gate. It has a local gradient of 1 to all of its inputs, so -366 +398 00:29:07,609 --> 00:29:11,279 it's just a gradient distributor. That's like a nice way to think about it -367 +399 00:29:11,279 --> 00:29:14,548 whenever you have a plus operation anywhere in your score function or your -368 +400 00:29:14,548 --> 00:29:18,740 ConvNet or anywhere else. It just distributes gradients equally. The max gate is -369 +401 00:29:18,740 --> 00:29:23,009 instead, a gradient router, and the way this works is, if you look at the expression -370 +402 00:29:23,009 --> 00:29:30,970 like, we have. Great, these markers don't work. So if you have a very simple binary -371 +403 00:29:30,970 --> 00:29:38,410 expression of max(x, y), so this is a gate. Then, the gradient on x and y, if you -372 +404 00:29:38,410 --> 00:29:42,570 think about it, the gradient on the larger one of your inputs, whichever one was larger -373 +405 00:29:42,570 --> 00:29:46,389 the gradient on that guy is one and all this, and the smaller one has a gradient of 0. -374 +406 00:29:46,390 --> 00:29:50,630 And intuitively, that's because if one of these was smaller, then wiggling it has no -375 +407 00:29:50,630 --> 00:29:53,220 effect on the output because the other guy is larger and that's what ends up -376 +408 00:29:53,220 --> 00:29:57,009 propagating through the gate. So you end up with a gradient of 1 on the -377 +409 00:29:57,009 --> 00:30:03,140 larger one of the inputs, and so that's why max gate is a gradient router. If I'm -378 +410 00:30:03,140 --> 00:30:06,420 a max gate and I have received several inputs, one of them was the largest of -379 +411 00:30:06,420 --> 00:30:09,550 all of them and that's the value that I propagated through the circuit. -380 +412 00:30:09,550 --> 00:30:12,909 At backpropagation time, I'm just going to receive my gradient from above and I'm -381 +413 00:30:12,910 --> 00:30:16,590 going to route it to whoever was my largest input. So it's a gradient router. -382 +414 00:30:17,000 --> 00:30:22,569 And the multiply gate is a gradient switcher. Actually I don't think that's a very good -383 +415 00:30:22,569 --> 00:30:26,960 way to look at it, but I'm referring to the fact that it's not actually -384 +416 00:30:26,960 --> 00:30:28,150 nevermind about that part. -384 +417 00:30:29,560 --> 00:30:30,860 Go ahead. -384 +418 00:30:30,860 --> 00:30:36,650 (Student is asking question) -384 +419 00:30:36,650 --> 00:30:39,150 So your question is what happens if the two -385 +420 00:30:39,150 --> 00:30:41,470 inputs are equal when you go through max gate. -385 +421 00:30:44,150 --> 00:30:46,150 Yeah, what happens? -385 +422 00:30:46,150 --> 00:30:48,470 (Student is answering) -385 +423 00:30:48,470 --> 00:30:50,000 Yeah, you pick one. Yeah. -385 +424 00:30:52,300 --> 00:30:53,470 Yeah, I don't think it's -386 +425 00:30:53,470 --> 00:30:57,559 correct to distributed to all of them. I think you'd have to pick one. -387 +426 00:30:58,259 --> 00:31:01,990 But that basically never happens in actual practice. -387 +427 00:31:05,559 --> 00:31:07,990 Okay, so max gradient here, I actually -388 +428 00:31:07,990 --> 00:31:13,019 have an example. So z, here, was larger than w, so only z has an influence on -389 +429 00:31:13,019 --> 00:31:16,839 the output of this max gate, right? So when 2 flows into the max gate -390 +430 00:31:16,839 --> 00:31:20,879 it gets routed to z, and w gets a 0 gradient because its effect on the circuit is -391 +431 00:31:20,880 --> 00:31:25,360 nothing. There is 0, because when you change it, it doesn't matter when you change -392 +432 00:31:25,360 --> 00:31:29,689 it, because z is the larger value going through the computational graph. -393 +433 00:31:29,690 --> 00:31:33,100 I have another note that is related to backpropagation which we already -394 +434 00:31:33,100 --> 00:31:36,490 addressed through a question. I just wanted to briefly point out with a terribly -395 +435 00:31:36,490 --> 00:31:40,440 bad looking figure that if you have these circuits and sometimes you have a -396 +436 00:31:40,440 --> 00:31:43,330 value that branches out into a circuit and is used in multiple parts of the -397 +437 00:31:43,330 --> 00:31:47,179 circuit, the correct thing to do by multivariate chain rule, is to actually -398 +438 00:31:47,180 --> 00:31:51,110 add up the contributions at the operation. -398 +439 00:31:51,110 --> 00:31:55,110 So gradients add when they backpropagate -399 +440 00:31:55,110 --> 00:32:00,009 backwards through the circuit. If they ever flow, they add up in these backward flow -400 +441 00:32:00,009 --> 00:32:04,879 All right. We're going to go into implementation very soon. I'll just take some -401 +442 00:32:04,880 --> 00:32:05,700 more questions. -402 +443 00:32:05,700 --> 00:32:08,820 (Student is asking question) -402 +444 00:32:08,820 --> 00:32:11,620 Thank you for the question. The question is, is there ever, like a loop in these -403 +445 00:32:11,620 --> 00:32:15,839 graphs. There will never be loops, so there are never any loops. You might think that -404 +446 00:32:15,839 --> 00:32:18,589 if you use a recurrent neural network, that there are loops in there -405 +447 00:32:18,589 --> 00:32:21,658 but there are actually no loops because what we'll do is we'll take a recurrent neural -406 +448 00:32:21,659 --> 00:32:26,230 network and we will unfold it through time steps and this will all become, there -407 +449 00:32:26,230 --> 00:32:30,530 will never be a loop in the unfolded graph where we've copied pasted that small recurrent net piece -407 +450 00:32:30,530 --> 00:32:31,259 over time. -408 +451 00:32:31,259 --> 00:32:35,059 You'll see that more when we actually get into it but these are always DAGs -408 +452 00:32:35,059 --> 00:32:36,338 There are no loops. -408 +453 00:32:38,059 --> 00:32:39,538 Okay, awesome. -409 +454 00:32:39,538 --> 00:32:42,220 So let's look at the implementation of how this is actually implemented in practice and -410 +455 00:32:42,220 --> 00:32:46,990 I think it will help make this more concrete as well. So we always have these -411 +456 00:32:46,990 --> 00:32:48,938 graphs, computational graphs. -411 +457 00:32:48,938 --> 00:32:52,038 These are the best way to think about structuring neural networks. -412 +458 00:32:52,038 --> 00:32:56,929 And so what we end up with is, all these gates that we're going to see a bit, but -413 +459 00:32:56,929 --> 00:33:00,059 on top of the gates, there something's that needs to maintain connectivity structure -414 +460 00:33:00,059 --> 00:33:03,490 of this entire graph, what gates are connected to each other. And so usually -415 +461 00:33:03,490 --> 00:33:09,710 that's handled by a graph or a net object, usually a net, and the net object has these -416 +462 00:33:09,710 --> 00:33:13,679 two main pieces, which is the forward and the backward piece. And this is just pseudo -417 +463 00:33:13,679 --> 00:33:19,929 code, so this won't run, but basically, roughly the idea is that in the forward pass -418 +464 00:33:19,929 --> 00:33:23,759 we're iterating over all the gates in the circuit that, and they're sorted in topological -419 +465 00:33:23,759 --> 00:33:27,980 order. What that means is that all the inputs must come to every node before -420 +466 00:33:27,980 --> 00:33:32,099 the output can be consumed. So these are just ordered from left to right and we're just -421 +467 00:33:32,099 --> 00:33:35,969 forwarding, we're calling a forward on every single gate along the way so we iterate -422 +468 00:33:35,970 --> 00:33:39,600 over that graph and we just go forward in every single piece and this net object will -423 +469 00:33:39,600 --> 00:33:43,189 just make sure that happens in the proper connectivity pattern. In backward -424 +470 00:33:43,190 --> 00:33:46,620 pass, we're going in the exact reversed order and we're calling backward on -425 +471 00:33:46,620 --> 00:33:49,709 every single gate and these gates will end up communicating gradients through each -426 +472 00:33:49,710 --> 00:33:53,429 other and they all get chained up and computing the analytic gradient at the back. -427 +473 00:33:53,429 --> 00:33:57,860 So really a net object is a very thin wrapper around all these gates, or as we -428 +474 00:33:57,860 --> 00:34:01,879 will see they're called layers, layers or gates. I'm going to use those interchangeably -429 +475 00:34:01,880 --> 00:34:05,700 and they're just very thin wrappers around connectivity structure of these -430 +476 00:34:05,700 --> 00:34:09,369 gates and calling a forward and backward function on them. And then let's look at -431 +477 00:34:09,369 --> 00:34:12,950 a specific example of one of the gates and how this might be implemented. -432 +478 00:34:12,950 --> 00:34:16,759 And this is not just a pseudo code. This is actually more like correct -433 +479 00:34:16,760 --> 00:34:18,730 implementation. Something like this might run -434 +480 00:34:18,730 --> 00:34:23,769 at the end. So let's consider a multiply gate and how it could be implemented. -435 +481 00:34:23,769 --> 00:34:27,690 A multiply gate, in this case, is just a binary multiply, so it receives two inputs -436 +482 00:34:27,690 --> 00:34:33,780 x and y. It computes their multiplication, z = x * y and it returns z. -437 +483 00:34:33,780 --> 00:34:38,950 And all these gates must basically satisfied this API of a forward call and a backward call. How -438 +484 00:34:38,950 --> 00:34:42,529 do you behave in a forward pass, and how do you behave in a backward pass. And -439 +485 00:34:42,530 --> 00:34:46,019 in a forward pass, we just compute whatever. In a backward pass, we eventually end up -440 +486 00:34:46,019 --> 00:34:52,639 learning about what is our gradient on the final loss. So dL/dz is what -441 +487 00:34:52,639 --> 00:34:55,628 we learn. That's represented in this variable dz, and right now -442 +488 00:34:55,628 --> 00:35:00,639 everything here is scalars, so x, y, z are numbers here. dz is also a number -443 +489 00:35:00,639 --> 00:35:03,639 telling the influence on the end of the circuit. -443 +490 00:35:03,639 --> 00:35:07,799 And what this gate is in charge of in this backward pass is -444 +491 00:35:07,800 --> 00:35:11,550 performing the little piece of chain rule. So what we have to compute is how do you -445 +492 00:35:11,550 --> 00:35:14,550 chain this gradient dz into your inputs x and y. -445 +493 00:35:14,550 --> 00:35:16,550 In other words, we have to compute dx and dy and we have to -446 +494 00:35:16,550 --> 00:35:19,820 returned those in the backward pass, And then the computational graph will make sure -447 +495 00:35:19,820 --> 00:35:23,720 that these get routed properly to all the other gates. And if there are any -448 +496 00:35:23,720 --> 00:35:28,820 edges that add up, the computational graph might add all those gradients together. -449 +497 00:35:30,220 --> 00:35:35,650 Okay, so how would we implement the dx and dy? So for example, what is -450 +498 00:35:35,650 --> 00:35:40,300 dx in this case? What would it be equal to, the implementation? -451 +499 00:35:43,300 --> 00:35:49,460 y * dz. Great. And, so y * dz. Additional point to make here by the way, -452 +500 00:35:49,460 --> 00:35:53,659 note that I've added some lines in the forward pass. We have to remember these values of -453 +501 00:35:53,659 --> 00:35:57,509 x and y, because we end up using them in the backward pass, so I'm assigning them to a -454 +502 00:35:57,510 --> 00:36:01,000 'self.' because I need to remember what x y are because I need access to -455 +503 00:36:01,000 --> 00:36:04,949 them in my backward pass. In general, in backpropagation, when we build these, -456 +504 00:36:04,949 --> 00:36:09,359 when you actually do forward pass, every single gate must remember the inputs in -457 +505 00:36:09,360 --> 00:36:13,430 any kind of intermediate calculations that it has performed that it needs to do, that needs -458 +506 00:36:13,430 --> 00:36:17,069 access to in the backward pass. So basically when we end up running these networks at -459 +507 00:36:17,070 --> 00:36:20,050 runtime, just always keep in mind that as you're doing this forward pass, a huge -460 +508 00:36:20,050 --> 00:36:22,890 amount of stuff gets cached in your memory, and that all has to stick around -461 +509 00:36:22,890 --> 00:36:25,909 because during backpropagation, you might need access to some of those variables. -462 +510 00:36:25,909 --> 00:36:30,779 And so, your memory ends up ballooning up during the forward pass, and then in backward pass, -463 +511 00:36:30,780 --> 00:36:33,690 it gets all consumed and we need all those intermediates to actually compute the -464 +512 00:36:33,690 --> 00:36:36,000 proper backward pass. So that's... -464 +513 00:36:36,000 --> 00:36:41,089 (Student is asking question) -464 +514 00:36:41,089 --> 00:36:43,189 Yes, so if you don't, if you know you don't want to do backward pass, -464 +515 00:36:43,189 --> 00:36:45,289 then you can get rid of many of these things and you -465 +516 00:36:45,289 --> 00:36:49,710 don't have to compute, you don't need to cache them. So you can save memory for sure. -466 +517 00:36:49,710 --> 00:36:54,110 But I don't think most implementations actually worriy about that. I don't -467 +518 00:36:54,110 --> 00:36:58,280 think there's a lot of logic that deals with that. Usually we end up remembering it anyway. -468 +519 00:37:00,280 --> 00:37:05,870 (Student is asking question) -468 +520 00:37:05,870 --> 00:37:09,369 I see. Yes, so I think if you're in the embedded device for example, and you worry -469 +521 00:37:09,369 --> 00:37:11,949 really about your memory constraints, this is something that you might take advantage -470 +522 00:37:11,949 --> 00:37:15,539 of. If you know that a neural network only has to run in test time, then you might -471 +523 00:37:15,539 --> 00:37:18,750 want to make sure to go into the code to make sure nothing gets cached in case -472 +524 00:37:18,750 --> 00:37:22,030 you want to do a backward pass. Questions. Yes. -472 +525 00:37:22,030 --> 00:37:30,990 (Student is asking question) -472 +526 00:37:30,990 --> 00:37:33,130 You're saying if we remember the local gradients in -473 +527 00:37:33,130 --> 00:37:39,250 the forward pass, then we don't have to remember the other intermediates? -474 +528 00:37:39,250 --> 00:37:45,269 I think that might only be the case in some simple expressions like this one. I'm -475 +529 00:37:45,269 --> 00:37:49,170 not actually sure if that's true in general. But I mean, you're in charge of, remember -476 +530 00:37:49,170 --> 00:37:54,950 whatever you need to, perform the backward pass, and on a gate-by-gate basis. -477 +531 00:37:54,950 --> 00:37:58,509 You can remember whatever you feel like. It has lower footprint and so on. -478 +532 00:37:58,510 --> 00:38:04,420 You can be clever with that. Okay, so just to give you guy's example of what this looks like in -479 +533 00:38:04,420 --> 00:38:08,250 practice, we're going to look at specific examples, say, in Torch. Torch is a deep -480 +534 00:38:08,250 --> 00:38:11,480 learning framework, which we might go into a bit near the end of the class. -481 +535 00:38:11,480 --> 00:38:16,750 Some of you might end up using for your projects. If you go into the Github repo -482 +536 00:38:16,750 --> 00:38:20,320 for Torch, and you'll look at, basically, it's just a giant collection -483 +537 00:38:20,320 --> 00:38:24,580 of these layer objects and these are the gates. Layers, gates, the same thing. So there's -484 +538 00:38:24,580 --> 00:38:27,429 all these layers. That's really what a deep learning framework is. It's just a -485 +539 00:38:27,429 --> 00:38:31,559 whole bunch of layers and a very thin computational graph thing that keeps track -486 +540 00:38:31,559 --> 00:38:36,420 of all the layer connectivity. And so really, the image to have in mind is all these -487 +541 00:38:36,420 --> 00:38:42,639 things are your Lego blocks, and then we're building up these computational graphs out of -488 +542 00:38:42,639 --> 00:38:44,829 your Lego blocks, out of the layers. You're putting them together in various -489 +543 00:38:44,829 --> 00:38:47,549 ways depending on what you want to achieve, so you end building all -490 +544 00:38:47,550 --> 00:38:51,519 kinds of stuff. So that's how you work with neural networks. So every library is -491 +545 00:38:51,519 --> 00:38:54,809 just a whole set of layers that you might want to compute, and every layer is -492 +546 00:38:54,809 --> 00:38:58,840 just implementing a small function piece, and that function piece knows how to do a -493 +547 00:38:58,840 --> 00:39:02,670 forward and it knows how to do a backward. So just to view the specific example, let's -494 +548 00:39:02,670 --> 00:39:10,150 look at the MulConstant layer in Torch. The MulConstant layer performs -495 +549 00:39:10,150 --> 00:39:16,039 just a scaling by a scalar. So it takes some tensor X. So this is not a scalar -496 +550 00:39:16,039 --> 00:39:19,300 but it's actually like an array of numbers basically, because when we -497 +551 00:39:19,300 --> 00:39:22,410 actually work with these, we do a lot of vectorized operation so we receive a tensor -498 +552 00:39:22,410 --> 00:39:28,289 which is really just a n-dimensional array, and we scale it by a constant. And you -499 +553 00:39:28,289 --> 00:39:31,980 can see that this layer actually just has 40 lines. There's some initialization stuff. -500 +554 00:39:31,980 --> 00:39:35,940 This is Lua, by the way, if this is looking some foreign to you, but there's -501 +555 00:39:35,940 --> 00:39:40,510 initialization, where you actually pass in that a that you want to use as -502 +556 00:39:40,510 --> 00:39:44,630 your scaling, and then during the forward pass which they call updateOutput -503 +557 00:39:44,630 --> 00:39:49,170 in a forward pass all they do is they just multiply aX and return it. And -504 +558 00:39:49,170 --> 00:39:53,760 in the backward pass which they call updateGradInput, there's an if statement -505 +559 00:39:53,760 --> 00:39:56,510 here but really when you look at these three lines, they're most important. You can -506 +560 00:39:56,510 --> 00:39:59,690 see that all it's doing is it's copying into a variable gradInput -507 +561 00:39:59,690 --> 00:40:03,539 which it needs to compute. That's your gradient that you're passing up. The gradInput is, -508 +562 00:40:03,539 --> 00:40:08,309 you're copying gradOutput. gradOutput is your gradient on final loss. -509 +563 00:40:08,309 --> 00:40:11,989 You're copying that over into gradInput and you're multiplying by the scalar, -510 +564 00:40:11,989 --> 00:40:15,629 which is what you should be doing because your local gradient is just a -511 +565 00:40:15,630 --> 00:40:19,980 and so you take the output you have, you take the gradient from above and you just -512 +566 00:40:19,980 --> 00:40:23,150 scale it by a, which is what these three lines are doing. And that's your gradInput -513 +567 00:40:23,150 --> 00:40:27,849 and that's what you return. So that's one of the hundreds of layers -514 +568 00:40:27,849 --> 00:40:32,110 that are in Torch. We can also look at examples in Caffe. Caffe is also a -515 +569 00:40:32,110 --> 00:40:36,140 deep learning framework specifically for images that you might be working with. Again, if -516 +570 00:40:36,140 --> 00:40:39,690 you go into the layers directory in GitHub, you just see all these layers. All of them implement -517 +571 00:40:39,690 --> 00:40:43,490 the forward backward API. So just to give you an example, there's a sigmoid layer in Caffe. -518 +572 00:40:43,490 --> 00:40:51,269 So sigmoid layer takes a blob. So Caffe likes to call these tensors blobs. So it takes a -519 +573 00:40:51,269 --> 00:40:54,219 blob. It's just an n-dimensional array of numbers, and it passes it -520 +574 00:40:54,219 --> 00:40:57,949 elementwise through a sigmoid function. And so it's computing in a forward pass a -521 +575 00:40:57,949 --> 00:41:04,379 sigmoid, which you can see there. Let me use my pointer. Okay, so there, its calling, so a lot of -522 +576 00:41:04,380 --> 00:41:07,840 this stuff is just boilerplate, getting pointers to all the data, and then we -523 +577 00:41:07,840 --> 00:41:11,730 have a bottom blob, and we're calling a sigmoid function on the bottom and -524 +578 00:41:11,730 --> 00:41:14,829 that's just a sigmoid function right there. So that's what we compute. And in the -525 +579 00:41:14,829 --> 00:41:18,719 backward pass, some boilerplate stuff, but really what's important is we need to -526 +580 00:41:18,719 --> 00:41:23,369 compute the gradient times the chain rule here, so that's what you see in this -527 +581 00:41:23,369 --> 00:41:26,150 line. That's where the magic happens where we take the diff, -528 +582 00:41:26,150 --> 00:41:32,048 so they call the gradients diffs. And you compute the bottom diff is the top diff -529 +583 00:41:32,048 --> 00:41:36,869 times this piece which is really the, that's the local gradient, so this is -530 +584 00:41:36,869 --> 00:41:41,960 chain rule happening right here through that multiplication. So, and that's it. So every -531 +585 00:41:41,960 --> 00:41:45,179 single layer just a forward backward API and then you have a computational graph -532 +586 00:41:45,179 --> 00:41:52,288 on top or a net object that keeps track of all the connectivity. Any questions about some of -533 +587 00:41:52,289 --> 00:42:00,849 these implementations and so on? Go ahead. -533 +588 00:41:54,000 --> 00:42:00,849 (Student is asking question) -534 +589 00:42:00,849 --> 00:42:04,759 Yes, thank you. So the question is, do we have to go through forward and backward for every update. -534 +590 00:42:04,759 --> 00:42:09,259 The answer is yes, because when you want to do update, you need the gradient, -534 +591 00:42:09,259 --> 00:42:11,849 and so you need to do forward on your sample minibatch. -534 +592 00:42:11,849 --> 00:42:15,559 You do a forward. Right away you do a backward. And now you have your analytic gradient. -535 +593 00:42:15,559 --> 00:42:19,369 And now I can do an update, where I take my analytic gradient and I change my weights a tiny -536 +594 00:42:19,369 --> 00:42:24,960 bit in the direction, the negative direction of your gradient. So forward computes -537 +595 00:42:24,960 --> 00:42:28,858 the loss, backward computes your gradient, and then the update uses the gradient to -538 +596 00:42:28,858 --> 00:42:33,000 increment your weights a bit. So that's what keeps happening in the loop. When you train a neural -539 +597 00:42:33,000 --> 00:42:36,318 network that's all that's happening. Forward, backward, update. Forward, backward, update. -540 +598 00:42:36,318 --> 00:42:38,808 We'll see that in a bit. Go ahead. -540 +599 00:42:38,808 --> 00:42:43,808 (Student is asking question) -540 +600 00:42:44,808 --> 00:42:47,008 You're asking about a for-loop. -540 +601 00:42:49,208 --> 00:42:51,808 Oh, is there a for-loop here? I didn't even notice. Okay. -541 +602 00:42:51,809 --> 00:42:57,160 Yeah, they have a for-loop. Yes, so you'd like this to be vectorized and that actually... -542 +603 00:42:57,160 --> 00:43:03,679 Because this is C++, so I think they just do it. Go for it. -543 +604 00:43:06,679 --> 00:43:10,899 Yeah, so this is a CPU implementation by the way. I should mention that this is a -544 +605 00:43:10,900 --> 00:43:14,599 CPU implementation of a sigmoid layer. There's a second file that implements the -545 +606 00:43:14,599 --> 00:43:19,420 sigmoid layer on GPU and that's CUDA code. And so that's a separate file. It -546 +607 00:43:19,420 --> 00:43:22,280 would be sigmoid.cu or something like that. I'm not showing you that. -547 +608 00:43:23,580 --> 00:43:30,349 Any questions? Okay, great. So one point I'd like to make is, we'll be of course working with -548 +609 00:43:30,349 --> 00:43:33,519 vectors, so these things flowing along our graphs are not just scalars. They're going -549 +610 00:43:33,519 --> 00:43:38,449 to be entire vectors. And so nothing changes. The only thing that is different -550 +611 00:43:38,449 --> 00:43:43,529 now since these are vectors, x, y, and z are vectors, is that this local gradient -551 +612 00:43:43,530 --> 00:43:47,530 which before used to be just a scalar, now they're in general, for general -552 +613 00:43:47,530 --> 00:43:51,290 expressions, they're full Jacobian matrices. And so Jacobian matrix is this -553 +614 00:43:51,290 --> 00:43:54,670 two-dimensional matrix and basically tells you what is the influence of every -554 +615 00:43:54,670 --> 00:43:58,010 single element in x on every single element of z, -555 +616 00:43:58,010 --> 00:44:01,880 and that's what Jacobian matrix stores, and the gradient is the same -556 +617 00:44:01,880 --> 00:44:10,960 expression as before, but now, say here, dz/dx is a vector and dL/dz is... sorry. -557 +618 00:44:11,560 --> 00:44:16,079 dL/dz is a vector and dz/dx is an entire Jacobian matrix, so you end up with -558 +619 00:44:16,079 --> 00:44:20,130 an entire matrix-vector multiply to actually chain the gradient backwards. -558 +620 00:44:20,130 --> 00:44:29,130 (Student is asking question) -559 +621 00:44:31,530 --> 00:44:36,380 No. So I'll come back to this point in a bit. You never actually end up forming the full -560 +622 00:44:36,380 --> 00:44:40,119 Jacobian. You'll never actually do this matrix multiply most of the time. This is -561 +623 00:44:40,119 --> 00:44:43,730 just a general way of looking at, you know, arbitrary function, and I need to -562 +624 00:44:43,730 --> 00:44:46,260 keep track of this. And I think that these two are actually out of order -563 +625 00:44:46,260 --> 00:44:49,569 because dz/dx is the Jacobian which should be on the left side, so -564 +626 00:44:49,569 --> 00:44:53,859 I think that's a mistaken slide because this should be a matrix-vector multiply. -565 +627 00:44:53,859 --> 00:44:57,618 So I'll show you why you don't actually need to ever perform those Jacobians. So let's -566 +628 00:44:57,619 --> 00:45:02,119 work with a specific example that is relatively common in neural networks. -567 +629 00:45:02,119 --> 00:45:06,869 Suppose we have this nonlinearity max(0, x) So really what this operation -568 +630 00:45:06,869 --> 00:45:11,068 is doing is it's receiving a vector, say 4096 numbers, which is a typical thing -569 +631 00:45:11,068 --> 00:45:12,308 you might want to do. -570 +632 00:45:12,309 --> 00:45:14,630 4096 numbers, real value, come in -571 +633 00:45:14,630 --> 00:45:19,630 and you're computing an element-wise thresholding at 0, so anything that is lower -572 +634 00:45:19,630 --> 00:45:24,680 than 0 gets clamped to 0, and that's your function that you're computing. And so output -573 +635 00:45:24,680 --> 00:45:28,588 vector is of the same dimension. So the question here I'd like to ask is -574 +636 00:45:28,588 --> 00:45:32,068 what is the size of the Jacobian matrix for this layer? -574 +637 00:45:37,588 --> 00:45:40,268 4096 by 4096. In principle, -575 +638 00:45:40,268 --> 00:45:45,018 every single number in here could have influenced every single number in there. -576 +639 00:45:45,018 --> 00:45:49,459 But that's not the case necessarily, right? So the second question is, so this -577 +640 00:45:49,460 --> 00:45:52,949 is a huge matrix, 16 million numbers, but why would you never form it? -578 +641 00:45:52,949 --> 00:45:54,719 What does the Jacobian actually look like? -578 +642 00:45:54,719 --> 00:45:59,019 (Student is asking question) -578 +643 00:45:59,019 --> 00:46:02,719 No, Jacobian will always be a matrix because every one of these 4096 -579 +644 00:46:02,719 --> 00:46:09,949 could have influenced every... It is, so the Jacobian is still a giant 4096 by 4096 -580 +645 00:46:09,949 --> 00:46:14,558 matrix, but has special structure, right? And what is that special structure? -580 +646 00:46:14,558 --> 00:46:17,558 (Student is answering) -581 +647 00:46:17,559 --> 00:46:20,420 Yeah, so this Jacobian is huge. -581 +648 00:46:21,259 --> 00:46:27,420 So it's 4096 by 4096 matrix, but there are only elements on the diagonal -582 +649 00:46:27,420 --> 00:46:33,700 because this is an element-wise operation, and moreover, they're not just 1's, but -583 +650 00:46:33,700 --> 00:46:38,129 for whichever element that was less than 0, it was clamped to 0, so some of these 1's -584 +651 00:46:38,130 --> 00:46:42,798 actually are zeros, in whichever elements had a lower-than-zero value during the -585 +652 00:46:42,798 --> 00:46:47,429 forward pass. And so the Jacobian would just be almost an identity matrix but -586 +653 00:46:47,429 --> 00:46:52,250 some of them are actually zero. So you never actually would want to form the -587 +654 00:46:52,250 --> 00:46:55,429 full Jacobean because that's silly and so you never actually want to carry out -588 +655 00:46:55,429 --> 00:47:00,808 this operation as a matrix vector multiply, because of their special structure -589 +656 00:47:00,809 --> 00:47:04,150 that we want to take advantage of. And so in particular, the gradient, the backward -590 +657 00:47:04,150 --> 00:47:09,269 pass for this operation is very very easy because you just want to look at -591 +658 00:47:09,269 --> 00:47:14,159 all the dimensions where your input was less than zero and you want to kill the -592 +659 00:47:14,159 --> 00:47:17,210 gradient in those dimensions. You want to set the gradient to 0 in those dimensions. -593 +660 00:47:17,210 --> 00:47:21,650 So you take the grad output here, and whichever numbers were less than zero, -594 +661 00:47:21,650 --> 00:47:25,910 just set them to 0. Set those gradients to 0 and then you continue backward pass. -595 +662 00:47:26,209 --> 00:47:30,209 So very simple operations in the end in terms of efficiency. -595 +663 00:47:30,209 --> 00:47:36,809 (Student is asking question) -595 +664 00:47:36,809 --> 00:47:37,300 That's right. -595 +665 00:47:37,300 --> 00:47:45,930 (Student is asking question) -595 +666 00:47:45,930 --> 00:47:51,830 So the question is, the commication between the gates is always just vectors. That's right. -596 +667 00:47:51,830 --> 00:47:55,940 So this Jacobian, if you wanted to, you can form that but that's internal to you inside the gate. -597 +668 00:47:55,940 --> 00:47:59,670 And you can use that to do backprop, but what's going back to other gates, they -598 +669 00:47:59,670 --> 00:48:02,870 only care about the gradient vector. -598 +670 00:48:02,870 --> 00:48:09,070 (Student is asking question) -598 +671 00:48:09,070 --> 00:48:12,070 Yes, so the question is, unless you end up having multiple outputs, -598 +672 00:48:12,070 --> 00:48:15,070 because then for each output, we have to do this, so yeah. -598 +673 00:48:15,070 --> 00:48:17,380 So we'll never actually run into that case -599 +674 00:48:17,380 --> 00:48:20,430 because we almost always have a single output, scalar value at the end -600 +675 00:48:20,430 --> 00:48:24,129 because we're interested in loss functions. So we just have a single -601 +676 00:48:24,130 --> 00:48:27,318 number at the end that we're interested in computing gradients with respect to. If we had -602 +677 00:48:27,318 --> 00:48:30,949 multiple outputs, then we have to keep track of all of those as well -603 +678 00:48:30,949 --> 00:48:35,769 in parallel when we do the backpropagation. But we just have scalar value loss -604 +679 00:48:35,769 --> 00:48:38,580 function so we don't have to worry about that. -604 +680 00:48:40,269 --> 00:48:46,080 Okay, makes sense? So I want to also make the point that actually -605 +681 00:48:46,080 --> 00:48:51,230 4096 dimensions is not even crazy. Usually we use minibatches, so say, minibatch of a -606 +682 00:48:51,230 --> 00:48:54,929 100 elements going through at the same time, and then you end up with 100 -607 +683 00:48:54,929 --> 00:48:59,038 4096-dimensional vectors that are all coming in parallel, but all the examples -608 +684 00:48:59,039 --> 00:49:02,539 in the minibatch are processed independently of each other in parallel, and so this Jacobian matrix -609 +685 00:49:02,539 --> 00:49:08,869 really ends up being 400 million, 400,000 by 400,000. So huge so you never form these, -610 +686 00:49:08,869 --> 00:49:14,160 basically. And you take, you take care to actually take advantage of the sparsity -611 +687 00:49:14,160 --> 00:49:17,538 structure in the Jacobian and you hand code operations, so you don't actually write -612 +688 00:49:17,539 --> 00:49:25,819 fully generalized chain rule inside any gate implementation. Okay cool. So I'd like -613 +689 00:49:25,819 --> 00:49:30,788 to point out that in your assignment, you'll be writing SVMs and Softmax and so on, and I just kind -614 +690 00:49:30,789 --> 00:49:33,680 of would like to give you a hint on the design of how you actually should approach this -615 +691 00:49:33,680 --> 00:49:39,769 problem. What you should do is just think about it as a backpropagation, even if -616 +692 00:49:39,769 --> 00:49:44,108 you're doing this for linear classification optimization. So roughly, your structure -617 +693 00:49:44,108 --> 00:49:50,048 should look something like this where... again, stage your computation in units that -618 +694 00:49:50,048 --> 00:49:53,960 you know the local gradient of and then do backprop when you actually evaluate these -619 +695 00:49:53,960 --> 00:49:57,679 gradients in your assignment. So in the top, your code will look something like -620 +696 00:49:57,679 --> 00:49:59,679 this where we don't have any graph structure because you're doing -621 +697 00:49:59,679 --> 00:50:04,038 everything inline. So no crazy edges or anything like that that you have to do. -622 +698 00:50:04,039 --> 00:50:07,200 You will do that in the second assignment. You'll actually come up with a graph -623 +699 00:50:07,200 --> 00:50:10,509 object and you'll implement your layers. But in the first assignment, you're just doing it inline -624 +700 00:50:10,510 --> 00:50:15,579 just straight up vanilla setup. And so compute your scores based on W and X. -625 +701 00:50:15,579 --> 00:50:21,798 Compute these margins which are max of 0 and the score differences, compute the -626 +702 00:50:21,798 --> 00:50:26,239 loss, and then do backprop. And in particular, I would really advise you to -627 +703 00:50:26,239 --> 00:50:30,949 have this intermediate scores that you create. It's a matrix. And then compute the -628 +704 00:50:30,949 --> 00:50:34,769 gradient on scores before you compute the gradient on your weights. And so -629 +705 00:50:34,769 --> 00:50:40,179 chain, use chain rule here. Otherwise, you might be tempted to try to just derive W, the -630 +706 00:50:40,179 --> 00:50:43,798 gradient on W equals, and then implement that and that's an unhealthy way of -631 +707 00:50:43,798 --> 00:50:47,349 approaching the problem. So stage your computation and do backprop through this -632 +708 00:50:47,349 --> 00:50:49,900 scores and that will help you out. -632 +709 00:50:51,500 --> 00:50:52,800 Okay. cool. -633 +710 00:50:54,300 --> 00:50:59,570 So, let's see. Summary so far. Neural networks are hopelessly large, -633 +711 00:50:59,570 --> 00:51:01,570 so we end up in this computational structures and these -634 +712 00:51:01,570 --> 00:51:05,470 intermediate nodes, forward backward API for both the nodes and also for the -635 +713 00:51:05,470 --> 00:51:08,869 graph structure. And the graph structure is usually a very thin wrapper around all these -636 +714 00:51:08,869 --> 00:51:12,059 layers and it handles all the communication between them. And this -637 +715 00:51:12,059 --> 00:51:16,380 communication is always along like vectors being passed around. In practice, -638 +716 00:51:16,380 --> 00:51:19,289 when we write these implementations, what we're passing around are these -639 +717 00:51:19,289 --> 00:51:23,079 n-dimensional tensors. Really what that means is just an n-dimensional array. -640 +718 00:51:23,079 --> 00:51:28,059 So like an numpy array. Those are what goes between the gates, and then internally, every single -641 +719 00:51:28,059 --> 00:51:33,529 gate knows what to do in the forward and the backward pass. Okay, so at this point, I'm -642 +720 00:51:33,530 --> 00:51:37,690 going to end with backpropagation and I'm going to go into neural networks. So -643 +721 00:51:37,690 --> 00:51:40,390 any questions before we move on from backprop? Go ahead. -643 +722 00:51:40,390 --> 00:51:51,860 (Student is asking a question) -644 +723 00:51:51,860 --> 00:51:55,530 The summation inside Li = blah? Yes, there is a sum there. -644 +724 00:51:55,530 --> 00:52:00,130 So you want that to be vectorized operation that you... Yeah so basically, the challenge in your -644 +725 00:52:00,130 --> 00:52:03,130 assignment almost is, how do you make sure that you do all -645 +726 00:52:03,130 --> 00:52:06,750 this efficiently nicely with matrix vector operations in numpy, so that's going to be some of the -646 +727 00:52:06,750 --> 00:52:09,750 brain teaser stuff that you guys are going to have to do. -646 +728 00:52:09,750 --> 00:52:14,250 (Student is asking a question) -646 +729 00:52:14,250 --> 00:52:20,030 Yes, so it's up to you what you want your gates to be like, and what you want them to be. -647 +730 00:52:20,030 --> 00:52:22,490 (Student is asking a question) -647 +731 00:52:22,490 --> 00:52:24,490 Yeah, I don't think you'd want to do that. -648 +732 00:52:25,490 --> 00:52:30,739 Yeah, I'm not sure. Maybe that works. I don't know. But it's up to you to design this and to -649 +733 00:52:30,739 --> 00:52:38,609 backprop through. Yeah, so that's fun. Okay. So we're going to go to neural networks. This is -650 +734 00:52:38,610 --> 00:52:44,010 exactly what they look like. So you'll be implementing these, and this is just what happens -651 +735 00:52:44,010 --> 00:52:46,770 when you search on Google Images for neural networks. This is I think the first -652 +736 00:52:46,770 --> 00:52:51,590 result or something like that. So let's look at neural networks. And before we dive -653 +737 00:52:51,590 --> 00:52:55,100 into neural networks actually, I'd like to do it first without all the brain -654 +738 00:52:55,100 --> 00:52:58,329 stuff. So forget that they're neural. Forget that they have any relation whatsoever -655 +739 00:52:58,329 --> 00:53:03,170 to a brain. They don't, but forget if you thought that they did, that they do. Let's -656 +740 00:53:03,170 --> 00:53:07,309 just look at score functions. Well before, we saw that f=Wx is what -657 +741 00:53:07,309 --> 00:53:11,079 we've been working with so far. But now as I said, we're going to start to make -658 +742 00:53:11,079 --> 00:53:14,590 that f more complex. And so if you wanted to use a neural network then you're -659 +743 00:53:14,590 --> 00:53:20,309 going to change that equation to this. So this is a two-layer neural network, and -660 +744 00:53:20,309 --> 00:53:24,820 that's what it looks like, and it's just a more complex mathematical expression of x. -661 +745 00:53:24,820 --> 00:53:30,230 And so what's happening here is, you receive your input x, and you -662 +746 00:53:30,230 --> 00:53:32,369 multiply it by a matrix, just like we did before. -663 +747 00:53:32,369 --> 00:53:36,619 Now, what's coming next, what comes next is a nonlinearity or activation function, -664 +748 00:53:36,619 --> 00:53:39,710 and we're going to go into several choices that you might make for these. In this -665 +749 00:53:39,710 --> 00:53:43,800 case, I'm using the thresholding at 0 as an activation function. So basically, we're -666 +750 00:53:43,800 --> 00:53:47,780 doing matrix multiply, we threshold everything negative to 0, and then we do -667 +751 00:53:47,780 --> 00:53:52,240 one more matrix multiply, and that gives us our scores. And so if I was to draw this, -668 +752 00:53:52,240 --> 00:53:58,169 say in case of CIFAR-10, with 3072 numbers going in, those are the pixel values, -669 +753 00:53:58,170 --> 00:54:02,110 and before, we just went one single matrix multiply to scores. We went right away -671 +754 00:54:02,110 --> 00:54:05,899 to 10 numbers. But now, we get to go through this intermediate representation -672 +755 00:54:05,900 --> 00:54:13,019 of hidden state. We'll call them hidden layers. So hidden vector h of hundred numbers, say -673 +756 00:54:13,019 --> 00:54:16,849 or whatever you want your size of the neural network to be. So this is a hyperparameter, -674 +757 00:54:16,849 --> 00:54:21,109 that's, say, a hundred, and we go through this intermediate representation. So matrix -675 +758 00:54:21,109 --> 00:54:24,319 multiply gives us hundred numbers, threshold at zero, and -676 +759 00:54:24,320 --> 00:54:28,559 then one more matrix multiply to get the scores. And since we have more numbers, we have -677 +760 00:54:28,559 --> 00:54:33,820 more wiggle to do more interesting things. So a more, one particular example -678 +761 00:54:33,820 --> 00:54:36,330 of something interesting you might want to, you might think that a neural network -679 +762 00:54:36,330 --> 00:54:40,210 could do, is going back to this example of interpreting linear -680 +763 00:54:40,210 --> 00:54:45,690 classifiers on CIFAR-10, and we saw that the car class has this red car that tries to -681 +764 00:54:45,690 --> 00:54:51,280 merge all the modes of different cars facing in different directions. And so in -682 +765 00:54:51,280 --> 00:54:57,980 this case, one single layer, one single linear classifier had to go across all -683 +766 00:54:57,980 --> 00:55:02,250 those modes, and we couldn't deal with for example, cars of different colors. That -684 +767 00:55:02,250 --> 00:55:05,190 wasn't very natural to do. But now we have hundred numbers in this -685 +768 00:55:05,190 --> 00:55:08,289 intermediate, and so you might imagine for example, that one of those numbers -686 +769 00:55:08,289 --> 00:55:11,539 could be just picking up on red car facing forward. It's just classifying, -687 +770 00:55:11,539 --> 00:55:14,750 is there a red car facing forward. Another one could be red car -688 +771 00:55:14,750 --> 00:55:16,280 facing slightly to the left, -689 +772 00:55:16,280 --> 00:55:20,650 red car facing slightly to the right, and those elements of h would only become -690 +773 00:55:20,650 --> 00:55:24,358 positive if they find that thing in the image, -691 +774 00:55:24,358 --> 00:55:28,029 otherwise, they stay at zero. And so another h might look for green cars -692 +775 00:55:28,030 --> 00:55:31,180 or yellow cars or whatever else in different orientations. So now we can -693 +776 00:55:31,180 --> 00:55:35,669 have a template for all these different modes. And so these neurons turn on or -694 +777 00:55:35,670 --> 00:55:41,869 off if they find the thing they're looking for. Car of some specific type, and then -695 +778 00:55:41,869 --> 00:55:46,660 this W2 matrix can sum across all those little car templates. So now we -696 +779 00:55:46,660 --> 00:55:50,719 have like say twenty card templates of what cars could look like, and now, to compute -697 +780 00:55:50,719 --> 00:55:54,149 the score of car classifier, there's an additional matrix multiply, so we have a choice -698 +781 00:55:54,150 --> 00:55:58,700 of doing a weighted sum over them. And so if anyone of them turn on, then through my -699 +782 00:55:58,700 --> 00:56:02,269 weighted sum, with positive weights presumably, I would be adding up and -700 +783 00:56:02,269 --> 00:56:07,358 getting a higher score. And so now I can have this multimodal car classifier -701 +784 00:56:07,358 --> 00:56:13,098 through this additional hidden layer in between there. So that's a handwavy reason for why -702 +785 00:56:13,099 --> 00:56:14,720 these would do something more interesting. -703 +786 00:56:15,520 --> 00:56:16,509 Was there a question? Yeah. -703 +787 00:56:16,509 --> 00:56:26,350 (Student is asking a question) -703 +788 00:56:26,350 --> 00:56:32,509 So the question is, if h had less than 10 units, would it be inferior to a linear classifier? I think that's... -703 +789 00:56:33,200 --> 00:56:39,509 that's actually not obvious to me. It's an interesting question. I think... you could make that work. -703 +790 00:56:39,509 --> 00:56:40,509 I think you could make it work. -703 +791 00:56:43,509 --> 00:56:47,509 Yeah, I think that would actually work. Someone should try that for extra points in the assignment. -703 +792 00:56:47,509 --> 00:56:49,509 So you'll have a section on the assignment do something fun or extra -704 +793 00:56:49,510 --> 00:56:53,220 and so you get to come up with whatever you think is interesting experiment and we'll -705 +794 00:56:53,220 --> 00:56:56,699 give you some bonus points. So that's good candidate for something you might -706 +795 00:56:56,699 --> 00:56:59,659 want to investigate, whether that works or not. -707 +796 00:56:59,659 --> 00:57:00,929 Any other questions? Go ahead. -707 +797 00:57:01,329 --> 00:57:11,329 (Student is asking a question) -708 +798 00:57:11,329 --> 00:57:13,589 Sorry, I don't think I understood the question. -708 +799 00:57:13,589 --> 00:57:26,989 (Student is asking question) -708 +800 00:57:26,989 --> 00:57:28,000 I see. -708 +801 00:57:28,900--> 00:57:32,389 So you're really asking about the layout of the h vector and how it gets allocated over the -708 +802 00:57:32,389--> 00:57:34,989 the different modes of the dataset and I don't have a good -709 +803 00:57:34,989 --> 00:57:39,500 answer for that. Since we're going to train this fully with backpropagation, -711 +804 00:57:39,500 --> 00:57:42,690 I think it's like naive to think that there will be exact template for, say a -712 +805 00:57:42,690 --> 00:57:46,539 left car facing, red car facing left. You probably want to find that. You'll find -713 +806 00:57:46,539 --> 00:57:50,690 these kind of like mixes, and weird things, intermediates, and so on. -714 +807 00:57:50,690 --> 00:57:54,390 So this neural network will come in and it will optimally find a way to truncate your data -714 +808 00:57:54,390 --> 00:57:55,630 with its linear boundaries -715 +809 00:57:55,630 --> 00:57:59,809 and these weights will all get adjusted just to make it come out right. So it's -716 +810 00:57:59,809 --> 00:58:03,809 really hard to say. It will all become tangled up I think. Go ahead. -716 +811 00:58:03,809 --> 00:58:09,500 (Student is asking question) -716 +812 00:58:09,500 --> 00:58:10,579 That's right. So that's the -717 +813 00:58:10,579 --> 00:58:14,579 size of a hidden layer, and it's a hyperparameter. We get to choose that. So I chose -718 +814 00:58:14,579 --> 00:58:18,719 hundred. Usually that's going to be, usually, you'll see that with neural networks. We'll go into -719 +815 00:58:18,719 --> 00:58:22,739 this a lot, but usually you want them to be as big as possible, as it fits in your -720 +816 00:58:22,739 --> 00:58:27,659 computer and so on, so more is better. We'll go into that. Go ahead. -720 +817 00:58:27,659 --> 00:58:33,659 (Student is asking question) -721 +818 00:58:33,659 --> 00:58:38,639 So you're asking, do we always take max of 0 and h, and we don't, and I'll get, it's like five slides -722 +819 00:58:38,639 --> 00:58:44,359 away. So I'm going to go into neural networks. I guess maybe I should preemtively just go -723 +820 00:58:44,360 --> 00:58:48,390 ahead and take questions near the end. If you wanted this to be a three-layer -724 +821 00:58:48,390 --> 00:58:50,940 neural network by the way, there's a very simple way in which we just extend -725 +822 00:58:50,940 --> 00:58:53,710 this, right? So we just keep continuing the same pattern where we have all these -726 +823 00:58:53,710 --> 00:58:57,159 intermediate hidden nodes, and then we can keep making our network deeper and -727 +824 00:58:57,159 --> 00:58:59,750 deeper, and you can compute more interesting functions because you're -728 +825 00:58:59,750 --> 00:59:03,369 giving yourself more time to compute something interesting in a handwavy way. -729 +826 00:59:03,369 --> 00:59:09,559 Now, one other slide I wanted to flash is that, training a two-layer neural network, -730 +827 00:59:09,559 --> 00:59:12,690 I mean, it's actually quite simple when it comes down to it. So this is a slide -731 +828 00:59:12,690 --> 00:59:17,349 borrowed from a blog post I found, and basically it suffices roughly eleven lines of -732 +829 00:59:17,349 --> 00:59:21,980 Python to implement a two layer neural network, doing binary classification on -733 +830 00:59:21,980 --> 00:59:27,570 what is this, two dimensional data. So you have a two dimensional data matrix X. You -734 +831 00:59:27,570 --> 00:59:32,580 have, sorry it's three dimensional. And you have binary labels for y, and then -735 +832 00:59:32,580 --> 00:59:36,579 syn0 syn1 are your weight matrices weight1 weight2. And so I think they're -736 +833 00:59:36,579 --> 00:59:41,150 called syn for synapse but I'm not sure. And then this is the optimization loop here -737 +834 00:59:41,150 --> 00:59:46,269 and what you're seeing here, I should use my pointer for more, what you're -738 +835 00:59:46,269 --> 00:59:50,139 seeing here is we're computing the first layer activations, but this is using -739 +836 00:59:50,139 --> 00:59:54,069 a sigmoid nonlinearity not a max of 0 and X. And we'll go into a bit of what -740 +837 00:59:54,070 --> 00:59:58,650 these nonlinearities might be. So sigmoid is one form. It's computing the first layer, -741 +838 00:59:58,650 --> 01:00:03,059 and then it's computing second layer, and then it's computing here right away the backward -742 +839 01:00:03,059 --> 01:00:08,130 pass. So this is the l2_delta. It's the gradient on l2, the gradient on l1, and the -743 +840 01:00:08,130 --> 01:00:13,390 gradient, and this is an update here. So right away he's doing an update at -744 +841 01:00:13,390 --> 01:00:17,150 the same time as during the final piece of backprop here where he formulating the -745 +842 01:00:17,150 --> 01:00:22,519 gradient on the W, and right away he's adding to gradient here. And so really -746 +843 01:00:22,519 --> 01:00:24,630 eleven lines suffice to train a neural network to do binary -747 +844 01:00:24,630 --> 01:00:29,710 classification. The reason that this loss might look slightly different from what -748 +845 01:00:29,710 --> 01:00:33,500 you've seen right now, is that this is a logistic regression loss. So you saw a -749 +846 01:00:33,500 --> 01:00:37,159 generalization of it which is a softmax classifier into multiple dimensions. But -750 +847 01:00:37,159 --> 01:00:40,149 this is basically a logistic loss being updated here and you can go through this -751 +848 01:00:40,150 --> 01:00:43,500 in more detail by yourself. But the logistic regression loss look slightly -752 +849 01:00:43,500 --> 01:00:50,539 different and that's being, that's inside there. But otherwise, yes, so this is not too -753 +850 01:00:50,539 --> 01:00:55,320 crazy of a computation, and very few lines of code suffice to actually train -754 +851 01:00:55,320 --> 01:00:58,900 these networks. Everything else is fluff. How do you make it efficient, how do -755 +852 01:00:58,900 --> 01:01:03,019 you... there's a cross-validation pipeline that you need to have and all this stuff -756 +853 01:01:03,019 --> 01:01:07,050 that goes on top to actually give these large code bases, but the kernel of it is -757 +854 01:01:07,050 --> 01:01:11,019 quite simple. We compute these layers, do forward pass, we do backward pass, we do an -758 +855 01:01:11,019 --> 01:01:14,540 update, we keep iterating this over and over again. Go ahead. -758 +856 01:01:14,540 --> 01:01:16,240 (Student is asking a question) -758 +857 01:01:16,240 --> 01:01:18,840 The random function is creating your first initial random -759 +858 01:01:18,840 --> 01:01:24,170 weights, so you need to start somewhere so you generate a random W. -760 +859 01:01:24,170 --> 01:01:29,150 Okay. Now I wanted to mention that you'll also be training a two-layer neural network -761 +860 01:01:29,150 --> 01:01:32,070 in this class, so you'll be doing something very similar to this but -762 +861 01:01:32,070 --> 01:01:34,950 you're not using logistic regression and you might have different activation -763 +862 01:01:34,950 --> 01:01:39,149 functions. But again, just my advice to you when you implement this is, stage -764 +863 01:01:39,150 --> 01:01:42,789 your computation into these intermediate results, and then do proper -765 +864 01:01:42,789 --> 01:01:46,909 backpropagation into every intermediate result. So you might have, you compute -766 +865 01:01:46,909 --> 01:01:54,460 your... Let's see. You compute, you receive these weight matrices and also the biases. I don't -767 +866 01:01:54,460 --> 01:01:59,940 believe you have biases actually in your SVM and in your softmax, but here you'll have biases. So -768 +867 01:01:59,940 --> 01:02:03,269 take your weight matrices in the biases, compute the first hidden layer, compute your scores, -769 +868 01:02:03,269 --> 01:02:08,429 compute your loss, and then do backward pass. So backprop into scores, then -770 +869 01:02:08,429 --> 01:02:13,739 backprop into the weights at the second layer, and backprop into this h1 vector, -771 +870 01:02:13,739 --> 01:02:18,849 and then through h1, backprop into the first weight matrices and the first biases. Okay, so do -772 +871 01:02:18,849 --> 01:02:22,929 proper backpropagation here. Otherwise, if you try to right away, just say, what -773 +872 01:02:22,929 --> 01:02:26,739 is dW1, what is the gradient on W1. If you just try to make it a single expression -774 +873 01:02:26,739 --> 01:02:31,099 for it, it will be way too large and you'll have headaches. So do it through a series of -775 +874 01:02:31,099 --> 01:02:32,619 steps and back-propagation. -776 +875 01:02:32,619 --> 01:02:36,119 That's just a hint. -777 +876 01:02:36,119 --> 01:02:39,940 Okay. So now I'd like to, so that was the presentation of neural networks without -778 +877 01:02:39,940 --> 01:02:43,940 all the brain stuff and it looks fairly simple. So now we're going to make it -779 +878 01:02:43,940 --> 01:02:47,740 slightly more insane by folding in all kinds of like motivations, mostly -780 +879 01:02:47,740 --> 01:02:51,219 historical about like how this came about that it's related to brain at all. -781 +880 01:02:51,219 --> 01:02:54,939 And so, we have neural networks and we have neurons inside these neural -782 +881 01:02:54,940 --> 01:02:59,440 networks. So this is what neurons look like. This is just what happens when you search on -783 +882 01:02:59,440 --> 01:03:03,800 image search 'neurons', so there you go. Now your actual biological neurons don't -784 +883 01:03:03,800 --> 01:03:09,030 look like this. Unfortunately, they look more like that. And so a neuron, -785 +884 01:03:09,030 --> 01:03:11,880 just very briefly, just to give you an idea about where this is all coming from -786 +885 01:03:11,880 --> 01:03:17,220 you have the cell body or a soma as people like to call it, and it's got all these dendrites -787 +886 01:03:17,220 --> 01:03:21,049 that are connected to other neurons. So there's a cluster of other neurons and -788 +887 01:03:21,050 --> 01:03:25,450 cell bodies over here. And dendrites are really, these appendages that listen to -789 +888 01:03:25,450 --> 01:03:30,869 them. So this is your inputs to a neuron, and then it's got a single axon that -790 +889 01:03:30,869 --> 01:03:35,839 comes out of the neuron that carries the output of the computation that this neurons performs. -791 +890 01:03:35,840 --> 01:03:40,579 So usually, usually you have this neuron, receives inputs. If many of them -792 +891 01:03:40,579 --> 01:03:46,179 align, then this cell, this neuron can choose to spike. It says an activation -793 +892 01:03:46,179 --> 01:03:50,199 potential down the axon and then this actually like diverges out to -794 +893 01:03:50,199 --> 01:03:54,659 connect to dendrites of other neurons that are downstream. So there are other -795 +894 01:03:54,659 --> 01:03:57,639 neurons here and their dendrites connect to the axons of these guys. -796 +895 01:03:57,639 --> 01:04:02,299 So basically, just neurons connected through these synapses in between and we had these -797 +896 01:04:02,300 --> 01:04:05,840 dendrites that are the input to a neuron and this axon that actually carries the -798 +897 01:04:05,840 --> 01:04:10,410 output of a neuron. And so basically, you can come up with a very crude model of a -799 +898 01:04:10,410 --> 01:04:16,769 neuron, and it will look something like this. We have an axon, so this is the cell body -800 +899 01:04:16,769 --> 01:04:20,909 here of a neuron. And just imagine an axon coming from a different neuron, -801 +900 01:04:20,909 --> 01:04:24,730 somewhere in the network, and this neuron is connected to that neuron through this -802 +901 01:04:24,730 --> 01:04:29,840 synapse. And every one of these synapses has a weight associated with it -803 +902 01:04:29,840 --> 01:04:35,350 of how much this neuron likes that neuron basically. And so axon carries -804 +903 01:04:35,350 --> 01:04:39,769 this x. It interacts in the synapse and they multiply in this crude model. So you -805 +904 01:04:39,769 --> 01:04:44,989 get w0x0 flowing to the soma. And then that happens for many neurons -806 +905 01:04:44,989 --> 01:04:45,849 so you have lots of -807 +906 01:04:45,849 --> 01:04:51,500 inputs of w times x flowing in. And the cell body here, it just performs a sum, offset by -808 +907 01:04:51,500 --> 01:04:56,940 a bias, and then if an activation function is met here, so it passes through an -809 +908 01:04:56,940 --> 01:05:02,800 activation function to actually compute the output of this axon. Now in -810 +909 01:05:02,800 --> 01:05:06,570 biological models, historically people liked to use the sigmoid nonlinearity to -811 +910 01:05:06,570 --> 01:05:09,430 actually use for the activation function. The reason for that is because -811 +911 01:05:09,430 --> 01:05:11,730 you get a number between 0 and 1, and -812 +912 01:05:11,730 --> 01:05:15,420 you can interpret that as the rate at which this neuron is firing for that -813 +913 01:05:15,420 --> 01:05:19,809 particular input. So it's a rate between 0 and 1 that's going through the -814 +914 01:05:19,809 --> 01:05:23,889 activation function. So if this neuron has seen something it likes, in the neurons -815 +915 01:05:23,889 --> 01:05:27,900 that connected to it, it will start to spike a lot, and the rate is described by -816 +916 01:05:27,900 --> 01:05:33,139 f of the input. Okay, so that's the crude model of the neuron. If I wanted to implement it -817 +917 01:05:33,139 --> 01:05:38,819 it would look something like this. So a neuron_tick function forward pass, it receives -818 +918 01:05:38,820 --> 01:05:44,500 some inputs. This is a vector and we form a sum at the cell body, so just a linear sum. -819 +919 01:05:44,500 --> 01:05:49,980 And we put, we compute the firing rate as a sigmoid of the cell body sum and return the firing -820 +920 01:05:49,980 --> 01:05:53,579 rate. And then this can plug into different neurons, right? So you can -821 +921 01:05:53,579 --> 01:05:56,710 imagine, you can actually see that this looks very similar to a linear -822 +922 01:05:56,710 --> 01:06:02,750 classifier, right? We're forming a linear sum here, a weighted sum, and we're passing that through -823 +923 01:06:02,750 --> 01:06:07,050 nonlinearity. So every single neuron in this model is really like a small linear -824 +924 01:06:07,050 --> 01:06:11,530 classifier, but these linear classifiers plug into each other, and they can work together to -825 +925 01:06:11,530 --> 01:06:16,650 do interesting things. Now one note to make about neurons is that they're very, they're -826 +926 01:06:16,650 --> 01:06:21,300 not like biological neurons. Biological neurons are super complex, so if you go -827 +927 01:06:21,300 --> 01:06:24,670 around and you start saying that neural networks work like brain, people are -828 +928 01:06:24,670 --> 01:06:28,849 starting to frown. People will start to frown at you and that's because neurons are -829 +929 01:06:28,849 --> 01:06:33,650 complex, dynamical systems. There are many different types of neurons. They function -830 +930 01:06:33,650 --> 01:06:38,550 differently. These dendrites, they can perform lots of interesting -831 +931 01:06:38,550 --> 01:06:42,140 computation. A good review article is Dendritic Computation, which I really -832 +932 01:06:42,140 --> 01:06:46,069 enjoyed. These synapses are complex dynamical systems. They're not just a -833 +933 01:06:46,070 --> 01:06:49,720 single weight. And we're not really sure if the brain uses rate code to -834 +934 01:06:49,720 --> 01:06:54,689 communicate, so very crude mathematical model and don't push his analogy too much. -835 +935 01:06:54,690 --> 01:06:57,960 But it's good for, kind of like, media articles, -836 +936 01:06:57,960 --> 01:07:01,990 so I suppose that's why this keeps coming up again and again as we -837 +937 01:07:01,990 --> 01:07:04,989 explained that this works like your brain. But I'm not going to go too deep into -838 +938 01:07:04,989 --> 01:07:09,829 this. To go back to a question that was asked before, there's an entire set of -839 +939 01:07:09,829 --> 01:07:11,859 nonlinearities that we can choose from. -839 +940 01:07:14,559 --> 01:07:17,559 So historically, sigmoid has been used -840 +941 01:07:17,559 --> 01:07:20,210 quite a bit, and we're going to go into much more detail over what these -841 +942 01:07:20,210 --> 01:07:23,690 nonlinearities are, what are their tradeoffs, and why you might want to use -842 +943 01:07:23,690 --> 01:07:27,838 one or the other, but for now, I'd just like to flash them and mention that there are many things to -843 +944 01:07:27,838 --> 01:07:28,579 choose from. -844 +945 01:07:28,579 --> 01:07:33,940 Historically people use to signmoid and tanh. As of 2012, ReLU became quite popular. -845 +946 01:07:33,940 --> 01:07:38,429 It makes your networks converge quite a bit faster, so right now, if you wanted a -846 +947 01:07:38,429 --> 01:07:41,429 default choice for nonlinearity, use ReLU. -847 +948 01:07:41,429 --> 01:07:45,679 That's the current default recommendation. And then there's a few, kind of a hipster -848 +949 01:07:45,679 --> 01:07:51,489 activation functions here. And so Leaky ReLUs were proposed a few years ago. Maxout is -849 +950 01:07:51,489 --> 01:07:54,989 interesting. And very recently ELU. And so you can come up with different -850 +951 01:07:54,989 --> 01:07:58,319 activation functions and you can describe why these might work better or -851 +952 01:07:58,320 --> 01:08:01,789 not. And so this is an active area of research. It's trying to come up with these -852 +953 01:08:01,789 --> 01:08:05,949 activation functions that perform, that have better properties in one way or -853 +954 01:08:05,949 --> 01:08:10,909 another. So we're going to go into this with much more detail soon in the class. But for -854 +955 01:08:10,909 --> 01:08:15,980 now, we have these neurons, we have a choice of activation function, and then -855 +956 01:08:15,980 --> 01:08:19,259 we arrange these neurons into neural networks, right? So we just connect them -856 +957 01:08:19,259 --> 01:08:23,140 together so they can talk to each other. And so here is an example of a -857 +958 01:08:23,140 --> 01:08:27,170 2-layer neural net or 3-layer neural net. When you want to count the number of layers and the -858 +959 01:08:27,170 --> 01:08:30,829 neural net, you count the number of layers that have weights. So here, the -859 +960 01:08:30,829 --> 01:08:35,449 input layer does not count as a layer, because there's no... These neurons are just -860 +961 01:08:35,449 --> 01:08:39,729 single values. They don't actually do any computation. So we have two layers here -861 +962 01:08:39,729 --> 01:08:45,068 that have weights. So it's a 2-layer net. And we call these layers fully connected -862 +963 01:08:45,069 --> 01:08:50,870 layers, and so, remember that I shown you that a single neuron computes this little -863 +964 01:08:50,870 --> 01:08:54,750 weighted sum, and then passed that through nonlinearity. In a neural network, the -864 +965 01:08:54,750 --> 01:08:58,829 reason we arrange these into layers is because arranging them into layers allows -865 +966 01:08:58,829 --> 01:09:01,759 us to perform the computation much more efficiently. So instead of having an -866 +967 01:09:01,759 --> 01:09:04,460 amorphous blob of neurons and every one of them has to be computed independently, -867 +968 01:09:04,460 --> 01:09:08,699 having them in layers allows us to use vectorized operations. And so we can -868 +969 01:09:08,699 --> 01:09:10,139 compute an entire set of -869 +970 01:09:10,140 --> 01:09:14,410 neurons in a single hidden layer as just at a single times a matrix multiply. And -870 +971 01:09:14,410 --> 01:09:17,619 that's why we arrange them in these layers, where neurons inside a layer can be -871 +972 01:09:17,619 --> 01:09:21,119 evaluated completely in parallel, and they all see the same input. So it's a -872 +973 01:09:21,119 --> 01:09:25,519 computational trick to arrange them in layers. So this is a 3-layer neural net -873 +974 01:09:25,520 --> 01:09:30,500 and this is how you would compute it. Just a bunch of matrix multiplies -874 +975 01:09:30,500 --> 01:09:35,550 followed by activation function. So now I'd -875 +976 01:09:35,550 --> 01:09:40,520 like to show you a demo of how these neural networks work. So this is JavaScript demo -876 +977 01:09:40,520 --> 01:09:44,770 that I'll show you in a bit. But basically, this is an example of a -877 +978 01:09:44,770 --> 01:09:50,080 two-layer neural network classifying a, doing a binary classification task. So we have two -878 +979 01:09:50,080 --> 01:09:54,119 classes, red and green. And so we have these points in two dimensions, and I'm drawing -879 +980 01:09:54,119 --> 01:09:58,109 the decision boundaries by the neural network. And so what you can see is, when -880 +981 01:09:58,109 --> 01:10:01,969 I train a neural network on this data, the more hidden neurons I have in my -881 +982 01:10:01,970 --> 01:10:05,770 hidden layer, the more wiggle your neural network has, right? The more it can compute -882 +983 01:10:05,770 --> 01:10:12,290 crazy functions. And just to show you effect also of regularization strength. So this is the -883 +984 01:10:12,290 --> 01:10:17,069 regularization of how much you penalize large Ws. So you can see that when you insist -884 +985 01:10:17,069 --> 01:10:22,340 that your Ws are very small, you end up with a very smooth functions, so they don't -885 +986 01:10:22,340 --> 01:10:27,050 have as much variance. So these neural networks, there's not as much wiggle -886 +987 01:10:27,050 --> 01:10:31,090 that they can give you, and then as you decrease the regularization, these neural -887 +988 01:10:31,090 --> 01:10:34,090 networks can do more and more complex tasks, so they can kind of get in and get -888 +989 01:10:34,090 --> 01:10:38,710 these little squeezed out points to cover them in a training data. So let me show -889 +990 01:10:38,710 --> 01:10:41,489 you what this looks like -890 +991 01:10:41,489 --> 01:10:47,079 during training. Okay. -891 +992 01:10:47,079 --> 01:10:53,010 So there're some stuff to explain here. Let me first actually... So you can play with -892 +993 01:10:53,010 --> 01:10:56,060 this because it's all in JavaScript. -893 +994 01:10:56,060 --> 01:11:04,060 Okay. All right. So what we're doing here is we have six neurons, and this is a binary -894 +995 01:11:04,060 --> 01:11:09,000 classification dataset with circle data. And so we have a little cluster of -895 +996 01:11:09,000 --> 01:11:13,520 green dots separated by red dots. And we're training a neural network to classify -896 +997 01:11:13,520 --> 01:11:18,080 this dataset. So if I restart the neural network, it's just, starts off with a -897 +998 01:11:18,080 --> 01:11:20,949 random W, and then it converges the decision boundary to actually classify -898 +999 01:11:20,949 --> 01:11:26,289 the data. What I'm showing on the right, which is the cool part, this visualization, is one interpretation of -899 +1000 01:11:26,289 --> 01:11:29,529 the neural network here, is what I'm taking this grid here and I'm -900 +1001 01:11:29,529 --> 01:11:33,909 showing how this space gets warped by the neural network. So you can interpret -901 +1002 01:11:33,909 --> 01:11:37,619 what the neural network is doing is it's using its hidden layer to transform your -902 +1003 01:11:37,619 --> 01:11:41,159 input data in such a way that the second hidden layer can come in with a linear -903 +1004 01:11:41,159 --> 01:11:47,059 classifier and classify your data. So here, you see that the neural network -904 +1005 01:11:47,060 --> 01:11:51,920 arranges your space. It warps it such that the second layer, which is really a -905 +1006 01:11:51,920 --> 01:11:56,779 linear classifier on top of the first layer, can put a plane through it, okay? -906 +1007 01:11:56,779 --> 01:11:59,939 So it's warping the space so that you can put a plane through it and -907 +1008 01:11:59,939 --> 01:12:06,259 separate out the points. So let's look at this again. So you can roughly see what -908 +1009 01:12:06,260 --> 01:12:10,940 how this gets warped so that you can linearly classify the data. This is -909 +1010 01:12:10,940 --> 01:12:13,569 something that people sometimes also referred to as kernel trick. It's -910 +1011 01:12:13,569 --> 01:12:19,149 changing your data representation to a space where it's linearly separable. Okay. -911 +1012 01:12:19,149 --> 01:12:23,079 Now, here's a question. If we'd like to separate, so right now we have six -912 +1013 01:12:23,079 --> 01:12:27,809 neurons here in the intermediate layer, and it allows us to separate out these -913 +1014 01:12:27,810 --> 01:12:33,580 data points. So you can see actually those six neurons roughly. You can see these lines -914 +1015 01:12:33,580 --> 01:12:36,869 here, like they're kind of like these functions of one of these neurons. So -915 +1016 01:12:36,869 --> 01:12:40,349 here's a question for you, What is the minimum number of neurons for which this -916 +1017 01:12:40,350 --> 01:12:45,570 dataset is separable with a neural network? If I want the neural network -917 +1018 01:12:45,570 --> 01:12:49,089 to correctly classify this, how many neurons do I need in the hidden layer as a minimum? -918 +1019 01:12:57,890 --> 01:13:04,270 Four? I heard some threes, some fours. Binary search. -918 +1020 01:13:04,270 --> 01:13:08,870 So intuitively, the way this would work is, let's see four. -918 +1021 01:13:12,270 --> 01:13:15,270 So what happens with four is, there is one -919 +1022 01:13:15,270 --> 01:13:18,910 neuron here that went from this way to that way, this way to that way, this way -920 +1023 01:13:18,910 --> 01:13:22,689 to that way. There's four neurons that are cutting up this plane. And then -921 +1024 01:13:22,689 --> 01:13:27,039 there's an additional layer that's doing a weighted sum. So in fact, the lowest -922 +1025 01:13:27,039 --> 01:13:34,739 number here would be three, which would work. So with three neurons... So -923 +1026 01:13:34,739 --> 01:13:39,189 one plane, second plane, third plane. So three linear functions with a nonlinearity, -924 +1027 01:13:39,189 --> 01:13:45,649 and then you can basically with three lines, you can carve out the space so -925 +1028 01:13:45,649 --> 01:13:50,329 that the second layer can just combine them when their numbers are 1 and not 0. -925 +1029 01:13:50,329 --> 01:13:52,429 (Student is asking question) -926 +1030 01:13:52,430 --> 01:13:57,850 At two? Certainly. So at two, this will break because two lines are not enough. I -927 +1031 01:13:57,850 --> 01:14:03,900 suppose this works... Not going to look very good here. So with two, basically it will find -928 +1032 01:14:03,900 --> 01:14:07,239 the optimal way of just using these two lines. They're kind of creating this -929 +1033 01:14:07,239 --> 01:14:11,239 tunnel and that's the best you can do. Okay? -929 +1034 01:14:11,239 --> 01:14:14,599 (Student is asking question) -930 +1035 01:14:18,600 --> 01:14:25,400 The curve, I think... Which nonlinearity am I using? tanh? Yeah, I'm not sure exactly how that works out. -930 +1036 01:14:25,400 --> 01:14:31,300 If I was using ReLU, I think it would be much, so ReLU is the... Let me change to ReLU, and I -931 +1037 01:14:31,300 --> 01:14:41,460 think you'd see sharp boundaries. Yeah. Yes, this is three. You can do four. So let's do... -931 +1038 01:14:41,460 --> 01:14:47,460 (Student is asking question) -931 +1039 01:14:47,460 --> 01:14:50,460 Yeah, that's because, it's -932 +1040 01:14:50,460 --> 01:14:52,130 because in some of these parts -933 +1041 01:14:52,130 --> 01:14:57,819 there's more than one of those ReLUs are active, and so you end up with... -934 +1042 01:14:57,819 --> 01:15:02,359 There are really three lines. I think like one, two, three, but then in some of the corners two ReLU -935 +1043 01:15:02,359 --> 01:15:05,689 neurons are active and so these weights will add up. It's kind of funky. You -936 +1044 01:15:05,689 --> 01:15:12,649 have to think about a bit. But okay. So let's look at, say, twenty here. So I changed to twenty -937 +1045 01:15:12,649 --> 01:15:16,670 so we have lots of space there, and let's look at different datasets like say spiral. -938 +1046 01:15:16,670 --> 01:15:22,390 So you can see how this thing just, as I'm doing this update, it will just go in there -939 +1047 01:15:22,390 --> 01:15:32,800 and figure that out. Very simple dataset is not... Spiral. Circle, and then random... -940 +1048 01:15:33,200 --> 01:15:39,880 so random data, and so you could, kind of goes in there, like covers up the green -941 +1049 01:15:39,880 --> 01:15:48,039 ones and the red ones. And yeah. And with fewer, say like five... I'm going to break this -942 +1050 01:15:48,039 --> 01:15:54,890 now. I'm not going to... Okay. So with five... Yes. So this will start working worse and worse -943 +1051 01:15:54,890 --> 01:15:58,770 because you don't have enough capacity to separate out this data. So you can -944 +1052 01:15:58,770 --> 01:16:05,270 play with this in your free time. Okay. And so as a summary, -945 +1053 01:16:05,270 --> 01:16:10,690 we arrange these neurons in neural networks into fully connected layers. -946 +1054 01:16:10,690 --> 01:16:14,579 We've looked at backprop and how this gets chained in computational graphs. And they're -947 +1055 01:16:14,579 --> 01:16:19,149 not really neural. And as we'll see soon, the bigger the better, and we'll go into -948 +1056 01:16:19,149 --> 01:16:23,510 that a lot. I want to take questions before I end. Just sorry. Were there any questions? Go ahead. -948 +1057 01:16:23,510 --> 01:16:27,710 (Student is asking question) -949 +1058 01:16:27,710 --> 01:16:29,359 We have two more minutes. Sorry. -948 +1059 01:16:29,359 --> 01:16:35,710 (Student is asking question) -950 +1060 01:16:35,710 --> 01:16:36,899 Yes, thank you. -951 +1061 01:16:36,899 --> 01:16:41,119 So is it always better to have more neurons in your neural network? The answer to -952 +1062 01:16:41,119 --> 01:16:48,809 that is yes. More is always better. It's usually computational constraint, so more will -953 +1063 01:16:48,810 --> 01:16:52,510 always work better, but then you have to be careful to regularize it properly. So -954 +1064 01:16:52,510 --> 01:16:55,810 the correct way to constrain your neural network to not overfit your data is not by -955 +1065 01:16:55,810 --> 01:16:58,940 making the network smaller. The correct way to do it is to increase the -956 +1066 01:16:58,940 --> 01:17:03,079 regularization. So you always want to use as large a network as you want, but then -957 +1067 01:17:03,079 --> 01:17:06,269 you have to make sure to properly regularize it. But most of the time -958 +1068 01:17:06,270 --> 01:17:09,320 because of computational reasons, you have finite amount of time, you don't want to wait forever to -959 +1069 01:17:09,320 --> 01:17:14,980 train your networks. You'll use smaller ones for practical reasons. Question? -959 +1070 01:17:14,980 --> 01:17:17,780 (Student is asking question) -959 +1071 01:17:17,780 --> 01:17:19,980 Do you regularize each layer equally. -960 +1072 01:17:19,980 --> 01:17:25,509 Usually you do, as a simplification. Yeah. Most of the, often when you see -961 +1073 01:17:25,510 --> 01:17:28,030 networks get trained in practice, they will be regularized the same way throughout. -962 +1074 01:17:28,030 --> 01:17:31,030 But you don't have to necessarily. Go ahead. -962 +1075 01:17:31,030 --> 01:17:35,710 (Student is asking question) -963 +1076 01:17:35,710 --> 01:17:40,500 Is there any value to using second derivatives using hashing in optimizing neural networks? There is value -964 +1077 01:17:40,500 --> 01:17:44,859 sometimes when your data sets are small. You can use things like L-BFGS which I -965 +1078 01:17:44,859 --> 01:17:47,729 didn't go into too much, and that's a second order method, but usually the datasets -966 +1079 01:17:47,729 --> 01:17:50,500 are really large and that's when L-BFGS doesn't work very well. -967 +1080 01:17:50,500 --> 01:17:57,039 So when you millions of data points, you can't do L-BFGS for various reasons. Yeah. And L-BFGS is -968 +1081 01:17:57,039 --> 01:18:01,970 not very good with minibatch. You always have to do full batch by default. Question. -969 +1082 01:18:01,970 --> 01:18:09,950 (Student is asking question) -969 +1083 01:18:09,950 --> 01:18:13,650 So what is the tradeoff between depth and size roughly, like how do you allocate? -969 +1084 01:18:13,650 --> 01:18:16,450 Not a good answer for that unfortunately. -970 +1085 01:18:16,450 --> 01:18:20,899 So you want, depth is good, but maybe after like ten layers maybe, if you have simple dataet -971 +1086 01:18:20,899 --> 01:18:25,219 it's not really adding too much. We have one more minute so I can still take some -972 +1087 01:18:25,220 --> 01:18:26,620 questions. You had a question for a while. -972 +1088 01:18:26,620 --> 01:18:31,520 (Student is asking question) -972 +1089 01:18:31,520 --> 01:18:35,990 Yeah, so the tradeoff between where do I allocate my -973 +1090 01:18:35,990 --> 01:18:40,019 capacity, do I want us to be deeper or do I want it to be wider, not a very good -974 +1091 01:18:40,020 --> 01:18:41,860 answer to that. -974 +1092 01:18:41,860 --> 01:18:44,560 (Student is asking question) -974 +1093 01:18:44,560 --> 01:18:47,860 Yes, usually, especially with images, we find that more layers are -975 +1094 01:18:47,860 --> 01:18:51,199 critical. But sometimes when you have simple datasets like 2D or some -976 +1095 01:18:51,199 --> 01:18:55,359 other things like depth is not as critical, and so it's kind of slightly -977 +1096 01:18:55,359 --> 01:19:59,670 data dependent. We had a question over there. -977 +1097 01:18:59,670 --> 01:19:05,670 (Student is asking question) -978 +1098 01:19:05,670 --> 01:19:10,050 Different activation functions for different layers, does that help? Usually it's not done. Usually we -979 +1099 01:19:10,050 --> 01:19:15,960 just kind of pick one and go with it. So say, for ConvNets for example, we'll see that -980 +1100 01:19:15,960 --> 01:19:19,279 most of them are trained just with ReLUs. And so you just use that throughout and -981 +1101 01:19:19,279 --> 01:19:22,389 there's no real benefit to switch them around. People don't play with that -982 +1102 01:19:22,390 --> 01:19:26,660 too much, but in principle, there's nothing preventing you. So it is 4:20, -983 +1103 01:19:26,660 --> 01:19:29,789 so we're going to end here, but we'll see lots of more neural networks, so a lot of -984 +1104 01:19:29,789 --> 01:19:31,738 these questions, we'll go through them. \ No newline at end of file From 710887bdaa2f72a07c17a4413b71bda6221537f1 Mon Sep 17 00:00:00 2001 From: jung_hojin Date: Thu, 2 Jun 2016 22:18:04 +0900 Subject: [PATCH 163/199] Minor fix --- captions/En/Lecture4_en.srt | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/captions/En/Lecture4_en.srt b/captions/En/Lecture4_en.srt index d85c7302..b72523f4 100644 --- a/captions/En/Lecture4_en.srt +++ b/captions/En/Lecture4_en.srt @@ -25,8 +25,8 @@ time. It's really running out. And you 6 00:00:29,278 --> 00:00:31,768 -know you might think that you have -late days and so on but these assignments just get +know you might think that you have late +days and so on but these assignments just get 7 00:00:31,768 --> 00:00:38,640 From 143ac02d6325cc8a8a57b252d3b2f221d5789006 Mon Sep 17 00:00:00 2001 From: YB Date: Sat, 4 Jun 2016 23:48:03 -0400 Subject: [PATCH 164/199] Lecture1 - part 211~225 (out of 715) en / ko --- captions/En/Lecture1_en.srt | 60 ++++++++++++++++++------------------- captions/Ko/Lecture1_ko.srt | 32 ++++++++++---------- 2 files changed, 46 insertions(+), 46 deletions(-) diff --git a/captions/En/Lecture1_en.srt b/captions/En/Lecture1_en.srt index e86327c3..1c52f268 100644 --- a/captions/En/Lecture1_en.srt +++ b/captions/En/Lecture1_en.srt @@ -1034,86 +1034,84 @@ and it takes the intro this much real estate space to be 210 00:23:37,579 --> 00:23:43,148 used for the system. Why? -because it's so important and it's so damn hard. +Because it's so important and it's so damn hard. 211 00:23:43,148 --> 00:23:50,959 -That's why we need to get back to human reason -they were really ambitious they wanna +That's why we need to use this much space. +OK, back to Hubel and Wiesel. They were really ambitious. 212 00:23:50,960 --> 00:23:56,028 -know what primary visual cortex is doing +They wanna know what primary visual cortex is doing because this is the beginning of our 213 00:23:56,028 --> 00:24:02,878 -knowledge for deep learning neural -network cats social then put the cats in +knowledge for deep learning neural network. +So, they were showing cats. so they put the cats in this room 214 00:24:02,878 --> 00:24:07,709 -this room and they were recording your -activities when I say recording your +and they were recording neural activities. +When I say recording neural activity, 215 00:24:07,710 --> 00:24:11,659 -activities fair trial there basically -trying to see you know if I put the +they're basically trying to see, +you know if I put the 216 00:24:11,659 --> 00:24:18,059 -electrode here like to the new office to -the new house fire when they see +neural electrode here, +Do the neurons fire when they see something? 217 -00:24:18,058 --> 00:24:25,308 -something so for example if they show if -they show cat their ideas if I showed +00:24:18,059 --> 00:24:25,308 +So, for example if they show cats, +their idea is.., 218 00:24:25,308 --> 00:24:30,519 -this kind of fish you know apparently at -that time comes to eat fish rather than +if I showed this kind of fish, you know, +apparently at that time cats eat fish rather than these beans. 219 00:24:30,519 --> 00:24:42,019 -these beings with the cats no I like -yellow happy and spikes and here's a +With the cat's neuron like, you know, +they're happy and start sending spikes. 220 00:24:42,019 --> 00:24:48,128 -story of scientific discovery is -scientific discovery takes both luck and +and the funny thing of a story of scientific discovery is 221 00:24:48,128 --> 00:24:52,449 -care and thoughtfulness they were shown +scientific discovery takes both luck and care and thoughtfulness. 222 00:24:52,450 --> 00:24:58,740 -whatever mouse flower it just doesn't -work the cats new are in the primary +They were showing this cat fish, whatever mouse, flower. +It just doesn't work. The cat's neuron in the primary 223 00:24:58,740 --> 00:25:02,839 -visual cortex was silent there was no -spiking +visual cortex was silent there was no spiking. 224 00:25:02,839 --> 00:25:09,079 -very little spike in there were really -frustrated but the good news is that +Very little spike and they were really frustrated. +but the good news is that 225 00:25:09,079 --> 00:25:14,509 there was no computer at that time so -what they have to do when they showed us +what they have to do when they showed this cats 226 00:25:14,509 --> 00:25:21,740 -cats is they have to use a slight -protector so they put his foot a slide +that is a stimulus, they have to use a slide +projector so they put his foot a slide 227 00:25:21,740 --> 00:25:26,799 diff --git a/captions/Ko/Lecture1_ko.srt b/captions/Ko/Lecture1_ko.srt index 5d4d6543..a3211e40 100644 --- a/captions/Ko/Lecture1_ko.srt +++ b/captions/Ko/Lecture1_ko.srt @@ -863,63 +863,65 @@ 211 00:23:43,148 --> 00:23:50,959 - 우리는 인간의 이성을 다시 얻을 필요가 왜 그들이 싶어 정말 야심했다 + 그래서 이렇게 큰 공간을 차지하고 있는거죠. + 자 Hubel과 Wiesel로 돌아가보면, 그들은 매우 야심찼어요. 212 00:23:50,960 --> 00:23:56,028 - 이의 시작이기 때문에 일차 시각 피질이 무엇을하고 있는지 알고 우리의 + 그들은 일차 시각 피질이 무엇을 하는지 알고 싶었어요. + 이것이 바로 Deep Learning Neural Network 연구의 시작이기 때문이죠. 213 00:23:56,028 --> 00:24:02,878 - 다음 소셜 깊은 학습 신경망 고양이에 대한 지식에 고양이를 넣어 + 그들은 한 방에 고양이를 두고 214 00:24:02,878 --> 00:24:07,709 - 나는 당신의 기록 말할 때이 방은 그들이 당신의 활동을 기록했다 + 신경 활동을 기록했어요. 이 신경 활동을 기록한다는 것은 다시 말해서 215 00:24:07,710 --> 00:24:11,659 - 내가 넣어 경우 활동 공정한 재판이 기본적으로 당신이 알고 보려고 + 제가 전극을 여기에 넣고 216 00:24:11,659 --> 00:24:18,059 - 여기에 새로운 사무실 등 그들이 볼 수있는 새 집 화재 전극 + 무언가를 보았을 때 뉴런이 활발하게 활동하는지를 보려고 하는 거죠. 217 00:24:18,058 --> 00:24:25,308 - 그래서 예를 들어 뭔가 내가 보여 주었다 경우 그들이 자신의 아이디어를 고양이를 보여 주면 그들이 보여 주면 + 예를 들어 그들이 고양이에게.. 218 00:24:25,308 --> 00:24:30,519 - 당신이 그 때 분명히 알고 물고기의이 종류보다는 물고기를 먹고 온다 + 만약 제가 고양이에게 생선을 보여주면, 그 당시에는 분명히 고양이들은 콩사료보다는 물고기를 먹었죠. 219 00:24:30,519 --> 00:24:42,019 - 여기에 노란색 행복과 스파이크와 같은 더 나는 없다 고양이와 함께이 존재 + 고양이의 뉴런이 기뻐 펄쩍뛰는 모습을 기대하는 것이죠. 220 00:24:42,019 --> 00:24:48,128 - 과학적 발견의 이야기는 과학적 발견이 소요 행운 모두와 + 과학적 발견의 재미있는 점은 과학적 발견은 행운이 따라주면서 221 00:24:48,128 --> 00:24:52,449 - 관심과 배려 그들은 나타내었다 + 관심과 깊은 고민의 과정이 있을 때 나타납니다. 222 00:24:52,450 --> 00:24:58,740 - 어떤 마우스 꽃 단지 새로운 고양이를 작동 차에하지 않습니다 + 그들은 고양이에게 생선, 쥐, 꽃등을 보여주었습니다만, 그 어떤 것도 효과가 없었어요. 223 00:24:58,740 --> 00:25:02,839 - 시각 피질은 급상승가 없었다 침묵 + 고양이의 일차 시각 피질은 조용했습니다. 224 00:25:02,839 --> 00:25:09,079 - 거기에 약간의 스파이크 정말 좌절했다 그러나 좋은 소식은 있다는 것입니다 + 아주 약간의 활동만을 보여 그들은 매우 불만스러웠죠. 그러나 좋은 소식은 225 00:25:09,079 --> 00:25:14,509 - 그들이 우리를 보였다 때이해야 할 무엇 때문에 그 시간에는 컴퓨터가 없었다 + 그 당시에 컴퓨터가 존재하지 않았다는 점 입니다. 그들은 이 고양이에게 226 00:25:14,509 --> 00:25:21,740 From ce018ec69a61a0270980fe75bbbb2c6c8139a14b Mon Sep 17 00:00:00 2001 From: 2wins Date: Sun, 5 Jun 2016 19:15:41 +0900 Subject: [PATCH 165/199] =?UTF-8?q?'=EC=B2=B4=EC=9D=B8=EB=A3=B0=EC=9D=84?= =?UTF-8?q?=20=EC=9D=B4=EC=9A=A9=ED=95=9C=20=EB=B3=B5=ED=95=A9=20=ED=91=9C?= =?UTF-8?q?=ED=98=84=EC=8B=9D'=20=EC=84=B9=EC=85=98=20=EC=9E=91=EC=84=B1?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit --- optimization-2.md | 16 ++++++++-------- 1 file changed, 8 insertions(+), 8 deletions(-) diff --git a/optimization-2.md b/optimization-2.md index 4dd25070..e6d7666e 100644 --- a/optimization-2.md +++ b/optimization-2.md @@ -7,7 +7,7 @@ Table of Contents: - [소개(Introduction)](#intro) - [그라디언트(Gradient)에 대한 간단한 표현과 이해](#grad) -- [복합적인 표현(Compound Expression), 체인룰(chain rule), Backpropagation)](#backprop) +- [복합 표현식(Compound Expression), 체인룰(chain rule), 역전파(Backpropagation)](#backprop) - [역전파(Backpropation)에 대한 직관적인 이해](#intuitive) - [모듈성 : 시그모이(Sigmoid)드 예제](#sigmoid) - [역전파(Backprop) 실제: 단계별 계산](#staged) @@ -19,9 +19,9 @@ Table of Contents: ### Introduction -**Motivation**. 이번 섹션에서 우리는 **역전파(Backpropagation)**에 대한 직관적인 이해를 바탕으로 전문지식을 더 키우고자 한다. Backpropagation은 Network 전체에 대해 반복적인 **체인룰(Chain rule)**을 적용하여 그라디언트(Gradient)를 계산하는 방법 중 하나이다. Backpropagation 과정과 세부 요소들에 대한 이해는 여러분에게 있어서 Neural Networks를 효과적으로 개발하고, 디자인하고 디버그하는데 중요하다고 볼 수 있다. +**Motivation**. 이번 섹션에서 우리는 **역전파(Backpropagation)**에 대한 직관적인 이해를 바탕으로 전문지식을 더 키우고자 한다. Backpropagation은 Network 전체에 대해 반복적인 **체인룰(Chain rule)**을 적용하여 그라디언트(Gradient)를 계산하는 방법 중 하나이다. Backpropagation 과정과 세부 요소들에 대한 이해는 여러분에게 있어서 신경망을 효과적으로 개발하고, 디자인하고 디버그하는 데 중요하다고 볼 수 있다. -**Problem statement**. 이번 섹션에서 공부할 핵심 문제는 다음과 같다 : 주어진 함 $$f(x)$$ 가 있고, $$x$$ 는 입력 값으로 이루어진 벡터이고, 주어진 입력 $$x$$에 대해서 함수 $$f$$의 그라디언트를 계산하고자 한다. (i.e. $$\nabla f(x)$$ ). +**Problem statement**. 이번 섹션에서 공부할 핵심 문제는 다음과 같다 : 주어진 함수 $$f(x)$$ 가 있고, $$x$$ 는 입력 값으로 이루어진 벡터이고, 주어진 입력 $$x$$에 대해서 함수 $$f$$의 그라디언트를 계산하고자 한다. (i.e. $$\nabla f(x)$$ ). **Motivation**. 우리가 이 문제에 관심을 기울이는 이유에 대해 Neural Network관점에서 좀더 구체적으로 살펴 보자. $$f$$는 Loss 함수 ( $$L$$ ) 에 해당하고 입력 값 $$x$$ 는 학습 데이터(Training data)와 Neural Network의 Weight라고 볼 수 있다. 예를 들면, Loss는 SVM Loss 함수가 될 수 있고, 입력 값은 학습 데이터 $$(x_i,y_i), i=1 \ldots N$$ 와 Weight, Bias $$W,b$$ 으로 볼 수 있다. 여기서 학습데이터는 미리 주어져서 고정 되어있는 값으로 볼 수 있고 (보통의 기계 학습에서 그러하듯..), Weight는 Neural Network의 학습을 위해 실제로 컨트롤 하는 값이다. 따라서 입력 값 $$x_i$$ 에 대한 그라디언트 계산이 쉬울지라도, 실제로는 파라미터(Parameter, Neural Network의 Weight) 값에 대한 Gradient를 일반적으로 계산하고, Gradient값을 활용하여 Parameter를 업데이트 할 수 있다. 하지만, Neural Network이 어떻게 작동하는지 해석하고, 시각화 하는 부분에서 입력 값 $x_i$에 대한 Gradient도 유용하게 활용 될 수 있는데, 이 부분은 본 강의의 뒷부분에 다룰 예정이다. @@ -66,9 +66,9 @@ $$ -### Compound expressions with chain rule +### 체인룰을 이용한 복합 표현식 -Lets now start to consider more complicated expressions that involve multiple composed functions, such as $f(x,y,z) = (x + y) z$. This expression is still simple enough to differentiate directly, but we'll take a particular approach to it that will be helpful with understanding the intuition behind backpropagation. In particular, note that this expression can be broken down into two expressions: $q = x + y$ and $f = q z$. Moreover, we know how to compute the derivatives of both expressions separately, as seen in the previous section. $f$ is just multiplication of $q$ and $z$, so $\frac{\partial f}{\partial q} = z, \frac{\partial f}{\partial z} = q$, and $q$ is addition of $x$ and $y$ so $ \frac{\partial q}{\partial x} = 1, \frac{\partial q}{\partial y} = 1 $. However, we don't necessarily care about the gradient on the intermediate value $q$ - the value of $\frac{\partial f}{\partial q}$ is not useful. Instead, we are ultimately interested in the gradient of $f$ with respect to its inputs $x,y,z$. The **chain rule** tells us that the correct way to "chain" these gradient expressions together is through multiplication. For example, $\frac{\partial f}{\partial x} = \frac{\partial f}{\partial q} \frac{\partial q}{\partial x} $. In practice this is simply a multiplication of the two numbers that hold the two gradients. Lets see this with an example: +이제 $f(x,y,z) = (x + y) z$ 같은 다수의 복합 함수(composed functions)를 수반하는 더 복잡한 표현식을 고려해보자. 이 표현식은 여전히 바로 미분하기에 충분히 간단하지만, 우리는 이 식에 특별한 접근법을 적용할 것이다. 이는 역전파 뒤에 있는 직관을 이해하는데 도움이 될 것이다. 특히 이 식이 두 개의 표현식 $q = x + y$와 $f = q z$ 으로 분해될 수 있음에 주목하자. 게다가 이전 섹션에서 본 것처럼 우리는 두 식에 대한 미분값을 어떻게 따로따로 계산할지 알고 있다. $f$ 는 단지 $q$와 $z$의 곱이다. 따라서 $\frac{\partial f}{\partial q} = z, \frac{\partial f}{\partial z} = q$, 그리고 $q$는 $x$와 $y$의 합이므로 $\frac{\partial q}{\partial x} = 1, \frac{\partial q}{\partial y} = 1$이다. 하지만, 중간결과값인 $q$에 대한 기울기($\frac{\partial f}{\partial q}$)를 신경쓸 필요가 없다. 대신 궁극적으로 입력 $x,y,z$에 대한 $f$의 기울기에 관심이 있다. **체인룰**은 이러한 기울기 표현식들을 함께 연결시키는 적절한 방법이 곱하는 것이라는 것을 보여준다. 예를 들면, $\frac{\partial f}{\partial x} = \frac{\partial f}{\partial q} \frac{\partial q}{\partial x} $와 같이 표현할 수 있다. 실제로 이는 단순히 두 기울기값을 담고 있는 두 수의 곱셈이다. 하나의 예를 통해 확인 해보자. ~~~python # set some inputs @@ -87,15 +87,15 @@ dfdx = 1.0 * dfdq # dq/dx = 1. And the multiplication here is the chain rule! dfdy = 1.0 * dfdq # dq/dy = 1 ~~~ -At the end we are left with the gradient in the variables `[dfdx,dfdy,dfdz]`, which tell us the sensitivity of the variables `x,y,z` on `f`!. This is the simplest example of backpropagation. Going forward, we will want to use a more concise notation so that we don't have to keep writing the `df` part. That is, for example instead of `dfdq` we would simply write `dq`, and always assume that the gradient is with respect to the final output. +결국 `[dfdx,dfdy,dfdz]` 변수들로 기울기가 표현되는데, 이는 `f`에 대한 변수 `x,y,z`의 민감도(sensitivity)를 보여준다. 이는 역전파의 가장 간단한 예이다. 더 나아가서 보다 간결한 표기법을 사용해서 `df` 파트를 계속 쓸 필요가 없도록 하고 싶을 것이다. 예를 들어 `dfdq` 대신에 단순히 `dq`를 쓰고 항상 기울기가 최종 출력에 관한 것이라 가정하는 것이다. -This computation can also be nicely visualized with a circuit diagram: +또한 이런 계산은 회로도를 가지고 다음과 같이 멋지게 시각화할 수 있다:
-2-4x5-4y-43z3-4q+-121f*
- The real-valued "circuit" on left shows the visual representation of the computation. The forward pass computes values from inputs to output (shown in green). The backward pass then performs backpropagation which starts at the end and recursively applies the chain rule to compute the gradients (shown in red) all the way to the inputs of the circuit. The gradients can be thought of as flowing backwards through the circuit. + 좌측에 실수 값으로 표현되는 "회로"는 이 계산에 대한 시각 표현을 보여준다. 전방 전달(forward pass)은 입력부터 출력까지 값을 계산한다 (녹색으로 표시). 그리고 나서 후방 전달(backward pass)는 역전파를 수행하는데, 이는 끝에서 시작해서 반복적으로 체인 룰을 적용해 회로 입력에 대한 모든 길에서 기울기 값 (적색으로 표시) 을 계산한다. 기울기 값은 회로를 통해 거꾸로 흐르는 것으로 볼 수 있다.
From 7f796933809b9a49492741e1d4d58677697037b9 Mon Sep 17 00:00:00 2001 From: 2wins Date: Sun, 5 Jun 2016 20:17:41 +0900 Subject: [PATCH 166/199] =?UTF-8?q?'=EC=97=AD=EC=A0=84=ED=8C=8C(backpropag?= =?UTF-8?q?ation)=EC=97=90=20=EB=8C=80=ED=95=9C=20=EC=A7=81=EA=B4=80?= =?UTF-8?q?=EC=A0=81=20=EC=9D=B4=ED=95=B4'=20=EC=84=B9=EC=85=98=20?= =?UTF-8?q?=EC=9E=91=EC=84=B1?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit --- optimization-2.md | 12 ++++++------ 1 file changed, 6 insertions(+), 6 deletions(-) diff --git a/optimization-2.md b/optimization-2.md index e6d7666e..55050d7f 100644 --- a/optimization-2.md +++ b/optimization-2.md @@ -66,7 +66,7 @@ $$ -### 체인룰을 이용한 복합 표현식 +### 체인룰(chain rule)을 이용한 복합 표현식 이제 $f(x,y,z) = (x + y) z$ 같은 다수의 복합 함수(composed functions)를 수반하는 더 복잡한 표현식을 고려해보자. 이 표현식은 여전히 바로 미분하기에 충분히 간단하지만, 우리는 이 식에 특별한 접근법을 적용할 것이다. 이는 역전파 뒤에 있는 직관을 이해하는데 도움이 될 것이다. 특히 이 식이 두 개의 표현식 $q = x + y$와 $f = q z$ 으로 분해될 수 있음에 주목하자. 게다가 이전 섹션에서 본 것처럼 우리는 두 식에 대한 미분값을 어떻게 따로따로 계산할지 알고 있다. $f$ 는 단지 $q$와 $z$의 곱이다. 따라서 $\frac{\partial f}{\partial q} = z, \frac{\partial f}{\partial z} = q$, 그리고 $q$는 $x$와 $y$의 합이므로 $\frac{\partial q}{\partial x} = 1, \frac{\partial q}{\partial y} = 1$이다. 하지만, 중간결과값인 $q$에 대한 기울기($\frac{\partial f}{\partial q}$)를 신경쓸 필요가 없다. 대신 궁극적으로 입력 $x,y,z$에 대한 $f$의 기울기에 관심이 있다. **체인룰**은 이러한 기울기 표현식들을 함께 연결시키는 적절한 방법이 곱하는 것이라는 것을 보여준다. 예를 들면, $\frac{\partial f}{\partial x} = \frac{\partial f}{\partial q} \frac{\partial q}{\partial x} $와 같이 표현할 수 있다. 실제로 이는 단순히 두 기울기값을 담고 있는 두 수의 곱셈이다. 하나의 예를 통해 확인 해보자. @@ -101,15 +101,15 @@ dfdy = 1.0 * dfdq # dq/dy = 1
-### Intuitive understanding of backpropagation +### 역전파(backpropagation)에 대한 직관적 이해 -Notice that backpropagation is a beautifully local process. Every gate in a circuit diagram gets some inputs and can right away compute two things: 1. its output value and 2. the *local* gradient of its inputs with respect to its output value. Notice that the gates can do this completely independently without being aware of any of the details of the full circuit that they are embedded in. However, once the forward pass is over, during backpropagation the gate will eventually learn about the gradient of its output value on the final output of the entire circuit. Chain rule says that the gate should take that gradient and multiply it into every gradient it normally computes for all of its inputs. +역전파가 굉장히 지역적인(local) 프로세스임에 주목하자. 회로도 내의 모든 게이트(gate) 몇개의 입력을 받아드리고 곧 바로 두 가지를 계산할 수 있다: 1. 게이트의 출력 값, 2. 게이트 출력에 대한 입력들의 *지역적* 기울기 값. 여기서 게이트들이 포함된 전체 회로의 세세한 부분을 모르더라도 완전히 독립적으로 값들을 계산할 수 있음을 주목하라. 하지만, 일단 전방 전달이 끝나면 역전파 과정에서 게이트는 결국 전체 회로의 마지막 출력에 대한 게이트 출력의 기울기 값에 관해 학습할 것이다. 체인룰을 통해 게이트는 이 기울기 값을 받아들여 모든 입력에 대해서 계산한 게이트의 모든 기울기 값에 곱한다. -> This extra multiplication (for each input) due to the chain rule can turn a single and relatively useless gate into a cog in a complex circuit such as an entire neural network. +> 체인룰 덕분에 이러한 각 입력에 대한 추가 곱셈은 전체 신경망과 같은 복잡한 회로에서 상대적으로 쓸모 없는 개개의 게이트를 중요하지 않은 것으로 바꿀 수 있다. -Lets get an intuition for how this works by referring again to the example. The add gate received inputs [-2, 5] and computed output 3. Since the gate is computing the addition operation, its local gradient for both of its inputs is +1. The rest of the circuit computed the final value, which is -12. During the backward pass in which the chain rule is applied recursively backwards through the circuit, the add gate (which is an input to the multiply gate) learns that the gradient for its output was -4. If we anthropomorphize the circuit as wanting to output a higher value (which can help with intuition), then we can think of the circuit as "wanting" the output of the add gate to be lower (due to negative sign), and with a *force* of 4. To continue the recurrence and to chain the gradient, the add gate takes that gradient and multiplies it to all of the local gradients for its inputs (making the gradient on both **x** and **y** 1 * -4 = -4). Notice that this has the desired effect: If **x,y** were to decrease (responding to their negative gradient) then the add gate's output would decrease, which in turn makes the multiply gate's output increase. +다시 위 예를 통해 이것이 어떻게 동작하는지에 대한 직관을 얻자. 덧셈 게이트는 입력 [-2, 5]를 받아 3을 출력한다. 이 게이트는 덧셈 연산을 하고 있기 때문에 두 입력에 대한 게이트의 지역적 기울기 값은 +1이 된다. 회로의 나머지 부분을 통해 최종 출력 값으로 -12가 나온다. 체인룰이 회로를 역으로 가로질러 반복적으로 적용되는 후방 전달 과정 동안, (곱셈 게이트의 입력인) 덧셈 게이트는 출력 값에 대한 기울기 값이 -4였다는 것을 학습한다. 만약 회로가 높은 값을 출력하기를 원하는 것으로 의인화하면 (이는 직관에 도움이 될 수 있다), 이 회로가 덧셈 게이트의 출력 값이 4의 *힘*으로 낮아지길 (음의 부호이기 때문) "원하는" 것으로 볼 수 있다. 반복을 지속하고 기울기 값을 연결하기 위해 덧셈 게이트는 이 기울기 값을 받아들이고 이를 모든 입력들에 대한 지역적 기울기 값에 곱한다 (**x**와 **y**에 대한 기울기 값이 1 * -4 = -4가 되도록). 다음의 원하는 효과가 있다는 사실에 주목하자. 만약 **x,y**가 (음의 기울기 값에 대한 반응으로) 감소한다면, 이 덧셈 게이트의 출력은 감소할 것이고 이는 다시 곱셈 게이트의 출력이 증가하도록 만들 것이다. -Backpropagation can thus be thought of as gates communicating to each other (through the gradient signal) whether they want their outputs to increase or decrease (and how strongly), so as to make the final output value higher. +따라서 역전파는 보다 큰 최종 출력 값을 얻도록 게이트들이 자신들의 출력이 (얼마나 강하게) 증가하길 원하는지 또는 감소하길 원하는지 서로 소통하는 것으로 간주할 수 있다. ### Modularity: Sigmoid example From 9c8ce381ea47af69f8f9c893ea0075f51adae051 Mon Sep 17 00:00:00 2001 From: Sanghun Kang Date: Tue, 7 Jun 2016 20:59:29 +0900 Subject: [PATCH 167/199] =?UTF-8?q?replaced=20=EB=B6=88=EB=A6=B0=20with=20?= =?UTF-8?q?=EB=B6=88=EB=A6=AC=EC=96=B8?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit --- python-numpy-tutorial.md | 14 +++++++------- 1 file changed, 7 insertions(+), 7 deletions(-) diff --git a/python-numpy-tutorial.md b/python-numpy-tutorial.md index ea98c3b4..6e9a63b8 100644 --- a/python-numpy-tutorial.md +++ b/python-numpy-tutorial.md @@ -94,7 +94,7 @@ print quicksort([3,6,8,10,1,2,1]) ### 기본 자료형 -다른 프로그래밍 언어들처럼, 파이썬에는 정수, 실수, 불린, 문자열같은 기본 자료형이 있습니다. +다른 프로그래밍 언어들처럼, 파이썬에는 정수, 실수, 불리언, 문자열같은 기본 자료형이 있습니다. 파이썬 기본 자료형 역시 다른 프로그래밍 언어와 유사합니다. **숫자:** 다른 언어와 마찬가지로 파이썬의 정수형(Integers)과 실수형(floats) 데이터 타입 역시 동일한 역할을 합니다 : @@ -120,7 +120,7 @@ print y, y + 1, y * 2, y ** 2 # 출력 "2.5 3.5 5.0 6.25" 파이썬 역시 long 정수형과 복소수 데이터 타입이 구현되어 있습니다. 자세한 사항은 [문서](https://docs.python.org/2/library/stdtypes.html#numeric-types-int-float-long-complex)에서 찾아볼 수 있습니다. -**불린(Booleans):** 파이썬에는 논리 자료형의 모든 연산자들이 구현되어 있습니다. +**불리언(Booleans):** 파이썬에는 논리 자료형의 모든 연산자들이 구현되어 있습니다. 그렇지만 기호(`&&`, `||`, 등.) 대신 영어 단어로 구현되어 있습니다 : ~~~python @@ -585,9 +585,9 @@ print a # 출력 "array([[11, 2, 3], # [10, 21, 12]]) ~~~ -**불린 배열 인덱싱:** -불린 배열 인덱싱을 통해 배열속 요소를 취사 선택할 수 있습니다. -불린 배열 인덱싱은 특정 조건을 만족시키는 요소만 선택하고자 할 때 자주 사용됩니다. +**불리언 배열 인덱싱:** +불리언 배열 인덱싱을 통해 배열속 요소를 취사 선택할 수 있습니다. +불리언 배열 인덱싱은 특정 조건을 만족시키는 요소만 선택하고자 할 때 자주 사용됩니다. 다음은 그 예시입니다: ~~~python @@ -596,7 +596,7 @@ import numpy as np a = np.array([[1,2], [3, 4], [5, 6]]) bool_idx = (a > 2) # 2보다 큰 a의 요소를 찾습니다; - # 이 코드는 a와 shape가 같고 불린 자료형을 요소로 하는 numpy 배열을 반환합니다, + # 이 코드는 a와 shape가 같고 불리언 자료형을 요소로 하는 numpy 배열을 반환합니다, # bool_idx의 각 요소는 동일한 위치에 있는 a의 # 요소가 2보다 큰지를 말해줍니다. @@ -604,7 +604,7 @@ print bool_idx # 출력 "[[False False] # [ True True] # [ True True]]" -# 불린 배열 인덱싱을 통해 bool_idx에서 +# 불리언 배열 인덱싱을 통해 bool_idx에서 # 참 값을 가지는 요소로 구성되는 # rank 1인 배열을 구성할 수 있습니다. print a[bool_idx] # 출력 "[3 4 5 6]" From 19c8f9dacf21dda6b34339c4e1fa9f85df6c279f Mon Sep 17 00:00:00 2001 From: Sanghun Kang Date: Tue, 7 Jun 2016 22:58:11 +0900 Subject: [PATCH 168/199] numpy done --- python-numpy-tutorial.md | 126 ++++++++++++++++++--------------------- 1 file changed, 58 insertions(+), 68 deletions(-) diff --git a/python-numpy-tutorial.md b/python-numpy-tutorial.md index 6e9a63b8..8fcc240b 100644 --- a/python-numpy-tutorial.md +++ b/python-numpy-tutorial.md @@ -763,125 +763,115 @@ for i in range(4): print y ~~~ -위의 방식대로 하면 됩니다; 그러나 'x'가 매우 큰 행렬이라면, 파이썬의 명시적 반복문을 통해 연산을 수행했을때 느릴 수 있습니다. 벡터 'v'를 행렬 'x'의 각 행에 더하는것은 'v'를 여러개 복사해 수직으로 쌓은 행렬 'vv'를 만들고 이 'vv'를 'x'에 더하는것과 동일합니다. 이 과정을 아래의 코드로 구현할 수 있습니다: +위의 방식대로 하면 됩니다; 그러나 'x'가 매우 큰 행렬이라면, 파이썬의 명시적 반복문을 이용한 위 코드는 매우 느려질 수 있습니다. 벡터 'v'를 행렬 'x'의 각 행에 더하는것은 'v'를 여러개 복사해 수직으로 쌓은 행렬 'vv'를 만들고 이 'vv'를 'x'에 더하는것과 동일합니다. 이 과정을 아래의 코드로 구현할 수 있습니다: ~~~python import numpy as np -# We will add the vector v to each row of the matrix x, -# storing the result in the matrix y +# 벡터 v를 행렬 x의 각 행에 더한 뒤, +# 그 결과를 행렬 y에 저장하고자 합니다 x = np.array([[1,2,3], [4,5,6], [7,8,9], [10, 11, 12]]) v = np.array([1, 0, 1]) -vv = np.tile(v, (4, 1)) # Stack 4 copies of v on top of each other +vv = np.tile(v, (4, 1)) # v의 복사본 4개를 위로 차곡차곡 쌓은게 vv print vv # 출력 "[[1 0 1] - # [1 0 1] - # [1 0 1] - # [1 0 1]]" -y = x + vv # Add x and vv elementwise + # [1 0 1] + # [1 0 1] + # [1 0 1]]" +y = x + vv # x와 vv의 요소별 합 print y # 출력 "[[ 2 2 4 - # [ 5 5 7] - # [ 8 8 10] - # [11 11 13]]" + # [ 5 5 7] + # [ 8 8 10] + # [11 11 13]]" ~~~ -Numpy broadcasting allows us to perform this computation without actually -creating multiple copies of `v`. Consider this version, using broadcasting: +Numpy 브로드캐스팅을 이용한다면 이렇게 v의 복사본을 여러개 만들지 않아도 동일한 연산을 할 수 있습니다. +아래는 브로드캐스팅을 이용한 예시 코드입니다: ~~~python import numpy as np -# We will add the vector v to each row of the matrix x, -# storing the result in the matrix y +# 벡터 v를 행렬 x의 각 행에 더한 뒤, +# 그 결과를 행렬 y에 저장하고자 합니다 x = np.array([[1,2,3], [4,5,6], [7,8,9], [10, 11, 12]]) v = np.array([1, 0, 1]) -y = x + v # Add v to each row of x using broadcasting +y = x + v # 브로드캐스팅을 이용하여 v를 x의 각 행에 더하기 print y # 출력 "[[ 2 2 4] - # [ 5 5 7] - # [ 8 8 10] - # [11 11 13]]" + # [ 5 5 7] + # [ 8 8 10] + # [11 11 13]]" ~~~ -The line `y = x + v` works even though `x` has shape `(4, 3)` and `v` has shape -`(3,)` due to broadcasting; this line works as if `v` actually had shape `(4, 3)`, -where each row was a copy of `v`, and the sum was performed elementwise. +`x`의 shape가 `(4, 3)`이고 `v`의 shape가 `(3,)`라도 브로드캐스팅으로 인해 `y = x + v`는 문제없이 수행됩니다; +이때 'v'는 'v'의 복사본이 차곡차곡 쌓인 shape `(4, 3)`처럼 간주되어 'x'와 동일한 shape가 되며 이들간의 요소별 덧셈연산이 y에 저장됩니다. -Broadcasting two arrays together follows these rules: +두 배열의 브로드캐스팅은 아래의 규칙을 따릅니다: -1. If the arrays do not have the same rank, prepend the shape of the lower rank array - with 1s until both shapes have the same length. -2. The two arrays are said to be *compatible* in a dimension if they have the same - size in the dimension, or if one of the arrays has size 1 in that dimension. -3. The arrays can be broadcast together if they are compatible in all dimensions. -4. After broadcasting, each array behaves as if it had shape equal to the elementwise - maximum of shapes of the two input arrays. -5. In any dimension where one array had size 1 and the other array had size greater than 1, - the first array behaves as if it were copied along that dimension +1. 두 배열이 동일한 rank를 가지고 있지 않다면, 낮은 rank의 1차원 배열이 높은 rank 배열의 shape로 간주됩니다. +2. 특정 차원에서 두 배열이 동일한 크기를 갖거나, 두 배열들 중 하나의 크기가 1이라면 그 두 배열은 특정 차원에서 *compatible*하다고 여겨집니다. +3. 두 행렬이 모든 차원에서 compatible하다면, 브로드캐스팅이 가능합니다. +4. 브로드캐스팅이 이뤄지면, 각 배열 shape의 요소별 최소공배수로 이루어진 shape가 두 배열의 shape로 간주됩니다. +5. 차원에 상관없이 크기가 1인 배열과 1보다 큰 배열이 있을때, 크기가 1인 배열은 자신의 차원수만큼 복사되어 쌓인것처럼 간주된다. + +설명이 이해하기 부족하다면 [scipy문서](http://docs.scipy.org/doc/numpy/user/basics.broadcasting.html)나 [scipy위키](http://wiki.scipy.org/EricsBroadcastingDoc)를 참조하세요. -If this explanation does not make sense, try reading the explanation -[from the documentation](http://docs.scipy.org/doc/numpy/user/basics.broadcasting.html) -or [this explanation](http://wiki.scipy.org/EricsBroadcastingDoc). +브로드캐스팅을 지원하는 함수를 *universal functions*라고 합니다. +*universal functions* 목록은 [문서](http://docs.scipy.org/doc/numpy/reference/ufuncs.html#available-ufuncs)를 참조하세요. -Functions that support broadcasting are known as *universal functions*. You can find -the list of all universal functions -[in the documentation](http://docs.scipy.org/doc/numpy/reference/ufuncs.html#available-ufuncs). - -Here are some applications of broadcasting: +브로드캐스팅을 응용한 예시들입니다: ~~~python import numpy as np -# Compute outer product of vectors -v = np.array([1,2,3]) # v has shape (3,) -w = np.array([4,5]) # w has shape (2,) -# To compute an outer product, we first reshape v to be a column -# vector of shape (3, 1); we can then broadcast it against w to yield -# an output of shape (3, 2), which is the outer product of v and w: +# 벡터의 외적을 계산 +v = np.array([1,2,3]) # v의 shape는 (3,) +w = np.array([4,5]) # w의 shape는 (2,) +# 외적을 게산하기 위해, 먼저 v를 shape가 (3,1)인 행벡터로 바꿔야 합니다; +# 그다음 이것을 w에 맞춰 브로드캐스팅한뒤 결과물로 shape가 (3,2)인 행렬을 얻습니다, +# 이 행렬은 v 와 w의 외적의 결과입니다: # [[ 4 5] # [ 8 10] # [12 15]] print np.reshape(v, (3, 1)) * w -# Add a vector to each row of a matrix +# 벡터를 행렬의 각 행에 더하기 x = np.array([[1,2,3], [4,5,6]]) -# x has shape (2, 3) and v has shape (3,) so they broadcast to (2, 3), -# giving the following matrix: +# x는 shape가 (2, 3)이고 v는 shape가 (3,)이므로 이 둘을 브로드캐스팅하면 shape가 (2, 3)인 +# 아래와 같은 행렬이 나옵니다: # [[2 4 6] # [5 7 9]] print x + v -# Add a vector to each column of a matrix -# x has shape (2, 3) and w has shape (2,). -# If we transpose x then it has shape (3, 2) and can be broadcast -# against w to yield a result of shape (3, 2); transposing this result -# yields the final result of shape (2, 3) which is the matrix x with -# the vector w added to each column. Gives the following matrix: +# 벡터를 행렬의 각 행에 더하기 +# x는 shape가 (2, 3)이고 w는 shape가 (2,)입니다. +# x의 전치행렬은 shape가 (3,2)이며 이는 w와 브로드캐스팅이 가능하고 결과로 shape가 (3,2)인 행렬이 생깁니다; +# 이 행렬을 전치하면 shape가 (2,3)인 행렬이 나오며 +# 이는 행렬 x의 각 열에 벡터 w을 더한 결과와 동일합니다. +# 아래의 행렬입니다: # [[ 5 6 7] # [ 9 10 11]] print (x.T + w).T -# Another solution is to reshape w to be a row vector of shape (2, 1); -# we can then broadcast it directly against x to produce the same -# output. +# 다른 방법은 w를 shape가 (2,1)인 열벡터로 변환하는 것입니다; +# 그런다음 이를 바로 x에 브로드캐스팅해 더하면 +# 동일한 결과가 나옵니다. print x + np.reshape(w, (2, 1)) -# Multiply a matrix by a constant: -# x has shape (2, 3). Numpy treats scalars as arrays of shape (); -# these can be broadcast together to shape (2, 3), producing the -# following array: +# 행렬의 스칼라배: +# x 의 shape는 (2, 3)입니다. Numpy는 스칼라를 shape가 ()인 배열로 취급합니다; +# 그렇기에 스칼라 값은 (2,3) shape로 브로드캐스트 될 수 있고, +# 아래와 같은 결과를 만들어 냅니다: # [[ 2 4 6] # [ 8 10 12]] print x * 2 ~~~ -Broadcasting typically makes your code more concise and faster, so you -should strive to use it where possible. +브로드캐스팅은 보통 코드를 간결하고 빠르게 해줍니다, 그러므로 가능하다면 최대한 사용하세요. ### Numpy Documentation -This brief overview has touched on many of the important things that you need to -know about numpy, but is far from complete. Check out the -[numpy reference](http://docs.scipy.org/doc/numpy/reference/) -to find out much more about numpy. +이 문서는 여러분이 numpy에 대해 알아야할 많은 중요한 사항들을 다루지만 완벽하진 않습니다. +numpy에 관한 더 많은 사항은 [numpy 레퍼런스](http://docs.scipy.org/doc/numpy/reference/)를 참조하세요. + ## SciPy Numpy provides a high-performance multidimensional array and basic tools to compute with and manipulate these arrays. From 3e641af490199eea019daabb5b3962112f4a1fc1 Mon Sep 17 00:00:00 2001 From: Sanghun Kang Date: Wed, 8 Jun 2016 00:08:24 +0900 Subject: [PATCH 169/199] all translation done --- python-numpy-tutorial.md | 150 +++++++++++++++++++-------------------- 1 file changed, 75 insertions(+), 75 deletions(-) diff --git a/python-numpy-tutorial.md b/python-numpy-tutorial.md index 8fcc240b..a11f7b9b 100644 --- a/python-numpy-tutorial.md +++ b/python-numpy-tutorial.md @@ -873,43 +873,41 @@ numpy에 관한 더 많은 사항은 [numpy 레퍼런스](http://docs.scipy.org/ ## SciPy -Numpy provides a high-performance multidimensional array and basic tools to -compute with and manipulate these arrays. -[SciPy](http://docs.scipy.org/doc/scipy/reference/) -builds on this, and provides -a large number of functions that operate on numpy arrays and are useful for -different types of scientific and engineering applications. -The best way to get familiar with SciPy is to -[browse the documentation](http://docs.scipy.org/doc/scipy/reference/index.html). -We will highlight some parts of SciPy that you might find useful for this class. +Numpy는 고성능의 다차원 배열 객체와 이를 다룰 도구를 제공합니다. +numpy를 바탕으로 만들어진 [SciPy](http://docs.scipy.org/doc/scipy/reference/)는, +numpy 배열을 다루는 많은 함수들을 제공하며 다양한 과학, 공학분야에서 유용하게 사용됩니다. + +SciPy에 익숙해지는 최고의 방법은 [SciPy 공식 문서](http://docs.scipy.org/doc/scipy/reference/index.html)를 보는 것입니다. +이 문서에서는 scipy중 cs231n 수업에서 유용하게 쓰일 일부분만을 소개할것입니다. -### Image operations -SciPy provides some basic functions to work with images. -For example, it has functions to read images from disk into numpy arrays, -to write numpy arrays to disk as images, and to resize images. -Here is a simple example that showcases these functions: + +### 이미지 작업 +SciPy는 이미지를 다룰 기본적인 함수들을 제공합니다. +예를들자면, 디스크에 저장된 이미지를 numpy 배열로 읽어들이는 함수가 있으며, +numpy 배열을 디스크에 이미지로 저장하는 함수도 있고, 이미지의 크기를 바꾸는 함수도 있습니다. +이 함수들의 간단한 사용 예시입니다: ~~~python from scipy.misc import imread, imsave, imresize -# Read an JPEG image into a numpy array +# JPEG 이미지를 numpy 배열로 읽어들이기 img = imread('assets/cat.jpg') print img.dtype, img.shape # 출력 "uint8 (400, 248, 3)" -# We can tint the image by scaling each of the color channels -# by a different scalar constant. The image has shape (400, 248, 3); -# we multiply it by the array [1, 0.95, 0.9] of shape (3,); -# numpy broadcasting means that this leaves the red channel unchanged, -# and multiplies the green and blue channels by 0.95 and 0.9 -# respectively. +# 각각의 색깔 채널을 다른 상수값으로 스칼라배함으로써 +# 이미지의 색을 변화시킬수 있습니다. +# 이미지의 shape는 (400, 248, 3)입니다; +# 여기에 shape가 (3,)인 배열 [1, 0.95, 0.9]를 곱합니다; +# numpy 브로드캐스팅에 의해 이 배열이 곱해지며 붉은색 채널은 변하지 않으며, +# 초록색, 파란색 채널에는 각각 0.95, 0.9가 곱해집니다 img_tinted = img * [1, 0.95, 0.9] -# Resize the tinted image to be 300 by 300 pixels. +# 색변경 이미지를 300x300 픽셀로 크기 조절. img_tinted = imresize(img_tinted, (300, 300)) -# Write the tinted image back to disk +# 색변경 이미지를 디스크에 기록하기 imsave('assets/cat_tinted.jpg', img_tinted) ~~~ @@ -917,94 +915,93 @@ imsave('assets/cat_tinted.jpg', img_tinted)
- Left: The original image. - Right: The tinted and resized image. + Left: 원본 이미지. + Right: 색변경 & 크기변경 이미지.
-### MATLAB files -The functions `scipy.io.loadmat` and `scipy.io.savemat` allow you to read and -write MATLAB files. You can read about them -[in the documentation](http://docs.scipy.org/doc/scipy/reference/io.html). + +### MATLAB 파일 +`scipy.io.loadmat` 와 `scipy.io.savemat`함수를 통해 +matlab 파일을 읽고 쓸 수 있습니다. +[문서](http://docs.scipy.org/doc/scipy/reference/io.html)를 참조하세요. -### Distance between points -SciPy defines some useful functions for computing distances between sets of points. -The function `scipy.spatial.distance.pdist` computes the distance between all pairs -of points in a given set: +### 두 점 사이의 거리 +SciPy에는 점들간의 거리를 계산하기 위한 유용한 함수들이 정의되어 있습니다. + +`scipy.spatial.distance.pdist`함수는 주어진 점들 사이의 모든 거리를 계산합니다: ~~~python import numpy as np from scipy.spatial.distance import pdist, squareform -# Create the following array where each row is a point in 2D space: +# 각 행이 2차원 공간에서의 한 점을 의미하는 행렬을 생성: # [[0 1] # [1 0] # [2 0]] x = np.array([[0, 1], [1, 0], [2, 0]]) print x -# Compute the Euclidean distance between all rows of x. -# d[i, j] is the Euclidean distance between x[i, :] and x[j, :], -# and d is the following array: +# x가 나타내는 모든 점 사이의 유클리디안 거리를 계산. +# d[i, j]는 x[i, :]와 x[j, :]사이의 유클리디안 거리를 의미하며, +# d는 아래의 행렬입니다: # [[ 0. 1.41421356 2.23606798] # [ 1.41421356 0. 1. ] # [ 2.23606798 1. 0. ]] d = squareform(pdist(x, 'euclidean')) print d ~~~ -You can read all the details about this function -[in the documentation](http://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.distance.pdist.html). +이 함수에 대한 자세한 사항은 [pidst 공식 문서](http://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.distance.pdist.html)를 참조하세요. -A similar function (`scipy.spatial.distance.cdist`) computes the distance between all pairs -across two sets of points; you can read about it -[in the documentation](http://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.distance.cdist.html). +`scipy.spatial.distance.cdist`도 위와 유사하게 점들 사이의 거리를 계산합니다. 자세한 사항은 [cdist 공식 문서](http://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.distance.cdist.html)를 참조하세요. + ## Matplotlib -[Matplotlib](http://matplotlib.org/) is a plotting library. -In this section give a brief introduction to the `matplotlib.pyplot` module, -which provides a plotting system similar to that of MATLAB. +[Matplotlib](http://matplotlib.org/)는 plotting 라이브러리입니다. +이번에는 MATLAB의 plotting 시스템과 유사한 기능을 제공하는 +`matplotlib.pyplot` 모듈에 관한 간략한 소개가 있곘습니다., + ### Plotting -The most important function in matplotlib is `plot`, -which allows you to plot 2D data. Here is a simple example: +matplotlib에서 가장 중요한 함수는 2차원 데이터를 그릴수 있게 해주는 `plot`입니다. +여기 간단한 예시가 있습니다: ~~~python import numpy as np import matplotlib.pyplot as plt -# Compute the x and y coordinates for points on a sine curve +# 사인과 코사인 곡선의 x,y 좌표를 계산 x = np.arange(0, 3 * np.pi, 0.1) y = np.sin(x) -# Plot the points using matplotlib +# matplotlib를 이용해 점들을 그리기 plt.plot(x, y) -plt.show() # You must call plt.show() to make graphics appear. +plt.show() # 그래프를 나타나게 하기 위해선 plt.show()함수를 호출해야만 합니다. ~~~ -Running this code produces the following plot: +이 코드를 실행하면 아래의 그래프가 생성됩니다:
-With just a little bit of extra work we can easily plot multiple lines -at once, and add a title, legend, and axis labels: +약간의 몇가지 추가적인 작업을 통해 여러개의 그래프와, 제목, 범주, 축 이름을 한번에 쉽게 나타낼 수 있습니다: ~~~python import numpy as np import matplotlib.pyplot as plt -# Compute the x and y coordinates for points on sine and cosine curves +# 사인과 코사인 곡선의 x,y 좌표를 계산 x = np.arange(0, 3 * np.pi, 0.1) y_sin = np.sin(x) y_cos = np.cos(x) -# Plot the points using matplotlib +# matplotlib를 이용해 점들을 그리기 plt.plot(x, y_sin) plt.plot(x, y_cos) plt.xlabel('x axis label') @@ -1017,37 +1014,38 @@ plt.show()
-You can read much more about the `plot` function -[in the documentation](http://matplotlib.org/api/pyplot_api.html#matplotlib.pyplot.plot). +`plot`함수에 관한 더 많은 내용은 [문서](http://matplotlib.org/api/pyplot_api.html#matplotlib.pyplot.plot)를 참조하세요. + ### Subplots -You can plot different things in the same figure using the `subplot` function. -Here is an example: + +'subplot'함수를 통해 다른 내용들도 동일한 그림위에 나타낼수 있습니다. +여기 간단한 예시가 있습니다: ~~~python import numpy as np import matplotlib.pyplot as plt -# Compute the x and y coordinates for points on sine and cosine curves +# 사인과 코사인 곡선의 x,y 좌표를 계산 x = np.arange(0, 3 * np.pi, 0.1) y_sin = np.sin(x) y_cos = np.cos(x) -# Set up a subplot grid that has height 2 and width 1, -# and set the first such subplot as active. +# 높이가 2이고 너비가 1인 subplot 구획을 설정하고, +# 첫번째 구획을 활성화. plt.subplot(2, 1, 1) -# Make the first plot +# 첫번째 그리기 plt.plot(x, y_sin) plt.title('Sine') -# Set the second subplot as active, and make the second plot. +# 두번째 subplot 구획을 활성화 하고 그리기 plt.subplot(2, 1, 2) plt.plot(x, y_cos) plt.title('Cosine') -# Show the figure. +# 그림 보이기. plt.show() ~~~ @@ -1055,12 +1053,13 @@ plt.show()
-You can read much more about the `subplot` function -[in the documentation](http://matplotlib.org/api/pyplot_api.html#matplotlib.pyplot.subplot). +`subplot`함수에 관한 더 많은 내용은 +[문서](http://matplotlib.org/api/pyplot_api.html#matplotlib.pyplot.subplot)를 참조하세요. -### Images -You can use the `imshow` function to show images. Here is an example: + +### 이미지 +`imshow`함수를 사용해 이미지를 나타낼 수 있습니다. 여기 예시가 있습니다: ~~~python import numpy as np @@ -1070,20 +1069,21 @@ import matplotlib.pyplot as plt img = imread('assets/cat.jpg') img_tinted = img * [1, 0.95, 0.9] -# Show the original image +# 원본 이미지 나타내기 plt.subplot(1, 2, 1) plt.imshow(img) -# Show the tinted image +# 색변화된 이미지 나타내기 plt.subplot(1, 2, 2) -# A slight gotcha with imshow is that it might give strange results -# if presented with data that is not uint8. To work around this, we -# explicitly cast the image to uint8 before displaying it. +# imshow를 이용하며 주의할 점은 데이터의 자료형이 +# uint8이 아니라면 이상한 결과를 보여줄수도 있다는 것입니다. +# 그러므로 이미지를 나타내기 전에 명시적으로 자료형을 uint8로 형변환 해줍니다. + plt.imshow(np.uint8(img_tinted)) plt.show() ~~~
-
+
\ No newline at end of file From bef8b1c0df6e67b427d8af768ee6757455244ef2 Mon Sep 17 00:00:00 2001 From: myungsub Date: Wed, 8 Jun 2016 11:30:44 +0900 Subject: [PATCH 170/199] Tutorial part finished --- index.html | 2 +- linear-classify.md | 2 +- python-numpy-tutorial.md | 75 +++++++++++++++++++++------------------- 3 files changed, 42 insertions(+), 37 deletions(-) diff --git a/index.html b/index.html index 713b0009..e083ac78 100644 --- a/index.html +++ b/index.html @@ -80,7 +80,7 @@ Python / Numpy Tutorial - + Complete!
diff --git a/linear-classify.md b/linear-classify.md index 916e2abf..e02308ac 100644 --- a/linear-classify.md +++ b/linear-classify.md @@ -363,7 +363,7 @@ We now saw one way to take a dataset of images and map each one to class scores -### 추가 읽기 자료료 +### 추가 읽기 자료 These readings are optional and contain pointers of interest. diff --git a/python-numpy-tutorial.md b/python-numpy-tutorial.md index a11f7b9b..dcd5fd37 100644 --- a/python-numpy-tutorial.md +++ b/python-numpy-tutorial.md @@ -24,7 +24,7 @@ Numpy 이 튜토리얼은 [Justin Johnson](http://cs.stanford.edu/people/jcjohns/)에 의해 작성되었습니다. cs231n 수업의 모든 과제에서는 프로그래밍 언어로 파이썬을 사용할 것입니다. -파이썬은 그 자체만으로도 훌륭한 범용 프로그래밍 언어이지만, 몇몇 라이브러리(numpy, scipy, matplotlib)의 도움으로 +파이썬은 그 자체만으로도 훌륭한 범용 프로그래밍 언어이지만, 몇몇 라이브러리(numpy, scipy, matplotlib)의 도움으로 계산과학 분야에서 강력한 개발 환경을 갖추게 됩니다. 많은 분들이 파이썬과 numpy를 경험 해보셨을거라고 생각합니다. 경험 하지 못했을지라도 이 문서를 통해 @@ -63,8 +63,8 @@ cs231n 수업의 모든 과제에서는 프로그래밍 언어로 파이썬을 ## Python -파이썬은 고차원이고, 다중패러다임을 지원하는 동적 프로그래밍 언어입니다. -짧지만 가독성 높은 코드 몇 줄로 수준 높은 아이디어들을 표현할수있기에 파이썬 코드는 거의 수도코드처럼 보인다고도 합니다. +파이썬은 고차원이고, 다중패러다임을 지원하는 동적 프로그래밍 언어입니다. +짧지만 가독성 높은 코드 몇 줄로 수준 높은 아이디어들을 표현할수있기에 파이썬 코드는 거의 수도코드처럼 보인다고도 합니다. 아래는 quicksort알고리즘의 파이썬 구현 예시입니다: ~~~python @@ -76,13 +76,13 @@ def quicksort(arr): middle = [x for x in arr if x == pivot] right = [x for x in arr if x > pivot] return quicksort(left) + middle + quicksort(right) - + print quicksort([3,6,8,10,1,2,1]) # 출력 "[1, 1, 2, 3, 6, 8, 10]" ~~~ ### 파이썬 버전 -현재 파이썬에는 두가지 버전이 있습니다. 파이썬 2.7 그리고 파이썬 3.4입니다. +현재 파이썬에는 두가지 버전이 있습니다. 파이썬 2.7 그리고 파이썬 3.4입니다. 혼란스럽게도, 파이썬3은 기존 파이썬2와 호환되지 않게 변경된 부분이 있습니다. 그러므로 파이썬 2.7로 쓰여진 코드는 3.4환경에서 동작하지 않고 그 반대도 마찬가지입니다. 이 수업에선 파이썬 2.7을 사용합니다. @@ -117,10 +117,10 @@ print y, y + 1, y * 2, y ** 2 # 출력 "2.5 3.5 5.0 6.25" ~~~ 다른 언어들과는 달리, 파이썬에는 증감 단항연상자(`x++`, `x--`)가 없습니다. -파이썬 역시 long 정수형과 복소수 데이터 타입이 구현되어 있습니다. +파이썬 역시 long 정수형과 복소수 데이터 타입이 구현되어 있습니다. 자세한 사항은 [문서](https://docs.python.org/2/library/stdtypes.html#numeric-types-int-float-long-complex)에서 찾아볼 수 있습니다. -**불리언(Booleans):** 파이썬에는 논리 자료형의 모든 연산자들이 구현되어 있습니다. +**불리언(Booleans):** 파이썬에는 논리 자료형의 모든 연산자들이 구현되어 있습니다. 그렇지만 기호(`&&`, `||`, 등.) 대신 영어 단어로 구현되어 있습니다 : ~~~python @@ -130,7 +130,7 @@ print type(t) # 출력 "" print t and f # 논리 AND; 출력 "False" print t or f # 논리 OR; 출력 "True" print not t # 논리 NOT; 출력 "False" -print t != f # 논리 XOR; 출력 "True" +print t != f # 논리 XOR; 출력 "True" ~~~ **문자열:** 파이썬은 문자열과 연관된 다양한 기능을 지원합니다: @@ -158,7 +158,7 @@ print s.replace('l', '(ell)') # 첫번째 인자로 온 문자열을 두번째 # 출력 "he(ell)(ell)o" print ' world '.strip() # 문자열 앞뒤 공백 제거; 출력 "world" ~~~ -모든 문자열 메소드는 [문서](https://docs.python.org/2/library/stdtypes.html#string-methods)에서 찾아볼 수 있습니다. +모든 문자열 메소드는 [문서](https://docs.python.org/2/library/stdtypes.html#string-methods)에서 찾아볼 수 있습니다. @@ -394,18 +394,18 @@ hello('Fred', loud=True) # 출력 "HELLO, FRED!" ~~~python class Greeter(object): - + # 생성자 def __init__(self, name): self.name = name # 인스턴스 변수 선언 - + # 인스턴스 메소드 def greet(self, loud=False): if loud: print 'HELLO, %s!' % self.name.upper() else: print 'Hello, %s' % self.name - + g = Greeter('Fred') # Greeter 클래스의 인스턴스 생성 g.greet() # 인스턴스 메소드 호출; 출력 "Hello, Fred" g.greet(loud=True) # 인스턴스 메소드 호출; 출력 "HELLO, FRED!" @@ -423,10 +423,10 @@ Numpy는 고성능의 다차원 배열 객체와 이를 다룰 도구를 제공 ### 배열 -Numpy 배열은 동일한 자료형을 가지는 값들이 격자판 형태로 있는 것입니다. 각각의 값들은 튜플(이때 튜플은 양의 정수만을 요소값으로 갖습니다.) 형태로 색인됩니다. +Numpy 배열은 동일한 자료형을 가지는 값들이 격자판 형태로 있는 것입니다. 각각의 값들은 튜플(이때 튜플은 양의 정수만을 요소값으로 갖습니다.) 형태로 색인됩니다. *rank*는 배열이 몇차원인지를 의미합니다; *shape*는 는 각 차원의 크기를 알려주는 정수들이 모인 튜플입니다. -파이썬의 리스트를 중첩해 Numpy 배열을 초기화 할 수 있고, 대괄호를 통해 각 요소에 접근할 수 있습니다: +파이썬의 리스트를 중첩해 Numpy 배열을 초기화 할 수 있고, 대괄호를 통해 각 요소에 접근할 수 있습니다: ~~~python import numpy as np @@ -451,7 +451,7 @@ import numpy as np a = np.zeros((2,2)) # 모든 값이 0인 배열 생성 print a # 출력 "[[ 0. 0.] # [ 0. 0.]]" - + b = np.ones((1,2)) # 모든 값이 1인 배열 생성 print b # 출력 "[[ 1. 1.]]" @@ -462,7 +462,7 @@ print c # 출력 "[[ 7. 7.] d = np.eye(2) # 2x2 단위 행렬 생성 print d # 출력 "[[ 1. 0.] # [ 0. 1.]]" - + e = np.random.random((2,2)) # 임의의 값으로 채워진 배열 생성 print e # 임의의 값 출력 "[[ 0.91940167 0.08143941] # [ 0.68744134 0.87236687]]" @@ -486,7 +486,7 @@ import numpy as np # [ 9 10 11 12]] a = np.array([[1,2,3,4], [5,6,7,8], [9,10,11,12]]) -# 슬라이싱을 이용하여 첫 두행과 1열,2열로 이루어진 부분배열을 만들어 봅시다; +# 슬라이싱을 이용하여 첫 두행과 1열,2열로 이루어진 부분배열을 만들어 봅시다; # b는 shape가 (2,2)인 배열이 됩니다: # [[2 3] # [6 7]] @@ -597,15 +597,15 @@ a = np.array([[1,2], [3, 4], [5, 6]]) bool_idx = (a > 2) # 2보다 큰 a의 요소를 찾습니다; # 이 코드는 a와 shape가 같고 불리언 자료형을 요소로 하는 numpy 배열을 반환합니다, - # bool_idx의 각 요소는 동일한 위치에 있는 a의 + # bool_idx의 각 요소는 동일한 위치에 있는 a의 # 요소가 2보다 큰지를 말해줍니다. - + print bool_idx # 출력 "[[False False] # [ True True] # [ True True]]" -# 불리언 배열 인덱싱을 통해 bool_idx에서 -# 참 값을 가지는 요소로 구성되는 +# 불리언 배열 인덱싱을 통해 bool_idx에서 +# 참 값을 가지는 요소로 구성되는 # rank 1인 배열을 구성할 수 있습니다. print a[bool_idx] # 출력 "[3 4 5 6]" @@ -811,10 +811,10 @@ print y # 출력 "[[ 2 2 4] 3. 두 행렬이 모든 차원에서 compatible하다면, 브로드캐스팅이 가능합니다. 4. 브로드캐스팅이 이뤄지면, 각 배열 shape의 요소별 최소공배수로 이루어진 shape가 두 배열의 shape로 간주됩니다. 5. 차원에 상관없이 크기가 1인 배열과 1보다 큰 배열이 있을때, 크기가 1인 배열은 자신의 차원수만큼 복사되어 쌓인것처럼 간주된다. - + 설명이 이해하기 부족하다면 [scipy문서](http://docs.scipy.org/doc/numpy/user/basics.broadcasting.html)나 [scipy위키](http://wiki.scipy.org/EricsBroadcastingDoc)를 참조하세요. -브로드캐스팅을 지원하는 함수를 *universal functions*라고 합니다. +브로드캐스팅을 지원하는 함수를 *universal functions*라고 합니다. *universal functions* 목록은 [문서](http://docs.scipy.org/doc/numpy/reference/ufuncs.html#available-ufuncs)를 참조하세요. 브로드캐스팅을 응용한 예시들입니다: @@ -825,7 +825,7 @@ import numpy as np # 벡터의 외적을 계산 v = np.array([1,2,3]) # v의 shape는 (3,) w = np.array([4,5]) # w의 shape는 (2,) -# 외적을 게산하기 위해, 먼저 v를 shape가 (3,1)인 행벡터로 바꿔야 합니다; +# 외적을 게산하기 위해, 먼저 v를 shape가 (3,1)인 행벡터로 바꿔야 합니다; # 그다음 이것을 w에 맞춰 브로드캐스팅한뒤 결과물로 shape가 (3,2)인 행렬을 얻습니다, # 이 행렬은 v 와 w의 외적의 결과입니다: # [[ 4 5] @@ -843,15 +843,15 @@ print x + v # 벡터를 행렬의 각 행에 더하기 # x는 shape가 (2, 3)이고 w는 shape가 (2,)입니다. -# x의 전치행렬은 shape가 (3,2)이며 이는 w와 브로드캐스팅이 가능하고 결과로 shape가 (3,2)인 행렬이 생깁니다; -# 이 행렬을 전치하면 shape가 (2,3)인 행렬이 나오며 +# x의 전치행렬은 shape가 (3,2)이며 이는 w와 브로드캐스팅이 가능하고 결과로 shape가 (3,2)인 행렬이 생깁니다; +# 이 행렬을 전치하면 shape가 (2,3)인 행렬이 나오며 # 이는 행렬 x의 각 열에 벡터 w을 더한 결과와 동일합니다. # 아래의 행렬입니다: # [[ 5 6 7] # [ 9 10 11]] print (x.T + w).T # 다른 방법은 w를 shape가 (2,1)인 열벡터로 변환하는 것입니다; -# 그런다음 이를 바로 x에 브로드캐스팅해 더하면 +# 그런다음 이를 바로 x에 브로드캐스팅해 더하면 # 동일한 결과가 나옵니다. print x + np.reshape(w, (2, 1)) @@ -896,7 +896,7 @@ from scipy.misc import imread, imsave, imresize img = imread('assets/cat.jpg') print img.dtype, img.shape # 출력 "uint8 (400, 248, 3)" -# 각각의 색깔 채널을 다른 상수값으로 스칼라배함으로써 +# 각각의 색깔 채널을 다른 상수값으로 스칼라배함으로써 # 이미지의 색을 변화시킬수 있습니다. # 이미지의 shape는 (400, 248, 3)입니다; # 여기에 shape가 (3,)인 배열 [1, 0.95, 0.9]를 곱합니다; @@ -923,7 +923,7 @@ imsave('assets/cat_tinted.jpg', img_tinted) ### MATLAB 파일 -`scipy.io.loadmat` 와 `scipy.io.savemat`함수를 통해 +`scipy.io.loadmat` 와 `scipy.io.savemat`함수를 통해 matlab 파일을 읽고 쓸 수 있습니다. [문서](http://docs.scipy.org/doc/scipy/reference/io.html)를 참조하세요. @@ -961,7 +961,7 @@ print d ## Matplotlib -[Matplotlib](http://matplotlib.org/)는 plotting 라이브러리입니다. +[Matplotlib](http://matplotlib.org/)는 plotting 라이브러리입니다. 이번에는 MATLAB의 plotting 시스템과 유사한 기능을 제공하는 `matplotlib.pyplot` 모듈에 관한 간략한 소개가 있곘습니다., @@ -975,7 +975,7 @@ matplotlib에서 가장 중요한 함수는 2차원 데이터를 그릴수 있 import numpy as np import matplotlib.pyplot as plt -# 사인과 코사인 곡선의 x,y 좌표를 계산 +# 사인과 코사인 곡선의 x,y 좌표를 계산 x = np.arange(0, 3 * np.pi, 0.1) y = np.sin(x) @@ -996,7 +996,7 @@ plt.show() # 그래프를 나타나게 하기 위해선 plt.show()함수를 호 import numpy as np import matplotlib.pyplot as plt -# 사인과 코사인 곡선의 x,y 좌표를 계산 +# 사인과 코사인 곡선의 x,y 좌표를 계산 x = np.arange(0, 3 * np.pi, 0.1) y_sin = np.sin(x) y_cos = np.cos(x) @@ -1027,7 +1027,7 @@ plt.show() import numpy as np import matplotlib.pyplot as plt -# 사인과 코사인 곡선의 x,y 좌표를 계산 +# 사인과 코사인 곡선의 x,y 좌표를 계산 x = np.arange(0, 3 * np.pi, 0.1) y_sin = np.sin(x) y_cos = np.cos(x) @@ -1076,7 +1076,7 @@ plt.imshow(img) # 색변화된 이미지 나타내기 plt.subplot(1, 2, 2) -# imshow를 이용하며 주의할 점은 데이터의 자료형이 +# imshow를 이용하며 주의할 점은 데이터의 자료형이 # uint8이 아니라면 이상한 결과를 보여줄수도 있다는 것입니다. # 그러므로 이미지를 나타내기 전에 명시적으로 자료형을 uint8로 형변환 해줍니다. @@ -1086,4 +1086,9 @@ plt.show()
-
\ No newline at end of file +
+ +--- +

+번역: 강상훈 (sanghkaang) +

From 9d94123d207e3bcf221b55cbdd8eeb33d3f9f7ed Mon Sep 17 00:00:00 2001 From: Dongkyu Kim Date: Wed, 8 Jun 2016 14:27:53 +0900 Subject: [PATCH 171/199] =?UTF-8?q?=EC=9A=A9=EC=96=B4=20=ED=86=B5=EC=9D=BC?= =?UTF-8?q?=20=EB=B0=8F=20'=EC=8B=9C=EA=B7=B8=EB=AA=A8=EC=9D=B4=EB=93=9C?= =?UTF-8?q?=20=EC=98=88=EC=A0=9C'=20=EC=84=B9=EC=85=98=20=EC=9E=91?= =?UTF-8?q?=EC=84=B1=20=EC=8B=9C=EC=9E=91?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit --- optimization-2.md | 22 +++++++++++----------- 1 file changed, 11 insertions(+), 11 deletions(-) diff --git a/optimization-2.md b/optimization-2.md index 55050d7f..7cff8932 100644 --- a/optimization-2.md +++ b/optimization-2.md @@ -9,7 +9,7 @@ Table of Contents: - [그라디언트(Gradient)에 대한 간단한 표현과 이해](#grad) - [복합 표현식(Compound Expression), 체인룰(chain rule), 역전파(Backpropagation)](#backprop) - [역전파(Backpropation)에 대한 직관적인 이해](#intuitive) -- [모듈성 : 시그모이(Sigmoid)드 예제](#sigmoid) +- [모듈성 : 시그모이드(Sigmoid) 예제](#sigmoid) - [역전파(Backprop) 실제: 단계별 계산](#staged) - [역박향 흐름의 패턴](#patters) - [벡터 기반의 그라디언트(Gradient) 계산)](#mat) @@ -44,7 +44,7 @@ $$ \frac{df(x)}{dx} = \lim_{h\ \to 0} \frac{f(x + h) - f(x)}{h} $$ -위에 수식을 기술적인 관점에서 보면, 왼쪽에 있는 분수 기호(가로바)는 오른쪽 분수 기호와 달리 나누기를 뜻하지는 않는다. 대신 연산자 $$ \frac{d}{dx} $$ 가 함수 $$f$$에 적용 되어 미분 된 함수를 의미 하는 것이다. 위의 수식을 이해하는 가장 좋은 방법은 $$h$$가 매우 작으면 함수 $$f$$는 직선으로 근사(Approximated) 될 수 있고, 미분 값은 그 직선의 기울기를 뜻한다. 다시말해, 만약 $$x = 4, y = -3$$ 이면 $$f(x,y) = -12$$ 가 되고, $$x$$에 대한 편미분 값은 $$x$$ $$\frac{\partial f}{\partial x} = -3$$ 으로 얻어진다. 이말은 즉슨, 우리가 x를 아주 조금 증가 시키면 전체 함수 값은 3배로 작아진다는 의미이다. (미분 값이 음수이므로). 이 것은 위의 수식을 재구성하면 이와 같이 간단히 보여 줄 수 있다 ( $$ f(x + h) = f(x) + h \frac{df(x)}{dx} $$ ).비슷하게, $$\frac{\partial f}{\partial y} = 4$$, 이므로, $$y$$ 값을 아주 작은 $$h$$ 만큼 증가 시킨다면 $$4h$$ 만큼 전체 함수 값은 증가하게 될 것이다. (이번에는 미분 값이 양수) +위에 수식을 기술적인 관점에서 보면, 왼쪽에 있는 분수 기호(가로바)는 오른쪽 분수 기호와 달리 나누기를 뜻하지는 않는다. 대신 연산자 $$ \frac{d}{dx} $$ 가 함수 $$f$$에 적용 되어 미분 된 함수를 의미 하는 것이다. 위의 수식을 이해하는 가장 좋은 방법은 $$h$$가 매우 작으면 함수 $$f$$는 직선으로 근사(Approximated) 될 수 있고, 미분 값은 그 직선의 기울기를 뜻한다. 다시말해, 만약 $$x = 4, y = -3$$ 이면 $$f(x,y) = -12$$ 가 되고, $$x$$에 대한 편미분 값은 $$x$$ $$\frac{\partial f}{\partial x} = -3$$ 으로 얻어진다. 이말은 즉슨, 우리가 x를 아주 조금 증가 시키면 전체 함수 값은 3배로 작아진다는 의미이다. (미분 값이 음수이므로). 이 것은 위의 수식을 재구성하면 이와 같이 간단히 보여 줄 수 있다 ( $$ f(x + h) = f(x) + h \frac{df(x)}{dx} $$ ). 비슷하게, $$\frac{\partial f}{\partial y} = 4$$, 이므로, $$y$$ 값을 아주 작은 $$h$$ 만큼 증가 시킨다면 $$4h$$ 만큼 전체 함수 값은 증가하게 될 것이다. (이번에는 미분 값이 양수) > 미분은 각 변수가 해당 값에서 전체 함수(Expression)의 결과 값에 영향을 미치는 민감도와 같은 개념이다. @@ -68,7 +68,7 @@ $$ ### 체인룰(chain rule)을 이용한 복합 표현식 -이제 $f(x,y,z) = (x + y) z$ 같은 다수의 복합 함수(composed functions)를 수반하는 더 복잡한 표현식을 고려해보자. 이 표현식은 여전히 바로 미분하기에 충분히 간단하지만, 우리는 이 식에 특별한 접근법을 적용할 것이다. 이는 역전파 뒤에 있는 직관을 이해하는데 도움이 될 것이다. 특히 이 식이 두 개의 표현식 $q = x + y$와 $f = q z$ 으로 분해될 수 있음에 주목하자. 게다가 이전 섹션에서 본 것처럼 우리는 두 식에 대한 미분값을 어떻게 따로따로 계산할지 알고 있다. $f$ 는 단지 $q$와 $z$의 곱이다. 따라서 $\frac{\partial f}{\partial q} = z, \frac{\partial f}{\partial z} = q$, 그리고 $q$는 $x$와 $y$의 합이므로 $\frac{\partial q}{\partial x} = 1, \frac{\partial q}{\partial y} = 1$이다. 하지만, 중간결과값인 $q$에 대한 기울기($\frac{\partial f}{\partial q}$)를 신경쓸 필요가 없다. 대신 궁극적으로 입력 $x,y,z$에 대한 $f$의 기울기에 관심이 있다. **체인룰**은 이러한 기울기 표현식들을 함께 연결시키는 적절한 방법이 곱하는 것이라는 것을 보여준다. 예를 들면, $\frac{\partial f}{\partial x} = \frac{\partial f}{\partial q} \frac{\partial q}{\partial x} $와 같이 표현할 수 있다. 실제로 이는 단순히 두 기울기값을 담고 있는 두 수의 곱셈이다. 하나의 예를 통해 확인 해보자. +이제 $f(x,y,z) = (x + y) z$ 같은 다수의 복합 함수(composed functions)를 수반하는 더 복잡한 표현식을 고려해보자. 이 표현식은 여전히 바로 미분하기에 충분히 간단하지만, 우리는 이 식에 특별한 접근법을 적용할 것이다. 이는 역전파 뒤에 있는 직관을 이해하는데 도움이 될 것이다. 특히 이 식이 두 개의 표현식 $q = x + y$와 $f = q z$ 으로 분해될 수 있음에 주목하자. 게다가 이전 섹션에서 본 것처럼 우리는 두 식에 대한 미분값을 어떻게 따로따로 계산할지 알고 있다. $f$ 는 단지 $q$와 $z$의 곱이다. 따라서 $\frac{\partial f}{\partial q} = z, \frac{\partial f}{\partial z} = q$, 그리고 $q$는 $x$와 $y$의 합이므로 $\frac{\partial q}{\partial x} = 1, \frac{\partial q}{\partial y} = 1$이다. 하지만, 중간 결과값인 $q$에 대한 그라디언트($\frac{\partial f}{\partial q}$)를 신경쓸 필요가 없다. 대신 궁극적으로 입력 $x,y,z$에 대한 $f$의 그라디언트에 관심이 있다. **체인룰**은 이러한 그라디언트 표현식들을 함께 연결시키는 적절한 방법이 곱하는 것이라는 것을 보여준다. 예를 들면, $\frac{\partial f}{\partial x} = \frac{\partial f}{\partial q} \frac{\partial q}{\partial x} $와 같이 표현할 수 있다. 실제로 이는 단순히 두 그라디언트를 담고 있는 두 수의 곱셈이다. 하나의 예를 통해 확인 해보자. ~~~python # set some inputs @@ -87,7 +87,7 @@ dfdx = 1.0 * dfdq # dq/dx = 1. And the multiplication here is the chain rule! dfdy = 1.0 * dfdq # dq/dy = 1 ~~~ -결국 `[dfdx,dfdy,dfdz]` 변수들로 기울기가 표현되는데, 이는 `f`에 대한 변수 `x,y,z`의 민감도(sensitivity)를 보여준다. 이는 역전파의 가장 간단한 예이다. 더 나아가서 보다 간결한 표기법을 사용해서 `df` 파트를 계속 쓸 필요가 없도록 하고 싶을 것이다. 예를 들어 `dfdq` 대신에 단순히 `dq`를 쓰고 항상 기울기가 최종 출력에 관한 것이라 가정하는 것이다. +결국 `[dfdx,dfdy,dfdz]` 변수들로 그라디언트가 표현되는데, 이는 `f`에 대한 변수 `x,y,z`의 민감도(sensitivity)를 보여준다. 이는 역전파의 가장 간단한 예이다. 더 나아가서 보다 간결한 표기법을 사용해서 `df` 파트를 계속 쓸 필요가 없도록 하고 싶을 것이다. 예를 들어 `dfdq` 대신에 단순히 `dq`를 쓰고 항상 그라디언트가 최종 출력에 관한 것이라 가정하는 것이다. 또한 이런 계산은 회로도를 가지고 다음과 같이 멋지게 시각화할 수 있다: @@ -95,7 +95,7 @@ dfdy = 1.0 * dfdq # dq/dy = 1 -2-4x5-4y-43z3-4q+-121f*
- 좌측에 실수 값으로 표현되는 "회로"는 이 계산에 대한 시각 표현을 보여준다. 전방 전달(forward pass)은 입력부터 출력까지 값을 계산한다 (녹색으로 표시). 그리고 나서 후방 전달(backward pass)는 역전파를 수행하는데, 이는 끝에서 시작해서 반복적으로 체인 룰을 적용해 회로 입력에 대한 모든 길에서 기울기 값 (적색으로 표시) 을 계산한다. 기울기 값은 회로를 통해 거꾸로 흐르는 것으로 볼 수 있다. + 좌측에 실수 값으로 표현되는 "회로"는 이 계산에 대한 시각 표현을 보여준다. 전방 전달(forward pass)은 입력부터 출력까지 값을 계산한다 (녹색으로 표시). 그리고 나서 후방 전달(backward pass)는 역전파를 수행하는데, 이는 끝에서 시작해서 반복적으로 체인 룰을 적용해 회로 입력에 대한 모든 길에서 그라디언트 값 (적색으로 표시) 을 계산한다. 그라디언트 값은 회로를 통해 거꾸로 흐르는 것으로 볼 수 있다.
@@ -103,24 +103,24 @@ dfdy = 1.0 * dfdq # dq/dy = 1 ### 역전파(backpropagation)에 대한 직관적 이해 -역전파가 굉장히 지역적인(local) 프로세스임에 주목하자. 회로도 내의 모든 게이트(gate) 몇개의 입력을 받아드리고 곧 바로 두 가지를 계산할 수 있다: 1. 게이트의 출력 값, 2. 게이트 출력에 대한 입력들의 *지역적* 기울기 값. 여기서 게이트들이 포함된 전체 회로의 세세한 부분을 모르더라도 완전히 독립적으로 값들을 계산할 수 있음을 주목하라. 하지만, 일단 전방 전달이 끝나면 역전파 과정에서 게이트는 결국 전체 회로의 마지막 출력에 대한 게이트 출력의 기울기 값에 관해 학습할 것이다. 체인룰을 통해 게이트는 이 기울기 값을 받아들여 모든 입력에 대해서 계산한 게이트의 모든 기울기 값에 곱한다. +역전파가 굉장히 지역적인(local) 프로세스임에 주목하자. 회로도 내의 모든 게이트(gate) 몇개의 입력을 받아드리고 곧 바로 두 가지를 계산할 수 있다: 1. 게이트의 출력 값, 2. 게이트 출력에 대한 입력들의 *지역적* 그라디언트 값. 여기서 게이트들이 포함된 전체 회로의 세세한 부분을 모르더라도 완전히 독립적으로 값들을 계산할 수 있음을 주목하라. 하지만, 일단 전방 전달이 끝나면 역전파 과정에서 게이트는 결국 전체 회로의 마지막 출력에 대한 게이트 출력의 그라디언트 값에 관해 학습할 것이다. 체인룰을 통해 게이트는 이 그라디언트 값을 받아들여 모든 입력에 대해서 계산한 게이트의 모든 그라디언트 값에 곱한다. > 체인룰 덕분에 이러한 각 입력에 대한 추가 곱셈은 전체 신경망과 같은 복잡한 회로에서 상대적으로 쓸모 없는 개개의 게이트를 중요하지 않은 것으로 바꿀 수 있다. -다시 위 예를 통해 이것이 어떻게 동작하는지에 대한 직관을 얻자. 덧셈 게이트는 입력 [-2, 5]를 받아 3을 출력한다. 이 게이트는 덧셈 연산을 하고 있기 때문에 두 입력에 대한 게이트의 지역적 기울기 값은 +1이 된다. 회로의 나머지 부분을 통해 최종 출력 값으로 -12가 나온다. 체인룰이 회로를 역으로 가로질러 반복적으로 적용되는 후방 전달 과정 동안, (곱셈 게이트의 입력인) 덧셈 게이트는 출력 값에 대한 기울기 값이 -4였다는 것을 학습한다. 만약 회로가 높은 값을 출력하기를 원하는 것으로 의인화하면 (이는 직관에 도움이 될 수 있다), 이 회로가 덧셈 게이트의 출력 값이 4의 *힘*으로 낮아지길 (음의 부호이기 때문) "원하는" 것으로 볼 수 있다. 반복을 지속하고 기울기 값을 연결하기 위해 덧셈 게이트는 이 기울기 값을 받아들이고 이를 모든 입력들에 대한 지역적 기울기 값에 곱한다 (**x**와 **y**에 대한 기울기 값이 1 * -4 = -4가 되도록). 다음의 원하는 효과가 있다는 사실에 주목하자. 만약 **x,y**가 (음의 기울기 값에 대한 반응으로) 감소한다면, 이 덧셈 게이트의 출력은 감소할 것이고 이는 다시 곱셈 게이트의 출력이 증가하도록 만들 것이다. +다시 위 예를 통해 이것이 어떻게 동작하는지에 대한 직관을 얻자. 덧셈 게이트는 입력 [-2, 5]를 받아 3을 출력한다. 이 게이트는 덧셈 연산을 하고 있기 때문에 두 입력에 대한 게이트의 지역적 그라디언트 값은 +1이 된다. 회로의 나머지 부분을 통해 최종 출력 값으로 -12가 나온다. 체인룰이 회로를 역으로 가로질러 반복적으로 적용되는 후방 전달 과정 동안, (곱셈 게이트의 입력인) 덧셈 게이트는 출력 값에 대한 그라디언트 값이 -4였다는 것을 학습한다. 만약 회로가 높은 값을 출력하기를 원하는 것으로 의인화하면 (이는 직관에 도움이 될 수 있다), 이 회로가 덧셈 게이트의 출력 값이 4의 *힘*으로 낮아지길 (음의 부호이기 때문) "원하는" 것으로 볼 수 있다. 반복을 지속하고 그라디언트 값을 연결하기 위해 덧셈 게이트는 이 그라디언트 값을 받아들이고 이를 모든 입력들에 대한 지역적 그라디언트 값에 곱한다 (**x**와 **y**에 대한 그라디언트 값이 1 * -4 = -4가 되도록). 다음의 원하는 효과가 있다는 사실에 주목하자. 만약 **x,y**가 (음의 그라디언트 값에 대한 반응으로) 감소한다면, 이 덧셈 게이트의 출력은 감소할 것이고 이는 다시 곱셈 게이트의 출력이 증가하도록 만들 것이다. 따라서 역전파는 보다 큰 최종 출력 값을 얻도록 게이트들이 자신들의 출력이 (얼마나 강하게) 증가하길 원하는지 또는 감소하길 원하는지 서로 소통하는 것으로 간주할 수 있다. -### Modularity: Sigmoid example +### 모듈성: 시그모이드(Sigmoid) 예제 -The gates we introduced above are relatively arbitrary. Any kind of differentiable function can act as a gate, and we can group multiple gates into a single gate, or decompose a function into multiple gates whenever it is convenient. Lets look at another expression that illustrates this point: +위에서 본 게이트들은 상대적으로 임의로 선택된 것이다. 어떤 종류의 함수도 미분가능하다면 게이트로서 역할을 할 수 있다. 필요한 경우 여러 개의 게이트를 그룹지어서 하나의 게이트로 만들거나, 하나의 함수를 여러 개의 게이트로 분해할 수도 있다. 이러한 요점을 보여주는 다른 표현식을 살펴보자: $$ f(w,x) = \frac{1}{1+e^{-(w_0x_0 + w_1x_1 + w_2)}} $$ -as we will see later in the class, this expression describes a 2-dimensional neuron (with inputs **x** and weights **w**) that uses the *sigmoid activation* function. But for now lets think of this very simply as just a function from inputs *w,x* to a single number. The function is made up of multiple gates. In addition to the ones described already above (add, mul, max), there are four more: +나중에 다른 수업에서 보겠지만, 이 표현식은 *시그모이드 활성* 함수를 사용하는 2차원 뉴런(입력 **x**와 가중치 **w**를 갖는)을 나타낸다. 그러나 지금은 이를 매우 단순하게 *w, x*를 입력으로 받아 하나의 단일 숫자를 출력하는 하나의 함수정도로 생각하자. 이 함수는 여러개의 게이트로 구성된다. 위에서 이미 설명한 게이트들(덧셈, 곱셈, 최대)에 더해 네 종류의 게이트가 더 있다: $$ f(x) = \frac{1}{x} @@ -140,7 +140,7 @@ f_a(x) = ax \frac{df}{dx} = a $$ -Where the functions $f_c, f_a$ translate the input by a constant of $c$ and scale the input by a constant of $a$, respectively. These are technically special cases of addition and multiplication, but we introduce them as (new) unary gates here since we do need the gradients for the constants. $c,a$. The full circuit then looks as follows: +여기서 $f_c, f_a$는 각각 입력을 상수 $c$만큼 이동시키고, 상수 $a$만큼 크기를 조정하는 함수이다. 이 함수들은 덧셈과 곰셈의 기술적으로 특별한 경우에 해당하지만, 여기서는 상수 $c,a$에 대한 그라디언트가 필요한 것이기에 (새로운) 단일 게이트로써 소개하고자 한다. 그러면 전체 회로는 다음과 같이 나타난다.
2.00-0.20w0-1.000.39x0-3.00-0.39w1-2.00-0.59x1-3.000.20w2-2.000.20*6.000.20*4.000.20+1.000.20+-1.00-0.20*-10.37-0.53exp1.37-0.53+10.731.001/x From 0a652505c60469ea5d059f2a7b4b636b1fd567af Mon Sep 17 00:00:00 2001 From: Dongkyu Kim Date: Wed, 8 Jun 2016 14:44:50 +0900 Subject: [PATCH 172/199] =?UTF-8?q?=EC=9A=A9=EC=96=B4=20=ED=86=B5=EC=9D=BC?= =?UTF-8?q?=20(=EC=97=AD=EC=A0=84=ED=8C=8C,=20=EC=B2=B4=EC=9D=B8=EB=A3=B0)?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit --- optimization-2.md | 24 ++++++++++++------------ 1 file changed, 12 insertions(+), 12 deletions(-) diff --git a/optimization-2.md b/optimization-2.md index 7cff8932..23d00695 100644 --- a/optimization-2.md +++ b/optimization-2.md @@ -7,10 +7,10 @@ Table of Contents: - [소개(Introduction)](#intro) - [그라디언트(Gradient)에 대한 간단한 표현과 이해](#grad) -- [복합 표현식(Compound Expression), 체인룰(chain rule), 역전파(Backpropagation)](#backprop) -- [역전파(Backpropation)에 대한 직관적인 이해](#intuitive) +- [복합 표현식(Compound Expression), 연쇄 법칙(Chain rule), Backpropagation](#backprop) +- [Backpropation에 대한 직관적인 이해](#intuitive) - [모듈성 : 시그모이드(Sigmoid) 예제](#sigmoid) -- [역전파(Backprop) 실제: 단계별 계산](#staged) +- [Backprop 실제: 단계별 계산](#staged) - [역박향 흐름의 패턴](#patters) - [벡터 기반의 그라디언트(Gradient) 계산)](#mat) - [요약](#summary) @@ -19,14 +19,14 @@ Table of Contents: ### Introduction -**Motivation**. 이번 섹션에서 우리는 **역전파(Backpropagation)**에 대한 직관적인 이해를 바탕으로 전문지식을 더 키우고자 한다. Backpropagation은 Network 전체에 대해 반복적인 **체인룰(Chain rule)**을 적용하여 그라디언트(Gradient)를 계산하는 방법 중 하나이다. Backpropagation 과정과 세부 요소들에 대한 이해는 여러분에게 있어서 신경망을 효과적으로 개발하고, 디자인하고 디버그하는 데 중요하다고 볼 수 있다. +**Motivation**. 이번 섹션에서 우리는 **Backpropagation**에 대한 직관적인 이해를 바탕으로 전문지식을 더 키우고자 한다. Backpropagation은 네트워크 전체에 대해 반복적인 **연쇄 법칙(Chain rule)**을 적용하여 그라디언트(Gradient)를 계산하는 방법 중 하나이다. Backpropagation 과정과 세부 요소들에 대한 이해는 여러분에게 있어서 신경망을 효과적으로 개발하고, 디자인하고 디버그하는 데 중요하다고 볼 수 있다. **Problem statement**. 이번 섹션에서 공부할 핵심 문제는 다음과 같다 : 주어진 함수 $$f(x)$$ 가 있고, $$x$$ 는 입력 값으로 이루어진 벡터이고, 주어진 입력 $$x$$에 대해서 함수 $$f$$의 그라디언트를 계산하고자 한다. (i.e. $$\nabla f(x)$$ ). -**Motivation**. 우리가 이 문제에 관심을 기울이는 이유에 대해 Neural Network관점에서 좀더 구체적으로 살펴 보자. $$f$$는 Loss 함수 ( $$L$$ ) 에 해당하고 입력 값 $$x$$ 는 학습 데이터(Training data)와 Neural Network의 Weight라고 볼 수 있다. 예를 들면, Loss는 SVM Loss 함수가 될 수 있고, 입력 값은 학습 데이터 $$(x_i,y_i), i=1 \ldots N$$ 와 Weight, Bias $$W,b$$ 으로 볼 수 있다. 여기서 학습데이터는 미리 주어져서 고정 되어있는 값으로 볼 수 있고 (보통의 기계 학습에서 그러하듯..), Weight는 Neural Network의 학습을 위해 실제로 컨트롤 하는 값이다. 따라서 입력 값 $$x_i$$ 에 대한 그라디언트 계산이 쉬울지라도, 실제로는 파라미터(Parameter, Neural Network의 Weight) 값에 대한 Gradient를 일반적으로 계산하고, Gradient값을 활용하여 Parameter를 업데이트 할 수 있다. 하지만, Neural Network이 어떻게 작동하는지 해석하고, 시각화 하는 부분에서 입력 값 $x_i$에 대한 Gradient도 유용하게 활용 될 수 있는데, 이 부분은 본 강의의 뒷부분에 다룰 예정이다. +**Motivation**. 우리가 이 문제에 관심을 기울이는 이유에 대해 신경망 관점에서 좀더 구체적으로 살펴 보자. $$f$$는 손실 함수 ( $$L$$ ) 에 해당하고 입력 값 $$x$$ 는 학습 데이터(Training data)와 신경망의 Weight라고 볼 수 있다. 예를 들면, 손실 함수는 SVM Loss 함수가 될 수 있고, 입력 값은 학습 데이터 $$(x_i,y_i), i=1 \ldots N$$ 와 Weight, Bias $$W,b$$ 으로 볼 수 있다. 여기서 학습데이터는 미리 주어져서 고정 되어있는 값으로 볼 수 있고 (보통의 기계 학습에서 그러하듯..), Weight는 신경망의 학습을 위해 실제로 컨트롤 하는 값이다. 따라서 입력 값 $$x_i$$ 에 대한 그라디언트 계산이 쉬울지라도, 실제로는 파라미터(Parameter) 값에 대한 그라디언트를 일반적으로 계산하고, 그라디언트 값을 활용하여 파라미터를 업데이트 할 수 있다. 하지만, 신경망이 어떻게 작동하는지 해석하고, 시각화 하는 부분에서 입력 값 $x_i$에 대한 그라디언트도 유용하게 활용 될 수 있는데, 이 부분은 본 강의의 뒷부분에 다룰 예정이다. -여러분이 이미 Chain Rule을 통해 Gradient를 도출하는데 익숙하더라도 이 섹션을 간략히 훑어보기를 권장한다. 왜냐하면 이 섹션에서는 다른데서는 보기 힘든 Backpropagation에 대한 실제 숫자를 활용한 역방향 흐름(Backward Flow)에 대해 설명을 할 것이고, 이를 통해 여러분이 얻게 될 통찰력은 이번 강의 전체에 있어 도움이 될 것이라 생각하기 때문이다. +여러분이 이미 연쇄 법칙을 통해 그라디언트를 도출하는데 익숙하더라도 이 섹션을 간략히 훑어보기를 권장한다. 왜냐하면 이 섹션에서는 다른데서는 보기 힘든 Backpropagation에 대한 실제 숫자를 활용한 역방향 흐름(Backward Flow)에 대해 설명을 할 것이고, 이를 통해 여러분이 얻게 될 통찰력은 이번 강의 전체에 있어 도움이 될 것이라 생각하기 때문이다. @@ -68,7 +68,7 @@ $$ ### 체인룰(chain rule)을 이용한 복합 표현식 -이제 $f(x,y,z) = (x + y) z$ 같은 다수의 복합 함수(composed functions)를 수반하는 더 복잡한 표현식을 고려해보자. 이 표현식은 여전히 바로 미분하기에 충분히 간단하지만, 우리는 이 식에 특별한 접근법을 적용할 것이다. 이는 역전파 뒤에 있는 직관을 이해하는데 도움이 될 것이다. 특히 이 식이 두 개의 표현식 $q = x + y$와 $f = q z$ 으로 분해될 수 있음에 주목하자. 게다가 이전 섹션에서 본 것처럼 우리는 두 식에 대한 미분값을 어떻게 따로따로 계산할지 알고 있다. $f$ 는 단지 $q$와 $z$의 곱이다. 따라서 $\frac{\partial f}{\partial q} = z, \frac{\partial f}{\partial z} = q$, 그리고 $q$는 $x$와 $y$의 합이므로 $\frac{\partial q}{\partial x} = 1, \frac{\partial q}{\partial y} = 1$이다. 하지만, 중간 결과값인 $q$에 대한 그라디언트($\frac{\partial f}{\partial q}$)를 신경쓸 필요가 없다. 대신 궁극적으로 입력 $x,y,z$에 대한 $f$의 그라디언트에 관심이 있다. **체인룰**은 이러한 그라디언트 표현식들을 함께 연결시키는 적절한 방법이 곱하는 것이라는 것을 보여준다. 예를 들면, $\frac{\partial f}{\partial x} = \frac{\partial f}{\partial q} \frac{\partial q}{\partial x} $와 같이 표현할 수 있다. 실제로 이는 단순히 두 그라디언트를 담고 있는 두 수의 곱셈이다. 하나의 예를 통해 확인 해보자. +이제 $f(x,y,z) = (x + y) z$ 같은 다수의 복합 함수(composed functions)를 수반하는 더 복잡한 표현식을 고려해보자. 이 표현식은 여전히 바로 미분하기에 충분히 간단하지만, 우리는 이 식에 특별한 접근법을 적용할 것이다. 이는 backpropagation 뒤에 있는 직관을 이해하는데 도움이 될 것이다. 특히 이 식이 두 개의 표현식 $q = x + y$와 $f = q z$ 으로 분해될 수 있음에 주목하자. 게다가 이전 섹션에서 본 것처럼 우리는 두 식에 대한 미분값을 어떻게 따로따로 계산할지 알고 있다. $f$ 는 단지 $q$와 $z$의 곱이다. 따라서 $\frac{\partial f}{\partial q} = z, \frac{\partial f}{\partial z} = q$, 그리고 $q$는 $x$와 $y$의 합이므로 $\frac{\partial q}{\partial x} = 1, \frac{\partial q}{\partial y} = 1$이다. 하지만, 중간 결과값인 $q$에 대한 그라디언트($\frac{\partial f}{\partial q}$)를 신경쓸 필요가 없다. 대신 궁극적으로 입력 $x,y,z$에 대한 $f$의 그라디언트에 관심이 있다. **체인룰**은 이러한 그라디언트 표현식들을 함께 연결시키는 적절한 방법이 곱하는 것이라는 것을 보여준다. 예를 들면, $\frac{\partial f}{\partial x} = \frac{\partial f}{\partial q} \frac{\partial q}{\partial x} $와 같이 표현할 수 있다. 실제로 이는 단순히 두 그라디언트를 담고 있는 두 수의 곱셈이다. 하나의 예를 통해 확인 해보자. ~~~python # set some inputs @@ -87,7 +87,7 @@ dfdx = 1.0 * dfdq # dq/dx = 1. And the multiplication here is the chain rule! dfdy = 1.0 * dfdq # dq/dy = 1 ~~~ -결국 `[dfdx,dfdy,dfdz]` 변수들로 그라디언트가 표현되는데, 이는 `f`에 대한 변수 `x,y,z`의 민감도(sensitivity)를 보여준다. 이는 역전파의 가장 간단한 예이다. 더 나아가서 보다 간결한 표기법을 사용해서 `df` 파트를 계속 쓸 필요가 없도록 하고 싶을 것이다. 예를 들어 `dfdq` 대신에 단순히 `dq`를 쓰고 항상 그라디언트가 최종 출력에 관한 것이라 가정하는 것이다. +결국 `[dfdx,dfdy,dfdz]` 변수들로 그라디언트가 표현되는데, 이는 `f`에 대한 변수 `x,y,z`의 민감도(sensitivity)를 보여준다. 이는 backpropagation의 가장 간단한 예이다. 더 나아가서 보다 간결한 표기법을 사용해서 `df` 파트를 계속 쓸 필요가 없도록 하고 싶을 것이다. 예를 들어 `dfdq` 대신에 단순히 `dq`를 쓰고 항상 그라디언트가 최종 출력에 관한 것이라 가정하는 것이다. 또한 이런 계산은 회로도를 가지고 다음과 같이 멋지게 시각화할 수 있다: @@ -95,21 +95,21 @@ dfdy = 1.0 * dfdq # dq/dy = 1 -2-4x5-4y-43z3-4q+-121f*
- 좌측에 실수 값으로 표현되는 "회로"는 이 계산에 대한 시각 표현을 보여준다. 전방 전달(forward pass)은 입력부터 출력까지 값을 계산한다 (녹색으로 표시). 그리고 나서 후방 전달(backward pass)는 역전파를 수행하는데, 이는 끝에서 시작해서 반복적으로 체인 룰을 적용해 회로 입력에 대한 모든 길에서 그라디언트 값 (적색으로 표시) 을 계산한다. 그라디언트 값은 회로를 통해 거꾸로 흐르는 것으로 볼 수 있다. + 좌측에 실수 값으로 표현되는 "회로"는 이 계산에 대한 시각 표현을 보여준다. 전방 전달(forward pass)은 입력부터 출력까지 값을 계산한다 (녹색으로 표시). 그리고 나서 후방 전달(backward pass)는 backpropagation을 수행하는데, 이는 끝에서 시작해서 반복적으로 체인 룰을 적용해 회로 입력에 대한 모든 길에서 그라디언트 값 (적색으로 표시) 을 계산한다. 그라디언트 값은 회로를 통해 거꾸로 흐르는 것으로 볼 수 있다.
-### 역전파(backpropagation)에 대한 직관적 이해 +### Backpropagation에 대한 직관적 이해 -역전파가 굉장히 지역적인(local) 프로세스임에 주목하자. 회로도 내의 모든 게이트(gate) 몇개의 입력을 받아드리고 곧 바로 두 가지를 계산할 수 있다: 1. 게이트의 출력 값, 2. 게이트 출력에 대한 입력들의 *지역적* 그라디언트 값. 여기서 게이트들이 포함된 전체 회로의 세세한 부분을 모르더라도 완전히 독립적으로 값들을 계산할 수 있음을 주목하라. 하지만, 일단 전방 전달이 끝나면 역전파 과정에서 게이트는 결국 전체 회로의 마지막 출력에 대한 게이트 출력의 그라디언트 값에 관해 학습할 것이다. 체인룰을 통해 게이트는 이 그라디언트 값을 받아들여 모든 입력에 대해서 계산한 게이트의 모든 그라디언트 값에 곱한다. +backpropagation이 굉장히 지역적인(local) 프로세스임에 주목하자. 회로도 내의 모든 게이트(gate) 몇개의 입력을 받아드리고 곧 바로 두 가지를 계산할 수 있다: 1. 게이트의 출력 값, 2. 게이트 출력에 대한 입력들의 *지역적* 그라디언트 값. 여기서 게이트들이 포함된 전체 회로의 세세한 부분을 모르더라도 완전히 독립적으로 값들을 계산할 수 있음을 주목하라. 하지만, 일단 전방 전달이 끝나면 backpropagation 과정에서 게이트는 결국 전체 회로의 마지막 출력에 대한 게이트 출력의 그라디언트 값에 관해 학습할 것이다. 체인룰을 통해 게이트는 이 그라디언트 값을 받아들여 모든 입력에 대해서 계산한 게이트의 모든 그라디언트 값에 곱한다. > 체인룰 덕분에 이러한 각 입력에 대한 추가 곱셈은 전체 신경망과 같은 복잡한 회로에서 상대적으로 쓸모 없는 개개의 게이트를 중요하지 않은 것으로 바꿀 수 있다. 다시 위 예를 통해 이것이 어떻게 동작하는지에 대한 직관을 얻자. 덧셈 게이트는 입력 [-2, 5]를 받아 3을 출력한다. 이 게이트는 덧셈 연산을 하고 있기 때문에 두 입력에 대한 게이트의 지역적 그라디언트 값은 +1이 된다. 회로의 나머지 부분을 통해 최종 출력 값으로 -12가 나온다. 체인룰이 회로를 역으로 가로질러 반복적으로 적용되는 후방 전달 과정 동안, (곱셈 게이트의 입력인) 덧셈 게이트는 출력 값에 대한 그라디언트 값이 -4였다는 것을 학습한다. 만약 회로가 높은 값을 출력하기를 원하는 것으로 의인화하면 (이는 직관에 도움이 될 수 있다), 이 회로가 덧셈 게이트의 출력 값이 4의 *힘*으로 낮아지길 (음의 부호이기 때문) "원하는" 것으로 볼 수 있다. 반복을 지속하고 그라디언트 값을 연결하기 위해 덧셈 게이트는 이 그라디언트 값을 받아들이고 이를 모든 입력들에 대한 지역적 그라디언트 값에 곱한다 (**x**와 **y**에 대한 그라디언트 값이 1 * -4 = -4가 되도록). 다음의 원하는 효과가 있다는 사실에 주목하자. 만약 **x,y**가 (음의 그라디언트 값에 대한 반응으로) 감소한다면, 이 덧셈 게이트의 출력은 감소할 것이고 이는 다시 곱셈 게이트의 출력이 증가하도록 만들 것이다. -따라서 역전파는 보다 큰 최종 출력 값을 얻도록 게이트들이 자신들의 출력이 (얼마나 강하게) 증가하길 원하는지 또는 감소하길 원하는지 서로 소통하는 것으로 간주할 수 있다. +따라서 backpropagation은 보다 큰 최종 출력 값을 얻도록 게이트들이 자신들의 출력이 (얼마나 강하게) 증가하길 원하는지 또는 감소하길 원하는지 서로 소통하는 것으로 간주할 수 있다. ### 모듈성: 시그모이드(Sigmoid) 예제 From 13b6d0a35cd28311b5f442743e20bec8d73a57f5 Mon Sep 17 00:00:00 2001 From: Dongkyu Kim Date: Wed, 8 Jun 2016 14:47:49 +0900 Subject: [PATCH 173/199] =?UTF-8?q?=EC=9A=A9=EC=96=B4=20=ED=86=B5=EC=9D=BC?= =?UTF-8?q?=20(=EC=97=AD=EC=A0=84=ED=8C=8C,=20=EC=B2=B4=EC=9D=B8=EB=A3=B0)?= =?UTF-8?q?=20=EB=8B=A4=EC=8B=9C?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit --- optimization-2.md | 10 +++++----- 1 file changed, 5 insertions(+), 5 deletions(-) diff --git a/optimization-2.md b/optimization-2.md index 23d00695..32ae548e 100644 --- a/optimization-2.md +++ b/optimization-2.md @@ -66,9 +66,9 @@ $$ -### 체인룰(chain rule)을 이용한 복합 표현식 +### 연쇄 법칙(Chain rule)을 이용한 복합 표현식 -이제 $f(x,y,z) = (x + y) z$ 같은 다수의 복합 함수(composed functions)를 수반하는 더 복잡한 표현식을 고려해보자. 이 표현식은 여전히 바로 미분하기에 충분히 간단하지만, 우리는 이 식에 특별한 접근법을 적용할 것이다. 이는 backpropagation 뒤에 있는 직관을 이해하는데 도움이 될 것이다. 특히 이 식이 두 개의 표현식 $q = x + y$와 $f = q z$ 으로 분해될 수 있음에 주목하자. 게다가 이전 섹션에서 본 것처럼 우리는 두 식에 대한 미분값을 어떻게 따로따로 계산할지 알고 있다. $f$ 는 단지 $q$와 $z$의 곱이다. 따라서 $\frac{\partial f}{\partial q} = z, \frac{\partial f}{\partial z} = q$, 그리고 $q$는 $x$와 $y$의 합이므로 $\frac{\partial q}{\partial x} = 1, \frac{\partial q}{\partial y} = 1$이다. 하지만, 중간 결과값인 $q$에 대한 그라디언트($\frac{\partial f}{\partial q}$)를 신경쓸 필요가 없다. 대신 궁극적으로 입력 $x,y,z$에 대한 $f$의 그라디언트에 관심이 있다. **체인룰**은 이러한 그라디언트 표현식들을 함께 연결시키는 적절한 방법이 곱하는 것이라는 것을 보여준다. 예를 들면, $\frac{\partial f}{\partial x} = \frac{\partial f}{\partial q} \frac{\partial q}{\partial x} $와 같이 표현할 수 있다. 실제로 이는 단순히 두 그라디언트를 담고 있는 두 수의 곱셈이다. 하나의 예를 통해 확인 해보자. +이제 $f(x,y,z) = (x + y) z$ 같은 다수의 복합 함수(composed functions)를 수반하는 더 복잡한 표현식을 고려해보자. 이 표현식은 여전히 바로 미분하기에 충분히 간단하지만, 우리는 이 식에 특별한 접근법을 적용할 것이다. 이는 backpropagation 뒤에 있는 직관을 이해하는데 도움이 될 것이다. 특히 이 식이 두 개의 표현식 $q = x + y$와 $f = q z$ 으로 분해될 수 있음에 주목하자. 게다가 이전 섹션에서 본 것처럼 우리는 두 식에 대한 미분값을 어떻게 따로따로 계산할지 알고 있다. $f$ 는 단지 $q$와 $z$의 곱이다. 따라서 $\frac{\partial f}{\partial q} = z, \frac{\partial f}{\partial z} = q$, 그리고 $q$는 $x$와 $y$의 합이므로 $\frac{\partial q}{\partial x} = 1, \frac{\partial q}{\partial y} = 1$이다. 하지만, 중간 결과값인 $q$에 대한 그라디언트($\frac{\partial f}{\partial q}$)를 신경쓸 필요가 없다. 대신 궁극적으로 입력 $x,y,z$에 대한 $f$의 그라디언트에 관심이 있다. **연쇄 법칙**은 이러한 그라디언트 표현식들을 함께 연결시키는 적절한 방법이 곱하는 것이라는 것을 보여준다. 예를 들면, $\frac{\partial f}{\partial x} = \frac{\partial f}{\partial q} \frac{\partial q}{\partial x} $와 같이 표현할 수 있다. 실제로 이는 단순히 두 그라디언트를 담고 있는 두 수의 곱셈이다. 하나의 예를 통해 확인 해보자. ~~~python # set some inputs @@ -103,11 +103,11 @@ dfdy = 1.0 * dfdq # dq/dy = 1 ### Backpropagation에 대한 직관적 이해 -backpropagation이 굉장히 지역적인(local) 프로세스임에 주목하자. 회로도 내의 모든 게이트(gate) 몇개의 입력을 받아드리고 곧 바로 두 가지를 계산할 수 있다: 1. 게이트의 출력 값, 2. 게이트 출력에 대한 입력들의 *지역적* 그라디언트 값. 여기서 게이트들이 포함된 전체 회로의 세세한 부분을 모르더라도 완전히 독립적으로 값들을 계산할 수 있음을 주목하라. 하지만, 일단 전방 전달이 끝나면 backpropagation 과정에서 게이트는 결국 전체 회로의 마지막 출력에 대한 게이트 출력의 그라디언트 값에 관해 학습할 것이다. 체인룰을 통해 게이트는 이 그라디언트 값을 받아들여 모든 입력에 대해서 계산한 게이트의 모든 그라디언트 값에 곱한다. +backpropagation이 굉장히 지역적인(local) 프로세스임에 주목하자. 회로도 내의 모든 게이트(gate) 몇개의 입력을 받아드리고 곧 바로 두 가지를 계산할 수 있다: 1. 게이트의 출력 값, 2. 게이트 출력에 대한 입력들의 *지역적* 그라디언트 값. 여기서 게이트들이 포함된 전체 회로의 세세한 부분을 모르더라도 완전히 독립적으로 값들을 계산할 수 있음을 주목하라. 하지만, 일단 전방 전달이 끝나면 backpropagation 과정에서 게이트는 결국 전체 회로의 마지막 출력에 대한 게이트 출력의 그라디언트 값에 관해 학습할 것이다. 연쇄 법칙을 통해 게이트는 이 그라디언트 값을 받아들여 모든 입력에 대해서 계산한 게이트의 모든 그라디언트 값에 곱한다. -> 체인룰 덕분에 이러한 각 입력에 대한 추가 곱셈은 전체 신경망과 같은 복잡한 회로에서 상대적으로 쓸모 없는 개개의 게이트를 중요하지 않은 것으로 바꿀 수 있다. +> 연쇄 법칙 덕분에 이러한 각 입력에 대한 추가 곱셈은 전체 신경망과 같은 복잡한 회로에서 상대적으로 쓸모 없는 개개의 게이트를 중요하지 않은 것으로 바꿀 수 있다. -다시 위 예를 통해 이것이 어떻게 동작하는지에 대한 직관을 얻자. 덧셈 게이트는 입력 [-2, 5]를 받아 3을 출력한다. 이 게이트는 덧셈 연산을 하고 있기 때문에 두 입력에 대한 게이트의 지역적 그라디언트 값은 +1이 된다. 회로의 나머지 부분을 통해 최종 출력 값으로 -12가 나온다. 체인룰이 회로를 역으로 가로질러 반복적으로 적용되는 후방 전달 과정 동안, (곱셈 게이트의 입력인) 덧셈 게이트는 출력 값에 대한 그라디언트 값이 -4였다는 것을 학습한다. 만약 회로가 높은 값을 출력하기를 원하는 것으로 의인화하면 (이는 직관에 도움이 될 수 있다), 이 회로가 덧셈 게이트의 출력 값이 4의 *힘*으로 낮아지길 (음의 부호이기 때문) "원하는" 것으로 볼 수 있다. 반복을 지속하고 그라디언트 값을 연결하기 위해 덧셈 게이트는 이 그라디언트 값을 받아들이고 이를 모든 입력들에 대한 지역적 그라디언트 값에 곱한다 (**x**와 **y**에 대한 그라디언트 값이 1 * -4 = -4가 되도록). 다음의 원하는 효과가 있다는 사실에 주목하자. 만약 **x,y**가 (음의 그라디언트 값에 대한 반응으로) 감소한다면, 이 덧셈 게이트의 출력은 감소할 것이고 이는 다시 곱셈 게이트의 출력이 증가하도록 만들 것이다. +다시 위 예를 통해 이것이 어떻게 동작하는지에 대한 직관을 얻자. 덧셈 게이트는 입력 [-2, 5]를 받아 3을 출력한다. 이 게이트는 덧셈 연산을 하고 있기 때문에 두 입력에 대한 게이트의 지역적 그라디언트 값은 +1이 된다. 회로의 나머지 부분을 통해 최종 출력 값으로 -12가 나온다. 연쇄 법칙이 회로를 역으로 가로질러 반복적으로 적용되는 후방 전달 과정 동안, (곱셈 게이트의 입력인) 덧셈 게이트는 출력 값에 대한 그라디언트 값이 -4였다는 것을 학습한다. 만약 회로가 높은 값을 출력하기를 원하는 것으로 의인화하면 (이는 직관에 도움이 될 수 있다), 이 회로가 덧셈 게이트의 출력 값이 4의 *힘*으로 낮아지길 (음의 부호이기 때문) "원하는" 것으로 볼 수 있다. 반복을 지속하고 그라디언트 값을 연결하기 위해 덧셈 게이트는 이 그라디언트 값을 받아들이고 이를 모든 입력들에 대한 지역적 그라디언트 값에 곱한다 (**x**와 **y**에 대한 그라디언트 값이 1 * -4 = -4가 되도록). 다음의 원하는 효과가 있다는 사실에 주목하자. 만약 **x,y**가 (음의 그라디언트 값에 대한 반응으로) 감소한다면, 이 덧셈 게이트의 출력은 감소할 것이고 이는 다시 곱셈 게이트의 출력이 증가하도록 만들 것이다. 따라서 backpropagation은 보다 큰 최종 출력 값을 얻도록 게이트들이 자신들의 출력이 (얼마나 강하게) 증가하길 원하는지 또는 감소하길 원하는지 서로 소통하는 것으로 간주할 수 있다. From 664e653bfdf7b00700a2e91abcd6877f36097a88 Mon Sep 17 00:00:00 2001 From: Sanghun Kang Date: Wed, 8 Jun 2016 22:07:41 +0900 Subject: [PATCH 174/199] =?UTF-8?q?=EB=A7=9E=EC=B6=A4=EB=B2=95=20=EC=88=98?= =?UTF-8?q?=EC=A0=95?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit --- python-numpy-tutorial.md | 162 +++++++++++++++++++-------------------- 1 file changed, 81 insertions(+), 81 deletions(-) diff --git a/python-numpy-tutorial.md b/python-numpy-tutorial.md index a11f7b9b..d121d481 100644 --- a/python-numpy-tutorial.md +++ b/python-numpy-tutorial.md @@ -21,18 +21,18 @@ Python: Numpy --> -이 튜토리얼은 [Justin Johnson](http://cs.stanford.edu/people/jcjohns/)에 의해 작성되었습니다. +이 튜토리얼은 [Justin Johnson](http://cs.stanford.edu/people/jcjohns/) 에 의해 작성되었습니다. cs231n 수업의 모든 과제에서는 프로그래밍 언어로 파이썬을 사용할 것입니다. 파이썬은 그 자체만으로도 훌륭한 범용 프로그래밍 언어이지만, 몇몇 라이브러리(numpy, scipy, matplotlib)의 도움으로 계산과학 분야에서 강력한 개발 환경을 갖추게 됩니다. -많은 분들이 파이썬과 numpy를 경험 해보셨을거라고 생각합니다. 경험 하지 못했을지라도 이 문서를 통해 -'프로그래밍 언어로서의 파이썬'과 '파이썬을 계산과학에 활용하는법'을 빠르게 훑을 수 있습니다. +많은 분들이 파이썬과 numpy를 경험 해보셨을 거라고 생각합니다. 경험하지 못했을지라도 이 문서를 통해 +'프로그래밍 언어로서의 파이썬'과 '파이썬을 계산과학에 활용하는 법'을 빠르게 훑을 수 있습니다. -만약 Matlab을 사용해보셨다면, [Matlab사용자를 위한 numpy](http://wiki.scipy.org/NumPy_for_Matlab_Users) 페이지를 추천해 드립니다. +만약 Matlab을 사용해보셨다면, [Matlab 사용자를 위한 numpy](http://wiki.scipy.org/NumPy_for_Matlab_Users) 페이지를 추천해 드립니다. -또한 [CS 228](http://cs.stanford.edu/~ermon/cs228/index.html)수업을 위해 [Volodymyr Kuleshov](http://web.stanford.edu/~kuleshov/) 와 [Isaac Caswell](https://symsys.stanford.edu/viewing/symsysaffiliate/21335)가 만든 [이 튜토리얼의 IPython notebook 버전](https://github.com/kuleshov/cs228-material/blob/master/tutorials/python/cs228-python-tutorial.ipynb)도 참조 할 수 있습니다. +또한 [CS 228](http://cs.stanford.edu/~ermon/cs228/index.html) 수업을 위해 [Volodymyr Kuleshov](http://web.stanford.edu/~kuleshov/) 와 [Isaac Caswell](https://symsys.stanford.edu/viewing/symsysaffiliate/21335) 가 만든 [이 튜토리얼의 IPython notebook 버전](https://github.com/kuleshov/cs228-material/blob/master/tutorials/python/cs228-python-tutorial.ipynb) 도 참조할 수 있습니다. 목차: @@ -48,7 +48,7 @@ cs231n 수업의 모든 과제에서는 프로그래밍 언어로 파이썬을 - [Numpy](#numpy) - [배열](#numpy-arrays) - [배열 인덱싱](#numpy-array-indexing) - - [데이터타입](#numpy-datatypes) + - [데이터 타입](#numpy-datatypes) - [배열 연산](#numpy-math) - [브로드캐스팅](#numpy-broadcasting) - [SciPy](#scipy) @@ -62,10 +62,10 @@ cs231n 수업의 모든 과제에서는 프로그래밍 언어로 파이썬을 -## Python -파이썬은 고차원이고, 다중패러다임을 지원하는 동적 프로그래밍 언어입니다. -짧지만 가독성 높은 코드 몇 줄로 수준 높은 아이디어들을 표현할수있기에 파이썬 코드는 거의 수도코드처럼 보인다고도 합니다. -아래는 quicksort알고리즘의 파이썬 구현 예시입니다: +## 파이썬 +파이썬은 고급 프로그래밍 언어(사람이 이해하기 쉽게 작성된 언어)이며, 다중패러다임을 지원하는 동적 프로그래밍 언어입니다. +짧지만 가독성 높은 코드 몇 줄로 수준 높은 아이디어들을 표현할 수 있기에 파이썬 코드는 거의 수도코드처럼 보인다고도 합니다. +아래는 quicksort 알고리즘을 파이썬으로 구현한 예시입니다: ~~~python def quicksort(arr): @@ -82,19 +82,19 @@ print quicksort([3,6,8,10,1,2,1]) ~~~ ### 파이썬 버전 -현재 파이썬에는 두가지 버전이 있습니다. 파이썬 2.7 그리고 파이썬 3.4입니다. +현재 파이썬에는 두 가지 버전이 있습니다. 파이썬 2.7 그리고 파이썬 3.4입니다. 혼란스럽게도, 파이썬3은 기존 파이썬2와 호환되지 않게 변경된 부분이 있습니다. 그러므로 파이썬 2.7로 쓰여진 코드는 3.4환경에서 동작하지 않고 그 반대도 마찬가지입니다. 이 수업에선 파이썬 2.7을 사용합니다. -커맨드라인에 아래의 명령어를 입력해서 현재 설치된 파이썬 버전을 확인 할 수 있습니다. +커맨드라인에 아래의 명령어를 입력해서 현재 설치된 파이썬 버전을 확인할 수 있습니다. `python --version`. ### 기본 자료형 -다른 프로그래밍 언어들처럼, 파이썬에는 정수, 실수, 불리언, 문자열같은 기본 자료형이 있습니다. +다른 프로그래밍 언어들처럼, 파이썬에는 정수, 실수, 불리언, 문자열 같은 기본 자료형이 있습니다. 파이썬 기본 자료형 역시 다른 프로그래밍 언어와 유사합니다. **숫자:** 다른 언어와 마찬가지로 파이썬의 정수형(Integers)과 실수형(floats) 데이터 타입 역시 동일한 역할을 합니다 : @@ -115,12 +115,12 @@ y = 2.5 print type(y) # 출력 "" print y, y + 1, y * 2, y ** 2 # 출력 "2.5 3.5 5.0 6.25" ~~~ -다른 언어들과는 달리, 파이썬에는 증감 단항연상자(`x++`, `x--`)가 없습니다. +다른 언어들과는 달리, 파이썬에는 증감 단항연산자(`x++`, `x--`)가 없습니다. 파이썬 역시 long 정수형과 복소수 데이터 타입이 구현되어 있습니다. 자세한 사항은 [문서](https://docs.python.org/2/library/stdtypes.html#numeric-types-int-float-long-complex)에서 찾아볼 수 있습니다. -**불리언(Booleans):** 파이썬에는 논리 자료형의 모든 연산자들이 구현되어 있습니다. +**불리언(Booleans):** 파이썬에는 논리 자료형의 모든 연산자가 구현되어 있습니다. 그렇지만 기호(`&&`, `||`, 등.) 대신 영어 단어로 구현되어 있습니다 : ~~~python @@ -136,8 +136,8 @@ print t != f # 논리 XOR; 출력 "True" **문자열:** 파이썬은 문자열과 연관된 다양한 기능을 지원합니다: ~~~python -hello = 'hello' # String 문자열을 표현할땐 따옴표나 -world = "world" # 쌍따옴표가 사용됩니다; 어떤걸 써도 상관없습니다. +hello = 'hello' # String 문자열을 표현할 땐 따옴표나 +world = "world" # 쌍따옴표가 사용됩니다; 어떤 걸 써도 상관없습니다. print hello # 출력 "hello" print len(hello) # 문자열 길이; 출력 "5" hw = hello + ' ' + world # 문자열 연결 @@ -150,11 +150,11 @@ print hw12 # 출력 "hello world 12" ~~~python s = "hello" -print s.capitalize() # 문자열을 대문자로 시작하게함; 출력 "Hello" +print s.capitalize() # 문자열을 대문자로 시작하게 함; 출력 "Hello" print s.upper() # 모든 문자를 대문자로 바꿈; 출력 "HELLO" print s.rjust(7) # 문자열 오른쪽 정렬, 빈공간은 여백으로 채움; 출력 " hello" print s.center(7) # 문자열 가운데 정렬, 빈공간은 여백으로 채움; 출력 " hello " -print s.replace('l', '(ell)') # 첫번째 인자로 온 문자열을 두번째 인자 문자열로 바꿈; +print s.replace('l', '(ell)') # 첫 번째 인자로 온 문자열을 두 번째 인자 문자열로 바꿈; # 출력 "he(ell)(ell)o" print ' world '.strip() # 문자열 앞뒤 공백 제거; 출력 "world" ~~~ @@ -168,14 +168,14 @@ print ' world '.strip() # 문자열 앞뒤 공백 제거; 출력 "world" #### 리스트 -리스트는 파이썬에서 배열같은 존재입니다. 그렇지만 배열과 달리 크기 변경이 가능하고 -서로 다른 자료형일지라도 하나의 리스트에 저장 될 수 있습니다: +리스트는 파이썬에서 배열 같은 존재입니다. 그렇지만 배열과 달리 크기 변경이 가능하고 +서로 다른 자료형일지라도 하나의 리스트에 저장될 수 있습니다: ~~~python xs = [3, 1, 2] # 리스트 생성 print xs, xs[2] # 출력 "[3, 1, 2] 2" print xs[-1] # 인덱스가 음수일 경우 리스트의 끝에서부터 세어짐; 출력 "2" -xs[2] = 'foo' # 리스트는 자료형이 다른 요소들을 저장 할 수 있습니다 +xs[2] = 'foo' # 리스트는 자료형이 다른 요소들을 저장할 수 있습니다 print xs # 출력 "[3, 1, 'foo']" xs.append('bar') # 리스트의 끝에 새 요소 추가 print xs # 출력 "[3, 1, 'foo', 'bar']" @@ -185,7 +185,7 @@ print x, xs # 출력 "bar [3, 1, 'foo']" 마찬가지로, 리스트에 대해 자세한 사항은 [문서](https://docs.python.org/2/tutorial/datastructures.html#more-on-lists)에서 찾아볼 수 있습니다. **슬라이싱:** -리스트의 요소로 한번에 접근하는것 이외에도, 파이썬은 리스트의 일부분에만 접근하는 간결한 문법을 제공합니다; +리스트의 요소로 한 번에 접근하는 것 이외에도, 파이썬은 리스트의 일부분에만 접근하는 간결한 문법을 제공합니다; 이를 *슬라이싱*이라고 합니다: ~~~python @@ -199,7 +199,7 @@ print nums[:-1] # 슬라이싱 인덱스는 음수도 가능; 출력 ["0, 1, nums[2:4] = [8, 9] # 슬라이스된 리스트에 새로운 리스트 할당 print nums # 출력 "[0, 1, 8, 9, 4]" ~~~ -numpy 배열 부분에서 다시 슬라이싱을 보게될것입니다. +numpy 배열 부분에서 다시 슬라이싱을 보게 될 것입니다. **반복문:** 아래와 같이 리스트의 요소들을 반복해서 조회할 수 있습니다: @@ -220,7 +220,7 @@ for idx, animal in enumerate(animals): ~~~ **리스트 comprehensions:** -프로그래밍을 하다보면, 자료형을 변환해야 하는 경우가 자주 있습니다. +프로그래밍을 하다 보면, 자료형을 변환해야 하는 경우가 자주 있습니다. 간단한 예를 들자면, 숫자의 제곱을 계산하는 다음의 코드를 보세요: @@ -240,7 +240,7 @@ squares = [x ** 2 for x in nums] print squares # 출력 [0, 1, 4, 9, 16] ~~~ -리스트 comprehensions에 조건을 추가 할 수도 있습니다: +리스트 comprehensions에 조건을 추가할 수도 있습니다: ~~~python nums = [0, 1, 2, 3, 4] @@ -257,16 +257,16 @@ print even_squares # 출력 "[0, 4, 16]" ~~~python d = {'cat': 'cute', 'dog': 'furry'} # 새로운 딕셔너리를 만듭니다 print d['cat'] # 딕셔너리의 값을 받음; 출력 "cute" -print 'cat' in d # 딕셔너리가 주어진 열쇠를 가지고 있는지 확인; 출력 "True" +print 'cat' in d # 딕셔너리가 주어진 열쇠를 가졌는지 확인; 출력 "True" d['fish'] = 'wet' # 딕셔너리의 값을 지정 print d['fish'] # 출력 "wet" # print d['monkey'] # KeyError: 'monkey' not a key of d print d.get('monkey', 'N/A') # 딕셔너리의 값을 받음. 존재하지 않는 다면 'N/A'; 출력 "N/A" print d.get('fish', 'N/A') # 딕셔너리의 값을 받음. 존재하지 않는 다면 'N/A'; 출력 "wet" del d['fish'] # 딕셔너리에 저장된 요소 삭제 -print d.get('fish', 'N/A') # "fish"는 더이상 열쇠가 아님; 출력 "N/A" +print d.get('fish', 'N/A') # "fish"는 더 이상 열쇠가 아님; 출력 "N/A" ~~~ -딕셔너리에 관해 더 알고싶다면 [문서](https://docs.python.org/2/library/stdtypes.html#dict)를 참조하세요. +딕셔너리에 관해 더 알고 싶다면 [문서](https://docs.python.org/2/library/stdtypes.html#dict)를 참조하세요. **반복문:** 딕셔너리의 열쇠는 쉽게 반복될 수 있습니다: @@ -288,7 +288,7 @@ for animal, legs in d.iteritems(): ~~~ **딕셔너리 comprehensions:** -리스트 comprehensions과 유사한 딕셔너리 comprehensions을 통해 손쉽게 딕셔너리를 만들수 있습니다. +리스트 comprehensions과 유사한 딕셔너리 comprehensions을 통해 손쉽게 딕셔너리를 만들 수 있습니다. 예시: ~~~python @@ -300,7 +300,7 @@ print even_num_to_square # 출력 "{0: 0, 2: 4, 4: 16}" #### 집합 -집합은 순서 구분이 없고 서로 다른 요소간의 모임입니다. 예시: +집합은 순서 구분이 없고 서로 다른 요소 간의 모임입니다. 예시: ~~~python animals = {'cat', 'dog'} @@ -315,11 +315,11 @@ animals.remove('cat') # Remove an element from a set print len(animals) # 출력 "2" ~~~ -마찬가지로, 집합에 관해 더 알고싶다면 [문서](https://docs.python.org/2/library/sets.html#set-objects)를 참조하세요. +마찬가지로, 집합에 관해 더 알고 싶다면 [문서](https://docs.python.org/2/library/sets.html#set-objects)를 참조하세요. **반복문:** 집합을 반복하는 구문은 리스트 반복 구문과 동일합니다; -그러나 집합은 순서가 없어서, 어떤 순서로 반복될지 추측할순 없습니다: +그러나 집합은 순서가 없어서, 어떤 순서로 반복될지 추측할 순 없습니다: ~~~python animals = {'cat', 'dog', 'fish'} @@ -340,8 +340,8 @@ print nums # 출력 "set([0, 1, 2, 3, 4, 5])" #### 튜플 -튜플은 요소들 간 순서가 있으며 값이 변하지 않는 리스트입니다. -튜플은 많은 면에서 리스트와 유사합니다; 가장 중요한 차이점은 튜플은 '딕셔너리의 열쇠'와 '집합의 요소'가 될 수 있지만 리스트는 불가능하다는 점입니다. +튜플은 요소 간 순서가 있으며 값이 변하지 않는 리스트입니다. +튜플은 많은 면에서 리스트와 유사합니다; 가장 중요한 차이점은 튜플은 '딕셔너리의 열쇠'와 '집합의 요소'가 될 수 있지만, 리스트는 불가능하다는 점입니다. 여기 간단한 예시가 있습니다: ~~~python @@ -423,8 +423,8 @@ Numpy는 고성능의 다차원 배열 객체와 이를 다룰 도구를 제공 ### 배열 -Numpy 배열은 동일한 자료형을 가지는 값들이 격자판 형태로 있는 것입니다. 각각의 값들은 튜플(이때 튜플은 양의 정수만을 요소값으로 갖습니다.) 형태로 색인됩니다. -*rank*는 배열이 몇차원인지를 의미합니다; *shape*는 는 각 차원의 크기를 알려주는 정수들이 모인 튜플입니다. +Numpy 배열은 동일한 자료형을 가지는 값들이 격자판 형태로 있는 것입니다. 각각의 값들은 튜플(이때 튜플은 양의 정수만을 요소값으로 갖습니다.) 형태로 색인 됩니다. +*rank*는 배열이 몇 차원인지를 의미합니다; *shape*는 는 각 차원의 크기를 알려주는 정수들이 모인 튜플입니다. 파이썬의 리스트를 중첩해 Numpy 배열을 초기화 할 수 있고, 대괄호를 통해 각 요소에 접근할 수 있습니다: @@ -459,7 +459,7 @@ c = np.full((2,2), 7) # 모든 값이 특정 상수인 배열 생성 print c # 출력 "[[ 7. 7.] # [ 7. 7.]]" -d = np.eye(2) # 2x2 단위 행렬 생성 +d = np.eye(2) # 2x2 단위행렬 생성 print d # 출력 "[[ 1. 0.] # [ 0. 1.]]" @@ -472,10 +472,10 @@ print e # 임의의 값 출력 "[[ 0.91940167 0.08143941] ### 배열 인덱싱 -Numpy는 배열을 인덱싱하는 몇가지 방법을 제공합니다. +Numpy는 배열을 인덱싱하는 몇 가지 방법을 제공합니다. **슬라이싱:** -파이썬 리스트와 유사하게, Numpy 배열도 슬라이싱이 가능합니다. Numpy 배열은 다차원인 경우가 많기에, 각 차원별로 어떻게 슬라이스할건지 명확히 해야합니다: +파이썬 리스트와 유사하게, Numpy 배열도 슬라이싱이 가능합니다. Numpy 배열은 다차원인 경우가 많기에, 각 차원별로 어떻게 슬라이스할건지 명확히 해야 합니다: ~~~python import numpy as np @@ -486,7 +486,7 @@ import numpy as np # [ 9 10 11 12]] a = np.array([[1,2,3,4], [5,6,7,8], [9,10,11,12]]) -# 슬라이싱을 이용하여 첫 두행과 1열,2열로 이루어진 부분배열을 만들어 봅시다; +# 슬라이싱을 이용하여 첫 두 행과 1열, 2열로 이루어진 부분배열을 만들어 봅시다; # b는 shape가 (2,2)인 배열이 됩니다: # [[2 3] # [6 7]] @@ -514,11 +514,11 @@ import numpy as np # [ 9 10 11 12]] a = np.array([[1,2,3,4], [5,6,7,8], [9,10,11,12]]) -# 배열의 중간 행에 접근하는 두가지 방법이 있습니다. +# 배열의 중간 행에 접근하는 두 가지 방법이 있습니다. # 정수 인덱싱과 슬라이싱을 혼합해서 사용하면 낮은 rank의 배열이 생성되지만, # 슬라이싱만 사용하면 원본 배열과 동일한 rank의 배열이 생성됩니다. -row_r1 = a[1, :] # 배열a의 두번째 행을 rank가 1인 배열로 -row_r2 = a[1:2, :] # 배열a의 두번째 행을 rank가 2인 배열로 +row_r1 = a[1, :] # 배열a의 두 번째 행을 rank가 1인 배열로 +row_r2 = a[1:2, :] # 배열a의 두 번째 행을 rank가 2인 배열로 print row_r1, row_r1.shape # 출력 "[5 6 7 8] (4,)" print row_r2, row_r2.shape # 출력 "[[5 6 7 8]] (1, 4)" @@ -533,7 +533,7 @@ print col_r2, col_r2.shape # 출력 "[[ 2] **정수 배열 인덱싱:** Numpy 배열을 슬라이싱하면, 결과로 얻어지는 배열은 언제나 원본 배열의 부분 배열입니다. -그러나 정수 배열 인덱싱을 한다면, 원본과 다른 배열을 만들수 있습니다. +그러나 정수 배열 인덱싱을 한다면, 원본과 다른 배열을 만들 수 있습니다. 여기에 예시가 있습니다: ~~~python @@ -549,7 +549,7 @@ print a[[0, 1, 2], [0, 1, 0]] # 출력 "[1 4 5]" print np.array([a[0, 0], a[1, 1], a[2, 0]]) # 출력 "[1 4 5]" # 정수 배열 인덱싱을 사용할 때, -# 원본 배열의 같은 요소를 재사용 할 수 있습니다: +# 원본 배열의 같은 요소를 재사용할 수 있습니다: print a[[0, 0], [1, 1]] # 출력 "[2 2]" # 위 예제는 다음과 동일합니다 @@ -586,8 +586,8 @@ print a # 출력 "array([[11, 2, 3], ~~~ **불리언 배열 인덱싱:** -불리언 배열 인덱싱을 통해 배열속 요소를 취사 선택할 수 있습니다. -불리언 배열 인덱싱은 특정 조건을 만족시키는 요소만 선택하고자 할 때 자주 사용됩니다. +불리언 배열 인덱싱을 통해 배열 속 요소를 취사선택할 수 있습니다. +불리언 배열 인덱싱은 특정 조건을 만족하게 하는 요소만 선택하고자 할 때 자주 사용됩니다. 다음은 그 예시입니다: ~~~python @@ -620,8 +620,8 @@ print a[a > 2] # 출력 "[3 4 5 6]" ### 자료형 Numpy 배열은 동일한 자료형을 가지는 값들이 격자판 형태로 있는 것입니다. -Numpy에선 배열을 구성하는데 사용할 수 있는 다양한 숫자 자료형을 제공합니다. -Numpy는 배열이 생성될 때 자료형을 스스로 추측합니다, 그러나 배열을 생성할 때 명시적으로 특정 자료형을 지정할수도 있습니다. 예시: +Numpy에선 배열을 구성하는 데 사용할 수 있는 다양한 숫자 자료형을 제공합니다. +Numpy는 배열이 생성될 때 자료형을 스스로 추측합니다, 그러나 배열을 생성할 때 명시적으로 특정 자료형을 지정할 수도 있습니다. 예시: ~~~python import numpy as np @@ -678,7 +678,7 @@ print np.divide(x, y) print np.sqrt(x) ~~~ -MATLAB과 달리, '*'은 행렬곱이 아니라 요소별 곱입니다. Numpy에선 벡터의 내적, 벡터와 행렬의 곱, 행렬곱을 위해 '*'대신 'dot'함수를 사용합니다. 'dot'은 Numpy 모듈 함수로서도 배열 객체의 인스턴스 메소드로서도 이용 가능한 합수입니다: +MATLAB과 달리, '*'은 행렬 곱이 아니라 요소별 곱입니다. Numpy에선 벡터의 내적, 벡터와 행렬의 곱, 행렬곱을 위해 '*'대신 'dot'함수를 사용합니다. 'dot'은 Numpy 모듈 함수로서도 배열 객체의 인스턴스 메소드로서도 이용 가능한 합수입니다: ~~~python import numpy as np @@ -693,7 +693,7 @@ w = np.array([11, 12]) print v.dot(w) print np.dot(v, w) -# 행렬과 벡터의 곱; 둘 다 결과는 rank 1 인 배열 [29 67] +# 행렬과 벡터의 곱; 둘 다 결과는 rank 1인 배열 [29 67] print x.dot(v) print np.dot(x, v) @@ -704,7 +704,7 @@ print x.dot(y) print np.dot(x, y) ~~~ -Numpy는 배열 연산에 유용하게 쓰이는 많은 함수를 제공합니다. 가장 유용한건 'sum'입니다: +Numpy는 배열 연산에 유용하게 쓰이는 많은 함수를 제공합니다. 가장 유용한 건 'sum'입니다: ~~~python import numpy as np @@ -715,10 +715,10 @@ print np.sum(x) # 모든 요소를 합한 값을 연산; 출력 "10" print np.sum(x, axis=0) # 각 열에 대한 합을 연산; 출력 "[4 6]" print np.sum(x, axis=1) # 각 행에 대한 합을 연산; 출력 "[3 7]" ~~~ -Numpy가 제공하는 모든 수학함수들의 목록은 [문서](http://docs.scipy.org/doc/numpy/reference/routines.math.html)를 참조하세요. +Numpy가 제공하는 모든 수학함수의 목록은 [문서](http://docs.scipy.org/doc/numpy/reference/routines.math.html)를 참조하세요. -배열연산을 하지 않더라도, 종종 배열의 모양을 바꾸거나 데이터를 처리해야할 때가 있습니다. -가장 간단한 예는 행렬의 주대각선을 기준으로 대칭되는 요소끼리 뒤바꾸는 것입니다; 이를 전치라고 하며 행렬을 전치하기 위해선, 간단하게 배열 객체의 'T' 속성을 사용하면 됩니다: +배열연산을 하지 않더라도, 종종 배열의 모양을 바꾸거나 데이터를 처리해야 할 때가 있습니다. +가장 간단한 예는 행렬의 주 대각선을 기준으로 대칭되는 요소끼리 뒤바꾸는 것입니다; 이를 전치라고 하며 행렬을 전치하기 위해선, 간단하게 배열 객체의 'T' 속성을 사용하면 됩니다: ~~~python import numpy as np @@ -729,7 +729,7 @@ print x # 출력 "[[1 2] print x.T # 출력 "[[1 3] # [2 4]]" -# rank 1인 배열을 전치할경우 아무일도 일어나지 않습니다: +# rank 1인 배열을 전치할 경우 아무 일도 일어나지 않습니다: v = np.array([1,2,3]) print v # 출력 "[1 2 3]" print v.T # 출력 "[1 2 3]" @@ -740,7 +740,7 @@ Numpy는 배열을 다루는 다양한 함수들을 제공합니다; 이러한 ### 브로드캐스팅 -브로트캐스팅은 Numpy에서 shape가 다른 배열간에도 산술 연산이 가능하게 하는 메커니즘입니다. 종종 작은 배열과 큰 배열이 있을 때, 큰 배열을 대상으로 작은 배열을 여러번 연산하고자 할 때가 있습니다. 예를 들어, 행렬의 각 행에 상수 벡터를 더하는걸 생각해보세요. 이는 다음과 같은 방식으로 처리될 수 있습니다: +브로트캐스팅은 Numpy에서 shape가 다른 배열 간에도 산술 연산이 가능하게 하는 메커니즘입니다. 종종 작은 배열과 큰 배열이 있을 때, 큰 배열을 대상으로 작은 배열을 여러 번 연산하고자 할 때가 있습니다. 예를 들어, 행렬의 각 행에 상수 벡터를 더하는 걸 생각해보세요. 이는 다음과 같은 방식으로 처리될 수 있습니다: ~~~python import numpy as np @@ -763,7 +763,7 @@ for i in range(4): print y ~~~ -위의 방식대로 하면 됩니다; 그러나 'x'가 매우 큰 행렬이라면, 파이썬의 명시적 반복문을 이용한 위 코드는 매우 느려질 수 있습니다. 벡터 'v'를 행렬 'x'의 각 행에 더하는것은 'v'를 여러개 복사해 수직으로 쌓은 행렬 'vv'를 만들고 이 'vv'를 'x'에 더하는것과 동일합니다. 이 과정을 아래의 코드로 구현할 수 있습니다: +위의 방식대로 하면 됩니다; 그러나 'x'가 매우 큰 행렬이라면, 파이썬의 명시적 반복문을 이용한 위 코드는 매우 느려질 수 있습니다. 벡터 'v'를 행렬 'x'의 각 행에 더하는 것은 'v'를 여러 개 복사해 수직으로 쌓은 행렬 'vv'를 만들고 이 'vv'를 'x'에 더하는것과 동일합니다. 이 과정을 아래의 코드로 구현할 수 있습니다: ~~~python import numpy as np @@ -772,7 +772,7 @@ import numpy as np # 그 결과를 행렬 y에 저장하고자 합니다 x = np.array([[1,2,3], [4,5,6], [7,8,9], [10, 11, 12]]) v = np.array([1, 0, 1]) -vv = np.tile(v, (4, 1)) # v의 복사본 4개를 위로 차곡차곡 쌓은게 vv +vv = np.tile(v, (4, 1)) # v의 복사본 4개를 위로 차곡차곡 쌓은 것이 vv print vv # 출력 "[[1 0 1] # [1 0 1] # [1 0 1] @@ -784,7 +784,7 @@ print y # 출력 "[[ 2 2 4 # [11 11 13]]" ~~~ -Numpy 브로드캐스팅을 이용한다면 이렇게 v의 복사본을 여러개 만들지 않아도 동일한 연산을 할 수 있습니다. +Numpy 브로드캐스팅을 이용한다면 이렇게 v의 복사본을 여러 개 만들지 않아도 동일한 연산을 할 수 있습니다. 아래는 브로드캐스팅을 이용한 예시 코드입니다: ~~~python @@ -802,15 +802,15 @@ print y # 출력 "[[ 2 2 4] ~~~ `x`의 shape가 `(4, 3)`이고 `v`의 shape가 `(3,)`라도 브로드캐스팅으로 인해 `y = x + v`는 문제없이 수행됩니다; -이때 'v'는 'v'의 복사본이 차곡차곡 쌓인 shape `(4, 3)`처럼 간주되어 'x'와 동일한 shape가 되며 이들간의 요소별 덧셈연산이 y에 저장됩니다. +이때 'v'는 'v'의 복사본이 차곡차곡 쌓인 shape `(4, 3)`처럼 간주되어 'x'와 동일한 shape가 되며 이들 간의 요소별 덧셈연산이 y에 저장됩니다. 두 배열의 브로드캐스팅은 아래의 규칙을 따릅니다: -1. 두 배열이 동일한 rank를 가지고 있지 않다면, 낮은 rank의 1차원 배열이 높은 rank 배열의 shape로 간주됩니다. -2. 특정 차원에서 두 배열이 동일한 크기를 갖거나, 두 배열들 중 하나의 크기가 1이라면 그 두 배열은 특정 차원에서 *compatible*하다고 여겨집니다. +1. 두 배열이 동일한 rank를 가지고 있지 않다면, 낮은 rank의 1차원 배열이 높은 rank 배열의 shape로 간주합니다. +2. 특정 차원에서 두 배열이 동일한 크기를 갖거나, 두 배열 중 하나의 크기가 1이라면 그 두 배열은 특정 차원에서 *compatible*하다고 여겨집니다. 3. 두 행렬이 모든 차원에서 compatible하다면, 브로드캐스팅이 가능합니다. -4. 브로드캐스팅이 이뤄지면, 각 배열 shape의 요소별 최소공배수로 이루어진 shape가 두 배열의 shape로 간주됩니다. -5. 차원에 상관없이 크기가 1인 배열과 1보다 큰 배열이 있을때, 크기가 1인 배열은 자신의 차원수만큼 복사되어 쌓인것처럼 간주된다. +4. 브로드캐스팅이 이뤄지면, 각 배열 shape의 요소별 최소공배수로 이루어진 shape가 두 배열의 shape로 간주합니다. +5. 차원에 상관없이 크기가 1인 배열과 1보다 큰 배열이 있을 때, 크기가 1인 배열은 자신의 차원 수만큼 복사되어 쌓인 것처럼 간주합니다. 설명이 이해하기 부족하다면 [scipy문서](http://docs.scipy.org/doc/numpy/user/basics.broadcasting.html)나 [scipy위키](http://wiki.scipy.org/EricsBroadcastingDoc)를 참조하세요. @@ -825,9 +825,9 @@ import numpy as np # 벡터의 외적을 계산 v = np.array([1,2,3]) # v의 shape는 (3,) w = np.array([4,5]) # w의 shape는 (2,) -# 외적을 게산하기 위해, 먼저 v를 shape가 (3,1)인 행벡터로 바꿔야 합니다; +# 외적을 계산하기 위해, 먼저 v를 shape가 (3,1)인 행벡터로 바꿔야 합니다; # 그다음 이것을 w에 맞춰 브로드캐스팅한뒤 결과물로 shape가 (3,2)인 행렬을 얻습니다, -# 이 행렬은 v 와 w의 외적의 결과입니다: +# 이 행렬은 v와 w 외적의 결과입니다: # [[ 4 5] # [ 8 10] # [12 15]] @@ -851,7 +851,7 @@ print x + v # [ 9 10 11]] print (x.T + w).T # 다른 방법은 w를 shape가 (2,1)인 열벡터로 변환하는 것입니다; -# 그런다음 이를 바로 x에 브로드캐스팅해 더하면 +# 그런 다음 이를 바로 x에 브로드캐스팅해 더하면 # 동일한 결과가 나옵니다. print x + np.reshape(w, (2, 1)) @@ -876,7 +876,7 @@ numpy에 관한 더 많은 사항은 [numpy 레퍼런스](http://docs.scipy.org/ Numpy는 고성능의 다차원 배열 객체와 이를 다룰 도구를 제공합니다. numpy를 바탕으로 만들어진 [SciPy](http://docs.scipy.org/doc/scipy/reference/)는, -numpy 배열을 다루는 많은 함수들을 제공하며 다양한 과학, 공학분야에서 유용하게 사용됩니다. +numpy 배열을 다루는 많은 함수를 제공하며 다양한 과학, 공학분야에서 유용하게 사용됩니다. SciPy에 익숙해지는 최고의 방법은 [SciPy 공식 문서](http://docs.scipy.org/doc/scipy/reference/index.html)를 보는 것입니다. 이 문서에서는 scipy중 cs231n 수업에서 유용하게 쓰일 일부분만을 소개할것입니다. @@ -885,7 +885,7 @@ SciPy에 익숙해지는 최고의 방법은 [SciPy 공식 문서](http://docs.s ### 이미지 작업 SciPy는 이미지를 다룰 기본적인 함수들을 제공합니다. -예를들자면, 디스크에 저장된 이미지를 numpy 배열로 읽어들이는 함수가 있으며, +예를들자면, 디스크에 저장된 이미지를 numpy 배열로 읽어 들이는 함수가 있으며, numpy 배열을 디스크에 이미지로 저장하는 함수도 있고, 이미지의 크기를 바꾸는 함수도 있습니다. 이 함수들의 간단한 사용 예시입니다: @@ -897,14 +897,14 @@ img = imread('assets/cat.jpg') print img.dtype, img.shape # 출력 "uint8 (400, 248, 3)" # 각각의 색깔 채널을 다른 상수값으로 스칼라배함으로써 -# 이미지의 색을 변화시킬수 있습니다. +# 이미지의 색을 변화시킬 수 있습니다. # 이미지의 shape는 (400, 248, 3)입니다; # 여기에 shape가 (3,)인 배열 [1, 0.95, 0.9]를 곱합니다; # numpy 브로드캐스팅에 의해 이 배열이 곱해지며 붉은색 채널은 변하지 않으며, # 초록색, 파란색 채널에는 각각 0.95, 0.9가 곱해집니다 img_tinted = img * [1, 0.95, 0.9] -# 색변경 이미지를 300x300 픽셀로 크기 조절. +# 색변경 이미지를 300x300픽셀로 크기 조절. img_tinted = imresize(img_tinted, (300, 300)) # 색변경 이미지를 디스크에 기록하기 @@ -963,12 +963,12 @@ print d ## Matplotlib [Matplotlib](http://matplotlib.org/)는 plotting 라이브러리입니다. 이번에는 MATLAB의 plotting 시스템과 유사한 기능을 제공하는 -`matplotlib.pyplot` 모듈에 관한 간략한 소개가 있곘습니다., +`matplotlib.pyplot` 모듈에 관한 간략한 소개가 있겠습니다., ### Plotting -matplotlib에서 가장 중요한 함수는 2차원 데이터를 그릴수 있게 해주는 `plot`입니다. +matplotlib에서 가장 중요한 함수는 2차원 데이터를 그릴 수 있게 해주는 `plot`입니다. 여기 간단한 예시가 있습니다: ~~~python @@ -990,7 +990,7 @@ plt.show() # 그래프를 나타나게 하기 위해선 plt.show()함수를 호
-약간의 몇가지 추가적인 작업을 통해 여러개의 그래프와, 제목, 범주, 축 이름을 한번에 쉽게 나타낼 수 있습니다: +약간의 몇 가지 추가적인 작업을 통해 여러 개의 그래프와 제목, 범주, 축 이름을 한 번에 쉽게 나타낼 수 있습니다: ~~~python import numpy as np @@ -1020,7 +1020,7 @@ plt.show() ### Subplots -'subplot'함수를 통해 다른 내용들도 동일한 그림위에 나타낼수 있습니다. +'subplot'함수를 통해 다른 내용도 동일한 그림 위에 나타낼 수 있습니다. 여기 간단한 예시가 있습니다: ~~~python @@ -1033,14 +1033,14 @@ y_sin = np.sin(x) y_cos = np.cos(x) # 높이가 2이고 너비가 1인 subplot 구획을 설정하고, -# 첫번째 구획을 활성화. +# 첫 번째 구획을 활성화. plt.subplot(2, 1, 1) -# 첫번째 그리기 +# 첫 번째 그리기 plt.plot(x, y_sin) plt.title('Sine') -# 두번째 subplot 구획을 활성화 하고 그리기 +# 두 번째 subplot 구획을 활성화 하고 그리기 plt.subplot(2, 1, 2) plt.plot(x, y_cos) plt.title('Cosine') @@ -1077,7 +1077,7 @@ plt.imshow(img) plt.subplot(1, 2, 2) # imshow를 이용하며 주의할 점은 데이터의 자료형이 -# uint8이 아니라면 이상한 결과를 보여줄수도 있다는 것입니다. +# uint8이 아니라면 이상한 결과를 보여줄 수도 있다는 것입니다. # 그러므로 이미지를 나타내기 전에 명시적으로 자료형을 uint8로 형변환 해줍니다. plt.imshow(np.uint8(img_tinted)) From d16f38205c21d8c4d9dc49e18cf2236eb71451c3 Mon Sep 17 00:00:00 2001 From: Sanghun Kang Date: Wed, 8 Jun 2016 22:13:50 +0900 Subject: [PATCH 175/199] Update python-numpy-tutorial.md --- python-numpy-tutorial.md | 99 ++++++++++------------------------------ 1 file changed, 23 insertions(+), 76 deletions(-) diff --git a/python-numpy-tutorial.md b/python-numpy-tutorial.md index fbee6035..cba82cbe 100644 --- a/python-numpy-tutorial.md +++ b/python-numpy-tutorial.md @@ -24,7 +24,7 @@ Numpy 이 튜토리얼은 [Justin Johnson](http://cs.stanford.edu/people/jcjohns/) 에 의해 작성되었습니다. cs231n 수업의 모든 과제에서는 프로그래밍 언어로 파이썬을 사용할 것입니다. -파이썬은 그 자체만으로도 훌륭한 범용 프로그래밍 언어이지만, 몇몇 라이브러리(numpy, scipy, matplotlib)의 도움으로 +파이썬은 그 자체만으로도 훌륭한 범용 프로그래밍 언어이지만, 몇몇 라이브러리(numpy, scipy, matplotlib)의 도움으로 계산과학 분야에서 강력한 개발 환경을 갖추게 됩니다. 많은 분들이 파이썬과 numpy를 경험 해보셨을 거라고 생각합니다. 경험하지 못했을지라도 이 문서를 통해 @@ -62,17 +62,10 @@ cs231n 수업의 모든 과제에서는 프로그래밍 언어로 파이썬을 -<<<<<<< HEAD ## 파이썬 파이썬은 고급 프로그래밍 언어(사람이 이해하기 쉽게 작성된 언어)이며, 다중패러다임을 지원하는 동적 프로그래밍 언어입니다. 짧지만 가독성 높은 코드 몇 줄로 수준 높은 아이디어들을 표현할 수 있기에 파이썬 코드는 거의 수도코드처럼 보인다고도 합니다. 아래는 quicksort 알고리즘을 파이썬으로 구현한 예시입니다: -======= -## Python -파이썬은 고차원이고, 다중패러다임을 지원하는 동적 프로그래밍 언어입니다. -짧지만 가독성 높은 코드 몇 줄로 수준 높은 아이디어들을 표현할수있기에 파이썬 코드는 거의 수도코드처럼 보인다고도 합니다. -아래는 quicksort알고리즘의 파이썬 구현 예시입니다: ->>>>>>> 5b00df6adc57e61e9ec7627036d093ccddae1ac6 ~~~python def quicksort(arr): @@ -83,17 +76,13 @@ def quicksort(arr): middle = [x for x in arr if x == pivot] right = [x for x in arr if x > pivot] return quicksort(left) + middle + quicksort(right) - + print quicksort([3,6,8,10,1,2,1]) # 출력 "[1, 1, 2, 3, 6, 8, 10]" ~~~ ### 파이썬 버전 -<<<<<<< HEAD 현재 파이썬에는 두 가지 버전이 있습니다. 파이썬 2.7 그리고 파이썬 3.4입니다. -======= -현재 파이썬에는 두가지 버전이 있습니다. 파이썬 2.7 그리고 파이썬 3.4입니다. ->>>>>>> 5b00df6adc57e61e9ec7627036d093ccddae1ac6 혼란스럽게도, 파이썬3은 기존 파이썬2와 호환되지 않게 변경된 부분이 있습니다. 그러므로 파이썬 2.7로 쓰여진 코드는 3.4환경에서 동작하지 않고 그 반대도 마찬가지입니다. 이 수업에선 파이썬 2.7을 사용합니다. @@ -128,14 +117,10 @@ print y, y + 1, y * 2, y ** 2 # 출력 "2.5 3.5 5.0 6.25" ~~~ 다른 언어들과는 달리, 파이썬에는 증감 단항연산자(`x++`, `x--`)가 없습니다. -파이썬 역시 long 정수형과 복소수 데이터 타입이 구현되어 있습니다. +파이썬 역시 long 정수형과 복소수 데이터 타입이 구현되어 있습니다. 자세한 사항은 [문서](https://docs.python.org/2/library/stdtypes.html#numeric-types-int-float-long-complex)에서 찾아볼 수 있습니다. -<<<<<<< HEAD **불리언(Booleans):** 파이썬에는 논리 자료형의 모든 연산자가 구현되어 있습니다. -======= -**불리언(Booleans):** 파이썬에는 논리 자료형의 모든 연산자들이 구현되어 있습니다. ->>>>>>> 5b00df6adc57e61e9ec7627036d093ccddae1ac6 그렇지만 기호(`&&`, `||`, 등.) 대신 영어 단어로 구현되어 있습니다 : ~~~python @@ -145,7 +130,7 @@ print type(t) # 출력 "" print t and f # 논리 AND; 출력 "False" print t or f # 논리 OR; 출력 "True" print not t # 논리 NOT; 출력 "False" -print t != f # 논리 XOR; 출력 "True" +print t != f # 논리 XOR; 출력 "True" ~~~ **문자열:** 파이썬은 문자열과 연관된 다양한 기능을 지원합니다: @@ -173,7 +158,7 @@ print s.replace('l', '(ell)') # 첫 번째 인자로 온 문자열을 두 번 # 출력 "he(ell)(ell)o" print ' world '.strip() # 문자열 앞뒤 공백 제거; 출력 "world" ~~~ -모든 문자열 메소드는 [문서](https://docs.python.org/2/library/stdtypes.html#string-methods)에서 찾아볼 수 있습니다. +모든 문자열 메소드는 [문서](https://docs.python.org/2/library/stdtypes.html#string-methods)에서 찾아볼 수 있습니다. @@ -409,18 +394,18 @@ hello('Fred', loud=True) # 출력 "HELLO, FRED!" ~~~python class Greeter(object): - + # 생성자 def __init__(self, name): self.name = name # 인스턴스 변수 선언 - + # 인스턴스 메소드 def greet(self, loud=False): if loud: print 'HELLO, %s!' % self.name.upper() else: print 'Hello, %s' % self.name - + g = Greeter('Fred') # Greeter 클래스의 인스턴스 생성 g.greet() # 인스턴스 메소드 호출; 출력 "Hello, Fred" g.greet(loud=True) # 인스턴스 메소드 호출; 출력 "HELLO, FRED!" @@ -438,15 +423,10 @@ Numpy는 고성능의 다차원 배열 객체와 이를 다룰 도구를 제공 ### 배열 -<<<<<<< HEAD Numpy 배열은 동일한 자료형을 가지는 값들이 격자판 형태로 있는 것입니다. 각각의 값들은 튜플(이때 튜플은 양의 정수만을 요소값으로 갖습니다.) 형태로 색인 됩니다. *rank*는 배열이 몇 차원인지를 의미합니다; *shape*는 는 각 차원의 크기를 알려주는 정수들이 모인 튜플입니다. -======= -Numpy 배열은 동일한 자료형을 가지는 값들이 격자판 형태로 있는 것입니다. 각각의 값들은 튜플(이때 튜플은 양의 정수만을 요소값으로 갖습니다.) 형태로 색인됩니다. -*rank*는 배열이 몇차원인지를 의미합니다; *shape*는 는 각 차원의 크기를 알려주는 정수들이 모인 튜플입니다. ->>>>>>> 5b00df6adc57e61e9ec7627036d093ccddae1ac6 -파이썬의 리스트를 중첩해 Numpy 배열을 초기화 할 수 있고, 대괄호를 통해 각 요소에 접근할 수 있습니다: +파이썬의 리스트를 중첩해 Numpy 배열을 초기화 할 수 있고, 대괄호를 통해 각 요소에 접근할 수 있습니다: ~~~python import numpy as np @@ -471,7 +451,7 @@ import numpy as np a = np.zeros((2,2)) # 모든 값이 0인 배열 생성 print a # 출력 "[[ 0. 0.] # [ 0. 0.]]" - + b = np.ones((1,2)) # 모든 값이 1인 배열 생성 print b # 출력 "[[ 1. 1.]]" @@ -482,7 +462,7 @@ print c # 출력 "[[ 7. 7.] d = np.eye(2) # 2x2 단위행렬 생성 print d # 출력 "[[ 1. 0.] # [ 0. 1.]]" - + e = np.random.random((2,2)) # 임의의 값으로 채워진 배열 생성 print e # 임의의 값 출력 "[[ 0.91940167 0.08143941] # [ 0.68744134 0.87236687]]" @@ -506,11 +486,7 @@ import numpy as np # [ 9 10 11 12]] a = np.array([[1,2,3,4], [5,6,7,8], [9,10,11,12]]) -<<<<<<< HEAD # 슬라이싱을 이용하여 첫 두 행과 1열, 2열로 이루어진 부분배열을 만들어 봅시다; -======= -# 슬라이싱을 이용하여 첫 두행과 1열,2열로 이루어진 부분배열을 만들어 봅시다; ->>>>>>> 5b00df6adc57e61e9ec7627036d093ccddae1ac6 # b는 shape가 (2,2)인 배열이 됩니다: # [[2 3] # [6 7]] @@ -621,15 +597,15 @@ a = np.array([[1,2], [3, 4], [5, 6]]) bool_idx = (a > 2) # 2보다 큰 a의 요소를 찾습니다; # 이 코드는 a와 shape가 같고 불리언 자료형을 요소로 하는 numpy 배열을 반환합니다, - # bool_idx의 각 요소는 동일한 위치에 있는 a의 + # bool_idx의 각 요소는 동일한 위치에 있는 a의 # 요소가 2보다 큰지를 말해줍니다. - + print bool_idx # 출력 "[[False False] # [ True True] # [ True True]]" -# 불리언 배열 인덱싱을 통해 bool_idx에서 -# 참 값을 가지는 요소로 구성되는 +# 불리언 배열 인덱싱을 통해 bool_idx에서 +# 참 값을 가지는 요소로 구성되는 # rank 1인 배열을 구성할 수 있습니다. print a[bool_idx] # 출력 "[3 4 5 6]" @@ -833,18 +809,12 @@ print y # 출력 "[[ 2 2 4] 1. 두 배열이 동일한 rank를 가지고 있지 않다면, 낮은 rank의 1차원 배열이 높은 rank 배열의 shape로 간주합니다. 2. 특정 차원에서 두 배열이 동일한 크기를 갖거나, 두 배열 중 하나의 크기가 1이라면 그 두 배열은 특정 차원에서 *compatible*하다고 여겨집니다. 3. 두 행렬이 모든 차원에서 compatible하다면, 브로드캐스팅이 가능합니다. -<<<<<<< HEAD 4. 브로드캐스팅이 이뤄지면, 각 배열 shape의 요소별 최소공배수로 이루어진 shape가 두 배열의 shape로 간주합니다. 5. 차원에 상관없이 크기가 1인 배열과 1보다 큰 배열이 있을 때, 크기가 1인 배열은 자신의 차원 수만큼 복사되어 쌓인 것처럼 간주합니다. -======= -4. 브로드캐스팅이 이뤄지면, 각 배열 shape의 요소별 최소공배수로 이루어진 shape가 두 배열의 shape로 간주됩니다. -5. 차원에 상관없이 크기가 1인 배열과 1보다 큰 배열이 있을때, 크기가 1인 배열은 자신의 차원수만큼 복사되어 쌓인것처럼 간주된다. - ->>>>>>> 5b00df6adc57e61e9ec7627036d093ccddae1ac6 설명이 이해하기 부족하다면 [scipy문서](http://docs.scipy.org/doc/numpy/user/basics.broadcasting.html)나 [scipy위키](http://wiki.scipy.org/EricsBroadcastingDoc)를 참조하세요. -브로드캐스팅을 지원하는 함수를 *universal functions*라고 합니다. +브로드캐스팅을 지원하는 함수를 *universal functions*라고 합니다. *universal functions* 목록은 [문서](http://docs.scipy.org/doc/numpy/reference/ufuncs.html#available-ufuncs)를 참조하세요. 브로드캐스팅을 응용한 예시들입니다: @@ -855,11 +825,7 @@ import numpy as np # 벡터의 외적을 계산 v = np.array([1,2,3]) # v의 shape는 (3,) w = np.array([4,5]) # w의 shape는 (2,) -<<<<<<< HEAD # 외적을 계산하기 위해, 먼저 v를 shape가 (3,1)인 행벡터로 바꿔야 합니다; -======= -# 외적을 게산하기 위해, 먼저 v를 shape가 (3,1)인 행벡터로 바꿔야 합니다; ->>>>>>> 5b00df6adc57e61e9ec7627036d093ccddae1ac6 # 그다음 이것을 w에 맞춰 브로드캐스팅한뒤 결과물로 shape가 (3,2)인 행렬을 얻습니다, # 이 행렬은 v와 w 외적의 결과입니다: # [[ 4 5] @@ -877,19 +843,15 @@ print x + v # 벡터를 행렬의 각 행에 더하기 # x는 shape가 (2, 3)이고 w는 shape가 (2,)입니다. -# x의 전치행렬은 shape가 (3,2)이며 이는 w와 브로드캐스팅이 가능하고 결과로 shape가 (3,2)인 행렬이 생깁니다; -# 이 행렬을 전치하면 shape가 (2,3)인 행렬이 나오며 +# x의 전치행렬은 shape가 (3,2)이며 이는 w와 브로드캐스팅이 가능하고 결과로 shape가 (3,2)인 행렬이 생깁니다; +# 이 행렬을 전치하면 shape가 (2,3)인 행렬이 나오며 # 이는 행렬 x의 각 열에 벡터 w을 더한 결과와 동일합니다. # 아래의 행렬입니다: # [[ 5 6 7] # [ 9 10 11]] print (x.T + w).T # 다른 방법은 w를 shape가 (2,1)인 열벡터로 변환하는 것입니다; -<<<<<<< HEAD # 그런 다음 이를 바로 x에 브로드캐스팅해 더하면 -======= -# 그런다음 이를 바로 x에 브로드캐스팅해 더하면 ->>>>>>> 5b00df6adc57e61e9ec7627036d093ccddae1ac6 # 동일한 결과가 나옵니다. print x + np.reshape(w, (2, 1)) @@ -934,13 +896,8 @@ from scipy.misc import imread, imsave, imresize img = imread('assets/cat.jpg') print img.dtype, img.shape # 출력 "uint8 (400, 248, 3)" -<<<<<<< HEAD # 각각의 색깔 채널을 다른 상수값으로 스칼라배함으로써 # 이미지의 색을 변화시킬 수 있습니다. -======= -# 각각의 색깔 채널을 다른 상수값으로 스칼라배함으로써 -# 이미지의 색을 변화시킬수 있습니다. ->>>>>>> 5b00df6adc57e61e9ec7627036d093ccddae1ac6 # 이미지의 shape는 (400, 248, 3)입니다; # 여기에 shape가 (3,)인 배열 [1, 0.95, 0.9]를 곱합니다; # numpy 브로드캐스팅에 의해 이 배열이 곱해지며 붉은색 채널은 변하지 않으며, @@ -966,7 +923,7 @@ imsave('assets/cat_tinted.jpg', img_tinted) ### MATLAB 파일 -`scipy.io.loadmat` 와 `scipy.io.savemat`함수를 통해 +`scipy.io.loadmat` 와 `scipy.io.savemat`함수를 통해 matlab 파일을 읽고 쓸 수 있습니다. [문서](http://docs.scipy.org/doc/scipy/reference/io.html)를 참조하세요. @@ -1004,7 +961,7 @@ print d ## Matplotlib -[Matplotlib](http://matplotlib.org/)는 plotting 라이브러리입니다. +[Matplotlib](http://matplotlib.org/)는 plotting 라이브러리입니다. 이번에는 MATLAB의 plotting 시스템과 유사한 기능을 제공하는 `matplotlib.pyplot` 모듈에 관한 간략한 소개가 있겠습니다., @@ -1018,7 +975,7 @@ matplotlib에서 가장 중요한 함수는 2차원 데이터를 그릴 수 있 import numpy as np import matplotlib.pyplot as plt -# 사인과 코사인 곡선의 x,y 좌표를 계산 +# 사인과 코사인 곡선의 x,y 좌표를 계산 x = np.arange(0, 3 * np.pi, 0.1) y = np.sin(x) @@ -1039,7 +996,7 @@ plt.show() # 그래프를 나타나게 하기 위해선 plt.show()함수를 호 import numpy as np import matplotlib.pyplot as plt -# 사인과 코사인 곡선의 x,y 좌표를 계산 +# 사인과 코사인 곡선의 x,y 좌표를 계산 x = np.arange(0, 3 * np.pi, 0.1) y_sin = np.sin(x) y_cos = np.cos(x) @@ -1070,7 +1027,7 @@ plt.show() import numpy as np import matplotlib.pyplot as plt -# 사인과 코사인 곡선의 x,y 좌표를 계산 +# 사인과 코사인 곡선의 x,y 좌표를 계산 x = np.arange(0, 3 * np.pi, 0.1) y_sin = np.sin(x) y_cos = np.cos(x) @@ -1119,13 +1076,8 @@ plt.imshow(img) # 색변화된 이미지 나타내기 plt.subplot(1, 2, 2) -<<<<<<< HEAD # imshow를 이용하며 주의할 점은 데이터의 자료형이 # uint8이 아니라면 이상한 결과를 보여줄 수도 있다는 것입니다. -======= -# imshow를 이용하며 주의할 점은 데이터의 자료형이 -# uint8이 아니라면 이상한 결과를 보여줄수도 있다는 것입니다. ->>>>>>> 5b00df6adc57e61e9ec7627036d093ccddae1ac6 # 그러므로 이미지를 나타내기 전에 명시적으로 자료형을 uint8로 형변환 해줍니다. plt.imshow(np.uint8(img_tinted)) @@ -1135,8 +1087,3 @@ plt.show()
- ---- -

-번역: 강상훈 (sanghkaang) -

From baaf6c58c7ca1b4db2c4f0addd14b7cf390d7af6 Mon Sep 17 00:00:00 2001 From: Sanghun Kang Date: Wed, 8 Jun 2016 22:14:32 +0900 Subject: [PATCH 176/199] Update python-numpy-tutorial.md --- python-numpy-tutorial.md | 219 ++++++++++++++++++++------------------- 1 file changed, 112 insertions(+), 107 deletions(-) diff --git a/python-numpy-tutorial.md b/python-numpy-tutorial.md index cba82cbe..dcd5fd37 100644 --- a/python-numpy-tutorial.md +++ b/python-numpy-tutorial.md @@ -21,18 +21,18 @@ Python: Numpy --> -이 튜토리얼은 [Justin Johnson](http://cs.stanford.edu/people/jcjohns/) 에 의해 작성되었습니다. +이 튜토리얼은 [Justin Johnson](http://cs.stanford.edu/people/jcjohns/)에 의해 작성되었습니다. cs231n 수업의 모든 과제에서는 프로그래밍 언어로 파이썬을 사용할 것입니다. -파이썬은 그 자체만으로도 훌륭한 범용 프로그래밍 언어이지만, 몇몇 라이브러리(numpy, scipy, matplotlib)의 도움으로 +파이썬은 그 자체만으로도 훌륭한 범용 프로그래밍 언어이지만, 몇몇 라이브러리(numpy, scipy, matplotlib)의 도움으로 계산과학 분야에서 강력한 개발 환경을 갖추게 됩니다. -많은 분들이 파이썬과 numpy를 경험 해보셨을 거라고 생각합니다. 경험하지 못했을지라도 이 문서를 통해 -'프로그래밍 언어로서의 파이썬'과 '파이썬을 계산과학에 활용하는 법'을 빠르게 훑을 수 있습니다. +많은 분들이 파이썬과 numpy를 경험 해보셨을거라고 생각합니다. 경험 하지 못했을지라도 이 문서를 통해 +'프로그래밍 언어로서의 파이썬'과 '파이썬을 계산과학에 활용하는법'을 빠르게 훑을 수 있습니다. -만약 Matlab을 사용해보셨다면, [Matlab 사용자를 위한 numpy](http://wiki.scipy.org/NumPy_for_Matlab_Users) 페이지를 추천해 드립니다. +만약 Matlab을 사용해보셨다면, [Matlab사용자를 위한 numpy](http://wiki.scipy.org/NumPy_for_Matlab_Users) 페이지를 추천해 드립니다. -또한 [CS 228](http://cs.stanford.edu/~ermon/cs228/index.html) 수업을 위해 [Volodymyr Kuleshov](http://web.stanford.edu/~kuleshov/) 와 [Isaac Caswell](https://symsys.stanford.edu/viewing/symsysaffiliate/21335) 가 만든 [이 튜토리얼의 IPython notebook 버전](https://github.com/kuleshov/cs228-material/blob/master/tutorials/python/cs228-python-tutorial.ipynb) 도 참조할 수 있습니다. +또한 [CS 228](http://cs.stanford.edu/~ermon/cs228/index.html)수업을 위해 [Volodymyr Kuleshov](http://web.stanford.edu/~kuleshov/) 와 [Isaac Caswell](https://symsys.stanford.edu/viewing/symsysaffiliate/21335)가 만든 [이 튜토리얼의 IPython notebook 버전](https://github.com/kuleshov/cs228-material/blob/master/tutorials/python/cs228-python-tutorial.ipynb)도 참조 할 수 있습니다. 목차: @@ -48,7 +48,7 @@ cs231n 수업의 모든 과제에서는 프로그래밍 언어로 파이썬을 - [Numpy](#numpy) - [배열](#numpy-arrays) - [배열 인덱싱](#numpy-array-indexing) - - [데이터 타입](#numpy-datatypes) + - [데이터타입](#numpy-datatypes) - [배열 연산](#numpy-math) - [브로드캐스팅](#numpy-broadcasting) - [SciPy](#scipy) @@ -62,10 +62,10 @@ cs231n 수업의 모든 과제에서는 프로그래밍 언어로 파이썬을 -## 파이썬 -파이썬은 고급 프로그래밍 언어(사람이 이해하기 쉽게 작성된 언어)이며, 다중패러다임을 지원하는 동적 프로그래밍 언어입니다. -짧지만 가독성 높은 코드 몇 줄로 수준 높은 아이디어들을 표현할 수 있기에 파이썬 코드는 거의 수도코드처럼 보인다고도 합니다. -아래는 quicksort 알고리즘을 파이썬으로 구현한 예시입니다: +## Python +파이썬은 고차원이고, 다중패러다임을 지원하는 동적 프로그래밍 언어입니다. +짧지만 가독성 높은 코드 몇 줄로 수준 높은 아이디어들을 표현할수있기에 파이썬 코드는 거의 수도코드처럼 보인다고도 합니다. +아래는 quicksort알고리즘의 파이썬 구현 예시입니다: ~~~python def quicksort(arr): @@ -76,25 +76,25 @@ def quicksort(arr): middle = [x for x in arr if x == pivot] right = [x for x in arr if x > pivot] return quicksort(left) + middle + quicksort(right) - + print quicksort([3,6,8,10,1,2,1]) # 출력 "[1, 1, 2, 3, 6, 8, 10]" ~~~ ### 파이썬 버전 -현재 파이썬에는 두 가지 버전이 있습니다. 파이썬 2.7 그리고 파이썬 3.4입니다. +현재 파이썬에는 두가지 버전이 있습니다. 파이썬 2.7 그리고 파이썬 3.4입니다. 혼란스럽게도, 파이썬3은 기존 파이썬2와 호환되지 않게 변경된 부분이 있습니다. 그러므로 파이썬 2.7로 쓰여진 코드는 3.4환경에서 동작하지 않고 그 반대도 마찬가지입니다. 이 수업에선 파이썬 2.7을 사용합니다. -커맨드라인에 아래의 명령어를 입력해서 현재 설치된 파이썬 버전을 확인할 수 있습니다. +커맨드라인에 아래의 명령어를 입력해서 현재 설치된 파이썬 버전을 확인 할 수 있습니다. `python --version`. ### 기본 자료형 -다른 프로그래밍 언어들처럼, 파이썬에는 정수, 실수, 불리언, 문자열 같은 기본 자료형이 있습니다. +다른 프로그래밍 언어들처럼, 파이썬에는 정수, 실수, 불리언, 문자열같은 기본 자료형이 있습니다. 파이썬 기본 자료형 역시 다른 프로그래밍 언어와 유사합니다. **숫자:** 다른 언어와 마찬가지로 파이썬의 정수형(Integers)과 실수형(floats) 데이터 타입 역시 동일한 역할을 합니다 : @@ -115,12 +115,12 @@ y = 2.5 print type(y) # 출력 "" print y, y + 1, y * 2, y ** 2 # 출력 "2.5 3.5 5.0 6.25" ~~~ -다른 언어들과는 달리, 파이썬에는 증감 단항연산자(`x++`, `x--`)가 없습니다. +다른 언어들과는 달리, 파이썬에는 증감 단항연상자(`x++`, `x--`)가 없습니다. -파이썬 역시 long 정수형과 복소수 데이터 타입이 구현되어 있습니다. +파이썬 역시 long 정수형과 복소수 데이터 타입이 구현되어 있습니다. 자세한 사항은 [문서](https://docs.python.org/2/library/stdtypes.html#numeric-types-int-float-long-complex)에서 찾아볼 수 있습니다. -**불리언(Booleans):** 파이썬에는 논리 자료형의 모든 연산자가 구현되어 있습니다. +**불리언(Booleans):** 파이썬에는 논리 자료형의 모든 연산자들이 구현되어 있습니다. 그렇지만 기호(`&&`, `||`, 등.) 대신 영어 단어로 구현되어 있습니다 : ~~~python @@ -130,14 +130,14 @@ print type(t) # 출력 "" print t and f # 논리 AND; 출력 "False" print t or f # 논리 OR; 출력 "True" print not t # 논리 NOT; 출력 "False" -print t != f # 논리 XOR; 출력 "True" +print t != f # 논리 XOR; 출력 "True" ~~~ **문자열:** 파이썬은 문자열과 연관된 다양한 기능을 지원합니다: ~~~python -hello = 'hello' # String 문자열을 표현할 땐 따옴표나 -world = "world" # 쌍따옴표가 사용됩니다; 어떤 걸 써도 상관없습니다. +hello = 'hello' # String 문자열을 표현할땐 따옴표나 +world = "world" # 쌍따옴표가 사용됩니다; 어떤걸 써도 상관없습니다. print hello # 출력 "hello" print len(hello) # 문자열 길이; 출력 "5" hw = hello + ' ' + world # 문자열 연결 @@ -150,15 +150,15 @@ print hw12 # 출력 "hello world 12" ~~~python s = "hello" -print s.capitalize() # 문자열을 대문자로 시작하게 함; 출력 "Hello" +print s.capitalize() # 문자열을 대문자로 시작하게함; 출력 "Hello" print s.upper() # 모든 문자를 대문자로 바꿈; 출력 "HELLO" print s.rjust(7) # 문자열 오른쪽 정렬, 빈공간은 여백으로 채움; 출력 " hello" print s.center(7) # 문자열 가운데 정렬, 빈공간은 여백으로 채움; 출력 " hello " -print s.replace('l', '(ell)') # 첫 번째 인자로 온 문자열을 두 번째 인자 문자열로 바꿈; +print s.replace('l', '(ell)') # 첫번째 인자로 온 문자열을 두번째 인자 문자열로 바꿈; # 출력 "he(ell)(ell)o" print ' world '.strip() # 문자열 앞뒤 공백 제거; 출력 "world" ~~~ -모든 문자열 메소드는 [문서](https://docs.python.org/2/library/stdtypes.html#string-methods)에서 찾아볼 수 있습니다. +모든 문자열 메소드는 [문서](https://docs.python.org/2/library/stdtypes.html#string-methods)에서 찾아볼 수 있습니다. @@ -168,14 +168,14 @@ print ' world '.strip() # 문자열 앞뒤 공백 제거; 출력 "world" #### 리스트 -리스트는 파이썬에서 배열 같은 존재입니다. 그렇지만 배열과 달리 크기 변경이 가능하고 -서로 다른 자료형일지라도 하나의 리스트에 저장될 수 있습니다: +리스트는 파이썬에서 배열같은 존재입니다. 그렇지만 배열과 달리 크기 변경이 가능하고 +서로 다른 자료형일지라도 하나의 리스트에 저장 될 수 있습니다: ~~~python xs = [3, 1, 2] # 리스트 생성 print xs, xs[2] # 출력 "[3, 1, 2] 2" print xs[-1] # 인덱스가 음수일 경우 리스트의 끝에서부터 세어짐; 출력 "2" -xs[2] = 'foo' # 리스트는 자료형이 다른 요소들을 저장할 수 있습니다 +xs[2] = 'foo' # 리스트는 자료형이 다른 요소들을 저장 할 수 있습니다 print xs # 출력 "[3, 1, 'foo']" xs.append('bar') # 리스트의 끝에 새 요소 추가 print xs # 출력 "[3, 1, 'foo', 'bar']" @@ -185,7 +185,7 @@ print x, xs # 출력 "bar [3, 1, 'foo']" 마찬가지로, 리스트에 대해 자세한 사항은 [문서](https://docs.python.org/2/tutorial/datastructures.html#more-on-lists)에서 찾아볼 수 있습니다. **슬라이싱:** -리스트의 요소로 한 번에 접근하는 것 이외에도, 파이썬은 리스트의 일부분에만 접근하는 간결한 문법을 제공합니다; +리스트의 요소로 한번에 접근하는것 이외에도, 파이썬은 리스트의 일부분에만 접근하는 간결한 문법을 제공합니다; 이를 *슬라이싱*이라고 합니다: ~~~python @@ -199,7 +199,7 @@ print nums[:-1] # 슬라이싱 인덱스는 음수도 가능; 출력 ["0, 1, nums[2:4] = [8, 9] # 슬라이스된 리스트에 새로운 리스트 할당 print nums # 출력 "[0, 1, 8, 9, 4]" ~~~ -numpy 배열 부분에서 다시 슬라이싱을 보게 될 것입니다. +numpy 배열 부분에서 다시 슬라이싱을 보게될것입니다. **반복문:** 아래와 같이 리스트의 요소들을 반복해서 조회할 수 있습니다: @@ -220,7 +220,7 @@ for idx, animal in enumerate(animals): ~~~ **리스트 comprehensions:** -프로그래밍을 하다 보면, 자료형을 변환해야 하는 경우가 자주 있습니다. +프로그래밍을 하다보면, 자료형을 변환해야 하는 경우가 자주 있습니다. 간단한 예를 들자면, 숫자의 제곱을 계산하는 다음의 코드를 보세요: @@ -240,7 +240,7 @@ squares = [x ** 2 for x in nums] print squares # 출력 [0, 1, 4, 9, 16] ~~~ -리스트 comprehensions에 조건을 추가할 수도 있습니다: +리스트 comprehensions에 조건을 추가 할 수도 있습니다: ~~~python nums = [0, 1, 2, 3, 4] @@ -257,16 +257,16 @@ print even_squares # 출력 "[0, 4, 16]" ~~~python d = {'cat': 'cute', 'dog': 'furry'} # 새로운 딕셔너리를 만듭니다 print d['cat'] # 딕셔너리의 값을 받음; 출력 "cute" -print 'cat' in d # 딕셔너리가 주어진 열쇠를 가졌는지 확인; 출력 "True" +print 'cat' in d # 딕셔너리가 주어진 열쇠를 가지고 있는지 확인; 출력 "True" d['fish'] = 'wet' # 딕셔너리의 값을 지정 print d['fish'] # 출력 "wet" # print d['monkey'] # KeyError: 'monkey' not a key of d print d.get('monkey', 'N/A') # 딕셔너리의 값을 받음. 존재하지 않는 다면 'N/A'; 출력 "N/A" print d.get('fish', 'N/A') # 딕셔너리의 값을 받음. 존재하지 않는 다면 'N/A'; 출력 "wet" del d['fish'] # 딕셔너리에 저장된 요소 삭제 -print d.get('fish', 'N/A') # "fish"는 더 이상 열쇠가 아님; 출력 "N/A" +print d.get('fish', 'N/A') # "fish"는 더이상 열쇠가 아님; 출력 "N/A" ~~~ -딕셔너리에 관해 더 알고 싶다면 [문서](https://docs.python.org/2/library/stdtypes.html#dict)를 참조하세요. +딕셔너리에 관해 더 알고싶다면 [문서](https://docs.python.org/2/library/stdtypes.html#dict)를 참조하세요. **반복문:** 딕셔너리의 열쇠는 쉽게 반복될 수 있습니다: @@ -288,7 +288,7 @@ for animal, legs in d.iteritems(): ~~~ **딕셔너리 comprehensions:** -리스트 comprehensions과 유사한 딕셔너리 comprehensions을 통해 손쉽게 딕셔너리를 만들 수 있습니다. +리스트 comprehensions과 유사한 딕셔너리 comprehensions을 통해 손쉽게 딕셔너리를 만들수 있습니다. 예시: ~~~python @@ -300,7 +300,7 @@ print even_num_to_square # 출력 "{0: 0, 2: 4, 4: 16}" #### 집합 -집합은 순서 구분이 없고 서로 다른 요소 간의 모임입니다. 예시: +집합은 순서 구분이 없고 서로 다른 요소간의 모임입니다. 예시: ~~~python animals = {'cat', 'dog'} @@ -315,11 +315,11 @@ animals.remove('cat') # Remove an element from a set print len(animals) # 출력 "2" ~~~ -마찬가지로, 집합에 관해 더 알고 싶다면 [문서](https://docs.python.org/2/library/sets.html#set-objects)를 참조하세요. +마찬가지로, 집합에 관해 더 알고싶다면 [문서](https://docs.python.org/2/library/sets.html#set-objects)를 참조하세요. **반복문:** 집합을 반복하는 구문은 리스트 반복 구문과 동일합니다; -그러나 집합은 순서가 없어서, 어떤 순서로 반복될지 추측할 순 없습니다: +그러나 집합은 순서가 없어서, 어떤 순서로 반복될지 추측할순 없습니다: ~~~python animals = {'cat', 'dog', 'fish'} @@ -340,8 +340,8 @@ print nums # 출력 "set([0, 1, 2, 3, 4, 5])" #### 튜플 -튜플은 요소 간 순서가 있으며 값이 변하지 않는 리스트입니다. -튜플은 많은 면에서 리스트와 유사합니다; 가장 중요한 차이점은 튜플은 '딕셔너리의 열쇠'와 '집합의 요소'가 될 수 있지만, 리스트는 불가능하다는 점입니다. +튜플은 요소들 간 순서가 있으며 값이 변하지 않는 리스트입니다. +튜플은 많은 면에서 리스트와 유사합니다; 가장 중요한 차이점은 튜플은 '딕셔너리의 열쇠'와 '집합의 요소'가 될 수 있지만 리스트는 불가능하다는 점입니다. 여기 간단한 예시가 있습니다: ~~~python @@ -394,18 +394,18 @@ hello('Fred', loud=True) # 출력 "HELLO, FRED!" ~~~python class Greeter(object): - + # 생성자 def __init__(self, name): self.name = name # 인스턴스 변수 선언 - + # 인스턴스 메소드 def greet(self, loud=False): if loud: print 'HELLO, %s!' % self.name.upper() else: print 'Hello, %s' % self.name - + g = Greeter('Fred') # Greeter 클래스의 인스턴스 생성 g.greet() # 인스턴스 메소드 호출; 출력 "Hello, Fred" g.greet(loud=True) # 인스턴스 메소드 호출; 출력 "HELLO, FRED!" @@ -423,10 +423,10 @@ Numpy는 고성능의 다차원 배열 객체와 이를 다룰 도구를 제공 ### 배열 -Numpy 배열은 동일한 자료형을 가지는 값들이 격자판 형태로 있는 것입니다. 각각의 값들은 튜플(이때 튜플은 양의 정수만을 요소값으로 갖습니다.) 형태로 색인 됩니다. -*rank*는 배열이 몇 차원인지를 의미합니다; *shape*는 는 각 차원의 크기를 알려주는 정수들이 모인 튜플입니다. +Numpy 배열은 동일한 자료형을 가지는 값들이 격자판 형태로 있는 것입니다. 각각의 값들은 튜플(이때 튜플은 양의 정수만을 요소값으로 갖습니다.) 형태로 색인됩니다. +*rank*는 배열이 몇차원인지를 의미합니다; *shape*는 는 각 차원의 크기를 알려주는 정수들이 모인 튜플입니다. -파이썬의 리스트를 중첩해 Numpy 배열을 초기화 할 수 있고, 대괄호를 통해 각 요소에 접근할 수 있습니다: +파이썬의 리스트를 중첩해 Numpy 배열을 초기화 할 수 있고, 대괄호를 통해 각 요소에 접근할 수 있습니다: ~~~python import numpy as np @@ -451,7 +451,7 @@ import numpy as np a = np.zeros((2,2)) # 모든 값이 0인 배열 생성 print a # 출력 "[[ 0. 0.] # [ 0. 0.]]" - + b = np.ones((1,2)) # 모든 값이 1인 배열 생성 print b # 출력 "[[ 1. 1.]]" @@ -459,10 +459,10 @@ c = np.full((2,2), 7) # 모든 값이 특정 상수인 배열 생성 print c # 출력 "[[ 7. 7.] # [ 7. 7.]]" -d = np.eye(2) # 2x2 단위행렬 생성 +d = np.eye(2) # 2x2 단위 행렬 생성 print d # 출력 "[[ 1. 0.] # [ 0. 1.]]" - + e = np.random.random((2,2)) # 임의의 값으로 채워진 배열 생성 print e # 임의의 값 출력 "[[ 0.91940167 0.08143941] # [ 0.68744134 0.87236687]]" @@ -472,10 +472,10 @@ print e # 임의의 값 출력 "[[ 0.91940167 0.08143941] ### 배열 인덱싱 -Numpy는 배열을 인덱싱하는 몇 가지 방법을 제공합니다. +Numpy는 배열을 인덱싱하는 몇가지 방법을 제공합니다. **슬라이싱:** -파이썬 리스트와 유사하게, Numpy 배열도 슬라이싱이 가능합니다. Numpy 배열은 다차원인 경우가 많기에, 각 차원별로 어떻게 슬라이스할건지 명확히 해야 합니다: +파이썬 리스트와 유사하게, Numpy 배열도 슬라이싱이 가능합니다. Numpy 배열은 다차원인 경우가 많기에, 각 차원별로 어떻게 슬라이스할건지 명확히 해야합니다: ~~~python import numpy as np @@ -486,7 +486,7 @@ import numpy as np # [ 9 10 11 12]] a = np.array([[1,2,3,4], [5,6,7,8], [9,10,11,12]]) -# 슬라이싱을 이용하여 첫 두 행과 1열, 2열로 이루어진 부분배열을 만들어 봅시다; +# 슬라이싱을 이용하여 첫 두행과 1열,2열로 이루어진 부분배열을 만들어 봅시다; # b는 shape가 (2,2)인 배열이 됩니다: # [[2 3] # [6 7]] @@ -514,11 +514,11 @@ import numpy as np # [ 9 10 11 12]] a = np.array([[1,2,3,4], [5,6,7,8], [9,10,11,12]]) -# 배열의 중간 행에 접근하는 두 가지 방법이 있습니다. +# 배열의 중간 행에 접근하는 두가지 방법이 있습니다. # 정수 인덱싱과 슬라이싱을 혼합해서 사용하면 낮은 rank의 배열이 생성되지만, # 슬라이싱만 사용하면 원본 배열과 동일한 rank의 배열이 생성됩니다. -row_r1 = a[1, :] # 배열a의 두 번째 행을 rank가 1인 배열로 -row_r2 = a[1:2, :] # 배열a의 두 번째 행을 rank가 2인 배열로 +row_r1 = a[1, :] # 배열a의 두번째 행을 rank가 1인 배열로 +row_r2 = a[1:2, :] # 배열a의 두번째 행을 rank가 2인 배열로 print row_r1, row_r1.shape # 출력 "[5 6 7 8] (4,)" print row_r2, row_r2.shape # 출력 "[[5 6 7 8]] (1, 4)" @@ -533,7 +533,7 @@ print col_r2, col_r2.shape # 출력 "[[ 2] **정수 배열 인덱싱:** Numpy 배열을 슬라이싱하면, 결과로 얻어지는 배열은 언제나 원본 배열의 부분 배열입니다. -그러나 정수 배열 인덱싱을 한다면, 원본과 다른 배열을 만들 수 있습니다. +그러나 정수 배열 인덱싱을 한다면, 원본과 다른 배열을 만들수 있습니다. 여기에 예시가 있습니다: ~~~python @@ -549,7 +549,7 @@ print a[[0, 1, 2], [0, 1, 0]] # 출력 "[1 4 5]" print np.array([a[0, 0], a[1, 1], a[2, 0]]) # 출력 "[1 4 5]" # 정수 배열 인덱싱을 사용할 때, -# 원본 배열의 같은 요소를 재사용할 수 있습니다: +# 원본 배열의 같은 요소를 재사용 할 수 있습니다: print a[[0, 0], [1, 1]] # 출력 "[2 2]" # 위 예제는 다음과 동일합니다 @@ -586,8 +586,8 @@ print a # 출력 "array([[11, 2, 3], ~~~ **불리언 배열 인덱싱:** -불리언 배열 인덱싱을 통해 배열 속 요소를 취사선택할 수 있습니다. -불리언 배열 인덱싱은 특정 조건을 만족하게 하는 요소만 선택하고자 할 때 자주 사용됩니다. +불리언 배열 인덱싱을 통해 배열속 요소를 취사 선택할 수 있습니다. +불리언 배열 인덱싱은 특정 조건을 만족시키는 요소만 선택하고자 할 때 자주 사용됩니다. 다음은 그 예시입니다: ~~~python @@ -597,15 +597,15 @@ a = np.array([[1,2], [3, 4], [5, 6]]) bool_idx = (a > 2) # 2보다 큰 a의 요소를 찾습니다; # 이 코드는 a와 shape가 같고 불리언 자료형을 요소로 하는 numpy 배열을 반환합니다, - # bool_idx의 각 요소는 동일한 위치에 있는 a의 + # bool_idx의 각 요소는 동일한 위치에 있는 a의 # 요소가 2보다 큰지를 말해줍니다. - + print bool_idx # 출력 "[[False False] # [ True True] # [ True True]]" -# 불리언 배열 인덱싱을 통해 bool_idx에서 -# 참 값을 가지는 요소로 구성되는 +# 불리언 배열 인덱싱을 통해 bool_idx에서 +# 참 값을 가지는 요소로 구성되는 # rank 1인 배열을 구성할 수 있습니다. print a[bool_idx] # 출력 "[3 4 5 6]" @@ -620,8 +620,8 @@ print a[a > 2] # 출력 "[3 4 5 6]" ### 자료형 Numpy 배열은 동일한 자료형을 가지는 값들이 격자판 형태로 있는 것입니다. -Numpy에선 배열을 구성하는 데 사용할 수 있는 다양한 숫자 자료형을 제공합니다. -Numpy는 배열이 생성될 때 자료형을 스스로 추측합니다, 그러나 배열을 생성할 때 명시적으로 특정 자료형을 지정할 수도 있습니다. 예시: +Numpy에선 배열을 구성하는데 사용할 수 있는 다양한 숫자 자료형을 제공합니다. +Numpy는 배열이 생성될 때 자료형을 스스로 추측합니다, 그러나 배열을 생성할 때 명시적으로 특정 자료형을 지정할수도 있습니다. 예시: ~~~python import numpy as np @@ -678,7 +678,7 @@ print np.divide(x, y) print np.sqrt(x) ~~~ -MATLAB과 달리, '*'은 행렬 곱이 아니라 요소별 곱입니다. Numpy에선 벡터의 내적, 벡터와 행렬의 곱, 행렬곱을 위해 '*'대신 'dot'함수를 사용합니다. 'dot'은 Numpy 모듈 함수로서도 배열 객체의 인스턴스 메소드로서도 이용 가능한 합수입니다: +MATLAB과 달리, '*'은 행렬곱이 아니라 요소별 곱입니다. Numpy에선 벡터의 내적, 벡터와 행렬의 곱, 행렬곱을 위해 '*'대신 'dot'함수를 사용합니다. 'dot'은 Numpy 모듈 함수로서도 배열 객체의 인스턴스 메소드로서도 이용 가능한 합수입니다: ~~~python import numpy as np @@ -693,7 +693,7 @@ w = np.array([11, 12]) print v.dot(w) print np.dot(v, w) -# 행렬과 벡터의 곱; 둘 다 결과는 rank 1인 배열 [29 67] +# 행렬과 벡터의 곱; 둘 다 결과는 rank 1 인 배열 [29 67] print x.dot(v) print np.dot(x, v) @@ -704,7 +704,7 @@ print x.dot(y) print np.dot(x, y) ~~~ -Numpy는 배열 연산에 유용하게 쓰이는 많은 함수를 제공합니다. 가장 유용한 건 'sum'입니다: +Numpy는 배열 연산에 유용하게 쓰이는 많은 함수를 제공합니다. 가장 유용한건 'sum'입니다: ~~~python import numpy as np @@ -715,10 +715,10 @@ print np.sum(x) # 모든 요소를 합한 값을 연산; 출력 "10" print np.sum(x, axis=0) # 각 열에 대한 합을 연산; 출력 "[4 6]" print np.sum(x, axis=1) # 각 행에 대한 합을 연산; 출력 "[3 7]" ~~~ -Numpy가 제공하는 모든 수학함수의 목록은 [문서](http://docs.scipy.org/doc/numpy/reference/routines.math.html)를 참조하세요. +Numpy가 제공하는 모든 수학함수들의 목록은 [문서](http://docs.scipy.org/doc/numpy/reference/routines.math.html)를 참조하세요. -배열연산을 하지 않더라도, 종종 배열의 모양을 바꾸거나 데이터를 처리해야 할 때가 있습니다. -가장 간단한 예는 행렬의 주 대각선을 기준으로 대칭되는 요소끼리 뒤바꾸는 것입니다; 이를 전치라고 하며 행렬을 전치하기 위해선, 간단하게 배열 객체의 'T' 속성을 사용하면 됩니다: +배열연산을 하지 않더라도, 종종 배열의 모양을 바꾸거나 데이터를 처리해야할 때가 있습니다. +가장 간단한 예는 행렬의 주대각선을 기준으로 대칭되는 요소끼리 뒤바꾸는 것입니다; 이를 전치라고 하며 행렬을 전치하기 위해선, 간단하게 배열 객체의 'T' 속성을 사용하면 됩니다: ~~~python import numpy as np @@ -729,7 +729,7 @@ print x # 출력 "[[1 2] print x.T # 출력 "[[1 3] # [2 4]]" -# rank 1인 배열을 전치할 경우 아무 일도 일어나지 않습니다: +# rank 1인 배열을 전치할경우 아무일도 일어나지 않습니다: v = np.array([1,2,3]) print v # 출력 "[1 2 3]" print v.T # 출력 "[1 2 3]" @@ -740,7 +740,7 @@ Numpy는 배열을 다루는 다양한 함수들을 제공합니다; 이러한 ### 브로드캐스팅 -브로트캐스팅은 Numpy에서 shape가 다른 배열 간에도 산술 연산이 가능하게 하는 메커니즘입니다. 종종 작은 배열과 큰 배열이 있을 때, 큰 배열을 대상으로 작은 배열을 여러 번 연산하고자 할 때가 있습니다. 예를 들어, 행렬의 각 행에 상수 벡터를 더하는 걸 생각해보세요. 이는 다음과 같은 방식으로 처리될 수 있습니다: +브로트캐스팅은 Numpy에서 shape가 다른 배열간에도 산술 연산이 가능하게 하는 메커니즘입니다. 종종 작은 배열과 큰 배열이 있을 때, 큰 배열을 대상으로 작은 배열을 여러번 연산하고자 할 때가 있습니다. 예를 들어, 행렬의 각 행에 상수 벡터를 더하는걸 생각해보세요. 이는 다음과 같은 방식으로 처리될 수 있습니다: ~~~python import numpy as np @@ -763,7 +763,7 @@ for i in range(4): print y ~~~ -위의 방식대로 하면 됩니다; 그러나 'x'가 매우 큰 행렬이라면, 파이썬의 명시적 반복문을 이용한 위 코드는 매우 느려질 수 있습니다. 벡터 'v'를 행렬 'x'의 각 행에 더하는 것은 'v'를 여러 개 복사해 수직으로 쌓은 행렬 'vv'를 만들고 이 'vv'를 'x'에 더하는것과 동일합니다. 이 과정을 아래의 코드로 구현할 수 있습니다: +위의 방식대로 하면 됩니다; 그러나 'x'가 매우 큰 행렬이라면, 파이썬의 명시적 반복문을 이용한 위 코드는 매우 느려질 수 있습니다. 벡터 'v'를 행렬 'x'의 각 행에 더하는것은 'v'를 여러개 복사해 수직으로 쌓은 행렬 'vv'를 만들고 이 'vv'를 'x'에 더하는것과 동일합니다. 이 과정을 아래의 코드로 구현할 수 있습니다: ~~~python import numpy as np @@ -772,7 +772,7 @@ import numpy as np # 그 결과를 행렬 y에 저장하고자 합니다 x = np.array([[1,2,3], [4,5,6], [7,8,9], [10, 11, 12]]) v = np.array([1, 0, 1]) -vv = np.tile(v, (4, 1)) # v의 복사본 4개를 위로 차곡차곡 쌓은 것이 vv +vv = np.tile(v, (4, 1)) # v의 복사본 4개를 위로 차곡차곡 쌓은게 vv print vv # 출력 "[[1 0 1] # [1 0 1] # [1 0 1] @@ -784,7 +784,7 @@ print y # 출력 "[[ 2 2 4 # [11 11 13]]" ~~~ -Numpy 브로드캐스팅을 이용한다면 이렇게 v의 복사본을 여러 개 만들지 않아도 동일한 연산을 할 수 있습니다. +Numpy 브로드캐스팅을 이용한다면 이렇게 v의 복사본을 여러개 만들지 않아도 동일한 연산을 할 수 있습니다. 아래는 브로드캐스팅을 이용한 예시 코드입니다: ~~~python @@ -802,19 +802,19 @@ print y # 출력 "[[ 2 2 4] ~~~ `x`의 shape가 `(4, 3)`이고 `v`의 shape가 `(3,)`라도 브로드캐스팅으로 인해 `y = x + v`는 문제없이 수행됩니다; -이때 'v'는 'v'의 복사본이 차곡차곡 쌓인 shape `(4, 3)`처럼 간주되어 'x'와 동일한 shape가 되며 이들 간의 요소별 덧셈연산이 y에 저장됩니다. +이때 'v'는 'v'의 복사본이 차곡차곡 쌓인 shape `(4, 3)`처럼 간주되어 'x'와 동일한 shape가 되며 이들간의 요소별 덧셈연산이 y에 저장됩니다. 두 배열의 브로드캐스팅은 아래의 규칙을 따릅니다: -1. 두 배열이 동일한 rank를 가지고 있지 않다면, 낮은 rank의 1차원 배열이 높은 rank 배열의 shape로 간주합니다. -2. 특정 차원에서 두 배열이 동일한 크기를 갖거나, 두 배열 중 하나의 크기가 1이라면 그 두 배열은 특정 차원에서 *compatible*하다고 여겨집니다. +1. 두 배열이 동일한 rank를 가지고 있지 않다면, 낮은 rank의 1차원 배열이 높은 rank 배열의 shape로 간주됩니다. +2. 특정 차원에서 두 배열이 동일한 크기를 갖거나, 두 배열들 중 하나의 크기가 1이라면 그 두 배열은 특정 차원에서 *compatible*하다고 여겨집니다. 3. 두 행렬이 모든 차원에서 compatible하다면, 브로드캐스팅이 가능합니다. -4. 브로드캐스팅이 이뤄지면, 각 배열 shape의 요소별 최소공배수로 이루어진 shape가 두 배열의 shape로 간주합니다. -5. 차원에 상관없이 크기가 1인 배열과 1보다 큰 배열이 있을 때, 크기가 1인 배열은 자신의 차원 수만큼 복사되어 쌓인 것처럼 간주합니다. - +4. 브로드캐스팅이 이뤄지면, 각 배열 shape의 요소별 최소공배수로 이루어진 shape가 두 배열의 shape로 간주됩니다. +5. 차원에 상관없이 크기가 1인 배열과 1보다 큰 배열이 있을때, 크기가 1인 배열은 자신의 차원수만큼 복사되어 쌓인것처럼 간주된다. + 설명이 이해하기 부족하다면 [scipy문서](http://docs.scipy.org/doc/numpy/user/basics.broadcasting.html)나 [scipy위키](http://wiki.scipy.org/EricsBroadcastingDoc)를 참조하세요. -브로드캐스팅을 지원하는 함수를 *universal functions*라고 합니다. +브로드캐스팅을 지원하는 함수를 *universal functions*라고 합니다. *universal functions* 목록은 [문서](http://docs.scipy.org/doc/numpy/reference/ufuncs.html#available-ufuncs)를 참조하세요. 브로드캐스팅을 응용한 예시들입니다: @@ -825,9 +825,9 @@ import numpy as np # 벡터의 외적을 계산 v = np.array([1,2,3]) # v의 shape는 (3,) w = np.array([4,5]) # w의 shape는 (2,) -# 외적을 계산하기 위해, 먼저 v를 shape가 (3,1)인 행벡터로 바꿔야 합니다; +# 외적을 게산하기 위해, 먼저 v를 shape가 (3,1)인 행벡터로 바꿔야 합니다; # 그다음 이것을 w에 맞춰 브로드캐스팅한뒤 결과물로 shape가 (3,2)인 행렬을 얻습니다, -# 이 행렬은 v와 w 외적의 결과입니다: +# 이 행렬은 v 와 w의 외적의 결과입니다: # [[ 4 5] # [ 8 10] # [12 15]] @@ -843,15 +843,15 @@ print x + v # 벡터를 행렬의 각 행에 더하기 # x는 shape가 (2, 3)이고 w는 shape가 (2,)입니다. -# x의 전치행렬은 shape가 (3,2)이며 이는 w와 브로드캐스팅이 가능하고 결과로 shape가 (3,2)인 행렬이 생깁니다; -# 이 행렬을 전치하면 shape가 (2,3)인 행렬이 나오며 +# x의 전치행렬은 shape가 (3,2)이며 이는 w와 브로드캐스팅이 가능하고 결과로 shape가 (3,2)인 행렬이 생깁니다; +# 이 행렬을 전치하면 shape가 (2,3)인 행렬이 나오며 # 이는 행렬 x의 각 열에 벡터 w을 더한 결과와 동일합니다. # 아래의 행렬입니다: # [[ 5 6 7] # [ 9 10 11]] print (x.T + w).T # 다른 방법은 w를 shape가 (2,1)인 열벡터로 변환하는 것입니다; -# 그런 다음 이를 바로 x에 브로드캐스팅해 더하면 +# 그런다음 이를 바로 x에 브로드캐스팅해 더하면 # 동일한 결과가 나옵니다. print x + np.reshape(w, (2, 1)) @@ -876,7 +876,7 @@ numpy에 관한 더 많은 사항은 [numpy 레퍼런스](http://docs.scipy.org/ Numpy는 고성능의 다차원 배열 객체와 이를 다룰 도구를 제공합니다. numpy를 바탕으로 만들어진 [SciPy](http://docs.scipy.org/doc/scipy/reference/)는, -numpy 배열을 다루는 많은 함수를 제공하며 다양한 과학, 공학분야에서 유용하게 사용됩니다. +numpy 배열을 다루는 많은 함수들을 제공하며 다양한 과학, 공학분야에서 유용하게 사용됩니다. SciPy에 익숙해지는 최고의 방법은 [SciPy 공식 문서](http://docs.scipy.org/doc/scipy/reference/index.html)를 보는 것입니다. 이 문서에서는 scipy중 cs231n 수업에서 유용하게 쓰일 일부분만을 소개할것입니다. @@ -885,7 +885,7 @@ SciPy에 익숙해지는 최고의 방법은 [SciPy 공식 문서](http://docs.s ### 이미지 작업 SciPy는 이미지를 다룰 기본적인 함수들을 제공합니다. -예를들자면, 디스크에 저장된 이미지를 numpy 배열로 읽어 들이는 함수가 있으며, +예를들자면, 디스크에 저장된 이미지를 numpy 배열로 읽어들이는 함수가 있으며, numpy 배열을 디스크에 이미지로 저장하는 함수도 있고, 이미지의 크기를 바꾸는 함수도 있습니다. 이 함수들의 간단한 사용 예시입니다: @@ -896,15 +896,15 @@ from scipy.misc import imread, imsave, imresize img = imread('assets/cat.jpg') print img.dtype, img.shape # 출력 "uint8 (400, 248, 3)" -# 각각의 색깔 채널을 다른 상수값으로 스칼라배함으로써 -# 이미지의 색을 변화시킬 수 있습니다. +# 각각의 색깔 채널을 다른 상수값으로 스칼라배함으로써 +# 이미지의 색을 변화시킬수 있습니다. # 이미지의 shape는 (400, 248, 3)입니다; # 여기에 shape가 (3,)인 배열 [1, 0.95, 0.9]를 곱합니다; # numpy 브로드캐스팅에 의해 이 배열이 곱해지며 붉은색 채널은 변하지 않으며, # 초록색, 파란색 채널에는 각각 0.95, 0.9가 곱해집니다 img_tinted = img * [1, 0.95, 0.9] -# 색변경 이미지를 300x300픽셀로 크기 조절. +# 색변경 이미지를 300x300 픽셀로 크기 조절. img_tinted = imresize(img_tinted, (300, 300)) # 색변경 이미지를 디스크에 기록하기 @@ -923,7 +923,7 @@ imsave('assets/cat_tinted.jpg', img_tinted) ### MATLAB 파일 -`scipy.io.loadmat` 와 `scipy.io.savemat`함수를 통해 +`scipy.io.loadmat` 와 `scipy.io.savemat`함수를 통해 matlab 파일을 읽고 쓸 수 있습니다. [문서](http://docs.scipy.org/doc/scipy/reference/io.html)를 참조하세요. @@ -961,21 +961,21 @@ print d ## Matplotlib -[Matplotlib](http://matplotlib.org/)는 plotting 라이브러리입니다. +[Matplotlib](http://matplotlib.org/)는 plotting 라이브러리입니다. 이번에는 MATLAB의 plotting 시스템과 유사한 기능을 제공하는 -`matplotlib.pyplot` 모듈에 관한 간략한 소개가 있겠습니다., +`matplotlib.pyplot` 모듈에 관한 간략한 소개가 있곘습니다., ### Plotting -matplotlib에서 가장 중요한 함수는 2차원 데이터를 그릴 수 있게 해주는 `plot`입니다. +matplotlib에서 가장 중요한 함수는 2차원 데이터를 그릴수 있게 해주는 `plot`입니다. 여기 간단한 예시가 있습니다: ~~~python import numpy as np import matplotlib.pyplot as plt -# 사인과 코사인 곡선의 x,y 좌표를 계산 +# 사인과 코사인 곡선의 x,y 좌표를 계산 x = np.arange(0, 3 * np.pi, 0.1) y = np.sin(x) @@ -990,13 +990,13 @@ plt.show() # 그래프를 나타나게 하기 위해선 plt.show()함수를 호
-약간의 몇 가지 추가적인 작업을 통해 여러 개의 그래프와 제목, 범주, 축 이름을 한 번에 쉽게 나타낼 수 있습니다: +약간의 몇가지 추가적인 작업을 통해 여러개의 그래프와, 제목, 범주, 축 이름을 한번에 쉽게 나타낼 수 있습니다: ~~~python import numpy as np import matplotlib.pyplot as plt -# 사인과 코사인 곡선의 x,y 좌표를 계산 +# 사인과 코사인 곡선의 x,y 좌표를 계산 x = np.arange(0, 3 * np.pi, 0.1) y_sin = np.sin(x) y_cos = np.cos(x) @@ -1020,27 +1020,27 @@ plt.show() ### Subplots -'subplot'함수를 통해 다른 내용도 동일한 그림 위에 나타낼 수 있습니다. +'subplot'함수를 통해 다른 내용들도 동일한 그림위에 나타낼수 있습니다. 여기 간단한 예시가 있습니다: ~~~python import numpy as np import matplotlib.pyplot as plt -# 사인과 코사인 곡선의 x,y 좌표를 계산 +# 사인과 코사인 곡선의 x,y 좌표를 계산 x = np.arange(0, 3 * np.pi, 0.1) y_sin = np.sin(x) y_cos = np.cos(x) # 높이가 2이고 너비가 1인 subplot 구획을 설정하고, -# 첫 번째 구획을 활성화. +# 첫번째 구획을 활성화. plt.subplot(2, 1, 1) -# 첫 번째 그리기 +# 첫번째 그리기 plt.plot(x, y_sin) plt.title('Sine') -# 두 번째 subplot 구획을 활성화 하고 그리기 +# 두번째 subplot 구획을 활성화 하고 그리기 plt.subplot(2, 1, 2) plt.plot(x, y_cos) plt.title('Cosine') @@ -1076,8 +1076,8 @@ plt.imshow(img) # 색변화된 이미지 나타내기 plt.subplot(1, 2, 2) -# imshow를 이용하며 주의할 점은 데이터의 자료형이 -# uint8이 아니라면 이상한 결과를 보여줄 수도 있다는 것입니다. +# imshow를 이용하며 주의할 점은 데이터의 자료형이 +# uint8이 아니라면 이상한 결과를 보여줄수도 있다는 것입니다. # 그러므로 이미지를 나타내기 전에 명시적으로 자료형을 uint8로 형변환 해줍니다. plt.imshow(np.uint8(img_tinted)) @@ -1087,3 +1087,8 @@ plt.show()
+ +--- +

+번역: 강상훈 (sanghkaang) +

From a013c2526c2cdd2674b50657423709a6d2b5e546 Mon Sep 17 00:00:00 2001 From: Sanghun Kang Date: Wed, 8 Jun 2016 22:24:29 +0900 Subject: [PATCH 177/199] typo and orthography correction MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit 오타와 맞춤법 수정(부산대 맞춤법 검사기 사용) --- python-numpy-tutorial.md | 220 +++++++++++++++++++-------------------- 1 file changed, 110 insertions(+), 110 deletions(-) diff --git a/python-numpy-tutorial.md b/python-numpy-tutorial.md index dcd5fd37..178b06b3 100644 --- a/python-numpy-tutorial.md +++ b/python-numpy-tutorial.md @@ -21,18 +21,18 @@ Python: Numpy --> -이 튜토리얼은 [Justin Johnson](http://cs.stanford.edu/people/jcjohns/)에 의해 작성되었습니다. +이 튜토리얼은 [Justin Johnson](http://cs.stanford.edu/people/jcjohns/) 에 의해 작성되었습니다. cs231n 수업의 모든 과제에서는 프로그래밍 언어로 파이썬을 사용할 것입니다. -파이썬은 그 자체만으로도 훌륭한 범용 프로그래밍 언어이지만, 몇몇 라이브러리(numpy, scipy, matplotlib)의 도움으로 +파이썬은 그 자체만으로도 훌륭한 범용 프로그래밍 언어이지만, 몇몇 라이브러리(numpy, scipy, matplotlib)의 도움으로 계산과학 분야에서 강력한 개발 환경을 갖추게 됩니다. -많은 분들이 파이썬과 numpy를 경험 해보셨을거라고 생각합니다. 경험 하지 못했을지라도 이 문서를 통해 -'프로그래밍 언어로서의 파이썬'과 '파이썬을 계산과학에 활용하는법'을 빠르게 훑을 수 있습니다. +많은 분들이 파이썬과 numpy를 경험 해보셨을 거라고 생각합니다. 경험하지 못했을지라도 이 문서를 통해 +'프로그래밍 언어로서의 파이썬'과 '파이썬을 계산과학에 활용하는 법'을 빠르게 훑을 수 있습니다. -만약 Matlab을 사용해보셨다면, [Matlab사용자를 위한 numpy](http://wiki.scipy.org/NumPy_for_Matlab_Users) 페이지를 추천해 드립니다. +만약 Matlab을 사용해보셨다면, [Matlab 사용자를 위한 numpy](http://wiki.scipy.org/NumPy_for_Matlab_Users) 페이지를 추천해 드립니다. -또한 [CS 228](http://cs.stanford.edu/~ermon/cs228/index.html)수업을 위해 [Volodymyr Kuleshov](http://web.stanford.edu/~kuleshov/) 와 [Isaac Caswell](https://symsys.stanford.edu/viewing/symsysaffiliate/21335)가 만든 [이 튜토리얼의 IPython notebook 버전](https://github.com/kuleshov/cs228-material/blob/master/tutorials/python/cs228-python-tutorial.ipynb)도 참조 할 수 있습니다. +또한 [CS 228](http://cs.stanford.edu/~ermon/cs228/index.html) 수업을 위해 [Volodymyr Kuleshov](http://web.stanford.edu/~kuleshov/) 와 [Isaac Caswell](https://symsys.stanford.edu/viewing/symsysaffiliate/21335) 가 만든 [이 튜토리얼의 IPython notebook 버전](https://github.com/kuleshov/cs228-material/blob/master/tutorials/python/cs228-python-tutorial.ipynb) 도 참조할 수 있습니다. 목차: @@ -48,7 +48,7 @@ cs231n 수업의 모든 과제에서는 프로그래밍 언어로 파이썬을 - [Numpy](#numpy) - [배열](#numpy-arrays) - [배열 인덱싱](#numpy-array-indexing) - - [데이터타입](#numpy-datatypes) + - [데이터 타입](#numpy-datatypes) - [배열 연산](#numpy-math) - [브로드캐스팅](#numpy-broadcasting) - [SciPy](#scipy) @@ -62,10 +62,10 @@ cs231n 수업의 모든 과제에서는 프로그래밍 언어로 파이썬을 -## Python -파이썬은 고차원이고, 다중패러다임을 지원하는 동적 프로그래밍 언어입니다. -짧지만 가독성 높은 코드 몇 줄로 수준 높은 아이디어들을 표현할수있기에 파이썬 코드는 거의 수도코드처럼 보인다고도 합니다. -아래는 quicksort알고리즘의 파이썬 구현 예시입니다: +## 파이썬 +파이썬은 고급 프로그래밍 언어(사람이 이해하기 쉽게 작성된 언어)이며, 다중패러다임을 지원하는 동적 프로그래밍 언어입니다. +짧지만 가독성 높은 코드 몇 줄로 수준 높은 아이디어들을 표현할 수 있기에 파이썬 코드는 거의 수도코드처럼 보인다고도 합니다. +아래는 quicksort 알고리즘을 파이썬으로 구현한 예시입니다: ~~~python def quicksort(arr): @@ -76,25 +76,25 @@ def quicksort(arr): middle = [x for x in arr if x == pivot] right = [x for x in arr if x > pivot] return quicksort(left) + middle + quicksort(right) - + print quicksort([3,6,8,10,1,2,1]) # 출력 "[1, 1, 2, 3, 6, 8, 10]" ~~~ ### 파이썬 버전 -현재 파이썬에는 두가지 버전이 있습니다. 파이썬 2.7 그리고 파이썬 3.4입니다. +현재 파이썬에는 두 가지 버전이 있습니다. 파이썬 2.7 그리고 파이썬 3.4입니다. 혼란스럽게도, 파이썬3은 기존 파이썬2와 호환되지 않게 변경된 부분이 있습니다. 그러므로 파이썬 2.7로 쓰여진 코드는 3.4환경에서 동작하지 않고 그 반대도 마찬가지입니다. 이 수업에선 파이썬 2.7을 사용합니다. -커맨드라인에 아래의 명령어를 입력해서 현재 설치된 파이썬 버전을 확인 할 수 있습니다. +커맨드라인에 아래의 명령어를 입력해서 현재 설치된 파이썬 버전을 확인할 수 있습니다. `python --version`. ### 기본 자료형 -다른 프로그래밍 언어들처럼, 파이썬에는 정수, 실수, 불리언, 문자열같은 기본 자료형이 있습니다. +다른 프로그래밍 언어들처럼, 파이썬에는 정수, 실수, 불리언, 문자열 같은 기본 자료형이 있습니다. 파이썬 기본 자료형 역시 다른 프로그래밍 언어와 유사합니다. **숫자:** 다른 언어와 마찬가지로 파이썬의 정수형(Integers)과 실수형(floats) 데이터 타입 역시 동일한 역할을 합니다 : @@ -115,12 +115,12 @@ y = 2.5 print type(y) # 출력 "" print y, y + 1, y * 2, y ** 2 # 출력 "2.5 3.5 5.0 6.25" ~~~ -다른 언어들과는 달리, 파이썬에는 증감 단항연상자(`x++`, `x--`)가 없습니다. +다른 언어들과는 달리, 파이썬에는 증감 단항연산자(`x++`, `x--`)가 없습니다. -파이썬 역시 long 정수형과 복소수 데이터 타입이 구현되어 있습니다. +파이썬 역시 long 정수형과 복소수 데이터 타입이 구현되어 있습니다. 자세한 사항은 [문서](https://docs.python.org/2/library/stdtypes.html#numeric-types-int-float-long-complex)에서 찾아볼 수 있습니다. -**불리언(Booleans):** 파이썬에는 논리 자료형의 모든 연산자들이 구현되어 있습니다. +**불리언(Booleans):** 파이썬에는 논리 자료형의 모든 연산자가 구현되어 있습니다. 그렇지만 기호(`&&`, `||`, 등.) 대신 영어 단어로 구현되어 있습니다 : ~~~python @@ -130,14 +130,14 @@ print type(t) # 출력 "" print t and f # 논리 AND; 출력 "False" print t or f # 논리 OR; 출력 "True" print not t # 논리 NOT; 출력 "False" -print t != f # 논리 XOR; 출력 "True" +print t != f # 논리 XOR; 출력 "True" ~~~ **문자열:** 파이썬은 문자열과 연관된 다양한 기능을 지원합니다: ~~~python -hello = 'hello' # String 문자열을 표현할땐 따옴표나 -world = "world" # 쌍따옴표가 사용됩니다; 어떤걸 써도 상관없습니다. +hello = 'hello' # String 문자열을 표현할 땐 따옴표나 +world = "world" # 쌍따옴표가 사용됩니다; 어떤 걸 써도 상관없습니다. print hello # 출력 "hello" print len(hello) # 문자열 길이; 출력 "5" hw = hello + ' ' + world # 문자열 연결 @@ -150,15 +150,15 @@ print hw12 # 출력 "hello world 12" ~~~python s = "hello" -print s.capitalize() # 문자열을 대문자로 시작하게함; 출력 "Hello" +print s.capitalize() # 문자열을 대문자로 시작하게 함; 출력 "Hello" print s.upper() # 모든 문자를 대문자로 바꿈; 출력 "HELLO" print s.rjust(7) # 문자열 오른쪽 정렬, 빈공간은 여백으로 채움; 출력 " hello" print s.center(7) # 문자열 가운데 정렬, 빈공간은 여백으로 채움; 출력 " hello " -print s.replace('l', '(ell)') # 첫번째 인자로 온 문자열을 두번째 인자 문자열로 바꿈; +print s.replace('l', '(ell)') # 첫 번째 인자로 온 문자열을 두 번째 인자 문자열로 바꿈; # 출력 "he(ell)(ell)o" print ' world '.strip() # 문자열 앞뒤 공백 제거; 출력 "world" ~~~ -모든 문자열 메소드는 [문서](https://docs.python.org/2/library/stdtypes.html#string-methods)에서 찾아볼 수 있습니다. +모든 문자열 메소드는 [문서](https://docs.python.org/2/library/stdtypes.html#string-methods)에서 찾아볼 수 있습니다. @@ -168,14 +168,14 @@ print ' world '.strip() # 문자열 앞뒤 공백 제거; 출력 "world" #### 리스트 -리스트는 파이썬에서 배열같은 존재입니다. 그렇지만 배열과 달리 크기 변경이 가능하고 -서로 다른 자료형일지라도 하나의 리스트에 저장 될 수 있습니다: +리스트는 파이썬에서 배열 같은 존재입니다. 그렇지만 배열과 달리 크기 변경이 가능하고 +서로 다른 자료형일지라도 하나의 리스트에 저장될 수 있습니다: ~~~python xs = [3, 1, 2] # 리스트 생성 print xs, xs[2] # 출력 "[3, 1, 2] 2" print xs[-1] # 인덱스가 음수일 경우 리스트의 끝에서부터 세어짐; 출력 "2" -xs[2] = 'foo' # 리스트는 자료형이 다른 요소들을 저장 할 수 있습니다 +xs[2] = 'foo' # 리스트는 자료형이 다른 요소들을 저장할 수 있습니다 print xs # 출력 "[3, 1, 'foo']" xs.append('bar') # 리스트의 끝에 새 요소 추가 print xs # 출력 "[3, 1, 'foo', 'bar']" @@ -185,7 +185,7 @@ print x, xs # 출력 "bar [3, 1, 'foo']" 마찬가지로, 리스트에 대해 자세한 사항은 [문서](https://docs.python.org/2/tutorial/datastructures.html#more-on-lists)에서 찾아볼 수 있습니다. **슬라이싱:** -리스트의 요소로 한번에 접근하는것 이외에도, 파이썬은 리스트의 일부분에만 접근하는 간결한 문법을 제공합니다; +리스트의 요소로 한 번에 접근하는 것 이외에도, 파이썬은 리스트의 일부분에만 접근하는 간결한 문법을 제공합니다; 이를 *슬라이싱*이라고 합니다: ~~~python @@ -199,7 +199,7 @@ print nums[:-1] # 슬라이싱 인덱스는 음수도 가능; 출력 ["0, 1, nums[2:4] = [8, 9] # 슬라이스된 리스트에 새로운 리스트 할당 print nums # 출력 "[0, 1, 8, 9, 4]" ~~~ -numpy 배열 부분에서 다시 슬라이싱을 보게될것입니다. +numpy 배열 부분에서 다시 슬라이싱을 보게 될 것입니다. **반복문:** 아래와 같이 리스트의 요소들을 반복해서 조회할 수 있습니다: @@ -220,7 +220,7 @@ for idx, animal in enumerate(animals): ~~~ **리스트 comprehensions:** -프로그래밍을 하다보면, 자료형을 변환해야 하는 경우가 자주 있습니다. +프로그래밍을 하다 보면, 자료형을 변환해야 하는 경우가 자주 있습니다. 간단한 예를 들자면, 숫자의 제곱을 계산하는 다음의 코드를 보세요: @@ -240,7 +240,7 @@ squares = [x ** 2 for x in nums] print squares # 출력 [0, 1, 4, 9, 16] ~~~ -리스트 comprehensions에 조건을 추가 할 수도 있습니다: +리스트 comprehensions에 조건을 추가할 수도 있습니다: ~~~python nums = [0, 1, 2, 3, 4] @@ -257,16 +257,16 @@ print even_squares # 출력 "[0, 4, 16]" ~~~python d = {'cat': 'cute', 'dog': 'furry'} # 새로운 딕셔너리를 만듭니다 print d['cat'] # 딕셔너리의 값을 받음; 출력 "cute" -print 'cat' in d # 딕셔너리가 주어진 열쇠를 가지고 있는지 확인; 출력 "True" +print 'cat' in d # 딕셔너리가 주어진 열쇠를 가졌는지 확인; 출력 "True" d['fish'] = 'wet' # 딕셔너리의 값을 지정 print d['fish'] # 출력 "wet" # print d['monkey'] # KeyError: 'monkey' not a key of d print d.get('monkey', 'N/A') # 딕셔너리의 값을 받음. 존재하지 않는 다면 'N/A'; 출력 "N/A" print d.get('fish', 'N/A') # 딕셔너리의 값을 받음. 존재하지 않는 다면 'N/A'; 출력 "wet" del d['fish'] # 딕셔너리에 저장된 요소 삭제 -print d.get('fish', 'N/A') # "fish"는 더이상 열쇠가 아님; 출력 "N/A" +print d.get('fish', 'N/A') # "fish"는 더 이상 열쇠가 아님; 출력 "N/A" ~~~ -딕셔너리에 관해 더 알고싶다면 [문서](https://docs.python.org/2/library/stdtypes.html#dict)를 참조하세요. +딕셔너리에 관해 더 알고 싶다면 [문서](https://docs.python.org/2/library/stdtypes.html#dict)를 참조하세요. **반복문:** 딕셔너리의 열쇠는 쉽게 반복될 수 있습니다: @@ -288,7 +288,7 @@ for animal, legs in d.iteritems(): ~~~ **딕셔너리 comprehensions:** -리스트 comprehensions과 유사한 딕셔너리 comprehensions을 통해 손쉽게 딕셔너리를 만들수 있습니다. +리스트 comprehensions과 유사한 딕셔너리 comprehensions을 통해 손쉽게 딕셔너리를 만들 수 있습니다. 예시: ~~~python @@ -300,7 +300,7 @@ print even_num_to_square # 출력 "{0: 0, 2: 4, 4: 16}" #### 집합 -집합은 순서 구분이 없고 서로 다른 요소간의 모임입니다. 예시: +집합은 순서 구분이 없고 서로 다른 요소 간의 모임입니다. 예시: ~~~python animals = {'cat', 'dog'} @@ -315,11 +315,11 @@ animals.remove('cat') # Remove an element from a set print len(animals) # 출력 "2" ~~~ -마찬가지로, 집합에 관해 더 알고싶다면 [문서](https://docs.python.org/2/library/sets.html#set-objects)를 참조하세요. +마찬가지로, 집합에 관해 더 알고 싶다면 [문서](https://docs.python.org/2/library/sets.html#set-objects)를 참조하세요. **반복문:** 집합을 반복하는 구문은 리스트 반복 구문과 동일합니다; -그러나 집합은 순서가 없어서, 어떤 순서로 반복될지 추측할순 없습니다: +그러나 집합은 순서가 없어서, 어떤 순서로 반복될지 추측할 순 없습니다: ~~~python animals = {'cat', 'dog', 'fish'} @@ -340,8 +340,8 @@ print nums # 출력 "set([0, 1, 2, 3, 4, 5])" #### 튜플 -튜플은 요소들 간 순서가 있으며 값이 변하지 않는 리스트입니다. -튜플은 많은 면에서 리스트와 유사합니다; 가장 중요한 차이점은 튜플은 '딕셔너리의 열쇠'와 '집합의 요소'가 될 수 있지만 리스트는 불가능하다는 점입니다. +튜플은 요소 간 순서가 있으며 값이 변하지 않는 리스트입니다. +튜플은 많은 면에서 리스트와 유사합니다; 가장 중요한 차이점은 튜플은 '딕셔너리의 열쇠'와 '집합의 요소'가 될 수 있지만, 리스트는 불가능하다는 점입니다. 여기 간단한 예시가 있습니다: ~~~python @@ -394,18 +394,18 @@ hello('Fred', loud=True) # 출력 "HELLO, FRED!" ~~~python class Greeter(object): - + # 생성자 def __init__(self, name): self.name = name # 인스턴스 변수 선언 - + # 인스턴스 메소드 def greet(self, loud=False): if loud: print 'HELLO, %s!' % self.name.upper() else: print 'Hello, %s' % self.name - + g = Greeter('Fred') # Greeter 클래스의 인스턴스 생성 g.greet() # 인스턴스 메소드 호출; 출력 "Hello, Fred" g.greet(loud=True) # 인스턴스 메소드 호출; 출력 "HELLO, FRED!" @@ -417,16 +417,16 @@ g.greet(loud=True) # 인스턴스 메소드 호출; 출력 "HELLO, FRED!" ## Numpy [Numpy](http://www.numpy.org/)는 파이썬이 계산과학분야에 이용될때 핵심 역할을 하는 라이브러리입니다. -Numpy는 고성능의 다차원 배열 객체와 이를 다룰 도구를 제공합니다. 만약 MATLAB에 익숙한 분이라면 넘파이 학습을 시작하는데 있어 +Numpy는 고성능의 다차원 배열 객체와 이를 다룰 도구를 제공합니다. 만약 MATLAB에 익숙한 분이라면 Numpy 학습을 시작하는데 있어 [이 튜토리얼](http://wiki.scipy.org/NumPy_for_Matlab_Users)이 유용할 것입니다. ### 배열 -Numpy 배열은 동일한 자료형을 가지는 값들이 격자판 형태로 있는 것입니다. 각각의 값들은 튜플(이때 튜플은 양의 정수만을 요소값으로 갖습니다.) 형태로 색인됩니다. -*rank*는 배열이 몇차원인지를 의미합니다; *shape*는 는 각 차원의 크기를 알려주는 정수들이 모인 튜플입니다. +Numpy 배열은 동일한 자료형을 가지는 값들이 격자판 형태로 있는 것입니다. 각각의 값들은 튜플(이때 튜플은 양의 정수만을 요소값으로 갖습니다.) 형태로 색인 됩니다. +*rank*는 배열이 몇 차원인지를 의미합니다; *shape*는 는 각 차원의 크기를 알려주는 정수들이 모인 튜플입니다. -파이썬의 리스트를 중첩해 Numpy 배열을 초기화 할 수 있고, 대괄호를 통해 각 요소에 접근할 수 있습니다: +파이썬의 리스트를 중첩해 Numpy 배열을 초기화 할 수 있고, 대괄호를 통해 각 요소에 접근할 수 있습니다: ~~~python import numpy as np @@ -451,7 +451,7 @@ import numpy as np a = np.zeros((2,2)) # 모든 값이 0인 배열 생성 print a # 출력 "[[ 0. 0.] # [ 0. 0.]]" - + b = np.ones((1,2)) # 모든 값이 1인 배열 생성 print b # 출력 "[[ 1. 1.]]" @@ -459,23 +459,23 @@ c = np.full((2,2), 7) # 모든 값이 특정 상수인 배열 생성 print c # 출력 "[[ 7. 7.] # [ 7. 7.]]" -d = np.eye(2) # 2x2 단위 행렬 생성 +d = np.eye(2) # 2x2 단위행렬 생성 print d # 출력 "[[ 1. 0.] # [ 0. 1.]]" - + e = np.random.random((2,2)) # 임의의 값으로 채워진 배열 생성 print e # 임의의 값 출력 "[[ 0.91940167 0.08143941] - # [ 0.68744134 0.87236687]]" + # [ 0.68744134 0.87236687]]" ~~~ 배열 생성에 관한 다른 방법들은 [문서](http://docs.scipy.org/doc/numpy/user/basics.creation.html#arrays-creation)를 참조하세요. ### 배열 인덱싱 -Numpy는 배열을 인덱싱하는 몇가지 방법을 제공합니다. +Numpy는 배열을 인덱싱하는 몇 가지 방법을 제공합니다. **슬라이싱:** -파이썬 리스트와 유사하게, Numpy 배열도 슬라이싱이 가능합니다. Numpy 배열은 다차원인 경우가 많기에, 각 차원별로 어떻게 슬라이스할건지 명확히 해야합니다: +파이썬 리스트와 유사하게, Numpy 배열도 슬라이싱이 가능합니다. Numpy 배열은 다차원인 경우가 많기에, 각 차원별로 어떻게 슬라이스할건지 명확히 해야 합니다: ~~~python import numpy as np @@ -486,7 +486,7 @@ import numpy as np # [ 9 10 11 12]] a = np.array([[1,2,3,4], [5,6,7,8], [9,10,11,12]]) -# 슬라이싱을 이용하여 첫 두행과 1열,2열로 이루어진 부분배열을 만들어 봅시다; +# 슬라이싱을 이용하여 첫 두 행과 1열, 2열로 이루어진 부분배열을 만들어 봅시다; # b는 shape가 (2,2)인 배열이 됩니다: # [[2 3] # [6 7]] @@ -514,11 +514,11 @@ import numpy as np # [ 9 10 11 12]] a = np.array([[1,2,3,4], [5,6,7,8], [9,10,11,12]]) -# 배열의 중간 행에 접근하는 두가지 방법이 있습니다. +# 배열의 중간 행에 접근하는 두 가지 방법이 있습니다. # 정수 인덱싱과 슬라이싱을 혼합해서 사용하면 낮은 rank의 배열이 생성되지만, # 슬라이싱만 사용하면 원본 배열과 동일한 rank의 배열이 생성됩니다. -row_r1 = a[1, :] # 배열a의 두번째 행을 rank가 1인 배열로 -row_r2 = a[1:2, :] # 배열a의 두번째 행을 rank가 2인 배열로 +row_r1 = a[1, :] # 배열a의 두 번째 행을 rank가 1인 배열로 +row_r2 = a[1:2, :] # 배열a의 두 번째 행을 rank가 2인 배열로 print row_r1, row_r1.shape # 출력 "[5 6 7 8] (4,)" print row_r2, row_r2.shape # 출력 "[[5 6 7 8]] (1, 4)" @@ -533,7 +533,7 @@ print col_r2, col_r2.shape # 출력 "[[ 2] **정수 배열 인덱싱:** Numpy 배열을 슬라이싱하면, 결과로 얻어지는 배열은 언제나 원본 배열의 부분 배열입니다. -그러나 정수 배열 인덱싱을 한다면, 원본과 다른 배열을 만들수 있습니다. +그러나 정수 배열 인덱싱을 한다면, 원본과 다른 배열을 만들 수 있습니다. 여기에 예시가 있습니다: ~~~python @@ -549,7 +549,7 @@ print a[[0, 1, 2], [0, 1, 0]] # 출력 "[1 4 5]" print np.array([a[0, 0], a[1, 1], a[2, 0]]) # 출력 "[1 4 5]" # 정수 배열 인덱싱을 사용할 때, -# 원본 배열의 같은 요소를 재사용 할 수 있습니다: +# 원본 배열의 같은 요소를 재사용할 수 있습니다: print a[[0, 0], [1, 1]] # 출력 "[2 2]" # 위 예제는 다음과 동일합니다 @@ -586,8 +586,8 @@ print a # 출력 "array([[11, 2, 3], ~~~ **불리언 배열 인덱싱:** -불리언 배열 인덱싱을 통해 배열속 요소를 취사 선택할 수 있습니다. -불리언 배열 인덱싱은 특정 조건을 만족시키는 요소만 선택하고자 할 때 자주 사용됩니다. +불리언 배열 인덱싱을 통해 배열 속 요소를 취사선택할 수 있습니다. +불리언 배열 인덱싱은 특정 조건을 만족하게 하는 요소만 선택하고자 할 때 자주 사용됩니다. 다음은 그 예시입니다: ~~~python @@ -597,15 +597,15 @@ a = np.array([[1,2], [3, 4], [5, 6]]) bool_idx = (a > 2) # 2보다 큰 a의 요소를 찾습니다; # 이 코드는 a와 shape가 같고 불리언 자료형을 요소로 하는 numpy 배열을 반환합니다, - # bool_idx의 각 요소는 동일한 위치에 있는 a의 + # bool_idx의 각 요소는 동일한 위치에 있는 a의 # 요소가 2보다 큰지를 말해줍니다. - + print bool_idx # 출력 "[[False False] # [ True True] # [ True True]]" -# 불리언 배열 인덱싱을 통해 bool_idx에서 -# 참 값을 가지는 요소로 구성되는 +# 불리언 배열 인덱싱을 통해 bool_idx에서 +# 참 값을 가지는 요소로 구성되는 # rank 1인 배열을 구성할 수 있습니다. print a[bool_idx] # 출력 "[3 4 5 6]" @@ -620,8 +620,8 @@ print a[a > 2] # 출력 "[3 4 5 6]" ### 자료형 Numpy 배열은 동일한 자료형을 가지는 값들이 격자판 형태로 있는 것입니다. -Numpy에선 배열을 구성하는데 사용할 수 있는 다양한 숫자 자료형을 제공합니다. -Numpy는 배열이 생성될 때 자료형을 스스로 추측합니다, 그러나 배열을 생성할 때 명시적으로 특정 자료형을 지정할수도 있습니다. 예시: +Numpy에선 배열을 구성하는 데 사용할 수 있는 다양한 숫자 자료형을 제공합니다. +Numpy는 배열이 생성될 때 자료형을 스스로 추측합니다, 그러나 배열을 생성할 때 명시적으로 특정 자료형을 지정할 수도 있습니다. 예시: ~~~python import numpy as np @@ -678,7 +678,7 @@ print np.divide(x, y) print np.sqrt(x) ~~~ -MATLAB과 달리, '*'은 행렬곱이 아니라 요소별 곱입니다. Numpy에선 벡터의 내적, 벡터와 행렬의 곱, 행렬곱을 위해 '*'대신 'dot'함수를 사용합니다. 'dot'은 Numpy 모듈 함수로서도 배열 객체의 인스턴스 메소드로서도 이용 가능한 합수입니다: +MATLAB과 달리, '*'은 행렬 곱이 아니라 요소별 곱입니다. Numpy에선 벡터의 내적, 벡터와 행렬의 곱, 행렬곱을 위해 '*'대신 'dot'함수를 사용합니다. 'dot'은 Numpy 모듈 함수로서도 배열 객체의 인스턴스 메소드로서도 이용 가능한 합수입니다: ~~~python import numpy as np @@ -693,7 +693,7 @@ w = np.array([11, 12]) print v.dot(w) print np.dot(v, w) -# 행렬과 벡터의 곱; 둘 다 결과는 rank 1 인 배열 [29 67] +# 행렬과 벡터의 곱; 둘 다 결과는 rank 1인 배열 [29 67] print x.dot(v) print np.dot(x, v) @@ -704,7 +704,7 @@ print x.dot(y) print np.dot(x, y) ~~~ -Numpy는 배열 연산에 유용하게 쓰이는 많은 함수를 제공합니다. 가장 유용한건 'sum'입니다: +Numpy는 배열 연산에 유용하게 쓰이는 많은 함수를 제공합니다. 가장 유용한 건 'sum'입니다: ~~~python import numpy as np @@ -715,10 +715,10 @@ print np.sum(x) # 모든 요소를 합한 값을 연산; 출력 "10" print np.sum(x, axis=0) # 각 열에 대한 합을 연산; 출력 "[4 6]" print np.sum(x, axis=1) # 각 행에 대한 합을 연산; 출력 "[3 7]" ~~~ -Numpy가 제공하는 모든 수학함수들의 목록은 [문서](http://docs.scipy.org/doc/numpy/reference/routines.math.html)를 참조하세요. +Numpy가 제공하는 모든 수학함수의 목록은 [문서](http://docs.scipy.org/doc/numpy/reference/routines.math.html)를 참조하세요. -배열연산을 하지 않더라도, 종종 배열의 모양을 바꾸거나 데이터를 처리해야할 때가 있습니다. -가장 간단한 예는 행렬의 주대각선을 기준으로 대칭되는 요소끼리 뒤바꾸는 것입니다; 이를 전치라고 하며 행렬을 전치하기 위해선, 간단하게 배열 객체의 'T' 속성을 사용하면 됩니다: +배열연산을 하지 않더라도, 종종 배열의 모양을 바꾸거나 데이터를 처리해야 할 때가 있습니다. +가장 간단한 예는 행렬의 주 대각선을 기준으로 대칭되는 요소끼리 뒤바꾸는 것입니다; 이를 전치라고 하며 행렬을 전치하기 위해선, 간단하게 배열 객체의 'T' 속성을 사용하면 됩니다: ~~~python import numpy as np @@ -729,7 +729,7 @@ print x # 출력 "[[1 2] print x.T # 출력 "[[1 3] # [2 4]]" -# rank 1인 배열을 전치할경우 아무일도 일어나지 않습니다: +# rank 1인 배열을 전치할 경우 아무 일도 일어나지 않습니다: v = np.array([1,2,3]) print v # 출력 "[1 2 3]" print v.T # 출력 "[1 2 3]" @@ -740,7 +740,7 @@ Numpy는 배열을 다루는 다양한 함수들을 제공합니다; 이러한 ### 브로드캐스팅 -브로트캐스팅은 Numpy에서 shape가 다른 배열간에도 산술 연산이 가능하게 하는 메커니즘입니다. 종종 작은 배열과 큰 배열이 있을 때, 큰 배열을 대상으로 작은 배열을 여러번 연산하고자 할 때가 있습니다. 예를 들어, 행렬의 각 행에 상수 벡터를 더하는걸 생각해보세요. 이는 다음과 같은 방식으로 처리될 수 있습니다: +브로트캐스팅은 Numpy에서 shape가 다른 배열 간에도 산술 연산이 가능하게 하는 메커니즘입니다. 종종 작은 배열과 큰 배열이 있을 때, 큰 배열을 대상으로 작은 배열을 여러 번 연산하고자 할 때가 있습니다. 예를 들어, 행렬의 각 행에 상수 벡터를 더하는 걸 생각해보세요. 이는 다음과 같은 방식으로 처리될 수 있습니다: ~~~python import numpy as np @@ -763,7 +763,7 @@ for i in range(4): print y ~~~ -위의 방식대로 하면 됩니다; 그러나 'x'가 매우 큰 행렬이라면, 파이썬의 명시적 반복문을 이용한 위 코드는 매우 느려질 수 있습니다. 벡터 'v'를 행렬 'x'의 각 행에 더하는것은 'v'를 여러개 복사해 수직으로 쌓은 행렬 'vv'를 만들고 이 'vv'를 'x'에 더하는것과 동일합니다. 이 과정을 아래의 코드로 구현할 수 있습니다: +위의 방식대로 하면 됩니다; 그러나 'x'가 매우 큰 행렬이라면, 파이썬의 명시적 반복문을 이용한 위 코드는 매우 느려질 수 있습니다. 벡터 'v'를 행렬 'x'의 각 행에 더하는 것은 'v'를 여러 개 복사해 수직으로 쌓은 행렬 'vv'를 만들고 이 'vv'를 'x'에 더하는것과 동일합니다. 이 과정을 아래의 코드로 구현할 수 있습니다: ~~~python import numpy as np @@ -772,7 +772,7 @@ import numpy as np # 그 결과를 행렬 y에 저장하고자 합니다 x = np.array([[1,2,3], [4,5,6], [7,8,9], [10, 11, 12]]) v = np.array([1, 0, 1]) -vv = np.tile(v, (4, 1)) # v의 복사본 4개를 위로 차곡차곡 쌓은게 vv +vv = np.tile(v, (4, 1)) # v의 복사본 4개를 위로 차곡차곡 쌓은 것이 vv print vv # 출력 "[[1 0 1] # [1 0 1] # [1 0 1] @@ -784,7 +784,7 @@ print y # 출력 "[[ 2 2 4 # [11 11 13]]" ~~~ -Numpy 브로드캐스팅을 이용한다면 이렇게 v의 복사본을 여러개 만들지 않아도 동일한 연산을 할 수 있습니다. +Numpy 브로드캐스팅을 이용한다면 이렇게 v의 복사본을 여러 개 만들지 않아도 동일한 연산을 할 수 있습니다. 아래는 브로드캐스팅을 이용한 예시 코드입니다: ~~~python @@ -802,19 +802,19 @@ print y # 출력 "[[ 2 2 4] ~~~ `x`의 shape가 `(4, 3)`이고 `v`의 shape가 `(3,)`라도 브로드캐스팅으로 인해 `y = x + v`는 문제없이 수행됩니다; -이때 'v'는 'v'의 복사본이 차곡차곡 쌓인 shape `(4, 3)`처럼 간주되어 'x'와 동일한 shape가 되며 이들간의 요소별 덧셈연산이 y에 저장됩니다. +이때 'v'는 'v'의 복사본이 차곡차곡 쌓인 shape `(4, 3)`처럼 간주되어 'x'와 동일한 shape가 되며 이들 간의 요소별 덧셈연산이 y에 저장됩니다. 두 배열의 브로드캐스팅은 아래의 규칙을 따릅니다: -1. 두 배열이 동일한 rank를 가지고 있지 않다면, 낮은 rank의 1차원 배열이 높은 rank 배열의 shape로 간주됩니다. -2. 특정 차원에서 두 배열이 동일한 크기를 갖거나, 두 배열들 중 하나의 크기가 1이라면 그 두 배열은 특정 차원에서 *compatible*하다고 여겨집니다. +1. 두 배열이 동일한 rank를 가지고 있지 않다면, 낮은 rank의 1차원 배열이 높은 rank 배열의 shape로 간주합니다. +2. 특정 차원에서 두 배열이 동일한 크기를 갖거나, 두 배열 중 하나의 크기가 1이라면 그 두 배열은 특정 차원에서 *compatible*하다고 여겨집니다. 3. 두 행렬이 모든 차원에서 compatible하다면, 브로드캐스팅이 가능합니다. -4. 브로드캐스팅이 이뤄지면, 각 배열 shape의 요소별 최소공배수로 이루어진 shape가 두 배열의 shape로 간주됩니다. -5. 차원에 상관없이 크기가 1인 배열과 1보다 큰 배열이 있을때, 크기가 1인 배열은 자신의 차원수만큼 복사되어 쌓인것처럼 간주된다. - +4. 브로드캐스팅이 이뤄지면, 각 배열 shape의 요소별 최소공배수로 이루어진 shape가 두 배열의 shape로 간주합니다. +5. 차원에 상관없이 크기가 1인 배열과 1보다 큰 배열이 있을 때, 크기가 1인 배열은 자신의 차원 수만큼 복사되어 쌓인 것처럼 간주합니다. + 설명이 이해하기 부족하다면 [scipy문서](http://docs.scipy.org/doc/numpy/user/basics.broadcasting.html)나 [scipy위키](http://wiki.scipy.org/EricsBroadcastingDoc)를 참조하세요. -브로드캐스팅을 지원하는 함수를 *universal functions*라고 합니다. +브로드캐스팅을 지원하는 함수를 *universal functions*라고 합니다. *universal functions* 목록은 [문서](http://docs.scipy.org/doc/numpy/reference/ufuncs.html#available-ufuncs)를 참조하세요. 브로드캐스팅을 응용한 예시들입니다: @@ -825,9 +825,9 @@ import numpy as np # 벡터의 외적을 계산 v = np.array([1,2,3]) # v의 shape는 (3,) w = np.array([4,5]) # w의 shape는 (2,) -# 외적을 게산하기 위해, 먼저 v를 shape가 (3,1)인 행벡터로 바꿔야 합니다; +# 외적을 계산하기 위해, 먼저 v를 shape가 (3,1)인 행벡터로 바꿔야 합니다; # 그다음 이것을 w에 맞춰 브로드캐스팅한뒤 결과물로 shape가 (3,2)인 행렬을 얻습니다, -# 이 행렬은 v 와 w의 외적의 결과입니다: +# 이 행렬은 v와 w 외적의 결과입니다: # [[ 4 5] # [ 8 10] # [12 15]] @@ -843,15 +843,15 @@ print x + v # 벡터를 행렬의 각 행에 더하기 # x는 shape가 (2, 3)이고 w는 shape가 (2,)입니다. -# x의 전치행렬은 shape가 (3,2)이며 이는 w와 브로드캐스팅이 가능하고 결과로 shape가 (3,2)인 행렬이 생깁니다; -# 이 행렬을 전치하면 shape가 (2,3)인 행렬이 나오며 +# x의 전치행렬은 shape가 (3,2)이며 이는 w와 브로드캐스팅이 가능하고 결과로 shape가 (3,2)인 행렬이 생깁니다; +# 이 행렬을 전치하면 shape가 (2,3)인 행렬이 나오며 # 이는 행렬 x의 각 열에 벡터 w을 더한 결과와 동일합니다. # 아래의 행렬입니다: # [[ 5 6 7] # [ 9 10 11]] print (x.T + w).T # 다른 방법은 w를 shape가 (2,1)인 열벡터로 변환하는 것입니다; -# 그런다음 이를 바로 x에 브로드캐스팅해 더하면 +# 그런 다음 이를 바로 x에 브로드캐스팅해 더하면 # 동일한 결과가 나옵니다. print x + np.reshape(w, (2, 1)) @@ -864,7 +864,7 @@ print x + np.reshape(w, (2, 1)) print x * 2 ~~~ -브로드캐스팅은 보통 코드를 간결하고 빠르게 해줍니다, 그러므로 가능하다면 최대한 사용하세요. +브로드캐스팅은 보통 코드를 간결하고 빠르게 해줍니다, 그러니 가능한 많이 사용하세요. ### Numpy Documentation 이 문서는 여러분이 numpy에 대해 알아야할 많은 중요한 사항들을 다루지만 완벽하진 않습니다. @@ -876,7 +876,7 @@ numpy에 관한 더 많은 사항은 [numpy 레퍼런스](http://docs.scipy.org/ Numpy는 고성능의 다차원 배열 객체와 이를 다룰 도구를 제공합니다. numpy를 바탕으로 만들어진 [SciPy](http://docs.scipy.org/doc/scipy/reference/)는, -numpy 배열을 다루는 많은 함수들을 제공하며 다양한 과학, 공학분야에서 유용하게 사용됩니다. +numpy 배열을 다루는 많은 함수를 제공하며 다양한 과학, 공학분야에서 유용하게 사용됩니다. SciPy에 익숙해지는 최고의 방법은 [SciPy 공식 문서](http://docs.scipy.org/doc/scipy/reference/index.html)를 보는 것입니다. 이 문서에서는 scipy중 cs231n 수업에서 유용하게 쓰일 일부분만을 소개할것입니다. @@ -885,7 +885,7 @@ SciPy에 익숙해지는 최고의 방법은 [SciPy 공식 문서](http://docs.s ### 이미지 작업 SciPy는 이미지를 다룰 기본적인 함수들을 제공합니다. -예를들자면, 디스크에 저장된 이미지를 numpy 배열로 읽어들이는 함수가 있으며, +예를들자면, 디스크에 저장된 이미지를 numpy 배열로 읽어 들이는 함수가 있으며, numpy 배열을 디스크에 이미지로 저장하는 함수도 있고, 이미지의 크기를 바꾸는 함수도 있습니다. 이 함수들의 간단한 사용 예시입니다: @@ -896,15 +896,15 @@ from scipy.misc import imread, imsave, imresize img = imread('assets/cat.jpg') print img.dtype, img.shape # 출력 "uint8 (400, 248, 3)" -# 각각의 색깔 채널을 다른 상수값으로 스칼라배함으로써 -# 이미지의 색을 변화시킬수 있습니다. +# 각각의 색깔 채널을 다른 상수값으로 스칼라배함으로써 +# 이미지의 색을 변화시킬 수 있습니다. # 이미지의 shape는 (400, 248, 3)입니다; # 여기에 shape가 (3,)인 배열 [1, 0.95, 0.9]를 곱합니다; # numpy 브로드캐스팅에 의해 이 배열이 곱해지며 붉은색 채널은 변하지 않으며, # 초록색, 파란색 채널에는 각각 0.95, 0.9가 곱해집니다 img_tinted = img * [1, 0.95, 0.9] -# 색변경 이미지를 300x300 픽셀로 크기 조절. +# 색변경 이미지를 300x300픽셀로 크기 조절. img_tinted = imresize(img_tinted, (300, 300)) # 색변경 이미지를 디스크에 기록하기 @@ -923,7 +923,7 @@ imsave('assets/cat_tinted.jpg', img_tinted) ### MATLAB 파일 -`scipy.io.loadmat` 와 `scipy.io.savemat`함수를 통해 +`scipy.io.loadmat` 와 `scipy.io.savemat`함수를 통해 matlab 파일을 읽고 쓸 수 있습니다. [문서](http://docs.scipy.org/doc/scipy/reference/io.html)를 참조하세요. @@ -961,21 +961,21 @@ print d ## Matplotlib -[Matplotlib](http://matplotlib.org/)는 plotting 라이브러리입니다. +[Matplotlib](http://matplotlib.org/)는 plotting 라이브러리입니다. 이번에는 MATLAB의 plotting 시스템과 유사한 기능을 제공하는 -`matplotlib.pyplot` 모듈에 관한 간략한 소개가 있곘습니다., +`matplotlib.pyplot` 모듈에 관한 간략한 소개가 있겠습니다., ### Plotting -matplotlib에서 가장 중요한 함수는 2차원 데이터를 그릴수 있게 해주는 `plot`입니다. +matplotlib에서 가장 중요한 함수는 2차원 데이터를 그릴 수 있게 해주는 `plot`입니다. 여기 간단한 예시가 있습니다: ~~~python import numpy as np import matplotlib.pyplot as plt -# 사인과 코사인 곡선의 x,y 좌표를 계산 +# 사인과 코사인 곡선의 x,y 좌표를 계산 x = np.arange(0, 3 * np.pi, 0.1) y = np.sin(x) @@ -990,13 +990,13 @@ plt.show() # 그래프를 나타나게 하기 위해선 plt.show()함수를 호
-약간의 몇가지 추가적인 작업을 통해 여러개의 그래프와, 제목, 범주, 축 이름을 한번에 쉽게 나타낼 수 있습니다: +약간의 몇 가지 추가적인 작업을 통해 여러 개의 그래프와 제목, 범주, 축 이름을 한 번에 쉽게 나타낼 수 있습니다: ~~~python import numpy as np import matplotlib.pyplot as plt -# 사인과 코사인 곡선의 x,y 좌표를 계산 +# 사인과 코사인 곡선의 x,y 좌표를 계산 x = np.arange(0, 3 * np.pi, 0.1) y_sin = np.sin(x) y_cos = np.cos(x) @@ -1020,27 +1020,27 @@ plt.show() ### Subplots -'subplot'함수를 통해 다른 내용들도 동일한 그림위에 나타낼수 있습니다. +'subplot'함수를 통해 다른 내용도 동일한 그림 위에 나타낼 수 있습니다. 여기 간단한 예시가 있습니다: ~~~python import numpy as np import matplotlib.pyplot as plt -# 사인과 코사인 곡선의 x,y 좌표를 계산 +# 사인과 코사인 곡선의 x,y 좌표를 계산 x = np.arange(0, 3 * np.pi, 0.1) y_sin = np.sin(x) y_cos = np.cos(x) # 높이가 2이고 너비가 1인 subplot 구획을 설정하고, -# 첫번째 구획을 활성화. +# 첫 번째 구획을 활성화. plt.subplot(2, 1, 1) -# 첫번째 그리기 +# 첫 번째 그리기 plt.plot(x, y_sin) plt.title('Sine') -# 두번째 subplot 구획을 활성화 하고 그리기 +# 두 번째 subplot 구획을 활성화 하고 그리기 plt.subplot(2, 1, 2) plt.plot(x, y_cos) plt.title('Cosine') @@ -1076,8 +1076,8 @@ plt.imshow(img) # 색변화된 이미지 나타내기 plt.subplot(1, 2, 2) -# imshow를 이용하며 주의할 점은 데이터의 자료형이 -# uint8이 아니라면 이상한 결과를 보여줄수도 있다는 것입니다. +# imshow를 이용하며 주의할 점은 데이터의 자료형이 +# uint8이 아니라면 이상한 결과를 보여줄 수도 있다는 것입니다. # 그러므로 이미지를 나타내기 전에 명시적으로 자료형을 uint8로 형변환 해줍니다. plt.imshow(np.uint8(img_tinted)) From 694131078e3f5af7eee32a169e6c6f67631b117b Mon Sep 17 00:00:00 2001 From: Sanghun Kang Date: Wed, 8 Jun 2016 22:31:07 +0900 Subject: [PATCH 178/199] unnecessary leading and trailing spaces removed --- python-numpy-tutorial.md | 50 ++++++++++++++++++++-------------------- 1 file changed, 25 insertions(+), 25 deletions(-) diff --git a/python-numpy-tutorial.md b/python-numpy-tutorial.md index 178b06b3..8c3ad8b8 100644 --- a/python-numpy-tutorial.md +++ b/python-numpy-tutorial.md @@ -63,8 +63,8 @@ cs231n 수업의 모든 과제에서는 프로그래밍 언어로 파이썬을 ## 파이썬 -파이썬은 고급 프로그래밍 언어(사람이 이해하기 쉽게 작성된 언어)이며, 다중패러다임을 지원하는 동적 프로그래밍 언어입니다. -짧지만 가독성 높은 코드 몇 줄로 수준 높은 아이디어들을 표현할 수 있기에 파이썬 코드는 거의 수도코드처럼 보인다고도 합니다. +파이썬은 고급 프로그래밍 언어(사람이 이해하기 쉽게 작성된 언어)이며, 다중패러다임을 지원하는 동적 프로그래밍 언어입니다. +짧지만 가독성 높은 코드 몇 줄로 수준 높은 아이디어들을 표현할 수 있기에 파이썬 코드는 거의 수도코드처럼 보인다고도 합니다. 아래는 quicksort 알고리즘을 파이썬으로 구현한 예시입니다: ~~~python @@ -76,13 +76,13 @@ def quicksort(arr): middle = [x for x in arr if x == pivot] right = [x for x in arr if x > pivot] return quicksort(left) + middle + quicksort(right) - + print quicksort([3,6,8,10,1,2,1]) # 출력 "[1, 1, 2, 3, 6, 8, 10]" ~~~ ### 파이썬 버전 -현재 파이썬에는 두 가지 버전이 있습니다. 파이썬 2.7 그리고 파이썬 3.4입니다. +현재 파이썬에는 두 가지 버전이 있습니다. 파이썬 2.7 그리고 파이썬 3.4입니다. 혼란스럽게도, 파이썬3은 기존 파이썬2와 호환되지 않게 변경된 부분이 있습니다. 그러므로 파이썬 2.7로 쓰여진 코드는 3.4환경에서 동작하지 않고 그 반대도 마찬가지입니다. 이 수업에선 파이썬 2.7을 사용합니다. @@ -117,10 +117,10 @@ print y, y + 1, y * 2, y ** 2 # 출력 "2.5 3.5 5.0 6.25" ~~~ 다른 언어들과는 달리, 파이썬에는 증감 단항연산자(`x++`, `x--`)가 없습니다. -파이썬 역시 long 정수형과 복소수 데이터 타입이 구현되어 있습니다. +파이썬 역시 long 정수형과 복소수 데이터 타입이 구현되어 있습니다. 자세한 사항은 [문서](https://docs.python.org/2/library/stdtypes.html#numeric-types-int-float-long-complex)에서 찾아볼 수 있습니다. -**불리언(Booleans):** 파이썬에는 논리 자료형의 모든 연산자가 구현되어 있습니다. +**불리언(Booleans):** 파이썬에는 논리 자료형의 모든 연산자가 구현되어 있습니다. 그렇지만 기호(`&&`, `||`, 등.) 대신 영어 단어로 구현되어 있습니다 : ~~~python @@ -158,7 +158,7 @@ print s.replace('l', '(ell)') # 첫 번째 인자로 온 문자열을 두 번 # 출력 "he(ell)(ell)o" print ' world '.strip() # 문자열 앞뒤 공백 제거; 출력 "world" ~~~ -모든 문자열 메소드는 [문서](https://docs.python.org/2/library/stdtypes.html#string-methods)에서 찾아볼 수 있습니다. +모든 문자열 메소드는 [문서](https://docs.python.org/2/library/stdtypes.html#string-methods)에서 찾아볼 수 있습니다. @@ -394,18 +394,18 @@ hello('Fred', loud=True) # 출력 "HELLO, FRED!" ~~~python class Greeter(object): - + # 생성자 def __init__(self, name): self.name = name # 인스턴스 변수 선언 - + # 인스턴스 메소드 def greet(self, loud=False): if loud: print 'HELLO, %s!' % self.name.upper() else: print 'Hello, %s' % self.name - + g = Greeter('Fred') # Greeter 클래스의 인스턴스 생성 g.greet() # 인스턴스 메소드 호출; 출력 "Hello, Fred" g.greet(loud=True) # 인스턴스 메소드 호출; 출력 "HELLO, FRED!" @@ -423,10 +423,10 @@ Numpy는 고성능의 다차원 배열 객체와 이를 다룰 도구를 제공 ### 배열 -Numpy 배열은 동일한 자료형을 가지는 값들이 격자판 형태로 있는 것입니다. 각각의 값들은 튜플(이때 튜플은 양의 정수만을 요소값으로 갖습니다.) 형태로 색인 됩니다. +Numpy 배열은 동일한 자료형을 가지는 값들이 격자판 형태로 있는 것입니다. 각각의 값들은 튜플(이때 튜플은 양의 정수만을 요소값으로 갖습니다.) 형태로 색인 됩니다. *rank*는 배열이 몇 차원인지를 의미합니다; *shape*는 는 각 차원의 크기를 알려주는 정수들이 모인 튜플입니다. -파이썬의 리스트를 중첩해 Numpy 배열을 초기화 할 수 있고, 대괄호를 통해 각 요소에 접근할 수 있습니다: +파이썬의 리스트를 중첩해 Numpy 배열을 초기화 할 수 있고, 대괄호를 통해 각 요소에 접근할 수 있습니다: ~~~python import numpy as np @@ -451,7 +451,7 @@ import numpy as np a = np.zeros((2,2)) # 모든 값이 0인 배열 생성 print a # 출력 "[[ 0. 0.] # [ 0. 0.]]" - + b = np.ones((1,2)) # 모든 값이 1인 배열 생성 print b # 출력 "[[ 1. 1.]]" @@ -462,7 +462,7 @@ print c # 출력 "[[ 7. 7.] d = np.eye(2) # 2x2 단위행렬 생성 print d # 출력 "[[ 1. 0.] # [ 0. 1.]]" - + e = np.random.random((2,2)) # 임의의 값으로 채워진 배열 생성 print e # 임의의 값 출력 "[[ 0.91940167 0.08143941] # [ 0.68744134 0.87236687]]" @@ -486,7 +486,7 @@ import numpy as np # [ 9 10 11 12]] a = np.array([[1,2,3,4], [5,6,7,8], [9,10,11,12]]) -# 슬라이싱을 이용하여 첫 두 행과 1열, 2열로 이루어진 부분배열을 만들어 봅시다; +# 슬라이싱을 이용하여 첫 두 행과 1열, 2열로 이루어진 부분배열을 만들어 봅시다; # b는 shape가 (2,2)인 배열이 됩니다: # [[2 3] # [6 7]] @@ -599,13 +599,13 @@ bool_idx = (a > 2) # 2보다 큰 a의 요소를 찾습니다; # 이 코드는 a와 shape가 같고 불리언 자료형을 요소로 하는 numpy 배열을 반환합니다, # bool_idx의 각 요소는 동일한 위치에 있는 a의 # 요소가 2보다 큰지를 말해줍니다. - + print bool_idx # 출력 "[[False False] # [ True True] # [ True True]]" -# 불리언 배열 인덱싱을 통해 bool_idx에서 -# 참 값을 가지는 요소로 구성되는 +# 불리언 배열 인덱싱을 통해 bool_idx에서 +# 참 값을 가지는 요소로 구성되는 # rank 1인 배열을 구성할 수 있습니다. print a[bool_idx] # 출력 "[3 4 5 6]" @@ -811,7 +811,7 @@ print y # 출력 "[[ 2 2 4] 3. 두 행렬이 모든 차원에서 compatible하다면, 브로드캐스팅이 가능합니다. 4. 브로드캐스팅이 이뤄지면, 각 배열 shape의 요소별 최소공배수로 이루어진 shape가 두 배열의 shape로 간주합니다. 5. 차원에 상관없이 크기가 1인 배열과 1보다 큰 배열이 있을 때, 크기가 1인 배열은 자신의 차원 수만큼 복사되어 쌓인 것처럼 간주합니다. - + 설명이 이해하기 부족하다면 [scipy문서](http://docs.scipy.org/doc/numpy/user/basics.broadcasting.html)나 [scipy위키](http://wiki.scipy.org/EricsBroadcastingDoc)를 참조하세요. 브로드캐스팅을 지원하는 함수를 *universal functions*라고 합니다. @@ -825,7 +825,7 @@ import numpy as np # 벡터의 외적을 계산 v = np.array([1,2,3]) # v의 shape는 (3,) w = np.array([4,5]) # w의 shape는 (2,) -# 외적을 계산하기 위해, 먼저 v를 shape가 (3,1)인 행벡터로 바꿔야 합니다; +# 외적을 계산하기 위해, 먼저 v를 shape가 (3,1)인 행벡터로 바꿔야 합니다; # 그다음 이것을 w에 맞춰 브로드캐스팅한뒤 결과물로 shape가 (3,2)인 행렬을 얻습니다, # 이 행렬은 v와 w 외적의 결과입니다: # [[ 4 5] @@ -843,7 +843,7 @@ print x + v # 벡터를 행렬의 각 행에 더하기 # x는 shape가 (2, 3)이고 w는 shape가 (2,)입니다. -# x의 전치행렬은 shape가 (3,2)이며 이는 w와 브로드캐스팅이 가능하고 결과로 shape가 (3,2)인 행렬이 생깁니다; +# x의 전치행렬은 shape가 (3,2)이며 이는 w와 브로드캐스팅이 가능하고 결과로 shape가 (3,2)인 행렬이 생깁니다; # 이 행렬을 전치하면 shape가 (2,3)인 행렬이 나오며 # 이는 행렬 x의 각 열에 벡터 w을 더한 결과와 동일합니다. # 아래의 행렬입니다: @@ -851,7 +851,7 @@ print x + v # [ 9 10 11]] print (x.T + w).T # 다른 방법은 w를 shape가 (2,1)인 열벡터로 변환하는 것입니다; -# 그런 다음 이를 바로 x에 브로드캐스팅해 더하면 +# 그런 다음 이를 바로 x에 브로드캐스팅해 더하면 # 동일한 결과가 나옵니다. print x + np.reshape(w, (2, 1)) @@ -961,7 +961,7 @@ print d ## Matplotlib -[Matplotlib](http://matplotlib.org/)는 plotting 라이브러리입니다. +[Matplotlib](http://matplotlib.org/)는 plotting 라이브러리입니다. 이번에는 MATLAB의 plotting 시스템과 유사한 기능을 제공하는 `matplotlib.pyplot` 모듈에 관한 간략한 소개가 있겠습니다., @@ -975,7 +975,7 @@ matplotlib에서 가장 중요한 함수는 2차원 데이터를 그릴 수 있 import numpy as np import matplotlib.pyplot as plt -# 사인과 코사인 곡선의 x,y 좌표를 계산 +# 사인과 코사인 곡선의 x,y 좌표를 계산 x = np.arange(0, 3 * np.pi, 0.1) y = np.sin(x) @@ -1076,7 +1076,7 @@ plt.imshow(img) # 색변화된 이미지 나타내기 plt.subplot(1, 2, 2) -# imshow를 이용하며 주의할 점은 데이터의 자료형이 +# imshow를 이용하며 주의할 점은 데이터의 자료형이 # uint8이 아니라면 이상한 결과를 보여줄 수도 있다는 것입니다. # 그러므로 이미지를 나타내기 전에 명시적으로 자료형을 uint8로 형변환 해줍니다. From 838253ee5ac8b742d564b0b02bf29d779f5cffc6 Mon Sep 17 00:00:00 2001 From: Sanghun Kang Date: Wed, 8 Jun 2016 22:34:01 +0900 Subject: [PATCH 179/199] unnecessary leading and trailing spaces removed --- python-numpy-tutorial.md | 16 ++++++++-------- 1 file changed, 8 insertions(+), 8 deletions(-) diff --git a/python-numpy-tutorial.md b/python-numpy-tutorial.md index 8c3ad8b8..356da92d 100644 --- a/python-numpy-tutorial.md +++ b/python-numpy-tutorial.md @@ -24,7 +24,7 @@ Numpy 이 튜토리얼은 [Justin Johnson](http://cs.stanford.edu/people/jcjohns/) 에 의해 작성되었습니다. cs231n 수업의 모든 과제에서는 프로그래밍 언어로 파이썬을 사용할 것입니다. -파이썬은 그 자체만으로도 훌륭한 범용 프로그래밍 언어이지만, 몇몇 라이브러리(numpy, scipy, matplotlib)의 도움으로 +파이썬은 그 자체만으로도 훌륭한 범용 프로그래밍 언어이지만, 몇몇 라이브러리(numpy, scipy, matplotlib)의 도움으로 계산과학 분야에서 강력한 개발 환경을 갖추게 됩니다. 많은 분들이 파이썬과 numpy를 경험 해보셨을 거라고 생각합니다. 경험하지 못했을지라도 이 문서를 통해 @@ -130,7 +130,7 @@ print type(t) # 출력 "" print t and f # 논리 AND; 출력 "False" print t or f # 논리 OR; 출력 "True" print not t # 논리 NOT; 출력 "False" -print t != f # 논리 XOR; 출력 "True" +print t != f # 논리 XOR; 출력 "True" ~~~ **문자열:** 파이썬은 문자열과 연관된 다양한 기능을 지원합니다: @@ -597,7 +597,7 @@ a = np.array([[1,2], [3, 4], [5, 6]]) bool_idx = (a > 2) # 2보다 큰 a의 요소를 찾습니다; # 이 코드는 a와 shape가 같고 불리언 자료형을 요소로 하는 numpy 배열을 반환합니다, - # bool_idx의 각 요소는 동일한 위치에 있는 a의 + # bool_idx의 각 요소는 동일한 위치에 있는 a의 # 요소가 2보다 큰지를 말해줍니다. print bool_idx # 출력 "[[False False] @@ -814,7 +814,7 @@ print y # 출력 "[[ 2 2 4] 설명이 이해하기 부족하다면 [scipy문서](http://docs.scipy.org/doc/numpy/user/basics.broadcasting.html)나 [scipy위키](http://wiki.scipy.org/EricsBroadcastingDoc)를 참조하세요. -브로드캐스팅을 지원하는 함수를 *universal functions*라고 합니다. +브로드캐스팅을 지원하는 함수를 *universal functions*라고 합니다. *universal functions* 목록은 [문서](http://docs.scipy.org/doc/numpy/reference/ufuncs.html#available-ufuncs)를 참조하세요. 브로드캐스팅을 응용한 예시들입니다: @@ -844,7 +844,7 @@ print x + v # 벡터를 행렬의 각 행에 더하기 # x는 shape가 (2, 3)이고 w는 shape가 (2,)입니다. # x의 전치행렬은 shape가 (3,2)이며 이는 w와 브로드캐스팅이 가능하고 결과로 shape가 (3,2)인 행렬이 생깁니다; -# 이 행렬을 전치하면 shape가 (2,3)인 행렬이 나오며 +# 이 행렬을 전치하면 shape가 (2,3)인 행렬이 나오며 # 이는 행렬 x의 각 열에 벡터 w을 더한 결과와 동일합니다. # 아래의 행렬입니다: # [[ 5 6 7] @@ -896,7 +896,7 @@ from scipy.misc import imread, imsave, imresize img = imread('assets/cat.jpg') print img.dtype, img.shape # 출력 "uint8 (400, 248, 3)" -# 각각의 색깔 채널을 다른 상수값으로 스칼라배함으로써 +# 각각의 색깔 채널을 다른 상수값으로 스칼라배함으로써 # 이미지의 색을 변화시킬 수 있습니다. # 이미지의 shape는 (400, 248, 3)입니다; # 여기에 shape가 (3,)인 배열 [1, 0.95, 0.9]를 곱합니다; @@ -923,7 +923,7 @@ imsave('assets/cat_tinted.jpg', img_tinted) ### MATLAB 파일 -`scipy.io.loadmat` 와 `scipy.io.savemat`함수를 통해 +`scipy.io.loadmat` 와 `scipy.io.savemat`함수를 통해 matlab 파일을 읽고 쓸 수 있습니다. [문서](http://docs.scipy.org/doc/scipy/reference/io.html)를 참조하세요. @@ -1027,7 +1027,7 @@ plt.show() import numpy as np import matplotlib.pyplot as plt -# 사인과 코사인 곡선의 x,y 좌표를 계산 +# 사인과 코사인 곡선의 x,y 좌표를 계산 x = np.arange(0, 3 * np.pi, 0.1) y_sin = np.sin(x) y_cos = np.cos(x) From ecca6db24c8d7e9a2946d14499430dec9b783b36 Mon Sep 17 00:00:00 2001 From: Sanghun Kang Date: Wed, 8 Jun 2016 22:35:09 +0900 Subject: [PATCH 180/199] unnecessary leading and trailing spaces removed --- python-numpy-tutorial.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/python-numpy-tutorial.md b/python-numpy-tutorial.md index 356da92d..8355cf01 100644 --- a/python-numpy-tutorial.md +++ b/python-numpy-tutorial.md @@ -996,7 +996,7 @@ plt.show() # 그래프를 나타나게 하기 위해선 plt.show()함수를 호 import numpy as np import matplotlib.pyplot as plt -# 사인과 코사인 곡선의 x,y 좌표를 계산 +# 사인과 코사인 곡선의 x,y 좌표를 계산 x = np.arange(0, 3 * np.pi, 0.1) y_sin = np.sin(x) y_cos = np.cos(x) From 2a9fda7544ffd29d97e106dc7d099ea0a97c8cb5 Mon Sep 17 00:00:00 2001 From: myungsub Date: Mon, 13 Jun 2016 22:25:57 +0900 Subject: [PATCH 181/199] update linear-classify --- index.html | 2 +- linear-classify.md | 38 +++++++++++++++++++------------------- 2 files changed, 20 insertions(+), 20 deletions(-) diff --git a/index.html b/index.html index e083ac78..0a98eb30 100644 --- a/index.html +++ b/index.html @@ -131,7 +131,7 @@ 선형 분류: Support Vector Machine, Softmax - +
parameteric 접근법, bias 트릭, hinge loss, cross-entropy loss, L2 regularization, 웹 데모 diff --git a/linear-classify.md b/linear-classify.md index e02308ac..d3b7d525 100644 --- a/linear-classify.md +++ b/linear-classify.md @@ -30,44 +30,44 @@ Table of Contents: ### 이미지에서 라벨 스코어로의 파라미터화된 매핑(mapping) -The first component of this approach is to define the score function that maps the pixel values of an image to confidence scores for each class. We will develop the approach with a concrete example. As before, let's assume a training dataset of images $ x_i \in R^D $, each associated with a label $ y_i $. Here $ i = 1 \dots N $ and $ y_i \in \{ 1 \dots K \} $. That is, we have **N** examples (each with a dimensionality **D**) and **K** distinct categories. For example, in CIFAR-10 we have a training set of **N** = 50,000 images, each with **D** = 32 x 32 x 3 = 3072 pixels, and **K** = 10, since there are 10 distinct classes (dog, cat, car, etc). We will now define the score function $f: R^D \mapsto R^K$ that maps the raw image pixels to class scores. +먼저, 이미지의 픽셀 값들을 각 클래스에 대한 신뢰도 점수 (confidence score)로 매핑시켜주는 스코어 함수를 정의한다. 여기서는 구체적인 예시를 통해 각 과정을 살펴볼 것이다. 이전 노트에서처럼, 학습 데이터셋 이미지들인 $$ x_i \in R^D $$가 있고, 각각이 해당 라벨 $$ y_i $$를 갖고 있다고 하자. 여기서 $$ i = 1 \dots N $$, 그리고 $$ y_i \in \{ 1 \dots K \} $$이다. 즉, 학습할 데이터 **N** 개가 있고 (각각은 **D** 차원의 벡터이다.), 총 **K** 개의 서로 다른 카테고리(클래스)가 있다. 예를 들어, CIFAR-10 에서는 **N** = 50,000 개의 학습 데이터 이미지들이 있고, 각각은 **D** = 32 x 32 x 3 = 3072 픽셀로 이루어져 있으며, (dog, cat, car, 등등) 10개의 서로 다른 클래스가 있으므로 **K** = 10 이다. 이제 이미지의 픽셀값들을 클래스 스코어로 매핑해 주는 스코어 함수 $$f: R^D \mapsto R^K$$ 을 아래에 정의할 것이다. -**Linear classifier.** In this module we will start out with arguably the simplest possible function, a linear mapping: +**선형 분류기 (Linear Classifier).** 이 파트에서는 가장 단순한 함수라고 할 수 있는 선형 매핑 함수로 시작할 것이다. $$ f(x_i, W, b) = W x_i + b $$ -In the above equation, we are assuming that the image $x_i$ has all of its pixels flattened out to a single column vector of shape [D x 1]. The matrix **W** (of size [K x D]), and the vector **b** (of size [K x 1]) are the **parameters** of the function. In CIFAR-10, $x_i$ contains all pixels in the i-th image flattened into a single [3072 x 1] column, **W** is [10 x 3072] and **b** is [10 x 1], so 3072 numbers come into the function (the raw pixel values) and 10 numbers come out (the class scores). The parameters in **W** are often called the **weights**, and **b** is called the **bias vector** because it influences the output scores, but without interacting with the actual data $x_i$. However, you will often hear people use the terms *weights* and *parameters* interchangeably. +위 식에서, 우리는 각 이미지 $$x_i$$의 모든 픽셀들이 [D x 1] 모양을 갖는 하나의 열 벡터로 평평하게 했다고 가정하였다. [K x D] 차원의 행렬 **W** 와 [K x 1] 차원의 벡터 **b** 는 이 함수의 **파라미터** 이다. CIFAR-10 에서 $$x_i$$ 는 i번째 이미지의 모든 픽셀을 [3072 x 1] 크기로 평평하게 모양을 바꾼 열 벡터가 될 것이고, **W** 는 [10 x 3072], **b** 는 [10 x 1] 여서 3072 개의 숫자가 함수의 입력(이미지 픽셀 값들)으로 들어와 10개의 숫자가 출력(클래스 스코어)으로 나오게 된다. **W** 안의 파라미터들은 보통 **weight** 라고 불리고, **b** 는 **bias 벡터** 라 불리는데, 그 이유는 b가 실제 입력 데이터인 $$x_i$$와의 아무런 상호 작용이 없이 출력 스코어 값에는 영향을 주기 때문이다. 그러나 보통 일반적으로 사람마다 *weight* 와 *파라미터(parameter)* 두 개의 용어를 혼용해서 사용하는 경우가 많다. -There are a few things to note: +여기서 몇 가지 짚고 넘어갈 점이 있다. -- First, note that the single matrix multiplication $W x_i$ is effectively evaluating 10 separate classifiers in parallel (one for each class), where each classifier is a row of **W**. -- Notice also that we think of the input data $ (x_i, y_i) $ as given and fixed, but we have control over the setting of the parameters **W,b**. Our goal will be to set these in such way that the computed scores match the ground truth labels across the whole training set. We will go into much more detail about how this is done, but intuitively we wish that the correct class has a score that is higher than the scores of incorrect classes. -- An advantage of this approach is that the training data is used to learn the parameters **W,b**, but once the learning is complete we can discard the entire training set and only keep the learned parameters. That is because a new test image can be simply forwarded through the function and classified based on the computed scores. -- Lastly, note that to classifying the test image involves a single matrix multiplication and addition, which is significantly faster than comparing a test image to all training images. +- 먼저, 한 번의 행렬곱 $$W x_i$$ 만으로 10 개의 로 다른 분류기(각 클래스마다 하나씩)를 병렬로 계산하는 효과를 나타내고 있다는 점을 살펴보자. 이 때 **W** 행렬의 각 열이 각각 하나의 분류기가 된다. +- 또한, 여기서 입력 데이터 $$ (x_i, y_i) $$는 주어진 값이고 고정되어 있지만, 파라미터들인 **W, b** 의 세팅은 우리가 조절할 수 있다는 점을 생각하자. 우리의 최종 목표는 전체 학습 데이터에 대해서 우리가 계산할 스코어 값들이 실제 (ground truth) 라벨과 가장 잘 일치하도록 이 파라미터 값들을 정하는 것이다. 이후(아래)에 자세한 방법에 대해 다룰 것이지만, 직관적으로 간략하게 말하자면 올바르게 잘 맞춘 클래스가 틀린 클래스들보다 더 높은 스코어를 갖도록 조절할 것이다. +- 이러한 방식의 장점은, 학습 데이터가 파라미터들인 **W, b** 를 학습하는데 사용되지만 학습이 끝난 이후에는 학습된 파라미터들만 남기고, 학습에 사용된 데이터셋은 더 이상 필요가 없다는 (따라서 메모리에서 지워버려도 된다는) 점이다. 그 이유는, 새로운 테스트 이미지가 입력으로 들어올 때 위의 함수에 의해 스코어를 계산하고, 계산된 스코어를 통해 바로 분류되기 때문이다. +- 마지막으로, 테스트 이미지를 분류할 때 행렬곱 한 번과 덧셈 한 번을 하는 계산만 필요하다는 점을 주목하자. 이것은 테스트 이미지를 모든 학습 이미지와 비교하는 것에 비하면 매우 빠르다. -> Foreshadowing: Convolutional Neural Networks will map image pixels to scores exactly as shown above, but the mapping ( f ) will be more complex and will contain more parameters. +> 스포일러: 컨볼루션 신경망(Convolutional Neural Networks)은 정확히 위의 방식처럼 이미지 픽셀 값을 스코어 값으로 매핑시켜 주지만, 매핑시켜주는 함수 ( f ) 가 훨씬 더 복잡해지고 더 많은 수의 파라미터를 갖고 있을 것이다. ### 선형 분류기 분석하기 -Notice that a linear classifier computes the score of a class as a weighted sum of all of its pixel values across all 3 of its color channels. Depending on precisely what values we set for these weights, the function has the capacity to like or dislike (depending on the sign of each weight) certain colors at certain positions in the image. For instance, you can imagine that the "ship" class might be more likely if there is a lot of blue on the sides of an image (which could likely correspond to water). You might expect that the "ship" classifier would then have a lot of positive weights across its blue channel weights (presence of blue increases score of ship), and negative weights in the red/green channels (presence of red/green descreases the score of ship). +선형 분류기는 클래스 스코어를 이미지의 모든 픽셀 값들의 가중치 합으로 스코어를 계산하고, 이 때 각 픽셀의 3 개의 색 채널을 모두 고려하는 것에 주목하자. 이 때 각 가중치(파라미터, weights)에 어떤 값을 주느냐에 따라 스코어 함수는 이미지의 특정 위치에서 특정 색깔을 선호하거나 선호하지 않거나 (가중치 값의 부호에 따라) 할 수 있다. 예를 들어, "ship" 클래스는 이미지의 가장자리 부분에 파란색이 많은 경우에 (강, 바다 등의 물에 해당하는 색) 스코어 값이 더 높아질 것이라고 추측해 볼 수 있을 것이다. 즉, "ship" 분류기는 파란색 채널의 파라미터(weights)들이 양의 값을 갖고 (파란색이 존재하는 것이 ship의 스코어를 증가시키도록), 빨강/초록색 채널에는 음의 값을 갖는 파라미터들이 많을 것이라고 (빨간색/초록색의 존재는 ship의 스코어를 감소시키도록) 예상할 수 있다.
-
An example of mapping an image to class scores. For the sake of visualization, we assume the image only has 4 pixels (4 monochrome pixels, we are not considering color channels in this example for brevity), and that we have 3 classes (red (cat), green (dog), blue (ship) class). (Clarification: in particular, the colors here simply indicate 3 classes and are not related to the RGB channels.) We stretch the image pixels into a column and perform matrix multiplication to get the scores for each class. Note that this particular set of weights W is not good at all: the weights assign our cat image a very low cat score. In particular, this set of weights seems convinced that it's looking at a dog.
+
이미지에서 클래스 스코어로의 매핑 예시. 시각화를 위해서, 이미지가 픽셀 4 개 만으로 이루어져 있고 (색 채널도 고려하지 않고, 단일 채널이라고 생각하자), 3 개의 클래스가 있다고 하자 (빨강 (cat), 초록 (dog), 파랑 (ship) 클래스). (주: 여기에서의 색깔은 3 개의 클래스를 나타내기 위함이고, RGB 채널과는 전혀 상관이 없다.) 이제 이미지 픽셀들을 펼쳐서 열 벡터로 만들고 각 클래스에 대해 행렬곱을 수행하면 스코어 값을 얻을 수 있다. 여기서 정해준 파라미터 W 값들은 매우 안 좋은 예시인 것을 확인하자: 현재의 파라미터로는 고양이(cat) 이미지를 매우 낮은 cat 스코어를 갖도록 한다. 이 경우, 현재의 파라미터 값은 우리가 dog 이미지를 보고있다고 생각하고 있다.
-**Analogy of images as high-dimensional points.** Since the images are stretched into high-dimensional column vectors, we can interpret each image as a single point in this space (e.g. each image in CIFAR-10 is a point in 3072-dimensional space of 32x32x3 pixels). Analogously, the entire dataset is a (labeled) set of points. +**이미지와 고차원 공간 상의 점에 대한 비유.** 이미지들을 고차원 열 벡터로 펼쳤기 때문에, 우리는 각 이미지를 이 고차원 공간 상의 하나의 점으로 생각할 수 있다 (e.g. CIFAR-10 데이터셋의 각 이미지는 32x32x3 개의 픽셀로 이루어진 3072-차원 공간 상의 한 점이 된다). 마찬가지로 생각하면, 전체 데이터셋은 라벨링된 고차원 공간 상의 점들의 집합이 될 것이다. -Since we defined the score of each class as a weighted sum of all image pixels, each class score is a linear function over this space. We cannot visualize 3072-dimensional spaces, but if we imagine squashing all those dimensions into only two dimensions, then we can try to visualize what the classifier might be doing: +위에서 각 클래스에 대한 스코어를 이미지의 모든 픽셀에 대한 가중치 합으로 정의했기 때문에, 각 클래스 스코어는 이 공간 상에서의 선형 함수값이 된다. 3072-차원 공간은 시각화할 수 없지만, 2차원으로 축소시켰다고 상상해보면 우리의 분류기가 어떤 행동을 하는지를 시각화하려고 시도해볼 수 있을 것이다:
- Cartoon representation of the image space, where each image is a single point, and three classifiers are visualized. Using the example of the car classifier (in red), the red line shows all points in the space that get a score of zero for the car class. The red arrow shows the direction of increase, so all points to the right of the red line have positive (and linearly increasing) scores, and all points to the left have a negative (and linearly decreasing) scores. + 이미지 공간의 시각화. 각 이미지는 하나의 점에 해당되고, 3 개의 분류기가 표시되어 있다. 자동차(car) 분류기(빨간색)를 예로 들어보면, 빨간색 선이 이 공간 상에서 car 클래스에 대해 스코어 값이 0이 되는 모든 점을 나타낸 것이다. 빨간색 화살표는 스코어가 증가하는 방향을 나타낸 것으로, 빨간색 선의 오른쪽에 있는 점들은 양의 (그리고 선형적으로 증가하는) 스코어 값을 가질 것이고, 왼쪽의 점들은 음의 (그리고 선형적으로 감소하는) 스코어 값을 가질 것이다.
@@ -118,7 +118,7 @@ For example, going back to the example image of a cat and its scores for the cla -#### Multiclass Support Vector Machine 손실함수수 +#### Multiclass Support Vector Machine 손실함수 There are several ways to define the details of the loss function. As a first example we will first develop a commonly used loss called the **Multiclass Support Vector Machine** (SVM) loss. The SVM loss is set up so that the SVM "wants" the correct class for each image to a have a score higher than the incorrect classes by some fixed margin $\Delta$. Notice that it's sometimes helpful to anthropomorphise the loss functions as we did above: The SVM "wants" a certain outcome in the sense that the outcome would yield a lower loss (which is good). @@ -339,16 +339,16 @@ where the probabilites are now more diffuse. Moreover, in the limit where the we ### 선형 분류 웹 데모 -
- + +
We have written an interactive web demo to help your intuitions with linear classifiers. The demo visualizes the loss functions discussed in this section using a toy 3-way classification on 2D data. The demo also jumps ahead a bit and performs the optimization, which we will discuss in full detail in the next section.
- - + + ### 요약 From 145f7dd7974fe0dd2e9233d9a7b820d74bbb664a Mon Sep 17 00:00:00 2001 From: Dongkyu Kim Date: Sat, 18 Jun 2016 10:04:21 +0900 Subject: [PATCH 182/199] =?UTF-8?q?=20'=EC=8B=9C=EA=B7=B8=EB=AA=A8?= =?UTF-8?q?=EC=9D=B4=EB=93=9C=20=EC=98=88=EC=A0=9C'=20=EC=84=B9=EC=85=98?= =?UTF-8?q?=20=EC=9E=91=EC=84=B1=20=EC=99=84=EB=A3=8C?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit --- optimization-2.md | 28 ++++++++++++++-------------- 1 file changed, 14 insertions(+), 14 deletions(-) diff --git a/optimization-2.md b/optimization-2.md index 32ae548e..55188b56 100644 --- a/optimization-2.md +++ b/optimization-2.md @@ -95,7 +95,7 @@ dfdy = 1.0 * dfdq # dq/dy = 1 -2-4x5-4y-43z3-4q+-121f*
- 좌측에 실수 값으로 표현되는 "회로"는 이 계산에 대한 시각 표현을 보여준다. 전방 전달(forward pass)은 입력부터 출력까지 값을 계산한다 (녹색으로 표시). 그리고 나서 후방 전달(backward pass)는 backpropagation을 수행하는데, 이는 끝에서 시작해서 반복적으로 체인 룰을 적용해 회로 입력에 대한 모든 길에서 그라디언트 값 (적색으로 표시) 을 계산한다. 그라디언트 값은 회로를 통해 거꾸로 흐르는 것으로 볼 수 있다. + 좌측에 실수 값으로 표현되는 "회로"는 이 계산에 대한 시각 표현을 보여준다. 전방 전달(forward pass)은 입력부터 출력까지 값을 계산한다 (녹색으로 표시). 그리고 나서 후방 전달(backward pass)은 backpropagation을 수행하는데, 이는 끝에서 시작해서 반복적으로 연쇄 법칙을 적용해 회로 입력에 대한 모든 길에서 그라디언트 값(적색으로 표시)을 계산한다. 그라디언트 값은 회로를 통해 거꾸로 흐르는 것으로 볼 수 있다.
@@ -120,7 +120,7 @@ $$ f(w,x) = \frac{1}{1+e^{-(w_0x_0 + w_1x_1 + w_2)}} $$ -나중에 다른 수업에서 보겠지만, 이 표현식은 *시그모이드 활성* 함수를 사용하는 2차원 뉴런(입력 **x**와 가중치 **w**를 갖는)을 나타낸다. 그러나 지금은 이를 매우 단순하게 *w, x*를 입력으로 받아 하나의 단일 숫자를 출력하는 하나의 함수정도로 생각하자. 이 함수는 여러개의 게이트로 구성된다. 위에서 이미 설명한 게이트들(덧셈, 곱셈, 최대)에 더해 네 종류의 게이트가 더 있다: +나중에 다른 수업에서 보겠지만, 이 표현식은 *시그모이드 활성* 함수를 사용하는 2차원 뉴런(입력 **x**와 가중치 **w**를 갖는)을 나타낸다. 그러나 지금은 이를 매우 단순하게 *w,x*를 입력으로 받아 하나의 단일 숫자를 출력하는 하나의 함수정도로 생각하자. 이 함수는 여러개의 게이트로 구성된다. 위에서 이미 설명한 게이트들(덧셈, 곱셈, 최대)에 더해 네 종류의 게이트가 더 있다: $$ f(x) = \frac{1}{x} @@ -145,12 +145,12 @@ $$
2.00-0.20w0-1.000.39x0-3.00-0.39w1-2.00-0.59x1-3.000.20w2-2.000.20*6.000.20*4.000.20+1.000.20+-1.00-0.20*-10.37-0.53exp1.37-0.53+10.731.001/x
- Example circuit for a 2D neuron with a sigmoid activation function. The inputs are [x0,x1] and the (learnable) weights of the neuron are [w0,w1,w2]. As we will see later, the neuron computes a dot product with the input and then its activation is softly squashed by the sigmoid function to be in range from 0 to 1. + 시그모이드 활성 함수를 갖는 2차원 뉴런에 대한 예시 회로. 입력은 [x0,x1]이고 뉴런의 (학습 가능한) 파라미터 값들은 [w0,w1,w2]이다. 나중에 보겠지만, 뉴런은 입력을 가지고 내적을 계산하고 이 입력의 활성 함수 출력 값은 0부터 1사이의 범위에 들어가도록 시그모이드 함수에 의해 압착(squash)이 된다.
-In the example above, we see a long chain of function applications that operates on the result of the dot product between **w,x**. The function that these operations implement is called the *sigmoid function* $\sigma(x)$. It turns out that the derivative of the sigmoid function with respect to its input simplifies if you perform the derivation (after a fun tricky part where we add and subtract a 1 in the numerator): +위 예제에서 **w,x** 사이의 내적의 결과로 동작하는 함수 적용(function applications)의 긴 체인을 보았다. 이러한 연산을 제공하는 함수를 *시그모이드 함수(sigmoid function)* $\sigma(x)$ 라고 한다. 만약 (분자에 1을 더하고 다시 빼는 재미있지만 까다로운 과정을 거친 후에)미분을 한다면 입력에 대한 시그모이드 함수의 미분값은 단순화할 수 있는 것으로 알려져 있다. $$ \sigma(x) = \frac{1}{1+e^{-x}} \\\\ @@ -158,7 +158,7 @@ $$ = \left( 1 - \sigma(x) \right) \sigma(x) $$ -As we see, the gradient turns out to simplify and becomes surprisingly simple. For example, the sigmoid expression receives the input 1.0 and computes the ouput 0.73 during the forward pass. The derivation above shows that the *local* gradient would simply be (1 - 0.73) * 0.73 ~= 0.2, as the circuit computed before (see the image above), except this way it would be done with a single, simple and efficient expression (and with less numerical issues). Therefore, in any real practical application it would be very useful to group these operations into a single gate. Lets see the backprop for this neuron in code: +보이는 것처럼 그라디언트는 단순화되면서 놀라울만큼 간단해진다.예를 들어 시그모이드 표현은 전방 전달(forward pass) 과정에서 입력 1.0을 받아 출력 0.73을 계산한다. 단일의 단순하고 효율적인 표현식을 이용해 (그리고 더 적은 수치적인 문제를 갖고) 계산하는 방식을 제외하고서, 마치 이전에 본 회로가 계산했던 것(위 그림을 보라)과 비슷하게 위의 미분은 *지역(local)* 그라디언트 값이 단순히 (1 - 0.73) * 0.73 ~= 0.2 가 됨을 보여준다. 그러므로 어떤 실제 실용적인 적용에서 그러한 연산들을 단일 게이트로 묶어주는 것은 매우 유용하다고 할 수 있다. 코드에서 이 뉴런에 대한 backprop를 살펴보자: ~~~python w = [2,-3,-3] # assume some random weights and data @@ -175,20 +175,20 @@ dw = [x[0] * ddot, x[1] * ddot, 1.0 * ddot] # backprop into w # we're done! we have the gradients on the inputs to the circuit ~~~ -**Implementation protip: staged backpropagation**. As shown in the code above, in practice it is always helpful to break down the forward pass into stages that are easily backpropped through. For example here we created an intermediate variable `dot` which holds the output of the dot product between `w` and `x`. During backward pass we then successively compute (in reverse order) the corresponding variables (e.g. `ddot`, and ultimately `dw, dx`) that hold the gradients of those variables. +**구현 팁(protip): 단계적 backpropagation**. 위 코드에서 볼 수 있듯이, 전방 전달(forward pass)를 쉽게 backprop되는 단계들로 잘게 분해하는 것은 실질적으로 항상 도움이 된다. 예를 들어 우리는 여기서 `w`와 `x` 사이의 내적의 결과를 담는 중간 변수 `dot`를 만들었다. 그리고나서 후방 전달(backward pass) 과정에서 그러한 변수들의 그라디언트 값들을 담은 해당 변수들(예: `ddot` 및 궁극적으로는 `dw, dx`)을 성공적으로 계산한다(역순으로). -The point of this section is that the details of how the backpropagation is performed, and which parts of the forward function we think of as gates, is a matter of convenience. It helps to be aware of which parts of the expression have easy local gradients, so that they can be chained together with the least amount of code and effort. +이 섹션에서 요점은 어떻게 backpropagation이 수행되는 지와 전방 함수(forward function)의 어느 부분을 게이트로 취급해야할 지에 대한 세부사항은 편의성 문제라는 것이다. 이는 표현식의 어느 부분들이 쉬운 지역 그라디언트를 가지며, 가장 적은 코드의 양과 노력으로 이들을 함께 묶을 수 있는지를 이해하는데 도움이 된다. -### Backprop in practice: Staged computation +### 실제 backprop: 단계적 계산 -Lets see this with another example. Suppose that we have a function of the form: +또 다른 예제를 통해 확인해보자. 다음과 같은 형태의 함수가 있다고 가정하자: $$ f(x,y) = \frac{x + \sigma(y)}{\sigma(x) + (x+y)^2} $$ -To be clear, this function is completely useless and it's not clear why you would ever want to compute its gradient, except for the fact that it is a good example of backpropagation in practice. It is very important to stress that if you were to launch into performing the differentiation with respect to either $x$ or $y$, you would end up with very large and complex expressions. However, it turns out that doing so is completely unnecessary because we don't need to have an explicit function written down that evaluates the gradient. We only have to know how to compute it. Here is how we would structure the forward pass of such expression: +명확히 말하면, 실제 backpropagation의 좋은 예제라는 사실 외에는 이 함수는 완전히 쓸모가 없으며 따라서 왜 여러분이 이 함수의 그라디언트를 그토록 계산해야 하는지 그 이유도 뚜렷하지 않다. 만약 여러분들이 $x$ 또는 $y$에 관해서 미분을 수행한다면 결국 매우 크고 복잡한 식을 얻게 될 것이다. 하지만, 그라디언트를 계산하는 명확한 함수(explicit function)를 쓸 필요가 없기 때문에 그렇게 미분하는 것은 완전히 불필요한 것으로 알려져있다. 우리는 단지 어떻게 이를 계산하는지만 알면 된다. 다음은 우리가 어떻게 그러한 표현식에 대해 전방 전달(forward pass)을 구조화 하는지를 나타낸 것이다: ~~~python x = 3 # example values @@ -205,7 +205,7 @@ invden = 1.0 / den #(7) f = num * invden # done! #(8) ~~~ -Phew, by the end of the expression we have computed the forward pass. Notice that we have structured the code in such way that it contains multiple intermediate variables, each of which are only simple expressions for which we already know the local gradients. Therefore, computing the backprop pass is easy: We'll go backwards and for every variable along the way in the forward pass (`sigy, num, sigx, xpy, xpysqr, den, invden`) we will have the same variable, but one that begins with a `d`, which will hold the gradient of the output of the circuit with respect to that variable. Additionally, note that every single piece in our backprop will involve computing the local gradient of that expression, and chaining it with the gradient on that expression with a multiplication. For each row, we also highlight which part of the forward pass it refers to: +표현식의 마지막에서 전방 전달(forward pass)을 계산했다. 각각이 단순한 표현식들인 다수의 중간 변수들을 포함하는 방식으로 코드를 구조화한 것에 주목하자, 우리는 이미 이 표현식들에 대한 지역 그라디언트 값을 알고 있다. 그러므로, backprop 전달을 계산하는 것은 쉬운 일이다: 전방 전달 과정의 모든 변수들(`sigy, num, sigx, xpy, xpysqr, den, invden`)에 대해 역방향으로 가면서 똑같은 변수들을 볼 것이다, 다만 해당 변수에 대한 회로 출력의 그라디언트를 담는 것을 나타내기 위해 변수명 앞에 `d`를 붙인다. 추가로, backprop에서 모든 단일 조각이 이 표현식에 대한 지역 그라디언트을 계산하고 곱셈 형태로 이 그라디언트 값을 연결하는 과정을 수반할 것이다. 각 행마다 전방 전달 과정에서 어느 부분에 해당하는지 표시한 것이다: ~~~python # backprop f = num * invden @@ -231,11 +231,11 @@ dy += ((1 - sigy) * sigy) * dsigy #(1) # done! phew ~~~ -Notice a few things: +몇 가지 주의할 점: -**Cache forward pass variables**. To compute the backward pass it is very helpful to have some of the variables that were used in the forward pass. In practice you want to structure your code so that you cache these variables, and so that they are available during backpropagation. If this is too difficult, it is possible (but wasteful) to recompute them. +**전방 전달 변수들을 저장(cache)하라**. 후방 전달을 계산하기 위해 전방 전달에서 사용한 일부 변수들을 가지고 있는 것은 정말 유용하다. 실제로 여러분은 이 변수들을 저장해서 backpropagation 동안 이용할 수 있도록 코드를 구성하고 싶을 것이다. 이것이 너무 어려운 일이라면, 이 변수들을 다시 계산할 수 있다(물론 비효율적이지만). -**Gradients add up at forks**. The forward expression involves the variables **x,y** multiple times, so when we perform backpropagation we must be careful to use `+=` instead of `=` to accumulate the gradient on these variables (otherwise we would overwrite it). This follows the *multivariable chain rule* in Calculus, which states that if a variable branches out to different parts of the circuit, then the gradients that flow back to it will add. +**갈래길에서 그라디언트는 더해진다**. 전방 표현식은 변수 **x,y**를 여러번 수반하므로, backpropagation을 수행할 때 이 변수들에 대한 그라디언트 값을 축적하기 위해 `=` 대신 `+=`를 사용해야 하는 점에 주의해야 한다 (그렇게 하지 않으면 덮어쓰게 된다). 이는 Calculus에 나오는 *다변수 연쇄 법칙(multivariate chain rule)*을 따른다, Calculus에는 하나의 변수가 회로의 다른 부분들로 가지를 뻗어나가면, 반환하는 그라디언트는 더해질 것이라고 명시되어 있다. ### Patterns in backward flow From af934a9d60f46c5a887413b472f729eb1f569508 Mon Sep 17 00:00:00 2001 From: Dongkyu Kim Date: Sat, 18 Jun 2016 10:07:52 +0900 Subject: [PATCH 183/199] =?UTF-8?q?optimization-2=20=EC=A7=84=ED=96=89?= =?UTF-8?q?=EC=83=81=ED=99=A9=20=EC=97=85=EB=8D=B0=EC=9D=B4=ED=8A=B8?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit --- index.html | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/index.html b/index.html index 713b0009..ec4a66b4 100644 --- a/index.html +++ b/index.html @@ -152,10 +152,10 @@
- Backpropagation, Intuition + Backpropagation, 직관 - +
연쇄 법칙 (chain rule) 해석, real-valued circuits, 그라디언트 흐름의 패턴 From 11303ddd6466da4537391a9eeeeacf7cb22106a2 Mon Sep 17 00:00:00 2001 From: MaybeS Date: Sun, 26 Jun 2016 14:39:51 +0900 Subject: [PATCH 184/199] Assignment1 softmax Update --- assignments2016/assignment1/softmax.ipynb | 28 +++++++++++------------ 1 file changed, 13 insertions(+), 15 deletions(-) diff --git a/assignments2016/assignment1/softmax.ipynb b/assignments2016/assignment1/softmax.ipynb index 94fd47cf..c319fe0c 100644 --- a/assignments2016/assignment1/softmax.ipynb +++ b/assignments2016/assignment1/softmax.ipynb @@ -192,8 +192,7 @@ "toc = time.time()\n", "print 'vectorized loss: %e computed in %fs' % (loss_vectorized, toc - tic)\n", "\n", - "# As we did for the SVM, we use the Frobenius norm to compare the two versions\n", - "# of the gradient.\n", + "# SVM에서 했던것 처럼, Frobenius 방법을 사용해 두 버전의 요소를 비교할 것입니다.\n", "grad_difference = np.linalg.norm(grad_naive - grad_vectorized, ord='fro')\n", "print 'Loss difference: %f' % np.abs(loss_naive - loss_vectorized)\n", "print 'Gradient difference: %f' % grad_difference" @@ -207,10 +206,9 @@ }, "outputs": [], "source": [ - "# Use the validation set to tune hyperparameters (regularization strength and\n", - "# learning rate). You should experiment with different ranges for the learning\n", - "# rates and regularization strengths; if you are careful you should be able to\n", - "# get a classification accuracy of over 0.35 on the validation set.\n", + "# 검증셋을 이용하여 hyperparameters(정규화 강도와 학습률)를 튜닝하세요.\n", + "# 다른 범위에 대해 학습률과 정규화 강도를 실험해 보세요.\n", + "# 검증셋에 대해 0.35 이상의 분류 정확도를 얻어야 합니다.\n", "from cs231n.classifiers import Softmax\n", "results = {}\n", "best_val = -1\n", @@ -220,16 +218,16 @@ "\n", "################################################################################\n", "# TODO: #\n", - "# Use the validation set to set the learning rate and regularization strength. #\n", - "# This should be identical to the validation that you did for the SVM; save #\n", - "# the best trained softmax classifer in best_softmax. #\n", + "# 검증셋을 이용해 학습률과 정규화 강도를 설정합니다. #\n", + "# 이것은 SVM에서의 검증과 같아야합니다; #\n", + "# 가장 잘 학습된 softmax 분류기를 best_softmax에 저장하세요. #\n", "################################################################################\n", "pass\n", "################################################################################\n", - "# END OF YOUR CODE #\n", + "# 코드의 끝 #\n", "################################################################################\n", " \n", - "# Print out results.\n", + "# 결과를 출력합니다.\n", "for lr, reg in sorted(results):\n", " train_accuracy, val_accuracy = results[(lr, reg)]\n", " print 'lr %e reg %e train accuracy: %f val accuracy: %f' % (\n", @@ -246,8 +244,8 @@ }, "outputs": [], "source": [ - "# evaluate on test set\n", - "# Evaluate the best softmax on test set\n", + "# 테스트 셋으로 평가해 봅니다.\n", + "# 테스트 셋에서 최고의 softmax를 평가해 봅니다.\n", "y_test_pred = best_softmax.predict(X_test)\n", "test_accuracy = np.mean(y_test == y_test_pred)\n", "print 'softmax on raw pixels final test set accuracy: %f' % (test_accuracy, )" @@ -261,7 +259,7 @@ }, "outputs": [], "source": [ - "# Visualize the learned weights for each class\n", + "# 각 클래스에 대한 학습 된 가중치를 시각화\n", "w = best_softmax.W[:-1,:] # strip out the bias\n", "w = w.reshape(32, 32, 3, 10)\n", "\n", @@ -271,7 +269,7 @@ "for i in xrange(10):\n", " plt.subplot(2, 5, i + 1)\n", " \n", - " # Rescale the weights to be between 0 and 255\n", + " # 가중치를 0과 255사이로 재조정\n", " wimg = 255.0 * (w[:, :, :, i].squeeze() - w_min) / (w_max - w_min)\n", " plt.imshow(wimg.astype('uint8'))\n", " plt.axis('off')\n", From 4bf68926d134a4ebd4694a3a9bf5a18e6fa42d23 Mon Sep 17 00:00:00 2001 From: Myungsub Choi Date: Tue, 28 Jun 2016 17:43:19 +0900 Subject: [PATCH 185/199] fix type & update linear-classify --- convolutional-networks-korean.md | 390 ------------------------------- linear-classify.md | 14 +- neural-networks-2.kr.md | 2 +- 3 files changed, 8 insertions(+), 398 deletions(-) delete mode 100644 convolutional-networks-korean.md diff --git a/convolutional-networks-korean.md b/convolutional-networks-korean.md deleted file mode 100644 index 9e1ce5e4..00000000 --- a/convolutional-networks-korean.md +++ /dev/null @@ -1,390 +0,0 @@ ---- -layout: page -permalink: /convolutional-networks-kr/ ---- - -Table of Contents: - -- [Architecture Overview](#overview) -- [ConvNet Layers](#layers) - - [Convolutional Layer](#conv) - - [Pooling Layer](#pool) - - [Normalization Layer](#norm) - - [Fully-Connected Layer](#fc) - - [Converting Fully-Connected Layers to Convolutional Layers](#convert) -- [ConvNet Architectures](#architectures) - - [Layer Patterns](#layerpat) - - [Layer Sizing Patterns](#layersizepat) - - [Case Studies](#case) (LeNet / AlexNet / ZFNet / GoogLeNet / VGGNet) - - [Computational Considerations](#comp) -- [Additional References](#add) - -## 컨볼루션 신경망 (CNN/ConvNets) - -컨볼루션 신경망 (Convolutional Neural Network, 이하 CNN)은 앞 장에서 다룬 일반 신경망과 매우 유사하다. CNN은 학습 가능한 가중치 (weight)와 바이어스(bias)로 구성되어 있다. 각 뉴런은 입력을 받아 내적 연산( dot product )을 한 뒤 선택에 따라 비선형 (non-linear) 연산을 한다. 전체 네트워크는 일반 신경망과 마찬가지로 미분 가능한 하나의 스코어 함수 (score function)을 갖게 된다 (맨 앞쪽에서 로우 이미지 (raw image)를 읽고 맨 뒤쪽에서 각 클래스에 대한 점수를 구하게 됨). 또한 CNN은 마지막 레이어에 (SVM/Softmax와 같은) 손실 함수 (loss function)을 가지며, 우리가 일반 신경망을 학습시킬 때 사용하던 각종 기법들을 동일하게 적용할 수 있다. - -CNN과 일반 신경망의 차이점은 무엇일까? CNN 아키텍쳐는 입력 데이터가 이미지라는 가정 덕분에 이미지 데이터가 갖는 특성들을 인코딩 할 수 있다. 이러한 아키텍쳐는 포워드 함수 (forward function)을 더욱 효과적으로 구현할 수 있고 네트워크를 학습시키는데 필요한 모수 (parameter)의 수를 크게 줄일 수 있게 해준다. - - - -### 아키텍쳐 개요 - -앞 장에서 보았듯이 신경망은 입력받은 벡터를 일련의 히든 레이어 (hidden layer) 를 통해 변형 (transform) 시킨다. 각 히든 레이어는 뉴런들로 이뤄져 있으며, 각 뉴런은 앞쪽 레이어 (previous layer)의 모든 뉴런과 연결되어 있다 (fully connected). 같은 레이어 내에 있는 뉴런들 끼리는 연결이 존재하지 않고 서로 독립적이다. 마지막 Fully-connected 레이어는 출력 레이어라고 불리며, 분류 문제에서 클래스 점수 (class score)를 나타낸다. - -일반 신경망은 이미지를 다루기에 적절하지 않다. CIFAR-10 데이터의 경우 각 이미지가 32x32x3 (가로,세로 32, 3개 컬러 채널)로 이뤄져 있어서 첫 번째 히든 레이어 내의 하나의 뉴런의 경우 32x32x3=3072개의 가중치가 필요하지만, 더 큰 이미지를 사용할 경우에는 같은 구조를 이용하는 것이 불가능하다. 예를 들어 200x200x3의 크기를 가진 이미지는 같은 뉴런에 대해 200x200x3=120,000개의 가중치를 필요로 하기 때문이다. 더욱이, 이런 뉴런이 레이어 내에 여러개 존재하므로 모수의 개수가 크게 증가하게 된다. 이와 같이 Fully-connectivity는 심한 낭비이며 많은 수의 모수는 곧 오버피팅(overfitting)으로 귀결된다. - -CNN은 입력이 이미지로 이뤄져 있다는 특징을 살려 좀 더 합리적인 방향으로 아키텍쳐를 구성할 수 있다. 특히 일반 신경망과 달리, CNN의 레이어들은 가로,세로,깊이의 3개 차원을 갖게 된다 ( 여기에서 말하는 깊이란 전체 신경망의 깊이가 아니라 액티베이션 볼륨 ( activation volume ) 에서의 3번 째 차원을 이야기 함 ). 예를 들어 CIFAR-10 이미지는 32x32x3 (가로,세로,깊이) 의 차원을 갖는 입력 액티베이션 볼륨 (activation volume)이라고 볼 수 있다. 조만간 보겠지만, 하나의 레이어에 위치한 뉴런들은 일반 신경망과는 달리 앞 레이어의 전체 뉴런이 아닌 일부에만 연결이 되어 있다. CNN 아키텍쳐는 전체 이미지를 클래스 점수들로 이뤄진 하나의 벡터로 만들어주기 때문에 마지막 출력 레이어는 1x1x10(10은 CIFAR-10 데이터의 클래스 개수)의 차원을 가지게 된다. 이에 대한 그럼은 아래와 같다: - -
- - -
좌: 일반 3-레이어 신경망. 우: 그림과 같이 CNN은 뉴런들을 3차원으로 배치한다. CNN의 모든 레이어는 3차원 입력 볼륨을 3차원 출력 볼륨으로 변환 (transform) 시킨다. 이 예제에서 붉은 색으로 나타난 입력 레이어는 이미지를 입력으로 받으므로, 이 레이어의 가로/세로/채널은 각각 이미지의 가로/세로/3(Red,Green,Blue) 이다.
-
- -> CNN은 여러 레이어로 이루어져 있다. 각각의 레이어는 3차원의 볼륨을 입력으로 받고 미분 가능한 함수를 거쳐 3차원의 볼륨을 출력하는 간단한 기능을 한다. - - - -### CNN을 이루는 레이어들 - -위에서 다룬 것과 같이, CNN의 각 레이어는 미분 가능한 변환 함수를 통해 하나의 액티베이션 볼륨을 또다른 액티베이션 볼륨으로 변환 (transform) 시킨다. CNN 아키텍쳐에서는 크게 컨볼루셔널 레이어, 풀링 레이어, Fully-connected 레이어라는 3개 종류의 레이어가 사용된다. 전체 CNN 아키텍쳐는 이 3 종류의 레이어들을 쌓아 만들어진다. - -*예제: 아래에서 더 자세하게 배우겠지만, CIFAR-10 데이터를 다루기 위한 간단한 CNN은 [INPUT-CONV-RELU-POOL-FC]로 구축할 수 있다. - -- INPUT 입력 이미지가 가로32, 세로32, 그리고 RGB 채널을 가지는 경우 입력의 크기는 [32x32x3]. -- CONV 레이어는 입력 이미지의 일부 영역과 연결되어 있으며, 이 연결된 영역과 자신의 가중치의 내적 연산 (dot product) 을 계산하게 된다. 결과 볼륨은 [32x32x12]와 같은 크기를 갖게 된다. -- RELU 레이어는 max(0,x)와 같이 각 요소에 적용되는 액티베이션 함수 (activation function)이다. 이 레이어는 볼륨의 크기를 변화시키지 않는다 ([32x32x12]) -- POOL 레이어는 (가로,세로) 차원에 대해 다운샘플링 (downsampling)을 수행해 [16x16x12]와 같이 줄어든 볼륨을 출력한다. -- FC (fully-connected) 레이어는 클래스 점수들을 계산해 [1x1x10]의 크기를 갖는 볼륨을 출력한다. 10개 숫자들은 10개 카테고리에 대한 클래스 점수에 해당한다. 레이어의 이름에서 유추 가능하듯, 이 레이어는 이전 볼륨의 모든 요소와 연결되어 있다. - -이와 같이, CNN은 픽셀 값으로 이뤄진 원본 이미지를 각 레이어를 거치며 클래스 점수로 변환 (transform) 시킨다. 한 가지 기억할 것은, 어떤 레이어는 모수 (parameter)를 갖지만 어떤 레이어는 모수를 갖지 않는다는 것이다. 특히 CONV/FC 레이어들은 단순히 입력 볼륨만이 아니라 가중치(weight)와 바이어스(bias) 또한 포함하는 액티베이션(activation) 함수이다. 반면 RELU/POOL 레이어들은 고정된 함수이다. CONV/FC 레이어의 모수 (parameter)들은 각 이미지에 대한 클래스 점수가 해당 이미지의 레이블과 같아지도록 그라디언트 디센트 (gradient descent)로 학습된다. - -요약해보면: - -- CNN 아키텍쳐는 여러 레이어를 통해 입력 이미지 볼륨을 출력 볼륨 ( 클래스 점수 )으로 변환시켜 준다. -- CNN은 몇 가지 종류의 레이어로 구성되어 있다. CONV/FC/RELU/POOL 레이어가 현재 가장 많이 쓰인다. -- 각 레이어는 3차원의 입력 볼륨을 미분 가능한 함수를 통해 3차원 출력 볼륨으로 변환시킨다. -- 모수(parameter)가 있는 레이어도 있고 그렇지 않은 레이어도 있다 (FC/CONV는 모수를 갖고 있고, RELU/POOL 등은 모수가 없음). -- 초모수 (hyperparameter)가 있는 레이어도 있고 그렇지 않은 레이어도 있다 (CONV/FC/POOL 레이어는 초모수를 가지며 RELU는 가지지 않음). - -
- -
- CNN 아키텍쳐의 액티베이션 (activation) 예제. 첫 볼륨은 로우 이미지(raw image)를 다루며, 마지막 볼륨은 클래스 점수들을 출력한다. 입/출력 사이의 액티베이션들은 그림의 각 열에 나타나 있다. 3차원 볼륨을 시각적으로 나타내기가 어렵기 때문에 각 행마다 볼륨들의 일부만 나타냈다. 마지막 레이어는 모든 클래스에 대한 점수를 나타내지만 여기에서는 상위 5개 클래스에 대한 점수와 레이블만 표시했다. 전체 웹 데모는 우리의 웹사이트 상단에 있다. 여기에서 사용된 아키텍쳐는 작은 VGG Net이다. -
-
- -이제 각각의 레이어에 대해 초모수(hyperparameter)나 연결성 (connectivity) 등의 세부 사항들을 알아보도록 하자. - - - -#### 컨볼루셔널 레이어 (이하 CONV) - -CONV 레이어는 CNN을 이루는 핵심 요소이다. CONV 레이어의 출력은 3차원으로 정렬된 뉴런들로 해석될 수 있다. 이제부터는 뉴런들의 연결성 (connectivity), 그들의 공간상의 배치, 그리고 모수 공유(parameter sharing) 에 대해 알아보자. - -**개요 및 직관적인 설명.** CONV 레이어의 모수(parameter)들은 일련의 학습가능한 필터들로 이뤄져 있다. 각 필터는 가로/세로 차원으로는 작지만 깊이 (depth) 차원으로는 전체 깊이를 아우른다. 포워드 패스 (forward pass) 때에는 각 필터를 입력 볼륨의 가로/세로 차원으로 슬라이딩 시키며 (정확히는 convolve 시키며) 2차원의 액티베이션 맵 (activation map)을 생성한다. 필터를 입력 위로 슬라이딩 시킬 때, 필터와 입력의 요소들 사이의 내적 연산 (dot product)이 이뤄진다. 직관적으로 설명하면, 이 신경망은 입력의 특정 위치의 특정 패턴에 대해 반응하는 (activate) 필터를 학습한다. 이런 액티베이션 맵 (activation map)을 깊이 (depth) 차원을 따라 쌓은 것이 곧 출력 볼륨이 된다. 그러므로 출력 볼륨의 각 요소들은 입력의 작은 영역만을 취급하고, 같은 액티베이션 맵 내의 뉴런들은 같은 모수들을 공유한다 (같은 필터를 적용한 결과이므로). 이제 이 과정에 대해 좀 더 깊이 파헤쳐보자. - -**로컬 연결성 (Local connectivity).** 이미지와 같은 고차원 입력을 다룰 때에는, 현재 레이어의 한 뉴런을 이전 볼륨의 모든 뉴런들과 연결하는 것이 비 실용적이다. 대신에 우리는 레이어의 각 뉴런을 입력 볼륨의 로컬한 영역(local region)에만 연결할 것이다. 이 영역은 리셉티브 필드 (receptive field)라고 불리는 초모수 (hyperparameter) 이다. 깊이 차원 측면에서는 항상 입력 볼륨의 총 깊이를 다룬다 (가로/세로는 작은 영역을 보지만 깊이는 전체를 본다는 뜻). 공간적 차원 (가로/세로)와 깊이 차원을 다루는 방식이 다르다는 걸 기억하자. - -*예제 1*. 예를 들어 입력 볼륨의 크기가 (CIFAR-10의 RGB 이미지와 같이) [32x32x3]이라고 하자. 만약 리셉티브 필드의 크기가 5x5라면, CONV 레이어의 각 뉴런은 입력 볼륨의 [5x5x3] 크기의 영역에 가중치 (weight)를 가하게 된다 (총 5x5x3=75 개 가중치). 입력 볼륨 (RGB 이미지)의 깊이가 3이므로 마지막 숫자가 3이 된다는 것을 기억하자. - -*예제 2*. 입력 볼륨의 크기가 [16x16x20]이라고 하자. 3x3 크기의 리셉티브 필드를 사용하면 CONV 레이어의 각 뉴런은 입력 볼륨과 3x3x20=180 개의 연결을 갖게 된다. 이번에도 입력 볼륨의 깊이가 20이므로 마지막 숫자가 20이 된다는 것을 기억하자. - -
- - -
- 좌: 입력 볼륨(붉은색, 32x32x3 크기의 CIFAR-10 이미지)과 첫번째 컨볼루션 레이어 볼륨. 컨볼루션 레이어의 각 뉴런은 입력 볼륨의 일부 영역에만 연결된다 (가로/세로 공간 차원으로는 일부 연결, 깊이(컬러 채널) 차원은 모두 연결). 컨볼루션 레이어의 깊이 차원의 여러 뉴런 (그림에서 5개)들이 모두 입력의 같은 영역을 처리한다는 것을 기억하자 (깊이 차원과 관련해서는 아래에서 더 자세히 알아볼 것임). 우: 입력의 일부 영역에만 연결된다는 점을 제외하고는, 이전 신경망 챕터에서 다뤄지던 뉴런들과 똑같이 내적 연산과 비선형 함수로 이뤄진다. -
-
- -**공간적 배치**. 지금까지는 컨볼루션 레이어의 한 뉴런과 입력 볼륨의 연결에 대해 알아보았다. 그러나 아직 출력 볼륨에 얼마나 많은 뉴런들이 있는지, 그리고 그 뉴런들이 어떤식으로 배치되는지는 다루지 않았다. 3개의 hyperparameter들이 출력 볼륨의 크기를 결정하게 된다. 그 3개 요소는 바로 **깊이, stride, 그리고 제로 패딩 (zero-padding)** 이다. 이들에 대해 알아보자: - -1. 먼저, 출력 볼륨의 **깊이** 는 우리가 결정할 수 있는 요소이다. 컨볼루션 레이어의 뉴런들 중 입력 볼륨 내 동일한 영역과 연결된 뉴런의 개수를 의미한다. 마치 일반 신경망에서 히든 레이어 내의 모든 뉴런들이 같은 입력값과 연결된 것과 비슷하다. 앞으로 살펴보겠지만, 이 뉴런들은 입력에 대해 서로 다른 특징 (feature)에 활성화된다 (activate). 예를 들어, 이미지를 입력으로 받는 첫 번째 컨볼루션 레이어의 경우, 깊이 축에 따른 각 뉴런들은 이미지의 서로 다른 엣지, 색깔, 블롭(blob) 등에 활성화된다. 앞으로는 인풋의 서로 같은 영역을 바라보는 뉴런들을 **깊이 컬럼 (depth column)**이라고 부르겠다. -2. 두 번째로 어떤 간격 (가로/세로의 공간적 간격) 으로 깊이 컬럼을 할당할 지를 의미하는 **stride**를 결정해야 한다. 만약 stride가 1이라면, 깊이 컬럼을 1칸마다 할당하게 된다 (한 칸 간격으로 깊이 컬럼 할당). 이럴 경우 각 깊이 컬럼들은 receptive field 상 넓은 영역이 겹치게 되고, 출력 볼륨의 크기도 매우 커지게 된다. 반대로, 큰 stride를 사용한다면 receptive field끼리 좁은 영역만 겹치게 되고 출력 볼륨도 작아지게 된다 (깊이는 작아지지 않고 가로/세로만 작아지게 됨). -3. 조만간 살펴보겠지만, 입력 볼륨의 가장자리를 0으로 패딩하는 것이 좋을 때가 있다. 이 **zero-padding**은 hyperparamter이다. zero-padding을 사용할 때의 장점은, 출력 볼륨의 공간적 크기(가로/세로)를 조절할 수 있다는 것이다. 특히 입력 볼륨의 공간적 크기를 유지하고 싶은 경우 (입력의 가로/세로 = 출력의 가로/세로) 사용하게 된다. - -출력 볼륨의 공간적 크기 (가로/세로)는 입력 볼륨 크기 ($$W$$), CONV 레이어의 리셉티브 필드 크기($$F$$)와 stride ($$S$$), 그리고 제로 패딩 (zero-padding) 사이즈 ($$P$$) 의 함수로 계산할 수 있다. $$(W - F + 2P)/S + 1$$. I을 통해 알맞은 크기를 계산하면 된다. 만약 이 값이 정수가 아니라면 stride가 잘못 정해진 것이다. 이 경우 뉴런들이 대칭을 이루며 깔끔하게 배치되는 것이 불가능하다. 다음 예제를 보면 이 수식을 좀 더 직관적으로 이해할 수 있을 것이다: - -
- -
- 공간적 배치에 관한 그림. 이 예제에서는 가로/세로 공간적 차원 중 하나만 고려한다 (x축). 리셉티브 필드 F=3, 입력 사이즈 W=5, 제로 패딩 P=1. : 뉴런들이 stride S=1을 갖고 배치된 경우, 출력 사이즈는 (5-3+2)/1 +1 = 5이다. : stride S=2인 경우 (5-3+2)/2 + 1 = 3의 출력 사이즈를 가진다. Stride S=3은 사용할 수 없다. (5-3+2) = 4가 3으로 나눠지지 않기 때문에 출력 볼륨의 뉴런들이 깔끔히 배치되지 않는다. - 이 예에서 뉴런들의 가중치는 [1,0,-1] (가장 오른쪽) 이며 bias는 0이다. 이 가중치는 노란 뉴런들 모두에게 공유된다 (아래에서 parameter sharing에 대해 살펴보라). -
-
- -*제로 패딩 사용*. 위 예제의 왼쪽 그림에서, 입력과 출력의 차원이 모두 5라는 것을 기억하자. 리셉티브 필드가 3이고 제로 패딩이 1이기 때문에 이런 결과가 나오는 것이다. 만약 제로 패딩이 사용되지 않았다면 출력 볼륨의 크기는 3이 될 것이다. 일반적으로, 제로 패딩을 $$P = (F - 1)/2$$ , stride $$S = 1$$로 세팅하면 입/출력의 크기가 같아지게 된다. 이런 방식으로 사용하는 것이 일반적이며, 앞으로 컨볼루션 신경망에 대해 다루면서 그 이유에 대해 더 알아볼 것이다. - -*Stride에 대한 constraints*. 공간적 배치와 관련된 hyperparameter들은 상호 constraint들이 존재한다는 것을 기억하자. 예를 들어, 입력 사이즈 $$W=10$$이고 제로 패딩이 사용되지 않았고 $$P=0$$, 필터 사이즈가 $$F=3$$이라면, stride $$S=2$$를 사용하는 것이 불가능하다. $$(W - F + 2P)/S + 1 = (10 - 3 + 0) / 2 + 1 = 4.5$$이 정수가 아니기 때문이다. 그러므로 hyperparameter를 이런 식으로 설정하면 컨볼루션 신경망 관련 라이브러리들은 exception을 낸다. 컨볼루션 신경망의 구조 관련 섹션에서 확인하겠지만, 전체 신경망이 잘 돌아가도록 이런 숫자들을 설정하는 과정은 매우 골치 아프다. 제로 패딩이나 다른 신경망 디자인 비법들을 사용하면 훨씬 수월하게 진행할 수 있다. - -*실제 예제*. 이미지넷 대회에서 우승한 [Krizhevsky et al.](http://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks) 의 모델의 경우 [227x227x3] 크기의 이미지를 입력으로 받는다. 첫 번째 컨볼루션 레이어에서는 리셉티브 필드 $$F=11$$, stride $$S=4$$를 사용했고 제로 패딩은 사용하지 않았다 $$P=0$$. (227 - 11)/4 +1=55 이고 컨볼루션 레이어의 깊이는 $$K=96$$이므로 이 컨볼루션 레이어의 크기는 [55x55x96]이 된다. 각각의 55\*55\*96개 뉴런들은 입력 볼륨의 [11x11x3]개 뉴런들과 연결되어 있다. 그리고 각 깊이의 모든 96개 뉴런들은 입력 볼륨의 같은 [11x11x3] 영역에 서로 다른 가중치를 가지고 연결된다. - -**파라미터 공유**. 파라미터 공유 기법은 컨볼루션 레이어의 파라미터 개수를 조절하기 위해 사용된다. 위의 실제 예제에서 보았듯, 첫 번째 컨볼루션 레이어에는 55\*55\*96 = 290,400 개의 뉴런이 있고 각각의 뉴런은 11\*11\*3 = 363개의 가중치와 1개의 바이어스를 가진다. 첫 번째 컨볼루션 레이어만 따져도 총 파라미터 개수는 290400*364=105,705,600개가 된다. 분명히 이 숫자는 너무 크다. - -사실 적절한 가정을 통해 파라미터 개수를 크게 줄이는 것이 가능하다: (x,y)에서 어떤 patch feature가 유용하게 사용되었다면, 이 feature는 다른 위치 (x2,y2)에서도 유용하게 사용될 수 있다. 3차원 볼륨의 한 슬라이스 (깊이 차원으로 자른 2차원 슬라이스) 를 **depth slice**라고 하자 ([55x55x96] 사이즈의 볼륨은 각각 [55x55]의 크기를 가진 96개의 depth slice임). 앞으로는 각 depth slice 내의 뉴런들이 같은 가중치와 바이어스를 가지도록 제한할 것이다. 이런 파라미터 공유 기법을 사용하면, 예제의 첫 번째 컨볼루션 레이어는 (depth slice 당) 96개의 고유한 가중치를 가져서 총 96\*11\*11\*3 = 34,848개의 고유한 가중치, 또는 바이어스를 합쳐서 34,944개의 파라미터를 갖게 된다. 또는 각 depth slice에 존재하는 55*55개의 뉴런들은 모두 같은 파라미터를 사용하게 된다. 실제로는 backpropagation 과정에서 각 depth slice 내의 모든 뉴런들이 가중치에 대한 gradient를 계산하겠지만, 가중치 업데이트 할 때에는 이 gradient들을 합해 사용한다. - -한 depth slice내의 모든 뉴런들이 같은 가중치 벡터를 갖기 때문에 컨볼루션 레이어의 forward pass는 입력 볼륨과 가중치 간의 **컨볼루션**으로 계산될 수 있다 (컨볼루션 레이어라는 이름이 붙은 이유). 그러므로 컨볼루션 레이어의 가중치는 **필터(filter)** 또는 **커널(kernel)**이라고 부른다. 컨볼루션의 결과물은 **액티베이션 맵(activation map, [55x55] 사이즈)** 이 되며 각 깊이에 해당하는 필터의 액티베이션 맵들을 쌓으면 최종 출력 볼륨 ([55x55x96] 사이즈) 가 된다. - -
- -
- Krizhevsky et al. 에서 학습된 필터의 예. 96개의 필터 각각은 [11x11x3] 사이즈이며, 하나의 depth slice 내 55*55개 뉴런들이 이 필터들을 공유한다. 만약 이미지의 특정 위치에서 가로 엣지 (edge)를 검출하는 것이 중요했다면, 이미지의 다른 위치에서도 같은 특성이 중요할 수 있다 (이미지의 translationally-invariant한 특성 때문). 그러므로 55*55개 뉴런 각각에 대해 가로 엣지 검출 필터를 재학습 할 필요가 없다. -
-
- -가끔은 파라미터 sharing에 대한 가정이 부적절할 수도 있다. 특히 입력 이미지가 중심을 기준으로 찍힌 경우 (예를 들면 이미지 중앙에 얼굴이 있는 이미지), 이미지의 각 영역에 대해 완전히 다른 feature들이 학습되어야 할 수 있다. 눈과 관련된 feature나 머리카락과 관련된 feature 등은 서로 다른 영역에서 학습될 것이다. 이런 경우에는 파라미터 sharing 기법을 접어두고 대신 **Locally-Connected Layer**라는 레이어를 사용하는 것이 좋다. - -**Numpy 예제.** 위에서 다룬 것들을 더 확실히 알아보기 위해 코드를 작성해보자. 입력 볼륨을 numpy 배열 `X`라고 하면: -- A *depth column* at position `(x,y)` would be the activations `X[x,y,:]`. -- `(x,y)`위치에서의 *depth column*은 액티베이션 `X[x,y,:]`이 된다. -- A *depth slice*, or equivalently an *activation map* at depth `d` would be the activations `X[:,:,d]`. -- depth `d`에서의 *depth slice*, 또는 *액티베이션 맵 (activation map)*은 `X[:,:,d]`가 된다. - -*컨볼루션 레이어 예제*. 입력 볼륨 `X`의 모양이 `X.shape: (11,11,4)`이고 제로 패딩은 사용하지 않으며($$P = 0$$) 필터 크기는 $$F = 5$$, stride $$S = 2$$라고 하자. 출력 볼륨의 spatial 크기 (가로/세로)는 (11-5)/2 + 1 = 4가 된다. 출력 볼륨의 액티베이션 맵 (`V`라고 하자) 는 아래와 같은 것이다 (아래에는 일부 요소만 나타냄). - -- `V[0,0,0] = np.sum(X[:5,:5,:] * W0) + b0` -- `V[1,0,0] = np.sum(X[2:7,:5,:] * W0) + b0` -- `V[2,0,0] = np.sum(X[4:9,:5,:] * W0) + b0` -- `V[3,0,0] = np.sum(X[6:11,:5,:] * W0) + b0` - -Numpy에서 `*`연산은 두 배열 간의 elementwise 곱셈이라는 것을 기억하자. 또한 `W0`는 가중치 벡터이고 `b0`은 바이어스라는 것도 기억하자. 여기에서 `W0`의 모양은 `W0.shape: (5,5,4)`라고 가정하자 (필터 사이즈는 5, depth는 4). 각 위치에서 일반 신경망에서와 같이 내적 연산을 수행하게 된다. 또한 파라미터 sharing 기법으로 같은 가중치, 바이어스가 사용되고 가로 차원에 대해 2 (stride)칸씩 옮겨가며 연산이 이뤄진다는 것을 볼 수 있다. 출력 볼륨의 두 번째 액티베이션 맵을 구성하는 방법은: - -- `V[0,0,1] = np.sum(X[:5,:5,:] * W1) + b1` -- `V[1,0,1] = np.sum(X[2:7,:5,:] * W1) + b1` -- `V[2,0,1] = np.sum(X[4:9,:5,:] * W1) + b1` -- `V[3,0,1] = np.sum(X[6:11,:5,:] * W1) + b1` -- `V[0,1,1] = np.sum(X[:5,2:7,:] * W1) + b1` (example of going along y) -- `V[2,3,1] = np.sum(X[4:9,6:11,:] * W1) + b1` (or along both) - -위 예제는 `V`의 두 번째 depth 차원 (인덱스 1)을 인덱싱하고 있다. 두 번째 액티베이션 맵을 계산하므로, 여기에서 사용된 가중치는 이전 예제와 달리 `W1`이다. 보통 액티베이션 맵이 구해진 뒤 ReLU와 같은 elementwise 연산이 가해지는 경우가 많은데, 위 예제에서는 다루지 않았다. - -**요약**. To summarize, the Conv Layer: - -- $$W_1 \times H_1 \times D_1$$ 크기의 볼륨을 입력받는다. -- 4개의 hyperparameter가 필요하다: - - 필터 개수 $$K$$, - - 필터의 가로/세로 Spatial 크기 $$F$$, - - Stride $$S$$, - - 제로 패딩 $$P$$. -- $$W_2 \times H_2 \times D_2$$ 크기의 출력 볼륨을 생성한다: - - $$W_2 = (W_1 - F + 2P)/S + 1$$ - - $$H_2 = (H_1 - F + 2P)/S + 1$$ (i.e. 가로/세로는 같은 방식으로 계산됨) - - $$D_2 = K$$ -- 파라미터 sharing로 인해 필터 당 $$F \cdot F \cdot D_1$$개의 가중치를 가져서 총 $$(F \cdot F \cdot D_1) \cdot K$$개의 가중치와 $$K$$개의 바이어스를 갖게 된다. -- 출력 볼륨에서 $$d$$번째 depth slice ($$W_2 \times H_2$$ 크기)는 입력 볼륨에 $$d$$번째 필터를 stride $$S$$만큼 옮겨가며 컨볼루션 한 뒤 $$d$$번째 바이어스를 더한 결과이다. - -흔한 Hyperparameter기본 세팅은 $$F = 3, S = 1, P = 1$$이다. 뒤에서 다룰 [ConvNet architectures](#architectures)에서 hyperparameter 세팅과 관련된 법칙이나 방식 등을 확인할 수 있다. - -**컨볼루션 데모**. 아래는 컨볼루션 레이어 데모이다. 3차원 볼륨은 시각화하기 힘드므로 각 행마다 depth slice를 하나씩 배치했다. 각 볼륨은 입력 볼륨(파란색), 가중치 볼륨(빨간색), 출력 볼륨(녹색)으로 이뤄진다. 입력 볼륨의 크기는 $$W_1 = 5, H_1 = 5, D_1 = 3$$이고 컨볼루션 레이어의 파라미터들은 $$K = 2, F = 3, S = 2, P = 1$$이다. 즉, 2개의 $$3 \times 3$$크기의 필터가 각각 stride 2마다 적용된다. 그러므로 출력 볼륨의 spatial 크기 (가로/세로)는 (5 - 3 + 2)/2 + 1 = 3이다. 제로 패딩 $$P = 1$$ 이 적용되어 입력 볼륨의 가장자리가 모두 0으로 되어있다는 것을 확인할 수 있다. 아래의 영상에서 하이라이트 표시된 입력(파란색)과 필터(빨간색)이 elementwise로 곱해진 뒤 하나로 더해지고 bias가 더해지는걸 볼 수 있다. - -
- -
-
- -**매트릭스 곱으로 구현**. 컨볼루션 연산은 필터와 이미지의 로컬한 영역간의 내적 연산을 한 것과 같다. 컨볼루션 레이어의 일반적인 구현 패턴은 이 점을 이용해 컨볼루션 레이어의 forward pass를 다음과 같이 하나의 큰 매트릭스 곱으로 계산된다: - -1. 이미지의 각 로컬 영역을 열 벡터로 stretch 한다 (이런 연산을 보통 **im2col** 이라고 부름). 예를 들어, 만약 [227x227x3] 사이즈의 입력이 11x11x3 사이즈와 strie 4의 필터와 컨볼루션 한다면, 이미지에서 [11x11x3] 크기의 픽셀 블록을 가져와 11\*11\*3=363 크기의 열 벡터로 바꾸게 된다. 이 과정을 stride 4마다 하므로 가로, 세로에 대해 각각 (227-11)/4+1=55, 총 55\*55=3025 개 영역에 대해 반복하게 되고, 출력물인 `X_col`은 [363x3025]의 사이즈를 갖게 된다. 각각의 열 벡터는 리셉티브 필드를 1차원으로 stretch 한 것이고, 이 리셉티브 필드는 주위 리셉티브 필드들과 겹치므로 입력 볼륨의 여러 값들이 여러 출력 열벡터에 중복되어 나타날 수 있다. -2. 컨볼루션 레이어의 가중치는 비슷한 방식으로 행 벡터 형태로 stretch된다. 예를 들어 [11x11x3]사이즈의 총 96개 필터가 있다면, [96x363] 사이즈의 W_row가 만들어진다. -3. 이제 컨볼루션 연산은 하나의 큰 매트릭스 연산 `np.dot(W_row, X_col)`를 계산하는 것과 같다. 이 연산은 모든 필터와 모든 리셉티브 필터 영역들 사이의 내적 연산을 하는 것과 같다. 우리의 예에서는 각 영역에 대한 각각의 필터를 각각의 영역에 적용한 [96x3025] 사이즈의 출력물이 얻어진다. -4. 결과물은 [55x55x96] 차원으로 reshape 한다. - -이 방식은 입력 볼륨의 여러 값들이 `X_col`에 여러 번 복사되기 때문에 메모리가 많이 사용된다는 단점이 있다. 그러나 매트릭스 연산과 관련된 많은 효율적 구현방식들을 사용할 수 있다는 장점도 있다 ([BLAS](http://www.netlib.org/blas/) API 가 하나의 예임). 뿐만 아니라 같은 *im2col* 아이디어는 풀링 연산에서 재활용 할 수도 있다 (뒤에서 다루게 된다). - -**Backpropagation.** 컨볼루션 연산의 backward pass 역시 컨볼루션 연산이다 (가로/세로가 뒤집어진 필터를 사용한다는 차이점이 있음). 간단한 1차원 예제를 가지고 쉽게 확인해볼 수 있다. - - -#### 풀링 레이어 (Pooling Layer) - -CNN 구조 내에 컨볼루션 레이어들 중간중간에 주기적으로 풀링 레이어를 넣는 것이 일반적이다. 풀링 레이어가 하는 일은 네트워크의 파라미터의 개수나 연산량을 줄이기 위해 representation의 spatial한 사이즈를 줄이는 것이다. 이는 오버피팅을 조절하는 효과도 가지고 있다. 풀링 레이어는 MAX 연산을 각 depth slice에 대해 독립적으로 적용하여 spatial한 크기를 줄인다. 사이즈 2x2와 stride 2가 가장 많이 사용되는 풀링 레이어이다. 각 depth slice를 가로/세로축을 따라 1/2로 downsampling해 75%의 액티베이션은 버리게 된다. 이 경우 MAX 연산은 4개 숫자 중 최대값을 선택하게 된다 (같은 depth slice 내의 2x2 영역). Depth 차원은 변하지 않는다. 풀링 레이어의 특징들은 일반적으로 아래와 같다: - -- $$W_1 \times H_1 \times D_1$$ 사이즈의 입력을 받는다 -- 3가지 hyperparameter를 필요로 한다. - - Spatial extent $$F$$ - - Stride $$S$$ -- $$W_2 \times H_2 \times D_2$$ 사이즈의 볼륨을 만든다 - - $$W_2 = (W_1 - F)/S + 1$$ - - $$H_2 = (H_1 - F)/S + 1$$ - - $$D_2 = D_1$$ -- 입력에 대해 항상 같은 연산을 하므로 파라미터는 따로 존재하지 않는다 -- 풀링 레이어에는 보통 제로 패딩을 하지 않는다 - -일반적으로 실전에서는 두 종류의 max 풀링 레이어만 널리 쓰인다. 하나는 overlapping 풀링이라고도 불리는 $$F = 3, S = 2$$ 이고 하나는 더 자주 쓰이는 $$F = 2, S = 2$$ 이다. 큰 리셉티브 필드에 대해서 풀링을 하면 보통 너무 많은 정보를 버리게 된다. - -**일반적인 풀링**. Max 풀링 뿐 아니라 *average 풀링*, *L2-norm 풀링* 등 다른 연산으로 풀링할 수도 있다. Average 풀링은 과거에 많이 쓰였으나 최근에는 Max 풀링이 더 좋은 성능을 보이며 점차 쓰이지 않고 있다. - -
- - -
- 풀링 레이어는 입력 볼륨의 각 depth slice를 spatial하게 downsampling한다. 좌: 이 예제에서는 입력 볼륨이 [224x224x64]이며 필터 크기 2, stride 2로 풀링해 [112x112x64] 크기의 출력 볼륨을 만든다. 볼륨의 depth는 그대로 유지된다는 것을 기억하자. Right: 가장 널리 쓰이는 max 풀링. 2x2의 4개 숫자에 대해 max를 취하게된다. -
-
- -**Backpropagation**. Backpropagation 챕터에서 max(x,y)의 backward pass는 그냥 forward pass에서 가장 큰 값을 가졌던 입력의 gradient를 보내는 것과 같다고 배운 것을 기억하자. 그러므로 forward pass 과정에서 보통 max 액티베이션의 위치를 저장해두었다가 backpropagation 때 사용한다. - -**최근의 발전된 내용들**. - -- [Fractional Max-Pooling](http://arxiv.org/abs/1412.6071) 2x2보다 더 작은 필터들로 풀링하는 방식. 1x1, 1x2, 2x1, 2x2 크기의 필터들을 임의로 조합해 풀링한다. 매 forward pass마다 grid들이 랜덤하게 생성되고, 테스트 때에는 여러 grid들의 예측 점수들의 평균치를 사용하게 된다. -- [Striving for Simplicity: The All Convolutional Net](http://arxiv.org/abs/1412.6806) 라는 논문은 컨볼루션 레이어만 반복하며 풀링 레이어를 사용하지 않는 방식을 제안한다. Representation의 크기를 줄이기 위해 가끔씩 큰 stride를 가진 컨볼루션 레이어를 사용한다. - -풀링 레이어가 보통 representation의 크기를 심하게 줄이기 때문에 (이런 효과는 작은 데이터셋에서만 오버피팅 방지 효과 등으로 인해 도움이 됨), 최근 추세는 점점 풀링 레이어를 사용하지 않는 쪽으로 발전하고 있다. - - -#### Normalization 레이어 - -실제 두뇌의 억제 메커니즘 구현 등을 위해 많은 종류의 normalization 레이어들이 제안되었다. 그러나 이런 레이어들이 실제로 주는 효과가 별로 없다는 것이 알려지면서 최근에는 거의 사용되지 않고 있다. Normalization에 대해 알고 싶다면 Alex Krizhevsky의 글을 읽어보기 바란다 [cuda-convnet library API](http://code.google.com/p/cuda-convnet/wiki/LayerParams#Local_response_normalization_layer_(same_map)). - - -#### Fully-connected 레이어 - -Fully connected 레이어 내의 뉴런들은 일반 신경망 챕터에서 보았듯이이전 레이어의 모든 액티베이션들과 연결되어 있다. 그러므로 Fully connected레이어의 액티베이션은 매트릭스 곱을 한 뒤 바이어스를 더해 구할 수 있다. 더 많은 정보를 위해 강의 노트의 "신경망" 섹션을 보기 바란다. - - -#### FC 레이어를 CONV 레이어로 변환하기 - -FC 레이어와 CONV 레이어의 차이점은, CONV 레이어는 입력의 일부 영역에만 연결되어 있고, CONV 볼륨의 많은 뉴런들이 파라미터를 공유한다는 것 뿐이라는 것을 알아 둘 필요가 있다. 두 레이어 모두 내적 연산을 수행하므로 실제 함수 형태는 동일하다. 그러므로 FC 레이어를 CONV 레이어로 변환하는 것이 가능하다: - -- 모든 CONV 레이어는 동일한 forward 함수를 수행하는 FC 레이어 짝이 있다. 이 경우의 가중치 매트릭스는 몇몇 블록을 제외하고 모두 0으로 이뤄지며 (local connectivity: 입력의 일부 영역에만 연결된 특성), 이 블록들 중 여러개는 같은 값을 지니게 된다 (파라미터 공유). - -- 반대로, 모든 FC 레이어는 CONV 레이어로 변환될 수 있다. 예를 들어, $$7 \times 7 \times 512$$ 크기의 입력을 받고 $$K= 4906$$ 인 FC 레이어는 $$F = 7, P = 0, S = 1, K = 4096$$인 CONV 레이어로 표현 가능하다. 바꿔 말하면, 필터의 크기를 입력 볼륨의 크기와 동일하게 만들고 $$1 \times 1 \times 4906$$ 크기의 아웃풋을 출력할 수 있다. 각 depth에 대해 하나의 값만 구해지므로 (필터의 가로/세로가 입력 볼륨의 가로/세로와 같으므로) FC 레이어와 같은 결과를 얻게 된다. - -**FC->CONV 변환**. 이 두 변환 중, FC 레이어를 CONV 레이어로의 변환은 매우 실전에서 매우 유용하다. 224x224x3의 이미지를 입력으로 받고 일련의 CONV레이어와 POOL 레이어를 이용해 7x7x512의 액티베이션을 만드는 컨볼루션넷 아키텍쳐를 생각해 보자 (뒤에서 살펴 볼 *AlexNet* 아키텍쳐에서는 입력의 spatial(가로/세로) 크기를 반으로 줄이는 풀링 레이어 5개를 사용해 7x7x512의 액티베이션을 만든다. 224/2/2/2/2/2 = 7이기 때문이다). AlexNet은 여기에 4096의 크기를 갖는 FC 레이어 2개와 클래스 스코어를 계산하는 1000개 뉴런으로 이뤄진 마지막 FC 레이어를 사용한다. 이 마지막 3개의 FC 레이어를 CONV 레이어로 변환하는 방법을 아래에서 배우게 된다: - -- [7x7x512]의 입력 볼륨을 받는 첫 번째 FC 레이어를 $$F = 7$$의 필터 크기를 갖는 CONV 레이어로 바꾼다. 이 때 출력 볼륨의 크기는 [1x1x4096] 이 된다. -- 두 번째 FC 레이어를 $$F = 1$$ 필터 사이즈의 CONV 레이어로 바꾼다. 이 때 출력 볼륨의 크기는 [1x1s4096]이 된다. -- 같은 방식으로 마지막 FC 레이어를 $$F = 1$$의 CONV 레이어를 바꾼다. 출력 볼륨의 크기는 [1x1x1000]이 된다. - -각각의 변환은 일반적으로 FC 레이어의 가중치 $$W$$를 CONV 레이어의 필터로 변환하는 과정을 수반한다. 이런 변환을 하고 나면, 큰 이미지 (가로/세로가 224보다 큰 이미지)를 단 한번의 forward pass만으로 마치 이미지를 "슬라이딩"하면서 여러 영역을 읽은 것과 같은 효과를 준다. - -예를 들어,224x224 크기의 이미지를 입력으로 받으면 [7x7x512]의 볼륨을 출력하는 이 아키텍쳐에, ( 224/7 = 32배 줄어듦 ) 된 아키텍쳐에 384x384 크기의 이미지를 넣으면 [12x12x512] 크기의 볼륨을 출력하게 된다 (384/32 = 12 이므로). 이후 3개 CONV 레이어 - -For example, if 224x224 image gives a volume of size [7x7x512] - i.e. a reduction by 32, then forwarding an image of size 384x384 through the converted architecture would give the equivalent volume in size [12x12x512], since 384/32 = 12. Following through with the next 3 CONV layers that we just converted from FC layers would now give the final volume of size [6x6x1000], since (12 - 7)/1 + 1 = 6. Note that instead of a single vector of class scores of size [1x1x1000], we're now getting and entire 6x6 array of class scores across the 384x384 image. - -> Evaluating the original ConvNet (with FC layers) independently across 224x224 crops of the 384x384 image in strides of 32 pixels gives an identical result to forwarding the converted ConvNet one time. - -Naturally, forwarding the converted ConvNet a single time is much more efficient than iterating the original ConvNet over all those 36 locations, since the 36 evaluations share computation. This trick is often used in practice to get better performance, where for example, it is common to resize an image to make it bigger, use a converted ConvNet to evaluate the class scores at many spatial positions and then average the class scores. - -Lastly, what if we wanted to efficiently apply the original ConvNet over the image but at a stride smaller than 32 pixels? We could achieve this with multiple forward passes. For example, note that if we wanted to use a stride of 16 pixels we could do so by combining the volumes received by forwarding the converted ConvNet twice: First over the original image and second over the image but with the image shifted spatially by 16 pixels along both width and height. - -- An IPython Notebook on [Net Surgery](https://github.com/BVLC/caffe/blob/master/examples/net_surgery.ipynb) shows how to perform the conversion in practice, in code (using Caffe) - - -### ConvNet Architectures - -We have seen that Convolutional Networks are commonly made up of only three layer types: CONV, POOL (we assume Max pool unless stated otherwise) and FC (short for fully-connected). We will also explicitly write the RELU activation function as a layer, which applies elementwise non-linearity. In this section we discuss how these are commonly stacked together to form entire ConvNets. - - -#### Layer Patterns -The most common form of a ConvNet architecture stacks a few CONV-RELU layers, follows them with POOL layers, and repeats this pattern until the image has been merged spatially to a small size. At some point, it is common to transition to fully-connected layers. The last fully-connected layer holds the output, such as the class scores. In other words, the most common ConvNet architecture follows the pattern: - -`INPUT -> [[CONV -> RELU]*N -> POOL?]*M -> [FC -> RELU]*K -> FC` - -where the `*` indicates repetition, and the `POOL?` indicates an optional pooling layer. Moreover, `N >= 0` (and usually `N <= 3`), `M >= 0`, `K >= 0` (and usually `K < 3`). For example, here are some common ConvNet architectures you may see that follow this pattern: - -- `INPUT -> FC`, implements a linear classifier. Here `N = M = K = 0`. -- `INPUT -> CONV -> RELU -> FC` -- `INPUT -> [CONV -> RELU -> POOL]*2 -> FC -> RELU -> FC`. Here we see that there is a single CONV layer between every POOL layer. -- `INPUT -> [CONV -> RELU -> CONV -> RELU -> POOL]*3 -> [FC -> RELU]*2 -> FC` Here we see two CONV layers stacked before every POOL layer. This is generally a good idea for larger and deeper networks, because multiple stacked CONV layers can develop more complex features of the input volume before the destructive pooling operation. - -*Prefer a stack of small filter CONV to one large receptive field CONV layer*. Suppose that you stack three 3x3 CONV layers on top of each other (with non-linearities in between, of course). In this arrangement, each neuron on the first CONV layer has a 3x3 view of the input volume. A neuron on the second CONV layer has a 3x3 view of the first CONV layer, and hence by extension a 5x5 view of the input volume. Similarly, a neuron on the third CONV layer has a 3x3 view of the 2nd CONV layer, and hence a 7x7 view of the input volume. Suppose that instead of these three layers of 3x3 CONV, we only wanted to use a single CONV layer with 7x7 receptive fields. These neurons would have a receptive field size of the input volume that is identical in spatial extent (7x7), but with several disadvantages. First, the neurons would be computing a linear function over the input, while the three stacks of CONV layers contain non-linearities that make their features more expressive. Second, if we suppose that all the volumes have $$C$$ channels, then it can be seen that the single 7x7 CONV layer would contain $$C \times (7 \times 7 \times C) = 49 C^2$$ parameters, while the three 3x3 CONV layers would only contain $$3 \times (C \times (3 \times 3 \times C)) = 27 C^2$$ parameters. Intuitively, stacking CONV layers with tiny filters as opposed to having one CONV layer with big filters allows us to express more powerful features of the input, and with fewer parameters. As a practical disadvantage, we might need more memory to hold all the intermediate CONV layer results if we plan to do backpropagation. - - -#### Layer Sizing Patterns - -Until now we've omitted mentions of common hyperparameters used in each of the layers in a ConvNet. We will first state the common rules of thumb for sizing the architectures and then follow the rules with a discussion of the notation: - -The **input layer** (that contains the image) should be divisible by 2 many times. Common numbers include 32 (e.g. CIFAR-10), 64, 96 (e.g. STL-10), or 224 (e.g. common ImageNet ConvNets), 384, and 512. - -The **conv layers** should be using small filters (e.g. 3x3 or at most 5x5), using a stride of $$S = 1$$, and crucially, padding the input volume with zeros in such way that the conv layer does not alter the spatial dimensions of the input. That is, when $$F = 3$$, then using $$P = 1$$ will retain the original size of the input. When $$F = 5$$, $$P = 2$$. For a general $$F$$, it can be seen that $$P = (F - 1) / 2$$ preserves the input size. If you must use bigger filter sizes (such as 7x7 or so), it is only common to see this on the very first conv layer that is looking at the input image. - -The **pool layers** are in charge of downsampling the spatial dimensions of the input. The most common setting is to use max-pooling with 2x2 receptive fields (i.e. $$F = 2$$), and with a stride of 2 (i.e. $$S = 2$$). Note that this discards exactly 75% of the activations in an input volume (due to downsampling by 2 in both width and height). Another sligthly less common setting is to use 3x3 receptive fields with a stride of 2, but this makes. It is very uncommon to see receptive field sizes for max pooling that are larger than 3 because the pooling is then too lossy and agressive. This usually leads to worse performance. - -*Reducing sizing headaches.* The scheme presented above is pleasing because all the CONV layers preserve the spatial size of their input, while the POOL layers alone are in charge of down-sampling the volumes spatially. In an alternative scheme where we use strides greater than 1 or don't zero-pad the input in CONV layers, we would have to very carefully keep track of the input volumes throughout the CNN architecture and make sure that all strides and filters "work out", and that the ConvNet architecture is nicely and symmetrically wired. - -*Why use stride of 1 in CONV?* Smaller strides work better in practice. Additionally, as already mentioned stride 1 allows us to leave all spatial down-sampling to the POOL layers, with the CONV layers only transforming the input volume depth-wise. - -*Why use padding?* In addition to the aforementioned benefit of keeping the spatial sizes constant after CONV, doing this actually improves performance. If the CONV layers were to not zero-pad the inputs and only perform valid convolutions, then the size of the volumes would reduce by a small amount after each CONV, and the information at the borders would be "washed away" too quickly. - -*Compromising based on memory constraints.* In some cases (especially early in the ConvNet architectures), the amount of memory can build up very quickly with the rules of thumb presented above. For example, filtering a 224x224x3 image with three 3x3 CONV layers with 64 filters each and padding 1 would create three activation volumes of size [224x224x64]. This amounts to a total of about 10 million activations, or 72MB of memory (per image, for both activations and gradients). Since GPUs are often bottlenecked by memory, it may be necessary to compromise. In practice, people prefer to make the compromise at only the first CONV layer of the network. For example, one compromise might be to use a first CONV layer with filter sizes of 7x7 and stride of 2 (as seen in a ZF net). As another example, an AlexNet uses filer sizes of 11x11 and stride of 4. - - -#### Case studies - -There are several architectures in the field of Convolutional Networks that have a name. The most common are: - -- **LeNet**. The first successful applications of Convolutional Networks were developed by Yann LeCun in 1990's. Of these, the best known is the [LeNet](http://yann.lecun.com/exdb/publis/pdf/lecun-98.pdf) architecture that was used to read zip codes, digits, etc. -- **AlexNet**. The first work that popularized Convolutional Networks in Computer Vision was the [AlexNet](http://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks), developed by Alex Krizhevsky, Ilya Sutskever and Geoff Hinton. The AlexNet was submitted to the [ImageNet ILSVRC challenge](http://www.image-net.org/challenges/LSVRC/2014/) in 2012 and significantly outperformed the second runner-up (top 5 error of 16% compared to runner-up with 26% error). The Network had a similar architecture basic as LeNet, but was deeper, bigger, and featured Convolutional Layers stacked on top of each other (previously it was common to only have a single CONV layer immediately followed by a POOL layer). -- **ZF Net**. The ILSVRC 2013 winner was a Convolutional Network from Matthew Zeiler and Rob Fergus. It became known as the [ZFNet](http://arxiv.org/abs/1311.2901) (short for Zeiler & Fergus Net). It was an improvement on AlexNet by tweaking the architecture hyperparameters, in particular by expanding the size of the middle convolutional layers. -- **GoogLeNet**. The ILSVRC 2014 winner was a Convolutional Network from [Szegedy et al.](http://arxiv.org/abs/1409.4842) from Google. Its main contribution was the development of an *Inception Module* that dramatically reduced the number of parameters in the network (4M, compared to AlexNet with 60M). Additionally, this paper uses Average Pooling instead of Fully Connected layers at the top of the ConvNet, eliminating a large amount of parameters that do not seem to matter much. -- **VGGNet**. The runner-up in ILSVRC 2014 was the network from Karen Simonyan and Andrew Zisserman that became known as the [VGGNet](http://www.robots.ox.ac.uk/~vgg/research/very_deep/). Its main contribution was in showing that the depth of the network is a critical component for good performance. Their final best network contains 16 CONV/FC layers and, appealingly, features an extremely homogeneous architecture that only performs 3x3 convolutions and 2x2 pooling from the beginning to the end. It was later found that despite its slightly weaker classification performance, the VGG ConvNet features outperform those of GoogLeNet in multiple transfer learning tasks. Hence, the VGG network is currently the most preferred choice in the community when extracting CNN features from images. In particular, their [pretrained model](http://www.robots.ox.ac.uk/~vgg/research/very_deep/) is available for plug and play use in Caffe. A downside of the VGGNet is that it is more expensive to evaluate and uses a lot more memory and parameters (140M). -- **ResNet**. [Residual Network](http://arxiv.org/abs/1512.03385) developed by Kaiming He et al. was the winner of ILSVRC 2015. It features an interesting architecture with special *skip connections* and features heavy use of batch normalization. The architecture is also missing fully connected layers at the end of the network. The reader is also referred to Kaiming's presentation ([video](https://www.youtube.com/watch?v=1PGLj-uKT1w), [slides](http://research.microsoft.com/en-us/um/people/kahe/ilsvrc15/ilsvrc2015_deep_residual_learning_kaiminghe.pdf)), and some [recent experiments](https://github.com/gcr/torch-residual-networks) that reproduce these networks in Torch. - -**VGGNet in detail**. -Lets break down the [VGGNet](http://www.robots.ox.ac.uk/~vgg/research/very_deep/) in more detail. The whole VGGNet is composed of CONV layers that perform 3x3 convolutions with stride 1 and pad 1, and of POOL layers that perform 2x2 max pooling with stride 2 (and no padding). We can write out the size of the representation at each step of the processing and keep track of both the representation size and the total number of weights: - -~~~ -INPUT: [224x224x3] memory: 224*224*3=150K weights: 0 -CONV3-64: [224x224x64] memory: 224*224*64=3.2M weights: (3*3*3)*64 = 1,728 -CONV3-64: [224x224x64] memory: 224*224*64=3.2M weights: (3*3*64)*64 = 36,864 -POOL2: [112x112x64] memory: 112*112*64=800K weights: 0 -CONV3-128: [112x112x128] memory: 112*112*128=1.6M weights: (3*3*64)*128 = 73,728 -CONV3-128: [112x112x128] memory: 112*112*128=1.6M weights: (3*3*128)*128 = 147,456 -POOL2: [56x56x128] memory: 56*56*128=400K weights: 0 -CONV3-256: [56x56x256] memory: 56*56*256=800K weights: (3*3*128)*256 = 294,912 -CONV3-256: [56x56x256] memory: 56*56*256=800K weights: (3*3*256)*256 = 589,824 -CONV3-256: [56x56x256] memory: 56*56*256=800K weights: (3*3*256)*256 = 589,824 -POOL2: [28x28x256] memory: 28*28*256=200K weights: 0 -CONV3-512: [28x28x512] memory: 28*28*512=400K weights: (3*3*256)*512 = 1,179,648 -CONV3-512: [28x28x512] memory: 28*28*512=400K weights: (3*3*512)*512 = 2,359,296 -CONV3-512: [28x28x512] memory: 28*28*512=400K weights: (3*3*512)*512 = 2,359,296 -POOL2: [14x14x512] memory: 14*14*512=100K weights: 0 -CONV3-512: [14x14x512] memory: 14*14*512=100K weights: (3*3*512)*512 = 2,359,296 -CONV3-512: [14x14x512] memory: 14*14*512=100K weights: (3*3*512)*512 = 2,359,296 -CONV3-512: [14x14x512] memory: 14*14*512=100K weights: (3*3*512)*512 = 2,359,296 -POOL2: [7x7x512] memory: 7*7*512=25K weights: 0 -FC: [1x1x4096] memory: 4096 weights: 7*7*512*4096 = 102,760,448 -FC: [1x1x4096] memory: 4096 weights: 4096*4096 = 16,777,216 -FC: [1x1x1000] memory: 1000 weights: 4096*1000 = 4,096,000 - -TOTAL memory: 24M * 4 bytes ~= 93MB / image (only forward! ~*2 for bwd) -TOTAL params: 138M parameters -~~~ - -As is common with Convolutional Networks, notice that most of the memory is used in the early CONV layers, and that most of the parameters are in the last FC layers. In this particular case, the first FC layer contains 100M weights, out of a total of 140M. - - - - -#### Computational Considerations - -The largest bottleneck to be aware of when constructing ConvNet architectures is the memory bottleneck. Many modern GPUs have a limit of 3/4/6GB memory, with the best GPUs having about 12GB of memory. There are three major sources of memory to keep track of: - -- From the intermediate volume sizes: These are the raw number of **activations** at every layer of the ConvNet, and also their gradients (of equal size). Usually, most of the activations are on the earlier layers of a ConvNet (i.e. first Conv Layers). These are kept around because they are needed for backpropagation, but a clever implementation that runs a ConvNet only at test time could in principle reduce this by a huge amount, by only storing the current activations at any layer and discarding the previous activations on layers below. -- From the parameter sizes: These are the numbers that hold the network **parameters**, their gradients during backpropagation, and commonly also a step cache if the optimization is using momentum, Adagrad, or RMSProp. Therefore, the memory to store the parameter vector alone must usually be multiplied by a factor of at least 3 or so. -- Every ConvNet implementation has to maintain **miscellaneous** memory, such as the image data batches, perhaps their augmented versions, etc. - -Once you have a rough estimate of the total number of values (for activations, gradients, and misc), the number should be converted to size in GB. Take the number of values, multiply by 4 to get the raw number of bytes (since every floating point is 4 bytes, or maybe by 8 for double precision), and then divide by 1024 multiple times to get the amount of memory in KB, MB, and finally GB. If your network doesn't fit, a common heuristic to "make it fit" is to decrease the batch size, since most of the memory is usually consumed by the activations. - -### Visualizing and Understanding Convolutional Networks - -In the [next section](../understanding-cnn/) of these notes we look at visualizing and understanding Convolutional Neural Networks. - - - -### Additional Resources - -Additional resources related to implementation: - -- [DeepLearning.net tutorial](http://deeplearning.net/tutorial/lenet.html) walks through an implementation of a ConvNet in Theano -- [cuda-convnet2](https://code.google.com/p/cuda-convnet2/) by Alex Krizhevsky is a ConvNet implementation that supports multiple GPUs -- [ConvNetJS CIFAR-10 demo](http://cs.stanford.edu/people/karpathy/convnetjs/demo/cifar10.html) allows you to play with ConvNet architectures and see the results and computations in real time, in the browser. -- [Caffe](http://caffe.berkeleyvision.org/), one of the most popular ConvNet libraries. -- [Example Torch 7 ConvNet](https://github.com/nagadomi/kaggle-cifar10-torch7) that achieves 7% error on CIFAR-10 with a single model -- [Ben Graham's Sparse ConvNet](https://www.kaggle.com/c/cifar-10/forums/t/10493/train-you-very-own-deep-convolutional-network/56310) package, which Ben Graham used to great success to achieve less than 4% error on CIFAR-10. - ---- -

-번역: 김택수 (jazzsaxmafia) -

diff --git a/linear-classify.md b/linear-classify.md index d3b7d525..36d6347f 100644 --- a/linear-classify.md +++ b/linear-classify.md @@ -71,27 +71,27 @@ $$
-As we saw above, every row of $W$ is a classifier for one of the classes. The geometric interpretation of these numbers is that as we change one of the rows of $W$, the corresponding line in the pixel space will rotate in different directions. The biases $b$, on the other hand, allow our classifiers to translate the lines. In particular, note that without the bias terms, plugging in $ x_i = 0 $ would always give score of zero regardless of the weights, so all lines would be forced to cross the origin. +위에서 살펴보았듯이, $$W$$의 각 행은 각각의 클래스를 구별하는 분류기이다. 각 행에 있는 숫자들을 기하학적으로 해석해보자면, 우리가 $$W$$의 하나의 행을 바꾸면 픽셀 공간에서 해당하는 선이 다른 방향으로 회전할 것이다. 반면에, bias인 $$b$$는 분류기가 그 선들을 평행이동 할 수 있도록 해준다. 특히, bias가 없다면 $$ x_i = 0 $$가 입력으로 들어왔을 때 파라미터 값들에 상관없이 항상 스코어가 0이 될 것이고, 모든 (분류) 선들이 원점을 지나야만 할 것이다. -**Interpretation of linear classifiers as template matching.** -Another interpretation for the weights $W$ is that each row of $W$ corresponds to a *template* (or sometimes also called a *prototype*) for one of the classes. The score of each class for an image is then obtained by comparing each template with the image using an *inner product* (or *dot product*) one by one to find the one that "fits" best. With this terminology, the linear classifier is doing template matching, where the templates are learned. Another way to think of it is that we are still effectively doing Nearest Neighbor, but instead of having thousands of training images we are only using a single image per class (although we will learn it, and it does not necessarily have to be one of the images in the training set), and we use the (negative) inner product as the distance instead of the L1 or L2 distance. +**템플릿 매칭으로서의 선형 분류기 해석.** +파라미터 $$W$$에 대해 다른 방식으로 해석해보면, $$W$$의 각 행은 각 클래스별 *템플릿* (또는 *프로토타입*)에 해당된다. 이미지의 각 클래스 스코어는 각 템플릿들을 이미지와 *내적(inner product, 또는 dot product)*을 통해 하나하나 비교함으로써 계산되고, 이 스코어를 기준으로 가장 잘 "맞는" 것이 무엇인지 정한다. 즉, 선형 분류기가 결국 템플릿 매칭을 하고 있고, 각 템플릿이 학습을 통해 배워진다고 할 수 있다. 또다른 방식으로 생각해보면, 우리는 Nearest Neighbor와 비슷한 것을 하고 있는데, 수 천 장의 학습 이미지를 갖고 있지 않고 각 클래스마다 한 장의 이미지만 사용한다고 볼 수 있다. (다만, 그 이미지를 학습하고, 학습 데이터셋에 실제로 존재하는 이미지일 필요는 없다.) 이 때, 거리 함수로는 L1이나 L2 거리를 사용하지 않고 서로 내적한 것(의 반대 부호인 값)을 사용한다.
- Skipping ahead a bit: Example learned weights at the end of learning for CIFAR-10. Note that, for example, the ship template contains a lot of blue pixels as expected. This template will therefore give a high score once it is matched against images of ships on the ocean with an inner product. + 약간의 선행학습: CIFAR-10 데이터셋에 학습된 파라미터들의 시각화 예시. 예를 들어 ship 템플릿을 보면, 예상할 수 있듯이 많은 수의 파란색 픽셀들로 이루어져 있다는 점에 주목하자. 이 템플릿은 배(ship)가 바다 위에 떠있는 이미지와 내적을 통해 비교되었을 때, 높은 스코어 값을 가질 것이다.
-Additionally, note that the horse template seems to contain a two-headed horse, which is due to both left and right facing horses in the dataset. The linear classifier *merges* these two modes of horses in the data into a single template. Similarly, the car classifier seems to have merged several modes into a single template which has to identify cars from all sides, and of all colors. In particular, this template ended up being red, which hints that there are more red cars in the CIFAR-10 dataset than of any other color. The linear classifier is too weak to properly account for different-colored cars, but as we will see later neural networks will allow us to perform this task. Looking ahead a bit, a neural network will be able to develop intermediate neurons in its hidden layers that could detect specific car types (e.g. green car facing left, blue car facing front, etc.), and neurons on the next layer could combine these into a more accurate car score through a weighted sum of the individual car detectors. +추가적으로, horse 템플릿은 머리가 두 개인 말(horse)이 있는 것처럼 보이는데, 이것은 데이터셋 안에 왼쪽을 보고 있는 말과 오른쪽을 보고 있는 말이 섞여있기 때문이다. 선형 분류기는 말에 대한 이 두 가지 모드를 하나의 템플릿으로 *합친* 것을 확인할 수 있다. 이와 비슷한 현상으로, car 분류기는 모든 방향 및 색깔의 자동차 모양들을 하나의 템플릿으로 합쳐 놓았다. 특히, 이 템플릿이 결과적으로 붉은 색을 띄는 것으로 보아 CIFAR-10 데이터셋에는 다른 색깔에 비해 빨간색 자동차가 더 많다는 점을 알 수 있다. 선형 분류기는 여러 가지 색깔의 자동차를 제대로 분류하기에는 너무 모델이 단순하지만, 나중에 배울 뉴럴 네트워크는 이를 해결할 수 있다. 약간만 미리 살펴보자면, 뉴럴 네트워크는 히든 레이어의 각 뉴런들이 특정 자동차 타입 (e.g. 왼쪽을 바라보고 있는 초록색 자동차, 정면을 보고 있는 파란색 차, 등등)을 검출하도록 할 수 있고, 다음 레이어의 뉴런들이 이 정보들을 종합하여 각각의 자동차 타입 검출기의 점수의 가중치 합을 통해 보다 정확한 (자동차에 대한) 스코어를 계산할 수 있다. -**Bias trick.** Before moving on we want to mention a common simplifying trick to representing the two parameters $W,b$ as one. Recall that we defined the score function as: +**Bias 트릭.** 다음 내용으로 넘어가기 전에, 두 파라미터 $$W, b$$를 하나로 표현하는 간단한 트릭을 소개한다. 앞에서 스코어 함수는 아래와 같이 정의되었다. $$ f(x_i, W, b) = W x_i + b $$ -As we proceed through the material it is a little cumbersome to keep track of two sets of parameters (the biases $b$ and weights $W$) separately. A commonly used trick is to combine the two sets of parameters into a single matrix that holds both of them by extending the vector $x_i$ with one additional dimension that always holds the constant $1$ - a default *bias dimension*. With the extra dimension, the new score function will simplify to a single matrix multiply: +앞으로 내용을 전개해 나갈 때 두 가지 파라미터를 (bias $$b$$와 weight $$W$$) 매번 동시에 고려해야 한다면 표현이 번거로워진다. 흔히 사용하는 트릭은 이 두 파라미터들을 하나의 행렬로 합치고, $$x_i$$를 항상 $$1$$의 값을 갖는 한 차원 - 디폴트 *bias* 차원 - 을 늘리는 방식이다. 이 한 차원 추가하는 것으로, 새 스코어 함수는 행렬곱 한 번으로 계산이 가능해진다: $$ f(x_i, W) = W x_i diff --git a/neural-networks-2.kr.md b/neural-networks-2.kr.md index 55302f8e..75b89ceb 100644 --- a/neural-networks-2.kr.md +++ b/neural-networks-2.kr.md @@ -225,7 +225,7 @@ dropout이 처음 소개된 이후로 실제 적용 사례에서 나타난 성 **forward pass에서 노이즈 관련하여** 넓은 의미에서 보자면 dropout은 신경망의 forward pass에서 stochastic(확률적) 접근을 도입하는 것으로 볼 수 있다. testing 과정에서 노이즈 감소하게 되는데 이는 *분석적 해석*은 `확률 $p$ 만큼 곱해진 결과`라고 볼 수 있고, *수치적 해석*은 `랜덤하게 선택된 forward pass를 여러차례 수행한 결과의 평균`이라고 볼 수 있다. 동일한 관점에서의 연구들 중 하나인 [DropConnect](http://cs.nyu.edu/~wanli/dropc/)를 보면 forward pass 동안 가중치 값을 0으로 설정하는 것으로 볼 수 있다. Convolutional 신경망에서 dropout과 함께 stochastic(확률적) 풀링(pooling), 부분 풀링, 데이터 augmentation 등의 기법을 같이 사용하여 추가적인 성능 향상을 기대할 수 있다. 이에 대해서는 뒤에서 더 자세히 살펴 볼 것이다. -**Bias resularization**. Linear Classification 파트에서 설명했듯이, bias 텀은 regularization을 적용하지 않는 것이 일반적인데, 이는 학습된 가중치와 곱셈 연산을 하지 않기 때문에 목적 함수에서 데이터 dimension을 결정하는 요소로 작용하지 않는다. 그러나 실제 적용 사례들을 보면 bias 텀에 regularization을 적용하였을 때 심각한 성능 저하가 나타나는 경우는 극히 드문 것으로 알려져 있다. 이는 모든 가중치 텀의 갯수와 비교했을 때 bais 텀의 갯수는 무시할 만한 수준이어서 so the classifier can "afford to" use the biases if it needs them to obtain a better data loss. +**Bias regularization**. Linear Classification 파트에서 설명했듯이, bias 텀은 regularization을 적용하지 않는 것이 일반적인데, 이는 학습된 가중치와 곱셈 연산을 하지 않기 때문에 목적 함수에서 데이터 dimension을 결정하는 요소로 작용하지 않는다. 그러나 실제 적용 사례들을 보면 bias 텀에 regularization을 적용하였을 때 심각한 성능 저하가 나타나는 경우는 극히 드문 것으로 알려져 있다. 이는 모든 가중치 텀의 갯수와 비교했을 때 bais 텀의 갯수는 무시할 만한 수준이어서 so the classifier can "afford to" use the biases if it needs them to obtain a better data loss. **레이어별 정규화**. 마지막 출력 레이어를 제외하고 레이어를 각각 따로 정규화 하는 것은 일반적인 방법이 아니다. 레이어 별 정규화를 적용한 논문수도 상대적으로 매우 적은 편이다. From b3eac0038e87c9e5b0806bdfcdc787b0d3eb0f66 Mon Sep 17 00:00:00 2001 From: YB Date: Mon, 4 Jul 2016 22:59:37 -0400 Subject: [PATCH 186/199] ecture1 - part 226~245 (out of 715) en / ko --- captions/En/Lecture1_en.srt | 76 ++++++++++++++++++------------------- captions/Ko/Lecture1_ko.srt | 49 +++++++++++++----------- 2 files changed, 64 insertions(+), 61 deletions(-) diff --git a/captions/En/Lecture1_en.srt b/captions/En/Lecture1_en.srt index 1c52f268..77101706 100644 --- a/captions/En/Lecture1_en.srt +++ b/captions/En/Lecture1_en.srt @@ -1111,100 +1111,100 @@ what they have to do when they showed this cats 226 00:25:14,509 --> 00:25:21,740 that is a stimulus, they have to use a slide -projector so they put his foot a slide +projector so they put a a slide of fish 227 00:25:21,740 --> 00:25:26,799 -of a fish and then wait till the new on -Spike if the new imposes bike they take +and then wait till the neuron spike. +If the neuron doesn't spike, they take 228 00:25:26,799 --> 00:25:29,960 -the slide out putting another slight +the slide out and put in another slide. 229 00:25:29,960 --> 00:25:38,630 -notice I would have liked this like you -know this film I don't you remember to +Then, they noticed every time they changed slide, +like this, you know, this square-ish film. 230 00:25:38,630 --> 00:25:46,890 -use glasser film whatever the Douro -spikes that's weird you know like the +I don't you remember if they use glass or film whatever. +The neuron spikes. That's weird you know like the 231 00:25:46,890 --> 00:25:51,940 -actual mouse official flower didn't -drive the new excite the new role but +actual mouse and fish and flower didn't +drive the neuron or excite the neuron but 232 00:25:51,940 --> 00:25:59,759 the the the movement of taking the slide -out nor could he has sliding did excite +out or putting a slide in did excite neuron. 233 00:25:59,759 --> 00:26:03,140 -the new I can be the catalyst think -you'll finally they're changing the new +It can be the careless thinking of +finally they're changing the new 234 00:26:03,140 --> 00:26:13,410 -you know new objects for me so it turned -out there is created by this life that +you know new objects for me. So, it turned +out there is edge that's created by this slide that 235 00:26:13,410 --> 00:26:18,240 -they're changing right slide the -whatever it's a square rectangular plate +they're changing. Right? the slide +whatever it's a square rectangular plate. 236 00:26:18,240 --> 00:26:28,120 -and that moving edge or excited the -nuance so they're really taste the after +And that moving edge grow or excited the neurons. +So they really chase after that observations. 237 -00:26:28,119 --> 00:26:34,859 -that observations you know if they were -too frustrated or to have missed that +00:26:28,120 --> 00:26:34,859 +you know if they were too frustrated or too careless, +they would have missed that, 238 00:26:34,859 --> 00:26:41,359 -but they're not the they're really taste -after that and realize new songs in the +but they were not. They're really chase +after that and realized neurons in the 239 00:26:41,359 --> 00:26:48,279 -primary visual cortex are organized in -columns and for every column of the new +primary visual cortex are organized in columns, +and for every column of the neurons, 240 00:26:48,279 --> 00:27:01,309 -Alice they'd like to see a specific -orientation of the of the bars rather +they'd like to see a specific orientation of the stimulus. +The simple oriented bars rather 241 00:27:01,309 --> 00:27:02,980 -than the Fisher a mouse +than the fish or a mouse. 242 00:27:02,980 --> 00:27:07,519 -you know I'm a bit of a simple story -because there are still numerous in +You know, I'm making this little bit of a simple story +because there are still neurons in 243 00:27:07,519 --> 00:27:10,940 -primary visual cortex we don't know what -they like they don't like simple +primary visual cortex we don't know what they like. +They don't like simple 244 00:27:10,940 --> 00:27:17,570 -oriented but by large with a human -visitor found that the beginning of +oriented bars but by large Hubel and Wiesel +found that the beginning of 245 -00:27:17,569 --> 00:27:23,779 -visual processing is not a holistic fish -or malice the beginning of visual +00:27:17,570 --> 00:27:23,779 +visual processing is not a holistic fish or mouse. +The beginning of visual 246 00:27:23,779 --> 00:27:29,178 diff --git a/captions/Ko/Lecture1_ko.srt b/captions/Ko/Lecture1_ko.srt index a3211e40..ec639c26 100644 --- a/captions/Ko/Lecture1_ko.srt +++ b/captions/Ko/Lecture1_ko.srt @@ -888,7 +888,7 @@ 무언가를 보았을 때 뉴런이 활발하게 활동하는지를 보려고 하는 거죠. 217 -00:24:18,058 --> 00:24:25,308 +00:24:18,059 --> 00:24:25,308 예를 들어 그들이 고양이에게.. 218 @@ -925,83 +925,86 @@ 226 00:25:14,509 --> 00:25:21,740 - 고양이는 그들이 그의 발 슬라이드 넣을 수 있도록 약간의 보호를 사용해야 할 것이다 + 자극적인 것들을 보여주기 위해서 슬라이드 프로젝터를 사용해야했어요. 227 00:25:21,740 --> 00:25:26,799 - 새로운 강요하며 자전거가 걸릴 경우 물고기의 다음 스파이크에 새로운까지 기다려 + 그들은 생선이 그려진 슬라이드를 넣고 뉴런이 활동하는지 지켜봅니다. 228 00:25:26,799 --> 00:25:29,960 - 슬라이드 아웃 다른 약간을 넣어 + 활동이 없다면 다른 슬라이드로 교체를 했죠. 229 00:25:29,960 --> 00:25:38,630 - 이 영화는 내가 당신이 기억하지 알고처럼이 좋아하는 것주의 사항 + 그런데 슬라이드를 바꿀 때 마다 뉴런이 활동하는것을 발견했어요. 230 00:25:38,630 --> 00:25:46,890 - 당신이처럼 알고 이상한 그 무엇이든 도우루가 스파이크 래셔 필름을 사용 + 그 사각형의 필름있죠? 유리로 되있는지 필름인지 기억은 안나지만, 아무튼 뉴런이 반을을 했죠. 231 00:25:46,890 --> 00:25:51,940 - 실제 마우스 공식 꽃 새가 새 역할을 자극 운전하지 않았지만 + 실제 쥐, 생선이나 꽃에 뉴런이 반응하진 않았어요. 232 00:25:51,940 --> 00:25:59,759 - 상기 아웃 슬라이드를 복용의 이동도 그는 흥분 않았다 슬라이딩 한 수 + 슬라이드를 꺼내거나 집어넣는 움직임에 뉴런이 반응을 했죠. 233 00:25:59,759 --> 00:26:03,140 - 나는 촉매가 될 수있는 새로운 당신이 마침내 새를 변경하는 것 같아요 + 고양이가 단지 "아 드디어 새로운 물체를 보여주려는 구나~" 하고 생각하는 것일 수도 있겠죠. 234 00:26:03,140 --> 00:26:13,410 - 그것은이 생활에 의해이 생성되어 있도록 나를 위해 새로운 객체를 알고 + 알고보니 그들이 슬라이드를 교체하면서 투영된 선이 있었어요. 235 00:26:13,410 --> 00:26:18,240 - 그들은 그것이 정사각형, 직사각형 판이야 무엇이든 바로 슬라이드를 변경하고 + 그 정사각형인가 네모난 판 말이에요. 236 00:26:18,240 --> 00:26:28,120 - 것을 에지 또는 흥분 뉘앙스를 이동 그들은 정말 후 맛을하고 그래서 + 그 움직이는 선이 뉴런을 활동하게 만들었고 + 그들은 그 점에 대해 조사하기 시작했죠. 237 -00:26:28,119 --> 00:26:34,859 - 너무 좌절하거나 놓친 것 인 경우에 관찰 당신은 알고 +00:26:28,120 --> 00:26:34,859 + 그들이 좌절을 했다거나 부주의 했다면 아마 놓쳐버릴 수도 있었죠. 238 00:26:34,859 --> 00:26:41,359 - 하지만 그들은 그들이 정말로 그 후 맛과 새로운 노래를 실현하고 아니에요 + 하지만 그들은 계속 단서를 쫒았고 일차 시각피질의 뉴런들이 239 00:26:41,359 --> 00:26:48,279 - 차 시각 피질은 열에 새의 모든 컬럼에 대해 구성되어 있습니다 + 기둥모양으로 구성되어 있으며, 각각의 뉴런기둥은 240 00:26:48,279 --> 00:27:01,309 - 앨리스는이 바의 특정 방향을 오히려보고 싶습니다 + 특정한 방향성을 가지고 자극을 본다는 것을 알게됩니다. 241 00:27:01,309 --> 00:27:02,980 - 피셔 마우스보다 + 물고기 혹은 쥐 보다 간단한 하나의 방향을 가진 선을 보는 것이죠. 242 00:27:02,980 --> 00:27:07,519 - 당신은 여전히​​ 수많은 있기 때문에 나는 간단한 이야기​​ 좀 해요 알고 + 제가 이 이야기를 단순화시켜서 말하고는 있지만 243 00:27:07,519 --> 00:27:10,940 - 차 시각 피질 우리는 그들이 단순한 마음에 안 좋아하는지 모르겠어요 + 일차시각피질에는 우리가 무엇에 반응하는지 알아내지 못한 뉴런들이 아직 있어요. 244 00:27:10,940 --> 00:27:17,570 - 지향하지만 인간의 방문자 대형하여 처음의 발견 + 그 뉴런들은 단순한 방향성을 가진 선에 반응하지는 않죠. + 하지만 Hubel과 Wiesel은 시각처리의 시작이 245 -00:27:17,569 --> 00:27:23,779 - 시각 처리는 전체적인 생선이나 악의 시각의 시작되지 않습니다 +00:27:17,570 --> 00:27:23,779 + 시각 처리는 전체적인 생선이나 쥐의 모습이 아니라는 것을 발견했어요. + 시각 처리의 시작은 246 00:27:23,779 --> 00:27:29,178 From 8cca7ccb353671fdc671432ca053376e303d38a6 Mon Sep 17 00:00:00 2001 From: MaybeS Date: Thu, 7 Jul 2016 19:32:29 +0900 Subject: [PATCH 187/199] back --softmax update --- assignments2016/assignment1/softmax.ipynb | 28 ++++++++++++----------- 1 file changed, 15 insertions(+), 13 deletions(-) diff --git a/assignments2016/assignment1/softmax.ipynb b/assignments2016/assignment1/softmax.ipynb index c319fe0c..94fd47cf 100644 --- a/assignments2016/assignment1/softmax.ipynb +++ b/assignments2016/assignment1/softmax.ipynb @@ -192,7 +192,8 @@ "toc = time.time()\n", "print 'vectorized loss: %e computed in %fs' % (loss_vectorized, toc - tic)\n", "\n", - "# SVM에서 했던것 처럼, Frobenius 방법을 사용해 두 버전의 요소를 비교할 것입니다.\n", + "# As we did for the SVM, we use the Frobenius norm to compare the two versions\n", + "# of the gradient.\n", "grad_difference = np.linalg.norm(grad_naive - grad_vectorized, ord='fro')\n", "print 'Loss difference: %f' % np.abs(loss_naive - loss_vectorized)\n", "print 'Gradient difference: %f' % grad_difference" @@ -206,9 +207,10 @@ }, "outputs": [], "source": [ - "# 검증셋을 이용하여 hyperparameters(정규화 강도와 학습률)를 튜닝하세요.\n", - "# 다른 범위에 대해 학습률과 정규화 강도를 실험해 보세요.\n", - "# 검증셋에 대해 0.35 이상의 분류 정확도를 얻어야 합니다.\n", + "# Use the validation set to tune hyperparameters (regularization strength and\n", + "# learning rate). You should experiment with different ranges for the learning\n", + "# rates and regularization strengths; if you are careful you should be able to\n", + "# get a classification accuracy of over 0.35 on the validation set.\n", "from cs231n.classifiers import Softmax\n", "results = {}\n", "best_val = -1\n", @@ -218,16 +220,16 @@ "\n", "################################################################################\n", "# TODO: #\n", - "# 검증셋을 이용해 학습률과 정규화 강도를 설정합니다. #\n", - "# 이것은 SVM에서의 검증과 같아야합니다; #\n", - "# 가장 잘 학습된 softmax 분류기를 best_softmax에 저장하세요. #\n", + "# Use the validation set to set the learning rate and regularization strength. #\n", + "# This should be identical to the validation that you did for the SVM; save #\n", + "# the best trained softmax classifer in best_softmax. #\n", "################################################################################\n", "pass\n", "################################################################################\n", - "# 코드의 끝 #\n", + "# END OF YOUR CODE #\n", "################################################################################\n", " \n", - "# 결과를 출력합니다.\n", + "# Print out results.\n", "for lr, reg in sorted(results):\n", " train_accuracy, val_accuracy = results[(lr, reg)]\n", " print 'lr %e reg %e train accuracy: %f val accuracy: %f' % (\n", @@ -244,8 +246,8 @@ }, "outputs": [], "source": [ - "# 테스트 셋으로 평가해 봅니다.\n", - "# 테스트 셋에서 최고의 softmax를 평가해 봅니다.\n", + "# evaluate on test set\n", + "# Evaluate the best softmax on test set\n", "y_test_pred = best_softmax.predict(X_test)\n", "test_accuracy = np.mean(y_test == y_test_pred)\n", "print 'softmax on raw pixels final test set accuracy: %f' % (test_accuracy, )" @@ -259,7 +261,7 @@ }, "outputs": [], "source": [ - "# 각 클래스에 대한 학습 된 가중치를 시각화\n", + "# Visualize the learned weights for each class\n", "w = best_softmax.W[:-1,:] # strip out the bias\n", "w = w.reshape(32, 32, 3, 10)\n", "\n", @@ -269,7 +271,7 @@ "for i in xrange(10):\n", " plt.subplot(2, 5, i + 1)\n", " \n", - " # 가중치를 0과 255사이로 재조정\n", + " # Rescale the weights to be between 0 and 255\n", " wimg = 255.0 * (w[:, :, :, i].squeeze() - w_min) / (w_max - w_min)\n", " plt.imshow(wimg.astype('uint8'))\n", " plt.axis('off')\n", From f3b812d1062d503266d46ed501de15c433d25ac5 Mon Sep 17 00:00:00 2001 From: MaybeS Date: Thu, 7 Jul 2016 19:39:42 +0900 Subject: [PATCH 188/199] TEST-Update Softmax --- assignments2016/assignment1/softmax.ipynb | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/assignments2016/assignment1/softmax.ipynb b/assignments2016/assignment1/softmax.ipynb index 94fd47cf..9eb64f7e 100644 --- a/assignments2016/assignment1/softmax.ipynb +++ b/assignments2016/assignment1/softmax.ipynb @@ -192,8 +192,7 @@ "toc = time.time()\n", "print 'vectorized loss: %e computed in %fs' % (loss_vectorized, toc - tic)\n", "\n", - "# As we did for the SVM, we use the Frobenius norm to compare the two versions\n", - "# of the gradient.\n", + "# ASVM에서 했던것 처럼, Frobenius 방법을 사용해 두 버전의 요소를 비교할 것입니다.\n", "grad_difference = np.linalg.norm(grad_naive - grad_vectorized, ord='fro')\n", "print 'Loss difference: %f' % np.abs(loss_naive - loss_vectorized)\n", "print 'Gradient difference: %f' % grad_difference" From 83c7fea71c8c4f4fe84b31d76fe7a80a976796ab Mon Sep 17 00:00:00 2001 From: MaybeS Date: Thu, 7 Jul 2016 19:42:50 +0900 Subject: [PATCH 189/199] Update softmax --- assignments2016/assignment1/softmax.ipynb | 25 +++++++++++------------ 1 file changed, 12 insertions(+), 13 deletions(-) diff --git a/assignments2016/assignment1/softmax.ipynb b/assignments2016/assignment1/softmax.ipynb index 9eb64f7e..1d364dc4 100644 --- a/assignments2016/assignment1/softmax.ipynb +++ b/assignments2016/assignment1/softmax.ipynb @@ -206,10 +206,9 @@ }, "outputs": [], "source": [ - "# Use the validation set to tune hyperparameters (regularization strength and\n", - "# learning rate). You should experiment with different ranges for the learning\n", - "# rates and regularization strengths; if you are careful you should be able to\n", - "# get a classification accuracy of over 0.35 on the validation set.\n", + "# 검증셋을 이용하여 hyperparameters(정규화 강도와 학습률)를 튜닝하세요.\n", + "# 다른 범위에 대해 학습률과 정규화 강도를 실험해 보세요.\n", + "# r검증셋에 대해 0.35 이상의 분류 정확도를 얻어야 합니다.\n", "from cs231n.classifiers import Softmax\n", "results = {}\n", "best_val = -1\n", @@ -219,16 +218,16 @@ "\n", "################################################################################\n", "# TODO: #\n", - "# Use the validation set to set the learning rate and regularization strength. #\n", - "# This should be identical to the validation that you did for the SVM; save #\n", - "# the best trained softmax classifer in best_softmax. #\n", + "# 검증셋을 이용해 학습률과 정규화 강도를 설정합니다. #\n", + "# 이것은 SVM에서의 검증과 같아야합니다; #\n", + "# 가장 잘 학습된 softmax 분류기를 best_softmax에 저장하세요. #\n", "################################################################################\n", "pass\n", "################################################################################\n", - "# END OF YOUR CODE #\n", + "# 코드의 끝 #\n", "################################################################################\n", " \n", - "# Print out results.\n", + "# 결과를 출력합니다\n", "for lr, reg in sorted(results):\n", " train_accuracy, val_accuracy = results[(lr, reg)]\n", " print 'lr %e reg %e train accuracy: %f val accuracy: %f' % (\n", @@ -245,8 +244,8 @@ }, "outputs": [], "source": [ - "# evaluate on test set\n", - "# Evaluate the best softmax on test set\n", + "# 테스트 셋으로 평가해 봅니다.\n", + "# 테스트 셋에서 최고의 softmax를 평가해 봅니다.\n", "y_test_pred = best_softmax.predict(X_test)\n", "test_accuracy = np.mean(y_test == y_test_pred)\n", "print 'softmax on raw pixels final test set accuracy: %f' % (test_accuracy, )" @@ -260,7 +259,7 @@ }, "outputs": [], "source": [ - "# Visualize the learned weights for each class\n", + "# 각 클래스에 대한 학습 된 가중치를 시각화\n", "w = best_softmax.W[:-1,:] # strip out the bias\n", "w = w.reshape(32, 32, 3, 10)\n", "\n", @@ -270,7 +269,7 @@ "for i in xrange(10):\n", " plt.subplot(2, 5, i + 1)\n", " \n", - " # Rescale the weights to be between 0 and 255\n", + " # 가중치를 0과 255사이로 재조정\n", " wimg = 255.0 * (w[:, :, :, i].squeeze() - w_min) / (w_max - w_min)\n", " plt.imshow(wimg.astype('uint8'))\n", " plt.axis('off')\n", From d6a139b0e9f37004a1c159a319a671c4c955158d Mon Sep 17 00:00:00 2001 From: j-min Date: Tue, 19 Jul 2016 00:02:28 +0900 Subject: [PATCH 190/199] update Lecture10_Ko~35:00 --- captions/Ko/Lecture10_ko.srt | 7716 +++++++++++++++++----------------- 1 file changed, 3857 insertions(+), 3859 deletions(-) diff --git a/captions/Ko/Lecture10_ko.srt b/captions/Ko/Lecture10_ko.srt index fdf66e82..6881c6bb 100644 --- a/captions/Ko/Lecture10_ko.srt +++ b/captions/Ko/Lecture10_ko.srt @@ -1,3859 +1,3857 @@ -1 -00:00:00,000 --> 00:00:04,129 - 마이크 테스트 - -2 -00:00:04,129 --> 00:00:12,109 - 오늘은 주제는 Recurrent neural networks 입니다. - -3 -00:00:12,109 --> 00:00:15,199 - 개인적으로 가장 좋아하는 주제이고 - -4 -00:00:15,199 --> 00:00:18,960 - 또 여러 형태로 사용하고 있는 NN 모델이기도 하죠. 재밌어요. - -5 -00:00:18,960 --> 00:00:23,009 - 강의 진행에 관해서 언급할 게 있는데, - -6 -00:00:23,009 --> 00:00:26,089 - 수요일에 중간 고사가 있어요. - -7 -00:00:26,089 --> 00:00:32,738 - 다들 중간고사 기대하고 있는거 다 알아요. 사실 별로 기대하는 것 같이 보이지는 않네요. - -8 -00:00:32,738 --> 00:00:37,979 - 수요일에 과제가 나갈 거에요. - -9 -00:00:37,979 --> 00:00:40,429 - 제출 기한은 2주 뒤 월요일까지입니다. - -10 -00:00:40,429 --> 00:00:43,399 - 그런데 저희가 원래 월요일에 이걸 발표하려 했는데 늦어져서 - -11 -00:00:43,399 --> 00:00:47,129 - 아마 제출 기한이 수요일 즈음으로 미뤄질 것 같네요. - -12 -00:00:47,130 --> 00:00:51,179 - 2번째 과제는 금요일까지고, 3-late day를 사용할 수 있어요. 그런데 너무 일찍 사용하지는 마세요. - -13 -00:00:51,179 --> 00:00:55,119 - 2번째 과제는 금요일까지고, 3-late day를 사용할 수 있어요. 그런데 너무 일찍 사용하지는 마세요. - -14 -00:00:55,119 --> 00:01:01,089 - 몇 명이나 끝냈나요? 72명? 거의 다 끝냈네요, 좋아요. - -15 -00:01:01,090 --> 00:01:04,549 - 자 우리는 Convolutional Neural Network (CNN)에 대해서 얘기하고 있었죠. - -16 -00:01:04,549 --> 00:01:07,820 - 지난 수업에서는 CNN에 대한 시각화와 간단한 이해에 대해서 다루었고, - -17 -00:01:07,819 --> 00:01:11,618 - 이런 그림과 비디오들을 살펴보면서 CNN이 어떻게 작동하는지 살펴보았죠. - -18 -00:01:11,618 --> 00:01:14,938 - 이런 그림과 비디오들을 살펴보면서 CNN이 어떻게 작동하는지 살펴보았죠. - -19 -00:01:14,938 --> 00:01:17,828 - 이런 그림과 비디오들을 살펴보면서 CNN이 어떻게 작동하는지 살펴보았죠. - -20 -00:01:17,828 --> 00:01:24,188 - 그리고 맨 마지막 그림에서 본 것처럼 디버깅도 해 보았고요. - -21 -00:01:24,188 --> 00:01:27,408 - 지난 주말에 트위터에서 새로운 시각화 자료를 찾았는데요, - -22 -00:01:27,409 --> 00:01:32,569 - 신기하죠? - -23 -00:01:32,569 --> 00:01:37,118 - 사실 설명이 없어서 정확히 어떤 방법으로 이걸 만든 건지는 잘 모르겠네요. - -24 -00:01:37,118 --> 00:01:43,099 - 그래도 멋있지 않아요? 이건 거북이고, 저건 타란튤라 거미이고, - -25 -00:01:43,099 --> 00:01:47,468 - 이건 체인이고, 저건 개들인데, - -26 -00:01:47,468 --> 00:01:50,509 -제가 보기에 이건 어떤 최적화 기법을 이미지에 적용한 것 같은데, - -27 -00:01:50,509 --> 00:01:53,679 - 뭔가 다른 regularization 방법을 적용한 것 같네요 - -28 -00:01:53,679 --> 00:01:57,049 - 음, 여기에는 bilateral filter (쌍방향 필터) 를 적용한 것 같네요. - -29 -00:01:57,049 --> 00:01:59,420 - 음, 여기에는 bilateral filter (쌍방향 필터) 를 적용한 것 같네요. - -30 -00:01:59,420 --> 00:02:03,659 - 그래도 솔직히 정확히 어떤 기법을 적용한 것인지는 잘 모르겠어요. - -31 -00:02:03,659 --> 00:02:04,549 - 오늘의 주제는 RNN입니다. - -32 -00:02:04,549 --> 00:02:10,360 - 오늘의 주제는 RNN입니다. - -33 -00:02:10,360 --> 00:02:13,520 - RNN의 강점은 네트워크 아키텍쳐를 구성하는 데에 자유도가 놓다는 것입니다. - -34 -00:02:13,520 --> 00:02:15,870 - RNN의 강점은 네트워크 아키텍쳐를 구성하는 데에 자유도가 놓다는 것입니다. - -35 -00:02:15,870 --> 00:02:18,650 - 일반적으로 NN을 왼쪽 그림과 같이 구성할 때는 (역자주: Vanilla NN) - -36 -00:02:18,650 --> 00:02:22,849 - 여기 빨간색으로 표시된 것처럼 고정된 크기의 input vector를 사용하고, - -37 -00:02:22,848 --> 00:02:27,639 - 초록색의 hidden layer들을 통해 작동시키며, 마찬가지로 고정된 크기의 파란색 output vector를 출력합니다. - -38 -00:02:27,639 --> 00:02:30,738 - 마찬가지로 고정된 크기의 이미지를 입력으로 받고, - -39 -00:02:30,739 --> 00:02:34,469 - 고정된 크기의 이미지를 벡터 형태로 출력합니다. - -40 -00:02:34,469 --> 00:02:38,239 - RNN에서는 이러한 작업을 계속 반복할 수 있습니다. input, output 모두에서 가능하죠. - -41 -00:02:38,239 --> 00:02:41,319 - 오늘 다룰 image captioning(이미지에 상응하는 자막/주석 생성) 을 예로 들면, - -42 -00:02:41,318 --> 00:02:44,689 - 고정된 크기의 이미지를 RNN에 입력하게 됩니다. - -43 -00:02:44,689 --> 00:02:47,829 - 그리고 그 RNN은 해당 이미지를 설명하는 단어/문장 들을 출력하게 되죠. - -44 -00:02:47,829 --> 00:02:52,560 - 그리고 그 RNN은 해당 이미지를 설명하는 단어/문장 들을 출력하게 되죠. - -45 -00:02:52,560 --> 00:02:55,969 - Sentiment classification(감정 분류)를 예로 들면, - -46 -00:02:55,969 --> 00:02:59,759 - (어떤 문장의) 단어들과 그 순서을 입력으로 받아서, - -47 -00:02:59,759 --> 00:03:03,828 - 그 문장의 느낌이 긍정적인지 또는 부정적인지를 출력하게 됩니다. - -48 -00:03:03,829 --> 00:03:07,590 - 또 다른 예로 machine translation (역자주: 구글 번역과 같은 알고리즘 번역) 에서는, - -49 -00:03:07,590 --> 00:03:12,069 - 어떤 영어 문장을 입력으로 받고, 프랑스어로 출력해야 합니다. - -50 -00:03:12,068 --> 00:03:17,119 - 그래서 우리는 이 영어 문장을 RNN에 입력하고 (이것을 Sequence to Sequence 라 부름) - -51 -00:03:17,120 --> 00:03:20,280 - 그래서 우리는 이 영어 문장을 RNN에 입력하고 (이것을 Sequence to Sequence 라 부름) - -52 -00:03:20,280 --> 00:03:25,169 - RNN은 이 영어 문장을 프랑스어 문장으로 번역합니다. - -53 -00:03:25,169 --> 00:03:28,000 - 마지막 예 video classification(영상 분류) 에서는, - -54 -00:03:28,000 --> 00:03:31,699 - 각 프레임 (순간 캡쳐 화면) 이 어떤 속성을 지니는지, - -55 -00:03:31,699 --> 00:03:35,429 - 그리고 그 전의 모든 프레임과의 관계는 어떻게 되는지도 고려합니다. - -56 -00:03:35,430 --> 00:03:38,739 - 그리고 그 전의 모든 프레임과의 관계는 어떻게 되는지도 고려합니다. - -57 -00:03:38,739 --> 00:03:41,909 - 그러니까 RNN은 각각의 프레임이 어떤 속성을 지니는지 분류하고, - -58 -00:03:41,909 --> 00:03:44,680 - 이전까지의 모든 프레임을 입력으로 받는 함수가 되어, - -59 -00:03:44,680 --> 00:03:48,760 - 앞으로의 프레임을 예측하는 아키텍쳐를 제공합니다. - -60 -00:03:48,759 --> 00:03:52,388 - 만약 맨 왼쪽 그림과 같이 입력과 출력의 순서에 관한 정보를 가지고 있지 않아도 RNN을 사용할 수 있습니다. - -61 -00:03:52,389 --> 00:03:55,250 - 만약 맨 왼쪽 그림과 같이 입력과 출력의 순서에 관한 정보를 가지고 있지 않아도 RNN을 사용할 수 있습니다. - -62 -00:03:55,250 --> 00:04:01,560 - 예를 들어, 제가 좋아하는 딥마인드의 한 논문에서는 - -63 -00:04:01,560 --> 00:04:05,189 - 번지로 된 집 주소 이미지를 문자로 변환했습니다. - -64 -00:04:05,189 --> 00:04:09,750 - 여기서는 단순히 CNN을 사용해서 이미지 자체가 몇 번지를 나타내는지를 분류하지 않고, - -65 -00:04:09,750 --> 00:04:13,530 - 여기서는 단순히 CNN을 사용해서 이미지 자체가 몇 번지를 나타내는지를 분류하지 않고, - -66 -00:04:13,530 --> 00:04:16,649 - RNN을 사용해서 작은 CNN이 이미지를 돌아다니면서 읽어들였습니다. - -67 -00:04:16,649 --> 00:04:19,779 - RNN을 사용해서 작은 CNN이 이미지를 돌아다니면서 읽어들였습니다. - -68 -00:04:19,779 --> 00:04:23,969 - 이렇게 RNN은 번지 주소 이미지를 왼쪽으로 오른쪽으로 순차적으로 읽는 방법을 학습했습니다. - -69 -00:04:23,970 --> 00:04:26,870 - 이렇게 RNN은 번지 주소 이미지를 왼쪽으로 오른쪽으로 순차적으로 읽는 방법을 학습했습니다. - -70 -00:04:26,870 --> 00:04:32,019 - 반대로 생각할 수도 있습니다. 이것은 DRAW라는 유명한 논문인데요, - -71 -00:04:32,019 --> 00:04:35,879 - 여기서는 이미지 샘플 하나하나가 무엇인지 개별적으로 판단하지 않고, - -72 -00:04:35,879 --> 00:04:39,490 - 여기서는 이미지 샘플 하나하나가 무엇인지 개별적으로 판단하지 않고, - -73 -00:04:39,490 --> 00:04:42,860 - RNN이 여러 이미지를 하나의 큰 캔버스의 형태로 한번에 출력합니다. - -74 -00:04:42,860 --> 00:04:47,540 - RNN이 여러 이미지를 하나의 큰 캔버스의 형태로 한번에 출력합니다. - -75 -00:04:47,540 --> 00:04:50,200 - 이 방법은 한 번지수 이미지에 대한 입력 결과를 곧바로 출력하지 않고, 보다 많은 계산을 거친다는 점에서 강력합니다. - -76 -00:04:50,199 --> 00:04:53,479 - 이 방법은 한 번지수 이미지에 대한 입력 결과를 곧바로 출력하지 않고, 보다 많은 계산을 거친다는 점에서 강력합니다. 질문 있나요? - -77 -00:04:53,480 --> 00:05:14,189 - (질문) 그림에서 화살표는 무엇인가요? - -78 -00:05:14,189 --> 00:05:19,310 - 화살표는 functional dependence를 나타냅니다. 조금 있다가 좀 더 자세하게 살펴 볼 거에요. - -79 -00:05:19,310 --> 00:05:23,139 - 화살표는 functional dependence를 나타냅니다. 조금 있다가 좀 더 자세하게 살펴 볼 거에요. - -80 -00:05:23,139 --> 00:05:37,168 - (질문) 그림에서 나타나는 숫자들은 무엇인가요? - -81 -00:05:37,168 --> 00:05:41,219 - 이것들은 실제 사진이 아니라 RNN이 학습 후 출력한 결과물입니다. - -82 -00:05:41,220 --> 00:05:44,830 - 이것들은 실제 사진이 아니라 RNN이 학습 후 출력한 결과물입니다. - -83 -00:05:44,829 --> 00:05:48,219 - (질문) 그러니까 실제 사진이 아니라 만들어진 거라는 거죠? - -84 -00:05:48,220 --> 00:05:51,689 - 네, 꽤 실제 사진처럼 포이기는 하지만, 이것들은 만들어진 이미지입니다. - -85 -00:05:51,689 --> 00:05:55,809 - RNN은 이런 초록색 박스처럼 생겼습니다. - -86 -00:05:55,809 --> 00:06:00,979 - RNN은 계속해서 input vector를 입력받습니다. - -87 -00:06:00,978 --> 00:06:04,859 - RNN은 계속해서 input vector를 입력받습니다. - -88 -00:06:04,860 --> 00:06:08,538 - RNN 내부에는 여러 state가 있는데, 이는 매 시간에 입력받는 input vector의 형태로 나타낼 수 있습니다. - -89 -00:06:08,538 --> 00:06:12,988 - RNN 내부에는 여러 state가 있는데, 이는 매 시간에 입력받는 input vector의 형태로 나타낼 수 있습니다. - -90 -00:06:12,988 --> 00:06:17,258 - RNN에는 또한 weight(가중치)를 설정할 수 있고, 이를 조정함으로써 RNN의 작동을 조절할 수 있습니다. - -91 -00:06:17,259 --> 00:06:20,829 - RNN에는 또한 weight(가중치)를 설정할 수 있고, 이를 조정함으로써 RNN의 작동을 조절할 수 있습니다. - -92 -00:06:20,829 --> 00:06:25,769 - 우리는 물론 RNN의 출력 결과물에도 관심을 갖고 있지만, - -93 -00:06:25,769 --> 00:06:30,429 - 우리는 물론 RNN의 출력 결과물에도 관심을 갖고 있지만, - -94 -00:06:30,428 --> 00:06:33,988 - RNN은 이 중간에 있는, 시간에 따라 이미지를 입력받고 출력하는 단계인 이 초록색 박스라는 것을 알아두셨으면 합니다. - -95 -00:06:33,988 --> 00:06:36,688 - RNN은 이 중간에 있는, 시간에 따라 이미지를 입력받고 출력하는 단계인 이 초록색 박스라는 것을 알아두셨으면 합니다. - -96 -00:06:36,689 --> 00:06:39,489 - RNN은 이 중간에 있는, 시간에 따라 이미지를 입력받고 출력하는 단계인 이 초록색 박스라는 것을 알아두셨으면 합니다. - -97 -00:06:39,488 --> 00:06:44,838 - RNN은 이 중간에 있는, 시간에 따라 이미지를 입력받고 출력하는 단계인 이 초록색 박스라는 것을 알아두셨으면 합니다. - -98 -00:06:44,838 --> 00:06:50,610 - RNN의 각 state는 vector들의 집합으로 나타낼 수 있고, 여기서는 h로 표기하겠습니다. - -99 -00:06:50,610 --> 00:06:55,399 - RNN의 각 state는 vector들의 집합으로 나타낼 수 있고, 여기서는 h로 표기하겠습니다. - -100 -00:06:55,399 --> 00:07:00,939 - 각각의 state(h_t) 는 바로 전 단계의 state(h_t-1)과 input vector(x_t)들의 함수로 나타낼 수 있습니다. - -101 -00:07:00,939 --> 00:07:05,769 - 각각의 state(h_t) 는 바로 전 단계의 state(h_t-1)과 input vector(x_t)들의 함수로 나타낼 수 있습니다. - -102 -00:07:05,769 --> 00:07:08,338 - 여기서의 함수는 Recurrence funtion 이라고 하고 파라미터 W(가중치)를 갖습니다. - -103 -00:07:08,338 --> 00:07:13,728 - 우리는 W 값을 변경함에 따라 RNN이 다른 결과를 보이는 걸 확인할 수 있습니다. - -104 -00:07:13,728 --> 00:07:16,228 - 우리는 W 값을 변경함에 따라 RNN이 다른 결과를 보이는 걸 확인할 수 있습니다. - -105 -00:07:16,228 --> 00:07:19,338 - 따라서 우리는 우리가 원하는 결과를 만들어낼 수 있는 적절한 W를 찾기 위해 training을 거칠 것이죠. - -106 -00:07:19,338 --> 00:07:23,639 - 따라서 우리는 우리가 원하는 결과를 만들어낼 수 있는 적절한 W를 찾기 위해 training을 거칠 것이죠. - -107 -00:07:23,639 --> 00:07:28,209 - 여기서 기억해야 할 것은 매 단계마다 같은 함수와 같은 W를 사용한다는 것입니다. - -108 -00:07:28,209 --> 00:07:31,778 - 여기서 기억해야 할 것은 매 단계마다 같은 함수와 같은 W를 사용한다는 것입니다. - -109 -00:07:31,778 --> 00:07:35,928 - 그래서 입력이나 출력 vector의 길이를 고려할 필요가 없습니다. - -110 -00:07:35,928 --> 00:07:38,778 - 그래서 입력이나 출력 vector의 길이를 고려할 필요가 없습니다. - -111 -00:07:38,778 --> 00:07:43,528 - 그래서 입력이나 출력 vector의 길이를 고려할 필요가 없습니다. - -112 -00:07:43,528 --> 00:07:46,769 - RNN을 구현하는 가장 간단한 방법은 Vanilla RNN 입니다. - -113 -00:07:46,769 --> 00:07:50,309 - RNN을 구현하는 가장 간단한 방법은 Vanilla RNN 입니다. - -114 -00:07:50,309 --> 00:07:54,569 - 여기서 RNN을 구성하는 것은 단 하나의 hidden state h 입니다. - -115 -00:07:54,569 --> 00:08:00,569 - 여기서 RNN을 구성하는 것은 단 하나의 hidden state h 입니다. - -116 -00:08:00,569 --> 00:08:04,039 - 그리고 여기 Recurrence(재귀) 식은 각 hidden state를 시간과 현재 input (x_t)로 어떻게 나타낼 수 있는지 알려줍니다. - -117 -00:08:04,038 --> 00:08:04,688 - 그리고 여기 Recurrence(재귀) 식은 각 hidden state를 시간과 현재 input (x_t)로 어떻게 나타낼 수 있는지 알려줍니다. - -118 -00:08:04,689 --> 00:08:08,369 - 그리고 여기 Recurrence 식은 각 hidden state를 시간과 현재 input (x_t)로 어떻게 나타낼 수 있는지 알려줍니다. - -119 -00:08:08,369 --> 00:08:10,349 - 가중치 행렬 W_hh와 W_xh에 직전 단계의 hidden state h 와 input vector x가 각각 곱해지고, - -120 -00:08:10,348 --> 00:08:15,238 - 가중치 행렬 W_hh와 W_xh에 직전 단계의 hidden state h_t-1 와 input vector x가 각각 곱해지고, - -121 -00:08:15,238 --> 00:08:18,238 - 이것이 tanh 함수에 의해 새로운 hidden state h_t로 결정되는 방식으로 업데이트 됩니다. - -122 -00:08:18,238 --> 00:08:21,978 - 이것이 tanh 함수에 의해 새로운 hidden state h_t로 결정되는 방식으로 업데이트 됩니다. - -123 -00:08:21,978 --> 00:08:26,199 - 이러한 재귀 식은 h가 시간과 현재 입력에 따라 업데이트되는 함수라는 것을 보여줍니다. - -124 -00:08:26,199 --> 00:08:29,769 -이러한 재귀 식은 h가 시간과 현재 입력에 따라 업데이트되는 함수라는 것을 보여줍니다. - -125 -00:08:29,769 --> 00:08:34,129 - h 바로 다음에 결과물이 행렬의 형태로 출력되는 형태가 가장 간단한 형태의 RNN입니다. - -126 -00:08:34,129 --> 00:08:37,528 - h 바로 다음에 결과물이 행렬의 형태로 출력되는 형태가 가장 간단한 형태의 RNN입니다. - -127 -00:08:37,528 --> 00:08:42,288 - 이게 어떻게 작동되는지 간단히 설명드리기 위해 예를 들자면, - -128 -00:08:42,288 --> 00:08:46,639 - 이게 어떻게 작동되는지 간단히 설명드리기 위해 예를 들자면, - -129 -00:08:46,639 --> 00:08:49,299 - 이런 추상적인 x, h, y 등에 의미를 부여할 수 있습니다. - -130 -00:08:49,299 --> 00:08:53,059 - 이런 추상적인 x, h, y 등에 의미를 부여할 수 있습니다. - -131 -00:08:53,059 --> 00:08:56,149 - 예를 들어 이러한 문자 수준 언어 모델에 RNN을 적용하는 것 말이죠. - -132 -00:08:56,149 --> 00:08:59,899 - 저는 이 예시를 참 좋아합니다. 직관적이고 재밌거든요. - -133 -00:08:59,899 --> 00:09:04,698 - 그래서 RNN 기반 문자 수준 언어 모델에서는, RNN에 문자열의 순서를 주고, - -134 -00:09:04,698 --> 00:09:07,859 - 그래서 RNN 기반 문자 수준 언어 모델에서는, RNN에 문자열의 순서를 주고, - -135 -00:09:07,860 --> 00:09:10,899 - 그래서 RNN 기반 문자 수준 언어 모델에서는, RNN에 문자열의 순서를 주고, - -136 -00:09:10,899 --> 00:09:14,299 - 지금까지의 관찰 결과를 바탕으로 각각의 단계에서 다음에 올 문자는 무엇인지 예측하게 합니다. - -137 -00:09:14,299 --> 00:09:16,909 - 지금까지의 관찰 결과를 바탕으로 각각의 단계에서 다음에 올 문자는 무엇인지 예측하게 합니다. - -138 -00:09:16,909 --> 00:09:21,120 - 간단한 예를 한번 보죠. - -139 -00:09:21,120 --> 00:09:25,610 - 여기서 training 문자열 'hello'를 주면, - -140 -00:09:25,610 --> 00:09:29,870 - 우리의 현재 어휘 목록에는 'h, e , l, o' 이렇게 4글자가 있겠죠 - -141 -00:09:29,870 --> 00:09:33,289 - 그러니까 RNN은 우리의 training 문자열 데이터를 바탕으로 다음에 올 글자가 무엇인지 예측하게 됩니다. - -142 -00:09:33,289 --> 00:09:37,000 - 구체적으로, h, e, l, o를 각각 순서대로 하나씩 RNN에 입력해 줍니다. - -143 -00:09:37,000 --> 00:09:40,509 - 여기서 가로축은 시간입니다. (역자주: 오른쪽으로 갈수록 뒤) - -144 -00:09:40,509 --> 00:09:47,110 - h는 첫번째, e는 두번째, 그다음 l, 그다음 l - -145 -00:09:47,110 --> 00:09:50,629 - 여기서는 'one-hot' 표기법을 사용하고 있습니다. (역자주: 0과 1로만 나타내는 것) - -146 -00:09:50,629 --> 00:09:53,889 - 여기서는 'one-hot' 표기법을 사용하고 있습니다. (역자주: 0과 1로만 나타내는 것) - -147 -00:09:53,889 --> 00:09:58,129 - 그리고 아까 본 재귀 식을 사용합니다. - -148 -00:09:58,129 --> 00:10:01,860 - 처음에 h에는 0만 들어가 있습니다. - -149 -00:10:01,860 --> 00:10:04,720 - 그래서 매 시간 단계마다 이 재귀 식을 이용해서 hidden state 벡터를 계산합니다. - -150 -00:10:04,720 --> 00:10:08,790 - hidden state에 3개의 (안들림) 가 있습니다. - -151 -00:10:08,789 --> 00:10:11,099 - 각 시점에서 이전까지 입력받은 모든 문자들을 요약해서 표현합니다. - -152 -00:10:11,100 --> 00:10:13,040 - 각 시점에서 이전까지 입력받은 모든 문자들을 요약해서 표현합니다. - -153 -00:10:13,039 --> 00:10:15,759 - 각 시점에서 이전까지 입력받은 모든 문자들을 요약해서 표현합니다. - -154 -00:10:15,759 --> 00:10:20,159 - 이런 방법으로 매 시간 단계마다 바로 다음 순서 에 올 문자를 예측할 것입니다. - -155 -00:10:20,159 --> 00:10:23,139 - 이런 방법으로 매 시간 단계마다 바로 다음 순서에 올 문자를 예측할 것입니다. - - -156 -00:10:23,139 --> 00:10:27,569 - 우리는 이 4 개의 문자(역자주: h, e, l, o)를 가지고 있고, 매 시간 단계마다 이 4개의 문자 중 어떤 문자가 오는지 예측할 것입니다. - -157 -00:10:27,570 --> 00:10:32,100 - 우리는 이 4 개의 문자(역자주: h, e, l, o)를 가지고 있고, 매 시간 단계마다 이 4개의 문자 중 어떤 문자가 오는지 예측할 것입니다. - -158 -00:10:32,100 --> 00:10:37,139 - 제일 처음에는 H를 입력할 것입니다. - -159 -00:10:37,139 --> 00:10:40,799 - RNN은 현재의 weight를 바탕으로 다음에 어떤 문자가 올 지 예측합니다. - -160 -00:10:40,799 --> 00:10:42,959 - RNN은 현재의 weight를 바탕으로 다음에 어떤 문자가 올 지 예측합니다. - -161 -00:10:42,960 --> 00:10:47,950 - 현재 normalized 되지 않은 수치로는, (역자주: 맨 위 왼쪽 사각형 안의 숫자) h는 1.0, e는 2.2, - -162 -00:10:47,950 --> 00:10:52,640 - l은 -3.0 , o는 4.1라는 숫자의 정도로 나타날 것입니다. - -163 -00:10:52,639 --> 00:10:56,409 - 물론 우리는 이 training sequence에서 h 다음에 e가 온다는 것을 알고 있습니다. - -164 -00:10:56,409 --> 00:11:00,669 - 그러니까 여기 초록색으로 적혀 있는 e의 2.2라는 숫자가 정답이 되는 것이죠. - -165 -00:11:00,669 --> 00:11:04,559 - 그래서 이 숫자는 커야 하고, 다른 숫자들은 작아져야 합니다. - -166 -00:11:04,559 --> 00:11:07,799 - 이처럼 매 시간 단계마다 우리는 다음에 올 타겟 문자를 갖고 있습니다. - -167 -00:11:07,799 --> 00:11:12,209 - 타겟에 해당하는 숫자는 커야 하고, 나머지 숫자는 작아야 합니다. - -168 -00:11:12,210 --> 00:11:15,470 - 타겟에 해당하는 숫자는 커야 하고, 나머지 숫자는 작아야 합니다. - -169 -00:11:15,470 --> 00:11:19,950 - 그래서 이러한 정보는 loss function(손실 함수)의 gradient signal에 포함됩니다. - -170 -00:11:19,950 --> 00:11:23,220 - 그리고 그러한 loss 들은 이 연결들은 통해 back-propagation 됩니다. - -171 -00:11:23,220 --> 00:11:26,600 - 매 시간 단계에 softmax classifier을 갖고 있다고 합시다. - -172 -00:11:26,600 --> 00:11:31,300 - 그래서 매 시간 단계마다 softmax classifier가 다음에 어떤 문자가 와야 할 지를 예측하고, - -173 -00:11:31,299 --> 00:11:34,269 - 그리고 모든 loss들은 맨 위(역자주: output layer)부터 거꾸로 그래프를 내려오면서 계산되어서 - -174 -00:11:34,269 --> 00:11:37,879 - 그리고 모든 loss들은 맨 위(역자주: output layer)부터 거꾸로 그래프를 내려오면서 계산되어서 - -175 -00:11:37,879 --> 00:11:41,179 - 그리고 모든 loss들은 맨 위(역자주: output layer)부터 거꾸로 그래프를 내려오면서 계산되어서 - -176 -00:11:41,179 --> 00:11:44,479 - weight 행렬에 gradient를 주어 적절한 값으로 변화시켜 RNN이 문자를 보다 정확하게 예측하게 합니다. - -177 -00:11:44,480 --> 00:11:50,039 - weight 행렬에 gradient를 주어 적절한 값으로 교정시켜 RNN이 문자를 보다 정확하게 예측하게 합니다. - -178 -00:11:50,039 --> 00:11:53,599 - 그러니까 여러분이 RNN에 문자를 입력하면 RNN은 보다 정확한 행동(역자주: 여기서는 문자 예측)을 하는 것이죠. - -179 -00:11:53,600 --> 00:11:57,750 - 이제 어떻게 데이터를 학습시키는지에 대해 상상이 좀 갈 거에요. - -180 -00:11:57,750 --> 00:12:02,879 - 여기 그림에 대해 질문이 있나요? - -181 -00:12:02,879 --> 00:12:08,750 - (질문): W_xh와 W_hy는 항상 일정한 값을 가지나요? - -182 -00:12:08,750 --> 00:12:13,320 - (답변): W(weight) 들은 매 recurrence 단계 마다 항상 일정한 값을 가집니다. - -183 -00:12:13,320 --> 00:12:17,010 - (답변): W(weight) 들은 매 recurrence 단계 마다 항상 일정한 값을 가집니다. - -184 -00:12:17,009 --> 00:12:23,830 - 여기서 우리는 W_xh, W_hh, W_yh를 각각 4번씩 사용했습니다. - -185 -00:12:23,830 --> 00:12:27,720 - 여러분이 backpropagation을 할 때, 동일한 weight 행렬에 이러한 gradient 들을 계속 더한다는 것을 명심해야 합니다. - -186 -00:12:27,720 --> 00:12:30,750 - 여러분이 backpropagation을 할 때, 동일한 weight 행렬에 이러한 gradient 들을 계속 더한다는 것을 명심해야 합니다. - -187 -00:12:30,750 --> 00:12:35,879 - 그리고 이것은 우리가 길이가 다양한 입력값들을 사용할 수 있게 해 줍니다. - -188 -00:12:35,879 --> 00:12:38,960 - 그리고 이것은 우리가 길이가 다양한 입력값들을 사용할 수 있게 해 줍니다. - -189 -00:12:38,960 --> 00:12:48,540 - 그러니까 정해진 길이의 입력값들을 사용하지 않아도 된다는 것이죠. - -190 -00:12:48,539 --> 00:12:52,579 - (질문): 처음 h_0를 어떻게 초기화하나요? - -191 -00:12:52,580 --> 00:13:00,650 - (답변): 0으로 놓는 것이 가장 일반적입니다. - -192 -00:13:00,649 --> 00:13:01,289 - (질문): 입력값의 순서는 영향을 미치나요? - -193 -00:13:01,289 --> 00:13:11,299 - (질문): 입력값의 순서는 영향을 미치나요? -194 -00:13:11,299 --> 00:13:14,359 - (답변): 여기서는 중요하지 않습니다. hidden state는 지금까지 들어온 모든 값을 반영하거든요. - -195 -00:13:14,360 --> 00:13:17,870 - (답변): 여기서는 중요하지 않습니다. hidden state는 지금까지 들어온 모든 값을 반영하거든요. - -196 -00:13:17,870 --> 00:13:21,299 - (답변): 여기서는 중요하지 않습니다. hidden state는 지금까지 들어온 모든 값을 반영하거든요. - -197 -00:13:21,299 --> 00:13:26,859 - (답변): 여기서는 중요하지 않습니다. hidden state는 지금까지 들어온 모든 값을 반영하거든요. - -198 -00:13:26,860 --> 00:13:31,590 - 보다 구체적인 예들로 확실히 설명드리겠습니다. - -199 -00:13:31,590 --> 00:13:36,149 - 문자 단위의 언어 모델 코드는 매우 간단합니다. - -200 -00:13:36,149 --> 00:13:38,980 - 여러분들이 나중에 찾아볼 수 있게 GitHub에 올려 놓았어요. - -201 -00:13:38,980 --> 00:13:43,350 - 이것은 NumPy 기반의 100줄 길이의 문자 단위 RNN 코드입니다. - -202 -00:13:43,350 --> 00:13:47,220 - 이것은 NumPy 기반의 100줄 길이의 문자 단위 RNN 코드입니다. - -203 -00:13:47,220 --> 00:13:49,840 - 실제로 RNN이 어떻게 학습하는지를 알기 위해서 이 코드를 단계별로 살펴볼게요. - -204 -00:13:49,840 --> 00:13:53,220 - 실제로 RNN이 어떻게 학습하는지를 알기 위해서 이 코드를 단계별로 살펴볼게요. - -205 -00:13:53,220 --> 00:13:58,250 - 코드를 블록들로 나누어 하나하나 살펴보겠습니다. - -206 -00:13:58,250 --> 00:14:02,389 - 처음에는 보다시피 NumPy만 사용합니다. - -207 -00:14:02,389 --> 00:14:05,569 - 여기에 우리가 입력받을 것은 문자들의 대용량 순서 .txt 데이터입니다. - -208 -00:14:05,570 --> 00:14:10,090 - 여기에 우리가 입력받을 것은 문자들의 대용량 순서 .txt 데이터입니다. - -209 -00:14:10,090 --> 00:14:14,810 - 이 파일의 모든 문자를 읽어들이고, mapping dictionary를 생성합니다. - -210 -00:14:14,809 --> 00:14:18,179 - mapping dictionary는 문자에 index를 대응시키고, 또 반대로 index에 문자를 대응시킵니다. - -211 -00:14:18,179 --> 00:14:23,120 - 그러니까 문자를 순서대로 배열하는 것입니다. - -212 -00:14:23,120 --> 00:14:27,350 - 여기 보면 아주 긴 문자열이 들어 있는 큰 데이터를 읽어들이네요. - -213 -00:14:27,350 --> 00:14:30,860 - 우리는 이 데이터를 배열해서 각 문자에 index를 지정할 것입니다. - -214 -00:14:30,860 --> 00:14:36,300 - 그리고 여기에 보다시피 initialization(초깃값 설정)을 하게 됩니다. - -215 -00:14:36,299 --> 00:14:39,899 - hidden size(hidden state의 크기)는 hyperparameter(바뀌지 않는 값) 입니다. 여기서는 100으로 설정했습니다. - -216 -00:14:39,899 --> 00:14:43,100 - hidden size(hidden state의 크기)는 hyperparameter(바뀌지 않는 값) 입니다. 여기서는 100으로 설정했습니다. - -217 -00:14:43,100 --> 00:14:46,720 - 여기 있는 건 learning rate 이고요. - -218 -00:14:46,720 --> 00:14:51,019 - 25가 지정되어 있는 seq_length는 여러분이 RNN을 공부하다 보면 나오는 parameter 입니다. - -219 -00:14:51,019 --> 00:14:53,899 - 많은 경우 우리의 입력 데이터는 너무 커서 RNN에 한꺼번에 넣을 수가 없습니다. - -220 -00:14:53,899 --> 00:14:56,870 - 이것은 우리가 backpropagation을 하는 동안 메모리에 데이터를 저장해 두어야 하는데 여기에 한계가 있기 때문이죠 - -221 -00:14:56,870 --> 00:15:00,070 - 이것은 우리가 backpropagation을 하는 동안 메모리에 데이터를 저장해 두어야 하는데 여기에 한계가 있기 때문이죠 - -222 -00:15:00,070 --> 00:15:03,540 - 이것은 우리가 backpropagation을 하는 동안 메모리에 데이터를 저장해 두어야 하는데 여기에 한계가 있기 때문이죠 - -223 -00:15:03,539 --> 00:15:07,139 - 그래서 우리는 입력 데이터를 몇 개의 데이터로 쪼개고, 여기서는 길이가 25인 데이터들로 쪼갰습니다. - -224 -00:15:07,139 --> 00:15:09,230 - 그래서 우리는 입력 데이터를 몇 개의 데이터로 쪼개고, 여기서는 길이가 25인 데이터들로 쪼갰습니다. - -225 -00:15:09,230 --> 00:15:14,769 - 그러니까 한 번에 처리할 문자의 개수가 25개인 것입니다. - -226 -00:15:14,769 --> 00:15:19,509 - 다시 설명하면, 한 번에 backpropagation 하는 문자의 개수가 25인 것이고, - -227 -00:15:19,509 --> 00:15:22,149 - 한 번에 모든 데이터를 기억해서 backpropagation 할 수 없기 때문에, 하나의 크기가 25개인 덩어리 데이터들로 나누어서 처리합니다. - -228 -00:15:22,149 --> 00:15:26,899 - 한 번에 모든 데이터를 기억해서 backpropagation 할 수 없기 때문에, 하나의 크기가 25개인 덩어리 데이터들로 나누어서 처리합니다. - -229 -00:15:26,899 --> 00:15:30,789 - 여기 보이는 행렬들은 random 함수를 이용해서 초기값이 무작위적으로 입력됩니다. - -230 -00:15:30,789 --> 00:15:34,709 - Wxh, Whh, Wxy은 모두 우리가 backpropagation을 통해 학습시킬 대상들입니다. - -231 -00:15:34,710 --> 00:15:36,790 - Wxh, Whh, Wxy은 모두 우리가 backpropagation을 통해 학습시킬 대상들입니다. - -232 -00:15:36,789 --> 00:15:40,699 - loss function은 넘어가고 맨 밑 부분을 살펴보겠습니다. - -233 -00:15:40,700 --> 00:15:44,020 - 이 부분은 Main loop입니다. 이 중에서 몇 부분을 한번 살펴보죠. - -234 -00:15:44,019 --> 00:15:48,399 - 이 부분에서 어떤 변수들에 0을 대입하는 초기화가 진행됩니다. - -235 -00:15:48,399 --> 00:15:50,829 - 그리고 계속해서 loop을 돌리게 되죠. - -236 -00:15:50,830 --> 00:15:54,960 - 우리가 지금 보고 있는 것은 전체 데이터의 한 batch 입니다. - -237 -00:15:54,960 --> 00:15:58,970 - 전체 데이터 세트에서 크기 25의 문자 batch를 가지를 list input으로 넣어줍니다. - -238 -00:15:58,970 --> 00:16:03,019 - 그리고 그 list input은 각 문자에 대응되는 25개의 숫자를 갖고 있습니다. - -239 -00:16:03,019 --> 00:16:06,919 - 타겟들은 여기 index에 1을 더한 값이 되는데요, - -240 -00:16:06,919 --> 00:16:09,909 - 이것은 타겟들이 현재 순서가 아니라 바로 다음 순서에 나올 문자들이기 때문에 그렇습니다. - -241 -00:16:09,909 --> 00:16:15,269 - 그러니까 list input에는 25개의 문자에 대응되는 25개의 숫자가 있고, 타겟 문자는 그 숫자들에서 1을 더한 index에 대응되는 문자들입니다. - -242 -00:16:15,269 --> 00:16:20,689 - 그러니까 list input에는 25개의 문자에 대응되는 25개의 숫자가 있고, 타겟 문자는 그 숫자들에서 1을 더한 index에 대응되는 문자들입니다. - -243 -00:16:20,690 --> 00:16:26,480 - 이것은 sampling 코드입니다. - -244 -00:16:26,480 --> 00:16:30,659 - 매 시간 단계에서 RNN을 학습시키면서, 현재 RNN이 어떻게 사고하고 있는지에 알아보기 위한 sample을 출력합니다. - -245 -00:16:30,659 --> 00:16:35,370 - 매 시간 단계에서 RNN을 학습시키면서, 현재 RNN이 어떻게 사고하고 있는지에 알아보기 위한 sample을 출력합니다. - -246 -00:16:35,370 --> 00:16:40,320 - 우리가 문자 단위의 RNN을 사용할 때에는 - -247 -00:16:40,320 --> 00:16:43,570 - RNN이 매 시간 단계마다 바로 다음에 올 문자들의 순서를 출력합니다. - -248 -00:16:43,570 --> 00:16:46,379 - 그러니까 sampling 후 그것을 다시 입력값으로 주고, 다음 sample을 또다시 입력값으로 주는 방식으로 모든 sample을 입력한 다음, - -249 -00:16:46,379 --> 00:16:49,259 - 그러니까 sampling 후 그것을 다시 입력값으로 주고, 다음 sample을 또다시 입력값으로 주는 방식으로 모든 sample을 입력한 다음, - -250 -00:16:49,259 --> 00:16:52,769 - 그러니까 sampling 후 그것을 다시 입력값으로 주고, 다음 sample을 또다시 입력값으로 주는 방식으로 모든 sample을 입력한 다음, - -251 -00:16:52,769 --> 00:16:56,549 - RNN에게 추상적인 문자열을 만들라고 지시할 수 있게 됩니다. - -252 -00:16:56,549 --> 00:17:00,549 - 이게 이 코드의 기능이고, 이것은 조금 있다 살펴볼 sample function을 사용합니다. - -253 -00:17:00,549 --> 00:17:04,250 - 여기서는 loss function을 불러옵니다. - -254 -00:17:04,250 --> 00:17:09,160 - loss function은 입력값, 타겟 문자, hprev 을 입력받습니다. - -255 -00:17:09,160 --> 00:17:13,900 - hprev는 h from previous chunk 을 뜻합니다. - -256 -00:17:13,900 --> 00:17:18,179 - 우리가 크기가 25인 batch들을 사용하는데, - -257 -00:17:18,179 --> 00:17:22,400 - hidden state에서는 바로 전 batch의 마지막 문자가 무엇인지에 대한 정보가 필요하고, 이 마지막 문자를 다음 batch의 첫 h 에 입력하게 됩니다. - -258 -00:17:22,400 --> 00:17:26,140 - 그러니까 h가 batch 에서 그 다음 batch 로 제대로 넘어가기 위해서 h prev을 사용하는 것입니다. - -259 -00:17:26,140 --> 00:17:30,700 - 그리고 그 h prev는 backpropagation 할 때만 사용됩니다. - -260 -00:17:30,700 --> 00:17:35,558 - 그 h prev을 loss fuction에 입력하면, loss. gradient, weight 행렬, 그리고 bias를 출력합니다. - -261 -00:17:35,558 --> 00:17:39,319 - 그 h prev을 loss fuction에 입력하면, loss. gradient, weight 행렬, 그리고 bias를 출력합니다. - -262 -00:17:39,319 --> 00:17:44,149 - 여기에서 loss를 print 하고, 여기에선 parameter들을 loss function이 하라는 대로 업데이트합니다. - -263 -00:17:44,150 --> 00:17:47,429 - 실제로 업데이트가 되는 것은 여기 adagrad update 라고 적혀 있는 부분이네요. - -264 -00:17:47,429 --> 00:17:53,100 - 여기 gradient 계산을 위한 변수들을 제곱한 값들을 계속 더해 줍니다. - -265 -00:17:53,099 --> 00:17:56,819 - 그리고 이것들로 adagrad를 업데이트 하죠. - -266 -00:17:56,819 --> 00:18:00,639 - 이제 loss funcion을 살펴보겠습니다. - -267 -00:18:00,640 --> 00:18:05,790 - 이 블록이 loss fuction이고, foward와 backward 방법들로 이루어져 있습니다. - -268 -00:18:05,789 --> 00:18:08,990 - 처음에는 forward pass, 나중에는 초록색으로 적혀 있는 backward pass를 수행합니다. - -269 -00:18:08,990 --> 00:18:13,130 - 처음에는 forward pass, 나중에는 초록색으로 적혀 있는 backward pass를 수행합니다. - -270 -00:18:13,130 --> 00:18:18,919 - forward pass에서는 input을 target을 향하게 만듭니다. - -271 -00:18:18,919 --> 00:18:23,360 - 여기서 25개의 index를 받지만, 반복문을 25번 실행하는 것이 아니라, - -272 -00:18:23,359 --> 00:18:27,500 - 여기 있는 성분이 모두 0인 input vector에 one-hot 인코딩을 하게 됩니다. - -273 -00:18:27,500 --> 00:18:32,169 - 그러니까 input에 대응되는 bit를 1로 지정하는 것이죠. - -274 -00:18:32,169 --> 00:18:34,110 - one hot encoding을 이용해서 input을 주고, - -275 -00:18:34,109 --> 00:18:39,229 - 밑에 있는 recurrence 공식을 이용해서 계산합니다. - -276 -00:18:39,230 --> 00:18:42,210 - hs[t]는 매 시간 단계의 모든 값들을 기록합니다. - -277 -00:18:42,210 --> 00:18:46,910 - recurrence 공식과 이 두 줄의 코드를 통해 hidden state vector과 output vector 을 계산합니다. - -278 -00:18:46,910 --> 00:18:50,779 - 여기서는 softmax function(역자주: cross entropy loss)을 이용해서 normalization을 구현합니다. - -279 -00:18:50,779 --> 00:18:54,440 - softmax function에서의 loss는 정답(역자주: 타겟 문자)이 나올 확률의 log를 취하고 거기에 -1을 곱한 값입니다. - -280 -00:18:54,440 --> 00:18:58,190 - softmax function에서의 loss는 정답(역자주: 타겟 문자)이 나올 확률의 log를 취하고 거기에 -1을 곱한 값입니다. - -281 -00:18:58,190 --> 00:19:02,779 - 지금까지 forward pass 를 살펴보았고, 이제 그래프를 통해 backpropagation을 살펴보겠습니다. - -282 -00:19:02,779 --> 00:19:06,899 - backward pass에서는, 25번째 문자에서 첫번째 문자까지 거슬러 올라갑니다. - -283 -00:19:06,900 --> 00:19:08,530 - backward pass에서는, 25번째 문자에서 첫번째 문자까지 거슬러 올라갑니다. - -284 -00:19:08,529 --> 00:19:12,899 - backward pass에서는, 25번째 문자에서 첫번째 문자까지 거슬러 올라갑니다. - -285 -00:19:12,900 --> 00:19:16,509 - 여기서는 softmax, activation 등을 통한 backpropagation이 수행됩니다. - -286 -00:19:16,509 --> 00:19:19,089 - 그리고 모든 gradient와 parameter들을 더해주죠. - -287 -00:19:19,089 --> 00:19:23,379 - 한 가지 짚고 넘어갈 점은, Whh를 비롯한 행렬에서의 gradient 계산에서 '+='을 사용하고 있다는 것입니다. - -288 -00:19:23,380 --> 00:19:27,210 - 한 가지 짚고 넘어갈 점은, Whh를 비롯한 행렬에서의 gradient 계산에서 '+='을 사용하고 있다는 것입니다. - -289 -00:19:27,210 --> 00:19:31,210 - 한 가지 짚고 넘어갈 점은, Whh를 비롯한 행렬에서의 gradient 계산에서 '+='을 사용하고 있다는 것입니다. - -290 -00:19:31,210 --> 00:19:34,590 - 우리는 매 시간 단계마다 weight 행렬들이 gradient를 받고, 이 값들을 모두 더해 주어야 하기 때문에, 이 행렬을 계속 쓰게 됩니다. - -291 -00:19:34,589 --> 00:19:37,449 - 우리는 매 시간 단계마다 weight 행렬들이 gradient를 받고, 이 값들을 모두 더해 주어야 하기 때문에, 이 행렬을 계속 쓰게 됩니다. - -292 -00:19:37,450 --> 00:19:43,980 - 그리고 계속해서 backpropagation을 하게 되죠. - -293 -00:19:43,980 --> 00:19:48,130 - 여기에서 나온 gradient는 loss function에 사용되고, 결국 parameter를 업데이트하게 됩니다. - -294 -00:19:48,130 --> 00:19:52,580 - 마지막으로 sampling function입니다. - -295 -00:19:52,579 --> 00:19:55,960 - 여기서 RNN을 지금까지 학습한 training 데이터를 바탕으로 실제로 새로운 문자열 데이터를 출력하게 됩니다. - -296 -00:19:55,960 --> 00:19:59,058 - 여기서 RNN을 지금까지 학습한 training 데이터를 바탕으로 실제로 새로운 문자열 데이터를 출력하게 됩니다. - -297 -00:19:59,058 --> 00:20:02,048 - 여기서 문자열을 초기화해주었고, - -298 -00:20:02,048 --> 00:20:06,759 - 피곤해질 때까지 (역자주: 미리 설정한 recurrence가 끝날 때까지) 다음 작업들을 반복합니다. - -299 -00:20:06,759 --> 00:20:09,289 - recurrence 공식 실행, 각 문자에 대한 확률분포 계산, 샘플링, one-hot 인코딩, 그리고 그 결과물을 다음 시간 단계로 재입력 - -300 -00:20:09,289 --> 00:20:10,450 - recurrence 공식 실행, 각 문자에 대한 확률분포 계산, 샘플링, one-hot 인코딩, 그리고 그 결과물을 다음 시간 단계로 재입력 - -301 -00:20:10,450 --> 00:20:15,640 - recurrence 공식 실행, 각 문자에 대한 확률분포 계산, 샘플링, one-hot 인코딩, 그리고 그 결과물을 다음 시간 단계로 재입력 - - -302 -00:20:15,640 --> 00:20:22,460 - 이 작업들을 충분히 많은 문자열을 출력할 때까지 계속 수행합니다. - -303 -00:20:22,460 --> 00:20:27,190 - (질문: 안들림 => 답변) 우리는 매 batch 마다 25개의 softmax classifier를 갖고 있습니다. - -304 -00:20:27,190 --> 00:21:04,680 - (답변) 그 classifier 들은 한번에 backpropagation을 진행하고, 반대방향으로 모든 결과물들을 더해주죠. - -305 -00:21:04,680 --> 00:21:14,910 - 그게 우리가 이걸 쓰는 이유죠. 다음 질문? - -306 -00:21:14,910 --> 00:21:19,259 - (질문) 여기서 regularization을 쓰나요? - -307 -00:21:19,259 --> 00:21:23,720 - (답변) 여기서는 빠져 있습니다. 일반적으로 RNN에서는 다른 알고리즘만큼 regularization이 흔하게 적용되지는 않습니다. - -308 -00:21:23,720 --> 00:21:27,269 - (답변) 여기서는 빠져 있습니다. 일반적으로 RNN에서는 다른 알고리즘만큼 regularization이 흔하게 적용되지는 않습니다. - -309 -00:21:27,269 --> 00:21:38,379 - (답변) 가끔 아주 좋지 않은 결과를 낳기도 해서, 저는 그냥 사용하지 않을 때도 있습니다. 일종의 hyperparameter이죠. 다음 질문? (질문 안들림) - -310 -00:21:38,380 --> 00:21:48,260 - (답변) 여기서의 문자들은 아주 기초적인 수준입니다. 그래서 실제로 이런 문자가 존재하는지 별로 신경쓰지는 않아요. - -311 -00:21:48,259 --> 00:21:51,839 - (답변) 여기서의 문자들은 아주 기초적인 수준입니다. 그래서 실제로 이런 문자가 존재하는지 별로 신경쓰지는 않아요. - -312 -00:21:51,839 --> 00:21:56,289 - 문자들의 index와 그것들의 순서 정도만을 고려할 뿐이죠. - -313 -00:21:56,289 --> 00:21:58,569 - 다음 질문? - -314 -00:21:58,569 --> 00:22:08,009 - (질문) space 대신 일정한 segment size(25)를 이용하는 이유가 있나요? - -315 -00:22:08,009 --> 00:22:13,460 - (질문) space 대신 일정한 segment size(25)를 이용하는 이유가 있나요? - -316 -00:22:13,460 --> 00:22:18,630 - (답변) 크기가 25인 batch 말고 space로 구분하는 것 역시 가능할 것 같습니다. 하지만 거기에는 언어에 대한특별한 가정이 필요해서 권장되지 않아요. - -317 -00:22:18,630 --> 00:22:22,530 - 자세한 이유는 좀 있다가 살펴보도록 하겠습니다. - -318 -00:22:22,529 --> 00:22:25,359 - 이 코드에는 어떤 문자열도 입력할 수 있어요. 이걸 갖고 여러 가지를 해 볼게요. - -319 -00:22:25,359 --> 00:22:31,539 - 여기 우리가 출처를 모르는 어떤 문자열이 있습니다. - -320 -00:22:31,539 --> 00:22:34,889 - 그리고 이 문자열을 RNN에 학습시키고, RNN이 문자열을 만들어내게 할 거에요. - -321 -00:22:34,890 --> 00:22:40,670 - 예를 들어, 셰익스피어의 모든 작품을 입력할 수 있습니다. - -322 -00:22:40,670 --> 00:22:44,789 - 크기가 좀 크긴 하지만, 이건 단지 문자열일 뿐이에요. - -323 -00:22:44,789 --> 00:22:48,289 - 크기가 좀 크긴 하지만, 이건 단지 문자열일 뿐이에요. - -324 -00:22:48,289 --> 00:22:51,909 - RNN 셰익스피어의 작품을 학습시키고, 셰익스피어의 시에서의 다음 문자를 예측하게끔 할 수 있습니다. - -325 -00:22:51,910 --> 00:22:54,650 - 처음에는 학습이 되어 있지 않기 때문에, 결과물들은 매우 무작위적인 문자열입니다. - -326 -00:22:54,650 --> 00:22:59,030 - 처음에는 학습이 되어 있지 않기 때문에, 결과물들은 매우 무작위적인 문자열입니다. - -327 -00:22:59,029 --> 00:23:03,200 - 하지만 학습을 통해 RNN은 이 문자열 안에는 단어들이 있고, 단어들 사이에 space가 있고, 따옴표의 사용법을 이해하기 되죠. - -328 -00:23:03,200 --> 00:23:06,930 - 하지만 학습을 통해 RNN은 이 문자열 안에는 단어들이 있고, 단어들 사이에 space가 있고, 따옴표의 사용법을 이해하기 되죠. - -329 -00:23:06,930 --> 00:23:11,490 - 하지만 학습을 통해 RNN은 이 문자열 안에는 단어들이 있고, 단어들 사이에 space가 있고, 쌍따옴표(")의 사용법을 이해하기 되죠. - -330 -00:23:11,490 --> 00:23:16,420 - 그리고 'here', 'on', 'and so on' 과 같은 기본적인 표현들을 알게 됩니다. - -331 -00:23:16,420 --> 00:23:18,820 - 그리고 RNN을 계속 학습시킬수록, 이러한 표현들이 점점 정제되는 것을 확인할 수 있습니다. - -332 -00:23:18,819 --> 00:23:22,609 - 예를 들어 "를 한번 사용하면 "를 한번 더 사용해서 인용구를 닫아 주는 것들을 익히는 거죠. - -333 -00:23:22,609 --> 00:23:26,379 - 또 문장이 마침표로 끝나는 것 역시 따로 가르치지 않고도 패턴만으로 익히게 됩니다. - -334 -00:23:26,380 --> 00:23:29,630 - 또 문장이 마침표로 끝나는 것 역시 따로 가르치지 않고도 통계적 패턴만으로 익히게 됩니다. - -335 -00:23:29,630 --> 00:23:30,580 - 그리고 마침내 '셰익스피어 문학' 자체를 생성할 수 있게 되죠. - -336 -00:23:30,579 --> 00:23:34,349 - 여기 RNN이 만들어낸 작품을 읽어볼게요. - -337 -00:23:34,349 --> 00:23:38,740 - (읽는 중) "Alas, I think he shall come approached and the day..." - -338 -00:23:38,740 --> 00:23:42,900 - (읽는 중) "Alas, I think he shall come approached and the day..." - -339 -00:23:42,900 --> 00:23:45,460 - (읽는 중) "Alas, I think he shall come approached and the day..." - -340 -00:23:45,460 --> 00:23:56,909 - (질문) 하지만 이것들은 25개가 넘는 문자로 이루어진 문장은 기억할 수가 없기 때문에 제대로 생성할 수 없죠? - -341 -00:23:56,909 --> 00:24:02,679 - (답변) 네 맞습니다. 그거 사실 되게 알아차리기 힘든 부분이라 제가 나중에 말하려고 했었어요. - -342 -00:24:02,679 --> 00:24:05,980 - 우리는 셰익스피어 작품이 아니라 다른 것들에도 이것을 활용할 수 있습니다. - -343 -00:24:05,980 --> 00:24:08,960 - 이것들은 제가 Justin과 작년에 만들어본 것들입니다. - -344 -00:24:08,960 --> 00:24:12,990 - Justin은 한 대수기하학 책의 LaTeX 소스를 RNN에 학습시켰습니다. - -345 -00:24:12,990 --> 00:24:18,069 - Justin은 한 대수기하학 책의 LaTeX 소스를 RNN에 학습시켰습니다. - -346 -00:24:18,069 --> 00:24:23,398 - 그리고 RNN은 수학책을 집필했죠. - -347 -00:24:23,398 --> 00:24:27,199 - 물론 RNN은 LaTeX 형식으로 결과물을 출력하지 않아서 저희가 약간 손봐주긴 했지만, - -348 -00:24:27,200 --> 00:24:30,009 - 물론 RNN은 LaTeX 형식으로 결과물을 출력하지 않아서 저희가 약간 손봐주긴 했지만, - -349 -00:24:30,009 --> 00:24:33,890 - 어쨌든 한두 번 손보고 나니 보시는 바와 같이 수학책이 되었어요. - -350 -00:24:33,890 --> 00:24:37,200 - 어쨌든 한두 번 손보고 나니 보시는 바와 같이 수학책이 되었어요. - -351 -00:24:37,200 --> 00:24:42,460 - 살펴보면, RNN은 proof(정리)를 쓰는 방법을 배웠네요. 수학적 정리의 끝에는 저렇게 사각형을 쓰죠. - -352 -00:24:42,460 --> 00:24:47,090 - lemma(소정리)를 비롯한 다른 것들도 만들어 냈고요. - -353 -00:24:47,089 --> 00:24:52,428 - 그림을 그리는 방법도 배웠네요. - -354 -00:24:52,429 --> 00:24:56,720 - 제가 가장 좋아하는 부분은 여기 왼쪽 상단에 있는 "Proof. Omitted" 부분입니다. - -355 -00:24:56,720 --> 00:24:59,650 - RNN도 귀찮았나 봐요 (웃음) - -356 -00:24:59,650 --> 00:25:05,780 - RNN도 귀찮았나 봐요 (웃음) - -357 -00:25:05,779 --> 00:25:12,480 - 전반적으로 보면 RNN은 꽤 대수기하학책 같이 보이는 걸 만들어 냈어요. - -358 -00:25:12,480 --> 00:25:16,160 - 뭐 세부적인 부분은 제가 대수기하를 잘 몰라서 말하기 그렇지만, 전반적으로 괜찮아요. - -359 -00:25:16,160 --> 00:25:19,529 - 저는 이어서 문자 단위 RNN으로 표현할 수 있는 가장 어렵고 추상적인 것들이 무엇이 있을까 생각했고, - -360 -00:25:19,529 --> 00:25:22,769 - 소스 코드에 생각이 미쳤습니다. - -361 -00:25:22,769 --> 00:25:27,879 - 그래서 리누스 토발즈의 GitHub에 들어가 리눅스의 모든 C 코드를 가져왔습니다. - -362 -00:25:27,880 --> 00:25:30,850 - 이 C 코드는 자그마치 700MB나 됩니다. - -363 -00:25:30,849 --> 00:25:35,079 - 이 코드를 RNN에게 학습시켰고, RNN은 코드를 생성해 냈습니다. - -364 -00:25:35,079 --> 00:25:39,849 - 이게 바로 RNN이 생성해낸 코드입니다. - -365 -00:25:39,849 --> 00:25:42,949 - 살펴보면 함수를 생성했고, 변수를 지정하고, 문법적 오류가 거의 없습니다. - -366 -00:25:42,950 --> 00:25:47,460 - 변수를 어떻게 사용하는지도 아는 것 같고, - -367 -00:25:47,460 --> 00:25:53,230 - indentation (들여쓰기)도 적절히 했고, 주석도 달았습니다. - -368 -00:25:53,230 --> 00:25:58,089 - 괄호를 열고 닫지 않는 등의 실수를 찾아보기가 매우 힘들었습니다. - -369 -00:25:58,089 --> 00:26:01,808 - 이런 것들은 RNN이 배우기 가장 쉬운 것들 중 하나거든요. - -370 -00:26:01,808 --> 00:26:04,058 - RNN의 실수들 중에는 쓰이지 않을 변수를 선언하거나, 선언하지도 않은 변수를 불러오기를 시도는 것들이 있었습니다. - -371 -00:26:04,058 --> 00:26:07,240 - RNN의 실수들 중에는 쓰이지 않을 변수를 선언하거나, 선언하지도 않은 변수를 불러오기를 시도는 것들이 있었습니다. - -372 -00:26:07,240 --> 00:26:09,929 - 그러니까 아직 매우 높은 단계의 코딩 수준에는 도달하지 못한 거죠. - -373 -00:26:09,929 --> 00:26:12,509 - 하지만 그런 것들을 제외하고 보면 꽤 코딩을 잘 했습니다. - -374 -00:26:12,509 --> 00:26:17,460 - 새로운 GPU 라이센스에 관한 주석을 다는 방법도 배웠네요. - -375 -00:26:17,460 --> 00:26:22,009 - 새로운 GPU 라이센스에 관한 주석을 다는 방법도 배웠네요. - -376 -00:26:22,009 --> 00:26:25,779 - GPL 라이센스 다음에는 #include, 매크로 코드 등이 오는 것도 배웠고요. - -377 -00:26:25,779 --> 00:26:33,879 - (질문) 이건 (아까 보여준) min char-rnn 으로 만들어낸 건가요? - -378 -00:26:33,880 --> 00:26:37,169 - (답변) min char-rnn은 그냥 작동 원리를 알려주기 위해 만들어낸 장난감 같은 거고, - -379 -00:26:37,169 --> 00:26:41,230 - (답변) 실제로는 min char-rnn의 확장판인 torch 기반 char-rnn을 으로 구현했고, GPU를 이용해서 처리했습니다. - -380 -00:26:41,230 --> 00:26:45,009 - (답변) 실제로는 min char-rnn의 확장판인 torch 기반 char-rnn을 으로 구현했고, GPU를 이용해서 처리했습니다. - -381 -00:26:45,009 --> 00:26:49,269 - 이 부분은 수업 마지막 부분에 다룰 것인데, 3-layer LSTM 이라는 것입니다. - -382 -00:26:49,269 --> 00:26:52,289 - 이건 RNN의 복잡한 버전이라고 생각하면 됩니다. - -383 -00:26:52,289 --> 00:26:58,839 - 좀 더 이해가 쉽도록 예를 들어 볼게요. - -384 -00:26:58,839 --> 00:27:02,089 - 이건 작년에 저희가 이런 것들을 가지고 만들어본 것들입니다. - -385 -00:27:02,089 --> 00:27:08,949 - 저희는 문자 단위 RNN에 신경과학적으로 접근을 해 보았습니다. - -386 -00:27:08,950 --> 00:27:13,110 - hidden state 내부 특정 cell의 excitement(흥분) 여부에 따라 색을 칠해 봤습니다. - -387 -00:27:13,109 --> 00:27:17,119 - hidden state 내부 특정 cell의 excitement(흥분) 여부에 따라 색을 칠해 봤습니다. - -388 -00:27:17,119 --> 00:27:18,699 - hidden state 내부 특정 cell의 excitement(흥분) 여부에 따라 색을 칠해 봤습니다. - -389 -00:27:18,700 --> 00:27:23,470 - 보시다시피, hidden state의 뉴런들의 상태를 해석하는 일이 쉽지가 않습니다. - -390 -00:27:23,470 --> 00:27:27,110 - 보시다시피, hidden state의 뉴런들의 상태를 해석하는 일이 쉽지가 않습니다. - -391 -00:27:27,109 --> 00:27:29,829 - 왜냐하면 어떤 뉴런들은 매우 낮은 단계에서의 작업을 맡거든요. - -392 -00:27:29,829 --> 00:27:33,859 - 예를 들면, 'h 다음에 e가 얼마나 자주 오는가' 가 있네요. - -393 -00:27:33,859 --> 00:27:37,928 - 하지만 어떤 cell 들은 해석하기가 꽤 용이했습니다. - -394 -00:27:37,929 --> 00:27:41,830 - 여기 보시는 것은 인용구 검출 cell 입니다. - -395 -00:27:41,829 --> 00:27:46,460 - 이 cell은 처음 따옴표가 나오면 켜지고, 따옴표가 다시 나타나면 꺼집니다. - -396 -00:27:46,460 --> 00:27:50,610 - 이건 그냥 backpropagation의 결과로 나온 것입니다. - -397 -00:27:50,609 --> 00:27:54,329 - RNN은 문자열의 길이가 따옴표들의 사이에 있을때와 따옴표 바깥에 있을 때에 다르다는 것을 파악했습니다. - -398 -00:27:54,329 --> 00:27:57,639 - 그래서 hidden state의 특정 부분들을 현재 문자들이 인용구 안에 있는지 파악하게 했습니다. - -399 -00:27:57,640 --> 00:28:00,650 - 그래서 hidden state의 특정 부분들을 현재 문자들이 인용구 안에 있는지 파악하게 했습니다. - -400 -00:28:00,650 --> 00:28:05,159 - 이것이 아까 (질문했던 사람)의 질문에 답을 해줄 것 같은데요, - -401 -00:28:05,159 --> 00:28:06,500 - 이 RNN의 seq_length는 100 이었습니다.(역자주: batch 크기가 100) - -402 -00:28:06,500 --> 00:28:10,269 - 하지만 실제로 이 인용구들의 크기를 재어 보면 100보다 훨씬 길다는 것을 알 수 있습니다. - -403 -00:28:10,269 --> 00:28:16,220 - 제가 보기에 대략 250정도 인 것 같네요. - -404 -00:28:16,220 --> 00:28:20,190 - 그러니까 우리는 한 번에 크기가 100인 backpropagation만을 진행했고, RNN에게는 그때만이 유일한 학습 기회입니다. - -405 -00:28:20,190 --> 00:28:23,460 - 그러니까 문자열 크기가 100이 넘어가면 그 앞뒤의 dependencies(종속성, 관계) 에 대해서는 직접적으로 학습하지를 않습니다. -406 -00:28:23,460 --> 00:28:27,809 - 그러니까 문자열 크기가 100이 넘어가면 그 앞뒤의 dependencies(종속성, 관계) 에 대해서는 직접적으로 학습하지를 않습니다. - -407 -00:28:27,809 --> 00:28:31,159 - 하지만 이 결과는 실제 문자열의 길이보다 작은 크기의 batch 들로 학습한다고 해도, batch 크기보다 긴 문자열에 대해서도 잘 작동할 수 있다는 것을 보여주네요. - -408 -00:28:31,160 --> 00:28:36,580 - 하지만 이 결과는 실제 문자열의 길이보다 작은 크기의 batch 들로 학습한다고 해도, batch 크기보다 긴 문자열에 대해서도 잘 작동할 수 있다는 것을 보여주네요. - -409 -00:28:36,579 --> 00:28:39,859 - 그러니까 batch 크기는 100이었지만, - -410 -00:28:39,859 --> 00:28:44,759 - 크기가 수백이 넘는 문자열의 dependecies 도 잘 잡아낸 것이죠. - -411 -00:28:44,759 --> 00:28:48,890 - 이것은 톨스토이의 <전쟁과 평화> 데이터 입니다. - -412 -00:28:48,890 --> 00:28:52,460 - 이 데이터 세트는 대략 80문자마다 한 번 줄이 바뀝니다. - -413 -00:28:52,460 --> 00:28:57,819 - 이 데이터 세트는 대략 80문자마다 한 번 줄이 바뀝니다. - -414 -00:28:57,819 --> 00:29:02,470 - 그리고 우리는 줄 길이 tracking cell을 찾아냈습니다. - -415 -00:29:02,470 --> 00:29:06,539 - 이 cell은 줄이 처음 시작하면 1로 시작해서, 문자열이 진행될수록 천천히 그 값이 감소합니다. - -416 -00:29:06,539 --> 00:29:09,019 - RNN은 현재 자신이 어느 시간 단계에 있는지 알아야 하기 때문에 이 기능은 매우 유용합니다. - -417 -00:29:09,019 --> 00:29:13,059 - RNN은 현재 자신이 어느 시간 단계에 있는지 알아야 하기 때문에 이 기능은 매우 유용합니다. - -418 -00:29:13,059 --> 00:29:15,149 - 이를 통해서 언제 줄을 바꾸어야 하는지 알 수 있기 때문이죠. - -419 -00:29:15,150 --> 00:29:19,280 - 이것 말고도 if 문을 감지하는 cell도 찾아냈고, - -420 -00:29:19,279 --> 00:29:23,970 - 인용구과 주석을 감지하는 cell 도 찾아냈고, - -421 -00:29:23,970 --> 00:29:28,710 - 상대적으로 deep한 코드를 감지하는 cell 도 찾아냈습니다. - -422 -00:29:28,710 --> 00:29:33,150 - 다른 역할을 수행하는 cell 들도 찾을 수 있을 것이고, 중요한 것은 이것들이 전부 backpropagation 에서 나왔다는 겁니다. - -423 -00:29:33,150 --> 00:29:36,710 - 되게 마법같은 일이죠. - -424 -00:29:36,710 --> 00:29:42,130 - (질문) 어떻게 cell 하나하나가 흥분했는지 알 수 있었죠? - -425 -00:29:42,130 --> 00:29:49,110 - (답변) 이 LSTM 에서는 대략 2100개의 cell 들이 있었습니다. 저는 그냥 하나하나 다 살펴봤어요. - -426 -00:29:49,109 --> 00:29:54,589 - (답변) 대부분은 규칙을 찾기가 어려웠지만, 약 5%에 해당하는 cell들에 대해서 살펴본 것들과 같은 규칙을 찾을 수 있었습니다. - -427 -00:29:54,589 --> 00:30:00,429 - (질문) 그러니까 어떤 cell들은 켜고, 어떤 cell들은 끄는 방식으로 찾은 건가요? - -428 -00:30:00,430 --> 00:30:05,310 - (답변) 오 제가 질문을 잘못 이해했었네요. 저희는 RNN 전체를 실행시켰고, 특정 hidden state의 흥분 상태를 관찰했습니다. - -429 -00:30:05,309 --> 00:30:09,679 - (답변) 오 제가 질문을 잘못 이해했었네요. 저희는 RNN 전체를 실행시켰고, 특정 hidden state의 흥분 상태를 관찰했습니다. - -430 -00:30:09,680 --> 00:30:14,470 - (답변) 그러니까 그냥 실행은 그대로 하되, 특정 hidden state의 상태를 기록하고 살펴본 것입니다. - -431 -00:30:14,470 --> 00:30:20,900 - 이해가 되셨나요? - -432 -00:30:20,900 --> 00:30:23,940 - 그러니까 저는 여기서 hidden state 단 한 부분만을 여기 슬라이드에 나타냈습니다. - -433 -00:30:23,940 --> 00:30:27,740 - 물론 hidden state 에는 이 부분 말고도 다른 일들을 하는 cell들이 많이 있죠. - -434 -00:30:27,740 --> 00:30:30,349 - 이것들은 모두 동시에, 다른 기능을 수행합니다. - -435 -00:30:30,349 --> 00:30:41,899 - (질문) 여기서의 hidden state의 layer은 1개인가요? - -436 -00:30:41,900 --> 00:30:50,150 - (답변) Multi-layer RNN을 말씀하시는 건가요? 그것에 대해서는 좀 있다가 설명드리겠습니다. 여기서는 Multi-layer을 썼지만, Single-layer을 썼어도 결과는 비슷했을 거에요. - -437 -00:30:50,150 --> 00:31:00,490 - (질문: 안들림) (답변): 이 hidden state 들은 -1 ~ 1의 값을 가집니다. tanh 함수의 결과물이거든요. - -438 -00:31:00,490 --> 00:31:04,120 - (답변) 이건 우리가 아직 다루지 않은 LSTM에 대한 것들입니다. 한 cell에 배정된 값은 -1~1 이라는 것 정도만 알아두세요. - -439 -00:31:04,119 --> 00:31:11,869 - (답변) 이건 우리가 아직 다루지 않은 LSTM에 대한 것들입니다. 한 cell에 배정된 값은 -1~1 이라는 것 정도만 알아두세요. - -440 -00:31:11,869 --> 00:31:15,609 - RNN은 매우 잘 작동하고, 이러한 시퀀스 모델을 잘 학습할 수 있습니다. - -441 -00:31:15,609 --> 00:31:19,039 - 대략 1년 전에 어떤 사람들이 이걸 컴퓨터 비전-image aptioning 분야에 적용해 보았습니다. - -442 -00:31:19,039 --> 00:31:22,039 - 대략 1년 전에 어떤 사람들이 이걸 컴퓨터 비전-image captioning 분야에 적용해 보았습니다. - -443 -00:31:22,039 --> 00:31:25,210 - 여기서는 어떤 하나의 사진을 가지고 단어의 배열을 생성해 보았는데요, - -444 -00:31:25,210 --> 00:31:27,840 - RNN은 여기서 매우 잘 작동했습니다. - -445 -00:31:27,839 --> 00:31:32,490 - RNN은 여기서 매우 잘 작동했습니다. - -446 -00:31:32,490 --> 00:31:36,240 - 여기 한 부분을 보시면, - -447 -00:31:36,240 --> 00:31:43,039 - 사실 이건 제 논문이기 때문에 저 사진들은 제가 마음대로 쓸 수 있죠. - -448 -00:31:43,039 --> 00:31:46,629 - CNN에 이미지를 입력했는데요, - -449 -00:31:46,630 --> 00:31:48,990 - 잘 살펴보시면 사실 이것은 CNN과 RNN의 두 부분으로 구성되어 있다는 것을 발견할 수 있습니다. - -450 -00:31:48,990 --> 00:31:51,750 - 잘 살펴보시면 사실 이것은 CNN과 RNN의 두 부분으로 구성되어 있다는 것을 발견할 수 있습니다. - -451 -00:31:51,750 --> 00:31:55,460 - CNN은 이미지 처리를, RNN은 단어들의 순서 결정을 맡았습니다. - -452 -00:31:55,460 --> 00:31:58,470 - 제가 강의 처음에 했던 레고 블록 비유를 기억한다면, - -453 -00:31:58,470 --> 00:32:01,039 - CNN과 RNN을 그림에 보이는 화살표와 같이 연결시킨 것을 이해할 수 잇을 것입니다. - -454 -00:32:01,039 --> 00:32:04,509 - CNN과 RNN을 그림에 보이는 화살표와 같이 연결시킨 것을 이해할 수 잇을 것입니다. - -455 -00:32:04,509 --> 00:32:07,829 - 저희가 여기서 잘한 점은 여기서 RNN 단어 생성 모델의 입력값을 적절히 조절했다는 것입니다. - -456 -00:32:07,829 --> 00:32:11,349 - 그러니까 아무 텍스트나 RNN에 입력한 것이 아니라, - -457 -00:32:11,349 --> 00:32:14,939 - CNN의 결과물을 RNN의 입력값으로 받아온 것이죠. - -458 -00:32:14,940 --> 00:32:21,220 - 좀 더 자세히 설명드리겠습니다. forward pass 부분부터요. - -459 -00:32:21,220 --> 00:32:24,110 - 여기 test image가 있습니다. - -460 -00:32:24,109 --> 00:32:27,679 - 우리는 이 이미지에서 단어들의 시퀀스를 만들어보고 싶어요. - -461 -00:32:27,680 --> 00:32:31,240 - 그래서 다음과 같이 이미지를 먼저 처리했습니다. - -462 -00:32:31,240 --> 00:32:35,250 - 먼저 이미지를 CNN에 입력했습니다. 여기서 쓰인 CNN은 VGG net 이었습니다. - -463 -00:32:35,250 --> 00:32:37,349 - 그리고 여기 conv들과 maxpool 들을 통과시켰죠. - -464 -00:32:37,349 --> 00:32:40,149 - 일반적으로 마지막에는 softmax classifier가 위치합니다. - -465 -00:32:40,150 --> 00:32:44,440 - softmax는 확률분포를 출력하죠. 예를 들어 1000개의 카테고리가 있다면 각 카테고리에 대한 확률분포를요. - -466 -00:32:44,440 --> 00:32:47,420 - 근데 여기서 우리는 softmax를 사용하지 않았습니다. - -467 -00:32:47,420 --> 00:32:50,750 - 대신 이 끝부분을 RNN의 시작 부분과 연결시켰죠. - -468 -00:32:50,750 --> 00:32:54,880 - RNN 입력에 처음에는 특별한 벡터들을 사용했습니다. - -469 -00:32:54,880 --> 00:33:00,410 - RNN 에 입력되는 벡터들의 차원은 300이었고요, - -470 -00:33:00,410 --> 00:33:02,700 - RNN의 첫 iteration에는 무조건 이 벡터를 사용했습니다. - -471 -00:33:02,700 --> 00:33:05,750 - 그럼으로써 RNN이 이것이 시퀀스의 시작임을 파악할 수 있게 했습니다. - -472 -00:33:05,750 --> 00:33:09,039 - 그리고 아까 살펴본 recurrence 공식 (Vanilla NN)을 사용했습니다. - -473 -00:33:09,039 --> 00:33:13,769 - 그리고 아까 살펴본 recurrence 공식 (Vanilla NN)을 사용했습니다. - -474 -00:33:13,769 --> 00:33:18,779 - 우리는 WSH 시간 섹스를 계산하지만 whhhy 지금 우리가 원하는 위치에있는 우리가했습니다 연대 - -475 -00:33:18,779 --> 00:33:23,500 - 아까는 (Wxh*x + Whh*h)과 0으로 초기화되는 h_0을 사용했다면, - -476 -00:33:23,500 --> 00:33:28,089 - 아까는 (Wxh*x + Whh*h)과 0으로 초기화되는 h_0을 사용했다면, - -477 -00:33:28,089 --> 00:33:33,649 - 이번에는 v를 추가해서 (Wxh*x + Whh*h + Wih*v) 를 사용했습니다. - -478 -00:33:33,650 --> 00:33:38,040 - 등이 여기에 주석의 정상이며 우리가 추가 한 상호 작용과 - -479 -00:33:38,039 --> 00:33:43,399 - 이 이미지 정보가에 나오는 방법을 우리에게 말해 추가 무게 행렬 W - -480 -00:33:43,400 --> 00:33:46,380 - 처음으로 직장에서 재발 역할 때문에 지금 여러 가지 방법이 있습니다 - -481 -00:33:46,380 --> 00:33:48,940 - 실제로 실제로 이미지를 연결하는 여러 가지 방법이 재발 플레이 - -482 -00:33:48,940 --> 00:33:51,690 - 이 지금이 중 하나만 및 간단한 것 중 하나에 - -483 -00:33:51,690 --> 00:33:55,750 - 아마도이 와인에 여기 난생 처음 단계에서 0 벡터 인 - -484 -00:33:55,750 --> 00:34:00,009 - 이 작동 방식하도록 시퀀스의 첫 번째 단어에 걸쳐 분포 - -485 -00:34:00,009 --> 00:34:05,490 - 예를 들어 당신이 볼 수있다 당신이 상상할 수있는 그 질량이 구조 - -486 -00:34:05,490 --> 00:34:09,699 - 모자는 다음 연합 네트워크 강한 같은 물건에 의해 인식 될 수있다 - -487 -00:34:09,699 --> 00:34:12,939 - 내 조건 사랑의이 상호 작용을 통해 들어갈 상태에서 칠 - -488 -00:34:12,940 --> 00:34:17,039 - 단어 짚의 확률이 약간 높을 수있다 특정 상태 - -489 -00:34:17,039 --> 00:34:20,519 - 바로 그래서 당신은 강한 같은 텍스처가 영향을 미칠 수 있다는 것을 상상 - -490 -00:34:20,519 --> 00:34:23,940 - 강력한 그래서 번호 중 하나 (10) 내부의 가능성이 있기 때문에 높은 것으로 - -491 -00:34:23,940 --> 00:34:28,470 - 그들의 구조와는 그래서 지금부터 군대는 정글이 작업의 종류에있다 - -492 -00:34:28,469 --> 00:34:32,269 - 그것은이 케이스에 시퀀스 내의 다음 치료 및 다음 단어를 예측한다 - -493 -00:34:32,269 --> 00:34:36,550 - 그래서 우리는 최대 그 양말로부터 전송 된 화상 정보를 기억하고 - -494 -00:34:36,550 --> 00:34:40,629 - 아마 우리가 그 분포에서 샘플링 가능성이 가장 높은 단어였다 - -495 -00:34:40,628 --> 00:34:44,710 - 참으로 강한 단어 우리는 강한 걸립니다 우리는에 연결하려고 할 것 - -496 -00:34:44,710 --> 00:34:47,519 - 본인은 생각이 경우 다시 그렇게 바닥에 모든 작업을 기록 - -497 -00:34:47,519 --> 00:34:52,190 - 강한 강한 단어와 연관 그래서 우리는 단어 수준과 침구를 사용하는 - -498 -00:34:52,190 --> 00:34:55,750 - 삼백 국가 박사는 우리는 삼백를 표현하는 법을 배워야거야 - -499 -00:34:55,750 --> 00:35:00,010 - 국가마다 하나의 고유 한 보석에 대한 표현과 우리는 그 플러그 - -500 -00:35:00,010 --> 00:35:02,940 - 삼백 아르 논에 숫자와 설명을 얻기 위해 다시 전달 - -501 -00:35:02,940 --> 00:35:07,090 - 하나는 우리가 이러한 모든 특성을 우리가 얻을 왜 내 두 번째 세계와 순서 - -502 -00:35:07,090 --> 00:35:08,010 - 그것에서 샘플을 다시 - -503 -00:35:08,010 --> 00:35:12,490 - 워드 모자 가능성이 있다고 가정 지금 우리는 모자 400 훨씬 나이 프리젠 테이션을 - -504 -00:35:12,489 --> 00:35:18,299 - 그리고 거기의 분포를 얻을 후 우리는 다시 샘플링하고 우리는 때까지 샘플 - -505 -00:35:18,300 --> 00:35:21,350 - 우리는 특별한 샘플 및 진정의 끝에있는 기간 토큰 - -506 -00:35:21,349 --> 00:35:24,900 - 문장하고는 arnaz 지금이에서 생성 할 것을 우리에게 알려줍니다 - -507 -00:35:24,900 --> 00:35:30,280 - 군대는 그렇게 확인 밀짚 모자 기간이 이미지를 설명했을 포인트 - -508 -00:35:30,280 --> 00:35:34,010 - 치수와 그의 아내 사진의 수는 단어의 숫자 당신의 - -509 -00:35:34,010 --> 00:35:39,220 - 특수 토큰과 우리가 항상 먹이 산업을위한 어휘 +1 - -510 -00:35:39,219 --> 00:35:43,609 - 다른 단어에 해당하는 부문과 얘기 특별한 시작과 - -511 -00:35:43,610 --> 00:35:46,250 - 우리는 언제나 그 전부 단일 통해 전파 - -512 -00:35:46,250 --> 00:35:49,769 - 시간은 무작위로이 국유화하거나 당신은 무료로 BG 그물을 초기화 할 수 있습니다 - -513 -00:35:49,769 --> 00:35:52,099 - 다음 분을 위해 무역 - -514 -00:35:52,099 --> 00:35:56,319 - 배포판은 다음 그라데이션을 인코딩 한 다음이를 통해 백업 - -515 -00:35:56,320 --> 00:35:59,700 - 전체 단일 모델로 것이나 그냥 모든 공동에서 훈련하고 얻을 - -516 -00:35:59,699 --> 00:36:08,389 - 캡션 또는 이미지 캡처 확인 질문을 많이하지만 네 삼백 - -517 -00:36:08,389 --> 00:36:12,609 - 감정 묻어은 너무 이미지 모든 단어의 단지 독립적있어 - -518 -00:36:12,610 --> 00:36:18,430 - 그렇게 우리가 그것으로 얻을 파산거야와 관련된 300 번호를 가지고 - -519 -00:36:18,429 --> 00:36:21,769 - 당신은 무작위로 초기화 한 다음이 더 나은 섹스에 들어갈 백업 할 수 있습니다 - -520 -00:36:21,769 --> 00:36:25,360 - 그 묻어은 주위 그냥 매개 변수를 다른 이동합니다 오른쪽 그래서 - -521 -00:36:25,360 --> 00:36:30,530 - 그것에 대해 생각하는 방법은 모두를위한 하나의 홉 표현을 데입니다입니다 - -522 -00:36:30,530 --> 00:36:34,960 - 단어는 당신은 거대한 W 매트릭스 곳 하나 하나가 - -523 -00:36:34,960 --> 00:36:40,130 - 그 백 농장과 W 곱셈과 승 300 밖으로하지만 크기가 - -524 -00:36:40,130 --> 00:36:43,530 - 효과적으로 하나가 부러 밖으로 따 버릴거야있는 뭔가 w - -525 -00:36:43,530 --> 00:36:47,560 - 나는 당신이 그 마음에 들지 않는 경우 그래서 그냥 생각이 한랭 전선의 종류의 걸거야 - -526 -00:36:47,559 --> 00:36:50,279 - 침대에서 단지 하나의 호퍼 프리젠 테이션으로 생각하고 수행 할 수 있습니다 - -527 -00:36:50,280 --> 00:36:58,920 - 교육에 토큰 네 말에 최대 네 그것의 모델러를 그런 식으로 생각 - -528 -00:36:58,920 --> 00:37:02,769 - 데이터는 우리가 예술에서 기대하는 올바른 순서는 내가 할 수있는 첫 번째 단어입니다 - -529 -00:37:02,769 --> 00:37:07,969 - 기대 때문에 매일 훈련 예 일종의 특별이 - -530 -00:37:07,969 --> 00:37:10,288 - 그리고 진행 토큰 - -531 -00:37:10,289 --> 00:37:28,929 - 당신이 유선 수 다르게 우리는 모든 단일 상태로 연결이 밝혀 - -532 -00:37:28,929 --> 00:37:32,999 - 그것은 실제로 당신이 단지에 연결하면 실제로 잘 작동 악화 때문에 작동 - -533 -00:37:32,998 --> 00:37:36,718 - 시간 단계 최초의 다음 아르 논은이이 두 작업을 저글링하는 - -534 -00:37:36,719 --> 00:37:40,829 - 그것은 예술과 그것을 통해 기억 할 필요가 무엇 이미지에 대한 기억 - -535 -00:37:40,829 --> 00:37:45,179 - 또한 이러한 모든 의상을 생산해야하고 어떻게 든 거기에 그렇게하고 싶어 - -536 -00:37:45,179 --> 00:38:04,209 - 일부는 사실 클래스 직후 나는 당신을 줄 수있는 이유를 전진 - -537 -00:38:04,208 --> 00:38:10,208 - 단일 인스턴스는 이미지와 단어의 순서와 우리가 대응합니다 - -538 -00:38:10,208 --> 00:38:16,328 - 여기에 그 단어를 연결 것이고, I를 우리는 이미지를 이야기하고 우리가하여야한다 - -539 -00:38:16,329 --> 00:38:22,159 - 그래서 와서 당신이 모든 사람들은 바닥에 계획되지 않은 한 기차 시간 - -540 -00:38:22,159 --> 00:38:25,528 - 이미지 런던과 다음이 그래프를 풀다 당신은 당신의 손실을 - -541 -00:38:25,528 --> 00:38:29,389 - 당신이 조심 있다면 배경이 다음 이미지의 배치를 할 수 있으며, - -542 -00:38:29,389 --> 00:38:33,108 - 그래서 당신의 이미지를 한 경우에는 때로는 서로 다른 길이의 시퀀스가 - -543 -00:38:33,108 --> 00:38:36,199 - 당신이 난 것을 확인 말을해야하기 때문에 훈련 데이터는 조심해야 - -544 -00:38:36,199 --> 00:38:41,059 - 아마 다음의 몇 가지를 최대 스무 단어의 배치를 처리하고자 - -545 -00:38:41,059 --> 00:38:44,499 - 코드에서 당신이 알고에 그 문장이 짧거나 더 이상 필요가있을 것입니다 - -546 -00:38:44,498 --> 00:38:48,188 - 일부 일부 일부 문장은 다른 사람보다 더 오래 있기 때문에 걱정 - -547 -00:38:48,188 --> 00:38:55,368 - 우리는 내가 갈 물건이 너무 많은 질문이 - -548 -00:38:55,369 --> 00:39:03,450 - 그 완전히 공동으로이 모든 것을 전파하도록 네 감사합니다 - -549 -00:39:03,449 --> 00:39:07,538 - 훈련은 인터넷으로 기차를 미리 할 수​​ 있도록 한 다음 그 단어를 넣어 - -550 -00:39:07,539 --> 00:39:10,190 - 이하지만 당신은 공동으로 모든 훈련을 원하고 그 큰이야 - -551 -00:39:10,190 --> 00:39:15,429 - 우리는 우리가 검색 기능을 알아낼 수 있기 때문에 실제로 이점 - -552 -00:39:15,429 --> 00:39:20,368 - 더 좋은 말은 그래서 당신은이 훈련하는 이미지를 설명하기 위해 - -553 -00:39:20,369 --> 00:39:23,890 - 실제로 우리가 인구 조사 자료에이 시도는 일반적인 욕구 중 하나를 설정합니다 - -554 -00:39:23,889 --> 00:39:27,368 - 마이크로 소프트 코코라고하는 것은, 그래서 그냥 당신이처럼 보이는 무엇의 아이디어를 제공합니다 - -555 -00:39:27,369 --> 00:39:31,499 - 대략 각 이미지 80 이미지와 다섯 문장의 설명이 있었다 - -556 -00:39:31,498 --> 00:39:35,288 - 그래서 당신은 단지 사람들에게 아마존 기계 터크를 사용하여 얻은 것은 우리에게주세요 - -557 -00:39:35,289 --> 00:39:39,710 - 문장 이미지에 대한 설명과 기록 및 데이터 세트를 종료하고 - -558 -00:39:39,710 --> 00:39:43,249 - 그래서 당신은 당신이 예상 할 수있는이 모델에게 결과의 종류를 훈련 할 때 또는 - -559 -00:39:43,248 --> 00:39:49,078 - 약 좀이 같은이 너무 이러한 이미지를 설명하는 우리의 무엇이다 - -560 -00:39:49,079 --> 00:39:52,329 - 이 이것이 검은 셔츠 연주 기타 또는 건설 사람이다라고 말한다 - -561 -00:39:52,329 --> 00:39:55,710 - 도로 또는 두 젊은 여자에 작업 오렌지 시티 웨스트에서 노동자 재생 - -562 -00:39:55,710 --> 00:40:00,528 - 레고 장난감이나 소년 그건 아니에요 웨이크 보드에 물론 공중제비를하고있다 - -563 -00:40:00,528 --> 00:40:04,650 - 웨이크 보드는하지만 매우 재미 실패 사례도 있습니다 가까이있는 - -564 -00:40:04,650 --> 00:40:07,680 - 또한이 야구 방망이를 들고 어린 소년입니다 보여주고 싶은 - -565 -00:40:07,679 --> 00:40:12,338 - 이 고양이는 여자의 원격 제어와 함께 소파에 앉아있다 - -566 -00:40:12,338 --> 00:40:15,710 - 거울 앞의 테디 베어를 들고 - -567 -00:40:15,710 --> 00:40:22,400 - 여기 질감은 아마 무슨 일이 것은 그것을 만든 것입니다 확신 해요 - -568 -00:40:22,400 --> 00:40:26,289 - 이 테디 베어가 있다고 생각하고 마지막은 서 창녀입니다 - -569 -00:40:26,289 --> 00:40:30,409 - 거리 도로의 중간 그래서 분명히 일부 확실하지 아무 말 없다 무엇 - -570 -00:40:30,409 --> 00:40:34,858 - 이 나온 모델의 단지 간단한 종류 그래서 거기에 무슨 일이 있었 - -571 -00:40:34,858 --> 00:40:37,619 - 작년 모델의 이러한 종류의 상단에 작업하려고 많은 사람들이 있었다 - -572 -00:40:37,619 --> 00:40:41,559 - 난 그냥 당신에게 11 레벨의 아이디어를 제공하고자 그들을 더 복잡하게 - -573 -00:40:41,559 --> 00:40:44,929 - 흥미로운 단지 사람들이 기본 아키텍처를 연주하는 방법에 대한 아이디어를 얻을 수 - -574 -00:40:44,929 --> 00:40:51,329 - 그래서 이것은 현재 모델에서 발견 경우 지난해 종이는 우리 - -575 -00:40:51,329 --> 00:40:55,608 - 단지 처음에 시간을 이미지로 한 시간을 공급 한 경우를 - -576 -00:40:55,608 --> 00:40:59,480 - 이 놀 수있는 것은 실제로 다시 볼 수있는 난폭 한 재발 성 신경 네트워크입니다 - -577 -00:40:59,480 --> 00:41:03,130 - 무선 않는 작동 기술 화상의 화상 및 참조 부 - -578 -00:41:03,130 --> 00:41:07,180 - 당신이 허용 등이 모든 단어를 생성하는 등의 단어가 없습니다 - -579 -00:41:07,179 --> 00:41:10,460 - 실제로 이미지 옆 모습을하고 다른 기능을 찾아 - -580 -00:41:10,460 --> 00:41:13,470 - 그것은 다음에 설명 할 수 있습니다 당신은 실제로 완전히에서이 작업을 수행 할 수있는 작업 - -581 -00:41:13,469 --> 00:41:17,899 - 그들은 단지이 말뿐만 아니라 측면을 생성하지 않도록 학습 가능한 방법 - -582 -00:41:17,900 --> 00:41:21,289 - 여기서 이미지에 다음보고하는 등이 작동하는 방식 만을 수행하지 않습니다 - -583 -00:41:21,289 --> 00:41:24,259 - 아웃 아르 논하지만 당신은 아마 다음 하나의 시퀀스에 대한 분배있어 - -584 -00:41:24,260 --> 00:41:29,250 - 하지만 제공이 오는 당신은 발륨은 우리가 전달이 경우 말을 않는 - -585 -00:41:29,250 --> 00:41:37,389 - 512 활성화 부피 (512) (14)에 의해 14를 얻었고에서 모든 및 주석 - -586 -00:41:37,389 --> 00:41:40,179 - 우리는 단지 그 분포를 인정하지 않습니다하지만 당신은 또한을 방출 한 시간 - -587 -00:41:40,179 --> 00:41:44,358 - 모양까지 키처럼 좀입니다 오백열둘 차원 사진 - -588 -00:41:44,358 --> 00:41:48,019 - 당신은 이미지 옆에 그래서 실제로 나는이 생각하지 않습니다 찾기 위해 원하는 것을 - -589 -00:41:48,019 --> 00:41:51,210 - 그들은이 특별한 종이에 무슨 짓을하지만, 이것은 당신이 연결할 수 있습니다 한 방법입니다 - -590 -00:41:51,210 --> 00:41:54,510 - 이 위로이 사진을보고 뭔가는 아르 논에서 방출되는 단지 - -591 -00:41:54,510 --> 00:41:58,430 - 그냥 약간의 무게와 다음이 그림은 점 수를 사용하여 예측처럼 - -592 -00:41:58,429 --> 00:42:03,618 - 제품이 모든 (14) (14)에 의해 위치가 그래서 우리는 이러한 모든 점 제품을 함께 - -593 -00:42:03,619 --> 00:42:09,108 - 우리는 우리가 지금 우리가 다음 우리 (14)의 호환성에 의해 기본적으로 14 계산 달성 - -594 -00:42:09,108 --> 00:42:13,949 - 그것은 모두 당신의 있도록 그래서 기본적으로 우리는이 모든 것을 정상화 이것에 부드러운 최대를 넣어 - -595 -00:42:13,949 --> 00:42:17,149 - 이 14 (14)에 의해, 그래서 우리는 이미지를 통해 긴장 부르는이를 얻을 수 - -596 -00:42:17,150 --> 00:42:21,230 - 아마 이미지에 지금 아르 논에 대한 흥미로운 내용을 통해지도, - -597 -00:42:21,230 --> 00:42:25,889 - 우리는이와이 사람의 가중 합을 수행하라는 메시지가이 문제를 사용 - -598 -00:42:25,889 --> 00:42:27,239 - 현출 - -599 -00:42:27,239 --> 00:42:30,929 - 그래서 오늘 아침은 기본적으로는 어떻게 생각하는지의 신화는 현재 수 - -600 -00:42:30,929 --> 00:42:36,089 - 그것에 대한 흥미가 돌아갑니다 당신은의 가중 합을하고 결국 - -601 -00:42:36,090 --> 00:42:39,850 - 엘리스 팀이 시점에서보고 싶은 기능의 종류 - -602 -00:42:39,849 --> 00:42:44,809 - 시간 등 섬의 생성 물건, 예를 들어 그것을 결정할 수 있습니다 - -603 -00:42:44,809 --> 00:42:49,400 - 지금과 같은 객체에 대한보고 싶은 그 확인은 벡터 파일을 인정 - -604 -00:42:49,400 --> 00:42:53,220 - 물건 같은 개체의 숫자는이 때의 정액과 상호 작용 - -605 -00:42:53,219 --> 00:42:57,379 - 위원회 주석 어쩌면 그 지역 같은 개체의 일부는 오는 - -606 -00:42:57,380 --> 00:43:01,700 - 점등 및 천장처럼 떨어지는 정품 인증에서이지도를 참조 - -607 -00:43:01,699 --> 00:43:05,949 - 4514 화나게하고 당신은 그 부분에 관심을 집중 결국 - -608 -00:43:05,949 --> 00:43:10,059 - 이 상호 작용을 통해 그래서 당신은 기본적으로 그냥 할 수있는 조회 이미지 - -609 -00:43:10,059 --> 00:43:14,130 - 이미지에 당신은 문장을 설명하고 그래서이 뭔가 우리 동안 - -610 -00:43:14,130 --> 00:43:17,360 - 부드러운 구금으로 참조 실제로 몇 강연이가는 것 - -611 -00:43:17,360 --> 00:43:21,050 - 그래서 우리는 군대가 실제로하지 않은 수있는이 같은 일을 다루려고 - -612 -00:43:21,050 --> 00:43:26,880 - 선택적 입력을 처리하는 등의 수입을 통해 관심과 그 그래서 I - -613 -00:43:26,880 --> 00:43:30,030 - 그냥 당신에게 그 무엇의 미리보기를 제공하기 위해 약 한 시간 그것을 가지고 싶어 - -614 -00:43:30,030 --> 00:43:34,490 - 우리가 중 한 가지 방법으로 우리의 삶을 더 복잡하게하려면 이제 괜찮아 보이는 - -615 -00:43:34,489 --> 00:43:39,259 - 이 당신을 제공합니다, 그래서 우리가 그 층을 쌓아하는 것입니다 할 수있는 당신이 더 많은 것을 알고 - -616 -00:43:39,260 --> 00:43:43,570 - 깊은 물건은 일반적으로 더 나은 우리가에 가지 방법 중 하나를이를 시작하는 방법을 작동 - -617 -00:43:43,570 --> 00:43:46,809 - 적어도 당신은 재발 성 신경 네트워크를 쌓을 수 많은 방법이있다 그러나이 - -618 -00:43:46,809 --> 00:43:49,409 - 사람들이 당신이 할 수 실제로 사용하는 것이 바로 그 중 하나입니다 - -619 -00:43:49,409 --> 00:43:53,339 - 똑바로 그냥 서로 그렇게 한 아르 논에 대한 자극이에 하네스를 연결 - -620 -00:43:53,340 --> 00:43:59,170 - 우리가 이전에 주 사진의 디렉터 등이 이미지 - -621 -00:43:59,170 --> 00:44:02,750 - 시간 축이 수평으로 이동 한 다음 우리가 다른이 위쪽으로가는 - -622 -00:44:02,750 --> 00:44:05,960 - 이 특정 이미지의 의식 등 세 가지 별도의 재발이 있습니다 - -623 -00:44:05,960 --> 00:44:09,858 - 신경 네트워크는 무게의 자신의 세트와 각각이 대령이다 그 - -624 -00:44:09,858 --> 00:44:16,299 - 난 그냥 서로 먹이를하지 그래서이 항상 공동으로 더 거기에 훈련되어 작동합니다 - -625 -00:44:16,300 --> 00:44:19,119 - 기차는 먼저 모든 단지 하나의 경쟁 성장의 두 번째 임기 하나 원 - -626 -00:44:19,119 --> 00:44:22,700 - 배경으로는 상단이 재발 식을 통해 얻을 수 - -627 -00:44:22,699 --> 00:44:25,980 - 상아 영국은 여전히​​ 우리는 여전히있어 더 일반적인 규칙을 만들 가능성이 높습니다 - -628 -00:44:25,980 --> 00:44:29,280 - 똑같은 일을하면 우리는 우리가 복용하고있는 같은 공식을하지 않았다된다 - -629 -00:44:29,280 --> 00:44:35,390 - 우린 시간 전에에서 아래 아래 깊이와 효과에서 강의 - -630 -00:44:35,389 --> 00:44:39,469 - 를 절단하고 퍼팅이 w 변환과를 통해 지원 - -631 -00:44:39,469 --> 00:44:40,519 - 스매싱 10 각 - -632 -00:44:40,519 --> 00:44:44,509 - 당신이 이것에 대해 약간 혼란스러워하는 경우에 당신이 기억한다면, 그래서 거기있다 - -633 -00:44:44,510 --> 00:44:51,760 - WRX H 시간의 X 플러스 당신이 다시 작성할 수 있습니다 whah 시간의 H는 엑손의 연결입니다 - -634 -00:44:51,760 --> 00:44:56,260 - 하나의 행렬 곱 H 바로 그래서 난에 침을 국가 스틱 것처럼 - -635 -00:44:56,260 --> 00:45:03,680 - 기본적으로 무슨 일이 끝나는 다음 하나의 열 벡터와 나는이 w 행렬이 - -636 -00:45:03,679 --> 00:45:07,690 - 최대 일어나고 당신의 WX 연령이 행렬과 WH의 첫 번째 부분 - -637 -00:45:07,690 --> 00:45:12,700 - 미국에서 두 번째로 당신의 매트릭스의 일부 등 식의이 종류는 기록 될 수있다 - -638 -00:45:12,699 --> 00:45:16,099 - 식으로 당신은 당신의 입력을 쌓아 단일 W가 어디 - -639 -00:45:16,099 --> 00:45:24,759 - 변환은 같은 식 있도록 그래서 우리가이는 중지 할 수 있습니다 방법 - -640 -00:45:24,760 --> 00:45:29,780 - 두 시간 색인되는 이후로 지금 다음이 발표하고 - -641 -00:45:29,780 --> 00:45:33,510 - 우리는 또한이 더 복잡한이 적층 공유되지 수 있습니다 지금은 한 방향으로 발생 - -642 -00:45:33,510 --> 00:45:37,030 - 그들을 실제로 그렇게 지금 약간 더 반복 공식을 사용하여 - -643 -00:45:37,030 --> 00:45:40,300 - 지금까지 우리는 복귀에 대한 매우 간단한 재발 수식으로 보았다 - -644 -00:45:40,300 --> 00:45:44,480 - 실제로 작품은 실제로 거의 지금과 같은 공식을 사용하고 - -645 -00:45:44,480 --> 00:45:48,170 - 기본 네트워크는 매우 드물게 우리가 그것에게 부르는 사용합니다 대신 사용되지 않습니다 - -646 -00:45:48,170 --> 00:45:52,059 - LSD와 오랜 단기 기억은 그래서 이것은 기본적으로 모든 서류에 사용된다 - -647 -00:45:52,059 --> 00:45:56,500 - 지금이 공식은 당신이 인 경우도 프로젝트를 사용하는 것입니다 - -648 -00:45:56,500 --> 00:46:00,989 - 사용이 현재 작동하지만 나는이 시점에서 주목하고 싶은 모든입니다 - -649 -00:46:00,989 --> 00:46:04,729 - 동일은 알렌과 마찬가지로이 재발 수식은이 단지의 - -650 -00:46:04,730 --> 00:46:09,050 - 약간 더 복잡한 기능을 확인 우리는 여전히 낮은에서 사진을 촬영하고 - -651 -00:46:09,050 --> 00:46:13,789 - 그리고 이전의 시간에 입력 같은 깊이 이전 재산이었다 - -652 -00:46:13,789 --> 00:46:18,309 - 연락 그들 앗 전송을 통해 이르렀 그러나 지금 우리는이 더이 - -653 -00:46:18,309 --> 00:46:21,869 - 복잡성과 방법을 우리가 실제로이 지점에서 뉴 헤이븐 상태를 달성 - -654 -00:46:21,869 --> 00:46:25,539 - 시간은 그래서 우리는 단지 약간 더 복잡한되고있어 방법에서 북한 이탈 주민을 결합 - -655 -00:46:25,539 --> 00:46:28,900 - 아래 실제로 단지 더 상태를 제목에 업데이트를 수행하기 전에 - -656 -00:46:28,900 --> 00:46:33,050 - 이 동기를 부여 정확히 복잡한 공식은 그래서 우리는 몇 가지 세부 사항에 갈거야 - -657 -00:46:33,050 --> 00:46:41,609 - 공식 이유는 실제로 오스틴에서 사용할 수있는 더 좋은 생각이 될 수 있습니다 - -658 -00:46:41,608 --> 00:46:49,909 - 그리고 우리가 지금 당장 그것을 통해 갈거야 의미가 나를 신뢰하게 그렇다면 당신 - -659 -00:46:49,909 --> 00:46:56,480 - 오후 4시 일부 온라인 비디오를 차단하거나 Google 이미지는 다이어그램을 찾을 수 있습니다로 이동 - -660 -00:46:56,480 --> 00:47:00,989 - 정말 도움이되지 않는이처럼 사람에게 내가 그를 처음봤을 때 생각 - -661 -00:47:00,989 --> 00:47:04,048 - 이 사람이 정말 그가 무슨 일이 일어나고 있는지 정말 확신했다 겁처럼 정말 무서워되고 - -662 -00:47:04,048 --> 00:47:08,170 - 나는 엘리스 팀을 이해하고 난 여전히이 두 다이어그램이 무엇인지 모르는에 - -663 -00:47:08,170 --> 00:47:14,289 - 나는 목록을 파괴하려고하는거야하고 ​​까다로운 물건의 종류, 그래서 그렇게 확인 - -664 -00:47:14,289 --> 00:47:18,329 - 그것을 통해 단계의 종류 당신이 정말로이 도면에 강의 있도록 넣어 - -665 -00:47:18,329 --> 00:47:24,220 - 형식은 우리가 미국의 방정식이 있고 난 그래서 여기에없는 스팀 확인을 위해 완벽하다 - -666 -00:47:24,219 --> 00:47:28,238 - 우리는이 두 벡터를 가지고 위치를 상단에 여기에 첫 번째 부분에 초점을 맞출 것 - -667 -00:47:28,239 --> 00:47:32,720 - 아래로부터의 상태에서 이렇게 X와 HHS 이전 전에 사고 있지만, - -668 -00:47:32,719 --> 00:47:37,848 - 우리는 변환 W를 통해 지금 모두 잭슨 href가 크기 경우를 만났다 - -669 -00:47:37,849 --> 00:47:40,950 - 그래서 우리는 어떤을 위해 생산 끝날거야 숫자를 보낼있다 - -670 -00:47:40,949 --> 00:47:46,068 - (21)에 의해 제시되었다이 w 매트릭스를 통해 확인 번호는 그래서 우리는 이러한이 - -671 -00:47:46,068 --> 00:47:51,108 - 그들이 입력 짧은 것 OMG 경우 사 및 차원 벡터 나가뿐만 - -672 -00:47:51,108 --> 00:47:57,328 - 그리고 G는 나는 당신과 그렇게 ISI없이 신호를 통과 단지를 무엇 확실하지 않다 - -673 -00:47:57,329 --> 00:48:05,859 - 게이트 및 G는 방법에게 지금이 실제로 작동이 길을 똑바로 세입자 게이트로 이동 - -674 -00:48:05,858 --> 00:48:09,420 - 그것에 대해 생각하는 가장 좋은 방법은 내가 깜빡 한 가지가 실제로 언급하는 것입니다 - -675 -00:48:09,420 --> 00:48:15,028 - 이전 슬라이드는 일반적으로 하나의 HVAC 시도 말합니다 할 네트워크를 필요로하지 않습니다 - -676 -00:48:15,028 --> 00:48:18,018 - 매번 중지하고 그에게 물었다 실제로 두 벡터 모든이 - -677 -00:48:18,018 --> 00:48:23,618 - 한 시간 때문에 우리는 세포 상태 벡터를 참조 전화를 매도록 - -678 -00:48:23,619 --> 00:48:29,470 - 시간 단계는 우리가 위험에 두 기관이 있고 그리고에서와 같이 여기 벡터를 참조하십시오 - -679 -00:48:29,469 --> 00:48:33,558 - 노란색 그래서 우리는 기본적으로 두 벡터 여기 공간에있는 모든 단일 지점을 가지고 - -680 -00:48:33,559 --> 00:48:37,849 - 그들이하는 일은 그들이 기본적 그래서이 셀 상태에서 작동하고있다 - -681 -00:48:37,849 --> 00:48:41,680 - 전에 당신 아래의 내용에 따라 해당 사용자 컨텍스트 당신은 결국 - -682 -00:48:41,679 --> 00:48:45,199 - 이들과 함께 세포 상태에서 작동 - -683 -00:48:45,199 --> 00:48:50,509 - 그리고 옹 요소와 그것에 대해 생각하는 새로운 방법 내가 통해 갈거야된다 - -684 -00:48:50,510 --> 00:48:58,290 - 이 0 또는 1 우리가 원하는 I NO처럼 이진 않습니다에 대해이 방법을 많이 생각합니다 - -685 -00:48:58,289 --> 00:49:01,199 - 그들에게 우리가 그들을 게이트의 해석이 생각하고 싶다 갖고 싶어 할 수 - -686 -00:49:01,199 --> 00:49:05,449 - 영웅이 그들이다 그것의로 우리는 물론 우리가 원하기 때문에 그들에게 이상 신호를 만들 - -687 -00:49:05,449 --> 00:49:08,348 - 우리는하지만, 모든 것을 통해 전파 백업 할 수 있도록이 미분 될 수 있습니다 - -688 -00:49:08,349 --> 00:49:11,960 - 우리의 상황에 기반을 계산 한 바로 진 것들로 이노 생각 - -689 -00:49:11,960 --> 00:49:17,740 - 항상 여기서 뭘에서이 참조 다음 당신은 무엇을 기준으로 그를 볼 수있는 - -690 -00:49:17,739 --> 00:49:22,250 - 이 문은 다음과 디아즈 우리는이 페이지의 값을 데이트 끝날거야 무슨 - -691 -00:49:22,250 --> 00:49:29,289 - 특히이 에피소드는 TUS을 종료하는 데 사용됩니다 게이트를 잊지 - -692 -00:49:29,289 --> 00:49:34,869 - (20) 태양 전지 등의 보호소 가장 생각되는 세포들을 재설정 - -693 -00:49:34,869 --> 00:49:38,700 - 우리와 함께 (20)이 상호 작용보다 기본적으로 우리가 할 수있는 하나 최근 이러한 카운터 - -694 -00:49:38,699 --> 00:49:42,368 - 이것은 자신의 레이저 포인터가 부족합니다 곱셈의 요소입니다 - -695 -00:49:42,369 --> 00:49:45,530 - 배터리 때문에 - -696 -00:49:45,530 --> 00:49:50,140 - 상호 작용 0 당신은 우리가를 재설정 할 수 있도록 그 셀을 제로 것이다 볼 수 있습니다 - -697 -00:49:50,139 --> 00:49:53,969 - 카운터 그리고 우리는 또한 우리는이를 통해 추가 할 수있는 카운터에 추가 할 수 있습니다 - -698 -00:49:53,969 --> 00:50:00,459 - 상호 작용 I 번 G와 11 사이와 G는 부정적 일 사이이기 때문에 - -699 -00:50:00,460 --> 00:50:05,900 - (10)에 기본적으로 한 12 매 있도록 모든 세포 사이의 숫자를 추가 - -700 -00:50:05,900 --> 00:50:09,338 - 우리는이를 재설정 할 수있는 모든 세포에서 이러한 카운터를 하나의 시간 단계 - -701 -00:50:09,338 --> 00:50:13,588 - 국가 2012 케이트를 잊어 버렸거나 우리는 하나 사이의 숫자를 추가 할 수 있습니다 - -702 -00:50:13,588 --> 00:50:18,039 - 12 그래서 확인을 하나 하나 셀은 우리가 다음 셀 업데이트 및 수행 방법 - -703 -00:50:18,039 --> 00:50:24,029 - 업데이트가 찌그러 세포 그렇게 10 HFC는 셀을 숙청되고 끝 머리 - -704 -00:50:24,030 --> 00:50:28,760 - 그렇게 만 셀 상태의 일부와 위로로 누출이 업데이트에 의해 변조 - -705 -00:50:28,760 --> 00:50:33,500 - 숨겨진 상태가이 벡터에 의해 변조 오 그래서 우리는 단지의 일부를 공개 선택 - -706 -00:50:33,500 --> 00:50:39,530 - 암탉 상태와 학습 가능 방법으로 세포는 몇 가지가있다 - -707 -00:50:39,530 --> 00:50:43,910 - 에 하이라이트의 종류 여기에 아마 여기에 가장 혼란스러운 부분에 우리가 걸이다 - -708 -00:50:43,909 --> 00:50:47,500 - 여기에 D I 배 하나 하나 사이의 숫자를 추가하지만 가지의 - -709 -00:50:47,500 --> 00:50:51,809 - 우리는 단지 거기 G가 있다면 대신 다음 이미 사이에 이름 : Jeez 때문에 혼란 - -710 -00:50:51,809 --> 00:50:56,679 - 8 11 왜 우리는 내가 여러 번 G 무엇을하지 실제로 우리가 제공하는 모든 필요합니까 우리 - -711 -00:50:56,679 --> 00:50:58,279 - 원하는에 의해 바다를 구현하는 것입니다 - -712 -00:50:58,280 --> 00:51:02,330 - 하나 하나 사이의 숫자는 그래서는 대한 내 성 부품의 종류의 - -713 -00:51:02,329 --> 00:51:08,989 - 마지막으로 내가 한 대답은 당신이 G에 대해 생각하면 그것의 기능 있다고 생각합니다 - -714 -00:51:08,989 --> 00:51:16,159 - 당신의 문맥의 선형 함수는 하나의 기회가 오른쪽으로 레이저 프린터가 없습니다 - -715 -00:51:16,159 --> 00:51:26,649 - 확인 그래서 G는 G 그래서 확인을 지역 310 세의 함수의 선형 함수로 - -716 -00:51:26,650 --> 00:51:30,579 - 우리가 청바지를 추가 한 경우 10 시간 등에 의해 숙청 이전에 접촉하는 경우 - -717 -00:51:30,579 --> 00:51:35,349 - 추가하여, 그래서 나는 시간 그녀는 그 종류의 매우 간단한 함수 같은 것 - -718 -00:51:35,349 --> 00:51:38,929 - 이 난 후 실제로 더 있어요 곱셈 상호 작용을 갖는 - -719 -00:51:38,929 --> 00:51:42,710 - 실제로 우리가 추가하는 것을 표현 할 수 있습니다 풍부한 기능 - -720 -00:51:42,710 --> 00:51:47,010 - 이전 테스트의 기능을 생각하는 또 다른 방법으로 상태를 몸통 - -721 -00:51:47,010 --> 00:51:50,620 - 이 약이 기본적으로 방법이 두 개념을 분리하는 것 - -722 -00:51:50,619 --> 00:51:54,159 - 많은 우리가 G 인 셀 상태로 추가 싶어하고 우리가 원하는 수행 - -723 -00:51:54,159 --> 00:51:58,129 - 나는 우리가 실제로 무엇을이 조작 가능성이 있으므로 모든 상태를 해결 - -724 -00:51:58,130 --> 00:52:03,280 - 또한 될 수 있음이 두 디커플링에 의해 통해 천재 우리가 원하는 이동 - -725 -00:52:03,280 --> 00:52:08,470 - 동적 측면에서 몇 가지 좋은 특성을 가지고 어떻게이 모든 증기 기차하지만, - -726 -00:52:08,469 --> 00:52:12,039 - 우리는 단지 그 오스틴 공식처럼 결국 나는 실제로 갈거야 - -727 -00:52:12,039 --> 00:52:14,059 - 자세한 세부 사항에서이뿐만 아니라 통해 - -728 -00:52:14,059 --> 00:52:21,400 - 확인 상기 제 1 상호 작용 이제 셀 C가 흐르는으로 이것에 대해 생각하고 - -729 -00:52:21,400 --> 00:52:28,269 - 여기 그래서 경제적으로 그 시그 모이 약간의 DOTC 그렇게 노력하다 - -730 -00:52:28,269 --> 00:52:32,559 - 곱셈의 상호 작용으로 자신을 게이팅 F 제로는 것입니다 그래서 만약 - -731 -00:52:32,559 --> 00:52:38,409 - 셀을 차단하고 세포학 부분이 기본적으로 제공되는 카운터를 재설정 - -732 -00:52:38,409 --> 00:52:44,799 - 당신은 완은 기본적으로 하위 상태 누수가 유일한 상태로 추가하고있다 - -733 -00:52:44,800 --> 00:52:51,100 - 언덕 상태로하지만 너무 의해 문이 가도록 한 후 10 시간 통해 - -734 -00:52:51,099 --> 00:52:55,380 - 전기 만 결정 사실로 밝혀 몇 가지 상태에있는 부품 - -735 -00:52:55,380 --> 00:52:59,610 - 매각하지 않았다 숨겨진 그리고 당신은 알 수가이 고속도로뿐만 아니라, - -736 -00:52:59,610 --> 00:53:03,720 - STM의 다음 반복으로 이동뿐만 아니라 실제로까지 폐쇄 - -737 -00:53:03,719 --> 00:53:07,159 - 상위 계층이 우리가 실제로 종료 상태 교리의 머리이기 때문에 - -738 -00:53:07,159 --> 00:53:11,250 - 까지 우리 위에 팀으로보고하거나이 예측에 간다 - -739 -00:53:11,250 --> 00:53:14,510 - 이 기본적으로 방법을 풀다 때 그래서 그것이 가지처럼 보이는 - -740 -00:53:14,510 --> 00:53:19,270 - 지금은 내 자신 그게 전부의 혼란도를 가지고있는이 나는 우리가 끝난 것 같아요 - -741 -00:53:19,269 --> 00:53:24,550 - 그러나 아래에서 입력 벡터를 얻을 수와 최대 당신은 당신의 자신의 상태에서이 - -742 -00:53:24,550 --> 00:53:26,090 - (248) - -743 -00:53:26,090 --> 00:53:31,030 - 그들은 다음 차원 벡터 및 모든 거 알아 fije 네 성문을 결정 - -744 -00:53:31,030 --> 00:53:35,110 - 는 셀 상태에서 동작하고, 셀의 상태가 변조 방법을 종료 - -745 -00:53:35,110 --> 00:53:38,610 - 당신이 한 번 실제로 우리는 일부 국가를 설정하고 하나 사이에 번호를 추가하면 - -746 -00:53:38,610 --> 00:53:42,630 - (12) 국가의 셀 상태는 그것의 일부는 학습 가능에서 누수 밖으로 누출 - -747 -00:53:42,630 --> 00:53:45,840 - 방법 및 다음 중 하나를 예측까지 갈 수 또는 다음에 갈 수 있습니다 - -748 -00:53:45,840 --> 00:53:52,269 - 미국 팀의 반복은 향후 그래서 그게 그렇게이 그렇게 추한 모습입니다 - -749 -00:53:52,269 --> 00:53:58,429 - 문제는 당신의 마음에 아마 그래서 우리는 거 야 우리가 간다 않은 이유입니다 - -750 -00:53:58,429 --> 00:54:02,649 - 이 특별한 방법 I에서이 Look을 수행하는 이유가 뭔가의 모든 통해 - -751 -00:54:02,650 --> 00:54:05,639 - 알고 싶어한다 분석가 많은 다양한 있다는 것을이 시점이 - -752 -00:54:05,639 --> 00:54:09,309 - 이 시점하지만 강의 사람들의 말은 이런 식으로 많이 연주 - -753 -00:54:09,309 --> 00:54:12,840 - 우리는 종류의 합리적인 것 같은 것으로이에 수렴했지만 - -754 -00:54:12,840 --> 00:54:15,510 - 당신이 실제로하지 않는이에 수 많은 작은 비틀기가있다 - -755 -00:54:15,510 --> 00:54:18,930 - 당신 같은 사람들 게이트의 일부를 제거 할 수 있습니다 많은하여 성능을 저하 - -756 -00:54:18,929 --> 00:54:20,359 - 아마 연루 등 - -757 -00:54:20,360 --> 00:54:25,200 - 당신은 할 수의 악취가이 바다가 될 수 볼 밝혀 그것을 잘 작동합니다 - -758 -00:54:25,199 --> 00:54:28,619 - 일반적으로하지만 좌석의 어린 나이로 때로는 약간 더 있었다 I - -759 -00:54:28,619 --> 00:54:33,869 - 우리는 CSI가의 비트와 함께 결국 왜를위한 아주 좋은 이유가 생각하지 않습니다 - -760 -00:54:33,869 --> 00:54:37,039 - 괴물하지만 실제로 좀 법무부 카운터의 측면에서 의미가 생각 - -761 -00:54:37,039 --> 00:54:40,739 - 그 0으로 재설정 할 수 있습니다 또는 당신은 하나 (12)을 사이에 작은 숫자를 추가 할 수 있습니다 - -762 -00:54:40,739 --> 00:54:46,039 - 지금은 좋은 실제로 비교적 단순한 이해하는 것처럼 그렇게는 가지이다 - -763 -00:54:46,039 --> 00:54:49,300 - 이것은 우리 자신보다 훨씬 더 그리고 우리는 약간에 가야 정확하게 이유 - -764 -00:54:49,300 --> 00:54:55,330 - 다른 그림은 재발 성 신경 있도록 구별을 그립니다 - -765 -00:54:55,329 --> 00:54:59,259 - 어떤 상태 벡터 권리가 네트워크 당신은 그것을 통해 운영하고 있고이있어 - -766 -00:54:59,260 --> 00:55:02,260 - 완전히이 재발 식을 통해로 변신 그래서 당신은 종료 - -767 -00:55:02,260 --> 00:55:06,280 - 시간 물건 시간에서 상태 벡터를 변경까지 당신은 미국 것을 알 수 있습니다 - -768 -00:55:06,280 --> 00:55:11,140 - 팀 대신 셀 미국이 흐르는 우리가 효과적으로 무슨 일을하고있다 - -769 -00:55:11,139 --> 00:55:15,250 - 우리는 세포에서 찾고 그것의 일부는 국가의 머리에 누수로 - -770 -00:55:15,250 --> 00:55:19,329 - 우리가 이득을 다음 잊어 버린 경우 셀에서 동작하는 방법을 결정하는 상태 - -771 -00:55:19,329 --> 00:55:22,869 - 기본적으로 그냥하여 셀을 조정 끝 - -772 -00:55:22,869 --> 00:55:28,509 - 함수로 쳐다 보면서 몇 가지 물건이 그래서 그래서 여기 활성 상호 작용 - -773 -00:55:28,510 --> 00:55:33,040 - 우리는 영혼의 상태를 변경 결국 그것이 무엇이든 셀 상태의 다음 - -774 -00:55:33,039 --> 00:55:37,190 - 대신 바로이 첨가제는 대신, 그래서 그것을 변환의 - -775 -00:55:37,190 --> 00:55:38,429 - 변형 - -776 -00:55:38,429 --> 00:55:42,929 - 그런 상호 작용이나 뭐 이제이 실제로 뭔가 당신을 생각 나게한다 - -777 -00:55:42,929 --> 00:55:48,839 - 우리가 이미 염두에두고 클래스에 적용되었음을 그, 그래 맞아 - -778 -00:55:48,840 --> 00:55:53,240 - 그래서이 같은 사실은 고체와 같은 일이 이렇게 기본적으로 직렬 공진입니다 - -779 -00:55:53,239 --> 00:55:56,299 - 일반적으로 우리가 표현 거주자가 변화하고 진정으로 - -780 -00:55:56,300 --> 00:56:00,019 - 여기에이 스킵 연결 및 당신은 기본적으로 주민들이를 볼 수 있습니다 - -781 -00:56:00,019 --> 00:56:04,690 - 우리가 지금 여기이 X이 때문에 첨가제의 상호 작용 우리는 약간의 계산에 기초 않는다 - -782 -00:56:04,690 --> 00:56:10,240 - 다음 섹스 그리고 우리는 행위와 첨가제의 상호 작용을 가지고 있고 그래서는이다 - -783 -00:56:10,239 --> 00:56:12,959 - 같은 멋진로 발생하는 기본 주민들의 블록과 그 사실의 - -784 -00:56:12,960 --> 00:56:18,440 - 물론 우리는 우리가 여기있어 이러한 상호 작용을 가지고 전은 세포이며, 우리가 간다 - -785 -00:56:18,440 --> 00:56:22,619 - 다음 몇 가지 기능은 당신과 떨어져 우리는이 세포 상태 만에 추가 할 수 - -786 -00:56:22,619 --> 00:56:26,900 - LSD와는 달리 주민들은 또한 추가 된 날짜를 잊지하시기 바랍니다있다 - -787 -00:56:26,900 --> 00:56:31,519 - 이뿐만 아니라 신호의 일부를 차단하도록 선택할 경우 제어를 잊지 있지만, - -788 -00:56:31,519 --> 00:56:33,679 - 그렇지 않으면 나는 그것이 가지 생각 때문에 대통령처럼 매우 보인다 - -789 -00:56:33,679 --> 00:56:36,710 - 보고 아키텍처와 매우 유사 종류에 수렴하고 그 재미 - -790 -00:56:36,710 --> 00:56:40,429 - 보인다 곳은 재발 성 신경 네트워크에서 끝의 두 소득을 작동 - -791 -00:56:40,429 --> 00:56:43,809 - 같은 동적으로 어떻게 든 실제로 이러한 첨가제를 가지고 훨씬 좋네요이다 - -792 -00:56:43,809 --> 00:56:48,739 - 당신이 실제로 훨씬 더 효과적으로 그렇게 전파 할 수 있도록 상호 작용 - -793 -00:56:48,739 --> 00:56:49,779 - 그 시점에 - -794 -00:56:49,780 --> 00:56:53,860 - 분석 팀 사이의 뒷면 전파 역학에 대해 생각 - -795 -00:56:53,860 --> 00:56:57,760 - 특히 미국 팀에 좀 그라디언트를 주입하면 매우 명확하고 - -796 -00:56:57,760 --> 00:57:01,120 - 가끔 내가 생기를 주입하고이 그림의 끝을 보자, 그래서 만약 여기에 - -797 -00:57:01,119 --> 00:57:05,239 - 다음이 플러스 상호 작용은 바로 여기 그냥 재료 고속도로처럼 - -798 -00:57:05,239 --> 00:57:09,299 - 이 동영상은 모든 탭 추가 상호 작용 오른쪽으로 흐르는 것 같은 - -799 -00:57:09,300 --> 00:57:13,240 - 내가 그라데이션 시간의 어느 지점을 연결하는 경우 버전은 동일하므로 분산 때문에 - -800 -00:57:13,239 --> 00:57:16,849 - 여기에 단지 물론 그라데이션도 다시 모든 방법을 날려 가고 - -801 -00:57:16,849 --> 00:57:20,809 - 이러한 행위를 통해 흘러 그들이에 자신의 재료를 기여 결국 - -802 -00:57:20,809 --> 00:57:25,630 - 독서 흐름합니다하지만 당신은 우리가 우리의 강렬한으로 참조 무​​엇으로 끝낼 수 없을거야 - -803 -00:57:25,630 --> 00:57:30,110 - 이 그라디언트 그냥 제로로 이동을 사망 어디에 문제가 지역 사라지는라고 - -804 -00:57:30,110 --> 00:57:32,880 - 당신은 다시 통해 전파 내가 예를 보여 드리겠습니다로 - -805 -00:57:32,880 --> 00:57:36,640 - 완전히이 조금 수중 음파 탐지기에서 발생하는 이유 떨어져 지금 우리는이 배니싱이 - -806 -00:57:36,639 --> 00:57:40,670 - 나는 당신을 보여줄 것 그라데이션 문제는 이유는이 때문에 애널리스트 오전 발생 - -807 -00:57:40,670 --> 00:57:45,210 - 그냥 판의 고속도로 매 시간 단계의 이러한 구배가 - -808 -00:57:45,210 --> 00:57:47,130 - 우리는 위의 미국 팀에 주입 - -809 -00:57:47,130 --> 00:57:54,829 - 그냥 세포를 통과하고 등급이에서 마무리 결국하지 않습니다 - -810 -00:57:54,829 --> 00:57:57,339 - 어쩌면 내가 몇 가지 질문을 가리 혼란 기능에 대한 질문이 있습니다 - -811 -00:57:57,338 --> 00:58:01,849 - 여기하지만 마지막으로 한 다음 그 후 나는 arnaz가에 있었던 이유에 갈거야 - -812 -00:58:01,849 --> 00:58:03,059 - 그린 즈 버러 - -813 -00:58:03,059 --> 00:58:09,789 - 예 000 벡터가 중요한 것입니다 - -814 -00:58:09,789 --> 00:58:13,400 - 내가 하나가 특별히 매우 중요 아니라고 생각 밝혀 - -815 -00:58:13,400 --> 00:58:16,660 - 나는 스페이스 오디세이 그들이 대답 할 다른 무엇을 보여 드리겠습니다 종이가있다 - -816 -00:58:16,659 --> 00:58:21,719 - 정말 거기에이 걸릴 물건 아웃하지만 물건을 연주 또한 같은있다 - -817 -00:58:21,719 --> 00:58:25,588 - 당신이 그렇게이 셀 상태가 여기에있을 수 추가 할 수 있습니다 사람들의 연결 - -818 -00:58:25,588 --> 00:58:29,538 - 사람들이 정말 재생할 수 있도록 실제로 입력으로 더 나은 숨겨진 상태에 넣어 - -819 -00:58:29,539 --> 00:58:32,049 - 이 아키텍처 그들은 바로 이러한 반복을 많이 시도 - -820 -00:58:32,048 --> 00:58:37,230 - 방정식과 거의 모든 약 동일한 일부 작동 당신이 우리와 끝까지 - -821 -00:58:37,230 --> 00:58:40,490 - 그것을 우리는 약간은 매우 가지 혼란이있는, 그래서 때로는 있었다있어 - -822 -00:58:40,489 --> 00:58:45,699 - 그들은했다 어디 용지를 표시하려면이 방법은 그들이 DS 업데이트를 처리 - -823 -00:58:45,699 --> 00:58:49,538 - 방정식은 업데이트 방정식을 통해 나무를 내장하고있다 그리고 그들은했다 - -824 -00:58:49,539 --> 00:58:52,950 - 이 같은 무작위 돌연변이 물건과 서로 다른 잔디의 모든 종류의 시도 - -825 -00:58:52,949 --> 00:58:57,028 - 사용자가 업데이트 할 수 그들 대부분은 그들 중 일부의 일부를 파괴에 대해 작동 - -826 -00:58:57,028 --> 00:58:59,858 - 정말보다 훨씬 더 않습니다처럼은 동일하지만 아무것도에 대한 작업 - -827 -00:58:59,858 --> 00:59:08,150 - 분석 팀과 질문 재발 성 신경 네트워크가 왜 가고있다 - -828 -00:59:08,150 --> 00:59:15,389 - 또한 끔찍한 역류 비디오 - -829 -00:59:15,389 --> 00:59:22,000 - 와 재발 성 신경 네트워크에서 사라지는 그라데이션 문제를 보여주는 - -830 -00:59:22,000 --> 00:59:29,250 - 모두에 대해 우리가 재발보고있는 것처럼 우리가 여기에 표시하고 줄기 - -831 -00:59:29,250 --> 00:59:33,039 - 많은 기간 많은 시간 단계에 걸쳐 신경망 다음 주입 그라데이션 - -832 -00:59:33,039 --> 00:59:36,760 - 그것은 백 스물여덟번째 시간 단계의 말을 우리는 파산하고 - -833 -00:59:36,760 --> 00:59:40,028 - 네트워크를 통해 재료와 우리는 그라데이션이 무엇인지보고있는 - -834 -00:59:40,028 --> 00:59:44,699 - 용 나는 체중의 입력 타입 숨겨진 매트릭스 하나에 모든 행렬 생각 - -835 -00:59:44,699 --> 00:59:49,009 - 한 시간 간격 때문에 실제로 통해 전체 업데이트를 얻기 위해 그 기억 - -836 -00:59:49,010 --> 00:59:52,289 - 다시 우리가 실제로 여기에 모든 그라디언트를 추가하고 그래서 무엇 무엇이다 - -837 -00:59:52,289 --> 00:59:56,760 - 어떻게 여기에 표시되는 것은 배경으로 우리는 단지에서 성분을 주입하는 것입니다 - -838 -00:59:56,760 --> 01:00:00,799 - 우리가 시간과 강한 조각을 통해 배경을 120 시간 단계 - -839 -01:00:00,798 --> 01:00:04,088 - 그 전파의 당신이보고있는 것은 미국 팀이 당신을 많이 준다이다 - -840 -01:00:04,088 --> 01:00:06,699 - 많이있다, 그래서이 역 전파에 걸쳐 그라데이션 - -841 -01:00:06,699 --> 01:00:11,000 - 단지 바로이 기술을 통해 흐르는되는 정보는 전원 사망 - -842 -01:00:11,000 --> 01:00:15,210 - 그냥 욕심 우리는 추방은 그냥 아무 거기에 작은 숫자가된다라고 - -843 -01:00:15,210 --> 01:00:18,750 - 내가 단계 그렇게되는 시간에 대해 표시를 생각이 경우 너무 그라데이션 - -844 -01:00:18,750 --> 01:00:22,679 - 우리가하지 않았다 주입 모든 정보와 10 배 단계 등 - -845 -01:00:22,679 --> 01:00:26,149 - 네트워크를 통해 흘러 모든 때문에 매우 긴 종속성을 배울 수 있습니다 - -846 -01:00:26,150 --> 01:00:29,720 - 우리가 왜이 볼 수 있도록 상관 관계 구조는 아래가 사망 한 - -847 -01:00:29,719 --> 01:00:39,399 - 조금 동적으로 발생이 채널이 너무 재미 그가처럼 몇 가지 코멘트 - -848 -01:00:39,400 --> 01:00:40,490 - YouTube 또는 뭔가 - -849 -01:00:40,489 --> 01:00:44,779 - 그래 - -850 -01:00:44,780 --> 01:00:53,170 - 확인 그래서 우리가 재발 성 신경 네트워크가 여기 아주 간단한 예를 살펴 보자 - -851 -01:00:53,170 --> 01:00:56,300 - 내가 보여주는 아니에요이 재발 성 신경 네트워크에 당신을 위해 전개거야 것을 - -852 -01:00:56,300 --> 01:01:03,960 - 우리가있어 모든 입력은 자신의 상태 업데이트가 너무 whaaa 교회와 대기 상태가 - -853 -01:01:03,960 --> 01:01:07,260 - 상호 작용을 칠 숨겨진 나는 기본적으로 재발을 전달하려고 해요 - -854 -01:01:07,260 --> 01:01:12,380 - 신경망 때문에 T-오십를 사용하고 여기에 내가 어떤 차 시간 단계를하지를 않습니다 - -855 -01:01:12,380 --> 01:01:16,260 - 내가 무슨 일을하고있어 WHAS 시간을 그 위에 다음 이전 세입자와 물건과입니다 - -856 -01:01:16,260 --> 01:01:20,570 - 그래서 이것은 모든 입력 벡터를 무시 들어오는 단지 전진 패스입니다 - -857 -01:01:20,570 --> 01:01:25,280 - 단지 WHAS 시간 H 임계 값 WHAS 시간 세이 임계 값 등 - -858 -01:01:25,280 --> 01:01:29,500 - 그 전진 패스의 다음 뒤로 여기가 연출하고있어 여기서 통과 - -859 -01:01:29,500 --> 01:01:33,820 - 마지막 단계에서 여기에 임의의 기울기에 의해 50 시간 단계에서 매우 - -860 -01:01:33,820 --> 01:01:37,880 - 뒤쪽으로 이동 한 후 무작위 및 그라데이션을 주입 나는 그렇게 백업 - -861 -01:01:37,880 --> 01:01:41,059 - 당신은 백업이 권한을 통해 여기 내가 사용하고 있습니다 통해 백업해야 할 때 - -862 -01:01:41,059 --> 01:01:46,170 - 오히려 곱셈 등 400 WH보다 곱셈 어를 통해 배경을 얻을 - -863 -01:01:46,170 --> 01:01:51,800 - 그래서 여기서주의 할 것은 여기에서 매우이다 나는 개발자 브라운 백을하고있는 중이 야 - -864 -01:01:51,800 --> 01:01:54,980 - 수입을 어디에서 관련 바로 잡고 아무것도 통해 전파 - -865 -01:01:54,980 --> 01:02:02,309 - 나는 WH 시간마다 작업을 제로보다 작은 여기서 포기하고 있었다 - -866 -01:02:02,309 --> 01:02:06,570 - 우리가 실제로 WH 행렬 곱 경우 우리는 그렇게 비선형 성을하기 전에 - -867 -01:02:06,570 --> 01:02:09,570 - 당신이 실제로 무슨 일을 볼 때가는 매우 펑키 뭔가가있다 - -868 -01:02:09,570 --> 01:02:13,300 - 당신이 시간을 통해 뒤로 이동으로 NHS의 구배이 DHS에 - -869 -01:02:13,300 --> 01:02:18,160 - 당신이 보는 것처럼 매우 걱정입니다 재미있는 구조의 매우 종류가 있습니다 - -870 -01:02:18,159 --> 01:02:22,210 - 등이 우리가 여기 무슨 일을하는지와 같은 루프에 연결되는 방식 - -871 -01:02:22,210 --> 01:02:33,409 - 두 시간 간격 - -872 -01:02:33,409 --> 01:02:43,849 - 제로 그래 나는 생각하고 가끔 어쩌면 반군이 모든 있었다 출력의 - -873 -01:02:43,849 --> 01:02:47,630 - 죽은 당신을 죽일 수 듯하지만 그건 정말 문제 아니다 - -874 -01:02:47,630 --> 01:02:51,470 - 더 걱정 문제는 그 모든 쇼가 될 것 잘하지만 착용 한 생각 - -875 -01:02:51,469 --> 01:02:55,500 - 사람들이 쉽게 우리가 걸 볼 수 있습니다뿐만 아니라 발견 할 수 있습니다 문제 - -876 -01:02:55,500 --> 01:03:00,380 - 때문에에 또 다시 이상이 whah 행렬 곱 - -877 -01:03:00,380 --> 01:03:04,840 - 앞으로 우리가 매일 반복에 awhh 곱 통과 - -878 -01:03:04,840 --> 01:03:09,670 - 다시 우리가이 전파 결국 모든 숨겨진 상태를 통해 전파 - -879 -01:03:09,670 --> 01:03:13,820 - 무형 문화 유산 konnte 체스와 backrub 어 공식은 실제로 것을 밝혀 - -880 -01:03:13,820 --> 01:03:19,000 - 당신은 whah 행렬 곱 인사말 신호를 가지고 우리는 종료 - -881 -01:03:19,000 --> 01:03:26,199 - 그라데이션이 whah 유지를 곱한 도착까지 그 다음 WH 관계자를 곱한 - -882 -01:03:26,199 --> 01:03:32,019 - 그렇게 우리는 그렇게하지 ​​매트릭스 W​​H 나이 오십 번 곱 결국 - -883 -01:03:32,019 --> 01:03:37,509 - 이 가진 문제는 녹색 신호는 기본적으로 두 가지 경우처럼 일어날 수 있다는 것입니다 - -884 -01:03:37,510 --> 01:03:41,080 - 당신은 아마 규모 행렬없는 스칼라 값 작업에 대한 생각 - -885 -01:03:41,079 --> 01:03:45,469 - 그때 임의의 번호를 가지고 있다면 두 번째 번호가 나는 유지 - -886 -01:03:45,469 --> 01:03:48,509 - 그래서 또 다시 두 번째 숫자에 의해 첫 번째 숫자를 곱한 - -887 -01:03:48,510 --> 01:03:55,990 - 다시 그 순서는 바로 같은 플레이 자신의 경우에 무엇을 이동 않습니다 - -888 -01:03:55,989 --> 01:04:01,849 - 번호 하나 내가 죽거나 아직 경우 두 번째 번호를 정확히 절전 모드로 전환 - -889 -01:04:01,849 --> 01:04:05,119 - 일년 실제로 폭발하지만, 그렇지 않는 경우에만 위치하도록 - -890 -01:04:05,119 --> 01:04:09,679 - 정말 나쁜 일이 죽을 중 하나 일어나고 또는 우리는 우리가 큰이 여기 폭발 - -891 -01:04:09,679 --> 01:04:12,659 - 도시 우리는 하나의 번호가없는 있지만, 사실은이 같은 일이 일어난다이다 - -892 -01:04:12,659 --> 01:04:16,599 - 그것의 일반화는 WHS 장축 반경 스펙트럼에서 일어나는 - -893 -01:04:16,599 --> 01:04:21,839 - 이는 그 행렬의 최대 고유 한 후보다 큰 것이다 - -894 -01:04:21,840 --> 01:04:25,220 - 이 시민은 완전히 사망의 1도 이하의 경우 무선 신호가 폭발 - -895 -01:04:25,219 --> 01:04:30,549 - 그래서 기본적으로 박사 탄 때문에이 재발이 매우 이상한이 있기 때문에 - -896 -01:04:30,550 --> 01:04:34,680 - 공식 우리는 매우 끔찍 역학에 결국 그리고 그것은 매우 불안정입니다 - -897 -01:04:34,679 --> 01:04:39,949 - 그냥 그렇게 연습이 처리 된 방법을 폭발하고 또는 사망했다 - -898 -01:04:39,949 --> 01:04:44,439 - 당신은 폭발 그라디언트에게 인사말 마치 하나의 간단한 하키를 제어 할 수 있습니다 - -899 -01:04:44,440 --> 01:04:45,720 - 폭발 당신은 그것을 클릭 - -900 -01:04:45,719 --> 01:04:50,789 - 그래서 사람들은 실제로 매우 누덕 누덕 기운 솔루션처럼하지만 경우에이 관행을 - -901 -01:04:50,789 --> 01:04:55,119 - 두 번 다섯 분 노먼 린 크램 펫 (25) 요소 위에합니까을 읽고있는 나 - -902 -01:04:55,119 --> 01:04:58,150 - 당신이 저하되어 클리핑을 수행 할 수 있도록 그런 일이 그 방법을의 - -903 -01:04:58,150 --> 01:05:01,829 - 폭발 등급을 매기는 문제를 해결하고 당신은 당신이 기록하고있어하지 않습니다 - -904 -01:05:01,829 --> 01:05:06,049 - 더 이상 폭발 그러나 녹색당은 여전히​​ 직장과 엘리스에서 카니발에서 사라질 수 있습니다 - -905 -01:05:06,050 --> 01:05:08,310 - 팀 때문에 이들의 사라지는 그라데이션 문제에 아주 좋은 것입니다 - -906 -01:05:08,309 --> 01:05:12,429 - 단지와 첨가제의 상호 작용에 따라 변화되는 세포의 고속도로 - -907 -01:05:12,429 --> 01:05:17,309 - 당신은 당신이이기 때문에 경우에 당신이 경우 구배는 단지 그들이 아래로 죽지 않을 날려 - -908 -01:05:17,309 --> 01:05:21,000 - 이러한 이유 대략이다처럼 같은 나이 또는 무언가에 의해 곱 - -909 -01:05:21,000 --> 01:05:26,909 - 단지 더 동적으로 우리는 항상 팀 그래서 우리는 그라데이션 클리핑을 수행 할 - -910 -01:05:26,909 --> 01:05:30,149 - 일반적으로 달라스 팀의 기울기가 잠재적으로 폭발 할 수 있기 때문에 - -911 -01:05:30,150 --> 01:05:33,400 - 여전히 그들은 일반적으로 사라하지 않는했다 - -912 -01:05:33,400 --> 01:05:48,608 - 재발 성 신경 네트워크뿐만 아니라에 대한 엘리스 팀은 분명하지 않다 어디를 - -913 -01:05:48,608 --> 01:05:53,769 - 당신이 플러그 것입니다 정확히 같은이 식의 명확하지에 뛰어들 것 - -914 -01:05:53,769 --> 01:06:00,619 - 상대적으로 어디에 아마 대신 G에서 월의 많은 다음에 참석하기 때문에 - -915 -01:06:00,619 --> 01:06:08,690 - 여기 huug하지만 재판매는 바로 이렇게 하나의 방향으로 성장할 것 - -916 -01:06:08,690 --> 01:06:11,980 - 어쩌면 당신은 실제로 좋은 아니에요 작게 있도록 만드는 끝낼 수 없다 - -917 -01:06:11,980 --> 01:06:18,539 - 난 당신이 알고있는 가정 아이디어는 이렇게 연결하는 명확한 방법이 없습니다 기본적으로됩니다 - -918 -01:06:18,539 --> 01:06:25,380 - 여기에 행을 너무 좋아 한 것은 나는이 초 고속도로의 측면에서 그 통지 - -919 -01:06:25,380 --> 01:06:29,780 - 네 개의 얻을 문이있을 때이 그라디언트 이러한 관점은 실제로 고장 - -920 -01:06:29,780 --> 01:06:33,310 - 네 개의 얻을 때 때문에 케이트의 우리는 이러한 행위의 일부를 잊을 수있는 곳 - -921 -01:06:33,309 --> 01:06:37,150 - 내가 문을 잊지 때마다 곱셈 상호 작용은 다음에 그것과 세가와 - -922 -01:06:37,150 --> 01:06:41,470 - 다음 그라데이션을 죽이고 물론 역류 때문에 이러한 슈퍼 중단됩니다 - -923 -01:06:41,469 --> 01:06:45,250 - 당신이없는 경우 고속도로 가지 사실 어느 문을 잊지하지만 당신은 경우 - -924 -01:06:45,250 --> 01:06:50,000 - a는 다음 그라디언트를 죽일 수 그들의줬고, 그래서 실제로 잊지했다 - -925 -01:06:50,000 --> 01:06:54,710 - 우리는 우리와 함께 연주 할 때 팀은 우리가 가끔 사람들이 때 가정 오스틴의 사용이다 - -926 -01:06:54,710 --> 01:06:58,099 - 긍정적 인 편견 때문에 함께 초기화에 그들이 처음 잊지 얻을 - -927 -01:06:58,099 --> 01:06:58,769 - 에 의한 - -928 -01:06:58,769 --> 01:07:05,699 - 나에 설정하는 것을 잊지 항상 종류의 내가 처음에 생각 해제 - -929 -01:07:05,699 --> 01:07:08,679 - 그래서 처음에 녹색 아주 잘 이야기하고 미국 팀은 배울 수있는 방법 - -930 -01:07:08,679 --> 01:07:12,779 - 그 해당 바이어스 용으로 나중에 사람들이 재생되도록 한 번에 그들을 차단하기 - -931 -01:07:12,780 --> 01:07:17,530 - 수십 년 때때로 그래서 여기에 지난 밤 나는 그 비용을 언급하고 싶었다 - -932 -01:07:17,530 --> 01:07:21,580 - 공간이 그래서 많은 사람들은 기본적으로이 꽤 플레이 한 - -933 -01:07:21,579 --> 01:07:26,119 - 그들이 아키텍처로 다양한 변화를 시도 오디세이 용지 거기 - -934 -01:07:26,119 --> 01:07:32,829 - 잠재적 인 변화의 큰 숫자 이상이 검색을 수행하려고 여기에 종이 - -935 -01:07:32,829 --> 01:07:36,940 - LST 방정식 그리고 그들은 많은 검색을했고, 그들은 아무것도 찾지 못했습니다 - -936 -01:07:36,940 --> 01:07:42,300 - 그건 그냥 애널리스트 오전 너무 좋아하고있어보다 실질적으로 더 잘 작동 - -937 -01:07:42,300 --> 01:07:45,560 - 또한 상대적으로 실제로 인기가 있고 내가 실제로 것 GRU - -938 -01:07:45,559 --> 01:07:50,159 - 당신이 콜로세움 그것의 변화를 개의 DRU 사용 할 수 있습니다 것이 좋습니다 - -939 -01:07:50,159 --> 01:07:54,460 - 그것은 짧은 점이다 대해도 좋은 상호 작용으로 결정했다 - -940 -01:07:54,460 --> 01:07:59,400 - 작은 공식과 단지 하나있는 테네시을 갖지 않는 트랙터 - -941 -01:07:59,400 --> 01:08:03,130 - 구현은 현명한 단지 하나가 가진 기억 단지 좋네요 있도록 만 H가 - -942 -01:08:03,130 --> 01:08:07,590 - 단지 작은 간단한 일이 같은 앞으로 과​​거 두 가지 요인에 차질 - -943 -01:08:07,590 --> 01:08:12,190 - 그 불쾌한의 혜택의 대부분을 갖고있는 것 같아요하지만 그래서는 GRU과라고 - -944 -01:08:12,190 --> 01:08:16,730 - 거의 항상 멋진에 대한 내 경험에 작동하고 그래서 당신은 수도 - -945 -01:08:16,729 --> 01:08:19,939 - 그것을 사용하려는 또는 당신은 그들이 모두 좀 동일한 대해 알고 마지막 시간을 사용할 수 있습니다 - -946 -01:08:19,939 --> 01:08:28,088 - 그래서 누군가가 마구는 아주 좋은하지만의 RaWR하고 실제로하지 않는 것입니다 - -947 -01:08:28,088 --> 01:08:29,130 - 아주 잘 작동 - -948 -01:08:29,130 --> 01:08:32,420 - 소유즈 미국 팀은 무엇을 그들에 대해 좋은 데요 것은 이상한 갖는 것입니다 대신 사용된다 - -949 -01:08:32,420 --> 01:08:36,000 - 그리스을 허용 이러한 첨가제의 상호 작용은 매우 잘 재생 당신은하지 않습니다 - -950 -01:08:36,000 --> 01:08:39,579 - 사라지는 품종 문제를 얻을 우리는 여전히 폭발에 대해 조금 걱정 - -951 -01:08:39,579 --> 01:08:44,269 - 이 사람들은 때때로 내가이 여자 클립을 참조하는 것이 일반적 그래서 문제를 공급 - -952 -01:08:44,270 --> 01:08:46,670 - 더 간단한 구조가 정말하려고하는 말 것 - -953 -01:08:46,670 --> 01:08:50,838 - 연결과 무슨 깊은 거기에 뭔가 오는 방법을 이해 - -954 -01:08:50,838 --> 01:08:53,899 - 주민과 엘리스 팀 사이에 이들에 대해 뭔가 깊은있다 - -955 -01:08:53,899 --> 01:08:57,579 - 나는 우리가 아직 정확히 그 이유는 완전히 이해되지 것 같아요 상호 작용 - -956 -01:08:57,579 --> 01:09:02,210 - 그래서 잘 작동하고 어떤 부분은 시원했고, 그래서 우리가 필요하다고 생각 - -957 -01:09:02,210 --> 01:09:05,119 - 공간 이론과 경험을 모두 이해하고 그것은 매우이야 - -958 -01:09:05,119 --> 01:09:10,979 - 벌리고 연구의 영역과 그래서 그래서 - -959 -01:09:10,979 --> 01:09:23,469 - 스포츠 (10) 그러나 나는 내가 그렇지 않은 그래서 폭발 가정 할 수 클래스의 끝 - -960 -01:09:23,470 --> 01:09:27,020 - 명확 왜 것이라고하지만 당신은 세포 상태로 그라데이션을 주입 유지 - -961 -01:09:27,020 --> 01:09:30,069 - 그래서 어쩌면 때때로 큰 얻을 수 있습니다 저하 - -962 -01:09:30,069 --> 01:09:33,960 - 그것은 그들을 수집하는 것이 일반적이지만 중요 할 수 있으므로 한 시간으로 아마 생각 - -963 -01:09:33,960 --> 01:09:40,829 - 그리고, 나는 그 시점하지만 비뇨기과 기초 I에 대해 확실히 백퍼센트 아니에요 - -964 -01:09:40,829 --> 01:09:46,640 - 흥미로운 무슨 생각 그래 나는 우리가 여기까지해야한다고 생각하지 않습니다 있지만 난 - -965 -01:09:46,640 --> 01:09:47,569 - 여기에 질문을 드리겠습니다 +1 +00:00:00,000 --> 00:00:04,129 +마이크 테스트 + +2 +00:00:04,129 --> 00:00:12,109 +오늘은 주제는 Recurrent neural networks 입니다. + +3 +00:00:12,109 --> 00:00:15,199 +개인적으로 가장 좋아하는 주제이고 + +4 +00:00:15,199 --> 00:00:18,960 +또 여러 형태로 사용하고 있는 NN 모델이기도 하죠. 재밌어요. + +5 +00:00:18,960 --> 00:00:23,009 +강의 진행에 관해서 언급할 게 있는데, + +6 +00:00:23,009 --> 00:00:26,089 +수요일에 중간 고사가 있어요. + +7 +00:00:26,089 --> 00:00:32,738 +다들 중간고사 기대하고 있는거 다 알아요. 사실 별로 기대하는 것 같이 보이지는 않네요. + +8 +00:00:32,738 --> 00:00:37,979 +수요일에 과제가 나갈 거에요. + +9 +00:00:37,979 --> 00:00:40,429 +제출 기한은 2주 뒤 월요일까지입니다. + +10 +00:00:40,429 --> 00:00:43,399 +그런데 저희가 원래 월요일에 이걸 발표하려 했는데 늦어져서 + +11 +00:00:43,399 --> 00:00:47,129 +아마 제출 기한이 수요일 즈음으로 미뤄질 것 같네요. + +12 +00:00:47,130 --> 00:00:51,179 +2번째 과제는 금요일까지고, 3-late day를 사용할 수 있어요. 그런데 너무 일찍 사용하지는 마세요. + +13 +00:00:51,179 --> 00:00:55,119 +2번째 과제는 금요일까지고, 3-late day를 사용할 수 있어요. 그런데 너무 일찍 사용하지는 마세요. + +14 +00:00:55,119 --> 00:01:01,089 +몇 명이나 끝냈나요? 72명? 거의 다 끝냈네요, 좋아요. + +15 +00:01:01,090 --> 00:01:04,549 +자 우리는 Convolutional Neural Network (CNN)에 대해서 얘기하고 있었죠. + +16 +00:01:04,549 --> 00:01:07,820 +지난 수업에서는 CNN에 대한 시각화와 간단한 이해에 대해서 다루었고, + +17 +00:01:07,819 --> 00:01:11,618 +이런 그림과 비디오들을 살펴보면서 CNN이 어떻게 작동하는지 살펴보았죠. + +18 +00:01:11,618 --> 00:01:14,938 +이런 그림과 비디오들을 살펴보면서 CNN이 어떻게 작동하는지 살펴보았죠. + +19 +00:01:14,938 --> 00:01:17,828 +이런 그림과 비디오들을 살펴보면서 CNN이 어떻게 작동하는지 살펴보았죠. + +20 +00:01:17,828 --> 00:01:24,188 +그리고 맨 마지막 그림에서 본 것처럼 디버깅도 해 보았고요. + +21 +00:01:24,188 --> 00:01:27,408 +지난 주말에 트위터에서 새로운 시각화 자료를 찾았는데요, + +22 +00:01:27,409 --> 00:01:32,569 +신기하죠? + +23 +00:01:32,569 --> 00:01:37,118 +사실 설명이 없어서 정확히 어떤 방법으로 이걸 만든 건지는 잘 모르겠네요. + +24 +00:01:37,118 --> 00:01:43,099 +그래도 멋있지 않아요? 이건 거북이고, 저건 타란튤라 거미이고, + +25 +00:01:43,099 --> 00:01:47,468 +이건 체인이고, 저건 개들인데, + +26 +00:01:47,468 --> 00:01:50,509 +제가 보기에 이건 어떤 최적화 기법을 이미지에 적용한 것 같은데, + +27 +00:01:50,509 --> 00:01:53,679 +뭔가 다른 regularization 방법을 적용한 것 같네요 + +28 +00:01:53,679 --> 00:01:57,049 +음, 여기에는 bilateral filter (쌍방향 필터) 를 적용한 것 같네요. + +29 +00:01:57,049 --> 00:01:59,420 +음, 여기에는 bilateral filter (쌍방향 필터) 를 적용한 것 같네요. + +30 +00:01:59,420 --> 00:02:03,659 +그래도 솔직히 정확히 어떤 기법을 적용한 것인지는 잘 모르겠어요. + +31 +00:02:03,659 --> 00:02:04,549 +오늘의 주제는 RNN입니다. + +32 +00:02:04,549 --> 00:02:10,360 +오늘의 주제는 RNN입니다. + +33 +00:02:10,360 --> 00:02:13,520 +RNN의 강점은 네트워크 아키텍쳐를 구성하는 데에 자유도가 놓다는 것입니다. + +34 +00:02:13,520 --> 00:02:15,870 +RNN의 강점은 네트워크 아키텍쳐를 구성하는 데에 자유도가 놓다는 것입니다. + +35 +00:02:15,870 --> 00:02:18,650 +일반적으로 NN을 왼쪽 그림과 같이 구성할 때는 (역자주: Vanilla NN) + +36 +00:02:18,650 --> 00:02:22,849 +여기 빨간색으로 표시된 것처럼 고정된 크기의 input vector를 사용하고, + +37 +00:02:22,848 --> 00:02:27,639 +초록색의 hidden layer들을 통해 작동시키며, 마찬가지로 고정된 크기의 파란색 output vector를 출력합니다. + +38 +00:02:27,639 --> 00:02:30,738 +마찬가지로 고정된 크기의 이미지를 입력으로 받고, + +39 +00:02:30,739 --> 00:02:34,469 +고정된 크기의 이미지를 벡터 형태로 출력합니다. + +40 +00:02:34,469 --> 00:02:38,239 +RNN에서는 이러한 작업을 계속 반복할 수 있습니다. input, output 모두에서 가능하죠. + +41 +00:02:38,239 --> 00:02:41,319 +오늘 다룰 image captioning(이미지에 상응하는 자막/주석 생성) 을 예로 들면, + +42 +00:02:41,318 --> 00:02:44,689 +고정된 크기의 이미지를 RNN에 입력하게 됩니다. + +43 +00:02:44,689 --> 00:02:47,829 +그리고 그 RNN은 해당 이미지를 설명하는 단어/문장 들을 출력하게 되죠. + +44 +00:02:47,829 --> 00:02:52,560 +그리고 그 RNN은 해당 이미지를 설명하는 단어/문장 들을 출력하게 되죠. + +45 +00:02:52,560 --> 00:02:55,969 +Sentiment classification(감정 분류)를 예로 들면, + +46 +00:02:55,969 --> 00:02:59,759 +(어떤 문장의) 단어들과 그 순서을 입력으로 받아서, + +47 +00:02:59,759 --> 00:03:03,828 +그 문장의 느낌이 긍정적인지 또는 부정적인지를 출력하게 됩니다. + +48 +00:03:03,829 --> 00:03:07,590 +또 다른 예로 machine translation (역자주: 구글 번역과 같은 알고리즘 번역) 에서는, + +49 +00:03:07,590 --> 00:03:12,069 +어떤 영어 문장을 입력으로 받고, 프랑스어로 출력해야 합니다. + +50 +00:03:12,068 --> 00:03:17,119 +그래서 우리는 이 영어 문장을 RNN에 입력하고 (이것을 Sequence to Sequence 라 부름) + +51 +00:03:17,120 --> 00:03:20,280 +그래서 우리는 이 영어 문장을 RNN에 입력하고 (이것을 Sequence to Sequence 라 부름) + +52 +00:03:20,280 --> 00:03:25,169 +RNN은 이 영어 문장을 프랑스어 문장으로 번역합니다. + +53 +00:03:25,169 --> 00:03:28,000 +마지막 예 video classification(영상 분류) 에서는, + +54 +00:03:28,000 --> 00:03:31,699 +각 프레임 (순간 캡쳐 화면) 이 어떤 속성을 지니는지, + +55 +00:03:31,699 --> 00:03:35,429 +그리고 그 전의 모든 프레임과의 관계는 어떻게 되는지도 고려합니다. + +56 +00:03:35,430 --> 00:03:38,739 +그리고 그 전의 모든 프레임과의 관계는 어떻게 되는지도 고려합니다. + +57 +00:03:38,739 --> 00:03:41,909 +그러니까 RNN은 각각의 프레임이 어떤 속성을 지니는지 분류하고, + +58 +00:03:41,909 --> 00:03:44,680 +이전까지의 모든 프레임을 입력으로 받는 함수가 되어, + +59 +00:03:44,680 --> 00:03:48,760 +앞으로의 프레임을 예측하는 아키텍쳐를 제공합니다. + +60 +00:03:48,759 --> 00:03:52,388 +만약 맨 왼쪽 그림과 같이 입력과 출력의 순서에 관한 정보를 가지고 있지 않아도 RNN을 사용할 수 있습니다. + +61 +00:03:52,389 --> 00:03:55,250 +만약 맨 왼쪽 그림과 같이 입력과 출력의 순서에 관한 정보를 가지고 있지 않아도 RNN을 사용할 수 있습니다. + +62 +00:03:55,250 --> 00:04:01,560 +예를 들어, 제가 좋아하는 딥마인드의 한 논문에서는 + +63 +00:04:01,560 --> 00:04:05,189 +번지로 된 집 주소 이미지를 문자로 변환했습니다. + +64 +00:04:05,189 --> 00:04:09,750 +여기서는 단순히 CNN을 사용해서 이미지 자체가 몇 번지를 나타내는지를 분류하지 않고, + +65 +00:04:09,750 --> 00:04:13,530 +여기서는 단순히 CNN을 사용해서 이미지 자체가 몇 번지를 나타내는지를 분류하지 않고, + +66 +00:04:13,530 --> 00:04:16,649 +RNN을 사용해서 작은 CNN이 이미지를 돌아다니면서 읽어들였습니다. + +67 +00:04:16,649 --> 00:04:19,779 +RNN을 사용해서 작은 CNN이 이미지를 돌아다니면서 읽어들였습니다. + +68 +00:04:19,779 --> 00:04:23,969 +이렇게 RNN은 번지 주소 이미지를 왼쪽으로 오른쪽으로 순차적으로 읽는 방법을 학습했습니다. + +69 +00:04:23,970 --> 00:04:26,870 +이렇게 RNN은 번지 주소 이미지를 왼쪽으로 오른쪽으로 순차적으로 읽는 방법을 학습했습니다. + +70 +00:04:26,870 --> 00:04:32,019 +반대로 생각할 수도 있습니다. 이것은 DRAW라는 유명한 논문인데요, + +71 +00:04:32,019 --> 00:04:35,879 +여기서는 이미지 샘플 하나하나가 무엇인지 개별적으로 판단하지 않고, + +72 +00:04:35,879 --> 00:04:39,490 +여기서는 이미지 샘플 하나하나가 무엇인지 개별적으로 판단하지 않고, + +73 +00:04:39,490 --> 00:04:42,860 +RNN이 여러 이미지를 하나의 큰 캔버스의 형태로 한번에 출력합니다. + +74 +00:04:42,860 --> 00:04:47,540 +RNN이 여러 이미지를 하나의 큰 캔버스의 형태로 한번에 출력합니다. + +75 +00:04:47,540 --> 00:04:50,200 +이 방법은 한 번지수 이미지에 대한 입력 결과를 곧바로 출력하지 않고, 보다 많은 계산을 거친다는 점에서 강력합니다. + +76 +00:04:50,199 --> 00:04:53,479 +이 방법은 한 번지수 이미지에 대한 입력 결과를 곧바로 출력하지 않고, 보다 많은 계산을 거친다는 점에서 강력합니다. 질문 있나요? + +77 +00:04:53,480 --> 00:05:14,189 +(질문) 그림에서 화살표는 무엇인가요? + +78 +00:05:14,189 --> 00:05:19,310 +화살표는 functional dependence를 나타냅니다. 조금 있다가 좀 더 자세하게 살펴 볼 거에요. + +79 +00:05:19,310 --> 00:05:23,139 +화살표는 functional dependence를 나타냅니다. 조금 있다가 좀 더 자세하게 살펴 볼 거에요. + +80 +00:05:23,139 --> 00:05:37,168 +(질문) 그림에서 나타나는 숫자들은 무엇인가요? + +81 +00:05:37,168 --> 00:05:41,219 +이것들은 실제 사진이 아니라 RNN이 학습 후 출력한 결과물입니다. + +82 +00:05:41,220 --> 00:05:44,830 +이것들은 실제 사진이 아니라 RNN이 학습 후 출력한 결과물입니다. + +83 +00:05:44,829 --> 00:05:48,219 +(질문) 그러니까 실제 사진이 아니라 만들어진 거라는 거죠? + +84 +00:05:48,220 --> 00:05:51,689 +네, 꽤 실제 사진처럼 포이기는 하지만, 이것들은 만들어진 이미지입니다. + +85 +00:05:51,689 --> 00:05:55,809 +RNN은 이런 초록색 박스처럼 생겼습니다. + +86 +00:05:55,809 --> 00:06:00,979 +RNN은 계속해서 input vector를 입력받습니다. + +87 +00:06:00,978 --> 00:06:04,859 +RNN은 계속해서 input vector를 입력받습니다. + +88 +00:06:04,860 --> 00:06:08,538 +RNN 내부에는 여러 state가 있는데, 이는 매 시간에 입력받는 input vector의 형태로 나타낼 수 있습니다. + +89 +00:06:08,538 --> 00:06:12,988 +RNN 내부에는 여러 state가 있는데, 이는 매 시간에 입력받는 input vector의 형태로 나타낼 수 있습니다. + +90 +00:06:12,988 --> 00:06:17,258 +RNN에는 또한 weight(가중치)를 설정할 수 있고, 이를 조정함으로써 RNN의 작동을 조절할 수 있습니다. + +91 +00:06:17,259 --> 00:06:20,829 +RNN에는 또한 weight(가중치)를 설정할 수 있고, 이를 조정함으로써 RNN의 작동을 조절할 수 있습니다. + +92 +00:06:20,829 --> 00:06:25,769 +우리는 물론 RNN의 출력 결과물에도 관심을 갖고 있지만, + +93 +00:06:25,769 --> 00:06:30,429 +우리는 물론 RNN의 출력 결과물에도 관심을 갖고 있지만, + +94 +00:06:30,428 --> 00:06:33,988 +RNN은 이 중간에 있는, 시간에 따라 이미지를 입력받고 출력하는 단계인 이 초록색 박스라는 것을 알아두셨으면 합니다. + +95 +00:06:33,988 --> 00:06:36,688 +RNN은 이 중간에 있는, 시간에 따라 이미지를 입력받고 출력하는 단계인 이 초록색 박스라는 것을 알아두셨으면 합니다. + +96 +00:06:36,689 --> 00:06:39,489 +RNN은 이 중간에 있는, 시간에 따라 이미지를 입력받고 출력하는 단계인 이 초록색 박스라는 것을 알아두셨으면 합니다. + +97 +00:06:39,488 --> 00:06:44,838 +RNN은 이 중간에 있는, 시간에 따라 이미지를 입력받고 출력하는 단계인 이 초록색 박스라는 것을 알아두셨으면 합니다. + +98 +00:06:44,838 --> 00:06:50,610 +RNN의 각 state는 vector들의 집합으로 나타낼 수 있고, 여기서는 h로 표기하겠습니다. + +99 +00:06:50,610 --> 00:06:55,399 +RNN의 각 state는 vector들의 집합으로 나타낼 수 있고, 여기서는 h로 표기하겠습니다. + +100 +00:06:55,399 --> 00:07:00,939 +각각의 state(h_t) 는 바로 전 단계의 state(h_t-1)과 input vector(x_t)들의 함수로 나타낼 수 있습니다. + +101 +00:07:00,939 --> 00:07:05,769 +각각의 state(h_t) 는 바로 전 단계의 state(h_t-1)과 input vector(x_t)들의 함수로 나타낼 수 있습니다. + +102 +00:07:05,769 --> 00:07:08,338 +여기서의 함수는 Recurrence funtion 이라고 하고 파라미터 W(가중치)를 갖습니다. + +103 +00:07:08,338 --> 00:07:13,728 +우리는 W 값을 변경함에 따라 RNN이 다른 결과를 보이는 걸 확인할 수 있습니다. + +104 +00:07:13,728 --> 00:07:16,228 +우리는 W 값을 변경함에 따라 RNN이 다른 결과를 보이는 걸 확인할 수 있습니다. + +105 +00:07:16,228 --> 00:07:19,338 +따라서 우리는 우리가 원하는 결과를 만들어낼 수 있는 적절한 W를 찾기 위해 training을 거칠 것이죠. + +106 +00:07:19,338 --> 00:07:23,639 +따라서 우리는 우리가 원하는 결과를 만들어낼 수 있는 적절한 W를 찾기 위해 training을 거칠 것이죠. + +107 +00:07:23,639 --> 00:07:28,209 +여기서 기억해야 할 것은 매 단계마다 같은 함수와 같은 W를 사용한다는 것입니다. + +108 +00:07:28,209 --> 00:07:31,778 +여기서 기억해야 할 것은 매 단계마다 같은 함수와 같은 W를 사용한다는 것입니다. + +109 +00:07:31,778 --> 00:07:35,928 +그래서 입력이나 출력 시퀀스의 길이를 고려할 필요가 없습니다. + +110 +00:07:35,928 --> 00:07:38,778 +그래서 입력이나 출력 시퀀스의 길이를 고려할 필요가 없습니다. + +111 +00:07:38,778 --> 00:07:43,528 +그래서 입력이나 출력 시퀀스의 길이를 고려할 필요가 없습니다. + +112 +00:07:43,528 --> 00:07:46,769 +RNN을 구현하는 가장 간단한 방법은 Vanilla RNN 입니다. + +113 +00:07:46,769 --> 00:07:50,309 +RNN을 구현하는 가장 간단한 방법은 Vanilla RNN 입니다. + +114 +00:07:50,309 --> 00:07:54,569 +여기서 RNN을 구성하는 것은 단 하나의 hidden state h 입니다. + +115 +00:07:54,569 --> 00:08:00,569 +여기서 RNN을 구성하는 것은 단 하나의 hidden state h 입니다. + +116 +00:08:00,569 --> 00:08:04,039 +그리고 여기 Recurrence(재귀) 식은 각 hidden state를 시간과 현재 input (x_t)로 어떻게 나타낼 수 있는지 알려줍니다. + +117 +00:08:04,038 --> 00:08:04,688 +그리고 여기 Recurrence(재귀) 식은 각 hidden state를 시간과 현재 input (x_t)로 어떻게 나타낼 수 있는지 알려줍니다. + +118 +00:08:04,689 --> 00:08:08,369 +그리고 여기 Recurrence 식은 각 hidden state를 시간과 현재 input (x_t)로 어떻게 나타낼 수 있는지 알려줍니다. + +119 +00:08:08,369 --> 00:08:10,349 +가중치 행렬 W_hh와 W_xh에 직전 단계의 hidden state h 와 input vector x가 각각 곱해지고, + +120 +00:08:10,348 --> 00:08:15,238 +가중치 행렬 W_hh와 W_xh에 직전 단계의 hidden state h_t-1 와 input vector x가 각각 곱해지고, + +121 +00:08:15,238 --> 00:08:18,238 +이것이 tanh 함수에 의해 새로운 hidden state h_t로 결정되는 방식으로 업데이트 됩니다. + +122 +00:08:18,238 --> 00:08:21,978 +이것이 tanh 함수에 의해 새로운 hidden state h_t로 결정되는 방식으로 업데이트 됩니다. + +123 +00:08:21,978 --> 00:08:26,199 +이러한 재귀 식은 h가 시간과 현재 입력에 따라 업데이트되는 함수라는 것을 보여줍니다. + +124 +00:08:26,199 --> 00:08:29,769 +이러한 재귀 식은 h가 시간과 현재 입력에 따라 업데이트되는 함수라는 것을 보여줍니다. + +125 +00:08:29,769 --> 00:08:34,129 +h 바로 다음에 결과물이 행렬의 형태로 출력되는 형태가 가장 간단한 형태의 RNN입니다. + +126 +00:08:34,129 --> 00:08:37,528 +h 바로 다음에 결과물이 행렬의 형태로 출력되는 형태가 가장 간단한 형태의 RNN입니다. + +127 +00:08:37,528 --> 00:08:42,288 +이게 어떻게 작동되는지 간단히 설명드리기 위해 예를 들자면, + +128 +00:08:42,288 --> 00:08:46,639 +이게 어떻게 작동되는지 간단히 설명드리기 위해 예를 들자면, + +129 +00:08:46,639 --> 00:08:49,299 +이런 추상적인 x, h, y 등에 의미를 부여할 수 있습니다. + +130 +00:08:49,299 --> 00:08:53,059 +이런 추상적인 x, h, y 등에 의미를 부여할 수 있습니다. + +131 +00:08:53,059 --> 00:08:56,149 +예를 들어 이러한 문자 수준 언어 모델에 RNN을 적용하는 것 말이죠. + +132 +00:08:56,149 --> 00:08:59,899 +저는 이 예시를 참 좋아합니다. 직관적이고 재밌거든요. + +133 +00:08:59,899 --> 00:09:04,698 +그래서 RNN 기반 문자 수준 언어 모델에서는, RNN에 문자열의 순서를 주고, + +134 +00:09:04,698 --> 00:09:07,859 +그래서 RNN 기반 문자 수준 언어 모델에서는, RNN에 문자열의 순서를 주고, + +135 +00:09:07,860 --> 00:09:10,899 +그래서 RNN 기반 문자 수준 언어 모델에서는, RNN에 문자열의 순서를 주고, + +136 +00:09:10,899 --> 00:09:14,299 +지금까지의 관찰 결과를 바탕으로 각각의 단계에서 다음에 올 문자는 무엇인지 예측하게 합니다. + +137 +00:09:14,299 --> 00:09:16,909 +지금까지의 관찰 결과를 바탕으로 각각의 단계에서 다음에 올 문자는 무엇인지 예측하게 합니다. + +138 +00:09:16,909 --> 00:09:21,120 +간단한 예를 한번 보죠. + +139 +00:09:21,120 --> 00:09:25,610 +여기서 training 문자열 'hello'를 주면, + +140 +00:09:25,610 --> 00:09:29,870 +우리의 현재 어휘 목록에는 'h, e , l, o' 이렇게 4글자가 있겠죠 + +141 +00:09:29,870 --> 00:09:33,289 +그러니까 RNN은 우리의 training 문자열 데이터를 바탕으로 다음에 올 글자가 무엇인지 예측하게 됩니다. + +142 +00:09:33,289 --> 00:09:37,000 +구체적으로, h, e, l, o를 각각 순서대로 하나씩 RNN에 입력해 줍니다. + +143 +00:09:37,000 --> 00:09:40,509 +여기서 가로축은 시간입니다. (역자주: 오른쪽으로 갈수록 뒤) + +144 +00:09:40,509 --> 00:09:47,110 +h는 첫번째, e는 두번째, 그다음 l, 그다음 l + +145 +00:09:47,110 --> 00:09:50,629 +여기서는 'one-hot' 표기법을 사용하고 있습니다. (역자주: 0과 1로만 나타내는 것) + +146 +00:09:50,629 --> 00:09:53,889 +여기서는 'one-hot' 표기법을 사용하고 있습니다. (역자주: 0과 1로만 나타내는 것) + +147 +00:09:53,889 --> 00:09:58,129 +그리고 아까 본 재귀 식을 사용합니다. + +148 +00:09:58,129 --> 00:10:01,860 +처음에 h에는 0만 들어가 있습니다. + +149 +00:10:01,860 --> 00:10:04,720 +그래서 매 시간 단계마다 이 재귀 식을 이용해서 hidden state 벡터를 계산합니다. + +150 +00:10:04,720 --> 00:10:08,790 +hidden state에 3개의 (안들림) 가 있습니다. + +151 +00:10:08,789 --> 00:10:11,099 +각 시점에서 이전까지 입력받은 모든 문자들을 요약해서 표현합니다. + +152 +00:10:11,100 --> 00:10:13,040 +각 시점에서 이전까지 입력받은 모든 문자들을 요약해서 표현합니다. + +153 +00:10:13,039 --> 00:10:15,759 +각 시점에서 이전까지 입력받은 모든 문자들을 요약해서 표현합니다. + +154 +00:10:15,759 --> 00:10:20,159 +이런 방법으로 매 시간 단계마다 바로 다음 순서 에 올 문자를 예측할 것입니다. + +155 +00:10:20,159 --> 00:10:23,139 +이런 방법으로 매 시간 단계마다 바로 다음 순서에 올 문자를 예측할 것입니다. + +156 +00:10:23,139 --> 00:10:27,569 +우리는 이 4 개의 문자(역자주: h, e, l, o)를 가지고 있고, 매 시간 단계마다 이 4개의 문자 중 어떤 문자가 오는지 예측할 것입니다. + +157 +00:10:27,570 --> 00:10:32,100 +우리는 이 4 개의 문자(역자주: h, e, l, o)를 가지고 있고, 매 시간 단계마다 이 4개의 문자 중 어떤 문자가 오는지 예측할 것입니다. + +158 +00:10:32,100 --> 00:10:37,139 +제일 처음에는 H를 입력할 것입니다. + +159 +00:10:37,139 --> 00:10:40,799 +RNN은 현재의 weight를 바탕으로 다음에 어떤 문자가 올 지 예측합니다. + +160 +00:10:40,799 --> 00:10:42,959 +RNN은 현재의 weight를 바탕으로 다음에 어떤 문자가 올 지 예측합니다. + +161 +00:10:42,960 --> 00:10:47,950 +현재 normalized 되지 않은 수치로는, (역자주: 맨 위 왼쪽 사각형 안의 숫자) h는 1.0, e는 2.2, + +162 +00:10:47,950 --> 00:10:52,640 +l은 -3.0 , o는 4.1라는 숫자의 정도로 나타날 것입니다. + +163 +00:10:52,639 --> 00:10:56,409 +물론 우리는 이 training sequence에서 h 다음에 e가 온다는 것을 알고 있습니다. + +164 +00:10:56,409 --> 00:11:00,669 +그러니까 여기 초록색으로 적혀 있는 e의 2.2라는 숫자가 정답이 되는 것이죠. + +165 +00:11:00,669 --> 00:11:04,559 +그래서 이 숫자는 커야 하고, 다른 숫자들은 작아져야 합니다. + +166 +00:11:04,559 --> 00:11:07,799 +이처럼 매 시간 단계마다 우리는 다음에 올 타겟 문자를 갖고 있습니다. + +167 +00:11:07,799 --> 00:11:12,209 +타겟에 해당하는 숫자는 커야 하고, 나머지 숫자는 작아야 합니다. + +168 +00:11:12,210 --> 00:11:15,470 +타겟에 해당하는 숫자는 커야 하고, 나머지 숫자는 작아야 합니다. + +169 +00:11:15,470 --> 00:11:19,950 +그래서 이러한 정보는 loss function(손실 함수)의 gradient signal에 포함됩니다. + +170 +00:11:19,950 --> 00:11:23,220 +그리고 그러한 loss 들은 이 연결들은 통해 back-propagation 됩니다. + +171 +00:11:23,220 --> 00:11:26,600 +매 시간 단계에 softmax classifier을 갖고 있다고 합시다. + +172 +00:11:26,600 --> 00:11:31,300 +그래서 매 시간 단계마다 softmax classifier가 다음에 어떤 문자가 와야 할 지를 예측하고, + +173 +00:11:31,299 --> 00:11:34,269 +그리고 모든 loss들은 맨 위(역자주: output layer)부터 거꾸로 그래프를 내려오면서 계산되어서 + +174 +00:11:34,269 --> 00:11:37,879 +그리고 모든 loss들은 맨 위(역자주: output layer)부터 거꾸로 그래프를 내려오면서 계산되어서 + +175 +00:11:37,879 --> 00:11:41,179 +그리고 모든 loss들은 맨 위(역자주: output layer)부터 거꾸로 그래프를 내려오면서 계산되어서 + +176 +00:11:41,179 --> 00:11:44,479 +weight 행렬에 gradient를 주어 적절한 값으로 변화시켜 RNN이 문자를 보다 정확하게 예측하게 합니다. + +177 +00:11:44,480 --> 00:11:50,039 +weight 행렬에 gradient를 주어 적절한 값으로 교정시켜 RNN이 문자를 보다 정확하게 예측하게 합니다. + +178 +00:11:50,039 --> 00:11:53,599 +그러니까 여러분이 RNN에 문자를 입력하면 RNN은 보다 정확한 행동(역자주: 여기서는 문자 예측)을 하는 것이죠. + +179 +00:11:53,600 --> 00:11:57,750 +이제 어떻게 데이터를 학습시키는지에 대해 상상이 좀 갈 거에요. + +180 +00:11:57,750 --> 00:12:02,879 +여기 그림에 대해 질문이 있나요? + +181 +00:12:02,879 --> 00:12:08,750 +(질문): W_xh와 W_hy는 항상 일정한 값을 가지나요? + +182 +00:12:08,750 --> 00:12:13,320 +(답변): W(weight) 들은 매 recurrence 단계 마다 항상 일정한 값을 가집니다. + +183 +00:12:13,320 --> 00:12:17,010 +(답변): W(weight) 들은 매 recurrence 단계 마다 항상 일정한 값을 가집니다. + +184 +00:12:17,009 --> 00:12:23,830 +여기서 우리는 W_xh, W_hh, W_yh를 각각 4번씩 사용했습니다. + +185 +00:12:23,830 --> 00:12:27,720 +여러분이 backpropagation을 할 때, 동일한 weight 행렬에 이러한 gradient 들을 계속 더한다는 것을 명심해야 합니다. + +186 +00:12:27,720 --> 00:12:30,750 +여러분이 backpropagation을 할 때, 동일한 weight 행렬에 이러한 gradient 들을 계속 더한다는 것을 명심해야 합니다. + +187 +00:12:30,750 --> 00:12:35,879 +그리고 이것은 우리가 길이가 다양한 입력값들을 사용할 수 있게 해 줍니다. + +188 +00:12:35,879 --> 00:12:38,960 +그리고 이것은 우리가 길이가 다양한 입력값들을 사용할 수 있게 해 줍니다. + +189 +00:12:38,960 --> 00:12:48,540 +그러니까 정해진 길이의 입력값들을 사용하지 않아도 된다는 것이죠. + +190 +00:12:48,539 --> 00:12:52,579 +(질문): 처음 h_0를 어떻게 초기화하나요? + +191 +00:12:52,580 --> 00:13:00,650 +(답변): 0으로 놓는 것이 가장 일반적입니다. + +192 +00:13:00,649 --> 00:13:01,289 +(질문): 입력값의 순서는 영향을 미치나요? + +193 +00:13:01,289 --> 00:13:11,299 +(질문): 입력값의 순서는 영향을 미치나요? +194 +00:13:11,299 --> 00:13:14,359 +(답변): 여기서는 중요하지 않습니다. hidden state는 지금까지 들어온 모든 값을 반영하거든요. + +195 +00:13:14,360 --> 00:13:17,870 +(답변): 여기서는 중요하지 않습니다. hidden state는 지금까지 들어온 모든 값을 반영하거든요. + +196 +00:13:17,870 --> 00:13:21,299 +(답변): 여기서는 중요하지 않습니다. hidden state는 지금까지 들어온 모든 값을 반영하거든요. + +197 +00:13:21,299 --> 00:13:26,859 +(답변): 여기서는 중요하지 않습니다. hidden state는 지금까지 들어온 모든 값을 반영하거든요. + +198 +00:13:26,860 --> 00:13:31,590 +보다 구체적인 예들로 확실히 설명드리겠습니다. + +199 +00:13:31,590 --> 00:13:36,149 +문자 단위의 언어 모델 코드는 매우 간단합니다. + +200 +00:13:36,149 --> 00:13:38,980 +여러분들이 나중에 찾아볼 수 있게 GitHub에 올려 놓았어요. + +201 +00:13:38,980 --> 00:13:43,350 +이것은 NumPy 기반의 100줄 길이의 문자 단위 RNN 코드입니다. + +202 +00:13:43,350 --> 00:13:47,220 +이것은 NumPy 기반의 100줄 길이의 문자 단위 RNN 코드입니다. + +203 +00:13:47,220 --> 00:13:49,840 +실제로 RNN이 어떻게 학습하는지를 알기 위해서 이 코드를 단계별로 살펴볼게요. + +204 +00:13:49,840 --> 00:13:53,220 +실제로 RNN이 어떻게 학습하는지를 알기 위해서 이 코드를 단계별로 살펴볼게요. + +205 +00:13:53,220 --> 00:13:58,250 +코드를 블록들로 나누어 하나하나 살펴보겠습니다. + +206 +00:13:58,250 --> 00:14:02,389 +처음에는 보다시피 NumPy만 사용합니다. + +207 +00:14:02,389 --> 00:14:05,569 +여기에 우리가 입력받을 것은 문자들의 대용량 순서 .txt 데이터입니다. + +208 +00:14:05,570 --> 00:14:10,090 +여기에 우리가 입력받을 것은 문자들의 대용량 순서 .txt 데이터입니다. + +209 +00:14:10,090 --> 00:14:14,810 +이 파일의 모든 문자를 읽어들이고, mapping dictionary를 생성합니다. + +210 +00:14:14,809 --> 00:14:18,179 +mapping dictionary는 문자에 index를 대응시키고, 또 반대로 index에 문자를 대응시킵니다. + +211 +00:14:18,179 --> 00:14:23,120 +그러니까 문자를 순서대로 배열하는 것입니다. + +212 +00:14:23,120 --> 00:14:27,350 +여기 보면 아주 긴 문자열이 들어 있는 큰 데이터를 읽어들이네요. + +213 +00:14:27,350 --> 00:14:30,860 +우리는 이 데이터를 배열해서 각 문자에 index를 지정할 것입니다. + +214 +00:14:30,860 --> 00:14:36,300 +그리고 여기에 보다시피 initialization(초깃값 설정)을 하게 됩니다. + +215 +00:14:36,299 --> 00:14:39,899 +hidden size(hidden state의 크기)는 hyperparameter(바뀌지 않는 값) 입니다. 여기서는 100으로 설정했습니다. + +216 +00:14:39,899 --> 00:14:43,100 +hidden size(hidden state의 크기)는 hyperparameter(바뀌지 않는 값) 입니다. 여기서는 100으로 설정했습니다. + +217 +00:14:43,100 --> 00:14:46,720 +여기 있는 건 learning rate 이고요. + +218 +00:14:46,720 --> 00:14:51,019 +25가 지정되어 있는 seq_length는 여러분이 RNN을 공부하다 보면 나오는 parameter 입니다. + +219 +00:14:51,019 --> 00:14:53,899 +많은 경우 우리의 입력 데이터는 너무 커서 RNN에 한꺼번에 넣을 수가 없습니다. + +220 +00:14:53,899 --> 00:14:56,870 +이것은 우리가 backpropagation을 하는 동안 메모리에 데이터를 저장해 두어야 하는데 여기에 한계가 있기 때문이죠 + +221 +00:14:56,870 --> 00:15:00,070 +이것은 우리가 backpropagation을 하는 동안 메모리에 데이터를 저장해 두어야 하는데 여기에 한계가 있기 때문이죠 + +222 +00:15:00,070 --> 00:15:03,540 +이것은 우리가 backpropagation을 하는 동안 메모리에 데이터를 저장해 두어야 하는데 여기에 한계가 있기 때문이죠 + +223 +00:15:03,539 --> 00:15:07,139 +그래서 우리는 입력 데이터를 몇 개의 데이터로 쪼개고, 여기서는 길이가 25인 데이터들로 쪼갰습니다. + +224 +00:15:07,139 --> 00:15:09,230 +그래서 우리는 입력 데이터를 몇 개의 데이터로 쪼개고, 여기서는 길이가 25인 데이터들로 쪼갰습니다. + +225 +00:15:09,230 --> 00:15:14,769 +그러니까 한 번에 처리할 문자의 개수가 25개인 것입니다. + +226 +00:15:14,769 --> 00:15:19,509 +다시 설명하면, 한 번에 backpropagation 하는 문자의 개수가 25인 것이고, + +227 +00:15:19,509 --> 00:15:22,149 +한 번에 모든 데이터를 기억해서 backpropagation 할 수 없기 때문에, 하나의 크기가 25개인 덩어리 데이터들로 나누어서 처리합니다. + +228 +00:15:22,149 --> 00:15:26,899 +한 번에 모든 데이터를 기억해서 backpropagation 할 수 없기 때문에, 하나의 크기가 25개인 덩어리 데이터들로 나누어서 처리합니다. + +229 +00:15:26,899 --> 00:15:30,789 +여기 보이는 행렬들은 random 함수를 이용해서 초기값이 무작위적으로 입력됩니다. + +230 +00:15:30,789 --> 00:15:34,709 +Wxh, Whh, Wxy은 모두 우리가 backpropagation을 통해 학습시킬 대상들입니다. + +231 +00:15:34,710 --> 00:15:36,790 +Wxh, Whh, Wxy은 모두 우리가 backpropagation을 통해 학습시킬 대상들입니다. + +232 +00:15:36,789 --> 00:15:40,699 +loss function은 넘어가고 맨 밑 부분을 살펴보겠습니다. + +233 +00:15:40,700 --> 00:15:44,020 +이 부분은 Main loop입니다. 이 중에서 몇 부분을 한번 살펴보죠. + +234 +00:15:44,019 --> 00:15:48,399 +이 부분에서 어떤 변수들에 0을 대입하는 초기화가 진행됩니다. + +235 +00:15:48,399 --> 00:15:50,829 +그리고 계속해서 loop을 돌리게 되죠. + +236 +00:15:50,830 --> 00:15:54,960 +우리가 지금 보고 있는 것은 전체 데이터의 한 batch 입니다. + +237 +00:15:54,960 --> 00:15:58,970 +전체 데이터 세트에서 크기 25의 문자 batch를 가지를 list input으로 넣어줍니다. + +238 +00:15:58,970 --> 00:16:03,019 +그리고 그 list input은 각 문자에 대응되는 25개의 숫자를 갖고 있습니다. + +239 +00:16:03,019 --> 00:16:06,919 +타겟들은 여기 index에 1을 더한 값이 되는데요, + +240 +00:16:06,919 --> 00:16:09,909 +이것은 타겟들이 현재 순서가 아니라 바로 다음 순서에 나올 문자들이기 때문에 그렇습니다. + +241 +00:16:09,909 --> 00:16:15,269 +그러니까 list input에는 25개의 문자에 대응되는 25개의 숫자가 있고, 타겟 문자는 그 숫자들에서 1을 더한 index에 대응되는 문자들입니다. + +242 +00:16:15,269 --> 00:16:20,689 +그러니까 list input에는 25개의 문자에 대응되는 25개의 숫자가 있고, 타겟 문자는 그 숫자들에서 1을 더한 index에 대응되는 문자들입니다. + +243 +00:16:20,690 --> 00:16:26,480 +이것은 sampling 코드입니다. + +244 +00:16:26,480 --> 00:16:30,659 +매 시간 단계에서 RNN을 학습시키면서, 현재 RNN이 어떻게 사고하고 있는지 알아보기 위한 sample을 출력합니다. + +245 +00:16:30,659 --> 00:16:35,370 +매 시간 단계에서 RNN을 학습시키면서, 현재 RNN이 어떻게 사고하고 있는지 알아보기 위한 sample을 출력합니다. + +246 +00:16:35,370 --> 00:16:40,320 +우리가 문자 단위의 RNN을 사용할 때에는 + +247 +00:16:40,320 --> 00:16:43,570 +RNN이 매 시간 단계마다 바로 다음에 올 문자들의 순서를 출력합니다. + +248 +00:16:43,570 --> 00:16:46,379 +그러니까 sampling 후 그것을 다시 입력값으로 주고, 다음 sample을 또다시 입력값으로 주는 방식으로 모든 sample을 입력한 다음, + +249 +00:16:46,379 --> 00:16:49,259 +그러니까 sampling 후 그것을 다시 입력값으로 주고, 다음 sample을 또다시 입력값으로 주는 방식으로 모든 sample을 입력한 다음, + +250 +00:16:49,259 --> 00:16:52,769 +그러니까 sampling 후 그것을 다시 입력값으로 주고, 다음 sample을 또다시 입력값으로 주는 방식으로 모든 sample을 입력한 다음, + +251 +00:16:52,769 --> 00:16:56,549 +RNN에게 추상적인 문자열을 만들라고 지시할 수 있게 됩니다. + +252 +00:16:56,549 --> 00:17:00,549 +이게 이 코드의 기능이고, 이것은 조금 있다 살펴볼 sample function을 사용합니다. + +253 +00:17:00,549 --> 00:17:04,250 +여기서는 loss function을 불러옵니다. + +254 +00:17:04,250 --> 00:17:09,160 +loss function은 입력값, 타겟 문자, hprev 을 입력받습니다. + +255 +00:17:09,160 --> 00:17:13,900 +hprev는 h from previous chunk 을 뜻합니다. + +256 +00:17:13,900 --> 00:17:18,179 +우리가 크기가 25인 batch들을 사용하는데, + +257 +00:17:18,179 --> 00:17:22,400 +hidden state에서는 바로 전 batch의 마지막 문자가 무엇인지에 대한 정보가 필요하고, 이 마지막 문자를 다음 batch의 첫 h 에 입력하게 됩니다. + +258 +00:17:22,400 --> 00:17:26,140 +그러니까 h가 batch 에서 그 다음 batch 로 제대로 넘어가기 위해서 h prev을 사용하는 것입니다. + +259 +00:17:26,140 --> 00:17:30,700 +그리고 그 h prev는 backpropagation 할 때만 사용됩니다. + +260 +00:17:30,700 --> 00:17:35,558 +그 h prev을 loss fuction에 입력하면, loss, gradient, weight 행렬, 그리고 bias를 출력합니다. + +261 +00:17:35,558 --> 00:17:39,319 +그 h prev을 loss fuction에 입력하면, loss, gradient, weight 행렬, 그리고 bias를 출력합니다. + +262 +00:17:39,319 --> 00:17:44,149 +여기에서 loss를 print 하고, 여기에선 parameter들을 loss function이 하라는 대로 업데이트합니다. + +263 +00:17:44,150 --> 00:17:47,429 +실제로 업데이트가 되는 것은 여기 adagrad update 라고 적혀 있는 부분이네요. + +264 +00:17:47,429 --> 00:17:53,100 +여기 gradient 계산을 위한 변수들을 제곱한 값들을 계속 더해 줍니다. + +265 +00:17:53,099 --> 00:17:56,819 +그리고 이것들로 adagrad를 업데이트 하죠. + +266 +00:17:56,819 --> 00:18:00,639 +이제 loss funcion을 살펴보겠습니다. + +267 +00:18:00,640 --> 00:18:05,790 +이 블록이 loss fuction이고, foward와 backward 방법들로 이루어져 있습니다. + +268 +00:18:05,789 --> 00:18:08,990 +처음에는 forward pass, 나중에는 초록색으로 적혀 있는 backward pass를 수행합니다. + +269 +00:18:08,990 --> 00:18:13,130 +처음에는 forward pass, 나중에는 초록색으로 적혀 있는 backward pass를 수행합니다. + +270 +00:18:13,130 --> 00:18:18,919 +forward pass에서는 input을 target을 향하게 만듭니다. + +271 +00:18:18,919 --> 00:18:23,360 +여기서 25개의 index를 받지만, 반복문을 25번 실행하는 것이 아니라, + +272 +00:18:23,359 --> 00:18:27,500 +여기 있는 성분이 모두 0인 input vector에 one-hot 인코딩을 하게 됩니다. + +273 +00:18:27,500 --> 00:18:32,169 +그러니까 input에 대응되는 bit를 1로 지정하는 것이죠. + +274 +00:18:32,169 --> 00:18:34,110 +one hot encoding을 이용해서 input을 주고, + +275 +00:18:34,109 --> 00:18:39,229 +밑에 있는 recurrence 공식을 이용해서 계산합니다. + +276 +00:18:39,230 --> 00:18:42,210 +hs[t]는 매 시간 단계의 모든 값들을 기록합니다. + +277 +00:18:42,210 --> 00:18:46,910 +recurrence 공식과 이 두 줄의 코드를 통해 hidden state vector과 output vector 을 계산합니다. + +278 +00:18:46,910 --> 00:18:50,779 +여기서는 softmax function을 이용해서 normalization을 구현합니다. + +279 +00:18:50,779 --> 00:18:54,440 +softmax function에서의 loss는 정답(역자주: 타겟 문자)이 나올 확률의 log를 취하고 거기에 -1을 곱한 값입니다.(역자주: cross entropy loss) + +280 +00:18:54,440 --> 00:18:58,190 +softmax function에서의 loss는 정답(역자주: 타겟 문자)이 나올 확률의 log를 취하고 거기에 -1을 곱한 값입니다.(역자주: cross entropy loss) + +281 +00:18:58,190 --> 00:19:02,779 +지금까지 forward pass 를 살펴보았고, 이제 그래프를 통해 backpropagation을 살펴보겠습니다. + +282 +00:19:02,779 --> 00:19:06,899 +backward pass에서는, 25번째 문자에서 첫번째 문자까지 거슬러 올라갑니다. + +283 +00:19:06,900 --> 00:19:08,530 +backward pass에서는, 25번째 문자에서 첫번째 문자까지 거슬러 올라갑니다. + +284 +00:19:08,529 --> 00:19:12,899 +backward pass에서는, 25번째 문자에서 첫번째 문자까지 거슬러 올라갑니다. + +285 +00:19:12,900 --> 00:19:16,509 +여기서는 softmax, activation 등을 통한 backpropagation이 수행됩니다. + +286 +00:19:16,509 --> 00:19:19,089 +그리고 모든 gradient와 parameter들을 더해주죠. + +287 +00:19:19,089 --> 00:19:23,379 +한 가지 짚고 넘어갈 점은, Whh를 비롯한 행렬에서의 gradient 계산에서 '+='을 사용하고 있다는 것입니다. + +288 +00:19:23,380 --> 00:19:27,210 +한 가지 짚고 넘어갈 점은, Whh를 비롯한 행렬에서의 gradient 계산에서 '+='을 사용하고 있다는 것입니다. + +289 +00:19:27,210 --> 00:19:31,210 +한 가지 짚고 넘어갈 점은, Whh를 비롯한 행렬에서의 gradient 계산에서 '+='을 사용하고 있다는 것입니다. + +290 +00:19:31,210 --> 00:19:34,590 +우리는 매 시간 단계마다 weight 행렬들이 gradient를 받고, 이 값들을 모두 더해 주어야 하기 때문에, 이 행렬을 계속 쓰게 됩니다. + +291 +00:19:34,589 --> 00:19:37,449 +우리는 매 시간 단계마다 weight 행렬들이 gradient를 받고, 이 값들을 모두 더해 주어야 하기 때문에, 이 행렬을 계속 쓰게 됩니다. + +292 +00:19:37,450 --> 00:19:43,980 +그리고 계속해서 backpropagation을 하게 되죠. + +293 +00:19:43,980 --> 00:19:48,130 +여기에서 나온 gradient는 loss function에 사용되고, 결국 parameter를 업데이트하게 됩니다. + +294 +00:19:48,130 --> 00:19:52,580 +마지막으로 sampling function입니다. + +295 +00:19:52,579 --> 00:19:55,960 +여기서 RNN을 지금까지 학습한 training 데이터를 바탕으로 실제로 새로운 문자열 데이터를 출력하게 됩니다. + +296 +00:19:55,960 --> 00:19:59,058 +여기서 RNN을 지금까지 학습한 training 데이터를 바탕으로 실제로 새로운 문자열 데이터를 출력하게 됩니다. + +297 +00:19:59,058 --> 00:20:02,048 +여기서 문자열을 초기화해주었고, + +298 +00:20:02,048 --> 00:20:06,759 +피곤해질 때까지 (역자주: 미리 설정한 recurrence가 끝날 때까지) 다음 작업들을 반복합니다. + +299 +00:20:06,759 --> 00:20:09,289 +recurrence 공식 실행, 각 문자에 대한 확률분포 계산, 샘플링, one-hot 인코딩, 그리고 그 결과물을 다음 시간 단계로 재입력 + +300 +00:20:09,289 --> 00:20:10,450 +recurrence 공식 실행, 각 문자에 대한 확률분포 계산, 샘플링, one-hot 인코딩, 그리고 그 결과물을 다음 시간 단계로 재입력 + +301 +00:20:10,450 --> 00:20:15,640 +recurrence 공식 실행, 각 문자에 대한 확률분포 계산, 샘플링, one-hot 인코딩, 그리고 그 결과물을 다음 시간 단계로 재입력 + +302 +00:20:15,640 --> 00:20:22,460 +이 작업들을 충분히 많은 문자열을 출력할 때까지 계속 수행합니다. + +303 +00:20:22,460 --> 00:20:27,190 +(질문: 안들림 => 답변) 우리는 매 batch 마다 25개의 softmax classifier를 갖고 있습니다. + +304 +00:20:27,190 --> 00:21:04,680 +(답변) 그 classifier 들은 한번에 backpropagation을 진행하고, 반대방향으로 모든 결과물들을 더해주죠. + +305 +00:21:04,680 --> 00:21:14,910 +그게 우리가 이걸 쓰는 이유죠. 다음 질문? + +306 +00:21:14,910 --> 00:21:19,259 +(질문) 여기서 regularization을 쓰나요? + +307 +00:21:19,259 --> 00:21:23,720 +(답변) 여기서는 빠져 있습니다. 일반적으로 RNN에서는 다른 알고리즘만큼 regularization이 흔하게 적용되지는 않습니다. + +308 +00:21:23,720 --> 00:21:27,269 +(답변) 여기서는 빠져 있습니다. 일반적으로 RNN에서는 다른 알고리즘만큼 regularization이 흔하게 적용되지는 않습니다. + +309 +00:21:27,269 --> 00:21:38,379 +(답변) 가끔 아주 좋지 않은 결과를 낳기도 해서, 저는 그냥 사용하지 않을 때도 있습니다. 일종의 hyperparameter이죠. 다음 질문? (질문 안들림) + +310 +00:21:38,380 --> 00:21:48,260 +(답변) 여기서의 문자들은 아주 기초적인 수준입니다. 그래서 실제로 이런 문자가 존재하는지 별로 신경쓰지는 않아요. + +311 +00:21:48,259 --> 00:21:51,839 +(답변) 여기서의 문자들은 아주 기초적인 수준입니다. 그래서 실제로 이런 문자가 존재하는지 별로 신경쓰지는 않아요. + +312 +00:21:51,839 --> 00:21:56,289 +문자들의 index와 그것들의 순서 정도만을 고려할 뿐이죠. + +313 +00:21:56,289 --> 00:21:58,569 +다음 질문? + +314 +00:21:58,569 --> 00:22:08,009 +(질문) space 대신 일정한 segment size(25)를 이용하는 이유가 있나요? + +315 +00:22:08,009 --> 00:22:13,460 +(질문) space 대신 일정한 segment size(25)를 이용하는 이유가 있나요? + +316 +00:22:13,460 --> 00:22:18,630 +(답변) 크기가 25인 batch 말고 space로 구분하는 것 역시 가능할 것 같습니다. 하지만 거기에는 언어에 대한 특별한 가정이 필요해서 권장되지 않아요. + +317 +00:22:18,630 --> 00:22:22,530 +자세한 이유는 좀 있다가 살펴보도록 하겠습니다. + +318 +00:22:22,529 --> 00:22:25,359 +이 코드에는 어떤 문자열도 입력할 수 있어요. 이걸 갖고 여러 가지를 해 볼게요. + +319 +00:22:25,359 --> 00:22:31,539 +여기 우리가 출처를 모르는 어떤 문자열이 있습니다. + +320 +00:22:31,539 --> 00:22:34,889 +그리고 이 문자열을 RNN에 학습시키고, RNN이 문자열을 만들어내게 할 거에요. + +321 +00:22:34,890 --> 00:22:40,670 +예를 들어, 셰익스피어의 모든 작품을 입력할 수 있습니다. + +322 +00:22:40,670 --> 00:22:44,789 +크기가 좀 크긴 하지만, 이건 단지 문자열일 뿐이에요. + +323 +00:22:44,789 --> 00:22:48,289 +크기가 좀 크긴 하지만, 이건 단지 문자열일 뿐이에요. + +324 +00:22:48,289 --> 00:22:51,909 +RNN 셰익스피어의 작품을 학습시키고, 셰익스피어의 시에서의 다음 문자를 예측하게끔 할 수 있습니다. + +325 +00:22:51,910 --> 00:22:54,650 +처음에는 학습이 되어 있지 않기 때문에, 결과물들은 매우 무작위적인 문자열입니다. + +326 +00:22:54,650 --> 00:22:59,030 +처음에는 학습이 되어 있지 않기 때문에, 결과물들은 매우 무작위적인 문자열입니다. + +327 +00:22:59,029 --> 00:23:03,200 +하지만 학습을 통해 RNN은 이 문자열 안에는 단어들이 있고, 단어들 사이에 space가 있고, 쌍따옴표(")의 사용법을 이해하기 되죠. + +328 +00:23:03,200 --> 00:23:06,930 +하지만 학습을 통해 RNN은 이 문자열 안에는 단어들이 있고, 단어들 사이에 space가 있고, 쌍따옴표(")의 사용법을 이해하기 되죠. + +329 +00:23:06,930 --> 00:23:11,490 +하지만 학습을 통해 RNN은 이 문자열 안에는 단어들이 있고, 단어들 사이에 space가 있고, 쌍따옴표(")의 사용법을 이해하기 되죠. + +330 +00:23:11,490 --> 00:23:16,420 +그리고 'here', 'on', 'and so on' 과 같은 기본적인 표현들을 알게 됩니다. + +331 +00:23:16,420 --> 00:23:18,820 +그리고 RNN을 계속 학습시킬수록, 이러한 표현들이 점점 정제되는 것을 확인할 수 있습니다. + +332 +00:23:18,819 --> 00:23:22,609 +예를 들어 "를 한번 사용하면 "를 한번 더 사용해서 인용구를 닫아 주는 것들을 익히는 거죠. + +333 +00:23:22,609 --> 00:23:26,379 +또 문장이 마침표로 끝나는 것 역시 따로 가르치지 않고도 패턴만으로 익히게 됩니다. + +334 +00:23:26,380 --> 00:23:29,630 +또 문장이 마침표로 끝나는 것 역시 따로 가르치지 않고도 통계적 패턴만으로 익히게 됩니다. + +335 +00:23:29,630 --> 00:23:30,580 +그리고 마침내 '셰익스피어 문학' 자체를 생성할 수 있게 되죠. + +336 +00:23:30,579 --> 00:23:34,349 +여기 RNN이 만들어낸 작품을 읽어볼게요. + +337 +00:23:34,349 --> 00:23:38,740 +(읽는 중) "Alas, I think he shall come approached and the day..." + +338 +00:23:38,740 --> 00:23:42,900 +(읽는 중) "Alas, I think he shall come approached and the day..." + +339 +00:23:42,900 --> 00:23:45,460 +(읽는 중) "Alas, I think he shall come approached and the day..." + +340 +00:23:45,460 --> 00:23:56,909 +(질문) 하지만 이것들은 25개가 넘는 문자로 이루어진 문장은 기억할 수가 없기 때문에 제대로 생성할 수 없죠? + +341 +00:23:56,909 --> 00:24:02,679 +(답변) 네 맞습니다. 그거 사실 되게 알아차리기 힘든 부분이라 제가 나중에 말하려고 했었어요. + +342 +00:24:02,679 --> 00:24:05,980 +우리는 셰익스피어 작품이 아니라 다른 것들에도 이것을 활용할 수 있습니다. + +343 +00:24:05,980 --> 00:24:08,960 +이것들은 제가 Justin과 작년에 만들어본 것들입니다. + +344 +00:24:08,960 --> 00:24:12,990 +Justin은 한 대수기하학 책의 LaTeX 소스를 RNN에 학습시켰습니다. + +345 +00:24:12,990 --> 00:24:18,069 +Justin은 한 대수기하학 책의 LaTeX 소스를 RNN에 학습시켰습니다. + +346 +00:24:18,069 --> 00:24:23,398 +그리고 RNN은 수학책을 집필했죠. + +347 +00:24:23,398 --> 00:24:27,199 +물론 RNN은 LaTeX 형식으로 결과물을 출력하지 않아서 저희가 약간 손봐주긴 했지만, + +348 +00:24:27,200 --> 00:24:30,009 +물론 RNN은 LaTeX 형식으로 결과물을 출력하지 않아서 저희가 약간 손봐주긴 했지만, + +349 +00:24:30,009 --> 00:24:33,890 +어쨌든 한두 번 손보고 나니 보시는 바와 같이 수학책이 되었어요. + +350 +00:24:33,890 --> 00:24:37,200 +어쨌든 한두 번 손보고 나니 보시는 바와 같이 수학책이 되었어요. + +351 +00:24:37,200 --> 00:24:42,460 +살펴보면, RNN은 proof(정리)를 쓰는 방법을 배웠네요. 수학적 정리의 끝에는 저렇게 사각형을 쓰죠. + +352 +00:24:42,460 --> 00:24:47,090 +lemma(소정리)를 비롯한 다른 것들도 만들어 냈고요. + +353 +00:24:47,089 --> 00:24:52,428 +그림을 그리는 방법도 배웠네요. + +354 +00:24:52,429 --> 00:24:56,720 +제가 가장 좋아하는 부분은 여기 왼쪽 상단에 있는 "Proof. Omitted" 부분입니다. + +355 +00:24:56,720 --> 00:24:59,650 +RNN도 귀찮았나 봐요 (웃음) + +356 +00:24:59,650 --> 00:25:05,780 +RNN도 귀찮았나 봐요 (웃음) + +357 +00:25:05,779 --> 00:25:12,480 +전반적으로 보면 RNN은 꽤 대수기하학책 같이 보이는 걸 만들어 냈어요. + +358 +00:25:12,480 --> 00:25:16,160 +뭐 세부적인 부분은 제가 대수기하를 잘 몰라서 말하기 그렇지만, 전반적으로 괜찮아요. + +359 +00:25:16,160 --> 00:25:19,529 +저는 이어서 문자 단위 RNN으로 표현할 수 있는 가장 어렵고 추상적인 것들이 무엇이 있을까 생각했고, + +360 +00:25:19,529 --> 00:25:22,769 +소스 코드에 생각이 미쳤습니다. + +361 +00:25:22,769 --> 00:25:27,879 +그래서 리누스 토발즈의 GitHub에 들어가 리눅스의 모든 C 코드를 가져왔습니다. + +362 +00:25:27,880 --> 00:25:30,850 +이 C 코드는 자그마치 700MB나 됩니다. + +363 +00:25:30,849 --> 00:25:35,079 +이 코드를 RNN에게 학습시켰고, RNN은 코드를 생성해 냈습니다. + +364 +00:25:35,079 --> 00:25:39,849 +이게 바로 RNN이 생성해낸 코드입니다. + +365 +00:25:39,849 --> 00:25:42,949 +살펴보면 함수를 생성했고, 변수를 지정하고, 문법적 오류가 거의 없습니다. + +366 +00:25:42,950 --> 00:25:47,460 +변수를 어떻게 사용하는지도 아는 것 같고, + +367 +00:25:47,460 --> 00:25:53,230 +indentation (들여쓰기)도 적절히 했고, 주석도 달았습니다. + +368 +00:25:53,230 --> 00:25:58,089 +괄호를 열고 닫지 않는 등의 실수를 찾아보기가 매우 힘들었습니다. + +369 +00:25:58,089 --> 00:26:01,808 +이런 것들은 RNN이 배우기 가장 쉬운 것들 중 하나거든요. + +370 +00:26:01,808 --> 00:26:04,058 +RNN의 실수들 중에는 쓰이지 않을 변수를 선언하거나, 선언하지도 않은 변수를 불러오기를 시도는 것들이 있었습니다. + +371 +00:26:04,058 --> 00:26:07,240 +RNN의 실수들 중에는 쓰이지 않을 변수를 선언하거나, 선언하지도 않은 변수를 불러오기를 시도는 것들이 있었습니다. + +372 +00:26:07,240 --> 00:26:09,929 +그러니까 아직 매우 높은 단계의 코딩 수준에는 도달하지 못한 거죠. + +373 +00:26:09,929 --> 00:26:12,509 +하지만 그런 것들을 제외하고 보면 꽤 코딩을 잘 했습니다. + +374 +00:26:12,509 --> 00:26:17,460 +새로운 GPU 라이센스에 관한 주석을 다는 방법도 배웠네요. + +375 +00:26:17,460 --> 00:26:22,009 +새로운 GPU 라이센스에 관한 주석을 다는 방법도 배웠네요. + +376 +00:26:22,009 --> 00:26:25,779 +GPL 라이센스 다음에는 #include, 매크로 코드 등이 오는 것도 배웠고요. + +377 +00:26:25,779 --> 00:26:33,879 +(질문) 이건 (아까 보여준) min char-rnn 으로 만들어낸 건가요? + +378 +00:26:33,880 --> 00:26:37,169 +(답변) min char-rnn은 그냥 작동 원리를 알려주기 위해 만들어낸 장난감 같은 거고, + +379 +00:26:37,169 --> 00:26:41,230 +(답변) 실제로는 min char-rnn의 확장판인 torch 기반 char-rnn을 으로 구현했고, GPU를 이용해서 처리했습니다. + +380 +00:26:41,230 --> 00:26:45,009 +(답변) 실제로는 min char-rnn의 확장판인 torch 기반 char-rnn을 으로 구현했고, GPU를 이용해서 처리했습니다. + +381 +00:26:45,009 --> 00:26:49,269 +이 부분은 수업 마지막 부분에 다룰 것인데, 3-layer LSTM 이라는 것입니다. + +382 +00:26:49,269 --> 00:26:52,289 +이건 RNN의 복잡한 버전이라고 생각하면 됩니다. + +383 +00:26:52,289 --> 00:26:58,839 +좀 더 이해가 쉽도록 예를 들어 볼게요. + +384 +00:26:58,839 --> 00:27:02,089 +이건 작년에 저희가 이런 것들을 가지고 만들어본 것들입니다. + +385 +00:27:02,089 --> 00:27:08,949 +저희는 문자 단위 RNN에 신경과학적으로 접근을 해 보았습니다. + +386 +00:27:08,950 --> 00:27:13,110 +hidden state 내부 특정 cell의 excitement(흥분) 여부에 따라 색을 칠해 봤습니다. + +387 +00:27:13,109 --> 00:27:17,119 +hidden state 내부 특정 cell의 excitement(흥분) 여부에 따라 색을 칠해 봤습니다. + +388 +00:27:17,119 --> 00:27:18,699 +hidden state 내부 특정 cell의 excitement(흥분) 여부에 따라 색을 칠해 봤습니다. + +389 +00:27:18,700 --> 00:27:23,470 +보시다시피, hidden state의 뉴런들의 상태를 해석하는 일이 쉽지가 않습니다. + +390 +00:27:23,470 --> 00:27:27,110 +보시다시피, hidden state의 뉴런들의 상태를 해석하는 일이 쉽지가 않습니다. + +391 +00:27:27,109 --> 00:27:29,829 +왜냐하면 어떤 뉴런들은 매우 낮은 단계에서의 작업을 맡거든요. + +392 +00:27:29,829 --> 00:27:33,859 +예를 들면, 'h 다음에 e가 얼마나 자주 오는가' 가 있네요. + +393 +00:27:33,859 --> 00:27:37,928 +하지만 어떤 cell 들은 해석하기가 꽤 용이했습니다. + +394 +00:27:37,929 --> 00:27:41,830 +여기 보시는 것은 인용구 검출 cell 입니다. + +395 +00:27:41,829 --> 00:27:46,460 +이 cell은 처음 따옴표가 나오면 켜지고, 따옴표가 다시 나타나면 꺼집니다. + +396 +00:27:46,460 --> 00:27:50,610 +이건 그냥 backpropagation의 결과로 나온 것입니다. + +397 +00:27:50,609 --> 00:27:54,329 +RNN은 문자열의 길이가 따옴표들의 사이에 있을때와 따옴표 바깥에 있을 때에 다르다는 것을 파악했습니다. + +398 +00:27:54,329 --> 00:27:57,639 +그래서 hidden state의 특정 부분들을 현재 문자들이 인용구 안에 있는지 파악하게 했습니다. + +399 +00:27:57,640 --> 00:28:00,650 +그래서 hidden state의 특정 부분들을 현재 문자들이 인용구 안에 있는지 파악하게 했습니다. + +400 +00:28:00,650 --> 00:28:05,159 +이것이 아까 (질문했던 사람)의 질문에 답을 해줄 것 같은데요, + +401 +00:28:05,159 --> 00:28:06,500 +이 RNN의 seq_length는 100 이었습니다.(역자주: batch 크기가 100) + +402 +00:28:06,500 --> 00:28:10,269 +하지만 실제로 이 인용구들의 크기를 재어 보면 100보다 훨씬 길다는 것을 알 수 있습니다. + +403 +00:28:10,269 --> 00:28:16,220 +제가 보기에 대략 250정도 인 것 같네요. + +404 +00:28:16,220 --> 00:28:20,190 +그러니까 우리는 한 번에 크기가 100인 backpropagation만을 진행했고, RNN에게는 그때만이 유일한 학습 기회입니다. + +405 +00:28:20,190 --> 00:28:23,460 +그러니까 문자열 크기가 100이 넘어가면 그 앞뒤의 dependencies(종속성, 관계) 에 대해서는 직접적으로 학습하지를 않습니다. +406 +00:28:23,460 --> 00:28:27,809 +그러니까 문자열 크기가 100이 넘어가면 그 앞뒤의 dependencies(종속성, 관계) 에 대해서는 직접적으로 학습하지를 않습니다. + +407 +00:28:27,809 --> 00:28:31,159 +하지만 이 결과는 실제 문자열의 길이보다 작은 크기의 batch 들로 학습한다고 해도, batch 크기보다 긴 문자열에 대해서도 잘 작동할 수 있다는 것을 보여주네요. + +408 +00:28:31,160 --> 00:28:36,580 +하지만 이 결과는 실제 문자열의 길이보다 작은 크기의 batch 들로 학습한다고 해도, batch 크기보다 긴 문자열에 대해서도 잘 작동할 수 있다는 것을 보여주네요. + +409 +00:28:36,579 --> 00:28:39,859 +그러니까 batch 크기는 100이었지만, + +410 +00:28:39,859 --> 00:28:44,759 +크기가 수백이 넘는 문자열의 dependecies 도 잘 잡아낸 것이죠. + +411 +00:28:44,759 --> 00:28:48,890 +이것은 톨스토이의 <전쟁과 평화> 데이터 입니다. + +412 +00:28:48,890 --> 00:28:52,460 +이 데이터 세트는 대략 80문자마다 한 번 줄이 바뀝니다. + +413 +00:28:52,460 --> 00:28:57,819 +이 데이터 세트는 대략 80문자마다 한 번 줄이 바뀝니다. + +414 +00:28:57,819 --> 00:29:02,470 +그리고 우리는 줄 길이 tracking cell을 찾아냈습니다. + +415 +00:29:02,470 --> 00:29:06,539 +이 cell은 줄이 처음 시작하면 1로 시작해서, 문자열이 진행될수록 천천히 그 값이 감소합니다. + +416 +00:29:06,539 --> 00:29:09,019 +RNN은 현재 자신이 어느 시간 단계에 있는지 알아야 하기 때문에 이 기능은 매우 유용합니다. + +417 +00:29:09,019 --> 00:29:13,059 +RNN은 현재 자신이 어느 시간 단계에 있는지 알아야 하기 때문에 이 기능은 매우 유용합니다. + +418 +00:29:13,059 --> 00:29:15,149 +이를 통해서 언제 줄을 바꾸어야 하는지 알 수 있기 때문이죠. + +419 +00:29:15,150 --> 00:29:19,280 +이것 말고도 if 문을 감지하는 cell도 찾아냈고, + +420 +00:29:19,279 --> 00:29:23,970 +인용구과 주석을 감지하는 cell 도 찾아냈고, + +421 +00:29:23,970 --> 00:29:28,710 +상대적으로 deep한 코드를 감지하는 cell 도 찾아냈습니다. + +422 +00:29:28,710 --> 00:29:33,150 +다른 역할을 수행하는 cell 들도 찾을 수 있을 것이고, 중요한 것은 이것들이 전부 backpropagation 에서 나왔다는 겁니다. + +423 +00:29:33,150 --> 00:29:36,710 +되게 마법같은 일이죠. + +424 +00:29:36,710 --> 00:29:42,130 +(질문) 어떻게 cell 하나하나가 흥분했는지 알 수 있었죠? + +425 +00:29:42,130 --> 00:29:49,110 +(답변) 이 LSTM 에서는 대략 2100개의 cell 들이 있었습니다. 저는 그냥 하나하나 다 살펴봤어요. + +426 +00:29:49,109 --> 00:29:54,589 +(답변) 대부분은 규칙을 찾기가 어려웠지만, 약 5%에 해당하는 cell들에 대해서 살펴본 것들과 같은 규칙을 찾을 수 있었습니다. + +427 +00:29:54,589 --> 00:30:00,429 +(질문) 그러니까 어떤 cell들은 켜고, 어떤 cell들은 끄는 방식으로 찾은 건가요? + +428 +00:30:00,430 --> 00:30:05,310 +(답변) 오 제가 질문을 잘못 이해했었네요. 저희는 RNN 전체를 실행시켰고, 특정 hidden state의 흥분 상태를 관찰했습니다. + +429 +00:30:05,309 --> 00:30:09,679 +(답변) 오 제가 질문을 잘못 이해했었네요. 저희는 RNN 전체를 실행시켰고, 특정 hidden state의 흥분 상태를 관찰했습니다. + +430 +00:30:09,680 --> 00:30:14,470 +(답변) 그러니까 그냥 실행은 그대로 하되, 특정 hidden state의 상태를 기록하고 살펴본 것입니다. + +431 +00:30:14,470 --> 00:30:20,900 +이해가 되셨나요? + +432 +00:30:20,900 --> 00:30:23,940 +그러니까 저는 여기서 hidden state 단 한 부분만을 여기 슬라이드에 나타냈습니다. + +433 +00:30:23,940 --> 00:30:27,740 +물론 hidden state 에는 이 부분 말고도 다른 일들을 하는 cell들이 많이 있죠. + +434 +00:30:27,740 --> 00:30:30,349 +이것들은 모두 동시에, 다른 기능을 수행합니다. + +435 +00:30:30,349 --> 00:30:41,899 +(질문) 여기서의 hidden state의 layer은 1개인가요? + +436 +00:30:41,900 --> 00:30:50,150 +(답변) Multi-layer RNN을 말씀하시는 건가요? 그것에 대해서는 좀 있다가 설명드리겠습니다. 여기서는 Multi-layer을 썼지만, Single-layer을 썼어도 결과는 비슷했을 거에요. + +437 +00:30:50,150 --> 00:31:00,490 +(질문: 안들림) (답변): 이 hidden state 들은 -1 ~ 1의 값을 가집니다. tanh 함수의 결과물이거든요. + +438 +00:31:00,490 --> 00:31:04,120 +(답변) 이건 우리가 아직 다루지 않은 LSTM에 대한 것들입니다. 한 cell에 배정된 값은 -1~1 이라는 것 정도만 알아두세요. + +439 +00:31:04,119 --> 00:31:11,869 +(답변) 이건 우리가 아직 다루지 않은 LSTM에 대한 것들입니다. 한 cell에 배정된 값은 -1~1 이라는 것 정도만 알아두세요. + +440 +00:31:11,869 --> 00:31:15,609 +RNN은 매우 잘 작동하고, 이러한 시퀀스 모델을 잘 학습할 수 있습니다. + +441 +00:31:15,609 --> 00:31:19,039 +대략 1년 전에 어떤 사람들이 이걸 컴퓨터 비전-image captioning 분야에 적용해 보았습니다. + +442 +00:31:19,039 --> 00:31:22,039 +대략 1년 전에 어떤 사람들이 이걸 컴퓨터 비전-image captioning 분야에 적용해 보았습니다. + +443 +00:31:22,039 --> 00:31:25,210 +여기서는 어떤 하나의 사진을 가지고 단어의 배열을 생성해 보았는데요, + +444 +00:31:25,210 --> 00:31:27,840 +RNN은 여기서 매우 잘 작동했습니다. + +445 +00:31:27,839 --> 00:31:32,490 +RNN은 여기서 매우 잘 작동했습니다. + +446 +00:31:32,490 --> 00:31:36,240 +여기 한 부분을 보시면, + +447 +00:31:36,240 --> 00:31:43,039 +사실 이건 제 논문이기 때문에 저 사진들은 제가 마음대로 쓸 수 있죠. + +448 +00:31:43,039 --> 00:31:46,629 +CNN에 이미지를 입력했는데요, + +449 +00:31:46,630 --> 00:31:48,990 +잘 살펴보시면 사실 이것은 CNN과 RNN의 두 부분으로 구성되어 있다는 것을 발견할 수 있습니다. + +450 +00:31:48,990 --> 00:31:51,750 +잘 살펴보시면 사실 이것은 CNN과 RNN의 두 부분으로 구성되어 있다는 것을 발견할 수 있습니다. + +451 +00:31:51,750 --> 00:31:55,460 +CNN은 이미지 처리를, RNN은 단어들의 순서 결정을 맡았습니다. + +452 +00:31:55,460 --> 00:31:58,470 +제가 강의 처음에 했던 레고 블록 비유를 기억한다면, + +453 +00:31:58,470 --> 00:32:01,039 +CNN과 RNN을 그림에 보이는 화살표와 같이 연결시킨 것을 이해할 수 잇을 것입니다. + +454 +00:32:01,039 --> 00:32:04,509 +CNN과 RNN을 그림에 보이는 화살표와 같이 연결시킨 것을 이해할 수 잇을 것입니다. + +455 +00:32:04,509 --> 00:32:07,829 +저희가 여기서 잘한 점은 여기서 RNN 단어 생성 모델의 입력값을 적절히 조절했다는 것입니다. + +456 +00:32:07,829 --> 00:32:11,349 +그러니까 아무 텍스트나 RNN에 입력한 것이 아니라, + +457 +00:32:11,349 --> 00:32:14,939 +CNN의 결과물을 RNN의 입력값으로 받아온 것이죠. + +458 +00:32:14,940 --> 00:32:21,220 +좀 더 자세히 설명드리겠습니다. forward pass 부분부터요. + +459 +00:32:21,220 --> 00:32:24,110 +여기 test image가 있습니다. + +460 +00:32:24,109 --> 00:32:27,679 +우리는 이 이미지에서 단어들의 시퀀스를 만들어보고 싶어요. + +461 +00:32:27,680 --> 00:32:31,240 +그래서 다음과 같이 이미지를 먼저 처리했습니다. + +462 +00:32:31,240 --> 00:32:35,250 +먼저 이미지를 CNN에 입력했습니다. 여기서 쓰인 CNN은 VGG net 이었습니다. + +463 +00:32:35,250 --> 00:32:37,349 +그리고 여기 conv들과 maxpool 들을 통과시켰죠. + +464 +00:32:37,349 --> 00:32:40,149 +일반적으로 마지막에는 softmax classifier가 위치합니다. + +465 +00:32:40,150 --> 00:32:44,440 +softmax는 확률분포를 출력하죠. 예를 들어 1000개의 카테고리가 있다면 각 카테고리에 대한 확률분포를요. + +466 +00:32:44,440 --> 00:32:47,420 +근데 여기서 우리는 softmax를 사용하지 않았습니다. + +467 +00:32:47,420 --> 00:32:50,750 +대신 이 끝부분을 RNN의 시작 부분과 연결시켰죠. + +468 +00:32:50,750 --> 00:32:54,880 +RNN 입력에 처음에는 특별한 벡터들을 사용했습니다. + +469 +00:32:54,880 --> 00:33:00,410 +RNN 에 입력되는 벡터들의 차원은 300이었고요, + +470 +00:33:00,410 --> 00:33:02,700 +RNN의 첫 iteration에는 무조건 이 벡터를 사용했습니다. + +471 +00:33:02,700 --> 00:33:05,750 +그럼으로써 RNN이 이것이 시퀀스의 시작임을 파악할 수 있게 했습니다. + +472 +00:33:05,750 --> 00:33:09,039 +그리고 아까 살펴본 recurrence 공식 (Vanilla NN)을 사용했습니다. + +473 +00:33:09,039 --> 00:33:13,769 +그리고 아까 살펴본 recurrence 공식 (Vanilla NN)을 사용했습니다. + +474 +00:33:13,769 --> 00:33:18,779 +아까는 (Wxh*x + Whh*h)과 0으로 초기화되는 h_0을 사용했다면, + +475 +00:33:18,779 --> 00:33:23,500 +아까는 (Wxh*x + Whh*h)과 0으로 초기화되는 h_0을 사용했다면, + +476 +00:33:23,500 --> 00:33:28,089 +아까는 (Wxh*x + Whh*h)과 0으로 초기화되는 h_0을 사용했다면, + +477 +00:33:28,089 --> 00:33:33,649 +이번에는 v를 추가해서 (Wxh*x + Whh*h + Wih*v) 를 사용했습니다. + +478 +00:33:33,650 --> 00:33:38,040 +v는 CNN의 맨 마지막 출력값이고, + +479 +00:33:38,039 --> 00:33:43,399 +Wih는 v에 들어 있는 이미지에 대한 정보를 RNN에게 전달해주기 위한 가중치 행렬입니다. + +480 +00:33:43,400 --> 00:33:46,380 +RNN에 이미지의 정보를 전달해 주는 방법은 실제로 여러 가지가 있고, + +481 +00:33:46,380 --> 00:33:48,940 +이것은 그 중 쉬운 한 방법일 뿐입니다. + +482 +00:33:48,940 --> 00:33:51,690 +이것은 그 중 쉬운 한 방법일 뿐입니다. + +483 +00:33:51,690 --> 00:33:55,750 +t = 0 에서의 y_0 벡터는 시퀀스의 첫번째 단어의 확률분포입니다 + +484 +00:33:55,750 --> 00:34:00,009 +t = 0 에서의 y0 벡터는 시퀀스의 첫번째 단어의 확률분포입니다 + +485 +00:34:00,009 --> 00:34:05,490 +이것이 작동하는 방식을 설명해 볼게요. + +486 +00:34:05,490 --> 00:34:09,699 +여기 그림에 밀짚모자가 보이시죠 + +487 +00:34:09,699 --> 00:34:12,939 +이 부분은 CNN에 의해 '지푸라기 같은' 물체로 인식됩니다. + +488 +00:34:12,940 --> 00:34:17,039 +Wih는 이 부분의 hidden state의 값이 특정 state로 넘어갈 때 '지푸라기'이라는 단어가 출력되게 하는 확률을 높이는 데 영향을 미칩니다. + +489 +00:34:17,039 --> 00:34:20,519 +그래서 '지푸라기 같은' 질감을 가진 이미지가 실제로 '지푸라기'이라는 단어의 출현 확률을 높이는 것이죠. + +490 +00:34:20,519 --> 00:34:23,940 +y0의 값 중 하나가 커지게 되는 방식으로요. + +491 +00:34:23,940 --> 00:34:28,470 +y0의 값 중 하나가 커지게 되는 방식으로요. + +492 +00:34:28,469 --> 00:34:32,269 +이제 RNN은 두 가지 작업을 처리해야 합니다. + +493 +00:34:32,269 --> 00:34:36,550 +다음 순서에 어떤 단어가 올지를 예측하고, 현재 이미지 정보를 기억해야 합니다. + +494 +00:34:36,550 --> 00:34:40,629 +우리가 이 softmax로부터 샘플링을 했을때 실제로 이 부분에서 가장 출현 확률이 높은 단어가 '지푸라기'라면, + +495 +00:34:40,628 --> 00:34:44,710 +우리가 이 softmax로부터 샘플링을 했을때 실제로 이 부분에서 가장 출현 확률이 높은 단어가 '지푸라기'라면, + +496 +00:34:44,710 --> 00:34:47,519 +우리는 이 단어를 기록하고 이것을 다시 RNN에 넣어줍니다. + +497 +00:34:47,519 --> 00:34:52,190 +이 단계에서 우리는 단어 단위 embedding을 사용하고 있습니다. + +498 +00:34:52,190 --> 00:34:55,750 +'지푸라기' 라는 단어는 차원이 300인 벡터의 한 원소입니다. + +499 +00:34:55,750 --> 00:35:00,010 +현재 우리가 여기서 사용하는 단어 사전에는 지푸라기를 비롯한 각기 다른 벡터의 형태로 표시되는 300개의 단어들이 존재합니다. + +500 +00:35:00,010 --> 00:35:02,940 +이 300개의 단어를 RNN에 입력하면 그 출력값 y1은 바로 다음 순서에 올 단어를 예측합니다. + +501 +00:35:02,940 --> 00:35:07,090 +하나는 우리가 이러한 모든 특성을 우리가 얻을 왜 내 두 번째 세계와 순서 + +502 +00:35:07,090 --> 00:35:08,010 +그것에서 샘플을 다시 + +503 +00:35:08,010 --> 00:35:12,490 +워드 모자 가능성이 있다고 가정 지금 우리는 모자 400 훨씬 나이 프리젠 테이션을 + +504 +00:35:12,489 --> 00:35:18,299 +그리고 거기의 분포를 얻을 후 우리는 다시 샘플링하고 우리는 때까지 샘플 + +505 +00:35:18,300 --> 00:35:21,350 +우리는 특별한 샘플 및 진정의 끝에있는 기간 토큰 + +506 +00:35:21,349 --> 00:35:24,900 +문장하고는 arnaz 지금이에서 생성 할 것을 우리에게 알려줍니다 + +507 +00:35:24,900 --> 00:35:30,280 +군대는 그렇게 확인 밀짚 모자 기간이 이미지를 설명했을 포인트 + +508 +00:35:30,280 --> 00:35:34,010 +치수와 그의 아내 사진의 수는 단어의 숫자 당신의 + +509 +00:35:34,010 --> 00:35:39,220 +특수 토큰과 우리가 항상 먹이 산업을위한 어휘 +1 + +510 +00:35:39,219 --> 00:35:43,609 +다른 단어에 해당하는 부문과 얘기 특별한 시작과 + +511 +00:35:43,610 --> 00:35:46,250 +우리는 언제나 그 전부 단일 통해 전파 + +512 +00:35:46,250 --> 00:35:49,769 +시간은 무작위로이 국유화하거나 당신은 무료로 BG 그물을 초기화 할 수 있습니다 + +513 +00:35:49,769 --> 00:35:52,099 +다음 분을 위해 무역 + +514 +00:35:52,099 --> 00:35:56,319 +배포판은 다음 그라데이션을 인코딩 한 다음이를 통해 백업 + +515 +00:35:56,320 --> 00:35:59,700 +전체 단일 모델로 것이나 그냥 모든 공동에서 훈련하고 얻을 + +516 +00:35:59,699 --> 00:36:08,389 +캡션 또는 이미지 캡처 확인 질문을 많이하지만 네 삼백 + +517 +00:36:08,389 --> 00:36:12,609 +감정 묻어은 너무 이미지 모든 단어의 단지 독립적있어 + +518 +00:36:12,610 --> 00:36:18,430 +그렇게 우리가 그것으로 얻을 파산거야와 관련된 300 번호를 가지고 + +519 +00:36:18,429 --> 00:36:21,769 +당신은 무작위로 초기화 한 다음이 더 나은 섹스에 들어갈 백업 할 수 있습니다 + +520 +00:36:21,769 --> 00:36:25,360 +그 묻어은 주위 그냥 매개 변수를 다른 이동합니다 오른쪽 그래서 + +521 +00:36:25,360 --> 00:36:30,530 +그것에 대해 생각하는 방법은 모두를위한 하나의 홉 표현을 데입니다입니다 + +522 +00:36:30,530 --> 00:36:34,960 +단어는 당신은 거대한 W 매트릭스 곳 하나 하나가 + +523 +00:36:34,960 --> 00:36:40,130 +그 백 농장과 W 곱셈과 승 300 밖으로하지만 크기가 + +524 +00:36:40,130 --> 00:36:43,530 +효과적으로 하나가 부러 밖으로 따 버릴거야있는 뭔가 w + +525 +00:36:43,530 --> 00:36:47,560 +나는 당신이 그 마음에 들지 않는 경우 그래서 그냥 생각이 한랭 전선의 종류의 걸거야 + +526 +00:36:47,559 --> 00:36:50,279 +침대에서 단지 하나의 호퍼 프리젠 테이션으로 생각하고 수행 할 수 있습니다 + +527 +00:36:50,280 --> 00:36:58,920 +교육에 토큰 네 말에 최대 네 그것의 모델러를 그런 식으로 생각 + +528 +00:36:58,920 --> 00:37:02,769 +데이터는 우리가 예술에서 기대하는 올바른 순서는 내가 할 수있는 첫 번째 단어입니다 + +529 +00:37:02,769 --> 00:37:07,969 +기대 때문에 매일 훈련 예 일종의 특별이 + +530 +00:37:07,969 --> 00:37:10,288 +그리고 진행 토큰 + +531 +00:37:10,289 --> 00:37:28,929 +당신이 유선 수 다르게 우리는 모든 단일 상태로 연결이 밝혀 + +532 +00:37:28,929 --> 00:37:32,999 +그것은 실제로 당신이 단지에 연결하면 실제로 잘 작동 악화 때문에 작동 + +533 +00:37:32,998 --> 00:37:36,718 +시간 단계 최초의 다음 아르 논은이이 두 작업을 저글링하는 + +534 +00:37:36,719 --> 00:37:40,829 +그것은 예술과 그것을 통해 기억 할 필요가 무엇 이미지에 대한 기억 + +535 +00:37:40,829 --> 00:37:45,179 +또한 이러한 모든 의상을 생산해야하고 어떻게 든 거기에 그렇게하고 싶어 + +536 +00:37:45,179 --> 00:38:04,209 +일부는 사실 클래스 직후 나는 당신을 줄 수있는 이유를 전진 + +537 +00:38:04,208 --> 00:38:10,208 +단일 인스턴스는 이미지와 단어의 순서와 우리가 대응합니다 + +538 +00:38:10,208 --> 00:38:16,328 +여기에 그 단어를 연결 것이고, I를 우리는 이미지를 이야기하고 우리가하여야한다 + +539 +00:38:16,329 --> 00:38:22,159 +그래서 와서 당신이 모든 사람들은 바닥에 계획되지 않은 한 기차 시간 + +540 +00:38:22,159 --> 00:38:25,528 +이미지 런던과 다음이 그래프를 풀다 당신은 당신의 손실을 + +541 +00:38:25,528 --> 00:38:29,389 +당신이 조심 있다면 배경이 다음 이미지의 배치를 할 수 있으며, + +542 +00:38:29,389 --> 00:38:33,108 +그래서 당신의 이미지를 한 경우에는 때로는 서로 다른 길이의 시퀀스가 + +543 +00:38:33,108 --> 00:38:36,199 +당신이 난 것을 확인 말을해야하기 때문에 훈련 데이터는 조심해야 + +544 +00:38:36,199 --> 00:38:41,059 +아마 다음의 몇 가지를 최대 스무 단어의 배치를 처리하고자 + +545 +00:38:41,059 --> 00:38:44,499 +코드에서 당신이 알고에 그 문장이 짧거나 더 이상 필요가있을 것입니다 + +546 +00:38:44,498 --> 00:38:48,188 +일부 일부 일부 문장은 다른 사람보다 더 오래 있기 때문에 걱정 + +547 +00:38:48,188 --> 00:38:55,368 +우리는 내가 갈 물건이 너무 많은 질문이 + +548 +00:38:55,369 --> 00:39:03,450 +그 완전히 공동으로이 모든 것을 전파하도록 네 감사합니다 + +549 +00:39:03,449 --> 00:39:07,538 +훈련은 인터넷으로 기차를 미리 할 수​​ 있도록 한 다음 그 단어를 넣어 + +550 +00:39:07,539 --> 00:39:10,190 +이하지만 당신은 공동으로 모든 훈련을 원하고 그 큰이야 + +551 +00:39:10,190 --> 00:39:15,429 +우리는 우리가 검색 기능을 알아낼 수 있기 때문에 실제로 이점 + +552 +00:39:15,429 --> 00:39:20,368 +더 좋은 말은 그래서 당신은이 훈련하는 이미지를 설명하기 위해 + +553 +00:39:20,369 --> 00:39:23,890 +실제로 우리가 인구 조사 자료에이 시도는 일반적인 욕구 중 하나를 설정합니다 + +554 +00:39:23,889 --> 00:39:27,368 +마이크로 소프트 코코라고하는 것은, 그래서 그냥 당신이처럼 보이는 무엇의 아이디어를 제공합니다 + +555 +00:39:27,369 --> 00:39:31,499 +대략 각 이미지 80 이미지와 다섯 문장의 설명이 있었다 + +556 +00:39:31,498 --> 00:39:35,288 +그래서 당신은 단지 사람들에게 아마존 기계 터크를 사용하여 얻은 것은 우리에게주세요 + +557 +00:39:35,289 --> 00:39:39,710 +문장 이미지에 대한 설명과 기록 및 데이터 세트를 종료하고 + +558 +00:39:39,710 --> 00:39:43,249 +그래서 당신은 당신이 예상 할 수있는이 모델에게 결과의 종류를 훈련 할 때 또는 + +559 +00:39:43,248 --> 00:39:49,078 +약 좀이 같은이 너무 이러한 이미지를 설명하는 우리의 무엇이다 + +560 +00:39:49,079 --> 00:39:52,329 +이 이것이 검은 셔츠 연주 기타 또는 건설 사람이다라고 말한다 + +561 +00:39:52,329 --> 00:39:55,710 +도로 또는 두 젊은 여자에 작업 오렌지 시티 웨스트에서 노동자 재생 + +562 +00:39:55,710 --> 00:40:00,528 +레고 장난감이나 소년 그건 아니에요 웨이크 보드에 물론 공중제비를하고있다 + +563 +00:40:00,528 --> 00:40:04,650 +웨이크 보드는하지만 매우 재미 실패 사례도 있습니다 가까이있는 + +564 +00:40:04,650 --> 00:40:07,680 +또한이 야구 방망이를 들고 어린 소년입니다 보여주고 싶은 + +565 +00:40:07,679 --> 00:40:12,338 +이 고양이는 여자의 원격 제어와 함께 소파에 앉아있다 + +566 +00:40:12,338 --> 00:40:15,710 +거울 앞의 테디 베어를 들고 + +567 +00:40:15,710 --> 00:40:22,400 +여기 질감은 아마 무슨 일이 것은 그것을 만든 것입니다 확신 해요 + +568 +00:40:22,400 --> 00:40:26,289 +이 테디 베어가 있다고 생각하고 마지막은 서 창녀입니다 + +569 +00:40:26,289 --> 00:40:30,409 +거리 도로의 중간 그래서 분명히 일부 확실하지 아무 말 없다 무엇 + +570 +00:40:30,409 --> 00:40:34,858 +이 나온 모델의 단지 간단한 종류 그래서 거기에 무슨 일이 있었 + +571 +00:40:34,858 --> 00:40:37,619 +작년 모델의 이러한 종류의 상단에 작업하려고 많은 사람들이 있었다 + +572 +00:40:37,619 --> 00:40:41,559 +난 그냥 당신에게 11 레벨의 아이디어를 제공하고자 그들을 더 복잡하게 + +573 +00:40:41,559 --> 00:40:44,929 +흥미로운 단지 사람들이 기본 아키텍처를 연주하는 방법에 대한 아이디어를 얻을 수 + +574 +00:40:44,929 --> 00:40:51,329 +그래서 이것은 현재 모델에서 발견 경우 지난해 종이는 우리 + +575 +00:40:51,329 --> 00:40:55,608 +단지 처음에 시간을 이미지로 한 시간을 공급 한 경우를 + +576 +00:40:55,608 --> 00:40:59,480 +이 놀 수있는 것은 실제로 다시 볼 수있는 난폭 한 재발 성 신경 네트워크입니다 + +577 +00:40:59,480 --> 00:41:03,130 +무선 않는 작동 기술 화상의 화상 및 참조 부 + +578 +00:41:03,130 --> 00:41:07,180 +당신이 허용 등이 모든 단어를 생성하는 등의 단어가 없습니다 + +579 +00:41:07,179 --> 00:41:10,460 +실제로 이미지 옆 모습을하고 다른 기능을 찾아 + +580 +00:41:10,460 --> 00:41:13,470 +그것은 다음에 설명 할 수 있습니다 당신은 실제로 완전히에서이 작업을 수행 할 수있는 작업 + +581 +00:41:13,469 --> 00:41:17,899 +그들은 단지이 말뿐만 아니라 측면을 생성하지 않도록 학습 가능한 방법 + +582 +00:41:17,900 --> 00:41:21,289 +여기서 이미지에 다음보고하는 등이 작동하는 방식 만을 수행하지 않습니다 + +583 +00:41:21,289 --> 00:41:24,259 +아웃 아르 논하지만 당신은 아마 다음 하나의 시퀀스에 대한 분배있어 + +584 +00:41:24,260 --> 00:41:29,250 +하지만 제공이 오는 당신은 발륨은 우리가 전달이 경우 말을 않는 + +585 +00:41:29,250 --> 00:41:37,389 +512 활성화 부피 (512) (14)에 의해 14를 얻었고에서 모든 및 주석 + +586 +00:41:37,389 --> 00:41:40,179 +우리는 단지 그 분포를 인정하지 않습니다하지만 당신은 또한을 방출 한 시간 + +587 +00:41:40,179 --> 00:41:44,358 +모양까지 키처럼 좀입니다 오백열둘 차원 사진 + +588 +00:41:44,358 --> 00:41:48,019 +당신은 이미지 옆에 그래서 실제로 나는이 생각하지 않습니다 찾기 위해 원하는 것을 + +589 +00:41:48,019 --> 00:41:51,210 +그들은이 특별한 종이에 무슨 짓을하지만, 이것은 당신이 연결할 수 있습니다 한 방법입니다 + +590 +00:41:51,210 --> 00:41:54,510 +이 위로이 사진을보고 뭔가는 아르 논에서 방출되는 단지 + +591 +00:41:54,510 --> 00:41:58,430 +그냥 약간의 무게와 다음이 그림은 점 수를 사용하여 예측처럼 + +592 +00:41:58,429 --> 00:42:03,618 +제품이 모든 (14) (14)에 의해 위치가 그래서 우리는 이러한 모든 점 제품을 함께 + +593 +00:42:03,619 --> 00:42:09,108 +우리는 우리가 지금 우리가 다음 우리 (14)의 호환성에 의해 기본적으로 14 계산 달성 + +594 +00:42:09,108 --> 00:42:13,949 +그것은 모두 당신의 있도록 그래서 기본적으로 우리는이 모든 것을 정상화 이것에 부드러운 최대를 넣어 + +595 +00:42:13,949 --> 00:42:17,149 +이 14 (14)에 의해, 그래서 우리는 이미지를 통해 긴장 부르는이를 얻을 수 + +596 +00:42:17,150 --> 00:42:21,230 +아마 이미지에 지금 아르 논에 대한 흥미로운 내용을 통해지도, + +597 +00:42:21,230 --> 00:42:25,889 +우리는이와이 사람의 가중 합을 수행하라는 메시지가이 문제를 사용 + +598 +00:42:25,889 --> 00:42:27,239 +현출 + +599 +00:42:27,239 --> 00:42:30,929 +그래서 오늘 아침은 기본적으로는 어떻게 생각하는지의 신화는 현재 수 + +600 +00:42:30,929 --> 00:42:36,089 +그것에 대한 흥미가 돌아갑니다 당신은의 가중 합을하고 결국 + +601 +00:42:36,090 --> 00:42:39,850 +엘리스 팀이 시점에서보고 싶은 기능의 종류 + +602 +00:42:39,849 --> 00:42:44,809 +시간 등 섬의 생성 물건, 예를 들어 그것을 결정할 수 있습니다 + +603 +00:42:44,809 --> 00:42:49,400 +지금과 같은 객체에 대한보고 싶은 그 확인은 벡터 파일을 인정 + +604 +00:42:49,400 --> 00:42:53,220 +물건 같은 개체의 숫자는이 때의 정액과 상호 작용 + +605 +00:42:53,219 --> 00:42:57,379 +위원회 주석 어쩌면 그 지역 같은 개체의 일부는 오는 + +606 +00:42:57,380 --> 00:43:01,700 +점등 및 천장처럼 떨어지는 정품 인증에서이지도를 참조 + +607 +00:43:01,699 --> 00:43:05,949 +4514 화나게하고 당신은 그 부분에 관심을 집중 결국 + +608 +00:43:05,949 --> 00:43:10,059 +이 상호 작용을 통해 그래서 당신은 기본적으로 그냥 할 수있는 조회 이미지 + +609 +00:43:10,059 --> 00:43:14,130 +이미지에 당신은 문장을 설명하고 그래서이 뭔가 우리 동안 + +610 +00:43:14,130 --> 00:43:17,360 +부드러운 구금으로 참조 실제로 몇 강연이가는 것 + +611 +00:43:17,360 --> 00:43:21,050 +그래서 우리는 군대가 실제로하지 않은 수있는이 같은 일을 다루려고 + +612 +00:43:21,050 --> 00:43:26,880 +선택적 입력을 처리하는 등의 수입을 통해 관심과 그 그래서 I + +613 +00:43:26,880 --> 00:43:30,030 +그냥 당신에게 그 무엇의 미리보기를 제공하기 위해 약 한 시간 그것을 가지고 싶어 + +614 +00:43:30,030 --> 00:43:34,490 +우리가 중 한 가지 방법으로 우리의 삶을 더 복잡하게하려면 이제 괜찮아 보이는 + +615 +00:43:34,489 --> 00:43:39,259 +이 당신을 제공합니다, 그래서 우리가 그 층을 쌓아하는 것입니다 할 수있는 당신이 더 많은 것을 알고 + +616 +00:43:39,260 --> 00:43:43,570 +깊은 물건은 일반적으로 더 나은 우리가에 가지 방법 중 하나를이를 시작하는 방법을 작동 + +617 +00:43:43,570 --> 00:43:46,809 +적어도 당신은 재발 성 신경 네트워크를 쌓을 수 많은 방법이있다 그러나이 + +618 +00:43:46,809 --> 00:43:49,409 +사람들이 당신이 할 수 실제로 사용하는 것이 바로 그 중 하나입니다 + +619 +00:43:49,409 --> 00:43:53,339 +똑바로 그냥 서로 그렇게 한 아르 논에 대한 자극이에 하네스를 연결 + +620 +00:43:53,340 --> 00:43:59,170 +우리가 이전에 주 사진의 디렉터 등이 이미지 + +621 +00:43:59,170 --> 00:44:02,750 +시간 축이 수평으로 이동 한 다음 우리가 다른이 위쪽으로가는 + +622 +00:44:02,750 --> 00:44:05,960 +이 특정 이미지의 의식 등 세 가지 별도의 재발이 있습니다 + +623 +00:44:05,960 --> 00:44:09,858 +신경 네트워크는 무게의 자신의 세트와 각각이 대령이다 그 + +624 +00:44:09,858 --> 00:44:16,299 +난 그냥 서로 먹이를하지 그래서이 항상 공동으로 더 거기에 훈련되어 작동합니다 + +625 +00:44:16,300 --> 00:44:19,119 +기차는 먼저 모든 단지 하나의 경쟁 성장의 두 번째 임기 하나 원 + +626 +00:44:19,119 --> 00:44:22,700 +배경으로는 상단이 재발 식을 통해 얻을 수 + +627 +00:44:22,699 --> 00:44:25,980 +상아 영국은 여전히​​ 우리는 여전히있어 더 일반적인 규칙을 만들 가능성이 높습니다 + +628 +00:44:25,980 --> 00:44:29,280 +똑같은 일을하면 우리는 우리가 복용하고있는 같은 공식을하지 않았다된다 + +629 +00:44:29,280 --> 00:44:35,390 +우린 시간 전에에서 아래 아래 깊이와 효과에서 강의 + +630 +00:44:35,389 --> 00:44:39,469 +를 절단하고 퍼팅이 w 변환과를 통해 지원 + +631 +00:44:39,469 --> 00:44:40,519 +스매싱 10 각 + +632 +00:44:40,519 --> 00:44:44,509 +당신이 이것에 대해 약간 혼란스러워하는 경우에 당신이 기억한다면, 그래서 거기있다 + +633 +00:44:44,510 --> 00:44:51,760 +WRX H 시간의 X 플러스 당신이 다시 작성할 수 있습니다 whah 시간의 H는 엑손의 연결입니다 + +634 +00:44:51,760 --> 00:44:56,260 +하나의 행렬 곱 H 바로 그래서 난에 침을 국가 스틱 것처럼 + +635 +00:44:56,260 --> 00:45:03,680 +기본적으로 무슨 일이 끝나는 다음 하나의 열 벡터와 나는이 w 행렬이 + +636 +00:45:03,679 --> 00:45:07,690 +최대 일어나고 당신의 WX 연령이 행렬과 WH의 첫 번째 부분 + +637 +00:45:07,690 --> 00:45:12,700 +미국에서 두 번째로 당신의 매트릭스의 일부 등 식의이 종류는 기록 될 수있다 + +638 +00:45:12,699 --> 00:45:16,099 +식으로 당신은 당신의 입력을 쌓아 단일 W가 어디 + +639 +00:45:16,099 --> 00:45:24,759 +변환은 같은 식 있도록 그래서 우리가이는 중지 할 수 있습니다 방법 + +640 +00:45:24,760 --> 00:45:29,780 +두 시간 색인되는 이후로 지금 다음이 발표하고 + +641 +00:45:29,780 --> 00:45:33,510 +우리는 또한이 더 복잡한이 적층 공유되지 수 있습니다 지금은 한 방향으로 발생 + +642 +00:45:33,510 --> 00:45:37,030 +그들을 실제로 그렇게 지금 약간 더 반복 공식을 사용하여 + +643 +00:45:37,030 --> 00:45:40,300 +지금까지 우리는 복귀에 대한 매우 간단한 재발 수식으로 보았다 + +644 +00:45:40,300 --> 00:45:44,480 +실제로 작품은 실제로 거의 지금과 같은 공식을 사용하고 + +645 +00:45:44,480 --> 00:45:48,170 +기본 네트워크는 매우 드물게 우리가 그것에게 부르는 사용합니다 대신 사용되지 않습니다 + +646 +00:45:48,170 --> 00:45:52,059 +LSD와 오랜 단기 기억은 그래서 이것은 기본적으로 모든 서류에 사용된다 + +647 +00:45:52,059 --> 00:45:56,500 +지금이 공식은 당신이 인 경우도 프로젝트를 사용하는 것입니다 + +648 +00:45:56,500 --> 00:46:00,989 +사용이 현재 작동하지만 나는이 시점에서 주목하고 싶은 모든입니다 + +649 +00:46:00,989 --> 00:46:04,729 +동일은 알렌과 마찬가지로이 재발 수식은이 단지의 + +650 +00:46:04,730 --> 00:46:09,050 +약간 더 복잡한 기능을 확인 우리는 여전히 낮은에서 사진을 촬영하고 + +651 +00:46:09,050 --> 00:46:13,789 +그리고 이전의 시간에 입력 같은 깊이 이전 재산이었다 + +652 +00:46:13,789 --> 00:46:18,309 +연락 그들 앗 전송을 통해 이르렀 그러나 지금 우리는이 더이 + +653 +00:46:18,309 --> 00:46:21,869 +복잡성과 방법을 우리가 실제로이 지점에서 뉴 헤이븐 상태를 달성 + +654 +00:46:21,869 --> 00:46:25,539 +시간은 그래서 우리는 단지 약간 더 복잡한되고있어 방법에서 북한 이탈 주민을 결합 + +655 +00:46:25,539 --> 00:46:28,900 +아래 실제로 단지 더 상태를 제목에 업데이트를 수행하기 전에 + +656 +00:46:28,900 --> 00:46:33,050 +이 동기를 부여 정확히 복잡한 공식은 그래서 우리는 몇 가지 세부 사항에 갈거야 + +657 +00:46:33,050 --> 00:46:41,609 +공식 이유는 실제로 오스틴에서 사용할 수있는 더 좋은 생각이 될 수 있습니다 + +658 +00:46:41,608 --> 00:46:49,909 +그리고 우리가 지금 당장 그것을 통해 갈거야 의미가 나를 신뢰하게 그렇다면 당신 + +659 +00:46:49,909 --> 00:46:56,480 +오후 4시 일부 온라인 비디오를 차단하거나 Google 이미지는 다이어그램을 찾을 수 있습니다로 이동 + +660 +00:46:56,480 --> 00:47:00,989 +정말 도움이되지 않는이처럼 사람에게 내가 그를 처음봤을 때 생각 + +661 +00:47:00,989 --> 00:47:04,048 +이 사람이 정말 그가 무슨 일이 일어나고 있는지 정말 확신했다 겁처럼 정말 무서워되고 + +662 +00:47:04,048 --> 00:47:08,170 +나는 엘리스 팀을 이해하고 난 여전히이 두 다이어그램이 무엇인지 모르는에 + +663 +00:47:08,170 --> 00:47:14,289 +나는 목록을 파괴하려고하는거야하고 ​​까다로운 물건의 종류, 그래서 그렇게 확인 + +664 +00:47:14,289 --> 00:47:18,329 +그것을 통해 단계의 종류 당신이 정말로이 도면에 강의 있도록 넣어 + +665 +00:47:18,329 --> 00:47:24,220 +형식은 우리가 미국의 방정식이 있고 난 그래서 여기에없는 스팀 확인을 위해 완벽하다 + +666 +00:47:24,219 --> 00:47:28,238 +우리는이 두 벡터를 가지고 위치를 상단에 여기에 첫 번째 부분에 초점을 맞출 것 + +667 +00:47:28,239 --> 00:47:32,720 +아래로부터의 상태에서 이렇게 X와 HHS 이전 전에 사고 있지만, + +668 +00:47:32,719 --> 00:47:37,848 +우리는 변환 W를 통해 지금 모두 잭슨 href가 크기 경우를 만났다 + +669 +00:47:37,849 --> 00:47:40,950 +그래서 우리는 어떤을 위해 생산 끝날거야 숫자를 보낼있다 + +670 +00:47:40,949 --> 00:47:46,068 +(21)에 의해 제시되었다이 w 매트릭스를 통해 확인 번호는 그래서 우리는 이러한이 + +671 +00:47:46,068 --> 00:47:51,108 +그들이 입력 짧은 것 OMG 경우 사 및 차원 벡터 나가뿐만 + +672 +00:47:51,108 --> 00:47:57,328 +그리고 G는 나는 당신과 그렇게 ISI없이 신호를 통과 단지를 무엇 확실하지 않다 + +673 +00:47:57,329 --> 00:48:05,859 +게이트 및 G는 방법에게 지금이 실제로 작동이 길을 똑바로 세입자 게이트로 이동 + +674 +00:48:05,858 --> 00:48:09,420 +그것에 대해 생각하는 가장 좋은 방법은 내가 깜빡 한 가지가 실제로 언급하는 것입니다 + +675 +00:48:09,420 --> 00:48:15,028 +이전 슬라이드는 일반적으로 하나의 HVAC 시도 말합니다 할 네트워크를 필요로하지 않습니다 + +676 +00:48:15,028 --> 00:48:18,018 +매번 중지하고 그에게 물었다 실제로 두 벡터 모든이 + +677 +00:48:18,018 --> 00:48:23,618 +한 시간 때문에 우리는 세포 상태 벡터를 참조 전화를 매도록 + +678 +00:48:23,619 --> 00:48:29,470 +시간 단계는 우리가 위험에 두 기관이 있고 그리고에서와 같이 여기 벡터를 참조하십시오 + +679 +00:48:29,469 --> 00:48:33,558 +노란색 그래서 우리는 기본적으로 두 벡터 여기 공간에있는 모든 단일 지점을 가지고 + +680 +00:48:33,559 --> 00:48:37,849 +그들이하는 일은 그들이 기본적 그래서이 셀 상태에서 작동하고있다 + +681 +00:48:37,849 --> 00:48:41,680 +전에 당신 아래의 내용에 따라 해당 사용자 컨텍스트 당신은 결국 + +682 +00:48:41,679 --> 00:48:45,199 +이들과 함께 세포 상태에서 작동 + +683 +00:48:45,199 --> 00:48:50,509 +그리고 옹 요소와 그것에 대해 생각하는 새로운 방법 내가 통해 갈거야된다 + +684 +00:48:50,510 --> 00:48:58,290 +이 0 또는 1 우리가 원하는 I NO처럼 이진 않습니다에 대해이 방법을 많이 생각합니다 + +685 +00:48:58,289 --> 00:49:01,199 +그들에게 우리가 그들을 게이트의 해석이 생각하고 싶다 갖고 싶어 할 수 + +686 +00:49:01,199 --> 00:49:05,449 +영웅이 그들이다 그것의로 우리는 물론 우리가 원하기 때문에 그들에게 이상 신호를 만들 + +687 +00:49:05,449 --> 00:49:08,348 +우리는하지만, 모든 것을 통해 전파 백업 할 수 있도록이 미분 될 수 있습니다 + +688 +00:49:08,349 --> 00:49:11,960 +우리의 상황에 기반을 계산 한 바로 진 것들로 이노 생각 + +689 +00:49:11,960 --> 00:49:17,740 +항상 여기서 뭘에서이 참조 다음 당신은 무엇을 기준으로 그를 볼 수있는 + +690 +00:49:17,739 --> 00:49:22,250 +이 문은 다음과 디아즈 우리는이 페이지의 값을 데이트 끝날거야 무슨 + +691 +00:49:22,250 --> 00:49:29,289 +특히이 에피소드는 TUS을 종료하는 데 사용됩니다 게이트를 잊지 + +692 +00:49:29,289 --> 00:49:34,869 +(20) 태양 전지 등의 보호소 가장 생각되는 세포들을 재설정 + +693 +00:49:34,869 --> 00:49:38,700 +우리와 함께 (20)이 상호 작용보다 기본적으로 우리가 할 수있는 하나 최근 이러한 카운터 + +694 +00:49:38,699 --> 00:49:42,368 +이것은 자신의 레이저 포인터가 부족합니다 곱셈의 요소입니다 + +695 +00:49:42,369 --> 00:49:45,530 +배터리 때문에 + +696 +00:49:45,530 --> 00:49:50,140 +상호 작용 0 당신은 우리가를 재설정 할 수 있도록 그 셀을 제로 것이다 볼 수 있습니다 + +697 +00:49:50,139 --> 00:49:53,969 +카운터 그리고 우리는 또한 우리는이를 통해 추가 할 수있는 카운터에 추가 할 수 있습니다 + +698 +00:49:53,969 --> 00:50:00,459 +상호 작용 I 번 G와 11 사이와 G는 부정적 일 사이이기 때문에 + +699 +00:50:00,460 --> 00:50:05,900 +(10)에 기본적으로 한 12 매 있도록 모든 세포 사이의 숫자를 추가 + +700 +00:50:05,900 --> 00:50:09,338 +우리는이를 재설정 할 수있는 모든 세포에서 이러한 카운터를 하나의 시간 단계 + +701 +00:50:09,338 --> 00:50:13,588 +국가 2012 케이트를 잊어 버렸거나 우리는 하나 사이의 숫자를 추가 할 수 있습니다 + +702 +00:50:13,588 --> 00:50:18,039 +12 그래서 확인을 하나 하나 셀은 우리가 다음 셀 업데이트 및 수행 방법 + +703 +00:50:18,039 --> 00:50:24,029 +업데이트가 찌그러 세포 그렇게 10 HFC는 셀을 숙청되고 끝 머리 + +704 +00:50:24,030 --> 00:50:28,760 +그렇게 만 셀 상태의 일부와 위로로 누출이 업데이트에 의해 변조 + +705 +00:50:28,760 --> 00:50:33,500 +숨겨진 상태가이 벡터에 의해 변조 오 그래서 우리는 단지의 일부를 공개 선택 + +706 +00:50:33,500 --> 00:50:39,530 +암탉 상태와 학습 가능 방법으로 세포는 몇 가지가있다 + +707 +00:50:39,530 --> 00:50:43,910 +에 하이라이트의 종류 여기에 아마 여기에 가장 혼란스러운 부분에 우리가 걸이다 + +708 +00:50:43,909 --> 00:50:47,500 +여기에 D I 배 하나 하나 사이의 숫자를 추가하지만 가지의 + +709 +00:50:47,500 --> 00:50:51,809 +우리는 단지 거기 G가 있다면 대신 다음 이미 사이에 이름 : Jeez 때문에 혼란 + +710 +00:50:51,809 --> 00:50:56,679 +8 11 왜 우리는 내가 여러 번 G 무엇을하지 실제로 우리가 제공하는 모든 필요합니까 우리 + +711 +00:50:56,679 --> 00:50:58,279 +원하는에 의해 바다를 구현하는 것입니다 + +712 +00:50:58,280 --> 00:51:02,330 +하나 하나 사이의 숫자는 그래서는 대한 내 성 부품의 종류의 + +713 +00:51:02,329 --> 00:51:08,989 +마지막으로 내가 한 대답은 당신이 G에 대해 생각하면 그것의 기능 있다고 생각합니다 + +714 +00:51:08,989 --> 00:51:16,159 +당신의 문맥의 선형 함수는 하나의 기회가 오른쪽으로 레이저 프린터가 없습니다 + +715 +00:51:16,159 --> 00:51:26,649 +확인 그래서 G는 G 그래서 확인을 지역 310 세의 함수의 선형 함수로 + +716 +00:51:26,650 --> 00:51:30,579 +우리가 청바지를 추가 한 경우 10 시간 등에 의해 숙청 이전에 접촉하는 경우 + +717 +00:51:30,579 --> 00:51:35,349 +추가하여, 그래서 나는 시간 그녀는 그 종류의 매우 간단한 함수 같은 것 + +718 +00:51:35,349 --> 00:51:38,929 +이 난 후 실제로 더 있어요 곱셈 상호 작용을 갖는 + +719 +00:51:38,929 --> 00:51:42,710 +실제로 우리가 추가하는 것을 표현 할 수 있습니다 풍부한 기능 + +720 +00:51:42,710 --> 00:51:47,010 +이전 테스트의 기능을 생각하는 또 다른 방법으로 상태를 몸통 + +721 +00:51:47,010 --> 00:51:50,620 +이 약이 기본적으로 방법이 두 개념을 분리하는 것 + +722 +00:51:50,619 --> 00:51:54,159 +많은 우리가 G 인 셀 상태로 추가 싶어하고 우리가 원하는 수행 + +723 +00:51:54,159 --> 00:51:58,129 +나는 우리가 실제로 무엇을이 조작 가능성이 있으므로 모든 상태를 해결 + +724 +00:51:58,130 --> 00:52:03,280 +또한 될 수 있음이 두 디커플링에 의해 통해 천재 우리가 원하는 이동 + +725 +00:52:03,280 --> 00:52:08,470 +동적 측면에서 몇 가지 좋은 특성을 가지고 어떻게이 모든 증기 기차하지만, + +726 +00:52:08,469 --> 00:52:12,039 +우리는 단지 그 오스틴 공식처럼 결국 나는 실제로 갈거야 + +727 +00:52:12,039 --> 00:52:14,059 +자세한 세부 사항에서이뿐만 아니라 통해 + +728 +00:52:14,059 --> 00:52:21,400 +확인 상기 제 1 상호 작용 이제 셀 C가 흐르는으로 이것에 대해 생각하고 + +729 +00:52:21,400 --> 00:52:28,269 +여기 그래서 경제적으로 그 시그 모이 약간의 DOTC 그렇게 노력하다 + +730 +00:52:28,269 --> 00:52:32,559 +곱셈의 상호 작용으로 자신을 게이팅 F 제로는 것입니다 그래서 만약 + +731 +00:52:32,559 --> 00:52:38,409 +셀을 차단하고 세포학 부분이 기본적으로 제공되는 카운터를 재설정 + +732 +00:52:38,409 --> 00:52:44,799 +당신은 완은 기본적으로 하위 상태 누수가 유일한 상태로 추가하고있다 + +733 +00:52:44,800 --> 00:52:51,100 +언덕 상태로하지만 너무 의해 문이 가도록 한 후 10 시간 통해 + +734 +00:52:51,099 --> 00:52:55,380 +전기 만 결정 사실로 밝혀 몇 가지 상태에있는 부품 + +735 +00:52:55,380 --> 00:52:59,610 +매각하지 않았다 숨겨진 그리고 당신은 알 수가이 고속도로뿐만 아니라, + +736 +00:52:59,610 --> 00:53:03,720 +STM의 다음 반복으로 이동뿐만 아니라 실제로까지 폐쇄 + +737 +00:53:03,719 --> 00:53:07,159 +상위 계층이 우리가 실제로 종료 상태 교리의 머리이기 때문에 + +738 +00:53:07,159 --> 00:53:11,250 +까지 우리 위에 팀으로보고하거나이 예측에 간다 + +739 +00:53:11,250 --> 00:53:14,510 +이 기본적으로 방법을 풀다 때 그래서 그것이 가지처럼 보이는 + +740 +00:53:14,510 --> 00:53:19,270 +지금은 내 자신 그게 전부의 혼란도를 가지고있는이 나는 우리가 끝난 것 같아요 + +741 +00:53:19,269 --> 00:53:24,550 +그러나 아래에서 입력 벡터를 얻을 수와 최대 당신은 당신의 자신의 상태에서이 + +742 +00:53:24,550 --> 00:53:26,090 +(248) + +743 +00:53:26,090 --> 00:53:31,030 +그들은 다음 차원 벡터 및 모든 거 알아 fije 네 성문을 결정 + +744 +00:53:31,030 --> 00:53:35,110 +는 셀 상태에서 동작하고, 셀의 상태가 변조 방법을 종료 + +745 +00:53:35,110 --> 00:53:38,610 +당신이 한 번 실제로 우리는 일부 국가를 설정하고 하나 사이에 번호를 추가하면 + +746 +00:53:38,610 --> 00:53:42,630 +(12) 국가의 셀 상태는 그것의 일부는 학습 가능에서 누수 밖으로 누출 + +747 +00:53:42,630 --> 00:53:45,840 +방법 및 다음 중 하나를 예측까지 갈 수 또는 다음에 갈 수 있습니다 + +748 +00:53:45,840 --> 00:53:52,269 +미국 팀의 반복은 향후 그래서 그게 그렇게이 그렇게 추한 모습입니다 + +749 +00:53:52,269 --> 00:53:58,429 +문제는 당신의 마음에 아마 그래서 우리는 거 야 우리가 간다 않은 이유입니다 + +750 +00:53:58,429 --> 00:54:02,649 +이 특별한 방법 I에서이 Look을 수행하는 이유가 뭔가의 모든 통해 + +751 +00:54:02,650 --> 00:54:05,639 +알고 싶어한다 분석가 많은 다양한 있다는 것을이 시점이 + +752 +00:54:05,639 --> 00:54:09,309 +이 시점하지만 강의 사람들의 말은 이런 식으로 많이 연주 + +753 +00:54:09,309 --> 00:54:12,840 +우리는 종류의 합리적인 것 같은 것으로이에 수렴했지만 + +754 +00:54:12,840 --> 00:54:15,510 +당신이 실제로하지 않는이에 수 많은 작은 비틀기가있다 + +755 +00:54:15,510 --> 00:54:18,930 +당신 같은 사람들 게이트의 일부를 제거 할 수 있습니다 많은하여 성능을 저하 + +756 +00:54:18,929 --> 00:54:20,359 +아마 연루 등 + +757 +00:54:20,360 --> 00:54:25,200 +당신은 할 수의 악취가이 바다가 될 수 볼 밝혀 그것을 잘 작동합니다 + +758 +00:54:25,199 --> 00:54:28,619 +일반적으로하지만 좌석의 어린 나이로 때로는 약간 더 있었다 I + +759 +00:54:28,619 --> 00:54:33,869 +우리는 CSI가의 비트와 함께 결국 왜를위한 아주 좋은 이유가 생각하지 않습니다 + +760 +00:54:33,869 --> 00:54:37,039 +괴물하지만 실제로 좀 법무부 카운터의 측면에서 의미가 생각 + +761 +00:54:37,039 --> 00:54:40,739 +그 0으로 재설정 할 수 있습니다 또는 당신은 하나 (12)을 사이에 작은 숫자를 추가 할 수 있습니다 + +762 +00:54:40,739 --> 00:54:46,039 +지금은 좋은 실제로 비교적 단순한 이해하는 것처럼 그렇게는 가지이다 + +763 +00:54:46,039 --> 00:54:49,300 +이것은 우리 자신보다 훨씬 더 그리고 우리는 약간에 가야 정확하게 이유 + +764 +00:54:49,300 --> 00:54:55,330 +다른 그림은 재발 성 신경 있도록 구별을 그립니다 + +765 +00:54:55,329 --> 00:54:59,259 +어떤 상태 벡터 권리가 네트워크 당신은 그것을 통해 운영하고 있고이있어 + +766 +00:54:59,260 --> 00:55:02,260 +완전히이 재발 식을 통해로 변신 그래서 당신은 종료 + +767 +00:55:02,260 --> 00:55:06,280 +시간 물건 시간에서 상태 벡터를 변경까지 당신은 미국 것을 알 수 있습니다 + +768 +00:55:06,280 --> 00:55:11,140 +팀 대신 셀 미국이 흐르는 우리가 효과적으로 무슨 일을하고있다 + +769 +00:55:11,139 --> 00:55:15,250 +우리는 세포에서 찾고 그것의 일부는 국가의 머리에 누수로 + +770 +00:55:15,250 --> 00:55:19,329 +우리가 이득을 다음 잊어 버린 경우 셀에서 동작하는 방법을 결정하는 상태 + +771 +00:55:19,329 --> 00:55:22,869 +기본적으로 그냥하여 셀을 조정 끝 + +772 +00:55:22,869 --> 00:55:28,509 +함수로 쳐다 보면서 몇 가지 물건이 그래서 그래서 여기 활성 상호 작용 + +773 +00:55:28,510 --> 00:55:33,040 +우리는 영혼의 상태를 변경 결국 그것이 무엇이든 셀 상태의 다음 + +774 +00:55:33,039 --> 00:55:37,190 +대신 바로이 첨가제는 대신, 그래서 그것을 변환의 + +775 +00:55:37,190 --> 00:55:38,429 +변형 + +776 +00:55:38,429 --> 00:55:42,929 +그런 상호 작용이나 뭐 이제이 실제로 뭔가 당신을 생각 나게한다 + +777 +00:55:42,929 --> 00:55:48,839 +우리가 이미 염두에두고 클래스에 적용되었음을 그, 그래 맞아 + +778 +00:55:48,840 --> 00:55:53,240 +그래서이 같은 사실은 고체와 같은 일이 이렇게 기본적으로 직렬 공진입니다 + +779 +00:55:53,239 --> 00:55:56,299 +일반적으로 우리가 표현 거주자가 변화하고 진정으로 + +780 +00:55:56,300 --> 00:56:00,019 +여기에이 스킵 연결 및 당신은 기본적으로 주민들이를 볼 수 있습니다 + +781 +00:56:00,019 --> 00:56:04,690 +우리가 지금 여기이 X이 때문에 첨가제의 상호 작용 우리는 약간의 계산에 기초 않는다 + +782 +00:56:04,690 --> 00:56:10,240 +다음 섹스 그리고 우리는 행위와 첨가제의 상호 작용을 가지고 있고 그래서는이다 + +783 +00:56:10,239 --> 00:56:12,959 +같은 멋진로 발생하는 기본 주민들의 블록과 그 사실의 + +784 +00:56:12,960 --> 00:56:18,440 +물론 우리는 우리가 여기있어 이러한 상호 작용을 가지고 전은 세포이며, 우리가 간다 + +785 +00:56:18,440 --> 00:56:22,619 +다음 몇 가지 기능은 당신과 떨어져 우리는이 세포 상태 만에 추가 할 수 + +786 +00:56:22,619 --> 00:56:26,900 +LSD와는 달리 주민들은 또한 추가 된 날짜를 잊지하시기 바랍니다있다 + +787 +00:56:26,900 --> 00:56:31,519 +이뿐만 아니라 신호의 일부를 차단하도록 선택할 경우 제어를 잊지 있지만, + +788 +00:56:31,519 --> 00:56:33,679 +그렇지 않으면 나는 그것이 가지 생각 때문에 대통령처럼 매우 보인다 + +789 +00:56:33,679 --> 00:56:36,710 +보고 아키텍처와 매우 유사 종류에 수렴하고 그 재미 + +790 +00:56:36,710 --> 00:56:40,429 +보인다 곳은 재발 성 신경 네트워크에서 끝의 두 소득을 작동 + +791 +00:56:40,429 --> 00:56:43,809 +같은 동적으로 어떻게 든 실제로 이러한 첨가제를 가지고 훨씬 좋네요이다 + +792 +00:56:43,809 --> 00:56:48,739 +당신이 실제로 훨씬 더 효과적으로 그렇게 전파 할 수 있도록 상호 작용 + +793 +00:56:48,739 --> 00:56:49,779 +그 시점에 + +794 +00:56:49,780 --> 00:56:53,860 +분석 팀 사이의 뒷면 전파 역학에 대해 생각 + +795 +00:56:53,860 --> 00:56:57,760 +특히 미국 팀에 좀 그라디언트를 주입하면 매우 명확하고 + +796 +00:56:57,760 --> 00:57:01,120 +가끔 내가 생기를 주입하고이 그림의 끝을 보자, 그래서 만약 여기에 + +797 +00:57:01,119 --> 00:57:05,239 +다음이 플러스 상호 작용은 바로 여기 그냥 재료 고속도로처럼 + +798 +00:57:05,239 --> 00:57:09,299 +이 동영상은 모든 탭 추가 상호 작용 오른쪽으로 흐르는 것 같은 + +799 +00:57:09,300 --> 00:57:13,240 +내가 그라데이션 시간의 어느 지점을 연결하는 경우 버전은 동일하므로 분산 때문에 + +800 +00:57:13,239 --> 00:57:16,849 +여기에 단지 물론 그라데이션도 다시 모든 방법을 날려 가고 + +801 +00:57:16,849 --> 00:57:20,809 +이러한 행위를 통해 흘러 그들이에 자신의 재료를 기여 결국 + +802 +00:57:20,809 --> 00:57:25,630 +독서 흐름합니다하지만 당신은 우리가 우리의 강렬한으로 참조 무​​엇으로 끝낼 수 없을거야 + +803 +00:57:25,630 --> 00:57:30,110 +이 그라디언트 그냥 제로로 이동을 사망 어디에 문제가 지역 사라지는라고 + +804 +00:57:30,110 --> 00:57:32,880 +당신은 다시 통해 전파 내가 예를 보여 드리겠습니다로 + +805 +00:57:32,880 --> 00:57:36,640 +완전히이 조금 수중 음파 탐지기에서 발생하는 이유 떨어져 지금 우리는이 배니싱이 + +806 +00:57:36,639 --> 00:57:40,670 +나는 당신을 보여줄 것 그라데이션 문제는 이유는이 때문에 애널리스트 오전 발생 + +807 +00:57:40,670 --> 00:57:45,210 +그냥 판의 고속도로 매 시간 단계의 이러한 구배가 + +808 +00:57:45,210 --> 00:57:47,130 +우리는 위의 미국 팀에 주입 + +809 +00:57:47,130 --> 00:57:54,829 +그냥 세포를 통과하고 등급이에서 마무리 결국하지 않습니다 + +810 +00:57:54,829 --> 00:57:57,339 +어쩌면 내가 몇 가지 질문을 가리 혼란 기능에 대한 질문이 있습니다 + +811 +00:57:57,338 --> 00:58:01,849 +여기하지만 마지막으로 한 다음 그 후 나는 arnaz가에 있었던 이유에 갈거야 + +812 +00:58:01,849 --> 00:58:03,059 +그린 즈 버러 + +813 +00:58:03,059 --> 00:58:09,789 +예 000 벡터가 중요한 것입니다 + +814 +00:58:09,789 --> 00:58:13,400 +내가 하나가 특별히 매우 중요 아니라고 생각 밝혀 + +815 +00:58:13,400 --> 00:58:16,660 +나는 스페이스 오디세이 그들이 대답 할 다른 무엇을 보여 드리겠습니다 종이가있다 + +816 +00:58:16,659 --> 00:58:21,719 +정말 거기에이 걸릴 물건 아웃하지만 물건을 연주 또한 같은있다 + +817 +00:58:21,719 --> 00:58:25,588 +당신이 그렇게이 셀 상태가 여기에있을 수 추가 할 수 있습니다 사람들의 연결 + +818 +00:58:25,588 --> 00:58:29,538 +사람들이 정말 재생할 수 있도록 실제로 입력으로 더 나은 숨겨진 상태에 넣어 + +819 +00:58:29,539 --> 00:58:32,049 +이 아키텍처 그들은 바로 이러한 반복을 많이 시도 + +820 +00:58:32,048 --> 00:58:37,230 +방정식과 거의 모든 약 동일한 일부 작동 당신이 우리와 끝까지 + +821 +00:58:37,230 --> 00:58:40,490 +그것을 우리는 약간은 매우 가지 혼란이있는, 그래서 때로는 있었다있어 + +822 +00:58:40,489 --> 00:58:45,699 +그들은했다 어디 용지를 표시하려면이 방법은 그들이 DS 업데이트를 처리 + +823 +00:58:45,699 --> 00:58:49,538 +방정식은 업데이트 방정식을 통해 나무를 내장하고있다 그리고 그들은했다 + +824 +00:58:49,539 --> 00:58:52,950 +이 같은 무작위 돌연변이 물건과 서로 다른 잔디의 모든 종류의 시도 + +825 +00:58:52,949 --> 00:58:57,028 +사용자가 업데이트 할 수 그들 대부분은 그들 중 일부의 일부를 파괴에 대해 작동 + +826 +00:58:57,028 --> 00:58:59,858 +정말보다 훨씬 더 않습니다처럼은 동일하지만 아무것도에 대한 작업 + +827 +00:58:59,858 --> 00:59:08,150 +분석 팀과 질문 재발 성 신경 네트워크가 왜 가고있다 + +828 +00:59:08,150 --> 00:59:15,389 +또한 끔찍한 역류 비디오 + +829 +00:59:15,389 --> 00:59:22,000 +와 재발 성 신경 네트워크에서 사라지는 그라데이션 문제를 보여주는 + +830 +00:59:22,000 --> 00:59:29,250 +모두에 대해 우리가 재발보고있는 것처럼 우리가 여기에 표시하고 줄기 + +831 +00:59:29,250 --> 00:59:33,039 +많은 기간 많은 시간 단계에 걸쳐 신경망 다음 주입 그라데이션 + +832 +00:59:33,039 --> 00:59:36,760 +그것은 백 스물여덟번째 시간 단계의 말을 우리는 파산하고 + +833 +00:59:36,760 --> 00:59:40,028 +네트워크를 통해 재료와 우리는 그라데이션이 무엇인지보고있는 + +834 +00:59:40,028 --> 00:59:44,699 +용 나는 체중의 입력 타입 숨겨진 매트릭스 하나에 모든 행렬 생각 + +835 +00:59:44,699 --> 00:59:49,009 +한 시간 간격 때문에 실제로 통해 전체 업데이트를 얻기 위해 그 기억 + +836 +00:59:49,010 --> 00:59:52,289 +다시 우리가 실제로 여기에 모든 그라디언트를 추가하고 그래서 무엇 무엇이다 + +837 +00:59:52,289 --> 00:59:56,760 +어떻게 여기에 표시되는 것은 배경으로 우리는 단지에서 성분을 주입하는 것입니다 + +838 +00:59:56,760 --> 01:00:00,799 +우리가 시간과 강한 조각을 통해 배경을 120 시간 단계 + +839 +01:00:00,798 --> 01:00:04,088 +그 전파의 당신이보고있는 것은 미국 팀이 당신을 많이 준다이다 + +840 +01:00:04,088 --> 01:00:06,699 +많이있다, 그래서이 역 전파에 걸쳐 그라데이션 + +841 +01:00:06,699 --> 01:00:11,000 +단지 바로이 기술을 통해 흐르는되는 정보는 전원 사망 + +842 +01:00:11,000 --> 01:00:15,210 +그냥 욕심 우리는 추방은 그냥 아무 거기에 작은 숫자가된다라고 + +843 +01:00:15,210 --> 01:00:18,750 +내가 단계 그렇게되는 시간에 대해 표시를 생각이 경우 너무 그라데이션 + +844 +01:00:18,750 --> 01:00:22,679 +우리가하지 않았다 주입 모든 정보와 10 배 단계 등 + +845 +01:00:22,679 --> 01:00:26,149 +네트워크를 통해 흘러 모든 때문에 매우 긴 종속성을 배울 수 있습니다 + +846 +01:00:26,150 --> 01:00:29,720 +우리가 왜이 볼 수 있도록 상관 관계 구조는 아래가 사망 한 + +847 +01:00:29,719 --> 01:00:39,399 +조금 동적으로 발생이 채널이 너무 재미 그가처럼 몇 가지 코멘트 + +848 +01:00:39,400 --> 01:00:40,490 +YouTube 또는 뭔가 + +849 +01:00:40,489 --> 01:00:44,779 +그래 + +850 +01:00:44,780 --> 01:00:53,170 +확인 그래서 우리가 재발 성 신경 네트워크가 여기 아주 간단한 예를 살펴 보자 + +851 +01:00:53,170 --> 01:00:56,300 +내가 보여주는 아니에요이 재발 성 신경 네트워크에 당신을 위해 전개거야 것을 + +852 +01:00:56,300 --> 01:01:03,960 +우리가있어 모든 입력은 자신의 상태 업데이트가 너무 whaaa 교회와 대기 상태가 + +853 +01:01:03,960 --> 01:01:07,260 +상호 작용을 칠 숨겨진 나는 기본적으로 재발을 전달하려고 해요 + +854 +01:01:07,260 --> 01:01:12,380 +신경망 때문에 T-오십를 사용하고 여기에 내가 어떤 차 시간 단계를하지를 않습니다 + +855 +01:01:12,380 --> 01:01:16,260 +내가 무슨 일을하고있어 WHAS 시간을 그 위에 다음 이전 세입자와 물건과입니다 + +856 +01:01:16,260 --> 01:01:20,570 +그래서 이것은 모든 입력 벡터를 무시 들어오는 단지 전진 패스입니다 + +857 +01:01:20,570 --> 01:01:25,280 +단지 WHAS 시간 H 임계 값 WHAS 시간 세이 임계 값 등 + +858 +01:01:25,280 --> 01:01:29,500 +그 전진 패스의 다음 뒤로 여기가 연출하고있어 여기서 통과 + +859 +01:01:29,500 --> 01:01:33,820 +마지막 단계에서 여기에 임의의 기울기에 의해 50 시간 단계에서 매우 + +860 +01:01:33,820 --> 01:01:37,880 +뒤쪽으로 이동 한 후 무작위 및 그라데이션을 주입 나는 그렇게 백업 + +861 +01:01:37,880 --> 01:01:41,059 +당신은 백업이 권한을 통해 여기 내가 사용하고 있습니다 통해 백업해야 할 때 + +862 +01:01:41,059 --> 01:01:46,170 +오히려 곱셈 등 400 WH보다 곱셈 어를 통해 배경을 얻을 + +863 +01:01:46,170 --> 01:01:51,800 +그래서 여기서주의 할 것은 여기에서 매우이다 나는 개발자 브라운 백을하고있는 중이 야 + +864 +01:01:51,800 --> 01:01:54,980 +수입을 어디에서 관련 바로 잡고 아무것도 통해 전파 + +865 +01:01:54,980 --> 01:02:02,309 +나는 WH 시간마다 작업을 제로보다 작은 여기서 포기하고 있었다 + +866 +01:02:02,309 --> 01:02:06,570 +우리가 실제로 WH 행렬 곱 경우 우리는 그렇게 비선형 성을하기 전에 + +867 +01:02:06,570 --> 01:02:09,570 +당신이 실제로 무슨 일을 볼 때가는 매우 펑키 뭔가가있다 + +868 +01:02:09,570 --> 01:02:13,300 +당신이 시간을 통해 뒤로 이동으로 NHS의 구배이 DHS에 + +869 +01:02:13,300 --> 01:02:18,160 +당신이 보는 것처럼 매우 걱정입니다 재미있는 구조의 매우 종류가 있습니다 + +870 +01:02:18,159 --> 01:02:22,210 +등이 우리가 여기 무슨 일을하는지와 같은 루프에 연결되는 방식 + +871 +01:02:22,210 --> 01:02:33,409 +두 시간 간격 + +872 +01:02:33,409 --> 01:02:43,849 +제로 그래 나는 생각하고 가끔 어쩌면 반군이 모든 있었다 출력의 + +873 +01:02:43,849 --> 01:02:47,630 +죽은 당신을 죽일 수 듯하지만 그건 정말 문제 아니다 + +874 +01:02:47,630 --> 01:02:51,470 +더 걱정 문제는 그 모든 쇼가 될 것 잘하지만 착용 한 생각 + +875 +01:02:51,469 --> 01:02:55,500 +사람들이 쉽게 우리가 걸 볼 수 있습니다뿐만 아니라 발견 할 수 있습니다 문제 + +876 +01:02:55,500 --> 01:03:00,380 +때문에에 또 다시 이상이 whah 행렬 곱 + +877 +01:03:00,380 --> 01:03:04,840 +앞으로 우리가 매일 반복에 awhh 곱 통과 + +878 +01:03:04,840 --> 01:03:09,670 +다시 우리가이 전파 결국 모든 숨겨진 상태를 통해 전파 + +879 +01:03:09,670 --> 01:03:13,820 +무형 문화 유산 konnte 체스와 backrub 어 공식은 실제로 것을 밝혀 + +880 +01:03:13,820 --> 01:03:19,000 +당신은 whah 행렬 곱 인사말 신호를 가지고 우리는 종료 + +881 +01:03:19,000 --> 01:03:26,199 +그라데이션이 whah 유지를 곱한 도착까지 그 다음 WH 관계자를 곱한 + +882 +01:03:26,199 --> 01:03:32,019 +그렇게 우리는 그렇게하지 ​​매트릭스 W​​H 나이 오십 번 곱 결국 + +883 +01:03:32,019 --> 01:03:37,509 +이 가진 문제는 녹색 신호는 기본적으로 두 가지 경우처럼 일어날 수 있다는 것입니다 + +884 +01:03:37,510 --> 01:03:41,080 +당신은 아마 규모 행렬없는 스칼라 값 작업에 대한 생각 + +885 +01:03:41,079 --> 01:03:45,469 +그때 임의의 번호를 가지고 있다면 두 번째 번호가 나는 유지 + +886 +01:03:45,469 --> 01:03:48,509 +그래서 또 다시 두 번째 숫자에 의해 첫 번째 숫자를 곱한 + +887 +01:03:48,510 --> 01:03:55,990 +다시 그 순서는 바로 같은 플레이 자신의 경우에 무엇을 이동 않습니다 + +888 +01:03:55,989 --> 01:04:01,849 +번호 하나 내가 죽거나 아직 경우 두 번째 번호를 정확히 절전 모드로 전환 + +889 +01:04:01,849 --> 01:04:05,119 +일년 실제로 폭발하지만, 그렇지 않는 경우에만 위치하도록 + +890 +01:04:05,119 --> 01:04:09,679 +정말 나쁜 일이 죽을 중 하나 일어나고 또는 우리는 우리가 큰이 여기 폭발 + +891 +01:04:09,679 --> 01:04:12,659 +도시 우리는 하나의 번호가없는 있지만, 사실은이 같은 일이 일어난다이다 + +892 +01:04:12,659 --> 01:04:16,599 +그것의 일반화는 WHS 장축 반경 스펙트럼에서 일어나는 + +893 +01:04:16,599 --> 01:04:21,839 +이는 그 행렬의 최대 고유 한 후보다 큰 것이다 + +894 +01:04:21,840 --> 01:04:25,220 +이 시민은 완전히 사망의 1도 이하의 경우 무선 신호가 폭발 + +895 +01:04:25,219 --> 01:04:30,549 +그래서 기본적으로 박사 탄 때문에이 재발이 매우 이상한이 있기 때문에 + +896 +01:04:30,550 --> 01:04:34,680 +공식 우리는 매우 끔찍 역학에 결국 그리고 그것은 매우 불안정입니다 + +897 +01:04:34,679 --> 01:04:39,949 +그냥 그렇게 연습이 처리 된 방법을 폭발하고 또는 사망했다 + +898 +01:04:39,949 --> 01:04:44,439 +당신은 폭발 그라디언트에게 인사말 마치 하나의 간단한 하키를 제어 할 수 있습니다 + +899 +01:04:44,440 --> 01:04:45,720 +폭발 당신은 그것을 클릭 + +900 +01:04:45,719 --> 01:04:50,789 +그래서 사람들은 실제로 매우 누덕 누덕 기운 솔루션처럼하지만 경우에이 관행을 + +901 +01:04:50,789 --> 01:04:55,119 +두 번 다섯 분 노먼 린 크램 펫 (25) 요소 위에합니까을 읽고있는 나 + +902 +01:04:55,119 --> 01:04:58,150 +당신이 저하되어 클리핑을 수행 할 수 있도록 그런 일이 그 방법을의 + +903 +01:04:58,150 --> 01:05:01,829 +폭발 등급을 매기는 문제를 해결하고 당신은 당신이 기록하고있어하지 않습니다 + +904 +01:05:01,829 --> 01:05:06,049 +더 이상 폭발 그러나 녹색당은 여전히​​ 직장과 엘리스에서 카니발에서 사라질 수 있습니다 + +905 +01:05:06,050 --> 01:05:08,310 +팀 때문에 이들의 사라지는 그라데이션 문제에 아주 좋은 것입니다 + +906 +01:05:08,309 --> 01:05:12,429 +단지와 첨가제의 상호 작용에 따라 변화되는 세포의 고속도로 + +907 +01:05:12,429 --> 01:05:17,309 +당신은 당신이이기 때문에 경우에 당신이 경우 구배는 단지 그들이 아래로 죽지 않을 날려 + +908 +01:05:17,309 --> 01:05:21,000 +이러한 이유 대략이다처럼 같은 나이 또는 무언가에 의해 곱 + +909 +01:05:21,000 --> 01:05:26,909 +단지 더 동적으로 우리는 항상 팀 그래서 우리는 그라데이션 클리핑을 수행 할 + +910 +01:05:26,909 --> 01:05:30,149 +일반적으로 달라스 팀의 기울기가 잠재적으로 폭발 할 수 있기 때문에 + +911 +01:05:30,150 --> 01:05:33,400 +여전히 그들은 일반적으로 사라하지 않는했다 + +912 +01:05:33,400 --> 01:05:48,608 +재발 성 신경 네트워크뿐만 아니라에 대한 엘리스 팀은 분명하지 않다 어디를 + +913 +01:05:48,608 --> 01:05:53,769 +당신이 플러그 것입니다 정확히 같은이 식의 명확하지에 뛰어들 것 + +914 +01:05:53,769 --> 01:06:00,619 +상대적으로 어디에 아마 대신 G에서 월의 많은 다음에 참석하기 때문에 + +915 +01:06:00,619 --> 01:06:08,690 +여기 huug하지만 재판매는 바로 이렇게 하나의 방향으로 성장할 것 + +916 +01:06:08,690 --> 01:06:11,980 +어쩌면 당신은 실제로 좋은 아니에요 작게 있도록 만드는 끝낼 수 없다 + +917 +01:06:11,980 --> 01:06:18,539 +난 당신이 알고있는 가정 아이디어는 이렇게 연결하는 명확한 방법이 없습니다 기본적으로됩니다 + +918 +01:06:18,539 --> 01:06:25,380 +여기에 행을 너무 좋아 한 것은 나는이 초 고속도로의 측면에서 그 통지 + +919 +01:06:25,380 --> 01:06:29,780 +네 개의 얻을 문이있을 때이 그라디언트 이러한 관점은 실제로 고장 + +920 +01:06:29,780 --> 01:06:33,310 +네 개의 얻을 때 때문에 케이트의 우리는 이러한 행위의 일부를 잊을 수있는 곳 + +921 +01:06:33,309 --> 01:06:37,150 +내가 문을 잊지 때마다 곱셈 상호 작용은 다음에 그것과 세가와 + +922 +01:06:37,150 --> 01:06:41,470 +다음 그라데이션을 죽이고 물론 역류 때문에 이러한 슈퍼 중단됩니다 + +923 +01:06:41,469 --> 01:06:45,250 +당신이없는 경우 고속도로 가지 사실 어느 문을 잊지하지만 당신은 경우 + +924 +01:06:45,250 --> 01:06:50,000 +a는 다음 그라디언트를 죽일 수 그들의줬고, 그래서 실제로 잊지했다 + +925 +01:06:50,000 --> 01:06:54,710 +우리는 우리와 함께 연주 할 때 팀은 우리가 가끔 사람들이 때 가정 오스틴의 사용이다 + +926 +01:06:54,710 --> 01:06:58,099 +긍정적 인 편견 때문에 함께 초기화에 그들이 처음 잊지 얻을 + +927 +01:06:58,099 --> 01:06:58,769 +에 의한 + +928 +01:06:58,769 --> 01:07:05,699 +나에 설정하는 것을 잊지 항상 종류의 내가 처음에 생각 해제 + +929 +01:07:05,699 --> 01:07:08,679 +그래서 처음에 녹색 아주 잘 이야기하고 미국 팀은 배울 수있는 방법 + +930 +01:07:08,679 --> 01:07:12,779 +그 해당 바이어스 용으로 나중에 사람들이 재생되도록 한 번에 그들을 차단하기 + +931 +01:07:12,780 --> 01:07:17,530 +수십 년 때때로 그래서 여기에 지난 밤 나는 그 비용을 언급하고 싶었다 + +932 +01:07:17,530 --> 01:07:21,580 +공간이 그래서 많은 사람들은 기본적으로이 꽤 플레이 한 + +933 +01:07:21,579 --> 01:07:26,119 +그들이 아키텍처로 다양한 변화를 시도 오디세이 용지 거기 + +934 +01:07:26,119 --> 01:07:32,829 +잠재적 인 변화의 큰 숫자 이상이 검색을 수행하려고 여기에 종이 + +935 +01:07:32,829 --> 01:07:36,940 +LST 방정식 그리고 그들은 많은 검색을했고, 그들은 아무것도 찾지 못했습니다 + +936 +01:07:36,940 --> 01:07:42,300 +그건 그냥 애널리스트 오전 너무 좋아하고있어보다 실질적으로 더 잘 작동 + +937 +01:07:42,300 --> 01:07:45,560 +또한 상대적으로 실제로 인기가 있고 내가 실제로 것 GRU + +938 +01:07:45,559 --> 01:07:50,159 +당신이 콜로세움 그것의 변화를 개의 DRU 사용 할 수 있습니다 것이 좋습니다 + +939 +01:07:50,159 --> 01:07:54,460 +그것은 짧은 점이다 대해도 좋은 상호 작용으로 결정했다 + +940 +01:07:54,460 --> 01:07:59,400 +작은 공식과 단지 하나있는 테네시을 갖지 않는 트랙터 + +941 +01:07:59,400 --> 01:08:03,130 +구현은 현명한 단지 하나가 가진 기억 단지 좋네요 있도록 만 H가 + +942 +01:08:03,130 --> 01:08:07,590 +단지 작은 간단한 일이 같은 앞으로 과​​거 두 가지 요인에 차질 + +943 +01:08:07,590 --> 01:08:12,190 +그 불쾌한의 혜택의 대부분을 갖고있는 것 같아요하지만 그래서는 GRU과라고 + +944 +01:08:12,190 --> 01:08:16,730 +거의 항상 멋진에 대한 내 경험에 작동하고 그래서 당신은 수도 + +945 +01:08:16,729 --> 01:08:19,939 +그것을 사용하려는 또는 당신은 그들이 모두 좀 동일한 대해 알고 마지막 시간을 사용할 수 있습니다 + +946 +01:08:19,939 --> 01:08:28,088 +그래서 누군가가 마구는 아주 좋은하지만의 RaWR하고 실제로하지 않는 것입니다 + +947 +01:08:28,088 --> 01:08:29,130 +아주 잘 작동 + +948 +01:08:29,130 --> 01:08:32,420 +소유즈 미국 팀은 무엇을 그들에 대해 좋은 데요 것은 이상한 갖는 것입니다 대신 사용된다 + +949 +01:08:32,420 --> 01:08:36,000 +그리스을 허용 이러한 첨가제의 상호 작용은 매우 잘 재생 당신은하지 않습니다 + +950 +01:08:36,000 --> 01:08:39,579 +사라지는 품종 문제를 얻을 우리는 여전히 폭발에 대해 조금 걱정 + +951 +01:08:39,579 --> 01:08:44,269 +이 사람들은 때때로 내가이 여자 클립을 참조하는 것이 일반적 그래서 문제를 공급 + +952 +01:08:44,270 --> 01:08:46,670 +더 간단한 구조가 정말하려고하는 말 것 + +953 +01:08:46,670 --> 01:08:50,838 +연결과 무슨 깊은 거기에 뭔가 오는 방법을 이해 + +954 +01:08:50,838 --> 01:08:53,899 +주민과 엘리스 팀 사이에 이들에 대해 뭔가 깊은있다 + +955 +01:08:53,899 --> 01:08:57,579 +나는 우리가 아직 정확히 그 이유는 완전히 이해되지 것 같아요 상호 작용 + +956 +01:08:57,579 --> 01:09:02,210 +그래서 잘 작동하고 어떤 부분은 시원했고, 그래서 우리가 필요하다고 생각 + +957 +01:09:02,210 --> 01:09:05,119 +공간 이론과 경험을 모두 이해하고 그것은 매우이야 + +958 +01:09:05,119 --> 01:09:10,979 +벌리고 연구의 영역과 그래서 그래서 + +959 +01:09:10,979 --> 01:09:23,469 +스포츠 (10) 그러나 나는 내가 그렇지 않은 그래서 폭발 가정 할 수 클래스의 끝 + +960 +01:09:23,470 --> 01:09:27,020 +명확 왜 것이라고하지만 당신은 세포 상태로 그라데이션을 주입 유지 + +961 +01:09:27,020 --> 01:09:30,069 +그래서 어쩌면 때때로 큰 얻을 수 있습니다 저하 + +962 +01:09:30,069 --> 01:09:33,960 +그것은 그들을 수집하는 것이 일반적이지만 중요 할 수 있으므로 한 시간으로 아마 생각 + +963 +01:09:33,960 --> 01:09:40,829 +그리고, 나는 그 시점하지만 비뇨기과 기초 I에 대해 확실히 백퍼센트 아니에요 + +964 +01:09:40,829 --> 01:09:46,640 +흥미로운 무슨 생각 그래 나는 우리가 여기까지해야한다고 생각하지 않습니다 있지만 난 + +965 +01:09:46,640 --> 01:09:47,569 +여기에 질문을 드리겠습니다 From 1f45a7bd2888d771a4d14cda40f102ffd131301a Mon Sep 17 00:00:00 2001 From: YB Date: Fri, 29 Jul 2016 20:52:22 -0400 Subject: [PATCH 191/199] Lecture1 - part 246~260 (out of 715) en / ko --- captions/En/Lecture1_en.srt | 59 ++++++++++++++++++------------------- captions/Ko/Lecture1_ko.srt | 38 +++++++++++++----------- 2 files changed, 49 insertions(+), 48 deletions(-) diff --git a/captions/En/Lecture1_en.srt b/captions/En/Lecture1_en.srt index 77101706..a8b3e439 100644 --- a/captions/En/Lecture1_en.srt +++ b/captions/En/Lecture1_en.srt @@ -1208,81 +1208,78 @@ The beginning of visual 246 00:27:23,779 --> 00:27:29,178 -processing is simple structures of the -world +processing is simple structures of the world, 247 00:27:29,179 --> 00:27:40,890 -oriented and this is a very deep deep -implications are signs as well as +edges, oriented edges and this is a very deep deep +implication to both neurophysiology, neuroscience as well as 248 00:27:40,890 --> 00:27:47,870 -engineering modeling it's later when we -visualize our dealer network features +engineering modeling. It's later when we +visualize our deep neural network features, 249 -00:27:47,869 --> 00:27:57,069 -will see that simple like structure in -emerging from our from our model and +00:27:47,870 --> 00:27:57,069 +we'll see that simple edge-like structure in +emerging from our model. 250 00:27:57,069 --> 00:28:03,298 -even though the discovery was later -fifties and early sixties they won the +Even though the discovery was later +fifties and early sixties, they won the 251 00:28:03,298 --> 00:28:12,039 -nobel medical price for this work in -1981 so that was another very important +Nobel medical prize for this work in 1981. +So, that was another very important 252 00:28:12,039 --> 00:28:25,928 -piece of work related to vision and -visual processing so that's another +piece of work related to vision and visual processing. +So, when did computer vision begin? 253 00:28:25,929 --> 00:28:35,620 -interesting story the precursor of -computer vision as a modern field was +That's another interesting story, history. +the precursor of computer vision as a modern field was 254 -00:28:35,619 --> 00:28:42,779 -this particular dissipation by Larry -Roberts in 1963 it's called block world +00:28:35,620 --> 00:28:42,779 +this particular dissertation by Larry Roberts in 1963. +It's called block world. 255 00:28:42,779 --> 00:28:49,889 -he just as humulin visa we're -discovering that the visual world in our +He, just as Hubel and Wiesel were discovering +that the visual world in our 256 00:28:49,890 --> 00:29:00,380 -brain is organized by simple like -structures Larry Roberts as early as PhD +brain is organized by simple edge-like +structures, Larry Roberts as early as Computer Science PhD 257 00:29:00,380 --> 00:29:06,350 -students were trying to extract these -like structures +students, were trying to extract these edge-like structures 258 00:29:06,349 --> 00:29:08,980 -images +and images as a as a piece of engineering work. 259 00:29:08,980 --> 00:29:16,210 -as a as a piece of engineering work in -this particular case his goal is that +In this particular case his goal is that 260 00:29:16,210 --> 00:29:22,210 -you know both you and not as humans can +you know, you and I as humans can recognize blocks no matter how it's 261 00:29:22,210 --> 00:29:28,009 -turned right like we know it's a saint +turned, right? We know it's the same block these two are the same block even 262 diff --git a/captions/Ko/Lecture1_ko.srt b/captions/Ko/Lecture1_ko.srt index ec639c26..56d377c6 100644 --- a/captions/Ko/Lecture1_ko.srt +++ b/captions/Ko/Lecture1_ko.srt @@ -1008,63 +1008,67 @@ 246 00:27:23,779 --> 00:27:29,178 - 처리는 세계의 간단한 구조입니다 + 우리 세상의 간단한 구조들 입니다. 247 00:27:29,179 --> 00:27:40,890 - 배향이 매우 깊은 깊이 영향뿐만 아니라 징후이며 + 선, 방향을 가진 선들이죠. 이것은 신경생리학, 신경과학 248 00:27:40,890 --> 00:27:47,870 - 우리는 우리의 딜러 네트워크 기능을 시각화 할 때 엔지니어링 모델링은 나중에 야 + 또한 공학 모델링에 있어서도 매우 중요한 발견입니다. + 우리가 이후에 Deep Neural Network feature들을 시각화 할 때에도 249 -00:27:47,869 --> 00:27:57,069 - 우리의 모델 우리에서 신흥 구조 등의 간단한 표시되고 +00:27:47,870 --> 00:27:57,069 + 간단한 선과같은 구조가 우리의 모델에 사용되는것을 볼 수 있어요. 250 00:27:57,069 --> 00:28:03,298 - 발견했다하더라도 나중에 쉰과 60 년대 초반은 그들이 원 + 이 발견은 50년대 후반과 60년대 초반에 이루어 졌지만, 251 00:28:03,298 --> 00:28:12,039 - 1981 년이 작품에 대한 노벨 의학 가격은 그래서 또 다른 매우 중요 + 그들이 노벨의학상을 받은 건 1981년 이었습니다. 252 00:28:12,039 --> 00:28:25,928 - 즉 다른, 그래서 작품의 조각의 비전과 시각적 처리와 관련된 + 이 발견은 비젼과 시각처리와 관련해 매우 중요한 업적이었던 것입니다. + 그럼 과연 컴퓨터 비전은 언제 시작되었을까요? 253 00:28:25,929 --> 00:28:35,620 - 재미있는 이야기 현대 필드와 컴퓨터 비전의 전구체이었다 + 이것도 재미있는 역사의 한 부분입니다. + 현대학문의 분야로써 컴퓨터 비전은 시발점은 254 -00:28:35,619 --> 00:28:42,779 - 이 블록의 세계라고 1963 년 래리 로버츠에 의해이 특정 손실 +00:28:35,620 --> 00:28:42,779 + 1963년 Larry Roberts의 "조각 세계" 라는 특별한 논문이었어요. 255 00:28:42,779 --> 00:28:49,889 - 그는 단지 휴 물린 비자로 우리의 시각 세계 있음을 발견하고 우리의 + Hubel과 Weisel이 우리 뇌에서 세상을 시각적으로 256 00:28:49,890 --> 00:29:00,380 - 뇌는 빠르면 박사와 같은 구조 래리 로버츠처럼 간단한에 의해 구성되어있다 + 선과 같은 구조로 받아들이는 것을 연구한 것처럼 257 00:29:00,380 --> 00:29:06,350 - 학생들은 다음과 같은 구조를 추출하고있었습니다 + Larry Roberts는 컴퓨터 과학 박사과정 학생으로서 258 00:29:06,349 --> 00:29:08,980 - 이미지 + 이러한 선과 같은 구조와 이미지를 공학적으로 추출하려는 노력을 했어요. 259 00:29:08,980 --> 00:29:16,210 - 이 특정한 경우에 엔지니어링 작업의 조각 같은 그의 목표는 것입니다 + 이 특별한 연구의 목표는.. 260 00:29:16,210 --> 00:29:22,210 - 당신은 모두 당신을 알고 인간은 아무리 그것이 얼마나 블록을 인식 할 수 없습니다하지로 + 음 여러분과 저처럼 사람들은 상자가 어떤식으로 회전을 한다해도 + 우리는 그 상자를 인지할 수가 있죠? 261 00:29:22,210 --> 00:29:28,009 From d5e4305173b10d970d0759d30dc5d856b1feb607 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?=EB=B0=B0=EC=A7=80=EC=9A=B4?= Date: Tue, 2 Aug 2016 15:10:34 +0900 Subject: [PATCH 192/199] Translate assignment1/two_layer_net --- .../assignment1/two_layer_net.ipynb | 130 +++++++++--------- 1 file changed, 64 insertions(+), 66 deletions(-) diff --git a/assignments2016/assignment1/two_layer_net.ipynb b/assignments2016/assignment1/two_layer_net.ipynb index 917853bf..3a9aea94 100644 --- a/assignments2016/assignment1/two_layer_net.ipynb +++ b/assignments2016/assignment1/two_layer_net.ipynb @@ -4,8 +4,8 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "# Implementing a Neural Network\n", - "In this exercise we will develop a neural network with fully-connected layers to perform classification, and test it out on the CIFAR-10 dataset." + "# 뉴럴 네트워크의 구현\n", + "이번에 우리는 완전 연결 레이어로 뉴럴 네트워크를 만들어 분류를 수행하고 CIFAR-10 데이터셋으로 테스트 해볼 것 입니다." ] }, { @@ -16,7 +16,7 @@ }, "outputs": [], "source": [ - "# A bit of setup\n", + "# 설치\n", "\n", "import numpy as np\n", "import matplotlib.pyplot as plt\n", @@ -24,12 +24,12 @@ "from cs231n.classifiers.neural_net import TwoLayerNet\n", "\n", "%matplotlib inline\n", - "plt.rcParams['figure.figsize'] = (10.0, 8.0) # set default size of plots\n", + "plt.rcParams['figure.figsize'] = (10.0, 8.0) # 기본 그래프 사이즈 설정\n", "plt.rcParams['image.interpolation'] = 'nearest'\n", "plt.rcParams['image.cmap'] = 'gray'\n", "\n", - "# for auto-reloading external modules\n", - "# see http://stackoverflow.com/questions/1907993/autoreload-of-modules-in-ipython\n", + "# 외부 모듈 자동 불러오기\n", + "# 참고. http://stackoverflow.com/questions/1907993/autoreload-of-modules-in-ipython\n", "%load_ext autoreload\n", "%autoreload 2\n", "\n", @@ -42,7 +42,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "We will use the class `TwoLayerNet` in the file `cs231n/classifiers/neural_net.py` to represent instances of our network. The network parameters are stored in the instance variable `self.params` where keys are string parameter names and values are numpy arrays. Below, we initialize toy data and a toy model that we will use to develop your implementation." + "우리는 네트워크의 인스턴스를 나타내기 위해 `cs231n/classifiers/neural_net.py` 파일의 `TwoLayerNet` 클래스를 사용할 것입니다. 네트워크 파라메터는 인스턴스 변수 `self.params`에 키는 파라메터 이름인 문자열이고 값은 numpy 배열로 저장되어 있습니다. 아래에서 구현에 사용할 toy 데이터와 toy 모델을 초기화 합니다." ] }, { @@ -53,8 +53,8 @@ }, "outputs": [], "source": [ - "# Create a small net and some toy data to check your implementations.\n", - "# Note that we set the random seed for repeatable experiments.\n", + "# 작은 net을 만들고 toy 데이터로 구현을 체크해 봅니다.\n", + "# 반복되는 실험에서 우리가 랜덤 시드를 설정한다는 것을 주의하세요.\n", "\n", "input_size = 4\n", "hidden_size = 10\n", @@ -79,10 +79,10 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "# Forward pass: compute scores\n", - "Open the file `cs231n/classifiers/neural_net.py` and look at the method `TwoLayerNet.loss`. This function is very similar to the loss functions you have written for the SVM and Softmax exercises: It takes the data and weights and computes the class scores, the loss, and the gradients on the parameters. \n", + "# Forward pass: 점수 계산하기\n", + "`cs231n/classifiers/neural_net.py`파일을 열고 `TwoLayerNet.loss` 방법에 대해서 확인해 보세요. 이 함수는 SVM과 Softmax에서 작성했던 손실함수와 매우 유사합니다: 데이터와 가중치로 클래스의 점수, 손실정도, 매개변수의 그라디언트를 계산합니다.\n", "\n", - "Implement the first part of the forward pass which uses the weights and biases to compute the scores for all inputs." + "Forward pass의 첫 번째 부분의 구현은 모든 입력에 대한 점수를 계산하기 위해 가중치와 biases를 사용합니다." ] }, { @@ -107,7 +107,7 @@ "print correct_scores\n", "print\n", "\n", - "# The difference should be very small. We get < 1e-7\n", + "# 차이가 매우 작을 것입니다. 우리는 <1e-7 정도 나왔습니다.\n", "print 'Difference between your scores and correct scores:'\n", "print np.sum(np.abs(scores - correct_scores))" ] @@ -141,7 +141,7 @@ "metadata": {}, "source": [ "# Backward pass\n", - "Implement the rest of the function. This will compute the gradient of the loss with respect to the variables `W1`, `b1`, `W2`, and `b2`. Now that you (hopefully!) have a correctly implemented forward pass, you can debug your backward pass using a numeric gradient check:" + "함수의 남은 부분을 구현합니다. 손실의 그라디언트와 respect to the variables `W1`, `b1`, `W2` and `b2` 이제 정확한 froward pass를 구현했습니다, 이제 backward pass를 수치 그라디언트 체크로 디버그 할 수 있을 것 입니다." ] }, { @@ -154,13 +154,12 @@ "source": [ "from cs231n.gradient_check import eval_numerical_gradient\n", "\n", - "# Use numeric gradient checking to check your implementation of the backward pass.\n", - "# If your implementation is correct, the difference between the numeric and\n", - "# analytic gradients should be less than 1e-8 for each of W1, W2, b1, and b2.\n", + "# 수치 그라디언트 체크로 backward pass에서 구현한 것을 체크합니다.\n", + "# 만약 구현이 맞았다면, 각 W1, W2, b1, b2에 대해서 numeric 과 통계적 그라디언트는 1e-8 이하의 차이가 있을 것 입니다.\n", "\n", "loss, grads = net.loss(X, y, reg=0.1)\n", "\n", - "# these should all be less than 1e-8 or so\n", + "# 모두 해봐야 1e-8 이하 정도일 것입니다.\n", "for param_name in grads:\n", " f = lambda W: net.loss(X, y, reg=0.1)[0]\n", " param_grad_num = eval_numerical_gradient(f, net.params[param_name], verbose=False)\n", @@ -171,10 +170,10 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "# Train the network\n", - "To train the network we will use stochastic gradient descent (SGD), similar to the SVM and Softmax classifiers. Look at the function `TwoLayerNet.train` and fill in the missing sections to implement the training procedure. This should be very similar to the training procedure you used for the SVM and Softmax classifiers. You will also have to implement `TwoLayerNet.predict`, as the training process periodically performs prediction to keep track of accuracy over time while the network trains.\n", + "# 네트워크 훈련\n", + "네트워크를 훈련시키기 위해 우리는 SVM과 Softmax분류기와 비슷한 stochastic gradient descent(SGD)를 사용할 것입니다. `TwoLayerNet.train` 함수를 보고 비워있는 부분을 채워 넣어서 훈련 프로시저를 구현해 보세요. 이것은 SVM과 Softmax 분류기에서 사용한 훈련 과정과 매우 비슷할 것입니다. 또한 훈련 과정은 정기적으로 네트워크가 훈련되는 동안 정확도를 유지하기위한 예측모델을 수행하는 `TwoLayerNet.predict`도 구현해야 합니다.\n", "\n", - "Once you have implemented the method, run the code below to train a two-layer network on toy data. You should achieve a training loss less than 0.2." + "한번 메서드를 구현하면, 아래의 코드를 실행시켜 toy 데이터의 two-layer 네트워크를 학습시킵니다. 손실은 0.2미만이어야 합니다." ] }, { @@ -192,7 +191,7 @@ "\n", "print 'Final training loss: ', stats['loss_history'][-1]\n", "\n", - "# plot the loss history\n", + "# 손실 기록 그래프\n", "plt.plot(stats['loss_history'])\n", "plt.xlabel('iteration')\n", "plt.ylabel('training loss')\n", @@ -204,8 +203,9 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "# Load the data\n", - "Now that you have implemented a two-layer network that passes gradient checks and works on toy data, it's time to load up our favorite CIFAR-10 data so we can use it to train a classifier on a real dataset." + "# 데이터 불러오기\n", + "이제 그라데이션 검사를 통과하고 toy 데이터에서 작동하는 two-layer 네트워크를 구현해야 합니다.\n", + "분류기에 실제 데이터셋을 학습시키기위해 우리가 좋아하는 CIFAR-10 데이터를 불러올 시간입니다." ] }, { @@ -220,15 +220,14 @@ "\n", "def get_CIFAR10_data(num_training=49000, num_validation=1000, num_test=1000):\n", " \"\"\"\n", - " Load the CIFAR-10 dataset from disk and perform preprocessing to prepare\n", - " it for the two-layer neural net classifier. These are the same steps as\n", - " we used for the SVM, but condensed to a single function. \n", + " CIFAR-10 데이터셋을 디스크에서 불러와서 two-layer 신경 망 분류기를 위해 준비한 사전 작업을\n", + " 수행합니다. 우리가 SVM에서 했던 작업과 비슷하지만 하나의 함수로 축약되어 있습니다.\n", " \"\"\"\n", - " # Load the raw CIFAR-10 data\n", + " # 원본 CIFAR-10 데이터를 불러옵니다.\n", " cifar10_dir = 'cs231n/datasets/cifar-10-batches-py'\n", " X_train, y_train, X_test, y_test = load_CIFAR10(cifar10_dir)\n", " \n", - " # Subsample the data\n", + " # 데이터 표본\n", " mask = range(num_training, num_training + num_validation)\n", " X_val = X_train[mask]\n", " y_val = y_train[mask]\n", @@ -239,13 +238,13 @@ " X_test = X_test[mask]\n", " y_test = y_test[mask]\n", "\n", - " # Normalize the data: subtract the mean image\n", + " # 데이터를 정규화 시킵니다.: 평균 이미지를 뺍니다.\n", " mean_image = np.mean(X_train, axis=0)\n", " X_train -= mean_image\n", " X_val -= mean_image\n", " X_test -= mean_image\n", "\n", - " # Reshape data to rows\n", + " # 데이터를 열(row)로 변형시킵니다.\n", " X_train = X_train.reshape(num_training, -1)\n", " X_val = X_val.reshape(num_validation, -1)\n", " X_test = X_test.reshape(num_test, -1)\n", @@ -253,7 +252,7 @@ " return X_train, y_train, X_val, y_val, X_test, y_test\n", "\n", "\n", - "# Invoke the above function to get our data.\n", + "# 데이터를 얻기위해 위의 함수들을 호출합니다.\n", "X_train, y_train, X_val, y_val, X_test, y_test = get_CIFAR10_data()\n", "print 'Train data shape: ', X_train.shape\n", "print 'Train labels shape: ', y_train.shape\n", @@ -267,8 +266,9 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "# Train a network\n", - "To train our network we will use SGD with momentum. In addition, we will adjust the learning rate with an exponential learning rate schedule as optimization proceeds; after each epoch, we will reduce the learning rate by multiplying it by a decay rate." + "# 망 학습시키기\n", + "네트워크를 학습시키기 위해 모멘텀과 SGD를 사용합니다. 추가적으로 지수적인 학습 정도를 최적화되도로 조절합니다;\n", + "각 epoch 후에 학습 정도를 decay rate를 곱해서 감소시킵니다." ] }, { @@ -284,13 +284,13 @@ "num_classes = 10\n", "net = TwoLayerNet(input_size, hidden_size, num_classes)\n", "\n", - "# Train the network\n", + "# 망 학습시키기\n", "stats = net.train(X_train, y_train, X_val, y_val,\n", " num_iters=1000, batch_size=200,\n", " learning_rate=1e-4, learning_rate_decay=0.95,\n", " reg=0.5, verbose=True)\n", "\n", - "# Predict on the validation set\n", + "# 검증 셋에 대해 확인하기\n", "val_acc = (net.predict(X_val) == y_val).mean()\n", "print 'Validation accuracy: ', val_acc\n", "\n" @@ -300,12 +300,12 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "# Debug the training\n", - "With the default parameters we provided above, you should get a validation accuracy of about 0.29 on the validation set. This isn't very good.\n", + "# 디버그\n", + "위에서 제공한 기본 파라메터로, 0.29 정도의 검증 정확도를 얻을 수 있을 것입니다. 별로 좋지 않죠.\n", "\n", - "One strategy for getting insight into what's wrong is to plot the loss function and the accuracies on the training and validation sets during optimization.\n", + "통찰력을 얻기위한 하나의 전략은 최적화 중의 학습과 검증 셋에 대한 손실 함수와 정확도 그래프가 틀린 이유를 찾아보는 것 입니다.\n", "\n", - "Another strategy is to visualize the weights that were learned in the first layer of the network. In most neural networks trained on visual data, the first layer weights typically show some visible structure when visualized." + "다른 전략은 네트워크의 첫 레이어가 학습한 가중치를 시각화 해보는 것입니다. 대부분 시각 데이터를 학습한 뉴럴 네트워크의 첫 레이어의 가중치는 일반적으로 시각화 했을때 몇 가지 눈에보이는 구조를 갖습니다." ] }, { @@ -316,7 +316,7 @@ }, "outputs": [], "source": [ - "# Plot the loss function and train / validation accuracies\n", + "# 손실함수와 학습 / 검증 정확도 그래프\n", "plt.subplot(2, 1, 1)\n", "plt.plot(stats['loss_history'])\n", "plt.title('Loss history')\n", @@ -342,7 +342,7 @@ "source": [ "from cs231n.vis_utils import visualize_grid\n", "\n", - "# Visualize the weights of the network\n", + "# 네트워크 가중치 시각화\n", "\n", "def show_net_weights(net):\n", " W1 = net.params['W1']\n", @@ -358,15 +358,15 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "# Tune your hyperparameters\n", + " # hyperparameters 튜닝하기\n", "\n", - "**What's wrong?**. Looking at the visualizations above, we see that the loss is decreasing more or less linearly, which seems to suggest that the learning rate may be too low. Moreover, there is no gap between the training and validation accuracy, suggesting that the model we used has low capacity, and that we should increase its size. On the other hand, with a very large model we would expect to see more overfitting, which would manifest itself as a very large gap between the training and validation accuracy.\n", + "**무엇이 문제인가?**. 위의 시각화를 살펴보면, 손실이 더(혹은 덜) 선형적으로 감소하고 있음을 확인할 수 있습니다. 이것은 학습률이 너무 낮을 수 있음을 시사하는것 처럼 보입니다. 게다가 학습과 검증 정확도 사이에 차이가 없다는것은 우리가 사용한 모델이 적은 용량을 가지고 있음을 나타내고 용량을 증가시켜야될 필요가 있습니다. On the other hand, with a very large model we would expect to see more overfitting, which would manifest itself as a very large gap between the training and validation accuracy.\n", "\n", - "**Tuning**. Tuning the hyperparameters and developing intuition for how they affect the final performance is a large part of using Neural Networks, so we want you to get a lot of practice. Below, you should experiment with different values of the various hyperparameters, including hidden layer size, learning rate, numer of training epochs, and regularization strength. You might also consider tuning the learning rate decay, but you should be able to get good performance using the default value.\n", + "**튜닝**. hyperparameters를 튜닝하고 최종 성능에 어떻게 영향을 끼치는지에 대한 직관을 얻기위해 많은 연습을 해야합니다. 아래에선 다양한 hidden layer 크기, 학습률, numer of training epochs 와 정규화 강도를 포함한 hyperparameters의 값들로 실험해야 합니다. 또한 학습률의 decay를 튜닝하는 것에 대해 생각해 볼 수 있지만 아마 기본 값이 가장 좋은 성능을 낼 것입니다.\n", "\n", - "**Approximate results**. You should be aim to achieve a classification accuracy of greater than 48% on the validation set. Our best network gets over 52% on the validation set.\n", + "**결과 예측하기**. 검증 셋에 대한 분류 정확도를 48% 이상으로 만들 수 있게 목표삼도록 합시다. 우리의 가장 좋은 네트워크는 검증 셋에 대해 52%이상의 정확도를 얻었습니다.\n", "\n", - "**Experiment**: You goal in this exercise is to get as good of a result on CIFAR-10 as you can, with a fully-connected Neural Network. For every 1% above 52% on the Test set we will award you with one extra bonus point. Feel free implement your own techniques (e.g. PCA to reduce dimensionality, or adding dropout, or adding features to the solver, etc.)." + "**실험**: 이 연습에서 목표는 완전히 연결된 신경망으로 CIFAR-10에 좋은 결과를 얻는 것입니다. 테스트 세트에 대해 52%이상의 정확도에 대해 각 1%마다 추가 보너스 점수를 얻을 수 있습니다. 자신만의 기술에 대해서 적어넣으세요. (예. PCA로 차원 줄이기, dropout 추가하기, solver에 특징 추가하기 등)." ] }, { @@ -377,23 +377,21 @@ }, "outputs": [], "source": [ - "best_net = None # store the best model into this \n", + "best_net = None # 여기에 가장 좋은 모델을 저장하세요.\n", "\n", "#################################################################################\n", - "# TODO: Tune hyperparameters using the validation set. Store your best trained #\n", - "# model in best_net. #\n", + "# TODO: 검증 셋을 이용하여 hyperparameters를 튜닝하세요. 가장 잘 학습된 모델은 best_net에 저장 #\n", + "# 하세요. #\n", "# #\n", - "# To help debug your network, it may help to use visualizations similar to the #\n", - "# ones we used above; these visualizations will have significant qualitative #\n", - "# differences from the ones we saw above for the poorly tuned network. #\n", + "# 우리가 위에서 사용한 시각화를 비슷하게 사용하면 디버그하는데 도움이 될 것입니다; #\n", + "# 시각화를 통해 위에서 잘 학습되지 않은 네트워크와 중요한 질적 차이를 보일 것 입니다. #\n", "# #\n", - "# Tweaking hyperparameters by hand can be fun, but you might find it useful to #\n", - "# write code to sweep through possible combinations of hyperparameters #\n", - "# automatically like we did on the previous exercises. #\n", + "# 손으로 hyperparameters를 미세조정하는것은 재밌을 수 있지만, 이전 연습에서 했던 것 처럼 자동으로 #\n", + "# 가능한 hyperparameters를 찾는 코드를 작성하는 것이 더 유용하다는것을 알게 될 것입니다. #\n", "#################################################################################\n", "pass\n", "#################################################################################\n", - "# END OF YOUR CODE #\n", + "# 코드의 끝 #\n", "#################################################################################" ] }, @@ -405,7 +403,7 @@ }, "outputs": [], "source": [ - "# visualize the weights of the best network\n", + "# 가장 좋은 네트워크의 가중치를 시각화 합니다.\n", "show_net_weights(best_net)" ] }, @@ -413,10 +411,10 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "# Run on the test set\n", - "When you are done experimenting, you should evaluate your final trained network on the test set; you should get above 48%.\n", + "# 테스트 세트로 실행하기\n", + "실험이 끝났다면, 최종으로 학습된 네트워크를 테스트 세트로 실행해 봅니다; 48%이상의 정확도를 얻어야 합니다.\n", "\n", - "**We will give you extra bonus point for every 1% of accuracy above 52%.**" + "**52%이상의 각 1% 마다 추가 점수를 얻을 수 있습니다.**" ] }, { @@ -434,21 +432,21 @@ ], "metadata": { "kernelspec": { - "display_name": "Python 2", + "display_name": "Python 3", "language": "python", - "name": "python2" + "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", - "version": 2 + "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", - "pygments_lexer": "ipython2", - "version": "2.7.11" + "pygments_lexer": "ipython3", + "version": "3.5.2" } }, "nbformat": 4, From cc435fa37aa76bef0006809d85c016fe5a68b14b Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?=EB=B0=B0=EC=A7=80=EC=9A=B4?= Date: Tue, 2 Aug 2016 15:17:38 +0900 Subject: [PATCH 193/199] Update fix & change --- assignments2016/assignment1/two_layer_net.ipynb | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/assignments2016/assignment1/two_layer_net.ipynb b/assignments2016/assignment1/two_layer_net.ipynb index 3a9aea94..f8396433 100644 --- a/assignments2016/assignment1/two_layer_net.ipynb +++ b/assignments2016/assignment1/two_layer_net.ipynb @@ -116,8 +116,8 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "# Forward pass: compute loss\n", - "In the same function, implement the second part that computes the data and regularizaion loss." + "# Forward pass: 손실 계산하기\n", + "같은 함수에서, 데이터와 정규화 손실 정도를 계산하는 부분을 구현해 봅시다." ] }, { @@ -131,7 +131,7 @@ "loss, _ = net.loss(X, y, reg=0.1)\n", "correct_loss = 1.30378789133\n", "\n", - "# should be very small, we get < 1e-12\n", + "# 매우 작을 것 입니다, 우리는 1e-12보다 적은 값을 얻었습니다.\n", "print 'Difference between your loss and correct loss:'\n", "print np.sum(np.abs(loss - correct_loss))" ] @@ -403,7 +403,7 @@ }, "outputs": [], "source": [ - "# 가장 좋은 네트워크의 가중치를 시각화 합니다.\n", + "# 가장 좋은 네트워크의 가중치를 시각화 합니다.\n", "show_net_weights(best_net)" ] }, From ec8c16b645c3fab4f6647d7ea1968c23af538cfe Mon Sep 17 00:00:00 2001 From: MaybeS Date: Thu, 4 Aug 2016 15:44:51 +0900 Subject: [PATCH 194/199] =?UTF-8?q?fix=20trans=20&=20=ED=9B=88=EB=A0=A8->?= =?UTF-8?q?=20=ED=95=99=EC=8A=B5?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit --- assignments2016/assignment1/two_layer_net.ipynb | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/assignments2016/assignment1/two_layer_net.ipynb b/assignments2016/assignment1/two_layer_net.ipynb index f8396433..3e7eb844 100644 --- a/assignments2016/assignment1/two_layer_net.ipynb +++ b/assignments2016/assignment1/two_layer_net.ipynb @@ -141,7 +141,7 @@ "metadata": {}, "source": [ "# Backward pass\n", - "함수의 남은 부분을 구현합니다. 손실의 그라디언트와 respect to the variables `W1`, `b1`, `W2` and `b2` 이제 정확한 froward pass를 구현했습니다, 이제 backward pass를 수치 그라디언트 체크로 디버그 할 수 있을 것 입니다." + "함수의 남은 부분을 구현합니다. W1, b1, W2, b2 변수들에 대한 손실함수의 그라디언트를 구현합니다. 이제 정확한 froward pass를 구현했습니다, 이제 backward pass를 수치 그라디언트 체크로 디버그 할 수 있을 것 입니다." ] }, { @@ -170,8 +170,8 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "# 네트워크 훈련\n", - "네트워크를 훈련시키기 위해 우리는 SVM과 Softmax분류기와 비슷한 stochastic gradient descent(SGD)를 사용할 것입니다. `TwoLayerNet.train` 함수를 보고 비워있는 부분을 채워 넣어서 훈련 프로시저를 구현해 보세요. 이것은 SVM과 Softmax 분류기에서 사용한 훈련 과정과 매우 비슷할 것입니다. 또한 훈련 과정은 정기적으로 네트워크가 훈련되는 동안 정확도를 유지하기위한 예측모델을 수행하는 `TwoLayerNet.predict`도 구현해야 합니다.\n", + "# 네트워크 학습\n", + "네트워크를 학습시키기 위해 우리는 SVM과 Softmax분류기와 비슷한 stochastic gradient descent(SGD)를 사용할 것입니다. `TwoLayerNet.train` 함수를 보고 비워있는 부분을 채워 넣어서 학습 프로시저를 구현해 보세요. 이것은 SVM과 Softmax 분류기에서 사용한 학습 과정과 매우 비슷할 것입니다. 또한 학습 과정은 정기적으로 네트워크가 학습되는 동안 정확도를 유지하기위한 예측모델을 수행하는 `TwoLayerNet.predict`도 구현해야 합니다.\n", "\n", "한번 메서드를 구현하면, 아래의 코드를 실행시켜 toy 데이터의 two-layer 네트워크를 학습시킵니다. 손실은 0.2미만이어야 합니다." ] @@ -360,7 +360,7 @@ "source": [ " # hyperparameters 튜닝하기\n", "\n", - "**무엇이 문제인가?**. 위의 시각화를 살펴보면, 손실이 더(혹은 덜) 선형적으로 감소하고 있음을 확인할 수 있습니다. 이것은 학습률이 너무 낮을 수 있음을 시사하는것 처럼 보입니다. 게다가 학습과 검증 정확도 사이에 차이가 없다는것은 우리가 사용한 모델이 적은 용량을 가지고 있음을 나타내고 용량을 증가시켜야될 필요가 있습니다. On the other hand, with a very large model we would expect to see more overfitting, which would manifest itself as a very large gap between the training and validation accuracy.\n", + "**무엇이 문제인가?**. 위의 시각화를 살펴보면, 손실이 더(혹은 덜) 선형적으로 감소하고 있음을 확인할 수 있습니다. 이것은 학습률이 너무 낮을 수 있음을 시사하는것 처럼 보입니다. 게다가 학습과 검증 정확도 사이에 차이가 없다는것은 우리가 사용한 모델이 적은 용량을 가지고 있음을 나타내고 용량을 증가시켜야될 필요가 있습니다. 반면에, 매우 큰 모델을 사용한다면 overfitting이 발생할 수 있는데, 이 경우 학습 정확도와 검증 정확도 사이에 매우 큰 차이가 나는 것을 확인할 수 있을 것입니다.\n", "\n", "**튜닝**. hyperparameters를 튜닝하고 최종 성능에 어떻게 영향을 끼치는지에 대한 직관을 얻기위해 많은 연습을 해야합니다. 아래에선 다양한 hidden layer 크기, 학습률, numer of training epochs 와 정규화 강도를 포함한 hyperparameters의 값들로 실험해야 합니다. 또한 학습률의 decay를 튜닝하는 것에 대해 생각해 볼 수 있지만 아마 기본 값이 가장 좋은 성능을 낼 것입니다.\n", "\n", From dedd771266fdc4a2b7f6d117084258ab0af4690b Mon Sep 17 00:00:00 2001 From: YB Date: Mon, 15 Aug 2016 21:36:56 -0400 Subject: [PATCH 195/199] Lecture1 - part 261~280 (out of 715) en / ko --- captions/En/Lecture1_en.srt | 70 ++++++++++++++++++------------------- captions/Ko/Lecture1_ko.srt | 45 +++++++++++++----------- 2 files changed, 58 insertions(+), 57 deletions(-) diff --git a/captions/En/Lecture1_en.srt b/captions/En/Lecture1_en.srt index a8b3e439..ab9b9320 100644 --- a/captions/En/Lecture1_en.srt +++ b/captions/En/Lecture1_en.srt @@ -1275,77 +1275,76 @@ In this particular case his goal is that 260 00:29:16,210 --> 00:29:22,210 you know, you and I as humans can -recognize blocks no matter how it's +recognize blocks no matter how it's turned, right? 261 00:29:22,210 --> 00:29:28,009 -turned, right? We know it's the same -block these two are the same block even +We know it's the same block. +These two are the same block, even though 262 00:29:28,009 --> 00:29:33,019 -though the lighting changed and the -orientation changed and he's conjuncture +he lighting changed and the orientation changed. +And, his conjuncture 263 00:29:33,019 --> 00:29:40,720 -is that just like we thought told us -it's the edges that define this the +is that just like Hubel and Wiesel told us, +it's the edges that define this the structure. 264 00:29:40,720 --> 00:29:46,419 -structure that the edges defying the -laws shape and they don't change +That the edges define the shape and they don't change, 265 00:29:46,419 --> 00:29:53,290 -relevant all these internal things so -Larry Roberts Road a PhD dissertation to +rather than all these internal things. +So, Larry Roberts wrote a PhD dissertation to 266 00:29:53,289 --> 00:29:59,250 -just extract these edges you know if +just extract these edges. You know if your work as a PhD student computer 267 00:29:59,250 --> 00:30:03,990 vision this is like you know this is -like undergraduate computer vision would +like undergraduate computer vision wouldn't 268 00:30:03,990 --> 00:30:10,210 -have been had PhD theses but that was -the first precursor computer vision PhD +have been a PhD thesis but that was +the first precursor computer vision PhD thesis. 269 00:30:10,210 --> 00:30:18,819 -theses like Robert since interest he -gave up his computer vision afterwards +And, Larry Roberts's interest +he gave up his work in computer vision afterwards, 270 00:30:18,819 --> 00:30:27,189 -and DARPA I was one of the inventors of -the internet we didn't do too badly by +and went to DARPA. +He was one of the inventors of the Internet. He didn't do too badly by 271 00:30:27,190 --> 00:30:34,490 -giving up computer vision but we always -like to say that the birth of computer +giving up computer vision. +but we always like to say that the birthday of computer 272 00:30:34,490 --> 00:30:43,960 vision as a modern field is in the -summer of 1966 the summer of 1966 MIT +summer of 1966. In the summer of 1966, 273 00:30:43,960 --> 00:30:49,548 -artificial intelligence lab was -established before that actually for one +MIT artificial intelligence lab was established. +Before that actually for one 274 00:30:49,548 --> 00:30:55,819 -piece of history should feel proud of -them for student this there are two +piece of history you should feel proud of +as Stanford student, this there are two 275 00:30:55,819 --> 00:31:02,579 @@ -1354,18 +1353,18 @@ established in the world in the early 276 00:31:02,579 --> 00:31:10,329 -1960's one by Marvin Minsky at MIT one -by John McCarthy at suffered a stone for +1960's, one by Marvin Minsky at MIT, one +by John McCarthy at Stanford. 277 00:31:10,329 --> 00:31:15,369 -the artificial intelligence lab was -established before the computer science +At Stanford, the artificial intelligence lab was +established before the computer science department 278 00:31:15,369 --> 00:31:21,479 -department and professor John McCarthy -who founded and I love is the one who is +and professor John McCarthy +who founded AI Lab is the one who is 279 00:31:21,480 --> 00:31:22,490 @@ -1373,13 +1372,12 @@ responsible for 280 00:31:22,490 --> 00:31:26,450 -the term artificial intelligence so -that's a little bit of a problem +the term artificial intelligence. +So, that's a little bit of a proud of Stanford history. 281 00:31:26,450 --> 00:31:31,720 -stempler history but anyway we have to -give us credit for starting the field of +But, anyway we have to give us credit for starting the field of 282 00:31:31,720 --> 00:31:41,380 diff --git a/captions/Ko/Lecture1_ko.srt b/captions/Ko/Lecture1_ko.srt index 56d377c6..6d7d94a2 100644 --- a/captions/Ko/Lecture1_ko.srt +++ b/captions/Ko/Lecture1_ko.srt @@ -1067,88 +1067,91 @@ 260 00:29:16,210 --> 00:29:22,210 - 음 여러분과 저처럼 사람들은 상자가 어떤식으로 회전을 한다해도 + 음 여러분과 저처럼 사람들은 상자가 어떤식으로 변한다해도 우리는 그 상자를 인지할 수가 있죠? 261 00:29:22,210 --> 00:29:28,009 - 우리가 성자 블록이 두 알고 같은 블록도 있습니다 오른쪽과 같이 설정 + 조명이 변하고 방향이 달라져도 262 00:29:28,009 --> 00:29:33,019 - 조명이 변화하고 있지만 방향이 변경 그는 국면이다 + 이 두 상자는 같아요. 그의 요지는 263 00:29:33,019 --> 00:29:40,720 - 우리가 우리에게 생각하는 것처럼이 이것을 정의 가장자리는 것이다 + Hubel과 Wiesel이 말한 것처럼 구조를 정의하는 것은 264 00:29:40,720 --> 00:29:46,419 - 구조 가장자리가 법률의 형태를 무시하고는 변경하지 않는 것이 + 가장자리의 선들이라는 거죠. + 이 선들이 모영을 정하고, 내부의 모든 것들과 다르게 이 선들은 변하지 않습니다. 265 00:29:46,419 --> 00:29:53,290 - 래리 로버츠 도로 박사 학위 논문에 아주 관련된 모든 내부 물건 + 그래서, Larry Roberts는 이 선들을 추출하는 것을 주제로 박사 학위 논문을 썼어요. 266 00:29:53,289 --> 00:29:59,250 - 그냥 있는지 알이 가장자리를 추출 박사 과정 학생 컴퓨터로 작업 + 아시다시피 지금 컴퓨터 비전의 박사과정 학생에게는 267 00:29:59,250 --> 00:30:03,990 - 이 학부 컴퓨터 비전처럼 알처럼이 비전은 것 + 이 작업은 학사과정의 작업이며 박사논문이 될 수 없었겠지요. 268 00:30:03,990 --> 00:30:10,210 - 박사 학위 논문을했지만, 즉 제 1 전구체 컴퓨터 비전 박사이었다되었습니다 + 하지만 이 논문이 최초의 선구자적인 컴퓨터 비전 논문이었습니다. 269 00:30:10,210 --> 00:30:18,819 - 그는 이후 자신의 컴퓨터 비전을 포기 관심 때문에 로버트 같은 논문 + Larry Roberts는 이후 컴퓨터 비전에 관련된 연구를 포기했습니다. 270 00:30:18,819 --> 00:30:27,189 - 과 DARPA는 내가 너무 심하게하여 우리가하지 않은 인터넷의 발명자 중 하나였다 + 그리고는 DARPA에 들어갔죠. 인터넷의 창시자중의 한 명이었습니다. 271 00:30:27,190 --> 00:30:34,490 - 컴퓨터 비전을 포기하지만 우리는 항상 좋아하는 말을 해당 컴퓨터의 탄생 + 컴퓨터 비전을 포기했지만 나쁘지 않은 업적이네요. + 현대 학문으로써의 컴퓨터비전의 생일은 272 00:30:34,490 --> 00:30:43,960 - 현대 필드와 비전은 1966 년 여름에 1966 MIT의 여름 + 1966년 여름이라고 합니다. 1966년 여름, MIT의 273 00:30:43,960 --> 00:30:49,548 - 인공 지능 연구소는 하나 실제로 그 전에 설립되었다 + 인공 지능 연구소가 설립되었어요. 그전에 여러분이 274 00:30:49,548 --> 00:30:55,819 - 역사의 조각이 학생에 대한 그들의 자부심을 느껴야한다이 두 가지가 있습니다 + Stanford 학생으로써 자부심을 느낄만한 이야기가 있어요. 275 00:30:55,819 --> 00:31:02,579 - 초기에 세계에 설립 선구적인 인공 지능 연구실 + 세계에서 선구적인 인공 지능 연구실이 두곳이 있습니다. 276 00:31:02,579 --> 00:31:10,329 - 존 맥카시에 의해 MIT 하나에서 마빈 민스키에 의해 1960 년대 하나에 돌에 대한 고통 + 1960년대 초에 한 곳은 Marvin Minsky에 의해 MIT에서, + 또 하나는 Mavin Minsky에 의해 Stanford에서 세워집니다. 277 00:31:10,329 --> 00:31:15,369 - 인공 지능 연구실은 컴퓨터 과학 전에 설립되었다 + Stanford에서 인공 지능 연구실은 컴퓨터 과학대학 전에 설립되었어요. 278 00:31:15,369 --> 00:31:21,479 - 설립 내가 사랑하는 부서 및 교수 존 맥카시는 하나입니다 + 그리고 인공지능 연구실을 세운 John McCarthy 교수가 279 00:31:21,480 --> 00:31:22,490 - 에 대한 책임 + 인공지능이라는 말을 만든 것이지요. 280 00:31:22,490 --> 00:31:26,450 - 그 문제의 조금 그래서 용어 인공 지능 + 여러분들이 자랑스러워할 Stanford 역사를 조금 이야기 해드렸어요. 281 00:31:26,450 --> 00:31:31,720 From 5a25df8c22eb06c86eeb266ffddfbc0dd194c5e8 Mon Sep 17 00:00:00 2001 From: ygchoistat Date: Thu, 18 Aug 2016 13:52:30 +0900 Subject: [PATCH 196/199] =?UTF-8?q?neural=20network=203=20=EB=B2=88?= =?UTF-8?q?=EC=97=AD=20=EC=99=84=EB=A3=8C?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit --- neural-networks-3.md | 225 +++++++++++++++++++++++-------------------- 1 file changed, 119 insertions(+), 106 deletions(-) diff --git a/neural-networks-3.md b/neural-networks-3.md index 4b579e98..77e3a00e 100644 --- a/neural-networks-3.md +++ b/neural-networks-3.md @@ -6,7 +6,7 @@ permalink: /neural-networks-3/ Table of Contents: - [그라디언트 점검 (Gradient checks)](#gradcheck) -- [Sanity checks](#sanitycheck) +- [제대로 돌아가는지 확인하기 (Sanity checks)](#sanitycheck) - [학습 과정 돌보기 (Babysitting the learning process)](#baby) - [손실 함수 (Loss function)](#loss) - [훈련/검증 성능 (Train/val accuracy)](#accuracy) @@ -82,68 +82,74 @@ $$ -**Be careful with the step size h**. It is not necessarily the case that smaller is better, because when $h$ is much smaller, you may start running into numerical precision problems. Sometimes when the gradient doesn't check, it is possible that you change $h$ to be 1e-4 or 1e-6 and suddenly the gradient will be correct. This [wikipedia article](http://en.wikipedia.org/wiki/Numerical_differentiation) contains a chart that plots the value of **h** on the x-axis and the numerical gradient error on the y-axis. +**Step size h에 주의하라**. 꼭 작을 수록 좋은 건 아닌 게, $h$가 훨씬 작으면 수치적인 정확도(numerical precision) 문제에 부딪힐 수 있다. 가끔 그라디언트 체크가 잘 안 되면, $h$를 1e-4나 1e-6 정도로 조정하여 보라. 갑자기 될 수도 있다. 링크된 [위키피디아 기사](http://en.wikipedia.org/wiki/Numerical_differentiation)에는 **h**에 따른 수치적 그라디언트 오차가 xy-plot으로 조사되어 있다. -**Gradcheck during a "characteristic" mode of operation**. It is important to realize that a gradient check is performed at a particular (and usually random), single point in the space of parameters. Even if the gradient check succeeds at that point, it is not immediately certain that the gradient is correctly implemented globally. Additionally, a random initialization might not be the most "characteristic" point in the space of parameters and may in fact introduce pathological situations where the gradient seems to be correctly implemented but isn't. For instance, an SVM with very small weight initialization will assign almost exactly zero scores to all datapoints and the gradients will exhibit a particular pattern across all datapoints. An incorrect implementation of the gradient could still produce this pattern and not generalize to a more characteristic mode of operation where some scores are larger than others. Therefore, to be safe it is best to use a short **burn-in** time during which the network is allowed to learn and perform the gradient check after the loss starts to go down. The danger of performing it at the first iteration is that this could introduce pathological edge cases and mask an incorrect implementation of the gradient. -**Don't let the regularization overwhelm the data**. It is often the case that a loss function is a sum of the data loss and the regularization loss (e.g. L2 penalty on weights). One danger to be aware of is that the regularization loss may overwhelm the data loss, in which case the gradients will be primarily coming from the regularization term (which usually has a much simpler gradient expression). This can mask an incorrect implementation of the data loss gradient. Therefore, it is recommended to turn off regularization and check the data loss alone first, and then the regularization term second and independently. One way to perform the latter is to hack the code to remove the data loss contribution. Another way is to increase the regularization strength so as to ensure that its effect is non-negligible in the gradient check, and that an incorrect implementation would be spotted. +**"특징적인" 연산이 수행되는 곳에서 그라디언트 체크를 (Gradcheck during a "characteristic" mode of operation)**. 그라디언트 체크는 파라미터 공간(parameter space)의 특정한 (보통 랜덤인) 점 위에서 수행됨을 기억하자. 그라디언트 체크가 한 점에서는 성공한다 하여도 다른 점에서 맞게 수행되리라고는 믿기 힘들다. 게다가, 초기값을 랜덤하게 줄 경우(random initialization) 그 점은 파라미터 공간의 가장 "특징적인(characteristic)" 점이 아닐 수도 있고, 분명 제대로 코딩(implement)된 듯한 그라디언트가 사실 잘 계산되지 않는 병적인 상황을 야기할 수도 있다. 예를 들어, SVM에서 초기 웨이트값을 매우 작게 설정하면, 모든 데이터 포인트에 거의 0에 근접한 점수를 부여할 것이고 그라디언트 값들 또한 모든 데이터에 걸쳐 어떤 패턴을 나타낼 것이다. 만약 그라디언트 구현이 잘못되었다면 이 패턴을 계속 만들어낼 것이고 좀더 특징적인 계산으로 (e.g. 몇몇 점수가 다른 것보다 큰 경우) 일반화하지 못할 수도 있다. 그러므로, 안전하게 가려면, 네트워크가 학습을 시작할 무렵 짧은 번인(**burn-in**)을 이용하고, 손실(loss)가 하강하기 시작한 뒤에 그라디언트 체크를 수행하는 것이 최선이다. 요컨대, 첫번째 iteration에서부터 그라디언트 체크를 수행하면 그 때만의 병적인(pathological) 오류 때문에 우리가 정말로 정확하게 그라디언트 체크를 수행하는 부분에서의 오류를 놓칠 수도 있다. -**Remember to turn off dropout/augmentations**. When performing gradient check, remember to turn off any non-deterministic effects in the network, such as dropout, random data augmentations, etc. Otherwise these can clearly introduce huge errors when estimating the numerical gradient. The downside of turning off these effects is that you wouldn't be gradient checking them (e.g. it might be that dropout isn't backpropagated correctly). Therefore, a better solution might be to force a particular random seed before evaluating both $f(x+h)$ and $f(x-h)$, and when evaluating the analytic gradient. -**Check only few dimensions**. In practice the gradients can have sizes of million parameters. In these cases it is only practical to check some of the dimensions of the gradient and assume that the others are correct. **Be careful**: One issue to be careful with is to make sure to gradient check a few dimensions for every separate parameter. In some applications, people combine the parameters into a single large parameter vector for convenience. In these cases, for example, the biases could only take up a tiny number of parameters from the whole vector, so it is important to not sample at random but to take this into account and check that all parameters receive the correct gradients. +**정규화가 데이터를 압도하게 하지 마라 (Don't let the regularization overwhelm the data)**. 가끔, 손실함수(loss function)는 데이터 손실과 정규화(regularization) 손실 (e.g. 웨이트값(weight)들에 대한 L2 벌점(penalty))의 합으로 이루어져 있다. 하나 알고 있어야 하는 위험은, 정규화 손실이 데이터 손실을 압도할 수 있다는 것인데, 이 경우 그라디언트는 주로 (그라디언트 표현이 훨씬 간단한) 정규화 항(term)에서 올 것이다. 이 경우 데이터 손실 그라디언트가 올바르게 구현되지 못하는 상황을 감출 수도 있다. 그러므로, 먼저 정규화를 끄고 데이터 손실 부분만 체크를 수행하길 추천하며 그 다음에 정규화 항을 따로 점검해 보라. 정규화 항만 따로 어떻게 점검 하냐고? 하나의 방법은 코드를 해킹(hack)하여 데이터 손실 부분을 제거하는 것이다. 다른 방법으로는 정규화 항의 강도(strength)를 높여서 그 효과가 그라디언트 체크 수행시 무시할 수 없게 키운 뒤 (정규화 항 부분에서의) 잘못된 그라디언트가 감지되도록 하라. + + +**드랍아웃과 augmentation을 끄라 (Remember to turn off dropout/augmentations)**. 그라디언트 체크를 수행하는 동안, 네트워크에서 결정되지 않은(non-deterministic) 효과, 이를테면 드랍아웃(dropout), 임의 자료 확대(random data augmentations), 등을 반드시 꺼 두어라. 당연한 이야기지만 이들을 꺼두지 않으면 수치적 그라디언트 근사에서 대규모의 오차가 생길 수 있다. 이 효과들을 끌 경우 단점은 이들의 그라디언트 체크를 수행할수 없다는 것이다 (e.g. 드랍아웃이 올바르게 역전파(backpropagate)되지 않을 수 있다). 그러므로 $f(x+h)$ and $f(x-h)$ 및 수식으로 계산된(analytic) 그라디언트를 계산하기 전에 시드(seed)를 특정 값으로 고정하는 것이 좀더 나은 해결책일 수도 있다. + + +**몇 개의 차원에서만 체크하라 (Check only few dimensions)**. 실제 데이터에서 그라디언트는 수백만개의 파라미터값을 가질 수도 있다. 이런 경우엔 오직 몇 차원의 그라디언트들만 체크 하고 다른 것들은 잘 계산되었다고 믿는 것이 현실적일 수도 있다. **조심하라**: 모든 '분리된 파라미터'들에 대해서 적은 차원의 그라디언트 체크를 수행하라. 몇몇 용례에서는, 사람들이 파라미터들을 편의상 하나의 큰 파라미터 벡터로 결합한다. 이 경우, 이를테면, 편향값(bias)들은 전체 벡터에서 아주 적은 수만 차지하고 있을 수 있으므로, 이를 반영하여 샘플한 뒤 모든 파라미터들이 올바른 그라디언트를 받고 있는지 확인하는 것이 중요하다. + -### Before learning: sanity checks Tips/Tricks +### 학습 전에: 제대로 돌아가는지 확인하는 팁과 트릭들 (Before learning: sanity checks Tips/Tricks) -Here are a few sanity checks you might consider running before you plunge into expensive optimization: +풀려는 최적화 문제가 매우 비싸(expensive)지기 전에, 다음 절차들을 돌려볼 만하다. -- **Look for correct loss at chance performance.** Make sure you're getting the loss you expect when you initialize with small parameters. It's best to first check the data loss alone (so set regularization strength to zero). For example, for CIFAR-10 with a Softmax classifier we would expect the initial loss to be 2.302, because we expect a diffuse probability of 0.1 for each class (since there are 10 classes), and Softmax loss is the negative log probability of the correct class so: -ln(0.1) = 2.302. For The Weston Watkins SVM, we expect all desired margins to be violated (since all scores are approximately zero), and hence expect a loss of 9 (since margin is 1 for each wrong class). If you're not seeing these losses there might be issue with initialization. -- As a second sanity check, increasing the regularization strength should increase the loss -- **Overfit a tiny subset of data**. Lastly and most importantly, before training on the full dataset try to train on a tiny portion (e.g. 20 examples) of your data and make sure you can achieve zero cost. For this experiment it's also best to set regularization to zero, otherwise this can prevent you from getting zero cost. Unless you pass this sanity check with a small dataset it is not worth proceeding to the full dataset. Note that it may happen that you can overfit very small dataset but still have an incorrect implementation. For instance, if your datapoints' features are random due to some bug, then it will be possible to overfit your small training set but you will never notice any generalization when you fold it your full dataset. +- **맞는 손실함수를 찾아라 ?? (Look for correct loss at chance performance.)** +적은 수의 파라미터로 초기화할 때는 당신이 기대한 손실함수값(loss)를 얻는지 확인하라. 먼저 데이터 손실함수 (data loss) 하나만 확인하는 것이 가장 낫다 (따라서 정규화 강도(regularization strength)는 영으로 설정하여라). 예를 들어, CIFAR-10에 Softmax 분류기를 이용할 경우 초기 손실함수값을 2.302로 기대할 수 있는데, 왜냐하면, -ln(0.1) = 2.302 -- 각 클래스에 확률이 0.1로 분산되었을 테고 Softmax 손실함수는 올바른 분류 확률에 음의 로그를 취한 값이기 때문이다. Weston Watkins SVM을 사용할 경우에는, (모든 점수(score)가 어림잡아 0이기 때문에) 고려되는 모든 마진값(margin)이 위반될 테니 9의 손실값을 기대할 수 있다 (마진값은 각각 잘못 분류된 클래스마다 1이다). 이런 손실값들이 나오지 않으면 초기화에 문제가 있을 수 있다. +- 두 번째 확인 절차로써, 정규화 강도를 올릴 수록 손실함수값이 올라가야 한다. +- **자료의 작은 부분집합으로 과적합해 보라 (Overfit a tiny subset of data)**. 마지막으로 가장 중요한 사항인데, 전체 데이터셋으로 훈련을 시작하기 전에, 작은 부분으로 훈련을 시도하여 보고 (한 20개의 자료 정도), 0의 비용(cost)을 달성할 수 있는지 확인하여 보라. 이 실험에서도 역시 정규화 강도는 0으로 설정하는 것이 가장 나으며, 그렇지 않으면 0의 비용을 얻을 수 없을 것이다. 작은 자료에서의 이러한 확인 과정이 제대로 끝나지 않으면 전체 데이터셋으로 나아가는 것은 무가치하다. 하나 강조할 것은, 아주 작은 데이터셋에 성공적으로 과적합하였지만 여전히 코딩(implementation)이 올바르게 이루어지지 않았을 수 있다. 예를 들어, 가지고 있는 데이터 포인트(datapoint)들의 특성(feature)들이 어떤 버그 때문에 임의로(randomly) 선정된 경우, 작은 훈련 집합(training set)에의 과적합은 성공할지라도 그게 전체 데이터셋으로 일반화되지 않을 수도 있다. -### Babysitting the learning process +### 학습 과정 돌보기 (Babysitting the learning process) -There are multiple useful quantities you should monitor during training of a neural network. These plots are the window into the training process and should be utilized to get intuitions about different hyperparameter settings and how they should be changed for more efficient learning. +신경망을 훈련하는 중에 몇몇 쓸모있는 값(quantitity)은 모니터링해야 한다. 이런 도표들은 학습 과정을 지켜보는 창문이다. 좀더 효율적인 학습을 위한 하이퍼파라미터(hyperparameter) 조정도 여기서 직관적 영감을 얻는다. -The x-axis of the plots below are always in units of epochs, which measure how many times every example has been seen during training in expectation (e.g. one epoch means that every example has been seen once). It is preferable to track epochs rather than iterations since the number of iterations depends on the arbitrary setting of batch size. +도표의 x축은 언제나 에폭(epoch)을 단위로 한다. 에폭(epoch)은 각 자료(example)가 몇 번이나 학습(SGD iteration--역자 주)에 사용되었는가를 재는 용어이다. (이를테면 1 에폭이 지났다는 것은 모든 자료가 한 번씩 SGD iteration에 사용되었음을 뜻한다.) x축으로 SGD 알고리즘 반복횟수(iteration)를 할 수도 있겠지만 에폭이 더 선호되는 편이다. 반복 횟수(iteration number)은 배치 사이즈(batch size)의 선택에 따라 임의로 바뀔 수 있기 때문이다. -#### Loss function +#### 손실 함수 (Loss function) -The first quantity that is useful to track during training is the loss, as it is evaluated on the individual batches during the forward pass. Below is a cartoon diagram showing the loss over time, and especially what the shape might tell you about the learning rate: +손실 함수(loss)는 forward pass 동안 개개의 배치(batch)에서 계산되고 따라서 훈련(training) 과정에서 추적하기 용이하다. 아래는 시간에 따른 손실 그래프의 모양을 여러 학습 속도(learning rate)에 따라 그려본 것이다. 각각의 모양이 시사하는 바도 함께 적었다:
- Left: A cartoon depicting the effects of different learning rates. With low learning rates the improvements will be linear. With high learning rates they will start to look more exponential. Higher learning rates will decay the loss faster, but they get stuck at worse values of loss (green line). This is because there is too much "energy" in the optimization and the parameters are bouncing around chaotically, unable to settle in a nice spot in the optimization landscape. Right: An example of a typical loss function over time, while training a small network on CIFAR-10 dataset. This loss function looks reasonable (it might indicate a slightly too small learning rate based on its speed of decay, but it's hard to say), and also indicates that the batch size might be a little too low (since the cost is a little too noisy). + 좌측: 훈련 과정에서 학습 속도의 영향. 낮은 학습 속도로는 선형적인 향상이 이루어질 것이다. 높은 학습 속도에서는 좀더 지수적인(exponential) 향상이 보일 것이다. 더 높은 학습 속도는 손실의 감소를 가속할 것이나, 더 나쁜 손실값에 빠지게 할 수도 있다 (초록 선). 그 이유는 최적화에 너무 많은 "에너지"가 가해져서 파라미터값들이 혼돈스러운 형태로 움직이고 (최적화 목적함수 모양에서) 좋은 곳에 정착하기가 힘들어지기 때문이다. 우측: 전형적인 손실 함수의 예. x축은 시간(epoch)이고 CIFAR-10 데이터셋에서 작은 신경망을 훈련하였다. 이 손실함수의 모양은 적절해 보이고 (손실 감소의 속ㄷ를 보았을 때, 약간 학습 속도가 너무 작은 감이 있으나 뭐라 말하기 어렵다) 배치 사이즈는 너무 작은 것으로 보인다 (비용(cost)에 너무 노이즈가 많다).
-The amount of "wiggle" in the loss is related to the batch size. When the batch size is 1, the wiggle will be relatively high. When the batch size is the full dataset, the wiggle will be minimal because every gradient update should be improving the loss function monotonically (unless the learning rate is set too high). +손실 함수의 "씰룩거림"은 배치 사이즈와 연관이 있다. 만일 배치 사이즈가 1이면 훨씬 더 많이 씰룩거릴 것이다. 만일 배치 사이즈가 전체 데이터셋이면 이 씰룩거림은 최소화될 것인데, 왜냐하면 모든 그라디언트 업데이트가 손실함수를 단조적으로 향상시킬 것이기 때문이다 (학습 속도가 너무 크지만 않다면). -Some people prefer to plot their loss functions in the log domain. Since learning progress generally takes an exponential form shape, the plot appears more as a slightly more interpretable straight line, rather than a hockey stick. Additionally, if multiple cross-validated models are plotted on the same loss graph, the differences between them become more apparent. +어떤 사람들은 손실함수의 로그값의 그래프를 선호하기도 한다. 일반적으로 학습 과정은 어떤 지수적인 모양(하키 스틱 모양)을 취하고 있기 때문에, 로그 손실 그래프는 좀 더 해석이 용이한 직선의 모양처럼 보인다. 부가적인 사항으로, 만약 여러 개의 교차검증 모형(의 손실 그래프)를 같은 그래프 위에 그리면, (로그 손실 그래프로 보면) 그들 사이의 차이가 좀 더 명백해지는 장점이 있다. -Sometimes loss functions can look funny [lossfunctions.tumblr.com](http://lossfunctions.tumblr.com/). +가끔 손실 함수 모양이 우스꽝스러울 때도 있다. [lossfunctions.tumblr.com](http://lossfunctions.tumblr.com/). -#### Train/Val accuracy +#### 훈련/검증 정확도 (Train/Val accuracy) -The second important quantity to track while training a classifier is the validation/training accuracy. This plot can give you valuable insights into the amount of overfitting in your model: +훈련/검증 정확도(training/validation accuracy)는 분류기 훈련시 추적해야 할 또다른 중요한 값이다. 이 플롯은 당신의 모형이 과적합(overfitting) 중인지를 발견할 수 있는 값진 인사이트를 제공한다:
- The gap between the training and validation accuracy indicates the amount of overfitting. Two possible cases are shown in the diagram on the left. The blue validation error curve shows very small validation accuracy compared to the training accuracy, indicating strong overfitting (note, it's possible for the validation accuracy to even start to go down after some point). When you see this in practice you probably want to increase regularization (stronger L2 weight penalty, more dropout, etc.) or collect more data. The other possible case is when the validation accuracy tracks the training accuracy fairly well. This case indicates that your model capacity is not high enough: make the model larger by increasing the number of parameters. + 훈련/검증 정확도의 차이는 오버피팅의 정도를 가리킬 수 있다. 두 가능한 경우는 그림의 왼쪽에 나타나 있다. 파란색 (검증 오류) 곡선은 훈련 정확도에 비하여 매우 낮은 검증 정확도를 보여주고 있는데, 이는 강한 과적합의 가능성을 시사한다 (어떤 지점 이후에 검증 정확도가 갑자기 떨어질 수 있는 것도 가능하다). 실제로 당신이 이 현상을 보게 되면 아마 정규화(regularization)을 쓰거나 (더 강한 L2 벌점(penalty)나 드랍아웃 등) 데이터를 더 모으고 싶을 것이다. 다른 가능성으로는 검증 정확도가 훈련 정확도를 꽤 잘 따라가는 것이다. 이것은 당신의 모델의 수용량이 충분히 높지 않음을 시사할 수도 있다. 파라미터(웨이트)의 개수를 늘려서 모형을 더 크게 만들어 봐라.
-#### Ratio of weights:updates +#### 웨이트의 현재값과 변화량의 비율 (Ratio of weights:updates) -The last quantity you might want to track is the ratio of the update magnitudes to to the value magnitudes. Note: *updates*, not the raw gradients (e.g. in vanilla sgd this would be the gradient multiplied by the learning rate). You might want to evaluate and track this ratio for every set of parameters independently. A rough heuristic is that this ratio should be somewhere around 1e-3. If it is lower than this then the learning rate might be too low. If it is higher then the learning rate is likely too high. Here is a specific example: +마지막으로, 웨이트의 현재 크기와 업데이트로 인한 변화량의 크기를 비교해 볼 수도 있다. (Note: 그냥 날 것의 그라디언트 값이 아니라, 웨이트의 *변화량*이다 (이를테면 vanilla SGD에서는 학습 속도(learning rate)와 그라디언트의 곱이다).) 모든 파라미터(의 집합)마다 독립적으로 이 비율을 추적/계산하고 싶은가? 대충 짚자면 이 비율은 1e-3 근처여야 한다. 이보다 낮으면 학습 속도(learning rate)가 너무 낮은 것이다. 이보다 크면 학습 속도가 너무 크다. 특정한 예를 들자면 아래와 같다: ~~~python # assume parameter vector W and its gradient vector dW @@ -154,49 +160,50 @@ W += update # the actual update print update_scale / param_scale # want ~1e-3 ~~~ -Instead of tracking the min or the max, some people prefer to compute and track the norm of the gradients and their updates instead. These metrics are usually correlated and often give approximately the same results. +최솟값이나 최댓값을 추적할 수도 있고, 그라디언트와 업데이트값의 놈(norm)을 계산하고 추적할 수도 있다. 이 지표들은 대개 연관성이 높아서 거의 비슷한 결과를 준다. -#### Activation / Gradient distributions per layer +#### 층별 활성값 및 그라디언트의 분포 (Activation / Gradient distributions per layer) -An incorrect initialization can slow down or even completely stall the learning process. Luckily, this issue can be diagnosed relatively easily. One way to do so is to plot activation/gradient histograms for all layers of the network. Intuitively, it is not a good sign to see any strange distributions - e.g. with tanh neurons we would like to see a distribution of neuron activations between the full range of [-1,1], instead of seeing all neurons outputting zero, or all neurons being completely saturated at either -1 or 1. +올바르지 않은 초기값 설정(initialization)은 학습 과정을 느리게 하거나 완전히 망칠 수 있다. 운좋게도 이 이슈는 상대적으로 쉽게 분석할 수 있다. 한 방법은 활성값/그라디언트값의 히스토그램을 망(network)의 모든 층(layer)마다 그려보는 것이다. 직관적으로 생각해 보면, 만일 이상한 분포가 나오면 좋은 징조가 아닐 수 있다 - 이를테면, tanh 뉴런(neuron)에서는 활성값이 [-1,1]의 전 범위에 걸쳐 분산되어 있는 모습을 보고 싶다. 혹시 모든 활성값이 0을 내놓거나 -1 혹은 1에 집중되어 있으면 문제가 있는 것이다. -#### First-layer Visualizations +#### 첫번째 층의 시각화 (First-layer Visualizations) -Lastly, when one is working with image pixels it can be helpful and satisfying to plot the first-layer features visually: +마지막으로, 만일 당신이 이미지 픽셀에 관련된 일을 한다면 첫 층의 특징(feature)들을 시각화하는 것이 많은 도움이 될 수도 있다.
- Examples of visualized weights for the first layer of a neural network. Left: Noisy features indicate could be a symptom: Unconverged network, improperly set learning rate, very low weight regularization penalty. Right: Nice, smooth, clean and diverse features are a good indication that the training is proceeding well. + 신경망 첫 층의 웨이트값(weight)를 시각화한 에. 좌측: 특징값(feature)에 잡음(noise)이 많을 때 나타날 수 있는 증상: 수렴하지 않은 망(network), 적절하지 않은 학습 속도(learning rate), 매우 낮은 정규화 페널티(regularization penalty). 우측: 부드럽고 깨끗하며 다양한 피쳐값들이 보이는 경우 훈련이 잘 진행되고 있다는 지표일 수 있다.
-### Parameter updates +### 파라미터값의 업데이트 (Parameter updates) -Once the analytic gradient is computed with backpropagation, the gradients are used to perform a parameter update. There are several approaches for performing the update, which we discuss next. +수식적으로 그라디언트값은 역전파(backpropagation)으로 계산되고 이는 파라미터값 업데이트를 위해 사용된다. 업데이트를 수행하는 몇 접근법들이 있는데 후술하겠다. + +딥 네트워크에서의 최적화 문제는 지금 가장 활발히 연구가 진행되고 있는 분야이다. 이 섹션에서는 (당신이 자주 보았을) 공통적으로 자주 쓰이는 테크닉과 그것들의 직관적인 아이디어를 살펴 본다. 디테일한 사항은 수업의 범위를 넘으므로 다루지 않는다. 흥미 있는 독자는 후에 등장할 몇 참고문헌을 봐도 좋다. -We note that optimization for deep networks is currently a very active area of research. In this section we highlight some established and common techniques you may see in practice, briefly describe their intuition, but leave a detailed analysis outside of the scope of the class. We provide some further pointers for an interested reader. -#### SGD and bells and whistles +#### SGD와 벨, 호루라기(?) (SGD and bells and whistles) -**Vanilla update**. The simplest form of update is to change the parameters along the negative gradient direction (since the gradient indicates the direction of increase, but we usually wish to minimize a loss function). Assuming a vector of parameters `x` and the gradient `dx`, the simplest update has the form: +**바닐라 업데이트 (Vanilla update)**. 가장 간단한 업데이트 형태는 그라디언트의 반대방향으로 파라미터를 업데이트하는 것이다(왜냐하면 그라디언트는 증가하는 방향을 가리키니까. 그렇지만 우리는 손실함수를 최소화하고 싶어한다). 파라미터의 벡터를 `x`라 하고 그라디언트를 `dx`라 쓰면, 가장 간단한 업데이트는 다음과 같: ~~~python # Vanilla update x += - learning_rate * dx ~~~ -where `learning_rate` is a hyperparameter - a fixed constant. When evaluated on the full dataset, and when the learning rate is low enough, this is guaranteed to make non-negative progress on the loss function. +여기서 학습속도 `learning_rate` 는 하이퍼파라미터(hyperparamter)이고 고정된 상수이다. 만일 `dx`가 전체 데이터셋에서 계산되고 학습 속도가 충분히 작을 때, 최소한 나쁜 프로세스는 아님을 보장한다. -**Momentum update** is another approach that almost always enjoys better converge rates on deep networks. This update can be motivated from a physical perspective of the optimization problem. In particular, the loss can be interpreted as a the height of a hilly terrain (and therefore also to the potential energy since $U = mgh$ and therefore $ U \propto h $ ). Initializing the parameters with random numbers is equivalent to setting a particle with zero initial velocity at some location. The optimization process can then be seen as equivalent to the process of simulating the parameter vector (i.e. a particle) as rolling on the landscape. +**모멘텀 업데이트 (Momentum update)**는, 적어도 딥 네트워크에서는, 바닐라 업데이트보다 더 잘 수렴한다. 이 방법은 최적화 문제(optimization problem)를 물리학적 관점에서 바라보는 데서 유래했다. 자세히 말하자면, 손실함수는 구릉지대에서 높이에 해당한다 (그래서 포텐셜 에너지에도 대응되는데 $U = mgh$이고 따라서 $ U \propto h $이다). 파라미터의 초기값을 임의로 정하는 것은 입자를 어떤 위치에서 0의 속도로 세팅하는 것과 똑같다. 이 상황에서 최적화 과정은 파라미터 벡터(즉 입자)를 '굴리는' 과정과 동일하다 볼 수 있다. -Since the force on the particle is related to the gradient of potential energy (i.e. $F = - \nabla U $ ), the **force** felt by the particle is precisely the (negative) **gradient** of the loss function. Moreover, $F = ma $ so the (negative) gradient is in this view proportional to the acceleration of the particle. Note that this is different from the SGD update shown above, where the gradient directly integrates the position. Instead, the physics view suggests an update in which the gradient only directly influences the velocity, which in turn has an effect on the position: +입자에 작용하는 힘(force)은 포텐셜 에너지의 그라디언트 (즉 $F = - \nabla U $ )와 관련되어 있으므로, 입자가 느끼는 **힘**은은 정확하게 손실함수의 그라디언트(의 반대부호)이다. 게다가 $F = ma$이므로 그 그라디언트(의 반대부호)는 입자에 작용하는 가속도에 비례한다. 위에서의 SGD와 다른 점을 발견했는가? SGD는 위치값(현재 파라미터값 - 역자주)에 그라디언트가 직접 합쳐진다. 모멘텀 업데이트는, 물리학적 관점에서, 그라디언트가 오직 속도(velocity)에만 직접적으로 영향을 주고 속도가 위치값(position)에 영향을 줄 것을 제안하고 있다: ~~~python # Momentum update @@ -204,22 +211,24 @@ v = mu * v - learning_rate * dx # integrate velocity x += v # integrate position ~~~ -Here we see an introduction of a `v` variable that is initialized at zero, and an additional hyperparameter (`mu`). As an unfortunate misnomer, this variable is in optimization referred to as *momentum* (its typical value is about 0.9), but its physical meaning is more consistent with the coefficient of friction. Effectively, this variable damps the velocity and reduces the kinetic energy of the system, or otherwise the particle would never come to a stop at the bottom of a hill. When cross-validated, this parameter is usually set to values such as [0.5, 0.9, 0.95, 0.99]. Similar to annealing schedules for learning rates (discussed later, below), optimization can sometimes benefit a little from momentum schedules, where the momentum is increased in later stages of learning. A typical setting is to start with momentum of about 0.5 and anneal it to 0.99 or so over multiple epochs. +여기서 우리는 새로운 변수 `v`를 도입하고 0으로 초기화했다. `mu`는 또 하나의 하이퍼파라미터(hyperparamter)이다. +정확한 용어는 아니지만 우리는 이 `mu`를 *모멘텀(운동량)*이라 부르기로 한다. (보통 0.9로 설정한다) 사실 마찰 계수라고 부르는 쪽이 더 `mu`에 맞기는 하다. 이 변수는 입자의 현재 속도 및 운동에너지를 효과적으로 감소시키도록 도와준다. 이게 없다면 아마 입자는 언덕의 아래쪽에 절대 멈추지 못할 것이다. 만약 모멘텀을 교차검증(cross-validation)으로 선택한다면 보통 [0.5, 0.9, 0.95, 0.99]로 설정한다. 에폭에 따라 모멘텀의 크기를 조정하면 최적화(optimization)에 더 이로울 수도 있다. 이를테면 시작할 때는 0.5의 모멘텀으로 시작하되 몇 번의 에폭을 지나면 0.99로 설정할 수도 있다. 이는 학습 속도의 스케줄을 담금질(annealing)하는 것과도 비슷하다. (뒤에 논의할 예정이다) + +> 모멘텀 업데이트를 쓰면, (파라미터 벡터가 업데이트되는) 속도의 방향은 그라디언트들이 많이 향하는 방향으로 축적될 것이다. -> With Momentum update, the parameter vector will build up velocity in any direction that has consistent gradient. -**Nesterov Momentum** is a slightly different version of the momentum update has recently been gaining popularity. It enjoys stronger theoretical converge guarantees for convex functions and in practice it also consistenly works slightly better than standard momentum. +최근에 많은 주목을 받은 **Nesterov 모멘텀 (Nesterov Momentum)** 은 모멘텀 업데이트와 조금 다르다. 볼록함수(convex function)에서는 이 업데이트가 강력한 이론적 성질을 갖고 있고, 실제상황에서도 보통의 모멘텀 방법론보다 (많은 경우에서) 조금 더 낫다고 한다. -The core idea behind Nesterov momentum is that when the current parameter vector is at some position `x`, then looking at the momentum update above, we know that the momentum term alone (i.e. ignoring the second term with the gradient) is about to nudge the parameter vector by `mu * v`. Therefore, if we are about to compute the gradient, we can treat the future approximate position `x + mu * v` as a "lookahead" - this is a point in the vicinity of where we are soon going to end up. Hence, it makes sense to compute the gradient at `x + mu * v` instead of at the "old/stale" position `x`. +Nesterov 모멘텀의 핵심 아이디어는 다음과 같다. 만약 현재 파라미터 벡터가 `x`라는 어떤 위치에 있다고 치고 위의 모멘텀 엄데이트를 보자. 만일 위의 integrate velocity 과정에서 뒷항없이 `v = mu * v` 만 있다고 가정하면, 다음 위치로 `x + mu * v`가 "예견"될 것이다. 그러므로 이전의/오래된 위치 `x` 대신 예견된 위치 `x + mu * v`에서 그라디언트를 계산하는 것이 합리적일 수 있다.
- Nesterov momentum. Instead of evaluating gradient at the current position (red circle), we know that our momentum is about to carry us to the tip of the green arrow. With Nesterov momentum we therefore instead evaluate the gradient at this "looked-ahead" position. + Nesterov 모멘텀. 지금 위치(붉은색 원)에서 모멘텀에 의해 연두색 화살표의 끝점으로 이동할 상황이다. Nesterov 모멘텀은 현재 위치에서 그라디언트를 계산하는 것이 아니라 이 "예견된" 위치(화살표 끝점)에서 그라디언트를 계산한다.
-That is, in a slightly awkward notation, we would like to do the following: +다른 말로 하면, 다음과 같이 계산한다. (notation이 조금 이상하다.) ~~~python x_ahead = x + mu * v @@ -228,7 +237,7 @@ v = mu * v - learning_rate * dx_ahead x += v ~~~ -However, in practice people prefer to express the update to look as similar to vanilla SGD or to the previous momentum update as possible. This is possible to achieve by manipulating the update above with a variable transform `x_ahead = x + mu * v`, and then expressing the update in terms of `x_ahead` instead of `x`. That is, the parameter vector we are actually storing is always the ahead version. The equations in terms of `x_ahead` (but renaming it back to `x`) then become: +실제 용례에서 사람들은 위 식을 재서술하여 바닐라 SGD나 이전의 모멘텀 업데이트의 꼴처럼 고칠 때가 있다. 이를테면 `x_ahead = x + mu * v` 부분을 손보고, 업데이트를 `x`의 관점이 아닌 `x_ahead`의 관점에서 서술하면 (그리고 `x_ahead`를 `x`로 고쳐쓰면) 아래와 같다. 사족을 달자면 이제 우리가 저장하는 파라미터 벡터는 언제나 "예견된" 버전이다. ~~~python v_prev = v # back this up @@ -236,51 +245,54 @@ v = mu * v - learning_rate * dx # velocity update stays the same x += -mu * v_prev + (1 + mu) * v # position update changes form ~~~ -We recommend this further reading to understand the source of these equations and the mathematical formulation of Nesterov's Accelerated Momentum (NAG): + +위 식들의 출처와 Nesterov's Accelerated Momentum의 수학적 서술에 대해 더 알아보고 싶으면 아래를 참조하라. - [Advances in optimizing Recurrent Networks](http://arxiv.org/pdf/1212.0901v2.pdf) by Yoshua Bengio, Section 3.5. - [Ilya Sutskever's thesis](http://www.cs.utoronto.ca/~ilya/pubs/ilya_sutskever_phd_thesis.pdf) (pdf) contains a longer exposition of the topic in section 7.2 -#### Annealing the learning rate +#### 학습 속도 담금질 (Annealing the learning rate) -In training deep networks, it is usually helpful to anneal the learning rate over time. Good intuition to have in mind is that with a high learning rate, the system contains too much kinetic energy and the parameter vector bounces around chaotically, unable to settle down into deeper, but narrower parts of the loss function. Knowing when to decay the learning rate can be tricky: Decay it slowly and you'll be wasting computation bouncing around chaotically with little improvement for a long time. But decay it too aggressively and the system will cool too quickly, unable to reach the best position it can. There are three common types of implementing the learning rate decay: +깊은 신경망의 훈련에서 시간에 따라 훈련 속도를 담금질(anneal, 조정)하는 건 언제나 도움이 된다. 이 직관을 기억해 두면 도움이 된다: 높은 학습 속도에서는, 전체 시스템이 너무 높은 운동 에너지를 갖고 있어서 파라미터 벡터가 혼돈스럽게 튀고, (손실 함수의) 좁고 깊숙한 골짜기 안으로 쏙 들어가서 정착하기 힘들다. +그러면 학습 속도를 언제 줄일 것인가? 좀 tricky할 것이다. 우선 천천히 줄여봐라. 그러면 오랜 시간동안 거의 제자리에서 혼돈스럽게 왔다갔다 할 것이다. 그렇지만 너무 빨리 줄이면 전체 시스템이 너무 빨리 식을 것이고, 갈 수 있는 최적의 장소에 도달하지 못할 수 있다. 학습속도를 감소시키는 방법은 보통 다음 세 가지가 있다. -- **Step decay**: Reduce the learning rate by some factor every few epochs. Typical values might be reducing the learning rate by a half every 5 epochs, or by 0.1 every 20 epochs. These numbers depend heavily on the type of problem and the model. One heuristic you may see in practice is to watch the validation error while training with a fixed learning rate, and reduce the learning rate by a constant (e.g. 0.5) whenever the validation error stops improving. -- **Exponential decay.** has the mathematical form $\alpha = \alpha_0 e^{-k t}$, where $\alpha_0, k$ are hyperparameters and $t$ is the iteration number (but you can also use units of epochs). -- **1/t decay** has the mathematical form $\alpha = \alpha_0 / (1 + k t )$ where $a_0, k$ are hyperparameters and $t$ is the iteration number. +- **계단식 감소 (step decay)**: 몇 에폭마다 일정량만큼 학습 속도를 줄인다. 전형적으로는 5 에폭마다 반으로 줄이거나 20 에폭마다 1/10씩 줄이기도 한다. 이 숫자들은 전적으로 문제와 모형의 타입에 의존한다. 실전에서는, 우선 고정된 학습 속도로 검증오차(validation error)를 살펴보다가, 검증오차가 개선되지 않을 때마다 학습 속도를 감소시키는 (이를테면 0.5정도?) 방법을 택하기도 한다. +- **지수적 감소 (exponential decay)**는 $\alpha = \alpha_0 e^{-k t}$ 꼴을 뜻한다. 여기서 $\alpha_0, k$는 초모수(hyperparameter)이고 $t$는 반복 횟수이다 (물론 에폭을 단위로 해도 된다.) +- **1/t 감소**는 $\alpha = \alpha_0 / (1 + k t )$ 꼴을 뜻하고 여기서 $a_0, k$는 초모수이고 $t$는 반복 횟수이다. -In practice, we find that the step decay dropout is slightly preferable because the hyperparameters it involves (the fraction of decay and the step timings in units of epochs) are more interpretable than the hyperparameter $k$. Lastly, if you can afford the computational budget, err on the side of slower decay and train for a longer time. +실전에서는 계단식 감소 방식이 조금 더 선호될만 한데, 관련된 초모수들(몇 에폭마다 감소시킬지, 그리고 감소율)이 $k$에 비해서 해석이 더 쉽기 때문이다. 마지막으로, 계산 자원이 충분하다면, 감소율을 좀 더 낮춰서 오랜 시간동안 (모형을) 훈련시켜라. -#### Second order methods +#### 이차 근사 방법들 (Second order methods) -A second, popular group of methods for optimization in context of deep learning is based on [Newton's method](http://en.wikipedia.org/wiki/Newton%27s_method_in_optimization), which iterates the following update: +딥러닝의 맥락에서 두 번째로 대중적인 최적화 방법은 [뉴턴 방법(Newton's method)](http://en.wikipedia.org/wiki/Newton%27s_method_in_optimization)인데 다음과 같은 업데이트 방식을 뜻한다: $$ x \leftarrow x - [H f(x)]^{-1} \nabla f(x) $$ -Here, $H f(x)$ is the [Hessian matrix](http://en.wikipedia.org/wiki/Hessian_matrix), which is a square matrix of second-order partial derivatives of the function. The term $\nabla f(x)$ is the gradient vector, as seen in Gradient Descent. Intuitively, the Hessian describes the local curvature of the loss function, which allows us to perform a more efficient update. In particular, multiplying by the inverse Hessian leads the optimization to take more aggressive steps in directions of shallow curvature and shorter steps in directions of steep curvature. Note, crucially, the absence of any learning rate hyperparameters in the update formula, which the proponents of these methods cite this as a large advantage over first-order methods. - -However, the update above is impractical for most deep learning applications because computing (and inverting) the Hessian in its explicit form is a very costly process in both space and time. For instance, a Neural Network with one million parameters would have a Hessian matrix of size [1,000,000 x 1,000,000], occupying approximately 3725 gigabytes of RAM. Hence, a large variety of *quasi-Newton* methods have been developed that seek to approximate the inverse Hessian. Among these, the most popular is [L-BFGS](http://en.wikipedia.org/wiki/Limited-memory_BFGS), which uses the information in the gradients over time to form the approximation implicitly (i.e. the full matrix is never computed). +여기서 $H f(x)$는 [헤시안 행렬(Hessian matrix)](http://en.wikipedia.org/wiki/Hessian_matrix)로, (다변수 함수의) 2차 미분으로 이루어진 정방행렬을 뜻한다. $\nabla f(x)$ 항은 (그라디언트 감소 Gradient Descent에서 보았던) 그라디언트 벡터이다. 직관적으로 헤시안 행렬은 어떤 함수의 국지적인 곡률(curvature)을 뜻하고 이 정보로 울이는 더 효율적인 업데이트를 수행할 수 있다. 특별히, 헤시안 행렬의 역행렬을 곱함으로써, 휨이 약한 방향으로는 더 공격적으로 그리고 휨이 강한 방향으로는 짧게짧게 움직일 수 있다. 일차 근사 방법에 비해 뉴턴 방법이 가지는 강점은, 위의 업데이트 공식을 보면 학습 속도(learning rate)에 대한 초모수(hyperparameter)가 없다는 것이다. -However, even after we eliminate the memory concerns, a large downside of a naive application of L-BFGS is that it must be computed over the entire training set, which could contain millions of examples. Unlike mini-batch SGD, getting L-BFGS to work on mini-batches is more tricky and an active area of research. +그렇지만 위의 업데이트는 거의 모든 실제 상황에서는 쓸모가 없는 게, 공식 그대로(explicitly) 헤시안 행렬을 계산한다면 (역행렬을 취하는 일 포함하여) 상상도 못할 시간과 메모리가 필요하다. 예를 들면, 모수가 백만개 정도인 신경망은 [1,000,000 x 1,000,000] 크기의 헤시안 행렬을 필요로 하고 이는 3725GB의 램(RAM)을 필요로 한다. 그 결과로 다양한 *유사-뉴턴* 방법이 역-헤시안 행렬을 근사하기 위해 고안되었다. 이 방법론들 중 [L-BFGS](http://en.wikipedia.org/wiki/Limited-memory_BFGS)가 가장 대중적이다. L-BFGS는 시간(iteration)에 따른 그라디언트의 변화를 (간접적으로) 근사에 이용한다. 즉, 전체 행렬은 절대 계산되지 않는다. -**In practice**, it is currently not common to see L-BFGS or similar second-order methods applied to large-scale Deep Learning and Convolutional Neural Networks. Instead, SGD variants based on (Nesterov's) momentum are more standard because they are simpler and scale more easily. +그렇다고 해도, 메모리 걱정을 없앴다고 할지라도, L-BFGS를 그냥 적용하자면 큰 단점이 하나 있는데 바로 전체 훈련 집합(traning set) 전체를 대상으로 계산하여야 한다는 점이다. 수백만 개체가 있는 그 데이터셋 말이다. 배치(Batch)-SGD와는 달리, 미니배치(mini-batch)에서 L-BFGS가 작동하게 하는 방법은 좀더 꼼수를 필요로 하며 활발한 연구 분야이다. + +**실제 상황에서는**, 지금까지는, L-BFGS나 다른 이차 근사 방법이 대규모 딥러닝이나 CNN에서 사용되지는 않는 게 보통이다. 표준적으로는 SGD와 그 변종들 (모멘텀이나 Nesterov's 모멘텀)이 훨씬 간단하고 계산도 빨라서 많이 사용된다. -Additional references: +추가 참고문헌: -- [Large Scale Distributed Deep Networks](http://research.google.com/archive/large_deep_networks_nips2012.html) is a paper from the Google Brain team, comparing L-BFGS and SGD variants in large-scale distributed optimization. -- [SFO](http://arxiv.org/abs/1311.2115) algorithm strives to combine the advantages of SGD with advantages of L-BFGS. +- [Large Scale Distributed Deep Networks](http://research.google.com/archive/large_deep_networks_nips2012.html)은 Google Brain team이 출판하였다. 대규모 분산 최적화 (large-scale distributed optimization)에서 L-BFGS와 SGD(의 변형 방법론들)을 비교하였다. +- [SFO](http://arxiv.org/abs/1311.2115) 알고리즘은 SGD와 L-BFGS의 장점을 혼합하고자 노력하였다. -#### Per-parameter adaptive learning rate methods +#### 파라미터별 데이터-맞춤 학습 속도 (Per-parameter adaptive learning rates) -All previous approaches we've discussed so far manipulated the learning rate globally and equally for all parameters. Tuning the learning rates is an expensive process, so much work has gone into devising methods that can adaptively tune the learning rates, and even do so per parameter. Many of these methods may still require other hyperparameter settings, but the argument is that they are well-behaved for a broader range of hyperparameter values than the raw learning rate. In this section we highlight some common adaptive methods you may encounter in practice: +지금까지 논의된 접근법들은 모든 파라미터에 똑같은 학습 속도를 적용하였다. 학습 속도의 튜닝(tuning)은 계산이 많은(expensive) 작업인지라, 데이터에 맞추어(adaptively) 자동으로 학습 속도를 정하는 방법을 찾고자 많은 사람들이 노력하였다. 파라미터별로 학습 속도를 다르게 하고 이를 데이터-맞춤으로 정하려는 노력들 또한 있었다. 이러한 방법들은 보통 또다른 초모수(hyperparameter) 세팅이 필요하긴 하지만, 이 초모수는 넓은 범위에서 잘 작동하는 편이라 일반적인 학습 속도 튜닝보다는 덜 까다롭다. 이번 절에서는 실전에서 마주칠 수도 있는 주요 데이터-맞춤 방법들을 조망해본다: -**Adagrad** is an adaptive learning rate method originally proposed by [Duchi et al.](http://jmlr.org/papers/v12/duchi11a.html). + +**Adagrad**는 데이터-맞춤 학습속도 조정 방법 중 하나이고 [Duchi et al.](http://jmlr.org/papers/v12/duchi11a.html) 에서 처음 제안되었다. ~~~python # Assume the gradient dx and parameter vector x @@ -288,18 +300,18 @@ cache += dx**2 x += - learning_rate * dx / (np.sqrt(cache) + eps) ~~~ -Notice that the variable `cache` has size equal to the size of the gradient, and keeps track of per-parameter sum of squared gradients. This is then used to normalize the parameter update step, element-wise. Notice that the weights that receive high gradients will have their effective learning rate reduced, while weights that receive small or infrequent updates will have their effective learning rate increased. Amusingly, the square root operation turns out to be very important and without it the algorithm performs much worse. The smoothing term `eps` (usually set somewhere in range from 1e-4 to 1e-8) avoids division by zero. A downside of Adagrad is that in case of Deep Learning, the monotonic learning rate usually proves too aggressive and stops learning too early. +위에서 변수 `cache`는 그라디언트 벡터의 사이즈와 동일한 사이즈를 갖고 있다. `cache`의 각 성분은 (해당 성분에 대응하는) 그라디언트의 제곱값들을 계속 추적하고 있고, 파라미터 업데이트에서, 성분별로, 일종의 표준화 기능을 수행한다. 주목할 점은, 높은 그라디언트값을 갖는 웨이트값(weight)들은 점점 실질적인 학습속도(effective learning rate)가 감소하고 / 그라디언트 값이 낮거나 업데이트가 거의 없는 웨이트값들은 실질 학습속도가 증가한다는 것이다. 놀랍게도 제곱근(square root) 연산이 여기서 중요한 비중을 차지한다. 제곱근이 없다면 알고리즘의 성능이 많이 나빠진다. 변수 `eps`는 분모가 너무 0에 가깝지 않도록 안정화 역할을 하고 주로 1e-4에서 1e-8의 값이 할당된다. Adagrad의 단점이 있다면, 딥러닝의 경우에는, 학습 속도가 단조적이라 너무 한 방향으로 급진적(aggressive)으로 나가거나, 혹은 학습을 너무 빨리 멈출 가능성도 있다. -**RMSprop.** RMSprop is a very effective, but currently unpublished adaptive learning rate method. Amusingly, everyone who uses this method in their work currently cites [slide 29 of Lecture 6](http://www.cs.toronto.edu/~tijmen/csc321/slides/lecture_slides_lec6.pdf) of Geoff Hinton's Coursera class. The RMSProp update adjusts the Adagrad method in a very simple way in an attempt to reduce its aggressive, monotonically decreasing learning rate. In particular, it uses a moving average of squared gradients instead, giving: +**RMSprop.** RMSprop는 매우 효과적이지만 아직 출판되지 않은 데이터-맞춤 학습속도 조정 방법이다. 현재는 Geoff Hinton의 Coursera 강의 중 다음 슬라이드를 인용한다: [slide 29 of Lecture 6](http://www.cs.toronto.edu/~tijmen/csc321/slides/lecture_slides_lec6.pdf) (역자 주: 2016년 8월 현재에도 검색결과 논문을 찾지는 못하였습니다. Goodfellow et al.의 책 [](http://www.deeplearningbook.org)의 8장에 줄글로 설명이 있습니다.) RMSProp 업데이트는 Adagrad를 간단히 조정하여 급진적이고 단조감소하는 학습속도를 경감시켰다. 어떻게? 제곱 그라디언트의 평균(Adagrad처럼)이 아니라, 이동평균(moving average)을 사용한다: ~~~python cache = decay_rate * cache + (1 - decay_rate) * dx**2 x += - learning_rate * dx / (np.sqrt(cache) + eps) ~~~ -Here, `decay_rate` is a hyperparameter and typical values are [0.9, 0.99, 0.999]. Notice that the `x+=` update is identical to Adagrad, but the `cache` variable is a "leaky". Hence, RMSProp still modulates the learning rate of each weight based on the magnitudes of its gradients, which has a beneficial equalizing effect, but unlike Adagrad the updates do not get monotonically smaller. +여기서 `decay_rate`는 초모수이고 보통 [0.9, 0.99, 0.999] 중 하나의 값을 취한다. 주목할 점은 `+=` 업데이트는 Adagrad와 동등하지만, `cache`가 "어디선가 샌다". 따라서 RMSProp은 여전히 각 웨이트값을 (그것의 과거 그라디언트) 값으로) 조정하여 성분별로 실질 학습속도를 비슷하게 만드는 효과는 갖고 있지만, Adagrad처럼 학습 속도가 단조적으로 줄지는 않는다. -**Adam.** [Adam](http://arxiv.org/abs/1412.6980) is a recently proposed update that looks a bit like RMSProp with momentum. The (simplified) update looks as follows: +**Adam.** [Adam](http://arxiv.org/abs/1412.6980)은 최근에 제안된 방법인데 RMSProp에 모멘텀(momentum)을 혼합한 것처럼 보인다. 간단하게 쓰면 업데이트는 다음과 같다: ~~~python m = beta1*m + (1-beta1)*dx @@ -307,82 +319,83 @@ v = beta2*v + (1-beta2)*(dx**2) x += - learning_rate * m / (np.sqrt(v) + eps) ~~~ -Notice that the update looks exactly as RMSProp update, except the "smooth" version of the gradient `m` is used instead of the raw (and perhaps noisy) gradient vector `dx`. Recommended values in the paper are `eps = 1e-8`, `beta1 = 0.9`, `beta2 = 0.999`. In practice Adam is currently recommended as the default algorithm to use, and often works slightly better than RMSProp. However, it is often also worth trying SGD+Nesterov Momentum as an alternative. The full Adam update also includes a *bias correction* mechanism, which compensates for the fact that in the first few time steps the vectors `m,v` are both initialized and therefore biased at zero, before they fully "warm up". We refer the reader to the paper for the details, or the course slides where this is expanded on. +업데이트는 RMSProp의 업데이트 방식과 정확히 같아 보이는데, 그냥 (노이즈가 껴있을 수도 있는) 그라디언트 `dx` 대신에 "안정화된" 버전인 `m`이 사용되었다는 점이 다르다. 논문에 따르면 추천되는 초모수값들은 `eps = 1e-8`, `beta1 = 0.9`, `beta2 = 0.999`이다. 실전에서 Adam은 기본 알고리즘으로 추천되고 있고, 가끔은 RMSProp보다 조금 더 잘 하기도 한다. 그러나 SGD+Nesterov Momentum도 대안으로 해볼만 하다. Adam 업데이트 절차에는 *편향 보정(bias correction)* 매커니즘이 반영되어 있는데, 벡터 `m,v`가 나중에 완벽하게 "워밍업" 되기 전에 (iteration의 처음 몇 스텝에서) 초기화되어 0에 편향되어 있다는 점을 보상하기 위해서이다. 자세한 사항은 논문이나 강의 코스 슬라이드를 참조하라. -Additional References: +추가 참고문헌: -- [Unit Tests for Stochastic Optimization](http://arxiv.org/abs/1312.6055) proposes a series of tests as a standardized benchmark for stochastic optimization. +- [Unit Tests for Stochastic Optimization](http://arxiv.org/abs/1312.6055)는 (지금까지 제안된) 확률적 최적화(stochastic optimization) 방법들을 평가하는 표준적인 테스트들을 제안하고 있다.
- Animations that may help your intuitions about the learning process dynamics. Left: Contours of a loss surface and time evolution of different optimization algorithms. Notice the "overshooting" behavior of momentum-based methods, which make the optimization look like a ball rolling down the hill. Right: A visualization of a saddle point in the optimization landscape, where the curvature along different dimension has different signs (one dimension curves up and another down). Notice that SGD has a very hard time breaking symmetry and gets stuck on the top. Conversely, algorithms such as RMSprop will see very low gradients in the saddle direction. Due to the denominator term in the RMSprop update, this will increase the effective learning rate along this direction, helping RMSProp proceed. Images credit: Alec Radford. + 이 동영상이 학습 과정에서의 동역학(dynamics)를 직관적으로 이해하는데 도움이 되길 바란다. + 왼쪽: 손실 함수의 등고선 위에서 각 최적화 알고리즘들의 시간(iteration)에 따른 변화. 모멘텀-기반 방법론들의 "급가속" 행동들을 주목하라. 이게 최적화를 마치 언덕을 내려가는 공처럼 보이게 만든다. 오른쪽: 목적함수에 안장점(saddle point)가 있을 때의 시각화. 안장점은 그라디언트가 0이지만 헤시안 행렬의 고유치(eigenvalue)에 양수/음수가 섞여있을 때 발생한다. SGD는 안장점에서 빠져나오는 데 매우 힘든 시간을 겪는다. 반대로, RMSprop같은 알고리즘들은 안장의 방향으로 매우 작은 그라디언트를 마주하게 되지만 분모-표준화 성질 덕분에 이 방향의 실질 학습속도를 높아질 수 있고 따라서 이 방향으로 빠져나올 수 있다. Images credit: Alec Radford.
-### Hyperparameter optimization +### 초모수 최적화 (Hyperparameter optimization) -As we've seen, training Neural Networks can involve many hyperparameter settings. The most common hyperparameters in context of Neural Networks include: +일전에 본 대로, 신경망(neural network)의 훈련에는 많은 초모수(hyperparamter) 설정이 관련된다. 신경망 관련 논의에서 가장 빈번하게 등장하는 초모수는 다음과 같다: -- the initial learning rate -- learning rate decay schedule (such as the decay constant) -- regularization strength (L2 penalty, dropout strength) +- 학습속도의 초기값(the initial learning rate) +- 학습속도 경감 계획, 이를테면 경감 상수 (learning rate decay schedule (such as the decay constant)) +- L2나 드랍아웃 페널티의 정규화 강도 (regularization strength (L2 penalty, dropout strength)) -But as saw, there are many more relatively less sensitive hyperparameters, for example in per-parameter adaptive learning methods, the setting of momentum and its schedule, etc. In this section we describe some additional tips and tricks for performing the hyperparameter search: +그렇지만 역시 본 대로, 덜 민감한 초모수들도 있는데, 이들은 파라미터별 데이터-맞춤 학습 방법, 모멘텀이나 관련 스케쥴 등에서 등장하였다. 이번 절에서는 초모수 최적화를 수행하기 위한 추가적인 팁이나 트릭들을 언급한다: -**Implementation**. Larger Neural Networks typically require a long time to train, so performing hyperparameter search can take many days/weeks. It is important to keep this in mind since it influences the design of your code base. One particular design is to have a **worker** that continuously samples random hyperparameters and performs the optimization. During the training, the worker will keep track of the validation performance after every epoch, and writes a model checkpoint (together with miscellaneous training statistics such as the loss over time) to a file, preferably on a shared file system. It is useful to include the validation performance directly in the filename, so that it is simple to inspect and sort the progress. Then there is a second program which we will call a **master**, which launches or kills workers across a computing cluster, and may additionally inspect the checkpoints written by workers and plot their training statistics, etc. +**코드 구성 단계에서 (Implementation)**. 큰 신경망은 대개 긴 학습시간이 걸리고, 따라서 초모수 검색에는 며칠, 몇 주가 걸릴 수도 있다. 코드를 짤 때 이 점을 염두에 두는 것이 중요하다 (코드 베이스의 구성이 달라질 수도 있다). 하나 가능한 코드 구성은, 초모수를 임의로 선택하여 최적화를 수행하는 **일꾼**을 만드는 것이다. 이 일꾼에게 훈련 과정에서 매 에폭 뒤의 검증 성능을 쭉 추적하여 모형의 체크포인트들을 (다른 훈련 통계량들, 이를테면 시간에 따른 손실함수값들과 함께) 파일에 저장케 하라. 공유 파일 시스템 위에 저장하면 더 좋다. 검증 성능을 아예 직접 파일 이름에 써 놓는 것도 괜찮다. 그러면 과정이 더 단축되고 단순할 것이다. 그리고 **마스터**라 불릴 두번째 프로그램을 만들어서 계산 클러스터별로 일꾼들을 개시(launch)하거나 끝내(kill)게 하라. 혹은 마스터는 일꾼이 작성한 체크포인트들을 조사하고 훈련 통계량들로 그림을 그릴 수도 있다. -**Prefer one validation fold to cross-validation**. In most cases a single validation set of respectable size substantially simplifies the code base, without the need for cross-validation with multiple folds. You'll hear people say they "cross-validated" a parameter, but many times it is assumed that they still only used a single validation set. +**교차검증보다는 단일한 검증 집합 (Prefer one validation fold to cross-validation)**. 많은 경우에, 적당한 크기의 검증 집합을 설정해 두어 한 번만 검증하는 것이, 여러 번의 교차검증보다 코드를 단순화시킨다. 사람들이 "교차검증" 했다고 얘기해도, 많은 경우에 그 사람들은 단일한 검증 집합만 썼을 것이다. -**Hyperparameter ranges**. Search for hyperparameters on log scale. For example, a typical sampling of the learning rate would look as follows: `learning_rate = 10 ** uniform(-6, 1)`. That is, we are generating a random random with a uniform distribution, but then raising it to the power of 10. The same strategy should be used for the regularization strength. Intuitively, this is because learning rate and regularization strength have multiplicative effects on the training dynamics. For example, a fixed change of adding 0.01 to a learning rate has huge effects on the dynamics if the learning rate is 0.001, but nearly no effect if the learning rate when it is 10. This is because the learning rate multiplies the computed gradient in the update. Therefore, it is much more natural to consider a range of learning rate multiplied or divided by some value, than a range of learning rate added or subtracted to by some value. Some parameters (e.g. dropout) are instead usually searched in the original scale (e.g. `dropout = uniform(0,1)`). +**초모수의 범위 (Hyperparameter ranges)**. 로그 스케일로 초모수를 찾아라. 예를 들어, 학습 속도의 선정은 전형적으로 다음과 같이 보일 수도 있다: `learning_rate = 10 ** uniform(-6, 1)`. 다시 말하면, 균등분포에서 난수를 뽑은 뒤에 이를 10의 제곱값으로 취하는 것이다. 같은 전략이 정규화 강도 검색에도 사용되어야 한다. 왜냐고? 직관적으로, 학습 속도와 정규화 강도는 학습 동역학에 배수적인(multiplicative) 효과가 있기 때문이다 - 학습 속도는 업데이트에서 그라디언트에 곱해지는 수이다. 이를테면, 최초 학습 속도가 0.001이면 이를 0.01씩 더할 경우 동역학에 큰 영향을 미치지만 최초 학습 속도가 10인 경우에는 거의 영향이 없다. 그러므로 학습 속도의 범위는 어떤 값을 계속 곱하거나 나누는 것이 (빼거나 더하는 것보다) 더 자연스럽다. 대신에, 어떤 초모수들(이를테면 드랍아웃)은 보통의 스케일에서 검색된다. (예. `dropout = uniform(0,1)`). -**Prefer random search to grid search**. As argued by Bergstra and Bengio in [Random Search for Hyper-Parameter Optimization](http://www.jmlr.org/papers/volume13/bergstra12a/bergstra12a.pdf), "randomly chosen trials are more efficient for hyper-parameter optimization than trials on a grid". As it turns out, this is also usually easier to implement. +**그리드 검색보다는 임의 검색 (Prefer random search to grid search)**은 Bergstra and Bengio가 쓴 다음 논문에서 논의되었다: [Random Search for Hyper-Parameter Optimization](http://www.jmlr.org/papers/volume13/bergstra12a/bergstra12a.pdf), "randomly chosen trials are more efficient for hyper-parameter optimization than trials on a grid". 그리고 밝혀진 대로, 이게 더 구현하기 쉽다.
- Core illustration from Random Search for Hyper-Parameter Optimization by Bergstra and Bengio. It is very often the case that some of the hyperparameters matter much more than others (e.g. top hyperparam vs. left one in this figure). Performing random search rather than grid search allows you to much more precisely discover good values for the important ones. + Bergstra and Bengio의 논의의 핵심을 도식화하였다. (Random Search for Hyper-Parameter Optimization). 어떤 초모수는 다른 것보다 훨씬 중요할 때가 많다 (예. 오른쪽 그림에서 꼭대기에 있는 초모수 vs. 왼쪽 그림). 그리드 검색보다는 임의 검색이 좋고 중요한 초모수 발견을 더 용이하게 한다.
-**Careful with best values on border**. Sometimes it can happen that you're searching for a hyperparameter (e.g. learning rate) in a bad range. For example, suppose we use `learning_rate = 10 ** uniform(-6, 1)`. Once we receive the results, it is important to double check that the final learning rate is not at the edge of this interval, or otherwise you may be missing more optimal hyperparameter setting beyond the interval. +**가장 좋은 값이 경계에 있으면 조심하라 (Careful with best values on border)**. 가끔은 초모수 검색 범위 (이를테면 학습 속도) 가 나쁘게 설정되었을 수도 있다. 이를테면, `learning_rate = 10 ** uniform(-6, 1)`을 사용한다고 가정하여 보자. 한번 결과를 받았으면, 최종 학습 속도가 이 구간의 끝에 있지 않아야 한다. 그렇지 않으면, 당신은 (구간 밖에 있는) 더 최적의 초모수를 놓치고 있을는지도 모른다. -**Stage your search from coarse to fine**. In practice, it can be helpful to first search in coarse ranges (e.g. 10 ** [-6, 1]), and then depending on where the best results are turning up, narrow the range. Also, it can be helpful to perform the initial coarse search while only training for 1 epoch or even less, because many hyperparameter settings can lead the model to not learn at all, or immediately explode with infinite cost. The second stage could then perform a narrower search with 5 epochs, and the last stage could perform a detailed search in the final range for many more epochs (for example). +**성긴 검색에서 촘촘한 검색으로 (Stage your search from coarse to fine)**. 실전에서는, 처음에는 널찍한 범위에서 검색을 하다가 (예. 10 ** [-6, 1]), 좋은 결과가 어디에서 발생하냐에 따라 범위를 좁힐 수도 있다. 또한, 처음의 성긴 검색에서는 1 에폭이나 혹은 더 적게만 훈련하는 게 도움이 될 수도 있는데, 왜냐하면 많은 초모수 세팅에서는 하나도 학습하는 게 없을 수도 있거나 즉시 무한대의 손실함수값으로 폭발할 수도 있기 때문이다. 두 번째 단계는 좀더 좁은 범위에서의 검색을, 5 에폭 정도로, 할 수 있을 것이다. 그리고 마지막 검색에서는 좁은 범위에서 많은 에폭의 훈련을 수행해도 좋겠다. -**Bayesian Hyperparameter Optimization** is a whole area of research devoted to coming up with algorithms that try to more efficiently navigate the space of hyperparameters. The core idea is to appropriately balance the exploration - exploitation trade-off when querying the performance at different hyperparameters. Multiple libraries have been developed based on these models as well, among some of the better known ones are [Spearmint](https://github.com/JasperSnoek/spearmint), [SMAC](http://www.cs.ubc.ca/labs/beta/Projects/SMAC/), and [Hyperopt](http://jaberg.github.io/hyperopt/). However, in practical settings with ConvNets it is still relatively difficult to beat random search in a carefully-chosen intervals. See some additional from-the-trenches discussion [here](http://nlpers.blogspot.com/2014/10/hyperparameter-search-bayesian.html). +**베이지안 초모수 최적화 (Bayesian Hyperparameter Optimization)**는 초모수 공간을 좀 더 효율적으로 항해하는 방법을 고안하기 위한 분야이다. 핵심 아이디어는 초모수들의 성능을 평가할 때 탐험(exploration)-개발(exploitation)의 상충(trade-off)에서 적절한 균형을 찾는 것이다. 많은 라이브러리들이 이 모형에 기반하여 개발되었고 그 중에 잘 알려진 것은 [Spearmint](https://github.com/JasperSnoek/spearmint), [SMAC](http://www.cs.ubc.ca/labs/beta/Projects/SMAC/), 그리고 [Hyperopt](http://jaberg.github.io/hyperopt/)이다. 그러나, ConvNet 관련된 실전 세팅에서는 아직 조심스레 선택된 구간에서의 임의 검색이 상대적으로 더 뛰어나다. 딥러닝의 최전선 참호에서(from-the-trenches) 진행중인 논의를 참조하라. [here](http://nlpers.blogspot.com/2014/10/hyperparameter-search-bayesian.html). -## Evaluation +## 평가 -### Model Ensembles +### 모형 앙상블 (Model Ensembles) -In practice, one reliable approach to improving the performance of Neural Networks by a few percent is to train multiple independent models, and at test time average their predictions. As the number of models in the ensemble increases, the performance typically monotonically improves (though with diminishing returns). Moreover, the improvements are more dramatic with higher model variety in the ensemble. There are a few approaches to forming an ensemble: +실전에서, 신경망(neural network)의 성능을 몇 퍼센트 끌어올릴 수 있는 믿을 만한 방법이 하나 있는데 바로 여러 개의 독립적인 모형을 만들고 테스트 때 그들의 평균 예측을 취하는 것이다. 앙상블에 관여하는 모형이 많아지면, 보통 성능은 단조적으로 개선된다 (비록 개선 정도가 점점 떨어질지라도). 게다가, 앙상블 내에서 모형의 다양함이 늘어날수록 성능의 개선은 더 극적이다. 아래는 앙상블을 구축하는 몇 가지 방법이다: -- **Same model, different initializations**. Use cross-validation to determine the best hyperparameters, then train multiple models with the best set of hyperparameters but with different random initialization. The danger with this approach is that the variety is only due to initialization. -- **Top models discovered during cross-validation**. Use cross-validation to determine the best hyperparameters, then pick the top few (e.g. 10) models to form the ensemble. This improves the variety of the ensemble but has the danger of including suboptimal models. In practice, this can be easier to perform since it doesn't require additional retraining of models after cross-validation -- **Different checkpoints of a single model**. If training is very expensive, some people have had limited success in taking different checkpoints of a single network over time (for example after every epoch) and using those to form an ensemble. Clearly, this suffers from some lack of variety, but can still work reasonably well in practice. The advantage of this approach is that is very cheap. -- **Running average of parameters during training**. Related to the last point, a cheap way of almost always getting an extra percent or two of performance is to maintain a second copy of the network's weights in memory that maintains an exponentially decaying sum of previous weights during training. This way you're averaging the state of the network over last several iterations. You will find that this "smoothed" version of the weights over last few steps almost always achieves better validation error. The rough intuition to have in mind is that the objective is bowl-shaped and your network is jumping around the mode, so the average has a higher chance of being somewhere nearer the mode. +- **같은 모형, 다른 초기화 (Same model, different initializations)**. 교차 검증으로 최고의 초모수를 결정한 다음에, 같은 초모수를 이용하되 초기값을 임의로 다양하게 여러 모형을 훈련한다. 이 접근법의 위험은, 모형의 다양성이 오직 다양한 초기값에서만 온다는 것이다. +- **교차 검증 동안 발견되는 최고의 모형들 (Top models discovered during cross-validation)**. 교차 검증으로 최고의 초모수(들)를 결정한 다음에, 몇 개의 최고 모형을 선정하여 (예. 10개) 이들로 앙상블을 구축한다. 이 방법은 앙상블 내의 다양성을 증대시키나, 준-최적 모형을 포함할 수도 있는 위험이 있다. 실전에서는 이를 수행하는 게 (위보다) 쉬운 편인데, 교차 검증 뒤에 추가적인 모형의 재훈련이 필요없기 때문이다. +- **한 모형에서 다른 체크포인트들을 (Different checkpoints of a single model)**. 만약 훈련이 매우 값비싸면, 어떤 사람들은 단일한 네트워크의 체크포인트들을 (이를테면 매 에폭 후) 앙상블하여 제한적인 성공을 거둔 바 있음을 기억해 두라. 명백하게 이 방법은 다양성이 떨어지지만, 실전에서는 합리적으로 잘 작동할 수 있다. 이 방법은 매우 간편하고 저렴하다는 것이 장점이다. +- **훈련 동안의 모수값들에 평균을 취하기 (Running average of parameters during training)**. 훈련 동안 (시간에 따른) 웨이트 값들의 지수 하강 합(exponentially decaying sum)을 저장하는 제 2의 네트워크를 만들면 언제나 몇 퍼센트의 이득을 값싸게 취할 수 있다. 이 방식으로 당신은 최근 몇 iteration 동안의 네트워크에 평균을 취한다고 생각할 수도 있다. 마지막 몇 스텝 동안의 웨이트값들을 이렇게 "안정화" 시킴으로써 당신은 언제나 더 나은 검증 오차를 얻을 수 있다. 거친 직관으로 생각하자면, 목적함수는 볼(bowl)-모양이고 당신의 네트워크는 극값(mode) 주변을 맴돌 것이므로, 평균을 취하면 극값에 더 가까운 어딘가에 다다를 기회가 더 많아질 것이다. -One disadvantage of model ensembles is that they take longer to evaluate on test example. An interested reader may find the recent work from Geoff Hinton on ["Dark Knowledge"](https://www.youtube.com/watch?v=EK61htlw8hY) inspiring, where the idea is to "distill" a good ensemble back to a single model by incorporating the ensemble log likelihoods into a modified objective. +모형 앙상블의 단점이 하나 있다면 테스트 샘플에 모형을 적용할 때 평가(evaluation)에 더 시간이 걸린다는 점이다. 흥미로운 독자는 Geoff Hinton의 ["Dark Knowledge"](https://www.youtube.com/watch?v=EK61htlw8hY)에서 영감을 얻을 수도 있겠다. 여기서의 아이디어는 좋은 앙상블 모형을 하나의 모형으로 "증류"하는 것인데, 앙상블 모형의 로그-가능도(log-likelihood)를 어떤 변형된 목적함수로 통합하는 작업과 관련이 있다. -## Summary +## 요약 (Summary) -To train a Neural Network: +신경망(neural network)를 훈련하기 위하여: -- Gradient check your implementation with a small batch of data and be aware of the pitfalls. -- As a sanity check, make sure your initial loss is reasonable, and that you can achieve 100% training accuracy on a very small portion of the data -- During training, monitor the loss, the training/validation accuracy, and if you're feeling fancier, the magnitude of updates in relation to parameter values (it should be ~1e-3), and when dealing with ConvNets, the first-layer weights. -- The two recommended updates to use are either SGD+Nesterov Momentum or Adam. -- Decay your learning rate over the period of the training. For example, halve the learning rate after a fixed number of epochs, or whenever the validation accuracy tops off. -- Search for good hyperparameters with random search (not grid search). Stage your search from coarse (wide hyperparameter ranges, training only for 1-5 epochs), to fine (narrower rangers, training for many more epochs) -- Form model ensembles for extra performance +- 코드를 짜는 중간중간에 작은 배치로 그라디언트를 체크하고, 뜻하지 않게 튀어나올 위험을 인지하고 있으라. +- 코드가 제대로 돌아가는지 확인하는 방법으로, 손실함수값의 초기값이 합리적인지 그리고 데이터의 일부분으로 100&%의 훈련 정확도를 달성할 수 있는지 확인하라. +- 훈련 동안, 손실함수와 훈련/검증 정확도를 계속 살펴보고, (이게 좀 더 멋져 보이면) 현재 파라미터 값 대비 업데이트 값 또한 살펴보라 (대충 ~1e-3 정도 되어야 한다). 만약 ConvNet을 다루고 있다면, 첫 층의 웨이트값도 살펴보라. +- 업데이트 방법으로 추천하는 건 SGD+Nesterov Momentum 혹은 Adam이다. +- 학습 속도를 훈련 동안 계속 하강시켜라. 예를 들면, 정해진 에폭 수 뒤에 (혹은 검증 정확도가 상승하다가 하강세로 꺾이면) 학습 속도를 반으로 깎아라. +- 초모수 검색은 그리드 검색이 아닌 임의 검색으로 수행하라. 처음에는 성긴 규모에서 탐색하다가 (넓은 초모수 범위, 1-5 에폭 정도만 학습), 점점 촘촘하게 검색하라 (좁은 범위, 더 많은 에폭에서 학습). +- 추가적인 개선을 위하여 모형 앙상블을 구축하라. -## Additional References +## 추가 참고문헌 - [SGD](http://research.microsoft.com/pubs/192769/tricks-2012.pdf) tips and tricks from Leon Bottou - [Efficient BackProp](http://yann.lecun.com/exdb/publis/pdf/lecun-98b.pdf) (pdf) from Yann LeCun From 6ef27d4053efc4d20f6a677cb7efc6e39d5f0753 Mon Sep 17 00:00:00 2001 From: YB Date: Mon, 5 Sep 2016 17:38:03 -0400 Subject: [PATCH 197/199] Lecture1 - part 281~325 (out of 715) en / ko --- captions/En/Lecture1_en.srt | 159 ++++++++++++++++++------------------ captions/Ko/Lecture1_ko.srt | 110 ++++++++++++++----------- 2 files changed, 142 insertions(+), 127 deletions(-) diff --git a/captions/En/Lecture1_en.srt b/captions/En/Lecture1_en.srt index ab9b9320..c836c419 100644 --- a/captions/En/Lecture1_en.srt +++ b/captions/En/Lecture1_en.srt @@ -1377,69 +1377,70 @@ So, that's a little bit of a proud of Stanford history. 281 00:31:26,450 --> 00:31:31,720 -But, anyway we have to give us credit for starting the field of +But, anyway we have to give MIT this credit +for starting the field of computer vision, 282 00:31:31,720 --> 00:31:41,380 -computer vision because in the summer of -1966 a professor at MIT I decided it's +because in the summer of 1966, +a professor at MIT AI lab decided it's time to solve vision. 283 00:31:41,380 --> 00:31:46,630 -time to salvage you know so I was -established we will start to understand +You know, so AI was established. +We start to understand, you know, first all the logic and all this. 284 00:31:46,630 --> 00:31:55,010 -I think this proves probably invented at -that time but anyway +I think this proves probably invented at that time but anyway, 285 00:31:55,009 --> 00:32:01,109 -vision is so easy you open your eyes you -see the world how can this be love one +vision is so easy. You open your eyes, +you see the world. How hard can this be? 286 00:32:01,109 --> 00:32:04,109 -summer so +Let's solve in one summer. +So, especially MIT students are smart, right? 287 00:32:04,109 --> 00:32:18,729 -so the summer vision project is an -attempt to use our visual system this +So, the summer vision project is an attempt +to use our summer workers effectively +in a construction of significant part of a visual system. 288 00:32:18,730 --> 00:32:24,329 -was the proposal from last number and -maybe they didn't use their summer work +This was the proposal from that summer +and maybe they didn't use their summer workers effectively, 289 00:32:24,329 --> 00:32:30,490 -effectively but in any case how -individual was not solved in that silver +but in any case, computer vision was not solved in that summer. 290 00:32:30,490 --> 00:32:35,740 -since then they become the fastest -growing field of computer vision and I +Since then, they become the fastest +growing field of computer vision and AI. 291 00:32:35,740 --> 00:32:43,679 -if you go to today's premium computer -vision conferences cost CPR or ICC we we +If you go to today's premium computer +vision conferences called CVPR or ICCV, 292 00:32:43,679 --> 00:32:52,160 -have like 2000 to 2500 researchers +we have like 2000 to 2500 researchers worldwide attending this conference and 293 00:32:52,160 --> 00:33:00,620 -very practical note 44 students if you +very practical note for students if you are a good computer vision / machine 294 -00:33:00,619 --> 00:33:05,369 +00:33:00,620 --> 00:33:05,369 learning students you will not worry about jobs in Silicon Valley or or @@ -1450,158 +1451,158 @@ the most exciting field but that was the 296 00:33:11,569 --> 00:33:19,210 -birth of computer vision which means +birthday of computer vision which means this year is the fiftieth anniversary of 297 00:33:19,210 --> 00:33:25,829 computer vision that's a very exciting -year in computer vision I we have a +year in computer vision and we have come 298 00:33:25,829 --> 00:33:28,529 -caller who long long way +a long long way 299 00:33:28,529 --> 00:33:31,660 -ok so continued of computer vision +ok so continue on the history of computer vision 300 00:33:31,660 --> 00:33:38,169 this is a person to remember David Mark -he he was also at MIT at that time +he was also at MIT at that time 301 00:33:38,169 --> 00:33:50,240 -working with a number of shimon tommy -tommy pope Jill and Mark himself died +working with a number of very influential +computer vision scientists like Shimon Ullman, Tommy Poggio. +and Mark himself died 302 00:33:50,240 --> 00:33:58,808 -early in the seventies very influential -book called vision it's a very book +early in the seventies and he wrote very influential +book called "Vision". It's a very thin book. 303 00:33:58,808 --> 00:34:08,148 -smart thinking about vision he took a -lot of insights from your signs where he +And David Mark's thinking about vision, he took a +lot of insights from neuro-science. 304 00:34:08,148 --> 00:34:14,868 -said that he want reasonable give us the -concept of simple structure regions +We have already said that +Hubel and Wiesel give us the concept of simple structure. 305 00:34:14,869 --> 00:34:16,539 -start with +Vision starts with simple structure. 306 00:34:16,539 --> 00:34:23,259 -simple structure in today and start with -a holistic fish or holistic not was mark +It didn't start with a holistic fish or holistic mouse. 307 00:34:23,260 --> 00:34:28,679 -give us the next important insight and -these two inside together is the +David Mark gave us the next important insight +and these two insights together is the 308 00:34:28,679 --> 00:34:35,740 beginning of deep learning architecture -is that vision is hierarchical you know +is that vision is hierarchical. 309 -00:34:35,739 --> 00:34:44,029 -so you will have easily said ok we start -simple but this major world is extremely +00:34:35,740 --> 00:34:44,029 +You know so, Hubel and Wiesel said ok we start simple, +but Hubel and Wiesel didn't say we end simple. +This visual world is extremely complex. 310 00:34:44,030 --> 00:34:49,540 -complex in fact I take a picture a -regular picture today with my eiffel +In fact, I take a picture, a regular picture today with my iPhone. 311 -00:34:49,539 --> 00:34:58,309 -there is no my iPhone's resolution let's -suppose it's like turned up exhaust the +00:34:49,540 --> 00:34:58,309 +There is, I don't know my iPhone's resolution. +Let's suppose it's like 10 mega-pixels. 312 00:34:58,309 --> 00:35:05,059 -potential combination of pixels or -picture in that is bigger than the total +The potential combination of pixels to form +a picture in that is bigger than the total 313 00:35:05,059 --> 00:35:11,429 -number of atoms in the universe that's -how complex vision can be is it's it's +number of atoms in the universe. +That's how complex vision can be. 314 00:35:11,429 --> 00:35:18,539 -really really complex human beings are -told us they are simple David Mark told +It's really really complex. +So, Hubel and Wiesel told us to start simple. +David Mark told 315 00:35:18,539 --> 00:35:25,130 -us build a hierarchical model of course -there mark didn't tell us to build it in +us build a hierarchical model. Of course +David Mark didn't tell us to build it in 316 00:35:25,130 --> 00:35:29,400 -the coalition on your network which will -cover for the rest of the quarter but +the covolution neural network which will +cover for the rest of the quarter 317 00:35:29,400 --> 00:35:36,990 -his ideas is to represent or to think -about it image we think about it in +but his ideas is this. To represent or to think +about an image, we think about it in 318 00:35:36,989 --> 00:35:42,129 -several layers the first one he thinks +several layers. The first one he thinks we should think about that edge image 319 00:35:42,130 --> 00:35:49,110 -which is clearly an inspiration noted -took the inspiration from these oh and +which is clearly an inspiration, +took the inspiration from Hubel and Wiesel and 320 00:35:49,110 --> 00:35:52,579 -he personally called this the primal -sketch +he personally called this the Primal Sketch. 321 00:35:52,579 --> 00:35:55,730 -you know the name is Sophie explain it +You know, the name is self-explainary. 322 00:35:55,730 --> 00:36:02,400 -explained every now and then you think -about one-half the this is work you +and then you think about 2.5D +This is where you 323 00:36:02,400 --> 00:36:08,829 -start to Rick reconcile your 2d image -with the 3d world you recognize there is +start to reconcile your 2D image +with the 3D world. You recognize there is 324 00:36:08,829 --> 00:36:15,679 -layers right I look at you right now I -don't think half of you only has upheld +layers right? I look at you right now. +I don't think half of you only has 325 00:36:15,679 --> 00:36:17,239 -in the neck + head and the neck 326 00:36:17,239 --> 00:36:22,799 -even though that's all I see there is I -know you're all concluded by the row in +even though that's all I see, but there is, +I know you're occluded by the row in 327 00:36:22,800 --> 00:36:29,680 -front of you and this challenge will -post problem to solve +front of you and this is the fundamental challenge of the Vision. +We have ill-post problem to solve 328 00:36:29,679 --> 00:36:38,118 diff --git a/captions/Ko/Lecture1_ko.srt b/captions/Ko/Lecture1_ko.srt index 6d7d94a2..72eb704c 100644 --- a/captions/Ko/Lecture1_ko.srt +++ b/captions/Ko/Lecture1_ko.srt @@ -1155,183 +1155,197 @@ 281 00:31:26,450 --> 00:31:31,720 - stempler 역사 어쨌든 우리는 분야를 시작하는 우리에게 신용을 제공해야 + 하지만 컴퓨터 비전을 시작한 업적은 MIT에게 돌아갑니다. 282 00:31:31,720 --> 00:31:41,380 - 컴퓨터 비전이 때문에 1966 년 여름에 MIT의 교수 나는 그것의 결정 + 1966년 여름 MIT의 인공지능 연구실 교수는 비전 연구를 시작하기로 결정합니다. 283 00:31:41,380 --> 00:31:46,630 - 시간은 그래서 우리가 이해하기 시작 설립되었다 알고 구제하기 + 그로인해 인공지능 연구실이 설립되고 우리는 이런 저런 로직들을 이해할 수 있게 됩니다. 284 00:31:46,630 --> 00:31:55,010 - 나는 이것이 아마 그 시간에 어쨌든 발명 입증한다 생각 + 이것이 아마 그 시간에 발명되었다는 것을 보여준다고 생각해요. 285 00:31:55,009 --> 00:32:01,109 - 비전 당신이 당신의 눈을 너무 쉽게 열려있는이 사랑의 하나가 될 수있는 방법을 세상을보고 + 아무튼 비전은 너무 쉬워요. 눈을 뜨고 세상을 바라보면 됩니다. + 이게 어려우면 얼마나 어렵겠어요? 286 00:32:01,109 --> 00:32:04,109 - 여름 그래서 + 그러니 이 문제를 여름안에 풀어봅시다! MIT학생들은 똘똘하잖아요? 287 00:32:04,109 --> 00:32:18,729 - 그래서 여름 비전 프로젝트는 우리의 시각 시스템이 사용하기위한 시도이다 + 그래서 여름 비전 프로젝트는 여름동안의 인력을 효율적으로 사용해서 + 우리의 시각 시스템의 상당한 부분을 만들어내려는 시도였어요. 288 00:32:18,730 --> 00:32:24,329 - 마지막 번호의 제안이었다 어쩌면 그들은 그들의 여름 작업을 사용하지 않은 + 그리고 이 시도가 그 해 여름동안의 계획이었어요. + 하지만 아마 그들이 충분히 효율적이지 못했었기 때문일까요? 289 00:32:24,329 --> 00:32:30,490 - 효과적으로하지만 어떤 경우에 어떻게 개인이 그 실버 해결되지 않았다 + 컴퓨터 비전의 문제는 그 해 여름안에 해결되지 않았죠. 290 00:32:30,490 --> 00:32:35,740 - 그 이후로 그들은 컴퓨터 비전 및 I의 가장 빠르게 성장하는 분야가 될 + 하지만 그 이후로 컴퓨터 비전과 인공지능은 가장 빠르게 성장하는 분야가 되었습니다. 291 00:32:35,740 --> 00:32:43,679 - 오늘의 프리미엄 컴퓨터 비전 컨퍼런스에 가면 CPR 또는 우리 ICC 비용 + 오늘날 CPVR이나 ICCV와 같은 유명한 컴퓨터 비전 컨퍼런스에는 292 00:32:43,679 --> 00:32:52,160 - 2000 2500에 대한 연구가 전 세계적으로이 회의에 참석처럼 가지고 + 전 세계적으로 2천에서 2천 5백명이 넘는 연구자들이 참가하고 있어요. 293 00:32:52,160 --> 00:33:00,620 - 매우 실용적인 노트 (44) 학생 당신은 좋은 컴퓨터 비전 / 기계 경우 + 학생들을 위한 매우 현실적인 이야기를 해보자면 294 00:33:00,619 --> 00:33:05,369 - 학생들이 학습 당신은 실리콘 밸리 나 또는 작업에 대해 걱정하지 않습니다 + 여러분이 훌륭한 머신러닝 혹은 비전을 공부한 학생들이라면 + 실리콘 밸리 혹은 어떤 곳에서도 취업걱정할 일은 없을거예요. 295 00:33:05,369 --> 00:33:11,569 - 다른 곳에서는 실제로 가장 흥미로운 분야 중 하나입니다하지만 그래서이었다 + 비전은 실제로 가장 흥미로운 분야 중 하나이며 그때에 바로 비전의 생일입니다. 296 00:33:11,569 --> 00:33:19,210 - 올해 의미 컴퓨터 비전의 탄생의 50 주년입니다 + 말인즉슨 올해가 컴퓨터 비전의 탄생의 50주년입니다. 297 00:33:19,210 --> 00:33:25,829 - 우리가이 컴퓨터 비전 I에서 매우 흥미로운 년의 컴퓨터 비전 + 아주 신나는 연도이며 298 00:33:25,829 --> 00:33:28,529 - 발신자 사람 오래 오래 방법 + 지금까지 비전은 참으로 먼 길을 걸어왔어요. 299 00:33:28,529 --> 00:33:31,660 - 확인 그래서 컴퓨터 비전의 계속 + 자 다시 컴퓨터 비전의 역사로 돌아가 봅시다. 300 00:33:31,660 --> 00:33:38,169 - 이 그가 그가 그 시간에 MIT에서도했다 데이비드 마크를 기억하는 사람 + 여러분이 기억해야할 사람이 하나 있습니다. + David Mark, 그 또한 그 당시 MIT에서 301 00:33:38,169 --> 00:33:50,240 - 사망 시몬 토미 토미 교황 질 마크 자신의 번호와 작업 + Shimon Ullman, Tommy Poggio와 같은 많은 영향력있는 컴퓨터 비전 연구자들과 작업했습니다. 302 00:33:50,240 --> 00:33:58,808 - 초기 70 년대의 비전라는 매우 영향력있는 책은 매우 책입니다 + 그리고 그 자신은 70년대 초에 일찍 세상을 떠낫지만 "Vision" 이라는 매우 영향력있는 책을 펴냈어요. + 매우 얇은 책이죠. 303 00:33:58,808 --> 00:34:08,148 - 비전에 대한 스마트 생각 그는 당신의 표지판 통찰력을 많이했다 그가 어디 + David Mark가 가진 비전에 대한 생각들에 있어서 그는 신경과학에서 많은 영감을 받았어요. 304 00:34:08,148 --> 00:34:14,868 - 그는 우리에게 단순한 구조 영역의 개념을 제공 합리적인한다고 말했다 + 우리는 Hubel과 Wiesel이 발견한 단순한 구조에 대한 컨셉을 이미 다루었죠. 305 00:34:14,869 --> 00:34:16,539 - 시작 + 비전은 단순한 구조에서 시작합니다. 306 00:34:16,539 --> 00:34:23,259 - 오늘의 간단한 구조와 전체적인 물고기로 시작하거나 전체적인 마크 아니었다 + 비전은 신성한 물고기나 쥐에서 시작하지 않았어요. 307 00:34:23,260 --> 00:34:28,679 - 함께 인 우리에게 다음으로 중요한 통찰력과 두 내부를 제공 + David Mark는 그 다음 가장 중요한 컴셉에 대한 통찰을 합니다. + 이 두 가지 동찰이 바로 308 00:34:28,679 --> 00:34:35,740 - 그 비전은 깊은 학습 아키텍처입니다 시작하는 당신이 알고있는 계층이다 + 딥러닝의 구조의 시작인데 그것은 바로 비전은 계층적이라는 것이죠. 309 -00:34:35,739 --> 00:34:44,029 - 그래서 당신은 쉽게 말한 것입니다 확인 우리는 간단하게 시작하지만,이 세계 주요 매우입니다 +00:34:35,740 --> 00:34:44,029 + Hubel과 Wiesel은 간단한 구조로부터 시작한다고 했지만, 간단한 구조로 끝난다고 말하진 않았죠. + 시각적인 세계는 매우 복잡합니다. 310 00:34:44,030 --> 00:34:49,540 - 사실 복잡한 내 에펠 오늘 사진을 일반 사진을 촬영 + 제가 사진을 찍습니다. 아이폰으로 보통 사진을 오늘 찍어요. 311 -00:34:49,539 --> 00:34:58,309 - 내 아이폰의 해상도의 그것의 같은 켜져를 배기를 가정하자 전혀 없다 +00:34:49,540 --> 00:34:58,309 + 음.. 정확히 내 아이폰의 해상도는 모르지만 10 메가픽셀이라고 가정합시다. 312 00:34:58,309 --> 00:35:05,059 - 그 픽셀 또는 그림의 잠재적 인 조합은 전체보다 크다 + 그 해상도 안에서 하나의 그림을 구성할 수 있는 픽셀들의 조합개수는 313 00:35:05,059 --> 00:35:11,429 - 얼마나 복잡한 비전의 우주에있는 원자의 수는 그것의의입니다 + 우주에 존재하는 원자들의 수보다도 많아요. + 그만큼 비전은 복잡해질 수가 있어요. 314 00:35:11,429 --> 00:35:18,539 - 정말 정말 복잡한 인간은 간단한 데이비드 마크는 이야기입니다 우리에게있다 + 비전은 정말 정말 복잡합니다. + Hubel과 Wiesel은 간단한 구조로 시작하라고 했으며, + David Mark는 315 00:35:18,539 --> 00:35:25,130 - 우리는 물론 계층 적 모델을 구축이 마크에 구축 알려하지 않았다 + 계층적인 모델을 만들라고 했어요. + 물론 David Mark가 Convolution Neural Network를 만들라고 하지는 않았죠. 316 00:35:25,130 --> 00:35:29,400 - 분기의 나머지 다룰 것 네트워크의 연합 있지만, + 우리는 나머지 분기동안 그에대해 다룰 것입니다. 317 00:35:29,400 --> 00:35:36,990 - 자신의 아이디어를 표현하기 위해 또는 우리가 그것에 대해 생각하는 이미지 그것에 대해 생각하는 것입니다 + 그의 아이디어는 이렇습니다. 하나의 이미지를 표현하거나 생각을 할 때, 318 -00:35:36,989 --> 00:35:42,129 - 여러 레이어 그는 우리가 그 가장자리 이미지에 대해 생각해야한다고 생각 첫 번째 +00:35:36,989 --> 00:35:42,129잰 + 우리는 여러 계층으로 나누어 생각을 합니다. 그가 생각하는 계층들 중의 첫번째는 선으로 이루어진 이미지입니다. 319 00:35:42,130 --> 00:35:49,110 - 이는 분명히 영감이 주목 오 이들로부터 영감을 가져다가 + 분명히 Hubel과 Wiesel에게서 영감을 받았죠. 320 00:35:49,110 --> 00:35:52,579 - 그는 개인적으로이 시원 스케치라고 + 그는 개인적으로 이것을 Primal Sketch라고 부릅니다. 321 00:35:52,579 --> 00:35:55,730 - 당신의 이름은 소피가 그것을 설명 알고 + 이름이 그 계층에 대해 설명해줍니다. 322 00:35:55,730 --> 00:36:02,400 - 모든 이제 다음은이 작업이 약 1/2 생각 설​​명 + 그 다음에 우리는 2.5 차원으로 생각을 합니다. 이 계층이 바로 323 00:36:02,400 --> 00:36:08,829 - 당신이 인식 3D 세계로 2D 이미지를 조정 릭에 시작 + 당신이 2D 이미지를 3D 세상으로 인식하기 시작하는 계증입니다. + 당신은 324 00:36:08,829 --> 00:36:15,679 - 레이어 바로 지금 당장 당신의 절반 만지지하고있다 생각하지 않는다 당신을보고 + 층들이 존재한다는 것을 인지합니다. 지금 제가 여러분을 바라볼 때, + 제가 여러분들이 머리와 목만 있다고 325 00:36:15,679 --> 00:36:17,239 - 목 + 생각하지는 않아요. 326 00:36:17,239 --> 00:36:22,799 From 9d248a6531cd5acb43fe1877010a656ce089c5f5 Mon Sep 17 00:00:00 2001 From: myungsub Date: Mon, 12 Sep 2016 19:27:33 +0900 Subject: [PATCH 198/199] fix markdown syntax --- index.html | 2 +- neural-networks-3.md | 74 +++++++++++++++++++++++++++----------------- 2 files changed, 47 insertions(+), 29 deletions(-) diff --git a/index.html b/index.html index e6be4127..5c16f963 100644 --- a/index.html +++ b/index.html @@ -191,7 +191,7 @@ 신경망 파트 3: 학습 및 평가 - +
그라디언트 체크, 버그 점검, 학습 과정 모니터링, momentum (+nesterov), 2차(2nd-order) 방법, Adagrad/RMSprop, hyperparameter 최적화, 모델 ensemble diff --git a/neural-networks-3.md b/neural-networks-3.md index 7aaf02f1..3fbac7cb 100644 --- a/neural-networks-3.md +++ b/neural-networks-3.md @@ -29,34 +29,35 @@ Table of Contents: 이전 섹션들에서는 레이어를 몇 층 쌓고 레이어별로 몇 개의 유닛을 준비할지(newwork connectivity), 데이터를 어떻게 준비하고 어떤 손실 함수(loss function)를 선택할지 논하였다. 말하자면 이전 섹션들은 주로 뉴럴 네트워크(Neural Network)의 정적인 부분인데, 본 섹션에서는 동적인 부분들을 소개한다. 파라미터(parameter)를 학습하고 좋은 초-파라미터(hyperparamter)를 찾는 과정 등을 다룰 예정이다. + ### 그라디언트 체크 (Gradient Checks) 이론적인 그라디언트 체크라 하면, 수치적으로 계산한(numerical) 그라디언트와 수식으로 계산한(analytic) 그라디언트를 비교하는 정도라 매우 간단하다고 생각할 수도 있겠다. 그렇지만 이 작업을 직접 실현해 보면 훨씬 복잡하고 뜬금없이 오차가 발생하기도 쉽다는 것을 깨달을 것이다. 이제 팁, 트릭, 조심할 이슈들 몇 개를 소개하고자 한다. -**같은 근사라 하여도 이론적으로 더 정확도가 높은 근사 공식이 있다 (Use the centered formula)**. 그라디언트($\frac{df(x)}{dx}$)를 수치적으로 근사한다 하면 보통 다음 유한 차분 근사(finite difference approximation)를 떠올릴 것이다: +**같은 근사라 하여도 이론적으로 더 정확도가 높은 근사 공식이 있다 (Use the centered formula)**. 그라디언트($$\frac{df(x)}{dx}$$)를 수치적으로 근사한다 하면 보통 다음 유한 차분 근사(finite difference approximation)를 떠올릴 것이다: $$ \frac{df(x)}{dx} = \frac{f(x + h) - f(x)}{h} \hspace{0.1in} \text{(bad, do not use)} $$ -여기서 $h$는 아주 작은 수이고 보통 1e-5 정도의 수를 사용한다. 위 식보다는 아래의 *중심화된(centered)* 차분 공식이 경험적으로는 훨씬 낫다: +여기서 $$h$$는 아주 작은 수이고 보통 1e-5 정도의 수를 사용한다. 위 식보다는 아래의 *중심화된(centered)* 차분 공식이 경험적으로는 훨씬 낫다: $$ \frac{df(x)}{dx} = \frac{f(x + h) - f(x - h)}{2h} \hspace{0.1in} \text{(use instead)} $$ -물론 이 공식은 $f(x+h)$ 말고도 $f(x-h)$도 계산하여야 하므로 최초 식보다 계산량이 두 배 많지만 훨씬 정확한 근사를 제공한다. $f(x+h)$ 및 $f(x-h)$의 ($x$ 근방에서의) 테일러 전개를 고려하면 이유를 금방 알 수 있다. 첫 식은 $O(h)$의 오차가 있는 데 반해 두번째 식은 오차가 $O(h^2)$이다 (즉, 이차 근사이다). -- 역자 주 : (1) 테일러 전개에서 $f(x + h) = f(x) + hf'(x) + O(h)$로부터 $f'(x) - \frac{(f(x+h)-f(x)}{h} = O(h)$. (2) $h$가 보통 벡터이므로 $O(h)$보다는 $O(\|h\|)$가 더 정확한 표현이나 편의상 $\|\cdot\|$을 생략한 듯 보입니다. +물론 이 공식은 $$f(x+h)$$ 말고도 $$f(x-h)$$도 계산하여야 하므로 최초 식보다 계산량이 두 배 많지만 훨씬 정확한 근사를 제공한다. $$f(x+h)$$ 및 $$f(x-h)$$의 ($$x$$ 근방에서의) 테일러 전개를 고려하면 이유를 금방 알 수 있다. 첫 식은 $$O(h)$$의 오차가 있는 데 반해 두번째 식은 오차가 $$O(h^2)$$이다 (즉, 이차 근사이다). -- 역자 주 : (1) 테일러 전개에서 $$f(x + h) = f(x) + hf'(x) + O(h)$$로부터 $$f'(x) - \frac{(f(x+h)-f(x)}{h} = O(h)$$. (2) $$h$$가 보통 벡터이므로 $$O(h)$$보다는 $$O(\|h\|)$$가 더 정확한 표현이나 편의상 $$\|\cdot\|$$을 생략한 듯 보입니다. -**상대 오차를 사용하라 (Use relative error for the comparison)**. 그라디언트의 (수식으로 계산한, analytic) 참값 $f'_a$와 수치적(numerical) 근사값 $f'_n$을 비교하려면 어떤 디테일을 점검하여야 할까? 이 둘이 비슷하지 않음(not compatible)을 어떻게 알아낼 수 있을까? 가장 쉽게는 둘의 절대 오차 $\mid f'_a - f'_n \mid $ 혹은 그 제곱을 쭉 추적하여 이 값(들)이 언젠가 어느 한계점(threshold)를 넘으면 그라디언트 오류라 할 수도 있겠다. 그렇지만 절대 오차에는 문제가 있는 것이, 가령 절대 오차가 1e-4라 가정하여 보자. 만약 $f'_a$와 $f'_n$ 모두 1.0 언저리라면 1e-4의 오차 정도는 매우 훌륭한 근사이고 $f'_a \approx f'_n$이라 할 수 있다. 그런데 만약 두 그라디언트가 1e-5거나 더 작은 값이라면? 그렇다면 1e-4는 매우 큰 차이가 되고 근사가 실패했다고 보아야 한다. 따라서 절대 오차와 두 그라디언트 값의 비율을 고려하는 *상대 오차*가 더 적절하다. 언제나!: +**상대 오차를 사용하라 (Use relative error for the comparison)**. 그라디언트의 (수식으로 계산한, analytic) 참값 $$f'_a$$와 수치적(numerical) 근사값 $$f'_n$$을 비교하려면 어떤 디테일을 점검하여야 할까? 이 둘이 비슷하지 않음(not compatible)을 어떻게 알아낼 수 있을까? 가장 쉽게는 둘의 절대 오차 $$\mid f'_a - f'_n \mid $$ 혹은 그 제곱을 쭉 추적하여 이 값(들)이 언젠가 어느 한계점(threshold)를 넘으면 그라디언트 오류라 할 수도 있겠다. 그렇지만 절대 오차에는 문제가 있는 것이, 가령 절대 오차가 1e-4라 가정하여 보자. 만약 $$f'_a$$와 $$f'_n$$ 모두 1.0 언저리라면 1e-4의 오차 정도는 매우 훌륭한 근사이고 $$f'_a \approx f'_n$$이라 할 수 있다. 그런데 만약 두 그라디언트가 1e-5거나 더 작은 값이라면? 그렇다면 1e-4는 매우 큰 차이가 되고 근사가 실패했다고 보아야 한다. 따라서 절대 오차와 두 그라디언트 값의 비율을 고려하는 *상대 오차*가 더 적절하다. 언제나!: $$ \frac{\mid f'_a - f'_n \mid}{\max(\mid f'_a \mid, \mid f'_n \mid)} $$ -보통의 상대 오차 공식은 분모에 $f'_a$ 혹은 $f'_n$ 둘 중 하나만 있지만, 나는 둘의 최대값을 분모로 선호하는 편이다. 그래야 공식에 대칭성이 생기고 둘 중 하나가 exactly 0이 되어 분모가 0이 되는 사태를 방지할 수 있다 (ReLU를 사용하면 자주 일어나는 문제이다). $f'_a$와 $f'_n$가 모두 exact 0이 된다면? 이 때는 상대 오차를 점검할 필요 없이 그라디언트 체크를 통과하여야 한다. 당신의 코드가 이 상황을 감안하여 조직된 코드인지 점검하여 보라. +보통의 상대 오차 공식은 분모에 $$f'_a$$ 혹은 $$f'_n$$ 둘 중 하나만 있지만, 나는 둘의 최대값을 분모로 선호하는 편이다. 그래야 공식에 대칭성이 생기고 둘 중 하나가 exactly 0이 되어 분모가 0이 되는 사태를 방지할 수 있다 (ReLU를 사용하면 자주 일어나는 문제이다). $$f'_a$$와 $$f'_n$$가 모두 exact 0이 된다면? 이 때는 상대 오차를 점검할 필요 없이 그라디언트 체크를 통과하여야 한다. 당신의 코드가 이 상황을 감안하여 조직된 코드인지 점검하여 보라. 실제 상황에서의 유용한 가이드: @@ -71,43 +72,45 @@ $$ **이중정확성 변수를 사용하라 (Use double precision)**. 흔히들 실수하는 것이, 그라디언트 체크를 계산하는 데 단일정확성 부동소숫점(single precision floating point) 변수를 사용하는 경우가 있다. 단일정확성 변수를 쓰면 그라디언트 계산이 맞다 하더라도 상대 오차가 (1e-2 정도로) 커지는 경우가 종종 있다. 내 경험상으로는 이중정확성 변수를 쓰면 상대 오차가 1e-2에서 1e-8까지 개선되는 경우도 봤다. -**부동소숫점 연산이 활성화되는 범위에서 계산하라 (Stick around active range of floating point)**. 당신 좀더 세심한 코드를 작성하고 실수를 줄이려면 ["모든 컴퓨터 사이언티스트들이 부동소숫점 연산에 대해 알아야 하는 것들(What Every Computer Scientist Should Know About Floating-Point Arithmetic)"](http://docs.oracle.com/cd/E19957-01/806-3568/ncg_goldberg.html) 를 읽는 게 좋다. 예를 들어, 신경망에서는 손실함수(loss function)를 배치별로(over batch)로 normalize하는 것이 보통이다 (역자 주 : 그라디언트 합을 배치 사이즈로 나누는 장면을 지칭하는 듯). 그렇지만 한 자료당(per datapoint) 그라디언트가 매우 작다면, 거기에 또 데이터 갯수를 *부가적으로* 나눌 경우 매우 작은 수가 되고 더욱더 많은 수치적인 문제가 생길 수 있다. 그래서 필자는 $f'_a$ 혹은 $f'_n$의 계산값을 계속 찍어보고 두 값이 너무 작지 않은가 확인하는 편이다. (대충 1e-10 혹은 그보다 작은 크기의 값이면 걱정하여라) 만약 두 값이 너무 작다면, 적당히 상수를 곱하여 부동소숫점 표현이 조금 더 "괜찮도록" (부동소숫점 표현에서 지수 부분이 0이 되도록) 만들 수도 있다. +**부동소숫점 연산이 활성화되는 범위에서 계산하라 (Stick around active range of floating point)**. 당신 좀더 세심한 코드를 작성하고 실수를 줄이려면 ["모든 컴퓨터 사이언티스트들이 부동소숫점 연산에 대해 알아야 하는 것들(What Every Computer Scientist Should Know About Floating-Point Arithmetic)"](http://docs.oracle.com/cd/E19957-01/806-3568/ncg_goldberg.html) 를 읽는 게 좋다. 예를 들어, 신경망에서는 손실함수(loss function)를 배치별로(over batch)로 normalize하는 것이 보통이다 (역자 주 : 그라디언트 합을 배치 사이즈로 나누는 장면을 지칭하는 듯). 그렇지만 한 자료당(per datapoint) 그라디언트가 매우 작다면, 거기에 또 데이터 갯수를 *부가적으로* 나눌 경우 매우 작은 수가 되고 더욱더 많은 수치적인 문제가 생길 수 있다. 그래서 필자는 $$f'_a$$ 혹은 $$f'_n$$의 계산값을 계속 찍어보고 두 값이 너무 작지 않은가 확인하는 편이다. (대충 1e-10 혹은 그보다 작은 크기의 값이면 걱정하여라) 만약 두 값이 너무 작다면, 적당히 상수를 곱하여 부동소숫점 표현이 조금 더 "괜찮도록" (부동소숫점 표현에서 지수 부분이 0이 되도록) 만들 수도 있다. -**목적함수에서의 꺾인 점 (Kinks in the objective)**. *꺾인 점(kink)*들에서 부정확한 계산이 발생할 수 있는데 이를 그라디언트 체크 과정에서도 염두에 두고 있어야 한다. 꺾인 점(kink)은 목적함수의 미분 불가능한 부분을 지칭하는 용어이다. ReLU 함수 ($max(0,x)$), 서포트 벡터 머신(SVM) 목적함수나 맥스아웃 뉴런(maxout neuron) 등을 사용하면 발생할 수 있다. 꺾인 점이 야기시킬 수 있는 문제는 대략 이렇다. ReLU 함수의 그라디언트를 $x = -1e6$에서 체크한다고 생각하여 보자. $x < 0$이므로 $f'_a$는 정확히 $0$이다. 그렇지만, 수치적으로 계산된 그라디언트는 $f(x+h)$가 꺾인 점을 넘을 수도 있으므로 (이를테면 $h > 1e-6$인 경우) 갑자기 $0$이 아닌 값을 내놓게 될 수도 있다. 이런 병적인(?) 경우까지 신경써야 하냐고 물을 수도 있겠는데, 사실 매우 흔하다. 예를 들어 CIFAR-10를 위해 서포트 벡터 머신(SVM)을 쓴다고 하면, 데이터가 50,000개이고(50,000 examples) 한 데이터당 $max(0,x)$ 항이 9개씩 있으니 결국 45만개의 ReLU항과 맞닥뜨리게 된다. 게다가 서포트 벡터 머신 분류기(SVM classifier)와 신경망(neural network)을 붙이면 ReLU들 때문에 꺾인 점이 더 늘어날 수도 있다. +**목적함수에서의 꺾인 점 (Kinks in the objective)**. *꺾인 점(kink)*들에서 부정확한 계산이 발생할 수 있는데 이를 그라디언트 체크 과정에서도 염두에 두고 있어야 한다. 꺾인 점(kink)은 목적함수의 미분 불가능한 부분을 지칭하는 용어이다. ReLU 함수 ($$max(0,x)$$), 서포트 벡터 머신(SVM) 목적함수나 맥스아웃 뉴런(maxout neuron) 등을 사용하면 발생할 수 있다. 꺾인 점이 야기시킬 수 있는 문제는 대략 이렇다. ReLU 함수의 그라디언트를 $$x = -1e6$$에서 체크한다고 생각하여 보자. $$x < 0$$이므로 $$f'_a$$는 정확히 $$0$$이다. 그렇지만, 수치적으로 계산된 그라디언트는 $$f(x+h)$$가 꺾인 점을 넘을 수도 있으므로 (이를테면 $$h > 1e-6$$인 경우) 갑자기 $$0$$이 아닌 값을 내놓게 될 수도 있다. 이런 병적인(?) 경우까지 신경써야 하냐고 물을 수도 있겠는데, 사실 매우 흔하다. 예를 들어 CIFAR-10를 위해 서포트 벡터 머신(SVM)을 쓴다고 하면, 데이터가 50,000개이고(50,000 examples) 한 데이터당 $$max(0,x)$$ 항이 9개씩 있으니 결국 45만개의 ReLU항과 맞닥뜨리게 된다. 게다가 서포트 벡터 머신 분류기(SVM classifier)와 신경망(neural network)을 붙이면 ReLU들 때문에 꺾인 점이 더 늘어날 수도 있다. -다행히도, 손실함수를 계산할 때 꺾인 점을 넘어서 계산했는지 (a kink was crossed) 여부를 알 수 있다. $max(x,y)$ 꼴 함수에서 $x$, $y$ 중 누가 "이겼는지"를 계속 기록해둔다고 생각해 보자. $f(x+h)$와 $f(x-h)$를 계산할 때 적어도 하나의 "승자"가 바뀐다면, 꺾인 점을 넘는 현상이 발생한 것이고 그렇다면 수치적인 그라디언트가 정확한 값이 아닐 수도 있다. +다행히도, 손실함수를 계산할 때 꺾인 점을 넘어서 계산했는지 (a kink was crossed) 여부를 알 수 있다. $$max(x,y)$$ 꼴 함수에서 $$x$$, $$y$$ 중 누가 "이겼는지"를 계속 기록해둔다고 생각해 보자. $$f(x+h)$$와 $$f(x-h)$$를 계산할 때 적어도 하나의 "승자"가 바뀐다면, 꺾인 점을 넘는 현상이 발생한 것이고 그렇다면 수치적인 그라디언트가 정확한 값이 아닐 수도 있다. **적은 수의 데이터만 써라 (Use only few datapoints)** 꺾인 점과 관련된 하나의 해결책은 더 적은 데이터를 쓰는 것이다. 손실함수가 꺾인 점을 포함하고 있으면 (ReLU나 margin loss등을 썼을 경우처럼) 데이터가 적을수록 더 적은 꺾인 점을 포함할 것이고, 따라서 유한 차분 근사(finite different approximation) 과정에서 꺾인 점을 가로지르는 경우가 더 적을 것이다. 게다가, ~2 혹은 3개의 데이터에 대해서만 그라디언트 체크를 수행하는 게 거의 배치(batch) 전부에 대해 그라디언트 체크하는 게 될 테니 훨씬 빠르고 효율적이다. (역자 주 : 그렇지만 배치 사이즈가 작아지면 다른 쪽에서 문제가 생길 수도 있을 것 같은데..) -**Step size h에 주의하라**. 꼭 작을 수록 좋은 건 아닌 게, $h$가 훨씬 작으면 수치적인 정확도(numerical precision) 문제에 부딪힐 수 있다. 가끔 그라디언트 체크가 잘 안 되면, $h$를 1e-4나 1e-6 정도로 조정하여 보라. 갑자기 될 수도 있다. 링크된 [위키피디아 기사](http://en.wikipedia.org/wiki/Numerical_differentiation)에는 **h**에 따른 수치적 그라디언트 오차가 xy-plot으로 조사되어 있다. +**Step size h에 주의하라**. 꼭 작을 수록 좋은 건 아닌 게, $$h$$가 훨씬 작으면 수치적인 정확도(numerical precision) 문제에 부딪힐 수 있다. 가끔 그라디언트 체크가 잘 안 되면, $$h$$를 1e-4나 1e-6 정도로 조정하여 보라. 갑자기 될 수도 있다. 링크된 [위키피디아 기사](http://en.wikipedia.org/wiki/Numerical_differentiation)에는 **h**에 따른 수치적 그라디언트 오차가 xy-plot으로 조사되어 있다. **"특징적인" 연산이 수행되는 곳에서 그라디언트 체크를 (Gradcheck during a "characteristic" mode of operation)**. 그라디언트 체크는 파라미터 공간(parameter space)의 특정한 (보통 랜덤인) 점 위에서 수행됨을 기억하자. 그라디언트 체크가 한 점에서는 성공한다 하여도 다른 점에서 맞게 수행되리라고는 믿기 힘들다. 게다가, 초기값을 랜덤하게 줄 경우(random initialization) 그 점은 파라미터 공간의 가장 "특징적인(characteristic)" 점이 아닐 수도 있고, 분명 제대로 코딩(implement)된 듯한 그라디언트가 사실 잘 계산되지 않는 병적인 상황을 야기할 수도 있다. 예를 들어, SVM에서 초기 웨이트값을 매우 작게 설정하면, 모든 데이터 포인트에 거의 0에 근접한 점수를 부여할 것이고 그라디언트 값들 또한 모든 데이터에 걸쳐 어떤 패턴을 나타낼 것이다. 만약 그라디언트 구현이 잘못되었다면 이 패턴을 계속 만들어낼 것이고 좀더 특징적인 계산으로 (e.g. 몇몇 점수가 다른 것보다 큰 경우) 일반화하지 못할 수도 있다. 그러므로, 안전하게 가려면, 네트워크가 학습을 시작할 무렵 짧은 번인(**burn-in**)을 이용하고, 손실(loss)가 하강하기 시작한 뒤에 그라디언트 체크를 수행하는 것이 최선이다. 요컨대, 첫번째 iteration에서부터 그라디언트 체크를 수행하면 그 때만의 병적인(pathological) 오류 때문에 우리가 정말로 정확하게 그라디언트 체크를 수행하는 부분에서의 오류를 놓칠 수도 있다. -**정규화가 데이터를 압도하게 하지 마라 (Don't let the regularization overwhelm the data)**. 가끔, 손실함수(loss function)는 데이터 손실과 정규화(regularization) 손실 (e.g. 웨이트값(weight)들에 대한 L2 벌점(penalty))의 합으로 이루어져 있다. 하나 알고 있어야 하는 위험은, 정규화 손실이 데이터 손실을 압도할 수 있다는 것인데, 이 경우 그라디언트는 주로 (그라디언트 표현이 훨씬 간단한) 정규화 항(term)에서 올 것이다. 이 경우 데이터 손실 그라디언트가 올바르게 구현되지 못하는 상황을 감출 수도 있다. 그러므로, 먼저 정규화를 끄고 데이터 손실 부분만 체크를 수행하길 추천하며 그 다음에 정규화 항을 따로 점검해 보라. 정규화 항만 따로 어떻게 점검 하냐고? 하나의 방법은 코드를 해킹(hack)하여 데이터 손실 부분을 제거하는 것이다. 다른 방법으로는 정규화 항의 강도(strength)를 높여서 그 효과가 그라디언트 체크 수행시 무시할 수 없게 키운 뒤 (정규화 항 부분에서의) 잘못된 그라디언트가 감지되도록 하라. +**정규화가 데이터를 압도하게 하지 마라 (Don't let the regularization overwhelm the data)**. 가끔, 손실함수(loss function)는 데이터 손실과 정규화(regularization) 손실 (e.g. 웨이트값(weight)들에 대한 L2 벌점(penalty))의 합으로 이루어져 있다. 하나 알고 있어야 하는 위험은, 정규화 손실이 데이터 손실을 압도할 수 있다는 것인데, 이 경우 그라디언트는 주로 (그라디언트 표현이 훨씬 간단한) 정규화 항(term)에서 올 것이다. 이 경우 데이터 손실 그라디언트가 올바르게 구현되지 못하는 상황을 감출 수도 있다. 그러므로, 먼저 정규화를 끄고 데이터 손실 부분만 체크를 수행하길 추천하며 그 다음에 정규화 항을 따로 점검해 보라. 정규화 항만 따로 어떻게 점검 하냐고? 하나의 방법은 코드를 해킹(hack)하여 데이터 손실 부분을 제거하는 것이다. 다른 방법으로는 정규화 항의 강도(strength)를 높여서 그 효과가 그라디언트 체크 수행시 무시할 수 없게 키운 뒤 (정규화 항 부분에서의) 잘못된 그라디언트가 감지되도록 하라. -**드랍아웃과 augmentation을 끄라 (Remember to turn off dropout/augmentations)**. 그라디언트 체크를 수행하는 동안, 네트워크에서 결정되지 않은(non-deterministic) 효과, 이를테면 드랍아웃(dropout), 임의 자료 확대(random data augmentations), 등을 반드시 꺼 두어라. 당연한 이야기지만 이들을 꺼두지 않으면 수치적 그라디언트 근사에서 대규모의 오차가 생길 수 있다. 이 효과들을 끌 경우 단점은 이들의 그라디언트 체크를 수행할수 없다는 것이다 (e.g. 드랍아웃이 올바르게 역전파(backpropagate)되지 않을 수 있다). 그러므로 $f(x+h)$ and $f(x-h)$ 및 수식으로 계산된(analytic) 그라디언트를 계산하기 전에 시드(seed)를 특정 값으로 고정하는 것이 좀더 나은 해결책일 수도 있다. +**드랍아웃과 augmentation을 끄라 (Remember to turn off dropout/augmentations)**. 그라디언트 체크를 수행하는 동안, 네트워크에서 결정되지 않은(non-deterministic) 효과, 이를테면 드랍아웃(dropout), 임의 자료 확대(random data augmentations), 등을 반드시 꺼 두어라. 당연한 이야기지만 이들을 꺼두지 않으면 수치적 그라디언트 근사에서 대규모의 오차가 생길 수 있다. 이 효과들을 끌 경우 단점은 이들의 그라디언트 체크를 수행할수 없다는 것이다 (e.g. 드랍아웃이 올바르게 역전파(backpropagate)되지 않을 수 있다). 그러므로 $$f(x+h)$$ and $$f(x-h)$$ 및 수식으로 계산된(analytic) 그라디언트를 계산하기 전에 시드(seed)를 특정 값으로 고정하는 것이 좀더 나은 해결책일 수도 있다. **몇 개의 차원에서만 체크하라 (Check only few dimensions)**. 실제 데이터에서 그라디언트는 수백만개의 파라미터값을 가질 수도 있다. 이런 경우엔 오직 몇 차원의 그라디언트들만 체크 하고 다른 것들은 잘 계산되었다고 믿는 것이 현실적일 수도 있다. **조심하라**: 모든 '분리된 파라미터'들에 대해서 적은 차원의 그라디언트 체크를 수행하라. 몇몇 용례에서는, 사람들이 파라미터들을 편의상 하나의 큰 파라미터 벡터로 결합한다. 이 경우, 이를테면, 편향값(bias)들은 전체 벡터에서 아주 적은 수만 차지하고 있을 수 있으므로, 이를 반영하여 샘플한 뒤 모든 파라미터들이 올바른 그라디언트를 받고 있는지 확인하는 것이 중요하다. + ### 학습 전에: 제대로 돌아가는지 확인하는 팁과 트릭들 (Before learning: sanity checks Tips/Tricks) 풀려는 최적화 문제가 매우 비싸(expensive)지기 전에, 다음 절차들을 돌려볼 만하다. - **맞는 손실함수를 찾아라 ?? (Look for correct loss at chance performance.)** -적은 수의 파라미터로 초기화할 때는 당신이 기대한 손실함수값(loss)를 얻는지 확인하라. 먼저 데이터 손실함수 (data loss) 하나만 확인하는 것이 가장 낫다 (따라서 정규화 강도(regularization strength)는 영으로 설정하여라). 예를 들어, CIFAR-10에 Softmax 분류기를 이용할 경우 초기 손실함수값을 2.302로 기대할 수 있는데, 왜냐하면, -ln(0.1) = 2.302 -- 각 클래스에 확률이 0.1로 분산되었을 테고 Softmax 손실함수는 올바른 분류 확률에 음의 로그를 취한 값이기 때문이다. Weston Watkins SVM을 사용할 경우에는, (모든 점수(score)가 어림잡아 0이기 때문에) 고려되는 모든 마진값(margin)이 위반될 테니 9의 손실값을 기대할 수 있다 (마진값은 각각 잘못 분류된 클래스마다 1이다). 이런 손실값들이 나오지 않으면 초기화에 문제가 있을 수 있다. +적은 수의 파라미터로 초기화할 때는 당신이 기대한 손실함수값(loss)를 얻는지 확인하라. 먼저 데이터 손실함수 (data loss) 하나만 확인하는 것이 가장 낫다 (따라서 정규화 강도(regularization strength)는 영으로 설정하여라). 예를 들어, CIFAR-10에 Softmax 분류기를 이용할 경우 초기 손실함수값을 2.302로 기대할 수 있는데, 왜냐하면, -ln(0.1) = 2.302 -- 각 클래스에 확률이 0.1로 분산되었을 테고 Softmax 손실함수는 올바른 분류 확률에 음의 로그를 취한 값이기 때문이다. Weston Watkins SVM을 사용할 경우에는, (모든 점수(score)가 어림잡아 0이기 때문에) 고려되는 모든 마진값(margin)이 위반될 테니 9의 손실값을 기대할 수 있다 (마진값은 각각 잘못 분류된 클래스마다 1이다). 이런 손실값들이 나오지 않으면 초기화에 문제가 있을 수 있다. - 두 번째 확인 절차로써, 정규화 강도를 올릴 수록 손실함수값이 올라가야 한다. -- **자료의 작은 부분집합으로 과적합해 보라 (Overfit a tiny subset of data)**. 마지막으로 가장 중요한 사항인데, 전체 데이터셋으로 훈련을 시작하기 전에, 작은 부분으로 훈련을 시도하여 보고 (한 20개의 자료 정도), 0의 비용(cost)을 달성할 수 있는지 확인하여 보라. 이 실험에서도 역시 정규화 강도는 0으로 설정하는 것이 가장 나으며, 그렇지 않으면 0의 비용을 얻을 수 없을 것이다. 작은 자료에서의 이러한 확인 과정이 제대로 끝나지 않으면 전체 데이터셋으로 나아가는 것은 무가치하다. 하나 강조할 것은, 아주 작은 데이터셋에 성공적으로 과적합하였지만 여전히 코딩(implementation)이 올바르게 이루어지지 않았을 수 있다. 예를 들어, 가지고 있는 데이터 포인트(datapoint)들의 특성(feature)들이 어떤 버그 때문에 임의로(randomly) 선정된 경우, 작은 훈련 집합(training set)에의 과적합은 성공할지라도 그게 전체 데이터셋으로 일반화되지 않을 수도 있다. +- **자료의 작은 부분집합으로 과적합해 보라 (Overfit a tiny subset of data)**. 마지막으로 가장 중요한 사항인데, 전체 데이터셋으로 훈련을 시작하기 전에, 작은 부분으로 훈련을 시도하여 보고 (한 20개의 자료 정도), 0의 비용(cost)을 달성할 수 있는지 확인하여 보라. 이 실험에서도 역시 정규화 강도는 0으로 설정하는 것이 가장 나으며, 그렇지 않으면 0의 비용을 얻을 수 없을 것이다. 작은 자료에서의 이러한 확인 과정이 제대로 끝나지 않으면 전체 데이터셋으로 나아가는 것은 무가치하다. 하나 강조할 것은, 아주 작은 데이터셋에 성공적으로 과적합하였지만 여전히 코딩(implementation)이 올바르게 이루어지지 않았을 수 있다. 예를 들어, 가지고 있는 데이터 포인트(datapoint)들의 특성(feature)들이 어떤 버그 때문에 임의로(randomly) 선정된 경우, 작은 훈련 집합(training set)에의 과적합은 성공할지라도 그게 전체 데이터셋으로 일반화되지 않을 수도 있다. + ### 학습 과정 돌보기 (Babysitting the learning process) 신경망을 훈련하는 중에 몇몇 쓸모있는 값(quantitity)은 모니터링해야 한다. 이런 도표들은 학습 과정을 지켜보는 창문이다. 좀더 효율적인 학습을 위한 하이퍼파라미터(hyperparameter) 조정도 여기서 직관적 영감을 얻는다. @@ -115,6 +118,7 @@ $$ 도표의 x축은 언제나 에폭(epoch)을 단위로 한다. 에폭(epoch)은 각 자료(example)가 몇 번이나 학습(SGD iteration--역자 주)에 사용되었는가를 재는 용어이다. (이를테면 1 에폭이 지났다는 것은 모든 자료가 한 번씩 SGD iteration에 사용되었음을 뜻한다.) x축으로 SGD 알고리즘 반복횟수(iteration)를 할 수도 있겠지만 에폭이 더 선호되는 편이다. 반복 횟수(iteration number)은 배치 사이즈(batch size)의 선택에 따라 임의로 바뀔 수 있기 때문이다. + #### 손실 함수 (Loss function) 손실 함수(loss)는 forward pass 동안 개개의 배치(batch)에서 계산되고 따라서 훈련(training) 과정에서 추적하기 용이하다. 아래는 시간에 따른 손실 그래프의 모양을 여러 학습 속도(learning rate)에 따라 그려본 것이다. 각각의 모양이 시사하는 바도 함께 적었다: @@ -134,6 +138,7 @@ $$ 가끔 손실 함수 모양이 우스꽝스러울 때도 있다. [lossfunctions.tumblr.com](http://lossfunctions.tumblr.com/). + #### 훈련/검증 정확도 (Train/Val accuracy) 훈련/검증 정확도(training/validation accuracy)는 분류기 훈련시 추적해야 할 또다른 중요한 값이다. 이 플롯은 당신의 모형이 과적합(overfitting) 중인지를 발견할 수 있는 값진 인사이트를 제공한다: @@ -147,6 +152,7 @@ $$
+ #### 웨이트의 현재값과 변화량의 비율 (Ratio of weights:updates) 마지막으로, 웨이트의 현재 크기와 업데이트로 인한 변화량의 크기를 비교해 볼 수도 있다. (Note: 그냥 날 것의 그라디언트 값이 아니라, 웨이트의 *변화량*이다 (이를테면 vanilla SGD에서는 학습 속도(learning rate)와 그라디언트의 곱이다).) 모든 파라미터(의 집합)마다 독립적으로 이 비율을 추적/계산하고 싶은가? 대충 짚자면 이 비율은 1e-3 근처여야 한다. 이보다 낮으면 학습 속도(learning rate)가 너무 낮은 것이다. 이보다 크면 학습 속도가 너무 크다. 특정한 예를 들자면 아래와 같다: @@ -163,12 +169,14 @@ print update_scale / param_scale # want ~1e-3 최솟값이나 최댓값을 추적할 수도 있고, 그라디언트와 업데이트값의 놈(norm)을 계산하고 추적할 수도 있다. 이 지표들은 대개 연관성이 높아서 거의 비슷한 결과를 준다. + #### 층별 활성값 및 그라디언트의 분포 (Activation / Gradient distributions per layer) 올바르지 않은 초기값 설정(initialization)은 학습 과정을 느리게 하거나 완전히 망칠 수 있다. 운좋게도 이 이슈는 상대적으로 쉽게 분석할 수 있다. 한 방법은 활성값/그라디언트값의 히스토그램을 망(network)의 모든 층(layer)마다 그려보는 것이다. 직관적으로 생각해 보면, 만일 이상한 분포가 나오면 좋은 징조가 아닐 수 있다 - 이를테면, tanh 뉴런(neuron)에서는 활성값이 [-1,1]의 전 범위에 걸쳐 분산되어 있는 모습을 보고 싶다. 혹시 모든 활성값이 0을 내놓거나 -1 혹은 1에 집중되어 있으면 문제가 있는 것이다. + #### 첫번째 층의 시각화 (First-layer Visualizations) 마지막으로, 만일 당신이 이미지 픽셀에 관련된 일을 한다면 첫 층의 특징(feature)들을 시각화하는 것이 많은 도움이 될 수도 있다. @@ -182,6 +190,7 @@ print update_scale / param_scale # want ~1e-3
+ ### 파라미터값의 업데이트 (Parameter updates) 수식적으로 그라디언트값은 역전파(backpropagation)으로 계산되고 이는 파라미터값 업데이트를 위해 사용된다. 업데이트를 수행하는 몇 접근법들이 있는데 후술하겠다. @@ -190,7 +199,8 @@ print update_scale / param_scale # want ~1e-3 -#### SGD와 벨, 호루라기(?) (SGD and bells and whistles) + +#### SGD와 그 외 (SGD and bells and whistles) **바닐라 업데이트 (Vanilla update)**. 가장 간단한 업데이트 형태는 그라디언트의 반대방향으로 파라미터를 업데이트하는 것이다(왜냐하면 그라디언트는 증가하는 방향을 가리키니까. 그렇지만 우리는 손실함수를 최소화하고 싶어한다). 파라미터의 벡터를 `x`라 하고 그라디언트를 `dx`라 쓰면, 가장 간단한 업데이트는 다음과 같: @@ -201,9 +211,9 @@ x += - learning_rate * dx 여기서 학습속도 `learning_rate` 는 하이퍼파라미터(hyperparamter)이고 고정된 상수이다. 만일 `dx`가 전체 데이터셋에서 계산되고 학습 속도가 충분히 작을 때, 최소한 나쁜 프로세스는 아님을 보장한다. -**모멘텀 업데이트 (Momentum update)**는, 적어도 딥 네트워크에서는, 바닐라 업데이트보다 더 잘 수렴한다. 이 방법은 최적화 문제(optimization problem)를 물리학적 관점에서 바라보는 데서 유래했다. 자세히 말하자면, 손실함수는 구릉지대에서 높이에 해당한다 (그래서 포텐셜 에너지에도 대응되는데 $U = mgh$이고 따라서 $ U \propto h $이다). 파라미터의 초기값을 임의로 정하는 것은 입자를 어떤 위치에서 0의 속도로 세팅하는 것과 똑같다. 이 상황에서 최적화 과정은 파라미터 벡터(즉 입자)를 '굴리는' 과정과 동일하다 볼 수 있다. +**모멘텀 업데이트 (Momentum update)**는, 적어도 딥 네트워크에서는, 바닐라 업데이트보다 더 잘 수렴한다. 이 방법은 최적화 문제(optimization problem)를 물리학적 관점에서 바라보는 데서 유래했다. 자세히 말하자면, 손실함수는 구릉지대에서 높이에 해당한다 (그래서 포텐셜 에너지에도 대응되는데 $$U = mgh$$이고 따라서 $$ U \propto h $$이다). 파라미터의 초기값을 임의로 정하는 것은 입자를 어떤 위치에서 0의 속도로 세팅하는 것과 똑같다. 이 상황에서 최적화 과정은 파라미터 벡터(즉 입자)를 '굴리는' 과정과 동일하다 볼 수 있다. -입자에 작용하는 힘(force)은 포텐셜 에너지의 그라디언트 (즉 $F = - \nabla U $ )와 관련되어 있으므로, 입자가 느끼는 **힘**은은 정확하게 손실함수의 그라디언트(의 반대부호)이다. 게다가 $F = ma$이므로 그 그라디언트(의 반대부호)는 입자에 작용하는 가속도에 비례한다. 위에서의 SGD와 다른 점을 발견했는가? SGD는 위치값(현재 파라미터값 - 역자주)에 그라디언트가 직접 합쳐진다. 모멘텀 업데이트는, 물리학적 관점에서, 그라디언트가 오직 속도(velocity)에만 직접적으로 영향을 주고 속도가 위치값(position)에 영향을 줄 것을 제안하고 있다: +입자에 작용하는 힘(force)은 포텐셜 에너지의 그라디언트 (즉 $$F = - \nabla U $$ )와 관련되어 있으므로, 입자가 느끼는 **힘**은은 정확하게 손실함수의 그라디언트(의 반대부호)이다. 게다가 $$F = ma$$이므로 그 그라디언트(의 반대부호)는 입자에 작용하는 가속도에 비례한다. 위에서의 SGD와 다른 점을 발견했는가? SGD는 위치값(현재 파라미터값 - 역자주)에 그라디언트가 직접 합쳐진다. 모멘텀 업데이트는, 물리학적 관점에서, 그라디언트가 오직 속도(velocity)에만 직접적으로 영향을 주고 속도가 위치값(position)에 영향을 줄 것을 제안하고 있다: ~~~python # Momentum update @@ -224,11 +234,11 @@ Nesterov 모멘텀의 핵심 아이디어는 다음과 같다. 만약 현재 파
- Nesterov 모멘텀. 지금 위치(붉은색 원)에서 모멘텀에 의해 연두색 화살표의 끝점으로 이동할 상황이다. Nesterov 모멘텀은 현재 위치에서 그라디언트를 계산하는 것이 아니라 이 "예견된" 위치(화살표 끝점)에서 그라디언트를 계산한다. + Nesterov 모멘텀. 지금 위치(붉은색 원)에서 모멘텀에 의해 연두색 화살표의 끝점으로 이동할 상황이다. Nesterov 모멘텀은 현재 위치에서 그라디언트를 계산하는 것이 아니라 이 "예견된" 위치(화살표 끝점)에서 그라디언트를 계산한다.
-다른 말로 하면, 다음과 같이 계산한다. (notation이 조금 이상하다.) +다른 말로 하면, 다음과 같이 계산한다. (notation이 조금 이상하다.) ~~~python x_ahead = x + mu * v @@ -253,18 +263,20 @@ x += -mu * v_prev + (1 + mu) * v # position update changes form + #### 학습 속도 담금질 (Annealing the learning rate) 깊은 신경망의 훈련에서 시간에 따라 훈련 속도를 담금질(anneal, 조정)하는 건 언제나 도움이 된다. 이 직관을 기억해 두면 도움이 된다: 높은 학습 속도에서는, 전체 시스템이 너무 높은 운동 에너지를 갖고 있어서 파라미터 벡터가 혼돈스럽게 튀고, (손실 함수의) 좁고 깊숙한 골짜기 안으로 쏙 들어가서 정착하기 힘들다. -그러면 학습 속도를 언제 줄일 것인가? 좀 tricky할 것이다. 우선 천천히 줄여봐라. 그러면 오랜 시간동안 거의 제자리에서 혼돈스럽게 왔다갔다 할 것이다. 그렇지만 너무 빨리 줄이면 전체 시스템이 너무 빨리 식을 것이고, 갈 수 있는 최적의 장소에 도달하지 못할 수 있다. 학습속도를 감소시키는 방법은 보통 다음 세 가지가 있다. +그러면 학습 속도를 언제 줄일 것인가? 좀 tricky할 것이다. 우선 천천히 줄여봐라. 그러면 오랜 시간동안 거의 제자리에서 혼돈스럽게 왔다갔다 할 것이다. 그렇지만 너무 빨리 줄이면 전체 시스템이 너무 빨리 식을 것이고, 갈 수 있는 최적의 장소에 도달하지 못할 수 있다. 학습속도를 감소시키는 방법은 보통 다음 세 가지가 있다. - **계단식 감소 (step decay)**: 몇 에폭마다 일정량만큼 학습 속도를 줄인다. 전형적으로는 5 에폭마다 반으로 줄이거나 20 에폭마다 1/10씩 줄이기도 한다. 이 숫자들은 전적으로 문제와 모형의 타입에 의존한다. 실전에서는, 우선 고정된 학습 속도로 검증오차(validation error)를 살펴보다가, 검증오차가 개선되지 않을 때마다 학습 속도를 감소시키는 (이를테면 0.5정도?) 방법을 택하기도 한다. -- **지수적 감소 (exponential decay)**는 $\alpha = \alpha_0 e^{-k t}$ 꼴을 뜻한다. 여기서 $\alpha_0, k$는 초모수(hyperparameter)이고 $t$는 반복 횟수이다 (물론 에폭을 단위로 해도 된다.) -- **1/t 감소**는 $\alpha = \alpha_0 / (1 + k t )$ 꼴을 뜻하고 여기서 $a_0, k$는 초모수이고 $t$는 반복 횟수이다. +- **지수적 감소 (exponential decay)**는 $$\alpha = \alpha_0 e^{-k t}$$ 꼴을 뜻한다. 여기서 $$\alpha_0, k$$는 초모수(hyperparameter)이고 $$t$$는 반복 횟수이다 (물론 에폭을 단위로 해도 된다.) +- **1/t 감소**는 $$\alpha = \alpha_0 / (1 + k t )$$ 꼴을 뜻하고 여기서 $$a_0, k$$는 초모수이고 $$t$$는 반복 횟수이다. -실전에서는 계단식 감소 방식이 조금 더 선호될만 한데, 관련된 초모수들(몇 에폭마다 감소시킬지, 그리고 감소율)이 $k$에 비해서 해석이 더 쉽기 때문이다. 마지막으로, 계산 자원이 충분하다면, 감소율을 좀 더 낮춰서 오랜 시간동안 (모형을) 훈련시켜라. +실전에서는 계단식 감소 방식이 조금 더 선호될만 한데, 관련된 초모수들(몇 에폭마다 감소시킬지, 그리고 감소율)이 $$k$$에 비해서 해석이 더 쉽기 때문이다. 마지막으로, 계산 자원이 충분하다면, 감소율을 좀 더 낮춰서 오랜 시간동안 (모형을) 훈련시켜라. + #### 이차 근사 방법들 (Second order methods) 딥러닝의 맥락에서 두 번째로 대중적인 최적화 방법은 [뉴턴 방법(Newton's method)](http://en.wikipedia.org/wiki/Newton%27s_method_in_optimization)인데 다음과 같은 업데이트 방식을 뜻한다: @@ -273,12 +285,12 @@ $$ x \leftarrow x - [H f(x)]^{-1} \nabla f(x) $$ -여기서 $H f(x)$는 [헤시안 행렬(Hessian matrix)](http://en.wikipedia.org/wiki/Hessian_matrix)로, (다변수 함수의) 2차 미분으로 이루어진 정방행렬을 뜻한다. $\nabla f(x)$ 항은 (그라디언트 감소 Gradient Descent에서 보았던) 그라디언트 벡터이다. 직관적으로 헤시안 행렬은 어떤 함수의 국지적인 곡률(curvature)을 뜻하고 이 정보로 울이는 더 효율적인 업데이트를 수행할 수 있다. 특별히, 헤시안 행렬의 역행렬을 곱함으로써, 휨이 약한 방향으로는 더 공격적으로 그리고 휨이 강한 방향으로는 짧게짧게 움직일 수 있다. 일차 근사 방법에 비해 뉴턴 방법이 가지는 강점은, 위의 업데이트 공식을 보면 학습 속도(learning rate)에 대한 초모수(hyperparameter)가 없다는 것이다. +여기서 $$H f(x)$$는 [헤시안 행렬(Hessian matrix)](http://en.wikipedia.org/wiki/Hessian_matrix)로, (다변수 함수의) 2차 미분으로 이루어진 정방행렬을 뜻한다. $$\nabla f(x)$$ 항은 (그라디언트 감소 Gradient Descent에서 보았던) 그라디언트 벡터이다. 직관적으로 헤시안 행렬은 어떤 함수의 국지적인 곡률(curvature)을 뜻하고 이 정보로 울이는 더 효율적인 업데이트를 수행할 수 있다. 특별히, 헤시안 행렬의 역행렬을 곱함으로써, 휨이 약한 방향으로는 더 공격적으로 그리고 휨이 강한 방향으로는 짧게짧게 움직일 수 있다. 일차 근사 방법에 비해 뉴턴 방법이 가지는 강점은, 위의 업데이트 공식을 보면 학습 속도(learning rate)에 대한 초모수(hyperparameter)가 없다는 것이다. 그렇지만 위의 업데이트는 거의 모든 실제 상황에서는 쓸모가 없는 게, 공식 그대로(explicitly) 헤시안 행렬을 계산한다면 (역행렬을 취하는 일 포함하여) 상상도 못할 시간과 메모리가 필요하다. 예를 들면, 모수가 백만개 정도인 신경망은 [1,000,000 x 1,000,000] 크기의 헤시안 행렬을 필요로 하고 이는 3725GB의 램(RAM)을 필요로 한다. 그 결과로 다양한 *유사-뉴턴* 방법이 역-헤시안 행렬을 근사하기 위해 고안되었다. 이 방법론들 중 [L-BFGS](http://en.wikipedia.org/wiki/Limited-memory_BFGS)가 가장 대중적이다. L-BFGS는 시간(iteration)에 따른 그라디언트의 변화를 (간접적으로) 근사에 이용한다. 즉, 전체 행렬은 절대 계산되지 않는다. 그렇다고 해도, 메모리 걱정을 없앴다고 할지라도, L-BFGS를 그냥 적용하자면 큰 단점이 하나 있는데 바로 전체 훈련 집합(traning set) 전체를 대상으로 계산하여야 한다는 점이다. 수백만 개체가 있는 그 데이터셋 말이다. 배치(Batch)-SGD와는 달리, 미니배치(mini-batch)에서 L-BFGS가 작동하게 하는 방법은 좀더 꼼수를 필요로 하며 활발한 연구 분야이다. - + **실제 상황에서는**, 지금까지는, L-BFGS나 다른 이차 근사 방법이 대규모 딥러닝이나 CNN에서 사용되지는 않는 게 보통이다. 표준적으로는 SGD와 그 변종들 (모멘텀이나 Nesterov's 모멘텀)이 훨씬 간단하고 계산도 빨라서 많이 사용된다. 추가 참고문헌: @@ -287,11 +299,12 @@ $$ - [SFO](http://arxiv.org/abs/1311.2115) 알고리즘은 SGD와 L-BFGS의 장점을 혼합하고자 노력하였다. + #### 파라미터별 데이터-맞춤 학습 속도 (Per-parameter adaptive learning rates) 지금까지 논의된 접근법들은 모든 파라미터에 똑같은 학습 속도를 적용하였다. 학습 속도의 튜닝(tuning)은 계산이 많은(expensive) 작업인지라, 데이터에 맞추어(adaptively) 자동으로 학습 속도를 정하는 방법을 찾고자 많은 사람들이 노력하였다. 파라미터별로 학습 속도를 다르게 하고 이를 데이터-맞춤으로 정하려는 노력들 또한 있었다. 이러한 방법들은 보통 또다른 초모수(hyperparameter) 세팅이 필요하긴 하지만, 이 초모수는 넓은 범위에서 잘 작동하는 편이라 일반적인 학습 속도 튜닝보다는 덜 까다롭다. 이번 절에서는 실전에서 마주칠 수도 있는 주요 데이터-맞춤 방법들을 조망해본다: - + **Adagrad**는 데이터-맞춤 학습속도 조정 방법 중 하나이고 [Duchi et al.](http://jmlr.org/papers/v12/duchi11a.html) 에서 처음 제안되었다. ~~~python @@ -335,6 +348,7 @@ x += - learning_rate * m / (np.sqrt(v) + eps)
+ ### 초모수 최적화 (Hyperparameter optimization) 일전에 본 대로, 신경망(neural network)의 훈련에는 많은 초모수(hyperparamter) 설정이 관련된다. 신경망 관련 논의에서 가장 빈번하게 등장하는 초모수는 다음과 같다: @@ -367,21 +381,24 @@ x += - learning_rate * m / (np.sqrt(v) + eps) **베이지안 초모수 최적화 (Bayesian Hyperparameter Optimization)**는 초모수 공간을 좀 더 효율적으로 항해하는 방법을 고안하기 위한 분야이다. 핵심 아이디어는 초모수들의 성능을 평가할 때 탐험(exploration)-개발(exploitation)의 상충(trade-off)에서 적절한 균형을 찾는 것이다. 많은 라이브러리들이 이 모형에 기반하여 개발되었고 그 중에 잘 알려진 것은 [Spearmint](https://github.com/JasperSnoek/spearmint), [SMAC](http://www.cs.ubc.ca/labs/beta/Projects/SMAC/), 그리고 [Hyperopt](http://jaberg.github.io/hyperopt/)이다. 그러나, ConvNet 관련된 실전 세팅에서는 아직 조심스레 선택된 구간에서의 임의 검색이 상대적으로 더 뛰어나다. 딥러닝의 최전선 참호에서(from-the-trenches) 진행중인 논의를 참조하라. [here](http://nlpers.blogspot.com/2014/10/hyperparameter-search-bayesian.html). + ## 평가 + ### 모형 앙상블 (Model Ensembles) 실전에서, 신경망(neural network)의 성능을 몇 퍼센트 끌어올릴 수 있는 믿을 만한 방법이 하나 있는데 바로 여러 개의 독립적인 모형을 만들고 테스트 때 그들의 평균 예측을 취하는 것이다. 앙상블에 관여하는 모형이 많아지면, 보통 성능은 단조적으로 개선된다 (비록 개선 정도가 점점 떨어질지라도). 게다가, 앙상블 내에서 모형의 다양함이 늘어날수록 성능의 개선은 더 극적이다. 아래는 앙상블을 구축하는 몇 가지 방법이다: - **같은 모형, 다른 초기화 (Same model, different initializations)**. 교차 검증으로 최고의 초모수를 결정한 다음에, 같은 초모수를 이용하되 초기값을 임의로 다양하게 여러 모형을 훈련한다. 이 접근법의 위험은, 모형의 다양성이 오직 다양한 초기값에서만 온다는 것이다. -- **교차 검증 동안 발견되는 최고의 모형들 (Top models discovered during cross-validation)**. 교차 검증으로 최고의 초모수(들)를 결정한 다음에, 몇 개의 최고 모형을 선정하여 (예. 10개) 이들로 앙상블을 구축한다. 이 방법은 앙상블 내의 다양성을 증대시키나, 준-최적 모형을 포함할 수도 있는 위험이 있다. 실전에서는 이를 수행하는 게 (위보다) 쉬운 편인데, 교차 검증 뒤에 추가적인 모형의 재훈련이 필요없기 때문이다. -- **한 모형에서 다른 체크포인트들을 (Different checkpoints of a single model)**. 만약 훈련이 매우 값비싸면, 어떤 사람들은 단일한 네트워크의 체크포인트들을 (이를테면 매 에폭 후) 앙상블하여 제한적인 성공을 거둔 바 있음을 기억해 두라. 명백하게 이 방법은 다양성이 떨어지지만, 실전에서는 합리적으로 잘 작동할 수 있다. 이 방법은 매우 간편하고 저렴하다는 것이 장점이다. +- **교차 검증 동안 발견되는 최고의 모형들 (Top models discovered during cross-validation)**. 교차 검증으로 최고의 초모수(들)를 결정한 다음에, 몇 개의 최고 모형을 선정하여 (예. 10개) 이들로 앙상블을 구축한다. 이 방법은 앙상블 내의 다양성을 증대시키나, 준-최적 모형을 포함할 수도 있는 위험이 있다. 실전에서는 이를 수행하는 게 (위보다) 쉬운 편인데, 교차 검증 뒤에 추가적인 모형의 재훈련이 필요없기 때문이다. +- **한 모형에서 다른 체크포인트들을 (Different checkpoints of a single model)**. 만약 훈련이 매우 값비싸면, 어떤 사람들은 단일한 네트워크의 체크포인트들을 (이를테면 매 에폭 후) 앙상블하여 제한적인 성공을 거둔 바 있음을 기억해 두라. 명백하게 이 방법은 다양성이 떨어지지만, 실전에서는 합리적으로 잘 작동할 수 있다. 이 방법은 매우 간편하고 저렴하다는 것이 장점이다. - **훈련 동안의 모수값들에 평균을 취하기 (Running average of parameters during training)**. 훈련 동안 (시간에 따른) 웨이트 값들의 지수 하강 합(exponentially decaying sum)을 저장하는 제 2의 네트워크를 만들면 언제나 몇 퍼센트의 이득을 값싸게 취할 수 있다. 이 방식으로 당신은 최근 몇 iteration 동안의 네트워크에 평균을 취한다고 생각할 수도 있다. 마지막 몇 스텝 동안의 웨이트값들을 이렇게 "안정화" 시킴으로써 당신은 언제나 더 나은 검증 오차를 얻을 수 있다. 거친 직관으로 생각하자면, 목적함수는 볼(bowl)-모양이고 당신의 네트워크는 극값(mode) 주변을 맴돌 것이므로, 평균을 취하면 극값에 더 가까운 어딘가에 다다를 기회가 더 많아질 것이다. 모형 앙상블의 단점이 하나 있다면 테스트 샘플에 모형을 적용할 때 평가(evaluation)에 더 시간이 걸린다는 점이다. 흥미로운 독자는 Geoff Hinton의 ["Dark Knowledge"](https://www.youtube.com/watch?v=EK61htlw8hY)에서 영감을 얻을 수도 있겠다. 여기서의 아이디어는 좋은 앙상블 모형을 하나의 모형으로 "증류"하는 것인데, 앙상블 모형의 로그-가능도(log-likelihood)를 어떤 변형된 목적함수로 통합하는 작업과 관련이 있다. + ## 요약 (Summary) 신경망(neural network)를 훈련하기 위하여: @@ -390,11 +407,12 @@ x += - learning_rate * m / (np.sqrt(v) + eps) - 코드가 제대로 돌아가는지 확인하는 방법으로, 손실함수값의 초기값이 합리적인지 그리고 데이터의 일부분으로 100&%의 훈련 정확도를 달성할 수 있는지 확인하라. - 훈련 동안, 손실함수와 훈련/검증 정확도를 계속 살펴보고, (이게 좀 더 멋져 보이면) 현재 파라미터 값 대비 업데이트 값 또한 살펴보라 (대충 ~1e-3 정도 되어야 한다). 만약 ConvNet을 다루고 있다면, 첫 층의 웨이트값도 살펴보라. - 업데이트 방법으로 추천하는 건 SGD+Nesterov Momentum 혹은 Adam이다. -- 학습 속도를 훈련 동안 계속 하강시켜라. 예를 들면, 정해진 에폭 수 뒤에 (혹은 검증 정확도가 상승하다가 하강세로 꺾이면) 학습 속도를 반으로 깎아라. +- 학습 속도를 훈련 동안 계속 하강시켜라. 예를 들면, 정해진 에폭 수 뒤에 (혹은 검증 정확도가 상승하다가 하강세로 꺾이면) 학습 속도를 반으로 깎아라. - 초모수 검색은 그리드 검색이 아닌 임의 검색으로 수행하라. 처음에는 성긴 규모에서 탐색하다가 (넓은 초모수 범위, 1-5 에폭 정도만 학습), 점점 촘촘하게 검색하라 (좁은 범위, 더 많은 에폭에서 학습). - 추가적인 개선을 위하여 모형 앙상블을 구축하라. + ## 추가 참고문헌 - [SGD](http://research.microsoft.com/pubs/192769/tricks-2012.pdf) tips and tricks from Leon Bottou From 4a3dcdef4deb553c7ba12f891aaa4ddd3ccf7e96 Mon Sep 17 00:00:00 2001 From: myungsub Date: Mon, 12 Sep 2016 19:36:57 +0900 Subject: [PATCH 199/199] fix additional markdown syntax --- neural-networks-1.md | 43 +++++++++++++++++++++++-------------- optimization-2.md | 50 +++++++++++++++++++++++++------------------- 2 files changed, 55 insertions(+), 38 deletions(-) diff --git a/neural-networks-1.md b/neural-networks-1.md index cfda6639..afc07e81 100644 --- a/neural-networks-1.md +++ b/neural-networks-1.md @@ -22,21 +22,23 @@ permalink: /neural-networks-1/ ## 간단한 소개 -뇌에 비유하지 않고도 신경망(neural networks)를 소개할 수 있다. 이 선형분류에 관한 섹션에서, $W$가 행렬이고 $x$가 입력 열벡터(column vector)로서 이미지의 모든 픽셀 정보값을 가질 때, $ s = W x $ 형태의 공식을 이용하여 주어진 이미지를 가지고 각 카테고리에 해당하는 스코어를 계산했었다. CIFAR-10의 경우, $x$는 크기가 [3072x1]인 열벡터이고, $W$는 크기가 [10x3072]인 행렬이었다. 따라서, 출력 스코어는 크기가 [10x1]인 벡터가 된다. (역자 주: 숫자 1개가 클래스 1개랑 관련있음.) +뇌에 비유하지 않고도 신경망(neural networks)를 소개할 수 있다. 이 선형분류에 관한 섹션에서, $$W$$가 행렬이고 $$x$$가 입력 열벡터(column vector)로서 이미지의 모든 픽셀 정보값을 가질 때, $$ s = W x $$ 형태의 공식을 이용하여 주어진 이미지를 가지고 각 카테고리에 해당하는 스코어를 계산했었다. CIFAR-10의 경우, $$x$$는 크기가 [3072x1]인 열벡터이고, $$W$$는 크기가 [10x3072]인 행렬이었다. 따라서, 출력 스코어는 크기가 [10x1]인 벡터가 된다. (역자 주: 숫자 1개가 클래스 1개랑 관련있음.) -신경망(neural network)는 그 대신, 예컨대 이런 류의 것을 계산한다: $ s = W_2 \max(0, W_1 x) $. 여기서 $W_1$는, 역시 예를 들자면, 크기가 [100x3072]인 행렬로서 이미지를 100차원짜리 중간단계 벡터로 전환하는 것일 수도 있겠다. $max(0,-) $ 함수는 비선형함수로서 $W_1 x $의 각 원소에 적용된다. (밑에서 다루겠지만), 이러한 비선형성을 구현하기 위한 방법은 여러 개 있지만, 이 함수는 흔히 쓰이는 것이고 단순히 모든 0 이하값을 0으로 막아버린다. 끝으로, 행렬 $W_2$은 크기 [10x100]짜리 행렬일 수도 있겠다. 그래서 결국에는 클래스 스코어(class score)로 쓰일 숫자 10개를 내놓게 된다. 비선형성이 계산에 있어 결정적이라는 점을 주목하자. 만약에 비선형성이 없다면, 이 행렬들은 서로 곱해져서 결국에는 하나의 행렬이 되고, 예측 스코어(score)도 역시나 입력값의 선형 함수(linear function)이 되고 만다. 이 비선형성에서 우리는 *wiggle*을 찾는다. 파라미터 $W_2, W_1$는 확률그라디언트로 학습시키고, 그 그라디언트들은 연쇄법칙(과 backpropagation)으로 계산하여 구한다. +신경망(neural network)는 그 대신, 예컨대 이런 류의 것을 계산한다: $$ s = W_2 \max(0, W_1 x) $$. 여기서 $$W_1$$는, 역시 예를 들자면, 크기가 [100x3072]인 행렬로서 이미지를 100차원짜리 중간단계 벡터로 전환하는 것일 수도 있겠다. $$max(0,-) $$ 함수는 비선형함수로서 $$W_1 x $$의 각 원소에 적용된다. (밑에서 다루겠지만), 이러한 비선형성을 구현하기 위한 방법은 여러 개 있지만, 이 함수는 흔히 쓰이는 것이고 단순히 모든 0 이하값을 0으로 막아버린다. 끝으로, 행렬 $$W_2$$은 크기 [10x100]짜리 행렬일 수도 있겠다. 그래서 결국에는 클래스 스코어(class score)로 쓰일 숫자 10개를 내놓게 된다. 비선형성이 계산에 있어 결정적이라는 점을 주목하자. 만약에 비선형성이 없다면, 이 행렬들은 서로 곱해져서 결국에는 하나의 행렬이 되고, 예측 스코어(score)도 역시나 입력값의 선형 함수(linear function)이 되고 만다. 이 비선형성에서 우리는 *wiggle*을 찾는다. 파라미터 $$W_2, W_1$$는 확률그라디언트로 학습시키고, 그 그라디언트들은 연쇄법칙(과 backpropagation)으로 계산하여 구한다. -3단계 신경망(neural network)는 $ s = W_3 \max(0, W_2 \max(0, W_1 x)) $랑 비슷하다. 이 때, $W_3, W_2, W_1$들은 모두 파라미터(parameter)들이고 추후에 학습시킨다. 중간 단계 벡터의 크기들은 하이퍼파라미터(hyperparameter)로서 나중에 어떻게 정하는지 알아보겠다. 이제 뉴런(neuron) 혹은 네트워크의 입장에서 이 계산들을 어떻게 해석해야하는지 알아보자. +3단계 신경망(neural network)는 $$ s = W_3 \max(0, W_2 \max(0, W_1 x)) $$랑 비슷하다. 이 때, $$W_3, W_2, W_1$$들은 모두 파라미터(parameter)들이고 추후에 학습시킨다. 중간 단계 벡터의 크기들은 하이퍼파라미터(hyperparameter)로서 나중에 어떻게 정하는지 알아보겠다. 이제 뉴런(neuron) 혹은 네트워크의 입장에서 이 계산들을 어떻게 해석해야하는지 알아보자. -## 뉴런 하나 모델링하기 + +## 뉴런 하나 모델링하기 The area of Neural Networks has originally been primarily inspired by the goal of modeling biological neural systems, but has since diverged and become a matter of engineering and achieving good results in Machine Learning tasks. Nonetheless, we begin our discussion with a very brief and high-level description of the biological system that a large portion of this area has been inspired by. + ### Biological motivation and connections -The basic computational unit of the brain is a **neuron**. Approximately 86 billion neurons can be found in the human nervous system and they are connected with approximately 10^14 - 10^15 **synapses**. The diagram below shows a cartoon drawing of a biological neuron (left) and a common mathematical model (right). Each neuron receives input signals from its **dendrites** and produces output signals along its (single) **axon**. The axon eventually branches out and connects via synapses to dendrites of other neurons. In the computational model of a neuron, the signals that travel along the axons (e.g. $x_0$) interact multiplicatively (e.g. $w_0 x_0$) with the dendrites of the other neuron based on the synaptic strength at that synapse (e.g. $w_0$). The idea is that the synaptic strengths (the weights $w$) are learnable and control the strength of influence (and its direction: excitory (positive weight) or inhibitory (negative weight)) of one neuron on another. In the basic model, the dendrites carry the signal to the cell body where they all get summed. If the final sum is above a certain threshold, the neuron can *fire*, sending a spike along its axon. In the computational model, we assume that the precise timings of the spikes do not matter, and that only the frequency of the firing communicates information. Based on this *rate code* interpretation, we model the *firing rate* of the neuron with an **activation function** $f$, which represents the frequency of the spikes along the axon. Historically, a common choice of activation function is the **sigmoid function** $\sigma$, since it takes a real-valued input (the signal strength after the sum) and squashes it to range between 0 and 1. We will see details of these activation functions later in this section. +The basic computational unit of the brain is a **neuron**. Approximately 86 billion neurons can be found in the human nervous system and they are connected with approximately 10^14 - 10^15 **synapses**. The diagram below shows a cartoon drawing of a biological neuron (left) and a common mathematical model (right). Each neuron receives input signals from its **dendrites** and produces output signals along its (single) **axon**. The axon eventually branches out and connects via synapses to dendrites of other neurons. In the computational model of a neuron, the signals that travel along the axons (e.g. $$x_0$$) interact multiplicatively (e.g. $$w_0 x_0$$) with the dendrites of the other neuron based on the synaptic strength at that synapse (e.g. $$w_0$$). The idea is that the synaptic strengths (the weights $$w$$) are learnable and control the strength of influence (and its direction: excitory (positive weight) or inhibitory (negative weight)) of one neuron on another. In the basic model, the dendrites carry the signal to the cell body where they all get summed. If the final sum is above a certain threshold, the neuron can *fire*, sending a spike along its axon. In the computational model, we assume that the precise timings of the spikes do not matter, and that only the frequency of the firing communicates information. Based on this *rate code* interpretation, we model the *firing rate* of the neuron with an **activation function** $$f$$, which represents the frequency of the spikes along the axon. Historically, a common choice of activation function is the **sigmoid function** $$\sigma$$, since it takes a real-valued input (the signal strength after the sum) and squashes it to range between 0 and 1. We will see details of these activation functions later in this section.
@@ -56,25 +58,27 @@ class Neuron(object): return firing_rate ~~~ -In other words, each neuron performs a dot product with the input and its weights, adds the bias and applies the non-linearity (or activation function), in this case the sigmoid $\sigma(x) = 1/(1+e^{-x})$. We will go into more details about different activation functions at the end of this section. +In other words, each neuron performs a dot product with the input and its weights, adds the bias and applies the non-linearity (or activation function), in this case the sigmoid $$\sigma(x) = 1/(1+e^{-x})$$. We will go into more details about different activation functions at the end of this section. **Coarse model.** It's important to stress that this model of a biological neuron is very coarse: For example, there are many different types of neurons, each with different properties. The dendrites in biological neurons perform complex nonlinear computations. The synapses are not just a single weight, they're a complex non-linear dynamical system. The exact timing of the output spikes in many systems in known to be important, suggesting that the rate code approximation may not hold. Due to all these and many other simplifications, be prepared to hear groaning sounds from anyone with some neuroscience background if you draw analogies between Neural Networks and real brains. See this [review](https://physics.ucsd.edu/neurophysics/courses/physics_171/annurev.neuro.28.061604.135703.pdf) (pdf), or more recently this [review](http://www.sciencedirect.com/science/article/pii/S0959438814000130) if you are interested. + ### Single neuron as a linear classifier The mathematical form of the model Neuron's forward computation might look familiar to you. As we saw with linear classifiers, a neuron has the capacity to "like" (activation near one) or "dislike" (activation near zero) certain linear regions of its input space. Hence, with an appropriate loss function on the neuron's output, we can turn a single neuron into a linear classifier: -**Binary Softmax classifier**. For example, we can interpret $\sigma(\sum_iw_ix_i + b)$ to be the probability of one of the classes $P(y_i = 1 \mid x_i; w) $. The probability of the other class would be $P(y_i = 0 \mid x_i; w) = 1 - P(y_i = 1 \mid x_i; w) $, since they must sum to one. With this interpretation, we can formulate the cross-entropy loss as we have seen in the Linear Classification section, and optimizing it would lead to a binary Softmax classifier (also known as *logistic regression*). Since the sigmoid function is restricted to be between 0-1, the predictions of this classifier are based on whether the output of the neuron is greater than 0.5. +**Binary Softmax classifier**. For example, we can interpret $$\sigma(\sum_iw_ix_i + b)$$ to be the probability of one of the classes $$P(y_i = 1 \mid x_i; w) $$. The probability of the other class would be $$P(y_i = 0 \mid x_i; w) = 1 - P(y_i = 1 \mid x_i; w) $$, since they must sum to one. With this interpretation, we can formulate the cross-entropy loss as we have seen in the Linear Classification section, and optimizing it would lead to a binary Softmax classifier (also known as *logistic regression*). Since the sigmoid function is restricted to be between 0-1, the predictions of this classifier are based on whether the output of the neuron is greater than 0.5. **Binary SVM classifier**. Alternatively, we could attach a max-margin hinge loss to the output of the neuron and train it to become a binary Support Vector Machine. -**Regularization interpretation**. The regularization loss in both SVM/Softmax cases could in this biological view be interpreted as *gradual forgetting*, since it would have the effect of driving all synaptic weights $w$ towards zero after every parameter update. +**Regularization interpretation**. The regularization loss in both SVM/Softmax cases could in this biological view be interpreted as *gradual forgetting*, since it would have the effect of driving all synaptic weights $$w$$ towards zero after every parameter update. > A single neuron can be used to implement a binary classifier (e.g. binary Softmax or binary SVM classifiers) + ### Commonly used activation functions Every activation function (or *non-linearity*) takes a single number and performs a certain fixed mathematical operation on it. There are several activation functions you may encounter in practice: @@ -85,12 +89,12 @@ Every activation function (or *non-linearity*) takes a single number and perform
Left: Sigmoid non-linearity squashes real numbers to range between [0,1] Right: The tanh non-linearity squashes real numbers to range between [-1,1].
-**Sigmoid.** The sigmoid non-linearity has the mathematical form $\sigma(x) = 1 / (1 + e^{-x})$ and is shown in the image above on the left. As alluded to in the previous section, it takes a real-valued number and "squashes" it into range between 0 and 1. In particular, large negative numbers become 0 and large positive numbers become 1. The sigmoid function has seen frequent use historically since it has a nice interpretation as the firing rate of a neuron: from not firing at all (0) to fully-saturated firing at an assumed maximum frequency (1). In practice, the sigmoid non-linearity has recently fallen out of favor and it is rarely ever used. It has two major drawbacks: +**Sigmoid.** The sigmoid non-linearity has the mathematical form $$\sigma(x) = 1 / (1 + e^{-x})$$ and is shown in the image above on the left. As alluded to in the previous section, it takes a real-valued number and "squashes" it into range between 0 and 1. In particular, large negative numbers become 0 and large positive numbers become 1. The sigmoid function has seen frequent use historically since it has a nice interpretation as the firing rate of a neuron: from not firing at all (0) to fully-saturated firing at an assumed maximum frequency (1). In practice, the sigmoid non-linearity has recently fallen out of favor and it is rarely ever used. It has two major drawbacks: - *Sigmoids saturate and kill gradients*. A very undesirable property of the sigmoid neuron is that when the neuron's activation saturates at either tail of 0 or 1, the gradient at these regions is almost zero. Recall that during backpropagation, this (local) gradient will be multiplied to the gradient of this gate's output for the whole objective. Therefore, if the local gradient is very small, it will effectively "kill" the gradient and almost no signal will flow through the neuron to its weights and recursively to its data. Additionally, one must pay extra caution when initializing the weights of sigmoid neurons to prevent saturation. For example, if the initial weights are too large then most neurons would become saturated and the network will barely learn. - - *Sigmoid outputs are not zero-centered*. This is undesirable since neurons in later layers of processing in a Neural Network (more on this soon) would be receiving data that is not zero-centered. This has implications on the dynamics during gradient descent, because if the data coming into a neuron is always positive (e.g. $x > 0$ elementwise in $f = w^Tx + b$)), then the gradient on the weights $w$ will during backpropagation become either all be positive, or all negative (depending on the gradient of the whole expression $f$). This could introduce undesirable zig-zagging dynamics in the gradient updates for the weights. However, notice that once these gradients are added up across a batch of data the final update for the weights can have variable signs, somewhat mitigating this issue. Therefore, this is an inconvenience but it has less severe consequences compared to the saturated activation problem above. + - *Sigmoid outputs are not zero-centered*. This is undesirable since neurons in later layers of processing in a Neural Network (more on this soon) would be receiving data that is not zero-centered. This has implications on the dynamics during gradient descent, because if the data coming into a neuron is always positive (e.g. $$x > 0$$ elementwise in $$f = w^Tx + b$$)), then the gradient on the weights $$w$$ will during backpropagation become either all be positive, or all negative (depending on the gradient of the whole expression $$f$$). This could introduce undesirable zig-zagging dynamics in the gradient updates for the weights. However, notice that once these gradients are added up across a batch of data the final update for the weights can have variable signs, somewhat mitigating this issue. Therefore, this is an inconvenience but it has less severe consequences compared to the saturated activation problem above. -**Tanh.** The tanh non-linearity is shown on the image above on the right. It squashes a real-valued number to the range [-1, 1]. Like the sigmoid neuron, its activations saturate, but unlike the sigmoid neuron its output is zero-centered. Therefore, in practice the *tanh non-linearity is always preferred to the sigmoid nonlinearity.* Also note that the tanh neuron is simply a scaled sigmoid neuron, in particular the following holds: $ \tanh(x) = 2 \sigma(2x) -1 $. +**Tanh.** The tanh non-linearity is shown on the image above on the right. It squashes a real-valued number to the range [-1, 1]. Like the sigmoid neuron, its activations saturate, but unlike the sigmoid neuron its output is zero-centered. Therefore, in practice the *tanh non-linearity is always preferred to the sigmoid nonlinearity.* Also note that the tanh neuron is simply a scaled sigmoid neuron, in particular the following holds: $$ \tanh(x) = 2 \sigma(2x) -1 $$.
@@ -98,24 +102,26 @@ Every activation function (or *non-linearity*) takes a single number and perform
Left: Rectified Linear Unit (ReLU) activation function, which is zero when x < 0 and then linear with slope 1 when x > 0. Right: A plot from Krizhevsky et al. (pdf) paper indicating the 6x improvement in convergence with the ReLU unit compared to the tanh unit.
-**ReLU.** The Rectified Linear Unit has become very popular in the last few years. It computes the function $f(x) = \max(0, x)$. In other words, the activation is simply thresholded at zero (see image above on the left). There are several pros and cons to using the ReLUs: +**ReLU.** The Rectified Linear Unit has become very popular in the last few years. It computes the function $$f(x) = \max(0, x)$$. In other words, the activation is simply thresholded at zero (see image above on the left). There are several pros and cons to using the ReLUs: - (+) It was found to greatly accelerate (e.g. a factor of 6 in [Krizhevsky et al.](http://www.cs.toronto.edu/~fritz/absps/imagenet.pdf)) the convergence of stochastic gradient descent compared to the sigmoid/tanh functions. It is argued that this is due to its linear, non-saturating form. - (+) Compared to tanh/sigmoid neurons that involve expensive operations (exponentials, etc.), the ReLU can be implemented by simply thresholding a matrix of activations at zero. - (-) Unfortunately, ReLU units can be fragile during training and can "die". For example, a large gradient flowing through a ReLU neuron could cause the weights to update in such a way that the neuron will never activate on any datapoint again. If this happens, then the gradient flowing through the unit will forever be zero from that point on. That is, the ReLU units can irreversibly die during training since they can get knocked off the data manifold. For example, you may find that as much as 40% of your network can be "dead" (i.e. neurons that never activativate across the entire training dataset) if the learning rate is set too high. With a proper setting of the learning rate this is less frequently an issue. -**Leaky ReLU.** Leaky ReLUs are one attempt to fix the "dying ReLU" problem. Instead of the function being zero when x < 0, a leaky ReLU will instead have a small negative slope (of 0.01, or so). That is, the function computes $f(x) = \mathbb{1}(x < 0) (\alpha x) + \mathbb{1}(x>=0) (x) $ where $\alpha$ is a small constant. Some people report success with this form of activation function, but the results are not always consistent. The slope in the negative region can also be made into a parameter of each neuron, as seen in PReLU neurons, introduced in [Delving Deep into Rectifiers](http://arxiv.org/abs/1502.01852), by Kaiming He et al., 2015. However, the consistency of the benefit across tasks is presently unclear. +**Leaky ReLU.** Leaky ReLUs are one attempt to fix the "dying ReLU" problem. Instead of the function being zero when x < 0, a leaky ReLU will instead have a small negative slope (of 0.01, or so). That is, the function computes $$f(x) = \mathbb{1}(x < 0) (\alpha x) + \mathbb{1}(x>=0) (x) $$ where $$\alpha$$ is a small constant. Some people report success with this form of activation function, but the results are not always consistent. The slope in the negative region can also be made into a parameter of each neuron, as seen in PReLU neurons, introduced in [Delving Deep into Rectifiers](http://arxiv.org/abs/1502.01852), by Kaiming He et al., 2015. However, the consistency of the benefit across tasks is presently unclear. -**Maxout**. Other types of units have been proposed that do not have the functional form $f(w^Tx + b)$ where a non-linearity is applied on the dot product between the weights and the data. One relatively popular choice is the Maxout neuron (introduced recently by [Goodfellow et al.](http://www-etud.iro.umontreal.ca/~goodfeli/maxout.html)) that generalizes the ReLU and its leaky version. The Maxout neuron computes the function $\max(w_1^Tx+b_1, w_2^Tx + b_2)$. Notice that both ReLU and Leaky ReLU are a special case of this form (for example, for ReLU we have $w_1, b_1 = 0$). The Maxout neuron therefore enjoys all the benefits of a ReLU unit (linear regime of operation, no saturation) and does not have its drawbacks (dying ReLU). However, unlike the ReLU neurons it doubles the number of parameters for every single neuron, leading to a high total number of parameters. +**Maxout**. Other types of units have been proposed that do not have the functional form $$f(w^Tx + b)$$ where a non-linearity is applied on the dot product between the weights and the data. One relatively popular choice is the Maxout neuron (introduced recently by [Goodfellow et al.](http://www-etud.iro.umontreal.ca/~goodfeli/maxout.html)) that generalizes the ReLU and its leaky version. The Maxout neuron computes the function $$\max(w_1^Tx+b_1, w_2^Tx + b_2)$$. Notice that both ReLU and Leaky ReLU are a special case of this form (for example, for ReLU we have $$w_1, b_1 = 0$$). The Maxout neuron therefore enjoys all the benefits of a ReLU unit (linear regime of operation, no saturation) and does not have its drawbacks (dying ReLU). However, unlike the ReLU neurons it doubles the number of parameters for every single neuron, leading to a high total number of parameters. This concludes our discussion of the most common types of neurons and their activation functions. As a last comment, it is very rare to mix and match different types of neurons in the same network, even though there is no fundamental problem with doing so. **TLDR**: "*What neuron type should I use?*" Use the ReLU non-linearity, be careful with your learning rates and possibly monitor the fraction of "dead" units in a network. If this concerns you, give Leaky ReLU or Maxout a try. Never use sigmoid. Try tanh, but expect it to work worse than ReLU/Maxout. + ## Neural Network architectures + ### Layer-wise organization **Neural Networks as neurons in graphs**. Neural Networks are modeled as collections of neurons that are connected in an acyclic graph. In other words, the outputs of some neurons can become inputs to other neurons. Cycles are not allowed since that would imply an infinite loop in the forward pass of a network. Instead of an amorphous blobs of connected neurons, Neural Network models are often organized into distinct layers of neurons. For regular neural networks, the most common layer type is the **fully-connected layer** in which neurons between two adjacent layers are fully pairwise connected, but neurons within a single layer share no connections. Below are two example Neural Network topologies that use a stack of fully-connected layers: @@ -138,6 +144,7 @@ This concludes our discussion of the most common types of neurons and their acti To give you some context, modern Convolutional Networks contain on orders of 100 million parameters and are usually made up of approximately 10-20 layers (hence *deep learning*). However, as we will see the number of *effective* connections is significantly greater due to parameter sharing. More on this in the Convolutional Neural Networks module. + ### Example feed-forward computation *Repeated matrix multiplications interwoven with activation function*. One of the primary reasons that Neural Networks are organized into layers is that this structure makes it very simple and efficient to evaluate Neural Networks using matrix vector operations. Working with the example three-layer neural network in the diagram above, the input would be a [3x1] vector. All connection strengths for a layer can be stored in a single matrix. For example, the first hidden layer's weights `W1` would be of size [4x3], and the biases for all units would be in the vector `b1`, of size [4x1]. Here, every single neuron has its weights in a row of `W1`, so the matrix vector multiplication `np.dot(W1,x)` evaluates the activations of all neurons in that layer. Similarly, `W2` would be a [4x4] matrix that stores the connections of the second hidden layer, and `W3` a [1x4] matrix for the last (output) layer. The full forward pass of this 3-layer neural network is then simply three matrix multiplications, interwoven with the application of the activation function: @@ -156,13 +163,14 @@ In the above code, `W1,W2,W3,b1,b2,b3` are the learnable parameters of the netwo > The forward pass of a fully-connected layer corresponds to one matrix multiplication followed by a bias offset and an activation function. + ### Representational power One way to look at Neural Networks with fully-connected layers is that they define a family of functions that are parameterized by the weights of the network. A natural question that arises is: What is the representational power of this family of functions? In particular, are there functions that cannot be modeled with a Neural Network? -It turns out that Neural Networks with at least one hidden layer are *universal approximators*. That is, it can be shown (e.g. see [*Approximation by Superpositions of Sigmoidal Function*](http://www.dartmouth.edu/~gvc/Cybenko_MCSS.pdf) from 1989 (pdf), or this [intuitive explanation](http://neuralnetworksanddeeplearning.com/chap4.html) from Michael Nielsen) that given any continuous function $f(x)$ and some $\epsilon > 0$, there exists a Neural Network $g(x)$ with one hidden layer (with a reasonable choice of non-linearity, e.g. sigmoid) such that $ \forall x, \mid f(x) - g(x) \mid < \epsilon $. In other words, the neural network can approximate any continuous function. +It turns out that Neural Networks with at least one hidden layer are *universal approximators*. That is, it can be shown (e.g. see [*Approximation by Superpositions of Sigmoidal Function*](http://www.dartmouth.edu/~gvc/Cybenko_MCSS.pdf) from 1989 (pdf), or this [intuitive explanation](http://neuralnetworksanddeeplearning.com/chap4.html) from Michael Nielsen) that given any continuous function $$f(x)$$ and some $$\epsilon > 0$$, there exists a Neural Network $$g(x)$$ with one hidden layer (with a reasonable choice of non-linearity, e.g. sigmoid) such that $$ \forall x, \mid f(x) - g(x) \mid < \epsilon $$. In other words, the neural network can approximate any continuous function. -If one hidden layer suffices to approximate any function, why use more layers and go deeper? The answer is that the fact that a two-layer Neural Network is a universal approximator is, while mathematically cute, a relatively weak and useless statement in practice. In one dimension, the "sum of indicator bumps" function $g(x) = \sum_i c_i \mathbb{1}(a_i < x < b_i)$ where $a,b,c$ are parameter vectors is also a universal approximator, but noone would suggest that we use this functional form in Machine Learning. Neural Networks work well in practice because they compactly express nice, smooth functions that fit well with the statistical properties of data we encounter in practice, and are also easy to learn using our optimization algorithms (e.g. gradient descent). Similarly, the fact that deeper networks (with multiple hidden layers) can work better than a single-hidden-layer networks is an empirical observation, despite the fact that their representational power is equal. +If one hidden layer suffices to approximate any function, why use more layers and go deeper? The answer is that the fact that a two-layer Neural Network is a universal approximator is, while mathematically cute, a relatively weak and useless statement in practice. In one dimension, the "sum of indicator bumps" function $$g(x) = \sum_i c_i \mathbb{1}(a_i < x < b_i)$$ where $$a,b,c$$ are parameter vectors is also a universal approximator, but noone would suggest that we use this functional form in Machine Learning. Neural Networks work well in practice because they compactly express nice, smooth functions that fit well with the statistical properties of data we encounter in practice, and are also easy to learn using our optimization algorithms (e.g. gradient descent). Similarly, the fact that deeper networks (with multiple hidden layers) can work better than a single-hidden-layer networks is an empirical observation, despite the fact that their representational power is equal. As an aside, in practice it is often the case that 3-layer neural networks will outperform 2-layer nets, but going even deeper (4,5,6-layer) rarely helps much more. This is in stark contrast to Convolutional Networks, where depth has been found to be an extremely important component for a good recognition system (e.g. on order of 10 learnable layers). One argument for this observation is that images contain hierarchical structure (e.g. faces are made up of eyes, which are made up of edges, etc.), so several layers of processing make intuitive sense for this data domain. @@ -173,6 +181,7 @@ The full story is, of course, much more involved and a topic of much recent rese - [FitNets: Hints for Thin Deep Nets](http://arxiv.org/abs/1412.6550) + ### Setting number of layers and their sizes How do we decide on what architecture to use when faced with a practical problem? Should we use no hidden layers? One hidden layer? Two hidden layers? How large should each layer be? First, note that as we increase the size and number of layers in a Neural Network, the **capacity** of the network increases. That is, the space of representable functions grows since the neurons can collaborate to express many different functions. For example, suppose we had a binary classification problem in two dimensions. We could train three separate neural networks, each with one hidden layer of some size and obtain the following classifiers: @@ -200,6 +209,7 @@ To reiterate, the regularization strength is the preferred way to control the ov The takeaway is that you should not be using smaller networks because you are afraid of overfitting. Instead, you should use as big of a neural network as your computational budget allows, and use other regularization techniques to control overfitting. + ## Summary In summary, @@ -212,6 +222,7 @@ In summary, - We discussed the fact that larger networks will always work better than smaller networks, but their higher model capacity must be appropriately addressed with stronger regularization (such as higher weight decay), or they might overfit. We will see more forms of regularization (especially dropout) in later sections. + ## Additional References - [deeplearning.net tutorial](http://www.deeplearning.net/tutorial/mlp.html) with Theano diff --git a/optimization-2.md b/optimization-2.md index 55188b56..723df3b9 100644 --- a/optimization-2.md +++ b/optimization-2.md @@ -19,11 +19,11 @@ Table of Contents: ### Introduction -**Motivation**. 이번 섹션에서 우리는 **Backpropagation**에 대한 직관적인 이해를 바탕으로 전문지식을 더 키우고자 한다. Backpropagation은 네트워크 전체에 대해 반복적인 **연쇄 법칙(Chain rule)**을 적용하여 그라디언트(Gradient)를 계산하는 방법 중 하나이다. Backpropagation 과정과 세부 요소들에 대한 이해는 여러분에게 있어서 신경망을 효과적으로 개발하고, 디자인하고 디버그하는 데 중요하다고 볼 수 있다. +**Motivation**. 이번 섹션에서 우리는 **Backpropagation**에 대한 직관적인 이해를 바탕으로 전문지식을 더 키우고자 한다. Backpropagation은 네트워크 전체에 대해 반복적인 **연쇄 법칙(Chain rule)**을 적용하여 그라디언트(Gradient)를 계산하는 방법 중 하나이다. Backpropagation 과정과 세부 요소들에 대한 이해는 여러분에게 있어서 신경망을 효과적으로 개발하고, 디자인하고 디버그하는 데 중요하다고 볼 수 있다. **Problem statement**. 이번 섹션에서 공부할 핵심 문제는 다음과 같다 : 주어진 함수 $$f(x)$$ 가 있고, $$x$$ 는 입력 값으로 이루어진 벡터이고, 주어진 입력 $$x$$에 대해서 함수 $$f$$의 그라디언트를 계산하고자 한다. (i.e. $$\nabla f(x)$$ ). -**Motivation**. 우리가 이 문제에 관심을 기울이는 이유에 대해 신경망 관점에서 좀더 구체적으로 살펴 보자. $$f$$는 손실 함수 ( $$L$$ ) 에 해당하고 입력 값 $$x$$ 는 학습 데이터(Training data)와 신경망의 Weight라고 볼 수 있다. 예를 들면, 손실 함수는 SVM Loss 함수가 될 수 있고, 입력 값은 학습 데이터 $$(x_i,y_i), i=1 \ldots N$$ 와 Weight, Bias $$W,b$$ 으로 볼 수 있다. 여기서 학습데이터는 미리 주어져서 고정 되어있는 값으로 볼 수 있고 (보통의 기계 학습에서 그러하듯..), Weight는 신경망의 학습을 위해 실제로 컨트롤 하는 값이다. 따라서 입력 값 $$x_i$$ 에 대한 그라디언트 계산이 쉬울지라도, 실제로는 파라미터(Parameter) 값에 대한 그라디언트를 일반적으로 계산하고, 그라디언트 값을 활용하여 파라미터를 업데이트 할 수 있다. 하지만, 신경망이 어떻게 작동하는지 해석하고, 시각화 하는 부분에서 입력 값 $x_i$에 대한 그라디언트도 유용하게 활용 될 수 있는데, 이 부분은 본 강의의 뒷부분에 다룰 예정이다. +**Motivation**. 우리가 이 문제에 관심을 기울이는 이유에 대해 신경망 관점에서 좀더 구체적으로 살펴 보자. $$f$$는 손실 함수 ( $$L$$ ) 에 해당하고 입력 값 $$x$$ 는 학습 데이터(Training data)와 신경망의 Weight라고 볼 수 있다. 예를 들면, 손실 함수는 SVM Loss 함수가 될 수 있고, 입력 값은 학습 데이터 $$(x_i,y_i), i=1 \ldots N$$ 와 Weight, Bias $$W,b$$ 으로 볼 수 있다. 여기서 학습데이터는 미리 주어져서 고정 되어있는 값으로 볼 수 있고 (보통의 기계 학습에서 그러하듯..), Weight는 신경망의 학습을 위해 실제로 컨트롤 하는 값이다. 따라서 입력 값 $$x_i$$ 에 대한 그라디언트 계산이 쉬울지라도, 실제로는 파라미터(Parameter) 값에 대한 그라디언트를 일반적으로 계산하고, 그라디언트 값을 활용하여 파라미터를 업데이트 할 수 있다. 하지만, 신경망이 어떻게 작동하는지 해석하고, 시각화 하는 부분에서 입력 값 $$x_i$$에 대한 그라디언트도 유용하게 활용 될 수 있는데, 이 부분은 본 강의의 뒷부분에 다룰 예정이다. 여러분이 이미 연쇄 법칙을 통해 그라디언트를 도출하는데 익숙하더라도 이 섹션을 간략히 훑어보기를 권장한다. 왜냐하면 이 섹션에서는 다른데서는 보기 힘든 Backpropagation에 대한 실제 숫자를 활용한 역방향 흐름(Backward Flow)에 대해 설명을 할 것이고, 이를 통해 여러분이 얻게 될 통찰력은 이번 강의 전체에 있어 도움이 될 것이라 생각하기 때문이다. @@ -32,13 +32,13 @@ Table of Contents: ### 그라디언트(Gradient)에 대한 간단한 표현과 이해 -복잡한 모델에 대한 수식등을 만들기에 앞서 간단하게 시작을 해보자. x와 y 두 숫자의 곱을 계산하는 간단한 함수 f를 정의하자. $$f(x,y) = x y$$. 각각의 입력 변수에 대한 편미분은 간단한 수학으로 아래와 같이 구해 진다. : +복잡한 모델에 대한 수식등을 만들기에 앞서 간단하게 시작을 해보자. x와 y 두 숫자의 곱을 계산하는 간단한 함수 f를 정의하자. $$f(x,y) = x y$$. 각각의 입력 변수에 대한 편미분은 간단한 수학으로 아래와 같이 구해 진다. : $$ -f(x,y) = x y \hspace{0.5in} \rightarrow \hspace{0.5in} \frac{\partial f}{\partial x} = y \hspace{0.5in} \frac{\partial f}{\partial y} = x +f(x,y) = x y \hspace{0.5in} \rightarrow \hspace{0.5in} \frac{\partial f}{\partial x} = y \hspace{0.5in} \frac{\partial f}{\partial y} = x $$ -**Interpretation**. 미분이 여러분에게 시사하는 바를 명심하자 : 미분은 입력 변수 부근의 아주 작은(0에 매우 가까운) 변화에 대한 해당 함수 값의 변화량이다. : +**Interpretation**. 미분이 여러분에게 시사하는 바를 명심하자 : 미분은 입력 변수 부근의 아주 작은(0에 매우 가까운) 변화에 대한 해당 함수 값의 변화량이다. : $$ \frac{df(x)}{dx} = \lim_{h\ \to 0} \frac{f(x + h) - f(x)}{h} @@ -48,7 +48,7 @@ $$ > 미분은 각 변수가 해당 값에서 전체 함수(Expression)의 결과 값에 영향을 미치는 민감도와 같은 개념이다. -앞서 말했듯이, 그라디언트 $$\nabla f$$는 편미분 값들의 벡터이다. 따라서 수식으로 표현하면 다음과 같다: $$\nabla f = [\frac{\partial f}{\partial x}, \frac{\partial f}{\partial y}] = [y, x]$$, 그라디언트가 기술적으로 벡터일지라도 심플한 표현을 위해 *"X에 대한 편미분"* 이라는 정확한 표현 대신 *"X에 대한 그라디언트"* 와 같은 표현을 종종 쓰게 될 예정이다. +앞서 말했듯이, 그라디언트 $$\nabla f$$는 편미분 값들의 벡터이다. 따라서 수식으로 표현하면 다음과 같다: $$\nabla f = [\frac{\partial f}{\partial x}, \frac{\partial f}{\partial y}] = [y, x]$$, 그라디언트가 기술적으로 벡터일지라도 심플한 표현을 위해 *"X에 대한 편미분"* 이라는 정확한 표현 대신 *"X에 대한 그라디언트"* 와 같은 표현을 종종 쓰게 될 예정이다. 다음과 같은 수식에 대해서도 미분값(그라디언트)을 한번 구해보자: @@ -68,7 +68,7 @@ $$ ### 연쇄 법칙(Chain rule)을 이용한 복합 표현식 -이제 $f(x,y,z) = (x + y) z$ 같은 다수의 복합 함수(composed functions)를 수반하는 더 복잡한 표현식을 고려해보자. 이 표현식은 여전히 바로 미분하기에 충분히 간단하지만, 우리는 이 식에 특별한 접근법을 적용할 것이다. 이는 backpropagation 뒤에 있는 직관을 이해하는데 도움이 될 것이다. 특히 이 식이 두 개의 표현식 $q = x + y$와 $f = q z$ 으로 분해될 수 있음에 주목하자. 게다가 이전 섹션에서 본 것처럼 우리는 두 식에 대한 미분값을 어떻게 따로따로 계산할지 알고 있다. $f$ 는 단지 $q$와 $z$의 곱이다. 따라서 $\frac{\partial f}{\partial q} = z, \frac{\partial f}{\partial z} = q$, 그리고 $q$는 $x$와 $y$의 합이므로 $\frac{\partial q}{\partial x} = 1, \frac{\partial q}{\partial y} = 1$이다. 하지만, 중간 결과값인 $q$에 대한 그라디언트($\frac{\partial f}{\partial q}$)를 신경쓸 필요가 없다. 대신 궁극적으로 입력 $x,y,z$에 대한 $f$의 그라디언트에 관심이 있다. **연쇄 법칙**은 이러한 그라디언트 표현식들을 함께 연결시키는 적절한 방법이 곱하는 것이라는 것을 보여준다. 예를 들면, $\frac{\partial f}{\partial x} = \frac{\partial f}{\partial q} \frac{\partial q}{\partial x} $와 같이 표현할 수 있다. 실제로 이는 단순히 두 그라디언트를 담고 있는 두 수의 곱셈이다. 하나의 예를 통해 확인 해보자. +이제 $$f(x,y,z) = (x + y) z$$ 같은 다수의 복합 함수(composed functions)를 수반하는 더 복잡한 표현식을 고려해보자. 이 표현식은 여전히 바로 미분하기에 충분히 간단하지만, 우리는 이 식에 특별한 접근법을 적용할 것이다. 이는 backpropagation 뒤에 있는 직관을 이해하는데 도움이 될 것이다. 특히 이 식이 두 개의 표현식 $$q = x + y$$와 $$f = q z$$ 으로 분해될 수 있음에 주목하자. 게다가 이전 섹션에서 본 것처럼 우리는 두 식에 대한 미분값을 어떻게 따로따로 계산할지 알고 있다. $$f$$ 는 단지 $$q$$와 $$z$$의 곱이다. 따라서 $$\frac{\partial f}{\partial q} = z, \frac{\partial f}{\partial z} = q$$, 그리고 $$q$$는 $$x$$와 $$y$$의 합이므로 $$\frac{\partial q}{\partial x} = 1, \frac{\partial q}{\partial y} = 1$$이다. 하지만, 중간 결과값인 $$q$$에 대한 그라디언트($$\frac{\partial f}{\partial q}$$)를 신경쓸 필요가 없다. 대신 궁극적으로 입력 $$x,y,z$$에 대한 $$f$$의 그라디언트에 관심이 있다. **연쇄 법칙** 은 이러한 그라디언트 표현식들을 함께 연결시키는 적절한 방법이 곱하는 것이라는 것을 보여준다. 예를 들면, $$\frac{\partial f}{\partial x} = \frac{\partial f}{\partial q} \frac{\partial q}{\partial x} $$와 같이 표현할 수 있다. 실제로 이는 단순히 두 그라디언트를 담고 있는 두 수의 곱셈이다. 하나의 예를 통해 확인 해보자. ~~~python # set some inputs @@ -101,17 +101,19 @@ dfdy = 1.0 * dfdq # dq/dy = 1
+ ### Backpropagation에 대한 직관적 이해 backpropagation이 굉장히 지역적인(local) 프로세스임에 주목하자. 회로도 내의 모든 게이트(gate) 몇개의 입력을 받아드리고 곧 바로 두 가지를 계산할 수 있다: 1. 게이트의 출력 값, 2. 게이트 출력에 대한 입력들의 *지역적* 그라디언트 값. 여기서 게이트들이 포함된 전체 회로의 세세한 부분을 모르더라도 완전히 독립적으로 값들을 계산할 수 있음을 주목하라. 하지만, 일단 전방 전달이 끝나면 backpropagation 과정에서 게이트는 결국 전체 회로의 마지막 출력에 대한 게이트 출력의 그라디언트 값에 관해 학습할 것이다. 연쇄 법칙을 통해 게이트는 이 그라디언트 값을 받아들여 모든 입력에 대해서 계산한 게이트의 모든 그라디언트 값에 곱한다. > 연쇄 법칙 덕분에 이러한 각 입력에 대한 추가 곱셈은 전체 신경망과 같은 복잡한 회로에서 상대적으로 쓸모 없는 개개의 게이트를 중요하지 않은 것으로 바꿀 수 있다. -다시 위 예를 통해 이것이 어떻게 동작하는지에 대한 직관을 얻자. 덧셈 게이트는 입력 [-2, 5]를 받아 3을 출력한다. 이 게이트는 덧셈 연산을 하고 있기 때문에 두 입력에 대한 게이트의 지역적 그라디언트 값은 +1이 된다. 회로의 나머지 부분을 통해 최종 출력 값으로 -12가 나온다. 연쇄 법칙이 회로를 역으로 가로질러 반복적으로 적용되는 후방 전달 과정 동안, (곱셈 게이트의 입력인) 덧셈 게이트는 출력 값에 대한 그라디언트 값이 -4였다는 것을 학습한다. 만약 회로가 높은 값을 출력하기를 원하는 것으로 의인화하면 (이는 직관에 도움이 될 수 있다), 이 회로가 덧셈 게이트의 출력 값이 4의 *힘*으로 낮아지길 (음의 부호이기 때문) "원하는" 것으로 볼 수 있다. 반복을 지속하고 그라디언트 값을 연결하기 위해 덧셈 게이트는 이 그라디언트 값을 받아들이고 이를 모든 입력들에 대한 지역적 그라디언트 값에 곱한다 (**x**와 **y**에 대한 그라디언트 값이 1 * -4 = -4가 되도록). 다음의 원하는 효과가 있다는 사실에 주목하자. 만약 **x,y**가 (음의 그라디언트 값에 대한 반응으로) 감소한다면, 이 덧셈 게이트의 출력은 감소할 것이고 이는 다시 곱셈 게이트의 출력이 증가하도록 만들 것이다. +다시 위 예를 통해 이것이 어떻게 동작하는지에 대한 직관을 얻자. 덧셈 게이트는 입력 [-2, 5]를 받아 3을 출력한다. 이 게이트는 덧셈 연산을 하고 있기 때문에 두 입력에 대한 게이트의 지역적 그라디언트 값은 +1이 된다. 회로의 나머지 부분을 통해 최종 출력 값으로 -12가 나온다. 연쇄 법칙이 회로를 역으로 가로질러 반복적으로 적용되는 후방 전달 과정 동안, (곱셈 게이트의 입력인) 덧셈 게이트는 출력 값에 대한 그라디언트 값이 -4였다는 것을 학습한다. 만약 회로가 높은 값을 출력하기를 원하는 것으로 의인화하면 (이는 직관에 도움이 될 수 있다), 이 회로가 덧셈 게이트의 출력 값이 4의 *힘*으로 낮아지길 (음의 부호이기 때문) "원하는" 것으로 볼 수 있다. 반복을 지속하고 그라디언트 값을 연결하기 위해 덧셈 게이트는 이 그라디언트 값을 받아들이고 이를 모든 입력들에 대한 지역적 그라디언트 값에 곱한다 (**x**와 **y**에 대한 그라디언트 값이 1 * -4 = -4가 되도록). 다음의 원하는 효과가 있다는 사실에 주목하자. 만약 **x,y**가 (음의 그라디언트 값에 대한 반응으로) 감소한다면, 이 덧셈 게이트의 출력은 감소할 것이고 이는 다시 곱셈 게이트의 출력이 증가하도록 만들 것이다. 따라서 backpropagation은 보다 큰 최종 출력 값을 얻도록 게이트들이 자신들의 출력이 (얼마나 강하게) 증가하길 원하는지 또는 감소하길 원하는지 서로 소통하는 것으로 간주할 수 있다. + ### 모듈성: 시그모이드(Sigmoid) 예제 위에서 본 게이트들은 상대적으로 임의로 선택된 것이다. 어떤 종류의 함수도 미분가능하다면 게이트로서 역할을 할 수 있다. 필요한 경우 여러 개의 게이트를 그룹지어서 하나의 게이트로 만들거나, 하나의 함수를 여러 개의 게이트로 분해할 수도 있다. 이러한 요점을 보여주는 다른 표현식을 살펴보자: @@ -123,24 +125,24 @@ $$ 나중에 다른 수업에서 보겠지만, 이 표현식은 *시그모이드 활성* 함수를 사용하는 2차원 뉴런(입력 **x**와 가중치 **w**를 갖는)을 나타낸다. 그러나 지금은 이를 매우 단순하게 *w,x*를 입력으로 받아 하나의 단일 숫자를 출력하는 하나의 함수정도로 생각하자. 이 함수는 여러개의 게이트로 구성된다. 위에서 이미 설명한 게이트들(덧셈, 곱셈, 최대)에 더해 네 종류의 게이트가 더 있다: $$ -f(x) = \frac{1}{x} -\hspace{1in} \rightarrow \hspace{1in} -\frac{df}{dx} = -1/x^2 +f(x) = \frac{1}{x} +\hspace{1in} \rightarrow \hspace{1in} +\frac{df}{dx} = -1/x^2 \\\\ f_c(x) = c + x -\hspace{1in} \rightarrow \hspace{1in} -\frac{df}{dx} = 1 +\hspace{1in} \rightarrow \hspace{1in} +\frac{df}{dx} = 1 \\\\ f(x) = e^x -\hspace{1in} \rightarrow \hspace{1in} +\hspace{1in} \rightarrow \hspace{1in} \frac{df}{dx} = e^x \\\\ f_a(x) = ax -\hspace{1in} \rightarrow \hspace{1in} +\hspace{1in} \rightarrow \hspace{1in} \frac{df}{dx} = a $$ -여기서 $f_c, f_a$는 각각 입력을 상수 $c$만큼 이동시키고, 상수 $a$만큼 크기를 조정하는 함수이다. 이 함수들은 덧셈과 곰셈의 기술적으로 특별한 경우에 해당하지만, 여기서는 상수 $c,a$에 대한 그라디언트가 필요한 것이기에 (새로운) 단일 게이트로써 소개하고자 한다. 그러면 전체 회로는 다음과 같이 나타난다. +여기서 $$f_c, f_a$$는 각각 입력을 상수 $$c$$만큼 이동시키고, 상수 $$a$$만큼 크기를 조정하는 함수이다. 이 함수들은 덧셈과 곰셈의 기술적으로 특별한 경우에 해당하지만, 여기서는 상수 $$c,a$$에 대한 그라디언트가 필요한 것이기에 (새로운) 단일 게이트로써 소개하고자 한다. 그러면 전체 회로는 다음과 같이 나타난다.
2.00-0.20w0-1.000.39x0-3.00-0.39w1-2.00-0.59x1-3.000.20w2-2.000.20*6.000.20*4.000.20+1.000.20+-1.00-0.20*-10.37-0.53exp1.37-0.53+10.731.001/x @@ -150,11 +152,11 @@ $$
-위 예제에서 **w,x** 사이의 내적의 결과로 동작하는 함수 적용(function applications)의 긴 체인을 보았다. 이러한 연산을 제공하는 함수를 *시그모이드 함수(sigmoid function)* $\sigma(x)$ 라고 한다. 만약 (분자에 1을 더하고 다시 빼는 재미있지만 까다로운 과정을 거친 후에)미분을 한다면 입력에 대한 시그모이드 함수의 미분값은 단순화할 수 있는 것으로 알려져 있다. +위 예제에서 **w,x** 사이의 내적의 결과로 동작하는 함수 적용(function applications)의 긴 체인을 보았다. 이러한 연산을 제공하는 함수를 *시그모이드 함수(sigmoid function)* $$\sigma(x)$$ 라고 한다. 만약 (분자에 1을 더하고 다시 빼는 재미있지만 까다로운 과정을 거친 후에)미분을 한다면 입력에 대한 시그모이드 함수의 미분값은 단순화할 수 있는 것으로 알려져 있다. $$ \sigma(x) = \frac{1}{1+e^{-x}} \\\\ -\rightarrow \hspace{0.3in} \frac{d\sigma(x)}{dx} = \frac{e^{-x}}{(1+e^{-x})^2} = \left( \frac{1 + e^{-x} - 1}{1 + e^{-x}} \right) \left( \frac{1}{1+e^{-x}} \right) +\rightarrow \hspace{0.3in} \frac{d\sigma(x)}{dx} = \frac{e^{-x}}{(1+e^{-x})^2} = \left( \frac{1 + e^{-x} - 1}{1 + e^{-x}} \right) \left( \frac{1}{1+e^{-x}} \right) = \left( 1 - \sigma(x) \right) \sigma(x) $$ @@ -180,6 +182,7 @@ dw = [x[0] * ddot, x[1] * ddot, 1.0 * ddot] # backprop into w 이 섹션에서 요점은 어떻게 backpropagation이 수행되는 지와 전방 함수(forward function)의 어느 부분을 게이트로 취급해야할 지에 대한 세부사항은 편의성 문제라는 것이다. 이는 표현식의 어느 부분들이 쉬운 지역 그라디언트를 가지며, 가장 적은 코드의 양과 노력으로 이들을 함께 묶을 수 있는지를 이해하는데 도움이 된다. + ### 실제 backprop: 단계적 계산 또 다른 예제를 통해 확인해보자. 다음과 같은 형태의 함수가 있다고 가정하자: @@ -211,7 +214,7 @@ f = num * invden # done! #(8) # backprop f = num * invden dnum = invden # gradient on numerator #(8) dinvden = num #(8) -# backprop invden = 1.0 / den +# backprop invden = 1.0 / den dden = (-1.0 / (den**2)) * dinvden #(7) # backprop den = sigx + xpysqr dsigx = (1) * dden #(6) @@ -235,9 +238,10 @@ dy += ((1 - sigy) * sigy) * dsigy #(1) **전방 전달 변수들을 저장(cache)하라**. 후방 전달을 계산하기 위해 전방 전달에서 사용한 일부 변수들을 가지고 있는 것은 정말 유용하다. 실제로 여러분은 이 변수들을 저장해서 backpropagation 동안 이용할 수 있도록 코드를 구성하고 싶을 것이다. 이것이 너무 어려운 일이라면, 이 변수들을 다시 계산할 수 있다(물론 비효율적이지만). -**갈래길에서 그라디언트는 더해진다**. 전방 표현식은 변수 **x,y**를 여러번 수반하므로, backpropagation을 수행할 때 이 변수들에 대한 그라디언트 값을 축적하기 위해 `=` 대신 `+=`를 사용해야 하는 점에 주의해야 한다 (그렇게 하지 않으면 덮어쓰게 된다). 이는 Calculus에 나오는 *다변수 연쇄 법칙(multivariate chain rule)*을 따른다, Calculus에는 하나의 변수가 회로의 다른 부분들로 가지를 뻗어나가면, 반환하는 그라디언트는 더해질 것이라고 명시되어 있다. +**갈래길에서 그라디언트는 더해진다**. 전방 표현식은 변수 **x,y** 를 여러번 수반하므로, backpropagation을 수행할 때 이 변수들에 대한 그라디언트 값을 축적하기 위해 `=` 대신 `+=`를 사용해야 하는 점에 주의해야 한다 (그렇게 하지 않으면 덮어쓰게 된다). 이는 Calculus에 나오는 *다변수 연쇄 법칙(multivariate chain rule)*을 따른다, Calculus에는 하나의 변수가 회로의 다른 부분들로 가지를 뻗어나가면, 반환하는 그라디언트는 더해질 것이라고 명시되어 있다. + ### Patterns in backward flow It is interesting to note that in many cases the backward-flowing gradient can be interpreted on an intuitive level. For example, the three most commonly used gates in neural networks (*add,mul,max*), all have very simple interpretations in terms of how they act during backpropagation. Consider this example circuit: @@ -256,11 +260,12 @@ The **add gate** always takes the gradient on its output and distributes it equa The **max gate** routes the gradient. Unlike the add gate which distributed the gradient unchanged to all its inputs, the max gate distributes the gradient (unchanged) to exactly one of its inputs (the input that had the highest value during the forward pass). This is because the local gradient for a max gate is 1.0 for the highest value, and 0.0 for all other values. In the example circuit above, the max operation routed the gradient of 2.00 to the **z** variable, which had a higher value than **w**, and the gradient on **w** remains zero. -The **multiply gate** is a little less easy to interpret. Its local gradients are the input values (except switched), and this is multiplied by the gradient on its output during the chain rule. In the example above, the gradient on **x** is -8.00, which is -4.00 x 2.00. +The **multiply gate** is a little less easy to interpret. Its local gradients are the input values (except switched), and this is multiplied by the gradient on its output during the chain rule. In the example above, the gradient on **x** is -8.00, which is -4.00 x 2.00. -*Unintuitive effects and their consequences*. Notice that if one of the inputs to the multiply gate is very small and the other is very big, then the multiply gate will do something slightly unintuitive: it will assign a relatively huge gradient to the small input and a tiny gradient to the large input. Note that in linear classifiers where the weights are dot producted $w^Tx_i$ (multiplied) with the inputs, this implies that the scale of the data has an effect on the magnitude of the gradient for the weights. For example, if you multiplied all input data examples $x_i$ by 1000 during preprocessing, then the gradient on the weights will be 1000 times larger, and you'd have to lower the learning rate by that factor to compensate. This is why preprocessing matters a lot, sometimes in subtle ways! And having intuitive understanding for how the gradients flow can help you debug some of these cases. +*Unintuitive effects and their consequences*. Notice that if one of the inputs to the multiply gate is very small and the other is very big, then the multiply gate will do something slightly unintuitive: it will assign a relatively huge gradient to the small input and a tiny gradient to the large input. Note that in linear classifiers where the weights are dot producted $$w^Tx_i$$ (multiplied) with the inputs, this implies that the scale of the data has an effect on the magnitude of the gradient for the weights. For example, if you multiplied all input data examples $$x_i$$ by 1000 during preprocessing, then the gradient on the weights will be 1000 times larger, and you'd have to lower the learning rate by that factor to compensate. This is why preprocessing matters a lot, sometimes in subtle ways! And having intuitive understanding for how the gradients flow can help you debug some of these cases. + ### Gradients for vectorized operations The above sections were concerned with single variables, but all concepts extend in a straight-forward manner to matrix and vector operations. However, one must pay closer attention to dimensions and transpose operations. @@ -284,6 +289,7 @@ dX = W.T.dot(dD) **Work with small, explicit examples**. Some people may find it difficult at first to derive the gradient updates for some vectorized expressions. Our recommendation is to explicitly write out a minimal vectorized example, derive the gradient on paper and then generalize the pattern to its efficient, vectorized form. + ### Summary - We developed intuition for what the gradients mean, how they flow backwards in the circuit, and how they communicate which part of the circuit should increase or decrease and with what force to make the final output higher.