Jekyll2017-09-21T16:23:49+00:00http://atishay.me/Atishay JainPersonal website of Atishay Jain, developer at Adobe Systems with over 6 years of experience developing a wide range of Adobe products, right from your favorite desktop apps and websites(full stack) to mobile. I have worked on a variety of platforms and technology domains. I love solving problems and believe in searching for the best way to solve a problem before diving in.
Easy Deep Learning Part X - Tips from the experts2017-09-21T00:00:00+00:002017-09-21T00:00:00+00:00http://atishay.me/blog/2017/09/21/Deep-Learning-Part-10<h4 id="recap">Recap</h4>
<p>So far we have looked at dense and convolutional neural networks. Then we tried our hands with transfer learning, that is using a pre-trained complicated model and tuning it to our different data set and getting wonderful results for that data set.</p>
<p>In this post we look at the identifying characteristics and new techniques that we can pick up from the best papers in the image-net challenge.</p>
<h4 id="5x5-as-two-stacks-of-3x3">5x5 as two stacks of 3x3</h4>
<p>We already talked about how two 3x3 networks are more efficient than a 5x5 network. Now is a good time explain why. Lets take a pixel. For a 3x3 filter(3 along width and 3 along height), we use 9 variables while for a 5x5 we use 25 variables. With two 3x3 layers, we got to use only 9+9=18 variables. This means the same data is represented by fewer variables that bring faster processing on the table. With more experiments we have figured out that the additional variables makes the overall model slower and does not add enough value for justifying its inclusion.</p>
<h4 id="1x1-convolutions">1x1 Convolutions</h4>
<p>These convolutions were popularized by GoogleNet’s usage in Inception. The idea looks stupid at first. We are looking at a single pixel, how does this hold up in the growing knowledge of the network. Doesn’t make any sense. We already talked about how CNNs tried to map hierarchy and 1x1 did not make sense. Well there is one thing we forgot to take note. A pixel is three colors - R, G & B. A 1x1 convolution is done to convert these three dimensions to one number. That reduces the operations required in the future 3x3 convolutions because otherwise they need to be run in all layers. This is a useful performance optimization as you might have seen, we are already running into bottlenecks with the processing power of the machines. The relation between the RGB color space components may not be based on the location and therefore it does not make sense to put them as a part of the 3x3 convolution.
Another point to note is that these may not be just 3 layers. We could have 200 parallel 3x3 convolutions running on an image that gives 200 dimensional image which would really save a lot from the 1x1 convolution.</p>
<h4 id="highways-and-residual-networks">Highways and Residual Networks</h4>
<p>One problem with making the networks very deep is loss of signal from the leaves to the root. At each stage, if you look at it, we are multiplying by a weight and therefore the early few players get weighted down by a lot of managers that have their own biases and weights which causes the information to die down. Residual networks solve this problem by creating skip level connections. If a level can skip a few managers and end information as well get feedback by a fewer levels and the overall flow of information will be much smoother. The intermediate managers do their processing but skip level managers can take the input both from the managers and the nodes that report to them. This allows the skip level managers to provide a more direct feedback and hence fix the problem of information loss. These networks are also called highway networks as we can create an information highway that allows fluid running of information from the lower level employees to the CEO.</p>
<h4 id="inception">Inception</h4>
<p>Inception is kind of the next level of information sharing at least at the lower level. The concept of inception is that a the manager of a pixel should look at the results of a 3x3 as well as a 5x5 (which can be represented as two stacks of 3x3 one over other) subordinates to take a decision for that pixels and pass over. This concept can be used in tandem with ResNet and the information highway. Essentially, instead of having a 3x3 convolution, we have a single pixel, a 1x1, 3x3 and a 5x5 convolution which is used as an input to a cell that can act on the combined information. In our organization analogy, the director does not just rely on the managers but the individual employees as well (all of them for his inputs). Then his boss’s boss relied on all the directors as well as the intermediate manager.</p>
<h4 id="7x1-and-1x7-shortcuts">7x1 and 1x7 shortcuts</h4>
<p>After multiple experiments we are at a point where we know it is better to go deeper than wider. Therefore we have tried various approaches that reduces number of variables so that we can go deeper with the same CPU power. In the 3x3 chain, you might have noticed the redundancy where each pixel is visited 9 times so that each future pixel can take input form 9 different pixels. This seems like a good opportunity for optimization. Here comes the 3x1 and 1x3 networks. First create a single pixel from the 9 pixels by not visiting the same pixel twice. This will reduce the dimensions across the width and the height by a 3x1 filter (hence the name). Then get back to the original dimensions by a 1x3 filter. This allows stacking to continue normally. This drastically reduces the number of parameters for a look across the 9 pixels from 9 to just 4/9. Now the information cannot really travel though such a network. There will definitely be losses. We cannot really squeeze in so much information through these variables especially in the layer of 1/3rd the size. There are tricks that the network can follow if the need comes up. For example packing the information in regions or bits in the input. But yes, a lot of information gets lost. That is why inception uses this technique as there is a parallel track that passes the useful information from the lower pixel in parallel. Another thing to remember is that there is a lot of redundancy in images. If you down-sample an image, it is very likely that you will still be able to recognize the regions inside of it. SqueezeNet has also found that this technique works without the parallel inception track in some cases.</p>
<h4 id="summary">Summary</h4>
<p>In this post we looked at more techniques, 3x3 stacks, the 1x1 convolution, 7x1 and 1x7 shortcuts, residual networks and the concept of inception in convolutional neural networks.</p>
<p>In the next post we will play more with the convolutional neural networks and look at an interesting application of convolutional neural networks - style transfer.</p>The best models have some real clever tricks to get through the last mile. They are worth learning.Easy Deep Learning Part IX - Using the model library2017-09-19T00:00:00+00:002017-09-19T00:00:00+00:00http://atishay.me/blog/2017/09/19/Deep-Learning-Part-9<p>This is the eighth part of an intended multi-part series on deep learning. You should read <a href="/blog/2017/08/16/Deep-Learning-Part-1">Part 1</a>, <a href="/blog/2017/08/18/Deep-Learning-Part-2">Part 2</a>, <a href="/blog/2017/08/21/Deep-Learning-Part-3">Part 3</a>, <a href="/blog/2017/08/22/Deep-Learning-Part-4">Part 4</a>, <a href="/blog/2017/08/24/Deep-Learning-Part-5">Part 5</a>, <a href="/blog/2017/08/29/Deep-Learning-Part-6">Part 6</a>, <a href="/blog/2017/08/30/Deep-Learning-Part-7">Part 7</a>, <a href="/blog/2017/09/18/Deep-Learning-Part-8">Part 8</a>before heading over here.</p>
<h4 id="recap">Recap</h4>
<p>In the previous posts we got to understand the intuitions behind regular(dense) and convolutional neural networks. Now it is time to see why we could skip all that.</p>
<h4 id="transfer-learning">Transfer Learning</h4>
<p>Building neural networks from scratch is a great way to learn the details of how they work, but there is a much better way if we are looking to apply the principles in practice. We talked about when we introduced convolutional neural networks the fact that we can start with the path of 9 pixels and then collect the information from there. The key point in this architecture is that there is not a lot of differences in 3x3 patches of images from many different categories. So if we have a good enough database of real life images (<a href="http://image-net.org">imagenet</a> is one such database), we should have the most practical 3x3 patches covered. That means we can reuse the weights from this learning on a new task.
The concept of reusing parts of the model from some of the state of the art trained models is called <strong>transfer learning</strong>. This is very powerful. This instantly makes deep learning very accessible. You do not need tons of CPU power and a huge data set to get going on a specific problem. You can tune some of the most complicated networks to work reasonably with very little data.</p>
<h4 id="model-zoo">Model Zoo</h4>
<p>The concept of model zoo is to have a place where we can download pre coded and pre trained models for common tasks. Keras comes in with a built in set of certain models.</p>
<h4 id="how-do-we-transfer-the-model">How do we transfer the model?</h4>
<p>If you are training the model to recognize an image, you need to change the set of classes or categories you are grouping the images into. The way to do that would be to change the last layer to group the images into a different set of classes and then retrain. You will notice that in many state of the art models, the last few layers are dense. This is because the early few layers summarize the local information into meaningful chunks that can be properly classified by the equations in the final layers. And it is that summary that needs to be reused. Therefore we knock off the final dense layers and replace them with a fresh set of dense layers that train with the new data that we have.</p>
<h4 id="freeze">Freeze</h4>
<p>There is one more concept that is very important in transfer learning. Say you have very little sample data. Now if you change the scores at the first few layers during your training phase (by taking the pre-trained weights at just initialization) and using the new model that you generated using the pre-trained one, there is a high chance that you end up throwing out a lot of use cases that the initial model cover and you did not. Since the data set is small, it may not have all examples of 3x3 patches that may be present in the real world where this sample is to be used. And during training, you might throw down the better weight a present in the initialization. Therefore it is a good idea to freeze the initial part of the network. Now how much you freeze depends on the data you have. If you have a lot of data, freeze the least. If you have very little freeze most of the network and your trained model will handle more cases that your data has.</p>
<h4 id="cifar-100">Cifar 100</h4>
<p>The models are so good that CIFAR 10 is not really a challenge. So we are going to go to its bigger brother CIFAR 100 where we will realize that even a set of 100 classes is a cinch for Xception neural network.</p>
<h4 id="code-time">Code time</h4>
<div class="language-python highlighter-rouge"><pre class="highlight"><code><span class="kn">from</span> <span class="nn">keras.applications.xception</span> <span class="kn">import</span> <span class="n">Xception</span>
<span class="kn">from</span> <span class="nn">keras.layers</span> <span class="kn">import</span> <span class="n">Dense</span><span class="p">,</span> <span class="n">GlobalAveragePooling2D</span>
<span class="kn">from</span> <span class="nn">keras.models</span> <span class="kn">import</span> <span class="n">Model</span>
<span class="kn">from</span> <span class="nn">keras.datasets</span> <span class="kn">import</span> <span class="n">cifar100</span>
<span class="kn">import</span> <span class="nn">keras</span>
<span class="c"># Load CIFAR dataset</span>
<span class="p">(</span><span class="n">x_train</span><span class="p">,</span> <span class="n">y_train</span><span class="p">),</span> <span class="p">(</span><span class="n">x_test</span><span class="p">,</span> <span class="n">y_test</span><span class="p">)</span> <span class="o">=</span> <span class="n">cifar100</span><span class="o">.</span><span class="n">load_data</span><span class="p">()</span>
<span class="c"># Convert inputs to float between 0 & 1</span>
<span class="n">x_train</span> <span class="o">=</span> <span class="n">x_train</span><span class="o">.</span><span class="n">astype</span><span class="p">(</span><span class="s">'float32'</span><span class="p">)</span> <span class="o">/</span> <span class="mf">255.0</span>
<span class="n">x_test</span> <span class="o">=</span> <span class="n">x_test</span><span class="o">.</span><span class="n">astype</span><span class="p">(</span><span class="s">'float32'</span><span class="p">)</span> <span class="o">/</span> <span class="mf">255.0</span>
<span class="c"># Convert output into one hot encoding (Just like before)</span>
<span class="n">y_train</span> <span class="o">=</span> <span class="n">keras</span><span class="o">.</span><span class="n">utils</span><span class="o">.</span><span class="n">to_categorical</span><span class="p">(</span><span class="n">y_train</span><span class="p">)</span>
<span class="n">y_test</span> <span class="o">=</span> <span class="n">keras</span><span class="o">.</span><span class="n">utils</span><span class="o">.</span><span class="n">to_categorical</span><span class="p">(</span><span class="n">y_test</span><span class="p">)</span>
<span class="c"># Download the Xception model.</span>
<span class="n">base_model</span> <span class="o">=</span> <span class="n">Xception</span><span class="p">(</span><span class="n">weights</span><span class="o">=</span><span class="s">'imagenet'</span><span class="p">,</span> <span class="n">include_top</span><span class="o">=</span><span class="bp">False</span><span class="p">)</span>
<span class="c"># add a global spatial average pooling layer. Just like</span>
<span class="c"># maxpool takes and average.</span>
<span class="c"># You can use maxpool here as well, hardly makes a difference.</span>
<span class="n">x</span> <span class="o">=</span> <span class="n">base_model</span><span class="o">.</span><span class="n">output</span>
<span class="n">x</span> <span class="o">=</span> <span class="n">GlobalAveragePooling2D</span><span class="p">()(</span><span class="n">x</span><span class="p">)</span>
<span class="c"># let's add a fully-connected layer</span>
<span class="n">x</span> <span class="o">=</span> <span class="n">Dense</span><span class="p">(</span><span class="mi">1024</span><span class="p">,</span> <span class="n">activation</span><span class="o">=</span><span class="s">'relu'</span><span class="p">)(</span><span class="n">x</span><span class="p">)</span>
<span class="c"># and a logistic layer to the 100 classes</span>
<span class="n">predictions</span> <span class="o">=</span> <span class="n">Dense</span><span class="p">(</span><span class="mi">100</span><span class="p">,</span> <span class="n">activation</span><span class="o">=</span><span class="s">'softmax'</span><span class="p">)(</span><span class="n">x</span><span class="p">)</span>
<span class="c"># this is the model we will train</span>
<span class="n">model</span> <span class="o">=</span> <span class="n">Model</span><span class="p">(</span><span class="n">inputs</span><span class="o">=</span><span class="n">base_model</span><span class="o">.</span><span class="nb">input</span><span class="p">,</span> <span class="n">outputs</span><span class="o">=</span><span class="n">predictions</span><span class="p">)</span>
<span class="c"># first: train only the top layers (which were randomly initialized)</span>
<span class="c"># i.e. freeze all convolutional Xception layers</span>
<span class="k">for</span> <span class="n">layer</span> <span class="ow">in</span> <span class="n">base_model</span><span class="o">.</span><span class="n">layers</span><span class="p">:</span>
<span class="n">layer</span><span class="o">.</span><span class="n">trainable</span> <span class="o">=</span> <span class="bp">False</span>
<span class="n">model</span><span class="o">.</span><span class="nb">compile</span><span class="p">(</span><span class="n">loss</span><span class="o">=</span><span class="s">'categorical_crossentropy'</span><span class="p">,</span>
<span class="n">optimizer</span><span class="o">=</span><span class="s">'sgd'</span><span class="p">,</span>
<span class="n">metrics</span><span class="o">=</span><span class="p">[</span><span class="s">'accuracy'</span><span class="p">])</span>
<span class="n">model</span><span class="o">.</span><span class="n">fit</span><span class="p">(</span><span class="n">x_train</span><span class="p">,</span> <span class="n">y_train</span><span class="p">,</span> <span class="n">epochs</span><span class="o">=</span><span class="mi">5</span><span class="p">,</span> <span class="n">batch_size</span><span class="o">=</span><span class="mi">32</span><span class="p">)</span>
<span class="n">loss_and_metrics</span> <span class="o">=</span> <span class="n">model</span><span class="o">.</span><span class="n">evaluate</span><span class="p">(</span><span class="n">x_test</span><span class="p">,</span> <span class="n">y_test</span><span class="p">,</span> <span class="n">batch_size</span><span class="o">=</span><span class="mi">128</span><span class="p">)</span>
<span class="k">print</span> <span class="n">loss_and_metrics</span>
</code></pre>
</div>
<p>The code is terse and clear. I don’t think there is much commentary to add. The comments explain everything.</p>
<h4 id="summary">Summary</h4>
<p>In this post we reused a pre-existing model and transferred the learnings into the Cifar100 model that we built and trained. The concept of transfer learning is very important in the deep learning toolkit. It provides ways to train using minimal data and get some great results.</p>
<p>In the next post we will discuss some of the pieces of the pre-existing models so that we can learn from their tools to improve our trade.</p>The real power of deep learning is reuse - Stand on the shoulder of giants.Easy Deep Learning Part VIII - ConvNets on CIFAR 102017-09-18T00:00:00+00:002017-09-18T00:00:00+00:00http://atishay.me/blog/2017/09/18/Deep-Learning-Part-8<p>This is the eighth part of an intended multi-part series on deep learning. You should read <a href="/blog/2017/08/16/Deep-Learning-Part-1">Part 1</a>, <a href="/blog/2017/08/18/Deep-Learning-Part-2">Part 2</a>, <a href="/blog/2017/08/21/Deep-Learning-Part-3">Part 3</a>, <a href="/blog/2017/08/22/Deep-Learning-Part-4">Part 4</a>, <a href="/blog/2017/08/24/Deep-Learning-Part-5">Part 5</a>, <a href="/blog/2017/08/29/Deep-Learning-Part-6">Part 6</a>, <a href="/blog/2017/08/30/Deep-Learning-Part-7">Part 7</a> before heading over here.</p>
<h4 id="recap">Recap</h4>
<p>In the previous posts we introduced neural networks, what they are and how they work as well as convolutional neural networks that provide methods to look at local information and generate the next layer by collecting information from a set of 9 neighbors.</p>
<h4 id="cifar-10">CIFAR-10</h4>
<p>We have played a lot with MNIST and now it is time to introduce a much more complicated CIFAR data set. With this, we will finally fulfill the detection of “is a cat” that we discussed in our early post. In comparison to modern day data sets, CIFAR-10 is very simple. It does have a bigger brother CIFAR-100, but for now we will not talk about it. This data set consists of 50k images of 10 types of objects - airplane, automobile, bird, cat, deer, dog, frog, horse, ship and truck. This one though just has 10 classes, is a lot more complicated. All 8s look very similar but all birds are very different. Automobiles have variety of shapes and sizes, colors etc that make this a much more difficult problem. CIFAR provides 32x32x3 colored images to test against.</p>
<h4 id="code">Code</h4>
<p>I am now going to put in vanilla code with very few tricks from the set I described in the previous posts. I will not be optimizing this and playing around with all the options. You will understand why in the next few posts.</p>
<div class="language-python highlighter-rouge"><pre class="highlight"><code><span class="kn">from</span> <span class="nn">keras.models</span> <span class="kn">import</span> <span class="n">Sequential</span>
<span class="kn">from</span> <span class="nn">keras.layers</span> <span class="kn">import</span> <span class="n">Dense</span>
<span class="kn">from</span> <span class="nn">keras.layers</span> <span class="kn">import</span> <span class="n">Dropout</span>
<span class="kn">from</span> <span class="nn">keras.datasets</span> <span class="kn">import</span> <span class="n">cifar10</span>
<span class="kn">from</span> <span class="nn">keras.layers</span> <span class="kn">import</span> <span class="n">Flatten</span>
<span class="kn">from</span> <span class="nn">keras.layers.convolutional</span> <span class="kn">import</span> <span class="n">Conv2D</span>
<span class="kn">from</span> <span class="nn">keras.layers.convolutional</span> <span class="kn">import</span> <span class="n">MaxPooling2D</span>
<span class="kn">import</span> <span class="nn">keras</span>
<span class="c"># Load CIFAR dataset</span>
<span class="p">(</span><span class="n">x_train</span><span class="p">,</span> <span class="n">y_train</span><span class="p">),</span> <span class="p">(</span><span class="n">x_test</span><span class="p">,</span> <span class="n">y_test</span><span class="p">)</span> <span class="o">=</span> <span class="n">cifar10</span><span class="o">.</span><span class="n">load_data</span><span class="p">()</span>
<span class="c"># Convert inputs to float between 0 & 1</span>
<span class="n">x_train</span> <span class="o">=</span> <span class="n">x_train</span><span class="o">.</span><span class="n">astype</span><span class="p">(</span><span class="s">'float32'</span><span class="p">)</span> <span class="o">/</span> <span class="mf">255.0</span>
<span class="n">x_test</span> <span class="o">=</span> <span class="n">x_test</span><span class="o">.</span><span class="n">astype</span><span class="p">(</span><span class="s">'float32'</span><span class="p">)</span> <span class="o">/</span> <span class="mf">255.0</span>
<span class="c"># Convert output into one hot encoding (Just like before)</span>
<span class="n">y_train</span> <span class="o">=</span> <span class="n">keras</span><span class="o">.</span><span class="n">utils</span><span class="o">.</span><span class="n">to_categorical</span><span class="p">(</span><span class="n">y_train</span><span class="p">)</span>
<span class="n">y_test</span> <span class="o">=</span> <span class="n">keras</span><span class="o">.</span><span class="n">utils</span><span class="o">.</span><span class="n">to_categorical</span><span class="p">(</span><span class="n">y_test</span><span class="p">)</span>
<span class="c"># we could have said 10 here but this allows to change to CIFAR 100 and work with the same code.</span>
<span class="n">num_classes</span> <span class="o">=</span> <span class="n">y_test</span><span class="o">.</span><span class="n">shape</span><span class="p">[</span><span class="mi">1</span><span class="p">]</span>
<span class="n">model</span> <span class="o">=</span> <span class="n">Sequential</span><span class="p">([</span>
<span class="n">Conv2D</span><span class="p">(</span><span class="mi">32</span><span class="p">,</span> <span class="p">(</span><span class="mi">3</span><span class="p">,</span> <span class="mi">3</span><span class="p">),</span> <span class="n">input_shape</span><span class="o">=</span><span class="p">(</span><span class="mi">32</span><span class="p">,</span> <span class="mi">32</span><span class="p">,</span><span class="mi">3</span><span class="p">),</span>
<span class="n">padding</span><span class="o">=</span><span class="s">'same'</span><span class="p">,</span> <span class="n">activation</span><span class="o">=</span><span class="s">'relu'</span><span class="p">),</span>
<span class="n">Dropout</span><span class="p">(</span><span class="mf">0.2</span><span class="p">),</span>
<span class="n">MaxPooling2D</span><span class="p">(),</span>
<span class="n">Flatten</span><span class="p">(),</span>
<span class="n">Dense</span><span class="p">(</span><span class="mi">512</span><span class="p">,</span> <span class="n">activation</span><span class="o">=</span><span class="s">'relu'</span><span class="p">),</span>
<span class="n">Dropout</span><span class="p">(</span><span class="mf">0.5</span><span class="p">),</span>
<span class="n">Dense</span><span class="p">(</span><span class="n">num_classes</span><span class="p">,</span> <span class="n">activation</span><span class="o">=</span><span class="s">'softmax'</span><span class="p">)</span>
<span class="p">])</span>
<span class="n">model</span><span class="o">.</span><span class="nb">compile</span><span class="p">(</span><span class="n">loss</span><span class="o">=</span><span class="s">'categorical_crossentropy'</span><span class="p">,</span>
<span class="n">optimizer</span><span class="o">=</span><span class="s">'sgd'</span><span class="p">,</span>
<span class="n">metrics</span><span class="o">=</span><span class="p">[</span><span class="s">'accuracy'</span><span class="p">])</span>
<span class="n">model</span><span class="o">.</span><span class="n">fit</span><span class="p">(</span><span class="n">x_train</span><span class="p">,</span> <span class="n">y_train</span><span class="p">,</span> <span class="n">epochs</span><span class="o">=</span><span class="mi">20</span><span class="p">,</span> <span class="n">batch_size</span><span class="o">=</span><span class="mi">32</span><span class="p">)</span>
<span class="n">loss_and_metrics</span> <span class="o">=</span> <span class="n">model</span><span class="o">.</span><span class="n">evaluate</span><span class="p">(</span><span class="n">x_test</span><span class="p">,</span> <span class="n">y_test</span><span class="p">,</span> <span class="n">batch_size</span><span class="o">=</span><span class="mi">128</span><span class="p">)</span>
<span class="k">print</span> <span class="n">loss_and_metrics</span>
</code></pre>
</div>
<p>This code should look almost the same as the first code we wrote for MNIST apart from the fact that we are using ConvNets. We start by a 3x3 conv layer, add some dropout, add a Maxpool to reduce to a 16x16 image, then convert it to a a flat list of numbers to feed into the same dense network as before. We use MaxPool as without that it would be a lot slower. It is a good time to discuss performance. Dense layers are heavy. They involve a lot of computations. They are good as they can take information to anywhere but we need to be extra careful not to use too many of them as they really slow it down. Softmax is also a heavy operation and a good advice is to reduce the number of parameters to manageable levels before involving in the complex softmax method.</p>
<h4 id="summary">Summary</h4>
<p>In this post we built a very simple convolutional neural network to classify the Cifar 10 images into various categories. We realized how simple the neural network engines make it to model a complicated piece of the network by just one function call.</p>
<p>It came out to be a short post as we did not go through many of the optimizations. The reason why we did this will be very clear when we will in the next post use the model zoo to solve CIFAR-10.</p>Using CovNets in Keras is as easy as it gets. This post shows some sample code.The gooey stuff inside of your head2017-09-17T00:00:00+00:002017-09-17T00:00:00+00:00http://atishay.me/blog/2017/09/17/gooey<p>This speech went with a slide show that you can view <a href="/assets/img/blog/gooey.pdf">here</a>. This one was a very technical topic to be presented to average individuals uninitiated to the world of technology and code. The biggest challenge for ths speech was to get some content across and still make sure everyone understands it.</p>
<p>Aristotle believed, as many of us still do - the heart is the seat of love, kidney is of fear and liver of anger. Neurologists have found that the most complicated organ, the one protected by the hardest layers of bones in our body, the brain is the seat of all emotions. We have been trying to understand it for centuries and after years of research we have started to map parts of it into code. I have been studying this now for almost a year. Fellow toastmasters, citizens of the silicon valley and dear guests - Deep learning is a direct result of messing up with the gooey stuff inside of the head. We are now living in a world where the machines have a true brain - they can see & hear, read and write, speak and understand the world around them just like us. I am to show you today how it works.</p>
<p>Lets go back 60 years into the research lab of Nobel Prize winners Hubel and Weisel. They wanted to know how vision worked. They took an unlucky cat, put a hole inside its head and inserted electrodes. Fun fact, the eyes are present here but the occipital lobe that creates vision is here, right at the back. Hubel and Weisel wanted to see what excites that cat. They showed it pictures of food, fish or females. The answer was - none. Frustrated the scientists spent days trying to figure stuff out. 10% of science is work. Rest is just luck. What they realized and won the Nobel price for was - it was not the rat, but the insertion of the next slide that really excited the cells. Not the food, not fish, not even the female, the first few cells look at simpler things, like the dark and the light, the horizontal and the vertical, the circles and the squares. It is a stack of neurons, like an organization’s hierarchy where the information is sifted until the condensed summary goes up to the CEO.</p>
<p>Deep learning works the same way. Each image consists of millions of pixels. At the lowest level, the equation consists of assigning each pixel a weight. A few layers later, it looks at lines, circles and squares. Come down a few more and come ears, eyes and nose and then after a long chain we get the results of what the contents of the image are. That is why it is called deep. It consists of a very deep chain of very easy mathematical equations that eventually come up with what the image consists of.
Now why is it called learning? Because we train it like the big cats. It looks at an image. If it identifies correctly it gets the meat. Otherwise the whip. Let me explain you how that truly works. I already told you it resembles the organization hierarchy where managers upon managers condense information. Lets assume each of us is a brain cell. the question I have is - Is this a cat?<br />
<member 1 &rt; Do you see eyes of a cat?<br />
1 point.<br />
<member 2 &rt; Do you see whiskers?<br />
I trust her more - 2 points.<br />
<member 3 &rt; Do you see a tail?<br />
I do kind of trust him. So -1.<br />
Now for my personal judgement = hmm. 1 point.<br />
So I say yes - 3 points. And then of course my manager will punish me. It is not a cat. And I will distribute the punishment. I will take two dollars from her, one from him and give him back one. I take one from my pocket and give it back my manager. Next time, she will be more careful, he will be more careful and he will be more confident.</p>
<p>This is 90% of deep learning. Add a few more tricks and you can really go deep. This creates a wonderful set of pattern recognition and pattern creation systems. The sights, the sounds, the smell are all patterns. Now the big deal about patterns - Deep neural networks are so good at it that Elon Musk fears machines getting better than humans.</p>
<p>We may know our brains a lot better now, but we haven’t yet built robots with emotions. These complicated machines that we currently have consist of layers of simple equations that bring enormous pattern recognition capabilities. Even though a lot of our lives involves pattern generation and recognition, we are still a lot more. Machines have been and will continue to be faithful helpers that do the boring stuff while we imagine and work towards a wonderful future. Deep learning is a wonderful application of the gooey stuff inside our head. Fear not and enjoy the wonderful future where machines speak and hear, read and write, see and show the wonderful world around us. Thank you.</p>Deep learning as a speech for toastmasters.The very basics for the very simple.Easy Deep Learning Part VII - Convolutional Neural Networks2017-08-30T00:00:00+00:002017-08-30T00:00:00+00:00http://atishay.me/blog/2017/08/30/Deep-Learning-Part-7<p>This is the seventh part of an intended multi-part series on deep learning. You should read <a href="/blog/2017/08/16/Deep-Learning-Part-1">Part 1</a>, <a href="/blog/2017/08/18/Deep-Learning-Part-2">Part 2</a>, <a href="/blog/2017/08/21/Deep-Learning-Part-3">Part 3</a>, <a href="/blog/2017/08/22/Deep-Learning-Part-4">Part 4</a>, <a href="/blog/2017/08/24/Deep-Learning-Part-5">Part 5</a>, <a href="/blog/2017/08/29/Deep-Learning-Part-6">Part 6</a> before heading over here.</p>
<h4 id="recap">Recap</h4>
<p>By now you should be comfortable with what a neuron is, what a neural network (or a stack of neuron means). We have so far described a neuron to represent <code class="highlighter-rouge">g(AX + b)</code> which was chained together in multiple layers that created a deep network. Next we talked about concepts like dropout, different activation function, regularization and different initialization to get better results.</p>
<h4 id="what-is-missing">What is missing?</h4>
<p>When I described the core concept of having depth, I talked about teams looking at different parts of the image adn giving a decision. But when we implemented it, we created a dense layer where everyone looks at everything. And we know that that may be a bad use of resources. Most information in the image is associated close to each other. Randomly arranging the eyes, nose and ears won’t make a face and looking at random places for that is definitely not a great idea. Therefore it makes sense to have teams look at small parts and then take their decision. The next question would be team size. The bigger the teams, the closer we get to the original dense network problem. Therefore its better to start with the smallest teams and look grow them if things don’t work well. So what is the smallest team size. An image is two dimensional(actually 3 because of RGB, but since the third dimension is so small, we don’t really talk about that in our convolutions here) and therefore, we need a 2D convolution. The smallest symmetrical one is everyone looking at one pixel. But then we can’t build a hierarchy with one element in the next layer looking at only 1 pixel(1x1) (There is a 1x1 module you might find in certain networks. It is too advanced for now. We discuss that in part 10.). What is the next one. It is not 2x2. Because we cannot chose which 2 we need(left or right). Therefore it has to be 3x3 i.e. 9 pixels. People have tried 5 and 7 pixel rows but have realized over time that adding another layer is better than making it bigger. It is faster because of lesser variables and leaves little reason to go higher (More on this in Post 10).
So how does it look like to a user. You can get a fair idea from the <a href="https://github.com/vdumoulin/conv_arithmetic">visualization from vdumoulin</a> displayed in the first image below.</p>
<h4 id="ideas-from-classical-computer-vision">Ideas from classical computer vision</h4>
<p>Convolutions are not a new concept. They have been there in computer vision for a long time. The idea of a convolution is very simple. If you look at an image, one thing you definitely see is the color, but color in itself is not very interesting. It is the change in color that interests us. The change in color across individual pixels is is really what defines everything. Smooth change would probably mean a gradient, zero change would be solid color and a huge change would be an edge. A lot of effort has already gone into finding hand coded kernels (specific values of these convolutions) for detection of edges corners etc. as well as for styling images into different variants like blurring or sharpening.</p>
<p>You can play with and understand the concepts using the visualization <a href="http://setosa.io/ev/image-kernels/">here</a>.</p>
<h4 id="size-problems">Size problems</h4>
<p>You might have noticed, the approach above would reduce the output prediction size by two pixels in both rows and columns. Going back to manager analogy, each manager has 9 people reporting, but not all people have 9 managers. The pixels at the edges don’t have all neighbors and therefore the calls from the higher level don’t really center them. In a classic organization that is not a problem (rather a good thing). You may assume that since we want a single number as an output, it is a good thing. And I would agree. But there is one reason why we want this to be the same. The reason is simple. We want to stack multiple layers. We cannot go very deep if we reduce pixels each time. Also we will have to write separate layers for each kernel size. We cannot try different model pieces at different parts of the network and all that together makes our work difficult. Therefore we add padding to the original image so that it becomes consistent. The padding is all zeros and a smart network should be able to set the scores properly that those pixels won’t matter. Anyways, if x is 0, ax is also gonna be 0. The second image below is another image from the same visualization on padding.</p>
<h4 id="more-tricks">More tricks</h4>
<p><strong>Stride</strong>: You might have also noticed that reporting to 9 managers is sort of a confusing. There is a lot of duplication of information. Of course it is good to some extent that the same pixel can be measured differently, but do remember that the basic thing we did with MNIST worked well. There might not be enough information in a single signal(pixel or manager’s output), to require a huge set of weights looking at it. Therefore to speed up the network we use a stride, i.e. have only a few center pixels and consolidate the output. You can understand stride by the third visualization easily.
<strong>Transpose</strong> With a stride we lose the benefits we added padding for. So we add the padding back to get to the same size via a transpose. The fourth visualization explains this concept.</p>
<p><img class="col-md-6 col-lg-3 img-fluid rounded" src="http://atishay.me/assets/img/blog/conv.gif" />
<img class="col-md-6 col-lg-3 img-fluid rounded" src="http://atishay.me/assets/img/blog/convpad.gif" />
<img class="col-md-6 col-lg-3 img-fluid rounded" src="http://atishay.me/assets/img/blog/convstride.gif" />
<img class="col-md-6 col-lg-3 img-fluid rounded" src="http://atishay.me/assets/img/blog/convtrans.gif" /></p>
<h4 id="why-these-tricks">Why these tricks?</h4>
<p>You should be tempted to ask, why do these tricks? What do we save? Why not just pad. And the answer is just pad. That is a great starting point and indeed works the best. The tricks that are applied is just to save calculation and speed the network up. Let me explain that. Say you have 81 pixels. Now a standard stride will mean you will have 81 managers where each pixel having 9 weights. That means 81*9 weights. Now look at a stride/transpose chain. In the stride part each pixel has just one weight. So we have 81 weights. Now the pixels reduce to a square root to 9. Now each of these has 9 weights (rest are all zeros). So we are reduced to 81 + 81 weights from 81 x 9. Thats a lot of saving. Using this we can add a few more layers to make a network deeper and still fit in the same RAM. As I said at the start, the concept of a neural network is simple, it is the optimizations we need because of our slow machines that make it complicated.</p>
<h4 id="maxoutaverageout">Maxout/Averageout</h4>
<p>There is one more concept that needs to be understood before going to code. This one is more of an optimization than a real trick. We can run multiple convolutions on the input data and it is a good idea to do that. This is because one convolution ideally carries only type of information. Eg one convolution could be a line detector, a corner detector, a blob or circle detector etc. With many convolutions on a big image we have a lot of data. We reduce this for getting better performance. There are many ways to reduce data, averaging(Averageout), picking up up one of them or picking up a min or max. Maxout or picking up the max in a set is very popular. We run a convolution and pick up the max of a small kernel (convolution area like 9 pixels) and keep that discarding the others. Why Max- because based on the intuitions from ReLU it seems like a good idea. But we are free to try other optimizations. It is an important trick to summarize the inputs and that is why very popular especially after a few <code class="highlighter-rouge">Conv</code> layers.</p>
<h4 id="convolutions-outside-images">Convolutions outside images</h4>
<p>The concepts of dividing local stuff and then putting a manager onto global stuff is not just applicable to images. Words are formed by looking at nearby characters, sentences via words and paragraphs via sentences. So this logic can be used on sentences of text. This logic can also be used with voice as that also consists of a similar pattern. I hope you can imagine how these simple concepts change everything.</p>
<h4 id="summary">Summary</h4>
<p>Here we discussed the intuition and some concepts around Convolutions and why having local information passed onto the next layer is a good idea.</p>
<p>In the <a href="/blog/2017/09/18/Deep-Learning-Part-8">next post</a> we will apply this to the CIFAR data set and show some results from the convolutional neural networks that can really amaze us all.</p>CovNets and local information can really make results better. Simple Problem - Simple Solution.Easy Deep Learning Part VI - Contracts, Options and Futures2017-08-29T00:00:00+00:002017-08-29T00:00:00+00:00http://atishay.me/blog/2017/08/29/Deep-Learning-Part-6<p>This is the sixth part of an intended multi-part series on deep learning. You should read <a href="/blog/2017/08/16/Deep-Learning-Part-1">Part 1</a>, <a href="/blog/2017/08/18/Deep-Learning-Part-2">Part 2</a>, <a href="/blog/2017/08/21/Deep-Learning-Part-3">Part 3</a>, <a href="/blog/2017/08/22/Deep-Learning-Part-4">Part 4</a>, <a href="/blog/2017/08/24/Deep-Learning-Part-5">Part 5</a> before heading over here.</p>
<h4 id="recap">Recap</h4>
<p>In the previous posts we came up with the equation of a neuron to be <code class="highlighter-rouge">f(X) = g(Ax + b)</code> and talked about how we can stack one neuron over other to get a chain and make the network deep. We also talked about SGD and how we can slowly change our random parameters to get to the correct answer by going in the direction of the gradient. We also talked about how over fitting prevents us from going deeper adn also the fact that deep learning is slow and most of the difficulty is that machines are not fast enough.</p>
<h4 id="regularization">Regularization</h4>
<p>Now that we know over fitting to the data is a problem, we need a way to prevent that from happening. If you are still in the world of the cat eye teams, you would call me stupid. Because your task was to classify the given images correctly and you are doing that, and getting great outputs. Now you follow the contract word by word and I don’t get what is needed means that I have to mess with the contract and change it to reflect what I want. So what do I want. I want you to give correct output for images you have never seen. How do I go about to do that? By preventing you from making your output too specific.
The answer is to force our outputs to be uniform. We don’t want it turn most <code class="highlighter-rouge">A</code>s to zero and only take a few inputs. What we want from the <code class="highlighter-rouge">A</code> and <code class="highlighter-rouge">b</code> to be mostly uniform. Therefore we change our loss function. Well dear maths, don’t just optimize for giving the right probabilities, optimize for giving these <code class="highlighter-rouge">A</code> and <code class="highlighter-rouge">b</code> as uniform as possible so that it generalizes.
We use sometimes the L1 loss which is the average of the values of A, but mostly L2 loss is better (average of squares) so that the -ve and +ve values do not cancel each other.</p>
<h4 id="dropout">Dropout</h4>
<p>Another trick to get a good improvement in the models is dropout. The key concept is very simple. Say you are again managing a team looking at the picture of a cat. If in all training examples cats have ears, you will assume that without ears a cat cannot exist. Now in the real world, if you find an image of a cat with headphones, you are likely to label it not a cat. This is the same over fitting problem we talked about earlier. In dropout we solve it by cheating. After seeing an image, we randomly select some teams to walk out and not give any feedback. Now we are left with a smaller number of teams and we learn to understand that we cannot rely on just one signal to identify. This diversification makes our network stronger. Mathematically we just need to set all <code class="highlighter-rouge">A</code> and <code class="highlighter-rouge">b</code> to <code class="highlighter-rouge">0</code> in a neuron we want to disable. It would be dead in the rest of the calculation. In most libraries drop out is just a function call. So there is no likely reason to reject that.</p>
<h4 id="activation-function">Activation Function</h4>
<p>We used softmax as our activation function for the inner layer. Now the network can work with scores and doesn’t need to emit probability. That means we have a lot of other options inside. The most popular one is ReLU. ReLU has a very simple concept. It sets the score to 0 if it is negative and lets it pass through otherwise. This gives a lot of advantages. For all continuous functions, there is no clean way to say - “This factor is irrelevant”. With the slow gradient it really will take some time before we get close to zero and we will never get to say 0. ReLU is fast. We do not need logs any more and therefore the computation gets real quick. Unless you are a researcher, ReLU should be the default middle layer.</p>
<p>With softmax the sum of probabilities has to be 1, which means a neuron can give only one useful signal. Now suppose we had both cat and dog in the picture, softmax would get lost. In this case a sigmoid would be better. Sigmoid gives us a number between 0 and 1 and we can treat is as a probability even though it may not be exactly the same. (I won’t go into what makes sigmoid, but if you read the theory, you can easily understand how it squashes the inputs to be between 0 & 1. It again comes from statistics’ world of logistic regression).</p>
<h4 id="initialization">Initialization</h4>
<p>I already played a trick on you when I said random initialization. Why not <code class="highlighter-rouge">0</code>? Why not anything else? Well, this requires research and experimentation. Xavier et. al did that and found an initial state that works better(glorot_normal). The exact theory you don’t really need to know. It is a better parameter for softmax in most cases and worth being the default. For sigmoid, 0 is a good default. This initialization theory has been subjective and may not yield better results but do serve a better default anyways.</p>
<h4 id="faster-descent">Faster descent</h4>
<p>Now SGD is very good but it is slow. The reason is that it moves one step at a time. Therefore we may take a lot more time to reach optima than we can if we move faster. The solution that was discovered was momentum, i.e. move faster in the direction of change if we are going in the same direction as before. If most previous images are decreasing a variable, maybe we can decrease it more and get faster. Of course you can overshoot and the rest of the images will bring it back. Momentum gives a good boost over raw SGD. The overshooting problem was later solved by other methods and we should be using ADAM to get to the results faster.</p>
<h4 id="learning-rate-and-decay">Learning rate and decay</h4>
<p>The learning rate we talked about is defined in keras, but the default decay is <code class="highlighter-rouge">0</code>. Adding some decay can get a little bit closer to the local optimum than before.</p>
<h4 id="validation">Validation</h4>
<p>You might have realized that for all these options require a lot of trial and error to be optimized. If we are experimenting, it is a good idea to split the training set into a training a validation set. We can then use the training set for training, validation set to try out various hyper parameters and the test set for the final verification before putting the code to production.</p>
<h4 id="code-changes">Code changes</h4>
<p>Again all the numbers I have put in are hyper parameters and playing with them you might find better results. This post is to introduce the options that you have not find the optimal for MNIST. Let us put all these in code.</p>
<div class="language-python highlighter-rouge"><pre class="highlight"><code><span class="kn">from</span> <span class="nn">keras.datasets</span> <span class="kn">import</span> <span class="n">mnist</span>
<span class="kn">from</span> <span class="nn">keras.models</span> <span class="kn">import</span> <span class="n">Sequential</span>
<span class="kn">from</span> <span class="nn">keras.layers</span> <span class="kn">import</span> <span class="n">Dense</span><span class="p">,</span> <span class="n">Activation</span>
<span class="kn">import</span> <span class="nn">keras</span>
<span class="kn">import</span> <span class="nn">numpy</span> <span class="kn">as</span> <span class="nn">np</span>
<span class="p">(</span><span class="n">x_train2d</span><span class="p">,</span> <span class="n">y_train</span><span class="p">),</span> <span class="p">(</span><span class="n">x_test2d</span><span class="p">,</span> <span class="n">y_test</span><span class="p">)</span> <span class="o">=</span> <span class="n">mnist</span><span class="o">.</span><span class="n">load_data</span><span class="p">()</span>
<span class="n">x_train</span> <span class="o">=</span> <span class="n">x_train2d</span><span class="o">.</span><span class="n">reshape</span><span class="p">(</span><span class="o">-</span><span class="mi">1</span><span class="p">,</span> <span class="mi">784</span><span class="p">)</span>
<span class="n">x_test</span> <span class="o">=</span> <span class="n">x_test2d</span><span class="o">.</span><span class="n">reshape</span><span class="p">(</span><span class="o">-</span><span class="mi">1</span><span class="p">,</span> <span class="mi">784</span><span class="p">)</span>
<span class="n">y_train</span> <span class="o">=</span> <span class="n">keras</span><span class="o">.</span><span class="n">utils</span><span class="o">.</span><span class="n">to_categorical</span><span class="p">(</span><span class="n">y_train</span><span class="p">,</span> <span class="mi">10</span><span class="p">)</span>
<span class="n">y_test</span> <span class="o">=</span> <span class="n">keras</span><span class="o">.</span><span class="n">utils</span><span class="o">.</span><span class="n">to_categorical</span><span class="p">(</span><span class="n">y_test</span><span class="p">,</span> <span class="mi">10</span><span class="p">)</span>
<span class="n">model</span> <span class="o">=</span> <span class="n">Sequential</span><span class="p">([</span>
<span class="n">Dense</span><span class="p">(</span><span class="n">units</span><span class="o">=</span><span class="mi">1000</span><span class="p">,</span><span class="n">input_dim</span><span class="o">=</span><span class="mi">784</span><span class="p">,</span> <span class="n">kernel_initializer</span><span class="o">=</span><span class="s">'glorot_normal'</span><span class="p">,</span> <span class="n">kernel_regularizer</span><span class="o">=</span><span class="n">regularizers</span><span class="o">.</span><span class="n">l2</span><span class="p">(</span><span class="mf">0.01</span><span class="p">)),</span>
<span class="n">Activation</span><span class="p">(</span><span class="s">'relu'</span><span class="p">),</span>
<span class="n">Dropout</span><span class="p">(</span><span class="mf">0.4</span><span class="p">),</span>
<span class="n">Dense</span><span class="p">(</span><span class="n">units</span><span class="o">=</span><span class="mi">10</span><span class="p">,</span> <span class="n">input_dim</span><span class="o">=</span><span class="mi">1000</span><span class="p">,</span> <span class="n">kernel_initializer</span><span class="o">=</span><span class="s">'glorot_normal'</span><span class="p">,</span> <span class="n">kernel_regularizer</span><span class="o">=</span><span class="n">regularizers</span><span class="o">.</span><span class="n">l2</span><span class="p">(</span><span class="mf">0.01</span><span class="p">)),</span>
<span class="n">Activation</span><span class="p">(</span><span class="s">'softmax'</span><span class="p">),</span>
<span class="p">])</span>
<span class="n">model</span><span class="o">.</span><span class="nb">compile</span><span class="p">(</span><span class="n">loss</span><span class="o">=</span><span class="s">'categorical_crossentropy'</span><span class="p">,</span>
<span class="n">optimizer</span><span class="o">=</span><span class="s">'adam'</span><span class="p">,</span>
<span class="n">metrics</span><span class="o">=</span><span class="p">[</span><span class="s">'accuracy'</span><span class="p">])</span>
<span class="n">model</span><span class="o">.</span><span class="n">fit</span><span class="p">(</span><span class="n">x_train</span><span class="p">,</span> <span class="n">y_train</span><span class="p">,</span> <span class="n">epochs</span><span class="o">=</span><span class="mi">20</span><span class="p">,</span> <span class="n">batch_size</span><span class="o">=</span><span class="mi">32</span><span class="p">)</span>
<span class="n">loss_and_metrics</span> <span class="o">=</span> <span class="n">model</span><span class="o">.</span><span class="n">evaluate</span><span class="p">(</span><span class="n">x_test</span><span class="p">,</span> <span class="n">y_test</span><span class="p">,</span> <span class="n">batch_size</span><span class="o">=</span><span class="mi">128</span><span class="p">)</span>
<span class="k">print</span> <span class="n">loss_and_metrics</span>
</code></pre>
</div>
<h4 id="summary">Summary</h4>
<p>We are a long way from the <code class="highlighter-rouge">AX + b</code> that we started with, but the changes are all minimal and incremental and conceptually nothing much has changed. Here we talked about some of the options that we have while starting with the model how to tweak the defaults.</p>
<p>In the <a href="/blog/2017/08/30/Deep-Learning-Part-7">next part</a> we will figure out the way to use some of the local information in the image and get some great improvements via another of the buzzwords - Convolutional Neural Networks.</p>Defaults are good, but playing with them can eek out the next 2% that we are looking for.Easy Deep Learning Part V - Lets go Deep2017-08-24T00:00:00+00:002017-08-24T00:00:00+00:00http://atishay.me/blog/2017/08/24/Deep-Learning-Part-5<p>This is the fourth part of an intended multi-part series on deep learning. You should read <a href="/blog/2017/08/16/Deep-Learning-Part-1">Part 1</a>, <a href="/blog/2017/08/18/Deep-Learning-Part-2">Part 2</a>, <a href="/blog/2017/08/21/Deep-Learning-Part-3">Part 3</a>, <a href="/blog/2017/08/22/Deep-Learning-Part-4">Part 4</a> before heading over here.</p>
<h4 id="recap">Recap</h4>
<p>In the previous sections we defined our deep learning task of identifying the contents of an image:</p>
<div class="highlighter-rouge"><pre class="highlight"><code>g(AX + b) = Probability of being a content type
</code></pre>
</div>
<p>where <code class="highlighter-rouge">X</code> is the huge matrix that makes up the image and A are the weights, b the biases and g is the function that converts scores into probabilities like softmax.
We realized that the traditional way of solving equations does not work well with this case and therefore we described SGD as a means to solve the equation to get <code class="highlighter-rouge">A</code> and <code class="highlighter-rouge">b</code>. Then we wrote the code for the same using Keras.</p>
<h4 id="limitations-with-our-equation">Limitations with our equation</h4>
<p>Let us congratulate ourselves for solving the seemingly impossible problem on the previous post. Now for some reality check. The current equation is very basic and does not solve many cases. Here is why:</p>
<ul>
<li>The information in the neighboring pixels is not shared and therefore all weights are on individual pixels.</li>
<li>The global information about the entire contents of the image are also not available.
This makes it harder for complicated models. A cat is a lot more complicated than a digit.</li>
</ul>
<h4 id="intuition-behind-depth">Intuition behind depth</h4>
<p>Now for the time being forget that you are someone reading the post. You are now the manager that manages the vision in the eye. Now you want to recognize a cat. The first thing you would do is to find a team. Then to this team you would divide work. Team A - try finding the whiskers, Team B - The ears, Team C - the eyes and so on. And then based on the results of what those teams find and adding your own personal belief you will give the probability to the user. And back propagation works the same way. You punish those people more whom you relied on the most if they made a mistake. Also you can now also play the majority rules game where if the ear team does not find the ear, but the others say, it is a cat, well it might be a cat wearing headphones. So they might just be missing ears or our ear folks are wrong and if this turns out to be a cat, they will be punished.
So the passes remain the same, we just ask the teams for their probabilities and then decide. The code is also very similar. Here is the set of equations:</p>
<div class="highlighter-rouge"><pre class="highlight"><code>g(AX + b) = X'
g'(A'X' + b') = y
</code></pre>
</div>
<p>You might have seen in the code already, the model consists of just these equations for the forward pass.</p>
<p>The other intuition comes from statistics. The original equation is very close to linear regression (which in 2D space is fitting a line). As you can expect, the images are a lot more complex and we cannot divide them into is cat or not with a line.</p>
<h4 id="why-g--g">Why g & g’?</h4>
<p>Now we don’t necessarily need probabilities and we could work with scores right. But let me show you the maths again:</p>
<div class="highlighter-rouge"><pre class="highlight"><code>AX + b = X'
g'(A'X' + b') = y
Substituting X'
g'(A'AX' + A'b + b') = y
</code></pre>
</div>
<p>Now <code class="highlighter-rouge">A'A</code> is a constant and <code class="highlighter-rouge">A'b + b'</code> is another. And we haven’t earned anything. <code class="highlighter-rouge">g</code> and <code class="highlighter-rouge">g'</code> are both essential. The function <code class="highlighter-rouge">g</code> in a neuron provides a way to make the method non-linear, and is therefore called <strong>non-linearity</strong> apart from the activation that we already defined it with. The neural network we just created is two <strong>layers</strong> deep. A strong neural network can have hundred of layers.</p>
<h4 id="isnt-the-back-propagation-affected">Isn’t the back propagation affected?</h4>
<p>Well that is why we found the formal method with calculus. There is a chain rule in calculus that makes this very simple as we multiply gradients until we get to the right weights to update.</p>
<h4 id="shall-we-make-it-deeeeeeeeeeeep">Shall we make it deeeeeeeeeeeep</h4>
<p>Mostly deeper networks do produce better results. But there is a limit. You can see this in organizations from our analogy too. The longer the management hierarchy, the more is the signal loss from the lower down to the higher ups. In a neural network during the back propagation steps, each time we go a layer deep we have to multiply with the weights of the next layer. After a few hundred layers the product of these weights make the updates so small that the initial few layers live on with their starting random weights for a long time. This is one form of the so called <strong>Vanishing GRadient</strong> problem. The data requirement increases with the depth of the network. If we really want it deep, it gets a lot more hungry. That is why the state of the art networks take weeks to finish training on the fastest GPUs and need huge data sets. There is one more reason attached to the same thing. The number of parameters in A & b as they increase, well the model has a lot of leeway and after some time, it actually gets enough parameters to fit the entire data set in the if this image then this kind of a equation. Then we have the same old over fitting problem. So going deep is good, until some depth.</p>
<h4 id="lets-see-some-code">Lets see some code</h4>
<p>Well, the beauty of the libraries that we use is that, going deep is fairly easy if we do the basic thing. The only code change that <code class="highlighter-rouge">keras</code> really minimal.</p>
<div class="language-python highlighter-rouge"><pre class="highlight"><code><span class="n">model</span> <span class="o">=</span> <span class="n">Sequential</span><span class="p">([</span>
<span class="n">Dense</span><span class="p">(</span><span class="n">units</span><span class="o">=</span><span class="mi">1000</span><span class="p">,</span><span class="n">input_dim</span><span class="o">=</span><span class="mi">784</span><span class="p">),</span>
<span class="n">Activation</span><span class="p">(</span><span class="s">'softmax'</span><span class="p">),</span>
<span class="n">Dense</span><span class="p">(</span><span class="n">units</span><span class="o">=</span><span class="mi">10</span><span class="p">,</span> <span class="n">input_dim</span><span class="o">=</span><span class="mi">1000</span><span class="p">),</span>
<span class="n">Activation</span><span class="p">(</span><span class="s">'softmax'</span><span class="p">),</span>
<span class="p">])</span>
</code></pre>
</div>
<p>We have brought 1000 teams that see the initial data and we take input from those 1000 outputs. This will run very slow. We will get to speeding it up and improving it in the next post.</p>
<h4 id="summary">Summary</h4>
<p>In this post we talked about how we can all more variables and allow the equation (from now on called a model) to be more complicated by having a chain of layers. We also discussed why it seems like a good idea and how we can get the model made deeper in keras.</p>
<p>In the <a href="/blog/2017/08/29/Deep-Learning-Part-6">next post</a> we will talk about how to solve some of the problems and roadblocks we hit by depth and some common tricks we can use to get improvements.</p>Lets make the network actually deeper. Understand how the maths changes - Or does it?Easy Deep Learning Part IV - Working code2017-08-22T00:00:00+00:002017-08-22T00:00:00+00:00http://atishay.me/blog/2017/08/22/Deep-Learning-Part-4<p>This is the fourth part of an intended multi-part series on deep learning. You should read <a href="/blog/2017/08/16/Deep-Learning-Part-1">Part 1</a>, <a href="/blog/2017/08/18/Deep-Learning-Part-2">Part 2</a>, <a href="/blog/2017/08/21/Deep-Learning-Part-3">Part 3</a> before heading over here.</p>
<h4 id="recap">Recap</h4>
<p>In the previous sections we defined our deep learning task of identifying the contents of an image:</p>
<div class="highlighter-rouge"><pre class="highlight"><code>g(AX + b) = Probability of being a content type
</code></pre>
</div>
<p>where <code class="highlighter-rouge">X</code> is the huge matrix that makes up the image and A are the weights, b the biases and g is the function that converts scores into probabilities like softmax.
We realized that the traditional way of solving equations does not work well with this case and therefore we described SGD as a means to solve the equation to get <code class="highlighter-rouge">A</code> and <code class="highlighter-rouge">b</code>.</p>
<h4 id="technical-considerations">Technical Considerations</h4>
<p>I hope you feel the same as me right now that deep learning as a concept is not very difficult. We use a very simple equation, fill it up with random numbers and slowly tweak that until we are done. The challenge comes up with the implementation. In any practical use case we are talking about a few million multiplications per image. Now that means we need a beefy GPU as running those multiplications one by one will turn out to be very slow. Add to that, we need floating point numbers (not integers) which take more space (4 bytes by default). Therefore for a single 1 megapixel image we are talking about 3GB of GPU memory to store <code class="highlighter-rouge">A</code> alone. Still it takes hours to train. The ImageNet model that is used to recognize images takes a month on a cluster of 5 of the fastest GPU available.
So make some compromises:</p>
<ul>
<li>We take small images and most of the time their size is a power of 2.</li>
<li>We train in batches. Since gradient calculation is very expensive, we run the forward pass for multiple images at a time and once one set of images goes through we calculate the overall loss across all those images and push that back into the equation.</li>
<li>Deep learning libraries do most of the heavy lifting for us and automatically divide between the various machines, effectively use the GPU, and also calculate the gradient to run the back propagation. We just define the model.</li>
<li>Because of the very specific input requirements, the data gathering is the toughest part of a deep learning system. We need to clean the data and get it to the correct shape and sizes.</li>
</ul>
<h4 id="libraries">Libraries</h4>
<p>This is a very biased topic. I don’t want to go in the benefits and disadvantages of a library. Any ways, if you are not doing research but instead just tweaking something that is already there(which you should be doing), there is no point in arguing about it. All are good enough. The biggest guns in the market are with caffe (facebook), tensorflow(google) and CNTK(microsoft). Again the goals of all these libraries are different. They are made for modifying the core of the networks, and messing up with the stuff like calculus which you don’t need to go into just right now.
For the sake of simplicity, I use (keras)[http://www.keras.io]. This is one of the simplest to use libraries with the minimal amount of code you need to write. The library is built over CNTK, tensorflow and theano and therefore you can go deep into lower level if you so desire. This library also enables me to export models that you can visualize in Javascript. Since it is built over tensorflow, you can export its models to mobile and run them there.</p>
<h4 id="mnist">MNIST</h4>
<p>The MNIST data set is a data set of black and white (saves us the RGB channels) images of handwritten numbers(not cat pictures) from 0-9 all labelled correctly. They are available as 28x28 pixel images (not 1 megapixel). It is a very popular data set for tring out complicated networks as the problem is just perfect - not too heavy to require huge amount of processing time, not simple to be solvable easily by other means and not too complicated so that it can be solved by the simplest of neural networks.</p>
<h4 id="installation">Installation</h4>
<p>From the python website install python. Remember to enable pip(or use get-pip.py to download pip). Then run <code class="highlighter-rouge">pip install keras</code> to get keras.
You can now run python files by <code class="highlighter-rouge">python filename</code>. Google is your friend here (better than me) and you can always go to <a href="http://keras.io">keras.io</a> to find the latest installation instructions.</p>
<h4 id="code">Code</h4>
<ol>
<li><strong>Import the python modules to use</strong>
<div class="language-python highlighter-rouge"><pre class="highlight"><code><span class="kn">from</span> <span class="nn">keras.datasets</span> <span class="kn">import</span> <span class="n">mnist</span>
<span class="kn">from</span> <span class="nn">keras.models</span> <span class="kn">import</span> <span class="n">Sequential</span>
<span class="kn">from</span> <span class="nn">keras.layers</span> <span class="kn">import</span> <span class="n">Dense</span><span class="p">,</span> <span class="n">Activation</span>
<span class="kn">import</span> <span class="nn">keras</span>
<span class="kn">import</span> <span class="nn">numpy</span> <span class="kn">as</span> <span class="nn">np</span>
</code></pre>
</div>
</li>
<li><strong>Load the data</strong> Now keras comes bundles with MNIST. This is a data set of handwritten numbers (0-9). The images are all 28x28 and labelled with the correct number. Keras defines 60k images in the training and 10k in the testing set.
<div class="language-python highlighter-rouge"><pre class="highlight"><code><span class="p">(</span><span class="n">x_train2d</span><span class="p">,</span> <span class="n">y_train</span><span class="p">),</span> <span class="p">(</span><span class="n">x_test2d</span><span class="p">,</span> <span class="n">y_test</span><span class="p">)</span> <span class="o">=</span> <span class="n">mnist</span><span class="o">.</span><span class="n">load_data</span><span class="p">()</span>
</code></pre>
</div>
</li>
<li><strong>Fit the data to our equation</strong> We do not use the fact that the image is 2 dimensional. we just create a flat list of 784 numbers. The -1 in the reshape is to keep the first dimension with the remaining dimension. We create 1 matrix of size [1x784] for each image and put all the images one under the other. Also the output is written as a single number. We convert that to categories, i.e. 1 row per output. This is because we want the probabilities of each number (0, 1, 2..9) separately. This type of encoding is called <strong>categorical</strong> or <strong>one hot encoding</strong>. In a more advanced network, we can also keeo this as a single number where the equation outputs between 0 & 0.1 for 1, 0.1 & 0.2 for 2 and so on. Probabilities are easier to understand and therefore we like the one hot output.
<div class="language-python highlighter-rouge"><pre class="highlight"><code><span class="n">x_train</span> <span class="o">=</span> <span class="n">x_train2d</span><span class="o">.</span><span class="n">reshape</span><span class="p">(</span><span class="o">-</span><span class="mi">1</span><span class="p">,</span> <span class="mi">784</span><span class="p">)</span>
<span class="n">x_test</span> <span class="o">=</span> <span class="n">x_test2d</span><span class="o">.</span><span class="n">reshape</span><span class="p">(</span><span class="o">-</span><span class="mi">1</span><span class="p">,</span> <span class="mi">784</span><span class="p">)</span>
<span class="n">y_train</span> <span class="o">=</span> <span class="n">keras</span><span class="o">.</span><span class="n">utils</span><span class="o">.</span><span class="n">to_categorical</span><span class="p">(</span><span class="n">y_train</span><span class="p">,</span> <span class="mi">10</span><span class="p">)</span>
<span class="n">y_test</span> <span class="o">=</span> <span class="n">keras</span><span class="o">.</span><span class="n">utils</span><span class="o">.</span><span class="n">to_categorical</span><span class="p">(</span><span class="n">y_test</span><span class="p">,</span> <span class="mi">10</span><span class="p">)</span>
</code></pre>
</div>
</li>
<li><strong>Define our equation</strong> The equation is called a model in keras. The code is should be very intuitive. You can ignore the term <em>Sequential</em> for now, since we have only one element, that does not technically mean a sequence. Next is Dense with 10 outputs and 784 inputs. Here we define the size of <code class="highlighter-rouge">y</code> & <code class="highlighter-rouge">X</code> in <code class="highlighter-rouge">AX + b = y</code>. The size of <code class="highlighter-rouge">A</code> and <code class="highlighter-rouge">b</code> is inferred by the network automatically. Dense means all inputs and outputs are connected. We will get into other types of networks later. The activation function is softmax, the only one we have discussed so far.
<div class="language-python highlighter-rouge"><pre class="highlight"><code><span class="n">model</span> <span class="o">=</span> <span class="n">Sequential</span><span class="p">([</span>
<span class="n">Dense</span><span class="p">(</span><span class="n">units</span><span class="o">=</span><span class="mi">10</span><span class="p">,</span><span class="n">input_dim</span><span class="o">=</span><span class="mi">784</span><span class="p">),</span>
<span class="n">Activation</span><span class="p">(</span><span class="s">'softmax'</span><span class="p">),</span>
<span class="p">])</span>
</code></pre>
</div>
</li>
<li><strong>Define the loss function</strong> We apply cross entropy loss to multiple categories. The optimization algorithm is SGD and we are looking for better accuracy.
<div class="language-python highlighter-rouge"><pre class="highlight"><code><span class="n">model</span><span class="o">.</span><span class="nb">compile</span><span class="p">(</span><span class="n">loss</span><span class="o">=</span><span class="s">'categorical_crossentropy'</span><span class="p">,</span>
<span class="n">optimizer</span><span class="o">=</span><span class="s">'sgd'</span><span class="p">,</span>
<span class="n">metrics</span><span class="o">=</span><span class="p">[</span><span class="s">'accuracy'</span><span class="p">])</span>
</code></pre>
</div>
</li>
<li><strong>Run the training set</strong> This method takes a lot of time. This takes the batch size(to group multiple images in a single pass) and the number of epochs(number of times to run this).
<div class="language-python highlighter-rouge"><pre class="highlight"><code><span class="n">model</span><span class="o">.</span><span class="n">fit</span><span class="p">(</span><span class="n">x_train</span><span class="p">,</span> <span class="n">y_train</span><span class="p">,</span> <span class="n">epochs</span><span class="o">=</span><span class="mi">20</span><span class="p">,</span> <span class="n">batch_size</span><span class="o">=</span><span class="mi">32</span><span class="p">)</span>
</code></pre>
</div>
</li>
</ol>
<p>That is it. We have a trained model. Now we can pass a new image and get the corresponding probabilities. To verify the accuracy of our model (remember the over fitting problem. We need to know how it performs in a new unseen data set) use test the model</p>
<div class="language-python highlighter-rouge"><pre class="highlight"><code><span class="n">loss_and_metrics</span> <span class="o">=</span> <span class="n">model</span><span class="o">.</span><span class="n">evaluate</span><span class="p">(</span><span class="n">x_test</span><span class="p">,</span> <span class="n">y_test</span><span class="p">,</span> <span class="n">batch_size</span><span class="o">=</span><span class="mi">128</span><span class="p">)</span>
<span class="k">print</span> <span class="n">loss_and_metrics</span>
</code></pre>
</div>
<p>This should give you an accuracy of <code class="highlighter-rouge">92-93%</code>. Just like that.</p>
<p>Here is the full code:</p>
<div class="language-python highlighter-rouge"><pre class="highlight"><code><span class="kn">from</span> <span class="nn">keras.datasets</span> <span class="kn">import</span> <span class="n">mnist</span>
<span class="kn">from</span> <span class="nn">keras.models</span> <span class="kn">import</span> <span class="n">Sequential</span>
<span class="kn">from</span> <span class="nn">keras.layers</span> <span class="kn">import</span> <span class="n">Dense</span><span class="p">,</span> <span class="n">Activation</span>
<span class="kn">import</span> <span class="nn">keras</span>
<span class="kn">import</span> <span class="nn">numpy</span> <span class="kn">as</span> <span class="nn">np</span>
<span class="p">(</span><span class="n">x_train2d</span><span class="p">,</span> <span class="n">y_train</span><span class="p">),</span> <span class="p">(</span><span class="n">x_test2d</span><span class="p">,</span> <span class="n">y_test</span><span class="p">)</span> <span class="o">=</span> <span class="n">mnist</span><span class="o">.</span><span class="n">load_data</span><span class="p">()</span>
<span class="n">x_train</span> <span class="o">=</span> <span class="n">x_train2d</span><span class="o">.</span><span class="n">reshape</span><span class="p">(</span><span class="o">-</span><span class="mi">1</span><span class="p">,</span> <span class="mi">784</span><span class="p">)</span>
<span class="n">x_test</span> <span class="o">=</span> <span class="n">x_test2d</span><span class="o">.</span><span class="n">reshape</span><span class="p">(</span><span class="o">-</span><span class="mi">1</span><span class="p">,</span> <span class="mi">784</span><span class="p">)</span>
<span class="n">y_train</span> <span class="o">=</span> <span class="n">keras</span><span class="o">.</span><span class="n">utils</span><span class="o">.</span><span class="n">to_categorical</span><span class="p">(</span><span class="n">y_train</span><span class="p">,</span> <span class="mi">10</span><span class="p">)</span>
<span class="n">y_test</span> <span class="o">=</span> <span class="n">keras</span><span class="o">.</span><span class="n">utils</span><span class="o">.</span><span class="n">to_categorical</span><span class="p">(</span><span class="n">y_test</span><span class="p">,</span> <span class="mi">10</span><span class="p">)</span>
<span class="n">model</span> <span class="o">=</span> <span class="n">Sequential</span><span class="p">([</span>
<span class="n">Dense</span><span class="p">(</span><span class="n">units</span><span class="o">=</span><span class="mi">10</span><span class="p">,</span><span class="n">input_dim</span><span class="o">=</span><span class="mi">784</span><span class="p">),</span>
<span class="n">Activation</span><span class="p">(</span><span class="s">'softmax'</span><span class="p">),</span>
<span class="p">])</span>
<span class="n">model</span><span class="o">.</span><span class="nb">compile</span><span class="p">(</span><span class="n">loss</span><span class="o">=</span><span class="s">'categorical_crossentropy'</span><span class="p">,</span>
<span class="n">optimizer</span><span class="o">=</span><span class="s">'sgd'</span><span class="p">,</span>
<span class="n">metrics</span><span class="o">=</span><span class="p">[</span><span class="s">'accuracy'</span><span class="p">])</span>
<span class="n">model</span><span class="o">.</span><span class="n">fit</span><span class="p">(</span><span class="n">x_train</span><span class="p">,</span> <span class="n">y_train</span><span class="p">,</span> <span class="n">epochs</span><span class="o">=</span><span class="mi">20</span><span class="p">,</span> <span class="n">batch_size</span><span class="o">=</span><span class="mi">32</span><span class="p">)</span>
<span class="n">loss_and_metrics</span> <span class="o">=</span> <span class="n">model</span><span class="o">.</span><span class="n">evaluate</span><span class="p">(</span><span class="n">x_test</span><span class="p">,</span> <span class="n">y_test</span><span class="p">,</span> <span class="n">batch_size</span><span class="o">=</span><span class="mi">128</span><span class="p">)</span>
<span class="k">print</span> <span class="n">loss_and_metrics</span>
</code></pre>
</div>
<p>Now that we have working code, we can really zoom ahead. You know the basics. Give me some time to build upon it. In <a href="/blog/2017/08/24/Deep-Learning-Part-5">part 5</a> we will talked about deeper networks, how and why.</p>Time for some action. Our first deep learning model - handwriting recognitionEasy Deep Learning Part III - Training & Testing2017-08-21T00:00:00+00:002017-08-21T00:00:00+00:00http://atishay.me/blog/2017/08/21/Deep-Learning-Part-3<p>This is the third part of an intended multi-part series on deep learning. You should read <a href="/blog/2017/08/16/Deep-Learning-Part-1">Part 1</a> and <a href="/blog/2017/08/18/Deep-Learning-Part-2">Part 2</a> before heading over here.</p>
<h4 id="recap">Recap</h4>
<p>In the previous sections we defined our deep learning task of detecting if an image is of a cat via the equation:</p>
<div class="highlighter-rouge"><pre class="highlight"><code>g(AX + b) = Probability of being a cat
</code></pre>
</div>
<p>where <code class="highlighter-rouge">X</code> is the huge matrix that makes up the image and A are the weights, b the biases and g is the function that converts scores into probabilities like softmax.
Using this equation we need to find <code class="highlighter-rouge">A</code> and <code class="highlighter-rouge">b</code> using the known image label pairs(in the training phase) and then use this <code class="highlighter-rouge">A</code> and <code class="highlighter-rouge">b</code> to find the label for a unlabelled image(in the testing phase).</p>
<h4 id="why-is-this-so-hard">Why is this so hard?</h4>
<p>Let me remind you that <code class="highlighter-rouge">AX + b</code> is a lot of numbers (3 billion parameters for 1000 categories from 1 megapixel images). So we have 3 billion unknowns. And one image just gives us one equation. So we need 3 billion images to figure out this equation. And this is the simplest equation possible. Add to that the fact that the images need to be independent (<code class="highlighter-rouge">2x + y = 4</code> and <code class="highlighter-rouge">4x + 2y = 8</code> are dependent equations). Also we need to make sure that the equations don’t contradict(Eg <code class="highlighter-rouge">x + y = 5</code> and <code class="highlighter-rouge">x + y = 6</code> cannot be solved). Now with images being a huge set of numbers, this is a bigger problem. Also we can have more images than the variables and in that case, we are stuck with unsolvable equations. Testing is easy as you just have to multiply and add and run softmax on a calculator. But getting perfect equations is impossible.</p>
<h4 id="what-do-we-do">What do we do?</h4>
<p>We approximate. We cannot expect to solve the equation but what we can do is find one of the solutions that satisfy all or most of our training data and assume it is our solution. Since it works for most of the known cases, it is very likely it will work for the unknown ones.</p>
<h4 id="how-do-we-do-that">How do we do that?</h4>
<p>Since we are already in the realm of approximation, we start with a real approximate - all random numbers. That somehow gives us an answer. Though the answer is wrong, but from no answer we have and answer. Now we need to tweak it to get the right answer. Here is how we go on to do that. We put the first image in. We get a probability for each class. Say for example it said with probabilities of <code class="highlighter-rouge">0.2</code>, <code class="highlighter-rouge">0.7</code>, <code class="highlighter-rouge">0.1</code> that it is a cat, a dog and a horse respectively. But we know it is a cat. So we take the error which is <code class="highlighter-rouge">0.8</code> in the cat and modify <code class="highlighter-rouge">A</code> & <code class="highlighter-rouge">b</code> such that the error would be <code class="highlighter-rouge">0</code>. Then we pick up the next image and continue to do this again and again. We repeat the same images and after a while considering the huge number of options we have for the variables(we have very few images and a lot more variables), they adjust themselves to a value that passes in most cases.</p>
<h4 id="which-variables-do-we-tweak-and-by-how-much">Which variables do we tweak and by how much?</h4>
<p>A billion options is not a great thing. We cannot possibly keep track of all the variables we modified for the previous image. So we definitely need a way to pick variables. This is a clever solution, but very practical.
Let us pause from the maths a little and go back to common sense. You take a decision (like whether to buy a Honda or a Ford). You ask multiple folks for scoring one of them(whichever they think is the best) and pick up the one with the maximum score. Now the decision turns out bad. Who would you blame? The ones that gave the score for the car you ended up buying. Do you punish them equally? Definitely not. The one who gave a higher score gets more punishment. The more the impact on the results, the more the rewards and the punishment.
The same logic applies here. You look at the raw numbers. The weights on the cases where the output was higher get reduced by more than the ones where the output was less. First convert the probability back to the score amount. Since we used powers (<script type="math/tex">e^x</script>) we use <code class="highlighter-rouge">log</code> to get back to scores. Then we subtract the error. This is the beauty of the simple equation <code class="highlighter-rouge">AX + b</code> that we started with. Note that in the day and age of computers, unless you are doing it on paper, you can re-use the mathematics already done in the library. So I am not going into the raw equations which you can find in any text book on the subject. I hope you can guess that doing this multiple times should ideally get to some good result. We can have multiple results all good enough based on where we start and which images we see first. You can compare it with a child learning. The lessons learnt when we are young leave a deeper imprint into us that those that we learn later.</p>
<h4 id="some-terminology">Some terminology</h4>
<p>The phase of training where we push the error back into the equation is called <strong>back propagation</strong> or <strong>backward pass</strong> while the calculation of scores(and probability) by using the equation is called <strong>forward propagation</strong> or <strong>forward pass</strong>. Note that in the testing phase we only do forward propagation. The error when converted into the score is called <strong>loss</strong> and is calculated only during training. This loss is applied based on the impact of a weight. The loss type for the softmax function we just described is called <strong>cross entropy</strong> loss. It is just the difference in the logarithmic space to counter the power we raised for making it all positive. The impact is officially called the <strong>gradient</strong>. The method of slowly going to the results by modifying the value in the direction of impact is called <strong>stochastic gradient descent</strong>(SGD).(It is called stochastic because we don’t wait for all pictures to take some decision). Also during the training we repeat the same images again and again. One run with the images is called one <strong>epoch</strong>. We run multiple epochs to fit most of our training data.</p>
<p>There is one more term that I would like to introduce. Now images come in all shape and sizes. It might happen that outliers(one weird test image) throw the equation in a weird direction. What we want is a slow and gradual learning so that we get to the conclusion that most images follow. We might not get to 100% accuracy but we will not be jumping around forever. And we can stop when we do not get better results any more. Therefore we multiply the loss with a number called the <strong>learning rate</strong> before we go onto modify our variables. And to get the best results we decrease the learning rate over time. Because closer to the results we do not want to lose the accuracy that was gained by so many images earlier.</p>
<h4 id="over-fitting">Over Fitting</h4>
<p>There will be one problem we have to deal with and that is over-fitting. With 3 billion variables, these is a possibility to tie in too tightly to the known cases. Eg if image 1 then this else that kind of solutions. We won’t notice it but somewhere that might happen. The solution to this is to reduce the number of variables or force using different variables each time. No need to think too much about this, just wanted to introduce the term. We will talk about solutions when we get to more complicated networks.</p>
<h4 id="generalization">Generalization</h4>
<p>The logic that I defined for back propagation of taking the log and subtracting is specific to the softmax and the equation <code class="highlighter-rouge">AX + b</code>. It is good to start with a simple equation. But we know this is not enough to capture all image use cases. (We might need even more variables). So we need to formalize the method and find some way of measuring the impact correctly. In early days of deep learning, for complicated equations, the amount was calculated manually using some clever equations until it was figured out there is a way to get the best possible value of the impact accurately. I do enter a bit of <em>hated</em> maths but I have to. The solution is calculus. Don’t worry, you don’t need to know calculus to do deep learning in the modern world. The libraries have calculus inbuilt and we don’t need to do anything manually. Via calculus we can calculate the derivative or gradient of the entire equation <code class="highlighter-rouge">f(X)</code> with respect to any of the individual variable like <script type="math/tex">\frac{\delta f(X)}{\delta A}</script> and then multiply the loss with the learning rate and this impact to apply it onto the value of A.</p>
<h4 id="regression">Regression</h4>
<p>For the statistics lovers, the logic is same as regression that we use to get a equation to fit a set of number in 2D space. Indeed deep learning is just a generalized version of
linear regression where we have a lot more features and inputs and more vague and complicated. The loss has been replaces from L2 loss to cross entropy (which to some extent the same thing just with the scores).</p>
<p>For common folks who don’t understand statistics, congratulations you just learnt your stats 101. This is exactly the method to divide a set of inputs into two groups by a straight line(you might have heard about the method of least squares). In stats, you use the same equation without the logs (as we started with) and use the loss as the difference in scores or mean square root difference (as loses of -5 and +5 end up being 0, but if we really want some better scores).</p>
<p>The calculus idea also came from statistics as that is what we use for finding the minimum in linear regression.</p>
<h4 id="summary">Summary</h4>
<p>Since we cannot get exact answers, we approximate. We start with random weights to all variables. We have two passes over the equation, a forward pass where we use the weights to get the probabilities, and the backward pass where we use the true answer to calculate the loss. Then based on the weight that had most gradient for a particular loss, we adjust our weights. We keep doing this multiple times until the results are good enough.</p>
<p>Now starts the fun part. In the next part we apply the learnings we just did to get to some real action. We will do a full pass over our basic neural network to create some real fun - handwritten number recognition. Zoom ahead to <a href="/blog/2017/08/22/Deep-Learning-Part-4">here</a> when you are ready.</p>Complete our model by going into training and backtracking.Free2017-08-20T00:00:00+00:002017-08-20T00:00:00+00:00http://atishay.me/blog/2017/08/20/Free<p>This time I have two versions of the speech the initial draft which makes a good read but gets a beating when we get to the understanding and the final one that I delivered.</p>
<h4 id="initial-draft">Initial Draft</h4>
<p>Introduction before speech: Atishay wants to apologize to all medical professionals. The contents of his speech are partially true and truth hurts.</p>
<p>On April 24, 2010, in White Bear Township, Minnesota, a 68-year-old retiree, Erwin Lingitz was hungry. Very hungry. So, hungry that he committed the biggest heist of the century – 1.5 pounds, mainly sausage and beef sticks, all free samples, were stolen… He was beaten up and arrested. CCTV cameras were install in all stores. (slight pause) My retirement plans were ruined.</p>
<p>Free is a marketing term - Fools Ready for Easy Extortion. There is a star at the end of each offer – “Conditions Apply”. Even if I could catch a hawk and transplant my brain inside of it, I wouldn’t be able to read those conditions.(pause) To get them you need to go to customer support, (hush) quietly tiptoe inside and (bend) hide under the table.(Pause) (speak fast)Or just go to Facebook.(Pause) Last month on Facebook, I found one such benefit in my health insurance – Free Health checkup. Free. Who would miss it. They have free candies in the waiting room.</p>
<p>I was about the learn their price, the hard way. Right as I entered a sexy salesgi…, attendant gave me this (Take out paper). I am good with form filling. Throwaway email… Done. Correct Phone number at the bottom. Just in case. No extra check marks. Careful with that opt-out. Do I mention the sick leave for the giant’s game? Na. And don’t forget to leave a few blanks . . . she would come over. (Fold paper) She did – no ad on the internet can match this. This is like a salesman telling you that the car is free. (Aside with smiling face) Sir, we only charge for the brakes. I happily parted five-hundred-dollars.</p>
<p>Now I was a lab rat – oversized gown, labelled collar and always scared. (Fidgety eyes)“Those machines. Hope they won’t turn me into a frog.” For the next one hour, I was a voodoo doll pinched to punish. Then came optometry. <Point at="" someone="" the="" audience=""> Ok Sir, do you know English? Is A better or A? A or A? <i think="" a="">. Okay. Let’s go with B… And then the last one – masters of trickery. “Have you eaten.” After two hours of this ordeal who would say yes. "Good". Trick question - No food. Had it been a yes, (in muted voice) “You would have won another day off from work.”</i></Point></p>
<p>Those who think failing college exams is difficult, you always have an option – Sit behind <someone from audience&rt; next time. Not with this (Take out paper). Fidgety… Trembling… I opened the sheet. … (Wide open mouth, to exclaim) - “Aaa… Ancient Greek. Where do you find an interpreter?” Another “Free” session with the doctor.</p>
<p>I was scared but prepared. Twenty minutes are free. Rest I pay. The doctor pounced upon the report. (Doctor’s Voice, again with the same sheet of paper. Serious look) “Hmm… Interesting… The results look good. “ (Fold paper)<br />
<sigh&rt;.<br />
“Except this one thing. (Pointing finger) You ate all my candies. You are overweight.”</p>
<p>My life flowed in front of my eyes. I had always been so thin that I fell between the cracks.(Pause) As a child, I was face to face with the manifestation of God – my fat young brother. Of course, I lost. “Oh so cute. Le Le Le Le”.<br />
I waited in a corner (Aside, eyes wide open, slightly bend) “Momma, my candy.”<br />
In my teens, I faced worse odds, against the manifestation of the devil - the bully KR - two times my size - “Hey matchstick, where’s my treat?”.<br />
By college my enemies had completely taken over - the manifestations of laziness.<br />
Now when I finally get to have candies, you won’t let me. I am a toastmaster. I will… I will evaluate you. (Pause for effect) Good. Very well practiced. No Ahs and Ums or Filler words. Look again. You’re using notes. That’s my Ice Breaker. You do it every day. Where is the vocal variety? The three-act structure. (Pause) (Strut across the stage) I decided to show some body language - “Get to the point.”<br />
(Looks around and walks to a spot.) “Is this the point?”.<br />
(Raised eyes, angry look as if to question) Hmmm.<br />
(Scared, shivering) “200 bucks”.</p>
<p>I took the prescription (take out the paper again) and left. Our founding fathers wanted free to stand for freedom. Free to think, free to say. By the twenty-eight amendment, free only stood for free beer, that too with a star. Erwin’s case has been dismissed. But the fact remains - the last E in Free stands for “Extortion”. (Flip the paper)
The Paper had Free written where the last e spelled into extortion.</p>
<h4 id="final-version">Final Version</h4>
<p><br />
Fellow Toastmasters and Guests, what do you think when you see the words “On Sale”? <br />
How about “Free” ?<br />
Last month on Facebook, I found an ad for a free health checkup.<br />
Instantly my brain said, “Don’t be a sucker !!!!”<br />
<br />
But my heart went on … Awwwww,,,,, give it a try ,,,, see what happens ….<br />
<br />
Next day I was off from work and into the hospital. As soon as I entered, a very attractive receptionist gave me a form to fill out.<br />
<br />
I signed up for all free stuff. I even picked up some of the candies from the waiting room. Then she explained me that some of the most important tests paid and the free ones don’t really make a sense without them.<br />
<br />
“Uh-oh”<br />
<br />
But ….Charmed with her attraction and seeing my effort to go over there go to waste……,<br />
<br />
I signed up for the paid tests.<br />
<br />
Within minutes it began.<br />
<br />
Hospitals are gloomy. Seriously, after the cemetery, where else do you expect to find all those dead.<br />
<br />
First was the CAT scan. They made me lie down in a small coffin, like box that went into a machine. I expected a frog to come out from the other side.<br />
<br />
Then the eye exam. He made me recite the alphabet in random order multiple times.<br />
<br />
The worst part was – with and without that special lens, it all looked the same.<br />
<br />
Next, I went through the blood test. He asked me whether I was fasting.<br />
<br />
I misunderstood and thought that he was offering me food. (funny)<br />
<br />
It turns out they don’t provide food, but instead ask you to come over again if you had eaten something.<br />
<br />
Anyways, I finished the tests and went headed back home.<br />
A few days later … the results …. Yes, the results …. came in the mail.<br />
<br />
Not email … but snail mail.<br />
<br />
I opened it, “What”<br />
<br />
I couldn’t understand anything.<br />
<br />
I needed an interpreter !!!!!<br />
<br />
So, I had to have another visit to the doctor.<br />
<br />
Having already shelled out money the last time, I got smart and I did my homework.<br />
<br />
I knew I had to be back within twenty minutes to save myself from paying.<br />
<br />
But the doctors are so clever. They take so much time.<br />
<br />
“Hmmm”.<br />
“Hmmm”.<br />
“What is it?”<br />
<br />
“All is good”.<br />
<br />
Sigh of relief.<br />
<br />
“Just one thing – You ate all my candy. You are overweight.”<br />
<br />
Then he began his long speech about the problems that could arise by being obese. I did not believe I was fat. I had always been thin. And no one likes being poked for his/her flaws.<br />
Now it was my turn.<br />
As a toastmaster, I evaluating his speech. His speech was dull and boring. I stopped him asked for what he wants me to do. He prescribed me some expensive drugs from his pharmacy and I left.<br />
<br />
And by the way, he had 10 ah and ums.<br />
After the entire experience, in additional to the reports and what the doctor said, it felt like they were trying to extract as much money as they could out of the health examination that was supposed to be free.
Even if we continue to fall for free stuff, we should remember that the real motive of giving free things is to make you pay in some other way.<br />
<br />
Fellow Toastmasters and Guests, would you ever do this?<br />
<br />
Would I ever do this again ?<br />
<br />
My mind says …. Hell No !!!!<br />
<br />
But my heart says….Maybe …. Just Maybe ….<br />
<br />
Toastmasters ????</p>If you really love free stuff this one is for you. I share an experience of a free health checkup.