Theano is an amazing Python package for deep learning that can utilize NVIDIA's CUDA toolkit to run on the gpu. The gpu is orders of magnitude faster than the cpu for math operations (such as matrix multiplication), which is essential for many machine learning algorithms. While setting up AWS to run my thesis experiments, I realized many instructions were out of date. Hopefully these steps will help you get your deep learning models up and running on AWS. These instructions use Ubuntu 14.04 64-bit with Cuda 7.0 on a g2.2xlarge instance.

Log into your AWS management console and click over to EC2. There are two options for creating your instance:

**On-Demand Instances**: This option guarantees that your have your section of a machine for computing. Use this option if you cannot handle potential interrupts and are willing to pay more. As of this writing, a g2.2xlarge instance costs $0.65/hr. To create a dedicated instance, go to "Instances" and click "Launch Instance".**Spot Instances**: This option gives you left-over compute power at a much cheaper rate. Use this option if you can hande potential interrupts. Spot instances use a bidding system to determine who gets the left-over compute power; you can look at the history of rates under "Spot Requests". As of this writing, a g2.2xlarge spot instance costs $0.0642/hr (10% of the price!) and seems fairly stable over the last few months. I ended up using a few spot requests, bidding $0.07.

When choosing an AMI to use for the instance, pick the latest 64-bit Ubuntu (I used ami-df6a8b9b, which is Ubuntu 14.04). You should also set up the security group for your IP to have ssh access, and download a new private/public key file.

Follow the instructions on Amazon to ssh into your instance. For Ubuntu on AWS, I found you have to use :

```
ssh -i [path/to/key.pem] ubuntu@[DNS]
```

Once you ssh into the instance, I recommend using screens (now updated to Tmux) to keep the session open so your program can run without the terminal window being open. Here are all the commands and the order I used to set up Theano:

1) update the default packages

```
sudo apt-get update
sudo apt-get -y dist-upgrade
```

2) create a new screen named theano (or look at using Tmux instead)

```
screen -S “theano”
```

3) install all the dependencies

```
sudo apt-get install -y gcc g++ gfortran build-essential git wget linux-image-generic libopenblas-dev python-dev python-pip python-nose python-numpy python-scipy
```

4) install the bleeding-edge version of Theano

```
sudo pip install --upgrade --no-deps git+git://github.com/Theano/Theano.git
```

5) grab the latest (7.0) cuda toolkit.

```
sudo wget http://developer.download.nvidia.com/compute/cuda/repos/ubuntu1404/x86_64/cuda-repo-ubuntu1404_7.0-28_amd64.deb
```

6) depackage Cuda

```
sudo dpkg -i cuda-repo-ubuntu1404_7.0-28_amd64.deb
```

7) add the package and install the cuda driver (this takes a while)

```
sudo apt-get update
sudo apt-get install -y cuda
```

8) update the path to include cuda nvcc and ld_library_path

```
echo -e "\nexport PATH=/usr/local/cuda/bin:$PATH\n\nexport LD_LIBRARY_PATH=/usr/local/cuda/lib64" >> .bashrc
```

9) reboot the system for cuda to load

```
sudo reboot
```

Wait a little bit for the reboot and then ssh back into the instance.

10) install included samples and test cuda

```
cuda-install-samples-7.0.sh ~/
cd NVIDIA\_CUDA-7.0\_Samples/1\_Utilities/deviceQuery
make
./deviceQuery
```

Make sure it shows that the GPU exists.

11) set up the theano config file to use gpu by default

```
echo -e "\n[global]\nfloatX=float32\ndevice=gpu\n[mode]=FAST_RUN\n\n[nvcc]\nfastmath=True\n\n[cuda]\nroot=/usr/local/cuda" >> ~/.theanorc
```

That's it! Now you can use git to pull whatever repo you need and test Theano. I recommend using the test script from Theano's site here.

These instructions were partially gathered from Erik Bernhardsson and Erik Hazzard, as well as the Ubuntu instructions from Theano.

Netflix has a great post showing that you can customize some functions in the Cuda kernel to speed up Theano by ~70%. I haven't done this, but it is definitely something to check out.

]]>(source)

Imagine you are trying to recognize someone's handwriting - whether they drew a '7' or a '9'. From years of seeing handwritten digits, you automatically notice the vertical line with a horizontal top section. If you see a closed loop in the top section of the digit, you think it is a '9'. If it is more like a horizontal line, you think of it as a '7'. Easy enough. What it took for you to correctly recognize the digit, however, is an impressive display of fitting smaller features together to make the whole - noticing contrasted edges to make lines, seeing a horizontal vs. vertical line, noticing the positioning of the vertical section underneath the horizontal section, noticing a loop in the horizontal section, etc.

Ultimately, this is what deep learning or representation learning is meant to do: discover multiple levels of features that work together to define increasingly more abstract aspects of the data (in our case, initial image pixels to lines to full-blown numbers). This post is going to be a rough summary of two main survey papers:

- Representation Learning: A Review and New Perspectives by Yoshua Bengio, Aaron Courville, and Pascal Vincent
- Deep Learning of Representations: Looking Forward by Yoshua Bengio

Current machine learning algorithms' performance depends heavily on the particular features of the data chosen as inputs. For example, document classification (such as marking emails as spam or not) can be performed by breaking down the input document into bag-of-words or n-grams as features. Choosing the correct feature representation of input data, or feature engineering, is a way that people can bring prior knowledge of a domain to increase an algorithm's computational performance and accuracy. To move towards general artificial intelligence, algorithms need to be less dependent on this feature engineering and better learn to identify the explanatory factors of input data on their own.

Deep learning tries to move in this direction by capturing a 'good' representation of input data by using compositions of non-linear transformations. A 'good' representation can be defined as one that disentangles underlying factors of variation for input data. It turns out that deep learning approaches can find useful abstract representations of data across many domains: it has had great commercial success powering most of Google and Microsoft's current speech recognition, image classification, natural language processing, object recognition, etc. Facebook is also planning on using deep learning approaches to understand its users^{1}. Deep learning has been so impactful in industry that MIT Technology Review named it as a top-10 breakthrough technology of 2013^{2}.

So how do you build a deep representation of input data? The central idea is to learn a hierarchy of features one level at a time where the input to one computational level is the output of the previous level for an arbitrary number of levels. Otherwise, 'shallow' representations (most current algorithms like regression or svm) go directly from input data to output classification.

One good analogue for deep representations is neurons in the brain (a motivation for artificial neural networks) - the output of a group of neurons is agglomerated as the input to more neurons to form a hierarchical layer structure. Each layer *N* is composed of *h* computational nodes that connect to each computational node in layer *N+1*. See the image below for an example:

(source)

There are two main ways to interpret the computation performed by these layered deep architectures:

**Probabilistic graphical models**have nodes in each layer that are considered as latent random variables. In this case, you care about the probability distribution of the input data*x*and the hidden latent random variables*h*that describe the input data in the joint distribution*p(x,h)*. These latent random variables describe a distribution over the observed data.**Direct encoding (neural network) models**have nodes in each layer that are considered as computational units. This means each node*h*performs some computation (normally nonlinear like a sigmoidal function, hyperbolic tangent nonlinearity, or rectifier linear unit) given its inputs from the previous layer.

To get started, principal component analysis (PCA) is a simple feature extraction algorithm that can span both of these interpretations. PCA learns a linear transform *h = f(x) = W ^{T }x + b* where

(source)

From a probabilistic viewpoint, PCA is simply finding the principal eigenvectors of the covariance matrix of the data. This means that you are finding which features of the input data can explain away the most variance in the data^{3}.

From an encoding viewpoint, PCA is performing a linear computation over the input data to form a hidden representation *h* that has a lower dimensionality than the data.

Note that because PCA is a linear transformation of the input *x*, it cannot really be stacked in layers because the composition of linear operations is just another linear operation. There would be no abstraction benefit of multiple layers. To form powerful deep representations, we will look at stacking Restricted Boltzmann Machines (RBM) from a probability viewpoint and nonlinear auto-encoders from a direct encoding viewpoint.

A Boltzmann machine is a network of symmetrically-coupled binary random variables or units. This means that it is a fully-connected, undirected graph. This graph can be divided into two parts:

- The
*visible*binary units*x*that make up the input data and - The
*hidden*or latent binary units*h*that explain away the dependencies between the visible units*x*through their mutual interactions.

(A graphical representation of an example Boltzmann machine. Each undirected edge represents dependency; in this example there are 3 hidden units and 4 visible units. source)

Boltzmann machines describe this pattern of interaction through the distribution over the joint space *[x,h]* with the energy function:

Where the model parameters Θ are {*U,V,W,b,d* }.

Trying to evaluate conditional probabilities over this fully connected graph ends up being an intractable problem. For example, computing the conditional probability of hidden variable given the visibles, *P*(*h _{i }* |

However, we can restrict the graph from being fully connected to only containing the interactions between the visible units *x* and hidden units *h*.

(source)

This gives us an RBM, which is a *bipartite* graph with the visible and hidden units forming distinct layers. Calculating the conditional distribution *P*(*h _{i }* |

Very successful deep learning algorithms stack multiple RBMs together, where the hiddens *h* from the visible input data *x* become the new input data for another RBM for an arbitrary number of layers.

There are a few drawbacks to the probabilistic approach to deep architectures:

- The posterior distribution
*P*(*h*|_{i }*x*) becomes incredibly complicated if the model has more than a few interconnected layers. We are forced to resort to sampling or approximate inference techniques to solve the distribution, which has computational and approximation error prices. - Calculating this distribution over latent variables still does not give a usable
*feature vector*to train a final classifier to make this algorithm useful for AI tasks. For example, we calculate all of these hidden distributions that explain the variations over the handwriting digit recognition problem, but they do not give a final classification of a number. Actual feature values are normally derived from the distribution, taking the latent variable's expected value, which are then used as the input to a normal machine learning classifier, such as logistic regression.

To get around the problem of deriving useful feature values, an auto-encoder is a non-probabilistic alternative approach to deep learning where the hidden units produce usable numeric feature values. An auto-encoder directly maps an input *x* to a hidden layer *h* through a parameterized closed-form equation called an *encoder*. Typically, this encoder function is a nonlinear transformation of the input to *h* in the form:

This resulting transformation is the *feature-vector* or *representation* computed from input *x*.

Conversely, a *decoder* function is used to then map from this feature space *h* back to the input space, which results in a *reconstruction* *x'*. This decoder is also a parameterized closed-form equation that is a nonlinear 'undoing' the encoding function:

In both cases, the nonlinear function *s* is normally an element-wise sigmoid, hyperbolic tangent nonlinearity, or rectifier linear unit.

Thus, the goal of an auto-encoder is to minimize a loss function over the reconstruction error given the training data. Model parameters Θ are {*W,b,W',d* }, with the weight matrix *W* most often having 'tied' weights such that *W' = W ^{T }*.

Stacking auto-encoders in layers is the same process as with RBMs:

One disadvantage of auto-encoders is that they can easily memorize the training data - i.e. find the model parameters that map every input seen to a perfect reconstruction with zero error - given enough hidden units *h*. To combat this problem, regularization is necessary, which gives rise to variants such as sparse auto-encoders, contractive auto-encoders, or denoising auto-encoders.

A practical advantage of auto-encoder variants is that they define a simple, tractable optimization objective that can be used to monitor progress.

Deep learning is currently a very active research topic. Many problems stand in the way of reaching more general AI-level performance:

*Scaling computations* - the more complex the input space (such as harder AI problems), the larger the deep networks have to be to capture its representation. These computations scale much worse than linearly, and current research in parallelizing the training algorithms and creating convolutional architectures is meant to make these algorithms useful in practice. Convolutional architectures mean that every hidden unit output to a layer does not become the input for every other hidden unit in the next layer; they can be restricted to only connect to other hidden units that are within the same spatial area. Further, there are so many hyper-parameters for these algorithms (number of layers, hidden units, nonlinear functions, training procedures) that choosing them has become considered an 'art'.

*Optimization* - as the input datasets grow larger and larger (growing faster than the size of the models), training error and generalization error converge. Optimization difficulty during training of deep architectures comes from both finding local minima and having ill-conditioning (the two main types of optimization difficulties in continuous optimization problems). Better optimization can have an impact on scaling computations, and is interesting to study to obtain better generalization of the algorithms. Layer-wise pretraining has helped immensely in recent years for optimization during training deep architectures.

*Inference and sampling* - all probabilistic models except for the RBM require a non-trivial form of inference (guessing values of the latent variables *h* given the conditional distribution over *x*). Inference and sampling techniques can be slow during training as well as have difficulties since the distributions can be incredibly complex and often have a very large number of modes.

*Disentangling* - finding the 'underlying factors' that explain the input data. Complex input data arise from the interaction of many interrelated sources - such as lights casting shadows, object material properties, etc. for image recognition. This would allow for very powerful cross-task learning, leading to a representation that can 'zoom in' on the relevant features in the learned representation given the current problem. Disentanglement is the most ambitious challenge presented so far, as well as the one with the most far-reaching impact towards more general AI.

- Deep learning is about creating an abstract hierarchical representation of the input data to create useful features for traditional machine learning algorithms. Each layer in the hierarchy learns a more abstract and complex feature of the data, such as edges to eyes to faces.
- This representation gets its power of abstraction by stacking nonlinear functions, where the output of one layer becomes the input to the next.
- The two main schools of thought for analyzing deep architectures are
*probabilistic*vs.*direct encoding*. - The probabilistic interpretation means that each layer defines a distribution of hidden units given the observed input,
*P*(*h*|*x*). - The direct encoding interpretation learns two separate functions - the
*encoder*and*decoder*- to transform the observed input to the feature space and then back to the observed space. - These architectures have had great commercial success so far, powering many natural language processing and image recognition tasks at companies like Google and Microsoft.

If you would like to learn more about the subject, check out this awesome hub or Bengio's page!

If you are inclined to use deep learning in code, Theano is an amazing open-source Python package developed by Bengio's LISA group at University of Montreal.

Update: HN discussion

]]>*build something you love*

- Choosing your cofounders is incredibly important, as you will be a huge part of each others lives for a long time.
- Success as a company only makes things harder - the most fun and least stressful time is ironically when Evernote could have gone out of business at any minute. It is not fun day-to-day, but it is fun looking back month-to-month.
- 5 years ago, it would have been stupid advice to say 'build something for yourself'; today, it would be stupid not to. If you like something, chances are 10 million people somewhere else are as weird as you and will too. Building something you want gives you the unique advantage of knowing when it is truly great or not. Make something sufficiently epic to be your life's work.

*create your own algorithm for success*

- Ultimately, building a company is about the feedback loop - you make something, see what people like/don't like, and adjust.
- A good way to test potential board members is to have a mock board meeting; you might be surprised with the results. Having people who ask the right questions is more valuable than people who give the best answers.
- Most successful entrepreneurs don't follow some fixed algorithm of previous success; they make one to their own advantage. Best example of this is Elon Musk with Tesla by bypassing the traditional success algorithm of auto makers and cutting out the middleman.

*look for a rifle focus on product and decisiveness*

- A successful entrepreneur is someone with a rifle focus on the product and is very decisive. Hire fast and fire fast.
- Always follow the metrics - if users like your product, it is a good product.
- Get one term sheet asap and use it as leverage to speed up the process with other investors.

*know a secret*

- Want to be in the sweet spot of a good idea that seems like a bad idea. Challenge a social norm. Powerful people will dismiss it as a toy or hobby.
- Successful startups usually come from unbundling functions done by others (newspapers, schools), hobbies, and challenging social norms
- The best ideas come from direct experience and domain expertise.

*have a big vision*

- If you think of a good reason for why someone should help you, they usually will help you (like Dell did with VMWare).
- Bringing something to market that a few people love is much better than a product that many people like.
- Having deadlines helps with everything.

*voice vs. exit*

- To make change, you decide to either use your voice to create it from within or exit to create it externally.
- Exit is the reason voice has strength (like a BATNA in negotiations)
- Exit normally amplifies voice because it is radical.

*find something to work on that you care about more than yourself*

- Do well and do good.
- Making the world smaller = making the world better.
- Think about the long-term effects of actions - sometimes things that help people on day 1 hurt people on day 365. An example is humanitarian efforts giving rice in Haiti put all the local rice farmers out of business.

*actively make yourself a better person*

- Keep a list for everything you are doing and notes for people - it is a great memory device.
- Have a list of do's and don'ts that you look at multiple times a day. Create habits to make yourself better.
- Act professional to be professional and think about the small details - tuck in your shirt, stand up straight, be present.

*don't avoid mistakes because you can't - just keep the determination to push through*

- There is no secret sauce of why he was able to make Facebook - just a ton of determination. So much so they would have lockdowns when they felt someone else was doing something better.
- Vision from the beginning was connecting
*everyone*in the world. - When hiring someone, ask the question - would I want to work for them?

*making a startup is like training to go to the Olympics*

- You are going to fail more times than you succeed - perseverence and confidence are key. Be a 'cockroach'.
- You can always pivot an idea; you can't really pivot your partners. Take time choosing who you work with.
- Give it 100% before you quit.

This blog’s main purpose, however, is going to be my thesis research in deep learning as well as some thoughts on entrepreneurship. Hopefully we can have a dialogue :)

Sebastian Thrun and me at the AAAI '12 conference!

]]>