12b: Deep Neural Nets

Image recognition by a deep neural net

Convolution: a neuron looks for patterns in a small portion (10×10 px) of an image (256×256 px), the process is repeated by moving this small area little by litte.

Pooling: The result of the convolution is computed as a point for each portion analyzed. By a similar step by step process, a small set of points are computed into values by choosing the maximum value (“max pooling”).

By reproducing the pooling process multiple times (100x), and feeding it to a neural net, it will compute how likely the initial image is recognized as a known category.


A small number of neurons (~2), the “hidden layer“, a bottleneck of neurons between two columns of multiple neurons (~10) is used to obtain output values z[n] that are the same as input values x[n].

Such results implies that a form a generalization is accomplished by the hidden layer, or rather, a form of encoded generalization, as the actual parameters of the bottleneck of neurons seems not so obvious to understand.

Final layer of neurons

As the neural net is trained with parameters and thresholds, the shape and corresponding equation of the sigmoid function is adapted to properly sort positive and negative results, by maximizing the probability of sorting examples properly.


Instead of sorting by the maximum value and the corresponding category, the final output is an array of the most probable categories (~5 categories).


The problems of neural nets is that they can get blocked in local maximum areas. To prevent this, at each computation, one neuron is deactivated to check if its behavior is skewing the neural net. At each new computation another is shut down, or dropped out, to check all neurons.

Thanks to wider neural networks, neural nets can avoid being jammed into local maximum as they can analyze local maximum through more parameters.

See also: 

Boltzmann machine

12a: Neural Nets

Modeling biological neurons

Neural nets are modeled upon real biological neurons, which have the following characteristics:

  1. All or none: the input from each entry in the neural net is either O or 1, the output is also 0 or 1
  2. Cumulative influence: the influence of various input neurons accumulates to produce the final output result, if a certain threshold is reached
  3. Synaptic weight: each input neuron is given a weight depending on its importance

Characteristics that are not modeled

Several characteristics of biological neurons are not modeled in neural nets, as their purpose is not clearly understood in brain information processing.

  • Real neurons also have a refractory period when they do not respond to stimuli after firing.
  • In a real neuron, the resulting output can go to either one (of a few) axonal bifurcations.
  • The timing of the various inputs in the dendritic tree is not well understood in the resulting pulse in the axon.

Neural net principles

In a neural net, the resulting vector z, is a function of the different inputs x[n], the weights w[n] and the thresholds t[n]. A neural net is therefore a function approximator.

By comparing a known desired results vector (such as the content of a picture) with the actual output vector, a performance function can be determined to know how well the neural net is performing.

To simplify the performance function, thresholds are eliminated by adding an extra weight w[0] that nullify the threshold, and the step function resulting in {0, 1} values is smoothed to a sigmoid function resulting in the [0, 1] interval.


Backpropagation is the name of the algorithm generally used to train a neural net.

Varying the weights little by little and with a certain randomization allows the performance function to measure if progress is being made or not, and to improve the weighing accordingly to progress towards an optimal performance.

The amount of computation of the performance function is linearly increased by the depth and squared by the width of the net.

11. Learning: Identification Trees, Disorder

Identification Trees


  • Non-numeric data
  • Not all characteristics matter
  • Some do matter but not all of the time
  • Cost: certain tests may be more expensive than others

Occam’s razor

The objective is to build the smallest tree possible (to reduce costs and computation) and because the simplest explanation is always the best.

Testing data

Small data sets

The different tests upon the data can be ranked by the homogeneous groups it produces and the total number of items in each homogeneous group.

Using the most efficient tests first, the remaining ambiguous data is checked through the other tests and so until until all the data is sorted.

Large data sets

For large data sets, no tests may divide the data in homogeneous group. The results of tests must therefore be ranked according to their level of disorder.

If P = Positive results, N = negative results, T = total

Disorder (Set) = – P/T log[2] ( P/T )  – N/T log[2] ( N/T )

The resulting curve of this equation is a parabolic curve with max at y = 1 for x = 1/2 and min at y = 0 for x = {0,1}

So the quality of each test can be defined as follows:

Quality (Test) = Sum[for each set produced] ( Disorder (Set) ) * Number of samples in set / Number of samples handed by test

Decision boundaries

Contrary to nearest neighbors, identification tests always separate the data space in two equal parts parallel to the space axis.

10. Introduction to Learning, Nearest Neighbors


Regularity: “Bulldozer computing”

  • Nearest neighbors: pattern recognition
  • Neural nets: mimic biology
  • Boosting: theory

Constraints: human-like learning

  • One-shot
  • Explanation-based learning

Nearest neighbor

A detection mechanism generates a vector of features. These features are converted in a vector of values that is compared to a library of possibilities to find the closest match in order to recognize patterns, objects, etc.

Method: Standard objects (ex: electric covers) are positioned in a space according to recognizable characteristics (ex: size, hole size). Decision boundaries between the standards are then established in the space to define areas of attribution to a nearest neighbor. Objects are then sorted according to which area the belong to.

Another method of sorting objects (ex: newspaper articles) to recognize could be to compare the angle of their vectors in the space with the vectors of the standard objects.

In the case of a robotic arm, instead of solving equations of angles at the joints which cannot be implemented in real life, due to friction and wear, a table of values is gathered at each position during a learning phase. During the working phase, the closest set of values from the table is used to complete the task.


  • Spread: Values can be concentrated in the space making it difficult to discern objects. Solution: norm the data using statistical analysis.
  • Subject matter: make sure to measure values that do actually make a difference between objects, not values that generate confusing results.
  • Relevancy: use data that is relevant to the matter at hand, not just any data that is independent from the target results.