14. Learning: Sparse Spaces, Phonology

Visual recognition to sound production

Words can be classified in tables representing the 14 features of each sound, notably if they are “voiced” or not. With such a matrix of sounds, a “speaking machine” can be built to produce words and sentences.

Example: a machine pronouncing plurals

The machine processes from recognition through different characteristics of words (verbs, nouns, plurals…) and constraints (pronunciation of plural) to produce sounds. The machine’s input comes from visual recognition and its output is sounds to mimic a voice to produce the right sound, “z“, “s” or “e-z“.

  1. The vision apparatus recognizes objects
  2. The information flows to the meaning register, labeling the objects, as well as through characteristics register to define if they are plural or not
  3. The information from the meaning register also flows through the characteristics register to define if they are nouns or verbs, and to a buffer which stores the beginning of the word letters while waiting to know if a “z” or “s” or “e-z” plural sound needs to be added.
  4. The buffer letters moves to be processed by the pronunciation apparatus, waiting to be processed by the plural constraint.
  5. The word is send to the plural constraint processor.
  6. If the word is plural and the last letter is voiced, the constraint adds a “z” sound, if not voiced an “s” sound is added. If the word is not plural, nothing is added, ending the word’s pronunciation.

Such machine is called a propagator as connections are reversible: by hearing a plural word, the machine can figure where are many of the objects to be seen by the visual apparatus.

Building phonological algorithm

  1. Collect + and – examples
  2. Pick and seed positive
  3. Generalize by churning useless information where + and – examples are not differentiated
  4. Quit when cover a negative example, or go back to step 3 and generalize more

Sparse space

The plural machine works because in a high dimensional sparse space, a hyperplane can easily separate two sets of examples, as phonemes constitutes a 14-dimension space.

Generalizing to problem solving with AI

  1. Define the problem
  2. Devise a representation suiting the problem
  3. Determine an approach / method
  4. Pick a mechanism or devise an algorithm
  5. Experiment

Troubles ensues when people focus on a specific mechanism rather the problem at hand.

The right representation does the following:

  • it makes the problem explicit
  • it exposes constraint
  • it generates localness to solve the problem: the answer can be pinpointed rather than gathered from various sources, areas, places, etc.

13. Learning: Genetic Algorithms

Mimicking chromosomes

Reproduction loop

Strings of 0 and 1 simulate chromosomes in binary systems. An initial population of chromosomes is allowed to mutate (modification of certain bit in the strings), cross over with other strings during replication, or to remain the same.

Mutation: enables a step by step modification of chromosomes, allowing hill-climbing to improve fitness little by little

Cross over: enables the aggregation of the best characteristics from different chromosomes into a new one

The chromosomes “genotype”, encodes the phenotype of corresponding individuals (animals, peoples, programs…), which is represented by its fitness to the environment. The fitness provides a probability of survival to the next generation through a selection process.

Through the alteration (mutation, cross over, remain) of chromosomes and selection (fitness) before reproduction, chromosomes can evolve from one generation to the next.

This process of selection and reproduction is bound to get blocked in local maxima, but allows for a lot of intervention by the algorithm designer to circumvent such problems.

Defining the probability of survival

By fitness

Probability of survival = Fitness / Sum of fitness of all chromosomes

Rank Space

The probability of survival of the highest ranking individual is defined as a constant P[c]. If this highest ranking individual is not reaching the selection threshold, then the second highest ranking individual is checked, with a probability of (1-P[c]) P[c] and so on:

  • P[1] = P[c]
  • P[2] = (1-P[c]) P[c]
  • P[3] = (1-P[c])^2 P[c]
  • P[n-1] = (1-P[c])^(n-2) P[c]
  • P[n] = (1-P[c])^(n-1)

By fitness and diversity

Both the most fit and most diverse individuals are retained for reproduction in the next generation. Diversity allows genetic algorithms to be much more effective and to not get blocked in local maxima.

Issues with genetic algorithm

The transition of genotype to phenotype is still not well understood.

The effectiveness of a genetic algorithm may lie in its creator’s ability to program it with the right parameters rather than the algorithm itself. It may also come from the richness of the environment, as other types of algorithm may do as well or better.

Applications

The combination of different good options from crossing over (beginning, end) can be useful in planning a series of steps. Combining different plans and different series of steps can result in a better plan.

Combining different rules from a rule-based expert system, such as horse-racing prediction.

Genetic algorithm may be used to encode “creatures“: their size, number of parts, articulations or method of operations…

12b: Deep Neural Nets

Image recognition by a deep neural net

Convolution: a neuron looks for patterns in a small portion (10×10 px) of an image (256×256 px), the process is repeated by moving this small area little by litte.

Pooling: The result of the convolution is computed as a point for each portion analyzed. By a similar step by step process, a small set of points are computed into values by choosing the maximum value (“max pooling”).

By reproducing the pooling process multiple times (100x), and feeding it to a neural net, it will compute how likely the initial image is recognized as a known category.

Autocoding

A small number of neurons (~2), the “hidden layer“, a bottleneck of neurons between two columns of multiple neurons (~10) is used to obtain output values z[n] that are the same as input values x[n].

Such results implies that a form a generalization is accomplished by the hidden layer, or rather, a form of encoded generalization, as the actual parameters of the bottleneck of neurons seems not so obvious to understand.

Final layer of neurons

As the neural net is trained with parameters and thresholds, the shape and corresponding equation of the sigmoid function is adapted to properly sort positive and negative results, by maximizing the probability of sorting examples properly.

Softmax

Instead of sorting by the maximum value and the corresponding category, the final output is an array of the most probable categories (~5 categories).

Dropout

The problems of neural nets is that they can get blocked in local maximum areas. To prevent this, at each computation, one neuron is deactivated to check if its behavior is skewing the neural net. At each new computation another is shut down, or dropped out, to check all neurons.

Thanks to wider neural networks, neural nets can avoid being jammed into local maximum as they can analyze local maximum through more parameters.

See also: 

Boltzmann machine

12a: Neural Nets

Modeling biological neurons

Neural nets are modeled upon real biological neurons, which have the following characteristics:

  1. All or none: the input from each entry in the neural net is either O or 1, the output is also 0 or 1
  2. Cumulative influence: the influence of various input neurons accumulates to produce the final output result, if a certain threshold is reached
  3. Synaptic weight: each input neuron is given a weight depending on its importance

Characteristics that are not modeled

Several characteristics of biological neurons are not modeled in neural nets, as their purpose is not clearly understood in brain information processing.

  • Real neurons also have a refractory period when they do not respond to stimuli after firing.
  • In a real neuron, the resulting output can go to either one (of a few) axonal bifurcations.
  • The timing of the various inputs in the dendritic tree is not well understood in the resulting pulse in the axon.

Neural net principles

In a neural net, the resulting vector z, is a function of the different inputs x[n], the weights w[n] and the thresholds t[n]. A neural net is therefore a function approximator.

By comparing a known desired results vector (such as the content of a picture) with the actual output vector, a performance function can be determined to know how well the neural net is performing.

To simplify the performance function, thresholds are eliminated by adding an extra weight w[0] that nullify the threshold, and the step function resulting in {0, 1} values is smoothed to a sigmoid function resulting in the [0, 1] interval.

Backpropagation

Backpropagation is the name of the algorithm generally used to train a neural net.

Varying the weights little by little and with a certain randomization allows the performance function to measure if progress is being made or not, and to improve the weighing accordingly to progress towards an optimal performance.

The amount of computation of the performance function is linearly increased by the depth and squared by the width of the net.

11. Learning: Identification Trees, Disorder

Identification Trees

Problems

  • Non-numeric data
  • Not all characteristics matter
  • Some do matter but not all of the time
  • Cost: certain tests may be more expensive than others

Occam’s razor

The objective is to build the smallest tree possible (to reduce costs and computation) and because the simplest explanation is always the best.

Testing data

Small data sets

The different tests upon the data can be ranked by the homogeneous groups it produces and the total number of items in each homogeneous group.

Using the most efficient tests first, the remaining ambiguous data is checked through the other tests and so until until all the data is sorted.

Large data sets

For large data sets, no tests may divide the data in homogeneous group. The results of tests must therefore be ranked according to their level of disorder.

If P = Positive results, N = negative results, T = total

Disorder (Set) = – P/T log[2] ( P/T )  – N/T log[2] ( N/T )

The resulting curve of this equation is a parabolic curve with max at y = 1 for x = 1/2 and min at y = 0 for x = {0,1}

So the quality of each test can be defined as follows:

Quality (Test) = Sum[for each set produced] ( Disorder (Set) ) * Number of samples in set / Number of samples handed by test

Decision boundaries

Contrary to nearest neighbors, identification tests always separate the data space in two equal parts parallel to the space axis.