17. Learning: Boosting

“Wisdom of a weighted crowd of experts”

Classifiers

Classifiers are tests that produce binary choices about samples. They are considered strong classifiers if their error rate is close to 0,  weak classifiers if their error rate is close to 0.5.

By using multiple classifiers with different weights, data samples can be sorted or grouped according to different characteristics.

Decision tree stumps

Aside from classifiers, a decision tree can be used to sort positive and negative samples in a 2-dimension space. By adding weights to different tests, some samples can be emphasized over the others. The total sum of weights must always be constrained to 1 to ensure a proper distribution of samples.

Dividing the space

By minimizing the error rate of the tests from the weights, the algorithm can cut the space to sort positive and negative examples.

No over fitting

Boosting algorithms seems not to be over fitting, as the decision tree stumps tends to be very tightly close to outlying samples, only excluding them from the space.

Decision boundaries

Separating positive and negative example with a straight line that is as far as possible from both positive and negative examples, a median that maximizes the space between positive and negative examples.

Constraints are applied to build a support vector (u) and define a constant b that allow to sort positive examples from negative ones. The width of a “street” between the positive and negative values is maximized.

Going through the algebra, the resulting equation show that the optimization depends only on the dot product of pair of samples.

The decision rule that defines if a sample is positive or negative only depends on the dot product of the sample vector and the unknown vector.

No local maximum

Such support vector algorithm can be proven to be evolving in a convex space, meaning that it will never be blocked at a local maximum.

Non linearity

The algorithm cannot find a median between data which cannot be linearly separable. A transformation can however be applied to the space to reorganize the samples so that they can be linearly separable. Certain transformations can however create an over fitting model that becomes useless by only sorting the example data.

One-shot learning

Learning in human-like way, in one shot: learning something definite from each example.

The evolving model

Comparing an initial model example, a seed, with a near miss or another example, the evolving model understands an important characteristic for each new near miss or example compared.

The evolving model develops a set of heuristics to describe the seed, specializing with near misses (reducing the potential matches) or generalizing with examples (broadening the potential matches) the characteristics of the seed.

• Require link heuristic: specialization
• Forbid link heuristic: specialization
• Extend set heuristic: generalization
• Drop link heuristic: generalization
• Climb tree heuristic: generalization

Felicity conditions

The teacher and learner must know about each other to achieve the best learning. The learner must talk to himself to understand what he is doing.

How to package ideas better

To better communicate ideas to others in order to achieve better results, the following 5 characteristics makes communication more effective.

• Symbol: ease to remember the idea
• Slogan: focus the idea
• Surprise: catch the attention
• Salient: one thing to stand out
• Story: helps transmission to people

Visual recognition to sound production

Words can be classified in tables representing the 14 features of each sound, notably if they are “voiced” or not. With such a matrix of sounds, a “speaking machine” can be built to produce words and sentences.

Example: a machine pronouncing plurals

The machine processes from recognition through different characteristics of words (verbs, nouns, plurals…) and constraints (pronunciation of plural) to produce sounds. The machine’s input comes from visual recognition and its output is sounds to mimic a voice to produce the right sound, “z“, “s” or “e-z“.

1. The vision apparatus recognizes objects
2. The information flows to the meaning register, labeling the objects, as well as through characteristics register to define if they are plural or not
3. The information from the meaning register also flows through the characteristics register to define if they are nouns or verbs, and to a buffer which stores the beginning of the word letters while waiting to know if a “z” or “s” or “e-z” plural sound needs to be added.
4. The buffer letters moves to be processed by the pronunciation apparatus, waiting to be processed by the plural constraint.
5. The word is send to the plural constraint processor.
6. If the word is plural and the last letter is voiced, the constraint adds a “z” sound, if not voiced an “s” sound is added. If the word is not plural, nothing is added, ending the word’s pronunciation.

Such machine is called a propagator as connections are reversible: by hearing a plural word, the machine can figure where are many of the objects to be seen by the visual apparatus.

Building phonological algorithm

1. Collect + and – examples
2. Pick and seed positive
3. Generalize by churning useless information where + and – examples are not differentiated
4. Quit when cover a negative example, or go back to step 3 and generalize more

Sparse space

The plural machine works because in a high dimensional sparse space, a hyperplane can easily separate two sets of examples, as phonemes constitutes a 14-dimension space.

Generalizing to problem solving with AI

1. Define the problem
2. Devise a representation suiting the problem
3. Determine an approach / method
4. Pick a mechanism or devise an algorithm
5. Experiment

Troubles ensues when people focus on a specific mechanism rather the problem at hand.

The right representation does the following:

• it makes the problem explicit
• it exposes constraint
• it generates localness to solve the problem: the answer can be pinpointed rather than gathered from various sources, areas, places, etc.

Mimicking chromosomes

Reproduction loop

Strings of 0 and 1 simulate chromosomes in binary systems. An initial population of chromosomes is allowed to mutate (modification of certain bit in the strings), cross over with other strings during replication, or to remain the same.

Mutation: enables a step by step modification of chromosomes, allowing hill-climbing to improve fitness little by little

Cross over: enables the aggregation of the best characteristics from different chromosomes into a new one

The chromosomes “genotype”, encodes the phenotype of corresponding individuals (animals, peoples, programs…), which is represented by its fitness to the environment. The fitness provides a probability of survival to the next generation through a selection process.

Through the alteration (mutation, cross over, remain) of chromosomes and selection (fitness) before reproduction, chromosomes can evolve from one generation to the next.

This process of selection and reproduction is bound to get blocked in local maxima, but allows for a lot of intervention by the algorithm designer to circumvent such problems.

Defining the probability of survival

By fitness

Probability of survival = Fitness / Sum of fitness of all chromosomes

Rank Space

The probability of survival of the highest ranking individual is defined as a constant P[c]. If this highest ranking individual is not reaching the selection threshold, then the second highest ranking individual is checked, with a probability of (1-P[c]) P[c] and so on:

• P[1] = P[c]
• P[2] = (1-P[c]) P[c]
• P[3] = (1-P[c])^2 P[c]
• P[n-1] = (1-P[c])^(n-2) P[c]
• P[n] = (1-P[c])^(n-1)

By fitness and diversity

Both the most fit and most diverse individuals are retained for reproduction in the next generation. Diversity allows genetic algorithms to be much more effective and to not get blocked in local maxima.

Issues with genetic algorithm

The transition of genotype to phenotype is still not well understood.

The effectiveness of a genetic algorithm may lie in its creator’s ability to program it with the right parameters rather than the algorithm itself. It may also come from the richness of the environment, as other types of algorithm may do as well or better.

Applications

The combination of different good options from crossing over (beginning, end) can be useful in planning a series of steps. Combining different plans and different series of steps can result in a better plan.

Combining different rules from a rule-based expert system, such as horse-racing prediction.

Genetic algorithm may be used to encode “creatures“: their size, number of parts, articulations or method of operations…