With a joint probability table, recording the tally of crossed events occurrence will allow us to measure the probabilities of each event happening, conditional or unconditional probabilities, independence of events, etc.
The problem with such table is that as the number of variables increase, the number of rows in the table grows exponentially.
Reminders of probabilities formulas
Basic axioms of probability
0 ≤ P(a) ≤ 1
P(True) = 1 ; P(False) = 0
P(a+b) = P(a) + P(b) – P(a,b)
Basic definitions of probability
P(a/b) = P(a,b) / P(b)
P(a,b) = P(a/b) P(b)
P(a/b,c) = P(a/b,c) P(b,c) = P(a/b,c)P(b/c)P(c)
Chain rule of probability
By generalizing the previous formula, we obtain the following chain rule:
P(a/b) = P(a) if a and b are independent
If a and b are independent
P(a/b+z) = P(a/z)
P(a+b/z) = P(a/z)P(b/z)
Causal relations between events can be represented in nets. These models highlight that any event is only dependent from its parents and descendants. Recording the probabilities at each node, the number of table and rows is significantly smaller than a general table of all events tallies.
By analyzing the difference between a current state and desired state, a set of intermediary steps can be created to solve the problem = problem solving hypothesis.
SOAR (State Operator And Result)
Key parts of the SOAR architecture:
Long-term memory and short-term memory
Assertions and rules (production)
Preferences systems between the rules
Problem spaces (make a space and search through that space)
Universal sub-goaling: new problems that emerge during the resolution become entire new goal with assertions rules, etc.
SOAR relies on the symbol system hypothesis. It primarily deals with deliberative thinking.
Created by Marvin Minsky to tackle more complex problems, this architecture involves thinking about several layers:
It is based upon the common sense hypothesis.
System created by Rodney Brooks. By generalizing layers of abstraction in the building of robots (such as for robot vision and movement), modifications to certain layers don’t interfere with other layers computation, allowing for better incremental improvement of the system as a whole.
It primarily deals with instinctive reaction and learned reaction.
This is the creature hypothesis, if a machine can act as an insect, then it will be easy to develop further later. This architecture relies upon the following principles:
Use the world instead of model: reacting to the world constantly
Finite state machines
Based upon language, this system involves perception and description of events, which then allow to understand stories and further, culture both at the macro (country, religion…) and micro (family…) levels. This system relies upon the strong story hypothesis.
In a semantic net, a diagram of relations between objects, essential notions can be defined as follows:
Combinators: linking objects together
Reification: actions implying results
Localization: a frame where objects and actions happen
Story sequence: a series of actions happening linearly in time
In natural language, knowledge is generally organized from general categories to basic objects and finally specific objects.
Another element of language is recording change in the evolution of objects during the unfolding of stories.
Language also tracks movement in the description of actions.
An agent makes an object move from a source to a destination using a instrument. A co-agent may be involved, the action might be aimed towards a beneficiary, and the motion may be helped by a conveyance, etc. In English, preposition tend to be used to define the role of each part in the action, enabled recording of interactions.
Agents’ action determine transitions in the semantic net, which result in trajectories.
Each type of story implies a number of characteristics that correspond to the situation. Example: events can be disasters or parties, they have a time and place, involved people, casualties, money, places…
Classifiers are tests that produce binary choices about samples. They are considered strong classifiers if their error rate is close to 0, weak classifiers if their error rate is close to 0.5.
By using multiple classifiers with different weights, data samples can be sorted or grouped according to different characteristics.
Decision tree stumps
Aside from classifiers, a decision tree can be used to sort positive and negative samples in a 2-dimension space. By adding weights to different tests, some samples can be emphasized over the others. The total sum of weights must always be constrained to 1 to ensure a proper distribution of samples.
Dividing the space
By minimizing the error rate of the tests from the weights, the algorithm can cut the space to sort positive and negative examples.
No over fitting
Boosting algorithms seems not to be over fitting, as the decision tree stumps tends to be very tightly close to outlying samples, only excluding them from the space.
Separating positive and negative example with a straight line that is as far as possible from both positive and negative examples, a median that maximizes the space between positive and negative examples.
Constraints are applied to build a support vector(u) and define a constant b that allow to sort positive examples from negative ones. The width of a “street” between the positive and negative values is maximized.
Going through the algebra, the resulting equation show that the optimization depends only on the dot product of pair of samples.
The decision rule that defines if a sample is positive or negative only depends on the dot product of the sample vector and the unknown vector.
No local maximum
Such support vector algorithm can be proven to be evolving in a convex space, meaning that it will never be blocked at a local maximum.
The algorithm cannot find a median between data which cannot be linearly separable. A transformation can however be applied to the space to reorganize the samples so that they can be linearly separable. Certain transformations can however create an over fitting model that becomes useless by only sorting the example data.