Programming Artificial Intelligence 101

Programming Language

Python logoThough several programming languages can be used to develop Artificial Intelligence software and web applications, the language that is most often used and advised to use is Python.

Documentation and resources can be found to use and learn the language on the official Python website.

Other languages that are often used include: Java, Lisp, Prolog, C++

Programming Tools

Here is a series of programming tools and other systems to assist in the programming of AI applications.

  • TensorFlow: an open-source software library for Machine Intelligence
  • Caffe: a deep learning framework made with expression, speed, and modularity in mind
  • Keras: a high-level neural networks API, written in Python
  • Pytorch: an open source machine learning library for Python, used for applications such as natural language processing.

Cloud services providers

Certain AI applications may require large computing power to process important amounts of data. Fortunately, several companies provide such services on a regular or on-demand basis. Here are the most famous cloud computing services.

24. Reviews and Exercises

Here are gathered the tutorials and exercises, called “mega-recitations” to apply the concepts presented in the course.

Mega-R1. Rule-Based Systems

Mega-R2. Basic Search, Optimal Search

Mega-R3. Games, Minimax, Alpha-Beta

Mega-R4. Neural Nets

Mega-R5. Support Vector Machines

Mega-R6. Boosting

Mega-R7. Near Misses, Arch Learning

23. Model Merging, Cross-Modal Coupling, Course Summary

Bayesian Story Merging

By using the probability model discovery previously studied, certain concepts and ideas can be analyzed and merged if similar.

Cross-Modal Coupling

By analyzing the correspondences between clusters of two sets of data, certain data subsets regular correspondences can be sorted out. According to Prof. Patrick Winston, this system of correspondences is very likely to be present in human intelligence.

Applications of AI

  • Scientific approach: understanding how AI works
  • Engineering approach: implementing AI applications

The most interesting applications are not to replace people with artificial intelligence but to work in tandem with people.

Using a lot of computing power and data becomes more common, but an interesting question is how little information is needed to work a certain problem.

Genesis system

The system translates stories into a internal language to understand stories and display them in diagrams. It allows to read stories on different levels, and use different “personas” to understand stories differently.

Humans may not be intelligent enough to build a machine that is as intelligent as them.

22. Probabilistic Inference II

Beliefs nets

Continued from previous class

Events diagrams must always be arranged in a way so that there are final nodes and no loops. Recording probabilities in tables for each event, the tables are filled by repeating experience so as to know the probabilities and occurrences of each event.

Bayesian inference

Several models can be drawn for a given set of events. To know which model is right, the Bayesian probabilities formulas can be used to confirm if events are independent or not, make them easier to compute, and choose the more appropriate model.

  • P(a/b) = P(a,b) / P(b)
  • P(a/b) P(b) = P(a,b) = P(b/a) P(a)
  • P(a/b) = P(b/a) P(a) / P(b)

Defining a as a class, and b as the evidence, the probability of the evidence given the class can be obtained through these formulas.

P(class/evidence) = P(evidence/class) P(class) / P(evidence)

Using the evidence from experience, classes can inferred by analyzing the results and corresponding probabilities.

Structure discovery

Given the data from experience / simulation, the right model can be sorted as it better corresponds to the probabilities. This allows to select between 2 existing models.

However if multiple models can be created, volumes of data make it impossible to compare them all. The solution is to use two models and compare them recursively. At each trial, the losing model is modified for improvements until a model fits certain criteria for success.

A trick is to use the sum of the logarithms rather than the probabilities, as large numbers of trials will make numbers too small to compute properly.

To avoid local maxima, a radical rearrangement of structure is launched after a certain number of trials.


This Bayesian structure discovery works quite well in situations when a diagnosis must be completed: medical diagnosis, lie-detector, symptoms of aircraft or program not working…

21. Probabilistic Inference I

Probabilities in Artificial Intelligence

With a joint probability table, recording the tally of crossed events occurrence will allow us to measure the probabilities of each event happening, conditional or unconditional probabilities, independence of events, etc.

The problem with such table is that as the number of variables increase, the number of rows in the table grows exponentially.

Reminders of probabilities formulas

Basic axioms of probability

  • 0 ≤ P(a) ≤  1
  • P(True) = 1 ; P(False) = 0
  • P(a+b) = P(a) + P(b) – P(a,b)

Basic definitions of probability

  • P(a/b) = P(a,b) / P(b)
  • P(a,b) = P(a/b) P(b)
  • P(a/b,c) = P(a/b,c) P(b,c) = P(a/b,c)P(b/c)P(c)

Chain rule of probability

By generalizing the previous formula, we obtain the following chain rule:

Chain rule of probability


Independent events

P(a/b) = P(a) if a and b are independent

Conditional independence

If a and b are independent

  • P(a/b+z) = P(a/z)
  • P(a+b/z) = P(a/z)P(b/z)

Belief nets

Causal relations between events can be represented in nets. These models highlight that any event is only dependent from its parents and descendants. Recording the probabilities at each node, the number of table and rows is significantly smaller than a general table of all events tallies.

19. Architectures: GPS, SOAR, Subsumption, Society of Mind

General Problem Solver

By analyzing the difference between a current state and desired state, a set of intermediary steps can be created to solve the problem = problem solving hypothesis.

SOAR (State Operator And Result)

SOAR Components:

  • Long-term memory
  • Short-term memory
  • Vision system
  • Action system

Key parts of the SOAR architecture:

  1. Long-term memory and short-term memory
  2. Assertions and rules (production)
  3. Preferences systems between the rules
  4. Problem spaces (make a space and search through that space)
  5. Universal sub-goaling: new problems that emerge during the resolution become entire new goal with assertions rules, etc.

SOAR relies on the symbol system hypothesis. It primarily deals with deliberative thinking.

Emotion machine

Created by Marvin Minsky to tackle more complex problems, this architecture involves thinking about several layers:

  • Reflective thinking
    • Self-conscious
    • Self-reflective
  • Deliberative thinking
  • Learned reaction
  • Instinctive reaction

It is based upon the common sense hypothesis.


System created by Rodney Brooks. By generalizing layers of abstraction in the building of robots (such as for robot vision and movement), modifications to certain layers don’t interfere with other layers computation, allowing for better incremental improvement of the system as a whole.

It primarily deals with instinctive reaction and learned reaction.

This is the creature hypothesis, if a machine can act as an insect, then it will be easy to develop further later. This architecture relies upon the following principles:

  1. No representation
  2. Use the world instead of model: reacting to the world constantly
  3. Finite state machines


Based upon language, this system involves perception and description of events, which then allow to understand stories and further, culture both at the macro (country, religion…) and micro (family…) levels. This system relies upon the strong story hypothesis.

18. Representations: Classes, Trajectories, Transitions


In a semantic net, a diagram of relations between objects, essential notions can be defined as follows:

  • Combinators: linking objects together
  • Reification: actions implying results
  • Localization: a frame where objects and actions happen
  • Story sequence: a series of actions happening linearly in time


In natural language, knowledge is generally organized from general categories to basic objects and finally specific objects.


Another element of language is recording change in the evolution of objects during the unfolding of stories.


Language also tracks movement in the description of actions.

An agent makes an object move from a source to a destination using a instrument. A co-agent may be involved, the action might be aimed towards a beneficiary, and the motion may be helped by a conveyance, etc. In English, preposition tend to be used to define the role of each part in the action, enabled recording of interactions.

Language corpuses, such as the Wall Street Journal Corpus, are generally composed of 25% of transition or trajectory.

Story sequences

Agents’ action determine transitions in the semantic net, which result in trajectories.

Story libraries

Each type of story implies a number of characteristics that correspond to the situation. Example: events can be disasters or parties, they have a time and place, involved people, casualties, money, places…

17. Learning: Boosting

“Wisdom of a weighted crowd of experts”


Classifiers are tests that produce binary choices about samples. They are considered strong classifiers if their error rate is close to 0,  weak classifiers if their error rate is close to 0.5.

By using multiple classifiers with different weights, data samples can be sorted or grouped according to different characteristics.

Decision tree stumps

Aside from classifiers, a decision tree can be used to sort positive and negative samples in a 2-dimension space. By adding weights to different tests, some samples can be emphasized over the others. The total sum of weights must always be constrained to 1 to ensure a proper distribution of samples.

Dividing the space

By minimizing the error rate of the tests from the weights, the algorithm can cut the space to sort positive and negative examples.

No over fitting

Boosting algorithms seems not to be over fitting, as the decision tree stumps tends to be very tightly close to outlying samples, only excluding them from the space.

16. Learning: Support Vector Machines

Decision boundaries

Separating positive and negative example with a straight line that is as far as possible from both positive and negative examples, a median that maximizes the space between positive and negative examples.

Constraints are applied to build a support vector (u) and define a constant b that allow to sort positive examples from negative ones. The width of a “street” between the positive and negative values is maximized.

Going through the algebra, the resulting equation show that the optimization depends only on the dot product of pair of samples.

The decision rule that defines if a sample is positive or negative only depends on the dot product of the sample vector and the unknown vector.

No local maximum

Such support vector algorithm can be proven to be evolving in a convex space, meaning that it will never be blocked at a local maximum.

Non linearity

The algorithm cannot find a median between data which cannot be linearly separable. A transformation can however be applied to the space to reorganize the samples so that they can be linearly separable. Certain transformations can however create an over fitting model that becomes useless by only sorting the example data.

15. Learning: Near Misses, Felicity Conditions

One-shot learning

Learning in human-like way, in one shot: learning something definite from each example.

The evolving model

Comparing an initial model example, a seed, with a near miss or another example, the evolving model understands an important characteristic for each new near miss or example compared.

The evolving model develops a set of heuristics to describe the seed, specializing with near misses (reducing the potential matches) or generalizing with examples (broadening the potential matches) the characteristics of the seed.

  • Require link heuristic: specialization
  • Forbid link heuristic: specialization
  • Extend set heuristic: generalization
  • Drop link heuristic: generalization
  • Climb tree heuristic: generalization

Felicity conditions

The teacher and learner must know about each other to achieve the best learning. The learner must talk to himself to understand what he is doing.

How to package ideas better

To better communicate ideas to others in order to achieve better results, the following 5 characteristics makes communication more effective.

  • Symbol: ease to remember the idea
  • Slogan: focus the idea
  • Surprise: catch the attention
  • Salient: one thing to stand out
  • Story: helps transmission to people