11. Learning: Identification Trees, Disorder

Identification Trees

Problems

Non-numeric data
Not all characteristics matter
Some do matter but not all of the time
Cost: certain tests may be more expensive than others

Occam’s razor

The objective is to build the smallest tree possible (to reduce costs and computation) and because the simplest explanation is always the best.

Testing data

Small data sets

The different tests upon the data can be ranked by the homogeneous groups it produces and the total number of items in each homogeneous group.

Using the most efficient tests first, the remaining ambiguous data is checked through the other tests and so until until all the data is sorted.

Large data sets

For large data sets, no tests may divide the data in homogeneous group. The results of tests must therefore be ranked according to their level of disorder.

If P = Positive results, N = negative results, T = total

Disorder (Set) = – P/T log[2] ( P/T ) – N/T log[2] ( N/T )

The resulting curve of this equation is a parabolic curve with max at y = 1 for x = 1/2 and min at y = 0 for x = {0,1}

So the quality of each test can be defined as follows:

Quality (Test) = Sum[for each set produced] ( Disorder (Set) ) * Number of samples in set / Number of samples handed by test

Decision boundaries

Contrary to nearest neighbors, identification tests always separate the data space in two equal parts parallel to the space axis.