Identification Trees
Problems
- Non-numeric data
- Not all characteristics matter
- Some do matter but not all of the time
- Cost: certain tests may be more expensive than others
Occam’s razor
The objective is to build the smallest tree possible (to reduce costs and computation) and because the simplest explanation is always the best.
Testing data
Small data sets
The different tests upon the data can be ranked by the homogeneous groups it produces and the total number of items in each homogeneous group.
Using the most efficient tests first, the remaining ambiguous data is checked through the other tests and so until until all the data is sorted.
Large data sets
For large data sets, no tests may divide the data in homogeneous group. The results of tests must therefore be ranked according to their level of disorder.
If P = Positive results, N = negative results, T = total
Disorder (Set) = – P/T log[2] ( P/T ) – N/T log[2] ( N/T )
The resulting curve of this equation is a parabolic curve with max at y = 1 for x = 1/2 and min at y = 0 for x = {0,1}
So the quality of each test can be defined as follows:
Quality (Test) = Sum[for each set produced] ( Disorder (Set) ) * Number of samples in set / Number of samples handed by test
Decision boundaries
Contrary to nearest neighbors, identification tests always separate the data space in two equal parts parallel to the space axis.