murthy-oc1-code/version3.whatsnew at master · Are52Tap/murthy-oc1-code · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
The following is a, by no means exhaustive, list of what has been
changed from version 2.01 to version 3 of OC1.


Animation: (A very experimental version of ) an animation option is
provided with this version of OC1. This enables you, when working with
2-D data, to watch the tree induction process. When you specify a file
name with the -A option for MKTREE, using no cross validation, all
intermediate hyperplanes generated by OC1 are dumped into the
specified file. The DISPLAY program can then be used to produce a
PostScript(R) animation from this file.  Be warned that this facility
is in very preliminary stages.  I found this facility useful in understanding
what goes on when a tree is being built, so I decided to provide the
code for the hackers. If it doesn't work, don't blame me!

I did a slight format extension for data files.  OC1 can now work with
data files in which attributes are separated by blanks, tabs or
commas. "?" for any attribute is considered as a missing value.
Missing values for an attribute are replaced by the mean value of
the attribute over the entire dataset. This format extension should
make it easier to use OC1 with UC-Irvine datasets. Please note that
the attribute values still need to be numeric (integer or real), and
the class labels have to be integers.

On popular demand (!), OC1 now writes both the unpruned and pruned
trees into two files (as opposed to only the pruned file as before)
when pruning is used. The default names of the decision tree files are
<training data>.dt.unpruned and <training data>.dt. If a filename is
specified with the -D option, <file name>.unpruned is used for storing
the unpruned tree, when pruning is used.

A new option to mktree is the -b <axis parallel bias> option.  This
option enables the user to specify some bias in favour of axis
parallel splits. (Currently, mktree is not written to support the bias
in the other direction). ap_bias is a positive number greater than or
equal to 1.0. At any node of the decision tree, an oblique split is
preferred to an axis-parallel split only if the ratio of the axis
parallel impurity to the oblique impurity is greater than ap_bias.

Another new option for mktree is the -K option. This provides a way of
making OC1 act (very much) like CART with linear combinations.  I have
implemented the CART perturbation algorithm as it is described in
Chapter 5 of Breiman's book. However, I have not implemented backward
feature elimination or the attribute normalization at each tree
node. Nevertheless, this mode can be used as an easy check on whether
randomization will be useful for your domain.

The OC1 paper attached to the package is now changed from the AAAI-93
paper to the JAIR-94 paper. This paper definitely has a lot more
useful information than the earlier one.

The impurity measure Sum-of-Impurities is now more appropriately named
Variance. The implementation is corrected to remove the effect of
particular class numbering scheme on this measure.

Both -j and -m options for mktree can now be used for specifying the
maximum number of random jumps tried at each local minimum. Similarly,
both -i and -r options for mktree can be used to specify the number of
restarts.

Just like the -a (axis parallel only) mode, there is now a
-o (oblique only) mode for mktree. When this mode is used,
the oblique perturbation algorithm does not start with the best
axis parallel split on any restart. In addition, the best
oblique split obtained is returned, without bothering to compare
it with the best axis parallel split.

Mktree now has a very verbose mode (-v option specified twice).
In this mode, individual (hill-climbing/random jumps) perturbations
made by the algorithm are also displayed.

The files train_util.c and classify_util.c are combined into tree_util.c

The -N (normalize) option now has a different meaning.  Previously,
OC1 normalized all attribute values to [0,1] range when -N option was
specified. This can cause problems in several situations.  Now, there
is no global normalization. However, before inducing an oblique
hyperplane at a tree node, the subset of instances used at that node
is now normalized, by default, to lie in the positive quadrant. This
is because OC1's hill climbing perturbation algorithm requires that
all attribute values be positive (page 10, JAIR paper). OC1 works okay
sometimes even if this requirement is not met, but the exact reasons
for why this is so elude me.  -N option can be used to SWITCH OFF this
default normalization.

You can now make the cost-complexity pruning use 0-SE rule or 1-SE
rule by changing the constant NO_OF_STD_ERRORS in oc1.h. Default is
the 0-SE rule.

Note:

Memory deallocation still doesn't work well in some situations.

I have not updated several module headers, especially the "Is Called
By Modules" and "Calls Modules" fields. Shall do so as soon as I get
time!