1 Classification and Regression Trees {#estwagon}
2 ===================================
5 # Overview {#cart-overview}
7 As part of tools for statistical modelling, EST includes methods
8 for automatically building decision trees and decision lists from
9 features data to predict both fixed classed (classification) or
10 gaussians (regression). \ref Wagon is
11 the basic program that provide this facility.
13 The construction of CARTs (classification and regression trees) is
14 best described in @cite breiman1984classification and has become a common basic
15 method for building statistical models from simple feature data.
16 CART is powerful because it can deal with incomplete data, multiple
17 types of features (floats, enumerated sets) both in input features and
18 predicted features, and the trees it produces often contain rules
19 which are humanly readable.
21 Decision trees contain a binary question (yes/no answer) about
22 some feature at each node in the tree. The leaves of the tree
23 contain the best prediction based on the training data. Decision
24 lists are a reduced form of this where one answer to each question
25 leads directly to a leaf node. A tree's leaf node may be a single
26 member of some class, a probability density function (over some
27 discrete class), a predicted mean value for a continuous feature
28 or a gaussian (mean and standard deviation for a continuous value).
30 Theorectically the predicted value may be anything for which a function
31 can defined that can give a measure of *impurity*
32 for a set of samples, and a distance measure between impurities.
35 The basic algorithm is given a set of samples (a feature vector) find
36 the question about some feature which splits the data minimising the
37 mean "impurity" of the two partitions. Recursively apply this
38 splitting on each partition until some stop criteria is reached
39 (e.g. a minimum number of samples in the partition.
42 The basic CART building algorithm is a *greedy algorithm* in that it
43 chooses the locally best choice which is suboptimal but a full search
44 for a fully optimized set of question would be computationallly very
45 expensive (and often not much better). Although there are
46 pathological cases in most data sets this greediness is not a problem.
48 The basic building algorithm starts with a set of feature vectors
49 representing samples, at each stage all possible questions for all
50 possibles features are asked about the data finding out how the
51 question splits the data. A measurement of impurity of each
52 partitioning is made and the question that generates the least impure
53 partitions is selected. This process is applied recursively on each
54 sub-partions recursively until some stop criteria is met (e.g. a
55 minimal number of samples in a partition).
57 ## Impurities {#estwagonimpurities}
59 The *impurity* of a set of samples is designed
60 capture how similar the samples are to each other. The smaller
61 the number the less impure the sample set is.
63 For sample sets with continuous predictees Wagon uses the variance
64 times number of sample points. The variance alone could
65 be used by this overly favour very small sample sets. As the
66 test that uses the impurity is trying to minimise it over
67 a partitioning of the data, multiple each part with the number
68 of samples will encourage larger partitions, which we have found
69 lead to better decision trees in general.
71 For sample sets with discrete predictees Wagon uses the entropy
72 times number of sample points. Again the number of sample
73 points is used in so that small sample set are not unfairly favoured.
74 The entropy for a sample set is calculated as:
76 \f[ E = \sum_{x \in class} prob(x) * \log(prob(x)) \f]
78 Other impurity measure could be used if required. For example an
79 experimental cluster technique used for unit selection actually used
80 impurity calculated as the mean euclidean distance between all vectors
81 of parameters in the sample set. However the above two are more
84 ## Question forming {#estwagonoverviewquestion}
86 Wagon has to automatically form questions about each feature in the
89 For discrete features questions are build for each member of the set,
90 e.g. if feature n has value x. Our implementation does not currently
91 support more complex questions which could achieve better results
92 (though at the expense of training time). Questions about features
93 being some subset of the class members may give smaller trees. If the
94 data requires distinction of values a, b and c, from d e and f, our
95 method would require three separate questions, while if subset
96 questions could be formed this could be done in one step which would
97 not only give a smaller tree but also not unecessarily split the
98 samples for a, b and c. In general subset forming is exponential on
99 the number items in the class though there are techniques that can
100 reduce this with heuristics. However these are currently not supported.
101 Note however the the tree formalism produced but Wagon does support
102 such questions (with the operator "in") but Wagon will never produce
103 these question, though other tree building techniques (e.g. by hand)
104 may use this form of question.
106 For continuous features Wagon tries to find a partition of
107 the range of the values that best optimizes the average
108 impurity of the partitions. This is currently done by linearly
109 splitting the range into a predefined subparts (10 by default)
110 and testing each split. This again isn't optimal but does
111 offer reasonably accuracy without requiring vast amounts of
115 ## Tree Building criteria {#estwagonoverviewtreebuild}
117 There are many ways to constrain the tree building algorithm to help
118 build the "best" tree. Wagon supports many of these (though there
119 are almost certainly others that is does not.
121 In the most basic forms of the tree building algorithm a fully
122 exhaustive classifcation of all samples would be achieved. This, of
123 course is unlikely to be good when given samples that are not
124 contained within the training data. Thus the object is to build a
125 classification/regression tree that will be most suitable for new
126 unseen samples. The most basic method to achieve this is not to build
127 a full tree but require that there are at least n samples in a
128 partition before a question split is considered. We refer to that as
129 the *stop* value. A number like 50 as a stop value
130 will often be good, but depending of the amount of data you have, the
131 distribution of it, etc various stop value may produce more general
134 A second method for building "good" trees is to *hold out* some of
135 the training data and build a (probably over-trained) tree with a small
136 stop value. Then prune the tree back to where it best matches the
137 held out data. This can often produce better results than a fixed
138 stop value as this effectively allows the stop value to vary through
139 different parts of the tree depending on how general the prediction
140 is when compared against held out data.
143 It is often better to try to build more balanced trees. A small stop
144 value may cause the tree building algorithm to find small coherent
145 sets of samples with very specific questions. The result tree becomes
146 heavily lop-sided and (perhaps) not optimal. Rather than having the
147 same literal stop value more balanced trees can built if the stop
148 value is defined to be some percentage of the number of samples under
149 consideration. This percentage we call a *balance*
150 factor. Thus the stop value is then the largest of the
151 defined fixed stop value or the balance factor times the number of
155 To some extent the multiplication of the entropy (or variance) by
156 the number of samples in the impurity measure is also a way to
157 combat imbalance in tree building.
160 A good technique we have found is to build trees in a *stepwise*
161 fashion. In this case instead of considering all
162 features in building the best tree. We increment build trees looking
163 for which individual feature best increases the accuracy of the build
164 tree on the provided test data. Unlike within the tree building
165 process where we are looking for the best question over all features
166 this technique limits which features are available for consideration.
167 It first builds a tree using each and only the features provided
168 looking for which individual feature provides the best tree. The
169 selecting that feature is builds n-1 trees with the best feature from
170 the first round with each of the remaining features. This process
171 continues until no more features add to the accuracy or some
172 stopping criteria (percentage improved) is not reached.
175 This technique is also a greedy technique but we've found that when
176 many features are presented, especially when some are highly
177 correlated with each other, stepwise building produces a significantly
178 more robust tree on external test data. It also typically builds
179 smaller trees. But of course there is a cost in computation time.
181 While using the stepwise option each new feature added is printed out.
182 Care should be taking in interpreting what this means. It does not
183 necessarily give the order and relative importance of the features,
184 but may be useful if showing which features are particualrly important
187 Stepwise tests each success tree against the specified test set, (balance,
188 held out and stop options are respected for each build). As this
189 is using the test set which optimizing the tree, it is not valid
190 to view the specified test set as a genuine test set. Another
191 externally held test set should be used to test the accuracy of
194 Another technique that has been shown to work in tree building is
195 called *random forests". In these technique multiple trees are build
196 with some random subset of the features presented, then the multiple
197 trees are combined by some technique (e.g. simple average). This
198 often works much better. Although doesn't offer building randon
199 forests as a direct option, it can be (and is) used to do this. You
200 can vary the features in in the description file by marking them with
201 *ignore* or you can use the *dropout* options *-dof* and *dos* which
202 specify a probability of ignoring a feature or sample at easy stage of
203 the build process. Its not clear if the *-dof* or *-dos* options are
204 useful, we have only fully investigated random forests in the voice
205 building process through externally selecting features, but we added
206 these internal options (with trendier names) as part of our
209 ## Data format {#cart-formats}
211 The input data for wagon (and some other model building tools
212 in the Edinburgh Speech Tools library), should consist of
213 feature vectors, and a description of the fields in these vectors.
215 ### Feature vectors {#estwagon-featvectors}
217 A feature vector is a file with one sample per line, with feature value
218 as white space separated tokens. If your features values conatin
219 whitespace then you must quote them using double quotes.
222 The (Festival) program `dumpfeats` is specifically
223 designed to generate such files from databases of utterances but
224 these files may be generated from any data source.
226 Each vector must have the same number of features (and in the
227 same order. Features may be specified as "ignored" in the
228 description (or in actual use) so it is common that data files
229 contain more features than are always used in model building. By
230 default the first feature in a data file is the predictee, though
231 at least in wagon) the predictee field can be named at tree building time
232 to be other than the first field.
234 Features can be discrete of continuous but at present must be
235 single valued, "multi-valued" or "list-valued" features are not
236 currently supported. Note this means that a feature in different
237 samples may have different values but in a particular sample a particular
238 feature can only have one value.
242 0.399 pau sh 0 0 0 1 1 0 0 0 0 0 0
243 0.082 sh iy pau onset 0 1 0 0 1 1 0 0 1
244 0.074 iy hh sh coda 1 0 1 0 1 1 0 0 1
245 0.048 hh ae iy onset 0 1 0 1 1 1 0 1 1
246 0.062 ae d hh coda 1 0 0 1 1 1 0 1 1
247 0.020 d y ae coda 2 0 1 1 1 1 0 1 1
248 0.082 y ax d onset 0 1 0 1 1 1 1 1 1
249 0.082 ax r y coda 1 0 0 1 1 1 1 1 1
250 0.036 r d ax coda 2 0 1 1 1 1 1 1 1
252 Note is it common to have thousands, even hundreds of thousands of
253 samples in a data file, and the number of features can often be in the
254 hundreds, though can also be less than ten depending on the what it
257 ### Data descriptions {#estwagon-datadescr}
259 A data file also requires a description file which names and classifies
260 the features in a datafiles. Features must haves names so they can
261 be refered to in the decision tree (or other model output) and also
262 be classified into their type. The basic types available for
265 - `continuous` for features that range over reals (e.g. duration of phones)
266 - `categorial` for features with a pre-defined list of possible values (e.g. phone names)
267 - `string` for features with an open class of discrete values (e.g. words)
268 - `vectors` like floats but as vectors of floats, (e.g. MFCC data)
270 The data description consists of a parenthesized list of feature
271 descriptions. Each feature description consists of the feature
272 name and its type (and/or possible values). Feature names, by
273 convention, should be features names in the sense for features (and
274 pathnames) used throughout the utterance structures in the
275 Edinburgh Speech Tools. The expected method to use models
276 generated from features sets in the Edinburgh Speech Tools is
277 to apply them to items. In that sense having a feature name be
278 a feature of an item (or relatve) pathname will avoid having
279 the extra step of extracting features into a separated table before
280 applying the model. However it should also be stated that to
281 wagon these names are arbitrary tokens and their semantic irrelevant
284 A typical description file would look like this, this is
285 one suitable for the data file given above
289 ((segment_duration float)
290 ( name aa ae ah ao aw ax ay b ch d dh dx eh el em en er ey f g
291 hh ih iy jh k l m n nx ng ow oy p r s sh t th uh uw v w y z zh pau )
292 ( n.name 0 aa ae ah ao aw ax ay b ch d dh dx eh el em en er ey f g
293 hh ih iy jh k l m n nx ng ow oy p r s sh t th uh uw v w y z zh pau )
294 ( p.name 0 aa ae ah ao aw ax ay b ch d dh dx eh el em en er ey f g
295 hh ih iy jh k l m n nx ng ow oy p r s sh t th uh uw v w y z zh pau )
296 (position_type 0 onset coda)
300 (R:Sylstructure.parent.R:Syllable.p.syl_break float)
301 (R:Sylstructure.parent.syl_break float)
302 (R:Sylstructure.parent.R:Syllable.n.syl_break float)
303 (R:Sylstructure.parent.R:Syllable.p.stress 0 1)
304 (R:Sylstructure.parent.stress 0 1)
305 (R:Sylstructure.parent.R:Syllable.n.stress 0 1)
308 There are also a number of special symbols that may be used in a
309 description file. If the type (first toke after the name) is
310 `ignore` the feature will be ignored in the model
311 building process. You may also specified features to ignore at tree
312 building time but it is often convenient to explicitly ignore
313 feature(s) in the description file.
315 For open categorial features the token `_other_`
316 should appear as the first in the list of possible values. This
317 actually allows features to have a partially closed set and and open set.
319 A description file can't be generated automatically from a data set though
320 an approximation is possible. Particularly its is not possible
321 to automatically decied if a feature value is continous of that its
322 example values happen to look like numbers. The script
323 `make_wagon_desc` takes a datafile and file
324 containing only the names of the features, and the name of
325 the description file it will create. This is often a useful
326 first pass though it almost certainly must be hand editted afterwards.
328 ### Tree format {#estwagon-treeformat}
330 The generated tree files are written as Lisp s-expressions as this
331 is by far the easiest external method to represent trees. Even if
332 the trees are read by something other than Lisp it is easy to
333 write a reader for such a format. The syntax of a tree is
336 TREE ::= LEAF | QUESTION-NODE
338 QUESTION-NODE ::= "(" QUESTION YES-NODE NO-NODE ")"
344 QUESTION ::= "(" FEATURENAME "is" VALUE ")" |
345 "(" FEATURENAME "=" FLOAT ")" |
346 "(" FEATURENAME "<" FLOAT ")" |
347 "(" FEATURENAME ">" FLOAT ")" |
348 "(" FEATURENAME "matches" REGEX ")" |
349 "(" FEATURENAME "in" "(" VALUE0 VALUE1 ... ")" ")"
351 LEAF ::= "(" STDDEV MEAN ")" |
352 "(" "(" VALUE0 PROB0 ")" "(" VALUE1 PROB1 ")" ... MOSTPROBVAL ")" |
353 any other lisp s-expression
356 Note that not all of the question types are generated by Wagon but
357 they are supported by the interpreters.
359 The leaf nodes differ depending on the type of the predictee. For
360 continuous predictees (regression trees) the leaves consist of a pair
361 of floats, the stddev and mean. For discrete predictees
362 (classification trees) the leaves are a probability density function
363 for the members of the class. Also the last member of the list is the
364 most probable value. Note that in both case the last value of the
365 leaf list is the answer desired in many cases.
368 Here's a small example tree
373 # Functions {#estwagon-functions}
375 # Programs {#estwagon-programs}
377 The following exectutable programs are provided for the
378 building and testing of decision trees
380 - \ref wagon_manual: is the core building program
381 - \ref wagon_test_manual: applies a trained treee to data and tests its accuracy.