estwagon.md (speech_tools-2.4-release) | : | estwagon.md (speech_tools-2.5.0-release) | ||
---|---|---|---|---|
skipping to change at line 40 | skipping to change at line 40 | |||
Theorectically the predicted value may be anything for which a function | Theorectically the predicted value may be anything for which a function | |||
can defined that can give a measure of *impurity* | can defined that can give a measure of *impurity* | |||
for a set of samples, and a distance measure between impurities. | for a set of samples, and a distance measure between impurities. | |||
The basic algorithm is given a set of samples (a feature vector) find | The basic algorithm is given a set of samples (a feature vector) find | |||
the question about some feature which splits the data minimising the | the question about some feature which splits the data minimising the | |||
mean "impurity" of the two partitions. Recursively apply this | mean "impurity" of the two partitions. Recursively apply this | |||
splitting on each partition until some stop criteria is reached | splitting on each partition until some stop criteria is reached | |||
(e.g. a minimum number of samples in the partition. | (e.g. a minimum number of samples in the partition. | |||
The basic CART building algorithm is a *greedy algorithm | The basic CART building algorithm is a *greedy algorithm* in that it | |||
in that it chooses the locally best | chooses the locally best choice which is suboptimal but a full search | |||
discriminatory feature at each stage in the process. This is | for a fully optimized set of question would be computationallly very | |||
suboptimal but a full search for a fully optimized set of question | expensive (and often not much better). Although there are | |||
would be computationallly very expensive. Although there are | ||||
pathological cases in most data sets this greediness is not a problem. | pathological cases in most data sets this greediness is not a problem. | |||
The basic building algorithm starts with a set of feature vectors | The basic building algorithm starts with a set of feature vectors | |||
representing samples, at each stage all possible questions for all | representing samples, at each stage all possible questions for all | |||
possibles features are asked about the data finding out how the | possibles features are asked about the data finding out how the | |||
question splits the data. A measurement of impurity of each | question splits the data. A measurement of impurity of each | |||
partitioning is made and the question that generates the least impure | partitioning is made and the question that generates the least impure | |||
partitions is selected. This process is applied recursively on each | partitions is selected. This process is applied recursively on each | |||
sub-partions recursively until some stop criteria is met (e.g. a | sub-partions recursively until some stop criteria is met (e.g. a | |||
minimal number of samples in a partition). | minimal number of samples in a partition). | |||
## Impurities {#estwagonimpurities} | ## Impurities {#estwagonimpurities} | |||
The *impurity* of a set of samples is designed | The *impurity* of a set of samples is designed | |||
capture how similar the samples are to each other. The smaller | capture how similar the samples are to each other. The smaller | |||
the number the less impure the sample set is. | the number the less impure the sample set is. | |||
For sample sets with continuous predictees Wagon uses the variance | For sample sets with continuous predictees Wagon uses the variance | |||
times number of sample points. The variance alone could | times number of sample points. The variance alone could | |||
be used by this overly favour very small sample sets. As the | be used by this overly favour very small sample sets. As the | |||
test thatuses the impurity is trying to minimise it over | test that uses the impurity is trying to minimise it over | |||
a partitioning of the data, multiple each part with the number | a partitioning of the data, multiple each part with the number | |||
of samples will encourage larger partitions, which we have found | of samples will encourage larger partitions, which we have found | |||
lead to better decision trees in general. | lead to better decision trees in general. | |||
For sample sets with discrete predictees Wagon uses the entropy | For sample sets with discrete predictees Wagon uses the entropy | |||
times number of sample points. Again the number of sample | times number of sample points. Again the number of sample | |||
points is used in so that small sample set are not unfairly favoured. | points is used in so that small sample set are not unfairly favoured. | |||
The entropy for a sample set is calculated as: | The entropy for a sample set is calculated as: | |||
\f[ E = \sum_{x \in class} prob(x) * \log(prob(x)) \f] | \f[ E = \sum_{x \in class} prob(x) * \log(prob(x)) \f] | |||
skipping to change at line 110 | skipping to change at line 109 | |||
Note however the the tree formalism produced but Wagon does support | Note however the the tree formalism produced but Wagon does support | |||
such questions (with the operator "in") but Wagon will never produce | such questions (with the operator "in") but Wagon will never produce | |||
these question, though other tree building techniques (e.g. by hand) | these question, though other tree building techniques (e.g. by hand) | |||
may use this form of question. | may use this form of question. | |||
For continuous features Wagon tries to find a partition of | For continuous features Wagon tries to find a partition of | |||
the range of the values that best optimizes the average | the range of the values that best optimizes the average | |||
impurity of the partitions. This is currently done by linearly | impurity of the partitions. This is currently done by linearly | |||
splitting the range into a predefined subparts (10 by default) | splitting the range into a predefined subparts (10 by default) | |||
and testing each split. This again isn't optimal but does | and testing each split. This again isn't optimal but does | |||
offer reasonably accuracy without require vast amounts of | offer reasonably accuracy without requiring vast amounts of | |||
computation. | computation. | |||
## Tree Building criteria {#estwagonoverviewtreebuild} | ## Tree Building criteria {#estwagonoverviewtreebuild} | |||
There are many ways to constrain the tree building algorithm to help | There are many ways to constrain the tree building algorithm to help | |||
build the "best" tree. Wagon supports many of theses (though there | build the "best" tree. Wagon supports many of these (though there | |||
are almost certainly others that is does not. | are almost certainly others that is does not. | |||
In the most basic forms of the tree building algorithm a fully | In the most basic forms of the tree building algorithm a fully | |||
exhaustive classifcation of all samples would be achieved. This, of | exhaustive classifcation of all samples would be achieved. This, of | |||
course is unlikely to be good when given samples that are not | course is unlikely to be good when given samples that are not | |||
contained within the training data. Thus the object is to build a | contained within the training data. Thus the object is to build a | |||
classification/regression tree that will be most suitable for new | classification/regression tree that will be most suitable for new | |||
unseen samples. The most basic method to achieve this is not to build | unseen samples. The most basic method to achieve this is not to build | |||
a full tree but require that there are at least n samples in a | a full tree but require that there are at least n samples in a | |||
partition before a question split is considered. We refer to that as | partition before a question split is considered. We refer to that as | |||
skipping to change at line 152 | skipping to change at line 151 | |||
sets of samples with very specific questions. The result tree becomes | sets of samples with very specific questions. The result tree becomes | |||
heavily lop-sided and (perhaps) not optimal. Rather than having the | heavily lop-sided and (perhaps) not optimal. Rather than having the | |||
same literal stop value more balanced trees can built if the stop | same literal stop value more balanced trees can built if the stop | |||
value is defined to be some percentage of the number of samples under | value is defined to be some percentage of the number of samples under | |||
consideration. This percentage we call a *balance* | consideration. This percentage we call a *balance* | |||
factor. Thus the stop value is then the largest of the | factor. Thus the stop value is then the largest of the | |||
defined fixed stop value or the balance factor times the number of | defined fixed stop value or the balance factor times the number of | |||
samples. | samples. | |||
To some extent the multiplication of the entropy (or variance) by | To some extent the multiplication of the entropy (or variance) by | |||
the number of samples in the impurity measure is also way to | the number of samples in the impurity measure is also a way to | |||
combat imbalance in tree building. | combat imbalance in tree building. | |||
A good technique we have found is to build trees in a *stepwise* | A good technique we have found is to build trees in a *stepwise* | |||
fashion. In this case instead of considering all | fashion. In this case instead of considering all | |||
features in building the best tree. We increment build trees looking | features in building the best tree. We increment build trees looking | |||
for which individual feature best increases the accuracy of the build | for which individual feature best increases the accuracy of the build | |||
tree on the provided test data. Unlike within the tree building | tree on the provided test data. Unlike within the tree building | |||
process where we are looking for the best question over all features | process where we are looking for the best question over all features | |||
this technique limits which features are available for consideration. | this technique limits which features are available for consideration. | |||
It first builds a tree using each and only the features provided | It first builds a tree using each and only the features provided | |||
skipping to change at line 188 | skipping to change at line 187 | |||
but may be useful if showing which features are particualrly important | but may be useful if showing which features are particualrly important | |||
to this build. | to this build. | |||
Stepwise tests each success tree against the specified test set, (balance, | Stepwise tests each success tree against the specified test set, (balance, | |||
held out and stop options are respected for each build). As this | held out and stop options are respected for each build). As this | |||
is using the test set which optimizing the tree, it is not valid | is using the test set which optimizing the tree, it is not valid | |||
to view the specified test set as a genuine test set. Another | to view the specified test set as a genuine test set. Another | |||
externally held test set should be used to test the accuracy of | externally held test set should be used to test the accuracy of | |||
generated tree. | generated tree. | |||
Another technique that has been shown to work in tree building is | ||||
called *random forests". In these technique multiple trees are build | ||||
with some random subset of the features presented, then the multiple | ||||
trees are combined by some technique (e.g. simple average). This | ||||
often works much better. Although doesn't offer building randon | ||||
forests as a direct option, it can be (and is) used to do this. You | ||||
can vary the features in in the description file by marking them with | ||||
*ignore* or you can use the *dropout* options *-dof* and *dos* which | ||||
specify a probability of ignoring a feature or sample at easy stage of | ||||
the build process. Its not clear if the *-dof* or *-dos* options are | ||||
useful, we have only fully investigated random forests in the voice | ||||
building process through externally selecting features, but we added | ||||
these internal options (with trendier names) as part of our | ||||
investigation. | ||||
## Data format {#cart-formats} | ## Data format {#cart-formats} | |||
The input data for wagon (and some other model building tools | The input data for wagon (and some other model building tools | |||
in the Edinburgh Speech Tools library), should consist of | in the Edinburgh Speech Tools library), should consist of | |||
feature vectors, and a description of the fields in these vectors. | feature vectors, and a description of the fields in these vectors. | |||
### Feature vectors {#estwagon-featvectors} | ### Feature vectors {#estwagon-featvectors} | |||
A feature vector is a file with one sample per line, with feature value | A feature vector is a file with one sample per line, with feature value | |||
as white space separated tokens. If your features values conatin | as white space separated tokens. If your features values conatin | |||
End of changes. 6 change blocks. | ||||
9 lines changed or deleted | 23 lines changed or added |