"Fossies" - the Fresh Open Source Software Archive  

Source code changes of the file "doc/estwagon.md" between
speech_tools-2.4-release.tar.gz and speech_tools-2.5.0-release.tar.gz

About: The speech_tools - Edinburgh Speech Tools Library (used by the Festival Speech Synthesis System).

estwagon.md  (speech_tools-2.4-release):estwagon.md  (speech_tools-2.5.0-release)
skipping to change at line 40 skipping to change at line 40
Theorectically the predicted value may be anything for which a function Theorectically the predicted value may be anything for which a function
can defined that can give a measure of *impurity* can defined that can give a measure of *impurity*
for a set of samples, and a distance measure between impurities. for a set of samples, and a distance measure between impurities.
The basic algorithm is given a set of samples (a feature vector) find The basic algorithm is given a set of samples (a feature vector) find
the question about some feature which splits the data minimising the the question about some feature which splits the data minimising the
mean "impurity" of the two partitions. Recursively apply this mean "impurity" of the two partitions. Recursively apply this
splitting on each partition until some stop criteria is reached splitting on each partition until some stop criteria is reached
(e.g. a minimum number of samples in the partition. (e.g. a minimum number of samples in the partition.
The basic CART building algorithm is a *greedy algorithm The basic CART building algorithm is a *greedy algorithm* in that it
in that it chooses the locally best chooses the locally best choice which is suboptimal but a full search
discriminatory feature at each stage in the process. This is for a fully optimized set of question would be computationallly very
suboptimal but a full search for a fully optimized set of question expensive (and often not much better). Although there are
would be computationallly very expensive. Although there are
pathological cases in most data sets this greediness is not a problem. pathological cases in most data sets this greediness is not a problem.
The basic building algorithm starts with a set of feature vectors The basic building algorithm starts with a set of feature vectors
representing samples, at each stage all possible questions for all representing samples, at each stage all possible questions for all
possibles features are asked about the data finding out how the possibles features are asked about the data finding out how the
question splits the data. A measurement of impurity of each question splits the data. A measurement of impurity of each
partitioning is made and the question that generates the least impure partitioning is made and the question that generates the least impure
partitions is selected. This process is applied recursively on each partitions is selected. This process is applied recursively on each
sub-partions recursively until some stop criteria is met (e.g. a sub-partions recursively until some stop criteria is met (e.g. a
minimal number of samples in a partition). minimal number of samples in a partition).
## Impurities {#estwagonimpurities} ## Impurities {#estwagonimpurities}
The *impurity* of a set of samples is designed The *impurity* of a set of samples is designed
capture how similar the samples are to each other. The smaller capture how similar the samples are to each other. The smaller
the number the less impure the sample set is. the number the less impure the sample set is.
For sample sets with continuous predictees Wagon uses the variance For sample sets with continuous predictees Wagon uses the variance
times number of sample points. The variance alone could times number of sample points. The variance alone could
be used by this overly favour very small sample sets. As the be used by this overly favour very small sample sets. As the
test thatuses the impurity is trying to minimise it over test that uses the impurity is trying to minimise it over
a partitioning of the data, multiple each part with the number a partitioning of the data, multiple each part with the number
of samples will encourage larger partitions, which we have found of samples will encourage larger partitions, which we have found
lead to better decision trees in general. lead to better decision trees in general.
For sample sets with discrete predictees Wagon uses the entropy For sample sets with discrete predictees Wagon uses the entropy
times number of sample points. Again the number of sample times number of sample points. Again the number of sample
points is used in so that small sample set are not unfairly favoured. points is used in so that small sample set are not unfairly favoured.
The entropy for a sample set is calculated as: The entropy for a sample set is calculated as:
\f[ E = \sum_{x \in class} prob(x) * \log(prob(x)) \f] \f[ E = \sum_{x \in class} prob(x) * \log(prob(x)) \f]
skipping to change at line 110 skipping to change at line 109
Note however the the tree formalism produced but Wagon does support Note however the the tree formalism produced but Wagon does support
such questions (with the operator "in") but Wagon will never produce such questions (with the operator "in") but Wagon will never produce
these question, though other tree building techniques (e.g. by hand) these question, though other tree building techniques (e.g. by hand)
may use this form of question. may use this form of question.
For continuous features Wagon tries to find a partition of For continuous features Wagon tries to find a partition of
the range of the values that best optimizes the average the range of the values that best optimizes the average
impurity of the partitions. This is currently done by linearly impurity of the partitions. This is currently done by linearly
splitting the range into a predefined subparts (10 by default) splitting the range into a predefined subparts (10 by default)
and testing each split. This again isn't optimal but does and testing each split. This again isn't optimal but does
offer reasonably accuracy without require vast amounts of offer reasonably accuracy without requiring vast amounts of
computation. computation.
## Tree Building criteria {#estwagonoverviewtreebuild} ## Tree Building criteria {#estwagonoverviewtreebuild}
There are many ways to constrain the tree building algorithm to help There are many ways to constrain the tree building algorithm to help
build the "best" tree. Wagon supports many of theses (though there build the "best" tree. Wagon supports many of these (though there
are almost certainly others that is does not. are almost certainly others that is does not.
In the most basic forms of the tree building algorithm a fully In the most basic forms of the tree building algorithm a fully
exhaustive classifcation of all samples would be achieved. This, of exhaustive classifcation of all samples would be achieved. This, of
course is unlikely to be good when given samples that are not course is unlikely to be good when given samples that are not
contained within the training data. Thus the object is to build a contained within the training data. Thus the object is to build a
classification/regression tree that will be most suitable for new classification/regression tree that will be most suitable for new
unseen samples. The most basic method to achieve this is not to build unseen samples. The most basic method to achieve this is not to build
a full tree but require that there are at least n samples in a a full tree but require that there are at least n samples in a
partition before a question split is considered. We refer to that as partition before a question split is considered. We refer to that as
skipping to change at line 152 skipping to change at line 151
sets of samples with very specific questions. The result tree becomes sets of samples with very specific questions. The result tree becomes
heavily lop-sided and (perhaps) not optimal. Rather than having the heavily lop-sided and (perhaps) not optimal. Rather than having the
same literal stop value more balanced trees can built if the stop same literal stop value more balanced trees can built if the stop
value is defined to be some percentage of the number of samples under value is defined to be some percentage of the number of samples under
consideration. This percentage we call a *balance* consideration. This percentage we call a *balance*
factor. Thus the stop value is then the largest of the factor. Thus the stop value is then the largest of the
defined fixed stop value or the balance factor times the number of defined fixed stop value or the balance factor times the number of
samples. samples.
To some extent the multiplication of the entropy (or variance) by To some extent the multiplication of the entropy (or variance) by
the number of samples in the impurity measure is also way to the number of samples in the impurity measure is also a way to
combat imbalance in tree building. combat imbalance in tree building.
A good technique we have found is to build trees in a *stepwise* A good technique we have found is to build trees in a *stepwise*
fashion. In this case instead of considering all fashion. In this case instead of considering all
features in building the best tree. We increment build trees looking features in building the best tree. We increment build trees looking
for which individual feature best increases the accuracy of the build for which individual feature best increases the accuracy of the build
tree on the provided test data. Unlike within the tree building tree on the provided test data. Unlike within the tree building
process where we are looking for the best question over all features process where we are looking for the best question over all features
this technique limits which features are available for consideration. this technique limits which features are available for consideration.
It first builds a tree using each and only the features provided It first builds a tree using each and only the features provided
skipping to change at line 188 skipping to change at line 187
but may be useful if showing which features are particualrly important but may be useful if showing which features are particualrly important
to this build. to this build.
Stepwise tests each success tree against the specified test set, (balance, Stepwise tests each success tree against the specified test set, (balance,
held out and stop options are respected for each build). As this held out and stop options are respected for each build). As this
is using the test set which optimizing the tree, it is not valid is using the test set which optimizing the tree, it is not valid
to view the specified test set as a genuine test set. Another to view the specified test set as a genuine test set. Another
externally held test set should be used to test the accuracy of externally held test set should be used to test the accuracy of
generated tree. generated tree.
Another technique that has been shown to work in tree building is
called *random forests". In these technique multiple trees are build
with some random subset of the features presented, then the multiple
trees are combined by some technique (e.g. simple average). This
often works much better. Although doesn't offer building randon
forests as a direct option, it can be (and is) used to do this. You
can vary the features in in the description file by marking them with
*ignore* or you can use the *dropout* options *-dof* and *dos* which
specify a probability of ignoring a feature or sample at easy stage of
the build process. Its not clear if the *-dof* or *-dos* options are
useful, we have only fully investigated random forests in the voice
building process through externally selecting features, but we added
these internal options (with trendier names) as part of our
investigation.
## Data format {#cart-formats} ## Data format {#cart-formats}
The input data for wagon (and some other model building tools The input data for wagon (and some other model building tools
in the Edinburgh Speech Tools library), should consist of in the Edinburgh Speech Tools library), should consist of
feature vectors, and a description of the fields in these vectors. feature vectors, and a description of the fields in these vectors.
### Feature vectors {#estwagon-featvectors} ### Feature vectors {#estwagon-featvectors}
A feature vector is a file with one sample per line, with feature value A feature vector is a file with one sample per line, with feature value
as white space separated tokens. If your features values conatin as white space separated tokens. If your features values conatin
 End of changes. 6 change blocks. 
9 lines changed or deleted 23 lines changed or added

Home  |  About  |  Features  |  All  |  Newest  |  Dox  |  Diffs  |  RSS Feeds  |  Screenshots  |  Comments  |  Imprint  |  Privacy  |  HTTP(S)