Decision Trees
Entropy 2-class
If (\cal I) is the set of training examples, (p_+) ratio of posititive examples in (\cal I) and (p_-) negative examples then entropy which measures impurity:
\[\text{Entropy}(I) = -p_+\log_2p_+ - p_-\log _2p_-\]This represents expected number of bits to encode the class of a randomly drawn member of (\cal I)
Information Gain
Expected reduction in entropy due to sorting on attribute (x_i)
\[\text{Gain}({\cal I}, x_i)=\text{Entropy}(I)-\sum_{v\in\text{values}(x _i)}\frac{|\cal I_v|}{|\cal I|}\text{Entropy}(\cal I_v)\]- (\text{values}(x_i)): all possible values of (x_i)
- (\cal I_v\subset \cal I) : points in (\cal I) that take value (v) for (x_i)
Mutual Information Gain & Information Gain
Mutual information is the measure of mutual dependence of two random variables.
Mutual information gain for two random variables (X) and (Y) is:
\[I(X;Y)=\sum_{y\in Y}\sum_{x\in X}p(x, y)\log\left(\frac{p(x,y)}{p(x)p(y)}\right)\]In decision trees mutual information gain and information gain are synonymous:
\[I(X;Y)=H(Y)-H(Y|X)\]ID3 Algorithm
Recursively perform on all examples.
- For all (+)/(-) examples return single node with corresponding label.
- Otherwise split on attribute that best classifies current set of examples.
Decision Trees Features
- Hypothesis space is complete
- Only single hypothesis as output
- No backtracking
- Statistically based choices
- True bias hard to estimate
- Shorter trees are preferred
- Trees with high information gain near root are preferred
Overfitting
Hypothesis (h\in \cal H) overfits (X) if (\exists h’) such that:
(E(h|X)< E(h’|X)) but (E_p(h)>E_p(h’)) where (p) is the entire distribution of data.
Strategies:
- Stop when insignificant information gain
- post pruning - subtrees/rules/
- using ensembles
Problems
- Real values functions - sort and split on midpoints
- Alternative measures for selecting attributes - gain ratio / cost sensitive information gain
- Missing attribute values - most commomn value at node / node with same target value / probability based on descendents of node
Advantages / Disadvantages
- Easy to explain
- No hyperparameters
- Overfitting
- Low accuracies (cf. other approaches)
Random Forests
Instane Bagging / Bootstrap aggregation
- Generate (K) different bootstrapped trainig datasets (sampling with replacement)
- Learn a decision tree on each
- Final prediction is the majority/average vote of all.
Freature bagging
- Improved than instance bagging
- Build a decision tree using a subset of (m) features rather than all (D) features, typically (m\approx \sqrt D)