Decision Trees
Entropy 2-class
If $\cal I$ is the set of training examples, $p_+$ ratio of posititive examples in $\cal I$ and $p_-$ negative examples then entropy which measures impurity:
\[\text{Entropy}(I) = -p_+\log_2p_+ - p_-\log _2p_-\]This represents expected number of bits to encode the class of a randomly drawn member of $\cal I$
Information Gain
Expected reduction in entropy due to sorting on attribute $x_i$
\[\text{Gain}({\cal I}, x_i)=\text{Entropy}(I)-\sum_{v\in\text{values}(x _i)}\frac{|\cal I_v|}{|\cal I|}\text{Entropy}(\cal I_v)\]- $\text{values}(x_i)$: all possible values of $x_i$
- $\cal I_v\subset \cal I$ : points in $\cal I$ that take value $v$ for $x_i$
Mutual Information Gain & Information Gain
Mutual information is the measure of mutual dependence of two random variables.
Mutual information gain for two random variables $X$ and $Y$ is:
\[I(X;Y)=\sum_{y\in Y}\sum_{x\in X}p(x, y)\log\left(\frac{p(x,y)}{p(x)p(y)}\right)\]In decision trees mutual information gain and information gain are synonymous:
\[I(X;Y)=H(Y)-H(Y|X)\]ID3 Algorithm
Recursively perform on all examples.
- For all $+$/$-$ examples return single node with corresponding label.
- Otherwise split on attribute that best classifies current set of examples.
Decision Trees Features
- Hypothesis space is complete
- Only single hypothesis as output
- No backtracking
- Statistically based choices
- True bias hard to estimate
- Shorter trees are preferred
- Trees with high information gain near root are preferred
Overfitting
Hypothesis $h\in \cal H$ overfits $X$ if $\exists h’$ such that:
$E(h|X)< E(h’|X)$ but $E_p(h)>E_p(h’)$ where $p$ is the entire distribution of data.
Strategies:
- Stop when insignificant information gain
- post pruning - subtrees/rules/
- using ensembles
Problems
- Real values functions - sort and split on midpoints
- Alternative measures for selecting attributes - gain ratio / cost sensitive information gain
- Missing attribute values - most commomn value at node / node with same target value / probability based on descendents of node
Advantages / Disadvantages
- Easy to explain
- No hyperparameters
- Overfitting
- Low accuracies (cf. other approaches)
Random Forests
Instane Bagging / Bootstrap aggregation
- Generate $K$ different bootstrapped trainig datasets (sampling with replacement)
- Learn a decision tree on each
- Final prediction is the majority/average vote of all.
Freature bagging
- Improved than instance bagging
- Build a decision tree using a subset of $m$ features rather than all $D$ features, typically $m\approx \sqrt D$