# Learning Decision Trees, DNF, and Augmented Decision Trees (Computational and Statistical Learning Theory, Lecture 2)

Disclaimer: these writeups aren’t detailed to be sufficient for learning or even making sense of the material; they’re just detailed enough to force myself to re-work all the theorems from scratch (and littered with enough keywords so that I can Google up my more detailed handwritten notes). I don’t have handwritten notes for the first few lectures, but I’m writing them up anyway, just to make myself prove all the theorems.

Learning Disjunctions (or Conjunctions): Set the initial hypothesis to everything (2n). Cross off at least one variable for each mistake: MB = $O(n)$. We basically do the same thing for conjunctions.

Decision Tree: See notes for definition. Size = number of nodes = $s$.

How powerful are these compared to decision lists? Just consider parity! We need to remove the restriction that each block is a single boolean variable to express an arbitrary DT as a DL.

How large a DL do we need then? Call an n-decision list one with blocks of size n. We can obviously convert a DT to a DL by traversing the tree, and putting each traversal in a block. How can we do better? Consider the concept of rank: rank of a leaf = 0. Otherwise, rank = max(rank($T_1$,$T_2$) when $T_1 \neq rank(T_2)$ and rank $= 1 + rank(T_1) = 1 + rank(T_2)$ otherwise.

What’s the rank of a tree with $m$ leaves? At most $\log_2 m$: by induction, number of leaves is at most $2^{rank(T_1)} + 2^{rank(T_2)}$. But any rank $k$ DT is a k-DL: Induct on the depth of the tree. Let $x_i$ be the variable at some level. WLOG, $T_1$ has rank $k - 1$. So our DL is the DL of $T_1$ with $x_i$, and the DL of $T_2$.

So we can learn a DT in $n^O(\log s)$.

What’s the next step up in complexity from a DT? DNF (Sum of products for us EE folks). Why is it a step up? We can take any DT and turn it into DNF by traversing the tree and setting that traversal to one of the conjunctions.

Augmented Decision Tree: this is just a DT with DNF formulas at the leaves. If this is a t-DT, any poly sized DNF with $s$ terms over $n$ variables has a $\frac{2n}{t} \ln s + 1$-DT. Consider all terms with length > t: take the most common variable, and put it at the root. Repeat.

Now we can extend our DL to have DNFs at the outputs instead of 0s and 1s. Say we have an augmented DT with rank R and term length t. We can turn an R-DT into a extended R-DL, and convert that into an (R+t)-DL by pushing variables from the DNFs into the decision blocks. This means we can turn an $s$ term $n$ variable DNF into a $\frac{2n}{t} \ln s + 1 + t$-DL. This is minimum when $t = \sqrt{2n \ln(s)}$, so our MB is $O(n^{\sqrt n})$

’05 scribe notes