# Learning Decision Trees, DNF, and Augmented Decision Trees (Computational and Statistical Learning Theory, Lecture 2)

Disclaimer: these writeups aren’t detailed to be sufficient for learning or even making sense of the material; they’re just detailed enough to force myself to re-work all the theorems from scratch (and littered with enough keywords so that I can Google up my more detailed handwritten notes). I don’t have handwritten notes for the first few lectures, but I’m writing them up anyway, just to make myself prove all the theorems.

Learning Disjunctions (or Conjunctions): Set the initial hypothesis to everything (2n). Cross off at least one variable for each mistake: MB = $O(n)$. We basically do the same thing for conjunctions.

Decision Tree: See notes for definition. Size = number of nodes = $s$.

How powerful are these compared to decision lists? Just consider parity! We need to remove the restriction that each block is a single boolean variable to express an arbitrary DT as a DL.

How large a DL do we need then? Call an n-decision list one with blocks of size n. We can obviously convert a DT to a DL by traversing the tree, and putting each traversal in a block. How can we do better? Consider the concept of rank: rank of a leaf = 0. Otherwise, rank = max(rank($T_1$,$T_2$) when $T_1 \neq rank(T_2)$ and rank $= 1 + rank(T_1) = 1 + rank(T_2)$ otherwise.

What’s the rank of a tree with $m$ leaves? At most $\log_2 m$: by induction, number of leaves is at most $2^{rank(T_1)} + 2^{rank(T_2)}$. But any rank $k$ DT is a k-DL: Induct on the depth of the tree. Let $x_i$ be the variable at some level. WLOG, $T_1$ has rank $k - 1$. So our DL is the DL of $T_1$ with $x_i$, and the DL of $T_2$.

So we can learn a DT in $n^O(\log s)$.

What’s the next step up in complexity from a DT? DNF (Sum of products for us EE folks). Why is it a step up? We can take any DT and turn it into DNF by traversing the tree and setting that traversal to one of the conjunctions.

Augmented Decision Tree: this is just a DT with DNF formulas at the leaves. If this is a t-DT, any poly sized DNF with $s$ terms over $n$ variables has a $\frac{2n}{t} \ln s + 1$-DT. Consider all terms with length > t: take the most common variable, and put it at the root. Repeat.

Now we can extend our DL to have DNFs at the outputs instead of 0s and 1s. Say we have an augmented DT with rank R and term length t. We can turn an R-DT into a extended R-DL, and convert that into an (R+t)-DL by pushing variables from the DNFs into the decision blocks. This means we can turn an $s$ term $n$ variable DNF into a $\frac{2n}{t} \ln s + 1 + t$-DL. This is minimum when $t = \sqrt{2n \ln(s)}$, so our MB is $O(n^{\sqrt n})$

’05 scribe notes

# Mistake Bound Model and Learning Decision Lists (Computational and Statistical Learning Theory, Lecture 1)

Scribing and writing up summaries of the gnofai MCMC tutorial made me realize how useful it is to force myself to review my notes by writing up summaries, so I’m going to write up searchable summaries of my handwritten notes for the Computational/Statistical Learning Theory class Adam Klivans and Pradeep Ravikumar ran last semester. I missed the first month of class due to a scheduling conflict, but I’ll write up summaries of those lectures anyway.

Apropos of nothing, I’m still shocked by how friendly and accommodating people are here at UT. Adam was nice enough to move the class to a time that would work for me even though I was just another nameless first-semester grad student from another department.

WARNING: these writeups aren’t detailed to be sufficient for learning or even making sense of the material; they’re just detailed enough to force myself to re-work all the theorems as exercises (and detailed enough that I can Google up my handwritten notes).

Mistake Bound (MB) model: the learner wants to learn some $f$. The learner gets some unlabeled example, $x \in \{0,1\}^n$, makes a prediction based on $h$, and then finds out the truth, $f(x)$. The learner updates if it made a mistake (which is equivalent to always being able to update for obvious reasons).

Definition: A concept class, $C$, over $\{0,1\}^n$ is efficiently learnable in MB if there exists an algorithm $A$ such that for any $c \in C$, A has an MB $M(c) = poly(n,size(c))$ and its running time in each trial is also $poly(n,size(c))$, where $size(c)$ is the description length of c under some encoding scheme. If the MB is $poly(f)$ instead of $poly(n)$, i.e., if it’s poly in the “size” of the function being learned, as opposed to poly in the number of features in the domain (which we care about in the high-dimensional case), then $A$ is attribute efficient.

Learning Decision Lists: We can learn a DL ($\{0,1\}^n \rightarrow \{0,1\}$) of length k with MB $O(nk)$. Start with everything at the top of the list, and push it down every time we make a mistake, a.k.a., use the bag algorithm. This gives us $2nk$. Note that this doesn’t actually produce a DL, but the hyp we get will be consistent with $c$. Of course we can get a DL from our list of bags by picking arbitrarily from each bag.

’05 scribe notes (which seem to be broken at the moment)
’06 scribe notes