Bootstrapping methods for NLP is a huge area, and we have invariably left out some important work. For that, we apologize in advance.

# Self-training

**David Yarowsky. Unsupervised Word Sense Disambiguation Rivaling Supervised Methods. ACL 1995.**

The "Yarowsky algorithm" of this paper is the most famous early application of self-training to NLP. Yarowsky addresses the word sense disambiguation problem. While he calls his method "unsupervised", it falls under our semi-supervised learning rubric since his seed listt effectively functions as labeled data. For each word, he divides sources of information into "one sense per discourse" and "one sense per collocation". His algorithm proceeds to repeatedly learn collocations (in the form of a decision list) to support a particular sense. He corrects these collocations by allowing only one sense per discourse. While he does not explicitly address multiple independent views or how they can affect one another, he does mention the importance of redundant information in unsupervised learning.

**Steven Abney. Understanding the Yarowsky Algorithm. Computational Linguistics 2004.**

In this paper Steven Abney proposes several variants of the Yarowsky algorithm that can be analyzed in terms of the objective functions they minimize. Some of the most interesting results are that algorithms very similar to the original Yarowsky algorithm minimize an upper bound on negative log-likelihood. In the end, though, this work does not relate the losses minimized to our true objective metric in semi-supervised learning: error.

**David McClosky and Eugene Charniak. Effective Self-training for Parsing. NAACL 2006.**

A recent, successful large-scale application of bootstrapping techniques to one of the most well-studied canonical NLP problems. The paper demonstrates the best results (as of this writing) on the standard Penn treebank Wall Street Journal test set. At a high level, the Charniak parser works in two stages: a generative parser followed by a discriminative re-ranker. The self-training process proceeds by re-training the generative model on the outputs of the discriminative model. Unfortunately, the Charniak parser is a complicated machine, and it is not clear from the paper exactly how and when the self-training procedure works.

# Co-training and co-boosting

**Avrim Blum and Tom Mitchell. Combining Labeled and Unlabeled Data with Co-training. COLT 1998.**

The original co-training paper was the first to articulate explicitly assumptions under which a self-training algorithm could perform well. Blum and Mitchell give a PAC-style, realizable bound on learning, where they assume two independent views of the data (subsets of feature space), each with its own concept class C1 and C2. They show that if

1. C1 and C2 both contain concepts $f_1$ and $f_2$ such that for every <x,y> pair, $y = f_{1}(x_{1})$ and $y = f_{2}(x_{2})$

and

2. X1 and X2 are conditionally independent. That is, $p(X1,X2 | Y) = p(X1 | Y)p(X2 | Y)$

then a a weak learner can be boosted into a good model using only unlabeled data. They also give a co-trained Naive Bayes algorithm which is not directly related to their theory, but which exploits two views effectively for web page classification.

**Michael Collins and Yoram Singer. Unsupervised Models for Named Entity Classification. EMNLP 1999.**

The co-boosting model suggested by Collins and Singer is the primary co-training model of the tutorial, so a lot of space is devoted to it in the slides themselves. In addition to co-boosting, Collins and Singer also test several other bootstrapping models, including an EM model, and several variants of the Yarowsky algorithm. Because of this, and a thorough discussion, the paper is a particularly worthwhile read for people interested in co-training and bootstrapping in general.

**David Rosenberg and Peter Bartlett. The Rademacher Complexity of Co-reguarlized Kernel Classes. AISTATS 2007.**

Rosenberg and Bartlett give a theoretical analysis of multi-view learning in terms of co-regularization. This paper examines a linear (kernel) regression problem, with two views. Instead of explicitly retraining on the unlabeled data, they consider a regularization term which penalizes the disagreement between the two views $(f_{1}(x_{1})- f_{2}(x_{2}))^2$. The main result of this paper bounds the Rademacher complexity of the kernel class in terms of the value of this regularization term.

**Kedar Bellare, Partha Pratim Talukdar, Giridhar Kumaran, Fernando Pereira, Mark Liberman, Andrew McCallum and Mark Dredze. Lightly Supervised Attribute Extraction for Web Search. NIPS 2007 Workshop on Machine Learning for Web Search.**

We chose to present the problem and results from this paper because we believe it represents an interesting application of semi-supervised learning techniques to a non-standard problem. Bellare et al. motivate their work using the problem of query expansion, and they ask the question "How could one expand the query of a particular entity using its attributes?" They present results for both their co-training and self-training methods on an information retrieval task.

# SSL with latent variables

**Bernard Merialdo. Tagging English Text with a Probabilistic Model. Computational Linguistics 1994.**

Merialdo's main purpose in this paper is to compare supervised (training by counting co-occurrence of words and tags) and unsupervised (treating the tags as hidden and using EM) methods. His conclusion is that unlabeled data is only useful when the amount of labeled data is very small. Otherwise, augmenting labeled data with lots of unlabeled data will only degrade performance.

**Kamal Nigam, Andrew McCallum, Sebastian Thrun and Tom Mitchell. Text Classification from Labeled and Unlabeled Documents using EM. Machine Learning Journal 2000.**

This paper describes extensive experiments on using Naive Bayes together with EM and unlabeled data for the problem of document classification. Nigam et al. perform extensive experiments on a variety of text classification datasets. They show that in many cases, they can achieve reasonable performance using only a small number of labeled examples. The takeaway message from this paper is most similar to that of Merialdo, however. Only with very small amounts of labeled data is the unlabeled data helpful for their tasks.

# Using linguistic side information

**Aria Haghighi and Dan Klein. Prototype-driven Learning for Sequence Models. NAACL 2006.**

The prototype-driven learning model of Haghighi and Klein is one of the models we focus on in the tutorial, so once again we have already devoted significant space to it in the slides. All of the papers we discuss in the tutorial use seed lists to "label" unlabeled data, but only Haghighi and Klein use it for sequences and focus explicitly on the benefit that may be derived from this setup. Coupled with distributional similarity, they are able to derive effective models using only prototypes, rather than a more traditional labeled data set.

**Aria Haghighi's open source implementation of prototype-driven sequence learning.**

The above links to a GPL-ed re-implementation of the Haghighi and Klein NAACL 2006 paper.

**Gregory Druck, Gideon Mann, and Andrew McCallum. Learning from Labeled Features using Generalized Expectation Criteria. SIGIR 2008.**

This paper generalizes the idea of linguistic constraints, and provides an interesting set of experiments to test the usefulness of such labeled features. Druck et al. consider a discriminative model where individual features are labeled with some number of labels (possibly 0). They then propose to find a model which has minimum KL divergence to the empirical distribution derived from these constraints and unlabeled data, along with a regularization term. Their experiments include a "labeling test" where they time annotators labeling features vs. labeling instances. They show that labeling features is typically much more efficient than labeling instances, for small amounts of data.