Graph-based SSL

# Minimum cuts

Avrim Blum and Shuchi Chawla. Learning from Labeled and Unlabeled Data using Graph Mincuts. ICML 2001.

The original "semi-supervised learning with mincuts" paper. Blum and Chawla describe an algorithm analogous to "transductive nearest neighbor". Given annotated positive and negative instances with a graph whose edges represent similarity, the algorithm suggests computing the minimum cut between positive and negative instances. Blum and Chawla motivate this algorithm under a particular generative assumption, where instances are generated in particular well-behaved regions, and all the instances in a particular region have the same label. They demonstrate results on several of the UCI datasets.

Yuri Boykov, Olga Veksler, and Ramin Zabih. Fast Approximate Energy Minimization via Graph Cuts. Pattern Analysis and Machine Intelligence.

Boykov et al. study graph cuts from the perspective of computer vision, where they are an important mechanism for formulating problems such as segmentation and de-noising. Unlike Blum and Chawla, who study binary classification, this paper examines multi-way min cuts, which are NP-hard to find. Boykov et al. propose an algorithm for approximately solving multi-way min cuts and test it on a variety of vision problems (but not on semi-supervised learning). While the problems are not directly related to NLP, the paper is interesting and important because of its exploration of approximate labeling algorithms under a variety of settings.

Bo Pang and Lillian Lee. A Sentimental Education: Sentiment Analysis Using Subjectivity Summarization Based on Minimum Cuts. ACL 2004.

The most famous application of graph-based min cuts in NLP: Pang and Lee use graph cuts to identify sentence subjectivity in documents. This paper has considerable space devoted to it in the presentation. Pang and Lee train isolated sentence "subjectivity classifiers" for movie reviews by learning a model to distinguish movie reviews from movie summaries. The isolated scores for sentences are combined into a per-document graph using proximity (in the document itself). After finding the subjective sentences in a document using min cuts, Pang and Lee then proceed to use only these sentences to classify a document as having positive or negative sentiment.

# Harmonic functions

Xiaojin Zhu, Zoubin Gharhamani, and John Lafferty. Semi-supervised Learning using Gaussian Fields and Harmonic Functions. ICML 2003.

The bulk of the harmonic functions section of the tutorial is devoted to this paper. It directly addresses many aspects of the harmonic function approach to learning. Given a graph on labeled and unlabeled instances, Zhu et al. introduce the Gaussian field formulation for labeling functions on the graph. They also demonstrate connections to random walks, normalized cuts, and graph kernels. Finally they suggest algorithms for discovering the energy-minimizing (harmonic) function f, including the closed form solution suggested by the connection to random walks and electric networks. At the end of the work, they also suggest methods for learning similarity parameters from data.

Zheng-Yu Niu, Dong-Hong Ji, and Chew-Lim Tan. Word Sense Disambiguation Using Label Propagation and Semi-supervised Learning. ACL 2005.

This is paper serves as one of the examples for the harmonic functions section. Niu et al. apply label propagation to the problem of semi-supervised word sense disambiguation. They create a graph based on direct similarity between instance vectors, using either Euclidean distance or Jensen-Shannon distance (normalizing the positive-valued feature vectors to create probability distributions) and compare the performance of the harmonic function labeler to supervised models.

Andrew Goldberg and Jerry Zhu. Seeing stars when there aren't many stars: Graph-based semi-supervised learning for sentiment categorization. NAACL 2006 TextGraphs workshop.

In this paper, Goldberg and Zhu directly apply harmonic functions to sentiment categorization. They define a graph where each instance node is connected to a label node with a particular strength. For labeled data, the connection is given strength 1. For unlabeled data, the connection is given some strength proportional to an external classifier. In addition to this, they connect neighboring instances based on a variety of heuristics. As in Zhu, Gharhamani, and Lafferty (2003), their energy function is quadratic, allowing them to compute the minimum-energy labeling of the graph in closed form. For small amounts of labeled data, they are able to significantly improve over a supervised baseline.

# Manifold regularization

Mikhail Belkin, Partha Niyogi, and Vikas Sindhwani. Manifold Regularization: A Geometric Framework for Learning from Labeled and Unlabeled Examples. JMLR 2005.

If you read a single paper on manifold regularization, this should be the one. It begins by describing the technique of using the graph Laplacian as a regularizer derived from unlabeled data in a semi-supervised learning framework. Belkin et al. motivate the graph Laplacian by showing that it converges to the Laplace-Beltrami operator on a manifold. This operator, in turn, can be seen as measuring smoothness of a function on a manifold. In the case of finite labeled and unlabeled data, they prove a representer theorem for the solution of a manifold-regularized loss minimization problem. Then they use this theorem to formulate Laplace-regularized SVM and least squares algorithms. Unlike the harmonic function estimation techniques of last section, the end result of these algorithms are classification (or regression) functions. Thus there is no need for specialized procedures to label out-of-sample test points. Belkin et al. test their algorithms on a variety of datasets (including the WebKB text classification dataset).

Andrew Goldberg, Xiaojin Zhu, and Stephen Wright. Dissimilarity in graph-based semi-supervised classification. AISTATS 2007.

In this paper, Goldberg, Zhu, and Wright explore the use of dissimilarity information in graphs. Each pair of instances $x_i$ and $x_j$ may have one of two types of edges: positive or negative. These edges may be represented in an a matrix $S$ where $s_{ij} \in \{-1,0,1\}$ (indicating negative-edge, no-edge, or positive-edge, respectively). For binary classification, the resulting per-edge penalty has the form $w_{ij}(f_{i}-s_{ij}f_{j})^2$. Goldberg et al. also derive a multiclass version of this penalty and use a "mixed-graph" manifold regularization technique to learn semi-supervised classifiers. They demonstrate the usefulness of negative weights in the particular context of political view (left-wing or right-wing) classification. For this task, they devise helpful heuristics which predict disagreement between posts. This disagreements correspond to opposing political views and can be computed from unlabeled data.

Dengyong Zhou, Chris Burges, and Tao Tao. Transductive Link Spam Detection. Workshop on Adversarial Information Retrieval on the Web, 2007.

This paper describes an application of graph-based transductive learning to the problem of link spam detection. Given a graph whose nodes are hosts and whose edges are hyperlinks between pages on different hosts, the task is to determine whether a distinct host is spam or not. The directedness of edges plays a crucial role in this task, since a host that points to spam hosts is likely to be spam itself, but a host which is pointed to by spam hosts is not necessarily spam. Zhou et al. employ a directed analog to the harmonic function energy minimization problem of Zhu, Gharhamani, and Lafferty (2003). They then solve a quadratic minimization problem to find the optimal labeling of a graph. In their case, the transductive setting is more appropriate, since many web search engines employ a static index and do not attempt to dynamically classify a web page as spam for every query.

John Lafferty and Larry Wasserman. Statistical Analysis of Semi-Supervised Regression. NIPS 2007.

In this paper, Lafferty and Wasserman examine the manifold and semi-supervised smoothness assumptions from the perspective of minimax convergence theory. Its main result is quite surprising: Suppose we are performing kernel regression with the standard product kernel. Then from the minimax perspective, manifold regularization using unlabeled data does not yield faster rates of convergence than simply using the labeled data with an appropriate choice of bandwidth. The bandwidth parameter itself can be chosen on a small heldout data set. While surprising, this negative result does seem to coincide with empirical observations on graph-based SSL: Graphs that are based directly on feature vectors do not consistently improve baseline supervised models. On the other hand, this result does not hold for graphs built in other ways (i.e. from hyperlink structure or via negative links).