Label Expansion - Petr Laitoch Text Classification with Limited Training Data

In this section, we survey a selection of methods that aim to increase recall of weak models. These are the existing methods we found are most similar to the needs of the Label Expansion step of the framework presented in chapter 3.

2.3.1 Pairwise Feedback

Constrained clustering techniques introduce pairwise feedback to improve clus-tering performance in a semi-supervised setting. Pairwise feedback, in the form of pairwise linkage constrains, is a set ofmust-link pairs and a set of cannot-link pairs that indicate belonging of the data point pair to the same class. Inspired by this concept, Boecking and Dubrawski [2019] incorporate pairwise feedback into the data programming framework to increase labeling accuracy. Pairwise feed-back is used to tie labeling functions together across different samples in order to improve latent class variable modeling of the data programming generative model.

Pairwise linkage constrains are encoded as a symmetric matrixA∈ {−1,0,1}ⁿ wherenis the dataset size. Often, onlymust-link pairs are available and pairwise linkage is thus encoded asA ∈ {0,1}ⁿ. The values of A are encoded as follows:

• Must-link – data pointsiandj that should be in the same class are encoded asA_i,j = 1

• Cannot-link – data pointsi and j that shouldnotbe in the same class are encodedA_i,j =−1

• Abstain– no feedback for class membership of data pointsiandjis encoded asA_i,j = 0

Domain-specific heuristics created by a domain expert may be used for con-structing pairwise linking constraints for the problem at hand. Furthermore, Mutual k-Nearest Neighbor (MKNN) graphs of a similarity function that holds across the dataset may be used to construct the constraints. The choice of simi-larity function is left up to the domain expert.

Boecking and Dubrawski [2019] test data programming with pairwise feedback on the 20 Newsgroups text classification dataset.⁶ Sixteen keyword-based labeling functions gathered from domain experts gained 46.94% accuracy using a majority vote and 56.45% accuracy using Snorkel. MKNN pairs constructed from TF-IDF similarity (see chapter 1) by computing 10 nearest neighbors increased the accuracy all the way to 75.98%.

2.3.2 Epoxy

Large language models such as BERT can be used as the discriminative model in weak supervision. After fine-tuning for the classification problem on anno-tated data, the language model is equipped to find synonyms, relax word order, or otherwise provide predictions beyond the scope of heuristics-based labeling functions. Although favorable to training from scratch, fine-tuning still requires non-negligible training time. This hinders the possibility for the data programmer to interactively develop labeling functions with feedback from the discriminative model.

Label propagation [Zhu and Ghahramani, 2002, Iscen et al., 2019] is a tech-nique in semi-supervised learning, where labels from annotated data are propa-gated through unannotated data using an embedding space. Epoxy⁷ [Chen et al., 2020] takes an inspiration from label propagation. As seen in Figure 2.8, la-bels generated by weak supervision are propagated across unlabeled data using pre-trained language model embeddings to guide the propagation. Embeddings need not be tuned. High-accuracy low-coverage labeling function predictions are extended to nearby unannotated training data points, increasing coverage.

The advantage of this approach is its interactivity. No model fine-tuning is necessary and the whole procedure takes less than half a second. An evaluation was made proving that the accuracy of a generative model enhanced by Epoxy label propagation is comparable to having a fine-tuned BERT as a discriminative model.

6https://scikit-learn.org/stable/datasets/index.html#newsgroups-dataset

7https://github.com/HazyResearch/epoxy

Figure 2.8: Epoxy Overview, from Chen et al. [2020]

2.3.3 Passage Ranking

Passage ranking is a task in information retrieval where a query with associated candidate passages are given and a model must learn to rank the passages by judged relevance. Most common passage ranking baselines are the unsupervised methods TF-IDF and BM-25. State-of-the-art methods make use of supervised learning. However, hand-labeling is difficult. Xu et al. [2019] propose to used weak supervision methods on the passage ranking task and propose successful weak supervision strategies.

To define labeling functions, the ranking problem is transformed into a clas-sification problem that decides if a passage is strongly related to a query. Using a similarity function between the query and candidate passages, a labeling func-tion is automatically created using positive and negative sampling. For the top-1 passage, the labeling functions outputs a positive label. For the bottom half of passage, a negative label is returned. For other passages, the labeling function abstains.

The following similarity functions are used as sources of weak supervision:

1. BM25 score 2. TF-IDF score

3. cosine similarity of Universal Sentence Encoder representations 4. cosine similarity of last hidden layer of pre-trained BERT

2.3.4 Iterative Expansion for Text

The authors of SLP (subsection 2.2.1) follow with an iterative data programming method [Mallinar et al., 2020] which introduces an iterative batch label-expansion procedure for Snorkel by combining traditional Data Programming combined with Elasticsearch’s ”More Like This” feature to expand labels in batches and thus augment the created training data sets. It is essentially an iterative procedure that uses search to rank and select unlabeled examples by leveraging labeled examples. The newly selected examples form a weak supervision strategy and thus form a labeling function in data programming.

In document Petr Laitoch Text Classification with Limited Training Data (Stránka 37-40)