Cross-training: Learning probabilistic relations between taxonomies
Cross-training: Learning
probabilistic relations
between taxonomies
Sunita Sarawagi
Soumen Chakrabarti
Shantanu Godbole IIT Bombay
Chakrabarti
KDD2003
2
Document classification
Set of labels A Bookmark folders, Yahoo topics Training documents, each with one A label Supervised approach Use training docs to induce classifier Invoke classifier on each unlabeled document in isolation Semi-supervised approach Unlabeled documents available during training Nigam et al. show how to exploit them collectively
Chakrabarti
KDD2003
3
Cross-training from another taxonomy
Another set of labels B, partly related but not identical to A A=Dmoz topics, B=Yahoo topics A=Personal bookmark topics, B=Yahoo topics Training docs come in two flavors now Fully labeled with A and B labels (rare) Half-labeled with either an A or a B label Can B make classification for A more accurate (and vice versa)? Inductive transfer, multi-task learning
DA
DB
Chakrabarti
KDD2003
4
Motivation
Symmetric taxonomy mapping Ecommerce catalogs: A=distributor, B=retailer Web directories: A = Dmoz, B = Yahoo Incomplete taxonomies, small training sets Bookmark taxonomy vs. Yahoo Cartesian label spaces
UK
USA
00/font>
Regional
Top
Sports
Baseball
Cricket
Region00/font>
00/font>Topic
Label-pair-
conditioned
term distribution
Chakrabarti
KDD2003
5
Obvious approach: Labels as features
A-label known, estimate B-label Suppose we have A+B labeled training set Discrete valued 00abel column0000/font>
Multinomial na茂ve Bayes too biased, cannot balance heterogeneous features Do not have fully-labeled data Must guess 00/font> (use soft scores instead of 0/1)
Term feature values
00/font>
00/i>
Augmented feature vector
Target label
Chakrabarti
KDD2003
6
SVM-CT: Cross-trained SVM
S(A,0)
Train
DA00/font>DB
Docs having only A-labels
One-vs-rest SVM
ensemble for A:
returns |A| scores
for each test doc
(signed distance
from separator)
DB00/font>DA
Docs having only B-labels
Test
Test
output
00/font> t 00/font>
00/font>|A|00/font>
00/i>
Label
Text features
S(B,1)
Train
One-vs-rest SVM
ensemble for B
(target label set)
Test case with
A-label known
(coded using a
vector of +1
and 00)
Term features
00/font>1,00/font>,00/font>1,+1,00/font>1,00/font>
S(A,1)
S(B,2)
S(A,2)
00/font>
Chakrabarti
KDD2003
7
SVM-CT anecdotes
Discriminant reveals relations between A and B One-to-one, many-to-one, related, antagonistic However, accuracy gains are meager
Positive
Negative
Chakrabarti
KDD2003
8
EM1D: Info from unlabeled docs
Use training docs to induce initial classifier for taxonomy B, say Repeat until classifier satisfactory Estimate Pr(00/font>|d) for unlabeled doc d, 000B Reweigh d by factor Pr(00/font>|d) and add to training set for label 00/font> Retrain classifier EM1D: Expectation maximization with one label set B (Nigam et al.) Ignores labels from another taxonomy A
Chakrabarti
KDD2003
9
Stratified EM1D
Target labels = B B-labeled docs are labeled training instances Consider A-labeled docs labeled 00/font> These are unlabeled for taxonomy B Run EM1D for each row 00/font> Test instance has 00known Invoke semi-supervised model for row 00to classify EM2D minus 2D model interaction
00/font>A topics
B-topics00/font>
Docs in DA00i>DB
labeled 00/font>
00/font>
DB00i>DA: docs
with B-labels
Docs in DA00i>DB
labeled 00/font>00/font>
Chakrabarti
KDD2003
10
EM2D: Cartesian product EM
Initialize with fully labeled
docs which go to a
specific (0000 cell Smear training doc
across label row or column Uniform smear could be bad Use a na茂ve Bayes classifier to seed Parameters extended from EM1D 00sub>00/sub>,00/sub> prior probability for label pair (0000 00sub>00/sub>,00/sub>,t multinomial term probability for (0000
Labels in A00/font>
Labels in B00/font>
A-labeled doc
B-labeled doc
Chakrabarti
KDD2003
11
EM2D updates
E-step for an A-labeled document
M-step
Updated
class-pair
priors
Updated
class-pair-
conditioned
term stats
Chakrabarti
KDD2003
12
Applying EM2D to a test doc
Mapping a B-labeled test doc d to an A label (e-commerce catalogs) Given 00/font>, find argmax00/sub> Pr(0000d) Classifying a document d with no labels to an A label Aggregation For each 00 compute 00sub>00/sub> Pr(0000d), pick best 00/font> Guessing (EM2D-G) Guess the best 00/font>* using a B-classifier Find argmax00/sub> Pr(0000|d) EM pitfalls: damping factor, early stopping
Chakrabarti
KDD2003
13
Experiments
Selected 5 Dmoz and Yahoo subtree pairs Compare EM2D against Na茂ve Bayes, best #features and smoothing EM1D: ignore labels from other taxonomy, consider as unlabeled docs Stratified EM1D Mapping test doc with A-label to B-label or vice versa Classifying zero-labeled test doc Accuracy = fraction with correct labels
Chakrabarti
KDD2003
14
Accuracy benefits in mapping
EM1D and NB are close, because training set sizes for each taxonomy are not too small EM2D > Stratified EM1D > NB 2d transfer of model info seems important
Improvement
over NB: 30% best,
10% average
Chakrabarti
KDD2003
15
Asymmetric setting
Few (only 300) bookmarked URLs
(taxonomy B, target) Many Yahoo URLs, larger number of classes (taxonomy A) Need to control damping factor (= importance of labeled :: unlabeled) to tackle population skew
Chakrabarti
KDD2003
16
Zero-labeled test documents
EM1D improves accuracy only for 12 train docs EM2D with guessing improves beyond EM1D In fact, better than aggregating scores to 1d Choice of unlabeled:labeled damping ratio L may be important to get benefits
Chakrabarti
KDD2003
17
Robustness to initialization
Seeding choices: hard (best class), NB scores, uniform Smear a fraction uniformly, rest by NB scores EM2D is robust to wide range of smear fractions Fully uniform smearing can fail (local optima)
Uniform
smear
Na茂ve
Bayes
smear
Chakrabarti
KDD2003
18
Related work
Multi-task learning, 00ife-long learning00 inductive transfer (Thrun, Caruana) Find earlier learning tasks similar to current Reuse models, features, parameters Co-training (Blum, Mitchell) Two learners over a single label set Partitioned feature set Catalog mapping (Agrawal, Srikant) Two-label docs to estimate priors Raise prior to exponent, tune by validation EM2D: generative model, slightly better accuracy
Chakrabarti
KDD2003
19
Summary and future work
Two algorithms for cross-training EM-based semi-supervised algorithm EM2D SVM-based algorithm SVM-CT Benefits Improved accuracy Interpretable mappings between label sets General issue: how best to deal with a large number of heterogeneous attributes? Future work Brittle na茂ve Bayes scores in EM2D Small relative gains in SVM-CT:
better kernels? feature selection?
Given the extra label info, can we improve upon transductive algorithms which use only a pool of unlabeled documents?
Because SVM has its own bias. (Narrate self-testing experiment?)
Turns out that this does not do as well as our final choice00/font>
Priors pi reveal correlation or otherwise between a pair of labels.
Discuss figs 10, 14, 12
What is L? NB in only one L
filetype:ppt
file time: 2008-02-16
file size:199168
Click Here To Download...