Reducing the size of the vocabulary in (bag-of-words) document representation

1. Introduction

1.1 Experimental Goal

Reducing the size of distinguishing vocabulary in order to reduce the expense of feature extraction process.

1.2 Dataset

500 documents are randomly sampled for each of the following 20 topics:

Figure 1: 20 topics as random search queries

"Family Law", "Bankruptcy", "Criminal Law", "Administrative Law", "Damages", "Damage Awards", "Guarantee and Indemnity", "Income Tax", "Indians, Inuit and Metis", "Injunctions", "Master and Servant", "Motor Vehicles", "Municipal Law", "Quebec Family", "Real Property", "Workers' Compensation", "Food and Drug Control", "Common Law", "Contracts", "Actions", "Aliens"

There are 8904 documents and 109 topics in this dataset. There are 17 classes in this dataset that have more than 500 cases, as follows:

Fig. 2 - 17 topics with more than 500 documents that are used for experiments in this report

(Family Law, 519) - (Real Property, 532) - (Injunctions, 515) - (Contracts, 565) - (Workers' Compensation, 502) - (Damage Awards, 521) - (Income Tax, 505) - (Criminal Law, 529) - (Guarantee and Indemnity, 503) - (Master and Servant, 508) - (Aliens, 503) - (Bankruptcy, 505) - (Damages, 517) - (Motor Vehicles, 502) - (Food and Drug Control, 501) - (Administrati ve Law, 583) - (Municipal Law, 506)

The following pair of classes have more than 50 common documents:

(Actions & Practice , 145) - (Guarantee and Indemnity & Practice , 53) - (Patents of Invention & Food and Drug Control , 112) - (Practice & Food and Drug Control , 67) - (Food and Drug Control & Courts , 55)

1.3 Method

The effect of Non-negative Matrix Factorization and chi-squared test for dimension reduction (feature selection) is assessed and compared in the following classification pipeline.

Feature extraction Tf-idf of documents for a given vocabulary
Feature selection(to be investigated) chi-squared test & Non-negative Matrix Factorization
Classifier Multinomial Naive Bayes (17 classes)
Evaluation metrics precision, recall, F-score
Implementation tool Scikitlearn

Given a vocabulary the classification pipeline is implemented as following :

  1. Multi-topic documents are replicated for each topic and a single topic is assigned to each document.

  2. Tfidf vectors are calculated for training documents only for the words included in the vocabulary.

  3. Multinomial Naive Bayes classifier is trained using the features and labels acquired from steps 1 and 2 for training documents.

  4. Term-frequencies are calculated for each test document. Using IDF rates calculated for train documents in step 1, tfidf vectors are calculated for test documents.

  5. The classifier trained in step 3 is applied to predict the topic of test document, using the tfidf vector calculated in step 4.

  6. Confusion matrix is calculated by maro averaging of predictions for all topics.

The above classification pipeline is experimented before and after feature selection and compared to see the effect of feature selection. Note that step 1 is a simplifying approach in order to be able to see the effect of dimension reduction in a uni-label classification task. The classification pipeline assigns a single topic to each document, which in the best case is counted as success in only one class and failure in other related topics. Therefore, there is a practical limit for classification accuracy in this classification task and it is impossible to achieve 100% classification rate. Therefore, classification metrics are only useful for comparison and assessing the effect of vocabulary size reduction

The dataset is divided into two subsets, as:

  • Train dataset: 6800 documents

    • First 400 samples of the 17 topics shown in Figure 2
  • Test dataset: 2016 documents

    • The rest of the samples from 17 topics shown in Figure 2

We avoid shuffling and random sampling for obtaining train and test samples, since we want replicated documents to be present in either of the test or train dataset and not in both of them.

2. Document Cleaning

2.1 Goal

Reducing the size of vocabulary by cleaning the text and removing irrelevant words

2.2 Method

Removing the following tokens from the vocabulary:

  • punctuations and numbers

  • words that only appear in 10 or less documents among 6800 train documents from 17 topics

  • words that appear in 70 percent of the 6800 train dataset from 17 topics

  • english stopwords

  • one and two letter words

2.3 Results

Size of the vocabulary after document cleaning is reduced to 19745 words.

The classification pipeline described in Table 1 is implemented using this vocabulary and the following results are obtained.

3. Unsupervised topic extraction using Non-negative Matrix Factorization (NMF)

Non-negative Matrix Factorization (NMF) fits additive models to a corpus of documents and extracts a list of terms for each topic. The method is unsupervised, meaning that the label of documents are not provided to NMF model. However, the number of topics and the number of terms that represent a topic needs to be identified.

3.1 Goal

There are two main goals:

  • Manual assessing of topic extraction by reviewing the list of words that NMF extracts for 17 topics in the corpus

  • Assessing the effect of number of words per topic on the evaluation metrics in the classification pipeline described in Table 1.

3.2 Method

  1. Tfidf vectors are calculated for the vocabulary acquired in Section 2 (after document cleaning).

  2. NMF is used to extract n words for each of the 17 topics. Extracted words will be manually investigated by a lawyer.

  3. The new vocabulary is formed which is composed of 17 x n words extracted for all the topics.

  4. The new vocabulary is used to implement the classification pipeline described in Table 1.

  5. Steps 2 to 4 are repeated for n, where n = [10, 20 , 30 , 40 , 50, 100, 200, 300].

3.3 Result

The following blocks show the extracted terms to represent 17 topics (given in Figure 2), where n = 50. NMF models the corpus as addition of 17 classes and extracts 50 terms for each class. Note that NMF is unsupervised and topics are not assigned to the extracted set of terms.

Topic #0
Topic #1
Topic #2
Topic #3
Topic #4
Topic #5
Topic #6
Topic #7
Topic #8
Topic #9
Topic #10
Topic #11
Topic #12
Topic #13
Topic #14
Topic #15
Topic #16

The effect of vocabulary size on the classification metrics:

Figure 3: Macro-average of classification metrics (precision, recall and F-score) versus vocabulary size. The applied classification pipeline is described in Table 1. The vocabulary is acquired from NMF model.

Classification results for vocabulary size of 1700 (100 words per topic):

4. Supervised dimension reduction using Chi-square statistical test

In this section, Chi-square test is applied to measure the dependency between training features and the corresponding labels of training documents. This test will remove the features that have the most independence score of the assigned topic and therefore can be assumed as irrelevant for classification.

4.1 Goal

  • Manual investigation of vocabulary that chi-square test chooses for classification
  • Effect of the size of the vocabulary on the classification metrics acquired from classification pipeline described in Table 1.

4.2 Method

  1. Tfidf vectors are calculated for the training documents using the vocabulary produced after document cleaning, where the size of vocabulary is 19745 .

  2. A chi2-square test is implemented on the vectors calculated in step 1 given the corresponding labels of training documents.

  3. k most relevant features are extracted as the most discriminant words among topics in the corpus

  4. The classification pipeline described in Table 1 is implemented for the vocabulary acquired in step 3

  5. Steps 3 and 4 are repeated for k where k = [50,100, 200, 300, 500,1000, 2000, 5000].

4.3 Results

300 words extracted by chi-square:

'convicted', 'bankrupts', 'plaintiff', 'plaintiffs', 'mrs', 'ankle', 'employees', 'decree', 'fear', 'guarantors', 'hospital', 'officer', 'earnings', 'dismissal', 'shoulder', 'injury', 'caveat', 'protection', 'spousal', 'neck', 'bank', 'sewer', 'motor', 'manager', 'student', 'principal', 'dosage', 'noninfringement', 'surety', 'judicial', 'discharge', 'guilty', 'immigration', 'applicant', 'clause', 'country', 'spouses', 'interim', 'compliance', 'benefits', 'trademark', 'medical', 'municipal', 'bylaws', 'accused', 'patently', 'bia', 'complainant', 'proposal', 'injuries', 'guarantees', 'device', 'cyanamid', 'abbott', 'creditors', 'employer', 'product', 'deterrence', 'matrimonial', 'bankrupt', 'drogue', 'spouse', 'accident', 'sentenced', 'certiorari', 'ands', 'sentences', 'marital', 'decision', 'salary', 'victim', 'speeding', 'guarantee', 'knee', 'breath', 'wcat', 'separation', 'apotexs', 'medicines', 'loss', 'commission', 'nds', 'mother', 'injunctions', 'creditor', 'trustees', 'privileges', 'suffer', 'unsecured', 'superintendent', 'probation', 'computing', 'interlocutory', 'jury', 'petitioner', 'offender', 'patent', 'vehicle', 'contract', 'parents', 'traffic', 'discomfort', 'support', 'minister', 'surplus', 'offence', 'headaches', 'relief', 'sri', 'city', 'cbrns', 'tablets', 'council', 'novopharm', 'defendant', 'boundary', 'sentence', 'accuseds', 'taxation', 'guarantor', 'constable', 'drugs', 'agreement', 'tribunal', 'employee', 'appeals', 'irpa', 'suffering', 'strata', 'signed', 'cbr', 'code', 'tax', 'municipalities', 'physiotherapy', 'wife', 'whiplash', 'disability', 'termination', 'hydrochloride', 'pfizer', 'demand', 'taxpayer', 'debt', 'applicants', 'worker', 'company', 'panel', 'award', 'bias', 'compensation', 'leg', 'easement', 'offences', 'possession', 'nupharm', 'debtor', 'employment', 'drivers', 'astrazeneca', 'prra', 'defendants', 'custody', 'lilly', 'patented', 'titles', 'workplace', 'registrar', 'property', 'assignment', 'workers', 'damages', 'marriage', 'insolvency', 'reassessment', 'humanitarian', 'driving', 'compassionate', 'invention', 'revenue', 'home', 'crown', 'radar', 'medicinal', 'arm', 'left', 'condominium', 'cervical', 'infringement', 'lands', 'surgery', 'xrays', 'work', 'driver', 'assault', 'collar', 'harm', 'noa', 'suspension', 'access', 'allegation', 'pharmaceutical', 'iad', 'parcel', 'cprd', 'income', 'teva', 'deed', 'refugee', 'bankruptcy', 'drug', 'spine', 'fracture', 'persecution', 'review', 'monograph', 'owners', 'future', 'rpd', 'security', 'speed', 'capital', 'noc', 'formulation', 'omeprazole', 'pain', 'father', 'citizenship', 'taxpayers', 'sentencing', 'board', 'suffered', 'notice', 'dtc', 'prohibition', 'wcb', 'lot', 'generic', 'alcohol', 'irreparable', 'doctor', 'apotex', 'patents', 'patient', 'children', 'taxable', 'loan', 'convenience', 'road', 'fence', 'deduction', 'town', 'regulations', 'criminal', 'lieu', 'husband', 'patentee', 'boards', 'convention', 'land', 'hip', 'nisi', 'divorce', 'police', 'muscle', 'injunction', 'tenement', 'trustee', 'sexual', 'offenders', 'bylaw', 'mortgage', 'vehicles', 'job', 'workmens', 'medicine', 'parent', 'cccd', 'rjrmacdonald', 'trial', 'child', 'rightofway', 'conviction', 'canada', 'licence', 'commissioner', 'debts', 'collision', 'injunctive', 'title', 'maintenance', 'fcj', 'highway', 'visa', 'municipality', 'credit', 'imprisonment', 'trademarks', 'cra', 'merck'

Classification metrics when number of words extracted by Chi_square test is 1000.

The effect of vocabulary size on the classification metrics:

Figure 4: Macro-average of classification metrics (precision,  recall and F-score) versus vocabulary size. The applied classification pipeline is described in Table 1\. The vocabulary is acquired from Chi-square test.
Figure 4: Macro-average of classification metrics (precision, recall and F-score) versus vocabulary size. The applied classification pipeline is described in Table 1. The vocabulary is acquired from Chi-square test.

5. Summary

5.1 Goal

  • Comparing results obtained from NMF and Chi-square

5.2 Result

Summary of results obtained from this experiment is shown in following. Best classification metrics are achieved with Chi-square where the vocabulary size is 1000. Best classification metrics achieved by NMF is where the vocabulary size is 1700.

Figure 4: training and test time for vocabulary acquired from document cleaning, Chi-square with 1000 words and NMF with 1700 words
Figure 4: training and test time for vocabulary acquired from document cleaning, Chi-square with 1000 words and NMF with 1700 words

Figure 6: Macro-average of classification metrics acquired from document cleaning, Chi-square with 1000 words and NMF with 1700 words
Figure 6: Macro-average of classification metrics acquired from document cleaning, Chi-square with 1000 words and NMF with 1700 words

6. Conclusion

Results of this experiment show that both NMF and Chi-square can be used to reduce the size of the distinguishing vocabulary, without decreasing the classification metrics. The main difference of these two methods is that NMF does not need labeled documents and extracts a list of words for each topic, while Chi-square test requires assigned topics as labels and extracts a list of distinguishing words for all topics.

Here, naive Bayes classification have been used for the sake of comparison and to assess the effect of dimension reduction. Generalization of these results to the case of multi-label classification task needs to be investigated in the next experiment.

In the next experiment, dimension reduction and also other features such as n-grams will be investigated in a multi-label classification framework.