Improvement and further analysis of the trained 17-class naive bayes classifier

At this stage we have a naive bayes classifier up and running on a subset of our data set. In the following post we will walk you through our work to improve and analyze the results of the 17-class classifier described previously.

1. Goal

  1. Investigation of adding bigrams as features
  2. Investigation of most informative terms based on the trained Naive Bayes classifier in order to understand and explain the results
  3. Investigation of using the trained Naive Bayes as a multi-label classifier

2. Method

  1. Cleaning the documents (as described in our previous post)
  2. Extracting tf/idf parameters for both unigram and bigram features
  3. Selecting 2000 features using chi-square test
  4. Implementing the classification pipeline described in our previous post
  5. Comparing the new results with our previous results
  6. Finding the top 10 terms for each class by selecting the features with the highest coefficients in the trained Naive Bayes Classifier
  7. Analyzing the effect of bigrams and the confusion matrix based on the words found in step 6.

The dataset used in this experiment is the same as last report. 8816 documents from 17 legal topics (show in in Table 1) are used here.

Note that the terms found in step 6 are different from those discussed in feature selection step. In feature selection, we use chi-square test to find the most distinguishing features in order to train the classifier. However, in this report, we analyze the trained classifier in order to find out which features have the highest weights to calculate the probability of a class in the linear model that has been formed during training. In other words, the goal of feature selection is reducing the dimension of feature vector and decreasing the expense of classification task, while the goal of this work is analyzing the results observed from a trained classifier and gaining a better understanding of the model formed during training process.

3. Results

3.1 Classification Summary

Steps 1-4 described above are carried out and the following classification results are obtained. By adding bigrams to the features macro-averaged precision, recall and f1-score improved by about 5% (in comparison to the previous results).

This improvement can be explained by looking at the probability of features given a class and investigation of the top terms for each class, the top 10 of which are shown in Table 1. Note that we use 2000 features for training the classifier and the 170 terms shown in Table 1 do not cover all the features, but the most informative ones. 11 terms out of these 170 terms are b-grams, which explains how adding bigrams improves classification accuracy.

Table 1: Top 10 terms related to each class that have the highest probability given the class acquired from trained Naive Bayes classifier

0 Administrative Law judicial - judicial review - hearing - tribunal - review - commission - appeal - application - applicant - board
1 Aliens application - minister - visa - board - canada - citizenship - officer - refugee - immigration - applicant
2 Bankruptcy debtor - registrar - payment - income - debt - discharge - creditor - trustee - bankruptcy - bankrupt
3 Contracts pay - clause - purchase - term - company - party - agreement - defendant - plaintiff - contract
4 Criminal Law criminal code - code - judge - trial - crown - appeal - criminal - offence - sentence - accuse
5 Damage Awards work - general damage - left - loss - neck - injury - damage - pain - accident - plaintiff
6 Damages benefit - judge - accident - trial judge - award - defendant - trial - loss - damage - plaintiff
7 Family Law mother - support - maintenance - parent - marriage - respondent - petitioner - divorce - custody - child
8 Food and Drug Control notice - allegation - medicine - notice compliance - minister - noc - regulation - apotex - drug - patent
9 Guarantee and Indemnity company - agreement - surety - loan - mortgage - plaintiff - defendant - bank - guarantor - guarantee
10 Income Tax taxation year - taxation - appeal - income tax - tax court - appellant - minister - taxpayer - income - tax
11 Injunctions balance convenience - applicant - irreparable harm - interlocutory injunction - irreparable - interlocutory - harm - defendant - plaintiff - injunction
12 Master and Servant salary - termination - work - notice - employer - defendant - dismissal - employee - employment - plaintiff
13 Motor Vehicles suspension - traffic - speed - highway - offence - driver - drive - motor - motor vehicle - vehicle
14 Municipal Law section - power - plaintiff - land - town - municipal - council - municipality - city - bylaw
15 Real Property deed - possession - defendant - easement - owner - plaintiff - lot - property - title - land
16 Workers' Compensation review - employer - injury - tribunal - commission - appeal - worker compensation - compensation - board - worker

3.2 Analysis of confusion matrix

Confusion matrix of this 17-class classifier is shown in the following. As marked on the table the highest rate of misclassification occurs for cases from class “Contracts” that are labeled as “Guarantee and Indemnity” by the trained classifier. We can observe in Table 1 that 4 out of 10 top terms are common between these two classes, which are company, agreement, defendant and plaintiff. This observation explains the high rate of misclassification between these two classes. The second highest misclassification rate happens between classes “Damages” and “Damage Awards”, for the same reason that they have many terms in common.

Figure 1: Confusion matrix of the trained classifier. For topics related to numerical class labels (0-16), see Table 1
Figure 1: Confusion matrix of the trained classifier. For topics related to numerical class labels (0-16), see Table 1

3.3 Investigation of using trained Naive Bayes as a multilabel classifier

Out of 8816 documents that have been used for this experiment, only 4 documents are multi-label, i.e. more than one label is assigned to the document. These documents have been replicated in all of the related classes. Although this dataset is not suitable for exploring the performance of the trained classifier for multi-label classification task, we still analyzed the output of the classifier for these 4 documents. Figure 2 shows the probabilities assigned to one of these documents by the trained classifier.

True topics of this document acquired from annotations are ['Food and Drug Control', 'Administrative Law', 'Criminal Law'] which in terms of numerical class labels is equivalent to [0,4, 8]. Trained Naive Bayes model predicts that this document belongs to the class “Administrative Law”, by assigning the highest probability to class #4. However, the classifier gives a relatively high probability to classes “Food and Drug Control” which is class #0 and “Criminal Law” which is #8 and also class #13 which is “Motor Vehicles”. This observation leads to the hypothesis that trained Naive Bayes can potentially be used as multi-label classifier, by looking at class probabilities and setting a threshold on the calculated probabilities, instead of relying only on the highest probability.

This hypothesis needs to be further investigated on the richer dataset in terms of multi-topic documents. However, this method is not straightforward to evaluate and may increase the misclassification rate.

Figure 2: Probabilities of 17 classes calculated by the trained classifier for a single multi-topic document.
Figure 2: Probabilities of 17 classes calculated by the trained classifier for a single multi-topic document.

4. Conclusion

This experiment shows that by adding bigrams to the feature set of the classifier, the classification accuracy increases without increasing the number of features. Using 2000 features from both unigrams and bigrams selected by Chi-square test, we achieved 89% average precision which is 5% better than using 2000 unigram features selected in the same manner.

We also confirmed that misclassification occurs mostly among classes that have common content, by looking at most probable terms in each class.

This experiment showed that class probabilities calculated by the trained Naive Bayes classifier may be used to produce a set of labels for multi-label documents. The feasibility of this method needs more investigated in future experiments.