Comparative Study of Supervised Multi-Label Classification Models for Legal Case Topic Classification

1. Introduction

1.1 Goals

  • Implementing a complete supervised multi-label classification pipeline

  • Choosing the metrics for comparison of different multi-label classifiers

  • Comparing different benchmark classifiers in a one-vs-rest classification pipeline

  • Comparing supervised multi-output classifiers in a multi-label classification pipeline

1.2 Dataset

Figure 1: Summary of the sampled dataset. a) number of single-topic, bi-topic and three-topic documents in the sampled dataset. b) number of documents in each topic.

10K documents are selected randomly from 11 topics that have more than 5K samples. Figure 1 summarizes the topics and their class size in the sampled dataset. Out of 10K documents, 7760 docs are annotated with only one topic, 1881 docs are annotated with two topics and only 57 docs are assigned to three topics. The percentiles of multi-topic documents in sampled dataset are consistent with those in the whole dataset.

1.3 Method

The following pipeline is implemented in order to compare various models of supervised multi-label classifiers.

Step Method SciKIT Learn Module
Cleaning - Lemmatizing using Lemma_ from spacy library
- Removing punctuations and numbers, words that only appear in 10 or less documents, words that appear in 70 percent of the documents, english stopwords, one and two letter words
Feature Extraction Tf-idf vectors of unigram and bigram features TfidfTransformer
Feature Selection chi-square test (percentile = 50) SelectPercentilechi2
Label Transformation Binarization of labels MultiLabelBinarizer
Classifier (to be investigated) - One-vs-rest (various benchmark classifiers )
- Multi Output classifiers
Evaluation Method 5-Fold- Cross-Validation model_selectio.KFold
Evaluation Metrics precision, recall, F1-score, accuracy metrics

1.3.1 Label Binarization

Using MultiLabel Binarizer which is a preprocessing tool, an 11-element binary vector is assigned to each document, where the jth element is 1 only if the document is assigned to the jth topic and j = 1, 2, …, 11.

Topics Numerical labels Binarized label
('Civil Rights', 'Criminal Law', 'Evidence') [2, 4, 7] [0,0,1,0,1,0,0,1,0,0,0,0]

1.3.2 Dealing with All-zero predictions

In case the classifier predicts none of the topics for a document, we assign the class with highest prediction to that document as the prediction. This is a modification that improves F-score. Default settings of scikit learn handle this issue by assigning zero to precision, recall and F-score of these samples, while averaging the metrics of the classifier.

2. Evaluation metrics for comparison

2.1 Goal

Our objective is to choose a few metrics to describe the performance of a multi-label classifier in order to compare various classifiers defined in Table 1.

2.2 Method

  • Reviewing different methods of averaging metrics in a multi-label classifier

  • Comparing different averaging methods for a single classifier

2.3 Results

2.3.1 Review

In this work, the following metrics are measured in order to evaluate a classification model :

  • Precision : measures whether predictions made by classifier are correct

  • Recall : measures whether everything that should be predicted is predicted

  • F1-score: balances precision and recall

  • Accuracy: measures whether decisions made by classifier are correct

In a multi-label classification task, metrics can be calculated across classes or samples, which results in many numbers. Summarizing the results to few numbers for comparison purposes is tricky and can be done in various ways. The right choice of averaging for each experiment depends on the number of classes, average number of topics assigned to documents, number of documents in each class and how sensitive various classes are to the wrong decisions made by the classifier.

1. Micro-averaging: Flattens the binarized true labels and predicted labels and evaluates the model as a single binary classifier.

  • In case that the length of topics assigned to documents are much less than the length of all topics, i.e. only a few elements of binarized true labels are 1, this method results in a very high true negative score which can be misleading. In this case only precision can be well interpreted and other scores are not realistic indicators of the performance of the classifier.

2. Macro-averaging: Calculates the metrics for each class and averages the results.

  • Gives equal weights to classes. Classifiers usually have low performance on infrequent classes. Macro-averaging overemphasizes these classes. Therefore, Macro averaging should be interpreted with caution in case of class imbalance.

3. Weighted-averaging: Calculates the metrics for each class and averages them with size of the classes as weights.

  • Reduces the effect of misclassifications that are caused by relative small class size.

4. Samples-averaging: Calculates the metrics for each sample and averages the results.

  • Number of topics assigned to a document are taken into account. A single prediction is weighted by the length of true labels for that sample. Measures how the classifier is doing on samples, in general.

5. Exact - match: only exact matches of binarized labels are counted as success.

  • This is a very strict way of evaluating a multi-label classifier. For example for a 3 topic document if two of them are predicted correctly, still this decision is counted as a failure.

2.3.2 Comparing averaging methods for one-vs-rest Naive Bayes classifier

Metrics per class for a trained classifier are shown in Figure 2. This is a one-vs-rest classifier with Naive Bayes classification model from the pipeline described in Table 1. Table 2 shows, the metrics acquired from each of the averaging methods described in Section 2-3-1, for this classifier.

Figure 2: Metrics per class for a trained classifier
Figure 2: Metrics per class for a trained classifier

In Table 2, we can see how different ways of evaluation can lead to completely different results. In our case each label is an 11-element binary vector, with at most three 1s (documents are assigned to 1, 2 or 3 topics). Therefore, micro-averaged recall and accuracy are drastically affected by high number of true negatives. On the other hand macro-averaged recall over emphasizes the poor classification of the classifier on topics with low number of documents. Weighting the metrics with the size of classe results in a higher precision and recall scores, but the accuracy is still very high which is unrealistic and is not the true indicator of the performance of the classifier. In this work, we chose to compare multiple classification models using samples average, since we are interested to see how the classifier does in predicting the topics that are assigned to the document and aim to avoid over-emphasizing true negatives. The accuracy metrics that are marked with (*) are not supported in sklearn.metrics.

Averaging method
Metric Micro Macro Weighted Samples Exact-match
Precision (%) 77 72 77 80 -
Recall (%) 79 63 79 82 -
F1-score (%) 78 62 75 79 -
Accuracy (%) 95 95 93 75 64

2.3.3 Summary of results

We chose “samples averaging” method, in order to compare various classifiers. Metrics acquired using this method are not affected by high true negative score. Especially, samples averaged accuracy is a proper indicator of the classifier performance, since it is not as unrealistic as micro, macro and weighted method and not as strict as exact-match method.

Given a multi-label classifier, , and N documents dn, n = 1,2, …, N, let’s assume Yn is the set of true labels assigned to dnand Zn=(dn) is the set of labels predicted by the classifier for dn. Table 3 summarized the mathematical description of samples averaged metrics.

Table 3: Equations that describe samples averaging of classification metrics
Table 3: Equations that describe samples averaging of classification metrics

3. Comparing various one-vs-rest classifiers

3.1 Goal

Implementation, evaluation and comparison of various benchmark classifiers in a one-vs-rest classification pipeline described in Table 1.

3.2 Method

  • Implementing a one-vs-rest pipeline described in Table 1

  • Integrating the following benchmark classifiers in the pipeline

    • Naive Bayes (NB), Ridge classifier, Passive Aggressive, Linear Regression classifier, Perceptron classifier, linear Support Vector Machine (SVM) trained by liblinear, linear SVM trained by Stochastic Gradient Descent (SGD) algorithm
  • Calculating the following metrics for each classifier

    • Sampled average, precision, recall, accuracy, exact-match accuracy and average of training time for cross-validated classifiers

3.3 Result

Table 4 summarizes the results acquired from multiple one-vs-rest classifiers. Among all implemented classifiers Linear SVM trained by SGD algorithm results in highest metrics. Results per class acquired from this classifier are shown in Figure 3. We observed that all classifiers show a very poor recall score on topics “Courts” and “Evidence”.

Stochastic gradient descent learning algorithm fits a linear support vector machine (SVM) to the data if the loss function is set to “hinge” , which is the default setting in scikit learn. This learning algorithm works in a way that gradient of the loss function is estimated each sample at a time and the model is updated along the way with a decreasing learning rate. The algorithm is regularized by adding a penalty to the loss function. We used Elastic Net penalty which is a combination of norm L2 and norm L1.

Figure 3: Results acquired from SGD one-vs-rest classifier for all classes
Figure 3: Results acquired from SGD one-vs-rest classifier for all classes

Advantages of SGD

  • Supports the partial fit method i.e. can be fed with batches of examples. Therefore, it can be used in out-of-core approach and is suitable for learning from data that doesn’t fit into main memory. In this case SGD needs to be used with HashingVectorizer.

  • Has been successfully applied to large-scale and sparse machine learning problems often encountered in text classification and natural language processing. Given that the data is sparse, the classifiers in this module easily scale to problems with more than 105 training examples and more than 105 features.

  • It is efficient and there are lots of parameters for code tuning.

Disadvantages of SGD

  • Requires a number of hyperparameters such as the regularization parameter and the number of iterations.

  • Is sensitive to feature scaling. For best results using the default learning rate schedule, the data should have zero mean and unit variance.

Table 4: Results acquired from implementing one-vs-rest classifier pipeline described in Table 1 for various classification models

Samples averaged metrics (%)
Classifier model SKlearn module Prec. Rec. F1 Acc. Exact-match acc. (%) Train time (sec)
Naive Bayes naive_bayes.MultinomialNB 80 82 79 76 64 59
Ridge Classifier linear_model.RidgeClassifier (tol=1e-2, solver="sag") 87 82 83 80 71 62
Linear SVM trained by liblinear svm.LinearSVC 87 84 84 81 72 60
Perceptron linear_model.Perceptron(n_iter=100) 81 83 80 76 64 65
Linear SVM trained by SGD Linear_model.SGDClassifier(alpha=0.0001, n_iter=50, penalty="elasticnet") 89 84 85 82 74 71
Passive Agressive linear_model.PassiveAggressiveClassifier 82 83 81 76 64 69
Linear Regression linear_model.LinearRegression 67 73 68 63 49 228

4. Multi-output classifiers

4.1 Goal

Implementation, evaluation and comparison of various benchmark multi-output classifier models

4.2 Method

  • Implementing a classification pipeline described in Table 1

  • Integrating the following benchmark multi-output classifiers in the pipeline

    • Random Forest classifier, Decision Tree , K-Nearest Neighbours and Multi-Layer Perceptron (MLP) neural network
  • Calculating the following metrics for each classifier

    • Sampled average, precision, recall , accuracy,exact-match accuracy and average of training time for cross-validated classifiers

4.3 Result

Table 5 summarizes the results obtained from multi-output classifiers. We can observe that Random Forest and MLP neural network model outperform decision tree and KNN classifiers. Here neural network model has been used in its simplest form, since the amount of data in this experiment is too little for training a proper neural network classifier. Unlike all classifiers mentioned in this report that have been trained with 5k features selected by Chi-square test, MLP has been trained with only 1K of features.

No hidden layer has been used. It is interesting to see that such simple neural network model results in metrics very close to Random Forest classifier and SGD one-vs-rest. Figure 4 shows the metrics obtained from MLP model per class.

We observed that MLP classifier is the only classifier that reaches the recall score of more than 30% for two classes “Courts” and “Evidence”. Other classifiers show poor recall score for these two classes. These results are encouraging for further investigation of using neural network models for this classification task. Scikit Learn does not support using GPU and in fact is not an optimized tool in order to investigate the capabilities of a neural network model.

Samples averaged metrics (%)
Classifier Model SKlearn Module Prec. Rec. F1 Acc. Exact-match acc. (%) Train time (sec)
Random Forest ensemble.RandomForestClassifier(n_estimators=100) 85 79 78 78 70 90
KNN neighbors.KNeighborsClassifier(N=100) 67 60 63 60 54 58
Decision Tree tree.DecisionTreeClassifier 74 73 71 68 57 80
MLP lineural_network.MLPClassifier(solver='lbfgs', alpha=1e-5, random_state=1) 86 84 84 80 70 78

5. Summary

In a set of experiments multiple ways of achieving a topic classification task, using scikit learn library, have been investigated. A sample dataset has been used for these experiments. Figure 5 summarizes the metrics obtained from these classifiers.

The highest classification metrics are obtained using a one-vs-rest SVM classifier trained by SGD algorithm. Significance of this superiority needs to be investigated in future experiments. The main advantage of using a one-vs-rest scheme with a linear classifier is that such models are very cheap and easy to scale up. Moreover, SGD supports partial fit way of learning. Based on our observations we decided to work on optimizing this classifier on the whole dataset that is available to us. Our observations are consistent with the cheat-sheet graph provided in scikit learn documentation.

Figure 5: Comparison of metrics obtained from various multi-label classifiers
Figure 5: Comparison of metrics obtained from various multi-label classifiers

6. Future work

  1. Implementing grid search to tune the parameters of the SGD classifier for this specific task

  2. Investigation of setting a threshold on probability of predictions made by the classifier

  3. Using unannotated part of data to implement a semi-supervised training scheme

Samuel Witherspoon