Exploratory Investigation of a Sample Dataset
While we started off our project a bit backwards (by launching Compass) we have since returned to a more R&D focused mentality with part of the team. As a starting point we need a better understanding of the extent and limitations of the data set, the hand annotations, and to figure out how much data cleaning is required.
We are sharing our experiences, not as a template to be followed, but rather as an honest representation of our R&D efforts.
- Deciding about data structure in order to store documents and related labels in a single variable
- Initial work on identifying the descriptive statistics of a sample dataset
- Number of documents: 13359
Sampling method : random sampling of 1000 cases with each of the following labels:
"Family Law", "Bankruptcy", “Criminal Law", "Administrative Law", "Damages", "Damage Awards", "Guarantee and Indemnity", "Income Tax”, "Indians, Inuit and Metis", "Injunctions", "Master and Servant", "Motor Vehicles", "Municipal Law", "Quebec Family", "Real Property", "Workers' Compensation", "Food and Drug Control", "Common Law” , "Contracts", "Actions", "Aliens"
Although the search has been done with 20 topics, the total number of documents is less than 20K, since classes have overlap and more than one topic is assigned to each document. Headnotes are removed from the documents.
Section 1: Comparing various data structures in python
Comparing the following candidates for the data structure used to store the sample dataset in a single variable which relates labels and documents:
- Combinations of list, dictionaries and tuples
- Panda dataframes
Reviewing the features of above-mentioned data structures in python and implementing example cases.
Panda dataframe seems to be a good choice, at least in the experiment phase, because it is:
- Fast, flexible, and expressive
- Designed to work with “relational” or “labeled” data
- Easy to integrate within a scientific computing environment with numpy and easy to visualize
- Easy to group, split or filter by specifications of items
- Easy to add new attributes of a class or remove the unused ones
- Easy to convert to excel, figure or html as a table
Panda dataframes can be understood as a simple table which organizes the information about a single topic into rows and columns. Each row is basically all the attributes (columns) related to one item. Defining a dataframe is very similar to designing a table where one aims to present a relational or labeled dataset.
Section 2: Storing a sample dataset of documents in panda dataframes
- Flexible and intuitive relation between each single label present in the dataset and all the documents with that label
- Initial work on identifying the descriptive statistics of a sample dataset stored in panda dataframes
Two panda dataframes are formed, which are summarized in Table 1. The original sample dataset is stored in json format. The dataframe, ‘data’, is directly transformed from list of dictionaries that is loaded by reading the json document. Each row of this dataframe represents a single document. Columns of this dataframe include:
- id of the document, which is the filename of the original document
- text of the document
- topics that the document is annotated with
- topic that was used to acquire this document in the sampling process.
An example of this dataframe is shown in the following:
An example of a panda dataframe
The dataframe ‘metadoc’ is defined to relate each single label to a meta-document formed by joining all the documents related to that label. Meta-documents are useful in finding most repeated words under a specific topic. By counting the words in a single document, document-specific words such as names of people will appear in the most-repeated words. One strategy to remove these words could be identifying names and other document-specific words, which is expensive and hard to implement. Looking at meta-documents will automatically remove some of the document-specific words.
In order to form meta-documents, multilabel documents are considered as multiple documents with single label and the document is duplicated in all the related classes. The dataframe ‘metadoc’ includes the following columns:
- List of ids of documents related to the label
- Other features or statistical parameters calculated for classes can be added to this dataframe if needed.
For example, size of the class in the sample dataset is added to the dataframe in terms of counts and percentage. An example of the metadoc variable is shown in the following:
Table 1: panda dataframes used in this experiment
There are 135 classes in this sample dataset. Labels and sizes of classes are as following:
(Bailment, 8) - (Restitution, 29) - (Damage Awards, 321) - (Government Programs, 10) - (Chattel Mortgages and Bills of Sale, 5) - (Auctions, 3) - (Public Utilities, 1) - (Municipal Law, 1031) - (Deeds and Documents, 17) - (Sale of Land, 15) - (Personal Property, 19) - (Professional Occupations, 13) - (Armed Forces, 1) - (Persons of Unsound Mind, 1) - (Torts, 313) - (Brokers, 14) - (Quebec Nominate Contracts, 2) - (Income Tax, 1008) - (Expropriation, 3) - (Mechanics' Liens, 6) - (Aliens, 1008) - (Family Law, 1035) - (Trade Regulation, 29) - (Estoppel, 66) - (Master and Servant, 45) - (Indians, Inuit and Métis, 1) - (Statutes, 99) - (Specific Performance, 6) - (Releases, 9) - (Quebec Obligations, 1) - (Insurance, 54) - (Animals, 4) - (Fraud and Misrepresentation, 58) - (Labour Law, 55) - (Negotiable Instruments, 18) - (Constitutional Law, 34) - (Carriers, 3) - (Health, 7) - (Liens, 2) - (National Security, 1) - (Highways, 16) - (Trademarks, Names and Designs, 4) - (Habeas Corpus, 2) - (Agency, 21) - (Receivers, 22) - (Waiver, 7) - (Coroners, 1) - (Infants, 4) - (Motor Vehicles, 1005) - (Trusts, 17) - (Arbitration, 13) - (Conditional Sales, 9) - (Contempt, 7) - (Liquor Control, 4) - (Conflict of Laws, 15) - (Banks and Banking, 68) - (Food and Drug Control, 6) - (Evidence, 172) - (Joint Ventures, 6) - (Shipping and Navigation, 12) - (Mistake, 13) - (Actions, 458) - (Unemployment Insurance, 1) - (Libel and Slander, 10) - (Execution, 3) - (Customs, 1) - (Landlord and Tenant, 28) - (Civil Rights, 104) - (International Law, 2) - (Barristers and Solicitors, 28) - (Damages, 1107) - (Bankruptcy, 1019) - (Guarantee and Indemnity, 854) - (Real Property, 1058) - (Narcotic Control, 2) - (Copyright, 1) - (Contracts, 1218) - (Quebec Civil Law, 2) - (Medicine, 17) - (Gifts, 3) - (Associations, 2) - (Aeronautics, 2) - (Land Regulation, 44) - (Real Property Tax, 14) - (Patents of Invention, 5) - (Mortgages, 117) - (Criminal Law, 1119) - (Practice, 609) - (Business Law, 1) - (Mines and Minerals, 11) - (Replevin, 2) - (Fish and Game, 6) - (Securities Regulation, 7) - (Gaming and Betting, 1) - (Quebec Property, 1) - (Interest, 70) - (Company Law, 83) - (Waters, 5) - (Time, 2) - (Crown, 66) - (Building Contracts, 90) - (Admiralty, 1) - (Courts, 127) - (Post Office, 2) - (Subrogation, 6) - (Wills, 2) - (Sales and Service Taxes, 3) - (Equity, 74) - (Creditors and Debtors, 28) - (Trials, 76) - (Elections, 4) - (Administrative Law, 1100) - (Police, 24) - (Franchises, 2) - (Railways, 3) - (Education, 6) - (Injunctions, 1029) - (Quebec Procedure, 4) - (Common Law, 48) - (Devolution of Estates, 1) - (Quebec Family, 16) - (Consumer Law, 8) - (Perpetuities, 1) - (Social Assistance, 2) - (Partnership, 6) - (Executors and Administrators, 5) - (Pollution Control, 7) - (Prisons, 3) - (Telecommunications, 3) - (Workers' Compensation, 7) - (Sale of Goods, 8) - (Limitation of Actions, 86) - (Hospitals, 4) - (Extradition, 1) - (Choses in Action, 18)
The following pairs of classes have more than 100 documents in common :
(Damage Awards & Damages , 295) - (Torts & Actions , 101) - (Torts & Damages , 131) - (Actions & Practice , 241) - (Damages & Practice , 116) - (Guarantee and Indemnity & Contracts , 138) - (Guarantee and Indemnity & Mortgages , 101) - (Guarantee and Indemnity & Practice , 174)
A bar plot of classes with more than 500 samples is shown in Figure 1.
Figure 1. Size of classes with more than 500 documents[/caption]
Dataframes are suitable data structures for exploratory analysis of a sample dataset. Using this data structure, a meta-document is formed which consists of all the documents related to a single label. Descriptive statistic of the dataset such as distribution of class size or joint distribution of multiple classes can be calculated and added to the columns of the dataframe for future reference or visualization purposes.
- Generalization of the method needs to be investigated in order to be applied to identify the descriptive statistics of the whole dataset.
- The variable metadoc is going to be used to extract the most relevant vocabulary for feature extraction.