Accepted JURIX 2017 paper: Semi-Supervised Training Method for Legal Semantic Search

IMRSV Data Labs is excited to announce that our long paper has been accepted to JURIX 2017, the 30th international conference on Legal Knowledge and Information Systems!

This is one of the most prestigious conferences in the AI+Law field. As described on their website, JURIX is “an international forum for research on the intersection of Law, Artificial Intelligence and Information Systems, under the auspices of The JURIX Foundation for Legal Knowledge Systems”.

The criteria for this conference is highly selective. Papers need to score well in terms of relevance, development and originality of research, technical quality, significance, literature review, and overall evaluation. There is only a ~30% acceptance rate for short and demo papers, with an even more prestigious ~20% acceptance rate for full papers. IMRSV Data Labs is honoured that our paper, “A Semi-Supervised Training Method for Semantic Search of Legal Facts in Canadian Immigration Cases” has been accepted as a full paper and presentation.

Following the proceedings, we will be published by IOS Press in their series Frontiers in Artificial Intelligence and Applications.

Our company will be well-represented at this year’s conference in Luxembourg City. We are looking forward to sharing our work, as well as gaining knowledge from the speakers and other attendees!

JURIX logo.jpg

An overview of our presentation

Semantic barriers prevent many Canadians from having adequate access to the legal system. Due to the costly expense of retaining a lawyer, a growing number of Canadians are representing themselves in court. We were inspired to help.

We envisioned an immigration-specific search algorithm to make legal research more efficient, thorough, and user-friendly. Existing search engines are not always effective because they can only match keywords to their results. In particular, traditional legal search is based on finding exact matches to a given combination of keywords in a set of legal cases. However, by introducing our semantic search approach, users are now able to input search strings where the meaning of words and the similarity of sentences will automatically be understood. This will eliminate the need to be familiar with the vocabulary used in legal documents, while still returning relevant results that aren’t exact matches.

In our testing, we found that the greater the similarity of the facts of the case, the more likely the legal outcome/judgment would be similar as well. Older cases could thus be used to accurately predict new cases. Of note, our system was designed to find sentences that assert a fact of the case and limit the search to only these sentences. This was important as matching fact sentences with different sentence types, such as one that demonstrates reasoning or one where litigants state their positions, could lead to misleading results.

We also decided to train our model using domain-specific language and domain-specific documents, as opposed to a more general training with general definitions of words. Although pre-trained word embeddings are available and provided by NLP tools, for a semantic search tool of immigration law cases, a word embedding model trained on the particular set of documents is preferred because it can capture the legal meaning of terms. With this understanding of legal vocabulary, sentences which are most similar to queries thus rank at the top of the retrieval results. The similarity score can also be used to estimate the relevance of a sentence with respect to a query.

Figure 1. Steps of the proposed semi-supervised method

Figure 1. Steps of the proposed semi-supervised method

Our results with this method show that the proposed semi-supervised method outperforms commonly used classifiers in detecting facts when the amount of training data is relatively small. We were able to establish the feasibility of detecting fact-asserting sentences and searching for semantically similar facts in a large Canadian immigration law corpus when only 0.3% of the corpus was manually annotated. In this binary classification task, a supervised fastText classifier that is initiated with immigration law word embedding is actually more effective than benchmark classification methods used in legal applications.

JURIX paper - Table 3.png

Based on this tested algorithm, we launched a search engine that accepts both a combination of keywords or a sentence as a query, then returns a link to the legal cases in the immigration law corpus that include the most similar fact sentences.

Figure 3. Example of the results returned by the developed system

Figure 3. Example of the results returned by the developed system

What this means moving forward

IMRSV Data Labs is furthering research and development with Natural Language Understanding and Natural Language Processing methods. However, we have since broadened the scope of our work.

Given that our technology is language-agnostic and domain-agnostic, it can be applied across any industry. The strides that we are leading thus have implications in many diverse fields.

Our work can not only be applied to legal research, but also to human resource tasks, enterprise management systems, individual organisation tools, as well as any program or service dealing with information management and analysis.

If you would be interested in learning how to make AI work for you, contact us now.

Catherine Guo