polymorpher / ai-law Goto Github PK

View Code? Open in Web Editor NEW

1.0 1.0 1.0 4.93 MB

Shell 1.99% Python 98.01%

ai-law's People

Contributors

Stargazers

Watchers

Forkers

pvcastro

ai-law's Issues

A larger, more diverse set of labelled dataset for classification of extracted holdings

In #1 Daming manually labelled 200 extracted cases from 2010, in terms of whether each extracted holding is a holding or not. In #2 he reported 86% F1 (?) score for binary classification. In Zoom discussions, he highlighted the data imbalance issue: we have few samples for extractions that are not holdings, and there are many different reasons that they are not holdings.

Ideally, we want close to 95% of F1 score for binary classification on the extracted holdings, so that we may confidently rely on the classifier to filter the extractions. The filtered extractions can be used as the next stage of the pipeline where we dissect the extracted sentences, identify the cited case and the holding text itself, thereby completing the components required for "automated holding extraction tool" we discussed. These filtered extractions can also provide reference summaries for the summarization pipeline for the "automated case summary tool" we discussed.

There are two ways to get from 86% to 95%: improve our extraction method, or get more data for the classifier. We will discuss each way separately. Here, we focus on getting more data. About two weeks ago, I completed the pipeline for automated extraction of all ~150 years of data from our AWS S3 bucket. There are around ~700k files in the S3 bucket and most of them should represent cases (there may be duplicate files for each case). Based on my observation, the extraction code does not produce any extracted holding summary for many cases. Indeed, many cases are deficient (prisoner habeas corpus petition) or straightforward and do not require a detailed analysis in the form of an opinion. For reference, cases in year 2010 produced ~400 extracted summaries. This means we should expected 50k-100k extracted summaries.

From the 50k-100k extracted summaries, we should aim for labelling 10% of them to perfect our automated tools, and use the automated tool to label the rest of the 90%. For this 10% labelled data, we should ensure they are diverse enough across year, case type (civil / criminal), and outcome (affirm / reverse).

10% of 50k cases will be 5k cases. We cannot manually label all of them by ourselves. But we cannot completely outsource them either before gaining clarity on the labelling task itself. Here is my view on the most sensible next steps:

Create a tool to randomly and deterministically (by a seed) sample on the extracted holdings from our AWS S3 bucket.
Create a tool based on simple pattern matching, to label the case type and outcome of the (corresponding) cases in (1). I recall Daniel has a database (stored in STATA) with some metadata for each case in the AWS S3 bucket. It may already have the information.
Create a tool to compute and verify the statistical metrics for cases in (1) so we can ensure the diversity of the cases.
Manually label 400-800 more cases in (1), following similar format and methodology in https://docs.google.com/spreadsheets/d/1BibkKuGlbjnYOBQCZiK_0G3EwnQcTDt-L2ONNA6lsSQ/edit#gid=0
Compute the percentage of improvement in F1 for each 200 increment of the number of data points
Create a one-page guideline / ruleset for the labelling task
Recruit 3-5 labelers to label the rest (UpWork, Amazon Mechanical Turk), using tools such as Datasaur, following the guideline in (5).
Aggregate the labelers' results in a reasonable way (see for example, section 4.3 in SummEval)

Holding extraction quality improvement

The method we currently use for extracting holdings is relying on searching for the lemma "hold" and filtering for a particular type of part-of-speech (verb-like). This method is shown to be effective on recent data (we have tried on cases from 2010 and afterwards), but not so for old opinions (from 19th century). In the old opinions, the usage of words are liberal and citations to other cases are scarce. The texts surrounding "hold" more often than not do not represent a summary of previous cases' holdings. Since we have not looked at cases between 1900-2000, we do not know if this presents a more fundamental issue to the lemma+POS methodology. The majority of the holdings are developed in the later half of this century, so the performance of this method during this time period is more critical than others.

Nonetheless, we should look for better ways to extract reference summaries in older cases (or more recent ones but sharing similar characteristics). One idea is to search for implicit summaries: since these cases do not cite precedents that much, they have to explain in detail the reasoning behind their own decision - which implies they are more likely to provide a good summary of their own opinion already.

Existing methods and comparable datasets for long document summarization

From my previous notes:

A good starting point is https://paperswithcode.com/sota/text-summarization-on-arxiv . Here a dataset is provided, extracted from physics paper on arXiv. The dataset is more comparable to court opinions in both length and structure, compared to those used in SummEval (such as CNNDaily).

It is unclear how easy it would be to run the models listed. However 10 out of 14 papers provided code. At the very least we can manually run them one by one. If we want to run them in batch and in a more automated way, we would need to write a data adapter to transform our data format into each of theirs and vice versa.

how much resource is needed to run these models. Things like BigBird might require a lot.
In sum, we have two ways going forward:

extract short documents from opinions, summarize and compare against CNNDaily and models referenced by SummEval

work with models for long-document summarization and compare against them. Since nobody has systematically studied and compared them in the past, this would require a lot more work, This requires us to understand each of their code, write data converters, figure out resources needed, set up experiments & automation on the cloud.

For summarizing long documents, the comparable datasets are:

BigPatent, https://arxiv.org/abs/1906.03741
arXiv, https://arxiv.org/abs/1804.05685
PubMed, https://github.com/armancohan/long-summarization

Use narrower rules to get high precision

Related to (#1), another direction is to use holding that and held that to make the extraction ~100% accurate. Then use other corpus as negative date points. Hope this classifier can have super high precision.

Other alternative datasets for court opinions

In discussions, two other datasets crawled from the following sources are discussed:

Justia
Wikipedia

Both datasets have hand-crafted holdings. The downsides are the quantity is small (only a few hundreds are available), and they seem to be limited to influential cases only, such as those decided by the US Supreme Court. They might not be good datasets for large scale automated training, but we may use them later to manually compare the summary/extraction quality from our methods.

Notes on value of court opinion summarization

From my previous notes:

I have been looking into past work on datasets, metrics, and summarization evaluation. The references in the Salesforce critique paper (ACL, 2019) and the CMU paper (EMNLP, 2020) provided great starting points. Another great piece of work is a more recent Yale + Salesforce paper (ACL, 2021) which also provides a tool to simultaneously evaluate on 23 models and 15 metrics.

I have also looked into the Bloomberg data files a little bit (now in AWS S3 bucket). I would say the value of this court dataset are as following:

Shows little layout bias (CNNDM datasets prone to have most relevant information in the opening sentences, see screenshot 1 from the ACL 2019 paper)

Provides golden reference summary that’s highly correlated with human quality judgment (judge and clerks already curated them) - the quality of reference summaries have been a key weakness in CNNDM and TAC datasets. This was mentioned in both ACL 2021 and 2019 paper. The EMNLP paper also reported very low human quality judgement score for reference summary.

First dataset with rich metadata (judge’s bias, affiliation, word embedding, area of law, etc.).

The problem context provides a more objective way of assessing quality of human judgement and quality of summary. e.g. every summary should contain the disposition, some basic facts, parties, key points of law, and maybe some brief reasoning.

This is a great, unique dataset to test model’s ability to generate consistent and factually accurate summary. It was cited in the ACL2021 paper (referencing another piece of their work in EMNLP 2020) about “hallucinating facts” issues commonly exist in generated summaries (accounts for as much as about 30% of errors). See also “Evaluating the Factual Consistency of Abstractive Text Summarization”

The Court opinion dataset is probably large enough for building a model from scratch (i.e. pre-train on that), instead of just merely use a pre-trained model and fine-tune on this dataset. It would be very interesting to explore how well summarization models trained on court opinion dataset would generalize to other domains.

The experiments results should prove some of these. Probably we should choose 2-3 bullet points as the focus of this paper. The uncovered bullet points can be left for future work.

[0616]: next step for auto summary extraction

1: get 2 sets:
a) POS(Verb) + lemma(hold): this has already been done
b) only lemma (hold): get a superset
try to b) - a) and see the diff. If not big, then it means we did not miss too many
tips: try to optimize pipeline to only include POS and Lemma

2: depend on Italic to pinpoint citation and summarization (conclusion)

Use fasttext to see binary classification performance

Manually labelled 200 cases (#1)
86% are good. Need to fine tune fasttext to see if it could classify it with high precision.

polymorpher / ai-law Goto Github PK

ai-law's People

Contributors

Stargazers

Watchers

Forkers

ai-law's Issues

A larger, more diverse set of labelled dataset for classification of extracted holdings

Holding extraction quality improvement

Existing methods and comparable datasets for long document summarization

Use narrower rules to get high precision

Other alternative datasets for court opinions

Notes on value of court opinion summarization

[0616]: next step for auto summary extraction

Use fasttext to see binary classification performance

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent