PrivacyGLUE: A Benchmark Dataset for General Language Understanding in Privacy Policies

This repository functions as the official codebase for the "PrivacyGLUE: A Benchmark Dataset for General Language Understanding in Privacy Policies" paper published in the MDPI Applied Sciences special issue for NLP and applications.

PrivacyGLUE is the first comprehensive privacy-oriented NLP benchmark comprising 7 relevant and high-quality privacy tasks for measuring general language understanding in the privacy language domain. We release performances from the BERT, RoBERTa, Legal-BERT, Legal-RoBERTa and PrivBERT pretrained language models and perform model-pair agreement analysis to detect examples where models benefited from domain specialization. Our findings show that PrivBERT, the only model pretrained on privacy policies, outperforms other models by an average of 2–3% over all PrivacyGLUE tasks, shedding light on the importance of in-domain pretraining for privacy policies.

Note that a previous version of this paper was submitted to the ACL Rolling Review (ARR) on 16th December 2022 before resubmission to the MDPI Applied Sciences special issue on NLP and applications on 3rd February 2023.

Tasks
Leaderboard
Dependencies
Initialization
Usage
Notebooks
Test
Citation

Tasks 🏃

Task	Study	Type	Train/Dev/Test Instances	Classes
OPP-115	Wilson et al. (2016)^*	Multi-label sequence classification	2,185/550/697	12
PI-Extract	Duc et al. (2021)	Multi-task token classification	2,579/456/1,029	3/3/3/3^**
Policy-Detection	Amos et al. (2021)	Binary sequence classification	773/137/391	2
PolicyIE-A	Ahmad et al. (2021)	Multi-class sequence classification	4,109/100/1,041	5
PolicyIE-B	Ahmad et al. (2021)	Multi-task token classification	4,109/100/1,041	29/9^**
PolicyQA	Ahmad et al. (2020)	Reading comprehension	17,056/3,809/4,152	--
PrivacyQA	Ravichander et al. (2019)	Binary sequence classification	157,420/27,780/62,150	2

^*Data splits were not defined in Wilson et al. (2016) and were instead taken from Mousavi et al. (2020)

^**PI-Extract and PolicyIE-B consist of four and two subtasks respectively, and the number of BIO token classes per subtask are separated by a forward-slash character.

Leaderboard 🏁

Our current leaderboard consists of the BERT (Devlin et al., 2019), RoBERTa (Liu et al., 2021), Legal-BERT (Chalkidis et al., 2020), Legal-RoBERTa (Geng et al., 2021) and PrivBERT (Srinath et al., 2021) models.

Task	Metric^*	BERT	RoBERTa	Legal-BERT	Legal-RoBERTa	PrivBERT
OPP-115	m-F₁	78.4_±0.6	79.5_±1.1	79.6_±1.0	79.1_±0.7	82.1_±0.5
	µ-F₁	84.0_±0.5	85.4_±0.5	84.3_±0.7	84.7_±0.3	87.2_±0.4
PI-Extract	m-F₁	60.0_±2.7	62.4_±4.4	59.5_±3.0	60.5_±3.9	66.4_±3.4
	µ-F₁	60.0_±2.7	62.4_±4.4	59.5_±3.0	60.5_±3.9	66.4_±3.4
Policy-Detection	m-F₁	85.3_±1.8	86.9_±1.3	86.6_±1.0	86.4_±2.0	87.3_±1.1
	µ-F₁	92.1_±1.2	92.7_±0.8	92.7_±0.5	92.4_±1.3	92.9_±0.8
PolicyIE-A	m-F₁	72.9_±1.7	73.2_±1.6	73.2_±1.5	73.5_±1.5	75.3_±2.2
	µ-F₁	84.7_±1.0	84.8_±0.6	84.7_±0.5	84.8_±0.3	86.2_±1.0
PolicyIE-B	m-F₁	50.3_±0.7	52.8_±0.6	51.5_±0.7	53.5_±0.5	55.4_±0.7
	µ-F₁	50.3_±0.5	54.5_±0.7	52.2_±1.0	53.6_±0.9	55.7_±1.3
PolicyQA	s-F₁	55.7_±0.5	57.4_±0.4	55.3_±0.7	56.3_±0.6	59.3_±0.5
	EM	28.0_±0.9	30.0_±0.5	27.5_±0.6	28.6_±0.9	31.4_±0.6
PrivacyQA	m-F₁	53.6_±0.8	54.4_±0.3	53.6_±0.8	54.4_±0.5	55.3_±0.6
	µ-F₁	90.0_±0.1	90.2_±0.0	90.1_±0.1	90.2_±0.1	90.2_±0.1

^*m-F₁, µ-F₁, s-F₁ and EM refer to the Macro-F₁, Micro-F₁, Sample-F₁ and Exact Match metrics respectively

Dependencies 🔍

This repository was tested against Python version 3.8.13 and CUDA version 11.7. Create a virtual environment with the same python version and install dependencies with poetry:
```
$ poetry install
```
Alternatively, install dependencies in the virtual environment using pip:
```
$ pip install -r requirements.txt
```
Install Git LFS to access upstream task data. We utilized version 3.2.0 in our implementation.
Optional: To further develop this repository, install pre-commit to setup pre-commit hooks for code-checks.

Initialization 🔥

To prepare git submodules and data, execute:
```
$ bash scripts/prepare.sh
```
Optional: To install pre-commit hooks for further development of this repository, execute:
```
$ pre-commit install
```

Usage ❄️

We use the run_privacy_glue.sh script to run PrivacyGLUE benchmark experiments:

usage: run_privacy_glue.sh [option...]

optional arguments:
  --cuda_visible_devices       <str>
                               comma separated string of integers passed
                               directly to the "CUDA_VISIBLE_DEVICES"
                               environment variable
                               (default: 0)

  --fp16                       enable 16-bit mixed precision computation
                               through NVIDIA Apex for training
                               (default: False)

  --model_name_or_path         <str>
                               model to be used for fine-tuning. Currently only
                               the following are supported:
                               "bert-base-uncased",
                               "roberta-base",
                               "nlpaueb/legal-bert-base-uncased",
                               "saibo/legal-roberta-base",
                               "mukund/privbert"
                               (default: bert-base-uncased)

  --no_cuda                    disable CUDA even when available (default: False)

  --overwrite_cache            overwrite caches used in preprocessing
                               (default: False)

  --overwrite_output_dir       overwrite run directories and saved checkpoint(s)
                               (default: False)

  --preprocessing_num_workers  <int>
                               number of workers to be used for preprocessing
                               (default: None)

  --task                       <str>
                               task to be worked on. The following values are
                               accepted: "opp_115", "piextract",
                               "policy_detection", "policy_ie_a", "policy_ie_b",
                               "policy_qa", "privacy_qa", "all"
                               (default: all)

  --wandb                      log metrics and results to wandb
                               (default: False)

  -h, --help                   show this help message and exit

To run the PrivacyGLUE benchmark for a supported model against all tasks, execute:

$ bash scripts/run_privacy_glue.sh --cuda_visible_devices <device_id> \
                                   --model_name_or_path <model> \
                                   --fp16

Note: Replace the <device_id> argument with a GPU ID or comma-separated GPU IDs to run single-GPU or multi-GPU fine-tuning respectively. Correspondingly, replace the <model> argument with one of our supported models listed in the usage documentation above.

Notebooks 📖

We utilize the following ipynb notebooks for analyses outside of the PrivacyGLUE benchmark:

Notebook	Description
visualize_domain_embeddings.ipynb	Compute and visualize BERT embeddings for Wikipedia, EURLEX and Privacy Policies using t-SNE and UMAP
visualize_results.ipynb	Plot benchmark results and perform significance testing
inspect_predictions.ipynb	Inspect test-set predictions for model-pair agreement analysis

Test 🔬

To run unit tests, execute:
```
$ make test
```
To run integration tests, execute:
```
$ CUDA_VISIBLE_DEVICES=<device_id> make integration
```
Note: Replace the <device_id> argument with a GPU ID or comma-separated GPU IDs to run single-GPU or multi-GPU integration tests respectively. Alternatively, pass an empty string to run CPU integration tests.

Citation 🏛️

If you found PrivacyGLUE useful, we kindly ask you to cite our paper as follows:

@Article{app13063701,
  AUTHOR =       {Shankar, Atreya and Waldis, Andreas and Bless, Christof and
                  Andueza Rodriguez, Maria and Mazzola, Luca},
  TITLE =        {PrivacyGLUE: A Benchmark Dataset for General Language
                  Understanding in Privacy Policies},
  JOURNAL =      {Applied Sciences},
  VOLUME =       {13},
  YEAR =         {2023},
  NUMBER =       {6},
  ARTICLE-NUMBER ={3701},
  URL =          {https://www.mdpi.com/2076-3417/13/6/3701},
  ISSN =         {2076-3417},
  ABSTRACT =     {Benchmarks for general language understanding have been
                  rapidly developing in recent years of NLP research,
                  particularly because of their utility in choosing
                  strong-performing models for practical downstream
                  applications. While benchmarks have been proposed in the legal
                  language domain, virtually no such benchmarks exist for
                  privacy policies despite their increasing importance in modern
                  digital life. This could be explained by privacy policies
                  falling under the legal language domain, but we find evidence
                  to the contrary that motivates a separate benchmark for
                  privacy policies. Consequently, we propose PrivacyGLUE as the
                  first comprehensive benchmark of relevant and high-quality
                  privacy tasks for measuring general language understanding in
                  the privacy language domain. Furthermore, we release
                  performances from multiple transformer language models and
                  perform model&ndash;pair agreement analysis to detect tasks
                  where models benefited from domain specialization. Our
                  findings show the importance of in-domain pretraining for
                  privacy policies. We believe PrivacyGLUE can accelerate NLP
                  research and improve general language understanding for humans
                  and AI algorithms in the privacy language domain, thus
                  supporting the adoption and acceptance rates of solutions
                  based on it.},
  DOI =          {10.3390/app13063701}
}

drewskidang / privacy-glue Goto Github PK

privacy-glue's Introduction

PrivacyGLUE: A Benchmark Dataset for General Language Understanding in Privacy Policies

Table of Contents

Tasks 🏃

Leaderboard 🏁

Dependencies 🔍

Initialization 🔥

Usage ❄️

Notebooks 📖

Test 🔬

Citation 🏛️

privacy-glue's People

Contributors

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent