Multiscale Positive-Unlabeled Detection of AI-Generated Texts

Yuchuan Tian, Hanting Chen, Xutao Wang, Zheyuan Bai, Qinghua Zhang, Ruifeng Li, Chao Xu, Yunhe Wang

The official codes of our paper "Multiscale Positive-Unlabeled Detection of AI-Generated Texts".

Paper Link: https://arxiv.org/pdf/2305.18149.pdf

BibTex formatted citation:

@misc{tian2023multiscale,
      title={Multiscale Positive-Unlabeled Detection of AI-Generated Texts}, 
      author={Yuchuan Tian and Hanting Chen and Xutao Wang and Zheyuan Bai and Qinghua Zhang and Ruifeng Li and Chao Xu and Yunhe Wang},
      year={2023},
      eprint={2305.18149},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

Detector Models

We have open-sourced detector models in the paper as follows.

Links for Detectors: Google Drive Baidu Disk (PIN:1234)

We have also uploaded detector models to HuggingFace, where easy-to-use DEMOs and online APIs are provided.

Variants	HC3-Full-En	HC3-Sent-En
seed0	98.68	82.84
seed1 HuggingFace: en v1	98.56	87.06
seed2	97.97	86.02
Avg.	98.40$\pm$0.31	85.31$\pm$1.80

Stronger Detectors

We have also open-sourced detector models with strengthened training strategies. Specifically, we develop a strong Chinese detector AIGC_detector_zhv2, which demonstrates similar performance to SOTA closed-source Chinese detectors on various texts, including news articles, poetry, essays, etc. The DEMOs and APIs are available on HuggingFace.

Detector	Google Drive	Baidu Disk	HuggingFace Link
English, version 2 (env2)	Google Drive	Baidu Disk (PIN:1234)	en v2
Chinese, version 2 (zhv2)	Google Drive	Baidu Disk (PIN:1234)	zh v2

About the Dataset

Here we provide the official link for the HC3 dataset: Dataset Link. We also provide identical dataset copies on Google Drive and Baidu Disk (PIN:1234) for your ease of use. We acknowledge the marvelous work by HC3 authors.

Data Preprocessing

In Appendix B of our paper, we proposed the removal of redundant spaces in human texts of the HC3-English dataset. We have provided a helper function en_cleaning in corpus_cleaning_kit.py that takes a sentence string as input and returns a preprocessed sentence without redundant spaces.

Here we provide a cleaned version of HC3-English. In this version, all answers are cleaned (i. e. redundant spaces are removed). However, please use the original version of HC3 for all experiments in our paper, as we have embedded the cleaning procedures in the training & validation scripts.

CLEANED HC3-English Link: Google Drive Baidu Disk (PIN:1234)

Preparation

Install requirement packages:

pip install -r requirements.txt

Download datasets to directory: ./data
Download nltk package punct (This step could be done by nltk api: nltk.download('punkt'))
Download pretrained models (This step could be automatically done by transformers)

Before running, the directory should contain the following files:

├── data
│   ├── unfilter_full
│   │   ├── en_test.csv
│   │   └── en_train.csv
│   └── unfilter_sent
│       ├── en_test.csv
│       └── en_train.csv
├── README.md
├── corpus_cleaning_kit.py
├── dataset.py
├── multiscale_kit.py
├── option.py
├── pu_loss_mod.py
├── prior_kit.py
├── requirements.txt
├── train.py
└── utils.py

Training

The script for training is train.py.

RoBERTa on HC3-English

Commands for seed=0,1,2:

CUDA_VISIBLE_DEVICES=0 python train.py --batch-size 32 --max-sequence-length 512 --train-data-file unfilter_full/en_train.csv --val-data-file unfilter_full/en_test.csv --model-name roberta-base --local-data data --lamb 0.4 --prior 0.2 --pu_type dual_softmax_dyn_dtrun --len_thres 55 --aug_min_length 1 --max-epochs 1 --weight-decay 0 --mode original_single --aug_mode sentence_deletion-0.25 --clean 1 --val_file1 unfilter_sent/en_test.csv --quick_val 1 --learning-rate 5e-05 --seed 0

CUDA_VISIBLE_DEVICES=0 python train.py --batch-size 32 --max-sequence-length 512 --train-data-file unfilter_full/en_train.csv --val-data-file unfilter_full/en_test.csv --model-name roberta-base --local-data data --lamb 0.4 --prior 0.2 --pu_type dual_softmax_dyn_dtrun --len_thres 55 --aug_min_length 1 --max-epochs 1 --weight-decay 0 --mode original_single --aug_mode sentence_deletion-0.25 --clean 1 --val_file1 unfilter_sent/en_test.csv --quick_val 1 --learning-rate 5e-05 --seed 1

CUDA_VISIBLE_DEVICES=0 python train.py --batch-size 32 --max-sequence-length 512 --train-data-file unfilter_full/en_train.csv --val-data-file unfilter_full/en_test.csv --model-name roberta-base --local-data data --lamb 0.4 --prior 0.2 --pu_type dual_softmax_dyn_dtrun --len_thres 55 --aug_min_length 1 --max-epochs 1 --weight-decay 0 --mode original_single --aug_mode sentence_deletion-0.25 --clean 1 --val_file1 unfilter_sent/en_test.csv --quick_val 1 --learning-rate 5e-05 --seed 2

BERT on HC3-English

Commands for seed=0,1,2:

CUDA_VISIBLE_DEVICES=0 python train.py --batch-size 32 --max-sequence-length 512 --train-data-file unfilter_full/en_train.csv --val-data-file unfilter_full/en_test.csv --model-name bert-base-cased --local-data data --lamb 0.5 --prior 0.3 --pu_type dual_softmax_dyn_dtrun --len_thres 60 --aug_min_length 1 --max-epochs 1 --weight-decay 0 --mode original_single --aug_mode sentence_deletion-0.25 --clean 1 --val_file1 unfilter_sent/en_test.csv --quick_val 1 --learning-rate 5e-05 --seed 0


CUDA_VISIBLE_DEVICES=0 python train.py --batch-size 32 --max-sequence-length 512 --train-data-file unfilter_full/en_train.csv --val-data-file unfilter_full/en_test.csv --model-name bert-base-cased --local-data data --lamb 0.5 --prior 0.3 --pu_type dual_softmax_dyn_dtrun --len_thres 60 --aug_min_length 1 --max-epochs 1 --weight-decay 0 --mode original_single --aug_mode sentence_deletion-0.25 --clean 1 --val_file1 unfilter_sent/en_test.csv --quick_val 1 --learning-rate 5e-05 --seed 1


CUDA_VISIBLE_DEVICES=0 python train.py --batch-size 32 --max-sequence-length 512 --train-data-file unfilter_full/en_train.csv --val-data-file unfilter_full/en_test.csv --model-name bert-base-cased --local-data data --lamb 0.5 --prior 0.3 --pu_type dual_softmax_dyn_dtrun --len_thres 60 --aug_min_length 1 --max-epochs 1 --weight-decay 0 --mode original_single --aug_mode sentence_deletion-0.25 --clean 1 --val_file1 unfilter_sent/en_test.csv --quick_val 1 --learning-rate 5e-05 --seed 2

Acknowledgement

Our code refers to the following GitHub repo:

https://github.com/openai/gpt-2-output-dataset

We sincerely thank their authors for open-sourcing.

xddd-ys / aigc_text_detector Goto Github PK

aigc_text_detector's Introduction

Multiscale Positive-Unlabeled Detection of AI-Generated Texts

Detector Models

Stronger Detectors

About the Dataset

Data Preprocessing

Preparation

Training

RoBERTa on HC3-English

BERT on HC3-English

Acknowledgement

aigc_text_detector's People

Contributors

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent