Automatic Recognition and Classification of Future Work Sentences from Academic Articles in a Specific Domain

Overview

Data and source Code for the paper "Automatic Recognition and Classification of Future Work Sentences from Academic Articles in a Specific Domain".

The aim of this paper is automatic recognition and classification of Future Work Sentences (FWS) from academic articles. We choose Natural Language Preocessing (NLP) domain as an example, and use papers from three main conferences, namey ACL, EMNLP and NAACL (These conferences can be visited via https://aclanthology.org/), as exprimental dataset. Our work includes the followig aspects:

FWS Recognition: After human annotation of the future work sentence, we use traditional machine learning models including Logistic Regression (LR), Naïve Bayes (NB), Support Vector Machine (SVM) and Random Forest (RF), to judge whether one sentence is FWS or not.
FWS Classification: After FWS Recognition, we classify the FWS in paper into six types including Method, Resources, Evaluation, Application, Problem and Other, via Bert, Scibert, Textcnn and Bilstm models.
FWS Evaluation: In addition, we compare difference between keywords which are extracted from the FWS and abstracts in other papers published several years later, to evaluate the effectiveness of FWS.

Directory structure

FWS                                                  Root directory
├─ Dataset                                           Experimental datasets
│    ├─ Corpus For KeyphraseExtraction               Corpus for content analysis of FWS                 
│    │    └─ Title and Abstract.csv                  Corpus for content analysis of FWS，incuding title and absrtract
│    │
│    ├─ Corpus_For_FWS_Recognition.csv               Training dataset for FWS recognition 
│    ├─ Corpus_For_FWS_Recognition_Predict.csv       Sample testing dataset for recognition of FWS
│    ├─ Corpus_For_FWS_TypeClassify.csv              Training dataset for FWS classification 
│    └─ Corpus_For_FWS_TypeClassify_Predict.csv      Sample testing dataset for FWS classification 
│   
├─ FWS Classification                                Module of FWS classification  
│    ├─ Bert.py					     Source code of BERT/SciBERT classification model
│    ├─ Bilstm.py				     Source code of Bi-LSTM model
│    ├─ TextCNN.py				     Source code of TextCNN model
│    ├─ logs.txt				     Log file which records classification performance of classification model
│    ├─ main.py					     Source code for selecting a model to train Corpus_For_FWS_Recognition by command line arguments
│    ├─ predict.py				     Source code for using trained model to predict label of FWS in test dataset
│    ├─ run.py					     Source code to start training process of FWS classification
│    └─ weights					     Model's weight
│           ├─ bilstm                                Weight of Bi-LSTM model
│           └─ textcnn                               Weight of TextCNN model
│
├─ FWS Recognition                                   Module of FWS recognition 
│    ├─ main.py					     Source code of data preprocessing, training and testing of FWS recognition model
│    └─ run.py					     Source code to start training of FWS recognition
│
└─ README.md

Dataset discription

We release our all train dataset in Dataset directory:

Corpus_For_FWS_Recognition.csv: Traning dataset for classification of Future Work Sentence, it contains 9, 009 FWS and 55, 887 Non-FWS respectively.

Corpus_For_FWS_TypeClassify.csv: Traning dataset for Recognition of Future Work Sentence, it contains 9, 009 records.

Each line of Corpus_For_FWS_Recognition includes:

id: Paper ID in ACL Anthology.

year: Year of publication

text: Content of FWS or Non-FWS.

label: 1: FWS and 0: Non-FWS.

chapter: Type of chapter headings.

Each line of Corpus_For_FWS_TypeClassify.csv includes:

id: Paper ID in ACL Anthology.

lable: Six types of FWS including method, resources, evaluation, application, problem and other.

text: Content of FWS.

Additionaly, we release sample our test dataset, if you need the whole data, contact us please.

Quick start

To reproduce our experiment result, you can follow these steps:

Recognition

based on your system, open the terminal in the FWS Recognition directory and type this command

python run.py

Classify

based on your system, open the terminal in the FWS Classification directory and type this command

python run.py

Extract keywords

We provide two notebooks, you can follow the steps to extract keywords and do some preprocess work

Citation

Please cite the following paper if you use these codes and datasets in your work.

Chengzhi Zhang, Yi Xiang, Wenke Hao, Zhicheng Li, Yuchen Qian, Yuzhuo Wang. Automatic Recognition and Classification of Future Work Sentences from Academic Articles in a Specific Domain. Journal of Informetrics, 2023, 17(1): 101373. [doi] [arXiv] [Dataset & Source Code]

tinierzhao / fws Goto Github PK

fws's Introduction

Automatic Recognition and Classification of Future Work Sentences from Academic Articles in a Specific Domain

Overview

Directory structure

Dataset discription

Quick start

Citation

fws's People

Contributors

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent