Coder Social home page Coder Social logo

story2personality's Introduction

Story2Personality

The dataset is a new narrative understanding benchmark to predict personality according to the character’s narrative texts in the script. We release the dataset and the codes for our work accepted to NAACL Student Research Workshop 2022: Machine Narrative Comprehension in Fictional Characters Personality Prediction Task and EMNLP 2022 MBTI Personality Prediction for Fictional Characters Using Movie Scripts.

Step 0: Env Setup

conda env create -f person_environment.yml python=3.8 pandas=1.5.2
conda activate person
python -m spacy download en

Step 1: Data Parsing

Our data parser first reads the narrative books and movie scripts from HTML files, and then extracts utterances said by recognized characters. The whole process can take 3~5 hours to finish. If you are only interested in the data, you can download them via this link and unzip to the root folder.

# move the downloaded "dialog_scene_mention_dicts.zip" to the root folder
unzip dialog_scene_mention_dicts.zip

If you would like to know how the raw text data is processed, you will have to download the HTML files first from OneDrive. The contents are the union of NarrativeQA dataset and Movie-Script-Database. Please unzip the downloaded file to the root repo folder.

# move the downloaded "raw_texts.zip" to the root folder
unzip raw_texts.zip

We are also sharing some other preprocessed files in the preprocessed/ folder which are also the dependencies of our parser. The following command would generate dialog_dict.pickle, scene_dict.pickle, and mention_dict.pickle from scratch.

python parse.py

Hereto, you will get three .pickle files which contain dictionaries of "what people say" and "who are mentioned" in a dialogue or a scene.

Step 2: Model Training and Inferencing

To use the data for modeling, please go to dataset/ and download one of the tokenized datasets. The format is more readily for training and testing than those .pickle files. More details will be provided in the future.

Citation

If you find this repo useful, please consider citing our paper:

@article{sang2022mbti,
  title={MBTI Personality Prediction for Fictional Characters Using Movie Scripts},
  author={Sang, Yisi and Mou, Xiangyang and Yu, Mo and Wang, Dakuo and Li, Jing and Stanton, Jeffrey},
  journal={arXiv preprint arXiv:2210.10994},
  year={2022}
}

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.