Coder Social home page Coder Social logo

ner_sw_fanfic's Introduction

NER_SW_Fanfic

NER on Star Wars fan fiction using spaCy and gensim word vectors

In my pursuit of learning NLP, I decided to try some NER modeling on my own Star Wars fan fiction. I like using my own writing (fiction and non-fiction) for personal NLP projects because there's a fair amount of it, it's readily available, and it's fun. I decided my Star Wars fan fiction was a good choice for NER since there are a lot of entities and many of them are things that a standard off-the-shelf model wouldn't pick up and/or label correctly, like original or obacure character names and places, or specific types of entities like droids and species.

This project is broken into four notebooks. The first notebook shows file importation, data cleaning, and splitting the corpus into training and testing data.

The second notebook uses the training data to build word vectors with gensim. The code is there to do unigrams, bigrams, and trigrams.

The third notebook takes the training data and uses spaCy's entity ruler to create some specific labels to annote the data. It is not a full dramatis personae of every character and entity. But it does include ones that are particularly odd, and that earlier versions of the models were getting consistently wrong. For example, 'Han' is a specific entity labeled as PERSON. Early models kept labeling Han as a NORP, which makes sense if it thought it meant Han Chinese. Other specific labels include SPECIES, DROID, FORCE as it's own label, and FORM for different lightsaber comabt forms. The training data are annotated, converted to spaCy 3 format, and split into training and validation data.

The fourth notebook reloads the word vectors and adds the NER pipe to the model, showing the command line instruction in the notebook. The model is then trained, again showing the comand line instruction in the notebook with the output. The best model is reloaded and tested on the hold out data from Notebook 1.

The tested output uses displacy's entity rendering to show which entities got labeled from those 50 lines. Many were PERSON enities that were called out in the annotated data, while some were not, notably 'Kalick,' 'Adan,' 'Sheev Palpatine,' and, most interestingly, 'Dathomiri Nightsisters.' I am pretty sure this is the only time the bigram appears in the entire corpus of over 18 thousand sentences, so I thought it was interesting that got picked up.

This model used word vectors and bigrams. Trigrams didn't seem to improve the model very much. The original model that had less good text cleaning, no word vectors, and fewer entities explicitly annotated had an f1 score around .83. The current model has an f1 score of .946 and a PERSON f1 score of .978.

ner_sw_fanfic's People

Contributors

sara1583 avatar

Stargazers

 avatar

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.