Coder Social home page Coder Social logo

rjdp / bootleg Goto Github PK

View Code? Open in Web Editor NEW

This project forked from hazyresearch/bootleg

0.0 0.0 0.0 7.57 MB

Self-Supervision for Named Entity Disambiguation at the Tail

Home Page: http://hazyresearch.stanford.edu/bootleg

License: Apache License 2.0

Python 88.95% Jupyter Notebook 10.53% Shell 0.43% Makefile 0.09%

bootleg's Introduction

Build Status Documentation Status license

Self-Supervision for Named Entity Disambiguation at the Tail

Bootleg is a self-supervised named entity disambiguation (NED) system for English built to improve disambiguation of entities that occur infrequently, or not at all, in training data. We call these entities tail entities. This is a critical task as the majority of entities are rare. The core insight behind Bootleg is that these tail entities can be disambiguated by reasoning over entity types and relations. We give an overview of how Bootleg achieves this below. For details, please see our blog post and paper.

Note that Bootleg is actively under development and feedback is welcome. Submit bugs on the Issues page or feel free to submit your contributions as a pull request.

Update 9-25-2021: We changed our architecture to be a biencoder. Our entity textual input still has all the goodness of types and KG relations, but our model now requires less storage space and has improved performance. A secret to getting the biencoder to work over the tail was heavy masking of the mention in the context encoder and entity title in the entity encoder.

Update 2-15-2021: We made a major rewrite of the codebase and moved to using Emmental for training--check out the changelog for details)

Getting Started

Install via

pip install bootleg

Checkout out our installation and quickstart guide here.

Models

Below is the link to download the English Bootleg model. The download comes with the saved model and config to run the model. We show in our quickstart guide and end-to-end tutorial how to load a config and run a model.

Model Description Number Parameters Link
BootlegUncased Uses titles, descriptions, types, and KG relations. Trained on uncased data. 1.3B Download

Tutorials

We provide tutorials to help users get familiar with Bootleg here.

Bootleg Overview

Given an input sentence, Bootleg takes the sentence and outputs a predicted entity for each detected mention. Bootleg first extracts mentions in the sentence, and for each mention, we extract its set of possible candidate entities and any structural information about that entity, e.g., type information or knowledge graph (KG) information. Bootleg leverages this information to generate an entity embedding through a Transformer entity encoder. The mention and its surrounding context is encoded in a context encoder. The entity with the highest dot product with the context is selected for each mention.

Dataflow

More details can be found here

Inference

Given a pretrained model, we support three types of inference: --mode eval, --mode dump_preds, and --mode dump_embs. Eval mode is the fastest option and will run the test files through the model and output aggregated quality metrics to the log. Dump_preds mode will write the individual predictions and corresponding probabilities to a jsonlines file. This is useful for error analysis. Dump_embs mode is the same as dump_preds, but will additionally output entity embeddings. These can then be read and processed in a downstream system. See this notebook to see how with a downloaded Bootleg model.

Entity Embedding Extraction

As we have a separate encoder for generating an entity representation, we also support the ability to dump all entities to create a single entity embedding matrix for use downstream. This is done through the bootleg.extract_all_entities script. See this notebook to see how with a downloaded Bootleg model.

Training

We recommend using GPUs for training Bootleg models. For large datasets, we support distributed training with Pytorch's Distributed DataParallel framework to distribute batches across multiple GPUs. Check out the Basic Training and Advanced Training tutorials for more information and sample data!

Downstream Tasks

Bootleg produces contextual entity embeddings (as well as learned static embeddings) that can be used in downstream tasks, such as relation extraction and question answering. Check out the tutorial to see how this is done.

Other Languages

The released Bootleg model only supports English, but we have trained multi-lingual models using Wikipedia and Wikidata. If you have interest in doing this, please let us know with an issue request or email [email protected]. We have data prep code to help prepare multi-lingual data.

bootleg's People

Contributors

lorr1 avatar mehrad0711 avatar mleszczy avatar vincentschen avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.