Coder Social home page Coder Social logo

ml-lab / histwords Goto Github PK

View Code? Open in Web Editor NEW

This project forked from williamleif/histwords

0.0 3.0 0.0 515 KB

Collection of tools for building diachronic/historical word vectors

Home Page: http://nlp.stanford.edu/projects/histwords/

License: Apache License 2.0

Python 93.56% Shell 5.25% C 0.52% Makefile 0.68%

histwords's Introduction

#Word Embeddings for Historical Text

Author: William Hamilton ([email protected])

Overview

An eclectic collection of tools for analyzing historical language change using vector space semantics.

alt text

Pre-trained historical embeddings

Various embeddings (for many languages and using different embeddings approaches are available on the project website.

Some pre-trained word2vec (i.e., SGNS) historical word vectors for multiple languages (constructed via Google N-grams) are also available here:

All except Chinese contain embeddings for the decades in the range 1800s-1990s (2000s are excluded because of sampling changes in the N-grams corpus). The Chinese data starts in 1950.

Embeddings constructed using the Corpus of Historical American English (COHA) are also available:

example.sh contains an example run, showing how to download and use the embeddings. example.py shows how to use the vector representations in the Python code (assuming you have already run the example.sh script.)

This paper describes how the embeddings were constructed. If you make use of these embeddings in your research, please cite the following:

@inproceedings{hamilton_diachronic_2016, title = {Diachronic {Word} {Embeddings} {Reveal} {Statistical} {Laws} of {Semantic} {Change}}, url = {http://arxiv.org/abs/1605.09096}, booktitle = {Proc. {Assoc}. {Comput}. {Ling}. ({ACL})}, author = {Hamilton, William L. and Leskovec, Jure and Jurafsky, Dan}, year = {2016} }

Code organization

The structure of the code (in terms of folder organization) is as follows:

Main folder for using historical embeddings:

Folders with pre-processing code and active research code (potentially unstable):

example.py shows how to compute the simlarity series for two words over time, which is how we evaluated different methods against the attested semantic shifts listed in our paper.

If you want to learn historical embeddings for new data, the code in the sgns directory is recommended, which can be run with the default settings. As long as your corpora has at least 100 million words per time-period, this is the best method. For smaller corpora, using the representations/ppmigen.py code followed by the vecanalysis/makelowdim.py code (to learn SVD embeddings) is recommended. In either case, the vecanalysis/seq_procrustes.py code should be used to align the learned embeddings. The default hyperparameters should suffice for most use cases.

Dependencies

Core dependencies:

You will also need Juptyer/IPython to run any IPython notebooks.

histwords's People

Contributors

williamleif avatar

Watchers

 avatar  avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.