Coder Social home page Coder Social logo

bhoov / flyvec Goto Github PK

View Code? Open in Web Editor NEW
36.0 4.0 7.0 2.85 MB

A biologically inspired method to create sparse, binary word vectors

Home Page: https://bhoov.github.io/flyvec

License: Apache License 2.0

Jupyter Notebook 31.67% Python 18.16% Makefile 0.27% Cuda 44.66% C++ 5.23%

flyvec's Introduction

FlyVec

Sparse Binary Word Embeddings Inspired by the Fruit Fly Brain

Code based on the ICLR 2021 paper Can a Fruit Fly Learn Word Embeddings?.

In this work we use a well-established neurobiological network motif from the mushroom body of the fruit fly brain to learn sparse binary word embeddings from raw unstructured text. This package allows the user to access pre-trained word embeddings and generate sparse binary hash codes for individual words.

Interactive demos of the learned concepts available at flyvec.org.

How to use

Install from Pip (recommended)

pip install flyvec

Installing from Source

After cloning:

conda env create -f environment-dev.yml
conda activate flyvec
pip install -e .

Basic Usage

An example below illustrates how one can access the binary word embedding for individual tokens for a default hash length k=50.

import numpy as np
from flyvec import FlyVec

model = FlyVec.load()
embed_info = model.get_sparse_embedding("market"); embed_info
{'token': 'market',
 'id': 1180,
 'embedding': array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0,
        1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0,
        0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 1, 1, 0, 1,
        0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1,
        0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0,
        1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0,
        1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0,
        1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0,
        0, 0, 0, 0], dtype=int8)}

Changing the Hash Length

The user can obtain the FlyVec embeddings for any hash length using the following example.

small_embed = model.get_sparse_embedding("market", 4); np.sum(small_embed['embedding'])
4

Handling "unknown" tokens

FlyVec uses a simple, word-based tokenizer. The provided model uses a vocabulary with about 20,000 words, all lower-cased, with special tokens for numbers (<NUM>) and unknown words (<UNK>). Unknown tokens have the token id of 0, which can be used to filter unknown tokens.

unk_embed = model.get_sparse_embedding("DefNotAWord")
if unk_embed['id'] == 0:
    print("I AM THE UNKNOWN TOKEN DON'T USE ME FOR ANYTHING IMPORTANT")
I AM THE UNKNOWN TOKEN DON'T USE ME FOR ANYTHING IMPORTANT

Batch generating word embeddings

Embeddings for individual words in a sentence can be obtained using this snippet.

sentence = "Supreme Court dismissed the criminal charges."
tokens = model.tokenize(sentence)
embedding_info = [model.get_sparse_embedding(t) for t in tokens]
embeddings = np.array([e['embedding'] for e in embedding_info])
print("TOKENS: ", [e['token'] for e in embedding_info])
print("EMBEDDINGS: ", embeddings)
TOKENS:  ['supreme', 'court', 'dismissed', 'the', 'criminal', 'charges']
EMBEDDINGS:  [[0 1 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 1 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 1 0]
 [0 0 0 ... 0 1 0]]

FlyVec vocabulary

The vocabulary under the hood uses the gensim Dictionary and can be accessed by either IDs (ints) or Tokens (strs).

# The tokens in the vocabulary
print(model.token_vocab[:5])

# The IDs that correspond to those tokens
print(model.vocab[:5])

# The dictionary object itself
model.dictionary;
['properties', 'a', 'among', 'and', 'any']
[2, 3, 4, 5, 6]

Simple word embeddings

Only care about the sparse, context independent word embeddings for our small vocabulary? Get precomputed word vectors at hash_length=51 below:

wget https://raw.githubusercontent.com/bhoov/flyvec/master/simple-flyvec-embeddings.json

Training

Please note that the training code is included, though code for processing the inputs.

Prerequisites

You need a python environment with numpy installed, a system that supports CUDA, nvcc, and g++.

Building the Source Files

flyvec_compile

(Or, if using from source, you can also run make training)

Note that you will see some warnings. This is expected.

Training

flyvec_train path/to/encodings.npy path/to/offsets.npy -o save/checkpoints/in/this/directory

Description of Inputs

  • encodings.npy -- An np.int32 array representing the tokenized vocabulary-IDs of the input corpus, of shape (N,) where N is the number of tokens in the corpus
  • offsets.npy -- An np.uint64 array of shape (C,) where C is the number of chunks in the corpus. Each each value represents the index that starts a new chunk within encodings.npy. (Chunks can be thought of as sentences or paragraphs within the corpus; boundaries over which the sliding window does not cross.)

Description of Outputs

  • model_X.npy -- Stores checkpoints after every epoch within the specified output directory

See flyvec_train --help for more options.

Debugging tips

BadZipFile

You see:

> >> File "/usr/lib/python3.6/zipfile.py", line 1198, in _RealGetContents
>>>    raise BadZipFile("File is not a zip file")
>>> zipfile.BadZipFile:File is not a zip file```

Run:

from flyvec import FlyVec FlyVec.load(force_redownload=True)


# Citation

If you use this in your work, please cite:

@article{liang2021flyvec, title={Can a Fruit Fly Learn Word Embeddings?}, author={Liang, Yuchen and Ryali, Chaitanya K and Hoover, Benjamin and Grinberg, Leopold and Navlakha, Saket and Zaki, Mohammed J and Krotov, Dmitry}, journal={arXiv preprint arXiv:2101.06887}, year={2021} url={https://arxiv.org/abs/2101.06887} }

flyvec's People

Contributors

bhoov avatar dimakrotov avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

flyvec's Issues

Some questions about the EMPIRICAL EVALUATION

When we use OWT_hid_400_W_11_LR_0.0002_14.npy to evaluate experiments, we have the following questions:

  1. STATIC WORD EMBEDDINGS EVALUATION. The experimental results on several datasets are much worse than those in the paper, especially on the RW dataset. For the dataset(Word1: [w10, w11], Word2: [w20, w21], Sorce: [s1, s2]), we took the following approach:
hs_10 = model.get_sparse_embedding(w10)['embedding']
hs_20 = model.get_sparse_embedding(w20)['embedding']
p1 = sum(hs_10==hs_20) / hs_10.shape[0]`

hs_11 = model.get_sparse_embedding(w11)['embedding']
hs_21 = model.get_sparse_embedding(w21)['embedding']
p2 = sum(hs_11==hs_21) / hs_11.shape[0]

p = pd.Series([p1, p2])
s = pd.Series(Sorce)
model_sorce = p.corr(s, method='spearman')
  1. CONTEXT-DEPENDENT WORD EMBEDDINGS. How do we get 10 nearest neighbor words in the hash code space? Or, how do you convert the static word embeddings to the context-dependent word embeddings?

Glove embeddings for reproducibility purpose

Hi, since you have retrained Glove embeddings on the corpus different from the original implementation, can you provide us with those new embeddings? It would be helpful in reproducing your results.

input data format

I am confused about the input data format, i.e., encodings.npy and offsets.npy

is each element in encodings.npy a one-hot vector?

Can you provide a detailed demo of them?

Integrated DVC remote

Hey @bhoov, I took a look at your project and it seems really cool. I remember reading about it when the paper came out. I'm working on a platform called DAGsHub that combines Git, DVC, and MLflow and since your project uses DVC I thought I'd try my hand at migrating this project over, and you can see the migration here:
https://dagshub.com/Dean/flyvec

I pushed the data to DAGsHub storage which is a free integrated DVC remote so that you can view the content of your data folder from the UI. It can be connected to this GitHub in such a way that you push code here and data to DAGsHub.

Would love to hear your thoughts, and if you want me to transfer ownership to you on DAGsHub, I'd be happy to do that.

Context-dependent word embeddings

The Flyvec paper mentions that context-dependent word embeddings can be created, but it seems like the API only provides methods for context-independent embeddings. How can context-dependent word embeddings be created?

In particular, how can Figure 4 from the FlyVec paper be created?

It seems like the first half of Flyvec.synapses is the lookup table for the dense "context" word embeddings. Are these embeddings combined with the "target" word embedding to create the context-dependent word embedding? (e.g. by mean pooling, max pooling?)

How do you deal with word probabilities?

I am also trying to Implement this network in this paper, but I have encountered some problems. I set the vocabulary size to 20000. For a rare word Wi, vi/pi will be a huge number, which will cause the program to output NAN.

Training scripts

Hi, do you plan to open source the scripts used to train flyvec?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.