avidale / compress-fasttext Goto Github PK

View Code? Open in Web Editor NEW

168.0 168.0 13.0 31.63 MB

Tools for shrinking fastText models (in gensim format)

License: MIT License

Python 5.59% Jupyter Notebook 94.41%

fasttext fasttext-embeddings nlp python word-embeddings

compress-fasttext's Introduction

Hi! I am David Dale, research engineer in natural language processing. 👋

You can read more about me in Russian or English.

I am open to collaboration, especially on creating NLP tools (such as machine translation models) for lower-resourced languages.

Some of my best repos are:

dialogic - for developing multiplatform chatbots and voice skills in Python
compress-fasttext - for bringing lightweight, fast and accurate word embeddings to your project
python-ruwordnet - for those who want to understand language beyound embeddings and need a Russian thesaurus
dependency-paraphraser - a simple tool for paraphrasing that respect sentence structure
word-mover-grammar - a constituency grammar parser that supports word embeddings
weirdMath - a collection of small Python etudes, mostly about data science

You can also take a look at my HugginFace contributions, including:

a tiny Russian BERT
the first public Russian NLI model
the only Russian multitask T5 model
one of the largest Russian paraphrase datasets

My Telegram channels:

https://t.me/izolenta_mebiusa - about programming and NLP
https://t.me/matchast - about applied math and data science

Contacts:

compress-fasttext's People

Contributors

Stargazers

Watchers

Forkers

vipup pyro-bot thousandvoices xiaming9880 andy-wagner t1masavin zaharponimash smithmayowa20 holmad petrov1c w95 michelecafagna26 davidramossal

compress-fasttext's Issues

Problem loading back the saved FastTextKeyedVectors

Hello! I tried to compress a fasttext model, and then load back the saved gensim model. On trying to load, got this exception:

Python 3.9.7 (default, Sep 10 2021, 14:59:43) 
[GCC 11.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import gensim
>>> sm = gensim.models.fasttext.FastTextKeyedVectors.load('/root/py/train/eng-small')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/root/.local/share/virtualenvs/py-np11W-p9/lib/python3.9/site-packages/gensim/models/fasttext.py", line 995, in load
    return super(FastTextKeyedVectors, cls).load(fname_or_handle, **kwargs)
  File "/root/.local/share/virtualenvs/py-np11W-p9/lib/python3.9/site-packages/gensim/utils.py", line 487, in load
    obj._load_specials(fname, mmap, compress, subname)
  File "/root/.local/share/virtualenvs/py-np11W-p9/lib/python3.9/site-packages/gensim/models/fasttext.py", line 1019, in _load_specials
    self.adjust_vectors()  # recompose full-word vectors
  File "/root/.local/share/virtualenvs/py-np11W-p9/lib/python3.9/site-packages/gensim/models/fasttext.py", line 1177, in adjust_vectors
    self.vectors = self.vectors_vocab[:].copy()
TypeError: 'NoneType' object is not subscriptable

Note: saw this warning while compressing:

/root/.local/share/virtualenvs/py-np11W-p9/lib/python3.9/site-packages/scipy/cluster/vq.py:607: UserWarning: One of the
 clusters is empty. Re-run kmeans with a different initialization.                                                     
  warnings.warn("One of the clusters is empty. "

but then rerunning and checking a case where the warning is not printed, the issue still stands.

Pipfile:

...
[packages]
gensim = "==4.1.2"
compress-fasttext = "==0.1.1"
pqkmeans = "*"
python-Levenshtein = "*"
...

But also with gensim==4.0.0

Thank you!

Revert the compressed vectors to gensim format

I am using this pre-trained model: ft_cc.en.300_freqprune_50K_5K_pq_100.bin

That's my code:

ft_gensim = compress_fasttext.models.CompressedFastTextKeyedVectors.load(org_model_path)
new_vocab = ft_gensim.key_to_index
new_vectors = ft_gensim.vectors
new_ngrams = ft_gensim.vectors_ngrams

print(type(new_vectors)) # <class 'compress_fasttext.navec_like.PQ'>
print(type(new_ngrams)) # <class 'compress_fasttext.prune.RowSparseMatrix'>
new_vectors = DecomposedMatrix.compress(new_vectors, n_components=100, fp16=True)
new_ngrams = DecomposedMatrix.compress(new_ngrams, n_components=100, fp16=True)

I get this error:
ValueError: setting an array element with a sequence. The requested array has an inhomogeneous shape after 1 dimensions. The detected shape was (6,) + inhomogeneous part.
Is there a way to convert the vectors and ngrams back to gensim format to do this compress operation?

Using CompressText Source to compress models

I want to use the compress text source code to compress models without pip install.

Here's my code:

org_model_path = "cc.en.300.bin"
ft = load_facebook_model(org_model_path).wv
ft_fp16 = make_new_fasttext_model(ft, ft.vectors.astype(np.float16), ft.vectors_ngrams.astype(np.float16))

I see this error

TypeError: init() missing 1 required positional argument: 'compatible_hash'

It's due to this creation in compress.py

new_ft = cls(
       vector_size=ft.vector_size,
       min_n=ft.min_n,
       max_n=ft.max_n,
       bucket=new_vectors_ngrams.shape[0],
   )

Any clue?

I tried to add a true/false compatible_hash, but I got bumped with another error

AttributeError: 'CompressedFastTextKeyedVectors' object has no attribute 'index_to_key'

Compress FastText to Uploaded Models

`model_path = "./models/en/ft_cc.en.300_freqprune_50K_5K_pq_100.bin"

big_model = gensim.models.fasttext.FastTextKeyedVectors.load(model_path)

small_model = compress_fasttext.prune_ft_freq(big_model, pq=True)
`

Why compress fast text to already uploaded models does not work?

Gives this error:

TypeError: loop of ufunc does not support argument 0 of type RowSparseMatrix which has no callable conjugate method

Supervised fastText models are not supported

I want to load the model using gensim:

from gensim.models.fasttext import FastText
FastText.load_fasttext_format("cc.de.300.compressed.bin")

But I get the error:

File "C:\Users\dd\Projects\wordembeddingservice\venv\lib\site-packages\gensim\models\_fasttext_bin.py", line 194, in _load_vocab
    raise NotImplementedError("Supervised fastText models are not supported")
NotImplementedError: Supervised fastText models are not supported

Is there a way to get it working?

load model issues

I trained a Korean language unsuperviesd model and was going to use compress_fasttext load model, but got a error: invalid load key, '\xba'. It seeams ues wrong encoding when load model.

error message:
1460 with open(fname, 'rb') as f:
-> 1461 return _pickle.load(f, encoding='latin1') # needed because loading from S3 doesn't support readline()
1462
1463

UnpicklingError: invalid load key, '\xba'.

Incompatability with Numpy >1.19.*

While compressing the models, I get following error because of usage of numpy.float in pq_encoder_light.py:

AttributeError: module 'numpy' has no attribute 'float'.
np.float was a deprecated alias for the builtin float. To avoid this error in existing code, use float by itself. Doing this will not modify any behavior and is safe. If you specifically wanted the numpy scalar type, use np.float64 here.
The aliases was originally deprecated in NumPy 1.20; for more details and guidance see the original release note at:
https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations

Add description how to load Facebook implementation fasttext model as FastTextKeyedVectors

I was trying to load Facebook implementation of ft model from deeppavlov http://docs.deeppavlov.ai/en/master/features/pretrained_vectors.html#fasttext as it written in README.md. But module gives error.
_pickle.UnpicklingError: invalid load key, '\xba'.

I solved this problem by loading ft model with function gensim.models.fasttext.load_facebook_model and get FastTextKeyedVectors object:

from gensim.models.fasttext import load_facebook_model
import compress_fasttext
big_model = load_facebook_model('path-to-original-model').wv
small_model = compress_fasttext.prune_ft_freq(big_model, pq=True)
small_model.save('path-to-new-model')

Maybe add this information in README.md?

error install pqkmeans

When I try to install pqkmeans, I find this error:

  running install_egg_info
  Copying lshash3.egg-info to build/bdist.linux-x86_64/wheel/lshash3-0.0.8-py3.8.egg-info
  running install_scripts
  error: invalid command 'bdist_wininst'
  [end of output]note: This error originates from a subprocess, and is likely not a problem with pip. ERROR: Failed building wheel for lshash3

Any clue about lshash3?

broken features after compression?

is it planned behaviour that some model features stop working after such compression?
e.g.:
model.most_similar issues "IndexError: list index out of range "
model.doesnt_match always outputs the first element from the input list

Can't compress fasttext when loaded from facebook format directly

Hello again! Not sure if this is compress-fasttext or gensim problem, but here we go:

Getting the following

Traceback (most recent call last):  
  File "/src/makesmall.py", line 11, in <module>
    small_model = compress_fasttext.prune_ft_freq(orig)
  File "/usr/local/lib/python3.9/site-packages/compress_fasttext/compress.py", line 206, in prune_ft_freq
    ngram_norms = np.linalg.norm(ft.vectors_ngrams, axis=-1)
AttributeError: 'FastText' object has no attribute 'vectors_ngrams'

for the code

import sys
from gensim.models import fasttext
from gensim.test.utils import datapath
import compress_fasttext

[inpath, outpath] = sys.argv[1:3]
print("Loading original from", inpath)
orig = fasttext.load_facebook_model(datapath(inpath))

print("Compressing")
small_model = compress_fasttext.prune_ft_freq(orig)

print("Saving compressed to", outpath)
small_model.save(outpath)

but when roundtripping the gensim model by save & load, it works:

import sys
from gensim.models import fasttext
from gensim.test.utils import datapath
import compress_fasttext

[inpath, outpath] = sys.argv[1:3]
print("Loading biginal from", inpath)
big = fasttext.load_facebook_model(datapath(inpath))

# Note: roundtripping gensim to disk, otherwise compress doesn't work.
print("Saving back in gensim format")
big.wv.save(outpath + ".gensim")

print("Loading gensim")
big = fasttext.FastTextKeyedVectors.load(outpath + ".gensim")

print("Compressing")
small_model = compress_fasttext.prune_ft_freq(big)

print("Saving compressed to", outpath + ".compressed")

env:

gensim == 4.1.2
compress-fasttext == 0.1.2

Thank you!

After unpacking vectors get_vector returns zeroes for OOV words..

to reference #8

even with compressed loading OOV vectors are all zeroes, while model which was saved returned actual vectors..

simply test fasttext model on

ft.get_vector('pythom')

before saving and after compressing and saving/loading..

gensim 4.0.0b0

At the moment, this seems not to work with gensim 4.0.0. Any plans to fix this?

import compress_fasttext
  File "/opt/miniconda3/envs/vectorian2021/lib/python3.8/site-packages/compress_fasttext/__init__.py", line 1, in <module>
    from compress_fasttext import compress, decomposition, evaluation, navec_like, prune, quantization, utils
  File "/opt/miniconda3/envs/vectorian2021/lib/python3.8/site-packages/compress_fasttext/compress.py", line 7, in <module>
    from .prune import prune_ngrams, prune_vocab, count_buckets, RowSparseMatrix
  File "/opt/miniconda3/envs/vectorian2021/lib/python3.8/site-packages/compress_fasttext/prune.py", line 7, in <module>
    from gensim.models.utils_any2vec import ft_ngram_hashes
ModuleNotFoundError: No module named 'gensim.models.utils_any2vec'

Error while compressing

I am trying to compress the Fasttext wiki model: link

https://dl.fbaipublicfiles.com/fasttext/vectors-english/wiki-news-300d-1M.vec.zip

I tried with first approach by load_facebook_model() got the error:
NotImplementedError: Supervised fastText models are not supported

and when tried with second approach of gensim:
return _pickle.load(f, encoding='latin1')
_pickle.UnpicklingError: invalid load key, '9'.

AssertionError when trying to compress facebook model

fasttext 0.9.2
compress-fasttext 0.1.4
gensim-4.3.2
Ubuntu 22.04
numpy 1.23.5 (is this the issue?)

Code:
import compress_fasttext
from gensim.models.fasttext import load_facebook_model

gensim_model = load_facebook_model('fasttext-best-4-2-24.bin').wv
small_model = compress_fasttext.prune_ft_freq(gensim_model, pq=True)

Error:
AssertionError Traceback (most recent call last)
Cell In[30], line 2
1 gensim_model = load_facebook_model('fasttext-best-4-2-24.bin').wv
----> 2 small_model = compress_fasttext.prune_ft_freq(gensim_model, pq=True)

File ~/git/SIEGE/env/lib/python3.10/site-packages/compress_fasttext/compress.py:231, in prune_ft_freq(ft, new_vocab_size, new_ngrams_size, fp16, pq, qdim, centroids, prune_by_norm, norm_power)
229 top_voc, top_vec = prune_vocab(ft, new_vocab_size=new_vocab_size)
230 if pq and len(top_vec) > 0:
--> 231 top_vec = quantize(top_vec, qdim=qdim, centroids=centroids)
232 elif fp16:
233 top_vec = top_vec.astype(np.float16)

File ~/git/SIEGE/env/lib/python3.10/site-packages/compress_fasttext/quantization.py:23, in quantize(matrix, qdim, centroids, sample, iterations, verbose)
20 indexes = np.random.randint(vectors, size=sample)
21 selection = matrix[indexes]
---> 23 encoder.fit(selection)
24 indexes = encoder.transform(matrix).astype(PQ.index_type(centroids))
25 codes = encoder.codewords

File ~/git/SIEGE/env/lib/python3.10/site-packages/compress_fasttext/pq_encoder_light.py:68, in PQEncoder.fit(self, x_train)
66 assert x_train.ndim == 2
67 N, D = x_train.shape
---> 68 assert self.Ks < N, "the number of training vector should be more than Ks"
69 assert D % self.M == 0, "input dimension must be dividable by M"
70 self.Ds = int(D / self.M)

AssertionError: the number of training vector should be more than Ks

Saving a compressed model in regular gensim format

Hi, is there a way to save a compressed model in regular gensim format? I can't install compress-fasttext where my application will run, so being able to run model.most_similar("word") only with gensim would be great.

Thanks in advance!

Cannot load the compressed model with the Facebook executable

./fastText/fasttext nn compressed.bin
terminate called after throwing an instance of 'std::invalid_argument'
what(): compressed.bin has wrong file format!
Aborted

similar_by_word/similar_by_key broken after compression

Hello, I am exploring the use of your library to compress some very big unsupervised models however while testing full functionality I encountered this error:

>>> small_model.similar_by_key("ciao")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/skrymr/anaconda3/envs/semantic39/lib/python3.9/site-packages/gensim/models/keyedvectors.py", line 889, in similar_by_key
    return self.most_similar(positive=[key], topn=topn, restrict_vocab=restrict_vocab)
  File "/home/skrymr/anaconda3/envs/semantic39/lib/python3.9/site-packages/gensim/models/keyedvectors.py", line 842, in most_similar
    mean = self.get_mean_vector(keys, weight, pre_normalize=True, post_normalize=True, ignore_missing=False)
  File "/home/skrymr/anaconda3/envs/semantic39/lib/python3.9/site-packages/gensim/models/keyedvectors.py", line 507, in get_mean_vector
    mean = np.zeros(self.vector_size, self.vectors.dtype)
AttributeError: 'PQ' object has no attribute 'dtype'

I compressed the original FastText model with prune_ft_freq() and default parameters except for new_vocab_length and new_ngrams_length both of which I kept at their original values to avoid losing information.

Reproduce and Compress ft_cc.en.300_freqprune_50K_5K_pq_100

I am trying to reproduce ft_cc.en.300_freqprune_50K_5K_pq_100.bin model from fasttext original model.

This is my code:


org_model_path = 'cc.en.300.bin'
print(fasttext.util.download_model('en', if_exists='ignore'))  # English
ft = load_facebook_model(org_model_path).wv

small_model = compress_fasttext.prune_ft_freq(
        ft,
        new_vocab_size=5_000,
        new_ngrams_size=2000_000,
        fp16=False,
        pq=True,
        qdim=100,
        centroids=255,
        prune_by_norm=True,
        norm_power=1,
)

The generated model (143MB) and the posted model (12MB) sizes are different. Can you please point out what's missing?

ValueError: numpy.ndarray size changed, may indicate binary incompatibility. Expected 96 from C header, got 88 from PyObject.

I installed using !pip install compress-fasttext in colab and it works fine. But when I try to import it via import compress_fasttext it errors out. I tried upgrading the numpy version, building pycocotools as mentioned in various Stack Overflow links but nothing works.

(ValueError Traceback (most recent call last)
in
1 get_ipython().system('pip install compress-fasttext')
----> 2 import compress_fasttext

7 frames
/usr/local/lib/python3.8/dist-packages/compress_fasttext/init.py in
----> 1 from compress_fasttext import compress, decomposition, evaluation, navec_like, prune, quantization, utils
2 from compress_fasttext import models
3 from compress_fasttext.compress import make_new_fasttext_model, quantize_ft, svd_ft, prune_ft, prune_ft_freq
4 from compress_fasttext.models import CompressedFastTextKeyedVectors

/usr/local/lib/python3.8/dist-packages/compress_fasttext/compress.py in
3
4 import numpy as np
----> 5 from gensim.models.fasttext import FastTextKeyedVectors
6
7 from .decomposition import DecomposedMatrix

/usr/local/lib/python3.8/dist-packages/gensim/init.py in
9 import logging
10
---> 11 from gensim import parsing, corpora, matutils, interfaces, models, similarities, utils # noqa:F401
12
13

/usr/local/lib/python3.8/dist-packages/gensim/corpora/init.py in
4
5 # bring corpus classes directly into package namespace, to save some typing
----> 6 from .indexedcorpus import IndexedCorpus # noqa:F401 must appear before the other classes
7
8 from .mmcorpus import MmCorpus # noqa:F401

/usr/local/lib/python3.8/dist-packages/gensim/corpora/indexedcorpus.py in
12 import numpy
13
---> 14 from gensim import interfaces, utils
15
16 logger = logging.getLogger(name)

/usr/local/lib/python3.8/dist-packages/gensim/interfaces.py in
17 import logging
18
---> 19 from gensim import utils, matutils
20
21

/usr/local/lib/python3.8/dist-packages/gensim/matutils.py in
1028 try:
1029 # try to load fast, cythonized code if possible
-> 1030 from gensim._matutils import logsumexp, mean_absolute_difference, dirichlet_expectation
1031
1032 except ImportError:

/usr/local/lib/python3.8/dist-packages/gensim/_matutils.pyx in init gensim._matutils()

ValueError: numpy.ndarray size changed, may indicate binary incompatibility. Expected 96 from C header, got 88 from PyObject)

can you compress this model 'fasttext-zh-vectors'

can not find zh model in https://zenodo.org/records/4905385

compress_fasttext 0.0.7 doesn't work with gensim 3.7.2

I've tried to compress gensim 3.7.2 fasttext model with compress_fasttext 0.0.7:

import gensim
import compress_fasttext

big_model = gensim.models.fasttext.FastTextKeyedVectors.load('path-to-original-model')
small_model = compress_fasttext.prune_ft_freq(big_model, pq=True) #ERROR
small_model.save('path-to-new-model')

I've got errror: `

AttributeError: 'FastText' object has no attribute 'vectors_ngrams' with call of prune_ft_freq
Alternatively with prune_ft function:
AttributeError: 'FastText' object has no attribute 'vocab'

Is gensim 3.7.2 too old or I miss something; maybe there was a version of compress_fasttext that supported it?