thompsonb / vecalign Goto Github PK

Improved Sentence Alignment in Linear Time and Space

License: Apache License 2.0

Python 72.96% Cython 27.04%

vecalign's Introduction

Vecalign

Vecalign is an accurate sentence alignment algorithm which is fast even for very long documents. In conjunction with LASER, Vecalign works in about 100 languages (i.e. 100^2 language pairs), without the need for a machine translation system or lexicon.

Vecalign uses similarity of multilingual sentence embeddings to judge the similarity of sentences.

[image based on this Facebook AI post]

Vecalign uses an approximation to Dynamic Programming based on Fast Dynamic Time Warping which is linear in time and space with respect to the number of sentences being aligned.

License

Vecalign is released under the Apache License, Version 2.0. For convenience, the dev and test datasets from Bleualign are provided. Bleualign is Copyright 2010 Rico Sennrich and is released under the GNU General Public License Version 2

Build Vecalign

You will need python 3.6+ with numpy and cython. You can build an environment using conda as follows:

# Use latest conda
conda update conda -y
# Create conda environment
conda create  --force -y --name vecalign python=3.7
# Activate new environment
source `conda info --base`/etc/profile.d/conda.sh # See: https://github.com/conda/conda/issues/7980
conda activate vecalign
# Install required packages
conda install -y -c anaconda cython
conda install -y -c anaconda numpy
pip install mcerp

Note that Vecalign contains cython code, but there is no need to build it manually as it is compiled automatically by pyximport.

Run Vecalign (using provided embeddings)

./vecalign.py --alignment_max_size 8 --src bleualign_data/dev.de --tgt bleualign_data/dev.fr \
   --src_embed bleualign_data/overlaps.de bleualign_data/overlaps.de.emb  \
   --tgt_embed bleualign_data/overlaps.fr bleualign_data/overlaps.fr.emb

Alignments are written to stdout:

[0]:[0]:0.156006
[1]:[1]:0.160997
[2]:[2]:0.217155
[3]:[3]:0.361439
[4]:[4]:0.346332
[5]:[5]:0.211873
[6]:[6, 7, 8]:0.507506
[7]:[9]:0.252747
[8, 9]:[10, 11, 12]:0.139594
[10, 11]:[13]:0.273751
[12]:[14]:0.165397
[13]:[15, 16, 17]:0.436312
[14]:[18, 19, 20, 21]:0.734142
[]:[22]:0.000000
[]:[23]:0.000000
[]:[24]:0.000000
[]:[25]:0.000000
[15]:[26, 27, 28]:0.840094
...

The first two entries are the source and target sentence indexes for each alignment, respectively. The third entry in each line is the sentence alignment cost computed by Vecalign. Note that this cost includes normalization but does not include the penalties terms for containing more than one sentence. Note that the alignment cost is set to zero for insertions/deletions. Also note that the results may vary slightly due to randomness in the normalization.

To score against a gold alignment, use the "-g" flag. Flags "-s", "-t", and "-g" can accept multiple arguments. This is primarily useful for scoring, as the output alignments will all be concatenated together in stdout. For example, to align and score the bleualign test set:

./vecalign.py --alignment_max_size 8 --src bleualign_data/test*.de --tgt bleualign_data/test*.fr \
   --gold bleualign_data/test*.defr  \
   --src_embed bleualign_data/overlaps.de bleualign_data/overlaps.de.emb  \
   --tgt_embed bleualign_data/overlaps.fr bleualign_data/overlaps.fr.emb > /dev/null

Which should give you results that approximately match the Vecalign paper:


 ---------------------------------
|             |  Strict |    Lax  |
| Precision   |   0.899 |   0.985 |
| Recall      |   0.904 |   0.987 |
| F1          |   0.902 |   0.986 |
 ---------------------------------

Note: Run ./vecalign.py -h for full sentence alignment usage and options. For stand-alone scoring against a gold reference, see score.py

Embed your own documents

The Vecalign repository contains overlap and embedding files for the Bluealign dev/test files. This section shows how those files were made, as an example for running on new data.

Vecalign requires not only embeddings of sentences in each document, but also embeddings of concatenations of consecutive sentences. The embeddings of multiple, consecutive sentences are needed to consider 1-many, many-1, and many-many alignments.

To create a file containing all the sentence combinations in the dev and test files from Bleualign:

./overlap.py -i bleualign_data/dev.fr bleualign_data/test*.fr -o bleualign_data/overlaps.fr -n 10
./overlap.py -i bleualign_data/dev.de bleualign_data/test*.de -o bleualign_data/overlaps.de -n 10

Note: Run ./overlap.py -h to see full set of embedding options.

bleualign_data/overlaps.fr and bleualign_data/overlaps.de are text files containing one or more sentences per line.

These files must then be embedded using a multilingual sentence embedder.

We recommend the Language-Agnostic SEntence Representations (LASER) toolkit from Facebook, as it has strong performance and comes with a pretrained model which works well in about 100 languages. However, Vecalign should also work with other embedding methods as well. Embeddings should be provided as a binary file containing float32 values.

The following assumes LASER is installed and the LASER environmental variable has been set.

To embed the Bleualign files using LASER:

$LASER/tasks/embed/embed.sh bleualign_data/overlaps.fr bleualign_data/overlaps.fr.emb [fra]
$LASER/tasks/embed/embed.sh bleualign_data/overlaps.de bleualign_data/overlaps.de.emb [deu]

Please always refer here for the latest usage of this script. The usage may vary across the different versions of LASER.

Note that LASER will not overwrite an embedding file if it exsts, so you may need to run first rm bleualign_data/overlaps.fr.emb bleualign_data/overlaps.de.emb.

Document Alignment

We propose using Vecalign to rescore document alignment candidates, in conjunction with candidate generation using a document embedding method that retains sentence order information. Example code for our document embedding method is provided here.

Publications

If you use Vecalign, please cite our Vecalign paper:

@inproceedings{thompson-koehn-2019-vecalign,
    title = "{V}ecalign: Improved Sentence Alignment in Linear Time and Space",
    author = "Thompson, Brian and Koehn, Philipp",
    booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)",
    month = nov,
    year = "2019",
    address = "Hong Kong, China",
    publisher = "Association for Computational Linguistics",
    url = "https://www.aclweb.org/anthology/D19-1136",
    doi = "10.18653/v1/D19-1136",
    pages = "1342--1348",
}

If you use the provided document embedding code or use Vecalign for document alignment, please cite our document alignment paper:

@inproceedings{thompson-koehn-2020-exploiting,
    title = "Exploiting Sentence Order in Document Alignment",
    author = "Thompson, Brian  and
      Koehn, Philipp",
    booktitle = "Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)",
    month = nov,
    year = "2020",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2020.emnlp-main.483",
    doi = "10.18653/v1/2020.emnlp-main.483",
    pages = "5997--6007",
}

vecalign's People

Contributors

Stargazers

Watchers

vecalign's Issues

show alignments

I can see just lines on stdout, how can I see the produced alignments and export to a file?
Thanks for the your efforts.

How to generate or where to get multilingual sentence embeddings data?

I'm novice, sorry for my stupid question.

error in the make_del_knob function?

In the make_del_knob function, when the size product (e_size * f_size) is smaller than the sample_size (20000 by default), the script ends up calculating the similarity score for all combinations of the src and tgt sentences, plus the remainder (20000 - e_size * f_size) . Is this behavior a mistake or an intended feature? It creates a biased histogram of the "real" distrubution by calculating multiple pairs on the 0:0 indexed sentences.

if e_size * f_size < sample_size:
    # dont sample, just compute full matrix
    sample_size = e_size * f_size
    x_idxs = np.zeros(sample_size, dtype=np.int32)
    y_idxs = np.zeros(sample_size, dtype=np.int32)
    c = 0
    for ii in range(e_size):
        for jj in range(f_size):
            x_idxs[c] = ii
            y_idxs[c] = jj
            c += 1
else:
    # get random samples
    x_idxs = np.random.choice(range(e_size), size=sample_size, replace=True).astype(np.int32)
    y_idxs = np.random.choice(range(f_size), size=sample_size, replace=True).astype(np.int32)

# output
random_scores = np.empty(sample_size, dtype=np.float32)

score_path(x_idxs, y_idxs,
           e_laser_norms, f_laser_norms,
           e_laser, f_laser,
           random_scores, )

some confusions

Hello, I'm thinking, besides using the command line, can you use other methods to call your toolkit? I want to wrap it into a function and call it conveniently. Do you have any suggestions?

Why remove global mean when halving vectors?

Thanks for sharing the excellent source code. I am confused about the vector half function:

def downsample_vectors(vecs1):
    a, b, c = vecs1.shape
    half = np.empty((a, b // 2, c), dtype=np.float32)
    for ii in range(a):
        # average consecutive vectors
        for jj in range(0, b - b % 2, 2):
            v1 = vecs1[ii, jj, :]
            v2 = vecs1[ii, jj + 1, :]
            half[ii, jj // 2, :] = v1 + v2
        # compute mean for all vectors
        mean = np.mean(half[ii, :, :], axis=0)
        for jj in range(0, b - b % 2, 2):
            # remove mean
            half[ii, jj // 2, :] = half[ii, jj // 2, :] - mean
    # make vectors norm==1 so dot product is cosine distance
    make_norm1(half)
    return half

Why do you remove the global mean along the first axis instead of simply dividing the vector by 2? Is there any reason why you do this? It would be very helpful if you can share your motivation and insights on this.

What if I have a very large corpus?

The embed output is 1.4T and it's too large to load this array to memory. Any tips for this?

Question: about "failed to find overlap=3, will use random vector"

Thank you for the awesome tool!

I ran overlap.py with -n 2, then ran vecalign.py with --alignment-max-size 3, out of fear that LASER may take too much time. This gave me failed to find overlap=3, will use random vector for every single line in the original text, which makes sense, but....

First Question:
I am curious of the morale behind using random vectors in place of overlap=3 embeddings not present. Or is this actually closer to an arbitrary choice, because overlap embedding not being present is actually something that should not happen?

Second Question:
Once again, I am running with --alignment-max-size 3. Surely the code looks at (0,3) (3,0) pairs as a possibility, along with (1,2) (2,1) (1,1), else they would not require overlap=3 embeddings. But where and why are (0,3) (3,0) being considered? surely (0,3) and (3,0) is not in final_alignment_types produced by the function make_alignment_types.

Actually, why are (0,3) and (3,0) being considered? Is considering (0,2) (0,1) (1,0) (2,0) just not enough?

how to enable local sentence reordering?

Greetings! I read through the paper and came cross this line:

Following prior work, we assume non-crossing alignments but allow local sentence reordering within an alignment.

But I find nowhere local sentence reordering is explained. Can give me some insight into it?

[Embedding phase error]

I've tried to re-run the project. However, when I tried to use LASOR to embed my embedding with the command line:

./embed.sh ~/source/samsung/vecalign/bleualign_data/overlaps.vi ~/source/samsung/vecalign/bleualign_data/overlaps.vi.emb [vi]

python embed.py --input ~/source/samsung/vecalign/bleualign_data/overlaps.vi --output ~/source/samsung/vecalign/bleualign_data/overlaps.vi.emb --encoder ../models/laser2.pt --spm-model ../models/laser2.spm --verbose

I got this error

The format of my input file is:

Does anyone have this problem? And how can I fix this?

Support for Bitext mining?

Thanks for your great work!

I am wondering whether vecalign supports bitext mining, i.e. find all parallel texts while ignoring sentence order.

Our research focuses on the translation of Wikipedia. However it's quite difficult to have a correct alignment since many sentences that are going to be aligned is shuffled in the article.

I will be very grateful if you could help me out.

some problems

hey man.
Thank you for providing a very easy-to-use tool, but why do multiple original sentences correspond to one translation when i get a result of sentence alignment here?
anyone can explain why this happened?
Thank you.

Possible Error in the score.py file in the _precision function

Hi,
In the evaluation script (score.py), precisely here:
https://github.com/thompsonb/vecalign/blob/ca96a30716f12241e14f836b06705107c771987c/score.py#L57C5-L57C5

I've noticed that you cycle in the for loop based on the variable "testalign", which should contain the alignment generated by the algorithm. The problem is that if the algorithm does not align a source sentence, this is not counted as an error.
For example, if you call _precision(testalign=goldalign[:2], goldalign=goldalign), the resulting f1 is 1 even though you predicted only two alignments out of all the possible ones.

The dataset conversion to ladder format

@thompsonb, I'm trying to replicate the work done in your paper, the results in the Table 1 in particular.
How did you convert the format of the dataset that you have in the "bleualign_data" directory to hunalign's ladder-style format?
Is there a script to do that, or you did it manually?

RuntimeWarning: overflow encountered in square

hi, I often see the warning:

dp_utils.py:112: RuntimeWarning: overflow encountered in square
norm = np.sqrt(np.square(vecs0[ii, jj, :]).sum())

I can provide sample data if you need. Thx

overlaps without test files?

Pherhaps silly quastion, but in the demo, it seems like you create the files with the overlapping sentences with the dev and the test files. In my case, I just have a parallel corpus of a few Arabic and English texts that I want to align, and I don't have any dev or test files due to the small size of the corpus. Do I need to have this to align the files, or is there some way to get around it?