zouharvi / pwesuite Goto Github PK

Suite for phonetic word embeddings, especially their evaluation and baseline models.

Python 85.24% Shell 7.59% C++ 5.64% Cython 0.79% Jupyter Notebook 0.74%

embeddings phonetic

pwesuite's Introduction

Phonetic Word Embeddings Suite (`PWESuite`)

Evaluation suite for phonetic (phonological) word embeddings and an additional model based on Panphone distance learning. This repository accompanies the paper PWESuite: Phonetic Word Embeddings and Tasks They Facilitate at LREC-COLING 2024. Watch 12-minute introduction to PWESuite.

Abstract: Mapping words into a fixed-dimensional vector space is the backbone of modern NLP. While most word embedding methods successfully encode semantic information, they overlook phonetic information that is crucial for many tasks. We develop three methods that use articulatory features to build phonetically informed word embeddings. To address the inconsistent evaluation of existing phonetic word embedding methods, we also contribute a task suite to fairly evaluate past, current, and future methods. We evaluate both (1) intrinsic aspects of phonetic word embeddings, such as word retrieval and correlation with sound similarity, and (2) extrinsic performance on tasks such as rhyme and cognate detection and sound analogies. We hope our task suite will promote reproducibility and inspire future phonetic embedding research.

The suite contains the following tasks:

Correlation with human sound similarity judgement
Correlation with articulatory distance
Nearest neighbour retrieval
Rhyme detection
Cognate detection
Sound analogies

Run pip3 install -e . to install this repository and its dependencies.

Embedding evaluation

In order to run all the evaluations, you first need to run the embedding on provided words. These can be downloaded from our Huggingface dataset:

>>> from datasets import load_dataset
>>> dataset = load_dataset("zouharvi/pwesuite-eval", split="train")
>>> dataset[10]
{'token_ort': 'aachener', 'token_ipa': 'ɑːkən', 'lang': 'en', 'purpose': 'main', 'token_arp': 'AA1 K AH0 N ER0'}

Note that each line contains token_ort, token_ipa, token_arp and lang. For training, only the words marked with purpose=="main" should be used. Note that unknown/low frequency phonemes or letters are replaced with 😕.

After running the embedding for each line/word, save it as either a Pickle or NPZ. The data structure can be either (1) list of list or numpy arrays or (2) numpy array. The loader will automatically parse the file and check that the dimensions are consistent.

After this, you are all set to run all the evaluations using ./suite_evaluation/eval_all.py --embd your_embd.pkl. Alternatively, you can invoke individual tasks: ./suite_evaluation/eval_{correlations,human_similarity,retrieval,analogy,rhyme,cognate}.py.

For a demo, see this Jupyter notebook.

Misc

Contact the authors if you encounter any issues using this evaluation suite. Read the associated paper and for now cite as:

@article{zouhar2023pwesuite,
  title={{PWESuite}: {P}honetic Word Embeddings and Tasks They Facilitate},
  author={Zouhar, Vil{\'e}m and Chang, Kalvin and Cui, Chenxuan and Carlson, Nathaniel and Robinson, Nathaniel and Sachan, Mrinmaya and Mortensen, David},
  journal={arXiv preprint arXiv:2304.02541},
  year={2023},
  url={https://arxiv.org/abs/2304.02541}
}

Compute details

The most compute-intensive tasks were training the Metric Learner and Triplet Margin, which took 1/4 and 2 hours on GTX 1080 Ti, respectively. For the research presented in this paper, we estimate 100 GPU hours overall.

The BERT embeddings were extracted as an average across the last layer. The INSTRUCTOR embeddings were used with the prompt "Represent the word for sound similarity retrieval:". For BPEmb and fastText, we used the best models (highest training data) and dimensionality of 300.

The metric learner uses bidirectional LSTM with 2 layers, hidden state size of 150 and dropout of 30%. The batch size is 128 and the learning rate is 0.01. The autoencoder follows the same hyperparameters both for the encoder and decoder. The difference is its learning size, 0.005, which was chosen empirically.

pwesuite's People

Contributors

Stargazers

Forkers

johnpfl kalvinchang owuqqq

pwesuite's Issues

Plot original + new space

t-SNE can take N^2 matrix of original distances and compute a new 2D vectors space we can examine in a plot. To verify that the correlations are meaningful we could compute t-SNE of the original space given by the metric but also of the new one, given by the fixed-width vectors.

Note that t-SNE is not solving our original task because it does not fit a replicable function and needs to be re-run for any new data.

Compute neighbourhood accuracy

The reason why this metric learning failed on IR is because it may globally increase the correlation but not so much the local space which is what we somewhat care about.

If the neighbourhood accuracy is low, we can include it explicitly in the model loss.

Huggingface upload broken

Nonmatching split size

File location and inconsistency in data attribute order

Hello,

I trust this message finds you well. I've encountered a couple of challenges while working with the provided code, and I wanted to bring them to your attention. Here are the main issues:

File Location: I'm having difficulty locating the file necessary to generate the main/prepare_data.sh dataset within the repository. Could you provide guidance on where to find this file?
Data Loading and Task Specification: In an effort to contribute, I took the initiative to implement data loading from Hugging Face (for the evaluation phase). However, I observed a discrepancy in the task specification within the eval_all.py function:
```
[
    (*x, y) for x, y in zip(data_multi_all, data_embd)
    if x[3] == "human_similarity"
]
```
Here, the purpose information is specified in the 4th position. Upon further analysis of the evaluate_human_similarity function:
```
def evaluate_human_similarity(data_multi_hs):
    tok_to_embd = {}
    for (token_ort, token_ipa, lang, pronunciation, purpose, embd) in data_multi_hs:
        tok_to_embd[token_ort] = embd
```
I noticed that the purpose information is expected in the 5th position. However, this caused a bug for me, and I had to adjust it from x[3] to x[4] to resolve the issue. I'd appreciate your insights on this matter, as I want to ensure I'm not overlooking any crucial details.

Another problem I noticed is that the expected order from these functions is different from the order given by the huggingface dataset:
huggingface order:
['token_ort', 'token_ipa', 'token_arp', 'lang', 'purpose']
evaluation order:
(token_ort, token_ipa, lang, pronunciation, purpose, embd)
To address this, I've adjusted my preprocessing steps to align with the expected order. However, this solution is inconvenient and introduces inconsistency in the codebase.
I wanted to bring this to your attention and seek for your opinion on a more sustainable resolution.
Thank you for your time and assistance.

Best regards

Broken demo link

In the readme, the following link does not work anymore.

For a demo, see this Jupyter notebook.

It should be replaced with this URL: https://github.com/zouharvi/pwesuite/blob/master/demo.ipynb

Train decoder-only from RNN's outputs

This could provide additional loss and also allow to synthetically predict sound analogies instead of just retrieving them.

Migrate experiments to new loader [high priority]

Recently we transitioned to using huggingface loader with dictionary for each token instead of positional array in multi.tsv.

I manually fixed the evaluation code which is the most important part but the experiments are still using positional tokens.

Upload trained models to Huggingface

Hi, first of all, thanks for this interesting research! I would like to ask about the models mentioned in the paper, would it be possible to upload the fully trained models to Huggingface? I noticed you had already uploaded the data there but I couldn't find any models on your profile there.

Please let me know if this would be possible.