Evaluation suite for phonetic (phonological) word embeddings and an additional model based on Panphone distance learning. This repository accompanies the paper PWESuite: Phonetic Word Embeddings and Tasks They Facilitate at LREC-COLING 2024. Watch 12-minute introduction to PWESuite.
Abstract: Mapping words into a fixed-dimensional vector space is the backbone of modern NLP. While most word embedding methods successfully encode semantic information, they overlook phonetic information that is crucial for many tasks. We develop three methods that use articulatory features to build phonetically informed word embeddings. To address the inconsistent evaluation of existing phonetic word embedding methods, we also contribute a task suite to fairly evaluate past, current, and future methods. We evaluate both (1) intrinsic aspects of phonetic word embeddings, such as word retrieval and correlation with sound similarity, and (2) extrinsic performance on tasks such as rhyme and cognate detection and sound analogies. We hope our task suite will promote reproducibility and inspire future phonetic embedding research.
The suite contains the following tasks:
- Correlation with human sound similarity judgement
- Correlation with articulatory distance
- Nearest neighbour retrieval
- Rhyme detection
- Cognate detection
- Sound analogies
Run pip3 install -e .
to install this repository and its dependencies.
In order to run all the evaluations, you first need to run the embedding on provided words. These can be downloaded from our Huggingface dataset:
>>> from datasets import load_dataset
>>> dataset = load_dataset("zouharvi/pwesuite-eval", split="train")
>>> dataset[10]
{'token_ort': 'aachener', 'token_ipa': 'ɑːkən', 'lang': 'en', 'purpose': 'main', 'token_arp': 'AA1 K AH0 N ER0'}
Note that each line contains token_ort
, token_ipa
, token_arp
and lang
.
For training, only the words marked with purpose=="main"
should be used.
Note that unknown/low frequency phonemes or letters are replaced with 😕
.
After running the embedding for each line/word, save it as either a Pickle or NPZ. The data structure can be either (1) list of list or numpy arrays or (2) numpy array. The loader will automatically parse the file and check that the dimensions are consistent.
After this, you are all set to run all the evaluations using ./suite_evaluation/eval_all.py --embd your_embd.pkl
.
Alternatively, you can invoke individual tasks: ./suite_evaluation/eval_{correlations,human_similarity,retrieval,analogy,rhyme,cognate}.py
.
For a demo, see this Jupyter notebook.
Contact the authors if you encounter any issues using this evaluation suite. Read the associated paper and for now cite as:
@article{zouhar2023pwesuite,
title={{PWESuite}: {P}honetic Word Embeddings and Tasks They Facilitate},
author={Zouhar, Vil{\'e}m and Chang, Kalvin and Cui, Chenxuan and Carlson, Nathaniel and Robinson, Nathaniel and Sachan, Mrinmaya and Mortensen, David},
journal={arXiv preprint arXiv:2304.02541},
year={2023},
url={https://arxiv.org/abs/2304.02541}
}
The most compute-intensive tasks were training the Metric Learner and Triplet Margin, which took 1/4 and 2 hours on GTX 1080 Ti, respectively. For the research presented in this paper, we estimate 100 GPU hours overall.
The BERT embeddings were extracted as an average across the last layer. The INSTRUCTOR embeddings were used with the prompt "Represent the word for sound similarity retrieval:". For BPEmb and fastText, we used the best models (highest training data) and dimensionality of 300.
The metric learner uses bidirectional LSTM with 2 layers, hidden state size of 150 and dropout of 30%. The batch size is 128 and the learning rate is 0.01. The autoencoder follows the same hyperparameters both for the encoder and decoder. The difference is its learning size, 0.005, which was chosen empirically.
pwesuite's People
pwesuite's Issues
Plot original + new space
t-SNE can take N^2 matrix of original distances and compute a new 2D vectors space we can examine in a plot. To verify that the correlations are meaningful we could compute t-SNE of the original space given by the metric but also of the new one, given by the fixed-width vectors.
Note that t-SNE is not solving our original task because it does not fit a replicable function and needs to be re-run for any new data.
Compute neighbourhood accuracy
The reason why this metric learning failed on IR is because it may globally increase the correlation but not so much the local space which is what we somewhat care about.
If the neighbourhood accuracy is low, we can include it explicitly in the model loss.
Huggingface upload broken
Nonmatching split size
File location and inconsistency in data attribute order
Hello,
I trust this message finds you well. I've encountered a couple of challenges while working with the provided code, and I wanted to bring them to your attention. Here are the main issues:
-
File Location: I'm having difficulty locating the file necessary to generate the main/prepare_data.sh dataset within the repository. Could you provide guidance on where to find this file?
-
Data Loading and Task Specification: In an effort to contribute, I took the initiative to implement data loading from Hugging Face (for the evaluation phase). However, I observed a discrepancy in the task specification within the
eval_all.py
function:[ (*x, y) for x, y in zip(data_multi_all, data_embd) if x[3] == "human_similarity" ]
Here, the purpose information is specified in the 4th position. Upon further analysis of the
evaluate_human_similarity
function:def evaluate_human_similarity(data_multi_hs): tok_to_embd = {} for (token_ort, token_ipa, lang, pronunciation, purpose, embd) in data_multi_hs: tok_to_embd[token_ort] = embd
I noticed that the purpose information is expected in the 5th position. However, this caused a bug for me, and I had to adjust it from
x[3]
tox[4]
to resolve the issue. I'd appreciate your insights on this matter, as I want to ensure I'm not overlooking any crucial details.Another problem I noticed is that the expected order from these functions is different from the order given by the huggingface dataset:
huggingface order:
['token_ort', 'token_ipa', 'token_arp', 'lang', 'purpose']
evaluation order:
(token_ort, token_ipa, lang, pronunciation, purpose, embd)
To address this, I've adjusted my preprocessing steps to align with the expected order. However, this solution is inconvenient and introduces inconsistency in the codebase.
I wanted to bring this to your attention and seek for your opinion on a more sustainable resolution.
Thank you for your time and assistance.Best regards
Broken demo link
In the readme, the following link does not work anymore.
For a demo, see this Jupyter notebook.
It should be replaced with this URL: https://github.com/zouharvi/pwesuite/blob/master/demo.ipynb
Train decoder-only from RNN's outputs
This could provide additional loss and also allow to synthetically predict sound analogies instead of just retrieving them.
Migrate experiments to new loader [high priority]
Recently we transitioned to using huggingface loader with dictionary for each token instead of positional array in
multi.tsv
.I manually fixed the evaluation code which is the most important part but the experiments are still using positional tokens.
Upload trained models to Huggingface
Hi, first of all, thanks for this interesting research! I would like to ask about the models mentioned in the paper, would it be possible to upload the fully trained models to Huggingface? I noticed you had already uploaded the data there but I couldn't find any models on your profile there.
Please let me know if this would be possible.
Unify vocab for language mismatch testing
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.