Coder Social home page Coder Social logo

pwesuite's Introduction

Phonetic Word Embeddings Suite (PWESuite)

Evaluation suite for phonetic (phonological) word embeddings and an additional model based on Panphone distance learning. This repository accompanies the paper PWESuite: Phonetic Word Embeddings and Tasks They Facilitate at LREC-COLING 2024. Watch 12-minute introduction to PWESuite.

Abstract: Mapping words into a fixed-dimensional vector space is the backbone of modern NLP. While most word embedding methods successfully encode semantic information, they overlook phonetic information that is crucial for many tasks. We develop three methods that use articulatory features to build phonetically informed word embeddings. To address the inconsistent evaluation of existing phonetic word embedding methods, we also contribute a task suite to fairly evaluate past, current, and future methods. We evaluate both (1) intrinsic aspects of phonetic word embeddings, such as word retrieval and correlation with sound similarity, and (2) extrinsic performance on tasks such as rhyme and cognate detection and sound analogies. We hope our task suite will promote reproducibility and inspire future phonetic embedding research.

The suite contains the following tasks:

  • Correlation with human sound similarity judgement
  • Correlation with articulatory distance
  • Nearest neighbour retrieval
  • Rhyme detection
  • Cognate detection
  • Sound analogies

Run pip3 install -e . to install this repository and its dependencies.

Embedding evaluation

In order to run all the evaluations, you first need to run the embedding on provided words. These can be downloaded from our Huggingface dataset:

>>> from datasets import load_dataset
>>> dataset = load_dataset("zouharvi/pwesuite-eval", split="train")
>>> dataset[10]
{'token_ort': 'aachener', 'token_ipa': 'ɑːkən', 'lang': 'en', 'purpose': 'main', 'token_arp': 'AA1 K AH0 N ER0'}

Note that each line contains token_ort, token_ipa, token_arp and lang. For training, only the words marked with purpose=="main" should be used. Note that unknown/low frequency phonemes or letters are replaced with 😕.

After running the embedding for each line/word, save it as either a Pickle or NPZ. The data structure can be either (1) list of list or numpy arrays or (2) numpy array. The loader will automatically parse the file and check that the dimensions are consistent.

After this, you are all set to run all the evaluations using ./suite_evaluation/eval_all.py --embd your_embd.pkl. Alternatively, you can invoke individual tasks: ./suite_evaluation/eval_{correlations,human_similarity,retrieval,analogy,rhyme,cognate}.py.

For a demo, see this Jupyter notebook.

Misc

Contact the authors if you encounter any issues using this evaluation suite. Read the associated paper and for now cite as:

@article{zouhar2023pwesuite,
  title={{PWESuite}: {P}honetic Word Embeddings and Tasks They Facilitate},
  author={Zouhar, Vil{\'e}m and Chang, Kalvin and Cui, Chenxuan and Carlson, Nathaniel and Robinson, Nathaniel and Sachan, Mrinmaya and Mortensen, David},
  journal={arXiv preprint arXiv:2304.02541},
  year={2023},
  url={https://arxiv.org/abs/2304.02541}
}

Compute details

The most compute-intensive tasks were training the Metric Learner and Triplet Margin, which took 1/4 and 2 hours on GTX 1080 Ti, respectively. For the research presented in this paper, we estimate 100 GPU hours overall.

The BERT embeddings were extracted as an average across the last layer. The INSTRUCTOR embeddings were used with the prompt "Represent the word for sound similarity retrieval:". For BPEmb and fastText, we used the best models (highest training data) and dimensionality of 300.

The metric learner uses bidirectional LSTM with 2 layers, hidden state size of 150 and dropout of 30%. The batch size is 128 and the learning rate is 0.01. The autoencoder follows the same hyperparameters both for the encoder and decoder. The difference is its learning size, 0.005, which was chosen empirically.

poster

pwesuite's People

Contributors

kalvinchang avatar natbcar avatar zouharvi avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

pwesuite's Issues

Plot original + new space

t-SNE can take N^2 matrix of original distances and compute a new 2D vectors space we can examine in a plot. To verify that the correlations are meaningful we could compute t-SNE of the original space given by the metric but also of the new one, given by the fixed-width vectors.

Note that t-SNE is not solving our original task because it does not fit a replicable function and needs to be re-run for any new data.

Compute neighbourhood accuracy

The reason why this metric learning failed on IR is because it may globally increase the correlation but not so much the local space which is what we somewhat care about.

If the neighbourhood accuracy is low, we can include it explicitly in the model loss.

File location and inconsistency in data attribute order

Hello,

I trust this message finds you well. I've encountered a couple of challenges while working with the provided code, and I wanted to bring them to your attention. Here are the main issues:

  1. File Location: I'm having difficulty locating the file necessary to generate the main/prepare_data.sh dataset within the repository. Could you provide guidance on where to find this file?

  2. Data Loading and Task Specification: In an effort to contribute, I took the initiative to implement data loading from Hugging Face (for the evaluation phase). However, I observed a discrepancy in the task specification within the eval_all.py function:

    [
        (*x, y) for x, y in zip(data_multi_all, data_embd)
        if x[3] == "human_similarity"
    ]

    Here, the purpose information is specified in the 4th position. Upon further analysis of the evaluate_human_similarity function:

    def evaluate_human_similarity(data_multi_hs):
        tok_to_embd = {}
        for (token_ort, token_ipa, lang, pronunciation, purpose, embd) in data_multi_hs:
            tok_to_embd[token_ort] = embd

    I noticed that the purpose information is expected in the 5th position. However, this caused a bug for me, and I had to adjust it from x[3] to x[4] to resolve the issue. I'd appreciate your insights on this matter, as I want to ensure I'm not overlooking any crucial details.

    Another problem I noticed is that the expected order from these functions is different from the order given by the huggingface dataset:
    huggingface order:
    ['token_ort', 'token_ipa', 'token_arp', 'lang', 'purpose']
    evaluation order:
    (token_ort, token_ipa, lang, pronunciation, purpose, embd)
    To address this, I've adjusted my preprocessing steps to align with the expected order. However, this solution is inconvenient and introduces inconsistency in the codebase.
    I wanted to bring this to your attention and seek for your opinion on a more sustainable resolution.
    Thank you for your time and assistance.

Best regards

Migrate experiments to new loader [high priority]

Recently we transitioned to using huggingface loader with dictionary for each token instead of positional array in multi.tsv.

I manually fixed the evaluation code which is the most important part but the experiments are still using positional tokens.

Upload trained models to Huggingface

Hi, first of all, thanks for this interesting research! I would like to ask about the models mentioned in the paper, would it be possible to upload the fully trained models to Huggingface? I noticed you had already uploaded the data there but I couldn't find any models on your profile there.

Please let me know if this would be possible.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.