Coder Social home page Coder Social logo

lgtm-migrator / sentence-embedding-evaluation-german Goto Github PK

View Code? Open in Web Editor NEW

This project forked from ulf1/sentence-embedding-evaluation-german

0.0 0.0 0.0 135 KB

Basically SentEval with German language downstream tasks

License: Apache License 2.0

Shell 8.61% Python 91.39%

sentence-embedding-evaluation-german's Introduction

PyPI version PyPi downloads Total alerts Language grade: Python

sentence-embedding-evaluation-german

Sentence embedding evaluation for German.

This library is inspired by SentEval but focuses on German language downstream tasks.

Downstream tasks

The available downstream tasks are listed in the table below. If you that think that a dataset is missing and should be added, please open an issue.

task type text type lang #train #test target info
TOXIC ๐Ÿ‘ฟ toxic comments facebook comments de-DE 3244 944 binary {0,1} GermEval 2021, comments subtask 1, ๐Ÿ“ ๐Ÿ“–
ENGAGE ๐Ÿค— engaging comments facebook comments de-DE 3244 944 binary {0,1} GermEval 2021, comments subtask 2, ๐Ÿ“ ๐Ÿ“–
FCLAIM โ˜๏ธ fact-claiming comments facebook comments de-DE 3244 944 binary {0,1} GermEval 2021, comments subtask 3, ๐Ÿ“ ๐Ÿ“–
VMWE โ˜๏ธ verbal idioms newspaper de-DE 6652 1447 binary (figuratively, literally) GermEval 2021, verbal idioms, ๐Ÿ“ ๐Ÿ“–
OL19-A ๐Ÿ‘ฟ offensive language tweets de-DE 3980 3031 binary {0,1} GermEval 2018, ๐Ÿ“ ๐Ÿ“–
OL19-B ๐Ÿ‘ฟ offensive language, fine-grained tweets de-DE 3980 3031 4 catg. (profanity, insult, abuse, oth.) GermEval 2018, ๐Ÿ“ ๐Ÿ“–
OL19-C ๐Ÿ‘ฟ explicit vs. implicit offense tweets de-DE 1921 930 binary (explicit, implicit) GermEval 2018, ๐Ÿ“ ๐Ÿ“–
OL18-A ๐Ÿ‘ฟ offensive language tweets de-DE 5009 3398 binary {0,1} GermEval 2018, ๐Ÿ“
OL18-B ๐Ÿ‘ฟ offensive language, fine-grained tweets de-DE 5009 3398 4 catg. (profanity, insult, abuse, oth.) GermEval 2018, ๐Ÿ“
ABSD-1 ๐Ÿคท relevance classification 'Deutsche Bahn' customer feedback de-DE 19432 2555 binary GermEval 2017, ๐Ÿ“
ABSD-2 ๐Ÿ˜ƒ๐Ÿ˜๐Ÿ˜ก sentiment analysis 'Deutsche Bahn' customer feedback de-DE 19432 2555 3 catg. (pos., neg., neutral) GermEval 2017, ๐Ÿ“
ABSD-3 ๐Ÿ›ค๏ธ aspect categories 'Deutsche Bahn' customer feedback de-DE 19432 2555 20 catg. GermEval 2017, ๐Ÿ“
MIO-S ๐Ÿ˜ƒ๐Ÿ˜๐Ÿ˜ก sentiment analysis 'Der Standard' newspaper article web comments de-AT 1799 1800 3 catg. One Million Posts Corpus, ๐Ÿ“
MIO-O ๐Ÿคท off-topic comments 'Der Standard' newspaper article web comments de-AT 1799 1800 binary One Million Posts Corpus, ๐Ÿ“
MIO-I ๐Ÿ‘ฟ inappropriate comments 'Der Standard' newspaper article web comments de-AT 1799 1800 binary One Million Posts Corpus, ๐Ÿ“
MIO-D ๐Ÿ‘ฟ discriminating comments 'Der Standard' newspaper article web comments de-AT 1799 1800 binary One Million Posts Corpus, ๐Ÿ“
MIO-F ๐Ÿ’ก feedback comments 'Der Standard' newspaper article web comments de-AT 3019 3019 binary One Million Posts Corpus, ๐Ÿ“
MIO-P โœ‰๏ธ personal story comments 'Der Standard' newspaper article web comments de-AT 4668 4668 binary One Million Posts Corpus, ๐Ÿ“
MIO-A โœด๏ธ argumentative comments 'Der Standard' newspaper article web comments de-AT 1799 1800 binary One Million Posts Corpus, ๐Ÿ“
SBCH-S ๐Ÿ˜ƒ๐Ÿ˜๐Ÿ˜ก sentiment analysis 'chatmania' app comments, only comments labelled as Swiss German are included gsw 394 394 3 catg. SB-CH Corpus, ๐Ÿ“
SBCH-L โ›ฐ๏ธ dialect classification 'chatmania' app comments gsw 748 748 binary SB-CH Corpus, ๐Ÿ“
ARCHI โ›ฐ๏ธ dialect classification Audio transcriptions of interviews in four dialect regions of Switzerland gsw 18809 4743 4 catg. ArchiMob, ๐Ÿ“ ๐Ÿ“–
LSDC ๐ŸŒŠ dialect classification several genres (e.g. formal texts, fairytales, novels, poetry, theatre plays) from the 19th to 21st centuries. Extincted Lower Prussia excluded. Gronings excluded due to lack of test examples. nds 74140 8602 14 catg. Lower Saxon Dialect Classification, ๐Ÿ“ ๐Ÿ“–
KLEX-P ๐Ÿค” text level Conceptual complexity classification of texts written for adults (Wikipedia), children between 6-12 (Klexikon), and beginner readers (MiniKlexikon); Paragraph split indicated by <eop> or * de 8264 8153 3 catg. ๐Ÿ“ ๐Ÿ“–

Download datasets

bash download-datasets.sh

Check if files were actually downloaded

find ./datasets/**/ -exec ls -lh {} \;

Usage example

Import the required Python packages.

from typing import List
import sentence_embedding_evaluation_german as seeg
import torch

Step (1) Load your pretrained model

In the following example, we generate a random embedding matrix for demonstration purposes.

# (1) Instantiate an embedding model
emb_dim = 512
vocab_sz = 128
emb = torch.randn((vocab_sz, emb_dim), requires_grad=False)
emb = torch.nn.Embedding.from_pretrained(emb)
assert emb.weight.requires_grad == False

Step (2) Specify your preprocessor function

You need to specify your own preprocessing routine. The preprocessor function must convert a list of strings batch (List[str]) into a list of feature vectors, or resp. a list of sentence embeddings (List[List[float]]). In the following example, we generate some sort of token IDs, retrieve the vectors from our random matrix, and average these to feature vectors for demonstration purposes.

# (2) Specify the preprocessing
def preprocesser(batch: List[str], params: dict=None) -> List[List[float]]:
    """ Specify your embedding or pretrained encoder here
    Paramters:
    ----------
    batch : List[str]
        A list of sentence as string
    params : dict
        The params dictionary
    Returns:
    --------
    List[List[float]]
        A list of embedding vectors
    """
    features = []
    for sent in batch:
        try:
            ids = torch.tensor([ord(c) % 128 for c in sent])
        except:
            print(sent)
        h = emb(ids)
        features.append(h.mean(axis=0))
    features = torch.stack(features, dim=0)
    return features

Step (3) Training settings

We suggest to train a final layer with bias term ('bias':True), on a loss function weighted by the class frequency ('balanced':True), a batch size of 128, an over 500 epochs without early stopping.

# (3) Training settings
params = {
    'datafolder': './datasets',
    'bias': True,
    'balanced': True,
    'batch_size': 128, 
    'num_epochs': 500,
    # 'early_stopping': True,
    # 'split_ratio': 0.2,  # if early_stopping=True
    # 'patience': 5,  # if early_stopping=True
}

Step (4) Downstream tasks

We suggest to run the following downstream tasks. FCLAIM flags comments that requires manual fact-checking because these contain reasoning, arguments or claims that might be false. VMWE differentiates texts with figurative or literal multi-word expressions. OL19-C distincts between explicit and implicit offensive language. ABSD-2 is a sentiment analysis dataset with customer reviews. These four dataset so far can be assumed to be Standard German from Germany (de-DE). MIO-P flags Austrian German (de-AT) comments if these contain personal stories. ARCHI is a Swiss (gsw), and LSDC a Lower German (nds) dialect identification task.

# (4) Specify downstream tasks
downstream_tasks = ['FCLAIM', 'VMWE', 'OL19-C', 'ABSD-2', 'MIO-P', 'ARCHI', 'LSDC']

Step (5) Run the experiments

Finally, start the evaluation. The suggested downstream tasks (step 4) with 500 epochs (step 3) might requires 10-40 minutes but it's highly dependent on your computing resources. So grab a โ˜• or ๐Ÿต.

# (5) Run experiments
results = seeg.evaluate(downstream_tasks, preprocesser, **params)

Demo notebooks

Start Jupyter

source .venv/bin/activate
jupyter lab

Open an demo notebook

Appendix

Installation & Downloads

The sentence-embedding-evaluation-german git repo is available as PyPi package

pip install sentence-embedding-evaluation-german
pip install git+ssh://[email protected]/ulf1/sentence-embedding-evaluation-german.git

You need to download the datasets as well. If you run the following code, the datasets should be in a folder ./datasets.

wget -q "https://raw.githubusercontent.com/ulf1/sentence-embedding-evaluation-german/main/download-datasets.sh" -O download-datasets.sh 
bash download-datasets.sh

Development work for this package

Install a virtual environment

python3 -m venv .venv
source .venv/bin/activate
pip install --upgrade pip
pip install -r requirements.txt --no-cache-dir
pip install -r requirements-dev.txt --no-cache-dir
pip install -r requirements-demo.txt --no-cache-dir

(If your git repo is stored in a folder with whitespaces, then don't use the subfolder .venv. Use an absolute path without whitespaces.)

Install conda environment for GPU

conda install -y pip
conda create -y --name gpu-venv-seeg python=3.9 pip
conda activate gpu-venv-seeg
# install CUDA support
conda install -y cudatoolkit=11.3.1 cudnn=8.3.2 -c conda-forge
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$CONDA_PREFIX/lib/
pip install torch==1.12.1+cu113 torchvision torchaudio -f https://download.pytorch.org/whl/torch_stable.html
# install other packages
pip install -r requirements.txt --no-cache-dir
pip install -r requirements-dev.txt --no-cache-dir
pip install -r requirements-demo.txt --no-cache-dir
watch -n 0.5 nvidia-smi

Python commands

  • Jupyter for the examples: jupyter lab
  • Check syntax: flake8 --ignore=F401 --exclude=$(grep -v '^#' .gitignore | xargs | sed -e 's/ /,/g')

Publish package

pandoc README.md --from markdown --to rst -s -o README.rst
python setup.py sdist 
twine upload -r pypi dist/*

Clean up

find . -type f -name "*.pyc" | xargs rm
find . -type d -name "__pycache__" | xargs rm -r
rm -r .pytest_cache
rm -r .venv

New Dataset recommendation

If you want to recommend another or a new dataset, please open an issue.

Troubleshooting

If you have troubles to get this package running, please open an issue for support.

Contributing

Please contribute using Github Flow. Create a branch, add commits, and open a pull request.

Citation

If you want to use this package in a research paper, please open an issue because we have not yet decided how to make this package citable. You should at least mention the PyPi version in your paper to ensure reproducibility.

You certainly need to cite the actual evaluation datasets in your paper. Please check the hyperlinks in the info column of the table above.

sentence-embedding-evaluation-german's People

Contributors

ulf1 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.