Coder Social home page Coder Social logo

legal-nlp-scrapers's Introduction

German Datasets / Resources

This repository contains scrapers and lists of links for various online sources of data in Leichte Sprache.

Preprocessing

To make use of our preprocessing pipeline refer to the following steps.

  1. Create a new folder (with the name of your dataset).
  2. Add folders for monolingual and/or aligned data.
    1. For monolingual data: place another subfolder 'monolingual' inside the first folder.
    2. For aligned data: place another subfolder 'aligned' inside the first folder.
  3. Add the text files
    1. For monolingual data: compress your raw .txt files into a zip file called 'corpus.zip' to your and place it in your 'monolingual' folder. Make sure the filenames in your corpus.zip meet the following criteria:
    • no spaces
    • no german "umlaut"
    • no other special characters as brackets ...
    1. For aligned data: place your csv file (arbitrary name) in your 'aligned' folder
      make sure to have one column normal_phrase and one column simple_phrase in your csv.
  4. If it already exists you can delete the current clean_data folder.
  5. Run python3 preprocess.py to generate preprocessed data.
  6. You can now find your cleaned data in the folder clean_data.

If you have added new data, repeat steps 4 to 6 to update the already cleaned data again.

Troubleshooting

  • If characters are corrupted switch between text.decode() and text.decode("latin-1")

Access to manually aligned test set

dictionary-meta.csv contains a list of urls to dictionary entries we used for compiling a sentence-wise manually aligned dataset. The election program and MDR news corpus can be provided upon request.

Scraped monolingual data overview

source # articles # sentences quality type
hurraki 3 911 56 785 good lexicon
nachrichtenleicht 7 709 122 842 good news
klexikon 2 350 80 042 medium lexicon
ndr 1 817 60 749 good news
einfachstars 6 488 129 674 good news
hdasprachtechnologie 44 4 210 good misc.
lebenshilfe 396 7 144 good misc.
kurier 4 519 67 827 good news
total 27 234 529 654

Scraped parallel data overview

source # articles # sentences quality type
original_kurier 3 476 77 587 good news
total 3 476 77 587

Sentence aligned data overview

source # sentences creation method leichte sprache type
wiki_translation 488 001 scraper no lexicon
apa_a2 9 456 lha scraper yes news
apa_b1 10 268 lha scraper yes news
kurier 40 772 vecalign scraper yes news
total 548 497

Validation dataset

source # sentences creation method leichte sprache type
mdr news 50 manually aligned yes news
mdr dictionary 50 manually aligned yes lexicon
brandeins(*) 106 scraped (color coding) yes various
wiki_auto_test 147 manually reviewed no lexicon
wiki_auto_dev 59 manually reviewed no lexicon
total 412

Test dataset

source # sentences creation method leichte sprache type
wahlprogramm 107 manually aligned yes election program
mdr news 50 manually aligned yes news
mdr dictionary 50 manually aligned yes lexicon
brandeins(*) 106 scraped (color coding) yes various
total 313

(*) Created with the help of Daniel Berger ([email protected])

Creation Methods

manually aligned: source and target phrases/sentences are aligned by hand by a human person
manually reviewed: alignment is already done (in this case english alignments are translated to german) and a human person corrects grammar and other mistakes
scraped (color coding): target and source phrases are displayed in different colors and can be automatically aligned

Mainly Monolingual Data

Parallel Data

Topic # chars (leichte) # chars (standard) # sentences (leichte) # sentences (standard) Notes
Election Programs 40 406 782 843 773 6 376 -
Bible 32 663 28 044 806 338 -
Tales 104 647 92 335 2 332 587 -
BRK 9 263 26 534 152 170 -
News 6 197 6 305 102 81 Standard German translation is sometimes mixed with "Leichte Sprache"
Books 1 863 2 666 45 29 -

Potential Issues

Severely unbalanced topics

Much of the data comes from government websites or texts in a legal context

Possible solutions
Use sentence embedding methods to cluster the phrases from the data.
You may be able to recognize clusters that can be assigned to certain topics or writing styles.
Phrases from these clusters can then be weighted differently.

legal-nlp-scrapers's People

Contributors

brjezierski avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.