Coder Social home page Coder Social logo

tatoeba-cangarejo---find-sentences-without-direct-or-indirect-translations's Introduction

Tatoeba-Cangarejo---Find-Sentences-without-Direct-or-Indirect-Translations

Note that this is a planned project and nothing has been added to Cangarjo's original script yet.

Started with Cangarjo's Python 3 Script

My aim is to create the same type of output, but to limit the results to sentences owned by native speakers. This should help lower the number of errors found in the data. Also, most language learners would probably prefer to learn from sentences created by native speakers. After members have finished translating all the sentences owned by native speakers, then they could run Cangarejo's original script to find all the other sentences.

I would use the list of native speakers that I have compiled (https://bit.ly/nativespeakers) rather than trust the "self-claimed" native languages in the exported data from tatoeba.org. My list also identifies native speakers who were part of the project before TRANG added the feature to include one's native language as something that gets exported. Also, I have chosen not to trust those who have multiple native languages listed, since many seem to either exaggerate or are over-confident in their "nativeness" of languages other than their native language. I do write to such members and ask them to identify their "real" native language or strongest language and add such members to my list with that language.

Files needed

Or: https://downloads.tatoeba.org/exports/sentences_detailed.csv

Or: https://downloads.tatoeba.org/exports/links.csv

  • nativespeakers.tsv

These are the data statments in http://bit.ly/nativespeakers, with the quotes and commas removed. You can use the file in this directory, or harvest newer data, from http://bit.ly/nativespeakers, which is much more likely to be up-to-date.

Example:

acm	hasenj	46
acm	salemazez	1
afb	hamzah	45
afb	Huda_Mohammad	45
afb	hadalabadi	3
afb	Muhammed_abdoon	2
afb	OmarSyr	1
etc.

The original HTML file looked like this:

"acm	hasenj	46",
"acm	salemazez	1",
"afb	hamzah	45",
"afb	Huda_Mohammad	45",
"afb	hadalabadi	3",
"afb	Muhammed_abdoon	2",
"afb	OmarSyr	1",
"afr	CJuser01	1013",
etc.

tatoeba-cangarejo---find-sentences-without-direct-or-indirect-translations's People

Contributors

ckjpn avatar cangareijo avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.