My aim is to create the same type of output, but to limit the results to sentences owned by native speakers. This should help lower the number of errors found in the data. Also, most language learners would probably prefer to learn from sentences created by native speakers. After members have finished translating all the sentences owned by native speakers, then they could run Cangarejo's original script to find all the other sentences.
I would use the list of native speakers that I have compiled (https://bit.ly/nativespeakers) rather than trust the "self-claimed" native languages in the exported data from tatoeba.org. My list also identifies native speakers who were part of the project before TRANG added the feature to include one's native language as something that gets exported. Also, I have chosen not to trust those who have multiple native languages listed, since many seem to either exaggerate or are over-confident in their "nativeness" of languages other than their native language. I do write to such members and ask them to identify their "real" native language or strongest language and add such members to my list with that language.
Or: https://downloads.tatoeba.org/exports/sentences_detailed.csv
Or: https://downloads.tatoeba.org/exports/links.csv
- nativespeakers.tsv
These are the data statments in http://bit.ly/nativespeakers, with the quotes and commas removed. You can use the file in this directory, or harvest newer data, from http://bit.ly/nativespeakers, which is much more likely to be up-to-date.
acm hasenj 46
acm salemazez 1
afb hamzah 45
afb Huda_Mohammad 45
afb hadalabadi 3
afb Muhammed_abdoon 2
afb OmarSyr 1
etc.
"acm hasenj 46",
"acm salemazez 1",
"afb hamzah 45",
"afb Huda_Mohammad 45",
"afb hadalabadi 3",
"afb Muhammed_abdoon 2",
"afb OmarSyr 1",
"afr CJuser01 1013",
etc.