hermitdave / frequencywords Goto Github PK
View Code? Open in Web Editor NEWRepository for Frequency Word List Generator and processed files
License: MIT License
Repository for Frequency Word List Generator and processed files
License: MIT License
The following language files are named with "_50k" appended to the filename, but do not contain 50k words:
10282 content/2016/af/af_50k.txt
2350 content/2016/bn/bn_50k.txt
7131 content/2016/br/br_50k.txt
31631 content/2016/eo/eo_50k.txt
2968 content/2016/hi/hi_50k.txt
1972 content/2016/hy/hy_50k.txt
3402 content/2016/kk/kk_50k.txt
2604 content/2016/ml/ml_50k.txt
7504 content/2016/si/si_50k.txt
927 content/2016/ta/ta_50k.txt
1402 content/2016/te/te_50k.txt
6033 content/2016/tl/tl_50k.txt
For german words it would be really beneficially if they could be written properly -> Nouns are written capitalized.
So not "freund" but "Freund".
This would allow this list to be used for spellchecking.
Hi @hermitdave ,
first, thanks for putting this list together! I just wanted to ask whether these words were collected from the whole database in OpenSubtitle, or only for the parallel subset. Concretely, I am working with the English and the Hebrew corpora, and it would be useful for me to know whether they were collected from the same movies.
Thanks!,
Raquel
In Ukrainian (and Russian, Bulgarian) where is plenty of words with "'" sign in it. I believe it is a completely different character then latin "'". It's not like in English where you can drop this "'" and words will still have a sense ("he's" will become "he"). It's more like Ukrainian word "Computer" is "комп'ютер" and "комп" does not mean anything on its own. There are hundreds of words like that.
https://en.wikipedia.org/wiki/Ukrainian_alphabet#Letter_names_and_pronunciation
Can anyone change that and rerun these words calculations for Ukrainian?
Hi @hermitdave,
Thanks for putting together these frequency lists and making them available on GitHub! We'd like to use the frequency words lists on your site for our software project. Specifically, we'd like to use them as stop words in our text analysis.
Our software project is an open-source project licensed under GPL. You can find it here: https://github.com/Yoast/javascript.
Unfortunately, the CC-BY-SA 3.0 license used for the data in your project isn't compatible with GPL. Would you perhaps consider upgrading the license of your data to a CC-BY-SA 4.0? That license is compatible with GPL, see https://creativecommons.org/share-your-work/licensing-considerations/compatible-licenses/. Alternatively, a GPL or LGPL license would also work for us.
Changing the license to CC-BY-SA 4.0 would open up the data for wider usage in open source software development, so I'm sure many others would benefit from it as well.
Best,
Manuel
All of the terms of the CC licence revolve around "copy and redistribute the material"
There is nothing that grants anyone the ability to simply use the "material" (i.e. the word data).
For example, store a word list in a database in order to make decisions based on the popularity of words entered by a user.
In that case the data is not actually shared. Some aspect of a display might change, but the "material" is not shared.
I believe "THE" listed on the top 202 should be "TEH", meaning "tea".
https://raw.githubusercontent.com/hermitdave/FrequencyWords/master/content/2018/id/id_full.txt
You'll find "THE" between "boleh" and "masara".
The most prominent Indonesian dictionary is KBBI issued by the government. Even KBBI doesn't register "THE" as an Indonesian word.
https://kbbi.kemdikbud.go.id/entri/the
When I work on Microsoft Excel, for example, "TEH" (tea) is always auto-corrected as "THE".
Hi,
I used your frequency list for a master thesis in psychology.
How can I reference you? For now, I have the following:
"Now, the words of the LibreOffice English dictionary were matched with their respective frequencies extracted from a frequency list developed by Hermit (2016)."
Best regards,
Koen
Ukrainian files contain Russian-only (i.e. there are no such words in the Ukrainian language) words
The simplest first-order filter is to ignore words with letters ё, ъ, ы, э
But there seem to be some characters that should be ignored in Urdu :
https://github.com/hermitdave/FrequencyWords/blob/master/content/2018/ur/ur_full.txt
See lines 3 and 5
What makes me think these are not words but really punctuation as someone who doesn't speak Urdu are the characters' name :
ARABIC COMMA
and ARABIC FULL STOP
If you made sure this is not just me that has not enough knowledge of the language but a real issue, my fix would be to add
،
and
۔
in the ignored characters list
MIT license permits reuse [...] provided that all copies of the licensed software include a copy of the MIT License terms.
MIT license is a "non copyleft permissive license" which is adapted for software but not for data.
Data would be better of under a more convenient license...
Preferably Public domain.
CC-by-sa would do as well : )
Lookin at the "en" list you see words like don
and 't
The issue presents a bit differently in 2016 and 2018 but it exists in both of them.
Currently those words are just "WORD OCCURENCECOUNT".
I think it is highly useful for many individuals to have "WORD OCCURENCECOUNT TYPE", whereas TYPE specifies the word type. This word type should have the format convention used in natural language processing: NN = Noun, VB= Verb, JJ = Adjective, ...
I am in the process of doing this, the stanford tagger in combination with the nltk module seems to be the most usable one. Having installation troubles at the moment.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.