michmech / irish-word-frequency Goto Github PK
View Code? Open in Web Editor NEWAbout 6,500 Irish lemmas ordered by corpus frequency, with noise removed.
License: Open Data Commons Open Database License v1.0
About 6,500 Irish lemmas ordered by corpus frequency, with noise removed.
License: Open Data Commons Open Database License v1.0
Would it be possible to have a version of this with all the words just separated by a comma and nothing else? For use in Clozemaster.com.
Hi,
I'm interested in the script / methodology used to construct this list.
Specifically, 'coinne' comes up quite high in the frequency list, but I imagine that's because of it's use in phrases such as 'i gcoinne' (against), 'gan choinne' (unexpectedly) & 'os coinne' (in front of/opposite).
From a language learning pov, I'd like to learn these phraselets separately, so my idea is to allow bigrams alongside high frequency words. E.g. given the corpus frequency for 'coinne' as 8507, maybe the above 3 phrases have (say) frequencies of 4000, 3000, and 1000, in which case, they would appear in the top 6,500 list and bump the plain 'coinne' version off the list (which would now have a frequency score 507 after subtracting the bigram frequency).
Is the source code for how this list was created available?
With thanks!
You should not have replied to the other issues you are just encouraging me :D
I came across this one this week.
téarmach, a1. Terminal. is in https://www.teanglann.ie/ but nothing in https://www.focloir.ie/
I search in the corpas http://corpas.focloir.ie/ reveals only téarma and ghearrtéarmach
I guess the answer is téarmaí
(terms) has been incorrectly lemmatized to the adjective rather than the noun.
Was wondering why 'dobhar' was appearing so high up in the list and after puzzling over the dictionary entries on focloir & teanglann, I remembered that Gaoth Dobhair would likely be a common Gaeltacht placename mentioned in the source texts. Just want to mention it as an issue if others' use this repository and add a query as to whether proper names were correctly identified? (I know Gaillimh is in the list and kept capitalized which is fine)
Sorry just wanted to register a further issue although I know this is an old repository.
I'm wondering why cál
is so high up the list as 'kale/cabbage' doesn't seem to merit such a high position.
Anyhow probably time I dived into creating a similar word frequency list myself from the source texts as then I'll be able to investigate myself!
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.