grammaticalframework / gf-wordnet Goto Github PK

View Code? Open in Web Editor NEW

24.0 11.0 10.0 3.66 GB

A WordNet in GF

Home Page: https://cloud.grammaticalframework.org/wordnet/

Grammatical Framework 99.68% Haskell 0.09% Python 0.08% Makefile 0.01% C 0.05% HTML 0.02% JavaScript 0.07% CSS 0.01%

gf wordnet lexicon translation multilingual

gf-wordnet's Introduction

GF WordNet

The Lexicon
WordNet Domains
Treebank
VerbNet
Wikipedia Images
Browsing and Editing
The Python Interface

The GF WordNet is a lexicon based on the Princeton WordNet and Wikidata but adapted to integrate with the GF Resource Grammars Library (RGL). Following the GF model, the lexicon consists of an abstract syntax with one abstract identifier for each word sense. The concrete syntaxes define the corresponding linearizations in each language. A synset, then, consists of a set of abstract identifiers instead words.

The lexicon includes nouns, verb, adjectives and adverbs from WordNet as well as people and place names from Wikidata. Some structural words such as prepositions and conjunctions are also included. The overal size is summarized in the table bellow:

WordNet	adjectives, nouns, verbs, etc.	100 thousand
Wikidata	Given names	64 thousand
	Family names	531 thousand
	Place names	3.7 million
total		4.3 million

The initial development was mostly focused on English, Swedish and Bulgarian. WordNets for all other languages were bootstrapped from existing resources and aligned by using statistical methods. They are only partly checked by either matching with Wikipedia or by human feedback. Many of the translations may be correct but inconsistancies can be expected as well. For details check:

Unlike the original WordNet we focus on grammatical, morphological as well as semantic features. All this is simply necessary to make the lexicon compatible with the Resource Grammars Library (RGL).

The Lexicon

Each entry in the lexicon represents the full morphology, precise syntactic category as well as one particular sense of a word. When words across different languages share the same meaning then they are represented as a single cross lingual id. For example in WordNetEng.gf we have all those definitions of blue in English:

lin blue_1_A = mkA "blue" "bluer";
lin blue_2_A = mkA "blue" "bluer";
lin blue_3_A = mkA "blue" "bluer";
lin blue_4_A = mkA "blue" "bluer";
lin blue_5_A = mkA "blue" "bluer";
lin blue_6_A = mkA "blue" "bluer";
lin blue_7_A = mkA "blue" "bluer";
lin blue_8_A = mkA "blue" "bluer";

since they represent different senses and thus different translations in WordNetSwe.gf and WordNetBul.gf:

lin blue_1_A = L.blue_A ;
lin blue_2_A = L.blue_A ; --guessed
lin blue_3_A = mkA "deppig" ;
lin blue_4_A = mkA "vulgär" ;
lin blue_5_A = mkA "pornografisk" ;
lin blue_6_A = mkA "aristokratisk" ;
lin blue_7_A = L.blue_A ; --guessed
lin blue_8_A = L.blue_A ; --guessed

lin blue_1_A = mkA086 "син" ;
lin blue_2_A = mkA086 "син" ; --guessed
lin blue_3_A = mkA076 "потиснат" ;
lin blue_4_A = mkA079 "вулгарен" ;
lin blue_5_A = mkA078 "порнографски" ;
lin blue_6_A = mkA079 "аристократичен" ;
lin blue_7_A = mkA086 "син" ; --guessed
lin blue_8_A = mkA086 "син" ; --guessed

The definitions are using the standard RGL syntactic categories which are a lot more descriptive than the tags ´n´, ´v´, ´a´, and ´r´ in the WordNet. In addition we use the RGL paradigms to implement the morphology.

Note also that not all translations are equally reliable for all languages. In the example above, the comment --guessed means that the translation was extracted from an existing translation lexicon, but we are not sure if it accurately represents the right sense. Similarly sometimes you can also see the comment --unchecked, which means that the chosen translation comes from an existing WordNet but still further checking is needed to guarantee that this is the most idiomatic translation.

The English lexicon contains also information about the gender. All senses that refer to a human being are tagged with either ´masculine´, ´feminine´ or ´human´ gender. In some cases where the word is either masculine or feminine then it is further split into two senses. In those cases there is usually a different translation in many languages. The information about which words refer to humans is based on the WordNet hierarchy. In the English RGL the gender information is relevant, for instance when choosing between who/which and herself/himself/itself.

The abstract syntax WordNet.gf defines all abstract ids in the lexicon. Almost all definitions are also followed by a comment which consists of, first the corresponding WordNet 3.1 synset offset, followed by dash and then the wordnet tag. After that there is a tab followed by the WordNet gloss. Like in WordNet the gloss is followed by a list of examples, but here we retain only the examples relevant to the current lexical id. In other words, if a synset contains several words, only the examples including the current abstract id are retained.

The verbs in the lexicon are also distinguished based on their valency, i.e. transitive, intransitive, ditransitive, etc. The valency of a verb in a given sense is determined by its example, but there is also a still partial integration of VerbNet.

Some of the nouns and the adjectives are typically accompanied by prepositions and a noun phrase. In those cases they are classified as N2 and A2. This helps in parsing and also let us to choose the right preposition in every language.

WordNet Domains

The data also integrates WordNet Domains. If the synset for the current entry has domain(s) in WordNet Domains, then they are listed in the abstract syntax, at the beginning of the gloss, surrounded by square brackets. In addition to those, some more domains are added by analysing the glosses in the original WordNet. The taxonomy of the domains is stored in domains.txt

Treebank

In order to make the lexical development more stable we have also started a treebank consisting of all examples from the Princeton WordNet (see examples.txt). The examples are automatically parsed with the Parse.gf grammar in GF. For each example there is also the original English sentence, as well as the Swedish and Bulgarian seed translations from Google Translate.

Some of the examples are already checked. This means that the abstract syntax tree is corrected, the right senses are used and the seed translations are replaced with the translations obtained by using the Parse grammar. The translations are also manually checked for validity.

The format of a treebank entry is as follows:

abs: PhrUtt NoPConj (UttS (UseCl (TTAnt TPast ASimul) PPos (PredVP (DetCN (DetQuant DefArt NumSg) (UseN storm_1_N)) (UseV abate_2_V)))) NoVoc
eng: The storm abated
swe: stormen avtog
bul: бурята отслабна
key: 1 abate_2_V 00245945-v

If the first line starts with "abs*" instead of "abs:" then the entry is not checked yet. If the line starts with "abs^" then the entry is checked but some more work is needed.

The last line in the entry contains the abstract ids for which this example is intended. The number 1 here means that there is only one id but there could be more if they share examples. After the abstract ids is the synset offset in WordNet 3.1. If the same synset has translations in either BulTreebank WordNet or Svenskt OrdNät then they are also listed after the offset. When possible these translations should be used.

VerbNet

VerbNet is invaluable in ensuring that the verb valencies are as consistent as possible. The VerbNet also gives us the semantics of the verbs as a first-order logical formula.

The VerbNet classes and frames are stored in examples.txt. For instance the following record encodes class begin-55.1.

class: begin-55.1
role:  Agent +animate +organization
role:  Theme
role:  Instrument

After the class definition, there are one or more frame definitions:

frm: PredVP Agent (ComplVV Verb ? ? Theme)
sem: begin(E,Theme), cause(Agent,E)
key: go_on_VV proceed_VV begin_to_1_VV start_to_1_VV commence_to_1_VV resume_to_1_VV undertake_1_VV get_34_VV set_about_3_VV set_out_1_VV get_down_to_VV

The line marked with frm: represents the syntax of the frame expressed in the abstract syntax. After that sem: contains the first-order formula. Finally key: lists all abstract syntax entries belonging to that frame.

After a frame, there might be one or more examples which illustrate how the frame is applied to different members of the frame.

Wikipedia Images

Quite often the glosses are not clear enough to understand the meaning of a synset. In the initial project development, about 1/5 of the lexical entries in the lexicon were linked to the corresponding Wikipedia articles. This has several benefits: the articles are way more informative than the glosses; they are also translated which helps in the bootstrapping of other languages. Finally we can now also show images associated with several of the lexemes which, as we know, is worth more than thousands words.

Later the links created in this project were merged with the links that Wikidata provides via property P8814. This resulted in a large set of links which is also of a supperior quality than what GF WordNet and Wikidata had in advance.

In order to speed up compilation the set on links is cached in the file images.txt which can be regenerated at any time by running the script bootstrap/images.hs. The file looks as follows:

gothenburg_1_LN	Q25287;Gothenburg;commons/3/30/Flag_of_Gothenburg.svg	Q25287;Gothenburg;commons/a/a8/Gothenburg_new_montage_2012.png

This is a space separated record. The value in the first field is the abstract lexical id, which is followed by one field per image. The image field consists of three parts separated by semicolumns. The first part is the Wikidata Qid, the second is the relative page location in the English Wikipedia, and the last one is the relative path to the thumbnail image.

Browsing and Editing

An online search interface for the lexicon is available here:

https://cloud.grammaticalframework.org/wordnet/

If you have editing rights to the GF WordNet repository, then it is also possible to log in and edit the data via the Web interface. Both the lexicon and the corpus are editable. Once you finish a batch of changes you can also commit directly.

The Python Interface

You can also use the WordNet as Python library similar in style to nltk.corpus.wordnet, see here.

gf-wordnet's People

Contributors

Stargazers

Watchers

Forkers

odanoburu pkolachi tokenmill harisont kristiank rnd0101 seanpm2001 guscarrian vitaly-z befunctional

gf-wordnet's Issues

Best way to "install" and use gf-wordnet

This project has nice documentation and pointer to other resources and projects, but for a complete beginner, there is a lack of a "getting started" section.

What is the preferred place to keep the gf-wordnet files? There seem to be many possibilities and it's hard to choose between them. One option seems to keep the gf-wordnet folder under /usr/local/share/gf-rgl/gf-wordnet.

Russian

Hi!

Would it be possible to add Russian in the very near future, and if so how long would that take approximately? Or how could I do it myself?

I'm trying to adapt some GF-based grammar exercises to Russian for this beginner Swedish course directed to Ukrainian refugees and the GF Wordnet would make everything much easier.

Can't login for edits from web interface

I've tried to make some adjustments to Italian words from the web interface but for the past couple of days I've been unable to log in, even though I had had no issues in the past.

Postmortem question

Hi Krasimir,

why has http://wordnet-rdf.princeton.edu/id/01414776-a become http://wordnet-rdf.princeton.edu/id/14014160-n ?

I've seen your edit at Wikidata ;)

Plurale tantum nouns

Do you have a policy how to handle plurale tantum nouns? For example

WordNetSpa.gf:lin day_off_CN = UseN (mkN "vacaciones") ; --guessed
WordNetSpa.gf:lin holiday_1_N = mkN "vacaciones" ;
WordNetSpa.gf:lin leave_1_N = mkN "vacaciones" ; --guessed
WordNetSpa.gf:lin vacation_1a_N = mkN "vacaciones" ;
WordNetSpa.gf:lin vacation_1b_N = mkN "vacaciones" ;

Currently, this produces "un vacaciones", "los vacacioneses", which is incorrect. I would like to correct all these entries like this:

oper vacación_N : N = mkN "vacación" "vacaciones" Fem ;
lin day_off_CN : CN = UseN vacación_N ;
…
lin vacation_1b_N = vacación_N ;

I perceive GF-wordnet as a low-level lexical resource, and that it's the responsibility of an application grammarian to make sure to use holiday_1_N in a plural NP. Is this the way you have thought of GF-wordnet as well, or do you have other visions?
I would be okay with oper vacación_N : N = mkN "vacaciones" "vacaciones" Fem as well, with the inconvenience that we'd get "una vacaciones" which doesn't seem correct. But it's still better than "los vacacioneses".

In Python, function langs() is not available in global scope

Even when using from wordnet import *, the function langs() described in the Python README is not available in the global scope:

Python 3.10.12 (main, Nov 20 2023, 15:14:05) [GCC 11.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from wordnet import *
>>> langs()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
NameError: name 'langs' is not defined. Did you mean: 'range'?

but other functions such as synsets are:

>>> synsets('eng','dog')
[Synset('02086723-n'), Synset('10133978-n'), Synset('10042764-n'), Synset('09905672-n'), Synset('07692347-n'), Synset('03907626-n'), Synset('02712903-n'), Synset('02005890-v')]

Problem when committing from the web UI

I was fixing some Estonian words from the web UI, when I was done I pressed Commit, and got this message:

Patch up WordNetEst.gf

$ git commit --author Inari Listenmaa <[email protected]> --message progress WordNet*.gf examples.txt
[master ced5cbdb] progress
 Author: Inari Listenmaa <[email protected]>
 1 file changed, 4 insertions(+), 4 deletions(-)
$ git push https://github.com/GrammaticalFramework/gf-wordnet
To https://github.com/GrammaticalFramework/gf-wordnet
 ! [rejected]          master -> master (fetch first)
error: failed to push some refs to 'https://inariksit:[email protected]/GrammaticalFramework/gf-wordnet'
hint: Updates were rejected because the remote contains work that you do
hint: not have locally. This is usually caused by another repository pushing
hint: to the same ref. You may want to first integrate the remote changes
hint: (e.g., 'git pull ...') before pushing again.
hint: See the 'Note about fast-forwards' in 'git push --help' for details.

Done.

I noticed that you had made the latest commit 8 minutes before me, so I guess the website hadn't had time to update?

I then tried to press Commit again, and got this message instead:

$ git commit --author Inari Listenmaa <[email protected]> --message progress WordNet*.gf examples.txt
On branch master
Your branch is ahead of 'origin/master' by 3 commits.
  (use "git push" to publish your local commits)

Changes not staged for commit:
	modified:   www/js/gf-wordnet.js

Untracked files:
	build/
	status.svg
	www/argument.png
	www/dg-conceptual-editor.js
	www/dg-grammarian.css
	www/dg-grammarian.html
	www/dg-grammarian.js
	www/gf-wordnet.js
	www/js/blockly.min.js
	www/js/dg-conceptual-editor.js
	www/js/dg-grammarian.js
	www/js/rgl.blockly.xml
	www/no-argument.png
	www/rgl.blockly.xml
	www/rgl.xml

no changes added to commit


Done.

It's not a big deal since I only changed 4 words and I can redo the changes, but I'm just letting you know that this happened.

Parse example in the Python README.md

I think that it would be most helpful for beginner users to have an example on how to use the parse function in the python/README.md (and maybe a helper in _api.py to access this function if that would be helpful).

That way, users could very easily explore the round trip from parsing to expression to linearization.

Using examples.txt as training data

The statistical model is currently completely unsupervised while it should use examples.txt in the training

Web interface broken

I wanted to correct some Estonian words. I pressed Log in and signed in to GitHub, after redirection I get this:

After refreshing it shows this:

python-webkit not maintained

mirroring andresriancho/w3af#13635.

this package is no longer maintained and has been dropped from the latest Ubuntu and Debian releases, so the wordnet-ide cannot run straightforwardly in these platforms.

I imagine this is low priority, so alternatives I can think of are compiling from source (instructions here) or downgrading.

Using secondary dependencies

Currently the primary UD dependencies are used but what we actually need is the secondary dependencies since they better reflect the semantics. There are two problems:

the training should be made to work with DAGs instead of trees
not all treebanks have secondary dependencies but it looks like it is possible to recreate them automatically.

Status for synonyms

The web interface shows the status of a lexical item as checked/unchecked but only if this is an item that you have searched for. On contrary when you open an item it shows a list of synonyms and for those there is no status.

Somali orthography

Great to have Somali coverage in GF!

I'm curious to know where the data is from. It looks rather linguistic, with lots of IPA characters.

lin question_1_V2 = mkV2 (mkV "suʔal") noPrep ; --guessed
lin put_6_V2 = mkV2 (mkV "ɖɪg") noPrep ; --guessed
lin squeeze_1_V2 = mkV2 (mkV "tuːǯi") noPrep ; --guessed

The sources I've read that try to adhere to Somali conventions (like this one), would use ' for the glottal stop, and dh for the retroflex. The word squeeze would be written tuuji. There are many other examples as well, and I'm not quite sure about all of them. So I'd like to know the source, so I can make a more educated guess.

I would like the GF resource to reflect more the prevalent usage than grammar books. If you agree, I can take care of the transliteration.

Smart paradigms for verbs

This is a TODO item for myself.

When I wrote the fragments of the Somali RG 2 years ago, I left my verb morphology quite unfinished. I was trying to decide which form would be the dictionary form, and found different conventions in different lexica and grammars. I didn't come to a conclusion, so my current implementation takes an artificial stem, which is not any actual wordform. I can adjust my smart paradigms to work with different forms, but that's good to keep in mind if we get words from different sources: their dictionary form may not be the same.

Retain the link to between bigrams and source sentences

For better debugging of the statistical model it would have been useful to retain a link between the extracted bigram and the sentences where it originated. The sentences then should be retrivable via the Web interface.

Confusion between Verb, Adjective and Noun

It looks like in Bulgarian some of the participles are annotated as either adjective or noun depending on the context. In contrast in GF those are treated as verb forms and later on syntactic level are coerced to other types. This problem seems to be relevant only to Bulgarian but it is worth double checking for other languages.

In Python, German and French downloads fail

Following the instruction of the Python README, the following works for English:

>>> wordnet.download(['eng'])
Download and boot the grammar 355MB (Expanded to 2719MB)
Download the semantics database 2814MB done
Reload wordnet

but fails for French and German:

wordnet.download(['fra'])
Download and boot the grammar 138MBTraceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/steph/.local/lib/python3.10/site-packages/wordnet/__init__.py", line 54, in download
    pgf.bootNGF(readinto, path+"Parse.ngf")
pgf.PGFError: reached end of file while reading the grammar

I am using Python 3.10.12 (main, Nov 20 2023, 15:14:05) [GCC 11.4.0] on linux, on Ubuntu 22.04.1, with wordnet installed using pip3 as explained in the README.

Using VerbNet for supervision

VerbNet implicitly contains information about the possible arguments of a verb. We can use that to supervise the learning algorithm.

Better visualization for the word clouds.

Currently the word clouds are too small and they have a fixed size regardless of the number of words. This is a limitation of the visualization library. Also almost all words are shown in the same font size, probably because the probabilities are very close. There should be a better way to map probabilities to font sizes. It might also help to add unigram smoothing of the probabilities to get more variation in the sizes.

Alignment in the examples

When the UI shows the examples it would be nice if it was possible to see the word alignment. One way to do this is that when the user clicks a word then the aligned words in the other languages should be highlighted. It would also be nice to show the gloss for the selected word.