circse / lemlat3 Goto Github PK

View Code? Open in Web Editor NEW

24.0 9.0 2.0 797.42 MB

Morphological analyzer and lemmatizer for Latin.

Home Page: http://www.lemlat3.eu/

C 88.70% Shell 8.58% M4 2.72%

lemmatization morphological-analysis latin

lemlat3's Introduction

LEMLAT 3.0

About

Contribution of the CIRCSE Research Centre to the Latin morphological analyzer and lemmatizer LEMLAT 3.0.

NB: LEMLAT analyzes types, not tokens; it doesn't therefore disambiguate words in context.

LEMLAT 3.0 website: http://www.lemlat3.eu/
LEMLAT 3.0 credits: http://www.lemlat3.eu/about/credits/

To cite LEMLAT 3.0 (first release), you can adapt the following:

Marco Passarotti, Paolo Ruffolo, Flavio M. Cecchini, Eleonora Litta, Marco Budassi (2018) LEMLAT 3.0.

To cite all versions/releases of LEMLAT 3.0 use DOI: https://doi.org/10.5281/zenodo.1492133

Documentation and use

See the Wiki of this repository for the full documentation and instructions on how to run LEMLAT 3.0.

Copyright

LEMLAT 3.0 is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.

lemlat3's People

Contributors

Stargazers

Watchers

Forkers

jared-desjardins andbue

lemlat3's Issues

Explain the file system

Add a GitHub Wiki page to explain how this repository's file system works. It's not immediately obvious which folder contains what and what all the files are.

Abort trap 6

Rieccomi. Altri due errori.
Ho fatto partire Lemlat su un file di circa 75.000 parole e dopo qualche minuto di attività mi ha dato "Abort trap 6".

Ho guardato l'output e mi ha fatto la lemmatizzazione ma non del file completo. Ho quindi eliminato i file output, ma non appena premo enter per far ripartire Lemlat mi dice questo:

Provo a farlo andare sul server per verificare che non sia un problema di time-out del mio computer, boh. Volevo condividere nel caso vogliate aggiungere gli errori ad una sezione di troubleshooting.

Common words and forms missing from LEMLAT

We have tested LEMLAT on a corpus of classical Latin texts from a university reading list. The corpus contains some 23,700 words and 8,538 different word forms: Terence's Adelphoe, Horace's Odes Bk. 1, Tibullus Bk. 1, Seneca's Letters Bk. 1 (all editions from the PerseusDL collection). Beside various forms of personal names (and some typos in our sources), there were 40 word forms not recognized by LEMLAT; a tiny percent of all forms -- but the list is below. Some reasons for not recognizing the forms seem to be orthographical (ë, omitted -p- in emta, demsi, oe in foeneraret; words joined instead of separated -- illiusmodi). Some have to do with meter in comedy - the elided -n', from -ne, is regularly not recognized by LEMLAT. Some missing forms are fairly common: norimus, nosse.

I propose that the forms from the list below be added to the LEMLAT database.


adteruisse
audistin
coëmisse
demseris
demsi
egon
emta
emtae
emtam
foeneraret
haecine
hancine
hocine
hoscine
illan
illiusmodi
ipsus
lucu
men
norimus
nosse
nossem
nostin
numquidnam
poëta
poëtae
posthaec
propediem
quamobrem
quamprimum
quandoquidem
quorundam
quotannis
refrixerit
sumtuosa
tamdiu
tantummodo
tercentenas
tetigin
tun

Vincolo fe-gen (esempio: lemma "virus")

virus (u0803) -> non produce il genere in output.
Il problema è il seguente. Il programma ha un vincolo (sul "lessario") per cui un les con codles='fe' che è usato per produrre il lemma non deve avere alcun valore in campo 'gen' (così, infatti, è la situaizone attuale sul "lessario"). Questo comporta che il genere non venga scritto nell'output del lemma (invece, è scritto correttamente per le forme). Se aggiungo il genere in campo 'gen' su queste righe, la forma riportata su quella riga del lessario non è più analizzata.
Questa casistica si realizza in due possibili situazioni:
(a) in una clem formata da più di un les: il les patologico ha clem='v' e codles='fe';
(b) in una clem formata da un solo les: il les è patologico se ha codles='fe'

Difference in output file names and paths

I noticed that, launching LEMLAT on a file like this:

./lemlat_client -i <path>/input.txt -c output.csv

the list of unknown forms is saved under the name input.txt.unk, and not as output.csv.unk, as it might be expected;
input.txt.unk is created in the location given by <path>, and not where LEMLAT is launched, unlike output.csv, again contrary to expectations.

Maybe it would help consistency if the unknown list got the name output.csv.unk and if both files were saved in the same location, either <path> or the local folder (option?).

Thank you!

Database tables in English

Ciao, una "feature" per il futuro. Tradurre i nomi italiani delle tabelle nel mySQL DB (e.g. lessario e lemmario) in inglese per coloro che non conoscono l'italiano?

Grouping identical analyses into a single entry

Create a filter/function to group identical analyses into a single entry. For example, analyses 18 and 19 of forma (Du Cange) are identical:

============================ANALYSIS 18==================================

SEGMENTATION:	form -a

---------------------morphological feats 1 ----------------------------
--bfs--

Case:	Ablative
Gender:	Feminine
Number:	Singular
---------------------morphological feats 2 ----------------------------
--nfs--

Case:	Nominative
Gender:	Feminine
Number:	Singular
---------------------morphological feats 3 ----------------------------
--vfs--

Case:	Vocative
Gender:	Feminine
Number:	Singular
	============================LEMMA =================================
	forma                         N1   D68HA f
	-----------------------morphological feats-------------------------
	NcA

	PoS:	Noun
	Type:	Common
	Inflexional Category:	I decl
	-----------------------derivational info---------------------------
	IS DERIVED: NO

============================ANALYSIS 19==================================

SEGMENTATION:	form -a

---------------------morphological feats 1 ----------------------------
--bfs--

Case:	Ablative
Gender:	Feminine
Number:	Singular
---------------------morphological feats 2 ----------------------------
--nfs--

Case:	Nominative
Gender:	Feminine
Number:	Singular
---------------------morphological feats 3 ----------------------------
--vfs--

Case:	Vocative
Gender:	Feminine
Number:	Singular
	============================LEMMA =================================
	forma                         N1   D68HB f
	-----------------------morphological feats-------------------------
	NcA

	PoS:	Noun
	Type:	Common
	Inflexional Category:	I decl
	-----------------------derivational info---------------------------
	IS DERIVED: NO

Running on Windows 10 64-bi

Hello,
I have tried running LemLat3 on Windows 10 64-bit and cannot get it to work. I have even tried running it in compatibility mode with Windows 8 but it simply opens and about a second later closes. Is there a way to run it on the Windows 10 64-bit platform?

path length

path length is fixed.
must be 'freed'

Error while processing some input strings (batch processing)

Three types of input cause error ('segmentation fault') in batch mode:

strings containing backslash character '\'
strings containing some non ascii (further investigation needed to know exactly)
strings longer than 30 characters

A quick workaround is to filter input file in advance (note that the filtered out words would not be analyzed anyway) for example with this simple cascade of sed command in bash

LANG=C sed  's/\\/ /g' input_file  | sed -E  "s/[^\x00-\x7F]+/ /g" |\
 sed 's/[-_\/\$[:alnum:]]\{30,\}/ /g' > alt_input_file

THANX TO Enrique

Problem with XML output

When running LEMLAT3 locally, from the command line (./lemlat -i q10.txt -x q10lemlat.xml - was testing it on Quintilian's Inst. 10), the XML that gets produced is invalid, because some lemma attribute values contain either the single-quote (apostrophe) or the double-quote (") character, for example acci"do in the following fragment:

<Lemmas>
	<Lemma is_derived="false" lemma="acci"do" codmorf="VmH" codlem="V3" gen="" n_id="a0269"/>
</Lemmas>

Proposed solution: follow the W3C Extensible Markup Language (XML) 1.1 (Second Edition) recommendation:


To allow attribute values to contain both single and double quotes, 
the apostrophe or single-quote character (') may be represented 
as "&apos;", and the double-quote character (") as "&quot;".

Treatment of punctuation

I have noticed that punctuation marks apart from the hyphen - are not analyzed by LEMLAT, not even as unknown wordforms in the unk file (where "-" lands). However, when e.g. feeding a list of wordforms, one per line, to LEMLAT, it would be better to have the possibility to retrieve all of them in either of the two output files.

Also, LEMLAT automatically splits a string where a ' appears, creating two wordforms that are subsequently analyzed, without this being mentioned in the inline output message. there should be some option to change this behaviour and to make LEMLAT analyze each token as it is. Since this also happens with "." , it is very relevant for the treatment of abbreviations, which are very often tokenized as "T." or "F.", to distinguish them from the occurrences of the isolated letters "T" or "F".

Comando per selezione di più basi lessicali

Ciao Paolo,
In questa sezione:

Enter a wordform
OR one of the following commands:
	\h to show this HELP
	\q to QUIT
	\B select BASE LES source
	\O select ONOMASTICON LES source
	\D select DU CANGE LES source
	\A select ALL LES source
	\a to output in 'lemresult' FILE
	\d to output on SCREEN

E' possibile aggiungere un comando che permetta di selezionare combinazioni di basi lessicali?

Provide the lemma DB raw data ?

Hi there :)
I have been seeing the project multiple time but one of the things that troubles me is the unability to find the list of lemma, say in a raw format like CSV/TSV. I think it would be pretty helpful to have access to this kind of list, as user of the application.

Cheers !

Incompatibile con Linux 64-bit

Il programma lemlat, almeno nella versione contenuta nell' archivio linux_embedded.tgz, non e' eseguibile su sistemi Linux 64-bit. Risulta necessaria la installazione del 32-bit subsystem.

Per esempio, in ambito Ubuntu, bisogna prima eseguire:

sudo apt-get install libc6-i386 
sudo apt-get install ia32-libs

Typo in output

Missing 'd' in No Worforms

inpraesentiarum/impraesentiarum

i9917
La forma "inpraesentiarum" non è analizzata.
Il les è ""impraesentiarum".
C'è correttamente il codice i04 in a_gra, che gestisce l'alternanza grafica inp/imp.

Codice

È possible vedere il source code di LemLat? Stiamo studiando il MySQL DB e vediamo tutte le tabelle ma sarebbe anche molto utile vedere come queste interagiscono tramite il codice.

Inglesizzando, parte prima

Ciao Paolo,
Qui qualche correzione da apportare all'inglese in output di LEMLAT.

Fix 1

La parola LEMMI andrebbe tradotta in LEMMAS. Vedi esempio sotto:

---------------------morphological feats   -----------------------------

LEMMI:
	============================LEMMA 1: IPO ==========================
	nae                           I    n0131
	============================LEMMA 2: IPER==========================
	ne                            I    n0131

Fix 2

Se vogliamo essere proprio pignoli, attualmente stiamo usando un misto di American e British English. Per esempio, nel preambolo di LEMLAT c'è scritto: LEMLAT: latin morphological lemmatizer con la -z in lemmatizer ma poi c'è -s in analysed (vedi sotto). Meglio mettere tutto all'americana e quindi scrivere analyzed.

Input    wordform : puellarum
Analysed wordform : puellarum

Fix 3

In questa sezione:

Enter a wordform
OR one of the following commands:
	\h to show this HELP
	\q to QUIT
	\B select BASE LES source
	\O select ONOMASTICON LES source
	\D select DU CANGE LES source
	\A select ALL LES source
	\a to output in 'lemresult' FILE
	\d to output on SCREEN

mancano preposizioni (to) e articoli (the);
il LES secondo me non è necessario.

Io riscriverei così:

Enter a wordform
OR one of the following commands:
	\h to show this HELP guide
	\q to QUIT LEMLAT
	\B to search in the BASE source only
	\O to search in the ONOMASTICON source only
	\D to search in the DU CANGE source only
	\A to search in ALL sources
	\a to output results in a 'lemresult' FILE
	\d to output results on the SCREEN

@passarom: sei d'accordo?
@gersh0m: Fai prima a fare le modifiche tu o ci metto mano io? In quel caso dimmi dove intervenire. Grazie! :-)

Web version of LEMLAT

Create a web version of LEMLAT for users who would rather not install+run it locally.

Frequency filter/function

For homographs, it would be helpful to have a "frequency function" in LEMLAT to point users to the most probable/frequent lemma of a given form. For example, LEMLAT lemmatises the form id to is and et, the latter being much less frequent than the former.

Fatal error in defaults handling

Due giorni fa, dopo 4-5 mesi di inattività, ho clonato l'ultima versione di LemLat da questo repository ma questo errore non mi permette di continuare. Avete qualche suggerimento?
Grazie!