proycon / pynlpl Goto Github PK

PyNLPl, pronounced as 'pineapple', is a Python library for Natural Language Processing. It contains various modules useful for common, and less common, NLP tasks. PyNLPl can be used for basic tasks such as the extraction of n-grams and frequency lists, and to build simple language model. There are also more complex data types and algorithms. Moreover, there are parsers for file formats common in NLP (e.g. FoLiA/Giza/Moses/ARPA/Timbl/CQL). There are also clients to interface with various NLP specific servers. PyNLPl most notably features a very extensive library for working with FoLiA XML (Format for Linguistic Annotation).

Home Page: https://pypi.python.org/pypi/PyNLPl

License: GNU General Public License v3.0

Python 99.72% Shell 0.17% C++ 0.11%

nlp python computational-linguistics linguistics library folia machine-learning language-modelling search-algorithms evaluation-metrics text-processing nlp-library natural-language-processing

pynlpl's Introduction

PyNLPl - Python Natural Language Processing Library

https://travis-ci.org/proycon/pynlpl.svg?branch=master

http://applejack.science.ru.nl/lamabadge.php/pynlpl

The library is a divided into several packages and modules. It works on Python 2.7, as well as Python 3.

The following modules are available:

pynlpl.datatypes - Extra datatypes (priority queues, patterns, tries)
pynlpl.evaluation - Evaluation & experiment classes (parameter search, wrapped progressive sampling, class evaluation (precision/recall/f-score/auc), sampler, confusion matrix, multithreaded experiment pool)
pynlpl.formats.cgn - Module for parsing CGN (Corpus Gesproken Nederlands) part-of-speech tags
pynlpl.formats.folia - Extensive library for reading and manipulating the documents in FoLiA format (Format for Linguistic Annotation).
pynlpl.formats.fql - Extensive library for the FoLiA Query Language (FQL), built on top of pynlpl.formats.folia. FQL is currently documented here.
pynlpl.formats.cql - Parser for the Corpus Query Language (CQL), as also used by Corpus Workbench and Sketch Engine. Contains a convertor to FQL.
pynlpl.formats.giza - Module for reading GIZA++ word alignment data
pynlpl.formats.moses - Module for reading Moses phrase-translation tables.
pynlpl.formats.sonar - Largely obsolete module for pre-releases of the SoNaR corpus, use pynlpl.formats.folia instead.
pynlpl.formats.timbl - Module for reading Timbl output (consider using python-timbl instead though)
pynlpl.lm.lm - Module for simple language model and reader for ARPA language model data as well (used by SRILM).
pynlpl.search - Various search algorithms (Breadth-first, depth-first, beam-search, hill climbing, A star, various variants of each)
pynlpl.statistics - Frequency lists, Levenshtein, common statistics and information theory functions
pynlpl.textprocessors - Simple tokeniser, n-gram extraction

Installation

Download and install the latest stable version directly from the Python Package Index with pip install pynlpl (or pip3 for Python 3 on most systems). For global installations prepend sudo.

Alternatively, clone this repository and run python setup.py install (or python3 setup.py install for Python 3 on most system. Prepend sudo for global installations.

This software may also be found in the certain Linux distributions, such as the latest versions as Debian/Ubuntu, as python-pynlpl and python3-pynlpl. PyNLPL is also included in our LaMachine distribution.

Documentation

API Documentation can be found here.

pynlpl's People

Contributors

Stargazers

Watchers

Forkers

w495 antiface jokame sdhu jvanasco acidburn0zzz sfischer13 nsaphra pombredanne amitbeka karinabunyik viveksck josi-asae zzmjohn clariah xuanhan863 almath123 vanatteveldt hbcbh1999 adofsauron anukat2015 dongqing7 hfoffani chagge neumath huamichaelchen lsq357 coventryresearch paroe12 tien-le-grenoble vyraun shashankg7 colinsongf textsummarizationapp solertis fkunneman 460130107 lijielife mwjo1988 stevenlol iamjoshbinder nlpchina sabirdvd oktaal ramananm haipengliu mohammedterry yjfiejd lazuraslong jnwlkr ethicalrbg afcarl phymucs zafarullahdialpad ondrocks awoziji saepark pcyin computational-linguistics-research cheny-00 mrsumitbd theanonymoos dearborn-open-ai kshula fcas

pynlpl's Issues

No function in folia.py to retrieve all chunks from folia document (or paragraph or sentence)

Fix FQL test failure

Update package for debian

Split FoLiA library to independent repository

The FoLiA library (including FQL library and FoLiA Set library) are by far the most-develop and largest part of PyNLPl, I think they merits being split off from PyNLPl into its own project, to provide more clarity to the user as to what it does and to make versioning easier.

This is planned to coincide with upcoming FoLiA v1.6

[NameError] name 'loadsetdefinition' is not defined

Declaring a new annotation type got broken somehow:

  File "/scratch2/www/flat/env/lib/python3.4/site-packages/foliadocserve/foliadocserve.py", line 578, in query
    result =  query(doc,False,self.debug >= 2)
  File "/scratch2/www/flat/env/lib/python3.4/site-packages/pynlpl/formats/fql.py", line 1892, in __call__
    doc.declare(Class,decset,**defaults)
  File "/scratch2/www/flat/env/lib/python3.4/site-packages/pynlpl/formats/folia.py", line 6656, in declare
    self.setdefinitions[set] = loadsetdefinition(set) #will raise exception on error
[QUERY FAILED] FoLiA Error in henniebrugman/wolf016hist01_01: [NameError] name 'loadsetdefinition' is not defined

[folia] Allow underscore as first char in an NCName

Was not properly implemented or got broken again somehow.

[folia] word.previous() violates constraint when given an explicit constraint like Sentence

I’ve encountered some problem with the usage of previous() in Folia 
As far as I see if the token is the first word of the sentence, and doesn’t return None, instead, it returns the last word of the previous sentence and sometimes even a word before that
and by the way, I explicitly define the sentence as a constraint like this : token.previous(folia.Word, [folia.Sentence])
This works well for the next, it returns None for the next token if the current token is the last word of a sentence. So “token.next(folia.Word, [folia.Sentence])“ works well
here are my observations:
flooding . None
 --------------------
 Prev | Token | Next
 --------------------
 flooding There was
here ‘flooding’ is the last word in the first sentence, and ‘There’ is the first word in the second sentence)

and for your information, previous() sometimes returns even before than the last word of the previous sentence
Here is the exapmle I’ve seen;
two days of
days of intense
of intense rainfall
intense rainfall .
rainfall . None
--------------------
Prev | Token | Next
--------------------
days A very
A very slow-moving
very slow-moving lo

Documentation build

Hey Maarten,

I have noticed a couple of issues when building the documentation for the package. I have prepared Pull Requests to fix some of them, and I would appreciate if you could take a look/comment on/merge them.

ImportErrror traceback:

WARNING: [autosummary] failed to import u'pynlpl.formats.folia.AlignmentReference': no module named pynlpl.formats.folia.AlignmentReference

This happens because the class is actually named AlignReference (https://github.com/proycon/pynlpl/blob/master/formats/folia.py#L4385)

/builddir/build/BUILD/pynlpl-1.0.9/docs/lm.rst:8: WARNING: autodoc: failed to import module u'pynlpl.lm.srilm'; the following exception was raised:
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/sphinx/ext/autodoc.py", line 519, in import_object
    __import__(self.modname)
  File "/builddir/build/BUILD/pynlpl/lm/srilm.py", line 21, in <module>
    import srilmcc
ImportError: No module named srilmcc

[The SRI Language Modeling Toolkit] (http://www.speech.sri.com/projects/srilm/) is needed to use the lm.SRILM model. This is an undocumented dependency which needs to be handled in case it is not configured.
PR: #20

AttributeError traceback

/builddir/build/BUILD/pynlpl-1.0.9/docs/_autosummary/pynlpl.formats.folia.AllowTokenAnnotation.rst:42: WARNING: autodoc: failed to import method u'AllowTokenAnnotation.__iter__' from module u'pynlpl.formats.folia'; the following exception was raised:
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/sphinx/ext/autodoc.py", line 526, in import_object
    obj = self.get_attr(obj, part)
  File "/usr/lib/python2.7/site-packages/sphinx/ext/autodoc.py", line 422, in get_attr
    return safe_getattr(obj, name, *defargs)
  File "/usr/lib/python2.7/site-packages/sphinx/util/inspect.py", line 125, in safe_getattr
    raise AttributeError(name)
AttributeError: __iter__
/builddir/build/BUILD/pynlpl-1.0.9/docs/_autosummary/pynlpl.formats.folia.AllowTokenAnnotation.rst:43: WARNING: autodoc: failed to import method u'AllowTokenAnnotation.__len__' from module u'pynlpl.formats.folia'; the following exception was raised:
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/sphinx/ext/autodoc.py", line 526, in import_object
    obj = self.get_attr(obj, part)
  File "/usr/lib/python2.7/site-packages/sphinx/ext/autodoc.py", line 422, in get_attr
    return safe_getattr(obj, name, *defargs)
  File "/usr/lib/python2.7/site-packages/sphinx/util/inspect.py", line 125, in safe_getattr
    raise AttributeError(name)
AttributeError: __len__

Happens because AllowTokenAnnotation does not have __len__ and __iter__.
PR: #21
3) Warnings

docstring of pynlpl.formats.folia.AbstractAnnotationLayer.parsexml:3: WARNING: Inline literal start-string without end-string.
docstring of pynlpl.formats.folia.AbstractElement:13: ERROR: Unexpected indentation.
docstring of pynlpl.formats.folia.AbstractElement:15: WARNING: Block quote ends without a blank line; unexpected unindent.
docstring of pynlpl.formats.folia.AbstractElement.parsexml:3: WARNING: Inline literal start-string without end-string.
docstring of pynlpl.formats.folia.AbstractSpanAnnotation.parsexml:3: WARNING: Inline literal start-string without end-string.
docstring of pynlpl.formats.folia.AbstractStructureElement.parsexml:3: WARNING: Inline literal start-string without end-string.
docstring of pynlpl.formats.folia.AbstractTokenAnnotation.parsexml:3: WARNING: Inline literal start-string without end-string.
docstring of pynlpl.formats.folia.ActorFeature.parsexml:3: WARNING: Inline literal start-string without end-string.
docstring of pynlpl.formats.folia.Alternative.parsexml:3: WARNING: Inline literal start-string without end-string.
docstring of pynlpl.formats.folia.AlternativeLayers.parsexml:3: WARNING: Inline literal start-string without end-string.
docstring of pynlpl.formats.folia.BegindatetimeFeature.parsexml:3: WARNING: Inline literal start-string without end-string.
docstring of pynlpl.formats.folia.Cell.parsexml:3: WARNING: Inline literal start-string without end-string.
docstring of pynlpl.formats.folia.Chunk.parsexml:3: WARNING: Inline literal start-string without end-string.
docstring of pynlpl.formats.folia.ChunkingLayer.parsexml:3: WARNING: Inline literal start-string without end-string.
docstring of pynlpl.formats.folia.CoreferenceChain.parsexml:3: WARNING: Inline literal start-string without end-string.
docstring of pynlpl.formats.folia.CoreferenceLayer.parsexml:3: WARNING: Inline literal start-string without end-string.
docstring of pynlpl.formats.folia.CoreferenceLink.parsexml:3: WARNING: Inline literal start-string without end-string.
....

I have fixed some of them in the Pull Request #22

Thank you,
Iryna

Enable proper confusion matrix in case of a dissimilarity between goals and observations

At the moment, the confusion matrix module in evaluation.py seems to model the matrix based on the categories that are seen in the observations. It would make more sense, however, if the matrix was made based on the category set in goals. For example, in case of two categories in the goals of which only one was predicted, the matrix is currently plotted as one-by-one instead of a 2x2 matrix with one cell valued '0'.

port 8020 still valid?

"No connection could be made because the target machine actively refused it."

using port 8020 according to manual (import FrogClient from python 3.6)

clients.frogclient.FrogClient assumes no FoLiA output

Title says it all, the FrogClient assumes the Frog server responds with a column format. When the server is running with -X, the output from FrogClient.process is [].

[folia] Layer has an effect on text serialisation when last word has space="no"

Continuation of LanguageMachines/ucto#50: If the last word has space=no but a layer follows, then the serialisation or text validation checking ignores this space=no!

Regenerate and reupload API documentation

[FQL] APPEND/PREPEND for suggestion for insertion not working

APPEND (AS CORRECTION OF https://raw.githubusercontent.com/proycon/folia/master/setdefinitions/spellingcorrection.foliaset.xml WITH class "missingpunctuation" annotator "puncrecase" annotatortype "auto" datetime now SUGGESTION (ADD w WITH text ",") ) FOR ID "untitled.p.1.s.7.w.6" RETURN nothing

Query appends it as first item in the sentence rather than the correct place.

Needed for Gecco.

Formats.Giza: IntersectionAlignment does not append the 0-0 alignment

The result of the intersection of the WordAlignment and the MultiWordAlignment does not contain the intersection of 0-0 (is always "None"). This happens because of the line 297 in the code (revalignment[0] = 0):
if revalignment[i] and revalignment[i] in x:

Worked well for me with just if revalignment[i] in x:

update debian package for v1.2.5

[folia] Text validation error (normalisation problem between sentences?)

Continuation of LanguageMachines/ucto#35 by @JessedeDoes:

Expected: (space after "Mo" due to default sentence delimiter)

Versoek van het Zuyd-Hollandse Synode aan Haar Ho. Mo. , dat bij het inwilligen van een nieuw octroy de Compagnie een goede somme gelds soude contribueeren tot onderhoud van een Seminarium. Het getal der predikanten in Indiën a°. 1647 gebragt op ’t getal van 28. Verdeelinge van deselve (blz. 12).

Got: (No space after "Mo", word carries space="no")

Versoek van het Zuyd-Hollandse Synode aan Haar Ho. Mo., dat bij het inwilligen van een nieuw octroy de Compagnie een goede somme gelds soude contribueeren tot onderhoud van een Seminarium. Het getal der predikanten in Indiën a°. 1647 gebragt op ’t getal van 28. Verdeelinge van deselve (blz. 12).

Occurs in multiple places, tests seem not to cover this however (hence only discovering it now). Libfolia/folialint works fine too.

[folia] Finish deep validation

planned for FoLiA v1.4, related to issue proycon/folia#14

[folia] implement alias mechanism

See proycon/folia#31

Possible to request all Frog output from frog server?

Hi,
when using Frog (from the Docker Lamachine image) interactively, I get 9 columns.

However, when calling Frog with frogclient.process(....), I only get the first 4 columns:

Is this a limitation of the Frog server (that doesn't sent the columns NER,..), or are the columns dropped on the client side? If so, is it possible to edit the script to include all columns?

Thank you Maarten, your software is very helpful for Dutch nlp!

$ foliavalidator Knaagtandje_.folia.xml
Validated successfully: Knaagtandje_.folia.xml

$ folialint Knaagtandje_.folia.xml
FAIL: XML error: attempt to add <t> with class=current to element: Knaagtandje_.id61 which already has a <t> with that class

This is an issue for the BASILEX corpus where multiple elements have been incorrectly put under paragraphs; curation needed.