Coder Social home page Coder Social logo

proycon / pynlpl Goto Github PK

View Code? Open in Web Editor NEW
474.0 32.0 65.0 13.12 MB

PyNLPl, pronounced as 'pineapple', is a Python library for Natural Language Processing. It contains various modules useful for common, and less common, NLP tasks. PyNLPl can be used for basic tasks such as the extraction of n-grams and frequency lists, and to build simple language model. There are also more complex data types and algorithms. Moreover, there are parsers for file formats common in NLP (e.g. FoLiA/Giza/Moses/ARPA/Timbl/CQL). There are also clients to interface with various NLP specific servers. PyNLPl most notably features a very extensive library for working with FoLiA XML (Format for Linguistic Annotation).

Home Page: https://pypi.python.org/pypi/PyNLPl

License: GNU General Public License v3.0

Python 99.72% Shell 0.17% C++ 0.11%
nlp python computational-linguistics linguistics library folia machine-learning language-modelling search-algorithms evaluation-metrics

pynlpl's Introduction

PyNLPl - Python Natural Language Processing Library

image

Documentation Status

image

image

PyNLPl, pronounced as 'pineapple', is a Python library for Natural Language Processing. It contains various modules useful for common, and less common, NLP tasks. PyNLPl can be used for basic tasks such as the extraction of n-grams and frequency lists, and to build simple language model. There are also more complex data types and algorithms. Moreover, there are parsers for file formats common in NLP (e.g. FoLiA/Giza/Moses/ARPA/Timbl/CQL). There are also clients to interface with various NLP specific servers. PyNLPl most notably features a very extensive library for working with FoLiA XML (Format for Linguistic Annotatation).

The library is a divided into several packages and modules. It works on Python 2.7, as well as Python 3.

The following modules are available:

  • pynlpl.datatypes - Extra datatypes (priority queues, patterns, tries)
  • pynlpl.evaluation - Evaluation & experiment classes (parameter search, wrapped progressive sampling, class evaluation (precision/recall/f-score/auc), sampler, confusion matrix, multithreaded experiment pool)
  • pynlpl.formats.cgn - Module for parsing CGN (Corpus Gesproken Nederlands) part-of-speech tags
  • pynlpl.formats.folia - Extensive library for reading and manipulating the documents in FoLiA format (Format for Linguistic Annotation).
  • pynlpl.formats.fql - Extensive library for the FoLiA Query Language (FQL), built on top of pynlpl.formats.folia. FQL is currently documented here.
  • pynlpl.formats.cql - Parser for the Corpus Query Language (CQL), as also used by Corpus Workbench and Sketch Engine. Contains a convertor to FQL.
  • pynlpl.formats.giza - Module for reading GIZA++ word alignment data
  • pynlpl.formats.moses - Module for reading Moses phrase-translation tables.
  • pynlpl.formats.sonar - Largely obsolete module for pre-releases of the SoNaR corpus, use pynlpl.formats.folia instead.
  • pynlpl.formats.timbl - Module for reading Timbl output (consider using python-timbl instead though)
  • pynlpl.lm.lm - Module for simple language model and reader for ARPA language model data as well (used by SRILM).
  • pynlpl.search - Various search algorithms (Breadth-first, depth-first, beam-search, hill climbing, A star, various variants of each)
  • pynlpl.statistics - Frequency lists, Levenshtein, common statistics and information theory functions
  • pynlpl.textprocessors - Simple tokeniser, n-gram extraction

Installation

Download and install the latest stable version directly from the Python Package Index with pip install pynlpl (or pip3 for Python 3 on most systems). For global installations prepend sudo.

Alternatively, clone this repository and run python setup.py install (or python3 setup.py install for Python 3 on most system. Prepend sudo for global installations.

This software may also be found in the certain Linux distributions, such as the latest versions as Debian/Ubuntu, as python-pynlpl and python3-pynlpl. PyNLPL is also included in our LaMachine distribution.

Documentation

API Documentation can be found here.

pynlpl's People

Contributors

bazsi avatar fkunneman avatar irushchyshyn avatar kosloot avatar mhkuu avatar nsaphra avatar proycon avatar sfischer13 avatar vanatteveldt avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

pynlpl's Issues

[folia] Validator does not complain when multiple t attributes of the same class are defined!

foliavalidator should raise an exception. folialint does show correct behaviour:

$ foliavalidator Knaagtandje_.folia.xml
Validated successfully: Knaagtandje_.folia.xml

$ folialint Knaagtandje_.folia.xml
FAIL: XML error: attempt to add <t> with class=current to element: Knaagtandje_.id61 which already has a <t> with that class

This is an issue for the BASILEX corpus where multiple elements have been incorrectly put under paragraphs; curation needed.

[FQL] APPEND/PREPEND for suggestion for insertion not working

APPEND (AS CORRECTION OF https://raw.githubusercontent.com/proycon/folia/master/setdefinitions/spellingcorrection.foliaset.xml WITH class "missingpunctuation" annotator "puncrecase" annotatortype "auto" datetime now SUGGESTION (ADD w WITH text ",") ) FOR ID "untitled.p.1.s.7.w.6" RETURN nothing

Query appends it as first item in the sentence rather than the correct place.

Needed for Gecco.

[folia] Text validation error (normalisation problem between sentences?)

Continuation of LanguageMachines/ucto#35 by @JessedeDoes:

Expected: (space after "Mo" due to default sentence delimiter)

Versoek van het Zuyd-Hollandse Synode aan Haar Ho. Mo. , dat bij het inwilligen van een nieuw octroy de Compagnie een goede somme gelds soude contribueeren tot onderhoud van een Seminarium. Het getal der predikanten in Indiën a°. 1647 gebragt op ’t getal van 28. Verdeelinge van deselve (blz. 12).

Got: (No space after "Mo", word carries space="no")

Versoek van het Zuyd-Hollandse Synode aan Haar Ho. Mo., dat bij het inwilligen van een nieuw octroy de Compagnie een goede somme gelds soude contribueeren tot onderhoud van een Seminarium. Het getal der predikanten in Indiën a°. 1647 gebragt op ’t getal van 28. Verdeelinge van deselve (blz. 12).

Occurs in multiple places, tests seem not to cover this however (hence only discovering it now). Libfolia/folialint works fine too.

Enable proper confusion matrix in case of a dissimilarity between goals and observations

At the moment, the confusion matrix module in evaluation.py seems to model the matrix based on the categories that are seen in the observations. It would make more sense, however, if the matrix was made based on the category set in goals. For example, in case of two categories in the goals of which only one was predicted, the matrix is currently plotted as one-by-one instead of a 2x2 matrix with one cell valued '0'.

[folia] word.previous() violates constraint when given an explicit constraint like Sentence

I’ve encountered some problem with the usage of previous() in Folia 
As far as I see if the token is the first word of the sentence, and doesn’t return None, instead, it returns the last word of the previous sentence and sometimes even a word before that
and by the way, I explicitly define the sentence as a constraint like this : token.previous(folia.Word, [folia.Sentence])
This works well for the next, it returns None for the next token if the current token is the last word of a sentence. So “token.next(folia.Word, [folia.Sentence])“ works well
here are my observations:
flooding . None
 --------------------
 Prev | Token | Next
 --------------------
 flooding There was
here ‘flooding’ is the last word in the first sentence, and ‘There’ is the first word in the second sentence)
and for your information, previous() sometimes returns even before than the last word of the previous sentence
Here is the exapmle I’ve seen;
two days of
days of intense
of intense rainfall
intense rainfall .
rainfall . None
--------------------
Prev | Token | Next
--------------------
days A very
A very slow-moving
very slow-moving lo

Split FoLiA library to independent repository

The FoLiA library (including FQL library and FoLiA Set library) are by far the most-develop and largest part of PyNLPl, I think they merits being split off from PyNLPl into its own project, to provide more clarity to the user as to what it does and to make versioning easier.

This is planned to coincide with upcoming FoLiA v1.6

Formats.Giza: IntersectionAlignment does not append the 0-0 alignment

The result of the intersection of the WordAlignment and the MultiWordAlignment does not contain the intersection of 0-0 (is always "None"). This happens because of the line 297 in the code (revalignment[0] = 0):
if revalignment[i] and revalignment[i] in x:

Worked well for me with just if revalignment[i] in x:

Possible to request all Frog output from frog server?

Hi,
when using Frog (from the Docker Lamachine image) interactively, I get 9 columns.
image
However, when calling Frog with frogclient.process(....), I only get the first 4 columns:
image
Is this a limitation of the Frog server (that doesn't sent the columns NER,..), or are the columns dropped on the client side? If so, is it possible to edit the script to include all columns?

Thank you Maarten, your software is very helpful for Dutch nlp!

Documentation build

Hey Maarten,

I have noticed a couple of issues when building the documentation for the package. I have prepared Pull Requests to fix some of them, and I would appreciate if you could take a look/comment on/merge them.

  1. ImportErrror traceback:
WARNING: [autosummary] failed to import u'pynlpl.formats.folia.AlignmentReference': no module named pynlpl.formats.folia.AlignmentReference

This happens because the class is actually named AlignReference (https://github.com/proycon/pynlpl/blob/master/formats/folia.py#L4385)

/builddir/build/BUILD/pynlpl-1.0.9/docs/lm.rst:8: WARNING: autodoc: failed to import module u'pynlpl.lm.srilm'; the following exception was raised:
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/sphinx/ext/autodoc.py", line 519, in import_object
    __import__(self.modname)
  File "/builddir/build/BUILD/pynlpl/lm/srilm.py", line 21, in <module>
    import srilmcc
ImportError: No module named srilmcc

[The SRI Language Modeling Toolkit] (http://www.speech.sri.com/projects/srilm/) is needed to use the lm.SRILM model. This is an undocumented dependency which needs to be handled in case it is not configured.
PR: #20

  1. AttributeError traceback
/builddir/build/BUILD/pynlpl-1.0.9/docs/_autosummary/pynlpl.formats.folia.AllowTokenAnnotation.rst:42: WARNING: autodoc: failed to import method u'AllowTokenAnnotation.__iter__' from module u'pynlpl.formats.folia'; the following exception was raised:
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/sphinx/ext/autodoc.py", line 526, in import_object
    obj = self.get_attr(obj, part)
  File "/usr/lib/python2.7/site-packages/sphinx/ext/autodoc.py", line 422, in get_attr
    return safe_getattr(obj, name, *defargs)
  File "/usr/lib/python2.7/site-packages/sphinx/util/inspect.py", line 125, in safe_getattr
    raise AttributeError(name)
AttributeError: __iter__
/builddir/build/BUILD/pynlpl-1.0.9/docs/_autosummary/pynlpl.formats.folia.AllowTokenAnnotation.rst:43: WARNING: autodoc: failed to import method u'AllowTokenAnnotation.__len__' from module u'pynlpl.formats.folia'; the following exception was raised:
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/sphinx/ext/autodoc.py", line 526, in import_object
    obj = self.get_attr(obj, part)
  File "/usr/lib/python2.7/site-packages/sphinx/ext/autodoc.py", line 422, in get_attr
    return safe_getattr(obj, name, *defargs)
  File "/usr/lib/python2.7/site-packages/sphinx/util/inspect.py", line 125, in safe_getattr
    raise AttributeError(name)
AttributeError: __len__

Happens because AllowTokenAnnotation does not have __len__ and __iter__.
PR: #21
3) Warnings

docstring of pynlpl.formats.folia.AbstractAnnotationLayer.parsexml:3: WARNING: Inline literal start-string without end-string.
docstring of pynlpl.formats.folia.AbstractElement:13: ERROR: Unexpected indentation.
docstring of pynlpl.formats.folia.AbstractElement:15: WARNING: Block quote ends without a blank line; unexpected unindent.
docstring of pynlpl.formats.folia.AbstractElement.parsexml:3: WARNING: Inline literal start-string without end-string.
docstring of pynlpl.formats.folia.AbstractSpanAnnotation.parsexml:3: WARNING: Inline literal start-string without end-string.
docstring of pynlpl.formats.folia.AbstractStructureElement.parsexml:3: WARNING: Inline literal start-string without end-string.
docstring of pynlpl.formats.folia.AbstractTokenAnnotation.parsexml:3: WARNING: Inline literal start-string without end-string.
docstring of pynlpl.formats.folia.ActorFeature.parsexml:3: WARNING: Inline literal start-string without end-string.
docstring of pynlpl.formats.folia.Alternative.parsexml:3: WARNING: Inline literal start-string without end-string.
docstring of pynlpl.formats.folia.AlternativeLayers.parsexml:3: WARNING: Inline literal start-string without end-string.
docstring of pynlpl.formats.folia.BegindatetimeFeature.parsexml:3: WARNING: Inline literal start-string without end-string.
docstring of pynlpl.formats.folia.Cell.parsexml:3: WARNING: Inline literal start-string without end-string.
docstring of pynlpl.formats.folia.Chunk.parsexml:3: WARNING: Inline literal start-string without end-string.
docstring of pynlpl.formats.folia.ChunkingLayer.parsexml:3: WARNING: Inline literal start-string without end-string.
docstring of pynlpl.formats.folia.CoreferenceChain.parsexml:3: WARNING: Inline literal start-string without end-string.
docstring of pynlpl.formats.folia.CoreferenceLayer.parsexml:3: WARNING: Inline literal start-string without end-string.
docstring of pynlpl.formats.folia.CoreferenceLink.parsexml:3: WARNING: Inline literal start-string without end-string.
....

I have fixed some of them in the Pull Request #22

Thank you,
Iryna

port 8020 still valid?

"No connection could be made because the target machine actively refused it."

using port 8020 according to manual (import FrogClient from python 3.6)

[NameError] name 'loadsetdefinition' is not defined

Declaring a new annotation type got broken somehow:

  File "/scratch2/www/flat/env/lib/python3.4/site-packages/foliadocserve/foliadocserve.py", line 578, in query
    result =  query(doc,False,self.debug >= 2)
  File "/scratch2/www/flat/env/lib/python3.4/site-packages/pynlpl/formats/fql.py", line 1892, in __call__
    doc.declare(Class,decset,**defaults)
  File "/scratch2/www/flat/env/lib/python3.4/site-packages/pynlpl/formats/folia.py", line 6656, in declare
    self.setdefinitions[set] = loadsetdefinition(set) #will raise exception on error
[QUERY FAILED] FoLiA Error in henniebrugman/wolf016hist01_01: [NameError] name 'loadsetdefinition' is not defined

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.