cltl / morphosyntactic_parser_nl Goto Github PK

5.0 10.0 4.0 67 KB

Morphosyntactic parser for Dutch based on the Alpino parser

License: Apache License 2.0

Python 99.09% Shell 0.91%

morphosyntactic_parser_nl's Introduction

morphosyntactic_parser_nl

Wrapper around the Dutch Alpino parser. It takes as input a text/NAF/KAF file with either raw text or tokens (processed by a tokeniser and sentence splitter) and generates the term layer (lemmas and rich morphological information), the constituency layer and the dependency layer.

Requirements and installation

There are two dependencies, the Alpino parser, and the KAfNafParserPy library for parsing NAF/KAF objects.

Step 1. For the Alpino parser you have two choices.

For a local install, visit the Alpino homepage and follow the instructions to get Alpino installed, or run install_alpino.sh. Make sure to set ALPINO_HOME to point to the installation.
For using an alpino server instance (e.g. through alpino-docker), point ALPINO_SERVER to the HTTP address of the server (e.g. ALPINO_SERVER=http://localhost:5002)

Step 2. The KafNafParserPy library can be install through pip or from GitHub.

Once you have the previous 2 steps completed, the last step is to clone this repository to your machine. You will need to tell the library where Alpino has been installed in your machine by setting the environment variable ALPINO_HOME, and point it to the correct path on your local machine.

export ALPINO_HOME=/home/a/b/c/Alpino

Usage

The simplest way to call the parser is to call to the script run_parser.sh, which can be found in the root folder of the repository. It will read a NAF/KAF file from the input stream and will write the NAF/KAF resulting file in the output stream. In the subfolder examples you can find 2 example input files with the corresponding and expected output files. From the command line and being on the root folder you can run:

cat examples/file1.in.kaf | run_parser.sh > my_output.kaf

The result in my_output.kaf should be the same as the file examples/file1.out.kaf (with exception of the time stamps).

You can specify the maximum number of seconds that Alpino will take to parse every sentence. Sentences taking longer that this value will be skipped from the parsing, and there will not be term, constituency nor dependency information for all the tokens of those sentences. The parameter to be used is -t or --time.

You can get the whole description of the parameters by calling python core/morph_syn_parser.py -h. You will see this information:

usage: morph_syn_parser.py [-h] [-v] [-t MAX_MINUTES]

Morphosyntactic parser based on Alpino

optional arguments:
  -h, --help            show this help message and exit
  -v, --version         show program's version number and exit
  -t MAX_MINUTES, --time MAX_MINUTES
                        Maximum number of minutes per sentence. Sentences that
                        take longer will be skipped and not parsed (value must
                        be a float)

If you want to use this library from a Python module, it is possible to import the main function and reuse it in other python scripts. The main module is located in the script core/morph_syn_parser.py, and it is called run_morph_syn_parser. This function takes two parameters, an input and an output file, which can be file names (strings), open file descriptors or streams.

Contact

Ruben Izquierdo
Vrije University of Amsterdam
[email protected] - [email protected]
http://rubenizquierdobevia.com/

morphosyntactic_parser_nl's People

Contributors

Stargazers

Watchers

Forkers

vanatteveldt paulhuygen filter-bubble brdiep113

morphosyntactic_parser_nl's Issues

Incorrect offsets with tokenization

The offset calculation (https://github.com/cltl/morphosyntactic_parser_nl/blob/master/alpinonaf/morph_syn_parser.py#L63) is incorrect, because it always counts spaces, which goes wrong for interpunction.

pipe symbols (|) not escaped in input

Sentence containing a pipe symbol / vertical bar (|) are not processed correctly. Alpino uses this character to indicate line id's, so if a sentence contains a pipe the left hand side is treated as an id, containing the 1.xml to not be found, and the parse is not included in the output:

(newsreader-env)wva@study-linux: {master} ~/newsreader_pipe_nl$ echo "Hallo daar| doeg" |  java -jar $MDIR/ixa-pipe-tok/target/ixa-pipe-tok-1.8.4.jar tok -l nl | $MDIR/morphosyntactic_parser_nl/run_parser.sh
CLI options: Namespace(normalize=default, notok=false, inputkaf=false, offsets=true, outputFormat=naf, hardParagraph=no, untokenizable=no, lang=nl, kafversion=v1.naf)
ixa-pipe-tok tokenized 4 tokens at 1908.31 tokens per second.
Calling to Alpino at /data/wva/newsreader_pipe_nl/tools/Alpino with 1 sentences...
hdrug: process 14878 on host study-linux (datime(2016,7,17,13,34,17))
[doeg]
Q#Hallo daar \|doeg|1|1|0.749662063
Not found the file /tmp/tmpOfK5Ti/1.xml

This results in the following output (without terms):

<?xml version='1.0' encoding='UTF-8'?>
<NAF xml:lang="nl" version="v1.naf">
  <nafHeader>
    <linguisticProcessors layer="text">
      <lp name="ixa-pipe-tok-nl" beginTimestamp="2016-07-17T13:34:17+0200" endTimestamp="2016-07-17T13:34:17+0200" version="1.8.4-9bb9cddd179cbd489b085776417cd8f1b8a4b10a" hostname="study-linux"/>
    </linguisticProcessors>
    <linguisticProcessors layer="terms">
      <lp name="Morphosyntactic parser based on Alpino" version="0.2_22sept2015" timestamp="2016-07-17T13:34:18CEST" beginTimestamp="2016-07-17T13:34:18CEST" endTimestamp="2016-07-17T13:34:18CEST" hostname="study-linux"/>
    </linguisticProcessors>
    <linguisticProcessors layer="constituents">
      <lp name="Morphosyntactic parser based on Alpino" version="0.2_22sept2015" timestamp="2016-07-17T13:34:18CEST" beginTimestamp="2016-07-17T13:34:18CEST" endTimestamp="2016-07-17T13:34:18CEST" hostname="study-linux"/>
    </linguisticProcessors>
    <linguisticProcessors layer="deps">
      <lp name="Morphosyntactic parser based on Alpino" version="0.2_22sept2015" timestamp="2016-07-17T13:34:18CEST" beginTimestamp="2016-07-17T13:34:18CEST" endTimestamp="2016-07-17T13:34:18CEST" hostname="study-linux"/>
    </linguisticProcessors>
  </nafHeader>
  <text>
    <wf id="w1" offset="0" length="5" sent="1" para="1">Hallo</wf>
    <wf id="w2" offset="6" length="4" sent="1" para="1">daar</wf>
    <wf id="w3" offset="10" length="1" sent="1" para="1">|</wf>
    <wf id="w4" offset="12" length="4" sent="1" para="1">doeg</wf>
  </text>
</NAF>

Note that this is not an error condition, so the "not found the file" does not raise an exception (which it probably should?)

$ echo $?
0

(this seems to be the root cause of ixa-ehu/ixa-pipe-nerc#11)

escaping long '--' sequences in comments

Character escaping for comments (in alpino_dependency.py and convert_penn_to_kaf.py) currently replaces '--' by '-'. This leads to a ValueError with lxml.etree when documents contain longer dash sequences, e.g., '------'.
Perhaps we could use '&ndash' as a replacement for '--'?

Add linguistic processor for text layer when tokenizing

If the input is raw, or there is no text layer in NAF, this module also performs tokenization and adds a text layer. In that case, there should also be a linguistic processor added for the text layer.

unicode in comments

The parser creates comments for the relations to make it easier to trace them. If these comments contain unusual unicode, however, the Java parser chokes on them (see https://bugs.openjdk.java.net/browse/JDK-8072081)

Although this is not strictly a problem caused by the parser (as it is a java bug triggered by the IXA NERC module) I think the easiest solution is to strip or escape "strange" unicode characters in the parser step.

Session and example files:
https://gist.github.com/vanatteveldt/6492fc3b97ba6f2a87c81462c71fe8a2

create setup.py and push to pip

I want to use this module in my nlpipe tool, so it would be nice if this could be pushed to pip.

To create a setup.py, it would make sense to rename the 'core' folder to 'morsphosyntactic_parser_nl', and add the dirver (name == 'main_') code to main.py.

I'll happily make those changes and add a setup.py, but first I wanted to know whether you agree and if you can are willing to push it to pip when that is done?

(cc @antske )

cltl / morphosyntactic_parser_nl Goto Github PK

morphosyntactic_parser_nl's Introduction

morphosyntactic_parser_nl

Requirements and installation

Usage

Contact

morphosyntactic_parser_nl's People

Contributors

Stargazers

Watchers

Forkers

morphosyntactic_parser_nl's Issues

Incorrect offsets with tokenization

pipe symbols (|) not escaped in input

escaping long '--' sequences in comments

Add linguistic processor for text layer when tokenizing

unicode in comments

create setup.py and push to pip

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent