Coder Social home page Coder Social logo

morphosyntactic_parser_nl's Introduction

morphosyntactic_parser_nl

Wrapper around the Dutch Alpino parser. It takes as input a text/NAF/KAF file with either raw text or tokens (processed by a tokeniser and sentence splitter) and generates the term layer (lemmas and rich morphological information), the constituency layer and the dependency layer.

Requirements and installation

There are two dependencies, the Alpino parser, and the KAfNafParserPy library for parsing NAF/KAF objects.

Step 1. For the Alpino parser you have two choices.

  1. For a local install, visit the Alpino homepage and follow the instructions to get Alpino installed, or run install_alpino.sh. Make sure to set ALPINO_HOME to point to the installation.
  2. For using an alpino server instance (e.g. through alpino-docker), point ALPINO_SERVER to the HTTP address of the server (e.g. ALPINO_SERVER=http://localhost:5002)

Step 2. The KafNafParserPy library can be install through pip or from GitHub.

Once you have the previous 2 steps completed, the last step is to clone this repository to your machine. You will need to tell the library where Alpino has been installed in your machine by setting the environment variable ALPINO_HOME, and point it to the correct path on your local machine.

export ALPINO_HOME=/home/a/b/c/Alpino

Usage

The simplest way to call the parser is to call to the script run_parser.sh, which can be found in the root folder of the repository. It will read a NAF/KAF file from the input stream and will write the NAF/KAF resulting file in the output stream. In the subfolder examples you can find 2 example input files with the corresponding and expected output files. From the command line and being on the root folder you can run:

cat examples/file1.in.kaf | run_parser.sh > my_output.kaf

The result in my_output.kaf should be the same as the file examples/file1.out.kaf (with exception of the time stamps).

You can specify the maximum number of seconds that Alpino will take to parse every sentence. Sentences taking longer that this value will be skipped from the parsing, and there will not be term, constituency nor dependency information for all the tokens of those sentences. The parameter to be used is -t or --time.

You can get the whole description of the parameters by calling python core/morph_syn_parser.py -h. You will see this information:

usage: morph_syn_parser.py [-h] [-v] [-t MAX_MINUTES]

Morphosyntactic parser based on Alpino

optional arguments:
  -h, --help            show this help message and exit
  -v, --version         show program's version number and exit
  -t MAX_MINUTES, --time MAX_MINUTES
                        Maximum number of minutes per sentence. Sentences that
                        take longer will be skipped and not parsed (value must
                        be a float)

If you want to use this library from a Python module, it is possible to import the main function and reuse it in other python scripts. The main module is located in the script core/morph_syn_parser.py, and it is called run_morph_syn_parser. This function takes two parameters, an input and an output file, which can be file names (strings), open file descriptors or streams.

Contact

morphosyntactic_parser_nl's People

Contributors

rubenizquierdo avatar vanatteveldt avatar sarnoult avatar paulhuygen avatar antske avatar

Stargazers

Jan van Casteren avatar Bram Vanroy avatar Wouter Kouw avatar Ali Hürriyetoğlu avatar Arne Neumann avatar

Watchers

James Cloos avatar Marieke van Erp avatar  avatar  avatar piek avatar  avatar  avatar R.H. Segers avatar  avatar  avatar

morphosyntactic_parser_nl's Issues

pipe symbols (|) not escaped in input

Sentence containing a pipe symbol / vertical bar (|) are not processed correctly. Alpino uses this character to indicate line id's, so if a sentence contains a pipe the left hand side is treated as an id, containing the 1.xml to not be found, and the parse is not included in the output:

(newsreader-env)wva@study-linux: {master} ~/newsreader_pipe_nl$ echo "Hallo daar| doeg" |  java -jar $MDIR/ixa-pipe-tok/target/ixa-pipe-tok-1.8.4.jar tok -l nl | $MDIR/morphosyntactic_parser_nl/run_parser.sh
CLI options: Namespace(normalize=default, notok=false, inputkaf=false, offsets=true, outputFormat=naf, hardParagraph=no, untokenizable=no, lang=nl, kafversion=v1.naf)
ixa-pipe-tok tokenized 4 tokens at 1908.31 tokens per second.
Calling to Alpino at /data/wva/newsreader_pipe_nl/tools/Alpino with 1 sentences...
hdrug: process 14878 on host study-linux (datime(2016,7,17,13,34,17))
[doeg]
Q#Hallo daar \|doeg|1|1|0.749662063
Not found the file /tmp/tmpOfK5Ti/1.xml

This results in the following output (without terms):

<?xml version='1.0' encoding='UTF-8'?>
<NAF xml:lang="nl" version="v1.naf">
  <nafHeader>
    <linguisticProcessors layer="text">
      <lp name="ixa-pipe-tok-nl" beginTimestamp="2016-07-17T13:34:17+0200" endTimestamp="2016-07-17T13:34:17+0200" version="1.8.4-9bb9cddd179cbd489b085776417cd8f1b8a4b10a" hostname="study-linux"/>
    </linguisticProcessors>
    <linguisticProcessors layer="terms">
      <lp name="Morphosyntactic parser based on Alpino" version="0.2_22sept2015" timestamp="2016-07-17T13:34:18CEST" beginTimestamp="2016-07-17T13:34:18CEST" endTimestamp="2016-07-17T13:34:18CEST" hostname="study-linux"/>
    </linguisticProcessors>
    <linguisticProcessors layer="constituents">
      <lp name="Morphosyntactic parser based on Alpino" version="0.2_22sept2015" timestamp="2016-07-17T13:34:18CEST" beginTimestamp="2016-07-17T13:34:18CEST" endTimestamp="2016-07-17T13:34:18CEST" hostname="study-linux"/>
    </linguisticProcessors>
    <linguisticProcessors layer="deps">
      <lp name="Morphosyntactic parser based on Alpino" version="0.2_22sept2015" timestamp="2016-07-17T13:34:18CEST" beginTimestamp="2016-07-17T13:34:18CEST" endTimestamp="2016-07-17T13:34:18CEST" hostname="study-linux"/>
    </linguisticProcessors>
  </nafHeader>
  <text>
    <wf id="w1" offset="0" length="5" sent="1" para="1">Hallo</wf>
    <wf id="w2" offset="6" length="4" sent="1" para="1">daar</wf>
    <wf id="w3" offset="10" length="1" sent="1" para="1">|</wf>
    <wf id="w4" offset="12" length="4" sent="1" para="1">doeg</wf>
  </text>
</NAF>

Note that this is not an error condition, so the "not found the file" does not raise an exception (which it probably should?)

$ echo $?
0

(this seems to be the root cause of ixa-ehu/ixa-pipe-nerc#11)

escaping long '--' sequences in comments

Character escaping for comments (in alpino_dependency.py and convert_penn_to_kaf.py) currently replaces '--' by '-'. This leads to a ValueError with lxml.etree when documents contain longer dash sequences, e.g., '------'.
Perhaps we could use '&ndash' as a replacement for '--'?

unicode in comments

The parser creates comments for the relations to make it easier to trace them. If these comments contain unusual unicode, however, the Java parser chokes on them (see https://bugs.openjdk.java.net/browse/JDK-8072081)

Although this is not strictly a problem caused by the parser (as it is a java bug triggered by the IXA NERC module) I think the easiest solution is to strip or escape "strange" unicode characters in the parser step.

Session and example files:
https://gist.github.com/vanatteveldt/6492fc3b97ba6f2a87c81462c71fe8a2

create setup.py and push to pip

I want to use this module in my nlpipe tool, so it would be nice if this could be pushed to pip.

To create a setup.py, it would make sense to rename the 'core' folder to 'morsphosyntactic_parser_nl', and add the dirver (name == 'main_') code to main.py.

I'll happily make those changes and add a setup.py, but first I wanted to know whether you agree and if you can are willing to push it to pip when that is done?

(cc @antske )

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.