mynlp / jigg Goto Github PK

View Code? Open in Web Editor NEW

74.0 74.0 20.0 114.43 MB

Pipeline framework for easy natural language processing

License: Apache License 2.0

Shell 0.86% Scala 76.15% Python 22.74% Java 0.09% Dockerfile 0.16%

jigg's People

Contributors

Stargazers

Watchers

Forkers

sakabar shengshaohua leoborn unnonouno shigetowatanabe-tome kyoshinaga yetanothertimes pecorarista sakaeda vochicong kitter ppvastar ttakayuki fyamamoto10 jamiemoon xindavidlee otanet wuhanlin0810 takeshiyagiuchi

jigg's Issues

Stanford dependency conversion annotator

Recently I found that the latest constituency -> dependency conversion in Stanford CoreNLP that outputs Universal Dependencies is very robust and works pretty well even when the empty category and auxiliary tag information are stripped off (The accuracy is above 99.6). This suggests we can reliably obtain UD dependencies from the output of other constituency parser such as Berkeley parser. I will implement new annotator just doing this.

pipeline_server.py: TimeOut Error

There seems to be a bug when trying to submit long pieces of text to the pipeline_server.

To replicate the error; initiate the server as;

submit the following text.

"　【ニューヨーク＝清水石珠実】「世紀の合併」と言われながら失敗した米メディア大手タイムワーナーとネット大手ＡＯＬの統合から15年。そのタイムワーナーを通信大手ＡＴ＆Ｔが買収することで、米国で業種を超えたメディア再編がまた動き出した。背景にはスマートフォン（スマホ）を通じたインターネット動画の普及がある。通信とメディア業界の境界線はこれまで以上に薄まっている。 　ネット動画を巡る異業種間合併の先駆けは2011年の米ケーブルテレビ最大手コムキャストによるＮＢＣユニバーサルの買収だ。ネット配信するコンテンツの拡充が目的で、コムキャストは今年６月にも米アニメ制作大手のドリームワークス・アニメーションを傘下に収めた。 　ＡＴ＆Ｔが狙うのもコムキャスト型のビジネスモデルで、垂直統合による複合メディア化は両社を軸に進む。 　ＡＴ＆Ｔは15年に米プロフットボールリーグ（ＮＦＬ）など人気番組の放映権を持つ衛星テレビ大手のディレクＴＶを485億ドル（約５兆円）で買収。電話契約者の頭打ちが予想される中、総売上高の３割強を占めるエンターテインメント事業の拡充を進めてきた。 　ランドール・スティーブンソン最高経営責任者（ＣＥＯ）は「世界的に優れたコンテンツを映画、テレビ、モバイル端末すべてで提供できるようになる」と買収の意義を述べた。スマホの付加価値を高め動画がもたらす広告収入も増やす考え。 　タイムワーナーも通信大手と組む事情がある。近年はネットフリックスやアマゾン・ドット・コムなど定額課金でネット動画の配信を扱うシリコンバレーの新興企業が台頭。こうした企業が独自番組の配信も始めた結果、ケーブルテレビの契約が減っている。多様化する番組配信への対応は急務だった。 　さらに20年ごろには「５Ｇ」と呼ばれる毎秒10ギガ（ギガは10億）ビットの通信速度を持つ携帯電話の無線通信規格が実用化される見込み。ネット動画配信の市場が拡大するのは確実で、買収を後押ししたとみられる。 　ただ課題もある。タイムワーナーは01年にインターネット大手ＡＯＬと経営統合したが、09年には合併を解消した。企業文化で折り合いがつかず、インターネットバブルの崩壊も影響した。異業種合併で成果をあげることは容易ではない。 　独禁当局の判断も注目だ。コンテンツがＡＴ＆Ｔに独占的に囲い込まれるのは米政府も望まない。コムキャストがＮＢＣユニバーサルを買収した際は、司法省と米連邦通信委員会の承認を得るのに１年以上かかった。米ニューヨーク・タイムズ紙はＡＴ＆Ｔが競合社向けにタイムワーナーのコンテンツ使用料をつり上げる恐れを指摘する。 　22日午前には大統領選のトランプ候補が「一握りの会社にあまりにも権力が集中している」と買収に反対する考えを表明。政治リスクもある。 　それでも米メディア企業がからむ業界再編の動きが後をたたない。 　放送業界では05年に分離したＣＢＳとバイアコムの再統合が取り沙汰される。タイムワーナーを巡っては米アップルも買収に興味を示したもよう。21世紀フォックスも数年前に買収を提案した。 　７月には米携帯首位のベライゾン・コミュニケーションズが米ヤフーの買収を決めた。傘下のネット大手ＡＯＬとの一体運営で、コンテンツ配信やデジタル広告事業の強化を目指す。 　「動画の未来はモバイル端末であり、モバイル端末の未来は動画だ」――。22日にＡＴ＆Ｔとタイムワーナーが出した記者発表文にはこんな言葉が盛り込まれた。通信とメディアの融合は今後も続きそうな気配だ。 　ＡＴ＆Ｔ　電話を発明したグラハム・ベルの電話会社を源流とする米通信大手。業界２位の米携帯業界で飽和感が強まるなか、買収を通じて「海外」「娯楽」分野を強化している。この２年間でメキシコの携帯大手２社や米衛星放送大手ディレクＴＶを買収した。競合する米携帯最大手ベライゾン・コミュニケーションズが買収で合意したヤフーのネット事業にも関心を示していた。本社はテキサス州ダラス。2015年12月通期の売上高は1468億ドル、純利益は133億ドル。 　タイムワーナー　1990年に米映画会社ワーナー・ブラザーズの親会社が出版大手タイムを買収して誕生した。「セックス・アンド・ザ・シティ」など人気ドラマの製作で定評のある有料テレビ局「ＨＢＯ」やニュース専門局「ＣＮＮ」も傘下に抱える。2001年、インターネット大手アメリカ・オンライン（ＡＯＬ）に買収されたが、相乗効果を生み出せず09年に分離。動画コンテンツ事業に特化するため14年には出版部門タイムも切り離した。15年12月通期の売上高は281億ドル、純利益は38億ドル。本社はニューヨーク市。"

The server returns the following error;

ERROR:__main__:Error: Timeout with input '"　【ニューヨーク＝清水石珠実】「世紀の合併」と言われながら失敗した米メディア大手タイムワーナーとネット大手ＡＯＬの統合から15年。そのタイムワーナーを通信大手ＡＴ＆Ｔが買収することで、米国で業種を超えたメディア再編がまた動き出した。背景にはスマートフォン（スマホ）を通じたインターネット動画の普及がある。通信とメディア業界の境界線はこれまで以上に薄まっている。 　ネット動画を巡る異業種間合併の先駆けは2011年の米ケーブルテレビ最大手コムキャストによるＮＢＣユニバーサルの買収だ。ネット配信するコンテンツの拡充が目的で、コムキャストは今年６月にも米アニメ制作大手のドリームワークス・アニメーションを傘下に収めた。 　ＡＴ＆Ｔが狙うのもコムキャスト型のビジネスモデルで、垂直統合による複合メデ?'

Notice that the input from the error is not the same as the input text; instead, the article seems to be truncated after 複合メデ, so not all data is being sent over to the server.

At first I thought it might be a mecab-problem with its buffer size (which you can change with the -b option), but this does not seem to be the case.

Looking into it a bit more - If i read the input from a file, and break each sentence into a new line, the data is processed correctly. Is there a limit on the maximum number of characters on a line?

Give definition of each requirement

We may have to specify the meaning of each requirement tag more rigorously.

Currently it is vague what is guaranteed by each requirement tag. For example, Currently, "mecab -d /path/jumandic" and juman give the same requirement, TokenizeWithJuman, but information given by mecab with jumandic is less than juman, and it is failed when connecting mecab and KNP in the pipeline.

UDPipe annotator

I have a plan to support UDPipe (http://ufal.mff.cuni.cz/udpipe), which supports parsing all languages in UD 2.0, runs very fast, and show near state-of-the-art accuracies.

While UDPipe jar is found on the maven central, this thread indicates the native JNI-related code may not be available, or depends on the environment.

Maybe I wil not use Maven, and assume users prepare a single jar by themselves.

Output in JSON

The current system outputs the analysis results in XML, but in many cases a simpler format like JSON is more user-friendly. I'd like to have an option to change the output format to JSON.

Support CoreNLP 3.7.0

The CoreNLP major version is updated from 3.6.0 to 3.7.0. The wrapper in Jigg StanfordCoreNLPAnnotator relies on some internal mechanisms in 3.6.0 so some problem may or may not occur in the new version.

Add syncha to pipeline

Just wondering if adding syncha was on the enhancement list, given the lack of coreference resolvers for JP?

Always trim in XMLUtil.text?

XMLUtil.text is used to extract the text field for a given node. This is the current issue (in REPL):

scala>  val props = jigg.util.ArgumentsParser.parse(List("-annotators", "corenlp[tokenize,ssplit]"))
scala> val pipe = new Pipeline(props)
scala> val x = pipe.annotate("aaa")
x: scala.xml.Node =
<root><document id="d1"><sentences><sentence id="s1" characterOffsetBegin="0" characterOffsetEnd="3">
          aaa
          <tokens annotators="corenlp"><token characterOffsetEnd="3" characterOffsetBegin="0" id="t2" form="aaa"/></tokens>
        </sentence></sentences></document></root>
scala> jigg.util.XMLUtil.text((x\\"sentence").head)
res24: String =
"
          aaa

        "

This means that current text returns the string, which still contains some redundant white spaces. One option to resolve this is to implement text to perform always trim. I think this is natural and (probably) causes no problem.

If the text field contains some new lines, e.g.,

<root><document id="d1"><sentences><sentence id="s1" characterOffsetBegin="0" characterOffsetEnd="3">
          aaa
bbb
          <tokens annotators="corenlp"><token characterOffsetEnd="3" characterOffsetBegin="0" id="t2" form="aaa"/></tokens>
        </sentence></sentences></document></root>

Then, the above solution still returns the intended string:

aaa
bbb

I think there is no possibility that the text contains some white spaces on either side, which should be kept.

Consistent error handling

Here is a proposal for how to keep track errors on the output XML when some errors are detected.

Example:

<chunks annotators="cabocha" errors="cabocha">
<error by="cabocha">error message</error>
</chunks>

That is, an error message is surrounded by <error>, which keeps the annotator causing the error.

This design may handle the situation where multiple annotators annotate the same XML element and only one of them fails in annotation:

<tokens annotators="ssplit tokenize pos" errors="pos">
<token id="0" offsetBegin="0" offsetEnd="1">I</token>
...
<error by="pos">error message</error>
</tokens>

errors attribute in each element may be redundant but seems useful to check errors. I'm not sure.

JSON/XML Schema

We probably need documentation that defines the schema for output from each of the annotators.
Couldn't find anything in the repo. Where would you suggest I start?

Refactor XMLUtil

Current implementation of removeText:
https://github.com/mynlp/jigg/blob/master/src/main/scala/jigg/util/XMLUtil.scala#L46

seems problematic. This code cannot handle some Elem object which has more than one Atom[_] elements. Maybe we can refactor this method based on pattern match with XML literal(?)

Rethinking behavior of DocumentAnnotator

Currently, dsplit is always performed before ssplit. The default annotator for this is RegexDocumentAnnotator, which segments the given text using regex (the default pattern is more than two successive new lines). This <document> annotation is currently only used in knpDoc, which annotates document-level anaphora.

Considering the usual scenario using knpDoc, probably most users do not assume document layers since KNP assumes each sentence comprises a document in default. In the current implementation, however, a user cannot specify that each sentence comprises a document, since document segmentation must be specified before sentence segmentation.

This problem might be resolved by changing the default behavior of dsplit. The following is one proposal:

if dsplit is not given, no document segmentation is performed before sentence segmentation, and <document> tag is given for each sentence (e.g., <document><sentences><sentence> is given for each sentence) along with ssplit;
if dsplit is given explicitly, document segmentation is performed before sentence segmentation.

A problem might be the compatibility to the CoreNLP, which does not provide dsplit annotators, and (probably) an input text is always recognized as a document.

Implement Document Annotator

Implement a simple document annotator. In a way similar to RegexSentenceAnnotator, use regular expressions to split documents. The default regex should be "\n{2,}".

AAA
BBB

CCC

should be annotated as:

<root>
  <document id="d0">
    AAA
    BBB
  </document>
  <document id="d1">
    CCC
  </document>
</root>

Development branch bug - simple query not working on server.

Following this fix, i have reassembled the develop branch.

However - running a simple query like the below is returning a The requested resource could not be found. error. (no error on the server's standard output)

curl --data-urlencode "annotators=corenlp[tokenize,ssplit]" --data 'aaa bbb' "http://localhost:8080/annotate?outputFormat=json"

The problem seems restricted to english annotation

# Parse - "これは桃" - this works
curl --data-urlencode "annotators=ssplit,mecab" --data '%E3%81%93%E3%82%8C%E3%81%AF%E6%A1%83' "http://localhost:8080/annotate?outputFormat=json"

but is not limited to strings - files also fail

$ cat en.txt
my name is slim.
this is a car.
$ curl --data-urlencode "annotators=corenlp[tokenize,ssplit]" --data @en.txt "http://localhost:8080/annotate?outputFormat=json"
The requested resource could not be found.

One other issue, probably different is that - despite this solution, adding corenlp[tokenize,ssplit] as an annotator fails to parse sentences with line breaks.

curl --data-urlencode "annotators=corenlp[tokenize,ssplit]" --data 'aaa%0Abbb' "http://localhost:8080/annotate?outputFormat=json"

Below is the server log in this case - could this be fixed by ensuring ssplit occurs before tokenize perhaps?

[INFO] [02/23/2017 10:58:28.317] [jigg-server-akka.actor.default-dispatcher-4] [akka://jigg-server/user/$a] Pipeline is updated. New property: {annotators=corenlp[tokenize,ssplit]}
[jigg-server-akka.actor.default-dispatcher-3] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator tokenize
[jigg-server-akka.actor.default-dispatcher-3] INFO edu.stanford.nlp.pipeline.TokenizerAnnotator - TokenizerAnnotator: No tokenizer type provided. Defaulting to PTBTokenizer.
[jigg-server-akka.actor.default-dispatcher-3] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator ssplit
[ERROR] [02/23/2017 10:58:29.268] [jigg-server-akka.actor.default-dispatcher-3] [akka.actor.ActorSystemImpl(jigg-server)] Error during processing of request: 'Illegal unquoted character ((CTRL-CHAR, code 10)): has to be escaped using backslash to be included in string value
 at [Source: {".tag":"root",".child":[{".tag":"document","id":"d0",".child":[{".tag":"sentences",".child":[{".tag":"sentence","text":"aaa
bbb","id":"s0","characterOffsetBegin":"0","characterOffsetEnd":"7",".child":[{".tag":"tokens","annotators":"corenlp",".child":[{".tag":"token","characterOffsetEnd":"3","characterOffsetBegin":"0","id":"t0","form":"aaa"},{".tag":"token","characterOffsetEnd":"7","characterOffsetBegin":"4","id":"t1","form":"bbb"}]}]}]}]}]}; line: 1, column: 126]'. Completing with 500 Internal Server Error response.

PipelineServer cannot annotate in parallel with external softwares (e.g., mecab)

Originally raised in #60.

Inspecting the behaviors, I found that the annotators for which parallelism is managed inside Jigg crush only when launched from PipelineServer. Usual annotations from command line (jigg.pipeline.Pipeline) does not cause errors, although.

The thread-safe annotator can be safely used, e.g., kuromoji. This suggests that the current method of handling parallelism for annotators of external softwares causes some problem if they are launched from PipelineServer.

Currently the parallelism for them is managed using Java's LinkedBlockingQueue, a thread-safe container, which preserves resources (IO Process). This is not a sophisticated way so I want to replace the mechanism using Akka instead.

JACCG/ CoreNLP annotator error when running pipeline_server.py

I can run jaccg on the terminal without issue.
When using the below however;

./script/pipeline_server.py -P "-Xmx4g -cp jigg-0.6.1.jar jigg.pipeline.Pipeline -annotators ssplit,mecab,jaccg"

The below error is returned.

I have extended the timeout time to allow the necessary libraries/packages to be loaded, but there still seems to be a problem in returning the command prompt character ">".

Using mecab, cabocha annotators works fine. The error is returned with jaccg only.

Running corenlp with all annotation options also returns a similar error;

./script/pipeline_server.py -P "-Xmx4g -cp jigg-0.6.1.jar jigg.pipeline.Pipeline -annotators corenlp[tokenize,ssplit,parse,lemma,ner,dcoref]"

ERROR
INFO:__main__:java -Xmx4g -cp jigg-0.6.1.jar jigg.pipeline.Pipeline -annotators ssplit,mecab,jaccg INFO:__main__:Spawn done! Traceback (most recent call last): File "./script/pipeline_server.py", line 83, in <module> pipeline = Pipeline(options.pipeline) File "./script/pipeline_server.py", line 20, in __init__ self.pipeline.expect("> ", timeout=5000) File "/usr/local/lib/python2.7/site-packages/pexpect/spawnbase.py", line 321, in expect timeout, searchwindowsize, async) File "/usr/local/lib/python2.7/site-packages/pexpect/spawnbase.py", line 345, in expect_list return exp.expect_loop(timeout) File "/usr/local/lib/python2.7/site-packages/pexpect/expect.py", line 105, in expect_loop return self.eof(e) File "/usr/local/lib/python2.7/site-packages/pexpect/expect.py", line 50, in eof raise EOF(msg) pexpect.exceptions.EOF: End Of File (EOF). Empty string style platform. <pexpect.pty_spawn.spawn object at 0x10d356810> command: /usr/bin/java args: ['/usr/bin/java', '-Xmx4g', '-cp', 'jigg-0.6.1.jar', 'jigg.pipeline.Pipeline', '-annotators', 'ssplit,mecab,jaccg'] buffer (last 100 chars): '' before (last 100 chars): ' < str>: Path to the trained model (you can omit this if you load a jar which packs models) []\r\n' after: <class 'pexpect.exceptions.EOF'> match: None match_index: None exitstatus: None flag_eof: True pid: 81078 child_fd: 6 closed: False timeout: 30 delimiter: <class 'pexpect.exceptions.EOF'> logfile: None logfile_read: None logfile_send: None maxread: 2000 ignorecase: False searchwindowsize: None delaybeforesend: 0.05 delayafterclose: 0.1 delayafterterminate: 0.1 searcher: searcher_re: 0: re.compile("> ")

High performance Japanese sentence splitter

Some requests for high performance Japanese sentence splitter are raised.

Resolving annotation conflicts

Currently, if we apply two annotators which annotate the same element, both are added to the result. Stanford CoreNLP instead overrides the old annotation. Following this, I implemented a method that checks whether there already exist the same elements when adding XML elements. Such duplicate occurs, e.g., when running a joint parser of POS and tree after applying POS tagger.

I plan to push this modification but I was also wondering this overriding method is the best way to resolve conflicts. Maybe it's better also to output some warnings, but this may be future work.

Improve key names for some XML nodes

Some key names determined at an early stage seem not good. For example, below is the current output for each token by mecab, or kuromoji:

<token id="s0_0" surf="日本語" pos="名詞" pos1="一般" pos2="*" pos3="*" inflectionType="*" inflectionForm="*" base="日本語" reading="ニホンゴ" pronounce="ニホンゴ"/>

Probably infection* should be replaced with conjugate*. Following kuromoji, it may be better to change the base number of pos, e.g., replacing pos with pos1, which is 品詞大分類, etc.

Upload packages to maven

Configure sbt to publish the jigg package and the CCG model file to maven.

Scripts to convert Jigg outputs into other famous output format

In some cases, we may want to output the annotation result in other popular formats such as CoNLL format, which may be useful as an input to other software. Another format might be just the PTB-style S-trees converted from the Jigg's parse trees.

Probably such mechanism should be implemented as an external tool (script) that converts the Jigg output into another one rather than via Jigg's option to choose output format as in CoreNLP.

get raw text from output XML

We have no way to get raw text from output XML. We use the text method of XML instance to get texts in XML. However, this method gets not only parent node's text but also children's one.

So we have to implement a new method to get raw text from output XML.

JSON output does not follow the standard in character escaping

For example, the standard in JSON to escape " is to use \".

Currently all escaping mechanisms are the same as XML, so " is outputted in this case, which is bad.

Reference: http://stackoverflow.com/questions/19176024/how-to-escape-special-characters-in-building-a-json-string

Add offset attributes for each token

Following StanfordCoreNLP, it is better to provide offset attributes in tokenizer. See https://github.com/mynlp/jigg/wiki/Definition-of-Requirement#tokenize

nThreads option in pipeline server

Does the server command currently support nThreads?

Support Chinese processing

Stanford CoreNLP supports Chinese processing. Some problems may or may not occur in jigg's CoreNLP's wrapper. Some tests are needed first.

Use document ID in KNP annotator

Coreference resolution and predicate argument structure analysis in KNP work on a document rather than a sentence. A boundary of documents is give to KNP by specifying a document ID. The KNP annotator should use this function to let KNP now document boundaries.

KNP receives a document ID in the following structure:

# S-ID:X-Y

where X is a document ID, and Y is a sentence ID. Sentences with same X are regarded as in the same document, while those with different X are processed like in different documents. More specifically, coreference resolution and predicate argument structure analysis find antecedents within sentences with same X.

See http://www.lr.pi.titech.ac.jp/~sasano/knp/input.html for details.

Improve tests for annotators of external softwares

Now unit tests for annotators of external softwares, such as mecab and knp, call commands of those softwares internally to check the outputs. This mechanism should be modified.
For example, the current test assumes the output of some software is fixed for some input, but it may be changed with a software update.

Each test should check only the wrapping mechanism of each annotator and should not depend on the output of an external software.

Detecting typos/errors in command-line options

It is desirable to output errors or warnings when some unsupported arguments or wrong value for some key is given. For example, ccg eats kBest option, but it is desirable when a user instead gives kbest option. Also, some options only accept a value in some tuple. For example, the dictionary option in kuromoji, which now I'm implementing, would only accept possible values such as ipa, juman, etc. Now a type check works for some options (e.g., Int options), but this mechanism should be extended possibly using enum (now all options except numeric values are string).

Ambiguity preserving analysis

The current CCG parser accepts one-best outputs from morphological analysis, but the outputs often contain errors. They cause a significant problem in constructing semantic representations along CCG derivations.

In order to solve this problem, the system is required to output n-best outputs, including not only ambiguities in CCG parsing but also those in morphological analysis. This requires some mechanism to pass ambiguity-preserving analyses from morphological analyzers to parsers.

The easiest solution is to modify the CCG parser to use lattice-style outputs from morphological analyzers. However, the problem of ambiguity preserving analysis is pervasive in various levels, and it is great if we can have some unified framework for this problem (maybe too ambitious?).

Japanese ccg model is not available.

Hi, I tried to use this software library for japanese ccg.
But the script "download_ccg_model.sh" won't work correctly.
It always shows "403 Forbidden" to me.

Is this software only for internal use?

Annotator for ChaPAS

It is Java-based so probably not difficult to implement Annotator?

https://sites.google.com/site/yotarow/chapas

JSON outputter cannot handle CCG categories

Example:

$ java -cp "*" jigg.pipeline.Pipeline -annotators "ssplit,kuromoji,jaccg" -outputFormat json
Loading parser model in ccg-models/parser/beam=64.ser.gz ...done [5.2 sec]
> 東京は晴れです
Exception in thread "main" com.fasterxml.jackson.core.JsonParseException: Unrecognized character escape 'N' (code 78)
 at [Source: {".tag":"root",".child":[{".tag":"document","id":"d0",".child":[{".tag":"sentences",".child":[{".tag":"sentence","text":"東京は晴れです","id":"s0","characterOffsetBegin":"0","characterOffsetEnd":"7",".child":[{".tag":"tokens","annotators":"kuromoji",".child":[{".tag":"token","id":"s0_0","form":"東京","characterOffsetBegin":"0","characterOffsetEnd":"2","pos":"名詞","pos1":"固有名詞","pos2":"地域","pos3":"一般","cType":"*","cForm":"*","lemma":"東京","yomi":"トウキョウ","pron":"トーキョー"},{".tag":"token","id":"s0_1","form":"は","characterOffsetBegin":"2","characterOffsetEnd":"3","pos":"助詞","pos1":"係助詞","pos2":"*","pos3":"*","cType":"*","cForm":"*","lemma":"は","yomi":"ハ","pron":"ワ"},{".tag":"token","id":"s0_2","form":"晴れ","characterOffsetBegin":"3","characterOffsetEnd":"5","pos":"名詞","pos1":"一般","pos2":"*","pos3":"*","cType":"*","cForm":"*","lemma":"晴れ","yomi":"ハレ","pron":"ハレ"},{".tag":"token","id":"s0_3","form":"です","characterOffsetBegin":"5","characterOffsetEnd":"7","pos":"助動詞","pos1":"*","pos2":"*","pos3":"*","cType":"特殊・デス","cForm":"基本形","lemma":"です","yomi":"デス","pron":"デス"}]},{".tag":"ccg","annotators":"jaccg","root":"s0_sp0","id":"s0_ccg0","score":"319.1745958328247",".child":[{".tag":"span","id":"s0_sp0","begin":"0","end":"4","symbol":"S[mod=nm,form=base,fin=f]","rule":"&gt;","children":"s0_sp1 s0_sp4"},{".tag":"span","id":"s0_sp1","begin":"0","end":"2","symbol":"S[fin=f]/S[fin=f]","rule":"&lt;","children":"s0_sp2 s0_sp3"},{".tag":"span","id":"s0_sp2","begin":"0","end":"1","symbol":"NP[mod=nm,case=nc,fin=f]","children":"s0_0"},{".tag":"span","id":"s0_sp3","begin":"1","end":"2","symbol":"(S[fin=f]/S[fin=f])\NP[mod=nm,case=nc,fin=f]","children":"s0_1"},{".tag":"span","id":"s0_sp4","begin":"2","end":"4","symbol":"S[mod=nm,form=base,fin=f]","rule":"&lt;","children":"s0_sp5 s0_sp6"},{".tag":"span","id":"s0_sp5","begin":"2","end":"3","symbol":"NP[mod=nm,case=nc,fin=f]","children":"s0_2"},{".tag":"span","id":"s0_sp6","begin":"3","end":"4","symbol":"S[mod=nm,form=base,fin=f]\NP[mod=nm,case=nc,fin=f]","children":"s0_3"}]}]}]}]}]}; line: 1, column: 1609]
    at com.fasterxml.jackson.core.JsonParser._constructError(JsonParser.java:1581)
    at com.fasterxml.jackson.core.base.ParserMinimalBase._reportError(ParserMinimalBase.java:533)
    at com.fasterxml.jackson.core.base.ParserMinimalBase._handleUnrecognizedCharacterEscape(ParserMinimalBase.java:510)
    at com.fasterxml.jackson.core.json.ReaderBasedJsonParser._decodeEscaped(ReaderBasedJsonParser.java:2208)
        ...

The problem seems in escaping backslash \ in CCG categories like N\N.

xml output error when using PipeLine server

Trying to run a simple curl request, and output xml. The below runs fine when using json.

$ curl -s --data 'this is my first.' "http://localhost:8080/annotate?outputFormat=xml&annotators=corenlp%5Btokenize%2Cssplit%5D"

The server logs show the below...

Server online at localhost:8080/
Press RETURN to stop...
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/Volumes/disk/InvestmentPlanning/Development/test/textPipeline/components/textPreProcessor/jigg/target/jigg-0.6.2.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/Volumes/disk/InvestmentPlanning/Development/test/textPipeline/components/textPreProcessor/jigg/target/jigg-assembly-0.6.2.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/Volumes/disk/InvestmentPlanning/Development/test/textPipeline/components/textPreProcessor/jigg/target/jigg-assembly-server-0.6.2.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.SimpleLoggerFactory]
[jigg-server-akka.actor.default-dispatcher-8] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator tokenize
[jigg-server-akka.actor.default-dispatcher-8] INFO edu.stanford.nlp.pipeline.TokenizerAnnotator - TokenizerAnnotator: No tokenizer type provided. Defaulting to PTBTokenizer.
[jigg-server-akka.actor.default-dispatcher-8] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator ssplit
Uncaught error from thread [jigg-server-akka.actor.default-dispatcher-8] shutting down JVM since 'akka.jvm-exit-on-fatal-error' is enabled for ActorSystem[jigg-server]
java.lang.NoSuchMethodError: jigg.pipeline.Pipeline.writeTo(Ljava/io/Writer;Lscala/xml/Node;)V
	at jigg.pipeline.PipelineServer$$anonfun$1$$anonfun$apply$1$$anonfun$apply$2$$anonfun$apply$3$$anonfun$2.outputBy$1(PipelineServer.scala:140)
	at jigg.pipeline.PipelineServer$$anonfun$1$$anonfun$apply$1$$anonfun$apply$2$$anonfun$apply$3$$anonfun$2.apply(PipelineServer.scala:144)
	at jigg.pipeline.PipelineServer$$anonfun$1$$anonfun$apply$1$$anonfun$apply$2$$anonfun$apply$3$$anonfun$2.apply(PipelineServer.scala:134)
	at scala.util.Success$$anonfun$map$1.apply(Try.scala:237)
	at scala.util.Try$.apply(Try.scala:192)
	at scala.util.Success.map(Try.scala:237)
	at scala.concurrent.Future$$anonfun$map$1.apply(Future.scala:235)
	at scala.concurrent.Future$$anonfun$map$1.apply(Future.scala:235)
	at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:32)
	at akka.dispatch.BatchingExecutor$AbstractBatch.processBatch(BatchingExecutor.scala:55)
	at akka.dispatch.BatchingExecutor$BlockableBatch$$anonfun$run$1.apply$mcV$sp(BatchingExecutor.scala:91)
	at akka.dispatch.BatchingExecutor$BlockableBatch$$anonfun$run$1.apply(BatchingExecutor.scala:91)
	at akka.dispatch.BatchingExecutor$BlockableBatch$$anonfun$run$1.apply(BatchingExecutor.scala:91)
	at scala.concurrent.BlockContext$.withBlockContext(BlockContext.scala:72)
	at akka.dispatch.BatchingExecutor$BlockableBatch.run(BatchingExecutor.scala:90)
	at akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:39)
	at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:415)
	at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
	at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
	at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
	at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)

Server does not return annotations

I am processing text data, and for about 5~10% of cases jigg fails to return the expected results.
The JIGG internal errors dont give to many clues as to the underlying cause, but any ideas?

# curl command
curl --data-urlencode 'annotators=ssplit,mecab,cabocha' --data-urlencode 'q=特集——福田氏?小沢氏?日本の顔、次の衆院選、政治の岐路「福田色」腐心 ねじれ国会、なお壁に 二〇〇八年は日本の政治にとって大きな岐路になる。年内に次期衆院選が実施されるとの観測が強いためだ。福田政権を継続させるのか、小沢一郎代表が率いる最大野党、**党に政権を担わせるのか、それとも「第三の道」を選ぶのか。有権者の一票が日本の進路に大きな重みを持つ。 首相は今年前半に外交で「福田カラー」を出し、年後半から社会保障などの難題に取り組むシナリオを描く。この中で一番、実績をアピールできるタイミングを選び、解散・総選挙に打って出る構えだ。 首相が外交の看板に掲げるのが、良好な日米関係を足場にアジア各国との連携を強める「共鳴外交」。日米同盟の堅持と対中関係などの強化を両立させる路線だ。「対米追随外交」をやり玉に挙げる小沢氏の批判をかわしつつ、独自色を出す狙いだ。 福田外交にとっての大きなヤマ場は七月の主要国首脳会議だ。首相は地ならしを進めるため、五月の大型連休に欧州の主要国を歴訪する方向。サミット参加国の首脳陣と顔見知りになり、議長として指導力を発揮しやすくするためだ。 サミットの主要議題となる地球温暖化対策では、米中など温暖化ガスの主要排出国との協力がカギを握る。今春に来日する予定の胡錦濤**国家主席にも協力を要請する腹づもりだ。 年後半には次期衆院選の大きな争点になる内政の課題も待ったなしだ。特に大きいのが、消費税問題。二〇〇九年度には基礎年金国庫負担割合の引き上げが控える。首相はその財源を確保するため、年末の〇九年度税制改正大綱に向けて消費税率引き上げ論議に取り組む意向だ。 首相は秋以降の展開をにらみ、すでに一つの仕掛けを施している。社会保障制度の将来像を描く国民会議の創設がそれだ。政党代表だけでなく、経済界や労働界、有識者らもメンバーに加えることで、給付と負担のあり方に関する国民の合意形成につなげる狙いだ。 社会保障論議と並行して「成長力の強化」にも力を入れる方針だ。増税論議に偏らない姿勢を強調し、税率の引き上げ幅をできるだけ圧縮する努力を示すことで、国民の理解を得る思惑がのぞく。 首相は次期衆院選でこれらの政策を訴えるとみられるが、過半数の議席を確保できたとしても、どこまで公約を実現できるかは不透明だ。参院で与野党が逆転する「ねじれ国会」の構図は変わらないからだ。 現在、与党は参院で否決された法案を衆院で再議決・可決するのに必要な三分の二の議席を握るが、この勢力も維持は困難との見方が多い。有権者はこうした事情も考慮に入れ、一票を投じる必要がある。' 'http://localhost:8080/annotate?outputFormat=json'

# Jigg internal error
[ERROR] [07/26/2017 12:51:50.741] [jigg-server-akka.actor.default-dispatcher-16] [akka.actor.ActorSystemImpl(jigg-server)] Error during processing of request: 'java.lang.AssertionError: assertion failed
    at scala.Predef$.assert(Predef.scala:156)
    at jigg.pipeline.AnnotatingInParallel$class.divideBy(AnnotatingInParallel.scala:74)
    at jigg.pipeline.AnnotatingInParallel$class.annotateInParallel(AnnotatingInParallel.scala:54)
    at jigg.pipeline.MecabAnnotator.annotateInParallel(MecabAnnotator.scala:27)
    at jigg.pipeline.AnnotatingSentencesInParallel$$anonfun$annotate$1.apply(AnnotatingInParallel.scala:95)
    at jigg.pipeline.AnnotatingSentencesInParallel$$anonfun$annotate$1.apply(AnnotatingInParallel.scala:93)
    at jigg.util.XMLUtil$RichNode.jigg$util$XMLUtil$RichNode$$recurse$2(XMLUtil.scala:93)
    at jigg.util.XMLUtil$RichNode$$anonfun$3.apply(XMLUtil.scala:94)
    at jigg.util.XMLUtil$RichNode$$anonfun$3.apply(XMLUtil.scala:94)
    at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
    at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
    at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
    at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
    at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
    at scala.collection.AbstractTraversable.map(Traversable.scala:104)
    at jigg.util.XMLUtil$RichNode.jigg$util$XMLUtil$RichNode$$recurse$2(XMLUtil.scala:94)
    at jigg.util.XMLUtil$RichNode$$anonfun$3.apply(XMLUtil.scala:94)
    at jigg.util.XMLUtil$RichNode$$anonfun$3.apply(XMLUtil.scala:94)
    at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
    at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
    at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
    at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
    at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
    at scala.collection.AbstractTraversable.map(Traversable.scala:104)
    at jigg.util.XMLUtil$RichNode.jigg$util$XMLUtil$RichNode$$recurse$2(XMLUtil.scala:94)
    at jigg.util.XMLUtil$RichNode.replaceAll(XMLUtil.scala:97)
    at jigg.pipeline.AnnotatingSentencesInParallel$class.annotate(AnnotatingInParallel.scala:93)
    at jigg.pipeline.MecabAnnotator.annotate(MecabAnnotator.scala:27)
    at jigg.pipeline.Pipeline.annotateRecur$1(Pipeline.scala:351)
    at jigg.pipeline.Pipeline.annotate(Pipeline.scala:360)
    at jigg.pipeline.Pipeline$$anonfun$annotateText$1.apply(Pipeline.scala:341)
    at jigg.pipeline.Pipeline$$anonfun$annotateText$1.apply(Pipeline.scala:339)
    at jigg.pipeline.Pipeline.process(Pipeline.scala:327)
    at jigg.pipeline.Pipeline.annotateText(Pipeline.scala:339)
    at jigg.pipeline.Pipeline.annotate(Pipeline.scala:344)
    at jigg.pipeline.PipelineServer$$anonfun$1$$anonfun$apply$1$$anonfun$apply$2$$anonfun$apply$3$$anonfun$4.apply(PipelineServer.scala:159)
    at jigg.pipeline.PipelineServer$$anonfun$1$$anonfun$apply$1$$anonfun$apply$2$$anonfun$apply$3$$anonfun$4.apply(PipelineServer.scala:157)
    at scala.util.Success$$anonfun$map$1.apply(Try.scala:237)
    at scala.util.Try$.apply(Try.scala:192)
    at scala.util.Success.map(Try.scala:237)
    at scala.concurrent.Future$$anonfun$map$1.apply(Future.scala:237)
    at scala.concurrent.Future$$anonfun$map$1.apply(Future.scala:237)
    at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:32)
    at akka.dispatch.BatchingExecutor$AbstractBatch.processBatch(BatchingExecutor.scala:55)
    at akka.dispatch.BatchingExecutor$BlockableBatch$$anonfun$run$1.apply$mcV$sp(BatchingExecutor.scala:91)
    at akka.dispatch.BatchingExecutor$BlockableBatch$$anonfun$run$1.apply(BatchingExecutor.scala:91)
    at akka.dispatch.BatchingExecutor$BlockableBatch$$anonfun$run$1.apply(BatchingExecutor.scala:91)
    at scala.concurrent.BlockContext$.withBlockContext(BlockContext.scala:72)
    at akka.dispatch.BatchingExecutor$BlockableBatch.run(BatchingExecutor.scala:90)
    at akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:39)
    at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:415)
    at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
    at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
    at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
    at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
'. Completing with 500 Internal Server Error response.

mecab buffer size

The following returns a server error; pipeline Server - Internal Error [500]

curl -s "http://localhost:8080/annotate?outputFormat=json&annotators=ssplit%2Cmecab%2Ccabocha" --data "　米議会上院は１７日、オクラホマ州司法長官のスコット・プルイット氏（４８）を環境保護局（ＥＰＡ）長官とする人事案を賛成５２票、反対４６票で承認した。プルイット氏は、オバマ政権の地球温暖化対策などに批判的で、ＥＰＡを相手取り訴訟を繰り返すなど規制反対の急先鋒（きゅうせんぽう）として知られる。米国の環境規制が抜本的に変わる可能性がある。 　プルイット氏は同日、宣誓し就任した。プルイット氏は２０１５年、オバマ政権が温暖化対策の一環として火力発電所から出る二酸化炭素（ＣＯ２）の排出基準を設けたことに反発し、反対する２７州が参加する集団訴訟を起こした中心人物の一人。オバマ氏在任中の規制導入を阻む形となり、反規制派から「功労者」とみられてきた。ＣＯ２や大気汚染物質を出す石炭に厳しかった環境規制について「エネルギーに勝ち組も負け組もない」と見直す方針を示していた。 　米メディアが、プルイット氏を支援する政治団体が化石燃料企業から政治献金を受けていると報じるなど、たびたびエネルギー産業とのつながりを問題視されてきた。**党議員らは、採決前にプルイット氏が化石燃料企業と交わした電子メールの公表を求めたが実現しなかった。市民団体が地元オクラホマ州の地裁に起こした訴訟に絡み、プルイット氏はメールの公表を命じられているという。環境保護団体がプルイット氏に反対するＥＰＡ元職員らの署名を募ったところ、７７０人以上（１６日現在）が参加するなど「身内」からも反発が強まっている。 　プルイット氏の下で環境規制の見直しが進めば、米国が温暖化対策の新たな国際ルール「パリ協定」で温室効果ガスの削減目標に掲げた「２０２５年に０５年比で２６～２８％減」は達成できなくなる可能性が高い。**に続く排出国である米国が対策に消極的な姿勢に転じれば、各国による取り組みにも影響がでそうだ。（ボストン＝ 小林哲 ）"

If I remove the cabocha annotators, I get the desired output, suggesting this is an issue this annotator.

Reducing the original document size (remove half of the characters), and then including cabocha annotator, desired output is obtained.

I have seen this problem before with mecab/cabocha - could this be related to the problem with the buffer size perhaps? How can i alter this value in the Jigg framework?

http://localhost:8080/help/mecab
 -b, --input-buffer-size=INT    set input buffer size (default 8192)

The response from the server is as below;

[ERROR] [02/21/2017 15:16:20.754] [jigg-server-akka.actor.default-dispatcher-245] [akka.actor.ActorSystemImpl(jigg-server)] Error during processing of request: 'null'. Completing with 500 Internal Server Error response.

Cabocha NE mode

Cabocha can perform named entity recognition if "-n 1" option is given. Maybe an extended annotator of cabocha (cabochaNE?) should be provided, which satisfies NamedEntity requirement.

Missing englishPCFG.ser.gz

I am running the server from the target directory; seems as though edu/stanford/nlp/models/lexparser/englishPCFG.ser.gz cannot be found - should i be running the server from elsewhere, or re-installing corenlp?

Client

$ curl --data-urlencode "annotators=corenlp[tokenize,ssplit,parse]" --data-urlencode 'my name is slim shady' "http://localhost:8080/annotate?outputFormat=json"
There was an internal server error.

Server

[jigg-server-akka.actor.default-dispatcher-7] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator tokenize
[jigg-server-akka.actor.default-dispatcher-7] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator ssplit
[jigg-server-akka.actor.default-dispatcher-7] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator parse
[jigg-server-akka.actor.default-dispatcher-7] INFO edu.stanford.nlp.parser.common.ParserGrammar - Loading parser from serialized file edu/stanford/nlp/models/lexparser/englishPCFG.ser.gz ... 
[ERROR] [02/28/2017 13:24:14.537] [jigg-server-akka.actor.default-dispatcher-7] [akka.actor.ActorSystemImpl(jigg-server)] Error during processing of request: 'java.io.IOException: Unable to open "edu/stanford/nlp/models/lexparser/englishPCFG.ser.gz" as class path, filename or URL'. Completing with 500 Internal Server Error response.

Improve error handling

To increase usability, errors in invoking/running annotators should be handled properly, and helpful messages should be shown. For example,

Show an instruction to download the model file when a CCG model file is not found.
Instruct the installation of mecab etc. when the necessary command is not found.
Improve the message when requirements are not satisfied.
Show a message when a path specification is wrong (necessary libraries are not found).
Check the version of java and show a message if a wrong version is used.

How to get "the Japanese CCGBank (Uematsu et al. 2013)"

Hi, I will try to use your software for japanese ccg parsing.
I want to modify the ccg parser for an experiment.
How can I get "the Japanese CCGBank (Uematsu et al. 2013)"?

The background of this question:
I will experiment with the ccg parser to translate natural sentences to DSL scripts.

Output predicate argument dependencies in CCG

Implement new annotator, which can extract dependencies from a CCG derivation.

Sentence and document level parallelization

SentenceAnnotator trait is designed to abstract sentence level parallelization of all subclass annotators, but currently no parallelization is implemented. Also it is unclear whether currently implemented components such as CCG parser can actually be parallelizable (by calling newSentenceAnnotation method concurrently). Maybe we have to define some rules for a component, which should be followed to enable sentence level parallelization. CoreNLP's SentenceAnnotator class also handles sentence level parallelization in its annotate method, which should be consulted to decide rules and implementation details.

Also, some annotators, such as KNPAnnotator with document-level anaphora requires document-level parallelization instead of sentence-level parallelization. Probably such classes inherit another trait of DocumentAnnotator, which abstracts document-level processing.

Interface to other languages

Some requests to support other languages, such as python, ruby, and c++, raised.

Special preprocessing for some annotators

It is known that in JUMAN (as well as Mecab?) hankaku numbers are not recommended. I'm not sure, but one option may be to preprocess the text appropriately in those annotators.

Another related issue I found is that Mecab ignores half spaces in the sentence, while JUMAN does not. Below is the examples:

$ echo あ い | juman
あ あ あ 感動詞 12 * 0 * 0 * 0 "代表表記:あ/あ"
  \  \  特殊 1 空白 6 * 0 * 0 NIL
い い いい 形容詞 3 * 0 イ形容詞イ段 19 文語基本形 18 "代表表記:良い/よい 反義:形容詞:悪い/わるい"
EOS

$ echo あ い | mecab
あ フィラー,*,*,*,*,*,あ,ア,ア
い 名詞,一般,*,*,*,*,い,イ,イ
EOS

In the current implementation, JumanAnnotator cannot handle an input with half space correctly, since the analyzed token line for a half space (' ') is different from other tokens. One way to solve this is to add preprocessing to remove half spaces from the input in JumanAnnotator. Another way is to preserve the original JUMAN output, and modify the implementation to correctly parse the line for a half space token.

CI for testing behaviors of Annotators

There are test classes for some annotators (https://github.com/mynlp/jigg/tree/master/src/test/scala/jigg/pipeline), but they are only limited, or very superficial.

One particular problem is that we cannot test the behavior of annotators that rely on external model files, including CoreNLP annotator, since Jigg does not have the models of them internally. This is also the case for annotators for non-maven softwares, including mecab, KNP, etc.

Currently, for example, when I update the version of CoreNLP in build.sbt, I don't carefully check how the behaviors of annotators change; I just see that Jigg's wrapper does not output errors when executing. This is bad.

We need more systematic test mechanism for these external softwares, perhaps with some CI tool?

Handling Special Characters

The majority of documents include special characters such as quotes ("), or carriage returns.
My documents employ "\n" to handle carriage returns, and I escape the leading \ with another , as below.

"1. 対処すべき課題\\nこれは桃だ。"

Currently JIGG would process all of this as one sentence, and includes the character sequence "\n", even though the document can be split into two, and that これは桃だ。 is a sentence.

My work around to this is to replace the above with the below; that is, to introduce a sentence termination character. With Japanese, the "。" is unambiguous, so this solution would probably be reasonable, but I cannot be sure for the "." character in the case of English.

"1. 対処すべき課題。これは桃だ。"

Are there any 'better' methods for handling such special characters?

Replace outputting to stderr with a logger

how to run PipelineServer in background?

I can't run PipelineServer in background, since it requires standard input at StdIn.readLine().

➜  jigg % java -Xmx1G -cp "target/*:models/jigg-models.jar" jigg.pipeline.PipelineServer > server.log  2>&1 &
[1] 38375
➜  jigg %
[1]  + 38375 suspended (tty input)  java -Xmx1G -cp "target/*:models/jigg-models.jar" jigg.pipeline.PipelineServe

➜  jigg % curl --data-urlencode 'annotators=corenlp[tokenize,ssplit]' \
         --data-urlencode 'q=Please annotate me!' \
         'http://localhost:8080/annotate?outputFormat=json'
curl: (7) Failed to connect to localhost port 8080: Connection refused

Do we need to modify PipelineServer#run() to run it in background?

Add annotator for SyntaxNet

https://github.com/tensorflow/models/tree/master/syntaxnet

Sentence split exceptions

I want to run a dependency parse over the below;

北越銀行の久須美隆頭取は「日本経済は先行きに不透明感が増しており、景気回復の持続へ試練の時を迎えている。地域活性化に向けての努力が大事」と述べた。

Prior to running the dependency parser, Jigg splits sentences based on the position of "。"

北越銀行の久須美隆頭取は「日本経済は先行きに不透明感が増しており、景気回復の持続へ試練の時を迎えている。
地域活性化に向けての努力が大事」と述べた。

which results in two sentences being parsed separately.

Is it possible to avoid this?