The snowball's discuss from aadah

Errors when indexing wikipedia

Hi,
I've been trying to set up snowball with wikipedia.
I'm at the step of indexing wikipedia using elasticsearch, I do this by running:
python index/index.py
(by the way, I believe this step is missing from the readme)

The code successfully sets up the corenlp server but then runs into the following error:

...
...
INFO:CoreNLP_PyWrapper:Successful ping. The server has started.
INFO:CoreNLP_PyWrapper:Subprocess is ready.
Python(1828,0x7fffefa9d3c0) malloc: *** mach_vm_map(size=8872781095355314176) failed (error code=3)
*** error: can't allocate region
*** set a breakpoint in malloc_error_break to debug
Process Process-2:
Traceback (most recent call last):
File "/usr/local/Cellar/python/2.7.11/Frameworks/Python.framework/Versions/2.7/lib/python2.7/multiprocessing/process.py", line 258, in _bootstrap
self.run()
File "/usr/local/Cellar/python/2.7.11/Frameworks/Python.framework/Versions/2.7/lib/python2.7/multiprocessing/process.py", line 114, in run
self._target(*self._args, **self._kwargs)
File "index/index.py", line 35, in work
parse = parser.combined_parse(xml, text, self.corenlp)
File "build/bdist.macosx-10.12-x86_64/egg/corenlp/parser.py", line 221, in combined_parse
sentences = get_sentences(text, corenlp)
File "build/bdist.macosx-10.12-x86_64/egg/corenlp/parser.py", line 203, in get_sentences
sents = parse(text, corenlp)['sentences']
File "build/bdist.macosx-10.12-x86_64/egg/corenlp/parser.py", line 88, in parse
return corenlp.parse_doc(text)
File "/Library/Python/2.7/site-packages/stanford_corenlp_pywrapper/sockwrap.py", line 226, in parse_doc
return self.send_command_and_parse_result(cmd, timeout, raw=raw)
File "/Library/Python/2.7/site-packages/stanford_corenlp_pywrapper/sockwrap.py", line 246, in send_command_and_parse_result
data = self.send_command_and_get_string_result(cmd, timeout)
File "/Library/Python/2.7/site-packages/stanford_corenlp_pywrapper/sockwrap.py", line 289, in send_command_and_get_string_result
data = self.outpipe_fp.read(remaining_size)
INFO:CoreNLP_JavaServer: INPUT: 1 documents, 50939 characters, 8747 tokens, 50939.0 char/doc, 8747.0 tok/doc RATES: 0.125 doc/sec, 1092.4 tok/sec
MemoryError

WARNING:CoreNLP_PyWrapper:Bad JSON returned from subprocess; returning null.
WARNING:CoreNLP_PyWrapper:Bad JSON length 392719, starts with: 'J","NN","."],"lemmas":["Industrial","agriculture","base","on","large-scale","monoculture","farming","have","become","the","dominant","agricultural","methodology","."],"tokens":["Industrial","agriculture","based","on","large-scale","monoculture","farming","has","become","the","dominant","agricultural","methodology","."],"char_offsets":[[590,600],[601,612],[613,618],[619,621],[622,633],[634,645],[646,653],[654,657],[658,664],[665,668],[669,677],[678,690],[691,702],[702,703]],"ner":["O","O","O","O","O","O","O","O","O","O","O","O","O","O"],"normner":["","","","","","","","","","","","","",""]},{"pos":["NNP","NNP",",","NN","NN",",","NNS","JJ","IN","NNS","CC","NNS",",","CC","JJ","NNS","VBP","IN","JJ","NNS","RB","VBD","NNS","IN","NN",",","CC","IN","DT","JJ","NN","VBP","VBN","JJ","JJ","NN","CC","JJ","JJ","NN","NNS","."],"lemmas":["Modern","agronomy",",","plant","breeding",",","agrochemical","such","as","pesticide","and","fertilizer",",","and","technological","development","have","in","many","c'
Process Process-1:
Traceback (most recent call last):
File "/usr/local/Cellar/python/2.7.11/Frameworks/Python.framework/Versions/2.7/lib/python2.7/multiprocessing/process.py", line 258, in _bootstrap
self.run()
File "/usr/local/Cellar/python/2.7.11/Frameworks/Python.framework/Versions/2.7/lib/python2.7/multiprocessing/process.py", line 114, in run
self._target(*self._args, **self._kwargs)
File "index/index.py", line 35, in work
parse = parser.combined_parse(xml, text, self.corenlp)
File "build/bdist.macosx-10.12-x86_64/egg/corenlp/parser.py", line 221, in combined_parse
sentences = get_sentences(text, corenlp)
File "build/bdist.macosx-10.12-x86_64/egg/corenlp/parser.py", line 203, in get_sentences
sents = parse(text, corenlp)['sentences']
TypeError: 'NoneType' object has no attribute 'getitem'

I see 3 problems here - the memory allocation error, the "bad json" warning, and the TypeError at the end. Not sure which of these is causing the problem or if they are connected.

I'm using the same corenlp and parser versions as in the example config file - "stanford-corenlp-full-2015-04-20".
Corenlp python wrapper installed from:
https://github.com/brendano/stanford_corenlp_pywrapper
log files:
log_0.log

INFO (LOGGER 0): Now processing file: /Users/mac/git/wikipedia/extracted2/AA/wiki_00

log_1.log

INFO (LOGGER 1): Now processing file: /Users/mac/git/wikipedia/extracted2/AA/wiki_01
Wikipedia extracted using the latest version of wikipedia-extractor, the version in this repository wasn't working for me.

Would love to hear your thoughts.

Thanks,
Simon.

aadah / snowball Goto Github PK

snowball's Issues

Errors when indexing wikipedia

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent