Coder Social home page Coder Social logo

snowball's Issues

Errors when indexing wikipedia

Hi,
I've been trying to set up snowball with wikipedia.
I'm at the step of indexing wikipedia using elasticsearch, I do this by running:
python index/index.py
(by the way, I believe this step is missing from the readme)

The code successfully sets up the corenlp server but then runs into the following error:

...
...
INFO:CoreNLP_PyWrapper:Successful ping. The server has started.
INFO:CoreNLP_PyWrapper:Subprocess is ready.
Python(1828,0x7fffefa9d3c0) malloc: *** mach_vm_map(size=8872781095355314176) failed (error code=3)
*** error: can't allocate region
*** set a breakpoint in malloc_error_break to debug
Process Process-2:
Traceback (most recent call last):
File "/usr/local/Cellar/python/2.7.11/Frameworks/Python.framework/Versions/2.7/lib/python2.7/multiprocessing/process.py", line 258, in _bootstrap
self.run()
File "/usr/local/Cellar/python/2.7.11/Frameworks/Python.framework/Versions/2.7/lib/python2.7/multiprocessing/process.py", line 114, in run
self._target(*self._args, **self._kwargs)
File "index/index.py", line 35, in work
parse = parser.combined_parse(xml, text, self.corenlp)
File "build/bdist.macosx-10.12-x86_64/egg/corenlp/parser.py", line 221, in combined_parse
sentences = get_sentences(text, corenlp)
File "build/bdist.macosx-10.12-x86_64/egg/corenlp/parser.py", line 203, in get_sentences
sents = parse(text, corenlp)['sentences']
File "build/bdist.macosx-10.12-x86_64/egg/corenlp/parser.py", line 88, in parse
return corenlp.parse_doc(text)
File "/Library/Python/2.7/site-packages/stanford_corenlp_pywrapper/sockwrap.py", line 226, in parse_doc
return self.send_command_and_parse_result(cmd, timeout, raw=raw)
File "/Library/Python/2.7/site-packages/stanford_corenlp_pywrapper/sockwrap.py", line 246, in send_command_and_parse_result
data = self.send_command_and_get_string_result(cmd, timeout)
File "/Library/Python/2.7/site-packages/stanford_corenlp_pywrapper/sockwrap.py", line 289, in send_command_and_get_string_result
data = self.outpipe_fp.read(remaining_size)
INFO:CoreNLP_JavaServer: INPUT: 1 documents, 50939 characters, 8747 tokens, 50939.0 char/doc, 8747.0 tok/doc RATES: 0.125 doc/sec, 1092.4 tok/sec
MemoryError

WARNING:CoreNLP_PyWrapper:Bad JSON returned from subprocess; returning null.
WARNING:CoreNLP_PyWrapper:Bad JSON length 392719, starts with: 'J","NN","."],"lemmas":["Industrial","agriculture","base","on","large-scale","monoculture","farming","have","become","the","dominant","agricultural","methodology","."],"tokens":["Industrial","agriculture","based","on","large-scale","monoculture","farming","has","become","the","dominant","agricultural","methodology","."],"char_offsets":[[590,600],[601,612],[613,618],[619,621],[622,633],[634,645],[646,653],[654,657],[658,664],[665,668],[669,677],[678,690],[691,702],[702,703]],"ner":["O","O","O","O","O","O","O","O","O","O","O","O","O","O"],"normner":["","","","","","","","","","","","","",""]},{"pos":["NNP","NNP",",","NN","NN",",","NNS","JJ","IN","NNS","CC","NNS",",","CC","JJ","NNS","VBP","IN","JJ","NNS","RB","VBD","NNS","IN","NN",",","CC","IN","DT","JJ","NN","VBP","VBN","JJ","JJ","NN","CC","JJ","JJ","NN","NNS","."],"lemmas":["Modern","agronomy",",","plant","breeding",",","agrochemical","such","as","pesticide","and","fertilizer",",","and","technological","development","have","in","many","c'
Process Process-1:
Traceback (most recent call last):
File "/usr/local/Cellar/python/2.7.11/Frameworks/Python.framework/Versions/2.7/lib/python2.7/multiprocessing/process.py", line 258, in _bootstrap
self.run()
File "/usr/local/Cellar/python/2.7.11/Frameworks/Python.framework/Versions/2.7/lib/python2.7/multiprocessing/process.py", line 114, in run
self._target(*self._args, **self._kwargs)
File "index/index.py", line 35, in work
parse = parser.combined_parse(xml, text, self.corenlp)
File "build/bdist.macosx-10.12-x86_64/egg/corenlp/parser.py", line 221, in combined_parse
sentences = get_sentences(text, corenlp)
File "build/bdist.macosx-10.12-x86_64/egg/corenlp/parser.py", line 203, in get_sentences
sents = parse(text, corenlp)['sentences']
TypeError: 'NoneType' object has no attribute 'getitem'

I see 3 problems here - the memory allocation error, the "bad json" warning, and the TypeError at the end. Not sure which of these is causing the problem or if they are connected.

  • I'm using the same corenlp and parser versions as in the example config file - "stanford-corenlp-full-2015-04-20".

  • Corenlp python wrapper installed from:
    https://github.com/brendano/stanford_corenlp_pywrapper

  • log files:
    log_0.log

    INFO (LOGGER 0): Now processing file: /Users/mac/git/wikipedia/extracted2/AA/wiki_00

    log_1.log

    INFO (LOGGER 1): Now processing file: /Users/mac/git/wikipedia/extracted2/AA/wiki_01

  • Wikipedia extracted using the latest version of wikipedia-extractor, the version in this repository wasn't working for me.

Would love to hear your thoughts.

Thanks,
Simon.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.