quanteda / spacyr Goto Github PK

R wrapper to spaCy NLP

R 82.81% Python 15.54% Shell 0.46% CSS 1.18%

extract-entities nlp r spacy speech-tagging

spacyr's Introduction

About

quanteda is an R package for managing and analyzing text, created and maintained by Kenneth Benoit and Kohei Watanabe. Its creation was funded by the European Research Council grant ERC-2011-StG 283794-QUANTESS and its continued development is supported by the Quanteda Initiative CIC.

For more details, see https://quanteda.io.

quanteda version 4

The quanteda 4.0 is a major release that improves functionality and performance and further improves function consistency by removing previously deprecated functions. It also includes significant new tokeniser rules that make the default tokeniser smarter than ever, with new Unicode and ICU-compliant rules enabling it to work more consistently with even more languages.

We describe more fully these significant changes in:

an article about the new external pointer tokens objects;
an article showing performance benchmarks for the new external pointer tokens objects, as well as some of the tokeniser improvements in v4; and
the changelog for v4 a full listing of the changes, improvements, and deprecations in v4.

The quanteda family of packages

We completed the trend of splitting quanteda into modular packages with the release of v3. The quanteda family of packages includes the following:

quanteda: contains all of the core natural language processing and textual data management functions
quanteda.textmodels: contains all of the text models and supporting functions, namely the textmodel_*() functions. This was split from the main package with the v2 release
quanteda.textstats: statistics for textual data, namely the textstat_*() functions, split with the v3 release
quanteda.textplots: plots for textual data, namely the textplot_*() functions, split with the v3 release

We are working on additional package releases, available in the meantime from our GitHub pages:

quanteda.sentiment: Functions and lexicons for sentiment analysis using dictionaries
quanteda.tidy: Extensions for manipulating document variables in core quanteda objects using your favourite tidyverse functions

and more to come.

How To…

Install (binaries) from CRAN

The normal way from CRAN, using your R GUI or

install.packages("quanteda")

(New for quanteda v4.0) For Linux users: Because all installations on Linux are compiled, Linux users will first need to install the Intel oneAPI Threading Building Blocks for parallel computing for installation to work.

To install TBB on Linux:

# Fedora, CentOS, RHEL
sudo yum install tbb-devel

# Debian and Ubuntu
sudo apt install libtbb-dev

Windows or macOS users do not have to install TBB or any other packages to enable parallel computing when installing quanteda from CRAN.

Compile from source (macOS and Windows)

Because this compiles some C++ and Fortran source code, you will need to have installed the appropriate compilers to build the development version.

You will also need to install TBB:

macOS:

First, you will need to install XCode command line tools.

xcode-select --install

Then install the TBB libraries and the pkg-config utility: (after installing Homebrew):

brew install tbb pkg-config

Finally, you will need to install gfortran.

Windows:

Install RTools, which includes the TBB libraries.

Use quanteda

See the quick start guide to learn how to use quanteda.

Get Help

Read out documentation at https://quanteda.io.
Check out the quanteda cheatsheet.
Submit a question on the quanteda channel on StackOverflow.
See our tutorial site.

Cite the package

Benoit, Kenneth, Kohei Watanabe, Haiyan Wang, Paul Nulty, Adam Obeng, Stefan Müller, and Akitaka Matsuo. (2018) “quanteda: An R package for the quantitative analysis of textual data”. Journal of Open Source Software 3(30), 774. https://doi.org/10.21105/joss.00774.

For a BibTeX entry, use the output from citation(package = "quanteda").

Leave Feedback

If you like quanteda, please consider leaving feedback or a testimonial here.

Contribute

Contributions in the form of feedback, comments, code, and bug reports are most welcome. How to contribute:

Fork the source code, modify, and issue a pull request through the project GitHub page. See our Contributor Code of Conduct and the all-important quanteda Style Guide.
Issues, bug reports, and wish lists: File a GitHub issue.
Contact the maintainer by email.

spacyr's People

Contributors

Stargazers

Watchers

spacyr's Issues

parser removes non-ASCII characters

Did we actually fail to check this??

require(spacyr)
## Loading required package: spacyr
spacy_initialize(model = "de")
## Finding a python executable with spacy installed...
## spaCy (language model: de) is installed in /usr/local/bin/python
## successfully initialized (spaCy Version: 1.8.2, language model: de)

spacy_parse("Müller will einen gut lesbaren, knappen Programmierstil fördern.")
##    doc_id sentence_id token_id           token           lemma   pos entity
## 1   text1           1        1           Mller           mller  NOUN       
## 2   text1           1        2            will            will  VERB       
## 3   text1           1        3           einen           einen   DET       
## 4   text1           1        4             gut             gut   ADJ       
## 5   text1           1        5        lesbaren        lesbaren   ADJ       
## 6   text1           1        6               ,               , PUNCT       
## 7   text1           1        7         knappen         knappen   ADJ       
## 8   text1           1        8 Programmierstil programmierstil  NOUN       
## 9   text1           1        9          frdern          frdern  VERB       
## 10  text1           1       10               .               . PUNCT

Quotes in text are problematic, especially single quotes

In short:

double quotes " are ok if double-escaped
single quotes simply do not work, as far as I can tell

spacy_parse("Failure for \'single\' quotes.")
## File "<string>", line 2
##   texts =' [ "Failure for 'single' quotes." ] '
##                                  ^
## SyntaxError: invalid syntax
##    docname id   tokens   lemma google penn
## 1:   text1  0     This    this    DET   DT
## 2:   text1  1      \t      \t   SPACE   SP
## 3:   text1  2      tab     tab   NOUN   NN
## 4:   text1  3 succeeds succeed   VERB  VBZ
## 5:   text1  4        .       .  PUNCT    .

spacy_parse("Failure for \\\'single\\\' quotes.")
## File "<string>", line 2
##   texts =' [ "Failure for 'single' quotes." ] '
##                                  ^
## SyntaxError: invalid syntax
##    docname id   tokens   lemma google penn
## 1:   text1  0     This    this    DET   DT
## 2:   text1  1      \t      \t   SPACE   SP
## 3:   text1  2      tab     tab   NOUN   NN
## 4:   text1  3 succeeds succeed   VERB  VBZ
## 5:   text1  4        .       .  PUNCT    .

spacy_parse("Failure for \"double\" quotes.")
## Error in python.exec(python.command) : 
##  Expecting , delimiter: line 1 column 18 (char 17) 

spacy_parse("Success for \\\"double\\\" quotes.")
##    docname id  tokens   lemma google penn
## 1:   text1  0 Success success   NOUN   NN
## 2:   text1  1     for     for    ADP   IN
## 3:   text1  2       "       "  PUNCT   ``
## 4:   text1  3  double  double    ADJ   JJ
## 5:   text1  4       "       "  PUNCT   ''
## 6:   text1  5  quotes   quote   NOUN  NNS
## 7:   text1  6       .       .  PUNCT    .

spaCy lemmatization only for English

In spaCy, lemmatization is only possible for English texts: https://spacy.io/docs/api/language-models. However, the output of spaCy in Python and, consequently, the output of spacy_parse() does contain something called "lemma" - mostly lower-case versions of the tokens.
Maybe it would make sense to at least issue a warning if language model != english and lemma = TRUE.

spacy_parse() does not accept quanteda corpus object

The spacy_parse() function does not accept a quanteda corpus object as input despite saying so in the documentation:

x a character object, a quanteda corpus, or a TIF-compliant corpus data.frame (see https://github.com/ropensci/tif)

Example

> require(quanteda); packageVersion("quanteda")
[1] ‘0.9.9.65’
> require(spacyr); packageVersion("spacyr")
Loading required package: spacyr
[1] ‘0.9.0’
> spacy_initialize(model = "en", python_executable = "/usr/local/bin/python3")
successfully initialized (spaCy , language model: en)
> spacy_parse(data_corpus_inaugural)
Error in UseMethod("spacy_parse") : 
  no applicable method for 'spacy_parse' applied to an object of class "c('corpus', 'list')"

error at initialize

I still couldnt get spacyr to run.

I load librat spacyr successfully by entering library(spacyr)
Run spacy_finalize() I got the following error.

Finding a python executable with spacy installed...
spaCy (language model: en) is installed in C:\Python\python.exe
Error in py_run_file_impl(file, local, convert) :
Unable to open file '' (does it exist?)

Please could you help me?
Thank you

Consider a more efficient data structure

The current data.table is quite inefficient from a space standpoint.

Current format (data.table):

   docname id     tokens      lemma google penn head_id dep_rel
1:   text1  0        And        and   CONJ   CC       2      cc
2:   text1  1        now        now    ADV   RB       2  advmod
3:   text1  2        for        for    ADP   IN       2    ROOT
4:   text1  3  something  something   NOUN   NN       2    pobj
5:   text1  4 completely completely    ADV   RB       5  advmod
6:   text1  5  different  different    ADJ   JJ       3    amod
7:   text1  6          .          .  PUNCT    .       2   punct

Options:

hash docname, tokens, lemma, pos (google/penn), dep_rel
only use lemma if it differs from token
use common types for tokens and lemmas
consider an index of document beginnings to replace docname, would be ndoc in length instead of sum(ntoken) in length

We should compare speeds and object sizes of these, since data.table does some sort of internal hashing/factorization by default.

Drop quanteda section from README?

At the moment, we cannot knit README.rmd because quanteda method does not exist. I can just drop the section but I think @kbenoit can decide how to address.

Consider options for named entities

I wonder if it makes sense to tag the tokens as named entities as we currently do, or whether this "breaks" named entities that are most often multi-word NEs.

Maybe a function to simply return a list of NEs per text? Or to process the elements of a multi-word NE such that they will form a multi-word "token" in the output data.table?

What is the structure returned by the Python calls?

can spacy_initialize() only execute fully if needed?

In other words, is there a way to see if a second call to this function needs to be executed, if the connection is already open and working? Maybe a force = TRUE option to override the check, if the user wants it? (It would be FALSE by default.)

Goal: To save time.

Strange Results on Windows

Running this job with identical data on my Macbook and on my Windows machine I get completley dfifferent results. I perform the same analysis 100 times on each machine. On OSX I get (as I expected) 100 identical results, in windows I get 100 different results. If I run the same problem on windows in python (same spacy installation) I get the same results as on OSX/R/spacyr.

On both Systems I use R 3.3.3 (Microsoft MRAN)

library(dplyr)
require(spacyr)
require(foreach)

load("MyData.RDATA")
spacy_initialize(condaenv="py36", model="de")


runid <- 0
res <- foreach(i = seq(1,100,1), .combine = "rbind", .errorhandling = "remove") %do% {
  print(i)
  spacy_parse(my_data$text) %>% 
    mutate(runid=i) %>% 
    group_by(runid,pos) %>% 
    summarise(N=n())
}

save(res, file="testruns1.rdata")
tab<-xtabs(N~runid+pos, data=res)
df <- as.data.frame.matrix(mat<-as.matrix(tab))
summary(df)

This is sSummary of the OSX and Python/Windows results:

ADJ    4937
ADP    6558
ADV    6140
AUX    4091
CONJ   1849
DET    8892
INTJ      2
NOUN  13858
NUM     913
PART   1530
PRON   6088
PROPN  3353
PUNCT 11942
SCONJ   982
SPACE     1
VERB   8134
X       146

This is a Summary of R/spacyr/Windows results:

     pos   Min   Max  Median
1    ADJ  5286  8045  5477.0
2    ADP  7053 10655  7276.0
3    ADV  6549  9501  6751.0
4    AUX  4351  5905  4450.0
5   CONJ  1978  2744  2032.0
6    DET  9517 13533  9783.0
7   INTJ     1     9     2.0
8   NOUN 14885 21740 15289.0
9    NUM   957  1620   988.0
10  PART  1610  2248  1653.0
11  PRON  6461  8677  6593.0
12 PROPN  3546  5472  3663.5
13 PUNCT 12738 18225 13049.5
14 SCONJ  1051  1424  1084.0
15 SPACE     1     5     1.0
16  VERB  8669 11645  8884.5
17     X   147   273   159.0

change options to spacy_parse() to set pos tag type

add to the signature for spacy_parse():

tag_scheme = c("google", "penn", "all")
- "google": Google simplified scheme
- "penn": Penn scheme
- "all": add both, which is the current default

Python virtualenv

With reticulate it should be possible to use python virtual env. Will implement at some point.

Japanese tokeniser and tagger?

From @koheiw:

By the way, in order to quickly support multiple languages, we can use online APIs. They are probably not for large scale analysis, but cater to the need of majority of the people.
I am interested in Rosette. Its POS tagger takes around 0.3 sec for one document in inaugCorpus. Japanese works fairly well too.

Kohei

devtools::install_github("hrbrmstr/rosette")
library(rosette)
library(quanteda)

res1 <- ros_morph(texts(inaugCorpus)[[1]])
res2 <- ros_morph("Original Ghostbuster Dan Aykroyd, who also co-wrote the 1984 Ghostbusters film, couldn’t be more pleased with the new all-female Ghostbusters cast, telling The Hollywood Reporter, 'The Aykroyd family is delighted by this inheritance of the Ghostbusters torch by these most magnificent women in comedy.'")

res3 <- ros_morph("政治とは社会に対して全体的な影響を及ぼし、社会で生きるひとりひとりの人の人生にも様々な影響を及ぼす複雑な領域である。広辞苑では「人間集団における秩序の形成と解体をめぐって、人が他者に対して、また他者と共に行う営み。」としているわけであるが、政治は、社会や社会に生きるひとりひとりの人にとってそもそも何が重要なことなのか、社会がどのような状態であることが良い状態なのか、ということも扱い、様々ある人々の意志からどれを選び集団の意志とするか、どのような方法でそれを選ぶか、といったこととも深く関係している。")

spacy_parse() returns previous object value after Python failures

This is a special type of error, since it happens in Python and not in R. It causes the last successful object to be returned. A side effect of reference classes? (Whenever I see the <<- operator, I get nervous!)

spacy_parse("Parsing this sentence is easy.")
#    docname id   tokens    lemma google penn
# 1:   text1  0  Parsing    parse   VERB  VBG
# 2:   text1  1     this     this    DET   DT
# 3:   text1  2 sentence sentence   NOUN   NN
# 4:   text1  3       is       be   VERB  VBZ
# 5:   text1  4     easy     easy    ADJ   JJ
# 6:   text1  5        .        .  PUNCT    .

spacy_parse("Failure for \'single\' quotes.")
## File "<string>", line 2
##   texts =' [ "Failure for 'single' quotes." ] '
##                                  ^
## SyntaxError: invalid syntax
##    docname id   tokens    lemma google penn
## 1:   text1  0  Parsing    parse   VERB  VBG
## 2:   text1  1     this     this    DET   DT
## 3:   text1  2 sentence sentence   NOUN   NN
## 4:   text1  3       is       be   VERB  VBZ
## 5:   text1  4     easy     easy    ADJ   JJ
## 6:   text1  5        .        .  PUNCT    .

Using spacyr when knitting rmarkdown file

Hi,
Love the package! I am just having some issues when trying to knit (in RStudio) an r-markdown file that contains calls to spacyr. I get the error
Error in check_spacy_model(python_executable, model) : C:\Users\jason\AppData\Local\Programs\Python\Python36\python.exe is not a python executable

I am on Windows 10.
Is this a knitr issue, or is there something that can be done when calling spacr?

Thank you.

Add entity_consolidate()

The existing function name get_all_named_entities() is too long and offers no options.

Proposal:

entity_extract(x, types = c("named", "extended", "all") - would perform same functionality as get_all_named_entities() but offer just named entities (the first part of the table at https://spacy.io/docs/usage/entity-recognition#entity-types), extended entities (the second part of that table), or all of them.
entity_consolidate(spacy_parsed_object, concatenator = "_") would consolidate the named entities into a single token, replace the sequence with the single concatenated "token", and renumber the token_id within sentence.

Fix Travis build failure

It's failing because spaCy is not installed for Python. Given the size of this object and the required language files, this may not be feasible.

Add sentence id to spacy_parse output

How did we miss this before?? 😕

add lemma option to spacy_parse()

lemma = FALSE would be default; lemma column in the data.table will only be returned if lemma = TRUE

Implement find_spacy

Enhancement of spacy_initialize(): The function will look for the python executables with spaCy installed.

A zero-token length "sentence" crashes the dependency parser renumbering

This could be a simple fix but points to a deeper issue: Why are there zero-length sentences?

I discovered this when working with a large Hansard file here: https://www.dropbox.com/s/f0tkvz0s9durumz/data_corpus_speeches66.RData?dl=0

You can set break points and find the object with which(lengths(ntokens_by_sent) == 0) in https://github.com/kbenoit/spacyr/blob/master/R/spacy_parse.R#L69.

Note I already changed lines 72-73 since this error crashed 1:length(x) for zero-length objects. I had always read (and mostly followed) the advice on always using seq_along() for this, but here it proved totally necessary. (since 1:0 -> [1] 1 0).

Integration with reticulate

In https://github.com/statsmaths/coreNLP/issues/1 I understood spacyr to be running over reticulate, but that doesn't seem to be the case at the moment. Are there plans to move towards reticulate?

error when tag_pos = TRUE along in spacy_parse()

> txt <- "And now for something completely different."
> spacy_parse(txt)
   docname sentence_id token_id     tokens tag_detailed tag_google
1:   text1           1        1        And           CC      CCONJ
2:   text1           1        2        now           RB        ADV
3:   text1           1        3        for           IN        ADP
4:   text1           1        4  something           NN       NOUN
5:   text1           1        5 completely           RB        ADV
6:   text1           1        6  different           JJ        ADJ
7:   text1           1        7          .            .      PUNCT
> spacy_parse(txt, pos_tag = FALSE)
Error in py_run_string_impl(code, convert) : 
  ValueError: sentence boundary detection requires the dependency parse, which requires data to be installed. If you haven't done so, run: 
python -m spacy.en.download all
to install the data

Detailed traceback: 
  File "<string>", line 1, in <module>
  File "<string>", line 54, in ntokens_by_sent
  File "spacy/tokens/doc.pyx", line 427, in __get__ (spacy/tokens/doc.cpp:9669)
    raise ValueError(

Expand options available through posTag.py

Figure out what other options are available in spaCy and build in options to call them.

In the Python code
In the R wrapper function tag()

Improvements to spacy_parse()

Issue

some arguments override other arguments, namely full_parse = TRUE sets all others to TRUE
the tagset_* arguments only apply when pos_tag = TRUE
the argument list could be more "spaCy-like", meaning closer to the Python structure

Proposal

The new function signature would be:

spacy_parse(x, pos = TRUE, tag = FALSE, lemma = TRUE, entity = TRUE, parse = FALSE)

where:

pos means return the Universal dependencies tagset to which all spaCy language models must map their tags
tag means return the detailed tagset code, which is the OntoNotes 5 version of the Penn Treebank tag set for the en language model, but something different for other languages (e.g. German).
the others are obvious, although parse is possibly ambiguous, since it overlaps with the name of this function, while meaning specifically to return the dependency parsing information.

The defaults are set to reflect what most users would want, namely tokenisation and tagging. Tagging is by default the Universal scheme, but the additional details tags can be activated with one argument to change the default. We throw in lemmas for free. Dependency parsing requires one more argument to be specified to override the default.

Possibly

rename this function to something more general, such as spacy_process(), although I don't like that at all. But the idea would be to find a verb that encompasses all of the activities of tokenizing, tagging, parsing, and extracting NEs.
an alternative I like less is to separate these functions into spacy_tokenize(), spacy_tag(), spacy_parse, spacy_getentities(). These are not even separate steps in the back end.

Escaped space characters need special handling

\n and \t characters fail unless double escaped in the text. This appears to be something that happens in the translation to JSON, in the call rPython::python.assign("texts", x) of the function process_document().

spacy_parse("This \n newline fails.")
## Error in python.exec(python.command) : 
##  Invalid control character at: line 1 column 10 (char 9) 
spacy_parse("This \\n newline succeeds.")
##    docname id   tokens   lemma google penn
## 1:   text1  0     This    this    DET   DT
## 2:   text1  1      \n      \n   SPACE   SP
## 3:   text1  2  newline newline   NOUN   NN
## 4:   text1  3 succeeds succeed   VERB  VBZ
## 5:   text1  4        .       .  PUNCT    .
spacy_parse("This \t tab fails.")
## Error in python.exec(python.command) : 
##  Invalid control character at: line 1 column 10 (char 9) 
spacy_parse("This \\t tab succeeds.")
##    docname id   tokens   lemma google penn
## 1:   text1  0     This    this    DET   DT
## 2:   text1  1      \t      \t   SPACE   SP
## 3:   text1  2      tab     tab   NOUN   NN
## 4:   text1  3 succeeds succeed   VERB  VBZ
## 5:   text1  4        .       .  PUNCT    .

Likely failure of automated spaCy python detection

Regarding:
https://stackoverflow.com/questions/44266536/spacyr-installation-on-r-usr-local-bin-python-is-not-a-python-executable?noredirect=1#comment75542933_44266536

If the python2 and 3 co-exist in a system, and only python3 has spaCy, detection of spaCy python will likely fail, because which -a python will not return a path to python 3. Need to run which -a python3 as well.

How to structure the parsed object and tokens

(This continues the discussion from issue #1.)

Following our chat, I think I have a clearer answer.

We parse a corpus, and store the results as a complete information set of tagged tokens with dependencies as an option.

inputs: character, or quanteda corpus class object
output: a corpus_tagged class object that is based on the quanteda corpus class. Internally, it would have a data.table containing something similar to your data.frame from get_dependency_data() with some way also to flag or store named entities. Note that this would hide the way that spacy_parse() works from the user, so it just creates that, calls it, returns the information requested, and closes the spacy referenced object.

To get the things we want for non-quanteda users, we create extractor functions that happen to be the same as in quanteda, to get:

docnames
tokens
tags
sentences
tagged tokens as in the print method for the older tag object already defined, e.g. a named list of concatenated forms such as "and_DET", with a user-defined concatenator
entities
ntoken
ntypes
nsentence

How many of these we want to enable independently of quanteda we can discuss.

Figure out how to pass Python call through Rcpp

The current call in https://github.com/kbenoit/spacyr/blob/master/R/tag.R#L71 using system2() has a lot of overhead.

Passing directly involves implementing the instructions in http://gallery.rcpp.org/articles/rcpp-python/.

Integration with cleanNLP

At the rOpenSci Text Workshop we discussed how we might combine functionality from coreNLP into spacyr. I'll summarize some of the ideas here so we don't forget them:

keep cleanNLP as a wrapper around coreNLP, but we should move our combined efforts with spacy into this package
generally keep the current return format of spacyr, a single data frame, compared to the normalized tables of cleanNLP
generally switch the backend to the approach of coreNLP, using reticulate and its direct API points; this stops the need to maintain all the complex cpp code and makes it more consistent for users to use different version of Python and spacy models
we can try to integrate some of the extra tags I grab or create (i.e., sentence ids) into the output

@amatsuo suggested the best approach would be for him to take the first pass at integrating my approach with reticulate into spacyr, and then I could go back through and submit pull requests to fill in the rough edges.

Need to trap wrong python_executable in spacy_initialize()

> spacy_initialize("/usr/local/bin/notpython")
Finding a python executable with spacy installed...
 Error in file.info(x, extra_cols = FALSE) : invalid filename argument

Add appveyor CI for Windows build tests

See ?devtools::add_appveyor() and you can also see how this is working for quanteda and readtexts.

We would need to add (to appveyor.yml) the script to install spaCy, and maybe even Python.

Broken `spacy_parse`?

> library(spacyr)
> spacy_initialize()
> out <- spacy_parse(inaugTexts[1:2])
 Hide Traceback
 
 Rerun with Debug
 Error in python.exec(python.command) : 
  Invalid control character at: line 1 column 71 (char 70) 
4.
stop(ret$error.desc) 
3.
python.exec(python.command) 
2.
rPython::python.assign("texts", x) at parse.R#42
1.
spacy_parse(inaugTexts[1:2])

Add methods to quanteda for handling spacyr objects

These would include a way to create a corpus from the parsed corpus created by spacyr, replacing the character vector that is the current core of a quanteda corpus. docvars would need to be indexed to docnames, but kept separate.

quanteda methods for corpus objects would need to be defined for the new class of corpus_parsed objects created in this way.

caught segfault

I get the following error when using spacyr with knitr (when just running markdown snippets, R Studio crashes altogether):

Code:

spacy_initialize()

txt <- unlist(enc2utf8(as.character(cdp_regulatory_risks_reporting_text_descr_building$CC5.1a..Description)))
parsed <- spacy_parse(txt, entity = TRUE)
#entity_consolidate(parsed)
entitties <- entity_extract(parsed)
entitties %>% select(entity) %>%
distinct(entity) %>%
arrange()

spacy_finalize()

Error

*** caught segfault ***
address 0x0, cause 'memory not mapped'

Traceback:
1: .Call(_reticulate_py_run_string_impl, code, local, convert)
2: py_run_string_impl(code, local, convert)
3: reticulate::py_run_string(pystring)
4: spacyr_pyexec("timestamps = spobj.parse(texts)")
5: process_document(x)
6: spacy_parse.character(txt, entity = TRUE)
7: spacy_parse(txt, entity = TRUE)
8: eval(expr, envir, enclos)
9: eval(expr, envir, enclos)
10: withVisible(eval(expr, envir, enclos))
11: withCallingHandlers(withVisible(eval(expr, envir, enclos)), warning = wHandler, error = eHandler, message = mHandler)
12: handle(ev <- withCallingHandlers(withVisible(eval(expr, envir, enclos)), warning = wHandler, error = eHandler, message = mHandler))
13: timing_fn(handle(ev <- withCallingHandlers(withVisible(eval(expr, envir, enclos)), warning = wHandler, error = eHandler, message = mHandler)))
14: evaluate_call(expr, parsed$src[[i]], envir = envir, enclos = enclos, debug = debug, last = i == length(out), use_try = stop_on_error != 2L, keep_warning = keep_warning, keep_message = keep_message, output_handler = output_handler, include_timing = include_timing)
15: evaluate(code, envir = env, new_device = FALSE, keep_warning = !isFALSE(options$warning), keep_message = !isFALSE(options$message), stop_on_error = if (options$error && options$include) 0L else 2L, output_handler = knit_handlers(options$render, options))
16: in_dir(input_dir(), evaluate(code, envir = env, new_device = FALSE, keep_warning = !isFALSE(options$warning), keep_message = !isFALSE(options$message), stop_on_error = if (options$error && options$include) 0L else 2L, output_handler = knit_handlers(options$render, options)))
17: block_exec(params)
18: call_block(x)
19: process_group.block(group)
20: process_group(group)
21: withCallingHandlers(if (tangle) process_tangle(group) else process_group(group), error = function(e) { setwd(wd) cat(res, sep = "\n", file = output %n% "") message("Quitting from lines ", paste(current_lines(i), collapse = "-"), " (", knit_concord$get("infile"), ") ") })
22: process_file(text, output)
23: knitr::knit(knit_input, knit_output, envir = envir, quiet = quiet, encoding = encoding)
24: rmarkdown::render("/Users/x/Documents/Dissertation/My files/DataLab/R Scripts/CDP2017/CDP2017.Rmd", encoding = "UTF-8")
An irrecoverable exception occurred. R is aborting now ...

sessionInfo()
R version 3.3.2 (2016-10-31)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: OS X El Capitan 10.11.6

locale:
[1] C

attached base packages:
[1] stats graphics grDevices utils datasets methods base

loaded via a namespace (and not attached):
[1] Rcpp_0.12.13 digest_0.6.12 withr_2.0.0 rprojroot_1.2 R6_2.2.2 backports_1.1.1
[7] git2r_0.19.0 magrittr_1.5 evaluate_0.10.1 httr_1.3.1 stringi_1.1.5 curl_3.0
[13] rmarkdown_1.6 devtools_1.13.3 tools_3.3.2 stringr_1.2.0 yaml_2.1.14 memoise_1.1.0
[19] htmltools_0.3.6 knitr_1.17

Loading spacy does not work on Windows with Python 2.7

Split of #19 (comment)

The problem is that on Windows with Python 2.7, loading spacy on the Python command line works but loading spacy via the spacyr package fails:

PS C:\Users\cm> . 'C:\Program Files\R\R-3.3.3\bin\R.exe' --no-save -q
> library("spacyr")
> spacy_initialize()
Traceback (most recent call last):
  File "<string>", line 13, in <module>
  File "C:\Users\cm\AppData\Roaming\Python\Python27\site-packages\spacy\__init__.py", line 5, in <module>
    from .deprecated import resolve_model_name
  File "C:\Users\cm\AppData\Roaming\Python\Python27\site-packages\spacy\deprecated.py", line 8, in <module>
    from .cli import download
  File "C:\Users\cm\AppData\Roaming\Python\Python27\site-packages\spacy\cli\__init__.py", line 5, in <module>
    from .train import train, train_config
  File "C:\Users\cm\AppData\Roaming\Python\Python27\site-packages\spacy\cli\train.py", line 7, in <module>
    from ..scorer import Scorer
  File "C:\Users\cm\AppData\Roaming\Python\Python27\site-packages\spacy\scorer.py", line 4, in <module>
    from .gold import tags_to_entities
ImportError: DLL load failed: The specified module could not be found.
> q()
PS C:\Users\cm> python
Python 2.7.13 (v2.7.13:a06454b1afa1, Dec 17 2016, 20:53:40) [MSC v.1500 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> import spacy
>>> nlp = spacy.load("en")

    Warning: no model found for 'en'

    Only loading the 'en' tokenizer.

>>> exit()

should token_id be unique only within sentence_id?

Right now token_id is unique across the entire token set, so that it matches the row number. This is then linked to the head_token_id, which is within sentence. This creates differences when we tag dependencies for the same sentence, depending on whether it follows another sentence.

Example:

spacy_parse("My cat ate two mice.  I played golf.", full_parse = TRUE)
#    docname sentence_id token_id tokens  lemma tag_detailed tag_google head_token_id dep_rel named_entity
# 1:   text1           1        1     My -PRON-         PRP$        ADJ             2    poss             
# 2:   text1           1        2    cat    cat           NN       NOUN             3   nsubj             
# 3:   text1           1        3    ate    eat          VBD       VERB             3    ROOT             
# 4:   text1           1        4    two    two           CD        NUM             5  nummod   CARDINAL_B
# 5:   text1           1        5   mice  mouse          NNS       NOUN             3    dobj             
# 6:   text1           1        6      .      .            .      PUNCT             3   punct             
# 7:   text1           1        7                         SP      SPACE             6                     
# 8:   text1           2        8      I -PRON-          PRP       PRON             2   nsubj             
# 9:   text1           2        9 played   play          VBD       VERB             2    ROOT             
# 10:   text1           2       10   golf   golf           NN       NOUN             2    dobj             
# 11:   text1           2       11      .      .            .      PUNCT             2   punct             
spacy_parse("I played golf.", full_parse = TRUE)
#    docname sentence_id token_id tokens  lemma tag_detailed tag_google head_token_id dep_rel named_entity
# 1:   text1           1        1      I -PRON-          PRP       PRON             2   nsubj             
# 2:   text1           1        2 played   play          VBD       VERB             2    ROOT             
# 3:   text1           1        3   golf   golf           NN       NOUN             2    dobj             
# 4:   text1           1        4      .      .            .      PUNCT             2   punct

This seems like an undesirable behaviour to me, so I'd advocate making token_id a serial number within sentence, and having head_token_id link to that unique key within sentence.

spacy_parse crashes with UTF-8 characters in Windows

Works fine on macOS, but this crashes R in Windows 10:

require(spacyr)
## Loading required package: spacyr
spacy_initialize()
## Finding a python executable with spacy installed...
## spaCy (language model: en) is installed in C:\Users\kbenoit\AppData\Local\Programs\Python\Python36\python.exe
## successfully initialized (spaCy Version: 1.9.0, language model: en)

spacy_parse("One B-17 costs $275,000, while now one B-36 costs $3 Ѕ million.")

The "Ѕ" is the trouble-maker here, which has somehow entered the text.

Session info:

sessionInfo()
# R version 3.4.1 (2017-06-30)
# Platform: x86_64-w64-mingw32/x64 (64-bit)
# Running under: Windows >= 8 x64 (build 9200)
# 
# Matrix products: default
# 
# locale:
# [1] LC_COLLATE=English_United Kingdom.1252  LC_CTYPE=English_United Kingdom.1252    LC_MONETARY=English_United Kingdom.1252
# [4] LC_NUMERIC=C                            LC_TIME=English_United Kingdom.1252    
# 
# attached base packages:
# [1] stats     graphics  grDevices utils     datasets  methods   base     
# 
# other attached packages:
# [1] spacyr_0.9.0  quanteda_0.99
# 
# loaded via a namespace (and not attached):
# [1] Rcpp_0.12.12        magrittr_1.5        MASS_7.3-47         splines_3.4.1       munsell_0.4.3       xtable_1.8-2        colorspace_1.3-2   
# [8] lattice_0.20-35     rlang_0.1.2         fastmatch_1.1-0     minqa_1.2.4         stringr_1.2.0       plyr_1.8.4          tools_3.4.1        
# [15] grid_3.4.1          nlme_3.1-131        data.table_1.10.4   gtable_0.2.0        gtools_3.5.0        lme4_1.1-13         lazyeval_0.2.0     
# [22] RcppParallel_4.3.20 tibble_1.3.3        Matrix_1.2-10       reshape2_1.4.2      nloptr_1.0.4        ggplot2_2.2.1       stringi_1.1.5      
# [29] compiler_3.4.1      BradleyTerry2_1.0-6 scales_0.4.1        profileModel_0.5-9  jsonlite_1.5        reticulate_1.0      brglm_0.6.1        
# [36] lubridate_1.6.0     sophistication_0.53

start serial id from 1, not zero

Why do we start counting from 0 for id?

> spacy_parse(txt, dependency = TRUE)
    docname id     tokens google penn head_id dep_rel
 1:   text1  0        And   CONJ   CC       2      cc
 2:   text1  1        now    ADV   RB       2  advmod
 3:   text1  2        for    ADP   IN       2    ROOT
 4:   text1  3  something   NOUN   NN       2    pobj
 5:   text1  4 completely    ADV   RB       5  advmod
 6:   text1  5  different    ADJ   JJ       3    amod
 7:   text1  6          .  PUNCT    .       2   punct
 8:   text1  7       This    DET   DT       8   nsubj
 9:   text1  8         is   VERB  VBZ       8    ROOT
10:   text1  9        the    DET   DT      11     det
11:   text1 10     second    ADJ   JJ      11    amod
12:   text1 11   sentence   NOUN   NN       8    attr
13:   text1 12          .  PUNCT    .       8   punct

travis, testthat

At the moment, the travis-build fails at a very early stage because suddenly testthat cannot be installed. Do you have any idea why?

Here are the lines of logs:

Package r-cran-testthat is not available, but is referred to by another package.
This may mean that the package is missing, has been obsoleted, or
is only available from another source

E: Package 'r-cran-testthat' has no installation candidate

The command "eval sudo apt-get install -y r-cran-testthat " failed. Retrying, 2 of 3.

https://travis-ci.org/kbenoit/spacyr/builds/225633567#L669

Set option for using different language

A couple of months ago, German has been added as a language to spaCy and more languages are planned (see also https://explosion.ai/blog/german-model). In Python, once I installed (python -m spacy.en.download), the German language model can be loaded by using nlp = spacy.load('de'). However, as far as I see spacyr seems to use English as the default and I did not find a command to switch between languages.

If this is correct (maybe I just did not find this possibility), could we add an option to spacy_initialize() to switch between languages?

Error message needs improvement when spacy is not installed

When python exists but space is not installed. Also we should check when python is not installed.

> require(spacyr)
Loading required package: spacyr
> spacy_initialize()
Finding a python executable with spacy installed...
Error in file.info(x, extra_cols = FALSE) : invalid filename argument

Session info:

> sessionInfo()
R version 3.3.2 (2016-10-31)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: OS X El Capitan 10.11.6

locale:
[1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] spacyr_0.3.0      quanteda_0.9.9-50

loaded via a namespace (and not attached):
 [1] Rcpp_0.12.10        ca_0.70             devtools_1.12.0    
 [4] munsell_0.4.3       colorspace_1.3-2    lattice_0.20-34    
 [7] R6_2.2.0            fastmatch_1.1-0     httr_1.2.1         
[10] plyr_1.8.4          tools_3.3.2         grid_3.3.2         
[13] data.table_1.10.4   gtable_0.2.0        git2r_0.18.0       
[16] withr_1.0.2         assertthat_0.1      lazyeval_0.2.0     
[19] RcppParallel_4.3.20 digest_0.6.12       tibble_1.2         
[22] Matrix_1.2-8        ggplot2_2.2.1       SnowballC_0.5.1    
[25] curl_2.3            memoise_1.0.0       stringi_1.1.5      
[28] scales_0.4.1        reticulate_0.7

improvements to spacy_initialize()

A few suggestions:

Change in argument names:

lang to model, and let the user specify this. I am not sure how to check if it's installed, although I imagine we could trap the Python error if it's not. Default would be en, but it also could be any of the models at https://spacy.io/docs/usage/models#download.
use_python, use_virtualenv, use_virtualenv: would it be possible to combine these into one argument called PYTHONPATH, which looks for python or python3 in that location, and this could include both anaconda and virtualenv environments? Even better we get this from the system by default.

I also don't like that we have to restart R if we get this wrong, e.g. this means I now have to restart R and reattach spacyr, then call spacy_initialize() with a correct path to python3 (where I do have spaCy and the language modules installed).

> spacy_initialize()
No python executable is specified, spacyr will use system default python
 (system default python: /usr/bin/python).
 Error in py_run_file_impl(file, convert) : 
  ImportError: No module named semver

Detailed traceback: 
  File "<string>", line 9, in <module>
  File "/Library/Python/2.7/site-packages/spacy/__init__.py", line 1, in <module>
    from . import util
  File "/Library/Python/2.7/site-packages/spacy/util.py", line 8, in <module>
    import sputnik
  File "/Library/Python/2.7/site-packages/sputnik/__init__.py", line 4, in <module>
    from .pool import Pool
  File "/Library/Python/2.7/site-packages/sputnik/pool.py", line 5, in <module>
    from . import util
  File "/Library/Python/2.7/site-packages/sputnik/util.py", line 9, in <module>
    import semver

Also it needs an example.

problems installing language file in Windows

Following our own installation instructions, in Windows 10, I was able to install successfully python 2.7 and spaCy.

However:

C:\Users\kbenoit>python -m spacy.en.download all
C:\Python27\python.exe: DLL load failed: The application has failed to start because its side-by-side configuration is incorrect. Please see the application event log or use the command-line sxstrace.exe tool for more detail.

Package loading fails on Linux with Python 3.6

The installation of the current master branch fails because the package cannot be loaded on Linux with Python 3.6. The specific problem is that spacyr thinks that the Python 3 library is /usr/lib/libpython3.6.so whereas it is actually /usr/lib/libpython3.6m.so (full output below).

The problematic line is the regular expression used to extract the version information here:
https://github.com/kbenoit/spacyr/blob/master/R/config.R.in#L24

downloads % wget -nv https://github.com/kbenoit/spacyr/archive/master.zip -O spacyr-master.zip
2017-04-18 16:35:43 URL:https://codeload.github.com/kbenoit/spacyr/zip/master [319603/319603] -> "spacyr-master.zip" [1]
downloads % unzip spacyr-master.zip
Archive:  spacyr-master.zip
downloads % R CMD INSTALL --build --no-inst spacyr-master/
* installing to library ‘/home/cmueller/.R/x86_64-pc-linux-gnu-library/3.3’
* installing *source* package ‘spacyr’ ...
Using python binary from PATH
configure: creating ./config.status
config.status: creating src/Makevars
config.status: creating R/config.R
** libs
g++ -I/usr/include/R/ -DNDEBUG -I/usr/include/python3.6m -I/usr/include/python3.6m -D PYTHONLIBFILE=libpython3.6.so -D_FORTIFY_SOURCE=2 -I"/home/cmueller/.R/x86_64-pc-linux-gnu-library/3.3/Rcpp/include"   -fpic  -march=native -O2 -pipe -fstack-protector-strong  -c RcppExports.cpp -o RcppExports.o
g++ -I/usr/include/R/ -DNDEBUG -I/usr/include/python3.6m -I/usr/include/python3.6m -D PYTHONLIBFILE=libpython3.6.so -D_FORTIFY_SOURCE=2 -I"/home/cmueller/.R/x86_64-pc-linux-gnu-library/3.3/Rcpp/include"   -fpic  -march=native -O2 -pipe -fstack-protector-strong  -c python.cpp -o python.o
g++ -shared -L/usr/lib64/R/lib -Wl,-O1,--sort-common,--as-needed,-z,relro -o spacyr.so RcppExports.o python.o -L/usr/lib -lpython3.6m -lpthread -ldl -lutil -lm -Xlinker -export-dynamic -L/usr/lib64/R/lib -lR
installing to /home/cmueller/.R/x86_64-pc-linux-gnu-library/3.3/spacyr/libs
** R
** data
*** moving datasets to lazyload DB
** preparing package for lazy loading
** help
*** installing help indices
** building package indices
** testing if installed package can be loaded
Error : .onLoad failed in loadNamespace() for 'spacyr', details:
  call: py_initialize(config$libpython)
  error: /usr/lib/libpython3.6.so: cannot open shared object file: No such file or directory
Error: loading failed
Execution halted
ERROR: loading failed
* removing ‘/home/cmueller/.R/x86_64-pc-linux-gnu-library/3.3/spacyr’
downloads % ls -1 /usr/lib/libpython3.6*
/usr/lib/libpython3.6m.so
/usr/lib/libpython3.6m.so.1.0

Add progress bar for spacy_parse()?

I was wondering whether we should add a progress bar in the R console when running spacy_parse(). Especially for a large corpus POS tagging can take quite a while, so it could be useful to know about the progress. txtProgressBar is part of base R and might be an easy solution without any package dependencies.

Type error when initializing spacy

Hello! I am getting the following error when trying to initialize spacy:

spacy_initialize()
Finding a python executable with spacy installed...
spaCy (language model: en) is installed in C:\ProgramData\Anaconda3\python.exe
Error in py_run_file_impl(file, local, convert) :
TypeError: 'int' object is not callable

Detailed traceback:
File "", line 9, in
File "C:\PROGRA~~3\ANACON~~1\lib\site-packages\spacy_init_.py", line 10, in
from . import en, de, zh, es, it, hu, fr, pt, nl, sv, fi, bn, he, nb, ja
File "C:\PROGRA~~3\ANACON~~1\lib\site-packages\spacy\en_init_.py", line 4, in
from ..language import Language
File "C:\PROGRA~~3\ANACON~~1\lib\site-packages\spacy\language.py", line 11, in
from .train import Trainer
File "C:\PROGRA~~3\ANACON~~1\lib\site-packages\spacy\train.py", line 5, in
import tqdm
File "C:\PROGRA~~3\ANACON~~1\lib\site-packages\tqdm_init_.py", line 1, in
from ._tqdm import tqdm
File "C:\PROGRA~~3\ANACON~~1\lib\site-packages\tqdm_tqdm.py", line 14, in
from ._utils import _supports_unicode, _environ_cols_wrapper, _range, _unich,
File "C:\PROGRA~~3\ANACON~~1\lib\site-packages\tqdm_utils.py", line 31, in
colorama

Any ideas what the problem can be?

Thanks!

Installation problem Windows 10 & Python 3.6.0 |Anaconda 4.3.1

I just tried to install spacyr but I receive this error message:

>devtools::install_github("kbenoit/spacyr")
Downloading GitHub repo kbenoit/spacyr@master
from URL https://api.github.com/repos/kbenoit/spacyr/zipball/master
Installing spacyr
"C:/PROGRA~1/R/R-33~1.3/bin/x64/R" --no-site-file --no-environ --no-save --no-restore --quiet CMD  \
  INSTALL "C:/Users/binis/AppData/Local/Temp/RtmpO6cQH6/devtools18bc70cc7fa/kbenoit-spacyr-ecec433"  \
  --library="C:/Users/binis/Documents/R/win-library/3.3" --install-tests 

* installing *source* package 'spacyr' ...
Using python binary at C:/Users/binis/Anaconda3/
C:/Users/binis/Anaconda3/
C:/Users/binis/Anaconda3/: not found
C:/Users/binis/Anaconda3/: not found
** libs
Warning: this package has a non-empty 'configure.win' file,
so building only the main architecture

C:/Users/binis/Anaconda3/: not found
C:/Users/binis/Anaconda3/: not found
C:/Rtools/mingw_64/bin/g++  -I"C:/PROGRA~1/R/R-33~1.3/include" -DNDEBUG    -I"C:/Users/binis/Documents/R/win-library/3.3/Rcpp/include" -I"d:/Compiler/gcc-4.9.3/local330/include"  -I"\include" -I"\PCBuild" -DMS_WIN64   -O3 -mtune=native -march=native -Wno-unused-variable -Wno-unused-function -Wno-ignored-attributes -Wno-deprecated-declarations -c RcppExports.cpp -o RcppExports.o
C:/Users/binis/Anaconda3/: not found
C:/Users/binis/Anaconda3/: not found
C:/Rtools/mingw_64/bin/g++  -I"C:/PROGRA~1/R/R-33~1.3/include" -DNDEBUG    -I"C:/Users/binis/Documents/R/win-library/3.3/Rcpp/include" -I"d:/Compiler/gcc-4.9.3/local330/include"  -I"\include" -I"\PCBuild" -DMS_WIN64   -O3 -mtune=native -march=native -Wno-unused-variable -Wno-unused-function -Wno-ignored-attributes -Wno-deprecated-declarations -c python.cpp -o python.o
python.cpp:2:20: fatal error: Python.h: No such file or directory
 #include <Python.h>
                    ^
compilation terminated.
make: *** [python.o] Error 1
Warning: running command 'make -f "Makevars.win" -f "C:/PROGRA~1/R/R-33~1.3/etc/x64/Makeconf" -f "C:/PROGRA~1/R/R-33~1.3/share/make/winshlib.mk" -f "C:/Users/binis/Documents/.R/Makevars" SHLIB_LDFLAGS='$(SHLIB_CXXLDFLAGS)' SHLIB_LD='$(SHLIB_CXXLD)' SHLIB="spacyr.dll" WIN=64 TCLBIN=64 OBJECTS="RcppExports.o python.o"' had status 2
ERROR: compilation failed for package 'spacyr'
* removing 'C:/Users/binis/Documents/R/win-library/3.3/spacyr'
Error: Command failed (1)

It does not seem to be an issue with spacy itself since
system("python -c \"import spacy; spacy.load('en'); print('OK')\"")

returns OK. I'm not 100% sure where SPACY_PYTHON needs to point since 'C:/Users/binis/Anaconda3/: not found' is displayed multiple times, but my python.exe is in:
Sys.setenv(SPACY_PYTHON="C:/Users/binis/Anaconda3/")
This is also were I set PATH in Environment variables to - this means 'python' in command prompt points to this folder. Spacy is installed, apparently correctly, in 'C:\Users\binis\Anaconda3\Lib'

Error after spacy_initilize

I got an error after running spacy_initialize() as per below. Please kindly advice. Thank you.

Python space is already attached. If you want to swtich to a different Python, please restart R.
Error in py_run_file_impl(file, local, convert) :
TypeError: 'int' object is not callable

Detailed traceback:
File "", line 9, in
File "C:\Users\nhaswell\AppData\Local\CONTIN~~1\ANACON~~1\lib\site-packages\spacy_init_.py", line 8, in
from . import en, de, zh, es, it, hu, fr, pt, nl, sv, fi, bn, he
File "C:\Users\nhaswell\AppData\Local\CONTIN~~1\ANACON~~1\lib\site-packages\spacy\en_init_.py", line 4, in
from ..language import Language
File "C:\Users\nhaswell\AppData\Local\CONTIN~~1\ANACON~~1\lib\site-packages\spacy\language.py", line 11, in
from .train import Trainer
File "C:\Users\nhaswell\AppData\Local\CONTIN~~1\ANACON~~1\lib\site-packages\spacy\train.py", line 5, in
import tqdm
File "C:\Users\nhaswell\AppData\Local\CONTIN~~1\ANACON~~1\lib\site-packages\tqdm_init_.py", line 1, in
from ._tqdm import tqdm
File "C:\Users\nhaswell\AppData\Local\CONTIN~~1\ANACON~~1\lib\site-packages\tqdm_tqdm.py", line 14, in
from ._utils

Make output conform to the tokens data.frame of the TIF

as per https://github.com/ropensci/tif

I will create a branch and implement this once we've settled on the other API changes.