ropensci / textworkshop17 Goto Github PK

Text Workshop at the London School of Economics, April 2017

HTML 90.73% JavaScript 6.45% CSS 2.82%

textworkshop17's Introduction

Text Analysis R Developers' Workshop

(invitation only), April 21 - 22, 2017. London

Supported by and and

Welcome to the repository for the Text Analysis R Developers' Workshop. This event is supported by European Research Council grant ERC-2011-StG 283794-QUANTESS and the LSE's Social and Economic Data Sciences unit.

Participants
Schedule
[Please post ideas for projects, discussion topics, and sessions as issues. These can be suggestions of any kind, things you would like to work on or discuss, questions for other participants, etc.

Event hashtag is #RtextSIG17

Code of conduct

To ensure a safe, enjoyable, and friendly experience for everyone who participates, we have a code of conduct. This applies to people attending in person or remotely, and for interacting over the issues.

Support

This meeting is made possible by generous support (financial, moral, and technical) from:

European Research Council grant ERC-2011-StG 283794-QUANTESS
rOpenSci
LSE's Social and Economic Data Sciences unit
LSE's Department of Methodology

textworkshop17's People

Contributors

Stargazers

Watchers

Forkers

trinker haiyanlw amatsuo adamobeng haven-jeon sunsanjun henrymok26

textworkshop17's Issues

Proposal for Interoperability Formats

We are proposing three formats for interoperability of text data between packages.

Corpus - a normal data frame with S3 class equal to c("corpus", "data.frame"). It has no rownames and has at least two columns. The first column is called doc_id and is a character vector with UTF-8 encoding. Document ids must be unique. The second column is called text and must also be a character vector in UTF-8 encoding. Each individual document is represented by a single string in text / row in the data frame. Addition document-level metadata columns and corpus level attributes are allowed but not required.

Document Term Matrix - should be a sparse numeric matrix with document ids as rownames and terms as column names, both of these are character vectors. There is one element in the row names for each document and one element in the column names for each term. Document ids and terms must be unique. We suggest the dgCMatrix from the Matrix package.

Tokens/Annotation - a normal data frame with S3 class equal to c("token", "data.frame"). We define a token to be a single element of a character vector. We propose representing these as a data frame. The first column of this data frame is called doc_id and is a character vector with UTF-8 encoding. The second column is called token_index and is an integer vector, with values starting at 1. The third column is called token and is a UTF-8 encoded character vector. Additional annotations can be provided as other columns. We suggest the following names and data types for common annotations: pos (character vector; UTF-8), lemma (character vector; UTF-8), and sentence_id (integer).

Functionality and performance benchmark challenge

Here would agree a series of tests using the same test text(s), and compare performance in a set of agreed tasks. This would provide a nice way of seeing how the various packages and approaches work for each function, and also of course provide a basis for comparing notes on performance.

Tasks could include:

tokenizing
creating ngrams and skip grams
tagging parts of speech, recognising named entities etc
feature selection (removing the same set of stopwords)
creating a document-term/feature matrix
creating term co-occurence matrixes
computing similarity, distances among documents
computing text-based statistics (e.g. reading ease, lexical diversity)
applying dictionaries
finding key words and other pattern searches
estimating models (text2vec, scaling models, supervised machine learning, whatever)

We'd need to agree on a set of clean, appropriate texts of course. We could set up a protocol for writing functions for one's package, and call those in sections of an .Rmd file we edit on the repo.

Asian languages roundtable and code challenge

What if we set out a series of text texts in Chinese, Japanese, and Korean for testing by various packages. The challenge could be, as applicable:

tokenise words and sentences
tag parts of speech
select tokens through pattern matching
create a document-term matrix
attempt analysis using the document-term matrix or other constructed data objects

Add support to readtext for some common formats

In particular

CoNLL
Metadata from a separate (Excel) tabular file

In the more distant future, we might want to rethink the API, to allow passing meta-data extracting functions which take the filenames and return the metadata.

@pnulty, @kbenoit

Antiword

Can easily be wrapped into an R package

Tokenization a.k.a. Boundary Analysis a.k.a. Text Segmentation in ICU

ICU allows for specifying arbitrary boundary rules based on a regex-like syntax. Moreover, it supports dictionary-based break iteration with dictionaries specified by users. I haven't created any interface to that in stringi yet.. Should I do it? Are there any interesting use cases for that?

References:
http://userguide.icu-project.org/boundaryanalysis
http://www.unicode.org/reports/tr29/

Package interoperability

I think a key issue to discuss is how to make R text packages interoperable, so that new packages extend functionality rather than compete with one another, and so that objects created in one package can be used with a minimum of conversion in other packages. Some areas to discuss:

corpus objects, both in-memory and out-of-memory
tokenization
document/term (or feature) matrices
output of models

A potential output of such a session would be a draft set of best practices, and perhaps a list of key places where objects are not currently interoperable but could be fixed by improvements to existing R packages.

Parsing/POS Tagging in R

Hi All,

I am hoping we can convene the folks working on NLP packages in R for a discussion of the current state of the art, and support for various underlying POS tagging/parsing libraries (spaCy, OpenNLP, CoreNLP, etc.). I tend to rely on Stanford's CoreNLP and other less portable options in my research, and would really like to know more about the current frontier in tagging/parsing in R.

Best,
Matt Denny

hyphenatr: R interface to Hunspell hyphenation lib (available in debian, etc)