Coder Social home page Coder Social logo

textworkshop17's Introduction

(invitation only), April 21 - 22, 2017. London

Supported by ERC logo and SEDS logo and rOpenSci logo

Welcome to the repository for the Text Analysis R Developers' Workshop. This event is supported by European Research Council grant ERC-2011-StG 283794-QUANTESS and the LSE's Social and Economic Data Sciences unit.

  • Participants
  • Schedule
  • [Please post ideas for projects, discussion topics, and sessions as issues. These can be suggestions of any kind, things you would like to work on or discuss, questions for other participants, etc.

Event hashtag is #RtextSIG17

Code of conduct

To ensure a safe, enjoyable, and friendly experience for everyone who participates, we have a code of conduct. This applies to people attending in person or remotely, and for interacting over the issues.

Support

This meeting is made possible by generous support (financial, moral, and technical) from:

textworkshop17's People

Contributors

amatsuo avatar gogamza avatar haiyanlw avatar jeroen avatar karthik avatar kbenoit avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

textworkshop17's Issues

Proposal for Interoperability Formats

We are proposing three formats for interoperability of text data between packages.

Corpus - a normal data frame with S3 class equal to c("corpus", "data.frame"). It has no rownames and has at least two columns. The first column is called doc_id and is a character vector with UTF-8 encoding. Document ids must be unique. The second column is called text and must also be a character vector in UTF-8 encoding. Each individual document is represented by a single string in text / row in the data frame. Addition document-level metadata columns and corpus level attributes are allowed but not required.

Document Term Matrix - should be a sparse numeric matrix with document ids as rownames and terms as column names, both of these are character vectors. There is one element in the row names for each document and one element in the column names for each term. Document ids and terms must be unique. We suggest the dgCMatrix from the Matrix package.

Tokens/Annotation - a normal data frame with S3 class equal to c("token", "data.frame"). We define a token to be a single element of a character vector. We propose representing these as a data frame. The first column of this data frame is called doc_id and is a character vector with UTF-8 encoding. The second column is called token_index and is an integer vector, with values starting at 1. The third column is called token and is a UTF-8 encoded character vector. Additional annotations can be provided as other columns. We suggest the following names and data types for common annotations: pos (character vector; UTF-8), lemma (character vector; UTF-8), and sentence_id (integer).

Functionality and performance benchmark challenge

Here would agree a series of tests using the same test text(s), and compare performance in a set of agreed tasks. This would provide a nice way of seeing how the various packages and approaches work for each function, and also of course provide a basis for comparing notes on performance.

Tasks could include:

  • tokenizing
  • creating ngrams and skip grams
  • tagging parts of speech, recognising named entities etc
  • feature selection (removing the same set of stopwords)
  • creating a document-term/feature matrix
  • creating term co-occurence matrixes
  • computing similarity, distances among documents
  • computing text-based statistics (e.g. reading ease, lexical diversity)
  • applying dictionaries
  • finding key words and other pattern searches
  • estimating models (text2vec, scaling models, supervised machine learning, whatever)

We'd need to agree on a set of clean, appropriate texts of course. We could set up a protocol for writing functions for one's package, and call those in sections of an .Rmd file we edit on the repo.

Asian languages roundtable and code challenge

What if we set out a series of text texts in Chinese, Japanese, and Korean for testing by various packages. The challenge could be, as applicable:

  • tokenise words and sentences
  • tag parts of speech
  • select tokens through pattern matching
  • create a document-term matrix
  • attempt analysis using the document-term matrix or other constructed data objects

Antiword

Can easily be wrapped into an R package

Package interoperability

I think a key issue to discuss is how to make R text packages interoperable, so that new packages extend functionality rather than compete with one another, and so that objects created in one package can be used with a minimum of conversion in other packages. Some areas to discuss:

  • corpus objects, both in-memory and out-of-memory
  • tokenization
  • document/term (or feature) matrices
  • output of models

A potential output of such a session would be a draft set of best practices, and perhaps a list of key places where objects are not currently interoperable but could be fixed by improvements to existing R packages.

Parsing/POS Tagging in R

Hi All,

I am hoping we can convene the folks working on NLP packages in R for a discussion of the current state of the art, and support for various underlying POS tagging/parsing libraries (spaCy, OpenNLP, CoreNLP, etc.). I tend to rely on Stanford's CoreNLP and other less portable options in my research, and would really like to know more about the current frontier in tagging/parsing in R.

Best,
Matt Denny

Wrapping C/C++ libraries

If people know of any useful C/C++ libs that would be nice to wrap into an R package, I am happy to assist with that!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.