Coder Social home page Coder Social logo

tmt's Introduction

tmt

The aim of tmt is to provide some added functionality to the tm package by facilitating the cleaning and analysis of messy text data. You can track and contribute to development of tmt at https://github.com/schaunwheeler/tmt.

Package text manipulation tools

Housecleaning:

  • SetEnvVarW searches for external applications necessary to run various text cleaning and analysis functions. It only works on Windows and should be used with extreme caution, as it has not been extensively tested.

  • NOTE: if switching between a Windows operating system and some other operating system, text encoding can get confused. Use the following to avoid some of that confusion:

    Sys.setlocale('LC_ALL', 'C')

Preparation:

  • MakeWordLists take lists of sentiment-laden words from General Inquirer (http://www.wjh.harvard.edu/~inquirer/), the list compiled by Finn Årup (http://www2.imm.dtu.dk/pubdb/views/publication_details.php?id=6010), or the list compiled by Bing Liu (http://www.cs.uic.edu/~liub/FBS/sentiment-analysis. html#lexicon) - or from all three lists (the default). More information about these lists and other can be found at http://sentiment.christopherpotts.net/ lexicons.html. The function combines and de-dupes the lists, preprocesses the words to make them compatible with use a regular expression patterns, and deconflicts the sentiment words with the list of common English stop words from the tm package. It returns a list with three elements - 'pos', 'neg', and 'stops'.

  • CleanText is a general set of text-cleaning rules. It changes everything to lower case, removes numbers and punctuation except in contractions, ensures that most contractions with omitted apostrophes get those apostrophes inserted, and removes specified stop words. A 'neutralize' option replaces all empty strings with the word 'neutral' (for purposes of the sentiment-analysis function in this package). A vector of words to remove can also be passed to the function.

  • PdfToText converts PDF files to text files. It's argument, 'pdfloc' is the path of the folder in which the PDFs to convert are located. As a default, it removes the PDF versions of the files after they have been converted to text.

Spelling Correction:

  • aspellCheck is a wrapper for the aspell() function in the R utils package. It takes as input a single character string. The output has three modes. "eval" returns a logical vector indicating whether each word (words are delineated by single spaces) was found in the dictionary. "sugg" returns a list where each misspelled word is given all suggested alternatives, and where each correctly spelled word is given NA. "fix" replaces each misspelled word with the word suggested as its most likely alternative. Proper nouns are not considered viable alternatives. The 'sep' option take a logical value and specifies whether two separate words should be considered a viable alternative to a misspelled word. 'cap.flag' takes one of three values: 'none' (the default) does nothing; 'first' tells the function to ignore all words that start with a capital letter; "all" tells the function to ignore all words that are entirely composed of capital letters. This allows subject-specific words to survive the spell check. The 'ignore' option takes a character vector and, like cap.flag, give the function a list of words to ignore. 'split.missing' takes a logical value. When set to TRUE, it makes a call to SplitWords() function in this package in each case where a word is not found in the dictionary and a viable alternative cannot be found.

  • aplitWords takes a character vector, splits it into single words (delineated by spaces), keeps all correctly spelled words the same, and splits all misspelled words into multiple words. It does this by splitting the original word into each possible combination of two words, taking the longest first-word option that is recognized in the dictionary, and then repeating the process for the remaining characters. A 'correction' option take a logical value. If TRUE, once a word is split, it checks the remaining characters against the dictionary and, if misspelled but having at least one viable alternative, that alternative is used instead of continuing the splitting.

  • NOTE: The aspellCheck() and splitWords() functions requires that aspell be
    installed and on the PATH environmental variable. On Windows, download aspell and dictionaries from http://aspell.net/win32/, and set the environmental PATH variable to include aspell, possibly using the SetEnvVarW() function in this package. On a Mac, download Cocoaspell from http://cocoaspell.leuski.net/, then go into the folder where the dictionaries are kept (probably library/ Application/Support/cocoAspell/aspell6-en-6.0-0/) and copy all of the dictionary files. Then open the Finder, click on Go>Go to Folder, type "/usr", then navigate to /usr/loca/lib/aspell-0.60/ and paste them all in that directory.

  • completeStem combined aspell() with the tm package's stemCompletion() function. Given a vector of sample text, it compiles that text into a dictionary. It then evaluates each word from a character string of stemmed (as well as unstemmed, if you like) words using aspell(). If a word is found in the dictionary, the function keeps it. If the word is not found in the dictionary, the function calls stemCompletion to complete it.

Basic Sentiment Analysis:

  • GetSentiment takes a vector of texts, a list of positive-sentiment words, and a list of negative-sentiment words, and uses regular-expressions to score each text. It returns a data frame of scores for each text: 'words' gives the number of words in each text; 'positives' and 'negatives' give the number of positive or negative words, respectively, in each text; 'polarity' gives the differences between the number of positive and number of negative words in each text, divided by the total number of sentiment-laden (positive or negative) words; 'subjectivity' gives the total number of sentiment-laden words divided by the total number of words; 'positivity' and 'negativity' give the total of positive or negative words, respectively, divided by the total number of words; 'balance' gives the number of positive words minus the number of negative words, divided by the total number of words.

  • PlotSentiment takes a vector of texts that have already been scored (and assumes the scores run from negative to positive with zero as a neutral point) and plots the top n words, as measured either by frequency or polarity. The default is to use binary measurement - to count a word occurrence only once per text. If binary is set to FALSE, then the results will reflect total word occurrences within the corpus rather than within individual texts.

tmt's People

Contributors

schaunwheeler avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

tmt's Issues

CRAN

Please put this into cran - would be very useful to lots of people !

Installation Error: 'Snowball' is not available for 'tmt'

Hello,

I get the following error and I have not found a solution elsewhere. I'm not sure where to begin. I'm running R 3.2.1 on Windows.

install_github("schaunwheeler/tmt")
    Downloading github repo schaunwheeler/tmt@master
    Installing tmt
    Skipping 1 packages not available: Snowball
    "C:/PROGRA~1/R/R-32~1.1/bin/x64/R" --no-site-file --no-environ --no-save  \
      --no-restore CMD INSTALL  \
      "C:/Users/Ryan/AppData/Local/Temp/RtmpQLvphB/devtoolse8c338658c/schaunwheeler-tmt-     c3d1c8e"  \
      --library="C:/Users/Ryan/Documents/R/win-library/3.2" --install-tests 

    ERROR: dependency 'Snowball' is not available for package 'tmt'
    * removing 'C:/Users/Ryan/Documents/R/win-library/3.2/tmt'
     Error: Command failed (1)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.