Coder Social home page Coder Social logo

text-miner's Introduction

NPM version Build Status Coverage Status

text-miner

text mining utilities for node.js

Introduction

The text-miner package can be easily installed via npm:

npm install text-miner

To require the module in a project, we can use the expression

var tm = require( 'text-miner' );

Corpus

The fundamental data type in the text-miner module is the Corpus. An instance of this class wraps a collection of documents and provides several methods to interact with this collection and perform post-processing tasks such as stemming, stopword removal etc.

A new corpus is created by calling the constructor

var my_corpus = new tm.Corpus([]);

where [] is an array of text documents which form the data of the corpus. The class supports method chaining, such that mutliple methods can be invoked after each other, e.g.

my_corpus
	.trim()
	.toLower()

The following methods and properties are part of the Corpus class:

Methods

.addDoc(doc)

Add a single document to the corpus. Has to be a string.

.addDocs(docs)

Adds a collection of documents (in form of an array of strings) to the corpus.

.clean()

Strips extra whitespace from all documents, leaving only at most one whitespace between any two other characters.

.map(fun)

Applies the function supplied to fun to each document in the corpus and maps each document to the result of its respective function call.

.removeInterpunctuation()

Removes interpunctuation characters (! ? . , ; -) from all documents.

.removeNewlines()

Removes newline characters (\n) from all documents.

.removeWords(words[, case_insensitive])

Removes all words in the supplied words array from all documents. This function is usually invoked to remove stopwords. For convenience, the text-miner package ships with a list of stopwords for different languages. These are stored in the STOPWORDS object of the module.

Currently, stopwords for the following languages are included:

STOPWORDS.DE
STOPWORDS.EN
STOPWORDS.ES
STOPWORDS.IT

As a concrete example, we could remove all english stopwords from corpus my_corpus as follows:

my_corpus.removeWords( tm.STOPWORDS.EN )

The second (optional) parameter of the function case_insensitive expects a Boolean indicating whether to ignore cases or not. The default value is false.

.removeDigits()

Removes any digits occuring in the texts.

.removeInvalidCharacters()

Removes all characters which are unknown or unrepresentable in Unicode.

.stem(type)

Performs stemming of the words in each document. Two stemmers are supported: Porter and Lancaster. The former is the default option. Passing "Lancaster" to the type parameter of the function ensured that the latter one is used.

.toLower()

Converts all characters in the documents to lower-case.

.toUpper()

Converts all characters in the documents to upper-case.

.trim()

Strips off whitespace at the beginning and end of each document.

DocumentTermMatrix / TermDocumentMatrix

We can pass a corpus to the constructor DocumentTermMatrix in order to create a document-term-matrix or a term-document matrix. Objects derived from either share the same methods, but differ in how the underlying matrix is represented: A DocumentTermMatrix has documents on its rows and columns corresponding to words, whereas a TermDocumentMatrix has rows corresponding to words and columns to documents.

var terms = new tm.DocumentTermMatrix( my_corpus );

An instance of either DocumentTermMatrix or TermDocumentMatrix has the following properties:

Properties

.vocabulary

An array holding all the words occuring in the corpus, in order corresponding to the column entries of the document-term matrix.

.data

The document-term or term-document matrix, implemented as a nested array in JavaScript. Rows correspond to individual documents, while each column index corresponds to the respective word in vocabulary. Each entry of data holds the number of counts the word appears in the respective documents. The array is sparse, such that each entry which is undefined corresponds to a value of zero.

.nDocs

The number of documents in the term matrix

.nTerms

The number of distinct words appearing in the documents

Methods

.findFreqTerms( n )

Returns all terms in alphabetical ordering which appear n or more times in the corpus. The return value is an array of objects of the form {word: "<word>", count: <number>}.

.removeSparseTerms( percent )

Remove all words from the document-term matrix which appear in less than percent of the documents.

.weighting( fun )

Apply a weighting scheme to the entries of the document-term matrix. The weighting method expects a function as its argument, which is then applied to each entry of the document-term matrix. Currently, the function weightTfIdf, which calculates the term-frequency inverse-document-frequency (TfIdf) for each word, is the only built-in weighting function.

.fill_zeros()

Turn the document-term matrix dtm into a non-sparse matrix by replacing each value which is undefined by zero and save the result.

Utils

The module exports several other utility functions.

.expandContractions( str )

Replaces all occuring English contractions by their expanded equivalents, e.g. "don't" is changed to "do not". The resulting string is returned.

.weightTfIdf( terms )

Weights document-term or term-document matrix terms by term frequency - inverse document frequency. Mutates the input DocumentTermMatrix or TermDocumentMatrix object.

Data

.STOPWORDS

An object with four keys: DE, EN, ES and IT, each of which is an array of stopwords for the German, English, Spanish and Italian language, respectively.

{
	"EN": [
		"a",
		"a's",
		"able",
		"about",
		"above",
		// (...)  
	],
	"DE": [
		// (...)
	],
	// (...)
}

.CONTRACTIONS

The keys of the CONTRACTIONS object are the contracted expressions and the corresponding values are arrays of the possible expansions.

{
	"ain't": ["am not", "are not", "is not", "has not","have not"],
	"aren't": ["are no", "am not"],
	"can't": ["cannot"],
	// (...)
}

Unit Tests

Run tests via the command npm test


License

MIT license.

text-miner's People

Contributors

dependabot[bot] avatar mcariatm avatar planeshifter avatar praneethmendu avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

text-miner's Issues

Add Snowball (Porter2) Stemmer as an option

You currently offer the widely used Porter Stemmer and also the Lancaster Stemmer. The former is less aggressive and the later is often times too aggressive. It would be nice to implement the Lancaster as well which is in the middle with aggressiveness as well as performance

Also, awesome library. Thanks for making. I'm incorporating it into a user friendly GUI app at the moment :)

tm.Terms is not a constructor when running examples/get_terms.js and creating_corpus.js

var terms = new tm.Terms( corpus );
            ^

TypeError: tm.Terms is not a constructor
    at Object.<anonymous> (D:\coding\pku\text-miner\examples\get_terms.js:10:13)
    at Module._compile (internal/modules/cjs/loader.js:701:30)
    at Object.Module._extensions..js (internal/modules/cjs/loader.js:712:10)
    at Module.load (internal/modules/cjs/loader.js:600:32)
    at tryModuleLoad (internal/modules/cjs/loader.js:539:12)
    at Function.Module._load (internal/modules/cjs/loader.js:531:3)
    at Function.Module.runMain (internal/modules/cjs/loader.js:754:12)
    at startup (internal/bootstrap/node.js:283:19)
    at bootstrapNodeJSCore (internal/bootstrap/node.js:622:3)

Frequent substring mining and collocation extraction

Does text-miner have a method to find frequent substrings in a list of strings?
I found several implementations of this algorithm on Stack Overflow, but I don't know if this library includes such an algorithm.

I suppose that the same algorithm could be used for collocation extraction, searching for sequences of words instead of sequences of characters.

Corpus#removeWords is not working properly with unicode characters

Observed

If you have a word like zurück in your documents, and you have this set of words to remove ['zur']
Then this step will remove zur in the word, converting zurück into ück.
That's happening because the function is using word boundaries (\b) which are known not to work with Unicode.

Expected

  • the function uses an unicode compatible regexp.

stemCompletion?

Hi,

I really like this package, it would be nice if you could implement a stemCompletion method.

Best,

readme typo

The readme says removeSparseWords but the code is removeSparseTerms.

Stopwords key issue

Hey man, just a quick heads up on v0.2.x there is a object structure issue on STOPWORDS. Looks like it is meant to be structured like this:

tm = { STOPWORDS: { EN: ..., ES: ..., } }

But it is actually structured like this:

tm = { STOPWORDS:{ STOPWORDS:{ EN: ... } } }

Not a huge issue, but figured you'd want a heads up so you can edit the documentation

Way to export documents / data from Corpus

I have been trying to use this lib to perform some basic cleanup and processing on text datasets. I am having trouble figuring out how to get the documents/text back out of the corpus after some functions have been applied.

const tm = require('text-miner');

const comments = ['here is the string value 1',
  'Another second string to be processed',
  'One more sentence'
];

let corpus = new tm.Corpus(comments);
corpus = corpus
  .clean()
  .toLower()
  .removeInterpunctuation()
  .removeWords(tm.STOPWORDS.EN);

// TODO what is the function or way to get the cleaned data back out?
// ie. for example something like this?
let cleanMessages = corpus.getDocuments();

Text Miner returns "empty" vocabulary item instead of first stopword

I've tested this a few times with different stopwords and configurations and can get repeated result.

var corpus = new TextMiner.Corpus([])

corpus.addDoc("wat cash money you go to boots and cats and dogs with me")
corpus.removeWords(TextMiner.STOPWORDS.EN)

var terms = new TextMiner.Terms(corpus)

=> [ 'wat', 'cash', 'money', '', 'boots', 'cats', 'dogs' ]

But if I move the first English stopword you two words left the "blank" word shifts two spaces left

corpus.addDoc("wat you cash money go to boots and cats and dogs with me")

=> [ 'wat', '', 'cash', 'money', 'boots', 'cats', 'dogs' ]

typo in code?

Hi,
line 97, self.dtm[d2] = self.dtm[d2].splice(w, 0.1);
Shouldn't it be self.dtm[d2] = self.dtm[d2].splice(w, 1); ?
Best,

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.