Coder Social home page Coder Social logo

postproc's Introduction

post-processing

This set of tools is for post processing NMF outputs from SmallK.

Topic Modeling

The topic_modeling/ directory contains applications useful for extracting topic modeling results from NMF input and output files.

If NMF was run on a sparse term-document matrix and there exist associated dictionary and documents files, the output W and H matrices along with these original files can be used to generate topic models.

##generate_outputs.py

Running the application without any arguments will present a list of required and optional input arguments.

python topic_modeling/generate_outputs.py

Output:

usage: generate_outputs.py [-h] 
							--dictpath dictpath 
							--docpath docpath
                            --matrixpath matrixpath 
                            --infile_w infile_w
                            --infile_h infile_h 
                            --maxterms maxterms
                           [--outdir outdir] 
                           [--tabstrip tabstrip]
                           [--format format]

###Inputs

dictpath: filepath for the input dictionary file, such as dictionary.txt

docpath: filepath for the input documents file, such as documents.txt

matrixpath: filepath for the input matrix file, such as matrix.mtx

infile_w: filepath for the NMF W matrix output, such as w.csv

infile_h: filepath for the NMF H matrix output, such as h.csv

maxterms: number of top terms desired in the output files

[outdir]: output directory, defaults to the current directory

[tabstrip]: whether or not to strip content from the dictionary file beyond the first tab, defaults to False

[format]: cluster topics output file format (XML or JSON), defaults to XML

###Outputs

This application will generate three files:

assignments_fuzzy_[k].csv: This file contains one line per document with k elements, each representing the probability of that document belonging to the kth cluster. This is the soft clustering result.

assignments_[k].csv: This file contains one element per document, representing the cluster to which that document was assigned. This is the hard clustering result.

clusters_[k].{xml,json}: This file contains JSON or XML formatted data that represents the cluster ids, the number of documents per cluster, and the top terms per cluster.

###Usage

A sample run of this application might look like the following:

python topic_modeling/generate_outputs.py \
	--dictpath test_data/dictionary.txt \
	--docpath test_data/documents.txt \
	--matrixpath test_data/matrix.mtx \
	--infile_w test_data/w.csv \
	--infile_h test_data/h.csv \
	--maxterms 5 \
	--outdir results

This would generate the following files:

results/
	clusters_10.xml
	assignments_fuzzy_10.csv
	assignments_10.csv

##reduce_outputs.py

Running the application without any arguments will present a list of required and optional input arguments.

python topic_modeling/reduce_outputs.py

Output:

usage: reduce_outputs.py [-h] 
						  --topicspath topicspath 
						  --fuzzypath fuzzypath
                         [--outdir outdir]

###Inputs

topicspath: filepath for the input topics file, such as clusters_10.json

fuzzypath: filepath for the input fuzzy assignments file, such as assignments_fuzzy_10.csv

[outdir]: output directory, defaults to the current directory

###Outputs

This application will generate two files:

assignments_labels_[k].csv: This file contains one line per document with k elements, representing in decreasing probability the clusters to which that document belongs. This is the soft clustering result. For example, a particular line might look like

6,5,2,8,3,9,1,4,7,0

This would mean that the document is most likely to belong to cluster 6, then cluster 5, and so on.

clusters_[k].csv: This file contains one line per cluster with top terms for that cluster separated by commas.

###Usage

A sample run of this application might look like the following:

python topic_modeling/reduce_outputs.py \
	--topicspath results/clusters_10.xml \
	--fuzzypath results/assignments_fuzzy_10.csv \
	--outdir reduced_results

This would generate the following files:

reduced_results/
	clusters_10.csv
	assignments_labels_10.csv

postproc's People

Contributors

ascripka avatar

Watchers

Samuel Garcia avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.