Coder Social home page Coder Social logo

bayon's Introduction

Tutorial in Japanese

Tutorial in English

Overview

Bayon is a simple and fast hard-clustering tool.

Bayon supports Repeated Bisection clustering and K-means clustering.

Install

% ./configure
% make
% sudo make install

Usage

Clustering input data

% bayon -n num [options] file
% bayon -l limit [options] file
   -n, --number=num      the number of clusters
   -l, --limit=lim       limit value of cluster bisection
   -p, --point           output similarity points
   -c, --clvector=file   save the vectors of cluster centroids
   --clvector-size=num   max size of output vectors of
                         cluster centroids (default: 50)
   --method=method       clustering method(rb, kmeans), default:rb
   --seed=seed           set a seed for random number generator

Get similar clusters for each input documents

% bayon -C file [options] file
   -C, --classify=file   target vectors
   --inv-keys=num        max size of the keys of each vector to be
                         looked up in inverted index (default: 20)
   --inv-size=num        max size of the inverted index of each key
                         (default: 100)
   --classify-size=num   max size of output similar groups
                         (default: 20)

Common options

   --vector-size=num     max size of each input vector
   --idf                 apply idf to input vectors
   -h, --help            show help messages
   -v, --version         show the version and exit

Example

  • clustering (number_of_output_clusters = 100)
% bayon -n 100 input.tsv > cluster.tsv
  • clustering (save vectors of cluster centroids)
% bayon -n 100 -c centroid.tsv input.tsv > cluster.tsv
  • classification (get similar clusters for input documents)
% bayon -C centroid.tsv input.tsv > classify.tsv

Format of Input Data

List of the vectors of input documents for clustering and classification

document_id1 \t key1-1 \t value1-1 \t key1-2 \t value1-2 \t ...\n
document_id2 \t key2-1 \t value2-1 \t key2-2 \t value2-2 \t ...\n
...
  • document_id : string
  • key : string
  • value : double

List of the vectors of cluster centroids

cluster_id1 \t key1-1 \t value1-1 \t key1-2 \t value1-2 \t ...\n
cluster_id2 \t key2-1 \t value2-1 \t key2-2 \t value2-2 \t ...\n
...
  • cluster_id : string
  • key : string
  • value : double

Format of Output Data

List of clusters (output of clustering)

cluster_id1 \t document_id1 \t document_id2 \t document_id3 \t ...\n
cluster_id2 \t document_id4 \t document_id5 \t document_id6 \t ...\n
...
  • cluster_id : integer (>= 1)
  • document_id : string

List of the clusters with similarity values between documents and clusters (if perform clustering with --point option)

cluster_id1 \t document_id1 \t point1 \t document_id2 \t point2 \t ...\n
cluster_id2 \t document_id3 \t point3 \t document_id4 \t point4 \t ...\n
...
  • cluster_id : integer (>= 1)
  • document_id : string
  • point : double

List of the vectors of cluster centroids (if perform clustering with --clvector option)

cluster_id1 \t key1-1 \t value1-1 \t key1-2 \t value1-2 \t ...\n
cluster_id2 \t key2-1 \t value2-1 \t key2-2 \t value2-2 \t ...\n
...
  • cluster_id : integer (>= 1)
  • key : string
  • value : double

List of similar clusters for each input documents

document_id1 \t cluster_id1 \t point1 \t cluster_id2 \t point2 \t ...\n
document_id2 \t cluster_id3 \t point3 \t cluster_id4 \t point4 \t ...\n
...
  • document_id : string
  • cluster_id : string
  • point : double

Requirement

  • C++ compiler with STL (Standard Template Library)

Recommended

  • google-sparsehash
    • If google-sparsehash not installed, this clustering tool uses "gnu_cxx::hash_map" or "std::map"

License

GPL2 (Gnu General Public License Version 2)

Author

Mizuki Fujisawa <[email protected]>

bayon's People

Contributors

fujimizu avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

bayon's Issues

Missing tags and downloads

This repository has tags for bayon 0.0.1, 0.0.2, 0.0.3, 0.0.4, 0.0.6, 0.0.8 and 0.0.9.

Where are the tags for 0.0.5, 0.0.7, 0.0.10, 0.0.11, 0.1.0 and 0.1.1? Can they be created, please? Those versions were available for download at Google Code:

https://code.google.com/archive/p/bayon/downloads

Can a GitHub "release" be created for each tag, and the tarballs for each version be uploaded to them?

Thanks.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.