Coder Social home page Coder Social logo

pscan-hadoop's Introduction

PSCAN in Hadoop

A graph clustering algorithm called SCAN implemented in Apache Hadoop by a method called as PSCAN. This algorithm is linear in nature and has complexity of O(m), where m is the number of edges in the graph.

This is an implementation of the paper

Zhao, W., Martha, V., & Xu, X. (2013, March). PSCAN: a parallel Structural clustering algorithm for big networks in MapReduce. In Advanced Information Networking and Applications (AINA), 2013 IEEE 27th International Conference on (pp. 862-869). IEEE

which is a MapReduce implementation of the paper

X.Xu, N.Yuruk, Z. Feng, T. Schweiger. SCAN: a structural clustering algorithm for networks,Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 824-833, 2007.

I have provided a sample graph network which is the Zachary Karate Club graph.

Note: There are slight changes in my code which are not present in the PSCAN paper, but the underlying results are same.

Content

  1. Flow of the program
  2. Preprocessing Graph
  3. Parameters of input
  4. Output in Hadoop

Flow of the Program

Let's assume that we have a graph such as this-

sample_graph

The input for this graph is assumed to be in the format of node-adjacency list.

1 2,6
2 1,3,5,6
3 2,4
4 3
5 2,6
6 1,2,5,7,8
7 6
8 6

If the graph is in form of edges, check out the Preprocessing section.

We first implement a MapReduce program, to calculate the structural similarity of all the edges in the graph. If this value is above a certain threshold (default=0.5), then only we output those edges, otherwise we dont consider them. The output for this looks like this -

1,2	0.5163977794943222
2,3	0.5163977794943222
2,5	0.5163977794943222
3,4	0.8164965809277261
6,7	0.8164965809277261
6,8	0.8164965809277261

Next, we take this output and then process this in the form of <vertex, structure_info> form of pairs. For this we implement another MapReduce program. The output for this looks like -

1	1:1:2
2	1:2:3,1,5
3	1:3:2,4
4	1:4:3
5	1:5:2
6	1:6:8,7
7	1:7:6
8	1:8:6

The second parameter of the input is called the structure_info. This contains the status, label and adjacency list of the vertex. Status tells us if the vertex is active(1) or not(0), the label tells us the cluster group which the node belongs to. Initially, the label is the vertex itself.

This input is the processed by the next MapReduce which is an iterative process (refer to the paper for more information). This iteration is dependent on the diameter of the graph, i.e., if the diameter is 4, this program will iterate 4 times. The output looks like the one above, but the status of all the nodes should be 0 by the end of all iterations.

1	0:1:2
2	0:1:3,1,5
3	0:1:2,4
4	0:1:3
5	0:1:2
6	0:6:8,7
7	0:6:6
8	0:6:6

This output is then given to the last MapReduce program which basically groups all clusters together. This looks like -

1	4,3,2,1,5
6	6,8,7

Preprocessing Graph

If the graph is given in the form of edge input format, i.e.,

1 2
2 3
2 5
3 4
6 7
6 8

then just run the graph_procc.py file given in the repo. Change the input name of the file, with whatever file you have and set output accordingly. This will convert the graph to node-adjaceny list format as shown above.

Parameters of Input

While running Hadoop 3, the input to run the program should be given in the format -

hadoop jar pscan.jar driver <directory-to-input-file> <directory-to-output-file> <diameter-of-graph>

As I stated before, the diameter of the graph is a new addition that I have made for input.

Output in Hadoop

pscan-hadoop's People

Contributors

mihirgupte avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.