twitter / cassovary Goto Github PK
View Code? Open in Web Editor NEWCassovary is a simple big graph processing library for the JVM
Home Page: http://twitter.com/cassovary
License: Apache License 2.0
Cassovary is a simple big graph processing library for the JVM
Home Page: http://twitter.com/cassovary
License: Apache License 2.0
Would be nice to have some more specific error message.
If you start working on this, please leave a comment here.
When using sbt, we are currently getting:
[warn] There may be incompatibilities among your library dependencies.
[warn] Here are some of the libraries that were evicted:
[warn] * com.google.guava:guava:13.0.1 -> 16.0.1
[warn] Run 'evicted' to see detailed eviction warnings
It should be cross-built so as to allow both scala 2.9 and 2.10 users to be use it.
The Cassovary tree, as shipped, only builds against Scala 2.8.1.
Local experiments with building for 2.9.1 and 2.9.2 work (compile and pass all tests), but specs has to be updated to 1.6.9 compiled for 2.9.1.
I would submit a pull request, but my current changes break 2.8.1 builds for some reason I have yet to debug. I redefine 'specs' as follows:
val specs = buildScalaVersion match {
case "2.8.1" =>
"org.scala-tools.testing" % "specs_2.8.1" % "1.6.6" % "test" withSources()
case "2.9.1" | "2.9.2" =>
"org.scala-tools.testing" % "specs_2.9.1" % "1.6.9" % "test" withSources()
}
and then set build.scala.versions
to “2.8.1 2.9.1”. 2.9.1 builds & passes all tests, as does 2.9.2 (++2.9.2 test
), but 2.8.1 does not for some reason.
Simplify it to both LeftNode and RightNode being the same BipartiteNode and the choice of which ids belong to LHS and which ids belong to RHS left to the user. The constructor of BipartiteGraph can take in a function parameter isLeftNode: Int => Boolean that tells whether an id is for left or right side of the graph, which should be checked for neighbors of a BipartiteNode and a runtime exception thrown if a left node has another left node has a neighbor (or right node has another right node as a neighbor).
CC @AnishShah
This is related to #36 but different. We want the ability of node ids to be Long. The use case is when storing a portion of a huge graph in memory on one machine. The nodes can be too many to be able to map to int. Even though the number of nodes on a single machine are small enough, they can refer to arbitrary remote nodes in their adjacency list.
This can be done by making Node and Graph to be of type[@specialized(Int, Long) V], making 'id' to be of type V and doing suitable refactoring. When the node is of type Long, a hash can be supplied to map node id to Node as required by getNodeById()
Such as cosine similarity and jaccard similarity when given
(1) Two nodes N1 and N2, find the similarity values between them in a given direction (Out or In)
(2) One node N1, find the top K similar nodes to N1
Cosine(N1, N2) = neighbors(N1) ∩ neighbors(N2) divided by (sqrt(numNeighbors(N1)) * sqrt(numNeighbors(N2))
and Jaccard(N1,N2) = neighbors(N1) ∩ neighbors(N2) divided by neighbors(N1) ∪ neighbors(N2)
Please add the algorithms in com.twitter.cassovary.algorithms
https://github.com/twitter/ostrich is currently used to maintain internal Stats in Cassovary. Ostrich does not seem to be actively maintained. Instead, should move to https://github.com/twitter/commons/tree/master/src/java/com/twitter/common/metrics as is used in Finagle-stats (http://twitter.github.io/finagle/guide/Metrics.html) as well.
In the future, please use git-tag for releases.
Documentation contains a lot of errors - esp. param names.
Hi, i encountered the problem when i tried to run the example.
The error is as following:
Compiling HelloGraph ...
error: error while loading TestGraphs, Scala signature TestGraphs has wrong version
expected: 4.1
found: 5.0
one error found
Running HelloGraph...
Exception in thread "main" java.lang.NoClassDefFoundError: HelloGraph
Caused by: java.lang.ClassNotFoundException: HelloGraph
at java.net.URLClassLoader$1.run(URLClassLoader.java:202)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:190)
at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301)
at java.lang.ClassLoader.loadClass(ClassLoader.java:247)
Could not find the main class: HelloGraph. Program will exit.
Is it due to the sbt and scala version problem?
The version of scala are different for building and using.
$ ./sbt
[info] Standard project rules 0.12.7 loaded (2011-05-24).
[warn] No .svnrepo file; no svn repo will be configured.
[info] Building project cassovary 1.0.1-SNAPSHOT against Scala 2.8.1
[info] using CassovaryProject with sbt 0.7.4 and Scala 2.7.7
In some applications one needs fast intersections of adjacency lists and fast search of whether a node v is a neighbor of a node u. In such cases, a graph that keeps the adjacency list in sorted order would be great.
This should work for both in and out directions in both ArrayBasedDirectedGraph and SharedArrayBasedDirectedGraph.
When running HelloLoadGraph the program does not finish. There is something wrong with reading graph from file.
After the node renumbering, there is currently no way to read graphs with long / string ids. Assume a specified single char separator (i.e., similar to the use of space right now) in both cases.
Noticed an interesting issue when first experimenting with Cassovary. When trying to do PageRank on a 25,000 node graph I kept experiencing many OutOfMemory errors, even with 4G allocated to the java heap.
I realized one of the problems was due to the ID scheme being used by my nodes. Basically, some nodes had an ID similar to 9080256 and so the PageRank algorithm was iterating as many times as the maximum node ID, even though 9 million node IDs were not valid. It could be that I didn't properly read the source/docs but I can see this being a design flaw. It might be more helpful to explain the purpose of node IDs or to change the node ID scheme altogether.
i.a. some examples would be useful.
CC @AnishShah
I have an application where I needed an undirected graph. I actually didn't realize until today that "Mutual" in StoredGraphDir meant undirected and was starting to write my own UndirectedGraph wrapper class when I noticed it. How should we make this more clear? For example I could add a comment to StoredGraphDir along the lines of
Mutual, // In a mutual graph, the outbound and inbound neighbors of each node are equal, and space is typically saved by only storing the outbound neighbors of each node
and/or add to the documentation of DirectedGraph that it actually also supports undirected graphs. Or should there be an UndirectedGraph wrapper in Cassovary?
@pankajgupta
Currently this object uses a couple of hand-rolled methods "parallelForEach" and "parallelWork" in com.twitter.cassovary.util.ExecutorUtils. It will be more elegant to use scala futures and/or promises instead.
Current pagerank implementation in com.twitter.cassovary.algorithms is single-threaded. Implement a multi-thread version to make it run faster. Also, report benchmark results comparing the parallel version with the current version (see PagerankBenchmark in cassovary-benchmarks)
Currently for Traversals you have to specify the direction of edges that will be used. This is to allow walks on directed graphs as if they were undirected.
for example, the randomized algorithm in http://arxiv.org/abs/1212.2264 or one of the other algorithms mentioned in this paper.
When adding cassovary as dependency in another project (according to Readme):
libraryDependencies += "com.twitter" %% "cassovary" % "4.0.0"
in build.sbt
I get compile error "object cassovary is not a member of package com.twitter".
I checked cassovary jar in Ivy repository cache and it does not contain any classes (just metainf directory). It's the same, when I compile cassovary with sbt package - jar in target directory is empty, and there's no classes in main target directory.
Workaround is:
libraryDependencies += "com.twitter" %% "cassovary-core" % "4.0.0"
Should we modify Readme or fix the build to match Readme?
Currently it only reports speed.
Motivated by #138 where a centrality score is being computed per node and it might be sometimes most efficient to keep information per node stored in an array.
Currently we have graph.tourist.InfoKeeper that keeps info in a mutable.Map which precludes storing infoPerNode in an array.
How about declaring a new trait that has get/put. and allowing infoPerNode to return a collection of this trait type. This can be either a mutable.Map (as now) or an Array with implicit conversions from the Map and Array to this trait?
Comments @szymonm ?
When running ./sbt update
, I have the following issues:
[warn] Host binaries.local.twitter.com not found. url=http://binaries.local.twitter.com/maven/commons-lang/commons-lang/2.2/commons-lang-2.2-sources.jar
[info] You probably access the destination server through a proxy server that is not well configured.
[warn] Host binaries.local.twitter.com not found. url=http://binaries.local.twitter.com/maven/commons-lang/commons-lang/2.2/commons-lang-2.2-javadoc.jar
[info] You probably access the destination server through a proxy server that is not well configured.
[warn] Host binaries.local.twitter.com not found. url=http://binaries.local.twitter.com/maven/com/twitter/util-logging/1.8.5/util-logging-1.8.5.pom
[info] You probably access the destination server through a proxy server that is not well configured.
[warn] Host binaries.local.twitter.com not found. url=http://binaries.local.twitter.com/maven/com/twitter/util-logging/1.8.5/util-logging-1.8.5.jar
[info] You probably access the destination server through a proxy server that is not well configured.
[warn] Host binaries.local.twitter.com not found. url=http://binaries.local.twitter.com/maven/com/twitter/util-logging/1.8.5/util-logging-1.8.5-sources.jar
[info] You probably access the destination server through a proxy server that is not well configured.
There are a few examples in examples/scala (and in examples/java) that demonstrate how to use the cassovary library. These are built and run using a hacky bash script examples/scala/build_run_example.sh (which uses examples/build_run_example.sh). Instead we should be using the run command in sbt to build and run these examples.
Another alternative is to build a separate sbt project just for examples and use cassovary library as a dependency from maven central (as explained in README.md)
Cassovary uses sbt (see http://www.scala-sbt.org) and has its build definition in build.sbt. It would be nicer to port that to a scala based build definition.
Currently the methods generateRandomGraph() and generateRandomUndirectedGraph() in TestGraph.scala are slow as they iterate over each of the potential O(n^2) edges to determine whether that edge should be picked or not.
Please could you release a build for Scala 2.11?
It doesn't seem to me that this is necessary. The original idea I guess was to ensure that the files are read during graph construction and not upfront. But even if we just have Seq[Iterator[NodeIdEdgesMaxId]] only the iterators are constructed and not actually traversed. Hence it seems like the extra empty parentheses are un-necessary.
Would you agree @szymonm ?
NodeRenumberer is a utility to convert the node ids supplied to a more compact, sequentially increasing integer node ids used internally. However, the current implementation only allows integer node ids to be supplied at the input. Would be great to allow Long and String node ids as well (the internal node ids would still be integer).
We use fastutil 6.4.4 while 6.5.16 is the current.
The randomly generated graph does not consistently provide a triangle count of 0 +/- 20.
CC @AnishShah
Now, both graph readers use Regex matching to read graphs. This is very inefficient for the Int case. Could be performed better avoiding regex for something like readInt from InputStream.
Now that cassovary is open sourced, it should be published to maven central every release.
To do this with SBT, see this link: http://www.scala-sbt.org/using_sonatype.html
If you need help doing this, consult with the Scalding project who does this...
http://search.maven.org/#search%7Cga%7C1%7Ca%3A%22scalding_2.8.1%22
As @szymonm mentioned in #141 something like:
trait Graph[NodeType <: Node] {
getNodeById: Option[NodeType]
...
}
That allows Node to be subclassed and used in the respective Graph. e.g., in #141 we would do:
trait UndirectedGraph extends Graph[UndirectedNode] {
getNodeById: Option[UndirectedNode]
}
It only seems to keep the out edges in a shared array.
Right now, the graph data structure can not have labels on nodes or edges. Allow arbitrary labels in a new class. As a special case, allow only one int to be associated with each edge and each node.
Compiling in scala 2.10 gives a few warnings. To look at these warnings do: "./sbt clean update ++2.10.3 test"
'org.specs' has been deprecated and the tests should move to using org.scalatest. An example of the latter is in com.twitter.cassovary.graph.TestGraphSpec.
Use datasets on http://snap.stanford.edu/data/index.html and benchmark performance of a couple of algorithms for very big graphs, such as (1) Global pagerank, (2) Personalized pagerank for every node in the graph.
A violation of an assert() statement results in a stack trace. The asserts should be removed and replaced by "shouldEqual" type statements.
cc @szymonm
There are currently two implementations for dynamic directed graphs. src/main/scala/com/twitter/cassovary/graph/{DynamicDirectedGraphHashMap.scala, SynchronizedDynamicGraph.scala}. Generate a synthetic update stream (adds/deletes) on a large graph and benchmark the performance of an algorithm such as calculatePersonalizedReputation() or bfsWalk() in src/main/scala/com/twitter/cassovary/graph/GraphUtils.scala while the graph is handling updates.
If I try running the examples I get:
Compiling HelloGraph ...
Running HelloGraph...
Exception in thread "main" java.lang.NoSuchMethodError: scala.Predef$.augmentString(Ljava/lang/String;)Ljava/lang/String;
at HelloGraph$.main(HelloGraph.scala:24)
at HelloGraph.main(HelloGraph.scala)
I've tried on both Linux and OSX with same result. Seems to be a issue with library dependencies and compiling with one scala version and trying to run with another... However I'm just starting to learn Scala so not sure what to do here...
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.