Coder Social home page Coder Social logo

amula-cs / danker Goto Github PK

View Code? Open in Web Editor NEW

This project forked from athalhammer/danker

0.0 0.0 0.0 5.29 MB

Compute PageRank on ~3 billion Wikipedia links on off-the-shelf hardware.

License: GNU General Public License v3.0

Shell 34.37% Python 65.63%

danker's Introduction

Build Status Coverage Status Documentation Status PyPI GitHub license

danker

danker is a compilation of Bash and Python3 scripts that enables the computation of PageRank on Wikipedia on normal off-the-shelf hardware (e.g., a quad-core CPU, 8 GB of main memory, and 250 GB hard disk storage). The "--bigmem/-b" option enables to speed up computation given that enough main memory is available (this depends on the Wikipedia language edition, project and your hardware configuration).

  • INPUT Wikipedia language edition, e.g. "en" OR "ALL" (for computing PR on the union of all language editions using the bag-of-links model); optional parameter "--bigmem/-b".
  • PROCESSING danker downloads the required Wikipedia dump files (https://dumps.wikimedia.org/LANGwiki/latest/), resolves links, redirects, Wikidata Q-IDs, produces a link-file and computes PageRank.
  • OUTPUT
    • LANG-DUMPDATE.links - a link file (edge list, tab-separated), the input for PageRank. Every line reads from left to right:
    # Q-ID left --links to--> Q-ID right
    Q30    Q46
    
    • LANG-DUMPDATE.links.rank - a series of Wikidata Q-ID with their respective PageRank (sorted descending)

Requirements

  • python>=3.5
  • csvkit>=1.0.4
  • numpy>=1.16.3 and networkx>=2.3 for running unit tests

Usage

  usage: ./danker.sh [-h] [-p PROJECT] [-i ITERATIONS] [-d DAMPING] [-s START]
                     [-b] [-l]
                     wikilang

  Compute PageRank on Wikipedia.

  positional arguments:
    wikilang              Wikipedia language edition, e.g. "en". "ALL" for
                          computing PageRank over all languages available in a
                          project.

  optional arguments:
    -h, --help            show this help message and exit
    -p PROJECT, --project PROJECT
                          Wiki project, currently supported [wiki, books,
                          source, versity, news]. (default: wiki)
    -i ITERATIONS, --iterations ITERATIONS
                          PageRank number of iterations. (default: 40)
    -d DAMPING, --damping DAMPING
                          PageRank damping factor. (default: 0.85)
    -s START, --start START
                          PageRank starting value. (default: 0.1)
    -b, --bigmem          PageRank big memory flag. (default: False)
    -l, --links           Only extract links (skip PageRank). (default: False)

Examples

  • Compute PageRank on the current dump of English Wikipedia:

    $ ./danker.sh en
    $ ./danker.sh en --bigmem
  • Compute PageRank on the union of all language editions:

    $ ./danker.sh ALL
    $ ./danker.sh ALL --bigmem    # caution, you will need some main memory for that
  • Compute PageRank for each Wikipedia language edition separately:

    $ for i in $(./script/get_languages.sh); do ./danker.sh "$i"; done
    $ for i in $(./script/get_languages.sh); do ./danker.sh "$i" -b; done
  • Compute PageRank on the English version of Wikibooks:

    $ ./danker.sh en --project books
    $ ./danker.sh en --bigmem --project books
  • Compute PageRank on any other graph

Download

The output of ./danker.sh ALL on bi-weekly Wikipedia dumps is available at https://danker.s3.amazonaws.com/index.html (CC BY-SA 3.0).

Previous work

Before danker, I performed a number of experiments with DBpedia "page links" datasets most of which are documented at https://web.archive.org/web/20180222182923/https://people.aifb.kit.edu/ath/.

Test

The unit tests assure correctness and compare the results of danker to the PageRank implementation of NetworkX. The tests need the numpy and networkx libraries installed.

Execute the unit tests as follows:

python3 -m unittest test/danker_test.py

In the directory test is a small graph with which you can try out the PageRank core of danker.

$ ./danker/danker.py ./test/graphs/test.links 0.85 40 1
1.2.3.4.5.6.7.8.9.10.11.12.13.14.15.16.17.18.19.20.21.22.23.24.25.26.27.28.29.30.31.32.33.34.35.36.37.38.39.40.
Computation of PageRank on './test/graphs/test.links' with danker took 0.00 seconds.
C	3.1828140590777672
B	3.5642607869667629
A	0.30410528185693986
D	0.3626006631927996
F	0.3626006631927996
E	0.75035528185693967
G	0.15000000000000002
H	0.15000000000000002
I	0.15000000000000002
K	0.15000000000000002
L	0.15000000000000002
$ ./danker/danker.py ./test/graphs/test.links ./test/graphs/test.links.right 0.85 40 1
1.2.3.4.5.6.7.8.9.10.11.12.13.14.15.16.17.18.19.20.21.22.23.24.25.26.27.28.29.30.31.32.33.34.35.36.37.38.39.40.
Computation of PageRank on './test/graphs/test.links' with danker took 0.01 seconds.
A	0.30410528185693986
B	3.5642607869667629
C	3.1828140590777672
D	0.3626006631927996
E	0.75035528185693967
F	0.3626006631927996
G	0.15000000000000002
L	0.15000000000000002
K	0.15000000000000002
I	0.15000000000000002
H	0.15000000000000002

If you normalize the output values (divide each by 11) the values compare well to https://commons.wikimedia.org/wiki/File:PageRank-Beispiel.png or, if you compute percentages (division by the sum), they are similar to https://commons.wikimedia.org/wiki/File:PageRanks-Example.svg (same graph).

License

This software is licensed under GPLv3. (see https://www.gnu.org/licenses/).

PageRank

For a comprehensive overview on the history of PageRank I recommend the following reading: Sebastiano Vigna. Spectral Ranking

FAQ

  1. The source code of danker is licensed under GPL v3. What about the output?

    When you use danker to compute PageRank on Wikipedia, the output of danker will have no license. It can be used without attribution. However, if you use the PageRank scores provided on this Web site for download, you need to provide attribution (https://creativecommons.org/licenses/by-sa/3.0/). In any case, if you find this work useful it would be nice if you would provide reference to this page or, if you use the scores in an academic work, cite the following paper (or both):

    @InCollection{Thalhammer2016,
        Title                    = {{PageRank on Wikipedia: Towards General Importance Scores for Entities}},
        Author                   = {Andreas Thalhammer and Achim Rettinger},
        Booktitle                = {The Semantic Web: ESWC 2016 Satellite Events, Heraklion, Crete, Greece, May 29 -- June 2, 2016, Revised Selected Papers},
        Publisher                = {Springer International Publishing},
        Year                     = {2016},
        Address                  = {Cham},
        Month                    = oct,
        Pages                    = {227--240},
        Doi                      = {10.1007/978-3-319-47602-5_41},
        ISBN                     = {978-3-319-47602-5},
        Url                      = {https://dx.doi.org/10.1007/978-3-319-47602-5_41}
    }
    
  2. The output format is a tab-separated values (TSV) file with Wikidata Qids and the respective rank. Can I have format xyz?

    We consider TSV as sufficient. Any other format and/or mapping can easily be produced with a simple script.

  3. Why is this not programmed with Apache Hadoop/NetworkX/etc.?

    We believe that ranking computations should be transparent. In the best case, everyone who wants to verify the computed rankings should be enabled to do so. Therefore, we also support computation on off-the-shelf hardware. We do this taking into account that we don't need this to finish in one or two days (also the Wikipedia dumps are not that frequent). In general, the provided code can be extended and also be ported to other platforms (under consideration of the license terms).

  4. Why does it take so long (up to one week) to compute PageRank with the ALL option?

    This goes in line with the previous point: we want to provide software that everyone with a standard laptop and some time can use. Of course it is possible to speed the computation up at the cost of required memory/computation power but we strongly believe that "this is for everyone".

  5. Can I use danker to compute PageRank on other graphs than Wikipedia?

    Sure, you can use the file ./danker/danker.py for computing PageRank on your graph. If you pass a "right sorted" file with the optional parameter right_sorted automatically the slower method with low memory footprint will be used. The memory-intensive method will be used otherwise. You can sort tab-separated files with the Unix command sort --key=2 -o output-file input-file. Type ./danker/danker.py -h for options. In addition, you can use danker as a library in your Python 3.x code (cf.: https://danker.rtfd.org)

  6. Why do the scores not form a nice probability distribution?

    This has multiple reasons. First, we do not compute the normalized version of PageRank. Instead of (1 - damping)/N (N is the total number of nodes) we only use (1 - damping). This doesn't change the ranking as it is just multiplying by a constant factor. Second, according to the theory, given the non-normalized version, all scores should add up to N. This would only be true if there would be no dangling nodes (pages with no outlinks) - these serve as energy sinks. One way to mitigate this would be to create links from dangling nodes to all pages (including itself). However, this also would only introduce a constant factor and therefore also has no effect on the final ranking. More information on the topic can be found in Monica Bianchini, Marco Gori, and Franco Scarselli. 2005. Inside PageRank. ACM Trans. Internet Technol. 5, 1 (February 2005), 92-128. DOI: https://doi.org/10.1145/1052934.1052938

  7. Sorted edge lists are not a common graph representation format (compared to adjacency list or adjacency matrix). Why is it useful in this particular case?

    This is a good question and there are multiple aspects to it. We know that the graph would not easily fit in some 8GB of memory (as we have ~3bn edges). The good news is: We don't have to fit it. Random access to get all out/in links of a specific node is not needed for computing PageRank as we access every node anyway (some of these ideas are presented in Taher H. Haveliwala. 1999. Efficient Computation of PageRank. http://ilpubs.stanford.edu:8090/386/).

    With sorted edge lists we gain two main advantages:

    1. We can walk through the graph node by node just by reading the lines of a file consecutively.
    2. We can transform quickly from the best way accessing out-links to the best way of accessing in-links by sorting by the second column ("best way" refers to this specific case).

    Trade offs:

    1. We use much more diskspace than actually needed as we repeat nodes (compared to adjacency lists). Still, computation usually needs < 100GB of space and disk space is cheaper than memory.
    2. Isolated nodes can not be represented with edge lists. However their PageRank would be (1 - damping).
    3. Computation by iterating over files is much slower than storing the graph in memory. If you have a graph that can fit into memory you can use the --bigmem/-b option and speed up computation time.
  8. What is the ALL option and what is the bag-of-links model?

    Naturally, the PageRank algorithm works with lists rather than sets. It also does not make any assumptions on the uniqueness of a link on a given web site. Therefore, if a link to page B apears multiple times on page A (say twice) it basically means that page B gets double the share of A's PageRank score than any other page C that is only linked once on page A. The following example shows that scenario:

    A -> B
    A -> C
    A -> B
    

    B gets 2/3 * PR(A) while C gets 1/3 * PR(A) (the damping factor is ignored here for brevity). This basic principle holds for every web site and can also be leveraged for Wikipedia. However, in the particular case of Wikipedia there is a "one link only" policy and multiple occurrences of a link may be arbitrary or due to quality issues of the article. However, with the ALL option we extract and align all page links from all Wikipedia language editions via the Wikidata identifier of each article. In this mode, we leverage the different language editions as link voting entities - the above example with Wikipedia language editions:

    A -> B   # en
    A -> C   # en
    A -> B   # de
    

    For example, the link Q159 -> Q46 appears in 254 different language editions. As stated above, PageRank automatically handles multiple link occurrences and adjusts the scores accordingly. The whole methodology is similar to the bag-of-words model of the information retrieval domain where the word count is tracked and the resulting frequencies are fed into a probabilistic model (such as topic modeling). Due to this relation we call this method informally "bag-of-links model".

danker's People

Contributors

athalhammer avatar joernhees avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.