Coder Social home page Coder Social logo

csgillespie / powerlaw Goto Github PK

View Code? Open in Web Editor NEW
108.0 17.0 24.0 15.39 MB

This package implements both the discrete and continuous maximum likelihood estimators for fitting the power-law distribution to data. Additionally, a goodness-of-fit based approach is used to estimate the lower cutoff for the scaling region.

R 100.00%
r powerlaw clauset cran

powerlaw's Introduction

The poweRlaw package

Build Status codecov.io Downloads CRAN

This package implements both the discrete and continuous maximum likelihood estimators for fitting the power-law distribution to data using the methods described in Clauset et al, 2009. It also provides function to fit log-normal and Poisson distributions. Additionally, a goodness-of-fit based approach is used to estimate the lower cut-off for the scaling region.

The code developed in this package was influenced from the python and R code found at Aaron Clauset’s website. In particular, the R code of Laurent Dubroca and Cosma Shalizi.

To cite this package in academic work, please use:

Gillespie, C. S. “Fitting heavy tailed distributions: the poweRlaw package.” Journal of Statistical Software, 64(2) 2015. (pdf).

For a different way of handling powerlaw type distributions, see

Gillespie, C.S. " Estimating the number of casualties in the American Indian war: a Bayesian analysis using the power law distribution." Annals of Applied Statistics, 2017. (pdf)

Installation

This package is hosted on CRAN and can be installed in the usual way:

install.packages("poweRlaw")

Alternatively, the development version can be install from from github using the devtools package:

install.packages("devtools")
devtools::install_github("csgillespie/poweRlaw")

Getting Started

To get started, load the package

library("poweRlaw")

then work through the four vignettes (links to the current CRAN version):

Alternatively, you can access the vignettes from within the package:

browseVignettes("poweRlaw")

The plots below show the line of best fit to the Moby Dick and blackout data sets (from Clauset et al, 2009). Cumulative CDF of the Moby Dick and blackout data sets with line of best fit.

Other information

  • If you have any suggestions or find bugs, please use the github issue tracker
  • Feel free to submit pull requests

Development of this package was supported by Jumping Rivers

powerlaw's People

Contributors

csgillespie avatar linzhp avatar maxbiostat avatar mbjoseph avatar mpadge avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

powerlaw's Issues

estimates comparison OLS, MLE + bootstrapping log-normal

Hi Collin, first off: many thanks for your work. this is amazing!

2 (probably very silly questions):

  1. running ols and mle on the same data, gives an estimate of alpha which is 1 off. I'm confused: both estimate the scale parameter and use the pdf, right? So for a given dataset, I get OLS estimate = -1.3, and MLE with same x_min = -2.26. I looked at the Clauset paper, but I seem to miss where I go wrong.
  2. Trying to bootstrap the Kolmogorov-Smirnov test for the powerless hypothesis. but now for a log-normal distribution. Unfortunately, every time R crashes, both on a server and on a laptop.

All the best,
Glenn

function copy() does not work on conpl objects, hence bootstrap crashes

m_m = conpl$new(abs(rt(1000,3)))
m_m$copy()
Error in x$values : $ operator is invalid for atomic vectors

displ works:

m_m = displ$new(round(10*abs(rt(1000,3))))
m_m$copy()
Reference class object of class "displ"
Field "datatype":
[1] "discrete"
Field "internal":
$freq
[1] 37 74 63 75 75 64 57 58 54 49 34 38 35 28 27 25 24 15 18 19 10 10 7 9 9 5 7 6 3 5 2 4 1 2 5 4 3 3 5 3 3 3 1 1 2 2 2 1 1 2 2 1 1 1 1 1 1 1
[59] 1

[snip]

estimate_xmin() makes no use of data_max

It appears as if estimate_xmin() does not make any use of its data_max argument.

Either data_max should be respected or it should be removed as a function argument.

On a model, m.pl, from a vector whose length is greater than 53 million, I ran estimate_xmin(), once with data_max = 1e+7 and once with data_max = 1.5e+6. The results (after several hours) were identical.

Also I've now looked in the source, and I see no use of data_max within estimate_xmin(). (Though I am not familiar with R, and so there might be some variable scoping thing going on.) But it looks like data_max is only used for bootstrap() and bootstrap_p().

Support for new/custom distributions?

I wonder if it would be possible to add functionality for new or custom distributions. For example, I'd like to fit and test power law with exponential cutoff (like the Python package does), but also others, like Stretched Exponential, Weibull, Generalized Exponential, and even Lognormal with Exponential cutoff.

Maybe this is already possible. If so, a tutorial would be very helpful.

problem plotting distribution object

I don't know whether this is a bug, or whether I did something wrong, but when I try to plot my distribution object, I get the following error message:

plot(USBlogFreqPow)
Error in xy.coords(x, y, xlabel, ylabel, log) :
'x' and 'y' lengths differ

However, I do not see any problem with the actual object:

USBlogFreqPow
Reference class object of class "displ"
Field "xmin":
[1] 2
Field "pars":
[1] 1.526647
Field "no_pars":
[1] 1

The vector with the distribution is appr 250000 elements long.

I use a 64 bit Windows 7 machine and R version 3.1.0 (2014-04-10) -- "Spring Dance"

install_github: Error: client error: (403) Forbidden

Following the readme,

install_github('poweRlaw', 'csgillespie')
Installing github repo(s) poweRlaw/master from csgillespie
Installing poweRlaw.zip from https://api.github.com/repos/csgillespie/poweRlaw/zipball/master
Error: client error: (403) Forbidden

Thanks for your help

Damien

powerlaw graph plot issue

i have a dataset having centralities of a network, i want to fit poweRlaw distributions on data. My dataset (Centralities) have alot of zeroes except "Degree centrality". When i run commands: "conpl" on the centralities one by one, it gives me errors ; Commands i am using are given below:

v=fun.zero.omit(Centrality1InDegree) mpl=conplnew(v) plot(m_pl) Error i get :Error in xy.coords(x, y, xlabel, ylabel, log) : 'x' and 'y' lengths differ Kindly guide me where i am doing wrong...Your help will be appreciated!

Question about plotting with an external cumulative distribution function

I fit power law, and exponential distributions using your package, and I fit power law w/exponential cutoff with the code by Cosma Shalizi, when I try to plot it the Power w/exponential is shifted upwards like this
image

the question how I make the lines to start at the same place?
The code I use is like this

m = conpl$new()
m$setXmin(1);m$setPars(2.1)
# 
#Generate random numbers            #
# 
testP <- dist_rand(m, 100)
m$setXmin(30);m$setPars(2.5)
testP <- c(testP,dist_rand(m, 100))

m <- conpl$new(testP)
(est = estimate_xmin(m))


plot(m,xlab="Area",ylab="CCDF")
m$setPars(est$pars)
m$setXmin(est$xmin) 
lines(m,col=2)

(est1 <- powerexp.fit(testP,est$xmin))
x <- sort(unique(testP))
y <- ppowerexp(x,est1$xmin,est1$exponent,est1$rate,lower.tail=F)
lines(x,y,col=3)

Question on power-law fit

I try to fit power law to node degree data; each line in the file refers to particular node degree (i.e. degree frequency). Histogram of the observed data is pasted below (Figure 1). Estimated xmin value is 19 with power-law exponent of 10.6.

I found strange problem when I plot m_pl object. x scale of the plot is in range 1:30 (Figure 2) although the max degree is 16,668.

Thanks in advance for any pointers.

Best, Andrej

histogram

degree_distribution

library(poweRlaw)
library(repmis) # Need to download data from Dropbox

# Download data from Dropbox
url <- "https://dl.dropboxusercontent.com/u/3340528/degrees.txt"
data <- repmis::source_data(url, header = TRUE)

# Estimate parameters
deg <- data$degree
tab <- table(deg)
m_pl <- displ$new(tab)
est <- estimate_xmin(m_pl)
m_pl$setXmin(est)

# Plot distribution
plot(m_pl)
lines(m_pl)

Show method for distribution objects

Currently the distribution object prints out the data twice due to the way things are stored. A nice show method would solve this duplication.

Implement method for upper bound

The original method tests for power-law behaviour right of a lower bound. But sometimes one would like to test for power-law behavior left of an upper bound. The KS function should be applicable.

Estimate bootstrap finishing time

Do a couple of bootstraps and provide an estimate of finishing time. Will also need to implement a method to combine bootstrap datasets.

Constants and exponents for PDF and CCDF

Using the moby data

library(poweRlaw)
data("moby")

and computing the corresponding power-lab object

m_m <- displ$new(moby)
(est = estimate_xmin(m_m))
m_m$setXmin(est)

I plot the PDF

plot(m_m$internal$values, m_m$internal$freq/sum(m_m$internal$freq), log = "xy")

Then I add the estimated curve:

curve(m_m$internal$constant*x^(-m_m$internal$pars), from = m_m$internal$xmin, add = T)

Obviously, the constant is still wrong (it seems to be the one for the CCDF). How can I get the constant for the PDF?
A corresponding question is: If I plot the CCDF

plot(m_m$internal$values, m_m$internal$cum_n/max(m_m$internal$cum_n), axes = T, xlab = "", ylab = "", log = "xy")

how does

lines(m_m)

get the right exponent? Simply by subtracting 1?

Thanks a lot

log normal lines and CDF

Hi,
I have two questions:
First, when I try to use the log normal distribution, when calling the function : lines(m_ln), I do not see any new curve, while it works for other distributions. Do you have an idea why ?
Secondly, if I understood correctly, the curve which appears when we call lines(for any distribution) does not match with the parameters that we can see by calling distrib$pars, because their refer to the PDF, while what is displayed is the CDF, am I correct ? If yes, is there a way to know the parameters corresponding to the CDF, or is there a simple relation between the parameters of CDF and those of PDF that I can use to compute it ?
Thank you for the awesome work...

bootstrap_p Error in checkForRemoteErrors

Based on the stack overflow discussion here:
http://stackoverflow.com/questions/25561694/error-with-bootstrap-function-in-powerlaw?noredirect=1#comment39925932_25561694

Two examples of me trying to run the bootstrap_p function

bs_p<-bootstrap_p(deg_pl, xmins = deg_pl$xmin, pars = deg_pl$pars, no_of_sims = 1, threads = 1)
Error in checkForRemoteErrors(val) :   one node produced an error: could not find function "rpldis"

bs_p = bootstrap_p(deg_pl)
Error in checkForRemoteErrors(val) : 
  100 nodes produced errors; first error: could not find function "rpldis"

sessionInfo()
attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] snowfall_1.84-6 snow_0.3-13     poweRlaw_0.20.5 RMySQL_0.9-3    DBI_0.2-7       igraph_0.7.1   

loaded via a namespace (and not attached):
[1] VGAM_0.9-4     parallel_3.0.2 stats4_3.0.2   tools_3.0.2 

subscript out of bounds

Hi,
when bootstrapping small datasets a few nodes in the cluster return a "subscript out of bound" error. This happens stochastically with p->1 for high no_of_sims i.e. with 10 bootstrap runs, it does not happen; with 50, it happens sometimes; with 1000 it happens always. Not all nodes return an error.

This happens, empirically and in my experience, when the data (subsetted for some characteristics of interest) has few points only (say, <10). Surely more data points are desirable for fitting work, but life is unfair in some of its applications :-)

The problem is independent of the number of threads used.

Xmax for power law head?

If I have a discrete distribution that the head, rather than the tail, seems to follow power law. How can I test the hypothesis with poweRlaw?

One thought I have is to take the reciprocal for each value and estimate xmin. However, I have to use conpl for the new distribution. Will that affect the testing result?

Error in bootstrap_p

When I try to run bootstrap_p on a particular dataset, I get an error: 'x' must be an array of at least two dimensions. I've played around with the function arguments but haven't been able to fix it easily.

I've put up a Gist with some sample code, which you can source with:

library(devtools)
source_gist("https://gist.github.com/jkeirstead/7f228e7cc6a4e677b011")

How do you calculate KS in estimate_xmin.R

We were looking at your functions to understand how they work and noticed that in the continuous case the function get_KS_statistic calculates the KS distance between the estimated distribution and a uniform cdf instead of the empirical cdf, which seems not quite right.

bootstrap_p error on dispois$new() object

I can apply bootstrap_p function to any object type displm, dislnorm & disexp but not to dispois. It gives me the following error:

library(poweRlaw)
x <- c(1,1,1,2,2,3,11)
m = dispois$new(x)
m$setXmin(estimate_xmin(m))
m$setPars(estimate_pars(m))
bootstrap_p(m, no_of_sims=1, threads=1)

Error in checkForRemoteErrors(val) :
one node produced an error: unable to find an inherited method for function 'dist_rand' for signature '"dispois"'

problem with ppldis ?

I would like to use the ppldis function in one-way KS tests, but my test of the procedure suggests an error in the ppldisc function. Here is the R code for my test:

x<-rpldis(1000, alpha=2, xmin=1) #synthetic dataset with known parameters
ks.test(x, "ppldis", alpha = 2, xmin=1) #compares x to probability function with same parameter values. Should show no significant difference (p>0.1).

p is consistently 0.00 (with a high D of 0.6), which should not be the case. I should get a p that approaches 0, and I do get that result when I perform the test with continuous power law functions:

x<-rplcon(1000, alpha=2, xmin=0.1)
ks.test(x, "pplcon", alpha = 2, xmin=0.1)

I am a bit confused how this could happen, and I suspect a problem with ppldis. Any thoughts?

(P.S., Thank you, Colin for a great set of tools)

running "plot" in Rmd files with cache = TRUE

Hi,
It seems like a have another issue with plot() now. I am running some functions in R chunks in an RMd file, and I get an error:
ploterrorinrmd
I think this is linked to the fact the the previous chunk was run with cache= TRUE, because I did not have a problem when cache=FALSE. When I comment out the following lines:

plot(FreqPow)
lines(FreqPow)

then the other lines in the same chunk work perfectly:

summary(FreqPow$internal)
df <- cbind(FreqPow$internal$freq,FreqPow$internal$values)
head(df)
tail(df)
head(TestFreqTable)
tail(TestFreqTable)

Documentation on gof

It would be great to have better documentation on the goodness-of-fit gof field in the bootstrap results. The original Clauset paper suggests this is a KS measure, but the bootstrap KS results are quite different from the gof field (perhaps this is similar to #27?). At the moment, I'm not sure how to interpret this value.

Bug in lines method

Hi I think I have found another small bug. I think line8 of *lines-methods.R
should refer be:

y = dist_cdf(x, x_axs, FALSE)

rather than

y = dist_cdf(m, x_axs, FALSE)

This causes the lines function to fail if you are not calling it with
a variable named m.

Best regards

Issue in tutorial

Working through the tutorial (R.15.2 on linux-x64 Ubuntu 12.04), everything works fine
until I get to this point:

xmin = estimate_xmin(m)
Error in envRefSetField(x, what, refObjectClass(x), selfEnv, value) :
‘v’ is not a field in class “displ”

The install went OK from what I can tell.

Cache data cdf

When estimating xmin via the bootstrip, the data cdf is recalculated for every simulation. Instead, the result should be cached and reused.

Error when data have 0 (zero) values

Some of my distributions have zero values. The frequency of this value is important to my analysis, but I can't fit the discrete Power Law.

It is something impossible to deal with?

Traceback:

9 stop("Data should be strictly positive (no zeros)")
8 check_discrete_data(dat)
7 .Object$initialize(...)
6 initialize(value, ...)
5 initialize(value, ...)
4 methods::new(def, ...)
3 displ$new(data)

conlnorm warning

Conlnorm warning

library(poweRlaw)
dt=read.table("read.txt")
read_ln=conlnorm$new(dt$V1)
read_ln$setXmin(estimate_xmin(read_ln))
Warning message:
In min(which(internal[["dat"]] >= (x - .Machine$double.eps^0.5))) :
 no non-missing arguments to min; returning Inf

See gist for data

Bug in plfit.r

Add a note in vignette about the bug in plfit.r when xmin == 1
Calculating fit is wrong (1:(xmin-1))

Question - input data format (raw data vs. frequencies)

Hello,

I would like to ask You, which kind of data should I use as an input for creating displ$new(data) object. In example is written "The Moby Dick dataset contains the frequency of unique words", so that mean, before passing my data to displ$new, they should be in frequency format?

For better imagination,here is an exampe. For 10-sided dice, if I am randomly throwing it, I am getting these numbers:

3
7
5
3
2
1
10
8
4
1
1
1

If I make frequnecies from these numbers, I would get:

1 - 4
2 - 1
3 - 2
4 - 1
5 - 1
6 - 0
7 - 1
8 - 1
9 - 0
10 - 1

Which numbers - "raw data" (3,7,5,3 ...) or "frequencies" (4,1,2,1 ...) do I pass as an argument to the function?

I am asking, because inside the displ object are inner argument containig frequencies. But these frequencies seem like to be "frequencies from frequencies" and because of it, I thought,the input should be "raw data".

Thank You,

Stepan

Uncertainty of alpha

I've noticed that the estimated alpha is often above the mean or median of the bootsrapped alphas. Why is that? Because drawing from the original data with replacement tends to miss the extreme events?

Also, there are several ways to present the uncertainty of alpha after the bootstrapping. For example, for one of my distributions with alpha = 2.64 (P>0.1), there are the following options for 1000 bootstraps:

mean and standard error: 2.27+-0.00
mean and 95% confidence interval: 2.27+-0.01
mean and standard deviation: 2.27+-0.12
median and percentiles [0.025,0.975]: 2.25 [2.15,2.63]

Is there a standard or should one be preferred for statistical reasons? If you ask me, it's a judgement call...

Different values for Matlab, Python and R powerlaw fit

Hello,

This R package is awesome and returns the results reported by Clause et al. (2009)

I tested the moby dick data under matlab, python and your R implementation of powerlaw fitting. The main problem that I have is that the results are differents.

For instance:

  • python returns: xmin=1, alpha=2.02 and
  • R returns: xmin=7 and alpha=1.95 and (only R returns the results obtained in Clause et al. (2009))
  • Matlab returns: xmin=1

What is the main differences between these implementations?
Which package should I consider to be right?

Thank you for your help.
Best regards.

Ranaivo Razakanirina

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.