gregversteeg / npeet Goto Github PK

View Code? Open in Web Editor NEW

339.0 339.0 89.0 311 KB

Non-parametric Entropy Estimation Toolbox

License: MIT License

Python 100.00%

entropy estimator information-theory python

npeet's People

Contributors

Stargazers

Watchers

Forkers

stevenio yalechang wgmueller1 happyresearch naught101 ttdtrang jostheim maxwellrebo j-chacon egaebel tozammel tzebin victoryong hedgefair klbesser gdisneyleugers hungz23 vafisher halftruth08 suangzi123 jesselivezey mehrtash afcarl 20cmdingding mystery-college-of-the-adapts bengiruken dsweber2 vaibhav90 guoqing-zhou guatimosim tailintalent yuhong-zhong ayotoasset teeger gonzo-likes mrswolf cbonnett guidoballabio razib764 kejiejiang shuentang fagan2888 randymcmillan yangkedc1984 xrosliang tedlsx fintrek kbc8894 phillipmogensen eveyear himaghna datoow xychenunc dacrogg alanganem ackorchmaros superzhen625 denniszy shizelong1985 haojiewangqb carlkt doctorado-ml jamalsz sjstreicher israel-jsf95 keykeykeykeyk mecafdl ulti-dreisteine jsairdrop ziyu-deep w32zhong cryptowealth-technology victoeywilly xp19991205 lkampoli wiwern stevenchowell fmtariq 5l1v3r1 arthurxl jithamanyu001 vishalbelsare mikailkhona efstathia-soufleri tdl77 jli05 zhaosiliang fedasaro62

npeet's Issues

Readme lacks installation instructions

The readme file lacks installation instructions -- I think it would be very helpful to include that.

I was able to puzzle out that

$ python3 setup.py build
$ python3 setup.py install

are enough to build and install the package. Maybe that info, or something like it, could be in the readme.

Also, I found that after installing, import entropy_estimators fails, but import npeet.entropy_estimators succeeds. The readme mentions the former, should it contain the latter instead?

Could you add a license file to NPEET?

Thank you for publishing your package NPEET on GitHub. Could you add a license file (preferably open-source license, such as MIT, GNU, BSD licenses) to your package? Thanks!

It doesn't work

kbriggs:~/Downloads/NPEET> python3 test.py
For a uniform distribution with width alpha, the differential entropy is log_2 alpha, setting alpha = 2
and using k=1, 2, 3, 4, 5
Traceback (most recent call last):
File "./test.py", line 16, in
print("result:", [ee.entropy([[2 * random.random()] for i in range(1000)], k=j + 1) for j in range(5)])
File "./test.py", line 16, in
print("result:", [ee.entropy([[2 * random.random()] for i in range(1000)], k=j + 1) for j in range(5)])
File "/home/kbriggs/Downloads/NPEET/entropy_estimators.py", line 28, in entropy
return (const + d * np.mean(map(log, nn))) / log(base)
File "/usr/local/lib/python3.4/dist-packages/numpy/core/fromnumeric.py", line 2909, in mean
out=out, **kwargs)
File "/usr/local/lib/python3.4/dist-packages/numpy/core/_methods.py", line 82, in _mean
ret = ret / rcount
TypeError: unsupported operand type(s) for /: 'map' and 'int'

kbriggs:~/Downloads/NPEET> python2 test.py
For a uniform distribution with width alpha, the differential entropy is log_2 alpha, setting alpha = 2
and using k=1, 2, 3, 4, 5
('result:', [0.95063690299507952, 0.98051458362141108, 1.0803462913574611, 1.0316551234094444, 1.0289725544677049])

Gaussian random variables

Conditional Mutual Information
covariance matrix
[[4 3 1]
[3 4 1]
[1 1 2]]
('true CMI(x:y|x)', 0.5148736716970265)
('samples used', [10, 25, 50, 100, 200])
('estimated CMI', [0.24721094773861269, 0.39550091844389834, 0.46211431227905897, 0.48994541664326197, 0.49993287186420526])
('95% conf int. (a, b) means (mean - a, mean + b)is interval\n', [(0.32947891495120885, 0.47883105656907937), (0.42410443410041138, 0.40553319741348437), (0.29607520550148525, 0.29646667472554578), (0.17646000212101254, 0.19139043703562886), (0.1623733550388789, 0.19292824321772967)])
Mutual Information
('true MI(x:y)', 0.5963225389711981)
('samples used', [10, 25, 50, 100, 200])
('estimated MI', [0.32218586252030301, 0.54386805987295483, 0.59630897787131887, 0.60762939695898355, 0.60418593716673841])
('95% conf int.\n', [(0.42363251380820954, 0.46980791508516928), (0.46583034399247247, 0.50990786157500079), (0.35170125121037665, 0.33635610406503746), (0.23654160340493391, 0.30032100823502828), (0.2007355329953654, 0.17193438029361319)])

IF you permute the indices of x, e.g., MI(X:Y) = 0
('samples used', [10, 25, 50, 100, 200])
('estimated MI', [0.032435448506589186, -0.027013576228861892, -0.0048799193000058135, 0.0023174460892350754, -0.0002141277047037321])
('95% conf int.\n', [(0.28988781354141774, 0.41434201574331025), (0.24203605944116111, 0.29849816049646066), (0.18081726377075832, 0.18040335534919902), (0.15879645329878422, 0.22733498191676946), (0.13263900209136867, 0.13325413690339941)])

Test of the discrete entropy estimators

For z = y xor x, w/x, y uniform random binary, we should get H(x)=H(y)=H(z) = 1, H(x:y) etc = 0, H(x:y|z) = 1
Traceback (most recent call last):
File "./test.py", line 116, in
print("H(x), H(y), H(z)", ee.entropyd(x), ee.entropyd(y), ee.entropyd(z))
File "/home/kbriggs/Downloads/NPEET/entropy_estimators.py", line 114, in entropyd
return entropyfromprobs(hist(sx), base=base)
File "/home/kbriggs/Downloads/NPEET/entropy_estimators.py", line 149, in hist
sx = discretize(sx)
File "/home/kbriggs/Downloads/NPEET/entropy_estimators.py", line 280, in discretize
return [discretize_one(x) for x in xs]
File "/home/kbriggs/Downloads/NPEET/entropy_estimators.py", line 275, in discretize_one
if len(x) > 1:
TypeError: object of type 'int' has no len()

ValueError: not enough values to unpack (expected 2, got 1) in calculating the micd

I observed this ValueError: not enough values to unpack (expected 2, got 1) when I tried to calculate the mutual info between a continuous and a discrete.
Can anybody help me?

import npeet.entropy_estimators as ee
ee.micd(cont.iloc[:,1].values.tolist(),disc.iloc[:,[1]].values.tolist()))

mutual information between different high dimensional continuous signal

Hello Prof. Greg Ver Steeg,

I want to compute MI between two high dimensional continues time varying signal. their dimension are 39 and 300. It seems like this toolbox is not suitable for that. do you know if there is any easy way to measure the MI in this situation?

raises error in continuous entropy

Hi,

Thanks for sharing you work. I want to use the continuous entropy of your project in mine.

I have a matrice like this:

x =  tf.Variable(   [   [0.96,    -0.65,    0.99,    -0.1   ],
                        [0.97,    0.33,    0.25  ,    0.05  ],
                        [0.9,     0.001,    0.009,    0.33  ],
                        [-0.60,   -0.1,    -0.3,     -0.5   ],
                        [0.49,    -0.8,     -0.05,   -0.0036],
                        [0.0  ,   -0.45,    0.087,    0.023 ],
                        [0.3,     -0.23,    0.82,    -0.28  ]])

When I apply the ee.entropy, I receive this error:

    rev = 1/ee.entropy(row)
  File "/home/sgnbx/Downloads/NPEET/npeet/entropy_estimators.py", line 21, in entropy
    assert k <= len(x) - 1, "Set k smaller than num. samples - 1"
TypeError: object of type 'Tensor' has no len()

This is my code:


def rev_entropy(x):
    def row_entropy(row):
        rev = 1/ee.entropy(row)
        return rev
    rev= tf.map_fn(row_entropy, x, dtype=tf.float32)
    return rev

x =  tf.Variable(   [   [0.96,    -0.65,    0.99,    -0.1   ],
                        [0.97,    0.33,    0.25  ,    0.05  ],
                        [0.9,     0.001,    0.009,    0.33  ],
                        [-0.60,   -0.1,    -0.3,     -0.5   ],
                        [0.49,    -0.8,     -0.05,   -0.0036],
                        [0.0  ,   -0.45,    0.087,    0.023 ],
                        [0.3,     -0.23,    0.82,    -0.28  ]])

p = (x + tf.abs(x)) / 2
ent_p = rev_entropy(p)

Can you please explain how can I know the `k` here?
print(ent_p)

What are the Units of the Entropy Output? / Differential Entropy Magnitude is Wrong

Thank you for providing these entropy estimators as open source.

I am having difficulty understanding the units of the continuous entropy estimations that are being produced by npeet. I wrote the below code to test this:

#!/usr/bin/env python3

from entropy_estimators import continuous as paulbrodersen
from npeet import entropy_estimators as npeet
from scipy import stats
import math
import pandas as pd
import numpy as np

uniform = stats.uniform(loc=0, scale=math.e) # Uniform distribution from 0 to e
cauchy = stats.cauchy(scale=0.01)
levy_stable = stats.levy_stable(alpha=2.0, beta=0.0, scale=0.01)

count = 5000
uniform_observations = uniform.rvs(size=count)
cauchy_observations = cauchy.rvs(size=count)
levy_stable_observations = levy_stable.rvs(size=count)

distributions = ["uniform to e", "cauchy", "levy stable"]
scipy_analytical = [uniform.entropy(), cauchy.entropy(), levy_stable.entropy()]
paulbrodersen_results = [paulbrodersen.get_h(uniform_observations, k=5),
                         paulbrodersen.get_h(cauchy_observations, k=5),
                         paulbrodersen.get_h(levy_stable_observations, k=5)]
npeet_results = [npeet.entropy(np.reshape(uniform_observations, [count, 1]), k=5),
                 npeet.entropy(np.reshape(cauchy_observations, [count, 1]), k=5),
                 npeet.entropy(np.reshape(levy_stable_observations, [count, 1]), k=5)]

results = pd.DataFrame({"distribution": distributions,
                        "scipy analytical": scipy_analytical,
                        "paulbrodersen": paulbrodersen_results,
                        "npeet": npeet_results})
print(results)

The result:

   distribution     scipy analytical  paulbrodersen     npeet
0  uniform to e                  1.0       0.994836  1.435245
1        cauchy     -2.0741459390188      -2.066179 -2.980866
2   levy stable  -2.8396580625037564      -2.836570 -4.092305

paulbroderson's library also implements the Kraskov differential entropy estimation technique using k-nearest neighbors. Notice that its estimate of the Levy Stable and Cauchy distributions is very close to the analytical result from Scipy's differential entropy calculation.

A nat is defined as the information content of the uniform distribution on the internal [0, e]. See here. You can see in the above table that paulbroderson's implementation does produce an entropy estimation of ~1.0 for the uniform distribution on the internal 0 to e. Kraskov's paper mentions the following:

where “log” will always mean natural logarithm so that information is measured in natural units

indicating that the paper is using the natural log and base e, which will produce values in nats.

However, npeet's estimates are very different from the expected values in nats. Is this library not producing values in the nats unit? I attempted to convert values from bits to nats using the conversion factor 1 nat = 1 / log(2) bits, but this did not improve the comparison.

Any pointers would be very helpful.

Trying hard to find how to install the package

for some reason the 'setup.py' file only gives this warning:

runfile('C:/Users/Yonatan/Documents/GitHub/NPEET/setup.py', wdir='C:/Users/Yonatan/Documents/GitHub/NPEET') Reloaded modules: npeet, npeet.entropy_estimators An exception has occurred, use %tb to see the full traceback. Traceback (most recent call last): File "C:\Users\Yonatan\anaconda3\lib\distutils\core.py", line 134, in setup ok = dist.parse_command_line() File "C:\Users\Yonatan\anaconda3\lib\site-packages\setuptools\dist.py", line 707, in parse_command_line result = _Distribution.parse_command_line(self) File "C:\Users\Yonatan\anaconda3\lib\distutils\dist.py", line 501, in parse_command_line raise DistutilsArgError("no commands supplied") DistutilsArgError: no commands supplied During handling of the above exception, another exception occurred: SystemExit: usage: setup.py [global_opts] cmd1 [cmd1_opts] [cmd2 [cmd2_opts] ...] or: setup.py --help [cmd1 cmd2 ...] or: setup.py --help-commands or: setup.py cmd --help error: no commands supplied

tried to install from CMD too with no response.
but for some reason not clear to me for the test.py file i can operate the library.

Help pls! s

omebody has an explanation?

** the test file for some reason works perfect...

How to estimate MMI

Hi Greg,

I would like to compute MMI (aka Interaction Information) for 3 variables. There are several ways to do this by combining entropies and mutual infromations, for example

I(X:Y:Z) = I(X:Y) - I(X:Y|Z)

I(X:Y:Z) = H(X) + H(Y) + H(Z) - H(XY) - H(XZ) - H(YZ) + H(XYZ)

Is there a difference as to which formula to use?
Are there some tricks to keep in mind to minimize effect of bias?
I'm interested to determine if MMI is significant (shuffle test), but also to determine its sign to evaluate if MMI is more on the synergistic or the redundant side. If MMI turns out to be significantly different from zero from shuffle test, is it reasonable to conclude that it likely has the correct sign as well?

Question on how to compute normalized mutual information for discrete and continuous data

Hi Greg,

Many thanks for making available such great Python code!

I was wondering if you could provided suggestions on how to compute normalized mutual information for discrete and continuous data. I would expect the normalized version of mutual information to be in the range [0, 1].

Kind regards,
Ivan

Best way to compute mutual information in high dimension when all but one variable are iid

Thanks for making this wonderful package.

I'm trying to compute the mutual information in high-dimension but the case I am interested in is exceptionally simple and hence there may be a faster method than using the built-in function.

Specifically, I have a function $f(x_{1},\dots,x_{n})$ where $n$ is large and I would like to estimate the mutual information between the random variable $F = f(X_{1},\dots,X_{n})$ and the independent and identically distributed (iid) random variables $X_{1},\dots,X_{n}$ (so $I(F,\dots,X_{n});X_{1},\dots,X_{n})$), given a large number of samples. Given the fact that all but one variable are iid, I'm hoping the calculation simplifies dramatically.

The documentation has a comment which is rather suggestive but I confess I don't really understand what is being said:

"On the other hand, in high-dimensions, KDTree’s are not very good, and you might be reduced to the speed of brute force $N^2$ search for nearest neighbors. If $N$ is too big, I recommend using multiple, random, small subsets of points ($N′ << N$) to estimate entropies, then average over them."

If I have $N$ samples of the form ${F, X_{1},\dots, X_{n} }{i}$, where $i$ runs from 1 to $N$, is this just saying that one should take a subset $M < N$ of these samples and compute the mutual information, do this multiple times, and then average them? Or is it saying to somehow construct the mutual information by some averaging of lower dimension samples ${F, X{1},\dots, X_{m} }_{i}$, where $m < n$. If the former, then why is this advantageous to using the built-in method? If the latter, then how exactly does this work?

However, this question about the documentation may not be relevant. Perhaps there is a more direct way of answering my primary question.

Thanks again for the wonderful code!

Negative mutual information after using shuffle (but correct trend)

Dear Greg,

I am using npeet for estimating mutual information in distributed least squares problem, but it seems I often get negative mutual information even with the use of shuffle_test. Despite that, one interesting thing is that even most of the results are negative, the tendency seems right. As I attached in the figure, the blue line first increase and then converge, the red line is far away from blue line and then converge. This trend is what I expected, but I cannot explain the negative values, do you have any idea about this? Thanks in advance.

Entropy does not increase with variance

Hi Greg,

I want to use your package to study some neuroscience data. I am having problem with a basic sanity check

The entropy of a uniform distribution theoretically scales as a logarithm of its standard deviation. I would expect that the entropy of the distribution in range [0, 100] would be log(100) + const, whereas the entropy for the range [0, 1] would be log(1) + const. However, my test seems to show that the entropy computed by NPEET does not change with increasing standard deviation. Why is that?

Here is a minimal example:

import numpy as np
import matplotlib.pyplot as plt
from npeet.entropy_estimators import entropy

data = np.random.uniform(0, 1, 1000)
alphaLst = np.arange(1, 100)
hLst = [entropy(a * data[:, None]) for a in alphaLst]

plt.figure()
plt.plot(alphaLst, hLst)
plt.show()

If possible, I would really appreciate a suggestion soon, I kind of discovered this problem during a validation study, and I need to submit some results soon.

Thanks,
Aleksejs

can it be used for feature_selection.mutual_info?

great code thanks
can it be used for
feature_selection.mutual_info_
https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.mutual_info_classif.html
and then for spectral_coclustering

https://scikit-learn.org/stable/auto_examples/bicluster/plot_spectral_coclustering.html#sphx-glr-auto-examples-bicluster-plot-spectral-coclustering-py

??
PS
may you share link to simple code example
to understand what is
entropy estimation from k-nearest neighbors distances
pls

Question: on how to compute conditional mutual information against a set of features

Hi there!

Thanks for making this package available to us. I was wondering how could be possible to compute a conditional mutual information of features X and Y condition to a set of features S (note that the set can have one or more features).

If it is possible, how would you proceed? Can this workflow be applied to continuous data (all features) and discrete data (all features).

Many thanks for your help,

Ivan

CMI

Hi, I noticed that also the cmi function can return a negative value. Can You give me some tips in order to obtain the correct result? Thanks

Unexpected behaviour in the mutual information calculation?

Dear Greg,

I think I have found a bug in the code, unless I am doing something seriously wrong. Consider this minimal example

import numpy as np
import npeet.entropy_estimators as ee

x = np.random.normal(0,1,10000)
y = np.random.normal(0,1,10000)
xy = np.array([x,y]).T
entrTrue1D = 0.5*(1 + np.log(2*np.pi))

print('H(X) =', ee.entropy(x[:, None], base=np.exp(1), k=3), 'expected', entrTrue1D)
print('H(Y) =', ee.entropy(y[:, None], base=np.exp(1), k=3), 'expected', entrTrue1D)
print('H(XY) =', ee.entropy(xy, base=np.exp(1), k=3), 'expected', 2*entrTrue1D)
print('I(X:X) =', ee.mi(x[:, None], x[:, None], base=np.exp(1), k=3), 'expected', entrTrue1D)
print('I(XY:XY) =', ee.mi(xy, xy, base=np.exp(1), k=3), 'expected', 2*entrTrue1D)

The output is as follows:

H(X) = 1.4081517115316977 expected 1.4189385332046727
H(Y) = 1.3950320463484136 expected 1.4189385332046727
H(XY) = 2.794510292787968 expected 2.8378770664093453
I(X:X) = 7.95417270271105 expected 1.4189385332046727
I(XY:XY) = 7.954172702711051 expected 2.8378770664093453

Problems:

Mutual information I(X:X) of a Gaussian variable with itself is much larger than its entropy H(X)
Mutual information I(XY:XY) of a pair of Gaussian variables with itself is much larger than its entropy H(XY), however, not larger than the estimated I(X:X)
Both of the above results get worse with increasing number of datapoints

Could you please tell me what is going on, and, if possible, how can I fix it

Best regards,
Aleksejs

entropy value is negative

Hello,
I am using your package to calculate the entropy for a continuous variable, however the entropy value I got is a negative number. Also I tried centropy(x,x), conditional entropy on itself. The result supposes to be zero. However, the results returned sometimes a positive or negative number, but not close to zero. Could you help me explain the issue? For discrete case, the result looks fine.
Thanks

Compute the Jensen–Shannon divergence

Hi there! Thank you for making and sharing such a useful package.

I've noticed that the estimation of Jensen-Shannon divergence is currently not supported in this package. Do you have any plan of adding it to this package in the future? If not, is there any workaround to make use of the current supported functions to compute the Jensen–Shannon divergence? Thanks a lot~

Unexpected scaling of mutual information with variance

Dear Greg,

I am getting strange behaviour in estimated mutual information. I want to check how much the mutual information I(aX, Y) depends on a positive scalar factor a. To the best of my knowledge, analytically mutual information should be completely independent of the scalar factor. However, when I try to estimate it with NPEET, the mutual information is decreasing significantly with increasing alpha. Can you comment on this please?

Here is the minimal example:

from npeet.entropy_estimators import mi
x = np.random.uniform(0,1,(1000,1))
y = np.random.uniform(0,1,(1000,1))
z = 0.5*x + 0.5*y
alphaLst = np.arange(1, 100)

miLst = [mi(a*x, z) for a in alphaLst]

plt.figure()
plt.plot(alphaLst, miLst)
plt.show()

Why mutual information I(x;x) is not equal to h(x)?

Hi, following your example, I try to test whether the mutual information estimation makes sense or not. But there is something wrong with mutual information esitmator, because it is very strange that I(x;x) and h(x) are not the same.
Do you have any idea of this?

x = [[1.3],[3.7],[5.1],[2.4],[3.4]]
y = [[1.5],[3.32],[5.3],[2.3],[3.3]]
ee.mi(x,x)
Out[182]: 0.36067376022224085

ee.entropy(x)
Out[183]: 2.706665509186988

ee.entropy(y)
Out[184]: 2.6794531992583743

ee.mi(y,y)
Out[185]: 0.36067376022224085

gregversteeg / npeet Goto Github PK

npeet's People

Contributors

Stargazers

Watchers

Forkers

npeet's Issues

Recommend Projects

Recommend Topics

Recommend Org