andland / logisticpca Goto Github PK

View Code? Open in Web Editor NEW

48.0 48.0 8.0 4.69 MB

Dimensionality reduction for binary data

License: Other

R 7.73% Shell 0.01% C 0.14% HTML 92.12%

logisticpca's People

Contributors

Stargazers

Watchers

Forkers

lenovor nudtchengqing fangzheng354 mgarza12 minghao2016 shubhampachori12110095 sqsun peikalunci

logisticpca's Issues

Change irlba to rARPACK?

Based on this benchmark, the package rARPACK is about twice as fast for non-sparse matrices on my PC.

Would need to change generalizedPCA too.

Biplot

Add type = 'biplot' option to plot commands.

Dear prof Andrew J,
I've been reading your article in the Journal of Multivariate Analysis called Dimensionality reduction for binary data through the projection of natural parameters. As far as I know, you obtained the principal component score by establishing the relationship between the natural parameters of the saturation model and the natural parameters of the Bernoulli distribution. I try to generate high-dimensional related binary data by cutting quantiles through mixed multivariate normal distribution, and get the principal component of binary data from the R package you provided, and predict the principal component score of new data. Surprisingly, the generation method of new data is the same, only the random-seed is different. But the final predicted principal component scores varied widely. So I would like to ask you how to understand this phenomenon?
I am looking forward to your early reply.
Yours sincerely

Add weights

It might not be a good idea to include in this package if it significantly slows performance.

m parameter calculated by cv.lpca function

No really an "issue" but a question: does the "m" parameter calculated in the cv.lpca function correspond to anything meaningful in the data, or the output. The reason, I ask is that it seems to correlate with the number of "clusters" in the PCA plot. Maybe this is just coincidence?

Cross-validation speed

Hi,

I'm trying to evaluate the three different methods shipped with this package on my data. The data is a 76x4623 matrix.

Estimating m with cv.lpca() is extremely slow for this size matrix, just the first iteration of the function at m=1 took >24 hours. Is there any way to speed this up? For now, I am just using logisticSVD() which takes <1 minute, but I am interested to compare the different approaches.

Best,
Ollie

Refactor methods

I have written different method functions for lpca, lsvd, and clpca. I can probably combine many of them. The methods to combine are:

Update irlba

The package irlba was updated to version 2.0.0 recently. This update includes the ability to get the first few eigenvectors of a symmetric matrix using partial_eigen. In the past, I used the irlba function to do this (without assuming symmetricity) and it was very ineffective. (Hence, why use_irlba = FALSE for logisticPCA.) If it improves, I should also update generalizedPCA.

There is also the ability to center and scale, but that probably won't matter since our matrices are not sparse.

Add Tipping's formulation

Tipping, M. E. (1998). Probabilistic visualisation of high-dimensional binary data. NIPS 11, pp. 592-598.

prop_deviance_expl for each Principal Component

Hi,

I was wondering if logisticPCA::logisticSVD function might incorporate a way to retrieve (or calculate) the proportion of deviance explained by each principal component, in addition to the overall proportion explained by the model, which is already implemented.

I think this might be a nice new feature for the package!

Best regards,
Martín

Remove dependence on reshape2

reshape2::melt() is called just once in plot.cv.lpca() here. Seems silly to import it.

Different options for cross validation

Due to its structure, the convex formulation may prefer higher values of M no matter k using the current setup of cv.clpca. It may be better to see how well it reconstructs missing values of the matrix. Also, it is common for users to just have a holdout set of validation data, instead of looping over all folds.

Matrix completion
Holdout validation

Add a vignette

Depends on #4

Creating Plot with Eigen Vectors

SampleData.xlsx

I have a similar question to this person: https://stats.stackexchange.com/questions/319818/how-to-analyse-the-strength-of-the-variables-in-a-logistic-pca-using-r
Your code essentially ends at creating a plot, and I am unsure of how to interpret the results. Essentially, my data table has 4 voting parties, that voted upon 11 policy changes, and I want to determine where the commonalities are.

Update citation

Should citation refer to the paper?

Add @reference to logisticPCA and convexLogisticPCA functions

Add convex relaxation

Failed installing package logisiticPCA

Hi,
I couldn't install your package. This get me an error:

** installing vignettes
** testing if installed package can be loaded
Error in get(name, envir = asNamespace(pkg), inherits = FALSE) :
object 'checkCompilerOptions' not found
Calls: ::: -> get
Execution halted
ERROR: loading failed

removing ?/mnt/gpfs/pt2/lib/R_3.2.3/logisticPCA?
Installation failed: Command failed (1)

Do you have an idea to help me?

Thanks

Documenting data

Is it okay to use the same documentation as the uci repo?

Functions crashing

Hi,

I am having the following issues, first logistic SVD fails

> logisticPCA::logisticSVD(bdata,k=2)
45 rows and 395 columns
Rank 2 solution

20.3% of deviance explained
11 iterations to converge
Warning message:
In logisticPCA::logisticSVD(bdata, k = 2) :
  Algorithm stopped because deviance increased.
This should not happen!
            Try rerunning with partial_decomp = FALSE

Second, logisticPCA also fails

> logisticPCA(bdata,k=2)
Error in eigen(mat_temp, symmetric = TRUE) : 
  error code 1 from Lapack routine 'dsyevr'

No idea what's wrong, everything else works just fine. Unfortunately, I can't share the data.

Enhancement: Add examples to the README.

While trying to implement LPCA into an analysis, I find it hard to know how to construct the methods. I think it would be helpful to include examples of how to call each method so this confusion can be mitigated.

Add dataset

Possibilities include:

congressional voting
hall of fame voting

log_like_Bernoulli incorrect when missing data?

n = 100
d = 10
x = matrix(sample(c(0, 1), d * n, TRUE), nrow = n)
log_like_Bernoulli(x = x, theta = outer(rep(1,n), gtools::logit(colMeans(x, na.rm = TRUE))))

which_missing = matrix(runif(n * d) < 0.25, nrow = d)
is.na(x[which_missing]) <- TRUE
log_like_Bernoulli(x = x, theta = outer(rep(1,n), gtools::logit(colMeans(x, na.rm = TRUE))))

loglike goes up with less data?

running logisticPCA on large matrix

Hi,

I'm trying to run logisticPCA on a 105x91802 binary matrix. I've been getting the following error:

Error: vector memory exhausted (limit reached?)

The line of code that's causing the issue is: qTq = crossprod(q), since this requires computing a 91802x91802 matrix which my laptop won't handle.

Is there a workaround for this? Regular PCA still works for the matrix.

Thanks,
Ayan