Coder Social home page Coder Social logo

theochem / procrustes Goto Github PK

View Code? Open in Web Editor NEW
106.0 106.0 20.0 19.66 MB

Python library for finding the optimal transformation(s) that makes two matrices as close as possible to each other.

Home Page: https://procrustes.qcdevs.org/

License: GNU General Public License v3.0

Python 100.00%
chemical-structure-similarity generalized-procrustes-analysis generalized-procrustes-problem gpa matrix-transformations molecular-alignment optimization optimization-algorithms orthogonal-procrustes-problem permutation-procrustes-problem procrustes procrustes-analysis quadratic-assignment-problem rotational-procrustes-problem softassign-method symmetric-procrustes-problem

procrustes's People

Contributors

abhishekmittal15 avatar ali-tehrani avatar banrovegrie avatar fanwangm avatar farnazh avatar janosh avatar kimt33 avatar lajd avatar msricher avatar paulwayers avatar wilhadams avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

procrustes's Issues

More efficient evaluation of the error function for k-opt

In permutation Procrustes, the error is computed by explicit multiplications like (A x P) or (P1 x A x P2). This is extremely inefficient because one is only rearranging rows/columns of A (an O(n^2) operation) but one is using matrix multiplication (which is O(n^3)) to do this. This can be done efficiently by:

  1. For each permutation matrix, construct the corresponding permutation of indices; then constructed the matrix with permuted rows/columns with fancy indexing.
  2. For each permutation matrix and the original matrix, construct a sparse matrix; use sparse matrix multiplication. The sparse matrix representation for A should be constructed only once or otherwise you'll give up much of the gains due to overhead.

(a) would be preferred in my view. It amounts to a more efficient way to evaluate the error function for k-opt in its application to Procrustes.

Web Site

We should get the https://procrustes.qcdevs.org/ web site working.

Can't install from pip

I am on Python 3.8.0. When I try to install procrustes from pip, I get this error:

ERROR: Could not find a version that satisfies the requirement qc-procrustes (from versions: none)
ERROR: No matching distribution found for qc-procrustes

Also, I can't clone the repo, since I get permission denied:

Cloning into 'procrustes'...
[email protected]: Permission denied (publickey).
fatal: Could not read from remote repository.

Please make sure you have the correct access rights
and the repository exists.

Protein alignment with IOData

I tried doing protein alignment with IOData as you have built for the manuscript and the code snippet seems not working.

import numpy as np

from iodata import load_one
from procrustes import rotational

# load PDB
pdb = load_one("2hhb.pdb")

# get coordinates of C_alpha atoms in chains A & C (in angstrom)
chainid = pdb.extra['chainid']
attypes = pdb.atffparams['attypes']
ca_a = pdb.atcoords[(chainid == 'A') & (attypes == 'CA')] * 0.529177
ca_c = pdb.atcoords[(chainid == 'C') & (attypes == 'CA')] * 0.529177

print("RMSD of intial coordinates:")   # output: 39.4
print(np.sqrt(np.mean(np.sum((ca_a - ca_c)**2, axis=1))))

# rotational Procrustes analysis
result = rotational(ca_a, ca_c, translate=True)

# compute transformed (translated & rotated) coordinates of chain A
ca_at = np.dot(result.new_a, result.array_u)

print("RMSD of transformed coordinates:")   # output: 0.23
print(np.sqrt(np.mean(np.sum((ca_at - result.new_b)**2, axis=1))))

I guess the problem is that there is no attribute called chainid after checking the latest source code.
Do you know which part I get things wrong? Thank you.
@FarnazH

Non square data in softassign function - cannot reshape array

Dear all,
I do not know how to get the permutation matrix (like what I would get from the softassign function). I tried the following code, that works in the case of of a square matrix a (d=n), but not for a non square matrix. The error I get is

File "/home/yann/.virtualenvs/work/lib/python3.8/site-packages/procrustes/softassign.py", line 201, in softassign
    c_tensor = array_c.reshape(row_num, row_num, row_num, row_num)

ValueError: cannot reshape array of size 400 into shape (10,10,10,10)
import numpy as np
from procrustes import *

from scipy.stats import ortho_group

# random input 10x2 matrix A
n=10
d=2
a = np.random.rand(n, d)

# random orthogonal 2x2 matrix T
t = ortho_group.rvs(d)

# target matrix B (which is a shifted AxT)
b = np.dot(a, t) + np.random.rand(1, d)

# orthogonal Procrustes analysis with translation
result = orthogonal(a, b, scale=True, translate=True)

# display Procrustes results
# error (expected to be zero)
print(result.error)

res = softassign(a, b)
print(res['t'])

How could I get the permutation matrix directly from the orthogonal function ? Is this a bug ?

thank you very much for your time
best regards

yann

Would be nice to have the used scaling and translation returned in the answer

Now the scaling factor and translation vector are calculated in the utilities but they are not returned in the answer (result = orthogonal(a, b, scale=True, translate=True)). For 3D-graphics calculations, these would be handy to scale and translate any new points found from the distance matrix (I get preliminary coordinates from MDS) back into the real coordinates of b.

kopt_heuristic

Move the functions for kopt_heuristic to a new module called kopt. They are currently in utils module.

Test error in "SVD Converge" in generic.py

During testing with github action, the follow test failed "test_generic_rectangular_translate_scale" for test_generic.py" with m = 829, n = 878 on python 3.7.

The error is
def _raise_linalgerror_svd_nonconvergence(err, flag): raise LinAlgError("SVD did not converge") E numpy.linalg.LinAlgError: SVD did not converge.

For more info on the github action, see https://github.com/Ali-Tehrani/procrustes/runs/2058961187.

However, running this test on my own machine it passes. So it seems the error is hardware related. Google searching this returns a 2019 SciPy issue [1]. I would recommend to decrease the matrix size so that it isn't hardware/MKL/OpenBlas dependent.

[1] - scipy/scipy#10032

Add References to Docstrings

I'm happy with putting references centrally. I don't worry so much about keeping them up-to-date, or having long documentation (especially if the references are just at the very end anyway), since most of the references we should be including are just "the classics." But I know that because our documentation is embedded in the code, it can be annoying to have a lot of documentation before the code even starts.

I guess the question, then, is whether it is good to have the specific reference where a reader can get more information about the underlying algorithm/method and its justification. I would tend towards including references in the code I guess (where one might wonder what's going on) but as long as there is a source for the references, it is OK I think.

Originally posted by @PaulWAyers in #68 (comment)

Deprecated `pinv2` function in `scipy.linalg`

The pinv2 in Scipy has been deprecated since 1.7.0, and it has been removed for 1.9.0 and above, scipy/scipy#16245. Originally, the (Moore-Penrose) pseudo-inverse is computed by SVD including all 'large' singular values (using scipy.linalg.pinv2) and scipy.linalg.pinv obtains the (Moore-Penrose) pseudo-inverse by using least-squares solver. The least-squares implementation is less efficient, but more robust, than the SVD implementation (

use_svd : bool, optional
If True, the (Moore-Penrose) pseudo-inverse is computed by singular-value decomposition
(SVD) including all 'large' singular values (using `scipy.linalg.pinv2`).
If False, the the (Moore-Penrose) pseudo-inverse is computed by least-squares solver
(using `scipy.linalg.pinv`). The least-squares implementation is less efficient, but more
robust, than the SVD implementation.
). But later, I think both functions use the SVD solution. Therefore, it was proposed to remove pinv2.

So, for newer versions of Scipy >= 1.9.0, it would be a problem to use pinv2. Therefore, I think the easiest solution is to stick to pinv and remove pinv2. Another option is to provide different solutions for different versions of Scipy greater than 1.9.0 or less than 1.9.0. What do you prefer? @PaulWAyers @Ali-Tehrani

Improve efficiency of kopt implementation

Actually, why make a permutation matrix (mostly zeros and ones) and multiply by A (cost of O(n^3)). You can directly compute A with permuted rows/columns, then compute the error in (permuted) A. This has cost O(n^2) (computing the error) instead of O(n^3). IF the error is reduced, you can return the permutation, or even the permutation matrix (which can be constructed at the very end).

Originally posted by @PaulWAyers in #62 (comment)

translation, scling, weighting for positive semi-definite Procrustes problem

The psdp_woodgate (#170) has set up a good starting point for implementing PSDP algorithms. Thanks! @banrovegrie

We are missing a few initial setting up such as scaling, translation, etc. You can add this code blocks like

 # check inputs
    new_a, new_b = setup_input_arrays(
        a,
        b,
        unpad_col,
        unpad_row,
        pad,
        translate,
        scale,
        check_finite,
        weight,
    )
    if new_a.shape != new_b.shape:
        raise ValueError(
            f"Shape of A and B does not match: {new_a.shape} != {new_b.shape} "
            "Check pad, unpad_col, and unpad_row arguments."
        )

And you will need to change the finial output accordingly, such as new_a=new_a and new_b=new_b. Sample codes can be found at

def orthogonal(
a: np.ndarray,
b: np.ndarray,
pad: bool = True,
translate: bool = False,
scale: bool = False,
unpad_col: bool = False,
unpad_row: bool = False,
check_finite: bool = True,
weight: Optional[np.ndarray] = None,
lapack_driver: str = "gesvd",
) -> ProcrustesResult:
r"""Perform orthogonal Procrustes.
Given a matrix :math:`\mathbf{A}_{m \times n}` and a reference matrix :math:`\mathbf{B}_{m
\times n}`, find the orthogonal transformation matrix :math:`\mathbf{Q}_{n
\times n}` that makes :math:`\mathbf{AQ}` as close as possible to :math:`\mathbf{B}`.
In other words,
.. math::
\underbrace{\min}_{\left\{\mathbf{Q} | \mathbf{Q}^{-1} = {\mathbf{Q}}^\dagger \right\}}
\|\mathbf{A}\mathbf{Q} - \mathbf{B}\|_{F}^2
This Procrustes method requires the :math:`\mathbf{A}` and :math:`\mathbf{B}` matrices to
have the same shape, which is gauranteed with the default ``pad`` argument for any given
:math:`\mathbf{A}` and :math:`\mathbf{B}` matrices. In preparing the :math:`\mathbf{A}` and
:math:`\mathbf{B}` matrices, the (optional) order of operations is: **1)** unpad zero
rows/columns, **2)** translate the matrices to the origin, **3)** weight entries of
:math:`\mathbf{A}`, **4)** scale the matrices to have unit norm, **5)** pad matrices with zero
rows/columns so they have the same shape.
Parameters
----------
a : ndarray
The 2D-array :math:`\mathbf{A}` which is going to be transformed.
b : ndarray
The 2D-array :math:`\mathbf{B}` representing the reference matrix.
pad : bool, optional
Add zero rows (at the bottom) and/or columns (to the right-hand side) of matrices
:math:`\mathbf{A}` and :math:`\mathbf{B}` so that they have the same shape.
translate : bool, optional
If True, both arrays are centered at origin (columns of the arrays will have mean zero).
scale : bool, optional
If True, both arrays are normalized with respect to the Frobenius norm, i.e.,
:math:`\text{Tr}\left[\mathbf{A}^\dagger\mathbf{A}\right] = 1` and
:math:`\text{Tr}\left[\mathbf{B}^\dagger\mathbf{B}\right] = 1`.
unpad_col : bool, optional
If True, zero columns (with values less than 1.0e-8) on the right-hand side of the intial
:math:`\mathbf{A}` and :math:`\mathbf{B}` matrices are removed.
unpad_row : bool, optional
If True, zero rows (with values less than 1.0e-8) at the bottom of the intial
:math:`\mathbf{A}` and :math:`\mathbf{B}` matrices are removed.
check_finite : bool, optional
If True, convert the input to an array, checking for NaNs or Infs.
weight : ndarray, optional
The 1D-array representing the weights of each row of :math:`\mathbf{A}`. This defines the
elements of the diagonal matrix :math:`\mathbf{W}` that is multiplied by :math:`\mathbf{A}`
matrix, i.e., :math:`\mathbf{A} \rightarrow \mathbf{WA}`.
lapack_driver : {'gesvd', 'gesdd'}, optional
Whether to use the more efficient divide-and-conquer approach ('gesdd') or the more robust
general rectangular approach ('gesvd') to compute the singular-value decomposition with
`scipy.linalg.svd`.
Returns
-------
res : ProcrustesResult
The Procrustes result represented as a class:`utils.ProcrustesResult` object.
Notes
-----
The optimal orthogonal matrix is obtained by,
.. math::
\mathbf{Q}^{\text{opt}} =
\arg \underbrace{\min}_{\left\{\mathbf{Q} \left| {\mathbf{Q}^{-1} = {\mathbf{Q}}^\dagger}
\right. \right\}} \|\mathbf{A}\mathbf{Q} - \mathbf{B}\|_{F}^2 =
\arg \underbrace{\max}_{\left\{\mathbf{Q} \left| {\mathbf{Q}^{-1} = {\mathbf{Q}}^\dagger}
\right. \right\}} \text{Tr}\left[\mathbf{Q^\dagger}\mathbf{A^\dagger}\mathbf{B}\right]
The solution is obtained using the singular value decomposition (SVD) of the
:math:`\mathbf{A}^\dagger \mathbf{B}` matrix,
.. math::
\mathbf{A}^\dagger \mathbf{B} &= \tilde{\mathbf{U}} \tilde{\mathbf{\Sigma}}
\tilde{\mathbf{V}}^{\dagger} \\
\mathbf{Q}^{\text{opt}} &= \tilde{\mathbf{U}} \tilde{\mathbf{V}}^{\dagger}
The singular values are always listed in decreasing order, with the smallest singular
value in the bottom-right-hand corner of :math:`\tilde{\mathbf{\Sigma}}`.
Examples
--------
>>> import numpy as np
>>> from scipy.stats import ortho_group
>>> from procrustes import orthogonal
>>> a = np.random.rand(5, 3) # random input matrix
>>> q = ortho_group.rvs(3) # random orthogonal transformation
>>> b = np.dot(a, q) + np.random.rand(1, 3) # random target matrix
>>> result = orthogonal(a, b, translate=True, scale=False)
>>> print(result.error) # error (should be zero)
>>> print(result.t) # transformation matrix (same as q)
>>> print(result.new_a) # translated array a
>>> print(result.new_b) # translated array b
"""
# check inputs
new_a, new_b = setup_input_arrays(
a,
b,
unpad_col,
unpad_row,
pad,
translate,
scale,
check_finite,
weight,
)
if new_a.shape != new_b.shape:
raise ValueError(
f"Shape of A and B does not match: {new_a.shape} != {new_b.shape} "
"Check pad, unpad_col, and unpad_row arguments."
)
# calculate SVD of A.T * B
u, _, vt = scipy.linalg.svd(np.dot(new_a.T, new_b), lapack_driver=lapack_driver)
# compute optimal orthogonal transformation
u_opt = np.dot(u, vt)
# compute one-sided error
error = compute_error(new_a, new_b, u_opt)
return ProcrustesResult(error=error, new_a=new_a, new_b=new_b, t=u_opt, s=None)
.

Given that you want the first PR to be merged in a quicker manner, I have merged #170. You can fix this in a new PR together with adding testing codes or you can just create a small PR for this.

Update and polish the sphinx docs

This is a reminder for myself that a few things need to be fixed.

  • Update the header
  • Update contact information using QC-Dev community
  • Update the footnote information in conf.py
  • Update the equations and make it consistent with our manuscript.
  • Update installation guide
  • Update quick start example
  • Add API rst file for kopt module
  • fix examples, following . This related to #26
  • Update docstrings if necessary: equations, argument names, and default parameters. This relates to #25.
  • Add RST file style checker

Support Weight Matrices

Currently, we allow weights for each dimension, and then use a (diagonal) weight matrix. In some applications in quantum mechanics, it would be useful to have a positive semidefinite weight matrix (which would be symmetric in all cases I can imagine). The idea would be to let the user pass a vector (then construct a diagonal weight matrix) or a matrix (then use the matrix).

In the cases I'm thinking about, the weight matrix would be the atomic-orbital overlap matrix or the one-electron reduced density matrix.

Fix Binder

I noticed that the binder doesn't work.

Clarify weight argument

The weigh argument is added to all flavors of Procrustes but it does not show up in the mathematical formulation of the problem, also the documentation is not clear at all on how it would be used.

Support of missing values for generalized Procrustes analysis

When performing generalized Procrustes analysis, it is very likely to have missing values in the configuration matrices. Casper J. Albers and John C. Gower proposed a strategy to deal with such a situation, _Albers, C. J., & Gower, J. C. (2010). A general approach to handling missing values in Procrustes analysis. Advances in Data Analysis and Classification, 4(4), 223-237.. A very nice feature of their algorithm is they can also deal with centring and/or standardisation.

This issue relates to #76.

PyPI

Make sure that similar to IOData, we can install Procrustes through:

pip install qc-procrustes

2-sided Orthogonal Procrustes for Rectangular Matrices

In chapter 7.5.2 of Procrustes Problems by J. C. Gower and G. G. Dijksterhuis an algorithm for 2-sided orthogonal Procrustes is presented. This generalizes the current treatment, which requires that the matrices being matched be symmetric and square. This is also explained in section 3.3 of Pythagoras Papadimitriou's Ph.D. thesis at the Univ. of Manchester.
papad93.pdf

Right now, the object is to find an orthogonal matrices, Q1 and Q2, such that || Q1 A Q2 - B || is minimized, where A and B are symmetric square matrices and have the same shape (i.e., both are mxm matrices). However, this reference shows that this can work even as long as A and B have the same shape; they need not be square or symmetric.

This could be an embellishment on the existing method (it contains the old method as a special case) or it could be implemented as a separate method. I'd tend to do the latter, and leave the easy case as a routine that can be used where applicable.

L-1 procrustes

Hi, Thanks for the project! I'm interested in solving procrustes using a more robust metric of dispersion than F norm.

I saw a couple papers online that solve e.g. smooth surrogates of l1, but wondering if you have any pointers in particular.

An Eigenvalue-Based Method for the Unbalanced Orhtogonal Procrustes Problem

In our implementations, when two input matrices have different shapes (unbalanced), zero-padding is used to make them balanced. An alternative method is proposed in a recent SIAM paper,

LH Zhang, WH Yang, C Shen, J Ying, SIAM J. Matrix Anal. Appl., 41(3), 957โ€“983..

In this paper, they showed that the orthogonal Procrustes can be transformed into an eigenvalue minimization problem, in which an adapted self-consistent field (SCF) iteration is used to solve the problem.

We can leave this for future implementation. Anyone interested to implement the algorithm is also welcome!

Optimization over doubly-stochastic matrices

Analogous to what was done with k-opt, we should add functionality to optimize a user-specified function over the set of doubly-stochastic matrices. (Doubly stochastic matrices are nonnegative matrices that have row-sum and column-sum equal to 1.)

  • scipy.opt could be used for the (local) optimization starting from a user-specified starting point. This is a classic problem for the slsqp algorithm but the trust-constr is also appropriate. Which algorithm could be a user option. Or just choose one; this is intended as a useful utility and should not be confused with a cutting-edge treatment for this problem.
  • The default starting point could be constructed as (1) take a random orthogonal matrix; (2) square all the elements).
  • as a user option, the function could return the permutation matrix that is closest to the doubly-stochastic matrix that is found.

This sort of problem showed up in @wilhadams work. For an objective function of the form in one-sided permutation Procrustes, it amounts to a (slow, iterative) alternative to the Hungarian algorithm for the assignment problem.

Note that in Procrustes problems of a quite general (tensor-ish) sense, it is usually important to maximize (or minimize the negative of) the cosine/overlap between the optimizee and the target, as direct minimization of the error doubles the degree of the optimization.

More algorithms on `PSDP`

The original "algorithm 3" has been peer reviewed (I think; it is published in a conference proceedings and that is never a sure thing).
http://www.scielo.org.co/pdf/rcm/v55n1/0034-7426-rcm-55-01-109.pdf

However, I really like Fanwang's suggestion. The Oviedo paper (algorithm 3) performs similarly to what they call PARATAN and what the above reference calls PARTAN. The above reference seems to have good performance, and has four closely related algorithms which could be implemented simulataneously, switching between the algorithms as specified by the user by a keyword. This is explained at the top of Section 5.

Specifically, the

  • projected gradient method
  • fast gradient method
  • parallel tangents method
  • semi-analytic fast gradient method

This gives a nice alternative to the first two algorithms, which focus on a more analytic approach (more like the other methods we have considered). This is a numerical approach, and perhaps might be faster (hard to know until we test).

Originally posted by @PaulWAyers in #189 (comment)

General 2-sided Procrustes Problem

We currently support the solution of the general (unconstrained transformation) one-sided Procrustes problem but not the two-sided unconstrained Procrustes problem. A statement, and solution, of this problem can be found in section 3.2 of the Ph.D. thesis of Pythagoras Papadimitriou (Univ. Manchester). It would be good to support this case also.
papad93.pdf

lapack_driver argument

From #69 & #77: Add a function variable "lapack_driver" to orthogonal, permutation, rotational, and symmetric functions with docstring lapack_driver : {"gesvd", "gesdd"}, optional used in the singular value decomposition function from SciPy. Only allowed two options, with "gesvd" being less-efficient than "gesdd" but is more robust. Default is "gesvd".

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.