phasesresearchlab / espei Goto Github PK

View Code? Open in Web Editor NEW

58.0 58.0 32.0 25.36 MB

Fitting thermodynamic models with pycalphad - https://doi.org/10.1557/mrc.2019.59

Home Page: http://espei.org

License: MIT License

Python 100.00%

calphad materials materials-science pycalphad python thermodynamics

espei's People

Contributors

Stargazers

Watchers

espei's Issues

No warning if datasets directory does not exist

ESPEI does not warn you if the datasets directory specified in the input YAML file does not exist. This may lead to unexpected results.

Refactor fit() function to separate functions.

This should allow for little to no logic required in run_espei.main and provides a better interface for users to interact with the ESPEI's fitting procedures.

Type check ESPEI datasets that values, conditions, site ratios, and site occupancies are all numeric

Be able to plot a phase diagram of just data points

In espei.plot.dataplot let users pass eq=None and their datasets in order to plot just the data

(optionally) dumping sampler object (w/ pickle) vs dumping arrays from sample (chain, autocorrelation, lnprobability)

Be able to provide a TDB with existing parameters to parameter selection

The new Database will be added to the current Database and the initial contributions would be subtracted out. The ideal use case for this is

Using literature magnetic models
Fitting ternary interaction parameters with fixed binary models.

Eventually when we implement unary fitting, unaires can be oassed in this way to be fixed.

Fitting HM, SM, CPM in parameter selection

Right now, ESPEI can perform parameter selection for HM_FORM/HM_MIX, SM_FORM/SM_MIX, or CPM_FORM/CPM_MIX data, but it would be useful to be able to fit HM-HM(SER), SM and CPM directly as well.

In the past, I would have suggested the following workaround procedure to fit absolute energies (no changes to ESPEI code required) that are compatible with SGTE:

Create a custom reference state for ESPEI where the energy is zero for the lattice stability of the SER phase (advanced, optional technique: if you want to keep existing lattice stabilities, update them in the custom reference state to not make them relative to the GHSER__ symbol)
Treat HM-HM(SER), SM, CPM data as the _FORM in ESPEI datasets and fit them. This will give energies for each phase referenced to GHSER__ functions which are zero, so the absolute energies and derivatives are fit.

I'm convinced now that fitting absolute valued versions of these data is useful enough that ESPEI should allow this data to be fit if given by a user. It shouldn't be too much work for a student that wanted to pick up this project.

I'm not completely sure, but this might "just work" to change espei.parameter_selection.utils.shift_reference_state to not raise on HM, SM, or CPM data. One would have to decide whether it is a user's responsibility to shift to HM-HM(SER) or whether to allow ESPEI to do it automatically (which should be possible now that Database().refstates data is supported by pycalphad and used by ESPEI).

Test that the ZPF error is actually zero when it should be

If known tielines are used as datasets from a database, calculating ZPF error for that database and equilibria should give 0 error.

I don't have a test case or any reason to suspect this is broken, but it would be a good sanity check.

POP -> JSON converter

Convert PARROT POP files to JSON format

Support fitting ternary interaction parameters in phase_fit

Data below 298.15 K is not fit in parameter selection

Enable saving the random state, so that restarts are still reproducible

ESPEI can be deterministic and reproducible, but restarting resets the random state.

That means running for 1000 steps in one run and two runs of 500 steps each (1000 total) will give different results, despite each being deterministic.

A solution is to be able to dump and load the random state on restart.

Choose final parameters to be the set that gives the lowest error

Benchmark to see if loadbalancing MPI improves performance

See emcee MPI load balancing

Refactor multiplot to include phase diagram plotting

pycalphad's eqplot filters the active phases and sorts them alphabetically to get the phase names from pycalphad.plot.utils.phase_legend. If phases are not sorted and active phases not removed, mutliplot will not produce the same phase_legend and colors as eqplot.

Using Model class when building phases never calculates a reasonable error

We would like to be able to optimize databases that may use custom models and therefore we should support building phases with the Model class (which has the other benefit of being more performant than CompiledModel) and subclasses.

See the diff for when CompiledModel was added and Model removed at commit 2bb8c49

It looks like we previous passed around callables for the objective as well as gradient and hessian.

3 phase equilibria do not plot correctly on a binary dataplot

Some doubt about parameter selection?

I see ESPEI using AICc to prevent over estimation of parameters.
But F-test was mentioned in Doctor thesis "SOFTWARE ARCHITECTURE FOR CALPHAD MODELING OF PHASE STABILITY AND TRANSFORMATIONS IN ALLOY ADDITIVE MANUFACTURING PROCESSES".

In the thesis AICc is used to fit sigle-phase parameters and F-test is used to fit multi-phases parameters. It looks like ESPEI only using AICc and not using F-test.

question:
(1) Does AICc is suitable for multi-phases?
(2) Why not taking F-test into consideration?

Tracing memory leaks in long-running jobs

Note that this is a sketch of a procedure and contains code that has not yet been tested.

Key tools:

pympler - find which objects are using the most memory
objgraph - generate flow graphs of backreferences to any object
pyrasite - attach a Python console to a running process

This will require some modification to the base code. Use the SummaryTracker from pympler. Add this code somewhere before MCMC sampling starts:

from pympler import tracker
tr = tracker.SummaryTracker()
# Calibrate it by calling this a few times until it returns no changed objects
tr.print_diff()
tr.print_diff()
tr.print_diff()
# Create all the objects, put all the function's setup code
# All the sampling code is here

Start the sampling job with just a single core. Then use pyrasite-shell to connect to the running Python process by PID: http://pyrasite.readthedocs.io/en/latest/Shell.html

In the remote shell:

import objgraph ; tr = objgraph.by_type('SummaryTracker')[0] ; tr.print_diff()

This will give output like:

                                     types |   # objects |   total size
========================================== | =========== | ============                                                                                      
                              <class 'list |       18730 |      1.71 MB                                                                                      
                               <class 'str |       18961 |      1.35 MB                                                                                      
              <class 'sip.methoddescriptor |        8287 |    453.20 KB                                                                                      
                              <class 'dict |         513 |    304.12 KB                                                                                      
                               <class 'int |        5375 |    152.21 KB                                                                                      
  <class 'sympy.core.assumptions.StdFactKB |         144 |     90.94 KB                                                                                      
                               <class 'set |          43 |     72.91 KB                                                                                      
                 <class '_lrucache.hashseq |         509 |     57.82 KB                                                                                      
                             <class 'tuple |         473 |     29.11 KB                                                                                      
          <class 'sympy.core.numbers.Float |         338 |     26.41 KB                                                                                      
                   <class '_lrucache.clist |         509 |     23.86 KB                                                                                      
           <class 'tinydb.database.Element |          32 |     15.50 KB
            <class 'sip.variabledescriptor |         184 |     12.94 KB
      <class 'PyQt4.QtCore.QLocale.Country |         248 |     12.59 KB
                <class 'sympy.core.mul.Mul |         174 |     12.23 KB

To start spot checking object backreferences, use

objgraph.show_backrefs(objgraph.by_type('list')[0], max_depth=10)

A graph will be rendered as a PNG and written to a temporary directory. The path to the graph will be output to the console.

Develop a logo

Restart from existing chain

Documentation

The following functions should be fully documented with a description, arguments/keyword arguments, returns, and examples (if applicable):

espei.paramselect._fit_parameters could be improved to make it clear that this selects the model from data with the AIC

The next set of functions are short and just need the minimal description, inputs and outputs.

espei.paramselect._build_feature_matrix
espei.paramselect._generate_symmetric_group
espei.core_utils.get_data
espei.core_utils.get_samples
espei.core_utils.symmetry_filter
espei.paramselect.estimate_hyperplane,
espei.paramselect.tieline_error,
espei.paramselect.multi_phase_fit

Web documentation

Note that single phase data is stored per-atom, e.g. (J/mol-atom) rather than (J/mol-form).
contribution guide
where to get support

Benchmark starting distribution for MCMC

Currently our walkers (concurrent chains) are initialized by sampling a Gaussian distribution that has a standard deviation of 10% of the parameter.

My understanding of the ensemble sampler implemented in emcee is that the distribution that new parameters are selected from depends on the other active walkers. This means that the rate of convergence initially is strongly dependent on the distribution of used to generate these walkers.

Initializing chains from larger Gaussian distributions means that we are less certain about our parameters initially and we will be searching a larger space in the initial iterations. Having too large a distribution initially might mean slow parameter convergence because the chains have to scale down to the relevant sampling space. Having too small of an initial distribution can cause the reverse, in that we waste a lot of time scaling up our sampling space.

We should benchmark different starting points for a given number of MCMC steps and compare the rate of convergence of parameter mixing with a single run that has 'fully' converged.

Seed emcee and walker initialization for deterministic results

See emcee docs on rstate0

Plot data points with a literature reference in the legend

ZKL suggested some kind of Mendeley integration. I think it would also be reasonable and fit into the spirit of ESPEI to use bibtex files (possibly managed and imported/exported from Mendeley). There are several benefits to using bibtex:

With bibtex the references can live and be versioned in plain text with the data sets they are referenced in.
We can use the common functionality and existing tooling for bibtex to generate reference legends in figures based on bibtex labels. Including styles. (Matplotlib supports LaTeX rendering, which will be helpful towards this when LaTeX macros are used in titles - like chemistry).
Reference libraries can be managed with existing software and be agnostic to which software is used (almost everything supports bibtex).

Test suite

The following should be tested in order to have unit tests covering the core functionality of ESPEI

MCMC likelihood function, espei.paramselect.lnprob
espei.core_utils.get_data retrieves the right data (do in migration from TinyDB 2 to TinyDB 3)
espei.paramselect.fit_formation_energy should work for endmembers and interactions (mixing). Test against two one data point cases of formation energy for each. Then one with temperature for the endmember.
AIC parameter selection should choose the right model with the right values (espei.paramselect._fit_parameters). When writing this test, make sure to verify that the chosen test case really is the lowest AIC among all the models and that all the possible models (parameter combinations) were chosen.
espei.core_utils.endmembers_from_interaction are properly computed for several cases of mixing sublattices
espei.core_utils.get_samples are properly computed for several cases of mixing sublattices
espei.core_utils.build_sitefractions properly constructs site fractions from sublattice configurations and occupancies
espei.paramselect._generate_symmetric_group handles cases with and without symmetry correctly

URI and multiple-output support

This is a feature request which is probably out of scope for #28.

Can every place where the run settings file accepts a filename or path, accept a general URI (e.g., https, ssh, git)? I think urlparse/urllib in the stdlib makes this a reasonable request.
See: https://stackoverflow.com/questions/22238090/validating-urls-in-python
One complicating factor is all the calls to open() and np.load() would need to get filtered through urllib, but I think this would be a very nice feature long term: Download datasets pinned to a Git repo, upload output TDBs to an S3 bucket, etc.

Related to this, being able to specify the output key multiple times would be useful once it would be possible to write results out to multiple remote locations.

Fit to other thermochemical data

Enable fitting to thermochemical data such as activities.

Should MCMC also consider this and single phase data (e.g. heat capacities)?

Issues reproducing Cu-Mg example

I had several issues running the Cu-Mg example from the ESPEI website. I installed ESPEI using the conda command, and took the Cu-Mg data directory from the ESPEI-datasets repository.

I first tried reproducing the diagram from the section titled, First-principles phase diagram
The code successfully ran, but the returned phase diagram didn't match the example well:

I then tried reproducing the results in the MCMC optimization section. I wasn't able to successfully perform the MCMC optimization. The code returned numerous errors over the course of several minutes and eventually hung with no further output.

This file contains the full python output when I ran the optimization:
espei_mcmc_error.txt

Here is my python version and installed packages/versions:
python_info.txt

Plot parameters does too much and needs to be refactored into composable parts

Needs to be broken up into analysis and plotting (so that only data can be compared)
enable comparison with activity data
We need to be able to compare all the data
Could tie into the new API for property plotting in pycalphad

Validate input JSON datasets

Check...

Phases defined are in the system
Phases in the SL model are the same as defined in phases
Same as above for components
Sublattices also
Shape of values is correct based on conditions
Should they be inputerrors, validation errors or JSON errors?

This is less of an issue when things are automatic

Activity data fails check_datasets

Update input schema so that tracefile and probfile are not populated by default unless there is an mcmc section

This doesn't really negatively affect user experience, it just adds some noise.

Also raises the question to whether tracefile and probfile belong in the output section (because they are output) or the mcmc section (because they are only used for mcmc.

Wrap calls to pycalphad equilibrium in try/except

We want to catch singular matrix errors and pass infinite log probability instead of stopping runs. Treat like a convergence failure

Implement AICc

AICc aims to prevent over parameterization for small number of samples.

$$ AICc = AIC + (2k^2 + 2k)/(n - k - 1) = 2k - 2 ln(L) (2k^2 + 2k)/(n - k - 1) $$

where k is number of parameters, L is likelihood, n is number of samples.

AICc collapses to AIC for high n.

All that needs to be done is change the formulation in the paramselect.py module

Allow for selecting number of cores to run on with -n option in emcee fitting

Example espei -n 4 will select the n_workers=4 on dask. Currently the dask scheduler is hardcoded to use half of the available processors in multiprocessing.

This will require adding the argparse argument n with a default. The default should be half of the available cores for dask and all of the MPI ranks.

The implementer should make a judgement on whether or not the -n option should support MPI. Would it make sense to use less than the available MPI ranks?

Implement emcee multiprocessing

Since MPIPool has shown we aren't required to use dask, we could support multiprocessing as well, especially in light of #22.

This would need changes to

The schema to include multiprocessing as an option (and the default?). The option for scheduler could be either 'emcee' (simple; and understandable) or 'multiprocessing' (more accurate)
Pass the option correctly from run_espei.py. Pass emcee's InterruptiblePool docs link as an object like with MPIPool and dask's client.

We shouldn't need any changes to paramselect.py, but this should be tested on multiple platforms, if possible.

Fix CI package constraints

pycalphad is constraining our dependences to dask<0.20 and sympy<1.2. Once pycalphad 0.7.1 is released, these should be fixed and we can release the constraints in travis.

Limit the degrees of freedom for non-active phases in MCMC to prevent them from diverging?

Phases that do not have phase equilibria data should have their parameters fixed before the MCMC run.

A particular phase in an ESPEI run can have single phase DFT data and no phase equilibria. This means that the parameters that were calculated in the single phase fitting have no effect on the error function that is used in the MCMC run.

When parameters have no effect on the error function, they diverge when used in emcee because the ensemble sampler scales them up to infinity in an attempt to force that parameter to affect the error function.

dask 0.18 breaking changes

I haven't checked yet, but my guess is some of the new dask config stuff affects ESPEI: http://matthewrocklin.com/blog/work/2018/06/14/dask-0.18.0

Run ESPEI via input files, rather than command line arguments

A first draft and feedback was written in this gist

The current iteration is:

Header area.
Include any metadata above the `---`.
---
# core run settings
run_type: full # choose full | dft | mcmc
phase_models: input.json
datasets: input-datasets # path to datasets. Defaults to current directory.
scheduler: dask # can be dask | MPIPool

# control output
verbosity: 0 # integer verbosity level 0 | 1 | 2, where 2 is most verbose.
output_tdb: out.tdb
tracefile: chain.npy # name of the file containing the mcmc chain array
probfile: lnprob.npy # name of the file containing the mcmc ln probability array

# the following only take effect for full or mcmc runs
mcmc:
  mcmc_steps: 2000
  mcmc_save_interval: 100

  # the following take effect for only mcmc runs
  input_tdb: null # TDB file used to start the mcmc run
  restart_chain: null # restart the mcmc fitting from a previous calculation

This issue will focus on the development of a first generation input file structure and spec, and also as a place to brainstorm options that should be user-facing.

Error releasing un-acquired lock in dask

Was distributed (1.18.0) when this error occurred. Changed to distributed (1.16.3).

  File "/Applications/anaconda/envs/my_pycalphad/bin/espei", line 11, in <module>
    sys.exit(main())
  File "/Applications/anaconda/envs/my_pycalphad/lib/python3.6/site-packages/espei/run_espei.py", line 135, in main
    mcmc_steps=args.mcmc_steps, save_interval=args.save_interval)
  File "/Applications/anaconda/envs/my_pycalphad/lib/python3.6/site-packages/espei/paramselect.py", line 754, in fit
    for i, result in enumerate(sampler.sample(walkers, iterations=mcmc_steps)):
  File "/Applications/anaconda/envs/my_pycalphad/lib/python3.6/site-packages/emcee/ensemble.py", line 259, in sample
    lnprob[S0])
  File "/Applications/anaconda/envs/my_pycalphad/lib/python3.6/site-packages/emcee/ensemble.py", line 332, in _propose_stretch
    newlnprob, blob = self._get_lnprob(q)
  File "/Applications/anaconda/envs/my_pycalphad/lib/python3.6/site-packages/emcee/ensemble.py", line 382, in _get_lnprob
    results = list(M(self.lnprobfn, [p[i] for i in range(len(p))]))
  File "/Applications/anaconda/envs/my_pycalphad/lib/python3.6/site-packages/espei/utils.py", line 39, in map
    result = [x.result() for x in result]
  File "/Applications/anaconda/envs/my_pycalphad/lib/python3.6/site-packages/espei/utils.py", line 39, in <listcomp>
    result = [x.result() for x in result]
  File "/Applications/anaconda/envs/my_pycalphad/lib/python3.6/site-packages/distributed/client.py", line 155, in result
    six.reraise(*result)
  File "/Applications/anaconda/envs/my_pycalphad/lib/python3.6/site-packages/six.py", line 685, in reraise
    raise value.with_traceback(tb)
  File "/Applications/anaconda/envs/my_pycalphad/lib/python3.6/site-packages/distributed/protocol/pickle.py", line 59, in loads
    return pickle.loads(x)
RuntimeError: cannot release un-acquired lock```

Support parameter selection for custom single phase models

Log emcee acceptance ratios and the end of the run.

Enable supplying data reference states that are not *T (e.g. 298.15K)

Logs not written until after job ends when running in schedulers

dask workers can sometimes die without warning

I haven't been able to reproduce it consistently, but dark workers sometimes die with the dask scheduler.

To debug this, I turned on debugging output by scheduler = LocalCluster(n_workers=cores, threads_per_worker=1, processes=True, silence_logs=verbosity[output_settings['verbosity']]).

I am still waiting for that job to have workers die to see the output, but for now as iterations in emcee complete the results are processed in Python (it is known that this is happening because of the progress bar output). During this time, the LocalCluster debugging gives output

distributed.core - WARNING - Event loop was unresponsive for 1.69s.  This is often caused by long-running GIL-holding functions or moving large chunks of data. This can cause timeouts and instability.

Usually I get two similar messages in a row.

As another possibility, the most recent time I was able to reproduce this was when I had two instances of ESPEI running at the same time. I wouldn't think that the different client instances would interact, but maybe it should be investigated.

Support input datasets to be CSV

Convert to JSON and validate internally.

Could be useful for anything digitized, particularly equilibria. Formatting problems are much easier to handle.

Not too much improvement if the data is already stored as arrays.

Implement reference state shifting

espei.paramselect._shift_referece_state should handle non _FORM or _MIX outputs, but there needs to be a way to specify what the reference state is if, for example, CPM data is passed

phasesresearchlab / espei Goto Github PK

espei's People

Contributors

Stargazers

Watchers

Forkers

espei's Issues

Recommend Projects

Recommend Topics

Recommend Org