Coder Social home page Coder Social logo

phasesresearchlab / espei Goto Github PK

View Code? Open in Web Editor NEW
58.0 58.0 32.0 25.36 MB

Fitting thermodynamic models with pycalphad - https://doi.org/10.1557/mrc.2019.59

Home Page: http://espei.org

License: MIT License

Python 100.00%
calphad materials materials-science pycalphad python thermodynamics

espei's People

Contributors

bocklund avatar jwsiegel2510 avatar richardotis avatar toastedcrumpets avatar wahab2604 avatar zhyrek avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

espei's Issues

Be able to provide a TDB with existing parameters to parameter selection

The new Database will be added to the current Database and the initial contributions would be subtracted out. The ideal use case for this is

  • Using literature magnetic models
  • Fitting ternary interaction parameters with fixed binary models.

Eventually when we implement unary fitting, unaires can be oassed in this way to be fixed.

Fitting HM, SM, CPM in parameter selection

Right now, ESPEI can perform parameter selection for HM_FORM/HM_MIX, SM_FORM/SM_MIX, or CPM_FORM/CPM_MIX data, but it would be useful to be able to fit HM-HM(SER), SM and CPM directly as well.

In the past, I would have suggested the following workaround procedure to fit absolute energies (no changes to ESPEI code required) that are compatible with SGTE:

  1. Create a custom reference state for ESPEI where the energy is zero for the lattice stability of the SER phase (advanced, optional technique: if you want to keep existing lattice stabilities, update them in the custom reference state to not make them relative to the GHSER__ symbol)
  2. Treat HM-HM(SER), SM, CPM data as the _FORM in ESPEI datasets and fit them. This will give energies for each phase referenced to GHSER__ functions which are zero, so the absolute energies and derivatives are fit.

I'm convinced now that fitting absolute valued versions of these data is useful enough that ESPEI should allow this data to be fit if given by a user. It shouldn't be too much work for a student that wanted to pick up this project.

I'm not completely sure, but this might "just work" to change espei.parameter_selection.utils.shift_reference_state to not raise on HM, SM, or CPM data. One would have to decide whether it is a user's responsibility to shift to HM-HM(SER) or whether to allow ESPEI to do it automatically (which should be possible now that Database().refstates data is supported by pycalphad and used by ESPEI).

Test that the ZPF error is actually zero when it should be

If known tielines are used as datasets from a database, calculating ZPF error for that database and equilibria should give 0 error.

I don't have a test case or any reason to suspect this is broken, but it would be a good sanity check.

Enable saving the random state, so that restarts are still reproducible

ESPEI can be deterministic and reproducible, but restarting resets the random state.

That means running for 1000 steps in one run and two runs of 500 steps each (1000 total) will give different results, despite each being deterministic.

A solution is to be able to dump and load the random state on restart.

Refactor multiplot to include phase diagram plotting

pycalphad's eqplot filters the active phases and sorts them alphabetically to get the phase names from pycalphad.plot.utils.phase_legend. If phases are not sorted and active phases not removed, mutliplot will not produce the same phase_legend and colors as eqplot.

Using Model class when building phases never calculates a reasonable error

We would like to be able to optimize databases that may use custom models and therefore we should support building phases with the Model class (which has the other benefit of being more performant than CompiledModel) and subclasses.

See the diff for when CompiledModel was added and Model removed at commit 2bb8c49

It looks like we previous passed around callables for the objective as well as gradient and hessian.

Some doubt about parameter selection?

I see ESPEI using AICc to prevent over estimation of parameters.
But F-test was mentioned in Doctor thesis "SOFTWARE ARCHITECTURE FOR CALPHAD MODELING OF PHASE STABILITY AND TRANSFORMATIONS IN ALLOY ADDITIVE MANUFACTURING PROCESSES".

In the thesis AICc is used to fit sigle-phase parameters and F-test is used to fit multi-phases parameters. It looks like ESPEI only using AICc and not using F-test.

question:
(1) Does AICc is suitable for multi-phases?
(2) Why not taking F-test into consideration?

Tracing memory leaks in long-running jobs

Note that this is a sketch of a procedure and contains code that has not yet been tested.

Key tools:

  • pympler - find which objects are using the most memory
  • objgraph - generate flow graphs of backreferences to any object
  • pyrasite - attach a Python console to a running process

This will require some modification to the base code. Use the SummaryTracker from pympler. Add this code somewhere before MCMC sampling starts:

from pympler import tracker
tr = tracker.SummaryTracker()
# Calibrate it by calling this a few times until it returns no changed objects
tr.print_diff()
tr.print_diff()
tr.print_diff()
# Create all the objects, put all the function's setup code
# All the sampling code is here

Start the sampling job with just a single core. Then use pyrasite-shell to connect to the running Python process by PID: http://pyrasite.readthedocs.io/en/latest/Shell.html

In the remote shell:

import objgraph ; tr = objgraph.by_type('SummaryTracker')[0] ; tr.print_diff()

This will give output like:

                                     types |   # objects |   total size
========================================== | =========== | ============                                                                                      
                              <class 'list |       18730 |      1.71 MB                                                                                      
                               <class 'str |       18961 |      1.35 MB                                                                                      
              <class 'sip.methoddescriptor |        8287 |    453.20 KB                                                                                      
                              <class 'dict |         513 |    304.12 KB                                                                                      
                               <class 'int |        5375 |    152.21 KB                                                                                      
  <class 'sympy.core.assumptions.StdFactKB |         144 |     90.94 KB                                                                                      
                               <class 'set |          43 |     72.91 KB                                                                                      
                 <class '_lrucache.hashseq |         509 |     57.82 KB                                                                                      
                             <class 'tuple |         473 |     29.11 KB                                                                                      
          <class 'sympy.core.numbers.Float |         338 |     26.41 KB                                                                                      
                   <class '_lrucache.clist |         509 |     23.86 KB                                                                                      
           <class 'tinydb.database.Element |          32 |     15.50 KB
            <class 'sip.variabledescriptor |         184 |     12.94 KB
      <class 'PyQt4.QtCore.QLocale.Country |         248 |     12.59 KB
                <class 'sympy.core.mul.Mul |         174 |     12.23 KB

To start spot checking object backreferences, use

objgraph.show_backrefs(objgraph.by_type('list')[0], max_depth=10)

A graph will be rendered as a PNG and written to a temporary directory. The path to the graph will be output to the console.

Documentation

The following functions should be fully documented with a description, arguments/keyword arguments, returns, and examples (if applicable):

  • espei.paramselect._fit_parameters could be improved to make it clear that this selects the model from data with the AIC

The next set of functions are short and just need the minimal description, inputs and outputs.

  • espei.paramselect._build_feature_matrix
  • espei.paramselect._generate_symmetric_group
  • espei.core_utils.get_data
  • espei.core_utils.get_samples
  • espei.core_utils.symmetry_filter
  • espei.paramselect.estimate_hyperplane,
  • espei.paramselect.tieline_error,
  • espei.paramselect.multi_phase_fit

Web documentation

  • Note that single phase data is stored per-atom, e.g. (J/mol-atom) rather than (J/mol-form).
  • contribution guide
  • where to get support

Benchmark starting distribution for MCMC

Currently our walkers (concurrent chains) are initialized by sampling a Gaussian distribution that has a standard deviation of 10% of the parameter.

My understanding of the ensemble sampler implemented in emcee is that the distribution that new parameters are selected from depends on the other active walkers. This means that the rate of convergence initially is strongly dependent on the distribution of used to generate these walkers.

Initializing chains from larger Gaussian distributions means that we are less certain about our parameters initially and we will be searching a larger space in the initial iterations. Having too large a distribution initially might mean slow parameter convergence because the chains have to scale down to the relevant sampling space. Having too small of an initial distribution can cause the reverse, in that we waste a lot of time scaling up our sampling space.

We should benchmark different starting points for a given number of MCMC steps and compare the rate of convergence of parameter mixing with a single run that has 'fully' converged.

Plot data points with a literature reference in the legend

ZKL suggested some kind of Mendeley integration. I think it would also be reasonable and fit into the spirit of ESPEI to use bibtex files (possibly managed and imported/exported from Mendeley). There are several benefits to using bibtex:

  1. With bibtex the references can live and be versioned in plain text with the data sets they are referenced in.
  2. We can use the common functionality and existing tooling for bibtex to generate reference legends in figures based on bibtex labels. Including styles. (Matplotlib supports LaTeX rendering, which will be helpful towards this when LaTeX macros are used in titles - like chemistry).
  3. Reference libraries can be managed with existing software and be agnostic to which software is used (almost everything supports bibtex).

Test suite

The following should be tested in order to have unit tests covering the core functionality of ESPEI

  • MCMC likelihood function, espei.paramselect.lnprob
  • espei.core_utils.get_data retrieves the right data (do in migration from TinyDB 2 to TinyDB 3)
  • espei.paramselect.fit_formation_energy should work for endmembers and interactions (mixing). Test against two one data point cases of formation energy for each. Then one with temperature for the endmember.
  • AIC parameter selection should choose the right model with the right values (espei.paramselect._fit_parameters). When writing this test, make sure to verify that the chosen test case really is the lowest AIC among all the models and that all the possible models (parameter combinations) were chosen.
  • espei.core_utils.endmembers_from_interaction are properly computed for several cases of mixing sublattices
  • espei.core_utils.get_samples are properly computed for several cases of mixing sublattices
  • espei.core_utils.build_sitefractions properly constructs site fractions from sublattice configurations and occupancies
  • espei.paramselect._generate_symmetric_group handles cases with and without symmetry correctly

URI and multiple-output support

This is a feature request which is probably out of scope for #28.

Can every place where the run settings file accepts a filename or path, accept a general URI (e.g., https, ssh, git)? I think urlparse/urllib in the stdlib makes this a reasonable request.
See: https://stackoverflow.com/questions/22238090/validating-urls-in-python
One complicating factor is all the calls to open() and np.load() would need to get filtered through urllib, but I think this would be a very nice feature long term: Download datasets pinned to a Git repo, upload output TDBs to an S3 bucket, etc.

Related to this, being able to specify the output key multiple times would be useful once it would be possible to write results out to multiple remote locations.

Fit to other thermochemical data

Enable fitting to thermochemical data such as activities.

Should MCMC also consider this and single phase data (e.g. heat capacities)?

Issues reproducing Cu-Mg example

I had several issues running the Cu-Mg example from the ESPEI website. I installed ESPEI using the conda command, and took the Cu-Mg data directory from the ESPEI-datasets repository.

I first tried reproducing the diagram from the section titled, First-principles phase diagram
The code successfully ran, but the returned phase diagram didn't match the example well:
diagram_dft

I then tried reproducing the results in the MCMC optimization section. I wasn't able to successfully perform the MCMC optimization. The code returned numerous errors over the course of several minutes and eventually hung with no further output.

This file contains the full python output when I ran the optimization:
espei_mcmc_error.txt

Here is my python version and installed packages/versions:
python_info.txt

Validate input JSON datasets

Check...

  1. Phases defined are in the system
  2. Phases in the SL model are the same as defined in phases
  3. Same as above for components
  4. Sublattices also
  5. Shape of values is correct based on conditions
  6. Should they be inputerrors, validation errors or JSON errors?

This is less of an issue when things are automatic

Implement AICc

AICc aims to prevent over parameterization for small number of samples.

$$ AICc = AIC + (2k^2 + 2k)/(n - k - 1) = 2k - 2 ln(L) (2k^2 + 2k)/(n - k - 1) $$

where k is number of parameters, L is likelihood, n is number of samples.

AICc collapses to AIC for high n.

All that needs to be done is change the formulation in the paramselect.py module

Allow for selecting number of cores to run on with -n option in emcee fitting

Example espei -n 4 will select the n_workers=4 on dask. Currently the dask scheduler is hardcoded to use half of the available processors in multiprocessing.

This will require adding the argparse argument n with a default. The default should be half of the available cores for dask and all of the MPI ranks.

The implementer should make a judgement on whether or not the -n option should support MPI. Would it make sense to use less than the available MPI ranks?

Implement emcee multiprocessing

Since MPIPool has shown we aren't required to use dask, we could support multiprocessing as well, especially in light of #22.

This would need changes to

  1. The schema to include multiprocessing as an option (and the default?). The option for scheduler could be either 'emcee' (simple; and understandable) or 'multiprocessing' (more accurate)
  2. Pass the option correctly from run_espei.py. Pass emcee's InterruptiblePool docs link as an object like with MPIPool and dask's client.

We shouldn't need any changes to paramselect.py, but this should be tested on multiple platforms, if possible.

Fix CI package constraints

pycalphad is constraining our dependences to dask<0.20 and sympy<1.2. Once pycalphad 0.7.1 is released, these should be fixed and we can release the constraints in travis.

Limit the degrees of freedom for non-active phases in MCMC to prevent them from diverging?

Phases that do not have phase equilibria data should have their parameters fixed before the MCMC run.

A particular phase in an ESPEI run can have single phase DFT data and no phase equilibria. This means that the parameters that were calculated in the single phase fitting have no effect on the error function that is used in the MCMC run.

When parameters have no effect on the error function, they diverge when used in emcee because the ensemble sampler scales them up to infinity in an attempt to force that parameter to affect the error function.

Run ESPEI via input files, rather than command line arguments

A first draft and feedback was written in this gist

The current iteration is:

Header area.
Include any metadata above the `---`.
---
# core run settings
run_type: full # choose full | dft | mcmc
phase_models: input.json
datasets: input-datasets # path to datasets. Defaults to current directory.
scheduler: dask # can be dask | MPIPool

# control output
verbosity: 0 # integer verbosity level 0 | 1 | 2, where 2 is most verbose.
output_tdb: out.tdb
tracefile: chain.npy # name of the file containing the mcmc chain array
probfile: lnprob.npy # name of the file containing the mcmc ln probability array

# the following only take effect for full or mcmc runs
mcmc:
  mcmc_steps: 2000
  mcmc_save_interval: 100

  # the following take effect for only mcmc runs
  input_tdb: null # TDB file used to start the mcmc run
  restart_chain: null # restart the mcmc fitting from a previous calculation

This issue will focus on the development of a first generation input file structure and spec, and also as a place to brainstorm options that should be user-facing.

Error releasing un-acquired lock in dask

Was distributed (1.18.0) when this error occurred. Changed to distributed (1.16.3).

  File "/Applications/anaconda/envs/my_pycalphad/bin/espei", line 11, in <module>
    sys.exit(main())
  File "/Applications/anaconda/envs/my_pycalphad/lib/python3.6/site-packages/espei/run_espei.py", line 135, in main
    mcmc_steps=args.mcmc_steps, save_interval=args.save_interval)
  File "/Applications/anaconda/envs/my_pycalphad/lib/python3.6/site-packages/espei/paramselect.py", line 754, in fit
    for i, result in enumerate(sampler.sample(walkers, iterations=mcmc_steps)):
  File "/Applications/anaconda/envs/my_pycalphad/lib/python3.6/site-packages/emcee/ensemble.py", line 259, in sample
    lnprob[S0])
  File "/Applications/anaconda/envs/my_pycalphad/lib/python3.6/site-packages/emcee/ensemble.py", line 332, in _propose_stretch
    newlnprob, blob = self._get_lnprob(q)
  File "/Applications/anaconda/envs/my_pycalphad/lib/python3.6/site-packages/emcee/ensemble.py", line 382, in _get_lnprob
    results = list(M(self.lnprobfn, [p[i] for i in range(len(p))]))
  File "/Applications/anaconda/envs/my_pycalphad/lib/python3.6/site-packages/espei/utils.py", line 39, in map
    result = [x.result() for x in result]
  File "/Applications/anaconda/envs/my_pycalphad/lib/python3.6/site-packages/espei/utils.py", line 39, in <listcomp>
    result = [x.result() for x in result]
  File "/Applications/anaconda/envs/my_pycalphad/lib/python3.6/site-packages/distributed/client.py", line 155, in result
    six.reraise(*result)
  File "/Applications/anaconda/envs/my_pycalphad/lib/python3.6/site-packages/six.py", line 685, in reraise
    raise value.with_traceback(tb)
  File "/Applications/anaconda/envs/my_pycalphad/lib/python3.6/site-packages/distributed/protocol/pickle.py", line 59, in loads
    return pickle.loads(x)
RuntimeError: cannot release un-acquired lock```

dask workers can sometimes die without warning

I haven't been able to reproduce it consistently, but dark workers sometimes die with the dask scheduler.

To debug this, I turned on debugging output by scheduler = LocalCluster(n_workers=cores, threads_per_worker=1, processes=True, silence_logs=verbosity[output_settings['verbosity']]).

I am still waiting for that job to have workers die to see the output, but for now as iterations in emcee complete the results are processed in Python (it is known that this is happening because of the progress bar output). During this time, the LocalCluster debugging gives output

distributed.core - WARNING - Event loop was unresponsive for 1.69s.  This is often caused by long-running GIL-holding functions or moving large chunks of data. This can cause timeouts and instability.

Usually I get two similar messages in a row.

As another possibility, the most recent time I was able to reproduce this was when I had two instances of ESPEI running at the same time. I wouldn't think that the different client instances would interact, but maybe it should be investigated.

Support input datasets to be CSV

Convert to JSON and validate internally.

Could be useful for anything digitized, particularly equilibria. Formatting problems are much easier to handle.

Not too much improvement if the data is already stored as arrays.

Implement reference state shifting

espei.paramselect._shift_referece_state should handle non _FORM or _MIX outputs, but there needs to be a way to specify what the reference state is if, for example, CPM data is passed

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.