Coder Social home page Coder Social logo

sparks-baird / nomad-examples Goto Github PK

View Code? Open in Web Editor NEW
8.0 0.0 0.0 26 KB

Examples of using the Novel Materials Discovery (NOMAD) database, especially downloading all chemical formulas.

Home Page: https://doi.org/10.6084/m9.figshare.19319783.v3

License: MIT License

Python 100.00%
materials-informatics materials-discoveries materials-discovery materials-databases materials-data materials-datasets database cheminformatics

nomad-examples's Introduction

nomad-examples DOI

Examples of using the Novel Materials Discovery (NOMAD) database, especially downloading all chemical formulas.

Installation

Clone or download the repository. To clone:

git clone https://github.com/sparks-baird/nomad-examples.git
cd nomad-examples

Install the dependencies, e.g. via:

pip install -r requirements.txt

Reproducer

Use all_formula_basic_metadata.py to download the data from NOMAD and to do some basic processing. This might take somewhere around an hour.

python -m all_formula_basic_metadata.py

Use remove_duplicate_compositions.py to process the chemical formulas down to a list of unique chemical compositions (represented as reduced formulas). This also might take around an hour.

python -m remove_duplicate_compositions.py

Data Descriptions

The data is available via figshare DOI: 10.6084/m9.figshare.19319783.v3 and was downloaded on 2022-03-07. There are four files available: all-formula.csv, unique-formula.csv, unique-reduced-formula.csv, and bad-formula.csv. There are 11680557, 764431, 695612, and 15 rows for each of these files, respectively. Descriptions are given below.

all-formula.csv

all-formula.csv contains two columns: calc_id (Calculation ID) and formula (Chemical Formula). These were restricted to VASP DFT calculations, and do not include noble gases nor radioactive elements. Some calculation IDs have missing chemical formulas.

unique-formula.csv

The list has also been filtered down to unique (non-reduced) chemical formulas in unique-formula.csv along with the calc_id for each unique formula. No structural information is included directly in this data.

unique-reduced-formula.csv

REALLY, what you're probably most interested in is unique-reduced-formula.csv because it is the most curated and is directly usable with e.g. pymatgen. This contains three columns: calc_id, reduced_formula, and factor which correspond to the Calculation ID, the reduced formula (e.g. Si2O4 --> SiO2), and the factor (e.g. for Si2O4 --> SiO2 the factor is 2). The formulas were first parsed via the pymatgen.core.Composition class.

bad-formula.csv

Finally, bad-formula.csv contains the formulas that were skipped during processing (i.e. not successfully processed with pymatgen.core.Composition for various reasons comprising 15 in total).

Future Work

Downloading all of the crystal structures and reducing this to a list of unique phases each with a CIF file.

Issues

See something missing? Please don't hesitate to drop me a note in issues.

nomad-examples's People

Contributors

sgbaird avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

nomad-examples's Issues

Adding basic material properties

Hey!
Have you thought of pairing each entry with some basic info about the materials?

One could definitely retrieve them on their own by searching via calc_id, but maybe adding some relevant results for each calculation would make this a cool, ready-to-use, training ds for ML!

Some of these additional info, targets if you prefer, may be formation energies (one for each calculation, and maybe an average value for each unique formula?) and some qualitative labels like point groups and Wyckoff positions.

Suggesting this as something pretty cost-effective to improve the quality of the ds'!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.