Coder Social home page Coder Social logo

openff-curation's Introduction

openFF-curation

README.md

The purpose of this repository is to allow users to track changes to the curation files used in the Open-FF project. These files are mostly changed manually when new material is added to the FracFocus source or when new errors are found.

The files in this repository are copies of the actual files used to build data sets in the Open-FF project. Copies are made so that sorting and filtering can be made consistent from one commit to the next. Note that in these files, all cells are quoted (not just those that require it). Further, the quoting character is the dollar sign ('$') because it is one of a very few characters in a simple ASCII set that do not turn up in the FracFocus data.

openff-curation's People

Watchers

 avatar  avatar

openff-curation's Issues

Carrier curation tasks

One of the current curation tasks is identifying the appropriate record(s) of a disclosure that are the water carrier. This is necessary because Open-FF needs a clear indication of the water carrier records to calculate the mass of all the chemicals in the record. Since version 10 of OpenFF (CodeOcean version), this has been mostly done with algorithmic searches and categorization of each disclosure. With this technique, we can identify classes of disclosures that are formatted the same way and therefore treat them the same way.

Currently, the carrier records of about 165,000 disclosures have been identified this way and another 49,000 have been identified as problem disclosure from which we cannot extract masses (most of these are FFVersion 1 in which no chemical records are included). This still leaves around 50,000 to curate. Below are some of the remaining characteristics that we can use to either id the carrier or mark as problems:

  • The largest % (usually over 60%) is identified as a salt solution, not water. This is usually something like "4% KCl." Clearly, these are the water carriers, just not identified as water.
  • Disclosures that have most of the characteristics of a water-dominated disclosure, but the records that are most likely water are not identified as that.
  • Disclosures with more proppant than water, sometimes with carbon dioxide or nitrogen. I'm not comfortable taking the water in these as a water carrier - at least from the perspective of anchoring the mass calculations. If I can find enough with MassIngredient confirmations, I should be able to move them into either carrier or problem categories.
  • Chloride Dioxide at 100% or close to it
  • Many of the remaining disclsoures may fit into previously written algorithms if they are tweaked a bit, such as changing the cutoff for water percentage.

9072-35-9 doesn't resolve to Crude Oil

Records of CASNumber 9072-35-9 are indentified in IngredientName as Petroleum, but that is not what that cas number resolves to: not clear what CAS number is intended. Probably should be Conflicting...

Heavy water, 7789-20-0

image
This version of water seems very unlikely to be used in fracking, as it is quite expensive to produce especially at the masses suggested in the data. More likely it is here because of sloppy data prep.

I would love to get other opinions before I set it as a mistake and change it to 7732-18-5.

Company: UPPI?

Should 'UPPI' be translated together with Universal Pressure Pumping?

Problem of `IngredientName` synonyms that are too general

There are a some synonyms of specific CASRN that are so general that they cause too many "conflictingID" issues when producing the CAS|Ing table. For example, there are a large number of compounds used in FF that are "nonyl phenol ethoxylates" but there is at least one, 9016-45-9, that has that term as a synonym. The CAS|Ing result should be that the CASNumber is used for bgCAS and the bgSource should be set to CAS_only.

How do we find those over general terms and mark them as not useful for distinguishing between like compounds?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.