Coder Social home page Coder Social logo

asu-olfactory's People

Contributors

enicolasgomez avatar filippomc avatar jrmartin avatar slarson avatar zsinnema avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

asu-olfactory's Issues

Provide API examples

As discussed on the meeting, we would need some API endpoint payload/response examples to create the API service layer that would sit on top of the molecules API in order to mimic puchem's API so our API is plug and play for clients.

Same for the front end. https://olfactory.dev.metacell.us/test
Any ideas are mostly welcomed, as we can continue working on the merge process in parallel to this tasks.

Evaluate populate_parallel performance

I have simplified the whole process and made use of lazyness and parallelism by using two libraries:

Dask - for creating the subset of cvs
asyncio, aiopg - for bulk inserting in parallel

This is here:

https://github.com/MetaCell/asu-olfactory/blob/feature/34_parallel/applications/pub-chem-index/tasks/ingestion/populate_parallel.py

This approach should be much more performant that what we had before.

Please use the HEAD command on csvs to reduce task times, for evaluating it's efficiency.

Also in the await calls for the postgres insert, I'm not sure they are run in parallel (indeed they are async but probably not parallel as it's waiting back for the cursor).

Investigate on how to run those tasks in parallel.

Test performance of different calls

Test performance of different endpoints requests.
For example:

  • Test endpoint looking at all tables: mesh/1/properties/synonyms_filtered,synonyms_unfiltered, title,iupac, mass....
    And report results

Data Merge Optimization

Data merge is NoSQL dataframe serveed by a SQL engine. There are some N to 1 relationships that produce duplicated rows (e.g., synonym to CID). Ideally, we would run a process that would merge all of these into an array.

I anticipate this is going to be a CPU consuming process and given the Postgres plugin performance won't produce much gain (20% maybe?). We still need to evaluate the Merge POC's performance.

Molecules that have no CID

There are valid molecules that have no CID (they have usually either never been synthesized or at least never characterized in public literature). In the PubChem data you've worked with, there are files (being normalized into tables) such as CIDSmiles, CIDInchi, and CIDInchiKey. Every CID has a SMILES, and Inchi, and InchiKey (InchiKey is just a hash of Inchi). SMILES is nice because it is sort of readable, but Inchi is truly unique. Every unique molecule is guaranteed to have exactly one Inchi. Inchi can be converted into SMILES, into images of the molecule, etc. And InchiKey is fixed length, and is good for fast one-way lookups.

Of the billions of molecules that can exist, only million have CIDs. So for any prospective mapping/prediction (e.g. what would this previously unsynthesized molecules be predicted to smell like?), the molecules might be a list of InChis or SMILES. For some of those we can link to a known CID using the CIDInchi table. But for others, we cannot, so there may be a need for tables which are indexed by Inchi. These do not need autocomplete. Even Inchikey is probably sufficient here. Converting from Inchi to anything else is one line in Python (with rdkit).

Parsing Issue with CID-Title

When parsing CID-Title with populatetable_parallel.py script, we have an issue parsing the text for CID-Title:

[main] [main] /usr/local/lib/python3.9/site-packages/aiopg/pool.py:478: ResourceWarning: Unclosed 1 connections in <aiopg.pool.Pool object at 0x7ff94469abe0> [main] warnings.warn( [main] 2022-10-04 17:52:25,524 [INFO] populate_parallel.go: Ouput folder %s /tmp/CID/CID-Title/ [main] Traceback (most recent call last): [main] File "/populate_parallel.py", line 157, in <module> [main] loop.run_until_complete(go()) [main] File "/usr/local/lib/python3.9/asyncio/base_events.py", line 642, in run_until_complete [main] return future.result() [main] File "/populate_parallel.py", line 150, in go [main] df.to_csv(output + "export-*.csv", sep='\t', index=False) [main] File "/usr/local/lib/python3.9/site-packages/dask/dataframe/core.py", line 1705, in to_csv [main] return to_csv(self, filename, **kwargs) [main] File "/usr/local/lib/python3.9/site-packages/dask/dataframe/io/csv.py", line 972, in to_csv [main] return list(dask.compute(*values, **compute_kwargs)) [main] File "/usr/local/lib/python3.9/site-packages/dask/base.py", line 600, in compute [main] results = schedule(dsk, keys, **kwargs) [main] File "/usr/local/lib/python3.9/site-packages/dask/threaded.py", line 89, in get [main] results = get_async( [main] File "/usr/local/lib/python3.9/site-packages/dask/local.py", line 511, in get_async [main] raise_exception(exc, tb) [main] File "/usr/local/lib/python3.9/site-packages/dask/local.py", line 319, in reraise [main] raise exc [main] File "/usr/local/lib/python3.9/site-packages/dask/local.py", line 224, in execute_task [main] result = _execute_task(task, data) [main] File "/usr/local/lib/python3.9/site-packages/dask/core.py", line 119, in _execute_task [main] return func(*(_execute_task(a, cache) for a in args)) [main] File "/usr/local/lib/python3.9/site-packages/dask/optimization.py", line 990, in __call__ [main] return core.get(self.dsk, self.outkey, dict(zip(self.inkeys, args))) [main] File "/usr/local/lib/python3.9/site-packages/dask/core.py", line 149, in get [main] result = _execute_task(task, cache) [main] File "/usr/local/lib/python3.9/site-packages/dask/core.py", line 119, in _execute_task [main] return func(*(_execute_task(a, cache) for a in args)) [main] File "/usr/local/lib/python3.9/site-packages/dask/dataframe/io/csv.py", line 129, in __call__ [main] df = pandas_read_text( [main] File "/usr/local/lib/python3.9/site-packages/dask/dataframe/io/csv.py", line 182, in pandas_read_text [main] df = reader(bio, **kwargs) [main] File "/usr/local/lib/python3.9/site-packages/pandas/util/_decorators.py", line 211, in wrapper [main] return func(*args, **kwargs) [main] File "/usr/local/lib/python3.9/site-packages/pandas/util/_decorators.py", line 317, in wrapper [main] return func(*args, **kwargs) [main] File "/usr/local/lib/python3.9/site-packages/pandas/io/parsers/readers.py", line 950, in read_csv [main] return _read(filepath_or_buffer, kwds) [main] File "/usr/local/lib/python3.9/site-packages/pandas/io/parsers/readers.py", line 611, in _read [main] return parser.read(nrows) [main] File "/usr/local/lib/python3.9/site-packages/pandas/io/parsers/readers.py", line 1772, in read [main] ) = self._engine.read( # type: ignore[attr-defined] [main] File "/usr/local/lib/python3.9/site-packages/pandas/io/parsers/c_parser_wrapper.py", line 243, in read [main] chunks = self._reader.read_low_memory(nrows) [main] File "pandas/_libs/parsers.pyx", line 808, in pandas._libs.parsers.TextReader.read_low_memory [main] File "pandas/_libs/parsers.pyx", line 890, in pandas._libs.parsers.TextReader._read_rows [main] File "pandas/_libs/parsers.pyx", line 1037, in pandas._libs.parsers.TextReader._convert_column_data [main] File "pandas/_libs/parsers.pyx", line 1068, in pandas._libs.parsers.TextReader._convert_tokens [main] File "pandas/_libs/parsers.pyx", line 1159, in pandas._libs.parsers.TextReader._convert_with_dtype [main] File "pandas/_libs/parsers.pyx", line 1246, in pandas._libs.parsers.TextReader._string_convert [main] File "pandas/_libs/parsers.pyx", line 1444, in pandas._libs.parsers._string_box_utf8 [main] UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe2 in position 1: invalid continuation byte

Fix normalization script

it's splitting values that contain comma (,) on their definition.

3 2,4-Cyclohexadiene-1-carboxylicacid
3 6-dihydroxy-
3 (1R,6S)-rel-

Results ordering

I'm not sure I yet have strong opinions on how the results should be sorted in the JSON that gets returned (and the corresponding dropdown in the app). But I think it would be reasonable to have exact matches returned first. So if I search for "methane", there is probably one exact match and so the CID,Synonym pair 297,methane should be the first one in the list.

Evaluate SQL fuzzy query like search performance

@jrmartin if you see the normalization script here:

https://github.com/MetaCell/asu-olfactory/blob/main/normalize.py

we need to grab the normalized data en bulk insert it into a POSTGRESQL instance running locally. with a single table having CID (numeric), Synonym (varchar) in which both columns will be indexes.

when the process is finished it will populate around 35M rows, we need to evaluate performance.

summarizing:

  • start up a postgresql instance with 1 database and create 1 table with the given description
  • download the dataset:
    https://ftp.ncbi.nlm.nih.gov/pubchem/Compound/Extras/CID-Synonym-unfiltered.gz
  • run normalize.py on it
  • this will create around 600 files with 50.000 rows
  • check that the normalization worked as exepcted with some random sampling (manually)
  • create a bulk_insert.py script that would go over the files and bulk insert into the newly created table

finally:

  • evaluate LIKE '%[value]%' kind of queries performance
  • if is not good enough, we'll need to evaluate configurations that would avoid table scan

Here's a description on how data is expected to be normalized:

https://app.zenhub.com/workspaces/asu-olfactory-625d568a637ec8001a4e40ac/issues/metacell/asu-olfactory/1

Fixing Deployment Issues , Improvements

  • Sort results, exact match on top
  • Exact match, true/false, missing on results returned for /properties endpoints
  • If exact=true passed as parameter on URL, only return exact matches. If none, then return zero
  • Truncate table names passed as properties and returned as results
  • Remove CIDS from /properties results, only leave the one
    image

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.