The asu-olfactory from metacell

Provide API examples

As discussed on the meeting, we would need some API endpoint payload/response examples to create the API service layer that would sit on top of the molecules API in order to mimic puchem's API so our API is plug and play for clients.

Same for the front end. https://olfactory.dev.metacell.us/test
Any ideas are mostly welcomed, as we can continue working on the merge process in parallel to this tasks.

Run the ingestion process with the newly composite file

Create indexing endpoint

Integrate existing metadata (DataJoint)

Create webservice skeleton and deploy to cloudharness (PostgresSQL included)

3D represantation of molecules

Review Data Joint and DB engine alternatives

https://tutorials.datajoint.io/beginner/building-first-pipeline/python/first-table.html

https://github.com/datajoint/datajoint-python

https://neo4j.com/developer/get-started/

any other No-SQL alternative

Analyze swap when running index

Verify if the normalization script is needed

According to our last meeting it might be the case that we don't need verifying

check with pubchem DB

Data use cases proposal

Create PostgresSQL instance locally with index

Populate non-fuzy text tables into postgres to be used as metadata

Evaluate populate_parallel performance

I have simplified the whole process and made use of lazyness and parallelism by using two libraries:

Dask - for creating the subset of cvs
asyncio, aiopg - for bulk inserting in parallel

This is here:

https://github.com/MetaCell/asu-olfactory/blob/feature/34_parallel/applications/pub-chem-index/tasks/ingestion/populate_parallel.py

This approach should be much more performant that what we had before.

Please use the HEAD command on csvs to reduce task times, for evaluating it's efficiency.

Also in the await calls for the postgres insert, I'm not sure they are run in parallel (indeed they are async but probably not parallel as it's waiting back for the cursor).

Investigate on how to run those tasks in parallel.

Move client to Nginx

To avoid using a proxy and following cloud harness best practices, we should move the client side code to
https://hub.docker.com/_/nginx
so static content is being server from the same URL avoiding any CORS issue or proying.

Test performance of different calls

Test performance of different endpoints requests.
For example:

Test endpoint looking at all tables: mesh/1/properties/synonyms_filtered,synonyms_unfiltered, title,iupac, mass....
And report results

Implement the /ingestion microservice

Called to load the 5GB data into postgres

Deploy clound-harness app and populate

Data Merge Optimization

Data merge is NoSQL dataframe serveed by a SQL engine. There are some N to 1 relationships that produce duplicated rows (e.g., synonym to CID). Ideally, we would run a process that would merge all of these into an array.

I anticipate this is going to be a CPU consuming process and given the Postgres plugin performance won't produce much gain (20% maybe?). We still need to evaluate the Merge POC's performance.

Proposal for match Pubchem API

Right now we are only using a term which could be CID or Syn.

Use different CSV files to search by different term types (e.g. SMILE)
Maybe one endpoint by type

The FTP site with all the files for digestion: https://ftp.ncbi.nlm.nih.gov/pubchem/Compound/Extras/

Molecules that have no CID

There are valid molecules that have no CID (they have usually either never been synthesized or at least never characterized in public literature). In the PubChem data you've worked with, there are files (being normalized into tables) such as CIDSmiles, CIDInchi, and CIDInchiKey. Every CID has a SMILES, and Inchi, and InchiKey (InchiKey is just a hash of Inchi). SMILES is nice because it is sort of readable, but Inchi is truly unique. Every unique molecule is guaranteed to have exactly one Inchi. Inchi can be converted into SMILES, into images of the molecule, etc. And InchiKey is fixed length, and is good for fast one-way lookups.

Of the billions of molecules that can exist, only million have CIDs. So for any prospective mapping/prediction (e.g. what would this previously unsynthesized molecules be predicted to smell like?), the molecules might be a list of InChis or SMILES. For some of those we can link to a known CID using the CIDInchi table. But for others, we cannot, so there may be a need for tables which are indexed by Inchi. These do not need autocomplete. Even Inchikey is probably sufficient here. Converting from Inchi to anything else is one line in Python (with rdkit).

Fix delimeter issue that will pull a big block of data

Deploy service to production and assign proper resources

Create one endpoint per text search

AS desribed on the following diagram

https://lucid.app/lucidchart/2a1a025a-7ca4-4fbf-885b-d42ac1e0676b/edit?invitationId=inv_84f1bdfb-7efb-4285-b5d5-26fb5cf5b8d9&page=0_0#

As the POC of merging tables together didn't work as expected we will be service each text table as a separate endpoint

Create query webservice endpoints

Parsing Issue with CID-Title

When parsing CID-Title with populatetable_parallel.py script, we have an issue parsing the text for CID-Title:

[main] [main] /usr/local/lib/python3.9/site-packages/aiopg/pool.py:478: ResourceWarning: Unclosed 1 connections in <aiopg.pool.Pool object at 0x7ff94469abe0> [main] warnings.warn( [main] 2022-10-04 17:52:25,524 [INFO] populate_parallel.go: Ouput folder %s /tmp/CID/CID-Title/ [main] Traceback (most recent call last): [main] File "/populate_parallel.py", line 157, in <module> [main] loop.run_until_complete(go()) [main] File "/usr/local/lib/python3.9/asyncio/base_events.py", line 642, in run_until_complete [main] return future.result() [main] File "/populate_parallel.py", line 150, in go [main] df.to_csv(output + "export-*.csv", sep='\t', index=False) [main] File "/usr/local/lib/python3.9/site-packages/dask/dataframe/core.py", line 1705, in to_csv [main] return to_csv(self, filename, **kwargs) [main] File "/usr/local/lib/python3.9/site-packages/dask/dataframe/io/csv.py", line 972, in to_csv [main] return list(dask.compute(*values, **compute_kwargs)) [main] File "/usr/local/lib/python3.9/site-packages/dask/base.py", line 600, in compute [main] results = schedule(dsk, keys, **kwargs) [main] File "/usr/local/lib/python3.9/site-packages/dask/threaded.py", line 89, in get [main] results = get_async( [main] File "/usr/local/lib/python3.9/site-packages/dask/local.py", line 511, in get_async [main] raise_exception(exc, tb) [main] File "/usr/local/lib/python3.9/site-packages/dask/local.py", line 319, in reraise [main] raise exc [main] File "/usr/local/lib/python3.9/site-packages/dask/local.py", line 224, in execute_task [main] result = _execute_task(task, data) [main] File "/usr/local/lib/python3.9/site-packages/dask/core.py", line 119, in _execute_task [main] return func(*(_execute_task(a, cache) for a in args)) [main] File "/usr/local/lib/python3.9/site-packages/dask/optimization.py", line 990, in __call__ [main] return core.get(self.dsk, self.outkey, dict(zip(self.inkeys, args))) [main] File "/usr/local/lib/python3.9/site-packages/dask/core.py", line 149, in get [main] result = _execute_task(task, cache) [main] File "/usr/local/lib/python3.9/site-packages/dask/core.py", line 119, in _execute_task [main] return func(*(_execute_task(a, cache) for a in args)) [main] File "/usr/local/lib/python3.9/site-packages/dask/dataframe/io/csv.py", line 129, in __call__ [main] df = pandas_read_text( [main] File "/usr/local/lib/python3.9/site-packages/dask/dataframe/io/csv.py", line 182, in pandas_read_text [main] df = reader(bio, **kwargs) [main] File "/usr/local/lib/python3.9/site-packages/pandas/util/_decorators.py", line 211, in wrapper [main] return func(*args, **kwargs) [main] File "/usr/local/lib/python3.9/site-packages/pandas/util/_decorators.py", line 317, in wrapper [main] return func(*args, **kwargs) [main] File "/usr/local/lib/python3.9/site-packages/pandas/io/parsers/readers.py", line 950, in read_csv [main] return _read(filepath_or_buffer, kwds) [main] File "/usr/local/lib/python3.9/site-packages/pandas/io/parsers/readers.py", line 611, in _read [main] return parser.read(nrows) [main] File "/usr/local/lib/python3.9/site-packages/pandas/io/parsers/readers.py", line 1772, in read [main] ) = self._engine.read( # type: ignore[attr-defined] [main] File "/usr/local/lib/python3.9/site-packages/pandas/io/parsers/c_parser_wrapper.py", line 243, in read [main] chunks = self._reader.read_low_memory(nrows) [main] File "pandas/_libs/parsers.pyx", line 808, in pandas._libs.parsers.TextReader.read_low_memory [main] File "pandas/_libs/parsers.pyx", line 890, in pandas._libs.parsers.TextReader._read_rows [main] File "pandas/_libs/parsers.pyx", line 1037, in pandas._libs.parsers.TextReader._convert_column_data [main] File "pandas/_libs/parsers.pyx", line 1068, in pandas._libs.parsers.TextReader._convert_tokens [main] File "pandas/_libs/parsers.pyx", line 1159, in pandas._libs.parsers.TextReader._convert_with_dtype [main] File "pandas/_libs/parsers.pyx", line 1246, in pandas._libs.parsers.TextReader._string_convert [main] File "pandas/_libs/parsers.pyx", line 1444, in pandas._libs.parsers._string_box_utf8 [main] UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe2 in position 1: invalid continuation byte

Provide examples of spatial data added to PubChem IDs in 2-3 spaces

Populate PostgresSQL instance with real CID data

Fix normalization script

it's splitting values that contain comma (,) on their definition.

3 2,4-Cyclohexadiene-1-carboxylicacid
3 6-dihydroxy-
3 (1R,6S)-rel-

Go over PubChem REST API to emulate endpoints - POC

Provide similar API but plug in our performance improvements

Check sorting (Type of molecule, CID, etc).

Share start up links in the slack channel

ZenHub
Github
Recurrent meeting agenda shared document

Modify the client to match the multi field proposal #32

Check why search by index column (CID) is taking so long

CID search is taking between 5 and 10 secs
while fuzzy is around 0.2ms

Modify ingestion process so it creates one table/index per file

Modify populateTable script so it takes the target table name as paramater
Modify ingest.sh so it goes over all the tables

Results ordering

I'm not sure I yet have strong opinions on how the results should be sorted in the JSON that gets returned (and the corresponding dropdown in the app). But I think it would be reasonable to have exact matches returned first. So if I search for "methane", there is probably one exact match and so the CID,Synonym pair 297,methane should be the first one in the list.

Create sampe POC that would query the webservice

Create dataupdate endpoint

Creating a python script to normalize the table

Creating the final version of the script that would normalize and insert data

Verify cloud harness latest deployment

Evaluate SQL fuzzy query like search performance

@jrmartin if you see the normalization script here:

https://github.com/MetaCell/asu-olfactory/blob/main/normalize.py

we need to grab the normalized data en bulk insert it into a POSTGRESQL instance running locally. with a single table having CID (numeric), Synonym (varchar) in which both columns will be indexes.

when the process is finished it will populate around 35M rows, we need to evaluate performance.

summarizing:

start up a postgresql instance with 1 database and create 1 table with the given description
download the dataset:
https://ftp.ncbi.nlm.nih.gov/pubchem/Compound/Extras/CID-Synonym-unfiltered.gz
run normalize.py on it
this will create around 600 files with 50.000 rows
check that the normalization worked as exepcted with some random sampling (manually)
create a bulk_insert.py script that would go over the files and bulk insert into the newly created table

finally:

evaluate LIKE '%[value]%' kind of queries performance
if is not good enough, we'll need to evaluate configurations that would avoid table scan

Here's a description on how data is expected to be normalized:

https://app.zenhub.com/workspaces/asu-olfactory-625d568a637ec8001a4e40ac/issues/metacell/asu-olfactory/1

Sort results, exact match on top
Exact match, true/false, missing on results returned for /properties endpoints
If exact=true passed as parameter on URL, only return exact matches. If none, then return zero
Truncate table names passed as properties and returned as results
Remove CIDS from /properties results, only leave the one

Implement the actual micro service end points

Implement the /molecules endpoint GET so it receives a single CID or a single synonym and will query the DB following the same approach that's been done on:

https://github.com/MetaCell/asu-olfactory/blob/main/scripts/lookup.py

metacell / asu-olfactory Goto Github PK

asu-olfactory's People

Contributors

Watchers

asu-olfactory's Issues

Recommend Projects

Recommend Topics

Recommend Org