metacell / asu-olfactory Goto Github PK
View Code? Open in Web Editor NEWLicense: MIT License
License: MIT License
As discussed on the meeting, we would need some API endpoint payload/response examples to create the API service layer that would sit on top of the molecules API in order to mimic puchem's API so our API is plug and play for clients.
Same for the front end. https://olfactory.dev.metacell.us/test
Any ideas are mostly welcomed, as we can continue working on the merge process in parallel to this tasks.
According to our last meeting it might be the case that we don't need verifying
check with pubchem DB
I have simplified the whole process and made use of lazyness and parallelism by using two libraries:
Dask - for creating the subset of cvs
asyncio, aiopg - for bulk inserting in parallel
This is here:
This approach should be much more performant that what we had before.
Please use the HEAD command on csvs to reduce task times, for evaluating it's efficiency.
Also in the await calls for the postgres insert, I'm not sure they are run in parallel (indeed they are async but probably not parallel as it's waiting back for the cursor).
Investigate on how to run those tasks in parallel.
To avoid using a proxy and following cloud harness best practices, we should move the client side code to
https://hub.docker.com/_/nginx
so static content is being server from the same URL avoiding any CORS issue or proying.
Test performance of different endpoints requests.
For example:
Called to load the 5GB data into postgres
Data merge is NoSQL dataframe serveed by a SQL engine. There are some N to 1 relationships that produce duplicated rows (e.g., synonym to CID). Ideally, we would run a process that would merge all of these into an array.
I anticipate this is going to be a CPU consuming process and given the Postgres plugin performance won't produce much gain (20% maybe?). We still need to evaluate the Merge POC's performance.
Right now we are only using a term which could be CID or Syn.
Use different CSV files to search by different term types (e.g. SMILE)
Maybe one endpoint by type
The FTP site with all the files for digestion: https://ftp.ncbi.nlm.nih.gov/pubchem/Compound/Extras/
There are valid molecules that have no CID (they have usually either never been synthesized or at least never characterized in public literature). In the PubChem data you've worked with, there are files (being normalized into tables) such as CIDSmiles, CIDInchi, and CIDInchiKey. Every CID has a SMILES, and Inchi, and InchiKey (InchiKey is just a hash of Inchi). SMILES is nice because it is sort of readable, but Inchi is truly unique. Every unique molecule is guaranteed to have exactly one Inchi. Inchi can be converted into SMILES, into images of the molecule, etc. And InchiKey is fixed length, and is good for fast one-way lookups.
Of the billions of molecules that can exist, only million have CIDs. So for any prospective mapping/prediction (e.g. what would this previously unsynthesized molecules be predicted to smell like?), the molecules might be a list of InChis or SMILES. For some of those we can link to a known CID using the CIDInchi table. But for others, we cannot, so there may be a need for tables which are indexed by Inchi. These do not need autocomplete. Even Inchikey is probably sufficient here. Converting from Inchi to anything else is one line in Python (with rdkit).
AS desribed on the following diagram
As the POC of merging tables together didn't work as expected we will be service each text table as a separate endpoint
When parsing CID-Title with populatetable_parallel.py script, we have an issue parsing the text for CID-Title:
[main] [main] /usr/local/lib/python3.9/site-packages/aiopg/pool.py:478: ResourceWarning: Unclosed 1 connections in <aiopg.pool.Pool object at 0x7ff94469abe0> [main] warnings.warn( [main] 2022-10-04 17:52:25,524 [INFO] populate_parallel.go: Ouput folder %s /tmp/CID/CID-Title/ [main] Traceback (most recent call last): [main] File "/populate_parallel.py", line 157, in <module> [main] loop.run_until_complete(go()) [main] File "/usr/local/lib/python3.9/asyncio/base_events.py", line 642, in run_until_complete [main] return future.result() [main] File "/populate_parallel.py", line 150, in go [main] df.to_csv(output + "export-*.csv", sep='\t', index=False) [main] File "/usr/local/lib/python3.9/site-packages/dask/dataframe/core.py", line 1705, in to_csv [main] return to_csv(self, filename, **kwargs) [main] File "/usr/local/lib/python3.9/site-packages/dask/dataframe/io/csv.py", line 972, in to_csv [main] return list(dask.compute(*values, **compute_kwargs)) [main] File "/usr/local/lib/python3.9/site-packages/dask/base.py", line 600, in compute [main] results = schedule(dsk, keys, **kwargs) [main] File "/usr/local/lib/python3.9/site-packages/dask/threaded.py", line 89, in get [main] results = get_async( [main] File "/usr/local/lib/python3.9/site-packages/dask/local.py", line 511, in get_async [main] raise_exception(exc, tb) [main] File "/usr/local/lib/python3.9/site-packages/dask/local.py", line 319, in reraise [main] raise exc [main] File "/usr/local/lib/python3.9/site-packages/dask/local.py", line 224, in execute_task [main] result = _execute_task(task, data) [main] File "/usr/local/lib/python3.9/site-packages/dask/core.py", line 119, in _execute_task [main] return func(*(_execute_task(a, cache) for a in args)) [main] File "/usr/local/lib/python3.9/site-packages/dask/optimization.py", line 990, in __call__ [main] return core.get(self.dsk, self.outkey, dict(zip(self.inkeys, args))) [main] File "/usr/local/lib/python3.9/site-packages/dask/core.py", line 149, in get [main] result = _execute_task(task, cache) [main] File "/usr/local/lib/python3.9/site-packages/dask/core.py", line 119, in _execute_task [main] return func(*(_execute_task(a, cache) for a in args)) [main] File "/usr/local/lib/python3.9/site-packages/dask/dataframe/io/csv.py", line 129, in __call__ [main] df = pandas_read_text( [main] File "/usr/local/lib/python3.9/site-packages/dask/dataframe/io/csv.py", line 182, in pandas_read_text [main] df = reader(bio, **kwargs) [main] File "/usr/local/lib/python3.9/site-packages/pandas/util/_decorators.py", line 211, in wrapper [main] return func(*args, **kwargs) [main] File "/usr/local/lib/python3.9/site-packages/pandas/util/_decorators.py", line 317, in wrapper [main] return func(*args, **kwargs) [main] File "/usr/local/lib/python3.9/site-packages/pandas/io/parsers/readers.py", line 950, in read_csv [main] return _read(filepath_or_buffer, kwds) [main] File "/usr/local/lib/python3.9/site-packages/pandas/io/parsers/readers.py", line 611, in _read [main] return parser.read(nrows) [main] File "/usr/local/lib/python3.9/site-packages/pandas/io/parsers/readers.py", line 1772, in read [main] ) = self._engine.read( # type: ignore[attr-defined] [main] File "/usr/local/lib/python3.9/site-packages/pandas/io/parsers/c_parser_wrapper.py", line 243, in read [main] chunks = self._reader.read_low_memory(nrows) [main] File "pandas/_libs/parsers.pyx", line 808, in pandas._libs.parsers.TextReader.read_low_memory [main] File "pandas/_libs/parsers.pyx", line 890, in pandas._libs.parsers.TextReader._read_rows [main] File "pandas/_libs/parsers.pyx", line 1037, in pandas._libs.parsers.TextReader._convert_column_data [main] File "pandas/_libs/parsers.pyx", line 1068, in pandas._libs.parsers.TextReader._convert_tokens [main] File "pandas/_libs/parsers.pyx", line 1159, in pandas._libs.parsers.TextReader._convert_with_dtype [main] File "pandas/_libs/parsers.pyx", line 1246, in pandas._libs.parsers.TextReader._string_convert [main] File "pandas/_libs/parsers.pyx", line 1444, in pandas._libs.parsers._string_box_utf8 [main] UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe2 in position 1: invalid continuation byte
it's splitting values that contain comma (,) on their definition.
3 2,4-Cyclohexadiene-1-carboxylicacid
3 6-dihydroxy-
3 (1R,6S)-rel-
Provide similar API but plug in our performance improvements
Check sorting (Type of molecule, CID, etc).
CID search is taking between 5 and 10 secs
while fuzzy is around 0.2ms
I'm not sure I yet have strong opinions on how the results should be sorted in the JSON that gets returned (and the corresponding dropdown in the app). But I think it would be reasonable to have exact matches returned first. So if I search for "methane", there is probably one exact match and so the CID,Synonym pair 297,methane should be the first one in the list.
@jrmartin if you see the normalization script here:
https://github.com/MetaCell/asu-olfactory/blob/main/normalize.py
we need to grab the normalized data en bulk insert it into a POSTGRESQL instance running locally. with a single table having CID (numeric), Synonym (varchar) in which both columns will be indexes.
when the process is finished it will populate around 35M rows, we need to evaluate performance.
summarizing:
finally:
Here's a description on how data is expected to be normalized:
Right now the pod is set for 2 GB, as the tree gram module is an in-memory kind of module, it will need at least 16GB
based on #31, create python scripts that will generate the new composite file which will be ingested
Implement the /molecules endpoint GET so it receives a single CID or a single synonym and will query the DB following the same approach that's been done on:
https://github.com/MetaCell/asu-olfactory/blob/main/scripts/lookup.py
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.