Coder Social home page Coder Social logo

dcplib's People

Contributors

bento007 avatar chmreid avatar dailydreaming avatar kislyuk avatar maniarathi avatar mdunitz avatar mweiden avatar natanlao avatar parthshahva avatar sampierson avatar xbrianh avatar

Stargazers

 avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

dcplib's Issues

Endpoint for providing associated graph based on "links.json".

Functionality to transform experimental graphs into bundles, and the reverse.

This ticket is to provide an endpoint to take some of the burden of custom associations made by ingest and stored in mongoDB and have functionality with dcplib do most of the heavy lifting.

This means, provided a links.json file with uuids and versions, the function would return a metadata graph designed for use by ingest. This is currently done with their internal mongoDB table, and the aim is to greatly reduce this use internally, or replace it entirely.

[ETL] Per-stage parallelization type control

Previously (v.1.x.x), the dispatch_executor_class parameter passed to extract controlled the parallelization type used to execute transformers (threads vs processes) which enabled the flexibility to maintain a high thread limit during extraction and the ability to spawn processes for CPU-intensive jobs during transformation. The current version applies the concurrent.futures.Executor passed to dispatch_executor_class to both the extraction and transform stages.

In order to optimize ETL performance, would it be possible to re-enable the old behavior?

ETL: add more tests

Increase test coverage in the ETL library:

  • Test for retrieving all bundle contents (not just the first 500)
  • Test for continuing on failures
  • Test for any other kwarg settings or major uncovered branches

ETL logic error

# make load
python -m dcpquery.db init
INFO:dcpquery.db:Initializing database at postgresql+psycopg2://root:***@/dcpquery
INFO:dcpquery.db:Initializing database
INFO:dcpquery.db:Migrating database at postgresql+psycopg2://root:***@/dcpquery
INFO  [alembic.runtime.migration] Context impl PostgresqlImpl.
INFO  [alembic.runtime.migration] Will assume transactional DDL.
python -m dcpquery.db load
INFO:tweak:Loaded configuration from /mnt/query-service/.venv37/lib/python3.7/site-packages/hca/default_config.json
INFO:tweak:Loaded configuration from /home/akislyuk/.config/hca/config.json
INFO:dcplib.etl:Scanning 526009 bundles
INFO:dcplib.etl:
Loading bundles: 0 loaded so far. 0% of bundles loaded
.WARNING:dcplib.networking:Waiting 10s before redirect per Retry-After header
..WARNING:dcplib.networking:Waiting 10s before redirect per Retry-After header
...............WARNING:dcplib.networking:Waiting 10s before redirect per Retry-After header
.WARNING:dcplib.networking:Waiting 10s before redirect per Retry-After header
.WARNING:dcplib.networking:Waiting 10s before redirect per Retry-After header
.WARNING:dcplib.networking:Waiting 10s before redirect per Retry-After header
..WARNING:dcplib.networking:Waiting 10s before redirect per Retry-After header
WARNING:dcplib.networking:Waiting 10s before redirect per Retry-After header
..WARNING:dcplib.networking:Waiting 10s before redirect per Retry-After header
WARNING:dcplib.networking:Waiting 10s before redirect per Retry-After header
.WARNING:dcplib.networking:Waiting 10s before redirect per Retry-After header
.WARNING:dcplib.networking:Waiting 10s before redirect per Retry-After header
..WARNING:dcplib.networking:Waiting 10s before redirect per Retry-After header
WARNING:dcplib.networking:Waiting 10s before redirect per Retry-After header
..WARNING:dcplib.networking:Waiting 10s before redirect per Retry-After header
WARNING:dcplib.networking:Waiting 10s before redirect per Retry-After header
..WARNING:dcplib.networking:Waiting 10s before redirect per Retry-After header
WARNING:dcplib.networking:Waiting 10s before redirect per Retry-After header
..WARNING:dcplib.networking:Waiting 10s before redirect per Retry-After header
WARNING:dcplib.networking:Waiting 10s before redirect per Retry-After header
..WARNING:dcplib.networking:Waiting 10s before redirect per Retry-After header
WARNING:dcplib.networking:Waiting 10s before redirect per Retry-After header
..WARNING:dcplib.networking:Waiting 10s before redirect per Retry-After header
WARNING:dcplib.networking:Waiting 10s before redirect per Retry-After header
..Traceback (most recent call last):
  File "/usr/lib/python3.7/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/usr/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/mnt/query-service/dcpquery/db/__main__.py", line 71, in <module>
    **extractor_args
  File "/mnt/query-service/.venv37/lib/python3.7/site-packages/dcplib/etl/__init__.py", line 96, in extract
    loader(bundle=future.result())
  File "/mnt/query-service/dcpquery/etl/__init__.py", line 83, in load_bundle
    bundle_row = Bundle(uuid=bundle["uuid"],
TypeError: 'NoneType' object is not subscriptable
Makefile:126: recipe for target 'load' failed
make: *** [load] Error 1

Add Mock Fusillade client

As part of the integration of Fusillade for authentication and authorization, the data store needed to set up a mock Fusillade server for testing purposes. As other components adopt Fusillade, this code will be useful for everyone to incorporate into their test suite.

In the tests/infra/ directory of humancellatlas/data-store there is a MockFusilladeHandler. To start the server, any test that needs to call endpoints protected by Fusillade first have to change the authorization URL (part of the component's configuration) to point to the local mock Fusillade server, then start the server:

# This code goes into your test
def setUpModule():
    # Change the component configuration to use the mock Fusillade server
    # as the authentication/authorization URL
    MockFusilladeHandler.start_serving()

and shut it down when finished with the test:

def tearDownModule():
    MockFusilladeHandler.stop_serving()

The mock fusillade server will run on 127.0.0.1:X (where X is a randomly-selected unused TCP port).

This code (plus documentation) should be added to the dcplib so others can use it in their own components' tests.

Drop support for Python 2.7

“DEPRECATION: Python 2.7 will reach the end of its life on January 1st, 2020. Please upgrade your Python as Python 2.7 won’t be maintained after that date. A future version of pip will drop support for Python 2.7.“

The Tech Arch team has approved dropping Python 2.7 support.

[ETL] Document "pre-heating DSS checkout" usage

The performance bottleneck during extraction is waiting for files to complete the DSS checkout processes. The ETL supports the ability to "touch" all required files before attempting to download in order to preemptively start DSS checkouts for all files. Is this usage documented somewhere?

Read timeouts

There seems to be a low probability that we get a timeout on every GET request to the DSS. When dcplib is run with a large number of files/bundles, encountering a timeout becomes nearly certain, and the whole process fails.

An example traceback is below.

Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/dcplib/etl/__init__.py", line 144, in get_files_to_fetch_for_bundle
    with open(f"{self.sd}/files/{f['uuid']}.{f['version']}", "rb") as fh:
FileNotFoundError: [Errno 2] No such file or directory: '/mnt/files/f8d226c2-f1b5-48a1-ba9f-fb7f67eeb3ad.2019-05-18T075613.240050Z'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/urllib3/response.py", line 302, in _error_catcher
    yield
  File "/usr/lib/python3/dist-packages/urllib3/response.py", line 384, in read
    data = self._fp.read(amt)
  File "/usr/lib/python3.6/http/client.py", line 449, in read
    n = self.readinto(b)
  File "/usr/lib/python3.6/http/client.py", line 493, in readinto
    n = self.fp.readinto(b)
  File "/usr/lib/python3.6/socket.py", line 586, in readinto
    return self._sock.recv_into(b)
  File "/usr/lib/python3.6/ssl.py", line 1012, in recv_into
    return self.read(nbytes, buffer)
  File "/usr/lib/python3.6/ssl.py", line 874, in read
    return self._sslobj.read(len, buffer)
  File "/usr/lib/python3.6/ssl.py", line 631, in read
    v = self._sslobj.read(len, buffer)
socket.timeout: The read operation timed out
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/requests/models.py", line 750, in generate
    for chunk in self.raw.stream(chunk_size, decode_content=True):
  File "/usr/lib/python3/dist-packages/urllib3/response.py", line 436, in stream
    data = self.read(amt=amt, decode_content=decode_content)
  File "/usr/lib/python3/dist-packages/urllib3/response.py", line 401, in read
    raise IncompleteRead(self._fp_bytes_read, self.length_remaining)
  File "/usr/lib/python3.6/contextlib.py", line 99, in __exit__
    self.gen.throw(type, value, traceback)
  File "/usr/lib/python3/dist-packages/urllib3/response.py", line 307, in _error_catcher
    raise ReadTimeoutError(self._pool, None, 'Read timed out.')
urllib3.exceptions.ReadTimeoutError: HTTPSConnectionPool(host='org-hca-dss-checkout-prod.s3.amazonaws.com', port=443): Read timed out.
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
  File "/usr/lib/python3.6/concurrent/futures/process.py", line 175, in _process_worker
    r = call_item.fn(*call_item.args, **call_item.kwargs)
  File "/usr/local/lib/python3.6/dist-packages/dcplib/etl/__init__.py", line 60, in extract_transform_one
    bundle_uuid, bundle_version, fetched_files = self.get_files_to_fetch_for_bundle(bundle_uuid, bundle_version)
  File "/usr/local/lib/python3.6/dist-packages/dcplib/etl/__init__.py", line 150, in get_files_to_fetch_for_bundle
    self._get_file(f, bundle_uuid, bundle_version)
  File "/usr/local/lib/python3.6/dist-packages/dcplib/etl/__init__.py", line 171, in _get_file
    res = http.get(f"{self.dss_client.host}/files/{f['uuid']}", params={"replica": "aws", "version": f["version"]})
  File "/usr/local/lib/python3.6/dist-packages/dcplib/networking.py", line 56, in get
    return self(url=url, method="GET", *args, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/dcplib/networking.py", line 50, in __call__
    return self.sessions[get_ident()].request(*args, timeout=self.timeout_policy, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/requests/sessions.py", line 524, in request
    resp = self.send(prep, **send_kwargs)
  File "/usr/local/lib/python3.6/dist-packages/requests/sessions.py", line 659, in send
    history = [resp for resp in gen] if allow_redirects else []
  File "/usr/local/lib/python3.6/dist-packages/requests/sessions.py", line 659, in <listcomp>
    history = [resp for resp in gen] if allow_redirects else []
  File "/usr/local/lib/python3.6/dist-packages/dcplib/networking.py", line 19, in resolve_redirects
    for rv in super(Session, self).resolve_redirects(resp, req, **kwargs):
  File "/usr/local/lib/python3.6/dist-packages/requests/sessions.py", line 238, in resolve_redirects
    **adapter_kwargs
  File "/usr/local/lib/python3.6/dist-packages/requests/sessions.py", line 677, in send
    r.content
  File "/usr/local/lib/python3.6/dist-packages/requests/models.py", line 828, in content
    self._content = b''.join(self.iter_content(CONTENT_CHUNK_SIZE)) or b''
  File "/usr/local/lib/python3.6/dist-packages/requests/models.py", line 757, in generate
    raise ConnectionError(e)
requests.exceptions.ConnectionError: HTTPSConnectionPool(host='org-hca-dss-checkout-prod.s3.amazonaws.com', port=443): Read timed out.
"""
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
  File "loader.py", line 167, in <module>
    load(_args)
  File "loader.py", line 91, in load
    dispatcher_executor_class=concurrent.futures.ProcessPoolExecutor)
  File "/mnt/matrix/common/etl/__init__.py", line 73, in etl_dss_bundles
    dispatch_executor_class=dispatcher_executor_class
  File "/usr/local/lib/python3.6/dist-packages/dcplib/etl/__init__.py", line 99, in extract
    extract_result = future.result()
  File "/usr/lib/python3.6/concurrent/futures/_base.py", line 425, in result
    return self.__get_result()
  File "/usr/lib/python3.6/concurrent/futures/_base.py", line 384, in __get_result
    raise self._exception
requests.exceptions.ConnectionError: None: None

Loosen the version constraint of puremagic to avoid dependency conflicts

Hi, dcplib locked the version constraint of puremagic as puremagic ==1.4, which leads a troubling scenario to its direct downstream project cgp-dss-data-loader which has dependency puremagic.

What makes the situation worse is that the downstream projects [hca] of commandtax are also have dependency puremagic.

Could you please loosen the version constraint of puremagic?
Benefit of this is that users using both of dcplib and puremagic can upgrade their third party libraries in a timely manner to reduce technical debts.

Solution

The dependency trees of your project and affected downstream projects are shown as follows.
Taking the version constraints of upstream and downstream projects into comprehensive consideration, you can

  1. Loosen click to be puremagic >=1.4.

  2. Try to add an upper bound for puremagic’s version constraint, according to your compatibility.

@kislyuk Please let me know your choice. I can submit a PR to fix this issue.

Thanks for your attention.
Best,
Neolith

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.