Coder Social home page Coder Social logo

Access modes about intake-bluesky HOT 7 CLOSED

danielballan avatar danielballan commented on July 26, 2024
Access modes

from intake-bluesky.

Comments (7)

danielballan avatar danielballan commented on July 26, 2024

I should clarify that this list is aspirational: 4 and 5 are still in progress.

from intake-bluesky.

danielballan avatar danielballan commented on July 26, 2024

Dictated by @gwbischof....

At the bottom we have:

read_partition()  # This talks to the 'backend', files or databases. It may or may not layer or a Filler or DaskFiller depending on the parameters passed.

Note that "read_partition" is a function name that means something to intake itself as is involved in the client--server code.

Then we have:

canonical_unfilled() -> Includes Datum, Resource, and Events with datum_ids in them
canonical()  -> No Datum, Resource docs. Events have numpy arrays in them.
canonical_dask() -> No Datum, Resource docs. Events have dask arrays in them.

All of these call read_partition under the hood.

Finally we have:

read() -> Calls `canonical()` under the hood and packs the results into an `xarray.Dataset` backed by numpy arrays.
to_dask()` -> Calls `canonical_dask()` under the hood and packs the results into an `xarray.Dataset` backed by a mixture of dask arrays and numpy arrays.

We intentionally do not include a read_unfilled() option. If you want unfilled data, you need the Resource and Datum documents to interpret them, so you need canonical_unfilled(). We don't have a place to put the Resource and Datum documents in an xarray.Dataset.

Note that read() and to_dask() are the standard methods on a DataSource in intake itself.

from intake-bluesky.

gwbischof avatar gwbischof commented on July 26, 2024

canonical_unfilled, canonical and canonical_dask will get a chunk_size argument.
chunk_size is the size of the event or datum pages.
We can pass chunk_size to read_partition by embedding it in the index argument.
What if read() and to_dask() call read_partition with a partition_size of None indicating to read the whole run? This could solve the problem of iterating the generators from the beginning each time you need the next partition for read and to_dask

from intake-bluesky.

danielballan avatar danielballan commented on July 26, 2024
class BlueskyRun(...):
    def canonical(self):
        "Interface documents from streams in this Run."
        ... stream.read_partition_canonical(filler_type='normal')

    def canonical_dask(self):
        "Interface documents from streams in this Run."
        ... stream.read_partition_canonical(filler_type='dask')

    def canonical_unfilled(self):
        "Interface documents from streams in this Run."
        ... yield from stream.read_partition_canonical(filler_type='null')

class BlueskyEventStream(intake_xarray.base.DataSourceMixin):
    def _open_dataset(self):
         # Use read_partition_canonical with the dask filler.
         self._ds = # xarray.Dataset backed by dask.arrays

    def read_partition_canonical(self, filler_type, ...): -> list
        "This hits the databases and gets Documents."

    def read_partition(self, partition): -> [list or dask.array]
        "Decide if this is for the canonical path or the xarray path."
        # The first element of partition tells us how to interpret the rest of the elements.
        canonical_switch, partition_ = partition
        if canonical_switch:
            args, kwargs = partition_
            return self.read_partition_canonical(*args, **kwargs)
        else:
            # Defer to base class (intake_xarray mechnaism).
            return super().read_partition(partition_)

    # These we don't have to write -- they come from the base class, but here's roughly what they do:

    def read(self):
         "Hand the local user a copy of the filled-in Dataset."
         return self._ds.load()
    def to_dask(self):
         "Hand the local user a copy of the dask-ified Dataset."
         return self._ds

from intake-bluesky.

danielballan avatar danielballan commented on July 26, 2024

Summary of a spontaneous call with @tacaswell on databroker design:

  • Recall that we plan three Broker implementations in the codebase to deal with back-compat. The v0 Broker is the current one; the v1 Broker provides a v0-compatible API but uses intake underneath; the v2 Broker has a totally new API much closer to standard intake. The v2 Broker "is a" Catalog (just adds some extra methods) and v1 Broker "has a" v2 Broker. This means we only instantiate one Catalog (and associated HTTP connection, caches, other resources) underlying both APIs.
  • Functions like from_config can create a Broker/Catalog and access a subcatalog if needed to give the user what they expect.
  • After some consideration of alternatives, we are happy with the laziness logic living in the Filler, not in the handlers themselves.
  • The methods that return documents should always include the Datum and Resource documents. From the theoretical point of view, we never want to hand the user an uninterpretable foreign key, and filled Events still have foreign keys. From a practical point of view, there is no constituency of users who want just some of the documents. Users won't don't want to think about Datum/Resource will be using xarrays, while users who want documents will want all the documents.
  • The methodsread() and to_dask() don't feel like they really needed to be separate methods. They both return {xarrays, dataframes}, just with different things inside, so they have return type stability. It could be done as read(dask=True) or read(delayed=True). I wonder if intake would be open to adding that. Along those lines, instead of adding three methods named canonical_something, a single canonical method with a required fill argument would be sufficient and simpler. It could take values like 'yes', 'no', or 'delayed'. Referring the previous bullet point, if all these modes return all the document types, a single canonical method would have return type stability. We want eventually want different kinds of lazy (e.g. different frameworks for creating the proxy objects) but that can be configured when the catalog is created, not per call.

from intake-bluesky.

danielballan avatar danielballan commented on July 26, 2024

This means we only instantiate one Catalog (and associated HTTP connection, caches, other resources) underlying both APIs.

To elaborate, we want to support the ability to gradually update scripts, opting into the new API in places, without necessarily updating everything at wants. This feels especially important for users like @CJ-Wright who have a lot of code that expects Header objects that will take some time to update, but might want to opt into v2's new search API.

from databroker.v1 import Broker

db = Broker.named(...)
db  # a v1-style databroker
db.v2  # an accessor to a v2-style databroker, sharing the same Catalog underneath
db.v2.header_version = 1  # set v2 to return v1-style Headers instead of Entries
headers = db.v2.search({...}, descriptors={...})  # new v2-only feature, but accessible from v1!
db.get_events(headers)  # back to v0/v1-style code

from intake-bluesky.

danielballan avatar danielballan commented on July 26, 2024

This is implemented and documented.

from intake-bluesky.

Related Issues (9)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.