Comments (7)
I should clarify that this list is aspirational: 4 and 5 are still in progress.
from intake-bluesky.
Dictated by @gwbischof....
At the bottom we have:
read_partition() # This talks to the 'backend', files or databases. It may or may not layer or a Filler or DaskFiller depending on the parameters passed.
Note that "read_partition" is a function name that means something to intake itself as is involved in the client--server code.
Then we have:
canonical_unfilled() -> Includes Datum, Resource, and Events with datum_ids in them
canonical() -> No Datum, Resource docs. Events have numpy arrays in them.
canonical_dask() -> No Datum, Resource docs. Events have dask arrays in them.
All of these call read_partition
under the hood.
Finally we have:
read() -> Calls `canonical()` under the hood and packs the results into an `xarray.Dataset` backed by numpy arrays.
to_dask()` -> Calls `canonical_dask()` under the hood and packs the results into an `xarray.Dataset` backed by a mixture of dask arrays and numpy arrays.
We intentionally do not include a read_unfilled()
option. If you want unfilled data, you need the Resource and Datum documents to interpret them, so you need canonical_unfilled()
. We don't have a place to put the Resource and Datum documents in an xarray.Dataset
.
Note that read()
and to_dask()
are the standard methods on a DataSource in intake itself.
from intake-bluesky.
canonical_unfilled, canonical and canonical_dask
will get a chunk_size argument.
chunk_size
is the size of the event or datum pages.
We can pass chunk_size
to read_partition by embedding it in the index argument.
What if read()
and to_dask()
call read_partition with a partition_size of None
indicating to read the whole run? This could solve the problem of iterating the generators from the beginning each time you need the next partition for read
and to_dask
from intake-bluesky.
class BlueskyRun(...):
def canonical(self):
"Interface documents from streams in this Run."
... stream.read_partition_canonical(filler_type='normal')
def canonical_dask(self):
"Interface documents from streams in this Run."
... stream.read_partition_canonical(filler_type='dask')
def canonical_unfilled(self):
"Interface documents from streams in this Run."
... yield from stream.read_partition_canonical(filler_type='null')
class BlueskyEventStream(intake_xarray.base.DataSourceMixin):
def _open_dataset(self):
# Use read_partition_canonical with the dask filler.
self._ds = # xarray.Dataset backed by dask.arrays
def read_partition_canonical(self, filler_type, ...): -> list
"This hits the databases and gets Documents."
def read_partition(self, partition): -> [list or dask.array]
"Decide if this is for the canonical path or the xarray path."
# The first element of partition tells us how to interpret the rest of the elements.
canonical_switch, partition_ = partition
if canonical_switch:
args, kwargs = partition_
return self.read_partition_canonical(*args, **kwargs)
else:
# Defer to base class (intake_xarray mechnaism).
return super().read_partition(partition_)
# These we don't have to write -- they come from the base class, but here's roughly what they do:
def read(self):
"Hand the local user a copy of the filled-in Dataset."
return self._ds.load()
def to_dask(self):
"Hand the local user a copy of the dask-ified Dataset."
return self._ds
from intake-bluesky.
Summary of a spontaneous call with @tacaswell on databroker design:
- Recall that we plan three Broker implementations in the codebase to deal with back-compat. The v0 Broker is the current one; the v1 Broker provides a v0-compatible API but uses intake underneath; the v2 Broker has a totally new API much closer to standard intake. The v2 Broker "is a" Catalog (just adds some extra methods) and v1 Broker "has a" v2 Broker. This means we only instantiate one Catalog (and associated HTTP connection, caches, other resources) underlying both APIs.
- Functions like
from_config
can create a Broker/Catalog and access a subcatalog if needed to give the user what they expect. - After some consideration of alternatives, we are happy with the laziness logic living in the Filler, not in the handlers themselves.
- The methods that return documents should always include the Datum and Resource documents. From the theoretical point of view, we never want to hand the user an uninterpretable foreign key, and filled Events still have foreign keys. From a practical point of view, there is no constituency of users who want just some of the documents. Users won't don't want to think about Datum/Resource will be using xarrays, while users who want documents will want all the documents.
- The methods
read()
andto_dask()
don't feel like they really needed to be separate methods. They both return {xarrays, dataframes}, just with different things inside, so they have return type stability. It could be done asread(dask=True)
orread(delayed=True)
. I wonder if intake would be open to adding that. Along those lines, instead of adding three methods namedcanonical_something
, a singlecanonical
method with a requiredfill
argument would be sufficient and simpler. It could take values like'yes'
,'no'
, or'delayed'
. Referring the previous bullet point, if all these modes return all the document types, a singlecanonical
method would have return type stability. We want eventually want different kinds of lazy (e.g. different frameworks for creating the proxy objects) but that can be configured when the catalog is created, not per call.
from intake-bluesky.
This means we only instantiate one Catalog (and associated HTTP connection, caches, other resources) underlying both APIs.
To elaborate, we want to support the ability to gradually update scripts, opting into the new API in places, without necessarily updating everything at wants. This feels especially important for users like @CJ-Wright who have a lot of code that expects Header
objects that will take some time to update, but might want to opt into v2's new search API.
from databroker.v1 import Broker
db = Broker.named(...)
db # a v1-style databroker
db.v2 # an accessor to a v2-style databroker, sharing the same Catalog underneath
db.v2.header_version = 1 # set v2 to return v1-style Headers instead of Entries
headers = db.v2.search({...}, descriptors={...}) # new v2-only feature, but accessible from v1!
db.get_events(headers) # back to v0/v1-style code
from intake-bluesky.
This is implemented and documented.
from intake-bluesky.
Related Issues (9)
- Give RemoteRunCatalog a nice __repr__ like RunCatalog HOT 1
- Reduce effort needed to produce intake-bluesky catalog. HOT 1
- problem converting run to xarray HOT 2
- list index out of range HOT 2
- Dictionary changed size during iteration / catalog._run_starts
- Resolve issue with intake 0.5.0+ HOT 4
- Make tests run faster by doing intake_server fixture per module not per function HOT 1
- ConnectionError: HTTPConnectionPool failed to establish new connection. HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from intake-bluesky.