Coder Social home page Coder Social logo

intake-stac's Introduction

Intake: Take 2

A general python package for describing, loading and processing data

Logo

Build Status Documentation Status

Taking the pain out of data access and distribution

Intake is an open-source package to:

  • describe your data declaratively
  • gather data sets into catalogs
  • search catalogs and services to find the right data you need
  • load, transform and output data in many formats
  • work with third party remote storage and compute platforms

Documentation is available at Read the Docs.

Please report issues at https://github.com/intake/intake/issues

Install

Recommended method using conda:

conda install -c conda-forge intake

You can also install using pip, in which case you have a choice as to how many of the optional dependencies you install, with the simplest having least requirements

pip install intake

Note that you may well need specific drivers and other plugins, which usually have additional dependencies of their own.

Development

  • Create development Python environment with the required dependencies, ideally with conda. The requirements can be found in the yml files in the scripts/ci/ directory of this repo.
    • e.g. conda env create -f scripts/ci/environment-py311.yml and then conda activate test_env
  • Install intake using pip install -e .
  • Use pytest to run tests.
  • Create a fork on github to be able to submit PRs.
  • We respect, but do not enforce, pep8 standards; all new code should be covered by tests.

intake-stac's People

Contributors

andersy005 avatar dependabot[bot] avatar fnattino avatar jsignell avatar jukent avatar matthewhanson avatar ocefpaf avatar pre-commit-ci[bot] avatar richardscottoz avatar scottyhq avatar tomaugspurger avatar wildintellect avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

intake-stac's Issues

Stack / Mosaic STAC item collections

This STAC blog (specifically the section under Additional Flexibility - proj:shape and proj:transform) gave me the idea that gdal vrt with vrt mosaicking could be leveraged in intake-STAC to stack and mosaic intake-stac item collections.

Here is a gist showing an example of temporally stacking a single band of Sentinel-2a COGS using this approach: https://gist.github.com/rmg55/1acf804ef1af0c7934b265b3a653a486

A similar approach could also be implemented to mosaic data as well (although I have yet to try it). To apply this feature, the STAC projection extension would need to be implemented in the STAC catalog. I could not find any existing work by the Open Data Cube folks on this effort (they were referenced in the blog post mentioned above).

A few questions:

  1. Is this the optimum way to stitch together these datasets that are organized in granules/tiled files?
  2. Does it seem likely that the projection extension will be incorporated in other catalogs?
  3. Is this a feature that would make sense to incorporate directly into intake-stac (if so I would be happy to work on this)?

Tagging @scottyhq and @jhamman

KeyError: 'open_stac_item_collection'

I downloaded the AWS Earth search notebook from the examples and whithout chaning anything. Just run the cells...

But I've got everytime the following error and don't know why:
Screenshot 2021-04-09 at 13 30 41

understanding the sat-stac --> intake object mapping

Intake-stac is simply an opinionated way to turn STAC into Intake. We're currently using sat-stac which has the following classes.

  • Thing: A Thing is not a STAC entity, it is a low-level parent class that is used by Catalog, Collection, and Item and includes the attributes they all have in common (read and save JSON, get links).
  • Catalog: A catalog is the simplest STAC object, containing an id, a description, the STAC version, and a list of links.
  • Collection: A Collection is a STAC catalog with some additional fields that describe a group of data, such as the provider, license, along with temporal and spatial extent.
  • Item: The Item class implements the STAC Item, and has several convenience functions such as retrieving the collection, getting assets by common band name (if using the EO Extension)
  • Items: The Items class does not correspond to a STAC object. It is a FeatureCollection of Items, possibly from multiple collections. It is used to save and load sets of Items as a FeatureCollection file, along with convenience functions for extracting info across the set.

Intake some natural analogs that we need to map these classes too:

  • Catalog: Manages a hierarchy of data sources as a collective unit. A catalog is a set of available data sources for an individual entity (remote server, local file, or a local directory of files). This can be expanded to include a collection of subcatalogs, which are then managed as a single unit.
  • CatalogEntry: A single item appearing in a catalog
  • DataSource: An object which can produce data

I think it would be good to discuss, at a high-level, how to implement these mappings and to produce some documentation on the subject.

cc @martindurant, @matthewhanson, @scottyhq, @jonahjoughin

ValueError: Can't clean for JSON for intake.catalog.local.LocalCatalogEntry

Running into an error outputting an intake.catalog.local.LocalCatalogEntry in a jupyter notebook. print(entry) works, but display(entry) a ValueError: Can't clean for JSON

pinging @jhamman and @martindurant for help sorting this one out. I think it's likely a simple fix.

import intake 
import intake_stac
print(intake.__version__) #0.5.3
print(intake_stac.__version__) #0.2.1

cat = open_stac_catalog('https://storage.googleapis.com/pdd-stac/disasters/catalog.json')
list(cat)
entry = cat['Houston-East-20170831-103f-100d-0f4f-RGB']
type(entry) #intake.catalog.local.LocalCatalogEntry
print(entry)
"""
name: Houston-East-20170831-103f-100d-0f4f-RGB
container: catalog
plugin: ['stac_item']
description: 
direct_access: True
user_parameters: []
metadata: 
args: 
  stac_obj: Houston-East-20170831-103f-100d-0f4f-RGB
"""
display(entry)
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
/srv/conda/envs/notebook/lib/python3.7/site-packages/IPython/core/formatters.py in __call__(self, obj)
    916             method = get_real_method(obj, self.print_method)
    917             if method is not None:
--> 918                 method()
    919                 return True
    920 

/srv/conda/envs/notebook/lib/python3.7/site-packages/intake/catalog/entry.py in _ipython_display_(self)
    113         }, metadata={
    114             'application/json': {'root': contents["name"]}
--> 115         }, raw=True)
    116 
    117     def __getattr__(self, attr):

/srv/conda/envs/notebook/lib/python3.7/site-packages/IPython/core/display.py in display(include, exclude, metadata, transient, display_id, *objs, **kwargs)
    309     for obj in objs:
    310         if raw:
--> 311             publish_display_data(data=obj, metadata=metadata, **kwargs)
    312         else:
    313             format_dict, md_dict = format(obj, include=include, exclude=exclude)

/srv/conda/envs/notebook/lib/python3.7/site-packages/IPython/core/display.py in publish_display_data(data, metadata, source, transient, **kwargs)
    120         data=data,
    121         metadata=metadata,
--> 122         **kwargs
    123     )
    124 

/srv/conda/envs/notebook/lib/python3.7/site-packages/ipykernel/zmqshell.py in publish(self, data, metadata, source, transient, update)
    127         # hooks before potentially sending.
    128         msg = self.session.msg(
--> 129             msg_type, json_clean(content),
    130             parent=self.parent_header
    131         )

/srv/conda/envs/notebook/lib/python3.7/site-packages/ipykernel/jsonutil.py in json_clean(obj)
    189         out = {}
    190         for k,v in iteritems(obj):
--> 191             out[unicode_type(k)] = json_clean(v)
    192         return out
    193     if isinstance(obj, datetime):

/srv/conda/envs/notebook/lib/python3.7/site-packages/ipykernel/jsonutil.py in json_clean(obj)
    189         out = {}
    190         for k,v in iteritems(obj):
--> 191             out[unicode_type(k)] = json_clean(v)
    192         return out
    193     if isinstance(obj, datetime):

/srv/conda/envs/notebook/lib/python3.7/site-packages/ipykernel/jsonutil.py in json_clean(obj)
    189         out = {}
    190         for k,v in iteritems(obj):
--> 191             out[unicode_type(k)] = json_clean(v)
    192         return out
    193     if isinstance(obj, datetime):

/srv/conda/envs/notebook/lib/python3.7/site-packages/ipykernel/jsonutil.py in json_clean(obj)
    189         out = {}
    190         for k,v in iteritems(obj):
--> 191             out[unicode_type(k)] = json_clean(v)
    192         return out
    193     if isinstance(obj, datetime):

/srv/conda/envs/notebook/lib/python3.7/site-packages/ipykernel/jsonutil.py in json_clean(obj)
    195 
    196     # we don't understand it, it's probably an unserializable object
--> 197     raise ValueError("Can't clean for JSON: %r" % obj)

ValueError: Can't clean for JSON: Houston-East-20170831-103f-100d-0f4f-RGB

Use markdown instead of rst for basic text files

Markdown is designed to be readable without being compiled, while RST is more focused on the compiled output.

Top level text files, such as the changelog, README, contributing info, are often not going to be rendered and will be looked at as plain text and so should be in Markdown format.

The documentation under docs should remain as rst since it is primarily used to render the documentation output.

Move tests outside of package directory

Move intakestac/tests directory to top level 'test' to reflect the typical way of providing tests. This way they are not distributed along with the intake-stac package.

rasterio driver - authentication

I really like intake-stac and together with xpublish it becomes a very powerful solution :-)

  1. Open Assets

The example shown in the notebook using Landsat data @aws is great. The assets are freely available and thus rasterio simply accesses those resources using the href

  1. Assets requiring authentication

I've got a use-case where the assets require a simple authentication mechanism. With such cases, I usually update the href for rasterio with:

def get_vsi_url(enclosure, username, api_key):
    
   
    parsed_url = urlparse(enclosure)

    url = '/vsicurl/%s://%s:%s@%s/api%s' % (list(parsed_url)[0],
                                            username, 
                                            api_key, 
                                            list(parsed_url)[1],
                                            list(parsed_url)[2])
    
    return url 

Would such an approach be possible with intake-stac ?

StacItem / FeatureCollection --> ?

STAC Items are described as geojson FeatureCollections (here's and example). I think it should be possible to generalize the loading of assets in a single FeatureCollection into a single meta object.

    "assets": {
        "B01": {
            "href": "https://sentinel-s2-l1c.s3.amazonaws.com/tiles/53/M/NQ/2018/10/11/0/B01.jp2",
            "type": "image/jp2",
            "eo:bands": [
                0
            ],
            "title": "Band 1 (coastal)"
        },
        "B02": {
            "href": "https://sentinel-s2-l1c.s3.amazonaws.com/tiles/53/M/NQ/2018/10/11/0/B02.jp2",
            "type": "image/jp2",
            "eo:bands": [
                1
            ],
            "title": "Band 2 (blue)"
        },
        "B03": {
            "href": "https://sentinel-s2-l1c.s3.amazonaws.com/tiles/53/M/NQ/2018/10/11/0/B03.jp2",
            "type": "image/jp2",
            "eo:bands": [
                2
            ],
            "title": "Band 3 (green)"
        },
    ...
    }

Perhaps this object is a xarray.DataArray with a 'band' dimension in the example linked above?

Of course there are challenges with "extra" assets to consider. For example, the catalog above also includes the following assets:

        "thumbnail": {
            "href": "https://roda.sentinel-hub.com/sentinel-s2-l1c/tiles/53/M/NQ/2018/10/11/0/preview.jpg"
        },
        "tki": {
            "href": "https://sentinel-s2-l1c.s3.amazonaws.com/tiles/53/M/NQ/2018/10/11/0/TKI.jp2",
            "description": "True Color Image"
        },
        "metadata": {
            "href": "https://roda.sentinel-hub.com/sentinel-s2-l1c/tiles/53/M/NQ/2018/10/11/0/metadata.xml"
        }

Perhaps this is what the eo extension is for and we could use that metadata as a tool for determining which assets can be combined?

Xarray attributes (CRS, transform) missing

I have issues opening some catalog on https://earth-search.aws.element84.com/v0.
When I try to make some operation with rioxarray (e.g., a clip operation) I get the error:

MissingCRS: CRS not found. Please set the CRS with 'rio.write_crs()'

That's because into the attributes of the xarray returned from the to_dask() method attributes transform, crs and res have not be set. I noticed this is happening only for catalogues of the beginning of 2018.
For this reason, rioxarry can't georeference data in the proper way.
Why those attributes have not been set?

A snippet of code to reproduce the error:

from shapely import wkt
import intake
import intake_stac
import satsearch

wkt_geometry = "Polygon ((9.59935220198788208 44.99328353692637705, 9.82295202809843637 44.99189113160298348, 9.82469113104509972 45.11301999240544092, 9.60061886478441018 45.11441825799879268, 9.59935220198788208 44.99328353692637705))"

start_date = "2018-01-01"
end_date = "2018-01-03"

results = satsearch.Search.search (
            url="https://earth-search.aws.element84.com/v0",
            intersects=mapping(wkt.loads(wkt_geometry)),
            datetime="%s/%s" % (start_date, end_date),
            query={"eo:cloud_cover": {"lte": 100}},
            collections=["sentinel-s2-l2a"],
        )

items = results.items()

for item in items:
        print("Processing item {} for asset {}".format(item, asset_key))
        single_item = intake.open_stac_item(item) # stac_item_collection[str(item_by_date)]

        asset_xarray = single_item['SCL'](chunks=dict(band=1, y=2048, x=2048)).to_dask()
        asset_xarray.rio.set_nodata(0)
        asset_clipped = asset_xarray.rio.clip([mapping(_wkt_bounds)], crs=4326, all_touched=True, drop=True)
        xarray_all_patches.append(asset_clipped)

    mosaic = rioxarray.merge.merge_arrays(xarray_all_patches, precision=50, nodata=nodataval)

The clip operation with rioxarray is the one that fails (asset_clipped = asset_xarray.rio.clip([mapping(_wkt_bounds)], crs=4326, all_touched=True, drop=True) )

Update tutorial with current catalogs

The documentation build is broken right now because the example STAC catalogs we are using in the tutorial are stale. It would be great if someone could update the tutorial with a new set of catalogs.

Can't save catalog yaml representation

intake version 0.5.4
intake-stac version 0.2.2

Currently, opening a stac catalog (JSON) gives us an intake catalog (which can be represented as YAML). We inherit intake's save method, but that method fails with the following traceback

cat = intake.open_stac_catalog('https://storage.googleapis.com/pdd-stac/disasters/catalog.json')
cat.save('pdd-stac.yaml')
---------------------------------------------------------------------------
ConstructorError                          Traceback (most recent call last)
<ipython-input-97-91d0ff8d57f2> in <module>
----> 1 cat.save('pdd-stac.yaml')
      2 
      3 ''' 
      4 ConstructorError: could not determine a constructor for the tag 'tag:yaml.org,2002:python/object:satstac.item.Item'
      5   in "<unicode string>", line 4, column 17:

~/miniconda3/envs/intake-stac/lib/python3.8/site-packages/intake/catalog/base.py in save(self, url, storage_options)
    290         from fsspec import open_files
    291         with open_files([url], **(storage_options or {}), mode='wt')[0] as f:
--> 292             f.write(self.serialize())
    293 
    294     @reload_on_change

~/Documents/GitHub/intake-stac/intake_stac/catalog.py in serialize(self)
     74         output = {"metadata": self.metadata, "sources": {}}
     75         for key, entry in self.items():
---> 76             output["sources"][key] = yaml.safe_load(entry.yaml())["sources"]
     77         return yaml.dump(output)
     78 

~/miniconda3/envs/intake-stac/lib/python3.8/site-packages/yaml/__init__.py in safe_load(stream)
    160     to be safe for untrusted input.
    161     """
--> 162     return load(stream, SafeLoader)
    163 
    164 def safe_load_all(stream):

~/miniconda3/envs/intake-stac/lib/python3.8/site-packages/yaml/__init__.py in load(stream, Loader)
    112     loader = Loader(stream)
    113     try:
--> 114         return loader.get_single_data()
    115     finally:
    116         loader.dispose()

~/miniconda3/envs/intake-stac/lib/python3.8/site-packages/yaml/constructor.py in get_single_data(self)
     41         node = self.get_single_node()
     42         if node is not None:
---> 43             return self.construct_document(node)
     44         return None
     45 

~/miniconda3/envs/intake-stac/lib/python3.8/site-packages/yaml/constructor.py in construct_document(self, node)
     50             self.state_generators = []
     51             for generator in state_generators:
---> 52                 for dummy in generator:
     53                     pass
     54         self.constructed_objects = {}

~/miniconda3/envs/intake-stac/lib/python3.8/site-packages/yaml/constructor.py in construct_yaml_map(self, node)
    403         data = {}
    404         yield data
--> 405         value = self.construct_mapping(node)
    406         data.update(value)
    407 

~/miniconda3/envs/intake-stac/lib/python3.8/site-packages/yaml/constructor.py in construct_mapping(self, node, deep)
    208         if isinstance(node, MappingNode):
    209             self.flatten_mapping(node)
--> 210         return super().construct_mapping(node, deep=deep)
    211 
    212     def construct_yaml_null(self, node):

~/miniconda3/envs/intake-stac/lib/python3.8/site-packages/yaml/constructor.py in construct_mapping(self, node, deep)
    133                 raise ConstructorError("while constructing a mapping", node.start_mark,
    134                         "found unhashable key", key_node.start_mark)
--> 135             value = self.construct_object(value_node, deep=deep)
    136             mapping[key] = value
    137         return mapping

~/miniconda3/envs/intake-stac/lib/python3.8/site-packages/yaml/constructor.py in construct_object(self, node, deep)
     90                     constructor = self.__class__.construct_mapping
     91         if tag_suffix is None:
---> 92             data = constructor(self, node)
     93         else:
     94             data = constructor(self, tag_suffix, node)

~/miniconda3/envs/intake-stac/lib/python3.8/site-packages/yaml/constructor.py in construct_undefined(self, node)
    417 
    418     def construct_undefined(self, node):
--> 419         raise ConstructorError(None, None,
    420                 "could not determine a constructor for the tag %r" % node.tag,
    421                 node.start_mark)

ConstructorError: could not determine a constructor for the tag 'tag:yaml.org,2002:python/object:satstac.item.Item'
  in "<unicode string>", line 4, column 17:
          stac_obj: !!python/object:satstac.item.Item

You can use the intake print(cat['Houston-East-20170831-103f-100d-0f4f-RGB'].yaml()) method to see where these exclamation points are:

sources:
  Houston-East-20170831-103f-100d-0f4f-RGB:
    args:
      stac_obj: !!python/object:satstac.item.Item
        _assets_by_common_name: null
        _collection: null
        _data:
          assets:
            mosaic:
              href: https://storage.googleapis.com/pdd-stac/disasters/hurricane-harvey/0831/Houston-East-20170831-103f-100d-0f4f-3-band.tif
              title: 3 Band RGB Mosaic
              type: image/vnd.stac.geotiff; cloud-optimized=true
            thumbnail:
              href: https://storage.googleapis.com/pdd-stac/disasters/hurricane-harvey/0831/Houston-East-20170831-103f-100d-0f4f-3-band.png
              title: Thumbnail
              type: image/png
          bbox:
          - -95.73737276800716
          - 29.561332400220497
          - -95.05332428370095
          - 30.157560439570304
          geometry:

Move dask chunk specification to to_dask() method

This came up in #29 (comment)

Currently in intake-stac (0.2.2) to set chunk size in intake, you (somewhat unintuitively ) invoke the entry:

# syntax in intake-stac 0.2.2
da = entry.B1(chunks=dict(band=1,x=2048,y=2048)).to_dask()

Thoughts on re-factoring to a more intuitive syntax (maybe just incorporate into read() method to match the way intake works with csv files to return a pandas dataframe)?

# returns xarray object with embedded dask arrays
da = entry.B1.read(chunks=dict(band=1,x=2048,y=2048)) 

Intake GUI browser + predefined plots

One really powerful feature would be to use the built-in intake GUI to browse the catalog and create plots (maybe the lowest-res overview image for each band?). But currently there are no pre-defined plots declared.

When running the following code and selecting a band under "Sources" a user sees "No predefined plots found - declare these in the catalog"

# `results` come from satsearch
catalog = intake.open_stac_item_collection(results.items())
intake.gui.add(catalog)
intake.gui

You can run the code-block above with this example notebook:
https://github.com/pangeo-data/pangeo-tutorial/blob/agu2019/notebooks/amazon-web-services/landsat8.ipynb

Proposed change: have stack_bands() return xarray.Dataset with common band names

@jsignell did some awesome work putting together the stack_bands() in #19

I'd like to propose a few modifications (building off examples/planet_disaster_data.ipynb) and can follow up with a pull request:

  1. Return common_name labels if they are used to select bands (notice 'B4' and 'B5') in DataArray
    landsat['LC80110312014230LGN00'].stack_bands(['red', 'nir']).to_dask()
<xarray.DataArray (band: 2, y: 7941, x: 7811)>
dask.array<concatenate, shape=(2, 7941, 7811), dtype=uint16, chunksize=(1, 7941, 7811), chunktype=numpy.ndarray>
Coordinates:
  * y        (y) float64 4.742e+06 4.742e+06 4.742e+06 ... 4.504e+06 4.504e+06
  * x        (x) float64 3.183e+05 3.183e+05 3.184e+05 ... 5.526e+05 5.526e+05
  * band     (band) <U2 'B4' 'B5'
  1. Return an xarray DataSet instead of DataArray by default (da.to_dataset(dim='band')), so we end up with this:
<xarray.Dataset>
Dimensions:  (x: 7811, y: 7941)
Coordinates:
  * y        (y) float64 4.742e+06 4.742e+06 4.742e+06 ... 4.504e+06 4.504e+06
  * x        (x) float64 3.183e+05 3.183e+05 3.184e+05 ... 5.526e+05 5.526e+05
Data variables:
    red       (y, x) uint16 dask.array<chunksize=(7941, 7811), meta=np.ndarray>
    nir       (y, x) uint16 dask.array<chunksize=(7941, 7811), meta=np.ndarray>

Can I use opendap?

Hi! I see that the "STAC catalog asset 'type' determines intake driver" but I don't see opendap in the list there, while it is listed separately for intake-xarray as a different driver. Is there a way I can specify "opendap" as the driver, or some other way around this? Thank you!

intake-stac 0.3 release plan

I propose making a new release that is compatible with more recent STAC versions. Happy to follow this up with a PR, but want some versioning feedback first. cc @jhamman @matthewhanson @andersy005

Currently the sat-stac dependency has a minimum pin, and we've only worked with STACv0.6 catalogs

sat-stac>=0.1.3

sat-stac STAC
0.[1,2].x 0.6.x

There have been a lot of changes in recent STAC versions leading to the 1.0 release. I envision most catalogs will be >1.0 going forward so backwards compatibility isn't a great concern. So I suggest for an intake-stac 0.3 release we pin to sat-stac==0.4.*:

The table below shows the corresponding versions between sat-stac and STAC:

| sat-stac | STAC  |
| -------- | ----  |
| 0.1.x    | 0.6.x - 0.7.x |
| 0.2.x    | 0.6.x - 0.7.x |
| 0.3.x    | 0.6.x - 0.9.x |
| 0.4.x    | 0.6.x - 1.0.0-beta.1 |

We could use this STAC 1.0.0-beta1 endpoint for testing https://earth-search.aws.element84.com/v0/collections since it is the default in the sat-search library.

Iterating through catalog items is awkward and slow

A common need is getting URLs from item assets within a catalog, which involves iterating over hundreds of items. Here is a quick example:

import satsearch
import intake

bbox = [35.48, -3.24, 35.58, -3.14] # (min lon, min lat, max lon, max lat)
dates = '2010-07-01/2020-08-15'

URL='https://earth-search.aws.element84.com/v0'
results = satsearch.Search.search(url=URL,
                                  collections=['sentinel-s2-l2a-cogs'], # note collection='sentinel-s2-l2a-cogs' doesn't work
                                  datetime=dates,
                                  bbox=bbox,    
                                  sortby=['+properties.datetime'])
print('%s items' % results.found())
itemCollection = results.items()
#489 items

Initializing the catalog is fast!

%%time 
catalog = intake.open_stac_item_collection(itemCollection)
#CPU times: user 3.69 ms, sys: 0 ns, total: 3.69 ms
#Wall time: 3.7 ms

Iterating through items is slow. I'm a bit confused by the syntax too. I find myself wanting to use an integer index to get the first item in a catalog (first_item = catalog[0]) or simplify the code block, but currentlty below to hrefs = [item.band.metadata.href for item in catalog] (currently iterating through catalogs returns item IDs as strings.

%%time 
band = 'visual'
hrefs = [catalog[item][band].metadata['href'] for item in catalog]
#CPU times: user 4.6 s, sys: 1.23 ms, total: 4.6 s
#Wall time: 4.61 s

As for speed, it only takes microseconds to iterate through the underlaying JSON via sat-stac

%%time 
band = 'visual'
hrefs = [i.assets[band]['href'] for i in catalog._stac_obj]
#CPU times: user 684 ยตs, sys: 0 ns, total: 684 ยตs
#Wall time: 689 ยตs

@martindurant any suggestions here? I'm a bit perplexed about where the code lives to handle list(catalog) or for item in catalog: ...

Odd NoneType error on catalog = intake.open_stac_item_collection(items)

On the AWS sentinel example notebook

print('%s items' % results.found())
items = results.items()
items.save('my-s2-l2a-cogs.json')

18 items

type(items)

satstac.itemcollection.ItemCollection

len(items)
18

catalog = intake.open_stac_item_collection(items)
TypeError                                 Traceback (most recent call last)
<ipython-input-9-cd702f8449e9> in <module>
----> 1 intake.open_stac_item_collection(items)

TypeError: 'NoneType' object is not callable

Driver for `application/xml`

Thanks a lot for this fantastic plugin, it makes it really easy to access data from STAC catalogs!

I am currently working with the public Sentinel-2 COGs collection on AWS and I have noticed an issue when trying to access the XML metadata file with Intake/Intake-STAC. The issue seems to originate from the fact that this asset is labeled as application/xml, but only text/xml seems to be present in the dictionary of drivers in Intake-STAC. Do you think it makes sense to add 'application/xml': 'textiles' to the dictionary of drivers? If so, I would be happy to contribute with a PR.

Thanks so much in advance!

Here is a minimum working example:

import intake

item_url = 'https://earth-search.aws.element84.com/v0/collections/sentinel-s2-l2a-cogs/items/S2A_31UFU_20210211_0_L2A'
item = intake.open_stac_item(item_url)
item['metadata'].read()

Traceback:

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
/opt/miniconda3/envs/test/lib/python3.7/site-packages/xarray/backends/file_manager.py in _acquire_with_cache_info(self, needs_lock)
    198             try:
--> 199                 file = self._cache[self._key]
    200             except KeyError:

/opt/miniconda3/envs/test/lib/python3.7/site-packages/xarray/backends/lru_cache.py in __getitem__(self, key)
     52         with self._lock:
---> 53             value = self._cache[key]
     54             self._cache.move_to_end(key)

KeyError: [<function open at 0x7fa073ff4f80>, ('https://roda.sentinel-hub.com/sentinel-s2-l2a/tiles/31/U/FU/2021/2/11/0/metadata.xml',), 'r', ()]

During handling of the above exception, another exception occurred:

CPLE_OpenFailedError                      Traceback (most recent call last)
rasterio/_base.pyx in rasterio._base.DatasetBase.__init__()

rasterio/_shim.pyx in rasterio._shim.open_dataset()

rasterio/_err.pyx in rasterio._err.exc_wrap_pointer()

CPLE_OpenFailedError: '/vsicurl/https://roda.sentinel-hub.com/sentinel-s2-l2a/tiles/31/U/FU/2021/2/11/0/metadata.xml' not recognized as a supported file format.

During handling of the above exception, another exception occurred:

RasterioIOError                           Traceback (most recent call last)
<ipython-input-9-760ed87ebc5b> in <module>
----> 1 item['metadata'].read()

/opt/miniconda3/envs/test/lib/python3.7/site-packages/intake_xarray/base.py in read(self)
     37     def read(self):
     38         """Return a version of the xarray with all the data in memory"""
---> 39         self._load_metadata()
     40         return self._ds.load()
     41 

/opt/miniconda3/envs/test/lib/python3.7/site-packages/intake/source/base.py in _load_metadata(self)
    234         """load metadata only if needed"""
    235         if self._schema is None:
--> 236             self._schema = self._get_schema()
    237             self.dtype = self._schema.dtype
    238             self.shape = self._schema.shape

/opt/miniconda3/envs/test/lib/python3.7/site-packages/intake_xarray/raster.py in _get_schema(self)
    100 
    101         if self._ds is None:
--> 102             self._open_dataset()
    103 
    104             ds2 = xr.Dataset({'raster': self._ds})

/opt/miniconda3/envs/test/lib/python3.7/site-packages/intake_xarray/raster.py in _open_dataset(self)
     89         else:
     90             self._ds = xr.open_rasterio(files, chunks=self.chunks,
---> 91                                         **self._kwargs)
     92 
     93     def _get_schema(self):

/opt/miniconda3/envs/test/lib/python3.7/site-packages/xarray/backends/rasterio_.py in open_rasterio(filename, parse_coordinates, chunks, cache, lock)
    274 
    275     manager = CachingFileManager(rasterio.open, filename, lock=lock, mode="r")
--> 276     riods = manager.acquire()
    277     if vrt_params is not None:
    278         riods = WarpedVRT(riods, **vrt_params)

/opt/miniconda3/envs/test/lib/python3.7/site-packages/xarray/backends/file_manager.py in acquire(self, needs_lock)
    179             An open file object, as returned by ``opener(*args, **kwargs)``.
    180         """
--> 181         file, _ = self._acquire_with_cache_info(needs_lock)
    182         return file
    183 

/opt/miniconda3/envs/test/lib/python3.7/site-packages/xarray/backends/file_manager.py in _acquire_with_cache_info(self, needs_lock)
    203                     kwargs = kwargs.copy()
    204                     kwargs["mode"] = self._mode
--> 205                 file = self._opener(*self._args, **kwargs)
    206                 if self._mode == "w":
    207                     # ensure file doesn't get overriden when opened again

/opt/miniconda3/envs/test/lib/python3.7/site-packages/rasterio/env.py in wrapper(*args, **kwds)
    433 
    434         with env_ctor(session=session):
--> 435             return f(*args, **kwds)
    436 
    437     return wrapper

/opt/miniconda3/envs/test/lib/python3.7/site-packages/rasterio/__init__.py in open(fp, mode, driver, width, height, count, crs, transform, dtype, nodata, sharing, **kwargs)
    218         # None.
    219         if mode == 'r':
--> 220             s = DatasetReader(path, driver=driver, sharing=sharing, **kwargs)
    221         elif mode == "r+":
    222             s = get_writer_for_path(path, driver=driver)(

rasterio/_base.pyx in rasterio._base.DatasetBase.__init__()

RasterioIOError: '/vsicurl/https://roda.sentinel-hub.com/sentinel-s2-l2a/tiles/31/U/FU/2021/2/11/0/metadata.xml' not recognized as a supported file format.

The output of item['metadata'].metadata:

{'title': 'Original XML metadata',
 'type': 'application/xml',
 'roles': ['metadata'],
 'href': 'https://roda.sentinel-hub.com/sentinel-s2-l2a/tiles/31/U/FU/2021/2/11/0/metadata.xml',
 'catalog_dir': ''}

"Hacking" the drivers' dictionary allows to correctly retrieve the file:

import intake_stac
intake_stac.catalog.drivers['application/xml'] = 'textfiles'

import intake
item = intake.open_stac_item(item_url)
item['metadata'].read()

I am using Intake version 0.6.2 and Intake-STAC version 0.3.0.

pystac 1.0 compatability

PySTAC is preparing for 1.0, which includes some breaking changes. Running locally, I see some failures.

========================================================================================================= ERRORS =========================================================================================================
____________________________________________________________________________________ ERROR at setup of test_cat_from_item_collection _____________________________________________________________________________________
E   pystac.errors.STACTypeError: JSON does not represent a STAC object.
________________________________________________________________________________________ ERROR at setup of test_cat_to_geopandas _________________________________________________________________________________________
E   pystac.errors.STACTypeError: JSON does not represent a STAC object.
_____________________________________________________________________________ ERROR at setup of test_cat_to_geopandas_crs[IGNF:ETRS89UTM28] ______________________________________________________________________________
E   pystac.errors.STACTypeError: JSON does not represent a STAC object.
________________________________________________________________________________ ERROR at setup of test_cat_to_geopandas_crs[epsg:26909] _________________________________________________________________________________
E   pystac.errors.STACTypeError: JSON does not represent a STAC object.
____________________________________________________________________________________ ERROR at setup of test_cat_to_missing_geopandas _____________________________________________________________________________________
E   pystac.errors.STACTypeError: JSON does not represent a STAC object.
_____________________________________________________________________________________ ERROR at setup of test_load_satsearch_results ______________________________________________________________________________________
E   pystac.errors.STACTypeError: JSON does not represent a STAC object.
======================================================================================================== FAILURES ========================================================================================================
/home/taugspurger/src/stac-extensions/intake-stac/intake_stac/catalog.py:302: ValueError: STAC Item must implement "eo" extension to use this method
/home/taugspurger/src/stac-extensions/intake-stac/intake_stac/catalog.py:302: ValueError: STAC Item must implement "eo" extension to use this method
/home/taugspurger/src/stac-extensions/intake-stac/intake_stac/catalog.py:302: ValueError: STAC Item must implement "eo" extension to use this method
/home/taugspurger/src/stac-extensions/intake-stac/intake_stac/tests/test_catalog.py:167: AssertionError: Regex pattern 'ANG not found in list of eo:bands in collection' does not match 'STAC Item must implement "eo" extension to use this method'.
/home/taugspurger/src/stac-extensions/intake-stac/intake_stac/tests/test_catalog.py:174: AssertionError: Regex pattern "'B8', 'B9', 'blue', 'cirrus'" does not match 'STAC Item must implement "eo" extension to use this method'.
/home/taugspurger/src/stac-extensions/intake-stac/intake_stac/catalog.py:302: ValueError: STAC Item must implement "eo" extension to use this method
================================================================================================ short test summary info =================================================================================================
FAILED intake_stac/tests/test_catalog.py::test_cat_item_stacking - ValueError: STAC Item must implement "eo" extension to use this method
FAILED intake_stac/tests/test_catalog.py::test_cat_item_stacking_using_common_name - ValueError: STAC Item must implement "eo" extension to use this method
FAILED intake_stac/tests/test_catalog.py::test_cat_item_stacking_path_as_pattern - ValueError: STAC Item must implement "eo" extension to use this method
FAILED intake_stac/tests/test_catalog.py::test_cat_item_stacking_dims_of_different_type_raises_error - AssertionError: Regex pattern 'ANG not found in list of eo:bands in collection' does not match 'STAC Item must i...
FAILED intake_stac/tests/test_catalog.py::test_cat_item_stacking_dims_with_nonexistent_band_raises_error - AssertionError: Regex pattern "'B8', 'B9', 'blue', 'cirrus'" does not match 'STAC Item must implement "eo" e...
FAILED intake_stac/tests/test_catalog.py::test_cat_item_stacking_dims_of_different_size_regrids - ValueError: STAC Item must implement "eo" extension to use this method
ERROR intake_stac/tests/test_catalog.py::test_cat_from_item_collection - pystac.errors.STACTypeError: JSON does not represent a STAC object.
ERROR intake_stac/tests/test_catalog.py::test_cat_to_geopandas - pystac.errors.STACTypeError: JSON does not represent a STAC object.
ERROR intake_stac/tests/test_catalog.py::test_cat_to_geopandas_crs[IGNF:ETRS89UTM28] - pystac.errors.STACTypeError: JSON does not represent a STAC object.
ERROR intake_stac/tests/test_catalog.py::test_cat_to_geopandas_crs[epsg:26909] - pystac.errors.STACTypeError: JSON does not represent a STAC object.
ERROR intake_stac/tests/test_catalog.py::test_cat_to_missing_geopandas - pystac.errors.STACTypeError: JSON does not represent a STAC object.
ERROR intake_stac/tests/test_catalog.py::test_load_satsearch_results - pystac.errors.STACTypeError: JSON does not represent a STAC object.
========================================================================================= 6 failed, 13 passed, 6 errors in 9.74s =========================================================================================

I'll look into these over the next couple days.

Consider changing approach to versioning

I find the versioning code to be a bit confusing, and might be a little overkill. But maybe this is required due to being a plugin for intake?

If not, then I'd say we should just have a single version.py file to specify the version.

Adding support for Zarr datasets

As part of Pangeo's general integration of STAC, we currently have a STAC Catalog roughly mirroring Pangeo's Intake catalogs, as well as support for rendering Zarr metadata with STAC Browser. Another major step forward with this integration would be adding support to load Zarr datasets through Intake-STAC.

What steps need to be taken to make something like this happen? At the moment, Zarr datasets are represented in STAC as Collections with a single asset - a link to the consolidated metadata file of the Zarr dataset, with a role of zarr-consolidated-metadata; an example of this here:

{
  "stac_version": "1.0.0-beta.2",
  "stac_extensions": [
    "collection-assets"
  ],
  "id": "sea_surface_height",
  "title": "sea-surface altimetry data from The Copernicus Marine Environment",
  "description": "",
  "keywords": [],
  "extent": {
    "spatial": {
      "bbox": [
        []
      ]
    },
    "temporal": {
      "interval": [
        []
      ]
    }
  },
  ...
  "assets": {
    "zmetadata": {
      "href": "https://storage.googleapis.com/pangeo-cmems-duacs/.zmetadata",
      "description": "Consolidated metadata file for Zarr store",
      "type": "application/json",
      "roles": [
        "metadata",
        "zarr-consolidated-metadata"
      ]
    }
  }
}

Some random obstacles that come to mind:

  • How can storage options (such as requester pays status) be specified for an individual dataset?
  • Should we only be focusing on consolidated Zarr datasets, or generalize Zarr representation in STAC to encompass non-consolidated datasets as well?
  • Spatial/temporal extent can probably be decided by looking at the Zarr metadata, but if this is impossible should it just default to the widest possible ranges?

Fix stack_bands

There have been some upstream changes in the eo extension that require us to rework the stack_bands method. I've marked the tests for this method upstream in #54.

Basically, the band info has moved to the asset properties:

  "assets": {
    ...
    "B1": {
      "href": "https://landsat-pds.s3.amazonaws.com/c1/L8/152/038/LC08_L1TP_152038_20200611_20200611_01_RT/LC08_L1TP_152038_20200611_20200611_01_RT_B1.TIF",
      "type": "image/tiff; application=geotiff",
      "eo:bands": [
        {
          "name": "B1",
          "common_name": "coastal",
          "center_wavelength": 0.44,
          "full_width_half_max": 0.02
        }
      ],
      "title": "Band 1 (coastal)"
    },

How to integrate intake-stac and geopandas

Converting STAC catalogs (JSON) to intake catalogs is great for facilitating browsing images with intake.gui and loading remote data directly into xarray objects. When loading data returned from a dynamic API like sat-search (radiantearth/stac-spec#691), we could also take advantage of Geopandas for querying the catalog and visualizing metadata, but currently (0.2.2) the integration is a bit awkward:

properties =  ["eo:row=027",
               "eo:column=047",
               "landsat:tier=T1"] 
results = satsearch.Search.search(collection='landsat-8-l1', 
                        sort=['<datetime'], #earliest scene first
                        property=properties)
items = results.items()

It would be great to easily load results.items() into a geodataframe directly

results.items().geojson() # returns a python dictionary
#gf = gpd.GeoDataFrame(results.items().geojson()) #ValueError: arrays must all be same length

This works and provides a very convenient tabular HTML display

items.save('subset.geojson')
gf = gpd.read_file('subset.geojson')
display(gf)

Screenshot 2019-12-15 17 04 20

We could consider adding the same geopandas HTML view directly to the catalog and LocalCatalogEntry objects

# prints <Intake catalog: <class 'satstac.itemcollection.ItemCollection'>>
cat = intake.open_stac_item_collection(results.items())
display(cat)

# also HTML table?
display(cat[sceneid].metadata)

Screenshot 2019-12-15 17 05 47

Enabling geopandas-like methods on an Intake catalog: <class 'satstac.itemcollection.ItemCollection would be very useful! Maybe it's best to just add a cat.to_geopandas() function to enable things like the following?:

gf = cat.to_geopandas()
gf.iloc[0]
gf[gf["eo:cloud_cover"] < 50]
#gf.query("eo:cloud_cover < 50") # Geopandas doesn't like ":" in column names
gf.rename(columns={"eo:cloud_cover": "cloud_cover"}).query("cloud_cover < 50")
gf.plot() #plots footprint polygons of STAC items

Sub-catalogs should wrap LocalCatalogEntry

In talking with @martindurant today, we can implement lazy sub-catalogs by wrapping our individual Catalog types as explicit LocalCatalogEntry objects. We currently implement sub-catalogs as entries directly, for example the StacCollection._load() method includes:

https://github.com/pangeo-data/intake-stac/blob/7b81b8d4ff36b9399ed9c0f5a96074d0627f2ba4/intake_stac/catalog.py#L110-L111

It sounds like we should be able to wrap this sub-catalog (e.g. StacItem) in LocalCatalogEntry objects, similar to this:

self._entries[item.id] = LocalCatalogEntry(
    name=item.id,
    description=item.id,
    driver=StacItem,
    args={'catalog': item}
)

This doesn't seem to work just yet (see intake/intake#348), but we something to do asap.

Add a `StacCollection.search` method

Purely as a convenience, it'd be nice to have a StacCollection.search method that uses pystac-client to search an endpoint with a specific collection.

cat = intake.open_stac_catalog("/path/to/catalog")
collection = cat["my-collection"]
collection.search(bbox=bbox)

The .search method would use pystac-client

  1. Find the link with a "rel": "search". Set that as the endpoint
  2. Specify collections=[self.id], to limit the search to just that collection.

I see now that intake's base Catalog class apparently defines a search method, which appears to do some kind of text-based search on the items. I suspect that most STAC users would expect search to behave like STAC search.

Allow opening of multiple assets

A single asset may have 1 band, or multiple bands.

When opening up the assets as an xarray using intake, the user should be able to specify multiple assets and have them be appended to return a 3d xarray.

This will require some sort of validation checks, such as that the bands all share the same x and y size. This validation can be kept to a minimum for now.

Write tests

We now have some alpha code in this repo but it needs tests. Ideally, we should host a few small catalogs right here in this repo. In the name of simplicity, I'd be fine pointing to the sat-stac test cases for now.

[question] transfer intake-stac to Intake GitHub organization?

I've been wondering for a while if it makes sense for this repo to live in the pangeo-data org, or if it would be better to have it elsewhere. It may make sense to put this project with the Intake Org (@martindurant), or if that doesn't fly, perhaps closer to other STAC tools.

In my mind, the goals of a potential transfer would be to increase the package's visibility and encourage outside contributions.

Pinging a few folks that have participated in the project to date. @scottyhq, @matthewhanson, @andersy005, @jsignell, and @martindurant.

intake-stac setup and error on aws search example

Hello,

I've noticed that there is an error running the aws search notebook example. Am getting, at cell 11 loading the item to Dask da = item.B04(chunks=dict(band=1, y=2048, x=2048)).to_dask() the following error:

ValueError: No plugins loaded for this entry: image/tiff; application=geotiff; profile=cloud-optimized A listing of installable plugins can be found at https://intake.readthedocs.io/en/latest/plugin-directory.html .

I've ensured that the module is installed and the example on the intake-stac readme does run fine.

Initially was running this inside of the pangeo notebook docker image and thought that might be the issue but am getting this same error just installing the dependencies in a clean virtual environment.

How to pass `xarray_kwargs` from STAC catalog through to `intake`

Hi! I am setting up a STAC catalog which is to be searched using pystac-client and the result then read into Python with intake, intake-stac, and intake-xarray. I would like to be able to pass through xarray_kwargs from my STAC catalog all the way through to the resulting intake catalog so that I can read in the datasets to xarray directly using information stored in the STAC catalog entries. I can pass xarray_kwargs into the "metadata" section of an intake entry, but not into a known xarray_kwargs attribute that is actually used when the dataset is read in. Is there a way to encode this properly in the STAC catalog so it passes all the way through? Thank you for any help.

add s3 URL support to intake-stac

I was recently part of a group trying to use intake-stac to bring some files into dask from s3. Unfortunately, the data in question was not public and neither were the catalog files. So I wanted to use s3-style URLs for everything. Unfortunately, when I tried the following:

import intake
from intake import open_stac_catalog
cat = open_stac_catalog('s3://put-real-bucket-here/catalog.json')

I got the error:

---------------------------------------------------------------------------
STACError                                 Traceback (most recent call last)
<ipython-input-6-c2ac70a95199> in <module>
----> 1 cat = open_stac_catalog('s3://sit-giovanni-website/catalog.json')

~/Software/python3/envs/stac/lib/python3.8/site-packages/intake_stac/catalog.py in __init__(self, stac_obj, **kwargs)
     30             self._stac_obj = stac_obj
     31         elif isinstance(stac_obj, str):
---> 32             self._stac_obj = self._stac_cls.open(stac_obj)
     33         else:
     34             raise ValueError(

~/Software/python3/envs/stac/lib/python3.8/site-packages/satstac/thing.py in open(cls, filename)
     56                 dat = json.loads(dat)
     57             else:
---> 58                 raise STACError('%s does not exist locally' % filename)
     59         return cls(dat, filename=filename)
     60 

STACError: s3://put-real-bucket-here/catalog.json does not exist locally

It looks to me as though STAC thinks this is a file path rather than an S3 URL. Our time was short and I couldn't figure out if there was some other way to get STAC to take an S3 URL.

At the same time, we were hoping to put s3 URLs in our item catalog entries. E.g. -

{
  "id": "AIRX3STD_006_SurfAirTemp_A/AIRX3STD_006_SurfAirTemp_A_2002_08.parquet",
  "stac_version": "0.9.0",
  "title": "Air temperature at surface (Daytime/Ascending) [AIRS AIRX3STD v006] for 2002-08-01 00:00:00+00:00 - 2002-08-31 23:59:59+00:00",
  "description": "Parquet file containing data from Air temperature at surface (Daytime/Ascending) [AIRS AIRX3STD v006] for 2002-08-01 00:00:00+00:00 - 2002-08-31 23:59:59+00:00",
  "type": "Feature",
  "bbox": [
    -180.0,
    -90.0,
    180.0,
    90.0
  ],
  "geometry": {
    "type": "Polygon",
    "coordinates": [
      [
        [
          -180.0,
          -90.0
        ],
        [
          -180.0,
          90.0
        ],
        [
          180.0,
          90.0
        ],
        [
          180.0,
          -90.0
        ],
        [
          -180.0,
          -90.0
        ]
      ]
    ]
  },
  "properties": {
    "datetime": "2002-08-01T00:00:00Z",
    "start_datetime": "2002-08-01T00:00:00Z",
    "end_datetime": "2002-08-31T23:59:59Z",
    "created": "2020-07-27T18:35:54Z",
    "license": "Apache",
    "platform": "AIRS",
    "instruments": [
      "AIRS"
    ],
    "mission": "AIRS"
  },
  "assets": {
    "AIRX3STD_006_SurfAirTemp_A_2002_08.parquet": {
      "href": "s3://fill-in-real-bucket-here/some/path/AIRX3STD_006_SurfAirTemp_A_2002_08.parquet",
      "title": "Air temperature at surface (Daytime/Ascending) [AIRS AIRX3STD v006], 2002-08-01 00:00:00+00:00 - 2002-08-31 23:59:59+00:00 (Parquet)",
      "type": "parquet",
      "roles": [
        "data"
      ]
    }
  },
  "links": [
    {
      "rel": "self",
      "href": "N/A"
    }
  ]
}

This one may be more of a stretch. I don't know if the STAC spec has s3-style URLs in mind. My two minute evaluation of the item spec (https://github.com/radiantearth/stac-spec/blob/master/item-spec/json-schema/item.json) is inconclusive.

How to deal with STAC assets that don't declare 'type'

intake-stac currently assumes that every asset has an associated type (e.g. 'image/png').

if entry_type is NULL_TYPE:
warnings.warn(
f'TODO: handle case with entry without type field. This entry was: {entry}'
)

Apparently STAC assets are not required to list a type. For example, I was just trying to work with this one which has:

  "assets": {
    "data": {
      "href": "https://grfn.asf.alaska.edu/door/download/S1-GUNW-A-R-087-tops-20141023_20141011-154036-33520N_31618N-PP-b56c-v2_0_2.nc"
    },

https://cmr.earthdata.nasa.gov/cmr-stac/ASF/collections/C1595422627-ASF/items/G1635991338-ASF

Has to do with compatibility wtih https://cmr.earthdata.nasa.gov/cmr-stac. Any ideas of best approach for these cases @jhamman @matthewhanson ?

Intake-STAC with NASA CMR STAC proxy: Authentication

As part of STAC-sprint 6 I was trying out intake-stac with https://github.com/nasa/cmr-stac. It would be absolutely amazing to integrate intake-stac with that endpoint to facilitate working with NASA datasets! But there multiple things to work out. First and foremost is how to deal with Authentication.

Unlike boto3 cloud credentials, NASA uses and 'Earthdata login' (https://urs.earthdata.nasa.gov/documentation). Typically, science users keep their username and password in a ~/.netrc file for any time you try to retrieve a file. This mechanism doesn't currently work with the intake-stac .to_dask() method. For example:

item['data'].metadata
#{'href': 'https://grfn.asf.alaska.edu/door/download/S1-GUNW-A-R-087-tops-20141023_20141011-153856-27545N_25464N-PP-1a1a-v2_0_2.nc'}
da = item['data'].to_dask()

Leads to a big traceback:

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
~/miniconda3/envs/intake-stac-gui/lib/python3.7/site-packages/xarray/backends/file_manager.py in _acquire_with_cache_info(self, needs_lock)
    197             try:
--> 198                 file = self._cache[self._key]
    199             except KeyError:

~/miniconda3/envs/intake-stac-gui/lib/python3.7/site-packages/xarray/backends/lru_cache.py in __getitem__(self, key)
     52         with self._lock:
---> 53             value = self._cache[key]
     54             self._cache.move_to_end(key)

KeyError: [<class 'netCDF4._netCDF4.Dataset'>, ('https://grfn.asf.alaska.edu/door/download/S1-GUNW-A-R-087-tops-20141023_20141011-153856-27545N_25464N-PP-1a1a-v2_0_2.nc',), 'r', (('clobber', True), ('diskless', False), ('format', 'NETCDF4'), ('persist', False))]

During handling of the above exception, another exception occurred:

OSError                                   Traceback (most recent call last)
<ipython-input-15-90d7a2a112b8> in <module>
----> 1 da = item['data'].to_dask()

~/miniconda3/envs/intake-stac-gui/lib/python3.7/site-packages/intake_xarray/base.py in to_dask(self)
     67     def to_dask(self):
     68         """Return xarray object where variables are dask arrays"""
---> 69         return self.read_chunked()
     70 
     71     def close(self):

~/miniconda3/envs/intake-stac-gui/lib/python3.7/site-packages/intake_xarray/base.py in read_chunked(self)
     42     def read_chunked(self):
     43         """Return xarray object (which will have chunks)"""
---> 44         self._load_metadata()
     45         return self._ds
     46 

~/miniconda3/envs/intake-stac-gui/lib/python3.7/site-packages/intake/source/base.py in _load_metadata(self)
    124         """load metadata only if needed"""
    125         if self._schema is None:
--> 126             self._schema = self._get_schema()
    127             self.datashape = self._schema.datashape
    128             self.dtype = self._schema.dtype

~/miniconda3/envs/intake-stac-gui/lib/python3.7/site-packages/intake_xarray/base.py in _get_schema(self)
     16 
     17         if self._ds is None:
---> 18             self._open_dataset()
     19 
     20             metadata = {

~/miniconda3/envs/intake-stac-gui/lib/python3.7/site-packages/intake_xarray/netcdf.py in _open_dataset(self)
     56             _open_dataset = xr.open_dataset
     57 
---> 58         self._ds = _open_dataset(url, chunks=self.chunks, **kwargs)
     59 
     60     def _add_path_to_ds(self, ds):

~/miniconda3/envs/intake-stac-gui/lib/python3.7/site-packages/xarray/backends/api.py in open_dataset(filename_or_obj, group, decode_cf, mask_and_scale, decode_times, autoclose, concat_characters, decode_coords, engine, chunks, lock, cache, drop_variables, backend_kwargs, use_cftime, decode_timedelta)
    507         if engine == "netcdf4":
    508             store = backends.NetCDF4DataStore.open(
--> 509                 filename_or_obj, group=group, lock=lock, **backend_kwargs
    510             )
    511         elif engine == "scipy":

~/miniconda3/envs/intake-stac-gui/lib/python3.7/site-packages/xarray/backends/netCDF4_.py in open(cls, filename, mode, format, group, clobber, diskless, persist, lock, lock_maker, autoclose)
    356             netCDF4.Dataset, filename, mode=mode, kwargs=kwargs
    357         )
--> 358         return cls(manager, group=group, mode=mode, lock=lock, autoclose=autoclose)
    359 
    360     def _acquire(self, needs_lock=True):

~/miniconda3/envs/intake-stac-gui/lib/python3.7/site-packages/xarray/backends/netCDF4_.py in __init__(self, manager, group, mode, lock, autoclose)
    312         self._group = group
    313         self._mode = mode
--> 314         self.format = self.ds.data_model
    315         self._filename = self.ds.filepath()
    316         self.is_remote = is_remote_uri(self._filename)

~/miniconda3/envs/intake-stac-gui/lib/python3.7/site-packages/xarray/backends/netCDF4_.py in ds(self)
    365     @property
    366     def ds(self):
--> 367         return self._acquire()
    368 
    369     def open_store_variable(self, name, var):

~/miniconda3/envs/intake-stac-gui/lib/python3.7/site-packages/xarray/backends/netCDF4_.py in _acquire(self, needs_lock)
    359 
    360     def _acquire(self, needs_lock=True):
--> 361         with self._manager.acquire_context(needs_lock) as root:
    362             ds = _nc4_require_group(root, self._group, self._mode)
    363         return ds

~/miniconda3/envs/intake-stac-gui/lib/python3.7/contextlib.py in __enter__(self)
    110         del self.args, self.kwds, self.func
    111         try:
--> 112             return next(self.gen)
    113         except StopIteration:
    114             raise RuntimeError("generator didn't yield") from None

~/miniconda3/envs/intake-stac-gui/lib/python3.7/site-packages/xarray/backends/file_manager.py in acquire_context(self, needs_lock)
    184     def acquire_context(self, needs_lock=True):
    185         """Context manager for acquiring a file."""
--> 186         file, cached = self._acquire_with_cache_info(needs_lock)
    187         try:
    188             yield file

~/miniconda3/envs/intake-stac-gui/lib/python3.7/site-packages/xarray/backends/file_manager.py in _acquire_with_cache_info(self, needs_lock)
    202                     kwargs = kwargs.copy()
    203                     kwargs["mode"] = self._mode
--> 204                 file = self._opener(*self._args, **kwargs)
    205                 if self._mode == "w":
    206                     # ensure file doesn't get overriden when opened again

netCDF4/_netCDF4.pyx in netCDF4._netCDF4.Dataset.__init__()

netCDF4/_netCDF4.pyx in netCDF4._netCDF4._ensure_nc_success()

OSError: [Errno -78] NetCDF: Authorization failure: b'https://grfn.asf.alaska.edu/door/download/S1-GUNW-A-R-087-tops-20141023_20141011-153856-27545N_25464N-PP-1a1a-v2_0_2.nc'

Full example here: https://gist.github.com/scottyhq/04fe1e2d0b946b97228f6922cf001bbd

Implement Intake catalogs for search results

Currently the plugin allows one to open existing static catalog and crawl it to open up STAC Items as an Intake catalog of assets.

In many cases though, users will want to take the output of a search result from a STAC API and open up all those Items as an Intake catalog of catalogs of assets.

The STAC Items class can be used here, which is what the sat-search library uses. Not sure if the sat-search dependency provides additional features that would make it worthwhile as a dependency as well.

Additionally, the Items class provides a way to save and load a STAC catalog as a single file that is a FeatureCollection with additional top level fields containing the collections. This approach will probably make it into STAC next version: radiantearth/stac-spec#385 so this becomes just adding the ability to open up either a traditional STAC catalog or a single file catalog.

support parquet files in catalog

As I said in #48, I was recently involved in group trying to use intake-stac with some data we have sitting in s3. This data is in parquet format. I've used intake-parquet on this data with no problem to get a dask data frame. But when I try with intake-stac,

import intake
from intake import open_stac_catalog
cat = open_stac_catalog('https://not.the.real.url/catalog.json')
df = cat["AIRX3STD_006_SurfAirTemp_A/AIRX3STD_006_SurfAirTemp_A_2002_08.parquet"].get().to_dask()

I get the error:

---------------------------------------------------------------------------
NotImplementedError                       Traceback (most recent call last)
<ipython-input-10-25d227182f13> in <module>
----> 1 df = cat["AIRX3STD_006_SurfAirTemp_A/AIRX3STD_006_SurfAirTemp_A_2002_08.parquet"].get().to_dask()

~/Software/python3/envs/stac/lib/python3.8/site-packages/intake/source/base.py in to_dask(self)
    219     def to_dask(self):
    220         """Return a dask container for this data source"""
--> 221         raise NotImplementedError
    222 
    223     def to_spark(self):

NotImplementedError: 

I assume that intake-stac is keying off the "type" field in the item field. Parquet doesn't have a mime-type, so I tried 'parquet' without success. I then re-read your Readme and realized that if intake-stac is built on top of intake-xarray, then you probably can't read in parquet regardless of what I put in the "type" field.

Would it be possible to add parquet via the intake-parquet library?

I'm wondering if parquet is beyond the scope of the STAC catalog spec? I don't see parquet in STAC's list of media types here. But then I don't see zarr either and I'm guessing that you support zarr with intake-stac because it's your favored data type for pangeo.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.