dlr-eoc / ukis-pysat Goto Github PK

The ukis-pysat package provides generic classes and functions to query, access and process multi-spectral and SAR satellite images

License: Apache License 2.0

Python 100.00%

satellite-data gis ukis stac

ukis-pysat's People

Contributors

Stargazers

Watchers

Forkers

marcelofa shiwakotisurendra abnsy zajquor petermederer sandrogroth s-kowa

ukis-pysat's Issues

get_valid_data_bbox uses wrong transform

Describe the bug
get_valid_data_bbox uses the window's transform instead of the one of the whole dataset, which always leads to a subset of the actual valid data bbox.
The current test for get_valid_data_bbox does not use an input dataset suitable for the test (it's all 0s).

Expected behavior
Use the correct transform to get correct result.
We should use a proper input dataset.

Module names

Not 100% happy with the naming of modules "data.py" and "downloads.py". Currently "data.py" deals with all the raster stuff and "downloads.py" deals more generally with data access and metadata queries. "downloads.py" is/will be more than downloading and "data.py" is more raster processing than data management. Personally I find this confusing and would suggest to rename the two modules as follows:

"downloads.py" -> "data.py"
"data.py" -> "raster.py"

What do you think about this?

Create ukis_pysat.raster-Image with 2D array

Is your feature request related to a problem? Please describe.

ukis_pysat.raster only treats 3D arrays. Numerous extermal methods work on a 2D basis and using np.squeeze() or np.expand_dims() adds unnecessary code.
building an ukis_pysat.raster.Image() from a 2D array will fail with a misleading error message:

rasterio._err.CPLE_FileIOError: Free disk space available is 68524220416 bytes, whereas 413266544908 are at least necessary.

We could narrow this problem down to the fact, that bringing a 2D array into a 3D method is simply not working.

Describe the solution you'd like
Allow initalization of ukis_pysat.raster.Image with both 2D or 3D arrays. If I'm not mistaken, rasterio is limited to 3D arrays. @MWieland and @fwfichtner what is your opinion on this?

Consider array band count when writing Image

Currently write_to_file() takes width and height from the Image array but its band number (count) from the dataset.

In case one initializes an Image() from an existing dataset and an array it may happen that the wrong band number is used internally to write to file. Take for example the case where someone computes a simple NVDI from a multi-band Image() and wants to write the result (a single-band image) back to file.

img = Image(path=multiband_file.tif)
ndvi_arr = (img.arr[3] - img.arr[2]) / (img.arr[3] + img.arr[2])
ndvi = Image(dataset=img.dataset, array=ndvi_arr)
ndvi.write_to_file(ndvi_file.tif)

This would result in an error upon writing the array to file, because img band number != ndvi band number.

The following could be a possible solution to account for this:

    def write_to_file(self, path_to_file, dtype, driver="GTiff", compress=None):
        """
        Write a dataset to file.
        :param path_to_file: str, path to new file
        :param dtype: datatype, like np.uint16, 'float32' or 'min' to use the minimum type to represent values
        :param driver: str, optional (default: 'GTiff')
        :param compress: compression, e.g. 'lzw' (default: None)
        """
        if dtype == 'min':
            dtype = get_minimum_dtype(self.__arr)

        profile = self.dataset.meta
        profile.update(
            {
                "driver": driver,
                "height": self.__arr.shape[-2],
                "width": self.__arr.shape[-1],
                "count": self.__arr.shape[0],
                "dtype": dtype,
                "transform": self.transform,
                "crs": self.crs,
            }
        )

        if compress:
            profile.update({"compress": compress})

        with rasterio.open(path_to_file, "w", **profile) as dst:
            dst.write(self.__arr.astype(dtype), indexes=range(1, self.__arr.shape[0] + 1)

Nodata values not available when creating dataset from array

Creating an Image.dataset from a numpy array produces a rasterio dataset with the following metadata:

meta = {
    "driver": "GTiff",
    "dtype": self.__arr.dtype,
    "nodata": nodata,
    "height": self.__arr.shape[-2],
    "width": self.__arr.shape[-1],
    "count": self.__arr.shape[0],
    "crs": crs,
    "transform": transform,
}

The instance nodata is given as an argument of the method: __update_dataset(self, crs, transform, nodata=None). However, nodata is not handed upon function call self.__update_dataset(crs, transform) which leaves the new dataset with no nodata value.

Suggest to introduce a nodata value when creating datasets from arrays such as self.__update_dataset(crs, transform, nodata=nodata). This requires handing the nodata value to the instantiation of the Image class __init__(self, data, dimorder="first", crs=None, transform=None, nodata=None) if usage is desired.

Target align option for Image.warp

In many applications it is extremely useful to be able to force align two raster grids to match each others crs, extent, dimensions and grid alignment. With a combination of warp and mask we can come close to align two rasters (this actually works in many cases). It may, however, happen that the raster grids are shifted with subpixel resolution, which prevents them from being exactly on the same grid and therefore they cannot be compared at array level with for example numpy.

I propose the following solution to add a target_align option to warp, which aligns one Image instance to another one:

    def warp(self, dst_crs, resampling_method=0, num_threads=4, resolution=None, nodata=None, target_align=None):
        """Reproject a source raster to a destination raster.
        :param dst_crs: CRS or dict, Target coordinate reference system.
        :param resampling_method: Resampling algorithm, int, defaults to 0 (Nearest)
            numbers: https://github.com/mapbox/rasterio/blob/master/rasterio/enums.py#L28
        :param num_threads: int, number of workers, optional (default: 4)
        :param resolution: tuple (x resolution, y resolution) or float, optional.
            Target resolution, in units of target coordinate reference system.
        :param target_align: raster to which to align resolution, extent and gridspacing, optional (Image).
        :param nodata: nodata value of source, int or float, optional.
        """
        if target_align:
            transform = target_align.dataset.transform
            width = target_align.dataset.width
            height = target_align.dataset.height

        else:
            if resolution:
                transform, width, height = calculate_default_transform(
                    self.dataset.crs,
                    dst_crs,
                    self.dataset.width,
                    self.dataset.height,
                    *self.dataset.bounds,
                    resolution=resolution,
                )

            else:
                transform, width, height = calculate_default_transform(
                    self.dataset.crs, dst_crs, self.dataset.width, self.dataset.height, *self.dataset.bounds,
                )

        destination = np.zeros((self.dataset.count, height, width), self.__arr.dtype)

        self.__arr, transform = reproject(
            source=self.__arr,
            destination=destination,
            src_transform=self.dataset.transform,
            src_crs=self.dataset.crs,
            src_nodata=nodata,
            dst_transform=transform,
            dst_crs=dst_crs,
            dst_nodata=nodata,
            resampling=resampling_method,
            num_threads=num_threads,
        )

        self.dataset.close()
        self.dataset = self.__update_dataset(dst_crs, transform, nodata=nodata)

Datetime Object to ESA date

Is your feature request related to a problem? Please describe.
We now have the option to create a datetime object from ESA dates, it would be nice to also be able to go back. This can be helpful when naming output files.

Describe the solution you'd like
Just a simple function to change the format back.

Image coordinate order

We use the coordinate order of rasterio when loading images, which is channels first (channels, rows, cols). Most image processing, visualization and machine learning libraries, however, use channels last (rows, cols, channels) for image coordinate convention (e.g. scikit-image, opencv, pillow, matplotlib, etc.).

In order to avoid confusion and to adjust to user requirements more easily it would be great if we could provide an option to change the coordinate order upon instantiation of an image object.

I know that Rasterio provides conversion methods here, but it would be convenient if the order could be set once (e.g. as env var or parameter in Image init) to comply with a user's requirements.

Tensorflow for example provides such a switch between channels first and channels last to account for performance differences when training on GPU or CPU.

Satellite image band list

For any image preprocessing and analysis steps we should be able to store and keep track of the band information of a satellite image. In particular, we should know which band index refers to which wavelength and which unit (e.g. digital numbers, top of atmosphere, sigma naught, etc.). This is essential to be able to compute even the simplest band indices (e.g. NDVI) or to select an appropriate band subset during preprocessing.

We have already a dedicated metadata attribute for this ("bandlist") but do not fill it upon metadata creation, because this information is not directly provided by the source metadata (e.g. EarthExplorer). It could like the following for a band subset of a Sentinel-2 image:

'bandlist': [{'index':0, 'name':'B02', 'wavelength':'blue', 'unit':'DN'}, {'index':1, 'name':'B03', 'wavelength':'green', 'unit':'DN'}, {'index':2, 'name':'B04', 'wavelength':'red', 'unit':'DN'}, {'index':3, 'name':'B08', 'wavelength':'nir', 'unit':'DN'}, {'index':4, 'name':'B11', 'wavelength':'swir1', 'unit':'DN'}, {'index':5, 'name':'B12', 'wavelength':'swir2', 'unit':'DN'}]

We could compile this ourselves from a bunch of lookup tables for unique combinations of "Platform" and "Producttype". We basically do this already here for Landsat.

This should be put in a public place (possibly here) and be expanded upon. Then we can compile the respective bandlist "on the fly" when we query the satellite image metadata. This would also allow a user to simply update it in a standardized way later on as part of a more complex processing chain.

Implement boundless padding with reflect mode for tiling

We should implement a boundless padding with reflect mode for the tile creation.
For being boundless we only need to not intersect with the bounding_window: .intersection(bounding_window) (link)
The padding might be rather complex in combination with the dataset.

Consider dimorder when initializing Image() from array

It is possible to initialize an Image() object from an existing rasterio.dataset and an array as follows:

img = Image(dataset=ds, array=arr, dimorder="first")

To work correctly this assumes that the array is of shape (bands, rows, cols) regardless of parameter setting for dimorder.

Suggest to modify init of Image to consider this case and allow passing arrays of different dimorder. The following would be a working example:

    def __init__(self, path=None, dataset=None, arr=None, dimorder="first"):
        if path:
            if isinstance(path, str):
                self.dataset = rasterio.open(path)
                self.__arr = self.dataset.read()
            else:
                raise TypeError(f"path must be of type str")
        else:
            if isinstance(dataset, rasterio.io.DatasetReader) and isinstance(arr, np.ndarray):
                self.dataset = dataset
                if dimorder == "first":
                    self.__arr = arr
                elif dimorder == "last":
                    self.__arr = reshape_as_raster(arr)
                else:
                    raise AttributeError("dimorder for bands or channels must be either 'first' or 'last'.")
            else:
                raise TypeError(
                    f"dataset must be of type rasterio.io.DatasetReader and arr must be of type " f"numpy.ndarray"
                )
        self.dimorder = dimorder
        self.transform = self.dataset.transform
        self.crs = self.dataset.crs
        self.da_arr = None

Please note that this modification would not interfere with any internal array manipulation but just allow for consistent flexible array IO to Image().

Add nbits option to write_to_file

The GeoTIFF specific nbits option allows to create a file with less than 8 bits per sample by passing a value from 1 to 7. This becomes particularly useful when writing binary masks as it can save significant disk memory. Related to #46 Therefore, suggest to add this option to the write_to_file method.

Image dataset update

When we do a transformation on the Image object, the underlying rasterio dataset should be updated accordingly. This does not happen at the moment and may be confusing and lead to problems when chaining several transformations. For example:

img = Image(path="dummy.tif")
img.mask_img().warp()

This would create an incorrect output, because mask_img() updates transform and arr, but not the entire dataset. Dataset is however used by warp(). At the moment warp() would be applied on the original dataset and not on the transformed one from mask_img().

Dataset cannot be updated after warping image

When trying to warp an image using Image().warp() I get the following error message:

  File "run_procs.py", line 119, in <module>
    num_threads=settings.general["num_threads"],
  File "D:\Workspaces\python\proc_s2_flood\procs\prepdata.py", line 468, in run
    img = prepdata.get_image()
  File "D:\Workspaces\python\proc_s2_flood\procs\prepdata.py", line 382, in get_image
    num_threads=self.num_threads)
  File "D:\Workspaces\python\proc_s2_flood\ukis_pysat\raster.py", line 206, in warp
    self.dataset = self.__update_dataset(self.dataset.meta).open()
  File "C:\Users\wiel_mc\AppData\Local\conda\conda\envs\pysatenv36new\lib\site-packages\rasterio\env.py", line 386, in wrapper
    return f(*args, **kwds)
  File "C:\Users\wiel_mc\AppData\Local\conda\conda\envs\pysatenv36new\lib\site-packages\rasterio\io.py", line 132, in open
    writer = get_writer_for_driver(driver)
  File "C:\Users\wiel_mc\AppData\Local\conda\conda\envs\pysatenv36new\lib\site-packages\rasterio\io.py", line 185, in get_writer_for_driver
    raise ValueError("'driver' is required to write dataset.")
ValueError: 'driver' is required to write dataset.

This seems to be related to updating the dataset and may also occur when trying to run other methods that modify the dataset (e.g. subset) - I haven't tested this though.

Running ukis-pysat on Win 7 with Python 3.6.

Build array subset from window tiles

Provide a method to build an array subset from window tiles.
The current workflow is to call the subset window in a for-loop through the get_tiles method. Subsetting the array is currently performed by the user by indexing the array with the window properties:

for window in data.Image(img_path).get_tiles():
    (row_start, row_stop), (col_start, col_stop) = window_tile.toranges()
    array[0][row_start: row_stop, col_start: col_stop]

This can be outsourced to a new method in ukis_pysat.data:

def get_subset(self, window_tile, array, band=0):
    (row_start, row_stop), (col_start, col_stop) = window_tile.toranges()
    return array[band][row_start: row_stop, col_start: col_stop]

Performance decreases when requesting the image array for each call:

def get_subset(self, window_tile, band=0):
    (row_start, row_stop), (col_start, col_stop) = window_tile.toranges()
    return self.arr[band][row_start: row_stop, col_start: col_stop]

This can be solved by handing the array to the new method directly. However, this requires refactoring the constructor of the Image class so that arrays are accepted without handing a dataset as it is currently done here.

Possible solution:

if isinstance(dataset, rasterio.io.DatasetReader) and isinstance(arr, np.ndarray):
    self.dataset = dataset
    self.arr = arr
elif isinstance(arr, np.ndarray):
    self.arr = arr

`get_ts_from_sentinel_filename()` does not return a defined date format

Describe the bug
get_ts_from_sentinel_filename() does not return a defined date format, but just a substring of the filename. It would be more useful to return something which is readable for common parsers. I would even expect this to happen and therefore label this as bug.

Expected behavior
Return a timestamp in ISO 8601 YYYY-MM-DD, e.g. with isoformat()

Memory Leak in Image

Image class has a memory leak that becomes significant when running multiple iterations or performing multiple transformations on an Image instance. The following script highlights the problem:

import numpy as np
import rasterio
import psutil
from ukis_pysat.raster import Image

def main():
    crs = "EPSG:32612"
    transform = rasterio.Affine(30.0, 0.0, 358485.0, 0.0, -30.0, 4265115.0)

    for i in range(100):
        arr = np.random.randint(0, 100, size=(2000,7200,1), dtype='int16')
        with Image(data=arr, crs=crs, transform=transform, dimorder="last") as img:
            img.warp(dst_crs="EPSG:4326", resolution=0.003)
            print(img.dataset.bounds)

        mem_pct = psutil.virtual_memory().percent
        print(f'Memory used: {mem_pct}%')

if __name__ == '__main__':
    main()

Running this script on the current implementation of Image the memory is linearly increasing. I could narrow this down to __update_dataset method, which initializes a MemoryFile as follows:

        memfile = MemoryFile()
        with memfile.open(**meta) as ds:
            ds.write(self.__arr)

        return memfile.open()

Here the memory file is not properly closed and with each call to __update_dataset we add to the memory. A possible fix for this could be:

        memfile = MemoryFile()
        with memfile.open(**meta) as ds:
            ds.write(self.__arr)
        self.dataset = memfile.open()
        memfile.close()

Running above test script on the fixed version of __update_dataset keeps the memory consumption at a stable level.

Yet another MemoryLeak issue that adds on top of this one is related to the rasterio version. Above fix works fine on rasterio<=1.1.4. With release of rasterio 1.1.5. changes to the MemoryFile class where made that caused memory leaks. For details see the following rasterio issue: rasterio/rasterio#1953
Solution for the time being: do not upgrade to rasterio 1.1.5.

Reduce number of dependencies

We should try to further reduce the number of dependencies. In particular matplotlib and pillow are only used in one spot and should be substituted if possible.

Image write compression and dtype

The Image.write_to_file() method could be expanded to support optional image compression (e.g. Packbits or LZW). This is necessary to reduce the storage space for example when saving binary masks to disk.

To achieve this we basically would have to update the profile with "compress" and optionally also "nbits" attributes. Beyond this, the "tiled" attribute could be useful to support.

Also this method defines as default dtype uint16. It would be more reasonable to set the default dtype to the dtype of the Image array in my opinion. This would allow us to stay consistent between image loading and writing.

Improve plotting of MetadataCollection to_pandas method

When plotting a pandas dataframe that holds the return values of a MetadataCollection the columns are cut off at a fixed length. To avoid this behaviour, suggest to add pd.set_option().

env_get with boolean

Is your feature request related to a problem? Please describe.
env_get does not do very much at the moment, it would be quite useful if it could return booleans.

Describe the solution you'd like
A clear and concise description of what you want to happen.

Describe alternatives you've considered
We could also use another package which does that, but I think it's very easy to implement.

EarthExplorer metadata platformname incorrect.

Describe the bug
Metadata of EarthExplorer is not returning the correct Platfromname (like it used to?). Parsing suddenly goes wrong.

To Reproduce
Run the test for query_metadata, like https://github.com/dlr-eoc/ukis-pysat/runs/950561059

Expected behavior
Platform name should be a part of Platform(Enum).

Additional context
This is the reason all our tests currently fail.

Pass additional requirements to download query

The method query_metadata takes a set of arguments and fills the sentinelsat API query.

Additional requirements such as the product type, e.g. GRS for Sentinel-1 data, are not accepted.

A solution would be to accept **kwargs at method call. However, kwargs is emptied several times, e.g. here. This should be taken into account for the implementation of this feature.

Missing files in sdist

It appears that the manifest is missing at least one file necessary to build
from the sdist for version 0.2.0. You're in good company, about 5% of other
projects updated in the last year are also missing files.

+ /tmp/venv/bin/pip3 wheel --no-binary ukis-pysat -w /tmp/ext ukis-pysat==0.2.0
Looking in indexes: http://10.10.0.139:9191/root/pypi/+simple/
Collecting ukis-pysat==0.2.0
  Downloading http://10.10.0.139:9191/root/pypi/%2Bf/bb0/9df8106b1d6c7/ukis-pysat-0.2.0.tar.gz (17 kB)
    ERROR: Command errored out with exit status 1:
     command: /tmp/venv/bin/python3 -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/tmp/pip-wheel-qv8a31e2/ukis-pysat/setup.py'"'"'; __file__='"'"'/tmp/pip-wheel-qv8a31e2/ukis-pysat/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' egg_info --egg-base /tmp/pip-wheel-qv8a31e2/ukis-pysat/pip-egg-info
         cwd: /tmp/pip-wheel-qv8a31e2/ukis-pysat/
    Complete output (5 lines):
    Traceback (most recent call last):
      File "<string>", line 1, in <module>
      File "/tmp/pip-wheel-qv8a31e2/ukis-pysat/setup.py", line 37, in <module>
        install_requires=open("requirements.txt").read().splitlines(),
    FileNotFoundError: [Errno 2] No such file or directory: 'requirements.txt'
    ----------------------------------------
ERROR: Command errored out with exit status 1: python setup.py egg_info Check the logs for full command output.

Thoughts about a metadata object

Currently we create, harmonize and store metadata in geojson for each source directly in construct_metadata method. This works fine at the moment, but is errorprone in case we may want to expand the metadata attributes, add further sources or modify existing metadata at some stage. Therefore, it would be good to have a dedicated metadata object with clearly defined attributes, dtypes, constraints, validation routines and descriptions.

An interesting approach to solve this in a bit more flexible way than just using a dictionary could be dataclasses . They have been added to the Python standard library since 3.7. This would restrict us to use only Python>=3.7.

What are your thoughts on such a restriction and the possible gains from a dataclass?

Add array setter

Instancing an Image object with self.img = ukis_pysat.raster.Image(self.path_to_img) gives access to attributes of the Image class, e.g. self.img.arr.

At some point it is desired to alter self.img.arr. However, self.img.arr = self.img.arr * 10 fails with an AttrbiuteError.

Suggest to extend the property decorator. Current implementation is:

@property
def arr(self):
    """array property"""
    if self.dimorder == "first":
        return self.__arr
    else:
        return reshape_as_image(self.__arr)

This could be extended with:

@arr.setter
def arr(self, value):
    self.__arr = value

Implement smoothing for window tiles

We need some kind of smoothing with tukey in the new implementation.

VRT support

GDAL offers building of virtual raster datasets, so does rasterio and this gives us the possibility to have it implemented with an interface easy to use.

Consider the following:

import os
import rasterio.shutil

from ukis_pysat import raster as pysat_raster

vrt_dir = os.path.join(os.path.dirname(os.path.realpath(__file__)), 'tests', 'testfiles', 'vrt')

for sample in os.listdir(vrt_dir):
    src = pysat_raster.Image(os.path.join(vrt_dir, sample))
    with rasterio.vrt.WarpedVRT(src.dataset, crs=src.dataset.crs, resampling=rasterio.enums.Resampling.nearest) as vrt:
        rasterio.shutil.copy(vrt, os.path.join(vrt_dir, 'test.vrt'), driver='VRT')

vrt = pysat_raster.Image(os.path.join(vrt_dir, 'test.vrt'))

Dangling rasterio.dataset

We don't enforce closing of the dataset, and we're not even very clear about that you should do this in the documentation (even though we offer the possibility), nor do we close img when we're testing.

We could offer using ukis-pysat as context manager or at least document this better and use it ourselves.

Wrong field types in Metadata

Some of the field types in Metadata need revision. In particular, fields processingsteps and bandlist should be of type list (with default value None, rather than str.

Replace os.path with pathlib

We could consider using pathlib instead of os.path for more readability and to be more modern.

Test dn2toa()

Implement a test for dn2toa(),

Make pack and unpack methods more flexible

In their current implementation, unpack and pack require the file format to be specified. This could be easily extracted from the file ending. Also the parameter names could use some renaming to be more self-explanatory. Suggest to modify these methods as follows:

def unpack(src_file, target_dir=None):
    """Unpack an archive, does not add anything to shutil, but makes aware it exists.

    :param src_file: path to source file to be extracted including extension.
    :param target_dir: directory where the archive is unpacked. If not provided, the current working directory is used.
    """
    if target_dir:
        shutil.unpack_archive(src_file, target_dir)
    else:
        shutil.unpack_archive(src_file)


def pack(src_dir, target_file):
    """Create an archive file (such as zip or tar) from a directory.

    :param src_dir: source directory to be archived.
    :param target_file: path to target file including extension.
    """
    base = os.path.basename(target_file)
    name = base.split(".")[0]
    format = base.split(".")[1]
    archive_from = os.path.dirname(src_dir)
    archive_to = os.path.basename(src_dir.strip(os.sep))
    shutil.make_archive(name, format, archive_from, archive_to)
    shutil.move("%s.%s" % (name, format), target_file)

__update_dataset() in raster.Image() does not update dtype

__update_dataset() does not update dtype. The following is a simple example that shows the problem:

from ukis_pysat.raster import Image
from ukis_pysat.members import Platform


file = "D:/tmp/test/T32UPU_20200413T100549_B01.jp2"

# read image
img = Image(data=file, dimorder="last")
print(img.arr.dtype)
>> uint16

# dn2toa
# NOTE: this modifies img dtype
img.dn2toa(platform=Platform.Sentinel2)
print(img.arr.dtype)
>> float32

# subset
# BUG: this overwrite the img dtype modified by dn2toa
img.mask_image(bbox=(654700.76, 663334.23, 5324479.29, 5330987.60), crop=True, pad=False)
print(img.arr.dtype)
>> uint16

Suggest to make sure that __update_dataset() upon calling takes what it can get from the array and enforce passing crs and transform information separately (similar to what we do with creating an Image from an array, just in memory).

Evaluate type hints

We could think about using type hints for data and file. For raster it only really makes sense with dedicated support by mypy, see roadmap.

Handle case where pad_width<0 in _pad_to_bbox

Currently _pad_to_bbox cannot deal with cases where pad_width happens to be <0 (e.g., when bbox is smaller than image and no padding would be required). We should be able to handle such cases and not immediately throw an error. A simple way would be to set pad_width=0 in such cases. The following would be possible implementation:

    def _pad_to_bbox(self, bbox, mode="constant", constant_values=0):
        """Buffers array with biggest difference to bbox and adjusts affine transform matrix. Can be used to fill
        array with nodata values before masking in case bbox only partially overlaps dataset bounds.
        :param bbox: bounding box of type tuple or Shapely Polygon
        :param mode: str, how to pad, see rasterio.pad. Optional (default: 'constant')
        :param constant_values: nodata value, padding should be filled with, optional (default: 0)
        :return: closed, buffered dataset in memory
        """
        if isinstance(bbox, polygon.Polygon):
            bbox = bbox.bounds
        elif isinstance(bbox, tuple):
            pass
        else:
            raise TypeError(f"bbox must be of type tuple or Shapely Polygon")

        max_diff_ur = np.max(np.subtract(bbox[2:], tuple(self.dataset.bounds[2:])))
        max_diff_ll = np.max(np.subtract(tuple(self.dataset.bounds[:2]), bbox[:2]))
        max_diff = max(max_diff_ll, max_diff_ur)  # buffer in units

        pad_width = math.ceil(max_diff / self.dataset.transform.to_gdal()[1])  # units / pixel_size

        if pad_width < 0:
            # make sure that pad_width is not negative
            pad_width = 0

        destination = np.zeros(
            (self.dataset.count, self.__arr.shape[1] + 2 * pad_width, self.__arr.shape[2] + 2 * pad_width,),
            self.__arr.dtype,
        )

        for i in range(0, self.dataset.count):
            destination[i], transform = rasterio.pad(
                self.__arr[i], self.dataset.transform, pad_width, mode, constant_values=constant_values,
            )

        self.__arr = destination

        self.dataset.close()
        return self.__update_dataset(self.dataset.crs, transform, nodata=self.dataset.nodata)

Split up package into subsets

Is your feature request related to a problem? Please describe.
To use UKIS-pysat we always require GDAL and other heavy dependencies to be installed. That's not very handy for lightweight and small Docker images.

Describe the solution you'd like
Split up package into subsets like suggested here so you can only install the functionalities you need:

pip install ukis-pysat  # would then only install ukis_pysat.file

pip install ukis-pysat[complete]  # would then install everything

pip install ukis-pysat[data]  # would then install ukis_pysat.file and ukis_pysat.data

pip install ukis-pysat[raster]  # would then install ukis_pysat.file and ukis_pysat.raster

Additional context
We need to carefully describe this in the Documentation, similar to what they have done in dask.

Handle unavailable products

We do not provide any offline handling, i.e. if a product could not be reached due to network issues or if a product has been archived.

For instance, the Copernicus SciHub moves Sentinel data to the LTA if being older than one year. Although meta data can be fetched, the data is treated as offline and requires restoration from the LTA:
Product ee4c3e9e-c1c9-43b2-a4fd-775ba63aec7b is not online. Triggering retrieval from long term archive.
Furthermore, the user is not allowed to send more than two or three LTA requests. Any further request will throw an HTTP error 403 and raises a SentinelAPILTAError.

If sending too many requests in a given amount of time, the sentinelsat module raises a HTTP 429 error.

The sentinelsat API offers some functionalities, e.g. checking the online status of a resource:
self.api.get_product_odata(product_uuid)['Online']

Error handling could be provided through:

except sentinelsat.SentinelAPIError as (e):
    if e.response.status_code == 403:
        logger.warning(f"Requests exceed user quota, {e}")

MetadataCollection more flexible filter method

Filter method in MetadataCollection is not very flexible at the moment. Suggest to expand it to allow fuzzy search and search on list values. This can be useful for example to filter a MetadataCollection by a list of srcids or tileids.

Array dimorder not considered anymore when creating Image from array

When creating an Image instance from an array, the array dimorder is not considered anymore.

Accept WKT string as AOI

Downloading data requires an area of interest. The AOI is either provided as a GEOJSON file or a bounding box tuple with lat lon coordinates (String, Tuple).

The function prep_aoi(aoi).wkt performs thse steps:

Read a GEOJSON file and convert the geometry parameter into a shapepy.geometry.shape object
Parse a bounding box tuple into a shapely.box object

I find the choice between GEOJSON file and bounding box quite restricting. A bounding box does not resolve AOIs more complex than a quadrilateral. On the other hand, having a GEOSJON file requires I/O operation both from the user as well from the ukis-pysat side.

I suggest to also accept WKT strings as valid AOI inputs.

Fix geocoding offset of Sentinel-1 quicklooks

The geocoding information that we assign to Sentinel-1 quicklook images in save_quicklook_image() does not consider the varying image sizes of this sensor (unlike the fixed tile size of Sentinel-2). This may lead to misalignments of the quicklooks with respect to the actual images.

Mosaic method

It would be great if we could provide a method for merging multiple raster files to one mosaic. With #64 we got a first implementation with a simple exploit of rasterio.merge.merge.

However, a mosaic is not a simple merge of multiple raster. We have to discuss the following points, please add if incomplete:

Placement: The method is situated in ukis_pysat.raster which seems logical as we are working on a raster basis. This implies calling the method via an Image object with src.build_mosaic(<list_of_datasets_to_merge>). So far the object src is not really connected to the mosaic itself and more or less an initiator for the method call.
Geometry: With the first draft, the raster files are merged and share a maximum bounding box. Although it is possible calling ukis_pysat.raster.warp and cropping the result to a desired geometry we may offer this functionality upon merging. This could also be connected to the original Image object or to another bounding box.

Bounds option for Image.warp

It would be great to have an option to limit warping to a subset of the image in order to speed up computation. Currently, warping is always performed on the whole Image. An simple solution could be to add an optional "bounds" parameter that passes the bounding box coordinates in source crs coordinates to the method. A simple implementation could look as follows (NOTE: this example already considers #60 and #62 ):

    def warp(
        self, dst_crs, resampling_method=0, num_threads=4, resolution=None, bounds=None, nodata=None, target_align=None
    ):
        """Reproject a source raster to a destination raster.
        :param dst_crs: CRS or dict, Target coordinate reference system.
        :param resampling_method: Resampling algorithm, int, defaults to 0 (Nearest)
            numbers: https://github.com/mapbox/rasterio/blob/master/rasterio/enums.py#L28
        :param num_threads: int, number of workers, optional (default: 4)
        :param resolution: tuple (x resolution, y resolution) or float, optional.
            Target resolution, in units of target coordinate reference system.
        :param bounds: tuple bounding box coordinates in source coordinate reference system, optional.
        :param target_align: raster to which to align resolution, extent and gridspacing, optional (Image).
        :param nodata: nodata value of source, int or float, optional.
        """
        if target_align:
            transform = target_align.dataset.transform
            width = target_align.dataset.width
            height = target_align.dataset.height

        else:
            if not bounds:
                bounds = self.dataset.bounds

            if resolution:
                transform, width, height = calculate_default_transform(
                    self.dataset.crs, dst_crs, self.dataset.width, self.dataset.height, *bounds, resolution=resolution,
                )

            else:
                transform, width, height = calculate_default_transform(
                    self.dataset.crs, dst_crs, self.dataset.width, self.dataset.height, *bounds,
                )

        destination = np.zeros((self.dataset.count, height, width), self.__arr.dtype)

        self.__arr, transform = reproject(
            source=self.__arr,
            destination=destination,
            src_transform=self.dataset.transform,
            src_crs=self.dataset.crs,
            src_nodata=nodata,
            dst_transform=transform,
            dst_crs=dst_crs,
            dst_nodata=nodata,
            resampling=resampling_method,
            num_threads=num_threads,
        )

        self.__update_dataset(dst_crs, transform, nodata=nodata)

Underscore is not a good delimiter with S3 naming conventions

Describe the bug
In certain cases there's multiple underscores in S3 filenames, see naming conventions.

To Reproduce
Use following example filename S3A_OL_1_EFR____20151102T094537_20151102T094837_20151103T075458_0180_090_022______LN2_D_NT_001.SEN3 with get_ts_from_sentinel_filename().

Expected behavior
Give back 20151102T094537.

Provide documentation for datahub log in

Connecing to a datahub requires user credentials to be set according to the target datahub, e.g. Copernicus SciHub.
Calling src = Source(source=Datahub.Scihub) looks for user credentials but fails if not set in an environment:

  File "<env_dir>\lib\site-packages\ukis_pysat\data.py", line 57, in __init__
    self.user = env_get("SCIHUB_USER")
  File "<env_dir>\lib\site-packages\ukis_pysat\file.py", line 23, in env_get
    raise KeyError(f"No environment variable {key} found")
KeyError: 'No environment variable SCIHUB_USER found'

The documentation on this is rather sparse and should be more detailed.

There are different ways on providing user credentials, depending on the OS and development tools.

User credentials could be provided as plain text, which is not recomennded.
PyCharm offers the possibility to store the credentials in an environment in the Run/Debug configuration dialog.
Credentials could be set in the following way (example in plain text, but can be loaded from files):
```
os.environ["SCIHUB_USER"] = "Tim"
os.environ["SCIHUB_PW"] = "TheEnchanter"
```

Make aoi in Source() more flexible

aoi in Source() currently only accepts a geojson in WGS84. Make this more flexible to work with 1.) any CRS and 2.) files and bbox coordinates. Basically accept the same input as aoi in Image().

We could do something like this, but that would add pyproj as dependency:

def prep_aoi(aoi, src_proj=None):
    """ Converts to Geojson Geometry and reprojects aoi to WGS84

    :param aoi: aoi
    :param src_proj: pyproj.proj.Proj, Source Projection, default: None
    :return: geojson.geometry.Polygon
    """
    if isinstance(aoi, geojson.geometry.Polygon):
        aoi = shape(aoi)

    if src_proj:
        project = pyproj.Transformer.from_proj(src_proj, pyproj.Proj(init='epsg:4326'))
        aoi = transform(project.transform, aoi)
    return geojson.dumps(aoi.__geo_interface__, sort_keys=True)

Add local file directories as data sources

It would be great to be able to read metadata from local file system so that we can perform tasks like: Give me all S2 scenes in directory x with less than y% cloud cover. TinyDB could be a possible package to get started with.

Test support of Sentinel-3

Add additional tests to test_download in order to check metadata queries, image and quicklook downloads for Sentinel-3 imagery from Scihub.

Substitute get_metadata with download_metadata in Source()

get_metadata() currently loads metadata from an existing json file. We do not yet support comprehensive file metadata queries, where this method would belong to. On the other hand we miss a method that downloads metadata (complementary to download_image() and download_quicklook()). For consistency we should substitute the get_metadata method with a download_metadata method.