dlr-eoc / ukis-pysat Goto Github PK
View Code? Open in Web Editor NEWThe ukis-pysat package provides generic classes and functions to query, access and process multi-spectral and SAR satellite images
License: Apache License 2.0
The ukis-pysat package provides generic classes and functions to query, access and process multi-spectral and SAR satellite images
License: Apache License 2.0
Add additional tests to test_download in order to check metadata queries, image and quicklook downloads for Sentinel-3 imagery from Scihub.
Currently we create, harmonize and store metadata in geojson for each source directly in construct_metadata method. This works fine at the moment, but is errorprone in case we may want to expand the metadata attributes, add further sources or modify existing metadata at some stage. Therefore, it would be good to have a dedicated metadata object with clearly defined attributes, dtypes, constraints, validation routines and descriptions.
An interesting approach to solve this in a bit more flexible way than just using a dictionary could be dataclasses . They have been added to the Python standard library since 3.7. This would restrict us to use only Python>=3.7.
What are your thoughts on such a restriction and the possible gains from a dataclass?
Is your feature request related to a problem? Please describe.
env_get
does not do very much at the moment, it would be quite useful if it could return booleans.
Describe the solution you'd like
A clear and concise description of what you want to happen.
Describe alternatives you've considered
We could also use another package which does that, but I think it's very easy to implement.
Downloading data requires an area of interest. The AOI is either provided as a GEOJSON file or a bounding box tuple with lat lon coordinates (String, Tuple).
The function prep_aoi(aoi).wkt
performs thse steps:
I find the choice between GEOJSON file and bounding box quite restricting. A bounding box does not resolve AOIs more complex than a quadrilateral. On the other hand, having a GEOSJON file requires I/O operation both from the user as well from the ukis-pysat side.
I suggest to also accept WKT strings as valid AOI inputs.
It appears that the manifest is missing at least one file necessary to build
from the sdist for version 0.2.0. You're in good company, about 5% of other
projects updated in the last year are also missing files.
+ /tmp/venv/bin/pip3 wheel --no-binary ukis-pysat -w /tmp/ext ukis-pysat==0.2.0
Looking in indexes: http://10.10.0.139:9191/root/pypi/+simple/
Collecting ukis-pysat==0.2.0
Downloading http://10.10.0.139:9191/root/pypi/%2Bf/bb0/9df8106b1d6c7/ukis-pysat-0.2.0.tar.gz (17 kB)
ERROR: Command errored out with exit status 1:
command: /tmp/venv/bin/python3 -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/tmp/pip-wheel-qv8a31e2/ukis-pysat/setup.py'"'"'; __file__='"'"'/tmp/pip-wheel-qv8a31e2/ukis-pysat/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' egg_info --egg-base /tmp/pip-wheel-qv8a31e2/ukis-pysat/pip-egg-info
cwd: /tmp/pip-wheel-qv8a31e2/ukis-pysat/
Complete output (5 lines):
Traceback (most recent call last):
File "<string>", line 1, in <module>
File "/tmp/pip-wheel-qv8a31e2/ukis-pysat/setup.py", line 37, in <module>
install_requires=open("requirements.txt").read().splitlines(),
FileNotFoundError: [Errno 2] No such file or directory: 'requirements.txt'
----------------------------------------
ERROR: Command errored out with exit status 1: python setup.py egg_info Check the logs for full command output.
Describe the bug
get_valid_data_bbox
uses the window's transform instead of the one of the whole dataset, which always leads to a subset of the actual valid data bbox.
The current test for get_valid_data_bbox
does not use an input dataset suitable for the test (it's all 0s).
Expected behavior
Use the correct transform to get correct result.
We should use a proper input dataset.
get_metadata() currently loads metadata from an existing json file. We do not yet support comprehensive file metadata queries, where this method would belong to. On the other hand we miss a method that downloads metadata (complementary to download_image() and download_quicklook()). For consistency we should substitute the get_metadata method with a download_metadata method.
Describe the bug
get_ts_from_sentinel_filename()
does not return a defined date format, but just a substring of the filename. It would be more useful to return something which is readable for common parsers. I would even expect this to happen and therefore label this as bug.
Expected behavior
Return a timestamp in ISO 8601 YYYY-MM-DD
, e.g. with isoformat()
Image class has a memory leak that becomes significant when running multiple iterations or performing multiple transformations on an Image instance. The following script highlights the problem:
import numpy as np
import rasterio
import psutil
from ukis_pysat.raster import Image
def main():
crs = "EPSG:32612"
transform = rasterio.Affine(30.0, 0.0, 358485.0, 0.0, -30.0, 4265115.0)
for i in range(100):
arr = np.random.randint(0, 100, size=(2000,7200,1), dtype='int16')
with Image(data=arr, crs=crs, transform=transform, dimorder="last") as img:
img.warp(dst_crs="EPSG:4326", resolution=0.003)
print(img.dataset.bounds)
mem_pct = psutil.virtual_memory().percent
print(f'Memory used: {mem_pct}%')
if __name__ == '__main__':
main()
Running this script on the current implementation of Image the memory is linearly increasing. I could narrow this down to __update_dataset method, which initializes a MemoryFile as follows:
memfile = MemoryFile()
with memfile.open(**meta) as ds:
ds.write(self.__arr)
return memfile.open()
Here the memory file is not properly closed and with each call to __update_dataset we add to the memory. A possible fix for this could be:
memfile = MemoryFile()
with memfile.open(**meta) as ds:
ds.write(self.__arr)
self.dataset = memfile.open()
memfile.close()
Running above test script on the fixed version of __update_dataset keeps the memory consumption at a stable level.
Yet another MemoryLeak issue that adds on top of this one is related to the rasterio version. Above fix works fine on rasterio<=1.1.4. With release of rasterio 1.1.5. changes to the MemoryFile class where made that caused memory leaks. For details see the following rasterio issue: rasterio/rasterio#1953
Solution for the time being: do not upgrade to rasterio 1.1.5.
We could think about using type hints for data and file. For raster it only really makes sense with dedicated support by mypy, see roadmap.
It would be great to have an option to limit warping to a subset of the image in order to speed up computation. Currently, warping is always performed on the whole Image. An simple solution could be to add an optional "bounds" parameter that passes the bounding box coordinates in source crs coordinates to the method. A simple implementation could look as follows (NOTE: this example already considers #60 and #62 ):
def warp(
self, dst_crs, resampling_method=0, num_threads=4, resolution=None, bounds=None, nodata=None, target_align=None
):
"""Reproject a source raster to a destination raster.
:param dst_crs: CRS or dict, Target coordinate reference system.
:param resampling_method: Resampling algorithm, int, defaults to 0 (Nearest)
numbers: https://github.com/mapbox/rasterio/blob/master/rasterio/enums.py#L28
:param num_threads: int, number of workers, optional (default: 4)
:param resolution: tuple (x resolution, y resolution) or float, optional.
Target resolution, in units of target coordinate reference system.
:param bounds: tuple bounding box coordinates in source coordinate reference system, optional.
:param target_align: raster to which to align resolution, extent and gridspacing, optional (Image).
:param nodata: nodata value of source, int or float, optional.
"""
if target_align:
transform = target_align.dataset.transform
width = target_align.dataset.width
height = target_align.dataset.height
else:
if not bounds:
bounds = self.dataset.bounds
if resolution:
transform, width, height = calculate_default_transform(
self.dataset.crs, dst_crs, self.dataset.width, self.dataset.height, *bounds, resolution=resolution,
)
else:
transform, width, height = calculate_default_transform(
self.dataset.crs, dst_crs, self.dataset.width, self.dataset.height, *bounds,
)
destination = np.zeros((self.dataset.count, height, width), self.__arr.dtype)
self.__arr, transform = reproject(
source=self.__arr,
destination=destination,
src_transform=self.dataset.transform,
src_crs=self.dataset.crs,
src_nodata=nodata,
dst_transform=transform,
dst_crs=dst_crs,
dst_nodata=nodata,
resampling=resampling_method,
num_threads=num_threads,
)
self.__update_dataset(dst_crs, transform, nodata=nodata)
Not 100% happy with the naming of modules "data.py" and "downloads.py". Currently "data.py" deals with all the raster stuff and "downloads.py" deals more generally with data access and metadata queries. "downloads.py" is/will be more than downloading and "data.py" is more raster processing than data management. Personally I find this confusing and would suggest to rename the two modules as follows:
"downloads.py" -> "data.py"
"data.py" -> "raster.py"
What do you think about this?
When we do a transformation on the Image
object, the underlying rasterio dataset
should be updated accordingly. This does not happen at the moment and may be confusing and lead to problems when chaining several transformations. For example:
img = Image(path="dummy.tif")
img.mask_img().warp()
This would create an incorrect output, because mask_img()
updates transform
and arr
, but not the entire dataset
. Dataset
is however used by warp()
. At the moment warp()
would be applied on the original dataset
and not on the transformed one from mask_img()
.
We do not provide any offline handling, i.e. if a product could not be reached due to network issues or if a product has been archived.
For instance, the Copernicus SciHub moves Sentinel data to the LTA if being older than one year. Although meta data can be fetched, the data is treated as offline and requires restoration from the LTA:
Product ee4c3e9e-c1c9-43b2-a4fd-775ba63aec7b is not online. Triggering retrieval from long term archive.
Furthermore, the user is not allowed to send more than two or three LTA requests. Any further request will throw an HTTP error 403 and raises a SentinelAPILTAError.
If sending too many requests in a given amount of time, the sentinelsat module raises a HTTP 429 error.
The sentinelsat API offers some functionalities, e.g. checking the online status of a resource:
self.api.get_product_odata(product_uuid)['Online']
Error handling could be provided through:
except sentinelsat.SentinelAPIError as (e):
if e.response.status_code == 403:
logger.warning(f"Requests exceed user quota, {e}")
We don't enforce closing of the dataset, and we're not even very clear about that you should do this in the documentation (even though we offer the possibility), nor do we close img
when we're testing.
We could offer using ukis-pysat as context manager or at least document this better and use it ourselves.
Filter method in MetadataCollection is not very flexible at the moment. Suggest to expand it to allow fuzzy search and search on list values. This can be useful for example to filter a MetadataCollection by a list of srcids or tileids.
We should try to further reduce the number of dependencies. In particular matplotlib and pillow are only used in one spot and should be substituted if possible.
__update_dataset() does not update dtype. The following is a simple example that shows the problem:
from ukis_pysat.raster import Image
from ukis_pysat.members import Platform
file = "D:/tmp/test/T32UPU_20200413T100549_B01.jp2"
# read image
img = Image(data=file, dimorder="last")
print(img.arr.dtype)
>> uint16
# dn2toa
# NOTE: this modifies img dtype
img.dn2toa(platform=Platform.Sentinel2)
print(img.arr.dtype)
>> float32
# subset
# BUG: this overwrite the img dtype modified by dn2toa
img.mask_image(bbox=(654700.76, 663334.23, 5324479.29, 5330987.60), crop=True, pad=False)
print(img.arr.dtype)
>> uint16
Suggest to make sure that __update_dataset() upon calling takes what it can get from the array and enforce passing crs and transform information separately (similar to what we do with creating an Image from an array, just in memory).
When trying to warp an image using Image().warp() I get the following error message:
File "run_procs.py", line 119, in <module>
num_threads=settings.general["num_threads"],
File "D:\Workspaces\python\proc_s2_flood\procs\prepdata.py", line 468, in run
img = prepdata.get_image()
File "D:\Workspaces\python\proc_s2_flood\procs\prepdata.py", line 382, in get_image
num_threads=self.num_threads)
File "D:\Workspaces\python\proc_s2_flood\ukis_pysat\raster.py", line 206, in warp
self.dataset = self.__update_dataset(self.dataset.meta).open()
File "C:\Users\wiel_mc\AppData\Local\conda\conda\envs\pysatenv36new\lib\site-packages\rasterio\env.py", line 386, in wrapper
return f(*args, **kwds)
File "C:\Users\wiel_mc\AppData\Local\conda\conda\envs\pysatenv36new\lib\site-packages\rasterio\io.py", line 132, in open
writer = get_writer_for_driver(driver)
File "C:\Users\wiel_mc\AppData\Local\conda\conda\envs\pysatenv36new\lib\site-packages\rasterio\io.py", line 185, in get_writer_for_driver
raise ValueError("'driver' is required to write dataset.")
ValueError: 'driver' is required to write dataset.
This seems to be related to updating the dataset and may also occur when trying to run other methods that modify the dataset (e.g. subset) - I haven't tested this though.
Running ukis-pysat on Win 7 with Python 3.6.
We need some kind of smoothing with tukey in the new implementation.
Provide a method to build an array subset from window tiles.
The current workflow is to call the subset window in a for-loop through the get_tiles method. Subsetting the array is currently performed by the user by indexing the array with the window properties:
for window in data.Image(img_path).get_tiles():
(row_start, row_stop), (col_start, col_stop) = window_tile.toranges()
array[0][row_start: row_stop, col_start: col_stop]
This can be outsourced to a new method in ukis_pysat.data:
def get_subset(self, window_tile, array, band=0):
(row_start, row_stop), (col_start, col_stop) = window_tile.toranges()
return array[band][row_start: row_stop, col_start: col_stop]
Performance decreases when requesting the image array for each call:
def get_subset(self, window_tile, band=0):
(row_start, row_stop), (col_start, col_stop) = window_tile.toranges()
return self.arr[band][row_start: row_stop, col_start: col_stop]
This can be solved by handing the array to the new method directly. However, this requires refactoring the constructor of the Image class so that arrays are accepted without handing a dataset as it is currently done here.
Possible solution:
if isinstance(dataset, rasterio.io.DatasetReader) and isinstance(arr, np.ndarray):
self.dataset = dataset
self.arr = arr
elif isinstance(arr, np.ndarray):
self.arr = arr
Currently _pad_to_bbox cannot deal with cases where pad_width happens to be <0 (e.g., when bbox is smaller than image and no padding would be required). We should be able to handle such cases and not immediately throw an error. A simple way would be to set pad_width=0 in such cases. The following would be possible implementation:
def _pad_to_bbox(self, bbox, mode="constant", constant_values=0):
"""Buffers array with biggest difference to bbox and adjusts affine transform matrix. Can be used to fill
array with nodata values before masking in case bbox only partially overlaps dataset bounds.
:param bbox: bounding box of type tuple or Shapely Polygon
:param mode: str, how to pad, see rasterio.pad. Optional (default: 'constant')
:param constant_values: nodata value, padding should be filled with, optional (default: 0)
:return: closed, buffered dataset in memory
"""
if isinstance(bbox, polygon.Polygon):
bbox = bbox.bounds
elif isinstance(bbox, tuple):
pass
else:
raise TypeError(f"bbox must be of type tuple or Shapely Polygon")
max_diff_ur = np.max(np.subtract(bbox[2:], tuple(self.dataset.bounds[2:])))
max_diff_ll = np.max(np.subtract(tuple(self.dataset.bounds[:2]), bbox[:2]))
max_diff = max(max_diff_ll, max_diff_ur) # buffer in units
pad_width = math.ceil(max_diff / self.dataset.transform.to_gdal()[1]) # units / pixel_size
if pad_width < 0:
# make sure that pad_width is not negative
pad_width = 0
destination = np.zeros(
(self.dataset.count, self.__arr.shape[1] + 2 * pad_width, self.__arr.shape[2] + 2 * pad_width,),
self.__arr.dtype,
)
for i in range(0, self.dataset.count):
destination[i], transform = rasterio.pad(
self.__arr[i], self.dataset.transform, pad_width, mode, constant_values=constant_values,
)
self.__arr = destination
self.dataset.close()
return self.__update_dataset(self.dataset.crs, transform, nodata=self.dataset.nodata)
aoi in Source() currently only accepts a geojson in WGS84. Make this more flexible to work with 1.) any CRS and 2.) files and bbox coordinates. Basically accept the same input as aoi in Image().
We could do something like this, but that would add pyproj
as dependency:
def prep_aoi(aoi, src_proj=None):
""" Converts to Geojson Geometry and reprojects aoi to WGS84
:param aoi: aoi
:param src_proj: pyproj.proj.Proj, Source Projection, default: None
:return: geojson.geometry.Polygon
"""
if isinstance(aoi, geojson.geometry.Polygon):
aoi = shape(aoi)
if src_proj:
project = pyproj.Transformer.from_proj(src_proj, pyproj.Proj(init='epsg:4326'))
aoi = transform(project.transform, aoi)
return geojson.dumps(aoi.__geo_interface__, sort_keys=True)
In many applications it is extremely useful to be able to force align two raster grids to match each others crs, extent, dimensions and grid alignment. With a combination of warp
and mask
we can come close to align two rasters (this actually works in many cases). It may, however, happen that the raster grids are shifted with subpixel resolution, which prevents them from being exactly on the same grid and therefore they cannot be compared at array level with for example numpy.
I propose the following solution to add a target_align
option to warp
, which aligns one Image instance to another one:
def warp(self, dst_crs, resampling_method=0, num_threads=4, resolution=None, nodata=None, target_align=None):
"""Reproject a source raster to a destination raster.
:param dst_crs: CRS or dict, Target coordinate reference system.
:param resampling_method: Resampling algorithm, int, defaults to 0 (Nearest)
numbers: https://github.com/mapbox/rasterio/blob/master/rasterio/enums.py#L28
:param num_threads: int, number of workers, optional (default: 4)
:param resolution: tuple (x resolution, y resolution) or float, optional.
Target resolution, in units of target coordinate reference system.
:param target_align: raster to which to align resolution, extent and gridspacing, optional (Image).
:param nodata: nodata value of source, int or float, optional.
"""
if target_align:
transform = target_align.dataset.transform
width = target_align.dataset.width
height = target_align.dataset.height
else:
if resolution:
transform, width, height = calculate_default_transform(
self.dataset.crs,
dst_crs,
self.dataset.width,
self.dataset.height,
*self.dataset.bounds,
resolution=resolution,
)
else:
transform, width, height = calculate_default_transform(
self.dataset.crs, dst_crs, self.dataset.width, self.dataset.height, *self.dataset.bounds,
)
destination = np.zeros((self.dataset.count, height, width), self.__arr.dtype)
self.__arr, transform = reproject(
source=self.__arr,
destination=destination,
src_transform=self.dataset.transform,
src_crs=self.dataset.crs,
src_nodata=nodata,
dst_transform=transform,
dst_crs=dst_crs,
dst_nodata=nodata,
resampling=resampling_method,
num_threads=num_threads,
)
self.dataset.close()
self.dataset = self.__update_dataset(dst_crs, transform, nodata=nodata)
Is your feature request related to a problem? Please describe.
We now have the option to create a datetime object from ESA dates, it would be nice to also be able to go back. This can be helpful when naming output files.
Describe the solution you'd like
Just a simple function to change the format back.
Describe the bug
In certain cases there's multiple underscores in S3 filenames, see naming conventions.
To Reproduce
Use following example filename S3A_OL_1_EFR____20151102T094537_20151102T094837_20151103T075458_0180_090_022______LN2_D_NT_001.SEN3
with get_ts_from_sentinel_filename()
.
Expected behavior
Give back 20151102T094537.
The geocoding information that we assign to Sentinel-1 quicklook images in save_quicklook_image() does not consider the varying image sizes of this sensor (unlike the fixed tile size of Sentinel-2). This may lead to misalignments of the quicklooks with respect to the actual images.
In their current implementation, unpack and pack require the file format to be specified. This could be easily extracted from the file ending. Also the parameter names could use some renaming to be more self-explanatory. Suggest to modify these methods as follows:
def unpack(src_file, target_dir=None):
"""Unpack an archive, does not add anything to shutil, but makes aware it exists.
:param src_file: path to source file to be extracted including extension.
:param target_dir: directory where the archive is unpacked. If not provided, the current working directory is used.
"""
if target_dir:
shutil.unpack_archive(src_file, target_dir)
else:
shutil.unpack_archive(src_file)
def pack(src_dir, target_file):
"""Create an archive file (such as zip or tar) from a directory.
:param src_dir: source directory to be archived.
:param target_file: path to target file including extension.
"""
base = os.path.basename(target_file)
name = base.split(".")[0]
format = base.split(".")[1]
archive_from = os.path.dirname(src_dir)
archive_to = os.path.basename(src_dir.strip(os.sep))
shutil.make_archive(name, format, archive_from, archive_to)
shutil.move("%s.%s" % (name, format), target_file)
Instancing an Image object with self.img = ukis_pysat.raster.Image(self.path_to_img)
gives access to attributes of the Image class, e.g. self.img.arr
.
At some point it is desired to alter self.img.arr
. However, self.img.arr = self.img.arr * 10
fails with an AttrbiuteError
.
Suggest to extend the property decorator. Current implementation is:
@property
def arr(self):
"""array property"""
if self.dimorder == "first":
return self.__arr
else:
return reshape_as_image(self.__arr)
This could be extended with:
@arr.setter
def arr(self, value):
self.__arr = value
It would be great if we could provide a method for merging multiple raster files to one mosaic. With #64 we got a first implementation with a simple exploit of rasterio.merge.merge
.
However, a mosaic is not a simple merge of multiple raster. We have to discuss the following points, please add if incomplete:
ukis_pysat.raster
which seems logical as we are working on a raster basis. This implies calling the method via an Image object with src.build_mosaic(<list_of_datasets_to_merge>)
. So far the object src
is not really connected to the mosaic itself and more or less an initiator for the method call.ukis_pysat.raster.warp
and cropping the result to a desired geometry we may offer this functionality upon merging. This could also be connected to the original Image object or to another bounding box.Is your feature request related to a problem? Please describe.
ukis_pysat.raster
only treats 3D arrays. Numerous extermal methods work on a 2D basis and using np.squeeze()
or np.expand_dims()
adds unnecessary code.
building an ukis_pysat.raster.Image()
from a 2D array will fail with a misleading error message:
rasterio._err.CPLE_FileIOError: Free disk space available is 68524220416 bytes, whereas 413266544908 are at least necessary.
We could narrow this problem down to the fact, that bringing a 2D array into a 3D method is simply not working.
Describe the solution you'd like
Allow initalization of ukis_pysat.raster.Image
with both 2D or 3D arrays. If I'm not mistaken, rasterio is limited to 3D arrays. @MWieland and @fwfichtner what is your opinion on this?
Currently write_to_file() takes width and height from the Image array but its band number (count) from the dataset.
In case one initializes an Image() from an existing dataset and an array it may happen that the wrong band number is used internally to write to file. Take for example the case where someone computes a simple NVDI from a multi-band Image() and wants to write the result (a single-band image) back to file.
img = Image(path=multiband_file.tif)
ndvi_arr = (img.arr[3] - img.arr[2]) / (img.arr[3] + img.arr[2])
ndvi = Image(dataset=img.dataset, array=ndvi_arr)
ndvi.write_to_file(ndvi_file.tif)
This would result in an error upon writing the array to file, because img band number != ndvi band number.
The following could be a possible solution to account for this:
def write_to_file(self, path_to_file, dtype, driver="GTiff", compress=None):
"""
Write a dataset to file.
:param path_to_file: str, path to new file
:param dtype: datatype, like np.uint16, 'float32' or 'min' to use the minimum type to represent values
:param driver: str, optional (default: 'GTiff')
:param compress: compression, e.g. 'lzw' (default: None)
"""
if dtype == 'min':
dtype = get_minimum_dtype(self.__arr)
profile = self.dataset.meta
profile.update(
{
"driver": driver,
"height": self.__arr.shape[-2],
"width": self.__arr.shape[-1],
"count": self.__arr.shape[0],
"dtype": dtype,
"transform": self.transform,
"crs": self.crs,
}
)
if compress:
profile.update({"compress": compress})
with rasterio.open(path_to_file, "w", **profile) as dst:
dst.write(self.__arr.astype(dtype), indexes=range(1, self.__arr.shape[0] + 1)
GDAL offers building of virtual raster datasets, so does rasterio and this gives us the possibility to have it implemented with an interface easy to use.
Consider the following:
import os
import rasterio.shutil
from ukis_pysat import raster as pysat_raster
vrt_dir = os.path.join(os.path.dirname(os.path.realpath(__file__)), 'tests', 'testfiles', 'vrt')
for sample in os.listdir(vrt_dir):
src = pysat_raster.Image(os.path.join(vrt_dir, sample))
with rasterio.vrt.WarpedVRT(src.dataset, crs=src.dataset.crs, resampling=rasterio.enums.Resampling.nearest) as vrt:
rasterio.shutil.copy(vrt, os.path.join(vrt_dir, 'test.vrt'), driver='VRT')
vrt = pysat_raster.Image(os.path.join(vrt_dir, 'test.vrt'))
We use the coordinate order of rasterio when loading images, which is channels first (channels, rows, cols). Most image processing, visualization and machine learning libraries, however, use channels last (rows, cols, channels) for image coordinate convention (e.g. scikit-image, opencv, pillow, matplotlib, etc.).
In order to avoid confusion and to adjust to user requirements more easily it would be great if we could provide an option to change the coordinate order upon instantiation of an image object.
I know that Rasterio provides conversion methods here, but it would be convenient if the order could be set once (e.g. as env var or parameter in Image init) to comply with a user's requirements.
Tensorflow for example provides such a switch between channels first and channels last to account for performance differences when training on GPU or CPU.
When creating an Image instance from an array, the array dimorder is not considered anymore.
The Image.write_to_file() method could be expanded to support optional image compression (e.g. Packbits or LZW). This is necessary to reduce the storage space for example when saving binary masks to disk.
To achieve this we basically would have to update the profile with "compress" and optionally also "nbits" attributes. Beyond this, the "tiled" attribute could be useful to support.
Also this method defines as default dtype uint16. It would be more reasonable to set the default dtype to the dtype of the Image array in my opinion. This would allow us to stay consistent between image loading and writing.
Describe the bug
Metadata of EarthExplorer is not returning the correct Platfromname (like it used to?). Parsing suddenly goes wrong.
To Reproduce
Run the test for query_metadata
, like https://github.com/dlr-eoc/ukis-pysat/runs/950561059
Expected behavior
Platform name should be a part of Platform(Enum)
.
Additional context
This is the reason all our tests currently fail.
It would be great to be able to read metadata from local file system so that we can perform tasks like: Give me all S2 scenes in directory x with less than y% cloud cover. TinyDB could be a possible package to get started with.
It is possible to initialize an Image() object from an existing rasterio.dataset and an array as follows:
img = Image(dataset=ds, array=arr, dimorder="first")
To work correctly this assumes that the array is of shape (bands, rows, cols) regardless of parameter setting for dimorder.
Suggest to modify init of Image to consider this case and allow passing arrays of different dimorder. The following would be a working example:
def __init__(self, path=None, dataset=None, arr=None, dimorder="first"):
if path:
if isinstance(path, str):
self.dataset = rasterio.open(path)
self.__arr = self.dataset.read()
else:
raise TypeError(f"path must be of type str")
else:
if isinstance(dataset, rasterio.io.DatasetReader) and isinstance(arr, np.ndarray):
self.dataset = dataset
if dimorder == "first":
self.__arr = arr
elif dimorder == "last":
self.__arr = reshape_as_raster(arr)
else:
raise AttributeError("dimorder for bands or channels must be either 'first' or 'last'.")
else:
raise TypeError(
f"dataset must be of type rasterio.io.DatasetReader and arr must be of type " f"numpy.ndarray"
)
self.dimorder = dimorder
self.transform = self.dataset.transform
self.crs = self.dataset.crs
self.da_arr = None
Please note that this modification would not interfere with any internal array manipulation but just allow for consistent flexible array IO to Image().
We should implement a boundless padding with reflect mode for the tile creation.
For being boundless we only need to not intersect with the bounding_window: .intersection(bounding_window)
(link)
The padding might be rather complex in combination with the dataset.
Is your feature request related to a problem? Please describe.
To use UKIS-pysat we always require GDAL and other heavy dependencies to be installed. That's not very handy for lightweight and small Docker images.
Describe the solution you'd like
Split up package into subsets like suggested here so you can only install the functionalities you need:
pip install ukis-pysat # would then only install ukis_pysat.file
pip install ukis-pysat[complete] # would then install everything
pip install ukis-pysat[data] # would then install ukis_pysat.file and ukis_pysat.data
pip install ukis-pysat[raster] # would then install ukis_pysat.file and ukis_pysat.raster
Additional context
We need to carefully describe this in the Documentation, similar to what they have done in dask.
When plotting a pandas dataframe that holds the return values of a MetadataCollection the columns are cut off at a fixed length. To avoid this behaviour, suggest to add pd.set_option()
.
Creating an Image.dataset
from a numpy array produces a rasterio dataset with the following metadata:
meta = {
"driver": "GTiff",
"dtype": self.__arr.dtype,
"nodata": nodata,
"height": self.__arr.shape[-2],
"width": self.__arr.shape[-1],
"count": self.__arr.shape[0],
"crs": crs,
"transform": transform,
}
The instance nodata
is given as an argument of the method: __update_dataset(self, crs, transform, nodata=None)
. However, nodata
is not handed upon function call self.__update_dataset(crs, transform)
which leaves the new dataset with no nodata value.
Suggest to introduce a nodata value when creating datasets from arrays such as self.__update_dataset(crs, transform, nodata=nodata)
. This requires handing the nodata value to the instantiation of the Image class __init__(self, data, dimorder="first", crs=None, transform=None, nodata=None)
if usage is desired.
Implement a test for dn2toa()
,
For any image preprocessing and analysis steps we should be able to store and keep track of the band information of a satellite image. In particular, we should know which band index refers to which wavelength and which unit (e.g. digital numbers, top of atmosphere, sigma naught, etc.). This is essential to be able to compute even the simplest band indices (e.g. NDVI) or to select an appropriate band subset during preprocessing.
We have already a dedicated metadata attribute for this ("bandlist") but do not fill it upon metadata creation, because this information is not directly provided by the source metadata (e.g. EarthExplorer). It could like the following for a band subset of a Sentinel-2 image:
'bandlist': [{'index':0, 'name':'B02', 'wavelength':'blue', 'unit':'DN'}, {'index':1, 'name':'B03', 'wavelength':'green', 'unit':'DN'}, {'index':2, 'name':'B04', 'wavelength':'red', 'unit':'DN'}, {'index':3, 'name':'B08', 'wavelength':'nir', 'unit':'DN'}, {'index':4, 'name':'B11', 'wavelength':'swir1', 'unit':'DN'}, {'index':5, 'name':'B12', 'wavelength':'swir2', 'unit':'DN'}]
We could compile this ourselves from a bunch of lookup tables for unique combinations of "Platform" and "Producttype". We basically do this already here for Landsat.
This should be put in a public place (possibly here) and be expanded upon. Then we can compile the respective bandlist "on the fly" when we query the satellite image metadata. This would also allow a user to simply update it in a standardized way later on as part of a more complex processing chain.
The GeoTIFF specific nbits option allows to create a file with less than 8 bits per sample by passing a value from 1 to 7. This becomes particularly useful when writing binary masks as it can save significant disk memory. Related to #46 Therefore, suggest to add this option to the write_to_file method.
We could consider using pathlib instead of os.path for more readability and to be more modern.
Connecing to a datahub requires user credentials to be set according to the target datahub, e.g. Copernicus SciHub.
Calling src = Source(source=Datahub.Scihub)
looks for user credentials but fails if not set in an environment:
File "<env_dir>\lib\site-packages\ukis_pysat\data.py", line 57, in __init__
self.user = env_get("SCIHUB_USER")
File "<env_dir>\lib\site-packages\ukis_pysat\file.py", line 23, in env_get
raise KeyError(f"No environment variable {key} found")
KeyError: 'No environment variable SCIHUB_USER found'
The documentation on this is rather sparse and should be more detailed.
There are different ways on providing user credentials, depending on the OS and development tools.
os.environ["SCIHUB_USER"] = "Tim"
os.environ["SCIHUB_PW"] = "TheEnchanter"
The method query_metadata takes a set of arguments and fills the sentinelsat API query.
Additional requirements such as the product type, e.g. GRS for Sentinel-1 data, are not accepted.
A solution would be to accept **kwargs at method call. However, kwargs is emptied several times, e.g. here. This should be taken into account for the implementation of this feature.
Some of the field types in Metadata need revision. In particular, fields processingsteps
and bandlist
should be of type list
(with default value None
, rather than str
.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.