microsoft / torchgeo Goto Github PK
View Code? Open in Web Editor NEWTorchGeo: datasets, samplers, transforms, and pre-trained models for geospatial data
Home Page: https://www.osgeo.org/projects/torchgeo/
License: MIT License
TorchGeo: datasets, samplers, transforms, and pre-trained models for geospatial data
Home Page: https://www.osgeo.org/projects/torchgeo/
License: MIT License
During the GeoDataset
refactor (#37), all existing GeoDataset
s were moved to VisionDataset
. Now that the GeoDataset
API has settled down, we should attempt to convert many of these VisionDataset
s that have geospatial information back to GeoDataset
.
We may need a STACDataset
base class that subclasses GeoDataset
and describes how to pull geospatial information from STAC JSON files.
Like torchvision
, all of our datasets have a base_folder
attribute that specifies the subdirectory of root
that we want to store downloaded data to. However, this also prevents users from using data that is already on a system that doesn't have this same directory structure. Most users will only need 1 or 2 datasets at a time, so instead of specifying data
and having data end up in data/sentinel2
users can just specify data/sentinel2
themselves. This gives more flexibility and simplifies the dataset code.
Add datasets for various generations of Sentinel satellites.
https://github.com/Microsoft/CanadianBuildingFootprints
This dataset contains 11,842,186 computer generated building footprints in all Canadian provinces and territories. This data is freely available for download and use.
For GeoDatasets, at sampling time, we should know the CRS, bounding box, and resolution of the image that gets sampled. As the image is passed through transforms like Resample
or Warp
, we should be able to recompute the new CRS/bbox/res pretty easily. However, depending on the padding and stride used in convolutional and pooling layers, the bounding box and resolution may change significantly as the image is passed through the model. In order to save or stitch together our predictions, we'll need to be able to compute the new bbox/res.
We have two possible options:
nn.Module
(the neural network) and computes the resulting bbox/res.In the short-term, we will likely go with 1 since it involves the least work. In the long-term, we may end up going with 2 since we'll want to be able to design networks that can take advantage of this kind of geospatial information.
Here are some ideas:
When using ZipDataset with random samplers, the index should come from whichever dataset is tile-based. When using ZipDataset with grid samplers, the index should come from whichever dataset is not tile-based. Not yet sure how to handle something like Landsat + Sentinel, but we can figure that out another day.
Class hierarchy:
Make sure to document the difference between samplers and batch samplers and when to use which. Should store samplers and batch samplers in different files and combine in __init__
like we do with datasets. Add utils.py
for things like _to_tuple
.
Question: if I'm using an LRU cache and BatchSampler and multiple workers, if something isn't yet in the cache, will PyTorch spawn multiple workers all trying to warp the entire tile? It may actually be faster to use a single worker in this case.
The VHR-10 Dataset consists of both "positive" images that contain objects of interest, as well as "negative" images that only contain background data. Currently, our dataset only handle positive images. We could add a split
argument that allows users to select between "positive", "negative", and "both" image sets.
The problem is that this greatly increases the complexity of the code in the data loader because the annotations file doesn't list annotations/filenames for negative images. For this reason, even torchvision
's COCO dataset doesn't contain unlabeled images.
We should set up some kind of Azure pipeline that can automatically generate and deploy new pre-trained weights whenever a new model/dataset is added or whenever we create a new release.
The following list enumerates all tasks that need to be completed before:
This list may change over time as we reevaluate the remaining tasks that need to be done.
https://drcog.org/services-and-resources/data-maps-and-modeling/regional-land-use-land-cover-project
A pilot land use land cover endeavor was undertaken by DRCOG, the Babbitt Center for Land and Water Policy(link is external), and the Conservation Innovation Center(link is external) in 2019. During this pilot, 1,000 square miles of the Denver region were classified at 1-meter resolution using high-resolution imagery acquired as part of the 2018 Denver Regional Aerial Photography Project. Eight classes were identified: structures, impervious surface, water, grassland/prairie, tree canopy, irrigated lands/turf, cropland and barren/rock.
So2Sat LCZ42: A Benchmark Dataset for Global Local Climate Zones Classification
Links:
All the tiffs in the Chesapeake series have embedded color tables except for the Chesapeake7 tiff. This causes the cmap = src.colormap(1)
line to throw a ValueError.
We need to figure out how to render Jupyter Notebooks in our documentation so that we can provide easy-to-use tutorials for new users. This should work similarly to https://pytorch.org/tutorials/.
Ideally I would like to be able to test these tutorials so that they stay up-to-date.
The ChesapeakeMD dataset fails when attempting to extract the downloaded zip file as zipfile.ZipFile
doesn't support the deflate64 compression type that _MD_STATEWIDE.zip
uses.
Use sphinx/napoleon to generate documentation and upload it to readthedocs.io. This will be a good chance to reserve the torchgeo domain. Only problem is that I don't think readthedocs has free private docs, so this may have to wait until it is public.
The availability of curated large-scale training data is a crucial factor for the development of well-generalizing deep learning methods for the extraction of geoinformation from multi-sensor remote sensing imagery. While quite some datasets have already been published by the community, most of them suffer from rather strong limitations, e.g. regarding spatial coverage, diversity or simply number of available samples. Exploiting the freely available data acquired by the Sentinel satellites of the Copernicus program implemented by the European Space Agency, as well as the cloud computing facilities of Google Earth Engine, we provide a dataset consisting of 180,662 triplets of dual-pol synthetic aperture radar (SAR) image patches, multi-spectral Sentinel-2 image patches, and MODIS land cover maps. With all patches being fully georeferenced at a 10 m ground sampling distance and covering all inhabited continents during all meteorological seasons, we expect the dataset to support the community in developing sophisticated deep learning-based approaches for common tasks such as scene classification or semantic segmentation for land cover mapping.
The following libraries provide APIs for performing augmentations:
https://captain-whu.github.io/DOTA/index.html
DOTA is a large-scale dataset for object detection in aerial images. It can be used to develop and evaluate object detectors in aerial images. The images are collected from different sensors and platforms. Each image is of the size in the range from 800 ร 800 to 20,000 ร 20,000 pixels and contains objects exhibiting a wide variety of scales, orientations, and shapes.
For transforms/trainers/models, the rST file simply tells Sphinx to automatically generate all documentation. For datasets, we instead hard-code the order and section titles, meaning that the file needs to be updated every time a new dataset is added. This isn't that much work, but I keep forgetting to do it. We should see if there's an easy way to get some of the same structure we have now and still autogenerate the entire documentation page.
Things to investigate:
__all__
to change the order in which each dataset gets documented? If not, we'll just have to stop caring about the order.datasets.rst
never needs to be updated?This dataset contains high-resolution aerial imagery from the USDA NAIP program [1], high-resolution land cover labels from the Chesapeake Conservancy [2], low-resolution land cover labels from the USGS NLCD 2011 dataset [3], low-resolution multi-spectral imagery from Landsat 8 [4], and high-resolution building footprint masks from Microsoft Bing [5], formatted to accelerate machine learning research into land cover mapping.
When using pytest, deprecation warnings for both torchgeo and all of its dependencies are displayed. The version of tensorboard I'm using raises hundreds of deprecation warnings:
.spack-env/view/lib/python3.8/site-packages/tensorboard/compat/proto/tensor_shape_pb2.py:18
/home/t-astewart/torchgeo/.spack-env/view/lib/python3.8/site-packages/tensorboard/compat/proto/tensor_shape_pb2.py:18: DeprecationWarning: Call to deprecated create function FileDescriptor(). Note: Create unlinked descriptors is going to go away. Please use get/find descriptors from generated code or query the descriptor_pool.
DESCRIPTOR = _descriptor.FileDescriptor(
...
.spack-env/view/lib/python3.8/site-packages/tensorboard/util/tensor_util.py:114
/home/t-astewart/torchgeo/.spack-env/view/lib/python3.8/site-packages/tensorboard/util/tensor_util.py:114: DeprecationWarning: `np.bool` is a deprecated alias for the builtin `bool`. To silence this warning, use `bool` by itself. Doing this will not modify any behavior and is safe. If you specifically wanted the numpy scalar type, use `np.bool_` here.
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
np.bool: SlowAppendBoolArrayToTensorProto,
The first set of warnings about FileDescription/FieldDescriptor seems to have been fixed in master, as these are no longer present in the source code.
The second set of warnings were fixed in tensorflow/tensorboard#5138.
We should silence these specific warnings since they will be taken care of when a new tensorboard release comes out.
The default config files use the GPU.
We should replace all of our print statements and verbose parameters with Python's logging library. This will allow for more uniform access to messages. I'm thinking of the following levels:
logging.INFO
: all operations that could be slow (download, checksum, decompression, extraction, indexing)logging.DEBUG
: all file accessWe should add a torchgeo.utils
package that provides utilities for common operations, including stitching together patches to create a prediction on an entire tile. This is equivalent to torchvision.utils
. See https://arxiv.org/pdf/1805.12219.pdf for a survey of common stitching techniques that we should support. This includes clipping (best default) and averaging. May also want to add weighted averaging. Let's see what libraries exist for this. If rasterio can do this for us for free, that would be awesome.
Dataset of sentinel 2 imagery and crop types used in one of the competitions in the CV4A workshop in ICLR 2020.
Links:
https://github.com/rwightman/pytorch-image-models
It may be useful to add wrappers around some of these models with pre-trained weights for satellite imagery tasks.
Once we release a paper on TorchGeo, we can use the following citation format file to directly add citation instructions to the repo:
This does not yet support citing external papers, but that feature is coming soon:
Before releasing, we should determine the minimum supported version of each dependency. We should also consider a test with this version just to make sure it doesn't change.
The Cars Overhead With Context (COWC) data set is a large set of annotated cars from overhead. It is useful for training a device such as a deep neural network to learn to detect and/or count cars.
Our number of dependencies is rapidly increasing. We should think about which of these dependencies are required (install_requires
) vs. optional (extras_require
). Here is a proposal:
This gets a bit tricky, and is currently at odds with our dependency list. For example, rasterio
is only used in RasterDataset
, and fiona
is only used in VectorDataset
. While almost half of our datasets are RasterDataset
, ony 1 is currently VectorDataset
. Also, matplotlib
is only needed to plot example samples.
Here is another proposal:
This may be a better default. Most users will likely run pip install torchgeo
, which will install only the things in install_requires
. We don't want a useless installation to be the default, and extras_require
is off by default. Alternatively, we could do a better job of documenting the recommended way to install TorchGeo and specify that you may want pip install torchgeo[datasets,train]
or something like that.
Another thing to consider is how to handle these optional imports. We can't put them at the module level (in the case of datasets) so we use lazy imports instead. We also may want to create a wrapper like importorraise
(akin to pytest
's importorskip
) that prints a more useful error message upon ImportError
. We could go even further and use lazy imports for almost all imports, not just optional ones. This will greatly speed up importing torchgeo.
Before RasterDataset
and VectorDataset
were added, each dataset class had to be tested separately. Now that most of the logic has been consolidated in RasterDataset
and VectorDataset
, we should move those tests to tests/datasets/test_geo.py
. This will make it much easier to add new datasets without having to add expansive tests.
https://www.maxar.com/products/satellite-imagery
It seems like the imagery isn't free/open-source, but they do have samples we could use to write a data loader: https://resources.maxar.com/product-samples
The __getitem__
methods in Chesapeake and CDL datasets return a key "masks" while SEN12MS and LandcoverAI return "mask". Should we choose one or is there some reason we need to differentiate?
We want to test several popular image sources, as well as both raster and vector labels.
There is also a question of which file formats to test. For example, sampling from GeoJSON can take 3 min per getitem, whereas ESRI Shapefile only takes 1 sec per getitem (#69 (comment)).
For the warping strategy, we should test the following possibilities:
What is the upfront cost of these pre-processing steps?
Example notebook: https://gist.github.com/calebrob6/d9bc5609ff638d601e2c35a1ab0a2dec
The dataset contains 516M building detections, across an area of 19.4M km2 (64% of the African continent).
For each building in this dataset we include the polygon describing its footprint on the ground, a confidence score indicating how sure we are that this is a building, and a Plus Code corresponding to the centre of the building. There is no information about the type of building, its street address, or any details other than its geometry.
All of our datasets share various utilities for download/checksum/decompression/extraction of datasets available online. For now, I've been trying to use the utilities provided in torchvision.datasets.utils
as much as possible, but these have many limitations. Specifically, the decompression/extraction logic doesn't handle many of our dataset formats (bz2, rar, etc). I was trying to submit PRs to add these features to torchvision, but they don't seem interested in many of them. Even if they do get merged, they will require a dependency on torchvision@master to actually use. We may want to write our own utilities instead of using torchvision's, at least for decompression/extraction. I'm a little afraid of writing my own utilities for downloading because they are complicated (especially for Google Drive) and would require internet access to test (slow).
For more info, see:
Many datasets like Sentinel2 have a different resolution per band. Currently we don't handle this and things crash when you try to concatenate bands with different shape. There are a few options for how to handle this:
Option 3 makes it hard to automatically detect the "image" keys during data augmentation, but offers the greatest flexibility for modeling. Option 1 and 2 aren't mutually exclusive, and are probably the easiest to implement in the short term.
We also need to add transforms for resampling.
Need better docs describing the format of the dataset.
https://github.com/qubvel/segmentation_models.pytorch
It may be useful to add wrappers around some of these models with pre-trained weights for satellite imagery tasks.
We don't really need it for plotting, and it adds a lot of complexity to get it installed in CI.
https://www.fsa.usda.gov/programs-and-services/aerial-photography/imagery-programs/naip-imagery/
The National Agriculture Imagery Program (NAIP) acquires aerial imagery during the agricultural growing seasons in the continental U.S. A primary goal of the NAIP program is to make digital ortho photography available to governmental agencies and the public within a year of acquisition.
NAIP is administered by the USDA's Farm Service Agency (FSA) through the Aerial Photography Field Office in Salt Lake City. This "leaf-on" imagery is used as a base layer for GIS programs in FSA's County Service Centers, and is used to maintain the Common Land Unit (CLU) boundaries.
What is the expected behavior when a dataset is not downloaded, and someone passes download=True
, checksum=False
?
I would expect that we download the dataset if it doesn't exist, but not verify that download to be correct, however I think the LandcoverAI dataset (at least) will just return that the dataset exists.
We should add some kind of glossary to document common terms that may be unfamiliar to either:
Examples:
See https://sublime-and-sphinx-guide.readthedocs.io/en/latest/glossary.html
The LandCover.ai (Land Cover from Aerial Imagery) dataset is a dataset for automatic mapping of buildings, woodlands, water and roads from aerial images.
We should add a GeoDataset for OpenStreetMap: https://www.openstreetmap.org/
There are a couple of approaches that we could take:
__getitem__
(slow)Use pytest and codecov to get 100% coverage and prevent bugs.
https://landsat.gsfc.nasa.gov/
Add data loaders for various generations of Landsat data.
In my mind, there are several different reasons that someone might want to combine two GeoDatasets:
Right now, ZipDataset is designed to exclusively handle case 1. Case 2 doesn't work because the "image" key gets replaced instead of merged or concatenated. Case 3 doesn't work because we explicitly check for overlap between datasets.
We need to think about whether it is possible to support all possible use cases, whether there are any additional use cases, and how to implement this support. Ideally, these could all be wrapped into ZipDataset
so that addition handles everything. Hopefully we don't need to add an additional ABC for MergeDataset
or something like that.
We should add integration with IPython:
By defining a _repr_png_
method that displays an image using the appropriate RGB channels.
It seems like this is defined at the class level, so we would have to either register this for ALL torch.Tensor objects (doesn't allow customization for which bands to use) or do something else at the Dataset level.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.