Coder Social home page Coder Social logo

frontierdevelopmentlab / sat-extractor Goto Github PK

View Code? Open in Web Editor NEW
74.0 3.0 6.0 888 KB

Extract Satellite Imagery from public constellations at scale

License: BSD 2-Clause "Simplified" License

Dockerfile 1.12% Python 97.93% CSS 0.94%
satellite satellite-imagery earth-observation esa zarr

sat-extractor's Introduction


Logo

SatExtractor

Build, deploy and extract satellite public constellations with one command line.
Logo

Table of Contents
  1. About The Project
  2. Getting Started
  3. Usage
  4. Contributing
  5. License
  6. Citation
  7. Acknowledgments

About The Project

  • tldr: SatExtractor gets all revisits in a date range from a given geojson region from any public satellite constellation and store it in a cloud friendly format.

The large amount of image data makes it difficult to create datasets to train models quickly and reliably. Existing methods for extracting satellite images take a long time to process and have user quotas that restrict access.

Therefore, we created an open source extraction tool SatExtractor to perform worldwide datasets extractions using serverless providers such as Google Cloud Platform or AWS and based on a common existing standard: STAC.

The tool scales horizontally as needed, extracting revisits and storing them in zarr format to be easily used by deep learning models.

It is fully configurable using Hydra.

(back to top)

Getting Started

SatExtractor needs a cloud provider to work. Before you start using it, you'll need to create and configure a cloud provider account.

We provide the implementation to work with Google Cloud, but SatExtractor is implemented to be easily extensible to other providers.

Structure

The package is structured in a modular and configurable approach. It is basically a pipeline containing 6 important steps (separated in modules).

  • Builder: contains the logic to build the container that will run the extraction.

    more info SatExtractor is based on a docker container. The Dockerfile in the root dir is used to build the core package and a reference in it to the specific provider extraction logic should be explicitly added (see the gcp example in directory providers/gcp).

    This is done by setting ENV PROVIDER var to point the provider directory. In the default Dockerfile it is set to gcp: ENV PROVIDER providers/gcp .

  • Stac: converts a public constellation to the STAC standard.
    more info If the original constellation is not already in STAC standard it should be converted. To do so, you have to implement the constellation specific STAC conversor. Sentinel 2 and Landsat 7/8 examples can be found in src/satextractor/stac . The function that is actually called to perform the conversion to the STAC standard is set in stac hydra config file ( conf/stac/gcp.yaml )
  • Tiler: Creates tiles (patches) of the given region to perform the extraction.
    more info The Tiler split the region in tiles using SentinelHub splitter . For example if a Tile size of 10000m is set, you will have in your storage patches of size 10000m. The config about the tiler can be found in conf/tiler/utm.yaml . There, the size of the tiles can be specified.
  • Scheduler: Decides how those tiles are going to be scheduled creating extractions tasks.

    more info The Scheduler takes the resulting tiles from the Tiler and group them in bigger areas to be extracted.

    For example, if the Tiler splitted the region in 1000x1000m tiles, now the scheduler can be set to group them in UTM splits of, say, 100000x100000m (100km). Also, the scheduler calculates the intersection between the patches and the constellation STAC assets. At the end, you'll have and object called ExtractionTask with the information to extract one revisit, one band and multiple patches. This ExtractionTask will be send to the cloud provider to perform the actual extraction.

    The config about the scheduler can be found in conf/scheduler/utm.yaml .

  • Preparer: Prepare the files in the cloud storage.

    more info The Preparer creates the cloud file structure. It creates the needed zarr groups and arrays in order to later store the extracted patches.

    The gcp preparer config can be found in conf/preparer/gcp.yaml .

  • Deployer: Deploy the extraction tasks created by the scheduler to perform the extraction.
    more info The Deployer sends one message per ExtractionTask to the cloud provider to perform the actal extraction. It works by publishing messages to a PubSub queue where the extraction is subscribed to. When a new message (ExtractionTask) arrives it will be automatically run on the cloud autoscaling. The gcp deployer config can be found in conf/deployer/gcp.yaml .

All the steps are optional and the user decides which to run the main config file.

Prerequisites

In order to run SatExtractor we recommend to have a virtual env and a cloud provider user should already been created.

Installation

  1. Clone the repo
    git clone https://github.com/FrontierDevelopmentLab/sat-extractor
  2. Install python packages
    pip install .

(back to top)

Usage

🔴🔴🔴

- WARNING!!!!:
Running SatExtractor will use your billable cloud provider services.
We strongly recommend testing it with a small region to get acquainted
with the process and have a first sense of your cloud provider costs
for the datasets you want to generate. Be sure you are running all your
cloud provider services in the same region to avoid extra costs.

🔴🔴🔴

Once a cloud provider user is set and the package is installed you'll need to grab the GeoJSON region you want (you can get it from the super-cool tool geojson.io) and change the config files.

  1. Choose a region name (eg cordoba below) and create an output directory for it:
mkdir output/cordoba
  1. Save the region GeoJSON as aoi.geojson and store it in the folder you just created.
  2. Open the config.yaml and you'll see something like this:
dataset_name: cordoba
output: ./output/${dataset_name}

log_path: ${output}/main.log
credentials: ${output}/token.json
gpd_input: ${output}/aoi.geojson
item_collection: ${output}/item_collection.geojson
tiles: ${output}/tiles.pkl
extraction_tasks: ${output}/extraction_tasks.pkl

start_date: 2020-01-01
end_date: 2020-02-01

constellations:
  - sentinel-2
  - landsat-5
  - landsat-7
  - landsat-8

defaults:
  - stac: gcp
  - tiler: utm
  - scheduler: utm
  - deployer: gcp
  - builder: gcp
  - cloud: gcp
  - preparer: gcp
  - _self_
tasks:
  - build
  - stac
  - tile
  - schedule
  - prepare
  - deploy

hydra:
  run:
    dir: .

The important here is to set the dataset_name to <your_region_name>, define the start_date and end_date for your revisits, your constellations and the tasks to be run (you would want to run the build only one time and the comment it out.)

Important: the token.json contains the needed credentials to access you cloud provider. In this example case it contains the gcp credentials. You can see instructions for getting it below in the Authentication instructions.

  1. Open the cloud/<provider>.yaml and add there your account info as in the default provided file. The storage_root must point to an existing bucket/bucket directory. user_id is simply used for naming resources. (optional): you can choose different configurations by changing modules configs: builder, stac, tiler, scheduler, preparer, etc. There you can change things like patch_size, chunk_size.

  2. Run python src/satextractor/cli.py and enjoy!

See the open issues for a full list of proposed features (and known issues).

(back to top)

Authentication

Google Cloud

To get the token.json for Google Cloud, the recommended approach is to create a service account:

  1. Go to Credentials
  2. Click Create Credentials and choose Service account
  3. Enter a name (e.g. sat-extractor) and click Create and Continue
  4. Under Select a role, choose Basic -> Editor and then click Done
  5. Choose the account from the list and then to to the Keys tab
  6. Click Add key -> Create new key -> JSON and save the file that gets downloaded
  7. Rename to token.json and you're done!

For building the sat-extractor service, you may also need to configure the credentials used by the cloud provider commandline devkit. Permissions at the project-owner level are recommended. If using Google Cloud Platform, you can authorize the gcloud devkit to access Google Cloud Platform using your Google credentials by running the command gcloud auth login. You may also need to run gcloud config set project your-proj-name for sat-extractor to work properly.

Contributing

Contributions are what make the open source community such an amazing place to learn, inspire, and create. Any contributions you make are greatly appreciated.

If you have a suggestion that would make this better, please fork the repo and create a pull request. You can also simply open an issue with the tag "enhancement". Don't forget to give the project a star! Thanks again!

  1. Fork the Project
  2. Create your Feature Branch (git checkout -b feature/AmazingFeature)
  3. Commit your Changes (git commit -m 'Add some AmazingFeature')
  4. Push to the Branch (git push origin feature/AmazingFeature)
  5. Open a Pull Request

(back to top)

License

Distributed under the BSD 2 License. See LICENSE.txt for more information.

(back to top)

Citation

If you want to use this repo please cite:

@software{dorr_francisco_2021_5609657,
  author       = {Dorr, Francisco and
                  Kruitwagen, Lucas and
                  Ramos, Raúl and
                  García, Dolores and
                  Gottfriedsen, Julia and
                  Kalaitzis, Freddie},
  title        = {SatExtractor},
  month        = oct,
  year         = 2021,
  publisher    = {Zenodo},
  version      = {v0.1.0},
  doi          = {10.5281/zenodo.5609657},
  url          = {https://doi.org/10.5281/zenodo.5609657}
}

(back to top)

Acknowledgments

This work is the result of the 2021 ESA Frontier Development Lab World Food Embeddings team. We are grateful to all organisers, mentors and sponsors for providing us this opportunity. We thank Google Cloud for providing computing and storage resources to complete this work.

sat-extractor's People

Contributors

carderne avatar frandorr avatar lkruitwagen avatar rramosp avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

sat-extractor's Issues

Get DLQ working

@frandorr @Lkruitwagen

From here, the service account for PubSub looks like this:

PUBSUB_SERVICE_ACCOUNT="service-${project-number}@gcp-sa-pubsub.iam.gserviceaccount.com"

i.e. instead of using the service account from the token.json.

From CLI:

PROJ_NUMBER=$(gcloud projects list \
--filter="$(gcloud config get-value project)" \
--format="value(PROJECT_NUMBER)")

PUBSUB_SERVICE_ACCOUNT="[email protected]"

And then bind the account as already done:

gcloud pubsub topics add-iam-policy-binding "$DLQ_TOPIC" \
  --member="serviceAccount:$PUBSUB_SERVICE_ACCOUNT" \
  --role=roles/pubsub.publisher

gcloud pubsub subscriptions add-iam-policy-binding "$MAIN_SUBSCRIPTION" \
  --member="serviceAccount:$PUBSUB_SERVICE_ACCOUNT" \
  --role=roles/pubsub.subscriber

mask and percentiles

These don't seem to be used for anything?

mask_path = f"{patch_constellation_path}/mask"
zarr.open_array(
fs_mapper(mask_path),
"w",
shape=(len(sensing_times), len(bands)),
chunks=(1, 1),
dtype=np.uint8,
)
percentiles_path = f"{patch_constellation_path}/percentiles_0to100_5incr"
zarr.open_array(
fs_mapper(percentiles_path),
"w",
shape=(len(sensing_times), len(bands), 21),
chunks=(1, 1, 21),
dtype=np.float32,
)

Succeeds but get error

@frandorr just sharing this here from last week

[2021-10-29 13:17:48,746][grpc._plugin_wrapping][ERROR] - AuthMetadataPluginCallback "<google.auth.transport.grpc.AuthMetadataPlugin object at 0x7fc4d86812b0>" raised exception!
Traceback (most recent call last):
  File "/home/chris/.virtualenvs/ox/lib/python3.9/site-packages/grpc/_plugin_wrapping.py", line 89, in __call__
    self._metadata_plugin(
  File "/home/chris/.virtualenvs/ox/lib/python3.9/site-packages/google/auth/transport/grpc.py", line 101, in __call__
    callback(self._get_authorization_headers(context), None)
  File "/home/chris/.virtualenvs/ox/lib/python3.9/site-packages/google/auth/transport/grpc.py", line 87, in _get_authorization_headers
    self._credentials.before_request(
  File "/home/chris/.virtualenvs/ox/lib/python3.9/site-packages/google/auth/credentials.py", line 134, in before_request
    self.apply(headers)
  File "/home/chris/.virtualenvs/ox/lib/python3.9/site-packages/google/auth/credentials.py", line 110, in apply
    _helpers.from_bytes(token or self.token)
  File "/home/chris/.virtualenvs/ox/lib/python3.9/site-packages/google/auth/_helpers.py", line 129, in from_bytes
    raise ValueError("{0!r} could not be converted to unicode".format(value))
ValueError: None could not be converted to unicode

Loosen version locking

I think it might be necessary to loosen the dependency version locking in setup.py a bit. Quite difficult to e.g. pip install prefect[google] sat-extractor because of the conflicts.

Dockerfile path not found

in gcp builder the dockerfile path is set to: dockerfile_path = Path(__file__).parents[3]
It returns '/home/fran/miniconda3/envs/sat-extractor/lib/python3.9'.

Should be changed to pass the path as parameter.

GCP dead-letter-queue not properly being created

When the pubsub cloud run subscription is created, it creates also a dlq but it doesn't assign the correct roles and permissions and doesn't create the topic for the dlq:
image

As the dlq doesn't exist, the extraction task messages that fails will loop forever in the the main queue, restarting the cloud run service until messages are manually purged.

We should add this permissions and create the dlq topic automatically to avoid an infinite loop.

Store bands info

Current implementation doesn't store the bands for each constellation.
It would be nice to have that info stored. Some ideas:

  • Store a simple metadata json at constellation level (easiest)
  • At the end of each extraction create a STAC catalog that contains metadata info (maybe better but would take longer to implement)
  • Store at array level the info, something like xarray coordinates.

Only works with pip install -e .

This:

cd sat-extractor
pyenv global 3.9.7
mkvirtualenv test
pip install .
python ./src/satextractor/cli.py

Fails with:

ImportError: Encountered error: `No module named 'satextractor.builder'` when
loading module 'satextractor.builder.gcp_builder.build_gcp'

However, installing with pip install -e . (what I did initially, which is why I didn't notice this) works fine.

Maybe because when using a non-editable install, cli.py gets confused about whether it should be looking for modules in its directory or in the somehwere-else/site-packages/satextractor directory...

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.