Coder Social home page Coder Social logo

nrcan / geo-deep-learning Goto Github PK

View Code? Open in Web Editor NEW
148.0 17.0 47.0 124.91 MB

Deep learning applied to georeferenced datasets

Home Page: https://geo-deep-learning.readthedocs.io/en/latest/

License: MIT License

Python 99.58% Dockerfile 0.42%
deep-learning cnn pytorch unet semantic-segmentation deeplabv3 remote-sensing

geo-deep-learning's People

Contributors

bstdenis avatar charlesauthier avatar dan-eli avatar epeterson12 avatar felegare avatar fmigneault avatar lemairecarl avatar lucarom avatar mkutu avatar mpelchat04 avatar ms888ekb avatar plstcharles avatar remtav avatar richardscottoz avatar valhassan avatar veurman3 avatar ychoquet avatar ymoisan avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

geo-deep-learning's Issues

Re-factor AWS-specific code

Specific areas of the code that deal with AWS are a bit clumsy. Also, some of the AWS-specific work is performed in the Lambda function used on AWS to launch processing and, more importantly, copy results (logs, outputs, etc.) to a permanent S3 datastore. Things to do:

  1. Take all copying/transfer of files/results out of the Lambda function and do it in the code
  2. Devise a callable class to do it, i.e. push and copy files

Python dataclass ?

AWS - Entity too large when calling the PutObject operation

During the create samples step, this error occurs, on AWS,:

Number of samples created:  {'trn': 13005, 'val': 5523}
Transfering Samples to the bucket
Traceback (most recent call last):
  File "images_to_samples.py", line 276, in <module>
    params['sample']['mask_reference'])
  File "images_to_samples.py", line 245, in main
    bucket.put_object(Key=final_samples_folder + '/trn_samples.hdf5', Body=trn_samples)
  File "/home/ec2-user/miniconda3/lib/python3.6/site-packages/boto3/resources/factory.py", line 520, in do_action
    response = action(self, *args, **kwargs)
  File "/home/ec2-user/miniconda3/lib/python3.6/site-packages/boto3/resources/action.py", line 83, in __call__
    response = getattr(parent.meta.client, operation_name)(**params)
  File "/home/ec2-user/miniconda3/lib/python3.6/site-packages/botocore/client.py", line 320, in _api_call
    return self._make_api_call(operation_name, kwargs)
  File "/home/ec2-user/miniconda3/lib/python3.6/site-packages/botocore/client.py", line 623, in _make_api_call
    raise error_class(parsed_response, operation_name)
botocore.exceptions.ClientError: An error occurred (EntityTooLarge) when calling the PutObject operation: Your proposed upload exceeds the maximum allowed size

This happens when transferring the hdf5 files trn_samples.hdf5 and val_samples.hdf5 to the s3 bucket. In our code, the error is located here.

The function put_object should be replaced with upload_file.
Link to a stack overflow thread explaining the difference between the two functions: https://stackoverflow.com/questions/43739415/what-is-the-difference-between-file-upload-and-put-object-when-uploading-fil

Link to boto 3 upload_file function: https://boto3.amazonaws.com/v1/documentation/api/latest/guide/s3.html#uploads

Manage imbalanced data classes

Imbalanced classes are common and very frustrating to deal with.
Here's the histogram of training and validation data, showing imbalanced data.
image

In this example, the classes 1, 5, 6 and 8 are over represented, compared to the classes 3, 4 and 7.
The idea would be to:

  • Identify the under and over represented classes;
  • Artificially augment the number of data within the under-represented classes;
  • Reduce the amount of data in the over-represented classes.

To do so, I suggest we add the following to the already in place samples preparation:

  • Calculate the number of pixel of each classes
    • The under-represented classes would be: Number of pixel for the class < mean - (2 x standard deviation)
    • The over-represented classes would be: Number of pixel for the class > mean + (2 x standard deviation)
  • For each samples containing the under-represented classes, apply data augmentation (flip, rotate and crop and rescale) and write the result to the hdfs file.
  • Delete samples containing ONLY pixels from over-represented classes.

Config file validation

Add functions for the validation of some parameters in the config.yaml file.

Data preparation

  1. Specified number of classes should be the same as those contained in the vector file (shp) provided.

Training

  1. If class weights are provided, the length of the list should be equal to the number of classes.
  2. Number of training and validation samples should be less or equal to the number contained in the hdfs files.

NaN values in input images

Current behavior

Not a number (NaN) and infinite (inf) values in images are also stored in HDFS files. Those values will affect the loss calculation, resulting in a NaN loss, therefore not learning anything.
Here's an example of the output (warning from numpy) and training and validation loss are set to nan because of if:

Start:

Epoch 0/49
--------------------
/home/ec2-user/miniconda3/lib/python3.6/site-packages/numpy/core/_methods.py:32: RuntimeWarning: invalid value encountered in reduce
  return umr_minimum(a, axis, None, out, keepdims, initial)
Training Loss: nan
Training iou: 0.9478
Training precision: 0.8983
Training recall: 0.9478
Training f1-score: 0.9224
Validation Loss: nan
Validation iou: 0.9395
Validation precision: 0.8827
Validation recall: 0.9395
Validation f1-score: 0.9102
Current elapsed time 27m 17s  
...

Expected behavior

NaN values in images should be written as 0 in the hdfs files.
Infinite values should be reduced.

Add task manager

A tool will be developed to help us manage tasks in the context of geo-deep-learning. This is a stub ticket while waiting for the task manager project to have its own Github repo.

Add the choice of model in config file

Currently, the NN model used is hardcoded. Most parameters of the model are in the config.yaml file.

The next step would be to factor out the model type from the code altogether, which probably means adding model-specific sections in the configuration file.

Add a Validation for the Number of Classes in Classification Tasks

In train_model.py, there isn't any verification that the number of classes used for training in classification tasks is the same as the parameter given in the config file. The model is built based on the the config file's value for num_classes

If the wrong number of classes is used for training, it results in meaningless outputs from the model and IndexError: list index out of range during classification.

We need to add a check to avoid running meaningless trainings.

Use OGC-WCS together with GeoTIFF files

OGC-WCS -- Web Coverage Service -- allows one to request a specific area and get a GeoTIFF response. Since GeoTIFF is the data format used for training/inference, using a WCS would free users from having to store GeoTIFF files on their hardware. This is turn has the benefit of easier integration to a Big Data environment that would expose its raster data sets as WCS. We ought to make the code use GeoTIFF files generated on the fly through WCS web requests just as easy as using local GeoTIFF files.

Migrate the project to AWS

  • Read and save data in S3 buckets

    • Load Images
    • Load Config file
    • Load Preparation CSV
    • Save Samples
    • Load Samples
    • Save Trained Models
    • Load Trained Models
    • Metrics and Logs
    • Classified Images
  • Test Multi-GPU

  • Wrap sample creation, training, and classification in Lambda functions

First, the training should be run on EC2 instances. Then, SageMaker's options and advantages should be explored

Web demo

It would be worthwhile to investigate the possibility of having a web demo to show each step in the process, especially training. See this demo for ideas.

Federated learning

"In a typical machine learning system, an optimization algorithm like Stochastic Gradient Descent (SGD) runs on a large dataset partitioned homogeneously across servers in the cloud. Such highly iterative algorithms require low-latency, high-throughput connections to the training data. But in the Federated Learning setting, the data is distributed across millions of devices in a highly uneven fashion. In addition, these devices have significantly higher-latency, lower-throughput connections and are only intermittently available for training."
https://ai.googleblog.com/2017/04/federated-learning-collaborative.html

Could this be applied in a later stage, e.g. to optimize our model through time ?

Set up a CI/test environment

  • Implement Continuous Integration

See this for example.

A brief review on the current status of various CI options still shows Travis CI at the top. Travis is free for publicly accessible GitHub projects.

An example of CI on deep learning projects. Requires Docker.

  • Find a set of test images

Small public domain test images that we could use for quick checks that the machinery works.

Lidar point cloud classification

IMPORTANT : This is a duplicate of #2

Lidar point cloud classification

Context

Airborne lidar systems generate data in the form of point clouds. The American Society for Photogrammetry and Remote Sensing (ASPRS) -- which now tags itself as The Imaging and Geospatial Information Society -- has a Lidar division the mission of which is "to provide a forum for collection development and dissemination of information related to the best practices in developing maintaining and operating kinematic laser scanners and associated sensors".

The LAS Working Group maintains the LAS file format standard, which is the default format for storing lidar data points. The latest version of the standard, LAS 1.4, provides the structure of an individual point data record:

Table 1: Point Data Record Format 0

Item Format Size Required
X long 4 bytes *
Y long 4 bytes *
Z long 4 bytes *
Intensity unsigned short 2 bytes
Return Number 3 bits (bits 0–2) 3 bits *
Number of Returns (given pulse) 3 bits (bits 3–5) 3 bits *
Scan Direction Flag 1 bit (bit 6) 1 bit *
Edge of Flight Line 1 bit (bit 7) 1 bit *
Classification unsigned char 1 byte *
Scan Angle Rank char 1 byte *
User Data unsigned char 1 byte
Point Source ID unsigned short 2 bytes *

The classification item in the table above is further specified in the following two tables:

Table 2: Classification Bit definition (field encoding)

Bit Field Name Description
0:4 Classification Standard ASPRS classification from 0 - 31 as defined in the classification table for legacy point formats (see table 3 below)
5 Synthetic If set then this point was created by a technique other than LIDAR collection such as digitized from a photogrammetric stereo model or by traversing a waveform.
6 Key-point If set, this point is considered to be a model key-point and thus generally should not be withheld in a thinning algorithm.
7 Withheld f set, this point should not be included in processing (synonymous with Deleted).

**Table 3: ASPRS Standard LIDAR Point Class Values**
Classification Value (bits 0:4) Meaning
0 Created, never classified
1 Unclassified
2 Ground
3 Low Vegetation
4 Medium Vegetation
5 High Vegetation
6 Building
7 Low Point (noise)
8 Model Key-point (mass point)
9 Water
10 Reserved for ASPRS Definition
11 Reserved for ASPRS Definition
12 Overlap Points2
13-31 Reserved for ASPRS Definition

Therefore, the format for the classification field is a bit encoding with the lower five bits used for the class value (as shown in table 3 above) and the three high bits used for flags.

Problem

The issue here is that the classification of lidar points is both expensive (typically 30 % of total data acquisition cost) and rather unreliable. We would like to devise a deep learning approach to help us classify data points independently, that is irrespective of class values assigned (or not) by the data provider.

Literature review

There are roughly two types of methods for lidar point cloud classification using DL:

  • rasterization of certain point atttributes so that CNNs may be used
  • direct manipulation of the point cloud, for example using voxels

The methods we are most interested in are those that have a PyTorch implementation.

CNN

Classifying airborne LiDAR point clouds via deep features learned by a multi-scale convolutional neural network

"With several selected attributes of LiDAR point clouds, our method first creates a group of multi-scale contextual images for each point in the data using interpolation. Taking the contextual images as inputs, a multi-scale convolutional neural network (MCNN) is then designed and trained to learn the deep features of LiDAR points across various scales. A softmax regression classifier (SRC) is finally employed to generate classification results of the data with a combination of the deep features learned from various scales. "

3D Point Cloud Classification and Segmentation using 3D Modified Fisher Vector Representation for Convolutional Neural Networks

"The point cloud ... the common solution [for classification] of transforming the data into a 3D voxel grid introduces its own challenges, mainly large memory size ... we propose a novel 3D point cloud representation called 3D Modified Fisher Vectors (3DmFV) ... it combines the discrete structure of a grid with continuous generalization of Fisher vectors ... Using the grid enables us to design a new CNN architecture for point cloud classification and part segmentation."

Large-scale Point Cloud Semantic Segmentation with Superpoint Graphs

Point clouds

Using CNNs implies rasterization of point attribute values, which in turn implies degradation of the level of information contained in the point cloud. Ideally, we would like our algorithms to work directly in the point cloud. The main problem here is the data volume. We should strive to find approaches that minimize data I/O. Things like Entwine may allow us to request point data for training or inference via web services and therefore avoid data duplication.

PointNet: Deep Learning on Point Sets for 3D Classification and Segmentation

Point cloud ... due to its irregular format, most researchers transform such data to regular 3D voxel grids or collections of images.

"This, however, renders data unnecessarily voluminous and causes issues. In this paper, we design a novel type of neural network that directly consumes point clouds and well respects the permutation invariance of points in the input. Our network, named PointNet, provides a unified architecture for applications ranging from object classification, part segmentation, to scene semantic parsing. Though simple, PointNet is highly efficient and effective."

Spherical Convolutional Neural Network for 3D Point Clouds

"We propose a neural network for 3D point cloud processing that exploits `spherical' convolution kernels and octree partitioning of space. The proposed metric-based spherical kernels systematically quantize point neighborhoods to identify local geometric structures in data ... The network architecture itself is guided by octree data structuring that takes full advantage of the sparse nature of irregular point clouds. We specify spherical kernels with the help of neurons in each layer that in turn are associated with spatial locations. We exploit this association to avert dynamic kernel generation during network training, that enables efficient learning with high resolution point clouds. We demonstrate the utility of the spherical convolutional neural network for 3D object classification on standard benchmark datasets."

Interactive Visualization of 10M+ 3D Points with New Open-Source Python Package PPTK

"The PPTK viewer is part of a larger effort of developing a Python toolkit for not only visualizing but also processing point data."

Manage discontinuities when classifying

When classifying images, we can see linear artifacts in the final output due to the moving window used to assign classes being smaller than the image to be classified. An example:

image

The solution would be to manage the moving window overlap and get the mean value of the prediction for overlapping regions. See Audebert et al., (2016)

Save Error Logs from AWS Instances to a s3 bucket

Currently, any errors or exceptions that are thrown in EC2 instances running our deep learning programs are lost when the instance terminates.

We should save some form of logs to facilitate debugging. The important logs could be stored in an S3 bucket.

Should be done after #4 to ensure continuity and to avoid having to redo work.

Multi GPU support

Need to support multi GPU with data parallel.
Number of GPU to use should be in config file.

Scratchpad

  • log files in JSON (easier parsing and future interop ?)
  • Trap standard error (AWS) and add full python traceback in output.txt
  • Fonctions lambda dans GH ?? (t3.large revenu!)
  • possible memory leak during images_to_samples.py (with file ...; wrap in a Python lambda ?)
  • Log files renaming --> add the config file name in the log name

Add more models for classification tasks

Currently, inception-v3 is the only model that can be used for classification tasks. It requires input images of 299x299 pixels.

Other models such as ResNet and Densenet from torchvision should be modified to accept different numbers of channels and added to our project.

Add live graphical updates for the training process

See #55.

It would be nice to have a live graphing system to visualize per class metrics as the training/validation process evolves. We could also set alarms (and highlight in graph ?) when the loss function starts increasing and potentially stop the process.

We currently dump sklearn's classification report in log files. We could try to plug that output in something like this : python-live-graph-update-from-a-changing-text-file

Longer term goal : implement structured logging. That would allow us to analyze several log files (e.g. training a dataset with different parameters over time) in log munging software.

Interesting item : Eliot, the causal logging library. "Most logging systems tell you what happened in your application, whereas eliot also tells you why it happened."

GDAL 2.2.2 import error

GDAL 2.2.2 has an issue importing osgeo, when used in conda env. The issue has been fixed. Updating gdal to 2.3.2 should fix it.

Capacity to choose the channels of input images

Let say you have a 4 channels images (R+G+B+NIR), it would be nice to be able to specify only the channels you're interested in instead of having to process the image and store it with the channels you want to test. .
The channel choice should be a parameter in the yaml config file. Linked to #56.

Define and Store metadata in STAC

Currently, training is performed on a list of GeoTIFF input images using reference data in GeoPackage files. That list of inputs is stored in csv files. For the results we store just the weights of our model (.pth file).

To make our models interoperable, we need to write out the model together with related weights; those items are our final shareable outputs. Also, should we care to implement checks on whether a particular dataset is amenable to inference using a given model, we need to store all inputs somewhere.

Initially we thought of using HDF to store both the inputs to and outputs of our models. It now appears one of the STAC extensions might be a more logical approach, as STAC is much more web-friendly than HDF.

Managing and building a model

GDL's supported models are loaded using a simple switch in model_choice.py. DL architectures are very dynamic and one can assume that the more we support, the more config parameters we will have to handle. Therefore, we could benefit from a dynamic import function, capable of directly instantiating a class, given it's fully qualified name. An example of such implementation is provided here.
Model could then be automatically constructed, given the parameter's dictionnary in the config file. Global parameters such as num_classes could be detected and inspected.

Loading pretrained model with different number of classes

The actual behavior of GDL doesn't support the modification of an architecture, when loading pretrained weights. So one can not use a pretrained model with 4 classes for prediction and import it for his/her new training on 3 classes for prediction.
We could benefit from a loading process in 3 steps:

  1. loading original model
  2. loading of the weights
  3. modification of the architecture

Deep learning applied to lidar point cloud classification

Context

Classification of lidar point clouds is resource-intensive as it generally involves human interaction to obtain decent results. Furthermore, it has been shown that classification results and accuracies vary between data producers. Finally, classification amounts to about 30% of lidar data acquisition costs.

Given that context, how can Deep Learning (DL) help in the classification process ?

Issues

DL architectures used in imagery are mostly CNNs. Convolution implies pixels. However, lidar cloud points do not correspond to pixels and therefore CNNs are likely not applicable in this case.

Thoughts

Could we use prior data (DEM, imagery, etc.) to initialize weights in a scene-specific rough manner rather than using a random initialization ?

Could we use standard unsupervised clustering algorithms like K-Means or DBSCAN ? If so, what is the minimum number of points we can run clustering algorithms against ?

References & projects

Classifying airborne LiDAR point clouds via deep features learned by a multi-scale convolutional neural network Paywalled, unfortunately ...

Deep Semantic Classification for 3D LiDAR Data. Key to using CNNs with point clouds: "The input to our network is a set of three channel 2D images generated by unwrapping 360^^0 3D LiDAR data onto a spherical 2D plane".

PyLidar; based on not well-known libraries ([SPDLibb[(http://www.spdlib.org/)), but developed by a government organization (New South Wales, Australia) and still looks pretty interesting: "Supported formats are: SPD V3, SPD V4, Riegl RXP, LAS, LVIS, ASCII and Pulsewaves"

Deep learning-based tree classification using mobile LiDAR data

Deep Semantic Classification for 3D LiDAR Data. Key to using CNNs with point clouds: "The input to our network is a set of three channel 2D images generated by unwrapping 360^^0 3D LiDAR data onto a spherical 2D plane"

PointNet : "Deep Learning on Point Sets for 3D Classification and Segmentation" and it's based on this work from stanford University.

Sync/cp of log files fails with AWS cli, due to ":" in the filename

Command: aws s3 sync s3://bucket_name/Logs/tmp/ W:\work2\Logs\tmp  
Error:  
download failed: s3://bucket_name/Logs/tmp/2019-02-20_01:41_output.txt to W:\work2/Logs\tmp\2019-02-20_01:41_output.txt [Errno 2] No such file or directory: 'W:\\work2\\Logs\\tmp\\2019-02-20_01:41_output.txt.78DaB819'

The sync worked fine without the :

Parametrize 'arch' when saving checkpoints in train_model.py

When saving checkpoints, the 'arch' item from the saved dictionary is hard coded in main(), both for checkpoints and for the final model. We should parametrize it so that the model architecture is always the one that was used for training instead of always being "unetsmall".

Image_classification.py refactoring

Couples of incongruity in the image_classification.py:

  • The name of the module. image_classification.py is confusing. For it to be coherent, the module should be named inference.py;
  • Every reference to the task of per pixel classification (from a remote sensing vocabulary) should be referenced as semantic segmentation;
  • In input of the module, one should be able to pass a list of images. The current behavior is to load all images in a folder;
  • The path to write all classified images should also be passed as parameter. The current behavior is to write classified images in the same folder as the one in input.

"sqlite3_open" error whith geopackage

The error occur in images_to_samples.py, when we have more than 1000 images to prepare.

...  
Images/mtq_f09_257000_5077000_50_int_rou_asp_slo.tif
{'trn': 42601, 'val': 12404}
Images/15_2525134f09_dc_50_int_rou_asp_slo.tif
{'trn': 42601, 'val': 12501}ERROR 4: sqlite3_open(chapeau_mtm09_lac_riv_build.gpkg) failed: unable to open database file

Images/mtq_f09_260000_5081000_50_int_rou_asp_slo.tif
{'trn': 42601, 'val': 12516}
Images/15_2495160f09_dc_45_50_int_rou_asp_slo.tif
{'trn': 42601, 'val': 12714}
Images/15_2575131f09_dc_50_int_rou_asp_slo.tif
{'trn': 42601, 'val': 12714}
Images/15_2605131f09_dc_50_int_rou_asp_slo.tif
{'trn': 42601, 'val': 12756}
Images/15_2455150f09_dc_50_int_rou_asp_slo.tif
{'trn': 42601, 'val': 12793}
Images/mtq_f09_262000_5093000_50_int_rou_asp_slo.tif
{'trn': 42601, 'val': 12799}
Images/15_2715113f09_dc_50_int_rou_asp_slo.tif
{'trn': 42601, 'val': 12854}
Images/15_2505138f09_dc_50_int_rou_asp_slo.tif
{'trn': 42601, 'val': 12858}
Images/15_2595126f09_dc_50_int_rou_asp_slo.tif
{'trn': 42601, 'val': 12888}
Traceback (most recent call last):
  File "images_to_samples.py", line 258, in <module>
    params['sample']['mask_reference'])
  File "images_to_samples.py", line 209, in main
    vector_to_raster(info['gpkg'], info['attribute_name'], label_raster)
  File "images_to_samples.py", line 133, in vector_to_raster
    source_layer = source_ds.GetLayer()
AttributeError: 'NoneType' object has no attribute 'GetLayer'

One of the possible solutions might include:

  • Using Fiona to read/write geopackages : Fiona is OGR’s neat and nimble API for Python programmers.
  • Using Rasterio for raster read/write : Rasterio reads and writes geospatial raster datasets

Add classification capacity

#Objectives

The intent of this project is to provide a basic system for deep learning uses in remote sensing (RS). As of today, the application has the capacity to perform semantic segmentation of images. In order to provide a larger capacity for RS tools, there is (at least) one task we need to add to the system : classification.

The definition of both these tasks and the difference between the two can be found in the lexicon.

Error using images_to_samples.py with 1 band input images

The band number assertion in utils.py throws an error when using images with 1 band as input images.

Traceback (most recent call last):
  File "D:/Processus/pycharm/geo-deep-learning/images_to_samples.py", line 246, in <module>
    params['sample']['mask_reference'])
  File "D:/Processus/pycharm/geo-deep-learning/images_to_samples.py", line 179, in main
    assert_band_number(info['tif'], number_of_bands)
  File "D:\Processus\pycharm\geo-deep-learning\utils.py", line 89, in assert_band_number
    assert len(in_array.shape) == 2, msg
AssertionError: The number of bands in the input image and the parameter 'number_of_bands' in the yaml file must be the same

CoordConv and spatial resolution

CoordConv works by giving convolution access to its own input coordinates through the use of extra coordinate channels Liu et al., 2018.

With remote sensing imagery, we also have access to the spatial resolution of the raster, which could be included as an input layer. Including prior knowledge into the model should improve generalization when training on multiple spatial resolution datasets.

Add samples to an already existing hdf5 file

Issue

When having large dataset to prepare, it can become quite time consuming. If you have new data to add to your dataset, you don't want to prepare it all over again, from scratch.

Considerations

  • path to hdf5 files as parameter in config.yaml
  • Validation for the size of the samples already in the hdf5 (height, width and number of channels)
  • Final hdfs must have all samples in random order

Memory error - Optimization to increase batch size

When training with samples of size 256x256 pixels, a batch size over 32 causes a memory error from cuda. We have to find a way to optimize the process, in order to increase the batch size.

RunTimeError: cuda runtime error (2) : out of memory at /opt/conda/ .../THCStorage.cu:58

NOTE : May be specific to our (GC HPC) computing environment

Spark/PyTorch integration

It would be interesting to see how PyTorch can be integrated in the CCMEO big data architecture deployed on AWS.

Add training metadata in weights file name

Training always results in a weights file with the same name: checkpoint.pth. Along the lines of #51, we should modify the weights filename so that it contains metadata of the particular training it is the result of: project, data type (RGB, lidar), date, epoch number, etc.

Adding epoch number in the name would be indicative of when the best set of weights were obtained. However, we may want to append _outof_Y where Y = the total number of epochs to whatever the best checkpoint is when training is done. In that case, ..._27.pth would mean the best epoch currently is 27 but the training process is not complete. The final name would, be ..._27_outof_50.pth , for example.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.