NAFlora-1M

NAFlora-1M: continental-scale high-resolution fine-grained plant classification dataset

Updates

August 25th, 2023:

Overview
Training script

June 14th, 2023:

Initialized repository

Overview

In botany, a ‘flora’ is a complete account of the plants found in a geographic region. The dichotomous keys and detailed descriptions of diagnostic morphological features contained within a flora are used by botanists to determine which names to apply to plant specimens. This competition dataset aims to encapsulate the flora of North America so that we can test the capability of artificial intelligence to replicate this traditional tool —a crucial first step to harnessing AI’s potential botanical applications.

NAFlora-1M dataset comprises 1.05 M images of 15,501 vascular plants, which constitute more than 90% of the taxa documented in North America. Our dataset is constrained to include only vascular land plants (lycophytes, ferns, gymnosperms, and flowering plants).

Our dataset has a long-tail distribution. The number of images per taxon is as few as seven and as many as 100 images. Although more images are available, we capped the maximum number in an attempt to ensure sufficient but manageable training data size.

Training

python3 src/naflora1m_train_and_infer.py

-------------------------------------------------------------------

/usr/local/lib/python3.10/dist-packages/keras/initializers/initializers.py:120: UserWarning: The initializer VarianceScaling is unseeded and being called multiple times, which will return identical values each time (even if the initializer is unseeded). Please update your code to provide a seed to the initializer, or avoid using the same initalizer instance more than once.
  warnings.warn(
Downloading data from https://github.com/leondgarse/keras_efficientnet_v2/releases/download/effnetv2_pretrained/efficientnetv2-s-21k.h5
194646348/194646348 [==============================] - 1s 0us/step
>>>> Load pretrained from: /root/.keras/models/efficientnetv2/efficientnetv2-s-21k.h5
EfficientNetV2S
Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
=================================================================
 EfficientNetV2S (Functional  (None, 12, 12, 1280)     20331360  
 )                                                               
                                                                 
 global_average_pooling2d (G  (None, 1280)             0         
 lobalAveragePooling2D)                                          
                                                                 
 dropout (Dropout)           (None, 1280)              0         
                                                                 
 dense (Dense)               (None, 1024)              1311744   
                                                                 
 dropout_1 (Dropout)         (None, 1024)              0         
                                                                 
 dense_1 (Dense)             (None, 15501)             15888525  
                                                                 
=================================================================
Total params: 37,531,629
Trainable params: 37,377,757
Non-trainable params: 153,872
_________________________________________________________________

grab config info
done - saving config info to ./EfficientNetV2S_380_OCEP30_FC_CLSBW10_None_configs.json
model summary saved to EfficientNetV2S_380_OCEP30_FC_CLSBW10_None_model_summary.txt. initialization is done
{'name': 'SGDW', 'learning_rate': {'class_name': 'OneCycle', 'config': {'initial_learning_rate': 0.006999999999999999, 'maximal_learning_rate': 0.7, 'cycle_size': 49230, 'scale_mode': 'cycle', 'shift_peak': 0.2}}, 'decay': 0.0, 'momentum': 0.9, 'nesterov': True, 'weight_decay': 1e-05, 'exclude_from_weight_decay': None}
Epoch 1/30
   6/1641 [..............................] - ETA: 20:16 - loss: 84.9471 - f1_score: 0.0000e+00WARNING:tensorflow:Callback method `on_train_batch_end` is slow compared to the batch time (batch time: 0.0074s vs `on_train_batch_end` time: 28.2103s). Check your callbacks.
WARNING:tensorflow:Callback method `on_train_batch_end` is slow compared to the batch time (batch time: 0.0074s vs `on_train_batch_end` time: 28.2103s). Check your callbacks.
1641/1641 [==============================] - 1491s 743ms/step - loss: 6.2415 - f1_score: 0.0019 - time: 1490.8735
Epoch 2/30
1641/1641 [==============================] - 1222s 745ms/step - loss: 3.1388 - f1_score: 0.1323 - time: 1221.8033
Epoch 3/30
1641/1641 [==============================] - 1224s 746ms/step - loss: 2.2029 - f1_score: 0.3254 - time: 1223.6055
Epoch 4/30
1641/1641 [==============================] - 1225s 746ms/step - loss: 1.8870 - f1_score: 0.4320 - time: 1224.5351
Epoch 5/30

Details

There are a total of 15,501 vascular species in the dataset, with 800k training images, 200k test images. We show the top-10 families ordered in terms of species-level diversity.

Family	Number of Species	Train Images	Test Images
Asteraceae	1,998	110,007	27,605
Fabaceae	1,070	59,152	14,803
Poaceae	964	53,547	13,399
Cyperaceae	780	45,447	11,410
Boraginaceae	454	23,724	5,948
Brassicaceae	402	19,033	4,752
Plantaginaceae	380	21,054	5,265
Polygonaceae	359	18,899	4,714
Rosaceae	356	20,628	5,165
Laminaceae	309	16,854	4,239
___	___	___	___
Top-10 total	7,072	388,345	97,300

How to access the data

This section specifies details on about how to access the data.

Kaggle competition

NAFlora-1M was benchmarked in the Herbarium 2022: The flora of North America Kaggle competition.

Annotation Format

We follow the annotation format of the COCO dataset and add additional fields. The annotations are stored in the JSON format and are organized as follows:

{ 
  "annotations" : [annotation],
  "categories" : [category],
  "genera" : [genus]
  "images" : [image],
  "distances" : [distance],
  "licenses" : [license],
  "institutions" : [institution]
}


annotation {
  "image_id" : int,
  "category_id" : int,
  "genus_id" : int,
  "institution_id" : int   
}

image {
  "image_id" : int,
  "file_name" : str,
  "license" : int
}

category {
  "category_id" : int, 
  "scientificName" : str,
  # We also provide a super-category for each species.
  "authors" : str, # correspond to 'authors' field in the wcvp
  "family" : str, # correspond to 'family' field in the wcvp
  "genus" : str, # correspond to 'genus' field in the wcvp
  "species" : str, # correspond to 'species' field in the wcvp
}

genera {
  "genus_id" : int,
  "genus" : str
}

distance {
  # We provide the pairwise evolutionary distance between categories (genus_id0 < genus_id1). 
  "genus_id_x" : int,    
  "genus_id_y" : int,    
  "distance" : float
}

institution {
  "institution_id" : int
  "collectionCode" : str
}

license {
  "id" : int,
  "name" : str,
  "url" : str
}

Evaluation through late submission

It is possible to get performance metric for our test data through the submssions page

The submission format for the Kaggle competition is a csv file with the following format:

Id,predicted
12345,0 
67890,83

The Id column corresponds to the test image id. The predicted column corresponds to 1 category id, for scientificName (species).

Terms of Use

CC BY-NC-ND-4.0: Commerical use of the data and pre-trained model is restricted.

Pretrained Models

Pretrained models and sample code will soon be released.

dpl10 / naflora-1m Goto Github PK

naflora-1m's Introduction

NAFlora-1M

Updates

Overview

Training

Details

How to access the data

Links

Kaggle competition

Annotation Format

Evaluation through late submission

Terms of Use

Pretrained Models

naflora-1m's People

Contributors

Stargazers

Watchers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent