Modify yolov8n's structure or functions to optimize its performance on small objections
SOD (Salient Object Detection) dataset is a collection of images that are annotated and labeled to identify the salient objects present in them. Salient objects refer to the visually distinct or important elements in an image that tend to attract human attention. SOD datasets are commonly used in computer vision research for developing and evaluating algorithms and models that can automatically detect and segment salient objects in images. These datasets typically consist of images along with corresponding ground truth annotations where salient regions or objects are marked or outlined.
The dataset is a combination of 3 others dataset, being them Stanford Drone Dataset, Vision Meets Drones, and Umanned Unmanned Aerial Vehicles Benchmark Object Detection and Tracking.
The YOLOv8 architecture builds upon previous versions of the YOLO algorithm, utilizing a convolutional neural network divided into two main parts: the backbone and the head.
The backbone of YOLOv8 is based on a modified CSPDarknet53 architecture, comprising 53 convolutional layers and utilizing cross-stage partial connections to improve information flow between layers.
The head of YOLOv8 consists of multiple convolutional layers followed by fully connected layers, responsible for predicting bounding boxes, objectness scores, and class probabilities of detected objects.
A notable feature of YOLOv8 is the incorporation of a self-attention mechanism in the network’s head, allowing the model to focus on different parts of the image and adjust the importance of features based on relevance. Another significant feature is YOLOv8’s capability of multi-scaled object detection, achieved through a feature pyramid network. This network consists of multiple layers that detect objects at different scales, enabling the model to accurately identify objects of varying sizes within an image.
In YOLOv8, the "head" part refers to the top-level hierarchical structure of the neural network model, which is responsible for processing the feature map after feature extraction from the basic level. Specifically, the "head" part of YOLOv8 mainly includes three key components: detection layers, upsample layers and route layers.
Detection layers are responsible for converting input feature maps into detection bounding boxes. Usually, the detection layer in YOLOv8 converts feature maps into bounding boxes of different scales and corresponding category prediction probabilities through convolution operations. Each detection layer is associated with an anchor box for detecting objects at different scales.
Upsample layers are used to increase the resolution of the feature map. These layers typically use deconvolution operations to achieve upsampling and convert low-resolution feature maps to high-resolution ones. The upsampling layer is mainly used to increase the model's perception of small-sized objects.
Route layer is used to connect feature maps of different levels. It can connect the feature map of the previous layer with the feature map of the earlier layer to obtain feature maps with different scale feature information. This multi-scale feature fusion helps the model to detect objects of different sizes and types.
In summary, the "head" part in YOLOv8 is a key network hierarchy, which converts feature maps into detection bounding boxes through the combination of detection layer, upsample layer and route layer, and provides multi-scale feature fusion ability to achieve efficient detection of targets of different sizes and types.
In standard object detection tasks, when there are small objects in the data set, the problem of missing detection or poor detection effect often occurs. The reason is stated as follows:
The YOLOv8 model has 3 detection heads by default, which can perform multi-scale detection of targets. Among them, the size of the detection feature map corresponding to P3/8 is 80x80, which is used to detect objects with a size above 8x8; the size of the detection feature map corresponding to P4/16 is 40x40, which is used to detect objects with a size above 16x16; P5/32 corresponds to The detection feature map size of 20x20 is used to detect objects with a size above 32x32.
Then it comes out instinctively that there may be a problem of poor capability for the detection of tiny objects whose sizes are smaller than a certain scale or one of the dimensions(width and height) is not large enough.
In order to improve the detection ability of small targets, we add a small object detection layer (160x160 detection feature map for detecting targets above 4x4,for example). And to achieve this improvement, we maintain the original results in the Backbone part, but adjust the model structure of the head part.
In order to evaluate the performance of the optimized YOLOv8n network, we conducted validation on various datasets using YOLOv3 and YOLOv5 as well. This allowed us to analyze the differences in validation speed and accuracy among the three models. From both structural and parametric perspectives, we compared YOLOv8 with YOLOv3 and YOLOv5 to determine the strengths and weaknesses of our modified YOLOv8n network. This comprehensive comparison enables us to assess the practical value of our optimized model.
It is important to recognize that our modified YOLOv8n network may only outperform other networks in specific situations. The modification we made focused on adding small object detection layers to extract shallower features, which may result in inferior performance in ordinary cases. Therefore, our comparison aims to further explore the niche where our model excels, as well as its future development and potential.
# YOLOv8.0s head
head:
- [-1, 1, nn.Upsample, [None, 2, 'nearest']]
- [[-1, 6], 1, Concat, [1]] # cat backbone P4
- [-1, 3, C2f, [512]] # 13
- [-1, 1, nn.Upsample, [None, 2, 'nearest']]
- [[-1, 4], 1, Concat, [1]] # cat backbone P3
- [-1, 3, C2f, [256]] # 17 (P3/8-small)
- [-1, 1, nn.Upsample, [None, 2, 'nearest']]
- [[-1, 2], 1, Concat, [1]] # cat backbone P3
- [-1, 3, C2f, [128]] # 20 (P4/16-medium)
- [-1, 1, Conv, [256, 3, 2]]
- [[-1, 15], 1, Concat, [1]] # cat head P4
- [-1, 3, C2f, [256]] # 20 (P4/16-medium)
- [-1, 1, Conv, [512, 3, 2]]
- [[-1, 12], 1, Concat, [1]] # cat head P5
- [-1, 3, C2f, [512]] # 23 (P5/32-large)
- [-1, 1, Conv, [512, 3, 2]]
- [[-1, 9], 1, Concat, [1]] # cat head P5
- [-1, 3, C2f, [1024]] # 23 (P5/32-large)
- [[18, 21, 24,27], 1, Detect, [nc]] # Detect(P3, P4, P5)
We use the following instructions to train the model and get the corresponding results on the specified dataset。
%pip install ultralytics
import ultralytics
ultralytics.checks()
# Load a model
from ultralytics import YOLO
model = YOLO('original_yolov8/yolov8n.yaml') # build a new model from YAML
# Train the model
results = model.train(data='uz.yaml', epochs=30, imgsz=640)