I'm try to run 'python tools/train.py -n yolox-s -d 1 -b 6 --fp16 -o',and the followi

Training error：An error has been caught in function 'launch', process 'MainProcess' about yolox HOT 11 CLOSED

lijianqing317 commented on May 13, 2024

Training error：An error has been caught in function 'launch', process 'MainProcess'

from yolox.

Comments (11)

lijianqing317 commented on May 13, 2024 3

可以正常训练了,coco.py文件修改如下：
def pull_item(self, index):
id_ = self.ids[index]
im_ann = self.coco.loadImgs(id_)[0]
width = im_ann["width"]
height = im_ann["height"]
name_f = im_ann["file_name"]
# load image and preprocess
img_file = os.path.join(
#self.data_dir, self.name, "{:012}".format(id_) + ".jpg"
self.data_dir, self.name, "{}".format(name_f)
)
img = cv2.imread(img_file)
assert img is not None

主要原因：标准的coco格式文件的id和file_name直接存在关系，如：id: 397133，file_name: "000000397133.jpg"，而我自己的coco集中id和file_name没有直接关系，"id": 2, "file_name": "38_1796_0.jpg"，所以通过id是无法找到图的，解析file_name来完成。

from yolox.

Joker316701882 commented on May 13, 2024

It seems you didn't properly set the path of your dataset.

from yolox.

hccv commented on May 13, 2024

修改coco.py
def pull_item(self, index):
id_ = self.ids[index]
img_file = self.data_dir + self.name + id_ + ".jpg"
img = cv2.imread(img_file)
height, width, c = img.shape
assert img is not None

    res = self.load_anno(index)
    img_info = (height, width)

    return img, res, img_info, id_

from yolox.

commented on May 13, 2024

修改coco.py
def pull_item(self, index):
id_ = self.ids[index]
img_file = self.data_dir + self.name + id_ + ".jpg"
img = cv2.imread(img_file)
height, width, c = img.shape
assert img is not None
    res = self.load_anno(index)
    img_info = (height, width)

    return img, res, img_info, id_

改过来之后还是报错，感觉好像是多线程读取数据之间的问题，读完几张以后会有错误。请问您改过之后就可以训练了吗

from yolox.

hccv commented on May 13, 2024

def load_anno(self, index):
id_ = self.ids[index]
anno_ids = self.coco.getAnnIds(imgIds=[id_], iscrowd=False)
annotations = self.coco.loadAnns(anno_ids)
img_file = self.data_dir + self.name + id_ + ".jpg"
img = cv2.imread(img_file)
height, width, c = img.shape
valid_objs = []
for obj in annotations:
x1 = np.max((0, obj["bbox"][0]))
y1 = np.max((0, obj["bbox"][1]))
x2 = np.min((width - 1, x1 + np.max((0, obj["bbox"][2] - 1))))
y2 = np.min((height - 1, y1 + np.max((0, obj["bbox"][3] - 1))))
if obj["area"] > 0 and x2 >= x1 and y2 >= y1:
obj["clean_bbox"] = [x1, y1, x2, y2]
valid_objs.append(obj)
objs = valid_objs
num_objs = len(objs)
res = np.zeros((num_objs, 5))
for ix, obj in enumerate(objs):
cls = self.class_ids.index(obj["category_id"])
res[ix, 0:4] = obj["clean_bbox"]
res[ix, 4] = cls
return res
这个函数也做同样的修改，我这边完整的训练完一轮，然后训练第二轮的时候报其他错误了

from yolox.

commented on May 13, 2024

def load_anno(self, index):
id_ = self.ids[index]
anno_ids = self.coco.getAnnIds(imgIds=[id_], iscrowd=False)
annotations = self.coco.loadAnns(anno_ids)
img_file = self.data_dir + self.name + id_ + ".jpg"
img = cv2.imread(img_file)
height, width, c = img.shape
valid_objs = []
for obj in annotations:
x1 = np.max((0, obj["bbox"][0]))
y1 = np.max((0, obj["bbox"][1]))
x2 = np.min((width - 1, x1 + np.max((0, obj["bbox"][2] - 1))))
y2 = np.min((height - 1, y1 + np.max((0, obj["bbox"][3] - 1))))
if obj["area"] > 0 and x2 >= x1 and y2 >= y1:
obj["clean_bbox"] = [x1, y1, x2, y2]
valid_objs.append(obj)
objs = valid_objs
num_objs = len(objs)
res = np.zeros((num_objs, 5))
for ix, obj in enumerate(objs):
cls = self.class_ids.index(obj["category_id"])
res[ix, 0:4] = obj["clean_bbox"]
res[ix, 4] = cls
return res
这个函数也做同样的修改，我这边完整的训练完一轮，然后训练第二轮的时候报其他错误了

这个地方应该没问题，因为我把我的数据格式转成COCO格式了，所以用它自带的就可以了。
问题在于他在读取数据的时候launch函数会报错，我觉得可能还是他自己的机制的问题把

from yolox.

commented on May 13, 2024

def load_anno(self, index):
id_ = self.ids[index]
anno_ids = self.coco.getAnnIds(imgIds=[id_], iscrowd=False)
annotations = self.coco.loadAnns(anno_ids)
img_file = self.data_dir + self.name + id_ + ".jpg"
img = cv2.imread(img_file)
height, width, c = img.shape
valid_objs = []
for obj in annotations:
x1 = np.max((0, obj["bbox"][0]))
y1 = np.max((0, obj["bbox"][1]))
x2 = np.min((width - 1, x1 + np.max((0, obj["bbox"][2] - 1))))
y2 = np.min((height - 1, y1 + np.max((0, obj["bbox"][3] - 1))))
if obj["area"] > 0 and x2 >= x1 and y2 >= y1:
obj["clean_bbox"] = [x1, y1, x2, y2]
valid_objs.append(obj)
objs = valid_objs
num_objs = len(objs)
res = np.zeros((num_objs, 5))
for ix, obj in enumerate(objs):
cls = self.class_ids.index(obj["category_id"])
res[ix, 0:4] = obj["clean_bbox"]
res[ix, 4] = cls
return res
这个函数也做同样的修改，我这边完整的训练完一轮，然后训练第二轮的时候报其他错误了

我把数据集重新转换成COCO格式可以开始训练了，不用更改原先的函数。

from yolox.

JasenWangLab commented on May 13, 2024

def load_anno(self, index):
id_ = self.ids[index]
anno_ids = self.coco.getAnnIds(imgIds=[id_], iscrowd=False)
annotations = self.coco.loadAnns(anno_ids)
img_file = self.data_dir + self.name + id_ + ".jpg"
img = cv2.imread(img_file)
height, width, c = img.shape
valid_objs = []
for obj in annotations:
x1 = np.max((0, obj["bbox"][0]))
y1 = np.max((0, obj["bbox"][1]))
x2 = np.min((width - 1, x1 + np.max((0, obj["bbox"][2] - 1))))
y2 = np.min((height - 1, y1 + np.max((0, obj["bbox"][3] - 1))))
if obj["area"] > 0 and x2 >= x1 and y2 >= y1:
obj["clean_bbox"] = [x1, y1, x2, y2]
valid_objs.append(obj)
objs = valid_objs
num_objs = len(objs)
res = np.zeros((num_objs, 5))
for ix, obj in enumerate(objs):
cls = self.class_ids.index(obj["category_id"])
res[ix, 0:4] = obj["clean_bbox"]
res[ix, 4] = cls
return res
这个函数也做同样的修改，我这边完整的训练完一轮，然后训练第二轮的时候报其他错误了

我把数据集重新转换成COCO格式可以开始训练了，不用更改原先的函数。

我训练的时候遇到了这个问题，请问你这边有遇到吗

` x = torch.cuda.FloatTensor(256, 1024, block_mem)
│ │ │ └ -1722
│ │ └ <class 'torch.cuda.FloatTensor'>
│ └ <module 'torch.cuda' from '/home/cqu/anaconda3/envs/pytorch38/lib/python3.8/site-packages/torch/cuda/init.py'>
└ <module 'torch' from '/home/cqu/anaconda3/envs/pytorch38/lib/python3.8/site-packages/torch/init.py'>

RuntimeError: Trying to create tensor with negative dimension -1722: [256, 1024, -1722]`

出问题的代码是这一段

`def get_total_and_free_memory_in_Mb(cuda_device):
devices_info_str = os.popen(
"nvidia-smi --query-gpu=memory.total,memory.used --format=csv,nounits,noheader"
)
devices_info = devices_info_str.read().strip().split("\n")
total, used = devices_info[int(cuda_device)].split(",")
return int(total), int(used)

def occumpy_mem(cuda_device, mem_ratio=0.9):
"""
pre-allocate gpu memory for training to avoid memory Fragmentation.
"""
total, used = get_total_and_free_memory_in_Mb(cuda_device)
max_mem = int(total * mem_ratio)
print(max_mem , used)
block_mem = max_mem - used
x = torch.cuda.FloatTensor(256, 1024, block_mem)
del x
time.sleep(5)`

我把mem_ratio从0.9改成1就不报前面egative dimension 的错误了，但是不管用多大bs，GPU申请内存总是差一点，不知道怎么解决

`File "/home/cqu/wjw/test/DandC/YOLOX-main/yolox/utils/metric.py", line 39, in occumpy_mem
x = torch.cuda.FloatTensor(256, 1024, block_mem)
│ │ │ └ 705
│ │ └ <class 'torch.cuda.FloatTensor'>
│ └ <module 'torch.cuda' from '/home/cqu/anaconda3/envs/pytorch38/lib/python3.8/site-packages/torch/cuda/init.py'>
└ <module 'torch' from '/home/cqu/anaconda3/envs/pytorch38/lib/python3.8/site-packages/torch/init.py'>

RuntimeError: CUDA out of memory. Tried to allocate 706.00 MiB (GPU 0; 23.70 GiB total capacity; 285.06 MiB already allocated; 705.06 MiB free; 306.00 MiB reserved in total by PyTorch)`

from yolox.

hccv commented on May 13, 2024

batch改小点也不管用吗

from yolox.

JasenWangLab commented on May 13, 2024

乌龙问题，GPU有其他程序在跑，占用了显存。。。。。

from yolox.

pistachio0812 commented on May 13, 2024

乌龙问题，GPU有其他程序在跑，占用了显存。。。。。

有点道理

from yolox.

Training error：An error has been caught in function 'launch', process 'MainProcess' about yolox HOT 11 CLOSED

Comments (11)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent