Coder Social home page Coder Social logo

dddd_trainer's Introduction

dddd_trainer 带带弟弟OCR训练工具

带带弟弟OCR所用的训练工具今天正式开源啦! ddddocr

项目仅支持N卡训练,A卡或其他卡就先别看啦

项目基于Pytorch进行开发,支持cnn与crnn进行训练、断点恢复、自动导出onnx模型,并同时支持无缝使用ddddocrocr_api_server 的无缝部署

训练环境支持

Windows/Linux

Macos仅支持cpu训练

1、深度学习必备环境配置(非仅本项目要求,而是所有深度学习项目要求,cpu训练除外)

开始本教程前请先前往pytorch 官网查看自己系统与硬件支持的pytorch版本,注意30系列之前的N卡,如2080Ti等请选择cuda11以下的版本(例:CUDA 10.2),如果为30系N卡,仅支持CUDA 11版本,请选择CUDA 11以上版本(例:CUDA 11.3),然后根据选择的条件显示的pytorch安装命令完成pytorch安装,由于pytorch的版本更新速度导致很多pypi源仅缓存了cpu版本,CUDA版本需要自己在官网安装。

安装CUDA和CUDNN

根据自己显卡型号与系统选择

cuda

cudnn

注意cudnn支持的cuda版本号要与你安装的cuda版本号对应,不同版本的cuda支持的显卡不一样,20系无脑选择10.2版本cuda,30系无脑选择11.3版本cuda,这里有啥问题就百度吧,算是一个基础问题。

2、训练部分

  • 以下所有变量均以 {param} 格式代替,表示可根据自己需要修改,而使用时并不需要带上{},如步骤创建新的训练项目,使用时可以直接写

python app.py create test_project

  • 1、Clone本项目到本地

git clone https://github.com/sml2h3/dddd_trainer.git

  • 2、进入项目目录并安装本项目所需依赖

pip install -r requirements.txt -i https://pypi.douban.com/simple

  • 3、创建新的训练项目

python app.py create {project_name}

如果想要创建一个CNN的项目,则可以加上--single参数,CNN项目识别比如图片类是什么分类的情况,比如图片上只有一个字,识别这张图是什么字(图上有多个字的不要用CNN模式),又比如分辨图片里是狮子还是兔子用CNN模式比较合适,大多数OCR需求请不要使用--single

python app.py create {project_name} --single

project_name 为项目名称,尽量不要以特殊符号命名

  • 4、准备数据

    项目支持两种形式的数据

    A、从文件名导入

    图片均在同一个文件夹中,且命名为类似,其中/root/images_set为图片所在目录,可以为任意目录地址

    /root/images_set/
    |---- abcde_随机hash值.jpg
    |---- sdae_随机hash值.jpg
    |---- 酱闷肘子_随机hash值.jpg
    
    

    如下图所示

    image

    那么图片命名可以是

    mkGu_000001d00f140741741ed9916240d8d5.jpg

    为考虑各种情况,dddd_trainer不会自动去处理大小写问题,如果想训练大小写,则在样本标注时就需要自己标注好大小写,如上面例子

    B、从文件中导入

    受限于可能样本组织形式或者特殊字符,本项目支持从txt文档中导入数据,数据集目录必须包含有labels.txt文件和images文件夹, 其中/root/images_set为图片所在目录,可以为任意目录地址

    labels.txt文件中包含了所有在/root/images_set/images目录下基于/root/images_set/images的图片相对路径,/root/images_set/images下可以有目录。

    当然,在这种模式下,图片的文件名随意,可以有具体label也可以没有,因为咱们不从这里获取图片的label

    如下所示

  • a.images下无目录的形式

    /root/images_set/
    |---- labels.txt
    |---- images
          |---- 随机hash值.jpg
          |---- 随机hash值.jpg
          |---- 酱闷肘子_随机hash值.jpg
    
    labels.txt文件内容为(其中\t制表符为每行文件名与label的分隔符)
    随机hash值.jpg\tabcd
    随机hash值.jpg\tsdae
    酱闷肘子_随机hash值.jpg\t酱闷肘子
    

    b.images下有目录的形式

    /root/images_set/
    |---- labels.txt
    |---- images
          |---- aaaa
                |---- 随机hash值.jpg
          |---- 酱闷肘子_随机hash值.jpg
    
    labels.txt文件内容为(其中\t制表符为每行文件名与label的分隔符)
    aaaa/随机hash值.jpg\tabcd
    aaaa/随机hash值.jpg\tsdae
    酱闷肘子_随机hash值.jpg\t酱闷肘子
    
    

    为了新手更好的理解本部分的内容,本项目也提供了两套基础数据集提供测试

    数据集一 数据集二

  • 5、修改配置文件

Model:
    CharSet: []     # 字符集,不要动,会自动生成
    ImageChannel: 1 # 图片通道数,如果你想以灰度图进行训练,则设置为1,彩图,则设置为3。如果设置为1,数据集是彩图,项目会在训练的过程中自动在内存中将读取到的彩图转为灰度图,并不需要提前自己修改并且该设置不会修改本地图片
    ImageHeight: 64 # 图片自动缩放后的高度,单位为px,高度必须为16的倍数,会自动缩放图像
    ImageWidth: -1  # 图片自动缩放后的宽度,单位为px,本项若设置为-1,将自动根据情况调整
    Word: false     # 是否为CNN模型,这里在创建项目的时候通过参数控制,不要自己修改
System:
    Allow_Ext: [jpg, jpeg, png, bmp]  # 支持的图片后缀,不满足的图片将会被自动忽略
    GPU: true                         # 是否启用GPU去训练,使用GPU训练需要参考步骤一安装好环境
    GPU_ID: 0                         # GPU设备号,0为第一张显卡
    Path: ''                          # 数据集根目录,在缓存图片步骤会自动生成,不需要自己改,除非数据集地址改了
    Project: test                     # 项目名称 也就是{project_name}
    Val: 0.03                         # 验证集的数据量比例,0.03就是3%,在缓存数据时,会自动选则3%的图片用作训练过程中的数据验证,修改本值之后需要重新缓存数据
Train:
    BATCH_SIZE: 32                                    # 训练时每一个batch_size的大小,主要取决于你的显存或内存大小,可以根据自己的情况,多测试,一般为16的倍数,如16,32,64,128
    CNN: {NAME: ddddocr}                              # 特征提取的模型,目前支持的值为ddddocr,effnetv2_l,effnetv2_m,effnetv2_xl,effnetv2_s,mobilenetv2,mobilenetv3_s,mobilenetv3_l
    DROPOUT: 0.3                                      # 非专业人员不要动
    LR: 0.01                                          # 初始学习率
    OPTIMIZER: SGD                                    # 优化器,不要动
    SAVE_CHECKPOINTS_STEP: 2000                       # 每多少step保存一次模型
    TARGET: {Accuracy: 0.97, Cost: 0.05, Epoch: 20}   # 训练结束的目标,同时满足时自动结束训练并保存onnx模型,Accuracy为需要满足的最小准确率,Cost为需要满足的最小损失,Epoch为需要满足的最小训练轮数
    TEST_BATCH_SIZE: 32                               # 测试时每一个batch_size的大小,主要取决于你的显存或内存大小,可以根据自己的情况,多测试,一般为16的倍数,如16,32,64,128
    TEST_STEP: 1000                                   # 每多少step进行一次测试

配置文件位于本项目根目录下projects/{project_name}/config.yaml

  • 6、缓存数据

python app.py cache {project_name} /root/images_set/

如果是从labels.txt里面读取数据

python app.py cache {project_name} /root/images_set/ file

  • 7、开始训练或恢复训练

python app.py train {project_name}

  • 8、部署

你们先训练着,我去适配ddddocr和ocr_api_server了,适配完我再继续更新文档

dddd_trainer's People

Contributors

sml2h3 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

dddd_trainer's Issues

Val Data Number is 0

when I run cmd cache ,it works well in cache.train.tmp, but cache.val.tmp is blank,no content.
Continue to run cmd train, There is a log when read cache.val.tmp "Read Cache File End! Caches Num is 0."
Finally result in this error: ValueError: num_samples should be a positive integer value, but got num_samples=0

训练测试集一段时间报错

在训练模型的时候 训练一段时间会出现
UserWarning: Exporting a model to ONNX with a batch_size other than 1, with a variable length with LST
M can cause an error when running the ONNX model with a different batch size. Make sure to save the model with a batch size of 1, or define the initial states (h0/c0) as inputs of the mo
del.

按照教程安装的 包 使用的数据集是 测试二 的那个数据集。

配置文件如下

Model:
CharSet: [' ', S, '4', F, X, '9', E, Q, V, U, '1', J, R, '5', '7', Z, H, G, P,
A, '2', '6', '8', Y, B, I, L, W, K, T, D, C, '3']
ImageChannel: 1
ImageHeight: 32
ImageWidth: -1
Word: false
System:
Allow_Ext: [jpg, jpeg, png, bmp]
GPU: true
GPU_ID: 0
Path: images
Project: my_test
Val: 0.03
Train:
BATCH_SIZE: 16
CNN: {NAME: ddddocr}
DROPOUT: 0.3
LR: 0.01
OPTIMIZER: SGD
SAVE_CHECKPOINTS_STEP: 2000
TARGET: {Accuracy: 0.97, Cost: 0.05, Epoch: 20}
TEST_BATCH_SIZE: 16
TEST_STEP: 1000

将ddddocr换成其他CNN比如mobilenetv2会变成NAN

utils.train:start:110 - [2023-10-07-17_21_35] Epoch: 1 Step: 100 LastLoss: nan AvgLoss: nan Lr: 0.01
2023-10-07 17:21:39.964 | INFO | utils.train:start:110 - [2023-10-07-17_21_39] Epoch: 2 Step: 200 LastLoss: nan AvgLoss: nan Lr: 0.01

如何增量训练

首先,dddd,yyds
其次,模型训练好后,如何在原有基础上新增数据集继续训练?

采用ddddocr训练的模型时数据类型报错!

训练的模型,进行识别的时候数据类型报错!
代码如下:
`import ddddocr

ocr = ddddocr.DdddOcr()

ocr = ddddocr.DdddOcr(det=False, ocr=False, import_onnx_path="testocr_1.0_28_19000_2022-03-21-20-44-59.onnx", charsets_path="charsets.json")

with open("0CB7_1644684875.png", 'rb') as f:
image = f.read()

res = ocr.classification(image)
print(res)`

报错输出内容如下:

2 : INVALID_ARGUMENT : Unexpected input data type. Actual: (tensor(double)) , expected: (tensor(float))

Traceback (most recent call last):
File "D:\python\ddddocr\1-验证码识别.py", line 10, in
res = ocr.classification(image)
File "C:\Program Files\Python39\lib\site-packages\ddddocr_init_.py", line 1629, in classification
ort_outs = self.__ort_session.run(None, ort_inputs)
File "C:\Users\GCB\AppData\Roaming\Python\Python39\site-packages\onnxruntime\capi\onnxruntime_inference_collection.py", line 195, in run
return self._sess.run(output_names, input_feed, run_options)
onnxruntime.capi.onnxruntime_pybind11_state.InvalidArgument: [ONNXRuntimeError] : 2 : INVALID_ARGUMENT : Unexpected input data type. Actual: (tensor(double)) , expected: (tensor(float))

CtrlC中断训练后,重新执行训练命令后报错

环境是mac os M1 pro,使用mps进行训练,训练已经顺利生成了超过10个以上的checkpoint,但中断后再重新执行训练命令后,代码中出现以下报错:

 File "/Users/username/opt/anaconda3/envs/OCR_trainer/lib/python3.8/site-packages/torch/serialization.py", line 267, in default_restore_location
    raise RuntimeError("don't know how to restore data location of "
RuntimeError: don't know how to restore data location of torch.storage.UntypedStorage (tagged with mps:0)

请问如何解决

多通道图片预处理BUG

复现过程:
1、训练配置文件设置为3个通道(彩色图训练)
2、训练完成以后使用DDDDOCR项目运行模型

首先程序报错同:
#2

看了下源码,错误位于
ddddocr/init.py:1629
将这一行改为强制指定类型
ort_inputs = {'input1': np.array([image], dtype=np.float32)}

再次运行程序又报错:

INVALID_ARGUMENT : Invalid rank for input: input1 Got: 5 Expected: 4 Please fix either the inputs or the model.

我在这里加了一行输出了input1的shape

print(ort_inputs['input1'].shape) # 输出: (1, 1, 160, 649, 3)

我猜测模型的输入数据应该是按单通道图片定义的,因此是(1,1,160,649) 但是我使用彩色图训练,图像预处理和单通道图片应该不一致。

But , 我不知道作者是怎么设计彩色图片的预处理的,这里我不知道该怎么改了

export() got an unexpected keyword argument '_retain_param_name'

Traceback (most recent call last):
File "D:\dddd_trainer-main\app.py", line 33, in
fire.Fire(App)
File "C:\Users\Administrator\AppData\Local\Programs\Python\Python39\lib\site-packages\fire\core.py", line 141, in Fire
component_trace = _Fire(component, args, parsed_flag_args, context, name)
File "C:\Users\Administrator\AppData\Local\Programs\Python\Python39\lib\site-packages\fire\core.py", line 466, in _Fire
component, remaining_args = CallAndUpdateTrace(
File "C:\Users\Administrator\AppData\Local\Programs\Python\Python39\lib\site-packages\fire\core.py", line 681, in_CallAndUpdateTrace
component = fn(*varargs, **kwargs)
File "D:\dddd_trainer-main\app.py", line 28, in train
trainer.start()
File "D:\dddd_trainer-main\utils\train.py", line 152, in start
self.net.export_onnx(self.net, dummy_input,
File "D:\dddd_trainer-main\nets_init
.py", line 216, in export_onnx
torch.onnx.export(net, dummy_input, graph_path, export_params=True, verbose=False,
TypeError: export() got an unexpected keyword argument '_retain_param_name'

报了这个错误是怎么回事

StopIteration 問題

只要加入某張圖片就會出現StopIteration,但我完全不知道這圖片有什麼問題,都是用同樣的手法採集的。
0+1_56165942e06a39fbe0fe6034524cc773ba9f59fe

Traceback (most recent call last):
  File "E:\Daz3D Workshop\Enhance_Queue\_dddd_trainer\utils\train.py", line 124, in start
    test_inputs, test_labels, test_labels_length = next(val_iter)
  File "C:\Users\T1me\AppData\Local\Programs\Python\Python39\lib\site-packages\torch\utils\data\dataloader.py", line 521, in __next__
    data = self._next_data()
  File "C:\Users\T1me\AppData\Local\Programs\Python\Python39\lib\site-packages\torch\utils\data\dataloader.py", line 560, in _next_data
    index = self._next_index()  # may raise StopIteration
  File "C:\Users\T1me\AppData\Local\Programs\Python\Python39\lib\site-packages\torch\utils\data\dataloader.py", line 512, in _next_index
    return next(self._sampler_iter)  # may raise StopIteration
StopIteration

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "E:\Daz3D Workshop\Enhance_Queue\_dddd_trainer\app.py", line 33, in <module>
    fire.Fire(App)
  File "C:\Users\T1me\AppData\Local\Programs\Python\Python39\lib\site-packages\fire\core.py", line 141, in Fire
    component_trace = _Fire(component, args, parsed_flag_args, context, name)
  File "C:\Users\T1me\AppData\Local\Programs\Python\Python39\lib\site-packages\fire\core.py", line 475, in _Fire
    component, remaining_args = _CallAndUpdateTrace(
  File "C:\Users\T1me\AppData\Local\Programs\Python\Python39\lib\site-packages\fire\core.py", line 691, in _CallAndUpdateTrace
    component = fn(*varargs, **kwargs)
  File "E:\Daz3D Workshop\Enhance_Queue\_dddd_trainer\app.py", line 28, in train
    trainer.start()
  File "E:\Daz3D Workshop\Enhance_Queue\_dddd_trainer\utils\train.py", line 128, in start
    test_inputs, test_labels, test_labels_length = next(val_iter)
  File "C:\Users\T1me\AppData\Local\Programs\Python\Python39\lib\site-packages\torch\utils\data\dataloader.py", line 521, in __next__
    data = self._next_data()
  File "C:\Users\T1me\AppData\Local\Programs\Python\Python39\lib\site-packages\torch\utils\data\dataloader.py", line 560, in _next_data
    index = self._next_index()  # may raise StopIteration
  File "C:\Users\T1me\AppData\Local\Programs\Python\Python39\lib\site-packages\torch\utils\data\dataloader.py", line 512, in _next_index
    return next(self._sampler_iter)  # may raise StopIteration
StopIteration

请问训练出现LastLoss:nan AvgLoss:nan,是怎么回事,要怎么处理,谢谢

共122个图,配置如下:
Model:
CharSet: [' ', '8', '2', r, v, h, m, b, '5', w, '7', t, k, '6', y, p, '3', l,
q, x, a, e, f, n, s, '4']
ImageChannel: 1
ImageHeight: 64
ImageWidth: -1
Word: false
System:
Allow_Ext: [jpg, jpeg, png, bmp]
GPU: true
GPU_ID: 0
Path: E:\Val\images_login
Project: mlogin
Val: 0.03
Train:
BATCH_SIZE: 32
CNN: {NAME: ddddocr}
DROPOUT: 0.3
LR: 0.01
OPTIMIZER: SGD
SAVE_CHECKPOINTS_STEP: 2000
TARGET: {Accuracy: 0.97, Cost: 0.05, Epoch: 20}
TEST_BATCH_SIZE: 32
TEST_STEP: 1000

2022-07-12 00:38:15.250 | INFO | utils.train:start:108 - [2022-07-12-00_38_15] Epoch: 140500 Step: 421500 LastLoss: nan AvgLoss: nan Lr: 0.00015268545525806817

电脑是2060显卡能搞吗

我是一个小白,想搞个这东西玩玩,弄了3天了,环境一直没搞好,有没有大神心情好能给指导一下,感激不尽,最好能列一个清单,包括python版本,cuda版本和caduu版本,本人电脑是惠普的暗影精灵,显卡3050,8G

是否支持滑块验证码学习?

抱歉,没搞过python。 从网上搜到了这个开源项目 确实yyds

这边测试 滑块验证码识别率有点低,但是文档上没介绍滑块验证码该怎么归类文件夹

标注?

模型对比

请问有这些模型的对比数据吗?哪种模型收敛较快,哪种模型效果最好,哪种模型速度更快,随带一问,我想把模型转成Tensorflow模型,然后再迁移到移动端平台,哪种模型合适点?

ddddocr
effnetv2_l,
effnetv2_m,
effnetv2_xl,
effnetv2_s,
mobilenetv2,
mobilenetv3_s,
mobilenetv3_l

另外,ddddocr这么模型是原创吗,还是基于其他模型改的?

因磁盘满而中断后,无法自动恢复

RuntimeError: PytorchStreamReader failed reading zip archive: failed finding central directory

2023-07-26 09:59:13.215 | INFO     | __main__:__init__:12 - 
Hello baby~
2023-07-26 09:59:13.216 | INFO     | __main__:train:26 - 
Start Train ----> images98

2023-07-26 09:59:13.221 | INFO     | utils.train:__init__:41 - 
Taget:
min_Accuracy: 0.97
min_Epoch: 20
max_Loss: 0.05
2023-07-26 09:59:13.221 | INFO     | utils.train:__init__:45 - 
USE GPU ----> 0
2023-07-26 09:59:13.221 | INFO     | utils.train:__init__:52 - 
Search for history checkpoints...
Traceback (most recent call last):
  File "/www/wwwroot/dddd_trainer/app.py", line 33, in <module>
    fire.Fire(App)
  File "/www/wwwroot/dddd_trainer/11ddbaf3386aea1f2974eee984542152_venv/lib/python3.6/site-packages/fire/core.py", line 141, in Fire
    component_trace = _Fire(component, args, parsed_flag_args, context, name)
  File "/www/wwwroot/dddd_trainer/11ddbaf3386aea1f2974eee984542152_venv/lib/python3.6/site-packages/fire/core.py", line 480, in _Fire
    target=component.__name__)
  File "/www/wwwroot/dddd_trainer/11ddbaf3386aea1f2974eee984542152_venv/lib/python3.6/site-packages/fire/core.py", line 691, in _CallAndUpdateTrace
    component = fn(*varargs, **kwargs)
  File "/www/wwwroot/dddd_trainer/app.py", line 27, in train
    trainer = train.Train(project_name)
  File "/www/wwwroot/dddd_trainer/utils/train.py", line 63, in __init__
    os.path.join(self.checkpoints_path, newer_checkpoint), self.device)
  File "/www/wwwroot/dddd_trainer/nets/__init__.py", line 223, in load_checkpoint
    param = torch.load(path, map_location=device)
  File "/www/wwwroot/dddd_trainer/11ddbaf3386aea1f2974eee984542152_venv/lib/python3.6/site-packages/torch/serialization.py", line 600, in load
    with _open_zipfile_reader(opened_file) as opened_zipfile:
  File "/www/wwwroot/dddd_trainer/11ddbaf3386aea1f2974eee984542152_venv/lib/python3.6/site-packages/torch/serialization.py", line 242, in __init__
    super(_open_zipfile_reader, self).__init__(torch._C.PyTorchFileReader(name_or_buffer))
RuntimeError: PytorchStreamReader failed reading zip archive: failed finding central directory

彩色图片训练过程24小时仍然acc为0.0

图片数据示例(图片为150高,一共900个数据集):
image

配置文件(图片为150高,所以设置高度为144,宽度自动):
image

火炬版本:1.10.2,cuda版本:10.2,显卡:2060ti
训练测试数据集时间:24分钟成功,自己的数据长达24小时仍是acc为0.0
运行图(长达24小时,loss一直下降,但acc为0.0,彩色图片) :
image

建议创建项目的时候检查名称

建议执行python app.py create {project_name}命令的时候,检查一下project_name中是否包含下划线,包含下划线的时候抛个错误,不要让创建了

原因:
https://github.com/sml2h3/dddd_trainer/blob/main/utils/train.py#L58

在这个地方加载checkpoints的时候,使用下划线分割文件名,如果自己项目名里面包含下划线,那么这里将加载失败。最终就会像我这里,训练了几天的模型,再次训练的时候加载不成功

学习素材怎么准备

这些是原始素材,有两个问题,一这些素材的答案怎么准备,只能手动来准备吗;二这个是点选类验证码,应该怎么训练,全部一起训练就好了吗
image

随机哈希值

随机哈希值就是任意值的意思吗?没看懂这个随即哈希值啥意思

cpu怎么训练

我修改了config文件中的GPU为false。
执行python app.py train test之后CPU的占用并没有太明显的变化。
image

这个报错是啥意思

2023-07-26 04:37:11.717 | INFO     | utils.train:start:110 - [2023-07-26-04_37_11]	Epoch: 35952	Step: 1725700	LastLoss: 0.00010419684986118227	AvgLoss: 0.00012866455923358445	Lr: 2.7344957135266526e-10
2023-07-26 04:37:14.401 | INFO     | utils.train:start:110 - [2023-07-26-04_37_14]	Epoch: 35954	Step: 1725800	LastLoss: 0.00011213675315957516	AvgLoss: 0.00012821589938539547	Lr: 2.7344957135266526e-10
2023-07-26 04:37:17.084 | INFO     | utils.train:start:110 - [2023-07-26-04_37_17]	Epoch: 35956	Step: 1725900	LastLoss: 0.00011715076107066125	AvgLoss: 0.00012866744575148914	Lr: 2.7344957135266526e-10
Traceback (most recent call last):
  File "/www/wwwroot/dddd_trainer/11ddbaf3386aea1f2974eee984542152_venv/lib/python3.6/site-packages/torch/serialization.py", line 379, in save
    _save(obj, opened_zipfile, pickle_module, pickle_protocol)
  File "/www/wwwroot/dddd_trainer/11ddbaf3386aea1f2974eee984542152_venv/lib/python3.6/site-packages/torch/serialization.py", line 499, in _save
    zip_file.write_record(name, storage.data_ptr(), num_bytes)
OSError: [Errno 28] No space left on device

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/www/wwwroot/dddd_trainer/app.py", line 33, in <module>
    fire.Fire(App)
  File "/www/wwwroot/dddd_trainer/11ddbaf3386aea1f2974eee984542152_venv/lib/python3.6/site-packages/fire/core.py", line 141, in Fire
    component_trace = _Fire(component, args, parsed_flag_args, context, name)
  File "/www/wwwroot/dddd_trainer/11ddbaf3386aea1f2974eee984542152_venv/lib/python3.6/site-packages/fire/core.py", line 480, in _Fire
    target=component.__name__)
  File "/www/wwwroot/dddd_trainer/11ddbaf3386aea1f2974eee984542152_venv/lib/python3.6/site-packages/fire/core.py", line 691, in _CallAndUpdateTrace
    component = fn(*varargs, **kwargs)
  File "/www/wwwroot/dddd_trainer/app.py", line 28, in train
    trainer.start()
  File "/www/wwwroot/dddd_trainer/utils/train.py", line 120, in start
    "epoch": self.epoch, "step": self.step, "lr": lr})
  File "/www/wwwroot/dddd_trainer/nets/__init__.py", line 188, in save_model
    torch.save(net, path)
  File "/www/wwwroot/dddd_trainer/11ddbaf3386aea1f2974eee984542152_venv/lib/python3.6/site-packages/torch/serialization.py", line 380, in save
    return
  File "/www/wwwroot/dddd_trainer/11ddbaf3386aea1f2974eee984542152_venv/lib/python3.6/site-packages/torch/serialization.py", line 259, in __exit__
    self.file_like.write_end_of_file()
RuntimeError: [enforce fail at inline_container.cc:300] . unexpected pos 17094848 vs 17094736
terminate called after throwing an instance of 'c10::Error'
  what():  [enforce fail at inline_container.cc:300] . unexpected pos 17094848 vs 17094736
frame #0: c10::ThrowEnforceNotMet(char const*, int, char const*, std::string const&, void const*) + 0x47 (0x7fe49ddc3ae7 in /www/wwwroot/dddd_trainer/11ddbaf3386aea1f2974eee984542152_venv/lib/python3.6/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0x2797840 (0x7fe4e3e5c840 in /www/wwwroot/dddd_trainer/11ddbaf3386aea1f2974eee984542152_venv/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so)
frame #2: <unknown function> + 0x2792e1c (0x7fe4e3e57e1c in /www/wwwroot/dddd_trainer/11ddbaf3386aea1f2974eee984542152_venv/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so)
frame #3: caffe2::serialize::PyTorchStreamWriter::writeRecord(std::string const&, void const*, unsigned long, bool) + 0xb5 (0x7fe4e3e5fa85 in /www/wwwroot/dddd_trainer/11ddbaf3386aea1f2974eee984542152_venv/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so)
frame #4: caffe2::serialize::PyTorchStreamWriter::writeEndOfFile() + 0x173 (0x7fe4e3e5fd73 in /www/wwwroot/dddd_trainer/11ddbaf3386aea1f2974eee984542152_venv/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so)
frame #5: caffe2::serialize::PyTorchStreamWriter::~PyTorchStreamWriter() + 0x125 (0x7fe4e3e5ffe5 in /www/wwwroot/dddd_trainer/11ddbaf3386aea1f2974eee984542152_venv/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so)
frame #6: <unknown function> + 0xb43ee3 (0x7fe56504eee3 in /www/wwwroot/dddd_trainer/11ddbaf3386aea1f2974eee984542152_venv/lib/python3.6/site-packages/torch/lib/libtorch_python.so)
frame #7: <unknown function> + 0x2a73f8 (0x7fe5647b23f8 in /www/wwwroot/dddd_trainer/11ddbaf3386aea1f2974eee984542152_venv/lib/python3.6/site-packages/torch/lib/libtorch_python.so)
frame #8: <unknown function> + 0x2a86fe (0x7fe5647b36fe in /www/wwwroot/dddd_trainer/11ddbaf3386aea1f2974eee984542152_venv/lib/python3.6/site-packages/torch/lib/libtorch_python.so)
frame #9: /www/wwwroot/dddd_trainer/11ddbaf3386aea1f2974eee984542152_venv/bin/python3() [0x480622]
frame #10: /www/wwwroot/dddd_trainer/11ddbaf3386aea1f2974eee984542152_venv/bin/python3() [0x434697]
frame #11: /www/wwwroot/dddd_trainer/11ddbaf3386aea1f2974eee984542152_venv/bin/python3() [0x4346a7]
frame #12: /www/wwwroot/dddd_trainer/11ddbaf3386aea1f2974eee984542152_venv/bin/python3() [0x4346a7]
frame #13: /www/wwwroot/dddd_trainer/11ddbaf3386aea1f2974eee984542152_venv/bin/python3() [0x4346a7]
frame #14: /www/wwwroot/dddd_trainer/11ddbaf3386aea1f2974eee984542152_venv/bin/python3() [0x4346a7]
frame #15: /www/wwwroot/dddd_trainer/11ddbaf3386aea1f2974eee984542152_venv/bin/python3() [0x4346a7]
frame #16: /www/wwwroot/dddd_trainer/11ddbaf3386aea1f2974eee984542152_venv/bin/python3() [0x4346a7]
frame #17: /www/wwwroot/dddd_trainer/11ddbaf3386aea1f2974eee984542152_venv/bin/python3() [0x4346a7]
frame #18: PyDict_SetItemString + 0x3b7 (0x4a2647 in /www/wwwroot/dddd_trainer/11ddbaf3386aea1f2974eee984542152_venv/bin/python3)
frame #19: PyImport_Cleanup + 0x71 (0x565771 in /www/wwwroot/dddd_trainer/11ddbaf3386aea1f2974eee984542152_venv/bin/python3)
frame #20: /www/wwwroot/dddd_trainer/11ddbaf3386aea1f2974eee984542152_venv/bin/python3() [0x421d98]
frame #21: Py_Main + 0x640 (0x43b7d0 in /www/wwwroot/dddd_trainer/11ddbaf3386aea1f2974eee984542152_venv/bin/python3)
frame #22: main + 0x162 (0x41d982 in /www/wwwroot/dddd_trainer/11ddbaf3386aea1f2974eee984542152_venv/bin/python3)
frame #23: __libc_start_main + 0xf5 (0x7fe57390c555 in /lib64/libc.so.6)
frame #24: /www/wwwroot/dddd_trainer/11ddbaf3386aea1f2974eee984542152_venv/bin/python3() [0x41da40]

已放弃
[root@RTX3090 dddd_trainer]# 

acc准确率始终是0

    def tester(self, inputs, labels, labels_length):
        predict = self.get_features(inputs)
        pred_decode_labels = []
        labels_list = []
        correct_list = []
        error_list = []
        i = 0
        labels = [int(x) for x in labels.tolist()]
        # labels = labels.tolist()

这里labels都是浮点数,所以后面的比较

            if label_res == pred_res:
                correct_list.append(ids)
            else:
                error_list.append(ids)

基本上都是false。

为啥我的acc通过率一直为0啊,没改过代码

image
用的是上面的那样的图片,打上标注之后,训练了好久 但是通过率一直为0 avgloss 什么的都下降了,就是acc一直不动,看了下tester处的代码,correct_list 这个值一直为空,有没有大佬告诉下我是什么原因呢?

图片 用的是上面的那样的图片,打上标注之后,训练了好久 但是通过率一直为0 avgloss 什么的都下降了,就是acc一直不动,看了下tester处的代码,correct_list 这个值一直为空,有没有大佬告诉下我是什么原因呢?

我修改了ImageChannel Acc立马就上来了

我能问一下你是更改了那些配置文件嘛?我比较新手
1652419941(1)
我和你一样更改了imagechannel 这个参数,但是识别率依旧为0 有没有可能是我图片的问题呢?我看主页中现实的图片按个例子,他没有做任何的处理,所以我也是直接没做处理丢进去了,是不是应该降噪什么的处理一下呢?

有可以提供参考的训练时长吗


OS:MacOS Ventura 13.1
Processor: 2GHz Quad-Core Intel Core i5
Memory: 16 GB 3733MHz LPDDR4X
数据集1700+
目前CPU训练了24x3个小时,Acc一直在0.2~0.3之间徘徊


OS: ubuntu 22.04.2 LTS 64-bit
Processor: 12th Gen Intel@ Core i5-12400 x 12
Graphic: Nvidia RTX 3060ti 8g
Ram: 32g
CUDA:12.0

数据集1700+
目前训练了11个小时,Acc一直在0.4~0.6之间徘徊

有大佬可以分享下训练时长吗

训练的数据集可以开源吗

首先非常感谢提供这么优秀的开源库,但是在自己训练的时候效果没有库自带的训练集效果好,所以想问一下训练集可以开源吗?对于识别不是太好的,我们再自己增加样本进行训练

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.