spherface bug

本文记录了SphereFace模型训练过程中遇到的问题,包括训练损失未能有效下降及模型快照保存失败等,并提供了可能的原因分析。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

I0915 05:11:15.613822  3017 solver.cpp:218] Iteration 27700 (1.1337 iter/s, 88.2068s/100 iters), loss = 4.18249
I0915 05:11:15.613924  3017 solver.cpp:237]     Train net output #0: lambda = 5
I0915 05:11:15.613945  3017 solver.cpp:237]     Train net output #1: softmax_loss = 4.18249 (* 1 = 4.18249 loss)
I0915 05:11:15.613955  3017 sgd_solver.cpp:105] Iteration 27700, lr = 0.001
I0915 05:12:43.843036  3017 solver.cpp:218] Iteration 27800 (1.13341 iter/s, 88.2292s/100 iters), loss = 2.92496
I0915 05:12:43.843133  3017 solver.cpp:237]     Train net output #0: lambda = 5
I0915 05:12:43.843152  3017 solver.cpp:237]     Train net output #1: softmax_loss = 2.92496 (* 1 = 2.92496 loss)
I0915 05:12:43.843163  3017 sgd_solver.cpp:105] Iteration 27800, lr = 0.001
I0915 05:14:12.033928  3017 solver.cpp:218] Iteration 27900 (1.1339 iter/s, 88.1909s/100 iters), loss = 3.67801
I0915 05:14:12.034019  3017 solver.cpp:237]     Train net output #0: lambda = 5
I0915 05:14:12.034037  3017 solver.cpp:237]     Train net output #1: softmax_loss = 3.67801 (* 1 = 3.67801 loss)
I0915 05:14:12.034049  3017 sgd_solver.cpp:105] Iteration 27900, lr = 0.001
I0915 05:15:38.687129  3017 solver.cpp:447] Snapshotting to binary proto file result/sphereface_model_iter_28000.caffemodel
F0915 05:15:39.744756  3017 io.cpp:69] Check failed: proto.SerializeToOstream(&output)
*** Check failure stack trace: ***
    @     0x7f8640241daa  (unknown)
    @     0x7f8640241ce4  (unknown)
    @     0x7f86402416e6  (unknown)
    @     0x7f8640244687  (unknown)
    @     0x7f864086940c  caffe::WriteProtoToBinaryFile()
    @     0x7f8640847152  caffe::Solver<>::SnapshotToBinaryProto()
    @     0x7f8640847261  caffe::Solver<>::Snapshot()
    @     0x7f864084a2d8  caffe::Solver<>::Solve()
    @           0x408085  train()
    @           0x4059ac  main
    @     0x7f863f551f45  (unknown)
    @           0x40620b  (unknown)
    @              (nil)  (unknown)
Aborted (core dumped)

主要是两个问题:

A.没有收敛好,softmax_loss 在3.几 

B. 模型保存不下来。

与作者的不同是我的batchsize 是128  
使用多个gpu 报的错误

 Multi-GPU execution not available - rebuild with USE_NCCL
*** Check failure stack trace: ***
    @     0x7f569c608add  google::LogMessage::Fail()
    @     0x7f569c60aa43  google::LogMessage::SendToLog()
    @     0x7f569c608659  google::LogMessage::Flush()
    @     0x7f569c60b44e  google::LogMessageFatal::~LogMessageFatal()
    @           0x40acd9  train()
    @           0x406e73  main
    @     0x7f569b814f45  __libc_start_main
    @           0x407755  (unknown)
Aborted (core dumped)


sphereface 5.5ms

ap:           98.676096
eer:          95.300000
tpr001:       94.500000
tpr0001:      93.000000
tpr00001:     NaN
tpr000001:    NaN
tpr0:         75.733333



评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值