I0915 05:11:15.613924 3017 solver.cpp:237] Train net output #0: lambda = 5
I0915 05:11:15.613945 3017 solver.cpp:237] Train net output #1: softmax_loss = 4.18249 (* 1 = 4.18249 loss)
I0915 05:11:15.613955 3017 sgd_solver.cpp:105] Iteration 27700, lr = 0.001
I0915 05:12:43.843036 3017 solver.cpp:218] Iteration 27800 (1.13341 iter/s, 88.2292s/100 iters), loss = 2.92496
I0915 05:12:43.843133 3017 solver.cpp:237] Train net output #0: lambda = 5
I0915 05:12:43.843152 3017 solver.cpp:237] Train net output #1: softmax_loss = 2.92496 (* 1 = 2.92496 loss)
I0915 05:12:43.843163 3017 sgd_solver.cpp:105] Iteration 27800, lr = 0.001
I0915 05:14:12.033928 3017 solver.cpp:218] Iteration 27900 (1.1339 iter/s, 88.1909s/100 iters), loss = 3.67801
I0915 05:14:12.034019 3017 solver.cpp:237] Train net output #0: lambda = 5
I0915 05:14:12.034037 3017 solver.cpp:237] Train net output #1: softmax_loss = 3.67801 (* 1 = 3.67801 loss)
I0915 05:14:12.034049 3017 sgd_solver.cpp:105] Iteration 27900, lr = 0.001
I0915 05:15:38.687129 3017 solver.cpp:447] Snapshotting to binary proto file result/sphereface_model_iter_28000.caffemodel
F0915 05:15:39.744756 3017 io.cpp:69] Check failed: proto.SerializeToOstream(&output)
*** Check failure stack trace: ***
@ 0x7f8640241daa (unknown)
@ 0x7f8640241ce4 (unknown)
@ 0x7f86402416e6 (unknown)
@ 0x7f8640244687 (unknown)
@ 0x7f864086940c caffe::WriteProtoToBinaryFile()
@ 0x7f8640847152 caffe::Solver<>::SnapshotToBinaryProto()
@ 0x7f8640847261 caffe::Solver<>::Snapshot()
@ 0x7f864084a2d8 caffe::Solver<>::Solve()
@ 0x408085 train()
@ 0x4059ac main
@ 0x7f863f551f45 (unknown)
@ 0x40620b (unknown)
@ (nil) (unknown)
Aborted (core dumped)
主要是两个问题:
A.没有收敛好,softmax_loss 在3.几
B. 模型保存不下来。
与作者的不同是我的batchsize 是128
使用多个gpu 报的错误
Multi-GPU execution not available - rebuild with USE_NCCL
*** Check failure stack trace: ***
@ 0x7f569c608add google::LogMessage::Fail()
@ 0x7f569c60aa43 google::LogMessage::SendToLog()
@ 0x7f569c608659 google::LogMessage::Flush()
@ 0x7f569c60b44e google::LogMessageFatal::~LogMessageFatal()
@ 0x40acd9 train()
@ 0x406e73 main
@ 0x7f569b814f45 __libc_start_main
@ 0x407755 (unknown)
Aborted (core dumped)
sphereface 5.5ms
ap: 98.676096
eer: 95.300000
tpr001: 94.500000
tpr0001: 93.000000
tpr00001: NaN
tpr000001: NaN
tpr0: 75.733333