UMDFaces Dataset

Hi, thanks for your project, can you share the UMDFaces Dataset with me.


关于cosin_add_m_layer的实现我有个疑问,Forward_cpu()中的以下代码是什么意思?arcface论文里好像只提到了下面else的实现,判断cos_t[i * dim + gt] <= threshold的意图以及对应的处理希望老师给解答一下,谢谢!

if(cos_t[i * dim + gt] <= threshold)


    top_data[i * dim + gt] = cos_t[i * dim + gt] - sin(M_PI - m_) * m_;

    tpflag[i * dim + gt] = 1.0f;



    top_data[i * dim + gt] = cos_t[i * dim + gt] * cos_m - sin_theta * sin_m;

caffe 训练速度仍然很慢,请问作者后来有发现这个问题吗,

@xialuxi ,谢谢您的回复,我上午看了adaface的论文的实现
用caffe 训练很慢的原因请问作者后来有发现吗,前向传播我觉得挺快的
我设置的参数是batch size=56 l两块1080 iter_size:6
I0513 12:49:24.228211 23509 solver.cpp:243] Iteration 0, loss = 24.1503
I0513 12:49:24.228235 23509 solver.cpp:259] Train net output #0: accuracy-t = 0.839286
I0513 12:49:24.228257 23509 solver.cpp:259] Train net output #1: softmax_loss = 22.5511 (* 1 = 22.5511 loss)
I0513 12:49:24.228299 23509 sgd_solver.cpp:138] Iteration 0, lr = 0.01
I0513 12:54:50.852994 23509 solver.cpp:243] Iteration 100, loss = 20.3324
I0513 12:54:50.853057 23509 solver.cpp:259] Train net output #0: accuracy-t = 0.928571
I0513 12:54:50.853081 23509 solver.cpp:259] Train net output #1: softmax_loss = 17.7896 (* 1 = 17.7896 loss)
I0513 12:54:50.923504 23509 sgd_solver.cpp:138] Iteration 100, lr = 0.01
I0513 13:00:39.458894 23509 solver.cpp:243] Iteration 200, loss = 18.8438
I0513 13:00:39.459019 23509 solver.cpp:259] Train net output #0: accuracy-t = 0.964286
I0513 13:00:39.459044 23509 solver.cpp:259] Train net output #1: softmax_loss = 16.4004 (* 1 = 16.4004 loss)
I0513 13:00:39.500185 23509 sgd_solver.cpp:138] Iteration 200, lr = 0.01
I0513 13:06:26.364652 23509 solver.cpp:243] Iteration 300, loss = 18.461
I0513 13:06:26.364759 23509 solver.cpp:259] Train net output #0: accuracy-t = 0.946429
I0513 13:06:26.364783 23509 solver.cpp:259] Train net output #1: softmax_loss = 16.8125 (* 1 = 16.8125 loss)


Hi, I compared the cpp and cu code, I found a bug for calculated the diff in CosinAddmBackward function, it need multiply bottom_diff[index * dim + gt] when calculated the bottom_diff, it should be used the following code.
bottom_diff[index * dim + gt] =bottom_diff[index * dim + gt] *(cos(bais) + sin(bais) * cos_theta / sin_theta);

bottom_diff[index * dim + gt] = cos(bais) + sin(bais) * cos_theta / sin_theta;

Did anyone try AdaCos?

I will in a couple of days and compare it with ArcFace using Megaface and other tests, will present the results. But I'm a bit confused about M parameter.

Compile on ubuntu

Hi, I compiled your repository as your described, I did as follow

1.) I downloaded the repository ) project
2.) ın the caffe windows directory I changed the make.config as described caffe installation
2.1) cd caffe-windows
2.2) for req in $(cat python/requirements.txt); do pip install --trusted-host $req; done
2.3) cp Makefile.config.example Makefile.config
2.4)gedit Makefile.config
PYTHON_INCLUDE := /usr/include/python2.7
INCLUDE_DIRS := $(PYTHON_INCLUDE) /usr/local/include /usr/include/hdf5/serial
LIBRARY_DIRS := $(PYTHON_LIB) /usr/local/lib /usr/lib /usr/lib/x86_64-linux-gnu /usr/lib/x86_64-linux-gnu/hdf5/serial
3 ) Copy cosin_add_m_layer.hpp to the directory: ./caffe/include/caffe/layers/
4) Copy cosin_add_m_layer.cpp and to the directory: ./caffe/src/caffe/layers/
5) According to the proto file, modify the ./caffe/src/caffe/proto/caffe.proto file accordingly.
6) Also I copy combined_margin_layer.cpp, and combined_margin_layer.hpp in the as descrbed in the step 3 and 4 places
6) make -j8
7.) make py
8.) make test -j8
But after these step when I run the following command
make runtest -j8
This failed for some layer testing. Therefore I did not run your repository. I miss something can you correct me in the compilation steps.

Thank you for your time..

以前和现在的 cosin_add_m scale 的实现区别是什么昵

以前的 层,是两个层,一个添加m角度,一个添加尺度64或者128
然后新的arcface合并只有一层, 没有scale 参数设置的值,但是这里和上面的实现的区别在哪里昵
layer {
name: "cosin_add_m"
type: "CosinAddm"
bottom: "temp_fc6"
bottom: "label"
top: "fc6_margin"
cosin_add_m_param {
m: 0.5
layer {
name: "fc6_margin_scale"
type: "Scale"
bottom: "fc6_margin"
top: "fc6_margin_scale"
param {
lr_mult: 0
decay_mult: 0
scale_param {
type: "constant"
value: 64

修改过后的addm 层
layer {
name: "adacos_add_m_scale"
type: "AdaCosAddmScale"
bottom: "fc6"
bottom: "label"
top: "fc6_margin_scale"
adacos_add_m_scale_param {
m: 0.5
num_classes: 10575



您好,作者,请问您的lmdb 数据,是用剪切对齐后的图片和label 制作的吗,label 是什么内容???,是每个人一个文件夹,有多张图片,最后分成多少类,就是多少个人的图片文件夹吗,数据如何制作??


hi xialuxi:
arcface loss的softmax计算和普通的softmax计算不一样:
分母部分把 yi 和 j 分开了,请问代码中这部分计算在哪里写呢?求导部分在哪里写呢?

doesnot work in the two class classification model training

hello guys,
Thanks the author for his excellent work firstly.
I use this arcface loss to finetune a classification model with two class,but I donot kown why it does not work.The loss does not decrease and accuracy is jumping.the parameter is m = 0.5, s=64.I have try some other parameters,but it is always same.
Has anyone encountered this similar problem? thanks.


你的实现中theta_med是计算batch内所有样本在所有类别上的角度均值,论文中说是“the median of all corresponding classes’ angles”,我理解的是类似每个样本在标签类上的夹角,不知道对不对?


what kind of optimiser (Adam, SGD) did you you for the test in ex. CASIA-WebFace?

I'm asking because look like gradient in final layer using adacos are lower than using fixed s=20 (98 classes). Also, adacos get lower scores. I'm just thinking, that maybe to low gradient are provided for learning the model.
I'm using Adam and the results are following (dataset is CARS196 )

  1. Adam fixed: 0.79
  2. Adam AdaCos: 0.745
  3. Adam AdaCos x2 bigger lr: 0.755

In general AdaCos works worse for some reason, not sure why. Maybe it is also because that averaged angle for non-similar classes in smaller than in case of faces.
Or we need more adaptive LR method for this problem.


Iteration 285400(1.51479 iter/s,132.031s/200 iters),loss=1.58242
Train net output #0:accuracy_hat=1
Train net output #1:accuracy_hat_arc=0.4375
Train net output #2:loss_hat=1.58242(*1 = 1.58242 loss)



landmark 检测不准



我训练网络的batchsize为128,学习率从0.001开始降到0.00001,迭代了20000代,train loss在3~4之间一直动荡,降不下去。初始学习率改了也是这样。想问问你是怎么设置训练参数的?


1、arcface的梯度我看到代码是:cos_m + sin_m * cos_t[i * dim + gt] / sin_theta,其实就是sin(theta+m)/sin(theta),我自己的计算是:-sin(theta+m),是不是少了什么呢?
2、combined margin的梯度我看你的代码是:m1 * pow(1 - pow(bottom_data[i * dim + gt], 2), -0.5) * sin(m1_x_m2[i * dim + gt]),其实就是m1 * sin(m1 * theta+m2) * sin(theta),我自己的计算是:-m1 * sin(m1 * theta+m2),请问你的计算是怎么得到的呢?万分感谢!

编译Combined Margin Loss 出错???

您好,我刚刚下载了您的最新的Combined Margin Loss文件,但是在编译的时候报错了,您看是什么问题呢?

Severity Code Description Project File Line Suppression State
Error C2065 'arccos_x': undeclared identifier libcaffe E:\AMSoftmax-master\Caffe-AM-Softmax\caffe-windows\src\caffe\layers\combined_margin_layer.cpp 32
Severity Code Description Project File Line Suppression State
Error C2228 left of '.mutable_cpu_data' must have class/struct/union libcaffe E:\AMSoftmax-master\Caffe-AM-Softmax\caffe-windows\src\caffe\layers\combined_margin_layer.cpp 32




Focal loss

I noticed that you use Focal loss as a second loss. What's the purpose?

Could you share wingloss example

Could you share wingloss in train prototxt example like EuclideanLoss

layer {
  name: "loss"    
  type: "EuclideanLoss"
  bottom: "fc2"
  bottom: "label"
  top: "loss"
  loss_weight: 100


layer {
name: "cosin_add_m"
type: "CosinAddm"
bottom: "concat_fc"
bottom: "label"
top: "fc6_margin"
cosin_add_m_param {
m: 0.5

layer {
name: "fc6_margin_scale"
type: "Scale"
bottom: "fc6_margin"
top: "fc6_margin_scale"
param {
lr_mult: 0
decay_mult: 0
scale_param {
type: "constant"
value: 64

layer {
name: "concat_loss"
type: "SoftmaxWithLoss"
bottom: "fc6_margin_scale"
bottom: "label"
top: "concat_loss"

layer {
name: "concat_loss"
type: "SoftmaxWithLoss"
bottom: "concat_fc"
bottom: "label"
top: "concat_loss"
就可以收敛, 搞不清了为什么了

How to get the similarity between two faces?

Hello, is there a demo that use caffemodel to get the distance and similarity between two faces just like the deploy/ of original insightface?Thanks,waiting for reply.







arcface 损失函数的添加

请问您的损失函数的添加,我目前只加入 cosin_add_m_layer相关proto,参数,训练的时候出现,这种情况,输出 costheta >1 ************ 1.58 ,这种输出很多,请问可能什么原因昵?
然后,caffe版本的训练和mxnet类似吗,就是也是先只训练softmax,到12万步,然后加入arcface 损失曾,再进行finetune 吗?



卷积层是否使用偏置 bias_term



name: "ArcFace"
layer {
name: "data"
type: "Data"
top: "data"
top: "label"
include {
phase: TRAIN
transform_param {
resize_param {
prob: 1
resize_mode: WARP
height: 128
width: 128
interp_mode: LINEAR
interp_mode: AREA
interp_mode: CUBIC
interp_mode: LANCZOS4
mirror: True
crop_h: 128
crop_w: 128
#distort_param {
# brightness_prob: 0.5
# brightness_delta: 32
# contrast_prob: 0.5
# contrast_lower: 0.5
# contrast_upper: 1.5
# hue_prob: 0.5
# hue_delta: 18
# saturation_prob: 0.5
# saturation_lower: 0.5
# saturation_upper: 1.5
# random_order_prob: 0.
data_param {
source: "/media/zz/7c333a37-0503-4f81-8103-0ef7e776f6fb/Face_Data/casia_extract_aligned_train_9204cls_lmdb"
batch_size: 512
backend: LMDB
layer {
name: "data"
type: "Data"
top: "data"
top: "label"
include {
phase: TEST
transform_param {
resize_param {
prob: 1
resize_mode: WARP
height: 128
width: 128
interp_mode: LINEAR
crop_h: 128
crop_w: 128
data_param {
source: "/media/zz/7c333a37-0503-4f81-8103-0ef7e776f6fb/Face_Data/casia_extract_aligned_test_9204cls_lmdb"
batch_size: 2
backend: LMDB
############## CNN Architecture ###############
layer {
name: "data/bias"
type: "Bias"
bottom: "data"
top: "data/bias"
param {
lr_mult: 0
decay_mult: 0
bias_param {
filler {
type: "constant"
value: -128
layer {
name: "conv1"
type: "Convolution"
bottom: "data/bias"
top: "conv1"
param {
lr_mult: 1
decay_mult: 1
param {
lr_mult: 2
decay_mult: 0
convolution_param {
num_output: 32
kernel_size: 7
pad: 3
stride: 1
weight_filler {
type: "gaussian"
std: 0.01
bias_filler {
type: "constant"
value: 0
layer {
name: "conv1_bn"
type: "BatchNorm"
bottom: "conv1"
top: "conv1"
layer {
name: "conv1_scale"
type: "Scale"
bottom: "conv1"
top: "conv1"
scale_param {
filler {
value: 1
bias_term: true
bias_filler {
value: 0
layer {
name: "conv1_relu"
type: "ReLU"
bottom: "conv1"
top: "conv1"
layer {
name: "pool1"
type: "Pooling"
bottom: "conv1"
top: "pool1"
pooling_param {
pool: MAX
kernel_size: 2
stride: 2
layer {
name: "pool1_1"
type: "Pooling"
bottom: "pool1"
top: "pool1_1"
pooling_param {
pool: MAX
kernel_size: 2
stride: 2
layer {
name: "conv2_1"
type: "Convolution"
bottom: "pool1_1"
top: "conv2_1"
param {
lr_mult: 1
decay_mult: 1
param {
lr_mult: 2
decay_mult: 0
convolution_param {
num_output: 32
kernel_size: 1
stride: 1
pad: 0
weight_filler {
type: "gaussian"
std: 0.01
bias_filler {
type: "constant"
value: 0
layer {
name: "conv2_1_bn"
type: "BatchNorm"
bottom: "conv2_1"
top: "conv2_1"
layer {
name: "conv2_1_scale"
type: "Scale"
bottom: "conv2_1"
top: "conv2_1"
scale_param {
filler {
value: 1
bias_term: true
bias_filler {
value: 0
layer {
name: "conv2_1_relu"
type: "ReLU"
bottom: "conv2_1"
top: "conv2_1"
layer {
name: "conv2_2"
type: "Convolution"
bottom: "conv2_1"
top: "conv2_2"
param {
lr_mult: 1
decay_mult: 1
param {
lr_mult: 2
decay_mult: 0
convolution_param {
num_output: 64
kernel_size: 3
stride: 1
pad: 1
weight_filler {
type: "gaussian"
std: 0.01
bias_filler {
type: "constant"
value: 0
layer {
name: "conv2_2_bn"
type: "BatchNorm"
bottom: "conv2_2"
top: "conv2_2"
layer {
name: "conv2_2_scale"
type: "Scale"
bottom: "conv2_2"
top: "conv2_2"
scale_param {
filler {
value: 1
bias_term: true
bias_filler {
value: 0
layer {
name: "conv2_2_relu"
type: "ReLU"
bottom: "conv2_2"
top: "conv2_2"
layer {
name: "pool2"
type: "Pooling"
bottom: "conv2_2"
top: "pool2"
pooling_param {
pool: MAX
kernel_size: 2
stride: 2
layer {
name: "conv3_1"
type: "Convolution"
bottom: "pool2"
top: "conv3_1"
param {
lr_mult: 1
decay_mult: 1
param {
lr_mult: 2
decay_mult: 0
convolution_param {
num_output: 64
kernel_size: 1
pad: 0
stride: 1
weight_filler {
type: "gaussian"
std: 0.01
bias_filler {
type: "constant"
value: 0
layer {
name: "conv3_1_bn"
type: "BatchNorm"
bottom: "conv3_1"
top: "conv3_1"
layer {
name: "conv3_1_scale"
type: "Scale"
bottom: "conv3_1"
top: "conv3_1"
scale_param {
filler {
value: 1
bias_term: true
bias_filler {
value: 0
layer {
name: "conv3_1_relu"
type: "ReLU"
bottom: "conv3_1"
top: "conv3_1"
layer {
name: "conv3_2"
type: "Convolution"
bottom: "conv3_1"
top: "conv3_2"
param {
lr_mult: 1
decay_mult: 1
param {
lr_mult: 2
decay_mult: 0
convolution_param {
num_output: 96
kernel_size: 3
pad: 1
stride: 1
weight_filler {
type: "gaussian"
std: 0.01
bias_filler {
type: "constant"
value: 0
layer {
name: "conv3_2_bn"
type: "BatchNorm"
bottom: "conv3_2"
top: "conv3_2"
layer {
name: "conv3_2_scale"
type: "Scale"
bottom: "conv3_2"
top: "conv3_2"
scale_param {
filler {
value: 1
bias_term: true
bias_filler {
value: 0
layer {
name: "conv3_2_relu"
type: "ReLU"
bottom: "conv3_2"
top: "conv3_2"
layer {
name: "conv4_1"
type: "Convolution"
bottom: "conv3_2"
top: "conv4_1"
param {
lr_mult: 1
decay_mult: 1
param {
lr_mult: 2
decay_mult: 0
convolution_param {
num_output: 96
kernel_size: 1
pad: 0
stride: 1
weight_filler {
type: "gaussian"
std: 0.01
bias_filler {
type: "constant"
value: 0
layer {
name: "conv4_1_bn"
type: "BatchNorm"
bottom: "conv4_1"
top: "conv4_1"
layer {
name: "conv4_1_scale"
type: "Scale"
bottom: "conv4_1"
top: "conv4_1"
scale_param {
filler {
value: 1
bias_term: true
bias_filler {
value: 0
layer {
name: "conv4_1_relu"
type: "ReLU"
bottom: "conv4_1"
top: "conv4_1"
layer {
name: "conv4_2"
type: "Convolution"
bottom: "conv4_1"
top: "conv4_2"
param {
lr_mult: 1
decay_mult: 1
param {
lr_mult: 2
decay_mult: 0
convolution_param {
num_output: 128
kernel_size: 3
pad: 1
stride: 1
weight_filler {
type: "gaussian"
std: 0.01
bias_filler {
type: "constant"
value: 0
layer {
name: "conv4_2_bn"
type: "BatchNorm"
bottom: "conv4_2"
top: "conv4_2"
layer {
name: "conv4_2_scale"
type: "Scale"
bottom: "conv4_2"
top: "conv4_2"
scale_param {
filler {
value: 1
bias_term: true
bias_filler {
value: 0
layer {
name: "conv4_2_relu"
type: "ReLU"
bottom: "conv4_2"
top: "conv4_2"
layer {
name: "conv5_1"
type: "Convolution"
bottom: "conv4_2"
top: "conv5_1"
param {
lr_mult: 1
decay_mult: 1
param {
lr_mult: 2
decay_mult: 0
convolution_param {
num_output: 96
kernel_size: 1
pad: 0
stride: 1
weight_filler {
type: "gaussian"
std: 0.01
bias_filler {
type: "constant"
value: 0
layer {
name: "conv5_1_bn"
type: "BatchNorm"
bottom: "conv5_1"
top: "conv5_1"
layer {
name: "conv5_1_scale"
type: "Scale"
bottom: "conv5_1"
top: "conv5_1"
scale_param {
filler {
value: 1
bias_term: true
bias_filler {
value: 0
layer {
name: "conv5_1_relu"
type: "ReLU"
bottom: "conv5_1"
top: "conv5_1"
layer {
name: "pool3"
type: "Pooling"
bottom: "conv5_1"
top: "pool3"
pooling_param {
pool: MAX
kernel_size: 2
stride: 2
layer {
name: "fc1"
type: "InnerProduct"
bottom: "pool3"
top: "fc1"
param {
lr_mult: 1
decay_mult: 1
param {
lr_mult: 2
decay_mult: 0
inner_product_param {
num_output: 1024
weight_filler {
type: "xavier"
bias_filler {
type: "constant"
value: 0
layer {
name: "fc1_bn"
type: "BatchNorm"
bottom: "fc1"
top: "fc1"
layer {
name: "fc1_scale"
type: "Scale"
bottom: "fc1"
top: "fc1"
scale_param {
filler {
value: 1
bias_term: true
bias_filler {
value: 0
layer {
name: "fc1_relu"
type: "ReLU"
bottom: "fc1"
top: "fc1"
layer {
name: "fc2"
type: "InnerProduct"
bottom: "fc1"
top: "fc2"
param {
lr_mult: 1
decay_mult: 1
param {
lr_mult: 2
decay_mult: 0
inner_product_param {
num_output: 128
weight_filler {
type: "xavier"
bias_filler {
type: "constant"
value: 0
layer {
name: "fc2_norm"
type: "NormalizeJin"
bottom: "fc2"
top: "fc2_norm"
norm_jin_param {
across_spatial: true
scale_filler {
type: "constant"
value: 1.0
channel_shared: true
############### Arc-Softmax Loss ##############

layer {
name: "fc6_changed"
type: "InnerProduct"
bottom: "fc2_norm"
top: "fc6"
inner_product_param {
num_output: 9204
normalize: true
weight_filler {
type: "xavier"
bias_term: false
layer {
name: "cosin_add_m"
type: "CosinAddm"
bottom: "fc6"
bottom: "label"
top: "fc6_margin"
cosin_add_m_param {
m: 0.1
include {
phase: TRAIN

layer {
name: "fc6_margin_scale"
type: "Scale"
bottom: "fc6_margin"
top: "fc6_margin_scale"
param {
lr_mult: 0
decay_mult: 0
scale_param {
type: "constant"
value: 64
include {
phase: TRAIN

layer {
name: "softmax_loss"
type: "SoftmaxWithLoss"
bottom: "fc6_margin_scale"
bottom: "label"
#bottom: "label"
#bottom: "data"
top: "softmax_loss"
loss_weight: 1
include {
phase: TRAIN

layer {
name: "Accuracy"
type: "Accuracy"
bottom: "fc6"
bottom: "label"
top: "accuracy"
include {
phase: TEST

I0627 17:38:58.567371 6757 solver.cpp:224] Iteration 450 (2.13816 iter/s, 4.67691s/10 iters), loss = 87.3365
I0627 17:38:58.567402 6757 solver.cpp:243] Train net output #0: softmax_loss = 87.3365 (* 1 = 87.3365 loss)
I0627 17:38:58.567409 6757 sgd_solver.cpp:137] Iteration 450, lr = 0.00314
I0627 17:39:03.256306 6757 solver.cpp:224] Iteration 460 (2.13288 iter/s, 4.6885s/10 iters), loss = 87.3365
I0627 17:39:03.256340 6757 solver.cpp:243] Train net output #0: softmax_loss = 87.3365 (* 1 = 87.3365 loss)
I0627 17:39:03.256347 6757 sgd_solver.cpp:137] Iteration 460, lr = 0.00314
I0627 17:39:07.941520 6757 solver.cpp:224] Iteration 470 (2.13457 iter/s, 4.68478s/10 iters), loss = 87.3365
I0627 17:39:07.941551 6757 solver.cpp:243] Train net output #0: softmax_loss = 87.3365 (* 1 = 87.3365 loss)
I0627 17:39:07.941558 6757 sgd_solver.cpp:137] Iteration 470, lr = 0.00314
I0627 17:39:12.623337 6757 solver.cpp:224] Iteration 480 (2.13612 iter/s, 4.68139s/10 iters), loss = 87.3365
I0627 17:39:12.623456 6757 solver.cpp:243] Train net output #0: softmax_loss = 87.3365 (* 1 = 87.3365 loss)


hi xialuxi:
看起来CosinAddmLayer就是arcloss(对应combined margin m1=1,m3=0的情况),那为什么CosinAddmLayer考虑cos_t[i * dim + gt] > 1.0f和cos_t[i * dim + gt] <= threshold,而combined_margin_layer不需要呢?

