Image Similarity Siamese Network

本文介绍了一种使用Siamese网络进行图像相似度评估的方法,并通过Fashion MNIST数据集进行了实验验证。该网络能够有效地从图像中提取特征并计算不同图像间的相似度。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

Overview

With the kernel I am trying to run a simple test on using Siamese networks for similarity on a slightly more complicated problem than standard MNIST. The idea is to take a randomly initialized network and apply it to images to find out how similar they are. The models should make it much easier to perform tasks like Visual Search on a database of images since it will have a simple similarity metric between 0 and 1 instead of 2D arrays.

In [1]:
import numpy as np
import os
import pandas as pd
from keras.preprocessing.image import ImageDataGenerator
from keras.utils.np_utils import to_categorical
import matplotlib.pyplot as plt
Using TensorFlow backend.
/opt/conda/lib/python3.6/importlib/_bootstrap.py:219: RuntimeWarning: compiletime version 3.5 of module 'tensorflow.python.framework.fast_tensor_util' does not match runtime version 3.6
  return f(*args, **kwds)

Load and Organize Data

Here we load and organize the data so we can easily use it inside of Keras models

In [2]:
from sklearn.model_selection import train_test_split
data_train = pd.read_csv('../input/fashion-mnist_train.csv')
X_full = data_train.iloc[:,1:]
y_full = data_train.iloc[:,:1]
x_train, x_test, y_train, y_test = train_test_split(X_full, y_full, test_size = 0.3)
In [3]:
x_train = x_train.values.reshape(-1, 28, 28, 1).astype('float32') / 255.
x_test = x_test.values.reshape(-1, 28, 28, 1).astype('float32') / 255.
y_train = y_train.values.astype('int')
y_test = y_test.values.astype('int')
print('Training', x_train.shape, x_train.max())
print('Testing', x_test.shape, x_test.max())
Training (42000, 28, 28, 1) 1.0
Testing (18000, 28, 28, 1) 1.0
In [4]:
# reorganize by groups
train_groups = [x_train[np.where(y_train==i)[0]] for i in np.unique(y_train)]
test_groups = [x_test[np.where(y_test==i)[0]] for i in np.unique(y_train)]
print('train groups:', [x.shape[0] for x in train_groups])
print('test groups:', [x.shape[0] for x in test_groups])
train groups: [4165, 4155, 4162, 4196, 4258, 4246, 4239, 4184, 4230, 4165]
test groups: [1835, 1845, 1838, 1804, 1742, 1754, 1761, 1816, 1770, 1835]

Batch Generation

Here the idea is to make usuable batches for training the network. We need to create parallel inputs for the AA and BBimages where the output is the distance. Here we make the naive assumption that if images are in the same group the similarity is 1 otherwise it is 0.

If we randomly selected all of the images we would likely end up with most images in different groups.

In [5]:
def gen_random_batch(in_groups, batch_halfsize = 8):
    out_img_a, out_img_b, out_score = [], [], []
    all_groups = list(range(len(in_groups)))
    for match_group in [True, False]:
        group_idx = np.random.choice(all_groups, size = batch_halfsize)
        out_img_a += [in_groups[c_idx][np.random.choice(range(in_groups[c_idx].shape[0]))] for c_idx in group_idx]
        if match_group:
            b_group_idx = group_idx
            out_score += [1]*batch_halfsize
        else:
            # anything but the same group
            non_group_idx = [np.random.choice([i for i in all_groups if i!=c_idx]) for c_idx in group_idx] 
            b_group_idx = non_group_idx
            out_score += [0]*batch_halfsize
            
        out_img_b += [in_groups[c_idx][np.random.choice(range(in_groups[c_idx].shape[0]))] for c_idx in b_group_idx]
            
    return np.stack(out_img_a,0), np.stack(out_img_b,0), np.stack(out_score,0)

Validate Data

Here we make sure the generator is doing something sensible, we show the images and their similarity percentage.

In [6]:
pv_a, pv_b, pv_sim = gen_random_batch(train_groups, 3)
fig, m_axs = plt.subplots(2, pv_a.shape[0], figsize = (12, 6))
for c_a, c_b, c_d, (ax1, ax2) in zip(pv_a, pv_b, pv_sim, m_axs.T):
    ax1.imshow(c_a[:,:,0])
    ax1.set_title('Image A')
    ax1.axis('off')
    ax2.imshow(c_b[:,:,0])
    ax2.set_title('Image B\n Similarity: %3.0f%%' % (100*c_d))
    ax2.axis('off')

Feature Generation

Here we make the feature generation network to process images into features. The network starts off randomly initialized and will be trained to generate useful vector features from input images (hopefully)

In [7]:
from keras.models import Model
from keras.layers import Input, Conv2D, BatchNormalization, MaxPool2D, Activation, Flatten, Dense, Dropout
img_in = Input(shape = x_train.shape[1:], name = 'FeatureNet_ImageInput')
n_layer = img_in
for i in range(2):
    n_layer = Conv2D(8*2**i, kernel_size = (3,3), activation = 'linear')(n_layer)
    n_layer = BatchNormalization()(n_layer)
    n_layer = Activation('relu')(n_layer)
    n_layer = Conv2D(16*2**i, kernel_size = (3,3), activation = 'linear')(n_layer)
    n_layer = BatchNormalization()(n_layer)
    n_layer = Activation('relu')(n_layer)
    n_layer = MaxPool2D((2,2))(n_layer)
n_layer = Flatten()(n_layer)
n_layer = Dense(32, activation = 'linear')(n_layer)
n_layer = Dropout(0.5)(n_layer)
n_layer = BatchNormalization()(n_layer)
n_layer = Activation('relu')(n_layer)
feature_model = Model(inputs = [img_in], outputs = [n_layer], name = 'FeatureGenerationModel')
feature_model.summary()
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
FeatureNet_ImageInput (Input (None, 28, 28, 1)         0         
_________________________________________________________________
conv2d_1 (Conv2D)            (None, 26, 26, 8)         80        
_________________________________________________________________
batch_normalization_1 (Batch (None, 26, 26, 8)         32        
_________________________________________________________________
activation_1 (Activation)    (None, 26, 26, 8)         0         
_________________________________________________________________
conv2d_2 (Conv2D)            (None, 24, 24, 16)        1168      
_________________________________________________________________
batch_normalization_2 (Batch (None, 24, 24, 16)        64        
_________________________________________________________________
activation_2 (Activation)    (None, 24, 24, 16)        0         
_________________________________________________________________
max_pooling2d_1 (MaxPooling2 (None, 12, 12, 16)        0         
_________________________________________________________________
conv2d_3 (Conv2D)            (None, 10, 10, 16)        2320      
_________________________________________________________________
batch_normalization_3 (Batch (None, 10, 10, 16)        64        
_________________________________________________________________
activation_3 (Activation)    (None, 10, 10, 16)        0         
_________________________________________________________________
conv2d_4 (Conv2D)            (None, 8, 8, 32)          4640      
_________________________________________________________________
batch_normalization_4 (Batch (None, 8, 8, 32)          128       
_________________________________________________________________
activation_4 (Activation)    (None, 8, 8, 32)          0         
_________________________________________________________________
max_pooling2d_2 (MaxPooling2 (None, 4, 4, 32)          0         
_________________________________________________________________
flatten_1 (Flatten)          (None, 512)               0         
_________________________________________________________________
dense_1 (Dense)              (None, 32)                16416     
_________________________________________________________________
dropout_1 (Dropout)          (None, 32)                0         
_________________________________________________________________
batch_normalization_5 (Batch (None, 32)                128       
_________________________________________________________________
activation_5 (Activation)    (None, 32)                0         
=================================================================
Total params: 25,040
Trainable params: 24,832
Non-trainable params: 208
_________________________________________________________________

Siamese Model

We apply the feature generating model to both images and then combine them together to predict if they are similar or not. The model is designed to very simple. The ultimate idea is when a new image is taken that a feature vector can be calculated for it using the FeatureGenerationModel. All existing images have been pre-calculated and stored in a database of feature vectors. The model can be applied using a few vector additions and multiplications to determine the most similar images. These operations can be implemented as a stored procedure or similar task inside the database itself since they do not require an entire deep learning framework to run.

In [8]:
from keras.layers import concatenate
img_a_in = Input(shape = x_train.shape[1:], name = 'ImageA_Input')
img_b_in = Input(shape = x_train.shape[1:], name = 'ImageB_Input')
img_a_feat = feature_model(img_a_in)
img_b_feat = feature_model(img_b_in)
combined_features = concatenate([img_a_feat, img_b_feat], name = 'merge_features')
combined_features = Dense(16, activation = 'linear')(combined_features)
combined_features = BatchNormalization()(combined_features)
combined_features = Activation('relu')(combined_features)
combined_features = Dense(4, activation = 'linear')(combined_features)
combined_features = BatchNormalization()(combined_features)
combined_features = Activation('relu')(combined_features)
combined_features = Dense(1, activation = 'sigmoid')(combined_features)
similarity_model = Model(inputs = [img_a_in, img_b_in], outputs = [combined_features], name = 'Similarity_Model')
similarity_model.summary()
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
==================================================================================================
ImageA_Input (InputLayer)       (None, 28, 28, 1)    0                                            
__________________________________________________________________________________________________
ImageB_Input (InputLayer)       (None, 28, 28, 1)    0                                            
__________________________________________________________________________________________________
FeatureGenerationModel (Model)  (None, 32)           25040       ImageA_Input[0][0]               
                                                                 ImageB_Input[0][0]               
__________________________________________________________________________________________________
merge_features (Concatenate)    (None, 64)           0           FeatureGenerationModel[1][0]     
                                                                 FeatureGenerationModel[2][0]     
__________________________________________________________________________________________________
dense_2 (Dense)                 (None, 16)           1040        merge_features[0][0]             
__________________________________________________________________________________________________
batch_normalization_6 (BatchNor (None, 16)           64          dense_2[0][0]                    
__________________________________________________________________________________________________
activation_6 (Activation)       (None, 16)           0           batch_normalization_6[0][0]      
__________________________________________________________________________________________________
dense_3 (Dense)                 (None, 4)            68          activation_6[0][0]               
__________________________________________________________________________________________________
batch_normalization_7 (BatchNor (None, 4)            16          dense_3[0][0]                    
__________________________________________________________________________________________________
activation_7 (Activation)       (None, 4)            0           batch_normalization_7[0][0]      
__________________________________________________________________________________________________
dense_4 (Dense)                 (None, 1)            5           activation_7[0][0]               
==================================================================================================
Total params: 26,233
Trainable params: 25,985
Non-trainable params: 248
__________________________________________________________________________________________________
In [9]:
# setup the optimization process
similarity_model.compile(optimizer='adam', loss = 'binary_crossentropy', metrics = ['mae'])

Visual Model Feedback

Here we visualize what the model does by taking a small sample of randomly selected A and B images the first half from the same category and the second from different categories. We then show the actual distance (0 for the same category and 1 for different categories) as well as the model predicted distance. The first run here is with a completely untrained network so we do not expect meaningful results.

In [10]:
def show_model_output(nb_examples = 3):
    pv_a, pv_b, pv_sim = gen_random_batch(test_groups, nb_examples)
    pred_sim = similarity_model.predict([pv_a, pv_b])
    fig, m_axs = plt.subplots(2, pv_a.shape[0], figsize = (12, 6))
    for c_a, c_b, c_d, p_d, (ax1, ax2) in zip(pv_a, pv_b, pv_sim, pred_sim, m_axs.T):
        ax1.imshow(c_a[:,:,0])
        ax1.set_title('Image A\n Actual: %3.0f%%' % (100*c_d))
        ax1.axis('off')
        ax2.imshow(c_b[:,:,0])
        ax2.set_title('Image B\n Predicted: %3.0f%%' % (100*p_d))
        ax2.axis('off')
    return fig
# a completely untrained model
_ = show_model_output()

In [11]:
# make a generator out of the data
def siam_gen(in_groups, batch_size = 32):
    while True:
        pv_a, pv_b, pv_sim = gen_random_batch(train_groups, batch_size//2)
        yield [pv_a, pv_b], pv_sim
# we want a constant validation group to have a frame of reference for model performance
valid_a, valid_b, valid_sim = gen_random_batch(test_groups, 1024)
loss_history = similarity_model.fit_generator(siam_gen(train_groups), 
                               steps_per_epoch = 500,
                               validation_data=([valid_a, valid_b], valid_sim),
                                              epochs = 10,
                                             verbose = True)
Epoch 1/10
500/500 [==============================] - 61s 122ms/step - loss: 0.6475 - mean_absolute_error: 0.4596 - val_loss: 0.5082 - val_mean_absolute_error: 0.3765
Epoch 2/10
500/500 [==============================] - 62s 125ms/step - loss: 0.5057 - mean_absolute_error: 0.3619 - val_loss: 0.4097 - val_mean_absolute_error: 0.2911
Epoch 3/10
500/500 [==============================] - 62s 124ms/step - loss: 0.4534 - mean_absolute_error: 0.3099 - val_loss: 0.3535 - val_mean_absolute_error: 0.2392
Epoch 4/10
500/500 [==============================] - 63s 126ms/step - loss: 0.4163 - mean_absolute_error: 0.2806 - val_loss: 0.3348 - val_mean_absolute_error: 0.2129
Epoch 5/10
500/500 [==============================] - 62s 124ms/step - loss: 0.4000 - mean_absolute_error: 0.2643 - val_loss: 0.3252 - val_mean_absolute_error: 0.2093
Epoch 6/10
500/500 [==============================] - 62s 124ms/step - loss: 0.3865 - mean_absolute_error: 0.2524 - val_loss: 0.3139 - val_mean_absolute_error: 0.2002
Epoch 7/10
500/500 [==============================] - 57s 114ms/step - loss: 0.3862 - mean_absolute_error: 0.2520 - val_loss: 0.3087 - val_mean_absolute_error: 0.2068
Epoch 8/10
500/500 [==============================] - 54s 108ms/step - loss: 0.3654 - mean_absolute_error: 0.2395 - val_loss: 0.3098 - val_mean_absolute_error: 0.1921
Epoch 9/10
500/500 [==============================] - 53s 106ms/step - loss: 0.3677 - mean_absolute_error: 0.2368 - val_loss: 0.3099 - val_mean_absolute_error: 0.1943
Epoch 10/10
500/500 [==============================] - 54s 108ms/step - loss: 0.3660 - mean_absolute_error: 0.2347 - val_loss: 0.3044 - val_mean_absolute_error: 0.1942
In [12]:
_ = show_model_output()

T-Shirt vs Ankle Boot-Plot

Here we take an random t-shirt and ankle boot (categories 0 and 9) images and calculate the distance using our network to the other images

In [13]:
t_shirt_vec = np.stack([train_groups[0][0]]*x_test.shape[0],0)
t_shirt_score = similarity_model.predict([t_shirt_vec, x_test], verbose = True, batch_size = 128)
ankle_boot_vec = np.stack([train_groups[-1][0]]*x_test.shape[0],0)
ankle_boot_score = similarity_model.predict([ankle_boot_vec, x_test], verbose = True, batch_size = 128)
18000/18000 [==============================] - 21s 1ms/step
18000/18000 [==============================] - 20s 1ms/step
In [14]:
obj_categories = ['T-shirt/top','Trouser','Pullover','Dress',
                  'Coat','Sandal','Shirt','Sneaker','Bag','Ankle boot'
                 ]
colors = plt.cm.rainbow(np.linspace(0, 1, 10))
plt.figure(figsize=(10, 10))

for c_group, (c_color, c_label) in enumerate(zip(colors, obj_categories)):
    plt.scatter(t_shirt_score[np.where(y_test == c_group), 0],
                ankle_boot_score[np.where(y_test == c_group), 0],
                marker='.',
                color=c_color,
                linewidth='1',
                alpha=0.8,
                label=c_label)
plt.xlabel('T-Shirt Dimension')
plt.ylabel('Ankle-Boot Dimension')
plt.title('T-Shirt and Ankle-Boot Dimension')
plt.legend(loc='best')
plt.savefig('tshirt-boot-dist.png')
plt.show(block=False)

Examining the Features

Here we aim to answer the more general question: did we generate useful features with the Feature Generation model? And how can we visualize this.

In [15]:
x_test_features = feature_model.predict(x_test, verbose = True, batch_size=128)
18000/18000 [==============================] - 11s 612us/step

Neighbor Visualization

For this we use the TSNE neighborhood embedding to visualize the features on a 2D plane and see if it roughly corresponds to the groups. We use the test data for this example as well since the training has been contaminated

In [16]:
%%time
from sklearn.manifold import TSNE
tsne_obj = TSNE(n_components=2,
                         init='pca',
                         random_state=101,
                         method='barnes_hut',
                         n_iter=500,
                         verbose=2)
tsne_features = tsne_obj.fit_transform(x_test_features)
[t-SNE] Computing 91 nearest neighbors...
[t-SNE] Indexed 18000 samples in 0.226s...
[t-SNE] Computed neighbors for 18000 samples in 4.028s...
[t-SNE] Computed conditional probabilities for sample 1000 / 18000
[t-SNE] Computed conditional probabilities for sample 2000 / 18000
[t-SNE] Computed conditional probabilities for sample 3000 / 18000
[t-SNE] Computed conditional probabilities for sample 4000 / 18000
[t-SNE] Computed conditional probabilities for sample 5000 / 18000
[t-SNE] Computed conditional probabilities for sample 6000 / 18000
[t-SNE] Computed conditional probabilities for sample 7000 / 18000
[t-SNE] Computed conditional probabilities for sample 8000 / 18000
[t-SNE] Computed conditional probabilities for sample 9000 / 18000
[t-SNE] Computed conditional probabilities for sample 10000 / 18000
[t-SNE] Computed conditional probabilities for sample 11000 / 18000
[t-SNE] Computed conditional probabilities for sample 12000 / 18000
[t-SNE] Computed conditional probabilities for sample 13000 / 18000
[t-SNE] Computed conditional probabilities for sample 14000 / 18000
[t-SNE] Computed conditional probabilities for sample 15000 / 18000
[t-SNE] Computed conditional probabilities for sample 16000 / 18000
[t-SNE] Computed conditional probabilities for sample 17000 / 18000
[t-SNE] Computed conditional probabilities for sample 18000 / 18000
[t-SNE] Mean sigma: 0.097702
[t-SNE] Computed conditional probabilities in 1.213s
[t-SNE] Iteration 50: error = 82.1846161, gradient norm = 0.0019173 (50 iterations in 27.468s)
[t-SNE] Iteration 100: error = 80.4134293, gradient norm = 0.0010669 (50 iterations in 26.792s)
[t-SNE] Iteration 150: error = 79.5910645, gradient norm = 0.0007335 (50 iterations in 27.382s)
[t-SNE] Iteration 200: error = 79.0950394, gradient norm = 0.0005696 (50 iterations in 27.344s)
[t-SNE] Iteration 250: error = 78.7620468, gradient norm = 0.0004646 (50 iterations in 27.556s)
[t-SNE] KL divergence after 250 iterations with early exaggeration: 78.762047
[t-SNE] Iteration 300: error = 3.2698922, gradient norm = 0.0012206 (50 iterations in 27.653s)
[t-SNE] Iteration 350: error = 2.7692475, gradient norm = 0.0006349 (50 iterations in 27.760s)
[t-SNE] Iteration 400: error = 2.4634285, gradient norm = 0.0004027 (50 iterations in 27.423s)
[t-SNE] Iteration 450: error = 2.2674994, gradient norm = 0.0002826 (50 iterations in 27.257s)
[t-SNE] Iteration 500: error = 2.1313024, gradient norm = 0.0002141 (50 iterations in 26.836s)
[t-SNE] Error after 500 iterations: 2.131302
CPU times: user 9min 7s, sys: 1min 10s, total: 10min 18s
Wall time: 4min 39s
In [17]:
obj_categories = ['T-shirt/top','Trouser','Pullover','Dress',
                  'Coat','Sandal','Shirt','Sneaker','Bag','Ankle boot'
                 ]
colors = plt.cm.rainbow(np.linspace(0, 1, 10))
plt.figure(figsize=(10, 10))

for c_group, (c_color, c_label) in enumerate(zip(colors, obj_categories)):
    plt.scatter(tsne_features[np.where(y_test == c_group), 0],
                tsne_features[np.where(y_test == c_group), 1],
                marker='o',
                color=c_color,
                linewidth='1',
                alpha=0.8,
                label=c_label)
plt.xlabel('Dimension 1')
plt.ylabel('Dimension 2')
plt.title('t-SNE on Testing Samples')
plt.legend(loc='best')
plt.savefig('clothes-dist.png')
plt.show(block=False)

In [18]:
feature_model.save('fashion_feature_model.h5')
In [19]:
similarity_model.save('fashion_similarity_model.h5')
### Swin Transformer在双塔模型中的应用 Swin Transformer 是一种基于窗口的分层变压器架构,能够有效地处理图像数据并提取特征[^1]。通过将其应用于孪生网络(Siamese Network),可以构建一个强大的双塔模型来解决诸如相似度计算、匹配等问题。 以下是将 Swin Transformer 集成到孪生网络的一个代码示例: #### 代码实现 ```python import torch from torch import nn from torchvision.models.swin_transformer import swin_t, Swin_T_Weights class SiameseNetwork(nn.Module): def __init__(self, pretrained=True): super(SiameseNetwork, self).__init__() # 加载预训练的Swin Transformer模型作为基础特征提取器 weights = Swin_T_Weights.IMAGENET1K_V1 if pretrained else None self.base_network = swin_t(weights=weights) # 移除分类头,仅保留特征提取部分 self.base_network.head = nn.Identity() def forward_one(self, x): """前向传播单个输入""" return self.base_network(x) def forward(self, input1, input2): """前向传播两个输入""" output1 = self.forward_one(input1) output2 = self.forward_one(input2) return output1, output2 # 测试代码 if __name__ == "__main__": model = SiameseNetwork(pretrained=True) # 假设两张图片大小为 (batch_size, channels, height, width) img1 = torch.randn(8, 3, 224, 224) # 输入张量1 img2 = torch.randn(8, 3, 224, 224) # 输入张量2 out1, out2 = model(img1, img2) print(f"Output shape of first image: {out1.shape}") print(f"Output shape of second image: {out2.shape}") # 计算余弦相似度或其他距离函数 cosine_similarity = nn.CosineSimilarity(dim=1)(out1, out2) print(f"Cosine similarity between the two images: {cosine_similarity}") ``` 上述代码展示了如何利用 PyTorch 的 `swin_t` 函数创建一个简单的双塔模型。该模型接受两幅图像作为输入,并分别通过共享权重的 Swin Transformer 提取它们的特征表示[^2]。最后可以通过比较这些特征之间的距离或相似度来进行进一步分析。 --- ### 关键点解析 1. **Swin Transformer 特征提取** 使用 Swin Transformer 替代传统的卷积神经网络(CNNs)进行特征提取,其优势在于能够在不同尺度上捕捉全局和局部信息。 2. **共享参数设计** 在孪生网络中,通常会采用共享参数的方式减少内存占用并提高效率。这意味着两个分支使用相同的权重结构。 3. **损失函数的选择** 对于具体任务可以选择不同的对比学习目标函数,比如三元组损失(Triplet Loss)、二进制交叉熵损失或者对比损失(Contrastive Loss)。这取决于实际应用场景的需求。 ---
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值