通过层次代表矩阵的top-k图搜索

最新推荐文章于 2025-04-12 02:48:11 发布

原创最新推荐文章于 2025-04-12 02:48:11 发布 · 542 阅读

0 ·

CC 4.0 BY-SA版权

论文阅读专栏收录该内容

30 篇文章

订阅专栏

探讨矩阵压缩技术在图相似度查询中的应用，通过分块压缩矩阵减少内存占用，加速查询过程。介绍如何利用图特征向量聚类优化查询效率，及其实验结果在不同数据库上的表现。

“”"
time:2020.5.26
reference:
1.https://blog.youkuaiyun.com/qq_37475168/article/details/103616443
2.<<>top k graph similarity>
“”"
辅助定理：
假设有一个矩阵如下，
$\begin{bmatrix} 13 &7 &4 &16 \\ 12&14 &2 &12 \end{bmatrix}$ ,现在表明，矩阵中每列的最小值和，小于等于按照分块之后每个模块行和最小值的和。比如1,2，3,，4列最小值为12 7 2 12，12+7+2+12相加33.那么现在把矩阵分为如下，
$\begin{bmatrix} 13 &7 \\ 12&14 \end{bmatrix} \begin{bmatrix} 4 &16 \\ 2&12 \end{bmatrix}$ ,那么第一个模块1 2行的和为20 26，则行和最大值为26，第二个模块1 2 行和为20 14，行和最大值为20，所以总体最小和46.因此33小于46。

上面的结论，可以用在层次矩阵变小的过程。
辅助定理的使用：
假设有一个图集合g1 g2 g3 g3的特征矩阵 $W^{0}$ 如下，
$W^{0}= \begin{bmatrix} 1 &2 &9 &4 &7 &1 \\ 2 &3 &2 &2 &7 &6 \\ 3 &3 &2 &1 &1 &9 \\ 4 &6 &5 &1 &2 &1 \end{bmatrix} v^{1} = \begin{bmatrix} 1 & 1 & 3 &8 &1 &1 \end{bmatrix}$ ,经过本文的按照块划分，可以分为如下6个：
$\begin{bmatrix} 1&2 \\ 2&3 \end{bmatrix} \begin{bmatrix} 9&4 \\ 2&2 \end{bmatrix} \begin{bmatrix} 7&1 \\ 7&6 \end{bmatrix}$
$\begin{bmatrix} 3&3 \\ 4&6 \end{bmatrix} \begin{bmatrix} 2&1 \\ 5&1 \end{bmatrix} \begin{bmatrix} 1&9 \\ 2&1 \end{bmatrix}$
,按照“压缩矩阵的思想”，可以构建如下的矩阵 $W^{1}=\begin{bmatrix} 5 &13 &13 \\ 10 &6 &10 \end{bmatrix} V^{1}=\begin{bmatrix} 2 &11 &2 \end{bmatrix}$ ,。
本文所需要研究的问题是计算上标相同的向量的交集的加和，用异或表示 $\bigoplus$ 。则第1层 $W^{1}_{0} v^{1}$ 的异或值为15.而第0层 $W^{0}_{0} v^{0}$ 为8 ，第0层举证 $W^{0}_{1} v^{0}$ 异或值为8，因为15>max(11,8),所以第一层一行的异或是第0层2行异或的上限。 第二层第二行的异或是第0层3 4 行异或的上界。
有上述的思想可知，如果 $W^{0}$ 层的1-n行的异或值，肯定不超过压缩L层异或的某一行。
因此，可以按照需求，求出 $W^{i-1}$ 中某几个图之间的异或上界， $W^{i-1}_{r}$ 表示第i-1层矩阵的第r行。那么第l层矩阵的某一行r $W^{l}_{t}$ 则可以给出 $W^{i-1}_{r}$ 中的 $r$ 范围 $r*R^{i},(r+1)R^{i}-1]$

问题：
如果对于已经构建好的RMS进行聚类？文中已经给出解答，但是自己没有读懂。
原文表示如下：
Firstly, we cluster graph feature vectors in W by the cardinality of their feature multisets, and sort each cluster by their cardinality. So graph feature vectors in the same cluster have the same cardinality and graph feature vectors in the adjacent clusters have similar cardinality.
翻译：
首先，我们将W中的图特征向量按其特征多集的基数进行聚类，并按其基数对每个聚类进行排序。因此，同一簇内的图特征向量具有相同的基数，相邻簇内的图特征向量具有相似的基数。也就是说，先计算特征为1个的图，把他们聚集为1类A；再计算特征为2个的图，把他们聚集为一类B，。。。。。分类之后再把A B C排好序，排序标准是特征的个数。

Secondly,we put similar graphs to gether with in each cluster. We recognize high-frequency features from W ﬁrstly, and then sort feature vectors in each cluster by the sum of their high-frequency features. The reason of sorting feature vectors is based upon the following two observations: i) the distribution of features in W is not uniform; only a few of features occur in most of the graphs, which are denoted as high frequency features; ii) the average values of high-frequency features are greater than that of other features signiﬁcantly, thus the similarity of two graphs is mainly determined by high-frequency features
其次，我们把每个簇中的相似图放在一起。我们首先从W中识别出高频特征，然后根据每个簇的高频特征之和对特征向量进行排序。

对特征向量进行排序的原因是基于以下两个观察结果:
1)分布不均匀，大多数图中只存在一个特征，即高频特征;
2)高频特征的平均值明显大于其他特征的平均值，因此两个图的相似性主要由高频特征决定。

算法2：
输入：已经构建好的RMS(即层次代表矩阵)、q(查询图)、k（前k个）
输出：和q前k个相似图
举个例子。假设已经有如下的层次代表矩阵，共三层，W0 W1 W2,已经提前计算好每一层和对应RVS q0 q1 q2的相似度上界，用三元组(l,r,val)表示。如下面的手写：

现在初始化堆P1，将RMS最后一层（RMS不一定最后一层只有一行）的三元组都放入P1。例子中P1一开始的三元组只有一个。
初始化P2，P2是存放在比较过程中候选图（三元组下标为0的图）的相似度。一开始共有K个0元素。

从P1中拿出val最大的三元组，检查这个三元组的l，看这个三元组是否是第一层W0对应图集的val，若是，则将此三元组对应的图放入结果集合result。若不是，则检查此三元组对应前一层矩阵W（l-1）中每行元素的val。（说明：如果l-0=0,说明l-1是W0对应value,即图和查询图q相似度，l-1!=0，表明是图集合和q的相似度上界）。前一层的每行元素的val如果比堆P2中的元素大，就把这一行对应的三元组放入P1。若访问的“前一层”是W0，则必须更新P2中的值，否则不跟新。

问题：
虽然理解了查找的过程，但是对于直观含义还是模糊。

实验：
1）建立特征树
在这里插入图片描述
文章中使用WL子树迭代的方法建立特征。文章中定义了t-hop特征树，是k-adj树的另外一种说法（k-adj和本文不一样！但是建立索引时候的思路是一样的）。文章实验中比较t大小对于分类精度的影响。t设立大小为1,2，3,4，5,6，7,8。对于k=3是，分类精度已经趋于稳定，因此将下面的实验设立为3.这里记录如何表示K=2时候的模式。
如果创建t=0水分子(H20)的模式，则先把顶点不同种类的label转变为数字(比如H O 变为不同的ID，比如1 2，能够唯一识别分子种类的数字，在AIDS数据集中)有模式pattern_0 = {1,2} pattern_1 = {2,11},patter_2={1,2}这三个模式是0-hop的编码表示；然后，设定一个哈希函数，将hash(pattern)= ID{1,2}，即将上面2中模式进行单射，比如这边构建哈希函数hash(pattern_0) = 3,hash(pattern_1) = 4,即现在水分子的标签为 3 4 4 ，邻居集合为{3,4,4} {4,3},{4,3}
这也是1-hop的模式。

参考博客中解释了这一个过程：
在这里插入图片描述
2）压缩矩阵的块状大小R（行块大小） C（列块大小）对查询时间和内存开销大小。
平均查询时间：给定1000给查询图，3000个数据库，查看返回时间的平均值。
内存开销：不是执行时候占有的内存大小，是内存开销比例。即非原始矩阵的压缩矩阵的总大小和原始矩阵总大小比值。实验表明：当行块大小（R）不变，C（列快大小）越大，查询时间越长。
列块C越大，内存比率越小。这也是和逻辑相符合的。
问题：实验2)俩者越大，RMS越小，应该越大，但是为什么推导出相反的时间结论？
在这里插入图片描述

3）评估图特征向量聚类对查询时间的影响。评估指标：时间

在建立RMS时候，并不是简单地按照图id建立图向量。作者在这里做了一个创新：
首先聚类特征的个数（就是每一行向量的和）相同的图向量（上面解释过），然后再把这些聚类好的图集，按照特征个数大小排序。在已经聚类好、排好序的特征向量中，我们再按照每个图高频率特征的和再在一个聚类中排序。
下面是统计高频率特征的信息：
在这里插入图片描述
这幅图比较不直观，想说明的高频率特征的数量不多，但是高频率决定俩个图之间的相似度。

问题：图b的纵坐标average values作用？
在这里插入图片描述
在AIDS数据库中，选取1000个数据作为查询图，剩下的40000个数据作为数据库，从中寻找和查询图相似的前1个图，前5个图，前10个图，前50个图。若寻找和查询图醉相思的前50个图，时间不超过25毫秒。（感觉不可靠）

在NCI数据库（共4000个左右）中，选取1/5的数据，（约为4000*0.2=800个），其余图（约为3200）个作为数据库，从中寻找前1 5 10 50个最相似的图

在NCI109数据库（共4000个左右）中，做同样的事情。

Resizable-array implementation of the List interface. Implements

all optional list operations, and permits all elements, including
null. In addition to implementing the List interface,
this class provides methods to manipulate the size of the array that is
used internally to store the list. (This class is roughly equivalent to
Vector, except that it is unsynchronized.)

The size, isEmpty, get, set, * iterator, and listIterator operations run in constant * time. The add operation runs in amortized constant time, * that is, adding n elements requires O(n) time. All of the other operations * run in linear time (roughly speaking). The constant factor is low compared * to that for the LinkedList implementation. * *

Each ArrayList instance has a capacity. The capacity is * the size of the array used to store the elements in the list. It is always * at least as large as the list size. As elements are added to an ArrayList, * its capacity grows automatically. The details of the growth policy are not * specified beyond the fact that adding an element has constant amortized * time cost. * *

An application can increase the capacity of an ArrayList instance * before adding a large number of elements using the ensureCapacity * operation. This may reduce the amount of incremental reallocation. * *

Note that this implementation is not synchronized. * If multiple threads access an ArrayList instance concurrently, * and at least one of the threads modifies the list structurally, it * must be synchronized externally. (A structural modification is * any operation that adds or deletes one or more elements, or explicitly * resizes the backing array; merely setting the value of an element is not * a structural modification.) This is typically accomplished by * synchronizing on some object that naturally encapsulates the list. * * If no such object exists, the list should be "wrapped" using the * {@link Collections#synchronizedList Collections.synchronizedList} * method. This is best done at creation time, to prevent accidental * unsynchronized access to the list:

 *   List list = Collections.synchronizedList(new ArrayList(...));

* *

* The iterators returned by this class's {@link #iterator() iterator} and * {@link #listIterator(int) listIterator} methods are fail-fast: * if the list is structurally modified at any time after the iterator is * created, in any way except through the iterator's own * {@link ListIterator#remove() remove} or * {@link ListIterator#add(Object) add} methods, the iterator will throw a * {@link ConcurrentModificationException}. Thus, in the face of * concurrent modification, the iterator fails quickly and cleanly, rather * than risking arbitrary, non-deterministic behavior at an undetermined * time in the future. * *

Note that the fail-fast behavior of an iterator cannot be guaranteed * as it is, generally speaking, impossible to make any hard guarantees in the * presence of unsynchronized concurrent modification. Fail-fast iterators * throw {@code ConcurrentModificationException} on a best-effort basis. * Therefore, it would be wrong to write a program that depended on this * exception for its correctness: the fail-fast behavior of iterators * should be used only to detect bugs. * *

This class is a member of the * * Java Collections Framework.