A1063.Set Similarity

Given two sets of integers, the similarity of the sets is defined to be Nc/Nt*100%, where Nc is the number of distinct common numbers shared by the two sets, and Nt is the total number of distinct numbers in the two sets. Your job is to calculate the similarity of any given pair of sets.

Input Specification:

Each input file contains one test case. Each case first gives a positive integer N (<=50) which is the total number of sets. Then N lines follow, each gives a set with a positive M (<=104) and followed by M integers in the range [0, 109]. After the input of sets, a positive integer K (<=2000) is given, followed by K lines of queries. Each query gives a pair of set numbers (the sets are numbered from 1 to N). All the numbers in a line are separated by a space.

Output Specification:

For each query, print in one line the similarity of the sets, in the percentage form accurate up to 1 decimal place.

Sample Input:
3
3 99 87 101
4 87 101 5 87
7 99 101 18 5 135 18 99
2
1 2
1 3
Sample Output:
50.0%
33.3%

Code:

#include "cstdio"
#include "set"

using namespace std;

const int N = 51;
set<int> st[N];

void compare(int x,int y)
{
    int totalNum = st[y].size(),sameNum = 0;
    for(set<int>::iterator it = st[x].begin(); it != st[x].end(); it++)
    {
        if(st[y].find(*it) != st[y].end())
        {
            sameNum++;
        }
        else
        {
            totalNum++;
        }
    }
    printf("%.1f%%\n",sameNum * 100.0 / totalNum);
}

int main()
{
    int n,k,s1,s2;
    scanf("%d",&n);
    for(int i = 1; i <= n; i++)
    {
        int m,e;
        scanf("%d",&m);
        for(int j = 0; j < m; j++)
        {
            scanf("%d",&e);
            st[i].insert(e);
        }
    }

    scanf("%d",&k);
    for(int i = 0; i < k; i++)
    {
        scanf("%d%d",&s1,&s2);
        compare(s1,s2);
    }

    return 0;
}
### 基于Token的相似性计算方法 在自然语言处理领域,基于Token的相似性计算方法主要用于评估词语或短语之间的关系强度。这些方法对于理解文本中的上下文关联至关重要,在诸如命名实体识别(NER)和关系抽取(RE)[^2]等任务中扮演着重要角色。 #### 余弦相似度 (Cosine Similarity) 种广泛应用的技术是利用词向量表示来衡量两个单词间的夹角大小作为它们之间语义距离的种量化方式: \[ \text{cosine\_similarity}(A,B)=\frac{\sum_{i=1}^{n}{a_ib_i}}{\sqrt{\sum_{i=1}^{n}{a_i^2}\cdot\sum_{i=1}^{n}{b_i^2}}} \] 其中 \( A=(a_1,a_2,\ldots,a_n)\),\( B=(b_1,b_2,\ldots,b_n) \) 是两个词项对应的分布式表达形式。当两者的夹角越接近零,则说明这两个对象更加相像;反之则差异较大[^1]。 ```python from sklearn.metrics.pairwise import cosine_similarity import numpy as np def calculate_cosine_sim(vec_a, vec_b): """ 计算两个向量之间的余弦相似度. 参数: vec_a : array-like, shape = [n_features] 输入的第个向量 vec_b : array-like, shape = [n_features] 输入的第二个向量 返回: float: 介于[-1, 1]区间的数值,代表两者间的关系程度 """ # 将输入转换成numpy数组并调整形状以便后续操作 a = np.array([vec_a]) b = np.array([vec_b]) return cosine_similarity(a.reshape(1,-1), b.reshape(1,-1))[0][0] ``` #### Jaccard指数 (Jaccard Index) 另种简单而有效的方案就是采用集合论的观点来看待问题——即把每个token视作独立元素构成的整体S(A) 和 S(B),那么就可以定义如下公式: \[ J(S_A,S_B)=\frac{|S_A∩S_B|}{|S_A∪S_B|} \] 此比例反映了交集中共同拥有的成员数量占总独特成分的比例,从而直观反映出二者重叠部分所占比率多少[^3]。 ```python def jaccard_index(set_a, set_b): """Calculate the Jaccard index between two sets.""" intersection = len(set.intersection(set_a, set_b)) union = len(set.union(set_a, set_b)) if union == 0: return 0 return intersection / union ``` 这两种算法各有优劣,具体应用取决于实际场景需求以及数据特性等因素影响下的综合考量。
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值