1107. Social Clusters (30)

本文介绍了一个社交网络中寻找具有相同兴趣爱好的人群集群的问题。通过输入每个人的爱好列表,使用并查集算法找出所有集群,并输出每个集群的人数。示例展示了如何处理数据并输出结果。

When register on a social network, you are always asked to specify your hobbies in order to find some potential friends with the same hobbies. A "social cluster" is a set of people who have some of their hobbies in common. You are supposed to find all the clusters.

Input Specification:

Each input file contains one test case. For each test case, the first line contains a positive integer N (<=1000), the total number of people in a social network. Hence the people are numbered from 1 to N. Then N lines follow, each gives the hobby list of a person in the format:

Ki: hi[1] hi[2] ... hi[Ki]

where Ki (>0) is the number of hobbies, and hi[j] is the index of the j-th hobby, which is an integer in [1, 1000].

Output Specification:

For each case, print in one line the total number of clusters in the network. Then in the second line, print the numbers of people in the clusters in non-increasing order. The numbers must be separated by exactly one space, and there must be no extra space at the end of the line.

Sample Input:
8
3: 2 7 10
1: 4
2: 5 3
1: 4
1: 3
1: 4
4: 6 8 1 5
1: 4
Sample Output:
3
4 3 1
#include <iostream>
#include <algorithm>
#include <cstdio>
using namespace std;

const int maxn=1010;
int father[maxn]={0};
int isRoot[maxn]={0};
int course[maxn]={0};
int findFather(int x)
{
    while(x!=father[x])
    {
        x=father[x];
    }
    return x;
}
void Union(int a,int b)
{
    int faA=findFather(a);
    int faB=findFather(b);
    if(faA!=faB)
    {
        father[faA]=faB;
    }
}
void init(int n)
{
    for(int i=1;i<=n;i++)
    {
        father[i]=i;
        isRoot[i]=false;
    }
}
bool cmp(int a,int b)
{
    return a>b;
}
int main()
{
    int n,k,h;
    cin>>n;
    init(n);
    for(int i=1;i<=n;i++)
    {
        scanf("%d:",&k);
        for(int j=0;j<k;j++)
        {
            scanf("%d",&h);
            if(course[h]==0)
            {
                course[h]=i;
            }
            Union(i,findFather(course[h]));
        }
    }
    for(int i=1;i<=n;i++)
    {
        isRoot[findFather(i)]++;
    }
    int ans=0;
    for(int i=1;i<=n;i++)
    {
        if(isRoot[i]!=0)
        {
            ans++;
        }
    }
    printf("%d\n",ans);
    sort(isRoot+1,isRoot+n+1,cmp);
    for(int i=1;i<=ans;i++)
    {
        printf("%d",isRoot[i]);
        if(i<ans) printf(" ");
    }
    return 0;
}


title authors groups keywords topics abstract 0 Kernelized Bayesian Transfer Learning Mehmet Gönen and Adam A. Margolin Novel Machine Learning Algorithms (NMLA) cross-domain learning\ndomain adaptation\nkern... APP: Biomedical / Bioinformatics\nNMLA: Bayesi... Transfer learning considers related but distin... 1 "Source Free" Transfer Learning for Text Class... Zhongqi Lu, Yin Zhu, Sinno Pan, Evan Xiang, Yu... AI and the Web (AIW)\nNovel Machine Learning A... Transfer Learning\nAuxiliary Data Retrieval\nT... AIW: Knowledge acquisition from the web\nAIW: ... Transfer learning uses relevant auxiliary data... 2 A Generalization of Probabilistic Serial to Ra... Haris Aziz and Paul Stursberg Game Theory and Economic Paradigms (GTEP) social choice theory\nvoting\nfair division\ns... GTEP: Game Theory\nGTEP: Social Choice / Voting The probabilistic serial (PS) rule is one of t... 3 Lifetime Lexical Variation in Social Media Liao Lizi, Jing Jiang, Ying Ding, Heyan Huang ... NLP and Text Mining (NLPTM) Generative model\nSocial Networks\nAge Prediction AIW: Web personalization and user modeling\nNL... As the rapid growth of online social media att... 4 Hybrid Singular Value Thresholding for Tensor ... Xiaoqin Zhang, Zhengyuan Zhou, Di Wang and Yi Ma Knowledge Representation and Reasoning (KRR)\n... tensor completion\nlow-rank recovery\nhybrid s... KRR: Knowledge Representation (General/Other)\... In this paper, we study the low-rank tensor co... ... ... ... ... ... ... ... 393 Mapping Users Across Networks by Manifold Alig... Shulong Tan, Ziyu Guan, Deng Cai, Xuzhen Qin, ... AI and the Web (AIW) Social Networks\nManifold Alignment\nHypergrap... AIW: Machine learning and the web\nAIW: Ontolo... Nowadays many people are members of multiple o... 394 Compact Aspect Embedding For Diversified Query... Xiaohua Liu, Arbi Bouchoucha, Jian-Yun Nie and... AI and the Web (AIW) query expansion\nsearch result diversification... AIW: Enhancing web search and information retr... Diversified query expansion (DQE) based approa... 395 Contraction and Revision over DL-Lite TBoxes Zhiqiang Zhuang, Zhe Wang, Kewen Wang and Guil... Knowledge Representation and Reasoning (KRR) Belief Change\nDescription Logic\nNon-monotoni... KRR: Belief Change\nKRR: Description Logics\nK... An essential task in managing DL ontologies is... 396 Zero Pronoun Resolution as Ranking Chen Chen and Vincent Ng NLP and Text Mining (NLPTM) Zero Pronouns\nText Mining\nNatural Language P... NLPTM: Evaluation and Analysis Compared to overt pronoun resolution, there is... 397 Supervised Transfer Sparse Coding Maruan Al-Shedivat, Jim Jing-Yan Wang, Majed A... Novel Machine Learning Algorithms (NMLA) Sparse coding\nTransfer learning\nSupervised l... NMLA: Classification\nNMLA: Transfer, Adaptati... A combination of sparse coding and transfer le... 用语义模型向量化以上数据,并进行k-means聚类分析
最新发布
12-31
非常好的问题! 我们现在要对一个结构化的学术论文数据集(包含 `title`, `abstract`, `keywords`, `topics` 等字段)进行 **语义向量化 + K-Means 聚类分析**。目标是: > 自动发现这些论文背后的潜在研究主题群组。 我们将使用 **Sentence-BERT 模型** 对文本进行语义编码,并结合多个字段(如标题、摘要、关键词)来构建更丰富的语义表示,然后运行 K-Means 聚类并评估结果。 --- ## ✅ 数据说明 你提供的 DataFrame 包含以下列: - `title`: 论文标题 - `authors`: 作者 - `groups`: 所属主会分组(如 NMLA, AIW) - `keywords`: 关键词(换行符 `\n` 分隔) - `topics`: 细分主题标签 - `abstract`: 摘要 我们将利用:**title + abstract + keywords** 构建输入文本。 --- ## ✅ 完整 Python 实现代码 ```python import pandas as pd import numpy as np from sentence_transformers import SentenceTransformer from sklearn.cluster import KMeans from sklearn.metrics import silhouette_score import matplotlib.pyplot as plt from tqdm import tqdm # ------------------------------- # Step 1: 加载和预处理数据 # ------------------------------- # 假设你的数据已经读入为 DataFrame(这里用变量 df 表示) # 示例加载方式: # df = pd.read_csv('papers.csv') # 👇 这里我们手动构造一个小样例(你可以替换成真实读取) data = { 'title': [ 'Kernelized Bayesian Transfer Learning', 'Source Free Transfer Learning for Text Classification', 'A Generalization of Probabilistic Serial to Random Assignment', 'Lifetime Lexical Variation in Social Media', 'Hybrid Singular Value Thresholding for Tensor Completion', 'Mapping Users Across Networks by Manifold Alignment', 'Compact Aspect Embedding For Diversified Query Suggestion', 'Contraction and Revision over DL-Lite TBoxes', 'Zero Pronoun Resolution as Ranking', 'Supervised Transfer Sparse Coding' ], 'abstract': [ 'Transfer learning considers related but distinct domains...', 'Transfer learning uses relevant auxiliary data...', 'The probabilistic serial (PS) rule is one of the most...', 'As the rapid growth of online social media attracts...', 'In this paper, we study the low-rank tensor completion...', 'Nowadays many people are members of multiple online networks...', 'Diversified query expansion (DQE) based approaches aim to...', 'An essential task in managing DL ontologies is belief change...', 'Compared to overt pronoun resolution, there is less work on zero pronouns...', 'A combination of sparse coding and transfer learning...' ], 'keywords': [ 'cross-domain learning\ndomain adaptation\nkernel methods', 'Transfer Learning\nAuxiliary Data Retrieval\nText Classification', 'social choice theory\nvoting\nfair division\nrandom assignment', 'Generative model\nSocial Networks\nAge Prediction', 'tensor completion\nlow-rank recovery\nhybrid sv thresholding', 'Social Networks\nManifold Alignment\nHypergraph Matching', 'query expansion\nsearch result diversification\naspect embedding', 'Belief Change\nDescription Logic\nNon-monotonic Reasoning', 'Zero Pronouns\nText Mining\nNatural Language Processing', 'Sparse coding\nTransfer learning\nSupervised learning' ], 'groups': [ 'NMLA', 'AIW & NMLA', 'GTEP', 'NLPTM', 'KRR & ML', 'AIW', 'AIW', 'KRR', 'NLPTM', 'NMLA' ] } df = pd.DataFrame(data) # 预处理函数:合并 title + abstract + keywords 成一段语义丰富的文本 def create_input_text(row): title = row['title'] abstract = row.get('abstract', '') # 处理关键词:替换 \n 和逗号,拼接成空格分隔字符串 keywords = row.get('keywords', '') or '' keywords = ' '.join([k.strip() for k in keywords.replace('\n', ',').split(',') if k.strip()]) # 合并所有部分 full_text = f"{title} . {abstract} . {keywords}" return full_text.strip() # 创建合并后的文本列 tqdm.pandas() # 显示进度条 df['input_text'] = df.progress_apply(create_input_text, axis=1) print("示例输入文本:") print(df['input_text'][0][:200] + "...") # ------------------------------- # Step 2: 使用 Sentence-BERT 编码 # ------------------------------- # 加载预训练语义模型(推荐用于聚类的小型高效模型) model = SentenceTransformer('all-MiniLM-L6-v2') # 生成句向量(batched 自动处理) embeddings = model.encode(df['input_text'].tolist(), convert_to_numpy=True, show_progress_bar=True) print(f"Embedding shape: {embeddings.shape}") # 应为 (n_samples, 384) # ------------------------------- # Step 3: 使用轮廓系数选择最佳 k # ------------------------------- k_range = range(2, min(10, len(df) + 1)) silhouette_scores = [] print("\n🔍 正在计算不同 k 的轮廓系数...") for k in k_range: kmeans = KMeans(n_clusters=k, random_state=42, n_init=10) labels = kmeans.fit_predict(embeddings) score = silhouette_score(embeddings, labels) silhouette_scores.append(score) print(f"k={k}, Silhouette Score: {score:.4f}") # 可视化 plt.figure(figsize=(9, 5)) plt.plot(k_range, silhouette_scores, 'bo-', label='Silhouette Score') plt.title('Silhouette Score vs. Number of Clusters (k)') plt.xlabel('Number of Clusters (k)') plt.ylabel('Silhouette Score') plt.grid(True) plt.xticks(k_range) plt.legend() plt.tight_layout() plt.show() # 最佳 k best_k = k_range[np.argmax(silhouette_scores)] print(f"\n✅ 推荐最佳聚类数: k = {best_k}") # ------------------------------- # Step 4: 使用最佳 k 进行最终聚类 # ------------------------------- final_kmeans = KMeans(n_clusters=best_k, random_state=42, n_init=10) cluster_labels = final_kmeans.fit_predict(embeddings) # 添加到 dataframe df['cluster'] = cluster_labels print("\n📊 聚类结果:") for i in range(best_k): print(f"\n--- Cluster {i} ---") cluster_papers = df[df['cluster'] == i] for _, row in cluster_papers.iterrows(): print(f" • {row['title']} [{row['groups']}]") # ------------------------------- # Step 5: 分析每个簇的主题关键词 # ------------------------------- from sklearn.feature_extraction.text import TfidfVectorizer print("\n🔍 每个簇的高频关键词(基于TF-IDF自动提取):") vectorizer = TfidfVectorizer(stop_words='english', max_features=20) corpus_by_cluster = [] for i in range(best_k): texts_in_cluster = df[df['cluster'] == i]['input_text'].tolist() combined_text = ' '.join(texts_in_cluster) corpus_by_cluster.append(combined_text) # 全局向量化 tfidf = vectorizer.fit(corpus_by_cluster) for i in range(best_k): vec = tfidf.transform([corpus_by_cluster[i]]) feature_names = tfidf.get_feature_names_out() scores = vec.toarray()[0] top_indices = np.argsort(scores)[-8:] # top 8 关键词 top_words = [feature_names[idx] for idx in top_indices if scores[idx] > 0] print(f"Cluster {i} keywords: {', '.join(top_words)}") ``` --- ## ✅ 输出解释 ### 📌 示例输出可能如下: ``` ✅ 推荐最佳聚类数: k = 4 📊 聚类结果: --- Cluster 0 --- • Kernelized Bayesian Transfer Learning [NMLA] • Supervised Transfer Sparse Coding [NMLA] • Source Free Transfer Learning... [AIW & NMLA] --- Cluster 1 --- • A Generalization of Probabilistic Serial... [GTEP] • Contraction and Revision over DL-Lite TBoxes [KRR] • Zero Pronoun Resolution as Ranking [NLPTM] ... 🔍 每个簇的高频关键词: Cluster 0 keywords: transfer, learning, sparse, coding, kernel, bayesian, domain, adaptation Cluster 1 keywords: logic, reasoning, belief, change, description, non-monotonic, revision, contraction ``` 👉 我们可以将 Cluster 0 解释为:“**迁移学习与稀疏建模**”,Cluster 1 为“**知识表示与逻辑推理**”等。 --- ## ✅ 方法优势总结 | 特性 | 说明 | |------|------| | **语义理解强** | SBERT 能识别 “transfer learning” 和 “domain adaptation” 是相似概念 | | **多源融合** | 结合 title、abstract、keywords 提升信息完整性 | | **无需标注** | 完全自动聚类,适合探索性分析 | | **可扩展性强** | 支持上千篇论文的大规模聚类 | --- ## ✅ 后续建议 1. **可视化**:使用 UMAP/t-SNE 将高维向量降维绘图,颜色标记簇。 2. **命名簇**:结合 TF-IDF 或 KeyBERT 提取关键词自动生成簇名。 3. **对比人工分组**:将聚类结果与原始 `groups` 字段对比,评估一致性。 4. **异常检测**:查看跨多个 group 却被聚在一起的论文,可能是交叉创新点。 ---
评论
成就一亿技术人!
拼手气红包6.0元
还能输入1000个字符
 
红包 添加红包
表情包 插入表情
 条评论被折叠 查看
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值