非常好的问题!
我们现在要对一个结构化的学术论文数据集(包含 `title`, `abstract`, `keywords`, `topics` 等字段)进行 **语义向量化 + K-Means 聚类分析**。目标是:
> 自动发现这些论文背后的潜在研究主题群组。
我们将使用 **Sentence-BERT 模型** 对文本进行语义编码,并结合多个字段(如标题、摘要、关键词)来构建更丰富的语义表示,然后运行 K-Means 聚类并评估结果。
---
## ✅ 数据说明
你提供的 DataFrame 包含以下列:
- `title`: 论文标题
- `authors`: 作者
- `groups`: 所属主会分组(如 NMLA, AIW)
- `keywords`: 关键词(换行符 `\n` 分隔)
- `topics`: 细分主题标签
- `abstract`: 摘要
我们将利用:**title + abstract + keywords** 构建输入文本。
---
## ✅ 完整 Python 实现代码
```python
import pandas as pd
import numpy as np
from sentence_transformers import SentenceTransformer
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
import matplotlib.pyplot as plt
from tqdm import tqdm
# -------------------------------
# Step 1: 加载和预处理数据
# -------------------------------
# 假设你的数据已经读入为 DataFrame(这里用变量 df 表示)
# 示例加载方式:
# df = pd.read_csv('papers.csv')
# 👇 这里我们手动构造一个小样例(你可以替换成真实读取)
data = {
'title': [
'Kernelized Bayesian Transfer Learning',
'Source Free Transfer Learning for Text Classification',
'A Generalization of Probabilistic Serial to Random Assignment',
'Lifetime Lexical Variation in Social Media',
'Hybrid Singular Value Thresholding for Tensor Completion',
'Mapping Users Across Networks by Manifold Alignment',
'Compact Aspect Embedding For Diversified Query Suggestion',
'Contraction and Revision over DL-Lite TBoxes',
'Zero Pronoun Resolution as Ranking',
'Supervised Transfer Sparse Coding'
],
'abstract': [
'Transfer learning considers related but distinct domains...',
'Transfer learning uses relevant auxiliary data...',
'The probabilistic serial (PS) rule is one of the most...',
'As the rapid growth of online social media attracts...',
'In this paper, we study the low-rank tensor completion...',
'Nowadays many people are members of multiple online networks...',
'Diversified query expansion (DQE) based approaches aim to...',
'An essential task in managing DL ontologies is belief change...',
'Compared to overt pronoun resolution, there is less work on zero pronouns...',
'A combination of sparse coding and transfer learning...'
],
'keywords': [
'cross-domain learning\ndomain adaptation\nkernel methods',
'Transfer Learning\nAuxiliary Data Retrieval\nText Classification',
'social choice theory\nvoting\nfair division\nrandom assignment',
'Generative model\nSocial Networks\nAge Prediction',
'tensor completion\nlow-rank recovery\nhybrid sv thresholding',
'Social Networks\nManifold Alignment\nHypergraph Matching',
'query expansion\nsearch result diversification\naspect embedding',
'Belief Change\nDescription Logic\nNon-monotonic Reasoning',
'Zero Pronouns\nText Mining\nNatural Language Processing',
'Sparse coding\nTransfer learning\nSupervised learning'
],
'groups': [
'NMLA', 'AIW & NMLA', 'GTEP', 'NLPTM', 'KRR & ML',
'AIW', 'AIW', 'KRR', 'NLPTM', 'NMLA'
]
}
df = pd.DataFrame(data)
# 预处理函数:合并 title + abstract + keywords 成一段语义丰富的文本
def create_input_text(row):
title = row['title']
abstract = row.get('abstract', '')
# 处理关键词:替换 \n 和逗号,拼接成空格分隔字符串
keywords = row.get('keywords', '') or ''
keywords = ' '.join([k.strip() for k in keywords.replace('\n', ',').split(',') if k.strip()])
# 合并所有部分
full_text = f"{title} . {abstract} . {keywords}"
return full_text.strip()
# 创建合并后的文本列
tqdm.pandas() # 显示进度条
df['input_text'] = df.progress_apply(create_input_text, axis=1)
print("示例输入文本:")
print(df['input_text'][0][:200] + "...")
# -------------------------------
# Step 2: 使用 Sentence-BERT 编码
# -------------------------------
# 加载预训练语义模型(推荐用于聚类的小型高效模型)
model = SentenceTransformer('all-MiniLM-L6-v2')
# 生成句向量(batched 自动处理)
embeddings = model.encode(df['input_text'].tolist(), convert_to_numpy=True, show_progress_bar=True)
print(f"Embedding shape: {embeddings.shape}") # 应为 (n_samples, 384)
# -------------------------------
# Step 3: 使用轮廓系数选择最佳 k
# -------------------------------
k_range = range(2, min(10, len(df) + 1))
silhouette_scores = []
print("\n🔍 正在计算不同 k 的轮廓系数...")
for k in k_range:
kmeans = KMeans(n_clusters=k, random_state=42, n_init=10)
labels = kmeans.fit_predict(embeddings)
score = silhouette_score(embeddings, labels)
silhouette_scores.append(score)
print(f"k={k}, Silhouette Score: {score:.4f}")
# 可视化
plt.figure(figsize=(9, 5))
plt.plot(k_range, silhouette_scores, 'bo-', label='Silhouette Score')
plt.title('Silhouette Score vs. Number of Clusters (k)')
plt.xlabel('Number of Clusters (k)')
plt.ylabel('Silhouette Score')
plt.grid(True)
plt.xticks(k_range)
plt.legend()
plt.tight_layout()
plt.show()
# 最佳 k
best_k = k_range[np.argmax(silhouette_scores)]
print(f"\n✅ 推荐最佳聚类数: k = {best_k}")
# -------------------------------
# Step 4: 使用最佳 k 进行最终聚类
# -------------------------------
final_kmeans = KMeans(n_clusters=best_k, random_state=42, n_init=10)
cluster_labels = final_kmeans.fit_predict(embeddings)
# 添加到 dataframe
df['cluster'] = cluster_labels
print("\n📊 聚类结果:")
for i in range(best_k):
print(f"\n--- Cluster {i} ---")
cluster_papers = df[df['cluster'] == i]
for _, row in cluster_papers.iterrows():
print(f" • {row['title']} [{row['groups']}]")
# -------------------------------
# Step 5: 分析每个簇的主题关键词
# -------------------------------
from sklearn.feature_extraction.text import TfidfVectorizer
print("\n🔍 每个簇的高频关键词(基于TF-IDF自动提取):")
vectorizer = TfidfVectorizer(stop_words='english', max_features=20)
corpus_by_cluster = []
for i in range(best_k):
texts_in_cluster = df[df['cluster'] == i]['input_text'].tolist()
combined_text = ' '.join(texts_in_cluster)
corpus_by_cluster.append(combined_text)
# 全局向量化
tfidf = vectorizer.fit(corpus_by_cluster)
for i in range(best_k):
vec = tfidf.transform([corpus_by_cluster[i]])
feature_names = tfidf.get_feature_names_out()
scores = vec.toarray()[0]
top_indices = np.argsort(scores)[-8:] # top 8 关键词
top_words = [feature_names[idx] for idx in top_indices if scores[idx] > 0]
print(f"Cluster {i} keywords: {', '.join(top_words)}")
```
---
## ✅ 输出解释
### 📌 示例输出可能如下:
```
✅ 推荐最佳聚类数: k = 4
📊 聚类结果:
--- Cluster 0 ---
• Kernelized Bayesian Transfer Learning [NMLA]
• Supervised Transfer Sparse Coding [NMLA]
• Source Free Transfer Learning... [AIW & NMLA]
--- Cluster 1 ---
• A Generalization of Probabilistic Serial... [GTEP]
• Contraction and Revision over DL-Lite TBoxes [KRR]
• Zero Pronoun Resolution as Ranking [NLPTM]
...
🔍 每个簇的高频关键词:
Cluster 0 keywords: transfer, learning, sparse, coding, kernel, bayesian, domain, adaptation
Cluster 1 keywords: logic, reasoning, belief, change, description, non-monotonic, revision, contraction
```
👉 我们可以将 Cluster 0 解释为:“**迁移学习与稀疏建模**”,Cluster 1 为“**知识表示与逻辑推理**”等。
---
## ✅ 方法优势总结
| 特性 | 说明 |
|------|------|
| **语义理解强** | SBERT 能识别 “transfer learning” 和 “domain adaptation” 是相似概念 |
| **多源融合** | 结合 title、abstract、keywords 提升信息完整性 |
| **无需标注** | 完全自动聚类,适合探索性分析 |
| **可扩展性强** | 支持上千篇论文的大规模聚类 |
---
## ✅ 后续建议
1. **可视化**:使用 UMAP/t-SNE 将高维向量降维绘图,颜色标记簇。
2. **命名簇**:结合 TF-IDF 或 KeyBERT 提取关键词自动生成簇名。
3. **对比人工分组**:将聚类结果与原始 `groups` 字段对比,评估一致性。
4. **异常检测**:查看跨多个 group 却被聚在一起的论文,可能是交叉创新点。
---