假如我用bertopic对英文专利摘要文本进行了静态主题表示,在此基础上现在我需要用bertopic自带的topic_over_time动态主题建模结合调整c-TF-IDF 算法进行动态主题表示,动态主题表示设置时间戳 t1=2000-2010 年,t2=2011-2018 年,t3=2019-2024 年。最终,将当前阶段和前一阶段的 c-TF-IDF 平均值作为当前阶段的权重分数,取权重分数前 15 的单词作为动态主题的关键词,形成动态主题词列表。注意,已经进行切词、去除停用词、标点符号的英文专利摘要文本保存在'tokenized_abstract.csv'中的'Tokenized_Abstract'字段内,并且在静态主题建模时已经进行了加载,专利摘要对应的时间数据保存在中'tokenized_abstract.csv'中的'Date'字段内,尚未加载,已经执行的静态主题模型的参数设置如下:from sentence_transformers import SentenceTransformer
Step 1 - Extract embeddings
embedding_model = SentenceTransformer(“C:\Users\18267\.cache\huggingface\hub\models–sentence-transformers–all-mpnet-base-v2\snapshots\9a3225965996d404b775526de6dbfe85d3368642”)
embeddings = np.load(‘clean_emb_last.npy’)
print(f"嵌入的形状: {embeddings.shape}")
Step 2 - Reduce dimensionality
umap_model = UMAP(n_neighbors=7, n_components=10, min_dist=0.0, metric=‘cosine’,random_state=42)
Step 3 - Cluster reduced embeddings
hdbscan_model = HDBSCAN(min_samples=7, min_cluster_size=60,metric=‘euclidean’,
cluster_selection_method=‘eom’,
prediction_data=True)
Step 4 - Tokenize topics
Combine custom stop words with scikit-learn’s English stop words
custom_stop_words = [‘h2’, ‘storing’, ‘storage’, ‘include’, ‘comprise’,
‘utility’, ‘model’, ‘disclosed’, ‘embodiment’, ‘invention’, ‘prior’, ‘art’,
‘according’, ‘present’, ‘method’, ‘system’, ‘device’, ‘may’, ‘also’, ‘use’,
‘used’, ‘provide’, ‘wherein’, ‘configured’, ‘predetermined’, ‘plurality’,
‘comprising’, ‘consists’, ‘following’, ‘characterized’, ‘claim’, ‘claims’,
‘said’, ‘first’, ‘second’, ‘third’, ‘fourth’, ‘fifth’, ‘one’, ‘two’, ‘three’,‘hydrogen’]
Create combined stop words set
all_stop_words = set(custom_stop_words).union(ENGLISH_STOP_WORDS)
vectorizer_model = CountVectorizer(stop_words=list(all_stop_words))
Step 5 - Create topic representation
ctfidf_model = ClassTfidfTransformer()
All steps together
topic_model = BERTopic(
embedding_model=embedding_model, # Step 1 - Extract embeddings
umap_model=umap_model, # Step 2 - Reduce dimensionality
hdbscan_model=hdbscan_model, # Step 3 - Cluster reduced embeddings
vectorizer_model=vectorizer_model, # Step 4 - Tokenize topics
ctfidf_model=ctfidf_model, # Step 5 - Extract topic words
top_n_words=50
)
现在,请你给出实现这一操作的python代码帮我完成静态主题表示之后的动态主题表示。调整后的c-TF-IDF计算公式如下:
$ c-TF-IDF_{w,c,r} = \frac{\left(\sqrt{\frac{f_{w,c,r}}{f_c}} + \sqrt{\frac{f_{w,c,r-1}}{f_c}}\right) \cdot \log\left(1 + \frac{M - cf_w + 0.5}{cf_w + 0.5}\right)}{2} $
文字说明:
“其中,( f_{w,c,r} ) 为第 ( r ) 阶段时,词 ( w ) 在聚类簇 ( c ) 中出现的频次,( f_c ) 表示聚类簇 ( c ) 中词数。( M ) 表示簇的平均单词数,( cf_w ) 表示词 ( w ) 在所有簇中出现频次。”