kea算法提取关键词
kea算法提取关键词
上一篇文章讲到基于bert的关键词提取,关键字出来的太少,需要一些其他方法增加关键词,我首先选择了kea算法
kea算法
Kea使用词法方法识别候选关键词,为每个候选关键词计算特征值,并使用机器学习算法预测哪些候选关键词是好的关键词。
1.首先基于一定的规则选出候选关键词,作者在文章中提出三个规则:
(1) Candidate phrases are limited to a certain maximum length (usually three words).
(2)Candidate phrases cannot be proper names (i.e. single words that only ever appear with an initial capital).
(3)Candidate phrases cannot begin or end with a stopword.
2.提取出关键词的tf-idf特征
3.用朴素贝叶斯模型计算候选关键词得分排名后选出关键词
主要代码
1.根据规则选出候选词
def candidate_selection(self, stoplist=None, **kwargs):
"""Select 1-3 grams of `normalized` words as keyphrase candidates.
Candidates that start or end with a stopword are discarded. Candidates
that contain punctuation marks (from `string.punctuation`) as words are
filtered out.
Args:
stoplist (list): the stoplist for filtering candidates, defaults
to the nltk stoplist.
"""
# select ngrams from 1 to 3 grams
self.ngram_selection(n=3)
# filter candidates containing punctuation marks
self.candidate_filtering(list(string.punctuation))
# initialize stoplist list if not provided
if stoplist is None:
stoplist = self.stoplist
# filter candidates that start or end with a stopword
for k in list(self.candidates):
# get the candidate
v = self.candidates[k]
# delete if candidate contains a stopword in first/last position
words = [u.lower() for u in v.surface_forms[0]]
if words[0] in stoplist or words[-1] in stoplist:
del self.candidates[k]
2.提取tf-idf特征
def feature_extraction(self, df=None, training=False):
"""Extract features for each keyphrase candidate. Features are the
tf*idf of the candidate and its first occurrence relative to the
document.
Args:
df (dict): document frequencies, the number of documents should be
specified using the "--NB_DOC--" key.
training (bool): indicates whether features are computed for the
training set for computing IDF weights, defaults to false.
"""
# initialize default document frequency counts if none provided
if df is None:
logging.warning('LoadFile._df_counts is hard coded to {}'.format(
self._df_counts))
df = load_document_frequency_file(self._df_counts, delimiter='\t')
# initialize the number of documents as --NB_DOC--
N = df.get('--NB_DOC--', 0) + 1
if training:
N -= 1
# find the maximum offset
maximum_offset = float(sum([s.length for s in self.sentences]))
for k, v in self.candidates.items():
# get candidate document frequency
candidate_df = 1 + df.get(k, 0)
# hack for handling training documents
if training and candidate_df > 1:
candidate_df -= 1
# compute the tf*idf of the candidate
idf = math.log(N / candidate_df, 2)
# add the features to the instance container
self.instances[k] = np.array([len(v.surface_forms) * idf,
v.offsets[0] / maximum_offset])
# scale features
self.feature_scaling()
3.训练贝叶斯模型并保存
def train(training_instances, training_classes, model_file):
""" Train a Naive Bayes classifier and store the model in a file.
Args:
training_instances (list): list of features.
training_classes (list): list of binary values.
model_file (str): the model output file.
"""
clf = MultinomialNB()
clf.fit(training_instances, training_classes)
dump_model(clf, model_file)
参考
https://www.cs.waikato.ac.nz/ml/publications/2005/chap_Witten-et-al_Windows.pdf
https://github.com/boudinfl/pke
上述内容详见:
https://blog.youkuaiyun.com/qq_41824131/article/details/107028478