子空间——kea算法提取关键词_kea 关键词提取方法-优快云博客

本文链接：https://blog.youkuaiyun.com/lw03060402/article/details/107063749

kea算法提取关键词

- kea算法提取关键词

kea算法提取关键词

上一篇文章讲到基于bert的关键词提取，关键字出来的太少，需要一些其他方法增加关键词，我首先选择了kea算法
kea算法
Kea使用词法方法识别候选关键词，为每个候选关键词计算特征值，并使用机器学习算法预测哪些候选关键词是好的关键词。
1.首先基于一定的规则选出候选关键词，作者在文章中提出三个规则：
（1） Candidate phrases are limited to a certain maximum length (usually three words).
（2）Candidate phrases cannot be proper names (i.e. single words that only ever appear with an initial capital).
（3）Candidate phrases cannot begin or end with a stopword.
2.提取出关键词的tf-idf特征
在这里插入图片描述
3.用朴素贝叶斯模型计算候选关键词得分排名后选出关键词

主要代码
1.根据规则选出候选词

    def candidate_selection(self, stoplist=None, **kwargs):
        """Select 1-3 grams of `normalized` words as keyphrase candidates.
        Candidates that start or end with a stopword are discarded. Candidates
        that contain punctuation marks (from `string.punctuation`) as words are
        filtered out.

        Args:
            stoplist (list): the stoplist for filtering candidates, defaults
                to the nltk stoplist.
        """

        # select ngrams from 1 to 3 grams
        self.ngram_selection(n=3)

        # filter candidates containing punctuation marks
        self.candidate_filtering(list(string.punctuation))

        # initialize stoplist list if not provided
        if stoplist is None:
            stoplist = self.stoplist

        # filter candidates that start or end with a stopword
        for k in list(self.candidates):

            # get the candidate
            v = self.candidates[k]

            # delete if candidate contains a stopword in first/last position
            words = [u.lower() for u in v.surface_forms[0]]
            if words[0] in stoplist or words[-1] in stoplist:
                del self.candidates[k]

2.提取tf-idf特征

    def feature_extraction(self, df=None, training=False):
        """Extract features for each keyphrase candidate. Features are the
        tf*idf of the candidate and its first occurrence relative to the
        document.

        Args:
            df (dict): document frequencies, the number of documents should be
                specified using the "--NB_DOC--" key.
            training (bool): indicates whether features are computed for the
                training set for computing IDF weights, defaults to false.
        """

        # initialize default document frequency counts if none provided
        if df is None:
            logging.warning('LoadFile._df_counts is hard coded to {}'.format(
                self._df_counts))
            df = load_document_frequency_file(self._df_counts, delimiter='\t')

        # initialize the number of documents as --NB_DOC--
        N = df.get('--NB_DOC--', 0) + 1
        if training:
            N -= 1

        # find the maximum offset
        maximum_offset = float(sum([s.length for s in self.sentences]))

        for k, v in self.candidates.items():

            # get candidate document frequency
            candidate_df = 1 + df.get(k, 0)

            # hack for handling training documents
            if training and candidate_df > 1:
                candidate_df -= 1

            # compute the tf*idf of the candidate
            idf = math.log(N / candidate_df, 2)

            # add the features to the instance container
            self.instances[k] = np.array([len(v.surface_forms) * idf,
                                          v.offsets[0] / maximum_offset])

        # scale features
        self.feature_scaling()

3.训练贝叶斯模型并保存

 def train(training_instances, training_classes, model_file):
        """ Train a Naive Bayes classifier and store the model in a file.

            Args:
                training_instances (list): list of features.
                training_classes (list): list of binary values.
                model_file (str): the model output file.
        """

        clf = MultinomialNB()
        clf.fit(training_instances, training_classes)
        dump_model(clf, model_file)