基于Density Based Selection 的文本摘要算法-优快云博客

本文深入探讨了文本摘要算法的核心概念及其在新闻领域的应用，特别介绍了Yahoo收购创业公司的背景，该公司的机器学习方法如何通过展示新闻摘要而非完整新闻来提升用户体验。进一步阐述了摘要算法的基本原理，包括直接选取首句或段落，以及更复杂地通过关键词权重计算和位置因素来生成最短摘要。重点介绍了基于密度的摘要算法，它将文章拆分为句子并计算每个句子的权重，依据权重选择关键句子构成摘要。算法考虑了句子与标题一致性、位置、长度以及关键词匹配度等特征。提供了简单的代码示例，用于计算句子与关键词一致性的算法，并指出实际应用中可能需要的验证步骤。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

文本摘要算法大意是提取出文章的主要信息，以一种较为概括的简短的方式表达整篇文章，在搜索领域会经常用到，前段时间，yahoo以3000W刀的价格收购了一家创业公司，该公司据说是以一种机器学习的方法来对新闻进行摘要，跟传统的推送完整新闻的方式不同，该公司是展示新闻的摘要给用户的，这里只是介绍下简单的摘要算法。

摘要算法

摘要算法目前的实现基本上是从正文选择有代表性的句子或段落形成文章的摘要，简单的直接取文章的第一句或者第一段，复杂的采用一些算法来提取关键的句子或者段落。一般是先抽取关键词，然后计算关键词的权重，根据关键词在文章中的出现位置来计算最短摘要，详情可以看下编程之美的最短摘要算法，这里介绍下基于密度的摘要算法，首先是将文章按照句子分隔，对每一个句子计算权重，然后根据句子的权重来选择哪个句子作为摘要，影响句子的权重的主要特征有句子和标题的一致性，句子在文章和段落的位置，句子的长短，句子和文章关键词的一致性等。以下算法是计算句子和文章关键词的一致性算法，代码比较简单但是没有经过验证，只是简单的描述了下思想，过段时间会将详细的应用场景补充。

#!/usr/bin/python #-*-coding:utf-8-*- import math """ @author:xyl This is an example for text summary, which will be an base algorithm for news recommendation """ """ sentence: the sentence to be summary keyWords: top key words of the article or sentece """ def dbs(sentence, keyWords): sentence = sentence.lower() wordArr = sentence.split()#this split the sentence by space,which can be changed words = [] for word in wordArr: if word != " ": words.append(word) if len(words) == 0: return 0 count = 1 for word in words: if keyWords.hasKey(word):#check whether the word is a keyword count += 1 summ = 0 firstWord = []#first key word secondWord = []#second key word for index in xrange(len(words)): wordIndex = keyWords.getWordIndex(words[index])#get the word's position in senetence or article if wordIndex > -1: score = keyWords.getScore(wordIndex) #get the word's weight,which can be computed by tf、tf*idf if firstWord == []: firstWord = [(index,score)] else: secondWord = firstWord firstWord = [(index,score)] summ += (firstWord[0][1]*secondWord[0][1])/math.pow((firstWord[0][0]-secondWord[0][0]),2) #update the sum score formula = float((1/count*(count+1)*summ)) return formula def sbs(sentence,keyWords): sentence = sententce.lower() wordArr = sentence.split() words = [] for word in wordArr: if word != " ": words.append(word) if len(words) == 0: return 0 summ = 0 for index in xrange(len(words)): wordIndex = keyWords.getWordIndex(words[index]) if wordIndex > -1: score = keyWords.getScore(wordIndex) summ += score formula = float((1/len(words))*summ)