网页关键词频率计算（词频计算js版）

最新推荐文章于 2022-10-21 14:11:02 发布

原创

最新推荐文章于 2022-10-21 14:11:02 发布 · 2.8k 阅读

5 ·

CC 4.0 BY-SA版权

文章标签：

#word #js

这是一个使用JavaScript实现的网页关键词频率计算方法，无需词库，通过分割网页内容提取并统计中英文词语出现次数，按从多到少排序。代码中包括字符串处理和字典对象的使用，以及对匹配次数的排序处理。

不需要词库，直接分割网页内容提取词语.并且计算词语出现次数按照从多到少排序, 这里能区分中英文词语


//石卓林 2008-7-12 第二版.左右右左匹配版
function keywords(ftitle,ftbody){
	this.trim = function(text){return text.replace(/(^\s*)|(\s*$)/g,'');}
	this.title = ftitle;
	this.tbody = ftbody.replace(/(\s+)/g,' ');//.substr(40,400);//截取最可能的内容此处数字需改进
	this.tbody = this.trim(this.tbody);
	this.tbodylen = this.tbody.length;
	this.chardic = new ActiveXObject('Scripting.Dictionary');	
	this.tempasc = 0;
	this.tempchar = '';
	this.tempcharat='';
	this.endchar = '。，：…　（—）》《';
	this.chscount = 0;
	this.keys = new Array();
	var oldchar='',oldcount=0;
	for(var i=0;i<this.tbodylen;i++){
		this.chscount = 0;
		for(var j=1;j<=15;j++){//最长英文单词15
			this.tempchar = this.tbody.substr(i,j);
			this.tempasc = this.tempchar.charCodeAt(j-1);
			this.tempcharat = this.tempchar.charAt(j-1);
			if((this.endchar.indexOf(this.tempcharat) != -1)||(this.tempasc >=0 && this.tempasc <= 47)||(this.tempasc >=58 &&am