基于文本内容的推荐系统开发记录

最新推荐文章于 2024-05-24 16:24:03 发布

IMISer2016

最新推荐文章于 2024-05-24 16:24:03 发布

阅读量1.2k

点赞数 4

CC 4.0 BY-SA版权

分类专栏：文本挖掘文章标签：文本挖掘 python-gensim 推荐

本文链接：https://blog.youkuaiyun.com/IMISer2016/article/details/81626702

本文记录了一种基于文本内容的推荐系统开发过程，包括推荐逻辑、算法实现（gensim）、内存管理、Python REST接口编写。算法涉及TF-IDF、LDA、LSI模型，通过计算相似度为用户推荐内容。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

基于文本内容的推荐系统开发日记

这里将会涉及：

推荐模型逻辑
算法基本实现与测试（gensim等模块的调用）
内存溢出问题的解决
编写Python rest 服务接口

推荐模型逻辑

基本思路是围绕一个json文件展开，将文本对应信息保存在json中，json定时根据更新过后的字典库、语料库进行更新。

算法基本实现与测试

文本预处理

中文文本的预处理包含去标点、去停用词、分词（单独编辑为预处理模块）

class preprocess():
    stopwords = []
    stopword_filepath="./stopwordList/stopword.txt"
    def __init__(self):
        self.__readin_stop()
    def __readin_stop(self):
        file_obj = codecs.open(self.stopword_filepath,'r','utf-8')
        while True:
            line = file_obj.readline()
            line=line.strip('\r\n')
            if not line:
                break
            self.stopwords.append(line)
        file_obj.close()
    def clean_doc(self,doc):
        # 汉字的Unicode范围为4e00-9fa5
        pattern = re.compile(r'[\u4e00-\u9fa5]+')
        filter_data = re.findall(pattern,doc)
        cleaned_doc = ''.join(filter_data)
        return cleaned_doc
    def cut(self,doc):
        seg = jieba.cut(doc)
        results = []
        for item in seg:
            if item in self.stopwords:
                continue
            results.append(item)
        return results

此模块输入未处理的文本字符串，输出的是分词后的数组。

字典与语料库的构建

字典就是输入目前所有文本分词结果 dict = [(w1,id),(w2,id),(w3,id),…,(wn,id)]

而语料的格式是对应文本在字典中生成：

corpus = [
    [
        #doc1
        (w,id),(w,id),(w,id).....(w,id)
    ],[
        #doc2
        (w,id),(w,id),....(w,id)
    ],
    ......[
        #doc_n
    ]
]

以上是语料库的基本格式

def __set_origin_corpus(self):
        if('dictionary.txt' in os.listdir()):
            print("----------语料库已经构建------

最低0.47元/天解锁文章

200万优质内容无限畅学