query term weight计算

对query分词后,我们需要对计算query中各个term的权重。Term weighting在文本检索,文本相关性,核心词提取等任务中都有重要作用。
举例来说:“荷尔蒙是什么?”如果分词之后直接进行搜索,可能出来的效果并不是很理想,在不能完全匹配的情况下,”什么“ ”是“ 这样的term对query结果干扰很大。
我们人工进行分析,term weighting结果可能是: “什么 0.1,是0.1,荷尔蒙0.8”。

term weight计算方法

1.计算tf-idf-最常见的计算term weight的方法
a.词w在文档d中的词频tf (Term Frequency),即词w在文档d中出现次数count(w, d)和文档d中总词数size(d)的比值:
tf(w,d) = count(w, d) / size(d)
b.词w在整个文档集合中的逆向文档频率idf (Inverse Document Frequency),即文档总数n与词w所出现文件数docs(w, D)比值的对数:
idf = log(n / docs(w, D))
计算tf-idf是一种相对笨重的方法,但是很实用。

2.特定环境下的term weight计算
a.从搜索session数据里提取数据,用户在一个session中的大致意图是不会变的,提取出核心意图所对应的term,其重要性就高。
b.分析点击次数比较多的doc,分析其中的title部分数据。点击的次数越多,相对应的title中包含的核心term权重越大。
c.在应用分发数据中,某些app的名字,可以根据其下载次数,决定其term的重要性程度。

3.基于上面的策略,我们可以通过LR预测query term weight
特征分析:
term本身:term专名类型,term词性,term idf,位置特征,term的长度
term与term:共现度,是否逆序,term分类,term与文本串中其他term的字面交叉特征

term weight中检索注意的地方:
term的检索省略:某些term在好结果的页面中并未出现,同时也没有被标记为retrieval optional,从而造成好结果无法retrieve回来。
term的算分省略:需要对部分无意义的term做ranking optional,打分时才能使分数更为合理。
term重要性:不能丢失的重要term,虽然在文中出现,但并不相关。

import os import re import tkinter as tk from tkinter import ttk, filedialog, messagebox, simpledialog import jieba import math import requests import threading from collections import defaultdict, Counter from bs4 import BeautifulSoup from urllib.parse import quote from tkinter.scrolledtext import ScrolledText class InformationRetrievalSystem: def __init__(self, root): self.root = root self.root.title("智能信息检索系统 v3.0") self.root.geometry("1200x800") # 初始化系统组件 self.documents = {} self.vocabulary = set() self.inverted_index = defaultdict(lambda: {'doc_ids': {}, 'idf': 0}) self.stop_words = self.load_stop_words() self.doc_vectors = {} self.synonym_dict = self.load_synonym_dict() # 同义词词典 self.domain_categories = self.load_domain_categories() # 领域分类 # 线程安全的文档计数器 self.doc_counter = 0 self.doc_counter_lock = threading.Lock() # 界面布局 self.create_ui() self.create_status_bar() def create_ui(self): """创建主界面""" main_frame = ttk.Frame(self.root) main_frame.pack(fill=tk.BOTH, expand=True, padx=10, pady=10) # 控制面板 control_frame = ttk.LabelFrame(main_frame, text="控制面板") control_frame.pack(fill=tk.X, pady=5) # 搜索组件 self.search_entry = ttk.Entry(control_frame, width=50) self.search_entry.pack(side=tk.LEFT, padx=5) # 搜索类型选择 search_type_frame = ttk.Frame(control_frame) search_type_frame.pack(side=tk.LEFT, padx=5) self.search_type = tk.StringVar(value="关键词") ttk.Radiobutton(search_type_frame, text="关键词", variable=self.search_type, value="关键词").pack(side=tk.LEFT) ttk.Radiobutton(search_type_frame, text="布尔", variable=self.search_type, value="布尔").pack(side=tk.LEFT) ttk.Radiobutton(search_type_frame, text="短语", variable=self.search_type, value="短语").pack(side=tk.LEFT) ttk.Radiobutton(search_type_frame, text="扩展", variable=self.search_type, value="扩展").pack(side=tk.LEFT) ttk.Button(control_frame, text="搜索", command=self.search).pack(side=tk.LEFT, padx=5) # 文档管理组件 doc_manage_frame = ttk.Frame(control_frame) doc_manage_frame.pack(side=tk.LEFT, padx=20) ttk.Button(doc_manage_frame, text="加载本地文档", command=self.load_local_documents).pack(side=tk.LEFT) ttk.Button(doc_manage_frame, text="输入URL添加", command=self.add_url_document).pack(side=tk.LEFT, padx=5) # 自动搜索组件 auto_search_frame = ttk.Frame(control_frame) auto_search_frame.pack(side=tk.LEFT) self.keyword_entry = ttk.Entry(auto_search_frame, width=20) self.keyword_entry.pack(side=tk.LEFT) self.keyword_entry.insert(0, "人工智能") ttk.Button(auto_search_frame, text="自动网络搜索", command=self.auto_web_search).pack(side=tk.LEFT, padx=5) # 结果显示区域 result_frame = ttk.LabelFrame(main_frame, text="搜索结果") result_frame.pack(fill=tk.BOTH, expand=True) columns = ('doc_id', 'source', 'score', 'content') self.result_tree = ttk.Treeview( result_frame, columns=columns, show='headings', selectmode='browse' ) # 配置列 self.result_tree.heading('doc_id', text='文档ID', anchor=tk.W) self.result_tree.heading('source', text='来源', anchor=tk.W) self.result_tree.heading('score', text='相关度', anchor=tk.CENTER) self.result_tree.heading('content', text='内容摘要', anchor=tk.W) self.result_tree.column('doc_id', width=150, minwidth=100) self.result_tree.column('source', width=100, minwidth=80) self.result_tree.column('score', width=80, minwidth=60, anchor=tk.CENTER) self.result_tree.column('content', width=800, minwidth=400) vsb = ttk.Scrollbar(result_frame, orient="vertical", command=self.result_tree.yview) hsb = ttk.Scrollbar(result_frame, orient="horizontal", command=self.result_tree.xview) self.result_tree.configure(yscrollcommand=vsb.set, xscrollcommand=hsb.set) self.result_tree.grid(row=0, column=0, sticky=tk.NSEW) vsb.grid(row=0, column=1, sticky=tk.NS) hsb.grid(row=1, column=0, sticky=tk.EW) result_frame.grid_rowconfigure(0, weight=1) result_frame.grid_columnconfigure(0, weight=1) # 详情面板 detail_frame = ttk.LabelFrame(main_frame, text="文档详情") detail_frame.pack(fill=tk.BOTH, expand=False, pady=5) self.detail_text = ScrolledText(detail_frame, wrap=tk.WORD, height=8) self.detail_text.pack(fill=tk.BOTH, expand=True) # 绑定事件 self.result_tree.bind('<<TreeviewSelect>>', self.show_detail) def create_status_bar(self): """创建状态栏""" self.status_var = tk.StringVar() status_bar = ttk.Label(self.root, textvariable=self.status_var, relief=tk.SUNKEN) status_bar.pack(side=tk.BOTTOM, fill=tk.X) def update_status(self, message): """更新状态栏""" self.status_var.set(message) self.root.update_idletasks() def load_stop_words(self): """加载停用词表""" stop_words = set() try: with open('stopwords.txt', 'r', encoding='utf-8') as f: stop_words = set(line.strip() for line in f if line.strip()) except FileNotFoundError: messagebox.showwarning("警告", "未找到停用词文件stopwords.txt") return stop_words def load_synonym_dict(self): """加载同义词词典""" synonyms = defaultdict(dict) # 改为存储术语-同义词-权重的字典 try: with open('synonyms.txt', 'r', encoding='utf-8') as f: for line in f: if line.strip(): parts = line.strip().split(',') if len(parts) > 1: term = parts[0] for syn_with_weight in parts[1:]: if ':' in syn_with_weight: syn, weight = syn_with_weight.split(':') synonyms[term][syn] = float(weight) else: synonyms[term][syn_with_weight] = 0.7 # 默认权重 except FileNotFoundError: messagebox.showwarning("警告", "未找到同义词文件synonyms.txt") return synonyms def load_domain_categories(self): """加载领域分类词典""" categories = defaultdict(list) try: with open('domain_categories.txt', 'r', encoding='utf-8') as f: for line in f: if line.strip(): parts = line.strip().split(':') if len(parts) == 2: domain, terms = parts categories[domain] = terms.split(',') except FileNotFoundError: messagebox.showwarning("警告", "未找到领域分类文件domain_categories.txt") return categories # 文档管理功能 def load_local_documents(self): """加载本地文档""" files = filedialog.askopenfilenames(filetypes=[("Text files", "*.txt")]) if not files: return for file in files: try: with open(file, 'r', encoding='utf-8') as f: content = f.read() doc_id = os.path.basename(file) self.add_document(doc_id, content, '本地文件') except Exception as e: messagebox.showerror("错误", f"加载文件失败:{str(e)}") self.update_index() messagebox.showinfo("成功", f"已加载 {len(files)} 个本地文档") def add_url_document(self): """手动添加URL文档""" url = simpledialog.askstring("输入URL", "请输入网页地址:") if url: threading.Thread(target=self.fetch_web_content, args=(url, '手动添加'), daemon=True).start() def auto_web_search(self): """自动网络搜索文档""" keyword = self.keyword_entry.get().strip() if not keyword: messagebox.showwarning("警告", "请输入搜索关键词") return threading.Thread(target=self.baidu_search, args=(keyword,), daemon=True).start() # 网络请求功能 def fetch_web_content(self, url, source): """获取网页内容""" try: headers = { 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36' } # 获取真实URL with requests.get(url, headers=headers, timeout=10, allow_redirects=True) as response: response.raise_for_status() final_url = response.url # 获取页面内容 with requests.get(final_url, headers=headers, timeout=15) as response: response.raise_for_status() soup = BeautifulSoup(response.content, 'html.parser') # 提取正文(简单实现) text = soup.get_text() text = re.sub(r'\s+', ' ', text).strip() # 生成唯一文档ID with self.doc_counter_lock: self.doc_counter += 1 doc_id = f"{source}_{self.doc_counter}" self.add_document(doc_id, text, final_url) self.root.after(0, lambda: messagebox.showinfo("成功", f"已添加文档:{doc_id}")) except Exception as e: self.root.after(0, lambda: messagebox.showerror("错误", f"抓取失败:{str(e)}")) def baidu_search(self, keyword): """执行百度搜索""" try: self.update_status(f"正在搜索: {keyword}") search_url = f"https://www.baidu.com/s?wd={quote(keyword)}" headers = { 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36' } with requests.get(search_url, headers=headers, timeout=10) as response: response.raise_for_status() soup = BeautifulSoup(response.text, 'html.parser') # 修复:使用新的CSS选择器来定位搜索结果 results = soup.select('div.c-container') max_results = min(5, len(results)) for idx, result in enumerate(results[:max_results]): # 修复:提取正确的链接 if link := result.select_one('h3.t a'): href = link.get('href', '') if href.startswith('http'): # 确保是有效的URL # 错开线程启动时间,避免请求过于密集 threading.Timer( idx * 1, # 每个线程间隔1秒启动 self.fetch_web_content, args=(href, f"百度结果_{idx+1}") ).start() self.update_status(f"搜索完成: {keyword}") except Exception as e: self.update_status(f"搜索失败: {str(e)}") self.root.after(0, lambda: messagebox.showerror("错误", f"搜索失败:{str(e)}")) def add_document(self, doc_id, content, source): """添加文档到系统""" if doc_id in self.documents: return self.documents[doc_id] = self.preprocess_text(content) self.update_index() # 更新UI显示 def update_treeview(): self.result_tree.insert('', 'end', values=( doc_id, source, 'N/A', ' '.join(self.documents[doc_id][:30]) + '...' )) self.root.after(0, update_treeview) def preprocess_text(self, text): """文本预处理""" words = jieba.lcut(text) return [word for word in words if word not in self.stop_words and re.match(r'^[\u4e00-\u9fa5]+$', word)] def update_index(self): """更新索引系统""" self.build_inverted_index() self.calculate_tf_idf() def build_inverted_index(self): """构建倒排索引""" self.vocabulary = set() self.inverted_index = defaultdict(lambda: {'doc_ids': {}, 'idf': 0}) # 收集词项位置 for doc_id, words in self.documents.items(): term_positions = defaultdict(list) for pos, term in enumerate(words): term_positions[term].append(pos) self.vocabulary.add(term) for term, positions in term_positions.items(): self.inverted_index[term]['doc_ids'][doc_id] = positions # 计算IDF total_docs = len(self.documents) for term in self.inverted_index: doc_count = len(self.inverted_index[term]['doc_ids']) self.inverted_index[term]['idf'] = math.log((total_docs + 1) / (doc_count + 1e-9)) def calculate_tf_idf(self): """计算TF-IDF向量""" self.doc_vectors = {} for doc_id, words in self.documents.items(): tf = defaultdict(int) for word in words: tf[word] += 1 vector = {} for term, count in tf.items(): tf_val = count / len(words) idf_val = self.inverted_index[term]['idf'] vector[term] = tf_val * idf_val self.doc_vectors[doc_id] = vector # 搜索功能 def search(self): """根据不同搜索类型执行搜索""" query = self.search_entry.get().strip() search_type = self.search_type.get() if not query: messagebox.showwarning("警告", "请输入查询内容") return # 清空结果 for item in self.result_tree.get_children(): self.result_tree.delete(item) if search_type == "关键词": self.perform_keyword_search(query) elif search_type == "布尔": self.perform_boolean_search(query) elif search_type == "短语": self.perform_phrase_search(query) elif search_type == "扩展": self.perform_expanded_search(query) def perform_keyword_search(self, query): """执行关键词搜索""" query_terms = self.preprocess_text(query) if not query_terms: messagebox.showwarning("警告", "查询词无效或均为停用词") return # 构建查询向量 query_vector = defaultdict(float) term_counts = defaultdict(int) for term in query_terms: term_counts[term] += 1 for term in term_counts: if term in self.inverted_index: tf = term_counts[term] / len(query_terms) idf = self.inverted_index[term]['idf'] query_vector[term] = tf * idf # 计算相似度 scores = {} query_norm = math.sqrt(sum(v**2 for v in query_vector.values())) for doc_id, doc_vector in self.doc_vectors.items(): dot_product = sum(query_vector[term] * doc_vector[term] for term in query_vector if term in doc_vector) doc_norm = math.sqrt(sum(v**2 for v in doc_vector.values())) if query_norm * doc_norm == 0: score = 0 else: score = dot_product / (query_norm * doc_norm) scores[doc_id] = score # 显示结果 for doc_id, score in sorted(scores.items(), key=lambda x: x[1], reverse=True): if score > 0: source = '本地文档' if doc_id.endswith('.txt') else '网络文档' preview = ' '.join(self.documents[doc_id][:30]) + '...' self.result_tree.insert('', 'end', values=( doc_id, source, f"{score:.4f}", preview )) def perform_boolean_search(self, query): """执行布尔搜索""" # 解析查询字符串,支持 AND(&), OR(|), NOT(!), 括号() tokens = re.findall(r'[\(\)!&\|]|[\u4e00-\u9fa5]+', query) query_terms = [] operators = [] for token in tokens: if token in {'&', '|', '!', '(', ')'}: operators.append(token) else: # 直接使用原始词项(不进行分词) if token not in self.stop_words and re.match(r'^[\u4e00-\u9fa5]+$', token): query_terms.append(token) if not query_terms: messagebox.showwarning("警告", "查询词无效或均为停用词") return # 执行布尔查询 try: result_docs = self.evaluate_boolean_expression(query_terms, operators) except Exception as e: messagebox.showerror("错误", f"布尔表达式解析错误: {str(e)}") return # 显示结果 for doc_id in result_docs: source = '本地文档' if doc_id.endswith('.txt') else '网络文档' preview = ' '.join(self.documents[doc_id][:30]) + '...' self.result_tree.insert('', 'end', values=( doc_id, source, "1.0000", # 布尔查询结果只有匹配/不匹配 preview )) def evaluate_boolean_expression(self, terms, operators): """评估布尔表达式""" # 使用栈式表达式求值 docs_stack = [] op_stack = [] for token in terms + operators: if token in terms: # 获取词项的文档集合 if token in self.inverted_index: docs = set(self.inverted_index[token]['doc_ids'].keys()) else: docs = set() docs_stack.append(docs) elif token == '(': op_stack.append(token) elif token == ')': # 处理括号内的表达式 while op_stack and op_stack[-1] != '(': op = op_stack.pop() self.apply_operator(op, docs_stack) op_stack.pop() # 弹出左括号 elif token in {'&', '|', '!'}: # 处理操作符优先级 while (op_stack and op_stack[-1] != '(' and self.operator_precedence(op_stack[-1]) >= self.operator_precedence(token)): op = op_stack.pop() self.apply_operator(op, docs_stack) op_stack.append(token) # 处理剩余操作符 while op_stack: op = op_stack.pop() self.apply_operator(op, docs_stack) if not docs_stack: return set() return docs_stack[0] def operator_precedence(self, op): """定义操作符优先级""" if op == '!': return 3 elif op == '&': return 2 elif op == '|': return 1 return 0 def apply_operator(self, op, docs_stack): """应用布尔操作符""" if op == '!': if len(docs_stack) < 1: raise ValueError("操作数不足") operand = docs_stack.pop() result = set(self.documents.keys()) - operand docs_stack.append(result) elif op in {'&', '|'}: if len(docs_stack) < 2: raise ValueError("操作数不足") right = docs_stack.pop() left = docs_stack.pop() if op == '&': result = left & right else: # op == '|' result = left | right docs_stack.append(result) def perform_phrase_search(self, query): """执行短语搜索""" # 预处理查询 query_terms = self.preprocess_text(query) if not query_terms: messagebox.showwarning("警告", "查询词无效或均为停用词") return # 查找所有包含第一个词的文档 if query_terms[0] not in self.inverted_index: return # 没有匹配结果 candidate_docs = set(self.inverted_index[query_terms[0]]['doc_ids'].keys()) # 检查每个候选文档是否包含完整短语 result_docs = [] for doc_id in candidate_docs: # 检查后续词是否按顺序出现在文档中 positions = self.inverted_index[query_terms[0]]['doc_ids'][doc_id] found = False for pos in positions: match = True for i in range(1, len(query_terms)): next_term = query_terms[i] if next_term not in self.inverted_index: match = False break if doc_id not in self.inverted_index[next_term]['doc_ids']: match = False break # 检查下一个词是否出现在预期位置 expected_pos = pos + i if expected_pos not in self.inverted_index[next_term]['doc_ids'][doc_id]: match = False break if match: found = True break if found: result_docs.append(doc_id) # 显示结果 for doc_id in result_docs: source = '本地文档' if doc_id.endswith('.txt') else '网络文档' preview = ' '.join(self.documents[doc_id][:30]) + '...' self.result_tree.insert('', 'end', values=( doc_id, source, "1.0000", # 短语查询结果只有匹配/不匹配 preview )) def perform_expanded_search(self, query): """执行查询扩展搜索""" # 预处理查询 query_terms = self.preprocess_text(query) if not query_terms: messagebox.showwarning("警告", "查询词无效或均为停用词") return # 扩展查询词 expanded_terms = [] for term in query_terms: # 添加原始词 expanded_terms.append(term) # 添加同义词 if term in self.synonym_dict: for syn, weight in self.synonym_dict[term].items(): expanded_terms.extend([(syn, weight)] * int(weight * 10)) # 按权重比例扩展 # 添加领域相关术语 for domain, terms in self.domain_categories.items(): if term in terms: for related_term in terms: if related_term != term and related_term not in self.synonym_dict.get(term, {}): expanded_terms.append((related_term, 0.5)) # 领域相关术语权重较低 # 构建扩展查询向量 query_vector = defaultdict(float) term_counts = defaultdict(float) for item in expanded_terms: if isinstance(item, tuple): term, weight = item term_counts[term] += weight else: term_counts[item] += 1.0 # 原始词权重为1.0 # 计算TF-IDF total_weight = sum(term_counts.values()) for term, count in term_counts.items(): if term in self.inverted_index: tf = count / total_weight idf = self.inverted_index[term]['idf'] query_vector[term] = tf * idf # 计算相似度 scores = {} query_norm = math.sqrt(sum(v**2 for v in query_vector.values())) for doc_id, doc_vector in self.doc_vectors.items(): dot_product = sum(query_vector[term] * doc_vector[term] for term in query_vector if term in doc_vector) doc_norm = math.sqrt(sum(v**2 for v in doc_vector.values())) if query_norm * doc_norm == 0: score = 0 else: score = dot_product / (query_norm * doc_norm) scores[doc_id] = score # 显示结果 for doc_id, score in sorted(scores.items(), key=lambda x: x[1], reverse=True): if score > 0: source = '本地文档' if doc_id.endswith('.txt') else '网络文档' preview = ' '.join(self.documents[doc_id][:30]) + '...' self.result_tree.insert('', 'end', values=( doc_id, source, f"{score:.4f}", preview )) def show_detail(self, event): """显示文档详情""" selected = self.result_tree.selection() if not selected: return item = self.result_tree.item(selected[0]) doc_id = item['values'][0] content = ' '.join(self.documents.get(doc_id, [])) self.detail_text.delete(1.0, tk.END) self.detail_text.insert(tk.END, content) if __name__ == "__main__": jieba.initialize() root = tk.Tk() app = InformationRetrievalSystem(root) root.mainloop()采用python程序设计语言,进行分词,再去掉停用词和标点符号等,生成文档的词典,接着根据词典和文档内容生成词项的倒排记录表(含位置信息),然后根据搜索关键字,进行满足某个布尔条件的检索并实现短语查询,结果文档按余弦相似度计算结果排序,完成查询扩展,最后提交设计程序和课程设计报告。 要求有UI界面,与或非的输入应有按钮或下拉列表辅助,显示出倒排记录表,结果文档的内容显示出来,相应的词项突出显示。
最新发布
05-29
<think>好的,我现在需要详细解释Elasticsearch的function_score查询。首先,我得回顾一下自己对这个功能的理解。function_score查询主要用于自定义评分,允许用户修改查询返回文档的相关性评分,或者根据特定条件过滤文档后重新计算评分。这对于个性化搜索排名非常有用,比如根据销量、热度、时间等因素调整搜索结果排序。 接下来,我应该分步骤来组织解释。首先,介绍function_score的基本概念和作用。然后,讲解它的组成部分,比如查询、函数、评分模式、权重等。需要详细说明每个可用的函数类型,如weight、field_value_factor、random_score、衰减函数(gauss、linear、exp)以及script_score。每个函数的作用和参数都要举例说明,比如field_value_factor使用哪个字段,如何影响评分。 然后,需要讨论function_score的参数,比如boost、boost_mode、score_mode、max_boost等。这些参数控制如何将多个函数的评分合并,以及最终评分的计算方式。例如,score_mode决定各个函数评分如何组合(相加、相乘、平均等),boost_mode决定函数评分如何与原始评分结合。 接下来,应该给出一些实际应用的例子,比如电商中的商品排序,新闻文章按时间衰减等。这些例子能帮助用户理解如何组合不同的函数和参数来实现具体的业务需求。例如,结合field_value_factor提升销量高的商品,使用gauss函数让新发布的内容排名靠前。 同时,需要注意可能遇到的常见问题,比如评分归一化的问题,避免不同函数导致的评分范围差异过大。此外,还要提到性能方面的考虑,使用脚本评分可能会影响查询速度,需要合理使用缓存和优化脚本。 需要检查是否有遗漏的重要点,比如过滤器与评分的结合,如何在不影响评分的情况下过滤文档。可能还需要解释衰减函数的参数配置,比如origin、scale、offset、decay等,以及它们对评分曲线的影响。 最后,总结function_score的使用场景和最佳实践,帮助用户合理选择函数和参数配置,以达到理想的搜索排序效果。确保语言通俗易懂,结构清晰,步骤分明,方便用户理解并应用到实际项目中。</think>Elasticsearch的`function_score`查询是用于干预搜索结果相关性评分的核心功能。以下为详细解析(全文遵循LaTeX格式规范): --- ### 一、基础概念 `function_score`允许对原始查询的评分进行二次加工,通过$$评分函数$$动态调整文档排名。适用于: 1. 提升热门商品权重($field\_value\_factor$) 2. 时间衰减效果($gauss/exp/linear$) 3. 随机排序($random\_score$) 4. 自定义脚本评分($script\_score$) --- ### 二、核心结构 ```json { "query": { "function_score": { "query": {}, // 原始查询 "functions": [], // 函数列表 "score_mode": "", // 函数间评分组合方式 "boost_mode": "", // 最终评分组合方式 "max_boost": 100 // 评分上限 } } } ``` --- ### 三、评分函数详解 #### 1. weight函数(固定加权) ```json { "weight": 2, "filter": { "term": { "category": "电子产品" } } } ``` - 效果:匹配过滤器的文档评分$ \times 2$ #### 2. field_value_factor(字段值加权) ```json { "field_value_factor": { "field": "sales", "factor": 1.2, "modifier": "sqrt", "missing": 1 } } ``` - 计算公式:$$ new\_score = old\_score \times \sqrt{(factor \times sales)} $$ #### 3. 衰减函数(距离衰减) ```json { "gauss": { "publish_date": { "origin": "now", "scale": "10d", "offset": "3d", "decay": 0.5 } } } ``` - 衰减曲线:距离`origin`越远评分越低 - 参数关系:$$ offset + scale \rightarrow 衰减到decay值 $$ #### 4. random_score(随机排序) ```json { "random_score": { "seed": 10, "field": "_seq_no" } } ``` - 相同seed值生成确定性随机数 #### 5. script_score(自定义脚本) ```json { "script_score": { "script": { "source": """ double price = doc['price'].value; return _score * Math.log(price); """ } } ``` - 支持Painless脚本动态计算评分 --- ### 四、组合控制参数 | 参数 | 作用 | 可选值 | |-------------|--------------------------------------------------------------------|---------------------------------| | score_mode | 多个函数的评分组合方式 | sum, avg, max, min, multiply | | boost_mode | 最终评分与原始评分的组合方式 | replace, multiply, sum, avg... | | max_boost | 函数评分上限(避免评分爆炸) | 默认FLT_MAX | --- ### 五、实战案例 电商商品搜索排序需求: - 基础匹配:商品名称包含"手机" - 提升权重:品牌为"Apple"的权重$\times 3$ - 销量因子:$$ score = score \times \log(1 + sales) $$ - 时间衰减:3天内发布的商品优先 对应DSL: ```json { "query": { "function_score": { "query": { "match": { "name": "手机" } }, "functions": [ { "filter": { "term": { "brand": "Apple" } }, "weight": 3 }, { "field_value_factor": { "field": "sales", "modifier": "ln1p" } }, { "gauss": { "publish_date": { "origin": "now", "scale": "3d", "offset": "1d" } } } ], "score_mode": "sum", "boost_mode": "multiply" } } } ``` --- ### 六、注意事项 1. 性能影响:脚本评分和复杂函数会增加CPU开销 2. 评分归一化:建议通过`normalization`控制评分范围 3. 结果验证:使用`explain=true`查看评分计算细节 通过合理组合函数,可实现符合业务需求的精细化排序效果。
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值