[ACL 2024]PokeMQA: Programmable knowledge editing for Multi-hop Question Answering

①Existing problems: current cascading knowledge updating methods are mix-up prompt, including question decomposition, answer generation, and conflict checking via comparing with edited facts. However, the coupling nature of them might cause conflict

②So they proposed Programmable knowledge editing for Multi-hop Question Answering (PokeMQA)

2.2. Introduction

①Example of Multi-hop question answering (MQA):

where blue lines are correct reasoning

②Methods for fine tune outdated knowledge (the authors used the second one):

parameter-modification based editing	modifies the internal model weights according to edited facts through meta-learning, fine-tuning, or knowledge locating
memory-based editing	leverages an external memory to explicitly store the edited facts (or termed as edits) and reason over them, while leaving LLMs parameters unchanged

③Existing challenges for MQA: a) conflict detection, b) the incorporation of knowledge editing instruction introduces noise

2.3. Multi-hop Question Answering under Knowledge Editing

（1）Notations

①A triplet $\left ( s,r,o \right )$ with subject $s$ , object $o$ and relation $r$ , such as:

$\left ( Messi, play\, for,Inter\, Miami \right )$

②To update this fact:

$\left ( Messi, play\, for,Boca\, Juniors \right )$

③Multi hop question: $Q$ , where the answer of $Q$ needs sequentially querying and retrieving multiple facts

④Chain of facts:

$\langle(s_1,r_1,o_1),\ldots,(s_n,r_n,o_n)\rangle$

where $s_{i+1}=o_{i}$ , $o_n$ is the final answer

⑤The unique inter-entity path $\mathcal{P}=\langle s_{1},o_{1},\ldots,o_{n}\rangle$

⑥Except for $s_1$ , all other entities $o_{1},\ldots,o_{n}$ will not allowed to appear in $Q$

⑦Edit facts: just one change of a fact such as from

$(s_i,r_i,o_i)\rightarrow e=(s_{i},r_{i},o_{i}^{*})$

causes cascaded changes consequently:

$\langle(s_{1},r_{1},o_{1}),\ldots,(s_{i},r_{i},o_{i}^{*}),\ldots(s_{n}^{*},r_{n},o_{n}^{*})\rangle$

and the inter-entity path will be:

$\mathcal{P}^{*}\quad=\langle s_{1},o_{1},\ldots,o_{i}^{*},\ldots,o_{n}^{*}\rangle$

（2）MQA under knowledge editing

①A set of edits: $\mathcal{E}=\{e_{1},\ldots,e_{m}\}$

②A language model: $f$

③Edited language model: $f_{\mathrm{edit}}$

（3）Edit scope

①Scopes of edit $S(e)$ means the similar questions which corresponding to the same answer

syntactic adj.句法的

2.4. Programmable Editing in Memory of Multi-hop Question Answering

2.4.1. Workflow of PokeMQA

①Illustration of PokeMQA:

where Prompt Generator utilizes an external knowledge base to decomposite original $Q$ , then use Scope Detector to further generate answers

②When receiving a set of edits $\mathcal{E}=\{e_{1},\ldots,e_{m}\}$ , PokeMQA first uses manually-defined template to convert each edit triplet $e$ into a natural language statement $t$ （感觉在图中对应的就是把原句分解成俩子问题：

原句：Who is the head of state of the country where Messi ( $s_1$ ) holds a citizenship?

子问题1（ $t_1$ ）：What is the country of citizenship ( $r_1$ ) of Messi ( $s_1$ )?

答案1：United States ( $o_1$ )

子问题2（ $t_2$ ）：Who is the head ( $r_2$ ) of state of United States ( $o_1$ )?

答案2：Joe Biden ( $o_2$ )

）, then explicitly stores them in an external memory $\mathcal{M}=\{t_{1},\ldots,t_{m}\}$ for query and retrieval.

③Models are taught to excute 3 tasks by few-shot prompt:

1	Identify the next subquestion (i.e., atomic question) condi tioned on the input question and current inference state in LLMs
2	Detect whether this subquestion falls within the edit scope and generate answer
3	Extract the answer entity for this subquestion in LLMs

④⭐Previous work always generated a tentative answer from model and retrieved edited facts for each question, but this was not realistic for few-shot prompts. So they change this to: retrieve subquestion in $\mathcal{M}=\{t_{1},\ldots,t_{m}\}$ , then get answer from $\mathcal{M}$ otherwise generate by itself

⑤Key entity decomposite from the $Q$ keeps helping to prompt due to the difficult of decompositing the input for the first subquetion

2.4.2. Programmable Scope Detector

（1）Architectures

①Scope detector: $g(t,q):{\mathcal{T}}\times\mathcal{Q}\to[0,1]$ , which pre dicts the probability that an atomic question $q$ falls into the scope of the edit statement $t$ (in terms of the edit $e$ )

②They employ 2 complementary models for expressiveness and computational efficiency: $g_\phi$ and $g_\psi$

③ $g_\phi$ （predetector $M_{\mathrm{pre}}$ ） calculates the embeddings for $t$ and $q$ separately and models the log-likelihood by the negative squared Euclidean distance in the embedding space, which filters irrelevant edits

④ $g_\psi$ (conflict disambiguator $M_{\mathrm{dis}}$ ) con catenates $t$ and $q$ together as a unified input for the sequence classification task, which achieve accurate task classification

（2）Training scope detector

①Training set: $\mathcal{D}_{\mathrm{train}}=\{(t_{1},q_{1}),\ldots,(t_{m},q_{m})\}$

②BCE loss:

$\mathcal{L}=-\log g(t_i,q_i)-\mathbb{E}_{q_n\sim P_n(q)}\left[\log(1-g(t_i,q_n))\right]$

where $P_n$ denotes negative sampling distribution

③ $M_{\mathrm{pre}}$ and $M_{\mathrm{dis}}$ are trained separately

（3）Model selection

①The authors design Success Rate and Block Rate to guide early stopping

②Success Rate measures the accuracy to retrieve the correct edit statement $t_i$ for a target question $q_i$ from a set of candidates

$SR=\frac{1}{N}\sum_{i=1}^{N}\mathbf{1}\left[\bigwedge_{(t,q)\in\mathcal{D}_{val}}(g(t_i,q_i)\geq g(t,q_i))\right]$

where $\mathbf{1}\left ( \cdot \right )$ is indicator function, $N$ denotes the size of validation set $\mathcal{D}_{val}$ , $\wedge$ is "and gate"

③Block Rate quantifies the extent of detector models to inhibit the unrelated edit statements for a target question:

$BR=\frac{1}{N}\sum_{i=1}^N\mathbf{1}\left[\bigwedge_{(t,q)\in\mathcal{D}_{val}^-}(g(t,q_i)<0.5)\right]$

where $\mathcal{D}_{val}^{-}=\mathcal{D}_{val}-\{(t_{i},q_{i})\}$

2.4.3. Knowledge Prompt Generator

①They introduce knowledge prompt generator $M_{\mathrm{gen}}$ (ELQ model) to quickly link $Q$ to an entity from Wikidata

②Store triplets $(s,r,o)$ in Wikidata

③2 basic membership properties $\mathcal{R}=[r_1,r_2]$ , where $r_{1}\mathrm{=}instance\: of$ , $r_{2}\mathrm{=}subclass\: of$ （这个就是大图中最左上图Messi, a human的来源）

2.5. Experimental Setup

①Knowledge editing dataset: MQUAKE, which including MQUAKE-CF-3Kbasedoncoun terfactual edits, and MQUAKE-T with temporal knowledge updates

②Hop questions in dataset: $k\in \left \{ 2,3,4 \right \}$

2.5.1. Evaluation Metrics

①Metrics: multi-hop accuracy and hop-wise answering accuracy (Hop-Acc)

2.5.2. Baselines Methods & Language Models

①Compared parameter updating methods: FT, ROME, MEMIT

②Compared memory-based method: MeLLo

③LLMs: LLaMa-2-7B, Vicuna-7B, GPT-3.5-turbo-instruct

2.5.3. Implementation Details

①Finetune $g_\phi$ and $g_\psi$ by DistilBERT

②Sampling method: stratified sampling

2.6. Performance Analysis

2.6.1. Main Results

①Performance table:

②Different acc and hop-acc of LLaMa-2-7B on MQUAKE-CF-3K:

2.6.2. Ablation Study

①Module ablation in GPT-3.5-turbo-instruct on MQUAKE-CF-3K:

②Module ablation table:

2.7. Related Work

①Knowledge editing methods

2.8. Conclusion

①Limitations: a) accuracy of retrieval, b) safe technique required

3. 知识补充

3.1. Knowledge editing

（1）定义

Knowledge Editing（知识编辑） 是指对知识库（如知识图谱、知识库、模型的知识表示等）中的现有知识进行修改、更新、修正或增添的过程。这一过程不仅限于添加新的事实或知识，还包括修改、删除、纠正错误的知识，或者在已有的知识基础上引入新的上下文和关系。

在自然语言处理和知识图谱领域，知识编辑的目的是使得知识库或模型中的知识更加准确、一致和及时。这对于提升智能系统（如问答系统、推理系统等）的表现至关重要，尤其是在知识是动态变化的环境中。

（2）应用

①修正错误：如果知识库中的某个事实是错误的（例如，日期、地点、人物等信息错误），知识编辑可以帮助纠正这些错误。

②扩展和更新知识：随着新信息的到来，知识库需要不断更新。例如，新增的科学发现、新的社会事件等都需要通过知识编辑进行更新。

③对抗偏见：知识库中的偏见或不准确的信息（如刻板印象、政治偏见等）也可以通过知识编辑进行修正，以提高知识的公平性和准确性。

④自动化学习：一些系统可以通过从大量文本数据中自动提取新事实并编辑到已有的知识库中，从而让系统不断完善其知识。

（3）挑战

①复杂的推理：在编辑现有知识时，可能需要考虑其与其他事实之间的关系，确保新加入的知识不会破坏原有的知识一致性。

②知识验证：需要有效的方法来验证知识编辑的正确性，尤其是在知识图谱中，如何确认新添加的事实没有与已有的知识发生冲突是一个挑战。

③处理不同来源的知识：如何整合来自多个不同来源的信息并进行一致的知识编辑，尤其是当不同来源的事实不一致时，如何做出合理的编辑决策。

（4）技术实现

①基于规则的编辑：利用一系列手工编写的规则进行知识库更新。例如，可以通过手动规则检查和纠正日期、地点等事实的准确性。

②基于学习的编辑：使用机器学习方法（如文本分类、实体关系抽取等）来自动检测知识库中的错误或不一致，并进行自动化的知识编辑。

③人机协作：结合人工审核和自动化编辑来确保编辑的质量和准确性。例如，自动化系统可以检测潜在的错误，而人类专家可以做出最终决定。

④自然语言生成（NLG）和推理：使用语言模型（如GPT、BERT）自动生成或推理出新知识，以便对知识图谱进行有效的更新。例如，可以使用大型预训练语言模型从文本中提取出新的事实，并将其作为编辑后的事实插入知识库。

3.2. 用于对比学习和判别模型的BCE变体

（1）公式

$\mathcal{L}=-\log g(t_i,q_i)-\mathbb{E}_{q_n\sim P_n(q)}\left[\log(1-g(t_i,q_n))\right]$

（2）解释

这个公式涉及的是一个正样本和多个负样本（取平均）的对比学习，第一项是正样本对，第二项是负样本对。作者定义 $g(t,q):{\mathcal{T}}\times\mathcal{Q}\to[0,1]$ ，即需要使得 $g(t_i,q_i)$ 趋近于1，得到高的匹配得分来最小化第一项。第二项 $g(t_i,q_n)$ 不匹配项则需要越小越好，这样第二项也能接近0。无论如何，这个损失应该都是大于0的，应该。