Doc2X:Markdown 格式转换的完美选择
通过 Doc2X,将多栏 PDF 和复杂文档转为 Markdown 格式,同时保留公式和表格的完整性。
Doc2X: The Perfect Choice for Markdown Conversion
With Doc2X, convert multi-column PDFs and complex documents to Markdown format, preserving formulas and tables with integrity.
👉 点击了解更多 | Learn More About Doc2X
原文链接:https://arxiv.org/pdf/2410.19750
THE GEOMETRY OF CONCEPTS: Sparse Autoencoder Feature Structure
概念的几何:稀疏自编码器特征结构
Yuxiao Li; Eric J. Michaud; David D. Baek; Joshua Engels, Xiaoqing Sun, Max Tegmark † {}^{ \dagger } †
Yuxiao Li; Eric J. Michaud; David D. Baek; Joshua Engels, Xiaoqing Sun, Max Tegmark † {}^{ \dagger } †
Massachusetts Institute of Technology
麻省理工学院
Cambridge, MA 02139, USA
美国马萨诸塞州剑桥市02139
ABSTRACT
摘要
Sparse autoencoders have recently produced dictionaries of high-dimensional vectors corresponding to the universe of concepts represented by large language models. We find that this concept universe has interesting structure at three levels: 1) The “atomic” small-scale structure contains “crystals” whose faces are parallelograms or trapezoids, generalizing well-known examples such as (man:woman;king:queen). We find that the quality of such parallelograms and associated function vectors improves greatly when projecting out global distractor directions such as word length, which is efficiently done with linear discriminant analysis. 2) The “brain” intermediate-scale structure has significant spatial modularity; for example, math and code features form a “lobe” akin to functional lobes seen in neural fMRI images. We quantify the spatial locality of these lobes with multiple metrics and find that clusters of co-occurring features, at coarse enough scale, also cluster together spatially far more than one would expect if feature geometry were random. 3) The “galaxy” scale large-scale structure of the feature point cloud is not isotropic, but instead has a power law of eigenvalues with steepest slope in middle layers. We also quantify how the clustering entropy depends on the layer.
稀疏自编码器最近生成了与大型语言模型所表示的概念宇宙相对应的高维向量字典。我们发现这个概念宇宙在三个层面上具有有趣的结构:1)“原子”小规模结构包含“晶体”,其面是平行四边形或梯形,推广了众所周知的例子,如(男人:女人::国王:女王)。我们发现,当投影出诸如词长等全局干扰方向时,这些平行四边形及相关功能向量的质量显著提高,这可以通过线性判别分析高效完成。2)“大脑”中等规模结构具有显著的空间模块性;例如,数学和代码特征形成一个类似于神经功能性fMRI图像中观察到的功能叶的“叶”。我们通过多种度量量化这些叶的空间局部性,并发现在足够粗糙的尺度下,共现特征的簇在空间上聚集得比我们预期的要多得多,如果特征几何是随机的话。3)特征点云的“银河”规模大规模结构不是各向同性的,而是具有特征值的幂律,其斜率在中间层最陡。我们还量化了聚类熵如何依赖于层次。
1 INTRODUCTION
1 引言
The past year has seen a breakthrough in understanding how large language models work: sparse autoencoders have discovered large numbers of points (“features”) in their activation space that can be interpreted as concepts (Huben et al. 2023, Bricken et al. 2023). Such SAE point clouds have recently been made publicly available (Lieberum et al. 2024), so it is timely to study their structure at various scales ∣ 1 ∣ \left| {}^{1}\right| 1 This is the goal of the present paper,focusing on three separate spatial scales. In Section 3, we investigate if the “atomic” small-scale structure contains “crystals” whose faces are parallelograms or trapezoids, generalizing well-known examples such as (man:woman;king:queen). In Section 4, we test if the “brain” intermediate-scale structure has functional modularity akin to biological brains. In Section 5, we study the “galaxy” scale large-scale structure of the feature point cloud, testing whether it is more interestingly shaped and clustered than an isosropic Gaussian distribution.
过去一年在理解大型语言模型如何工作的方面取得了突破:稀疏自编码器发现了其激活空间中大量可以被解释为概念的点(“特征”)(Huben et al. 2023,Bricken et al. 2023)。这种稀疏自编码器点云最近已公开可用(Lieberum et al. 2024),因此研究其在不同尺度下的结构是及时的 ∣ 1 ∣ \left| {}^{1}\right| 1 。本文的目标是专注于三个不同的空间尺度。在第3节中,我们研究“原子”小尺度结构是否包含“晶体”,其面是平行四边形或梯形,推广了众所周知的例子,如(man:woman;king:queen)。在第4节中,我们测试“脑”中间尺度结构是否具有类似于生物大脑的功能模块化。在第5节中,我们研究特征点云的“银河”尺度大尺度结构,测试其形状和聚类是否比各向同性高斯分布更有趣。
2 RELATED WORK
2 相关工作
SAE feature structure: Sparse autoencoders have relatively recently garned attention as an approach for discovering interpretable language model features without supervision, with relatively few works examining SAE feature structure. Bricken et al. (2023) and Templeton et al. (2024) both visualized SAE features with UMAP projections and noticed that features tend to group together in “neighborhoods” of related features, in contrast to the approximately-orthogonal geometry observed in the toy model of Elhage et al. (2022). Engels et al. (2024) find examples of SAE structure where multiple SAE features appear to reconstruct a multi-dimensional feature with interesting geometry, and multiple authors have recently speculated that SAE vectors might contain more important structures (Mendel, 2024, Smith, 2024). Bussmann et al. (2024) suggest that SAE features are in fact linear combinations of more atomic features, and discover these more atomic latents with “meta SAEs”. Our discussion of crystal structure in SAE features is related to this idea.
SAE 特征结构:稀疏自编码器作为一种无监督发现可解释语言模型特征的方法,近年来引起了相对较多的关注,但对 SAE 特征结构的研究相对较少。Bricken 等人(2023)和 Templeton 等人(2024)都使用 UMAP 投影可视化了 SAE 特征,并注意到特征倾向于在相关特征的“邻域”中聚集,这与 Elhage 等人(2022)玩具模型中观察到的近正交几何形状形成对比。Engels 等人(2024)发现了一些 SAE 结构的例子,其中多个 SAE 特征似乎重建了具有有趣几何形状的多维特征,最近多位作者推测 SAE 向量可能包含更重要的结构(Mendel,2024;Smith,2024)。Bussmann 等人(2024)建议 SAE 特征实际上是更原子特征的线性组合,并通过“元 SAE”发现了这些更原子潜变量。我们对 SAE 特征中晶体结构的讨论与这一观点相关。
*Equal Contribution
*平等贡献
† {}^{ \dagger } † tegmark@mit.edu
† {}^{ \dagger } † tegmark@mit.edu
1 {}^{1} 1 Throughout the paper,we use the JumpReLU SAEs (Rajamanoharan et al. 2024) of “Gemma Scope” (Lieberum et al. 2024) for gemma-2-2b and gemma-2-9b (Team et al. 2024). By “SAE point cloud”, we refer to the collection of unit-norm decoder feature vectors learned by the SAE. Unless otherwise stated, we use residual stream SAEs (2b-pt-res or 9 b 9\mathrm{\;b} 9b -pt-res) with 16 k {16}\mathrm{k} 16k features.
1 {}^{1} 1 在整篇论文中,我们使用“Gemma Scope”(Lieberum 等人 2024)的 JumpReLU SAE(Rajamanoharan 等人 2024)来处理 gemma-2-2b 和 gemma-2-9b(Team 等人 2024)。我们所称的“SAE 点云”是指由 SAE 学习到的单位范数解码器特征向量的集合。除非另有说明,我们使用残差流 SAE(2b-pt-res 或 9 b 9\mathrm{\;b} 9b -pt-res)与 16 k {16}\mathrm{k} 16k 特征。
Function vectors and Word embedding models: Early word embedding methods such as GloVe and Word2vec, were found to contain directions encoding semantic concepts, e.g. the well-known formula f(king) - f(man) + f(woman) = f(queen) (Drozd et al., 2016, Pennington et al., 2014, Ma & Zhang, 2015). More recent research has found similar evidence of linear representations in sequence models trained only on next token prediction, including Othello board positions (Nanda et al. 2023 Li et al., 2022), the truth value of assertions (Marks & Tegmark, 2023), and numeric quantities such as longitude, latitude, birth year, and death year (Gurnee & Tegmark, 2023; Heinzerling & Inui, 2024). Recent works have found causal function vectors for in-context learning (Todd et al., 2023 Hendel et al. 2023 Kharlapenko et al. 2024), which induce the model to perform a certain task. Our discussion of crystal structures builds upon these previous works of finding task vectors or parallelogram structures in language models.
函数向量和词嵌入模型:早期的词嵌入方法,如 GloVe 和 Word2vec,发现包含编码语义概念的方向,例如著名的公式 f(king) - f(man) + f(woman) = f(queen) (Drozd et al., 2016, Pennington et al., 2014, Ma & Zhang, 2015)。最近的研究发现,在仅基于下一个标记预测训练的序列模型中也有类似的线性表示证据,包括俄罗棋盘位置 (Nanda et al. 2023 Li et al., 2022)、断言的真值 (Marks & Tegmark, 2023),以及数字量,如经度、纬度、出生年份和死亡年份 (Gurnee & Tegmark, 2023; Heinzerling & Inui, 2024)。最近的工作发现了用于上下文学习的因果函数向量 (Todd et al., 2023 Hendel et al. 2023 Kharlapenko et al. 2024),这些向量促使模型执行特定任务。我们对晶体结构的讨论建立在这些先前关于在语言模型中寻找任务向量或平行四边形结构的研究之上。
3 “ATOM” SCALE: CRYSTAL STRUCTURE
3 “原子”尺度:晶体结构
Figure 1: Parallelogram and trapezoid structure is revealed (left) when using LDA to project out distractor dimensions, tightening up clusters of pairwise Gemma-2-2b activation differences (right).
图 1:使用 LDA 投影出干扰维度时,揭示了平行四边形和梯形结构(左),从而收紧了成对 Gemma-2-2b 激活差异的聚类(右)。
In this section, we search for what we term crystal structure in the point cloud of SAE features. By this we mean geometric structure reflecting semantic relations between concepts, generalizing the classic example of ( a , b , c , d ) = ( man,woman,king,queen ) \left( {\mathbf{a},\mathbf{b},\mathbf{c},\mathbf{d}}\right) = \left( \text{man,woman,king,queen}\right) (a,b,c,d)=(man,woman,king,queen) forming an approximate parallelogram where b − a ≈ d − c \mathbf{b} - \mathbf{a} \approx \mathbf{d} - \mathbf{c} b−a≈d−c . This can be interpreted in terms of two function vectors b − a \mathbf{b} - \mathbf{a} b−a and c − a \mathbf{c} - \mathbf{a} c−a that turn male entities female and turn entities royal, respectively. We also search for trapezoids with only one pair of parallel edges b − a ∝ d − c \mathbf{b} - \mathbf{a} \propto \mathbf{d} - \mathbf{c} b−a∝d−c (corresponding to only one function vector); Fig. 1 (right) shows such an example with ( a , b , c , d ) = ( Austria,Vienna,Switzerland,Bern ) \left( {\mathbf{a},\mathbf{b},\mathbf{c},\mathbf{d}}\right) = \left( \text{Austria,Vienna,Switzerland,Bern}\right) (a,b,c,d)=(Austria,Vienna,Switzerland,Bern) ,where the function vector can be interpreted as mapping countries to their capitals.
在本节中,我们搜索我们称之为晶体结构的 SAE 特征点云中的结构。我们所指的是反映概念之间语义关系的几何结构,概括了经典示例 ( a , b , c , d ) = ( man,woman,king,queen ) \left( {\mathbf{a},\mathbf{b},\mathbf{c},\mathbf{d}}\right) = \left( \text{man,woman,king,queen}\right) (a,b,c,d)=(man,woman,king,queen) 形成一个近似平行四边形,其中 b − a ≈ d − c \mathbf{b} - \mathbf{a} \approx \mathbf{d} - \mathbf{c} b−a≈d−c 。这可以通过两个函数向量 b − a \mathbf{b} - \mathbf{a} b−a 和 c − a \mathbf{c} - \mathbf{a} c−a 来解释,分别将男性实体转变为女性实体,并将实体转变为王室。我们还搜索只有一对平行边的梯形 b − a ∝ d − c \mathbf{b} - \mathbf{a} \propto \mathbf{d} - \mathbf{c} b−a∝d−c (对应于只有一个函数向量);图 1(右)显示了这样一个例子,其中 ( a , b , c , d ) = ( Austria,Vienna,Switzerland,Bern ) \left( {\mathbf{a},\mathbf{b},\mathbf{c},\mathbf{d}}\right) = \left( \text{Austria,Vienna,Switzerland,Bern}\right) (a,b,c,d)=(Austria,Vienna,Switzerland,Bern) ,函数向量可以解释为将国家映射到其首都。
We search for crystals by computing all pairwise difference vectors and clustering them, which should result in a cluster corresponding to each function vector. Any pair of difference vectors in a cluster should form a trapezoid or parallelogram, depending on whether the difference vectors are normalized or not before clustering (or, equivalently, whether we quantify similarity between two difference vectors via Euclidean distance or cosine similarity).
我们通过计算所有成对差异向量并对其进行聚类来搜索晶体,这应该导致与每个函数向量对应的一个聚类。聚类中的任何一对差异向量应该形成一个梯形或平行四边形,这取决于在聚类之前差异向量是否被归一化(或者等价地,是否通过欧几里得距离或余弦相似度来量化两个差异向量之间的相似性)。
Our initial search for SAE crystals found mostly noise. To investigate why, we focused our attention on Layers 0 (the token embedding) and 1, where many SAE features correspond to single words. We then studied Gemma2-2b residual stream activations for previously reported word ↦ \mapsto ↦ word function vectors from the dataset of (Todd et al. 2023), which clarified the problem. Figure 1 illustrates that candidate crystal quadruplets are typically far from being parallelograms or trapezoids. This is consistent with multiple papers pointing out that (man,woman,king,queen) is not an accurate parallelogram either.
我们对 SAE 晶体的初步搜索主要发现了噪声。为了调查原因,我们将注意力集中在层 0(标记嵌入)和层 1 上,在这些层中,许多 SAE 特征对应于单个单词。然后,我们研究了 Gemma2-2b 残差流激活,针对之前报告的单词 ↦ \mapsto ↦ 单词函数向量,来自 (Todd et al. 2023) 的数据集,这澄清了问题。图 1 说明候选晶体四元组通常远未形成平行四边形或梯形。这与多篇论文指出的 (man,woman,king,queen) 也不是一个准确的平行四边形的观点一致。
Figure 2: Features in the SAE point cloud identified that tend to fire together within documents are seen to also be geometrically co-located in functional “lobes”, here down-projected to 2D with t-SNE with point size proportional to feature frequency. A 2-lobe partition (left) is seen to break the point cloud into roughly equal parts, active on code/math documents and English language documents, respectively. A 3-lobe partition (right) is seen to mainly subdivide the English lobe into a part for short messages and dialogue (e.g. chat rooms and parliament proceedings) and one primarily containing long-form scientific papers.
图2:在文档中识别出的倾向于一起激发的SAE点云特征在功能“叶”中也被几何上共同定位,这里通过t-SNE降投影到2D,点大小与特征频率成正比。可以看到,2-叶分区(左)将点云大致分为相等的部分,分别在代码/数学文档和英语文档中活跃。3-叶分区(右)主要将英语叶分为一部分用于短消息和对话(例如聊天室和议会程序),另一部分主要包含长篇科学论文。
We found the reason to be the presence of what we term distractor features. For example, we find that the horizontal axis in Figure 1 (right) corresponds mainly to word length (Appendix B. Figure 100, which is semantically irrelevant and wreaks havoc on the trapezoid (left), since “Switzerland” is much longer than the other words.
我们发现原因在于我们所称的干扰特征的存在。例如,我们发现图1(右)的横轴主要对应于单词长度(附录B. 图100),这在语义上是无关的,并对梯形(左)造成了混乱,因为“瑞士”的长度远大于其他单词。
To eliminate such semantically irrelevant distractor vectors, we wish to project the data onto a lower-dimensional subspace orthogonal to them. For the (Todd et al. 2023) dataset, we do this with Linear Discriminant Analysis (LDA) (Xanthopoulos et al. 2013), which projects onto signal-to-noise eigenmodes where “signal” and “noise” are defined as the covariance matrices of inter-cluster variation and intra-cluster variation, respectively. Figure 1 illustrates that this dramatically improves the cluster and trapezoid/parallelogram quality, highlighting that distractor features can hide existing crystals.
为了消除这种在语义上无关的干扰向量,我们希望将数据投影到一个与它们正交的低维子空间。对于(Todd et al. 2023)数据集,我们使用线性判别分析(LDA)(Xanthopoulos et al. 2013)来实现这一点,该方法投影到信噪比特征模式,其中“信号”和“噪声”分别定义为群间变异和群内变异的协方差矩阵。图1说明这显著改善了聚类和梯形/平行四边形的质量,突显了干扰特征可以隐藏现有的晶体。
4 “Brain” SCALE: MESO-SCALE MODULAR STRUCTURE
4 “大脑”尺度:中尺度模块结构
We now zoom out and look for larger-scale structure. In particular, we investigate if functionally similar groups of SAE features (which tend to fire together) are also geometrically similar, forming “lobes” in the activation space.
我们现在放大视野,寻找更大规模的结构。特别是,我们调查功能相似的 SAE 特征组(它们往往一起激活)是否在几何上也相似,从而在激活空间中形成“叶片”。
In animal brains, such functional groups are well-known clusters in the 3D space where neurons are located. For example, Broca’s area is involved in speech production, the auditory cortex processes sound, and the amygdala is primarily associated with processing emotions. We are curious whether we can find analogous functional modularity in the SAE feature space.
在动物大脑中,这种功能组是神经元所在的三维空间中的众所周知的簇。例如,布罗卡区与语言产生相关,听觉皮层处理声音,而杏仁体主要与情绪处理相关。我们好奇是否能在 SAE 特征空间中找到类似的功能模块化。
We test a variety of methods for automatically discovering such functional “lobes” and for quantifying if they are spatially modular. We define a lobe partition as a partition of the point cloud into k k k subsets (“lobes”) that are computed without positional information. Instead, we identify such lobes based on then being functionally related, specifically, tending to fire together within a document.
我们测试多种方法以自动发现这些功能“叶片”,并量化它们是否在空间上是模块化的。我们将叶片划分定义为将点云划分为 k k k 个子集(“叶片”),该划分是在没有位置信息的情况下计算得出的。相反,我们基于这些叶片在功能上的相关性来识别它们,具体而言,它们在文档中往往一起激活。
To automatically identify functional lobes, we first compute a histogram of SAE feature co-occurrences. We take gemma-2-2b and pass documents from The Pile Gao et al. (2020) through it. In this section, we report results with a layer 12 residual stream SAE with 16k features and average L0 of 41 For this SAE, we record the features that fire (we count a feature as firing if its hidden activation > 1 > 1 >1 ). Features are counted as co-occurring if they both fire within the same block of 256 tokens - this length provides a coarse “time resolution” allowing us to find tokens that tend to fire together within the same document rather than just at the same token. We use a max context length of 1024, and only use one such context per document, giving us at most 4 blocks (and histogram updates) per document of The Pile. We compute histograms across 50k documents. Given this histogram, we compute an affinity score between each pair of SAE features based on their co-occurrence statistics and perform spectral clustering on the resulting affinity matrix.
为了自动识别功能性叶,我们首先计算 SAE 特征共现的直方图。我们使用 gemma-2-2b,并将 Gao 等人(2020)在 The Pile 中的文档传递给它。在本节中,我们报告使用具有 16k 特征和平均 L0 为 41 的第 12 层残差流 SAE 的结果。对于该 SAE,我们记录触发的特征(如果其隐藏激活 > 1 > 1 >1,我们将其计为触发)。如果两个特征在同一个 256 个标记的块内同时触发,则将其计为共现 - 这个长度提供了粗略的“时间分辨率”,使我们能够找到在同一文档中倾向于一起触发的标记,而不仅仅是在同一标记上。我们使用最大上下文长度为 1024,并且每个文档仅使用一个这样的上下文,这使我们在 The Pile 的每个文档中最多获得 4 个块(和直方图更新)。我们在 50k 文档中计算直方图。根据这个直方图,我们基于每对 SAE 特征的共现统计计算亲和度分数,并对结果亲和度矩阵执行谱聚类。
We experiment with the following notions of co-occurrence-based affinity: simple matching coefficient, Jaccard similarity, Dice coefficient, overlap coefficient, and Phi coefficient, which can all be computed just from a co-occurrence histogram. In the Appendix A. 1, we review definitions for each of these and in Figure 5 illustrate how the choice between them affects the resulting lobes.
我们实验了以下基于共现的亲和度概念:简单匹配系数、Jaccard 相似度、Dice 系数、重叠系数和 Phi 系数,这些都可以仅通过共现直方图计算得出。在附录 A.1 中,我们回顾了这些定义,并在图 5 中说明了它们之间的选择如何影响结果叶的形成。
Our null hypothesis is that functionally similar points (of commonly co-occurring SAE features) are uniformly distributed throughout the activation space, showing no spatial modularity. In contrast, Figure 2 shows lobes that appear visually quite spatially localized. To quantify how statistically significant this is, we use two approaches to rule out the null hypothesis.
我们的零假设是功能上相似的点(常共现的 SAE 特征)在激活空间中均匀分布,显示出没有空间模块性。相反,图 2 显示出视觉上相当空间局部化的叶。为了量化这一点的统计显著性,我们使用两种方法来排除零假设。
-
While we can cluster features based on whether they co-occur, we can also perform spectral clustering based on the cosine similaity between SAE feature decoder vectors. Given a clustering of SAE features using cosine similarity and a clustering using co-occurrence, we compute the mutual information between these two sets of labels. In some sense, this directly measures the amount of information about geometric structure that one gets from knowing functional structure. We report the adjusted mutual information Vinh et al. (2009) as implemented by scikit-learn Pedregosa et al. (2011), which corrects for chance agreements between the clusters.
-
虽然我们可以根据特征是否共现来聚类特征,但我们也可以基于 SAE 特征解码器向量之间的余弦相似度进行谱聚类。给定使用余弦相似度的 SAE 特征聚类和使用共现的聚类,我们计算这两组标签之间的互信息。从某种意义上说,这直接测量了通过了解功能结构可以获得的几何结构的信息量。我们报告调整后的互信息 Vinh 等人(2009)根据 scikit-learn Pedregosa 等人(2011)的实现,该实现纠正了聚类之间的偶然一致性。
-
Another conceptually simple approach is to train models to predict which functional lobe a feature is in from its geometry. To do this, we take a given set of lobe labels from our co-occurrence-based clustering, and train a logistic regression model to predict these labels directly from the point positions, using an 80-20 train-test split and reporting the balanced test accuracy of this classifier.
-
另一种概念上简单的方法是训练模型来预测特征所处的功能叶的几何形状。为此,我们从基于共现的聚类中获取一组给定的叶标签,并训练一个逻辑回归模型,以直接从点位置预测这些标签,使用 80-20 的训练-测试分割,并报告该分类器的平衡测试准确率。
Figure 3 shows that for both measures, the Phi coefficient wins, delivering the best correspondence between functional lobes and feature geometry. To show that this is statistically significant, we randomly permute the cluster labels from the cosine similarity-based clustering and measure the adjusted mutual information. We also randomly re-initialize the SAE feature decoder directions from a random Gaussian and normalize, and then train logistic regression models to predict functional lobe from these feature directions. Figure 3 (bottom) shows that both tests rule out the null hypothesis at high significance, at 954 and 74 standard deviations, respectively, clearly demonstrating that the lobes we see are real and not a statistical fluke.
图 3 显示,对于这两种度量,Phi 系数表现最佳,提供了功能叶与特征几何之间的最佳对应关系。为了证明这一点在统计上是显著的,我们随机置换来自基于余弦相似度的聚类的聚类标签,并测量调整后的互信息。我们还随机重新初始化来自随机高斯的 SAE 特征解码器方向并进行归一化,然后训练逻辑回归模型以从这些特征方向预测功能叶。图 3(底部)显示这两个测试在高显著性水平下排除了零假设,标准差分别为 954 和 74,清楚地表明我们看到的叶是真实的,而不是统计上的偶然。
To assess what each lobe specializes in, we run 10k documents from The Pile through gemma-2-2b, and again record which SAE features at layer 12 fire within blocks of 256 tokens. For each block of tokens, we record which lobe has the highest proportion of its features firing. Each document in The Pile is attached with a name specifying the subset of the corpus that document is from. For each document type, for each 256-token block within a document of that type, we record which lobe which had the highest proportion of its SAE features firing. Across thousands of documents, we can then look at a histogram of which lobes were maximally activating across each document type. We show these results for three lobes, computed with the Phi coefficient as the co-occurrence measure, in Figure 4. This forms the basis for our lobe labeling in Figure 2.
为了评估每个脑叶的专业化,我们将来自 The Pile 的 10,000 个文档通过 gemma-2-2b 进行处理,并再次记录在 256 个标记的块中第 12 层的 SAE 特征的激活情况。对于每个标记块,我们记录哪个脑叶的特征激活比例最高。The Pile 中的每个文档都附有一个名称,指定该文档来自语料库的哪个子集。对于每种文档类型,对于该类型文档中的每个 256 标记块,我们记录哪个脑叶的 SAE 特征激活比例最高。在数千个文档中,我们可以查看每种文档类型中最大激活的脑叶的直方图。我们在图 4 中展示了这三个脑叶的结果,使用 Phi 系数作为共现度量。这为我们在图 2 中的脑叶标记奠定了基础。
The effects of the five different co-occurrence measures are compared in Figure 5. Although we found Phi to be best, all five are seen to discover the “code/math lobe”.
图 5 比较了五种不同的共现度量的效果。尽管我们发现 Phi 是最佳的,但所有五种方法都能够发现“代码/数学脑叶”。
—— 更多内容请到Doc2X翻译查看——
—— For more content, please visit Doc2X for translations ——