基于ECFP4分子指纹的大规模分子聚类(chemfp)

化合物大规模虚拟筛选后得到的小分子数据集,Rdkit难以完成聚类分析,而chemfp可胜任百万级分子聚类。实操中rdkit2fps生成指纹成功率低,作者基于ubuntu 18.04 WSL系统,通过conda创建两个环境,分别运行新版本和老版本chemfp,完成指纹生成与聚类算法。
部署运行你感兴趣的模型镜像

化合物在大规模虚拟筛选完成后,往往能得到一个数十万到上百万的小分子数据集结果,如果要对这些结果进行聚类分析,Rdkit是很难胜任的。通过网上检索,发现了chemfp可以完成百万级的分子聚类工作,原文链接基于分子指纹的大规模分子聚类 - 知乎 (zhihu.com)。但是在实操过程中也遇到了一些问题,rdkit2fps命令生成分子指纹的成功率不高,根据对chemfp的学习,尝试出以下方法。

使用的系统为ubuntu 18.04 WSL,流程如下。

1.安装Ananconda

网上教程很多,就不班门弄斧了。

因为chemfp的rdkit2fps命令(需要rdkit的支持)在生成ECFP4指纹时有部分分子会失败,我转换48万+的分子集,只成功了18万+,而ob2fps命令(需要openbabel的支持)可以全部成功。新版本chemfp的ob2fps才能生成ECFP4指纹,而老版本的chemfp为免费版本,可以使用聚类功能。故而在conda创建两个环境,一个用于运行新版本chemfp,生成ECFP4指纹,一个用于运行老版本chemfp,用于进行聚类算法。


2.创建openbabel环境,并生成fps指纹文件

conda create -n openbabel python=3.6   #创建python 3.6环境下名为openbabel的环境
conda activate openbabel
conda install -c conda-forge openbabel   #安装3.1.1版本openbabel
python -m pip install chemfp -i https://chemfp.com/packages/    #安装chemfp
ob2fps --ECFP4 sdf_scoure78.sdf -o sdf_scoure78.fps   #使用ob2fps命令转换sdf文件为ECFP4指纹文件

3.创建chemfp环境

conda create -n chemfp python=2.7  #创建python 2.7环境下名为chemfp的环境
conda activate chemfp
python2.7 -m pip install chemfp #安装1.6版本免费的chemfp
在运行目录下创建taylor_butina.py脚本文件,脚本文件内容见文末
python taylor_butina.py --profile --threshold 0.6 test.fps -o test.clusters  #运行taylor_butina.py脚本得到化合物聚类文件

taylor_butina.py脚本

#!/usr/bin/env python

from __future__ import print_function

# An implementation of the Taylor-Butina clustering algorithm for
# chemical fingerprints.

# Robin Taylor, "Simulation Analysis of Experimental Design Strategies
# for Screening Random Compounds as Potential New Drugs and
# Agrochemicals", J. Chem. Inf. Comput. Sci., 1995, 35 (1), pp 59-67.
# DOI: 10.1021/ci00023a009
#
# Darko Butina, "Unsupervised Data Base Clustering Based on Daylight's
# Fingerprint and Tanimoto Similarity: A Fast and Automated Way To
# Cluster Small and Large Data Sets", J. Chem. Inf. Comput. Sci.,
# 1999, 39 (4), pp 747-750
# DOI: 10.1021/ci9803381

# Copyright 2017 Andrew Dalke <dalke@dalkescientific.com>
# Distributed under the "MIT license"
#
# Permission is hereby granted, free of charge, to any person
# obtaining a copy of this software and associated documentation files
# (the "Software"), to deal in the Software without restriction,
# including without limitation the rights to use, copy, modify, merge,
# publish, distribute, sublicense, and/or sell copies of the Software,
# and to permit persons to whom the Software is furnished to do so,
# subject to the following conditions:
# 
# The above copyright notice and this permission notice shall be
# included in all copies or substantial portions of the Software.
#
# THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
# EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
# MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
# NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
# BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
# ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
# CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
# SOFTWARE.

__version__ = "1.0"

import sys
import argparse
import time

# I use the third-party "psutil" package to get memory use information.
try:
    import psutil
except ImportError:
    # If it isn't available, then don't report memory use.
    psutil = None
    def get_memory_use():
        return None
else:
    import os
    _process = psutil.Process(os.getpid())
    def get_memory_use():
        info = _process.memory_info()
        return info.rss # or info.vms?

# This requires chemfp. See http://chemfp.com/
import chemfp
from chemfp import search
        
# Convert the number of bytes into a more human-readable form.
def human_memory(n):
    if n < 1024:
        return "%d B" % (n,)
    for unit, denom in (("KB", 1024), ("MB", 1024**2),
                         ("GB", 1024**3), ("PB", 1024**4)):
        f = n/denom
        if f < 10.0:
            return "%.2f %s" % (f, unit)
        if f < 100.0:
            return "%d %s" % (round(f, 1), unit)
        if f < 1000.0:
            return "%d %s" % (round(f, 0), unit)
    return ">1TB ?!?"

##### Measure the current time and memory use, so I can generate a delta report.
class ProfileStats(object):
    def __init__(self, timestamp, memory_use):
        self.timestamp = timestamp
        self.memory_use = memory_use

def get_profile_stats():
    return ProfileStats(time.time(), get_memory_use())

def get_profile_time():
    return ProfileStats(time.time(), None)

def profile_report(title, start, end):
    dt = end.timestamp - start.timestamp
    if start.memory_use is None or end.memory_use is None:
        # Memory use not available.
        sys.stderr.write("%s time: %.1f sec\n" % (title, dt))
    else:
        delta_memory = end.memory_use - start.memory_use
        memory_str = human_memory(delta_memory)
        sys.stderr.write("%s time: %.1f sec memory: %s\n" % (title, dt, memory_str))

######

# The results of the Taylor-Butina clustering
class ClusterResults(object):
    def __init__(self, true_singletons, false_singletons, clusters):
        self.true_singletons = true_singletons
        self.false_singletons = false_singletons
        self.clusters = clusters

# The clustering implementation
def taylor_butina_cluster(similarity_table):
    # Sort the results so that fingerprints with more hits come
    # first. This is more likely to be a cluster centroid. Break ties
    # arbitrarily by the fingerprint id; since fingerprints are
    # ordered by the number of bits this likely makes larger
    # structures appear first.:

    # Reorder so the centroid with the most hits comes first.  (That's why I do
    # a reverse search.)  Ignore the arbitrariness of breaking ties by
    # fingerprint index

    centroid_table = sorted(((len(indices), i, indices)
                                 for (i,indices) in enumerate(similarity_table.iter_indices())),
                            reverse=True)

    # Apply the leader algorithm to determine the cluster centroids
    # and the singletons:

    # Determine the true/false singletons and the clusters
    true_singletons = []
    false_singletons = []
    clusters = []

    seen = set()
    for (size, fp_idx, members) in centroid_table:
        if fp_idx in seen:
            # Can't use a centroid which is already assigned
            continue
        seen.add(fp_idx)

        # Figure out which ones haven't yet been assigned
        unassigned = set(members) - seen

        if not unassigned:
            false_singletons.append(fp_idx)
            continue

        # this is a new cluster
        clusters.append( (fp_idx, unassigned) )
        seen.update(unassigned)

    # Return the results:
    return ClusterResults(true_singletons, false_singletons, clusters)

def report_cluster_results(cluster_results, arena, outfile):
    true_singletons = cluster_results.true_singletons
    false_singletons = cluster_results.false_singletons
    clusters = cluster_results.clusters

    # Sort the singletons by id.
    print(len(true_singletons), "true singletons", file=outfile)
    print("=>", " ".join(sorted(arena.ids[idx] for idx in true_singletons)),
          file=outfile)
    print(file=outfile)

    print(len(false_singletons), "false singletons", file=outfile)
    print("=>", " ".join(sorted(arena.ids[idx] for idx in false_singletons)),
          file=outfile)
    print(file=outfile)

    # Sort so the cluster with the most compounds comes first,
    # then by alphabetically smallest id
    def cluster_sort_key(cluster):
        centroid_idx, members = cluster
        return -len(members), arena.ids[centroid_idx]

    clusters.sort(key=cluster_sort_key)

    print(len(clusters), "clusters", file=outfile)
    for centroid_idx, members in clusters:
        print(arena.ids[centroid_idx], "has", len(members), "other members", file=outfile)
        print("=>", " ".join(arena.ids[idx] for idx in members), file=outfile)


#### Command-line driver

p = parser = argparse.ArgumentParser(
    description="An implementation of the Taylor-Butina clustering algorithm using chemfp")
p.add_argument("--threshold", "-t", type=float, default=0.8,
               help="threshold similarity (default: 0.8)")
p.add_argument("--output", "-o", metavar="FILENAME",
               help="output filename (default: stdout)")
p.add_argument("--profile", action="store_true",
               help="report time and memory use")
p.add_argument("--version", action="version",
               version="spam %(prog)s " + __version__ + " (using chemfp " + chemfp.__version__ + ")")
p.add_argument("fingerprint_filename", metavar="FILENAME")

# Turn the --output option into a file object and close function.
def _close_nothing():
    pass

def open_output(parser, filename):
    ## open a file, or use None to use stdout
    if filename is None:
        return sys.stdout, _close_nothing
    try:
        outfile = open(filename, "w")
    except IOError as err:
        parser.error("Cannot open --output file: %s" % (err,))
    return outfile, outfile.close

## 

def main(args=None):
    args = parser.parse_args(args)

    if args.profile and psutil is None:
        sys.stderr.write("WARNING: Must install the 'psutil' module to see memory statistics.\n")

    # Load the fingerprints
    start_stats = get_profile_stats()
    try:
        arena = chemfp.load_fingerprints(args.fingerprint_filename)
    except IOError as err:
        sys.stderr.write("Cannot open fingerprint file: %s" % (err,))
        raise SystemExit(2)

    # Make sure I can generate output before doing the heavy calculations
    outfile, outfile_close = open_output(parser, args.output)

    try:
        load_stats = get_profile_stats()

        # Generate the NxN similarity matrix for the given threshold
        similarity_table = search.threshold_tanimoto_search_symmetric(
            arena, threshold = args.threshold)
        similarity_stats = get_profile_stats()

        # Do the clustering
        cluster_results = taylor_butina_cluster(similarity_table)
        cluster_stats = get_profile_stats()

        # Report the results
        report_cluster_results(cluster_results, arena, outfile)

        # Report the time and memory use.
        if args.profile:
            print("#fingerprints:", len(arena), "#bits/fp:", arena.num_bits, "threshold:", args.threshold,
                  "#matches:", similarity_table.count_all(), file=sys.stderr)
            profile_report("Load", start_stats, load_stats)
            profile_report("Similarity", load_stats, similarity_stats)
            profile_report("Clustering", similarity_stats, cluster_stats)
            profile_report("Total", start_stats, get_profile_time())
    finally:
        outfile_close()

if __name__ == "__main__":
    main()

您可能感兴趣的与本文相关的镜像

Python3.8

Python3.8

Conda
Python

Python 是一种高级、解释型、通用的编程语言,以其简洁易读的语法而闻名,适用于广泛的应用,包括Web开发、数据分析、人工智能和自动化脚本

<think>我们正在讨论分子指纹技术,特别是用于分析大分子结构相似性的方法。根据引用内容,分子指纹是一种将分子结构表示为固定长度二进制位串(或整数向量)的方法,用于快速计算分子相似性。以下是几种常见的分子指纹技术及其特点:1.**MACCS指纹**(引用[1]):这是一种基于预定义子结构的指纹,包含166个关键结构(或128位简化版)。每个位表示特定子结构是否存在。它适用于快速相似性搜索和分类。2.**拓扑指纹(RDKit指纹)**(引用[2]):通过哈希分子中的路径生成,默认路径长度1-7个键。它不依赖预定义子结构,因此更灵活。RDKit的`FingerprintMol`函数可生成此类指纹。3.**摩根指纹(圆形指纹)**(引用[2][3]):也称为ECFP(扩展连通性指纹),通过迭代扩展原子环境生成。每个位代表一个圆形子结构(半径由用户指定,如半径2)。ECFP4是半径2的常用版本,FCFP4是其药效团版本(忽略原子类型,关注药效团特征)。4.**原子对(AtomPairs)和拓扑扭曲(TopologicalTorsions)**(引用[2]):原子对编码原子类型及其间距离;拓扑扭曲编码四个连续原子的路径。两者均捕获局部结构信息。**工具实现**:-**RDKit**(引用[2][4]):开源化学信息学工具包,支持上述所有指纹(MACCS、拓扑、摩根、原子对等)。示例代码:```pythonfromrdkitimportChemfromrdkit.ChemimportAllChem,MACCSkeys#生成分子mol=Chem.MolFromSmiles('CCCO')#MACCS指纹maccs_fp=MACCSkeys.GenMACCSKeys(mol)#拓扑指纹(RDKit默认)rdkit_fp=Chem.RDKFingerprint(mol)#摩根指纹ECFP4,半径=2)ecfp4_fp=AllChem.GetMorganFingerprintAsBitVect(mol,radius=2,nBits=2048)```**相似性计算**:-使用Tanimoto系数(Jaccard相似性)或Dice系数比较指纹(引用[2][4]):```pythonfromrdkitimportDataStructs#计算Tanimoto相似性tanimoto_sim=DataStructs.TanimotoSimilarity(fp1,fp2)#批量计算(引用[4])similarities=DataStructs.BulkTanimotoSimilarity(query_fp,fps_list)```**大分子适用性**:-**ECFP/FCFP系列**:因捕获局部环境,对蛋白质或核酸等大分子也有效,但需调整半径(例如半径≥3)以覆盖更大范围的结构特征。-**注意事项**:大分子指纹可能需更长的位向量(如4096位)以减少碰撞,并考虑使用折叠(folding)技术压缩维度。**高级应用**:-**相似性地图**(引用[2]):可视化贡献相似性的原子区域。-**指纹解释**:RDKit可解码指纹位对应的子结构(如`AllChem.GetMorganAtomEnv(mol,atom_id,radius)`)。§§相关问题§§1.如何选择适合大分子相似性分析的指纹类型(如ECFPvsMACCS)?2.RDKit中如何调整摩根指纹的参数(半径、位长)以优化大分子分析?3.分子指纹相似性计算有哪些常见指标(如Tanimoto)?各自适用场景是什么?4.如何解释分子指纹中特定比特位对应的化学结构?</think>### 用于分析大分子结构相似性的分子指纹技术及工具 分子指纹是将分子结构转化为固定长度二进制位串(或整数向量)的数学表示方法,用于量化分子相似性。以下是针对**大分子结构相似性分析**的核心技术及工具: --- #### **一、常用分子指纹技术** 1. **摩根指纹(圆形指纹)** - **原理**:通过原子环境迭代扩展生成,捕获局部结构特征。半径参数(如半径=2)控制搜索范围[^3]。 - **变体**: - **ECFP**:基于原子类型和键性质(如ECFP4,半径=2)[^2]。 - **FCFP**:忽略原子类型,聚焦药效团特征(如疏水基、氢键受体)[^2]。 - **大分子适用性**:通过增大半径(如半径≥3)可捕获蛋白质/核酸的局部折叠结构。 2. **拓扑指纹(RDKit指纹)** - **原理**:哈希分子中的路径(键序列),默认路径长度1-7键[^4]。 - **优势**:无需预定义子结构,灵活表达复杂大分子[^4]。 - **公式**:指纹位$b_i$由路径哈希生成:$b_i = \text{hash}( \text{path}_k ) \mod N$ 3. **MACCS指纹** - **原理**:基于166个预定义化学子结构(如官能团、环系统)的布尔向量[^1]。 - **局限**:对小分子更高效,大分子需结合其他方法。 4. **原子对(Atom Pairs)与拓扑扭曲(Topological Torsions)** - **原理**: - 原子对:编码原子类型及其间距离 - 拓扑扭曲:捕获4个连续原子的几何构型[^2] - **大分子应用**:描述蛋白质二级结构(如α螺旋、β折叠)。 --- #### **二、关键工具与实现** 1. **RDKit(主流开源库)** ```python from rdkit import Chem from rdkit.Chem import AllChem # 生成大分子(示例:蛋白质) mol = Chem.MolFromPDBFile("protein.pdb") # 摩根指纹ECFP4,半径=3) fp_ecfp = AllChem.GetMorganFingerprintAsBitVect(mol, radius=3, nBits=2048) # 拓扑指纹 fp_rdkit = Chem.RDKFingerprint(mol, maxPath=10) # 增大路径长度适应大分子 ``` 2. **相似性计算** - **Tanimoto系数**(常用): $$T(A,B) = \frac{|A \cap B|}{|A \cup B|}$$ - **Dice系数**(对片段缺失更敏感): $$D(A,B) = \frac{2|A \cap B|}{|A| + |B|}$$ ```python from rdkit import DataStructs # 计算两个大分子的相似性 similarity = DataStructs.TanimotoSimilarity(fp1, fp2) ``` 3. **可视化与解释** - RDKit支持解码指纹位对应的子结构(如`AllChem.DrawMorganBit()`)[^2]。 - 相似性地图(Similarity Maps)直观显示贡献相似性的原子区域[^2]。 --- #### **三、大分子分析优化策略** 1. **参数调整**: - 摩根指纹:增大`radius`(3-5)和`nBits`(≥2048)以覆盖更大结构域。 - 拓扑指纹:延长`maxPath`(至10-15键)捕获长程相互作用。 2. **混合指纹**: 结合ECFP(局部特征)与原子对(全局距离)提升准确性。 3. **降维技术**: 对高维指纹使用PCA或t-SNE压缩,便于聚类分析。 --- #### **四、应用场景** - **药物设计**:比对蛋白质结合口袋相似性[^3]。 - **进化分析**:量化同源蛋白结构差异。 - **材料科学**:评估高分子聚合物重复单元相似性。
评论 4
成就一亿技术人!
拼手气红包6.0元
还能输入1000个字符
 
红包 添加红包
表情包 插入表情
 条评论被折叠 查看
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值