Biopython中的PDB模块：解析与操作蛋白质三维结构数据-优快云博客

本文链接：https://blog.youkuaiyun.com/gitblog_00993/article/details/148524124

Biopython中的PDB模块：解析与操作蛋白质三维结构数据

biopython Official git repository for Biopython (originally converted from CVS) 项目地址: https://gitcode.com/gh_mirrors/bi/biopython

概述

Biopython的PDB模块是一个专门用于处理生物大分子晶体结构的强大工具。随着结构生物学的发展，PDB文件格式已不再是标准，自2014年起PDBx/mmCIF成为PDB存档的标准格式。本教程将详细介绍如何使用Biopython的PDB模块来读取、解析和操作各种结构文件格式。

文件格式支持

Biopython的PDB模块支持多种结构文件格式：

PDB - 传统格式，但已不再更新
mmCIF - 当前标准格式
BinaryCIF - mmCIF的二进制版本
MMTF - 高效二进制格式
PQR - 包含原子电荷和半径信息的PDB变体
PDBML - PDB的XML格式

读取结构文件

读取mmCIF文件

mmCIF是目前PDB数据库的标准格式，Biopython提供了两种解析方式：

from Bio.PDB.MMCIFParser import MMCIFParser
parser = MMCIFParser()
structure = parser.get_structure("1fat", "1fat.cif")

或者使用字典方式直接访问标签数据：

from Bio.PDB.MMCIF2Dict import MMCIF2Dict
mmcif_dict = MMCIF2Dict("1FAT.cif")
solvent_content = mmcif_dict["_exptl_crystal.density_percent_sol"]

读取PDB文件

虽然PDB格式已不再是标准，但Biopython仍支持其解析：

from Bio.PDB.PDBParser import PDBParser
parser = PDBParser(PERMISSIVE=1)  # PERMISSIVE模式会忽略常见问题
structure = parser.get_structure("1fat", "pdb1fat.ent")

读取PQR文件

PQR文件包含原子电荷和半径信息：

from Bio.PDB.PDBParser import PDBParser
pqr_parser = PDBParser(PERMISSIVE=1, is_pqr=True)
structure = pqr_parser.get_structure("1fat", "pdb1fat.ent", is_pqr=True)

结构表示：SMCRA架构

Biopython使用SMCRA(Structure/Model/Chain/Residue/Atom)架构来表示结构数据：

Structure：顶级对象，包含多个Model
Model：通常晶体结构只有一个Model，NMR结构有多个
Chain：代表分子链，如蛋白质的A链、B链等
Residue：氨基酸或核苷酸残基
Atom：原子级别的数据

这种层级结构可以通过简单的Python操作进行遍历：

# 获取第一个Model
first_model = structure[0]

# 获取链A
chain_A = first_model["A"]

# 获取第100号残基
residue_100 = chain_A[100]

# 获取CA原子
ca_atom = residue_100["CA"]

写入结构文件

Biopython支持将结构写入多种格式：

写入mmCIF文件

from Bio.PDB import MMCIFIO
io = MMCIFIO()
io.set_structure(structure)
io.save("out.cif")

写入PDB文件

from Bio.PDB import PDBIO
io = PDBIO()
io.set_structure(structure)
io.save("out.pdb")

可以使用Select类进行选择性输出：

class GlySelect(Select):
    def accept_residue(self, residue):
        return residue.get_name() == "GLY"
        
io.save("gly_only.pdb", GlySelect())

处理特殊结构问题

无序原子和残基

Biopython使用DisorderedAtom和DisorderedResidue类来处理无序结构，这些类隐藏了复杂性，使无序原子/残基可以像普通原子/残基一样操作。

缺失残基

可以通过header信息检查缺失残基：

if structure.header["has_missing_residues"]:
    missing = structure.header["missing_residues"]
    print(f"缺失残基数量: {len(missing)}")

实用技巧

获取完整ID：每个实体都可以获取其完整ID路径

residue.get_full_id()  # 返回如("1abc", 0, "A", ("", 10, "A"))

遍历结构：

# 遍历所有原子
for atom in structure.get_atoms():
    print(atom)

父子关系：

parent = atom.get_parent()  # 获取所属残基

总结

Biopython的PDB模块提供了强大而灵活的工具来处理蛋白质三维结构数据。通过本教程，您应该已经掌握了如何读取、解析和操作各种结构文件格式，以及如何利用SMCRA架构来访问和修改结构数据。无论是进行简单的结构分析还是复杂的结构操作，Biopython都能提供有效的支持。

对于更高级的用法，建议进一步探索Biopython文档中的结构比对、几何计算和结构验证等功能。

biopython Official git repository for Biopython (originally converted from CVS) 项目地址: https://gitcode.com/gh_mirrors/bi/biopython

创作声明：本文部分内容由AI辅助生成（AIGC），仅供参考