点击名片关注 阿尘blog,一起学习,一起成长
本文主要介绍了BeautifulSoup4的使用和实践:PubMed医学文献标题、摘要、pmid的爬取
01
—
BeautifulSoup4
安装及初步使用
安装
pip install beautifulsoup4 -i http://pypi.tuna.tsinghua.edu.cn/simple/ --trusted-host pypi.tuna.tsinghua.edu.cn
使用
可以将一个文档传入BeautifulSoup的构造方法,也可以传入一段字符串或句柄
#导入方法
from bs4 import BeautifulSoup
#实例化
soup = BeautifulSoup(open("index.html"))
soup = BeautifulSoup("<html>data</html>")
BeautifulSoup会根据传入的文档来选择最合适的解析器,当然这也是可以自定义的
源码
def __init__(self, markup="", features=None, builder=None,
parse_only=None, from_encoding=None, exclude_encodings=None,
element_classes=None, **kwargs):
"""Constructor.
:param markup: A string or a file-like object representing
markup to be parsed.
:param features: Desirable features of the parser to be
used. This may be the name of a specific parser ("lxml",
"lxml-xml", "html.parser", or "html5lib") or it may be the
type of markup to be used ("html", "html5", "xml"). It's
recommended that you name a specific parser, so that
Beautiful Soup gives you the same results across platforms
and virtual environments.
:param builder: A TreeBuilder subclass to instantiate (or
instance to use) instead of looking one up based on
`features`. You only need to use this if you've implemented a
custom TreeBuilder.
:param parse_only: A SoupStrainer. Only parts of the document
matching the SoupStrainer will be considered. This is useful
when parsing part of a document that would o