NLP基础(1):文本数据的获取与处理_文本数据获取-优快云博客

本文链接：https://blog.youkuaiyun.com/tilblackout/article/details/147259690

在人工智能的众多分支中，自然语言处理(Natural Language Processing，NLP)是一项非常贴近我们日常生活的技术。无论是智能客服、语音助手，还是各类推荐系统，它们背后都少不了 NLP 的参与。NLP 的目标，是让计算机能够理解、分析甚至生成类似人类的语言。

然而，在真正开始做 NLP 项目之前，我们往往会面临一个非常现实的问题：如何获取干净、结构化的文本数据？这一步虽不起眼，却是整个自然语言处理流程中极其重要的基础。

所以这一篇文章，我们先来看一看现实中可以利用的各种数据来源，然后提取它们。

1 数据收集

现实中常用的数据来源包括企业内部数据(如数据库、云存储)、公开数据(政府网站、百科)，还可以通过爬虫提取网页中的数据。

1.1 从PDF中收集数据

我们可以使用 PyPDF2 库可以很方便地提取 PDF 文本信息。

该方法不支持扫描类 PDF(即图像格式)

安装并导入库：

!pip install PyPDF2
import PyPDF2
from PyPDF2 import PdfFileReader

提取文本内容：

pdf = open("file.pdf", "rb")
pdf_reader = PyPDF2.PdfFileReader(pdf)
print(pdf_reader.numPages)
page = pdf_reader.getPage(0)
print(page.extractText())
pdf.close()

1.2 从Word文档中收集数据

使用 docx 库可以读取 .docx 文件。首先安装并导入库：

!pip install docx
from docx import Document

提取文本内容：

doc = open("file.docx", "rb")
document = docx.Document(doc)
docu = ""
for para in document.paragraphs:
    docu += para.text
print(docu)

1.3 从JSON中收集数据

使用 json 库可以解析 JSON 内容：

# import requests  # 如果想访问网站中的json
import json

假设本地有一个quote.json文件：

{
  "contents": {
    "quotes": [
      {
        "quote": "Where there is ruin, there is hope for a treasure.",
        "length": "50",
        "author": "Rumi",
        "tags": [
          "failure",
          "inspire",
          "learning-from-failure"
        ],
        "category": "inspire",
        "date": "2018-09-29",
        "title": "Inspiring Quote of the day",
        "id": "dPKsui4sQnQqgMnXHLKtfweF"
      }
    ]
  }
}

读取 JSON 并提取内容：

import json

# 读取本地 JSON 文件
with open('quote.json', 'r', encoding='utf-8') as f:
    res = json.load(f)

# 打印格式化后的内容
print(json.dumps(res, indent=4))

# 提取名言部分
q = res['contents']['quotes'][0]
print(q['quote'], '\n--', q['author'])

1.4 从 HTML 页面中收集数据

如果想要解析网页(HTML 页面)内容，可以使用 bs4(BeautifulSoup)库来解析与提取。

安装并导入库：

!pip install bs4
import urllib.request as urllib2
from bs4 import BeautifulSoup

接着我们获取网页 HTML 内容：

response = urllib2.urlopen('https://en.wikipedia.org/wiki/Natural_language_processing')
html_doc = response.read()

解析 HTML 页面：

soup = BeautifulSoup(html_doc, 'html.parser')
strhtm = soup.prettify()
print(strhtm[:1000])

输出：

在这里插入图片描述

提取指定标签内容：

print(soup.title)
print(soup.title.string)
# 提取 HTML 页面中第一个 <a> 标签和 <b> 标签中纯文本内容的语句
print(soup.a.string)
print(soup.b.string)

输出：

在这里插入图片描述

提取所有指定标签实例：

for x in soup.find_all('a'):
    print(x.string)

输出：

在这里插入图片描述

提取页面中所有段落(p标签)文本：

for x in soup.find_all('p'):
    print(x.text)

输出：

在这里插入图片描述

1.5 使用正则表达式解析文本

我们可以使用正则表达式解析文本数据，常用的是Python的re库。

常用标志位（flags）如下：

re.I：忽略大小写匹配
re.L：本地化识别
re.M：多行匹配
re.S：匹配包括换行符的任意字符
re.U：支持 Unicode
re.X：使正则更具可读性

常用正则语法举例：

[ab]：匹配单个字符 a 或 b
[^ab]：匹配除了 a 和 b 的字符
[a-z]：匹配 a 到 z
[^a-z]：匹配除 a 到 z 之外的字符
[a-zA-Z]：匹配大小写字母
.：匹配任意单个字符
\s：匹配空白字符
\S：匹配非空白字符
\d：匹配数字
\D：匹配非数字
\w：匹配单词字符
\W：匹配非单词字符
(a|b)：匹配 a 或 b
a?：匹配 0 或 1 次
a*：匹配 0 次或多次
a+：匹配 1 次或多次
a{3}：匹配恰好 3 次
a{3,}：匹配 3 次及以上
a{3,6}：匹配 3 到 6 次
^：匹配字符串开头
$：匹配字符串结尾
\b：匹配单词边界
\B：非单词边界

re.match() vs re.search()

re.match(): 从字符串的头开始匹配  
re.search(): 查找整个字符串中的匹配

下面来看几个正则表达式的例子：

(1)分词(Tokenizing)

import re
re.split('\s+', 'I like this book.')
# 输出：['I', 'like', 'this', 'book.']

(2)提取 Email

doc = "For more details please mail us at: xyz@abc.com, pqr@mno.com"
addresses = re.findall(r'[\w\.-]+@[\w\.-]+', doc)
for address in addresses:
    print(address)
# 输出：
# xyz@abc.com
# pqr@mno.com

(3)替换 Email

doc = "For more details please mail us at xyz@abc.com"
new_email_address = re.sub(r'([\w\.-]+)@([\w\.-]+)', r'pqr@mno.com', doc)
print(new_email_address)
# 输出：
# For more details please mail us at pqr@mno.com

(4)文本查找和替换

import re

text = """
*** START OF THIS PROJECT ***
This is a sample text document created for testing regex operations.
It includes numbers like 123, special symbols like % and $, and some repeated patterns.
We will do a search and replace operation using regular expressions.
End of the sample.
"""

# 使用 re.search 查找文本中特定标记的起始位置
start = re.search(r"\*\*\* START OF THIS PROJECT \*\*\*", text).end()
# 从 start 位置之后截取文本
content = text[start:]

# 使用 re.sub 进行清洗，只保留字母、数字和句号
def preprocess(sentence):
    return re.sub('[^A-Za-z0-9.]+', ' ', sentence).lower()

# 处理文本
processed_content = preprocess(content)

# 输出结果
print("原始内容：\n", content)
print("\n处理后内容：\n", processed_content)

输出如下：

在这里插入图片描述

1.6 分析文本

# “the” 出现次数
len(re.findall(r'the', processed_book))  # 输出：302

# 替换 ' i ' 为 ' I '
processed_book = re.sub(r'\si\s', " I ", processed_book)

# 查找形如 abc--xyz 的字符串
re.findall(r'[a-zA-Z0-9]*--[a-zA-Z0-9]*', book)

1.7 字符串处理

常用字符串操作

s.find(t)
s.rfind(t)
s.index(t)
s.rindex(t)
s.join(text)
s.split(t)
s.splitlines()
s.lower()
s.upper()
s.title()
s.strip()
s.replace(t, u)

替换字符串：

String_v1 = "I am exploring NLP"
print(String_v1[0])           # 输出：I
print(String_v1[5:14])        # 输出：exploring

String_v2 = String_v1.replace("exploring", "learning")
print(String_v2)              # 输出：I am learning NLP

拼接两个字符串：

s1 = "nlp"
s2 = "machine learning"
s3 = s1 + s2
print(s3)  # 输出：nlpmachine learning

查找子串索引：

var = "I am learning NLP"
f = "learn"
print(var.find(f))  # 输出：5

1.8 网页抓取(Web Scraping)

抓取网页前请阅读网站的使用条款，确认是否允许抓取。从网页上大规模提取数据，并保存在本地或数据库，用于用户、产品等信息分析。

首先安装依赖：

!pip install bs4
!pip install requests

导入库：

import requests
from bs4 import BeautifulSoup

发起请求并解析网页内容：

url = 'https://example.com'
response = requests.get(url)
soup = BeautifulSoup(response.content, 'html.parser')

提取需要的数据：

for p in soup.find_all('p'):
    print(p.text)

我们可以根据网页结构，通过标签名(如 div、span)、类名(class_="xxx")、ID(id="xxx")等定位内容进行提取。

2 总结

文本数据的提取是自然语言处理的第一步，也是非常关键的一步。只有掌握了从各种来源(如本地文件、网页等)高效获取数据的能力，后续的文本清洗、建模、分析工作才能顺利进行。