Python 3
采用词袋模型(BoW)和Word2Vec提取文本特征
实验目标
理解词袋模型(BoW)和 Word2Vec 的概念及其在自然语言处理中的应用。
掌握使用 Python 实现词袋模型和 Word2Vec 的方法。
通过实验加深对文本特征提取和相似度计算的理解。
实验内容
数据准备:
提供一个简单的文本数据集,例如几段英文文本。
任务分解:
任务 1: 预处理文本数据(分词、去除标点符号)。
任务 2: 实现词袋模型(BoW)提取文本特征,并计算文本相似度。
任务 3: 实现 Word2Vec 提取文本特征,并计算文本相似度。
任务 4: 比较两种方法的相似度计算结果。
实验报告:
记录实验过程,包括代码实现、结果分析等。
分析不同方法对文本相似度计算的影响。
总结实验结果,并讨论词袋模型和 Word2Vec 在文本特征提取中的应用。
1
!pip install gensim -t /home/aistudio/external-libraries
ERROR: Can not combine '--user' and '--target'
运行时长:17.739秒结束时间:2025-11-20 22:41:34
[1]
12
pip install -i https://pypi.tuna.tsinghua.edu.cn/simple gensim --no-user -t /home/aistudio/external-libraries
Looking in indexes: https://pypi.tuna.tsinghua.edu.cn/simple, https://pypi.tuna.tsinghua.edu.cn/simple/, https://pypi.mirrors.ustc.edu.cn/simple/, https://mirrors.aliyun.com/pypi/simple/, https://pypi.org/simple/
Collecting gensim
Downloading gensim-4.4.0-cp310-cp310-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl.metadata (8.5 kB)
Collecting numpy>=1.18.5 (from gensim)
Downloading numpy-2.2.6-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (62 kB)
Collecting scipy>=1.7.0 (from gensim)
Downloading scipy-1.15.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (61 kB)
Collecting smart_open>=1.8.1 (from gensim)
Downloading smart_open-7.5.0-py3-none-any.whl.metadata (24 kB)
Collecting wrapt (from smart_open>=1.8.1->gensim)
Downloading wrapt-2.0.1-cp310-cp310-manylinux1_x86_64.manylinux_2_28_x86_64.manylinux_2_5_x86_64.whl.metadata (9.0 kB)
Downloading gensim-4.4.0-cp310-cp310-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl (27.6 MB)
━━━━━━━━━━━━━━━━━━╸━━━━━━━━━━━━━━━━━━━━━ 12.8/27.6 MB 14.9 kB/s eta 0:16:31
ERROR: Exception:
Traceback (most recent call last):
File "/opt/conda/envs/python35-paddle120-env/lib/python3.10/site-packages/pip/_vendor/urllib3/response.py", line 438, in _error_catcher
yield
File "/opt/conda/envs/python35-paddle120-env/lib/python3.10/site-packages/pip/_vendor/urllib3/response.py", line 561, in read
data = self._fp_read(amt) if not fp_closed else b""
File "/opt/conda/envs/python35-paddle120-env/lib/python3.10/site-packages/pip/_vendor/urllib3/response.py", line 527, in _fp_read
return self._fp.read(amt) if amt is not None else self._fp.read()
File "/opt/conda/envs/python35-paddle120-env/lib/python3.10/site-packages/pip/_vendor/cachecontrol/filewrapper.py", line 98, in read
data: bytes = self.__fp.read(amt)
File "/opt/conda/envs/python35-paddle120-env/lib/python3.10/http/client.py", line 465, in read
s = self.fp.read(amt)
File "/opt/conda/envs/python35-paddle120-env/lib/python3.10/socket.py", line 705, in readinto
return self._sock.recv_into(b)
File "/opt/conda/envs/python35-paddle120-env/lib/python3.10/ssl.py", line 1274, in recv_into
return self.read(nbytes, buffer)
File "/opt/conda/envs/python35-paddle120-env/lib/python3.10/ssl.py", line 1130, in read
return self._sslobj.read(len, buffer)
TimeoutError: The read operation timed out
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/opt/conda/envs/python35-paddle120-env/lib/python3.10/site-packages/pip/_internal/cli/base_command.py", line 105, in _run_wrapper
status = _inner_run()
File "/opt/conda/envs/python35-paddle120-env/lib/python3.10/site-packages/pip/_internal/cli/base_command.py", line 96, in _inner_run
return self.run(options, args)
File "/opt/conda/envs/python35-paddle120-env/lib/python3.10/site-packages/pip/_internal/cli/req_command.py", line 67, in wrapper
return func(self, options, args)
File "/opt/conda/envs/python35-paddle120-env/lib/python3.10/site-packages/pip/_internal/commands/install.py", line 379, in run
requirement_set = resolver.resolve(
File "/opt/conda/envs/python35-paddle120-env/lib/python3.10/site-packages/pip/_internal/resolution/resolvelib/resolver.py", line 179, in resolve
self.factory.preparer.prepare_linked_requirements_more(reqs)
File "/opt/conda/envs/python35-paddle120-env/lib/python3.10/site-packages/pip/_internal/operations/prepare.py", line 554, in prepare_linked_requirements_more
self._complete_partial_requirements(
File "/opt/conda/envs/python35-paddle120-env/lib/python3.10/site-packages/pip/_internal/operations/prepare.py", line 469, in _complete_partial_requirements
for link, (filepath, _) in batch_download:
File "/opt/conda/envs/python35-paddle120-env/lib/python3.10/site-packages/pip/_internal/network/download.py", line 184, in __call__
for chunk in chunks:
File "/opt/conda/envs/python35-paddle120-env/lib/python3.10/site-packages/pip/_internal/cli/progress_bars.py", line 55, in _rich_progress_bar
for chunk in iterable:
File "/opt/conda/envs/python35-paddle120-env/lib/python3.10/site-packages/pip/_internal/network/utils.py", line 65, in response_chunks
for chunk in response.raw.stream(
File "/opt/conda/envs/python35-paddle120-env/lib/python3.10/site-packages/pip/_vendor/urllib3/response.py", line 622, in stream
data = self.read(amt=amt, decode_content=decode_content)
File "/opt/conda/envs/python35-paddle120-env/lib/python3.10/site-packages/pip/_vendor/urllib3/response.py", line 560, in read
with self._error_catcher():
File "/opt/conda/envs/python35-paddle120-env/lib/python3.10/contextlib.py", line 153, in __exit__
self.gen.throw(typ, value, traceback)
File "/opt/conda/envs/python35-paddle120-env/lib/python3.10/site-packages/pip/_vendor/urllib3/response.py", line 443, in _error_catcher
raise ReadTimeoutError(self._pool, None, "Read timed out.")
pip._vendor.urllib3.exceptions.ReadTimeoutError: HTTPSConnectionPool(host='files.pythonhosted.org', port=443): Read timed out.
Note: you may need to restart the kernel to use updated packages.
最新发布