下载Boost库
选择右边的Documentation
![![[Pasted image 20250210000332.png]]](https://i-blog.csdnimg.cn/direct/873eedc23e9f4492ad4c0e4b162d56c1.png)
选择最新的1.87.0版本
![![[Pasted image 20250210000517.png]]](https://i-blog.csdnimg.cn/direct/9702d199dd9644fe9480536161396f02.png)
可以在首页的这里下载最新版本
建立项目结构
- 新建目录boost_searcher
mkdir boost_searcher
![![[Pasted image 20250215082843.png]]](https://i-blog.csdnimg.cn/direct/c48adb06b91e460884f9faf51a123a2a.png)
- 移动到boost_searcher目录
cd boost_searcher
![![[Pasted image 20250215083008.png]]](https://i-blog.csdnimg.cn/direct/6c1d2a81f91143419cb37500c224bbb3.png)
- 下载rz命令
yum install lrzsz
![![[Pasted image 20250215083557.png]]](https://i-blog.csdnimg.cn/direct/d586f7ce50af44c9819735d8ced88b3b.png)
- 导入boost文件,获得对应的网页信息
rz
![![[Pasted image 20250215084553.png]]](https://i-blog.csdnimg.cn/direct/f27f2827f5494d008bdad81b2cf4d48a.png)
![![[Pasted image 20250215084730.png]]](https://i-blog.csdnimg.cn/direct/24c8daa617b24caaba1897208435efab.png)
上传完成
5. 解包文件
tar xzf boost_1_87_0.tar.gz
![![[Pasted image 20250215084939.png]]](https://i-blog.csdnimg.cn/direct/673a3b91946d4429b6f10d6a8d5c5577.png)
解压完成
6. 可以删除掉压缩包
rm boost_1_87_0.tar.gz
![![[Pasted image 20250215085251.png]]](https://i-blog.csdnimg.cn/direct/fd8de35df125421da5b9c6a628ae6c66.png)
- 创建data目录和底下的input目录
mkdir -p data/input
![![[Pasted image 20250215085444.png]]](https://i-blog.csdnimg.cn/direct/083a5a80161e4e0db4c53d02d0b800e8.png)
data目录底下的input放的就是数据源,也就是要进行搜索的8000多个html文档
8. 拷贝boost库当中的doc目录下的html的所有内容到data下的input目录下
cp -rf boost_1_87_0/doc/html/* data/input/
![![[Pasted image 20250215090315.png]]](https://i-blog.csdnimg.cn/direct/1db6fe533f9941009244a4657544d976.png)
⽬前只需要boost_1_87_0/doc/html⽬录下的html⽂件,⽤它来进⾏建⽴索引
编写数据去标签与数据清洗的模块 Parser
- 新建一个parser文件,对网页信息进行去标签动作
touch parser.cc
![![[Pasted image 20250215090908.png]]](https://i-blog.csdnimg.cn/direct/ab471d4c86a44b669acc18b33afef6ba.png)
要把原始数据变为去标签之后的数据
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"
"http://www.w3.org/TR/html4/loose.dtd">
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
<title>Chapter 30. Boost.Process</title>
<link rel="stylesheet" href="../../doc/src/boostbook.css" type="text/css">
<meta name="generator" content="DocBook XSL Stylesheets V1.79.1">
<link rel="home" href="index.html" title="The Boost C++ Libraries BoostBook
Documentation Subset">
<link rel="up" href="libraries.html" title="Part I. The Boost C++ Libraries
(BoostBook Subset)">
<link rel="prev" href="poly_collection/acknowledgments.html"
title="Acknowledgments">
<link rel="next" href="boost_process/concepts.html" title="Concepts">
</head>
<body bgcolor="white" text="black" link="#0000FF" vlink="#840084"
alink="#0000FF">
<table cellpadding="2" width="100%"><tr>
<td valign="top"><img alt="Boost C++ Libraries" width="277" height="86"
src="../../boost.png"></td>
<td align="center"><a href="../../index.html">Home</a></td>
<td align="center"><a href="../../libs/libraries.htm">Libraries</a></td>
<td align="center"><a href="http://www.boost.org/users/people.html">People</a>
</td>
<td align="center"><a href="http://www.boost.org/users/faq.html">FAQ</a></td>
<td align="center"><a href="../../more/index.htm">More</a></td>
</tr></table>
<>:html的标签,这个标签对我们进行搜索是没有价值的,需要去掉这些标签,一般标签都是成对出现的
2. 在data目录下创建raw_html目录,存放处理完之后的内容
mkdir raw_html
![![[Pasted image 20250215091707.png]]](https://i-blog.csdnimg.cn/direct/b5432cd6f16348d081098fd8f01ec68f.png)
把每个⽂档都去标签,然后写⼊到同⼀个⽂件中
每个⽂档内容不需要任何\n
⽂档和⽂档之间⽤ \3 区分
![![[Pasted image 20250209235520.png]]](https://i-blog.csdnimg.cn/direct/d6591fb22f0d4d7f983c693462b7203a.png)
544

被折叠的 条评论
为什么被折叠?



