GooSeeker抓取网页和CMarkup处理xml的配合使用

本文介绍了如何使用GooSeeker抓取网页内容，并结合CMarkup库处理XML，以提取新浪和搜狐新闻的标题和内容。通过学习和实践，作者成功地完成了抓取任务，并分享了使用CMarkup的主要函数和技巧，包括Load、FindElem、FindChildElem、GetElemContent、IntoElem、OutOfElem等。文章强调了深入学习和解决问题的重要性，鼓励继续努力。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

这两天要抓取些新闻用，就选择了新浪新闻和搜狐新闻两个列表。抓下来是根据MetaStudio里定义的xml文件。再用C++的一个包CMarkup把xml里的标题和新闻内容处理成txt文件留作备用。

之前的一次抓取任务就200多个页面，自己ctrl+c和ctrl+c屌丝般混迹过去了。这次实在太多，顶不住于是耐下心学了这两个工具。从开始学到完成任务大约用了2天半，基本1天学一个，边学边用，这会终于是完成了任务，趁着热乎记录下来以后参考用。

前篇文章大致讲了一下GooSeeker，那是在第一次正在成功抓取网页时写的，第二天来了发现定位不成功，原因是新浪很多新闻页带视频，抓下来的网页有的有还有代码，导致直接没法处理。自己又摸索了下MetaStudio的bucketEditor的定义方式，终于定好了信息结构，不管什么网页都是抓了标题和文本内容。搜狐新闻照葫芦画瓢很快抓完了。

该进的地方在于

1.定义了多层结构

2.使用了Gooseeker的精确定位标记FreeFormat，实现精确定位

这两个网站的bucket如下

3.双击信息属性好像还可以选择只抓取文本内容

主题名分别是test_sinanews_chuck（列表页）+test_sinanews_chuck_content（新闻页）和chuck_sohunews_list（列表页）+chuck_sohunews_content（新闻页）

下边开始说CMarkup的用法。

GooSeeker后是很规范的xml文件。智商拙计没写出来parser，发现有个现成的就拿来用了。写parser时遇到中文处理的问题，这是个盲点。。。

CMarkup主要用了这些个函数：

这里插一句。把Markup.h包含进函数会出现错误：fatal error LNK1120: 9 个无法解析的外部命令

这里改成把Markup.cpp包含进去就好了。还不知道原理。。。屌丝了。。。估计是个很基本的原理问题。。。唉。。c++ primer还没看完。。。

1.CMarkup xml;xml.Load("某某.xml"); 由于我是抓取下来的xml，所以自己load进去就行了。CMarkup还可以自己创建xml文件，好强大。。留着以后看

2.xml.FindElem()；xml.FindChildElem()

前一个是找当前主标签和平行兄弟标签的方法。相当于FinfNextElem()函数。原话是

The FindElem should be thought of as "FindNextElem" because it goes forward from the current main position to the next matching sibling element.

他的文档写的很好。自己看看应该能解决问题。英语捉急？谁让你不好好学来着。。。

3.xml.GetElemcontent()这个就是获取当前主标签内文本内容了

4.xml.IntoElem() xml.OutOfElem() xml.FindElem()这三个参数改变当前主目录的函数。第一个是进入第一个子标签。第二个是返回父标签。第三个是到下一个兄弟标签。

我的一个理解是。xml会维护3个值分别是parent position，main position 和first child position。每次你要用IntoElem时，他的first child position还没初始化，所以会进入不了。先要用FindChildElem()获取下child position。FindChildElem()不会改变main position

5. 4.中的3个position着实不好理解，那就有了debug利器

xml.GetTagName() xml.GetChildTagName()

太特么好用了

开始贴代码

#include<fstream>
#include<iostream>
#include"findFile.h"
#include<string>
#include"Markup.cpp"
using namespace std;

CString fileNames[5000];
int main()
{
	int i,j,k,fileNum=0;
	findFiles("raw\\",fileNames,fileNum);
/*****这里是处理搜狐新闻xml文件
	for(int i=0;i<fileNum;++i)
	//for(i=0;i<1;++i)
	{
		ofstream fout("processed\\"+fileNames[i]+".txt");

		CString newsName;
		CString newsBody;
	    CMarkup xml;
		xml.Load("raw\\"+fileNames[i]+".xml");
		xml.FindElem();
		//cout<<"*****[ "<<xml.GetTagName()<<" ]*****"<<endl;
		xml.IntoElem();
		for(j=0;j<8;++j)
			xml.FindElem();
		//cout<<"*****[ "<<xml.GetTagName()<<" ]*****"<<endl;
		xml.IntoElem();
		xml.FindChildElem();
		//cout<<"*****[ "<<xml.GetTagName()<<" ]*****"<<endl;
		xml.IntoElem();
		xml.FindChildElem();
		//cout<<"*****[ "<<xml.GetTagName()<<" ]*****"<<endl;

//**********************写入标题****************************
		newsName=xml.GetElemContent();
		fout<<newsName<<endl;

		xml.FindElem();
		xml.IntoElem();
		xml.FindChildElem();
		xml.IntoElem();
		xml.FindChildElem();

		bool areWeThere=xml.IntoElem();
		xml.FindChildElem();
	//	cout<<"****main position element[ "<<xml.GetTagName()<<" ]****"<<endl;
	//	cout<<"****child position element[ "<<xml.GetChildTagName()<<" ]****"<<endl;
		CMarkup tempMark=xml;
		while(areWeThere)
		{
			xml.IntoElem();
			fout<<xml.GetElemContent()<<endl;
			xml.OutOfElem();
			areWeThere=xml.FindElem();
			xml.FindChildElem();
		//	xml=tempMark;
		//	areWeThere=xml.FindElem();
		//	xml.FindChildElem();
		//	tempMark=xml;
		//
		}

		//fin.close();
		fout.close();
		//fin.clear(ios::goodbit);
		fout.clear(ios::goodbit);
	}
    这里是处理搜狐新闻xml文件*/
	for(i=0;i<fileNum;++i)
	{
		cout<<"processing file NO."<<i<<endl;
		ofstream fout("processed\\"+fileNames[i]+".txt");
		CMarkup xml;
		xml.Load("raw\\"+fileNames[i]+".xml");

		//reaching the 新闻 tag
		xml.FindElem();
		xml.IntoElem();
		for(j=0;j<8;++j)
			xml.FindElem();

		//cout<<"-----main position element:[ "<<xml.GetTagName()<<" ]"<<endl;
		//get the 新闻标题
		xml.FindChildElem();			//initialiting the child position
		xml.IntoElem();
		xml.FindChildElem();
		xml.IntoElem();
		fout<<xml.GetElemContent()<<endl;

		//start the 新闻主题 by looping 
		xml.FindElem();
		//cout<<"-----main position element:[ "<<xml.GetTagName()<<" ]"<<endl;
		xml.FindChildElem();
		xml.IntoElem();
		xml.FindChildElem();
		xml.IntoElem();
		//cout<<"-----main position element:[ "<<xml.GetTagName()<<" ]"<<endl;
		xml.FindChildElem();			//got the 段落 child path
		bool areWeThere=xml.IntoElem();
		while(areWeThere)
		{
			xml.FindChildElem();
			xml.IntoElem();
			fout<<xml.GetElemContent()<<endl;
			xml.OutOfElem();
			areWeThere=xml.FindElem();
		}
 	}
	return 0;
}

就是这样了。想想看以前还真没有过认真踏实下来专研某个技术。很多时候都是知难而退，屌丝。。。所以时间花的不少，但是没有多少进步。嗯这次做的不错。鼓励下自己，以后发扬这次努力精神。好好干。

天下事有难易乎？为之，则难者亦易矣；不为，则易者亦难矣。人之为学有难易乎？学之，则难者亦易矣；不学，则易者亦难矣。

弄他1w小时，也不枉学过计算机。。。