Larbin 搜索引擎源码赏析——（五）为二次开发用户提供的进一步处理网页的接口函数

最新推荐文章于 2012-05-25 18:45:53 发布

原创最新推荐文章于 2012-05-25 18:45:53 发布 · 2.1k 阅读

2 ·

CC 4.0 BY-SA版权

文章标签：

#搜索引擎 #html #output #url #thread #user

源码赏析专栏收录该内容

23 篇文章

订阅专栏

本文介绍了一套用于处理网页爬取结果的接口设计，包括正常网页及异常网页的处理方式，提供了二次开发的灵活性。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

output.h文件

// Larbin // Sebastien Ailleret // 03-02-00 -> 16-03-00 #ifndef OUTPUT_H #define OUTPUT_H #include "global.h" /* 这个文件在程序中的功能：主要是为二次开发者提供一个接口，使二次开发者可以对爬取到的网页灵活的实现相应的处理工作，这里包括正常网页的处理，和异常网页的处理。总的来说是提供了一个扩展的功能。 */ /** The fetch failed * @param u the URL of the doc * @param reason reason of the fail */ void fetchFail (url *u, FetchError reason); /** In this thread, end user manage the result of the crawl */ void *startOutput (void *none); #endif // OUTPUT_H

output.cc文件

// Larbin // Sebastien Ailleret // 03-02-00 -> 06-06-00 #include <iostream.h> #include "types.h" #include "global.h" #include "xfetcher/file.h" #include "xinterf/output.h" #include "xutils/debug.h" /** A page has been loaded successfully * @param page the page that has been fetched */ ////////////////////////////////////////////////////// // //函数功能：对网页进行处理，主要针对内容 //参数：html *page 被处理网页文件的对象指针 //返回值：void //注：在这个函数中，用户可以任意补充，如何具体处理每个html文件 // 的操作。 // //////////////////////////////////////////////////////////// void loaded (html *page) { // Here should be the code for managing everything // cout << "fetched : "; // page->getUrl()->print(); } /** The fetch failed (but this page is interesting * @param u the URL of the doc * @param reason reason of the fail */ ////////////////////////////////////////////////////////// // //函数功能：处理感兴趣网页的错误 //参数：url *u 网页的url // FetchError reason 错误信息 //返回值：void //注：具体操作有用户根据需要实现 // /////////////////////////////////////////////////////////// void fetchFailInteresting (url *u, FetchError reason) { // Here should be the code for managing everything } /** The fetch failed * @param u the URL of the doc * @param reason reason of the fail */ ////////////////////////////////////////////////////// // //函数功能：处理不感兴趣网页的错误 //参数：url *u 网页的url // FetchError reason 错误类型 //返回值： void //注：这里没有具体的实现代码，由用户将代码填写 // ////////////////////////////////////////////////////////// void fetchFail (url *u, FetchError reason) { // Here should be the code for managing everything } /** It's over with this file * report the situation ! (and make some stats) */ ///////////////////////////////////////////////////////////////////// // //函数功能：分别处理每一个html网页文件 //参数： html *parser 网页文件对象 // FetchError err 该网页相应的错误信息 //返回值：void //注：这里的处理，主要是指针对网页内容的处理，不是连接的处理。 //很多地方是以接口的形式留给用户自己写的。 // ////////////////////////////////////////////////////////////////////// static void endOfLoad (html *parser, FetchError err) { answers(err); switch (err) { case success: if (parser->isInteresting()) { interestingPage(); //在debug.h 文件中定义，主要实现的是一个计数的功能 #define interestingPage() interestingPage++ loaded(parser); //函数在上面定义 } break; default: if (parser->isInteresting()) { fetchFailInteresting(parser->getUrl(), err); //函数在上面定义 } else { fetchFail(parser->getUrl(), err); //函数在上面定义 } break; } } /** In this thread, end user manage the result of the crawl */ ///////////////////////////////////////////////////////////////////////////////////// // //函数功能：处理每一个爬取到的网页 //参数：void *none 无作用 //返回值：void * 无论什么情况都返回NULL // //////////////////////////////////////////////////////////////////////////////////////// void *startOutput (void *none) { crash("Output on"); for (;;) { /* //free connection for fetchOpen : connections waiting for end user static ConstantSizedFifo<Connexion> *userConns; 一个队列对象，队列的元素是Connexion对象，其中保存这爬去网页的信息。 Connexion类在global.h 中定义 */ Connexion *conn = global::userConns->get(); //获取一个连接，对象 endOfLoad((html *)conn->parser, (enum FetchError) conn->pos); //将其中的html文件进行处理。 conn->recycle(); //释放“连接”的记录， global::freeConns->put(conn); //将释放的连接，重新添加到空链接队列中 } return NULL; }