linux C++ 爬虫抓取网页

最新推荐文章于 2022-04-21 17:05:07 发布

原创

最新推荐文章于 2022-04-21 17:05:07 发布 · 5.1k 阅读

18 ·

CC 4.0 BY-SA版权

文章标签：

#爬虫 #抓取网页 #c++ #linux #socket

本文介绍如何在Linux系统中使用C++编程语言实现一个简单的网络爬虫，通过socket通信获取指定URL的网页内容。

方便易用，传入URL，返回对应页面的内容

#include <iostream>
#include <string>
#include <netdb.h>
#include <string.h>
#include <stdlib.h>
using namespace std;

void parseHostAndPagePath(const string url, string &hostUrl, string &pagePath){
    hostUrl=url;
    pagePath="/";
    int pos=hostUrl.find("http://");
    if(-1!=pos)
        hostUrl=hostUrl.replace(pos,7,"");
    pos=hostUrl.find("https://");
    if(-1!=pos)
        hostUrl=hostUrl.replace(pos,8,"");
    pos=hostUrl.find("/");
    if(-1!=pos){
        pagePath=hostUrl.substr(pos);
        hostUrl=hostUrl.substr(0,pos);
    }
}

string getPageContent(const string url){
    struct hostent *host;
    string hostUrl, pagePath;
    parseHostAndPagePath(url, hostUrl, pagePath);
    if(0==(host=gethostbyname(hostUrl.c_str()))){
        cout<<"gethostbyname error\n"<<endl;
        exit(1);
    }

    struct sockaddr_in pin;
    in