PHPCrawl webcrawler library/framework

本文介绍了 PHPCrawl,一款用 PHP 编写的高效网页爬虫库。该库支持高度配置,包括 URL 过滤、内容类型过滤、处理 cookie 等功能,并提供了详细的使用示例。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

http://phpcrawl.cuab.de/about.html

About PHPCrawl


PHPCrawl is a framework for crawling/spidering websites written in the programming language PHP, so just call it a webcrawler-library or crawler-engine for PHP

PHPCrawl "spiders" websites and passes information about all found documents (pages, links, files ans so on) for futher processing to users of the library.

It is high configurable and provides several options to specify the behaviour of the crawler like URL- and Content-Type-filters, cookie-handling, robots.txt-handling, limiting options, multiprocessing and much more.

PHPCrawl is completly free opensource software and is licensed under the  GNU GENERAL PUBLIC LICENSE v2 .

To get a first impression on how to use the crawler you may want to take a look at the  quickstart guide  or an  example  inside the manual section.
A complete reference and documentation of all available options and methods of the framework can be found in the  classreferences-section

The current version of the phpcrawl-package and older releases can be downloaded from a  sourceforge-mirror .

Note to users of phpcrawl version 0.7x or before: Although in version 0.8 some method-names and parameters have changed, it should be fully compatible to older versions of phpcrawl.


<?php

// It may take a whils to crawl a site ...
set_time_limit(10000);

// Inculde the phpcrawl-mainclass
include("libs/PHPCrawler.class.php");

// Extend the class and override the handleDocumentInfo()-method 
class MyCrawler extends PHPCrawler 
{
  function handleDocumentInfo($DocInfo) 
  {
    // Just detect linebreak for output ("\n" in CLI-mode, otherwise "<br>").
    if (PHP_SAPI == "cli") $lb = "\n";
    else $lb = "<br />";

    // Print the URL and the HTTP-status-Code
    echo "Page requested: ".$DocInfo->url." (".$DocInfo->http_status_code.")".$lb;
    
    // Print the refering URL
    echo "Referer-page: ".$DocInfo->referer_url.$lb;
    
    // Print if the content of the document was be recieved or not
    if ($DocInfo->received == true)
    {
        file_put_contents("yuqingpeoplecomcn/".urlencode($DocInfo->url),$DocInfo->content);
        echo "Content received: ".$DocInfo->bytes_received." bytes".$lb;
    }
    else
      echo "Content not received".$lb; 
    
    // Now you should do something with the content of the actual
    // received page or file ($DocInfo->source), we skip it in this example 
    
    echo $lb;
    
    flush();
  } 
}

// Now, create a instance of your class, define the behaviour
// of the crawler (see class-reference for more options and details)
// and start the crawling-process.

$crawler = new MyCrawler();

// URL to crawl
$crawler->setURL("yuqing.people.com.cn");

$crawler->enableResumption();
// Only receive content of files with content-type "text/html"
//$crawler->addContentTypeReceiveRule("#text/html#");

// Ignore links to pictures, dont even request pictures
//$crawler->addURLFilterRule("#\.(jpg|jpeg|gif|png)$# i");

// Store and send cookie-data like a browser does
$crawler->enableCookieHandling(true);

// Set the traffic-limit to 1 MB (in bytes,
// for testing we dont want to "suck" the whole site)
//$crawler->setTrafficLimit(1000 * 1024);

$crawler->setWorkingDirectory("/dev/shm/");
$crawler->setUrlCacheType(PHPCrawlerUrlCacheTypes::URLCACHE_SQLITE);
// Thats enough, now here we go
//$crawler->go();
$crawler->goMultiProcessed(5);

// At the end, after the process is finished, we print a short
// report (see method getProcessReport() for more information)
$report = $crawler->getProcessReport();

if (PHP_SAPI == "cli") $lb = "\n";
else $lb = "<br />";
    
echo "Summary:".$lb;
echo "Links followed: ".$report->links_followed.$lb;
echo "Documents received: ".$report->files_received.$lb;
echo "Bytes received: ".$report->bytes_received." bytes".$lb;
echo "Process runtime: ".$report->process_runtime." sec".$lb; 
?>

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值