今天再网上看了一篇采集文章的基础,所以献丑一下。。。look,代码!!!
<?php
header('Content-Type:text/html;charset=UTF-8');
require "mysql.class.php";
$db = new Mysql_DB("localhost","root","root","caiji");
// 采集首页地址
$url = "http://cn.jokes.yahoo.com/jok/index.html";
// 获取页面代码
$r = file_get_contents($url);
// 设置匹配正则
$preg = '/hspace=5><a href="http:\/\/cn.jokes.yahoo.com\/(.*).html" class=list target=_blank>/isU';
// 进行正则搜索
preg_match_all($preg, $r, $title);
// 计算标题数量
$count = count($title[1]);
//echo $count;die;
//如果一次性将文章内容,标题都写入数据库,服务器会卡死的,所以分两步走
for($i=0;$i<$count;$i++){
$jurl = "http://cn.jokes.yahoo.com/" .$title[1][$i]. ".html";
echo $jurl;
echo "<br>";
echo $tt = $title[1][$i];
$db->query("insert into demo01 set url='$jurl',title='$tt'");
}
//读出写入的url
$res = $db->get_all("select * from demo01");
//echo "<pre>";
//print_r($res);
foreach($res as $k=>$v){
$c = file_get_contents($v['url']);
$tt = $v['title'];
echo $tt;
echo "<br>";
$p = '/\<div id=\"newscontent\"\>(.*)\<\/div\>/isU';
preg_match($p, $c, $content);
$text = $content[0];
//如果url的地方是GBK编码的,别忘了iconv
$text1 = iconv("GBK","UTF-8",$text);
echo $text1;
$db->query("insert into demo011 set title='$tt',content='$text1'");
}
unset($res);
echo 'ok';
?>
噔噔噔噔,一个小型的采集器OK了,下面就靠自己如何扩展代码了。。。
转载于:https://blog.51cto.com/xpmozong/483415