
简介
网页爬虫(也被称做蚂蚁或者蜘蛛)是一个自动抓取万维网中网页数据的程序.网页爬虫一般都是用于抓取大量的网页,为日后搜索引擎处理服务的.抓取的网页由一些专门的程序来建立索引(如:Lucene,DotLucene),加快搜索的速度.爬虫也可以作为链接检查器或者HTML代码校验器来提供一些服务.比较新的一种用法是用来检查E-mail地址,用来防止Trackback spam.
爬虫概述
在这篇文章中,我将介绍一个用C#写的简单的爬虫程序.这个程序根据输入的目标URL地址,来有针对性的进行网页抓取.用法相当简单,只需要输入你想要抓取的网站地址,按下"GO"就可以了.
这个爬虫程序有一个队列,存储要进行抓取的URL,这个设计和一些大的搜索引擎是一样的.抓取时是多线程的,从URL队列中取出URL进行抓取,然后将抓取的网页存在指定的存储区(Storage 如图所示).用C#Socket库进行Web请求.分析当前正在抓取的页面中的链接,存入URL队列中(设置中有设置抓取深度的选项)
状态查看
这个程序提供三种状态查看:
抓取线程列表
每个抓取线程的详细信息
查看错误信息
线程查看
线程列表查看显示所有正在工作中的线程.每个线程从URI队列中取出一个URI,进行链接.
请求查看
请求查看显示了所有的最近下载的页面的列表,也显示了HTTP头中的详细信息.
每个请求的头显示类似下面的信息:
GET / HTTP/1.0
Host: www.cnn.com
Connection: Keep-Alive
HTTP/1.0 200 OK
Date: Sun, 19 Mar 2006 19:39:05 GMT
Content-Length: 65730
Content-Type: text/html
Expires: Sun, 19 Mar 2006 19:40:05 GMT
Cache-Control: max-age=60, private
Connection: keep-alive
Proxy-Connection: keep-alive
Server: Apache
Last-Modified: Sun, 19 Mar 2006 19:38:58 GMT
Vary: Accept-Encoding,User-Agent
Via: 1.1 webcache (NetCache NetApp/6.0.1P3)
Parsing page ...
Found: 356 ref(s)
http://www.cnn.com/
http://www.cnn.com/search/
http://www.cnn.com/linkto/intl.html设置
这个程序提供一些参数的设置,包括:MIME types、存储目的文件夹、最大抓取线程数等等...
文件类型
爬虫支持的下载下来的文件类型,用户可以添加MIME类型,支持可下载的文件包括一个默认类型用来用户可添加,编辑和删除MIME类型. 用户可以选择让所有MIME类型为下列数字. 
输出
输出设定包括下载文件夹, 而请求的数目应保持在要求查看复审请求细节. 
连接
连线设定包含:
- Thread count: 多个并行工作的线程爬虫;
- Thread sleep time when refs queue empty: 当时每个threadsleeps当refs队列空;
- Thread sleep time between two connection: 时间,每一线程睡眠后处理的任何请求, 这是非常重要的参考价值,以防止主机堵爬虫由于沉重的负荷.
- Connection timeout: 代表连接超时的时间;
- Navigate through pages to a depth of: 代表深度;
- Keep same URL server: 限制爬行的过程中,以同样的主机了原来的URL.
- keepconnectionalive:手段不断Socket连接打开以后请避免reconnect时间.

高级
高级设置:
代码页编码的文本下载页面列出了用户定义列表限制词,让用户 防止任何坏页面列出了用户定义列表限制主机分机以免堵塞这类主机 名单一使用者定义限制档案清单分机避免paring非文字资料 
兴趣要点
保持活动连接:
保持活动连接是一种形式,要求客户端和服务器保持连接打开后,反应完毕后, 可以通过添加一个HTTPheader的请求到服务器,在以下要求:
GET /CNN/Programs/nancy.grace/ HTTP/1.0
Host: www.cnn.com
Connection: Keep-Alive
HTTP/1.0 200 OK
Date: Sun, 19 Mar 2006 19:38:15 GMT
Content-Length: 29025
Content-Type: text/html
Expires: Sun, 19 Mar 2006 19:39:15 GMT
Cache-Control: max-age=60, private
Connection: keep-alive
Proxy-Connection: keep-alive
Server: Apache
Vary: Accept-Encoding,User-Agent
Last-Modified: Sun, 19 Mar 2006 19:38:15 GMT
Via: 1.1 webcache (NetCache NetApp/6.0.1P3)
HTTP/1.0 200 OK
Date: Sun, 19 Mar 2006 19:38:15 GMT
Content-Length: 29025
Content-Type: text/html
Expires: Sun, 19 Mar 2006 19:39:15 GMT
Cache-Control: max-age=60, private
Connection: Close
Server: Apache
Vary: Accept-Encoding,User-Agent
Last-Modified: Sun, 19 Mar 2006 19:38:15 GMT
Via: 1.1 webcache (NetCache NetApp/6.0.1P3)WebRequest and WebResponse 问题:
当我开始这篇文章典我所用webrequest类以及webresponse像以下代码:
WebRequest request = WebRequest.Create(uri);
WebResponse response = request.GetResponse();
Stream streamIn = response.GetResponseStream();
BinaryReader reader = new BinaryReader(streamIn, TextEncoding);
byte[] RecvBuffer = new byte[10240];
int nBytes, nTotalBytes = 0;
while((nBytes = reader.Read(RecvBuffer, 0, 10240)) > 0)
...{
nTotalBytes += nBytes;
...
}
reader.Close();
streamIn.Close();
response.Close();
request = MyWebRequest.Create(uri, request/**//*to Keep-Alive*/, KeepAlive);
MyWebResponse response = request.GetResponse();
byte[] RecvBuffer = new byte[10240];
int nBytes, nTotalBytes = 0;
while((nBytes = response.socket.Receive(RecvBuffer, 0,
10240, SocketFlags.None)) > 0)
...{
nTotalBytes += nBytes;
...
if(response.KeepAlive && nTotalBytes >= response.ContentLength
&& response.ContentLength > 0)
break;
}
if(response.KeepAlive == false)
response.Close();
/**//* reading response header */
Header = "";
byte[] bytes = new byte[10];
while(socket.Receive(bytes, 0, 1, SocketFlags.None) > 0)
...{
Header += Encoding.ASCII.GetString(bytes, 0, 1);
if(bytes[0] == ' ' && Header.EndsWith(" "))
break;
}
// number of running threads
private int nThreadCount;
private int ThreadCount
...{
get ...{ return nThreadCount; }
set
...{
Monitor.Enter(this.listViewThreads);
try
...{
for(int nIndex = 0; nIndex < value; nIndex ++)
...{
// check if thread not created or not suspended
if(threadsRun[nIndex] == null ||
threadsRun[nIndex].ThreadState != ThreadState.Suspended)
...{
// create new thread
threadsRun[nIndex] = new Thread(new
ThreadStart(ThreadRunFunction));
// set thread name equal to its index
threadsRun[nIndex].Name = nIndex.ToString();
// start thread working function
threadsRun[nIndex].Start();
// check if thread dosn't added to the view
if(nIndex == this.listViewThreads.Items.Count)
...{
// add a new line in the view for the new thread
ListViewItem item =
this.listViewThreads.Items.Add(
(nIndex+1).ToString(), 0);
string[] subItems = ...{ "", "", "", "0", "0%" };
item.SubItems.AddRange(subItems);
}
}
// check if the thread is suspended
else if(threadsRun[nIndex].ThreadState ==
ThreadState.Suspended)
...{
// get thread item from the list
ListViewItem item = this.listViewThreads.Items[nIndex];
item.ImageIndex = 1;
item.SubItems[2].Text = "Resume";
// resume the thread
threadsRun[nIndex].Resume();
}
}
// change thread value
nThreadCount = value;
}
catch(Exception)
...{
}
Monitor.Exit(this.listViewThreads);
}
}
void EnqueueUri(MyUri uri)
...{
Monitor.Enter(queueURLS);
try
...{
queueURLS.Enqueue(uri);
}
catch(Exception)
...{
}
Monitor.Exit(queueURLS);
}
MyUri DequeueUri()
...{
Monitor.Enter(queueURLS);
MyUri uri = null;
try
...{
uri = (MyUri)queueURLS.Dequeue();
}
catch(Exception)
...{
}
Monitor.Exit(queueURLS);
return uri;
}
1万+





