pre-emptive multithreading web spider铪铪铪

本文介绍了一个使用MFC WinInet类库实现的预抢占式多线程网络爬虫项目。该爬虫可以检查网站上的失效链接,并通过多线程方式加速网页抓取及链接验证过程。每个线程负责检查URL链接或下载网页,同时更新列表视图显示链接状态。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

 

pre-emptive multithreading web spider


this article was contributed by sim ayers.


the win32 api supports applications that are pre-emptively multithreaded. this is a very useful and powerful feature of win32 in writing mfc internet spiders. the spider project is an example of how to use preemptive multithreading to gather information on the web using a spider/robot with the mfc wininet classes.

this project produces a spidering software program that checks web sites for broken url links. link verification is done only on href links. it displays a continously updated list of urls in a clistview that reports the status of the href link. the project could be used as a template for gathering and indexing information to be stored in a database file for queries.

search engines gather information on the web using programs called robots. robots (also called web crawlers, spiders, worms, web wanderers, and scooters) automatically gather and index information from around the web, and then put that information into databases. (note that a robot will index a page, and then follow the links on that page as a source for new urls to index.) users can than construct queries to search these databases to find the information they want.

by using preemptive multithreading, you can index a web page of url links, start a new thread to follow each new url link for a new source of urls to index.

the project uses the mdi cdocument used with a custom mdi child frame to display a ceditview when downloading web pages and a clistview when checking url links. the project also uses the cobarray, cinternetsession, chttpconnection, chttpfile, and cwinthread mfc classes. the cwinthread class is used to produce multiple threads instead of using the asynchronous mode in cinternetsession, which is realy left over from the winsock 16 bit windows platform.

the spider project uses simple worker threads to check url links or download a web page. the cspiderthread class is derived from the cwinthread class so each cspiderthread object can use the cwinthread message_map() function. by declaring a "declare_message_map()" in the cspiderthread class the user interface is still responsive to user input. this means you can check the url links on one web server and at the same time download and open a web page from another web server. the only time the user interface will become unresponsive to user input is when the thread count exceedes maximum_wait_objects which is defined as 64.

in the constructor for each new cspiderthread object we supply the threadproc function and the thread paramters to be passed to the threadproc function.

	cspiderthread* pthread;	pthread = null;	pthread = new cspiderthread(cspiderthread::threadfunc,pthreadparams); // create a new cspiderthread object

in the cspiderthread constructor we set the cwinthread* m_pthread pointer in the thread paramters structure so we can point to the correct instance of this thread;

pthreadparams->m_pthread = this;

the cspiderthread threadproc function

// simple worker thread proc functionuint cspiderthread::threadfunc(lpvoid pparam){	threadparams * lpthreadparams = (threadparams*) pparam;	cspiderthread* lpthread = (cspiderthread*) lpthreadparams->m_pthread;		lpthread->threadrun(lpthreadparams);	// use  sendmessage instead of postmessage here to keep the current thread count	// synchronizied. if the number of threads is greater than maximum_wait_objects (64)	// the program will be come	 unresponsive to user input	::sendmessage(lpthreadparams->m_hwndnotifyprogress,		wm_user_thread_done, 0, (lparam)lpthreadparams);  // deletes lpthreadparams and decrements the thread count	return 0;}

the structure passed to the cspiderthread threadproc function

typedef struct tagthreadparams{	hwnd m_hwndnotifyprogress;	hwnd m_hwndnotifyview;	cwinthread* m_pthread;	cstring m_pszurl;	cstring m_contents;	cstring m_strservername;	cstring m_strobject;	cstring m_checkurlname;	cstring m_string;	dword m_dwservicetype;	dword  m_threadid;	dword m_status;	urlstatus m_pstatus;	internet_port  m_nport;	int m_type;	bool m_rootlinks;}threadparams; 

after the cspiderthread object has been created we use the creatthread function to start the execution of the new thread object.

	if (!pthread->createthread())   //  starts execution of a cwinthread object	{		afxmessagebox("cannot start new thread");		delete pthread;		pthread = null;		delete pthreadparams;		return false;	}    

once the new thread is running we use the ::sendmessage function to send messages to the cdocument's-> clistview with the status structure of the url link.

	if(pthreadparams->m_hwndnotifyview != null)		::sendmessage(pthreadparams->m_hwndnotifyview,wm_user_check_done, 0, (lparam) &pthreadparams->m_pstatus);

sturcture used for url status.

typedef struct tagurlstatus{	cstring m_url;	cstring m_urlpage;	cstring m_statusstring;	cstring m_lastmodified;	cstring m_contenttype;	cstring m_contentlength;	dword	m_status;}urlstatus, * purlstatus;

each new thread creats a new cmyinternetsession (derived from cinternetsession) object with enablestatuscallback set to true, so we can check the status on all internetsession callbacks. the dwcontext id for callbacks is set to the thread id.

bool cinetthread::initserver(){		try	{		m_psession = new cmyinternetsession(agentname,m_nthreadid);		int ntimeout = 30;  // very important, can cause a server time-out if set to low							// or hang the thread if set to high.		/*		the time-out value in milliseconds to use for internet connection requests. 		if a connection request takes longer than this timeout, the request is canceled.		the default timeout is infinite. */		m_psession->setoption(internet_option_connect_timeout,1000* ntimeout);				/* the delay value in milliseconds to wait between connection retries.*/		m_psession->setoption(internet_option_connect_backoff,1000);				/* the retry count to use for internet connection requests. if a connection 		attempt still fails after the specified number of tries, the request is canceled.		the default is five. */		m_psession->setoption(internet_option_connect_retries,1);	        m_psession->enablestatuscallback(true);	}	catch (cinternetexception* pex)	{		// catch errors from wininet		//pex->reporterror();		m_psession = null;		pex->delete();		return false ;	}	return true;}

the key to using the mfc wininet classes in a single or multithread program is to use a try and catch block statement surrounding all mfc wininet class functions. the internet is very unstable at times or the web page you are requesting no longer exist, which is guaranteed to throw a cinternetexception error.

	try	{		// some mfc wininet class function	}	catch (cinternetexception* pex)	{		// catch errors from wininet		//pex->reporterror();		pex->delete();		return false ;	} 

the maximum count of threads is initially set to 64, but you can configure it to any number between 1 and 100. a number that is too high will result in failed connections, which means you will have to recheck the url links.

a rapid fire succession of http requests in a /cgi-bin/ directory could bring a server to it's knees. the spider program sends out about 4 http request a second. 4 * 60 = 240 a minute. this can also bring a server to it's knees. be carefull about what server you are checking. each server has a server log with the requesting agent's ip address that requested the web file. you might get some nasty email from a angry web server administrator.

you can prevent any directory from being indexed by creating a robots.txt file for that directory. this mechanism is usually used to protect /cgi-bin/ directories. cgi scripts take more server resources to retrieve.

when the spider program checks url links it's goal is to not request too many documents too quickly. the spider program adheres somewhat to the standard for robot exclusion. this standard is a joint agreement between robot developers, that allows www sites to limit what url's the robot requests. by using the standard to limit access, the robot will not retrieve any documents that web server's wish to disallow.

before checking the root url, the program checks to see if there is a robots.txt file in the main directory. if the spider program finds a robots.txt file the program will abort the search. the program also checks for the meta tag in all web pages. if it finds a meta name="robots" content ="noindex,nofollow" tag it will not index the urls on that page.

build:
windows 95
mfc/vc++ 5.0
wininet.h dated 9/25/97
wininet.lib dated 9/16/97
wininet.dll dated 9/18/97

problems:
can't seem to keep the thread count below 64 at all times.
limit of 32,767 url links in the clistview
wouldn't parse all urls correctly, will crash program occasionally using cstring functions with complex urls.

resources:
internet tools - fred forester
multithreading applications in win32
win32 multithreaded programming

download source code and example (65 kb)

last updated: 21 june 1998




我帮你复制它的首页内容: The Client/Server Solution for Mission Critical Transaction Processing The System The need for PC-based, networked, mission-critical transaction processing applications becomes more pressing each year, but good solutions are not easy to find. Today's offerings in the desktop client/server market either cannot handle the performance and reliability needs of mission critical applications, or are overly expensive to develop and maintain. That's why we created TxServer. Specifically designed for high performance transaction processing, it is running today on hundreds of servers and thousands of clients worldwide. Customers depend on it for mission critical applications that include order processing, real-time monitoring, inventory control, billing, pharmacy sales, and many others. Users find TxServer to be a complete and cost-effective solution, with the additional benefits of rapid application development, high performance and data integrity, and lower system administration and support costs. Architecture TxServer's unique architecture tightly binds the client application to the server, all the way from application development through application modifications, and finally to delivered operational runtime. TxServer delivers application development and runtime, communications, and server management in a single integrated package. The benefits of this architecture begin during application development, where the path from system requirements to completed application is fast and smooth. At runtime, the close co-operation between client and server delivers superior performance. Equally importantly, system integration and support costs are substantially lower, since it is not necessary to integrate software from many different vendors. When application requirements change, TxServer tracks application version tightly, guaranteeing that the application delivered to the field always matches the customer's database in the field. Performance TxServer delivers true client/server distribution of workload, with processing properly shared between client and server. Clients process applications, while the server handles transaction management, communications, and database management, all with pre-emptive multithreaded multitasking. This lets the clients perform application calculations at the same time data requests are being processed in the server. Demand paging of the application into the client dramatically improves performance. And if this client-based performance is still not enough, TxServer allows up to 512 "drone" application processes to run in the background (in the server or slave server), doing batch processing or special event-driven tasks. "The development cycle is far shorter than with other products, because TxServer automatically deals with so many of the issues - like concurrency control, setting relations, and data integrity - which have to be programmed in with other approaches. And it's a complete system for deployment and field maintenance, so you don't have any of the typical problems of trying to work with other front ends, back ends, and tools from different vendors like you would with other systems." Value-Added Reseller Contact us for further information.
08-09
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值