整体爬取论坛页面流程图: Created with Raphaël 2.1.0开始输入关键字得到关键字搜索页面是否是应用页面登陆,否则无法获得posttime下载该页面,并获得tid和posttime使用tid和posttime回复该页面再次下载该页面并爬取apk下载链接结束yesno SpiderTrigger调用Spider工作流程分析: Created with Raphaël 2.1.0开始Sipder.setDownloader()Sipder.addUrl()addUrl()中调用addRequest(new Request(url)),以及signalNewUrl()Spider.thread(20),启动20个线程run() Spider工作流程分析: Created with Raphaël 2.1.0Spider.run()checkRunningStat()检测怕成是否在运行initComponent() start判断downloader是否存在,若为空则使用默认的HttpClientDownloaderdownloader.setThread(int thread)setThread()调用httpClientGenerator.setPoolSize(thread)根据thread数目建立threadPoolinitComponent() EndtoDoQueue中取到一个requestrequest是否为空即toDoQueue为空threadPool中是否是否还有存活的threadbreak()waitNewUrl()将request赋值给final量requestFinalthreadPool启动线程调用processRequest()request是否需要重试rmRetryRequestFromQueue(page)extractAndAddRequests(page, true);sleep(site.getSleepTime())apk=pageProcessor.process(page)extractAndAddRequests(page, spawnUrl)yesnoyesnoyesno SpiderTirgger Spider 与MyhttpClientDownloader交互的顺序图如下: Created with Raphaël 2.1.0SpiderTriggerSpiderTriggerSpiderSpiderMyHttpClientDownloaderMyHttpClientDownloaderPageProcessorPageProcessorsetDownloader()addUrl()thread(20)启动20个Spider线程进行爬取run()开始执行线程checkRunningStat()检测爬虫是否在运行initComponent()检查或初始化downloader线程池等ComponentprocessRequest()启动子线程调用processRequest()方法downloader.download(request, Spider)ProcessRequest()调用downloader的download方法,此处为GET请求下载,得到pagereturn pagerquest重试如果request需要重试,首先应该从donePushQueue中移除移除的元素就是队列头部的元素sleep()process(page)获得Apk返回Apk Downloader中的download(),下载page的流程图如下: Created with Raphaël 2.1.0downloader.download()开始site=task.getSite()其中Site()是在创建Spider对象时PagePro给的以下使用httpClient进行页面的下载getHttpUriRequest()获得HttpUriRequest-getHttpUriRequest()开始调用selectRequestMethod()获得requestBuilder--selectRequestMethod()开始String method = request.getMethod()得到请求方式,若null则默认get设置请求方式,并返回对应的requestBuilder--selectRequestMethod()结束设置header配置requestConfigBuilder包括代理、超时等将requestConfigBuilder传给requestBuilder-getHttpUriRequest()结束得到httpUriReauest,设置请求头等并执行得到httpResponse成功,通过handleResponse()得到page返回page