对于网络之中的数据爬取是当前大数据时代最为流行的一件事。关于数据的爬取有很多方法,很多语言,很多框架去完成相关的工作。比如当前最为主流的爬虫语言python、R语言,爬取数据相当方便和速度。但是,对于大多的java使用者来说使用java进行相关数据的爬取也是可以的。比如当前使用较多的java爬虫框架webmagic,httpclient+htmlclearner+xpath 。。。。进行数据的爬取也是相当不错的。
对于数据的爬取主要进行一下的步骤:
1. 下载指定URL的页面
2. 解析下载后的页面
3. 相关URL的管理
4, 下载数据的存储
。。。。。
今天我们主要进行的事是对于下载的代码进行分析。
public Page startDown(String url) {
Page page=new Page();
HttpClientBuilder custom = HttpClients.custom();
CloseableHttpClient build = custom.build();
HttpUriRequest request=new HttpGet(url);
try {
CloseableHttpResponse execute = build.execute(request);
HttpEntity entity = execute.getEntity();
String string = EntityUtils.toString(entity);
//
page.setUrl(url);
page.setPageString(string);
} catch (ClientProtocolException e) {
// TODO Auto-generated catch block
e.printStackTrace();
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
return page;
}
上面为下载页面的主要代码,我们对于下载的代码进行分析。在分析怎样下载之前我们假象一下,下载页面的内容一般需要什么?
1. 连网 。没有网络你也打不开网页啊,更别说相关的下载啦。所以,第一步要检查相关的网络相关信息。
2. 有网那就根据url去连上网啊
3. 不用说就下载网页的内容
HttpClientBuilder custom = HttpClients.custom();
这段代码的实质是啥,就是创建 HttpClientBuilder类 public static HttpClientBuilder custom() {
return HttpClientBuilder.create();
}
在往下看:
public static HttpClientBuilder create() {
return new HttpClientBuilder();
}
类都创建完了,要这个类干啥?就是创建请求
CloseableHttpClient build = custom.build();
往下看:
public CloseableHttpClient build() {
// Create main request executor
// We copy the instance fields to avoid changing them, and rename to avoid accidental use of the wrong version
PublicSuffixMatcher publicSuffixMatcherCopy = this.publicSuffixMatcher;
if (publicSuffixMatcherCopy == null) {
publicSuffixMatcherCopy = PublicSuffixMatcherLoader.getDefault();
}
HttpRequestExecutor requestExecCopy = this.requestExec;
if (requestExecCopy == null) {
requestExecCopy = new HttpRequestExecutor();
}
HttpClientConnectionManager connManagerCopy = this.connManager;
if (connManagerCopy == null) {
LayeredConnectionSocketFactory sslSocketFactoryCopy = this.sslSocketFactory;
if (sslSocketFactoryCopy == null) {
final String[] supportedProtocols = systemProperties ? split(
System.getProperty("https.protocols")) : null;
final String[] supportedCipherSuites = systemProperties ? split(
System.getProperty("https.cipherSuites")) : null;
HostnameVerifier hostnameVerifierCopy = this.hostnameVerifier;
if (hostnameVerifierCopy == null) {
hostnameVerifierCopy = new DefaultHostnameVerifier(publicSuffixMatcherCopy);
}
if (sslcontext != null) {
sslSocketFactoryCopy = new SSLConnectionSocketFactory(
sslcontext, supportedProtocols, supportedCipherSuites, hostnameVerifierCopy);
} else {
if (systemProperties) {
sslSocketFactoryCopy = new SSLConnectionSocketFactory(
(SSLSocketFactory) SSLSocketFactory.getDefault(),
supportedProtocols, supportedCipherSuites, hostnameVerifierCopy);
} else {
sslSocketFactoryCopy = new SSLConnectionSocketFactory(
SSLContexts.createDefault(),
hostnameVerifierCopy);
}
}
}
@SuppressWarnings("resource")
final PoolingHttpClientConnectionManager poolingmgr = new PoolingHttpClientConnectionManager(
RegistryBuilder.<ConnectionSocketFactory>create()
.register("http", PlainConnectionSocketFactory.getSocketFactory())
.register("https", sslSocketFactoryCopy)
.build(),
null,
null,
null,
connTimeToLive,
connTimeToLiveTimeUnit != null ? connTimeToLiveTimeUnit : TimeUnit.MILLISECONDS);
if (defaultSocketConfig != null) {
poolingmgr.setDefaultSocketConfig(defaultSocketConfig);
}
if (defaultConnectionConfig != null) {
poolingmgr.setDefaultConnectionConfig(defaultConnectionConfig);
}
if (systemProperties) {
String s = System.getProperty("http.keepAlive", "true");
if ("true".equalsIgnoreCase(s)) {
s = System.getProperty("http.maxConnections", "5");
final int max = Integer.parseInt(s);
poolingmgr.setDefaultMaxPerRoute(max);
poolingmgr.setMaxTotal(2 * max);
}
}
if (maxConnTotal > 0) {
poolingmgr.setMaxTotal(maxConnTotal);
}
if (maxConnPerRoute > 0) {
poolingmgr.setDefaultMaxPerRoute(maxConnPerRoute);
}
connManagerCopy = poolingmgr;
}
ConnectionReuseStrategy reuseStrategyCopy = this.reuseStrategy;
if (reuseStrategyCopy == null) {
if (systemProperties) {
final String s = System.getProperty("http.keepAlive", "true");
if ("true".equalsIgnoreCase(s)) {
reuseStrategyCopy = DefaultConnectionReuseStrategy.INSTANCE;
} else {
reuseStrategyCopy = NoConnectionReuseStrategy.INSTANCE;
}
} else {
reuseStrategyCopy = DefaultConnectionReuseStrategy.INSTANCE;
}
}
ConnectionKeepAliveStrategy keepAliveStrategyCopy = this.keepAliveStrategy;
if (keepAliveStrategyCopy == null) {
keepAliveStrategyCopy = DefaultConnectionKeepAliveStrategy.INSTANCE;
}
AuthenticationStrategy targetAuthStrategyCopy = this.targetAuthStrategy;
if (targetAuthStrategyCopy == null) {
targetAuthStrategyCopy = TargetAuthenticationStrategy.INSTANCE;
}
AuthenticationStrategy proxyAuthStrategyCopy = this.proxyAuthStrategy;
if (proxyAuthStrategyCopy == null) {
proxyAuthStrategyCopy = ProxyAuthenticationStrategy.INSTANCE;
}
UserTokenHandler userTokenHandlerCopy = this.userTokenHandler;
if (userTokenHandlerCopy == null) {
if (!connectionStateDisabled) {
userTokenHandlerCopy = DefaultUserTokenHandler.INSTANCE;
} else {
userTokenHandlerCopy = NoopUserTokenHandler.INSTANCE;
}
}
String userAgentCopy = this.userAgent;
if (userAgentCopy == null) {
if (systemProperties) {
userAgentCopy = System.getProperty("http.agent");
}
if (userAgentCopy == null) {
userAgentCopy = VersionInfo.getUserAgent("Apache-HttpClient",
"org.apache.http.client", getClass());
}
}
ClientExecChain execChain = createMainExec(
requestExecCopy,
connManagerCopy,
reuseStrategyCopy,
keepAliveStrategyCopy,
new ImmutableHttpProcessor(new RequestTargetHost(), new RequestUserAgent(userAgentCopy)),
targetAuthStrategyCopy,
proxyAuthStrategyCopy,
userTokenHandlerCopy);
execChain = decorateMainExec(execChain);
HttpProcessor httpprocessorCopy = this.httpprocessor;
if (httpprocessorCopy == null) {
final HttpProcessorBuilder b = HttpProcessorBuilder.create();
if (requestFirst != null) {
for (final HttpRequestInterceptor i: requestFirst) {
b.addFirst(i);
}
}
if (responseFirst != null) {
for (final HttpResponseInterceptor i: responseFirst) {
b.addFirst(i);
}
}
b.addAll(
new RequestDefaultHeaders(defaultHeaders),
new RequestContent(),
new RequestTargetHost(),
new RequestClientConnControl(),
new RequestUserAgent(userAgentCopy),
new RequestExpectContinue());
if (!cookieManagementDisabled) {
b.add(new RequestAddCookies());
}
if (!contentCompressionDisabled) {
if (contentDecoderMap != null) {
final List<String> encodings = new ArrayList<String>(contentDecoderMap.keySet());
Collections.sort(encodings);
b.add(new RequestAcceptEncoding(encodings));
} else {
b.add(new RequestAcceptEncoding());
}
}
if (!authCachingDisabled) {
b.add(new RequestAuthCache());
}
if (!cookieManagementDisabled) {
b.add(new ResponseProcessCookies());
}
if (!contentCompressionDisabled) {
if (contentDecoderMap != null) {
final RegistryBuilder<InputStreamFactory> b2 = RegistryBuilder.create();
for (Map.Entry<String, InputStreamFactory> entry: contentDecoderMap.entrySet()) {
b2.register(entry.getKey(), entry.getValue());
}
b.add(new ResponseContentEncoding(b2.build()));
} else {
b.add(new ResponseContentEncoding());
}
}
if (requestLast != null) {
for (final HttpRequestInterceptor i: requestLast) {
b.addLast(i);
}
}
if (responseLast != null) {
for (final HttpResponseInterceptor i: responseLast) {
b.addLast(i);
}
}
httpprocessorCopy = b.build();
}
execChain = new ProtocolExec(execChain, httpprocessorCopy);
execChain = decorateProtocolExec(execChain);
// Add request retry executor, if not disabled
if (!automaticRetriesDisabled) {
HttpRequestRetryHandler retryHandlerCopy = this.retryHandler;
if (retryHandlerCopy == null) {
retryHandlerCopy = DefaultHttpRequestRetryHandler.INSTANCE;
}
execChain = new RetryExec(execChain, retryHandlerCopy);
}
HttpRoutePlanner routePlannerCopy = this.routePlanner;
if (routePlannerCopy == null) {
SchemePortResolver schemePortResolverCopy = this.schemePortResolver;
if (schemePortResolverCopy == null) {
schemePortResolverCopy = DefaultSchemePortResolver.INSTANCE;
}
if (proxy != null) {
routePlannerCopy = new DefaultProxyRoutePlanner(proxy, schemePortResolverCopy);
} else if (systemProperties) {
routePlannerCopy = new SystemDefaultRoutePlanner(
schemePortResolverCopy, ProxySelector.getDefault());
} else {
routePlannerCopy = new DefaultRoutePlanner(schemePortResolverCopy);
}
}
// Add redirect executor, if not disabled
if (!redirectHandlingDisabled) {
RedirectStrategy redirectStrategyCopy = this.redirectStrategy;
if (redirectStrategyCopy == null) {
redirectStrategyCopy = DefaultRedirectStrategy.INSTANCE;
}
execChain = new RedirectExec(execChain, routePlannerCopy, redirectStrategyCopy);
}
// Optionally, add service unavailable retry executor
final ServiceUnavailableRetryStrategy serviceUnavailStrategyCopy = this.serviceUnavailStrategy;
if (serviceUnavailStrategyCopy != null) {
execChain = new ServiceUnavailableRetryExec(execChain, serviceUnavailStrategyCopy);
}
// Optionally, add connection back-off executor
if (this.backoffManager != null && this.connectionBackoffStrategy != null) {
execChain = new BackoffStrategyExec(execChain, this.connectionBackoffStrategy, this.backoffManager);
}
Lookup<AuthSchemeProvider> authSchemeRegistryCopy = this.authSchemeRegistry;
if (authSchemeRegistryCopy == null) {
authSchemeRegistryCopy = RegistryBuilder.<AuthSchemeProvider>create()
.register(AuthSchemes.BASIC, new BasicSchemeFactory())
.register(AuthSchemes.DIGEST, new DigestSchemeFactory())
.register(AuthSchemes.NTLM, new NTLMSchemeFactory())
.register(AuthSchemes.SPNEGO, new SPNegoSchemeFactory())
.register(AuthSchemes.KERBEROS, new KerberosSchemeFactory())
.build();
}
Lookup<CookieSpecProvider> cookieSpecRegistryCopy = this.cookieSpecRegistry;
if (cookieSpecRegistryCopy == null) {
final CookieSpecProvider defaultProvider = new DefaultCookieSpecProvider(publicSuffixMatcherCopy);
final CookieSpecProvider laxStandardProvider = new RFC6265CookieSpecProvider(
RFC6265CookieSpecProvider.CompatibilityLevel.RELAXED, publicSuffixMatcherCopy);
final CookieSpecProvider strictStandardProvider = new RFC6265CookieSpecProvider(
RFC6265CookieSpecProvider.CompatibilityLevel.STRICT, publicSuffixMatcherCopy);
cookieSpecRegistryCopy = RegistryBuilder.<CookieSpecProvider>create()
.register(CookieSpecs.DEFAULT, defaultProvider)
.register("best-match", defaultProvider)
.register("compatibility", defaultProvider)
.register(CookieSpecs.STANDARD, laxStandardProvider)
.register(CookieSpecs.STANDARD_STRICT, strictStandardProvider)
.register(CookieSpecs.NETSCAPE, new NetscapeDraftSpecProvider())
.register(CookieSpecs.IGNORE_COOKIES, new IgnoreSpecProvider())
.build();
}
CookieStore defaultCookieStore = this.cookieStore;
if (defaultCookieStore == null) {
defaultCookieStore = new BasicCookieStore();
}
CredentialsProvider defaultCredentialsProvider = this.credentialsProvider;
if (defaultCredentialsProvider == null) {
if (systemProperties) {
defaultCredentialsProvider = new SystemDefaultCredentialsProvider();
} else {
defaultCredentialsProvider = new BasicCredentialsProvider();
}
}
List<Closeable> closeablesCopy = closeables != null ? new ArrayList<Closeable>(closeables) : null;
if (!this.connManagerShared) {
if (closeablesCopy == null) {
closeablesCopy = new ArrayList<Closeable>(1);
}
final HttpClientConnectionManager cm = connManagerCopy;
if (evictExpiredConnections || evictIdleConnections) {
final IdleConnectionEvictor connectionEvictor = new IdleConnectionEvictor(cm,
maxIdleTime > 0 ? maxIdleTime : 10, maxIdleTimeUnit != null ? maxIdleTimeUnit : TimeUnit.SECONDS);
closeablesCopy.add(new Closeable() {
@Override
public void close() throws IOException {
connectionEvictor.shutdown();
}
});
connectionEvictor.start();
}
closeablesCopy.add(new Closeable() {
@Override
public void close() throws IOException {
cm.shutdown();
}
});
}
return new InternalHttpClient(
execChain,
connManagerCopy,
routePlannerCopy,
cookieSpecRegistryCopy,
authSchemeRegistryCopy,
defaultCookieStore,
defaultCredentialsProvider,
defaultRequestConfig != null ? defaultRequestConfig : RequestConfig.DEFAULT,
closeablesCopy);
}
}
看这么多头皮发麻吧,其实就是检查相关网络信息,进行创建网络的连接。
接下来就是连接网络啦。
HttpUriRequest request=new HttpGet(url);
看到这个应该想到的是GET请求相关的信息,没错。
public HttpGet(final String uri) {
super();
setURI(URI.create(uri));
}
网络已经连接,请求已经发出。剩下的就是获取请求数据下载。CloseableHttpResponse execute = build.execute(request);
@Override
public CloseableHttpResponse execute(
final HttpUriRequest request) throws IOException, ClientProtocolException {
return execute(request, (HttpContext) null);
}
往下:
public CloseableHttpResponse execute(
final HttpUriRequest request,
final HttpContext context) throws IOException, ClientProtocolException {
Args.notNull(request, "HTTP request");
return doExecute(determineTarget(request), request, context);
}
再往下:
private static HttpHost determineTarget(final HttpUriRequest request) throws ClientProtocolException {
// A null target may be acceptable if there is a default target.
// Otherwise, the null target is detected in the director.
HttpHost target = null;
final URI requestURI = request.getURI();
if (requestURI.isAbsolute()) {
target = URIUtils.extractHost(requestURI);
if (target == null) {
throw new ClientProtocolException("URI does not specify a valid host name: "
+ requestURI);
}
}
return target;
}
抛到你祖坟也要找到你,通过网络下载了吧。
最后就是等到信息实体。
HttpEntity entity = execute.getEntity();
/**
* Obtains the message entity of this response, if any.
* The entity is provided by calling {@link #setEntity setEntity}.
*
* @return the response entity, or
* {@code null} if there is none
*/
HttpEntity getEntity();
到此下载的流程和代码的分析已经完成。
实现的代码会放在我的码云上的。