现在用jsoup下载某个网站, 感兴趣网址有两种: 一个是页码, 另外就是每个连接,
页码的样式为
<a class="page-link _Xo" href="/toutiao/31891686/p4/">
连接的内容为
<a title="社会共识" href="/toutiao/31891686/276320419">
分别获取这两种连接, 用中间的/31891686/做过滤字符串, 过滤掉不包含此字符串网址
String SelectStr;
switch (linktype) {
case PAGENO:
SelectStr = "a[href~=.*/" + FilterStr + "/.*][class=page-link _Xo]";
break;
case LINKS:
SelectStr = "a[href~=.*/" + FilterStr + "/.*][class!=page-link _Xo]";
break;
case NONE:
default:
SelectStr = "a[href]";
break;
}
result= document.select(SelectStr);
另外一个网站例子:
<a class="yt-uix-sessionlink yt-uix-tile-link spf-link yt-ui-ellipsis yt-ui-ellipsis-2" dir="ltr" title="EP24 DAY12" aria-describedby="description-id-630364" data-sessionlink="ei=Oim2XrycEMWyiga4p6yQBA&feature=c4-videos-u" href="/watch?v=4RyTKa-hnxc" rel="nofollow">EP24 DAY12</a>
可以用类名字做选择, 但类名很长,中间有空格, 选择语句把中间空格用点号代替就可以. 注意类名选择要以点号(.)开头
String SelectStr = ".yt-uix-sessionlink.yt-uix-tile-link.spf-link.yt-ui-ellipsis.yt-ui-ellipsis-2";
Elements elements = glDoc.select(SelectStr);