HTML Parser

本文介绍了一种使用Java实现的HTML解析器的应用案例,通过访问指定网页并抓取页面信息,包括文本、链接及标题等内容。

HTML Parser 用于对HTML进行解析,并从中攫取你所需的信息。

Java版:

http://htmlparser.sourceforge.net/

http://www.ibm.com/developerworks/cn/java/l-html-parser/

 

Htmlparser的使用范例:

package com.amigo.htmlparser;

import java.io.*;
import java.net.URL;
import java.net.URLConnection;

import org.htmlparser.filters.*;
import org.htmlparser.*;
import org.htmlparser.nodes.*;
import org.htmlparser.tags.*;
import org.htmlparser.util.*;
import org.htmlparser.visitors.*;


public class HTMLParserTest {
   
    public static void main(String args[]) throws Exception {
        String path = "http://www.blogjava.net/amigoxie";
        URL url = new URL(path);
        URLConnection conn = url.openConnection();
        conn.setDoOutput(true);
       
        InputStream inputStream = conn.getInputStream();
        InputStreamReader isr = new InputStreamReader(inputStream, "utf8");
        StringBuffer sb = new StringBuffer();
        BufferedReader in = new BufferedReader(isr);
        String inputLine;
       
        while ((inputLine = in.readLine()) != null) {
            sb.append(inputLine);
            sb.append("\n");
        }
       
        String result = sb.toString();

        readByHtml(result);
        readTextAndLinkAndTitle(result);
    }
   
   
    public static void readByHtml(String content) throws Exception {
        Parser myParser;
        myParser = Parser.createParser(content, "utf8");
        HtmlPage visitor = new HtmlPage(myParser);
        myParser.visitAllNodesWith(visitor);

        String textInPage = visitor.getTitle();
        System.out.println(textInPage);
        NodeList nodelist;
        nodelist = visitor.getBody();
       
        System.out.print(nodelist.asString().trim());
    }

   
    public static void readTextAndLinkAndTitle(String result) throws Exception {
        Parser parser;
        NodeList nodelist;
        parser = Parser.createParser(result, "utf8");
        NodeFilter textFilter = new NodeClassFilter(TextNode.class);
        NodeFilter linkFilter = new NodeClassFilter(LinkTag.class);
        NodeFilter titleFilter = new NodeClassFilter(TitleTag.class);
        OrFilter lastFilter = new OrFilter();
        lastFilter.setPredicates(new NodeFilter[] { textFilter, linkFilter, titleFilter });
        nodelist = parser.parse(lastFilter);
        Node[] nodes = nodelist.toNodeArray();
        String line = "";
       
        for (int i = 0; i < nodes.length; i++) {
            Node node = nodes[i];
            if (node instanceof TextNode) {
                TextNode textnode = (TextNode) node;
                line = textnode.getText();
            } else if (node instanceof LinkTag) {
                LinkTag link = (LinkTag) node;
                line = link.getLink();
            } else if (node instanceof TitleTag) {
                TitleTag titlenode = (TitleTag) node;
                line = titlenode.getTitle();
            }
           
            if (isTrimEmpty(line))
                continue;
            System.out.println(line);
        }
    }
   
   
    public static boolean isTrimEmpty(String astr) {
        if ((null == astr) || (astr.length() == 0)) {
            return true;
        }
        if (isBlank(astr.trim())) {
            return true;
        }
        return false;
    }

   
    public static boolean isBlank(String astr) {
        if ((null == astr) || (astr.length() == 0)) {
            return true;
        } else {
            return false;
        }
    }
}
PHP版:

http://sourceforge.net/projects/simplehtmldom/

http://sourceforge.net/projects/html-parser/

 

$i=0;
while($content=='' or $i==3){
  @$content =file_get_contents($url);
  $i++;
}
if($i==3) exit("next");

 

.Net版

http://download.youkuaiyun.com/source/737172

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd"> <html> <head> <meta http-equiv="Content-Type" content="text/html; charset=utf-8"> <style> .AlignLeft { text-align: left; } .AlignCenter { text-align: center; } .AlignRight { text-align: right; } body { font-family: sans-serif; font-size: 11pt; } td { vertical-align: top; padding-left: 4px; padding-right: 4px; } tr.SectionGap td { font-size: 4px; border-left: none; border-top: none; border-bottom: 1px solid Black; border-right: 1px solid Black; } tr.SectionAll td { border-left: none; border-top: none; border-bottom: 1px solid Black; border-right: 1px solid Black; } tr.SectionBegin td { border-left: none; border-top: none; border-right: 1px solid Black; } tr.SectionEnd td { border-left: none; border-top: none; border-bottom: 1px solid Black; border-right: 1px solid Black; } tr.SectionMiddle td { border-left: none; border-top: none; border-right: 1px solid Black; } tr.SubsectionAll td { border-left: none; border-top: none; border-bottom: 1px solid Gray; border-right: 1px solid Black; } tr.SubsectionEnd td { border-left: none; border-top: none; border-bottom: 1px solid Gray; border-right: 1px solid Black; } table.fc { border-top: 1px solid Black; border-left: 1px solid Black; width: 100%; font-family: monospace; font-size: 10pt; } td.TextItemInsigAdd { color: #000000; background-color: #EEEEFF; } td.TextItemInsigDel { color: #000000; background-color: #EEEEFF; text-decoration: line-through; } td.TextItemInsigMod { color: #000000; background-color: #EEEEFF; } td.TextItemInsigOrphan { color: #000000; background-color: #FAEEFF; } td.TextItemNum { color: #696969; background-color: #F0F0F0; } td.TextItemSame { color: #000000; background-color: #FFFFFF; } td.TextItemSigAdd { color: #000000; background-color: #FFE3E3; } td.TextItemSigDel { color: #000000; background-color: #FFE3E3; text-decoration: line-through; } td.TextItemSigMod { color: #000000; background-color: #FFE3E3; } td.TextItemSigOrphan { color: #000000; background-color: #F1E3FF; } .TextSegInsigDiff { color: #0000FF; } .TextSegReplacedDiff { color: #0000FF; font-style: italic; } .TextSegSigDiff { color: #FF0000; } .TextSegElement_20851_38190_23383 { font-weight: bold; } .TextSegElement_35782_21035_31526 { } .TextSegElement_25968_23383 { color: #2E9269; } .TextSegElement_23383_31526_20018 { color: #3A7726; } .TextSegElement_32534_35793_22120_25351_20196 { color: #681717; } .TextSegElement_27880_37322 { color: #786A41; } .TextSegElement_25805_20316_31526 { } </style> <title>GA_D82DD83D_00-00-05 VS GA_D82DD83D_00-00-04_Warning</title> </head> <body> GA_D82DD83D_00-00-05 VS GA_D82DD83D_00-00-04_Warning<br/> 已产生: 2025/10/20 11:20:35<br/>     <br/> 模式:  全部   <br/> 左边文件: E:\1_临时代码仓\GA_D37D_03-00-01\mainline\spa_traveo\src\IpcApplication\diagClient\canTp\canTp.c   <br/> 右边文件: E:\1_临时代码仓\GA_D37D_02-00-04\mainline\spa_traveo\src\IpcApplication\diagClient\canTp\canTp.c   <br/> <table class="fc" cellspacing="0" cellpadding="0"> <tr class="SectionBegin"> <td class="TextItemNum AlignRight">1</td> <td class="TextItemSame"><span class="TextSegElement_27880_37322">/*===================================================================================================================================*/</span></td> <td class="AlignCenter">=</td> <td class="TextItemNum AlignRight">1</td> <td class="TextItemSame"><span class="TextSegElement_27880_37322">/*===================================================================================================================================*/</span></td> </tr> <tr class="SectionMiddle"> <td class="TextItemNum AlignRight">2</td> <td class="TextItemSame"><span class="TextSegElement_27880_37322">/*  Copyright DENSO Corporation                                                                                                      */</span></td> <td class="AlignCenter"> </td> <td class="TextItemNum AlignRight">2</td> <td class="TextItemSame"><span class="TextSegElement_27880_37322">/*  Copyright DENSO Corporation                                                                                                      */</span></td> </tr> <tr class="SectionMiddle"> <td class="TextItemNum AlignRight">3</td> <td class="TextItemSame"><span class="TextSegElement_27880_37322">/*===================================================================================================================================*/</span></td> <td class="AlignCenter"> </td> <td class="TextItemNum AlignRight">3</td> <td class="TextItemSame"><span class="TextSegElement_27880_37322">/*===================================================================================================================================*/</span></td> </tr> <tr class="SectionMiddle"> <td class="TextItemNum AlignRight">4</td> <td class="TextItemSame"><span class="TextSegElement_27880_37322">/*  Version  Date        Author   Change Description                                                                                 */</span></td> <td class="AlignCenter"> </td> <td class="TextItemNum AlignRight">4</td> <td class="TextItemSame"><span class="TextSegElement_27880_37322">/*  Version  Date        Author   Change Description                                                                                 */</span></td> </tr> <tr class="SectionMiddle"> <td class="TextItemNum AlignRight">5</td> <td class="TextItemSame"><span class="TextSegElement_27880_37322">/* --------- ----------  -------  -------------------------------------------------------------------------------------------------- */</span></td> <td class="AlignCenter"> </td> <td class="TextItemNum AlignRight">5</td> <td class="TextItemSame"><span class="TextSegElement_27880_37322">/* --------- ----------  -------  -------------------------------------------------------------------------------------------------- */</span></td> </tr> <tr class="SectionMiddle"> <td class="TextItemNum AlignRight">6</td> <td class="TextItemSame"><span class="TextSegElement_27880_37322">/*  1.0.0    3/6/2019    LW       New.                                                                                               */</span></td> <td class="AlignCenter"> </td> <td class="TextItemNum AlignRight">6</td> <td class="TextItemSame"><span class="TextSegElement_27880_37322">/*  1.0.0    3/6/2019    LW       New.                                                                                               */</span></td> </tr> <tr class="SectionMiddle"> <td class="TextItemNum AlignRight">7</td> <td class="TextItemSame"><span class="TextSegElement_27880_37322">/*  2.0.0    9/26/2021   DC       add a new function which is sending  multi frame request                                           */</span></td> <td class="AlignCenter"> </td> <td class="TextItemNum AlignRight">7</td> <td class="TextItemSame"><span class="TextSegElement_27880_37322">/*  2.0.0    9/26/2021   DC       add a new function which is sending  multi frame request                                           */</span></td> </tr> <tr class="SectionMiddle"> <td class="TextItemNum AlignRight">8</td> <td class="TextItemSame"><span class="TextSegElement_27880_37322">/*  2.0.1    10/8/2021   DC       Add PBD.can id and can index for D03B                                                              */</span></td> <td class="AlignCenter"> </td> <td class="TextItemNum AlignRight">8</td> <td class="TextItemSame"><span class="TextSegElement_27880_37322">/*  2.0.1    10/8/2021   DC       Add PBD.can id and can index for D03B                                                              */</span></td> </tr> <tr class="SectionMiddle"> <td class="TextItemNum AlignRight">9</td> <td class="TextItemSame"><span class="TextSegElement_27880_37322">/*                                                                                                                                   */</span></td> <td class="AlignCenter"> </td> <td class="TextItemNum AlignRight">9</td> <td class="TextItemSame"><span class="TextSegElement_27880_37322">/*                                                                                                                                   */</span></td> </tr> <tr class="SectionMiddle"> <td class="TextItemNum AlignRight">10</td> <td class="TextItemSame"><span class="TextSegElement_27880_37322">/*  * LW   = Luo Wei, KOTEI  create                                                                                                  */</span></td> <td class="AlignCenter"> </td> <td class="TextItemNum AlignRight">10</td> <td class="TextItemSame"><span class="TextSegElement_27880_37322">/*  * LW   = Luo Wei, KOTEI  create                                                                                                  */</span></td> </tr> <tr class="SectionMiddle"> <td class="TextItemNum AlignRight">11</td> <td class="TextItemSame"><span class="TextSegElement_27880_37322">/*  * DC   = Ding Cong, KOTEI                                                                                                        */</span></td> <td class="AlignCenter"> </td> <td class="TextItemNum AlignRight">11</td> <td class="TextItemSame"><span class="TextSegElement_27880_37322">/*  * DC   = Ding Cong, KOTEI                                                                                                        */</span></td> </tr> <tr class="SectionMiddle"> <td class="TextItemNum AlignRight">12</td> <td class="TextItemSame"><span class="TextSegElement_27880_37322">/*                                                                                                                                   */</span></td> <td class="AlignCenter"> </td> <td class="TextItemNum AlignRight">12</td> <td class="TextItemSame"><span class="TextSegElement_27880_37322">/*                                                                                                                                   */</span></td> </tr> <tr class="SectionMiddle"> <td class="TextItemNum AlignRight">13</td> <td class="TextItemSame"><span class="TextSegElement_27880_37322">/*  CAN TP 15765-2 implement                                                                                                         */</span></td> <td class="AlignCenter"> </td> <td class="TextItemNum AlignRight">13</td> <td class="TextItemSame"><span class="TextSegElement_27880_37322">/*  CAN TP 15765-2 implement                                                                                                         */</span></td> </tr> <tr class="SectionMiddle"> <td class="TextItemNum AlignRight">14</td> <td class="TextItemSame"><span class="TextSegElement_27880_37322">/*                                                                                                                                   */</span></td> <td class="AlignCenter"> </td> <td class="TextItemNum AlignRight">14</td> <td class="TextItemSame"><span class="TextSegElement_27880_37322">/*                                                                                                                                   */</span></td> </tr> <tr class="SectionMiddle"> <td class="TextItemNum AlignRight">15</td> <td class="TextItemSame"><span class="TextSegElement_27880_37322">/*===================================================================================================================================*/</span></td> <td class="AlignCenter"> </td> <td class="TextItemNum AlignRight">15</td> <td class="TextItemSame"><span class="TextSegElement_27880_37322">/*===================================================================================================================================*/</span></td> </tr> <tr class="SectionMiddle"> <td class="TextItemNum AlignRight">16</td> <td class="TextItemSame"> </td> <td class="AlignCenter"> </td> <td class="TextItemNum AlignRight">16</td> <td class="TextItemSame"> </td> </tr>取得.c
10-21
ajax.ts:80 POST http://localhost:8000/api/upload 500 (Internal Server Error) ajaxUpload @ ajax.ts:80 doUpload @ upload-content.vue:191 await in doUpload upload @ upload-content.vue:130 await in upload uploadFiles @ upload-content.vue:87 handleChange @ upload-content.vue:202 callWithErrorHandling @ runtime-core.esm-bundler.js:199 callWithAsyncErrorHandling @ runtime-core.esm-bundler.js:206 invoker @ runtime-dom.esm-bundler.js:730 use-handlers.ts:30 UploadAjaxError: <!DOCTYPE html> <html lang="en"> <head> <meta charset="utf-8"> <title>Error</title> </head> <body> <pre>MulterError: Unexpected field<br>    at wrappedFileFilter (E:\lucky备份\study_Express\node_modules\multer\index.js:40:19)<br>    at Multipart.<anonymous> (E:\lucky备份\study_Express\node_modules\multer\lib\make-middleware.js:132:7)<br>    at Multipart.emit (node:events:524:28)<br>    at HeaderParser.cb (E:\lucky备份\study_Express\node_modules\busboy\lib\types\multipart.js:358:14)<br>    at HeaderParser.push (E:\lucky备份\study_Express\node_modules\busboy\lib\types\multipart.js:162:20)<br>    at SBMH.ssCb [as _cb] (E:\lucky备份\study_Express\node_modules\busboy\lib\types\multipart.js:394:37)<br>    at feed (E:\lucky备份\study_Express\node_modules\streamsearch\lib\sbmh.js:248:10)<br>    at SBMH.push (E:\lucky备份\study_Express\node_modules\streamsearch\lib\sbmh.js:104:16)<br>    at Multipart._write (E:\lucky备份\study_Express\node_modules\busboy\lib\types\multipart.js:567:19)<br>    at writeOrBuffer (node:internal/streams/writable:572:12)</pre> </body> </html> at getError (ajax.ts:22:10) at XMLHttpRequest.<anonymous> (ajax.ts:62:29)
最新发布
11-18
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值