skrape[it] 开源项目使用教程-优快云博客

skrape[it] 开源项目使用教程

skrape.it A Kotlin-based testing/scraping/parsing library providing the ability to analyze and extract data from HTML (server & client-side rendered). It places particular emphasis on ease of use and a high level of readability by providing an intuitive DSL. It aims to be a testing lib, but can also be used to scrape websites in a convenient fashion. 项目地址: https://gitcode.com/gh_mirrors/sk/skrape.it

1. 项目介绍

skrape[it] 是一个基于 Kotlin 的 HTML/XML 测试和网页抓取库，可以在 Spring-Boot、Ktor、Android 或其他 Kotlin-JVM 项目中无缝使用。它特别强调易用性和高可读性，通过提供直观的 DSL（领域特定语言）来实现这一目标。skrape[it] 主要用于测试，但也可以方便地用于网页抓取。

主要特点

解析和反序列化：支持从网站、本地 HTML 文件和 HTML 字符串中解析和反序列化 HTML/XML 数据到数据类或 POJOs。
DSL 选择器：提供 DSL 来选择 HTML 元素，并支持 CSS 选择器语法。
HTTP 客户端：提供简洁的 HTTP 客户端接口，支持请求选项如 headers、cookies 等，并能处理客户端渲染的网页。
兼容性：不绑定特定的测试运行器或框架，开放使用其他断言库。

2. 项目快速启动

安装

Gradle

dependencies {
    implementation("it.skrape:skrapeit:1.2.2")
}

Maven

<dependency>
    <groupId>it.skrape</groupId>
    <artifactId>skrapeit</artifactId>
    <version>1.2.2</version>
</dependency>

示例代码

解析 HTML 并提取数据

import it.skrape.core.htmlDocument
import it.skrape.selects.html5.h1
import it.skrape.selects.html5.p
import org.junit.jupiter.api.Test

class HtmlParsingExample {
    @Test
    fun `can read and return html from String`() {
        htmlDocument("""
            <html>
                <body>
                    <h1>welcome</h1>
                    <div>
                        <p>first p-element</p>
                        <p class="foo">some p-element</p>
                        <p class="foo">last p-element</p>
                    </div>
                </body>
            </html>
        """) {
            h1 {
                findFirst {
                    text toBe "welcome"
                }
            }
            p {
                withClass = "foo"
                findFirst {
                    text toBe "some p-element"
                    className toBe "foo"
                }
            }
            p {
                findAll {
                    text toContain "p-element"
                }
                findLast {
                    text toBe "last p-element"
                }
            }
        }
    }
}