序
本文主要研究一下langchain4j结合Apache Tika进行文档解析
步骤
pom.xml
<dependency>
<groupId>dev.langchain4j</groupId>
<artifactId>langchain4j-document-parser-apache-tika</artifactId>
<version>1.0.0-beta1</version>
</dependency>
example
public class TikaTest {
public static void main(String[] args) {
String path = System.getProperty("user.home") + "/downloads/tmp.xlsx";
DocumentParser parser = new ApacheTikaDocumentParser();
Document document = FileSystemDocumentLoader.loadDocument(path, parser);
log.info("textSegment:{}", document.toTextSegment());
log.info("meta data:{}", document.metadata().toMap());
log.info("text:{}", document.text());
}
}
指定好了文件路径,通过ApacheTikaDocumentParser来解析,最后统一返回Document对象,它可以返回textSegment,这个可以跟向量数据库结合在一起
EmbeddingModel embeddingModel = new AllMiniLmL6V2EmbeddingModel();
TextSegment segment1 = document.toTextSegment();
Embedding embedding1 = embeddingModel.embed(segment1).content();
embeddingStore.add(embedding1, segment1);
源码
DocumentParser
dev/langchain4j/data/document/DocumentParser.java
public interface DocumentParser {
/**
* Parses a given {@link InputStream} into a {@link Document}.
* The specific implementation of this method will depend on the type of the document being parsed.
* <p>
* Note: This method does not close the provided {@link InputStream} - it is the
* caller's responsibility to manage the lifecycle of the stream.
*
* @param inputStream The {@link InputStream} that contains the content of the {@link Document}.
* @return The parsed {@link Document}.
* @throws BlankDocumentException when the parsed {@link Document} is blank/empty.
*/

最低0.47元/天 解锁文章
1000

被折叠的 条评论
为什么被折叠?



