TextExtract(1)Tika Basic

最新推荐文章于 2025-12-17 19:02:04 发布

原创最新推荐文章于 2025-12-17 19:02:04 发布 · 107 阅读

0 ·

CC 4.0 BY-SA版权

文章标签：

#java

Summary 专栏收录该内容

381 篇文章

订阅专栏

本文介绍如何使用Apache Tika处理多种文件格式，包括文本、图片、音频和视频，通过命令行界面、GUI和CMD工具进行操作。展示了如何从文件中提取内容、元数据和识别语言，提供了简单的Java代码示例。

TextExtract(1)Tika Basic

1. Introduction
Tika supports a lot of different file formats, including audio, video, pictures and text files.
Tika bundle has tika-app for jar, GUI and CMD tool.

Command-line interface + GUI
Language identifier + Tika Facade + MIME Type
Parser

There are 3 files:
http://mirrors.sonic.net/apache/tika/tika-server-1.10.jar
http://apache.mirrors.hoobly.com/tika/tika-app-1.10.jar
http://ftp.wayne.edu/apache/tika/tika-1.10-src.zip
source code is managed by maven, I can directly build that.
> mvn clean install -DskipTests=true

Command or double click tikka-app can work.
> java -jar tika-app-1.10.jar --gui

And we can choose files and change the view to see different contents we get from the files.

2. Try The Packages in Java Codes
The simplest JAVA code to fetch the content of files.
package com.sillycat.resumeparse;

import java.io.File;
import java.io.IOException;

import org.apache.tika.Tika;
import org.apache.tika.exception.TikaException;

public class TestFunMain {

static final String file = "/opt/data/resume/3-resume.pdf";

public static void main(String[] args) {
// Create a Tika instance with the default configuration
Tika tika = new Tika();
// Parse all given files and print out the extracted text content
String text = null;
try {
text = tika.parseToString(new File(file));
} catch (IOException | TikaException e) {
e.printStackTrace();
}
System.out.print(text);
}
}

Fetch the Meta data and Identify Language
package com.sillycat.resumeparse;

import java.io.File;
import java.io.FileInputStream;
import java.io.FileNotFoundException;
import java.io.IOException;

import org.apache.tika.Tika;
import org.apache.tika.exception.TikaException;
import org.apache.tika.language.LanguageIdentifier;
import org.apache.tika.metadata.Metadata;
import org.apache.tika.parser.AutoDetectParser;
import org.apache.tika.parser.ParseContext;
import org.apache.tika.parser.Parser;
import org.apache.tika.sax.BodyContentHandler;
import org.xml.sax.SAXException;

public class TestFunMain {

static final String file = "/opt/data/resume/3-duffy.pdf";

public static void main(String[] args) {
Tika tika = new Tika();
String text = null;
Parser parser = new AutoDetectParser();
BodyContentHandler handler = new BodyContentHandler();
ParseContext context = new ParseContext();
Metadata metadata = new Metadata();

// fetch the content
try {
text = tika.parseToString(new File(file));
} catch (IOException | TikaException e) {
e.printStackTrace();
}
// System.out.print(text);

// fetch the meta
try {
parser.parse(new FileInputStream(file), handler, metadata, context);
} catch (IOException | SAXException | TikaException e) {
e.printStackTrace();
}
// System.out.println(handler.toString());

String[] metadataNames = metadata.names();

for (String name : metadataNames) {
// System.out.println(name + ": " + metadata.get(name));
}

// identify language
try {
parser.parse(new FileInputStream(file), handler, metadata,
new ParseContext());
} catch (FileNotFoundException e) {
e.printStackTrace();
} catch (IOException e) {
e.printStackTrace();
} catch (SAXException e) {
e.printStackTrace();
} catch (TikaException e) {
e.printStackTrace();
}
LanguageIdentifier object = new LanguageIdentifier(handler.toString());
System.out.println("Language name :" + object.getLanguage());
}
}

References:
https://tika.apache.org/
https://github.com/luohuazju/sillycat-resume-parse
http://itindex.net/detail/41933-apache-tika-%E9%80%9A%E7%94%A8

books
Tika in Action.pdf

http://m.yiibai.com/tika/tika_content_extraction.html