TextExtract(1)Tika Basic

本文介绍如何使用Apache Tika处理多种文件格式,包括文本、图片、音频和视频,通过命令行界面、GUI和CMD工具进行操作。展示了如何从文件中提取内容、元数据和识别语言,提供了简单的Java代码示例。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

TextExtract(1)Tika Basic

1. Introduction
Tika supports a lot of different file formats, including audio, video, pictures and text files.
Tika bundle has tika-app for jar, GUI and CMD tool.

Command-line interface + GUI
Language identifier + Tika Facade + MIME Type
Parser

There are 3 files:
http://mirrors.sonic.net/apache/tika/tika-server-1.10.jar
http://apache.mirrors.hoobly.com/tika/tika-app-1.10.jar
http://ftp.wayne.edu/apache/tika/tika-1.10-src.zip
source code is managed by maven, I can directly build that.
> mvn clean install -DskipTests=true

Command or double click tikka-app can work.
> java -jar tika-app-1.10.jar --gui

And we can choose files and change the view to see different contents we get from the files.

2. Try The Packages in Java Codes
The simplest JAVA code to fetch the content of files.
package com.sillycat.resumeparse;

import java.io.File;
import java.io.IOException;

import org.apache.tika.Tika;
import org.apache.tika.exception.TikaException;

public class TestFunMain {

static final String file = "/opt/data/resume/3-resume.pdf";

public static void main(String[] args) {
// Create a Tika instance with the default configuration
Tika tika = new Tika();
// Parse all given files and print out the extracted text content
String text = null;
try {
text = tika.parseToString(new File(file));
} catch (IOException | TikaException e) {
e.printStackTrace();
}
System.out.print(text);
}
}

Fetch the Meta data and Identify Language
package com.sillycat.resumeparse;

import java.io.File;
import java.io.FileInputStream;
import java.io.FileNotFoundException;
import java.io.IOException;

import org.apache.tika.Tika;
import org.apache.tika.exception.TikaException;
import org.apache.tika.language.LanguageIdentifier;
import org.apache.tika.metadata.Metadata;
import org.apache.tika.parser.AutoDetectParser;
import org.apache.tika.parser.ParseContext;
import org.apache.tika.parser.Parser;
import org.apache.tika.sax.BodyContentHandler;
import org.xml.sax.SAXException;

public class TestFunMain {

static final String file = "/opt/data/resume/3-duffy.pdf";

public static void main(String[] args) {
Tika tika = new Tika();
String text = null;
Parser parser = new AutoDetectParser();
BodyContentHandler handler = new BodyContentHandler();
ParseContext context = new ParseContext();
Metadata metadata = new Metadata();

// fetch the content
try {
text = tika.parseToString(new File(file));
} catch (IOException | TikaException e) {
e.printStackTrace();
}
// System.out.print(text);

// fetch the meta
try {
parser.parse(new FileInputStream(file), handler, metadata, context);
} catch (IOException | SAXException | TikaException e) {
e.printStackTrace();
}
// System.out.println(handler.toString());

String[] metadataNames = metadata.names();

for (String name : metadataNames) {
// System.out.println(name + ": " + metadata.get(name));
}

// identify language
try {
parser.parse(new FileInputStream(file), handler, metadata,
new ParseContext());
} catch (FileNotFoundException e) {
e.printStackTrace();
} catch (IOException e) {
e.printStackTrace();
} catch (SAXException e) {
e.printStackTrace();
} catch (TikaException e) {
e.printStackTrace();
}
LanguageIdentifier object = new LanguageIdentifier(handler.toString());
System.out.println("Language name :" + object.getLanguage());
}
}

References:
https://tika.apache.org/
https://github.com/luohuazju/sillycat-resume-parse
http://itindex.net/detail/41933-apache-tika-%E9%80%9A%E7%94%A8

books
Tika in Action.pdf

http://m.yiibai.com/tika/tika_content_extraction.html
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值