提示:文章写完后,目录可以自动生成,如何生成可参考右边的帮助文档
前言
PDF中提取全部文本信息比较容易,但是表格内容和文字一般是混在一起的,需要拆分开。
一、依赖?
<dependency>
<groupId>org.apache.pdfbox</groupId>
<artifactId>pdfbox</artifactId>
<version>3.0.0-alpha2</version>
</dependency>
<dependency>
<groupId>technology.tabula</groupId>
<artifactId>tabula</artifactId>
<version>1.0.3</version>
</dependency>
二、代码示例
代码如下(示例):
private static void readTextAndTable(String path) throws IOException {
FileOutputStream fileOutputStream;
FileOutputStream fileOutputStream1;
fileOutputStream = new FileOutputStream("text.txt");
fileOutputStream1 = new FileOutputStream("table.txt");
PDDocument pd = Loader.loadPDF(new File(path));
ObjectExtractor oe = new ObjectExtractor(pd);
SpreadsheetExtractionAlgorithm sea = new SpreadsheetExtractionAlgorithm(); // Tabula algo.
int pageSize = pd.getNumberOfPages();
// 一页一页读取
for (int i = 0; i < pageSize; i++) {
System.out.println("第" + (i + 1) + "页");
Page page = oe.extract(i + 1);
//抽取出所有的表格内容
List<Table> table = sea.extract(page);
List<double[]> loc = new ArrayList<>();
for (Table table1 : table) {
loc.add(new double[]{table1.y, table1.y + table1.height});
for (List<RectangularTextContainer> row : table1.getRows()) {
fileOutputStream1.write((row.stream().map(RectangularTextContainer::getText).map(o -> o.replace("\r", "")).collect(Collectors.joining(" | ")) + "\n").getBytes("utf-8"));
}
fileOutputStream1.write("\n".getBytes("utf-8"));
}
//抽取出所有的文本内容
MyPDFTextStripper stripper = new MyPDFTextStripper();
//设置按顺序输出
stripper.setSortByPosition(true);
stripper.setStartPage(i + 1);
stripper.setEndPage(i + 1);
stripper.setParagraphStart("\t");
stripper.setParagraphEnd(" ");
stripper.getText(pd);
double lastY = 0;
List<List<TextPosition>> charactersByArticle = stripper.myGetCharactersByArticle();
StringBuilder pagraph = new StringBuilder();
outer:
for (int m = 0; m < charactersByArticle.get(0).size(); m++) {
TextPosition word = charactersByArticle.get(0).get(m);
float endY = word.getY();
// 如果文本的y在表格的上下边框的内部,则不输出继续遍历文本
for (double[] l : loc) {
if (l[0] < endY && l[1] > endY) {
continue outer;
}
}
if (!word.getUnicode().equals(" ")) {
pagraph.append(word.getUnicode());
} else {
if (lastY == endY) {
pagraph.append(word.getUnicode());
} else {
if (!pagraph.toString().equals(" ")) {
pagraph.append("\n");
fileOutputStream.write(pagraph.toString().getBytes("utf-8"));
}
pagraph = new StringBuilder();
pagraph.append(word.getUnicode());
lastY = endY;
}
}
}
}
fileOutputStream.flush();
fileOutputStream.close();
fileOutputStream1.flush();
fileOutputStream1.close();
}
总结
以上就是今天要讲的内容,本文仅仅简单介绍了如何拆分文字内容和表格内容,输出的文本内容还需要优化调整。