icepdf中使用BufferedImage时内存溢出的解决方法

最近项目中需要将pdf转成图片,网上找了各种开源的工具,发觉icepdf用的人比较多。
但是在实际使用过程中,遇到几个问题。

1. 出现jpeg2000的错误:

ImageIO missing required plug-in to read JPEG 2000 images. 
You can download the JAI ImageIO Tools from: http://www.oracle.com/technetwork/java/current-142188.html

解决方法:
需要下载jai-imageio-core-1.3.1.jar,jai-imageio-jpeg2000-1.3.0.jar这两个包。
把这两个jar包导入你的项目即可。
我是从Maven下载的:
<repositories>
	  <repository>
	    <id>bintray-jai-imageio</id>
	    <name>jai-imageio at bintray</name>
	    <url>https://dl.bintray.com/jai-imageio/maven/</url>
	    <snapshots>
	      <enabled>false</enabled>
	    </snapshots>
	  </repository>
</repositories>

  <dependencies>
	<dependency>
	    <groupId>com.github.jai-imageio</groupId>
	    <artifactId>jai-imageio-core</artifactId>
	    <version>1.3.1</version>
	</dependency>
	<dependency>
	    <groupId>com.github.jai-imageio</groupId>
	    <artifactId>jai-imageio-jpeg2000</artifactId>
	    <version>1.3.0</version>
	</dependency>
  </dependencies>

2. 可能是由于使用了BufferedImage,报了out of memory错误:
照着github上icepdf的例子,虽然转换少量pdf的时候没有问题,但是由于我这里转换的pdf比较多,所以out of memory了。
所以需要在BufferedImage使用完以后置null,告诉jvm可以回收资源,并在每次截图完成以后都调用System.gc(); 释放资源。
我的理解:设null是告诉jvm此资源可以回收,System.gc(); 是让系统回收资源,但不一定是回收你刚才设成null的资源,可能是回收其他没用的资源。
icepdf截图的代码如下:
package com.sun.pdfImage;

import java.awt.Graphics;
import java.awt.image.BufferedImage;  
import java.awt.image.RenderedImage;  
import java.io.File;  
import java.io.IOException;  
import javax.imageio.ImageIO;  
  
import org.icepdf.core.exceptions.PDFException;
import org.icepdf.core.exceptions.PDFSecurityException;
import org.icepdf.core.pobjects.Document;  
import org.icepdf.core.pobjects.PDimension;
import org.icepdf.core.pobjects.Page;
import org.icepdf.core.util.GraphicsRenderingHints;  
/*  
 * pdf 转 图片  
 */  
public class Icepdf {  
    public static void pdf2Pic(String pdfPath){  

    	File pdf = new File(pdfPath);
    	if (!pdf.exists()){
            System.out.println("pdf截图失败:" + pdfPath + " 不是pdf文件");  
    		return;
    	}
    	
    	int dot = pdfPath.lastIndexOf('.'); 
    	String imgName = "";
    	if ((dot >-1) && (dot < (pdfPath.length()))) { 
    		imgName = pdfPath.substring(0, dot) + ".jpg";
            File file = new File(imgName);  
            if (file.exists()){
            	return;
            }
    	}
    	else{
            System.out.println("pdf截图失败:" + pdfPath + " 没有后缀名");  
    		return;
    	}
    	
        Document document = new Document();  
        document.setFile(pdfPath);
        
        float scale = 2.5f;//缩放比例
        float rotation = 0f;//旋转  

        for (int i = 0; i < document.getNumberOfPages(); i++) {
        	if (i>0)
        		break;
        	  
        	System.out.println("pdf截图Start:" + pdfPath);
            try {
            	Page page = document.getPageTree().getPage(i);
            	page.init();
                PDimension sz = page.getSize(Page.BOUNDARY_CROPBOX, rotation, scale);
                int pageWidth = (int) sz.getWidth();
                int pageHeight = (int) sz.getHeight();
                BufferedImage image = new BufferedImage(pageWidth,
                        pageHeight,
                        BufferedImage.TYPE_INT_RGB);
                Graphics g = image.createGraphics();
                page.paint(g, GraphicsRenderingHints.PRINT,
                        Page.BOUNDARY_CROPBOX, rotation, scale);
                g.dispose();
            	
                // capture the page image to file
                System.out.println("pdf截图:" + imgName);  
                File file = new File(imgName);  
                ImageIO.write(image, "jpg", file);
                
	            /*BufferedImage image = (BufferedImage)  
	            document.getPageImage(i, GraphicsRenderingHints.SCREEN, org.icepdf.core.pobjects.Page.BOUNDARY_CROPBOX, rotation, scale);
	            RenderedImage rendImage = image;
            	int dot = pdfPath.lastIndexOf('.'); 
				if ((dot >-1) && (dot < (pdfPath.length()))) { 
					String imgName = pdfPath.substring(0, dot) + ".jpg";
	                System.out.println("pdf截图:" + imgName);  
	                File file = new File(imgName);  
	                ImageIO.write(rendImage, "jpg", file);
	            }*/
	            image.flush();
	            image = null;
            } catch (Exception e) {
                System.out.println(e.getMessage());
            	System.out.println("icepdf截图Error");
            }  
        }
        
        document.dispose();  
        document = null;
        
        pdf = null;
    	
    	System.gc(); 
    }  
    
    public static void main(String[] args) {  
        String filePath = "C:/Users/Administrator/Desktop/30.pdf";  
        pdf2Pic(filePath);  
    }  
}  

如果还是不行,就只能自己增加运行程序的内存了:

java -Xms128m -Xmx2048m -jar TEST.jar

package com.luxsan.service; import com.fasterxml.jackson.databind.JsonNode; import com.fasterxml.jackson.databind.ObjectMapper; import com.hankcs.hanlp.HanLP; import com.hankcs.hanlp.seg.common.Term; import com.luxsan.common.core.utils.MessageUtils; import com.luxsan.domain.ValidationResult; import jakarta.annotation.PostConstruct; import lombok.RequiredArgsConstructor; import net.sourceforge.tess4j.Tesseract; import org.apache.pdfbox.pdmodel.PDDocument; import org.apache.pdfbox.text.PDFTextStripper; import org.springframework.stereotype.Service; import org.springframework.web.multipart.MultipartFile; import javax.imageio.ImageIO; import java.awt.image.BufferedImage; import java.io.File; import java.io.InputStream; import java.nio.file.Files; import java.nio.file.Path; import java.nio.file.StandardCopyOption; import java.util.*; @RequiredArgsConstructor @Service public class ReadFileContentService { private final ObjectMapper objectMapper = new ObjectMapper(); private Tesseract tesseract; @PostConstruct public void initOcrEngine() { tesseract = new Tesseract(); //语言包路径和支持语言 tesseract.setDatapath("D:\\maven_use\\lingxi-lhc\\lingxi-ai-extend\\lingxi-ai-comparison\\src\\main\\resources\\tessdata"); tesseract.setLanguage("eng+chi_sim"); tesseract.setPageSegMode(6); // 自动页面分割 tesseract.setOcrEngineMode(1); // LSTM引擎 } /** * 支持PDF读取文件和图片ocr */ public String extractContent(MultipartFile file) { String contentType = file.getContentType(); String fileName = file.getOriginalFilename().toLowerCase(); if (contentType == null) { return "不支持的文件类型: " + contentType; } if (fileName.endsWith(".pdf")) { return readPdfText(file); } return extractImageText(file); } /** * 读取PDF文本内容 * * @param file * @return */ public String readPdfText(MultipartFile file) { try (PDDocument doc = PDDocument.load(file.getInputStream())) { PDFTextStripper stripper = new PDFTextStripper(); // 设置行分隔符 stripper.setLineSeparator("\n"); // 设置字符间距 stripper.setSortByPosition(true); String rawText = stripper.getText(doc); System.out.println("pdf内容" + rawText); return rawText.trim(); } catch (Exception e) { return MessageUtils.message("file.read.pdf.error"); } } /** * OCR识别图片内容 */ private String extractImageText(MultipartFile file) { try (InputStream is = file.getInputStream()) { // 将输入流直接转为BufferedImage BufferedImage image = ImageIO.read(is); if (image == null) { return MessageUtils.message("Image.parsing.failed"); } // 直接对BufferedImage进行OCR String result = tesseract.doOCR(image).replaceAll("\\s+", " ").trim(); return result; } catch (Exception e) { return MessageUtils.message("file.read.picture.error"); } } // private String getFileExtension(String filename) { // if (filename == null) return ".tmp"; // int dotIndex = filename.lastIndexOf('.'); // return (dotIndex == -1) ? ".tmp" : filename.substring(dotIndex); // } /** * 解析json */ public JsonNode parseJson(String jsonContent) throws Exception { return this.objectMapper.readTree(jsonContent); } public List<ValidationResult> compareContent(String pdfText, JsonNode jsonConfig) { List<ValidationResult> results = new ArrayList<>(); String cleanPdf = pdfText.replaceAll("\\s+", ""); // 预处理PDF文本 //处理JSON结构对象/数组 JsonNode dataNode = jsonConfig.isArray() && jsonConfig.size() > 0 ? jsonConfig.get(0) : jsonConfig; // 高效遍历JSON字段 dataNode.fields().forEachRemaining(entry -> { String key = entry.getKey(); String value = entry.getValue().asText().replaceAll("\\s+", ""); if (!value.isEmpty()) { boolean found = cleanPdf.contains(value); results.add(new ValidationResult( "FIELD", key, value, found ? "Found" : "Not Found", found )); } }); return results; } public JsonNode parsePipeSeparatedDataToJson(String inputData) throws Exception { Map<String, String> dataMap = parsePipeSeparatedData(inputData); return objectMapper.valueToTree(dataMap); } //解析分隔数据 public Map<String, String> parsePipeSeparatedData(String fileCONTENT) { //处理转义的换行符 fileCONTENT = fileCONTENT.replace("\\n", "\n").replaceAll("\\|+\"$", "").trim();; Map<String, String> dataMap = new LinkedHashMap<>(); String[] lines = fileCONTENT.split("\n"); if (lines.length >= 2) { String[] headers = lines[0].split("\\|"); String[] values = lines[1].split("\\|"); int minLength = Math.min(headers.length, values.length); for (int i = 0; i < minLength; i++) { dataMap.put(headers[i], values[i]); } } return dataMap; } //判断是不是json public boolean isPipeSeparatedData(String inputData) { return inputData.contains("|"); } // 比较和校验数据 public List<ValidationResult> compareAndValidate(String fileContent, JsonNode jsonConfig) { List<ValidationResult> results = new ArrayList<>(); Map<String, String> pipeDataMap = objectMapper.convertValue(jsonConfig, Map.class); fileContent = fileContent.replaceAll("\\s+", ""); for (Map.Entry<String, String> entry : pipeDataMap.entrySet()) { String key = entry.getKey(); String value = entry.getValue(); if (!value.isEmpty()) { value = value.replaceAll("\\s+", ""); boolean found = fileContent.contains(value); results.add(new ValidationResult( "FIELD", key, value, found ? "Found" : "Not Found", found )); } } return results; } } 哪里需要改进待优化
最新发布
07-17
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值