ICCIS2025

LongDocAI: A Quantized Modular Pipeline for Multimodal PDF Summarization and QA

Abstract

"The rapid expansion of scientific literature has made it increasingly difficult for researchers to extract meaningful insights from dense academic documents. We introduce LongDocAI, a multimodal, multi-model framework designed to efficiently understand and interact with scholarly PDFs at scale. The system combines Donut, an OCR-free document parser, with a hybrid summarization module based on BARTlarge- CNN, trained respectively on over 200,000 paper-summary pairs and 30,000+ question-answer examples. To ensure faster and resourceefficient inference, we apply LLM.int8 post-training quantization across all transformer models, significantly reducing memory usage and latency without compromising output quality. LongDocAI is built as a unified pipeline that handles both layout-aware parsing and semantic-level understanding through summarization and interactive question answering. In our evaluations, the system achieved a 15% improvement in ROUGEL, a 28% boost in METEOR, and a QA accuracy of 85.3%, based on expert human assessments. Quantization further led to a 70% reduction in model size and a 40% decrease in inference time, making the system suitable for real-time or edge deployment. By integrating multimodal document understanding, summarization, question answering, and quantization into a single, modular pipeline, LongDocAI offers a scalable, lightweight solution for navigating and understanding long-form academic texts—benefiting researchers, reviewers, and knowledge platforms alike."

Objective

Develop a quantized multimodal framework for scientific document summarization.

Methodology

Combined OCR-free Donut parser with BARTlarge-CNN. Applied LLM.int8 post-training quantization.

Results & Conclusion

15% improvement in ROUGEL, 28% boost in METEOR, 85.3% QA accuracy. 70% reduction in model size and 40% decrease in latency.