Apache POI-XWPF: Read MS Word DOCX Header, Footer, Paragraph and Table Example
February 04, 2015
This page will provide Apache POI-XWPF API example to read MS word DOCX header, footer, paragraph and table. Start by the API XWPFDocument to read DOCX file. There are different POI-XWPF classes to extract data. Header and footer is read by using XWPFHeader and XWPFFooter respectively. XWPFParagraph is used to read paragraph and XWPFTable is sused to read tables of DOCX. We can also read complete data of DOCX in one go by using XWPFWordExtractor.
Read Header and Footer using XWPFHeader and XWPFFooter
Apache POI provides XWPFHeader and XWPFFooter to read header and footer respectively. First create the object of XWPFDocument passing the path of DOCX file. Now create XWPFHeaderFooterPolicy object by passing instance of XWPFDocument. Fetch instance of XWPFHeader and XWPFFooter using object of XWPFHeaderFooterPolicy. We can do this in using below methods.XWPFHeaderFooterPolicy.getFirstPageHeader(): Provides the header of first page.
XWPFHeaderFooterPolicy.getDefaultHeader(): Provides the default header of DOCX file given to each and every page.
XWPFHeaderFooterPolicy.getFirstPageFooter(): Provides the footer of first page.
XWPFHeaderFooterPolicy.getDefaultFooter(): Provides the default footer of DOCX file given to each and every page.
Find the sample demo for getDefaultHeader and getDefaultFooter.
ReadDOCXHeaderFooter.java
package com.concretepage; import java.io.FileInputStream; import org.apache.poi.openxml4j.opc.OPCPackage; import org.apache.poi.xwpf.model.XWPFHeaderFooterPolicy; import org.apache.poi.xwpf.usermodel.XWPFDocument; import org.apache.poi.xwpf.usermodel.XWPFFooter; import org.apache.poi.xwpf.usermodel.XWPFHeader; public class ReadDOCXHeaderFooter { public static void main(String[] args) { try { FileInputStream fis = new FileInputStream("D:/docx/read-test.docx"); XWPFDocument xdoc=new XWPFDocument(OPCPackage.open(fis)); XWPFHeaderFooterPolicy policy = new XWPFHeaderFooterPolicy(xdoc); //read header XWPFHeader header = policy.getDefaultHeader(); System.out.println(header.getText()); //read footer XWPFFooter footer = policy.getDefaultFooter(); System.out.println(footer.getText()); } catch(Exception ex) { ex.printStackTrace(); } } }
This is header This is footer
Read Paragraph using XWPFParagraph
Apache POI provides XWPFParagraph class to fetch paragraph text. Using XWPFDocument.getParagraphs(), we get the list of all paragraphs of the document. Find the example.ExtractParagraphDOCX.java
package com.concretepage; import java.io.FileInputStream; import java.util.List; import org.apache.poi.openxml4j.opc.OPCPackage; import org.apache.poi.xwpf.usermodel.XWPFDocument; import org.apache.poi.xwpf.usermodel.XWPFParagraph; public class ExtractParagraphDOCX { public static void main(String[] args) { try { FileInputStream fis = new FileInputStream("D:/docx/read-test.docx"); XWPFDocument xdoc=new XWPFDocument(OPCPackage.open(fis)); List<XWPFParagraph> paragraphList = xdoc.getParagraphs(); for (XWPFParagraph paragraph: paragraphList){ System.out.println(paragraph.getText()); } } catch(Exception ex) { ex.printStackTrace(); } } }
This is body content of Page One. This is body content of Page Two.
Read Table using XWPFTable
Apache POI provides XWPFTable class to fetch table data. We can get this object by two way. First by using XWPFDocument directly.List<XWPFTable> tables = XWPFDocument.getTables()
IBodyElement.getBody().getTables()
ExtractTableDOCX.java
package com.concretepage; import java.io.FileInputStream; import java.util.Iterator; import java.util.List; import org.apache.poi.openxml4j.opc.OPCPackage; import org.apache.poi.xwpf.usermodel.IBodyElement; import org.apache.poi.xwpf.usermodel.XWPFDocument; import org.apache.poi.xwpf.usermodel.XWPFTable; public class ExtractTableDOCX { public static void main(String[] args) { try { FileInputStream fis = new FileInputStream("D:/docx/read-test.docx"); XWPFDocument xdoc=new XWPFDocument(OPCPackage.open(fis)); Iterator<IBodyElement> bodyElementIterator = xdoc.getBodyElementsIterator(); while(bodyElementIterator.hasNext()) { IBodyElement element = bodyElementIterator.next(); if("TABLE".equalsIgnoreCase(element.getElementType().name())) { List<XWPFTable> tableList = element.getBody().getTables(); for (XWPFTable table: tableList){ System.out.println("Total Number of Rows of Table:"+table.getNumberOfRows()); System.out.println(table.getText()); } } } } catch(Exception ex) { ex.printStackTrace(); } } }
Total Number of Rows of Table:2 Row 1- column 1 Row 1- column 2 Row 2- column 1 Row 2- column 2
Extract Complete Data using XWPFWordExtractor
Apache POI provides XWPFWordExtractor class to fetch complete data of every page of a DOCX.BasicTextExtractor.java
package com.concretepage; import java.io.FileInputStream; import org.apache.poi.openxml4j.opc.OPCPackage; import org.apache.poi.xwpf.extractor.XWPFWordExtractor; import org.apache.poi.xwpf.usermodel.XWPFDocument; public class BasicTextExtractor { public static void main(String[] args) { try { FileInputStream fis = new FileInputStream("D:/docx/read-test.docx"); XWPFDocument xdoc = new XWPFDocument(OPCPackage.open(fis)); XWPFWordExtractor extractor = new XWPFWordExtractor(xdoc); System.out.println(extractor.getText()); } catch(Exception ex) { ex.printStackTrace(); } } }
This is header This is body content of Page One. Row 1- column 1 Row 1- column 2 Row 2- column 1 Row 2- column 2 This is body content of Page Two. This is footer
Gradle file for Apache POI-XWPF
Find the gradle file to resolve JAR for Apache POI-XWPF.build.gradle
apply plugin: 'java' apply plugin: 'eclipse' archivesBaseName = 'ApachePOI' version = '1' repositories { mavenCentral() } dependencies { compile 'org.apache.poi:poi-ooxml:3.11' }