Apache POI-XWPF: Read MS Word DOCX

By Arvind Rai, February 04, 2015
This page will provide Apache POI-XWPF API example to read MS word DOCX header, footer, paragraph and table. Start by the API XWPFDocument to read DOCX file. There are different POI-XWPF classes to extract data. Header and footer is read by using XWPFHeader and XWPFFooter respectively. XWPFParagraph is used to read paragraph and XWPFTable is sused to read tables of DOCX. We can also read complete data of DOCX in one go by using XWPFWordExtractor.

Read Header and Footer using XWPFHeader and XWPFFooter

Apache POI provides XWPFHeader and XWPFFooter to read header and footer respectively. First create the object of XWPFDocument passing the path of DOCX file. Now create XWPFHeaderFooterPolicy object by passing instance of XWPFDocument. Fetch instance of XWPFHeader and XWPFFooter using object of XWPFHeaderFooterPolicy. We can do this in using below methods.

XWPFHeaderFooterPolicy.getFirstPageHeader(): Provides the header of first page.
XWPFHeaderFooterPolicy.getDefaultHeader(): Provides the default header of DOCX file given to each and every page.
XWPFHeaderFooterPolicy.getFirstPageFooter(): Provides the footer of first page.
XWPFHeaderFooterPolicy.getDefaultFooter(): Provides the default footer of DOCX file given to each and every page.

Find the sample demo for getDefaultHeader and getDefaultFooter.
ReadDOCXHeaderFooter.java
package com.concretepage;
import java.io.FileInputStream;
import org.apache.poi.openxml4j.opc.OPCPackage;
import org.apache.poi.xwpf.model.XWPFHeaderFooterPolicy;
import org.apache.poi.xwpf.usermodel.XWPFDocument;
import org.apache.poi.xwpf.usermodel.XWPFFooter;
import org.apache.poi.xwpf.usermodel.XWPFHeader;
public class ReadDOCXHeaderFooter {
   public static void main(String[] args) {
     try {
	 FileInputStream fis = new FileInputStream("D:/docx/read-test.docx");
	 XWPFDocument xdoc=new XWPFDocument(OPCPackage.open(fis));
	 XWPFHeaderFooterPolicy policy = new XWPFHeaderFooterPolicy(xdoc);
	 //read header
	 XWPFHeader header = policy.getDefaultHeader();
	 System.out.println(header.getText());
	 //read footer
	 XWPFFooter footer = policy.getDefaultFooter();
	 System.out.println(footer.getText());
     } catch(Exception ex) {
	ex.printStackTrace();
     } 
  }
}
Find the output.
This is header
This is footer 

Read Paragraph using XWPFParagraph

Apache POI provides XWPFParagraph class to fetch paragraph text. Using XWPFDocument.getParagraphs(), we get the list of all paragraphs of the document. Find the example.
ExtractParagraphDOCX.java
package com.concretepage;
import java.io.FileInputStream;
import java.util.List;
import org.apache.poi.openxml4j.opc.OPCPackage;
import org.apache.poi.xwpf.usermodel.XWPFDocument;
import org.apache.poi.xwpf.usermodel.XWPFParagraph;
public class ExtractParagraphDOCX {
   public static void main(String[] args) {
     try {
       FileInputStream fis = new FileInputStream("D:/docx/read-test.docx");
       XWPFDocument xdoc=new XWPFDocument(OPCPackage.open(fis));
       List<XWPFParagraph> paragraphList =  xdoc.getParagraphs();
       for (XWPFParagraph paragraph: paragraphList){
	   System.out.println(paragraph.getText());
       }
     } catch(Exception ex) {
	   ex.printStackTrace();
     } 
   }
}  
Find the output.
This is body content of Page One.

This is body content of Page Two. 

Read Table using XWPFTable

Apache POI provides XWPFTable class to fetch table data. We can get this object by two way. First by using XWPFDocument directly.
List<XWPFTable> tables = XWPFDocument.getTables() 
And second by using IBodyElement.
IBodyElement.getBody().getTables() 
Now find the example to extract data of tables within DOCX.
ExtractTableDOCX.java
package com.concretepage;
import java.io.FileInputStream;
import java.util.Iterator;
import java.util.List;
import org.apache.poi.openxml4j.opc.OPCPackage;
import org.apache.poi.xwpf.usermodel.IBodyElement;
import org.apache.poi.xwpf.usermodel.XWPFDocument;
import org.apache.poi.xwpf.usermodel.XWPFTable;
public class ExtractTableDOCX {
   public static void main(String[] args) {
    try {
	FileInputStream fis = new FileInputStream("D:/docx/read-test.docx");
	XWPFDocument xdoc=new XWPFDocument(OPCPackage.open(fis));
	Iterator<IBodyElement> bodyElementIterator = xdoc.getBodyElementsIterator();
	while(bodyElementIterator.hasNext()) {
	  IBodyElement element = bodyElementIterator.next();
          if("TABLE".equalsIgnoreCase(element.getElementType().name())) {
	     List<XWPFTable> tableList =  element.getBody().getTables();
	     for (XWPFTable table: tableList){
	        System.out.println("Total Number of Rows of Table:"+table.getNumberOfRows());
		System.out.println(table.getText());
	     }
	  }
        }
    } catch(Exception ex) {
	ex.printStackTrace();
    } 
   }
}  
Find the output.
Total Number of Rows of Table:2
Row 1- column 1	Row 1- column 2
Row 2- column 1	Row 2- column 2 

Extract Complete Data using XWPFWordExtractor

Apache POI provides XWPFWordExtractor class to fetch complete data of every page of a DOCX.
BasicTextExtractor.java
package com.concretepage;
import java.io.FileInputStream;
import org.apache.poi.openxml4j.opc.OPCPackage;
import org.apache.poi.xwpf.extractor.XWPFWordExtractor;
import org.apache.poi.xwpf.usermodel.XWPFDocument;
public class BasicTextExtractor {
   public static void main(String[] args) {
      try {
        FileInputStream fis = new FileInputStream("D:/docx/read-test.docx");
	XWPFDocument xdoc = new XWPFDocument(OPCPackage.open(fis));
	XWPFWordExtractor extractor = new XWPFWordExtractor(xdoc);
	System.out.println(extractor.getText());
      } catch(Exception ex) {
	ex.printStackTrace();
      } 
   }
} 
Find the output.
This is header
This is body content of Page One.

Row 1- column 1	Row 1- column 2
Row 2- column 1	Row 2- column 2

This is body content of Page Two.

This is footer 

Gradle file for Apache POI-XWPF

Find the gradle file to resolve JAR for Apache POI-XWPF.
build.gradle
apply plugin: 'java'
apply plugin: 'eclipse'
archivesBaseName = 'ApachePOI'
version = '1' 
repositories {
    mavenCentral()
}
dependencies {
    compile 'org.apache.poi:poi-ooxml:3.11'
} 
The input DOCX file read-test.docx has been attached in ZIP file.

Download Source Code

POSTED BY
ARVIND RAI
ARVIND RAI







©2024 concretepage.com | Privacy Policy | Contact Us