Microsoft Office documents have several metadata or properties like "Title", "Author", "Comments", "Keywords", "CreateDateTime", "LastSaveDateTime", etc. 
Apache POI HPSF, Java API for Microsoft documents, is a neat library to extract such properties from Word, Excel or PowerPoint documents. It can be useful if user upload MS Office files and would like to show / edit metadata before uploaded files are stored in a content repository or somewhere else. Let's write an Java class named 
MsOfficeExtractor.
import java.io.ByteArrayInputStream;
import java.io.IOException;
import java.io.InputStream;
import java.lang.reflect.Method;
import java.util.Arrays;
import java.util.Collections;
import java.util.HashMap;
import java.util.Map;
import org.apache.poi.hpsf.PropertySetFactory;
import org.apache.poi.hpsf.SummaryInformation;
import org.apache.poi.poifs.eventfilesystem.POIFSReader;
import org.apache.poi.poifs.eventfilesystem.POIFSReaderEvent;
import org.apache.poi.poifs.eventfilesystem.POIFSReaderListener;
public class MsOfficeExtractor
{
 private String[] properties;
 private Map<String, Method> methodMap;
 public MsOfficeExtractor(final String[] properties)
 {
  this.properties = (properties == null ? new String[] {} : properties);
  methodMap = new HashMap<String, Method>();
  try {
   for (int i = 0; i < properties.length; i++) {
    methodMap.put(properties[i], SummaryInformation.class.getMethod("get" + properties[i], (Class[]) null));
   }
  } catch (SecurityException e) {
   // error handling
  } catch (NoSuchMethodException e) {
   // error handling
  }
 }
 
 public Map<String, Object> parseMetaData(final byte[] data)
 {
  if (properties.length == 0) {
   return Collections.EMPTY_MAP;
  }
  InputStream in = null;
  try {
   in = new ByteArrayInputStream(data);
   POIFSReader poifsReader = new POIFSReader();
   MetaDataListener metaDataListener = new MetaDataListener();
   poifsReader.registerListener(metaDataListener, "\005SummaryInformation");
   poifsReader.read(in);
   return metaDataListener.metaData;
  } catch (final IOException e) {
   // error handling
  } catch (final RuntimeException e) {
   // error handling
  } finally {
   if (in != null) {
    try {
     in.close();
    } catch (IOException e) {
     // nothing to do
    }
   }
  }
 }
}
The constructor expects property names of the properties we want to extract. All valid names are defined in 
org.apache.poi.hpsf.SummaryInformation. The map 
methodMap defines a mapping between a property to be extracted and a method to be called in the listener explained below. The core method is 
parseMetaData which expects a byte array of given MS Office file. We need now a listener class 
MetaDataListener which is called while parsing.
public class MetaDataListener implements POIFSReaderListener
{
 public final Map<String, Object> metaData;
 public MetaDataListener()
 {
  metaData = new HashMap<String, Object>();
 }
 public void processPOIFSReaderEvent(final POIFSReaderEvent event)
 {
  try {
   final SummaryInformation summaryInformation = (SummaryInformation) PropertySetFactory.create(event.getStream());
   for (int i = 0; i < properties.length; i++) {
    Method method = (Method) methodMap.get(properties[i]);
    Object propertyValue = method.invoke(summaryInformation, (Object[]) (Object[]) null);
    metaData.put(properties[i], propertyValue);
   }
  } catch (final Exception e) {
   // error handling
  }
 }
}
The goal of this listener is to build a map with extracted values to the given properties (which values we want to extract). The using is simple:
// initialize extractor
String[] poiProperties = new String[] {"Comments", "CreateDateTime", "LastSaveDateTime"};
MsOfficeExtractor msOfficeExtractor = new MsOfficeExtractor(poiProperties);
// get byte array of any MS office document
byte[] data = ...
// extract metadata
Map<String, Object> metadata = msOfficeExtractor.parseMetaData(data);
In the next post I will show how to extract metadata from MS Outlook msg files. Apache POI-HSMF, Java API to access MS Outlook msg files, has limits and is not flexible enough. I will present my own powerful solution.
Thanks for your effort. I appreciate it.
ReplyDelete