Microsoft Office documents have several metadata or properties like "Title", "Author", "Comments", "Keywords", "CreateDateTime", "LastSaveDateTime", etc.
Apache POI HPSF, Java API for Microsoft documents, is a neat library to extract such properties from Word, Excel or PowerPoint documents. It can be useful if user upload MS Office files and would like to show / edit metadata before uploaded files are stored in a content repository or somewhere else. Let's write an Java class named
MsOfficeExtractor.
import java.io.ByteArrayInputStream;
import java.io.IOException;
import java.io.InputStream;
import java.lang.reflect.Method;
import java.util.Arrays;
import java.util.Collections;
import java.util.HashMap;
import java.util.Map;
import org.apache.poi.hpsf.PropertySetFactory;
import org.apache.poi.hpsf.SummaryInformation;
import org.apache.poi.poifs.eventfilesystem.POIFSReader;
import org.apache.poi.poifs.eventfilesystem.POIFSReaderEvent;
import org.apache.poi.poifs.eventfilesystem.POIFSReaderListener;
public class MsOfficeExtractor
{
private String[] properties;
private Map<String, Method> methodMap;
public MsOfficeExtractor(final String[] properties)
{
this.properties = (properties == null ? new String[] {} : properties);
methodMap = new HashMap<String, Method>();
try {
for (int i = 0; i < properties.length; i++) {
methodMap.put(properties[i], SummaryInformation.class.getMethod("get" + properties[i], (Class[]) null));
}
} catch (SecurityException e) {
// error handling
} catch (NoSuchMethodException e) {
// error handling
}
}
public Map<String, Object> parseMetaData(final byte[] data)
{
if (properties.length == 0) {
return Collections.EMPTY_MAP;
}
InputStream in = null;
try {
in = new ByteArrayInputStream(data);
POIFSReader poifsReader = new POIFSReader();
MetaDataListener metaDataListener = new MetaDataListener();
poifsReader.registerListener(metaDataListener, "\005SummaryInformation");
poifsReader.read(in);
return metaDataListener.metaData;
} catch (final IOException e) {
// error handling
} catch (final RuntimeException e) {
// error handling
} finally {
if (in != null) {
try {
in.close();
} catch (IOException e) {
// nothing to do
}
}
}
}
}
The constructor expects property names of the properties we want to extract. All valid names are defined in
org.apache.poi.hpsf.SummaryInformation. The map
methodMap defines a mapping between a property to be extracted and a method to be called in the listener explained below. The core method is
parseMetaData which expects a byte array of given MS Office file. We need now a listener class
MetaDataListener which is called while parsing.
public class MetaDataListener implements POIFSReaderListener
{
public final Map<String, Object> metaData;
public MetaDataListener()
{
metaData = new HashMap<String, Object>();
}
public void processPOIFSReaderEvent(final POIFSReaderEvent event)
{
try {
final SummaryInformation summaryInformation = (SummaryInformation) PropertySetFactory.create(event.getStream());
for (int i = 0; i < properties.length; i++) {
Method method = (Method) methodMap.get(properties[i]);
Object propertyValue = method.invoke(summaryInformation, (Object[]) (Object[]) null);
metaData.put(properties[i], propertyValue);
}
} catch (final Exception e) {
// error handling
}
}
}
The goal of this listener is to build a map with extracted values to the given properties (which values we want to extract). The using is simple:
// initialize extractor
String[] poiProperties = new String[] {"Comments", "CreateDateTime", "LastSaveDateTime"};
MsOfficeExtractor msOfficeExtractor = new MsOfficeExtractor(poiProperties);
// get byte array of any MS office document
byte[] data = ...
// extract metadata
Map<String, Object> metadata = msOfficeExtractor.parseMetaData(data);
In the next post I will show how to extract metadata from MS Outlook msg files. Apache POI-HSMF, Java API to access MS Outlook msg files, has limits and is not flexible enough. I will present my own powerful solution.
Thanks for your effort. I appreciate it.
ReplyDelete