Thoughts on software development: How to configure Apache Jackrabbit for binary content search?

The configuration file repository.xml is described in detail by Apache Jackrabbit. The section for workspace and versioning configuration must be extended to support binary content search as follows:

<SearchIndex class="org.apache.jackrabbit.core.query.lucene.SearchIndex">
    <param name="path" value="${wsp.home}/index"/>
    <param name="extractorPoolSize" value="2"/>
    <param name="supportHighlighting" value="true"/>
    <param name="textFilterClasses"
      value="org.apache.jackrabbit.extractor.PlainTextExtractor,
      org.apache.jackrabbit.extractor.MsWordTextExtractor,
      org.apache.jackrabbit.extractor.MsExcelTextExtractor,
      org.apache.jackrabbit.extractor.MsPowerPointTextExtractor,
      org.apache.jackrabbit.extractor.PdfTextExtractor,
      org.apache.jackrabbit.extractor.OpenOfficeTextExtractor,
      org.apache.jackrabbit.extractor.RTFTextExtractor,
      org.apache.jackrabbit.extractor.HTMLTextExtractor,
      org.apache.jackrabbit.extractor.XMLTextExtractor"/>
</SearchIndex>

Content of following document types can be full text searchable with this configuration

Plain Text
MS Word
MS Excel
MS Powerpoint
PDF
OpenOffice
RTF
HTML
XML

There are two points to be considered: Content repository needs some time after document adding to parse documents content and extract needed informations. In my tests I had to wait ca. 7 sek.

// add document and save changes
....

// sleep 7 sek. to allow the content of document be indexed
try {
    Thread.sleep(7000);
} catch (InterruptedException e) {
    ;
}

// do full text search
....

The second point is related to the configuration of node types. The full text search works if you use quite normally node type nt:file which contains a sub node jcr:content of type nt:resource. If you use custom node types you must ensure that the node type describing binary content has at least two properties: jcr:data (content is stored here) and jcr:mimeType. The second property for mime type is very important. Without the mime type there isn't text extraction (consequential, isn't it?). Here is an example in XML notation:

<nodeType name="cssns:resource"
          isMixin="false"
          hasOrderableChildNodes="false"
          primaryItemName="jcr:data">
    <supertypes>
        <supertype>nt:base</supertype>
        <supertype>mix:referenceable</supertype>
    </supertypes>
    <propertyDefinition name="jcr:mimeType"
                        requiredType="String"
                        autoCreated="false"
                        mandatory="true"
                        onParentVersion="COPY"
                        protected="false"
                        multiple="false">
    </propertyDefinition>
    <propertyDefinition name="jcr:data"
                        requiredType="Binary"
                        autoCreated="false"
                        mandatory="true"
                        onParentVersion="COPY"
                        protected="false"
                        multiple="false">
    </propertyDefinition>
</nodeType>

<nodeType name="cssns:file"
          isMixin="false"
          hasOrderableChildNodes="false"
          primaryItemName="jcr:content">
    <supertypes>
        <supertype>mix:versionable</supertype>
        <supertype>cssns:hierarchyNode</supertype>
    </supertypes>
    <propertyDefinition name="cssns:size"
                        requiredType="Long"
                        autoCreated="true"
                        mandatory="true"
                        onParentVersion="COPY"
                        protected="false"
                        multiple="false">
        <defaultValues>
            <defaultValue>-1</defaultValue>
        </defaultValues>
    <childNodeDefinition name="jcr:content"
                          defaultPrimaryType=""
                         autoCreated="false"
                         mandatory="true"
                          onParentVersion="COPY"
                          protected="false"
                          sameNameSiblings="false">
     <requiredPrimaryTypes>
         <requiredPrimaryType>cssns:resource</requiredPrimaryType>
     </requiredPrimaryTypes>
    </childNodeDefinition>
</nodeType>

1 comment:

murali krishnaMarch 22, 2013 at 6:53 AM
ur code .......its helped me a lot ... while configuring.....
but..........................
i need a path to configure the jackrabbit to the jboss..
i am using jboss 7.1.1 final and jackrabbit 2.4.3 please hep me out....

Note: Only a member of this blog may post a comment.

Monday, June 28, 2010

How to configure Apache Jackrabbit for binary content search?

1 comment: