The configuration file
repository.xml is described in detail by
Apache Jackrabbit. The section for workspace and versioning configuration must be extended to support binary content search as follows:
<SearchIndex class="org.apache.jackrabbit.core.query.lucene.SearchIndex">
<param name="path" value="${wsp.home}/index"/>
<param name="extractorPoolSize" value="2"/>
<param name="supportHighlighting" value="true"/>
<param name="textFilterClasses"
value="org.apache.jackrabbit.extractor.PlainTextExtractor,
org.apache.jackrabbit.extractor.MsWordTextExtractor,
org.apache.jackrabbit.extractor.MsExcelTextExtractor,
org.apache.jackrabbit.extractor.MsPowerPointTextExtractor,
org.apache.jackrabbit.extractor.PdfTextExtractor,
org.apache.jackrabbit.extractor.OpenOfficeTextExtractor,
org.apache.jackrabbit.extractor.RTFTextExtractor,
org.apache.jackrabbit.extractor.HTMLTextExtractor,
org.apache.jackrabbit.extractor.XMLTextExtractor"/>
</SearchIndex>
Content of following document types can be full text searchable with this configuration
- Plain Text
- MS Word
- MS Excel
- MS Powerpoint
- PDF
- OpenOffice
- RTF
- HTML
- XML
There are two points to be considered: Content repository needs some time after document adding to parse documents content and extract needed informations. In my tests I had to wait ca. 7 sek.
// add document and save changes
....
// sleep 7 sek. to allow the content of document be indexed
try {
Thread.sleep(7000);
} catch (InterruptedException e) {
;
}
// do full text search
....
The second point is related to the configuration of node types. The full text search works if you use quite normally node type
nt:file which contains a sub node
jcr:content of type
nt:resource. If you use custom node types you must ensure that the node type describing binary content has at least two properties:
jcr:data (content is stored here) and
jcr:mimeType. The second property for mime type is very important. Without the mime type there isn't text extraction (consequential, isn't it?). Here is an example in XML notation:
<nodeType name="cssns:resource"
isMixin="false"
hasOrderableChildNodes="false"
primaryItemName="jcr:data">
<supertypes>
<supertype>nt:base</supertype>
<supertype>mix:referenceable</supertype>
</supertypes>
<propertyDefinition name="jcr:mimeType"
requiredType="String"
autoCreated="false"
mandatory="true"
onParentVersion="COPY"
protected="false"
multiple="false">
</propertyDefinition>
<propertyDefinition name="jcr:data"
requiredType="Binary"
autoCreated="false"
mandatory="true"
onParentVersion="COPY"
protected="false"
multiple="false">
</propertyDefinition>
</nodeType>
<nodeType name="cssns:file"
isMixin="false"
hasOrderableChildNodes="false"
primaryItemName="jcr:content">
<supertypes>
<supertype>mix:versionable</supertype>
<supertype>cssns:hierarchyNode</supertype>
</supertypes>
<propertyDefinition name="cssns:size"
requiredType="Long"
autoCreated="true"
mandatory="true"
onParentVersion="COPY"
protected="false"
multiple="false">
<defaultValues>
<defaultValue>-1</defaultValue>
</defaultValues>
<childNodeDefinition name="jcr:content"
defaultPrimaryType=""
autoCreated="false"
mandatory="true"
onParentVersion="COPY"
protected="false"
sameNameSiblings="false">
<requiredPrimaryTypes>
<requiredPrimaryType>cssns:resource</requiredPrimaryType>
</requiredPrimaryTypes>
</childNodeDefinition>
</nodeType>
ur code .......its helped me a lot ... while configuring.....
ReplyDeletebut..........................
i need a path to configure the jackrabbit to the jboss..
i am using jboss 7.1.1 final and jackrabbit 2.4.3 please hep me out....