Reliable indexing of custom properties in Alfresco

04.07.2012.

In business applications relying on the Alfresco repository one of common functions is linking of a document with records from the external systems. Such unique business keys (for example, invoice numbers, applicant identifiers, transaction identifiers), are typically stored in Alfresco as custom properties. The external applications then use Alfresco search functionality to find and retrieve a document related to its business key. In some cases, Alfresco could return more than one document, thus breaking the logic of the external application. This blog entry specifies how such condition may happen and how to modify default Alfresco configuration to prevent it.

Alfresco uses Apache Lucene indexing engine (from version 4 it can also use more capable Apache Solr). An external application typically invokes a customized web script to search/find a unique document and retrieve it for the presentation to the end user. If a business key used for searching contains any of the special characters that underlying indexing engine uses for tokenization, it will produce unexpected results. It is not a big problem for users performing searches through Alfresco Share or Explorer, since they can easily spot such cases and choose the appropriate document from the provided list. But applications relying on finding a unique document in the repository might break.

In our case the unique business key was stored in a custom property. Requirements were simple – the value must be unique for each document and allowed characters are alphanumeric and some special characters such as underscore.

<property name="my:uniqueID">
      <type>d:text</type>
      <mandatory>true</mandatory>
</property>

The default Lucene setup is:

<index enabled="true">
      <atomic>true</atomic>
      <stored>false</stored>
      <tokenised>true</tokenised>
</index>

After the system was in the production for some time, there were documents in the repository with similar unique identifiers (e.g. test and my_test) and that would cause problems in the external application. It was discovered that searching for a document with uniqueID == test would produce two documents instead of one.

The cause was related to the way how Lucene was configured, since by default configuration in Alfresco it tokenizes strings by underscore among the other special characters. In other words, our unique values from the mentioned example there indexed as “test”, “my” and “test” instead of “test” and “my_test” which explains the result set contained more than one document.

Once the cause of the problem was identified, it is quite easy to fix it by creating a custom model:

<property name="my:uniqueID">
      <type>d:text</type>
      <mandatory>true</mandatory>
      <index enabled="true">
             <tokenised>false</tokenised>
      </index>
</property>

After this change it is necessary to perform a complete rebuild the index, so that the new custom model should take effect on the given custom property. Otherwise Alfresco can still return unexpected search results for documents created before the change was made.