Saturday, August 01, 2020

SOLR: Non English (Latin) Characters in Field Name

The SOLR documentation mentions following requirement for defining name of a field.

The name of the field. Field names should consist of alphanumeric or underscore characters only and not start with a digit. . . .
While working on a dictionary website, the JSON documents that I created had field names in Hindi. After indexing the data I was surprised to see that field names in the data were converted to multiple underscore letters e.g. field name शब्द was converted to ____. According to SOLR documentation शब्द should have been allowed as field name.

Looks like SOLR developers have assumed that only 26 letters in Latin script are alphabets. Mentioning this assumption explicitly in documentation would have been helpful.

After a closer scrutiny of solrconfig.xml file, I found following configuration, which converts anything that is not Latin alphanumeric in field name to underscore while indexing the data.

<updateProcessor class="solr.FieldNameMutatingUpdateProcessorFactory" name="field-name-mutating">
   <str name="pattern">[^\w-\.]</str>
   <str name="replacement">_</str>

Changing the pattern regex for FieldNameMutatingUpdateProcessorFactory to something like below will allow SOLR to accept non Latin alphabets in field name.

   <str name="pattern">[\s]</str>

[\s] is a too generic pattern to use in real life scenario. This pattern should be further restricted to a limited set of characters that one intend to use in field name.