Monday, 18 March 2013

Apache Solr and complex data architecture

At the moment I'm working on a highly relational content architecture and the related Solr backend. It's quite special from the practical use case point of view. In a nutshell it's a huge logical monster content that is separated to over several content types and referenced fields so we can maintain the most optimal normal form.

However this example may serve as one way of dealing with relational data and Solr. The problem with this setup is that in search we're interested in the huge logical unit, and not the technical elements - such as content types. Just to give you an example: the item in interest would be a human being, which technically contains appendixes, gut, fingernail, blood type, behavior ... and many would be redundant. Like attributes of blood can be reused among entities sharing the same blood type.

By default - in Drupal - you can index contents from different content types. Apache Solr is very flexible. It's using a document based index system, almost like a NoSQL storage, however here you have types. The way Drupal fetches items for Solr is basically getting the fields and then rendering a 'search_index' view mode with everything you added to that view. (It can be overridden in many ways, eg with Display Suite or a template.) But if want precise property search on each related content types we really have to define to Solr how those fields relate to the main content items.

I've found Snufkin's great article about custom data indexing. It's almost what I needed, but I decided to go for a bit simpler way. (And I also needed a Drupal 7 solution.)

The very first idea was to index every related content and in the search result process the result and determine the master content type we need. That is awful, eww. Then I was checking Solr and found it can do joins. It looks fine but as far as I can see it's not designed to hold business logic. And I would like to keep the flexibility. Then I realized I can also do whatever I want with the index document that Solr is fetching from Drupal. As the node module is building up the index by adding all the fields the same way I can also add my own logical elements. This was when I'm looking for - let's say - a certain blood type then Solr will see the blood type properties on all indexed document. So again - I'm attaching all related (in Drupal it's node reference chains) attributes to the main content type index. The basic workflow is the following:

  • when indexing the master content
    • load all related node references
    • add all references attributes to the index document
  • when search
    • refer to the additional fields

Pretty easy, huh? To make the attachment there is a hook to use - hook_apachesolr_index_document_build():

function myhuman_module_apachesolr_index_document_build(ApacheSolrDocument $document, $entity, $entity_type, $env_id) {
  $blood_type_node = human_module_get_humans_blood_type($entity);
  $bloog_type_items = field_get_items('node', $blood_type_node, 'field_blood_type_name');
  $document->setField('ss_blood_type', $bloog_type_items[0]['value']);

Then when you reindex the content:

$ drush solr-delete-index
$ drush solr-index
$ drush cron

And force Solr to finish processing quickly by calling:


You will see that the new index content has the referenced value. The superawesome thing is that the ApacheSolr Views module can use any of the Solr backends as a storage and the field definition is updated whenever you have the new index attributes in Solr. However it's gonna be more or less typeless. Well, in general. If you're interested about the type capabilities of Solr you can check the schema.xml. You'll find definitions like these:

   <!-- We use long for integer since 64 bit ints are now common in PHP. -->
   <dynamicField name="is_*"  type="long"    indexed="true"  stored="true" multiValued="false"/>
   <dynamicField name="im_*"  type="long"    indexed="true"  stored="true" multiValued="true"/>
   <!-- List of floats can be saved in a regular float field -->
   <dynamicField name="fs_*"  type="float"   indexed="true"  stored="true" multiValued="false"/>
   <dynamicField name="fm_*"  type="float"   indexed="true"  stored="true" multiValued="true"/>
   <!-- List of doubles can be saved in a regular double field -->
   <dynamicField name="ps_*"  type="double"   indexed="true"  stored="true" multiValued="false"/>
   <dynamicField name="pm_*"  type="double"   indexed="true"  stored="true" multiValued="true"/>

Simply the first letter is the letter of the type, the second indicates the quantity. You can find more about the schema here.

Having said that now you're almost finished with the search part. One thing you have to ensure is updating the index. As Solr cares about the content of update it doesn't observe the relations. What I added is a simple Rules action (this way you can setup a rule to the specific content changes that you're interested in) that calls a node reference-tree preorder walkthrough to detect all referenced items (humans) that you have to invalidate in the index. That can be done by simply saving the content:

$node = node_load($ID);

And probably that part is the bottleneck of the process. Imagine if you update a blood type then how many humans has to be updated. So you need the quite awesome DrupalQueue and probably hook_cron_queue_info to help on the performance.



No comments:

Post a Comment