allBlogsList

Deterministic IDs with Content Hub CMP Connector for Sitecore

Data consistency and the same IDs between environments

Content Hub CMP Connector for Sitecore is a great way to synchronize data between Content Hub CMP and Sitecore without a ton of work, but what happens if the out of the box connector doesn’t meet all your data requirements?

That was the exact challenge I found myself in on a recent project. The project had two requirements that the connector did not sufficiently meet:

1. Data Consistency

  • no duplicate items
  • no data missing

2. Same item IDs between Sitecore environments

  • easy to move/search between environments
  • allow us to do reverse look-ups without depending on the links database or Solr
  • synced items can be deleted and recreated and all relationships are maintained

Luckily, with a few tweaks to the connector, we can fulfill both requirements. First, we’ll look at the out-of-the-box critical path processes the connector takes to create/update items. Then, we’ll identify the root issue causing data inconsistencies and different IDs between environments. Finally, we’ll go over a solution and how to implement it.

If you want to jump straight to the code and templates, the repository can be found at: https://github.com/ksuamel/content-hub-extensions

Out-of-the-box functionality

Let's begin by understanding how the CMP connector for Sitecore creates items and finds existing items by default. The connector uses a series of pipelines to handle its functionality, but the “cmp.importEntity” pipeline is what we’re looking for. This pipeline is responsible for creating and updating items coming from Content Hub.

While this pipeline has multiple processors each responsible for a portion of the item creation/update process, we’ll focus on only three of them, since they are the only ones responsible for item lookups, creations, and updates.

  • Sitecore.Connector.CMP.Pipelines.ImportEntity.SearchIndex
  • Sitecore.Connector.CMP.Pipelines.ImportEntity.EnsureItem
  • Sitecore.Connector.CMP.Pipelines.ImportEntity.EnsureRelation

Let's take a deeper look into each one of these pipelines/processors to unravel how they’re accomplishing their tasks. A warning for the faint of heart, we’ll be looking at source code that has been decompiled from DLLs, so be ready to view some not-so-pretty code.

Sitecore.Connector.CMP.Pipelines.ImportEntity.SearchIndex

This processor’s job is to query for an item and see if it already exists within Sitecore. Internally, it does this by using Solr to query the entities bucket.

Picture1
Once the item is found, it adds it to the args.Item property. While this approach seems straightforward, the biggest red flag is the dependency on a search index. While search indexes may seem like a no-brainer dependency, they can become out of sync or the cluster may be temporarily down, which will cause the system to incorrectly determine if an item exists or not. Let’s investigate the processors that run next to see what side effects this may have.

Sitecore.Connector.CMP.Pipelines.ImportEntity.EnsureItem

The EnsureItem processor's job is to create a new Sitecore item to represent the CMP entity if the args.Item property is null.

Picture2

On the surface, this processor also looks ok, but with closer inspection, we find two things that will cause issues for our two requirements. The first one is a by-product of the previous processor incorrectly determining the existence of an item. When that happens, this processor will create a duplicate item with no safeguard. Second, which is related to the first, this processor is currently letting Sitecore determine an ID for its new items. We’ll discuss why these Sitecore-generated Ids need to change later.

Sitecore.Connector.CMP.Pipelines.ImportEntity.EnsureRelation

This processor is the most complex, but its job is relatively simple. When the processor receives an item, it attempts to determine what item it is related to and does two things. One, it makes our item (parent) point to the referenced item (child), and second, it goes to the referenced item (child) and creates a reverse relationship to our original item (parent). The processor’s code is quite large, so let's just focus on the parts we care about:

Picture3
The decompiled code is not pretty, but what's noteworthy is that once again we see “SearchItemUnderItemBucket”, which uses SOLR to query for existing items. We’re starting to see a pattern here...

Root issue

After looking at the core processors, we find that Solr is used during the most crucial moments to find existing items to handle item updates and relationships. We also found that when new items are being created, the connector is letting Sitecore generate a new random ID for our item. While this keeps things simple, it causes issues when the index is out of sync with our Sitecore database. It also means that if you’re syncing multiple Sitecore environments, each one will have a different ID for items representing the same entity inside of CMP. To overcome these issues, let's just get rid of the Solr dependency! But how are we going to do that?

Solution

The main things we’ll need to change are how items are created and queried. Instead of relying on Solr, we’ll use deterministic IDs that will allow us to query the database directly. With Deterministic IDs, a given input will always provide the same output ID.

From a 100-foot view, here are the changes we’ll make:

1. Generate a deterministic ID given a specific input string.

2. Add ID Provider configurations for CMP entity mappings to support deterministic IDs

3. Update processors to use the new ID Provider to query and create/update items.

Note: While the next sections won’t include every change required, like creating new templates to support our enhancement, the full source code and serialized items can be found at: https://github.com/ksuamel/content-hub-extensions

Generate Deterministic Ids

To create deterministic IDs, we’ll use an Md5 hash generated from our input string, and then take the first 16 bytes to use as an input for our ID. If we deterministically create the ID based on a unique input, the output ID will always match if the input is the same.

Here is some pseudo code that may help demystify how it works internally.

// Create the item
var itemId = CreateDeterministicId(“[UNIQUE_INPUT]”); //output → “{123-123-123-123}”
var newItem = CreateItem(“item name”, itemId); // item created and added to database with deterministic ID.

// Query the item
var itemId = CreateDeterministicId(“[UNIQUE_INPUT]”); //same output → “{123-123-123-123}”
var item = database.GetItem(itemId); //because we generate what the ID of an entity should be, we can go directly to the database to see if it exists.

if(sitecoreItem == null) return “doesn’t exist in db”
return “item found!”

The full code for generating an ID given an input can be found at: https://github.com/ksuamel/content-hub-extensions/blob/main/src/Foundation/ContentHubExtensions/code/Utils/HashUtil.cs

Next, let's configure Sitecore to automatically generate a unique input for each of our entity mappings.

Add ID Providers

To automatically create a unique ID for each entity, we need to add ID Providers to our Sitecore CMP configuration. These items are customized to make sure you’ve added the templates in the repository into your solution before you continue.

Let’s look at how we configure an ID provider within Sitecore.

Entity Mapping: There are no custom templates here, it’s just an entity mapping we’ve configured using out-of-the-box CMP templates.

Picture4

ID provider: Here we find the first bit of custom configuration we’re applying over the out-of-the-box CMP connector.

Picture5

In an ID Provider we reference the entity the ID provider is for, and then we provide the property name that we want to use. We use both of these values to generate an input that looks like [entity.DefinitionName]-[entity.getFieldValue(ItemIdProperty)]. Because of this, the value of the property we use must be unique across all entities of the same entity type. For example, if I’m using the property “externalId” for my CMP entity mapping “Example Entity”, the value of that field should be unique across all “Example Entity” entities. That way, when they are synced into Sitecore, they all generate a unique item ID for each unique entity inside of CMP.

Update Processors

In this section, we’ll create our new processors that will leverage our new ID Providers and deterministic IDs. We’ll update the SearchIndex, EnsureItem, and EnsureRealtion processors. I’ve truncated the code snippets to only the modifications we’ve made, but the full code can be found in the repository.

SearchIndex

Above, we learned that the first processor we care about is the SearchIndex processor, which uses Solr to query items. Let’s remove the Solr dependency by creating our own “SearchDatabase” processor that uses our new Id Providers instead.

public class SearchDatabase : ImportEntityProcessor
    {
        public override void Process(ImportEntityPipelineArgs args, BaseLog logger)
        {
            ...
            

We’ll leave most of the Sitecore code as is and just update the lines we need to change how we determine if an item exists in Sitecore already.

EnsureItem

If SearchIndex, now SearchDatabase, fails to find the item inside of Sitecore, the EnsureItem processor will create a new item for us. Let’s update this processor to use our ID Providers as well.

public class EnsureItem : ImportEntityProcessor
    {
        public override void Process(ImportEntityPipelineArgs args, BaseLog logger)
        {
            ...

                    

EnsureRelation

Finally, when the item is newly created or is being updated, EnsureRelation updates any reference fields. The changes for this processor are sprinkled throughout the processor, so a snippet would be too large. The full source code can be found in the repo.

Configuration Changes

Now that we have our code changes in place, let's make sure to override the default connector processors with ours.

<configuration xmlns:patch="http://www.sitecore.net/xmlconfig/" xmlns:set="http://www.sitecore.net/xmlconfig/set/" xmlns:role="http://www.sitecore.net/xmlconfig/role/">
  <sitecore>
    <pipelines>
      <cmp.importEntity role:require="ContentManagement or Standalone">
        <processor type="Foundation.ContentHubExtensions.Pipelines.cmp.importEntity.SearchDatabase, Foundation.ContentHubExtensions"
                   resolve="true"
                   patch:instead="*[@type='Sitecore.Connector.CMP.Pipelines.ImportEntity.SearchIndex, Sitecore.Connector.CMP']" />

        <processor type="Foundation.ContentHubExtensions.Pipelines.cmp.importEntity.EnsureItem, Foundation.ContentHubExtensions"
                   resolve="true"
                   patch:instead="*[@type='Sitecore.Connector.CMP.Pipelines.ImportEntity.EnsureItem, Sitecore.Connector.CMP']" />

        <processor type="Foundation.ContentHubExtensions.Pipelines.cmp.importEntity.EnsureRelation, Foundation.ContentHubExtensions"
                   resolve="true"
                   patch:instead="*[@type='Sitecore.Connector.CMP.Pipelines.ImportEntity.EnsureRelation, Sitecore.Connector.CMP']"  />
      </cmp.importEntity>
    </pipelines>
  </sitecore>
</configuration>

Conclusion

Content Hub CMP Connector for Sitecore is a great way to synchronize data between Content Hub CMP and Sitecore, but it has its pitfalls. When data consistency and item IDs needed to match between Sitecore environments are key, we must update the connectors processors to use deterministic IDs. With deterministic IDs in place, we remove all the Solr dependencies and instead use an Md5 hash to generate our Sitecore Item IDs.