Sitecore “Batch” Connector, part 1: Download Content from Content Hub

Introduction

Sitecore offers a couple of great options for synchronizing content between its Content Hub and Sitecore XM/XP products:

The challenge

Both options can be viable approaches in most scenarios, but may not work in certain scenarios. One of our clients had been using Sitecore Connect to synchronize changes from Content Hub to Sitecore XM and it worked well enough until the volume of messages exceeded connector bandwidth: the number of daily updates grew to tens of thousands, complex data schema and Content Hub API throttle limits led to a very long processing times.

Removing the sync altogether and serving it directly from Content Hub via Sitecore Edge GraphQL APIs would be the best solution, but we didn’t have the bandwidth or budget to rewrite the Front End. The Sitecore Connect platform could be a solution too, but it was too new at the moment. The business needed a solution right now so I built a custom batch sync to satisfy the following requirements

All posts in this Sitecore “Batch” Connector series

Requirements

  • Performance: process tens to hundreds of thousands of daily updates
  • Allow full resync of all data, if needed
  • Data consistency & idempotence: syncing and re-syncing an item with the same source ID would need to result in the same item with the same Sitecore ID created in Sitecore XM/XP
  • Configurable mappings: allow to configure mappings of Source (Content Hub) entities to the destination templates and fields

The Solution

The solution consists of two modules: Downloader (WebJob) and Importer (Sitecore PowerShell scripts & scheduled job). The whole solution (less config files) can be found in this GitHub repo.

Batch Downloader

In this case, we have an Azure WebJob written in .NET. This WebJob wakes up and runs every XX minutes to download all created and updated entities from the source, which in this case is Content Hub. However, this approach can be used with any other source of data or content, as long as it provides suitable APIs for querying new and updated content.

The process iterates through a list of entity names. For each entity, it queries the Content Hub to retrieve a list of all entities of that type that have been created or deleted since the last download.

The download results for each entity type are saved into a JSON file. The filename includes the entity type and a timestamp representing the time of download.

Each time the downloader runs, it finds the last processed file and uses the embedded timestamp to calculate the timespan delta for the next request. For example, it requests all entities of a given type that were created or updated between the given timestamp and the current time.

Batch Importer

This part is a Sitecore PowerShell job, which wakes up every XX minutes, looks for new files, imports their content, and then moves them into the processed folder.

TODO: The batch importer is described in this post.

Batch Downloader Implementation

The batch downloader is an Azure WebJob, which I chose for it being able to run on the same instance as the Sitecore CM, but in a separate process, which means no conflicts with the Sitecore CM instance and no DLL hell to deal with

TODO: see this post for more details on running Azure WebJobs in Sitecore instance

Below is a high-level summary of the main functional elements of the Batch Downloader .NET solution

Configuration JSON

{
  "Entities": [
    {
      "EntityDefinition": "GVI.AssetCollection"
    },
    {
      "EntityDefinition": "M.AssetType"
    }
		//...more entities here

	],
	"WebRootPath": "your web app root path",
  "LogsPath": "where file logs to be saved",
  "IncomingFolderRelativePath": "App_Data\\ContentHubData\\Incoming",
  "ProcessedFolderRelativePath": "App_Data\\ContentHubData\\Processed",
  "DeliveryHostUrl": "your Content Hub instance",
  "NamespaceGuid": "Seed Guid to generate Sitecore item IDs from (more details to follow...)",
}

Program.cs is an entry point

// To learn more about Microsoft Azure WebJobs SDK, please see https://go.microsoft.com/fwlink/?LinkID=320976
	internal class Program
	{
		static void Main()
		{
			try
			{
				var config = new JobHostConfiguration();

				if (config.IsDevelopment)
				{
					config.UseDevelopmentSettings();
				}

				#region Read configs
				//Read configuration setting from appsettings.json...
				#endregion

				//Configure file logger, which might be the easiest option, but consider using other types of Azure diagnostics & logging.
				//This documentation article might be a good starting point
				Log.Logger = new LoggerConfiguration()
				  .MinimumLevel.Debug()
				  .WriteTo.File(logFilePath)
				  .WriteTo.Console(restrictedToMinimumLevel: LogEventLevel.Information)
				  .CreateLogger();

				//Create the "Incoming" folder if it doesn't exist
				if (!Directory.Exists(incomingFolderPath))
				{
					Log.Warning($"Creating incoming folder {incomingFolderPath}");
					Directory.CreateDirectory(incomingFolderPath);
				}

                //Create the "Processed" folder if it doesn't exist
                if (!Directory.Exists(processedFolderPath))
				{
					Log.Warning($"Creating processed folder {processedFolderPath}");
					Directory.CreateDirectory(processedFolderPath);
				}

				//Download configured entities to file (since lasrt update)
				if (entityConfigurations != null)
				{
					var runner = new BatchSyncRunner(endpointUri, clientId, clientSecret, userName, password, baseUrl, deliveryHostUrl, namespaceGuid);
					var batchStartTime = DateTime.Now;
					Log.Information($"Starting download batch. Start time: " + batchStartTime);

					foreach (var entityConfiguration in entityConfigurations)
					{
					//Download the entity data and save it to JSON file

Downloading Content Hub Entities

Initialize Content Hub client with .NET SDK

public IWebMClient CreateClient(string endpointUri, string clientId, string clientSecret, string userName, string password)
		{
			Uri endpoint = new Uri(endpointUri);

			// Enter your credentials here
			OAuthPasswordGrant oauth = new OAuthPasswordGrant
			{
				ClientId = clientId,
				ClientSecret = clientSecret,
				UserName = userName,
				Password = password
			};

			// Create the Web SDK client
			IWebMClient client = MClientFactory.CreateMClient(endpoint, oauth);
			return client;
		}

Read all entities of a given type, created or modified after a given datetime

/// <summary>
	/// Get all entities with given definition name, created or modified since given date
	/// </summary>
	/// <param name="definitionName"></param>
	/// <param name="modifiedAfter"></param>
	/// <returns></returns>
	public async Task<List<IEntity>> GetEntitiesByDefinition(string definitionName, DateTime modifiedAfter)
	{
		List<IEntity> results = new List<IEntity>();
		List<IEntity> iteratorResults = new List<IEntity>();
		List<long> idResults = new List<long>();
		var query = Query.CreateQuery(entities =>
						from e in entities
						where e.DefinitionName == definitionName && e.ModifiedOn >= modifiedAfter
						select e);

		var scroller = _client.Querying.CreateEntityScroller(query, TimeSpan.FromSeconds(30), EntityLoadConfiguration.Full);

		while (await scroller.MoveNextAsync().ConfigureAwait(false))
		{
			var items = scroller.Current.Items;
			if (items != null && items.Any())
			{
				results.AddRange(items);
			}
		}

		return results;
	}

Idempotency (consistency) of Sitecore items IDs, based on source Content Hub item IDs

Generate a consistent Sitecore item ID based on source ID and “seed namespace”. This is one of the foundational parts of this approach, which allow and entity and its dependent/child entities to be created without index lookups (to find Sitecore item ID based on source ID)

using System;
using System.Linq;
using System.Security.Cryptography;
using System.Text;

namespace GVI.ContentHub.Sync.WebJob.Core
{
    /// <summary>
    /// Helper class to create deterministic Guids based on given ID.
    /// </summary>
    public static class GuidUtils
    {
        /// <summary>
        /// Creates a name-based UUID using the algorithm from RFC 4122 §4.3, using SHA1
        /// (version 5). This is useful for creating predictive Guid based on content.
        /// </summary>
        /// <param name="namespaceId">A known namespace to create the UUID within</param>
        /// <param name="name">The name (within the given namespace) to make the Guid from</param>
        /// <returns></returns>
        public static Guid Create(Guid namespaceId, string name)
        {
            if (name is null)
                throw new ArgumentNullException(nameof(name));

            return Create(namespaceId, Encoding.UTF8.GetBytes(name));
        }

        /// <summary>
        /// Creates a name-based UUID using the algorithm from RFC 4122 §4.3, using SHA1
        /// (version 5). This is useful for creating predictive Guid based on content.
        /// </summary>
        /// <param name="namespaceId">A known namespace to create the UUID within</param>
        /// <param name="nameBytes">The name (within the given namespace) to make the Guid from</param>
        /// <returns>A UUID derived from the namespace and name</returns>
        public static Guid Create(Guid namespaceId, byte[] nameBytes)
        {
            const int version = 5;

            // convert the namespace UUID to network order (step 3)
            byte[] namespaceBytes = namespaceId.ToByteArray();
            SwapByteOrder(namespaceBytes);

            // compute the hash of the namespace ID concatenated with the name (step 4)
            byte[] data = namespaceBytes.Concat(nameBytes).ToArray();
            byte[] hash;
            using (var algorithm = SHA1.Create())
                hash = algorithm.ComputeHash(data);

            // most bytes from the hash are copied straight to the bytes of the new GUID (steps 5-7, 9, 11-12)
            byte[] newGuid = new byte[16];
            Array.Copy(hash, 0, newGuid, 0, 16);

            // set the four most significant bits (bits 12 through 15) of the time_hi_and_version field to the appropriate 4-bit version number from Section 4.1.3 (step 8)
            newGuid[6] = (byte)(newGuid[6] & 0x0F | version << 4);

            // set the two most significant bits (bits 6 and 7) of the clock_seq_hi_and_reserved to zero and one, respectively (step 10)
            newGuid[8] = (byte)(newGuid[8] & 0x3F | 0x80);

            // convert the resulting UUID to local byte order (step 13)
            SwapByteOrder(newGuid);
            return new Guid(newGuid);
        }

        /// <summary>
        /// Converts a Guid to/from network order (MSB-first)
        /// </summary>
        /// <param name="guid"></param>
        private static void SwapByteOrder(byte[] guid)
        {
            SwapBytes(guid, 0, 3);
            SwapBytes(guid, 1, 2);
            SwapBytes(guid, 4, 5);
            SwapBytes(guid, 6, 7);
        }

        private static void SwapBytes(byte[] guid, int left, int right)
        {
            byte temp = guid[left];
            guid[left] = guid[right];
            guid[right] = temp;
        }
    }

} 

Read entity data by iterating through entity fields

public List<EntityContent> GetEntitiesContent(EntityConfiguration entityMapping, DateTime modifiedAfter)
		{
			var startTime = DateTime.Now;
			var renditionMappingNames = entityMapping.RenditionRelations != null
				? entityMapping.RenditionRelations.Keys.Select(x => x.ToLower()).ToList()
				: new List<string>();

			var task = _client.GetEntitiesByDefinition(entityMapping.EntityDefinition, modifiedAfter);
			var result = task.WaitAndUnwrapException();
			var entities = new List<EntityContent>();
			foreach (var entity in result)
			{
				try
				{
					if (entity != null && entity.Id.HasValue && !string.IsNullOrEmpty(entity.Identifier))
					{
						var timestamp = entity.ModifiedOn ?? entity.CreatedOn;
						var entityContent = new EntityContent(entity.Id.Value, entity.Identifier, timestamp, _namespaceGuid);

						foreach (var property in entity.Properties)
						{
							try
							{
							//iterate through entity properties and copy their data into map collection...

Saving downloaded content to JSON files

int entityCount = entitiesData.Count;
int fileCount = (int)Math.Ceiling((double)entityCount / maxEntityCountInFile);

for (int i = 0; i < fileCount; i++)
{
	int entitiesToSave = (i != fileCount - 1) ? maxEntityCountInFile : entityCount % maxEntityCountInFile;
	var entitiesBatch = entitiesData.Skip(i * maxEntityCountInFile).Take(entitiesToSave);

	string strNum = i.ToString("D3");
	var fileName = $"{entityConfiguration?.EntityDefinition}_{startTime.ToString("yyyyMMddTHHmmss")}_{strNum}.json";

	var filePath = Path.Combine(incomingFolderPath, fileName);
	var json = JsonConvert.SerializeObject(entitiesBatch);

	File.WriteAllText(filePath, json);
	Log.Information($"Saved {entitiesToSave} entities of type {entityConfiguration.EntityDefinition} entities to {filePath}");
}

Useful links