Navigating the Evolution: Next-Level Insights from Billion-Document Migrations

At Maretha Solutions, our team has the most real world experience with the largest, longest running, and most complex implementations of the Nuxeo Platform. We sometimes get called the “migration experts”. We sometimes get called the “DAM experts”. The truth is we all love helping clients achieve their goals with Nuxeo. This blog post will detail the pattern we developed and follow for the majority of migrations we take on. Lessons from this pattern can be applied to any migration of data between distinct systems.

The screenshot above is a production Nuxeo Platform implementation that serves over 10,000 daily users on a continuous basis.

The Nuxeo Platform is capable of so much more than storing 3.5 Billion document records. The scale and complexity of migrating many distinct systems of record into one is a challenge. In my previous blog post about FileNet to Nuxeo migrations (which applies when migrating any system into Nuxeo) I described the general process the team followed and the results we achieved. Now, one of these clients has a single repository with more than 3.5 Billion Documents stored. 

The migration toolset we implemented on top of Nuxeo’s Bulk Action Framework is capable of ingesting 500 million documents over a weekend. The speed of ingestion with this toolset provides advantages to any migration. This is not an extension to an existing SDK making API requests wrapped into a product. This migration toolset transforms the Nuxeo Platform. Exposing a set of automation endpoints to extract data from cloud data sources, mapping and indexing the content into Nuxeo (and Mongo DB and ElasticSearch) at a rapid and configurable pace.

The architecture of the Nuxeo Platform makes highly scalable automations like this possible. Here is how this migration toolset leverages the platform architecture to achieve these results.

When doing a migration into Nuxeo, there are two main components of document records that have to be migrated. One is the binary files. These come from the source system and are usually saved in local storage (at a datacenter) or in cloud hosted storage (S3 or other object store). Nuxeo uses an S3 object store by default, so this is what we used to bring in the binaries. AWS has a range of tools to upload files into S3 buckets, including hardware devices, and REST API’s. We setup a client access S3 bucket in the same availability zone as the Nuxeo cluster. This bucket is configured with PutObject permissions so clients can add files to the bucket in whatever way they choose.

As I mentioned in my previous blog post binary files should be moved first. This is critical as the movement of binary files can be fast or slow depending on the size of the files and network bandwidth of the originating system(s). Nuxeo secures each binary file by removing the extension from the filename and running the binary through a hashing algorithm. 

We use a lambda inside AWS to process the inbound files from the originating system — removing the extension, hashing the filename, moving the file to the object store connected to the Nuxeo server, and producing a report file that contains the original filename, as well as the corresponding hash key.

In order to arrange the metadata for each document record in an effective and scalable way, we chose CSV files as they are smaller and more secure than other options. JSON files are also a viable option as this format includes native data-type validation, however the file size is larger. The ingestion file type is a configuration that can be made in the code of the platform add-on. 

With either file type the best way to transmit these files to the Nuxeo cluster is by streaming them from a cloud connected S3 object store. In our testing streaming from S3 is faster than reading the file from a locally attached EFS. This is because the file does not need to be fully loaded in memory and it can be processed in chunks.

Stream Import Architecture

Here are the simple steps that the team follows to run an import of any size into the platform:

  1. Users use AWS CLI (or any supported method) to upload CSV (or JSON) files created from the exporting system into the client access S3 bucket.
  2. Lambda’s evaluate the content that is added to the client access S3 bucket and automatically move the metadata and hashed binary files into the S3 bucket connected to the Nuxeo cluster.
  3. An automation operation is triggered by API that references the metadata file that will drive the import.
  4. Nuxeo begins streaming the metadata file from S3 and breaking the content into Kafka messages for processing.
  5. Messages are processed by streams — using the native Nuxeo Bulk Action Framework — and documents are created with reference to the already existing binary files.
  6. In the background the content is indexed into ElasticSearch (or OpenSearch) after the data is created in Mongo.

This pattern is now being used by several different clients of Nuxeo and Nuxeo Cloud. Not only does it come with a speed advantage over other import options, it is also versatile, capable, and adjustable to any clients needs.

In the realm of digital content management, the journey we’ve embarked on with Nuxeo stands as a testament to innovation, scalability, and the transformative power of technology. Migrating over 3.5 billion documents into a single, unified repository is not just a technical achievement; it’s a strategic leap forward in how we access, manage, and leverage critical content. With more than 10,000 daily users now relying on this robust system, we’ve moved beyond mere storage to creating a dynamic, efficient ecosystem for content management. The success of this project, underscored by the high-speed, flexible, and efficient toolset we developed, illustrates not just the capabilities of the Nuxeo Platform but also the vast potential it holds for organizations seeking to revolutionize their content management strategies. As we continue to refine and deploy this pattern across various clients, it’s clear that the intersection of cloud technology, data migration, and content management is an evolving landscape ripe with opportunities for innovation and growth. This journey with Nuxeo is more than a milestone — it’s a blueprint for the future of enterprise content management, where speed, scalability, and precision converge to create unparalleled business value.

Contact us for more information about our existing migration tools.

0 replies

Leave a Reply

Want to join the discussion?
Feel free to contribute!

Leave a Reply

Your email address will not be published. Required fields are marked *