In this blog, we will try to explain some challenges that you might face, even with the best tools available, when you are migrating large volumes (hundreds of terabytes, thousands of tables) of data to the cloud. This information is based on practical experiences that our team has had while migrating large on-premise data warehouses.
In your cloud migration journey, after you are done with cloud assessments, target platform POCs and signoffs, the next step would be the actual migration. Basic requirements for any data migration project are similar: it should be quick, it should be automated (minimal manual intervention) and there should not be any data issues.
There are hundreds of tools out there that claim to fulfill these requirements and more. They’ll say that “all you have to do is select the source tables that you want to migrate and hit the migrate button and that’s it.” The tool will automatically create target schema, map data types, migrate data and also perform change data capture (CDC) for syncing. Apart from that these tools also have capabilities to scale, schedule, tokenize, etc.
Observed Gaps in Data Migration Tools
When you are assessing these tools its looks like a straightforward thing, but when you start large-scale implementations you might start observing some gaps. Some of these gaps may impact timelines, efforts, and the overall quality for your migration. Below is a listing of the gaps that we observed. You can have a look and see if you want to include these in your assessment phase itself so that your implementation runs smoothly, or at least you are better prepared.
1. Throttling and Elasticity
You need to apply throttling on data extraction processes because of the limited availability of resources on your source database. Since you want to migrate, you don’t want to spend more to add resources on your source system and at the same time, you don’t want your data migration processes to go all guns out on source database and impact the performance of your production applications/loads. You can control this by limiting the parallelism of extraction processes but that’s not an optimal solution. Most warehouses have variable load throughout the day which means I shouldn’t be running a constant number of processes all the time.
To optimally use the source database, you need a data migration tool to sense the load of the source database and accordingly throttle data extraction processes while scaling up or down. In practice, we had to create scripts to throttle the extraction process as none of the tools provided such a feature.
2. Efficient Data Validation
Data validation provides assurance that data is loaded without any consistency, truncation, formatting, or data type issues. If your source and target are different databases, then it becomes all the more important to perform data validation. You can’t validate thousands of tables manually. In our opinion it is always good to perform full data validation, however, it depends upon how critical the data is. We have observed the following issues with tools regarding data validation:
- While many tools don’t provide any data validation features, some tools provide limited (as explained in the below points) data validation.
- They could validate data only if tables had primary keys.
- Even if some tools can validate on the primary key, they perform so slow that it’s not possible to use this feature, and furthermore, they add double load on the source database.
The right tool should provide efficient data validation without putting extra load on the source database, irrespective of whether the table has a primary key or not.
3. Automatic Configuration and Scheduling
While most tools provide certain features to configure and schedule data migration, it still takes a considerable amount of manual effort for configuration. Imagine that you have thousands of tables and for each table you need to manually setup pipeline, specify primary keys (for validation), watermark column (for CDC), load modes (based on partitions), etc. and schedule these tables at particular time or priorities. It would be nice if the tool can get these configurations automatically, otherwise again it is required to create custom scripts to automate these things.
4. Fault Tolerance and Restartability
Most tools work well when they are on a smooth path and nothing is failing, but it becomes very difficult to manage when you have many processes that start to fail. We have observed the following issues for fault tolerance and restartability:
- On restart, it doesn’t start from the point of failure but right from the extraction.
- There were instances where duplicates were loaded in case of restart. It required a lot of manual hand-holding due to these issues in case it got failed.
- It was very hard or not possible to get a detailed error message to pinpoint the cause of failure. Only generic messages were available.
The right tool should perform automatic restart from the point of failure and it should ensure that there is no duplicate loading of records.
5. Upserts for CDC
Data syncing is almost always required since you want to continue with the source system until your target system is stable enough. Change data capture (CDC) is needed for data syncing and this feature is available in many tools. However, many tools don’t support updates/upserts on target for data synching, as they only support append. In data warehouses, there are often a large number of dimensions which are typically upsert tables, so it required to develop extra scripts to perform upserts. The right tool should have options to configure tables for append/upsert/replace mode for CDC.
6. Generate/Respond to Events for Integration
Sometimes data migration and syncing is just one part of your overall pipeline. You have some process upstream that should trigger the data synching process or you want to trigger/notify a downstream process once data synching is done. As trivial as it may sound, many tools do not provide such integration endpoints, nor can they be integrated with an external scheduler.
7. Data Type Support
Depending upon your use case of source and target database there might be some tricky datatypes that do not have a direct mapping in the target database. Many tools also have limitations of data type support per source or target database. You should run a metadata query on your source database and make sure that you have support for all data types.
Closing the Migration Tool Gaps
We hope that after sharing our experiences with large-scale cloud data migration initiatives, you are better equipped to assess your migration tool. At Bitwise, we realized that these gaps are making the process of migrating on-premise data to the cloud difficult for our customers, so we developed utilities and accelerators to streamline cloud data migrations.
Learn more about our cloud migration experiences and the solutions we developed to close the gaps with available tools.
Editor's Note: The blog was originally posted on October 2019 and recently updated on February 2023 for accuracy.