In this blog, we will try to explain some challenges that you might face, even with the best tools available, when you are migrating large volumes (hundreds of terabytes, thousands of tables) of data to the cloud. This information is based on practical experiences that our team has had while migrating large on-premise data warehouses.
In your cloud migration journey, after you are done with cloud assessments, target platform POCs and signoffs, the next step would be the actual migration. Basic requirements for any data migration project are similar: it should be quick, it should be automated (minimal manual intervention) and there should not be any data issues.
There are hundreds of tools out there that claim to fulfil these requirements and more. They’ll say that “all you have to do is select the source tables that you want to migrate and hit the migrate button and that’s it.” The tool will automatically create target schema, map data types, migrate data and also perform change data capture (CDC) for syncing. Apart from that these tools also have capabilities to scale, schedule, tokenize, etc.
When you are assessing these tools its looks like a straightforward thing, but when you start large scale implementations you might start observing some gaps. Some of these gaps may impact timelines, efforts and overall quality for your migration. Below is a listing of the gaps that we observed. You can have a look and see if you want to include these in your assessment phase itself so that your implementation runs smoothly, or at least you are better prepared.
You need to apply throttling on data extraction processes because of limited availability of resources on your source database. Since you want to migrate, you don’t want to spend more to add resources on your source system and at the same time you don’t want your data migration processes to go all guns out on source database and impact performance of your production applications/loads. You can control this by limiting the parallelism of extraction processes but that’s not an optimal solution. Most warehouses have variable load throughout the day which means I shouldn’t be running a constant number of processes all the time.
To optimally use the source database, you need a data migration tool to sense the load of source database and accordingly throttle data extraction processes while scaling up or down. In practice, we had to create scripts to throttle the extraction process as none of the tools provided such feature.
Data validation provides assurance that data is loaded without any consistency, truncation, formatting, or datatype issues. If your source and target are different databases, then it becomes all the more important to perform data validation. You can’t validate thousands of tables manually. In our opinion it is always good to perform full data validation, however it depends upon how critical the data is. We have observed the following issues with tools regarding data validation:
The right tool should provide efficient data validation without putting extra load on source database, irrespective of whether the table has primary key or not.
While most tools provide certain features to configure and schedule data migration, it still takes a considerable amount of manual effort for configuration. Imagine that you have thousands of tables and for each table you need to manually setup pipeline, specify primary keys (for validation), watermark column (for CDC), load modes (based on partitions), etc. and schedule these tables at particular time or priorities. It would be nice if the tool can get these configurations automatically, otherwise again it is required to create custom scripts to automate these things.
Most tools work well when they are on a smooth path and nothing is failing, but it becomes very difficult to manage when you have many processes that start to fail. We have observed the following issues for fault tolerance and restartability:
The right tool should perform automatic restart from the point of failure and it should ensure that there is no duplicate loading of records.
Data syncing is almost always required since you want to continue with the source system until your target system is stable enough. Change data capture (CDC) is needed for data syncing and this feature is available in many tools. However, many tools don’t support updates/upserts on target for data synching, as they only support append. In data warehouses, there are often a large number of dimensions which are typically upsert tables, so it required to develop extra scripts to perform upserts. The right tool should have options to configure tables for append/upsert/replace mode for CDC.
Sometimes data migration and syncing is just one part of your overall pipeline. You have some process upstream which should trigger the data synching process or you want to trigger/notify a downstream process once data synching is done. As trivial as it may sound, many tools do not provide such integration endpoints, nor can they be integrated with an external scheduler.
Depending upon your use case of source and target database there might be some tricky datatypes which do not have direct mapping in target database. Many tools also have limitations of data type support per source or target database. You should run a metadata query on your source database and make sure that you have support for all data types.
We hope that after sharing our experiences with large-scale cloud data migration initiatives, you are better equipped to assess your migration tool. At Bitwise, we realized that these gaps are making the process of migrating on-premise data to the cloud difficult for our customers, so we developed utilities and accelerators to streamline cloud data migrations.
Learn more about our cloud migration experiences and the solutions we developed to close the gaps with available tools here.
Pushpender has been a leading member of the Bitwise Design and Architecture Research Team (DART) responsible for driving innovations in big data architecture, cloud computing, blockchains and microservices, and played a key role in developing Hydrograph as the project’s chief architect.