Cloud migration has become a key part of Digital Transformation initiatives today. Since there are considerable investments at stake, it requires careful planning and assessment to anticipate the challenges that you might face during your migration journey.
Migrating large volumes of data from on-premise systems to cloud is a highly complex and nuanced process. Even with the best tools available, migrating hundreds of terabytes and thousands of tables of data to cloud may pose some unexpected challenges that can adversely affect and/or delay your cloud migration process.
Before we begin…
Your data migration journey typically involves cloud assessments, target platform POCs and signoffs, and then the actual data migration.
Here are some basic requirements for any data migration project:
- It should be quick
- It should be automated and should require minimal manual intervention
- There should not be any data issues or losses
Many data migration tools in the market claim to fulfil these requirements and more. The most common offering is “All you have to do is select the source tables that you want to migrate and hit the migrate button. That’s it.” The tool will automatically create target schema, map data types, migrate data and also perform Change Data Capture (CDC) for syncing. Apart from that, these tools also have capabilities to scale, schedule, tokenize, etc.
During our many on-premise to cloud migration initiatives, we have observed and outlined some key challenges which are likely to occur despite using some of the best cloud migration tools in the market.
Some of the most common challenges with Data Migration Tools
When you are assessing these tools, at first glance it looks like a straightforward thing. But when you begin with large scale implementations, you might start observing and experiencing some gaps.
Stumbling upon these gaps in the middle of execution phase may impact timelines, efforts and overall quality for your migration.
Considering these gaps in your assessment phase may help you choose the right data migration tool as part of your cloud migration strategy.
1. Throttling and Elasticity
During data migration, you may need to apply throttling on data extraction processes because of limited availability of resources on your source database.
Since you want to migrate, you don’t want to spend more to add resources on your source system and at the same time you don’t want your data migration processes to overload source database and impact performance of your production applications/loads. You can control this by limiting the number of parallel extraction processes but that’s still not an optimal solution. Most of the data warehouses have a variable load throughout the day which eliminates running a constant number of processes all the time.
To optimally use the source database, you need an efficient data migration tool to sense the load of source database and accordingly throttle data extraction processes while scaling up or down. In practice, we had to create scripts to throttle the extraction process as none of the existing data migration tools provided such feature.
2. Efficient Data Validation
Data validation provides assurance that data is loaded without any datatype, truncation, formatting, or consistency issues. If your source and target are different databases, then it becomes all the more important to perform data validation. You can’t validate thousands of tables manually. In our opinion, it is always good to perform full data validation, however it depends upon how critical the data is. We have observed the following issues with most of the migration tools regarding data validation:
- While many tools don’t provide any data validation features, some tools provide limited (as explained in below points) data validation.
- They could validate data only if tables had primary keys.
- Even if some tools can validate on primary key, they perform so slow that it’s not possible to use this feature, and furthermore they add double load on source database.
The right tool should provide efficient data validation without putting extra load on source database, irrespective of whether the table has primary key or not.
3. Automatic Configuration and Scheduling
While most data migration tools provide certain features to configure and schedule data migration, it still takes a considerable amount of manual effort for configuration.
Imagine you have thousands of tables and for each table, you need to manually setup pipeline, specify primary keys (for validation), watermark column (for CDC), load modes (based on partitions), etc. and schedule these tables at a particular time or as per the priorities. Again, if the tool can’t get these configurations done automatically, it is required to create custom scripts to automate these processes.
4. Fault Tolerance and Restartability
Most data migration tools work well when they are on a smooth path and nothing is failing. But it becomes very difficult to manage when you have many processes that start to fail. We have observed the following issues for fault tolerance and restartability
- On restart, it doesn’t start from the point of failure but right from the extraction.
- There were instances where duplicates were loaded in case of a restart. In such cases, it requires a lot of manual handholding during this process to avoid such issues during the process failure.
- During some of these instances of failure, it was very hard to get a detailed error message to pinpoint the cause of failure. Only generic messages were available.
The right data migration tool must have provisions for such possibilities and should perform an automatic restart from the point of failure while ensuring that there is no duplicate loading of records.
5. Upserts for Change Data Capture (CDC)
Data syncing is almost always required since you want to continue with the source system until your target system is stable enough. Change Data Capture (CDC) is needed for data syncing and this feature is available in many tools. However, many of these data migration tools don’t support updates/upserts on target for data synching, as they only support append mode.
In data warehouses, there are often a large number of dimensions which are typically upsert tables, so it requires developing extra scripts to perform upserts. The right tool should have options to configure tables for append/upsert/replace mode for CDC.
6. Generate/Respond to Events for Integration
Sometimes data migration and syncing is just one part of your overall pipeline. You have some process upstream which should trigger the data synching process or you want to trigger/notify a downstream process once data synching is done. As trivial as it may sound, many tools do not provide such integration endpoints, nor can they be integrated with an external scheduler.
7. Data Type Support
Depending upon your use case of source and target database, there might be some tricky datatypes which do not have direct mapping in target database. Many data migration tools available in the market have limitations of data type support per source or target database.
This requires you to additionally run a metadata query on your source database and to make sure that you have support for all data types.
Ensuring successful data migration to cloud
Large-scale data migration from on-premise systems to cloud needs carefully mapping and preparing for these and other unanticipated challenges that may occur during the migration, despite whichever data migration tool you may be planning to use.
The key factors to achieving an efficient and successful cloud migration are:
- On-premise resources expert in cloud migration
- Agile processes
- Continuous and effective collaboration with all teams involved
- Data governance
Ensuring that your cloud migration journey is completed without any glitches, delays and worse, crucial data losses, requires you to fulfill above criteria, avoid complete dependence on the data migration tool and to better assess the tool before you begin with your on-premise to cloud migration initiative.