Best File Formats for Loading Data into Snowflake on Cloud

By

Best File Formats for Loading Data into Snowflake on Cloud

Snowflake, the software-as-a-service (SaaS) warehouse solution supported on popular public cloud platforms, provides a very flexible solution, especially for agile enterprises as you can scale storage and compute separately and automatically (or randomly). Snowflake scales horizontally and can process large volumes of data using MPP (massively parallel processing) compute clusters where each node in the cluster stores a portion of the entire data set locally. Snowflake supports a number of ETL, analytics and reporting tools through ODBC and JDBC drivers. It also has connectors for popular big data processing tools like Apache Spark.

Bitwise has developed accelerators and utilities to help our clients efficiently and seamlessly migrate their on-premise data warehouses to Snowflake on cloud. While developing these accelerators, we wanted to find out which is the best file format for loading data into Snowflake. This blog discusses our findings.

Snowflake Load Performance and File Formats

Snowflake supports multiple file formats for loading data, including CSV, JSON, AVRO, ORC, PARQUET and XML. For our benchmarking, we considered only CSV, AVRO, PARQUET and ORC. Since our core objective was to migrate traditional warehouses which are flat in nature, it did not make sense to use JSON or XML.

All file formats were used to load the same amount of data (number of records) with the same amount of resources from the same S3 bucket. Since we are going to do relative comparison, the actual data size, cluster size, etc. doesn’t matter. The below diagram shows the load time comparison of the file formats.

Building a Self-Service BI Platform

CSV is the winner, but Avro is close second. Frankly, we were expecting Avro to perform better than CSV, but Avro was a bit slower. The only explanation that we can think of for this is probably because of additional cast operations (variant to target type) that you need to perform in case of Avro.

Parquet and ORC are 150 to 200% slower. Again just a guess, it’s probably because even though Snowflake is columnar store, its load function is accessing data row based. Since Parquet and ORC are not as good for row based access, there is a slowness.

Before finally deciding on the file format that works best for your requirement, you might want to consider these factors as well:

  1. File Size – For each file format we found that the best load performance was at 80MB file chunk. It may vary in your case. As per Snowflake recommendations it should be between 10 to 100MB for best performance.
  2. Complex and binary data types – If you have complex and binary (varbyte, BLOB, etc.) datatypes that you want to migrate, then you will have to do proper encoding for CSV. Please check Snowflake documentation for supported encodings (Hex worked best in our tests). Avro will be straighter in this case because it won’t require any encoding.
  3. Validation errors – If you want to capture load error messages and records using Snowflake’s validation function, then CSV is the only option. That’s because all other formats load data in variant columns on which you need to apply conversion functions when loading from stage to target. These conversions will be considered as transforms and such loads are not supported by validation.

Recap on Findings for Snowflake Load

Based on our experience, we recommend that CSV and Avro should be the preferred formats for loading data into Snowflake. Even if you are planning to keep a copy of data on object storage (S3, etc.) in ORC or Parquet format for direct query, it would be advisable to create another temporary copy in CSV or Avro just for Snowflake load.

We hope that these findings are helpful in your efforts to migrate data from your on-premise data warehouses to Snowflake. Bitwise offers a full range of cloud migration services to efficiently move your data and applications to the cloud, including a framework to help expedite your data migration from legacy database applications to Snowflake by using the best of breed technologies from the leading cloud providers and Bitwise in-built utilities to fill the gaps of cloud provider tools and services.

Contact us to learn more on how we can accelerate your migration efforts.

Pushpender Garg

Pushpender has been a leading member of the Bitwise Design and Architecture Research Team (DART) responsible for driving innovations in big data architecture, cloud computing, blockchains and microservices, and played a key role in developing Hydrograph as the project’s chief architect.


Leave a Comment

Your email address will not be published. Required fields are marked *

SIGN UP FOR BITWISE NEWS

TOP