Snowflake, the software-as-a-service (SaaS) warehouse solution supported on popular public cloud platforms, provides a very flexible solution, especially for agile enterprises as you can scale storage and compute separately and automatically (or randomly). Snowflake scales horizontally and can process large volumes of data using MPP (massively parallel processing) compute clusters where each node in the cluster stores a portion of the entire data set locally. Snowflake supports a number of ETL, analytics and reporting tools through ODBC and JDBC drivers. It also has connectors for popular big data processing tools like Apache Spark.
Bitwise has developed accelerators and utilities to help our clients efficiently and seamlessly migrate their on-premise data warehouses to Snowflake on cloud. While developing these accelerators, we wanted to find out which is the best file format for loading data into Snowflake. This blog discusses our findings.
Snowflake supports multiple file formats for loading data, including CSV, JSON, AVRO, ORC, PARQUET and XML. For our benchmarking, we considered only CSV, AVRO, PARQUET and ORC. Since our core objective was to migrate traditional warehouses which are flat in nature, it did not make sense to use JSON or XML.
All file formats were used to load the same amount of data (number of records) with the same amount of resources from the same S3 bucket. Since we are going to do relative comparison, the actual data size, cluster size, etc. doesn’t matter. The below diagram shows the load time comparison of the file formats.
CSV is the winner, but Avro is close second. Frankly, we were expecting Avro to perform better than CSV, but Avro was a bit slower. The only explanation that we can think of for this is probably because of additional cast operations (variant to target type) that you need to perform in case of Avro.
Parquet and ORC are 150 to 200% slower. Again just a guess, it’s probably because even though Snowflake is columnar store, its load function is accessing data row based. Since Parquet and ORC are not as good for row based access, there is a slowness.
Before finally deciding on the file format that works best for your requirement, you might want to consider these factors as well:
Based on our experience, we recommend that CSV and Avro should be the preferred formats for loading data into Snowflake. Even if you are planning to keep a copy of data on object storage (S3, etc.) in ORC or Parquet format for direct query, it would be advisable to create another temporary copy in CSV or Avro just for Snowflake load.
We hope that these findings are helpful in your efforts to migrate data from your on-premise data warehouses to Snowflake. Bitwise offers a full range of cloud migration services to efficiently move your data and applications to the cloud, including a framework to help expedite your data migration from legacy database applications to Snowflake by using the best of breed technologies from the leading cloud providers and Bitwise in-built utilities to fill the gaps of cloud provider tools and services.
Contact us to learn more on how we can accelerate your migration efforts.
Pushpender has been a leading member of the Bitwise Design and Architecture Research Team (DART) responsible for driving innovations in big data architecture, cloud computing, blockchains and microservices, and played a key role in developing Hydrograph as the project’s chief architect.