Azure SQL Data Warehouse and Its Best Practices

What is Azure SQL Data Warehouse?

Massively Parallel Processing (MPP) Architecture & Its Components

Why Migrate to Azure SQL Data Warehouse?

  • A low-cost solution with scalability that enables users to pause or resume the databases within minutes.
  • A high-performance boost and the ability of globalization.
  • The ability to run up to 128 concurrent queries at one time using the Massive Parallel Processing with increased performance than you could ever get with a traditional on-prem SQL Server.
  • Smoothly create your jobs and build analytics along with any native connectivity with data integration and visualization services.
  • Guarantee of 99.9% availability with fully managed service, so you don’t need your DBAs to take care of any of this.
  • Built-in advanced security like encryption, connection security, authentication, and authorization.
  • Built-in intelligent caching that helps you accelerate access to your data and query performance.

Best Practices of Azure SQL Data Warehouse

1. Performance with Proper Distribution
  • Hash Distribution tables: A hash distributed table can deliver the highest query performance for joins and aggregations on large table columns. The hash function uses the values in the distribution column to assign each row to a distribution.
    • For example – A unique column is always preferred for hash distribution, resulting in optimum performance. However, for large tables, it is recommended to perform hash distribution. Also when you are joining two tables, the distribution of both the tables should be on the same column for best performance.
  • Round-Robin tables: A round-robin distributed table distributes data evenly across the table, but without any further optimization. A round robin table is fast for storing data, but query performance can often be improved with hash distributed tables. Joins on round-robin tables require reshuffling data which can degrade the performance of the query. In scenarios when you need to store data in tables but none of the columns are qualified for the distribution, round-robin tables method can be used.
  • Replicated tables: It is a table that is replicated caches a full copy of the table on each computes node. However, this reduces the DMS operation. Replicated tables are best utilized and only recommended for small tables. For example, a lookup table can be replicated for faster availability in joins and aggregations.
2. Selection of Data Warehouse Unit (DWU’s)
  • Gen1 data warehouses are measured in Data Warehouse Units (DWUs).
  • Gen2 data warehouses are measured in compute Data Warehouse Units (cDWUs).
  • Number of compute nodes required for the processing
  • Number of distribution per compute node
  • Memory required per data warehouse
  • Maximum concurrent queries to be executed (Consider execution sharing among resource classes like small, medium, large, etc. and small resource class.)
3. Fast Data Uploading Strategy
  • Push prepared data in Azure Blob Storage
  • Pull data into External table (polybase)
  • Move data by some transformation into staging tables
  • Merge the data of staging with the production tables
4. Reduce Cost by Writing an Optimized Query
5. Use Smaller Resource Class to Increase Concurrency & Larger for Query Performance
  • More parallelism can be achieved by using smaller resource class user in any applications. A large, time-consuming query should not be used by this resource group.
  • Certain queries like large joins, indexing, heavy loads on clustered columnstore tables, etc. will benefit from larger memory allocations which are available with the large resource group. A large resource group user should not be used in every scenario as it would lead to a queue of the other queries that will be blocked until the large resource group user query gets executed.
6. Use CTAS (Create Table As Select) to Enhance Query Performance and Joins