Best Practices

The goal of this section is to share our best practices, learning experiences and tips for implementing and integrating Redshift. Much documentation is available from AWS and other sites regarding each of these options, so we’ll just briefly share our opinions on each. If you have comments, findings or questions, please use the ‘Contact’ page to share.

Analyze

I have seen that stale table statistics in every DBMS I have worked on causes poor query performance. Redshift if no different. But once statistics are updated, the system can accurately determine the proper query plan and memory.

Therefore, use the ANALYZE TABLE command after inserting data into a table. Redshift will automatically determine if the statistics will need to be updated, or skipped (which means the statistics are generally acceptable).

As part of AWS ongoing efforts to make Redshift maintenance free, this feature is now automated. 

For more developer details, reference the AWS Redshift developers guide.  https://docs.aws.amazon.com/redshift/latest/dg/r_ANALYZE.html

Column Compression

Column compression will reduce the storage of data on disk, thus improving I/O performance dramatically. The results are exceptional query performance.

As of January 2017, Redshift now supports a new powerful compression algorithm ZSTD that works accross all datatypes, and gives exceptional performance on long varchars. We generally have applied this compression type as a default. This works very well, but if you are a purist, you can run the ANALYZE COMPRESSION utility to get specific recommendations based on your data characteristics. Note: when using ANALYZE COMPRESSION, ZSTD in almost all cases replaces LZO as the default complression method.

Do NOT apply compression on your sort keys as this will slow down your query.

For more developer details, reference the AWS Redshift developers guide.  https://docs.aws.amazon.com/redshift/latest/dg/c_Compression_encodings.html#compression-encoding-list

Copy and Unload Commands

We have standardized our COPY and UNLOAD commands to ensure the data in S3 is in a standardized format.

We also are not using the aws_access_key_id/aws_secret_access_key due to security concerns. If these keys get in the wrong hands, your data could be comprimised. Thus, we always use IAM roles.

The options below take advantage of the parallel option which is the default. We like this too because the performance is spectacular.

Another unique find is the maxfilesize. Keeping this option defined somewhat small means that in some cases the COPY will load more effectively when you more data between Redshift cluster that are defined with different node types, size and count. Likewise, we found a bug when you have tables with a large number of rows only works if the maxfilesize is defined small.

The use of STATUPDATE ON keeps the stats up to date, but there are others who believe this causes more overhead than required. I have decided to keep this on and let Redshift determine if stats should be updated.

Examples of our standard COPY and UNLOAD commands are below.

copy schema.table
from ‘s3://…….manifest’
credentials ‘IAM_ROLE’ delimiter ‘\325’
gzip null as ” escape manifest
allowoverwrite ignoreblanklines maxerror 0
dateformat ‘auto’ timeformat ‘auto’
statupdate on;

unload (‘select * from schema.table)
to ‘s3://…….’
credentials ‘IAM_ROLE’ delimiter ‘\325’
gzip null as ” escape manifest
maxfilesize 100 mb;

For more developer details, reference the AWS Redshift developers guide.
https://docs.aws.amazon.com/redshift/latest/dg/r_COPY.html
https://docs.aws.amazon.com/redshift/latest/dg/r_UNLOAD.html

Dependency Errors

Error:  psql:/opt/temp/SQL.20200904.sql:140: ERROR:  cannot drop table dw.time_dim because other objects depend on it.  HINT:  Use DROP … CASCADE to drop the dependent objects too.

This most likely occurs when an object references the table, and thus protects you from dropping a table causing the view to be disabled in function.

Options

1: Remove the Drop statement – Sounds too simple, but in many cases the drop is followed by a Create Table or CTAS statement.  If you are able to remove the Drop statement in place of a Truncate command, then this alleviates the issue.

2: Implement Late Binding Views – Adding the option ‘with no schema binding’ to your create view statement specifies that the view isn’t bound to the underlying database objects (ie. Tables, UDF’s).  Therefore, you can create a view even if the underlying objects don’t exist.  Likewise, you can drop the underlying objects without affecting the view.  Redshift only checks for the underlying objects at run time.  https://docs.aws.amazon.com/redshift/latest/dg/r_CREATE_VIEW.html

Benefits: Changing the size of VARCHAR columns or adding new columns are non-issues.
Challenges: The view will become unusable when underlying objects or attributes become unavailable.

3: Manual Drop/Cascade – Adding the option ‘with cascade’ to the drop statement indicates that all dependent objects will also automatically be dropped.  Be careful!  Probably the least used, but this will quickly remove all of the objects related to the underlying table.  https://docs.aws.amazon.com/redshift/latest/dg/r_DROP_TABLE.html

Benefits: Removes the underlying objects and dependent objects.  Any objects required must be recreated.
Challenges:  Requires manual processing for locating, dropping and re-creating objects.  Views and UDF’s cannot be created until the underlying objects are reinstated.

Distribution Styles

The default distribution style is EVEN, and this a great default spreading your rows accress the nodes in your Redshift cluster. The other options are ALL and KEY. We generally use ALL for small tables which is another easy decision. But using the distribution style of KEY requires more thought, and is generally used to colocate data with a similar key as another table. This scenerio generally occurs when you have 2 tables that are large with the same join key, so the rows with the same value in the ‘DISTKEY’ column end up on the same node.

The downside of the above is row skew because of the uneven distribution of data accross the nodes in the cluster.

Therefore, the DISTKEY option is a good option if you have a query that needs to be optimized, otherwise EVEN and ALL are excellent choices.

For more developer details, reference the AWS Redshift developers guide.  https://docs.aws.amazon.com/redshift/latest/dg/t_Distributing_data.html

Large Tables

There are 2 types of large tables, one with lots of rows, and one with lots of columns. In both scenerios we have implemented the following:

Tables with a large number of rows can be partitioned into multiple tables and joined in a view using the UNION ALL command. The allows for better maintenance options on smaller tables sizes. Likewise, if you use a partition that allows for a table drop, then the maintenence requirements are minimal.

Tables with a large number of columns has it’s own challenges. We had defined tables with a large amount of columns because of our processing to denormalize many table joins into a single table. The benefits of this allowed for queries to be very fast as the joins where done upfront. The downside has to do with the table maintenence. Typically VACUUM commands may fail as this command is restricted to a particular number of columns. Therefore, we execute deep copy commands to remove the deleted rows. For more information on our processing to execute a deep copy, check out the Redshift Table Data Replicator.

For more developer details, reference the AWS Redshift developers guide. https://docs.aws.amazon.com/redshift/latest/dg/r_CREATE_TABLE_usage.html

Short Query Acceleration

This seems to be a good option, and is now enabled by default on new Redshift clusters. If your cluster is running mostly user queries, this can help your workload significantly. But in our case where the workload involves many COPY commands, we have decided to turn off SQA in order to provide WLM queues that provides us enough slots for the concurrent number of loads we process. This is due to the limitation that SQL for the number of slots you are able to define.

For more developer details, reference the AWS Redshift developers guide.  https://docs.aws.amazon.com/redshift/latest/dg/wlm-short-query-acceleration.html

Spectrum

So far we can generally only speak on the benefits, but we would like to provide performance metrics shortly comparing query performance when a table is in the Redshift cluster vs. using Spectrum (in a Parquet format). For now, here are the benefits:

Storing data in AWS S3 provides for a great long term solution for historical data, and minimizes your need to increase the size of your Redshift cluster.

Referencing data from Redshift vs. Spectrum is pretty seamless from a users perspective.

For more developer details, reference the AWS Redshift developers guide.  https://docs.aws.amazon.com/redshift/latest/dg/c-getting-started-using-spectrum.html

Vacuum

When a row is deleted from a table in Amazon Redshift, the row is only logically deleted, and the disk space used by that row is not immediately recovered. The command VACUUM will both reclaim space from the rows that were flagged for deletion, and sort the table based on it’s SORTKEY.

In most cases, it’s more desirable to separate the VACUUM operation using the VACUUM DELETE ONLY command which can be run as required. This can be executed after every ETL operation which involves UPDATEs or DELETEs.

But the VACUUM SORT operation requuires a signification amount of resources and time, and should be executed less frequently.

For more developer details, reference the AWS Redshift developers guide.  https://docs.aws.amazon.com/redshift/latest/dg/r_VACUUM_command.html

Workload Management

WLM is critical for managing query loads, and evicting bad queries. The default setup is not optimal for most workloads, so you will want to spend some time configuring changes in this utility.

The goal of WLM is to allow for the configuration of queues for concurrency and memory, and to isolate unpredictable workloads. In some cases, you can evict those queries deemed excessive. WLM allows for up to 8 queues with a total of 50 slots. Queries will run in a slot as defined by WLM, and can be routed to another queue based on rules. Configuring WLM is a balancing act between the number of slots you have defined, and the amount of memory your cluster provides. This can be an interative process after reviewing your workload statistics.

Disk-based queries are horrible to a Redshift cluster, causing much overhead with I/O impacting the overall cluster performance. We typically evict any query that a user runs which is disk-based, and we look to tune the query as required.

The other WLM configuration we typically implment is for long running queries. These are infrequent, but if there are many, we move them to their own queue with a smaller number of slots. This allows the primary queue for user queries to be more fluid, isolation the large queries.

Don’t allocation all of your memory to your defined slots. Leave approx 10% unallocated for those queries that need additional memory.

For more developer details, reference the AWS Redshift developers guide.  https://docs.aws.amazon.com/redshift/latest/dg/cm-c-implementing-workload-management.html

Advertisements