DBI_RS_Loader

Overview

The process to load data is driven by a rules table specific with the details on how to load the data into Redshift.  The rules in this table allows for data to be loaded in full or incrementally with the code being generated within the process.  There is no scheduling involved in this process.  A service is continuously looking at an S3 bucket for the arrival of any new files. Once landed, the service will initiate a background process to load the data into Redshift.  There is no coding required for this processing, all of the code is dynamically generated based on the options in the rules table.

The DBI_RS_Loader process is a set of instructions that allows for data to be loaded into Redshift without any coding.  The basic functions in a standard load includes the following:

Table Rules

This table maintains the rules for all table loads into Redshift (and the datalake if desired).  Once a rule is defined, on the event that a file is place into the DBI_RS_Inbox, a process is exected to load the data into Redshift.  The following defines the attributes in the metadata table.

  • Activate_Flag (Y/N) – determines if the job should be actively evaluated for execution.
  • DB – The Redshift DB where the data will be processed.
  • Schema – This is the Redshift schema and/or datalake path for where the data will be loaded.
  • Table_File_Name – This is the primary key for this table.  Once a file arrives for processing, the prefix of the filename is used to look up a rule against this attribute.  If found, the details in this record are the parameters for the Redshift processing.
  • Alt_Table_Name – This is an optional field that can be used to redirect the data to load a different table name than defined in the field ‘Table_File_Name’.
  • Delim – An octal value representing the character used in the file for the delimiter.  Some common values are as follows:
    • 011 – tab
    • 054 – comma 
    • 174 – pipe 
    • 325 – non-printable character seldomly used, but significantly used by DBI
  • RS_Trunc_Begin – Truncate the table at the start of the process.  This is actually implemented as a ‘delete’ in the process which allows for the process to manage the DB changes as a transaction.  Therefore, if an error occurs in the specific DB update process, the ‘delete’ process can be rolled back.
  • RS_Trunc_End – Truncate the table at the end of the process.  This is used primariy for cleaning up Redshift if you storing the data only in a defined datalake
  • Insert_Upsert – This option allows for the process to dynamically generate the SQL logic for either an ‘insert’, or the combination of a ‘delete’/’insert’ to the target table.  If using the ‘upsert’ option, the ‘delete’/’insert’ SQL logic generated required that a primary key be defined on the table.  Not that the primary key is enforced by Redshift, but rather that this process will use the defined primary key in the SQL logic.  Multiple attributes defined in the primary key will be concatenated together.
  • Datalake_Source_Flag – If the data reside ONLY on the S3 Datalake, this flag should be set to ‘Y’ to allow the data to be updated using the S3 data as the source.
  • Datalake_S3_Bucket – If a value exists in this field, logic is initiated to unload the data to S3 for integrating into Spectrum and Athena.
  • Datalake_Path – This allows for the data to be unloaded to a specified S3 bucket location.  If no path is defined, the ‘Schema’ name with the appended string ‘_dl’ is automatically configured (ie. DW_DL)
  • Datalake_Table_PK – This is a required field if the combonation of the fields ‘Insert_Upsert’ is set to ‘upsert’ and ‘Datalake_Source_Flag’ is set to ‘Y’.  This allows for the data from the Datalake to be incrementally updated from the new dataset being processed.
  • Datalake_Partition – For performance reasons, adding a partition field to the data will allow for optimized performance.  For more details, reference this document Creating external tables for Amazon Redshift Spectrum
  • Clean_Data (Y/N/1/2/3) – This option allows files to be preprocessed in order to clean up the data if dirty.  This utility allows data to be loaded with less lead time, but for large file, the additional time to process could be significant.  The options are defined as follows:
    • Removes Embedded NULLS
    • Removes in Row Carriage Returns (fixing orphan rows)
    • Delimiter Replacement (Redshift only allows 1 char delimiters)
  • Delim_From – This field paired with the ‘Delim_To’ field allows for the delimiter in the file to be changed to the delimiter defined in the ‘Delim_To’ field.  The primary use case for this allows for a multi-character delimiter to be changed to a single character delimiter which is a requirement for loading Redshift.
  • Delim_To – This is the value that the delimiter will be changed to, used in conjuntion with the field ‘Delim_From’.
  • Copy_Commands – This field allows custom Redshift Copy commands 
  • Grouping_Name – This is a metadata field that can be used for reporting purposes.  This value is also carried forward in the ‘System_Log’ table.
  • Email_Notify – allows for an email update to be sent to 1 or more email addresses when this rule is used to update a table.  An email is sent on success or failure.
  • Alt_Load_Cluster_String – This option allows you to load data to an Alternate cluster (Full or Incremental)
  • Alt_Load_Data_Driver – tbd
  • Alt_Load_Num_Days_Back – tbd 
  • Owner – This is a metadata field which describes the name of the owner of this table.
  • Description – This is a metadata field which describes details of the table, it’s uses, and any other important details.

Dynamic Exit Calls

For custom logic at the time a file is being processed, exit calls are allowed at the start or at the end of the process.  Exit calls run on the AWS EC2 instance, but the code is sourced from the AWS S3 location S3:/<bucket_name>/DBI_RS_Inbox/DBI_RS_Loader/sh_scripts/ directory.  The filename for the logic must be configured as follows to ensure that the logic can be found dynamically.  Once you place a file that complies to the path location and filename, the process will dynamically execute this script as part of the load process.  The <Table_File_Name> below should be aligned with the same attribute used in the table rules.

Shell scripts

  • <Table_File_Name>.pre.sh
  • <Table_File_Name>.post.sh

 SQL

  • <Table_File_Name>.pre.sql
  • <Table_File_Name>.intermediate.sql
  • <Table_File_Name>.post.sql

Audit SQL

This occur directly before the primary table is update.  This feature allows you to execute query in order to determine if the process should stop before continuing.

  • <Table_File_Name>.audit_report.sql
  • <Table_File_Name>.audit_query.sql

Loading Data

Server Oriented Architecture (No Scheduling Required)

Loading data in a SOA allows data to be loaded dynamically as a file is identified. This is driven by a set of rules that can be referenced at the time the load process begins execution. The primary advantage with this type of service is twofold.  First that the process does not need to be scheduled when using a service. The process to load the data is executed on the arrival of a file, and using the prefix of the filename to lookup the associated load rule. The other advantage is that all of the required processing to load the data uses a single codebase.  As the load service has a robust set of functions, this displaces much of the transformation logic required in traditional ETL. But if custom logic is required, tailored SQL can be interjected at various times during the load process, including exit calls at the OS level. But In general, I have found that all of our required and repeatable logic has been packed into the load service.

Therefore, this approach could displace thousands of jobs and job schedules, replacing it with a service driven by a rules file. The end result is a re-usable code base requiring much less system support and development.  As I have seen this implementation at a large billion dollar corporations, I can account for the reuse, decreased support costs, and improved integration turnaround.

To solidify this concept further, think of what has occurred over the last 10 years with real-time data systems. We push and consume transaction data in what is referred to as a ‘Publish and Subscribe’ (Pub-Sub) model in order to keep transaction systems in sync alleviating large batch processing. There is much benefit with this functionality, so why can’t we do this in the BI space, but with large files. Likewise, most vendors working in the cloud already have a similar solutions as this part of their processing model but on a much smaller scale.  Inevitably, the role of services around a platform will be more predominant as technologies mature.

An enterprise ready service for loading data consists of a series of services to handle standard and custom processing. The execution of the process is based on the arrival of a file with the rules being defined by using the name of the data file. The rules file drives the functionality being reused, and thus does not require any coding. But if required, SQL or shell scripts can be executed for specific pre or post processing requirements.

Loading Data Functions

Typical out of the box functions that we have included are as follows.  Most of these are parameters stored as rules used when loading data into a database.

  • Allow for load files to be processed from source folders and FTP servers once they are available
  • Share files with other DB servers keeping then synchronized in parallel (DR and other non-prod systems)
  • Allow for multiple files to be concatenated into one
  • Execute DB load commands (system generated command) and allow for all DB specific parameters
  • Execute standard insert and/or delete logic against primary table (system generated commands)
  • Allow for data audit and reporting during the load process which is driven by user provided SQL.  For the reporting component, SQL is executed after the initial load of data, allowing for change data capture reporting prior to any substantial updates to the primary table.  A subsequent audit option allows for the results to be evaluated with the process being discontinued if desired.
  • Allow for system monitoring (HPOV/Tivoli), email and text paging alerts on data load failures – include job log, table ddl, data load error messages and data record samples
  • Allow for SLA reporting and escalation reporting when specific data is not received or processed.

Loading Excel files into Redshift

This process allows for Excel (.xlsx) spreadsheets to be quickly consumed into Redshift.  In order to comply with the current functionality, the following rules apply:

  • Only the first tab of the spreadsheet is processed
  • This process references the first line from Excel for the column names.  Since the table DDL is created automatically, be sure the column name exists, or the column could be given a default name, or excluded.  
  • The table definition created in Redshift defines each attribute as varchar(255).  This was a conscious decision to ensure the data loading.
  • The Redshift table is only generated if it does not exists.  Therefore, subsequent changes to the column names or data types will be persistent.

Default Load Rule

The easiest way to load data into Redshift only involves the placement of the file into the designated ‘DBI_RS_Inbox’.  Some characteristics of this process are as follows:

  • The default schema name is ‘ods_excel’
  • The default table name is the name of the file after removing spaces and the file extension
  • The first row is automatically removed

If you prefer more control specific to the table being loaded in Redshift, a custom rule can be created which overrides the default load rule.

Custom Load Rule

  • The table can be loaded into any schema in Redshift.
  • The table name can be directed to load any table name in Redshift.  The table will have to be created manually in the defined schema.  When you create the table, custom column names and data types can be implemented.  Although this may be desired, our recommendation is to leverage the datatype of varchar(255) to ensure successful data loading as there is typically no integrity on data types in Excel.
  • Specific file characteristics for the load process can be managed (ie. header row, date and time formats, Null handling, etc)

 

Amazon Architecture Reference:  https://blogs.aws.amazon.com/bigdata/post/Tx24VJ6XF1JVJAA/A-Zero-Administration-Amazon-Redshift-Database-Loader

Advertisements