DBI_SQL
Overview
This process allows for running SQL against a supported database. The results can be sent in an email, included as an attachment, and/or saved to a file that is placed in AWS S3. When placing the file on AWS S3, many times the data is placed in the ‘DBI_RS_Inbox’ folder so the data can be loaded into Redshift.
Benefits:
- Executes SQL against Oracle, Postgres, Informix, SQL Server, Redshift or MySQL
- Dynamic parameters – allows for the code to be reused in subsequent executions with dynamic parameters. Parameters can be added to the command line, and the corresponded SQL can be modified to accept the variable in sequence. All variables in the code use the following nomenclature with incremental integer values starting with ‘::P1::’.
- Query results can be emailed, attached as a files, or copied to a named location in AWS S3
Usage: DBI_SQL.sh
-f <configuration file name>
-s <SQL file path>
-p <SQL parameter list>
-d <SQL parameter list delimiter>
-i <Audit flag> (Y/N)
-e <Email List>
-a <Attachment> (Y/N)
-m <AWS Destination>
-b <Datalake Path>
-c <Datalake Partition>
-r <Datalake Replace Insert Ind> (R/I)
-k <Process Key>
ie ./DBI_SQL.sh -f ./DBI_SQL.ini -s test.sql -m s3://dbintegrator-prod/DBI_RS_Inbox -e email@gmail.com -p ‘DBI,100’
SQL file path – The file location and name of the SQL to be executed. Multiple SQL statements are allowed separated by a semicolon.
SQL Parameters
A list of parameters can be defined with the command line and substituted into the SQL at execution time. This parameter list must be delimited by a comma (default), or as defined with the option to change the default parameter (-d). An example of a parameter list is as follows:
–p ‘c site,site dim t,d open > 2014-01-21’
The SQL in the file aligns with the parameter list using a sequential number representation with the format ‘::P1::’ and ‘::P2::’ and so on. Therefore, an example of SQL using the above parameter list could be as follows:
select ::P2:: from ::P1:: where ::P3::
Audit Capabilities
Data quality is probably THE most critical factor when measuring the success of the EDH. The value of the EDH is all about the data, and it has to be fresh, readily available and most importantly, it has to be right! Therefore, we have created a process that will allow us to measure data quality, stop processing before job completion (if required), and send alerts as they are identified. We can then stop feeds to downstream systems when data may be compromised, or discontinue the refresh of a denormalized table before its use. A less invasive process can allow for just alerting as required, and summarizing the results (the results of all audits are captured in our system log table).
This audit process is driven by standard SQL statements, so both users and IT members can provide. The biggest challenge is determining the audit details, if the audit should actually stop job processing, and what the follow-up actions require. We can start to answer these questions as we determine new data audit requirements and mature the IT support processes.
Datalake Options
Including the Datalake Path option triggers the SQL to be executed with the results being stored in an AWS Data lake in the a schema defined as the Datalake Path. The Datalake Path will appear in Redshift or Spectrum as a table with the table name defined by the name of the sql file name. If you have a partion key, include this column name in the Datalake Partion option. The Datalake Replace Insert Ind allows for the data to be replaced, or inserted based on the results of the SQL query.
Process Key
You can include a Process Key which allows for the SQL to be date driven for incremental updates. The Process Key is looked up in the MySQL table XXXXXXX. This table provides the column name used with the last timestamp in which the process has received updates. In order to use this value in your SQL, reference the variable ${LAST_PROCESS_DATETIME}.
IMPORTANT: When using this option (-k), you can only have 1 SQL statement in your sql script file, the column used for incremental updates must exists in the ‘select’ statement, and you should have a ‘where’ clause similar to the below.
select last_update_date, count(*)
from ar.ar_batches_all
where last_update_date > to_timestamp(‘${INC_PROCESS_DATETIME}’, ‘yyyy-mm-dd hh24:mi:ss’)
group by last_update_date;