DBI_DB_Extractor

Overview

The DBI_DB_Extractor is a rules driven process to extract data from a source system into file that lands on AWS S3. As a new table is requested, no new source code is required, only an entry in a metadata table defining the location of the source data. Therefore, we have a rule in our metadata table for each table being extracted. Some high level features are below:

Extracts from Oracle, SQL Server, Postgres, Informix, MySQL or Redshift
Single process executed via Rules table
Core SQL generated
Allows for multiple partition extracts to be executed, synchronously or asynchronously
Allows for multiple jobs to be executed asynchronously with single command (using batch key)
Provides a governor for the max number of simultaneous extracts
Integrates with S3 and SNS
Results are written to a .gz file on AWS S3
Execute multiple jobs based on a ‘batch_key’ or if defined with the ‘background_flag’ = true

Table Rules

Active_Flag – Determines if the job should be actively evaluated for execution
Batch_Key – A name that groups several records together so it can be called as a batch
Sequence_No – When the batch executed, they will be started in sequence using this sequence number. Typically, only a limited number of extracts can run simulatneously as defined in the related .ini file, so it may be that the extracts defined at the end of the batch run at a much different time
Background_Flag – This flag defines if the extracts in the batch run in the background asynchronously (fire and forget), or synchronously
File_Delimiter – This defines the delimiter used once the file is generated
Row_Limit – This allows you to limit the number of records being extracted. This option is good in a non-prod environment limiting the amount of data being extracted
Header_Row_Flag – This option allows you to include a header row with column names, or not. It is a DBI standard to include the header row
Schema_Name – This is the schema name in the source system
Object_Name – This is the object that will be extracted. If the table name has a special character, include it in quotes. ie. order’$’header
File_Name_Rename – If the file should be renamed, you can enter the name in this field, and the filename will include this name instead
Hint – For Oracle, this allows you to enter a hint which will be included in the Oracle query executed against the source system
Select_Column_Name – If you only want to include specific column names, you can include them here seperated by a space. If this field is left blank, all of the columns are retreived
Inc_Column_Name – If you want to pull the data since the last time you executed this process, then you can enter the DATETIME field name in this field. This works in conjunction with the next field.
Inc_Process_Date – If you populated the Inc_Column_Name field above, then the system will place the DATETIME into this field from the last record extracted. Therefore, the next time the process runs, it will start from this DATETIME field value
Partition_Prefix – This allows your extract to be broken down into multiple extracts based on a specific field name, and works in conjuction with the next 2 fields. This field also allows you to enter an entire custom ‘where’ clause (without the word ‘where’)
Operand – This can be the values =, >, <, >=, etc. or whatever the syntax allows against the source system
Partition1…Partition40 – These are the values for which the Partition_Prefix and Operand will generated for an individual extract. If you define 5 values in Partition1 – Partition5, then 5 extracts will be generated.

Creating the Extracts as Data Files

Standardizing on an ETL tool does allow for a level of consistency in the development process. But if not using as proposed in this paper, then how and what are the consideration that have to be determined.

The first important concept is too generate the extracts as stand-alone data files. This allows for integration into any type of analytical system: SAS, Hadoop, AWS S3 and other various BI purpose built databases. Furthermore, the same files can be easily copied to your stress and/or QA systems for parallel integration as discussed later in this paper.

The second aspect to consider are the standards of the file names and content. This allows for much of the consistency lost when not using an ETL tool. Some of the considerations I have found to be valuable are as follows:

Maintain a consistent file name such as <table_name>.txt.<yyyymmdd_hhmmss>.gz
Include a header row of the column names
Maintain the file name with a consistent case, such as all lowercase
Dates and timestamps in the file should be formatted consistently ie.YYYY-MM-DD, YYYY-MM-DD 24HH:MI:SS.
Use a delimiter that is as unique as possible. I have found that using an extended ASCII character such as Õ (hex value D5, ASCII 213 or Oct 325) may provide an excellent choice.
Dollar sign characters should be removed from numeric fields
Nulls in the file should be represented as empty or as a specified character if you are planning to maintain null values.
Exponential values should be represented as a numeric value only
All NON PRINTABLE characters should be removed (ie. new line character ‘\n’ and carriage return characters ‘\CR’) from within the data.
All negative signs should be prefixed (ie. negative 100, should be written as -100, not as 100-)