DBI_DB_Extractor

Overview

The DBI_DB_Extractor is a rules driven process to extract data from a source system into file that lands on AWS S3.  As a new table is requested, no new source code is required, only an entry in a metadata table defining the location of the source data.  Therefore, we have a rule in our metadata table for each table being extracted.  Some high level features are below:

  • Extracts from Oracle, SQL Server, Postgres, Informix, MySQL or Redshift
  • Single process executed via Rules table
  • Core SQL generated
  • Allows for multiple partition extracts to be executed, synchronously or asynchronously
  • Allows for multiple jobs to be executed asynchronously with single command (using batch key)
  • Provides a governor for the max number of simultaneous extracts
  • Integrates with S3 and SNS
  • Results are written to a .gz file on AWS S3
  • Execute multiple jobs based on a ‘batch_key’ or if defined with the ‘background_flag’ = true

Table Rules

  • Active_Flag – Determines if the job should be actively evaluated for execution
  • Batch_Key – A name that groups several records together so it can be called as a batch
  • Sequence_No – When the batch executed, they will be started in sequence using this sequence number.  Typically, only a limited number of extracts can run simulatneously as defined in the related .ini file, so it may be that the extracts defined at the end of the batch run at a much different time
  • Background_Flag – This flag defines if the extracts in the batch run in the background asynchronously (fire and forget), or synchronously
  • File_Delimiter – This defines the delimiter used once the file is generated
  • Row_Limit – This allows you to limit the number of records being extracted.  This option is good in a non-prod environment limiting the amount of data being extracted
  • Header_Row_Flag – This option allows you to include a header row with column names, or not.  It is a DBI standard to include the header row
  • Schema_Name – This is the schema name in the source system
  • Object_Name – This is the object that will be extracted. If the table name has a special character, include it in quotes.  ie. order’$’header
  • File_Name_Rename – If the file should be renamed, you can enter the name in this field, and the filename will include this name instead
  • Hint – For Oracle, this allows you to enter a hint which will be included in the Oracle query executed against the source system
  • Select_Column_Name – If you only want to include specific column names, you can include them here seperated by a space.  If this field is left blank, all of the columns are retreived
  • Inc_Column_Name – If you want to pull the data since the last time you executed this process, then you can enter the DATETIME field name in this field. This works in conjunction with the next field.
  • Inc_Process_Date – If you populated the Inc_Column_Name field above, then the system will place the DATETIME into this field from the last record extracted.  Therefore, the next time the process runs, it will start from this DATETIME field value
  • Partition_Prefix – This allows your extract to be broken down into multiple extracts based on a specific field name, and works in conjuction with the next 2 fields.  This field also allows you to enter an entire custom ‘where’ clause (without the word ‘where’)
  • Operand – This can be the values =, >, <, >=, etc. or whatever the syntax allows against the source system
  • Partition1…Partition40 – These are the values for which the Partition_Prefix and Operand will generated for an individual extract.  If you define 5 values in Partition1 – Partition5, then 5 extracts will be generated.

 



Creating the Extracts as Data Files

Standardizing on an ETL tool does allow for a level of consistency in the development process.  But if not using as proposed in this paper, then how and what are the consideration that have to be determined.

The first important concept is too generate the extracts as stand-alone data files.  This allows for integration into any type of analytical system: SAS, Hadoop, AWS S3 and other various BI purpose built databases.  Furthermore, the same files can be easily copied to your stress and/or QA systems for parallel integration as discussed later in this paper.

The second aspect to consider are the standards of the file names and content.  This allows for much of the consistency lost when not using an ETL tool. Some of the considerations I have found to be valuable are as follows:

  1. Maintain a consistent file name such as <table_name>.txt.<yyyymmdd_hhmmss>.gz
  2. Include a header row of the column names
  3. Maintain the file name with a consistent case, such as all lowercase
  4. Dates  and timestamps in the file should be formatted consistently ie.YYYY-MM-DD, YYYY-MM-DD 24HH:MI:SS.
  5. Use a delimiter that is as unique as possible.  I have found that using an extended ASCII character such as Õ (hex value D5, ASCII 213 or Oct 325) may provide an excellent choice.
  6. Dollar sign characters should be removed from numeric fields
  7. Nulls in the file should be represented as empty or as a specified character if you are planning to maintain null values.
  8. Exponential values should be represented as a numeric value only
  9. All NON PRINTABLE characters should be removed (ie. new line character ‘\n’ and carriage return characters ‘\CR’) from within the data.
  10. All negative signs should be prefixed (ie. negative 100, should be written as -100, not as 100-)
Advertisements