By Sam Wuertz, Senior Technologist, and Sudhir Karusala, Business Intelligence Technologist

OVERVIEW

SAP Business Objects Data Services 4 is a powerful ETL platform. It enables rapid development of complex ETL, and if designed properly, facilitates tremendous throughput of high data volumes. However, proper design requires know-how, and the following may help avoid some hard learned lessons around ETL performance. This article will focus on the “Full SQL Push Down” method, and how that can help you quickly process hundreds of millions of records.

STRATEGY

ETL performance is all about efficiency, and enabling ETL and database engines to process quickly by doing fewer costly operations. The design of a very efficient data flow is not trivial, and requires significant knowledge and understanding of the SAP Data Services engine as well as how it interacts with the database engine. When processing millions of records, the most costly operation might be the act of moving all of that data from the database server to the ETL server and back again. A database engine can also use indexing to vastly improve operations on large datasets. The SQL Push Down method capitalizes on both of these principals.

SQL Push Down is best utilized under the following conditions:

  • The database hardware is sufficiently similar to, or better than, the ETL hardware.
  • The data flow in question is processing high volume (millions of records).
  • The data flow in question does not require many sub queries. Data Services can’t generate SQL statements with multiple levels of sub queries. The Data Transfer transform can be used as a work-around, but using it may be too costly.
  • The high volume data must share the same source and target database instance. Data Transfer can accommodate smaller datasets from other instances, but it won’t work well enough for a large dataset.
  • As a general guideline, most of the basic ETL transforms can be pushed down. However, Advanced and Data Quality transforms must take place in the Data Services engine.

THE DATA SERVICES ENGINE AND SQL PUSH DOWN

Data Services can’t be forced to push down SQL operations to the database. The Data Services engine “wants” to push down as much SQL as possible, but there are limitations that prevent this outcome. When that happens, the engine will not tell the developer why, but instead will decide to do more of the processing. A skilled developer can see what is causing this problem by reviewing the SQL statements that are, and are not, being pushed down to the database.

Consider the loading of a large fact table in a data warehouse. This is a typical scenario where full push down may be advantageous because of high volume, but challenging because of numerous joins and aggregations.

STEP 1: SINGLE SELECT STATEMENT

The first goal is to combine all of the SQL Select statements into a single statement. This is done by reviewing the generated SQL for the dataflow, and identifying the dataflow sections that are not being combined into the larger query. If one of the statements isn’t being combined, there’s always a technical reason for that. That section of the dataflow should be reviewed with the following in mind:

  1. Data Services can’t generate sub queries. As a result, any distinct, aggregation and ordering operations must be carefully implemented in the right place. These operations should only occur once unless a Data Transfer transformation is utilized. There are some exceptions if these operations are implemented sequentially in a transform.
  2. Embedded dataflows *can* be pushed down, but analyzing a dataflow that contains them is more complex.
  3. Many built-in functions, like decode, will push down properly. Custom functions will not push down.
  4. Built-in transforms like Conditional, Merge, Table Comparison, Pivot, UnPivot, History Preservation, and Hierarchy Flattening will not push down.
  5. Data Quality transforms won’t push down.

STEP 2: FULL PUSH DOWN

Now there is a single Select query for the dataflow in question. Note that performance of this dataflow has most likely already improved, since the join operations have been pushed down and only a single query has to be processed by the Data Services engine. However, any performance gains are contingent on hardware and database configurations, and may vary greatly.

The next step is turn that into an “INSERT…SELECT” statement (if it isn’t already). This is the Full Push Down that will achieve the greatest performance gain, since the Data Services engine will not have to process any data. In addition to the list above in Step 1 (which still applies), consider the following:

  1. There can be only one target table, since SQL only allows one target table per statement.
  2. Do not use bulk loading options.
  3. Set Lookup functions to NO_CACHE. Note however, that if full pushdown is not achieved, this should be set to something else because it is very inefficient if the Data Services engine has to process any data.

CONCLUSION

A Full SQL Push Down will usually result in drastic performance gains for high volume data movement and transformation. Even a Partial SQL Push Down (Single Select) can achieve respectable performance gains. Achieving these can be a challenge, since significant knowledge about both the Database and Data Services engines are necessary, but the guidelines above should help bridge that gap.