A Modern Data Warehouse Architecture: Part 2 – Make it a Logical Data Warehouse

In the first post of this series (here) I suggested that Hadoop could be effectively used as the platform for staging. Figure 2 describes the result. In this post I will extend the architecture by adding a data federation feature and turning the entire picture into a logical data warehouse… and then we’ll consider the consequences.

Figure 2. Add a Data Lake

add-a-data-lake1

 

 

 

 

 

 

 

 

Figure 3. Add a Federation Layer

add-ldw

 

 

 

 

 

 

 

 

Figure 3 shows this extension.. but lets quickly review what a federation fabric provides.

First, the fabric allows any of the tables and files in the picture to be registered as “virtual” tables. This includes tables in EDWs, and any marts, in one or more Hadoop systems, the sandbox, and even in the source systems. From here the fabric is viewed by any programs in the BI, Analytics, Apps layer as a single relational database composed of all of the registered tables.This fabric would consume ODBC and JDBC queries at a minimum and provide some level of data translation, function translation, and query translation to allow all of the virtual tables to be queried through the single SQL dialect offer by the fabric. Finally, the fabric would provide some measure of optimization to reduce the overhead of accessing these distributed systems. It is this optimization that is the main topic of the series I suggested above.

One of the important implications of th… one that is often overlooked before implementation… is that the queries emitted from the fabric add workload to the underlying databases. If, in Figure 3, the top EDW is 100% busy servicing un-federated queries then adding a workload that joins that data to the 2nd EDW’s data will overburden the system. This is why it is not usually sound design to integrate source, OLTP systems into a fabric. The OLTP systems are not likely to be optimized in any way for the resulting workload.

But the upside of the fabric is significant. Consider:

  • If we just said “yes” and loaded ungoverned data into a sandbox the business users can immediately access that data and join it to dimensions and facts deployed elsewhere in the enterprise.
  • Rogue data marts can now be integrated back into the fold.
  • Redundant data deployed to allow joins within a single database instance can be eliminated and the joins can be federated. Note that federation is no silver bullet… there may be performance reasons for co-locating data… but you now can consider the trade-offs (I’ll post later on a way to federate and improve performance).
  • Data bases products can be retired without affecting the programs that access them. If you have an old data mart built on a product that you would rather not support or license… you can move the data and re-point the virtual tables without impacting the tools and applications at all.
  • Data can be relocated based on economics and/or for performance reasons… this will be the subject of the next blog in this series� but as a teaser, remember the economics… Hadoop costs $1K/TB (ok… $1K-$4K) hardware included and commercial databases cost much more.

There is one final advantage to this that is strategic and important enough to deserve to not be buried in the previous bullet list… All of us have seen the database product market move in the last 6+ years from a place where the choices were DB2, Oracle, SQL Server, or Teradata… to a market with those products plus Exadata, plus two flavors of SQL Server, plus Netezza, plus Greenplum, plus Vertica… and then plus Impala, plus Hive, plus Tez, plus Spark, plus MongoDB, plus Cassandra, plus plus plus. I think that it is impossible to place database bets today with the confidence that these bets will pay off five years from now… and certainly there are no bets good for ten years. If you are betting on infrastructure to support the Internet of Things the bets are more risky still. In my opinion a federation layer provides critical insulation from this chaos. With federation you can build applications knowing that you can acquire and retire database products and not affect the queries. IMO this insulation is a strategic imperative…

Sorry… I got a little excited… I don’t usually spew foam. Let me try again. IMO you should seriously consider the benefits of federation in both your EDW architecture and in your enterprise data architecture.