In this series I will present a perspective on how the classic EDW architecture is changing under the influence of new technologies, new requirements, and new economics.
Part 1 will show the first of three architectural changes by introducing a data lake into the picture… in Part 2 we’ll extend the picture by adding a logical EDW layer… and in Part 3 we’ll consider the implications of Hadoop as an EDW Annex.
Figure 1. A Classic EDW
Figure 1 describes a classic EDW architecture with sources feeding a staging area, transformations feeding cleaned and certified data to the EDW, and data consumed by analytic applications. In this picture we assume that data marts are part of the EDW fabric and that it is the responsibility of the applications to know which database to query. In most enterprises there are rogue data marts with ungoverned data and these are an issue… but more on that.
Note that Figure 1 also represents an optional sandbox area where uncertified and un-governed data can be landed This helps avoid paying the cost of data governance upfront… before data has proven it has value. A sandbox provides support for rogue data marts and cubes. Obviously use of uncertified data introduces issues… but many believe, as I do, that the trade-off of agility over governance can be productive under the right circumstances.
Figure 2. An EDW with a Data Lake
Figure 2 represents a step towards a modern EDW architecture with a Hadoop-based data lake replacing the staging area as well as providing support for a sandbox. A data lake provides all of the same capabilities as the staging area it replaced… it may be a landing zone for untransformed data… but it has several other important characteristics:
- A data lake can hold raw data forever rather than store it temporarily.
- A data lake has compute included so it can execute transformations and before a single platform for staging and ETL.
- A data lake has compute and tools included so it can be used to analyze raw data for trends and anomalies.
- A data lake can easily store semi-structured and unstructured data.
- A data lake can store big data.
Of course a staging area based on a RDBMS could do the same… but the economics are completely different. A Hadoop system provides storage and processing for as little as $1K/TB; the cost of an EDW hardware and software ranges from $15K/TB to $50K/TB… 15X to 50X more. In addition, the data lake provides a cost-effective, extensible platform for building more sandboxes. If, by using Hadoop, you lower the upfront cost of a sandbox to $1K/TB then, when combined with the cost reduction from postponing the development of data quality and governance processes; you may find yourself in a position where you can just say “yes” to requests to add new data. At $1K/TB is may be unnecessary to force the business to build a business case and perform the difficult intellectual challenge of developing an ROI argument for data that has never been available and therefore has an unproven value proposition. Later, once the value is proven, you might move it to a governed state and to a governed platform.
Hadoop as an EDW staging area is not a new concept (see here)… the ETL vendors are supporting Hadoop as an ETL engine already. But a Hadoop staging approach starts to solve one of the nagging problems with the classic architecture: the lack of agility. Just saying “yes” to new data and corralling rogue marts provides a foundation to experiment and evolve while also providing the means to leverage successful experiments across the Enterprise.
There is more required to satisfy sandbox users… on to that in Part 2 – Logical Data Warehousing here.
- Wikitionary definition of a Data Lake here.
- Informatica Big Data Edition here.
- Curt Monash on IBM ETL here.