skip to Main Content
"If you can't explain it simply, you don't understand it well enough." ~ Albert EinsteinData Lakes • Data Warehouses • Data Marts • Data Analytics • Big Data • Business Intelligence

AWS Analytics – Use Cases

AWS Lake House

1. Lake House
• About integrating a data lake, a data warehouse, and purpose-built stores, enabling unified governance and easy data movement

AWS Data Lake

2. Data Lake
• centralized repository that allows you to store all your structured and unstructured data at any scale
• store data as-is, without having to first structure the data & run different types of analytics—from dashboards & visualizations to big data processing, real-time analytics & ML to guide better decisions.

AWS Data Warehouse

3. Data Warehousing
• Run SQL and complex, analytic queries against structured and unstructured data in your data warehouse and data lake, without the need for unnecessary data movement.
try Amazon Redshift

AWS Big Data

4. Big Data Processingg
• Quickly and easily process vast amounts of data in your data lake or on-premises for data engineering, data science development, and collaboration.
try Amazon EMR

AWS Real-time Analytics

5. Real-time Analytics
• Collect, process, and analyze streaming data, and load data streams directly into your data lakes, data stores, and analytics services so you can respond in real time.
try Amazon MSK
try Amazon Kinesis

AWS Data Visualization

6. Data Visualization (Operational Analytics)
• Search, explore, filter, aggregate, and visualize your data in near real time for application monitoring, log analytics, and clickstream analytics.
try Amazon Elasticsearch Service

AWS Analytics – Services

Interactive Analytics
Big Data Processing
  Amazon-EMR_Icon_32_Squid   Amazon EMR
Data Warehousing
Real-time Analytics
Operational Analytics
Dashboards & Visualizations
Visual Data Preparation
Object Storage
  Amazon-S3-Standard-Icon_32_Squid   Amazon S3
Backup & Archive
  AWS-Backup_Icon_32_Squid   AWS Backup
Data Catalog
 AWS-Glue_Icon_32_Squid   AWS Glue
Third-party Data
Frameworks & interfaces
Platform Services

Analytics – Ecosystem (click image for larger view or to download)

2020 Data Landscape

Design Patterns for Cloud Analytics

1. Introduction

With the growing number of new technologies coming into the market every year it becomes very difficult for engineers and their leadership to choose the right combination of elements to get the “right solution” in place. In this article, I provide architectural patterns for a cloud-centric analytics platform, their pros and cons and when each should be used.

1.1. Evaluation Criteria / Architectural Concerns

  • Concern 1: Availability — what’s the desired uptime depending on the criticality and the use case.
  • Concern 2: Performance — how quick it responds to user activities or events under different workloads.
  • Concern 3: Reliability — how reliable system needs to be for every type of break, i.e. if the disk is broken, node down, data-centre is down, etc.
  • Concern 4: Recoverability — how quickly and how in principle the system would recover from breaks. Some of the recoveries are automated, like in HDFS or S3, others like node failure need to be considered in advance.
  • Concern 5: Cost — how much money are we willing to spend to bring the solution up (Infra, Development) and maintain it later (Operations).
  • Concern 6: Scalability — how scalable the solution needs to be, i.e. peak hour traffic, changing trends, growth over the next few years.
  • Concern 7: Manageability — how to ensure compliance, privacy and security requirements
  • Enterprise Non-Real Time — used for consolidation of several regionally distributed Entry Level data sets and/or 1–100 Bn. records with annual growth of up to 20%
  • Enterprise Real-Time — used for real-time analytics and/or consolidation of several regionally distributed Entry Level data sets with more than 100 Bn. records with high growth factors.

2. Entry-Level Solution

For small use cases (< 1 Bn. records) most of the transformations and dimensional storage could be kept within the tool itself. Modern BI solutions (Qlik, Tableau) come with in-memory storage capability directly linked to the self-discovery and dashboarding UI. However, they create a heavy query load on source transactional databases to dynamically refresh the dimensional models and that’s why it’s highly recommended to create a CDC copy of the original relational tables and not link directly into the transactional DBs. That’s also advisable from a security perspective based on the decoupling principle.

2.1. Entry Level Conceptual Architecture

Cloud Analytics — Entry Level Solution Conceptual Architecture
  • Store transitional data (Consolidation, curation, enrichment);
  • Process extraction, transformation and loading of data from/to every storage layer;
  • Self-discovery, Dashboards, Data Wrangling UI;
  • Efficiently serving the data into the Self-discovery, Dashboards, Data Wrangling UI

2.2. Technology Architecture

First, let’s fit the selected technology stack into the conceptual model to get a better feeling of the solution. For this scenario, my source systems are SAP ERP and I’m using AWS as a cloud provider, Tableau as a BI tool of choice.

Cloud Analytics — Entry Level Technology Architecture

2.3. Evaluation of the solution

Concern 1: Availability: AWS RDS supports HA through Multi-AZ deployments, AWS DMS needs to be configured HA at the virtualization level, Tableau HA and load balancing configuration on AWS is well documented, see [1] and [2]. HA of the on-prem to AWS channel usually provided by Direct Connect provider or done by setting a separate backup VPN channel through the Internet.

3. Enterprise Level Non-Real Time Solution

For non-real-time use cases (distributed data sets with up to 100 Bn. records and moderate growth rate) I would use a different approach replacing RDS with cheaper intermediary storage on S3 and involving proper data integration toolset. Let’s look at how conceptual architecture has to change in response to increased data volumes.

3.1. Conceptual Architecture

With larger data sets we have to introduce two new technical capabilities into our design. First is the analytic database (e.g. Redshift, Snowflake) that specifically designed to serve dimensional models and enable data wrangling and self-discovery over large data sets. Data in dimensional models are usually denormalized and not compliant with 3NF or 5NF compared to transactional DBs. That denormalization helps with fast data retrieval reducing table joins.

Cloud Analytics — Enterprise Level Non-Real Time Solution Conceptual Architecture

3.2. Technology Architecture

Attunity (Qlik) have developed a very niche product that gained popularity over recent years. It’s a set of tools to essentially move (Attunity Replicate) [4] and transform (Attunity Compose) the data [5]. These tools are standing out of others by intuitive interface, simplicity and wide range of ready connectors.

Cloud Analytics — Enterprise Level Non-Real Time Solution Technology Architecture
  • take only what’s needed and leave unnecessary details behind,
  • consolidate and enrich data sets
  • create specialized data presentation (materialized views) for different roles of users (basic design principle of “need to know”)

3.3. Evaluation of the solution

Concern 1: Availability: AWS S3 is highly redundant, designed to provide 99.999999999% durability and 99.99% availability of objects. EMR is HA by design. Attunity Replicate needs to be configured HA at the virtualization level on the source system side. Attunity Compose supports HA configuration with primary and secondary node installations [4]. Tableau and other components are similar to the Entry-Level use case.

4. Enterprise Level Real-Time Solution

This option should be considered when either you have a large number of source systems or building real-time analytics. This pattern is very much in the centre for every real-time analytics repeating what I’ve described in one of my previous posts — Real-Time Security Data Lake [9].

4.1. Conceptual Architecture

A conceptual architecture for downstream remains mainly the same, however upstream has been changed to add the streaming capability for real-time data pipelines.

Real-Time Cloud Analytics — Conceptual Architecture

4.2. Technology Architecture

Real-time data pipeline introduces stringent requirements for response time at any scale. Even if vendors are referring to near real-time capability (e.g. in Compose) I would be hesitant to put it into this architecture. Kinesis Family (Firehose and Analytics), as well as Kafka, has been designed ground up for streaming use-cases. When we use the right tools in relevant scenarios then we can suddenly realise more hidden capabilities that we can start using with little effort.

Real-Time Cloud Analytics — Technology Architecture
Global Data Ingestion with Amazon CloudFront and Lambda@Edge

4.3. Evaluation of the solution

Concern 1: Availability: AWS Kinesis and CloudFront are HA by design.

5. Summary

Entry Level solution is easy to start with but limited in scalability. There will be a linear increase in infrastructure and operational costs with growing data.

Back To Top