Designing Cloud Data Platforms

More About this Post

1 Introducing the data platform

A Cloud Data Platform is a set of cloud-hosted data processing technologies. It enables an organization to process data from various sources and transform it into useful information. One can think of it as a modern day ETL using cloud-managed services.

Cloud Data Platforms try to solve these V's:

  • Volume of incoming data
  • Variety of data formats from different sources
  • Velocity of serving data - streamed or per request
  • Veracity - accuracy and integrity of stored data
  • Value the data can provide to the business or its users

Cloud Data Platforms have these loosely coupled components at its core:

  • Ingestion Layer for getting the data in
  • Data Lake Storage - raw, unorganized, just about anything can be stored
  • Processing Layer for added data manipulation based on business rules
  • Serving Layer for extraction and data sharing

Clients of Cloud Data Platforms could be:

  • Data Warehouse - governed and designed for business users
  • Web API's
  • Data Exports (CSV, Ecel, etc..)
  • Business People and Teams

2 Why a data platform and not just a data warehouse

4 Getting data into the platform

 

Cloud Data Platform
Cloud Da...
Other
Databases
Other...
File
Dumps
File...
API
Requests
API...
Other
Sources
Other...
1
1
1
1
1
1
1
1
2
2
2
2
Ingestion
Ingestion
3
3
4
4
Storage
Storage
4
4
Processing
Processing
Serving
Serving
Other
Databases
Other...
File
Dumps
File...
API
Requests
API...
5
5
5
5
5
5
Viewer does not support full SVG 1.1

  1. Data can come from a variety of sources
  2. The Ingestion layer is the entry point for the Cloud Data Platform
  3. The data payload is stored then processed (or processed then stored) for manipulation
  4. Cleaned up data is then forwarded to the Serving layer
  5. Stakeholders can retrieve the data through a Database Warehouse, File exports, or through applications that connect via API requests

5 Organizing and processing data

How data flows through several areas of the platform

Retry
Retry
Failed
Failed
1
1
Landing
Landing
Archive
Archive
3
3
2
2
2
2
Staging
Staging
4
4
Production
Production
Database
Warehouse
Database...
Data Output A
Data Output A
Data Output B
Data Output B
Data Output C
Data Output C
Viewer does not support full SVG 1.1

These "areas" can be as simple as folders or can be more elaborate as being their own machines.

  1. The Ingestion layer writes the data into the Landing area. These are raw and untouched data. A Landing area can also be just a simple database table where the source schema matches the table definitions to enforce data structure.
  2. After processing, transformation, and/or manipulation, it moves along the Staging area. If there were no errors, it is Archived or moved into the Failed area to support retries.
  3. Staging data is then copied into the Production area for further inspection
  4. If all validations pass, it can then be saved into the Data warehouse from which any relevant output is generated.

6 Real-time data processing and analytics

Cloud Services for real-time data

Vendor Real-time Storage Real-time Processing
AWS Kinesis Data Streams Kinesis Data Analytics
Google Cloud Pub/Sub Dataflow
Azure Event Hubs Stream Analytics

7 Metadata layer architecture

The book notes that there is no standard implementation for a Metadata Layer. It is commonly placed on top of the Ingestion Layer and records information about:

  1. The file or the data received
  2. The type of file or data received - CSV, Excel, JSON, etc..
  3. The timestamp for receiving and executing processes
  4. Source and Destination of the data
  5. Rows read and written
  6. The ingestion pipeline it was handed over to
  7. Status - Success and Error information

Open Source Metadata Layer Implementations

These are open source products you can use but don't expect them to be drop-in solutions:

  1. Apache Atlas
  2. LinkedIn's Data Hub
  3. MarquezProject from WeWork engineers

8 Schema management

9 Data access and security

Cloud Data Warehouse Providers

AWS Redshift Azure Synapse Google BigQuery
Based on relational technology? Yes Yes No
Nested data structure support Yes thru JSON Yes thru JSON Yes
Scaling Manual Manual Automatic
Pricing Pay per capacity Pay per capacity Pay per use OR capacity

10 Fueling business value with data platforms