Skip to content

The resources for the peer-reviewed paper Building A Modern Data Platform Based On The Data Lakehouse Architecture And Cloud-Native Ecosystem published on Springer Nature

License

Notifications You must be signed in to change notification settings

aabouzaid/modern-data-platform-research-paper

Repository files navigation

Research Paper - Building A Modern Data Platform Based On The Data Lakehouse Architecture And Cloud-Native Ecosystem

The resources for the research paper Building A Modern Data Platform Based On The Data Lakehouse Architecture And Cloud-Native Ecosystem, a proof of concept for the core of Modern Data Platform using DataOps, Kubernetes, and Cloud-Native ecosystem to build a resilient Big Data platform based on Data Lakehouse architecture, which is the base for Machine Learning (MLOps) and Artificial Intelligence (AIOps).

Contents

Architecture

Core Components

The core components of the platform are:

  • Infrastructure (Kubernetes)
  • Data Ingestion (Argo Workflows + Python)
  • Data Storage (MinIO)
  • Data Processing/Query (Dremio)

Initial Model

To visualise the interactions of the current implementation, the C4 software architecture model (Context, Containers, Components, and Code) is used.

The following is a simplified view of the initial architecture model (all the abstractions are combined together).

Modern Data Platform Initial Architecture Model

Modern Data Platform Initial Data Flow

Prerequisites

ASDF, Linux operating system, and Docker Engine (tested with asdf 0.11.1, Ubuntu 20.04.5 LTS, and Docker Engine Community 23.0.1).

The following tools are used in the development:

  • Helm
  • Kubectl
  • Kustomize

They could be installed with corresponding versions via asdf:

asdf install

Clusters

Check the clusters section for more details about the infrastructure setup.

Applications

Check the applications section for more details about the application setup.

Pipelines

Check the pipelines section for more details about the pipeline setup.

Benchmarking

Check the benchmarking section for more details about the pipeline setup.

Author Contributions

Ahmed AbouZaid: Conceptualization, Methodology, Software, Validation, Data curation, Writing–original draft, Writing–review & editing. Peter J. Barclay: Conceptualization, Methodology, Writing–review & editing, Supervision. Christos Chrysoulas: Conceptualization, Writing–review & editing. Nikolaos Pitropakis: Conceptualization, Writing–review & editing.

About

The resources for the peer-reviewed paper Building A Modern Data Platform Based On The Data Lakehouse Architecture And Cloud-Native Ecosystem published on Springer Nature

Topics

Resources

License

Stars

Watchers

Forks