saLeox / Lambda-Architecture-BigData-Pipeline-Curation Public

Notifications You must be signed in to change notification settings
Fork 0
Star 0

Curation of big data pipeline, applying lambda architecture, combining batching and streaming processing. Use the Springboot and Spark to implement.

0 stars 0 forks Branches Tags Activity

Notifications

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
README.md		README.md

Repository files navigation

Lambda-Architecture-BigData-Pipeline-Curation

Architecture Overview

1. Ingest Layer

Set up the Kafka cluster with monitoring tools in the form of Docker compose.
Collect the house transation data as an event and send to Kafka topic.

2. Speed Layer

Kafka producer & consumer.
Streaming ETL that will sink to Big Query.
Materialize the KTable of count based on the tumbling timewindow.

3. Batch Layer

Train the Regression Model by machine learning API provided by Spark MLlib and persist into GCS.
Use metrics to evaluate the performance of each Model.

4. Serving Layer

Load the ML model inside SpringBoot application to provide prediction function.
Provide interactive query based on Kafka Timewindow Streaming.

Limitation and Future Enhancement

Integrate the Neural Network to solve regression problem by using deeplearning4j.
Use Flink & Alink to achieve the unified online & offline machine learning, since the Spark MLlib only provide the streaming linear regression model for the regression problem.
Embed the pre-processing and feature engineering into pipeline and make them automatically.

About

Curation of big data pipeline, applying lambda architecture, combining batching and streaming processing. Use the Springboot and Spark to implement.

spark lambda-architecture springboot-spark

Report repository

Releases

No releases published

Packages

No packages published