Skip to content

Curation of big data pipeline, applying lambda architecture, combining batching and streaming processing. Use the Springboot and Spark to implement.

Notifications You must be signed in to change notification settings

saLeox/Lambda-Architecture-BigData-Pipeline-Curation

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 

Repository files navigation

Lambda-Architecture-BigData-Pipeline-Curation

Architecture Overview

1. Ingest Layer

  • Set up the Kafka cluster with monitoring tools in the form of Docker compose.
  • Collect the house transation data as an event and send to Kafka topic.
  • Kafka producer & consumer.
  • Streaming ETL that will sink to Big Query.
  • Materialize the KTable of count based on the tumbling timewindow.
  • Train the Regression Model by machine learning API provided by Spark MLlib and persist into GCS.
  • Use metrics to evaluate the performance of each Model.

4. Serving Layer

Limitation and Future Enhancement

  • Integrate the Neural Network to solve regression problem by using deeplearning4j.
  • Use Flink & Alink to achieve the unified online & offline machine learning, since the Spark MLlib only provide the streaming linear regression model for the regression problem.
  • Embed the pre-processing and feature engineering into pipeline and make them automatically.

About

Curation of big data pipeline, applying lambda architecture, combining batching and streaming processing. Use the Springboot and Spark to implement.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published