Skip to content
Pronnoy Goswami edited this page Mar 7, 2025 · 3 revisions

πŸš€ RAG Data Ingestion Pipeline for ML Workloads

Overview

This repository provides a scalable and modular pipeline for ingesting large-scale datasets into vector databases to power Retrieval-Augmented Generation (RAG) applications. The pipeline is optimized for handling millions of records, enabling fast, efficient similarity search to enhance LLM applications.

Key Features

βœ… Parallel Embedding Generation: Uses Ray to distribute computation across multiple GPUs and CPUs.
βœ… Vector Storage in OpenSearch: Implements Hierarchical Navigable Small World (HNSW) indexing for fast approximate nearest neighbor (ANN) search.
βœ… Vector Storage in PostgreSQL (pgvector): Supports exact k-NN retrieval for precision-based searches.
βœ… Optimized ETL Pipeline: Converts large-scale unstructured text data into vector embeddings efficiently.
βœ… Scalability & Performance: Designed to handle millions of records for large-scale ML workloads.

πŸ”— How to Use

Follow these steps to set up and run the pipeline:

1️⃣ Convert Raw Data to Parquet Format

To improve processing speed and reduce storage overhead, convert your JSONL data into Parquet format:

python src/convert.py

2️⃣ Generate Vector Embeddings with Ray

Leverage Ray for distributed embedding generation across multiple GPUs:

python src/embeddings.py

3️⃣ Store Embeddings in OpenSearch & PostgreSQL

Index the generated embeddings into OpenSearch and PostgreSQL (pgvector):

python src/opensearch_store.py
python src/pgvector_store.py

4️⃣ Query for Contextual Document Retrieval

Run queries to fetch similar documents for RAG-based applications:

python src/query.py

Project Structure

rag-data-ingestion-pipeline/
β”‚-- data/
β”‚   β”‚-- raw/
β”‚   β”‚   β”œβ”€β”€ data.jsonl
β”‚   β”‚-- processed/
β”‚   β”‚   β”œβ”€β”€ data.parquet
β”‚-- src/
β”‚   β”‚-- convert.py  # Converts JSONL to Parquet
β”‚   β”‚-- embeddings.py  # Handles embedding generation with Ray
β”‚   β”‚-- opensearch_store.py  # Stores embeddings in OpenSearch
β”‚   β”‚-- pgvector_store.py  # Stores embeddings in PostgreSQL
β”‚   β”‚-- query.py  # Queries vector databases for retrieval
β”‚   β”‚-- pipeline.py  # Main script to run ingestion pipeline
β”‚-- requirements.txt  # Python dependencies
β”‚-- README.md  # Project documentation

πŸ”₯ Performance Insights

  • Ray speeds up embedding generation by distributing workload across GPUs.
  • OpenSearch provides fast ANN search, while pgvector ensures precise k-NN retrieval.
  • Batching queries reduces latencyβ€”bulk retrieval is significantly faster than per-query execution.
  • Proper index configuration (HNSW in OpenSearch, IVF in pgvector) enhances performance.

πŸš€ Future Enhancements

  • Support for other vector databases like Pinecone and FAISS.
  • Integration with streaming data sources for real-time ingestion.
  • Advanced index tuning for even faster retrieval.

πŸ“’ Contributing

We welcome contributions! Feel free to submit PRs, suggest improvements, or open issues. Let’s build scalable ML infrastructure together. πŸ”₯

πŸ“œ License

This project is licensed under the MIT License.


Have feedback or ideas? Let’s discuss! πŸš€