Home

🚀 RAG Data Ingestion Pipeline for ML Workloads

Overview

This repository provides a scalable and modular pipeline for ingesting large-scale datasets into vector databases to power Retrieval-Augmented Generation (RAG) applications. The pipeline is optimized for handling millions of records, enabling fast, efficient similarity search to enhance LLM applications.

Key Features

✅ Parallel Embedding Generation: Uses Ray to distribute computation across multiple GPUs and CPUs.
✅ Vector Storage in OpenSearch: Implements Hierarchical Navigable Small World (HNSW) indexing for fast approximate nearest neighbor (ANN) search.
✅ Vector Storage in PostgreSQL (pgvector): Supports exact k-NN retrieval for precision-based searches.
✅ Optimized ETL Pipeline: Converts large-scale unstructured text data into vector embeddings efficiently.
✅ Scalability & Performance: Designed to handle millions of records for large-scale ML workloads.

🔗 How to Use

Follow these steps to set up and run the pipeline:

1️⃣ Convert Raw Data to Parquet Format

To improve processing speed and reduce storage overhead, convert your JSONL data into Parquet format:

python src/convert.py

2️⃣ Generate Vector Embeddings with Ray

Leverage Ray for distributed embedding generation across multiple GPUs:

python src/embeddings.py

3️⃣ Store Embeddings in OpenSearch & PostgreSQL

Index the generated embeddings into OpenSearch and PostgreSQL (pgvector):

python src/opensearch_store.py
python src/pgvector_store.py

4️⃣ Query for Contextual Document Retrieval

Run queries to fetch similar documents for RAG-based applications:

python src/query.py

Project Structure

rag-data-ingestion-pipeline/
│-- data/
│   │-- raw/
│   │   ├── data.jsonl
│   │-- processed/
│   │   ├── data.parquet
│-- src/
│   │-- convert.py  # Converts JSONL to Parquet
│   │-- embeddings.py  # Handles embedding generation with Ray
│   │-- opensearch_store.py  # Stores embeddings in OpenSearch
│   │-- pgvector_store.py  # Stores embeddings in PostgreSQL
│   │-- query.py  # Queries vector databases for retrieval
│   │-- pipeline.py  # Main script to run ingestion pipeline
│-- requirements.txt  # Python dependencies
│-- README.md  # Project documentation

🔥 Performance Insights

Ray speeds up embedding generation by distributing workload across GPUs.
OpenSearch provides fast ANN search, while pgvector ensures precise k-NN retrieval.
Batching queries reduces latency—bulk retrieval is significantly faster than per-query execution.
Proper index configuration (HNSW in OpenSearch, IVF in pgvector) enhances performance.

🚀 Future Enhancements

Support for other vector databases like Pinecone and FAISS.
Integration with streaming data sources for real-time ingestion.
Advanced index tuning for even faster retrieval.

📢 Contributing

We welcome contributions! Feel free to submit PRs, suggest improvements, or open issues. Let’s build scalable ML infrastructure together. 🔥

📜 License

This project is licensed under the MIT License.

Have feedback or ideas? Let’s discuss! 🚀

Provide feedback

Saved searches

Use saved searches to filter your results more quickly