-
Notifications
You must be signed in to change notification settings - Fork 0
Home
This repository provides a scalable and modular pipeline for ingesting large-scale datasets into vector databases to power Retrieval-Augmented Generation (RAG) applications. The pipeline is optimized for handling millions of records, enabling fast, efficient similarity search to enhance LLM applications.
β
Parallel Embedding Generation: Uses Ray to distribute computation across multiple GPUs and CPUs.
β
Vector Storage in OpenSearch: Implements Hierarchical Navigable Small World (HNSW) indexing for fast approximate nearest neighbor (ANN) search.
β
Vector Storage in PostgreSQL (pgvector): Supports exact k-NN retrieval for precision-based searches.
β
Optimized ETL Pipeline: Converts large-scale unstructured text data into vector embeddings efficiently.
β
Scalability & Performance: Designed to handle millions of records for large-scale ML workloads.
Follow these steps to set up and run the pipeline:
To improve processing speed and reduce storage overhead, convert your JSONL data into Parquet format:
python src/convert.py
Leverage Ray for distributed embedding generation across multiple GPUs:
python src/embeddings.py
Index the generated embeddings into OpenSearch and PostgreSQL (pgvector):
python src/opensearch_store.py
python src/pgvector_store.py
Run queries to fetch similar documents for RAG-based applications:
python src/query.py
rag-data-ingestion-pipeline/
β-- data/
β β-- raw/
β β βββ data.jsonl
β β-- processed/
β β βββ data.parquet
β-- src/
β β-- convert.py # Converts JSONL to Parquet
β β-- embeddings.py # Handles embedding generation with Ray
β β-- opensearch_store.py # Stores embeddings in OpenSearch
β β-- pgvector_store.py # Stores embeddings in PostgreSQL
β β-- query.py # Queries vector databases for retrieval
β β-- pipeline.py # Main script to run ingestion pipeline
β-- requirements.txt # Python dependencies
β-- README.md # Project documentation
- Ray speeds up embedding generation by distributing workload across GPUs.
- OpenSearch provides fast ANN search, while pgvector ensures precise k-NN retrieval.
- Batching queries reduces latencyβbulk retrieval is significantly faster than per-query execution.
- Proper index configuration (HNSW in OpenSearch, IVF in pgvector) enhances performance.
- Support for other vector databases like Pinecone and FAISS.
- Integration with streaming data sources for real-time ingestion.
- Advanced index tuning for even faster retrieval.
We welcome contributions! Feel free to submit PRs, suggest improvements, or open issues. Letβs build scalable ML infrastructure together. π₯
This project is licensed under the MIT License.
Have feedback or ideas? Letβs discuss! π