An agentic RAG system that helps users query Growth Lab-specific unstructured data.
Growth Lab Deep Search is an agentic AI system designed to answer complex questions about the Growth Lab's research and publications. The system incorporates:
Key Features:
- Automated ETL pipeline for harvesting Growth Lab publications and academic papers
- Advanced OCR processing of PDF documents using modern tools
- Vector embeddings with hybrid search
- Agentic RAG system based on LangGraph
This is a rough outline of the intended directory structure. The actual structure might look different, but this should give an idea of the intended code organization.
gl_deep_search/
βββ .github/
β βββ workflows/
β βββ etl-pipeline.yml # Scheduled ETL runs and deployment
β βββ service-deploy.yml # Service API deployment
β βββ frontend-deploy.yml # Frontend deployment
βββ .gitignore
βββ README.md
βββ pyproject.toml # Python project config for uv
βββ docker-compose.yml # Local development setup
βββ docker-compose.prod.yml # Production setup
β
βββ backend/
β βββ etl/
β β βββ Dockerfile # ETL container configuration
β β βββ docker-compose.yml # Local development setup
β β βββ config.yaml # Default configuration
β β βββ .env.example # Environment variables template
β β βββ pyproject.toml # Python dependencies (uv)
β β βββ main.py # ETL orchestration entry point
β β βββ models.py # Pydantic data models
β β βββ config.py # Configuration management
β β βββ scrapers/
β β β βββ __init__.py
β β β βββ base.py # Abstract scraper interface
β β β βββ growthlab.py # Growth Lab website scraper
β β β βββ openalex.py # OpenAlex API client
β β βββ processors/
β β β βββ __init__.py
β β β βββ pdf_processor.py # PDF processing and OCR
β β β βββ manifest.py # Manifest management
β β βββ storage/
β β β βββ __init__.py
β β β βββ base.py # Storage abstraction
β β β βββ local.py # Local filesystem adapter
β β β βββ gcs.py # Google Cloud Storage adapter
β β βββ utils/
β β β βββ __init__.py
β β β βββ id_utils.py # ID generation utilities
β β β βββ async_utils.py # Async helpers & rate limiting
β β β βββ ocr_utils.py # OCR interface
β β β βββ logger.py # Logging configuration
β β βββ tests/ # Unit and integration tests
β β
β βββ service/ # Main backend service (replaces "agent")
β β βββ Dockerfile # Service Docker configuration
β β βββ .env.example # Example environment variables
β β βββ main.py # FastAPI entry point
β β βββ routes.py # API endpoints
β β βββ models.py # Data models
β β βββ config.py # Service configuration
β β βββ graph.py # LangGraph definition
β β βββ tools.py # Service tools
β β βββ utils/
β β βββ retriever.py # Vector retrieval
β β βββ logger.py # Logging and observability
β β
β βββ storage/ # Storage configuration
β β βββ qdrant_config.yaml # Qdrant vector DB config
β β βββ metadata_schema.sql # Metadata schema if needed
β β
β βββ cloud/ # Cloud deployment configs
β βββ etl-cloudrun.yaml # ETL Cloud Run config
β βββ service-cloudrun.yaml # Service Cloud Run config
β
βββ frontend/
β βββ Dockerfile # Frontend Docker configuration
β βββ .env.example # Example environment variables
β βββ app.py # Single Streamlit application file
β βββ utils.py # Frontend utility functions
β
βββ scripts/ # Utility scripts
βββ setup.sh # Project setup
βββ deploy.sh # Deployment to GCP
βββ storage_switch.sh # Script to switch between local/cloud storage
- ETL Pipeline: GitHub Actions, Modern OCR tools (Docling/Marker/Gemini Flash 2)
- Vector Storage: Qdrant for embeddings, with Cohere for reranking
- Agent System: LangGraph for agentic RAG workflows
- Backend API: FastAPI, Python 3.12+
- Frontend: Streamlit or Chainlit for MVP
- Deployment: Google Cloud Run
- Package Management: uv
- Docker and Docker Compose
- Python 3.12+
uv
for project management (check documentarion here)- GCP account and credentials (for production)
- API keys for OpenAI, Anthropic, etc.
-
Clone the repository:
git clone https://github.com/shreyasgm/gl_deep_search.git cd gl_deep_search
-
Run
uv
in the CLI to check that it is available. After this, runuv sync
to install dependencies and create the virtual environment. This command will only install the core dependencies specified in thepyproject.toml
file. To install dependencies that belong to a specific component (i.e., optional dependencies) use:# For a single optional component uv sync --extra etl # For multiple optional components uv sync --extra etl, frontend, [other groups]
-
To add new packages to the project, use the following format:
# Add a package to a specific group (etl, service, frontend, dev, prod) uv add package_name --optional group_name # Example: Add seaborn to the service group uv add seaborn --optional service
-
Create and configure environment files:
cp backend/etl/.env.example backend/etl/.env cp backend/service/.env.example backend/service/.env cp frontend/.env.example frontend/.env
-
Add your API keys and configuration to the
.env
files
The project uses Docker for consistent development and deployment environments:
-
Start the complete development stack:
docker-compose up
-
Access local services:
- Frontend UI: http://localhost:8501
- Backend API: http://localhost:8000
- API Documentation: http://localhost:8000/docs
-
Run individual components:
# Run only the ETL service docker-compose up etl # Run only the backend service docker-compose up service # Run only the frontend docker-compose up frontend
The ETL pipeline supports both development and production environments through containerized deployment. Initial data ingestion and processing of historical documents is executed on a High-Performance Computing (HPC) infrastructure using SLURM workload manager. Incremental updates for new documents are handled through Google Cloud Run.
# Development: Execute ETL pipeline in local environment
docker-compose run --rm etl python main.py
# Production: Initial bulk processing via HPC/SLURM
sbatch scripts/slurm_etl_initial.sh
# Component-specific execution
docker-compose run --rm etl python main.py --component scraper
docker-compose run --rm etl python main.py --component processor
docker-compose run --rm etl python main.py --component embedder
Post-initial processing, data is migrated to Google Cloud Storage. Subsequent ETL operations are orchestrated through automated GitHub Actions workflows and executed on Google Cloud Run.
- Development occurs in local Docker environment
- Code is pushed to GitHub
- GitHub Actions triggers:
- Code testing
- Building and publishing container images
- Deploying to Cloud Run
- ETL Pipeline: Scheduled Cloud Run jobs triggered by GitHub Actions
- Backend Service: Cloud Run with autoscaling
- Vector Database: Managed Qdrant instance or Qdrant Cloud
- Document Storage: Cloud Storage
- Frontend: Streamlit or Chainlit
# Deploy to development environment
./scripts/deploy.sh dev
# Deploy to production environment
./scripts/deploy.sh prod
- Create a feature branch from
main
- Implement your changes with tests
- Submit a pull request for review
# Run tests
pytest
# Run with coverage
pytest --cov=backend
Development and production environments are managed through Docker and GitHub Actions:
# Deploy to development
./scripts/deploy.sh dev
# Deploy to production
./scripts/deploy.sh prod
- API keys and secrets are managed via
.env
files (not committed to GitHub) - Production secrets are stored in GCP Secret Manager
- Access control is implemented at the API level
This project is licensed under CC-BY-NC-SA 4.0. See the LICENSE file for details.