A powerful document intelligence platform that enables document uploading, processing, embedding, and semantic search with AI-powered chat capabilities.
- Document Processing: Upload and process PDF, DOCX, PPTX, Excel files
- Intelligent Chunking: Smart document chunking for optimal retrieval
- Vector Embeddings: Generate and store embeddings for semantic search
- AI-Powered Chat: Chat with your documents using AI
- Citation Support: Get citations with source document context
- Multi-User Support: Document collections separated by user
- Parallel Processing: Efficient document processing with parallel embedding generation
- OCR Capabilities: Extract text from scanned documents and images
- Backend: FastAPI
- Vector Database: Qdrant
- Embeddings: Azure OpenAI
- Containerization: Docker
- Deployment: Fly.io
- Python 3.8+
- Docker (optional)
-
Clone the repository:
git clone https://github.com/yourusername/docintel.git cd docintel
-
Install dependencies:
pip install -r requirements.txt
-
Run the application:
uvicorn app.main:app --reload
-
Build the Docker image:
docker build -t docintel .
-
Run the container:
docker run -p 8000:8000 docintel
Upload documents using the /upload
endpoint. Supported file types include:
- PDF (.pdf)
- Word Documents (.docx)
- PowerPoint Presentations (.pptx)
- Excel Spreadsheets (.xlsx, .xls, .csv)
Use the /query
endpoint to search through processed documents with natural language queries.
Create a chat session with the /sessions
endpoint and send messages to interact with your documents.
POST /upload
- Upload and process a documentGET /list
- List all documentsGET /{document_id}
- Get document detailsDELETE /{document_id}
- Delete a documentGET /{document_id}/file
- Download original documentGET /statistics
- Get document statistics
POST /sessions
- Create a new chat sessionGET /sessions
- List chat sessionsGET /sessions/{session_id}
- Get chat session detailsDELETE /sessions/{session_id}
- Delete a chat sessionGET /sessions/{session_id}/messages
- Get chat historyPOST /sessions/{session_id}/messages
- Send a messageGET /messages/{message_id}
- Get a specific messageGET /messages/{message_id}/citations
- Get citations for a messageGET /citations/{document_id}/{chunk_id}
- Get citation source
DocIntel uses a modular architecture:
- Parser Module - Extracts text and metadata from different document types
- Chunking Module - Divides documents into manageable chunks for processing
- Embedding Module - Generates vector embeddings for document chunks
- Storage Module - Manages vector storage using Qdrant
- Chat Module - Handles chat sessions and message processing
DocIntel can be deployed to Fly.io using the included fly.toml
configuration:
fly launch