AI-Driven Keyword Extraction & Tag Generation

This project provides an end-to-end, LLM-powered system for structured information extraction and tag generation from multimedia content. It is designed to help organizations understand which athletes, teams, disciplines, and events are being covered in media, and enrich visual content with consistent tags for search and content management.

Features

Entity Extraction from Articles
- Extracts company-related athletes, teams, disciplines, and events
- Configurable GPT-4.1-based pipeline with support for multiple reruns and temperature tuning
- Consolidation logic to merge multiple runs for maximum recall
Tag Generation from Images
- Uses GPT-4 Vision API to describe images with high-quality tags
- Detects subjects, actions, settings, brand elements, and technical components
- Tag consolidation prompt ensures consistent and relevant output
Prompt Engineering Framework
- Structured prompt design (Goal, Format, Constraints, Context)
- Evaluation loop based on test tiers: minimal, small, and full sample
- Built-in evaluation criteria (recall, precision, F1, confidence-based scoring planned)
Front-End Apps (Streamlit)
- Configurable UIs for both text and image pipelines
- Advanced settings panel for pro users to adjust model, temperature, runs

Tech Stack

Backend: Python 3, OpenAI GPT-4.1 via Responses API
Frontend: Streamlit

Evaluation Highlights

Multi-model and rerun comparison (GPT-4.1 vs GPT-4.1 mini)
A/B testing setup to compare prompts and models
Prompt consolidation via LLM to ensure structured, consistent results
Modular design for versioned prompt & model swapping

Project Structure

├── entityExtraction.py        # Entity extraction logic from article JSONs
├── entityExtraction.ipynb     # Jupyter notebook for iterative prompt tuning and testing
├── tagGeneration.py           # Image tag generation pipeline using GPT-4 Vision
├── tagGeneration.ipynb        # Visual exploration and prompt iterations for tagging
├── ProjectPresentation.pdf    # Project presentation for case study
├── LICENSE                    # MIT license
└── README.md                  # This file

Setup & Usage

Clone the repository
Install dependencies
Set environment variables
Run extraction or tagging
- Use entityExtraction.ipynb for articles
- Use tagGeneration.ipynb for image tag generation
Optionally, launch the Streamlit apps
- streamlit run entityExtraction.py
- streamlit run tagGeneration.py

Future Enhancements

Confidence scores for model outputs
- Domain-specific semantic validation
- Adaptive prompt rerunning based on low-confidence tags
- Fine-tuning or model personalization
Human-in-the-loop feedback interface

License

MIT License – see LICENSE for details.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

AI-Driven Keyword Extraction & Tag Generation

Features

Tech Stack

Evaluation Highlights

Project Structure

Setup & Usage

Future Enhancements

License

About

Releases 1

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
.gitattributes		.gitattributes
LICENSE		LICENSE
ProjectPresentation.pdf		ProjectPresentation.pdf
README.md		README.md
entitiyExtraction.ipynb		entitiyExtraction.ipynb
entityExtraction.py		entityExtraction.py
tagGeneration.ipynb		tagGeneration.ipynb
tagGeneration.py		tagGeneration.py

License

thomaslaner/AiDrivenKeywordExtractionTagGeneration

Folders and files

Latest commit

History

Repository files navigation

AI-Driven Keyword Extraction & Tag Generation

Features

Tech Stack

Evaluation Highlights

Project Structure

Setup & Usage

Future Enhancements

License

About

Topics

Resources

License

Stars

Watchers

Forks

Releases 1

Languages