Exploring Data Augmentation Methods through Attribution

The following research explores data attribution methods to identify training subsets from a larger training set with varying contributions to a model's performance. Two approaches have been tested here - (1) Captum's Layer Integrated Gradients for feature attribution in text classification, and (2) a self-implemented perplexity-based text quality filtering method inspired by a recent Meta research paper for translation tasks. Additionally, we examine the effects of applying data augmentation to the top 10% and bottom 10% of training samples in both tasks using established techniques, including Easy Data Augmentation (EDA), An Easier Data Augmentation (AEDA), backtranslation (English ↔ French), and paraphrasing with the instruction-following variant of the Llama 3.1 (8B) model.

Project Structure

|- assets
|- attribution
  |-- lig.py
  |-- perplexity.py
|- augmentation
  |-- An Easier Data Augmentation
  |-- Backtranslation
  |-- Easy Data Augmentation
  |-- Paraphrasing
|- datasets
|- plots
|- train
  |-- bert.py
  |-- mT5.py

Datasets

The datasets for both tasks are available as .csv files in this repository.

Task 1 - Classification
- SST2 - binary classification dataset
- SST5 - multi-label classification dataset
- IMDB - binary classification dataset
- AGNews10K - multi-label classification dataset
Task 2 - Language Translation
- English to Afrikaans
- English to Welsh
- English to Czech
- English to Spanish
- English to Romanian

Methodology

Task 1 - Classification

Task 2 - Language Translation

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
attribution		attribution
augmentation		augmentation
sampling		sampling
train		train
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Exploring Data Augmentation Methods through Attribution

Project Structure

Datasets

Methodology

About

Uh oh!

Releases

Packages

Uh oh!

Languages

kotiyalanurag/Exploring-Data-Augmentation-Methods-through-Attribution

Folders and files

Latest commit

History

Repository files navigation

Exploring Data Augmentation Methods through Attribution

Project Structure

Datasets

Methodology

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages