Skip to content

The OMOP CSV Validator is a CLI tool that validates CSV files against JSON schemas generated from OMOP Common Data Model (CDM) DDL fiiles

License

Notifications You must be signed in to change notification settings

mrueda/omop-csv-validator

Repository files navigation

OMOP-CSV-Validator

CPAN Publish Kwalitee Score License: Artistic-2.0

Links

📦 CPAN Distribution: https://metacpan.org/pod/OMOP::CSV::Validator

OMOP CSV Validator

The OMOP CSV Validator is a CLI tool (and module) that validates OMOP CDM CSV files against their expected data types. Rather than relying solely on Types::Standard or similar libraries, it converts SQL schemas derived from the OMOP Common Data Model (CDM) PostgreSQL DDL files into JSON schemas. It then utilizes JSON::Validator, which scales efficiently with large datasets and provides meaningful error messages.

Features

  • DDL Parsing: Automatically converts PostgreSQL OMOP CDM DDL into JSON schemas.
  • Version Independent Works with any DDL (e.g., 5.3, 5.4).
  • CSV Validation: Validates CSV files using JSON::Validator.
  • Modular Design: Separate CLI and module for easy testing and integration.

Installation

This project uses cpanm along with a cpanfile to manage dependencies. It is recommended to install dependencies locally using local::lib.

Step 1: Install cpanminus

If you don't have cpanm installed, run:

sudo apt-get install cpanminus

If you don't have the gcc compiler and other default Linux utils installed please do:

sudo apt-get install gcc make libperl-dev

Step 2: Set Up local::lib

Configure a local library in your home directory. For example:

cpanm --local-lib=~/perl5 local::lib && eval $(perl -I ~/perl5/lib/perl5/ -Mlocal::lib)

Then, add this settings to your shell profile (e.g. ~/.bashrc or ~/.zshrc) so that your shell knows about your local library.

echo 'eval $(perl -I ~/perl5/lib/perl5/ -Mlocal::lib)' >> ~/.bashrc

Step 3: Download and installation:

From CPAN

From any directory:

cpanm -n OMOP::CSV::Validator

From Github

  1. Clone the repository:
git clone https://github.com/mrueda/omop-csv-validator.git
cd omop-csv-validator
  1. Install Dependencies from:
cpanm -n --installdeps .

This command reads the included cpanfile and installs all required dependencies into your local library directory.

Usage

Command-Line Interface

Once dependencies are installed, you can run the CLI tool as follows:

(If you installed fron CPAN then you can simply run omop-csv-validator).

bin/omop-csv-validator --ddl path/to/OMOPCDM_ddl.sql --input path/to/data.csv --sep ","

With the included example data:

bin/omop-csv-validator --ddl ddl/OMOPCDM_postgresql_5.4_ddl.sql -i example/DRUG_EXPOSURE.csv -sep $'\t'

Example of an error in table person:

Error

Options

Usage:
      omop_csv_validator --ddl DDL.sql --input DATA.csv [--sep $'\t'] [--table person] [--save-schemas schemas.json]

Options:
    --ddl
        (required) Path to the PostgreSQL DDL file defining OMOP CDM table
        structures.

    --input
        (required) Path to the input CSV file to validate.

    --sep
        CSV field separator (default: comma). For tab, use: --sep $'\t'

    --table, -t
        (optional) Table name to validate against. If not provided, the
        script will attempt to derive the table name from the CSV filename.

    --save-schemas
        (optional) Path to a file where the DDL-derived schemas should be
        saved (in JSON format).

    --no-color, -nc
        (Optional) Turn off STDOUT color

    --help, -h
        Display this help message.

    --version, -V
        Show the script's version (which corresponds to
        "OMOP::CSV::Validator::VERSION").

Running Tests

To run the test suite, execute:

prove -l t/

Utilities

  • reorder-csv.pl

See directory utils.

Notes

TIMESTAMP format

The TIMESTAMP field from the DDL is translated to the date-time format as defined in the JSON Schema specification (ISO 8601):

YYYY-MM-DDTHH:MM:SSZ

If your TIMESTAMP fields (ex: visit_end_datetime) follow the date format YYYY-MM-DD, you can ignore any validation errors related to the missing time component.

Integers Length

Depending on how you generate integers for person_id, you might end up with values longer than 20 characters. When this occurs, the validator will report an error like /person_id: Expected integer - got number. In such cases, you can safely ignore or discard the message.

Author

Written by Manuel Rueda, PhD. Info about CNAG can be found at https://www.cnag.eu.

Contributing

Contributions, issues, and feature requests are welcome. Please check the issues page for details.

License

This project is released under the Artistic License 2.0.

About

The OMOP CSV Validator is a CLI tool that validates CSV files against JSON schemas generated from OMOP Common Data Model (CDM) DDL fiiles

Topics

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages