DEMIS Wissensdatenbank

Zur Überbrückung bis der Sequenzdatenupload via DEMIS möglich ist wird ein SFTP Server am RKI verwendet. Informationen zum Uploadprozess finden Sie hier.

IGS Data Submission Format Specification

This document describes how to submit data via the RKI SFTP server.


Involved parties

  1. Data provider: An institution that delivers IGS-relevant data to the RKI (e.g., NRZ).
  2. Data receiver: The IGS team at the RKI.

Types of data

Data providers submit sequencing data and metadata that need processing within the IGS project. Each sample submitted should have a unique sample id.

Sample IDs

Each sample must be identifiable by an ID unique to all samples submitted by the data provider. The sample ID may contain uppercase and lowercase characters, digits, dashes, and underscores (regular expression: [A-Za-z0-9_-]+).

Sequencing Data

NGS Reads

Two .fastq.gz files are expected for each sample: Forward and reverse reads (indicated by R1 and R2 in the filename).

File name pattern: {sample-id}_{...}.fastq.gz

Genomes

For each sample, one .fasta file is expected.

File name pattern: {sample-id}.fasta

Metadata

All pre-defined fields are described in detail on the Metadata Specification file provided with this documentation.

Sample metadata can be provided in three different formats:

  1. EXCEL: One .xlsx file containing all sample metadata of a submission. A template is provided with this documentation.
  2. JSON: One .json files for each sample
  3. CSV/TSV: A .csv/.tsv file, mirroring the Excel format.

The CSV/TSV format is discouraged because it regularly leads to the following processing issues:

  • Inconsistent delimiting characters, e.g., semicolons in .csv files
  • Inconsistent escaping of entries containing delimiting characters or line breaks

Formatting of multi-value fields

The fields Files and Uploads can have multiple entries. If that's the case the respective column names have a counter after the first word (EXCEL, CSV, and TSV):

  • FILE_1_NAME
  • FILE_1_SHA256SUM
  • FILE_2_NAME
  • FILE_2_SHA256SUM
  • ...

In JSON, the field should be sent as arrays:

{
    ...,
    "Files": [
        {
            "FILE_NAME": "...",
            "FILE_SHA256SUM": "..."
        },
        {
            "FILE_NAME": "...",
            "FILE_SHA256SUM": "..."
        }
    ]
}

Examples

Template/examples for EXCEL, JSON, TSV and CSV are provided with this documentation.

JSON example:

{
    "MELDETATBESTAND": "CVDP",
    "SPECIES": "Severe acute respiratory syndrome coronavirus 2 (organism)",
    "LAB_SEQUENCE_ID": "2bf17a251b58a10e019f9a368b06ed20467414e4fafe69c09030918bf2b485b1",
    "DEMIS_NOTIFICATION_ID": "70fd3f6e-47f3-4f9c-bd2d-cf19e09f01ed",
    "STATUS": "final",
    "IGS_ID": "IMS-10087-CVDP-0016B98F-5D26-4455-BECE-A3D66EF7EE1B",
    "DATE_OF_SAMPLING": "2024-12-31",
    "DATE_OF_RECEIVING": "2024-12-31",
    "DATE_OF_SEQUENCING": "2024-12-31",
    "DATE_OF_SUBMISSION": "2024-12-31",
    "SEQUENCING_INSTRUMENT": "Illumina_MiSeq",
    "SEQUENCING_PLATFORM": "ILLUMINA",
    "ADAPTER": "Adapter1",
    "PRIMER_SCHEME": "Scheme1",
    "SEQUENCING_STRATEGY": "WGS",
    "ISOLATION_SOURCE": "Upper respiratory swab specimen (specimen)",
    "HOST": "Homo sapiens",
    "HOST_SEX": "female",
    "HOST_BIRTH_MONTH": "11",
    "HOST_BIRTH_YEAR": "1990",
    "SEQUENCING_REASON": "requested",
    "GEOGRAPHIC_LOCATION": "426",
    "ISOLATE": "Isolate 1",
    "AUTHOR": "Author A.",
    "NAME_AMP_PROTOCOL": "Protocol1",
    "PRIME_DIAGNOSTIC_LAB.DEMIS_LAB_ID": "DEMIS-10001",
    "PRIME_DIAGNOSTIC_LAB.NAME": "Lab1",
    "PRIME_DIAGNOSTIC_LAB.ADDRESS": "Labstr. 1",
    "PRIME_DIAGNOSTIC_LAB.POSTAL_CODE": "13353",
    "PRIME_DIAGNOSTIC_LAB.CITY": "Berlin",
    "PRIME_DIAGNOSTIC_LAB.FEDERAL_STATE": "Berlin",
    "PRIME_DIAGNOSTIC_LAB.COUNTRY": "DE",
    "PRIME_DIAGNOSTIC_LAB.EMAIL": "my@email.com",
    "SEQUENCING_LAB.DEMIS_LAB_ID": "DEMIS-10001",
    "SEQUENCING_LAB.NAME": "Lab1",
    "SEQUENCING_LAB.ADDRESS": "Labstr. 1",
    "SEQUENCING_LAB.POSTAL_CODE": "13353",
    "SEQUENCING_LAB.CITY": "Berlin",
    "SEQUENCING_LAB.FEDERAL_STATE": "Niedersachsen",
    "SEQUENCING_LAB.COUNTRY": "DE",
    "SEQUENCING_LAB.EMAIL": "my@email.com",
    "Files": [
        {
            "FILE_NAME": "sample1_1.fastq.gz",
            "FILE_SHA256SUM": "c16516c6d9b1dd9c0f1e8ce8baf43d42031a32fcf75f1d69f16eb4b24df6fecd"
        },
        {
            "FILE_NAME": "sample1_2.fastq.gz",
            "FILE_SHA256SUM": "c16516c6d9b1dd9c0f1e8ce8baf43d42031a32fcf75f1d69f16eb4b24df6fecd"
        }
    ],
    "Uploads": [
        {
            "UPLOAD_DATE": "2024-12-31",
            "UPLOAD_STATUS": "Accepted",
            "UPLOAD_SUBMITTER": "Tobias T.",
            "REPOSITORY_ID": "DESH-377b53a7-ce88-4346-8b24-5d10e66e9774",
            "REPOSITORY_NAME": "ENA",
            "REPOSITORY_LINK": "https://www.ena.de/DESH-377b53a7-ce88-4346-8b24-5d10e66e9774"
        }
    ]
}(


sFTP file structure

Sequencing data

All sequencing data (.fastq.gz and .fasta files) is stored inside the reads directory. File names have to start with the respective sample ID.

Valid file name patterns:

  • fastq: {sample-id}_{...}.fastq.gz
  • fasta: {sample-id}.fasta

Metadata

All sample metadata (.xlsx, .json, .csv, and .tsv files) is stored inside the metadata directory.

This directory can contain either:

  1. One .xlsx, .csv, or .tsv file (the exact filename is ignored by the data receiver), or
  2. One .json file per sample: {sample-id}_sequencing_metadata.json.

Submissions

Directory name

  • A folder must be created by the data provider on the SFTP server that stores all relevant data, inside the root /data/ directory.
  • Each submission directory is named {date}-{name}, where {date} is the submission date in ISO 8601 format (YYYY-MM-DD). The name is optional and can contain uppercase and lowercase characters, digits, dashes, and underscores (regular expression: [a-zA-Z0-9_-]+). It is ignored by the data receiver.
  • Inside the submission directory are two sub-directories: reads and metadata.
  • Once the submission is complete a marker file named submission-complete.txt should be created at the submission directory. When the data receiver detects this marker file, it validates the content of this folder. It either:
    • transfers the data to the IGS systems, logs this action on the SFTP server, and deletes the submission folder, or
    • rejects the data, logs this action on the SFTP server, leaves the folder untouched, and notifies the data provider about the issue.

Examples:

  • /data/2023-12-24
  • /data/2024-01-07-January_submission_1
  • /data/2024-01-02-rest_of_2023_data

Marker file

The marker should always be name submission-complete.txt and placed inside the submission directory. It must be an empty file.

Example file structures

  1. Submission of NGS reads, using Excel as metadata format (samples: G612, G86)

    /data/
    └── 2024-01-31-January2024/
        ├── metadata/
        │   └── IGS_Metadata.xlsx
        ├── reads/
        │   ├── G612_S1_L000_R1_001.fastq.gz
        │   ├── G612_S1_L000_R2_001.fastq.gz
        │   ├── G86_S1_L000_R1_001.fastq.gz
        │   └── G86_S1_L000_R2_001.fastq.gz
        └── submission-complete.txt
  2. Submission of genomes, using JSON as metadata format (samples: Sample-324, Sample-84)

    /data/
    └── 2023-12-31/
        ├── metadata/
        │   ├── Sample-324_sequencing_metadata.json
        │   └── Sample-84_sequencing_metadata.json
        ├── reads/
        │   ├── Sample-324.fasta
        │   └── Sample-84.fasta
        └── submission-complete.txt
  3. Multiple submissions (2024-01-31-January2024 is marked as complete, 2024-02-29-February2024 is in progress)

    /data/
    ├── 2024-01-31-January2024
    │   ├── metadata/
    │   │   └── IGS_Metadata.xlsx
    │   ├── reads/
    │   │   ├── G612_S1_L000_R1_001.fastq.gz
    │   │   ├── G612_S1_L000_R2_001.fastq.gz
    │   │   ├── G86_S1_L000_R1_001.fastq.gz
    │   │   └── G86_S1_L000_R2_001.fastq.gz
    │   └── submission-complete.txt
    └── 2024-02-29-February2024
        ├── metadata/
        │   └── IGS_Metadata.xlsx
        └── reads/
    


Metadata files

When the metadata format is Excel, CSV or TSV, the data receiver will not remove sample metadata after a successful import. The data provider may remove metadata after it was imported successfully or leave all submitted metadata in the file.

Data reception

The data receiver logs all actions in the file data-import.log.

Each row in this file contains the following pieces of information, separated by a comma:

  • Timestamp in ISO 8601 format, including UTC offset (e.g., 2022-09-11T10:35:04+02:00)
  • Performed operation: One of imported-sample, deleted-file, rejected
  • Details: Sample ID, file path, or other information like error messages


Validation

The tool 'IGS Toolbox' can be used to validate the results prior to uploading to the sFTP server to avoid issues with format and content.

It is a Python package that can be installed using pip.

URL: https://pypi.org/project/igs-toolbox

Install:

pip install igs-toolbox

For JSON files :

Usage: jsonChecker [OPTIONS]                                                                                                                                                
                                                                                                                                                                             
 Validate metadata json.                                                                                                                                                     
                                                                                                                                                                             
╭─ Options ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│ *  --input               -i      FILE  Path to input json file. [default: None] [required]                                                                                │
│    --log_file            -l      FILE  Path to log file. [default: jsonChecker_2024-06-07T10-09-48.log]                                                                   │
│    --version             -V                                                                                                                                               │
│    --install-completion                Install completion for the current shell.                                                                                          │
│    --show-completion                   Show completion for the current shell, to copy it or customize the installation.                                                   │
│    --help                              Show this message and exit.                                                                                                        │
╰───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯

For EXCEL, TSV and CSV files (validates and convert files to JSON):

 Usage: convertSeqMetadata [OPTIONS]                                                                                                                                         
                                                                                                                                                                             
 Convert table of seq metadata to json files.                                                                                                                                
                                                                                                                                                                             
╭─ Options ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│ *  --input               -i      FILE       Path to input excel or csv/tsv file. [default: None] [required]                                                               │
│ *  --output              -o      DIRECTORY  Path to output folder for json files. [default: None] [required]                                                              │
│    --log_file            -l      FILE       Path to log file. [default: convertSeqMetadata_2024-06-07T10-09-21.log]                                                       │
│    --version             -V                                                                                                                                               │
│    --install-completion                     Install completion for the current shell.                                                                                     │
│    --show-completion                        Show completion for the current shell, to copy it or customize the installation.                                              │
│    --help                                   Show this message and exit.                                                                                                   │
╰───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
  • No labels