Zur Überbrückung bis der Sequenzdatenupload via DEMIS möglich ist wird ein SFTP Server am RKI verwendet. Informationen zum Uploadprozess finden Sie hier.
IGS Data Submission Format Specification
This document describes how to submit data via the RKI SFTP server.
Involved parties
- Data provider: An institution that delivers IGS-relevant data to the RKI (e.g., NRZ).
- Data receiver: The IGS team at the RKI.
Types of data
Data providers submit sequencing data and metadata that need processing within the IGS project. Each sample submitted should have a unique sample id.
Sample IDs
Each sample must be identifiable by an ID unique to all samples submitted by the data provider. The sample ID may contain uppercase and lowercase characters, digits, dashes, and underscores (regular expression: [A-Za-z0-9_-]+
).
Sequencing Data
NGS Reads
Two .fastq.gz files are expected for each sample: Forward and reverse reads (indicated by R1
and R2
in the filename).
File name pattern: {sample-id}_{...}.fastq.gz
Genomes
For each sample, one .fasta file is expected.
File name pattern: {sample-id}.fasta
Metadata
All pre-defined fields are described in detail on the Metadata Specification file provided with this documentation.
Sample metadata can be provided in three different formats:
- EXCEL: One .xlsx file containing all sample metadata of a submission. A template is provided with this documentation.
- JSON: One .json files for each sample
- CSV/TSV: A .csv/.tsv file, mirroring the Excel format.
The CSV/TSV format is discouraged because it regularly leads to the following processing issues:
- Inconsistent delimiting characters, e.g., semicolons in .csv files
- Inconsistent escaping of entries containing delimiting characters or line breaks
Formatting of multi-value fields
The fields Files
and Uploads
can have multiple entries. If that's the case the respective column names have a counter after the first word (EXCEL, CSV, and TSV):
FILE_1_NAME
FILE_1_SHA256SUM
FILE_2_NAME
FILE_2_SHA256SUM
- ...
In JSON, the field should be sent as arrays:
{
"Files": [
{
"FILE_NAME": "...",
"FILE_SHA256SUM": "..."
},
{
"FILE_NAME": "...",
"FILE_SHA256SUM": "..."
}
]
}
,
Examples
Template/examples for EXCEL, JSON, TSV and CSV are provided with this documentation.
JSON example:
sFTP file structure
Sequencing data
All sequencing data (.fastq.gz and .fasta files) is stored inside the reads
directory. File names have to start with the respective sample ID.
Valid file name patterns:
- fastq:
{sample-id}_{...}.fastq.gz
- fasta:
{sample-id}.fasta
Metadata
All sample metadata (.xlsx, .json, .csv, and .tsv files) is stored inside the metadata
directory.
This directory can contain either:
- One .xlsx, .csv, or .tsv file (the exact filename is ignored by the data receiver), or
- One .json file per sample:
{sample-id}_sequencing_metadata.json
.
Submissions
Directory name
- A folder must be created by the data provider on the SFTP server that stores all relevant data, inside the root /
data/
directory. - Each submission directory is named
{date}-{name}
, where{date}
is the submission date in ISO 8601 format (YYYY-MM-DD
). The name is optional and can contain uppercase and lowercase characters, digits, dashes, and underscores (regular expression:[a-zA-Z0-9_-]+
). It is ignored by the data receiver. - Inside the submission directory are two sub-directories:
reads
andmetadata
. - Once the submission is complete a marker file named
submission-complete.txt
should be created at the submission directory. When the data receiver detects this marker file, it validates the content of this folder. It either:- transfers the data to the IGS systems, logs this action on the SFTP server, and deletes the submission folder, or
- rejects the data, logs this action on the SFTP server, leaves the folder untouched, and notifies the data provider about the issue.
Examples:
/data/2023-12-24
/data/2024-01-07-January_submission_1
/data/2024-01-02-rest_of_2023_data
Marker file
The marker should always be name submission-complete.txt
and placed inside the submission directory. It must be an empty file.
Example file structures
Submission of NGS reads, using Excel as metadata format (samples:
G612
,G86
)/data/ └── 2024-01-31-January2024/ ├── metadata/ │ └── IGS_Metadata.xlsx ├── reads/ │ ├── G612_S1_L000_R1_001.fastq.gz │ ├── G612_S1_L000_R2_001.fastq.gz │ ├── G86_S1_L000_R1_001.fastq.gz │ └── G86_S1_L000_R2_001.fastq.gz └── submission-complete.txt
Submission of genomes, using JSON as metadata format (samples:
Sample-324
,Sample-84
)/data/ └── 2023-12-31/ ├── metadata/ │ ├── Sample-324_sequencing_metadata.json │ └── Sample-84_sequencing_metadata.json ├── reads/ │ ├── Sample-324.fasta │ └── Sample-84.fasta └── submission-complete.txt
Multiple submissions (
2024-01-31-January2024
is marked as complete,2024-02-29-February2024
is in progress)/data/ ├── 2024-01-31-January2024 │ ├── metadata/ │ │ └── IGS_Metadata.xlsx │ ├── reads/ │ │ ├── G612_S1_L000_R1_001.fastq.gz │ │ ├── G612_S1_L000_R2_001.fastq.gz │ │ ├── G86_S1_L000_R1_001.fastq.gz │ │ └── G86_S1_L000_R2_001.fastq.gz │ └── submission-complete.txt └── 2024-02-29-February2024 ├── metadata/ │ └── IGS_Metadata.xlsx └── reads/
Metadata files
When the metadata format is Excel, CSV or TSV, the data receiver will not remove sample metadata after a successful import. The data provider may remove metadata after it was imported successfully or leave all submitted metadata in the file.
Data reception
The data receiver logs all actions in the file data-import.log
.
Each row in this file contains the following pieces of information, separated by a comma:
- Timestamp in ISO 8601 format, including UTC offset (e.g.,
2022-09-11T10:35:04+02:00
) - Performed operation: One of
imported-sample
,deleted-file
,rejected
- Details: Sample ID, file path, or other information like error messages
Validation
The tool 'IGS Toolbox' can be used to validate the results prior to uploading to the sFTP server to avoid issues with format and content.
It is a Python package that can be installed using pip.
URL: https://pypi.org/project/igs-toolbox
Install:
pip install igs-toolbox
For JSON files :
Usage: jsonChecker [OPTIONS] Validate metadata json. ╭─ Options ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮ │ * --input -i FILE Path to input json file. [default: None] [required] │ │ --log_file -l FILE Path to log file. [default: jsonChecker_2024-06-07T10-09-48.log] │ │ --version -V │ │ --install-completion Install completion for the current shell. │ │ --show-completion Show completion for the current shell, to copy it or customize the installation. │ │ --help Show this message and exit. │ ╰───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
For EXCEL, TSV and CSV files (validates and convert files to JSON):
Usage: convertSeqMetadata [OPTIONS] Convert table of seq metadata to json files. ╭─ Options ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮ │ * --input -i FILE Path to input excel or csv/tsv file. [default: None] [required] │ │ * --output -o DIRECTORY Path to output folder for json files. [default: None] [required] │ │ --log_file -l FILE Path to log file. [default: convertSeqMetadata_2024-06-07T10-09-21.log] │ │ --version -V │ │ --install-completion Install completion for the current shell. │ │ --show-completion Show completion for the current shell, to copy it or customize the installation. │ │ --help Show this message and exit. │ ╰───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯