Isabl Docs
Search…
Importing Data
πŸ“¦ Learn how to import raw data into Isabl using existing metadata.
Isabl-CLI enables tracking and managing of raw data, as well as reference resources that are a function of a genome assembly or an experimental technique.

Data Import

Isabl-CLI supports automated data import by recursively exploring data deposition directories and matching raw data files with identifiers registered in the database. For example, the client can be instructed to explore the /projects directory (A), retrieving only samples from Project 393, and match files using Sample Identifiers (B).
Isabl supports automatic import from data deposition directories.
Upon --commit, Isabl-CLI proceeds to move (or symlink) matched files into scalable directory structures (C). The experiments data path is created by hashing the four last digits of the its primary key. For instance, data for Experiment 57395 will be stored at {storage-directory}/experiments/73/95/57395/. This hashing approach ensures a maximum of 1000 subdirectories in any folder at a worst case scenario of 10 million experiments.

Supported Data Formats

Isabl experiments can be linked to any kind of data. Be default Isabl will match the following data types:
1
RAW_DATA_FORMATS = [
2
("CRAM", "CRAM"),
3
("FASTQ_R1", "FASTQ_R1"),
4
("FASTQ_R2", "FASTQ_R2"),
5
("FASTQ_I1", "FASTQ_I1"),
6
("BAM", "BAM"),
7
("PNG", "PNG"),
8
("JPEG", "JPEG"),
9
("TXT", "TXT"),
10
("TSV", "TSV"),
11
("CSV", "CSV"),
12
("PDF", "PDF"),
13
("DICOM", "DICOM"),
14
("MD5", "MD5"),
15
]
Copied!
If you need to support more raw data formats, adding the EXTRA_RAW_DATA_FORMATS both in the api and client settings, you can extend the valid data format choices in the backend, and provide a new format file validator in the client settings or a new data importer.
ie. to support MAF format:
1
# In the api settings
2
EXTRA_RAW_DATA_FORMATS = [("MAF", "MAF")]
3
​
4
# In the cli client settings
5
EXTRA_RAW_DATA_FORMATS = [("\\.maf(\\.gz)?quot;, "MAF")]
Copied!
Tip: subclassing isabl_cli.data.LocalDataImporter and overwriting RAW_DATA_INSPECTORSmight be enough to support new data formats.

Import Data from Yaml

Isabl-CLI also supports explicit importing into a single experiment by specifying absolute file paths and metadata in a yaml file via the import-data-from-yaml command. The metadata will be added to the file_data field in an experiment's raw_data.
The two main parameters to be specified when importing are:
    -fi: an argument that takes a pair of values (field, field value) to identify an experiment. For example, if you had an experiment with a system_id of TEST_EXPERIMENT_T01 , the argument would look like:
    1
    -fi system_id TEST_EXPERIMENT_T01
    Copied!
    --files-data: an argument that takes an absolute file path to the yaml file containing absolute file paths and metadata. For example, if you had a yaml file /absolute/path/to/files_data.yaml with the following contents:
    /absolute/path/to/files_data.yaml
    1
    /absolute/path/to/file_1.fastq.gz:
    2
    metadata1: value1
    3
    metadata2: value2
    4
    ​/absolute/path/to/file_2.fastq.gz:
    5
    metadata3: value3
    6
    metadata4: value4
    Copied!
    the argument would look like:
    1
    --files-data /absolute/path/to/files_data.yaml
    Copied!
Full command using examples above:
1
isabl import-data-from-yaml \
2
-fi system_id TEST_EXPERIMENT_T01 \
3
--files-data /absolute/path/to/files_data.yaml \
4
--commit
Copied!
View command details by running: isabl import-data-from-yaml --help

Import Reference Data and BED files

You can link reference data to assemblies and techniques. Here are a few ways of how to go about it.
The need to register arbitrary resources for any assembly or technique (e.g. gene annotations) is also supported:
1
isabl import-reference-data --help
2
​
3
# extra resources are included in the assembly directory
4
assemblies/
5
β”œβ”€β”€ GRCh37
6
β”‚ β”œβ”€β”€ chr_alias
7
β”‚ β”‚ └── hg19_alias.tab
8
β”‚ β”œβ”€β”€ cytoband
9
β”‚ β”‚ └── cytoBand.txt
10
β”‚ β”œβ”€β”€ genes
11
β”‚ β”‚ └── refGene.txt
12
β”‚ └── genome_fasta
13
β”‚ └── GRCh37.fasta ...
14
└── GRCm38
Copied!

Import Assembly Reference Genome

Isabl supports the ability to track resources for assemblies and techniques. For instance, ensuring that reference FASTA files are uniformly index, named, and tracked across genome builds:
1
# indexes are created with `bwa index`, `samtools faidx`, `samtools dict`
2
isabl import-reference-genome --help
3
​
4
# example of isabl assemblies directories
5
assemblies/
6
β”œβ”€β”€ GRCh37
7
β”‚ └── genome_fasta
8
β”‚ β”œβ”€β”€ GRCh37.fasta
9
β”‚ β”œβ”€β”€ GRCh37.fasta.amb
10
β”‚ β”œβ”€β”€ GRCh37.fasta.ann
11
β”‚ β”œβ”€β”€ GRCh37.fasta.bwt
12
β”‚ β”œβ”€β”€ GRCh37.fasta.dict
13
β”‚ β”œβ”€β”€ GRCh37.fasta.fai
14
β”‚ β”œβ”€β”€ GRCh37.fasta.pac
15
β”‚ └── GRCh37.fasta.sa
16
└── ...
Copied!

Import BED Files for Sequencing Techniques

Lastly, you can register BED files for any sequencing technique, which will be compressed, indexed, moved to the technique data directory, and registered in the database:
1
# compressed with bgzip, indexed with tabix
2
isabl import-bedfiles --help
3
​
4
# example of isabl technique directories
5
techniques/
6
β”œβ”€β”€ 34
7
β”‚ └── bed_files
8
β”‚ └── GRCh37
9
β”‚ β”œβ”€β”€ dna-td-hemepact-v4-grch37.baits.bed
10
β”‚ β”œβ”€β”€ dna-td-hemepact-v4-grch37.baits.bed.gz
11
β”‚ β”œβ”€β”€ dna-td-hemepact-v4-grch37.baits.bed.gz.tbi
12
β”‚ β”œβ”€β”€ dna-td-hemepact-v4-grch37.targets.bed
13
β”‚ β”œβ”€β”€ dna-td-hemepact-v4-grch37.targets.bed.gz
14
β”‚ └── dna-td-hemepact-v4-grch37.targets.bed.gz.tbi
15
└── ...
Copied!
Imported assets are available for systematic processing by Isabl applications.

Customizing Import Logic

All registration mechanisms are configurable and can be customized by providing an alternative python sub class:
Setting Name
Default
DATA_IMPORTER
isabl_cli.data.LocalDataImporter
REFERENCE_GENOME_IMPORTER
isabl_cli.data.LocalReferenceGenomeImporter
REFERENCE_DATA_IMPORTER
isabl_cli.data.LocalReferenceDataImporter
BED_IMPORTER
isabl_cli.data.LocalBedImporter
Although only local storage is supported at the time of writing, Isabl-CLI capability can be extrapolated to cloud solutions including integration with cloud workbenches such as Arvados.
Last modified 1mo ago