Writing Applications
⚡️Isabl Applications enable you to systematically deploy data science tools across thousands of Experiments in a metadata driven approach. Learn how to build them here.
Last updated
Was this helpful?
⚡️Isabl Applications enable you to systematically deploy data science tools across thousands of Experiments in a metadata driven approach. Learn how to build them here.
Last updated
Was this helpful?
Isabl Applications enable you to systematically deploy data science tools across thousands of Experiments in a metadata driven approach. The most important things to know about applications are:
Applications are agnostic to the underlying tools being utilized.
Applications can submit analyses to multiple compute environments (local, cluster, cloud).
Results are stored as analyses for which uniqueness is a function of the experiments used.
Once implemented, applications can be deployed across any subset of experiments in the database.
During this tutorial we will build a hello world application that show cases the functionalities and advantages of processing data with Isabl. Here is a really simple example of an Isabl application that echoes an experiment's sample identifier and it's raw data.
This application can now be executed system-wide using:
Results produced by applications are stored as analyses. The uniqueness of an analysis is determined by the experiments associated with it. Specifically, analyses can be linked to multiple targets and references experiments (e.g. tumor-normal pairs). The possibility of linking analyses to multiple experiments allow for a wide variety of experimental designs:
Single target analyses (e.g. quality control applications).
Tumor-normal pairs (e.g. variant calling applications).
One target vs. a pool of references (e.g. copy number applications).
Multiple targets agains multiple references (e.g. all vs. all contamination testing).
Importantly, if someone tries to run the same application over the same experiments, a new analysis won't be created and but the existing one will be retrieved.
All Isabl Applications inherit from isabl_cli.AbstractApplication
and are configured using a class based approach. Your role is to override attributes and methods to drive the behavior of your app.
Applications are uniquely versioned by setting the NAME
and VERSION
attributes. The version of an application is not necessarily the version of the underlying tool being executed:
Optionally you can also set ASSEMBLY
and SPECIES
to version the application as a function of a given genome assembly. This is particularly useful for NGS applications as often results are only comparable if data was analyzed against the same version of the genome:
You can add additional metadata to be attached to the database object, such as an application description and URLs (or comma separated URLs):
Applications can depend on multiple configurations such as paths to executables, references files, compute requirements, etc. These settings are explicitly defined using the application_settings
dictionary:
Optional settings can be set to None
whilst required but not yet defined settings can be set to theNotImplemented
python object. Settings defined in the application python class are considered to be the default settings, yet they can be overridden using the database application field settings
.
You can make sure applications are properly configured by performing settings validation. To do so, simply define validate_settings
and raise an AssertionError
if something is not set properly:
To support CLI capabilities you have to tell the application how to link analyses to experiments using command line options:
isabl.options
Description
TARGETS
Enable --filters (-fi)
to provide key value pair of RESTful API filters used to retrieve experiments (e.g. -fi sample.category TUMOR
). Each experiment will be linked to a new analysis in a one-to-one basis using the analysis.targets field.
PAIRS
, PAIR
, PARS_FROM_FILE
Enable --pairs (-p), --pair (-p), --paris-from-file (-pf)
to provide pairs of target, reference experiments (e.g. -p TUMOR-ID NORMAL-ID
). Each pair will be linked to a new analysis (targets list is one experiment, references list is one experiment).
REFERENCES
, NULLABLE_REFERENCES
Enable --references-filters (-rfi)
to provide filters to retrieve reference experiments. This has to be coupled withTARGETS
, each analysis will then be linked to one target, and to as many references.
When these options are not adequate for your experimental design, you can implement get_experiments_from_cli_options
. This function takes the evaluated cli_options
and must return a list of tuples: one tuple per analysis, each tuple with 2 lists: the target experiments and the reference experiments. Here is a an example of an application that creates only one analysis linked to all Whole Genome experiments in a project:
By default, applications come with --force
to remove and start analyses from scratch, --restart
to run failed analyses again without trashing them, --local
to run analyses locally, one after the other:
As such, the CLI configuration for our Hello World app will result in the following help message:
You can usecli_options
to include any other argument your app may need in order to successfully build and deploy data processing tools.
Some of the advantages of metadata-driven applications is that we can prevent analyses that don't make sense, for example running a variant calling application on imaging data. Simply raise an AssertionError
if something doesn't make sense, and the error message will be provided to the user:
All options passed in cli_options
are available during get_command
using the settings attribute run_args
. In this simple example, we allowed the user to pass a custom --message
.
Configuration Name
Type
Description
get_requirements
Import String
An import string to a function that will determine LSF requirements as a function of the experimental methods, see below.
extra_args
String
Default qsub
, bsub
, or sbatch
args to be used across all submissions.
throttle_by
Integer
The total number of analyses that are allowed to run at the same time (default is 50).
The method get_requirements
must take the application and a list of targets' technique methods (which are submitted together in the same job array):
Apps can programmatically be triggered from python using the run
method:
You can provide an specification your application results using the application_results
dictionary. Each key is a result id and the value is a dictionary with specs of the result:
Results can be paths to files, strings (e.g. MD5s), numbers, and any other serializable value. Here is a full list of the different specifications a result can have:
description
Information about the result (required)
verbose_name
Name displayed for the result in the results list (required)
optional
If False
and result is missing, an alert will be shown online (optional)
external_link
URL to a resource that may explain about the result (optional)
pattern
A glob pattern to match recursively a filename within the analysis folder. (optional)
exclude
Any string to exclude that falls pattern
. (optional)
Specifying the result frontend_type
is meant to define how to render it through Isabl Web. When set to None
the result will still be available in the analysis, but it won't be shown in the frontend.
Here is a full list of the result types that are supported for rendering in Isabl Web:
Frontend Type
Description
text-file
It's shown as a raw file, and its content is streamed as the user requests it.
tsv-file
It can be shown as raw text or tabulated for easier inspection (i.e. VCF, TSV).
string
, number
It's shown as a string and can't be downloaded.
image
Previews are displayed in a gallery in the analysis view.
html
, pdf
Rendered as html in an iframe
.
igv_bam[:index]
Can be streamed to visualized in an embedded IGV viewer. If another result called bai
is the BAM index, you can set it toigv_bam:bai
.
None
For non-previewable files, that are either large, compressed or it's format is not supported. i.e. fastq, .RData objects, .pkl models, etc.
By default, analysis results are protected upon completion (i.e. permissions are set to read only). If you want your application to be re-runnable indefinitely, set:
When application_results
is defined, you must implement get_analysis_results
. This method must return a serializable dictionary of results and its only run after the analysis has been completed successfully. For our example it can be something like:
When using the properties, pattern
in the results' definition for the application_results
, the get_analysis_results
method can be simplified as isabl will match the filename using the pattern
and exclude
fields by walking recursively the analysis' output directory.
For instance, this previous app needs only the count
result to be defined, as input
and output
can be automatically matched, hence it can be simplified as:
In cases, where all application_results
are filenames with pattern
, the get_analysis_results
is not even needed.
Isabl applications can produce auto-merge analyses at a project and individual level. For example, you may want to merge variants whenever new results are available for a given project, or update quality control reports when a new sample is added to an individual. A newly versioned analysis will be created for each type of auto-merge and your role is to take a list of succeeded analysis and implement the merge logic.
The first argument in merge_project_analyses
, is the project level analysis
, which is unique per project and application. The second argument is a list of all completed analyses
of this application for a given project. Your role is to merge analyses
output into the project level analysis
directory. We need to define similar methods for the Individual level auto merge. Lets say that our project-level merge logic is the same for individuals, then we can simply do:
If at any arbitrary time you want to test the auto-merge logic, use any of these two commands:
Here is an example for SGE
:
This section lists additional optional functionality supported by Isabl applications. Particularly, dependencies on other applications, after completion analyses status, and unique analyses per individual.
Application inputs are analysis-specific settings (settings are the same for all analyses, yet inputs are potentially different for each analysis). Inputs can be formally defined using application_inputs
, inputs set to NotImplemented
are considered required and must be resolved using get_dependencies
:
In certain cases you don't want your analyses to be marked as SUCCEEDED
after completion, as you may want to flag them for manual review or leave them to know that you need to run an extra step on them. For these cases, you may want to set the after-completion status to IN_PROGRESS
:
It is possible to create applications that are unique at the individual level. To do so set unique_analysis_per_individual = True
. A good example of a unique per individual application could be a patient centric report that aggregates results across all samples. If you are interested on how analyses for these applications are created, give a look at AbstractApplication.get_individual_level_analyses.
You can configure Isabl API to periodically check if any analysis has failed and send you email notifications. To do so, head to the admin site at /admin/django_celery_beat/periodictask/add/
and in Task (registered) select isabl_api.tasks.report_status_change_task
, then create a 1 hour interval, and provide the following Keyword arguments {"status": "FAILED", "seconds": 3600}
(i.e. every hour check how many analyses failed in the past hour):
Name
Description
commit
datadir
Path to the dummy data directory. Your tests directory comes with a data
folder, which can be populated with dummy, small files - useful to run your apps.
tmpdir
Isabl CLI comes with a full set of factories that facilitate the creation of fake metadata. Here is an example of how to create two experiments for the same sample:
Being able to actually run the applications (i.e. passing --commit
) during testing might be something valuable to you. In the case of Next Generation Sequencing, for example, you could create fake BAMs and really small reference genomes (few KBs) to test variant calling applications.
cli_options
enabled us to run the app across multiple experiments using RESTful API filters (i.e. --filters
). We will learn more about how to link experiments with analyses .
You should store your applications and custom Isabl logic in your own python package. will help you bootstrap your own Isabl project:
An example of a project generated with its available . This project, and every project created with Cookiecutter Apps includes the hello world application described in this tutorial, check it out . Now let's learn about writing apps!
To make sure your applications are available when running isabl --help
, make sure you add them to the client setting :
Note that application.settings
are a function of the . This enables you to run the sample application in different compute architectures. You can configure application.settings
using the Django Admin site or the application method patch_application_settings
:
Applications can be launched from both the command line and from python (we will learn more about the latter in the ).
The attribute cli_options
is set to a list of that will be used to retrieve experiments from the API and link them to new analyses. Out of the box, Isabl supports the following CLI options to retrieve experiments:
The --force
flag will not completely remove the analyses, but it will move them to a temporary trash directory within the . You may want to clean this location periodically using crontab -e
:
Analyses are understood to be unique if their targets, references, and application are the same (). If you need custom get or create logic, you can override the get_or_create_analyses
method.
AbstractApplication
comes with that you may want to use. Here are some examples of commonly used ones:
Now that we know how to link analyses to experiments, lets dive into creating data processing commands. Our only objective is to use the analysis and settings objects to build a shell
command and return it as a string (ignore inputs
for now, we will learn more about it when specifying ).
This also means we can perform actions when an analysis, which can be useful for cleaning up files:
Isabl
is agnostic of the compute infrastructure you're working on and can be configured to work with different batch systems (e.g. local, HPC, cloud). Currently, Isabl
supports local
, LSF
, SGE
, and Slurm
submissions, how ever you can create a submitter for .
Isabl comes with prebuilt logic to submit thousands of analyses to LSF
, SGE
, and Slurm
using Job Arrays. To do so simply set the Isabl CLI setting as follows:
This submitter can check for the following configurations in :
You can implement functions for other schedulers, the function must take a list of tuples, each tuple being an analysis and the analysis head job script.
Isabl applications can be run by multiple users in the same unix group. However, if applications are run by users different than the and are not , then analyses will be set to FINISHED
instead of SUCCEEDED
. isabl process-finished
can be run by the to copy and own the results and set the permissions to read-only whilst updating analyses status to SUCCEEDED
. We recommend you add the following cron task in the profile using crontab -e
:
Tip: this is useful when creating !
frontend_type
Defines in the frontend.
Merge operations are triggered automatically when the last analysis that is meant to be merged finish running. By default, the merge operation will be conducted right after the analysis is patched to SUCCEEDED
. However, you can define how merge analyses are submitted using Isabl CLI setting . For example in LSF
:
Our goal is to make it extremely easy to test your applications. Ideal apps can be tested locally, with fake/dummy data using factory-created database instances. and come with a range of utilities to help you test your applications.
If you created your project using , the following pytest
fixtures are available to you:
Enables you to run pytest
using a --commit
flag, this flag can be used to actually commit the application.
This is like the regular pytest yet it comes with some perks. First, it sets the current to a temporary directory. Second, it comes with tmpdir.docker
, a method to create executable scripts to docker containers and with specific entrypoints
. For example tmpdir.docker("ubuntu", "echo")
creates an executable script that calls echo
using an Ubuntu image.
Look at the to learn how use these fixtures.
We recommend limiting use of these factories to development instances of Isabl API. By default, ISABL_API_URL
is set to .
Here is a comprehensive example to test our HelloWorldApp
, projects created will include this test:
is much more comprehensive. It creates two experiments per individual and validates that was actually conducted, that the application is , and that the works well.