Creating a Data Analysis Pipeline in Python
The goal of a data analysis pipeline in Python is to allow you to transform data from one state to another through a set of repeatable, and ideally scalable, steps. Problems for which I have used data analysis pipelines in Python include:
- Processing financial / stock market data, including text documents, into features for ingestion into a neural network used to predict the stock market.
- Analysis and visualization of genetic (sequencing) data for easy identification of genetically engineered organisms in complex samples.
- Analysis of user activity on websites to understand user behavior.
[Related article: Creating if/elseif/else Variables in Python/Pandas]
There are many tools available for creating data processing and analysis pipelines. This post will focus on Snakemake, a python based workflow management system that can help you create reproducible and scalable data analyses. Snakemake is a workflow management system that uses sets of rules to define steps in the analysis process; it integrates smoothly with server, cluster, or cloud environments to allow easy scaling. While Snakemake is a very general framework for creating pipelines, most tutorials focus on bioinformatics applications
The basic unit of code within Snakemake is a rule. Each rule defines the input files, output files, and the steps to get from input to output (python code, Python or R scripts, or shell commands).
The rule above illustrates many of the available features. Starting at the top, each rule is defined using a name which should describe the function of the rule. In this example, the rule is maps_reads, i.e. find the original location for a short sequence of DNA (the output of sequencing) in the reference, or pre-compiled database of potential sequences. Within each rule there are three basic parts:
- Input: the paths to the files that are used as starting data for this rule
- Output: the paths to the files the rule produces
- Directions: the commands that will create the output file(s) from the inputs. Generally marked by shell or script to indicate the use of a series of shell commands or a python script.
In addition to these three basic parts of every rule, there are several other options that enable increased modularity and flexibility, such as the log option used above. Some available additions include:
Log: this allows the specification of a log file for each rule. These files are not deleted even if a rule fails.
Threads: the numbers of cores or threads to use during execution
Resources: the amount of memory or disk usage that will be provided
Message: a short summary of the rule that will be printed during execution
Priority: a number indicating the urgency with which a rule should be executed. This allows ordering rules because those with higher priority will be run earlier.
All of these sections within a rule are facilitated through the use of wildcards and variables, marked with curly brackets. Variables can be defined using traditional python programming and can be used within or outside of rules. In the example above, {core} is used to specify the number of cores available and is defined elsewhere in the larger file. Wildcards are similar but their value can change each time a rule is used and are defined through their use in the input and output file names like {sample} in the file above.
To create the scalability and modularity desired from a pipeline, the rule “all” is used to determine which rules should be run. In rule all, the desired output files are specified. From that, Snakemake determines what combination of other rules need to be run to create the requested output files. Easy scalability is created through the expand command which repeats the same analysis on multiple samples or sets of files. In the example above, this will create an alignment output file for each of the samples, as well as a summary file that captures some information about all of the samples in one file.
The number and order of rules to be run is determined using the creation of a directed acyclic graph (DAG) with nodes representing rules and edges representing dependencies in the files. Snakemake can display the graph, the example shown above, which allows users to understand what the pipeline will do and the order and dependencies within the process at a glance.
All of these capabilities and more combine to make Snakemake a powerful and flexible framework for creating data analysis pipelines. I am currently using it for all of the data analysis for Finding Engineering-Linked Indicators (FELIX), a program whose goal is to identify genetically engineered organisms in complex samples. For FELIX, I have used Snakemake to create pipelines that have been deployed onto clusters to process 100 samples (100s of Gb of data) and perform over many hundreds of hours of compute time in about two days. The pipeline includes alignment of DNA sequences, assembling sequences that contain signs of engineering into larger constructs, identifying the makeup of those constructs, and creating visualizations so that subject matter experts can quickly identify if a sample has been genetically engineered or not.
[Related article: 15+ Free and Paid Resources to Learn Python]
At the ODSC East 2020 Virtual Conference, Laura Seaman will conduct a workshop on using Snakemake to create data analysis pipelines in Python and some of the bioinformatics applications she has used it for.
Laura Seaman is a Senior Machine Intelligence Scientist at Draper where she applies machine learning and bioinformatics algorithms to a variety of applications including analysis of financial networks and identification of genetically engineered organisms. Dr. Seaman has a Bachelor of Science in Biological Engineering from the Massachusetts Institute of Technology, a Masters of Arts in Statistics from the University of Michigan, and a Doctor of Philosophy in Bioinformatics from the University of Michigan.