This page is a quick catalogue of the scripts in the Aspen clusters tnorth scratch bin folder. Many descriptions are subject to change and eventually these scripts will have individual pages documenting their use and purpose. Some descriptions may be only partial and most will require future revisions. Some more complex scripts may receive their own individual help page in the future, if you would like a more specific help page to be created anytime soon please contact the bioinformatics department.


Paths: /packages/tnorth/bin OR /scratch/bin

/OLD/<script>Sub folder.
Contains scripts too old to be useable. If a script found here does not have a newer alternative in scratch/bin and if a newer version is needed then please make a request otherwise these files will remain deprecated.
/Pyed-piper/<script>Sub folder.
Contains python based edirect scripts. Many of these replace old Perl scripts that fulfill the same purpose.
deleteFiles.cshC Shell Script.
Given a folder it deletes files and folders that within that are bwamem, bowtie2, novo, snap, xml, spades, pilon, trimmomatic, or "working_directory"
g3-iterated.cshC Shell Script. 
Runs Glimmer3 on a given genome file, also takes in a tag to use for the output file naming.
replaceScripts.cshC Shell Script. 
Looks across a folder for spades soft links and creates a replacement for each. 
addSequenceDescriptionsPerl Script.
Takes in fasta and outputs a modified version with added descriptions if GAS emm type or Borrelia is present.
append2xlsPerl Script.
Takes in a tsv file, an xml file and optionally a name for the worksheet made. The tsv should be 2 columns of key/sample and value. The key is used to determine where in the xml to append the associated value data.
assembleUnmappedReadsBash Script.
Creates an assembly, for each bam in the current directory, from all the unaligned segments.
canSnpPython 3 Script. 

This script was requested by Jolene Bowers and intern Stephanie Casey for a simplified version of one of Jason Sahl's scripts. They were comparing trees and wanted to identify SNP where a pair of clades split.

Takes in a tsv file for a snp matrix and 2 other tsvs for groups 1 and 2 which should have each sample name on a separate line.

combineRunsBash Script.
Combines all R1 fastq.gz files in the current directory.
countReadsBash Script.
Takes as arguments either *.fastq (all files in directory) or can be supplied with fastq file list. Counts all reads in given files and displays the count for each file to the console screen.
countReads_fasterBash Script.
Counts all reads in all fastq files in the current directory. Output is displayed via individual lines printed to the console screen in the format of sample name, tab, and then the count.
demultiplexBash Script.
Purpose: "Demultiplex a run from an Illumina sequencer. Works for all Illumina platforms."

Converts the bcl files found under the set run folder into fastq.gz files in the output directory, which will be created if it doesn't exist. You can include any additional parameters to bcl2fastq at the end. The job will be submitted to the job queueing system.

Arguments/Flag Options:

-h, --help => Prints usage message

-r, --runfolder_dir => Path to run folder directory

-o, --output_dir => Path to demultiplexed output

-s, --sample_sheet => Path to sample sheet

-l, --lane_splitting => Whether to split FASTq files by lane, should be 0 for NextSeq, 1 for all others

-d, --debug_level => Minimum log level, recognized values: NONE,FATAL,ERROR,WARNING,INFO,DEBUG,TRACE

-p, --partition => SLURM partition to use, recognized values: gpu, defq

covid_demultiplexBash Script.
Mimics the code found in demultiplex.

Differences to demultiplex: covid_demultiplex adds in an extra job after the normal demultiplex which is to run the covid pipeline script (/labs/COVIDseq/COVIDpoint/post_demux_pipeline/

deleteFiles_newBash Script.
Searches current dirrectory for files related to bwamem, bowtie2, gatk, etc. Deletes all files matching types like .tsv, .csv, .bam, .sorted, and other similar types.
deleteFiles_userBash Script.
Searches the users scratch folder for files related to bwamem, bowtie2, gatk, etc. Deletes all related files such as the files types of .tsv, .bam, .sorted and so on.
dirtyDipBash Script.
Mimics the dirtySpades script.
Differences: The key difference is that dirtyDip uses the slurm based sbatch job system to handle tasks while dirtySpades uses the qsub command.
dirtySpadesBash Script.
Run dirtyDip without arguments; it will automatically search for fastq files in the current directory.
Advanced Options are available if you need to override the defaults. See -h or --help

Synopsis: dirtyDip will create an assembly fasta for each fastq sample in the current directory.

The fastqs are trimmed with Trimmomatic and assembled with Spades. The assembly will be output to a file named ./<SAMPLE_NAME>.spades/contigs.fasta. 

A symbolic link to the assembly, ./<SAMPLE_NAME>.fasta, is created for convenience.

Unless the --single flag is set, the script will attempt to pair the fastqs assuming they use one of the following naming conventions:





If the --single flag is used, each read is assembled separately. The filename is used as the SAMPLE_NAME.

This script will not detect read files that do not match the expression: *R1*.fastq* (this includes not detecting *.fq files).

download_ncbi_setBash Script.

Given a tab-delimited file of user chosen names (i.e. M013) and NCBI ids (i.e. NC_016928.1) this script will download all NCBI ids from the nucleotide database using into the user chosen names as fasta files.


download_ncbi_set <tab delim id file>
downloadSRABash Script.

Downloads a list of accession numbers from SRA. Should be SRP# or SRR#


-i <input file> => Specifies the input file. This should be a text file of accession numbers, one per line. 

-o <output directory> => Specifies the output directory. All read files will be downloaded to this directory. 

-h => Displays help message and exits.

Example: downloadSRA -i samples.txt -o /scratch/dlemmer/SRA/

extractBamsPerl Script.
Usage: extractBams dir_to_process results_dir
Uses two existing directories, the dir_to_process is searched for subdirectories and files wherein bam file info will be extracted and saved to the results_dir (which is NOT created if it does not exist already).
The following directories will be ignored if found: 

read_metrics, sai, bam_link_unique, bamcoverage_unique_1,

bamcoverage_unique_10, bamcoverage_unique_noINDEL_1,

bamcoverage_unique_noINDEL_10, SolSNP1, SolSNP10

fastaStatsPerl Script.
Displays the following info gathered from a fasta file:

Filename, Total contigs, Total nt, Mean length, Median length, Mode length, Max length, Min length, Length of each sequence

fastqStatsPerl Script.
Displays the following info gathered from a fastq file:

Filename, Total contigs, Total nt, Mean length, Median length, Mode length, Max length, Min length, Length of each sequence

fixFastaPerl Script.
Takes in a fasta file as only argument. Script must be edited to alter how sequences found will be 'fixed' and currently is set to do nothing.
gbk2fastaPerl Script.
Takes in a genbank file as only argument. Converts the file to fasta and then writes out the file using the original file name to the same directory.
gcg2fastaPerl Script.
Usage: gcg2fasta <file_or_directory_of_gcgs> <output_fasta>
Converts the gcg or sds file/files into fasta and then writes out to the given output.
generateGTFPerl Script.
Uses a given fasta file to generate a gtf file.
getFlankingSequencePython 3 script.
Given a reference, contig name, and position, extract n bases of flanking sequence on each side of the position

-r, --reference => Required. Reference fasta to extract from

-o, --out => Required. Output fasta file to write

-i, --input => Required. Input file listing contig::position within reference

-f, --flank => Optional. Number of bases of flanking sequence to return. Default=500 

getNumReadsBash Script.
Search for all .bam files found in current directory and displays info line by line in tab separated order of: name, reads, mapped
getReadLengthBash Script.
Search current directory for all files, expects fasta or fastq and will display a message for any non-fasta/fastq files looked at. Otherwise the script prints the length of all found sequences.
getVelvetInfoPerl Script.
Usage: getVelvetInfo <dir_to_process>
Looks within directory and process all "*_logfile.txt" files from Velvet within and creates a assembly_details.txt file which contains the following columns of tab separated data: Sample, Chosen kmer, Num contigs, N50, Longest contig, Total bases in contigs
initDemultiplexPerl Script.
Generate the samplesheet(s) and a script to demultiplex a HiSeq or GAIIx sequencing run.
Usage: initDemultiplex [options]

'runlog' => The sequencing log spreadsheet [REQUIRED]

'runid'   => The run ID, i.e. 'HiSeq0010'          [REQUIRED]

'fcid'      => The flowcell ID                             [REQUIRED]

'indexconverter' => The IndexConverter spreadsheet 



'control' => Whether a control was used, presence of this option means 'Y'

'operator'   => Initials of who started the run, default is 'HQ'

'recipe'       => Recipe used, default is '101x101withMP'

'initials'       => Initials of the person running the script, will be used in SampleSheet filename

'extracycle' => Whether the indexing reads run an extra cycle, default is false

'help'          => Print the help message

iqtreeShell with Jenkins.
JenkinsfileA pipeline descriptor Jenkins file. Set for determining if shell status is normal or unsatable.
jobstatsPython 3 script.

jobstats prints a summary table similar to the following command:

sacct -o jobid,jobname,reqmem,maxrss,reqcpus,usercpu,timelimit,elapsed,state

In addition, it includes a 'resource efficiency' report showing how well the allocated resources matched the used resources.

Usage: python [optional additional sacct args]

link_filesBash Script.
Usage: link_files <txt file list> <place to begin search>
Searches based on the criteria and performs "ln -s..." to create links within the current directory.
make_gene_mlst_dbBash Script.
Usage: make_gene_mlst_db <analysis type: gene or mslst> <organism i.e. saureus>
For gene mode: creates a blast database for each gene (all within the current directory)

For mlst mode: if not already downloaded, downloads MLST data from pubmlst for the given organism. Extracts the first alleles into separate files, creates a blast database for each housekeeping gene (all within the current directory)

make_MLST_dbBash Script.
Usage: make_MLST_db <organism i.e. saureus> <path to mlst gene and profile files>

Downloads MLST data from pubmlst for the given organism, extracts the first alleles into separate files, and creates a blast database for each housekeeping gene (all within the current directory)

mash_screenSlurm sbatch bash script.
Usage: sbatch mash_screen
Creates a slurm job array which takes all *.fastq.gz in current directory and calls the mash script
mergeReadsBash Script.
Looks for all fastq/fq files and uses PEAR plus the pbs job manager to handle read merging.
mksquashfsNon-TGen Compiled Code.
mlstPerl Script.
Requires BLAT and gzip. Downloads PubMLST data and then uses the scheme set to create a output csv.
Usage: mlst [options] --scheme XXX <contigs.fasta> ...
CMD Options:

help, verbose, datadir, list, longlist, scheme, noheader, csv, nopath

mlst-download_pub_mlstBash Script.
Performs and handles the data for the following cmd call:

wget '' --no-check-certificate

rm -rf mlst_db/*

modifyFastaBash Script.
Usage: modifyFasta <fasta_file>
Modifies the fasta file, set to edit the name to replace '-' with '_'
nasp_matrix_to_fastaPerl Script.

Converts a NASP snp matrix file into a fasta file


Copy the script to your local machine .. Then on the command line type..

nasp_matrix_to_fasta <source_file> <output_file> <max_seq_length>


    source_file - a tab delimited snp matrix file as output by NASP

    output_file - the name of the fasta file to output

    max_seq_length - this is the max length for each line in a sequence. 

     A line break is inserted at that point. Defaults to 60.


    fasta file (all snps for that organism concatenated together)

nebenNon-TGen Compiled Code.
Unknown TODO
new_scriptBash Script [Non-TGen].

Usage: new_script [-h|--help] [-q|--quiet] [-s|--root] [script]


new_script - Bash shell script template generator

Copyright 2012, William Shotts <>

This program is free software: you can redistribute it and/or modify

it under the terms of the GNU General Public License as published by

the Free Software Foundation, either version 3 of the License, or

(at your option) any later version.

This program is distributed in the hope that it will be useful,

but WITHOUT ANY WARRANTY; without even the implied warranty of


GNU General Public License at <> for

more details.

Revision history:

   2014-03-20  Corrected bug in insert_help_message() discovered by

             Lev Gorenstein <> (3.3)

   2014-01-21  Minor formatting corrections (3.2)

   2014-01-12  Various cleanups (3.1)

   2012-05-14  Created

paired_haplotype_smor.plPerl Script.
Usage: <inputfile.bam>
Using the input bam file this script checks the sequences inside against a list of specific positions: eisPlus1000, gyrAPlus1000, inhAPlus1000, katGPlus1000, rpoBPlus1000, and rrsPlus1000. It outputs by printing out a tab separated table with the following headers: Chromosome, Pos | #AA, #CC, #GG, #TT, #Hom, | %AA, %CC, %GG, %TT, | #A, #C, #G, #T, #Cov, | %A, %C, %G, %T, | Raw
paired_haplotype_smor_ASAP.plPerl Script.
Usage: <inputfile.bam>
Works just like, but has altered the position names to just be eis, gyrA, and so on (ie without the "Plus1000" text).
pbs_header.shBash Script.

Version 1.1

Written by Joshua Colvin


Takes two or three arguments:

- Directory command should be executed in.

- Command to execute.

Note that it is important to surround entire command with quotes

so that pipes and redirects are handled properly.

- If optional third argument is present, it contains the name of the file

to create if the program fails.

(make sure file doesn't exist before running)

Example: echo_and_exec $PBS_O_WORKDIR "wc -l *.fastq > fastq_line_count.txt"

This script mainly contains a single function called echo_and_exec. This function simply has the commands be both executed via the pbs system as well as having the original cmd printed to cmd screen.

Since Aspen no longer uses PBS this script is likely DEPRECATED

pear_unmappedBash Script.
Without arguments, the script will print a SLURM sbatch template will submit a job for each .bam file in the directory. To run the template you can either:

    1. pipe the template directly to sbatch:

       ./pear_unmapped | sbatch

    2. save the template to a file and then submit to sbatch:

       ./pear_unmapped > any_filename


      sbatch any_filename

plasmidOrGenomeBash Script.
Usage: ./plasmidOrGenome <input_fasta>
Creates an output file that is <input name>.pog.txt which contains the following tab separated headers and then lines of tab separated data:
contig name, contig length, plasmid hit ratio, top hit length, top hit, determination.
pmi_demultiplexBash Script.
Help Msg: ./pmi_demultiplex -h
Demultiplex a run from an Illumina sequencer. Works for all Illumina platforms.
Usage: ./pmi_demultiplex [OPTIONS] [ADDITIONAL OPTIONS PASSED DIRECTLY TO bcl2fastq]

-r | --runfolder_dir => Path to run folder directory 

(default: ./ )

-o|--output_dir => Path to demultiplexed output 

(default: ./Data/Intensities/Basecalls/)

-s|--sample_sheet => Path to sample sheet 

(default: ./SampleSheet.csv/)

-l | --lane_splitting => Whether to split FASTq files by lane, should be 0 for NextSeq, 1 for all others 

(default: 0)

-d | --debug_level => Minimum log level, recognized values: NONE,FATAL,ERROR,WARNING,INFO,DEBUG,TRACE

(default: WARNING)

-p | --partition => SLURM partition to use, recognized values: gpu, defq 

(default: defq)

prep_reads_with_numbers.pyPython Script.
Usage: python <read 1 file> <read 2 file> <new read 1 file> <new read 2 file>
This script takes in 2 read files and appends "-1" to all of read 1 names and "-2" to all of the read 2 files seq names and then these new versions are saved at the given new file locations for read 1 and 2.
primerDimerPython Script.
Usage: python primerDimer -h
 python primerDimer [-s, --sort] FASTA1 FASTA2

This script takes in 2 fasta files and prints out info such as interaction count, the primers score, a total sum of scores, bond strength and then finally which of the 2 will have greater coverage.
Usage: python <cluster_names> <consensus_file> <outputfile>
This script pulls clusters out of consensus.seqs to align reads using the list of cluster names given on the fasta/consensus file given and the results are then written to the output file.
Perl Script.
Usage: pull_contig_from_blat <fasta_file>
Creates a temporary file and then searches the display ids for "21863 217408 2870903 821+,...,3456+" and writes those to the temp file, and then finally overwrites the main file with the temp file.

Python Script.

Usage: python <gene_seqs> <psl_file> <fasta_file> <output_destination/file_name>

Gene sequences will only be identified by the first thing in the fasta header.

Only works on nucleotide  BLATs!!!

This script identifies if a gene seq given is a full hit within the fasta file or psl file and then writes the results to the given output fasta file.

Python Script.

Usage: python Pull_gene_sequence_fromPSL_v3 .py <gene_seqs> <psl_file> <fasta_file> <output_destination/file_name>

Works just like, but allows for splicing when determining gene sequence matches.
q.plPerl Script.
This script takes in a single argument which is to be a cmd to execute on the PBS job manager.
Since Aspen no longer uses PBS this script is likely DEPRECATED.
Shell Script.
This script kills a range or all of PBS/Torque jobs owned by the current user.
Since Aspen no longer uses PBS this script is likely DEPRECATED.
Perl Script.
This script kills a range or all of PBS/Torque jobs that possess a job id within the given range (inclusive)
Since Aspen no longer uses PBS this script is likely DEPRECATED.
Perl Script.
Usage: removeShortContigs <assembly.fasta> <cutoff_size>
Combs through the fasta and deletes any contigs which fall below the given size, prints out the number removed.
renamePerl Script (Non-TGen).
Usage: rename [-v] [-n] [-f] perlexpr [filenames]

Renames the filenames supplied according to the rule specified as the

first argument. The perlexpr argument is a Perl expression which is expected to modify the $_ string in Perl for at least some of the filenames specified. If a given filename is not modified by the expression, it will not be renamed. If no filenames are given on the command line, filenames will be read via standard input.

Example: To rename all files matching *.bak to strip the extension,

you might say

    rename 's/\.bak$//' *.bak

To translate uppercase names to lower, you'd use

    rename 'y/A-Z/a-z/' *

This script was developed by Robin Barker (,

from Larry Wall's original script eg/rename from the perl source.

This script is free software; you can redistribute it and/or modify it

under the same terms as Perl itself.

Perl Script.
Usage: renameContigs <dir_to_process>
Searches through the directory given for all .fasta and .fa files and then alters their sequence names to have their display id appended to the end separated by an underscore.
Perl Script.
Usage: renameContigs <dir_to_process>
Searches through the directory given for all .fasta and .fa files and then alters their sequence names to be just their sample name.
Python Script.
Usage: python renameSamples [-h, --help] [-c --column | -r --row | -a --fasta | -f --files ] [--sheet SHEET] [--old OLD] [--new NEW] <BOOK> <LOCATION>
This script can be set to rename various parts of files using a provided work book to determine what data to change to what. The book and the location of the files to change are required. The --old and --new flags are also required and set the old headers and the new headers that are to be used. The flags -c, -r, -f when set decide if the script is to edit those sections of the file. Lastly --sheet is optional and determines which sheet in the workbook to use, otherwise it defaults to the first sheet.
Perl Script.
Usage: <file to alter>
Renames the samples in the first column of the input file
Perl Script.
Usage: <fasta file to alter>
Renames the samples in the fasta file that is passed in
Perl Script.
Usage: <path to directory containing files>
Renames the files in the given directory
Perl Script.
Usage: <SNP pipeline results file to alter>
Renames the samples in the first line of the SNP pipeline results file
replaceScript and replaceScript.cshBash Script.
Fixes and replaces Spade links, currently send to make a test output at /scratch/dlemmer/replace_spades_links.txt
Bash Script.
Usage Example: ./ --input-directory /scratch/TGenNextGen/TGN-MiSeq0123/ProjectMayhem/ --output-directory ./ --genome-size 1.2 --read-length 250 --min-coverage 20 --min-gc 42

authors: Kristin Wiggins <> and Jason Travis <>

url: (restricted access)


Anyone is welcome to use this script when quickly analyzing any whole genome sequence data.  There is a special feature to demultiplex the data before the quick analysis, which requires special permissions, but the rest of the script will work for everyone!

This should only be used as a quick look at the data and should not be used in place of detailed analysis techniques.

Pipeline Synopsis:

Automated quality checks on sequence data with the final output being two separate files of "Passed QC" and "Failed QC" with their associated metrics of GC content, average Phred Score, and Quick Coverage estimates.

Optional- Demultiplex data first; create a new directory of passed files.

Python Script.
Usage: python reverseComplement [flag] <fasta, txt file, or string>

-f, --fasta => Reverse complement all sequences in the fasta file, output new fasta file

-t, --text, => Reverse complement all sequences in the text file, output new text file

-s, --string => Reverse complement the passed string, output to STDOUT

Bash/Sbatch Slurm Script.
Usage: sbatch RStudioServer
Loads singularity 3.3.0 and then:
"Starts RStudio Server on the cluster. Please run using sbatch. After starting, "cat rstudio-server.job.{slurmJobID}" for details"
run_bwa.plPerl Script.
Usage: -help
Performs read alignments across multiple references and calculates coverage stats.. runs bwa, solsnp, bam_coverage, read metrics and SnpPipeline-0.4.jar

  -alignment <type of alignment: single/paired>


  -analysis <type of analysis: gene/full/both/bamcov/none>


  -reference <comma separated list of reference prefixes>


  -organism <string>


  -p <path to sequence folder>


  -snp_pipe <to run or not to run SnpPipeline-0.4.jar. Values: y/n>


  -ext_p <path to external fastas>


  -aln <memory needed to run bwa and picard. Format: integer followed by kb/mb/gb  recommended minimum:4G> WILLOW ONLY


  -bamcov <memory needed to run solsnp1 and solsnp1. Format: integer followed by kb/mb/gb  recommended minimum:5G> WILLOW ONLY


  -covcalc <memory needed to run and Format: integer followed by kb/mb/gb  recommended minimum:2G > WILLOW ONLY


  -snppipeline <memory needed to run SnpPipeline-0.4.jar, snpfixer_040512.php and Format: integer followed by kb/mb/gb  recommended minimum:5G > WILLOW ONLY
Works just like, but has been edited to use /media/lumberyard/bin/bwa_match_auto_040512.pbs and ~${username}/lumberyard/bin/generic.pbs. This script also focuses on fastq files rather than the more compressed fastq.gz that was default in
Works just like, but has been edited to use different defaults, do_all_reads is true, memory required for all unique is 2gb instead of 1gb. Uses for pbs: /media/lumberyard/bin/bwa_match_auto_040512.pbs
Works just like, but has been edited to perform similarly to, but with added things to the pbs job to call novoalign.
Works just like, but has been edited to search through a sequences directory and make calls, ignoring duplicates.
Works just like, but has been edited to call /media/lumberyard/bin/bwa_match_auto_callreference.pbs to handle PBS commands
Perl Script
Usage: <assembly_file>|<directory> <output_file>
Performs one of 2 functions depending on args used. If only 1 arg is provided it assumes the first case of it being an assembly file and proceeds to get the product and output for each primer set found within, uses a unchangeable output with name based on the original file. The second operation is performed when 2 args are given, directory and output. This second operation uses the files found in the provided directory to collect data which it then writes to the outfile in a tsv table of data with the headers: Sample, ccr type, mec class, SCCmec Type
screenDirBash Script
Executes: sbatch --array=1-$(ls -1 *R1*.fastq.gz | wc -l) /scratch/bin/mash_screen.slurm
Fasta File





sdsi_pipeline_runBash Script
Usage: scriptTemplate [-s|-o|-r|-p|-h]


   -s | --sequence_directory  => Sequence Directory

   -o | --output_directory     => Output Directory

   -r | --reference_sdsi          => SDSI Reference File

   -p | --adapter_file             => Adapter File

   -h | --help                         => Print Help.

Loads a Anaconda environment as well as calls Snakemake in order to execute a sdsi pipeline which then prompts the user to make any other necessary decisions and then calls /scratch/cridenour/Projects/SDSI/SDSIVisuals/SDSiVisuals/workflow/scripts/
Author: Chris Ridenour
Bash Script
When run checks current directory for fastq files and then uses PBS to run a spades job on each file, set to be non-trimming.
Perl Script


perl <fastafile> <numsnpstokeep> <numiterations> <outputfolder> <independent|linked> <isolate1> [isolate2] [isolate3]...

Opens and searches through the fasta file given for snps which meet the criteria. Then the system randomly keeps some of them based on what numsnpstokeep is set to. These randomly kept set is then saved into a handful of files which get saved at the set output folder.
Perl Script
Usage: perl separateBSRResults [help | filter] <file>
Takes in a file and an optionally set filter value. The script uses the filter value, default 0.8, and removes any lines in the file with greater value. The script assumes the file to contain tab separated data, with part of that data being a number which it then takes the first of to use for comparison.

sequencingQCBash Script
Usage Example: ./sequencingQC --input-directory /scratch/TGenNextGen/TGN-MiSeq0123/ProjectMayhem --output-directory ./ --genome-size 1.2 --read-length 250 --min-coverage 20 --min-gc 42

Anyone is welcome to use this script when quickly analyzing any whole genome sequence data.  There is a special feature to demultiplex the data before the quick analysis, which requires special permissions, but the rest of the script will work for everyone!

This should only be used as a quick look at the data and should not be used in place of detailed analysis techniques.

Pipeline Synopsis:

Automated quality checks on sequence data with the final output being two separate files of "Passed QC" and "Failed QC" with their associated metrics of GC content, average Phred Score, and Quick Coverage estimates.

Optional- Demultiplex data first; create a new directory of passed files.

All functions assume they are called from the output_directory
Perl Script
Usage: perl <fastq A> <fastq B> <fastq file to output>
This script outputs the contents of A and then B (4 then 4) repeating into the set output file.
Perl Script
Usage: perl <inputfile.bam> <min_coverage>
Prints results to strout (cmd screen). This script checks the given bam file and outputs the following data under these headers (tab seperated): Chromosome, Min proportion non-most-frequent call (excl), Max proportion non-most-frequent call (incl), Position count, All
This script uses position ranges: eisPlus1000, gyrAPlus1000, inhAPlus1000, 

katGPlus1000, rpoBPlus1000, rrsPlus1000
Perl Script
Usage: perl <inputfile.bam> <min_coverage>
Works just like except it does not use/include the position ranges found in the tb version.
Contains the following text:
sacct -o jobid,jobname,reqmem,maxrss,reqcpus,usercpu,timelimit,elapsed,state
Link to Deprecated version of this script, see
Perl Script
Usage: perl <fastafile> <numsnpstokeep> <numiterations> <outputfolder>
Searches through given fasta file for snps and generates a number of random batches equal to the number of iterations of a size equal to the number to keep, each saved in the output destination under their own naming scheme.
snpdistShell Script (Compiled, thus content not human readable)
Usage: ./snpdist -h [?]
PHP Script (Symbolic Link, real file is in: /nextgen_snp_pipeline/)
Usage: php snpfixer.php <smallfile.txt> <largefile.txt> <resultfile>

Takes in two '.txt' or '.xls' files that look like this: snp

Sample1_ID Sample2_ID Sample3_ID ... SNP1_ID A T T ... SNP2_ID G G T ...

SNP3_ID A A A ... ... It will then match up all SNPs that are present in

both files. If a call at the same position on the same sample differs

between the two, it will be changed to 'N'. Any SNP not present in both

files will be discarded. The resulting file will be returned. This tool

is useful if you have stringent requirements on what gets considered a

valid SNP, but then want a downstream tool not to assume excluded SNPs

must match the reference. this version removes all snp positions that 

have N >0 and/or if the snp position is not bi-allelic

Works like snpfixer.php. No code difference from what I can tell.
SnpPipeline.jarTODO (compiled code, needs to be looked up) 
TODO (compiled code, needs to be looked up)
splitFastaSequencesPerl Script
Usage: perl splitFastaSequences <fasta_file>
Uses BioPerl and re-outputs the fasta file to a temp which then replaces the original. The main benefit of the script would be that it ensures consistent formatting, but doesn't make and alterations to the data.
TODO Compiled code, needs lookup
TODO Compiled code, needs lookup
Bash Script
Usage: srst2_mlst [-s optional flag for species]
Runs the python script and pbs job manager to handle all available fastq files in run directory. Uses the srst2 module on aspen to process the read and species data.
srst2_mult is the previous version to srst2_multiple.
Bash Script
Usage: srst2_multiple
Works similarly to srst2_mlst but does not allow any arguments to be provided. Uses PBS job manager and the srst2 module on all fastq files in the directory its run in.
Bash Wrapper Script
Usage: srst2_wrapper <any args to pass along to>
Calls python /packages/srst2/0.1.4/scripts/ and pipes any args to it as if run in place of the wrapper.
Perl Script
Usage: perl <assembly_file>|<directory> <output_file>
Uses either an assembly file or a directory and output file. This script contains multiple primer sets which are used to perform checks on the inputted data. If just an assembly file is supplied a check for product across all primer sets and outputs results to the screen. If a directory and output file are supplied then the directory will be searched for any files that seem to be assembly files ie fa, fasta, contains 'final_assemby', etc. Each of these files are then checked against the primer sets and their output is tabulated and stored in the output file. The product is calculated by creating a amplicon search using foreword and reverse primers with the assembly data and this search can potentially find a position and PCR product size.
stressTODO Compiled code lookup
Perl Script.
Usage: perl [options] <taxid1> [ <taxid2> ... <taxidN> ]

-exclude => Accepts taxids to exclude from the output (default: None)

-alias_file => File base name (no extension) to save results into an alias file.

-title => Title to include in the generated alias file. Required if alias_file is provided.

-url_api_ready => Produce output that can be used in the NCBI URL API (default: false)

-verbose, -v => Produce verbose output, can be specified multiple times for increased verbosity (default: false)

-help, -? => Displays this man page.

Retrieves TSA projects for given NCBI taxonomy IDs

AUTHOR: Christiam Camacho (
Perl Script.
Usage: perl [options] <taxid1> [ <taxid2> ... <taxidN> ]

-exclude => Accepts taxids to exclude from the output (default: None)

-alias_file => File base name (no extension) to save results into an alias file.

-title => Title to include in the generated alias file. Required if alias_file is provided.

-url_api_ready => Produce output that can be used in the NCBI URL API (default: false)

-verbose, -v => Produce verbose output, can be specified multiple times for increased verbosity (default: false)

-help, -? => Displays this man page.

Retrieves WGS projects for given NCBI taxonomy IDs.

AUTHOR: Christiam Camacho (

TODO Compiled code lookup
Python Script
Usage: python

Reads in a tab deliminated file and outputs a transposed version.

Use the --start-line and --end-line options to control the range of lines from

a file that are used for the data


-d, --debug = display debug messages

--log = name of file to write log data to (defaults to STDERR)

-i, --in = name of file to read from (defaults to STDIN)

-o, --out = name of file to write to (defaults to STDOUT)

-s, --start-line the 0-based line number of the first line to use as data (defaults to 0)

-e, --end-line = the 0-based line number of the last line to use as data (defaults to last line)

trfTODO Compiled code lookup
Simple Shell File

for i in *R1*; do k=`echo $i | sed 's/_001.fastq.gz//g'`; j=`echo $i | sed 's/_R1_/_R2_/g'`; l=`echo $j | sed 's/_001.fastq.gz//g'`; java -jar /scratch/jsahl/tools/UGAP/bin/trimmomatic-0.30.jar PE -threads 8 $i $j "$k"_trim_paired_1.fastq.gz "$k"_trim_unpaired_1.fastq.gz "$l"_trim_paired_2.fastq.gz "$l"_trim_unpaired_2.fastq.gz ILLUMINACLIP:scriptseq_adapter_seqs.fasta:2:25:10 MINLEN:60; done

Bash Script
Runs Trimmomatic on all fastq files in the current directory using slurm sbatch
java org.usadellab.trimmomatic.TrimmomaticPE -threads 4 $read1 $read2 ${sample}_R1_paired.fastq.gz ${sample}_R1_unpaired.fastq.gz ${sample}_R2_paired.fastq.gz ${sample}_R2_unpaired.fastq.gz ILLUMINACLIP:/scratch/bin/illumina_adapters_all.fasta:4:30:10:1:true SLIDINGWINDOW:5:20 MINLEN:60
Bash Script
Runs Trimmomatic on all fastq files in the current directory using slurm sbatch
java -jar /packages/trimmomatic/0.36/trimmomatic-0.36.jar PE -threads 4 $read1 $read2 ${sample}_R1_paired.fastq.gz ${sample}_R1_unpaired.fastq.gz ${sample}_R2_paired.fastq.gz ${sample}_R2_unpaired.fastq.gz HEADCROP:0
Bash Script
Runs Trimmomatic on all fastq files in the current directory using PBS job manager
java -jar /scratch/bin/trimmomatic-0.32.jar PE -threads 4 $read1 $read2 ${sample}_R1_paired.fastq.gz ${sample}_R1_unpaired.fastq.gz ${sample}_R2_paired.fastq.gz ${sample}_R2_unpaired.fastq.gz ILLUMINACLIP:/scratch/bin/illumina_adapters_all.fasta:2:30:10 SLIDINGWINDOW:5:20 MINLEN:80
Bash Script
Runs Trimmomatic on all fastq files in the current directory using slurm sbatch
java -jar /packages/trimmomatic/0.36/trimmomatic-0.36.jar PE -threads 4 $read1 $read2 ${sample}_R1_paired.fastq.gz ${sample}_R1_unpaired.fastq.gz ${sample}_R2_paired.fastq.gz ${sample}_R2_unpaired.fastq.gz ILLUMINACLIP:/scratch/bin/illumina_adapters_all.fasta:2:25:10:1:true SLIDINGWINDOW:5:20 MINLEN:60
Bash Script
Runs Trimmomatic on all fastq files in the current directory using slurm sbatch
java -jar /packages/trimmomatic/0.36/trimmomatic-0.36.jar PE -threads 4 $read1 $read2 ${sample}_R1_paired.fastq.gz ${sample}_R1_unpaired.fastq.gz ${sample}_R2_paired.fastq.gz ${sample}_R2_unpaired.fastq.gz ILLUMINACLIP:/scratch/bin/illumina_adapters_no_readthrough.fasta:2:25:10 SLIDINGWINDOW:5:20 MINLEN:60
Bash Script
Runs Trimmomatic on all fastq files in the current directory using slurm sbatch
java -jar /scratch/bin/trimmomatic-0.32.jar PE -threads 4 $read1 $read2 ${sample}_R1_paired.fastq.gz ${sample}_R1_unpaired.fastq.gz ${sample}_R2_paired.fastq.gz ${sample}_R2_unpaired.fastq.gz ILLUMINACLIP:/scratch/bin/illumina_adapters_all.fasta:2:25:10 MINLEN:60
Bash Script
Usage: <genome> <reads_1> <reads_2>
Uses samtools, gsnap, and the perl script to align and create sorted coords to then run the perl script
Bash Script
Usage: <genome> <reads_1> <reads_2>

Uses samtools, gsnap, and the perl script to align and create sorted coords to then run the perl script 
+ adds in the use of bowtie
Bash Script
A directory (usually a TGenNextGen sub folder) set inside the code will be looped over and have it's mod date changed
Bash Script
Similar to updateModDate, but has only a READFILE variable available to change
Bash Script
Usage: upload2pathogen <name>

chmod -R o+r $name*.html $name

chmod o+x $name

rsync -r $name*.html $name ${USER}

usearchTODO Compiled code lookup
Perl Script

created by Arun Rawat Version 2.1

Modified on Aug 8th 2012

This version does not generate other files like mean, std dev.

vdbTODO Compiled code lookup
TODO Compiled code lookup
Bash Script
Only handles fastq.gz. Runs on all isolates listed in isolates.txt
Note: this script uses and depends on the PBS system
Usage: velvet_multiple [options] <isolates.txt> <reads_dir> <starting_kmer> <ending_kmer> [threads_to_use]
Note: velveth doesn't seem to abide by ncpus in PBS or threads_to_use (may esplode your computer)
-h, --help = display help text and more info
-a <date_time> = Declares the time after which the job is eligible for execution. The date_time argument is in the form: [[[[CC]YY]MM]DD]hhmm[.SS]
Bash Script
Only handles fastq.gz. Runs on all isolates listed in isolates.txt
Note: this script uses and depends on the PBS system
Usage: velvet_multiple [options] <isolates.txt> <reads_dir> <starting_kmer> <ending_kmer> [threads_to_use]
Note: velveth doesn't seem to abide by ncpus in PBS or threads_to_use (may esplode your computer)
-h, --help = display help text and more info
-a <date_time> = Declares the time after which the job is eligible for execution. The date_time argument is in the form: [[[[CC]YY]MM]DD]hhmm[.SS]
Python 3 Script
Usage: python [options] <barcode_file_x> <barcode_file_y>

-o = output file (default: stdout)

-m =min hamming distance (default: 3)

--color-balance =treat G:T and A:C as equivalent (recommended)

Bash Script
Script to estimate depth of coverage (x) given a read file
Usage: yield_approximation <read file> <read length> <genome size>