Location | scratch/bin/Pyed-piper OR /packages/tnorth/bin/Pyed-piper |
Modules [module load <insert here>] | Python/3.8.2-GCCcore-9.3.0 |
Github | [https://github.com/TGenNorth/Pyed-piper] |
This collection of python scripts utilizing edirect functions in order to make calls to the NCBI databases for gathering data such as sequences, assemblies, accession numbers and so on. Most scripts pipe data into edirect in order to function, thus these are scripts are Python to eDirect piper scripts [Py->Ed Piper]. Since many of these scripts are piping and can take in files of any size, some large query files used may fail, consider instead using and sbatch slurm arrays with these scripts as an alternative to allow for staggering out the data requests in order to not overwhelm the database with large/rapid calls.
Script | CMD Line | Description |
genbankFetch | python genbankFetch.py <database> <query> <output_file> [-h, --help] | Uses esearch with specified database and query and then passes the IDs found into a elink to nuccore database lookup for the fasta data that is then compiled and outputted |
getAccessionNumbers | python getAccessionNumbers.py <organism> <in_file> <out_file> [-h, --help] | Uses esearch with specified organism plus data from an assembly or sra file to output the related accession numbers. |
getARGOSReads | python getARGOSReads.py <srp_number> [-o, --output-file : DEFAULT=./ARGOSReads_<SRP#>.txt] [-h, --help] | Uses esearch with specified SRP number to collect all run ids and biosample names. |
getAssembly | python getAssembly.py <accession_number> [-o, --output-folder : DEFAULT=./] [-h, --help] | Using the given accession number a query for latest properties with be performed and all IDs found will be compiled into a fasta with the name "<accession_number>.fasta" in the designated output folder. |
getGeneCoords | python getGeneCoords.py <gene_name> [-f, --flanking : DEFAULT=0] [-o, --output-file : DEFAULT=geneCoords_<gene_name>.tsv] [-h, --help] | Using the provided gene name and species/organism this script compiles CHR info into a tsv which it appends to if the tsv already exists. |
getNCBIAssemblies | python getNCBIAssemblies.py <organism_name> [-o, --output-folder : DEFAULT=./] [-h, --help] | Uses the organism name to gather Assembly ID, Name, Organism, Assembly Level, Num Contigs, Fasta File, into 1 outfile and also outputs the assembly fastas to the output folder as well if able. |
getSequences | python getSequences.py <in_file> [-o, --output-folder : DEFAULT=./] [-h, --help] | Takes in an input file of line separated queries for edirects esearch script and outputs the assembly fastas if able. |
getSRAData | python getSRAData.py <srp_number> [-o, --output-folder : DEFAULT=./] [-h, --help] | Using the given SRP number a query for latest properties with be performed and all IDs found will be compiled and run through the sratoolkit script which will output into the designated output folder. |