Aspen edirect esearch elink sra assembly nuccore python : TGen North Bioinformatics Solutions

Location

scratch/bin/Pyed-piper OR /packages/tnorth/bin/Pyed-piper

Modules [module load <insert here>]

Python/3.8.2-GCCcore-9.3.0

Github

TGenNorth/Pyed-Piper

[https://github.com/TGenNorth/Pyed-piper]

This collection of python scripts utilizing edirect functions in order to make calls to the NCBI databases for gathering data such as sequences, assemblies, accession numbers and so on. Most scripts pipe data into edirect in order to function, thus these are scripts are Python to eDirect piper scripts [Py->Ed Piper]. Since many of these scripts are piping and can take in files of any size, some large query files used may fail, consider instead using and sbatch slurm arrays with these scripts as an alternative to allow for staggering out the data requests in order to not overwhelm the database with large/rapid calls.

Script	CMD Line	Description
genbankFetch	python genbankFetch.py <database> <query> <output_file> [-h, --help]	Uses esearch with specified database and query and then passes the IDs found into a elink to nuccore database lookup for the fasta data that is then compiled and outputted
getAccessionNumbers	python getAccessionNumbers.py <organism> <in_file> <out_file> [-h, --help]	Uses esearch with specified organism plus data from an assembly or sra file to output the related accession numbers.
getARGOSReads	python getARGOSReads.py <srp_number> [-o, --output-file : DEFAULT=./ARGOSReads_<SRP#>.txt] [-h, --help]	Uses esearch with specified SRP number to collect all run ids and biosample names.
getAssembly	python getAssembly.py <accession_number> [-o, --output-folder : DEFAULT=./] [-h, --help]	Using the given accession number a query for latest properties with be performed and all IDs found will be compiled into a fasta with the name "<accession_number>.fasta" in the designated output folder.
getGeneCoords	python getGeneCoords.py <gene_name> [-f, --flanking : DEFAULT=0] [-o, --output-file : DEFAULT=geneCoords_<gene_name>.tsv] [-h, --help]	Using the provided gene name and species/organism this script compiles CHR info into a tsv which it appends to if the tsv already exists.
getNCBIAssemblies	python getNCBIAssemblies.py <organism_name> [-o, --output-folder : DEFAULT=./] [-h, --help]	Uses the organism name to gather Assembly ID, Name, Organism, Assembly Level, Num Contigs, Fasta File, into 1 outfile and also outputs the assembly fastas to the output folder as well if able.
getSequences	python getSequences.py <in_file> [-o, --output-folder : DEFAULT=./] [-h, --help]	Takes in an input file of line separated queries for edirects esearch script and outputs the assembly fastas if able.
getSRAData	python getSRAData.py <srp_number> [-o, --output-folder : DEFAULT=./] [-h, --help]	Using the given SRP number a query for latest properties with be performed and all IDs found will be compiled and run through the sratoolkit script which will output into the designated output folder.

TGen North Bioinformatics Solutions

How can we help you today?

EDirect Python Scripts List (Pyed Piper) Print

How can we help you today?

EDirect Python Scripts List (Pyed Piper) Print

Related Articles