Usage

The getNCBI script uses Entrez Direct (eDirect) to download information from the NCBI database. A basic query can be made to the database using the -edirect command, or a more complex job can be submitted using the -job command.


Command Line:

getNCBI is called with the following command line prompt:

$ getNCBI [-h] [-v] [-edirect <edirect expression>] [-job <job expression>]


Arguments:

The following is a list of commands for getNCBI:

-h, --help
Displays the help menu for getNCBI before exiting.
-v, --version
Displays the version number for getNCBI before exiting.
-edirect
Allows users to submit a direct query to the NCBI database.
-job
Allows users to download NCBI data.


Edirect

Basic Functions:

Descriptions for all basic functions of -edirect can be found here.


Performing a simple esearch to the database can be done with the following command line prompt:

$ getNCBI -edirect "esearch -db assembly -query PRJEB2870"


Which results in the following returned to the command line:

<ENTREZ_DIRECT>
  <Db>assembly</Db>
  <WebEnv>NCID_1_51606105_130.14.22.215_9001_1498587436_1976020800_0MetA0_S_MegaStore_F_1</WebEnv>
  <QueryKey>1</QueryKey>
  <Count>379</Count>
  <Step>1</Step>
  <Tool>getNCBI</Tool>
  <Email>dlemmer@tgen.org</Email>
</ENTREZ_DIRECT>


Piping Commands:

It is also possible to pipe -edirect function into any other built in -edirect function. For example, expanding the previous esearch example to display the summary for all results can be done by piping esearch through the esummary function.


This can be done through the following command line prompt:

$ getNCBI -edirect "esearch -db assembly -query PRJEB2870 | esummary"


Which results in the following returned to the command line:

...
<DocumentSummary>
  <Id>327081</Id>
  <RsUid>1749898</RsUid>
  <GbUid>1742378</GbUid>
  <AssemblyAccession>GCF_000982775.1</AssemblyAccession>
  <LastMajorReleaseAccession>GCF_000982775.1</LastMajorReleaseAccession>
  <ChainId>982775</ChainId>
  <AssemblyName>7748_4#96</AssemblyName>
...


Job

The following is a list of functions that can be performed by the -job command:

getAssembly
Downloads assembly data from the NCBI database.
getSRAdata
Downloads SRA data from the NCBI database.
continue
Continue any previous job that is in the current directory.


Get Assembly:

The getAssembly function is called with the symbol ga, followed by a query. For each id related to the query, all data is downloaded into the current directory to its own id specific file. This is done by downloading and inserting the data directly into each file.


Submitting a getAssembly job to the database for query "PRJEB2870" can be done with the following command line prompt:

$getNCBI -job ga "PRJEB2870"


Each file will have the naming convention:

  • "<AssemblyAccession>_<SpeciesName>_<Sub_value>_<AssemblyStatus>.fasta"

For example:

  • GCF_001354535.1_Staphylococcus_aureus_FL365_Contig.fasta
  • GCF_001354515.1_Staphylococcus_aureus_FL355_Scaffold.fasta
  • GCF_001354495.1_Staphylococcus_aureus_FL387_Scaffold.fasta


Get SRA Data:

The getSRAdata function is called with the symbol gs, followed by a query. For each id related to the query, all data is downloaded into the current directory to its own id specific file. This is done by submitting the download job for each id to the cluster, which downloads the data and places it into each file in the working directory.


Submitting a getSRAdata job to the database for query "PRJNA240563" can be done with the following command line prompt:

$getNCBI -job gs "PRJNA240563"


Each file will have the naming convention:

  • "<Run/Runs/acc>.fastq.gz"

For example:

  • SRR1548323.fastq.gz
  • SRR1548324.fastq.gz
  • SRR1548325.fastq.gz


Continue:

The continue function is called with the symbol continue.


Continuing job can be done with the following command line prompt:

$getNCBI -job continue