GenBank database files

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

GenBank database files

Mike Dyall-Smith
This should be easy (but not for me so far). I want to do local blast searches, so I download the premade nr protein blast database from GenBank. It is split into 10 .tar.gz files.
    I've decompressed them all, and now I want to put all the file parts together. Can I simply concatenate all similar files? (e.g. all 10 parts of the .phd files). The Readme mentions use of an alias file, but I did not find this at all clear. A set of step-by-step decompression and restoration instructions would be useful. I could not find any.
Thanks for any assistance, Mike DS

Sent from my iPhone4
___________________________________________________________
The Galaxy User list should be used for the discussion of
Galaxy analysis and other features on the public server
at usegalaxy.org.  Please keep all replies on the list by
using "reply all" in your mail client.  For discussion of
local Galaxy instances and the Galaxy source code, please
use the Galaxy Development list:

  http://lists.bx.psu.edu/listinfo/galaxy-dev

To manage your subscriptions to this and other Galaxy lists,
please use the interface at:

  http://lists.bx.psu.edu/

To search Galaxy mailing lists use the unified search at:

  http://galaxyproject.org/search/mailinglists/
Reply | Threaded
Open this post in threaded view
|

Re: GenBank database files

Peter Cock
On Sun, Apr 28, 2013 at 11:22 PM, Mike Dyall-Smith
<[hidden email]> wrote:
> This should be easy (but not for me so far). I want to do local blast searches, so I download the premade nr protein blast database from GenBank. It is split into 10 .tar.gz files.
>     I've decompressed them all, and now I want to put all the file parts together. Can I simply concatenate all similar files? (e.g. all 10 parts of the .phd files). The Readme mentions use of an alias file, but I did not find this at all clear. A set of step-by-step decompression and restoration instructions would be useful. I could not find any.
> Thanks for any assistance, Mike DS

Don't cat anything - just download all nr.*.tar.gz files, and
decompress them. You'll have a load of files including a
special alias file called nr.pal which is how BLAST knows
how to deal with the combined 'nr' database.

Peter

___________________________________________________________
The Galaxy User list should be used for the discussion of
Galaxy analysis and other features on the public server
at usegalaxy.org.  Please keep all replies on the list by
using "reply all" in your mail client.  For discussion of
local Galaxy instances and the Galaxy source code, please
use the Galaxy Development list:

  http://lists.bx.psu.edu/listinfo/galaxy-dev

To manage your subscriptions to this and other Galaxy lists,
please use the interface at:

  http://lists.bx.psu.edu/

To search Galaxy mailing lists use the unified search at:

  http://galaxyproject.org/search/mailinglists/
Reply | Threaded
Open this post in threaded view
|

Re: GenBank database files

Peter Cock
On Mon, Apr 29, 2013 at 3:27 AM, Mike Dyall-Smith
<[hidden email]> wrote:
> Dear Peter Cock, thanks for your advice. Just to be clear, do I leave the
> files within their decompressed folders or do I put all the individual files
> into one folder? I assume the former, but want to be sure.
> Thanks again, Mike DS

Hi Mike,

Unless you're using a Graphical decompression tool which is trying
to be too helpful, each tar-ball does *not* decompress into its own
folder. The files should all be in the *same* folder.

I use this to verify the checksums,

$ md5sum --check nr.00.tar.gz.md5
nr.00.tar.gz: OK

Then I use this to decompress the tar-balls,

$ tar -zxvf nr.00.tar.gz
etc

(Actually I don't do this personally any more - it has been setup
to happen automatically when the NCBI update the databases.)

We keep all our NCBI databases in the same folder,

$ ls /data/blastdb/ncbi/nr.*
/data/blastdb/ncbi/nr.00.phd
/data/blastdb/ncbi/nr.00.phi
/data/blastdb/ncbi/nr.00.phr
/data/blastdb/ncbi/nr.00.pin
/data/blastdb/ncbi/nr.00.pnd
/data/blastdb/ncbi/nr.00.pni
/data/blastdb/ncbi/nr.00.pog
/data/blastdb/ncbi/nr.00.ppd
/data/blastdb/ncbi/nr.00.ppi
/data/blastdb/ncbi/nr.00.psd
/data/blastdb/ncbi/nr.00.psi
/data/blastdb/ncbi/nr.00.psq
/data/blastdb/ncbi/nr.00.tar.gz
/data/blastdb/ncbi/nr.00.tar.gz.md5
...
/data/blastdb/ncbi/nr.10.phd
/data/blastdb/ncbi/nr.10.phi
/data/blastdb/ncbi/nr.10.phr
/data/blastdb/ncbi/nr.10.pin
/data/blastdb/ncbi/nr.10.pnd
/data/blastdb/ncbi/nr.10.pni
/data/blastdb/ncbi/nr.10.pog
/data/blastdb/ncbi/nr.10.ppd
/data/blastdb/ncbi/nr.10.ppi
/data/blastdb/ncbi/nr.10.psd
/data/blastdb/ncbi/nr.10.psi
/data/blastdb/ncbi/nr.10.psq
/data/blastdb/ncbi/nr.10.tar.gz
/data/blastdb/ncbi/nr.10.tar.gz.md5
/data/blastdb/ncbi/nr.pal

We can then refer to the NR database at the command line
as /data/blastdb/ncbi/nr or as just nr if the BLAST database
path is configured to check this folder.

In this folder we also have other NCBI database, like NT:

$ ls /data/blastdb/ncbi/nt.*
/data/blastdb/ncbi/nt.00.nhd
/data/blastdb/ncbi/nt.00.nhi
/data/blastdb/ncbi/nt.00.nhr
/data/blastdb/ncbi/nt.00.nin
/data/blastdb/ncbi/nt.00.nnd
/data/blastdb/ncbi/nt.00.nni
/data/blastdb/ncbi/nt.00.nog
/data/blastdb/ncbi/nt.00.nsd
/data/blastdb/ncbi/nt.00.nsi
/data/blastdb/ncbi/nt.00.nsq
/data/blastdb/ncbi/nt.00.tar.gz
/data/blastdb/ncbi/nt.00.tar.gz.md5
...
/data/blastdb/ncbi/nt.13.nhd
/data/blastdb/ncbi/nt.13.nhi
/data/blastdb/ncbi/nt.13.nhr
/data/blastdb/ncbi/nt.13.nin
/data/blastdb/ncbi/nt.13.nnd
/data/blastdb/ncbi/nt.13.nni
/data/blastdb/ncbi/nt.13.nog
/data/blastdb/ncbi/nt.13.nsd
/data/blastdb/ncbi/nt.13.nsi
/data/blastdb/ncbi/nt.13.nsq
/data/blastdb/ncbi/nt.13.tar.gz
/data/blastdb/ncbi/nt.13.tar.gz.md5
/data/blastdb/ncbi/nt.nal

Note you don't need to keep the *.tar.gz and the *.md5 files
once you've verified the checksum (using md5sum to detect
any data corruption during download) and decompressed the
tar-ball.

Peter

P.S. This galaxy-users list is meant for discussion of using the
tools within Galaxy from an end user perspective. Although
there is talk about creating a new Galaxy mailing list specifically
for deployment questions like this, currently galaxy-devel is
preferred for this kind of discussion.
___________________________________________________________
The Galaxy User list should be used for the discussion of
Galaxy analysis and other features on the public server
at usegalaxy.org.  Please keep all replies on the list by
using "reply all" in your mail client.  For discussion of
local Galaxy instances and the Galaxy source code, please
use the Galaxy Development list:

  http://lists.bx.psu.edu/listinfo/galaxy-dev

To manage your subscriptions to this and other Galaxy lists,
please use the interface at:

  http://lists.bx.psu.edu/

To search Galaxy mailing lists use the unified search at:

  http://galaxyproject.org/search/mailinglists/