Administration Documentation

Sequence databases

The sequence databases can be indexed in a variety of ways, contain data in a variety of sequence formats and be accessed by a variety of methods. These are defined through a set of control files. For example, EMBL entries could be read by:

The setting up of databases is covered in detail in the Admin Guide.


Admin Guide to setting up EMBOSS

You should read the EMBOSS Administrators Guide by David Martin, Peter Rice and Alan Bleasby.

There is also a short guide in German that shows how to install Kaptain and EMBOSS.


Example Databases

If you wish to gain experience in setting up various styles of databases under EMBOSS, you will find some small example databases included in EMBOSS in the test directory after you unpack the release.

The EMBOSS developers use them to test database indexing and sequence reading.

See directories:

test/data    (emrod (DNA) and swnew (protein) are in blast format)
test/embl    (*.dat for EMBL format, *.ref and *.seq for gcg format)
test/pir     (*.ref and *.seq for nbrf format)
test/swiss   (*.dat for swissprot format, 1 file)
test/swnew   (*.dat for swissprot format, 3 files)
test/wormpep (wormpep is in fasta and blast format)

If you use the emboss/emboss.default.template file to create your own emboss.default file, change the definition of emboss_tempdata at the top to point to your test directory and you can use the test databases as "tembl", "tsw" and so on. The databases contain the sequences in the program examples (see the web pages, or run the "tfm" program to see the documentation).

You can also reindex these files yourself to test the dbi* programs and to test writing your own DB definitions for emboss.default.


Pre-indexed Databases

Don Gilbert (Indiana University) is making EMBOSS format databanks of recent GenBank DNA databank plus non-redundent EMBL, GenPept, PIR and SwissProt available on a trial basis for public use. You can fetch these data from IUBio Archive:

ftp://iubio.bio.indiana.edu/biomirror-gcg/

 Mar  6 22:36 Readme
 May 17 12:40 emboss.default.gz
 May 18 02:19 gcgdbconfigure
 May 18 02:19 gcgembl      (release 70, non-redundant w/ genbank)
 May 18 02:18 gcggenbank1  (core genbank, release 129)
 May 18 22:37 gcggenbank2  (est,gss of rel  129)
 May 17 22:13 gcggenpept   (release 129)
 May 17 20:38 gcgpir       (release 71)
 May 17 20:32 gcgswissprot (release 40)

These are gzip compressed, but otherwise should drop into an EMBOSS system with minor editing of the emboss.default file path. Included are EMBOSS package indices with each data set (total size about 60 GB uncompressed; 20 GB compressed).

This is a trial to see if those of you who support EMBOSS want such a pre-digested set of data + indices. Let Don Gilbert know if you find it useful.