the Gene Ontology

  • Open menus
  • Home
  • FAQ
  • Downloads
  • Ontologies
  • Annotations
  • Database
  • Mappings to GO
  • Teaching Resources
  • Other files
  • FTP and CVS downloads
  • Tools
  • Browsers
  • Microarray tools
  • Annotation tools
  • Other tools
  • Submit New Tools
  • Documentation
  • Introduction
  • Annotation Guide
  • Evidence Code Guide
  • Component Ontology
  • Function Ontology
  • Process Ontology
  • File Format Guide
  • GO Database Guide
  • GO Slim Guide
  • Meeting minutes
  • Editorial Style Guide
  • About GO
  • GO Consortium
  • Publications
  • Citation Policy
  • Mailing lists
  • Interest Groups
  • GO People
  • Funding
  • Acknowledgements
  • Newsletter
  • Projects
  • Cardiovascular
  • Immunology
  • Reference Genomes
  • Contact GO
  • Site Map

GO Database Guide

  • What is the GO Database?
  • Availability
  • Pre-built database dumps
  • Build your own
  • Database Schema Documentation
  • Autodoc
  • Schema Diagram
  • Additional notes on the schema
  • Schema source code
  • Querying the Database
  • Querying via AmiGO
  • Querying via SQL
  • Querying via Perl
  • XML and RDF Queries
  • Finding out more
  • Contact
  • Frequently Asked Questions
  • Future extensions
  • Similar database schemas

What is the GO Database?

The GO Database is a relational database [external website] housing both the Gene Ontology and the annotations of genes and gene products to terms in the GO. The advantage of housing both ontologies and annotations in a single database is that powerful queries can be performed over annotations using the ontology.

The GO Database forms the base of the AmiGO browser and search engine. It is built from source data at regular intervals, and is currently housed as a MySQL [external website] database. The builds can be downloaded and installed on your local machine or queried remotely.

This page has the technical details concerning the GO database; if you are simply interested in browsing the database you may wish to proceed straight to AmiGO.

For a good online introduction to databases in bioinformatics, we highly recommend Relational Databases for Biologists [external website].

Back to top

Database Availability

Pre-built database dumps

The database is generated from the annotation sources and the latest version of the ontology at regular intervals. The database can be downloaded as a MySQL database dump, and reconstituted on any system where MySQL is running. Other database management systems can be used, but this is not trivial.

Older builds of the database are available from the GO archives.

Build your own

You can create your own instance of a GO Database, either by building one de novo, or by augmenting an existing build. You can load other OBO ontologies [external website], or load your own annotations.

To build from scratch, you will need a MySQL server and the SQL source, available in go-dev CVS (see the sourceforge project page [external website]). Follow the instructions in the installation guide.

To load ontologies or data from OBO files or gene_association files, you will need to install go-db-perl [external website], and use the load scripts there.

Back to top

Database Schema

A relational schema specifies a collection of table definitions, providing structure for the data housed in the database instance. The schema for the GO database consists of tables for storing GO terms and graphs, as well as annotations and gene products.

Autodoc

See the automatically generated GO schema documentation

The schema is partitioned into different modules or sub-schemas. This is purely a documentation convention. The go-graph module is for housing the ontology; the central tables are term and term2term. Annotations are stored in the go-association module; the main tables are association and gene_product. The GO database can also be enhanced with views; these are in the go-dev/sql/view [external website] directory. Autogenerated documentation is also available for all tables and all views.

Schema Diagram (ER)

The database structure can also be seen in this entity-relationship diagram:

GO database diagram

(thanks to Florian Leitner for the diagram)

Additional notes on the schema

Primary and foreign keys

As a convention, all tables in the schema use the column named id as the primary key, and foreign key columns are all named reference_id. All foreign keys are explicitly declared in the schema; however, MySQL drops these. This means you should always consult the documentation here, rather than relying on the MySQL DESCRIBE TABLES command or the CREATE TABLE commands in the database dump. The GO database schema page allows you to traverse primary and foreign key links via internal html page links.

All keys are numeric surrogate identifiers. They are meaningless, not consistent between builds, and bear no correspondence to public identifiers such as GO:0008050.

Terms as nodes in a graph

The central concept in OBO style ontologies and in the GO database is the graph. The GO or OBO terms are nodes in the graph, and the relationships between them are arcs. This is handled by the tables term and term2term.

The terms constituting the nodes in an ontology graph represent the kinds of entities that exist within the domain of that ontology. The edges in the graph represent the relations that hold between these entities. The edge types or relations in GO are drawn from the OBO Relation Ontology [external website].

Note that these edge types or relations are also stored in the term table. This allows us to reuse the same schema structures, and potentially allows us to have hierarchies of relations, which may be required in the future.

Traversing Graphs

When performing ontology-oriented queries, it is often necessary to do some kind of graph traversal. It is possible to use the term2term table to iterate through the graph, but this requires mutliple SQL calls, as MySQL does not support transitive operators such as Oracle's CONNECT BY. Most implementations of SQL do not support the kind of recursive querying required to answer queries such as "find all DNA binding genes".

This kind of query is possible with the GO database because we precompute the path from every node to all of its ancestors. This is known as the transitive closure of a relationship. This goes in the graph_path table, which also holds the distance between terms.

In particular, we calculate the reflexive transitive closure, which means that every term is related to itself (the distance between the terms is zero). In practical terms, this makes it easier to write queries such as "find all DNA binding genes" - because such queries should return genes attached directly to DNA binding, as well as to children of DNA binding.

Note that we can use the same table to also query for descendants - it is simply a matter of switching around term1 and term2 in the graph_path query constraints.

The diagram below shows an example of the reflexive transitive closure of DNA helicase and its ancestors. The dark lines indicate direct is_a relationships (stored in the term2term table); the dotted lines indicate the implied ancestral relationships (ie the closure), which is stored in the graph_path table.

Transitive closure

Schema Source

The SQL source for the schema is maintained in CVS on sourceforge, in the go-dev/sql directory[external website]. See the modules directory [external website] for the SQL DDL.

Back to top

Querying the Database

Querying via AmiGO

The most common way to query the database is via the AmiGO browser interface. The Advanced Query provides a reasonably flexible way to query the database, but far more powerful queries can be executed using SQL.

Querying in SQL

You can query the database using the SQL query language - either download the GO MySQL dump and query your local copy (for example, through the mysql command line client), or connect to one of the GO database mirror nodes below using a client such as MySQL or SquirrelSQL.

Examples are given for accessing using a mysql client. Once connected you can use the SHOW DATABASES command to see available databases. To see when the database was built, use the following command:

SELECT * FROM instance_data

EBI Mirror

db
go_latest
user
go_select
password
amigo
host
mysql.ebi.ac.uk
port
4085

Example connection from command line:

mysql -hmysql.ebi.ac.uk -ugo_select -pamigo -P4085 go_latest

Ensembl

Ensembl [external website] provide builds of the GO database going back several years.

user
anonymous
host
ensembldb.ensembl.org

The GO databases are typically named go_[ENSEMBL_BUILD_NO]

If you are unsure as to how to construct your query, the best place to start is the example queries page on the GO wiki.

Querying via perl

You can connect to a local or remote MySQL installation using the perl API, go-db-perl [external website], which can be downloaded from CPAN. This module depends on go-perl [external website]; see the installation guide for more details. See also GO software libraries page.

The API can be used to get terms, subgraphs and annotated gene products. It can also be used to perform analyses.

You can also use the perl API interactively through a perl shell, as follows (substituting connection details as appropriate):

GOshell.pl -d latest_go -dbuser go_select -dbauth amigo -port 4085 -h mysql.ebi.ac.uk

Type help for more instructions.

Alternate means of querying

There are also a few less conventional means of querying the database programmatically.

XML via DBStag

Nested XML can be automatically extracted from the database using DBStag [external website]. A number of SQL templates are available in the stag-templates directory in CVS [external website].

SPARQL Endpoint

We have an experimental SPARQL [external website] Endpoint for querying an RDF view of the GO Database. SPARQL queries can be executed using this interface

See SPARQL-GO[external website] for full details. This service is highly experimental and not fully supported; if it proves useful it may be supported in future. Comments welcome!

Back to top

Finding out more

Contact

If you have specific questions, either technical or content-related, regarding the database, please contact the GO helpdesk.

If you want to sign up for announcements, discussions etc, you can sign up to the GO-Database [external website] mailing list

Frequently Asked Questions

See the FAQ on the GO wiki.

Future extensions

There are a number of extensions to the schema planned over the coming years. These changes will be introduced in a way that retains backwards compatibility. You can see some examples of these in the form of unpopulated tables and columns in the database.

Deductive closure

A more advanced algorithm (e.g. the OBO-Edit reasoner [external website] for computing the graph_path table may be used in future, allowing us to discriminate paths of different edge types. Amongst other things it will help create relation-centric graph views.

Cross products

The schema already supports logical (complete) definitions via the complete relation qualifier, although these are not populated in the public GO database as the cross-products are still experiemntal.

For more background, see the logical definitions page on the GO wiki.

Similar database schemas

The GO schema is the predecessor of the ontology-based components of other bioinformatics schemas, including Chado [external website], BioSQL and OBD [external website]. We have no plans to migrate to any of these schemas in the near future.

Back to top


Open Biomedical Ontologies logo

Last modified Monday, 28-Jan-2008 11:10:11 PST
Cite GO • Terms of use • GO helpdesk
Copyright © 1999-Tuesday, 06-Jan-2009 15:45:46 PST the Gene Ontology