bioDBnet: Documentation: FAQs

Notes on bioDBnet

What is bioDBnet?

Network
What are bioDBnet connectors?
How are the databases integrated?
What is a main database?
If all input nodes are not connectors then is bioDBnet missing some conversions?
How is the network used in bioDBnet conversions?

Data
What kind of biological data is integrated?
What is the coverage for the different databases?
Is the data updated regularly?
How does the update process effect bioDBnet?
Why am I getting results for some but not all of my identifiers?

Data Type
Why am I not getting any results from db2db/dbReport/dbWalk?
Why is the data type that I use a node in bioDBnet but not an input node?
Why is my data type not included in bioDBnet?

dbReport links
non-B DB
polyBrowse
UCSC Browser
DAVID

Good to know stuff
How can I cite bioDBnet?
How can I create a hyperlink to bioDBnet?
What is the format for entering input values?
Is it necessary to give the Taxon ID when I use bioDBnet?
Is there any sort of precomputing?
Is there any statistical basis for the bioDBnet conversions?
What is an edge weight?
What does the path information in db2db results mean?

What is bioDBnet?
bioDBnet is a network of the major biological databases. There is a vast amount of infomation available in various formats and in various scattered resources. We tried to put all this information together and make it available through an easy to use web resource. The major advantages that bioDBnet offers compared to similar resources is the simplicity of our model, the number and types of databases integrated, support for batch conversions and the integration process itself. bioDBnet is created in such a way that it picks up all the database updates seamlessly.

What are bioDBnet connectors?
bioDBnet uses only some of the nodes to build it's connections. We choose Gene ID, UniProt Protein Name and Ensembl Gene ID as our connectors. Although it is possible to build any type of path to walk through the network, we choose only these nodes so as to ensure more accurate results.

How are the databases integrated?
We download many public databases from their ftp sites. (Refer to the database versions page for the current versions of the databases in bioDBnet and links to their ftp sites) If the data is in tab delimited files (Ex: EntrezGene, GO) then we load these files directly into relational tables. For other databases we have parsers to parse out the required information into tab delimited files and then load them into the database (Ex: GenBank, UniProt). The tables are all in Oracle but there is no dependency on a particualr RDBMS so these tables can be ported if needed to other database management systems like MySql. This database structure is completely independent of bioDBnet, in the sense that they have not been formatted for bioDBnet but were created to pull out the main information from the public data sources.
Within bioDBnet we have an XML file called bioDBnet.xml which contains all the pair-wise mappings (edges of bioDBnet) from each of the above relational tables. We then run a small perl script to read all these edges and create all the possible paths for bioDBnet using Gene ID, UniProt Entry Name and Ensembl Gene ID as the main connectors. These paths are then stored in a relational table. So, now at runtime when a user chooses to convert from input 'A' to output 'B', we get all the possible paths and then follow through these paths until a result is obtained or all the paths are exhausted. This way we can keep updating our databases independent of each other and in a regular manner as there is no pre computing of any sort.

What is a main database?
At the Advanced Biomedical Computing Center we download data from most commonly used and publicly available biological databases. This data is parsed/formatted as needed and loaded into relational tables. We refer to these databases as main databases as opposed to the ones whose information is obtained only by referring to the main databases. In other words the databases that are not main do not have a complete coverage in bioDBnet.

If all input nodes are not connectors then is bioDBnet missing some conversions?
Yes and No as some conversions that would be feasible only by using non bioDBnet connectors in the network path are not available through db2db but they can certainly be made possible by using dbWalk. dbWalk does not have any restrictions on the network path to be followed.

How is the network used in bioDBnet conversions?
For most of the conversions, bioDBnet first tries to get all the connectors for that identifier by following all possible paths in the network for each of the connectors. It then uses these results to get the actual output results. Once a result is obtained for an input value bioDBnet does not traverse the remaining paths for that particular value. The order in which the paths are chosen is based on the distance and path weights

What kind of biological data is integrated?
bioDBnet has connected all types of information including but not limited to gene data from Gene, Ensembl; protein data from UniProt, Ensembl, RefSeq; microarray chip data from Affymetrix; annotation data from GO; pathway data from KEGG, Biocarta.

What is the coverage for the different databases?
The coverage for a particular node depends on where the information is coming from and whether we maintain the database locally at ABCC. Please refer to the nodes page in the documentation for a complete list of the nodes and their coverage

Is the data updated regularly?
Here at ABCC we update our databases as soon as there is an update available. We have parsers and programs running, some of them as cron jobs, to get the updates regularly. The databases are maintained independent of bioDBnet and all the queries/mappings in bioDBnet are done at runtime. So, at any given time the results should be current with the maximum expected lag time of only a week between when the data becomes available on a public ftp site and our loading into the local database. Refer to the database versions page for getting the release versions and the update dates

How does the update process effect bioDBnet?
The updates on our production databases run as cron jobs at 5am (EST) on week days. We run the process in such a way as to keep the effect on bioDBnet to the minimum. Depending on the type and number of databases getting updated on a particular day the process might take from a few seconds to a couple of minutes. To avoid missing any results it may be better to not use bioDBnet between 5am and 6am (EST) on week days. Please bear in mind that although some of the results might be missed bioDBnet will never give wrong results if accessed during the update period.

Why am I getting results for some but not all of my identifiers?
This might occur due to various possible reasons
(i) The coverage for that node might not be 100%
(ii) Although the coverage is 100% for both the input and output nodes the possiblity of converting all the input to output might be less than desired
(iii) Some of the identifiers may be obsolete and so bioDBnet might not have any information for these.
(iv) If a Taxon ID was entered, it may be possible that some of the identifiers might not belong to that Taxon.

Why am I not getting any results from db2db/dbReport/dbWalk?
1) Check and see that the correct input type has been selected for your data. With all the many different types of nodes in bioDBnet it sometimes may be confusing as to what exactly your identifiers are called. For example 'Q8SZ22_DROME' would be a UniProt Entry Name and not a UniProt Accession.
2) Check the format for the identifier selected. Refer to the nodes section in dbInfo to look at some of the examples for your identifier and make sure that you use them in the exact same format. For example bioDBnet does not use version numbers for RefSeq accessions. Therefore, NM_130786.2 would not give any results but NM_130786 would. EC 3.4.25.1 would return all the mapppings for the EC number but EC:3.4.25.1 or 3.4.25.1 will not give any results.
3) Make sure that bioDBnet currently has your identifiers. For example not all affymetrix probe set identifiers are covered in bioDBnet.
Let us know if you are not getting results even after using the correct data type and format. We will track down the problem and correct it if it's because of an error in bioDBnet

Why is the data type that I use a node in bioDBnet but not an input node?
We tried to use only the most commonly used data types as input nodes. Please email us the specifics and if there is sufficient interest then we will modify the node of your interest into an input node

Why is my data type not included in bioDBnet?
We have included only major databases in this version of bioDBnet. However we make constant additions and changes to include data of wide interest. If there is something missing in bioDBnet please let us know and we will try to include it in our next version.

non-B DB
non-B DB is a link to nonbdb, a database developed at the Advanced Biomedical Computing Center. It contains annotations of non-B DNAs for several mammalian species. The link is available for dbReport queries with 'Gene Symbol' for certain species like human and mouse.

polyBrowse
polyBrowse links to pbrowse, which is a gbrowse based browser tool developed at the Advanced Biomedical Computing Center. It provides visualization and query capabilities for several genomic annotations. This link is also available for certain dbReport queries for human and mouse.

UCSC Browser
The UCSC Browser links to the genome browser tool from UCSC.

DAVID
DAVID link is available for dbReports of human and mouse when multiple genes are queried at the same time. These genes are submitted as a gene list to the Database for Annotation, Visualization and Integrated Discovery (DAVID), which provides functional annotation and visualization tools to understand the biological meaning behind large list of genes

How can I cite bioDBnet?
If you have used bioDBnet as part of your research, please cite us with a reference to our website and/or by our publication:
Mudunuri,U., Che,A., Yi,M. and Stephens,R.M. (2009) bioDBnet: the biological database network.
Bioinformatics, 25, 555-556.

How can I create a hyperlink to bioDBnet?
You can create hyperlinks from your website to bioDBnet either to generate full database reports for your choice of identifier or for enabling database conversions.
For linking to dbReport:
http://biodbnet.abcc.ncifcrf.gov/db/dbReportRes.php?input=inputType&idList=value(s)
For linking to db2db:
http://biodbnet.abcc.ncifcrf.gov/db/db2dbRes.php?input=inputType&outputs[]=outputType&idList=value(s)
where inputType and outputType are the names of the input and output nodes. These names are case sensitive and should be the same as that specified in the database nodes documentation page. Example links:
http://biodbnet.abcc.ncifcrf.gov/db/dbReportRes.php?input=UniProt Accession&idList=Q68CK0
http://biodbnet.abcc.ncifcrf.gov/db/dbReportRes.php?input=Gene ID&taxonId=9606&idList=1, 3
http://biodbnet.abcc.ncifcrf.gov/db/db2dbRes.php?input=UniProt Accession&outputs[]=Gene ID&idList=Q68CK0

What is the format for entering input values?
The input values can be entered either as comma separated values (CSV) or in new lines or a combination of both. In this version of bioDBnet we limit the number of values to 500, this restriction will likely be removed in our next version.

Is it necessary to give the Taxon ID when I use bioDBnet?
Some identifers like Gene ID, UniProt ID, UniGene ID are unique identifers and it might not matter much when converting between such identifiers. Some others like Gene Symbol are not specific for an organism so entering a Taxon ID when using bioDBnet would limit the results to the organism of interest. So, although not necessary we suggest that Taxon ID should be included wherever possible.
In dbReports for EC Number, GSEA Standard Name, GO ID, InterPro ID and Pfam ID if a Taxon ID is not entered then results for human are displayed by default. This is done so that the request would not be killed by running out of memory as the list of possible connections from all the species for these identifiers would be far too many.

Is there any sort of precomputing?
No, bioDBnet does not do any kind of precomputing nor does it try to map everything to a unique identifier. As the number of local databases at ABCC are many there is a database update running almost every day and having no precomputing allows us to keep up with the different update times for various databases. Although there is no precomputing bioDBnet is optimized and configured in a way so that the queries give immediate results.

Is there any statistical basis for the bioDBnet conversions?
For any conversion bioDBnet traverses the network in a top down approach, by going through the path with the shortest distance and the most likelihood of getting a result. It uses edge weights and path weights to get the order of traversing the network but none of these have any statistical significance as the nature and coverage of the databases at ABCC is varied. Not all input nodes are from the main databases at ABCC, and not all databases at ABCC are complete. So the same value might mean different things in different conversions. Also the values would not be the same for all the species as different species are covered differently in different databases with most of them concentrating on human. So a value of 0.6 might actually be 0.95 for human and 0.01 for a bacterial species. The weights are given only to give a broad sense of the underlying databases and their coverage as a whole.

What is an edge weight?
Edge weight is the likelihood of getting an output node directly from an input node through a single database query.

Calcualting weight of an edge going from node 1 to node 2 (node 1 -> node 2)

        edge weight = tn1n2 / tn1
        where, tn1n2 = total distinct node 1 values connecting to node 2 
                     tn1 = total distinct node 1 values in the database


What does the path information in db2db results mean?
db2db prints out the path(s) taken by bioDBnet to do the actual conversions. It has details on the actual paths taken and path weights & distance of input node to output node for each path. Please note that in most cases the connectors are found first and then these are used to get the actual output results.

Calculating weight of a path from node 1 to node n (node 1 -> node 2 ... -> node n)

        path weight   = (e(n1)(n2) + e(n2)(n3) + ...... e(n-1)(n)) / d
        where, e(x)(y) = edge wight of node x -> node y
                     d = distance from node 1 to node n = n - 1