Wednesday, January 18, 2006

Finding good phylogenies using citation relationships

How does a person who is not expert in a group of organisms find a good phylogeny to use for their work? Think of somebody interested in animal behaviour who needs a tree for their birds of fish. Ignoring the answers "ask a systematist" or (even worse) "become a systematist and build the tree themselves", how do we answer this query?

Google ranks sites using link structure (in essense, pages with lots of links that are themselves pointed to by lots of sites score highly). Could we use the same idea for scientific papers? The answer is of course we could, but whether it would generate useful results is an open question. I've been toying with Jon Kleinberg's ideas in Authoritative sources in a hyperlinked environment. Kleinberg identifies "authorities" and "hubs", which are roughly analogous to highly cited papers and review articles, respectively.

So, the idea is this. For a collection of papers (such as those in TreeBASE, or those being assembled for birds by Katie Davis in my lab), use Google Scholar to extract citation information, build a graph and compute authorities and hubs using Kleinberg's algorithms. Based on a little play with TreeBASE (which I need to finish and write up, sigh), papers with high hub scores tend to be recent reviews, which may be good candidates for a place to start.

We could even test this. In the case of Katie's work on bird supertrees, we could compute a measure of fit between input trees and the supertree, and compare that with the score assigned to the paper containing the source tree. If my idea has value, papers that have "good" input trees will also have scores based on citation structure (e.g., hubiness, or some other measure).

Monday, January 02, 2006

Wouldn't it be cool if ... GenBank watch

From the "wouldn't it be cool if" department, one thing I've often thought would be very handy would be a web site that listed sequences in GenBank that were known (or suspected) to be problematic (especially sequences thought to have been misidentified). What I'd like to see is a site called something like "GenBank Watch" (a ripe off of Search Engine Watch) where this information is recorded.

There has some commentary on this issue in the literature (Rytas Vilgalys's article in New Phytologist doi:10.1046/j.1469-8137.2003.00894.x, and James Harris' article in Trends in Ecology and Evolution doi:10.1016/S0169-5347(03)00150-2).

Some workers do make available lists of dubious sequences, such as list of rejected sequences provided by the mor project. My concern is that a lot of this sort of information is buried in papers (e.g. this one suggesting AF203470 has been misidentified), or even worse, comes to light when manuscripts are reviewed, the authors remove the sequences from their data set, but the important information (that the sequence is bogus) isn't mentioned in the paper.

Wouldn't it be great if there was a web site were one could go and search for a sequence by accession number to if somebody had flagged that sequences as problematic? Ideally the site would enable users to comment on a sequence (for example, it whether the sequence is bogus might be contentious), and it would also need a web service interface so the search could be automated.

One for the "if I only had time" list.