iPhylo: June 2010

Roderic D. M. Page

Friday, June 25, 2010

Evolution 2010 talk schedule for iCal

In a moment of madness brought on by trying to make sense of 10 Mb of conference schedule for Evolution 2010, I extracted the text from the schedule and created a series of crude iCal files that I can add to my iCal calendar on my Mac (and hence sync to my iPhone). This way I can set reminders of specific talks I want to see.

I'm making these ical files available here, on the understanding that you can use them entirely at your own risk (some talks may be missing due to errors in parsing the file).

Tuesday, June 22, 2010

Mashing up NCBI and Wikipedia using treemaps

Having made a first stab at mapping NCBI taxa to Wikipedia, I thought it might be fun to see what could be done with it. I've always wanted to get quantum treemaps working (quantum treemaps ensure that the cells in the treemap are all the same size, see my 2006[!] blog post for further description and links). After some fussing I have some code that seems to do the trick. As an example, here is a quantum treemap for Laurasiatheria.

The diagram shows the NCBI taxonomy subtree rooted on Laurasiatheria, with images (where available) from Wikipedia for the children of the the children of that node. In other words, the images correspond to the tips of the tree below:

There's a lot to be done to tidy this up, but there is potential to create a nice, visual way to navigate through the NCBI taxonomy (it might work well on the iPhone or iPad, for example).

Thursday, June 17, 2010

NCBI to Wikipedia links are now live...

The 52,956 links from NCBI to Wikipedia that I've been busy creating are now "live." If you go to a NCBI taxon such as Sphaerius you'll see something like this:

Clicking the "Wikipedia" link takes you to the Wikipedia page for this taxon. You can see all the links to Wikipedia using the query loproviphylo[filter]. Here are some additional links to try:

NCBI	Wikipedia
8353	Xenopus
83698	Banksia
9766	Balaenoptera

Thanks to Scott Federhen and Kathy Kwan at NCBI for all their assistance in getting this into NCBI Linkout.

Fixing errors
There will be errors and omissions. The best way to fix these is by using the iPhylo Linkout wiki. The page for a NCBI taxon is always http://iphylo.org/linkout/Ncbi:xxxx where xxxx is the NCBI taxonomy id. You can edit/annotate the link there (click on the "edit with form" for a simple web form). I plan to regularly update the links based on this the wiki.

Future
NCBI Linkout provide access statistics so it will be interesting to see how much traffic goes from NCBI to Wikipedia. It will also be interesting to see if this is correlated with increased editing of those Wikipedia pages.

PLoS doesn't "get" the iPad (or the web)

PLoS recently announced a dedicated iPad app, that covers all the PLoS Journals, and which is available from the App Store. Given the statement that "PLoS is committed to continue pushing the boundaries of scientific communication" I was expecting something special. Instead, what we get (as shown) in the video below is a PDF viewer with a nice page turning effect (code here). Maybe it's Steve Job's fault for showing iBooks when he first demoed the iPad, but there desire to imitate 3D page turning effects leaves me cold (for a nice discussion of how this can lead to horribly mixed metaphors see iA's Designing for iPad: Reality Check).

PLoS iPad Demo from PLoS on Vimeo.

But I think this app shows that PLoS really don't grok the iPad. Maybe it's early days, but I find it really disappointing that page-turning PDFs is the first thing they come up with. It's not about recreating the paper experience on a device! There's huge scope for interactivity, which the PLoS app simply ignores — you can't select text, and none of the references. It also ignores the web (without which, ironically, PLoS couldn't exist).

Instead of just moaning about this, I've spent a couple of days fussing with a simple demo of what could be done. I've taken a PLoS paper ("Discovery of the Largest Orbweaving Spider Species: The Evolution of Gigantism in Nephila", doi:10.1371/journal.pone.0007516), grabbed the XML, applied a XSLT style sheet to generate some HTML, and added a little Javascript functionality. References are displayed as clickable links inline. If you click on one a window pops up displaying the citation, and it then tries to find it for you online (for the technically mined, it's using OpenURL and bioGUID). If it succeeds it displays a blue arrow — click that and you're off to the publisher's web site to view the article.

Figures are also links, click on and you get a Lightbox view of the image.
You can view this article live, in a regular browser or in iPad. Here's a video of the demonstration page:

PLoS paper on the iPad from Roderic Page on Vimeo.

This is all very crude and rushed. There's a lot more that could be done. For references we could flag which articles are self citations, we could talk to bookmarking services via their APIs to see which citations the reader already has, etc. We could also make data, sequences, and taxonomic names clickable, providing the reader with more information and avenues for exploration. Then there's the whole issue of figures. For graphs we should have the underlying data so that we can easily make new visualisations, phylogenies should be interactive (at least make the taxon names clickable), and there's no need to amalgamate figures into aggregates like Fig .2 below. Each element (A-E) should be separately addressable so when the text refers to Fig. 2D we can show the user just that element.

journal.pone.0007516.g002.png

The PLoS app and reactions to Elsevier's "Article 2.0" (e.g., Elsevier's 'Article of the Future' resembles websites of the past and The “Article of the Future” — Just Lipstick Again?) suggests publishers are floundering in their efforts to get to grips with the web, and new platforms for interacting with the web.

So, PLoS, I challenge you to show us that you actually "get" the iPad and what it could mean for science publishing. Because at the moment, I've seen nothing that suggests you grasp the opportunity it represents. Better yet, why not revisit Elsevier's Article 2.0 project and have a challenge specifically about re-imagining the scientific article? And please, no more page turning effects

Tuesday, June 08, 2010

Linking NCBI to Wikipedia

180px-Sphaerius.acaroides.Reitter.tafel64.jpg

In an earlier post I discussed linking NCBI taxonomy to Wikipedia. One way to tackle this is to add NCBI Taxonomy ID to Wikipedia pages. I reopened the case for adding the Taxonomy IDs to the Taxobox on each taxon page, but this met with substantial resistance. A modified proposal to add them elsewhere to the Wikipedia page seems to be gaining more support (or, at least, less vigorous resistance).

Meanwhile, there are other things that need to be done to linking NCBI and Wikipedia. One is to add Wikipedia page names to NCBI Linkout so that when viewing a NCBI taxon page you will see a link to Wikipedia if a page for the corresponding taxon exists. To create this linkout we need a mapping from NCBI to Wikipedia, and that's what I've been working on for the last few days.

The mapping is still in progress, but essentially I've taken a dump of the NCBI taxonomy for June 3, 2010, and matched the names with those in a the June 18, 2009 dump of Wikipedia that I've analysed elsewhere on this blog. I'll detail the various steps in the mapping elsewhere (there are issues such as synonyms, homonyms, Wikipedia redirects, etc.), but for now things seem to be working reasonably well.

The mapping is being created in a Semantic Mediawiki at http://iphylo.org/linkout/. When complete you will be able to up a NCBI taxon by either it's name (including synonyms and common names) or it's NCBI Taxonomy ID. Where possible I'm mapping the NCBI taxon to Wikipedia, and providing a snippet of text and an image.

I've also extracted bibliographic information from the citations.dmp file that comes with the NCBI dump. This contains the comments that you sometimes see on a taxon page. In a few cases I've added some information manually. For example, the beetle genus Sphaerius has a rather complicated nomenclatural history, which the NCBI page summarises as:

Due to a recent ruling (ICZN 2000), the family and generic names Sphaeriusidae Erichson, 1845, and Sphaerius Waltl, 1838, are both available names and have priority over Microsporidae Crotch, 1873 for the family name and Microsporus Kolenati, 1846 for the single included genus, respectively.

By looking through BioStor I've found some of the papers relating to this ICZN ruling, and added them to the wiki page http://iphylo.org/linkout/Ncbi:174920 (aficionados of zoological nomenclature may enjoy the complexity of the case, due to homonymy between the corresponding family name, Sphaeriidae, and a mollusc family of the same name).

Once thus mapping is complete, it will be time to think of how to get this into NCBI's Linkout, and also how to automatically update the mapping to reflect the growth of both the NCBI taxonomy and Wikipedia. If you visit http://iphylo.org/linkout/ please be aware that the mapping is still being written to the wiki (this is being done via API calls, and adding some 900,000 pages is going to take a while).

Sunday, June 06, 2010

Nexus Data Editor running on Mac OS X

In an earlier post I expressed my amazement that my venerable Nexus Data Editor (NDE) still had users, meaning I had to rebuild the installer so users could install NDE on Windows Vista. Now, Thomas Hauser has gone one better and created an installer for Mac OS X. Given that NDE is a Windows-only program, this is quite a feat. Thomas uses Mike Kroenenberg's (@k3erg) WineBottler to create a version of NDE that can be run on a Mac. WineBottler builds on Wine, which enables Windows software to run on Unix-like operating systems.

To run NDE on a Mac, first download WineBottler from http://winebottler.kronenberg.org/ and install. Then grab the file NDE.dmg Thomas has created. Install the file NDE in your Applications file and run it. Note that you will need X11 installed on your Mac. If you don't have this it should be on the installation disk that came with your computer. After a short pause you should see NDE appear in a X11 window. Below is a screen shot showing NDE editing the example file (Bembidion.nex) that comes with the program:

Many thanks to Thomas for his efforts in packaging NDE with Winebottler, and for making available the NDE.dmg file he created.

Wednesday, June 02, 2010

TreeBASE II RDF

One of the potentially powerful features of TreeBASE II is availability of a RDF version of a study. This means that, in principle, one could take the RDF for a TreeBASE study, combine it with RDF from other sources, and generate a richer view of a particular study. For example, if a TreeBASE study has a DOI, then we could link it to bibliographic details for the study, and through them to other information, such as GenBank sequences, specimens, etc. (see my little linked data browser for an example of some of this linking). If we added a phylogeny viewer, then we'd have a great tool for browsing the basic components of a phylogenetic study.

Unfortunately, we're not there yet. I've been trying to make sense of TreeBASE II RDF, and frankly, it's a mess. Here are some of the problems:

TreeBASE URIs aren't linked data compliant
The canonical URI for a study (e.g., http://purl.org/phylo/treebase/phylows/study/TB2:S10423) doesn't conform to the linked data approach. In fact, the URI crashes the linked data validator, so I tried another test.


curl --include 
   --header "Accept: application/rdf+xml" 
   http://purl.org/phylo/treebase/phylows/study/TB2:S10423

To be a valid linked data resource this request should return a 303 HTTP status code. Instead we get a 302 and some HTML. Linked data clients won't be able to extract information from this URI.

SKOS matching
There are some odd things going on in the RDF. It contains statements of the form:


<rdf:Description rdf:ID="otu1789319">
   <skos:closeMatch rdf:resource="http://purl.uniprot.org/taxonomy/76066.rdf">
</rdf:Description>

(I've tidied this up a little from the original, rather verbose RDF). This asserts that the TreeBASE OTU otu1789319 corresponds to the NCBI taxon with the taxonomy id 76066 (represented by the Uniprot URI). Except, it doesn't really. As far as I understand it, SKOS is about matching concepts, not documents. The URI http://purl.uniprot.org/taxonomy/76066.rdf is a document URI (specifically, a RDF document), the URI http://purl.uniprot.org/taxonomy/76066 is the taxon. The match should really be to http://purl.uniprot.org/taxonomy/76066. Then I've come across statements that match TreeBASE OTUs to http://purl.uniprot.org/taxonomy/0.rdf. This URI doesn't exist (we get a 404). This seems an odd way to say that we don't have a match -- if we don't have a match, don't include it in the RDF.

Local URIs for trees don't work
The RDF is full of local URIs such as http://purl.org/phylo/treebase/phylows/#tree1790755, which don't resolve. In fact they generate a rather spectacular Tomcat exception. I don't understand why we need local URIs. Everything in TreeBASE should have a global URI. Then we can avoid unnecessary statements such as:

http://purl.org/phylo/treebase/phylows/#tree1790755 owl:sameAs http://purl.org/phylo/treebase/phylows/tree/TB2:Tr7899

which links a local resource to a global one http://purl.org/phylo/treebase/phylows/tree/TB2:Tr7899. Incidentally, this URI doesn't resolve, despite claims that this bug has been fixed.

No links between tree and study
But the show stopper for me is that there is no link between a study and a tree! There is no triple in the RDF specifying any relationship between these two entities. To me this is just about the most important thing I need. I want to be able to query TreeBASE RDF using a study identifier (either from TreeBASE itself, or from an external identifier such as a DOI or a PubMed number). As it stands the TreeBASE II RDF is almost useless. I can't get it via a linked data client, it's full of URIs that don't resolve, and it lacks key triples that would glue things together.

RDF != XML

I can't help thinking that the RDF output hasn't been designed with end use in mind. I know from my own experience that it's not until you try to do something with the RDF that you realise how poor some design decisions may have been.

It's not enough to pump out RDF and hope for the best. RDF is not XML, which is just a verbose format for moving data around. RDF brings with it all sorts of expectations about how clients will resolve it, how they will interpret URIs, and the kinds of queries that will be performed. We are achingly close to being able to tie everything together, but not with RDF TreeBASE II is currently making available.