iPhylo: October 2007

Roderic D. M. Page

Wednesday, October 31, 2007

Amber spider

Really just a shameless attempt to get one over David Shorthouse, but there has been some buzz about Very High Resolution X-Ray Computed Tomography (VHR-CT) of a fossil of Cenotextricella simon.

The paper describing the work is in Zootaxa (link here). Zootaxa is doing great things for taxonomic publishing, but they really need to get some sort of stable identifier set up. Linking to ZooTaxa articles is not straightforward. If they had DOIs (or even OpenURL access) they wuld make it much easier for people to convert lists of papers that include ZooTaxa publications to lists of resolvable links.

Sunday, October 28, 2007

Universal Serial Item Names

Following on from the discussion of BHL and DOIs, I stumbled across some remarkable work by Robert Cameron at SFU. Cameron has developed Universal Serial Item Names (USIN). The approach is spelled out in detail in Towards Universal Serial Item Names (also on Scribd). This lengthy document deals with how to develop user-friendly identifiers for journal articles, books, and other documents. The solution looks less baroque than SICIs, which I've discussed earlier.

There is also a web site (USIN.org), complete with examples and source code. Identifiers for books are straightforward, for instance bibp:ISBN/0-86542-889-1 identifies a certain book:

For journals things are slightly more complicated. However, Cameron simplified things a little in his subsequent paper Scholar-Friendly DOI Suffixes with JACC: Journal Article Citation Convention (also on Scribd).

JACC (Journal Article Citation Convention) is proposed as an alternative to SICI (Serial Item and Contribution Identifier) as a convention for specifying journal articles in DOI (Digital Object Identifier) suffixes. JACC is intended to provide a very simple tool for scholars to easily create Web links to DOIs and to also support interoperability between legacy article citation systems and DOI-based services. The simplicity of JACC in comparison to SICI should be a boon both to the scholar and to the implementor of DOI responders.

USIN and JACC use the minimal number of elements in order to identifier an article, such as journal code (e.g., ISSN or an accepted acronym), volume number, and starting page. Using ISSNs ensures globally unique identifiers for journals, but the scheme can also use acronyms, hence those journals that lack ISSNs could be catered for. The scheme is simple, and in many cases will provide the bare minimum of information necessary to locate an item via an OpenURL resolver. Indeed, one simple way to implement USIN identifiers would be to have a service that takes URIs of the form <journal-code>:<volume>@<page> and resolves them behind the scenes using OpenURL. Hence we get simple identifiers that are resolvable, without the baroque approach of SICIs.

When I get the chance I may add support for something like this to bioGUID.

Saturday, October 27, 2007

Taxonomy is dead, long live taxonomy

No, not taxonomy the discipline (although I've given a talk asking this question), but taxonomy.zoology.gla.ac.uk, my long-running web server hosting such venerable software projects as TreeView, NDE, and GeneTree, along with my home page.

A series of power cuts in my building while I was away finally did for my ancient Sun Sparcstation5, running the CERN web server (yes, it's that old). I can remember the thrill (mixed with mild terror) of taking delivery of the Sparcstation and having to manually assemble it (the CD ROM and floppy drives came separately), and the painful introduction to the Unix command line. The joy of getting a web server to run (way back in late 1995), followed by Samba, AppleTalk, and CVS.

For the time being a backup copy of the documents and software hosted on the Sparcstation are being served from a Mac. The only tricky thing was setting up the CVS server that I use for version control for my projects. Yes, I know CVS is also ancient, and that Linus Torvalds will think me a moron, but for now it's what I use. CVS comes with Apple's developer tools, but I wanted to set up remote access. I found the articles by Daniel Côté Setting up a CVS server on Mac OS X and on Mac OSX Hints Enable CVS pserver on 10.2 to be helpful. Basically I initialised a new CVS repository, then copied across the backed repository from a DVD. I then replaced some files in CVSROOT that listed things like the modules in the repository and notifications sent when code is comitted. Getting the pserver up and running required some work. I created a file called cvspserver inside /etc/xinetd.d/, with the following contents.

service cvspserver
{
        disable = no
        socket_type     = stream
        wait            = no
        user            = root
        server          = /usr/bin/cvs
        server_args     = -f --allow-root=/usr/local/CVS pserver
        groups          = yes
        flags           = REUSE
}

Then I started the service:

sudo /sbin/service cvspserver start

So far, so good, but I couldn't log in to CVS. Discovered that this is because Mac OS X uses ShadowHash authentication_authority. Hence, on a Mac CVS won't use the system user names and passwords (probably a good thing). Therefore, we uncomment the line

# Set this to "no" if pserver shouldn't check system users/passwords
SystemAuth=no

in the file CVSROOT/config, then create a file CVSROOT/passwd. This file contains the username, hash password, and the actual Mac OS X username (nicely explained in Daniel Côté's article). To generate a hash password, do this:

darwin: openssl passwd
Password: 123
Verifying - Password: 123
yrp85EUNQl01E

At last it all seems to work, and I can get back to coding. This is about as geeky as this blog gets, but if you want a real geek overload, spend some time listening to this talk by Linus Torvalds.

Thursday, October 25, 2007

BHL and DOis

In a series of emails Chris Freeland, David Shorthouse, and I have been discussing DOIs in the context of the Biodiversity Heritage Library (BHL). I thought it worthwhile to capture some thoughts here.
In an email Chris wrote:

Sure, DOIs have been around for a while, but how many nomenclators or species databases record them? Few, from what I've seen - instead they record citations in traditional text form. I'm trying to find the middle ground between guys like the two of you, who want machine-readable lit (RDF), and most everyone else I talk with, including regular users of Botanicus & BHL, who want human-readable lit (PDF). I'm not overstating - it really does break down into these 2 camps (for now), with much more weight over on the PDF side (again, for now).

I think the perception that there are two "camps" is unfortunate. I guess for a working taxonomist, it would be great if for a given taxonomic name there was a way to see the original publication of the name, even if it is simply a bitmap image (such as a JPEG). Hence, a database that links names to images of text would be a useful resource. If this is what BHL is aiming for, then I agree, DOIs may seem to be of little use, apart from being one way to address the issue of persistent identifiers.

But it seems to me that there are lots of tasks for which DOIs (or more precisely, the infrastructure underlying them) can help. For example, given a bibliographic citation such as

Fiers, F. and T. M. Iliffe (2000) Nitocrellopsis texana n. sp. from central TX (U.S.A.) and N. ahaggarensis n. sp. from the central Algerian Sahara (Copepoda, Harpacticoida). Hydrobiologia, 418:81-97.

how do I find a digital version of this article? Given this citation

Fiers, F. & T. M. Iliffe (2000). Hydrobiologia, 418:81.

how do I decide that this is the same article? If I want to see whether somebody has cited this paper (and perhaps changed the name of the copepod) how do I do that? If I want follow up the references in this paper, how do I do that?

These are the kinds of thing that DOIs address. This article has the DOI doi:10.1023/A:1003892200897. This gives me a globally unique identifier for the article. The DOI foundation provides a resolver whereby I can go to a site that will provide me with access (albeit possibly for a fee) to the article. CrossRef provides an OpenURL service whereby I can

Retrieve metadata about the article given the DOI

Given metadata I can search for a DOI

To an end user much of this is irrelevant, but to people building the links between taxonomic names and taxonomic literature, these are pressing issues. Previously I've given some examples before where taxonomic databases such as Cataloggue of Life and ITIS store only text citations, not identifiers (such as DOIs or Handles). As a result, the user has to search for each paper "by hand". Surely in an ideal world there would be a link to the publication? If so, how do we get there? How do IPNI, Index Fungorum, ITIS, Sp2000, ZooBank, and so on link their names and references to digitised content? This is where a CrossRef-style infrastructure comes in.

Publishers "get this". Given the nature of the web where users expect to be able follow links, CrossRef deals with the issue of converting the literature cited section of a paper into a set of clickable links. Don't we want the same thing for our databases of taxonomic names? And, don't we want this for our taxonomic literature?

It is worth noting that the perception that DOIs only cover modern literature is erroneous. For example, here's the description of Megalania prisca Owen (doi:10.1098/rstl.1859.0002), which was published in 1859. The Royal Society of London has DOIs for articles published in the 18th century.

If the Royal Society can do this, why can't BHL?

Monday, October 22, 2007

Phyloinformatics workshop - primal scream

Argh!!! The phyloinformatics workshop at Edinburgh's eScience Centre is underway (program of talks available here as an iCalendar file), and I'm stranded in Germany for personal reasons I won't bore readers with. The best and brightest gather less than an hour from my home town to talk about one of my favourite subjects, and I can't be there. Talk about frustration!

How can they they possibly proceed without yours truly to interject "it sucks" at regular intervals? What, things are going just fine? Next, you'll be suggesting that Systematic Biology can function without me as editor … wait, what's that you say? Jack's running the show without a hitch … gack, I'm redundant.

Monday, October 15, 2007

Getting into Nature ... sort of

The kind people at Nature have taken pity on my rapidly fading research career, and have highlighted my note "Towards a Taxonomically Intelligent Phylogenetic Database" in Nature Precedings (doi:10.1038/npre.2007.1028.1) on the Nature web site. Frankly this is probably the only way I'll be getting into Nature...

Sunday, October 14, 2007

Pygmybrowse GBIF classification

Here is a live demo of Pygmybrowse using the Catalogue of Life classification of animals provided by GBIF. It's embedded in this post in an <iframe> tag, so you can play with it. Just click on a node.

Taxa in bold have ten or more children, the numbers of children are displayed in parentheses "()". Each subtree is fetched on the fly from GBIF.

Friday, October 12, 2007

Pygmybrowse revisited

As yet another example of avoiding what I should really be doing, a quick note about a reworked version of PygmyBrowse (see earlier posts here and here). Last September I put together a working demo written in PHP. I've now rewritten it entirely in Javascript, apart from PHP script that returns information about a node in a classification. For example, this link returns details about the Animalia in ITIS.

You can view the new version live. Going "view source" in your browser will show you the code. It's mostly a matter of Javacsript and CSS, with some AJAX thrown in (based on the article Dynamic HTML and XML: the XMLHttpRequest object on Apple's ADC web site).

One advantage of making it entirely in Javascript is that it can be easily integrated into web sites that don't use PHP. As an example, David Shorthouse has inserted a version into the species pages in The Neartic Spider database (for example, visit the page for Agelena labyrinthica and click on the little "Browse tree" link).

Thursday, October 11, 2007

Frog deja vu

While using iSpecies to explore some names (e.g., Sooglossus sechellensis in the Frost et al. amphibian tree mentioned in the last post, I stumbled across two papers that both described a new genus of frog for the same two taxa in the Seychelles. The papers (doi:10.1111/j.1095-8312.2007.00800.x and www.connotea.org/uri/6567cfd7531a77588ee62d78e7b4359b) were published within a couple of months of each other.

Was about to blog this, when I discovered that Christopher Taylor had beaten me to it with his post Sooglossidae: Deja vu all over again. Amongst the commentary on this post is a note by Darren Naish (now here) pointing to an interesting article by Jerald D. Harris entitled ‘Published Works’ in the electronic age: recommended amendments to Articles 8 and 9 of the Code, in which he states:

I propose to the Commission that, under Article 78.3 (‘Amendments to the Code’), Articles 8 and 9 of the current Code require both pro- and retroactive (to the effective date of the Fourth Edition, 1 January 2000) modification to accommodate the following issue: documents published electronically with DOI numbers and that are followed by hard-copy printing and distribution be exempt from Article 9.8 and be recognized as valid, citable sources of zoological taxonomic information and that their electronic publication dates be considered definitive.

It's an interesting read.

Visualising very big trees Part VI

I've tidied up the big phylogeny viewer mentioned earlier, and added a simple web form for anybody interested to upload a NEXUS or Newick tree and have a play.

Examples:

Ants from TreeBASE

trichodectid lice from TreeBASE

Bats from TreeBASE

Frost et al amphibian tree

To create your own tree viewer, simply go to http://linnaeus.zoology.gla.ac.uk/~rpage/bigtrees/tv2/ and upload a tree. After some debugging code and images scroll past, a link to the widget appears at the bottom of the page. I'll tidy this all up when I get the chance, but for now it's good enough to play with.

Thursday, October 04, 2007

processing PhylOData (pPOD)

The first pPod workshop happened last month at NESCent, and some of the presentations are online on the pPod Wiki. Although I'm a "consultant" I couldn't be there, which is a pity because it looks to have been an interesting meeting. When pPod was first announced I blogged some of my own thoughts on phylogenetics databases. The first meeting had lots of interesting stuff on workflows and data integration, as well as outlining the problems faced by large-scale systematics. Some relevant links (partly here to as a reminder to myself to explore these further):

Orchestra a collaborative data sharing system
MIAPA Minimal Information About a Phylogenetic Analysis (see also doi:10.1089/omi.2006.10.231)
BioSQL persistent storage of sequences and features