iPhylo: May 2006

Roderic D. M. Page

Saturday, May 27, 2006

More on trees and Google Earth

Well, turns out Bill's not the only one putting trees on Google Earth. Declan Butler pointed me to Ogle Earth, where there is a teaser of some work on guiology.

Currently playing in iTunes: Crazy In Love by Beyoncé

Avian flu, phylogeny, and Google Earth

The penny just dropped (duh!).
Having mentioned Bill Piel's very cool visualisation of phylogenies on Google Earth

what about the other cool use of Google Earth in biology, namely Declan Butler's displays of the march of avian flu?

Instead of standard diagrams like this one from the Ruben Donis' paper in Emerging Infections Diseases:

why not take phylogenies for avian flu virus and add them to the data Declan is displaying? This could be a potentially compelling graphic, and a test of whether phylogenies add useful information to our understanding of what is going on.

Friday, May 26, 2006

TreeBASE meets Google Earth

Bill Piel has created a cool tool for creating KMZ files of phylogenies for Google Earth. From the web site:

One of the components of the CIPRES project is the development of TreeBASE II — a robust, scalable, and versatile re-design and re-engineering of TreeBASE. As part of this project, we are exploring other ways of browsing and visualizing trees. Google Earth is a fantastic 3-D browser for exploring geographic resources and has the potential to be a useful and fun tool for delivering biological information with a geographic component.

Google Earth (available for Windows and Mac OS X) is opening up all sorts of possibilities for biodiversity informatics (ants being one of the first examples). What is cool about Bill's work is that it departs from simple locality records.

As always, after pausing to say "wow", there are all sorts of things that one could think of adding. For example, some trees are clearer than others, due to how well the geography and trees match. I wonder if this could be used as a measure of how well geography "explains" the tree. For example, simple vicariance or serial dispersal would have few cross-overs, a history of dispersal (or an old pattern with extinction, or if geography has changed) might be messier. Perhaps there is a metric that could be developed for this. It strikes me as similar in spirit to trees for tandem duplications -- there's a nice spatial (albeit it linear) order in a tree if the sequences are tandem duplications.

If the trees had dated nodes (i.e., were "chronograms"), presumably this could be used to compute node heights, so you'd be able to have chronograms. Sort of a reverse onion, the layers getting older as you go out. People could then see whether biogeographic patterns were of a similar age. This adds a spatial dimension to chronograms (see an earlier post on the analogy between genome browsers and chronograms).

As an aside, and because I was once a panbiogeography enthusiast, why haven't panbiogeographers leap on Google Earth as a tool to display "tracks"? If ever there was an opportunity to drag that movement out of the doldrums, this is it.

Wednesday, May 24, 2006

Open Access taxonomy

Pyramica boltoni
Originally uploaded by Roderic Page.

Fussing around with ants, I stumbled across this paper (doi:10.1653/0015-4040(2006)89[1:PBANSO]2.0.CO;2) (if the DOI doesn't work, try this link), which describes a new species, Pyramica boltoni. This paper is Open Access, so the question arises, how do I get it into a triple store? I could add the metadata about the paper (it would be nice to do this automatically via Connotea and the DOI, but some BioOne DOIs aren't "live" yet), but what about things like the pictures?
For fun, I grabbed Fig. 1, uploaded it into iPhoto, then exported it to Flickr using the FlickrExport plugin.
Flickr has an API, hence the image (and the associated tags) could be retrieved automatically. Hence, anybody with Connotea and Flickr accounts could contribute to a triple store.

Sunday, May 21, 2006

Towards the ToL database - some visions

So, when I started this blog I promised to write something about phyloinformatics, and the goal of a phylogenetic database. I've been playing around with various ideas, some of which have made it online, but most remain buried on various hard drives until they get written up to the state they are useable.

There are also numerous distractions, and detours along the way, such as MyPHPBib, Taxonomic Search Engine, and LSIDs, oh and iSpecies, which got me into trouble with Google, then there is a certain journal, and a certain person (but let's not go there...).

My point (and I do have one), is that maybe it's time to rethink some cherished ideas. Basically, my original goal of creating a phylogenetic database involved massive annotation, disambiguation of taxonomic names, and linking to global identifiers for taxonomic names, sequences, images, and publications. This is the project outlined at the start of this blog.

I still believe this would be worthwhile, and I've a lot of the work done for TreeBASE (e.g., mapping TreeBASE names to external databases, BLASTing sequences in ttreeBASE to get accession numbers, etc.). This is a lot of work, and I wonder about scalability and involvement. In other words, can it cope with the amount of data and trees we generate, and how do we get people to contribute. So, here are a few different (not necessarily exclusive approaches).

Curation
Use TreeBASE as a seed and continue to grow that database, adding extensive annotations and cross links. Time consuming, but potentially very powerful, especially is data is dumped into a triple store and cool ways to query it are developed.

Googolise everything
Use Google to crawl for NEXUS files (e.g., "#nexus" "begin data" format dna), extract them and put them into a database. Use string matching and BLAST to figure out what the files are about.

Phylogeny news
Monitor NCBI and journal RSS feeds, when new sequences or papers appear, extract popsets, use or build alignments, compute trees quickly and wack into a database. Interface is something like Postgenomic (maybe using the same clustering algorithms to link related stories), or even cooler, newsmap

Connotea style

Inspired by projects like Connotea, perhaps the trick is to mobilise the community by lowering the barrier to entry. Instead of aiming for a carefully curated database, what if people could upload the basics (some sort of identifier for the paper, such as a DOI or a PubMed id, and one or more trees in Newick format). I think this is what Wayne Maddison was suggesting when we chatted at the CIPRES (see my earlier post about that meeting) -- if Wayne didn't suggest this, then my apologies. The idea would be that people could upload the bare minimum, but be able to annotate, comment, link, etc. Behind the scenes we have scripts to look up whatever identifiers we have and extract relevant metadata.

Saturday, May 20, 2006

Taxonomic Search Engine back online

My Taxonomic Search Engine is back online (mostly). This tool to search multiple databases for scientific names was another casualty of hacking. Having been developed under PHP 4, it needed some work to play nice with PHP 5. The changes were minor, and mainly concerned changes in code involving XPath and XSLT. I've commited these changes to SourceForge. I've not got the Species 2000 service back up (this needs local data to be restored), and the LSIDs are broken due to problems with IBM's LSID perl stack on my Fedora Core 4 machine (sigh).

Wednesday, May 17, 2006

AntBase and Web 2.0 business value

Dion Hinchcliffe has a piece entitled Creating real business value with Web 2.0 which lists AntBase.org (I think he actually means AntWeb) as an example of a non-commercial Web 2.0 service that demonstrates "scalable marshalling of underutilized data resources," and shows:

...how a scientific community turned massive taxonomy resources otherwise mouldering away in basements as lost specimens into a thriving online database of information that can be shared by all. Understanding the success and importance of both of these points to intriguing and largely unexploited possibilities that I predict will become more common and widespread in the near future.

The article comes with this graphic:

See also Dion's Thinking in Web 2.0: Sixteen ways (via Danny Ayers).

Tuesday, May 09, 2006

CrossRef's OpenURL resolver

CrossRef's OpenURL resolver can be used to find DOIs for papers, or give a DOI it can extract metadata. For example, consider Brian Fisher's article on silk production in Melissotarsus emeryi. This is in FORMIS, and the record is displayed here in RIS format:

TY  - JOUR
AU  - Fisher, B.L/
AU  - Robertson, H.G.
PY  - 1999
TI  - Silk production by adult workers of the ant Melissotarsus emeryi (Hymenoptera, Formicidae) in South African fynbos
SP  - 78-83
JF  - Insect. Soc.
JO  - Insectes Sociaux
VL  - 46
N1  - Part of FORMIS 00; from PSW,
KW  - ant; Formicidae; Melissotarsus emeryi; Africa; South Africa; scientific; nest; tending Homoptera; silk; production; adult; worker; gland; cuticular depressions; hypostoma; silk brushes; nest construction; defense; diaspidid symbiont;
ID  - 6573
ER  - 

We can find the DOI for this article using the following query:

http://www.crossref.org/openurl?&noredirect=true&aulast=Fisher%20LRM&title=Insectes%20Sociaux&volume=46&spage=78&date=1999, which yields the following XML:

<?xml version = "1.0" encoding = "UTF-8"?>
<crossref_result version="2.0" xmlns="http://www.crossref.org/qrschema/2.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.crossref.org/qrschema/2.0 http://www.crossref.org/qrschema/crossref_query_output2.0.xsd">
<query_result>
	<head>
		<email_address>ckoscher@crossref.org</email_address>
		<doi_batch_id>w001</doi_batch_id>
	</head>
	<body>
		<query key="555-555" status="resolved">
			<doi type="journal_article">10.1007/s000400050116</doi>
			<issn type="print">00201812</issn>
			<issn type="electronic">14209098</issn>
			<journal_title match="exact">Insectes Sociaux</journal_title>
			<author match="fuzzy">Fisher</author>
			<volume match="exact">46</volume>
			<issue>1</issue>
			<first_page match="exact">78</first_page>
			<year match="exact">1999</year>
			<publication_type>full_text</publication_type>
			<article_title>Silk production by adult workers of the ant Melissotarsus emeryi (Hymenoptera, Formicidae) in South African fynbos</article_title>
		</query>
	</body>
</query_result>
</crossref_result>

The DOI for this article is 10.1007/s000400050116. This service could be a simple way to get DOIs for recent papers in FORMIS, enabling us to get GUIDs for the articles, as well as providing a link to the article itself.

Saturday, May 06, 2006

Updating ants

A triple store for ants is all very well, but it contains just the information available when the triple store was created. What about updating it? What about doing this automatically? Here are some ideas:

Connotea
Connotea provides semantically rich RSS feeds. We could subscribe to a feed using a tag (such as Formicidae), and extract recent posts. Could use HTTP conditional GET, or parse the Connotea feed and use XPath to extract references more recent than a given date. Connotea makes extensive use of RDF in their RSS feeds, so it's easy to dump this into the triple store.
uBio
uBio's taxononmically intelligent RSS feed reader could be used to monitor publications on ants (e.g., Formicidae). uBio uses RSS 2.0, which doesn't include RDF (see Wikipedia entry for RSS). One option would be to parse the RSS and see what we can extract from the links (e.g., if they contain DOIs, are Ingenta feeds, etc.). If there are DOIs we could use CrossRef's OpenURL lookup. Or we could use the Connotea Web API. We'd upload the URLs, and get Connotea to see what it can do with them, then we make use of their RSS feed. This also makes the information available to everybody for tagging.

GenBank
We could also track new sequences in GenBank (to do).

stamen design | big ideas worth pursuing

stamen (which brought us Mappr) has a nice discussion of data visualisation.

Currently playing in iTunes: Summertime by George Benson

Ants, RDF, and triple stores

Background
In order to explore the promise of RDF and triple stores we need some large, interesting data sets. Ants are cool, there is a lot of data available online (e.g., AntWeb at the California Academy of Sciences, Antbase at the American Museum of Natural History, New York, and the Hymenoptera Name Server at Ohio State University, Chris Schmidt's ponerine.org, and Ant News), and they get good press (for example, the "Google ant").

Specimens

Firstly, we start with a Google Earth file for ants, obtained from AntWeb on Monday April 24th, 2006. AntWeb gives the link as http://www.antweb.org/antweb.kmz, which is a compressed KML file. However, this file merely gives the location for the actual data file, http://www.antweb.org/doc.kmz. Grab that file, expand it and you get 27 Mb of XML listing 50,550 ant specimens and 1,511 associated images.

We use the Google Earth file because it gives us a dump of AntWeb, and does it in a reasonably easy to handle format (XML). I wrote a C++ program to parse the KML file and dump the information to RDF. One limitation is that my program dies on the recent KML files because they have very long lines. Doing a search on <Placemark> and replacing it with \r<Placemark> in TextWrangler fixed the problem.

In order to keep things as simple and as generic as possible, I use Dublin Core metadata terms wherever possible, and the basic geo (WGS84 lat/long) vocabulary for geographical coordinates. The URI for the specimen is the URL (no LSIDs just yet).

In addition to the RDF, I generate two text dumps for further processing.

Images
As noted at iSpecies, we can automate the extraction of metadata from images using Exif tags.There is a vocabulary for describing Exif data in RDF, which I've adopted. However, I don't use all the tags, nor do I use IFD, which frankly I don't understand.

So, the basic idea is to have a Perl script that:

Takes a list of AntWeb images (more preciesly, the URLs for the images)

Fetches each image in turn using LWP and writes them to a temporary folder

Uses Image::EXIF to extract Exif tags

Generate RDF

Some AntWeb specific things include linking the image to the specimen, and linking to a Creative Commons license.
Here is an example:


<rdf:Description rdf:about="http://www.antweb.org/images/casent0005842/casent0005842_p_1_low.jpg" >
<dc:subject rdf:resource="http://www.antweb.org/specimen.do?name=casent0005842" />
<dc:type>image</dc:type>
<dc:publisher rdf:resource="http://www.antweb.org/"/>
<dc:format>image/jpeg</dc:format>
<exif:resolutionUnit>inches</exif:resolutionUnit>
<exif:yResolution>337.75</exif:yResolution>
<exif:imageHeight>64</exif:imageHeight>
<exif:imageWidth>112</exif:imageWidth>
<exif:xResolution>337.75</exif:xResolution>
</rdf:Description>
</rdf:RDF>

This RDF is generated from the image to the right. What is interesting about the Exif metadata is that it isn't generated from the AntWeb database itself, but from the images. Hence, unlike the Goggle Earth file, we are adding value rather than simply reformatting an existing resource.

Of course, there are some gotchas. Some images look like this ("image not available"), and the Exif tag Copyright has a stray null character (\0x00) appended at the end, which breaks the RDF. Fixed this by Zap gremlins in TextWrangler.

Names
There is no single authorative list of scientific names. I'm going to use the biggest (and best), uBio, specifically their findIT SOAP service. It might make more sense to use the Hymenoptera Name Server, but uBio serves RDF, and gets most off the ant names anyway as the Hymenoptera Name Server feeds names into ITIS, which in turn end up in uBio. The result of this mapping is a <dc:subject> tag for each specimen that links using rdf:resource to a uBio LSID. When we make the mapping, we write the uBio namebank ids to a separate file, which we then process to get the RDF for each name.
The script reads a list of specimens and taxon names, calls uBio's findIT SOAP service, and if it gets a direct match, writes some RDF linking the specimen URI to the uBio LSID. It also stores the uBio id in memory, and dumps these into a file for processing in the next step.

uBio metadata

Having mapped ant names to uBio, we can then go to uBio and use their LSID authority to retrieve metadata for each name in, you guessed it, RDF. We could resolve LSIDs, but for speed I'll "cheat" and append the uBio namebank ID to http://names.ubio.org/authority/metadata.php?lsid=.
So, armed with a Perl script we read the list of uBio ids, fetch the metadata for each one and dump it into directory. I then run another Perl script that scans a directory for ".rdf" files and puts them in the triple store.

NCBI

Sequences
I retrieved all ant sequences from GenBank by searching the taxonomy browser for Formicidae, downloading all the sequence gis, then running a Perl script that retrieved the XML record for each sequence and populated a MySQL database. I then found all sequences that include a specimen voucher field with CASENT%:


SELECT DISTINCT dbxref.id FROM
specimen INNER JOIN source USING (source_id)
INNER JOIN sequence_dbxref  ON source.seq_id = sequence_dbxref.sequence_id
INNER JOIN dbxref USING (dbxref_id)
WHERE (code LIKE "CASENT%") AND (dbxref.namespace = "GI")

Next, we fetch these records from NCBI. This seems redundant as we have the information already in a local MySQL database, but I want to use a simple script that takes a GI and outputs RDF so that anybody can do this.

Names
In much the same way, I grabbed the TaxIds for ants with sequences, and grabbed RDF for each name.

PubMed
For PubMed records I wrote a simple Perl script that, given a list of PubMed identifiers, retrieves the XML record from NCBI and converts it to RDF using a XSLT style sheet. The script also gets the identifiers for any sequence linked to that PubMed record using elinks, and uses the <dcterms:references> tag to model the relationship. For the ant project I only use PubMed ids for papers that include sequences that have CASENT specimens:


SELECT DISTINCT dbxref.id FROM
specimen INNER JOIN source USING (source_id)
INNER JOIN sequence_dbxref  ON source.seq_id = sequence_dbxref.sequence_id
INNER JOIN dbxref USING (dbxref_id)
WHERE (code LIKE "CASENT%") AND (dbxref.namespace = "PUBMED")

Turns out there are only three such papers:

16601190

16214741

15336679

FORMIS
We could add bibliographic data from FORMIS, which can be searched online here, and downloaded as EndNote files. This would be "fun" to convert to RDF.

PubMed Central
This search finds all papers on Formicidae in PubMed Central, which we could use as an easy source of XML data, in some cases with full text and citation links.

Triple store
The beauty of a triple store is that we can import all these RDF documents into a single store and query them. It doesn't matter that we have information about images in one file, information about specimens in another, and information about names in yet another file. If we use URIs consistently, it all comes together. This is data integration made easy.

Query

This RDQL query finds all images for Melissotarsus insularis


SELECT ?image WHERE 
(?taxon, <dc:subject>, "Melissotarsus insularis") 
(?specimen, <dc:subject>, ?taxon) 
(?image, <dc:subject>, ?specimen)
(?image, <dc:type>,"image")
USING dc FOR <http://purl.org/dc/elements/1.1/>

Which returns two images:

.

OK, now for something a little more fun. The Smith et al. barcoding paper that surveyed ants in Madagascar has PubMed id 16214741 (this paper also has the identifier doi:10.1098/rstb.2005.1714). Given this id (recast as a LSID urn:lsid:ncbi.nlm.nih.gov.lsid.zoology.gla.ac.uk:pubmed:16214741), we can find the geographic localities the authors sampled from using this query:


SELECT  ?lat, ?long WHERE 
(?nuc, <dcterms:isReferencedBy>, <urn:lsid:ncbi.nlm.nih.gov.lsid.zoology.gla.ac.uk:pubmed:16214741>)
(?nuc, <dc:source>, ?specimen)
(?specimen, <geo:lat>, ?lat)
(?specimen, <geo:long>, ?long)
USING dc FOR <http://purl.org/dc/elements/1.1/>
dcterms FOR <http://purl.org/dc/terms/>
geo FOR <http://www.w3.org/2003/01/geo/wgs84_pos#>

which gives four localities:


?lat    ?long
"-13.263333"    "49.603333"
"-13.464444"    "48.551666"
"-13.211666"    "49.556667"
"-14.4366665"   "49.775"

We can also search our triple store using other identifiers, such as DOIs:


SELECT  ?lat, ?long WHERE 
(?pubmed, <dc:identifier>, <doi:10.1098/rstb.2005.1714>)
(?nuc, <dcterms:isReferencedBy>, ?pubmed)
(?specimen, <geo:lat>, ?lat)
(?specimen, <geo:long>, ?long)
USING dc FOR <http://purl.org/dc/elements/1.1/>
dcterms FOR <http://purl.org/dc/terms/>
geo FOR <http://www.w3.org/2003/01/geo/wgs84_pos#>

is the same query as above, but uses the DOI for the barcoding paper.

New inferences

One thing I noticed early on is that there are specimens that have been barcoded and which are labelled in GenBank as unidentified (i.e., they have names like "Melissotarsus sp. BLF m1"), but the same specimen has a proper name in AntWeb (e.g., casent0107665-d01 is Melissotarsus insularis). Assuming the identification is correct (a big if), we can then use this information to add value to GenBank. For example, a search of GenBank for sequences for Melissotarsus insularis find nothing, but it does have sequences for this taxon, albeit under the name "Melissotarsus sp. BLF m1".

This query searches the triple store for specimens that are named differently in AntWeb and GenBank. Often both names are not proper names, but represent different ways of saying "we don't know what this is". But in some cases, the specimen does have a proper name attached to it:


SELECT ?specimen, ?ident, ?name WHERE 
(?specimen, <dc:type>, "specimen")
(?specimen, <dc:subject>, ?ident) 
(?nuc, <dc:source>, ?specimen)
(?nuc, <dc:subject>, ?taxid)
(?taxid, <dc:type>, "Scientific Name")
(?taxid, <dc:title>, ?name)
AND ?ident ne ?name
USING dc FOR <http://purl.org/dc/elements/1.1/>

Currently playing in iTunes: One by Mary J Blige & Bono. Currently playing in iTunes: Crazy (Single Version) by Gnarls Barkley

Thursday, May 04, 2006

Nascent: Open Text Mining Interface

From Nature's blog on web technology and science comes this post on Open Text Mining Interface (OTMI):

Every now and then a scientist contacts Nature asking for a machine-readable copy if our content (i.e., the XML) to use in text-mining research. We're usually happy to oblige, but there has to be a better way for everyone concerned, not least the poor researcher, who might have to contact any number of publishers and deal with many different content formats to conduct their work. Much better, surely, to have a common format in which all publishers can issue their content for text-mining and indexing purposes.

and further

The example of RSS shows how powerful a relatively simple common standard can be when it comes to aggregating content from multiple sources (even when it's messed up as badly as RSS ;). So maybe an approach like OTMI (or a better one dreamt up by someone else) can help those who want to index and text-mine scientific and other content. Like RSS, I think publishers might also come to see this as a kind of advert for their content because it should help interested readers to discover it. And on the basis that a something is always better than nothing, it also doesn't force publishers to give away the human-readable form of their content — they can limit themselves to snippets or even just word vectors if they want to.

Currently playing in iTunes: By the Time I Get to Phoenix by Glen Campbell

Tuesday, May 02, 2006

Normal service...

I'm slowly restoring the services offered by Darwin, after it was hacked. So far, these services are back online:

TreeView X

iSpecies

Joseph Hughes' web site

Vince Smith's web site