Sunday, February 12, 2006

Rob McCool on Rethinking the Semantic Web

Having read Rob McCool's articles on Rethinking the Semantic Web (brought to my attention by Bob McMorris' comment on my earlier post on globally unique identifiers), I think he makes very interesting points, but they are not all relevant to whether biodiversity informatics adopts RDF.

In terms of whether the dream of the Semantic Web will happen, I suspect he is right - technologies such as tags and microformats will be a lot easier to adopt, and will make more effective use of existing tools. I'm not writing the Semantic Web off, but McCool's point about keeping things very simple is, I think, on the money.

Much of the work on RDF and the Semantic Web has been done in academia, and most examples concern things such as relationships between people and projects (typically computer science projects in, you guessed it, the Semantic Web). Within a small academic community there is often a small problem scope, consistent vocabulary (or at least, it is tractable to develop either a vocabulary or a mapping between vocabularies), obvious identifiers, experience with ontologies, and a limited set of problems. My sense is that biodiversity informatics fits this model. If the goal is to integrate databases of integrate taxonomic names, specimens, images, character data, DNA sequences, and publications, and make inferences based on this aggregation of information, then I feel the use of Semantic Web techniques will be quite tractable, indeed productive.

In the same way, much of the scepticism about whether ontologies are actually be useful in the real world (see Clay Shirky's brilliant Ontology is Overrated -- Categories, Links, and Tags, or listen to a MP3) is probably well founded. Again, I think the issue is one of scope. Biologists are used to ontologies, after all what is taxonomic classification but a large ontology with well developed rules for its construction and maintenance?

That said, there are areas in our field where insistence on RDF, controlled vocabularies, and ontologies will probably be counterproductive. Ontologies for morphological characters will, I suspect, prove hard to sell. Even though we have a history of shared terminology (think of papers establishing consistent numbering schemes for setae on insect heads), these shared vocabularies tend to have limited applicability unless they are very general (matching setae on the head of a fly and a louse is tricky), and if they are general (e.g., "legs") they are very low level. There is also the thorny issue that many aspects of morphology are not homologous in evolutionary terms (in what sense are the wings of a fly and a bird both "wings"?). Leaving aside the conceptual issues, this is one area where I think people will balk if it becomes a pain to use ontologies. It's hard enough getting people to use scientific names (never mind remembering that species names such as Homo sapiens should be written in italics). I suspect this is one area (along with scientific literature) where tagging will be a compelling alternative. For an example of the power of tagging literature see Connotea.


McCool's articles are available here:

1 comment:

Anonymous said...

I am pretty sure I agree with you that there is unlikely to be support for morphological characters described in RDF.

I don't even believe, unlike Kevin Thiele, the architect of LucID, that states should get GUIDS, though maybe characters should. However, my present thinking is that constrained vocabularies and descriptions are the right granularity of descriptive data on which to put GUIDs, not characters.

Ironically, when the tdwg SDD effort http://wiki.cs.umb.edu/twiki/bin/view/SDD/WebHome
began I argued for RDF and was beaten down on the grounds that it was immature and too complex (I always found this an odd argument, because it entailed that we should start all over, and that's hardly starting from a mature base...As to complexity, I quickly learned that morphological characters and states are in fact complex objects, and I am not alarmed at the complexity of what we produced in XML Schema, including some tools in support of generating import/export software and supporting developers of same).

All that said, the question of RDF for taxon and specimen descriptions, rests, for me, on the large number of arguments out there that RDF hasn't caught on much for anything other than discovery, and maybe not much for that, outside the academic world. I cited McCool because he is the most qualified I've ever seen advance the argument he does. I still suspect RDF is probably pretty good at most of what we did in SDD. However, SDD was designed as an exchange language for descriptions. Most holders of descriptive data will still need to hold that data in databases, which are particularly unlikely to be triple stores for a long long long time. At least one blogger claims that no RDF enthusiast has been able to show him a single encoding of content in RDF
http://www.zacker.org/the-battle-for-the-semantic-web-rdf-vs-xml, and I rather suspect that no scientific publications, including those with GUID metadata in RDF or any other description language, have ever been coded in such a language, at least other than as a demonstration. I'd be very interested to see what it takes to code a systematics publication in RDF. Being part of the same project you mention Donat Agosti is working on about legacy ant literature, I know what a daunting task this is if you haven't started from a database of descriptive data.

p.s. Please complain to your blogmaster about how the platform gratuitously truncates URLs making it pretty hard to follow links. At least for me.