Friday, March 15, 2013

BioNames ideas - automatically finding synonyms from the literature

One of the biggest pains (and self-inflicted wounds) in taxonomy is synonymy, the existence of multiple names for the same taxon. A common cause of synonymy is moving species to different genera in order to have their name reflect their classification. The consequence of this is any attempt to search the literature for basic biological data runs into the problem that observations published at different times by different researchers (e.g., taxonomists, ecologists, parasitologists) may use different names for the same taxon.

Existing taxonomic databases often have lists of synonyms, but these are incomplete, and typically don't provide any evidence why two names are synonyms.

Reading literature extracted form the Biodiversity Heritage Library I'm struck by how often I come across papers such as taxonomic revisions, museum catalogues, and checklists, that list two names as synonyms. Wouldn't it be great if we could mine these to automatically build lists of synonyms?

One quick and dirty way to do this is look for sets of names that have the same species name but different generic names, e.g.

  • Atlantoxerus getulus
  • Sciurus getulus
  • Xerus getulus

If such names appear on the same page (i.e., in close proximity) there's a reasonable chance they are synonyms. So, one of the features I'm building in BioNames is an index of names like this. Hence, if we are displaying a page for the name Atlantoxerus getulus that page could also display Sciurus getulus and Xerus getulus as possible synonyms.

There's a lot more that could be done with this sort of approach. For example, this approach only works if the the species name remains unchanged. To improve it we'd need to do things like handle changes to the ending of a species name to agree with the gender of the genus, and cases where the taxa are demoted to subspecies (or promoted to species).

If we were even clever we'd attempt to parse synonymy lists to extract even more synonyms (for an example see Huber and Klump (PDF available here):

Huber, R., & Klump, J. (2009). Charting taxonomic knowledge through ontologies and ranking algorithms. Computers & Geosciences, 35(4), 862–868. doi:10.1016/j.cageo.2008.02.016

Then there's the broader topic of looking at co-occurrence of taxonomic names in general. As I noted a while ago there are examples of pages in BHL that lists taxonomically unrelated taxa that are ecologically closely associated (e.g., hosts and parasites). Hence we could imagine automatically building host-parasite databases by mining the literature. Initially we could simply display lists of names that co-occur frequently. Ideally we'd filter out "accidental" co-occurrences, such as indexes or tables of contents, but there seems to be a lot of potential in automating the extraction of basic information from the taxonomic literature.