Friday, August 26, 2016

Displaying original species descriptions in BioNames

B8e253dc3be3d84f2c69c51b0af86c03 400x400The goal of my BioNames project is to link every taxonomic name to its original description (initially focussing on animal names). The rationale is that taxonomy is based on evidence, and yet most of this evidence is buried in a non-digitised and/or hard to find literature. Surfacing this information not only makes taxonomic evidence accessible (see Surfacing the deep data of taxonomy), it also surfaces a lot of basic biological information. In many cases the original taxonomic description will be an important source of information about what a species looks like, where it lives, and what it does.

To date I've focussed on linking names to publications, such as articles, on the grounds that this is the unit of citation in science. It's also the unit most often digitised and assigned an identifier, such as a DOI. But often taxonomists cite not an article but the individual page on which the description appears. In web-speak, taxonomists cite "fragment identifiers". Page-level identifiers are not often encountered in the digital world, in part because many digital representations don't have "pages". But this doesn't mean that we can't have identifiers for parts of an article, for example in Fragment Identifiers and DOIs Martin Fenner gives examples of ways to link to specific parts of an online article. His examples work if the article is displayed as HTML. If we are working with XML (say, for a journal published by Pensoft), then we can use XPath to refer to sections of a document. Ultimately it would be nice to have stable identifiers for document fragments linked to taxonomic names, so that we can readily go from name to description (even better if that description was in machine-readable form). You could think of these as locators for "taxonomic treatments", e.g. Miller et al. 2015.

As a quick and dirty approach to this I've reworked BioNames to be able to show the page where a species name is first published. This only works if a number of conditions are met:

  • The BioName database has the page number ("micro reference") for the name.
  • BioNames has the full text for the article, either from BioStor or a PDF.
  • The taxonomic name has been found in that text (e.g., by the Global Names GNRD service).

If these conditions are met, then BioNames will display the page, like this example (Belobranchus segura Keith, Hadiaty & Lord 2012: Screenshot 2016 08 26 16 13 50

Both the page image and OCR text (if available) are displayed. This is a first step towards (a) making stable identifiers available for these pages, and (b) making the text accessible for machine reading.

For some more examples, try Heterophasia melanoleuca kingi Eames 2002 (bird), Echinoparyphium anatis Fischthal & Kuntz 1976 (trematode), Bathymodiolus brooksi Gustafson, Turner, Lutz & Vrijenhoek 1998 (bivalve), Amolops cremnobatus Inger & Kottelat 1998 (frog), Leptothorax caesari Espadaler 1997 (ant), and Daipotamon minos Ng & Trontelj 1996 (crab).

Thursday, August 18, 2016

GBIF Challenge: €31,000 in prizes for analysing and addressing gaps and biases in primary biodiversity data

Full widthIn a classic paper Boggs (1949) appealed for an “atlas of ignorance”, an honest assessment of what we know we don’t know:

Boggs, S. W.. (1949). An Atlas of Ignorance: A Needed Stimulus to Honest Thinking and Hard Work. Proceedings of the American Philosophical Society, 93(3), 253–258. Retrieved from

This is the theme of this year's GBIF Challenge: Analysing and addressing gaps and biases in primary biodiversity data. "Gaps" can be gaps in geographic coverage, taxa group, or types of data. GBIF is looking for ways to access the nature of the gaps in the data it is aggregating from its network of contributors.

How to enter

Details on how to enter are on the Challenge website, deadline is September 30th.


One approach to gap analysis is to compare what we expect to see with what we actually have. For example, we might take a “well-known” group of organisms and use that to benchmark GBIF’s data coverage. A drawback is that the “well-known” organisms tend to be the usual suspects (birds, mammals, fish, etc.), and there is the issue of whether the chosen group is a useful proxy for other taxa. Another approach is to base the estimate of ignorance on the data itself. For example, OBIS has computed Hurlbert's index of biodiversity for its database, e.g. Screenshot 2016 08 18 15 13 59 Can we scale these methods to the 600+ million records in GBIF? There are some clever methods for using resampling methods (such as the bootstrap) on large data sets that might be relevant, see

Another approach might be to compare different datasets for the same taxa, particularly if one data set is not in GBIF. Or perhaps we can compare datasets for the same taxa collected by different methods.

Or we could look at taxonomic gaps. In an earlier post The Zika virus, GBIF, and the missing mosquitoes I noted that GBIF's coverage of vectors of the Zika virus was very poor. How well does GBIF cover vectors and other organisms relevant to human health? Maybe we could generalise this to explore other taxa. It might, for example, be interesting to compare degree of coverage for a species with some measure of the "importance" of that species. Measures of importance could be based on, say, number of hits in Google Scholar for that species, size of Wikipedia page (see Wikipedia mammals and the power law), etc.

Gaps might also be gaps in data completeness, quality, or type.


This post has barely scratched the surface of what is possible. But I think one important thing to bear in mind is that the best analyses of gaps are those that lead to "actionable insights", in other words, if you are going to enter the challenge (and please do, it's free to enter and there's money to be won), how does you entry help GBIF and the wider biodiversity community decide what to do about gaps?

BioStor updates: nicer map, reference matching service

BioStor now has 150,000 articles. When I wrote a paper describing how BioStor worked it had 26,784 articles, so things have progressed somewhat!

I continue to tweak the interface to BioStor, trying different ways to explore the articles.

Spatial search

I've tweaked spatial search in BioStor. As blogged about previously I replaced the Google Maps interface with Leaflet.js, enabling you to draw a search area on the map and see a set of articles that mention that area. I've changed the base map to the prettier "terrain" map from Stamen, and added back the layer showing all the localities in BioStor. This gives you a much better sense of the geographic coverage in BioStor. This search interface still needs work, but is a fun way to discover content.

Screenshot 2016 08 18 14 04 03

Reference matching

In the "Labs" section of the web site I've added a demonstration of BioStor's reconciliation service. This service is based on the Freebase reconciliation service used by tools such as OpenRefine, see Reconciliation Service API. The goal is to demonstrate a simple way to locate references in BioStor, simply paste references, one per line, click Match and BioStor will attempt to find those references for you.

This service is really intended to be used by tools like OpenRefine, but this web page helps me debug the service.


BioStor is part labour of love, part albatross around my neck. I'm always open to suggestions for improvements, or for articles to add (but remember that all content must first have been scanned and in the Biodiversity Heritage Library). If you are involved in publishing a journal and are interested in getting it into BHL, please get in touch.

Wednesday, August 17, 2016

Containers, microservices, and data

Docker Some notes on containers, microservices, and data. The idea of packaging software into portable containers and running them either locally or in the cloud is very attractive (see Docker). Some use cases I'm interested in exploring.


In Towards a biodiversity knowledge graph (doi:10.3897/rio.2.e8767) I listed a number of services that are essentially self contained, such as name parsers, reconciliation tools, resolvers, etc. Each of these could be packaged up and made into containers.


We can use containers to package database servers, such as CouchDB, ElasticSearch, and triple stores. Using containers means we don't need to go through the hassle of installing the software locally. Interested in RDF? Spin up a triple store, play with it, then switch it off if you decide it's not for you. If it proves useful, you can move it to the cloud and scale up (e.g.,


A final use case is to put individual datasets in a container. For exmaple, imagine that we have a large Darwin Core Archive. We can distribute this as a simple zip file, but you can't do much with this unless you have code to parse Darwin Core. But imagine we combine that dataset with a simpel visualisation tool, such as VESpeR (see doi:10.1016/j.ecoinf.2014.08.004). Users interested in the data could then play with the data without the overhead of installing specialist software. In a sense, the data becomes an app.

Friday, August 12, 2016

Spatial search in BioStor

I've been experimenting with simple spatial search in BioStor, as shown in the demo below. If you go to the map on BioStor you can use the tools on the left to draw a box or a polygon on the map, and BioStor will search it's database for articles that mention localities that occur in that region. If you click on a marker you can see the title of the article, clicking on that title takes you to the article itself.

This is all rather crude (and a little slow), but it provides another way to dive into the content I've been extracting from the BHL. One thing I've been doing is looking at protected areas (often marked on Open Street Map), drawing a simple polygon around that area, then seeing if BioStor knows anything about that area.

For the technically minded, this tool is an extension of the Searching GBIF by drawing on a map demo, and uses the geospatial indexing offered by Cloudant.

Wednesday, August 10, 2016

On asking for access to data

In between complaining about the lack of open data in biodiversity (especially taxonomy), and scraping data from various web sites to build stuff I'm interested in, I occasionally end up having interesting conversations with the people whose data I've been scraping, cleaning, cross-linking, and otherwise messing with.

Yesterday I had one of those conversations at Kew Gardens. Kew is a large institution that is adjusting to a reduced budget, a changing ditigal landscape, and a rethinking of it's science priorities. Much of Kew's data has not been easily accessible to the outside world, but this is changing. Part of the reason for this is that Defra, which part-funds Kew, is itself opening up (see Ellen Broad's fascinating post Lasers, hedgehogs and the rise of the Age of Yoghurt: reflections on #OpenDefra).

During this conversation I was asked "Why didn't you just ask for the data instead of scraping it? We would most likely have given it to you." My response to this was "well, you might have said no". In my experience saying "no" is easy because it is almost always the less risky approach. And I want a world where we don't have to ask for data, in the same way that we don't ask to get source code for open source software, and we don't ask to download genomic data from GenBank. We just do it and, hopefully, do cool things with it. Just as importantly, if things don't work out and we fail to make cool things, we haven't wasted time negotiating access for something that ultimately didn't work out. The time I lose is simply the time I've spent playing with the data, not any time negotiating access. The more obstacles you put in front of people playing with your data, the fewer innovative uses of that data you're likely to get.

But it was pointed out to me that a consequence of just going ahead and getting the data anyway is that it doesn't necessarily help people within an organisation make the case for being open. The more requests for access to data that are made, the easier it might be to say "people want this data, lets work to make it open". Put another way, by getting the data I want regardless, I sidestep the challenge of convincing people to open up their data. It solves my problem (I want the data now) but doesn't solve it for the wider community (enabling everyone to have access).

I think this is a fair point, but I'm going to try and wiggle away from it. From a purely selfish perspective, my time is limited, there are only so many things I can do, and making the political case for opening up specific data sets is not something I really want to be doing. In a sense, I'm more interested in what happens when the data is open. In other words, let's assume the battle for open has been won, what do we then? So, I'm essentially operating as if the data is already open because I'm betting that it will be at some point in time.

Without wishing to be too self-serving, I think there are ways that treating closed data as effectively open can help make the case that the data should (genuinely) open. For example, one argument for being open is that people will come along and do cool things with the data. In my case, "cool" means cross linking taxonomic names with the primary literature, eventually to original decsriptions and fundamental data about the organisms tagged with the taxonomic names (you may feel that this stretches the definitoon of "cool" somewhat). But adding value to data is hard work, and takes time (in some cases I've invested years in cleaning and linking the data). The benefits from being open may take time, especially if the data is messy, or relatively niche so that few people are prepared to invest the time necessary to do the work.

Some data, such as the examples given in Lasers, hedgehogs and the rise of the Age of Yoghurt: reflections on #OpenDefra will likely be snapped up and give rise to nice visualisations, but a lot of data won't. So, imagine that you're making the case for data to be open, and one of your arguments is "people will do cool things with it", eventually you win that argument, the data is opened up... and nothing happens. Wouldn't it be better if once the data is open, those of us who have been beavering away with "illicit" copies of the data can come out of the woodwork and say "by the way, here are some cool things we've been doing with that data"? OK, this is a fairly self-serving argument, but my point is that while internal arguments about being open are going on I have three choices:

  1. Wait until you open the data (which stops me doing the work I want to do)
  2. Help make the case for being open (which means I engage in politics, an area in which I have zero aptitude)
  3. Assume you will be open eventually, and do the work I want to do so that when you're open I can share that work with you, and everyone else
Call me selfish, but I choose option 3.

Thursday, June 30, 2016

Aggregating annotations on the scientific literature: a followup on the ReCon16 hackday

In the previous post I sketched out a workflow to annotate articles using and aggregate those annotations. I threw this together for the hackday at ReCon 16 in Edinburgh, and the hack day gave me a chance to (a) put together a crude visualisation of the aggregated annotations, and (b) recruit CrossRef's Rachael Lammey (@rachaellammey) into doing some annotations as well so I could test how easy it was to follow my garbled instructions and contribute to the project.

We annotated the paper A new species of shrew (Soricomorpha: Crocidura) from West Java, Indonesia (doi:10.1644/13-MAMM-A-215). If you have the extension installed you will see our annotations on that page, if not, you can see them using the proxy:

Rachael and I both used the IFTTT tool to send our annotations to a central store. I then created a very crude summary page for those annotations: When this page loads it queries the central store for annotations on the paper with DOI 10.1644/13-MAMM-A-215, then creates some simple summaries.

For example, here is a list of the annotations. The listed is "typed" by tags, that is, you can tell the central store what kind of annotation is being made using the "tag" feature in On this example, we've picked out taxonomic names, citations, geographical coordinates, specimen codes, grants, etc.

Screenshot 2016 06 30 12 30 53

Given that we have latitude and longitude pairs, we can generate a map: Screenshot 2016 06 30 12 31 15

The names of taxa can be enhanced by adding pictures, so we have a sense of what organisms the paper is about:

Screenshot 2016 06 30 12 31 24

The metadata on the web page for this article is quite rich, and does a nice job of extracting it, so that we have a list of DOIs for many of the articles this paper cites. I've chosen to add annotations for articles that lack DOIs but which may be online elsewhere (e.g., BioStor).

Screenshot 2016 06 30 12 31 34

What's next

This demo shows that it's quite straightforward to annotate an article and pull those annotations together to create a central database that can generate new insights about a paper. For example, we can generate a map even if the original paper doesn't provide one. Conversely, we could use the annotations to link entities such as museum specimens to the literature that discusses those specimens. Given a specimen code in a paper we could look up that code in GBIF (using GBIF's API, or a tool like "Material Examined", see Linking specimen codes to GBIF). Hence we could go from code in paper to GBIF, or potentially from GBIF to the paper that cites the specimen. Having a central annotation store potentially becomes a way to build a knowledge graph linking different entities that we care about.

Of course, a couple of people manually annotating a few papers isn't scalable, but because has an API we can scale this approach (for another experiment see revisited: annotating articles in BioStor). For example, we have automated tools to locate taxonomic names in text. Imagine that we use those tools to create annotations across the accessible biodiversity literature. We can then aggregate those into a central store and we have an index to the literature based on taxonomic name, but we can also click on any annotation and see that name in context as an annotation on a page. We could manually augment those annotations, if needed, for example by correcting OCR errors.

I think there's scope here for unifying the goals of indexing, annotation, and knowledge graph building with a fairly small set of tools.

Thursday, June 23, 2016

Aggregating annotations on the scientific literature: a hack for ReCon 16

7iUlfzBpI will be at ReCon 16 in Edinburgh (hashtag #ReCon_16), the second ReCon event I've attended (see Thoughts on ReCon 15: DOIs, GitHub, ORCID, altmetric, and transitive credit). For the hack day that follows I've put together some instructions for a way to glue together annotations made by multiple people using It works by using IFTTT to read a user's annotation stream (i.e., the annotations they've made) and then post those to a CouchDB database hosted by Cloudant.

Why, you might ask? Well, I'm interested in using to make machine-readable annotations on papers. For example, we could select a pair of geographic co-ordinates (latitude and longitude) in a paper, tag it "geo", then have a tool that takes that annotation, converts it to a pair of decimal numbers and renders it on a map.

Screenshot 2016 06 23 15 53 07

Or we could be reading a paper and the literature cited lacks links to the cited literature (i.e., there are no DOIs). We could add those by selecting the reference, pasted in the DOI as the annotation, and tagging it "cites". If we aggregate all those annotations then we could write a query that lists all the DOIs of the cited literature (i.e., it builds a small part of the citation graph).

By aggregating across multiple users we effectively crowd source the annotation problem, but in a way that we can still collect those annotations. For this hack I'm going to automate this collection by enabling each user to create an IFTTT recipe that feeds their annotations into the database (they can switch this feature off at any time by switching off the recipe).

Manual annotation is not scalable, but it does enable us to explore different ways to annotate the literature, and what sort of things people may be interested in. For example, we could flag scientific names, great numbers, localities, specimens, concepts, people, etc. We could explore what degree of post-processing would be needed to make the annotations computable (e.g., converting 8°07′45.73″S, 63°42′09.64″W' into decimal latitude and longitude).

If this project works I hope to learn something about people want to extract from the literature, and to what extent having a database of annotations can provide useful information. This will also help inform my thinking about automated annotation, which I've explored in revisited: annotating articles in BioStor.