Wednesday, September 03, 2008

Hell is other people's data

Starting to get serious about the Grand Challenge. First step is to parse the XML data Elsevier made available. Sadly this is only for Molecular Phylogenetics and Evolution for 2007, I would have liked the whole journal in XML to avoid hassles with parsing PDF. However, XML is not without it's own problems. I'm slowly getting my head around Elsevier's XML (which is, it has to be said, documented in depth). Two tools I find invaluable are the oXygen XML editor, and Marc Liyanage's TextXSLT application.




As a first attempt, I'm converting Elsevier XML into JSON (being a much simpler format to handle). I'm just after what I regard as the core data, namely the bibliography, and the tables (rich with GenBank accession numbers, specimen codes, and geocoordinates). There are a few "gotchas", such as misisng namespaves to add, and HTML entities that need to be added. Then there's the fact that the XML describes both the document content and it's presentation. Tables can get complicated (cells can span more than one row or column), which makes tasks such as identifying cell contents by using the heading of the corresponding column a bit harder. I hope to put a XSLT style sheet online once I'm happy that it can handle most, if not all the tables I've come across. Then the fun of trying to extract the information can begin.

No comments: