For a long time I wanted to experiment with using GitHub as a tool to browse and manage the GBIF backbone taxonomy. Encouraged by similar sentiments from Rod Page, it would be nice to use git to keep track of versions and allow external parties to fork parts of the taxonomic tree and push back changes if desired. To top it off there is the great GitHub Treeslider to browse the taxonomy, so why not give it a try?
A GitHub filesystem taxonomy
I decided to export each taxon in the backbone as a folder that is named according to the canonical name, containing 2 files:
- README.md, a simple markdown file that gets rendered by github and shows the basic attributes of a taxon
- data.json, a complete json representation of the taxon as it is exposed via the new GBIF species API
The filesystem represents the taxonomic classification and taxon folders are nested accordingly, for example the species Amanita arctica is represented as:
This is just a first experimental step. One can improve the readme a lot to render more content in a human friendly way and include more data in the json file such as common names and synonyms.
Getting data into GitHub
It didn't take much to write a small NubGitExporter.java class that exports the GBIF backbone into the filesystem as described above. The export of the entire taxonomy, with it's currently 4.4 million taxa incl synonyms, took about one hour on a MacBook Pro laptop.
Not bad I thought, but then I tried to add the generated files into git and that's when I started to doubt. After waiting for half a day for git to add the files to my local index I decided to kill the process and start by only adding the smaller kingdoms first, excluding animals and plants. That left about 335.000 folders and 670.000 files to be added to git. Adding these to my local git still took several hours, committing and finally pushing them onto the GitHub server took yet another 2 hours.
Delta compression using up to 8 threads.
Compressing objects: 100% (1010487/1010487), done.
Writing objects: 100% (1010494/1010494), 173.51 MiB | 461 KiB/s, done.
Total 1010494 (delta 405506), reused 0 (delta 0)
To https://github.com/mdoering/backbone.git
After those files were added to the index committing a simple change to the main README file took 15 minutes to commit. Although I like the general idea and the pretty user interface I fear GitHub, and even git itself, are not made to be a repository of millions of files and folders.
First GitHub impressions
Browsing taxa in GitHub is surprisingly responsive. The fungi genus Amanita contains 746 species, but it loads very quickly. In that regard GitHub is much nicer to use than the one on the new GBIF species pages which of course shows much more information. The rendered readme file is not ideal as it's at the very bottom of the page, but showing information to humans that way is nice - and markdown could also be parsed by machines quite easily if we adopt a simple format, for example for every property create a heading with that name and put the content into the following paragraph(s).
The Amanita example also reveals a bug in the exporter class when dealing with synonyms (the Amanita readme contains the synonym information) and also with infraspecific taxa. For example Amanita muscaria contains some weird form information which is mapped erroneously to the species. This obviously should be fixed.
The GitHub browser sorts all files alphabetically. When mixing ranks (we skip intermediate unknown ranks in the backbone), for example see the Fungus kingdom, sorting by the rank first is desirable. We could enable this by naming the taxon folders accordingly, prefixing with an alphabetically correctly ordered rank.
I have not had the time to try to version branches of the tree and see how usable that is. I suspect the git performance to be really slow, but that might not be a blocker if we only do versioning of larger groups and rarely push & pull.

0 comments:
Post a Comment