Approximately 82% of the GBIF occurrence records have latitude and longitude recorded, but these often contain errors - typically typos, and often one or both of lat and long reversed. Map 1, below, plots all of the verbatim (i.e. completely unprocessed) records that have a latitude and longitude and claim to be in the USA. Note the common mistakes, which result in glaring errors: reversed longitude produces the near-perfect mirror over China; reversed latitude produces a faint image over the Pacific off the coast of Chile; reversing both produces an even fainter image off Australia; setting 0 for lat or long produces tell tale straight lines over the Prime Meridian and the equator.
![]() | |
| Map 1: Verbatim (unprocessed) occurrence data coordinates for the USA |
One of goals of the GBIF Secretariat is to help publishers improve their data, and identifying and reporting back these types of problems is one way of doing that. Of course the current GBIF data portal attempts to filter these records before displaying them. The current system for verifying that given coordinates fall within the country they claim is by overlaying a 1 degree grid on the world map, and identifying each of those grid points as belonging to one or more countries. This overlay is curated by hand, and is therefore error prone, and its maintenance is time consuming.
The results of doing a lookup against the overlay are shown in Map 2, where a number of bugs in the processing are still visible: parts of the mirror over China are still visible; none of the coastal waters that are legally US territory (i.e. Exclusive Economic Zone of 200 nautical miles off shore) are shown; the Aleutian Islands off the coast of Alaska are not shown; and, some spots around the world are allowed through, including 0,0 and a few seemingly at random.
![]() |
| Map 2: Results of current data portal processing for occurrences in the USA |
My work, then, was to build new processing into our Hive/Hadoop processing workflow that addresses these problems and produces a map that is as close to error free as possible. The starting point is a webservice that can answer the question "In what country (including coastal waters) does this lat/long pair fall?". This is clearly a GIS problem, and in GIS-speak this is a reverse geocode, and something that PostGIS is well equipped to provide. Because country definitions and borders change semi-regularly, it seemed wisest to use a trusted source of country boundaries (shapefiles) that we could replace whenever needed. Similarly we needed the boundaries of Exclusive Economic Zones to cover coastal waters. The political boundaries come from Natural Earth, and the EEZ boundaries shapefile come from the VLIZ Maritime Boundaries Geodatabase.
While not an especially difficult query to formulize, a word to the wise: if you're doing this kind of reverse geocode lookup, remember to build your query by scoping the distance query within its enclosing polygon, like so
where the_geom && ST_GeomFromText(#{point}, 4326) and distance(the_geom, geomfromtext(#{point}, 4326)) < 0.001
This buys an order of magnitude improvement in query response time!
With a thin webservice wrapper from Jersey, we have the GIS pieces built. We opted for a webservice approach to allow us to ultimately expose this quality control utility externally in the future. Since we process in Hadoop, we experienced huge stress on this web service - we were DDOS'ing ourselves. I mentioned a similar approach in my last entry, where we alleviated the problem with load balancing across multiple machines. And in case anyone is wondering why we didn't just use Google's reverse-geocoding webservice, the answer is twofold - first, it violates their terms of use, and second, even if we were allowed, they hold a rate limit on how many queries you can send over time, and that would have brought our workflow to its knees.
The last piece of the puzzle is adding the call to the webservice from a Hive UDF and adding it to our workflow, which is reasonably straight forward. The result of the new processing is shown in Map 3, where the problems of Map 2 are all addressed.
![]() |
| Map 3: Results of new processing workflow for occurrences in the USA |
These maps and the mapping cleanup processing will replace the existing maps and processing in our data portal later this year, hopefully in as little as a few months.
You can find the source of the reverse-geocode webservice at the Google code site for the occurrence-spatial project. Similarly you can browse the source of the Hadoop/Hive workflow and the Hive UDFs.



0 comments:
Post a Comment