Synchronizing occurrence records




This
post should be read in the line of Tim’s post about
Decoupling Components,
as it takes for granted some information written there.





During
the last week, I’ve been learning
/working with some
technologies that are related to the decoupling of components we want to
accomplish.  Specifically, I’ve
been working with the Synchronizer component of the event driven architecture Tim described.





Right now, the synchronizer takes the responses from the
resources and gets those responses into the occurrence store (MySQL as of today, but not final). But it has
more to it: The responses from the resources come typically from DiGIR, TAPIR
and BioCASe providers which render their responses into XML format. So how does
all this data ends up in the occurrence store? Well, fortunately my colleague Oliver Meyn
wrote a
very useful library to unmarshall all these XML chunks into  nice and simple objects, so on my side
I just have to worry about calling all those getter methods. Also, the synchronizer
acts as a listener to a message queue , queue that will store all the resource responses
that need to be handled. All the queue’s nuts & bolts were worked out by Tim
and Federico Méndez. So yes, it has been a nice collaboration from many
developers inside the Secretariat and it’s always nice to have this kind of head
start from your colleagues :)





So,
getting back to my duties, I have to take all these objects and start
populating the occurrence target store taking some precautions (e.g.: not
inserting duplicated occurrence records, checking that some mandatory fields
are not null and other requirements).





For
now, it’s in development mode, but I have managed to make some tests and
extract some metrics that show current performance and definitely leaves room for improvement. For the tests, first the message queue is loaded with some responses that need to be attended
and afterwards I execute the synchronizer which starts populating the occurrence
store. All these tests are done on my MacBook Pro, so definetely response times
will improve on a better box. So here are the metrics:





Environment:




  • MacBook
    Pro 2.4 GHz Core2 Duo (4GB Memory)

  • Mac OS
    X 10.5.8 (Leopard)

  • Message
    Queue & MySQL DB reside on different machines, but same intranet.

  • Threads: synchronizer spawns 5 threads to attend the queue elements.

  • Message
    queue:
    loaded with 552 responses (some responses are just empty to emulate a real world scenario).

  • Records in total: 70,326
    occurrence records in total in all responses








Results Test 1 (without filtering out records):




  • Extracting
    responses from queue

  • Unmarshalling

  • Inserting into a MySQL DB

  • 202022
    milliseconds (3 min, 22 secs)






Results Test 2 (filtering out records):




  • Extracting
    from queue

  • Unmarshalling

  • Filtering out records (duplicates, mandatory
    fields, etc)

  • Inserting into MySQL DB

  • over 30 minutes... (big FAIL)


So, as you see there is MUCH room for improvement. As I have just joined this project in particular, I need to start the long and tedious road of debugging why the huge difference, obviously the filtering out process needs huge improvement. Obvious solutions come to mind: increasing threads, improve memory consumption and other not so obvious solutions.  I will try to keep you readers posted about this, and hopefully some more inspiring metrics, and for sure in a better box.





I hope to communicate further improvements later, see you for now.


















0 comments:

Post a Comment