Fast datetimes in MongoDB

| No Comments | No TrackBacks

One of the most common complaints about the Perl MongoDB driver is that it tries to be a little too clever. In the current production release of MongoDB.pm (version 0.46.2 as of this writing), all datetime values retrieved by a query are automatically instantiated as DateTime objects.

DateTime is a remarkable CPAN distribution. In fact, I would say that DateTime and its related distributions on CPAN comprise one of the best date and time manipulation libraries in any programming language. But that power comes with a cost. The DateTime codebase is large, and instantiating DateTime objects is expensive. The constructor performs a great deal of validation, and creates a large amouunt of metadata which is stored inside the object.

Upcoming changes to the Perl MongoDB driver solve this problem.

If you need to perform a series of complex arithmetic operations with dates, then the cost of DateTime is justified. But frequently, all you want is a simple read-only value that is sufficient for displaying to a user or saving elsewhere. If you are running queries involving a large number of documents, the automatic instantiation of thousands of complex objects becomes barrier to performance.

DateTime::Tiny is a lightweight alternative. As its name suggests, it is quite tiny indeed. It does no validation, simply shoving whatever you pass to its constructor into an object. It has a couple methods for outputting formatted dates, and a convenience method for promoting an object to a full DateTime object if required. If you know that your date information came from another process that has already done the validation, and you know that you don't need to do any manipulation of the date data, then DateTime::Tiny is an excellent choice for speed.

When I took over maintenance of the Perl MongoDB driver, I began a project to allow alternatives to the default DateTime method of handling datetime data in MongoDB documents. You can now set a per-connection dt_type attribute if you want DateTime::Tiny objects instead of the default.

But how much faster is it? I set out to find a useful data set with a lot of datetime values that I could import into MongoDB and use for testing. A co-worker suggested the GitHub Archive project, which aggregates voluminous data about various events in public GitHub repositories. I downloaded 24 JSON files covering the day of April 11, 2012. (I chose that date for the sole reason that it was the date used in the GitHub Archive example queries.)

The first step was to examine the JSON data to see what kind of date information was available. To my surprise, I found that the .json files returned by the GitHub Archive are not actually JSON, but streams of JSON objects concatenated together with no delimiter. After figuring out this unusual structure, I was able to coerce the data into a proper JSON array and then format it for human readability via the following bash one-liner.

for file in *.json; do
perl -MFile::Slurp -MJSON::XS -E
    '$new = "[" . join( "},{", split /}{/, read_file shift ) . "]";
print JSON::XS->new->utf8(1)->pretty(1)->encode( decode_json $new )' $file
  > $file.new;
done

Upon examining the data, I was pleased to see it consisted of many highly-structured documents with plenty of datetimes. In order to store them in MongoDB, the datetime strings inside the JSON objects would need to be parsed into DateTime objects for serialization. To my further surprise, I found that different types of things in the GitHub data had different kinds of datetime formats. Some looked like 2012/04/05 11:37:28 -0700, whereas others looked like 2012-04-11T11:01:37Z. Fortunately, both formats are easy to parse and turn into DateTime objects.

use strict;
use warnings;
use v5.16;

use JSON::XS;
use File::Slurp;
use DateTime;
use MongoDB;

my $file = shift;
my $json = read_file $file;

my $data = JSON::XS->new->utf8(1)->decode( $json );

my $conn = MongoDB::Connection->new;
my $db = $conn->get_database( 'github' );
my $coll = $db->get_collection( 'events' );

sub traverse {
    my $node = shift;

    return if not defined $node;

    if ( ref $node eq ref [ ] ) {
        foreach my $item( @$node ) {
            traverse( $item ) if ref $item;
        }
    } elsif ( ref $node eq ref { } ) {
        foreach my $key( keys %$node ) {
             my $val = $node->{$key};
             traverse( $val ) if ref $val;
             next if not defined $val;

             if ( $key =~ m{(pushed|created|closed|updated|merged)_at} ) {
                   my $re = $val =~ m{/}
                     ? qr{(?<year>\d{4}) / (?<month>\d{2}) / (?<day>\d{2}) \s
(?<hour>\d{2}) : (?<minute>\d{2}) : (?<second>\d{2}) \s
(?<time_zone>[+-]\d{4}) }x
                     : qr{(?<year>\d{4}) - (?<month>\d{2}) - (?<day>\d{2}) T
(?<hour>\d{2}) : (?<minute>\d{2}) : (?<second>\d{2}) Z}x;

                   $val =~ $re;
                   $node->{$key} = DateTime->new( time_zone => 'GMT', %+ );
             }
         }
    }
}

traverse( $data );


$coll->insert( $_ ) for @$data;

This code traverses the nested structures in every event record, looking for things that look like dates. It then parses them and turns them into DateTimes, so the MongoDB driver will serialize them as such when importing into the database. The resulting documents are then stored in a collection called events. I didn't bother to create an _id field for each document; I instead relied on the driver to create default ObjectID's for me.

The day's worth of data results in 133,790 documents in the events collection, each with at least a couple datetimes.

How slow is DateTime compared to DateTime::Tiny or raw datetime strings when fetching thousands of documents? Let's use Perl's excellent Benchmark module to find out.

use strict;
use warnings;
use v5.16;

use Benchmark ':all';
use DateTime::Tiny;
use MongoDB 0.47;

my $conn = MongoDB::Connection->new;
my $db = $conn->get_database( 'github' );
my $coll = $db->get_collection( 'events' );


timethese(
    10,
    {
       datetime => sub { $conn->dt_type( 'DateTime' ); my @recs = $coll->find->all; },
       tiny => sub { $conn->dt_type( 'DateTime::Tiny' ); my @recs = $coll->find->all; },
       raw => sub { $conn->dt_type( undef ); my @recs = $coll->find->all }
    }
);

The results speak for themselves.

Benchmark: timing 10 iterations of datetime, raw, tiny...
  datetime: 428 wallclock secs (426.24 usr +  1.22 sys = 427.46 CPU) @  0.02/s (n=10)
       raw: 44 wallclock secs (42.58 usr +  0.52 sys = 43.10 CPU) @  0.23/s (n=10)
      tiny: 66 wallclock secs (64.68 usr +  0.99 sys = 65.67 CPU) @  0.15/s (n=10)

Unsurprisingly, raw datetime strings perform the best, taking about one tenth the time as instantiating full DateTime objects. But the more useful DateTime::Tiny performs almost as well, taking only slightly longer than the raw option to construct the same number of date objects.

In conclusion, DateTime::Tiny is a worthy optimization for situations where you are grabbing large numbers of dates from MongoDB which require no manipulation. The ability to specify DateTime::Tiny and raw dt_types will be available in MongoDB's Perl driver release 0.47.

No TrackBacks

TrackBack URL: http://friedo.com/cgi-bin/mt/mt-tb.cgi/15

Leave a comment

About this Entry

This page contains a single entry by Mike Friedman published on September 16, 2012 8:26 PM.

MongoDB 0.46.1 Released was the previous entry in this blog.

Toward a Unified libbson is the next entry in this blog.

Find recent content on the main index or look in the archives to find all content.

Categories

Pages

Powered by Movable Type 5.14-en