One of the most common complaints about the Perl MongoDB driver is that it tries to be
a little too clever. In the current production release of MongoDB.pm (version 0.46.2 as of
this writing), all datetime values retrieved by a query are automatically instantiated
as DateTime objects.
DateTime is a remarkable CPAN distribution. In fact, I would say that
DateTime and its related distributions on CPAN comprise one of the best
date and time manipulation libraries in any programming language. But that power comes with
a cost. The DateTime codebase is large, and instantiating DateTime
objects is expensive. The constructor performs a great deal of validation, and creates a
large amouunt of metadata which is stored inside the object.
Upcoming changes to the Perl MongoDB driver solve this problem.
If you need to perform a series of complex arithmetic operations with dates, then the
cost of DateTime is justified. But frequently, all you want is a simple read-only value that is
sufficient for displaying to a user or saving elsewhere. If you are running queries involving a large number
of documents, the automatic instantiation of thousands of complex objects becomes barrier
to performance.
DateTime::Tiny is a
lightweight alternative. As its name suggests, it is quite tiny indeed. It does no
validation, simply shoving whatever you pass to its constructor into an object. It has a
couple methods for outputting formatted dates, and a convenience method for promoting
an object to a full DateTime object if required. If you know that your date
information came from another process that has already done the validation, and you know
that you don't need to do any manipulation of the date data, then DateTime::Tiny
is an excellent choice for speed.
When I took over maintenance of the Perl MongoDB driver, I began a project to allow
alternatives to the default DateTime method of handling datetime data in
MongoDB documents. You can now set a per-connection dt_type attribute if you
want DateTime::Tiny objects instead of the default.
But how much faster is it? I set out to find a useful data set with a lot of datetime values that I could import into MongoDB and use for testing. A co-worker suggested the GitHub Archive project, which aggregates voluminous data about various events in public GitHub repositories. I downloaded 24 JSON files covering the day of April 11, 2012. (I chose that date for the sole reason that it was the date used in the GitHub Archive example queries.)
The first step was to examine the JSON data to see what kind of date information was
available. To my surprise, I found that the .json files returned by the GitHub
Archive are not actually JSON, but streams of JSON objects concatenated together with
no delimiter. After figuring out this unusual structure, I was able to coerce the data into
a proper JSON array and then format it for human readability via the following bash one-liner.
for file in *.json; do
perl -MFile::Slurp -MJSON::XS -E
'$new = "[" . join( "},{", split /}{/, read_file shift ) . "]";
print JSON::XS->new->utf8(1)->pretty(1)->encode( decode_json $new )' $file
> $file.new;
done
Upon examining the data, I was pleased to see it consisted of many highly-structured
documents with plenty of datetimes. In order to store them in MongoDB, the datetime
strings inside the JSON objects would need to be parsed into DateTime objects
for serialization. To my further surprise, I found that different types of things in the
GitHub data had different kinds of datetime formats. Some looked like
2012/04/05 11:37:28 -0700, whereas others looked like
2012-04-11T11:01:37Z. Fortunately, both formats are easy to parse and turn into
DateTime objects.
use strict;
use warnings;
use v5.16;
use JSON::XS;
use File::Slurp;
use DateTime;
use MongoDB;
my $file = shift;
my $json = read_file $file;
my $data = JSON::XS->new->utf8(1)->decode( $json );
my $conn = MongoDB::Connection->new;
my $db = $conn->get_database( 'github' );
my $coll = $db->get_collection( 'events' );
sub traverse {
my $node = shift;
return if not defined $node;
if ( ref $node eq ref [ ] ) {
foreach my $item( @$node ) {
traverse( $item ) if ref $item;
}
} elsif ( ref $node eq ref { } ) {
foreach my $key( keys %$node ) {
my $val = $node->{$key};
traverse( $val ) if ref $val;
next if not defined $val;
if ( $key =~ m{(pushed|created|closed|updated|merged)_at} ) {
my $re = $val =~ m{/}
? qr{(?<year>\d{4}) / (?<month>\d{2}) / (?<day>\d{2}) \s
(?<hour>\d{2}) : (?<minute>\d{2}) : (?<second>\d{2}) \s
(?<time_zone>[+-]\d{4}) }x
: qr{(?<year>\d{4}) - (?<month>\d{2}) - (?<day>\d{2}) T
(?<hour>\d{2}) : (?<minute>\d{2}) : (?<second>\d{2}) Z}x;
$val =~ $re;
$node->{$key} = DateTime->new( time_zone => 'GMT', %+ );
}
}
}
}
traverse( $data );
$coll->insert( $_ ) for @$data;
This code traverses the nested structures in every event record, looking for things that
look like dates. It then parses them and turns them into DateTimes, so the
MongoDB driver will serialize them as such when importing into the database. The resulting
documents are then stored in a collection called events. I didn't bother to
create an _id field for each document; I instead relied on the driver to
create default ObjectID's for me.
The day's worth of data results in 133,790 documents in the events
collection, each with at least a couple datetimes.
How slow is DateTime compared to DateTime::Tiny or raw datetime
strings when fetching thousands of documents? Let's use Perl's excellent Benchmark
module to find out.
use strict;
use warnings;
use v5.16;
use Benchmark ':all';
use DateTime::Tiny;
use MongoDB 0.47;
my $conn = MongoDB::Connection->new;
my $db = $conn->get_database( 'github' );
my $coll = $db->get_collection( 'events' );
timethese(
10,
{
datetime => sub { $conn->dt_type( 'DateTime' ); my @recs = $coll->find->all; },
tiny => sub { $conn->dt_type( 'DateTime::Tiny' ); my @recs = $coll->find->all; },
raw => sub { $conn->dt_type( undef ); my @recs = $coll->find->all }
}
);
The results speak for themselves.
Benchmark: timing 10 iterations of datetime, raw, tiny...
datetime: 428 wallclock secs (426.24 usr + 1.22 sys = 427.46 CPU) @ 0.02/s (n=10)
raw: 44 wallclock secs (42.58 usr + 0.52 sys = 43.10 CPU) @ 0.23/s (n=10)
tiny: 66 wallclock secs (64.68 usr + 0.99 sys = 65.67 CPU) @ 0.15/s (n=10)
Unsurprisingly, raw datetime strings perform the best, taking about one tenth the time
as instantiating full DateTime objects. But the more useful
DateTime::Tiny performs almost as well, taking only slightly longer than the raw
option to construct the same number of date objects.
In conclusion, DateTime::Tiny is a worthy optimization for situations where
you are grabbing large numbers of dates from MongoDB which require no manipulation. The
ability to specify DateTime::Tiny and raw dt_types will be
available in MongoDB's Perl driver release 0.47.

Leave a comment