performance – Grey Panthers Savannah

On benchmarks

gpanther — Fri, 28 Mar 2014 07:17:00 +0000

Numbers every programmer should know and their impact on benchmarks

Disclaimer: I don’t mean to be picking on the particular organizations / projects / people who I’ll mention below. They are just examples of a larger trend I observed.

Sometimes (most of the times?) we forget just how powerful the machines in our pockets / bags / desks are and accept the inefficiencies of the software running on them. When we start to celebrate those inefficiencies, a line has to be drawn though. Two examples:

In 2013 Twitter claimed a record Tweets Per Second (TPS – cute :-)) of ~143k. Lets round that up to 150k and do some back-of-the envelope calculations:

Communication between the clients and Twitter: a tweet is 140 bytes (240 if we allow for unicode). Lets multiple the 150k number by 10 (just to be generous – remember that 143k was already a big blip) – we get a bandwidth requirement of 343 MB/sec. Because tweets are going over TCP presumably and ~20% of a TCP connection is overhead, you would need 428 MB/s of bandwidth, about 3.5 gigabit or less than 0.5 of a 10 gigabit connection.
On the backend: lets assume we want triple redundancy (1 master + 2 replica) and that the average tweet goes out to 9 subscribers. This means that internally we need to write each tweet 30 times (we suppose a completely denormalized structure, we need to write the tweet to the users timeline also and do all this thrice for redundancy). This means 10 GB/sec of data (13 if we’re sending it over the network using TCP).
Thus ~100 servers would be able to easily handle the load. And remember this is 10x of the peak traffic they experienced.

So why do the have 20 to 40 times that many servers? This means that less than 10% (!) of their server capacity is actually used for business functions.

Second example: Google with DataStax came out with a blogpost about benchmarking a 300 node Cassandra cluster on Google Compute Engine. They claim a peak of 1.2M messages per second. Again, lets do some calculations:

The messages were 170 bytes in size. They were written to 2+1 nodes which would mean ~600 MB/s of traffic (730 MB/s if over the network using TCP).
They used 300 servers but were also testing the resiliency by removing 1/3 of the nodes, so lets be generous and say that the volume was divided over 100 servers.

This means that per server we use 7.3 MB/s network traffic and 6 MB/s disk traffic or 6% or a Gigabit connection and about 50% of medium quality spinning rust HDD.

My challenge to you is: next time you see such benchmarks do a quick back-of-the envelope calculation and if it uses less than 60% of the available throughput, call the people on it!

Is hand-writing assembly still necessary these days?

gpanther — Sun, 06 Feb 2011 07:14:00 +0000

Some time ago I came over the following article: Fast CRC32 in Assembly. It claimed that the assembly implementation was faster than the one implemented in C. Performance was always something I’m interested in, so I repeated and extended the experiment.

Here are the numbers I got. This is on a Core 2 Duo T5500 @ 1.66 Ghz processor. The numbers express Mbits/sec processed:

The assembly version from the blogpost (table taken from here): ~1700
Optimized C implementation (taken from the same source): ~1500. The compiler used was Microsoft Visual C++ Express 2010
Unoptimized C implementation (ie. Debug build): ~900
Java implementation using polynomials: ~100 (using JRE 1.6.0_23)
Java implementation using table: ~1900
Built-in Java implementation: ~1700
Javascript (for the fun of it) implementation (using the code from here with optimization – storing the table as numeric rather than string) on Firefox 4.0 Beta 10: ~80
Javascript on Chrome 10.0.648.18: ~40
(No IE9 test – they don’t offer it for Windows XP)

Final thoughts:

Hand coding assembly is not necessary in 99.999% (then again 80% of all statistics are made up :-p). Using better tools or better algorithms (see the “Java table based” vs. “Java polynomial”) can give just as good of performance improvement. Maintainability and portability (almost always) trump performance
Be pragmatic. Are you sure that your performance is CPU bound? If you are calculating a CRC32 of disk files, a gigabit per second is more than enough
Revisit your assumptions periodically (especially if you are dealing with legacy code). The performance characteristics of modern systems (CPUs) differ enormously from the old ones. I would wager that on an old CPU with little cache the polynomial version would have performed much better, but now that we have CPU caches measured in MB rather than KB the table one performs much better
Javascript engines are getting better and better.

Some other interesting remarks:

The source code can be found in my repo. Unfortunately I can’t include the C version since I managed to delete it by mistake
The file used to benchmark the different implementations was a PDF copy of the Producing Open Source Software book
The HTML5 implementation is surprisingly inconsistent between Firefox and Chrome, so I needed to add the following line to keep them both happy: var blob = file.slice ? file.slice(start, len) : file;
The Javascript code doesn’t work unless it is loaded via the http(s) protocol. Loading it from a local file gives “Error no. 4”, so I used a small python webserver
Javascript timing has some issues, but my task took longer than 15ms, so I got stable measurements
The original post mentions a variation of the algorithm which can take 16 bits at one (rather than 8) which could result in a speed improvement (and maybe it can be extended to 32 bits)
Be aware of the “free” tools from Microsoft! This article would have been published sooner if it wasn’t for the fact MSVC++ 2010 Express require an online registration and when I had time I had no Internet access!
Update: If you want to run the experiment with GCC, you might find the following post useful: Intel syntax on GCC

Picture taken from the TheGiantVermin’s photostream with permission.

Profile first!

gpanther — Wed, 01 Jul 2009 13:57:00 +0000

I was using my code to parse a medium-sized CVS log and it was being slow (~1min). So, like an idiot I said: I know that the algorithm is quite slow, so I optimize it. Here is a version which is twice as fast as the original version:


use strict;
use warnings;
use Test::More tests => 8;

is(cmp_cvs_tags('1.1', '1.1'),    0);
is(cmp_cvs_tags('1.1', '1.1.1'), -1);
is(cmp_cvs_tags('1.1', '1.2'),   -1);
is(cmp_cvs_tags('1.2', '1.2'),    0);
is(cmp_cvs_tags('1.2', '1.3'),   -1);

is(cmp_cvs_tags('1.1.1', '1.1'), 1);
is(cmp_cvs_tags('1.2',   '1.1'), 1);
is(cmp_cvs_tags('1.3',   '1.1'), 1);

sub cmp_cvs_tags {
  my ($a, $b) = @_;
  return 0 if ($a eq $b);
  
  # eliminate common prefix
  my ($alen, $blen) = (length($a), length($b));
  my $minlen = ($alen < $blen) ? $alen : $blen;
  for (my $i = 0; $i < $minlen; ++$i) {
    if (substr($a, $i, 1) ne substr($b, $i, 1)) {
      # first different character, cut until this point
      $a = substr($a, $i);
      $b = substr($b, $i);
      last;
    }
  }
  
  my @a_lst = split /./, $a;
  my @b_lst = split /./, $b;
  
  while (1) {
    my ($apart, $bpart) = (shift @a_lst, shift @b_lst);
    return -1 if (!defined $apart);
    return  1 if (!defined $bpart);
    next if ($apart == $bpart);
    return $apart <=> $bpart;
  } 
}

However, the whole script was still running slow. So I did which I should have done in the first place: run NYTProof on it. After looking at the reports I was clear that 99% of the time was being spent in the read loop. Dooh!

Java numerical calculation benchmark

gpanther — Thu, 01 Jan 2009 11:53:00 +0000

Update: it seems that the JITting process has improved quite a bit these last years (which is to be expected), and the differences are much smaller (and in some cases in favor of Java). Also, the discussion below is to be understood in the context of trigonometric functions rather than floating point operations in general.

In the same podcast I referenced in my previous entry (Java Posse #222) I heard the following affirmation (this is from memory, so it may not be 100% correct): “Performing floating point calculations in Java is slower than native code by a factor of 6-7!”. If I recall correctly the biggest problem was with some trigonometric functions.

This came as quite a surprise for me, given the consensus of JIT-ed code being very close to “compiled” code in performance (in the 90% range). This is why I considered the possible performance gains by things like Google Native Code less significant compared to its problems (dependence on the x86 platforms, possible security risks, etc).

An other field where I would have considered Java to be perfect before this information was the World Community Grid / BOINC project. Now, I’m all for using the extra power of computers for solving large problems, however the model used by BOINC is a little problematic (IMHO) from a security standpoint: they repeteadly download both data and executables for each project that you take part in (I observed this because of a whitelisting solution that I use which prompts whenever an unknown executable is launched). I didn’t look into the safeguards (if any) BOINC has, but given that you run native code, I don’t know if there is much they can do. This is why I would have considered Java to be perfect for the job.

Looking for some more concrete results (because I’m lazy to do my own benchmarking :-)), I come over this site: JavaNumerics, where they have a nice Benchmarks section. Unfortunately they don’t seem to have a comparison with native code, only different JVM’s on different platforms. I also found this paper (warning! PDF) which claims that numerical computation in Java is 50%-60% slower than the equivalent compiled code. This is quite a lot and a big waste for distributed projects.

The (sad) conclusion seems to be: Java doesn’t seem to be “there” yet for numerical calculations (just as it doesn’t seem to be “there” with JavaFX either :-(). Hopefully this will be resolved sooner rather than later and we can dedicate our computer to things like computing rainbow tables with an extra level of safety.

On javascript libraries

gpanther — Mon, 15 Jan 2007 07:20:00 +0000

I did a little project for school which included the dojo.gfx library. Here are share some of the conclusions I arrived at. But first a disclaimer: INAJD (I’m Not A Javascript Developer). I dabble with it but I’m not a professional. Now back to our topic:

Javascript libraries are huge. After including the dojo toolkit, the page took more than twice as long to load locally. Now I’m not saying that they are worthless, because there is some very, very good code in them (like dojo.gfx, which is incredible), but with todays browser performance they are not the way to go. Again: the main problem is todays browser perform very poorly when importing javascript files and while the library management part of dojo for example is very nice, it also results in the page loading slower (some times much slower). The dojo guys are of course aware of this and from what I understand they included the most common packages in the main dojo file to eliminate some of this burden, but that is still a workaround. What I would like to see is a package system which would resolve dependencies offline and would generate a (compressed) javascript file which includes all the needed functions in one place.

As I found out the javascript libraries debate has been going on for some time on the blogs, so here are the links to some articles (most of them don’t discuss the issue from the point of performance, but from the point of view of the (beginner) developers and if it benefits them):

The New Amateurs – by ppk at quirksmode. BTW, this guy really knows what he’s talking about when it comes to javascript! If you are looking for methods of doing something in Javascript, you should definitely first check out his site.

The New Amateurs – part 2 – the sequel.

Again JavaScript libraries

fog of libraries

Too many libraries, not enough librarians

Your own personal library

Reducing the pain of adopting a JavaScript library

Dear JavaScript Library Developers… – a post I very much agree with. I didn’t have any previous experience with Dojokit and when I went to the documentation site I was greeted by the following choices: Dojo Book, API Reference (not completed yet), Dojo Manual (obsolete) and Old Documentation Site. That’s three out of four documentation sources were marked explicitly or implicitly as not really usable. Fortunately the warnings turned be untrue (at least until now) and everything described there worked just fine, but it was a little frightening. And by the way: the API documentation doesn’t work with Opera 9, which is a shame since it has the best support for SVG and because of this it was the browser of my choice for this project.

The dark side of JavaScript libraries and Why good JavaScript libraries fail