rant – Grey Panthers Savannah

On benchmarks

gpanther — Fri, 28 Mar 2014 07:17:00 +0000

Numbers every programmer should know and their impact on benchmarks

Disclaimer: I don’t mean to be picking on the particular organizations / projects / people who I’ll mention below. They are just examples of a larger trend I observed.

Sometimes (most of the times?) we forget just how powerful the machines in our pockets / bags / desks are and accept the inefficiencies of the software running on them. When we start to celebrate those inefficiencies, a line has to be drawn though. Two examples:

In 2013 Twitter claimed a record Tweets Per Second (TPS – cute :-)) of ~143k. Lets round that up to 150k and do some back-of-the envelope calculations:

Communication between the clients and Twitter: a tweet is 140 bytes (240 if we allow for unicode). Lets multiple the 150k number by 10 (just to be generous – remember that 143k was already a big blip) – we get a bandwidth requirement of 343 MB/sec. Because tweets are going over TCP presumably and ~20% of a TCP connection is overhead, you would need 428 MB/s of bandwidth, about 3.5 gigabit or less than 0.5 of a 10 gigabit connection.
On the backend: lets assume we want triple redundancy (1 master + 2 replica) and that the average tweet goes out to 9 subscribers. This means that internally we need to write each tweet 30 times (we suppose a completely denormalized structure, we need to write the tweet to the users timeline also and do all this thrice for redundancy). This means 10 GB/sec of data (13 if we’re sending it over the network using TCP).
Thus ~100 servers would be able to easily handle the load. And remember this is 10x of the peak traffic they experienced.

So why do the have 20 to 40 times that many servers? This means that less than 10% (!) of their server capacity is actually used for business functions.

Second example: Google with DataStax came out with a blogpost about benchmarking a 300 node Cassandra cluster on Google Compute Engine. They claim a peak of 1.2M messages per second. Again, lets do some calculations:

The messages were 170 bytes in size. They were written to 2+1 nodes which would mean ~600 MB/s of traffic (730 MB/s if over the network using TCP).
They used 300 servers but were also testing the resiliency by removing 1/3 of the nodes, so lets be generous and say that the volume was divided over 100 servers.

This means that per server we use 7.3 MB/s network traffic and 6 MB/s disk traffic or 6% or a Gigabit connection and about 50% of medium quality spinning rust HDD.

My challenge to you is: next time you see such benchmarks do a quick back-of-the envelope calculation and if it uses less than 60% of the available throughput, call the people on it!

The wrong time to update software…

gpanther — Mon, 11 Apr 2011 13:41:00 +0000

is when the user is the busiest, for example when s/he just started your application. See for example the screenshot below with Adobe Air (click trough to see it in its full beauty).

The mistakes it makes:

It tries to do the update when I’m trying to start Grooveshark (it interferes with my intention)
It consumes 100% of a core by polling for the presence of running applications (I suppose), effectively obliging me to do the update. This is combined with frequent releases (which otherwise would be a good thing) for maximum annoyance.
Although you can’t see it in the screenshot, the updater has (had?) a bug when it asks for your sudo password: if you misstype it at first, then it asks for the root password (which doesn’t exists under Ubuntu by default) and then it just gets into some weird state until the next update is released.

To sum it up: You should download and install the updates in the background (in a separate, versioned directory, always keeping just the two most recent versions). Users shouldn’t be bothered with this, especially when they are trying to get work done!

Copyright is not theft!

gpanther — Wed, 09 Jun 2010 12:36:00 +0000

Recently there have been quite a few copyright-related posts which came up in my feedreader. This is of course a complicated and layered problem which can’t be solved in the couple of paragraphs of this blogpost, but at least I can post a bunch of great materials which should contribute to the edification of all of us.

From comixtalk.com: Copyright Is Not theft

Also from comixtalk: Nina Paley Discusses State of Sita Sings The Blues. This is an animated movie (which you can watch for free) that had legal problems because of the backing soundtrack, even though the music in question was created in 1920, so it should be in the public domain.

You might also be interested in the documentary rip! a remix manifesto (embedded below for your convenience). It is a documentary (Michael Moore style) talking about the issue. And while it isn’t perfect, it manages to raise a lot of interesting issues (BTW, personally I find the songs on the Grey Album much better than the originals on the Black album – just a random example how derivative works can improve the original):

Finally here is a presentation from TED comparing industries with different level of copyright protection (via Slashdot):

This should be enough information to keep you outraged for weeks

PS. Just a quick rundown of my current opinion: all works are derivative. But even if we skip over this, long copyright stifles innovation. And even if we don’t consider (or don’t accept) this premise, labelling all copying as “theft” is (depending) wrong, (possibly purposefully) misleading and unethical. For example I posses the copyright for all the materials published on this blog (since it is my original work), but I explicitly grant anyone to reuse the content under the conditions of the CC-BY-SA 3.0 license.

PS #2: An other interesting documentary to watch is Patently absurd. I didn’t include it above because it deals with patent law, not copyright, two domains which are frequently bundled together under the term "Intellectual Property" (together with trademark law), but the fact is that these three domains are completely separate and the laws governing them are distinct, ergo I didn’t want to add to the confusion.

PS #3: Technology != Breaking the law. Just because I use bittorrent, it doesn’t mean that I’m breaking the copyright law! I might be very well downloading a Linux ISO (as I frequently do), one of the many free (as in freedom) material from Clearbits (previosuly legaltorrents) or a World of Warcraft patch for that matter.

Dear people: try to think harder, even if it makes your head hurt!

gpanther — Mon, 07 Jun 2010 02:36:00 +0000

This again is the case of a couple of links on the same topic piling up in my reader (this tends to happen if you take a pause in blogging :-)):

The commonality between all these articles is that they make statements based on faulty questions (PHD Comics says it best). A website poll is not the same as a scientific study (to name just one the problems – it has a selection bias towards the reader of the particular site – which wouldn’t be a problem if the results wouldn’t be presented as applicable for the general population). And even if they were scientific studies, the purpose of a scientific study isn’t to find the absolute truth! It is to present a hypothesis which doesn’t contradict any of the current observation. But it doesn’t exclude the possibility that in the future there will be an observation which contradicts the hypothesis, and as such, it must be changed.

On the hopelessness of pulling content from the interwebs

gpanther — Mon, 10 May 2010 16:12:00 +0000

In the last couple of weeks I had at least two cases where I saw a (provocative) post come up in my feedreader, click trough to read the entire piece (BTW, partial feeds just suck!), just to find that the owner removed the post. The first was from the DynDNS blog named “Open Dialogue” (apparently openness and censorship can co-exists in some people’s minds, without having their brains blown-up by the cognitive dissonance) and it said the following:

We hope we’re wrong, but it looks like DNS Made Easy (aka Tiggee LLC) is secretly behind DNSComparison.com

Let’s first start off by providing some definitions of key attributes that Dyn Inc. lives by across our organization and takes pride in while representing the DNS industry. These characteristics define us and make us the company we are today. Call us naïve, but we also “still” hold out hope that the rest of the DNS space (and the business world, in general) believes the same and truly means well when their actions might seem otherwise.

The second one comes from the MaraDNS blog (is there a pattern here? are there many frustrated people in the DNS space? :-p):

Xonotic: Type 2 Freetards can’t make content

If you want to piss a type 2 freetard off, take an open-source project, make it proprietary (after getting everything with copyright to the code to agree to the non-GPL terms), and sell the proprietary product.
This happened with Tux Racer. Boy were the freetards pissed off, whining about how the commercial game wasn’t very good, blah blah blah. But, bottom line: The developers worked hard making the program. They wanted to get paid for their work. The type 2 freetards felt something was stolen from them because the next version of their program was not open-source.
Another successful open-source game is now doing the same thing: Nexuiz, an excellent fun little first person shooter with everything (both the engine and the content) under a GPL-compatible license.
Well, the developers realized one day that they wanted to get paid for their work, so they decided to have a remake of Nexuiz for consoles that will be closed-sourced using different content.
The freetards went ballistic. It became a front page story at Freetard central. In short order, a fork was declared. Freetards everywhere talked about how evil Nexuiz was; their declarations were mainly based on ignorance; inaccurate posts accusing the Nexuiz development team of violating the GPL were posted everywhere.
The next Nexuiz will, for the record, be 100% legal: All of the Nexuiz code has been licensed for non-free use. The content will be, for the most part, entirely new. There is no GPL violation here.
Once the dust cleared, development on the fork (called Xonotic) stalled. One developer recently admitted that, two months after declaring this fork, that

It is my opinion that such actions inherently undermine the trust in a person / brand. It is also ineffective (proven by the fact that at least one person – ie. myself – was able to read the content). My ideal publishing platform would be:

Versioned, so that everyone could look up what the text looked like at a given moment in time
Verified by a third-party agency (such as a timestamp signing service) which guarantees that it had a certain content at given point in time (you don’t have to transmit the full text to such a service BTW – them signing a cryptographic hash is good enough)
Digitally signed by the author

We all make mistakes. Lets act as grownups about it. Don’t try to wish some things away. I understand that in some circumstances there are legal obligations to take some things down, but at least post the takedown reason in these cases (ie. “this post was taken down because of allegedly defamatory content. sorry”).

Picture taken from sara~’s photostream with permission.

Parsing pcap files with Perl

gpanther — Fri, 19 Mar 2010 13:43:00 +0000

Recently I was reading the blogpost on the BrekingPoint labs log about parsing pcap files with Perl and I immediately said to myself: it is impossible that there isn’t a module on CPAN, because Perl is great. Turns out I was right, there is Net::TcpDumpLog which can be combined with the NetPacket family of modules to parse the higher level protocols. Because example code is rather sparse on the POD pages of the respective modules, here is a small example to illustrate their use:


use Net::TcpDumpLog;
use NetPacket::Ethernet;
use NetPacket::IP;
use NetPacket::TCP;
use strict;
use warnings;

my $log = Net::TcpDumpLog->new(); 
$log->read("foo.pcap");

foreach my $index ($log->indexes) { 
  my ($length_orig, $length_incl, $drops, $secs, $msecs) = $log->header($index); 
  my $data = $log->data($index);
  
  my $eth_obj = NetPacket::Ethernet->decode($data);    
  next unless $eth_obj->{type} == NetPacket::Ethernet::ETH_TYPE_IP;

  my $ip_obj = NetPacket::IP->decode($eth_obj->{data});
  next unless $ip_obj->{proto} == NetPacket::IP::IP_PROTO_TCP;

  my $tcp_obj = NetPacket::TCP->decode($ip_obj->{data});
  my ($sec,$min,$hour,$mday,$mon,$year,$wday,$yday,$isdst) = localtime($secs + $msecs/1000);
  print sprintf("%02d-%02d %02d:%02d:%02d.%d", 
    $mon, $mday, $hour, $min, $sec, $msecs), 
    " ", $eth_obj->{src_mac}, " -> ", 
    $eth_obj->{dest_mac}, "n";    
  print "t", $ip_obj->{src_ip}, ":", $tcp_obj->{src_port}, 
    " -> ", 
    $ip_obj->{dest_ip}, ":", $tcp_obj->{dest_port}, "n";
}

The code does the following: it opens the pcap file named “foo.pcap”, iterates over all the packets (it assumes that they all are Ethernet packets) and looks for TCP packets. Finally it prints out some information about these packets (capture time, source/destination mac, source/destination ip:port). You can customize it to fit your needs.

Small, somewhat offtopic rant: one should always think at least twice before publishing code which does such elementary things. Find a library and use it. If it doesn’t work, try patching it so that it works and send back the code to the original author. Only if this fails should you start from scratch.

Reusing existing code has many advantages: from your point of view, you can be sure that you can get code which worked for a couple of people. This is especially true for Perl modules which have a strong culture of testing. Also, even these “simple” problems like parsing a TCP packet have many corner cases which you will almost certainly miss at the first go, and as a result, half of your time will be spent hunting them down and only half of your time will be dedicated to solving the actual problem (this is if you are lucky – if you are unlucky, your code will skip over the special cases and it may make your entire analysis irrelevant).

Looking at it from the other side we have: more concentration of the way to do “X” means that the code will be more tested, leading it to be used more, meaning that it will be better tested and thus creating a positive feedback loop. Also, if you believe in the open-source ethos (and supposedly you do, since you published your code in the first place), you should consider maximizing the return while minimizing the effort needed.

Picture taken from greyloch’s photostream with permission.

Update: updated NetPacket link – thank you Anonymous.

You go PHD Comics!

gpanther — Tue, 26 Jan 2010 16:39:00 +0000

PHD Comics is always great and hilarious (and worth to subscribe to if you are even vaguely related to the academic world – like trough a friend of a friend :-)) but there are those occasions when it is epic, like this one:

The media can almost never be trusted to get things right and we should get into the habit of questioning everything they deliver. Think for yourself people, get a grip on basic math and logic and learn how to dig up information!

PS. I’m not “new media”, I’m just a raindrop :-p.

A missed opportunity

gpanther — Fri, 08 Jan 2010 15:51:00 +0000

The theory of capitalism (and I’m greatly oversimplifying here, I know) says that, even is we all follow just our own self interest, a global “good” will somehow emerge. This is what F-Secure is doing in their blogpost where they write about a specific ransomware which – if you get infected with – encrypts your data and asks you a certain amount of money to decrypt it.

Trouble is that their only recommendation is to “remind everyone to backup their important files regularly” (coincidentally – sarcasm, sarcasm – they have an online backup component in their suite). They could have at least mentioned that Sunbelt provides a tool which may decrypt the files (I say may, because I didn’t actually try the tool). This is even more inexplicable given the fact that they got the samples from Sunbelt (“Many thanks to Adam Thomas from Sunbelt for providing samples of the dropper”).

Shame on you F-Secure for putting a (possible) financial interest before the interest of your users!

So I don’t know about you, but instead of claiming that pure self-interest is the solution, I will go with:

Everything in moderation – including moderation.

Picture taken from d3stiny_sm4sher’s photostream with permission.

PS. Who wants to bet that – if these claims are bought to F-Secure’s attention – they will claim that they didn’t know about the removal tool?

Update: I’m not singling out F-Secure here, Zarestel Ferrer from CA just made a very similar blogpost: here are the facts (he did include some more technical detail, which is nice for us, security geeks), you should have used a security product to keep it out:

CA advises to keep your security products signature updated to prevent this kind of ransomware.

The plus side: he doesn’t pimp his company’s product necessarily. The minus: he doesn’t link to the Sunbelt decryption tool either. On the plus side, there is a comment facility on their website which could be used by visitors to mention the tool and thus help out people who lost data, but on the negative side: it doesn’t work, not even with IE!.

A game of Chinese whispers

gpanther — Fri, 18 Dec 2009 15:23:00 +0000

Yet an other example of real-life Chinese whispers in the security journalism:

A Hungarian online news site published an article titled “Hackers tried to steal user data from Amazon” (here is a somewhat usable automatic translation for the non-Hungarian speakers). I assume that the information went like this:

What happened –> What the security company has written up about it –> What the “journalist” understood –> What s/he actually wrote.

What actually happened is that an Amazon EC2 rented to a third party was being used as a C&C server for a botnet. No Amazon user data compromise here, move along (also, this isn’t a new phenomenon at all).

To top it off, the article talks about the security issues involved in cloud computing. Surely they are paid by buzzwords / paragraph :-p.

As if you needed further proof that a large percentage of the news out there is false, even when there is no intent to “spin” it. Newer attribute to malice what can be explained by stupidity I suppose…

Picture taken from bignoseduglyguy’s photostream with permission.

Today’s fudbuster

gpanther — Mon, 23 Nov 2009 15:24:00 +0000

We begin today’s FUD-buster with – applause please – cyberterorism via an “article”: Cyberterrorism: A look into the future. The article talks about Estonia (which is the poster-child for “cyber” incidents these days) and says the following thing (amongst others equally high-quality content) – emphasis added:

“The three-week cyberattack on Estonia threatened to black out the country’s digital infrastructure, infiltrating the websites of the nation’s banks and political institutions”

The article cites as source (hey, at least they cite sources) an equally “well researched” piece from the Telegraph.co.uk which says almost the same thing. Now I seem to remember that the Estonia incident was just a large scale DDoS attack, so I’ve looked around for more reliable sources, like this article on Dark Reading Authoritatively, Who Was Behind The Estonian Attacks? by Gadi Evron (or see this other article). This confirms what I was remembering: it was a large scale DDoS attack with some minor defacements, but in no way were they “infiltrating the websites”.

The second (unrelated, other than the fact that it is an overstatement) quote comes from the Kaspersky blog, where we can read that:

“a vast amount of pirate software nowadays contains trojans, both for the PC and Mac”

This depends very much on your interpretation of “vast amount” (as me how I know :-P). Of the actual pirated software shared in limited networks like college campuses, very little is infected. What are extremely likely to be malicious are the crack / keygen websites. Either they contain exploits directly or they bundle malware with the downloads. An other sneaky way, seen on P2P networks like Gnutella or eDonkey, is to run bots which respond to any search with an executable that contains the keywords in the name and is – of course – malicious. So, depending on your interpretation of “vast amount”, this doesn’t hold up.

The conclusion, as always: do your own research!

Picture taken from cooljinny’s photostream with permission.