Grey Panthers Savannah https://grey-panther.net Just another WordPress site Tue, 06 Sep 2022 12:53:47 +0000 en-US hourly 1 https://wordpress.org/?v=6.1.1 206299117 Remembering the OG ad/malware blocking hosts file https://grey-panther.net/2022/09/remembering-the-og-ad-malware-blocking-hosts-file.html https://grey-panther.net/2022/09/remembering-the-og-ad-malware-blocking-hosts-file.html#respond Tue, 06 Sep 2022 12:53:44 +0000 https://grey-panther.net/?p=1201 For the longest time the first thing which I installed on new computers / computers I was asked to “help with” was the MVP hosts file (archive.org link). I credit this file with keeping many, many computers safe and running they way their owners intended to for almost two decades now.

Sadly it seems like the maintainer might have passed sometime last year (or is at least gravely ill). From the page:

Folks … sorry for the delay (again) in getting out an update … just got out of the Hospital … I now have some severe health issues to deal with (complete Kidney failure … need a Kidney transplant) plus another operation … large needles inserted into my spine …however I will try to better maintain the MVPS HOSTS file. Well just got back from Hospital again (excessive water in lungs)

If you could … please consider a donation. Thanks to all that contributed … every little bit helps.

https://winhelp2002.mvps.org/hosts.htm (archive.org link)

So, I donated – may it be of some use to them / their family! And I encourage to do the same if you benefited from this great file!

As for alternatives, there are several good ones:

  • I now use nextdns.io on the machines/mobile devices I maintain
  • pi-hole is also an alternative
  • Specifically for Windows, HostsMan is a good software to manage/update hosts files
  • Browser plugins like uBlockOrigin are also very useful

For the last decade it has been the case – and continues to be the case in my opinion – that ad/tracker blocking is the single most effective way to keep devices from being infected with all kinds of malware (and, it generally makes web browsing faster!)

]]>
https://grey-panther.net/2022/09/remembering-the-og-ad-malware-blocking-hosts-file.html/feed 0 1201
Oracle cloud https://grey-panther.net/2022/07/oracle-cloud.html https://grey-panther.net/2022/07/oracle-cloud.html#respond Sun, 24 Jul 2022 17:12:54 +0000 https://grey-panther.net/?p=1172 As they say – people don’t use Oracle because the IT department chose it :). This is also probably true for for their cloud offering :). Just of the top of my head:

  • Arcane login procedure
    • that doesn’t support 2FA
    • that prompts you to change your (randomly generated, high entry, kept in a password manager) password, even though the NIST has recommended against this practice for many years
    • which fails to actually log you out (!) – discovered this when I was trying to verify that my updated password worked
  • Machine console sometimes work and sometimes doesn’t
  • Arcane procedure to attach disks to VMs (to be fair: they show the commands in a popup window)
    • And even with these commands one can’t switch the boot disk of a given VM

They have a generous amount of free credit, but I wouldn’t recommend them for production use.

]]>
https://grey-panther.net/2022/07/oracle-cloud.html/feed 0 1172
Useful Cloudflare infos https://grey-panther.net/2022/07/useful-cloudflare-infos.html https://grey-panther.net/2022/07/useful-cloudflare-infos.html#respond Sun, 24 Jul 2022 13:30:35 +0000 https://grey-panther.net/?p=1170

Trying to set up CloudFlare Access and it seems that some information are hard to find:

  • The tunnel communicates over 7844/udp (important in case you want/have a restrictive firewall and/or your cloud provider requires to configure the node-independent firewall)
  • The authenticated user is specified by the Cf-Access-Authenticated-User-Email header. Other useful headers can be Cf-Connecting-Ip or Cf-Ipcountry.
  • To link the authentication with the tunnel you desire, simply configure the “Self-hosted application” on the same (sub)domain as the tunnel.
]]>
https://grey-panther.net/2022/07/useful-cloudflare-infos.html/feed 0 1170
A fresh start with… WordPress :) https://grey-panther.net/2022/05/a-fresh-start-with-wordpress.html https://grey-panther.net/2022/05/a-fresh-start-with-wordpress.html#comments Sun, 15 May 2022 13:41:32 +0000 https://grey-panther.net/?p=1122

In 2016 I wrote A fresh start with Pelican. And now, 6 years later I’m writing this. Lots has changed since then and lots has stayed the same. It still fills me with joy writing texts that may be useful to somebody.

So, what’s to like about WordPress? For one, it can do blogs (and websites in general – so I don’t have to keep up with the latest (micro)formats and trust that it handles them reasonably well) and for most usual things (like code highlighting), there are well supported plugins. It’s also F/LOSS software and portable – I must say I quite liked the interview with Matt on FLOSS Weekly.

An other big thing is that it supports comments – something which static websites generally don’t and the alternatives (like Disqus) don’t respect user’s privacy at the level I would like them to.

So type away your comments! (also, if you’re on the feedburner feed, please switch over to https://grey-panther.net/feed, because who knows how long the former will be around!).

But there are also a couple of things not to like about WordPress – for one, using it, I’m painting a big target on my back (lots of WordPress sites are getting hacked every day). I do believe that I’ve taken reasonable precautions against this (stay tuned for a description on how this is set up!), but it’s a risk.

Also, running dynamic websites is not free (though not astronomically expensive either). My main worry around this is that if I become incapacitated for a longer time, this content will disappear (and one big reason for me starting up writing the blog again is to have a documentation for my family for such cases – so that they can get technical help to access – and maintain if they wish to – all the digital trinkets I’m creating). Also stay tuned about my plans around this problem, but the short version is that I’m planning to mirror the content periodically to several “free” providers and hope that at least one of the mirrors will be around long enough.

Until the next time!

Image credits to rawpixel.com through PxHere.

]]>
https://grey-panther.net/2022/05/a-fresh-start-with-wordpress.html/feed 1 1122
An interesting proof for Pythagoras’s theorem https://grey-panther.net/2017/01/an-interesting-proof-for-pythagorass-theorem.html https://grey-panther.net/2017/01/an-interesting-proof-for-pythagorass-theorem.html#respond Thu, 05 Jan 2017 07:06:00 +0000 https://grey-panther.net/?p=1116

I recently saw an interesting proof for Pythagoras’s theorem in the MathHistory series which I wanted to share with y’all 🙂

So a quick reminder, Pythagoras’s theorem says that if we have a right-angle (90 degree) triangle, then there is the following relation between the length of the sides:

a = sqrt(b^2 + c^2) (where a is the length of the longest side) – and vice-versa.

The proof goes like this: lets rewrite the formula like a^2 = b^2 + c^2. We can interpret this geometrically as: (for a right-angled triangle) the are of the square constructed on the longer side is equal to the sum of the areas of the two squares constructed on the shorter sides.

And now the proof goes as follows:

  • consider a right angled triangle
  • "clone" it 4 times and put it together such that the longer sides form a square. Now the area of the inner square is a^2 while the area of the big square is a^2 + 4*At (At is the area of a triangle)
  • rearrange the triangles as shown. The outer square is still of the same size (the length of its side – a+b is the same) but now it can be written as b^2 + c^2 + 4At. Hence a^2 + 4*At = b^2 + c^2 + 4At which can be simplified to a^2 = b^2 + c^2, or if you prefer to a = sqrt(b^2 + c^2).

I only had one nagging feeling after seeing this proof – how do we know that the first big square constructed is actually a square. Can’t it be that its "edges" are not lines, but slightly crooked like below?

Fortunately we can use the fact that the angles in a triangle add up to 180 degrees (ie. a straight line) and show that the sides of the external triangle are indeed straight lines:

]]>
https://grey-panther.net/2017/01/an-interesting-proof-for-pythagorass-theorem.html/feed 0 1116
Finding the N-th word in a complete dictionary https://grey-panther.net/2017/01/finding-the-n-th-word-in-a-complete-dictionary.html https://grey-panther.net/2017/01/finding-the-n-th-word-in-a-complete-dictionary.html#respond Mon, 02 Jan 2017 04:49:00 +0000 https://grey-panther.net/?p=1114

Problem statement

Find the N-th word in a dictionary which contains all the words that can be generated from a given alphabet of length at most M (and sorted by the conventional dictionary sorting rule / lexicographical order).

As a short detour: why did I become interested in it? It was during my investigation of the upper limit for the number of strings formed from a given alphabet that can be encoded in a given number of bits. Even more concretely: what is the upper limit for the length of a DNA/RNA string formed from nucleotides (ie. a string with alphabet [A,C,G,T]) that can be encoded on 64 bits. Note: the problem statement that we need a codec (ie. both enCOding and DECoding, so we’ll solve a bit more generic problem than just the one-way one described in the title).

The first solution which came to mind was to use some bits for the length and the remaining bits to encode the nucleotides (2 bit / nucleotide) however the question remained: how many bits for the length? And is the solution optimal?

So finally I came up with the following formulation: consider that we have a dictionary of all the possible nucleotide strings for length at most M. Now let the 64 bit value just be an index in this dictionary. This is guaranteed to be the optimal solution (if we assume that the probability of occurrence for every string is the same). Now we need three things:

  1. what is the largest value of M for which the index can be stored on 64 bits?
  2. a time and space efficient way (ie. not generating the entire dictionary and keeping it in memory for lookup) to get the index of a given string (the enCOde step)
  3. the same to get the word at a given index (the DECode step)

There is also a somewhat related problem on Project Euler (24: Lexicographic permutations) – that wasn’t the inspiration though, I found out about it later.

Some initial observations

Just by writing out the complete set of words of length at most M formed from a given alphabet we can make some observations. For example consider the alphabet [A,B] and write out:

  • the words of length 0: '' (the empty string)
  • the words of length 1: A and B
  • the words of length 2: AA, AB, BA and BB

So pretty quickly we can see that for a given alphabet and a given length we have exactly len(alphabet) ** length possible words (where ** is the exponentiation operator – ie. a ** b is the b-th power of a), since: we have length positions, at each position we can have one of the len(alphabet) characters, thus the total possibilities are len(alphabet) * len(alphabet) * ... length times which is len(alphabet) to power length.

After this we can ask "how many strings of length less than or equal to M are there"? (question 1 from the initial problem statement). This is simply sum(len(alphabet) ** i for i in [0, M]), also known as the geometric progression: (1 - La ** M) / 1 - La where La = len(alphabet).

So for example if we have the alphabet [A, C, G, T] and 64 bits available we can encode at most 32 characters according to Wolfram Alpha.

Finding the index of a string

To find this we just need to count how many strings there are in the dictionary before our string (remember the dictionary is in lexicographical order).

A concrete example: our dictionary contains all the words of length at most 3 (M=3) formed from the alphabet [A, B]. What is the index of the word BA? (we consider that index 0 is '' – the empty string, index 1 is A, index 2 is AA and so on).

What is the position of BA in our dictionary?

If we would only have words of length exactly K we could compute this by considering BA a number in base 2 (binary) where A=0 and B=1, transform it to base 10 and have our answer (ie BA -> 10b -> 2 -> BA is at position 2 – or is the 3rd word – in the dictionary AA, AB, BA, BB).

However our dictionary contains all words of length exactly 0, 1, 2 and 3. So just consider each in turn!

In a dictionary containing the words from the alphabet [A, B] of exactly length:

  • K=0: BA would have index 1
  • K=1: BA would have index 2 which is the same as indexOf(B) + 1
  • K=2: BA would have index 2
  • K=3: BA would have index 10, which is the same as indexOf(BAA)

So, to find the index of a string:

  • Go from 0 to M (the maximum length allowed for words in our dictionary)
  • Generate a word of length K from our word by either (assuming our strings are zero indexed):
    • Taking the characters 0 to K (exclusive) if K < len(word)
    • Padding the word with the first character of the alphabet up to length K
  • Finding the index of this (sub)word in a dictionary that contains words of length exactly K by considering the (sub)word as a value written in base La (La == length(alphabet)). Add 1 if we’re in the first case since the longer word would come after the shorter ones.
  • Sum up all the values

Or in Python 3 code:

def indexOf(self, word):
    assert len(word) <= self.__max_len
    result = 0
    for i in range(0, self.__max_len + 1):
        if i < len(word):
            subword = word[:i]
            result += self.__valueInBaseN(subword) + 1
        else:
            subword = word + (i - len(word)) * self.__alphabet[0]
            result += self.__valueInBaseN(subword)
    return result

Finding the N-th string

Finally getting at the problem stated in the title. For this I noted how the dictionary can be constructed for length M:

  • the dictionary for M=0 is just '' (the empty string) and for M=1 the empty string plus the alphabet itself.
  • for M>1 take the dictionary for M-1 and prefix it with each of the characters from the alphabet. Finally add the empty string as the first element.

So for example if we have [A, B] as the alphabet then:

  • the dictionary for M=1 is 0: '', 1: A, 2: B
  • to construct the dictionary for M=2 we replicate the above dictionary 2 times, first prefixing it with A, then with B and finally we add the empty string in front:
0: ''  1: A    4: B
       2: AA   5: BA
       3: AB   6: BB

This suggests an algorithm for finding the solution:

  • take the value. Decide in "column" it would be.
    • you know the number of words in each column: len(dictionary) - 1 / len(alphabet)
    • len(dictionary) is sum(len(alphabet) ** i for i in [0, K]) (see the initial observations)
    • this can also be precomputed for efficiency
  • the column index gives you letter index in the alphabet
  • now subtract from the value the index of the first word in the given column. If you get 0, stop.
  • Otherwise make K one less and look up the new value in the dictionary of length at most K.

A small worked example:

  • lets say we have [A, B] as the alphabet and M=2. We want to find the word at 5 (which is BA if you take a peak at the table above). So:
  • in each column we have 3 words, so 5 is in the 2nd row (the row with index 1) which gives us "B" as the first letter
  • now subtract 4 (the index of the first word in the 2nd column – B) from 5 which leaves us with 1
  • now find the word with index 1 in a dictionary with M=1 which is "A"
  • thus the final word is "BA"

Or in Python 3 code:

def wordAt(self, index):
    assert 0 <= index <= self.__lastIndex
    result, current_len = '', self.__max_len
    while index > 0:
        words_per_column = self.__wordsPerLetterForLen[current_len]
        column_idx = (index - 1) // words_per_column
        result += self.__alphabet[column_idx]
        index_of_first_word_in_col = 1 + column_idx * words_per_column
        index -= index_of_first_word_in_col
        current_len -= 1
    return result

Note: you can find a different algorithm to do the same on math.stackexchange.com, however I found the above to be visually more intuitive.

Can we do it simpler?

So we solved the initial problem (both the one stated in the title and the one which motivated this journey) however it took over a thousand words to describe and justify it. Can we do simpler? Turns out yes! We just need abandon our attachment to the lexicographical order and say that as long as we have a bijective encoding and decoding function with the property decode(encode(word)) == word we are satisfied.

A simple and efficient function is the transformation of the word from base La (length of alphabet) to base 10 and vice-versa. For example if we have [A, C, G, T] as the alphabet and GAT as the word we can do:

  • encode: 2*(4**2) + 0*(4**1) + 3*(4**0) which is 33
  • decode: 33 is written as powers of 4 as above and 2, 0, 3 corresponds to GAT

Again, the ordering will not be lexicographical (A, AA, AB, ...) but rather a numerical-order kind-of (A, B, AA, AB, ...) but the algorithm is much simpler and in the case that La is a power of two, very efficient to implement on current CPUs since division / remainder can be done using bit-shifts / masking.

More speculation

I didn’t actually want to encode DNA/RNA sequences, but rather mutations/variations which are pair of sequences (something like G -> TC or GT -> ''). Now I could just divide the 64 bits into two 32 bit chunks but the same initial question would arise: is this the most optimal way for encoding?

So we go the same solution: what if we would have a dictionary of all the variants ('' -> A, '' -> AA, ...) and just index into it. How would we construct such a dictionary and how would we order it?

Turns out there is an algorithm inspired by the proof that there are the same number of natural numbers as there are rational ones. However that doesn’t give us a way to find the N-th element in the sequence but a Calkin–Wilf sequence does.

So we can have the following algorithm:

  • represent the pair to -> from as two numbers A and B (refer to the discussion until now how we can do that)
  • use the Calkin-Wilf sequence (combined with the continued fraction formula) to find the index of A/B
  • or conversely use the sequence to transform the index into the A/B fraction and then transform the numerator and denominator into the original sequences

This is just speculation but it should work in theory. Also, it is fairly complicated so perhaps there is a better way to do it by making some simplifying assumptions? (like us eliminating the lexicographic ordering requirement).

Source code

A complete implementation of the above algorithms (with tests!) in Python 3 can be found on GitHub.

]]>
https://grey-panther.net/2017/01/finding-the-n-th-word-in-a-complete-dictionary.html/feed 0 1114
The limits of science https://grey-panther.net/2016/09/the-limits-of-science.html https://grey-panther.net/2016/09/the-limits-of-science.html#respond Thu, 08 Sep 2016 04:44:00 +0000 https://grey-panther.net/?p=1110

In a lot of ways Science has become the religion of the day. We can’t go more than a day to hear / see / read a "news" story about "scientists" saying something about something important. We can’t help but feel dazzled, confused, perplexed, overwhelmed by these announcements. And we have discussions with others:

  • I heard that scientists say that X cures cancer.
  • Didn’t you hear? Scientist announced that X causes cancer How is this really different from saying "my shaman said"?

The core principles of science

I’m going to argue that scientific thinking is really in a different league – but it also has many limitations. The core ideas of scientific thinking are:

  • We can observe things and those observations have some relation to the objective reality
  • This objective reality is probabilistically deterministic (ie. if we flip a coin it will land either heads or tails but it generally won’t turn into a unicorn and fly away)
  • Based on our observations we can construct models / hypothesis about cause and effect. While there can be an infinite amount of hypothesis, we assume that reality has a simple and elegant structure and thus prefer the hypothesis which explains the largest category of events with the minimal assumptions
  • Scientific hypothesis need to be testable (also called "falsifiable") – ie. they need to be forward looking / have predictive power. For example if I say "object alway fall downwards" then you or I can take a ball, a key, a rock, etc and verify that indeed, when I let go of it, it falls towards the ground. However such positive outcomes have little value (they increase the likelihood that the hypothesis / theory is true only by a small amount) and the most valuable things are falsifications – when somebody makes a prediction based on the theory but then a different outcome is observed – which most likely means that the theory is false (unless there was an error in the experiment).

This looks an awful lot like "the ten commandments of the science religion", doesn’t it? Because there is no reason to believe them intrinsically. There are only two things in favor of this mindset:

This is how our world seems to work – and thus these principles seem intrinsically true to a lot of us. Then again this "gut feeling" is no different from being convinced that a given kind of deity governs our lives

What is different from every other religion/philosophy however is that it contains the framework to extend it. You just observe, theorize and then try to falsify your theory. Voilà! You’re adding to the scientific knowledge. All other religions are closed while the religion of science is open.

Science is not "truth"

You might have observed that the principles – as described in the previous section – have somewhat of a wishy-washy nature. I keep using words like "likely" and "usually".

For example I said "if we flip a coin it will land either heads or tails but it generally won’t turn into a unicorn and fly away". I can’t say for certain that it won’t turn into a unicorn, but so far nobody reported a case of this happening, but we observed the coin landing heads or tails a lot of times so we’ll assign a very small probability to something else happening.

Scientific results are always probabilistic, but that is just life: if I go out tomorrow I might get hit by a car, but I most probably won’t. Note that other religions generally avoid saying anything about "this life/world" and reserve their proclamations for "the other/after life". At least science is trying to make some predictions, even though they sometimes turn out to be false.

This bears repeating: Science is not "truth". There is no such thing as a "scientific fact". Our scientific knowledge is simply a collection of theories which we failed to falsify – as of yet. Now, some of those theories have been around for a long time and have been found true in many situations and it’s prudent to do to act as if they were the absolute truth, but we must accept the fact that there is always the small possibility that they might turn out to be false.

To pound some more this idea: all the above is true even if we would "do science perfectly". However we are humans: subject to our biases, feelings and other motivations.

Science is not math

(a couple of words about induction)

Related to "there is no scientific truth" is the mirage of mathematics in science, which goes something like this: "we all know that 2+2 is 4, this scientist is using a mathematical formula, so whatever comes out of the formula must be true".

No!

Mathematics and science use very different ways of reasoning:

In math we use deduction: we state some ground rules ("axioms") and from those we deduce (prove) all the other statements. All the deduced statements are "true" (if we didn’t do any mistakes) since they are derived from the axioms which we assume to be true. This doesn’t mean however that it necessarily has any relation to the objective reality. (And just a side-note: even if we stay within the bounds of such imaginary systems – ie. don’t try to apply it to the "real world" – we will hit some fundamental limitations ^1)

Science however uses induction: if I see that both an apple and a rock is failing down, I guess that both "falling down" events happen because of the same cause. Such "rules of thumbs" ^2 work generally but can give the "wrong" results sometimes (as opposed to the rules of deduction which are always correct if we accept the initial axioms). Of course for a scientist these cases of "unexpected outcome" are the most interesting ones since they signal an opportunity to learn something new, but they can be most distressing if we think of science as "the source of truth".

So how does science use mathematics? It is just a precise way to describe hypothesis / theories and to manipulate those. Thus we might say that the maximum distance of a ball thrown is described by the equation v^2 / g and we can make predictions about the distance of a ball will travel before throwing it (and verify after the fact that prediction was – mostly – correct) but the fact that we formulated the theory in mathematical term does not make it fundamentally more true than any other scientific theory. It still is "the best theory we have for now which isn’t refuted by evidence".

Other limitations of science

Coming back to the idea "science is not the absolute truth even if we would "do science perfectly":

We don’t do science perfectly:

  • We have biases and when we have a theory we might not look "hard enough" for evidence to refute it
  • The academic structure is not set up to encourage verification of results: publishing replication results or negative results is discouraged (may scientific journals don’t even accept them for publication)
  • Science is entirely probabilistic (it only tells you what probably is true), however statistics (the branch of mathematics which deals with probabilities) is complicated and it’s easy to make mistakes when judging what constitutes probably true (Is Most Published Research Wrong?)
  • In some very interesting fields it is hard to conduct experiments (more on this in the next section)

And as if the above wasn’t enough, there is the problem of communicating the results (the step where "there is a 60% probability that eating a bit of chocolate improves bone density in white women over 50" turns into "Science Fact! Women should eat chocolate daily!").

To give some uplifting news: these issues are known and people are trying to work on it. There have been scientific journals set up to publish replicating studies and/or studies with negative results. There is a movement to encourage researchers to pre-register their studies (to state what data they’ll collect and how they will analyze it) and publish the results even if they are negative. Finally some journals require scientists to give a "simplified abstract" when publishing the research which can be adapted by journalists easier.

However these are hard issues and we can help out by not having unrealistic expectations.

Words about human-centric fields

A couple of final words about the science in fields like health (mental and physical) or economics: these are the fields which are the most important to us but where the difficulties presented multiply:

  • Generally we can’t do random double-bind trials where we select a group of people and infect them with AIDS or make them live on less than $1 a day (you know, because we value human life)
  • This means that we can only study people who already are in this situation but that makes it very likely that we confuse cause and effect
  • Even if the experiment is non-intrusive (or we’re doing an observational study) it is very hard to get a diverse set of participants (ie. most psychology experiments are done on young white males in the US – no wonder that they don’t replicate across the world)

Again, people are working on addressing these issues, but it is just one more reason not to believe the "fad" articles published daily and shared virally on social media.

Science is not perfect, but it’s the best that we have.

]]>
https://grey-panther.net/2016/09/the-limits-of-science.html/feed 0 1110
A fresh start with Pelican https://grey-panther.net/2016/01/a-fresh-start-with-pelican.html https://grey-panther.net/2016/01/a-fresh-start-with-pelican.html#respond Mon, 04 Jan 2016 04:15:56 +0000 https://grey-panther.net/?p=1107 Here we are in 2016, trying to start blogging again. Using Pelican is more complicate than it needs to be :-(.

]]>
https://grey-panther.net/2016/01/a-fresh-start-with-pelican.html/feed 0 1107
On benchmarks https://grey-panther.net/2014/03/on-benchmarks.html https://grey-panther.net/2014/03/on-benchmarks.html#respond Fri, 28 Mar 2014 07:17:00 +0000 Numbers every programmer should know and their impact on benchmarks

Disclaimer: I don’t mean to be picking on the particular organizations / projects / people who I’ll mention below. They are just examples of a larger trend I observed.

Sometimes (most of the times?) we forget just how powerful the machines in our pockets / bags / desks are and accept the inefficiencies of the software running on them. When we start to celebrate those inefficiencies, a line has to be drawn though. Two examples:

In 2013 Twitter claimed a record Tweets Per Second (TPS – cute :-)) of ~143k. Lets round that up to 150k and do some back-of-the envelope calculations:

  • Communication between the clients and Twitter: a tweet is 140 bytes (240 if we allow for unicode). Lets multiple the 150k number by 10 (just to be generous – remember that 143k was already a big blip) – we get a bandwidth requirement of 343 MB/sec. Because tweets are going over TCP presumably and ~20% of a TCP connection is overhead, you would need 428 MB/s of bandwidth, about 3.5 gigabit or less than 0.5 of a 10 gigabit connection.
  • On the backend: lets assume we want triple redundancy (1 master + 2 replica) and that the average tweet goes out to 9 subscribers. This means that internally we need to write each tweet 30 times (we suppose a completely denormalized structure, we need to write the tweet to the users timeline also and do all this thrice for redundancy). This means 10 GB/sec of data (13 if we’re sending it over the network using TCP).
  • Thus ~100 servers would be able to easily handle the load. And remember this is 10x of the peak traffic they experienced.

So why do the have 20 to 40 times that many servers? This means that less than 10% (!) of their server capacity is actually used for business functions.

Second example: Google with DataStax came out with a blogpost about benchmarking a 300 node Cassandra cluster on Google Compute Engine. They claim a peak of 1.2M messages per second. Again, lets do some calculations:

  • The messages were 170 bytes in size. They were written to 2+1 nodes which would mean ~600 MB/s of traffic (730 MB/s if over the network using TCP).
  • They used 300 servers but were also testing the resiliency by removing 1/3 of the nodes, so lets be generous and say that the volume was divided over 100 servers.

This means that per server we use 7.3 MB/s network traffic and 6 MB/s disk traffic or 6% or a Gigabit connection and about 50% of medium quality spinning rust HDD.

My challenge to you is: next time you see such benchmarks do a quick back-of-the envelope calculation and if it uses less than 60% of the available throughput, call the people on it!

]]>
https://grey-panther.net/2014/03/on-benchmarks.html/feed 0 8
Proxying pypi / npm / etc for fun and profit! https://grey-panther.net/2014/02/proxying-pypi-npm-etc-for-fun-and-profit.html https://grey-panther.net/2014/02/proxying-pypi-npm-etc-for-fun-and-profit.html#respond Wed, 05 Feb 2014 15:26:00 +0000 Package managers for source code (like pypi, npm, nuget, maven, gems, etc) are great! We should all use them. But what happens if the central repository goes down? Suddenly all your continious builds / deploys fail for no reason. Here is a way to prevent that:

Configure Apache as a caching proxy fronting these services. This means that you can tolerate downtime for the services and you have quicker builds (since you don’t need to contact remote servers). It also has a security benefit (you can firewall of your build server such that it can’t make any outgoing connections) and it’s nice to avoid consuming the bandwidth of those registries (especially since they are provided for free).

Without further ado, here are the config bits for Apache 2.4

/etc/apache2/force_cache_proxy.conf – the general configuration file for caching:

# Security - we don't want to act as a proxy to arbitrary hosts
ProxyRequests Off
SSLProxyEngine On
 
# Cache files to disk
CacheEnable disk /
CacheMinFileSize 0
# cache up to 100MB
CacheMaxFileSize 104857600
# Expire cache in one day
CacheMinExpire 86400
CacheDefaultExpire 86400
# Try really hard to cache requests
CacheIgnoreCacheControl On
CacheIgnoreNoLastMod On
CacheStoreExpired On
CacheStoreNoStore On
CacheStorePrivate On
# If remote can't be reached, reply from cache
CacheStaleOnError On
# Provide information about cache in reply headers
CacheDetailHeader On
CacheHeader On
 
# Only allow requests from localhost
<Location />
        Order Deny,Allow
        Deny from all
        Allow from 127.0.0.1
</Location>
 
<Proxy *>
        # Don't send X-Forwarded-* headers - don't leak local hosts
        # And some servers get confused by them
        ProxyAddHeaders Off
</Proxy>

# Small timeout to avoid blocking the build to long
ProxyTimeout    5

Now with this prepared we can create the individual configurations for the services we wish to proxy:

For pypi:

# pypi mirror
Listen 127.1.1.1:8001

<VirtualHost 127.1.1.1:8001>
        Include force_cache_proxy.conf

        ProxyPass         /  https://pypi.python.org/ status=I
        ProxyPassReverse  /  https://pypi.python.org/
</VirtualHost>

For npm:

# npm mirror
Listen 127.1.1.1:8000

<VirtualHost 127.1.1.1:8000>
        Include force_cache_proxy.conf

        ProxyPass         /  https://registry.npmjs.org/ status=I
        ProxyPassReverse  /  https://registry.npmjs.org/
</VirtualHost>

After configuration you need to enable the site (a2ensite) as well as needed modules (a2enmod – ssl, cache, disk_cache, proxy, proxy_http).

Finally you need to configure your package manager clients to use these endpoints:

For npm you need to edit ~/.npmrc (or use npm config set) and add registry = http://127.1.1.1:8000/

For Python / pip you need to edit ~/.pip/pip.conf (I recommend having download-cache as per Stavros’s post):

[global]
download-cache = ~/.cache/pip/
index-url = http://127.1.1.1:8001/simple/

If you use setuptools (why!? just stop and use pip :-)), your config is ~/.pydistutils.cfg:

[easy_install]
index_url = http://127.1.1.1:8001/simple/

Also, if you use buildout, the needed config adjustment in buildout.cfg is:

[buildout]
index = http://127.1.1.1:8001/simple/

This is mostly it. If your client is using any kind of local caching, you should clear your cache and reinstall all the dependencies to ensure that Apache has them cached on the disk. There are also dedicated solutions for caching the repositories (for example devpi for python and npm-lazy-mirror for node), however I found them somewhat unreliable and with Apache you have a uniform solution which already has things like startup / supervision implemented and which is familiar to most sysadmins.

]]>
https://grey-panther.net/2014/02/proxying-pypi-npm-etc-for-fun-and-profit.html/feed 0 9