Problem statement
Find the N
-th word in a dictionary which contains all the words that can be generated from a given alphabet of length at most M
(and sorted by the conventional dictionary sorting rule / lexicographical order).
As a short detour: why did I become interested in it? It was during my investigation of the upper limit for the number of strings formed from a given alphabet that can be encoded in a given number of bits. Even more concretely: what is the upper limit for the length of a DNA/RNA string formed from nucleotides (ie. a string with alphabet [A,C,G,T]
) that can be encoded on 64 bits. Note: the problem statement that we need a codec (ie. both enCOding and DECoding, so we’ll solve a bit more generic problem than just the one-way one described in the title).
The first solution which came to mind was to use some bits for the length and the remaining bits to encode the nucleotides (2 bit / nucleotide) however the question remained: how many bits for the length? And is the solution optimal?
So finally I came up with the following formulation: consider that we have a dictionary of all the possible nucleotide strings for length at most M
. Now let the 64 bit value just be an index in this dictionary. This is guaranteed to be the optimal solution (if we assume that the probability of occurrence for every string is the same). Now we need three things:
- what is the largest value of
M
for which the index can be stored on 64 bits? - a time and space efficient way (ie. not generating the entire dictionary and keeping it in memory for lookup) to get the index of a given string (the enCOde step)
- the same to get the word at a given index (the DECode step)
There is also a somewhat related problem on Project Euler (24: Lexicographic permutations) – that wasn’t the inspiration though, I found out about it later.
Some initial observations
Just by writing out the complete set of words of length at most M
formed from a given alphabet we can make some observations. For example consider the alphabet [A,B]
and write out:
- the words of length 0:
''
(the empty string) - the words of length 1:
A
andB
- the words of length 2:
AA
,AB
,BA
andBB
So pretty quickly we can see that for a given alphabet and a given length we have exactly len(alphabet) ** length
possible words (where **
is the exponentiation operator – ie. a ** b
is the b-th power of a), since: we have length
positions, at each position we can have one of the len(alphabet)
characters, thus the total possibilities are len(alphabet) * len(alphabet) * ...
length
times which is len(alphabet)
to power length
.
After this we can ask "how many strings of length less than or equal to M
are there"? (question 1 from the initial problem statement). This is simply sum(len(alphabet) ** i for i in [0, M])
, also known as the geometric progression: (1 - La ** M) / 1 - La
where La = len(alphabet)
.
So for example if we have the alphabet [A, C, G, T]
and 64 bits available we can encode at most 32 characters according to Wolfram Alpha.
Finding the index of a string
To find this we just need to count how many strings there are in the dictionary before our string (remember the dictionary is in lexicographical order).
A concrete example: our dictionary contains all the words of length at most 3 (M=3
) formed from the alphabet [A, B]
. What is the index of the word BA
? (we consider that index 0 is ''
– the empty string, index 1 is A
, index 2 is AA
and so on).
What is the position of BA
in our dictionary?
If we would only have words of length exactly K
we could compute this by considering BA
a number in base 2 (binary) where A=0
and B=1
, transform it to base 10 and have our answer (ie BA
-> 10b
-> 2
-> BA
is at position 2 – or is the 3rd word – in the dictionary AA, AB, BA, BB
).
However our dictionary contains all words of length exactly 0, 1, 2 and 3. So just consider each in turn!
In a dictionary containing the words from the alphabet [A, B]
of exactly length:
K=0
:BA
would have index 1K=1
:BA
would have index 2 which is the same asindexOf(B) + 1
K=2
:BA
would have index 2K=3
:BA
would have index 10, which is the same asindexOf(BAA)
So, to find the index of a string:
- Go from 0 to
M
(the maximum length allowed for words in our dictionary) - Generate a word of length
K
from our word by either (assuming our strings are zero indexed):- Taking the characters
0
toK
(exclusive) ifK < len(word)
- Padding the word with the first character of the alphabet up to length
K
- Taking the characters
- Finding the index of this (sub)word in a dictionary that contains words of length exactly
K
by considering the (sub)word as a value written in baseLa
(La == length(alphabet)
). Add 1 if we’re in the first case since the longer word would come after the shorter ones. - Sum up all the values
Or in Python 3 code:
def indexOf(self, word):
assert len(word) <= self.__max_len
result = 0
for i in range(0, self.__max_len + 1):
if i < len(word):
subword = word[:i]
result += self.__valueInBaseN(subword) + 1
else:
subword = word + (i - len(word)) * self.__alphabet[0]
result += self.__valueInBaseN(subword)
return result
Finding the N-th string
Finally getting at the problem stated in the title. For this I noted how the dictionary can be constructed for length M
:
- the dictionary for
M=0
is just''
(the empty string) and forM=1
the empty string plus the alphabet itself. - for
M>1
take the dictionary forM-1
and prefix it with each of the characters from the alphabet. Finally add the empty string as the first element.
So for example if we have [A, B]
as the alphabet then:
- the dictionary for
M=1
is0: '', 1: A, 2: B
- to construct the dictionary for
M=2
we replicate the above dictionary 2 times, first prefixing it withA
, then withB
and finally we add the empty string in front:
0: '' 1: A 4: B
2: AA 5: BA
3: AB 6: BB
This suggests an algorithm for finding the solution:
- take the value. Decide in "column" it would be.
- you know the number of words in each column:
len(dictionary) - 1 / len(alphabet)
len(dictionary)
issum(len(alphabet) ** i for i in [0, K])
(see the initial observations)- this can also be precomputed for efficiency
- you know the number of words in each column:
- the column index gives you letter index in the alphabet
- now subtract from the value the index of the first word in the given column. If you get 0, stop.
- Otherwise make
K
one less and look up the new value in the dictionary of length at mostK
.
A small worked example:
- lets say we have
[A, B]
as the alphabet andM=2
. We want to find the word at 5 (which isBA
if you take a peak at the table above). So: - in each column we have 3 words, so 5 is in the 2nd row (the row with index 1) which gives us "B" as the first letter
- now subtract 4 (the index of the first word in the 2nd column –
B
) from 5 which leaves us with 1 - now find the word with index 1 in a dictionary with
M=1
which is "A" - thus the final word is "BA"
Or in Python 3 code:
def wordAt(self, index):
assert 0 <= index <= self.__lastIndex
result, current_len = '', self.__max_len
while index > 0:
words_per_column = self.__wordsPerLetterForLen[current_len]
column_idx = (index - 1) // words_per_column
result += self.__alphabet[column_idx]
index_of_first_word_in_col = 1 + column_idx * words_per_column
index -= index_of_first_word_in_col
current_len -= 1
return result
Note: you can find a different algorithm to do the same on math.stackexchange.com, however I found the above to be visually more intuitive.
Can we do it simpler?
So we solved the initial problem (both the one stated in the title and the one which motivated this journey) however it took over a thousand words to describe and justify it. Can we do simpler? Turns out yes! We just need abandon our attachment to the lexicographical order and say that as long as we have a bijective encoding and decoding function with the property decode(encode(word)) == word
we are satisfied.
A simple and efficient function is the transformation of the word from base La
(length of alphabet) to base 10 and vice-versa. For example if we have [A, C, G, T]
as the alphabet and GAT
as the word we can do:
- encode:
2*(4**2) + 0*(4**1) + 3*(4**0)
which is 33 - decode: 33 is written as powers of 4 as above and 2, 0, 3 corresponds to
GAT
Again, the ordering will not be lexicographical (A, AA, AB, ...
) but rather a numerical-order kind-of (A, B, AA, AB, ...
) but the algorithm is much simpler and in the case that La
is a power of two, very efficient to implement on current CPUs since division / remainder can be done using bit-shifts / masking.
More speculation
I didn’t actually want to encode DNA/RNA sequences, but rather mutations/variations which are pair of sequences (something like G -> TC
or GT -> ''
). Now I could just divide the 64 bits into two 32 bit chunks but the same initial question would arise: is this the most optimal way for encoding?
So we go the same solution: what if we would have a dictionary of all the variants ('' -> A, '' -> AA, ...
) and just index into it. How would we construct such a dictionary and how would we order it?
Turns out there is an algorithm inspired by the proof that there are the same number of natural numbers as there are rational ones. However that doesn’t give us a way to find the N-th element in the sequence but a Calkin–Wilf sequence does.
So we can have the following algorithm:
- represent the pair
to -> from
as two numbersA
andB
(refer to the discussion until now how we can do that) - use the Calkin-Wilf sequence (combined with the continued fraction formula) to find the index of
A/B
- or conversely use the sequence to transform the index into the
A/B
fraction and then transform the numerator and denominator into the original sequences
This is just speculation but it should work in theory. Also, it is fairly complicated so perhaps there is a better way to do it by making some simplifying assumptions? (like us eliminating the lexicographic ordering requirement).
Source code
A complete implementation of the above algorithms (with tests!) in Python 3 can be found on GitHub.