Continuing with selected sections of Classic Computer Science Problems in Python (CCSPiP) by David Kopec, we have a brief foray into data compression. This post corresponds to Chapter 1 Section 2 in CCSPiP. The source code presented here is available on GitHub.
I won't pretend to know much about genes or bioinformatics, but it is an interesting subject on my list of things to learn about someday. Nucleotides and genes make great content for simple compression. There are four potential values, A, C, G, and T which are used to form genes in DNA. For simplicity, let's assume each nucleotide character in a gene string takes up a byte, which is eight bits. By comparison, as we have only four values, we could store this in binary with two bits - a space savings of 75%. What a bargain!
Unsurprisingly, the first thing we'll look at is compressing the gene and producing a compressed value via an integer.
To compress a string such as
CAGTCGATAGAGTAT, we call
compress/1 and it will return
1389938867. This value can then be used to restore the original value at a later time. The amount of compression occurring is negligible given the small initial value, but that's okay! I doubt many would want to read huge string anyways... for large strings, the compression really kicks in. Note that if you know about DNA and base pairs, the initial value might not make any sense given that nature has "rules" and "laws" and whatnot. Instead of complying, we reject reality and make bioinformatics people cringe with our hand wavy explanations!
The relevant compression code:
import Bitwise, only: :macros @spec compress(String.t) :: integer def compress(gene) do String.upcase(gene) |> String.graphemes() |> Enum.reduce(1, &convert_to_bit/2) end @spec convert_to_bit(String.t, integer) :: integer def convert_to_bit(nucleotide, acc) do shiftedAcc = acc <<< 2 cond do nucleotide == "A" -> shiftedAcc ||| 0 nucleotide == "C" -> shiftedAcc ||| 1 nucleotide == "G" -> shiftedAcc ||| 2 nucleotide == "T" -> shiftedAcc ||| 3 end end
The main tidbits of interest are the various bitwise operations available from the Bitwise Elixir module. Depending on the character, we use a bitwise OR
||| operation against the shifted accumulated value.
compress/1 function splits the string into characters and reduces them into the single returned integer with the
convert_to_bit/2 function. This function uses two bitwise operations, left shift
<<<, and OR
|||. The left shift is used to make space in the accumulator for the new incoming value and the OR adds the necessary bits. So for each character, we move everything over two bits and then OR in the new bits based on the character begin matched in the case statement. For Elixir, it doesn't matter if we represent the bits as decimal versus binary, the Bitwise library handles them all the same.
Note that we use
String.graphemes/1 rather than
String.codepoints/1 but it doesn't really matter here as we assume each character takes one byte and the set of acceptable characters is limited to
ACGT... otherwise the
convert_to_bit/2 function errors out.
Now we can verify our work by decompressing the returned integer.
The code to decompress an integer back into the original gene is somewhat more interesting, but only because I couldn't find existing functions to do what I wanted to do. For instance, calculating the number of bits a given integer actually takes up. There probably exists an obvious and trivial answer, but such is life. This came to a head in the
@spec count_bits(integer, integer) :: integer def count_bits(n, acc \\ 0) def count_bits(n, acc) when n > 0 do count_bits(n >>> 1, acc + 1) end def count_bits(_, acc), do: acc
Basically, we recursively shrink the integer with the right shift bitwise
>>> until it is zero and increment the accumulator for each pass. There's probably a higher level option available, but this worked well enough.
Another pain point came up while trying to shift
i on only even values. The most obvious solution, and the one that mirrors the CCSPiP book, is a stepped range. In Python, this is a built in parameter of their range function... however Elixir ranges don't let us produce a stepped range. One solution is to create a comprehension generator to be used in the
Enum.reduce during decompression:
@spec stepped_bit_range(integer, integer) :: list(integer) def stepped_bit_range(n, step) do half = div(count_bits(n), 2) - 1 Enum.map(0..half, fn x -> x * step end) end
Here we want to create a list based on the value
n. We do this by calculating the number of bits
n actually uses, halve that and produce a list increasing by the value
x * step. For
n = 8 this would produce
[0, 2, 4, 6]. Ultimately, this worked well enough and we could move onto the meat of the problem... turning an integer back into a string.
To decompress the an integer, we utilize the
stepped_bit_range/2 described above and iterate over the provided even values in an
@spec decompress(integer) :: String.t def decompress(value) do # subtract 1 to offset sentinel value during compression range = stepped_bit_range(value - 1, 2) Enum.reduce(range, "", fn(i, acc) -> bits = value >>> i &&& 3 convert_to_char(bits, acc) end) |> String.reverse() end @spec convert_to_char(integer, String.t) :: String.t def convert_to_char(bits, acc) do cond do bits == 0 -> acc <> "A" bits == 1 -> acc <> "C" bits == 2 -> acc <> "G" bits == 3 -> acc <> "T" end end
decompress/1 we bitwise right shift
>>> the original starting
i, which increases each time, and then perform a bitwise AND
&&& on the resulting value. This basically does the opposite of the
compress/1 function. Instead of appending an empty two bits for each character and then adding in (via the
|||) the correct bit value we instead remove bits with
>>> and then extract just the two bits we need with
&&&. Afterwards, we have
convert_to_char/2 do the opposite of
convert_to_bit/2 and append a character to the accumulated string.
One thing of interest, the resulting string is initially backwards! This is mostly to remain consistent with CCSPiP, but we could have instead joined the two strings in the opposite way, like
bits == 0 -> "A" <> acc.
Trivial bits of pain
That's all there is to it. We can confirm the entire process is correct by compressing genes, decompressing them, and confirming the original value matches the ending value. This is a nice property and perfect for some tests utilizing the StreamData library.
Generally speaking, I found this translation from Python to Elixir nontrivial due to self inflicted missteps. It has been quite some time since I dealt with bitwise operations, which slowed the process down. Another, slightly embarrasing, tripping point was that I had the notion to use Bitstrings to produce the binary values. Luckily I eventually realized it was not necessary and could use integers. All in all, the compression covered might be basic compared to more robust current solutions, but the lessons learned were certainly valuable.