DNA Search with Elixir
We revisit the land of bioinformatics and try to find a codon in a proverbial gene haystack. Linear and binary search are implemented in Elixir to find if a gene contains a specific codon.

Continuing with selected sections of Classic Computer Science Problems in Python (CCSPiP) by David Kopec, we do some searching with DNA. This post corresponds to Chapter 2 Section 1. The source code presented here is available on GitHub. On an unrelated note, I'd like to thank David Kopec for giving the previous post on encryption a mention on Twitter - we're famous now!
Put your genes and codon on as it is cold outside
Terrible wordplay with this section's title, I know. Whose got time to put your jeans and coat on when you are trying to find if a gene has a codon?
In addition to compression, nucleotides and genes make great content for search problems too. As a refresher, there are four potential values for a nucleotide, A, C, G, and T and a gaggle of nucleotides form a gene.
Furthermore, three nucleotides form a codon and codons are codes for specific amino acids, which when combined with other amino acids form proteins. I've never heard of codons prior to CCSPiP, so I'll just roll with it and blindly take the book at its word. One common task is to check a gene for the existence of a specific codon. We will cover how to do so with the search classics - linear and binary search.
But first, here are some type aliases clarify the code and understand the @spec
lines later on:
@type nucleotide :: non_neg_integer
# should only ever have exactly 3 elements
# note that we do not use a tuple like CCSPiP as lists are better suited
@type codon :: list(nucleotide)
@type gene :: list(codon)
In CCSPiP, tuples are used instead of lists with regards to representing gene
and codon
types. For convenience we'll use lists in Elixir as tuples do not implement the Enumerable protocol. Basically this means we cannot easily leverage the Enum
or Stream
modules with tuples without first implementing the protocol specifically for tuples.
Codons, Codons Everywhere
Before we can search, we need to consider a minor tidbit particular to this problem. A codon is three consecutive nucleotides so we need to break the gene into all possible codons itself and compare them against the provided codon. Alternatively, maybe we could do some sort of sliding window substring, but let's remain faithful to the CCSPiP approach.
An outrageously convenient way to create all possible codons is to use Enum.chunk_every/3
after we transform the string:
@spec grapheme_to_nucleotide(String.t()) :: nucleotide
defp grapheme_to_nucleotide(nucleotide) do
nucleotide = String.upcase(nucleotide)
cond do
nucleotide == "A" -> 0
nucleotide == "C" -> 1
nucleotide == "G" -> 2
nucleotide == "T" -> 3
end
end
@spec string_to_nucleotides(String.t()) :: list(nucleotide)
def string_to_nucleotides(str) do
str
|> String.graphemes()
|> Enum.reduce([], &[grapheme_to_nucleotide(&1) | &2])
end
@spec string_to_gene(String.t()) :: gene
def string_to_gene(str) do
str
|> string_to_nucleotides()
|> Enum.chunk_every(3, 1, :discard)
end
One thing to note is that Elixir does not have anything quite like Python's IntEnum for converting nucleotides into integer values, so we have the function grapheme_to_nucleotide/1
, which takes a single character.
If you aren't familiar with the shorthand anonymous functions, the line with Enum.reduce/2
could be rewritten as:
|> Enum.reduce([], fn elem, acc ->
[character_to_nucleotide(elem) | acc]
end)
Now that we have transformed the gene into a list of codons (which are just lists of integers), we can get on to the actual searching.
The Straight Path
The code for the linear search is contained in the linear_contains?/2
function. It is fairly small due to using lists rather than tuples and the Enum
module.
@spec linear_contains?(String.t(), String.t()) :: boolean
def linear_contains?(gene, key_codon) do
gene = string_to_gene(gene)
codon = string_to_nucleotides(key_codon)
Enum.any?(gene, &(&1 == codon))
end
The gene
and key_codon
arguments are strings and we convert those into their respective codon collection and the specific codon we are searching for. From there, we check if the collection of codons from the gene contains the matching codon and we return a boolean accordingly.
Lets Split Up and Search
The binary_contains?/2
function is a bit more involved. One important point is that the collection of codons must be sorted before we can perform a binary search against it. Similar to the linear search, we must transform the strings first. Afterwards, we call binary_search/4
to check for a matching codon.
@spec binary_contains?(String.t(), String.t()) :: boolean
def binary_contains?(gene, key_codon) do
# must be sorted for binary search to work properly
sorted_gene = Enum.sort(string_to_gene(gene))
codon = string_to_nucleotides(key_codon)
binary_search(sorted_gene, codon, 0, length(sorted_gene) - 1)
end
@spec binary_search(gene, codon, non_neg_integer, non_neg_integer) :: boolean
defp binary_search(gene, key_codon, low, high) when low <= high do
mid = div(low + high, 2)
gene_codon = Enum.at(gene, mid)
cond do
gene_codon < key_codon -> binary_search(gene, key_codon, mid + 1, high)
gene_codon > key_codon -> binary_search(gene, key_codon, low, mid - 1)
gene_codon == key_codon -> true
end
end
defp binary_search(_, _, low, high) when low > high do
false
end
The binary search function recursively calls itself until we either find a matching codon or no match. The latter is the second binary_search/4
function where low
is greater than high
. One nuance to appreciate is that accessing a list via its index is rather inefficient. In Elixir, lists are linked lists so we can end up traversing a large portion of the list to reach that particular element. The functions could be further refined to be more efficient but this approach will suffice for now.
Found It!
In hindsight, this section is rather enjoyable as it had a variety of interesting elements to it... expanding material from a previous section, significant differences between Python and Elixir, and a binary search. Some of the code is reused in future sections of CCSPiP so it will be interesting to see how the code evolves from this initial pass.
If you like what you have read, please consider sharing this post via your preferred social network. When a post is popular it provides a certain sense of achievement and motivates me to write more horrible puns while rambling about a topic. It is quite the win-win situation for everyone. As always, if you see an error, please feel free to reach out.