DNA Search with Elixir
We revisit the land of bioinformatics and try to find a codon in a proverbial gene haystack. Linear and binary search are implemented in Elixir to find if a gene contains a specific codon.
Continuing with selected sections of Classic Computer Science Problems in Python (CCSPiP) by David Kopec, we do some searching with DNA. This post corresponds to Chapter 2 Section 1. The source code presented here is available on GitHub. On an unrelated note, I'd like to thank David Kopec for giving the previous post on encryption a mention on Twitter - we're famous now!
Put your genes and codon on as it is cold outside
Terrible wordplay with this section's title, I know. Whose got time to put your jeans and coat on when you are trying to find if a gene has a codon?
In addition to compression, nucleotides and genes make great content for search problems too. As a refresher, there are four potential values for a nucleotide, A, C, G, and T and a gaggle of nucleotides form a gene.
Furthermore, three nucleotides form a codon and codons are codes for specific amino acids, which when combined with other amino acids form proteins. I've never heard of codons prior to CCSPiP, so I'll just roll with it and blindly take the book at its word. One common task is to check a gene for the existence of a specific codon. We will cover how to do so with the search classics - linear and binary search.
But first, here are some type aliases clarify the code and understand the
@spec lines later on:
@type nucleotide :: non_neg_integer # should only ever have exactly 3 elements # note that we do not use a tuple like CCSPiP as lists are better suited @type codon :: list(nucleotide) @type gene :: list(codon)
In CCSPiP, tuples are used instead of lists with regards to representing
codon types. For convenience we'll use lists in Elixir as tuples do not implement the Enumerable protocol. Basically this means we cannot easily leverage the
Stream modules with tuples without first implementing the protocol specifically for tuples.
Codons, Codons Everywhere
Before we can search, we need to consider a minor tidbit particular to this problem. A codon is three consecutive nucleotides so we need to break the gene into all possible codons itself and compare them against the provided codon. Alternatively, maybe we could do some sort of sliding window substring, but let's remain faithful to the CCSPiP approach.
An outrageously convenient way to create all possible codons is to use
Enum.chunk_every/3 after we transform the string:
@spec grapheme_to_nucleotide(String.t()) :: nucleotide defp grapheme_to_nucleotide(nucleotide) do nucleotide = String.upcase(nucleotide) cond do nucleotide == "A" -> 0 nucleotide == "C" -> 1 nucleotide == "G" -> 2 nucleotide == "T" -> 3 end end @spec string_to_nucleotides(String.t()) :: list(nucleotide) def string_to_nucleotides(str) do str |> String.graphemes() |> Enum.reduce(, &[grapheme_to_nucleotide(&1) | &2]) end @spec string_to_gene(String.t()) :: gene def string_to_gene(str) do str |> string_to_nucleotides() |> Enum.chunk_every(3, 1, :discard) end
One thing to note is that Elixir does not have anything quite like Python's IntEnum for converting nucleotides into integer values, so we have the function
grapheme_to_nucleotide/1, which takes a single character.
If you aren't familiar with the shorthand anonymous functions, the line with
Enum.reduce/2 could be rewritten as:
|> Enum.reduce(, fn elem, acc -> [character_to_nucleotide(elem) | acc] end)
Now that we have transformed the gene into a list of codons (which are just lists of integers), we can get on to the actual searching.
The Straight Path
The code for the linear search is contained in the
linear_contains?/2 function. It is fairly small due to using lists rather than tuples and the
@spec linear_contains?(String.t(), String.t()) :: boolean def linear_contains?(gene, key_codon) do gene = string_to_gene(gene) codon = string_to_nucleotides(key_codon) Enum.any?(gene, &(&1 == codon)) end
key_codon arguments are strings and we convert those into their respective codon collection and the specific codon we are searching for. From there, we check if the collection of codons from the gene contains the matching codon and we return a boolean accordingly.
Lets Split Up and Search
binary_contains?/2 function is a bit more involved. One important point is that the collection of codons must be sorted before we can perform a binary search against it. Similar to the linear search, we must transform the strings first. Afterwards, we call
binary_search/4 to check for a matching codon.
@spec binary_contains?(String.t(), String.t()) :: boolean def binary_contains?(gene, key_codon) do # must be sorted for binary search to work properly sorted_gene = Enum.sort(string_to_gene(gene)) codon = string_to_nucleotides(key_codon) binary_search(sorted_gene, codon, 0, length(sorted_gene) - 1) end @spec binary_search(gene, codon, non_neg_integer, non_neg_integer) :: boolean defp binary_search(gene, key_codon, low, high) when low <= high do mid = div(low + high, 2) gene_codon = Enum.at(gene, mid) cond do gene_codon < key_codon -> binary_search(gene, key_codon, mid + 1, high) gene_codon > key_codon -> binary_search(gene, key_codon, low, mid - 1) gene_codon == key_codon -> true end end defp binary_search(_, _, low, high) when low > high do false end
The binary search function recursively calls itself until we either find a matching codon or no match. The latter is the second
binary_search/4 function where
low is greater than
high. One nuance to appreciate is that accessing a list via its index is rather inefficient. In Elixir, lists are linked lists so we can end up traversing a large portion of the list to reach that particular element. The functions could be further refined to be more efficient but this approach will suffice for now.
In hindsight, this section is rather enjoyable as it had a variety of interesting elements to it... expanding material from a previous section, significant differences between Python and Elixir, and a binary search. Some of the code is reused in future sections of CCSPiP so it will be interesting to see how the code evolves from this initial pass.
If you like what you have read, please consider sharing this post via your preferred social network. When a post is popular it provides a certain sense of achievement and motivates me to write more horrible puns while rambling about a topic. It is quite the win-win situation for everyone. As always, if you see an error, please feel free to reach out.