CCSPiP

DNA Search with Elixir

We revisit the land of bioinformatics and try to find a codon in a proverbial gene haystack. Linear and binary search are implemented in Elixir to find if a gene contains a specific codon.

Scott Bauer

Oct 12, 2019 • 4 min read

Continuing with selected sections of Classic Computer Science Problems in Python (CCSPiP) by David Kopec, we do some searching with DNA. This post corresponds to Chapter 2 Section 1. The source code presented here is available on GitHub. On an unrelated note, I'd like to thank David Kopec for giving the previous post on encryption a mention on Twitter - we're famous now!

Put your genes and codon on as it is cold outside

Terrible wordplay with this section's title, I know. Whose got time to put your jeans and coat on when you are trying to find if a gene has a codon?

In addition to compression, nucleotides and genes make great content for search problems too. As a refresher, there are four potential values for a nucleotide, A, C, G, and T and a gaggle of nucleotides form a gene.

Furthermore, three nucleotides form a codon and codons are codes for specific amino acids, which when combined with other amino acids form proteins. I've never heard of codons prior to CCSPiP, so I'll just roll with it and blindly take the book at its word. One common task is to check a gene for the existence of a specific codon. We will cover how to do so with the search classics - linear and binary search.

But first, here are some type aliases clarify the code and understand the @spec lines later on:

@type nucleotide :: non_neg_integer

# should only ever have exactly 3 elements
# note that we do not use a tuple like CCSPiP as lists are better suited
@type codon :: list(nucleotide)

@type gene :: list(codon)

In CCSPiP, tuples are used instead of lists with regards to representing gene and codon types. For convenience we'll use lists in Elixir as tuples do not implement the Enumerable protocol. Basically this means we cannot easily leverage the Enum or Stream modules with tuples without first implementing the protocol specifically for tuples.

Codons, Codons Everywhere

Before we can search, we need to consider a minor tidbit particular to this problem. A codon is three consecutive nucleotides so we need to break the gene into all possible codons itself and compare them against the provided codon. Alternatively, maybe we could do some sort of sliding window substring, but let's remain faithful to the CCSPiP approach.

An outrageously convenient way to create all possible codons is to use Enum.chunk_every/3 after we transform the string:

@spec grapheme_to_nucleotide(String.t()) :: nucleotide
defp grapheme_to_nucleotide(nucleotide) do
  nucleotide = String.upcase(nucleotide)

  cond do
    nucleotide == "A" -> 0
    nucleotide == "C" -> 1
    nucleotide == "G" -> 2
    nucleotide == "T" -> 3
  end
end

@spec string_to_nucleotides(String.t()) :: list(nucleotide)
def string_to_nucleotides(str) do
  str
  |> String.graphemes()
  |> Enum.reduce([], &[grapheme_to_nucleotide(&1) | &2])
end

@spec string_to_gene(String.t()) :: gene
def string_to_gene(str) do
  str
  |> string_to_nucleotides()
  |> Enum.chunk_every(3, 1, :discard)
end

One thing to note is that Elixir does not have anything quite like Python's IntEnum for converting nucleotides into integer values, so we have the function grapheme_to_nucleotide/1, which takes a single character.

If you aren't familiar with the shorthand anonymous functions, the line with Enum.reduce/2 could be rewritten as:

|> Enum.reduce([], fn elem, acc ->
  [character_to_nucleotide(elem) | acc]
end)

Now that we have transformed the gene into a list of codons (which are just lists of integers), we can get on to the actual searching.

The Straight Path

The code for the linear search is contained in the linear_contains?/2 function. It is fairly small due to using lists rather than tuples and the Enum module.

@spec linear_contains?(String.t(), String.t()) :: boolean
def linear_contains?(gene, key_codon) do
  gene = string_to_gene(gene)
  codon = string_to_nucleotides(key_codon)

  Enum.any?(gene, &(&1 == codon))
end

The gene and key_codon arguments are strings and we convert those into their respective codon collection and the specific codon we are searching for. From there, we check if the collection of codons from the gene contains the matching codon and we return a boolean accordingly.

Lets Split Up and Search

The binary_contains?/2 function is a bit more involved. One important point is that the collection of codons must be sorted before we can perform a binary search against it. Similar to the linear search, we must transform the strings first. Afterwards, we call binary_search/4 to check for a matching codon.

@spec binary_contains?(String.t(), String.t()) :: boolean
def binary_contains?(gene, key_codon) do
  # must be sorted for binary search to work properly
  sorted_gene = Enum.sort(string_to_gene(gene))
  codon = string_to_nucleotides(key_codon)
  binary_search(sorted_gene, codon, 0, length(sorted_gene) - 1)
end

@spec binary_search(gene, codon, non_neg_integer, non_neg_integer) :: boolean
defp binary_search(gene, key_codon, low, high) when low <= high do
  mid = div(low + high, 2)
  gene_codon = Enum.at(gene, mid)

  cond do
    gene_codon < key_codon -> binary_search(gene, key_codon, mid + 1, high)
    gene_codon > key_codon -> binary_search(gene, key_codon, low, mid - 1)
    gene_codon == key_codon -> true
  end
end

defp binary_search(_, _, low, high) when low > high do
  false
end

The binary search function recursively calls itself until we either find a matching codon or no match. The latter is the second binary_search/4 function where low is greater than high. One nuance to appreciate is that accessing a list via its index is rather inefficient. In Elixir, lists are linked lists so we can end up traversing a large portion of the list to reach that particular element. The functions could be further refined to be more efficient but this approach will suffice for now.

Found It!

In hindsight, this section is rather enjoyable as it had a variety of interesting elements to it... expanding material from a previous section, significant differences between Python and Elixir, and a binary search. Some of the code is reused in future sections of CCSPiP so it will be interesting to see how the code evolves from this initial pass.

If you like what you have read, please consider sharing this post via your preferred social network. When a post is popular it provides a certain sense of achievement and motivates me to write more horrible puns while rambling about a topic. It is quite the win-win situation for everyone. As always, if you see an error, please feel free to reach out.

Photo by Ousa Chea on Unsplash

Put your genes and codon on as it is cold outside

Codons, Codons Everywhere

The Straight Path

Lets Split Up and Search

Found It!

Sign up for more like this.