Press "Enter" to skip to content

Find repeated 10 char sequences in DNA sample

Ajk 0

Find repeated 10 char sequences in DNA sample is just a modified version of Find all repeated substrings of size K, which is a pretty common question that gets asked in interviews. Since I did not find appropriate information about this question online, I decided to take a stab at it.

So first off let’s try to picture what we are dealing with and make some assumptions clear.

  • The DNA sample sequence is a string of length N.
  • The DNA sample sequence is composed of the letters ‘A’, ‘C’, ‘G’, ‘T’ only.
  • Simply printing the repeated substrings is going to be enough.

To put into perspective the sequence we are dealing with should just be a long stream of characters as the picture below shows:

DNA Sequence
DNA Sequence

Given this information, coming up with a quick brute force solution should be relatively simple. Let’s see the following algorithm:

Find repeated 10 char sequences in DNA sample – Brute Force algorithm

  1. For each 10 character sequence in the string.
  2. Compare it with each other 10 character sequence.
  3. If the same print it out.

Since this algorithm has 2 loops, one inside of the other, it is going to have a O(n2) runtime complexity. Given that we are not using any auxiliary data structure, the algorithm is said to take O(1) space complexity. Let us try to quickly put this solution into code. It should not take more than 5 minutes.

Find repeated 10 char sequences in DNA sample – Brute Force solution

As with most interview questions, the brute force solution is not usually the best one. Indeed there should be much better solutions to finding repeated 10-character sequences in a DNA sample. Common approaches that should be immediately be considered are Sorting and HashMaps.

  • Sorting gives us a similar reduced problem with only O(n*log(n)) runtime complexity added. Therefore promising to possibly reduce our overall runtime. Unfortunately for the problem at hand sorting does not make much sense, since it goes against the main ask of the problem, which is to find common subsequences of the existing string.
  • HashMaps on the other hand, provide an easy to use data structures that often does a space-time tradeoff, since it adds a significant space overhead of usually O(n), but in doing so it also simplifies the problem, since you can access any element in O(1) time. Turns out HashMaps could really help us in our specific problem. After all, if we simplify the checking of the 10 char sequence to O(1), then our brute force algorithm could improve to O(n) runtime.

So after this short reasoning we should be able to devise an optimized algorithm:

Find repeated 10 char sequences in DNA sample – Optimized algorithm

  1. For each 10 character sequence in the string.
  2. If the entry exists in the HashMap, print it out.
  3. If it does not exist then put it there.

Note that depending on our problem (whether we print out once or many times duplicate sequences), we can also put a counter in the HashMap, so we only print out the sequence once.

Seems like we found a solution worth coding. So let us do just that. For this solution we will use Java.

Find repeated 10 char sequences in DNA sample – Optimized solution

With this structured thought process, we found the solution in much less time than what we would have were we to just try to find the best solution by throwing random solutions to the problem.

So let us analyze this last solution. The time complexity as we mentioned before is O(n) since we are going through the DNA sequence only once. However the space complexity has been also increased to O(n) since we are using an helper data structure. To be more specific, we are actually using 10n chars and n ints auxiliary space. Which means roughly 24n bytes. With DNA sequences possibly being in the billions this, number could turn out pretty large (possibly up to a gig of memory), so could we do anything to improve on this solution?

Well it turns out some further thought into this process could give us a few solutions that are a bit more optimal. Since they are interchangeable I will mention them both.

Find repeated 10 char sequences in DNA sample – Further optimal ideas

    1. Hashing char sequences. Instead of putting the whole 10 character subsequence in the HashMap, we could just hash it into a smaller version. Supposing that we have 4 to the power of 10 possible sequences, that will mean 2 to the power of 20, which is roughly 3 bytes and could comfortably fit into an integer. Comparing that with the initial storage of 10 characters (20 bytes), we will see that we have made gains of roughly 6x.
    2. Using modified tries. If we have a modified version of a trie to store the 10-character subsequences, we could just check there to see if a substring already exists. This really goes along with the previous concept. Since there are only 4 to the power of 10 possible subsequences and given pointers to some static char references we could really solve this in just around 1 MB of space. This is quite some gain and further more, in some sense we are guaranteeing constant space solutions when n grows really large.

Finally, for these kind of problems a data structure that is usually useful is Suffix Tree. Although a very useful data structure, I decided to not describe it in detail, as I feel like it is probably out of scope for a regular interview. It also might take a lot longer to develop a solution with it, than the allotted time. Said that, if you mention it, you could get some bonus brownie points.

Hope you guys enjoyed taking a look at the solution and analysis of this find repeated 10 char sequences in DNA sample problem… And I’ll see you guys next time 😀

The following two tabs change content below.
If you like one of my posts the best way to support is give it a thumbs up, comment, or share it on social media 🙂