Last year I had the good fortune of taking an undergraduate algorithms course with Larry Ruzzo. One of my favorite parts of the course was applying dynamic program to approximate a real-world problem in the field of computational biology: RNA secondary structure prediction. I ended up implementing the algorithm in a couple languages (I was learning Nim at the time) and swore I would write it up, so this post will give an introduction to the Nussinov algorithm, along with some of an implementation in Nim.
- The problem context
- Subproblems and optimal substructure
The problem context
Disclaimer: I am a computer scientist1, not a biologist. I am sure that the basic Nussinov algorithm is a drastic simplification of real-world computational biology, so if you would like to learn more about the biology side of this, ask your favorite (computational) biologist.
Most of us are familiar with the structure of DNA: a double helix that consists of the bases A, C, T, and G. The double helix is formed because the bases in each strand are paired with their corresponding bases (A/T and C/G) in the other strand – in other words, each strand in DNA is sort of an “inverse” of the other. RNA, on the other hand, is single stranded, so does not have a “natural” structure like DNA. However, the complementary bases (A/U and C/G) still have affinities for each other, so strands of RNA end up folding in on themselves to form a unique secondary structure. Knowledge of this secondary structure gives us insight into the behavior of the molecule, so it is very handy to be able to predict it.
Predicting the secondary structure is also a little tricky: any complementary bases can pair up with each other (with some restrictions we will detail below), but physics tells us that the eventual scondary structure will approximate the one with the optimum total free energy. Total free energy can be approximated by the number of pairings that form the secondary structure, so the “optimal” structure is approximately the structure with the largest number of pairings. Remember that a pairing is defined by two complementary bases “sticking” to each other.
How do we find the secondary structure with the largest number of pairings? This is where the Nussinov algorithm comes in. First, let’s formalize the idea of a pairing.
What is a pairing?
Let’s define a pairing in somewhat formal terms2. We can view a single-stranded RNA molecule as a sequence of symbols drawn from the alphabet . Let represent our RNA strand, where each . Pairings must form between complementary bases, and each base may pair with at most one other base (i.e. the base pairs form a matching). Using the notation from Algorithm Design, a base pairing is some set in which . Base pairings must also adhere to the following conditions:
- No sharp turns. There must be at least 4 bases between any base pair. Formally, if , then .
- No crossing. If and are in , then we must not have the case that . Intuitively, this means a base within any given pairing cannot pair with a base outside that pairing.
The naive solution
It is worth taking a moment to think about what a naive algorithm algorithm for this would look like: evaluate each possible set of pairings, and from the ones that follow the rules we’ve laid out (complementary bases, matching, no sharp turns, non-crossing), pick the largest set. There are on the order of pairings3, and each set can contain any number of those pairings. This leaves us with somewhere in the neighborhood of pairings, which is… big. big. As the number of bases grows, this solution quickly becomes infeasible. Clearly, we are going to have to do better.
Subproblems and optimal substructure
Intuition on subproblem decomposition
Let’s say we know the optimal solution (i.e. the largest set of pairings that follow the rules) for an RNA strand with bases. Does this knowledge make it any easier to find the solution for an RNA strand with length ? Assume that the first bases are the same: we have just added the base .
The simplest case is that the does not pair with any previous bases in the optimal solution, so the optimal solution is unchanged from the -base solution. Otherwise, will pair with some previous base . Due to the noncrossing condition, the optimal solution for this new strand will consist of the pairing and the optimal solutions for and . We’ve broken down the problem into subproblems.
------------------------------- | | * * * * * * * * * * * * * 1 2 t-1 t t+1 n n+1
First approach to dynamic programming
Dynamic programming is a great fit for this problem, since we have shown that the problem has optimal substructure: the optimal solution for the -length problem can be constructed from the optimal solutions of its subproblems. Let’s try to formalize this a little bit. Our first approach may be to define some function , which gives us the number of base pairings in the optimal solution for the strand . In this case, would be the number of pairings in the optimal solution for the whole strand.4
When we make a pairing for and , as discussed earlier, our solution consists of that pairing combined with the optimal solutions for and . The function we defined will only give us the optimal solution for , but we have no information about the subproblem. We will have to introduce another variable.
Introducing another variable
Let’s define a new function which represents the number of pairings in the optimal solution from to . Both of our subproblems can be represented by this function, and we can define it as a recurrence:
We maximize over , constrained to the values of that follow the rules for allowable pairings we defined above (no sharp turns, complementary bases, max one pairing per base, noncrossing). In English, we can understand this as saying that the optimal solution for the strand is either:
- The optimal solution for the strand (i.e. doesn’t get paired), or
- The pairing for the that follows the rules and maximizes the number of pairings in and , plus those optimal solutions to and .
We can now find the optimal solution in time: thats values of to calculate, and (approximately) possible values of to look over for each value of . This is much better than our previous exponential-time solution.
As is usually the case in dynamic program, we need to build up solutions to subproblems in an order such that each later subproblem consists of previous, already-calculated subproblems. Looking at our definition of above, we can see that each calculation of relies on subproblems strictly smaller than . If we calculate values for in order of ascending strand size, we should be good.
Our base cases will be those strands that are too short to contain any pairings. By the “no sharp turns” rule, we can declare that for all . We will then build up solutions starting at and going up to (for a strand of length ). In Nim, this looks something like the folllowing:
# calculate the OPT array for a given set of bases proc nussinov(bases: seq[RNABase]): seq[seq[int]] = let n = bases.len result = newSeqWith(n, newSeq[int](n)) # Initialize OPT[i,j] = 0 where i >= j - 4 for i in 0..<n: for j in 0..<n: if i >= j - 4: result[i][j] = 0 # Calculate progressively larger subproblems according # to the recurrence given in the textbook for k in 5..<n: for i in 0..<(n-k): let j = i + k var max_val = result[i][j-1] for t in i..<(j-4): if complementary(bases[t], bases[j]): let before_val = if t > 0: result[i][t-1] else: 0 let after_val = if j > 0: result[t+1][j-1] else: 0 let cur_val = 1 + before_val + after_val max_val = max(max_val, cur_val) result[i][j] = max_val
A matrix is a great way to represent what’s going on as this code runs, so let’s draw it out. Our matrix is “truncated” ( starts at 6) because we only need to consider subproblems of and up (by the no sharp turns rule again). Let’s see what happens as we fill it in.
Sequence: ACCGGUAGU Fill in values for k = 5 4 |0 0 0 4 |0 0 0 0 3 |0 0 3 |0 0 1 2 |0 2 |0 0 i = 1 | i = 1 |1 -------- -------- j = 6 7 8 9 j = 6 7 8 9 k = 6 k = 7 4 |0 0 0 0 4 |0 0 0 0 3 |0 0 1 1 3 |0 0 1 1 2 |0 0 1 2 |0 0 1 1 i = 1 |1 1 i = 1 |1 1 1 -------- -------- j = 6 7 8 9 j = 6 7 8 9 k = 8 4 |0 0 0 0 3 |0 0 1 1 2 |0 0 1 1 i = 1 |1 1 1 2 -------- j = 6 7 8 9
The value in the lower right tells us that the optimal solution has two pairings.
We know how to calculate the matrix, but weren’t we looking for a solution that actually gave us the pairings? To get the pairings, we have to backtrack through our matrix – another common idea in dynamic programming. The pseudocode looks like this:
function traceback(i, j): if i >= j: return if OPT[i][j] == OPT[i][j-1]: # optimal solution does not contain (i, j) traceback(i, j - 1) else: find the t that maximizes 1 + OPT(i, t - 1) + OPT(t + 1, j) mark (t, j) as a pairing traceback(i, t - 1) traceback(t + 1, j)
This traceback has an runtime. If we wanted to improve the runtime of the traceback, we could make space tradeoffs and store the actual optimal pairings (rather than just the number of them) in the matrix. However, this doesn’t improve the overall asymptotic runtime of the full solution, since the step still dominates the step.
Technically, I have no credentials – my graduation date is in about 5 weeks. ↩
Many of these pairings will not be valid by our rules, but I don’t believe this changes the asymptotic bound. ↩
Note that gives us the optimal number of pairings for the strand from to , not the pairings themselves. This is okay – later, we will trace back through our calculated values of to recover the optimal solution. ↩