02-9-2014, 12:07 PM | #21 | |
Batch Manager
Game Manager, Song Release Coordinator
Join Date: Mar 2008
Location: USA
Age: 29
Posts: 14,855
|
Re: Python pattern matching algorithm tuple error
Quote:
I'm lost in that block of code you wrote though. Like I mentioned I'm new to python so seeing all that code is like drinking out of a fire hose; mind giving some comments please?
__________________
Last edited by DossarLX ODI; 02-9-2014 at 12:09 PM.. |
|
02-9-2014, 12:15 PM | #22 |
x'); DROP TABLE FFR;--
Join Date: Nov 2010
Posts: 6,332
|
Re: Python pattern matching algorithm tuple error
I mean right now I'm using a memoization class which is the "lazy man's" way to use dynamic programming in Python (and in this case it's memoizing a lot more than it needs to, but I digress -- this was just to ensure the score's right and recursion is easy to code).
Basically, at each step of the DP, it takes the max of three possibilities: 1. Inserting a gap in the second sequence, incurring the g-cost, and then continuing onward 2. Inserting a gap in the first sequence, incurring the g-cost, and then continuing onward 3. Not inserting a gap, so taking the score of the base-pair, and continuing onward If we run out of bases in a particular string, then we just incur gap-costs for the rest (however long the remaining string is) Since it's constantly taking the max of all three possibilities at each step, and recursing at each step, you get the maximum possible score from all possible ways of inserting gaps/not inserting gaps every which way. For really long strings, this could take forever, but that's what DP is for: Memoization (in this case) is just caching results at intermediate steps so they don't have to get re-calculated over and over again (which is the heart of DP). For instance, if I asked you to add 1+2+3+4+5, you'd tell me "15!" Now if I asked, "Okay, now what is 1+2+3+4+5+10", you'd very quickly tell me "25!" In this case, you knew it was just 10+your previous answer, so you didn't re-calculate "1+2+3+4+5+10" from scratch. Memoization is basically the same thing. Disclaimer: not a CS expert, so I'm sure someone will come along and point out all the flaws in what I've said Last edited by Reincarnate; 02-9-2014 at 12:25 PM.. |
02-9-2014, 12:28 PM | #23 |
x'); DROP TABLE FFR;--
Join Date: Nov 2010
Posts: 6,332
|
Re: Python pattern matching algorithm tuple error
Let's say we had the two sequences:
s1 = "ATG" s2 = "TG" so there are three options: 1. insert a gap in the second sequence, continue with the rest: "A" + "TG" "-" + "TG" so here we take the gap cost of -5 and continue with the two substrings "TG" and "TG" 2. insert a gap in the first sequence, continue with the rest: "-" + "ATG" "T" + "G" so here we take the gap cost of -5 and continue with the two substrings "ATG" and "G" 3. Take the current base pair, continue with the rest "A" + "TG" "T" + "G" so here we take whatever matrix["A"]["T"] is and continue with the two substrings "TG" and "G" And so each time we "continue with substrings" we are performing the same DP calculation on smaller sub-problems. Every time we encounter a new subproblem, we're always taking the max of these three options. As there are many ways to arrive at the same sub-problems, we cache the results as we calculate stuff so we can re-use them again later if we encounter them. |
02-9-2014, 12:29 PM | #24 |
Batch Manager
Game Manager, Song Release Coordinator
Join Date: Mar 2008
Location: USA
Age: 29
Posts: 14,855
|
Re: Python pattern matching algorithm tuple error
So essentially you are remembering which position of the matrix you are in so when you consider the maximum of three possibilities you can just start from that index again. That makes sense.
Also, that means you're splitting the string into a smaller string each time you do that correct? Comments in that code would be very useful because I'm getting the syntax for Python down to see what your code is actually saying. matrix[i,1:] # Take ith row, go from column 1 to the last column for each row matrix[0] = g # entire first row all gap penalties? matrix[:,0] = g # make first column all gap penalties? string[::-1] # Reverse string alphabet.index( char ) # Turn character into an integer based on alphabet values # [x,y] is interpreted as (x,y) # [x][y] is the index in the matrix
__________________
Last edited by DossarLX ODI; 02-9-2014 at 12:35 PM.. |
02-9-2014, 12:39 PM | #25 |
x'); DROP TABLE FFR;--
Join Date: Nov 2010
Posts: 6,332
|
Re: Python pattern matching algorithm tuple error
does this help
Code:
class memoize: def __init__(self, fn): self.fn = fn self.memo = {} def __call__(self, *args, **kwds): import cPickle str = cPickle.dumps(args, 1)+cPickle.dumps(kwds, 1) if not self.memo.has_key(str): self.memo[str] = self.fn(*args, **kwds) return self.memo[str] #memoization decorator, so any time dpScore(args) is called with arguments we've seen before, #we don't re-calculate it -- we return the answer that we cached for it @memoize def dpScore( mat, baseMap, seq1, seq2, g = -5 ): if len(seq1)*len(seq2)==0: #if one or the other is length 0, i.e. an empty string return g*abs(len(seq1)-len(seq2)) #then we incur gaps for the length of the difference #note that if both strings are length 0, no gap costs are incurred #seq1[1:] means "everything but the first character of seq1" #This is (iirc) called a string-slice operation. #edit: here http://techearth.net/python/index.php5?title=Python:Basics:Slices possibleScore1 = g + dpScore( mat, baseMap, seq1[1:], seq2, g ) possibleScore2 = g + dpScore( mat, baseMap, seq1, seq2[1:], g ) #these two options reflect using a gap at the front of seq2 and seq1, respectively. possibleScore3 = mat[baseMap[seq1[0]]][baseMap[seq2[0]]] + dpScore( mat, baseMap, seq1[1:], seq2[1:], g ) #we aren't using a gap here, but we're looking at the first char of each sequence, #using the baseMap to get the correct indices for the matrix, and then using the matrix #to get the value we need to add to this score. return max([possibleScore1, possibleScore2, possibleScore3]) #return the max score from the three possibilities mat = [[20, 10, 5, 5], [10, 20, 5, 5], [5, 5, 20, 10], [5, 5, 10, 20]] alphabet = "TCAG" baseMap = {base:index for index,base in enumerate(alphabet)} #creates a lookup for the matrix. So baseMap["T"] = 0, baseMap["C"] = 1, etc. print dpScore(mat, baseMap, 'ATCGTAGTA', 'ATGTTAT') A matrix, here, is just a list of lists, so mat[i][j] is telling you which sub-list and which element of that sub-list to look at. Last edited by Reincarnate; 02-9-2014 at 12:44 PM.. |
02-9-2014, 12:49 PM | #26 |
x'); DROP TABLE FFR;--
Join Date: Nov 2010
Posts: 6,332
|
Re: Python pattern matching algorithm tuple error
Using a dictionary object to cache results instead of a memoization decorator:
Code:
cache = {} ''' cache here is a dictionary object, which means you do cache[key] and you get a corresponding value back. The key can be (mostly) anything you want (IIRC, as long as it's an immutable object like a tuple). A tuple is something like (1,2,3) versus a list [1,2,3]. Tuples are basically "immutable" lists, which means they can't be modified directly. By the way, these triple-apostrophe things are how you do comment-blocks. Pound/hashtag sign is for a line-comment. ''' def dpScore( mat, baseMap, seq1, seq2, g = -5 ): if len(seq1)*len(seq2)==0: #if one or the other is 0, i.e. an empty string return g*abs(len(seq1)-len(seq2)) #then we incur gaps for the length of the difference #note that if both strings are length 0, no gap costs are incurred if (seq1,seq2) in cache: return cache[(seq1,seq2)] #if we've already solved the sub-problem for our two strings, return the answer we cached #seq1[1:] means "everything but the first character of seq1" possibleScore1 = g + dpScore( mat, baseMap, seq1[1:], seq2, g ) possibleScore2 = g + dpScore( mat, baseMap, seq1, seq2[1:], g ) #these two options reflect using a gap at the front of seq2 and seq1, respectively. possibleScore3 = mat[baseMap[seq1[0]]][baseMap[seq2[0]]] + dpScore( mat, baseMap, seq1[1:], seq2[1:], g ) #we aren't using a gap here, but we're looking at the first char of each sequence, #using the baseMap to get the correct indices for the matrix, and then using the matrix #to get the value we need to add to this score. cache[(seq1,seq2)] = max([possibleScore1, possibleScore2, possibleScore3]) #cache the result return cache[(seq1,seq2)] #return the max score from the three possibilities mat = [[20, 10, 5, 5], [10, 20, 5, 5], [5, 5, 20, 10], [5, 5, 10, 20]] alphabet = "TCAG" baseMap = {base:index for index,base in enumerate(alphabet)} #creates a lookup for the matrix. So baseMap["T"] = 0, baseMap["C"] = 1, etc. print dpScore(mat, baseMap, 'ATCGTAGTA', 'ATGTTAT') Last edited by Reincarnate; 02-9-2014 at 01:04 PM.. Reason: added comments |
02-9-2014, 12:51 PM | #27 |
Batch Manager
Game Manager, Song Release Coordinator
Join Date: Mar 2008
Location: USA
Age: 29
Posts: 14,855
|
Re: Python pattern matching algorithm tuple error
Currently looking at your comments to see what you're getting at. From the memoize comments,
Code:
def dpScore(): # function stuff dpScore = memoize(dpScore) @memoize def dpScore(): # function stuff |
02-9-2014, 01:27 PM | #28 |
x'); DROP TABLE FFR;--
Join Date: Nov 2010
Posts: 6,332
|
Re: Python pattern matching algorithm tuple error
If that code makes sense then you can try an iterative-based approach to the DP using arrays (in Python, technically lists).
Note that all possible subproblems are basically of form: (some substring of the first main sequence, some substring of the second main sequence) where for us, in our example the main sequences are 'ATCGTAGTA' and 'ATGTTAT' So if sequence 1 is length I and sequence 2 is length J, we can have a matrix of size (I+1)*(J+1) and each element of this matrix will correspond to the maximal score you can get from the substrings. (the +1 on each side is because it is possible to compare a substring against a 0-length string) So for example, if we call our matrix "dp," then we might have dp[3][1] or something, which means we're looking at the maximal score that comes from the subproblem ("GTAGTA", "TGTTAT"). In other words, the index-3 character onward for the first main sequence, and the index-1 character onward for the second main sequence. This means, at the very end, dp[0][0] will be our answer. I'll explain this later when I have time (or someone else can explain it) -- I have to run out for the next few hours: (I also haven't debugged this yet so there may be a silly error somewhere, I'll fix later) Code:
mat = [[20, 10, 5, 5], [10, 20, 5, 5], [5, 5, 20, 10], [5, 5, 10, 20]] alphabet = "TCAG" baseMap = {base:index for index,base in enumerate(alphabet)} seq1='ATCGTAGTA' seq2='ATGTTAT' g=-5 dp = [[0 for j in xrange(len(seq2)+1)] for i in xrange(len(seq1)+1)] #creates a list of lists, which we use as a matrix for i in xrange(len(seq1)+1): dp[i][len(seq2)] = g*abs(len(seq1)-i) #initialize the cases where we have second substring empty for j in xrange(len(seq2)+1): dp[len(seq1)][j] = g*abs(len(seq2)-j) #initialize the cases where we have first substring empty #http://www.velocityreviews.com/forums/t636275-how-to-make-a-reverse-for-loop-in-python.html for i in xrange(len(seq1)-1,-1,-1): #len(seq1)-1 <= i <= 0, iterating down by -1 for j in xrange(len(seq2)-1,-1,-1): #len(seq2)-1 <= j <= 0, iterating down by -1. possibleScore1 = g + dp[i+1][j] possibleScore2 = g + dp[i][j+1] possibleScore3 = mat[baseMap[seq1[i]]][baseMap[seq2[j]]] + dp[i+1][j+1] dp[i][j] = max([possibleScore1, possibleScore2, possibleScore3]) print dp[0][0] Last edited by Reincarnate; 02-9-2014 at 01:39 PM.. Reason: made a quick fix, running out now |
02-9-2014, 01:44 PM | #29 |
Batch Manager
Game Manager, Song Release Coordinator
Join Date: Mar 2008
Location: USA
Age: 29
Posts: 14,855
|
Re: Python pattern matching algorithm tuple error
The memoize class constructor stuff in the beginning of the recursive algorithm code will need to be discussed more later, I sent a PM about possibly discussing over some kind of instant messanging program. I'll stick with the iterative solution since it doesn't require constructing a class.
Also your iterative code works.
__________________
Last edited by DossarLX ODI; 02-9-2014 at 01:50 PM.. |
02-9-2014, 02:29 PM | #30 |
x'); DROP TABLE FFR;--
Join Date: Nov 2010
Posts: 6,332
|
Re: Python pattern matching algorithm tuple error
Post any questions you have and I'll address them when I can (I am not at comp atm)
(It's probably best to ignore the solution with the memo class for easiest understanding -- use the one with the cache dict instead, or the iterative translation) |
02-9-2014, 05:34 PM | #31 |
Batch Manager
Game Manager, Song Release Coordinator
Join Date: Mar 2008
Location: USA
Age: 29
Posts: 14,855
|
Re: Python pattern matching algorithm tuple error
This picture is showing what's happening in the iterative loops for our dp matrix. dp[0][0] gives us the final score of the best path (or paths in the case of a tied score) available. It's basically doing what I said the backtrace algorithm did in my earlier posts in the thread -- the path is there, but the sequences are in reverse so when the strings are saved they have to be reversed to be the normal alignment. Now there's a problem of saving the new aligned sequences into two string variables. possibleScore1 could have the same value as possibleScore2 when max is called etc. so saving the characters of the string is a lot more complex than what I originally thought. This is to save the string of the best alignment. |
02-9-2014, 08:19 PM | #32 |
x'); DROP TABLE FFR;--
Join Date: Nov 2010
Posts: 6,332
|
Re: Python pattern matching algorithm tuple error
If you want to return a score-maximizing solution, you could use a second matrix to keep the corresponding substrings based on maximal score at each step of the dp, or something:
untested air code: Code:
g=-5 dp = [[0 for j in xrange(len(seq2)+1)] for i in xrange(len(seq1)+1)] bestPieces = [[("","") for j in xrange(len(seq2)+1)] for i in xrange(len(seq1)+1)] for i in xrange(len(seq1)+1): dp[i][len(seq2)] = g*abs(len(seq1)-i) bestPieces[i][len(seq2)] = (seq1[i:],"-"*abs(len(seq1)-i)) for j in xrange(len(seq2)+1): dp[len(seq1)][j] = g*abs(len(seq2)-j) bestPieces[len(seq1)][j] = ("-"*abs(len(seq2)-j),seq2[j:]) for i in xrange(len(seq1)-1,-1,-1): for j in xrange(len(seq2)-1,-1,-1): possibleScore1 = g + dp[i+1][j] possibleScore2 = g + dp[i][j+1] possibleScore3 = mat[baseMap[seq1[i]]][baseMap[seq2[j]]] + dp[i+1][j+1] bestScore = max([possibleScore1, possibleScore2, possibleScore3]) dp[i][j] = bestScore if bestScore==possibleScore1: bestPieces[i][j] = (seq1[i] + bestPieces[i+1][j][0], "-" + bestPieces[i+1][j][1]) elif bestScore==possibleScore2: bestPieces[i][j] = ("-" + bestPieces[i][j+1][0], seq2[j] + bestPieces[i][j+1][1]) else: bestPieces[i][j] = (seq1[i] + bestPieces[i+1][j+1][0], seq2[j] + bestPieces[i+1][j+1][1]) print dp[0][0], bestPieces[0][0] Last edited by Reincarnate; 02-9-2014 at 09:45 PM.. |
Currently Active Users Viewing This Thread: 1 (0 members and 1 guests) | |
Thread Tools | |
Display Modes | |
|
|