Go Back   Flash Flash Revolution > General Discussion > Technology

Reply
 
Thread Tools Display Modes
Old 02-9-2014, 12:07 PM   #21
DossarLX ODI
Batch Manager
Game Manager, Song Release Coordinator
Game ManagerSimfile JudgeFFR Simfile AuthorD7 Elite KeysmasherFFR Veteran
 
DossarLX ODI's Avatar
 
Join Date: Mar 2008
Location: USA
Age: 29
Posts: 14,855
Default Re: Python pattern matching algorithm tuple error

Quote:
Originally Posted by Reincarnate View Post
Is it possibly something like this?

Code:
class memoize:
    def __init__(self, fn):
        self.fn = fn
        self.memo = {}
    def __call__(self, *args, **kwds):
        import cPickle
        str = cPickle.dumps(args, 1)+cPickle.dumps(kwds, 1)
        if not self.memo.has_key(str):
            self.memo[str] = self.fn(*args, **kwds)
        return self.memo[str]

@memoize
def dpScore( mat, baseMap, seq1, seq2, g = -5 ):
    if len(seq1)*len(seq2)==0: #one or the other is 0
        return g*abs(len(seq1)-len(seq2)) #the rest shall be gaps, damn it

    possibleScore1 = g + dpScore( mat, baseMap, seq1[1:], seq2, g )
    possibleScore2 = g + dpScore( mat, baseMap, seq1, seq2[1:], g )
    possibleScore3 = mat[baseMap[seq1[0]]][baseMap[seq2[0]]] + dpScore( mat, baseMap, seq1[1:], seq2[1:], g )
    
    return max([possibleScore1, possibleScore2, possibleScore3])



mat = [[20, 10, 5, 5],
       [10, 20, 5, 5],
       [5, 5, 20, 10],
       [5, 5, 10, 20]]

alphabet = "TCAG"
baseMap = {base:index for index,base in enumerate(alphabet)}
print dpScore(mat, baseMap, 'ATG', 'GGAAT')
I am using 2.7.3 (I should probably upgrade but this version has always done everything I've needed so I haven't changed it yet)
I tested this out with ATG and GGAAT and it gave me 20 (like in the document).

I'm lost in that block of code you wrote though. Like I mentioned I'm new to python so seeing all that code is like drinking out of a fire hose; mind giving some comments please?
__________________
Quote:
Originally Posted by hi19hi19 View Post
oh boy, it's STIFF, I'll stretch before I sit down at the computer so not I'm not as STIFF next time I step a file

Last edited by DossarLX ODI; 02-9-2014 at 12:09 PM..
DossarLX ODI is offline   Reply With Quote
Old 02-9-2014, 12:15 PM   #22
Reincarnate
x'); DROP TABLE FFR;--
Retired StaffFFR Veteran
 
Reincarnate's Avatar
 
Join Date: Nov 2010
Posts: 6,332
Default Re: Python pattern matching algorithm tuple error

I mean right now I'm using a memoization class which is the "lazy man's" way to use dynamic programming in Python (and in this case it's memoizing a lot more than it needs to, but I digress -- this was just to ensure the score's right and recursion is easy to code).

Basically, at each step of the DP, it takes the max of three possibilities:

1. Inserting a gap in the second sequence, incurring the g-cost, and then continuing onward
2. Inserting a gap in the first sequence, incurring the g-cost, and then continuing onward
3. Not inserting a gap, so taking the score of the base-pair, and continuing onward

If we run out of bases in a particular string, then we just incur gap-costs for the rest (however long the remaining string is)

Since it's constantly taking the max of all three possibilities at each step, and recursing at each step, you get the maximum possible score from all possible ways of inserting gaps/not inserting gaps every which way.

For really long strings, this could take forever, but that's what DP is for:

Memoization (in this case) is just caching results at intermediate steps so they don't have to get re-calculated over and over again (which is the heart of DP).

For instance, if I asked you to add 1+2+3+4+5, you'd tell me "15!"
Now if I asked, "Okay, now what is 1+2+3+4+5+10", you'd very quickly tell me "25!"

In this case, you knew it was just 10+your previous answer, so you didn't re-calculate "1+2+3+4+5+10" from scratch. Memoization is basically the same thing.

Disclaimer: not a CS expert, so I'm sure someone will come along and point out all the flaws in what I've said

Last edited by Reincarnate; 02-9-2014 at 12:25 PM..
Reincarnate is offline   Reply With Quote
Old 02-9-2014, 12:28 PM   #23
Reincarnate
x'); DROP TABLE FFR;--
Retired StaffFFR Veteran
 
Reincarnate's Avatar
 
Join Date: Nov 2010
Posts: 6,332
Default Re: Python pattern matching algorithm tuple error

Let's say we had the two sequences:

s1 = "ATG"
s2 = "TG"

so there are three options:



1. insert a gap in the second sequence, continue with the rest:

"A" + "TG"
"-" + "TG"

so here we take the gap cost of -5 and continue with the two substrings "TG" and "TG"



2. insert a gap in the first sequence, continue with the rest:

"-" + "ATG"
"T" + "G"

so here we take the gap cost of -5 and continue with the two substrings "ATG" and "G"



3. Take the current base pair, continue with the rest

"A" + "TG"
"T" + "G"

so here we take whatever matrix["A"]["T"] is and continue with the two substrings "TG" and "G"

And so each time we "continue with substrings" we are performing the same DP calculation on smaller sub-problems. Every time we encounter a new subproblem, we're always taking the max of these three options. As there are many ways to arrive at the same sub-problems, we cache the results as we calculate stuff so we can re-use them again later if we encounter them.
Reincarnate is offline   Reply With Quote
Old 02-9-2014, 12:29 PM   #24
DossarLX ODI
Batch Manager
Game Manager, Song Release Coordinator
Game ManagerSimfile JudgeFFR Simfile AuthorD7 Elite KeysmasherFFR Veteran
 
DossarLX ODI's Avatar
 
Join Date: Mar 2008
Location: USA
Age: 29
Posts: 14,855
Default Re: Python pattern matching algorithm tuple error

So essentially you are remembering which position of the matrix you are in so when you consider the maximum of three possibilities you can just start from that index again. That makes sense.

Also, that means you're splitting the string into a smaller string each time you do that correct?

Comments in that code would be very useful because I'm getting the syntax for Python down to see what your code is actually saying.
matrix[i,1:] # Take ith row, go from column 1 to the last column for each row
matrix[0] = g # entire first row all gap penalties?
matrix[:,0] = g # make first column all gap penalties?
string[::-1] # Reverse string
alphabet.index( char ) # Turn character into an integer based on alphabet values
# [x,y] is interpreted as (x,y)
# [x][y] is the index in the matrix
__________________
Quote:
Originally Posted by hi19hi19 View Post
oh boy, it's STIFF, I'll stretch before I sit down at the computer so not I'm not as STIFF next time I step a file

Last edited by DossarLX ODI; 02-9-2014 at 12:35 PM..
DossarLX ODI is offline   Reply With Quote
Old 02-9-2014, 12:39 PM   #25
Reincarnate
x'); DROP TABLE FFR;--
Retired StaffFFR Veteran
 
Reincarnate's Avatar
 
Join Date: Nov 2010
Posts: 6,332
Default Re: Python pattern matching algorithm tuple error

does this help

Code:
class memoize:
    def __init__(self, fn):
        self.fn = fn
        self.memo = {}
    def __call__(self, *args, **kwds):
        import cPickle
        str = cPickle.dumps(args, 1)+cPickle.dumps(kwds, 1)
        if not self.memo.has_key(str):
            self.memo[str] = self.fn(*args, **kwds)
        return self.memo[str]

#memoization decorator, so any time dpScore(args) is called with arguments we've seen before,
#we don't re-calculate it -- we return the answer that we cached for it
@memoize
def dpScore( mat, baseMap, seq1, seq2, g = -5 ):
    if len(seq1)*len(seq2)==0: #if one or the other is length 0, i.e. an empty string
        return g*abs(len(seq1)-len(seq2)) #then we incur gaps for the length of the difference

    #note that if both strings are length 0, no gap costs are incurred

    #seq1[1:] means "everything but the first character of seq1"
    #This is (iirc) called a string-slice operation. 
    #edit: here http://techearth.net/python/index.php5?title=Python:Basics:Slices

    possibleScore1 = g + dpScore( mat, baseMap, seq1[1:], seq2, g )
    possibleScore2 = g + dpScore( mat, baseMap, seq1, seq2[1:], g )
    #these two options reflect using a gap at the front of seq2 and seq1, respectively.
    
    possibleScore3 = mat[baseMap[seq1[0]]][baseMap[seq2[0]]] + dpScore( mat, baseMap, seq1[1:], seq2[1:], g )
    #we aren't using a gap here, but we're looking at the first char of each sequence,
    #using the baseMap to get the correct indices for the matrix, and then using the matrix
    #to get the value we need to add to this score.

    return max([possibleScore1, possibleScore2, possibleScore3])
    #return the max score from the three possibilities



mat = [[20, 10, 5, 5],
       [10, 20, 5, 5],
       [5, 5, 20, 10],
       [5, 5, 10, 20]]

alphabet = "TCAG"
baseMap = {base:index for index,base in enumerate(alphabet)}
#creates a lookup for the matrix. So baseMap["T"] = 0, baseMap["C"] = 1, etc.

print dpScore(mat, baseMap, 'ATCGTAGTA', 'ATGTTAT')
And yeah when it comes to list-type objects in Python, like L = [1,4,7,3], then doing something like L[2] gives you the third element of the list (due to 0-indexing, L[0] is the first element).

A matrix, here, is just a list of lists, so mat[i][j] is telling you which sub-list and which element of that sub-list to look at.

Last edited by Reincarnate; 02-9-2014 at 12:44 PM..
Reincarnate is offline   Reply With Quote
Old 02-9-2014, 12:49 PM   #26
Reincarnate
x'); DROP TABLE FFR;--
Retired StaffFFR Veteran
 
Reincarnate's Avatar
 
Join Date: Nov 2010
Posts: 6,332
Default Re: Python pattern matching algorithm tuple error

Using a dictionary object to cache results instead of a memoization decorator:

Code:
cache = {} 
'''
cache here is a dictionary object, which means you do cache[key] and you get a corresponding value back. 
The key can be (mostly) anything you want (IIRC, as long as it's an immutable object like a tuple).
A tuple is something like (1,2,3) versus a list [1,2,3]. Tuples are basically "immutable" lists, 
which means they can't be modified directly.

By the way, these triple-apostrophe things are how you do comment-blocks. Pound/hashtag sign is for a line-comment.
'''

def dpScore( mat, baseMap, seq1, seq2, g = -5 ):
    if len(seq1)*len(seq2)==0: #if one or the other is 0, i.e. an empty string
        return g*abs(len(seq1)-len(seq2)) #then we incur gaps for the length of the difference

    #note that if both strings are length 0, no gap costs are incurred

    if (seq1,seq2) in cache: return cache[(seq1,seq2)] 
    #if we've already solved the sub-problem for our two strings, return the answer we cached

    #seq1[1:] means "everything but the first character of seq1"

    possibleScore1 = g + dpScore( mat, baseMap, seq1[1:], seq2, g )
    possibleScore2 = g + dpScore( mat, baseMap, seq1, seq2[1:], g )
    #these two options reflect using a gap at the front of seq2 and seq1, respectively.
    
    possibleScore3 = mat[baseMap[seq1[0]]][baseMap[seq2[0]]] + dpScore( mat, baseMap, seq1[1:], seq2[1:], g )
    #we aren't using a gap here, but we're looking at the first char of each sequence,
    #using the baseMap to get the correct indices for the matrix, and then using the matrix
    #to get the value we need to add to this score.

    cache[(seq1,seq2)] = max([possibleScore1, possibleScore2, possibleScore3]) #cache the result
    return cache[(seq1,seq2)]
    #return the max score from the three possibilities



mat = [[20, 10, 5, 5],
       [10, 20, 5, 5],
       [5, 5, 20, 10],
       [5, 5, 10, 20]]

alphabet = "TCAG"
baseMap = {base:index for index,base in enumerate(alphabet)}
#creates a lookup for the matrix. So baseMap["T"] = 0, baseMap["C"] = 1, etc.

print dpScore(mat, baseMap, 'ATCGTAGTA', 'ATGTTAT')

Last edited by Reincarnate; 02-9-2014 at 01:04 PM.. Reason: added comments
Reincarnate is offline   Reply With Quote
Old 02-9-2014, 12:51 PM   #27
DossarLX ODI
Batch Manager
Game Manager, Song Release Coordinator
Game ManagerSimfile JudgeFFR Simfile AuthorD7 Elite KeysmasherFFR Veteran
 
DossarLX ODI's Avatar
 
Join Date: Mar 2008
Location: USA
Age: 29
Posts: 14,855
Default Re: Python pattern matching algorithm tuple error

Currently looking at your comments to see what you're getting at. From the memoize comments,

Code:
def dpScore():
    # function stuff
dpScore = memoize(dpScore)

@memoize
def dpScore():
    # function stuff
Do the same exact thing then.
__________________
Quote:
Originally Posted by hi19hi19 View Post
oh boy, it's STIFF, I'll stretch before I sit down at the computer so not I'm not as STIFF next time I step a file
DossarLX ODI is offline   Reply With Quote
Old 02-9-2014, 01:27 PM   #28
Reincarnate
x'); DROP TABLE FFR;--
Retired StaffFFR Veteran
 
Reincarnate's Avatar
 
Join Date: Nov 2010
Posts: 6,332
Default Re: Python pattern matching algorithm tuple error

If that code makes sense then you can try an iterative-based approach to the DP using arrays (in Python, technically lists).

Note that all possible subproblems are basically of form:
(some substring of the first main sequence, some substring of the second main sequence)

where for us, in our example the main sequences are 'ATCGTAGTA' and 'ATGTTAT'

So if sequence 1 is length I and sequence 2 is length J, we can have a matrix of size (I+1)*(J+1) and each element of this matrix will correspond to the maximal score you can get from the substrings. (the +1 on each side is because it is possible to compare a substring against a 0-length string)

So for example, if we call our matrix "dp," then we might have dp[3][1] or something, which means we're looking at the maximal score that comes from the subproblem ("GTAGTA", "TGTTAT"). In other words, the index-3 character onward for the first main sequence, and the index-1 character onward for the second main sequence.

This means, at the very end, dp[0][0] will be our answer.


I'll explain this later when I have time (or someone else can explain it) -- I have to run out for the next few hours:
(I also haven't debugged this yet so there may be a silly error somewhere, I'll fix later)

Code:
mat = [[20, 10, 5, 5],
       [10, 20, 5, 5],
       [5, 5, 20, 10],
       [5, 5, 10, 20]]

alphabet = "TCAG"
baseMap = {base:index for index,base in enumerate(alphabet)}


seq1='ATCGTAGTA'
seq2='ATGTTAT'
g=-5


dp = [[0 for j in xrange(len(seq2)+1)] for i in xrange(len(seq1)+1)] #creates a list of lists, which we use as a matrix

for i in xrange(len(seq1)+1):
    dp[i][len(seq2)] = g*abs(len(seq1)-i)  #initialize the cases where we have second substring empty
for j in xrange(len(seq2)+1):
    dp[len(seq1)][j] = g*abs(len(seq2)-j) #initialize the cases where we have first substring empty


#http://www.velocityreviews.com/forums/t636275-how-to-make-a-reverse-for-loop-in-python.html

for i in xrange(len(seq1)-1,-1,-1): #len(seq1)-1 <= i <= 0, iterating down by -1
    for j in xrange(len(seq2)-1,-1,-1): #len(seq2)-1 <= j <= 0, iterating down by -1.
        possibleScore1 = g + dp[i+1][j]
        possibleScore2 = g + dp[i][j+1]
        possibleScore3 = mat[baseMap[seq1[i]]][baseMap[seq2[j]]] + dp[i+1][j+1]
        dp[i][j] = max([possibleScore1, possibleScore2, possibleScore3])
        
print dp[0][0]

Last edited by Reincarnate; 02-9-2014 at 01:39 PM.. Reason: made a quick fix, running out now
Reincarnate is offline   Reply With Quote
Old 02-9-2014, 01:44 PM   #29
DossarLX ODI
Batch Manager
Game Manager, Song Release Coordinator
Game ManagerSimfile JudgeFFR Simfile AuthorD7 Elite KeysmasherFFR Veteran
 
DossarLX ODI's Avatar
 
Join Date: Mar 2008
Location: USA
Age: 29
Posts: 14,855
Default Re: Python pattern matching algorithm tuple error

The memoize class constructor stuff in the beginning of the recursive algorithm code will need to be discussed more later, I sent a PM about possibly discussing over some kind of instant messanging program. I'll stick with the iterative solution since it doesn't require constructing a class.

Also your iterative code works.
__________________
Quote:
Originally Posted by hi19hi19 View Post
oh boy, it's STIFF, I'll stretch before I sit down at the computer so not I'm not as STIFF next time I step a file

Last edited by DossarLX ODI; 02-9-2014 at 01:50 PM..
DossarLX ODI is offline   Reply With Quote
Old 02-9-2014, 02:29 PM   #30
Reincarnate
x'); DROP TABLE FFR;--
Retired StaffFFR Veteran
 
Reincarnate's Avatar
 
Join Date: Nov 2010
Posts: 6,332
Default Re: Python pattern matching algorithm tuple error

Post any questions you have and I'll address them when I can (I am not at comp atm)

(It's probably best to ignore the solution with the memo class for easiest understanding -- use the one with the cache dict instead, or the iterative translation)
Reincarnate is offline   Reply With Quote
Old 02-9-2014, 05:34 PM   #31
DossarLX ODI
Batch Manager
Game Manager, Song Release Coordinator
Game ManagerSimfile JudgeFFR Simfile AuthorD7 Elite KeysmasherFFR Veteran
 
DossarLX ODI's Avatar
 
Join Date: Mar 2008
Location: USA
Age: 29
Posts: 14,855
Default Re: Python pattern matching algorithm tuple error



This picture is showing what's happening in the iterative loops for our dp matrix. dp[0][0] gives us the final score of the best path (or paths in the case of a tied score) available.

It's basically doing what I said the backtrace algorithm did in my earlier posts in the thread -- the path is there, but the sequences are in reverse so when the strings are saved they have to be reversed to be the normal alignment.

Now there's a problem of saving the new aligned sequences into two string variables. possibleScore1 could have the same value as possibleScore2 when max is called etc. so saving the characters of the string is a lot more complex than what I originally thought. This is to save the string of the best alignment.
__________________
Quote:
Originally Posted by hi19hi19 View Post
oh boy, it's STIFF, I'll stretch before I sit down at the computer so not I'm not as STIFF next time I step a file
DossarLX ODI is offline   Reply With Quote
Old 02-9-2014, 08:19 PM   #32
Reincarnate
x'); DROP TABLE FFR;--
Retired StaffFFR Veteran
 
Reincarnate's Avatar
 
Join Date: Nov 2010
Posts: 6,332
Default Re: Python pattern matching algorithm tuple error

If you want to return a score-maximizing solution, you could use a second matrix to keep the corresponding substrings based on maximal score at each step of the dp, or something:

untested air code:

Code:
g=-5
dp = [[0 for j in xrange(len(seq2)+1)] for i in xrange(len(seq1)+1)]
bestPieces = [[("","") for j in xrange(len(seq2)+1)] for i in xrange(len(seq1)+1)]

for i in xrange(len(seq1)+1):
    dp[i][len(seq2)] = g*abs(len(seq1)-i)
    bestPieces[i][len(seq2)] = (seq1[i:],"-"*abs(len(seq1)-i))
for j in xrange(len(seq2)+1):
    dp[len(seq1)][j] = g*abs(len(seq2)-j)
    bestPieces[len(seq1)][j] = ("-"*abs(len(seq2)-j),seq2[j:])

for i in xrange(len(seq1)-1,-1,-1):
    for j in xrange(len(seq2)-1,-1,-1):
        possibleScore1 = g + dp[i+1][j]
        possibleScore2 = g + dp[i][j+1]
        possibleScore3 = mat[baseMap[seq1[i]]][baseMap[seq2[j]]] + dp[i+1][j+1]
        bestScore = max([possibleScore1, possibleScore2, possibleScore3])
        dp[i][j] = bestScore

        if bestScore==possibleScore1:
            bestPieces[i][j] = (seq1[i] + bestPieces[i+1][j][0], "-" + bestPieces[i+1][j][1])
        elif bestScore==possibleScore2:
            bestPieces[i][j] = ("-" + bestPieces[i][j+1][0], seq2[j] + bestPieces[i][j+1][1])
        else:
            bestPieces[i][j] = (seq1[i] + bestPieces[i+1][j+1][0], seq2[j] + bestPieces[i+1][j+1][1])
        
print dp[0][0], bestPieces[0][0]
You don't have to "reverse" anything (if I am understanding you correctly) -- although technically, sure, DP is generally a bottom-up type of thing where you work backwards from smaller subproblems and work your way towards the final result you're after. But you don't have to store things in reverse if you don't want to.

Last edited by Reincarnate; 02-9-2014 at 09:45 PM..
Reincarnate is offline   Reply With Quote
Reply


Currently Active Users Viewing This Thread: 1 (0 members and 1 guests)
 
Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are Off
[IMG] code is On
HTML code is Off

Forum Jump



All times are GMT -5. The time now is 05:38 AM.


Powered by vBulletin® Version 3.8.1
Copyright ©2000 - 2024, Jelsoft Enterprises Ltd.
Copyright FlashFlashRevolution