Go Back   Flash Flash Revolution > General Discussion > Technology
Register FAQ Community Calendar Today's Posts Search

Reply
 
Thread Tools Display Modes
Old 02-8-2014, 11:15 PM   #1
DossarLX ODI
Batch Manager
Game Manager, Song Release Coordinator
Game ManagerSimfile JudgeFFR Simfile AuthorD7 Elite KeysmasherFFR Veteran
 
DossarLX ODI's Avatar
 
Join Date: Mar 2008
Location: USA
Age: 29
Posts: 14,860
Default Python pattern matching algorithm tuple error

I’m new at Python so this is like drinking out of a fire hose. I’ll explain the context of what the problem is here to give a better idea of what to discuss.

This program gives a pairwise DNA alignment using “dynamic programming”. A score is assigned based on how well the pairs line up; preferably the optimal alignment should be reached, but that can be discussed later.

The “substitution matrix” here considers the possible pair matchings and assigns a score value to each. Every time there is the same letter, it is 20 points. This table shows the values.
Code:
alphabet = "TCAG" # 0 for T, 1 for C, 2 for A, 3 for G.


In the case of dynamic programming, an optimal alignment is accomplished by considering three options for each location in the DNA sequence, and selecting the best option before continuing on.
(Conceptually, the second sequence would be the columns of the matrix, and the first sequence is the rows of the matrix. This is just an example)


Don’t worry about affine gap penalties (consecutive gaps), there’s just only a gap penalty variable required here and it’s set to -5 points. A gap (indel) is when we can’t distinguish whether a letter was inserted into a string or removed from the other string. For alignment purposes we can’t tell so gaps are used when necessary.

So now onto the problems!

In order to get the aligned results of the two DNA sequences, it looks like one would have to:
1.) Make the scoring matrix. <-- Scoring matrix maintains the current alignment score for the particular alignment.
2.) Make the arrow matrix. <-- determines the optimal score path. It is created at the same time the scoring matrix is.
3.) Get path from the arrow matrix through a backtracing algorithm and reverse the string after.
4.) Pass the two strings to a scoring function and calculate the score.

Code:
# Our two DNA sequences.
seq1 = 'ATGTTAT'
seq2 = 'ATCGTAG'
alphabet = "TCAG" # 0 for T, 1 for C, 2 for A, 3 for G.

mat =       [
                 [20, 10, 5, 5],
                 [10, 20, 5, 5],
                 [5, 5, 20, 10],
                 [5, 5, 10, 20],
            ]
Here the DNA sequences are being converted to integers to represent the index on the original substitution matrix (0 for T, 1 for C, etc.) and this new substitution matrix also aligns with the scoring matrix.

Code:
"""
mat is the substitution matrix being passed.
abet is the alphabet associated with that matrix.
sq1 is the first DNA sequence.
sq2 is the second DNA sequence.
Here we are making a substitution matrix that aligns with the scoring matrix.
"""
def alignSubMatrix( mat, alphabet, seq1, seq2 ):
    l1, l2 = len( seq1 ), len( seq2 ) # Get the lengths of our two DNA strings
    subvals = zeros( ( l1+1,l2+1 ) , int ) # Matrix of integers, add 1 to the lengths
    # Convert the DNA sequences to numbers. si for substitution index.
    si1 = zeros( l1,int ) # Array of integers, length of horizontal dimension
    si2 = zeros( l2,int ) # Array of integers, length of vertical dimension
    # In this for loop, first DNA sequence gets converted to a set of integers.
    for i in range( l1 ):
        si1[i] = alphabet.index( seq1[i] )
    # In this for loop, second DNA sequence gets converted to a set of integers.
    for i in range( l2 ):
        si2[i] = alphabet.index( seq2[i] )
    # Fill in one row of the substitution matrix at a time.
    for i in range( 1, l1+1 ):
        # Subtracting 1 from index to line up sub matrix with scoring matrix.
        subvals[i,1:] = mat[ si1[i-1] , si2[i-1] ] # Take ith row of subvals, go from column 1 to the last column for each row
    return subvals


In this code we’re making the Scoring and Arrow Matrices. The first row and column is filled entirely with gap penalties.

Code:
"""
This function computes the scoring matrix and the arrow matrix.
subvals are the substitution matrix values from alignSubMatrix.
sq1 and sq2 are the DNA sequences.
g is the gap penalty.
"""
def SandAmatrix( subvals, alphabet, seq1, seq2, g = -5 ):
    l1, l2 = len( seq1 ), len( seq2)
    sMatrix = zeros( ( l1+1,l2+1 ) , int )
    aMatrix = zeros( ( l1+1,l2+1 ) , int )
    # Create first row and first column. Remember they're all gap penalties
    sMatrix[0] = arange( l2+1 )*g # make first row all gap penalties
    sMatrix[:,0] = arange( l1+1 )*g # make first column all gap penalties
    aMatrix[0] = ones( l2+1 ) # make first row all ones
    for i in range( 1, l1+1 ):
        for j in range( 1, l2+1 ):
            pathcheck = zeros( 3 ) # 3-element array of zeros.
            pathcheck[0] = sMatrix[i-1,j] + g # If we go right
            pathcheck[1] = sMatrix[i,j-1] + g # If we go down
            x = alphabet.index( seq1[i-1] )
            y = alphabet.index( seq2[j-1] )
            pathcheck[2] = sMatrix[i-1,j-1] + mat[x,y] # If we go diagonally
            sMatrix[i,j] = pathcheck.max()
            aMatrix[i,j] = pathcheck.argmax()
    return sMatrix, aMatrix


Making the arrow matrix from the previous block of code allows us to backtrace the path to return the aligned DNA sequences to get their scores calculated.

Code:
 """
Extract the aligned sequences from the arrow matrix.
Start from the bottom right, work to the top left.
"""
def Backtrace( aMatrix, seq1, seq2 ):
    st1, st2 = ","
    y,x = aMatrix.shape
    ok = 1
    # x for horizontal position, y for vertical position in the arrow matrix
    y -= 1
    x -= 1
    # Conceptually here, second sequence is at the top, first sequence is along the side.
    while ok:
        if aMatrix[y,x] == 0: # indicates backtrace moves up
            st1 += '-'
            y -= 1
        elif aMatrix[y,x] == 1: # indicates backtrace moves left
            st1 += '-'
            st2 += seq2[x-1]
        elif aMatrix[y,h] == 2: # indicates backtrace moves diagonally up-left
            st1 += seq1[y-1]
            st2 += seq2[x-1]
            y -= 1
            x -= 1
        if y==0 and x==0: # This means we are at the top left
            ok = 0
    # Now we need to reverse the strings, since we went bottom right to top left
    st1 = st1[::-1] # Reverse the first string.
    st2 = st2[::-1] # Reverse the second string.
    return st1, st2


Now what’s confusing here is why when I did mat[x,y] in this stringScore function it wasn’t converted into an integer, I got an error about a tuple:

>>> t1 = stringScore( mat, alphabet, seq1 , seq2 , g = -5 )

Traceback (most recent call last):
File "<pyshell#31>", line 1, in <module>
t1 = stringScore( mat, alphabet, seq1 , seq2 , g = -5 )
File "C:\bioinformatics\hw2.py", line 126, in stringScore
score -= mat[x,y] # Search substitution matrix for value to subtract according to pair.
TypeError: list indices must be integers, not tuple
>>>

Code:
"""
Calculate score of the two DNA sequences.
They are stored as strings in two arrays.
"""
def stringScore( mat, alphabet, seq1 , seq2 , g = -5 ):
    length = len( seq1 )
    score = 0
    for i in range (length):
        if seq1[i] != seq2[i]:
            x = alphabet.index( seq1[i] )
            y = alphabet.index( seq2[i] )
            score -= mat[x,y] # Search substitution matrix for value to subtract according to pair.                  
        elif seq1[i] == '-' or seq2[i] == '-':
            score += g # Gap penalty
        else:
            score += 20 # This means the two letters match.
    return score
__________________
Quote:
Originally Posted by hi19hi19 View Post
oh boy, it's STIFF, I'll stretch before I sit down at the computer so not I'm not as STIFF next time I step a file
DossarLX ODI is offline   Reply With Quote
Old 02-9-2014, 12:22 AM   #2
beary605
FFR Veteran
D8 Godly KeysmasherFFR Veteran
 
beary605's Avatar
 
Join Date: Oct 2008
Location: Canada
Posts: 448
Default Re: Python pattern matching algorithm tuple error

A really simple error: you need to use map[x][y] instead of map[x,y] ("x,y" gets interpreted as (x, y), so you access map[ (x, y) ] which Python doesn't like)

I'll look over the rest of the code to see if there's any other errors

edit: wow I did not know you could put commas like that in splices... see if the above works if not then... hm

Last edited by beary605; 02-9-2014 at 03:48 AM..
beary605 is offline   Reply With Quote
Old 02-9-2014, 09:29 AM   #3
DossarLX ODI
Batch Manager
Game Manager, Song Release Coordinator
Game ManagerSimfile JudgeFFR Simfile AuthorD7 Elite KeysmasherFFR Veteran
 
DossarLX ODI's Avatar
 
Join Date: Mar 2008
Location: USA
Age: 29
Posts: 14,860
Default Re: Python pattern matching algorithm tuple error

Quote:
Originally Posted by beary605 View Post
A really simple error: you need to use map[x][y] instead of map[x,y] ("x,y" gets interpreted as (x, y), so you access map[ (x, y) ] which Python doesn't like)
I changed it to what you said and got this, thanks

>>> t1 = stringScore( mat, alphabet, seq1 , seq2 , g = -5 )
>>> t1
65
>>>

Which looks correct, because the two strings would result in:
20 + 20 + (-5) + (-5) + 20 + 20 + (-5) = 80 - 15 = 65
Those negatives being the mismatches.

Now I'll see what goes wrong with the other functions. I know that the scoring function works when the strings are aligned, so that's a good start.

Edit: Ok, another problem

Code:
>>> from numpy import *

Traceback (most recent call last):
  File "<pyshell#2>", line 1, in <module>
    from numpy import *
  File "C:\Python27\lib\site-packages\numpy\__init__.py", line 153, in <module>
    from . import add_newdocs
  File "C:\Python27\lib\site-packages\numpy\add_newdocs.py", line 13, in <module>
    from numpy.lib import add_newdoc
  File "C:\Python27\lib\site-packages\numpy\lib\__init__.py", line 8, in <module>
    from .type_check import *
  File "C:\Python27\lib\site-packages\numpy\lib\type_check.py", line 11, in <module>
    import numpy.core.numeric as _nx
  File "C:\Python27\lib\site-packages\numpy\core\__init__.py", line 6, in <module>
    from . import multiarray
ImportError: DLL load failed: %1 is not a valid Win32 application.
I'm trying to use numpy but there's that ImportError message. I downloaded a 64-bit version of opencv but that still didn't solve it. I am running a 64-bit machine (Windows 7) and I know numpy is 32-bit but there doesn't seem to be much on handling this online, maybe someone else can help?
__________________
Quote:
Originally Posted by hi19hi19 View Post
oh boy, it's STIFF, I'll stretch before I sit down at the computer so not I'm not as STIFF next time I step a file

Last edited by DossarLX ODI; 02-9-2014 at 09:52 AM..
DossarLX ODI is offline   Reply With Quote
Old 02-9-2014, 09:34 AM   #4
Reincarnate
x'); DROP TABLE FFR;--
Retired StaffFFR Veteran
 
Reincarnate's Avatar
 
Join Date: Nov 2010
Posts: 6,332
Default Re: Python pattern matching algorithm tuple error

I am way too lazy to read through all this, can you just explain what you need to do in like 2 sentences -- I'll whip up the solution in Python real quick and you can look at it or something

you have two sequences
seq1 = 'ATGTTAT'
seq2 = 'ATCGTAG'

what's the score and why

(or are you trying to do something else?)
Reincarnate is offline   Reply With Quote
Old 02-9-2014, 09:57 AM   #5
Reincarnate
x'); DROP TABLE FFR;--
Retired StaffFFR Veteran
 
Reincarnate's Avatar
 
Join Date: Nov 2010
Posts: 6,332
Default Re: Python pattern matching algorithm tuple error

something like this maybe

Code:
def stringScore( mat, baseMap, seq1, seq2, g = -5 ):

    if len(seq1) != len(seq2):
        raise Exception("Fix your shit, both strings must be of equal length")

    score = 0
    
    for base1,base2 in zip(seq1,seq2):
        if base1 != base2:
            score -= mat[baseMap[base1]][baseMap[base2]]
        elif base1 == "-" or base2 == "-":
            score += g
        else:
            score += mat[baseMap[base1]][baseMap[base2]]
        
    return score



mat = [[20, 10, 5, 5],
       [10, 20, 5, 5],
       [5, 5, 20, 10],
       [5, 5, 10, 20]]

alphabet = "TCAG"
baseMap = {base:index for index,base in enumerate(alphabet)}

seq1 = 'ATGTTAT'
seq2 = 'ATCGTAG'

print stringScore(mat, baseMap, seq1, seq2)

Last edited by Reincarnate; 02-9-2014 at 10:02 AM..
Reincarnate is offline   Reply With Quote
Old 02-9-2014, 09:58 AM   #6
DossarLX ODI
Batch Manager
Game Manager, Song Release Coordinator
Game ManagerSimfile JudgeFFR Simfile AuthorD7 Elite KeysmasherFFR Veteran
 
DossarLX ODI's Avatar
 
Join Date: Mar 2008
Location: USA
Age: 29
Posts: 14,860
Default Re: Python pattern matching algorithm tuple error

Quote:
Originally Posted by Reincarnate View Post
I am way too lazy to read through all this, can you just explain what you need to do in like 2 sentences -- I'll whip up the solution in Python real quick and you can look at it or something

you have two sequences
seq1 = 'ATGTTAT'
seq2 = 'ATCGTAG'

what's the score and why

(or are you trying to do something else?)
Yeah I can make a tl;dr version, my post above also shows an error when I try importing numpy. However I still have to give lots of details here because this assignment involves creating a path

The program is supposed to work for any two sequences e.g.
ATGAATGCGATTTCGGGTGGCC
TTGGCAGGACATGAAGTTCGATACGGAA

Obviously the first string is too short so that's when we have to insert gaps and there are gap penalties to make it the same length. The gap penalty is just some integer, say -5.

In a nutshell we're doing a 5-way comparison:
- Gaps.
- A, G, C, or T (those are the four letters).

This is the score table (substitution matrix).


In dynamic programming you're considering an optimal sequence alignment by considering three options for each location in the sequence: down, right, or down-right. Visual example:


So basically, we have our substitution matrix (4x4) which has the values for a pair of letters found, listed in the score table. Then the scoring matrix maintains the current alignment score for the particular alignment. The arrow matrix determines the optimal score path (It is created at the same time the scoring matrix is). Then we backtrace that path to get the strings aligned so they're the same length, and finally use the score function.
__________________
Quote:
Originally Posted by hi19hi19 View Post
oh boy, it's STIFF, I'll stretch before I sit down at the computer so not I'm not as STIFF next time I step a file

Last edited by DossarLX ODI; 02-9-2014 at 10:00 AM..
DossarLX ODI is offline   Reply With Quote
Old 02-9-2014, 10:04 AM   #7
Reincarnate
x'); DROP TABLE FFR;--
Retired StaffFFR Veteran
 
Reincarnate's Avatar
 
Join Date: Nov 2010
Posts: 6,332
Default Re: Python pattern matching algorithm tuple error

so like this?

Code:
def stringScore( mat, baseMap, seq1, seq2, g = -5 ):
    score = g*abs(len(seq1)-len(seq2)) #if one string is longer than the other, gap penalty = g*(difference in lengths)
    
    for base1,base2 in zip(seq1,seq2):
        if base1 != base2:
            score -= mat[baseMap[base1]][baseMap[base2]]
        else:
            score += mat[baseMap[base1]][baseMap[base2]]
    
    return score
sorry still in tl;dr mode

are you trying to basically say "we don't know where the real gaps are, so let's try all possible ways of inserting gaps to determine the maximal possible score?"


ATGAATGCGATTTCGGGTGGCC-------
TTGGCAGGACATGAAGTTCGATACGGAA

probably a shitty score due to all mismatches, but if we had

-ATG-AATG-CGATT-TC-GG--GTGGCC
TTGGCAGGACATGAAGTTCGATACGGAA

or something (just making up a shitty example), then the gaps could possibly be matches and we'd have a better score

Last edited by Reincarnate; 02-9-2014 at 10:06 AM..
Reincarnate is offline   Reply With Quote
Old 02-9-2014, 10:06 AM   #8
DossarLX ODI
Batch Manager
Game Manager, Song Release Coordinator
Game ManagerSimfile JudgeFFR Simfile AuthorD7 Elite KeysmasherFFR Veteran
 
DossarLX ODI's Avatar
 
Join Date: Mar 2008
Location: USA
Age: 29
Posts: 14,860
Default Re: Python pattern matching algorithm tuple error

Quote:
Originally Posted by Reincarnate View Post
are you trying to basically say "we don't know where the real gaps are, so let's try all possible ways of inserting gaps to determine the maximal possible score?"


ATGAATGCGATTTCGGGTGGCC-------
TTGGCAGGACATGAAGTTCGATACGGAA

probably a shitty score due to all mismatches, but if we had

-ATG-AATG-CGATT-TC-GG--GTGGCC
TTGGCAGGACATGAAGTTCGATACGGAA

or something (just making up a shitty example), then the gaps could possibly be matches and we'd have a better score
Yes, the gaps are supposed to align the sequences better. From a practical standpoint we also don't readily know where the gaps are so they have to be inserted based on how much the score will improve.

The score is calculated at the end. The scoring matrix is just made so the arrow matrix can be made to determine the best path for alignment. Then after those paths are backtraced (and then the strings reversed), the alignments are sent to the score function (essentially we're making a new string with dashes in it to indicate gaps since it was too short).

Gaps here are a pain in the ass to work with. The gaps are considered by the arrow matrix which considers the best possible path by checking how the score changes, because like I mentioned some strings aren't of equal length and that's when the gaps come in.

I guess another way of tl;dr this is that it's getting two strings with four possible letters, and if they are not the same length then gaps have to be inserted to make the letters line up better for an optimal score.
__________________
Quote:
Originally Posted by hi19hi19 View Post
oh boy, it's STIFF, I'll stretch before I sit down at the computer so not I'm not as STIFF next time I step a file

Last edited by DossarLX ODI; 02-9-2014 at 10:15 AM..
DossarLX ODI is offline   Reply With Quote
Old 02-9-2014, 10:26 AM   #9
Reincarnate
x'); DROP TABLE FFR;--
Retired StaffFFR Veteran
 
Reincarnate's Avatar
 
Join Date: Nov 2010
Posts: 6,332
Default Re: Python pattern matching algorithm tuple error

So if two bases match, they basically add the corresponding matrix value (typically 20), but if they mismatch, they subtract the matrix value?

And using a gap incurs a -5 penalty?
Reincarnate is offline   Reply With Quote
Old 02-9-2014, 10:29 AM   #10
DossarLX ODI
Batch Manager
Game Manager, Song Release Coordinator
Game ManagerSimfile JudgeFFR Simfile AuthorD7 Elite KeysmasherFFR Veteran
 
DossarLX ODI's Avatar
 
Join Date: Mar 2008
Location: USA
Age: 29
Posts: 14,860
Default Re: Python pattern matching algorithm tuple error

Quote:
Originally Posted by Reincarnate View Post
So if two bases match, they basically add the corresponding matrix value (typically 20), but if they mismatch, they subtract the matrix value?

And using a gap incurs a -5 penalty?
From my understanding yes. It makes no sense that the score would increase if there is a mismatch, and it should subtract the matrix value. Subtracting 1 from the score for a mismatch sounds incredibly stupid and I'll be mad if that's actually the case here, but just go by the matrix values in that table.

Edit: Back. That was faster than I planned
__________________
Quote:
Originally Posted by hi19hi19 View Post
oh boy, it's STIFF, I'll stretch before I sit down at the computer so not I'm not as STIFF next time I step a file

Last edited by DossarLX ODI; 02-9-2014 at 10:49 AM..
DossarLX ODI is offline   Reply With Quote
Old 02-9-2014, 10:48 AM   #11
Reincarnate
x'); DROP TABLE FFR;--
Retired StaffFFR Veteran
 
Reincarnate's Avatar
 
Join Date: Nov 2010
Posts: 6,332
Default Re: Python pattern matching algorithm tuple error

Untested code, let me know if this is returning the right values for you -- if so then I'll clean up the code and make it a little more readable / with better coding practices / whatever -- just want to ensure it's returning the right vals first

Code:
class memoize:
    def __init__(self, fn):
        self.fn = fn
        self.memo = {}
    def __call__(self, *args, **kwds):
        import cPickle
        str = cPickle.dumps(args, 1)+cPickle.dumps(kwds, 1)
        if not self.memo.has_key(str):
            self.memo[str] = self.fn(*args, **kwds)
        return self.memo[str]



@memoize
def dpScore( mat, baseMap, seq1, seq2, gapsLeft, g ):
    if (len(seq1)==0 and gapsLeft>0) or gapsLeft<0: return -9999999999999999999999 #lazy right now, using a sentinel
    if len(seq1)==0 or len(seq2)==0: return 0

    possibleScore1 = g + mat[baseMap[seq1[0]]][baseMap[seq1[0]]] + dpScore( mat, baseMap, seq1[1:], seq2, gapsLeft-1, g )

    if seq1[0] == seq2[0]: matchScore = mat[baseMap[seq1[0]]][baseMap[seq2[0]]]
    else: matchScore = -mat[baseMap[seq1[0]]][baseMap[seq2[0]]]

    possibleScore2 = matchScore + dpScore( mat, baseMap, seq1[1:], seq2[1:], gapsLeft, g )
    
    return max(possibleScore1, possibleScore2)



def bestStringScore( mat, baseMap, seq1, seq2, g = -5 ):
    gapsLeft = abs(len(seq1)-len(seq2))
    if len(seq1) > len(seq2):
        return dpScore(mat, baseMap, seq1, seq2, gapsLeft, g)
    else:
        return dpScore(mat, baseMap, seq2, seq1, gapsLeft, g)



mat = [[20, 10, 5, 5],
       [10, 20, 5, 5],
       [5, 5, 20, 10],
       [5, 5, 10, 20]]

alphabet = "TCAG"
baseMap = {base:index for index,base in enumerate(alphabet)}

seq1 = 'ATGTTAT'
seq2 = 'ATCGTAG'

print bestStringScore(mat, baseMap, seq1, seq2)
Reincarnate is offline   Reply With Quote
Old 02-9-2014, 10:53 AM   #12
DossarLX ODI
Batch Manager
Game Manager, Song Release Coordinator
Game ManagerSimfile JudgeFFR Simfile AuthorD7 Elite KeysmasherFFR Veteran
 
DossarLX ODI's Avatar
 
Join Date: Mar 2008
Location: USA
Age: 29
Posts: 14,860
Default Re: Python pattern matching algorithm tuple error

Yes, copied that code snippet and got 65 for those sequences (same with what I got).
__________________
Quote:
Originally Posted by hi19hi19 View Post
oh boy, it's STIFF, I'll stretch before I sit down at the computer so not I'm not as STIFF next time I step a file
DossarLX ODI is offline   Reply With Quote
Old 02-9-2014, 11:06 AM   #13
Reincarnate
x'); DROP TABLE FFR;--
Retired StaffFFR Veteran
 
Reincarnate's Avatar
 
Join Date: Nov 2010
Posts: 6,332
Default Re: Python pattern matching algorithm tuple error

I mean ideally you should also test with sequences that are of different lengths and then check to see if the resulting maximum score is what you get by hand (inserting gaps manually etc)
Reincarnate is offline   Reply With Quote
Old 02-9-2014, 11:11 AM   #14
DossarLX ODI
Batch Manager
Game Manager, Song Release Coordinator
Game ManagerSimfile JudgeFFR Simfile AuthorD7 Elite KeysmasherFFR Veteran
 
DossarLX ODI's Avatar
 
Join Date: Mar 2008
Location: USA
Age: 29
Posts: 14,860
Default Re: Python pattern matching algorithm tuple error

I'm testing it with

seq1 = 'ATGTTAT'
seq2 = 'ATCGTAGTA'

seq1 = 'ATGTTA-T-'
seq2 = 'ATCGTAGTA'

20 + 20 - 5 - 5 + 20 + 20 - 5 + 20 - 5 = 80

seq1 = 'AT-GTTAT-'
seq2 = 'ATCGTAGTA'

20 + 20 - 5 + 20 + 20 - 5 - 10 + 20 -5 = 75

ATGTTA-T- would be the optimal sequence here. The second solution was slightly lower since AG is worth 10, not 5.

When I input into the program it said 105.

Edit: Ok so what I consider a possibility is that it did this:

AT-GTTAT-
ATCGTAGTA

20 + 20 - 5 + 20 + 20 + 5 + 10 + 20 - 5 = 105

Edit 2: Optimal score looks like 100 actually.

seq1 = 'AT-GTTA-T-'
seq2 = 'ATCGT-AGTA'

20 + 20 - 5 + 20 + 20 - 5 + 20 - 5 + 20 - 5 = 100
__________________
Quote:
Originally Posted by hi19hi19 View Post
oh boy, it's STIFF, I'll stretch before I sit down at the computer so not I'm not as STIFF next time I step a file

Last edited by DossarLX ODI; 02-9-2014 at 11:39 AM..
DossarLX ODI is offline   Reply With Quote
Old 02-9-2014, 11:43 AM   #15
Reincarnate
x'); DROP TABLE FFR;--
Retired StaffFFR Veteran
 
Reincarnate's Avatar
 
Join Date: Nov 2010
Posts: 6,332
Default Re: Python pattern matching algorithm tuple error

I think maybe I am not clear how you are supposed to be scoring stuff

is this right?

Code:
seq1: --ATGTTAT
seq2: ATCGTAGTA
score: -45

seq1: -A-TGTTAT
seq2: ATCGTAGTA
score: -45

seq1: -AT-GTTAT
seq2: ATCGTAGTA
score: -50

seq1: -ATG-TTAT
seq2: ATCGTAGTA
score: -25

seq1: -ATGT-TAT
seq2: ATCGTAGTA
score: 0

seq1: -ATGTT-AT
seq2: ATCGTAGTA
score: 0

seq1: -ATGTTA-T
seq2: ATCGTAGTA
score: -5

seq1: -ATGTTAT-
seq2: ATCGTAGTA
score: 20

seq1: A--TGTTAT
seq2: ATCGTAGTA
score: -20

seq1: A-T-GTTAT
seq2: ATCGTAGTA
score: -25

seq1: A-TG-TTAT
seq2: ATCGTAGTA
score: 0

seq1: A-TGT-TAT
seq2: ATCGTAGTA
score: 25

seq1: A-TGTT-AT
seq2: ATCGTAGTA
score: 25

seq1: A-TGTTA-T
seq2: ATCGTAGTA
score: 20

seq1: A-TGTTAT-
seq2: ATCGTAGTA
score: 45

seq1: AT--GTTAT
seq2: ATCGTAGTA
score: 5

seq1: AT-G-TTAT
seq2: ATCGTAGTA
score: 30

seq1: AT-GT-TAT
seq2: ATCGTAGTA
score: 55

seq1: AT-GTT-AT
seq2: ATCGTAGTA
score: 55

seq1: AT-GTTA-T
seq2: ATCGTAGTA
score: 50

seq1: AT-GTTAT-
seq2: ATCGTAGTA
score: 75

seq1: ATG--TTAT
seq2: ATCGTAGTA
score: 5

seq1: ATG-T-TAT
seq2: ATCGTAGTA
score: 30

seq1: ATG-TT-AT
seq2: ATCGTAGTA
score: 30

seq1: ATG-TTA-T
seq2: ATCGTAGTA
score: 25

seq1: ATG-TTAT-
seq2: ATCGTAGTA
score: 50

seq1: ATGT--TAT
seq2: ATCGTAGTA
score: 5

seq1: ATGT-T-AT
seq2: ATCGTAGTA
score: 5

seq1: ATGT-TA-T
seq2: ATCGTAGTA
score: 0

seq1: ATGT-TAT-
seq2: ATCGTAGTA
score: 25

seq1: ATGTT--AT
seq2: ATCGTAGTA
score: 30

seq1: ATGTT-A-T
seq2: ATCGTAGTA
score: 25

seq1: ATGTT-AT-
seq2: ATCGTAGTA
score: 50

seq1: ATGTTA--T
seq2: ATCGTAGTA
score: 55

seq1: ATGTTA-T-
seq2: ATCGTAGTA
score: 80

seq1: ATGTTAT--
seq2: ATCGTAGTA
score: 55
max score 80?

score: +matrix[base1][base2] if base1 and base2 match (e.g. "AA"), -matrix[base1][base2] if they don't match (e.g. "GC"), and -5 if we have a gap (e.g. "T-")

Wait you can insert gaps in both sequences, or just the shorter of the two?

(I also don't want to waste your time if I am going off-track with all this -- feel free to ignore if you're on a tight deadline)

Last edited by Reincarnate; 02-9-2014 at 11:45 AM..
Reincarnate is offline   Reply With Quote
Old 02-9-2014, 11:50 AM   #16
DossarLX ODI
Batch Manager
Game Manager, Song Release Coordinator
Game ManagerSimfile JudgeFFR Simfile AuthorD7 Elite KeysmasherFFR Veteran
 
DossarLX ODI's Avatar
 
Join Date: Mar 2008
Location: USA
Age: 29
Posts: 14,860
Default Re: Python pattern matching algorithm tuple error

You can insert gaps in both sequences. This picture is straight out of the assignment; the table is annoying as fuck to read so I'll basically just say what he was getting at.

Original sequences:
seq1: 'ATG'
seq2: 'GGAAT'

The two alignments listed

G G A A T
- - A T G

G G A A T -
- - - A T G

give the same score.

For the first one: (0) - 5 - 5 + 20 + 5 + 5 = 20
For the second one: (0) - 5 - 5 - 5 + 20 + 20 - 5 = 20



It pisses me off how ambiguous my professor was on this assignment. Apparently the score has to add up everything, even mismatches, and only penalize for gaps. This is stupid.

Quote:
Originally Posted by Reincarnate View Post
(I also don't want to waste your time if I am going off-track with all this -- feel free to ignore if you're on a tight deadline)
The current code I have doesn't work since I can't even import numpy or scipy. There's something about them being 32-bit when I have a 64-bit machine. Even then I don't even know if the code from the textbook even works because well, the variable names used were absolute garbage and it had errors when I tried running it.
__________________
Quote:
Originally Posted by hi19hi19 View Post
oh boy, it's STIFF, I'll stretch before I sit down at the computer so not I'm not as STIFF next time I step a file

Last edited by DossarLX ODI; 02-9-2014 at 11:55 AM..
DossarLX ODI is offline   Reply With Quote
Old 02-9-2014, 11:54 AM   #17
Reincarnate
x'); DROP TABLE FFR;--
Retired StaffFFR Veteran
 
Reincarnate's Avatar
 
Join Date: Nov 2010
Posts: 6,332
Default Re: Python pattern matching algorithm tuple error

So it's always +matrix[base1][base2] and then -5 if we have "*-" or "-*" where * is A,C,T, or G?
Reincarnate is offline   Reply With Quote
Old 02-9-2014, 11:57 AM   #18
Reincarnate
x'); DROP TABLE FFR;--
Retired StaffFFR Veteran
 
Reincarnate's Avatar
 
Join Date: Nov 2010
Posts: 6,332
Default Re: Python pattern matching algorithm tuple error

You can use numpy if you're using a 64-bit machine (that's what I'm on) -- you just have to make sure you're using the right versions of Python/Numpy/etc.
Reincarnate is offline   Reply With Quote
Old 02-9-2014, 11:57 AM   #19
DossarLX ODI
Batch Manager
Game Manager, Song Release Coordinator
Game ManagerSimfile JudgeFFR Simfile AuthorD7 Elite KeysmasherFFR Veteran
 
DossarLX ODI's Avatar
 
Join Date: Mar 2008
Location: USA
Age: 29
Posts: 14,860
Default Re: Python pattern matching algorithm tuple error

Yes, if there is a dash it is a gap penalty. If the letters don't match up it still adds positively to the score if they're not gaps (which I think is very counterintuitive).

I am using Python 2.7.6 by the way, which are you using?
__________________
Quote:
Originally Posted by hi19hi19 View Post
oh boy, it's STIFF, I'll stretch before I sit down at the computer so not I'm not as STIFF next time I step a file
DossarLX ODI is offline   Reply With Quote
Old 02-9-2014, 11:59 AM   #20
Reincarnate
x'); DROP TABLE FFR;--
Retired StaffFFR Veteran
 
Reincarnate's Avatar
 
Join Date: Nov 2010
Posts: 6,332
Default Re: Python pattern matching algorithm tuple error

Is it possibly something like this?

Code:
class memoize:
    def __init__(self, fn):
        self.fn = fn
        self.memo = {}
    def __call__(self, *args, **kwds):
        import cPickle
        str = cPickle.dumps(args, 1)+cPickle.dumps(kwds, 1)
        if not self.memo.has_key(str):
            self.memo[str] = self.fn(*args, **kwds)
        return self.memo[str]

@memoize
def dpScore( mat, baseMap, seq1, seq2, g = -5 ):
    if len(seq1)*len(seq2)==0: #one or the other is 0
        return g*abs(len(seq1)-len(seq2)) #the rest shall be gaps, damn it

    possibleScore1 = g + dpScore( mat, baseMap, seq1[1:], seq2, g )
    possibleScore2 = g + dpScore( mat, baseMap, seq1, seq2[1:], g )
    possibleScore3 = mat[baseMap[seq1[0]]][baseMap[seq2[0]]] + dpScore( mat, baseMap, seq1[1:], seq2[1:], g )
    
    return max([possibleScore1, possibleScore2, possibleScore3])



mat = [[20, 10, 5, 5],
       [10, 20, 5, 5],
       [5, 5, 20, 10],
       [5, 5, 10, 20]]

alphabet = "TCAG"
baseMap = {base:index for index,base in enumerate(alphabet)}
print dpScore(mat, baseMap, 'ATCGTAGTA', 'ATGTTAT')
I am using 2.7.3 (I should probably upgrade but this version has always done everything I've needed so I haven't changed it yet)

Last edited by Reincarnate; 02-9-2014 at 12:05 PM..
Reincarnate is offline   Reply With Quote
Reply


Currently Active Users Viewing This Thread: 1 (0 members and 1 guests)
 

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are Off
[IMG] code is On
HTML code is Off

Forum Jump



All times are GMT -5. The time now is 01:43 AM.


Powered by vBulletin® Version 3.8.1
Copyright ©2000 - 2024, Jelsoft Enterprises Ltd.
Copyright FlashFlashRevolution