02-8-2014, 11:15 PM | #1 |
Batch Manager
Game Manager, Song Release Coordinator
Join Date: Mar 2008
Location: USA
Age: 29
Posts: 14,860
|
Python pattern matching algorithm tuple error
I’m new at Python so this is like drinking out of a fire hose. I’ll explain the context of what the problem is here to give a better idea of what to discuss.
This program gives a pairwise DNA alignment using “dynamic programming”. A score is assigned based on how well the pairs line up; preferably the optimal alignment should be reached, but that can be discussed later. The “substitution matrix” here considers the possible pair matchings and assigns a score value to each. Every time there is the same letter, it is 20 points. This table shows the values. Code:
alphabet = "TCAG" # 0 for T, 1 for C, 2 for A, 3 for G. In the case of dynamic programming, an optimal alignment is accomplished by considering three options for each location in the DNA sequence, and selecting the best option before continuing on. (Conceptually, the second sequence would be the columns of the matrix, and the first sequence is the rows of the matrix. This is just an example) Don’t worry about affine gap penalties (consecutive gaps), there’s just only a gap penalty variable required here and it’s set to -5 points. A gap (indel) is when we can’t distinguish whether a letter was inserted into a string or removed from the other string. For alignment purposes we can’t tell so gaps are used when necessary. So now onto the problems! In order to get the aligned results of the two DNA sequences, it looks like one would have to: 1.) Make the scoring matrix. <-- Scoring matrix maintains the current alignment score for the particular alignment. 2.) Make the arrow matrix. <-- determines the optimal score path. It is created at the same time the scoring matrix is. 3.) Get path from the arrow matrix through a backtracing algorithm and reverse the string after. 4.) Pass the two strings to a scoring function and calculate the score. Code:
# Our two DNA sequences. seq1 = 'ATGTTAT' seq2 = 'ATCGTAG' alphabet = "TCAG" # 0 for T, 1 for C, 2 for A, 3 for G. mat = [ [20, 10, 5, 5], [10, 20, 5, 5], [5, 5, 20, 10], [5, 5, 10, 20], ] In this code we’re making the Scoring and Arrow Matrices. The first row and column is filled entirely with gap penalties. Making the arrow matrix from the previous block of code allows us to backtrace the path to return the aligned DNA sequences to get their scores calculated. Now what’s confusing here is why when I did mat[x,y] in this stringScore function it wasn’t converted into an integer, I got an error about a tuple: >>> t1 = stringScore( mat, alphabet, seq1 , seq2 , g = -5 ) Traceback (most recent call last): File "<pyshell#31>", line 1, in <module> t1 = stringScore( mat, alphabet, seq1 , seq2 , g = -5 ) File "C:\bioinformatics\hw2.py", line 126, in stringScore score -= mat[x,y] # Search substitution matrix for value to subtract according to pair. TypeError: list indices must be integers, not tuple >>> Code:
""" Calculate score of the two DNA sequences. They are stored as strings in two arrays. """ def stringScore( mat, alphabet, seq1 , seq2 , g = -5 ): length = len( seq1 ) score = 0 for i in range (length): if seq1[i] != seq2[i]: x = alphabet.index( seq1[i] ) y = alphabet.index( seq2[i] ) score -= mat[x,y] # Search substitution matrix for value to subtract according to pair. elif seq1[i] == '-' or seq2[i] == '-': score += g # Gap penalty else: score += 20 # This means the two letters match. return score |
02-9-2014, 12:22 AM | #2 |
FFR Veteran
Join Date: Oct 2008
Location: Canada
Posts: 448
|
Re: Python pattern matching algorithm tuple error
A really simple error: you need to use map[x][y] instead of map[x,y] ("x,y" gets interpreted as (x, y), so you access map[ (x, y) ] which Python doesn't like)
I'll look over the rest of the code to see if there's any other errors edit: wow I did not know you could put commas like that in splices... see if the above works if not then... hm Last edited by beary605; 02-9-2014 at 03:48 AM.. |
02-9-2014, 09:29 AM | #3 | |
Batch Manager
Game Manager, Song Release Coordinator
Join Date: Mar 2008
Location: USA
Age: 29
Posts: 14,860
|
Re: Python pattern matching algorithm tuple error
Quote:
>>> t1 = stringScore( mat, alphabet, seq1 , seq2 , g = -5 ) >>> t1 65 >>> Which looks correct, because the two strings would result in: 20 + 20 + (-5) + (-5) + 20 + 20 + (-5) = 80 - 15 = 65 Those negatives being the mismatches. Now I'll see what goes wrong with the other functions. I know that the scoring function works when the strings are aligned, so that's a good start. Edit: Ok, another problem Code:
>>> from numpy import * Traceback (most recent call last): File "<pyshell#2>", line 1, in <module> from numpy import * File "C:\Python27\lib\site-packages\numpy\__init__.py", line 153, in <module> from . import add_newdocs File "C:\Python27\lib\site-packages\numpy\add_newdocs.py", line 13, in <module> from numpy.lib import add_newdoc File "C:\Python27\lib\site-packages\numpy\lib\__init__.py", line 8, in <module> from .type_check import * File "C:\Python27\lib\site-packages\numpy\lib\type_check.py", line 11, in <module> import numpy.core.numeric as _nx File "C:\Python27\lib\site-packages\numpy\core\__init__.py", line 6, in <module> from . import multiarray ImportError: DLL load failed: %1 is not a valid Win32 application.
__________________
Last edited by DossarLX ODI; 02-9-2014 at 09:52 AM.. |
|
02-9-2014, 09:34 AM | #4 |
x'); DROP TABLE FFR;--
Join Date: Nov 2010
Posts: 6,332
|
Re: Python pattern matching algorithm tuple error
I am way too lazy to read through all this, can you just explain what you need to do in like 2 sentences -- I'll whip up the solution in Python real quick and you can look at it or something
you have two sequences seq1 = 'ATGTTAT' seq2 = 'ATCGTAG' what's the score and why (or are you trying to do something else?) |
02-9-2014, 09:57 AM | #5 |
x'); DROP TABLE FFR;--
Join Date: Nov 2010
Posts: 6,332
|
Re: Python pattern matching algorithm tuple error
something like this maybe
Code:
def stringScore( mat, baseMap, seq1, seq2, g = -5 ): if len(seq1) != len(seq2): raise Exception("Fix your shit, both strings must be of equal length") score = 0 for base1,base2 in zip(seq1,seq2): if base1 != base2: score -= mat[baseMap[base1]][baseMap[base2]] elif base1 == "-" or base2 == "-": score += g else: score += mat[baseMap[base1]][baseMap[base2]] return score mat = [[20, 10, 5, 5], [10, 20, 5, 5], [5, 5, 20, 10], [5, 5, 10, 20]] alphabet = "TCAG" baseMap = {base:index for index,base in enumerate(alphabet)} seq1 = 'ATGTTAT' seq2 = 'ATCGTAG' print stringScore(mat, baseMap, seq1, seq2) Last edited by Reincarnate; 02-9-2014 at 10:02 AM.. |
02-9-2014, 09:58 AM | #6 | |
Batch Manager
Game Manager, Song Release Coordinator
Join Date: Mar 2008
Location: USA
Age: 29
Posts: 14,860
|
Re: Python pattern matching algorithm tuple error
Quote:
The program is supposed to work for any two sequences e.g. ATGAATGCGATTTCGGGTGGCC TTGGCAGGACATGAAGTTCGATACGGAA Obviously the first string is too short so that's when we have to insert gaps and there are gap penalties to make it the same length. The gap penalty is just some integer, say -5. In a nutshell we're doing a 5-way comparison: - Gaps. - A, G, C, or T (those are the four letters). This is the score table (substitution matrix). In dynamic programming you're considering an optimal sequence alignment by considering three options for each location in the sequence: down, right, or down-right. Visual example: So basically, we have our substitution matrix (4x4) which has the values for a pair of letters found, listed in the score table. Then the scoring matrix maintains the current alignment score for the particular alignment. The arrow matrix determines the optimal score path (It is created at the same time the scoring matrix is). Then we backtrace that path to get the strings aligned so they're the same length, and finally use the score function.
__________________
Last edited by DossarLX ODI; 02-9-2014 at 10:00 AM.. |
|
02-9-2014, 10:04 AM | #7 |
x'); DROP TABLE FFR;--
Join Date: Nov 2010
Posts: 6,332
|
Re: Python pattern matching algorithm tuple error
so like this?
Code:
def stringScore( mat, baseMap, seq1, seq2, g = -5 ): score = g*abs(len(seq1)-len(seq2)) #if one string is longer than the other, gap penalty = g*(difference in lengths) for base1,base2 in zip(seq1,seq2): if base1 != base2: score -= mat[baseMap[base1]][baseMap[base2]] else: score += mat[baseMap[base1]][baseMap[base2]] return score are you trying to basically say "we don't know where the real gaps are, so let's try all possible ways of inserting gaps to determine the maximal possible score?" ATGAATGCGATTTCGGGTGGCC------- TTGGCAGGACATGAAGTTCGATACGGAA probably a shitty score due to all mismatches, but if we had -ATG-AATG-CGATT-TC-GG--GTGGCC TTGGCAGGACATGAAGTTCGATACGGAA or something (just making up a shitty example), then the gaps could possibly be matches and we'd have a better score Last edited by Reincarnate; 02-9-2014 at 10:06 AM.. |
02-9-2014, 10:06 AM | #8 | |
Batch Manager
Game Manager, Song Release Coordinator
Join Date: Mar 2008
Location: USA
Age: 29
Posts: 14,860
|
Re: Python pattern matching algorithm tuple error
Quote:
The score is calculated at the end. The scoring matrix is just made so the arrow matrix can be made to determine the best path for alignment. Then after those paths are backtraced (and then the strings reversed), the alignments are sent to the score function (essentially we're making a new string with dashes in it to indicate gaps since it was too short). Gaps here are a pain in the ass to work with. The gaps are considered by the arrow matrix which considers the best possible path by checking how the score changes, because like I mentioned some strings aren't of equal length and that's when the gaps come in. I guess another way of tl;dr this is that it's getting two strings with four possible letters, and if they are not the same length then gaps have to be inserted to make the letters line up better for an optimal score.
__________________
Last edited by DossarLX ODI; 02-9-2014 at 10:15 AM.. |
|
02-9-2014, 10:26 AM | #9 |
x'); DROP TABLE FFR;--
Join Date: Nov 2010
Posts: 6,332
|
Re: Python pattern matching algorithm tuple error
So if two bases match, they basically add the corresponding matrix value (typically 20), but if they mismatch, they subtract the matrix value?
And using a gap incurs a -5 penalty? |
02-9-2014, 10:29 AM | #10 | |
Batch Manager
Game Manager, Song Release Coordinator
Join Date: Mar 2008
Location: USA
Age: 29
Posts: 14,860
|
Re: Python pattern matching algorithm tuple error
Quote:
Edit: Back. That was faster than I planned
__________________
Last edited by DossarLX ODI; 02-9-2014 at 10:49 AM.. |
|
02-9-2014, 10:48 AM | #11 |
x'); DROP TABLE FFR;--
Join Date: Nov 2010
Posts: 6,332
|
Re: Python pattern matching algorithm tuple error
Untested code, let me know if this is returning the right values for you -- if so then I'll clean up the code and make it a little more readable / with better coding practices / whatever -- just want to ensure it's returning the right vals first
Code:
class memoize: def __init__(self, fn): self.fn = fn self.memo = {} def __call__(self, *args, **kwds): import cPickle str = cPickle.dumps(args, 1)+cPickle.dumps(kwds, 1) if not self.memo.has_key(str): self.memo[str] = self.fn(*args, **kwds) return self.memo[str] @memoize def dpScore( mat, baseMap, seq1, seq2, gapsLeft, g ): if (len(seq1)==0 and gapsLeft>0) or gapsLeft<0: return -9999999999999999999999 #lazy right now, using a sentinel if len(seq1)==0 or len(seq2)==0: return 0 possibleScore1 = g + mat[baseMap[seq1[0]]][baseMap[seq1[0]]] + dpScore( mat, baseMap, seq1[1:], seq2, gapsLeft-1, g ) if seq1[0] == seq2[0]: matchScore = mat[baseMap[seq1[0]]][baseMap[seq2[0]]] else: matchScore = -mat[baseMap[seq1[0]]][baseMap[seq2[0]]] possibleScore2 = matchScore + dpScore( mat, baseMap, seq1[1:], seq2[1:], gapsLeft, g ) return max(possibleScore1, possibleScore2) def bestStringScore( mat, baseMap, seq1, seq2, g = -5 ): gapsLeft = abs(len(seq1)-len(seq2)) if len(seq1) > len(seq2): return dpScore(mat, baseMap, seq1, seq2, gapsLeft, g) else: return dpScore(mat, baseMap, seq2, seq1, gapsLeft, g) mat = [[20, 10, 5, 5], [10, 20, 5, 5], [5, 5, 20, 10], [5, 5, 10, 20]] alphabet = "TCAG" baseMap = {base:index for index,base in enumerate(alphabet)} seq1 = 'ATGTTAT' seq2 = 'ATCGTAG' print bestStringScore(mat, baseMap, seq1, seq2) |
02-9-2014, 10:53 AM | #12 |
Batch Manager
Game Manager, Song Release Coordinator
Join Date: Mar 2008
Location: USA
Age: 29
Posts: 14,860
|
Re: Python pattern matching algorithm tuple error
Yes, copied that code snippet and got 65 for those sequences (same with what I got).
|
02-9-2014, 11:06 AM | #13 |
x'); DROP TABLE FFR;--
Join Date: Nov 2010
Posts: 6,332
|
Re: Python pattern matching algorithm tuple error
I mean ideally you should also test with sequences that are of different lengths and then check to see if the resulting maximum score is what you get by hand (inserting gaps manually etc)
|
02-9-2014, 11:11 AM | #14 |
Batch Manager
Game Manager, Song Release Coordinator
Join Date: Mar 2008
Location: USA
Age: 29
Posts: 14,860
|
Re: Python pattern matching algorithm tuple error
I'm testing it with
seq1 = 'ATGTTAT' seq2 = 'ATCGTAGTA' seq1 = 'ATGTTA-T-' seq2 = 'ATCGTAGTA' 20 + 20 - 5 - 5 + 20 + 20 - 5 + 20 - 5 = 80 seq1 = 'AT-GTTAT-' seq2 = 'ATCGTAGTA' 20 + 20 - 5 + 20 + 20 - 5 - 10 + 20 -5 = 75 ATGTTA-T- would be the optimal sequence here. The second solution was slightly lower since AG is worth 10, not 5. When I input into the program it said 105. Edit: Ok so what I consider a possibility is that it did this: AT-GTTAT- ATCGTAGTA 20 + 20 - 5 + 20 + 20 + 5 + 10 + 20 - 5 = 105 Edit 2: Optimal score looks like 100 actually. seq1 = 'AT-GTTA-T-' seq2 = 'ATCGT-AGTA' 20 + 20 - 5 + 20 + 20 - 5 + 20 - 5 + 20 - 5 = 100
__________________
Last edited by DossarLX ODI; 02-9-2014 at 11:39 AM.. |
02-9-2014, 11:43 AM | #15 |
x'); DROP TABLE FFR;--
Join Date: Nov 2010
Posts: 6,332
|
Re: Python pattern matching algorithm tuple error
I think maybe I am not clear how you are supposed to be scoring stuff
is this right? Code:
seq1: --ATGTTAT seq2: ATCGTAGTA score: -45 seq1: -A-TGTTAT seq2: ATCGTAGTA score: -45 seq1: -AT-GTTAT seq2: ATCGTAGTA score: -50 seq1: -ATG-TTAT seq2: ATCGTAGTA score: -25 seq1: -ATGT-TAT seq2: ATCGTAGTA score: 0 seq1: -ATGTT-AT seq2: ATCGTAGTA score: 0 seq1: -ATGTTA-T seq2: ATCGTAGTA score: -5 seq1: -ATGTTAT- seq2: ATCGTAGTA score: 20 seq1: A--TGTTAT seq2: ATCGTAGTA score: -20 seq1: A-T-GTTAT seq2: ATCGTAGTA score: -25 seq1: A-TG-TTAT seq2: ATCGTAGTA score: 0 seq1: A-TGT-TAT seq2: ATCGTAGTA score: 25 seq1: A-TGTT-AT seq2: ATCGTAGTA score: 25 seq1: A-TGTTA-T seq2: ATCGTAGTA score: 20 seq1: A-TGTTAT- seq2: ATCGTAGTA score: 45 seq1: AT--GTTAT seq2: ATCGTAGTA score: 5 seq1: AT-G-TTAT seq2: ATCGTAGTA score: 30 seq1: AT-GT-TAT seq2: ATCGTAGTA score: 55 seq1: AT-GTT-AT seq2: ATCGTAGTA score: 55 seq1: AT-GTTA-T seq2: ATCGTAGTA score: 50 seq1: AT-GTTAT- seq2: ATCGTAGTA score: 75 seq1: ATG--TTAT seq2: ATCGTAGTA score: 5 seq1: ATG-T-TAT seq2: ATCGTAGTA score: 30 seq1: ATG-TT-AT seq2: ATCGTAGTA score: 30 seq1: ATG-TTA-T seq2: ATCGTAGTA score: 25 seq1: ATG-TTAT- seq2: ATCGTAGTA score: 50 seq1: ATGT--TAT seq2: ATCGTAGTA score: 5 seq1: ATGT-T-AT seq2: ATCGTAGTA score: 5 seq1: ATGT-TA-T seq2: ATCGTAGTA score: 0 seq1: ATGT-TAT- seq2: ATCGTAGTA score: 25 seq1: ATGTT--AT seq2: ATCGTAGTA score: 30 seq1: ATGTT-A-T seq2: ATCGTAGTA score: 25 seq1: ATGTT-AT- seq2: ATCGTAGTA score: 50 seq1: ATGTTA--T seq2: ATCGTAGTA score: 55 seq1: ATGTTA-T- seq2: ATCGTAGTA score: 80 seq1: ATGTTAT-- seq2: ATCGTAGTA score: 55 score: +matrix[base1][base2] if base1 and base2 match (e.g. "AA"), -matrix[base1][base2] if they don't match (e.g. "GC"), and -5 if we have a gap (e.g. "T-") Wait you can insert gaps in both sequences, or just the shorter of the two? (I also don't want to waste your time if I am going off-track with all this -- feel free to ignore if you're on a tight deadline) Last edited by Reincarnate; 02-9-2014 at 11:45 AM.. |
02-9-2014, 11:50 AM | #16 |
Batch Manager
Game Manager, Song Release Coordinator
Join Date: Mar 2008
Location: USA
Age: 29
Posts: 14,860
|
Re: Python pattern matching algorithm tuple error
You can insert gaps in both sequences. This picture is straight out of the assignment; the table is annoying as fuck to read so I'll basically just say what he was getting at.
Original sequences: seq1: 'ATG' seq2: 'GGAAT' The two alignments listed G G A A T - - A T G G G A A T - - - - A T G give the same score. For the first one: (0) - 5 - 5 + 20 + 5 + 5 = 20 For the second one: (0) - 5 - 5 - 5 + 20 + 20 - 5 = 20 It pisses me off how ambiguous my professor was on this assignment. Apparently the score has to add up everything, even mismatches, and only penalize for gaps. This is stupid. The current code I have doesn't work since I can't even import numpy or scipy. There's something about them being 32-bit when I have a 64-bit machine. Even then I don't even know if the code from the textbook even works because well, the variable names used were absolute garbage and it had errors when I tried running it.
__________________
Last edited by DossarLX ODI; 02-9-2014 at 11:55 AM.. |
02-9-2014, 11:54 AM | #17 |
x'); DROP TABLE FFR;--
Join Date: Nov 2010
Posts: 6,332
|
Re: Python pattern matching algorithm tuple error
So it's always +matrix[base1][base2] and then -5 if we have "*-" or "-*" where * is A,C,T, or G?
|
02-9-2014, 11:57 AM | #18 |
x'); DROP TABLE FFR;--
Join Date: Nov 2010
Posts: 6,332
|
Re: Python pattern matching algorithm tuple error
You can use numpy if you're using a 64-bit machine (that's what I'm on) -- you just have to make sure you're using the right versions of Python/Numpy/etc.
|
02-9-2014, 11:57 AM | #19 |
Batch Manager
Game Manager, Song Release Coordinator
Join Date: Mar 2008
Location: USA
Age: 29
Posts: 14,860
|
Re: Python pattern matching algorithm tuple error
Yes, if there is a dash it is a gap penalty. If the letters don't match up it still adds positively to the score if they're not gaps (which I think is very counterintuitive).
I am using Python 2.7.6 by the way, which are you using? |
02-9-2014, 11:59 AM | #20 |
x'); DROP TABLE FFR;--
Join Date: Nov 2010
Posts: 6,332
|
Re: Python pattern matching algorithm tuple error
Is it possibly something like this?
Code:
class memoize: def __init__(self, fn): self.fn = fn self.memo = {} def __call__(self, *args, **kwds): import cPickle str = cPickle.dumps(args, 1)+cPickle.dumps(kwds, 1) if not self.memo.has_key(str): self.memo[str] = self.fn(*args, **kwds) return self.memo[str] @memoize def dpScore( mat, baseMap, seq1, seq2, g = -5 ): if len(seq1)*len(seq2)==0: #one or the other is 0 return g*abs(len(seq1)-len(seq2)) #the rest shall be gaps, damn it possibleScore1 = g + dpScore( mat, baseMap, seq1[1:], seq2, g ) possibleScore2 = g + dpScore( mat, baseMap, seq1, seq2[1:], g ) possibleScore3 = mat[baseMap[seq1[0]]][baseMap[seq2[0]]] + dpScore( mat, baseMap, seq1[1:], seq2[1:], g ) return max([possibleScore1, possibleScore2, possibleScore3]) mat = [[20, 10, 5, 5], [10, 20, 5, 5], [5, 5, 20, 10], [5, 5, 10, 20]] alphabet = "TCAG" baseMap = {base:index for index,base in enumerate(alphabet)} print dpScore(mat, baseMap, 'ATCGTAGTA', 'ATGTTAT') Last edited by Reincarnate; 02-9-2014 at 12:05 PM.. |
Currently Active Users Viewing This Thread: 1 (0 members and 1 guests) | |
|
|