MIT CompBio

Lecture 1 Introduction

Lecture 2 Dynamic Programming(动态规划)

Module 1: Aligning and modeling genomes


人的基因组1.5%编码20000多个基因,与其他物种高度保守,剩余序列不保守。

Introduction to sequence alignment

genomes change over time.

occam’s razor:找最短路径

比对的本质就是两个字符串相似性的比较,是否gaps allow, 找到longest common subsequence.

Formualation1. longest common substring(no gap)

Formualation2. longest common subsequnece( gaps allowed)

edit distance: number of change needed for S1-S2,uniform scoring function

Formualtion 3. Sequence alignment

碱基替换的方式有两种,一种allow gaps(fixed penalty),所有替换打分类似。另一种varing penalties for edit operations.包含transition(嘌呤和嘌呤,嘧啶和嘧啶),tranversion(嘌呤嘧啶随意转换) ,polymerase(不懂)。

Formulation 4. varing gap cost models

linear gap penality(用时短), affine gap penalty,general gap penalty,frame-aware gap penalty(蛋白比对,multiples of 3 disrupt coding regions), seek sulicated regions, rearrangements

Introduction to principles of dynical programming

由于找到最优序列比对是一个指数级的运算,所以省时省力,需要使用动态规划。(指数型到多项式)

斐波那契数列:

1
2
3
4
5
6
7
8
9
10
11
12
13

# top down
def fibonacci(n):
if n==1 or n==2: return 1
return fibonacci(n-1)+fibonacci(n-2)

# bottom up
def fibonacci(n):
fib_table[1]=1
fib_table[2]=1
for i in range(3,n+1):
fib_table[i]=fib_table[i-1]+fib_table[i-2]
return fib_table[n]

DP for sequence alignment

    1. Parameterization 参数化
    1. sub-problem space 划分亚空间
    1. traversal order 遍历的顺序
    1. recursion formula 递归公式
    1. trace-back

Apply dynamic programming to sequence alignment?

key: score is additive, smaller to larger

calculate maximun alignment score of longer sequences based on previously-computed to scores of shorter sequences.