富集分析

Posted on 2020-12-23

前景基因：差异基因
背景基因：对于有参的物种来说，一般建议用参考基因组上的全部基因，对于无参的物种来说，选择组装出来的全部unigenes作为背景基因。

https://www.sohu.com/a/393592549_278730

可以对GO分析的go term 进行汇总并且制图。

http://revigo.irb.hr/

机器学习白板学习-shuhuai008

Posted on 2020-12-03

数学概念

Posted on 2020-11-26

广义线性模型GLM及负二项回归

参考，小白鱼的生统笔记写的非常棒：

https://cloud.tencent.com/developer/article/1667175

一般线性模型:假设应变量Y呈正态分布

$u_y=\beta_0+\sum_{j+=1}^p\beta_jX_j$

广义线性模型：$g(u_y)$代表了条件均值的函数（指数、泊松、二项式、负二项式等），因此应变量Y服从指数分布的某一种，因此广义线性模型也涵盖了许多非线性模型的存在。

$g(u_y)=\beta_0+\sum_{j+=1}^p\beta_jX_j$

生物数据常见计数型数据，常常偏离正态性，之前常用泊松回归，现在更多被负二项回归取代，可解决过大离差问题，广泛用于计数型因变量的生物统计领域。基于二、三代测序获得的基因表达值通常以reads count值等表示，就是典型的计数型数据。要计算基因表达值这类计数型数值在组间的差异，常规的t检验等方法的统计功效会降低，这时负二项回归就是很好的选择。这也是那些总所周知的基因表达分析R包如edgeR、DESeq2等广为流行的原因，它们的原理就是负二项回归。
负二项回归

java入门

Posted on 2020-11-25

JAVA 基本语法

参考链接：https://www.jianshu.com/p/472c84347b97

简介

JVM(java virtual machine)
specification, implemenation and instance

public class HelloWord{
    public static void main(String []args) {  ## java程序总入口
        System.out.println("Hello World");
    }

}

标注

类：首字母大写
方法：首字母小写
源文件：与类名相同，后缀为.java

注释

//单行注释
/*多行注释
*多行注释第二行
/*多行注释第三行*/

类

方法在类中，比如eating是Person的方法

public class Person{
String name;
int age;
boolean sex;
void eating(){ //方法1
    }
void running(){ //方法2
    }
void sleeping(){
    }
}

类的构造方法

# 指定构造方法
class Name{
public sam(String name){ //默认了构造方法
    System.out.println("Sam");
}
}

# 构造对象
public class Car{
    public Puppy(String name){
        // 这个构造器仅有一个参数: name
        System.out.println("My car brand is :"+name);
    }
    public static void main(String []args){
        // 下面的语句将创建一个Car对象
        Puppy myPuppy=new Car( "Benz" );
    }
}

#output
My car brand is :Benz

创建对象

对象时根据类创建的，在java中，使用关键字来创建一个新的对象，创建对象的步骤：

1. 声明：声明一个对象，包括对象名称和对象类型
1. 实例化：使用关键字new来创建一个对象
1. 初始化：使用new创建对象时，会调用构造方法初始化对象

public class Car {
    String brandname;
    public Car(String name){
        // 这个构造器有且仅有一个参数：name
        System.out.println("Car's brand is:" + name);
    }
    public void setbrand(String name){
        brandname = name;
        /void(空)方法不需要返回值/
    }
    public string getname(){
        System.out.println("Brand's name is :"+brandname);
        return brandname;
        /*设定为String类型的方法必须返回String类型的值*/
    }
    public static void main(String[] args){
        // 创建对象
        Car c=new Car("BMW");
        // 通过方法来设定车名
        c.setbrand("BMW");
        // 调用另外一个方法获取
        c.getname();
        //直接访问成员变量
        System.out.println("String Value :" + c.brandname);
    }
}


#otutput
Car's brand is:BMW
Brand's name is :BMW
String Value :abc

规则

源文件声明是有规则的。当在一个源文件中定义多个类，并且还有import语句和package语句时，要特别注意这些规则。

一个源文件中只能有一个public类

一个源文件可以有多个非public类

源文件的名称应该和public类的类名保持一致。例如：源文件中public类的类名是Employee，那么源文件应该命名为Employee.java。

如果一个类定义在某个包中，那么package语句应该在源文件的首行。

如果源文件包含import语句，那么应该放在package语句和类定义之间。如果没有package语句，那么import语句应该在源文件中最前面。

import语句和package语句对源文件中定义的所有类都有效。在同一源文件中，不能给不同的类不同的包声明。

MIT CompBio

Posted on 2020-11-24

Lecture 1 Introduction

Lecture 2 Dynamic Programming(动态规划)

Module 1: Aligning and modeling genomes

人的基因组1.5%编码20000多个基因，与其他物种高度保守，剩余序列不保守。

Introduction to sequence alignment

genomes change over time.

occam’s razor:找最短路径

比对的本质就是两个字符串相似性的比较，是否gaps allow, 找到longest common subsequence.

Formualation1. longest common substring(no gap)
Formualation2. longest common subsequnece( gaps allowed)
edit distance: number of change needed for S1-S2,uniform scoring function

Formualtion 3. Sequence alignment
碱基替换的方式有两种，一种allow gaps(fixed penalty)，所有替换打分类似。另一种varing penalties for edit operations.包含transition（嘌呤和嘌呤，嘧啶和嘧啶），tranversion（嘌呤嘧啶随意转换），polymerase（不懂）。

Formulation 4. varing gap cost models
linear gap penality(用时短), affine gap penalty,general gap penalty,frame-aware gap penalty(蛋白比对，multiples of 3 disrupt coding regions), seek sulicated regions, rearrangements

Introduction to principles of dynical programming

由于找到最优序列比对是一个指数级的运算，所以省时省力，需要使用动态规划。(指数型到多项式)

斐波那契数列:


# top down
def fibonacci(n):
 if n==1 or n==2: return 1
 return fibonacci(n-1)+fibonacci(n-2)

 # bottom up
 def fibonacci(n):
 fib_table[1]=1
 fib_table[2]=1
 for i in range(3,n+1):
 fib_table[i]=fib_table[i-1]+fib_table[i-2]
 return fib_table[n]

DP for sequence alignment

1. Parameterization 参数化
1. sub-problem space 划分亚空间
1. traversal order 遍历的顺序
1. recursion formula 递归公式
1. trace-back

Apply dynamic programming to sequence alignment?

key: score is additive, smaller to larger

calculate maximun alignment score of longer sequences based on previously-computed to scores of shorter sequences.

shell命令

Posted on 2020-11-22

统计文件夹中文件的个数

1	ls -l \| grep "^-" \| wc -l

统计文件的行数和字数

1 2	wc -l log.txt wc -w log.txt

比较数值

a=10
b=1
[ $a -lt $b ] # 注意括号两边需要留空格

# 报错:integer expression expected
int=${frac%.*} #需要将字符型或者num型的转化为int型
[ $int -lt $b]

新建文件夹

1
2
3

if [ ! -d "$outfolder" ];then
    mkdir $outfolder
fi

shell命令的理论学习

查看当前目录下一级文件夹的大小，不展示所有文件

du -lh –max-depth=1

删除带有特定字符文件

1	find ./ -name unaligned_mc_tagged_polyA_filtered* \| xargs rm

fastq及bam文件的操作

# 查看第九列为88的内容
samtools view human19.bam |awk -F'\t' 'function abs(x){return((x<0.0)? -x:x)} abs($9)==88{print $0}'

# 打印2-4行
sed -n '2~4p' human19_19bp_R1.fastq

# 打印具有固定字符的行
cat ../unaligned_human_R1.fastq| grep -A 3  'CAAGGCGACGAACTAGGCGC:E100019795L1C032R02303048097|CAAGGCGACGAACTAGGCGC:E100019795L1C032R02303048097|CAAGGCGACGAACTAGGCGC:E100019795L1C032R02303179234|CAAGGCGACGAACTAGGCGC:E100019795L1C032R02303179234|CAAGGCGACGAACTAGGCGC:E100019795L1C032R02400179922|CAAGGCGACGAACTAGGCGC:E100019795L1C032R02400179922' #-A：显示匹配到字符那行的后面n行 

cat ../unaligned_human_R1.fastq| egrep -A 3  'CAAGGCGACGAACTAGGCGC:E100019795L1C032R02303048097|CAAGGCGACGAACTAGGCGC:E100019795L1C032R02303048097‘ # 相当于grep -E,匹配比grep更多。

桌面卡死的操作，重新登陆。

1	pkill Xorg #但是之前的进展看不多了

less human.sorted_peaks.narrowPeak |sed ‘s/“ “/\t/g’|awk {p’rint $1”\t”$2”\t”$3”\t”$4”\t”$5}’>MW_human.bed

统计fastq的reads 数量

1	zcat H_R1.fq.gz \| echo $((`wc -l`/4))

每日一记20201122

Posted on 2020-11-22

度量两个排序列表的相似性

https://zhuanlan.zhihu.com/p/63279107
基于排序之间的相互关系

1. Kendall Tau
1. Spearman’s Footrule: 计算一个列表改为另一个列表最少需要移动各个元素的距离和
  基于集合的度量
  不理解

shell 统计

SampleReadsFromBamFiles

Posted on 2020-11-19

下采样一定数量的reads

http://seqanswers.com/forums/showthread.php?t=41983
https://bioinformatics.stackexchange.com/questions/402/how-can-i-downsample-a-bam-file-while-keeping-both-reads-in-pairs
方法1

(
samtools view -H [bamfile];
samtools view -F 0x004 [bamfile] |
java -jar StreamSampler.jar [# of reads to sample] [total # reads]
) |
samtools -bS - > [sampled bam file]

方法2

function SubSample {
## Calculate the sampling factor based on the intended number of reads:
FACTOR=$(samtools idxstats $1 | cut -f3 | awk -v COUNT=$2 'BEGIN {total=0} {total += $1} END {print COUNT/total}')

if [[ $FACTOR > 1 ]]
  then 
  echo '[ERROR]: Requested number of reads exceeds total read count in' $1 '-- exiting' && exit 1
fi

sambamba view -s $FACTOR -f bam -l 5 $1

}

## Usage example, selecting 100.000 reads:
SubSample in.bam 100000 > subsampled.bam

下采样一定百分比的reads

samtools view -bs 42.1 in.bam > subsampled.bam

samtool命令

Posted on 2020-11-19

参考https://blog.csdn.net/sunchengquan/article/details/85176940

下采样一定数量的reads

http://seqanswers.com/forums/showthread.php?t=41983
https://bioinformatics.stackexchange.com/questions/402/how-can-i-downsample-a-bam-file-while-keeping-both-reads-in-pairs

方法1

## 需要额外下载StreamSampler.jar，并且有bam文件版本限制，不好用
(
samtools view -H [bamfile];
samtools view -F 0x004 [bamfile] |
java -jar StreamSampler.jar [# of reads to sample] [total # reads]
) |
samtools -bS - > [sampled bam file]

方法2 sample 1000 reads/cell，可用

##在shell直接执行，存在一个问题就是subsample的不是正好1000，可能是999。因为flagstat里面有两个reads相加，所以需要double number。

frac=$( samtools flagstat $outbam | cut -f3 | awk 'BEGIN {total=0} {total += $1} END {frac=2000/total;{print frac}}')
int=${frac%.*}
if [ $int -eq 0 ];then
    samtools view -bs $frac $outbam > $outfolder3$samplebam
fi


## 原始网页写法，但是运行terminal会闪退。

function SubSample {
## Calculate the sampling factor based on the intended number of reads:
FACTOR=$(samtools idxstats $1 | cut -f3 | awk -v COUNT=$2 'BEGIN {total=0} {total += $1} END {print COUNT/total}')

if [[ $FACTOR > 1 ]]
  then 
  echo '[ERROR]: Requested number of reads exceeds total read count in' $1 '-- exiting' && exit 1
fi

samtools view -s $FACTOR -f bam -l 5 $1
}

## Usage example, selecting 100.000 reads:
SubSample in.bam 100000 > subsampled.bam

下采样一定百分比的reads

1	samtools view -bs 42.1 in.bam > subsampled.bam

过滤低mapping质量的reads,并且sam转bam

1
2
3

samfile=out.sam
outbam=filtered.bam
samtools view -h -q 10 $samfile | samtools view -bS > $outbam

过滤PCR duplication

1	samtools rmdup in.bam -o rmdup.bam

bam, sam的相互转化

1 2	samtools view -bS in.sam> out.bam samtools view -h -o out.sam in.bam

1 2	/samtools view -h -f 0x002 file.bam \|\ egrep "(^@\|XT\:A\:U)"

从bam file中提取固定标签的文件

cd /media/ggj/home/ggj/tmp/DarkReaction/barcode/nofilter
BAM_FILE="/media/ggj/home/ggj/tmp/DarkReaction/barcode/nofilter/unaligned_tagged_filtered.bam"
filter_file="/media/ggj/home/ggj/tmp/DarkReaction/barcode/barcode_88K_tag.txt"
# Save the header lines
samtools view -H $BAM_FILE > SAM_header

# Filter alignments using filter.txt. Use LC_ALL=C to set C locale instead of UTF-8
samtools view $BAM_FILE | LC_ALL=C grep -F -f $filter_file > filtered_SAM_body

# Combine header and body
cat SAM_header filtered_SAM_body > filtered.sam

# Convert filtered.sam to BAM format
samtools view -b filtered.sam > filtered.bam

每日文献

Posted on 2020-11-16

Feilijiang

生物信息记录分享博客