matlab安装

Posted on 2021-03-30

注意：
学校版本20G左右，官网压缩版本226M,目前没有发现这者的差别。linux 系统容易网络连接有问题，所以建议使用不联网安装方法，登录官方账户，下载licience.loc及密钥。

matlab具有比较完备的社区，大部分的问题可以在上面得到解答。

修改当前目录
cd NEWfolder

读取csv
csvread() csvwrite()

安装gephi 方法
https://www.jianshu.com/p/73e5d81ddee7

matlab 添加add-on的方法

通过软件直接搜索得到。APP- get more apps
下载插件包，解压后添加路径 addpath()

gephi 使用教程

注意导入的dot file 有多几行。

Single Cell Genomics Day2021

Posted on 2021-03-28

Rahul satijia single cell gemomics recent advances and future directions

1. Multimodality spanning the central dogma
  RNA+ surface protein(CITE-seq)
  RNA+ATAC(10X multiome)
  缺点：都是利用做单细胞核的ATAC，所以无法针对surface proteins.

mtscATAC/ICICLE/ASAP-seq: optimized permeabilization enables ATAC-seq on whole cells

TEA-seq：scATAC+RNA+protein

Dogma-seq:scATAC+RNA+protein: surface protein, transcriptome, accessible chromatin, mtDNA mutations

inCITE-seq: intracellualr protein + RNA

1. Reference mapping to single-cell atlas
  symphony
  scArches
  Seurat V4
  azimuth.hubmapconsortium.org: a web application that uses an annotated reference dataset to automate the processing, analysis, and interpretation of a new single-cell RNA-seq experiment.

1. Comparative analysis of scRNA-seq datasets

changes in cell-type abundance
changes in expression within a cell type
how about datasets without discrete clusters

三个方法推荐
MILO
muscat
Quantifying the effect of experimental perturbations at single-cell resolution

1. Hi-resolution spatial transcriptomics

10x Visium: low resolution
SLIDE-seq: high resolution, low size
SLIDE-seq2: higher sensitive
DBit-seq(10um), seq-scope(< 1um),PIXEL-SEQ(< 1um), Stereo-seq( <1um)

1. Spatial deconvolution based on scRNAseq
  核心问题
  deconvolution approaches must handle batch/platform effects

方法推荐：
cell2location （Visium）
RCDT （SLIDE-seq2）

1. Next-generation single cell screens (Perturb-seq+)

multimodal perturb-seq: ECCITE-seq, Perturb-CITE-seq

1. Understanding the cellualr origins of COVID-19
  ACE2/TMPRSS2 expression
  Dysfunctional myeloid cells
1. 总结

问题：
Can we use cellrank or rna velocity to infer directions across conditions?
有变化就能映射到

Cole Trapnell: Studying development robustnewss at the whole-scale embryo scale and sin

sci-Plex: profiles cells from thousands of specimens

斑马鱼图谱： 0-96hpf， 90万细胞， 859 embryos, 85 cell types,15 time points
genetic perturbations during zebrafish development: 3.1 million cells, 812 embryos 5 timepoints, 22 perturbations18-72hpf
科学问题： how do gene circuits stabilize the developmental program?

neural crest cells to divergence cell types
tfap2a and foxd3 敲除影响神经细胞发育

用指标来衡量， % increase in CV obs/exp
mean and variable in count data are often correlated: a generalized linear model to captures the trend in mean vs variance改成a b-binomial distrubution models fits the recovered cell type count data(Gamma GLM fit).

perturbations can change the observed variance: cell number/ cell and std.dev/mean
两个基因的敲除不仅影响细胞类型的细胞数，而且还影响细胞类型的variance.

how temperature stress destablize embryonic development?

升高温度
develop faster
staging embryo using variablity in cell type frequencies

exploring variability in gene regulation across developing embryonic tisstis

R命令实录

Posted on 2021-03-10

读取dataframe的指定行或列

http://shichangshun.cn/2015/11/09/%E6%8A%80%E6%9C%AF/2015-11-09-Read-rows-col/

if(!file.exists("read")){
  dir.create("read")
}# 创建工作路径文件夹
setwd("./read") # 设置工作路径
if(!file.exists("data")){
  dir.create("data")
} # 新建data文件夹

命令的重复

1	var.genes=Reduce(intersect, var.genes1)

快速计算cor

library(HiClimR)
fastCor

数据框替换值

method1: 使用封装函数

currentid
newid
pbmc$ident<- plyr::mapvalues(pbmc$ident，from=currentid,to=newid)
##### method2: 使用gsub
anno$celltype <- anno$cluster_new
anno$celltype <- paste0("_",paste0(anno$celltype,'_'))
ma$cluster <- paste0("_",paste0(ma$cluster,"_"))
for (i in unique(ma$cluster)) {
  anno$celltype <- gsub(anno$celltype,pattern = i,replacement = ma$V2[i])
  print(i)
}

method3: 使用%>%

trimws

替换开头或结尾的空格、换行符、制表符、回车。

1	trimws(gene,which=c("both"), whitespace="[ \t\r\n]")

scale

1 2	library(scales) rss_ct_scale<-apply(rss_ct, 1, rescale,to=c(-2,2))

ggplot 多个画板一个图

示例

######################################  same mouse
library(gridExtra)
library(ggplot2)
plot_correlation <- function(i){
  name1 <- paste0('T',paste0(i,"human"))
  name2 <- paste0('T',paste0(i+1,"human"))
  p<-ggplot(ave_teratoma, aes(x=get(name1), y=get(name2))) +
    geom_point(alpha=1, size=1) +
    geom_smooth(method="lm") +
    theme_bw() +
    xlab(name1)+ylab(name2)+
    ggtitle(paste0("Mouse",i/2-3))+
    theme(text = element_text(size=10), legend.title=element_blank(),plot.title = element_text(hjust = 0.5))+
    stat_fit_glance(method = 'lm',
                    
                    method.args = list(formula = y ~ x),
                    
                    mapping = aes(label = sprintf('R^2~"="~%.3f~~italic(P)~"="~%.2g', stat(r.squared), stat(p.value))),
                    
                    parse = TRUE,label.x = 0.95,label.y = 0.95)
}

plot_list <- lapply(seq(8,18,2), plot_correlation)
n <- length(plot_list)
nCol <- floor(sqrt(n))
grid.arrange(grobs = plot_list, ncol = nCol) ## display plot
ggsave(file = "same_mouse_correlation.png", arrangeGrob(grobs = plot_list, ncol = 2),height = 9,width = 6)  ## save plot

基因组知识

Posted on 2021-03-08

MEME 网站使用的总结：

原文链接：https://blog.csdn.net/weixin_43569478/article/details/108079461

适用于分析数据量较大的序列上的motif信息。首先通过MEME和DREME两款软件预测de novo motif, 然后利用CentriMo识别在序列的中心区域显著富集的motif, 同时采用Tomtom软件将预测到的de novo motif与指定数据库的已知motif进行比对，确定二者的相似度。最后利用FIMO软件预测motif在输入序列上的结合位点。

python命令

Posted on 2021-03-07

字符串替换

print(ss.replace(‘\n’, ‘’))

python 运行shell 命令

import subprocess
subprocess.call(command, shell=True)

python判断字符串（string）是否包含（contains）子字符串的方法

string in list
isin()

切片

data[[x,xx,x]] # 取某几行
data.loc # 具体的行列名字
data.iloc[:,1:2]

python paste 命令

paste strings

datause.columns=[“COL01.” + str(i) for i in datause.columns]

[“s” + str(i) for i in xrange(1,11)]

list(map(‘s{}’.format, range(1, 11)))

map(lambda x:”s”+str(x), range(1,11))

paste two columns

anno[‘day’]=anno[‘day’].astype(“str”)
anno[‘louvain13’]=anno[‘louvain13’].astype(“str”)
anno[‘day_cluster’]=anno[‘day’].astype(str)+”_”
anno[‘day_cluster’]=anno[‘day_cluster’].astype(str)+anno[‘louvain13’]

paste two columns

求相关性

数据库直接求皮尔斯相关性

datause1=datause_gene.to_numpy()
pcc_gene=np.corrcoef(datause1)
pcc_gene=pd.DataFrame(pcc_gene)
pcc_gene.index=datause_gene.index.values
pcc_gene.columns =datause_gene.index.values

按组求平均

df = pd.DataFrame([['a', 'man', 120, 90],
                   ['b', 'woman', 130, 100],
                   ['a', 'man', 110, 108],
                   ['a', 'woman', 120, 118]], columns=['level', 'gender', 'math','chinese'])
group = df.groupby('gender').mean()

列和

colsum=mtx_tf.apply(lambda x:x.sum())
rowsum=mtx_tf.apply(lambda x:x.sum(),axis=1)

unique 命令必须是pd.Series才能使用。

zz=barcode_list3.values()
zz=list(zz)
df = pd.Series( (v for v in zz) )
len(df.unique())
uu=df.unique()
pd.DataFrame(uu).to_csv("barcode3_96.csv")

read pickle文件

1
2
3

RT_bc3 = open("/media/ggj/home/ggj/Documents/data/fei/Rscripts/MW_script/barcode3_96_bc.pickle2","rb")
barcode_list3 = pickle.load(RT_bc3)
RT_bc3.close()

Waddington-OT

Posted on 2021-02-18

机器学习实战

Posted on 2021-02-12

第一章机器学习概览

1. 机器学习系统的分类

是否在人类监督下学习

监督式学习，分类和预测变量，eg，K-Nearest Neighbors, linear regression,logistic regression, SVM, decision trees and random forest, neural network

无监督学习，聚类：K-means, Hierarchical cluster tree, expectation maximization，降维和可视化 pca, kernek pca, LLE, t-SNE, 关联规则学习Apriori, Elcat，检测异常值

半监督学习，eg. 深度信念网络中的受限玻尔兹曼机（RBM）
强化学习，

是否动态进行增量学习

批量学习：全批量
在线学习：小批量

基于实例的学习和基于模型的学习

基于实例：记住样例，根据相似度对新的实例进行泛化。
基于模型：调整参数，适配训练集

1. 主要挑战

训练数据不足
训练数据不具代表性
数据质量差：无关特征
训练数据过拟合（简化模型（减少参数，减少属性变量，约束模型），收集更多数据，减少数据中的噪音）
训练数据拟合不足（选择更多参数的模型，给学习算法更好的特征集（特征工程），减少模型中的约束）

1. 测试与验证
  测试集和训练集

交叉验证

第二章端到端的机器学习项目

1. 使用真实数据

观察大局

框架问题：回答机器学习分类的问题

选择性能指标: 均方根误差RESM L2，平均绝对误差MAE L1

检查假设

1. 获取数据

下载数据

import os 
import tarfile
from six.moves import urlib

DOWNLOAD_ROOT = "https://raw.githubusercontent.com/ageron/handson-ml/master/"
HOUSING_PATH = "datasets/housing"
HOUSING_URL = DOWNLOAD_ROOT + HOUSING_PATH + "/housing.tgz"

def fetch_housing_data(housing_url=HOUSING_URL,housing_path=HOUSING_PATH):
    if not os.path.isdir(housing_path):
        os.makedirs(housing_path)
    tgz_path=os.path.join(housing_url,"housing.tgz")
    urlib.request.urlretrieve(housing_url,tgz_path)
    housing_tgz=tarfile.open(tgz_path)
    housing_tgz.extractall(path=housing_path)
    housing_tgz.close()

## 加载数据
import pandas
def load_housing_data(housing_path=HOUSING_PATH):
    csv_path = os.path.join(housing_path,"housing.csv")
    return pd.read_csv(csv_path)

查看数据

dataframe.head()
dataframe.info()
dataframe.describe()

## 注意标签的最值，以及数据的分布。
%matplotlib inline
import matplotlib.pyplot as plt
housing.hist(bins=50,figsize=(20,15))
plt.show()

创建数据集

创建抽样数据集的时候考虑分层抽样，而非纯随机抽样。
分层抽样需要包含每一层具有足够的数据，不至于重要信息被遗漏。

1. 从数据探索和可视化中获得洞见

将地理数据可视化

可视化数据，可以将alpha设置为0.1，从而看密度情况，也可以将简单聚类来看。

寻找相关性

数学o模型

Posted on 2021-01-22

二项分布和负二项分布

https://www.zhihu.com/question/24253978?sort=created

广义线性模型

负二项广义线性模型

保守位点计算

Posted on 2021-01-20

1. 佳琦推荐PAML，计算ds/da的值，必须同源直系基因，不清楚输出的是针对一个碱基的，还是针对一个基因的值，需要再看看。引用5284

查找保守基因
https://www.researchgate.net/post/What_is_the_best_way_to_see_how_conserved_a_gene_is_across_different_species

1. phast（冷泉港），可以计算每一个碱基的保守性得分
  2011年发表（PHAST and RPHAST: phylogenetic analysis with space/time models），引用262
1. GERP 北大的，具体做什么我还是没有很清楚。
  https://m.ensembl.org/info/genome/compara/conservation_and_constrained.html

CellOracle 可以不使用对应的ATACseq数据（sc或者bulk），可以使用sci-ATAC的数据（成年小鼠的多个组织）来做为参照。

scan motif
https://www.bbsmax.com/A/pRdBgqL2zn/

https://www.cnblogs.com/leezx/p/11995314.html

进化

Posted on 2020-12-25

二、如何构建系统发生树

化石
形态
行为
生态
地理位置，一般来说地理距离与遗传距离成正比。

三、极性的确定

特征的一些状态，谁先谁后，即祖征还是衍征

1. 纵向对比法：化石资料
1. 横向对比法：外类群（out-group，比你研究的类群更加古老的类群）分析法
  假设外类群的特征状态较为古老

四、如何对待不同结果的树

质疑是驱动科学进步的动力，当发现不一样的结论的时候，应该去寻找证据，证明自己！

五、如何评价最简约树

1.用统计学方法检查
用的最多的是bootstrap方法，若70%以上的树都支持一种分支，则相对可靠，小于50%不可靠。
随机取矩阵，然后做树，计算

2.用多种方法构建系统发生树：
距离法，用聚类算法从数据中获得一棵树
简约法（parsimony）：路径越短，可能性越大
最大似然法
贝斯法

用一个最优标准（目标函数）来激素那和测量树与数据的配合，具有最优分值的树才是真实树的估计。

六、有根系统发生树告诉我们什么？

1. 所研究的生物类群是否来源于同一个祖先或是同一个祖先的所有的后代。
  单系类群：来自于同一祖先的后代
  并系类群：来自于同一最近祖先的后代，但不是全部后代（一般来说含有一个完整的支）
  多系类群：来自于不同最近祖先的后代
1. 物种形成的顺序

bar：
差异程度（距离）
分歧时间（时间）

1. 特征的起源及其生物学意义

Feilijiang

生物信息记录分享博客

gephi 使用教程

Rahul satijia single cell gemomics recent advances and future directions

Cole Trapnell: Studying development robustnewss at the whole-scale embryo scale and sin

exploring variability in gene regulation across developing embryonic tisstis

读取dataframe的指定行或列

命令的重复

快速计算cor

数据框替换值

method1: 使用封装函数

method3: 使用%>%

trimws

scale

ggplot 多个画板一个图

示例

MEME 网站使用的总结：

字符串替换

python 运行shell 命令

python判断字符串（string）是否包含（contains）子字符串的方法

切片

python paste 命令

paste strings

paste two columns

paste two columns

求相关性

数据库直接求皮尔斯相关性

按组求平均

列和

unique 命令必须是pd.Series才能使用。

read pickle文件

第一章 机器学习概览

第二章 端到端的机器学习项目

二项分布和负二项分布

广义线性模型

负二项广义线性模型

二、如何构建系统发生树

三、极性的确定

四、如何对待不同结果的树

五、如何评价最简约树

六、有根系统发生树告诉我们什么？

第一章机器学习概览

第二章端到端的机器学习项目