Here we describe the
Cancer Cell Line Encyclopedia (CCLE): a compilation of gene expression,
chromosomal copy number, and massively parallel sequencing data from 947
human cancer cell lines.
收集了三種數據:
The mutational status
of >1,600 genes was determined by targeted massively parallel
sequencing, followed by removal of variants likely to be germline events
.
Moreover, 392 recurrent mutations affecting 33 known cancer genes were assessed by mass spectrometric genotyping13 .
DNA copy number was
measured using high-density single nucleotide polymorphism arrays
(Affymetrix SNP 6.0; Supplementary Methods).
Finally, mRNA expression levels were obtained for each of the lines using Affymetrix U133 plus 2.0 arrays.
These data were also used to confirm cell line identities .
一般用得最多的就是表達數據,因為表達數據最簡單,大多數生物信息學分析著只會用這個數據!
而它的突變數據又不是通常意義的高通量測序得到的,snp6芯片數據很多人聽都沒聽過
文章的附件有對cell lines的具體描述。
CCLE的數據在broad institute里面可以下載,也放在GEO數據庫里面,我比較喜歡GEO里面的數據
This SuperSeries is composed of the following SubSeries:
GSE36133 Expression data from the Cancer Cell Line Encyclopedia (CCLE)
GSE36138 SNP array data from the Cancer Cell Line Encyclopedia (CCLE)
GSE36133這個study的metadata里面有對每個cellline來源的cancer進行描述!
有人喜歡把這個metadata叫做是clinical data。
library(GEOquery)
ccleFromGEO <- getGEO("GSE36133")
annotBlock1 <- pData(phenoData(ccleFromGEO[[1]]))
>dim(annotBlock1)
[1] 917 38
exprSet=exprs(ccleFromGEO[[1]])
> dim(exprSet)
[1] 18926 917
##它的表達數據矩陣,包含了18926個基因,列名是917個細胞系的名字,行是基因的entrez ID
keyColumns <- c("title", "source_name_ch1", "characteristics_ch1", "characteristics_ch1.1",
"characteristics_ch1.2")
options(stringsAsFactors = F)
allAnnot=annotBlock1[,keyColumns]
##這幾列信息是比較重要的metadata,里面詳細記錄了細胞系的收集公司單位,tissue,癌癥分類等信息
Cell line (1035個細胞系簡介)Gene Sets
1035 sets of genes with
high or low expression in each cell line relative to other cell lines
from the CCLE Cell Line Gene Expression Profiles dataset.
一些關于CCLE數據庫的文章:
Anticancer drug sensitivity analysis: An integrated approach applied to Erlotinib sensitivity prediction in the CCLE database