您当前的位置:首页 > 互联网教程

dbGaP数据库的测序数据当然是可以申请成功的

发布时间:2025-05-20 16:37:41    发布人:远客网络

dbGaP数据库的测序数据当然是可以申请成功的

一、dbGaP数据库的测序数据当然是可以申请成功的

原始测序数据通常上传至NCBI的SRA数据库,同时在EBI备份。操作过程中,NCBI的prefetch命令下载速度较慢,推荐使用EBI直接下载fastq格式数据的脚本。

脚本功能为根据EBI搜索到的路径文件批量下载fastq文件,用户需自行配置。不过,并非所有数据库提供完全开放服务。以《Progressive immune dysfunction with advancing disease stage in renal cell carcinoma》为例,数据位于ncbi.nlm.nih.gov/projec...,包含三种测序技术,13位患者数据。该数据库需申请访问权限,但已有两个申请条件满足。

在大部分情况下,无需申请原始数据。因为类似研究已有超过十篇报告了ccRCC的单细胞数据,其他数据集已公布原始测序数据,不必过于纠结。此外,该文章提供表达量矩阵,位于文章附件,文件大小分别为143M和1.7G,GUID分别为217E8B40-EB49-4FF5-AEF5-57BBBA4DAE61和34477B69-0F73-4D9A-B926-66981E1D5D4A。

二、GWAS Catalog数据库简介

The NHGRI-EBI Catalog of published genome-wide association studies

EBI负责维护的一个收集已发表的GWAS研究的数据库

Last data release on 2019-09-244220 publications107486 SNPs157336 associationsGenome assembly GRCh38.p12dbSNP Build 151Ensembl Build 96

搜索表型:如breast carcinoma,会得到相关的非常规范的表型信息,EFO,就像GO一样,是一套表型分类规则。然后还会得到表型相关的基因。

搜索SNP:如rs7329174,会得到变异的详细信息,和对应的基因。

搜索人名:Yao,会得到相关的文献

搜索染色体位置:如2q37.1,Cytogenetic region

搜索区域:如6:16000000-25000000

说是数据库,其实就是一个table,从这里下载,不过100MB

DATE ADDED TO CATALOG*+: Date a study is published in the catalog

PUBMEDID*+: PubMed identification number

FIRST AUTHOR*+: Last name and initials of first author

DATE*+: Publication date(online(epub) date if available)

JOURNAL*+: Abbreviated journal name

DISEASE/TRAIT*+: Disease or trait examined in study

INITIAL SAMPLE DESCRIPTION*+: Sample size and ancestry description for stage 1 of GWAS(summing across multiple Stage 1 populations, if applicable)

REPLICATION SAMPLE DESCRIPTION*+: Sample size and ancestry description for subsequent replication(s)(summing across multiple populations, if applicable)

REGION*: Cytogenetic region associated with rs number

CHR_ID*: Chromosome number associated with rs number

CHR_POS*: Chromosomal position associated with rs number

REPORTED GENE(S)*: Gene(s) reported by author

MAPPED GENE(S)*: Gene(s) mapped to the strongest SNP. If the SNP is located within a gene, that gene is listed. If the SNP is intergenic, the upstream and downstream genes are listed, separated by a hyphen.

UPSTREAM_GENE_ID*: Entrez Gene ID for nearest upstream gene to rs number, if not within gene

DOWNSTREAM_GENE_ID*: Entrez Gene ID for nearest downstream gene to rs number, if not within gene

SNP_GENE_IDS*: Entrez Gene ID, if rs number within gene; multiple genes denotes overlapping transcripts

UPSTREAM_GENE_DISTANCE*: distance in kb for nearest upstream gene to rs number, if not within gene

DOWNSTREAM_GENE_DISTANCE*: distance in kb for nearest downstream gene to rs number, if not within gene

STRONGEST SNP-RISK ALLELE*: SNP(s) most strongly associated with trait+ risk allele(? for unknown risk allele). May also refer to a haplotype.

SNPS*: Strongest SNP; if a haplotype it may include more than one rs number(multiple SNPs comprising the haplotype)

MERGED*: denotes whether the SNP has been merged into a subsequent rs record(0= no; 1= yes;)

SNP_ID_CURRENT*: current rs number(will differ from strongest SNP when merged= 1)

CONTEXT*: SNP functional class

INTERGENIC*: denotes whether SNP is in intergenic region(0= no; 1= yes)

RISK ALLELE FREQUENCY*: Reported risk/effect allele frequency associated with strongest SNP in controls(if not available among all controls, among the control group with the largest sample size). If the associated locus is a haplotype the haplotype frequency will be extracted.

P-VALUE*: Reported p-value for strongest SNP risk allele(linked to dbGaP Association Browser). Note that p-values are rounded to 1 significant digit(for example, a published p-value of 4.8 x 10-7 is rounded to 5 x 10-7).

P-VALUE(TEXT)*: Information describing context of p-value(e.g. females, smokers).

OR or BETA*: Reported odds ratio or beta-coefficient associated with strongest SNP risk allele. Note that if an OR<1 is reported this is inverted, along with the reported allele, so that all ORs included in the Catalog are>1. Appropriate unit and increase/decrease are included for beta coefficients.

95% CI(TEXT)*: Reported 95% confidence interval associated with strongest SNP risk allele, along with unit in the case of beta-coefficients. If 95% CIs are not published, we estimate these using the standard error, where available.

PLATFORM(SNPS PASSING QC)*: Genotyping platform manufacturer used in Stage 1; also includes notation of pooled DNA study design or imputation of SNPs, where applicable

CNV*: Study of copy number variation(yes/no)

ASSOCIATION COUNT+: Number of associations identified for this study

什么是Experimental Factor Ontology trait?

什么是Cytogenetic region?karyotype

什么是trait+ risk allele?这里要分清SNP和allele的概念,SNP是位点,而allele则是该位点上碱基。考虑一下DNA双链,以及多倍体。

什么是risk/effect allele frequency?

odds ratio在GWAS里是个什么指标?wiki

The odds ratio is the ratio of two odds, which in the context of GWA studies are the odds of case for individuals having a specific allele and the odds of case for individuals who do not have that same allele.

As an example, suppose that there are two alleles, T and C. The number of individuals in the case group having allele T is represented by‘A‘ and the number of individuals in the control group having allele T is represented by‘B‘. Similarly, the number of individuals in the case group having allele C is represented by‘X‘ and the number of individuals in the control group having allele C is represented by‘Y‘. In this case the odds ratio for allele T is A:B(meaning‘A to B‘, in standard odds terminology) divided by X:Y, which in mathematical notation is simply(A/B)/(X/Y).

When the allele frequency in the case group is much higher than in the control group, the odds ratio is higher than 1, and vice versa for lower allele frequency. Additionally, a P-value for the significance of the odds ratio is typically calculated using a simple chi-squared test. Finding odds ratios that are significantly different from 1 is the objective of the GWA study because this shows that a SNP is associated with disease.[18]

什么是MAF?the frequency of the minor allele

GWAS数据可以有哪些注释?phenotype annotation、population and linkage disequilibrium(LD) information

什么是CP loci?an effective region associated with at least two phenotypes

Quality Control Procedures for Genome Wide Association Studies

Data quality control in genetic case-control association studies

minor allele frequency(MAF)> 0.01;statistical power is extremely low for rare SNPs,很好理解,如果一个非常罕见的SNP,需要非常大的样本量才能有足够的powerHardy-Weinberg equilibrium(HWE) test p-value> 5E-05; missing genotypes rate< 10%;Genotypes are classified as missing if the genotype-calling algorithm cannot infer the genotype with sufficient confidence. Can be calculated across each individual and/or SNP.

什么是Experimental Factor Ontology?

什么是LD information(r2 and D’ values)?

Mathematical properties of the r2 measure of linkage disequilibrium

标签:loadproceduremissing问题:linkagegenids基本using

标签 load procedure missing问题: linkage gen ids基本 using

三、国际著名的三大蛋白质数据库

国际著名的三大蛋白质数据库有UniProt数据库、The Human Protein Atlas数据库、PhosphoSitePlus数据库。

蛋白组学常用数据库UniProt(全称UniProt Protein Resource),建立于1986年,由Swiss-Protein、TrEMBL、PIR-PSD三大蛋白质数据库联合成立的,其信息量丰富、资源广泛,是目前公认的首选免费蛋白质数据库。

2、The Human Protein Atlas数据库

The Human Protein Atlas内含近30000种人类蛋白质的组织和细胞分布信息,并提供免费查询。

瑞典Knut&Alice Wallenberg基金会利用免疫组化技术,检查每一种蛋白质在人类48种正常组织,20种肿瘤组织,47个细胞系和12种血液细胞内的分布和表达,其结果用至少576张免疫组化染色图表示,并经专业人员校对和标引,保证染色结果具有充分的代表性。

PhosphoSitePlus数据库是一个由CST和NIH联合开发的免费资源数据库,总结归纳了海量通过科学研究发现的蛋白修饰位点,包括磷酸化、甲基化、乙酰化、泛素化等,并且包括一些CST公司发现但未发表的蛋白修饰位点。

该数据库是动态的、开放的、高度互动并持续更新的。它有助于研究PTMs在正常和病理细胞/组织中的作用,同时它也是发现新的疾病标志物和药物靶点的有力工具。

蛋白质数据库(HPDB),建于2005年5月,动态展示生物大分子立体结构,鼠标点击放大分子结构、原子定位、测定原子之间距离,可用于教学或科研。服务对象是能够熟练使用中文的生命科学、医学、药学、农学、林学等领域的大中专学生、教师及科技工作者。

分子结构特征描述采用汉语,同时提供英文原文以供考证。对于善于使用英文的读者,我们提倡直接访问RCSB PDB,一来可以减少网络拥挤,二来可以减少由于HPDB的翻译不妥带来的不便。

蛋白质数据库(HPDB)对每个蛋白质分子结构说明部分做了中文翻译(最新加入数据库的分子除外),内容包括分子结构定性描述、样品的来源、表达载体、宿主、化学分析方法、分子结构组成成分等。这些信息并同蛋白质分子结构数据存储于数据库,因此HPDB支持中文查询。

蛋白质数据库(HPDB)虽然翻译了“分子结构说明”部分,但为了保证数据的可靠性和准确性,HPDB对一级结构序列及大分子结构坐标数据等未做任何改动,数据库保持RCSB PDB核实后的原始实验数据文件,并保持PDB文件格式和蛋白质分子编号。