SARS冠状病毒蛋白质初步分析(II)

石磊,张其鹏,芮伟,卢铭
(北京大学医学部生物信息组 100083)

1.概述
冠状病毒(coronavirus)是一个大的RNA病毒家族,它的宿主非常广泛,包括哺乳动物和鸟类在内的很多脊椎动物都能被感染.冠状病毒是一个六角型的结构(如右图),在其核内区域有RNA,核蛋白N和膜蛋白M,M蛋白有两种空间拓扑结构,两头都朝膜外或者N端朝着膜外.衣壳中含有2-3种蛋白:S蛋白(S protein),E蛋白(the small membrane protein)和红血球凝聚素酯酶(hemagglutinin-esterase HE).其中S蛋白被认为是冠状病毒侵染过程中的关键蛋白,在结构上一般有一个靠近C端的半胱氨酸富集区.SARS病毒是一个新发现的冠状病毒,在其基因组中总共找到了11个CDS,序列比对的结果显示从第250位核苷酸开始编码的,长度为7074个氨基酸的多肽段putative orf1ab polyprotein和鼠肝炎病毒(murine hepatitis virus)的相应片断很相似,根据已有的鼠肝炎病毒相应片断的知识,将putative orf1ab polyprotein又分成14个蛋白(见Table 1),其中包括RNA-dependent RNA polymerase.另外一个CDS(orf1a polyprotein)就是putative orf1ab polyprotein的前4383个氨基酸.其它9个CDS中有4个在BLAST中找到了高度同源的序列,分别对应于S protein,M protein, E protein和N protein.5个没有找到任何同源的序列.所以推定SARS病毒总共有23个蛋白质,分列于两个表中.Table 1列出了putative orf1ab polyprotein中的可能蛋白,Table 2列出了11个CDS片断,其中除了putative orf1ab polyprotein和orf1a polyprotein以外都推定为可能的蛋白.Fig 1标出了基因组,CDS以及推定的蛋白质片断.
Fig 1.SARS Genome
2.预测方法:
利用一些已经完成并在互联网上对所有研究者开放的预测软件来预测蛋白质的二级结构,motif,信号肽,二硫键等结构,主要包括The PredictProtein server, Prosit, SignalP.另外利用和其高度同源的蛋白的结构和功能来推断未知蛋白的一些复杂结构和可能的功能.
 
3.结果:
Table1和Table2中列出了Sars病毒中可能有的23个蛋白质.每个蛋白的预测结果列于表中,点击Structure项中的"Go"就可查看详细的预测结果
 
 
Table 1. putative protein in putative orf1ab polyprotein
 
 
Product
Location
Structure
Description
putative leader protein 250..786
PL1-PRO cleavage product
putative counterpart of MHV p65 protein 787..2703  
putative coronavirus nsp1 2704..9969 contains predicted phosphoesterase (similar to the Appr-1'-p processing enzyme) formerly known as 'X-domain', papain-like proteinase domain similar to that of MHV PLP-2, and hydrophobic domains
putative coronavirus nsp2 (3CL-PRO) 9970..10887 presumably mediates cleavages downstream from nsp1; 3C-like proteinase
putative coronavirus nsp3 (HD2) 10888..11757 hydrophobic domain
putative coronavirus nsp4 11758..12006  
putative coronavirus nsp5 12007..12600  
putative coronavirus nsp6 12601..12939  
putative coronavirus nsp7 12940..13356 formerly known as growth-factor-like protein
putative coronavirus nsp9 (RdRp) 13357..13383,13383..16151 RNA-dependent RNA polymerase
putative coronavirus nsp10 (MB, NTPase/HEL) 16152..17954 metal-binding domain, NTPase/helicase domain
putative coronavirus nsp11 17955..19535  
putative coronavirus nsp12 19536..20573  
utative coronavirus nsp13 20574..21467  
 
     
 
Table 2.CDS fragments in SARS Genome
 
 
Product
Location Length(aa)
Structure
Predicate Description
putative orf1ab polyprotein 250..21470 7074
Chain A, Structure Of Coronavirus Main Proteinase Reveals Combination Of A Chymotrypsin Fold With An Extra Alpha- Helical Domain
orf1a polyprotein 250..13398 4383
Chain A, Structure Of Coronavirus Main Proteinase Reveals Combination Of A Chymotrypsin Fold With An Extra Alpha- Helical Domain
putative E2 glycoprotein precursor 21477..25244 1256 E2 glycoprotein, it has align similar sequence
putative uncharacterized protein 25253..26077 275 Unknown, It is definitely new protein, which has now similar sequence
putative uncharacterized protein 25674..26138 155 Unknown, It is definitely new protein, which has now similar sequence
putative small envelope protein E 26102..26332 77 envelope protein
putative protein M 26383..27048 222 align similar with matrix glycoprotein [porcine hemagglutinating encephalomyelitis virus]
putative uncharacterized protein 27059..27250 64 Unknown, It is definitely new protein, which has now similar sequence
putative uncharacterized protein 27258..27626 123 Unknown, It is definitely new protein, which has now similar sequence
putative nucleocapsid protein 28105..29373 423 align similar with nucleocapsid protein [Murine hepatitis virus]
putative uncharacterized protein 28115..28411 99 Unknown, It is definitely new protein, which has now similar sequence
 
 
4.结论
S蛋白被认为是冠状病毒侵染过程中的关键蛋白,我们对现在已知的冠状病毒S蛋白的多序列比对发现在靠近C端有一个半胱氨酸富集区(该区和冠状病毒的侵染相关).对putative S Protein 的二级结构的预测也表明其含有一个C端的半胱氨酸富集区和一个穿膜区,另外在N端还发现一个可能的信号肽区域(1-13).序列比对发现它和鼠肝炎病毒(Murine hepatitis virus)的spike glycoprotein高度同源.由于以上结构的相似性,推断这就是SARS病毒的S蛋白,主要和SARS病毒的侵染过程有关,是可能的药物靶点.M蛋白是一个膜蛋白,序列比对也发现SARS的putative protein M和牛冠状病毒(bovine coronavirus)的M蛋白高度同源,二级结构分析也发现其有3个穿膜序列,N端处于膜外.C端处于膜内.还有五个未找到同源序列CDS,暂时我们还不能推断其功能和定位,接下来我们会在motif分析的基础上进一步对功能进行分析.对于其它的蛋白我们会在已知部分结构和同源蛋白的基础上对功能进行部分预测.  
 
 
参考文献:
1.PHD: predicting one-dimensional protein structure by profile-based neural networks.
2.Prediction of protein secondary structure at better than 70% accuracy
3.Topology prediction for helical transmembrane proteins at 86% accuracy
4.The PROSITE database, its status in 1999
5.Coronavirus derived expression systems