Unigene build produce(NCBI)(原文)

本文详细阐述了基于基因组的转录序列聚类过程,包括使用多种证据来确定转录位点边界、识别转录代表,以及如何进行序列对齐和聚类,最终形成UniGene构建。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

Genome-Based UniGene Build Procedure

The availability of genomic sequence is helpful for identifying sets of transcript sequences that correspond to distinct transcription loci or to annotated genes, which is the goal of UniGene. The procedure used for genome-based clustering of transcript sequences is described here.

Several types of evidence are used to identify a transcription locus' boundaries and to identify which transcripts represent the locus. Although determining gross structure and transcript representation of many genes is not sensitive to the details of transcript mapping, there are cases where the details are important: overlapping genes on opposite strands, or genes located within introns of other genes, for example. To accurately resolve these and similar cases, we identify genes by incorporating evidence in order of confidence, beginning with the strongest data.

Annotation of characterized genes annotated on the genomic sequence is recorded. Annotated genes include those supported by experimentally confirmed RefSeqs as well as transcription loci that are predicted to encode a protein realized in a gene model. These annotated exon boundaries and the association of exons with genes form a skeleton that can be extended by subsequent analysis but cannot be contradicted subsequently.

Transcribed sequences that can be stringently aligned to genomic sequence with a requirement of splice site consensus are used to enumerate additional exon-intron boundaries. Not all sequence alignments satisfy this stringent requirement. Any sequences sharing an exon-intron boundary that can be identified with only one gene are grouped together.

Unspliced sequences, as well as sequences for which the splicing location or orientation is uncertain, are associated with an overlapping exon if one exists, or placed against genome if not. Sequence orientation is used where there is possible ambiguity of gene orientation.

Sequences that do not align to genomic sequence are grouped together, and transcribed sequences within an interval smaller than 3000 nt that have a common clone of origin are grouped together.

Clusters that do not correspond to an annotated gene and are less than 500 bases 3' of another cluster are likely alternative 3' termini, and are merged into the upstream cluster. This merging is not transitive.

Transcript-Based Build Procedure

Clustering is the process of finding subsets of sequences that belong together within a larger set. This is done by converting discrete similarity scores to Boolean links between sequences. That is, two sequences are considered linked if their similarity exceeds a threshold. UniGene clustering proceeds in several stages, with each stage adding less reliable data to the results of the preceding stage. This staged clustering affords greater control than a more egalitarian treatment of all links between sequences.

Screening for contaminants, repeats, and low-complexity sequence is performed. Low-complexity screening is performed using NCBI's Dust program. Mitochondrial and ribosomal sequences are screened for, as are vector contaminants and repetitive elements. After screening, a sequence must contain at least 100 informative bp to be a candidate for entry into UniGene.

Builds are either genome based or transcript based, as described here.Sequence records in distinct Entrez gene records are used to group mRNAs (please see entrez Gene help for details). Links to HomoloGene clusters are computed for all mRNA sequences via translating searches against proteins in Homologene. The set of mRNA sequences is compared with itself. Sequence pairs that are sufficiently similar are linked together to form initial clusters, as long as sequences in each initial cluster do not correspond to multiple Entrez gene records or multiple HomoloGene entries.

Links between ESTs and mRNA are added to these clusters. The set of ESTs is compared with sequences from the set of initial clusters using megaBLAST, and sufficiently similar sequence pairs are added to the clusters. Links that would join the initial mRNA-based clusters are discarded. EST to EST links are also generated and used to extend the initial clusters and to generate clusters composed solely of ESTs.

Clone-based edges are added; these allow non-overlapping 5' and 3' ESTs to be assigned to the same cluster. Because of imperfect clone labeling, a single clone-ID based edge is insufficient to merge two clusters. Clone IDs that link at least two 5' ends from one cluster with at least two 3' ends from another cluster are found, and the two clusters are merged.

Any resulting cluster that does not contain a sequence with a polyadenylation signal or tail is discarded. Clusters that meet these criteria are called anchored clusters, because their 3' ends are presumed to be known.

ESTs that do not belong to an anchored cluster are rechecked at a lower level of stringency than in the preceding passes. An EST that passes this less stringent test is then added to the cluster that contains the sequence that is the best match to the EST; it is a guest member.

Clusters of size 1 (that is, clusters that seem to identify infrequently expressed genes) are compared against the rest of the sequences in UniGene at a lower level of stringency and merged with the cluster containing the most similar sequence.

The resulting clusters are compared with the preceding week's build and renumbered in an attempt to maintain continuity. Because the sequences that make up a cluster may change from week to week and because the cluster identifier may disappear (typically when two clusters merge), using the cluster identifier as a reference is ill advised. Using the GB accession numbers of the sequences that make up the cluster is a safe alternative.


内容概要:本文档详细介绍了Analog Devices公司生产的AD8436真均方根-直流(RMS-to-DC)转换器的技术细节及其应用场景。AD8436由三个独立模块构成:轨到轨FET输入放大器、高动态范围均方根计算内核和精密轨到轨输出放大器。该器件不仅体积小巧、功耗低,而且具有广泛的输入电压范围和快速响应特性。文档涵盖了AD8436的工作原理、配置选项、外部组件选择(如电容)、增益调节、单电源供电、电流互感器配置、接地故障检测、三相电源监测等方面的内容。此外,还特别强调了PCB设计注意事项和误差源分析,旨在帮助工程师更好地理解和应用这款高性能的RMS-DC转换器。 适合人群:从事模拟电路设计的专业工程师和技术人员,尤其是那些需要精确测量交流电信号均方根值的应用开发者。 使用场景及目标:①用于工业自动化、医疗设备、电力监控等领域,实现对交流电压或电流的精准测量;②适用于手持式数字万用表及其他便携式仪器仪表,提供高效的单电源解决方案;③在电流互感器配置中,用于检测微小的电流变化,保障电气安全;④应用于三相电力系统监控,优化建立时间和转换精度。 其他说明:为了确保最佳性能,文档推荐使用高质量的电容器件,并给出了详细的PCB布局指导。同时提醒用户关注电介质吸收和泄漏电流等因素对测量准确性的影响。
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值