human molecular genetics 4th edition pdf download

Abstract

To facilitate gene finding and for the investigation of human molecular genetics on a genome scale, we present a comprehensive survey on various statistical features of human exons. We first show that human exons with flanking genomic DNA sequences can be classified into 12 mutually exclusive categories. This classification could serve as a standard for future studies so that direct comparisons of results can be made. A database for eight categories (related to human genes in which coding regions are split by introns) was built from GenBank release 87.0 and analyzed by a number of methods to characterize statistical features of these sequences that may serve as controls or regulatory signals for gene expression. The statistical information compiled includes profiles of signals for transcription, splicing and translation, various compositional statistics and size distributions. Further analyses reveal novel correlations and constraints among different splicing features across an internal exon that are consistent with the Exon Definition model. This information is fundamental for a quantitative view of human gene organization, and should be invaluable for individual scientists to design human molecular genetics experiments.

Introduction

Almost all the nuclear genes coding for proteins in eukaryotes are split into exon and intron sequences. Thus questions such as 'what makes an exon an exon?' and 'how is an exon recognized by the gene expression machinery?' are of major importance to the understanding of gene expression and regulation. The task of delineating exon-intron organization is even more challenging in vertebrates than in lower eukaryotes because an average vertebrate gene consists of multiple small exons separated by introns that are 10 or 100 times larger. As the Human Genome Project enters a large-scale sequencing phase, identifying exons has also become a bottle-neck in genome annotation. In the early 1980s, the splicing site consensus (1) and the weight matrix method (2) were developed by DNA sequence comparisons. Senapathy et al. (3) later compiled more comprehensive sequence statistics on major categories of GenBank release 57.0. The statistical features of promoters (4) and exon/intron size distributions (5,6) have also been studied carefully for vertebrates. There have been many good reviews on important aspects of gene recognition methods (7–9) and on assessment of different protein-coding measures (10).

To take advantage of a much larger set of human-specific sequence data available today, to facilitate experiments on human molecular genetics and to meet the need for developing better human exon recognition methods, we have extended our statistical analysis of fission yeast genes (11) to human exons and their flanking regions. Recently, accumulating experimental evidence has led to the Exon Definition model (12), which argues that, in vertebrate pre-mRNAs with large introns, the initial recognition unit of splicing is an exon defined by the interactions of splicing factors across the exon. This implies that splicing signals may be correlated across an exon and they cannot be recognized as independent sequence features. In this survey, we report our results on systematic analyses of many individual features, and we demonstrate the existence of some novel correlations and constraints among different features that may be relevant to human exon recognition and to understanding of gene expression and regulation. As the central theme in molecular biology is the structure-function relationship, distinct statistical sequence structures can often suggest, or ought to be explained by, their functions. Putting various gene sequence information in one place will help to speed up this functional interpretation.

Exon Classification

Exons are classified into the following 12 categories (Fig. 1), according mainly to what transcriptional or translational boundaries an exon contains [we shall refer to the poly(A) site as the end of the last exon]: (1) a 5uexon is the 5′-terminal untranslated exon in a gene; (2) a 3uexon is the 3′-terminal untranslated exon; (3) a 5utexon is the 5′-terminal exon having a 5′-untranslated region (5′UTR) followed by a coding sequence (CDS); (4) a 3tuexon is the 3′-terminal exon having a 3′UtR a CDS; (5) an iutexon is an internal exon having a 3′ portion of the 5′UTR followed by a CDS; (6) an ituexon is an internal exon having a 5′ portion of 3′UTR following a CDS; (7) an iuexon is an internal untranslated exon; (8) an itexon is an internal translated exon. An exon in categories 9 and 10 has to contain the complete CDS: (9) a 5utuexon does not contain the transcriptional end; (10) a 3utuexon does not contain the transcriptional start; (11) a 5-3utuexon contains both; and (12) an iutuexon contains neither. Because of annotational ambiguities in distinguishing between truly intronless CDSs and mRNAs, all the analyses reported in this study were done for the first eight categories, which consist of 271 5uexons, 38 3uexons, 482 5utexons, 553 3tuexons, 174 iutexons, 69 ituexons, 34 iuexons and 3440 itexons. Up until now, the focus of study has been mainly itexons, for obvious reasons. As the human genome will be completely sequenced, it is time to address issues related to all types of exons. This is the first systematic classification of exons which could serve as a standard for future studies so that direct comparison of results can be made.

Figure 1

Exon classification. All exons can be classified into these 12 mutually exclusive classes. On the top, a schematic gene model is depicted which indicates how some types of exons may be organized in a gene.

Figure 1

Statistical characterization of individual sequence features

Size and compositional characteristics

Exon size distributions. The size distributions of exons that have a definite size (no '>' or '<' in their annotation) in the different categories as well as the corresponding quantile statistics are plotted in Figure 2. These results indicate that, in general, 5utexons or iuexons are relatively short (mostly <100 nt) and 3tuexons and 3uexons are relatively long (mostly −300–500 nt). While itexon sizes have a tight log-normal distribution centered around log10 (130 nt), the sizes of 3uexons are extremely heterogeneous (all >100 nt). There also seems to be a sharp drop-off for iutexon sizes >200 nt and for 3uexon sizes <100 nt. The extreme values, although somewhat dependent upon the data set, do give some idea about possible size constraints. There seems to be no minimum constraint on the size of an itexon: we found the smallest (4 nt) was the exon 3 of the human TNNI1 gene in the initial human exon data set, although, after the data cleaning process, the minimum in our final itexon collection was 15 nt. These distributions are very useful, for example in comparative studies or in exon-trapping experimental designs.

Coding fraction and UTR distributions. The coding fraction is defined as the ratio of the CDS size over the exon size. The histograms of the coding fractions of the exons in four categories are plotted in Figure 3. These data clearly suggest that (i) translation is equally likely to start anywhere in the first exon, but it is more likely to start near the beginning of an internal exon; (ii) translation is more likely to stop near the beginning of the last exon, but it is more likely to stop near the end of an internal exon. At this point, one could only speculate on the biological implications. For example, why do most ituexons terminate translation near their ends? It is known that premature termination codons (PTCs) upstream of the distal third of penultimate exons trigger transcript degradation while more 3′ PTCs fail to signal transcript targeting (13). Could this be a mechanism to prevent bona fide termination codons being recognized as 'premature'?

A total of 410 5′UTRs and 432 3′UTRs (these numbers are different from those in exon size distributions because no definite boundary at the CDS end of the exon is required) were extracted in full length from 5utexons and 3tuexons, respectively. Their distributions and the quantile statistics are also plotted in Figure 3. In the human GLA gene for α-D-galactosidase A (X14448), the minimum 3′UTR (-2 nt in our definition) results from the fact that the CDS (including the stop codon) ends at 11 268 and the poly(A) site is at 11 266. UTR size distributions are useful, for example, when analyzing human expressed sequence tags (ESTs) and cDNA clones.

Mono- and dinucleotide compositions. Because compositional measures depend on the G+C content of the isochore (14) in which a gene is residing, for convenience, we compiled the statistics separately for data from low GC (<0.5) and high GC (≥0.5) genomic loci. The average G+C content of our data set is 0.53, but the average G+C content of all genomic loci from which our data set was extracted is only 0.49. The fundamental mono-and dinucleotide compositions in Table 1 are for the following groups of sequences: upstream flanks (216 nt), upstream UTRs, upstream introns (54 nt), CDS in each frame, downstream introns (54 nt), downstream UTRs and downstream flanks (216 nt). The values are represented as the percentage difference relative to the average of the total data set.

In contrast to the averages, some of the salient features are given below. At low GC loci, the average G+C contents of the various genomic regions have the following order: uutr>uflk> cds>din>dflk>dutr>uin (see Table 1 for the notations). A codon has a consensus of RWY (in IUPAC ambiguity codes). At high GC loci, the average G+C contents of intron elements are boosted in such a way that the new order becomes: uutr>uflk>diri> average cds>uiri>dflk>dutr (notice the dramatic increase of G din and of C in uin both at the expense of T); a codon has a different consensus of SWS. At the dinucleotide level, the lack of self-complementary pairs is quite obvious. At low GC loci, they are CG and GC (CG is especially scarce because of the methylation decay); at high GC loci, they are TA AT. However, they are relatively enriched at the 5′ (for CG and GC) and the 3′ end (for TA AT) of a gene where they may play some role in the control signals of transcription (e.g. the CpG islands are often found near the 5′ end of a housekeeping gene, and the AATAAA motifs are often found near the polyadenylation site at the 3′ end of a gene. In uins, the richness of dipyrimidines is clearly caused by the poly(Y) splicing signal near the 3′ splice site (3′ss), and the C content is controlled by the G+C level of the genomic loci. Also, the poorness of AG in uin may be caused by avoiding the confusion of the true 3′ss. Unexpectedly, the richness of GG in din (which is related to the G-string excess, see later discussions) is only associated with high GC loci. Because of the coding constraints, the dinucleotide bias in CDS is strongly correlated with the codon bias below (see also 15).

Figure 2

Exon size distributions for the first eight categories (4731 exons from GenBank: 87.0). The quantile statistics were calculated in log10 (bp) scale and converted back into units of nucleotides for ease of comparison.

Figure 2

Figure 3

$Coding fraction (top two panels) and UTR distributions (bottom panel). The coding fraction is defined by the ratio of the size of the coding portion over the total size of an exon. The quantile statistics for the UTRs were calculated in log10 (bp) scale and converted back into units of nucleotide.$

Coding fraction (top two panels) and UTR distributions (bottom panel). The coding fraction is defined by the ratio of the size of the coding portion over the total size of an exon. The quantile statistics for the UTRs were calculated in log10 (bp) scale and converted back into units of nucleotide.

Figure 3

Table 1.

Mono- and dinucleotide compostions for low/high G+C loci expressed as percentage difference from the average (of the entire category 1–8 data set)

Table 1.

Mono- and dinucleotide compostions for low/high G+C loci expressed as percentage difference from the average (of the entire category 1–8 data set)

Codon usage. The trinucleotide statistics are presented as in the human codon usage Table (Table 2). The most abundant codons (CUG, GAR and AAG) are related to dominant dinucleotides in the first coding frame (GA, AA for genes at low GC loci, and CU, GA for genes at high GC loci). The least abundant codons (NUA and NCG) are related to the rare dinucleotide at the second position (CG for genes at low GC loci and UA for genes at high GC loci). There are other examples, such as the GCC codon for alanine and the CAG codon for glutamine, that have strong isochore bias. The most abundant amino acids (leucine and serine) are the ones that have most codons and have no rare dinucleotide at the first position. Stop codon UAA is particularly avoided for genes at high GC loci, although all three stop codons are used equally for genes at low GC loci.

Hexamer (6-tuple) statistics. Hexamer frequency has been used widely as a major discriminant factor in exon/intron identification (10,16). We have calculated hexamer frequencies f_exon (from all the CDSs in all frames) and f_intro (from the introns of 43 complete sequenced genes). The ratio f_exon /f_intro human is less discriminating than in the fission yeast (11) because the human splice sites are more degenerate (see the splicing signals later). In Figure 4, the dot charts of some extreme ranking frequency differences are depicted separately for low and high G+C loci. Most of the intron characteristics can be explained by the run of Ts or As. However, at high GC loci, the presence of CA repeats and G-strings is apparent, especially the complementary pair GGGAGG/CCTCCC, which occurs much more often in introns. In the coding regions, the information is dominated by the hexamers consisting of frequent codons (especially in tandem repeat) that are also highly biased by the G+C content. Both codon usage and hexamer statistics are very useful, for example, when choosing appropriate restriction enzymes or designing various probes.

table 2

Human codon usage table for low/high G+C loci (in percent)

table 2

Human codon usage table for low/high G+C loci (in percent)

Signal profiles

All the signal profiles below are essential in delineating human gene organization. They also provide information about factor-binding energies around recognition sites (17).

Promoter signals. Human promoter sequences are difficult to identify and are poorly annotated in the public databases. We used a refining procedure to obtain the various promoter signal profiles (Table 3), where the vertebrate scores of Bucher (4) were used as the initial approximate matrices (their positional coordinates were also adopted). The windows and the number of sequences used in the search are shown, together with the mean distance of the signals from the transcriptional start (CAAT and GC boxes were searched on both strands). This simple refining procedure was able to produce reasonable promoter signal profiles consistent with the well-known consensus (4). Of course, genes that have no TATA or CAAT box would have contributed 'noise' to the profile frequencies (such noise could be reduced by imposing a minimum score requirement, as was done for the branch site profile below). As mentioned in (4), we also found a 1 nt shift in the human Capsite annotations. Measuring from the end of a box, the TATA box is found ∼25 nt and the CAAT box ∼100 nt (with larger deviations) upstream from the transcriptional start site.

Pre-mRNA 3′ end processing signals. The AATAAA box and the poly(A) site profiles are shown in Table 4. The first AATAAA box profile was obtained by aligning all of the annotated poly(A) signals. The second was obtained from a subset of the signals that occur within a 50 nt window upstream of the poly(A) site. The second profile allowed us to estimate the mean distance (16 nt) between the T in the box to the poly(A) site. The first poly(A) site profile was also obtained simply from aligning all of the annotated poly(A) sites. Due to the uncertainty in identifying a precise poly(A) site, we do not see the CA consensus from this profile. In an attempt to correct some possible errors, we re-aligned the sequences if there was a CA within a ±2 nt distance from the annotated poly(A) site. We believe this poly(A) site profile (shown at the bottom of the figure) may be closer to the truth.

Translational signals. Both the translational start and stop profiles were obtained by aligning the corresponding sequences according to the annotations (to avoid possible errors, only the consensus sequences starting with ATG or ending with one of the three stop codons compiled; Table 5). The human translational start profile is consistent with the general consensus for all vertebrates: GCCGCCRCCATGG (18). From 531 sequences, we also found that start codons as the first, second, third or fourth ATG in the open reading frame (ORF) 474, 51, five and one times, respectively [which would be in favor of the 'first ATG rule' and is very similar to our previous finding for the fission yeast (11)]. The translational stop profile shows the ratio among different stop codons TAA:TAG:TGA ∼1:1:2 (obtained from TAA+TAG ∼TGA and TAA+TGA ∼3 TAG according to the matrix at positions −2 and −1), which may also be seen from Table 2.

Splicing signals. The 5′ and 3′ splice site profiles were obtained by aligning annotated sequences obeying the GT-AG rule (Table 6). We also did separate profiles for the splice sites adjacent to UTR and those adjacent to CDS; we did not find any substantial differences (data not shown). From the general mononucleotide compositional analysis above, it was shown that the compositional property in the flanking intron regions depends strongly on the G+C content. One could get more insight by comparing splicing signal profiles for low and high G+C content; Table 6 is such a comparison calculated from itexons. Although the human splice site consensus more or less agrees with the general consensus for most vertebrates, AG|GTRAGT for the 5′ss and (Y)nNCAG|G for the 3′ss, the G+C content greatly affects the nature of purine or pyrimidine constituents. At low G+C loci, the 5′ss consensus may be better described as AG|GTAAGT and the 3′ss consensus as (Y)nNYAG|G.

Figure 4

Major differences in 6-tuple frequency between coding sequences (in all frames)fcds and intronsfintron. Only the top and the bottom 20 6-tuples are displayed.

Major differences in 6-tuple frequency between coding sequences (in all frames)f _cds and intronsf _intron. Only the top and the bottom 20 6-tuples are displayed.

Figure 4

Major differences in 6-tuple frequency between coding sequences (in all frames)fcds and intronsfintron. Only the top and the bottom 20 6-tuples are displayed.

Major differences in 6-tuple frequency between coding sequences (in all frames)f _cds and intronsf _intron. Only the top and the bottom 20 6-tuples are displayed.

The branch site profile was again obtained by the refining method above, where the vertebrate scores (19) were used as the initial approximate matrix. We took a window (of size 41 nt) 10 nt upstream of the 3 ss end where most reported branch points were found (20). To reduce the noise created by genes that have their branch point located outside of the window, we imposed a minimum score of 3, corresponding to the 1st quantile of the maximum score distribution (the absolute conservation of A at the branch point was obtained automatically as a consequence). The average distance of the branch point from the 3′ss end was found to be 26 nt. Again the profile is biased by the G+C content. In contrast to the vertebrate consensus CTRAY, YTVAY for the low G+C content and CTSAY for the high G+C content may be more specific to the human sequence.

In addition to these conventional measures, we also examined many other statistics that may play a role in mRNA splicing. Some of these are plotted in Figure 5a-f (all the maximum values were limited by the searching window size used).

We plotted the distribution of the distance from the branch point to the 3′ss end (Fig. 5a); its quantile statistics are: Min. = 10, 1st Qu. = 20, Median = 25, Mean = 26, 3rd Qu. = 32 and Max. = 46. Most branch points are located 15–30 bp upstream of the 3′ss end. Branch points outside this region are suboptimal; this happens in many alternatively spliced introns (21).

We plotted the distribution of the distance from the 5′ss end to the closest (with respect to the CDS) downstream in-frame stop codon (Fig. 5b); its quantile statistics are: Min. = 2, 1st Qu. = 1, Median = 11, Mean = 16, 3rd Qu. = 27 and Max. = 51 (−2 occurs when the 5′ss looks like TA|GT where TAG is the first stop 'masked' by the 5′ss boundary). The expected number of the first downstream stop codons found by chance in random sequences at each position is indicated by a dotted line. This is three times the expected number of the first in-frame downstream stop codons found by chance. The striking peak at 1 gives rise to the 5′ss consensus G|GTRAG, and there is a second peak at position 5 which may also correspond to a functional role in splicing. The fact that most significant stops are very close to the 5 ss supports the idea that the 5 ss may be there originally to mask the downstream nonsense codons (22) and to allow longer (and hence more complex) gene products to be coded. Many non-functional excess stops would have long ago been washed out by mutations in evolution.

Table 3.

Promoter signal profiles (in percent)

Table 3.

Promoter signal profiles (in percent)

We plotted the distance from the closest upstream AG to the 3′ss AG (Fig. 5c); its quantile statistics are: Min. = 2, 1st Qu. = 22, Median = 29, Mean = 30, 3nd Qu. = 37 and Max. = 52. We can see a sharp drop-off at short distances (<18 nt). Compared with Figure 5a, it is clear that the first AG downstream of the branch site is most likely to be used as the 3′ss. This rule works much better for the fission yeast (11).

Table 4.

Pre-mRNA terminational processing signal profiles (in percent)

Table 4.

Pre-mRNA terminational processing signal profiles (in percent)

To study more flanking intron features quantitatively, we looked at the Y-string in polypyrimidine [poly(Y)] tracts of the 3′ intron region and the possible G-string excess in the 5′ intron region.

The polypyrimidine tract is known to play an important role in human pre-mRNA splicing (23). A Y-string is a tandem stretch of pyrimidines. We extracted the maximum Y-string (closest to the 3′ss end) from the 54 nt upstream flanking intron region of 4417 exons and plotted the size and distance distributions in Figure 5d and e. The quantile statistics for the sizes are: Min. 2, 1st Qu. = 6, Median = 8, Mean = 9, 3rd Qu. = 11, Max. = 32; and for the distances are: Min. = 2, 1st Qu. = 4, Median = 9, Mean = 14, 3rd Qu. = 21, Max. = 50. Compared with the same statistics from random sequences having the same nucleotide composition (also shown in the figure), we see that the maximum Y-strings have a much longer length and occur very close to the 3′ss end in real introns (a t-test, df = 7899, showed the true mean 9.19 is outside of the 95% confidence interval of the random sample mean 6.62 ± 2.45). [A non parametric test (Wilcoxon rank sum) test (24) was also performed that definitely ruled out the null hypothesis that the medians of the two sample distributions are the same which had a P-value <10⁻¹⁰.] We expect the size effect would be larger if approximate Y-strings were used (allowing one mismatch, for example).

Table 5.

Translational signal profiles (in percent)

Table 5.

Translational signal profiles (in percent)

Table 6.

Splicing signal profiles (in percent)

Table 6.

Splicing signal profiles (in percent)

The G-string excess was hinted at in the above discussion of mono- and dinucleotide distributions. It was first reported by Solovyev et al. (25). We measured this feature in a more quantitative way as follows. Using itexons, for each minimum G-string size i (i = 1, 2, 3, 4, 5), we counted the number of tandem G of size i and larger in a 54 nt window on each side of the 5′ss boundary. Let the difference of the number on the intron side to the exon side be D d = 1, 0 or −1 depending on whether D is >, = or < 0. The average D (unshaded) and d (shaded) are plotted in Figure 5f. One can see both measures peaked at 3, indicating that G triplets are the most over-represented G-strings on the intron side on average. Recent experiments indicated the G triplets located throughout a class of small vertebrate introns intron borders and regulate splice site selection (26). It is possible that such G may be related to hnRNP sites [such as hnRNP A1 sites which have a consensus of UAGGGU (27)].

Correlations and Constraints Among Different Sequence Features

To explore possible relationships among different features across an exon, we used itexons that have complete flanking intron information. (For convenience of statistical tests, all variables in Figure 6a-d were normalized—subtracted the mean and divided by the standard deviation.)

Strong compensatory constraint between the two splice sites across a short exon. In Figure 6a, we have plotted the upstream 3′ss score against the downstream 5′ss score for each exon (we chose 0 as the minimum score cut-off, because the few splice sites with negative scores were mostly annotation errors). There appears to be no correlation for the whole data set at first sight. However, if we highlighted the exons that have a short size (< 61 nt) and have a relatively long upstream intron (> 200 nt), there appeared to be a constraint (represented by the dotted line) that restricted these exons into the upper-right corner. To test the significance of the constraint, we found that the P-value (see Database and Methods) was 0.029 (N = 77, the constraint is y = −2.8x − 3.5). Therefore, this constraint is statistically significant at the 95% confidence level. Such a constraint is consistent with the Exon Definition model: both splice sites have to be consensual for a short exon not to be skipped and, when the 3′ss is weak in a long intron, a stronger downstream 5′ss is needed to compensate for exon definition. Experimentally, it was observed that mutation of a 5′ss depressed the removal of the upstream intron 20-fold (28). Strengthening a naturally weak 5′ss of an internal exon by making it a better fit to the consensus increased in vitro splicing of the upstream intron (29,30). In vivo, mutant 5′ splice sites usually were suppressed by second mutations that improved the 3′ss across the exon (31,32).

Correlation between the splice sites and exon size. To see how the minimum quality of splice sites depends on exon size, we made a box plot of the total splice site score (i.e. the sum of the 5′ss score and 3′ss score across an exon) as a function of the internal exon size (Fig. 6b, in log10). We observed that when the exon size is near the mean or larger, there appears to be no restriction on the splice site scores (except the absolute minimum); however, when the exon size decreases, the splice site scores tend to go up systematically. This is also consistent with Exon Definition in the sense that the interacting splicing factors across an exon may require an optimal interaction range. Experimentally, in addition to exon skipping, the other major phenotype resulting from mutation of a splice site in a human gene is activation of a cryptic site with the right polarity (33). When a constitutively recognized internal exon was internally deleted below 50 nt, it was skipped by the in vivo splicing machinery (34). Increasing the strength of the splice sites could revert the mutant (35).

Figure 5

Novel splice site features. (a) The distance from the branch point to the 3′ss is mostly between 15 and 30 bp. (b) The distance from the 5′ss to the first downstream in-frame nonsense codon is mostly 1 bp. The dotted line indicates the expected number of the first downstream nonsense codon which is three times the expected number of the first in-frame downstream nonsense codon. (c) The distance from the 3′ss to the first upstream AG is mostly between 18 and 35 bp. (d) The poly(Y) tract in the 3′ sequence of real introns is larger than would be found by random chance. (The histogram for the random model with the same nucleotide composition is cross-hatched.) (e) The distance from the poly(Y) tract to the 3′ss is much shorter than would be found by chance. (The histogram for the random model is cross-hatched.) (f) G-string excess in the 5′ss intron region is mostly G-triplets. D and d are two different statistical measures (see text). The peak at 3 bp indicates that G-triplets are more enriched in the 5′ss intron region relative to the adjacent upstream exon region.

Figure 5

Figure 6

and constraints among different sequence features. (a) Scatter plot of 5′ss scores versus 3′ss scores (both normalized, see text). The highlighted dots represent the short exons (<61 bp and flanked by an upstream intron >200 bp). That fact that these highlighted dots are bounded by the dotted line indicates the compensatory constraint between minimum qualities of upstream 3′ss and downstream 5′ss across these short exons. (b) Correlation between the splice site quality (measured by the sum of the flanking splice site scores) and the exon size (normalized in log10 scale) is apparent for short exons. (c) The constraint on minimum upstream intron size for large exons is indicated by the dotted line. (d) The constraint between the upstream poly(Y) (measured by the normalized ratio of YYY frequency to Y) and downstream G-strings (measured by the normalized ratio of GGG frequency to G) is indicated by the dotted line. (e) A 3 -tuple clustering tendency correlation is indicated by the diagonal stripe in the scatter-plot of the 3-tuple clustering tendency (which measures how likely it is that one would find mononucleotide runs) in the upstream flanking intron region versus the downstream flanking intron region. (f) Scatter-plots of A+T contents for the upstream flanking intron region versus the downstream flanking intron region, for the exon region versus the whole genomic region and for the total flanking intron region versus the whole genomic region are displayed. The A+T content correlations are indicated by the diagonal stripes.

Figure 6

Constraint between the upstream intron size and exon size. To prevent steric hindrance between splicing factors, a minimum intron size is generally expected. This is the case, as may be seen in Figure 6c where the upstream intron size is plotted against the exon size (both in log10). With our data set, this minimum was 24 nt (for the two introns in the human parvalbumin gene); the others were 60 nt or larger. Unexpectedly, our data also show a restriction at the upper-right corner. Namely, a long exon is often accompanied by a short upstream intron. A significance test arrived at P = 0.025 (N = 1228, the constraint is y = −x + 3.88), implying that the constraint is significant at the 95% level. According to the Exon Definition, when the exon size becomes very large, the upstream intron may only be recognized through the Intron Definition mode (33), which would require a shorter intron size (see also ref. 36).

Constraint between the poly(Y) tract and the downstream G-strings. To demonstrate that the G-string feature mentioned above is also correlated with the poly(Y) tract, we designed a simple measure as follows: for a G-string measure, we counted the total size of G-strings (G runs of 3 nt or more) in a 54 nt window downstream of a 5′ss and normalized it by dividing the total number of G; we constructed a similar measure for the poly(Y) tract by counting the total size of Y-strings (pyrimidine runs of 3 nt or more) in a 30 nt window upstream of the 3′ss and by dividing the total number of pyrimidine residues. As shown by the dotted line in Figure 6d, the relationship manifests itself again as a constraint: when the G-string content downstream of an itexon is too high (>1 on the normalized scale), the poly(Y) tract upstream has to be of better quality (more Y-triplets clustering). Again the features on the opposite sides of an itexon appear to be 'talking' to each other. The P-value for the constraint is 0.019 (N = 1586 and the line is y = 1.7x − 5.58), again significant at the 95% level. It is possible that the G-strings bind the Y-strings across an itexon transiently to help initial exon recognition during RNA splicing.

Correlation of 3-tuple clusterings in the flanking introns. Due to runs of As or Ts (as shown, for example, in Fig. 4), most introns have less complexity than exons (37). We wanted to see if there are any correlations between clusterings in different regions. We use x = P_k N_k (N_k − 1)/L − 2 as our 3-tuple clustering measure (this measure has been used by Roman Tatusov in his low complexity filtering software—dust) where N_k is the number of a 3-tuple k, L is the window length and the sum is over all possible (64) 3-tuples. The larger this number is, the stronger is the 3-tuple clustering. We had calculated x for each side of the 3′ or 5′ splice site with a window of length 54 nt; we observed that most of the introns have lower clustering tendency than the exons, and there were no correlations between different exon regions or between an exon region and an intron region (data not shown). However, there was a strong correlation between the upstream intron region and the downstream intron region for a subset of itexon data (as indicated by the diagonal stripe in Fig. 6e). The correlation coefficient was cor = 0.84 with a 90% confidence interval of (0.825,0.846) (defined in Database and Methods).

Correlation between flanking A+T contents across an itexon. Finally, we show a correlation between mononucleotide compositions in the two flanking intron regions (54 nt each) across an itexon. We have calculated A+T content in each of the following regions: upstream flanking intron, itexon and downstream flanking intron, and compared them with the A+T content of their genomic locus. We found that the correlation between flanking introns across an itexon was the strongest, with cor=0.78 and a 90% confidence interval of (0.758,0.802) (as seen in Fig. 6f). Presumably this correlation is caused mainly by the isochore effect (14), as may be seen from the subplots in Figure 5f, which shows that genomic A+T content is clustered into islands and the flanking intron A+T content correlates with the genomic more positively than the exon A+T content (the straight lines are the equal A+T content lines). This correlation is remarkably strong because only very short (54 nt) flanking intron sequences were used and the poly(Y) or the G-string in these regions would have to adjust its composition to accommodate the correlation. In fact, all the signal profiles are GC content-dependent as shown earlier. That is why exon identification in a low GC locus is very difficult as the exon GC content is constrained to be almost as low as introns and intergenic regions.

Conclusions

In summary, we have analyzed statistical characteristics of many sequence features in human exons and their flanking regions, including some very subtle and complex ones. We have begun to reveal several novel correlations and constraints among different features. Some of these are in accord with recent experimental observations; others are still a mystery awaiting functional interpretation. We should emphasize that most correlations are between extreme values (mathematically, the relationships can only be expressed as inequality constraints). For features that are close to their consensus, they are quite free to vary and the exon will still be defined (recognized). This type of degeneracy is quite typical for biological systems (e.g. non-lethal mutations of protein sequences are often tolerated because the resulting change of a structure is not critical to the function). However, under special conditions where the general constraints are violated, the exons have to be defined by a delicate balance of multiple features and/or a requirement for additional new features [such as the secondary structures (38), purine-rich enhancers (39), etc.]. These have been observed often in alternatively spliced genes (40). All correlations related to gene structures are likely to be complex, because they resulted from dynamic interactions of many macromolecules during evolution. However, these inter-dependencies among different features are just as important as each individual feature. This is analogous to the fact that protein function cannot be worked out by structure characterization alone without considering interactions between subcomponents or with substrates. More rigorously characterizing existing features, further discovering new features, and quantitatively exploring novel feature relationships will be the keys for understanding gene structures and for improving exon discrimination methods. Many statistical findings reported here have already been utilized in a new gene-finding method (41) which has substantially improved the accuracy in human exon prediction. Better understanding of the architecture of genes will become the prerequisite for innovative experimental designs in functional studies.

Database and Methods

Human exons (in nuclear protein-coding genes) were extracted from GenBank release 87.0, and the data set was processed to remove redundant copies and checked for data integrity. Starting with 31 202 human sequences extracted out of gbpri.seq, we filtered out viruses, mitochondria, RNAs, pseudogenes, Ig genes, MHC genes, redundant large family genes and identical copies. Only 6152 sequences remained. Of these, 780 contained complete CDSs which belong to the last four categories (see Fig. 1). We sorted the rest in ascending order according to their size, and extracted all the exons that had no >90% maximum similarity to other larger exons. Many errors were removed or corrected during the entire analysis by comparison against the original publications. The final data set consisted of 5061 exons (representing 2705 intron-containing genes) belonging to the first eight categories (see Fig. 1). These exons and their flanking regions (54 nt into introns or 210 nt into intergenic regions) have been deposited in FASTA format at the anonymous ftp site phage.cshl.org in the directory pub/science/human\_exons. The accession number and the coding frame information are also retained in each exon record.

All profiles were defined by positional dependent frequency matrices (17)

formula

where n_αx is the counts of a k-tuple α at position x. The scores were defined by

formula

where P ₀= (¼)^k (if α is a k-tuple) and P_αx is the Bayesian posterior probability (42) given by

formula

We chose the pseudocount a_α be

formula

multiplied by the background frequency of α. The counts were obtained either from the known alignments or by the following refining procedure. Starting with an approximate matrix (we took the corresponding vertebrate matrix; one may also take a matrix resulting from a known alignment of subset data), align the signals (within a specified window) that have the maximum score and calculate the new matrix. One then iterates this procedure until it converges.

To assess a straight line constraint in a plan, we used the normal approximation (with a mean of zero and standard deviation of unity) by normalizing (subtracting the mean and dividing by the standard deviation) the two sample variables x_i and y_i . (The variables may need to be transformed, such as by taking the logarithm of exon sizes, so that their distribution can be approximated by the normal distribution.) Assuming x, y are independent, the null hypothesis is that the observed N points are not restricted by a straight line (generalization to an arbitrary curve is straight forward) y = ax + b. The probability f of finding one point in the restricted region is given by F(d), where F is the cumulative distribution function and

formula

is the distance of the constraint line from the origin. The P-value for finding all N points on one side of the constraint is f^N.

The definition of the correlation coefficient cor of two vectors of sample data X,Y, is the standard:

formula

where µ₁, µ₂ and σ₁, σ₂ are the means and standard deviations, respectively, of X and Y. The confidence interval for cor is obtained by the procedure described in (43).

Most of the analysis and graphics were done with SPLUS (44).

Acknowledgements

This work was supported by NIH grant HG00010. The author is grateful for referees' detailed suggestions on how to improve the manuscript.

References

, .

Organization and expression of eukaryotic split genes coding for proteins

Annu. Rev. Biochem.

1981

, vol.

(pg.

349

383

)

A catalogue of splice junction sequences

Nucleic Acids Res

1982

, vol.

(pg.

459

472

)

, , .,

Methods Enzymol

1990

, vol.

183

(pg.

252

278

)

Weight matrix descriptions of four eukaryotic RNA polymerase II promoter elements derived from 502 unrelated promoter sequences

J. Mol. Biol.

1990

, vol.

212

(pg.

563

578

)

A survey on intron and exon lengths

Nucleic Acids Res

1988

, vol.

(pg.

9893

908

)

Structure of vertebrate genes: a statistical analysis implicating selection

J. Mol. Evol.

1988

, vol.

(pg.

)

Computer methods for analyzing sequence recognition of nucleic acids

Annu. Rev. Biophys. Chem.

1988

, vol.

(pg.

241

263

)

Finding protein coding regions in genomic sequences

Methods Enzymol.

1990

, vol.

183

(pg.

163

180

)

Global methods for the computer prediction of protein-coding regions in nucleotide sequences

Biotechnol. Software

1990

, vol.

(pg.

)

, .

Assessment of protein coding measures

Nucleic Acids Res.

1992

, vol.

(pg.

6441

6450

)

, .

Fission yeast gene structure and recognition

Nucleic Acids Res.

1994

, vol.

(pg.

1750

1759

)

, , .

Exon definition may facilitate splice site selection in RNAs with multiple exons

Mol. Cell. Biol.

1990

, vol.

(pg.

)

, .

Nonsense but not missense mutations can decrease the abundance of nuclear mRNA for the mouse major urinary protein, while both types of mutations can facilitate exon skipping

Mol. Cell.

1994

, vol.

(pg.

6326

6336

)

, , .

Compositional patterns in vertebrate genomes: conservation and change in evolution

J. Mol. Evol.

1988

, vol.

(pg.

)

, .

What drives codon choices in human genes?

J. Mol. Biol.

1996

, vol.

262

(pg.

459

472

)

, , .

K-tuple frequency analysis: from intron/exon discrimination to T-cell epitope mapping

Methods Enzymol.

1990

, vol.

183

(pg.

237

252

)

Consensus patterns in DNA

Methods Enzymol

1990

, vol.

183

(pg.

211

221

)

An analysis of 5′-noncoding sequences from 699 vertebrate messenger RNAs

Nucleic Acids Res.

1987

, vol.

(pg.

8125

8148

)

, .

Distribution and consensus of branch point signals in eukaryotic genes: a computerized statistical analysis

Nucleic Acids Res.

1990

, vol.

(pg.

3015

3019

)

Biochemical mechanisms of constitutive and regulated pre-mRNA splicing

Annu. Rev. Cell Biol.

1991

, vol.

(pg.

559

599

)

, .

Branch point selection in alternative splicing of tropomyosin pre-mRNAs

Nucleic Acids Res.

1989

, vol.

(pg.

5633

5650

)

Possible evolution of splice-junction signals in eukaryotic genes from stop codons

Proc. Natl Acad. Sci. USA

1988

, vol.

(pg.

1129

1133

)

, , .

Functional analysis of the polypyrimidine tract in pre-mRNA splicing

Nucleic Acids Res.

1997

, vol.

(pg.

888

896

)

Nonparametrics: Statistical Methods Based on Ranks

1975

San Francisco, CA

Holden and Day

, , .

Predicting internal exons by oligonucleotide composition and discriminant analysis of spliceable open reading frames

Nucleic Acids Res.

1994

, vol.

(pg.

5156

5163

)

, , .

An intron splicing enhancer containing a G-rich repeat facilitates inclusion of a vertebrate micro-exon

RNA

1996

, vol.

(pg.

342

353

)

, , , .

An intron element modulating 54 splice site selection in the hnRNP A1 pre-mRNA interacts with hnRNP A1

Mol. Cell. Biol.

1997

, vol.

(pg.

1776

1786

)

, .

Effect of 5′ splice site mutations on splicing of the preceding intron

Mol. Cell. Biol.

1990

, vol.

(pg.

6299

6305

)

, , .

Control of alternative splicing by the differential binding of U1 small nuclear ribonucleoprotein particle

Science

1991

, vol.

251

(pg.

1045

1050

)

, , , .

Combinatorial splicing of exon pairs by two-site binding of U1 small nuclear ribonucleoprotein particle

Mol. Cell. Biol.

1991

, vol.

(pg.

5919

5928

)

, , , .

Splicing mutants and their second-site suppressors at the dihydrofolate reductase locus in Chinese hamster ovary cells

Mol. Cell. Biol.

1993

, vol.

(pg.

5085

5098

)

, , .

Alternative splicing of beta-tropomyosin pre-mRNA: multiple cis-elements can contribute to the use of the 5′- and 3′-splice sites of the nonmuscle/smooth muscle exon 6

Nucleic Acids Res.

1994

, vol.

(pg.

2318

2325

)

Exon recognition in vertebrate splicing

J. Biol. Chem.

1995

, vol.

270

(pg.

2411

2414

)

, .

Selection of splice sites in pre-mRNAs with short internal exons

Mol. Cell. Biol.

1991

, vol.

(pg.

6075

6083

)

, .

Cooperation of pre-mRNA sequence elements in splice site selection

Mol. Cell. Biol.

1992

, vol.

(pg.

2108

2114

)

, , .

Architectural limits on split genes

Proc. Natl Acad. Sci. USA

1996

, vol.

(pg.

15081

15085

)

, .

Complexity charts can be used to map functional domains in DNA

Gene Anal. Technol. Appl.

1990

, vol.

(pg.

)

, , , , , .

RNA secondary structure repression of a muscle-specific exon in HeLa cell nuclear extracts

Science

1991

, vol.

252

(pg.

1823

1828

)

, , .

Polypurine sequences within a downstream exon function as a splicing enhancer

Mol. Cell. Biol.

1994

, vol.

(pg.

1347

1354

)

, , , .

A sequence compilation and comparison of exons that are alternatively spliced in neurons

Nucleic Acids Res.

1994

, vol.

(pg.

1515

1526

)

Identification of protein coding regions in the human genome by quadratic discriminant analysis

Proc. Natl Acad. Sci. USA

1997

, vol.

(pg.

565

568

)

, .

The calculation of posterior distribution by data augmentation

J. Am. Stat. Assoc.

1987

, vol.

pg.

528

, .,

Statistical Methods

1980

th edn

Ames, IA

Iowa State University Press

S-PLUS User's Manual

1991

, vol.

Vol. 1

Seattle, WA

Statistical Sciences, Inc.

Chap. 6, Sept

Posted by: toneytoneytalforde0273221.blogspot.com

Source: https://academic.oup.com/hmg/article/7/5/919/759675

human molecular genetics 4th edition pdf download

Abstract

Introduction

Exon Classification

Statistical characterization of individual sequence features

Size and compositional characteristics

Signal profiles

Correlations and Constraints Among Different Sequence Features

Conclusions

Database and Methods

Acknowledgements

References

Post a Comment

0 Comments

Popular Posts

Spray Paint For Canvas Material / Amazing Spray Paint Project Ideas to Beautify Your Home : Hi my name is mullerjeanfrancois i do art,if you interested in any piece please let me know.i hope you enjoy my art i am an artist who works in many different types of media;

Matt Chrome Spray Paint - Home Made Spray Booth - YouTube : Visit your local store for the widest range of products.

Over The Stool Bathroom Shelves : Diy Pantry Shelves - Decor Ideas - Cardboard furniture. this particular piece was a tedious nightmare.

Random Posts