|
Advertisement | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
J. Biol. Chem., Vol. 276, Issue 30, 28327-28339, July 27, 2001
From the
Received for publication, September 12, 2000, and in revised form, March 26, 2001
As experimental technologies for
characterization of proteomes emerge, bioinformatic analysis of the
data becomes essential. Separation and identification technologies
currently based on two-dimensional gels/mass spectrometry provide the
inherent analytical power required. This strategy involves protein spot
digestion and accurate mass mapping together with computational
interrogation of available data bases for protein functional
identification. When either no exact match is found or when the
possible matches only partially account for molecular weights actually
observed, peptide sequencing by tandem mass spectrometry has emerged as the methodology of choice to provide the basic additional information required. To evaluate the capabilities of bioinformatics methods employed for identifying homologs of a protein of interest, we attempted to identify the major proteins from the 20 S proteasome of
Trypanosoma brucei using sequence information determined
using mass spectrometry. The results suggest that neither the
traditional query engines, BLAST and FASTA, nor specialized software
developed for analysis of sequence information obtained by mass
spectrometry are able to identify even closely related sequences at
statistically significant scores. To address this deficit, new
bioinformatics approaches were developed for concomitant use of the
multiple fragments of short sequence typically available from methods
of tandem mass spectrometry. These approaches rely on the occurrence of
congruence across searches of multiple fragments from a single protein.
This method resulted in sharply better statistical significance values
for correct hits in the data base output relative to that achieved for
independent searches using single sequence fragments.
Fueled by the genome projects, encyclopedic increases in the
banking of newly obtained, comprehensive biological data are transforming studies of biology and medicine (1). As the post-genomic era moves into high gear, new "high throughput" technologies are allowing characterization of gene expression profiles, comparisons of
genomic complements, and identification of the genetic markers associated with normal, pathological, or environmentally triggered states. Yet information derived from full analysis of genomics alone is
clearly inadequate to explain the complexities of cell biology. Recent
studies showing differences between the genome and the proteome suggest
that the profound understanding we seek will require the complete and
direct characterization of the proteome as well (2, 3).
Peptide mass mapping by
MALDI-TOF1 MS (4) or
liquid chromatography-electrospray ionization MS (5, 6),
combined with interrogation of sequence data bases (7-12), currently
is the most widely employed strategy for the identification of
expressed proteins. This methodology involves electrophoretic
separation of proteins at sub-picomole levels, digestion with trypsin,
and measurement of the molecular weights of the resulting peptide mixture by mass spectrometry. This strategy can routinely identify proteins whose actual sequences are present in protein, gene, or EST
data bases (13-16, 12, 17, 18). In addition, one would also like to
find related proteins even when the actual protein sequence is absent
from the data bases. Here, the identity of the protein of interest can
frequently be inferred if the similarities to homologs in available
data bases are sufficiently high. Such identification by inference
cannot be achieved solely by mass mapping, however, because patterns of
tryptic digestion are generally not conserved; hence, only nearly
identical homologs are likely to be found. Thus, additional
experimental information, viz. amino acid sequence, is
required to identify proteins whose sequences are not present in the
data bases per se or that may differ even slightly from
related sequences.
Over the last 5 years, the primary source of sequence information
obtained directly from protein has been generated using methods of
tandem mass spectrometry, based upon collision-induced dissociation of
tryptic fragments in protein digests. During this time, mass
spectrometry has emerged as the only technology sufficiently sensitive,
rapid, and robust to provide this information on a proteome-wide scale.
New instrumentation is currently being developed to accommodate the
high throughputs required (19). Sequence information, which is much
better conserved among related proteins than are tryptic digestion
patterns, can then be used for identification of proteins with close or
sometimes even remote homologs in the data bases. Searching tools
currently in use with such partial sequence data include simple string
searches (20) and algorithms such as BLAST (21), Gapped BLAST (22), and
FASTA (23). When closely related homologs exist in the data bases or
when the available sequence fragments represent highly conserved
regions of a homolog, these approaches can be quite successful.
However, when only more remote homologs are present or when homologs
exist only as partial sequences, as in EST data bases, sequence
information from mass spectrometric methods may be insufficient to
identify the protein of interest due to two major problems. First, the
statistical significance of matches in data base searches is
length-dependent (24), whereas segments of sequence
obtained by mass spectrometry of tryptic peptides are generally short,
often 10 amino acids or less. As a consequence, searching algorithms
often fail to find homologs at statistically significant scores.
Second, the location and coverage of peptides relative to the parent
protein is highly variable and may represent regions of low or no
similarity to a homolog in the data base. Thus, a score of poor
statistical significance could signify an absence of homologs in the
data base or that the particular peptides sequenced may represent
poorly conserved regions of even fairly close homologs. The inability to distinguish between these two possibilities could result in misleading interpretation of the novelty of the sequence of interest. In a high throughput experiment, the consequences could also be substantial, resulting in a decision to divert resources elsewhere or
to over-commit resources toward further analysis of the protein or
protein system in question.
Here we describe new strategies for more effective identification of
both close and remote homologs from partial sequence information using
the identification and preliminary structural analysis of the 20 S
proteasome from Trypanosoma brucei (25) as an example.
Proteasomes are multicatalytic proteolytic complexes found in the cells
of most organisms and play important roles in protein turnover in both
cytosol and nucleus. These roles include the removal of abnormal,
misfolded, or improperly assembled proteins associated with a wide
range of biological functions in stress response, cell differentiation,
metabolic adaptation, and cellular immune responses (26, 27). The
proteasome provides a good model system for this study because it is a
macromolecular assemblage whose component subunits cannot be directly
inferred from genomic information alone, thus requiring direct
characterization of its protein components. Furthermore, its relative
complexity provides an appropriate challenge for initial development of
high throughput strategies for characterization of an important protein
machine (28).
A combination of modern mass spectrometric techniques and newly
developed informatics approaches was used to identify and characterize
subunits of the proteasome. This is the first time that such a complex
system has been investigated in this manner and offers evidence for the
value of the approach. Further development of these informatics
strategies could extend substantially the usefulness of mass
spectrometric sequence information for identifying proteins in species
whose genomes have not been sequenced.
Protein Separation by Two-dimensional Gel Electrophoresis and in
Gel Digestion
Preparation and electrophoretic separation of 20 S proteasomes
from T. brucei was performed as described previously (25, 29). After silver staining, protein Spots were excised from the gel.
Minced gel pieces were washed with 25 mM
NH4HCO3 in 50% ACN, dried in a Speedvac, and
then rehydrated in 25 mM NH4HCO3 solution containing trypsin. After overnight digestion at 37 °C, peptides were extracted, once with HPLC grade water and then 3 times
with 50% ACN, 5% trifluoroacetic acid. The combined supernatants were
dried by Speedvac and then redissolved in 50% ACN, 5% trifluoroacetic acid. Aliquots were analyzed directly without separation. The remainder
was separated by reverse phase HPLC on a Vydac microbore C18 column
(1.0 mm × 15 cm), fractions from which were collected and concentrated.
Mass Spectrometric Analysis of Tryptic Peptides
Molecular masses of tryptic peptides were determined by
analyzing 1/20th of each unseparated digest and 1/10th of each HPLC fraction by MALDI-TOF MS (Voyager DESTR, Perspective Biosystems, Framingham, MA). Peptides were cocrystallized with equal volumes of a
saturated solution of In determining peptide sequences, all spectra were interpreted manually
using known rules and established principles (31, 32). Potential
ambiguities exist for isomeric Ile/Leu and isobaric Gln/Lys, and
frequently in the assignments of terminal residues. Here PSD was the
primary technique, in which the reflectron voltage in a MALDI-TOF mass
spectrometer was scanned to select fragment ions (33), supplemented by
high quality tandem MS from the QqoaTOF mass spectrometer (30). PSD
data is generally inferior as the mass calibration is variable, whereas
newer instruments such as the QqoaTOF give high quality tandem data
with sufficiently accurate mass measurements to increase the success of
data base searching (34) and to assist manual interpretation by
removing the ambiguities arising from isobaric ions. The MALDI-TOF/TOF
tandem mass spectrometer is another new alternative that has the
additional advantage of high energy CID giving more comprehensive
fragmentation, allowing Ile and Leu to be differentiated (19, 35).
Data Base Searching and Analysis of Sequence Information
A static data base, non-redundant Static, was generated by
downloading the non-redundant National Center for Biotechnology Information protein data base (as of 3/11/99) and modified by removing
sequences matching 3 of our query sets that had been submitted to the
data base following their identification early in this project (29).
All searches used either the BLOSUM62MS scoring matrix, a version of
BLOSUM62 (36), modified to account for Ile/Leu and Gln/Lys ambiguities
associated with determination of protein sequences using mass
spectrometry (37) or the PAM30MS matrix, modified in an analogous
manner from the PAM30 matrix (38).
String Searches Using MS-Pattern (Formerly
MS.Edman)--
MS-Pattern is a program within the ProteinProspector
suite (prospector.ucsf.edu/) (20) that searches the data bases and retrieves entries associated with a text string comprised of a keyword,
accession number, or sequence.
Gapped BLAST for Single and Catenated Sequences, Automation of
Multiple Searches with Multi-BLAST and Congruence Analysis Using
MS-Shotgun2--
All types
of BLAST searches were performed using version 2.0a19 of BLASTP as
implemented by Washington University, St. Louis. Further information
about this implementation is available
(blast.wustl.edu/blast/README.html). Congruence analysis of Gapped
BLAST (BLAST 2.0) searches was carried out for all sequence fragments
from a given protein. The MultiBLAST Suite was written in Perl and
Python to facilitate the generation and submission of multiple
permutations and catenations of peptide sequences to Gapped BLAST,
input as text files containing all sequence fragments for a single
Spot. Default options for all Gapped BLAST searches were E = 1000, B = 200, and V = 200. MS-Shotgun, a modification of Shotgun
(39), was written for congruence analysis of MS data. For simple
analysis, searches used individual peptide sequences as queries and
then determined congruence across these searches by determining which
hits were found in common by more than one peptide sequence fragment.
The number of query sequences that found each hit is designated the
"Shotgun Score." For MS-Shotgun with catenation analysis, files
containing the total number of peptide sequences from a single Spot
(n) were generated, permuted, and catenated to give all
possible sequence orders. Gapped BLAST searched each of the
n! catenations, and then congruence analysis was performed
across all output files from either the set of single peptides or the
n! catenations from the peptide sequences from a given Spot.
Hits were sorted either by Shotgun Score or by p values.
During preparation of this manuscript, other approaches using
congruence analysis came to our attention. These include extensions to
the CIDentify program developed by Johnson and Taylor (40), i.e. the CID Results Compiler, and the FASTS program, a
modified version of the FASTA search
engine.3 For comparison with
our methods, data base searches were also performed with the FASTS
program. Like MS-Shotgun, FASTS exploits the advantages of using all of
the available sequence information from a single Spot to enhance the
sensitivity and significance of the searches. The modifications
reflected in FASTS relative to the parent FASTA search engine include
allowing submission of multiple short sequences from one protein as a
single query to the search engine and removal of the penalty for gaps
between query sequences. Pairwise similarity scores for all the
peptides that show matches to a given hit are included in the
calculation of statistical significance. Pairwise alignments for all
the hits are also reported in the FASTS output, showing which query
sequences were used in the match and where they align. Because neither
a description nor an evaluation of the FASTS algorithm has yet been published by its authors, we did not consider it appropriate to present
extensive comparisons between our MS-Shotgun with catenation analysis
and FASTS. For this reason, data from our FASTS analysis of the
proteasome data are not provided in our figures or tables. To give the
reader some idea of the relative performance of MS-Shotgun with
catenation and FASTS on our data, important points of similarity and
difference between the two approaches are provided under "Results and
Discussion," however.
The parent search engine, FASTA, was also tested using all individual
peptide sequences as queries to confirm that there was no absolute
advantage for this study in using FASTA over BLAST. This test was
performed using both the BLOSUM62MS and the PAM30MS scoring matrices.
Evaluation of Functional Assignments Using Multiple
Alignments--
As required, functional assignments inferred from the
data base search outputs were further evaluated using multiple
alignments of top scoring protein hits from each MS-Shotgun with
catenation search. Percent identities were calculated for these hits to
identify a divergent set of sequences from available species when
possible. These were aligned with the individual peptide sequences
using either Pileup (41) or ClustalW (42). To evaluate the correctness of the linear ordering, we subsequently compared the ordering of each
catenated sequence to the ordering of those same peptides within the
sequences of the full-length proteasome protein subunits that were
obtained independently from DNA sequencing after the completion of this
study (29).4 While the data
base search scores for these full-length sequences were useful for
comparison with scores obtained for sequence fragments obtained from
sequencing by mass spectrometry, these full-length sequences were not
used in the bioinformatic analysis of the sequences obtained from mass spectrometry.
The overall strategy used for identifying unknowns is illustrated
in the flow chart in Fig. 1A.
The protocols for congruence analyses using the MS-Shotgun program are
illustrated graphically in Fig. 1B.
Separation, Digestion, and Analysis of Tryptic Peptides
The 20 S proteasome of T. brucei separated by
two-dimensional gel electrophoresis and stained with silver yielded 26 protein Spots (Fig. 2), the pattern of
which was highly reproducible from batch to batch although minor Spot
intensities varied across different preparations. Compared with
two-dimensional gel analysis of the 20 S proteasome from rat (18), the
two-dimensional fingerprint showed far fewer Spots, indicating that the
trypanosome proteasome is likely to be less complex. Fifteen major
Spots were excised from the gel, digested overnight with trypsin, and
analyzed by MALDI-TOF. The peak list for each protein digest was
searched against the calculated peptide mass values for proteins in the data bases, using the known specificity for trypsin to cleave at
arginine and lysine. When this experiment was performed, no cDNA
sequences of 20 S proteasome subunits from T. brucei were present in public data bases, so mass mapping was limited to searches for homologous proteins that might show conserved tryptic peptides. Not surprisingly, no proteasome entries from other species
were found; therefore, after separation of digests by capillary HPLC, peptide sequencing by tandem mass spectrometry was undertaken on all
abundant peaks from each Spot (2-8 sequences were obtained per Spot).
The length of sequences ranged from 5 to 18 residues, averaging ~10
amino acids (Table I). The variability in
the number and length of the sequence fragments from each Spot was
typical of an analysis of this type.
Identification of Possible Homologs from Searches By using new informatics approaches developed here, we could identify the functions of the proteasome proteins represented by 9 of the 15 major Spots marked with arrows in Fig. 2 as follows: Spots 1, 2, 4, 6, 7, 11, 12, 15, and 17. Of the remainder, functions for 3, 5, 8, 9, and 13 could not be predicted from sequence information (see "Spots That Required Additional Biological Information for Provisional Identification" below). Spot 10 appears to contain identical sequences to those in both Spots 11 and 12, suggesting either that this protein in T. brucei has a different origin from other known proteasomes or that artifacts, such as contamination with other proteins, were introduced during protein isolation and separation. Although such complexities will likely be a difficult and ubiquitous problem for high throughput proteomics, analysis of these problems is beyond the scope of this study. Therefore, data are not presented for Spot 10. Our new informatics approaches are based on the hypothesis that examination of congruence across searches for a set of peptides from the same protein is useful for distinguishing real homologs from unrelated hits, even though scores for individual data base searches using each of these peptide sequences may fall below the level of statistical significance. This idea has been applied successfully using the original implementation of the Shotgun program to the problem of deducing function of proteins when only remote homologs of those proteins are present in the data bases (39). In those studies (43-45), however, Shotgun analysis used full-length proteins as queries rather than short sequence fragments such as described in this study. Data base search engines such as BLAST and FASTA are designed to perform pairwise comparisons between a query sequence and each sequence in a data base. As described below, the searches performed in this study could be divided into the following three types with respect to the nature of the query sequences and processing of data base search output: 1) searches using single sequences of tryptic peptides as queries, performed one at a time; 2) congruence analysis across the search outputs of all of these single searches from the same Spot; or 3) congruence analysis across the outputs of all possible permutations/catenations of the peptide sequences from the same Spot (MS-Shotgun with catenation). The results are summarized in Tables II-IV and discussed below. Table II compares single searches using simple BLAST analysis and congruence analysis using MS-Shotgun with catenation for the 9 Spots identified using those tools. Because we wanted to compare simple BLAST searches with MS-Shotgun with catenation searches using the same conditions, we chose to use the BLOSUM62MS scoring matrix for both types of search because of the following: 1) it is the default matrix of choice for the occasional or non-expert user; 2) no clearly optimal choice of matrix can be predicted from theory for use with the fragmentary sequence information represented by our data (see under "A Note on the Importance of the Scoring Matrix in Data Base Searches Using Peptide Sequences Obtained from Tandem Mass Spectrometry"); and 3) we had seen insufficient differences between the BLOSUM62MS and the theoretically more optimal PAM30MS matrix in searches using individual peptide sequences for many of the Spots we investigated. However, because the searches of individual peptide sequence fragments using the PAM30MS matrix did give slightly improved p values on a consistent basis, results using this matrix are also included in Table II.
Functional assignments for all of the Spots are given in Table III and are based on the functions of the proteasome sequences appearing in the top 10 scoring hits from the MS-Shotgun with catenation analysis for each Spot or as described below. As required for evaluation, multiple alignments of the peptide sequences obtained from mass spectrometry and the set of closest likely homologs determined from our analysis were used to confirm the catenation order and mapping of the short peptide sequences onto the homologs. Although none of the functional assignments provided in Table III have been verified unequivocally by experimental characterization of the proteins, recent determinations of full-length cDNA sequences for all these genes allow high credibility assignments to be made from data base searches, independent of this study. The correctness of these assignments is supported by the statistically significant similarities of these full-length sequences to other proteasome sequences (also shown in Table III).
For Spots difficult to identify solely from sequence information, highly provisional assignments of subunit function could be made using additional information obtained from keyword searches of the MS-Shotgun with catenation results. These assignments are also included in Table III and an accompanying discussion of keyword searching is given under "Spots That Required Additional Biological Information for Provisional Identification." Table IV summarizes results from MS-Shotgun with catenation analysis and keyword searches for Spots that could not be identified from sequence analysis alone, Spots 3, 5, 8, 9, and 13. To illustrate some important features of these results, a detailed discussion is also presented below for Spots 1 and 7.
Simple Search Methods Using Individual Peptide Sequences without Congruence These are the types of searches generally performed with sequence data obtained by tandem mass spectrometry, specific software that includes MS-Pattern (12) and the CIDentify (37) programs. The BLAST (version 1.4) (21), Gapped-BLAST (version 2.0x) (22), or FASTA (23) programs are also used (46-48). Simple analyses of single sequences of peptide fragments were compared here using MS-Pattern (detailed results not shown) and Gapped BLAST (Table II). MS-Pattern is a simple string-searching algorithm designed to accommodate the difficulties in using incomplete sequence information obtained from poor quality tandem mass spectra such as MALDI post-source decay type (PSD). Here, one has neither control over nor any way of predicting the degree of fragmentation likely to be generated and recorded for any given peptide sequence. These accommodations are made by allowing a variable number of substitutions (errors) in the query sequence relative to a data base hit. "Uncertain" amino acids in the query sequences can also alternatively be input either as specifically defined elements or as a list of sequences encompassing all alternative possibilities. MS-Pattern has no explicit scoring function, however, making it difficult to distinguish the best matches. Gapped BLAST, in contrast, is among the best tools for analysis of sequence similarity. It incorporates a powerful scoring function that provides both a raw score for each hit and a measure of its statistical significance. Although the subunit types of the proteins represented by Spots 1, 4, 7, 11, 15, and 17 could be provisionally identified using both MS-Pattern (data not shown) and simple Gapped BLAST searches, the number of proteasome sequences obtained in the top 10 hits for each method was highly variable and shows that the identification of proteasome homologs is dependent on whether the subset of peptide sequences that happen to be available include one or more regions that are highly conserved in a homolog in the data base. Spot 1, represented by three peptide sequences (Table I), provides a
good example of this type of information. Sequence 1.0 is highly
conserved in all In a similar manner, the subunit type represented by Spot 4 was also fairly easy to identify due to the presence of a close homolog in the data base, the full-length sequence of the C2 subunit from Trypanosoma cruzi (established later to be 81% identical to the C2 subunit from T. brucei). Four of the seven sequences available for Spot 4 were not useful for identifying the correct function, however. Again, the substantial variability among p values across all of the peptide sequences available for Spot 4 suggests that these scoring functions are not dependable for these short query sequences. Similar issues pertain to the identification of Spots 7, 11, 15, and 17 from these simple searches. For Spots 2, 3, 5, 6, 8, 9, 12, and 13, neither MS-Pattern nor Gapped BLAST produced information sufficient for functional identification. For nearly all of the peptides associated with these Spots, the top 10 hits failed to include any proteasome proteins; for some Spots, many thousands of hits contained no proteasome proteins. BLAST searches using either the BLOSUM62M2 or the PAM30MS matrix also did not achieve statistically significant scores for any of these peptides. Overall, the MS-Pattern and simple BLAST searches were only marginally useful for identification of most Spots, although in a few cases highly conserved peptide sequences were available that matched well with homologs in the data base. In general, neither the number of available peptide fragments nor their lengths were as important as was a high degree of conservation of a peptide sequence fragment with a homolog in the data base. Although further statistical analysis will be required to explore this issue more systematically, our experience using the limited number of proteins evaluated for the 20 S proteasome dramatically illustrates the difficulties encountered in using these search methods to provide credible identification/prediction of function. By using MS-Pattern searches, for all of the most easily identifiable Spots 1, 4, 7, 11, 15, 17, the total number of hits per Spot ranged from 3 to 32,988. For these results, except for rank ordering based on the number of mismatched residues allowed, there was no way to evaluate whether a sequence listed in the top 10 hits was more likely to be a homolog than a hit further down the list. The poor p values obtained for the great majority of the single sequence fragment searches showed that BLAST is not well suited for identification of remote homologs from these data either. The degree to which these algorithms gave both unreliable and unpredictable results was surprising given their widespread use for this purpose. The poor performance of these algorithms for the 20 S proteasome raises an important warning for the common use of these simple searching algorithms for identification of function from sequence information typically obtained from mass spectrometry. This problem can be expected to be greatly amplified for high throughput experiments where functional assignments from data base searches will depend on automated evaluation without recourse to manual review by scientists with expertise in each system in question. Investigation of Spots Using Congruence Analysis Searches using automated congruence analysis were performed using our program, MS-Shotgun, with and without catenation of peptide sequences. Overall, simple congruence analysis with Shotgun across the outputs of either MS-Pattern or Gapped BLAST searches (Fig. 1B, left side) using all peptide sequences in independent searches gave no substantial improvement over BLAST searching without Shotgun analysis (data not shown), e.g. Spots that could not be identified using simple searching with MS-Pattern or BLAST were not identifiable using simple congruence analysis either. This was perhaps due to the inherent limitations of data base search programs to find true positive hits from such short peptide sequences, see below. In comparison to searches performed for each peptide independently, and to simple congruence analysis without catenation, MS-Shotgun analysis with catenation resulted in sometimes greatly amplified scores for the correct hits for Spots 1, 2, 4, 6, 7, 11, 12, 15, and 17 (Table II). The implementation of the FASTS program used in this study showed a similar level of improvement in distinguishing true homologs (data not shown).5 Detailed discussion of the results for Spots 1 and 7 is provided below. Improvement in the Ability to Identify Spot 1-- For Spot 1, the additional information available from MS-Shotgun with catenation analysis produced scores with improved statistical significance (p values) relative to BLAST searches on single peptide sequences (Table II). Because it could take advantage of the information from all three of the peptide sequences available for Spot 1 in consonance, MS-Shotgun with catenation analysis resulted in an enriched clustering of proteasome sequences within the top 10 hits. The catenation analysis was also useful to distinguish the order of the peptides. Considering all hits, the best p value for a proteasome hit is 5 orders of magnitude better than the best p value for a non-proteasome hit using the BLOSUM62MS matrix and two orders of magnitude better using the PAM30MS matrix. For 9 of the top 10 hits, the most statistically significant score among all six of the possible permutations was that obtained by the catenation representing the correct sort order, 021 (data not shown).
To evaluate how similar the
Improvement in the Ability to Identify Spot 7--
For Spot 7, 81 of the top scoring 87 hits represent proteasome sequences (data not
shown) with the p values for the highest scoring proteasome
hit improving from statistically insignificant for all of the BLAST
searches on single peptide sequences to
2.5e
Because only a limited number of preset gap penalties are available for use with Gapped BLAST, it is currently difficult to determine the advantages of gapped over ungapped alignments with this type of data. As with many other aspects of these analyses, comparisons of gapped and ungapped alignment outputs for different Spots show great variability in the statistical significance of the scores. From a preliminary examination, it appears that the usefulness of gapped alignments (generated using Gapped BLAST, e.g. BLAST 2.0x versions) over ungapped alignments (generated using BLAST version 1.4) is substantial for some proteins but can even result in less significant scores for others (data not shown). The reasons for this apparently anomalous behavior again appear to be complex, showing large variation with other parameters such as scoring matrix. One likely reason for the uneven performance of the gapping function with these data is that gap penalties are assessed both in terms of a penalty for opening a gap and a penalty for extension of a gap. Thus, when the gaps reflecting missing information in our catenations are long, the gap penalty becomes substantial, adversely affecting the score. When gaps reflecting missing information are short, gap penalties are low, allowing a scoring advantage for the gapped alignment. Since there is no way to predict the length of gaps due to missing information from the peptide sequence information itself, it is impossible to set appropriate gap penalties for any given set of catenations. Moreover, the current gap length limit for scoring a gapped alignment is 40 amino acids. As peptide sequences from our data frequently were separated by gaps of >40 amino acids due to missing information, it is not possible to evaluate the performance of the gapping function for these catenations. A solution to this problem will require modification of the search algorithm to treat unnatural gaps due to missing information, e.g. in our catenations of sequence fragments, differently from "natural" gaps due to evolutionary divergence, which the algorithm was designed to accommodate. Despite the complexities associated with this issue, the use of Gapped BLAST appears to provide advantages for data base searching with the "unnatural" sequences represented by our catenations. For the best scoring catenation of Spot 7, as illustrated in Fig. 4A, all four peptides were used in the alignments, with gaps between peptides 3 and 2 inserted by the Gapped BLAST algorithm in order to obtain the best alignment to each full-length homolog sequence. Note, however, that the gap between peptides 3 and 2 is relatively short. Fig. 4B illustrates the complex and variable nature of these data with regard to which and how much of the available sequence fragments contribute to the overall scores. In this figure, the four peptide sequences from Spot 7 are shown as ordered in the three best scoring catenation orders for a single high scoring hit. Note that the alignment for sort order 1320 uses all four sequence fragments in the alignment with the hit sequence, whereas the contribution of sequence fragment 0 to the alignment in sort order 1302 is minimal, making the effective sort order for this search 132, consistent with the best scoring sort order of 1320. The alignment for sort order 0132 does not even include sequence fragment 0 in the alignment, again making the effective sort order for this search 132. The poor contribution provided by sequence fragment 0 to these alignments is also consistent with the results for simple BLAST searches (Table II), which shows that no hits representing proteasome sequences were obtained for sequence 7.0 using simple BLAST searches.
Another important way in which MS-Shotgun with catenation analysis
improves the ability to identify function from sequence analysis is its
ability to separate correct from incorrect scoring orders. This
advantage was substantial for Spots 4, 6, 7, 11, and 17. The difference
in p values obtained for correct and incorrect sort orders is illustrated graphically for Spot 7 in Fig.
5 which shows that the score for the
catenation representing the correct sort order, 1320, and
variants of the effective sort order, 132, have the best
p values and stand out well from the background scores
representing other catenations.
Spots That Required Additional Biological Information for Provisional Identification-- For the unknown proteins in Spots 3, 5, 8, 9, and 13, identification of the best candidate homologs was not possible using sequence information alone, even with the aid of MS-Shotgun with catenation. Table IV shows that few proteasome hits appeared among the top 10 hits for any of these Spots using MS-Shotgun with catenation. The p values for statistical significance are also unimpressive and in all cases worse than the values for non-proteasome hits. These results are not surprising and help to remind us that function cannot always be assigned from sequence information. This is especially true for proteomics studies in which only a small amount of sequence information, representing varied and unknown coverage of the parent protein, is available. Despite the failure of these methods to produce any high scoring proteasome hits for these Spots, congruence analysis gave some information that in conjunction with biological information allowed provisional assignments of function. Because this experiment was designed to isolate only proteins from the 20 S proteasome complex, the appearance of known proteasome proteins in our search outputs helped to assign a subunit type for each Spot. This was implemented through automated keyword searches of the data base search outputs using the keywords "proteasome," "multicatalytic," "endopeptidase," and "macropain." Table III gives tentative assignments of subunit identity for Spots 3, 5, 8, 9, and 13, based on the best scoring hits from the MS-Shotgun with catenation analysis and/or the keyword searches. Although these assignments all include the correct subunit identification among the likely solutions obtained from this strategy, it should be emphasized that the partial sequence information available for these Spots was insufficient for credible and definitive assignment of their functions. The tentative nature of such assignments is emphatically illustrated in Table IV by the small number of proteasome sequences found from keyword searches compared with the total number of hits in the search outputs and the fact that hits representing non-proteasome sequences scored with better statistical significance than proteasome sequences. Although keyword searching is easily automatable, the time-intensive individual analysis required to evaluate the data for these difficult to identify Spots would preclude the use of this approach in a high throughput setting. A Note on the Importance of the Scoring Matrix in Data Base Searches Using Peptide Sequences Obtained from Tandem Mass Spectrometry Information theory analysis (24) suggests that the PAM30 matrix would be more effective for generating statistically significant scores than the BLOSUM62 matrix for the searches using the short peptide fragments typified by our data. However, except for those few peptides in our data set that are highly similar to a region of a homologous full-length protein in the data bases, we found only slight improvement using either the PAM30 or the PAM30MS matrix compared with their BLOSUM62 counterparts (data shown in the Tables only for the -MS versions of these matrices). As shown in Table II, use of the PAM30MS matrix for the searches using individual peptide sequences resulted in 3-5 orders of magnitude improvement in p values over the results obtained from use of the BLOSUM62MS matrix for peptides 1.0, 4.4, 4.5, 4.6, and 15.1. These peptides are the highest scoring peptides from all data base searches of this type that we performed and were among the few peptides containing sufficient information for identification of the proteasome subunit from a simple search of a single peptide. For the more difficult searches, neither PAM30 nor PAM30MS performed sufficiently better than BLOSUM62MS to allow identification of those proteins from simple data base searches. One reason that use of a scoring matrix predicted by theory to be optimal with these data was not more advantageous likely lies, again, in the manner in which the peptide sequence fragments represent coverage of the parent protein. Altschul (24) has predicted the ranges of local alignment lengths for which specific PAM matrices are appropriate. For example, the minimum sequence length required to distinguish a statistically significant alignment from one likely to occur by chance is 12 amino acids using the PAM30 matrix and 31 amino acids using a PAM120 matrix. Although the BLOSUM series matrices cannot be cleanly correlated with the PAM matrices with respect to these predictions, it has been suggested that the BLOSUM series does not include any matrices suitable for the shortest queries (see the BLAST tutorial provided by National Center for Biotechnology Information). But the short sequences available from tandem mass spectrometry correspond not to local alignment regions found by algorithms such as BLAST but rather to stretches of sequence that may only fortuitously and occasionally correspond to regions that would be matched in a BLAST alignment of two full-length sequences. This means that matches between our query sequences and even close homologs in the data bases will frequently score with p values that are below the level of statistical significance whether PAM30 or BLOSUM62 is used. An example is provided by the sequence fragments available for Spot 1, as discussed above and shown in the multiple alignment given in Fig. 3.
One of these sequences, peptide 1.0, matches perfectly or nearly
perfectly a highly conserved region in all of the homologs shown in
Fig. 3. Thus, this peptide, which fortuitously corresponds to a high
scoring alignment region for BLAST, would be expected to score better
with the PAM30 matrix than the BLOSUM62 matrix, which is indeed the
case. The best p value for peptide 1.0 using BLOSUM62MS is
0.34; using PAM30MS, it is 4.4e These results and those provided in Table II for other Spots show that neither the use of BLOSUM62MS nor of PAM30MS provides statistically significant scores for many of the peptides available for the T. brucei proteasome. Thus, our results reflect the need for more sophisticated solutions than those provided by standard approaches to data base searching which were developed, after all, for distinctly different purposes. Moreover, the variability in performance of both of these matrices with these data leads to great uncertainty in the identification of these proteins and emphasizes the importance of using the longest possible sequence lengths for data base searching. By using our protocol, this can be achieved by catenation of all available sequence fragments from each protein to provide the longest possible query sequence. As described above, using all possible permutations of those catenations as queries in data base searches provides a distribution of scores from which the correctly ordered catenation may, in some cases, be distinguished. But although the improvement provided by this approach over that performed by searching with single sequence fragments is substantial, it is not yet a mature solution. Based on all of the results described in this study, several areas in which future research may prove fruitful are described below. Future Directions As the volume of data from high throughput proteome analysis increases, automated assignment of function from sequence information inferred from mass spectrometry will be required. Our results suggest that current informatics approaches do not provide an adequate solution. Although the new approaches described in this work enhanced our ability to assign subunit identities to the T. brucei proteasome, the complexities associated with analysis of this type of data suggest that substantial further development of the bioinformatic approaches is required. In particular, the variability in how well any set of peptide sequences from a given protein represents conserved elements of a homolog in the data base confounds a systematic and rigorous solution to the problem. Our experience with these data does, however, suggest several directions in which improvement may be achieved. Thus, whereas the nature of the available data may obviate an elegant solution, practical improvements may be sufficient to allow us to find more effectively the relevant relationships that will help us establish identity or obtain clues about function of unknown proteins. We envision that a better tuning of our congruence approach might be achieved in several ways as follows. First, alter the data base search algorithm to treat the missing information in our catenated sequences, e.g. regions of the associated full-length sequence for which no sequence information was obtained, differently from the normal types of gaps expected in an alignment of a query sequence and a putative data base homolog. The Gapped BLAST algorithm was designed to handle the latter types of gaps but not the former. Implementation of such a strategy could also remove the problems associated with the current allowed maximum gap length of 40 amino acids. We are exploring ways in which such changes could be implemented with respect to both the availability of BLAST source code and to the consequences of these changes for the number of false positives that might be obtained at statistically significant scores. Second, adapt search engines to accommodate better the idiosyncratic nature of sequence information obtained from mass spectrometry. These could include uncertainty in the ordering of the first two residues of a sequence and the inability to distinguish Ile/Leu and Gln/Lys, although high energy CID can eliminate the ambiguity in distinguishing Ile and Leu (31, 32, 19) and accurate measurement of relevant sequential ion series can distinguish Gln and Lys (12). We have noted that the use of either the BLOSUM62MS matrix or the PAM30MS matrix, which score Ile/Leu and Gln/Lys equivalently, substantially improves our ability to identify remote homologs. Recent implementation of congruence/catenation methods in the string searching algorithm MS-Pattern, initially designed to accommodate the ambiguities inherent in sequence data obtained from Edman degradation or PSD mass spectrometry, also show great promise for improved identification of remote homologs.6 Other accommodations for the deficiencies observed in CID spectral sequence fragment content, such as the common lack of observation of b1 ions, could improve data base search results. Some such accommodations, mentioned previously, were incorporated into the original design of the MS-Pattern algorithm. The types and sources of possible subsequence ambiguities, e.g. residue orders or isobaric substitutions, as well as the improvements in quality and completeness of de novo sequence being obtained from new generation tandem instrumentation will be discussed elsewhere. By using the information and the models provided by MS-Pattern and CIDentify, it should be possible to implement efficient searching protocols that allow submission of each peptide sequence in a format that accommodates the likely alternate versions. Third, systematically investigate other scoring matrices and other scoring parameters to determine a generally more optimal set of default parameters useful with sequence information obtained from tandem mass spectrometry. Although our initial comparisons of the BLOSUM62MS and PAM30MS suggest that the problem is more complicated than that for full-length "natural" sequences, it is still likely that improvement could be obtained, perhaps through use of initial parameterizing runs followed by additional searches using those optimized parameters. More useful parameterization may also require knowing whether and how the lengths of catenated peptide sequences can be correlated to specific scoring matrices now in use. Fourth, develop an efficient strategy for minimizing the number of catenation searches that must be performed for large numbers of peptide sequences. The number of catenations for all possible permutations of n sequences from a given Spot is n! For proteins for which several peptide sequences are available, MS-Shotgun with catenation analysis requires a large number of searches, e.g. for n = 6, 720 searches are required. For high throughput analysis of perhaps 10s of thousands of spots per day,7 the number of searches that are feasible to perform for each protein will likely be many fewer than this. Several simple solutions to this problem are being evaluated. Fifth, determine the relative value of pairwise versus
multiple alignments for evaluation of putative homology. The pairwise solution is less computationally intensive and much easier to evaluate for statistical significance. Such an approach has been described by Damer et al. (49) who employed a strategy using pairwise alignments for confirming similarities in a forerunner of the
FASTS program. However, the multiple alignment solution can also be
automated, by using similar strategies to those implemented at the
National Library of Medicine BLAST server
(www.ncbi.nlm.nih.gov/BLAST/). For the relatively sparse and
discontinuous sequence information generally available from mass
spectrometric sequencing, it appears particularly important to provide
the more extensive context available from multiple alignments. Also,
although the version of FASTS used for this study found correct
catenation orders about as well as did MS-Shotgun with catenation, we
also found examples of pairwise alignments in the FASTS output in which
as many as four peptide sequences from a given Spot mapped very well
onto a data base sequence that was clearly a false positive. Such false
positives would appear less credible in the data base output if
multiple alignments were provided for evaluation of data base hits.
We have shown that the data base search tools and protocols
commonly used for assignment of function from full-length protein sequences are largely inadequate for dealing with sequence information generally obtained from tandem mass spectrometry, even when close homologs exist in the data bases. By taking advantage of congruence across data base search results for all peptide sequences available for
a single protein, we have developed an approach that substantially improves our ability to identify remote homologs. By allowing all
available sequence information from a single Spot to be used in
consonance, matches can be found with homologs at substantially improved statistical significance scores. In parallel with the ongoing
optimization of these methods to suit the idiosyncratic nature of mass
spectrometric data, automated analysis will be improved to meet the
needs of high throughput experimental characterization of proteomes,
particularly those for which the full genomic sequence is unavailable.
* The mass spectrometry performed at the University of California, San Francisco, Mass Spectrometry Facility, was supported by National Institutes of Health Grant NCRR RR01614 (to A. L. B.), bioinformatics was supported by National Institutes of Health Grant RO1 GM60595 (to P. C. B.), and biochemistry was supported by National Institutes of Health Grant R01 AI21786 (to C. C. W.).The costs of publication of this article were defrayed in part by the payment of page charges. The article must therefore be hereby marked "advertisement" in accordance with 18 U.S.C. Section 1734 solely to indicate this fact. The nucleotide sequence(s) reported in this paper has been submitted to the GenBankTM/EMBL Data Bank with accession number(s) AF198386, AF148125, AF198387, AF169652, AF140353, AJ131148, AF169651, AJ131043, AJ130820, AF169653, AF226673, AF226674, AF148124, AF290945.
¶ To whom correspondence should be addressed. Tel.: 415-476-3784; Fax: 415-476-0688; E-mail: babbitt@cgl.ucsf.edu.
Published, JBC Papers in Press, April 17, 2001, DOI 10.1074/jbc.M008342200
2 The original Shotgun program is available by E-mail request. As the research issues described under "Results and Discussion" are solved, additional software will be developed for public use.
3 W. Pearson, unpublished information (available online, fasta.bioch.virginia.edu/).
4 C. C. Wang, unpublished information.
5 An updated implementation of FASTS that became available after our initial analyses were completed has apparently been optimized to identify sequences that are already present in the data bases more effectively. For us, this has resulted in poorer performance with the 20 S proteasome data, for which only homologous sequences were present in the data bases. The version of FASTS used for this study is no longer available.
6 P. R. Baker, R. J. Jacobs, L. Huang, K. F. Medzihradszky, M. A. Baldwin, and A. L. Burlingame, manuscript in preparation.
7 The near future needs for high throughput mass spectrometric analysis of protein spots were described by M. L. Vestal at the meeting of the American Society for Mass Spectrometry, September 1999, Asilomar, CA, and have also been observed from recent results using MALDI-TOF TOF derived from CID spectra (L. Huang, M. A. Baldwin, K. E. Williams, C. B. Silbohm, K. F. Medzihradszky, P. R. Baker, D. A. Maltby, M. Pallavicini, J. Campbell, P. Juhasz, S. A. Martin, M. L. Vestal, and A. L. Burlingame, manuscript in preparation).
The abbreviations used are: MALDI-TOF, matrix-assisted laser desorption/ionization-time-of-flight; ACN, acetonitrile; CID, collision-induced dissociation; EST, expressed sequence tags; HPLC, high performance liquid chromatography; MS, mass spectrometry; PSD, post-source decay.
Copyright © 2001 by The American Society for Biochemistry and Molecular Biology, Inc. This article has been cited by other articles:
|
| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Advertisement | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||