JBC

HOME HELP FEEDBACK SUBSCRIPTIONS ARCHIVE SEARCH TABLE OF CONTENTS
 QUICK SEARCH:   [advanced]


     


Originally published In Press as doi:10.1074/jbc.M204161200 on August 16, 2002

J. Biol. Chem., Vol. 277, Issue 48, 45765-45769, November 29, 2002
This Article
Right arrow Abstract Freely available
Right arrow Full Text (PDF)
Right arrow All Versions of this Article:
277/48/45765    most recent
M204161200v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Right arrow Citation Map
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Download to citation manager
Right arrow reprints & permissions
Citing Articles
Right arrow Citing Articles via HighWire
Right arrow Citing Articles via Google Scholar
Google Scholar
Right arrow Articles by Chou, K.-C.
Right arrow Articles by Cai, Y.-D.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Chou, K.-C.
Right arrow Articles by Cai, Y.-D.
Social Bookmarking
 Add to CiteULike   Add to Complore   Add to Connotea   Add to Del.icio.us   Add to Digg   Add to Reddit   Add to Technorati  
What's this?

Using Functional Domain Composition and Support Vector Machines for Prediction of Protein Subcellular Location*

Kuo-Chen ChouDagger and Yu-Dong Cai§

From Dagger  Upjohn Laboratories, Pharmacia, Kalamazoo, Michigan 49001-4940 and § Shanghai Research Centre of Biotechnology, Chinese Academy of Sciences, Shanghai 200233, China

Received for publication, April 29, 2002, and in revised form, August 8, 2002

    ABSTRACT
TOP
ABSTRACT
INTRODUCTION
THEORY
RESULTS AND DISCUSSION
CONCLUSION
REFERENCES

Proteins are generally classified into the following 12 subcellular locations: 1) chloroplast, 2) cytoplasm, 3) cytoskeleton, 4) endoplasmic reticulum, 5) extracellular, 6) Golgi apparatus, 7) lysosome, 8) mitochondria, 9) nucleus, 10) peroxisome, 11) plasma membrane, and 12) vacuole. Because the function of a protein is closely correlated with its subcellular location, with the rapid increase in new protein sequences entering into databanks, it is vitally important for both basic research and pharmaceutical industry to establish a high throughput tool for predicting protein subcellular location. In this paper, a new concept, the so-called "functional domain composition" is introduced. Based on the novel concept, the representation for a protein can be defined as a vector in a high-dimensional space, where each of the clustered functional domains derived from the protein universe serves as a vector base. With such a novel representation for a protein, the support vector machine (SVM) algorithm is introduced for predicting protein subcellular location. High success rates are obtained by the self-consistency test, jackknife test, and independent dataset test, respectively. The current approach not only can play an important complementary role to the powerful covariant discriminant algorithm based on the pseudo amino acid composition representation (Chou, K. C. (2001) Proteins Struct. Funct. Genet. 43, 246-255; Correction (2001) Proteins Struct. Funct. Genet. 44, 60), but also may greatly stimulate the development of this area.

    INTRODUCTION
TOP
ABSTRACT
INTRODUCTION
THEORY
RESULTS AND DISCUSSION
CONCLUSION
REFERENCES

According to the localization or compartment in a cell, proteins are generally classified into the following 12 categories: 1) chloroplast, 2) cytoplasm, 3) cytoskeleton, 4) endoplasmic reticulum, 5) extracellular, 6) Golgi apparatus, 7) lysosome, 8) mitochondria, 9) nucleus, 10) peroxisome, 11) plasma membrane, and 12) vacuole. Given the sequence of a protein, how can we predict which category or subcellular location it belongs to? This is certainly a very important problem because the subcellular location of a protein is closely correlated with its biological function. Although the information about protein subcellular location can be determined by conducting various experiments, that is both time consuming and costly. Because of the fact that the number of sequences entering into databanks has been rapidly increasing, e.g. in 1986 the total sequence entries in SWISS-PROT (1) was only 3,939 while the number was increased to 80,000 in 1999, the problem has become an urgent challenge. Particularly, it is anticipated that many more new protein sequences will be derived soon because of the recent success of the human genome project, which has provided an enormous amount of genomic information in the form of 3 billion base pairs assembled into tens of thousands of genes. Therefore, the challenge will become even more urgent and critical. Actually, many efforts have been made trying to develop some computational methods for quickly predicting the subcellular locations of proteins (2-13). It is instructive to point out that, of these algorithms, most are based on the amino acid composition alone without including any sequence-order effects, and some (9, 12, 13) are based on the pseudo amino acid composition that incorporated partial sequence-order effects. To further improve the prediction quality, a logical and key step would be to find an effective way to incorporate the sequence-order effects. The present study was initiated in an attempt to explore a different approach to incorporate these kinds of effects. The core of the new approach is based on a novel concept, the so-called "functional domain composition," as will be further described below.

    THEORY
TOP
ABSTRACT
INTRODUCTION
THEORY
RESULTS AND DISCUSSION
CONCLUSION
REFERENCES

The Functional Domain Composition Representation

To improve the quality of statistical prediction for protein subcellular location, one of the most important steps is to give an effective representation for a protein. This is indeed a crucial problem but meanwhile a quite subtle one, which might lead us to face the dilemma discussed below. According to common sense, an effective representation should include as much information a protein has as possible. Compared with the amino acid composition (14-16) and the pseudo amino acid composition (12), the entire protein sequence contains of course the most complete information. Unfortunately, if using the entire sequence of a protein as its representation to formulate the statistical prediction algorithm, one would face the difficulty of dealing with almost an infinity of sample patterns, as elaborated by Chou (12). Accordingly, to formulate a feasible statistical prediction algorithm, a protein must be expressed in terms of a set of discrete numbers. The earliest approach (2-5) in this regard was to use the amino acid composition that consists of 20 components representing the occurrence frequencies of the 20 native amino acids in a protein. However, if using the amino acid composition as the representation for a protein, all the sequence-order effects would be missed. Therefore we are actually confronted with the dilemma that, if wishing to include the complete information, the prediction would become unfeasible; if wishing to make the prediction feasible, some important information must be ignored. In view of this, can we find a compromise scenario, i.e. a new protein representation that is constituted by a set of discrete numbers but that also contains as much of the sequence-order effects as possible? The introduction of the pseudo amino acid composition is a pioneer effort in this regard that has no doubt made one important step forward for such a goal. The pseudo amino acid composition consists of 20 + lambda  discrete numbers, where the first 20 numbers are the same as those in the amino acid composition and the remaining numbers represent lambda  different ranks of sequence-correlation factors (12). In this paper, we would like to introduce a completely different set of discrete numbers; i.e. instead of using each of the 20 amino acid components or each of the 20 + lambda  pseudo amino acid components as a vector base to define a protein, we shall use each of the native functional domains as a vector base to define a protein.

By searching and clustering 139,765 annotated protein sequences, Murvai et al. (17) have constructed a data base called SBASE-A that contains 2005 sequences with well known structural and functional domain types. With each of the 2005 functional domains as a vector-base, a protein can be defined as a 2005-dimensional (D)1 vector according to the following procedures. 1) Use BLASTP to compare a protein with each of the 2005 domain sequences in SBASE-A to find the high-scoring segment pairs (HSPs) and the smallest sum probability (P). A detailed description about this operation can be found in Altschul (18). 2) If the HSP score 75 and P < 0.8 in comparing the protein sequence with the ith domain sequence, then the ith component of the protein in the 2005-D space is assigned 1; otherwise, 0. 3) The protein can thus be explicitly formulated as follows.
<B><UP>X</UP></B>=<FENCE><AR><R><C>x<SUB>1</SUB></C></R><R><C>x<SUB>2</SUB></C></R><R><C>⋮</C></R><R><C>x<SUB><UP>i</UP></SUB></C></R><R><C>⋮</C></R><R><C>x<SUB>2005</SUB></C></R></AR></FENCE>, (Eq. 1)
where
x<SUB><UP>i</UP></SUB>=<FENCE><AR><R><C>1, <UP>when HSP score>></UP> 75<UP> and </UP>P<0.8</C></R><R><C>0, <UP>otherwise</UP></C></R></AR></FENCE> (Eq. 2)

Defined in this way, a protein corresponds to a 2005-D vector X with each of the 2005 functional domain sequences as a base for the vector space; i.e. rather than the 20-D space (15) of the amino acid composition approach or the (20 + lambda )-D space of the pseudo amino acid composition approach (12), a protein is represented in terms of the functional domain-composition. By using such a representation, not only some sequence-order effects but also some functional information is included. In other words, the representation thus obtained for a protein would bear some sequence-order mark as well as the structural and functional type mark. Because the function of a protein is closely related to its subcellular location, the prediction algorithm established based on the new representation would naturally incorporate those factors that might be directly correlated with the protein subcellular location.

Support Vector Machines

Support Vector Machines (SVMs) are kinds of learning machines based on statistical learning theory. The most remarkable characteristics of SVMs are the absence of local minima, the sparseness of the solution, and the use of the kernel-induced feature spaces. The basic idea of applying SVMs to pattern classification can be outlined as follows. First, map the input vectors into a feature space (possible with a higher dimension) either linearly or non-linearly, which is relevant to the selection of the kernel function. Then, within the feature space, seek an optimized linear division; i.e. construct a hyper-plane that can separate two classes (this can be extended to multi-classes) with the least error and maximal margin. The SVMs training process always seeks a global optimized solution and avoids over-fitting, so it has the ability to deal with a large number of features. A complete description to the theory of SVMs for pattern recognition is given in the book by Vapnik (19). SVMs have been used to deal with protein fold recognition (20), protein-protein interaction prediction (21), and protein secondary structure prediction (22).

In this paper, the Vapnik's Support Vector Machine (23) was introduced to predict protein subcellular location. Specifically, SVMlight, which is an implementation (in C Language) of SVM for the problems of pattern recognition, was used for computations. The optimization algorithm used in SVMlight can be found in Joachims (24). The relevant mathematical principles can be briefly formulated as follows.

Given a set of n samples, i.e. a series of input vectors
<B><UP>X</UP></B><SUB>k</SUB> &egr; ℜ<SUP>&tgr;</SUP> (k=1,…,N), (Eq. 3)
where Xk can be regarded as the kth protein or vector defined in the 2005-D space according to the functional domain composition (see Eq. 1), and Rtau is a Euclidean space with tau  dimensions. Because the multiclass identification problem can always be converted into a two-class identification problem, without loss of the generality, the formulation below is given for the two-class case only. Suppose the output derived from the learning machine is expressed by yk epsilon {+1,-1} (k = 1,  ... , N) where the indexes -1 and +1 are used to stand for the two classes concerned, respectively. The goal here is to construct one binary classifier or derive one decision function from the available samples that has a small probability of misclassifying a future sample. Here both the basic linear separable case and the most useful linear non-separable case for most real life problems are taken into consideration.

The Linear Separable Case In this case, there exists a separating hyper-plane whose function is W · X + b = 0, whose implication is shown in the following equation.
y<SUB>k</SUB>(<B><UP>W</UP></B>·<B><UP>X</UP></B><SUB>k</SUB>+b)≥1, (k=1,…,N) (Eq. 4)
By minimizing ∥W∥2 subject to the above constraint, the SVM approach will find a unique separating hyper-plane. Here ∥W∥2 is the Euclidean norm of W, which maximizes the distance between the hyper-plane or the optimal separating hyper-plane (25) and the nearest data points of each class. The classifier thus obtained is called the maximal margin classifier. By introducing Lagrange multipliers alpha i, and using the Karush-Kuhn-Tucker conditions (26, 27) as well as the Wolfe dual theorem of optimization theory (28), the SVM training procedure amounts to solving the following convex quadratic programming problem
<B><UP>Max: </UP></B><LIM><OP>∑</OP><LL>i=1</LL><UL>N</UL></LIM> &agr;<SUB>i</SUB>−<FR><NU>1</NU><DE>2</DE></FR> <LIM><OP>∑</OP><LL>i=1</LL><UL>N</UL></LIM> <LIM><OP>∑</OP><LL>j=1</LL><UL>N</UL></LIM> &agr;<SUB>i</SUB>&agr;<SUB>j</SUB><UP>y</UP><SUB>i</SUB><UP>y</UP><SUB>j</SUB><B><UP>X</UP></B><SUB>i</SUB> · <B><UP>X</UP></B><SUB>j</SUB> (Eq. 5)
subject to the following two conditions.
&agr;<SUB>i</SUB>≥0, (i=1, 2,…,N) (Eq. 6)

<LIM><OP>∑</OP><LL>i=1</LL><UL>N</UL></LIM> &agr;<SUB>i</SUB><UP>y<SUB>i</SUB>=0</UP> (Eq. 7)
The solution is a unique globally optimized result, which can be expressed with the following expansion.
<B><UP>W</UP></B>=<LIM><OP>∑</OP><LL>i=1</LL><UL>N</UL></LIM> y<SUB>i</SUB>&agr;<SUB>i</SUB><B><UP>X</UP></B><SUB>i</SUB> (Eq. 8)
Only if the corresponding alpha i > 0, are these Xi called the Support Vectors. Now suppose X is a query protein defined in the same 2005-D space based on the functional domain composition (see Eq. 1). After the SVM has been trained, the decision function for identifying which class the query protein belongs to can be formulated as
f(<B><UP>X</UP></B>)=<UP>sgn</UP> <FENCE><LIM><OP>∑</OP><LL>i=1</LL><UL>N</UL></LIM> y<SUB>i</SUB>&agr;<SUB>i</SUB><B><UP>X</UP></B> · <B><UP>X</UP></B><SUB>i</SUB>+b</FENCE> (Eq. 9)
where sgn() in the above equation is a sign function, which equals to +1 or -1 when its argument is >= 0 or <0, respectively.

The Linear Non-separable Case For this case two important techniques are needed that are given below.

The "Soft Margin" Technique-- To allow for training errors, Cortes and Vapnik (25) introduced the slack variables shown below.


&zgr;<SUB>i</SUB>>0 (i=1,…,N), (Eq. 10)
The relaxed separation constraint is given below.
y<SUB>i</SUB>(<B><UP>W</UP></B>·<B><UP>X</UP></B><SUB>i</SUB>+b)≥1−&zgr;<SUB>i</SUB>, (i=1,…,N) (Eq. 11)
The optimal separating hyper-plane can be found by minimizing
<FR><NU>1</NU><DE>2</DE></FR>∥<B><UP>W</UP></B>∥<SUP>2</SUP>+c<LIM><OP>∑</OP><LL>i=1</LL><UL>N</UL></LIM> &zgr;<SUB>i</SUB> (Eq. 12)
where c is a regularization parameter used to decide a trade-off between the training error and the margin.

The "Kernel Substitution" Technique-- The SVM performs a nonlinear mapping of the input vectors from the Euclidean space 194 d into a higher dimensional Hilbert space H, where the mapping is determined by the kernel function. Then like in the linear separable case, it finds the optimal separating hyper-plane in the Hilbert space H that would correspond to a non-linear boundary in the original Euclidean space. Two typical kernel functions are listed below


K(<B><UP>X</UP></B><SUB>i</SUB>,<B><UP>X</UP></B><SUB>j</SUB>)=(<B><UP>X</UP></B><SUB>i</SUB> · <B><UP>X</UP></B><SUB>j</SUB>+1)<SUP>&tgr;</SUP> (Eq. 13)

K(<B><UP>X</UP></B><SUB>i</SUB>,<B><UP>X</UP></B><SUB>j</SUB>)=<UP>exp </UP><FENCE><UP>−</UP>r∥<B><UP>X</UP></B><SUB>i</SUB>−<B><UP>X</UP></B><SUB>j</SUB>∥<SUP>2</SUP></FENCE> (Eq. 14)
where the first one is called the polynomial kernel function of degree tau , which will eventually revert to the linear function when tau  = 1, the second one is called the radial basic function kernel. Finally, for the selected kernel function, the learning task amounts to solving the following quadratic programming problem
<B><UP>Max:</UP></B> <LIM><OP>∑</OP><LL>i=1</LL><UL>N</UL></LIM> &agr;<SUB>i</SUB>−<FR><NU>1</NU><DE>2</DE></FR> <LIM><OP>∑</OP><LL>i=1</LL><UL>N</UL></LIM> <LIM><OP>∑</OP><LL>i=1</LL><UL>N</UL></LIM> &agr;<SUB>i</SUB>&agr;<SUB>j</SUB>y<SUB>i</SUB>y<SUB>j</SUB>K(<B><UP>X</UP></B><SUB>i</SUB> · <B><UP>X</UP></B><SUB>j</SUB>) (Eq. 15)
which is subject to the following equations.
0≤a<SUB>i</SUB>≤c, (i=1, 2,…,N) (Eq. 16)

<LIM><OP>∑</OP><LL>i=1</LL><UL>N</UL></LIM> &agr;<SUB>i</SUB>y<SUB>i</SUB>=0 (Eq. 17)
Accordingly, the form of the decision function is given by the equation shown below.
f(<B><UP>X</UP></B>)=<UP>sgn</UP> <FENCE><LIM><OP>∑</OP><LL>i=1</LL><UL>N</UL></LIM>y<SUB>i</SUB>&agr;<SUB>i</SUB>K(<B><UP>X</UP></B>,<B><UP>X</UP></B><SUB>i</SUB>)+b</FENCE> (Eq. 18)
For a given dataset, only the kernel function and the regularity parameter c must be selected to specify the SVM.

    RESULTS AND DISCUSSION
TOP
ABSTRACT
INTRODUCTION
THEORY
RESULTS AND DISCUSSION
CONCLUSION
REFERENCES

To facilitate comparison, the same dataset constructed by Chou and Elrod (6) was used to demonstrate the current method. However, as mentioned in Chou (12), because the change of code names, some protein sequences could no longer be retrieved from the SWISS-PROT data bank (1). Of the 2319 proteins originally listed in Appendix A of Chou and Elrod (6), 2191 protein sequences were retrieved. These sequences consist of 145 chloroplast proteins, 571 cytoplasm, 34 cytoskeleton, 49 endoplasmic reticulum, 224 extracellular, 25 Golgi apparatus, 37 lysosome, 84 mitochondria, 272 nucleus proteins, 27 peroxisome, 699 plasma membrane, and 24 vacuole.

During the operation, the width of the Gaussian radial basic functions was selected for minimizing the estimation of the Vapnik-Chervonenkis dimension (19). The parameter c that controlled the error-margin trade-off was set at 1000. After being trained, the hyper-plane output by the SVM was obtained. This indicates that the trained model, i.e. hyper-plane output that includes the important information, has the function to identify the protein subcellular locations.

The demonstration was conducted by the three most typical approaches in statistical prediction (29), i.e. the resubstitution test, jackknife test, and independent dataset test as reported below.

Resubstitution Test-- The so-called resubstitution test is an examination for the self-consistency of an identification method. When the resubstitution test is performed for the current study, the subcellular location of each protein in the dataset is in turn identified using the rule parameters derived from the same dataset, the so-called training dataset. The success rate thus obtained for predicting the 12 subcellular locations of the 2191 proteins is summarized in Table I, from which we can see that 1913 proteins were correctly predicted for their subcellular locations and that only 278 proteins were incorrectly predicted. The overall success rate is 87.3%, indicating that after being trained, the SVMs model has grasped the complicated relationship between the functional domain composition and the subcellular location of proteins. However, during the process of the resubstitution test, the rule parameters derived from the training dataset include the information of the query protein later plugged back in the test. This will certainly underestimate the error and enhance the success rate because the same proteins are used to derive the rule parameters and to test themselves. Accordingly, the success rate thus obtained represents an optimistic estimation (6, 15, 30, 31). Nevertheless, the resubstitution test is absolutely necessary because it reflects the self-consistency of an identification method, especially for its algorithm part. An identification algorithm certainly cannot be deemed as a good one if its self-consistency is poor. In other words, the resubstitution test is necessary but not sufficient for evaluating an identification method. As a complement, a cross-validation test for an independent testing dataset is needed because it can reflect the effectiveness of an identification method in practical application. This is important especially for checking the validity of a training data base: whether it contains sufficient information to reflect all the important features concerned so as to yield a high success rate in application.

                              
View this table:
[in this window]
[in a new window]
 
Table I
Overall rates of correct prediction for the 12 subcellular locations of proteins by different algorithms and test methods

Jackknife Test-- As is well known, the independent dataset test, subsampling test and jackknife test are the three methods often used for cross-validation in statistical prediction. Among these three, however, the jackknife test is deemed as the most effective and objective one; see, e.g. a relevant review (29) for a comprehensive discussion about this and a monograph (32) for the mathematical principle. During jackknifing, each protein in the dataset is in turn singled out as a tested protein and all the rule parameters are calculated based on the remaining proteins. In other words, the subcellular location of each protein is identified by the rule parameters derived using all the other proteins except the one which is being identified. During the process of jackknifing both the training dataset and testing dataset are actually open, and a protein will in turn move from one to the other. The results of jackknife test thus obtained for the 2191 proteins are given in Table I as well.

Independent Dataset Test-- Moreover, as a demonstration of practical application, predictions were also conducted for an independent dataset based on the rule parameters derived from the 2191 proteins in the training dataset. The independent dataset was also adopted from Ref. 6. However, for the same reason as mentioned in Ref. 12, of the 2591 independent proteins originally studied by Ref. 6, only 2494 protein sequences were retrieved. They are: 112 chloroplast proteins, 761 cytoplasm, 19 cytoskeleton, 106 endoplasmic reticulum, 95 extracellular, 4 Golgi apparatus, 31 lysosome, 163 mitochondria, 418 nucleus proteins, 23 peroxisome, and 762 plasma membrane. None of these proteins occurs in the training dataset. The predicted results thus obtained for the 2494 proteins in the independent dataset are also summarized in Table I, from which we can see that 2037 proteins were correctly predicted for their subcellular locations, and only 478 proteins were incorrectly predicted. The overall success rate is 81.7%.

Furthermore, to facilitate comparison, listed in Table I are also the results predicted by various other methods on the same datasets. From Table I the following can be observed. 1) The success prediction rates by both the functional domain composition approach and the pseudo amino acid composition approach are significantly higher than those by the simple geometry approaches (14, 33) and the ProtLock algorithm (3). This is fully consistent with what is expected because both of these two approaches bear the marks of some sequence-order effects although by means of different avenues. 2) A comparison between the functional domain composition approach and the pseudo amino acid composition approach indicates that the success rates by the former are higher than the latter in both the self-consistency test and independent dataset test, indicating the current functional domain composition approach is quite promising with a considerable potential for further development. However, its success rate by the jackknife test is lower than that of the pseudo amino acid approach (12) by using the augmented covariant discriminant algorithm (9) and even 1.4% lower than that of the conventional amino acid approach by using the covariant discriminant algorithm (6). The setback might be due to the reason that the functional domain data base used in the current study is far from a complete one yet. Also, some subsets in the training dataset are too small (e.g. with less than 50 sequences) to have a high cluster-tolerant capacity (34) for the 2005-D functional domain composition space. It is anticipated that with the continuous improvement of the functional domain data base as well as the training data of small subsets by adding into them more new proteins that have been found belonging to the subcellular locations defined by these subsets the setback would be naturally overcome. As a demonstration, the testing dataset was incorporated into the original training dataset to form an augmented training dataset of 2191 + 2494 = 4685 proteins, and then the jackknife test was reapplied. It was observed that the success rate thus obtained by the current functional domain composition approach was increased from 66.7 to 87.9%, but the corresponding rate by the augmented covariant discriminant algorithm was increased from 73.0 to 79.5%, convincingly indicating the strong potential of the current approach.

The goal of this study is not to determine the possible upper limit of the success rate for the prediction of protein subcellular location, but to propose a novel and different approach to incorporate the sequence-order effect as well as the factors of structural and functional types that might help to open a new avenue to further increase our ability or options to deal with this very complicated and difficult problem. It should be realized that it is too premature to construct a complete or quasi-complete training dataset based on the knowledge available so far. Without a complete or quasi-complete training dataset, any attempt to determine such an upper limit would be unjustified, and the result thus obtained might be misleading no matter how powerful the prediction algorithm is.

It should be pointed out that some proteins are known to be shuttled from one subcellular compartment to another and back again. For example, if a query protein is shuttled between nucleus and cytoplasm, then only half-correction should be counted even its subcellular location was predicted to be one of the two locations. However, cases like that would not happen in the current study. This is because the same dataset constructed by Chou and Elrod (6) was used to demonstrate the current method. Compared with the other datasets in this area that only cover two-five subcellular locations, the dataset used here has covered many more subcellular locations. Even though, as clearly described by the authors of Ref. 6, "Sequences annotated by two or more locations are not included because of a lack of uniqueness. For example, a protein sequence labeled with "Golgi and nuclear" or "chloroplast or mitochondria" was omitted." Those proteins that are known to be shuttled between subcellular compartments must be annotated by two or more locations in SWISS-PROT. According to the screen procedure, they were already excluded from the dataset.

    CONCLUSION
TOP
ABSTRACT
INTRODUCTION
THEORY
RESULTS AND DISCUSSION
CONCLUSION
REFERENCES

The above results, together with those obtained by the covariant discriminant prediction algorithm (6) and those further improved by introducing some sequence-order effects (9, 12), have indicated that the subcellular locations of proteins are predictable with a considerable accuracy. The development in statistical prediction of protein attributes generally consists of two cores: one is to construct a training dataset and the other is to formulate a prediction algorithm. The latter can be further separated into two subcores: one is how to give a mathematical expression to effectively represent a protein and the other is how to find an operational equation to accurately perform the prediction. The process in expressing a protein from the 20-D amino acid composition space (14, 15, 33, 35) to the (20 + lambda )-D pseudo amino acid composition space (12) and to the current 2005-D functional domain composition space reflects the development of defining a protein in terms of different mathematical representations. The process in conducting prediction using from the simple geometry distance algorithm (14, 33, 35), to the Mahalanobis distance algorithm (3, 15, 16), to the covariant discriminant algorithm (6, 7, 36-38), and to the current SVM algorithm reflects the development of computation by means of different mathematical operations. One of the remarkable advantages for the pseudo amino acid composition representation is to use a set of simple and intuitive sequence-order-coupling modes to directly incorporate the sequence-order effects, while a remarkable advantage of the functional domain composition representation is to use the functional domain data base to incorporate the information of not only some sequence-order effects but also the structural and functional types. Each of the two representations has its own advantage. For some cases, the functional domain composition representation yields better results than the pseudo amino acid composition representation; but for some other cases, the outcome may be just the reverse. This is just exactly the same in comparison of the covariant discriminant algorithm with the SVM algorithm. Therefore, when we are still in the situation of lacking a complete training dataset and functional domain data base, it would be wise to complement the covariant discriminant algorithm based on the pseudo amino acid composition representation with the SVM algorithm based on the functional domain composition representation for conducting practical predictions. Finally, the functional domain composition approach might have more room and potential for further development because it incorporates both the sequence-order information and the functional type information.

The program of the new prediction method, called CLPFD (Cellular Location Prediction based on Functional Domains), is available by contacting Y. D. Cai at y.cai@umist.ac.uk.

    FOOTNOTES

* The costs of publication of this article were defrayed in part by the payment of page charges. The article must therefore be hereby marked "advertisement" in accordance with 18 U.S.C. Section 1734 solely to indicate this fact.

Current address and to whom correspondence should be addressed: Biomolecular Sciences Dept., UMIST, P. O. Box 88, Manchester, M60 1QD, United Kingdom. Tel.: 44-161-2008936; Fax: 44-161-2360409; E-mail: y.cai@umist.ac.uk.

Published, JBC Papers in Press, August 16, 2002, DOI 10.1074/jbc.M204161200

    ABBREVIATIONS

The abbreviations used are: D, dimensional; HSP(s), high-scoring segment pair(s); SVM(s), support vector machine(s).

    REFERENCES
TOP
ABSTRACT
INTRODUCTION
THEORY
RESULTS AND DISCUSSION
CONCLUSION
REFERENCES

1. Bairoch, A., and Apweiler, R. (2000) Nucleic Acids Research 25, 31-36[Abstract/Free Full Text]
2. Nakashima, H., and Nishikawa, K. (1994) J. Mol. Biol. 238, 54-61[CrossRef][Medline] [Order article via Infotrieve]
3. Cedano, J., Aloy, P., P'erez-pons, J. A., and Querol, E. (1997) J. Mol. Biol. 266, 594-600[CrossRef][Medline] [Order article via Infotrieve]
4. Reinhardt, A., and Hubbard, T. (1998) Nucleic Acids Res. 26, 2230-2236[Abstract/Free Full Text]
5. Chou, K. C., and Elrod, D. W. (1998) Biochem. Biophys. Res. Commun. 252, 63-68[CrossRef][Medline] [Order article via Infotrieve]
6. Chou, K. C., and Elrod, D. W. (1999) Protein Eng. 12, 107-118[Abstract/Free Full Text]
7. Chou, K. C., and Elrod, D. W. (1999) Proteins Struct. Funct. Genet. 34, 137-153[CrossRef][Medline] [Order article via Infotrieve]
8. Chou, K. C. (2000) Curr. Protein Peptide Sci. 1, 171-208[CrossRef][Medline] [Order article via Infotrieve]
9. Chou, K. C. (2000) Biochem. Biophys. Res. Comm. 278, 477-483[CrossRef][Medline] [Order article via Infotrieve]
10. Cai, Y. D., and Chou, K. C. (2000) Mol. Cell Biol. Res. Comm. 4, 172-173[Medline] [Order article via Infotrieve]
11. Cai, Y. D., Liu, X. J., Xu, X. B., and Chou, K. C. (2000) Mol. Cell Biol. Res. Comm. 4, 230-233[CrossRef][Medline] [Order article via Infotrieve]
12. Chou, K. C. (2001) Proteins Struct. Funct. Genet. 43, 246-255[CrossRef][Medline] [Order article via Infotrieve] Correction (2001) Proteins Struct. Funct. Genet. 44, 60
13. Cai, Y. D., Liu, X. J., Xu, X. B., and Chou, K. C. (2002) J. Cell. Biochem. 84, 343-348[CrossRef][Medline] [Order article via Infotrieve]
14. Nakashima, H., Nishikawa, K., and Ooi, T. (1986) J. Biochem. 99, 152-162
15. Chou, K. C. (1995) Proteins Struct. Funct. Genet. 21, 319-344[CrossRef][Medline] [Order article via Infotrieve]
16. Chou, K. C., and Zhang, C. T. (1994) J. Biol. Chem. 269, 22014-22020[Abstract/Free Full Text]
17. Murvai, J., Vlahovicek, K., Barta, E., and Pongor, S. (2001) Nucleic Acids Res. 29, 58-60[Abstract/Free Full Text]
18. Altschul, S. F., Gish, W., Miller, W., Myers, E. W., and Lipman, D. J. (1990) J. Mol. Biol. 215, 403-410[CrossRef][Medline] [Order article via Infotrieve]
19. Vapnik, V. (1998) Statistical Learning Theory , Wiley-Interscience, New York
20. Ding, C. H., and Dubchak, I. (2001) Bioinformatics 17, 349-358[Abstract/Free Full Text]
21. Bock, J. R., and Gough, D. A. (2001) Bioinformatics 17, 455-460[Abstract/Free Full Text]
22. Hua, S. J., and Sun, Z. R. (2001) J. Mol. Biol. 308, 397-407[CrossRef][Medline] [Order article via Infotrieve]
23. Vapnik, V. N. (1995) The Nature of Statistical Learning Theory , Springer-Verlag New York Inc., New York
24. Joachims, T. (1999) in Advances in Kernel Methods-Support Vector Learning (Schölkopf, B. , Burges, C. J. C. , and Smola, A. J., eds) , pp. 169-184, MIT Press, Cambridge, MA
25. Cortes, C., and Vapnik, V. (1995) Machine Learning 20, 273-293
26. Karush, W. (1939) Department of Mathematics , University of Chicago, Chicago
27. Cristianini, N., and Shawe-Taylor, J. (2000) Support Vector Machines , Cambridge University Press, Cambridge, UK
28. Wolfe, P. (1961) Q. Appl. Math. 19, 239-244
29. Chou, K. C., and Zhang, C. T. (1995) Crit. Rev. Biochem. Mol. Biol. 30, 275-349[Medline] [Order article via Infotrieve]
30. Cai, Y. D. (2001) Proteins Struct. Funct. Genet. 43, 336-338[CrossRef][Medline] [Order article via Infotrieve]
31. Zhou, G. P., and Assa-Munt, N. (2001) Proteins Struct. Funct. Genet. 44, 57-59[CrossRef][Medline] [Order article via Infotrieve]
32. Mardia, K. V., Kent, J. T., and Bibby, J. M. (1979) Multivariate Analysis, pp. 322 and 381, Academic Press, London
33. Chou, P. Y. (1980) Abstracts of Papers, Part I Second Chemical Congress of the North American Continent, Las Vegas, May 1, 1980 , Plenum Press, New York
34. Chou, K. C. (1999) Biochem. Biophys. Res. Comm. 264, 216-224[CrossRef][Medline] [Order article via Infotrieve]
35. Chou, P. Y. (1989) in Prediction of Protein Structure and the Principles of Protein Conformation (Fasman, G. D., ed) , pp. 549-586, Plenum Press, New York
36. Chou, K. C., Liu, W., Maggiora, G. M., and Zhang, C. T. (1998) Proteins Struct. Funct. Genet. 31, 97-103[Medline] [Order article via Infotrieve]
37. Liu, W., and Chou, K. C. (1998) J. Prot. Chem. 17, 209-217[CrossRef][Medline] [Order article via Infotrieve]
38. Zhou, G. P. (1998) J. Prot. Chem. 17, 729-738[CrossRef][Medline] [Order article via Infotrieve]


Copyright © 2002 by The American Society for Biochemistry and Molecular Biology, Inc.
Add to CiteULike CiteULike   Add to Complore Complore   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us   Add to Digg Digg   Add to Reddit Reddit   Add to Technorati Technorati    What's this?


This article has been cited by other articles:


Home page
Protein Eng Des SelHome page
H.-B. Shen and K.-C. Chou
Nuc-PLoc: a new web-server for predicting protein subnuclear localization by fusing PseAA composition and PsePSSM
Protein Eng. Des. Sel., November 10, 2007; (2007) gzm057v1.
[Abstract] [Full Text] [PDF]


Home page
Protein Eng Des SelHome page
H.-B. Shen and K.-C. Chou
Gpos-PLoc: an ensemble classifier for predicting subcellular localization of Gram-positive bacterial proteins
Protein Eng. Des. Sel., January 23, 2007; (2007) gzl053v1.
[Abstract] [Full Text] [PDF]


Home page
Nucleic Acids ResHome page
K. Lee, D.-W. Kim, D. Na, K. H. Lee, and D. Lee
PLPD: reliable protein localization prediction from imbalanced and overlapped datasets
Nucleic Acids Res., October 18, 2006; 34(17): 4655 - 4666.
[Abstract] [Full Text] [PDF]


Home page
BioinformaticsHome page
J. Guo and Y. Lin
TSSub: eukaryotic protein subcellular localization by extracting features from profiles
Bioinformatics, July 15, 2006; 22(14): 1784 - 1785.
[Abstract] [Full Text] [PDF]


Home page
Nucleic Acids ResHome page
Z. R. Li, H. H. Lin, L. Y. Han, L. Jiang, X. Chen, and Y. Z. Chen
PROFEAT: a web server for computing structural and physicochemical features of proteins and peptides from amino acid sequence.
Nucleic Acids Res., July 1, 2006; 34(Web Server issue): W32 - W37.
[Abstract] [Full Text] [PDF]


Home page
Plant Physiol.Home page
S. Li, D. W. Ehrhardt, and S. Y. Rhee
Systematic Analysis of Arabidopsis Organelles and a Protein Localization Database for Facilitating Fluorescent Tagging of Full-Length Arabidopsis Proteins
Plant Physiology, June 1, 2006; 141(2): 527 - 539.
[Abstract] [Full Text] [PDF]


Home page
BioinformaticsHome page
A. Hoglund, P. Donnes, T. Blum, H.-W. Adolph, and O. Kohlbacher
MultiLoc: prediction of protein subcellular localization using N-terminal targeting sequences, sequence motifs and amino acid composition
Bioinformatics, May 15, 2006; 22(10): 1158 - 1165.
[Abstract] [Full Text] [PDF]


Home page
Protein Sci.Home page
S. Matsuda, J.-P. Vert, H. Saigo, N. Ueda, H. Toh, and T. Akutsu
A novel representation of protein sequences for prediction of subcellular location using support vector machines
Protein Sci., November 1, 2005; 14(11): 2804 - 2813.
[Abstract] [Full Text] [PDF]


Home page
BioinformaticsHome page
P. M. Kasson, J. B. Huppa, M. M. Davis, and A. T. Brunger
A hybrid machine-learning approach for segmentation of protein localization data
Bioinformatics, October 1, 2005; 21(19): 3778 - 3786.
[Abstract] [Full Text] [PDF]


Home page
Nucleic Acids ResHome page
M. Bhasin and G. P. S. Raghava
GPCRsclass: a web tool for the classification of amine type of G-protein-coupled receptors
Nucleic Acids Res., July 1, 2005; 33(suppl_2): W143 - W147.
[Abstract] [Full Text] [PDF]


Home page
J. Biol. Chem.Home page
A. Garg, M. Bhasin, and G. P. S. Raghava
Support Vector Machine-based Method for Subcellular Localization of Human Proteins Using Amino Acid Compositions, Their Order, and Similarity Search
J. Biol. Chem., April 15, 2005; 280(15): 14427 - 14432.
[Abstract] [Full Text] [PDF]


Home page
BioinformaticsHome page
K.-C. Chou and Y.-D. Cai
Predicting protein localization in budding Yeast
Bioinformatics, April 1, 2005; 21(7): 944 - 950.
[Abstract] [Full Text] [PDF]


Home page
Genome Res.Home page
M. S. Scott, D. Y. Thomas, and M. T. Hallett
Predicting Subcellular Localization via Protein Motif Co-Occurrence
Genome Res., October 1, 2004; 14(10a): 1957 - 1966.
[Abstract] [Full Text] [PDF]


Home page
J. Biol. Chem.Home page
M. L. Rosen, M. Edman, M. Sjostrom, and A. Wieslander
Recognition of Fold and Sugar Linkage for Glycosyltransferases by Multivariate Sequence Analysis
J. Biol. Chem., September 10, 2004; 279(37): 38683 - 38692.
[Abstract] [Full Text] [PDF]


Home page
Protein Eng Des SelHome page
M. Wang, J. Yang, G.-P. Liu, Z.-J. Xu, and K.-C. Chou
Weighted-support vector machines for predicting membrane protein types based on pseudo-amino acid composition
Protein Eng. Des. Sel., June 1, 2004; 17(6): 509 - 516.
[Abstract] [Full Text] [PDF]


Home page
J. Biol. Chem.Home page
M. Bhasin and G. P. S. Raghava
Classification of Nuclear Receptors Based on Amino Acid Composition and Dipeptide Composition
J. Biol. Chem., May 28, 2004; 279(22): 23262 - 23266.
[Abstract] [Full Text] [PDF]