Crystal Structure of the Cytoskeleton-associated Protein Glycine-rich (CAP-Gly) Domain*

Cytoskeleton-associated proteins (CAPs) are involved in the organization of microtubules and transportation of vesicles and organelles along the cytoskeletal network. A conserved motif, CAP-Gly, has been identified in a number of CAPs, including CLIP-170 and dynactins. The crystal structure of the CAP-Gly domain ofCaenorhabditis elegans F53F4.3 protein, solved by single wavelength sulfur-anomalous phasing, revealed a novel protein fold containing three β-sheets. The most conserved sequence, GKNDG, is located in two consecutive sharp turns on the surface, forming the entrance to a groove. Residues in the groove are highly conserved as measured from the information content of the aligned sequences. The C-terminal tail of another molecule in the crystal is bound in this groove.

The cytoskeleton maintains the hierarchy of a eukaryotic cell to control the spatial distribution and the movement of vesicles, organelles, and large protein complexes. It mainly involves three types of protein filaments: microtubules, intermediate filaments, and actin filaments. There are extensive interactions between the filaments mediated by a large number of cytoskeleton-associated proteins (1,2). For instance, microtubules and actin filaments are cross-linked by microtubule-associated proteins such as myosin-kinesin complexes, myosin-CLIP-170 complexes, and dynein-dynactin complexes (1,3). These protein molecules or molecular assemblies have one end that binds to the microtubule and the another end that binds to different vesicles or organelles. Movements related to nucleus positioning or mitotic spindle orientation are orchestrated between microtubules and actin filaments through dynamic interactions of different cytoskeleton-associated proteins.
The cytoskeleton is a dynamic network that disassembles and reassembles constantly according to functional requirements. Regulation of the dynamic assembly of the cytoskeleton and motor proteins involves other non-motor accessory proteins. The term "ϩTIPS" has recently been proposed to describe microtubule plus-end-tracking proteins (4). This class of proteins tends to localize to the plus end of microtubules, and these proteins are the most conserved microtubule-associated proteins. One common motif, named glycine-rich cytoskeletonassociated protein (CAP-Gly) 1 domain, was first discovered in restin (or CLIP-170), the prototype ϩTIP, and three other proteins (5). This domain is present in a Pfam family of 69 proteins (6). The amino acid sequence of the CAP-Gly domain is conserved from the simple single cell eukaryote Saccharomyces cerevisiae to the human being. The functions of this family of proteins are mostly hypothetical. They are grouped together because of the existence of this CAP-Gly domain. Therefore, the CAP-Gly domain must have a signature role in this protein family. The CAP-Gly domain can appear once or multiple times in one protein, either in the N terminus or the C terminus. It can also be found in some motor-associated proteins, such as dynactin chain 1 (7) and Drosophila kinesin-73 (8). The two CAP-Gly domains in CLIP-170 located at the N terminus were shown to mediate the binding of the CLIP-170 protein to the microtubules (9). The C-terminal region of CLIP-170 carries the domains for organelle binding, suggesting that CLIP-170 is a microtubule-organelle linker protein (10). The CAP-Gly domains of CLIP-170 bind specifically to the growing (plus) end of the microtubule (11), whereas the opposite end of CLIP-170 mediates interactions with dynactin (12) or myosin (13). This characteristic feature might allow CLIP proteins to participate in the dynamic control of cargo transport along microtubules or between microtubules and actin filaments.
The genome of Caenorhabditis elegans has been completely sequenced, and several proteins were predicted to have a CAP-Gly domain based on homology. The genes from C. elegans have been selected as targets by the Southeast Collaboratory for Structural Genomics (14). Here we report the structure solution of the CAP-Gly domain of the C. elegans F53F4.3 protein with a novel fold, which provided an example for how high throughput x-ray crystallography might impact on our knowledge of new genes identified from genome sequencing.

EXPERIMENTAL PROCEDURES
Protein Preparation and Crystallization-Based on the annotation of the C. elegans genome performed by the C. elegans sequencing consortium (15), primers were designed for cloning all predicted cDNAs into an ENTRY vector in the GATEWAY coning system (Invitrogen) (16). Our group then took these ENTRY vectors and systematically recombined them with an expression vector, pDEST 17.1, in a 96-well format. * The work was funded by a grant from NIGMS, National Institutes of Health (1P50-GM62407). The costs of publication of this article were defrayed in part by the payment of page charges. This article must therefore be hereby marked "advertisement" in accordance with 18 U.S.C. Section 1734 solely to indicate this fact.
The Recombinant protein expression of each vector was screened, and vectors that expressed soluble proteins in Escherichia coli were selected. F53F4.3 was one of the selected vectors that produced a high amount of soluble protein when protein expression in E. coli was induced at 18°C. This expression vector produced a protein that has a hexahistidine tag and an 8-amino acid peptide in front of the N terminus and an 8-amino acid peptide following the C terminus of the gene. The His tag was included for purification by a nickel affinity column, and both 8-amino acid peptides at each terminus were the result of the recombination reaction during cloning (16). After nickel affinity, fast protein liquid chromatography gel filtration (S-75, Amersham Biosciences), and ion exchange (Resource Q, Amersham Biosciences) chromatography purification, the purified protein was concentrated to about 10 mg/ml for crystal screening in a robotic system. Before any crystals were produced, it was noticed that the purified F53F4.3 protein had been cleaved into two halves in the solution stored at 4°C. The fragments were isolated from the SDS-PAGE gel and subjected to mass spectrometry analysis for their molecular weights and automated N-terminal amino acid sequencing. The C-terminal fragment corresponds to residues 121-229 and was subsequently cloned into a pET 28b vector (Novagen) for protein expression. The resulting protein with a hexahistidine tag, a thrombin cleavage site in front of the N terminus of the fragment, and no extra amino acids at the C terminus was purified by the same protocol and subjected to crystal screen. Large single crystals (ϳ0.5 mm) were observed when the reservoir solution contained 1.8 M ammonium sulfate and 0.1 M MES buffer at pH 6.5. Removal of the His tag by thrombin cleavage did not appear to affect crystallization. Diffraction data were collected with crystals prepared using proteins without the His tag.
X-ray Diffraction Data-Data for phasing were collected to 2.5-Å resolution on a single crystal and flash-cooled to 100 K at beamline ID-17 (IMCA-CAT), Advanced Photon Source (APS), Argonne National Laboratory using a MarResearch 165-mm CCD detector and 1.74-Å X-rays following the protocol described in Liu et al. (17). HKL2000 (18) was used to determine a data collection strategy (99% completion), and data collection was divided into four passes. The first and third passes cover one anomalous diffraction wedge (86 -146°), whereas the second and fourth passes cover the other anomalous diffraction wedge (266 -326°). The oscillation angle was 1°per frame for all the passes. Data processing was carried out using HKL2000 (18). The processed data from different passes were then merged. Data collection and processing statistics are given in Table I.
Phase Determination-Low resolution data (3.5 Å) were used to locate the sulfur atom sites. A total of 1707 Bijvoet reflection pairs greater than 2, between 20.0 and 3.5-Å resolution (Bijvoet difference R-factor was 1.5%), were used to determine three sulfur atom positions (see Fig. 3A) with SOLVE V1.16 (19). These three sulfur sites could also be easily determined by the program SHELXD (20). PHASES (21) was then used to refine the sulfur positions and to locate one additional sulfur site by Bijvoet difference Fourier analysis. These heavy atom sites correspond to sulfur atoms of two cysteines, a sulfur atom of one methionine, and a possible Cl Ϫ ion in the solvent region based on the final refined protein structure. All heavy atom sites were treated as sulfur for the phase calculations.
The four sulfur sites were used to resolve the enantiomorphic space groups P6 1 22 and P6 5 22. The handedness test feature in ISAS2001 (22) was used at 20.0 -3.5-Å resolution to identify P6 1 22 as the correct space group because of the high figure-of-merit and low map-inversion Rvalue. The same four sulfur sites were used to estimate the protein phases at 3.0-Å resolution using iterative single wavelength anomalous signal (ISAS) (22). After three filters and two cycles of phase extension, the final average figure-of-merit and map-inversion R-value were 0.73 and 0.281, respectively. The original electron density map is shown in Fig. 3B, in which the polypeptide chain could be traced for about 65 residues. After a poly-Ala model was fitted into the density, the original phases and the calculated phases from the poly-Ala model were provided to wARP program (23) with diffraction data between 50 -1.8 Å collected on a Raxis IV detector mounted on a Rigaku RU300 generator. A new poly-Ala model with 95 residues was generated. Sequence assignment and refinement were carried out with the programs O (24) and XPLOR3.5 (25). The final model of 109 residues was refined against a data set to 1.77-Å resolution collected at APS (see Table I).

RESULTS
Structure Determination-Crystals of the wild type recombinant F53F4.3 protein (residues 121-292) were grown in conditions found in a robotic screen using the 96-well sitting drop plates. The optimal reservoir solution consisted of 1.8 M ammonium sulfate and 0.1 M MES buffer at pH 6.5. The protein drop contained 5 l of the reservoir solution and 5 l of the protein solution at 10 mg/ml concentration in 20 mM HEPES buffer, pH 7.4. Crystals of 0.4 ϫ 0.4 ϫ 0.3 mm in size appeared in the drops in 4 days at room temperature. The space group was determined to be P6 1 22 with a ϭ b ϭ 64.16 Å, c ϭ 101.95 Å from x-ray diffraction data collected on a Raxis-IV image plate detector (MSC). To obtain phases for solving the crystal structure, the ISAS method was employed (22). It has been argued that the anomalous signal from the sulfur atoms present in the wild type proteins would be sufficient for phase determination using only single wavelength x-ray data (17). To collect x-ray diffraction data with high accuracy required for the ISAS method, crystals of F53F4.3 were cryo-frozen at 100°F, and x-ray data were collected at the wavelength of 1.74 Å to increase the anomalous signal from sulfur atoms while minimizing crystal damage. After screening a number of crystals, a data set to 2.5-Å resolution was completed from the best crystal. This data set to the moderate resolution limit was collected for phasing only. Three sulfur atom positions could be located by the SOLVE program using SAS data. The Patterson peaks corresponding to the sulfur sites were clearly visible (Fig. 1A). A fourth site was found in further refinement. Phases derived from these anomalous sites were refined and extended by solvent flattening (22). An electron density map was calculated, and polypeptide chain tracing and some side chain identification were possible (Fig. 1B). A complete structure model corresponding to residues 135-229 was built in steps and refined to 1.77 Å using a high resolution data set and the X-plor program (26). The structure refinement statistics are summarized in Table I.
Overall Structure-The C-terminal domain of F53F4.3 is roughly spherical with a three-layer ␤/␤ structure (Fig. 2). The ordered part of the N terminus has a 9-residue ␣-helix (Asp-135-Lys-144) with relatively higher average temperature factors. This ␣-helix was preceded by 17 disordered residues that do not make any stable contacts with the rest of this domain and could be available for intermolecular interactions. The three-layer ␤/␤ structure contains seven antiparallel ␤-strands. The first layer consists of ␤1, ␤2a, and ␤7. The second layer starts with the extension of ␤-strand 2 (␤2b) and continues with ␤3 and ␤6. The last layer contains ␤4 and ␤5. The topology of the CAP-Gly domain could be defined as three antiparallel ␤-sheets of 3-3-2 strands. The two three-stranded sheets are L-shaped to each other with the last strand of the first sheet extended continuously to the first strand of the second sheet. The two-stranded sheet is on top of the second three-stranded sheet. The C-terminal 6 residues (Leu-224 to Ile-229) protrude out of the globular domain. Considering the disordered residues at the N terminus prior to the ␣-helix, the CAP-Gly domain appears to be a rigidly packed motif in the middle of two extended polypeptides. The structure of this domain is likely to represent a new protein fold. A search for threedimensional homology (27) found no similar structures from the non-redundant subset of the Protein Data Bank.
Conserved Features-The sequence of the F53F4.3 domain was provided to the EBI server (www.ebi.ac.uk/fasta33) to search the Swiss Protein Database using the FASTA alignment method (28). The 58 aligned sequences, ranging from 55 to 22% identity, corresponded to the CAP-Gly motif (5,29). The Shannon information content (30) is used as the measure of sequence conservation (Fig. 3A). This rigorous statistical procedure compares each position in the given alignment with the probability of randomly distributed amino acids weighted by the observed distribution of each amino acid in the C. elegans genome. A variety of structural alphabets may be employed in this analysis, for example, grouping the amino acids by charge or chemical class. The data presented here treat each of the 20 amino acids independently. A portion of the alignment and its annotation are shown in Fig. 3B.
The longest stretch of highly conserved sequence corresponds to Gly-189 to Gly-193, which is located at the surface of the domain between ␤3 and ␤4. This 5-residue GKNDG segment is conserved in most homologous sequences except in a few sequences where the asparagine is substituted by a histidine or the aspartic acid is substituted by a serine. A loop structure opposite to this conservative stretch, Val-156 to Met-160, has higher average temperature factors, indicating that this part of the structure may be flexible. This is a highly variable region among the homologous sequences, which could present a characteristic recognition site for other protein interactions if the conserved stretch is to associate with the cytoskeleton. There is a conserved aromatic cluster including Tyr-168, Phe-174, Trp-179, Tyr-200, Phe-201, and Phe-210. The hydrophobic cluster may represent a stabilizing core structure. There is also a significant groove adjacent to the con-   is packed into the groove of a symmetry-related molecule. The side chain of residue Glu-228 forms a buried salt bridge with Arg-162 and a hydrogen bond with residue Tyr-184. Residue 162 or its neighbor is usually basic according to the sequence alignment. Hydrogen bonds are also formed between the main chain atoms of residue Ile-229 and the symmetry-related molecule. Lys-190 is also close to the acidic C terminus.
The CAP-Gly domains contain a large number of conserved glycine residues. Two of them, Gly-193 and Gly-197, are involved in the formation of two consecutive sharp turns with the second one as a type II ␤-turn. This unique double-turn motif exposes the most conserved region (GKNDG) in the CAP-Gly domain. The residue after the GKNDG sequence is often a serine/threonine, which forms a hydrogen bond with the side chain of residue Asp-192. Other conserved glycine residues include: Gly-170 and Gly-177, which are involved in bending ␤2b and ␤3, resulting in the loop connecting ␤2b and ␤3 to be perpendicular; Gly-189, which is in the middle of an extended loop containing the most conserved stretch; Gly-181, which is packed closely to the conserved Phe-210 (3.5 Å); and Gly-208 to Phe-201 (4 Å). These conserved glycine residues may be required for maintaining the folding of the three-layer sheet. The surface plot adjacent to the ribbon is rotated ϳ90 degrees about the y-axis, and the next surface is rotated 180 degrees more. A, temperature factors. The color-coding for the refined atomic ␤-factors is: yellow, 36; orange, 42; red, 48. In A, the calculation is based on the 58-protein sequence alignment as described under "Results" (also see panel B). The color coding for the information content in bits is: cyan, 2.75; light blue, 3.25; deep blue, 3.75. b, sequence alignment to the CAP-Gly domain of C. elegans F53F4.3. The amino acid sequences of 20 representative members of the 58 proteins aligned as described under "Results." The 20 chosen had identities Ͼ30% and the most descriptive annotation, which is displayed at the end of each sequence. The observed secondary structure of the F53F4.3 is displayed as a schematic above the sequences. The blue rectangles along the secondary structure denote the information content of the entire 58-protein alignment, as in panel A. The residue numbering above the sequences is for C. elegans F53F4.3. Charged residues (D, E, H, K, R) are in red.

DISCUSSION
The crystal structure of a widely distributed CAP-Gly domain is reported to high resolution. This domain was isolated from the C. elegans F53F4.3 protein when expressed in E. coli. Based on previous sequence alignment, the CAP-Gly domain was proposed to contain 52 amino acids (Pfam: PF01302) (5). These 52 amino acids constitute the conserved core region of this domain. However, the crystal structure revealed that this domain consists of 84 amino acids in this protein. One ␣-helix and three ␤-sheets form a novel protein fold not observed previously. Two conserved regions are identified on the surface based on information content analysis of the amino acid sequences from this protein family. A surface loop containing residues 193-197 (GKNDG) could present itself for interaction with the microtubule. The second feature is a groove that was occupied in our crystal structure by the C terminus of a neighboring molecule. A helix at the N terminus is loosely formed in this crystal structure and is located upstream from the CAP-Gly domain of C. elegans F53F4.3. The function of F53F4.3 is unknown in C. elegans.
The CAP-Gly domain was first identified by comparing CLIP-170 (or restin), a filament-associated protein abundant in the tumoral cells characteristic for Hodgkin's disease (31), with other proteins known to be cytoskeleton-associated proteins (5).
Recently, this motif of three repeats was also found in the familial cylindromatosis tumor suppressor gene CYLD (32). Most proteins contain only one CAP-Gly motif, whereas some may have two or three. The wide distribution of the CAP-Gly motif in cytoskeleton-associated proteins suggests that this domain could be a common adhesive domain for attachment to microtubules. For instance, the cytoplasmic linker protein CLIP-170 was shown to be required for the binding of endocytic vesicles to microtubules in vitro and to be colocalized with endocytic organelles in vivo (9). There are two CAP-Gly domains in CLIP-170. When these two domains were deleted from CLIP-170, its binding to microtubules was diminished. The purified CLIP-170 forms an elongated dimer of a central coil-coil structure with its N-terminal domain binding to microtubules (10).
There are three potential structural regions that could interact with the cytoskeleton. First, the conserved patch identified on the surface is central in the CAP-Gly domain, which could offer a point of contact with microtubules. The specially arranged glycine residues render an unusual structure feature protruding out on the protein surface. The highly conserved residues in this region could readily be available to fit into a receptive region on microtubules. Second, a groove that holds the C terminus of the neighboring molecule might also be a candidate for binding the filamentous proteins. In this structure, the ordered residues extend to the last residue of the C terminus. In other CAP-Gly-containing proteins, this domain is usually located in the middle of a long polypeptide chain. The C terminus could represent hypothetically the binding peptide from the cytoskeleton if such a peptide is required for binding. Third, the N terminus of the CAP-Gly domain might be involved in interactions with other cytoskeleton-associated proteins (5). The F53F4.3 does not form a dimer in vitro. However, the helix region of the CAP-Gly domain in other CAP-Gly proteins could be more extended and may contain residues that induce coil-coil interactions. Generally speaking, the coil-coil interaction of the CAP-Gly domain, if present, is likely to be with other cytoskeleton accessory proteins or to be involved in dimerization as in CLIP-170 (10). The helix region has very high B-factor in this structure and is preceded by a long disordered region. It is possible that a helix would only be stable in the formation of a multiplex protein structure. The precise pattern of the CAP-Gly domain interactions with the cytoskeleton remains to be delineated. This new structure, however, could provide the platform for mapping these critical interactions.
The CAP-Gly domain could be viewed as a microtubule association module in cytoskeleton-associated proteins that may contain one, two, or three such domains with modularly increased affinity. The adhesive property of the CAP-Gly domain to microtubule is in another sense analogous to that of some DNA binding domains such as zinc fingers (33). Both microtubules and DNA are biopolymers. The CAP-Gly domain or zinc fingers are protein modular units that are used for association to the polymer in a consecutive manner depending on the affinity requirement. The specificity for the binding location could be provided by sequence variations of the nonconserved regions or interacting with other regions. The CAP-Gly domain has also been found in dynein/dynactin (P150 Glued ) complexes. The conventional view is that the motor proteins could get on and off the microtubule by alternated motor domain binding to walk along the filament. The function of the CAP-Gly domain in this mobile complex may be to act as a tether to "glue" the complex on the microtubule. This will keep the traveling complex on the right track without totally falling off the microtubule.
We demonstrated here that protein crystal structures could be determined in an HTP manner using the naturally present sulfur sites as anomalous scatterers provided that high quality crystals could be grown. The prerequisite for growing high quality protein crystals is the production of highly purified, rigid, globular protein molecules. It seems inevitable that subcloning of digested fragments from soluble protein preparations is frequently employed to identify the portion of a gene product that would ultimately be crystallized. Sequence analysis of protein homology may not predict the correct structural domain that forms crystals. In this case, the predicted domain was much smaller than the one that was suitable for crystallization. When such a strategy is implemented, novel protein structure folds that are correlated to significant biological functions might be solved without prior sequence-based target selection. Since each protein family classified by sequence homology contains many members, it is difficult to predict which representative member would yield useful crystals for structure determination. Shotgun screening in an HTP manner is an alternative approach to complement other target selection strategies.