Localization of a New Prostate-specific Antigen-related Serine Protease Gene, KLK4, Is Evidence for an Expanded Human Kallikrein Gene Family Cluster on Chromosome 19q13.3–13.4*

The human tissue kallikrein (KLK) family of serine proteases, which is important in post-translational processing events, currently consists of just three genes—tissue kallikrein (KLK1), KLK2, and prostate-specific antigen (PSA) (KLK3)—clustered at chromosome 19q13.3–13.4. We identified an expressed sequence tag from an endometrial carcinoma cDNA library with 50% identity to the three known KLK genes. Primers designed to putative exon 2 and exon 3 regions from this novel kallikrein-related sequence were used to polymerase chain reaction-screen five cosmids spanning 130 kb around the KLK locus on chromosome 19. This new gene, which we have named KLK4, is 25 kb downstream of theKLK2 gene and follows a region that includes two other putative KLK-like gene fragments. KLK4 spans 5.2 kb, has an identical genomic structure—five exons and four introns—to the otherKLK genes and is transcribed on the reverse strand, in the same direction as KLK1 but opposite to that ofKLK2 and KLK3. It encodes a 254-amino acid prepro-serine protease that is most similar (78% identical) to pig enamel matrix serine protease but is also 37% identical to PSA. These data suggest that the human kallikrein gene family locus on chromosome 19 is larger than previously thought and also indicate a greater sequence divergence within this family compared with the highly conserved rodent kallikrein genes.

The human tissue kallikrein (KLK) family of serine proteases, which is important in post-translational processing events, currently consists of just three genestissue kallikrein (KLK1), KLK2, and prostate-specific antigen (PSA) (KLK3)-clustered at chromosome 19q13.3-13.4. We identified an expressed sequence tag from an endometrial carcinoma cDNA library with 50% identity to the three known KLK genes. Primers designed to putative exon 2 and exon 3 regions from this novel kallikrein-related sequence were used to polymerase chain reaction-screen five cosmids spanning 130 kb around the KLK locus on chromosome 19. This new gene, which we have named KLK4, is 25 kb downstream of the KLK2 gene and follows a region that includes two other putative KLK-like gene fragments. KLK4 spans 5.2 kb, has an identical genomic structure-five exons and four introns-to the other KLK genes and is transcribed on the reverse strand, in the same direction as KLK1 but opposite to that of KLK2 and KLK3. It encodes a 254-amino acid prepro-serine protease that is most similar (78% identical) to pig enamel matrix serine protease but is also 37% identical to PSA. These data suggest that the human kallikrein gene family locus on chromosome 19 is larger than previously thought and also indicate a greater sequence divergence within this family compared with the highly conserved rodent kallikrein genes.
The tissue or glandular kallikrein (KLK) 1,2 genes are members of a highly conserved multigene family encoding serine proteases that are involved in the post-translational modification of many polypeptides and are central to many biological processes (reviewed in Refs. 1 and 2). The human KLK gene family has previously been thought to contain only three members. These are KLK1 encoding true tissue kallikrein (3)(4)(5), KLK2 (6), and KLK3, or prostate specific antigen (PSA) (7)(8)(9). The large size of this gene family in other species such as rat and mouse, where kallikrein-like proteins are encoded by 13-24 genes (10 -13), and the recent identification of a fragment of a putative human kallikrein gene adjacent to the KLK2 gene on a human genomic cosmid (14) suggested that the human KLK gene family may be potentially larger than previously thought. Several early estimations of the size of this family, using Southern analysis, predicted just 3-4 genes (3,5,6,15). However, heterologous probing of a human genomic Southern blot with a monkey kallikrein cDNA suggested that it may contain as many as 19 different genes (16). Determination of the true size of the human gene family is important to our understanding of the contribution of the kallikreins to human biology and their roles as peptide processing enzymes.
The three best characterized KLK families (human, rat, and mouse) are all localized within gene clusters on single chromosomes. The human family currently spans 60 kb on chromosome 19q13. 3-13.4 (17). The rat family, although in two clusters, is tightly linked at one locus spanning 480 kb on chromosome 1 (11,18,19). Similarly, the mouse kallikrein family has been localized to a single 310-kb locus on chromosome 7, a position that is analogous to the human KLK locus on chromosome 19 (4,12,20). Overall, these findings suggest that if there are more members of the human kallikrein gene family they should be localized within close proximity to the current KLK locus on chromosome 19q13.3-13.4.
All kallikrein genes described to date are 5-7 kb in size and are structurally identical in the genomic arrangement of the five exons and four introns (2,4,6,8,10,12,13). The considerable homology shared by members of these gene families, both within and between species, suggests that they are derived from gene duplications, exon shuffling, and subsequent divergence from a common ancestor (19,21,22). Whereas the rat and murine family members are highly homologous (74 -89% protein similarity), the three fully characterized members of the human family have lower homology (52-78% protein similarity), suggesting that the human KLK gene family has diverged further than in the rodent (1, 3-10, 12, 19). In support of this notion, three novel serine proteases-protease M (23,24), normal epithelial cell-specific (NES)1 gene (25,26), and prostase (27)-with just 23-44% similarity at the protein level to PSA or KLK1, have been identified recently and reported to lie on chromosome 19q13.3-13.4. However, their precise localization in relationship to the KLK locus is not known.
Thus, further characterization of the human KLK locus on chromosome 19 will determine whether these novel kallikreinrelated proteases are part of the human KLK family and also may identify other members of this important family of serine proteases. We are currently mapping and sequencing several cosmids and BACs that span over 430 kb surrounding the known KLK locus. A concurrent search of the expressed sequence tag (EST) data base for novel kallikrein-related sequences identified an EST sequence (EST 40886; accession number AA336074) from an endometrial carcinoma cDNA library that was 50% similar at the nucleotide level to the exon2-exon3 regions of the three characterized human KLK genes. We now report the cloning and mapping of the gene encoding this kallikrein-related cDNA to the KLK locus. Although it has also recently been described as a prostate-specific serine protease, prostase (27), we have called this endometrial-and prostateexpressed gene, KLK4, given its close relationship to the other previously described KLK genes.

EXPERIMENTAL PROCEDURES
Cosmid DNA Clones-Five cosmid clones (F24242, R28297, F18306, F22702, R28781) that represent a region previously positive on genomic Southern blots with the pHRK-1 KLK1 probe (4) were used for the following studies. These cosmids span ϳ130 kb near the q13.3-13.4 band junction on the metric physical map of human chromosome 19. 3 Using the Qiagen Midiprep kit, DNA was isolated from these cosmids transformed in Escherichia coli DH5a cells and grown in a 500-ml culture of Luria-Bertani medium supplemented with 15 g/ml chloramphenicol.
Genomic PCR-Primers were designed from the putative exon 2 and exon 3 sequences of the endometrial carcinoma EST 40886 to amplify this region, including intron 2, from the five cosmids as potential positive templates. PCR amplification was carried out in an Omnigene thermocycler with the following PCR conditions: 4 mM MgCl 2 , 200 M each deoxyribonucleoside triphosphate, 10 ng of each primer, and 1.3 units of Taq polymerase in 1ϫ PCR buffer (Roche Biochemical). Cycling conditions included an initial denaturation at 94°C for 5 min, followed by 30 cycles of 94°C for 30 s, 58°C for 30 s, and 72°C for 30 s, with a final extension of 7 min at 72°C. PCR products were separated by electrophoresis through 1% agarose gels and isolated using the Qiagen Gel Extraction kit.
Subcloning and Sequencing of the Genomic KLK4 Clone-The KLK4 genomic PCR fragment from cosmid F22702 was subcloned using the Promega pGEM-T Easy TM cloning kit and sequenced fully with the ABI PRISM TM Dye Terminator cycle sequencing Ready Reaction Kit (Perkin-Elmer) using 500 ng of plasmid and 3.2 pmol of primer. Extension products were purified by ethanol precipitation and analyzed using an automated Applied Biosystems 373A DNA Sequencer. Further sequence was obtained by primer walking along F22702. Sequence alignments were performed using the Align program at the Genestream network server, IGH, France.

Amplification of EST 40886
Gene Sequence-PCR analysis of the five cosmids spanning the KLK locus, with EST 40886 specific primers designed to putative exon 2 and 3 regions, identified a band of ϳ480 bp from the F22702 and R28781 cosmids (data not shown). Comparison of the intron sizes of other kallikreins predicted that, if this gene was in fact a kallikrein, this fragment could be up to 1.5 kb in size. Although smaller than the predicted fragment, DNA sequencing identified it as the expected product, and analysis of the sequence indicated that the intron was in the correct position. Sequence comparison of this intron with that of the PSA intron 2 suggested that the difference in size was because of the absence of Alu and MER repeat units within the novel genomic sequence (data not shown).
KLK4 Gene Characterization-The sequence of the KLK4 gene, comprising the coding region and part of the 5Ј-and 3Ј-noncoding sequence, was obtained by primer walking along the F22702 cosmid and is shown in Fig. 1. Comparison with the amino acid sequence, deduced from a cDNA isolated in parallel from a prostate cDNA library (data not shown), indicated that the KLK4 gene consists of five exons and four introns, although the presence of a further upstream exon cannot be ruled out. All of the intron/exon junctions conformed to the consensus sequence for eukaryotic splice sites and were in comparable positions to those of the other human KLK gene family members ( Figs. 1 and 2). In addition, the exonic positions of the serine protease catalytic triad residues (His-71, Asp-116, Ser-207), the signal peptide, and the putative cleavage site of the mature enzyme are common to all four KLK genes, although KLK4 is missing a "kallikrein loop" region (Figs. 1 and 2). KLK4 encodes a 254-amino acid prepro-serine protease that is 78% identical to pig enamel matrix serine protease and ϳ37% identical to PSA, tissue kallikrein, and the K2 enzyme. The amino acid sequence was identical to the serine protease sequence recently reported as prostase (27) except for a Gln (Q) to His (H) change at residue 197 that results from a single nucleotide change in the codon (CAA to CAC). With the addition of the 3Ј-untranslated region from the prostase cDNA sequence (27) (GenBank TM accession number AF113140), the full size of the gene is estimated at ϳ5 kb, again like that reported for the KLK1-3 genes (4, 6,8).
Sequence Analysis of the 5Ј-flanking Region-By comparison of KLK4 with PSA, the sequence upstream of the KLK4 coding region can be presumed to include both the 5Ј-untranslated region of exon 1 and potential promoter sequence. This sequence lacks any obvious TATA box, CAAT box or GC-rich regions, but the sequence at Ϫ35 to Ϫ21 from the ATG start site (ctaagtctcagtcc) has two overlapping regions with similarity to the initiator sequence (ctcantct) described for a TATAless promoter, which is often located from Ϫ3 to ϩ5 relative to the transcription initiator site (28), which might be, therefore, the A nucleotide in these sequences. Although promoter elements for PSA are located up to 6 kb from the start of the gene (29), analysis of the 553 bp of 5Ј-flanking region of KLK4 presented in Fig. 1 indicates the presence of sequences that show partial homology to known androgen regulatory elements (AREs) in the proximal PSA and KLK2 promoters (30,31). Characterization of the full promoter sequence of this gene is required to confirm whether these are indeed functional elements.
KLK Gene Linkage-Five cosmids that span the 170 kb of chromosome 19q13.3-13.4 that incorporates the 60-kb kallikrein locus were digested with EcoRI for Southern analysis. Hybridization with the KLK4 cDNA sequence identified two bands in cosmid F22702 of 4.3 and 0.5 kb and two bands of 9.26 and 0.5 kb from the adjacent R28781 cosmid (Fig. 3). The fainter hybridizing bands in lanes 1-3 are cross-hybridization of the KLK1 and KLK3 genes located on these cosmids. The EcoRI restriction fragments on each of the five cosmids that were positive with the KLK1, KLK2, and KLK3 probes and two additional but small KLK-like fragments are indicated in Fig.  4. Comparison of the Southern analysis with the EcoRI restric-tion map of the cosmid clones indicated that the KLK4 gene is located 25 kb downstream of KLK2 (Fig. 4). Restriction enzyme analysis of the KLK4 gene sequence obtained from the F22702 cosmid identified an EcoRI site in the fifth exon. This indicates that the KLK4 gene is transcribed on the reverse strand, in a similar direction to KLK1 but the opposite direction to KLK2 and KLK3. A comparison of these results with the genomic sequence that is now available for other cosmids F25479 and R33359 3 in this region confirms both the distance between all four KLK genes and the direction of transcription. DISCUSSION In the data presented here, we have precisely defined, for the first time, the locations of the three previously known KLK genes in the kallikrein gene locus on chromosome 19q13. 3-13.4. In addition we have identified, within the same locus, a fourth member of this family of serine proteases that we have designated KLK4. Several criteria, which are particularly pertinent to the rodent KLK/Klk families, have been used previously to determine inclusion of a gene in this family. These are a high conservation of sequence with the tissue kallikrein (KLK1) gene, a similar genomic organization and size, and localization within a cluster of related genes on the one chromosome in that species. As noted below, the KLK4 gene fits all these criteria, although in terms of sequence it would appear to have diverged further.
The KLK4 gene consists of five exons and four introns span-   2, 4, 6, 8, 10, 12, and 19). The exon-intron organization of serine protease genes is well documented (32), with these genes being separated into five different groups based on intron position and position of the amino acid residues (His-71, Asp-116, Ser-207, chymotrypsin numbering) that make up the catalytic triad. Several factors clearly indicate that KLK4 belongs to the group that includes all the human KLK family, as well as NES1 and trypsinogen. First, the region encoding the signal peptide is in the first coding exon. Second, the catalytic residues His-71 and Ser-207 are separated in different exons and almost adjacent to the exon boundaries. Third, there is only one intron between exons coding for the His-71 and Asp-116 catalytic residues and, finally, the cleavage site in the zymogen or proenzyme, that leads to activation of the mature enzyme, is localized in the same exon as the His-71 residue.
Because the gene structure and coding sequence in all serine proteases, including members of the trypsin family and the kallikreins, is very similar, these genes are thought to have derived from a common ancestor (19,21,22,32). During evolution, some members of this ancestral gene family have split to different chromosomes, e.g. trypsin to chromosome 7 and tissue plasminogen activator to chromosome 8, but KLK4 and the rest of the KLK gene family remain close to each other on chromosome 19, suggesting that they constitute a sub-family of these related serine proteases. Three Alu repeated sequences are associated with the KLK2 gene: one in the second intron and two approximately 0.4 and 1.2 kb upstream (33). Alu and Mer repeats are also found in the second intron of PSA and KLK1 in a very similar gene organization (4, 8) but the KLK4 intron 2 lacks these repeat sequences. This suggests that KLK4 may have separated from the ancestral serine protease gene before the formation of the rest of the kallikrein genes, and it could possibly be a precursor to these other genes. Full characterization of all KLK genes and other serine proteases in this locus will further clarify their relationships and the ultimate size of this sub-family of serine proteases.
The human KLK gene family is clustered in a 60-kb region on chromosome 19q13.3 -q13.4 in the order of centromere-KLK1-KLK3-KLK2-telomere (17). From the chromosomal sequence that is now available for this region, we have confirmed the distance from KLK1 to KLK3 is at least 26 kb and from KLK3 to KLK2 is 12 kb. We have mapped KLK4 25 kb downstream of KLK2 and have determined that it is transcribed in a similar direction to KLK1, in the opposite direction to PSA and KLK2. Two KLK-like fragments (KLK Frag #1, KLK Frag #2), also identified (14) 4 between KLK2 and KLK4, do not appear to date to be part of full-length genes and may constitute pseudogenes. Two other serine protease genes have also been mapped close to the human KLK gene family. Fluorescence in situ hybridization, somatic cell, and radiation hybrid mapping have localized the recently identified NES1 gene to chromosome 19q13.3 (26), although its precise distance from the current KLK genes is not known. A second serine protease gene, Protease M (also known as zyme and neurosin) (23,24) was also localized to this same 19q13.3.-13.4 region by fluorescence in situ hybridization. (23). These authors suggested that this gene is also close to the KLK locus and proposed that the order of these genes is centromere-ProteaseM-Nes1-KLK1-PSA-KLK2-19 -telomere. Although the information gained from radiation hybrid and somatic cell mapping and fluorescent in situ hybridization is less precise than the restriction mapping and sequencing approach that we have employed, it is likely that these two genes (NES1 and protease M) are still part of a larger KLK cluster in this locus, akin to that seen in the rodent families (11,12). We therefore propose that the order of the expanded kallikrein locus is 19centromere-KLK1-PSA-KLK2-KLK frag #1-KLK frag #2-KLK4-19telomere, with the inclusion of protease M and NES1 although the precise order and distance of protease M and NES1 from the other KLK genes is yet to be defined.
The human KLK gene family is clearly less conserved than its rodent counterparts whose enzymes share 75-85% identity (10,12,19). The encoded proteins of PSA and KLK2 are highly conserved (78% identity), however, they are less similar to KLK1 (52-60% identity) (3)(4)(5)(6)(7)(8)(9). KLK4 would appear to have diverged further, being only 37% similar to PSA, KLK2, and KLK1 and no longer containing a "kallikrein loop," a region of 11 amino acids encoded in exon 3 and thought to be important in the substrate specificity of these enzymes (1, 2, 7). However, the conservation is greater at the nucleotide level (50%), as noted previously for KLK1-3 (74 -85%, Ref. 9). Of note, the encoded protein of KLK4 is most similar to a pig enamel matrix protease (78% identity), indicating that it is probably the human homologue of that enzyme. Moreover, if NES1 and protease M are also part of the human KLK gene family, along with KLK4, they may constitute a sub-group within this family that have diverged further in sequence from their namesake, tissue kallikrein or KLK1.
The KLK4 gene was also recently cloned by Nelson et al., (27) but given the name prostase because it was suggested to be expressed specifically in the prostate. From our earlier identification of an EST KLK4 sequence that was derived from an endometrial carcinoma cDNA library and evidence that KLK4 is expressed in other tissues, albeit at lower levels, 5 we would suggest that the KLK gene family designation (34) is more appropriate than a tissue-specific nomenclature that is difficult to extensively verify. Similarly, the recent designation of prostase as PRSS17, 6 and thus part of a family of serine proteases that are dispersed across several chromosomes, also should be modified given our localization of KLK4 (alias prostase/ PRSS17) to the KLK locus. Although the function of this new gene is unknown, it is likely to be as important in post-translational processing events as tissue kallikrein, PSA and the K2 enzyme have proven to be in a large range of biological events (1,2). The reported androgen dependence of prostase expression (26) and the putative ARE-like sequences noted here in the 5Ј flanking region of the KLK4 gene would support this notion. The further characterization of the human KLK locus is likely to provide the location of further KLK-related genes, either known (NES1, protease M) or as yet unknown, that will add to this diverse family of serine proteases.