Fidelity of prespacer capture and processing is governed by the PAM-mediated interactions of Cas1-2 adaptation complex in CRISPR-Cas type I-E system

Prokaryotes deploy CRISPR-Cas–based RNA-guided adaptive immunity to fend off mobile genetic elements such as phages and plasmids. During CRISPR adaptation, which is the first stage of CRISPR immunity, the Cas1-2 integrase complex captures invader-derived prespacer DNA and specifically integrates it at the leader-repeat junction as spacers. For this integration, several variants of CRISPR-Cas systems use Cas4 as an indispensable nuclease for selectively processing the protospacer adjacent motif (PAM) containing prespacers to a defined length. Surprisingly, however, a few CRISPR-Cas systems, such as type I-E, are bereft of Cas4. Despite the absence of Cas4, how the prespacers show impeccable conservation for length and PAM selection in type I-E remains intriguing. Here, using in vivo and in vitro integration assays, deep sequencing, and exonuclease footprinting, we show that Cas1-2/I-E—via the type I-E–specific extended C-terminal tail of Cas1—displays intrinsic affinity for PAM containing prespacers of variable length in Escherichia coli. Although Cas1-2/I-E does not prune the prespacers, its binding protects the prespacer boundaries from exonuclease action. This ensures the pruning of exposed ends by exonucleases to aptly sized substrates for integration into the CRISPR locus. In summary, our work reveals that in a few CRISPR-Cas variants, such as type I-E, the specificity of PAM selection resides with Cas1-2, whereas the prespacer processing is co-opted by cellular non-Cas exonucleases, thereby offsetting the need for Cas4.

ture that comprises an array of direct repeats (ϳ30 -40 bp) (5), which are partitioned by short spacer sequences of viral origin (6,7). The repository of spacers acts as a vaccination card, and this genetic memory acquired during pathogenic invasions guides the adaptive immune response by CRISPR-Cas (1,2). The molecular choreography of CRISPR-Cas defense is trichotomized into (i) adaptation, (ii) maturation, and (iii) interference stages (3). Upon phage attack, the CRISPR adaptation machinery derives short stretches of nucleic acids (prespacers) from the invaders and incorporates them into the CRISPR array. This process generates an infection memory and immunizes the host (2,4,(8)(9)(10). An A-T-rich leader region present upstream to the repeat-spacer array regulates the transcription of the CRISPR locus, and the subsequent Cas nucleasemediated processing of this inert transcript generates short regulatory mature CRISPR RNA (11)(12)(13)(14). In association with other Cas proteins, the CRISPR RNA forms a surveillance complex that detects the recurring infections by MGE via base complementarity. This event triggers the Cas nucleases to annihilate the MGE, thereby interfering with the spread of infection (1)(2)(3)13).
The CRISPR adaptation machinery expands the spacer archive, thus bestowing on the host a compendium of immunological memory to counter the ever-evolving phages. Highly conserved Cas1 and Cas2 spearhead the spacer acquisition (9,10,15); however, previous investigations have also demonstrated the indispensable role of additional Cas proteins and host factors in the uptake of spacers that are derived from phage DNA and RNA (16 -28). Although the spacer sequences are extremely variable, the majority of the organisms (CRISPR types I, II, and V) display conservation of a 2-5-nt protospacer adjacent motif (PAM) at its origin site in MGE (8,10,(29)(30)(31). In addition to specifying the prespacer region for integration, PAM also guides the differentiation between self and nonself during the interference step (32)(33)(34). Mutations in the PAM region on MGE lead to impaired target recognition, thus evading CRISPR-Cas immune response. In such circumstances, the imperfectly paired surveillance complex at the target region directs the interference machinery and Cas1-2 to display an inflammatory immune response by rapidly acquiring more spacers by a process termed "primed adaptation" (24,(35)(36)(37)(38).
The CRISPR adaptation pathway proceeds via two sequential events: the capture of the prespacer fragments from invading MGE and the site-specific integration of these captured fragments at the leader-repeat junction. The CRISPR adaptation machinery of Escherichia coli (type I-E) derives its spacers predominantly from the DNA debris generated during the action of a multisubunit RecBCD DNA repair complex (18). Modulation in helicase, exonuclease, and endonuclease activities of RecBCD results in the production of ssDNA fragments ranging from tens to thousands of nucleotides in length (39,40). However, the legitimate spacers in E. coli are strictly 33 bp in length and abut a 5Ј-AAG-3Ј PAM (where "G" is destined to be the first residue of the spacer) (10,30,33,35,41,42). Additionally, structural studies have demonstrated that Cas1-2/I-E efficiently binds to partial duplex prespacers that are 33 bp in length (42,43). The existence of such spacer-sized DNA fragments is infinitesimal among the RecBCD products. Moreover, a previous study also demonstrated the incorporation of 33-bp spacers that are directly acquired from electroporated longer DNA duplexes (63 bp) (44). These findings augment the involvement of an additional DNA-trimming step to generate befitting substrates for CRISPR adaptation.
Recent studies in Sulfolobus solfataricus (type I-A), Sulfolobus islandicus (type I-A), Bacillus halodurans (type I-C), Synechocystis sp. 6803 (type I-D), Pyrococcus furiosus (type I-G), and Geobacter sulfurreducens (type I-U) (25,(45)(46)(47)(48)(49)(50)(51) highlighted the indispensable role of Cas4 nuclease in PAM selection and prespacer processing. The occurrence of Cas4 is prevalent in type I CRISPR-Cas systems except in subtypes I-E and I-F (15). In one case, it was observed that an extended variant of Cas2 with an unorthodox C terminus DnaQ exonuclease domain assists Streptococcus thermophilus DGCC7710 (type I-E) in prespacer trimming (52). Cas2-DnaQ domain fusion is nonubiquitous, and even model organisms for spacer acquisition studies like E. coli (type I-E) do not harbor it. Although recent studies envisage the involvement of exonucleases during spacer acquisition in E. coli (53), the molecular events guiding PAM selectivity and prespacer processing remain obscure.
Upon production of legitimate prespacers, Cas1-2 integrase complex catalyzes the prespacer incorporation at the leaderadjoining repeat (10, 23, 24, 41, 54 -56). This polarized mechanism records the chronology of infections by positioning the new spacer toward a promoter encompassing leader. During recurring infections, swift expression of the latest spacers ensures a productive fight by a rapid and robust immune response (54). In conjunction with Cas1-2, various conserved DNA motifs present in the leader and repeat regions mandate the fidelity of prespacer integration (16,17,54,(57)(58)(59)(60)(61)(62)(63)(64)(65). In E. coli (type I-E), upon interaction with these conserved regions, a sequence-specific genome architectural protein termed integration host factor (IHF) restructures the leader and generates a docking site for Cas1-2 integrase at the leaderrepeat junction (16,17,66). In contrast, the presence of a Cas1-2-binding site abutting the leader proximal region obviates the involvement of host factors during spacer acquisition in the CRISPR type II-A system (54,67,68).
Unlike many of the type I CRISPR-Cas encompassing prokaryotes, E. coli (type I-E) lacks Cas nucleases like Cas4 and Cas2-DnaQ to generate processed prespacers for efficient homing at the CRISPR locus (15). Nevertheless, this inadequacy does not appear to hinder PAM selection or spacer size preference (10,30,33,41,44). Intrigued by these observations, we sought to understand how prespacers are selected and tailored to the appropriate size for CRISPR adaptation. Here, we demonstrate that the PAM-directed interactions with longer DNA fragments signals Cas1-2 to demarcate potential prespacer boundaries. Upon supplementing the reaction with exonucleases to mimic the cellular environment, we found that Cas1-2-DNA nucleoprotein complex could protect DNA fragments of ϳ33 bp in length. Further, we show that these protected fragments could be efficiently integrated into the CRISPR locus. These findings demystify the mechanism by which E. coli efficiently scales the fragments of the foreign DNA to generate viable prespacers of desired length, in contrast to other CRISPR-Cas subtypes that possess dedicated prespacer-processing nucleases such as Cas4.
Upon incubation with Cas1-2 and IHF, prespacer integration proceeds via a trans-esterification reaction, wherein 3Ј-OH of the Cas1-2-bound prespacer makes two nucleophilic attacks at the target sites to get itself ligated into the CRISPR array (L-R1 junction in the top strand and R1-S1 junction in the bottom strand) (Fig. 1A) (55,56). To precisely identify the site of nucleophilic attack, we employed the CRISPR DNA substrates that were 5Ј-end-labeled with fluorescein (FAM) either on the top strand (CD-T* in Fig. 1A (i)) or on the bottom strand (CD-B* in Fig. 1A (ii)). Likewise, to monitor the prespacer ligation, we used an unlabeled CRISPR DNA (CD-U in Fig. 1A (iii and iv)) and prespacers with FAM at their 5Ј-ends. Here, we observed that P23 [3Ј-5] or P23[5Ј-5] or P33 could alone make a successful nucleophilic attack at the integration sites and result in the generation of top strand (L) and bottom strand (RЈ) cleavage products from CD-T* and CD-B*, respectively (lanes 3-5 in Fig. 1 (C and D); Fig. S1 (B and C)). Further, utilizing 5Ј-FAMlabeled prespacers, we observed the ligation of P23[3Ј-5], P23 [5Ј-5], and P33 at these nicked sites on the top strand (P ϩ R) and bottom strand (LЈ ϩ P) (lanes 3, 5, and 7 in Fig. 1E; Fig. S1 (B and C)).

Mechanism of prespacer capture and processing in E. coli
Interestingly, we did not observe bands corresponding to the integration products when we substituted the reaction mixtures with P33[ss], P23[3Ј-10], and P63 (lanes 6, 7, and 8 in Fig.  1 (C and D); lanes 9, 11, and 13 in Fig. 1E; Fig. S1 (B and C)). These findings suggest that either duplex (P33) or partial duplex (comprising 3Ј-overhang (P23[3Ј-5]) or 5Ј-overhang (P23[5Ј-5])) prespacers with an effective length of 33 nt are strictly required during CRISPR adaptation. This bias in prespacer size preference could possibly arise due to the weakening of Cas1-2 interaction with long substrate precursors (such as P23[3Ј-10] and P63 with an effective length of 43 and 63 nt, respectively) and/or inefficient integration of such DNA fragments at the target site in the CRISPR locus. To test whether longer prespacers (Ͼ33 nt) have weak interaction with Cas1-2, thereby leading to inefficient integration, we performed EMSA to assess the interaction between Cas1-2 integrase and various prespacers ( Fig. S2 and Fig. 1F). This indicated that the affinity of Cas1-2 toward P23[3Ј-10] (K D ϭ 748.5 Ϯ 163.7 nM) is comparable with its affinity to P23[3Ј-5] (K D ϭ 648.5 Ϯ 136.2 nM). Similarly, the affinity of P63 (K D ϭ 2.842 Ϯ 0.372 M) for and prespacer (P) used in Cas1-2mediated prespacer integration assay. Regions corresponding to 69-bp leader (L in blue), 28-bp repeats (R1-R2 in red), 33-bp spacer 1 (S1 in cyan), and 19-bp spacer 2 (S2 in magenta) of the CRISPR DNA are indicated. Two integration events resulted from 3Ј-OH nucleophilic attack of Cas1-2-bound prespacer at the top strand (L-R1 junction) and bottom strand (R1-S1 junction), and their respective denatured DNA fragments are shown. The design of integration assays to determine the positions of nucleophilic attack (top strand (i) and bottom strand (ii)) and prespacer ligation (top strand (iii) and bottom strand (iv)) are displayed.

Mechanism of prespacer capture and processing in E. coli
Cas1-2 is on par with that of P33 (K D ϭ 2.885 Ϯ 0.613 M). These experiments suggest that Cas1-2 can interact with DNA fragments of varying lengths; however, the integration at the target site could be achieved only in the presence of DNA fragments with an effective length of 33 nt (Fig. 1G).

Cas1-2 foothold protects potential prespacer regions during exonuclease action
CRISPR adaptation in E. coli mandates precisely sized prespacers (Fig. 1). To cater to this need, long DNA fragments generated during RecBCD activity have to be trimmed further by nuclease action. Of the multitude of proteins encoded by the Cas operon, only Cas1 and Cas2 contribute to naive spacer acquisition in E. coli. We could not notice processing of longer prespacers (P23[3Ј-10] and P63) even at higher Cas1-2 concentrations (Fig. S3A).
The type I-E system is devoid of prespacer processing exonuclease Cas4, and its deficit cannot be complemented by Cas1-2 alone (Fig. S3A). Hence, we predicted the involvement of cytoplasmic exonucleases in trimming the longer prespacer to a suitable length. Having established that Cas1-2 indeed binds DNA fragments of variable length ( Fig. 1F and Fig. S2), we sought to test the fate of Cas1-2-bound DNA fragments against exonucleases. Here, Cas1-2-P63 nucleoprotein complex was treated with a mixture containing 5Ј33Ј-acting T5 exonuclease (T5exo) and 3Ј35Ј-acting exonuclease III (ExoIII) ( Fig. 2A). To our surprise, we identified a smear of protected DNA fragments (P63exoϩ) that ranged from 30 to 40 nt in the sample containing both P63 and Cas1-2 (lane 8 in Fig.  2A and lanes 4 -13 in Fig. S3B), whereas such protection was not observed when we treated P63 in the absence of Cas1-2 or in the presence of either Cas1 or Cas2 (lanes 5, 6, and 7 in Fig.  2A). Coincidentally, the length of the protected fragments corresponded to legitimate spacer size in E. coli (ϳ33 nt). Additionally, it was noted that this protection was absent when Cas1-2-P63 was treated with ExoIII alone (lane 11 in Fig. 2A). The nuclease generates 5Ј-overhangs with which Cas1-2 binds weakly (compare P23[3Ј-5] and P23[5Ј-5] in Fig. 1F and Fig. S2). Therefore, it appears that ExoIII seems to dislodge the weakly bound Cas1-2 from its position on the prespacer. In contrast, the treatment of Cas1-2-P63 with 5Ј33Ј-acting T5exo resulted in incompletely digested fragments that were predominantly of greater length (ϳ45 nt) than that of P63exoϩ (compare lanes 8 and 13 in Fig. 2A).
As the Cas1-2-protected P63exoϩ fragments were approximately of E. coli spacer size, we wondered whether they could act as potential prespacers for integration. To test this hypothesis, we purified and utilized P63exoϩ DNA fragments as prespacers in a spacer integration assay. In line with the previous experiment ( Fig. 1), we could not observe any integration events when we employed longer prespacer P63 (lanes 3 and 7 in Fig. 2B). To our surprise, integration was observed when P63exoϩ was employed (lanes 4 and 8 in Fig. 2B; lane 7 in Fig.  S3C). In this case, although we monitored efficient nucleophilic attack at the top strand (L in lane 4 of Fig. 2B), the integration at the bottom strand seemed to be sparse (RЈ in lane 8 of Fig. 2B). By recapitulating these observations, we could suggest that Cas1-2-mediated binding of large DNA fragments secures the boundaries of suitable prespacers from the exonucleolytic action of cellular nucleases in E. coli (Fig. 2C).

PAM-directed binding of Cas1-2 defines the boundary for prespacers
Because Cas1-2 binding was shown to mark the spacer boundaries ( Fig. 2), we sought to identify and map these regions. As no protected DNA fragments appeared upon treatment of Cas1-2-P63 complex with ExoIII alone (lane 11 in Fig.  2A), we sought to utilize T5exo digestion (lane 13 in Fig. 2A) to demarcate the prespacer boundaries defined by Cas1-2. To accomplish this, we utilized variants of 63-bp blunt-ended prespacers that encompass fluorescein-labeled 3Ј-end (6-FAM) either on the top (P63T*) or on the bottom (P63B*) strand (Fig.  3A). Further, to identify the footprints of Cas1-2 on these 3Ј-end-labeled prespacers, we incubated Cas1-2-bound prespacer complex with T5exo. Here, the Cas1-2 binding on the prespacer acts as a roadblock that stalls the 5Ј33Ј progression of T5exo. The length of the resultant labeled fragments specifies the stalling points of the exonuclease, which, in turn, indicates the binding position of Cas1-2 on the prespacer. Utilizing this approach, we mapped the cleavage termination points on

Mechanism of prespacer capture and processing in E. coli
the top and bottom strands of the prespacer. After T5exo treatment of P63T*, we could observe an ϳ28-nt labeled fragment. This is indicative of an inherent nuclease stalling point (in the absence of roadblocks such as Cas1-2) around the 28th nt position from the labeled end in P63T* (lane 2 in Fig. 3B; lane 12 in Fig. 2A). However, owing to complete exonucleolytic cleavage of the bottom strand, we did not observe such stalling on P63B* (lane 10 in Fig. 3B). Upon T5exo treatment of Cas1-2-bound P63T* complex, we noticed a shift in the nuclease stalling point to about the ϳ45th nt position from the labeled end (lane 4 in Fig. 3B). This maps the Cas1-2-binding position to be around 45 nt from the labeled end (P63T* in Fig. 3A). Coincidentally, this binding position of Cas1-2 on P63T* is localized around a cognate PAM sequence (5Ј-AAG-3Ј ranging from 47 to 49 nt upstream of the labeled position) (P63T* in Fig. 3A).
Prompted by this finding, we were interested in identifying the extent of protection that Cas1-2 could confer on the bottom strand upon its binding at the PAM region. To accomplish this, we treated the Cas1-2-bound P63B* complex with T5exo. Here, the resulting length of the protected fragments upon exonuclease treatment indicated that Cas1-2 complex interaction can guard a region spanning ϳ45 nt from the labeled end in P63B* (lane 12 in Fig. 3B and P63B* in Fig. 3A). As the PAM residues are positioned at 14 nt from the labeled end of the P63B*, the effective length of the protected prespacer from the PAM is ϳ30 nt. Overall, these results suggest a PAM-dependent mechanism by which Cas1-2 could selectively acquire 33-nt prespacers.
To test the role of PAM in defining the protospacer boundary, we employed mutated P63 DNA fragments (P63mPAMT* and P63mPAMB* in Fig. 3A) that are devoid of any cognate E. coli PAM sequence (5Ј-AWG-3Ј, where W represents A/T).
These labeled fragments were incubated with Cas1-2 and later treated with T5exo. Here, we found extended smears and multiple bands upon employing P63mPAMT* (Fig. 3A and lane 8 in  Fig. 3B). The varied length of the resultant labeled fragments is indicative of numerous stalling points on these P63 mutants (Fig. 3A) that occurred due to Cas1-2 binding. These results highlight that Cas1-2 gets specifically recruited toward a PAM-containing region that plays a crucial role in defining prespacer boundaries, whereas such specificity is lost when the DNA fragments lack PAM. Here the promiscuous interaction of Cas1-2 results in the generation of illicit prespacers that defy the productive length of the prespacers for integration (lanes 8 and 16 in Fig.  3B; P63mPAMT* and P63mPAMB* in Fig. 3A).

Intrinsic specificity of Cas1-2 circumvents the requirement of Cas4 during PAM selection in E. coli
Having found the role of PAM-mediated interaction in selecting the prespacers for uptake, we attempted to understand the intrinsic molecular principles that confer precision to Cas1-2 in PAM selectivity and prespacer scaling. Previous structural studies of CRISPR adaptation complex in E. coli suggested that the extended Cas1 C-terminal tail of apoCas1-2 complex (69) gets organized around the PAM residues upon binding to prespacer DNA (Fig. 4) (42). In particular, the Gln-287 and Ile-291 residues of this proline-rich C-terminal tail make direct contacts with the nucleotides of PAM (Fig. 4B), thus possibly imparting the PAM specificity. In another striking feature, a pair of Tyr-22 residues that are derived from two different Cas1 protomers scales a 23-bp duplex region of prespacer via stacking interactions at either end (Fig. 4B) (42,43). This gating mechanism at the Cas1-2 platform seems to scale
Similar to WT Cas1-2, T5exo treatment of Y22A-P63B* generated a nuclease stalling point at ϳ45 nt, albeit with reduced efficiency (compare lanes 4 and 8 in Fig. 5A). These findings indicate that Cas1 Y22A does not result in the altered PAM specificity (WT and Y22A in Fig. 5B). Despite having a stronger affinity for P63 (Y22A K D ϭ 1.781 Ϯ 0.202 M versus WT K D ϭ 2.842 Ϯ 0.372 M) (Figs. S2E and S5C), the absence of Tyr-22mediated stacking interaction seems to reduce the prespacer protection ability of Y22A against nuclease action. Additionally, a shift in the nuclease stalling point to 60 nt from the labeled ends of P63B* was observed for ⌬C and 5M variants (lanes 6 and 10 in Fig. 5A). Due to its low affinity, 5M seems to display reduced protection of prespacers from nuclease action S5, A and B). The shift in this nuclease stalling point indeed indicates that the ⌬C and 5M variants display an impaired PAM specificity and were randomly interacting at the ends of P63B* (⌬C and 5M in Fig. 5B). To fortify our observations, we sought to understand the impact of these mutations on the sequence composition of acquired spacers by expressing Cas1-2 variants in E. coli IYB5101. We observed that the mutations in ⌬C and 5M have partly reduced the spacer incorporation efficacy of Cas1-2 (compare lanes 2, 4, and 6 in Fig. 5C). Surprisingly, Y22A displayed a drastic reduction of spacer uptake in vivo (compare lanes 2 and 8 in Fig. 5C), despite showing spacer integration in vitro (Fig. S4B).
Expanded CRISPR arrays corresponding to the expression of each mutant were purified, and the sequences of newly incorporated spacers were derived from high-throughput sequencing. In line with previous studies, we observed that the spacers originated from both genome and plasmid (44). Irrespective of the mutations in Cas1-2, the length of the incorporated spacers is strictly conserved (i.e. 33 nt) (Fig. 5D). This finding suggests that Tyr-22-mediated stacking interaction with prespacer or the Cas1 C-terminal restructuring is dispensable for the scaling of prespacers. These spacer sequences were mapped onto the plasmid and genome to identify the PAM. Despite the display of precise prespacer scaling by Cas1-2 variants, the specificity toward PAM region appears to be profoundly altered (Fig. 5E).
In concurrence with previous studies (30,44), we observed that most of the spacers acquired by WT Cas1-2 encompass a conserved PAM region (5Ј-AAG-3Ј, where "G" indicates the ϩ1-position of the 33-nt spacer) (Fig. 5E). In line with the nuclease protection assay (Fig. 5A), we did not observe any preference toward the PAM region when we employed 5M or ⌬C (Fig. 5E). This finding bolsters the involvement of Cas1 C-terminal tail in PAM selectivity. Despite the reduced efficiency in spacer acquisition in vivo (Fig. 5C), surprisingly, Y22A displayed a remarkable precision for PAM selectivity, suggesting that this mutation bestowed high fidelity with respect to PAM recognition (Fig. 5E).

Discussion
The CRISPR system in E. coli (type I-E) displays precise scaling of prespacer length and stringent selectivity of PAM (30,41,44). Here, we attempted to uncover the elusive molecular events that drive the generation of competent substrates for homing at the leader-repeat junction by a CRISPR adaptation complex. During naive adaptation, prespacers are predominantly foraged from the DNA fragments generated by RecBCD during DNA repair (18) or by Cas3 during primed adaptation (35,36,70). Because they are helicase-nuclease enzymes, RecBCD and Cas3 action generates varied-length ssDNA fragments (39,71). Despite the association of ssDNA fragments with Cas1 in vivo (72), their integration at the CRISPR locus has been highly inefficient compared with duplex and partial-duplex DNA forms (Fig. 1) (56). This points to a certitude that duplex formation by annealing of complementary ssDNA strands upon helicase-nuclease action could potentially generate prespacers that elicit CRISPR adaptation.
We demonstrate that irrespective of being a duplex or partial duplex, the overall length of the prespacer must be ϳ33 nt (i.e.

Mechanism of prespacer capture and processing in E. coli
E. coli spacer length) for successful integration (Fig. 1). A previous study has shown that an in vitro reconstituted Cas1-2 can solely process the prespacers with 10-nt 3Ј-overhangs to legitimate spacer size (42). In contrast, our experiments show the absence of such prespacer processing even with increasing Cas1-2 concentrations, despite the presence of PAM (P23[3Ј-10] in Fig. S3A). We posit that these variations could be due to differences in the purification strategies of Cas1-2 complex in both studies. Here, we employed Cas1-2 that was generated as a complex in vivo and was further purified extensively, whereas the previous study (42) utilized in vitro reconstituted Cas1-2 that potentially contained unassembled, free Cas1 protomers. The utilization of high concentrations of the in vitro reconstituted Cas1-2 and thus the increased presence of free Cas1, a known endonuclease (73), could have resulted in the cleavage of the prespacer overhangs (42). Alternatively, because the nuclease activity was only seen at micromolar concentration of in vitro reconstituted Cas1-2 (42), despite the fact that the integrase activity requires just . The x axis depicts the length of the spacer (nt), whereas normalized frequency (density) is indicated on the y axis. E, illustration depicting the PAM preference of Cas1-2 variants (WT, ⌬C, 5M, and Y22A) during the in vivo integration assay (C). ϩ1 to ϩ33 sequence of each spacer (gray ladder) was extracted from high-throughput sequencing data. Subsequently, sequence information of Ϫ1 and Ϫ2 positions of each spacer was derived from the respective plasmid/genome sequence. The conservation profile of PAM sequences (Ϫ2, Ϫ1, and ϩ1 in red) corresponding to the respective Cas1-2 variant is shown as a sequence logo.

Mechanism of prespacer capture and processing in E. coli
nanomolar concentration (17,56), it points to the possibility of trace cellular nuclease contamination.
Previous structural studies of E. coli Cas integrase complex have revealed that the 33-bp length of DNA can be exactly accommodated between two active-site regions of Cas1-2 (42,43). This hints at the fact that the Cas1-2 foothold can only mask the 33-bp region and the rest is exposed to potential nuclease action. Mimicking such conditions, we incubated Cas1-2-bound longer DNA fragments (P63) with 3Ј35Ј (ExoIII) and 5Ј33Ј (T5exo) acting exonucleases. These reactions resulted in the generation of fragments that are in the range of cognate E. coli spacer size ( Fig. 2A). Furthermore, these processed DNA fragments were readily utilized as substrates by Cas1-2 integrase (Fig. 2B). Likewise, in B. halodurans (type I-C) and S. thermophilus DGCC7710 (type I-E), the nuclease action of Cas4 and DnaQ (an auxiliary domain of Cas2), respectively, on Cas1-2-bound DNA, generates integration-competent prespacers (45,52). Intriguingly, as observed in E. coli, when Cas1-2 of these systems was presented with legitimately sized prespacers, the need for nuclease processing was obviated (45,52). This implies that Cas1-2 can solely catalyze spacer integration, but the generation of productive prespacers involves the action of an additional nuclease. The lack of a known prespacerprocessing enzyme such as Cas4 in E. coli led us to hypothesize that the trimming action could be complemented by other cellular nucleases. Unlike in other CRISPR variants (25,(45)(46)(47)(48)(49)(50), the productive pruning of prespacers by non-Cas nucleases is not sequence-specific (Fig. 2). Moreover, a parallel study on prespacer generation in E. coli also independently demonstrated that the generic nucleases (such as DNA polymerase III or exonuclease T) are sufficient for trimming the prespacers upon PAM recognition by Cas1-2 (74). These explain why the involvement of specific nucleases, such as Cas4, is precluded for prespacer processing in E. coli (see below).
In addition to spacer length conservation, most prokaryotes display selective uptake of phage-origin prespacers that are bordered by a PAM (8). Recent in vivo studies in various type I organisms (I-A, I-D, I-G, and I-U) have underscored the indispensable requirement of Cas4 in PAM selection as well as in prespacer processing (46 -50). Moreover, in vitro studies performed with adaptation complex of B. halodurans (type I-C) and S. solfataricus (type I-A) revealed that Cas4 nuclease avoids the processing of free DNA ends that are devoid of PAM sequence (25,45). This preferential activity of Cas4 seems to act as a critical checkpoint in ensuring the productive uptake of infection memory by Cas1-2 in the hosts.
A previous study in E. coli showed that upon expression of Cas1-2, a 33-nt prespacer bordered by PAM originated from the longer electroporated DNA (P63) (44). Interestingly, we observed the protection of the same region by Cas1-2 when we performed a nuclease footprinting assay on P63 (Fig. 3). These experiments highlight that the Cas1-2 complex alone is sufficient to recognize PAM in E. coli. The footprinting experiments also demonstrate binding of substrates at multiple points when PAM residues of P63 were mutated (Fig. 3). These nonspecific interactions of Cas1-2 could generate a heterogeneous population of protected prespacers. This explains how the adaptation complex of various type I organisms infrequently uptake prespacers with erroneous PAM (WT in Fig. 5E) (35,(75)(76)(77)(78)(79)(80).
Structural analysis of Cas1-2-prespacer complex highlights the features that could lead to precise scaling and PAM selection of prespacers. A platform formed by the interaction of a Cas2 dimer with two Cas1 dimers on either side houses the 23-bp duplex region of prespacer (42,43). Stationed at either end of this duplex is the aromatic ring of Tyr-22 residue that stacks the prespacer at the border of the Cas1 catalytic groove and directs the 3Ј-overhang to position its fifth nt at the catalytic site. Thus, Tyr-22-guided meticulous placement of DNA substrate seems to dictate the length of prespacer (42,43). Furthermore, the flexible C-terminal tail of Cas1 is molded around the PAM region. The absence of such molecular architecture upon mutating the PAM hints at the role of C-terminal tail in PAM recognition (42). Deployment of Cas1-2 variants that encompass either deletion of Cas1 C-terminal tail (⌬C) or Y22A in spacer integration assays helped to unveil the role of these structural entities in determining the PAM selection (Fig.  5). As shown here, the deletion of C-terminal tail resulted in impaired PAM recognition (Fig. 5A) and led to an uptake of prespacers that were lacking PAM (⌬C in Fig. 5E).
A comparison of the structures of Cas1 from various type I organisms (Fig. 6, A-D) revealed a striking contrast between Cas1 C-terminal tail of type I-E and other subtypes. The C-terminal tail of E. coli Cas1 is noticeably longer with 31 amino acids (aa) than the shorter 12-aa tails of Archeoglobus fulgidus, Pyrococcus horikoshii, and B. halodurans (Fig. 6) (42,81,82). Additional comparative analysis reinforced these observations that type I-E encompasses the longest C-terminal tail with an average length of 29 aa (Fig. 6E and Figs. S6 -S9). Coincidentally, CRISPR-Cas subtypes with shorter Cas1 C-terminal tails, such as type I-A, I-B, and I-C, encompass Cas4 (15), and previous studies suggest the indispensability of Cas4 in promoting PAM specificity (25,(45)(46)(47). In contrast to these systems, it appears that the extended C-terminal tail of Cas1 in E. coli (type I-E) compensates for the lack of Cas4 by guiding the PAM selection.
In the case of Y22A variant, although the prespacer integration is unaffected in vitro (Fig. S4B), in line with the previous observations (43), a rampant decrease in spacer integration efficiency was observed in vivo (Fig. 5C). Y22A conferred reduced prespacer protection against the nucleases compared with WT (Fig. 5A). This observation highlights the critical role of Tyr-22 residue in providing the WT with a better grip on bound DNA (Fig. 5A). As Y22A could lack such interactions with its substrates, nucleases might seamlessly dislodge it from the bound prespacer (Fig. 5A). This action appears to limit the substrate availability and impede spacer integration in vivo (Fig. 5C). Despite the reduction in spacer acquisition potential, Y22A showed high fidelity toward "AAG" PAM selection than WT (Fig. 5E). As discussed above, Y22A displays a reduced grip on the prespacers during nuclease-mediated processing (Fig. 5A). This weak prespacer binding could be further disrupted in the absence of PAM due to the loss of interactions with the Cas1 C-terminal tail. Therefore, only the presence of cognate PAM (AAG) is likely to allow Y22A to retain the hold on DNA during prespacer processing leading to selective enrichment (Fig. 5E).

Mechanism of prespacer capture and processing in E. coli
The strategic position of Cas1 Tyr-22 in the adaptation complex appears to have a key role in defining the prespacer length (Fig. 4B). Interestingly, our experiments with Y22A resulted in the uptake of prespacers that were predominantly 33 bp in length (Fig. 5D). These findings negate the involvement of Tyr-22-stacking interactions in deciding the prespacer boundary. Recent studies in type V-C demonstrated that a mini-integrase complex constituted by Cas1 tetramer prefers short (18-bp) spacers (83); likewise, in E. coli, the Cas1-2 structural framework alone appears to be a critical parameter in gauging the length of spacers (42,43).
Our work in conjunction with the previous studies allows us to propose an updated model for prespacer capture in type I systems (Fig. 7). During CRISPR adaptation, the dispensability of sequence-specific auxiliary nucleases such as Cas4 seems to be contingent on the type of molecular players that are involved in PAM selection. Although Cas1-2 integrase catalyzes the prespacer homing, in the majority of type I systems, PAM selection and prespacer processing require Cas4 (25,(45)(46)(47). In contrast, in the type I-E system, the intrinsic affinity of Cas1-2 integrase alone is sufficient to recognize cognate PAM. This lineage-specific remarkable adaptation of Cas1-2/I-E offsets the requirement of PAM-specifying Cas4 nuclease. Instead, generic cellular non-Cas nucleases are co-opted to trim the exposed DNA ends of Cas1-2-prespacer complex for generating the legitimate prespacers for integration.

Construction of plasmids
Lists of plasmids, strains, and oligonucleotides used in this study are detailed in the Tables S1-S3.

Expression and purification of proteins
IHF and Cas1 were purified as described before (17). To purify Cas2, E. coli BL21(DE3) harboring pMS-Cas2 was grown in autoinduction LB medium supplemented with 100 g/ml kanamycin at 37°C, 180 rpm. Upon reaching 0.6 A 600 , the temperature was shifted to 16°C, and thereafter the growth and induction were continued for 16 h. Subsequently, cells were harvested and washed two times with Buffer 1A (20 mM HEPES-NaOH, pH 7.4, 500 mM KCl, and 10% glycerol). Bacterial pellet was resuspended in Buffer 1A containing 1 mM phenylmethylsulfonyl fluoride, and the cells were lysed by sonication. Here, Cas2 encompasses His 6 -tagged MBP-SUMO as an N-terminal fusion and a Strep-II tag on the C-terminal end. The clarified fraction of the lysate was applied to a 5-ml MBPTrap HP column (GE Healthcare) and was followed by a washing step with Buffer 1A. Thereafter, the bound proteins were eluted with Buffer 1A containing 10 mM maltose. Eluted fractions were mixed with SUMO protease (Ulp1 403-621 ) (in a 400:1 ratio of His-MBP-SUMO-Cas2-strep/Ulp1 403-621 ) (84), and the incubation was continued for 60 min at 25°C. Following this, the mixture was loaded onto a 5-ml HiTrap IMAC HP column (GE Healthcare) five times to facilitate binding of histidine-tagged MBP-SUMO-Cas2-strep, MBP-SUMO, and Ulp1 403-621 . Column flow-through containing Cas2-strep was concentrated using a centrifugal membrane filter (Sartorius). To remove any (error bars)) of each subtype is indicated at the respective position. n corresponds to the number of Cas1 sequences from each subtype that are considered for the analysis (see "Experimental procedures").

Mechanism of prespacer capture and processing in E. coli
trace protein contaminants, the concentrated sample was loaded onto a HiLoad Superdex 200 pg gel filtration column (GE Healthcare), that was pre-equilibrated with Buffer 1B (20 mM HEPES-NaOH, pH 7.4, 150 mM KCl, and 10% glycerol). Eluted fractions containing Cas2-strep were pooled, concentrated, snap-frozen in liquid nitrogen, and stored at Ϫ80°C until required.
Integrase complex comprising untagged Cas1 and C-terminal His 6 -tagged Cas2 was expressed and purified as described before (61) with minor modifications. Here, E. coli BL21(DE3) transformed using pCas1-2H was grown in 2XYT broth supplemented with 100 g/ml spectinomycin at 37°C, 180 rpm until 0.6 A 600 . Thereafter, the protein expression was induced by the addition of 0.7 mM IPTG, and the growth was continued at 25°C for 24 h. Simultaneously, cells were harvested and washed two times with Buffer 2A (20 mM HEPES-NaOH, pH 7.4, 150 mM KCl, 10% glycerol, and 30 mM imidazole). The pellet was resuspended in Buffer 2A containing 1 mM phenylmethylsulfonyl fluoride, and cells were lysed by sonication. Thereafter, the lysate was clarified and loaded onto a 5-ml HiTrap IMAC HP column (GE Healthcare) and was followed by a washing step with Buffer 2A. A linear gradient of imidazole (0.03-0.5 M) in Buffer 2A was applied to elute the proteins that were bound to the column resin. The purified fractions that contain the complex of Cas1-2 were pooled and concentrated using a centrifugal membrane filter (Sartorius). To remove trace protein contaminants and uncomplexed Cas2, the concentrate was further purified using a HiLoad Superdex 200 pg gel filtration column (GE Healthcare) that was pre-equilibrated with Buffer 2B (20 mM HEPES-NaOH, pH 7.4, 150 mM KCl, and 10% glycerol). The prespacer production in CRISPR systems encompasses a subset of two events: (i) PAM directed prespacer capture and (ii) processing of selected prespacer to a defined length for integration at the CRISPR locus. Generally, Cas1-2 captures long dsDNA fragments that are produced by reannealing of ssDNA products (in brown) derived from DNA repair pathways (such as RecBCD in E. coli) or CRISPR interference. Although in type I-E it is clear that the DNA capture by Cas1-2 precedes the prespacer processing event, the order of DNA capture and processing is yet to be understood in other CRISPR-Cas subtypes. The Cas4 (in green) in CRISPR-Cas subtypes I-A, I-C, I-D, and I-G or the Cas4 domain of Cas4-Cas1 fusion in type I-U trims the DNA upon recognizing the PAM region (in magenta) (25,(45)(46)(47)(48)(49)(50)(51). A second copy of Cas4 in type I-G was shown to trim the non-PAM end upon recognizing a short motif (47), whereas in other subtypes, it is not clear whether Cas4 processes this end. Here, the dual role of PAM recognition and prespacer processing by Cas4 propels CRISPR adaptation. Unlike this, in the type I-E system, the intrinsic affinity of Cas1-2 integrase toward PAM itself is sufficient to define the potential prespacer regions. This aspect of Cas1-2/I-E precludes the involvement of any PAM-specific Cas nucleases (such as Cas4) in prespacer selection. As the Cas1-2/I-E protects the prespacer boundaries efficiently upon recognizing the PAM, any common cellular non-Cas nucleases (in cyan) could trim the exposed ends to generate aptly sized prespacers for integration into the CRISPR locus.

Mechanism of prespacer capture and processing in E. coli
Eluted fractions containing Cas1-2 integrase were pooled, concentrated, snap-frozen in liquid nitrogen, and stored at Ϫ80°C until required. A similar procedure was implemented to purify 5M, ⌬C, and Y22A Cas1 variants of Cas1-2 from the IPTGinduced E. coli BL21(DE3) cells that harbor p5M, p⌬C, and pY22A, respectively.  (Table S3) were mixed in a buffer containing 10 mM Tris-Cl, pH 8.5. These mixtures were heated to 95°C and gradually allowed to cool to room temperature to facilitate the formation of duplex and partial-duplex prespacers. In the case of P33ss, a 33-nt-long single-stranded oligonucleotide was used as a prespacer.

In vitro integration assay
The in vitro integration assays were performed as described previously (17) with minor modifications. Briefly, 210 nM Cas1, Cas2, or Cas1-2 (WT, ⌬C, Y22A, and 5M) was mixed with a 550 nM concentration of the desired prespacer and incubated at room temperature for 5 min. To this mixture, 0.5 M IHF and 21 nM CRISPR DNA substrate were supplemented, and incubation was continued at 37°C for 60 min in integrase buffer (20 mM HEPES-NaOH, pH 7.4, 25 mM KCl, 10 mM MgCl 2 , and 1 mM DTT). Subsequently, the reaction mixtures were supplemented with an equal volume of stopping solution (95% formamide, 5 mM EDTA, and 0.025% SDS) followed by heating at 95°C for 20 min. These samples were loaded onto preheated 12% denaturing acrylamide gels that were maintained at 50°C and electrophoresed in 1ϫ TBE. Subsequently, gels were stained with EtBr and visualized using a gel documentation system (Bio-Rad), whereas in the assays that involve FAM-labeled CRISPR DNA or prespacers, gels were imaged without any post-staining step. All of the integration experiments were independently repeated at least twice, and the representative gel pictures were shown.
To further verify the formation of Cas1-2 nucleoprotein complex, the release of prespacers was monitored upon Proteinase K treatment of Cas1-2 in each assay. To achieve this, an aliquot of sample containing prespacer and 3 M Cas1-2 was mixed with 1 mg/ml Proteinase K and incubated at 37°C for 15 min.

Exonuclease treatment of Cas1-2-bound DNA fragments
Exonuclease treatment was performed to identify the extent of protection conferred by binding of Cas1-2 onto a long DNA fragment. 40 l of 0.5 M P63 and a 6 M concentration of either Cas1 or Cas2 or Cas1-2 in prespacer-binding buffer were incubated at 37°C for 45 min. Subsequently, 20-l aliquots of these samples were supplemented with 3 units of either T5 exonuclease (New England Biolabs) or exonuclease III (New England Biolabs) or 3 units of the mixture containing both exonucleases, and incubation was continued for 60 min at 37°C. Thereafter, all of the samples were mixed with an equal volume of denaturation buffer that contains 200 mM Tris-Cl, pH 8.3, 200 mM boric acid, 20 mM EDTA, 0.05% SDS, and 8 M urea, followed by heating at 95°C for 15 min. These samples were loaded onto preheated 20% denaturing acrylamide gels that were maintained at 50°C and electrophoresed in 1ϫ TBE. Subsequently, gels were stained with EtBr and visualized using a gel documentation system. This experiment was independently repeated three times, and a representative gel picture is shown.

Exonuclease footprinting
T5 exonuclease-mediated footprinting was performed to identify the interaction boundaries of Cas1-2 on longer prespacer DNA fragments. Here, 40 l of 0.5 M desired fluorescein-labeled P63 variant (P63T*, P63B*, P63mPAMT*, and P63mPAMB*) was mixed with 6 M WT or one of the mutant variants of Cas1-2 (Y22A, ⌬C, and 5M) in prespacer-binding buffer and incubated for 45 min at 37°C. Subsequently, 20-l aliquots of these samples were supplemented with 3 units of T5 exonuclease, and incubation was continued for 60 min at 37°C. Thereafter, all of the samples were mixed with an equal volume of denaturation buffer that contains 200 mM Tris-Cl, pH 8.3, 200 mM boric acid, 20 mM EDTA, 0.05% SDS, and 8 M urea followed by heating at 95°C for 15 min. These samples were loaded onto preheated 20% denaturing acrylamide gels that were maintained at 50°C and electrophoresed in 1ϫ TBE. Subsequently, gels were directly visualized using a gel documentation system. All of the footprinting experiments were repeated at least twice, and representative gel pictures are shown.

Mechanism of prespacer capture and processing in E. coli Spacer acquisition assays
The in vivo spacer acquisition assays were performed as described previously (10,17) with minor modifications. After transformation using plasmids (pCas1-2H, p5M, pY22A, and p⌬C), E. coli IYB5101 that expresses WT or a mutant of Cas1-2 was subjected to three cycles of growth and induction in LB medium supplemented with 100 g/ml spectinomycin, 0.2% L-arabinose, and 0.1 mM IPTG for 16 h at 37°C. After each cycle, cultures were diluted to 1:300 with fresh LB medium containing the aforementioned supplements, and the growth was continued for 16 h. Thereafter, genomic DNA was isolated according to the manufacturer's protocol (HiPurA bacterial genomic DNA purification kit, Himedia), and this was used as a template for PCR to monitor the spacer integration at CRISPR 2.1. All of the PCR-amplified samples were resolved on 1.5% agarose gels to identify the DNA bands corresponding to parental and expanded arrays (parental array ϩ n ϫ 61 bp, where n is a positive integer). DNA quantities corresponding to parental and expanded array were quantified by densitometric analysis. Utilizing these values, the percentage of spacer integration for each Cas1-2 variant was estimated (% integration ϭ ((amount of expanded array)/(Amount of parental array ϩ Amount of expanded array)) ϫ 100).

High-throughput sequencing and analysis
To understand the effect of Cas1-2 mutants on prespacer scaling and PAM selectivity, high-throughput sequencing was performed to derive the sequences of newly incorporated prespacers. Expanded CRISPR arrays corresponding to the expression of each Cas1-2 variant were extracted from the agarose gels (QIAquick Gel Extraction Kit, Qiagen). Approximately 200 ng of each PCR product was further purified using HighPrep magnetic beads (MAGBIO). These purified samples were subjected to DNA end repair and adaptor ligation using the Illuminacompatible NEXTflex Rapid DNA sequencing kit (BIOO Scientific, Austin, TX). Subsequently, the ligated DNA products were purified with HighPrep magnetic beads, and further enrichment was achieved by eight cycles of PCR with Illuminacompatible primers (NEXTFlex DNA-sequencing kit). These amplicons were subjected to an additional step of purification with HighPrep magnetic beads and were sequenced on a Miseq 300 paired-end platform.
The paired-end reads were subjected to several preprocessing steps as described below. First, both F and R reads with Phred score less than 20 were removed by utilizing fastq_ quality_trimmer from the FASTX-toolkit-version-0.0.13. The remaining F and R reads were trimmed in paired-end mode to remove F (5Ј-AGATCGGAAGAGCACACGTCTGAACTCC-AGTCA-3Ј) and R (5Ј-AGATCGGAAGAGCGTCGTGTAG-GGAAAGAGTGT-3Ј) adapter sequences using Cutadapt-1.18 (86). Following these, the leader-proximal spacer sequence (S0) was selectively retrieved in FASTA format. These S0 sequences that were derived from E. coli expressing WT and mutants of Cas1-2 were searched against plasmid (pCas1-2[K]) (85) and E. coli K-12 MG1655 genome (GenBank TM assembly accession: GCA_000005845.2), respectively, using BLASTN (87). From the BLAST hits, we identified the location of spacer sequences on plasmid and E. coli K-12 MG1655 genome, respectively, and extracted the triplet sequence corresponding to PAM. The conservation of PAM was analyzed using WEB-LOGO (88). For all sequence manipulations, including the extraction of S0 and PAM sequences, we employed custom-written Python codes utilizing the Biopython library (89).

Analysis of Cas1 C-terminal tail across CRISPR-Cas type I systems
Locus IDs corresponding to Cas1 genes of type I-A, I-B, I-C, and I-E were derived from the previous study (90). Utilizing these identifiers, we extracted Cas1 protein sequences of 36 type I-A, 150 type I-B, 129 type I-C, and 116 type I-E organisms from the NCBI protein database. Subsequent to this, we performed multiple sequence alignments in the T-COFFEE web server (91) for Cas1 proteins of each subtype separately (Files S1-S4). Utilizing Cas1 crystal structures of A. fulgidus (type I-A; PDB code 4N06), P. horikoshii (type I-B; PDB code 4WJ0), and E. coli (type I-E; PDB code 5DQZ) as a reference, C-terminal tail residues of each Cas1 were extracted from their multiple-sequence alignments. Owing to the absence of a type I-C Cas1 crystal structure in the PDB, the Cas1 (BH0341) structure of B. halodurans was predicted by I-TASSER web server (82) using structures corresponding to PDB entries 3LFX, 4N06, and 2YZS as threading templates. I-TASSER predicted five different models for B. halodurans Cas1. Among these, the model with the highest confidence score (C-score ϭ 1.25) was used as a structural reference for predicting the C-terminal tail residues from a multiple-sequence alignment of 129 type I-C Cas1 proteins.
Utilizing ESpript server (92), various secondary structural element positions were mapped onto the multiple-sequence alignments of type I-A, I-B, I-C, and I-E Cas1 proteins (Figs. S6 -S9). All of the protein structural representations were generated using ChimeraX (93).