How repertoire data are changing antibody science

Antibodies are vital proteins of the immune system that rec-ognize potentially harmful molecules and initiate their removal. Mammals can efficiently create vast numbers of antibodies with different sequences capable of binding to any antigen with high affinity and specificity. Because they can be developed to bind to many disease agents, antibodies can be used as therapeutics. In an organism, after antigen exposure, antibodies specific to that antigen are enriched through clonal selection, expansion, and somatic hypermutation. The antibodies present in an organism therefore report on its immune status, describe its innate ability to deal with harmful substances, and reveal how it has previously responded. Next-generation sequencing technologies are being increasingly used to query the antibody, or B-cell receptor (BCR), sequence repertoire, and the amount of BCR data in public repositories is growing. The Observed Antibody Space database, for example, currently contains over a billion sequences from 68 different studies. Repertoires are available that represent both the naive state ( i.e. antigen-inexperienced) and that after immunization. This wealth of data has created opportunities to learn more about our immune system. In this review, we dis-cuss the many ways in which BCR repertoire data have been or could be exploited. We highlight its utility for providing insights into how the naive immune repertoire is generated and how it responds to antigens. We also consider how structural information can be used to enhance these data and may lead to more accurate depictions of the sequence space and to applications in the discovery of new therapeutics.

Antibodies are vital proteins of the immune system that recognize potentially harmful molecules and initiate their removal. Mammals can efficiently create vast numbers of antibodies with different sequences capable of binding to any antigen with high affinity and specificity. Because they can be developed to bind to many disease agents, antibodies can be used as therapeutics. In an organism, after antigen exposure, antibodies specific to that antigen are enriched through clonal selection, expansion, and somatic hypermutation. The antibodies present in an organism therefore report on its immune status, describe its innate ability to deal with harmful substances, and reveal how it has previously responded. Next-generation sequencing technologies are being increasingly used to query the antibody, or B-cell receptor (BCR), sequence repertoire, and the amount of BCR data in public repositories is growing. The Observed Antibody Space database, for example, currently contains over a billion sequences from 68 different studies. Repertoires are available that represent both the naive state (i.e. antigen-inexperienced) and that after immunization. This wealth of data has created opportunities to learn more about our immune system. In this review, we discuss the many ways in which BCR repertoire data have been or could be exploited. We highlight its utility for providing insights into how the naive immune repertoire is generated and how it responds to antigens. We also consider how structural information can be used to enhance these data and may lead to more accurate depictions of the sequence space and to applications in the discovery of new therapeutics.
Antibodies are proteins that play a key role in the adaptive immune response. They are produced by B cells and are either secreted or membrane-bound (in the latter case, they are known as B-cell receptors, or BCRs). They are able to neutralize and initiate the removal of foreign entities (known as antigens) from the body by binding to them (1). The ability of the immune system to respond to a huge range of antigens originates in the diversity of the antibodies that can be generatedantibodies can be produced that bind to nearly every antigen, with both high specificity and affinity (2). This property has made antibodies highly successful as therapeutics; to date, 87 have been approved for use in the clinic across a number of disease areas, and many more are undergoing clinical trials (3,4). Antibodies are currently the largest class of biotherapeutic (5).
It is estimated that the human antibody repertoire contains around 10 13 unique sequences (6). This diversity is a result of how the proteins are encoded in the genome. Antibodies are composed of two types of protein chain, known as the heavy and light chains (Fig. 1). Each of these is encoded by multiple gene segments that are spliced together using a process called V(D)J recombination (7). The sequence for the light-chain variable region (Fv) is made up of two segments: the variable segment (V) and the joining segment (J). The heavy chain is encoded from variable, joining, and diversity (D) segments. There are many genes for each of the V, D, and J segments, which can be matched up in different combinations to produce a diverse range of antibody sequences. Further diversity is introduced through the insertion or deletion of nucleotides at the segment junctions (8) and somatic hypermutation (a process through which the number of random mutations that occur is increased) (9). The majority of the variation in sequence occurs in the complementarity-determining regions, or CDRs-there are three of these on each of the heavy and light chains. The most variable of these is the H3 loop (the third CDR on the heavy chain), because the DNA encoding it is found at the join between the V, D, and J segments. By creating a large, diverse repertoire of antibody sequences, an individual is able to react to almost any antigen it may encounter.
The ability of an antibody to bind to its target antigen is governed by its three-dimensional structure. Knowledge of an antibody's structure therefore allows for a deeper understanding of its physicochemical properties than can be gained from sequence alone. The general structure of an antibody is depicted in Fig. 1 The heavy and light variable domains both adopt a b-sandwich structure known as the immunoglobulin fold. Framework (non-CDR) regions are very highly conserved between different antibodies; in accordance with the observed variability of antibody sequences, the structural diversity that allows binding to many different targets occurs mainly in the CDRs. These correspond to loops in the three-dimensional structure, which are responsible for most of the antigen-binding interactions (10). For five of the six CDRs (H1, H2, and L1-L3), structural diversity is limited-only a few different shapes have been observed, forming a set of discrete conformational classes known as canonical structures. However, as described above, the H3 loop is much more variable in sequence than the other CDRs and consequently is also more structurally diverse. It is thought that the H3 loop contributes the most to antigenbinding properties (11,12).
Upon exposure to an antigen, antibodies that are able to bind to it do so and are thus selected from the repertoire (clonal selection) (13). Having a large repertoire of antibodies present in the body at any time increases the chance that at least one has the ability to bind to the antigen, even if only weakly, thereby allowing the initiation of an appropriate immune response. B cells producing binding antibodies undergo cycles of proliferation (clonal expansion) with simultaneous somatic hypermutation (9) to produce antibodies with higher affinity. The antibody repertoire is consequently enriched with antibodies that bind to the target antigen.
The antibodies present in an organism therefore describe both its current and past immune status; what it is able to respond to, and what it has previously dealt with. Whereas previously only a handful of sequences could be obtained at a time, technological advances mean that large snapshots of this repertoire can now be obtained using next-generation sequencing approaches. This technique of BCR repertoire sequencing was first described by Glanville et al. in 2009 (14), and since then the volume of data available has increased exponentially (Fig.  2). As it is the H3 loop that mostly determines binding properties, many studies have focused only on sequencing this region. However, BCR repertoires containing full-length sequences are increasingly being produced-commonly only the heavy chain (15), but some studies have focused only the light chain (e.g. Refs. 16 and 17), and some data sets include both (e.g. Refs. 18 and 19). Recent advances in sequencing technology have led to a small but growing number of repertoires that also include native pairing information (i.e. which heavy-chain sequences belong with which light-chain sequences).
The largest repertoire sequencing study to date, by Briney et al. (20), alone resulted in a set of over 300 million heavy-chain sequences. In addition, many algorithms and pipelines have now been created that preprocess the generated data ready for analysis, performing tasks such as translation from nucleotides to amino acids, error estimation and correction, and sequence numbering (21). Recently, efforts have been made to create standardized, publicly available repositories for these sequencing data (e.g. iReceptor (22), VDJServer (23), Immu-neDB (24), and others (25)(26)(27)(28)(29)). This has provided researchers with easy access to a vast number of sequences and created Figure 1. A, antibody structure. An antibody is made up of four chains: two light (orange) and two heavy (blue). Each chain is made up of a series of domainsthe variable domains of the light and heavy chains together are known as the Fv region (shown on the right; PDB entry 12E8). The Fv features six loops known as CDRs (shown in dark blue); these are mainly responsible for antigen binding. B, example sequences for the VH and VL, highlighting the CDR regions and the genetic composition. opportunities for large-scale data mining. The Observed Antibody Space (OAS) database, for example, which collates fulllength variable region sequences, currently contains over 1 billion sequences spanning 68 different studies (28).
The studies included in OAS cover many different repertoire characteristics. Sequences are available for six different species, with the majority (64%) being human. Diseased states are represented (i.e. repertoires from individuals who have been exposed to a specific antigen) as well as healthy ones (meaning the individual has not been exposed to the antigen of interest and also has not suffered from a disorder of the immune system). Repertoires from vaccination studies also feature (e.g. HIV, hepatitis B, flu, etc.), and in some cases, OAS has the repertoires of the same individual both pre-and post-immunization. Although the snapshots of the repertoire achieved through sequencing are actually small relative to the potential number of antibodies present in an organism (e.g. data sets in OAS contain between 20,000 and 300 million redundant sequences) and most studies feature only the heavy chain or have no pairing information, the data available still provides opportunities to investigate many different aspects of the immune response. In this review, we explore what can be done with the wealth of antibody sequence data stored in repositories such as OAS. We give examples of how this data has been used to give insights into the workings of the immune system, look at how it can be enhanced with structural information, explore how it offers new avenues for therapeutic antibody discovery and development, and consider what advances may be made in the future.

Biological insights from antibody repertoire data
Until the advent of BCR repertoire sequencing, antibody sequences were analyzed in much smaller numbers (normally a few hundred B cells per experiment (15)), only a tiny fraction of the estimated total repertoire. This approach can be useful when investigating a few key antibodies (e.g. those that bind to an antigen of interest (e.g. Refs. 30 and 31)) but cannot give an in-depth view of the repertoire as a whole (e.g. little can be learned about its diversity). Analysis of larger repertoire snapshots, on the other hand, gives a much more detailed picture and can provide valuable insights into how the immune system works. It can be used to explain how in its naive state (i.e. before exposure to a given antigen) it is capable of protecting against such diverse threats and can give a deeper understanding of the processes that produce higher-affinity antibodies after antigen exposure.
Sequencing data has been used to learn more about the underlying mechanisms that shape the repertoire, such as V(D)J recombination (32,33). Increasing amounts of large-scale sequence data, along with the development of computational tools that annotate sequences with their V(D)J gene origins (34)(35)(36)(37), have allowed trends in this process to be identified. It has been shown that the process is intrinsically biased; the available V, D, and J segments in the genome are not used with the same frequency, and therefore some combinations are observed more commonly than others (14,(38)(39)(40)(41). Mathematical models of V(D)J recombination have been developed that repro-duce the natural biases (42,43). It has been proposed that this has the potential to aid in the discovery of new antibody therapeutics-replicating the underlying architecture of observed human repertoires should lead to the creation of more humanlike (and hence less immunogenic) screening libraries (44).
During the proliferation of B cells in clonal selection, the rate of mutation is increased up to 106-fold (45) compared with normal cells, due to somatic hypermutation (as described earlier). Variations on the original antigen-binding antibody sequence are therefore generated, and higher-affinity antibodies are iteratively produced. Repertoire data has been used to analyze this process (46)(47)(48)(49)(50). This has increased our understanding of mutation frequencies, substitution bias, and the location of mutation hot spots and, hence, how the repertoire reacts to an antigenic stimulus. For example, researchers have demonstrated that memory cells of different isotypes experience different selection pressures (46) and that substitution profiles vary between V genes (47), are dependent on neighboring bases, and are conserved across individuals (48). As in the case of V(D)J recombination, these insights have enabled accurate models of somatic hypermutation to be established (49,50). These models have led to the creation of software that simulates repertoires (51) and mean that more accurate B-cell lineages can be established (49). These phylogenies have the potential to be used in the identification of antibodies with high binding affinities (50).
Researchers have also investigated the interplay between all the processes that dictate repertoire diversity to ascertain how much is genetically predetermined and how much is antigendriven; analysis indicates that both are important factors, but genetics are more influential (39). Further research has compared the repertoires of humans and other species (52,53), revealing that immune system development is broadly similar across different mammals (53) and that mice BCR repertoires tend to be closer to germline sequences than those of humans (52). The effect of disease on the immune system has also been studied (54) and has indicated that repertoire analysis can have more practical applications; for example, it can be used to monitor the diversity of the repertoire before and after an organ transplant (55), and machine learning methods have been used to predict vaccination status or the presence of disease (56)(57)(58).
The overall architecture of the antibody repertoire can be investigated by inferring relationships between sequences (i.e. by predicting which ones originated from the same precursor antibody and hence which bind to the same antigen). One approach is to consider the repertoire as a network, with each sequence being a separate node and the presence of an edge between them indicating an evolutionary relationship (44). These relationships are normally defined based on sequence identity; for example, two sequences can be connected if they differ by one amino acid in their H3 region (44). Common network analysis metrics can then be used to explore the repertoire architecture-for example, the degree distribution (the degree of a node is the number of edges it is connected to) can reveal the presence or absence of clonal expansion (33), because highly connected nodes are likely to represent sequences derived from a common precursor during affinity maturation (Fig. 3).
Clonotyping is another related way of investigating the diversity of repertoires and, in particular, how they change upon antigen exposure. Similar antibody sequences are clustered into "clonotypes"; these are generally defined as sequences originating from the same V and J genes and with H3s that are the same length and similar in sequence (normally a sequence identity of 80-100%) (59-62), although alternative approaches have been used (63). Antibodies belonging to the same clonotype are assumed to share the same precursor sequence (i.e. they arose from the proliferation of the same B cell) and are therefore predicted to bind to the same epitope. This is therefore a method of monitoring the clonal selection and expansion that occurs after exposure to an antigen and can be used to identify the antibodies that bind to a particular target.
Because the repertoires of many individuals have now been sequenced, we can compare them to identify which characteristics of the repertoire are shared and which are unique to each organism. The idea of "public sequences" has recently been proposed-a set of sequences or clonotypes that are observed in the repertoires of two or more individuals (20,44,61,(64)(65)(66). One may expect that this is rare, due to the enormous potential number of sequences (estimated at 10 13 ) and the relatively small proportion of those sequences sampled in current data sets (the largest samples from a single individual currently have on the order of 10 6 sequences). However, whereas repertoires are largely unique to the organism (67), it has been shown that individuals share more heavy-chain sequences than would be expected by coincidence. Briney et al. (20), in their recent large-scale study, showed that in the repertoires of 10 individuals, on average 0.95% of clonotypes were shared between at least two subjects, and 0.022% were common to all 10. The pool of subjects contained both men and women, individuals from both Caucasian and African American ethnic backgrounds, and a variety of blood types; the authors report that the repertoires did not cluster based on these factors. The work of Soto et al. (64) indicates that this public subrepertoire could be even . The process of affinity maturation and methods of analyzing the resulting antibody repertoires. A, upon exposure to an antigen, those antibodies present in the naive repertoire that are able to bind to it proliferate, undergoing somatic hypermutation to produce variations upon the initial binder. Successive rounds of this process produce antibodies with high affinity. B, clonotyping groups antibodies in the repertoire based on sequence similarity; normally they must originate from the same V and J genes and have an H3 sequence identity of 80-100%. Antibodies of the same clonotype are predicted to bind to the same epitope. C, network analysis of antibody repertoires, where each node is a different sequence and edges are present between them if they meet set sequence similarity criteria. The lineages of different antibodies can be inferred using this method. larger, making up between 1 and 6% of the whole. Greiff et al. (68) have used machine learning techniques, trained on publicly available data sets such as those in OAS, to predict the public or private nature of a given sequence with 80% accuracy, hinting that this property is not random and that there are fundamental characteristics of the sequences that separate the two subsets. In their network-based analysis of antibody H3 sequences, where each node is a unique H3 sequence, Miho et al. (44) demonstrated that public clonotypes were among the most connected nodes (i.e. they are similar in sequence to many other nodes) and that most private clonotypes (74%) were connected to at least one public one. The removal of public clonotypes from the network therefore changed the underlying repertoire architecture; however, the system was robust to the removal of a large number of randomly selected clonotypes. This implies that public clonotypes are key in maintaining functional immunity against antigens, whereas the presence of other clonotypes is able to fluctuate over time.
Light chain data has also been analyzed; VL sequences are less diverse than their VH counterparts (52,69,70), so the percentage of the repertoire comprising public sequences is much larger. For instance, Soto et al., in a three-individual experiment, observed that 20-34% of light chains (of both k and l types) were shared by at least two people (64).
Overall, the presence of shared clonotypes across different individuals, although small, may signal the existence of a baseline common functionality of the immune system. This core subset of the repertoire may be responsible for an organism's response to common antigens (66), and it has been hypothesized that these public clonotypes are more likely to display low levels of immunogenicity and be more versatile binders and hence may be useful starting points in therapeutic development (71). 1 Combining sequence with structure Although much can be learned from sequences alone, it is the three-dimensional structure of the antibody that determines how it interacts with an antigen and therefore governs its binding properties (1,72). It is known that CDRs belonging to the same canonical class (i.e. that have nearly identical structures) can have very different sequences, and conversely H3 loops with similar sequences can adopt different conformations ( Fig. 4) (73). Therefore, by considering sequence alone (e.g. in clonotyping), antibodies may be grouped together that have structurally dissimilar binding sites, and vice versa (74). It is therefore crucial to consider structure as well as sequence to allow more accurate comparisons to be made and to properly understand antibody function.
Antibody structures can be obtained experimentally, normally through X-ray crystallography or NMR. However, the sequencestructure gap is large-whereas OAS consists of over a billion sequences, SAbDab, a database of publicly available antibody structures (75), currently contains ;4000 entries. This is because experimental structure determination is time-consuming and hence low-throughput; as such, it can be used to probe the chemistry of a select few sequences (76,77), but it cannot yet be used to structurally characterize a BCR repertoire.
Computational modeling offers an alternative. It has been shown that the majority of antibody sequences from BCR repertoires can be mapped to known structures (74). A number of algorithms have been developed that predict the structure of an antibody's Fv region from its sequence (78)(79)(80)(81)(82)(83)(84)(85)(86)(87)(88)(89)(90)(91). Due to the conserved nature of the antibody framework structure (see Fig. 1) and the existence of canonical classes, these tools generally rely on homology modeling (i.e. an existing structure with high sequence identity to a segment of or to the whole target is used as a template). Normally the structure is considered as separate regions, first the frameworks of the VH and VL and then the six CDRs. Separate templates may be chosen for the VH and VL; however, if a single template is available with high sequence identity to both chains, only one is required (78). In this case, the orientation of the two chains can be directly copied from the chosen template; otherwise, a further template that is similar in sequence to both chains is required, or the orientation between the chains must be predicted (92). The framework can be modeled with very high accuracy, typically with a root mean square deviation (RMSD) of below 1 Å. In the second Antibody Modelling Assessment (AMA-II), a blind test of prediction accuracy, VH and VL were modeled with an average backboneatom RMSD of 0.65 and 0.50 Å, respectively (87,88,90,(93)(94)(95)(96). Prediction of the orientation of the two domains was more challenging, however, with predicted tilt angles differing from the true angle by 5-12 8 (93).
Once a framework template has been selected, CDR structures can then be predicted, again using templates through knowledge-based loop modeling algorithms. As mentioned previously, in the majority of cases, CDRs L1-L3, H1, and H2 . Sequence is not always a reliable indicator of structural similarity. A, L1 loops of the PDB entries 3PHO (red) and 3QUM (blue). The two loops differ in sequence at every position except one (sequence identity = 10%); however, they have very similar conformations (RMSD = 0.60 Å). B, H3 loops of the PDB entries 5I1G (red) and 5I1C (blue). These loops have very similar se-quences (sequence identity = 92%) and therefore may be predicted to have similar structures; however, this is not the case (RMSD = 4.15 Å). RMSDs were calculated across all backbone atoms after superposition of the anchor residues (two residues on each side of the loop, shown in gray). adopt a limited number of known conformations known as canonical classes (97)(98)(99). As a result, they can be predicted accurately and quickly using this technique. Templates are selected from a database of known CDR structures based on sequence identity and the geometry of the anchor residues (the residues on either side of the CDR). The database of CDR structures can either include all known structures or be limited to the known conformations for the predicted canonical class of the target (78,80). Average RMSDs achieved during AMA-II ranged from 0.50 Å for L2 to 1.6 Å for L3 (93).
H3 can also be modeled using this method; however, its sequence and structural diversity compared with the other CDRs makes prediction more challenging (100). The H3 loop has also been shown to be structurally distinct from typical protein loops (101); researchers have therefore developed specialized software to model H3 loops more accurately (102)(103)(104)(105). Ab initio techniques, which create potential loop conformations without knowledge of templates, are often used here, either in isolation or in combination with knowledge-based strategies as a hybrid algorithm (102). Despite the existence of H3-specific prediction algorithms, H3 modeling remains challenging, achieving RMSDs normally in the region of 2-3 Å (74, 93). In addition, ab initio methods typically require much longer run times than knowledge-based methods, and therefore H3 prediction is currently the main bottleneck for accurate modeling of BCR repertoires. Attempts have been made to circumvent this issue, either by imposing an H3 length cutoff (long loops are modeled less accurately due to the absence of experimental data) (107) or by only considering those H3 sequences that can be confidently modeled using a knowledge-based algorithm (74,107). 1 Whereas this may introduce some biases into the analysis-for example, long H3 loop structures will be underrepresented in model libraries-it increases the confidence we have in the models that are considered and subsequently in the conclusions that are drawn.
Several studies have used antibody modeling to enhance the information given by BCR repertoires. DeKosky et al. (108) modeled 2,000 VH/VL pairs using RosettaAntibody (82,83), limiting their sequences to those with high-identity templates available. They analyzed the physico-chemical properties of the antibodies, such as solvent-accessible surface area and hydrophobicity, and were able to demonstrate how these properties change with antigen experience and link their observations to germline usage. Raybould et al. (106) used ABodyBuilder (78) to predict the structures of a large subset of a BCR repertoire (;19,000 sequences) and compared these models with those of a set of therapeutics to deduce which properties are required to reduce developability issues. Because antibody properties can be predicted with greater accuracy with the inclusion of structural data (109), models representing the repertoire have the potential to improve strategies such as directed design by using them as inputs to other computational tools (e.g. predictors of the sets of residues on the antibody and antigen that are involved in binding (known as the epitope and paratope respectively) and developability predictors).
One problem with modeling the antibody sequences obtained through repertoire sequencing is that they are nor-mally not paired (i.e. we do not know which VH belongs with which VL). Native pairings are important in creating accurate models that represent the repertoire and will affect the properties of the antibody, such as its folding, stability, expression, and binding. Pairing is currently thought to be mostly random (20,65), meaning that most VH chains are capable of associating with most VLs. Prediction of true pairings is therefore difficult. Techniques currently used to propose likely pairings include comparing all of the potential interfaces with those observed in known structures (106), 2 pairing based on the relative frequency of the sequences (110), or by constructing phylogenetic trees (111). Recently, experimental methods for immunoglobulin sequencing that preserve native pairings have been developed (112); as these techniques become more widespread, the amount of paired data will increase, and these approximations will no longer be required.
Producing complete models of the antibody variable region can be time-consuming; for example, in the study by DeKosky et al. (108), RosettaAntibody took 570,000 CPU hours to produce 2,000 models. Even for algorithms that are considered to be fast, execution times would be prohibitive-ABodyBuilder, for example, takes on average 567 CPU hours per 1,000 sequences (78). An alternative, faster method of characterizing a repertoire is the structural annotation of sequences. Instead of running a complete modeling protocol, sequences can be quickly matched up to their predicted templates using sequence identity. The conformations of the CDRs can be assigned by either exploiting a knowledge-based loop modeling algorithm (74) or a canonical class predictor (for the non-H3 CDRs) (99,107). Sequences can therefore be structurally annotated in much greater numbers than could be done using modeling tools. It has been shown that the majority of sequences can be mapped to an existing structure in this way (74).
SAAB (Structural Annotation of Antibodies) (74) and its successor SAAB1 (107) are algorithms that have been used to annotate millions of sequences with their proposed template structures, allowing thorough analysis of repertoire-wide structural properties. For example, Kovaltsuk et al. (107) investigated structural changes that occur with B-cell differentiation. Clustering based on their proposed H3 templates resulted in the separation of antibodies from different stages of the immune response, indicating that there are structural changes that occur as the response progresses. The effect of aging on the repertoire has also been studied in this way, revealing that older individuals have a higher number of antibodies that are structurally distinct from the germline (113).
The idea of public sequences has been extended to that of public structures. Instead of searching for sequences that are observed in the repertoires of multiple individuals, we can look instead for antibodies with shared backbone conformations, which may be a greater indicator of common functionality. Sequence-only analyses have shown that the shared space is present but only makes up a small percentage of the overall repertoire (20); however, by incorporating structure it can be seen that the public repertoire is likely to be much larger (107). 1 BCR repertoire sequencing and therapeutic discovery Discovering antibodies specific to an antigen of interest Currently, potential therapeutic antibodies are commonly discovered in two ways: through the immunization of an animal, such as a mouse, with the target antigen and subsequent extraction of the antibodies it produces and through phage display, where viruses displaying antibodies on their surface are screened against the target antigen. High-throughput sequencing of the antibody repertoire has been used successfully to enhance both approaches. For example, researchers have genetically engineered mice such that they contain human antibody genes-the antibodies produced by these mice are therefore less likely to be immunogenic. The "humanness" of the repertoire was validated through sequencing of the mouse BCR repertoire (114). Sequencing techniques have been used to characterize phage display libraries, to monitor their diversity and hence evaluate their capability of isolating antibodies that bind to different antigens (115). Screening libraries can also be designed using BCR repertoire data-Zhai et al. (116) and Prassler et al. (117) have shown how this is possible, by reproducing the observed amino acid usages at each sequence position. Both groups found that the antibodies in their libraries exhibited better expression levels than other synthetic libraries, with high genetic diversity, and they were able to isolate highaffinity antibodies for a range of different antigens.
It is now becoming possible to identify binders directly from BCR repertoire data. If an antibody that binds to the target antigen is already known, approaches such as clonotyping can be used to identify more potential binders with closely related sequences, expanding the pool of candidates that can be taken forward for further study. Known binders are not essential, however. The immunization of an organism with an antigen, as explained previously, leads to the enrichment of the repertoire with antibodies that bind to that antigen. Therefore, by analyzing how often a given sequence or clonotype appears in the repertoire after antigen exposure, specific antibodies can be identified. This approach can be used either to find antibodies that might work as therapeutics or to monitor the immune response during the development of vaccines (66,(118)(119)(120)(121)(122). The repertoires of multiple individuals who have been exposed to the same antigen can be investigated to find potential binders, by identifying common features that hint at shared functionality (e.g. identical H3 sequences) (123). The volume of data produced also means that deep learning techniques can be used effectively; for example Mason et al. (124) have generated neural networks that classify antibodies as HER2 binders or nonbinders based on sequence and thereby successfully identified 30 antigen-specific antibodies. BCR repertoire sequencing experiments have been carried out to discover binders for a wide range of antigens, including HIV (71,111,125,126), Ebola (127), hepatitis B (66,128), and many others (77,110,116,118,120,123,(128)(129)(130)(131)(132)(133).
Following the isolation of binders in this way, a small number can be taken forward as starting points for further development (77), or a larger number can be employed as a targeted screening library (110). A comparison between repertoire mining and phage display has demonstrated that the antibodies isolated by each method are not necessarily the same, and therefore it could be beneficial to use the two techniques together (129).
Much of the data from these experiments has been deposited in public sequence repertoires (28), meaning it can be exploited by other researchers in their therapeutic discovery pipelines, for example to provide new lead molecules. It has recently been shown that there is a close sequence match to many known therapeutic antibodies in the OAS database (134). Of 242 antibodies that are either currently used as therapeutics or undergoing clinical trials (Phase II or later), sequences with over 90% identity were available for 90 H chains and 158 light chains. Notably, for H3, which is thought to contribute the most to an antibody's binding properties, 54 perfect matches were found. Given the huge number of potential sequences, this is significantly more than would be expected by chance alone in a sequence database of this size (around 1 billion sequences) and implies that artificially developed sequences are not dissimilar from their natural counterparts. It therefore follows that natural sequence repertoires could potentially be mined for new therapeutic leads, perhaps removing the need for large-scale screening experiments at the beginning of an antibody discovery project.
Structural annotations and modeling can also be applied to discover antigen-specific antibodies. Krawczyk et al. (74) annotated ;3.4 million sequences from individuals who had been exposed to the influenza virus with their proposed templates and therefore whose repertoires were enriched with influenza-specific binders. They discovered that many of the templates assigned came from known influenza-binding antibodies. They therefore propose that sharing of a similar structural template could be an indication of similar specificity. Assuming that a structure of an antibody specific to a given antigen or epitope is known, antibodies can be selected from a repertoire if they are predicted to have a high degree of structural similarity to it. Other computational tools can also be exploited to find potential therapeutics: a large set of models generated from repertoire data can be used as an in silico screening library (135) 1 in conjunction with epitope predictors (109,(136)(137)(138)(139)(140)(141)(142)(143)(144), paratope predictors (72,(145)(146)(147)(148)(149)(150), and docking algorithms (83,(151)(152)(153)(154)(155)(156)(157)(158)(159)(160)(161)(162)(163)(164). As computational methods continue to improve and become faster, this approach will become more accurate and more feasible, potentially making an entirely in silico antibody discovery platform a reality.
However, issues arise due to most sequencing experiments focusing on only the heavy chain and unknown native pairings even when both the heavy and light chains are sequenced. Antibodies with high affinity and specificity are identified more often when the true VH/VL pairings are known (165); however, this is not achievable with most of the available data. As previously stated, single-cell approaches that retain pair information have been developed (112); however, the method is not as highthroughput as other sequencing techniques, so less data is currently available. In the future, this is likely to change, but for now, other approaches must be applied. For experiments resulting in both heavy-and light-chain sequences, pairings can be exhaustively tested for plausibility (135) 1 or by observing relative frequencies (110). Alternatively, especially when light chains have not been sequenced, it may be possible to use an artificial light chain with the ability to associate with a range of heavy chains (166). The concept of public sequences may also help here; a subset of the public light chain sequences could be used as a pairing library, as these sequences are clearly widely used and are therefore more likely to form successful pairings. In general, known public sequences may be a good place to start when attempting to discover a new therapeutic (e.g. in the design of a screening library), because they are likely to have low immunogenicity and be of high importance in the immune response to many common antigens.
Using BCR repertoire data to identify undesirable properties during therapeutic development Binding affinity is not the only feature of a potential therapeutic that needs to be optimized. In addition to being biologically active, it must be safe to administer to humans and be able to withstand the stresses of the production process (i.e. the antibody should have good "developability") (167). Antibodies discovered through the immunization of an organism (such as a mouse) against the target antigen cannot be used directly as therapeutics, because they would be identified as nonnative by the human immune system and would therefore cause an unwanted response themselves (168). Changes made to potential therapeutics during the development process can also introduce nonhuman-like characteristics. It is therefore desirable to be able to quantify the similarity of a sequence to those from natural human repertoires (its "humanness") and to propose changes that could be made to a sequence to make it more human and hence less likely to be rejected by a patient. This "humanization" process can be guided through comparisons with human BCR repertoires, because they are natural and represent what is "allowed" and what is safe in an organism (see Fig. 5). Previous work has used small sets of reference sequen-ces (such as known germline sequences) to infer humanness (169)(170)(171), but the growth of BCR repertoire sequencing has created new opportunities. The amount of data now available allows not only the identification of which amino acids are allowed at which positions, but also the investigation of residue couplings and covariation (172). Recently, Wollacott et al. (172) described a machine learning-based humanization method, trained on large sets of sequence data, and demonstrated that it outperformed other methods at evaluating the humanness of antibodies from sequence.
The chemical properties of a potential therapeutic can also cause problems, such as instability, self-association, high viscosity, polyspecificity, and poor expression (167). These characteristics can be determined experimentally; however, this is time-consuming and hence low-throughput, meaning the examination of thousands or millions of sequences from a BCR repertoire is not feasible. However, some of these properties can be predicted from the amino acid sequence of the antibody. For example, a number of sequence motifs have been identified that indicate sites of potential post-translational modification (78,173); hydrophobic residues in the CDRs are thought to lead to high aggregation, viscosity, and polyspecificity (167,(174)(175)(176)(177)(178)(179); patches of electrostatic charge on the antibody surface have been linked to high clearance rates and poor expression (180,181); and asymmetric charges of the heavy and light variable domains result in self-association and high viscosity (175,182). A number of computational tools have therefore been developed that predict these risk factors (e.g. Refs. 175-179 and 183). Whereas some of these attempt to predict solely from sequence, the majority require structural knowledge-for instance, it is important to know which residues are located on the antibody surface (176,177). The tools can be exploited during the identification of binders as described above to minimize issues further along the therapeutic development pipeline.
The properties described above can also be examined by calculating repertoire-wide distributions. As a simple example, consider the lengths of the CDRs. Using sequence repertoires, the distribution of observed lengths can be obtained. If a given length falls outside the range of this distribution, it can be assumed that this property is "unnatural," and therefore the antibody is more likely to have undesirable characteristics in vivo. Raybould et al. (106) used this approach, alongside the generation of antibody model libraries, to contextualize known therapeutic sequences against human repertoires. They were therefore able to define five developability guidelines that predict whether a given antibody will be successful as a therapeutic, based on total CDR length, patches of hydrophobicity, patches of positive and negative charge, and the overall surface charge of VH and VL domains. Testing the guidelines on sequences from two antibody discovery projects showed that this approach successfully highlighted candidates with known developability issues.
In summary, by representing the allowed antibody sequence space, BCR repertoires can be used to guide the antibody discovery and development process toward more successful therapeutic candidates. Using developability or humanness prediction algorithms in conjunction with in silico screening of BCR Figure 5. WebLogo representations for the second framework region (residues 39-55 in the IMGT numbering scheme) for known human and mouse antibody sequences. Hydrophobic amino acids are shown in red, hydrophilic in blue, and neutral in gray. Data were extracted from OAS; we only considered repertoires from individuals with no disease and no vaccine recorded. Whereas amino acid usage is the same at many positions along the sequence, it can be seen that there are differences that could potentially be used to measure "humanness" and guide the humanization process. For example, it is rare to observe lysine at position 43 in human antibodies, but this is common in mice. Changing a lysine to an arginine at this position in a potential therapeutic may therefore reduce immunogenicity.
repertoires should be of great benefit to the therapeutic development community, and as sequence repositories continue to grow and computational techniques become more sophisticated, we can expect more advances to be made.

Conclusions
Advances in next-generation sequencing and its increasing use in characterizing the immune system has led to the exponential growth of the number of known antibody sequences. Subsequently, there is now a wealth of information, which has increased opportunities for large-scale data mining. The amount of data presents its challenges, however. Curated, publicly available sequence repositories such as the OAS are addressing the problem of storage and accessibility, but changes may have to be made as we learn more about the needs of researchers wishing to use the data. The increase in the amount of data will also create computational obstacles; we must continue to develop methods that can analyze huge numbers of sequences in a time-and resource-efficient manner.
Repertoire data can be used to gain a deeper understanding of human immune system, including the mechanisms that drive repertoire diversity and its response to antigen exposure. Comparisons between individuals have detected the presence of a core set of shared sequences or clonotypes known as the public repertoire, potentially of great importance in protecting against common antigens.
The antigen-binding properties of antibodies are governed by their structures. Sequence-similar antibodies may adopt different structures, and vice versa; by using sequence alone, these subtleties are not discerned. The incorporation of structural information into repertoire analyses, through annotation or modeling, therefore allows more accurate comparisons to be made and hence provides a better representation of the repertoire space. Ongoing improvements in modeling algorithms, in particular increased speed and accuracy of H3 structure prediction, will mean that larger subsets of the repertoire can be analyzed in this manner and with more reliability. An increase in the number of available templates would also improve structural modeling-repertoire data itself may be used in this process, to highlight areas of sequence space for which structures are currently lacking.
Large-scale sequencing data can also be of great benefit during the discovery of antibodies for therapeutic use. Clonal selection and expansion leads to the enrichment of the repertoire with antigen binders post-exposure; these can be identified and used as starting points for further development. The presence of sequence-similar antibodies to known therapeutics in OAS (74) indicates that it should be possible to mine these repositories for new therapeutic leads without performing specific experiments. For example, in silico screening libraries could be developed, by combining BCR repertoire data with modeling protocols and other computational tools (e.g. docking algorithms) to select likely binders.
Currently, it is possible for the computational approaches such as those described in this review to be used in tandem with experimental work. For example, after a potential binder is identified experimentally, clonotyping can be used to select similar antibodies from a repertoire, thereby expanding the pool of candidates for further study. In the long term, however, the objective of many researchers is to make the discovery of new therapeutic antibodies completely computational, with little or no human input. Consolidating all of the knowledge gained from large-scale repertoire analysis may enable the creation of an in silico immune system, or at the least a completely human-like synthetic repertoire that can be screened to identify potential therapeutics. Although it is too soon to say whether an entirely in silico protocol would produce better results than an experimental one, it would remove the need for expensive and time-consuming experimental work and would mean the immunization of animals is no longer required. There are many obstacles to achieve this, perhaps most importantly in the initial selection of antibodies that bind to a specific antigen of interest-improvements in structural modeling, docking, and binding affinity prediction in particular will help this.
Even though there is a large quantity of data already available, there is a vast amount of the antibody sequence space that remains unknown. For example, at around one billion sequences (including redundant sequences), the Observed Antibody Space database represents less than 0.01% of the potential total number (predicted to be around 10 13 nonredundant sequences). Efforts should also be made to sequence repertoires with different attributes (e.g. ethnic background)-currently, this is not routinely disclosed, making analysis of its effect on the repertoire difficult. The continued growth of available sequence information should mean that currently unknown parts of sequence space are investigated, and therefore we should be able to analyze the workings of the immune system and predict antibody/repertoire properties more accurately. Importantly, with the development of experimental techniques that preserve the native VH-VL pairings, we will no longer have to rely on approximations and exhaustive combinatorics to achieve an accurate view of what binding sites are present. Overall, access to large-scale sequencing data has provided many opportunities to deepen our understanding of the immune system and improve our ability to design biotherapeutics and will surely continue to do so.
Conflict of interest-The authors declare that they have no conflicts of interest with the contents of this article.