Description
The quest for efficient clone selection in B-cell research is pivotal to therapeutic antibody development. This study harnesses the power of a BERT-based large language model (LLM) trained on pair-chain antibodies, to analyse immune repertoires and facilitate clonal selection in addition to the established B-cell V-region recovery workflow (BCW). Two datasets were investigated in this study: Dataset A covers paired sequencing samples from four stages of immunisation, and Dataset B contains antigen-mapped repertoires against multiple antigens along with gene expression data.
By applying LLM to map clones from the repertoires and BCW into a “sequence space”, we showed that clustering with LLM embeddings could result in more focussed clusters when compared to sequence homology clustering. In Dataset A, we employed LLM embeddings as input features and were able to train an accurate predictor for immunization time points. For Dataset B, UMAP projection successfully delineated well-clustered groups by genotype, and subsequent differential gene expression analysis identified potential marker genes for each sample. Our findings underscore the utility of LLM in enriching clone selection by providing embeddings that capture the phenotypic relationships between antibody sequences. The integration of advanced language models and traditional bioinformatics tools with antigen-mapped repertoires offers a potential avenue for accelerating the discovery and optimization of therapeutic antibodies.