Anchor Clustering for million-scale immune repertoire sequencing data

Haiyang Chang,Daniel A. Ashlock,Steffen P. Graether,Stefan M. Keller,Steffen P. Graether and Stefan M. Keller
DOI: https://doi.org/10.1186/s12859-024-05659-z
IF: 3.307
2024-01-26
BMC Bioinformatics
Abstract:The clustering of immune repertoire data is challenging due to the computational cost associated with a very large number of pairwise sequence comparisons. To overcome this limitation, we developed Anchor Clustering, an unsupervised clustering method designed to identify similar sequences from millions of antigen receptor gene sequences. First, a Point Packing algorithm is used to identify a set of maximally spaced anchor sequences. Then, the genetic distance of the remaining sequences to all anchor sequences is calculated and transformed into distance vectors. Finally, distance vectors are clustered using unsupervised clustering. This process is repeated iteratively until the resulting clusters are small enough so that pairwise distance comparisons can be performed.
biochemical research methods,biotechnology & applied microbiology,mathematical & computational biology
What problem does this paper attempt to address?