MobiVision V(D)J Algorithm Introduction

Algorithm Overview

Barcode and UMI correction

The schematic diagram of the VDJ library generated by the MobiNova platform is as follows:

 

 

From the structure of VDJ above, it can be seen that the 5' end of Read1 is cell label sequence (20bp) and UMI sequence (10bp). In order to determine whether the cell label sequence carried by Read1 is correct, MobiVision will compare the cell label sequence in the sequenced fragment with the cell label sequence in the known white list. Currently, the MobiCube high-throughput single-cell V(D)J v1.0 kit provides nearly 3,000,000 cell labeling sequences. Sequencing reads that meet the following conditions will be retained:

  • The cell label of Read1 exists in the whitelist;
  • The cell label of Read1 does not exist in the white list, but the minimum Hamming distance with the cell label in the white list is <=2, and the cell label in Read1 is corrected according to the cell label in the white list.

For the sequenced fragments that pass, Read1 only retains the corrected cell label sequence and UMI sequence, and Read2 does not process it at this step.

 

For the fastq data after correcting the cell label sequence

对于纠正细胞标签序列后的fastq数据中

  • There may be a 13bp TSO sequence at the 5' end of the Read1 fragment, and a polyA sequence at the 3' end.
  • There may be a polyT sequence at the 5' end of the Read2 fragment, and a 13bp TSO reverse complementary sequence at the 3' end.
  • The existence of TSO, polyA, polyT and other sequences will effectively reduce the alignment rate of the library. Therefore, it is necessary to remove the TSO sequence and poly A sequence that may exist at both ends of the insert fragment before alignment.
  • Removal of adapter sequences and poly A and poly T may result in too short inserted DNA fragments, and too short DNA fragments will increase the probability of mismatching. Therefore, after completing the removal of adapter sequences, it is necessary to filter out inserted DNA fragments smaller 30bp Read.

Check VDJ gene chain type

Align the inner primers to the fastq insert, and then calculate the ratio of the inner primers alignment reads from TCR to all inner primers alignment reads. If the ratio is greater than 80%, the library is considered to be a TCR type library ; If the ratio is less than 20%, the library is considered to be a BCR type library, otherwise it is an ALL type (BCR+TCR type) library.

VDJ gene sequence filter

In order to ensure the effectiveness and speed of splicing, we compared all reads to the reference sequence of VDJ, and eliminated reads that were not necessarily matched. Only the reads on the alignment are used for subsequent splicing analysis.

Assemble contig

Collect reads from the same Barcode to form a set of fastq files, use the De Brujin algorithm to splice transcripts of short fragments, and finally obtain the full-length information (contig). Each base of the contig is given a base quality value, and the UMI and the number of reads are also recorded. For all barcodes, perform the same operation to get the contig information in each barcode.

V(D)J Annotation

The purpose of VDJ annotation is to find a biologically functional and effective protein receptor/product, which needs to meet the following conditions: 1. The structure is complete, that is, the full-length sequence; 2. It starts with a codon, and there is no stop codon in the VJ region ; 3. The last codon of the J gene-the start codon of the V gene/3 is an integer; 4. The sequence contains the CDR3 region, and the length of the region spanned by V-J is reasonable to avoid structural abnormalities; 5. VJ (reference fragment Total length)-len (last codon-first codon of V) is between -25-25 amino acids, IGH is between -55-25 amino acids.

The method of determining CDR3: look for the conserved motif sequence on the left and right sides of CDR3, starting from C amino acid, 5-27 amino acid in length, without stop codon. If more than one CDR3 sequence is found, the one with the highest score is regarded as the CDR3 region, and if the scores are the same, the longer CDR3 sequence is selected.

Cell Calling

Cell Calling is based on whether there is a valid contig in the Barcode, and only if there is a valid contig will the cell be considered a real cell instead of a blank cell or a twin cell. Generally, the following conditions need to be met to select cells expressing V(D)J gene. Only T or B cells will have vdj rearrangement and produce full-length transcripts. The filtered Barcode must have sufficient UMI count support to avoid background mRNA interference. In addition, UMIs should have sufficient reads support to avoid library contamination and Sample index jumps.

Assignment of clonotype

The cell barcode is grouped to form different clonotypes, that is, the same or similar paired receptor sequences are found, and the cell barcode is grouped into different clonotypes.

Clonotype results include the following and can be used for subsequent downstream analysis.

1.clonotype_id

2.The number of cell Barcodes corresponding to the clonetype id frequency

3.Proportion corresponds to the proportion of cell Barcode

4.Amino acid sequence of CDR3_aa CDR3

5.Nucleotide sequence of CDR3_nt CDR3

Quality control report

When mobivision vdj is running, it will make statistics on the raw data and analysis results of the entire library, and finally generate a quality control report. The report is an honest feedback on the entire library, aiming to help users understand the quality of the original data and analysis results of the library from a macro perspective, without any data screening or filtering. If necessary, users can adjust the library results according to the results of the quality control report before starting downstream analysis.