MobiVision Epigenomics Algorithm Introduction - ATAC

Algorithm Overview

MobiVision ATAC is designed for analyzing single-cell ATAC-seq data generated from the MobiNova platform. The key analytical steps are illustrated in the following diagram:

Barcode Correction

The schematic diagram of the ATAC library generated by the MobiNova platform is shown below:

The MobiDrop scATAC fastq data is paired-end sequenced. Read 1 from 5' to 3' consists of cell barcode, UMI, MEC fixed sequence, and insert DNA. When processing the input fastq data, mobivision atac first corrects the cell barcode in Read 1. If the cell barcode exists in the built-in whitelist of mobivision, the read contains a valid cell barcode and can proceed to the next analysis step. If the cell barcode is not in the whitelist, the read is invalid and discarded. When comparing the cell barcode with the barcode sequences in the whitelist, a hamming distance <=1 per 10 bases is allowed for passing. In the output valid reads, the cell barcode corresponding to the Read 1 sequence is the corrected cell barcode. The cell barcode and UMI sequences are stored in the read ID, not in the read sequence.

For reads with corrected cell barcodes, further adapter removal is required. Read 1 needs to remove the MEB sequence at its 3' end and the reverse complement of the MEC sequence at its 5' end. Read 2 needs to remove the MEC sequence at its 3' end. The allowed mismatch rate for adapter trimming is 0.1. After trimming, valid and clean fastq files are obtained and can be used for subsequent alignment.

Alignment

Mobivision atac uses the built-in bowtie2 software for paired-end alignment, generating a .bam output file that includes both mapped and unmapped reads.

For the aligned bam file, further filtering and deduplication are performed. Only paired-end alignments with MapQ ≥ 30 are retained, and only alignments with lengths ≤ 2000 bp are kept. Duplicate fragments are removed based on the cell barcode, chromosome name, alignment start, and alignment end in the alignment information, resulting in a filtered and deduplicated filtered.bed file. This file is then used to generate a visualization .bw file. If the sample is a dual-species sample, a corresponding .bw file is generated for each species.

Peaks Calling and Annotation

The deduplicated and filtered filtered.bed file is used for peak calling with the built-in macs2 software in mobivision atac. If no peak type is specified, the narrow peak type is used by default. To call broad peaks, the --peaktypebroad parameter must be specified. If --control is specified, IgG data is used as the control during peak calling to correct for background noise. The final output is a peaks file with the extension .narrowPeak or .broadPeak.

The obtained peaks file is annotated based on the following principles:

● The promoter region is defined as the interval from 1000 bpupstream to 100 bp downstream of the transcription start site (TSS) (-1kb, +100 bp).

● A distal peak refers to a peak that is within 200 kb of itsnearest TSS but does not fall within the promoter region.

● A distal peak may also refer to a peak that overlaps with atranscript but is neither classified as a promoter region nor as a distalpeak under the above condition. Such peaks are still termed distal peaks.

● Peaks that do not fall into any of the above categories areclassified as intergenic peaks.

Valid Fragments

Valid Fragments, also referred to as fragments in peaks, are defined as fragments that have at least one base overlapping with a peak region. These fragments are identified as fragmentsInPeaks. This data is used as input for cell calling.

Cell Calling

mobivision atac currently employs a dynamic threshold strategy for cell barcode filtering: First, all barcodes are sorted in descending order based on the number of fragments falling within peak regions. The fragment count at the 95th percentile position of the expected cell number N (default 3000, i.e., the 2850th position when N=3000) is taken as the value m. Then, m/10 is set as the threshold. All barcodes with fragment counts exceeding this threshold are identified as valid cells. For example, when N=3000 and m=20000, the threshold is set to 2000. In this case, all barcodes with fragment counts exceeding 2000 are retained (as illustrated in the example, resulting in 9000 cells). The advantage of this method is its ability to automatically adjust the filtering criteria based on data characteristics, ensuring reliable cell identification results for datasets of varying scales.

Report Generation

Based on the above analysis results and intermediate data, a summary report of the sample analysis is generated, including the following five sections: Sequencing, Mapping, Cell, Targeting, and t-SNE Projection.

1. Sequencing: Primarily providesstatistics on the sequencing quality of the input library.

2. Mapping: Summarizes the alignmentresults of the library.

3. Cell: Provides statistics on thefinal cell calling results and the generated matrix.

4. Targeting: Includes annotationstatistics for fragments and peaks.

5. t-SNE Projection: Utilizes LSA fordimensionality reduction, t-SNE for mapping, and Louvain for clustering.