Interpretation of MobiVision transcriptomic analysis results

Output result file

The output files of mobivision quantify are as follows, with a total of 16 files, among which the SAMPLEID_outs file is automatically generated by the software and does not need to be specified by the user:

_flagdone is the flag file that the task runs successfully, and it will be automatically output after the mobivision quantify task is completed;

_log is the log file generated during the task running;

run_analysis_cmds.txt records the complete command line information of mobivision quantify;

SAMPLEID_Aligned.sort.bam records the alignment of reads, sorts them according to the coordinate information and outputs them as a bam alignment file;

SAMPLEID_Aligned.sort.bam.bai is the index file of the SAMPLEID_Aligned.sort.bam comparison file;

raw_cell_gene_matrix is the root directory of the original matrix without filtering, which contains three sub-files: features.tsv.gz, barcodes.tsv.gz, and matrix.mtx.gz, which are obtained by counting the comparison in the bam comparison file;

filtered_cell_gene_matrix is the root directory of the matrix after cell screening, which contains three sub-files: features.tsv.gz, barcodes.tsv.gz, matrix.mtx.gz;

SAMPLEID_filtered.h5ad is the filtered_cell_gene_matrix converted to h5ad form, which can be read by third-party software, so as to conduct in-depth analysis of cell gene expression matrix;

SAMPLEID_Report.json is a quality control report in the form of json, which can be read and extracted by third-party software;

SAMPLEID_Report.html is a quality control report in html format, which visualizes the data and facilitates users to intuitively judge the quality of the library;

SAMPLEID_summary.csv contains library information content; result_mito_percentage.csv is a mitochondrial percentage information file, which counts the distribution of cell mitochondria.

BAM file interpretation

mobivision quantify will output the bam comparison file after the analysis is completed. The bam file records the detailed comparison information of the library. Users can trace and correct the analysis results according to their needs, or perform downstream analysis, such as velocity, etc.

MAPQ

Alignment quality MAPQ is the fifth column of information in the bam file. For reads that can be aligned to a unique region of the genome, MAPQ=255, that is, MAPQ=255 means that the read can be aligned to a unique region of the genome. When the Read is mapped to a region of the genome > 1, MAPQ = -10*log10(1-1/Nmap).

matrix释义

There are two sets of matrix files output by mobivision quantify, namely raw_cell_gene_matrix and filtered_cell_gene_matrix. Both sets of files contain three sub-files: features.tsv.gz, barcodes.tsv.gz, and matrix.mtx.gz. The specific file contents are as follows:

barcodes.tsv.gz

$ cat barcodes.tsv.gz
AACAACACGAAAGTGGCTTA
AACAACACGAAGATTGTAAC
AACAACACGAATTACCAGAA
AACAACACGACGCTGAATGA
AACAACACGACGGACCAACA
AACAACACGACTACGTGAGG
AACAACACGAGGCCACACGC
AACAACACGAGGTTAGTACT
AACAAGTGATCAGCGATGTC
AACAAGTGATCGGTGTGAGT

Each row in the barcodes.tsv.gz file represents a cell label sequence.

features.tsv.gz

$ cat features.tsv.gz
ENSMUSG00000102693.24933401J01Rik Gene Expression
ENSMUSG00000064842.3Gm26206 Gene Expression
ENSMUSG00000051951.6Xkr4Gene Expression
ENSMUSG00000102851.2Gm18956 Gene Expression
ENSMUSG00000103377.2Gm37180 Gene Expression
ENSMUSG00000104017.2Gm37363 Gene Expression
ENSMUSG00000103025.2Gm37686 Gene Expression
ENSMUSG00000089699.2Gm1992Gene Expression
ENSMUSG00000103201.2Gm37329 Gene Expression
ENSMUSG00000103147.2Gm7341Gene Expression

The first column from left to right in the features.tsv.gz file indicates the gene ID, the second column is the gene name, and the third column is the fixed string "Gene Expression".

matrix.mtx.gz

$ cat matrix.mtx.gz
%%MatrixMarket matrix coordinate integer general
%
55416 6167 20865276
54 1 4
68 1 2
114 1 2
122 1 3
123 1 2
125 1 1
137 1 8

The matrix.mtx.gz file is a sparse matrix file. The file starts from the fourth line, from left to right, followed by gene ID number, cell label number, and the number of transcripts captured by the corresponding gene of the corresponding cell. The third line from left to right is the number of genes in the library, the total number of cells, and the total number of transcripts captured in the library. The number of genes in the library should be consistent with the number of genes in the features.tsv.gz file, and the number of cells in the library should be consistent with the number of cells in the file barcodes.tsv.gz.

Quality control report

After the mobivision quantify analysis is completed, an html quality control report will be generated, which is divided into two forms of single and double species. The two forms can be divided into six parts: overview, Sample, Cells, Sequencing & Mapping, Data Distribution, and UMAP Projection. The specific report content as follows:

Single Species Report

Overview

The Sample column contains the following information:

Sample name
Reference genome name
Library building kit name
Analysis software name

Cells

In the single-species report, the left picture of the Cells column is the Barcode Rank Plot, and the right side is the cell-related indicators, and the content is consistent with the overview column. The report obtains the serial number of the cell label by counting the number of UMIs corresponding to each cell label and sorting the cell labels according to the number of UMIs from high to low. For example, the cell label with the largest UMI number is 1, and so on. Take the cell label serial number as the x-axis abscissa, and use the UMI number corresponding to the cell label as the y-axis ordinate to draw a graph to obtain the Barcode Rank Plot. Users can also click the question mark in the upper right corner of the corresponding column to get more detailed help information (the same is true for other columns), as follows:

Sequencing & Mapping

The left side of the Sequencing & Saturation column is the Sequencing Saturation Plot, and the right side is the library sequencing information and comparison information. Users can use the Sequencing Saturation Plot to judge whether the library needs additional testing. If the sequencing saturation curve reaches a plateau or is close to the short gray dashed line, it implies that it is difficult to capture more genes or UMI molecules through library addition testing.

Data Distribution

Data Distribution displays the distribution of three pieces of data in the form of a violin diagram, which are cell mitochondrial content, cell UMI number, and cell gene number. Taking the distribution of mitochondrial content in cells as an example, we observed that the position of the short dotted line in the violin plot is at about 3%, which means that the median content of mitochondrial content in the library is 3%. Similarly, according to the distribution of the violin plot, we can also judge that the mitochondrial content of most cells in this library does not exceed 5%.

UMAP Projection

UMAP Projection contains two visualization images, each point represents a cell. The picture on the left shows the UMI numbers corresponding to the cells for staining after dimensionality reduction by UMAP, so that the distribution of RNA content in each cell can be judged; The results are colored.

Dual Species Report

Overview

Dual-species reports differ slightly in content from single-species reports. The four indicators in the first line of the dual-species report are as shown in the figure above. The complexity and sequencing degree of the library can also be judged based on these four indicators, so as to judge whether the quality of the library meets the user's expectations.

Sample

Same single species report.

Cells

In the double-species report, the Cells column calculates the cell number, gene median and UMI median from different species on the basis of the original single species. Among them, Estimated Number of Cells = Estimated Number of Cells (GRCh38) + Estimated Number of Cells (GRCm39) + Number of Barcodes with >1 Cell. Median Genes per Cell (GRCh38) counts all cells derived from GRCh38, and Median Genes per Cell (GRCm39) counts cells derived from GRCm39. The statistical method of Median UMI Counts is the same as that of Median Genes.

Sequencing & Mapping

The Sequencing & Mapping column counts the comparison to different genomes based on the original single species. As above, we can find that 95.88% of the reads are aligned to the genome, of which, 53.38% of the reads are aligned to the GRCh38 genome, and 42.5% are aligned to the GRCm39 genome (95.88% = 53.38% + 42.5%). The other comparison results are the same, on the basis of the original statistical results, the proportions from different genomes are counted respectively.

Data Distribution

The Data Distribution column counts the mitochondrial content, UMI content and gene content of cells from different species (excluding multiplets). The Cell UMI Counts graph reflects the distribution of UMIs from different species in different cells. Only when more than 90% of the UMIs in the cell label are from the same species, the report will consider the barcode to be a cell from that species. If 20% of the UMIs in the cell label are compared to species A, and 80% of the UMIs are compared to species B, it is determined that the cell does not belong to species A or species B, and it needs to be classified as a Multiplet, that is, in the above figure gray dots. Generally speaking, we think that the lower the proportion of Multiplets, the less twins or multiplets there are in the library.

UMAP Projection

Same as single species report.

Support Center

Interpretation of MobiVision transcriptomic analysis results

Output result file

BAM file interpretation

Tags

MAPQ

matrix释义

Quality control report

Single Species Report

Overview

Cells

Sequencing & Mapping

Data Distribution

UMAP Projection

Dual Species Report

Overview

Sample

Cells

Sequencing & Mapping

Data Distribution

UMAP Projection