BAM to gene expression matrix (UMI counts per gene per cell),10X

Question

I am trying to reproduce some results of a scRNASeq experiment. However I am new to the server-side aspect of such analyses and am very confused at the moment.

The data provided by the authors of the paper is in .BAM format and from there I wish to derive a gene expression matrix (UMI counts per gene per cell). The authors stated that they did this using the 10X genomics pipeline, however it seems that .BAM outputs are sort of the "final version" of the pipeline and, the pipeline only really deals with truly raw sequencing files (bcl/fastq).

I have considered converting it back to fastq format and following the pipeline from the beginning, but it just seems like I would be moving backwards by doing such.

I have also found some R packages that can integrate BAM files in the environment like Rsamtools, but I would much rather do this on the server and then upload a csv into R for the downstream analysis as the files are +60GB each.

Essentially I am just asking for some advice, or a vignette/article I can read that would clear up how to proceed with the BAM outputs to derive the gene expression data. I have already read the cellranger pages on the 10X genomics website but they were of no help in answering this quesion.

Thank you for your time,

swbarnes2 · Answer

You should be able to parse out what you need using the tags in the .bam.  10xGenomics' website says what tags they add.

https://support.10xgenomics.com/single-cell-gene-expression/software/pipelines/latest/output/bam

Going backwards would also involve parsing the tags, because you have to make two fastq files, and a simple bam -> fastq pipeline won't do that correctly.

Brunox13 · Answer

How to re-analyze 10X BAM files?
This is a great question and honestly, I don't think there was an easy way to do this at the time the question was asked. The reason is that if you want to re-do the authors' analysis to get the gene expression (gene-cell / feature-barcode) matrix and they used cellranger (the official 10X pipeline software) to do that, you will need FASTQ files as input - because "10x pipelines require sequencer FASTQs (with embedded barcodes) as input."
Therefore, to get FASTQ files from 10X BAM files, there's a relatively new tool provided by 10X Genomics: bamtofastq. Then you can simply use those FASTQ files as input to cellranger.
More info here: https://support.10xgenomics.com/docs/bamtofastq
How to correctly download 10X BAM files from NCBI?
The only caveat is that you need a BAM file generated directly by 10X's cellranger (or the respective 10X pipeline, if not dealing with gene expression) - that means that a BAM file obtained by downloading an SRA from NCBI and converting to BAM won't work; you need to get the original BAM file directly (often found among the originally submitted files, under "Original format").
To give you some additional help here (because figuring this out is also not trivial): an original 10X BAM file (if originally submitted in this format) can be downloaded from NCBI using prefetch from sra-tools, specifying the --type option - for example:
prefetch --type TenX --max-size 100000000 SRR5167880

Because the file was larger than the default limit of 20GB, I also had to increase it by specifying the --max-size option.
Alternatively, you can follow the official 10X Instructions to Download and Process BAM files of 1.3 Million Brain Cells.
I know this is an old thread but I'm hoping my answer will be helpful to anyone who has the same question (like me).

BAM to gene expression matrix (UMI counts per gene per cell),10X

2 Answers

How to re-analyze 10X BAM files?

How to correctly download 10X BAM files from NCBI?

Add your own answers!

Ask a Question