Bioinformatics Asked on August 22, 2021
I’m dealing with a very large and sparse dataset and the first issues I met occurred when I tried to use quickCluster that reported me this error:
'cannot allocate vector of size 156.6 Mb'
So, given that I cannot wait to change the RAM of my computer and I can’t afford to use a cluster, I want to rely on some other strategies like some package that would allow me to handle sparse matrices. I’m thinking about sparseM but given that I don’t know well this package I’d like to know how to shrink the ram allocation for these kind of matrices. Any suggestion will be very appreciated!
Ah, looks like I can't even procrastinate on StackExchange anymore without seeing work-related stuff. Oh well.
Anyway, the other answers and comments are way off. scran has supported sparse matrices for years, ever since we switched over to the SingleCellExperiment
class as our basic data structure. quickCluster
does no coercion to dense format unless you tell it so explicitly, e.g., with use.ranks=TRUE
(in which case you're asking for ranks, so there's little choice but to collapse to a dense matrix).
You don't provide an MWE or your session information, but this is how it rolls for me:
# Using the raw counts in the linked dataset. Despite being
# called a CSV, it's actually space delimited... typical.
library(scater)
mat <- readSparseCounts("GBM_raw_gene_counts.csv", sep=" ")
# Making an SCE just for fun. Not strictly necessary for
# this example, but you'll find it useful later.
sce <- SingleCellExperiment(list(counts=mat))
library(scran)
system.time(clust <- quickCluster(sce))
## user system elapsed
## 3.170 0.174 3.411
This is running on my laptop - 16 GB RAM but I'm definitely not using all of it. I only go full throttle when I'm working on some real data, e.g., the 300k HCA bone marrow dataset. Check out the book for more details.
Session info below, I don't know quite enough SO-fu to know collapse it.
R version 4.0.0 Patched (2020-04-27 r78316)
Platform: x86_64-apple-darwin17.7.0 (64-bit)
Running under: macOS High Sierra 10.13.6
Matrix products: default
BLAS: /Users/luna/Software/R/R-4-0-branch/lib/libRblas.dylib
LAPACK: /Users/luna/Software/R/R-4-0-branch/lib/libRlapack.dylib
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] parallel stats4 stats graphics grDevices utils datasets
[8] methods base
other attached packages:
[1] scran_1.16.0 scater_1.16.1
[3] ggplot2_3.3.2 SingleCellExperiment_1.10.1
[5] SummarizedExperiment_1.18.1 DelayedArray_0.14.0
[7] matrixStats_0.56.0 Biobase_2.48.0
[9] GenomicRanges_1.40.0 GenomeInfoDb_1.24.2
[11] IRanges_2.22.2 S4Vectors_0.26.1
[13] BiocGenerics_0.34.0
loaded via a namespace (and not attached):
[1] beeswarm_0.2.3 statmod_1.4.34
[3] tidyselect_1.1.0 locfit_1.5-9.4
[5] purrr_0.3.4 BiocSingular_1.4.0
[7] lattice_0.20-41 colorspace_1.4-1
[9] vctrs_0.3.1 generics_0.0.2
[11] viridisLite_0.3.0 rlang_0.4.6
[13] pillar_1.4.4 glue_1.4.1
[15] withr_2.2.0 BiocParallel_1.22.0
[17] dqrng_0.2.1 GenomeInfoDbData_1.2.3
[19] lifecycle_0.2.0 zlibbioc_1.34.0
[21] munsell_0.5.0 gtable_0.3.0
[23] rsvd_1.0.3 vipor_0.4.5
[25] irlba_2.3.3 BiocNeighbors_1.6.0
[27] Rcpp_1.0.4.6 edgeR_3.30.3
[29] scales_1.1.1 limma_3.44.3
[31] XVector_0.28.0 gridExtra_2.3
[33] dplyr_1.0.0 grid_4.0.0
[35] tools_4.0.0 bitops_1.0-6
[37] magrittr_1.5 RCurl_1.98-1.2
[39] tibble_3.0.1 crayon_1.3.4
[41] pkgconfig_2.0.3 ellipsis_0.3.1
[43] Matrix_1.2-18 DelayedMatrixStats_1.10.0
[45] ggbeeswarm_0.6.0 viridis_0.5.1
[47] R6_2.4.1 igraph_1.2.5
[49] compiler_4.0.0
Correct answer by wizard_of_oz on August 22, 2021
Not a direct solution but some workarounds:
As far as I know, Seurat
can work with sparse matrices.
The particular function of scran
that you are using eats up quite some memory. I believe it is needed for the "normalization" step (that is how I used it anyway). Whereas the scaling normalization performed by this function is superior to crude "log normalization", you can give a try with the latter, which is far less computationally-intensive (does not do clustering). Seurat, once again, can help with this.
You can downsample your data to the extent that it fits your into your RAM.
You can give a try with Python. More and more single cell packages are written in Python, to some extent because of the problem you have experienced. For example Scanpy
output is comparable to that of Seurat
, although I am not sure if you can use scaling normalization with Scanpy
.
Answered by haci on August 22, 2021
Essentially you've hit a RAM bottleneck and the calculation will slow to zero, or in this instance refuse to go forward. The way to do this normally is to parallelize the calculation across the cores of your machine. This will likely remove the RAM bottlenack, dont ask me the computer archetectural reasons for why it works - but its works.
However, my knowledge of R is minimal. I wouldn't know how to parallelize an R calculation. It is certainly doable in Perl and Python, but the calculation needs to be written to ensure parallelisation.
The other way in is to reconfigure your calculation to remove sparse matrices OR find someone doing NGS where they've confgured their machine around heavy RAM.
Looking at your calculation I don't quite understand why you need to use a specific package, it looks like unsupervised machine learning, PCA - tSNE that sort of thing and you don't need a given package to do that you just need to vectorise the inputs. If you worked out the statistical components of Scran
then there are a few extremely strong R statisticians/bioinformaticians on site who would have zero problem in replicating this within a few lines of code. Its not hard in Python's Sci-kit learn either. At a guess they perform PCA and resolve it via tSNE and this give nice clear clusters.
GCP is free for 3 months so it will cost you zero in context to a single calculation.
Answered by M__ on August 22, 2021
Get help from others!
Recent Answers
Recent Questions
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP