Can you comment on the amount of cells that can be processed as a function of VRAM? A

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Good examples would be here: <a href="http://mousebrain.org/downloads.html" rel="nofol

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

VRAM and cell numbers,about nvidia-genomics-research/rapids-single-cell-examples

Comments (16)

cjnolet commented on June 27, 2024

@JBreunig,

This is a great question. I admit we have been mostly testing on 32gb GPUs but I can generate some random data with known sparsity to get a feel for how far we can push different GPUs.

What is the ideal target size and sparsity for your GPU? Are there any public datasets with a similar size/sparsity?

from rapids-single-cell-examples.

JBreunig commented on June 27, 2024

Good examples would be here: http://mousebrain.org/downloads.html (either the aggregate loom https://storage.googleapis.com/linnarsson-lab-loom/l5_all.agg.loom or maybe you can see where you hit the limit with concatenating different subsets http://mousebrain.org/loomfiles_level_L1.html ?)

For example, I'm currently processing a dataset of 330,000 cells including all of the above and a bunch of our datasets combined with batch correction.

from rapids-single-cell-examples.

cjnolet commented on June 27, 2024

@JBreunig,

Wanted to provide a small update just to let you know that I have been looking into this. I think the most straightforward solution here might be to use the unified virtual memory allocator. I’ll get an example together.

from rapids-single-cell-examples.

JBreunig commented on June 27, 2024

Ok, thanks! I'll close this.

from rapids-single-cell-examples.

cjnolet commented on June 27, 2024

@JBreunig,

I've made a modification to the notebook to enable the Unified Virtual Memory manager in RAPIDS & CuPy. Specifically, the change looks like this:

import rmm
rmm.reinitialize(
    managed_memory=True, # Allows oversubscription
    devices=0, # GPU device IDs to register. By default registers only GPU 0.
)

cp.cuda.set_allocator(rmm.rmm_cupy_allocator)

This should allow you to oversubscribe your GPU memory to use all available host memory and it will page memory onto the GPU as needed. While swapping pages can slow down the workflow, it does make it much easier to experiment and explore different datasets & sizes without having to think about available GPU memory.

For example, I was able to load the 10x 1.3M neuron dataset pretty easily onto a 32gb GPU and do all sorts of transformations to it without ever encountering an OOM.

If you get a chance, you should try this feature out and let us know if it allow you to scale higher than before. Specifically, we're also very curious to know if you still find it extremely fast. I didn't notice much hit to performance at all and I wasn't even paying attention to how much memory I was using (I'm sure it was well over 100gb by the time I was done).

from rapids-single-cell-examples.

JBreunig commented on June 27, 2024

I will try the 1.3M neuron dataset myself and some internal datasets at some point today.

Using this new modification looks excellent so far. Initial results are: 54 seconds (GPU) vs. 563 seconds (CPU) on a 32 core/ 64 thread Threadripper with a 1080 Ti and 128GB ram.

from rapids-single-cell-examples.

JBreunig commented on June 27, 2024

Just FYI, the 1.3M neuron dataset crashes the kernel on this line:

filtered = rapids_scanpy_funcs.filter_cells(sparse_gpu_array, min_genes=min_genes_per_cell, max_genes=max_genes_per_cell)

However, my other dataset of ~330K cells gets past this step. It doesn't seem to be a RAM/swap issue as I don't seem to be approaching the ceiling. Perhaps it's a VRAM issue? I'll try again later.

from rapids-single-cell-examples.

cjnolet commented on June 27, 2024

@JBreunig,

I meant to respond to you earlier about this. Indeed, as I was playing around with this I noticed the crash as well. It's actually a known bug in Cusparse and we're waiting for them to fix it. I played around a little bit and managed to isolate the bug to entry 1057790 in the input data. The problem is in the conversion from the CPU array to the GPU.

If you slice off the first 1M records, or vstack everything up to 1057789 and above 1057791, the filter will work. I was able to run the 1M fairly easily all the way up to the regress_out. We're also very close to merging a PR on cuML that will enable sparse inputs for PCA (and doesn't require conversion to dense for the mean centering).

from rapids-single-cell-examples.

JBreunig commented on June 27, 2024

Just FYI, I'm consistently crashing the kernel here with your new notebook and rapids_scanpy_funcs.py file:

%%time
sc.pp.highly_variable_genes(adata, n_top_genes=n_top_genes, flavor="cell_ranger")
adata = adata[:, adata.var.highly_variable]

Any suggestions?

from rapids-single-cell-examples.

cjnolet commented on June 27, 2024

@JBreunig,

What is the shape of adata passed into the highly_variable_genes? Is it giving you any type of error at all before the kernel crashes?

from rapids-single-cell-examples.

JBreunig commented on June 27, 2024

(989838, 23781) right after this step (adding a line):

%%time
adata = anndata.AnnData(sparse_gpu_array.get())
adata.var_names = genes.to_pandas()
adata.shape

and there is no python error except for the kernal crash and a system request to send a report.

Let me try shaving off cells to see if it's a memory issue.

from rapids-single-cell-examples.

cjnolet commented on June 27, 2024

@JBreunig,

I believe we might have hit a similar issue today where our Jupyter kernel crashed without giving any type of useful error information. I’m pretty sure it’s because we were running on a system that didn’t have enough main memory.

While the benefit to using the managed memory option is the ability to oversubscribe the GPU memory, it does now increase the requirement on the amount of main memory needed.

Many of the CPU examples of the 1.3M cells dataset indicate a requirement of at least 30gb of main memory to do the processing end to end. I think you can get away with a smaller gpu and managed memory, but this comes at the expense of needing more main memory.

from rapids-single-cell-examples.

JBreunig commented on June 27, 2024

I can run it fine with 300K cells (266 seconds) but somewhere a little above that number it fails to work. At 500K it gets stuck, never finishing but at 700K or above it seems to crash the kernel. As I mentioned, I have 128 GB of RAM and 628GB set aside for Swap but it doesn't appear to get near that limit--especially with 500K.

We are ordering workstations with 256 GB of RAM and hopefully, I'll add more VRAM with the next gen of video cards.

Update: it's a cell number somewhere between 350K and 400K that causes the kernel crash for me...350 took 299 seconds to finish but 400K crashed the kernel.

from rapids-single-cell-examples.

JBreunig commented on June 27, 2024

Just FYI, I've upgraded to 256 GB of RAM and completely reinstalled drivers and CUDA (from 10.1 to 10.2) and still have problems with the code getting "stuck" perpetually in the regression or scaling steps (no kernel crashes lately but a few CPU cores are continually engaged by Python but no progression to completion). Have you run this code on a 2080 Ti or other non-TESLA card?

This only happens above 350K cells.

Is there any way to troubleshoot this?

from rapids-single-cell-examples.

cjnolet commented on June 27, 2024

@JBreunig,

Have you run this code on a 2080 Ti or other non-TESLA card?

Unfortunately, I don't have any 2080 Ti's available to try and reproduce your problem on my end and the 1M cell notebook appears to be working w/ the T4 instances in AWS, which rules out the problem being exclusive to the Turing architecture. This behavior does sound very strange, though.

Is there any way to troubleshoot this?

A lot of times when errors are printed, they end up displaying in the command-line that's running the Jupyter notebook and not in the notebook itself. Do you see any errors on the command-line?

You can set verbose=True in the call to rapids_scanpy_funcs.regress_out, which will print something after every 500 cells are processed. If that's not enough, you can add more prints to the regress_out and scale functions in rapids_scanpy_funcs.

If you have a command-line available, can you also run the nvidia-smi command? That should at least help us determine if the GPU is being actively utilized when the code gets stuck.

from rapids-single-cell-examples.

JBreunig commented on June 27, 2024

Sorry for the delay...coming back to this it seems like it's now hanging at

%%time
sparse_gpu_array, genes = rapids_scanpy_funcs.filter_genes(sparse_gpu_array, genes, min_cells=1)

I'm guessing this is related to issue #53 ???

from rapids-single-cell-examples.

VRAM and cell numbers about rapids-single-cell-examples HOT 16 CLOSED

Comments (16)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent