Skip to content

Eval bug: Unusual high RAM usage on Windows when running DeepSeek V3 Q2_K_XL/IQ2_XXS, on Hybrid CPU+GPU #13978

Open
@Que8549

Description

@Que8549

Name and Version

llama-server --version
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 2 CUDA devices:
Device 0: NVIDIA GeForce RTX 5090, compute capability 12.0, VMM: yes
Device 1: NVIDIA GeForce RTX 5080, compute capability 12.0, VMM: yes
version: 5572 (7675c55)
built with MSVC 19.44.35207.1 for x64

Operating systems

Windows

GGML backends

CUDA

Hardware

AMD Ryzen 9 9950x3D CPU and 2 GPUs: Nvidia 5090 and 5080.

Models

Unsloth models: IQ2_XXS and Q2_K_XL from Hugging Face from here: https://huggingface.co/unsloth/DeepSeek-V3-0324-GGUF-UD

Problem description & steps to reproduce

I am encountering the same issue reported here: #12651 The prior issue was closed, but is there a fix for this issue? I am experiencing this exact same issue running unsloth's DeepSeek-V3-0324-UD-Q2_K_XL model with 2 GPUs (Nvidia RTX 5090 and 5080). I have tried setting LLAMA_CUDA_UNIFIED_MEMORY=1, but I still get out of memory exception. I tried running this: .\llama.cpp\build\bin\Release\llama-server ^
--model F:/local_llm/models/DeepSeek-V3-0324-GGUF/UD-Q2_K_XL/DeepSeek-V3-0324-UD-Q2_K_XL-00001-of-00006.gguf ^
--port 10000 ^
--ctx-size 8192 ^
--n-gpu-layers 10 ^
--tensor-split 0.6667,0.3333 ^
--cache-type-k q8_0 ^
--temp 0.3 ^
--min-p 0.01

First Bad Commit

No response

Relevant log output

print_info: n_head           = 128
print_info: n_head_kv        = 128
print_info: n_rot            = 64
print_info: n_swa            = 0
print_info: is_swa_any       = 0
print_info: n_embd_head_k    = 192
print_info: n_embd_head_v    = 128
print_info: n_gqa            = 1
print_info: n_embd_k_gqa     = 24576
print_info: n_embd_v_gqa     = 16384
print_info: f_norm_eps       = 0.0e+00
print_info: f_norm_rms_eps   = 1.0e-06
print_info: f_clamp_kqv      = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale    = 0.0e+00
print_info: f_attn_scale     = 0.0e+00
print_info: n_ff             = 18432
print_info: n_expert         = 256
print_info: n_expert_used    = 8
print_info: causal attn      = 1
print_info: pooling type     = 0
print_info: rope type        = 0
print_info: rope scaling     = yarn
print_info: freq_base_train  = 10000.0
print_info: freq_scale_train = 0.025
print_info: n_ctx_orig_yarn  = 4096
print_info: rope_finetuned   = unknown
print_info: ssm_d_conv       = 0
print_info: ssm_d_inner      = 0
print_info: ssm_d_state      = 0
print_info: ssm_dt_rank      = 0
print_info: ssm_dt_b_c_rms   = 0
print_info: model type       = 671B
print_info: model params     = 671.03 B
print_info: general.name     = DeepSeek V3 0324 BF16
print_info: n_layer_dense_lead   = 3
print_info: n_lora_q             = 1536
print_info: n_lora_kv            = 512
print_info: n_embd_head_k_mla    = 0
print_info: n_embd_head_v_mla    = 0
print_info: n_ff_exp             = 2048
print_info: n_expert_shared      = 1
print_info: expert_weights_scale = 2.5
print_info: expert_weights_norm  = 1
print_info: expert_gating_func   = sigmoid
print_info: rope_yarn_log_mul    = 0.1000
print_info: vocab type       = BPE
print_info: n_vocab          = 129280
print_info: n_merges         = 127741
print_info: BOS token        = 0 '<|begin▁of▁sentence|>'
print_info: EOS token        = 1 '<|end▁of▁sentence|>'
print_info: EOT token        = 1 '<|end▁of▁sentence|>'
print_info: PAD token        = 1 '<|end▁of▁sentence|>'
print_info: LF token         = 201 '─è'
print_info: FIM PRE token    = 128801 '<|fim▁begin|>'
print_info: FIM SUF token    = 128800 '<|fim▁hole|>'
print_info: FIM MID token    = 128802 '<|fim▁end|>'
print_info: EOG token        = 1 '<|end▁of▁sentence|>'
print_info: max token length = 256
load_tensors: loading model tensors, this can take a while... (mmap = true)
load_tensors: offloading 10 repeating layers to GPU
load_tensors: offloaded 10/62 layers to GPU
load_tensors:        CUDA0 model buffer size = 28222.66 MiB
load_tensors:        CUDA1 model buffer size = 12095.43 MiB
load_tensors:   CPU_Mapped model buffer size = 46729.58 MiB
load_tensors:   CPU_Mapped model buffer size = 47092.29 MiB
load_tensors:   CPU_Mapped model buffer size = 47190.81 MiB
load_tensors:   CPU_Mapped model buffer size = 46830.22 MiB
load_tensors:   CPU_Mapped model buffer size =  7958.07 MiB
....................................................................................................
llama_context: constructing llama_context
llama_context: n_seq_max     = 1
llama_context: n_ctx         = 8192
llama_context: n_ctx_per_seq = 8192
llama_context: n_batch       = 2048
llama_context: n_ubatch      = 512
llama_context: causal_attn   = 1
llama_context: flash_attn    = 0
llama_context: freq_base     = 10000.0
llama_context: freq_scale    = 0.025
llama_context: n_ctx_per_seq (8192) < n_ctx_train (163840) -- the full capacity of the model will not be utilized
llama_context:        CPU  output buffer size =     0.49 MiB
ggml_backend_cuda_buffer_type_alloc_buffer: allocating 4480.00 MiB on device 0: cudaMalloc failed: out of memory
alloc_tensor_range: failed to allocate CUDA0 buffer of size 4697620480
llama_init_from_model: failed to initialize the context: failed to allocate buffer for kv cache
common_init_from_params: failed to create context with model 'F:/local_llm/models/DeepSeek-V3-0324-GGUF/UD-Q2_K_XL/DeepSeek-V3-0324-UD-Q2_K_XL-00001-of-00006.gguf'
srv    load_model: failed to load model, 'F:/local_llm/models/DeepSeek-V3-0324-GGUF/UD-Q2_K_XL/DeepSeek-V3-0324-UD-Q2_K_XL-00001-of-00006.gguf'
srv   operator (): operator (): cleaning up before exit...
main: exiting due to model loading error

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions