Description
Name and Version
llama-server --version
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 2 CUDA devices:
Device 0: NVIDIA GeForce RTX 5090, compute capability 12.0, VMM: yes
Device 1: NVIDIA GeForce RTX 5080, compute capability 12.0, VMM: yes
version: 5572 (7675c55)
built with MSVC 19.44.35207.1 for x64
Operating systems
Windows
GGML backends
CUDA
Hardware
AMD Ryzen 9 9950x3D CPU and 2 GPUs: Nvidia 5090 and 5080.
Models
Unsloth models: IQ2_XXS and Q2_K_XL from Hugging Face from here: https://huggingface.co/unsloth/DeepSeek-V3-0324-GGUF-UD
Problem description & steps to reproduce
I am encountering the same issue reported here: #12651 The prior issue was closed, but is there a fix for this issue? I am experiencing this exact same issue running unsloth's DeepSeek-V3-0324-UD-Q2_K_XL model with 2 GPUs (Nvidia RTX 5090 and 5080). I have tried setting LLAMA_CUDA_UNIFIED_MEMORY=1, but I still get out of memory exception. I tried running this: .\llama.cpp\build\bin\Release\llama-server ^
--model F:/local_llm/models/DeepSeek-V3-0324-GGUF/UD-Q2_K_XL/DeepSeek-V3-0324-UD-Q2_K_XL-00001-of-00006.gguf ^
--port 10000 ^
--ctx-size 8192 ^
--n-gpu-layers 10 ^
--tensor-split 0.6667,0.3333 ^
--cache-type-k q8_0 ^
--temp 0.3 ^
--min-p 0.01
First Bad Commit
No response
Relevant log output
print_info: n_head = 128
print_info: n_head_kv = 128
print_info: n_rot = 64
print_info: n_swa = 0
print_info: is_swa_any = 0
print_info: n_embd_head_k = 192
print_info: n_embd_head_v = 128
print_info: n_gqa = 1
print_info: n_embd_k_gqa = 24576
print_info: n_embd_v_gqa = 16384
print_info: f_norm_eps = 0.0e+00
print_info: f_norm_rms_eps = 1.0e-06
print_info: f_clamp_kqv = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale = 0.0e+00
print_info: f_attn_scale = 0.0e+00
print_info: n_ff = 18432
print_info: n_expert = 256
print_info: n_expert_used = 8
print_info: causal attn = 1
print_info: pooling type = 0
print_info: rope type = 0
print_info: rope scaling = yarn
print_info: freq_base_train = 10000.0
print_info: freq_scale_train = 0.025
print_info: n_ctx_orig_yarn = 4096
print_info: rope_finetuned = unknown
print_info: ssm_d_conv = 0
print_info: ssm_d_inner = 0
print_info: ssm_d_state = 0
print_info: ssm_dt_rank = 0
print_info: ssm_dt_b_c_rms = 0
print_info: model type = 671B
print_info: model params = 671.03 B
print_info: general.name = DeepSeek V3 0324 BF16
print_info: n_layer_dense_lead = 3
print_info: n_lora_q = 1536
print_info: n_lora_kv = 512
print_info: n_embd_head_k_mla = 0
print_info: n_embd_head_v_mla = 0
print_info: n_ff_exp = 2048
print_info: n_expert_shared = 1
print_info: expert_weights_scale = 2.5
print_info: expert_weights_norm = 1
print_info: expert_gating_func = sigmoid
print_info: rope_yarn_log_mul = 0.1000
print_info: vocab type = BPE
print_info: n_vocab = 129280
print_info: n_merges = 127741
print_info: BOS token = 0 '<|begin▁of▁sentence|>'
print_info: EOS token = 1 '<|end▁of▁sentence|>'
print_info: EOT token = 1 '<|end▁of▁sentence|>'
print_info: PAD token = 1 '<|end▁of▁sentence|>'
print_info: LF token = 201 '─è'
print_info: FIM PRE token = 128801 '<|fim▁begin|>'
print_info: FIM SUF token = 128800 '<|fim▁hole|>'
print_info: FIM MID token = 128802 '<|fim▁end|>'
print_info: EOG token = 1 '<|end▁of▁sentence|>'
print_info: max token length = 256
load_tensors: loading model tensors, this can take a while... (mmap = true)
load_tensors: offloading 10 repeating layers to GPU
load_tensors: offloaded 10/62 layers to GPU
load_tensors: CUDA0 model buffer size = 28222.66 MiB
load_tensors: CUDA1 model buffer size = 12095.43 MiB
load_tensors: CPU_Mapped model buffer size = 46729.58 MiB
load_tensors: CPU_Mapped model buffer size = 47092.29 MiB
load_tensors: CPU_Mapped model buffer size = 47190.81 MiB
load_tensors: CPU_Mapped model buffer size = 46830.22 MiB
load_tensors: CPU_Mapped model buffer size = 7958.07 MiB
....................................................................................................
llama_context: constructing llama_context
llama_context: n_seq_max = 1
llama_context: n_ctx = 8192
llama_context: n_ctx_per_seq = 8192
llama_context: n_batch = 2048
llama_context: n_ubatch = 512
llama_context: causal_attn = 1
llama_context: flash_attn = 0
llama_context: freq_base = 10000.0
llama_context: freq_scale = 0.025
llama_context: n_ctx_per_seq (8192) < n_ctx_train (163840) -- the full capacity of the model will not be utilized
llama_context: CPU output buffer size = 0.49 MiB
ggml_backend_cuda_buffer_type_alloc_buffer: allocating 4480.00 MiB on device 0: cudaMalloc failed: out of memory
alloc_tensor_range: failed to allocate CUDA0 buffer of size 4697620480
llama_init_from_model: failed to initialize the context: failed to allocate buffer for kv cache
common_init_from_params: failed to create context with model 'F:/local_llm/models/DeepSeek-V3-0324-GGUF/UD-Q2_K_XL/DeepSeek-V3-0324-UD-Q2_K_XL-00001-of-00006.gguf'
srv load_model: failed to load model, 'F:/local_llm/models/DeepSeek-V3-0324-GGUF/UD-Q2_K_XL/DeepSeek-V3-0324-UD-Q2_K_XL-00001-of-00006.gguf'
srv operator (): operator (): cleaning up before exit...
main: exiting due to model loading error