Skip to content

add memory properties API​ #1263

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
bratpiorka opened this issue Apr 15, 2025 · 26 comments
Open

add memory properties API​ #1263

bratpiorka opened this issue Apr 15, 2025 · 26 comments
Assignees
Labels
1.0 API readiness Things to be improved in API before 1.0 stable release enhancement New feature or request
Milestone

Comments

@bratpiorka
Copy link
Contributor

bratpiorka commented Apr 15, 2025

UMF should offer a set of observability functions that can be used to retrieve the memory properties of memory allocated through UMF. Since these properties are closely tied to the provider used, the API should essentially return the provider's properties for a given pointer.

Requirements:

  • allow to get memory type: CPU or GPU, or the info is the ptr CPU accessible or not
  • allow access to provider-specific properties, such as NUMA node, USM type (Host, Device, Shared), GPU device ID, context, and others

Currently, for a given ptr, a user can obtain a provider by calling:

pool = umfPoolByPtr(ptr)
umfPoolGetMemoryProvider(pool, &provider)

After obtaining the provider, there are several options for retrieving the memory provider properties:

Proposal 1 - per-provider get/set functions

In this proposal, the user needs to be aware of the provider's type. Each provider's property can then be retrieved in a manner similar to how it was set during creation. Additionally, there could be extra functions that do not have a corresponding "set" function for provider properties, such as a function that retrieves the general type of memory (e.g., CPU or GPU). If a specific provider doesn't know how to populate a given property, we could return a new error code UMF_RESULT_INVALID_PARAM.

name = umfMemoryProviderGetName(provider)

umf_memory_type_t memory_type // NEW enum: CPU or GPU
umf_memory_type_t usm_memory_type // host, dev, shared

switch(name)
    case "level_zero":
        // NEW per-property umf*MemoryProviderParamsGet*() call
        ret1 = umfLevelZeroMemoryProviderParamsGetMemoryType(params, &memory_type)    
        // NEW return code
        if (ret1 == UMF_RESULT_INVALID_PARAM) {
            ...
        } 
        ret2 =umfLevelZeroMemoryProviderParamsGetUSMMemoryType(params, &usm_memory_type)

    case "os":
        unsigned numa_list = []
        unsigned numa_list_len = 0
        ret1 = umfOsMemoryProviderParamsGetNumaList(params, &numa_list, &numa_list_len)
        ret1 = umfOsMemoryProviderParamsGetMemoryType(params, &memory_type)
    ...

Proposal 2 - common properties structure

In this proposal, we could define a common structure for provider properties, along with a function that returns it based on the provider handle. In this structure, we could maintain some properties, such as is_cpu_accessible, in a common scope, while storing provider-specific properties in unions (they could be nested). Additionally, we could introduce the type of provider as one of the properties.

// NOTE: this NEW enum is defined in the public API
enum umf_memory_provider_type {
    UMF_MEMORY_PROVIDER_OS,
    UMF_MEMORY_PROVIDER_CUDA,
    UMF_MEMORY_PROVIDER_LEVEL_ZERO,
    UMF_MEMORY_PROVIDER_FILE,
}

// NOTE: this NEW struct is defined in the public API
struct umf_memory_provider_params_t {
    DWORD version // NOTE: this struct has to be versioned
    bool is_cpu_accessible
    umf_memory_provider_type provider_type 
    union {
        struct {
            int numa_node
            DWORD flags
        } os
        struct  {
            umf_usm_memory_type_t usm_type
            void* gpu_context
            union {
                struct {
                    void* device_ctx
                } level_zero
                struct {
                    int device_id
                } cuda
            } 
        } gpu
    }
}

umf_memory_provider_params_t params = umfMemoryProviderGetParams(provider)

Proposal 1 + 2 - generic per-property set of functions

This is a mix of proposals 1 and 2. The difference is that umf_memory_provider_params_t is hidden from the user, and there is a public list of generic (not provider-specific) per-property functions:

umfMemoryProviderGetParams(provider,  &params)
umfMemoryProviderParamsGetType(params, &provider_type)
umfMemoryProviderParamsGetCPUAccessible(params, &is_cpu_accessible )
umf_result_t res = umfMemoryProviderParamsGetGPUContext(params, &gpu_context)

Proposal 3A - per-provider key-value set based on strings

Each provider could keep a key-value store that could hold its properties. Then, the user could get the specific property from a provider using its name.

// NOTE: this structure is non-public
struct umf_property_t {
    char* key
    void* value
}

 // NOTE: this structure is non-public
 struct umf_property_set_t {
    umf_property_t* properties
    size_t count
 }

const umf_property_set_t* props = umfMemoryProviderGetProps(provider)
// NEW API call - return value in last param (void*)
// NOTE: do we always return void* or unsigned int? If not, the additional "size" param could be needed
umf_result_t res = umfGetPropValueByName(props, "memory_type", &memory_type)
// get name of the prop
const char* prop_name = NULL
umfGetPropName(props, id, &prop_name)

Proposal 3B - per-provider key-value set based on IDs

Similarly to string-based Get functions from Proposal 3A we could use a property ID. They could be pre-defined in public headers.

enum umf_property_id {
    UMF_MEMORY_TYPE,
    // OS provider
    UMF_NUMA_NODE_ID,
    // any GPU provider
    UMF_MEMORY_USM_TYPE,
    UMF_GPU_CONTEXT,
    // CUDA-specyfic
    UMF_DEVICE_ID
    // Level-Zero specyfic
    UMF_DEVICE_CTX
    // File-provider specyfic
    ....
    UMF_MAX_PROPERTY_ID
}

const umf_property_set_t* props = umfMemoryProviderGetProps(provider)
void* value = umfGetPropValueById(props, UMF_NUMA_NODE_ID)

Proposal 4 - CTL

Similar to proposal 3A but based on CTL.

// get value by name
umfCtlGet("umf.provider.by_handle.props.memory_type", provider, memory_type)

// get a list of props
int count = 0;
umfCtlGet("umf.provider.by_handle.props.count", provider, &count)

// get name/val by id
const char* name = NULL;
umfCtlGet("umf.provider.by_handle.props.2.name", provider, &name)
void* val;
umfCtlGet("umf.provider.by_handle.props.2.val", provider, &val)

Additional Considerations - per allocation properties

Please note that in the proposals above, we assumed that all pointer properties could be derived from the provider properties. However, this is not the case for certain attributes, such as the unique ID of the allocation (see CU_POINTER_ATTRIBUTE_BUFFER_ID for CUDA and ze_memory_allocation_properties_t.id for Level Zero) or the page size. To query the page size of an allocation, the user could use the generic umfMemoryProviderGetMinPageSize(provider, ptr, &page_size) function. However, we still need to define a new umfMemoryProviderGetAllocationID(provider, ptr, &id) function for retrieving the allocation ID.

Additional per-pointer properties to consider are base pointer and size of the full allocation (see zeMemGetAddressRange).

Hybrid proposal

It is also worth noticing, that we could achieve both flexibility (like in the Proposal 4 - CTL) and performance by caching per-provider properties at the user side:

// this struct is defined by the user and caches only required provider properties
struct props_struct {
  int type; // set accordingly
  union {
    struct cuda_props cuda;
    struct l0_props l0;
  } data;
}

...

get_properties(ptr, props_cache) {    
    // get memory provider
    umf_pool_handle_t pool = umfPoolByPtr(ptr)
    umfPoolGetMemoryProvider(pool, &provider)
    
    // props_struct is defined and filled by the user
    props_struct props = props_cache[provider]    
    if (props == NULL) {
        // slow path - cache properties once
        props_struct props = {} // empty
        const char* adapter_name = umf_get_prop(provider, "adapter_name") // only example - could be CTL
        if (adapter_name == "CUDA") {
            props.type = PROPS_STRUCT_CUDA
            props.cuda.prop2 = umf_get_prop(provider, "prop2")
            props.cuda.prop3 = umf_get_prop(provider, "prop3")
        } else if (adapter_name == "L0") {
            props.type = PROPS_STRUCT_L0
            props.l0.prop2 = umf_get_prop(provider, "prop2")
            props.l0.prop3 = umf_get_prop(provider, "prop3")
        }
        props_cache[provider] = props
    }
    
    bool is_host = props.is_host;
}

Pros / Cons

Proposal Proposal 1
per-provider get/set funcs
Proposal 2
common props struct
Proposal 1 + 2
generic per-property funcs
Proposal 3A
per-provider key-value strings set
Proposal 3B
per-provider key-value ID set
Proposal 4
CTL
Proposal 5
Hybrid CTL
easy to implement + yes - complex structure, new ops method - complex structure, new ops method + easy + easy - complex - complex + code at user side
consistent with how we set props of providers + yes - no +/- somehow - no - no - no - no
API defined in per-provider vs common headers per-provider common common + per-provider common, kv structures non public common hidden hidden
encapsulation + yes - no (types) + yes if we keep some API in e.g. GPU headers + yes + yes + yes + yes
needs to be versioned + no - yes + no + no - yes? + no + no
number of new API functions - large + small +/- moderate + small + small + none - uses existing + none - uses existing
performance + fast + fast + fast - slow (string compare) + fast - slow + fast
supports "common" properties - no + yes + yes + yes + yes + yes + yes
supports user-defined providers + yes +/- only common props +/- only common props + yes +/- yes with some potentiall problems + yes + yes
supports user-defined properties + yes - no + yes (tricky) + yes +/- yes with some potentiall problems + yes + yes
user needs to know the type of provider for common props - yes + no + no + no + no + no + no
@bratpiorka bratpiorka added 1.0 API readiness Things to be improved in API before 1.0 stable release enhancement New feature or request labels Apr 15, 2025
@bratpiorka bratpiorka added this to the v0.12.x milestone Apr 15, 2025
@bratpiorka
Copy link
Contributor Author

@lplewa @vinser52 please review

@bratpiorka
Copy link
Contributor Author

@lplewa @vinser52 please re-review

also, please notice the new "Additional Considerations - per allocation properties" section, where I described additional Provider OPS function umfMemoryProviderGetAllocationID

@lplewa
Copy link
Contributor

lplewa commented Apr 18, 2025

I vote for option 5.
Cache on user side is not compilcated. We just have to document, that memory properties do not change, so user can cache them. There will not be a huge number of providers on user side (i hope so), so this cache should be really simple to implement and performant. If we drop the performance requirement on umf side we can use CTL api, so we do not have introduce yet another interface.

@lplewa
Copy link
Contributor

lplewa commented Apr 18, 2025

Also we need example showing memorypropertyapi, with cache, so if user needs one then they can copy paste it from the example.

@bratpiorka bratpiorka self-assigned this Apr 22, 2025
@irozanova
Copy link

@lplewa, the problem is that not all properties have 1:1 relation with the provider. For example, memory id is unique for the allocation and the provider may include several allocations. In this case we will have to use 3 different caches to get all the properties: umfPoolGetMemoryProvider, user cache props_cache[provider] and additional "cache" inside the provider to get mem_id and other allocation related properties (umfMemoryProviderGetAllocationID and etc). I am not sure that it will show good performance. But we may check.

I also do not see what advantages option 5 provides if compared it with using native CUDA/Level-Zero API. We already can get all properties from CUDA/Level-Zero and then cache them.

@irozanova
Copy link

A few more properties, which are used in MPI: base pointer and size of the full allocation. We get it from zeMemGetAddressRange/cuMemGetAddressRange

@vinser52
Copy link
Contributor

General thoughts from my side:

  1. I do not think we should delegate any caching to the user side if it is possible to implement it inside UMF.
  2. We should avoid a string-based API if possible. As we agreed on the meeting CTL API and Memory Property API are targeted for different use cases. CTL is good for debugging/tracing/logging, but Memory Property API is targeted for the use cases when client is not aware of the pool/pointer and needs to discover the properties of the allocation.
  3. We still miss the part when we test the proposal on real use cases (MPI case), which demonstrates how to use UMF in client flow.

@bratpiorka
Copy link
Contributor Author

Yes, we are aware of per-allocation properties vs per-provider properties. The user could cache only per-provider properties if needed. Also, I don't think caching per-allocation properties makes sense (consider free + alloc the same pointer from the different provider). IMO, for per-allocation props we should just extend our existing API (umfMemoryProviderGetMinPageSize etc). So I would separate these problems and focus here only on per-provider properties.

@vinser52

  1. Our per-provider properties are already set in the provider - there is nothing to cache at the UMF side. Caching could be a way only for the user as a WA for any possible perf issues. It is also an example of how to "think" in the UMF way, where we have a concept of providers (hence I think that example with cached CTL props would be nice to have)
  2. CTL doesn't require knowing a type of pool/provider, so I don't understand in which scenario it can't be used. I understand the performance reasons - here is why I have a proposal based on CTL + caching, but if it isn't possible, which other option would you recommend?
  3. I would be happy to do any testing, but we should agree on which option we want to test/create a POC?

@irozanova
Copy link

irozanova commented Apr 22, 2025

Also, I don't think caching per-allocation properties makes sense (consider free + alloc the same pointer from the different provider)

But it is the main area where we can improve performance in MPI. MPI cannot cache memory id, because as you said it can be changed when free/alloc. But UMF controls free/alloc and can update the cache accordantly

@vinser52
Copy link
Contributor

For mem ID UMF can introduce its own mem ID and store it as part of the tracker

@bratpiorka
Copy link
Contributor Author

But it is the main area where we can improve performance in MPI. MPI cannot cache memory id, because as you said it can be changed when free/alloc. But UMF controls free/alloc and can update the cache accordantly

ok to be more precise: there is no sense in caching per-alloc properties at the user side. Of course, we could do it at the UMF side (using our allocation tracker), but accessing it would require a different set of API functions which accept an allocation ptr instead of provider handle as an argument.

@vinser52
Copy link
Contributor

I would be happy to do any testing, but we should agree on which option we want to test/create a POC?

Under "testing" I meant that for every proposal, we should consider how it will be used by MPI. As the first step, we do not even need to create a working POC; just creating a simple code snippet (like you did to demonstrate idea with caching on the user side) should be enough.

@vinser52
Copy link
Contributor

but accessing it would require a different set of API functions which accept an allocation ptr instead of provider handle as an argument.

The next question is why not to make all Mem Property APIs accept a pointer as a parameter? Even though some property is provider-specific, internally we can call umfPoolByPtr and umfPoolGetMemoryProvider, but it will be hidden from the user.

P.S.: I am not pushing to this approach, just trying to brainstorm and ask questions.

@bratpiorka
Copy link
Contributor Author

but accessing it would require a different set of API functions which accept an allocation ptr instead of provider handle as an argument.

The next question is why not to make all Mem Property APIs accept a pointer as a parameter? Even though some property is provider-specific, internally we can call umfPoolByPtr and umfPoolGetMemoryProvider, but it will be hidden from the user.

P.S.: I am not pushing to this approach, just trying to brainstorm and ask questions.

Since currently we have at least 3 more per-allocation functions to consider (alloc ID, base ptr, and size) I also think that this could be a good approach. And we can't use CTL here (but we can still use it internally in the implementation).

@irozanova
Copy link

We may need two different types of API:

  1. Getting all memory properties (or some specific subset of properties)
  2. Getting one specific property

For both of them ideally we should call one search in the cache to achieve the best performance.

@irozanova
Copy link

irozanova commented Apr 22, 2025

One of the types may not be needed if we can achieve the same behavior with another type without overhead.

@vinser52
Copy link
Contributor

We also need to consider a representative benchmarks.

Compute benchmarks already contains something for Level Zero:

  1. get_memory_properties_l0.cpp
  2. get_memory_properties_with_offseted_pointer_l0.cpp

We need to consider adding a UMF version or creating other representative benchmarks, so that we can monitor improvements vs L0

@irozanova
Copy link

irozanova commented Apr 22, 2025

We also need to consider a representative benchmarks.

I would also define "the roofline". For example time of one search in the cache. We may choose some reasonable number of elements in the cache

@bratpiorka
Copy link
Contributor Author

@vinser52, please comment on which of the proposed options you vote for (even considering per-allocation properties - the problem we face here is the same as with providers) or do you have an idea for a different proposal?

Under "testing" I meant that for every proposal, we should consider how it will be used by MPI. As the first step, we do not even need to create a working POC; just creating a simple code snippet (like you did to demonstrate idea with caching on the user side) should be enough.

I would like to start with the most promising option. This would be a lot of work to create a POC for all options

We need to consider adding a UMF version or creating other representative benchmarks, so that we can monitor improvements vs L0

yes, this would be helpful

@lplewa
Copy link
Contributor

lplewa commented Apr 23, 2025

@irozanova What is exactly memory allocation id, and how do you use it?

@irozanova
Copy link

@lplewa, it is unique id, which CUDA/L0 returns for each allocation. We have some caches associated with pointer and we need to know if the pointer refers to the same memory or the memory was deallocated and new allocation has the same address.

@lplewa
Copy link
Contributor

lplewa commented Apr 23, 2025

@lplewa, it is unique id, which CUDA/L0 returns for each allocation. We have some caches associated with pointer and we need to know if the pointer refers to the same memory or the memory was deallocated and new allocation has the same address.

Do you need this information at level of Cuda/L0 - or you need it at umf level. As we can reuse memory without going back to the driver., so id at driver level will be the same, but this might be a different allocation.

@irozanova
Copy link

@lplewa , not necessary id from CUDA/L0, but if the memory was not returned to the driver, then we need the same id as before. New id would mean that the memory was returned to the driver and we need to remove outdated elements from the cache

@vinser52
Copy link
Contributor

@lplewa, it is unique id, which CUDA/L0 returns for each allocation. We have some caches associated with pointer and we need to know if the pointer refers to the same memory or the memory was deallocated and new allocation has the same address.

Do you need this information at level of Cuda/L0 - or you need it at umf level. As we can reuse memory without going back to the driver., so id at driver level will be the same, but this might be a different allocation.

In UMF terms, each coarse-grain allocation from the memory provider has a unique memory id. Memory ID is used to detect the new allocation that has the same VA.

@ddurnov
Copy link

ddurnov commented Apr 29, 2025

@lplewa, it is unique id, which CUDA/L0 returns for each allocation. We have some caches associated with pointer and we need to know if the pointer refers to the same memory or the memory was deallocated and new allocation has the same address.

Technically these are the caches we are considering to delegate to UMF going forward :)

@ddurnov
Copy link

ddurnov commented Apr 29, 2025

@lplewa, it is unique id, which CUDA/L0 returns for each allocation. We have some caches associated with pointer and we need to know if the pointer refers to the same memory or the memory was deallocated and new allocation has the same address.

Do you need this information at level of Cuda/L0 - or you need it at umf level. As we can reuse memory without going back to the driver., so id at driver level will be the same, but this might be a different allocation.

In UMF terms, each coarse-grain allocation from the memory provider has a unique memory id. Memory ID is used to detect the new allocation that has the same VA.

Cache consistency/invalidation is what we wanted to stop doing on higher level runtime in general. I.e. the more caching would happen on UMF, the better. Main scenario would be where UMF would intercept alloc/free so cache invalidation would happen inside some hook/interception layer.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
1.0 API readiness Things to be improved in API before 1.0 stable release enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

5 participants