ElasticDL Memory Leak Debug

Memory Leak

The instances including Master, Worker, and ParameterServer (PS) in ELasticDL will be killed because of OOM due to memory leak. In this document, we will summarize the problems of memory leaks when we train a model using ElasticDL.

Memory Leaks in Master

tf.keras.metrics causes a memory leak in the master server.

Problem Description

In ElasticDL, each worker executes evaluation tasks and report the model outputs and corresponding labels to the master by grpc. Then, the master calls update_state of tf.kreas.metrics to calculate the metrics. Generally, grpc.server uses multi-threads to process the grpc request from workers and each thread will call update_state. The memory leak will occur if we use multi-threads in grpc.server。

Experiment

The resource configuration:

  --master_resource_request="cpu=1,memory=1024Mi,ephemeral-storage=1024Mi" \
  --worker_resource_request="cpu=1,memory=1024Mi,ephemeral-storage=1024Mi" \
  --ps_resource_request="cpu=1,memory=1024Mi,ephemeral-storage=1024Mi" \

Use multi-threads with max_worker=64 in the master grpc.server and train a deepFM model in the model zoo.

def _create_master_service(self, args):
        self.logger.info("Creating master service")
        server = grpc.server(
            futures.ThreadPoolExecutor(max_workers=64),

Then, we view the used memory in the master.

Use a single thread with max_worker=1 in the master grpc.server and train a deepFM model in the model zoo.

def _create_master_service(self, args):
        self.logger.info("Creating master service")
        server = grpc.server(
            futures.ThreadPoolExecutor(max_workers=1),

Then, we view the used memory in the master.

Cause

Solution

Memory Leaks in PS

Using multi-threads in the PS grpc.server also cause memory leaks.

Experiment

The resource configuration:

  --master_resource_request="cpu=1,memory=1024Mi,ephemeral-storage=1024Mi" \
  --worker_resource_request="cpu=1,memory=1024Mi,ephemeral-storage=1024Mi" \
  --ps_resource_request="cpu=1,memory=1024Mi,ephemeral-storage=1024Mi" \

Use multi-threads with max_worker=64 in the PS grpc.server and train a deepFM model in the model zoo.

def prepare(self):
        server = grpc.server(
            futures.ThreadPoolExecutor(max_workers=64),
            options=[
                ("grpc.max_send_message_length", GRPC.MAX_SEND_MESSAGE_LENGTH),
                (
                    "grpc.max_receive_message_length",
                    GRPC.MAX_RECEIVE_MESSAGE_LENGTH,
                ),
            ],
        )

Then, we view the used memory in the PS instances.

Use a single thread with max_worker=1 in the master grpc.server and train a deepFM model in the model zoo.

def prepare(self):
        server = grpc.server(
            futures.ThreadPoolExecutor(max_workers=1),
            options=[
                ("grpc.max_send_message_length", GRPC.MAX_SEND_MESSAGE_LENGTH),
                (
                    "grpc.max_receive_message_length",
                    GRPC.MAX_RECEIVE_MESSAGE_LENGTH,
                ),
            ],
        )

Then, we view the used memory in PS instances.

Memory Leaks in Worker

The memory leak will occur if we use tf.py_function to wrap the lookup_embedding in the ElasticdDL Embedding layer.. The detail is in tensorflow issue 35010.

ElasticDL Memory Leak Debug

Memory Leak

Memory Leaks in Master

Problem Description

Experiment

Cause

Solution

Memory Leaks in PS

Experiment

Memory Leaks in Worker

Cause

Solution

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally