-
Notifications
You must be signed in to change notification settings - Fork 115
ElasticDL Memory Leak Debug
The instances including Master, Worker, and ParameterServer (PS) in ELasticDL will be killed because of OOM due to memory leak. In this document, we will summarize the problems of memory leaks when we train a model using ElasticDL.
-
tf.keras.metrics
causes a memory leak in the master server.
In ElasticDL, each worker executes evaluation tasks and report the model outputs and corresponding labels to the master by grpc. Then, the master calls update_state
of tf.kreas.metrics
to calculate the metrics. Generally, grpc.server
uses multi-threads to process the grpc request from workers and each thread will call update_state
. The memory leak will occur if we use multi-threads in grpc.server
。
The resource configuration:
--master_resource_request="cpu=1,memory=1024Mi,ephemeral-storage=1024Mi" \
--worker_resource_request="cpu=1,memory=1024Mi,ephemeral-storage=1024Mi" \
--ps_resource_request="cpu=1,memory=1024Mi,ephemeral-storage=1024Mi" \
- Use multi-threads with max_worker=64 in the master
grpc.server
and train a deepFM model in the model zoo.
def _create_master_service(self, args):
self.logger.info("Creating master service")
server = grpc.server(
futures.ThreadPoolExecutor(max_workers=64),
Then, we view the used memory in the master.
- Use a single thread with max_worker=1 in the master
grpc.server
and train a deepFM model in the model zoo.
def _create_master_service(self, args):
self.logger.info("Creating master service")
server = grpc.server(
futures.ThreadPoolExecutor(max_workers=1),
Then, we view the used memory in the master.
Using multi-threads in the PS grpc.server
also cause memory leaks.
The resource configuration:
--master_resource_request="cpu=1,memory=1024Mi,ephemeral-storage=1024Mi" \
--worker_resource_request="cpu=1,memory=1024Mi,ephemeral-storage=1024Mi" \
--ps_resource_request="cpu=1,memory=1024Mi,ephemeral-storage=1024Mi" \
- Use multi-threads with max_worker=64 in the PS
grpc.server
and train a deepFM model in the model zoo.
def prepare(self):
server = grpc.server(
futures.ThreadPoolExecutor(max_workers=64),
options=[
("grpc.max_send_message_length", GRPC.MAX_SEND_MESSAGE_LENGTH),
(
"grpc.max_receive_message_length",
GRPC.MAX_RECEIVE_MESSAGE_LENGTH,
),
],
)
Then, we view the used memory in the PS instances.
- Use a single thread with max_worker=1 in the master
grpc.server
and train a deepFM model in the model zoo.
def prepare(self):
server = grpc.server(
futures.ThreadPoolExecutor(max_workers=1),
options=[
("grpc.max_send_message_length", GRPC.MAX_SEND_MESSAGE_LENGTH),
(
"grpc.max_receive_message_length",
GRPC.MAX_RECEIVE_MESSAGE_LENGTH,
),
],
)
Then, we view the used memory in PS instances.
- The memory leak will occur if we use
tf.py_function
to wrap thelookup_embedding
in the ElasticdDL Embedding layer.. The detail is in tensorflow issue 35010.