ElasticDL code overall review discussion

Jump to bottom

Bright edited this page Sep 26, 2019 · 11 revisions

20190924

Need further consideration

It's a common scenario that the task list contains some successive training tasks and then some following evaluation tasks. The dataset is using prefetch function. While the worker is handling the last train task in the sublist, the prefetch action will pull all the successive evaluation tasks into this worker. Proposal: use two separate dataset for training and evaluation.
How to do early stop? Early stop need the training and evaluation metrics to make the decision. Proposal:
Refactor the evaluation process. Proposal:
Each time executing the ElasticDL command, the client will build a new docker image. Support reusing the existed image. Proposal:

Include in next plan

Support fail over of the EmbeddingService Redis cluster. At the present, it's single point. Proposal: Redis is the temporary solution, the next plan is using multiple PS.

Performance issue

In the process of training and evaluation, we will call GetModel to get the whole model if the model_version is updated. The Rpc payload is high.
For Sync SGD scenario, the speed of all the workers are not exactly the same (Worker number: N). The larger N is, the more computation for the expired model version there is.