Youngseok Yang, Jeongyoon Eo, Geon-Woo Kim, Joo Yeon Kim, Sanha Lee, Jangho Seo, Won Wook Song, Byung-Gon Chun
Efficient distributed execution of data processing applications, especially machine learning applications, is increasingly becoming important. At the same time, characteristics of resources and data are becoming diversified for distributed data processing. Data processing runtimes provide policy interfaces to enforce fine control over computation scheduling and data transfer behaviors to handle such characteristics, but these interfaces do not ensure the maintenance of correct application semantics and thus often require a significant effort to use. On the other hand, policy interfaces that extend a high-level application programming model ensure correctness, but do not provide sufficient fine control. We describe Nemo, an optimization framework for distributed dataflow processing that provides fine control for high performance, and also ensures correctness for ease of use. We combine several techniques to achieve this, including an intermediate representation, optimization passes, and runtime extensions. Our Nemo implementation demonstrates the degree of customizations made possible by our approach. Evaluation results show that Nemo is able to enforce fine-grained control to improve performance, and also apply different combinations of optimization passes across different batch and machine learning applications correctly.