Seminar khoa học “Effective Deployment of Data-intensive Frameworks on Supercomputers”
Thời gian bắt đầu:
12:00 am 22/01/2018
Thời gian kết thúc:
12:00 am 22/01/2018
P.803 - Nhà B1 - Đại học BKHN
Thời gian: 14.30 ngày 22/1/2018
Địa điểm: phòng 803, nhà B1, Đại học Bách khoa Hà Nội
Người trình bày: TS. Đào Thành Chung, Bộ môn Hệ thống thông tin, Đại học Bách khoa Hà Nội
Title: Effective Deployment of Data-intensive Frameworks on Supercomputers
The goal of this research is to achieve a better performance of popular data-intensive frameworks, for example Hadoop and Spark, with only small modifications when running on modern supercomputers. Big data analytics applications are widely demanded to process large-scale datasets in both industry and academia. Compared with developing a new data-intensive application from scratch, using existing popular data-intensive application frameworks to develop is a better choice in aspects of productivity and maturity. Supercomputers are potentially faster than commodity clusters, such as Amazon EC2 cloud, when running data-intensive applications due to their high-performance and high-cost hardware. However, the current supercomputer design focuses more on compute-intensive applications rather than data-intensive ones, so it is hard to achieve the best performance of the hardware when running data-intensive applications on supercomputers.
We observe two mismatches of the execution environment on a lack of MPI-friendly dynamic process creation and local disks when running those frameworks on supercomputers since they are designed to run on the commodity clusters. The first mismatch raises a question of how to provide MPI-compatible fast dynamic process creation for popular data-intensive frameworks but satisfy the standard way of creating processes on supercomputers. The second mismatch brings into question of when using in-memory storage to provide virtual local disks as a replacement of physical local disks, how to deploy that in-memory storage and what deployment strategy is good on supercomputers. To overcome the first mismatch, we propose HPC-Reuse located between YARN-like and PBS-like resource managers in order to provide better support of dynamic management with MPI. Regarding to the second mismatch, we report our experiments to compare various deployment strategies of memcached-like in-memory storage for our focused Hadoop framework on supercomputers.
Thanh-Chung Dao received his Ph.D. in Information Science and Technology from the University of Tokyo in 2017. He received a B.A. from Keio University, Japan in 2011, and an M.S. from University of Eastern Finland. His research interests include deployment of data-intensive frameworks (e.g. Hadoop and Spark), big data applications on HPC clusters (e.g. DSL for using MPI and CUDA), and MapReduce computation.