Seminar khoa học “Effective Deployment of Data-intensive Frameworks on Supercomputers”

Thời gian bắt đầu: 12:00 am 22/01/2018

Thời gian kết thúc: 12:00 am 22/01/2018

Địa điểm: P.803 - Nhà B1 - Đại học BKHN

Thời gian: 14.30 ngày 22/1/2018
Địa điểm: phòng 803, nhà B1, Đại học Bách khoa Hà Nội
Người trình bày: TS. Đào Thành Chung, Bộ môn Hệ thống thông tin, Đại học Bách khoa Hà Nội
Title: Effective Deployment of Data-intensive Frameworks on Supercomputers
Abstract:
The goal of this research is to achieve a better performance of popular data-intensive frameworks, for example Hadoop and Spark, with only small modifications when running on modern supercomputers. Big data analytics applications are widely demanded to process large-scale datasets in both industry and academia. Compared with developing a new data-intensive application from scratch, using existing popular data-intensive application frameworks to develop is a better choice in aspects of productivity and maturity. Supercomputers are potentially faster than commodity clusters, such as Amazon EC2 cloud, when running data-intensive applications due to their high-performance and high-cost hardware. However, the current supercomputer design focuses more on compute-intensive applications rather than data-intensive ones, so it is hard to achieve the best performance of the hardware when running data-intensive applications on supercomputers.
We observe two mismatches of the execution environment on a lack of MPI-friendly dynamic process creation and local disks when running those frameworks on supercomputers since they are designed to run on the commodity clusters. The first mismatch raises a question of how to provide MPI-compatible fast dynamic process creation for popular data-intensive frameworks but satisfy the standard way of creating processes on supercomputers. The second mismatch brings into question of when using in-memory storage to provide virtual local disks as a replacement of physical local disks, how to deploy that in-memory storage and what deployment strategy is good on supercomputers. To overcome the first mismatch, we propose HPC-Reuse located between YARN-like and PBS-like resource managers in order to provide better support of dynamic management with MPI. Regarding to the second mismatch, we report our experiments to compare various deployment strategies of memcached-like in-memory storage for our focused Hadoop framework on supercomputers.
Short bio:
Thanh-Chung Dao received his Ph.D. in Information Science and Technology from the University of Tokyo in 2017. He received a B.A. from Keio University, Japan in 2011, and an M.S. from University of Eastern Finland. His research interests include deployment of data-intensive frameworks (e.g. Hadoop and Spark), big data applications on HPC clusters (e.g. DSL for using MPI and CUDA), and MapReduce computation.
Trân trọng kính báo và kính mời quý vị tham dự.