(BigData 2020) A Performance Prediction Model for Spark Applications

About This Webinar

Abstract: Apache Spark is a popular open-source distributed processing framework that enables efficient processing of massive amounts of data. It has a large number of parameters that need to be tuned to get the best performance. However, tuning these parameters manually is a complex and time-consuming task. Therefore, a robust performance model to predict applications execution time could greatly help in accelerating the deployment and optimization of big data applications relying on Spark. In this paper, we ran extensive experiments on a selected set of Spark applications that cover the most common workloads to generate a representative dataset of execution time. In addition, we extracted application and data features to build a machine learning based performance model to predict Spark applications execution time. The experiments show that boosting algorithms achieved better results compared to the other algorithms.

Authors: Muhammad Usama Javaid (Eura Nova, Belgium); Florian Demesmaeker, Amir Kanoun, Sabri Skhiri and Amine Ghrab (Co-author, Belgium)

Email: usama.javaid@euranova.eu, florian.demesmaeker@euranova.eu, amir.kanoun@euranova.eu, sabri.skhiri@euranova.eu, amine.ghrab@euranova.eu

Usama Javaid is a Data Scientist and AI enthusiast. In 2016, he completed his bachelor degree in computrer science from National University of Computer and Emerging Science, Islamabad, Pakistan. In 2018, he obtained his master degree in machine learning and data mining, from Universite Jean Monnet, Saint Etienne, France. He has been a Data Scientist at Eura Nova since July,2019. He is very passionaite about AI shaping the lives and society for better and future of technology.
