Abstract: Missing values can significantly affect the result of analyses and decision making in any field. Two major approaches deal with this issue: statistical and model-based methods. While the former brings bias to the analyses, the latter is usually designed for limited and specific use cases. To overcome the limitations of the two methods, we present a stacked ensemble framework based on the integration of the adaptive random forest algorithm, the Jaccard index, and Bayesian probability. Considering the challenge that the heterogeneous and distributed data from multiple sources represents, we build a model in our use case, that supports different data types: continuous, discrete, categorical, and binary. The proposed model tackles missing data in a broad and comprehensive context of massive data sources and data formats. We evaluated our proposed framework extensively on five different datasets that contained labelled and unlabelled data. The experiments showed that our framework produces encouraging and competitive results when compared to statistical and model-based methods. Since the framework works for various datasets, it overcomes the model-based limitations that were found in the literature review.
Authors: Andre L Costa Carvalho (Université du Québec, Canada); Darine Ameyed (ETS, Canada); Mohamed Cheriet (Ecole de technologie superieure (University of Quebec), Canada)
Email: andre-luis.costa-carvalho.1@ens.etsmtl.ca, darine.ameyed.1@ens.etsmtl.ca, mohamed.cheriet@etsmtl.ca