Time Estimation and Resource Minimization Scheme for Apache Spark and Hadoop Big Data Systems With Failures
Apache Spark and Hadoop are open source frameworks for big data processing, which have beenadoptedbymanycompanies.Inordertoimplementareliablebigdatasystemthatcansatisfyprocessing targetcompletiontimes,accurateresourceprovisioningandjobexecutiontimeestimationsareneeded.Inthis paper, time estimation and resource minimization schemes for Spark and Hadoop systems are presented. The proposed models use the probability of failure in the estimations to more accurately formulate the characteristics of real big data operations. The experimental results show that the proposed Spark adaptive failure-compensationandHadoopadaptivefailure-compensationschemesimprovetheaccuracyofresource provisions by considering failure events, which improves the scheduling success rate of big data processing tasks.
Big data, failure probability, Apache Spark, resilient distributed dataset (RDD), Apache Hadoop, MapReduce, cloud computing, job estimation, resource provisioning, performance optimization