Survey of Data Locality in Apache Hadoop
One of the key challenges in big data technology is the velocity at which the data is processed. Hadoop, an open-source software framework, is the dominant technology to support big data analytics. So, the researcher has tried to increase the performance of the Hadoop system. One of the Hadoop performance research is data locality. Recently, the data locality research receives attention to increasing the performance of Hadoop. Using the updated Hadoop software, the researchers can investigate data locality using the Hadoop Distributed File System (HDFS), Yet Another Resource Negotiator (YARN), MapReduce, and other features. Data locality research has potential to increase performance of big data processing by scheduling, data placement framework and service. Here we introduced data locality in the Hadoop system including data-local, rack-local, and off-rack. We studied the data locality research such as scheduling, data placement, networking, partition/key, framework and so on. We categorized prior research using MapReduce and found some of this research overlapped some MapReduce steps. Also, we graphed the data locality research to identify trends. This analysis showed different effects depending on the applications. Specifically, the number of taskers and data locations affected performance of MapReduce. We also tested Terasort Benchmark and WordCount using CloudLab and physical environment to show the effect of data locality in Hadoop.
Hadoop, data locality, MapReduce, YARN, HDFS