Abstract:With the Internet and data growth increasing trends, big data is becoming an extremely important and challenging problem for Data Centers. Many platforms and frameworks are working to bring a cutting edge technology to this problem. Apache Hadoop is a software framework addressing the big-data processing and storing on clusters, providing reliability, scala-bility and distributed computing. Hadoop has a distributed file system to store vast amount of data in distributed environments, and uses Map reduce algorithm to perform the computations and process large amount of data, by parallelizing the workload and storage. In comparison to other relational database systems, Hadoop works well with unstructured data. Our work is focused on performance evaluation of benchmarks of Hadoop, which are crucial for testing the infrastructure of the clusters. Taking into consideration the sensitiveness and importance of data, it’s inevitable testing the clusters and distributed systems before deploying. The benchmark results can lead to optimizing the parameters for an enhanced performance tuning of the cluster. In this paper, we are motivated to study and evaluate the performance of Hadoop and a comprehensive listing of bench-marks used to test Hadoop, while providing detailed information for their appliance and procedures to run them. We construct a distributed hadoop cluster simulation based on VmWare Workstation consisted of multiple nodes running hadoop. we have conducted a measurement study on the performance evaluation of hadoop cluster simulation under multiple scenarios. Our results demonstrates the trade-off between performance and flexibility. This evaluation study is focused on the throughput performance comparisons under different scenarios. The Hadoop performance has been evaluated for different size of data sets in this simulation. To measure the performance we set up a Hadoop cluster with many nodes and use the fileTestDFSIO.java of the Hadoop version 1.2.1 which gives us the data throughput, average I/O rate and I/O rate standard deviation. Our results demonstrates that the HDFS writing performance scales well on both small and big data set. The average HDFS reading performance scales well on big data set where it is, however, lower than on the small data set. The more nodes a writing/reading operation is run on, the faster its performance is.

Data Migration at Scale for Distributed Systems: Hot and Cold Migration (HCM)

Distributed High-Dimension Matrix Operation Optimization on Spark

DataMPI: Extending MPI to Hadoop-Like Big Data Computing

Data Migration among Different Clouds

Performance Analysis of Distributed Computing Frameworks for Big Data Analytics: Hadoop Vs Spark

Data Processing Framework Using Apache and Spark Technologies in Big Data

Does Big Data Require Complex Systems? A Performance Comparison Between Spark and Unicage Shell Scripts

Pre-stack Kirchhoff Time Migration on Hadoop and Spark.

Big data scalability based on Spark Machine Learning Libraries

Performance evaluation of Map-reduce jar pig hive and spark with machine learning using big data

Vhadoop: A Scalable Hadoop Virtual Cluster Platform for MapReduce-Based Parallel Machine Learning with Performance Consideration

A Benchmarking Study to Evaluate Apache Spark on Large-Scale Supercomputers

To Migrate or not to Migrate: An Analysis of Operator Migration in Distributed Stream Processing

SparkDWM: a scalable design of a Data Washing Machine using Apache Spark

Optimal Operator State Migration for Elastic Data Stream Processing

A Survey on Geographically Distributed Big-Data Processing using MapReduce

Implementation and Performance Analysis of Apache Hadoop

Optimizing Data Migration Using Online Clustering.

A Secure and Efficient Data Migration Over Cloud Computing

Evaluating Accumulo Performance for a Scalable Cyber Data Processing Pipeline

Benchmarking Apache Spark and Hadoop MapReduce on Big Data Classification