Big data Performance Evaluation using Machine Learning


  • Santosh Kumar J., Raghavendra B. K. Raghavendra S.


Abstract: The Big data is the complex, huge, variety of data which is difficult to process using traditional systems. To process big data, we have many frameworks like Hadoop, Spark, flink. Some of the languages to process big data are Scala, Pig, Hive NoSQL and more important Java for all frameworks. Spark is developed with scala, one of the languages which reduce the extra unnecessary code of Java for processing, Pig is the scripting language to process unstructured data, Hive NoSQL are the languages to process the structured data. Additional Sqoop and flume are to inject the structured and unstructured data to HDFS. PySpark is one of the frame work for processing big data using Python that is python with spark. SparkR is also one more language for processing big data it’s a R language with spark. In here we are processing the data with parquet, orc file formats and from the results we can say that Parquet is faster than orc file system. Also, with mllib, mlflow pipelined parameters tuning enhances the processing performance of big data.