Quantcast
Channel: Cloud Training Program
Viewing all articles
Browse latest Browse all 1891

Spark & MapReduce: Introduction, Differences & Use Case

$
0
0

In this post, we will cover the Overview of Spark & MapReduce, and we will cover the brief difference between Spark & MapReduce, also we will going to discuss some of the use cases examples of Spark & MapReduce.

We will discuss How did Spark become so efficient in data processing compared to MapReduce?

Overview

MapReduce

MapReduce is a programming engine for processing and generating large data sets with a parallel, distributed algorithm on a cluster of the computer.
MapReduce is composed of several components, including :

  • JobTracker — The master node that manages all jobs and resources in a cluster
  • TaskTrackers — Agents deployed to each machine in the cluster to run the map and reduce tasks
  • JobHistoryServer — A component that tracks completed jobs, and is typically deployed as a separate function or with JobTracker

MapReduce Architecture

SPARK

Apache Spark is a fast and general-purpose cluster computing system. It provides high-level APIs in Java, Scala, Python and R, and an optimized engine that supports general execution graphs.

It also supports a rich set of higher-level tools including Spark SQL for SQL and structured data processing, MLlib for machine learning, GraphX for graph processing, and Spark Streaming.

Spark serves several languages Scala, Python, R, and Java.

Apache Spark is a powerful open-source processing engine built around speed, ease of use, and sophisticated analytics.

Spark Architecture

Now, when you understand the basic overview of Spark & MapReduce, let’s check some main difference between both.

Difference Between Spark & MapReduce

Spark stores data in-memory whereas MapReduce stores data on disk. Hadoop uses replication to achieve fault tolerance whereas Spark uses different data storage model, resilient distributed datasets (RDD), uses a clever way of guaranteeing fault tolerance that minimizes network I/O.

MapReduce, Spark, Use Cases, Diffrence, spark vs hadoop

Spark’s Major Use Cases Over MapReduce

  • Iterative Algorithms in Machine Learning
  • Interactive Data Mining and Data Processing
  • Spark is a fully Apache Hive-compatible data warehousing system that can run 100x faster than Hive.
  • Stream processing: Log processing and Fraud detection in live streams for alerts, aggregates, and analysis
  • Sensor data processing: Where data is fetched and joined from multiple sources, in-memory dataset really helpful as they are easy and fast to process.

Below are Some Use Cases & Scenarios That Will Explain the Benefits & Advantages of Spark over MapReduce.

Some scenarios have solutions with both MapReduce and Spark, which makes it clear as to why one should opt for Spark when writing long codes.

Scenario 1: Simple word count example in MapReduce and Spark

The same code in MapReduce.

MapReduce

Step 1: Create a text file on which processing is to be done.
hadoop fs -mkdir -p /user/$USER/input

Step 2: Copy the text file from local file system to hdfs
hadoop fs -copyFromLocal sample.txt input

Step 3: Create wordcount file

package wc;
import java.io.IOException;
import java.util.*;
import org.apache.hadoop.conf.*;
import org.apache.hadoop.fs.*;
import org.apache.hadoop.conf.*;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapreduce.*;
import org.apache.hadoop.mapreduce.lib.input.*;
import org.apache.hadoop.mapreduce.lib.output.*;
import org.apache.hadoop.util.*;
public class WordCount extends Configured implements Tool {
public static void main(String args[]) throws Exception {
int res = ToolRunner.run(new WordCount(), args);
System.exit(res);
}
public int run(String[] args) throws Exception {
Path inputPath = new Path(args[0]);
Path outputPath = new Path(args[1]);
Configuration conf = getConf();
Job job = new Job(conf, this.getClass().toString());
FileInputFormat.setInputPaths(job, inputPath);
FileOutputFormat.setOutputPath(job, outputPath);
job.setJobName(“WordCount”);
job.setJarByClass(WordCount.class);
job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(IntWritable.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
job.setMapperClass(Map.class);
job.setCombinerClass(Reduce.class);
job.setReducerClass(Reduce.class);
return job.waitForCompletion(true) ? 0 : 1;
}
public static class Map extends Mapper<LongWritable, Text, Text, IntWritable> {
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(LongWritable key, Text value,
Mapper.Context context) throws IOException, InterruptedException {
String line = value.toString();
StringTokenizer tokenizer = new StringTokenizer(line);
while (tokenizer.hasMoreTokens()) {
word.set(tokenizer.nextToken());
context.write(word, one);
}
}
}
public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> {
public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {
int sum = 0;
for (IntWritable value : values) {
sum += value.get();
}
context.write(key, new IntWritable(sum));
}
}
}

Step 4: Execute the jar file
jar cf wordcount.jar WordCount*.class
hadoop jar wordcount.jar WordCount input output

Step 5: Check the output from the two partitions
hadoop fs -tail output/part-r-00000 | tail > sample-tail.out

Spark

Step 1: Open up the spark-shell (Scala or python)

Step 2: In scala shell

val rdd1=sc.textFile(“sample.txt”)
val rdd2=rdd1.flatMap(line => line.split( ))
val rdd3=rdd2.map(word => (word,1))
val rdd4=rdd3.reduceByKey((v1,v2)=>(v1+v2))
rdd4.collect()
rdd4.saveAsTextFile(“/user/input/wordcount”)

VERDICT

The 100 lines of code of a simple Word Count Program have been limited to just less than 10 lines. It shows the efficiency of Spark and ease in code.

You will get to know all of this and deep-dive into each concept related to Hadoop Development & Apache Spark, once you will get enrolled in our Hadoop Developer Using Apache Spark

Another question, which might come to your mind, What are all the things you will get when you enrolled!!

We are glad to tell you that:

Things you will get!!

  1. Live Instructor-led Online Interactive Sessions
  2. FREE unlimited retake for next 1 Years
  3. FREE On-Job Support for next 1 Years
  4. Training Material (Presentation + Step by Step Hands-on Guide)
  5. Recording of Live Interactive Session for Lifetime Access
  6. 100% Money Back Guarantee (If you attend sessions, practice and don’t get results, We’ll do full REFUND, check our Refund Policy)

Reference & Related

  1. Big Data on Wikipedia
  2. What is Big Data from Oracle
  3. Apache Hadoop

Next Task for you:

Did you get a chance to download  FREE Guide on Big Data Hadoop Development? If not, then get it now by clicking on the link below.
Must read Doc Big Data Hadoop Development

The post Spark & MapReduce: Introduction, Differences & Use Case appeared first on Oracle Trainings.


Viewing all articles
Browse latest Browse all 1891

Trending Articles