Apache Spark RDD

Apache Spark RDD with Spark Tutorial, Introduction, Installation, Spark Architecture, Spark Components, Spark RDD, Spark RDD Operations, RDD Persistence, RDD Shared Variables, etc.

<< Back to APACHE

next → ← prev

What is RDD?

The RDD (Resilient Distributed Dataset) is the Spark's core abstraction. It is a collection of elements, partitioned across the nodes of the cluster so that we can execute various parallel operations on it.

There are two ways to create RDDs:

Parallelizing an existing data in the driver program
Referencing a dataset in an external storage system, such as a shared filesystem, HDFS, HBase, or any data source offering a Hadoop InputFormat.

Parallelized Collections

To create parallelized collection, call SparkContext's parallelize method on an existing collection in the driver program. Each element of collection is copied to form a distributed dataset that can be operated on in parallel.

val info = Array(1, 2, 3, 4)
val distinfo = sc.parallelize(info)

Now, we can operate the distributed dataset (distinfo) parallel such like distinfo.reduce((a, b) => a + b).

External Datasets

In Spark, the distributed datasets can be created from any type of storage sources supported by Hadoop such as HDFS, Cassandra, HBase and even our local file system. Spark provides the support for text files, SequenceFiles, and other types of Hadoop InputFormat.

SparkContext's textFile method can be used to create RDD's text file. This method takes a URI for the file (either a local path on the machine or a hdfs://) and reads the data of the file.

Now, we can operate data on by dataset operations such as we can add up the sizes of all the lines using the map and reduceoperations as follows: data.map(s => s.length).reduce((a, b) => a + b).

Next TopicRDD Operations

← prev next →

.Net

.NET Array Dictionary List String 2D Async DataTable Dates DateTime Enum File For Foreach Format IEnumerable If IndexOf Lambda LINQ Parse Path Process Property Regex Replace Sort Split Static StringBuilder Substring Switch Tuple

Java

Core Array ArrayList HashMap String 2D Cast Character Console Deque Duplicates File For Format HashSet If IndexOf Lambda Math ParseInt Process Random Regex Replace Sort Split StringBuilder Substring Switch Vector While

Related Links

Adjectives Ado Ai Android Angular Antonyms Apache Articles Asp Autocad Automata Aws Azure Basic Binary Bitcoin Blockchain C Cassandra Change Coa Computer Control Cpp Create Creating C-Sharp Cyber Daa Data Dbms Deletion Devops Difference Discrete Es6 Ethical Examples Features Firebase Flutter Fs Git Go Hbase History Hive Hiveql How Html Idioms Insertion Installing Ios Java Joomla Js Kafka Kali Laravel Logical Machine Matlab Matrix Mongodb Mysql One Opencv Oracle Ordering Os Pandas Php Pig Pl Postgresql Powershell Prepositions Program Python React Ruby Scala Selecting Selenium Sentence Seo Sharepoint Software Spellings Spotting Spring Sql Sqlite Sqoop Svn Swift Synonyms Talend Testng Types Uml Unity Vbnet Verbal Webdriver What Wpf

TheDeveloperBlog.com