TheDeveloperBlog.com

Home | Contact Us

C-Sharp | Java | Python | Swift | GO | WPF | Ruby | Scala | F# | JavaScript | SQL | PHP | Angular | HTML

Apache Spark Components

Apache Spark Components with Spark Tutorial, Introduction, Installation, Spark Architecture, Spark Components, Spark RDD, Spark RDD Operations, RDD Persistence, RDD Shared Variables, etc.

<< Back to APACHE

Spark Components

The Spark project consists of different types of tightly integrated components. At its core, Spark is a computational engine that can schedule, distribute and monitor multiple applications.

Let's understand each Spark component in detail.

Spark Components

Spark Core

  • The Spark Core is the heart of Spark and performs the core functionality.
  • It holds the components for task scheduling, fault recovery, interacting with storage systems and memory management.

Spark SQL

  • The Spark SQL is built on the top of Spark Core. It provides support for structured data.
  • It allows to query the data via SQL (Structured Query Language) as well as the Apache Hive variant of SQL?called the HQL (Hive Query Language).
  • It supports JDBC and ODBC connections that establish a relation between Java objects and existing databases, data warehouses and business intelligence tools.
  • It also supports various sources of data like Hive tables, Parquet, and JSON.

Spark Streaming

  • Spark Streaming is a Spark component that supports scalable and fault-tolerant processing of streaming data.
  • It uses Spark Core's fast scheduling capability to perform streaming analytics.
  • It accepts data in mini-batches and performs RDD transformations on that data.
  • Its design ensures that the applications written for streaming data can be reused to analyze batches of historical data with little modification.
  • The log files generated by web servers can be considered as a real-time example of a data stream.

MLlib

  • The MLlib is a Machine Learning library that contains various machine learning algorithms.
  • These include correlations and hypothesis testing, classification and regression, clustering, and principal component analysis.
  • It is nine times faster than the disk-based implementation used by Apache Mahout.

GraphX

  • The GraphX is a library that is used to manipulate graphs and perform graph-parallel computations.
  • It facilitates to create a directed graph with arbitrary properties attached to each vertex and edge.
  • To manipulate graph, it supports various fundamental operators like subgraph, join Vertices, and aggregate Messages.

Next TopicWhat is RDD




Related Links:


Related Links

Adjectives Ado Ai Android Angular Antonyms Apache Articles Asp Autocad Automata Aws Azure Basic Binary Bitcoin Blockchain C Cassandra Change Coa Computer Control Cpp Create Creating C-Sharp Cyber Daa Data Dbms Deletion Devops Difference Discrete Es6 Ethical Examples Features Firebase Flutter Fs Git Go Hbase History Hive Hiveql How Html Idioms Insertion Installing Ios Java Joomla Js Kafka Kali Laravel Logical Machine Matlab Matrix Mongodb Mysql One Opencv Oracle Ordering Os Pandas Php Pig Pl Postgresql Powershell Prepositions Program Python React Ruby Scala Selecting Selenium Sentence Seo Sharepoint Software Spellings Spotting Spring Sql Sqlite Sqoop Svn Swift Synonyms Talend Testng Types Uml Unity Vbnet Verbal Webdriver What Wpf