What is replication and why is it used in distributed systems?
Replication involves copying data across multiple connected machines to reduce latency, increase availability (fault tolerance), and improve throughput.
How does leader-based replication work?
One leader replica handles all writes, and changes are streamed to follower replicas, which handle read-only queries.
What are the core features of Apache Spark?
Cluster-computing framework
Implicit parallel processing (handles node failures)
In-memory data processing (avoids frequent read/write operations)
Lazy loading of data
Supports computations on Resilient Distributed Datasets (RDDs)
Open-source (Apache project)
Describe and outline the spark application structure
What is the role of SparkContext in Apache Spark?
Main entry point for Spark functionality
Represents the connection to a Spark cluster
Used to create RDDs, accumulators, and broadcast variables
Sends tasks to the executors in the cluster
Creates RDDs which form the basis for task definitions
What are key characteristics of RDDs in Apache Spark?
Defines a dataset and the operations on it (where to get data, what to do)
Does not contain the actual data
Lazy loading: data is only loaded when computation starts
Distributed: operations can be parallelized across cluster nodes
Resilient: automatically handles node failures
Custom RDDs can be implemented
What is data partitioning and why is it used?
Distributes data across multiple nodes
Breaks data into partitions (shards), each a small database
Enables scalability
Limitation: may result in duplicate data copies across nodes (especially with large datasets and high throughput needs)
How is key-value data partitioned across nodes and what challenges can arise?
Goal: Spread data evenly across nodes
Issue: Uneven distribution (skew) can create hot spots
Naive approach: Random distribution, lacks location awareness
Solution: Use key-value data to better control placement
What is partitioning by key range and what are its characteristics?
Assigns a continuous range of keys to each partition
Distribution may be uneven; boundaries can be adapted to data
Keys within each partition are kept sorted
What are the main components in the structure of a Spark application?
Cluster Manager: Assigns resources (executors) to each Spark application (SparkContext)
SparkContext: Interfaces with the cluster manager and coordinates execution
Tasks: Each calculation step is divided into smaller tasks for data partitions
Worker Nodes: Execute tasks using executors, which may cache data for efficiency
Last changeda month ago