undefined

Buffl

IMSE

by Julia O.

What are the four steps of data engineering?

Data Ingestion
Data Transformation
Data Storage
Data Retrival

What comes before and after Data Engineering?

Before: Raw Data Generation

After: Data Use

What is Data Ingestion?

Collect, Clean, and validate raw data (Extract)

What is Data Transformation

Convert, clean, enrich and aggregate data (Transform)

Data Storage

Store and manage data securly and scalably (Load)

Data Retrieval

Access and extract data efficiently

Why are data moved?

Decision Making Support Systems:

Point of Sale System

Customer Relationship Management

Enterprise Resource Planning System

Website and Mobile App Analytics

Social Media Plattforms

Primarly Responsibility of Data Engineers

Design
Build
Maintain

the infrastructure that enables the seamless flow and its utilization of data throughout the organization

Work ares of data engineers

Data Pipeline Development

Database Management

ETL Processes

Data Quality

Collaboration

Whats the Challange of CERN?

Datagrid

What´s the purpose of the Harge Hardon Collider

Test theories of particle physics

How many running jobs are on this global netform

36544

Number of active CPU cores

807139

Whats the transfer rate in the data grid

21.54 GiB/sec

What did the large hardon collider discover

Higgs Boson

What are the detectors of the Hardon Collider

ATLAS, CMS, LHCb, ALICE

What does WLCG stand for?

Worldwide LHC Computing Grid

What is the WLCG

A decentrilized system (Grid Computing)

How many cores are on it?

1.4 million computer cores

Whats the storage?

2 exabytes

How many tasks per day

2 million

Whats the transfer rate?

> 260 GB/s

What are the challanges of the WLCG

Data Intergration and Interporability
Data Security and Privacy
Scalability
Data Lifecycle Management
Resource Allocation and Management
Real-time Processing and Streaming Data
Machine Learning and Advanced Analytics Intergration
Cost Managment

Worldwide LHC Computing Grid (WLCG) - Datagrid

Whats the different Focus of:

Data Engineers
Scientists
Analysts

Build data pipelines and infrastructure
Analzye data, build predictive models
Analyze trends, support decision-making

Whats the different Key Skills of:

Data Engineers
Scientists
Analysts

SQL, ETL, kpud platforms, data architecture
Stats, ML, Python/R
Data vizualization, data analysis tools

Whats the different Tools of:

Data Engineers
Scientists
Analysts

Spark, Hadoop, AWS
Python, R, TensorFlow, PyTorch
SQL, Excel, Tableau, Power BI, R, Python

What is the Output of

Data Engineers
Scientists
Analysts

Reliable, scalable data infrastructure
Predicitve models, insights, experiments,
Reports, dashboards, presentations

Whats the different background of:

Data Engineers
Scientists
Analysts

Software engineering, IT
Math/ Statistics - Computer Science
Business, Economics, Statistics

Whats the key question of data engineers?

How can we design a data pipeline that ingests real-time streaming data from our website and ensures it is stored in a scalable, fault-tolerant data warehouse?

Whats the key question of data scientists?

How can we use historical customer data to build a predictive model that forecasts customer churn over the next six months?

What is the key question of data analysts?

What are the key trends in our sales data over the last quarter, and how can we visualize these to support the marketing team’s decisions?

What are the elements of the data lifecycle?

Data Creation/ Collection
Data Storage and Management
Data Processing and Analysis
Data Utilization and Sharing
Data Archiving and Disposal

What types of data sources are there?

Structured

Semi structured

Unstructured

What data collection methods are there

APIs

Databases

Streaming data

Web scraping

What are challanges in data collection

Data quality, volume and variety

Can traditional relational databases satisfy the requirements of today’s scenarios like WLCG?

The old world

Millions of objects

100-byte objects

Can traditional relational databases satisfy the requirements of today’s scenarios like WLCG?

The “new” world

Billions of objects

Big objects (1MB)

Objects have behavior (methods)

What are new requirements for data management?

Explosion of unstructured and semistructured data (text, sensor data, social media feeds, JSON, etc.)
Complex data types (arrays, maps, nested structures) enable natural modeling of these data forms
Machine learning algorithms often work with complex data representations (text embeddings, image vectors, geospatial coordinates, etc.)

What is the evolutionary approach in Object-Relational Database Technologies?

Extends existing relational database models with object-oriented features
Focuses on backward compatibility with traditional SQL and relational schemas
Adds complex data types and structures, like nested tables or arrays
Gradual integration ensures a smoother transition for users and developers
Maintains the robustness and reliability of established relational databases

What is the revolutionary approach in Object-Relational Database Technologies?

Builds database systems fundamentally on object-oriented principles
Supports OO features like inheritance, polymorphism, and encapsulation natively
Redesigns data storage and access to be object-centric
Often requires a complete restructuring of existing database systems
Prioritizes object-oriented design over traditional relational models

Object-relational impedance mismatch

What is the definition of object-relational mapping (ORM)?

Technique that bridges the gap between object-oriented programming and relational databases.

What is the functionaly of ORM?

Automates the translation of objects in code to relational tables in a database, simplifying data manipulation.

What are the benefits of ORM?

Simplifies database interactions, enhances productivity and maintains data integrity

Whats the process of ORM?

Maps objects to database entities; CRUD operations in code translate to SQL queries.

What are Use cases of ORM?

Widely used in applications requiring database interaction, such as web and enterprise applications.

What are popular ORM frameworks?

Hibernate (Java), Entity Framework (.NET), Django ORM (Python)

ORM Example with Python I

ORM Example with Python II

Object-Relational DBMS vs. Object-Oriented DBMS:

Object-Relational DBMS

Hybrid Model: Combines relational and objectoriented database features.
Complex Data Types: Supports structured types and nested objects.
SQL Enhancements: Extends SQL to handle OO features like inheritance.

Object-Relational DBMS vs. Object-Oriented DBMS:

Object-Oriented DBMS

OO Principles: Built entirely on object-oriented concepts like encapsulation, inheritance, and polymorphism.
Direct Object Storage: Stores objects as they are used in OO programming languages.
No SQL Required: Interacts with data using objectoriented languages, bypassing traditional SQL.

Overview Object-Relational Database Management Systems (ORDBMS)

Definition

Added Features

Significance

Definition: ORDBMS extends traditional relational models with object orientation
Added Features: Supports objects, classes, and inheritance
Significance: Represents an advancement in database technology by integrating object-oriented concepts

ORDBMS Key Features

Complex Structured Data-Types
OO Methods
Inheritance
Polymorphism
Encapsulation
Database Integrity

ORDBMS Compatibility

Upward Compatibility: Maintains compatibility with existing SQL-based relational database languages
Integration with Existing Applications: Facilitates the use of object-relational features alongside current applications without disruption
Support for Legacy Systems: Allows for the enhancement of legacy relational databases, preserving previous investments

SQL:1999

SQL Standards Compliance: Adheres to SQL:1999 (SQL3) introduced several extensions
Relational Principles Adherence: Upholds fundamental relational database concepts, ensuring declarative access and the enforcement of ACID properties
Object-Relational Features: Object-oriented features, more complex data types, userdefined types, and support for object behaviors and hierarchies • Recursive Queries: WITH clause
Triggers and Stored Procedures: More sophisticated data processing and business logic
Advanced Querying Capabilities: New operators and functions
Enhanced Data Integrity: Assertions and referential constraints

Which databases support the SQL:1999 standard (OO features)

Oracle Database
IBM DB2
PostgreSQL
Microsoft SQL Server

SQL:1999 - Selected Extensions for Complex Types

User-Defined Types (UDTs)
User Defined Functions (UDF)
Inheritance
Reference Types
Collections
Large Object Types (LOBs)

ORDBMS - SQL Improvements Timeline

1999: Object-relational Features

2003: SQL/XML, improved UDTs

2006: XML Query Language

2016: SQL:2016: JSON support, Polymorphic Table Functions, Graph Querying

ORDBMS - Collection Type Example

ORDBMS - Inheritance Example

ORDBMS - User Type Definition (UDT) Example

ORDBMS - User Defined Functions (UDF) Example

Object-Oriented Database Management Systems

Object-Oriented Paradigm: Combines database functions with OO programming techniques.
Objects and Attributes: Real-world entities with data and methods.
Classes: Templates for creating objects, defining structure and behavior.
Encapsulation: Bundles data and methods and restricts direct access.
Inheritance: New classes inherit features from existing ones.
Polymorphism: Different classes, common interface; method redefinition in subclasses.
Object Identity: Unique identifiers for each object, independent of attributes.
Object Relationships and Associations: Manages complex inter-object relationships efficiently

OODMS: Storage

Object Persistence: Objects exist beyond creation, ensuring data continuity.
Complex Structures: Naturally stores nested and interconnected object structures.
Object Identity: Each object is uniquely identified through persistent OIDs.
Direct Storage: Objects stored as-is, not in rows/columns.

OODMS: Retrieval

Navigational Access: Retrieve by navigating through object network relationships.
Querying: Uses Object Query Language (OQL) for object-oriented data querying.
Indexing: Complex indexing methods for efficient object retrieval.
Characteristics-Based Retrieval: Search based on attributes, behavior, and relationships.
Handling Relationships: Exploits object associations for intuitive data access.

Query Languages in Object-Oriented Database Management Systems

Nature: Incorporates object-oriented concepts to extend traditional query capabilities for complex data handling.
Integration: Seamlessly integrates with object-oriented programming languages for data manipulation.
Object-Oriented Features: Supports encapsulation, polymorphism, inheritance in database queries.
Complex Data Types: Manages arrays, lists, and custom structures effectively in queries.
Complex Queries: Enables navigating through intricate object relationships and hierarchies.
Performance: Potentially slower for deep hierarchies and complex relational queries.
Extensibility: May offer extensibility for specific application requirements and functio

Object Query Language: Example I

Object Query Language: Example II

OODBMS: Use Cases (6)

CAD/CAM (Computer-Aided Design/Manufacturing)
Geographic Information System (GIS)
Digital Asset Management (DAM)
Content Management Systems (CMS)
Scientific Data
Multimedia Applications

Pros of OODBMS

Reduced Impedance Mismatch: Seamlessly maps object-oriented concepts to database representation, simplifying application logic
Complex Data Handling: Natively supports complex data structures, custom types, and rich object relationships
Intuitive Queries: Querying often follows natural object navigation and method calls

Cons of OODBMS

Less Mature Technology: Smaller market share compared to RDBMS solutions
Potential Learning Curve: OODBMS concepts and query languages might be less familiar to developers
Performance Considerations: Some OODBMS can face challenges in scenarios traditionally dominated by highly optimized relational databases
Vendor Support: May have fewer vendor choices or less extensive support networks compared to major RDBMS

Comparison RDBMS vs. ORDBMS vs. OODBMS

Popular ORDBMSs

PostgreSQL: An advanced ORDBMS that supports various data types, including custom types, and offers powerful programming and querying features.
Oracle Database: Oracle offers robust object-relational features and is widely used in enterprise environments for its robustness and extensive feature set.
IBM DB2: Offers strong object-relational capabilities, particularly in its advanced editions, and is widely used in enterprise and large-scale systems.
Microsoft SQL Server: While primarily a relational database, it has expanded to include more object-oriented features like user-defined types and functions.

Popular OODBMSs

InterSystems Caché: Known for its high performance, it integrates object database, SQL, and analytics capabilities.
db4o (database for objects): Specifically designed for object-oriented languages like Java and .NET.
ObjectDB: An object database for Java and JPA/JDO. It's very efficient for Java-based applications.
ObjectStore: A mature OODBMS with strong support for commercial applications, offering versions for Java, C++, and other languages.

ORDBMS vs. OODBMS Ranking Graph

Drivers of Database Evolution: From Traditional to Parallel and Distributed Systems

Parallel Database Architectures 3

Performance Metrics in Database Scalability

Speedup: the same job, more hardware, less time
Scaleup: bigger job, more hardware, same time
Throughput: more clients/servers, same response time

Distributed Database Systems - Data Replication Pros and Cons

Multiple copies of data stored on different sites

+ Availability

+ Fast (local) access

+ Performance by parallel execution

- High data update cost (each replicate)

- Complex Concurrency-Management

Distributed Database Systems - Data Replication Synchronous vs Ansynchronys

Synchronous Replication

Changes are atomically applied to destination DBs
Consistency: Strong
Latency: Higher

Asynchronous Replication

Changes propagated to destination DBs with some delay
Consistency: Moderate
Latency: Lower

Distributed Database Systems – Design Considerations

Network Structure

Latency
Bandwith
Network Partitioning

Design Considerations

Homogeneous vs. Heterogeneous Systems
Client-Server vs. Peer-to-Peer Architecture
Transparency

Single System Image (SSI) for Distributed Databases

Provides the illusion of a centralized system despite distributed data.
Key aspects:
- Abstraction (hides infrastructure complexity)
- Unified Interface (consistent way to interact with data)
- Global Schema (centralized view of all data)
Enables other transparencies (replication, fragmentation, location)

Transparency in Distributed Systems (Three types)

Replication transparency: Users view data items as logically unique and are not concerned about which data item is replicated.
Fragmentation transparency: Users have not to know if and how a relation has been fragmented
Location transparency: Users are not required to know the physical location of a data item

Benefits of Transparency in Distributed Systems

Easier application development (data location agnostic)
Simpler data management (unified interface)
Improved scalability (easier to add resources)
Increased fault tolerance (redundancy hidden)

Examples of Trnaspareny in Distributed Systems

E-commerce Platforms: Amazon, ebay Alibaba
Content Sharing Services: Facebook, X (Twitter)
Cloud Storage Provider: Dropbox, Google Drive
Travel Booking Services: Expedia, Booking.com

Distributed Database Systems - Fragmentation

Partitions (fragments) are stored on different sites

+ Fast (local) access

+ Performance by parallel execution

- High cost for de-fragmentation of a relation

Horizontal Partitioning: Data rows are distributed across different sites.
Vertical Partitioning: Different columns of a table are stored at different sites.
Full Replication: Each site holds a complete copy of the database.

Vertigal Fragmentation Example

Horizontal Fragmentation Example

Exploring Parallelism in Databases Intra-Query vs. Inter-Query:

Intra Query

Dividing a single query into subtasks
Parallel execution on a multi-core CPU
Applicable to operations like scans, joins, aggregations, sorting

Exploring Parallelism in Databases Intra-Query vs. Inter-Query:

Inter Query

Distributing query tasks across multiple database servers
Leveraging multiple CPUs and storage devices
Suitable for complex queries or large datasets

Distributed vs. Parallel Databases

What is MapReduce?

A programming model and an associated implementation for processing and generating large data sets.

What are the key features of MapReduce

3 phases: Map, Shuffle, Reduce
Automatic parallelization and distribution of work
I/O Scheduling
Example "The Story of Sam"

MapReduce Framework for Large-Scale Data Processing:

Key Points

Simplifying Large-Scale Data Processing
Harnessing the power of multiple CPUs for efficient analysis
Distributing work across computing sites for scalability
Built-in fault tolerance for resilient computations

Map Reduce Image

Map Function in Map Reduce

Processing Individual Elements

Executes on every dataset element, transforming into a new key-value pair.

Partition in Map Reduce

Organizing Data

Shards the key-value pairs, grouping them based on the key for efficient processing.

Reduce Function in Map Reduce

Aggregating Results

Operates on each unique key, aggregating or summarizing associated values.

Map Reduce Key Value Pair Approach

Key: A unique identifier for each data element
Value: The actual data or a reference to its location.

Hadoop Ecosystem Image

MapReduce Framework: Hadoop Ecosystem

Apache Foundation Java project since 2008
HDFS: distributed file system (similar to Google FS)
Error tolerance by 3-times (default) data replication If no "heartbeat" of a node => the central node re-distributes data
Files are stored in chunks of fixed size (64 MB) => reasonable number of large files
Rack-aware file system => nobody knows where data is stored Is not directly mountable by an operating system
Hadoop Map Reduce: Parallel Programming Framework
HBase: NoSQL database modeled after Google BigTable
YARN (Yet another resource negotiator)

Limitations of Traditional Data Formats (Dataformats)

EDIFACT
CSV / TSV
Fixed-Width Text Files
Binary Formats
Proprietary Formats

What are the 8 Characteristics of Semi-Structured Data?

Self-Describing: Metadata included
Flexible Schema: Adaptable to changes
Hierarchical Structure: Nested elements
Inhomogeneous Structure: Varied formats
Implicit Schema: Structured suggested, not enforced
Graph-Like Model: Interconnected nodes
Platform-Independent: Universally accessible
Human and Machine-Readable: Easily processed and understood

Examples of Semi-Structured Data

What was XML designed for?

storage, transmission, and reconstruction of data

XML Key Facts

Standardized Data Interchange Format
XML became W3C Recommendation 1998
Tag-based Syntax
Foundation of several web technologies

For what Technologies is XML the foundation

Foundation of several web technologies
- XHTML
- RSS/ATOM
- AJAX (the X in AJAX)

What are factors that led to the rise of XML

Flexibility and Simplicity: XML balances structure and ease of use compared to SGML or highly rigid formats like EDI.
Human and Machine Readability: XML is relatively easy for humans and computers to understand.
Web Compatibility: XML's integration with web technologies promoted its widespread adoption.

What are the goals of XML

Human Readable: Understandable without specialized tools.
Data Sharing: Simplifies data sharing across platforms and applications.
SGML-Compatible: Maintains compatibility with its parent, SGML.
Ease of Processing: Programmatically parsable with standard libraries.
Support Diverse Applications: Adaptable to various use cases.

Document centric vs data centric XML: Document Centric

Focuses on representing the layout and formatting of a document.
Often used for human-readable content like reports, articles, or ebooks.
May contain large text sections with some embedded tags for structure or styling.
Example: A research paper with sections, paragraphs, and citations marked up using XML tags.

Document centric vs data centric XML: Data Centric

Focuses on representing the data itself in a structured way.
Often used for machine-readable information exchange like invoices, purchase orders, or scientific data.
Highly structured with well-defined elements and attributes containing specific data points.
Example: An invoice with elements for items, quantities, prices, and total amount.

Disadvantages of XML

Not suitable for very large datasets (multiple MB of data)
Images are not represented well
XML can quickly become difficult to read when complex

Usecases of XML (7)

RSS Feeds
SOAP Protocols
APIs (e.g., Google)
Weather Services
Healthcare Data Exchange
Financial Transactions
Microsoft Office

Types of XML Content

XML Document Declaration
Elements and attributes
Comments
Character Data
Processing Instructions
Entity References
Namespace

XML Document Declaration

Optional at the beginning of XML document
Specifies XML version and character encoding

Elements in XML

Primary building block
Must have valid names
Start tag and end tag
Can be nested
Must be properly closed

Attributes in XML

Additional information about elements
Defined within start tag of an element
Name-value pairs
Must have valid names
Appear only once on a given element
Must always be quoted

Text in XML

Actual data content within XML elements
Format: Characters, numbers or other data types

Comments in XML

Embed human-readable information
Used for adding notes or explanations
Not visible in output
Enclosed in <! — and — >
Not allowed
- Before document declaration
- Inside element brackets

Processing Instructions (PIs) in XML

Provide instructions for the XML processor
Contain application-specific directives
Form: <? targetName instruction ?>

CDATA Sections in XML

CDATA (Character Data) sections allow inclusion of text data that should not be parsed by XML processor like script or style code
Ensures special characters or sequences in the text do not interfere with the XML structure
CDATA sections are treated as plain text by the XML parser
Enclosed in <![CDATA[…unscaped text data…]]

Entity References in XML

General Entities
Character Entities

Namespaces in XML

Namespaces ensure the unique identification of elements and attributes
Enable XML documents from different sources to be combined without name conflicts
Use URI references to differentiate similar elements with distinct meanings
A default namespace can be declared and applied to all unqualified elements
Prefixes before element names indicate the namespace and prevent ambiguity
Essential for extending XML languages, such as in XHTML or SVG
Support XML's extensibility and reusability across applications

Example XML with Namespaces

XML Syntax

Must be well-formed
XML documents require a single root element to encapsulate all content
Every opening tag must be matched with a closing tag
- Empty Tags must be closed <hr />
Attributes values cannot be minimized
- <option selected> use <option selected = “selected”>
Tags are case-sensitive and must be used consistently
Attribute values must always be quoted.
- <li id=1> use <li id=”1”>
Nested elements must be correctly closed in the order they are opened.

XML Well-formed and Valid

Why Validation of XML Makes Sense:

Ensures XML documents conform to a predefined structure.
Improves data integrity and reliability in data exchange.
Facilitates interoperability between systems and applications.
Catches errors early in the development process, reducing costs.
Enables automated parsing and processing of XML documents.
Provides clear specifications for data formats and types.

Document Type Definitions (DTDs)

Original schema language for defining XML document structure.
Can be embedded in XML documents or defined externally.
Does not support data types other than strings.
Strict order in which elements appear.
Offers entity mechanism for reusing content.
Lacks support for namespaces.
More widely supported in legacy systems.

DTD Syntax and Building Blocks

Element declarations
Attribute declarations
Entity declarations
Notation declarations
PCDATA
Element quantifiers (?, *, +)
Choice (|) and sequence (,) operators

DTD Example I

DTD Example II

XML Schema Description

Richer and more powerful than DTDs.
Supports XML namespaces and multiple schemas in a single document.
Allows definition of custom data types and data type inheritance.
Can enforce the order of child elements.
Facilitates creation of reusable schema modules.
Enables default values and fixed values for elements and attributes
Better suited for modern, complex applications.

XML Schema Example

XSD Validation Example

DTD vs. XML Schema

XML related technologies

XPATH
XSLT
Xquery
Xpointer, Xlink

What is XPath

Definition: A language used for navigating through elements and attributes in an XML document.
Key concept:
- Context Node
- Axis (way from context to selected node)
- Predicates (further refinement)

XPath Syntax I

XPath Syntax II

XPATH Example:

Simple Selection //Book/Title

XPath Example

Attribute Selection //book[@category="Science"]/title

XPath Example

Predicate Filtering //book[price>20]

XPath Example

Axes //book[author="Stephen Hawking"]/following-sibling::book/title

XPath Example

Functions //book[contains(title, 'Data’)]

eXtensible Stylesheet Language Transformations (XSLT)

Purpose
- Transforms XML documents into different XML, HTML, or text formats
- Enables the separation of content and presentation.
Uses XSLT stylesheets to define transformation rules.
Operates as a template engine, matching patterns in the source XML.
Written in XML

XSLT Key Components

XSLT UseCases

Generating dynamic web pages from XML data.
Converting XML data to PDF or other document formats.
Migrating data from one database to another.

XQuery

Is to XML what SQL is to databases
Language for querying XML data
Built on XPath expressions
Supported by all major databases
W3C Recommendation

XQuery Key Features

Functional: Built on functional programming concepts.
Rich Expressions: FLWOR (For, Let, Where, Order by, Return) for complex queries.
Versatile: Queries data that is fully structured, unstructured, or semi-structured.

XQuery Use cases

Transforming XML documents.
Aggregating data from multiple XML sources.
Searching text within XML documents for web services.

JavaScript Object Notation (JSON)

Lightweight, text-based, human-readable data format for structured data
Key Features
- Simplicity
- Language-Independent
- Universal
Syntactically Similar to JavaScript but with stricter rules
Independent Standard (ECMA-404, RFC 8259)

Popularity Overview

JSON Elements: Building Blocks of Data

JSON Example

XML vs. JSON: Mapping Challenges

Attributes vs. Elements:
- XML supports attributes (e.g., <person id="123">), but JSON does not. Attributes need to be converted to key-value pairs in JSON.
Mixed Content:
- XML allows elements to have both text and child elements (e.g., <tag>Text <child>value</child></tag>). JSON does not directly support this, requiring restructuring.
Array Representation:
- XML does not have a native array representation. Lists are represented by repeating elements, which must be interpreted as arrays during conversion to JSON.
Namespaces:
- XML supports namespaces (e.g., xmlns), which have no direct equivalent in JSON. This adds complexity during mapping.
Data Types:
- XML requires additional schemas to define data types, whereas JSON supports native types directly. Converting between the two may involve type inference or loss of type information.

Comparison XML with JSON

NoSQL Definition from https://hostingdata.co.uk/nosql-database/

Next Generation Database Management Systems mostly addressing some of the points:

being non-relational,
distributed,
open-source and horizontally scalable.

The original intention has been modern web-scale database management systems. The movement began early 2009 and is growing rapidly. Often more characteristics apply such as: schema-free, easy replication support, simple API, eventually consistent / BASE (not ACID), a huge amount of data and more. So the misleading term "nosql" (the community now translates it mostly with "not only sql") should be seen as an alias to something like the definition above.

Reasons for the popularity of NoSQL:

Development Speed: Often faster development than SQL databases.
Data Versatility: Better suited for managing and easily evolving various data structures.
Cost-Effectiveness: Handling large data volumes can be more economical than with SQL.
Scalability and Uptime: Can better manage high traffic and maintain continuous uptime, unlike SQL.
Innovation Support: Supports new application paradigms more effectively.

NoSQL Characteristics

Schema Flexibility: Evolution over time
Scalability
High Performance & Low Latency
Specialized Data Models
Large Data Volumes
BASE vs. ACID

Core NoSQL Systems I

Key-value Stores:
- Simplest type: Data is stored as key-value pairs.
- Ideal for: Caching, session management, storing user preferences
- Examples: Redis, Memcached, Riak
Document Databases:
- Like JSON: Data is stored in document-like structures
- Ideal for: Content management, semi-structured data, flexible schemas
- Examples: MongoDB, Couchbase, Amazon DocumentDB, BaseX

Core NoSQL Systems II

Wide-Column Stores
- Table-like but flexible: Data is organized into rows and dynamic columns (columns can vary by row).
- Ideal for: Large-scale analytics, time-series data, event logging
- Examples: Cassandra, HBase
Graph Databases
- Nodes and relationships: Focus on representing relationships between data entities (nodes) and connections between them (edges).
- Ideal for: Social networks, recommendation engines, fraud detection
- Examples: Neo4j, JanusGraph
Multimodel Databases
- Support multiple data models within a single system
- Examples: ArangoDB, OrientDB, Cosmos DB

Sharding NoSQL

Why?

Horizontally partitioning a large database into smaller, independent pieces called "shards"

Scalability: Handle more data and requests.
Availability: Improve system resilience.
Performance: Faster query responses

Key Elements of Sharding NoSQL

Sharding Key: Decides data placement.
Sharding Function: Maps data to shards.
Query Router: Directs queries to the correct shard(s).

CAP Theorem

A distributed system can have at most "two of the three" properties:

Consistency
- Every read receives the most recent write or an error.
Availability
- Every request receives a (non-error) response without guaranteeing it contains the most recent write.
Partition
- The system continues to operate despite an arbitrary number of messages being dropped (or delayed) by the network between nodes.

Visual Guide to NoSQL Systems

BASE: Basically Available, Soft State, Eventual Consistency

Promotes availability over consistency - “Optimistic approach”
Contrary concept to ACID - “Pessimistic” approach
Abstinence of strong consistency
„Soft-state” (State of system may change over time)
- Database changes between consistent and inconsistent state
- User has no guarantee to see only one version of data
- During inconsistency windows, different versions of data possible
Simplifies the redundancy management of data
- Less synchronization between replicates necessary
- Higher availability due to more replicated copies

Advantages NoSQL System to Relational DBMS

Flexible Data Models: Accommodate evolving data structures without complex schema changes. Applications can be designed to work well with less rigid schemas.
Horizontal Scalability: Scale-out cost-effectively by adding commodity hardware.
High Performance for Specific Workloads: Optimized for fast reads and writes, particularly in key-value or denormalized data models
Developer-Friendly: Many NoSQL systems align with modern application development practices and data formats, reducing reliance on specialized DBAs.
Big Data Ready: Designed to handle massive data volumes
Lower Costs: Often leverages clusters of commodity servers, reducing hardware expenses

Drawbacks of NoSQL System to Relational DBMS

Support & Maturity: Often open-source with varying support levels, still maturing.
Administration: Designed for simpler management, yet skilled oversight is beneficial.
Expertise: Growing developer community, but expertise less widespread than RDBMS.
Analytics & BI Focus: Optimized for web-scale operational needs, analytics features evolving.
Standardization & Transactions: Lacks a single standard, inconsistent support for complex transactions.

NoSQL Databases

Redis

Riak

MongoDB

Amazon DocumentDB

Couchbase

Apache HBase

Cassandra

Neo4j

JanusGraph

OrientDB

Microsoft Azure Cosmos DB

ArangoDB

BaseX

Key-Value Store Definition

data storage system that resembles a dictionary or hash table. It stores data as a collection of key-value pairs, where a unique key is used to quickly retrieve the associated data record.

Key-Value Store Description

Simple key-value access
Flexible schema-less design
Queries are restricted to keys (focused queries)
Operations usually: put, get, delete
Advantages of decreased complexity
- High Scalability
- Efficient Distribution
- Fault tolerance
Foundation of MapReduce

Key Considerations for Key-Value Stores: Suitable and unsuitable for

Suitable

Simple data model
High performance for simple retrievals
Scalability

Unsuitable

Complex queries
Relational Data Management
Applications requiring ACID transactions

Key Considerations for Key-Value Stores: Advantages and Disadvantages

Advantages

Extremely fast reads/writes
Simple data model
Highly scalable

Disadvantages

Limited data modeling
No native support for complex queries
No relationships

Wide-Column Store

Two-dimensional Key-Value Store.
Columns not predefined (may vary from row to row)
Column families

Key Considerations for Wide-Column Stores: Suitable and Unsuitable

Suitable

Large volumes of data with variable schema
Fast reads and writes
Scalability

Unsuitable

Complex transactions
Strong consistency across multiple operations
Complex data relationships

Key Considerations for Wide-Column Stores: Advantages and Disadvantages

Advantages

Highly flexible handling varied column sets
Efficient for analytics

Disadvantages

Complexity in schema managing
Less intuitive for relational data model users

Document Stores Definition

Document Stores are specifically designed to handle semi-structured data. They are a popular type of NoSQL database, with XML databases being a specialized subclass for XML document management.

Document Stores Description

Collection of documents (eq. rows in RDBMS)
Documents Formats: JSON, XML, YAML, …
Structured set of key/value pairs
Addressed via a unique key
Documents are treated as whole (schema-free) • Access via API or Query Language
Support: MapReduc
Not directly Supported: Joins

Key Considerations for Document Stores: Suitable and unsuitable

Suitable

Flexible schema
Document encapsulation of data
Moderate relationship management

Unsuitable

Highly relational data
Applications requiring complex joins and multilevel transactions

Key Considerations for Document Stores: Advantages and disadvantages

Advantages

Flexible data model
Rich query capabilities

Disadvantages

Less efficient for complex queries involving multiple document relationships

Graph Databases Definition

A graph database stores data using nodes (data points), edges (relationships between the data), and properties (attributes). This structure prioritizes relationships, allowing for fast queries and intuitive visualization of complex, interconnected data

Graph Databases Description

Graph-oriented
Entity types are edges, nodes or attributes
No global key, no joins are necessary
Data are identified by relative position in the graph (traversal)
Nodes and edges can be labeled (used later for search)
No limitation in the number of edges and attributes per node

Key Considerations for Graph Databases: Suitbale and Unsuitable

Suitable

Complex relationships
Data interconnectivity
Deep queries involving multiple hops
Dynamic, evolving data

Unsuitable

Simple, non-connected data
High throughput operations on massive datasets
If applying same operation to multitude of elements

Key Considerations for Graph Databases: Advantages and Disadvantages

Advantages

Highly optimized for relationships
Intuitive modeling
Visual representations

Disadvantages

Less performant for non-graph queries
Specialized query languages
Higher learning curve

Multimodel Database Management Systems

A multi-model database is a database management system that supports multiple data models on a single backend.

Multimodel Database Management System Description

Support different data models in the same database
Different data models can easily be combined in queries and even transactions.
Common features typically include
- Data storage, backup, and recovery
- Querying and indexing mechanisms by a unified query language
- ACID transactions (mostly in stand-alone mode only)
- Integration by the support of multiple data models depending on the application
- Advanced security features

Key Considerations for Multimodel Database Management Systems: Suitable and Unsuitable

Suitable

Support of multiple data models within a single backend
Diverse data types

Unsuitable

Simple applications with a single data model
Low complexity environments

Key Considerations for Multimodel Database Management Systems: Advantages and Disadvantages

Advantages

Master and administer with a single technology
Less locked to specific data models and limitations
More flexible in requirement changes

Disadvantages

Potentially complex to manage
Overhead from support of multiple models can impact performance

Scenario Example Multimodel Database

Production Line Robots:

Each robot is carefully tracked, with individual parts and maintenance history stored as JSON documents.

Part Relationships:

All robot components are interconnected in a detailed graph. This graph maps everything from tiny screws to complete robotic arms.

Problem:

A critical component on a robot arm breaks.

Task:

Identify a compatible replacement component that's in stock

MongoDB Definition

MongoDB is a flexible, document-oriented NoSQL database that uses JSON-like structures for data storage. It's known for its scalability and ability to handle diverse data types.

MongoDB Description

Schema-less, documented-oriented Open-Source-Database.
Highly scaleable, highly flexible
Manages collections of JSON-based documents (+BSON: Binary JSON)
Editions: Community Server, Enterprise Server, Atlas
Written in C++
Developed 2007 by 10gen, now MongoDB Inc.
Consistency over Availability
API-support by many programming languages

Drivers MongoDB

C
C++
C#
Go
Java
Kotlin
Node.js
PHP
Python
Ruby
Rust
Scala
Swift
TypeScript
Elixir
Mongoose
Prisma
R

Advantages of MongoDB

Schema-less
Sharding (automatically)
MapReduce Support
Simple Replication with automated failover
Serverless access
GridFS (Load balancing, data replication features)
Simple Query Language

Schema-less design MongoDB

No predefined structure for documents.
Flexible data model adapts to changing needs.
Easy to add new fields or data types.
Handles unstructured and semi-structured data effectively.
Faster development cycles with less upfront design.
Enables agile iterations and rapid prototyping. Reduces schema migration overhead.

Terminology: RDBMS vs MongoDB