What are the four steps of data engineering?
Data Ingestion
Data Transformation
Data Storage
Data Retrival
What comes before and after Data Engineering?
Before: Raw Data Generation
After: Data Use
What is Data Ingestion?
Collect, Clean, and validate raw data (Extract)
What is Data Transformation
Convert, clean, enrich and aggregate data (Transform)
Store and manage data securly and scalably (Load)
Data Retrieval
Access and extract data efficiently
Why are data moved?
Decision Making Support Systems:
Point of Sale System
Customer Relationship Management
Enterprise Resource Planning System
Website and Mobile App Analytics
Social Media Plattforms
Primarly Responsibility of Data Engineers
Design
Build
Maintain
the infrastructure that enables the seamless flow and its utilization of data throughout the organization
Work ares of data engineers
Data Pipeline Development
Database Management
ETL Processes
Data Quality
Collaboration
Whats the Challange of CERN?
Datagrid
What´s the purpose of the Harge Hardon Collider
Test theories of particle physics
How many running jobs are on this global netform
36544
Number of active CPU cores
807139
Whats the transfer rate in the data grid
21.54 GiB/sec
What did the large hardon collider discover
Higgs Boson
What are the detectors of the Hardon Collider
ATLAS, CMS, LHCb, ALICE
What does WLCG stand for?
Worldwide LHC Computing Grid
What is the WLCG
A decentrilized system (Grid Computing)
How many cores are on it?
1.4 million computer cores
Whats the storage?
2 exabytes
How many tasks per day
2 million
Whats the transfer rate?
> 260 GB/s
What are the challanges of the WLCG
Data Intergration and Interporability
Data Security and Privacy
Scalability
Data Lifecycle Management
Resource Allocation and Management
Real-time Processing and Streaming Data
Machine Learning and Advanced Analytics Intergration
Cost Managment
Worldwide LHC Computing Grid (WLCG) - Datagrid
Whats the different Focus of:
Data Engineers
Scientists
Analysts
Build data pipelines and infrastructure
Analzye data, build predictive models
Analyze trends, support decision-making
Whats the different Key Skills of:
SQL, ETL, kpud platforms, data architecture
Stats, ML, Python/R
Data vizualization, data analysis tools
Whats the different Tools of:
Spark, Hadoop, AWS
Python, R, TensorFlow, PyTorch
SQL, Excel, Tableau, Power BI, R, Python
What is the Output of
Reliable, scalable data infrastructure
Predicitve models, insights, experiments,
Reports, dashboards, presentations
Whats the different background of:
Software engineering, IT
Math/ Statistics - Computer Science
Business, Economics, Statistics
Whats the key question of data engineers?
How can we design a data pipeline that ingests real-time streaming data from our website and ensures it is stored in a scalable, fault-tolerant data warehouse?
Whats the key question of data scientists?
How can we use historical customer data to build a predictive model that forecasts customer churn over the next six months?
What is the key question of data analysts?
What are the key trends in our sales data over the last quarter, and how can we visualize these to support the marketing team’s decisions?
What are the elements of the data lifecycle?
Data Creation/ Collection
Data Storage and Management
Data Processing and Analysis
Data Utilization and Sharing
Data Archiving and Disposal
What types of data sources are there?
Structured
Semi structured
Unstructured
What data collection methods are there
APIs
Databases
Streaming data
Web scraping
What are challanges in data collection
Data quality, volume and variety
Can traditional relational databases satisfy the requirements of today’s scenarios like WLCG?
The old world
Millions of objects
100-byte objects
The “new” world
Billions of objects
Big objects (1MB)
Objects have behavior (methods)
What are new requirements for data management?
Explosion of unstructured and semistructured data (text, sensor data, social media feeds, JSON, etc.)
Complex data types (arrays, maps, nested structures) enable natural modeling of these data forms
Machine learning algorithms often work with complex data representations (text embeddings, image vectors, geospatial coordinates, etc.)
What is the evolutionary approach in Object-Relational Database Technologies?
Extends existing relational database models with object-oriented features
Focuses on backward compatibility with traditional SQL and relational schemas
Adds complex data types and structures, like nested tables or arrays
Gradual integration ensures a smoother transition for users and developers
Maintains the robustness and reliability of established relational databases
What is the revolutionary approach in Object-Relational Database Technologies?
Builds database systems fundamentally on object-oriented principles
Supports OO features like inheritance, polymorphism, and encapsulation natively
Redesigns data storage and access to be object-centric
Often requires a complete restructuring of existing database systems
Prioritizes object-oriented design over traditional relational models
Object-relational impedance mismatch
What is the definition of object-relational mapping (ORM)?
Technique that bridges the gap between object-oriented programming and relational databases.
What is the functionaly of ORM?
Automates the translation of objects in code to relational tables in a database, simplifying data manipulation.
What are the benefits of ORM?
Simplifies database interactions, enhances productivity and maintains data integrity
Whats the process of ORM?
Maps objects to database entities; CRUD operations in code translate to SQL queries.
What are Use cases of ORM?
Widely used in applications requiring database interaction, such as web and enterprise applications.
What are popular ORM frameworks?
Hibernate (Java), Entity Framework (.NET), Django ORM (Python)
ORM Example with Python I
ORM Example with Python II
Object-Relational DBMS vs. Object-Oriented DBMS:
Object-Relational DBMS
Hybrid Model: Combines relational and objectoriented database features.
Complex Data Types: Supports structured types and nested objects.
SQL Enhancements: Extends SQL to handle OO features like inheritance.
Object-Oriented DBMS
OO Principles: Built entirely on object-oriented concepts like encapsulation, inheritance, and polymorphism.
Direct Object Storage: Stores objects as they are used in OO programming languages.
No SQL Required: Interacts with data using objectoriented languages, bypassing traditional SQL.
Overview Object-Relational Database Management Systems (ORDBMS)
Definition
Added Features
Significance
Definition: ORDBMS extends traditional relational models with object orientation
Added Features: Supports objects, classes, and inheritance
Significance: Represents an advancement in database technology by integrating object-oriented concepts
ORDBMS Key Features
Complex Structured Data-Types
OO Methods
Inheritance
Polymorphism
Encapsulation
Database Integrity
ORDBMS Compatibility
Upward Compatibility: Maintains compatibility with existing SQL-based relational database languages
Integration with Existing Applications: Facilitates the use of object-relational features alongside current applications without disruption
Support for Legacy Systems: Allows for the enhancement of legacy relational databases, preserving previous investments
SQL:1999
SQL Standards Compliance: Adheres to SQL:1999 (SQL3) introduced several extensions
Relational Principles Adherence: Upholds fundamental relational database concepts, ensuring declarative access and the enforcement of ACID properties
Object-Relational Features: Object-oriented features, more complex data types, userdefined types, and support for object behaviors and hierarchies • Recursive Queries: WITH clause
Triggers and Stored Procedures: More sophisticated data processing and business logic
Advanced Querying Capabilities: New operators and functions
Enhanced Data Integrity: Assertions and referential constraints
Which databases support the SQL:1999 standard (OO features)
Oracle Database
IBM DB2
PostgreSQL
Microsoft SQL Server
SQL:1999 - Selected Extensions for Complex Types
User-Defined Types (UDTs)
User Defined Functions (UDF)
Reference Types
Collections
Large Object Types (LOBs)
ORDBMS - SQL Improvements Timeline
1999: Object-relational Features
2003: SQL/XML, improved UDTs
2006: XML Query Language
2016: SQL:2016: JSON support, Polymorphic Table Functions, Graph Querying
ORDBMS - Collection Type Example
ORDBMS - Inheritance Example
ORDBMS - User Type Definition (UDT) Example
ORDBMS - User Defined Functions (UDF) Example
Object-Oriented Database Management Systems
Object-Oriented Paradigm: Combines database functions with OO programming techniques.
Objects and Attributes: Real-world entities with data and methods.
Classes: Templates for creating objects, defining structure and behavior.
Encapsulation: Bundles data and methods and restricts direct access.
Inheritance: New classes inherit features from existing ones.
Polymorphism: Different classes, common interface; method redefinition in subclasses.
Object Identity: Unique identifiers for each object, independent of attributes.
Object Relationships and Associations: Manages complex inter-object relationships efficiently
OODMS: Storage
Object Persistence: Objects exist beyond creation, ensuring data continuity.
Complex Structures: Naturally stores nested and interconnected object structures.
Object Identity: Each object is uniquely identified through persistent OIDs.
Direct Storage: Objects stored as-is, not in rows/columns.
OODMS: Retrieval
Navigational Access: Retrieve by navigating through object network relationships.
Querying: Uses Object Query Language (OQL) for object-oriented data querying.
Indexing: Complex indexing methods for efficient object retrieval.
Characteristics-Based Retrieval: Search based on attributes, behavior, and relationships.
Handling Relationships: Exploits object associations for intuitive data access.
Query Languages in Object-Oriented Database Management Systems
Nature: Incorporates object-oriented concepts to extend traditional query capabilities for complex data handling.
Integration: Seamlessly integrates with object-oriented programming languages for data manipulation.
Object-Oriented Features: Supports encapsulation, polymorphism, inheritance in database queries.
Complex Data Types: Manages arrays, lists, and custom structures effectively in queries.
Complex Queries: Enables navigating through intricate object relationships and hierarchies.
Performance: Potentially slower for deep hierarchies and complex relational queries.
Extensibility: May offer extensibility for specific application requirements and functio
Object Query Language: Example I
Object Query Language: Example II
OODBMS: Use Cases (6)
CAD/CAM (Computer-Aided Design/Manufacturing)
Geographic Information System (GIS)
Digital Asset Management (DAM)
Content Management Systems (CMS)
Scientific Data
Multimedia Applications
Pros of OODBMS
Reduced Impedance Mismatch: Seamlessly maps object-oriented concepts to database representation, simplifying application logic
Complex Data Handling: Natively supports complex data structures, custom types, and rich object relationships
Intuitive Queries: Querying often follows natural object navigation and method calls
Cons of OODBMS
Less Mature Technology: Smaller market share compared to RDBMS solutions
Potential Learning Curve: OODBMS concepts and query languages might be less familiar to developers
Performance Considerations: Some OODBMS can face challenges in scenarios traditionally dominated by highly optimized relational databases
Vendor Support: May have fewer vendor choices or less extensive support networks compared to major RDBMS
Comparison RDBMS vs. ORDBMS vs. OODBMS
Popular ORDBMSs
PostgreSQL: An advanced ORDBMS that supports various data types, including custom types, and offers powerful programming and querying features.
Oracle Database: Oracle offers robust object-relational features and is widely used in enterprise environments for its robustness and extensive feature set.
IBM DB2: Offers strong object-relational capabilities, particularly in its advanced editions, and is widely used in enterprise and large-scale systems.
Microsoft SQL Server: While primarily a relational database, it has expanded to include more object-oriented features like user-defined types and functions.
Popular OODBMSs
InterSystems Caché: Known for its high performance, it integrates object database, SQL, and analytics capabilities.
db4o (database for objects): Specifically designed for object-oriented languages like Java and .NET.
ObjectDB: An object database for Java and JPA/JDO. It's very efficient for Java-based applications.
ObjectStore: A mature OODBMS with strong support for commercial applications, offering versions for Java, C++, and other languages.
ORDBMS vs. OODBMS Ranking Graph
Drivers of Database Evolution: From Traditional to Parallel and Distributed Systems
Parallel Database Architectures 3
Performance Metrics in Database Scalability
Speedup: the same job, more hardware, less time
Scaleup: bigger job, more hardware, same time
Throughput: more clients/servers, same response time
Distributed Database Systems - Data Replication Pros and Cons
Multiple copies of data stored on different sites
+ Availability
+ Fast (local) access
+ Performance by parallel execution
- High data update cost (each replicate)
- Complex Concurrency-Management
Distributed Database Systems - Data Replication Synchronous vs Ansynchronys
Synchronous Replication
Changes are atomically applied to destination DBs
Consistency: Strong
Latency: Higher
Asynchronous Replication
Changes propagated to destination DBs with some delay
Consistency: Moderate
Latency: Lower
Distributed Database Systems – Design Considerations
Network Structure
Latency
Bandwith
Network Partitioning
Design Considerations
Homogeneous vs. Heterogeneous Systems
Client-Server vs. Peer-to-Peer Architecture
Transparency
Single System Image (SSI) for Distributed Databases
Provides the illusion of a centralized system despite distributed data.
Key aspects:
Abstraction (hides infrastructure complexity)
Unified Interface (consistent way to interact with data)
Global Schema (centralized view of all data)
Enables other transparencies (replication, fragmentation, location)
Transparency in Distributed Systems (Three types)
Replication transparency: Users view data items as logically unique and are not concerned about which data item is replicated.
Fragmentation transparency: Users have not to know if and how a relation has been fragmented
Location transparency: Users are not required to know the physical location of a data item
Benefits of Transparency in Distributed Systems
Easier application development (data location agnostic)
Simpler data management (unified interface)
Improved scalability (easier to add resources)
Increased fault tolerance (redundancy hidden)
Examples of Trnaspareny in Distributed Systems
E-commerce Platforms: Amazon, ebay Alibaba
Content Sharing Services: Facebook, X (Twitter)
Cloud Storage Provider: Dropbox, Google Drive
Travel Booking Services: Expedia, Booking.com
Distributed Database Systems - Fragmentation
Partitions (fragments) are stored on different sites
- High cost for de-fragmentation of a relation
Horizontal Partitioning: Data rows are distributed across different sites.
Vertical Partitioning: Different columns of a table are stored at different sites.
Full Replication: Each site holds a complete copy of the database.
Vertigal Fragmentation Example
Horizontal Fragmentation Example
Exploring Parallelism in Databases Intra-Query vs. Inter-Query:
Intra Query
Dividing a single query into subtasks
Parallel execution on a multi-core CPU
Applicable to operations like scans, joins, aggregations, sorting
Inter Query
Distributing query tasks across multiple database servers
Leveraging multiple CPUs and storage devices
Suitable for complex queries or large datasets
Distributed vs. Parallel Databases
What is MapReduce?
A programming model and an associated implementation for processing and generating large data sets.
What are the key features of MapReduce
3 phases: Map, Shuffle, Reduce
Automatic parallelization and distribution of work
I/O Scheduling
Example "The Story of Sam"
MapReduce Framework for Large-Scale Data Processing:
Key Points
Simplifying Large-Scale Data Processing
Harnessing the power of multiple CPUs for efficient analysis
Distributing work across computing sites for scalability
Built-in fault tolerance for resilient computations
Map Reduce Image
Map Function in Map Reduce
Processing Individual Elements
Executes on every dataset element, transforming into a new key-value pair.
Partition in Map Reduce
Organizing Data
Shards the key-value pairs, grouping them based on the key for efficient processing.
Reduce Function in Map Reduce
Aggregating Results
Operates on each unique key, aggregating or summarizing associated values.
Map Reduce Key Value Pair Approach
Key: A unique identifier for each data element
Value: The actual data or a reference to its location.
Hadoop Ecosystem Image
MapReduce Framework: Hadoop Ecosystem
Apache Foundation Java project since 2008
HDFS: distributed file system (similar to Google FS)
Error tolerance by 3-times (default) data replication If no "heartbeat" of a node => the central node re-distributes data
Files are stored in chunks of fixed size (64 MB) => reasonable number of large files
Rack-aware file system => nobody knows where data is stored Is not directly mountable by an operating system
Hadoop Map Reduce: Parallel Programming Framework
HBase: NoSQL database modeled after Google BigTable
YARN (Yet another resource negotiator)
Limitations of Traditional Data Formats (Dataformats)
EDIFACT
CSV / TSV
Fixed-Width Text Files
Binary Formats
Proprietary Formats
What are the 8 Characteristics of Semi-Structured Data?
Self-Describing: Metadata included
Flexible Schema: Adaptable to changes
Hierarchical Structure: Nested elements
Inhomogeneous Structure: Varied formats
Implicit Schema: Structured suggested, not enforced
Graph-Like Model: Interconnected nodes
Platform-Independent: Universally accessible
Human and Machine-Readable: Easily processed and understood
Examples of Semi-Structured Data
What was XML designed for?
storage, transmission, and reconstruction of data
XML Key Facts
Standardized Data Interchange Format
XML became W3C Recommendation 1998
Tag-based Syntax
Foundation of several web technologies
For what Technologies is XML the foundation
XHTML
RSS/ATOM
AJAX (the X in AJAX)
What are factors that led to the rise of XML
Flexibility and Simplicity: XML balances structure and ease of use compared to SGML or highly rigid formats like EDI.
Human and Machine Readability: XML is relatively easy for humans and computers to understand.
Web Compatibility: XML's integration with web technologies promoted its widespread adoption.
What are the goals of XML
Human Readable: Understandable without specialized tools.
Data Sharing: Simplifies data sharing across platforms and applications.
SGML-Compatible: Maintains compatibility with its parent, SGML.
Ease of Processing: Programmatically parsable with standard libraries.
Support Diverse Applications: Adaptable to various use cases.
Document centric vs data centric XML: Document Centric
Focuses on representing the layout and formatting of a document.
Often used for human-readable content like reports, articles, or ebooks.
May contain large text sections with some embedded tags for structure or styling.
Example: A research paper with sections, paragraphs, and citations marked up using XML tags.
Document centric vs data centric XML: Data Centric
Focuses on representing the data itself in a structured way.
Often used for machine-readable information exchange like invoices, purchase orders, or scientific data.
Highly structured with well-defined elements and attributes containing specific data points.
Example: An invoice with elements for items, quantities, prices, and total amount.
Disadvantages of XML
Not suitable for very large datasets (multiple MB of data)
Images are not represented well
XML can quickly become difficult to read when complex
Usecases of XML (7)
RSS Feeds
SOAP Protocols
APIs (e.g., Google)
Weather Services
Healthcare Data Exchange
Financial Transactions
Microsoft Office
Types of XML Content
XML Document Declaration
Elements and attributes
Comments
Character Data
Processing Instructions
Entity References
Namespace
Optional at the beginning of XML document
Specifies XML version and character encoding
Elements in XML
Primary building block
Must have valid names
Start tag and end tag
Can be nested
Must be properly closed
Attributes in XML
Additional information about elements
Defined within start tag of an element
Name-value pairs
Appear only once on a given element
Must always be quoted
Text in XML
Actual data content within XML elements
Format: Characters, numbers or other data types
Comments in XML
Embed human-readable information
Used for adding notes or explanations
Not visible in output
Enclosed in <! — and — >
Not allowed
Before document declaration
Inside element brackets
Processing Instructions (PIs) in XML
Provide instructions for the XML processor
Contain application-specific directives
Form: <? targetName instruction ?>
CDATA Sections in XML
CDATA (Character Data) sections allow inclusion of text data that should not be parsed by XML processor like script or style code
Ensures special characters or sequences in the text do not interfere with the XML structure
CDATA sections are treated as plain text by the XML parser
Enclosed in <![CDATA[…unscaped text data…]]
Entity References in XML
General Entities
Character Entities
Namespaces in XML
Namespaces ensure the unique identification of elements and attributes
Enable XML documents from different sources to be combined without name conflicts
Use URI references to differentiate similar elements with distinct meanings
A default namespace can be declared and applied to all unqualified elements
Prefixes before element names indicate the namespace and prevent ambiguity
Essential for extending XML languages, such as in XHTML or SVG
Support XML's extensibility and reusability across applications
Example XML with Namespaces
XML Syntax
Must be well-formed
XML documents require a single root element to encapsulate all content
Every opening tag must be matched with a closing tag
Empty Tags must be closed <hr />
Attributes values cannot be minimized
<option selected> use <option selected = “selected”>
Tags are case-sensitive and must be used consistently
Attribute values must always be quoted.
<li id=1> use <li id=”1”>
Nested elements must be correctly closed in the order they are opened.
XML Well-formed and Valid
Why Validation of XML Makes Sense:
Ensures XML documents conform to a predefined structure.
Improves data integrity and reliability in data exchange.
Facilitates interoperability between systems and applications.
Catches errors early in the development process, reducing costs.
Enables automated parsing and processing of XML documents.
Provides clear specifications for data formats and types.
Document Type Definitions (DTDs)
Original schema language for defining XML document structure.
Can be embedded in XML documents or defined externally.
Does not support data types other than strings.
Strict order in which elements appear.
Offers entity mechanism for reusing content.
Lacks support for namespaces.
More widely supported in legacy systems.
DTD Syntax and Building Blocks
Element declarations
Attribute declarations
Entity declarations
Notation declarations
PCDATA
Element quantifiers (?, *, +)
Choice (|) and sequence (,) operators
DTD Example I
DTD Example II
XML Schema Description
Richer and more powerful than DTDs.
Supports XML namespaces and multiple schemas in a single document.
Allows definition of custom data types and data type inheritance.
Can enforce the order of child elements.
Facilitates creation of reusable schema modules.
Enables default values and fixed values for elements and attributes
Better suited for modern, complex applications.
XML Schema Example
XSD Validation Example
DTD vs. XML Schema
XML related technologies
XPATH
XSLT
Xquery
Xpointer, Xlink
What is XPath
Definition: A language used for navigating through elements and attributes in an XML document.
Key concept:
Context Node
Axis (way from context to selected node)
Predicates (further refinement)
XPath Syntax I
XPath Syntax II
XPATH Example:
Simple Selection //Book/Title
XPath Example
Attribute Selection //book[@category="Science"]/title
Predicate Filtering //book[price>20]
Axes //book[author="Stephen Hawking"]/following-sibling::book/title
Functions //book[contains(title, 'Data’)]
eXtensible Stylesheet Language Transformations (XSLT)
Purpose
Transforms XML documents into different XML, HTML, or text formats
Enables the separation of content and presentation.
Uses XSLT stylesheets to define transformation rules.
Operates as a template engine, matching patterns in the source XML.
Written in XML
XSLT Key Components
XSLT UseCases
Generating dynamic web pages from XML data.
Converting XML data to PDF or other document formats.
Migrating data from one database to another.
XQuery
Is to XML what SQL is to databases
Language for querying XML data
Built on XPath expressions
Supported by all major databases
W3C Recommendation
XQuery Key Features
Functional: Built on functional programming concepts.
Rich Expressions: FLWOR (For, Let, Where, Order by, Return) for complex queries.
Versatile: Queries data that is fully structured, unstructured, or semi-structured.
XQuery Use cases
Transforming XML documents.
Aggregating data from multiple XML sources.
Searching text within XML documents for web services.
JavaScript Object Notation (JSON)
Lightweight, text-based, human-readable data format for structured data
Key Features
Simplicity
Language-Independent
Universal
Syntactically Similar to JavaScript but with stricter rules
Independent Standard (ECMA-404, RFC 8259)
Popularity Overview
JSON Elements: Building Blocks of Data
JSON Example
XML vs. JSON: Mapping Challenges
Attributes vs. Elements:
XML supports attributes (e.g., <person id="123">), but JSON does not. Attributes need to be converted to key-value pairs in JSON.
<person id="123">
Mixed Content:
XML allows elements to have both text and child elements (e.g., <tag>Text <child>value</child></tag>). JSON does not directly support this, requiring restructuring.
<tag>Text <child>value</child></tag>
Array Representation:
XML does not have a native array representation. Lists are represented by repeating elements, which must be interpreted as arrays during conversion to JSON.
Namespaces:
XML supports namespaces (e.g., xmlns), which have no direct equivalent in JSON. This adds complexity during mapping.
xmlns
Data Types:
XML requires additional schemas to define data types, whereas JSON supports native types directly. Converting between the two may involve type inference or loss of type information.
Comparison XML with JSON
NoSQL Definition from https://hostingdata.co.uk/nosql-database/
Next Generation Database Management Systems mostly addressing some of the points:
being non-relational,
distributed,
open-source and horizontally scalable.
The original intention has been modern web-scale database management systems. The movement began early 2009 and is growing rapidly. Often more characteristics apply such as: schema-free, easy replication support, simple API, eventually consistent / BASE (not ACID), a huge amount of data and more. So the misleading term "nosql" (the community now translates it mostly with "not only sql") should be seen as an alias to something like the definition above.
Reasons for the popularity of NoSQL:
Development Speed: Often faster development than SQL databases.
Data Versatility: Better suited for managing and easily evolving various data structures.
Cost-Effectiveness: Handling large data volumes can be more economical than with SQL.
Scalability and Uptime: Can better manage high traffic and maintain continuous uptime, unlike SQL.
Innovation Support: Supports new application paradigms more effectively.
NoSQL Characteristics
Schema Flexibility: Evolution over time
High Performance & Low Latency
Specialized Data Models
Large Data Volumes
BASE vs. ACID
Core NoSQL Systems I
Key-value Stores:
Simplest type: Data is stored as key-value pairs.
Ideal for: Caching, session management, storing user preferences
Examples: Redis, Memcached, Riak
Document Databases:
Like JSON: Data is stored in document-like structures
Ideal for: Content management, semi-structured data, flexible schemas
Examples: MongoDB, Couchbase, Amazon DocumentDB, BaseX
Core NoSQL Systems II
Wide-Column Stores
Table-like but flexible: Data is organized into rows and dynamic columns (columns can vary by row).
Ideal for: Large-scale analytics, time-series data, event logging
Examples: Cassandra, HBase
Graph Databases
Nodes and relationships: Focus on representing relationships between data entities (nodes) and connections between them (edges).
Ideal for: Social networks, recommendation engines, fraud detection
Examples: Neo4j, JanusGraph
Multimodel Databases
Support multiple data models within a single system
Examples: ArangoDB, OrientDB, Cosmos DB
Sharding NoSQL
Why?
Horizontally partitioning a large database into smaller, independent pieces called "shards"
Scalability: Handle more data and requests.
Availability: Improve system resilience.
Performance: Faster query responses
Key Elements of Sharding NoSQL
Sharding Key: Decides data placement.
Sharding Function: Maps data to shards.
Query Router: Directs queries to the correct shard(s).
CAP Theorem
A distributed system can have at most "two of the three" properties:
Consistency
Every read receives the most recent write or an error.
Availability
Every request receives a (non-error) response without guaranteeing it contains the most recent write.
Partition
The system continues to operate despite an arbitrary number of messages being dropped (or delayed) by the network between nodes.
Visual Guide to NoSQL Systems
BASE: Basically Available, Soft State, Eventual Consistency
Promotes availability over consistency - “Optimistic approach”
Contrary concept to ACID - “Pessimistic” approach
Abstinence of strong consistency
„Soft-state” (State of system may change over time)
Database changes between consistent and inconsistent state
User has no guarantee to see only one version of data
During inconsistency windows, different versions of data possible
Simplifies the redundancy management of data
Less synchronization between replicates necessary
Higher availability due to more replicated copies
Advantages NoSQL System to Relational DBMS
Flexible Data Models: Accommodate evolving data structures without complex schema changes. Applications can be designed to work well with less rigid schemas.
Horizontal Scalability: Scale-out cost-effectively by adding commodity hardware.
High Performance for Specific Workloads: Optimized for fast reads and writes, particularly in key-value or denormalized data models
Developer-Friendly: Many NoSQL systems align with modern application development practices and data formats, reducing reliance on specialized DBAs.
Big Data Ready: Designed to handle massive data volumes
Lower Costs: Often leverages clusters of commodity servers, reducing hardware expenses
Drawbacks of NoSQL System to Relational DBMS
Support & Maturity: Often open-source with varying support levels, still maturing.
Administration: Designed for simpler management, yet skilled oversight is beneficial.
Expertise: Growing developer community, but expertise less widespread than RDBMS.
Analytics & BI Focus: Optimized for web-scale operational needs, analytics features evolving.
Standardization & Transactions: Lacks a single standard, inconsistent support for complex transactions.
NoSQL Databases
Redis
Riak
MongoDB
Amazon DocumentDB
Couchbase
Apache HBase
Cassandra
Neo4j
JanusGraph
OrientDB
Microsoft Azure Cosmos DB
ArangoDB
BaseX
Key-Value Store Definition
data storage system that resembles a dictionary or hash table. It stores data as a collection of key-value pairs, where a unique key is used to quickly retrieve the associated data record.
Key-Value Store Description
Simple key-value access
Flexible schema-less design
Queries are restricted to keys (focused queries)
Operations usually: put, get, delete
Advantages of decreased complexity
High Scalability
Efficient Distribution
Fault tolerance
Foundation of MapReduce
Key Considerations for Key-Value Stores: Suitable and unsuitable for
Suitable
Simple data model
High performance for simple retrievals
Unsuitable
Complex queries
Relational Data Management
Applications requiring ACID transactions
Key Considerations for Key-Value Stores: Advantages and Disadvantages
Advantages
Extremely fast reads/writes
Highly scalable
Disadvantages
Limited data modeling
No native support for complex queries
No relationships
Wide-Column Store
Two-dimensional Key-Value Store.
Columns not predefined (may vary from row to row)
Column families
Key Considerations for Wide-Column Stores: Suitable and Unsuitable
Large volumes of data with variable schema
Fast reads and writes
Complex transactions
Strong consistency across multiple operations
Complex data relationships
Key Considerations for Wide-Column Stores: Advantages and Disadvantages
Highly flexible handling varied column sets
Efficient for analytics
Complexity in schema managing
Less intuitive for relational data model users
Document Stores Definition
Document Stores are specifically designed to handle semi-structured data. They are a popular type of NoSQL database, with XML databases being a specialized subclass for XML document management.
Document Stores Description
Collection of documents (eq. rows in RDBMS)
Documents Formats: JSON, XML, YAML, …
Structured set of key/value pairs
Addressed via a unique key
Documents are treated as whole (schema-free) • Access via API or Query Language
Support: MapReduc
Not directly Supported: Joins
Key Considerations for Document Stores: Suitable and unsuitable
Flexible schema
Document encapsulation of data
Moderate relationship management
Highly relational data
Applications requiring complex joins and multilevel transactions
Key Considerations for Document Stores: Advantages and disadvantages
Flexible data model
Rich query capabilities
Less efficient for complex queries involving multiple document relationships
Graph Databases Definition
A graph database stores data using nodes (data points), edges (relationships between the data), and properties (attributes). This structure prioritizes relationships, allowing for fast queries and intuitive visualization of complex, interconnected data
Graph Databases Description
Graph-oriented
Entity types are edges, nodes or attributes
No global key, no joins are necessary
Data are identified by relative position in the graph (traversal)
Nodes and edges can be labeled (used later for search)
No limitation in the number of edges and attributes per node
Key Considerations for Graph Databases: Suitbale and Unsuitable
Complex relationships
Data interconnectivity
Deep queries involving multiple hops
Dynamic, evolving data
Simple, non-connected data
High throughput operations on massive datasets
If applying same operation to multitude of elements
Key Considerations for Graph Databases: Advantages and Disadvantages
Highly optimized for relationships
Intuitive modeling
Visual representations
Less performant for non-graph queries
Specialized query languages
Higher learning curve
Multimodel Database Management Systems
A multi-model database is a database management system that supports multiple data models on a single backend.
Multimodel Database Management System Description
Support different data models in the same database
Different data models can easily be combined in queries and even transactions.
Common features typically include
Data storage, backup, and recovery
Querying and indexing mechanisms by a unified query language
ACID transactions (mostly in stand-alone mode only)
Integration by the support of multiple data models depending on the application
Advanced security features
Key Considerations for Multimodel Database Management Systems: Suitable and Unsuitable
Support of multiple data models within a single backend
Diverse data types
Simple applications with a single data model
Low complexity environments
Key Considerations for Multimodel Database Management Systems: Advantages and Disadvantages
Master and administer with a single technology
Less locked to specific data models and limitations
More flexible in requirement changes
Potentially complex to manage
Overhead from support of multiple models can impact performance
Scenario Example Multimodel Database
Production Line Robots:
Each robot is carefully tracked, with individual parts and maintenance history stored as JSON documents.
Part Relationships:
All robot components are interconnected in a detailed graph. This graph maps everything from tiny screws to complete robotic arms.
Problem:
A critical component on a robot arm breaks.
Task:
Identify a compatible replacement component that's in stock
MongoDB Definition
MongoDB is a flexible, document-oriented NoSQL database that uses JSON-like structures for data storage. It's known for its scalability and ability to handle diverse data types.
MongoDB Description
Schema-less, documented-oriented Open-Source-Database.
Highly scaleable, highly flexible
Manages collections of JSON-based documents (+BSON: Binary JSON)
Editions: Community Server, Enterprise Server, Atlas
Written in C++
Developed 2007 by 10gen, now MongoDB Inc.
Consistency over Availability
API-support by many programming languages
Drivers MongoDB
C
C++
C#
Go
Java
Kotlin
Node.js
PHP
Python
Ruby
Rust
Scala
Swift
TypeScript
Elixir
Mongoose
Prisma
R
Advantages of MongoDB
Schema-less
Sharding (automatically)
MapReduce Support
Simple Replication with automated failover
Serverless access
GridFS (Load balancing, data replication features)
Simple Query Language
Schema-less design MongoDB
No predefined structure for documents.
Flexible data model adapts to changing needs.
Easy to add new fields or data types.
Handles unstructured and semi-structured data effectively.
Faster development cycles with less upfront design.
Enables agile iterations and rapid prototyping. Reduces schema migration overhead.
Terminology: RDBMS vs MongoDB
MongoDB in DB Ranking
Disadvantages of MongoDB
Only limited transactions
No joins, but $lookup operator
No referential integrity
Eventual consistency
Relationship modeling MongoDB Nesting
Relationship modeling MongoDB Referencing
Denormalization MongoDB
Combination of two relations into one new relation
Increases data redundancy to avoid expensive lookups (controlled redundancy)
Improves read performance by reducing the need for joins.
Reduces query complexity by consolidating related data.
Simplifies data retrieval for frequently accessed information.
Trades storage space for faster query execution.
Requires careful management to maintain data consistency.
Suitable for read-heavy workloads with infrequent updates.
MongoDB Embed vs. Reference
Considerations for Embedding Information
Data with tight relationships
One-to-few relationships
Data does not change frequently
Atomic updates
Considerations for Using References
One-to-many relationships
Data with high update frequency
Data needs to be accessed independently
Normalization
Migration Example RDMBS to MongoDB
Primary-Secondary Replication MongoDB
Primary-Secondary Replication – Primary down
Primary-Secondary Replication – New Primary
Last changeda month ago