Skip to content
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
131 changes: 131 additions & 0 deletions docs/architecture/ceph.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,137 @@ Storage Cluster accommodates large numbers of nodes, which communicate with
each other to replicate and redistribute data dynamically.

## Architecture

### The Ceph Storage Cluster

At its core, Ceph provides an infinitely scalable storage cluster based on
RADOS (Reliable Autonomic Distributed Object Store), a distributed storage
service that uses the intelligence in each node to secure data and provide it
to clients. A Ceph Storage Cluster consists of four daemon types: Ceph
Monitors, which maintain the master copy of the cluster map; Ceph OSD Daemons,
which check their own state and that of other OSDs; Ceph Managers, serving as
endpoints for monitoring and orchestration; and Ceph Metadata Servers (MDS),
which manage file metadata when CephFS provides file services.

Storage cluster clients and Ceph OSD Daemons use the CRUSH (Controlled
Scalable Decentralized Placement of Replicated Data) algorithm to compute data
location information, avoiding bottlenecks from central lookup tables. This
algorithmic approach enables Ceph's high-level features, including a native
interface to the storage cluster via librados and numerous service interfaces
built atop it.

### Data Storage and Organization

The Ceph Storage Cluster receives data from clients through various
interfaces—Ceph Block Device, Ceph Object Storage, CephFS, or custom
implementations using librados—and stores it as RADOS objects. Each object
resides on an Object Storage Device (OSD), with Ceph OSD Daemons controlling
read, write, and replication operations. The default BlueStore backend stores
objects in a monolithic, database-like fashion within a flat namespace, meaning
objects lack hierarchical directory structures. Each object has an identifier,
binary data, and name/value pair metadata, with clients determining object data
semantics.

### Eliminating Centralization

Traditional architectures rely on centralized components—gateways, brokers, or
APIs—that act as single points of entry, creating failure points and
performance limits. Ceph eliminates these centralized components, enabling
clients to interact directly with Ceph OSDs. OSDs create object replicas on
other nodes to ensure data safety and high availability, while monitor clusters
Comment on lines +49 to +53
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Clarify direct client↔OSD behavior scope.

The statement on Line 51-Line 53 is too broad. Direct client-to-OSD applies to RADOS/librados paths, but CephFS and RGW involve MDS/RGW components. Please narrow the wording to avoid architectural confusion.

Suggested wording update
-Traditional architectures rely on centralized components—gateways, brokers, or
-APIs—that act as single points of entry, creating failure points and
-performance limits. Ceph eliminates these centralized components, enabling
-clients to interact directly with Ceph OSDs.
+Traditional architectures often rely on centralized components—gateways,
+brokers, or APIs—that can become failure points and performance limits.
+In Ceph’s RADOS data path, clients can interact directly with OSDs based on
+CRUSH-derived placement, avoiding a centralized lookup bottleneck.
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
Traditional architectures rely on centralized components—gateways, brokers, or
APIs—that act as single points of entry, creating failure points and
performance limits. Ceph eliminates these centralized components, enabling
clients to interact directly with Ceph OSDs. OSDs create object replicas on
other nodes to ensure data safety and high availability, while monitor clusters
Traditional architectures often rely on centralized components—gateways,
brokers, or APIs—that can become failure points and performance limits.
In Ceph's RADOS data path, clients can interact directly with OSDs based on
CRUSH-derived placement, avoiding a centralized lookup bottleneck.
OSDs create object replicas on
other nodes to ensure data safety and high availability, while monitor clusters
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/architecture/ceph.md` around lines 49 - 53, The wording claiming
"clients interact directly with Ceph OSDs" is too broad; update the sentence
that describes direct client↔OSD interaction to specify it applies to
RADOS/librados clients (e.g., "RADOS/librados paths allow direct client-to-OSD
interactions"), and add a clarifying clause that higher-level services like
CephFS and RGW instead involve MDS and RGW components respectively, so they do
not follow the direct OSD path.

ensure high availability. The CRUSH algorithm replaces centralized lookup
tables, providing better data management by distributing work across all OSD
daemons and communicating clients, using intelligent data replication to ensure
resiliency suitable for hyper-scale storage.

### Cluster Map and High Availability

For proper functioning, Ceph clients and OSDs require current cluster topology
information stored in the Cluster Map, actually a collection of five maps: the
Monitor Map (containing cluster fsid, monitor positions, names, addresses, and
ports), the OSD Map (containing cluster fsid, pool lists, replica sizes, PG
numbers, and OSD statuses), the PG Map (containing PG versions, timestamps, and
placement group details), the CRUSH Map (containing storage devices, failure
domain hierarchy, and traversal rules), and the MDS Map (containing MDS map
epoch, metadata storage pool, and metadata server information). Each map
maintains operational state change history, with Ceph Monitors maintaining
master copies including cluster members, states, changes, and overall health.

Ceph uses monitor clusters for reliability and fault tolerance. To establish
consensus about cluster state, Ceph employs the Paxos algorithm, requiring a
majority of monitors to agree (one in single-monitor clusters, two in
three-monitor clusters, three in five-monitor clusters, and so forth). This
prevents issues when monitors fall behind due to latency or faults.

### Authentication and Security

The cephx authentication system authenticates users and daemons while
protecting against man-in-the-middle attacks, though it doesn't address
transport encryption or encryption at rest. Using shared secret keys, cephx
enables mutual authentication without revealing keys. Like Kerberos, each
monitor can authenticate users and distribute keys, eliminating single points
of failure. The system issues session keys encrypted with users' permanent
secret keys, which clients use to request services. Monitors provide tickets
authenticating clients against OSDs handling data, with monitors and OSDs
sharing secrets enabling ticket use across any cluster OSD or metadata server.
Tickets expire to prevent attackers from using obtained credentials, protecting
against message forgery and alteration as long as secret keys remain secure
before expiration.

### Smart Daemons and Hyperscale

Ceph's architecture makes OSD Daemons and clients cluster-aware, unlike
centralized storage clusters requiring double dispatches that bottleneck at
petabyte-to-exabyte scale. Each Ceph OSD Daemon knows other OSDs in the
cluster, enabling direct interaction with other OSDs and monitors. This
awareness allows clients to interact directly with OSDs, and because monitors
and OSD daemons interact directly, OSDs leverage aggregate cluster CPU and RAM
resources.

This distributed intelligence provides several benefits: OSDs service clients
directly, improving performance by avoiding centralized interface connection
limits; OSDs report membership and status (up or down), with neighboring OSDs
detecting and reporting failures; data scrubbing maintains consistency by
comparing object metadata across replicas, with deeper scrubbing comparing data
bit-for-bit against checksums to find bad drive sectors; and replication
involves client-OSD collaboration, with clients using CRUSH to determine object
locations, mapping objects to pools and placement groups, then writing to
primary OSDs that replicate to secondary OSDs.

### Dynamic Cluster Management

Pools are logical partitions for storing objects, with clients retrieving
cluster maps from monitors and writing RADOS objects to pools. CRUSH
dynamically maps placement groups (PGs) to OSDs, with clients storing objects
by having CRUSH map each RADOS object to a PG. This abstraction layer between
OSDs and clients enables adaptive cluster growth, shrinkage, and data
redistribution when topology changes. The indirection allows dynamic
rebalancing when new OSDs come online.

Clients compute object locations rather than querying, requiring only object ID
and pool name. Ceph hashes object IDs, calculates hash modulo PG numbers,
retrieves pool IDs from pool names, and prepends pool IDs to PG IDs. This
computation proves faster than query sessions, with CRUSH enabling clients to
compute expected object locations and contact primary OSDs for storage or
retrieval.

### Client Interfaces

Ceph provides three client types: Ceph Block Device (RBD) offers resizable,
thin-provisioned, snapshottable block devices striped across clusters for high
performance; Ceph Object Storage (RGW) provides RESTful APIs compatible with
Amazon S3 and OpenStack Swift; and CephFS provides POSIX-compliant filesystems
mountable as kernel objects or FUSE. Modern applications access storage through
librados, which provides direct parallel cluster access supporting pool
Comment on lines +136 to +137
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Fix CephFS mount terminology.

“mountable as kernel objects” on Line 136 is inaccurate phrasing. CephFS is typically mounted via the kernel client or via FUSE.

Suggested wording update
-Amazon S3 and OpenStack Swift; and CephFS provides POSIX-compliant filesystems
-mountable as kernel objects or FUSE.
+Amazon S3 and OpenStack Swift; and CephFS provides POSIX-compliant filesystems
+that can be mounted via the kernel client or via FUSE.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/architecture/ceph.md` around lines 136 - 137, Update the inaccurate
phrase "mountable as kernel objects" in the CephFS description: replace that
wording so it states CephFS is typically mounted via the kernel client or via
FUSE (e.g., change the sentence containing "mountable as kernel objects or FUSE"
to use "mounted via the kernel client or via FUSE" and keep the surrounding
context about librados providing direct parallel cluster access).

operations, snapshots, copy-on-write cloning, object read/write operations,
extended attributes, key/value pairs, and object classes.

The architecture demonstrates how Ceph's distributed, intelligent design
eliminates traditional storage limitations, enabling massive scalability while
maintaining reliability and performance through algorithmic data placement,
autonomous daemon operations, and direct client-storage interactions.

## See Also
The architecture of the Ceph cluster is explained in [the Architecture
chapter of the upstream Ceph
documentation](https://docs.ceph.com/en/latest/architecture/)
Loading