generated from cobaltcore-dev/repository-template
-
Notifications
You must be signed in to change notification settings - Fork 1
doc: add architecture information to ceph.md #12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
senolcolak
merged 1 commit into
cobaltcore-dev:main
from
zdover23:docs-2025-03-23-ceph-architecture
Mar 26, 2026
Merged
Changes from all commits
Commits
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -13,6 +13,137 @@ Storage Cluster accommodates large numbers of nodes, which communicate with | |
| each other to replicate and redistribute data dynamically. | ||
|
|
||
| ## Architecture | ||
|
|
||
| ### The Ceph Storage Cluster | ||
|
|
||
| At its core, Ceph provides an infinitely scalable storage cluster based on | ||
| RADOS (Reliable Autonomic Distributed Object Store), a distributed storage | ||
| service that uses the intelligence in each node to secure data and provide it | ||
| to clients. A Ceph Storage Cluster consists of four daemon types: Ceph | ||
| Monitors, which maintain the master copy of the cluster map; Ceph OSD Daemons, | ||
| which check their own state and that of other OSDs; Ceph Managers, serving as | ||
| endpoints for monitoring and orchestration; and Ceph Metadata Servers (MDS), | ||
| which manage file metadata when CephFS provides file services. | ||
|
|
||
| Storage cluster clients and Ceph OSD Daemons use the CRUSH (Controlled | ||
| Scalable Decentralized Placement of Replicated Data) algorithm to compute data | ||
| location information, avoiding bottlenecks from central lookup tables. This | ||
| algorithmic approach enables Ceph's high-level features, including a native | ||
| interface to the storage cluster via librados and numerous service interfaces | ||
| built atop it. | ||
|
|
||
| ### Data Storage and Organization | ||
|
|
||
| The Ceph Storage Cluster receives data from clients through various | ||
| interfaces—Ceph Block Device, Ceph Object Storage, CephFS, or custom | ||
| implementations using librados—and stores it as RADOS objects. Each object | ||
| resides on an Object Storage Device (OSD), with Ceph OSD Daemons controlling | ||
| read, write, and replication operations. The default BlueStore backend stores | ||
| objects in a monolithic, database-like fashion within a flat namespace, meaning | ||
| objects lack hierarchical directory structures. Each object has an identifier, | ||
| binary data, and name/value pair metadata, with clients determining object data | ||
| semantics. | ||
|
|
||
| ### Eliminating Centralization | ||
|
|
||
| Traditional architectures rely on centralized components—gateways, brokers, or | ||
| APIs—that act as single points of entry, creating failure points and | ||
| performance limits. Ceph eliminates these centralized components, enabling | ||
| clients to interact directly with Ceph OSDs. OSDs create object replicas on | ||
| other nodes to ensure data safety and high availability, while monitor clusters | ||
| ensure high availability. The CRUSH algorithm replaces centralized lookup | ||
| tables, providing better data management by distributing work across all OSD | ||
| daemons and communicating clients, using intelligent data replication to ensure | ||
| resiliency suitable for hyper-scale storage. | ||
|
|
||
| ### Cluster Map and High Availability | ||
|
|
||
| For proper functioning, Ceph clients and OSDs require current cluster topology | ||
| information stored in the Cluster Map, actually a collection of five maps: the | ||
| Monitor Map (containing cluster fsid, monitor positions, names, addresses, and | ||
| ports), the OSD Map (containing cluster fsid, pool lists, replica sizes, PG | ||
| numbers, and OSD statuses), the PG Map (containing PG versions, timestamps, and | ||
| placement group details), the CRUSH Map (containing storage devices, failure | ||
| domain hierarchy, and traversal rules), and the MDS Map (containing MDS map | ||
| epoch, metadata storage pool, and metadata server information). Each map | ||
| maintains operational state change history, with Ceph Monitors maintaining | ||
| master copies including cluster members, states, changes, and overall health. | ||
|
|
||
| Ceph uses monitor clusters for reliability and fault tolerance. To establish | ||
| consensus about cluster state, Ceph employs the Paxos algorithm, requiring a | ||
| majority of monitors to agree (one in single-monitor clusters, two in | ||
| three-monitor clusters, three in five-monitor clusters, and so forth). This | ||
| prevents issues when monitors fall behind due to latency or faults. | ||
|
|
||
| ### Authentication and Security | ||
|
|
||
| The cephx authentication system authenticates users and daemons while | ||
| protecting against man-in-the-middle attacks, though it doesn't address | ||
| transport encryption or encryption at rest. Using shared secret keys, cephx | ||
| enables mutual authentication without revealing keys. Like Kerberos, each | ||
| monitor can authenticate users and distribute keys, eliminating single points | ||
| of failure. The system issues session keys encrypted with users' permanent | ||
| secret keys, which clients use to request services. Monitors provide tickets | ||
| authenticating clients against OSDs handling data, with monitors and OSDs | ||
| sharing secrets enabling ticket use across any cluster OSD or metadata server. | ||
| Tickets expire to prevent attackers from using obtained credentials, protecting | ||
| against message forgery and alteration as long as secret keys remain secure | ||
| before expiration. | ||
|
|
||
| ### Smart Daemons and Hyperscale | ||
|
|
||
| Ceph's architecture makes OSD Daemons and clients cluster-aware, unlike | ||
| centralized storage clusters requiring double dispatches that bottleneck at | ||
| petabyte-to-exabyte scale. Each Ceph OSD Daemon knows other OSDs in the | ||
| cluster, enabling direct interaction with other OSDs and monitors. This | ||
| awareness allows clients to interact directly with OSDs, and because monitors | ||
| and OSD daemons interact directly, OSDs leverage aggregate cluster CPU and RAM | ||
| resources. | ||
|
|
||
| This distributed intelligence provides several benefits: OSDs service clients | ||
| directly, improving performance by avoiding centralized interface connection | ||
| limits; OSDs report membership and status (up or down), with neighboring OSDs | ||
| detecting and reporting failures; data scrubbing maintains consistency by | ||
| comparing object metadata across replicas, with deeper scrubbing comparing data | ||
| bit-for-bit against checksums to find bad drive sectors; and replication | ||
| involves client-OSD collaboration, with clients using CRUSH to determine object | ||
| locations, mapping objects to pools and placement groups, then writing to | ||
| primary OSDs that replicate to secondary OSDs. | ||
|
|
||
| ### Dynamic Cluster Management | ||
|
|
||
| Pools are logical partitions for storing objects, with clients retrieving | ||
| cluster maps from monitors and writing RADOS objects to pools. CRUSH | ||
| dynamically maps placement groups (PGs) to OSDs, with clients storing objects | ||
| by having CRUSH map each RADOS object to a PG. This abstraction layer between | ||
| OSDs and clients enables adaptive cluster growth, shrinkage, and data | ||
| redistribution when topology changes. The indirection allows dynamic | ||
| rebalancing when new OSDs come online. | ||
|
|
||
| Clients compute object locations rather than querying, requiring only object ID | ||
| and pool name. Ceph hashes object IDs, calculates hash modulo PG numbers, | ||
| retrieves pool IDs from pool names, and prepends pool IDs to PG IDs. This | ||
| computation proves faster than query sessions, with CRUSH enabling clients to | ||
| compute expected object locations and contact primary OSDs for storage or | ||
| retrieval. | ||
|
|
||
| ### Client Interfaces | ||
|
|
||
| Ceph provides three client types: Ceph Block Device (RBD) offers resizable, | ||
| thin-provisioned, snapshottable block devices striped across clusters for high | ||
| performance; Ceph Object Storage (RGW) provides RESTful APIs compatible with | ||
| Amazon S3 and OpenStack Swift; and CephFS provides POSIX-compliant filesystems | ||
| mountable as kernel objects or FUSE. Modern applications access storage through | ||
| librados, which provides direct parallel cluster access supporting pool | ||
|
Comment on lines
+136
to
+137
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Fix CephFS mount terminology. “mountable as kernel objects” on Line 136 is inaccurate phrasing. CephFS is typically mounted via the kernel client or via FUSE. Suggested wording update-Amazon S3 and OpenStack Swift; and CephFS provides POSIX-compliant filesystems
-mountable as kernel objects or FUSE.
+Amazon S3 and OpenStack Swift; and CephFS provides POSIX-compliant filesystems
+that can be mounted via the kernel client or via FUSE.🤖 Prompt for AI Agents |
||
| operations, snapshots, copy-on-write cloning, object read/write operations, | ||
| extended attributes, key/value pairs, and object classes. | ||
|
|
||
| The architecture demonstrates how Ceph's distributed, intelligent design | ||
| eliminates traditional storage limitations, enabling massive scalability while | ||
| maintaining reliability and performance through algorithmic data placement, | ||
| autonomous daemon operations, and direct client-storage interactions. | ||
|
|
||
| ## See Also | ||
| The architecture of the Ceph cluster is explained in [the Architecture | ||
| chapter of the upstream Ceph | ||
| documentation](https://docs.ceph.com/en/latest/architecture/) | ||
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Clarify direct client↔OSD behavior scope.
The statement on Line 51-Line 53 is too broad. Direct client-to-OSD applies to RADOS/librados paths, but CephFS and RGW involve MDS/RGW components. Please narrow the wording to avoid architectural confusion.
Suggested wording update
📝 Committable suggestion
🤖 Prompt for AI Agents