Skip to content

docs: add JEP-0014 Virtual Scalable Exporters proposal#744

Open
mangelajo wants to merge 4 commits into
mainfrom
jep-0014-virtual-scalable-exporters
Open

docs: add JEP-0014 Virtual Scalable Exporters proposal#744
mangelajo wants to merge 4 commits into
mainfrom
jep-0014-virtual-scalable-exporters

Conversation

@mangelajo

Copy link
Copy Markdown
Member

Summary

  • Adds JEP-0014: Virtual Scalable Exporters, proposing a controller-managed pool of virtual target instances with configurable autoscaling via per-provider CRDs (QEMUExporterPool, AndroidExporterPool, etc.)
  • Updates the JEP index in README.md to include JEP-0014

Test plan

  • Verify the document renders correctly in Sphinx docs
  • Review JEP content for completeness and accuracy
  • Confirm all template-required sections are present (Abstract, Motivation, Proposal, Design Decisions, Design Details, Test Plan, Acceptance Criteria, Backward Compatibility, Consequences, Rejected Alternatives)

Made with Cursor

@coderabbitai

coderabbitai Bot commented Jun 3, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: f0806445-50e1-4c09-843d-f956d62a9aa2

📥 Commits

Reviewing files that changed from the base of the PR and between b397149 and 0c8b743.

📒 Files selected for processing (1)
  • python/docs/source/contributing/jeps/index.md
🚧 Files skipped from review as they are similar to previous changes (1)
  • python/docs/source/contributing/jeps/index.md

📝 Walkthrough

Walkthrough

Adds JEP-0014 documentation proposing controller-managed, provider-specific autoscaling exporter pools with warm-pool leasing, Exporter.spec.enabled for graceful scale-down, reconciliation pseudocode, tests/acceptance criteria, phased implementation plan, and registers the JEP in the docs index.

Changes

JEP-0014 Virtual Scalable Exporters

Layer / File(s) Summary
Problem Statement and Core Proposal
python/docs/source/contributing/jeps/JEP-0014-virtual-scalable-exporters.md
JEP header, abstract, motivation, user stories; introduces managed *ExporterPool CRDs, warm-pool leasing semantics, and example provider manifests.
Architecture, Deployment, and API
python/docs/source/contributing/jeps/JEP-0014-virtual-scalable-exporters.md
Controller and pool-controller architecture, per-provider Deployment model (shared binary + --provider), scaling inputs (watching Leases + Exporters), instance lifecycle, hardware/compatibility notes, and Exporter.spec.enabled for coordinated scale-down.
Design Decisions and Reconciliation Logic
python/docs/source/contributing/jeps/JEP-0014-virtual-scalable-exporters.md
Design rationale (pool-based scaling, rejection of per-lease parameters), reconciliation pseudocode, invariants, instance state model, component interactions, and failure modes.
Testing, Acceptance, and Backward Compatibility
python/docs/source/contributing/jeps/JEP-0014-virtual-scalable-exporters.md
Test plan, acceptance and graduation criteria, and backward compatibility expectations (including Exporter.enabled defaulting).
Consequences, Risks, and Future Work
python/docs/source/contributing/jeps/JEP-0014-virtual-scalable-exporters.md
Consequences, identified risks, rejected alternatives, prior art, unresolved/resolved questions, and future provider extensions outside the JEP scope.
Implementation Plan and References
python/docs/source/contributing/jeps/JEP-0014-virtual-scalable-exporters.md
Phased implementation roadmap (Phase 1: Exporter.enabled; Phase 2: pool controller + QEMUExporterPool; Phase 3: more providers), implementation history, references, and license.
JEP Documentation Index Update
python/docs/source/contributing/jeps/index.md
Registers JEP-0014 in Standards Track table (Draft) and adds the JEP to the JEP toctree.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Suggested labels

documentation

Suggested reviewers

  • kirkbrauer
  • bennyz
  • bkhizgiy
  • maboras-rh
  • raballew

Poem

🐰 I hopped through a JEP at break of day,
Pools of exporters lined the way,
Controllers hum and leases sing,
Warm instances ready for testing,
Docs tucked neat — the rabbit hops away.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 66.67% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly summarizes the main change: adding JEP-0014 documentation for Virtual Scalable Exporters, which aligns with the +920 lines added to the JEP document and index updates.
Description check ✅ Passed The description is directly related to the changeset, detailing the addition of JEP-0014 Virtual Scalable Exporters and index updates with a clear test plan.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch jep-0014-virtual-scalable-exporters

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Propose a Virtual Scalable Exporter subsystem for Jumpstarter that
manages pools of virtual targets with configurable autoscaling via
per-provider CRDs (QEMUExporterPool, AndroidExporterPool, etc.).

Co-authored-by: Cursor <cursoragent@cursor.com>
@mangelajo mangelajo force-pushed the jep-0014-virtual-scalable-exporters branch from 8928e29 to 08d970b Compare June 3, 2026 17:03

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
python/packages/jumpstarter-driver-opendal/jumpstarter_driver_opendal/client.py (1)

82-84: 💤 Low value

Optional: consolidate the duplicated HTTP-URL detection. The same 3-line original_url block is now copy-pasted in write_from_path, _flash_single, and StorageMuxFlasherClient.flash. A tiny helper keeps the three sites from drifting.

♻️ Proposed helper
+def _http_original_url(path: PathBuf) -> str | None:
+    """Return the path as an HTTP(S) original_url, else None."""
+    if isinstance(path, str) and path.startswith(("http://", "https://")):
+        return path
+    return None

Then at each call site:

-        original_url = None
-        if isinstance(path, str) and path.startswith(("http://", "https://")):
-            original_url = path
+        original_url = _http_original_url(path)
         if operator is None:
             path, operator, _ = operator_for_path(path)

Also applies to: 636-638, 774-776

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In
`@python/packages/jumpstarter-driver-opendal/jumpstarter_driver_opendal/client.py`
around lines 82 - 84, The HTTP-URL detection logic (setting original_url when
path is a string starting with "http://" or "https://") is duplicated in
write_from_path, _flash_single, and StorageMuxFlasherClient.flash; extract this
into a small helper function (e.g., is_http_url or extract_original_url) and
replace the three copy-pasted blocks with calls to that helper. Ensure the
helper accepts the same path argument(s) and returns the original_url (or None)
so callers in write_from_path, _flash_single, and StorageMuxFlasherClient.flash
keep the existing behavior without duplicated code.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In
`@python/packages/jumpstarter-driver-opendal/jumpstarter_driver_opendal/client.py`:
- Around line 82-84: The HTTP-URL detection logic (setting original_url when
path is a string starting with "http://" or "https://") is duplicated in
write_from_path, _flash_single, and StorageMuxFlasherClient.flash; extract this
into a small helper function (e.g., is_http_url or extract_original_url) and
replace the three copy-pasted blocks with calls to that helper. Ensure the
helper accepts the same path argument(s) and returns the original_url (or None)
so callers in write_from_path, _flash_single, and StorageMuxFlasherClient.flash
keep the existing behavior without duplicated code.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 1246f919-4b05-47a5-bbbd-68370cf50adc

📥 Commits

Reviewing files that changed from the base of the PR and between e654084 and 8928e29.

📒 Files selected for processing (4)
  • python/docs/source/internal/jeps/JEP-0014-virtual-scalable-exporters.md
  • python/docs/source/internal/jeps/README.md
  • python/packages/jumpstarter-driver-opendal/jumpstarter_driver_opendal/client.py
  • python/packages/jumpstarter-driver-opendal/jumpstarter_driver_opendal/driver_test.py

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@python/docs/source/contributing/jeps/JEP-0014-virtual-scalable-exporters.md`:
- Around line 647-653: Replace the placeholder "<!-- TODO: Detail specific test
cases -->" in the "Test Plan" section with concrete, verifiable test cases:
enumerate unit tests (functionality scenarios, inputs, expected outputs) for
exporter creation and scaling code paths (e.g., single-exporter, multi-exporter,
failure/retry), integration tests covering end-to-end export flows and
compatibility boundaries (formats, destinations, auth), performance/load tests
with target metrics (throughput, latency, resource usage) and pass/fail
thresholds, and regression/upgrade tests to assert behavior across version
changes; reference the "Test Plan" header and ensure each case includes scope,
steps, expected outcome, and acceptance criteria so reviewers can reproduce and
validate.
- Line 1: This JEP file is not included in any Sphinx toctree; open
python/docs/source/contributing/jeps/index.md and add an entry for
JEP-0014-virtual-scalable-exporters.md (exact filename) to the toctree so the
page is discovered by Sphinx; ensure the relative path and filename match the
JEP file and rebuild docs to confirm warnings are cleared.
- Around line 243-279: The fenced code blocks containing ASCII diagrams and
sequences are untyped and trigger MD040; update each triple-backtick fence
around the diagrams/blocks (e.g., the large Kubernetes ASCII diagram block and
the smaller sequence blocks like "for each *ExporterPool CR:" and "Provisioning
→ Ready...") by adding an explicit language identifier such as text (```text) to
the opening fence so the markdown linter stops flagging them; ensure you replace
the untyped ``` with ```text for all occurrences noted in the review (including
the blocks around lines referenced in the comment).
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 35aa8d0c-1ccf-4eb6-9673-1944fc3dbb0f

📥 Commits

Reviewing files that changed from the base of the PR and between 8928e29 and 08d970b.

📒 Files selected for processing (2)
  • python/docs/source/contributing/jeps/JEP-0014-virtual-scalable-exporters.md
  • python/docs/source/contributing/jeps/index.md
✅ Files skipped from review due to trivial changes (1)
  • python/docs/source/contributing/jeps/index.md

Comment thread python/docs/source/contributing/jeps/JEP-0014-virtual-scalable-exporters.md Outdated
iterate quickly without waiting for scarce hardware.

- **As a** platform engineer, **I want to** declare a virtual target pool with
`minInstances: 2, maxInstances: 20`, **so that** there are always warm

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldnt it be minWarmInstances and maxTotalInstances then?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done — renamed throughout the document: minInstancesminWarmInstances and maxInstancesmaxTotalInstances. CRD spec comments updated to reflect the semantics.

This comment was generated from a Cursor session.

The guiding principle is: **"Get me a target that matches my requirements."** The
distinction between physical and virtual is an implementation detail, not a
primary concern for the user. Virtual exporters simply appear in the same pool
as physical ones, differentiated only by labels.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, thats a reasonable design decision - this would allow you to easily switch between virtual and physical by merely adding a label

Each pool controller watches two key resources to make scaling decisions:

1. **Leases** — The controller watches for pending Leases whose label selectors
match the pool's labels. Pending leases with no available exporter signal

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what about scheduled leases?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 we need to ignore them until it's time to make them effective.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed — moved "Scheduled leases" from Unresolved to Resolved Questions. The controller already supports Spec.BeginTime on Lease CRs; the pool controller simply ignores leases whose BeginTime is in the future when counting demand, and only pre-provisions instances as the scheduled time approaches.

This comment was generated from a Cursor session.

Comment on lines +296 to +303
**Per-Provider Deployments (single image by default):** All provider
controllers are compiled into a single binary. Each Deployment in the cluster
passes a `--provider=<type>` flag to activate the corresponding reconciler.
This gives each provider isolated logs and independent restarts while
maintaining a single image to build and release. The per-provider `image`
override in the operator CR allows administrators to substitute a custom image
for a specific provider (e.g., a third-party provider distributed as its own
image) without affecting other providers.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

for scalable testing. However:

- Virtual targets must faithfully emulate the interfaces exposed by physical
hardware (serial, network, storage, power) through the existing driver model.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+inf

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hehe the LLM was a little bit creative in this, I think it needs some work

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, so I am proposing "0" for no limits... (at your own risk) :D

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done — maxTotalInstances is now optional; omitting it or setting it to 0 means no upper limit.

This comment was generated from a Cursor session.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure what you are referring to here. the comment seems to be not related to my support for proper interfaces

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@raballew I think if we can use virtio here, we might be able to get very high fidelity for guests that run natively on virtio targets and we can avoid privileged Pods in this case :)

**Decision:** Pool-based with configurable min/max.

**Rationale:** Purely on-demand provisioning introduces unacceptable latency for
CI pipelines (VM boot + exporter registration can take 30-120s). A warm pool

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why does it take 2min?

@mangelajo mangelajo Jun 3, 2026

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. create exporter
  2. create the pod, node is assigned...
  3. the image is downloaded
  4. the exporter boots
  5. connects back and becomes ready.

May be it's more around 10-15 seconds.

But could be driver dependent, i.e. renode takes some time to initialize.

But it really doesn't make much difference, with the current design you can chose to set minInstances to 0 and ... get 0 warmth instances, or more if you want any :) it becomes an admin decission and it doesn't really change a lot the underlying code design.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated DD-1 — the cold-start estimate is now 10–60 seconds (not 2 minutes). The breakdown: image pull + VM boot + exporter registration. The previous number was too generous.

This comment was generated from a Cursor session.


```
Provisioning → Ready (warm pool) → Leased → Ready
└→ Terminating → (deleted if available instances>min)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does that me we are reusing virtual instances (even just the exporter?)

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed — clarified instance reuse in the Component Interaction section. Added a recycleStrategy field to the common pool spec with two options:

  • ExitAndReplace (default): the exporter process exits after lease release; the Deployment/ReplicaSet respawns a fresh Pod.
  • InPlaceReuse: the exporter performs internal cleanup and re-registers as ready for the next lease without restarting.

This comment was generated from a Cursor session.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

virtual instances should be throw away imo

mangelajo and others added 2 commits June 4, 2026 10:37
Add DD-4 explaining why per-lease parameters are not included in this
JEP. The same use case is served by creating separate pools with
different resource profiles, avoiding complexity across the Lease CRD,
controller, pool controllers, and driver templates.

Co-authored-by: Cursor <cursoragent@cursor.com>
- Clarify warm pool rationale and cold-start latency range (10-60s)
- Rename minInstances/maxInstances to minWarmInstances/maxTotalInstances
- Make maxTotalInstances optional (0 or omitted means unlimited)
- Add Crossplane to Prior Art with rationale
- Resolve scheduled leases question via existing BeginTime mechanism
- Add DD-5: built-in scaling vs HPA/KEDA
- Add DD-4: per-lease parameters rejected in favor of pool flavors
- Add composite exporters and Corellium to Future Possibilities
- Clarify instance reuse with recycleStrategy field (ExitAndReplace default)
- Add language identifiers to untyped fenced code blocks
- Add Apache 2.0 license footer

Co-authored-by: Cursor <cursoragent@cursor.com>
CI pipelines (Pod scheduling + image pull + VM boot + exporter registration
typically takes 10-15s, and up to 60s with cold image pulls or heavy
providers). A warm pool
provides instant lease fulfillment for the common case. Setting `minWarmInstances: 0`

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are you referring to the exporter image or the "disk image" that the exporter will run?

@mangelajo mangelajo Jun 4, 2026

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull of the container image that runs the exporter in a pod. I guess that in most cases will already be part of the node.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

then this is a no-op wiht time only spend on scheduling, boot and registration

@kirkbrauer

Copy link
Copy Markdown
Member

@mangelajo I would also propose a sidecar pattern for the exporters by default. This would prevent losing the main pod from bringing down the exporter too and add more flexibility for multi-device virtual benches in the future.


- [ ] `AndroidExporterPool` CRD and reconciler
- [ ] Provider authoring guide documenting how to add a new `*ExporterPool`

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Future ideas:

  • Priority selectors : I want "this", if not available "this other thing" ... otherwise ....

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be admin configured or user configured?

This is probably related to Device Classes

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mangelajo Humm, yeah we'll need to think about this if this is related to a DeviceClass or ExporterClass since the Exporter is the primary scheduling unit in this proposal.

New JEP files not listed in any toctree cause Sphinx build warnings,
which fail the check-warnings CI job.

Co-authored-by: Cursor <cursoragent@cursor.com>

@kirkbrauer kirkbrauer left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for putting this together, @mangelajo — the warm-pool + autoscaling direction is great and the design decisions are well argued. I've left a set of inline suggestions exploring whether we can lean harder on native Kubernetes primitives (so the proposal reads like standard k8s to cluster admins) and make the provider/device model more extensible.

The throughline of the inline comments:

  • Orchestration → ExporterSet (a ReplicaSet+HPA analog): selector + an embedded template, HPA/PDB scaling vocab (minReplicas/maxReplicas/minAvailableReplicas), the scale subresource, and Deployment-style status. Reuses the existing lease-selector→Exporter-label matching unchanged.
  • Separate the device from the exporter: keep Exporter as the minimum leased unit (Pod analog); move the provider-typed device into a first-class VirtualTarget.
  • CSI-style class + claim: a VirtualTargetClass (StorageClass analog) with a pluggable provisioner (k8s container / EC2 / Corellium / a vendor cloud-device API), an inline credentialsSecretRef, bindingMode (warm vs provision-on-lease), reclaimPolicy, and node scheduling; the typed *VirtualTarget is the claim.
  • Endorse per-provisioner, backend-aware autoscaling — the only ask is a consistent scaling API across provisioners.
  • Packaging: exporter as a native sidecar + an independent runtime container + the OS image as an OCI artifact (image volume); drivers attach over standard interfaces (serial/SPI/CAN/GPIO via virtio), reused across physical/virtual.
  • Node scheduling on the class (arch/KVM/GPU via tolerations + device resources).
  • A fidelity/cost ladder of multiple classes (software-emulated → cloud virtual device → real hardware), all selectable through one jmp lease.
  • A few Future Possibilities to keep the model open (cross-node accelerators, a *ProviderConfig for multi-account creds, a realized-instance CRD, an ExporterDeployment rollout tier, multi-target-per-exporter, a universal Target).

To make this concrete rather than abstract, I pushed a worked rewrite of the JEP on a separate branch so we can diff and discuss specifics:

These are suggestions for discussion, not blockers — happy to scope any of them down to Future Possibilities.


🤖 This review summary and the inline suggestions were drafted with AI assistance (Claude) and reviewed by me (the PR reviewer) before posting.


## Abstract

This JEP proposes a Virtual Scalable Exporter subsystem for Jumpstarter that

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We might want to specify that the thing that is really scaling here is not the exporter per say, but actually the targets themselves. For example, one exporter may have multiple targets, but I do understand this from the perspective of the "exporter" being the basic unit of scheduling such as a Pod in k8s.

target definition declares scaling parameters that let administrators tune the
trade-off between instant availability and resource consumption.

### Core Concept: Managed Pools with Scaling

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion: consider re-modeling the pool on native Kubernetes workload primitives so it reads like standard k8s to cluster admins. Replace the provider-typed *ExporterPool with a generic ExporterSet (a ReplicaSet+HPA analog): spec.selector + an inline spec.template, HPA/PDB scaling vocab (minReplicas/maxReplicas/minAvailableReplicas), Deployment-style status (replicas/readyReplicas/availableReplicas/leasedReplicas), and the scale subresource. This reuses the existing lease-selector→Exporter-label matching unchanged and lets kubectl scale/HPA/KEDA interoperate.

kind: ExporterSet                  # ≈ ReplicaSet + HPA
spec:
  minReplicas: 0
  maxReplicas: 20
  minAvailableReplicas: 2          # PDB-style warm buffer (ready & unleased)
  selector:
    matchLabels:
      board: rpi4
  template:                        # embedded template (Deployment idiom)
    metadata:
      labels:
        board: rpi4
        virtual: "true"
    spec:
      drivers:
        - type: jumpstarter_driver_power.driver.QemuPower
        # ...
status:                            # Deployment-style counters
  replicas: 5
  readyReplicas: 3
  availableReplicas: 3             # warm
  leasedReplicas: 2
# scale subresource: specReplicasPath=.spec.maxReplicas

🤖 Drafted with AI assistance (Claude) and reviewed by the PR reviewer before posting.

storage: 16Gi

# Exporter template (drivers exposed by each instance)
exporterTemplate:

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion: rather than the pool being the provider-typed device, separate the device into a first-class VirtualTarget, keeping Exporter as the minimum leased unit (Pod-equivalent). The drivers/exporterTemplate here would move under a typed *VirtualTarget (e.g. QEMUVirtualTarget) the Exporter owns. Keeps the lease flow unchanged + unified for physical/virtual, localizes provider typing, and leaves room for one Exporter to expose multiple VirtualTargets later (multi-device benches).

ExporterSet (generic)            ~ ReplicaSet + HPA
  └ Exporter (leasable; a Pod)   ~ Pod   ← minimum leased unit
      └ QEMUVirtualTarget        ~ the device (provider-typed)

🤖 Drafted with AI assistance (Claude) and reviewed by the PR reviewer before posting.


# Corellium-specific configuration
apiHost: app.corellium.com
apiCredentialsSecret: corellium-api-credentials # Secret with keys: token

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion: instead of inlining credentials/provisioning per pool, adopt the CSI StorageClass/PVC pattern. A cluster-scoped VirtualTargetClass holds the provisioner, an inline credentialsSecretRef, opaque parameters (apiHost/projectId/region), bindingMode (Immediate=warm vs WaitForFirstConsumer=provision-on-lease), and reclaimPolicy; the typed *VirtualTarget is the claim naming the class. Admins own classes + secrets; claim authors never touch credentials — like a PVC naming a StorageClass.

# cluster-scoped (StorageClass analog) — admins own it
kind: VirtualTargetClass
metadata:
  name: corellium-kronos
spec:
  provisioner: corellium.jumpstarter.dev
  credentialsSecretRef:
    name: corellium-creds
    namespace: jumpstarter
  parameters:
    apiHost: app.corellium.com
    projectId: "778f..."
  bindingMode: WaitForFirstConsumer    # Immediate = pre-warmed pool
  reclaimPolicy: Delete
---
# the typed claim just names the class (PVC analog) — no creds here
kind: CorelliumVirtualTarget
spec:
  virtualTargetClassName: corellium-kronos
  deviceFlavor: kronos

🤖 Drafted with AI assistance (Claude) and reviewed by the PR reviewer before posting.

Deployment manifest pointing to the same image with a different flag — no new
image build required.

### DD-3: CRD per provider vs. generic CRD

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion: a cleaner framing than "CRD per provider vs generic" (borrowed from CSI/CRI/Cluster-API): keep orchestration generic and make the device backend pluggable via a provisioner named on the class. The typed *VirtualTarget stays strongly-typed (your DD-3 win), while one provisioner string selects the backend — k8s container, EC2 instance, Corellium/REST, a vendor cloud-device API — all behind one interface, so the Exporter/lease experience is identical regardless of where the device runs. New backends add a claim kind + a provisioner, no pool-tier changes.

VirtualTargetClass.provisioner →
  qemu.jumpstarter.dev        →  k8s container (+ OS OCI image volume)
  ec2.jumpstarter.dev         →  AWS API
  corellium.jumpstarter.dev   →  Corellium REST API
# one typed *VirtualTarget claim interface; backend is pluggable

🤖 Drafted with AI assistance (Claude) and reviewed by the PR reviewer before posting.

maxTotalInstances: 20 # Scale up to 20 under load

# Node scheduling (shared across all pool CRDs, optional)
nodeSelector:

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion: node scheduling (this nodeSelector) is really a property of the backend, so consider folding it into the VirtualTargetClass as a scheduling block — nodeSelector/nodeAffinity plus tolerations (tainted KVM/GPU/baremetal/ARM nodes) and device resource requests (kubernetes.io/arch, devices.kubevirt.io/kvm, nvidia.com/gpu). CSI precedent: StorageClass.allowedTopologies. The rendered Pod inherits it, with optional per-ExporterSet override.

kind: VirtualTargetClass
spec:
  scheduling:                    # inherited by the rendered exporter Pod
    nodeSelector:
      kubernetes.io/arch: arm64
    tolerations:
      - key: jumpstarter.dev/kvm
        operator: Exists
        effect: NoSchedule
    resources:
      limits:
        devices.kubevirt.io/kvm: "1"   # or nvidia.com/gpu

🤖 Drafted with AI assistance (Claude) and reviewed by the PR reviewer before posting.

`status.leaseRef` to remain empty).
3. Pool controller deletes the Pod and Exporter CR.

### Hardware Considerations

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion: add a concrete fidelity/cost ladder showing one logical target served by multiple classes — a container-backed software emulator/simulator (cheap, CI-scale), an API/cloud-backed virtual device (higher fidelity, metered), and real hardware (full fidelity, scarce) — all selected via labels through one jmp lease. For example, a target that needs a GPU or specialized I/O device: a software-emulated class runs functional checks cheaply in CI, a cloud-backed virtual device adds higher fidelity, and real hardware is the ground truth. This illustrates why keeping the VirtualTarget/class abstraction generic pays off across fidelity tiers.

One logical target, selected via labels through jmp lease:

class (provisioner) fidelity scale/cost role
container sim (qemu) low cheap / CI functional checks
cloud virtual device (api) high metered higher-fidelity behavior
real hardware (exporter) full scarce ground truth

🤖 Drafted with AI assistance (Claude) and reviewed by the PR reviewer before posting.

input, so they naturally do not scale up for future-dated leases until the
controller makes them effective.

## Future Possibilities

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion: list a few forward-compat items so the model stays open to them: disaggregated/cross-node accelerators (ARM64 runtime bridged to a remote GPU via virtio-gpu/RDMA), a separate reusable *ProviderConfig CRD (multi-account credential reuse/rotation), a realized-instance CRD (PV analog) for static/pre-provisioned devices, an ExporterDeployment rollout tier (Deployment analog), multiple/spawned-on-lease VirtualTargets per Exporter, and a universal physical+virtual Target abstraction.


🤖 Drafted with AI assistance (Claude) and reviewed by the PR reviewer before posting.

for scalable testing. However:

- Virtual targets must faithfully emulate the interfaces exposed by physical
hardware (serial, network, storage, power) through the existing driver model.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@raballew I think if we can use virtio here, we might be able to get very high fidelity for guests that run natively on virtio targets and we can avoid privileged Pods in this case :)


- [ ] `AndroidExporterPool` CRD and reconciler
- [ ] Provider authoring guide documenting how to add a new `*ExporterPool`

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mangelajo Humm, yeah we'll need to think about this if this is related to a DeviceClass or ExporterClass since the Exporter is the primary scheduling unit in this proposal.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants