Scale Design
5.1 Spec Management at Scale
Section titled “5.1 Spec Management at Scale”Problem: A Fortune 50 with 2,000 sources, 500 actors, and 10,000 routes can’t manage a single directory of YAML files.
Solution: Learn from Terraform. Introduce modules and workspaces.
Modules
Section titled “Modules”For the full module design, see Section 12: Modules.
A module is a reusable, parameterized bundle: connectors + routes + transforms + prompt files — a complete workflow installable as a single package. Modules compose by instantiation, not merging — they declare connector dependencies and the user wires them up.
# Usage: two modules sharing one GitHub sourcemodules: - package: "@orgloop/module-code-review" params: github_source: github agent_actor: platform-agent
- package: "@orgloop/module-ci-monitor" params: github_source: github agent_actor: platform-agentWorkspaces
Section titled “Workspaces”Workspaces provide isolated state and configuration for different environments or organizational units.
$ orgloop workspace list default* staging production
$ orgloop workspace select productionSwitched to workspace "production"Each workspace has its own:
- Event store / checkpoint state
- Runtime configuration (poll intervals, endpoints)
- Variable overrides
orgloop-enterprise/├── orgloop.yaml # Base config├── workspaces/│ ├── staging.yaml # Override: staging endpoints, faster polling│ └── production.yaml # Override: production endpoints, slower polling├── modules/│ ├── team-standard/│ └── compliance-audit/└── teams/ ├── platform/ ├── frontend/ ├── ml-infra/ └── security/Plan/Apply Model
Section titled “Plan/Apply Model”Borrowed directly from Terraform:
┌──────────┐ YAML files ──────► validate ├──── syntax + schema errors └────┬─────┘ │ ┌────▼─────┐ │ plan ├──── "3 sources added, 1 route changed, └────┬─────┘ 2 transforms removed" │ ┌────▼─────┐ │ start ├──── start/update runtime └──────────┘orgloop plan computes a diff between the current running state and the desired state from YAML files. orgloop start reconciles. This gives operators visibility and control over changes.
5.2 Runtime Scale
Section titled “5.2 Runtime Scale”Architecture Tiers
Section titled “Architecture Tiers”┌─────────────────────────────────────────────────────────────┐│ TIER 1: Single Process ││ (MVP, small teams, dev/test) ││ ││ ┌─────────┐ ┌──────────┐ ┌────────┐ ┌───────────────┐ ││ │ Pollers │──► Event Bus │──► Router │──► Actor Delivery │ ││ └─────────┘ │ (in-mem) │ └────────┘ └───────────────┘ ││ └──────────┘ ││ │ ││ ┌────▼─────┐ ││ │ File WAL │ ││ └──────────┘ │└─────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────┐│ TIER 2: Single Process + Queue ││ (Medium orgs, hundreds of sources) ││ ││ ┌─────────┐ ┌──────────────────┐ ┌────────┐ ┌────────┐ ││ │ Pollers │──► NATS / Redis │──► Router │──► Deliver │ ││ └─────────┘ │ Streams │ └────────┘ └────────┘ ││ │ (persistent) │ ││ └──────────────────┘ │└─────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────┐│ TIER 3: Distributed ││ (Fortune 50, thousands of sources) ││ ││ ┌──────────────┐ ┌───────────────────┐ ││ │ Poller Fleet │───► Kafka / NATS │ ││ │ (N instances) │ │ (partitioned by │ ││ └──────────────┘ │ source) │ ││ └────────┬──────────┘ ││ ┌──────────▼──────────┐ ││ │ Router Fleet │ ││ │ (N instances, │ ││ │ partition-aware) │ ││ └──────────┬──────────┘ ││ ┌──────────▼──────────┐ ││ │ Delivery Fleet │ ││ │ (N instances) │ ││ └─────────────────────┘ │└─────────────────────────────────────────────────────────────┘MVP ships Tier 1. But we design the internal interfaces so swapping the event bus implementation is a one-line config change.
Event Bus Interface
Section titled “Event Bus Interface”// The core abstraction that enables tiered scalingexport interface EventBus { /** Publish an event to the bus */ publish(event: OrgLoopEvent): Promise<void>;
/** Subscribe to events, optionally filtered */ subscribe(filter: EventFilter, handler: EventHandler): Subscription;
/** Acknowledge processing of an event (for at-least-once) */ ack(eventId: string): Promise<void>;
/** Get unacknowledged events (for recovery) */ unacked(): Promise<OrgLoopEvent[]>;}
// Implementations:// - InMemoryBus → Tier 1 (dev/small)// - FileWalBus → Tier 1 (production small)// - NatsBus → Tier 2// - RedisBus → Tier 2// - KafkaBus → Tier 3Backpressure
Section titled “Backpressure”When delivery to an actor fails or is slow:
-
Per-route circuit breaker. After N consecutive failures, the route enters a half-open state. Events queue (bounded). After a cooldown, a single event is retried. Success -> close circuit. Failure -> remain open.
-
Bounded queues per actor. Each actor target has a configurable queue depth (default: 1000). When full, oldest events are dropped with a
pipeline.backpressurelog entry. This prevents a slow actor from consuming all memory. -
Rate limiting. Configurable per-route: max events/second to deliver to an actor. Excess events queue (bounded).
# Route-level delivery configurationroutes: - name: high-volume-source when: source: telemetry events: [resource.changed] then: actor: processor delivery: max_rate: 100/s # Rate limit queue_depth: 5000 # Max queued events retry: max_attempts: 3 backoff: exponential initial_delay: 1s max_delay: 60s circuit_breaker: failure_threshold: 5 cooldown: 30s with: prompt_file: "./sops/telemetry-alert.md"5.3 Event Persistence & Delivery Guarantees
Section titled “5.3 Event Persistence & Delivery Guarantees”Guarantee: At-least-once delivery.
This is the right default for OrgLoop’s use case. Actors may receive duplicate events — the dedup transform handles this for routes that need exactly-once semantics. At-least-once is achievable without the complexity of distributed transactions.
Implementation (Tier 1 — File WAL):
┌─────────────────────────────────────────────┐│ Write-Ahead Log ││ ││ 1. Event received from source ││ 2. Write to WAL (fsync) ││ 3. Process through pipeline ││ 4. On successful delivery: mark WAL entry ││ as acknowledged ││ 5. On crash: replay unacked WAL entries ││ ││ WAL file: ~/.orgloop/data/wal/ ││ Format: append-only JSONL ││ Rotation: configurable (size/time) ││ Compaction: remove acked entries on rotate │└─────────────────────────────────────────────┘State management:
Each source connector maintains a checkpoint — an opaque string (typically a timestamp or cursor) that tells the connector where to resume polling after a restart. Checkpoints are persisted to ~/.orgloop/data/checkpoints/.
// Checkpoint storeexport interface CheckpointStore { get(sourceId: string): Promise<string | null>; set(sourceId: string, checkpoint: string): Promise<void>;}
// Implementations:// - FileCheckpointStore → JSON file per source// - InMemoryCheckpointStore → in-memory (for testing)