Jobs

Monitor and manage background tasks across the platform.

What Are Background Jobs?

Background jobs are asynchronous tasks that run outside of user-facing requests. They handle the heavy lifting that keeps the platform running — processing uploaded media, sending emails, indexing content for search, executing scripts, transforming analytics data, computing audience segments, and much more. When you publish a video and it gets transcoded into multiple formats, a background job does that work. When a search index is rebuilt after content changes, a background job handles it.

Jobs run automatically in response to platform events, but administrators can also monitor their progress, investigate failures, and intervene when things go wrong.

Job Lifecycle

Every job follows a defined lifecycle from creation to completion:

  1. Pending. The job has been created and is waiting to be picked up by a worker. Jobs in this state are sitting in a queue, ordered by priority and creation time.
  2. Running. A worker has claimed the job and is actively processing it. The job holds a lock that prevents other workers from picking up the same task.
  3. Complete. The job finished successfully. Its results (if any) are stored and any downstream actions — such as callbacks or child job creation — are triggered.
  4. Failed. The job encountered an error and could not complete. Depending on the type of failure, the job may be retried automatically or marked as permanently failed.
StateDescriptionWhat Happens Next
PendingQueued and waiting for a workerPicked up by the next available worker
RunningActively being processedCompletes successfully or fails
CompleteFinished successfullyTriggers callbacks and child jobs if applicable
FailedEncountered an errorRetried automatically (transient) or stopped (permanent)

Distributed Execution

In a production deployment, multiple worker processes run in parallel across different servers. The job system ensures that each job is processed exactly once, even when many workers are competing for tasks.

  • Lock management. When a worker picks up a job, it acquires an exclusive lock on that job. No other worker can claim the same job while the lock is held. Locks have expiration times to handle the case where a worker crashes without releasing its lock.
  • Work distribution. Jobs are distributed across workers based on queue assignment and availability. This ensures balanced load and prevents any single worker from becoming a bottleneck.
  • Idempotency. Jobs are designed to be safely re-executed if a worker fails mid-processing. This means the system can recover from worker crashes without producing duplicate results or corrupted data.
Lock expiration is a critical safety mechanism. If a worker crashes while processing a job, the lock eventually expires and another worker can pick up the job. Administrators can also manually clear stale locks when needed.

Parent-Child Jobs

Complex workflows often require multiple processing steps that must execute in a specific order. The job system supports this through parent-child relationships and callbacks:

  • Child jobs. A running job can create one or more child jobs that handle sub-tasks. For example, a content publishing job might spawn separate child jobs for generating thumbnails, building search indexes, and sending notifications.
  • Callbacks. When a child job completes, the parent job is notified through a callback mechanism. The parent can then inspect the child's results and decide how to proceed.
  • Cascading completion. A parent job can wait for all of its children to complete before marking itself as finished. This ensures that multi-step workflows complete fully before triggering downstream processes.

Retry and Failure Handling

Not all failures are created equal. The job system distinguishes between transient failures that are likely to resolve on their own and permanent failures that require human intervention.

  • Transient failures. Network timeouts, temporary service unavailability, and resource contention are treated as transient. The job system automatically retries these jobs after a delay, giving the underlying issue time to resolve.
  • Permanent failures. Invalid data, missing dependencies, and logic errors are treated as permanent. These jobs are marked as failed immediately without retrying, since repeating the same operation would produce the same error.
  • Retry limits. Transient failures have a maximum retry count. If a job continues to fail after exhausting its retries, it is marked as permanently failed and flagged for administrator review.
Jobs that fail repeatedly may indicate a systemic issue — a misconfigured storage backend, an unreachable external service, or a bug in processing logic. Investigate the root cause rather than simply retrying failed jobs.

Administrative Operations

Administrators have several tools for managing the job system:

  • Expire stuck jobs. Jobs that have been in the running state longer than expected can be forcibly expired, releasing their locks and allowing them to be retried or marked as failed.
  • Clear stale locks. When workers crash or are restarted, they may leave behind locks that prevent other workers from picking up jobs. Administrators can clear these stale locks to restore normal processing.
  • View job history. See the full history of jobs including their creation time, processing duration, completion status, and any error messages. Use this to identify patterns in failures or performance bottlenecks.
  • Monitor active jobs. View all currently running and pending jobs to understand the current processing load and identify any jobs that may be stuck.

Job Queues

The platform organizes jobs into separate queues based on the type of work being performed. This separation ensures that a spike in one type of processing does not block other types from making progress.

QueueWork Type
ContentMedia transcoding, document processing, thumbnail generation, supplementary file handling.
AnalyticsData transforms, aggregation jobs, report generation.
SegmentationAudience segment computation, membership recalculation.
IndexingSearch index updates, content reindexing operations.
NotificationsEmail delivery, push notifications, webhook dispatches.

Workers can be configured to process jobs from specific queues, allowing you to dedicate more resources to high-priority or resource-intensive work types.

Monitoring

Effective job monitoring helps you stay ahead of problems before they affect users. Key metrics to watch include:

  • Queue depth. How many jobs are waiting in each queue. A growing queue indicates that workers cannot keep up with demand.
  • Processing duration. How long jobs take to complete. Sudden increases may indicate performance degradation in a downstream service.
  • Failure rate. The ratio of failed jobs to total jobs. A rising failure rate warrants immediate investigation.
  • Lock age. How long active locks have been held. Locks held for much longer than the typical processing time suggest stuck jobs.
  • Retry count. How often jobs are being retried. Frequent retries indicate intermittent issues that may need attention.
Set up regular checks of the job administration interface during peak usage periods. Clearing stale locks and expiring stuck jobs proactively prevents backlogs from cascading into user-visible delays.