Jobs
Monitor and manage background tasks across the platform.
What Are Background Jobs?
Background jobs are asynchronous tasks that run outside of user-facing requests. They handle the heavy lifting that keeps the platform running — processing uploaded media, sending emails, indexing content for search, executing scripts, transforming analytics data, computing audience segments, and much more. When you publish a video and it gets transcoded into multiple formats, a background job does that work. When a search index is rebuilt after content changes, a background job handles it.
Jobs run automatically in response to platform events, but administrators can also monitor their progress, investigate failures, and intervene when things go wrong.
Job Lifecycle
Every job follows a defined lifecycle from creation to completion:
- Pending. The job has been created and is waiting to be picked up by a worker. Jobs in this state are sitting in a queue, ordered by priority and creation time.
- Running. A worker has claimed the job and is actively processing it. The job holds a lock that prevents other workers from picking up the same task.
- Complete. The job finished successfully. Its results (if any) are stored and any downstream actions — such as callbacks or child job creation — are triggered.
- Failed. The job encountered an error and could not complete. Depending on the type of failure, the job may be retried automatically or marked as permanently failed.
| State | Description | What Happens Next |
|---|---|---|
| Pending | Queued and waiting for a worker | Picked up by the next available worker |
| Running | Actively being processed | Completes successfully or fails |
| Complete | Finished successfully | Triggers callbacks and child jobs if applicable |
| Failed | Encountered an error | Retried automatically (transient) or stopped (permanent) |
Distributed Execution
In a production deployment, multiple worker processes run in parallel across different servers. The job system ensures that each job is processed exactly once, even when many workers are competing for tasks.
- Lock management. When a worker picks up a job, it acquires an exclusive lock on that job. No other worker can claim the same job while the lock is held. Locks have expiration times to handle the case where a worker crashes without releasing its lock.
- Work distribution. Jobs are distributed across workers based on queue assignment and availability. This ensures balanced load and prevents any single worker from becoming a bottleneck.
- Idempotency. Jobs are designed to be safely re-executed if a worker fails mid-processing. This means the system can recover from worker crashes without producing duplicate results or corrupted data.
Parent-Child Jobs
Complex workflows often require multiple processing steps that must execute in a specific order. The job system supports this through parent-child relationships and callbacks:
- Child jobs. A running job can create one or more child jobs that handle sub-tasks. For example, a content publishing job might spawn separate child jobs for generating thumbnails, building search indexes, and sending notifications.
- Callbacks. When a child job completes, the parent job is notified through a callback mechanism. The parent can then inspect the child's results and decide how to proceed.
- Cascading completion. A parent job can wait for all of its children to complete before marking itself as finished. This ensures that multi-step workflows complete fully before triggering downstream processes.
Retry and Failure Handling
Not all failures are created equal. The job system distinguishes between transient failures that are likely to resolve on their own and permanent failures that require human intervention.
- Transient failures. Network timeouts, temporary service unavailability, and resource contention are treated as transient. The job system automatically retries these jobs after a delay, giving the underlying issue time to resolve.
- Permanent failures. Invalid data, missing dependencies, and logic errors are treated as permanent. These jobs are marked as failed immediately without retrying, since repeating the same operation would produce the same error.
- Retry limits. Transient failures have a maximum retry count. If a job continues to fail after exhausting its retries, it is marked as permanently failed and flagged for administrator review.
Administrative Operations
Administrators have several tools for managing the job system:
- Expire stuck jobs. Jobs that have been in the running state longer than expected can be forcibly expired, releasing their locks and allowing them to be retried or marked as failed.
- Clear stale locks. When workers crash or are restarted, they may leave behind locks that prevent other workers from picking up jobs. Administrators can clear these stale locks to restore normal processing.
- View job history. See the full history of jobs including their creation time, processing duration, completion status, and any error messages. Use this to identify patterns in failures or performance bottlenecks.
- Monitor active jobs. View all currently running and pending jobs to understand the current processing load and identify any jobs that may be stuck.
Job Queues
The platform organizes jobs into separate queues based on the type of work being performed. This separation ensures that a spike in one type of processing does not block other types from making progress.
| Queue | Work Type |
|---|---|
| Content | Media transcoding, document processing, thumbnail generation, supplementary file handling. |
| Analytics | Data transforms, aggregation jobs, report generation. |
| Segmentation | Audience segment computation, membership recalculation. |
| Indexing | Search index updates, content reindexing operations. |
| Notifications | Email delivery, push notifications, webhook dispatches. |
Workers can be configured to process jobs from specific queues, allowing you to dedicate more resources to high-priority or resource-intensive work types.
Monitoring
Effective job monitoring helps you stay ahead of problems before they affect users. Key metrics to watch include:
- Queue depth. How many jobs are waiting in each queue. A growing queue indicates that workers cannot keep up with demand.
- Processing duration. How long jobs take to complete. Sudden increases may indicate performance degradation in a downstream service.
- Failure rate. The ratio of failed jobs to total jobs. A rising failure rate warrants immediate investigation.
- Lock age. How long active locks have been held. Locks held for much longer than the typical processing time suggest stuck jobs.
- Retry count. How often jobs are being retried. Frequent retries indicate intermittent issues that may need attention.