Skip to Content
DocsTroubleshootingWorker Issues

Worker Issues

Troubleshooting guide for the Marqov platform worker.


Worker Not Picking Up Jobs

Symptoms

Jobs stay in pending status indefinitely. No worker logs show job processing.

Diagnosis

  1. Check worker health:

    curl http://localhost:8080/health

    If this fails, the worker process is not running.

  2. Check Supabase connection: Verify NEXT_PUBLIC_SUPABASE_URL and SUPABASE_SERVICE_ROLE_KEY are set correctly. The worker uses the service role key to poll for pending jobs.

  3. Check for pending jobs: In the Supabase SQL Editor:

    SELECT id, status, backend, created_at FROM job_runs WHERE status = 'pending' ORDER BY created_at DESC LIMIT 10;
  4. Check worker logs:

    • Local: Check terminal output.
    • ECS: Check CloudWatch Logs for the marqov-worker task.

Common causes

CauseFix
Worker process crashedRestart the worker. Check logs for the crash reason.
Wrong Supabase credentialsVerify SUPABASE_SERVICE_ROLE_KEY matches the production project.
Network issue to SupabaseCheck DNS resolution and firewall rules from the worker’s network.
Worker polling intervalThe worker polls on a fixed interval. Wait for the next poll cycle.

Temporal Connection Failures

Symptoms

Worker starts but workflow execution fails. Jobs move to failed status with Temporal connection errors.

Diagnosis

  1. Check Temporal address:

    echo $TEMPORAL_ADDRESS
    • Local: should be localhost:7233
    • Production: should be the Temporal Cloud address
  2. Check Temporal is running (local):

    docker ps | grep temporal

    If no containers, start with:

    cd docker && docker-compose up -d
  3. Check namespace:

    echo $TEMPORAL_NAMESPACE

    Ensure it matches an existing namespace.

  4. Check API key (production): Ensure TEMPORAL_API_KEY is set and not expired. Temporal Cloud API keys have expiration dates.

Common causes

CauseFix
Temporal not runningdocker-compose up -d from docker/ directory
Wrong addressFix TEMPORAL_ADDRESS environment variable
API key expiredRegenerate key in Temporal Cloud dashboard
Namespace mismatchFix TEMPORAL_NAMESPACE to match existing namespace
TLS issuesTemporal Cloud requires TLS. Ensure the SDK auto-negotiates correctly with TEMPORAL_API_KEY.

AWS Credential Issues

Symptoms

Jobs fail when targeting AWS Braket backends (sv1, dm1, ionq-*, etc.) with credential errors.

Error messages

botocore.exceptions.NoCredentialsError: Unable to locate credentials
botocore.exceptions.ClientError: An error occurred (AccessDeniedException)

Diagnosis

  1. Check credentials are set:

    echo $AWS_ACCESS_KEY_ID echo $AWS_REGION
  2. Verify IAM permissions: The worker needs:

    • braket:* for submitting quantum tasks
    • s3:PutObject and s3:GetObject on the results bucket
    • iam:PassRole if using a custom execution role
  3. Check S3 bucket:

    echo $BRAKET_S3_BUCKET

    The bucket must exist in the same region and be configured for Braket.

Common causes

CauseFix
Missing credentialsSet AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY
Wrong regionSet AWS_REGION to match the target device region
S3 bucket doesn’t existCreate the bucket or fix BRAKET_S3_BUCKET
IAM permissionsAdd required permissions to the worker’s IAM role
Cross-region accessSome QPUs are in specific regions (e.g., IQM in eu-north-1). The BraketExecutor handles cross-region access automatically by extracting the region from the device ARN.

Docker Build Failures

Symptoms

docker build fails during the worker image build.

Common issues

Wrong platform:

ERROR: failed to solve: python:3.12-slim: no match for platform

Fix: Add --platform linux/amd64 to the build command.

ECR push 403 (attestation manifest):

403 Forbidden: manifest unknown

Fix: Add --provenance=false --sbom=false to the build command.

pip install failures:

ERROR: Could not find a version that satisfies the requirement ...

Fix: Check platform/requirements-worker-full.txt for version conflicts. Ensure the package index is accessible from the build environment.

marqov install failure:

ERROR: ... covalent ...

Fix: Ensure pip install --no-deps --no-cache-dir . is used (the --no-deps flag skips covalent, which is no longer needed).

Complete build command

docker build \ --platform linux/amd64 \ --provenance=false \ --sbom=false \ -t marqov-worker .

Health Check Failures

Symptoms

ECS marks the task as unhealthy and restarts it repeatedly. The service never reaches a steady state.

Diagnosis

  1. Check the health endpoint manually:

    curl -v http://localhost:8080/health
  2. Check if port 8080 is exposed: The Dockerfile exposes port 8080. Ensure the ECS task definition maps this port.

  3. Check start period: The Docker HEALTHCHECK has a 5-second start period. If the worker takes longer to initialize, increase this value.

  4. Check resource limits: If the container is OOM-killed during startup, increase the ECS task memory allocation.

ECS-specific checks

# Check service events aws ecs describe-services \ --cluster marqov-production \ --services marqov-worker \ --region us-east-1 \ --query 'services[0].events[:5]' # Check task status aws ecs list-tasks \ --cluster marqov-production \ --service-name marqov-worker \ --region us-east-1

Worker Crashes on Startup

Common causes

ErrorCauseFix
ModuleNotFoundError: No module named 'marqov'marqov not installed in the imageEnsure pip install --no-deps . runs in Dockerfile
ImportError: cannot import name ...Version mismatch between marqov and worker codeRebuild the image with latest code
KeyError: 'SUPABASE_SERVICE_ROLE_KEY'Missing environment variableAdd it to ECS task definition
ConnectionRefusedErrorTemporal not reachable on startupCheck TEMPORAL_ADDRESS; consider adding retry logic

Performance Issues

Jobs taking too long

  1. Check queue depth: QPU backends can have long queues. Use BraketExecutor.get_queue_depth() or check the AWS Braket console.

  2. Check worker concurrency: The worker processes one job at a time by default. For parallel execution, scale the ECS service to multiple tasks.

  3. Check network latency: The worker makes HTTP calls to Supabase, Temporal, and AWS Braket. High latency on any of these increases total job time.

Memory issues

Large circuits or many-qubit simulations can consume significant memory. The local simulator (LocalExecutor) uses state vector simulation, which requires 2^n * 16 bytes for n qubits. For example:

  • 20 qubits: ~16 MB
  • 25 qubits: ~512 MB
  • 30 qubits: ~16 GB

Monitor container memory usage and increase ECS task memory if needed.

Last updated on