Worker Issues
Troubleshooting guide for the Marqov platform worker.
Worker Not Picking Up Jobs
Symptoms
Jobs stay in pending status indefinitely. No worker logs show job processing.
Diagnosis
-
Check worker health:
curl http://localhost:8080/healthIf this fails, the worker process is not running.
-
Check Supabase connection: Verify
NEXT_PUBLIC_SUPABASE_URLandSUPABASE_SERVICE_ROLE_KEYare set correctly. The worker uses the service role key to poll for pending jobs. -
Check for pending jobs: In the Supabase SQL Editor:
SELECT id, status, backend, created_at FROM job_runs WHERE status = 'pending' ORDER BY created_at DESC LIMIT 10; -
Check worker logs:
- Local: Check terminal output.
- ECS: Check CloudWatch Logs for the
marqov-workertask.
Common causes
| Cause | Fix |
|---|---|
| Worker process crashed | Restart the worker. Check logs for the crash reason. |
| Wrong Supabase credentials | Verify SUPABASE_SERVICE_ROLE_KEY matches the production project. |
| Network issue to Supabase | Check DNS resolution and firewall rules from the worker’s network. |
| Worker polling interval | The worker polls on a fixed interval. Wait for the next poll cycle. |
Temporal Connection Failures
Symptoms
Worker starts but workflow execution fails. Jobs move to failed status with Temporal connection errors.
Diagnosis
-
Check Temporal address:
echo $TEMPORAL_ADDRESS- Local: should be
localhost:7233 - Production: should be the Temporal Cloud address
- Local: should be
-
Check Temporal is running (local):
docker ps | grep temporalIf no containers, start with:
cd docker && docker-compose up -d -
Check namespace:
echo $TEMPORAL_NAMESPACEEnsure it matches an existing namespace.
-
Check API key (production): Ensure
TEMPORAL_API_KEYis set and not expired. Temporal Cloud API keys have expiration dates.
Common causes
| Cause | Fix |
|---|---|
| Temporal not running | docker-compose up -d from docker/ directory |
| Wrong address | Fix TEMPORAL_ADDRESS environment variable |
| API key expired | Regenerate key in Temporal Cloud dashboard |
| Namespace mismatch | Fix TEMPORAL_NAMESPACE to match existing namespace |
| TLS issues | Temporal Cloud requires TLS. Ensure the SDK auto-negotiates correctly with TEMPORAL_API_KEY. |
AWS Credential Issues
Symptoms
Jobs fail when targeting AWS Braket backends (sv1, dm1, ionq-*, etc.) with credential errors.
Error messages
botocore.exceptions.NoCredentialsError: Unable to locate credentialsbotocore.exceptions.ClientError: An error occurred (AccessDeniedException)Diagnosis
-
Check credentials are set:
echo $AWS_ACCESS_KEY_ID echo $AWS_REGION -
Verify IAM permissions: The worker needs:
braket:*for submitting quantum taskss3:PutObjectands3:GetObjecton the results bucketiam:PassRoleif using a custom execution role
-
Check S3 bucket:
echo $BRAKET_S3_BUCKETThe bucket must exist in the same region and be configured for Braket.
Common causes
| Cause | Fix |
|---|---|
| Missing credentials | Set AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY |
| Wrong region | Set AWS_REGION to match the target device region |
| S3 bucket doesn’t exist | Create the bucket or fix BRAKET_S3_BUCKET |
| IAM permissions | Add required permissions to the worker’s IAM role |
| Cross-region access | Some QPUs are in specific regions (e.g., IQM in eu-north-1). The BraketExecutor handles cross-region access automatically by extracting the region from the device ARN. |
Docker Build Failures
Symptoms
docker build fails during the worker image build.
Common issues
Wrong platform:
ERROR: failed to solve: python:3.12-slim: no match for platformFix: Add --platform linux/amd64 to the build command.
ECR push 403 (attestation manifest):
403 Forbidden: manifest unknownFix: Add --provenance=false --sbom=false to the build command.
pip install failures:
ERROR: Could not find a version that satisfies the requirement ...Fix: Check platform/requirements-worker-full.txt for version conflicts. Ensure the package index is accessible from the build environment.
marqov install failure:
ERROR: ... covalent ...Fix: Ensure pip install --no-deps --no-cache-dir . is used (the --no-deps flag skips covalent, which is no longer needed).
Complete build command
docker build \
--platform linux/amd64 \
--provenance=false \
--sbom=false \
-t marqov-worker .Health Check Failures
Symptoms
ECS marks the task as unhealthy and restarts it repeatedly. The service never reaches a steady state.
Diagnosis
-
Check the health endpoint manually:
curl -v http://localhost:8080/health -
Check if port 8080 is exposed: The Dockerfile exposes port 8080. Ensure the ECS task definition maps this port.
-
Check start period: The Docker HEALTHCHECK has a 5-second start period. If the worker takes longer to initialize, increase this value.
-
Check resource limits: If the container is OOM-killed during startup, increase the ECS task memory allocation.
ECS-specific checks
# Check service events
aws ecs describe-services \
--cluster marqov-production \
--services marqov-worker \
--region us-east-1 \
--query 'services[0].events[:5]'
# Check task status
aws ecs list-tasks \
--cluster marqov-production \
--service-name marqov-worker \
--region us-east-1Worker Crashes on Startup
Common causes
| Error | Cause | Fix |
|---|---|---|
ModuleNotFoundError: No module named 'marqov' | marqov not installed in the image | Ensure pip install --no-deps . runs in Dockerfile |
ImportError: cannot import name ... | Version mismatch between marqov and worker code | Rebuild the image with latest code |
KeyError: 'SUPABASE_SERVICE_ROLE_KEY' | Missing environment variable | Add it to ECS task definition |
ConnectionRefusedError | Temporal not reachable on startup | Check TEMPORAL_ADDRESS; consider adding retry logic |
Performance Issues
Jobs taking too long
-
Check queue depth: QPU backends can have long queues. Use
BraketExecutor.get_queue_depth()or check the AWS Braket console. -
Check worker concurrency: The worker processes one job at a time by default. For parallel execution, scale the ECS service to multiple tasks.
-
Check network latency: The worker makes HTTP calls to Supabase, Temporal, and AWS Braket. High latency on any of these increases total job time.
Memory issues
Large circuits or many-qubit simulations can consume significant memory. The local simulator (LocalExecutor) uses state vector simulation, which requires 2^n * 16 bytes for n qubits. For example:
- 20 qubits: ~16 MB
- 25 qubits: ~512 MB
- 30 qubits: ~16 GB
Monitor container memory usage and increase ECS task memory if needed.