Understanding Celery

Trying to dig into understanding celery and it’s implications here, a couple of questions:

  1. Currently, the image used for the worker is the same as the one for the “main” instance. This would also mean that the worker image has a bunch of frontend code that is never used? Is part of the effort to rearrange the code into frontend/backend so that a separate worker image could be used that only has the backend code?
  2. My understanding is that the main:worker relationship is currently 1:1, and so while the worker operates on a task queue, only one task can be performed at a time (hence messages like "the workbench is busy, come back later’). If more workers were to be added, the bottleneck would become interactions with the database (locks etc)?

Hi @markp,

Thanks for your questions!

Since the release of v7.6.1, the Docker image used for the Celery worker is the same as the one used for the main app. This means the worker also includes frontend code that it never actually uses. The recent effort to separate the frontend and backend code will allow us to explore the possibility of creating a dedicated worker image in the future. A simpler image could contain only the backend code and the dependencies needed for Celery. We would love to reduce the image size, speed up deployments, and enhance security by minimizing unnecessary components for the worker environment.

Currently, there’s just one Celery worker running, so only one background task can be handled at once. That’s why users see messages like “the workbench is busy” if they try to start another task, e.g. “If this message persists for longer than 30 seconds, the {validation/upload} process is busy with another Data Set. Please try again later.”

If you start more workers, you can run several tasks at once. As you mentioned, you then have to be careful with the database—if many tasks try to change the same data, it could run into issues like database locks or conflicts. The next big challenge after adding more workers is making sure the database and tasks are set up to safely handle multiple things happening at the same time.

A single Celery worker (like the one used in Specify) can handle multiple tasks at the same time. A worker could theoretically run several processes or threads in parallel (limited by the number of cores dedicated). This means it can process multiple tasks concurrently, not just one after another, but with the aforementioned risks in mind. We are planning on addressing ways to improve concurrency in upcoming releases (see #5337), and concurrent use is a part of our third priority area for 2025.