Create the tasks table in chamberfile #6

Open
opened 2025-07-01 09:30:47 +00:00 by vaibhav · 0 comments
Owner

We could use a separate table to list down the things that we want to do and then write a system to read from that table and execute those tasks (more or less like background job management system) in a typical HTTP-based backend for a web service.

Usecase and Benefit:

For any operation that runs for long and requires to perform a series of operations which take time. For example, validation or verification of contents, re-encrypting the contents for a new key/password, copying large number of files in any direction etc. And can potentially demand a lot of storage for the action as well, breaking the entire operation into smaller tasks and keeping track of those will be much easier as well as safer!

It would also allow us to defer the execution of one long running operation after another and help in achieving the end goal faster and more reliably.

Let's say we are trying to perform the following operations:

  1. Read (copy 100,000 files of varying sizes from chamber to the disk) - In this case, we are "reading" from chamberfile.
  2. Write (Copy 50,000 files of varying sizes but roughly the same cumulative size as the read operation) - In this case we are "writing" to the chamberfile.
  3. Re-encrypt (This will both read and write all the files in the chamberfile) - We will read each file, encrypt it with a new key, store it, update the fsindex (if applicable) accordingly. Do we are doing both read and write.

How having a tasks table would help

Imagine running these three operations at the same time. What are the challenges?

Keeping track of each operation on each file

We have to make sure that all the old files have been re-encrypted and if there were new files created as a result of copy-in operation, they would have to be encrypted with the new key only (but we have to guarantee that the old password still works, till the re-encrypt is not done). This is complicated to track. You can employ any number of strategies to deal with this - including a record-keeping table for each such operation, or a flag in either the file_blobs table or the fsindex table or anything else - but the complexity for such an operation will remain high. Sequencing these long running operations by breaking them into smaller tasks that can be done in a sequence helps avoid the complexity and improves the reliability.

Total time taken would be lesser

The storage media is a SQLite file (a single one for now, until #7 gets completed, at least). If you try to parallelize so many read and writes, there are at least 3 factors that kick in:

  1. The delays introduced purely by the storage media: If you have a SSD connected to a fast-enough bus (such as PCI-E), you might think it will not have a significant impact on performance. But if the storage is a memory card (like a MicroSD) or a USB pen drive, or a HDD with spinning disks etc. then the parallel reads and writes are going to cause a lot of delay because of how storage systems work.
  2. The resource allocation on OS level: The OS scheduler would have to allocate the same resource (which is your SQLite file) in a way that allows for both reads and writes to go in as fast as possible. Depending on how the scheduler works, it will anyway get serialized or take a performance penalty at a minimum.
  3. Resource allocation and management in the code: When I say code, I mean both the logic that we would write into our application (Chamber) as well as the SQLite implementation. The driver that handles the operations against the database might also introduce its own complexities, decision making logic and analysis of what is going on (such as locks that are active, acquiring and releasing them, managing transactions etc.). On top of this, we have to write our own logic to manage multiple operations.

If we have a tasks table which contains a sequential breakdown of what we intend to do (the above mentioned 3 operations), then the total effort would get simplified (e.g. less number of locks and checks, less load on the OS as well as the media storage).

Conclusion backing the decision

Given all the possible scenarios, usecases and environments that Chamber might be executing under, it would serve us better to have a tasks table and use it for longer operations.

NOTE: We will not work on this for the alpha release.

We could use a separate table to list down the things that we want to do and then write a system to read from that table and execute those tasks (more or less like background job management system) in a typical HTTP-based backend for a web service. ### Usecase and Benefit: For any operation that runs for long and requires to perform a series of operations which take time. For example, validation or verification of contents, re-encrypting the contents for a new key/password, copying large number of files in any direction etc. And can potentially demand a lot of storage for the action as well, breaking the entire operation into smaller tasks and keeping track of those will be much easier as well as safer! It would also allow us to defer the execution of one long running operation after another and help in achieving the end goal faster and more reliably. Let's say we are trying to perform the following operations: 1. Read (copy 100,000 files of varying sizes from chamber to the disk) - In this case, we are "reading" from chamberfile. 2. Write (Copy 50,000 files of varying sizes but roughly the same cumulative size as the read operation) - In this case we are "writing" to the chamberfile. 3. Re-encrypt (This will both read and write all the files in the chamberfile) - We will read each file, encrypt it with a new key, store it, update the fsindex (if applicable) accordingly. Do we are doing both read and write. ### How having a `tasks` table would help Imagine running these three operations at the same time. What are the challenges? #### Keeping track of each operation on each file We have to make sure that all the old files have been re-encrypted and if there were new files created as a result of copy-in operation, they would have to be encrypted with the new key only (but we have to guarantee that the old password still works, till the re-encrypt is not done). This is complicated to track. You can employ any number of strategies to deal with this - including a record-keeping table for each such operation, or a flag in either the `file_blobs` table or the `fsindex` table or anything else - but the complexity for such an operation will remain high. Sequencing these long running operations by breaking them into smaller tasks that can be done in a sequence helps avoid the complexity and improves the reliability. #### Total time taken would be lesser The storage media is a SQLite file (a single one for now, until #7 gets completed, at least). If you try to parallelize so many read and writes, there are at least 3 factors that kick in: 1. **The delays introduced purely by the storage media**: If you have a SSD connected to a fast-enough bus (such as PCI-E), you might think it will not have a significant impact on performance. But if the storage is a memory card (like a MicroSD) or a USB pen drive, or a HDD with spinning disks etc. then the parallel reads and writes are going to cause a lot of delay because of how storage systems work. 2. **The resource allocation on OS level**: The OS scheduler would have to allocate the same resource (which is your SQLite file) in a way that allows for both reads and writes to go in as fast as possible. Depending on how the scheduler works, it will anyway get serialized or take a performance penalty at a minimum. 3. **Resource allocation and management in the code**: When I say code, I mean both the logic that we would write into our application (Chamber) as well as the SQLite implementation. The driver that handles the operations against the database might also introduce its own complexities, decision making logic and analysis of what is going on (such as locks that are active, acquiring and releasing them, managing transactions etc.). On top of this, we have to write our own logic to manage multiple operations. If we have a `tasks` table which contains a sequential breakdown of what we intend to do (the above mentioned 3 operations), then the total effort would get simplified (e.g. less number of locks and checks, less load on the OS as well as the media storage). ### Conclusion backing the decision Given all the possible scenarios, usecases and environments that Chamber might be executing under, it would serve us better to have a tasks table and use it for longer operations. **NOTE**: We will not work on this for the alpha release.
vaibhav added this to the Future project 2025-08-29 08:11:27 +00:00
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
techrail/chamber#6
No description provided.