SSD Particle Caching in cryoSPARC

Last Updated: March 9, 2020

Why is particle caching effective?

For classification, refinement, and reconstruction jobs that deal with particles, having local SSDs on worker nodes can significantly speed up computation: Many cryo-EM algorithms rely on random-access patterns and multiple passes though the data, rather than sequentially reading the data once. When you install cryoSPARC, you have the option of adding an ssd_path, which is a fast drive location on the worker node that particles will be copied to and read from when being processed. CryoSPARC manages the SSD cache on each worker node transparently.

When you run jobs that have the Cache particle images on SSD option turned on, particles will be automatically copied to and read from the SSD path specified. Furthermore, if multiple jobs within the same project require the same particles, the cache will be re-used and the copying step is skipped. If more space is needed, previously cached data will be automatically deleted. Setting up an SSD cache is optional on a per-worker node basis, but it is highly recommended. Nodes reserved for pre-processing (motion correction, CTF estimation, particle picking, etc.) do not need to have an SSD.

A cryoSPARC job's stream log showing particles being cached on an SSD.

Hardware

The size of your typical cryo-EM single particle datasets will inform the size of SSD you choose to use. To store the largest of particle stacks, we recommend 2TB SSDs. You can calculate the exact size of a particle dataset with the following calculation:

An equation to calculate the total size of a particle dataset.

For example: A 1,000,000 particle dataset with box size 256 will have a total size of 263.3 GB

Solving the particle dataset size equation with 1MM particles at 256px.

For example: A 2,000,000 particle dataset with box size 432 will have a total size of 1.5 TB

Solving the particle dataset size equation with 2MM particles at 432px.

Configuration

Installation

When installing cryoSPARC, you can use the parameter --ssdpath to specify the path of your SSD drive when you connect your worker to your instance. If you don't want to configure an SSD cache for a workstation node, specify the --nossd option.

bin/cryosparcw connect 
  --worker <worker_hostname> 
  --master <master_hostname> 
  --port <port_num>   
  --ssdpath <ssd_path>             : path to directory on local ssd

By default, if you specify the SSD path then the cache will be enabled with no quota or reserve.

Advanced Parameters

You can specify two advanced parameters to fine-tune your SSD cache:

--ssdquota: The maximum amount of space that cryoSPARC can use on the SSD (MB)

--ssdreserve: The minimum amount of free space to leave on the SSD (MB)

The above options are useful when you're setting up cryoSPARC on a common compute node that will share the SSD with other applications.

Updating Configuration

You can always update the SSD configuration at any time by running the connect command with the --update flag:

bin/cryosparcw connect
  --worker <worker_hostname>
  --master <master_hostname>
  --port <port_num>
  --update                         : update an existing worker configuration
  [--nossd]                        : connect worker with no SSD
  [--ssdpath <ssd_path> ]          : path to directory on local ssd
  [--ssdquota <ssd_quota_mb> ]     : quota of how much SSD space to use (MB)
  [--ssdreserve <ssd_reserve_mb> ] : minimum free space to leave on SSD (MB)

Use

Use the caching system when running a job

When you are running jobs that process particles (for example: Ab-Initio, Homogeneous Refinement, 2D Classification, 3D Variability), you will find a parameter at the bottom of the job builder under "Compute Settings" called Cache particle images on SSD. Turn this option off to load raw data from their original location instead.

The builder parameter in a cryoSPARC job that turns on or off particle caching.

Set a default parameter for the project

By default, the Cache particle images on SSD parameter is always on for every job you build, but if you'd like to keep this option off across all jobs in a project, you can set a project-level default by running the following command in a shell on the master node:

cryosparcm cli "set_project_param_default('PX', 'compute_use_ssd', False)"

where 'PX' is the Project UID you'd like to set the default for (e.g., 'P2')

You can undo this setting by running:

cryosparcm cli "unset_project_param_default('PX', 'compute_use_ssd')"

Tips and Tricks

Consolidating a Particle Stack

When caching a particle stack that is larger than space available on your SSD, you may optionally consolidate your particle stack. This option works if the current particle stack is a subset of the original particle stack. For example, when the cache reports how much data it's requesting to copy (SSD cache : cache requires 1000000.00 MB more on the SSD for files to be downloaded. & SSD cache : cache successfully requested to check 2000000 files.) and the sizes it reports seem much larger than you expected, you can consolidate your particle stack such that only the particle subset you care about is cached.

You might run into this situation if you ran an "Inspect Picks" job after an "Extract From Micrographs" job, and you modified the picking thresholds of your particles to include a smaller subset than the original stack.

You might also run into this situation after a round of 2D Classification. When you select classes, you create metadata that specifies which subset of the particle stack to use. When using this particle subset in further processing, the caching system will require the entire stack of particles to be cached, even though only the smaller subset is required.

To consolidate your particle stack, build a "Downsample Particles" job, connect your particles, and run the job. There is no need to change any parameters - nothing will change about your particle dataset except for the .cs metafile that will be recreated to reflect the smaller subset. You can use this smaller dataset to continue processing.

Troubleshooting

SSD cache : cache waiting for requested files to become unlocked.

This temporary message usually means the files this job is trying to access are currently being cached by another job. For example, if you started two different Refinement jobs at the same time on the same node (Job A and Job B) using the same particle stack that haven't been cached on SSD before, both jobs try to first copy all particles onto the SSD. If Job A acquires the lock for the files first, it starts copying them and Job B shows this message. When Job A finishes copying the files, it unlocks them. Job B is unlocked and finds that the particles are already on the SSD, so it skips over the copy step.

SSD cache : cache does not have enough space for download... but there are no files that can be deleted.

This message means that there is another cryoSPARC job or another application on the workstation taking up space on the SSD. If the former, the job showing this message will try to free up space as soon as it can, and it will continue processing. If there are files on the SSD that are not owned by cryoSPARC, it will not be able to delete them. It may be necessary to delete them manually.

FAQ

  • Is it safe to manually delete cache files for completed or unqueued/cleared jobs? Also, can I pre-cache with symlinks to skip caching?

    Yes, it is safe to delete cache files any time (it’s a read-only cache) and yes, the cache checks to see if files exist just based on path/size/modification date so symlinks should cause it to skip. Though it may be easier to just set the SSD Cache parameter to False in each job that you queue up.

    Source:

    How to clear the cache in v2?

This website uses cookies to ensure you get the best experience. To learn more, please refer to our Privacy Policy