Tutorial: Building Jobs in cryoSPARC
One of cryoSPARC's staple features is it's job builder, allowing you to quickly create jobs by simply dragging and dropping the outputs of one into the inputs of another one. In cryoSPARC v2.3, we've revamped the job builder to make it more powerful than ever, allowing you to access output data a lot easier and have more control over specific output data.
Before we introduce the new job builder updates, let's dive into some details about how data is stored and used within cryoSPARC.
Inputs and Outputs in cryoSPARC
CryoSPARC handles bookkeeping and management of all files, inputs and outputs for every job type. The structure of inputs and outputs in cryoSPARC is designed to allow flexibility while removing ambiguities.
In cryoSPARC, the basic unit of data/metadata transferred betwen jobs is an
item has a type, for example an exposure, particle, volume, mask, etc. Every
item also has properties, with each property having a name (eg.
ctf) and sub-properties containing actual metadata values (eg.
ctf/astigmatism, etc). Collections of
items with the same type and same properties, constitute a
dataset is essentially a table, where each row is a single item, and columns are properties/sub-properties. Every job can only input and output
datasets. Therefore every type of data/metadata is stored in a
dataset. On disk,
datasets are stored in the
.cs file format, which is a binary
numpy format descibed in a later section.
Each job defines the outputs it creates in the following way:
- the job defines that it will output a certain types of
- for each type of item, the job defines certain properties that will be outputted. Each property is called a
result. For example, a CTF estimation job would output a
ctfresult, containing sub-properties like defocus, astigmatism, spherical abberation, etc. The
results that a job outputs are the basic component of what gets connected to other jobs.
- the job defines certain
result-groups - each one is a set of
results that describe the same type of
item. Thus a job can output a
result-groupdefining particles, with two
Each job also defines the inputs that it takes in:
- the job defines
input-groups each allowing a certain type of
itemlike particles, volumes, etc.
slots, each taking in a particular kind of
result. For example, an
input-grouptaking in particles may have a
ctfand another for
input-groupalso defines the number of different
result-groups that can be connected to it. In general all the items from all
result-groups that are connected are appended together to make one larger
datasetthat forms the input to the job. So for example, connecting two particle stacks to a single
input-groupwill cause those stacks to be appended together.
The reason for this abstraction of
result-groups, etc is so that in cryoSPARC, most connections between jobs can be made simply at the
group level, without having to specify particular files, paths, columns or rows in tables or text files. Subsets of
datasets can be easily defined and passed around, and different subsets can be joined together as inputs to a further job. For advanced uses, however, the lower-level
results allow a user to connect only certain metadata about an item from one job to another, or override the metadata for certain properties in a
result-group. Examples of how and when to use this capability follow.
In order to simplify long chains of processing, each job can input an arbitrary number of extra
results that it doesn't actually need, and then output those
results as "passthrough" metadata that is not read or modified by the job, but just passed along in its output so that subsetquent jobs can use it without needing to be manually connected to an earlier output in the chain.
.cs file format
CryoSPARC uses a simple common tabular format to store metadata about all types of items that are managed by the cryoSPARC system. Items include movies, micrographs, particles, volumes, and masks. Each item can have many different properties that are kept track of as the items progress through processing. Only some job types create items: Imports, particle extraction, ab-initio reconstruction, volume tools, etc. Most job types simply load items, process them to compute new properties of those items, and output the new properties. A collection of items of the same kind is called a dataset and can be represented in a single table of rows and columns.
In cryoSPARC, each item that is managed is assigned a unique identifier
uid (a 64-bit integer) that is used to maintain correspondences across chains of processing jobs and to ensure that regardless of the order that a job outputs items, the properties of each items are always correctly assigned to the correct item.
The tabular format that cryoSPARC uses for this metadata and
uid is an array of
C structures, implemented using
numpy structured arrays. These arrays are stored in memory and on disk in the same format. On disk, we store these arrays in binary format in
.cs files. Each
.cs file in cryoSPARC contains a single table. Each row corresponds to a single item. A
.cs file must contain a column for the
uid of each item, and further columns define properties/sub-properties of that item. Multiple
.cs files therefore can be used in aggregate to define all the properties of a set of items, since the rows in every table all have a
uid that can be used to join the tables. In general, when multiple tables are used to specify a dataset, the dataset contains only the intersection of items included in each table.
New Job Outputs Tab
In cryoSPARC v2.3, there is now an outputs tab in the details view of every job. It contains sections for each output group, and within each section a list of all individual results, including passthroughs.
You can easily copy the path of an individual output or download the file directly using the copy and download buttons, respectively. It's also now possible to inspect or select different versions of an individual output by toggling the 'versions' section.
When building a job, you can drag and drop the header of the output group section to add the whole group. If you'd like to override a particular input slot, you can drag and drop the header of the individual output to a matching input slot. We'll see examples of how this can be useful later in this tutorial.
Updated Job Builder Inputs Section
CryoSPARC v2.3 also includes an update to the job builder's inputs section, allowing for not only removing an existing group, but clearing individual input slots that are not required. You can use the outputs tab to drag and drop output groups and individual outputs into the matching slots. It is possible to override both optional and required input slots by connecting matching individual outputs.
There's now a requirements section for each group which specifies the minimum and maximum number of groups accepted, and whether or not repeat groups are accepted. The requirements section will be highlighed in green when you start to drag a matching output group and is highlighed in red when you do not meet that input group's requirements.
Building Jobs with Input and Output Groups
Building a job from output groups covers most use cases. In cryoSPARC v2.3, this works exactly as in previous versions of cryoSPARC v2; simply drag and drop an output group from the overview or outputs tab in the job details view.
Updated Overview Tab Output Groups
The output groups list in the overview tab has been updated to be more user friendly and highlight key data. You can now download the latest version of all individual outputs by selecting from the download dropdown menu.
Fine-tuned Control over Individual Results
The addition of the outputs tab and the updated job builder inputs section allows for connecting low-level or individual outputs into an input group, overriding specific slots. This functionality alows for advanced users to experiment with their data more, and also makes certain tasks in cryoSPARC possible. In this section, we'll cover two such use cases for fine-tuned control over individual results.
Use Case: Local Resolution Estimation
When building a local resolution estimation job, it's now possible to use the outputs tab and override the
half_map_B inputs from different jobs. The example below outlines the three step process of using one input group in the local resolution estimation job builder to populate volume data from three separate jobs.
Use Case: Downsampled Particles
If you use the downsample particles job to shrink particles to make other jobs such as 2D classification run faster, you will end up with a subset of particles later on but need to reference the original (non-downsampled) particle data when running a refinement to get full particle resolution. In this case, you can use the outputs tab and override the
particles.blob input slot with the non-downsampled data that you previously connected as an input group.
We're excited to be adding this new functionality into cryoSPARC; it makes accessing data and creating custom jobs easier than ever. If you have any feedback or questions about the new job builder, feel free to consult the cryoSPARC discussion forum.
Last Updated: September 28, 2018