Data model, workflow and experience of a biorepository using Specify

willem · February 19, 2025, 9:24am

Reading the ‘Tissue & Molecular Extract Collection Management Open Discussion Notes’, from the discussion on February 12, 2025, it occurred to me that our data model, workflow and experience may be of interest.

“Workflows for samples that do not have a voucher was a major concern. Several participants stated they have samples with vouchers and without a voucher in their collection. A member mentioned there may be an ear clip or sample they want to document but do not want to assign a catalog number to.”

At SAIAB we have observation catalogs (read Specify collections) for fish observations, invertebrate observations and amphibian observations, which mirror the voucher collections for these disciplines. The observation catalogs were created specifically for cases where there is a tissue sample and no voucher, or when there is an image or illustration. Many observation records have no physical evidence (but may still be useful). The point is that in the Fish Observation catalog/collection, a collectionobject represents an individual organism, which has a determination and a collectingevent (and locality and geography). This collectionobject (on the left side) can be linked to a collectionobject (on the right side) in the Tissue Collection, which also represents an individual organism. There is a collectionrelationship between these two COs. The right-side CO has a preparation record of preptype ‘Tissue (unvouchered)’, and a value in the preparation.description field which could be ‘whole’ or ‘fin-clip’. Having all the CO objects on the left side having links to CE and Dets means that vertically integrating tissue and linked collecting event information from all left-side collections is simple (using the Specify Interface). In the Fish Voucher collection, the CO represents a jar. It can have links to more than one right-side CO (remember each right-side CO is an individual sampled for tissue). In this case, the preparation type is ‘Tissue (vouchered)’, and the preparation.description field has a value like ‘RBH-2015_3’ (if the individual in the jar has a label through the gills, which would match the label in the freezer vial), or ‘cut left’ (meaning that the tissue sample was taken from the individual with a cut on its left side). There should only be one right-side preparation record of type ‘tissue’ per right-side CO (in both the voucher and observation data model). But there can be an additional preparation for ‘DNA extract’, which is then assumed to be associated with the preparation of type ‘tissue’ in the same CO. So you could say that the right-side collection is a catalogue of each individual in the voucher collection (e.g. you’ve catalogued each of 10 individuals in this jar, but only a few fields in the preparation record are meaningful).

“Alternatively, another member mentioned that a tissue sample may be a sample of a specific preparation of an object such as a dried shell, SEM stub, or ethanol sample, and that relationship needs to be maintained.”

I would like to emphasize that correctly identifying the specific individual organism whose tissue was sampled can be a curatorial function, in that you will rely on the value in the preparation.description field to make this link. And you need to put in place strict workflows to ensure that the data quality is maintained at the required standard.

Justification:

We defend this data model by arguing that we ultimately want to create determination records on the right side. These are ‘molecular determinations’ (e.g. using COI sequence information), and they can easily be compared to ‘morphological determinations’ on the left side. This has scientific and curatorial value (e.g. when you suspect that a sample has been misplaced). In addition, the determination citation table may be used to record verified (e.g. published in literature) sequence identifiers/molecular determinations, which could be seen as ‘gold dust’ in the bean-counting paradigm, such as it is (throw your hands up here).

Mechanism:

We have developed a workflow that uses an excel spreadsheet of data that will become right-side COs and preparations. (This spreadsheet also has a continuous seqeunce of freezer locations, which we fill consecutively as we enter a new batch into the spreadsheet, which, after about years, has about 15000 rows). There are two columns in this spreadsheet, collectionname and catalognumber, that have the values from the left side (e.g. the jar catalog number in the fish voucher collection). They get imported into the right-side CO table so that a collectionrelationship can be made after the data have been imported. I used to do this through an ODBC query in Microsoft Access, but recently the workbench has become wise to collectionrelationships, and I can therefore report that this whole workflow is now possible within Specify. It is extremely important to have 100% certainty about the left-side catalognumber and collectionname, but even if errors are made, there are post-import validation queries that can be run to check that a range of tissue sample numbers is associated with an expected range of event numbers or a particular collecting trip (any error is going to throw up an obvious red flag because there will suddenly be an unexpected event number or collecting trip). Again, it is the workflow and the quality of data that matter (e.g. if you are not using event numbers or collecting trips, maybe you should consider doing so), arguably more than designing a Specify schema to accommodate everyone’s perspectives or preferences. And I think this is the broader point I want to make.

A few times every year I then import the whole ‘freezer spreadsheet’ into an MS Access file, and I compare the data with the data in the Specify database (matching on the freezer location, which is a box/slot arrangement). Any number of comparisons will yield obvious errors (event, left-side collection, freezer location). I then update the freezer spreadsheet with the preparation.preparationid when I am satisified that the records agree in every way, and this is the synchronisation that was discussed during the meeting. This also raises another question, which is to what degree is it useful or necessary to use data validation/correction/synchronisation techniques outside of the Specify application (e.g. connecting to the database from e.g. MS Access using ODBC)? Is this a skill that we can reasonably expect users to develop? How safe is it to allow a power user to connect to the back-end tables, or even potentially update data using such means?

Experience:

Our experience since 2017 has been very good, and we find that there are a few minor modifications that might be nice to have. But even if nothing new were developed we would still be able to continue managing our biorepository data just fine. I’m grateful to Andy Bentley and others in the Specify team for helping me with the initial set-up.

I agree that the terminology of ‘left-side’ and ‘right-side’ is somewhat meaningless, or perhaps meaningful only to the database manager. This is why I have resolved to change my field captions today. Instead of ‘left-side rels’, I will have ‘tissue samples’, and instead of ‘right-side rels’ I will have ‘voucher/obs data’. Now it’s just a matter of knowing that these fields exist, which is the same expectation for any number of other fields. So I really believe that collectionrelationships are not intrinsically a complex phenomenon (or you would have to agree that the concept of a collecting event is also intrinsically complex).

Ultimately it depends on what you want to do with the data. If you want to publish genomic data on the GGBN Portal I would argue that your choice is clear.

Update to show modified schema captions.

This is the query builder as seen from within the Tissue Collection:

This is the query builder as seen from within the Voucher Collection:

willem · March 18, 2025, 8:30am

We at SAIAB are keen to propose a workflow based on our experience, which could potentially become the basis of future functionality in the Specify application. Further to the explanation above, we are keen to specifically develop the idea that each freezer should be represented by a Microsoft Excel spreadsheet. The main reason for this is that you want to be able to place uncatalogued samples in the freezer, and then gradually catalogue them and synchronise the spreadsheet and database (e.g. updating a column in the spreadsheet with the preparation.preparationID when a sample’s tube number and location match). Different versions of the spreadsheet will be stored in the database (e.g. maybe you synchronise every month). The current spreadsheet is the last one that was synchronised and exported from the database, and you will insert new sample data into the current xlsx, which will then be imported as a new ‘database version’. Now you need to import the new sample data into the database and synchronise what you can.

Working in Excel can be efficient (e.g. finding duplicates, copying etc.), and it lends itself well to a long sequence of consecutive freezer locations (e.g. the 81-slot box).

All of this could potentially happen in the existing Workbench application. The synchronisation functions would then be new functions of the Workbench.

I foresee challenges associated with different data models. How would the synchronisation be tailored to a database that uses related collections versus a database that represents tissue samples as preparations in the voucher collection?

At SAIAB we have been through the mill and basically seen it all - labelled but undocumented samples placed in the freezer (requiring epic freezer inventories); duplicate sample numbers (this is our speciality, and we are now embarking on a project to sequence these to disambiguate them); DNA extractions with sample numbers that can’t be related to the tissue samples from which they originated. You name it, we can tell you about it.

If you would like to collaborate with us and share your experience, this will help to give the Specify development team more ideas to develop a robust set of tools. Please send me an email if you would like to collaborate: w.coetzer@saiab.nrf.ac.za

Topic		Replies	Views
Open Discussion on Biorepository Collection Data Management Give Development Feedback Specify-7 , Webinar	2	63	February 6, 2025
Methods for Managing Tissue Collections in Specify Configuration & Installation Specify-7 , Specify-6 , Videos	0	354	June 9, 2017
DNA and material sample tables Get Help	1	76	September 26, 2024
Webinar: Introduction to Specify 7 Documentation Specify-7 , Videos , Webinar	0	1373	September 20, 2022
DNA extraction Fields Get Help	2	189	November 29, 2023

Data model, workflow and experience of a biorepository using Specify

Related topics