I do not have advice to offer, but instead am looking for it. From people who have already created a Specify database for your collection(s), what is some advice you wish you’d known? What are the most important questions to ask or issues to hammer out before beginning?
From the research I’ve done I’ve managed to identify two key issues:
Organization of the database
If you want to publish, mapping data fields to required or recommended Darwin Core fields for exporting
Are there other issues to consider? Do you have advice within those issues? Are there differences that one might not expect when setting up a database for different scientific fields? Tell the newbies what you’ve learned with your hard-won experience.
I wish I could go back in time (not that I am unhappy with how things turned out, but I have learned a ton from three years ago). Below are my own thoughts, Grant has written a great set of articles recently on many things, including the Data Conversion Process
Architecture and setup choices
From the get-go, decide with all stakeholders whether you are going to aim to standardize across units as part of the transition, or use the transition to give units greater flexibility. This will inform your choices around whether you want multiple databases, disciplines, divisions etc. Disciplines are especially important because this is the level at which the internal schema tool is shared.
Having multiple divisions should be given careful consideration, in most instances, I would argue you want one division.
While you can preload the taxon tree, you don’t have to, and instead the taxon tree can be populated with only the taxon that the collection already has when the data is imported. This can be decided on a discipline level. Benefit: cleaner taxon tree that doesn’t contain unnecessary nodes. Drawback: the collection will have to enter new nodes later, increasing risk of spelling mistakes. Consider how likely it is that the collection will take in new taxon that it has not seen before. The same applies to the Geography tree, though most decide to stick with the tree pre-imported.
Data Cleaning, Standardization and Import
I recently stumbled upon pointblank, I would have used this for cleaning back and forth with collections if it was available back then. It should support most of the business rules that Specify would apply through the workbench, with the exception of disambiguation.
Split your workbench uploads into chunks, 3000-5000 seems optimal.
Standardize the schema across all collections in a discipline wherever possible.
To your point about DwC, you don’t have to, but it is nice to try to utilize the darwin core equivalents wherever possible. However, fields don’t need to go by their DwC names. Many managers will disagree that recordedBy makes sense for the name of the collector, and furthermore, Specify will split that into firstName, lastName anyways, so it can get very confusing trying to teach Specify terminology plus DwC at the same time. You can leave the DwC mapping for after you have data in the system, it is always easy to go from a Specify field to DwC terms later when mapping for publication. Note that there are also Darwin Core Data Packages on the horizon!
Forms
Don’t use the label= argument in form definitions for regular fields. This creates the opportunity for a field to have one caption in the schema, and another when viewed on the form. There should be a single source of truth (the schema definition) and it is better to use the schema because changes are applied across all collections in the discipline, and fields are easier to track/hide etc.
When first rolling out, it sometimes makes sense to disable the button beside querycomboboxes for a few weeks until users grasp the concept of relational databases. Otherwise, they will edit a taxon thinking that they are just editing information for a single record. Erroneous additions are easier to fix later through merging.
I could go on but will leave it at that for now, always happy to chat if you have any specific questions! Others will also have some great advice I am sure, and I’ll note that Specify team has some consulting pricing options
Would you suggest keeping separate departments each in their own instance/database? I saw that the Canadian Forestry Service decided to have each database (Inverts, Botany, Mycology) have its own instance, to facilitate backup and restoration of one department without impacting the other two. That feels like it makes a lot of sense, if having to restore a database is a real danger. But at the same time, having the Institution, Division, and Discipline levels be separate things in the scoping hierarchy seems to exist specifically to keep you from doing that.
Please tell me if this should have been created as an entirely new topic. I’m still learning the protocol.
I think in general, if a topic should be moved to a new topic Grant will move it (and it’s just our job to do a best-effort interpretation of the appropriate category to start).
The question of whether separate instances/databases should be used is largely one of tradeoffs and specifics about how you would like to setup your infrastructure. We can imagine two extremes to highlight these trade-offs
Maximal decentralization
In this case, you would split everything up to the maximal extent. Separate databases and instances (or at least separate containers on the same hardware).
Pros:
Isolated rollbacks, reduction in single sources of failure, and potentially better disaster recovery[1].
Could have different security setups for each (perhaps one is tightly firewalled with SSO setup while another has public anonymous access)
Can use different specify versions on each (one always runs the latest, another runs the latest - 1 point release etc).
Cons:
Less efficient resource use. Code is duplicated, worker instance more likely to be idle for each. Pricing for compute may improve with scale. Storage pricing (if you are using an asset server) would almost certainly get better with scale (i.e. it’s cheaper to get one server with 32TB of storage than 4 with 8TB each).
Duplicated effort. You would have to fix the same problem across all instances, instead of just doing it in one.
Have to manage updates for all instances. However, there is some code in the docker compositions repository that shows how you could manage this with a centralized json.
Maximal Centralization
The reverse of the above, the pros become the cons, cons are pros.
Note that if you were to run a single database backed up nightly, the cost of a rollback would be less than 24 hours of work. However, if you operating at country scale, even this may be too costly to accept.
Consider people as well as hardware
As much as this is a hardware and infrastructure question, it is also about the people that will end up using the systems. Is there going to be one person making decisions for all deployments (may make sense to have one instance), or are there going to be 2 or 3 technical managers (probably best to have one setup per person, unless they always agree on everything and coordinate well!).
How would these people receive bug or help requests? Are the users well defined in groups, or could they all go through the same channels?
Lot’s to consider, there isn’t a single correct answer!
This is complex, it depends on the risks you are trying to avoid. Separate instances wouldn’t save you from a bad deployment if you end up deploying to all instances at once, or a cyberattack if all instances are on the same network etc. ↩︎