Category Archives: Databases for research

Databases for research header image

Variables in research databases need support

Variety in data-editors

During my work on the SWINNO project and its database I noticed how the assistants who did the data work, I call them data editors, have different strategies of entering data. Also, they know the rules that govern the entry of the variables in various degrees, they have different ways of dealing with uncertainty, they work with different levels of precision, they differ in how critical they read the source material and in how fast they work. I am probably forgetting a few more dimensions. All these differences have their effects on the quality of the data that they produce, which can be problematic for the researchers who use the data.

The design of software tools : 4 layers of support

Part of the quality management is done or can be done through the design of the data entry tools. Such a tool can be a relational database, but it could also be a spreadsheet, or other software. It could even be pen and paper forms, but then the range of possibilities is more limited. Anyhow, the point I want to make in this post remains the same : in the data entry tool a variable should not only be a box where data is entered, it needs additional supporting interface elements. One can distinguish at least four layers of support.

1 Aides for data entry

The first layer of support concerns the aides for entering data. If for example, a variable concerns a number between 1 and 5, then one could use an entry box, or field, where the user can enter a number. However, a scroll list or drop-down list with the possible options would help reduce the possibility that a user enters ‘0’ or a number greater than 5. Another example concerns the entry of dates where the user will be confronted with a calendar from which they can pick a date, rather then allow them to enter a date ‘manually’. And so on. Such techniques are quite common in all kinds of web forms.

2 A validation system

A second layer of support concerns data validation : a warning system for easily detectable mistakes, that is automatically detectable mistakes. An icon appears next to the field to show that the entered data passes the test(s) or not. Usually a cross (✕) appears to indicate something is wrong and a check mark (✓) if all is okay. If something is wrong, it should be explained what is wrong and how it can be fixed. A common example is the check of a password which has to comply to a set of constrains, like it should contain at least ten characters, two capitals, four numbers and a punctuation mark.

3 Guiding attention to instructions

Often, database designs – especially the ones working through websites – stop at the validation system because most data that users have to enter is concrete and not very complex: a date of birth, a phone number, a postal address, a credit card number and so on. In research however, and other more complex data-entry environments, drop-down lists and check marks are needed but far from sufficient. One obvious reason for this is that the rules of data entry can not be checked automatically. For example because human interpretation of some source material is needed. (I know, you are thinking ‘bring in the machine learning and artificial intelligence’. If only it were that simple.)

So, a third layer of support is a system that brings the editing instructions to the constant attention of the editor so they can easily review them. At the SWINNO project, data editors used to enter data in a spreadsheet and could read the instructions on a couple of A4 papers, a so-called ‘cheat sheet’. Although those cheat sheets were readily available, users had to divide their attention between the screen with the spreadsheet, the paper source material from which data was extracted (paper or on-line magazines and journals), and the cheat sheet. Would it not be much more practical if the instructions for the particular cell in the spreadsheet were shown right next to it? It would not only make it easier for them to review the instructions, but the close proximity would also remind them that there are instructions which one should follow in detail.

Expression of doubt and uncertainty

The fourth and last layer of support concerns the expression of uncertainty. Instructions will only go so far in covering all the possible situations that the data editors might encounter in the source material. For example, in a chemicals database the year of discovery of a new material is needed. The newspaper article from year X only mentions ‘a few years ago’. What should the data editor enter if the instructions do not cover this scenario? The data editors may come up with their own solutions which may lead to ‘dirty data’. But what if they had means to express their doubt, for example through a notes field, or a qualifier field for the ‘year of discovery’ field where they for example can select ‘<‘, ‘≤’, or ‘?’ or ‘??’ to express the uncertainty.

Conclusion : not all data quality can be done through the interface design

Summarizing : data-editors show variety in quite a few dimensions when it comes to their data-entry work. For high-quality data such as is needed in research, this variety needs to be reduced as much as possible. This post outlines four layers of support that should go into the design of databases : input aides, validation feedback, guiding attention to instructions, and uncertainty expression.

I should add that am not proposing that having a good interface will reduce the diversity between data-editors and how they work to zero. It is necessary but not sufficient. In fact, it would be the last set of measures when the data-entry gets done. Additional standardization work needs to be part of the design of the data-entry process. This starts with introductions and training before the data-entry. During the data-entry, collaboration between data-editors (asking and answering each other questions), review by a supervisor and reviewing each others work in special ‘calibration’ workshops play important roles as well.

However, in spite of the database design, and the design of the data-entry process, some differences will remain, simply because people and the data sources differ in so many ways that all variety cannot be controlled for. Scientists will have to deal with that through their research design, i.e. in the ways they structure and analyze the data, evaluate the outcomes and draw their conclusions.