If you are interested in this solution, then watch out for a beta version for Mac that I am hoping to release late 2020/early 2021. Follow my blog or return to this page at a later date.
In this text, I will explain the problem that the sea turtle researchers at Osa Conservation asked me to solve. It turns out that they have a data problem that I am sure many other researchers and organizations have. The solution is a database tool to stitch data from multiple tables together, while addressing differences in formatting of the data, and while keeping track and documenting all modifications and calculations in metadata output files.
What is the problem?
In short, the problem is that you may have multiple tables with data about the same thing, but differently structured, and then you want to bring it all together in one table.
The sting of the issue sits in the part ‘differently structured’. This could mean a number of things, including (but certainly not limited to) the following. Some are illustrated in the image above.
- It could mean that the order of the columns is different. Someone may have changed it because it was handier at some point.
- It could mean that the names of the columns do not match between different files anymore. For example, ‘date’ becomes ‘date of purchase’.
- It could mean that the format of the data differs. For programmers, a notorious case is how dates can be differently formatted.
- It could mean that even though the names of the columns are the same, the meaning is different. For example ‘date’ could mean the date of an event in one file, and date on which the data was entered in another. This one is particularly difficult to spot if you only have the data but no description ( no metadata ) of it.
- Another problem could be that the unit of what is measured changes. Say from meters to yards, from Deutschmark to Euro, and so on. Yes, it should be mentioned in the column header, but is it?
- Similarly, coding system may change. Instead of the first names of the researchers, acronyms may be used. Or instead of yes/no, it could be yes/no/maybe/no-data.
- Finally, it could be that data is entered outside the area where you would expect it, meaning in columns without a heading or in rows far below where a block of rows ends. This could be accidental scrap data or intentional additions of for example summary data.
Some of these problems can not be solved unless you have additional information. If the header of a column only says ‘Date’ and you need to know if that is, say, the date of departure or the date of arrival, the data itself is not likely to tell you. In the remainder of this text, those types of problems are ignored because also a database tool will not provide the answer. You will have to do historical research or interview someone, or so.
Another type of challenge that the database can not solve is that the table is not actually a table, but a spreadsheet designed as a repetition of forms. See the two images that explain the difference.
So, setting aside those types of issues, each of the remaining ones from the list is not that difficult to solve, but of course, if left unchecked changes may accumulate over time so that the first version of a file would not recognize the last. This is probably how you ended up with this problem.
Can this problem be solved manually?
How you would solve this if you only had manual labor, the spreadsheet software and the files? Well, it is possible, but it may not be very effective or efficient, as I will discuss in this section.
Each of the problems listed above has its own particular solution. For example, if the order of the columns is not the same, you need to reorder them (by dragging or copy-pasting) in a uniform way before copy-pasting them all into one file. If the formatting of a date differs between rows or tables, they need to be harmonized with the formatting and calculation functions that are available in the spreadsheet software. And so on. Every problem simply needs to be addressed one-by-one. In my view, with patience, systematic checking and conscientious work, this way, you can disentangle the biggest ‘data spaghetti’ imaginable.
It has a couple of downsides though. First of all you need to have the right mindset for this kind of work. Patience is not in everybody’s dictionary. Systematic working is also not always a given and precise working is also not easy. Basically, it is a tedious job.
Secondly, if the set of files, or the number of columns in them is big, in practice, it will take a lot of tedious work. Even if you have the right mindset for it, you will have problems keeping up the required level of focus in the face of work interruptions, potential bad nights of sleep, work pressure and so on. Did you indeed check the format of column BC in file 29? Or did you stop at BE before going for a lunch break?
Thirdly, the details of the manual labor go lost in history, unless you have made really good notes. If you are working in science, a reviewer may want to see those notes, so they need to be fool-proof readable. One may have to do individual data corrections, which are virtually impossible to trace. And what do you do if you realize later on that your corrections need to be corrected?
Arguably, much of the cloud-based software can keep track of changes, but for how long will they provide the service, and for how long will you keep the files in the cloud?
So, yes, in theory the problem can be solved manually, but in practice it may be a different story. Since, I could not find an existing software tool, I decided to build one.
A table stitcher. How does it help?
The database – I haven’t found a good name for it so suggestions are welcome – helps solving this problem with the following strategy for stitching tables together.
1. Changes are added to the data and thus become traceable, reviewable, and adaptable.
Instead of altering the problematic data, the table stitcher will generate and export a changed version of the data. Basically, you first import all your problematic data and then specify and export a revision with a uniform structure and formatting. All the changes are not made to the data itself, but administered as instructions next to the data. This way you can always see how the final result was generated.
An added benefit is that if the end result is not good enough, you can simply change the instructions and generate a new version of the final result (nothing is really final, not in life, not in databases).
2. Integrated data-cleaning and formatting
The database allows you not only to make corrections to (that is, added to) individual cells in the import tables, it also gives you the possibility to change the formatting or regular mistakes in the data. For example, a column may contain dates in different formats, or someone often miss-spells a particular name. Advanced programming of specific transformations is also possible. With a simple test facility you can quickly see the result of the changed formatting and manual corrections.
3. Generating multiple new tables
Depending on your data and your wishes, it could be that the tables that you want to stitch together contain data for multiple export tables. Perhaps some tables have extra columns that do not occur with others, so you could make a separate export of those columns only.
4. Documentation, documentation, documentation
While you are working with the database, you have ample opportunity to add meta-data, including descriptions of the imported tables and all their columns, and the export table(s) and their columns. It is not necessary, but advisable because it will help you later on when you need to explain everything to your reviewers, your successors, or yourself a couple of years later. The database also has plenty of possibilities to leave notes about all the import and export tables, their columns and the instructions of how to generate an export table.
As I explained at the beginning, I am developing this database to help solve a problem for the sea turtle researchers, but I see potential for wider use. However, I could do with some help. So, if you have a problem like this, then please do contact me at firstname.lastname@example.org . We could help each other in different ways.
If you have this problem and would like to try out the database, then do let me know. I am interested in having the database beta tested. You will need to either have a Mac computer (you will receive stand-alone software), or a Windows computer with FileMaker Pro 16 or later installed (you will receive a FileMaker Pro file).
If you do have this problem, but don’t have time for testing the software, I would appreciate receiving sample data of your files. Just the first, say 100 rows would be enough. Please, also mention the magnitude of how many rows all tables have together (thousands, tens of thousands, … )
Finally, you could let me know how you would like to work with a database like this? Do you prefer to install software on your computer, or go on-line and work with the database in the cloud? Would you perhaps prefer not to do the work and make use of a service, i.e. I would be doing the work for you?
PS Technical restrictions
The database does have some restrictions. Among others, the following:
It can only deal with data that is expressed in text and will only export data as text. Data expressed as text includes numbers, times and dates because they are written in text. But it can not deal with audio, video and images.
It exports as tab-text, coma separated values (.csv), or .xlsx, but even in .xlsx all data will be formatted as text. So in your spreadsheet software, you will have to manually set cell-formatting to the desired format, or try its import function to automatically determine the format.
At present, the maximum number of columns for import and export is 104. In spreadsheet terms, the last column that is imported is CZ.