Databases for research header image

Your data hub

The general idea

You’re likely to just start making your data with whatever software seems ‘natural’ to you. However, hold on for a moment, and consider the near future of your research project. Your table, notebook, network graph, or what have you, may quickly turn into a data hub. Perhaps you should consider a relational database.

Basic pro’s and con’s of a relational database

One important trademark of relational databases is that they are relatively flexible. To be precise, the data stored in tables and woven together through relationships, can easily be changed. Tables can be split up, data can be moved from one to another. Relationships can be redefined, et cetera.

Secondly, relational databases are also quite good at importing and exporting data. In particular exporting is very easy. One can bring data together in a particular view, if only for the export, and export it to the outside world, often in different file formats.

Thirdly, relational databases are not particularly suited for data analysis, although they often come packed with basic possibilities to summarize, do simple statistics, and generate graphs. However for more advanced scientific analysis they are simply not suited. The reverse is also true: the software that is suited for advanced analysis usually does not offer the same flexibility to rearrange data as relational databases do.

These traits make a relational database an excellent hub for your research data. You produce and maintain your data in the database and when the time comes, you export it as needed, import and import it in other software for analysis. This is illustrated in the figure below.

 

Graph of a data hub with a relational database at the center

 

But, you only use one type of software for analysis?

You may consider that even though the figure suggests different types of analysis, you are doing only one, say statistics, or geographical analysis. Would you still need a relational database? Arguably, the answer is ‘yes’ … but it depends. It depends on how likely you are to change the data structure and whether you end up with something that would suggest the use of a relational database. As soon as some form of repetition in the data or of the data structure comes in, you should consider a relational database. Read more about that on this page.

Dealing with inflexibility

In practice, there is a price to pay for the flexibility of relational databases. Although changing the data structure itself is easy, the hard part – when it comes to developing – is that much needs to change along with it: layouts, buttons and the programming that glues it together into a working tool. I see two ways to deal with this.

Firstly, at the beginning of your research, you may not have anything to start with. You’ll be developing your research question along with your needs for data and gradually build your data structure. While you do this, keep in mind that you may want to expand with additional data later on.

Secondly, if you are building your own database, simply try to postpone adding the ‘other’ stuff, i.e. the layouts, the buttons, the additional programming, as much as possible. And if you do add these, then keep them as simple as possible. Most likely, because you’re building the database yourself, you become an expert on the database software and probably don’t need a lot more than the bare bones of tables and relationships and a few layouts to manage your data.

Knowing when to start the hub

If you are not starting with a relational database but with the preferred software of your discipline or institution, that will probably bring you quite far. The question then is when you should change to a relational database.

Above I already mentioned the occurrence of repetition in your data or in your data structure. Another indicator is that you find yourself doing a lot of ‘fiddling’ or ‘hacking’ of your software. It might indicate that you’ve reached the limits of data management with the software that you are using. Then it’s time to 1) get the data out, 2) move it into other software to be able to do what you need, and 3) import it back into the software that you were using for the analysis. You might do this one time as an exception. Then another and then another. When it happens a lot, you have adopted the hub system.

Then, of course, you may want to do a new type of analysis which requires different software. For example, you were doing statistics and want to do network analysis. If your statistical package allows you to export your data as ‘nodes’ and ‘edges’, you’re ready to go. If it does not, then – you guessed it – consider a relational database as a hub.

May 2019