Databases for research header image

The choice of software depends on the structure of your research data.

Your choice of research tools is probably determined through your history. Your education and research experience have lead you to certain instruments of choice, that feel ‘natural’ to you. However, from time to time you may want, or are forced to, reconsider. There are some aspects to this. To name some: function ( What do you do with the tool? ), data structure ( What kind of data are you processing and what does it look like? ), costs and availability, ease of migration ( Can you move your data to the other software? Can you move it away again? ), transfer-ability to colleagues ( If you collaborate or hand over, how easily will your colleagues deal with your software? ). Here, I will look at the structure of research data. I will show some examples and discuss which software is in my view best suitable to handle the data.

It is important to realize that in some cases the creation and maintenance of the data is better done through software different from what is used to analyze it. An obvious example is using survey software to create a survey and collect the results, but using statistical software or spreadsheets to analyze it. Similarly, transcribing interview recordings, coding the resulting texts and then analyzing the coding could be supported by three different software packages.

In such cases, it is worth your while to consider if and how data can be transferred from one software package to another. Perhaps as important is it to realize that the data may need to be restructured as part of the transfer. If so, find out how you can do this because it may not be trivial. I am a big fan of relational databases, so call me biased if you like, but compared to other software, they are a marvel of flexibility and effectiveness when it comes to restructuring data.

No repetition, quantitative data

Perhaps your data looks like this.

Birth yearGenderHeightWeightDietSmoker
1995f16255Fish, no meatNo

Basically, every row contains an observation, there is no obvious repetition in the data and the data is numerical and/or highly codified. In such cases, your likely choice to produce and analyze the data could be spreadsheets, statistical software and survey software.

Repeating rows

Perhaps your data looks like this.

CompanyAddressCityCountryYearProfitNumber of
NanosoftBilvägen 14StockhomSweden1993 15000050-75
Carz ABSåggatan 168MalmöSweden19934756000250-500
Carz ABSåggatan 168MalmöSweden19944986000250-500
BAAFalkoner vej 5BAhusDenmark1993 9572025-50
NanosoftBilvgen 14StockholmSweden1994 15600050-75
NanosoftOrustgatan 98StockholmSweden1995 15200075-100

In this example, there is some data that differs from row to row, like the profit and the number of employees, but some data, covering multiple columns, is repeated in multiple rows. A company may move to a different location, but it is not likely to change every year. And perhaps, you are only interested in it’s current address, in which case it will be the same for every row that involves the company.

In such cases, it makes sense to separate the address data from the annual data in two separate tables. This way address info does not need to be repeated which reduces the chance for errors and is easier to maintain. This can be done in spreadsheet, and in simple cases it can even be used to re-combine the data when needed (through functions like Vertical Lookup). However, if you have more repeating data, leading to multiple separate tables, then you should consider using a relational database. The linking of data in different tables is typical what relational databases do.

It probably means that creating, storing and maintaining your data (in the relational database) gets separated from analyzing (in the spreadsheet or statistical software), because relational databases are not very suitable for that. They can produce basic summaries, overviews and aggregates (like average value), but a regression analysis is a different matter.

Repeating columns

Your data may look like this

CompanyChairAdvisorMember 1Member 2Member 3Member 4
Carz ABM KarlssonB AndersonE SvenssonK Lindberg
NanosoftS MattsonC EklundA HåkanssonE GunnarssonF ForsbergT Eliasson
BAAL NybergP BlomqvistM KarlssonG StrömS Holmgren

Here is no repetition of actual data, although that could happen as well, but here is a repetition of the type of column, viz. the member columns. Such data can be created, maintained and managed and analyzed in a spreadsheet, but notice the empty cells. If in 200 company records, 95% of the companies have three regular board members, three percent has seven members and 2% have sixteen members, then that will cause a lot of impractical empty space. For analyzing this data one might have to insist that all data for one company occupies exactly one row. For that I see no alternative. However, for creating, maintaining and managing this data a relational database would be far more effective. To analyse the data, an export to XL in the format above might be needed, but this is not difficult to achieve.

Repeating rows and columns

What is said about the repeating row-data and repeating columns typically also holds for a combination of the two. Such as in the following example, where M Karlsson is the chair of the board of Carz AB and advisor for BAA

Carz ABM KarlssonÅldermansgatan 38StockholmE SvenssonLuntvägen 529Uppsala
NanosoftS MattsonKlostergatan 142-ASödertaljeC EklundKlövervägen 11Göteborg
BAAL NyberNotariegränden 88KirunaM KarlssonÅldermansgatan 38Stockholm
K-branchM KarlssonÅldermansgatan 38StockholmL NyberNotariegränden 88Kiruna

Because the same name and address combinations can occur in multiple rows and different columns, manual maintenance for example because of a change of address is bound to cause errors sooner or later

Textual data

To not overlook the obvious, if your data exclusively consists of regular text for humans to read, then a word processor or even a simple text editor suffices. An important research tool that they offer is that it is easy to find words, phrases or parts of them. For many research purposes this is actually good enough. For example if you have a log of journal that you keep with notes about experiments, visits to archives, field visits, or interviews. The only thing you might want to do with these notes is read and re-read them or look up something specific.

However, if you want to use the text in a different way, such as organizing or analyzing, you may consider other types of software. The most advanced use probably is automatic text analysis. Basically you have the computer doing some analysis and give you results without any manual labor. Frankly, I know of no off-the-shelf software that does this. If you do, then please do let me know. If you want to go in this direction, you’re likely to hire a software engineer – assuming that you are a linguist.

Second most advanced would be semi-automatic analysis. Here humans would typically ‘code’ the text, i.e. apply labels to parts of the text, and then have software to analyze the codes: how often do codes occur, which codes co-occur, what is the average distance (in words or sentences) between selected codes, et cetera. It boils down to quantitative analysis of qualitative data. For this type of use, several software packages exist. Typical examples are NVIVO and ATLAS.ti.

If you are interested in ‘something in between’ reading and semi-automatic analysis, then you need to tinker with word processors, spread sheets or relational databases. The in-between area is where the text more or less remains as it is or where coding takes place at the level of paragraphs (rather than at words, phrases or sentences), and where you want to attach additional data to these paragraphs to allow for selecting or sorting.

For example, you took archival notes and you want to add the year(s) that your notes cover, the geographical area and the main topic. This can easily be achieved in a word processor that allows you to make tables.

YearLocationMain topicNotes
1824                 The Hague                                         Sewer canal                                                At the beginning of the 19th century, the city architect of The Hague, Zeger Reijers, designed a plan for the canal in combination with a new harbor. The plan was presented in 1824. King Willem I was very content with the plan and commissioned a study.
1830The HagueHarbor planIn 1830, king Willem I decided to scrap the plan for the harbor but wanted to continue with the sewer canal
1862The HagueCanal constructionThe war with Belgium delayed the digging of the canal but it was ‘completed’ in 1862. ‘Completed’ here meant that the canal did not actually reach the sea. See Vijfwinkel et al., 1986, p. 162-163 for more details

You end up having a table, which means that it could also be done in a spreadsheet, or a relational database.

A second point to notice is that, again, you end up having a table. It has big chunks of text in it which looks weird, but it is nonetheless a table, and you may end up with situations as described in the previous sections. So, if the data that you want to attach to the notes becomes more complex, your choice should be for relational database software rather than a spread sheet or word processor.

Images, video or sound recordings

With images, video and sound recordings, the basic argument is similar to that of text. For viewing and listening, a set of files on a computer suffices. Searching of a certain file is harder, although here Google may be of help, and searching within the files near impossible as far as I know. So, arguably, they are less accessible.

A quick internet search learns that there is a lot of video analysis software available, that allows human motion analysis, object tracking, face recognition and more. At first glance the application areas for this software is largely outside academic research, but that does not mean that it cannot be used for research applications. It is a field that I am not familiar with, so I will leave it at that.

Semi-automatic analysis is possible and in outline works the same as text analysis: humans code the contents of the files, and the software then analyses the coding. In case of video and sound, code can be applied to certain fragments. NVIVO and Atlas.ti are candidate software for this, even though they were not originally designed for it.

As with textual data, one might want to attach other types of data to the audio, video or images. Integrating them in a word processor or spreadsheet is not impossible but may not be very handy due to file-size problems. Most relational database software however is capable of storing entire files or even multiple files in a single field. Here too, file size may be an issue but the databases may be better capable to handle big size and volumes. For all these softwares it is possible to not store the actual files but links to the files on the operating system. It solves the size problem but seeing the additional data literally next to the additional data may be difficult.

Mixed data

Some branches in social science use ‘mixed data’ in more or less structured form. Mixed would mean for example text, but also quantitative data, field notes, photos and other media recordings. For more structured or systematic data the question of which software to use boils down to breaking down in similar parts and picking from the sections above.

For less structured data, a folder on your computer’s operating system is already a great start. It holds everything in one place and makes it tag-able and searchable, at least as far as text is involved. In addition, advanced note-taking software such as Evernote, Onenote and Bear does more or less the same.

Relational databases could contain less structured data, but it does not fit neatly in the concept of related tables. Instead a so-called xml database could do the job a lot better. I have no experience with any of such databases, so all tips are welcome.

Perhaps one warning about mixed data is in place. Unless you really know what you are doing, using less structured mixed data could be a sign that you are not sure which data structure to use. To me that means you do not have a good idea of where you want to go with your research. This is absolutely fine. You will probably decide at a later stage. However, precisely because of that, it might be best not to invest too much time and other resources in software that you may not continue using in the near future.

May 2019

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.