Have you ever been in the situation where you have been working in a spreadsheet like excel, fine-tuning figures and calculating means and other stats, only to discover you have to do it all again because your co-worker forgot to send you all the data? Or you proudly showed your supervisor your results and they tell you all excited “how awesome this is but how much awesomer it would be if you added that latest fancy analysis that shouldn’t take you long right?” and of course it does because you have to go through the whole thing again?
And possibly even worse than knowing you have to spend a ton of time updating the tables and figures and stats, and that’s your “well-meaning” friend telling you that “if you would have done all this in R, it would take you only 5 minutes to change the data and then you can run your code to produce all your stats and figures automatically”….
I was that guy last week. And I agreed with my victim that a comment like that is not cool.
It’s true though! (that’s what they all say. still not cool.). When you know how to do some coding, in R or Python or whatever language, a new world opens up and you can save an incredible amount of time. Problem is you first need time to learn how to do it, and it’s daunting with a steep learning curve, especially when there’s no close co-worker or friend that can get you started.
The big question is when is it best to stick to what you know even if it is more time consuming, and when is it worth investing your time in learning some R basics. There’s always something more pressing that keeps you from allocating time to learning R. It’s outside the comfort zone of what you know. It may be a completely new way of thinking, this programming coding logic and those for-loops and dataframes -- and what the hell is the difference between a matrix and a dataframe anyway?
This is a conversation I’ve had with many people, and because learning R has improved my own professional life so dramatically I used to get all excited and start advocating the world of R like a zealous preacher. Now though I am not as quick to recommend it, because I have come to realize that it really depends on the type of work, your background, and your natural motivation. That last one might even be the most important. I have always had this nerdy side (I asked my dad to get me books on programming when I was 10 because my games weren’t working well), and that clean logical programming world always appealed to me and felt like a safe space. But that doesn’t mean it works for everyone.
It took me a while before it clicked and I got a sense of how useful coding is. It started like this. I was trying to estimate rodent home ranges, and that involved cleaning and preparing data in excel, importing them in ArcGIS, select the right columns and coordinate system, wait for the program to calculate the home ranges for certain periods, export the data as an excel spreadsheet so I could check whether I used good method settings and how they could be better, and doing it over and over every time the smallest thing changed. Explaining this to Jonas when he started a postdoc in the research group, he offered to write me some simple code in Matlab to automate a small but time-consuming part of what I was doing. When I saw the code, I was impressed by how short it was, and how big of an impact such a simple thing could have. I got a first sense of what type of other things could be done with this, and slowly started to try to write some code myself. It started with pretty simple things, but the more I used it, the more I realized how useful this could be. After a while I was hooked and vowed to only use excel for entering data, and do all the rest in R. It completely changed the way I do research, even the kinds of questions I ask and dare to tackle.
So, what are reasons for using R (or other programming languages)?
A first big one is automation. Say you have a dataset of weights of male and female lizards at different times, and you want to calculate the mean for each sex, make a figure of how weight changes over time, and test whether the males and females have a different weight. That’s pretty straightforward in excel, and you can produce neat figures after playing around with colors etc. But then you realize that you hadn’t included the lizard data that were collected last month. That means you have to go through the whole process again. In R, re-calculating the mean and stats, and re-producing the figure would require no changes to code at all. All that needs to be done is adding the new data to the original datafile, and then you can just run the code again as is and everything is produced automatically. The question you might be asking yourself now is “ok, but how long did it take to write the code in the first place”? That’s the key point here, right? If you are completely new to R, this might take you somewhere between half a day and two days, because you have to learn it all. Pretty bad return of investment, so little motivation to do that. But as soon as you learned this code, it will take way less time to produce it again for different work, and adapt it to other datasets and to do more complicated things. For someone with basic experience, a simple thing like this example would take maybe 15 minutes to write. Less if you have some previous examples saved. It would take 5 minutes for an experienced coder.
There’s many more reasons to use R of course. Let’s see. Some basic but super useful things would be:
Subsetting a dataset (e.g. now I only want data collected before last year) can be done using one short line of code and is intuitive and simple to learn. That would take about 25 seconds to write.
Changing the colors of the lines in a figure: 50 characters of code, 15 seconds to write.
Producing a different figure for each year in a long dataset, and saving it automatically as a jpg file: might need writing a “for loop” (takes a while to learn but opens up a whole world of possibilities), 4 lines of code, 1 minute to write.
Downloading data from a website and saving it on your computer in an excel file: 2 to 5 lines of code, 2 minutes if you have the internet address.
Lowering the resolution of 500 images so they can be emailed: 3 lines of code, 2 minutes to write and maybe a few minutes to run depending on the size of the pictures.
There’s so much it can do! If it exists in digital form, you can import it into R and do stuff with it.
A scanned image, a map, an excel file, a pdf, a text file, a word document, a sound file.
Bring it on!
It’s still hard though to decide whether to start learning it or not. My suggestion is to seriously consider it if you manipulate any type of data more often than just once or very infrequently. If you work with data on a regular basis, there’s no question about it, it will be worth investing your time because you will win so much time back. On top of that, as soon as you start learning even basic code, you will start to see its potential, and ways to use it for different purposes. If you really don’t have time, consider working together with someone who does know how to code. Tell them what you are doing, and ask whether they can make that process more efficient and faster. Or ask them to write you code with clear explanations so that you can start using it yourself, even if you aren’t able to create it. This has the potential to free up so much of your time that you can then spend on more important work.
I know, I probably still sound like the preacher I’m trying not to be, and like that guy, but it’s because I’ve been there. It’s only because of Jonas that I started realizing how huge the impact of coding can be in the life of someone who works with data. I love it.
Comments