How I got to grips with stats.

If you’re an undergraduate or early MA student in the social sciences, you’re probably going to need to get to grips with stats. For most students in this area, this feels like a massive undertaking, and online rows about the best approach (e.g. Frequentism vs. Bayes – we’ll get to those later) and software package (e.g. SPSS vs. R) can make the process feel even more daunting, prompting the questions: Where do I start? And how do I get on top of this stuff most efficiently? This post tries to answer these questions by describing a few of the materials and tools that helped me. In short, how I learnt it is not how I’d recommend learning, and if you want to skip the waffle you can scroll to the end of this post to see the route I’d recommend with hindsight.

For me, Stage one was Urdan’s (2010) Statistics in Plain EnglishHaving not studied maths since secondary school, I found this a good entry point. It’s a clear, no-nonsense introduction to the fundamentals: sampling and research design, measures of central tendency and variability, the normal distribution, standardisation, standard errors, significance and effect sizes, as well as the battery of commonly used tests (correlation, t-tests, ANOVA, regression etc.). One drawback of this text (which now feels fairly terminal) is the lack of any computational examples to work through, e.g. in SPSS or R. In retrospect, this means that I could have probably skipped this text, and gone to straight to something more hands on. That said, this book was pivotal in my personal transition from stats luddite to marginally more informed stats luddite.

At this point, I was vaguely aware of R, but didn’t have the guts to start coding. Stage two was to learn SPSS to analyse my own data for my early MA assignments. My supervisor recommended Pallant’s (2013) SPSS Survival Manual; a pithy, step-by-step guide to implementing the major models. Particularly useful at this point were the Example research questions and What you will need (i.e. number and type of variables) sections included with each model description. For me, the very basic but essential take home message from these tips was: consider your modelling options before you begin collecting data. This seems so obvious now, but it didn’t at the time – not without solid knowledge of the options available and their limitations. It’s worth mentioning that Pallant provides a companion website with practice datasets and answers, making it easy to learn methods you might not necessarily be using in your own work but need to be able to interpret.

stage.three <- c(“R”). I used SPSS for my MA theses, and while this was being marked I decided to start learning R. If, like me at this stage in my stats odyssey, you’ve heard of R and downloaded it to see what the hype about, you may (also like me) have been sorely disappointed. Yes, it’s free, increasingly ‘industry standard’, and people say it’s hugely powerful once you get into it – but it feels impenetrable at first. Almost all R manuals start with example code showing how to construct basic objects or vectors (e.g. a <- c(1, 2, 3)). Looking back, this is essential, but it definitely stimulates the thoughts; How do I go from this to a regression model? And do I really need to learn how given that SPSS will let me do it through menus? The short answer to the latter question is that you probably don’t have to, but almost certainly should, not only because it’s free but because it will improve your knowledge of statistics enormously. Being explicit and sticking stuff in code makes everything much clearer, as well as being sharable and in the interests of open and replicable science. It’s also huge fun, and you’ll realise the initial hurdle was worth it very quickly.

The simple solution to R’s impenetrability problem is to invest in Andy Field’s Discovering Statistics Using R. I can’t recommend this book enough, and if you’re into in the academic side of Twitter etc., you’ll see it pop up again and again – usually with people sharing pics of their dogs, cats, and children reading it (guilty). You’ll learn research design, stats, and R from the bottom up, and come out the other side with a solid understanding of what’s what. While reading and working through the text, get on YouTube. For me, Andy Field’s lectures were not only an invaluable complement to the text, but also a masterclass in engaging teaching. A notable omission from the text is meta-analysis, and I’d recommend reading this paper to fill that gap – it’s by Field and has the same feel as the main text. Briefly, another fantastic R resource with an emphasis on visualisation and data manipulation is Wickham and Grolemund’s R for Data Science, which is available here.

The stats books I’ve mentioned so far are all in what’s called the frequentist tradition. The meaning of that word will be clearer having been through Field’s book. Put a little crudely, a key point here is that the statistical procedures in this tradition typically return point estimates (or best guesses) of the values you’re interested in, for example the mean difference in scores between young and old people on a memory test. You’ll also get a p value and (fingers crossed) an effect size telling you whether the point estimate is significant and practically important. Another general approach is Bayesian statistics, which (again crudely put) outputs probability distributions rather than point estimates. You’ll get a curve showing the probability of different values of the parameters you’re interested in (e.g. a mean or correlation). There is growing interest in Bayes, due largely to better explanatory materials and more powerful and more simple tools to do the work. Unfortunately it sometimes feels like Twitter is approaching civil war on this matter. Luckily, my personal journey into Bayes (stage four) began with Daniel Lakens’ excellent, and free, MOOC (massive open online course) Improving your Statistical InferencesThis course will make you an immeasurably better researcher, and provides a sober response to the frequentist vs. Bayesian dispute; proving that both approaches can be useful, and that animosity between camps may be reduced dramatically by improving the quality of statistical inferences made within camps. For instance, Bayesian’s widespread hatred of p values might be tempered a little bit (…possibly) if frequentists didn’t frequently misinterpret them. Same with confidence intervals.

Anyway, with my interest in Bayes primed, I came across a tweet from @profandyfield in which he linked two introductory texts – a recommendation you can trust in the stats world. The first of these was Krushke’s (2014) Doing Bayesian Data Analysis. I haven’t read this, but it seems to have the status in the Bayesian tradition that Field’s book has in frequentism (just have a look at the reviews on Amazon). The second was Richard McElreath’s (2015) Statistical Rethinking. I have read this, and it’s stylistically and pedagogically superb. Statistical Rethinking is a PhD-level text, so don’t expect an easy ride – you’re going to need to have done some groundwork. That said, the examples McElreath chooses are fascinating, making it easy to immerse yourself in learning what can seem at first like quite intimidating techniques. For instance, you’ll learn about masked relationships by modelling associations between kilocalories of energy per gram of breastmilk, mother body mass, and neocortex mass across different primate species. McElreath provides a complementary software package, datasets, and online lecture series, which I recommend working through closely. (When learning anything in R, make a code notebook, interleave code with your comments (use # to make a comment line that will not be computed), and revise.)

So that’s the path I’ve taken up to now; the first year of my PhD. Looking back, here’s what I’d recommend:

  • Start with Andy Field’s Discovering Statistics Using R.* Skip SPSS and go for the jugular. You’ll learn R and a huge amount about frequentist stats. Fill in with this meta-analysis paper. Should you skip frequentistism and go straight to Bayes? Some researchers say yes. However, while I agree it’s learnable at undergrad level, especially given new tools like JASP (see below), the problem is that if you come to 99.9% of social science papers with knowledge of Bayes alone, you’re not going to be able to interpret or judge the quality of their results. For instance, Bayes doesn’t require correcting your alpha level for multiple comparisons, and if you haven’t learnt this process and then read a paper in the frequentist tradition that fails to do this (which is surprisingly common), you’re not going to spot the error. It might be unpopular, and (again) I’m certainly not saying Bayes is too difficult for undergrads, but given the sheer quantity of papers published using  values etc., I’d still recommend starting there.
  • Next, take Daniel Lakens’ course. This will tighten up what you’ve learned already, teach you a huge amount more, make you incredibly positive about the direction science is taking (e.g. open source tools and pre-registration etc.), and provide the groundwork for moving toward Bayes. Highly, highly recommended.
  • Next, download JASP and the associated introductory papers Bayesian Inference for Psychology Parts I and II, available here. I didn’t mention this above because it’s not how I did it. But looking back I reckon this is probably the best next step into Bayes. JASP is free (again, a massive thank you to the developers) and the supplementary papers are clear and engaging. Commit two afternoons to working through them and you’re well on your way.
  • After that, read Krushke’s (2014) Doing Bayesian Data Analysisor McElreath’s (2015) Statistical Rethinking, or both. McElreath himself recommends Krushke’s book, saying they offer different perspectives.

Throughout this process, keep an R code notebook, and immerse yourself in stats YouTube videos, blogs, and podcasts (see my links for some favourites). You’ll have moments when it feels like your brain is a wreck; when simple R code throws back an error message or when you start learning the principle of maximum entropy. But you’ll make massive gains in your ability to produce and read high quality research in both frequentist and Bayesian traditions. Also, remember we’re all everyday consumers and users of data – stats and data science are vital life skills, not just things you need to get through your science degree. 

Good luck.

* As a final note, it’s possible that you follow the links for the textbooks mentioned above, see the cost, and go on the hunt for PDFs. On Amazon UK, Discovering Statistics Using R is currently £44.20, while Doing Bayesian Data Analysis is £49.79. For students, that isn’t cheap, and I’ve seen a few complaints to that effect on Twitter. However, not only are these texts the essential elements of the approach I recommend above (and I’m confident a lot of researchers would vouch for their inclusion), they are the only financial investment you need to make. R = free. JASP = free. Daniel Lakens’ course = free. Bayesian Inference for Psychology Parts I and II papers = free. So for less than £100, you’ve got everything you need to take you to a fairly advanced level of understanding. Also, think of the hours these authors poured into those books, and the complementary datasets, video lectures, and R packages, and the cost seems pretty reasonable.