Statistical education software tool usage

A recent paper “The Teaching of Introductory Statistics: Results of a National Survey” by Chelsey Legacy et al. in Journal of Statistics and Data Science Education summarizes 228 responses to a 2019 survey of statistics teachers in colleges and universities. The following figure got my attention since I work on a statistical software product, JMP. The chart supposedly shows the percentage of respondents using software in each of five categories.

It’s shocking if Excel is really so dominant at the college level, and I wonder how the individual products in each category fare. Fortunately, the data is available to download. The tool usage data comes from the question, “Do students use the following desktop- or web-based applications/software to analyze data in your class?” which was asked once each for 18 different applications. Here’s an excerpt of the raw survey data, showing the presence of three possible responses: NA (no answer), No, and Yes.

t_Excelt_Fathomt_JMPt_Minitabt_Pythont_R_GUIt_R_Studio
YesNoNoNoNoNoNo
YesNoNoNoNoNoYes
YesNoNoNoNoNoNo
YesNoNoNoNoYesYes
YesNoYesYesNoNoNo
NoNoNoNoNoNoNo
NANANANANANAYes
YesNoNoNoNoNoNo

Here’s a graphical summary of all the responses.

Excel indeed has the most Yes responses, but some others aren’t that far behind, and I would expect them to be on par with Excel when grouped into categories. Since each respondent can answer Yes to multiple software choices, it’s not so straightforward to aggregate the responses. Still, we can already see that R Studio alone has 50+ Yes responses, or about 25% of the total, which is already above the 10% claimed in the paper’s figure.

(Note: Tableau was grouped into “Other” instead of “GUI-Based” because of what appears to be a coding error in the paper, where it was not assigned a specific category.)

Missing responses

There are a fair number of NA responses, and it’s unclear what those mean in this context. Does it really represent a missing response or is it the same as No? I can imagine the latter if the respondent answered Yes for some software and skipped the others, making those recorded as NA.

Here’s one way of looking at all the responses, grouped by the Yes-No-NA response pattern.

The top gray bar represents the 25 respondents who didn’t answer any of the software questions. I checked and most of those respondents answered all the other questions in the survey, implying maybe they left the software questions unanswered intentionally as a blanket No response. Nonetheless, either way won’t affect the relative results.

Very few respondents made use of both NA and No, so it seems reasonable to say that at least those upper bars with only Yes and NA respondents can be treated as Yes and No responses. After that, there is a group of No+NA responses; oddly they come from the first two respondents in the data set and had many other unanswered questions, making me wonder if there was some issue that got worked out in the survey administration. Fortunately, it’s only a few responses which is not enough to have a noticeable impact on the summary statistics.

Summarizing by software type

When multiple responses are allowed per respondent, there are a few ways to aggregate them. Since the caption of the original chart says “Percentage (and 95% CI) of STI respondents”, I aggregated them such that if one respondent is using three different GUI-Based tools, that counts once, not three times. The paper also introduces the chart with “For instructors who have students use software …”, which I interpret as ignoring the 27 respondents who had zero Yes responses.

Recreating the original chart (except as a bar chart and without confidence intervals) reveals quite different results.

My findings align with my original stacked bar chart of the response breakdowns, indicating a potential error in the paper. I contacted the author in case there was a different intention for the percentages, but she only directed me to the R code that was included in the paper’s materials.

Diagnosing the difference

I don’t know R very well, but the code was written well enough that I could mostly follow along. Here are the critical sections. This first block also shows why Tableau gets coded as Other since it was left out of the “Software %in%” tests and falls into Other.

software = sti_2019 |>
  select(t_CODAP,t_Excel, t_Fathom, t_JMP, t_Minitab, t_Python,
         t_R_GUI, t_R_Studio, t_R_Studio_Cloud, t_SAS, t_SAS_U,t_SPSS, t_Stata,
         t_StatCrunch, t_Statkey, t_Tableau, t_TinkerPlots, t_Other) |>
  gather(
    key = Software, 
    value = Response
  ) |>
  mutate(
    Software = stringr::str_remove(Software, "t_"),
    Type = case_when(
      Software == "Excel" ~ "Excel",
      Software %in% c("CODAP", "Fathom", "TinkerPlots", "Statkey") ~ "Pedagogical",
      Software %in% c("JMP", "Minitab", "SPSS", "Stata", "StatCrunch") ~ "GUI-Based",
      Software %in% c("Python", "R_GUI", "R_Studio", "R_Studio_Cloud", "SAS", "SAS_U") ~ "Syntax-Driven",
      TRUE ~ "Other"
    )
  )

The gather operation (which I was used to as “stack” in JMP) restructures the 18 columns into 18 rows per respondent with two columns, Software and Response. For instance, here are the 18 rows for one respondent.

SoftwareTypeResponse
CODAPPedagogicalNo
ExcelExcelNo
FathomPedagogicalNo
JMPGUI-BasedNo
MinitabGUI-BasedNo
PythonSyntax-DrivenNA
R_GUISyntax-DrivenNo
R_StudioSyntax-DrivenYes
R_Studio_CloudSyntax-DrivenYes
SASSyntax-DrivenNo
SAS_USyntax-DrivenNo
SPSSGUI-BasedNo
StataGUI-BasedNo
StatCrunchGUI-BasedNo
StatkeyPedagogicalNo
TableauOtherNo
TinkerPlotsPedagogicalNo
OtherOtherNo

The next block of code summarizes that data, without regard to the respondents. That is, instead of about 200 respondents, it’s looking at 200 × 18 responses. So this respondent who answered Yes for two of the six questions in the Syntax-Driven category, will contribute 2/6 in that group and 0/5 in the GUI-Based and 0/1 in the Excel group, when they should contribute 1/1, 0/1 and 0/1 respectively.

software_yes_tbl = software |>
  group_by(Type, Response) |>
  summarize(
    n = n()
  ) |>
  tidyr::drop_na() |>
  mutate(
    p = n / sum(n),
    N = sum(n)
  ) |>
  ungroup() |>
  filter(Response == "Yes") |>
  select(version = Type, n, p, N) |>
  mutate(question = "Software Type")

Effectively, categories containing many products get penalized in the final percentages. Here is my reproduction of the paper’s summarization. Though I don’t completely understand the R code, I think I got it right since the figures in the last column completely agree with the chart from the paper.

TypeYesNoNAYes / (Yes + No)
Pedagogicall546811777.3%
GUI-Based12880121113.8%
Syntax-Drivenen1019942739.2%
Excel93993648.4%
Other36326949.9%

What now?

After being more sure of my calculations, I contacted the paper’s contact author again with the findings. Update: I just heard back, and they’re redoing the calculations and contacting the editor about making a correction.

Even if this is an error in the paper, what happens next? The growing trend where papers also publish their data and code is a giant step in the right direction, but as far as I know, there’s no mechanism for correcting errors in the code or the papers.

Epilogue

Still perplexed by the common use of Excel in college intro stats courses, I dug a little deeper into the data. There is a field for the type of institution, and a quick rework (without as much care for NA values) of the calculations does show a difference there, with two-year colleges being far less likely to use products in the syntax-driven category and more likely to use all the others, including Excel.

And going back to my original breakdown by product, Excel and especially StatCrunch, a web-based application, are more often used in two-year colleges while R is less often used there.

Another reason for the large number of Yes responses for Excel is the Excel is so ubiquitous that it always gets included but may not be the main software product. While we can’t tell how much each tool is used within respondent, we can at least tell if they’re using Excel alone or in conjunction with other tools.

This graph shows, for those 93 respondents who answered Yes to using Excel, the number of Yes responses they had across all software tools.

That first bar represents those educators who used Excel and only Excel. That’s not too many (18) out of the survey total (200+), which feels less alarming.


Leave a Reply

Discover more from Raw Data Studies

Subscribe now to keep reading and get access to the full archive.

Continue reading