More Math for Data Scientists, Not Less

Benjamin S. Baumer

As undergraduate data science programs continue to spring up across the country, the time has come for mathematical sciences departments to rethink their traditional sequence of courses. As Jo Hardin and Nick Horton (2017) point out, to do nothing is to risk being left behind—an outcome that will leave mathematical sciences departments with lower enrollments and data scientists with insufficient mathematical training. Data science students need more math, not less, but the typical sequence of mathematics courses does not meet their needs.

Some content is only viewable by ASA Members. Please login or become an ASA member to gain access.

Tagged as: data science, math course, undergraduate, undergraduate curriculum, undergraduate math

1 Comment

Daniel Kaplan
October 18, 2023 • 9:41 am

As an example of what can be done when the mathematics curriculum is reconsidered, I point to a course and corresponding textbook I helped develop at the US Air Force Academy: https://www.mosaic-web.org/MOSAIC-Calculus/. USAFA has a 9-credit-hour core math/stats requirement. As Ben Baumer points out, linear algebra is important for budding data scientists, though not generally the linear algebra taught traditionally. The 6-credit-hour intro calculus core—with no prerequisites and taken by about 800 cadets (out of 1200) each year—is multi-variate, modeling-focused, engages data, and gives an introduction to linear algebra oriented to statistics. (It also includes a dynamical systems/DiffEQ section, since all cadets have a core engineering requirement as well.)

There are many opportunities to connect meaningfully to statistics in an intro class course. It’s low-hanging fruit to include the normal PDF-CDF pair of functions connected by differentiation/integration. The pair provides function shapes essential to modeling (bump and sigmoidal functions). An even richer setting is Bayesian estimation with a continuous parameter. The construction of likelihood functions is an exercise in modeling. Calculation of posteriors is a two-step process, both steps being operations central to calculus: (1) multiplication of the likelihood function times the prior, (2) normalization of the result from (1). Computing is done with R, using the {mosaicCalc} package from CRAN.

The follow-up statistics course was taught in 2022-23. The textbook for that course, https://dtkaplan.github.io/Lessons-in-statistical-thinking/, will come out in 2024.