What’s in a Word?
In their 2010 The American Statistician paper, Deborah Nolan and Duncan Temple Lang describe the need for students to be able to “compute with data” to be able to answer statistical questions. Diane Lambert of Google calls this the capacity to “think with data.” Statistics graduates need to be able to manage data, analyze it accurately, and communicate findings effectively. The Wikipedia data-science entry states that “data scientists use the ability to find and interpret rich data sources, manage large amounts of data despite hardware, software, and bandwidth constraints, merge data sources, ensure consistency of data sets, create visualizations to aid in understanding data, build mathematical models using the data, present and communicate the data insights/findings to specialists and scientists in their team, and if required to a non-expert audience.” But what is the best word or phrase to describe these computational and data-related skills?
“Data wrangling” has been suggested as one possibility (and returned about 131,000 results on Google), though this connotes the idea of a long and complicated dispute, often involving livestock, which may not end well.
“Data grappling” is another option (about 7,500 results on Google), though this is perhaps less attractive as it suggests pirates (and grappling hooks), or wrestling as combat sport or for self defense.
“Data munging” (about 35,000 results on Google) is a common term in computer science used to describe changes to data (both constructive and destructive), or mapping from one format to another. A disadvantage of this term is that it has a somewhat pejorative sentiment.
“Data tidying” (about 900 results on Google) brings to mind the ideas of “bringing order to” or “arranging neatly.”
“Data curation” (about 322,000 results on Google) is a term that focuses on a long-term time scale for use (and preservation). While important, this may be perceived as a dusty and stale task.
“Data cleaning” (or “data cleansing,” about 490,000 results on Google) is the process of identifying and correcting (or removing) invalid records from a data set. Other related terms include “data standardization” and “data harmonization.”
A search for “data manipulation” yielded about 740,000 results on Google. Interestingly, this term on Wikipedia redirects to the “misuse of statistics” page, implying the analyst might have malicious intentions and could torture the data to tell a particular story. The Wikipedia “data manipulation language” page has no such negative connotations (and describes the Structured Query Language [SQL] as one such language). This dual meaning stems from the definition (from Merriam-Webster) of “manipulate”:
- To manage or utilize skillfully
- To control or play upon by artful, unfair, or insidious means especially to one’s own advantage
“Data management” was the most common term, with more than 33,000,000 results on Google. The DAMA Data Management Body of Knowledge (DAMA-DMBOK) provides a definition: “Data management is the development, execution and supervision of plans, policies, programs, and practices that control, protect, deliver, and enhance the value of data and information assets.” While the term is somewhat clinical, and doesn’t necessarily capture the essential creativity required (and is decidedly non-sexy), data management may be the most appropriate phrase to describe the type of data-related skills students need to make sense of the information around them.