50 Years of Data Science. Donoho, D. 26(4):745–766.
50 Years of Data Science [link]Paper  doi  abstract   bibtex   
More than 50 years ago, John Tukey called for a reformation of academic statistics. In “The Future of Data Analysis,” he pointed to the existence of an as-yet unrecognized science, whose subject of interest was learning from data, or “data analysis.” Ten to 20 years ago, John Chambers, Jeff Wu, Bill Cleveland, and Leo Breiman independently once again urged academic statistics to expand its boundaries beyond the classical domain of theoretical statistics; Chambers called for more emphasis on data preparation and presentation rather than statistical modeling; and Breiman called for emphasis on prediction rather than inference. Cleveland and Wu even suggested the catchy name “data science” for this envisioned field. A recent and growing phenomenon has been the emergence of “data science” programs at major universities, including UC Berkeley, NYU, MIT, and most prominently, the University of Michigan, which in September 2015 announced a \$100M “Data Science Initiative” that aims to hire 35 new faculty. Teaching in these new programs has significant overlap in curricular subject matter with traditional statistics courses; yet many academic statisticians perceive the new programs as “cultural appropriation.” This article reviews some ingredients of the current “data science moment,” including recent commentary about data science in the popular media, and about how/whether data science is really different from statistics. The now-contemplated field of data science amounts to a superset of the fields of statistics and machine learning, which adds some technology for “scaling up” to “big data.” This chosen superset is motivated by commercial rather than intellectual developments. Choosing in this way is likely to miss out on the really important intellectual event of the next 50 years. Because all of science itself will soon become data that can be mined, the imminent revolution in data science is not about mere “scaling up,” but instead the emergence of scientific studies of data analysis science-wide. In the future, we will be able to predict how a proposal to change data analysis workflows would impact the validity of data analysis across all of science, even predicting the impacts field-by-field. Drawing on work by Tukey, Cleveland, Chambers, and Breiman, I present a vision of data science based on the activities of people who are “learning from data,” and I describe an academic field dedicated to improving that activity in an evidence-based manner. This new field is a better academic enlargement of statistics and machine learning than today’s data science initiatives, while being able to accommodate the same short-term goals. Based on a presentation at the Tukey Centennial Workshop, Princeton, NJ, September 18, 2015. [Excerpt: The Common Task Framework] To my mind, the crucial but unappreciated methodology driving predictive modeling’s success is what computational linguist Mark Liberman (Liberman 2010) has called the Common Task Framework (CTF). An instance of the CTF has these ingredients: [::(a)] A publicly available training dataset involving, for each observation, a list of (possibly many) feature measurements, and a class label for that observation. [::(b)] A set of enrolled competitors whose common task is to infer a class prediction rule from the training data. [::(c)] A scoring referee, to which competitors can submit their prediction rule. The referee runs the prediction rule against a testing dataset, which is sequestered behind a Chinese wall. The referee objectively and automatically reports the score (prediction accuracy) achieved by the submitted rule. [\n] All the competitors share the common task of training a prediction rule which will receive a good score; hence the phase common task framework. [\n] [...] [Required Skills] The Common Task Framework imposes numerous demands on workers in a field: [::] The workers must deliver predictive models which can be evaluated by the CTF scoring procedure in question. They must therefore personally submit to the information technology discipline imposed by the CTF developers. [::] The workers might even need to implement a custom-made CTF for their problem; so they must both develop an information technology discipline for evaluation of scoring rules and they must obtain a dataset which can form the basis of the shared data resource at the heart of the CTF. [\n] In short, information technology skills are at the heart of the qualifications needed to work in predictive modeling. These skills are analogous to the laboratory skills that a wet-lab scientist needs to carry out experiments. No math required. [\n] [...] [The Full Scope of Data Science] [...] The larger vision posits a professional on a quest to extract information from data - exactly as in the definitions of data science we saw earlier. The larger field cares about each and every step that the professional must take, from getting acquainted with the data all the way to delivering results based upon it, and extending even to that professional’s continual review of the evidence about best practices of the whole field itself. [\n] Following Chambers, let us call the collection of activities mentioned until now “lesser data science” (LDS) and the larger would-be field greater data science (GDS). Chambers and Cleveland each parsed out their enlarged subject into specific divisions/topics/subfields of activity. I find it helpful to merge, relabel, and generalize the two parsings they proposed. This section presents and then discusses this classification of GDS. [::The Six Divisions] The activities of GDS are classified into six divisions: [::1] Data Gathering, Preparation, and Exploration [::2] Data Representation and Transformation [::3] Computing with Data [::4] Data Modeling [::5] Data Visualization and Presentation [::6] Science about Data Science [\n] [...]
@article{donoho50YearsData2017,
  title = {50 Years of Data Science},
  author = {Donoho, David},
  date = {2017-10-02},
  journaltitle = {Journal of Computational and Graphical Statistics},
  volume = {26},
  pages = {745--766},
  issn = {1061-8600},
  doi = {10.1080/10618600.2017.1384734},
  url = {https://doi.org/10.1080/10618600.2017.1384734},
  urldate = {2019-08-07},
  abstract = {More than 50 years ago, John Tukey called for a reformation of academic statistics. In “The Future of Data Analysis,” he pointed to the existence of an as-yet unrecognized science, whose subject of interest was learning from data, or “data analysis.” Ten to 20 years ago, John Chambers, Jeff Wu, Bill Cleveland, and Leo Breiman independently once again urged academic statistics to expand its boundaries beyond the classical domain of theoretical statistics; Chambers called for more emphasis on data preparation and presentation rather than statistical modeling; and Breiman called for emphasis on prediction rather than inference. Cleveland and Wu even suggested the catchy name “data science” for this envisioned field. A recent and growing phenomenon has been the emergence of “data science” programs at major universities, including UC Berkeley, NYU, MIT, and most prominently, the University of Michigan, which in September 2015 announced a \$100M “Data Science Initiative” that aims to hire 35 new faculty. Teaching in these new programs has significant overlap in curricular subject matter with traditional statistics courses; yet many academic statisticians perceive the new programs as “cultural appropriation.” This article reviews some ingredients of the current “data science moment,” including recent commentary about data science in the popular media, and about how/whether data science is really different from statistics. The now-contemplated field of data science amounts to a superset of the fields of statistics and machine learning, which adds some technology for “scaling up” to “big data.” This chosen superset is motivated by commercial rather than intellectual developments. Choosing in this way is likely to miss out on the really important intellectual event of the next 50 years. Because all of science itself will soon become data that can be mined, the imminent revolution in data science is not about mere “scaling up,” but instead the emergence of scientific studies of data analysis science-wide. In the future, we will be able to predict how a proposal to change data analysis workflows would impact the validity of data analysis across all of science, even predicting the impacts field-by-field. Drawing on work by Tukey, Cleveland, Chambers, and Breiman, I present a vision of data science based on the activities of people who are “learning from data,” and I describe an academic field dedicated to improving that activity in an evidence-based manner. This new field is a better academic enlargement of statistics and machine learning than today’s data science initiatives, while being able to accommodate the same short-term goals. Based on a presentation at the Tukey Centennial Workshop, Princeton, NJ, September 18, 2015.

[Excerpt: The Common Task Framework]
To my mind, the crucial but unappreciated methodology driving predictive modeling’s success is what computational linguist Mark Liberman (Liberman 2010) has called the Common Task Framework (CTF). An instance of the CTF has these ingredients:

[::(a)] 	
A publicly available training dataset involving, for each observation, a list of (possibly many) feature measurements, and a class label for that observation.

[::(b)] 	
A set of enrolled competitors whose common task is to infer a class prediction rule from the training data.

[::(c)] 	
A scoring referee, to which competitors can submit their prediction rule. The referee runs the prediction rule against a testing dataset, which is sequestered behind a Chinese wall. The referee objectively and automatically reports the score (prediction accuracy) achieved by the submitted rule.

[\textbackslash n] All the competitors share the common task of training a prediction rule which will receive a good score; hence the phase common task framework.

[\textbackslash n] [...]

[Required Skills]
The Common Task Framework imposes numerous demands on workers in a field:

[::] The workers must deliver predictive models which can be evaluated by the CTF scoring procedure in question. They must therefore personally submit to the information technology discipline imposed by the CTF developers.

[::] The workers might even need to implement a custom-made CTF for their problem; so they must both develop an information technology discipline for evaluation of scoring rules and they must obtain a dataset which can form the basis of the shared data resource at the heart of the CTF.

[\textbackslash n] In short, information technology skills are at the heart of the qualifications needed to work in predictive modeling. These skills are analogous to the laboratory skills that a wet-lab scientist needs to carry out experiments. No math required.

[\textbackslash n] [...]

[The Full Scope of Data Science]
[...]
The larger vision posits a professional on a quest to extract information from data - exactly as in the definitions of data science we saw earlier. The larger field cares about each and every step that the professional must take, from getting acquainted with the data all the way to delivering results based upon it, and extending even to that professional’s continual review of the evidence about best practices of the whole field itself.

[\textbackslash n] Following Chambers, let us call the collection of activities mentioned until now “lesser data science” (LDS) and the larger would-be field greater data science (GDS). Chambers and Cleveland each parsed out their enlarged subject into specific divisions/topics/subfields of activity. I find it helpful to merge, relabel, and generalize the two parsings they proposed. This section presents and then discusses this classification of GDS.

[::The Six Divisions]
The activities of GDS are classified into six divisions:

[::1] 	
Data Gathering, Preparation, and Exploration

[::2] 	
Data Representation and Transformation

[::3] 	
Computing with Data

[::4] 	
Data Modeling

[::5] 	
Data Visualization and Presentation

[::6]
Science about Data Science

[\textbackslash n] [...]},
  keywords = {~INRMM-MiD:z-IDJAC3LQ,bias-disembodied-science-vs-computational-scholarship,computational-science,computational-science-literacy,data-sharing,data-transformation-modelling,data-uncertainty,knowledge-integration,reproducible-research,statistics},
  number = {4}
}

Downloads: 0