Principles and steps of DataScience
Oct 9th 2022
Before using the Neuro prefix, what is DataScience?
@ Long story short
Data Science is a discipline with at least 50 years of history, that combine multiple intelectual fields to study the entire data life-cycle and its applications.
@ The Medical Subject Headings (MeSH) definition
Data Science is an interdisciplinary field involving processes, theories, concepts, tools, and technologies, that enable the review, analysis, and extraction of valuable knowledge and information from structured and unstructured (raw) data.
(From the
MeSH vocabulary thesaurus, introduced at 2019).
@ Short History of the term
Some of the earliest mentions to Data Science go back to
Peter Naur, another great Computer Scientist that I admire, who in 1974 wrote the book Concise survey of computer methods.
Naur explains that "Consise survey of computer oriented datalogy" was a preferable title for his book. But because the previous dissolution of a 1966 ambitious project named "Plan for a course of datalogy and datamatics", justifyied the new term Data Science.
And developed the idea in part 1.8 of the book: A basic Principle of Data Science.
He said "The data representation must be chosen with due regard to the transformation to be achieved and the the data processing tools available... Three remarks are relevant: (1) Since data science is concerned with methods of construction of data processes, it is consistent that its basic principles come as design guidance. (2) The principle is consistent with the idea of freedom to choose the data representation. (3) The regard to the data processing tools is consistent with the notion that basically data are things to be processed."
I recommend you to read this book, at least pages 28-31. Bellow a link and preview thank to the great internet archive, or buy via Amazon Affiliate.
Aside the Computer Science strand, to me is obvious that Statistics, as a Mathematical Applied Discipline, it is very, very close to Data Science. Is from such work that the its core born and grow. In fact, the three quotes below come from Pure Statisticians or Healthcare Teams Statisticians or the Handbook of Statistics.
@ A 2017 Statistician point of view of Data Science
"The would-be notion takes data science as the science of learning from data, with all that this entails. It is matched to the most important developments in science which will arise over the coming 50 years. As scientific publication itself becomes a body of data that we can analyze and study, there are staggeringly large opportunities for improving the accuracy and validity of science, through the scientific study of the data analysis that scientists have been doing."
As footnote:"Farther below, we will use shortened formulations such as “science itself becomes a body of data.”"
Quote from David Donoho, Department of Statistics, at Stanford University, in a Creative Commons article:
Donoho D.(2017) 50 Years of Data Science, Journal of Computational and Graphical Statistics, 26:4, 745-766,
doi: 10.1080/10618600.2017.1384734
@ A 2022 healthcare view of Data Science
"Data science is an interdisciplinary field that extracts knowledge and insights from many structural and unstructured data, using scientific methods, data mining techniques, machine-learning algorithms, and big data."
Quote taken from this excellent Creative Commons article:
Subrahmanya, S., Shetty, D. K., Patil, V., Hameed, et al. (2022). The role of data science in healthcare advancements: applications, benefits, and future prospects. Irish journal of medical science, 191(4), 1473–1483. doi: 10.1007/s11845-021-02730-z
@ Data Science and Data Engineering, relation with AI
"Data science is an umbrella term used for referring to concepts and practices of subset of the topics under artificial intelligence (AI) methodologies. AI is actually a framework to define notion of intelligence in software systems or devices in terms of knowledge representation and reasoning methodologies. There are two main types of reasoning methods deductive and inductive over data. The major class of machine learning and deep learning methods come under inductive reasoning where essentially, missing pieces of information are interpolated based on existing data through numerical transformations. However, today AI is mostly identified with deduction systems while it is actually a comprehensive school of thought and formal framework. The AI framework offers rigor and robustness to the solutions developed and there is still scope for onboarding today's deep learning solutions and reap benefits of sturdiness. Data science is about end to end development of a smart solution that involves creation of pipelines for activities for data generation, business decision making and solution maintenance with humans in loop. Data generation is a cycle of activities involving collection, refinement, feature transformations, devising more insightful heuristic measures based on domain peculiarities and iterations to enhance quality of data driven decisions. Business decision making is pipeline of activities involving designing mappers from data to business decisions. The mappers are typically machine learning methods which are fine tuned to give best possible performance in a given period of study subject to business constraints. The mappers are fine tuned based on quality and magnitude of data and subdata. Solution maintenance is a critical component that involves setting up alarms to detect when a given decision maker model no longer works as desired. The maintenance work calls for repair actions such as identifying data to gather, comparative metrics of different models and monitoring the patterns and trends in the input data."
Abstract taken from this chapter of Handbook of Statistic 43 Principles and Methods for Data Science:
Kalidas Yeturu, Chapter 3 - Machine learning algorithms, applications, and practices in data science, Editor(s): Arni S.R. Srinivasa Rao, C.R. Rao, Handbook of Statistics, Elsevier, Volume 43, 2020, Pages 81-206. doi:10.1016/bs.host.2020.01.002.
@ The steps of Data Science
Data Collection, Data Preparation (Data Exploration Data Cleaning, Feature Selection, Data Transformation), Model Training (Model Building, Iterative Model Training, Model Hyperparameters Tuning), Model Prediction (Model Trained, Prediction).
(Image inspired from Kaggle template information)
Kaggle Free Machine Learning Course
@ The five "tribes"or schools of the Machine Learning strand of Data Science
(Image inspired from the Machine Learning for Dummies Book)
@ Data Science and beyond
IMHO, no strict definition is really needed, because the focus of the message in the term "data science" is that now (as always) we need the best "datalogy" possible. But, in order to those achievements, multiple disciplines need to be used, constantly updated, and maintained with scientific rigor. That converts the entire data life-cycle study field itself in a science, and its applications in a new very important engineering field: data engineering.
Meaning for the basic and applied scientific disciplines : keep doing science as always, but now with emphasis and correctness in the use of powerful, newer and better data tools. The best data treatment has been from the beginning a must in science.
In the background of this idea lies that if enough facts of a new type appear, and if it is worth it, a science that studies it will develop, and its corresponding applications.
For example, the terms AI or VR Science is emerging, and if extraterrestrial intelligence enter in contact, EI Science will emerge.
Note: I'm aware about the need of Information Theory here. But, in another post we will see the levels and flow from Data, as an elementary input block. From Data with Context emerge Information, to construct with Meaning Knowledge, to finally with Insight reach Wisdom.