Simplicity and Complexity in Data Science

coding + programming data technical skill Nov 27, 2023
Two women thinking about data science.

This cartoon captures the essence of one of data science's key dualities: complexity versus simplicity. On one surface, it seems straightforward, encapsulated by the elementary linear equation "y = mx + b," a fundamental concept taught in high school algebra that describes a straight line with 'm' as the slope and 'b' as the y-intercept. This simplicity is what many initially believe data science revolves around—simple models explaining predictable outcomes.

However, the the cartoon then juxtaposes this simplicity with the often intimidating complexity beneath the surface of data science, as demonstrated by the more complex linear regression equation "Y = a + βX + e".

This second equation represents a predictive model where 'Y' is the outcome variable, 'a' is the intercept, 'β' is the slope coefficient for the predictor 'X', and 'e' is the error term, accounting for randomness or variance not explained by the model.

The multitude of terms for 'Y' and 'X' arise from the interdisciplinary nature of data science. Each field, from economics and statistics to machine learning and computer science, has historically developed its own lingo. These terminologies reflect the context and the nuanced approaches of the disciplines as they integrate data science into their frameworks.

The illustration further delves into the multifaceted terminology used to describe the variables 'Y' and 'X'. Each term reflects a different perspective or application within various disciplines:

Y Terms Further Explained:

- Regressed: From regression analysis, suggesting 'Y' is being estimated or predicted. Originates from 'regression to the mean,' a concept introduced by Francis Galton.
- Response: Indicates 'Y' is responding to changes in 'X'. Suggests an action-reaction relationship, akin to stimulus-response in psychology.
- Endogenous: An econometrics term implying 'Y' is determined within the system being modeled. Derived from Greek, meaning 'developing from within,' used in economics to represent internal factors.
- Target: Common in machine learning, where 'Y' is the goal for prediction. Perhaps, adopted from sport and target shooting, symbolizing the goal to be achieved.
- Predicted: Implies 'Y' is the outcome of the prediction process. More simply etymologically - Straight from the Latin 'praedicere,' meaning 'to declare beforehand.'
- Explained: In statistical models, 'Y' is the part that can be explained by the independent variables. Reflects that a portion of 'Y' that can be elucidated by 'X', stemming from the basic principle of cause and effect.
- Dependent: Indicates 'Y' depends on the values of 'X'. A direct translation of the mathematical relationship where one variable's value is dependent on another's.
- Output: In computer programming, 'Y' is the result produced by the algorithm. A term to denote what comes out of a process or function.

X Terms Further Explained:

- Regressor: Suggests 'X' is used to estimate 'Y'. A statistical term for a variable used for prediction, often used interchangeably with 'predictor'.
- Covariate: Implies 'X' co-varies with 'Y'. From 'co-' meaning 'together' and 'variate,' indicating variables that vary together.
- Exogenous: From econometrics, indicating 'X' is external to the system. Greek for 'developing from outside,' used in econometrics for external factors.
- Feature: Machine learning terminology for input variables. In machine learning, 'features' are the measurable properties or characteristics / they are the data related to each observation.
- Predictor: 'X' is used to predict 'Y'. Implies a forecast, an estimation of future values.
- Explainer: 'X' explains the variation in 'Y'. Comes from the role of 'X' in explaining changes in 'Y'.
- Independent: 'X' varies independently of 'Y'. Denotes 'X' as a variable whose variation does not depend on that of another. However dual causality is often present.
- Input: In computing, 'X' is the data fed into a model. Common in computer science, referring to data entered into a system or function.

This cartoon thus serves as a reminder that data science, while rooted in basic principles, spans a spectrum of complexity informed by its rich interdisciplinary contributions. Each term carries with it a legacy of knowledge, shaping how data scientists articulate and harness the power of data.

Learn Data Science For Free. Now Offering Live Free Online Data Science Lessons.


Get You're Free Lesson Here