Data Science vs. Data Engineering

Data Science is a relatively young term for a relatively old field.  In general, it tends to be applied statistics + some other skill-base.  So stats+computer science, stats+software engineering, stats + dataviz, etc.  There is controversy around the term itself, with many people upset that this new term exists at all because data science is more of an evolution of statistics than a separate field.

Further, with the growth of large data processing frameworks and tools (hadoop, spark, etc), the title of Data Engineer has begun to crop up, instead of DBA, software engineer, or another more traditional title.

So what is Data Science? What is Data Engineering? And why have separate terms to begin with?

Science vs. Engineering

To understand the differences between statistics, programming, data science and data engineering, we first need to think about the differences between science and engineering in general.

Science, in general, follows the scientific method.  The idea is that given some observation, the scientist can formulate a question regarding the observation, posit some hypothesis to answer that question, develop a way to test that hypothesis, test it, and either reject, alter, confirm or expand that hypothesis.  With a body of hypotheses regarding an observation or set of observations, the scientific community can then develop generalized theories that fit with all available data, observations, and existing theories to try to explain phenomena in general.

Engineering, in general, does not follow this method. Engineers start with requirements or an end goal of some sort, and follow a defined process to apply the theories and methods output from science to accomplish those goals.  In engineering there are multiple methods for this process, including agile, waterfall, and systems engineering.

So at the surface, the primary difference between science and engineering is theoretical vs. applied, but in a practical sense the difference is in workflow.  A scientist starts with an observation and moves forward into the unknown, while an engineer starts with the known end goal and then works backwards to find a solution that meets that need.

Tools vs. Methods

For job titles, or other descriptions of one's work, there are two primary paths one can take: description of tool or descriptions of methods.

A description of tools tends to be what engineers gravitate towards, as they tend think of themselves professionally in terms of skills or tools used. This gives rise to titles like java developer or CAD designer. The focus of the title is the tool used, not the methodology followed.

Similarly, one can choose a title based on the process followed, so rather than the much more specific title of Hadoop Developer, one could be a Data Engineer, because they follow the engineering process on projects related to data processing or a Data Scientist rather than a statistician, because they follow the scientific method on projects related to data analysis.

Data Science vs. Data Engineering

Data science is the role of applying the scientific method to projects involving data analysis.  The tools involved in this role are numerous, and include those from statistics, UX/UI design, computer science, pure math, and the domains specific to the data itself.  It doesn't live above, or alongside any of these fields (notable statistics and computer science), nor is it a watered down version of either.  It is, simply, a description of process and domain.

Data scientists are not defined by the tools that they use (R vs. Python vs. Java vs. whatever), nor by their ability to clearly communicate the outputs of their analyses (while that is very important). A data scientist is, rather, defined by the method they use for their work.  If a data set arrives in their hands and they select the end goal and apply known techniques to accomplish that goal then they are practicing data engineering, not data science.

Data engineering is the role of applying the engineering methodology to projects involving data analysis.  It also involves many of the tools from computer science and software engineering (datastores, cluster computing, complexity analysis, etc), but is narrowed in scope to projects that support data analysis. Again, simply a description of process and domain.

There is, of course, a tremendous amount of overlap between these two roles in the real world, but there remains a tremendous amount of noise in the industry about how "data science is a statistician who knows hadoop" or a "software developer who is good at math" or other totally skill/tool based distinctions. This of course gives rise to conflict with people who were good at math and software for decades but didn't get the new hot job title.

To generalize the idea of a method-domain title, one's title ought to be a reflection of the methodology used to accomplish their job, and the domain in which they will apply that method.  Optionally, prepend with a seniority measure.  Example are:

  1. Data Scientist
  2. Chemical Engineer
  3. Data Engineer

As opposed to:

  1. R Developer
  2. Gasification Specialist
  3. Hadoop Architect

All in all, it's a mess, but I believe that sticking to method-domain titles and keeping skills out of it makes much more sense.

Will

Will has a background in Mechanical Engineering from Auburn, but mostly just writes software now. He was the first employee at Predikto, and is currently building out the premiere platform for predictive maintenance in heavy industry there as Chief Scientist. When not working on that, he is generally working on something related to python, data science or cycling.

2 Comments

  1. I never thought of applying it the literal sense of science and engineering. I suppose they work in tandem, data scientist finds the data and the engineer analyzes it. I will have to look into it further to decide what I like doing. thank you for the information.

  2. I didn't realize that data science is applying the scientific method to a project involving data analysis. It seems like it would be really beneficial to have data science software that would allow for easier access to how the scientific method looks for your project and will ease the recording of the information you acquire as you experiment. One of my sons will be starting a science project soon for school so maybe I'll have to invest in something like that for our younger kids too.

Leave a Reply