Data Science is a relatively young term for a relatively old field. In general, it tends to be applied statistics + some other skill-base. So stats+computer science, stats+software engineering, stats + dataviz, etc. There is controversy around the term itself, with many people upset that this new term exists at all because data science is more of an evolution of statistics than a separate field.
Further, with the growth of large data processing frameworks and tools (hadoop, spark, etc), the title of Data Engineer has begun to crop up, instead of DBA, software engineer, or another more traditional title.
So what is Data Science? What is Data Engineering? And why have separate terms to begin with?
Science vs. Engineering
To understand the differences between statistics, programming, data science and data engineering, we first need to think about the differences between science and engineering in general.
Science, in general, follows the scientific method. The idea is that given some observation, the scientist can formulate a question regarding the observation, posit some hypothesis to answer that question, develop a way to test that hypothesis, test it, and either reject, alter, confirm or expand that hypothesis. With a body of hypotheses regarding an observation or set of observations, the scientific community can then develop generalized theories that fit with all available data, observations, and existing theories to try to explain phenomena in general.
Engineering, in general, does not follow this method. Engineers start with requirements or an end goal of some sort, and follow a defined process to apply the theories and methods output from science to accomplish those goals. In engineering there are multiple methods for this process, including agile, waterfall, and systems engineering.
So at the surface, the primary difference between science and engineering is theoretical vs. applied, but in a practical sense the difference is in workflow. A scientist starts with an observation and moves forward into the unknown, while an engineer starts with the known end goal and then works backwards to find a solution that meets that need.
Tools vs. Methods
For job titles, or other descriptions of one's work, there are two primary paths one can take: description of tool or descriptions of methods.
A description of tools tends to be what engineers gravitate towards, as they tend think of themselves professionally in terms of skills or tools used. This gives rise to titles like java developer or CAD designer. The focus of the title is the tool used, not the methodology followed.
Similarly, one can choose a title based on the process followed, so rather than the much more specific title of Hadoop Developer, one could be a Data Engineer, because they follow the engineering process on projects related to data processing or a Data Scientist rather than a statistician, because they follow the scientific method on projects related to data analysis.
Data Science vs. Data Engineering
Data science is the role of applying the scientific method to projects involving data analysis. The tools involved in this role are numerous, and include those from statistics, UX/UI design, computer science, pure math, and the domains specific to the data itself. It doesn't live above, or alongside any of these fields (notable statistics and computer science), nor is it a watered down version of either. It is, simply, a description of process and domain.
Data scientists are not defined by the tools that they use (R vs. Python vs. Java vs. whatever), nor by their ability to clearly communicate the outputs of their analyses (while that is very important). A data scientist is, rather, defined by the method they use for their work. If a data set arrives in their hands and they select the end goal and apply known techniques to accomplish that goal then they are practicing data engineering, not data science.
Data engineering is the role of applying the engineering methodology to projects involving data analysis. It also involves many of the tools from computer science and software engineering (datastores, cluster computing, complexity analysis, etc), but is narrowed in scope to projects that support data analysis. Again, simply a description of process and domain.
There is, of course, a tremendous amount of overlap between these two roles in the real world, but there remains a tremendous amount of noise in the industry about how "data science is a statistician who knows hadoop" or a "software developer who is good at math" or other totally skill/tool based distinctions. This of course gives rise to conflict with people who were good at math and software for decades but didn't get the new hot job title.
To generalize the idea of a method-domain title, one's title ought to be a reflection of the methodology used to accomplish their job, and the domain in which they will apply that method. Optionally, prepend with a seniority measure. Example are:
- Data Scientist
- Chemical Engineer
- Data Engineer
As opposed to:
- R Developer
- Gasification Specialist
- Hadoop Architect
All in all, it's a mess, but I believe that sticking to method-domain titles and keeping skills out of it makes much more sense.