A long time ago I wrote a short post on the differences between data engineers and data scientists. My reasoning back then was that a data engineer is someone who applies engineering methodologies to data problems, while a data scientist is someone who applies the scientific method to data problems. Pretty straight forward, and I think it holds true. For way more in depth discussion on that, check out that post.
Today I'm here to write a little on the consequences of those two intrinsically tied roles having such different methodologies. As data science groups pop up in more and more organizations, where the work, who the report to, and how they track themselves are getting to be more front of mind questions, and to be honest there isn't a great best practice for it that I know of. Sometimes they work under engineering, sometimes it's an independent group, sometimes it's one data scientist per business unit all working separately. There are, however, I think a few guiding principles that hold true regardless.
The customer of the data engineer is data science. Data engineering is there to provide the platform and the tools for data science to create useful insights and push them into production, whatever that means (dashboard, live scoring system, or whatever). Data science is there to take those tools and that data, derive some business value, and make sure it's capitalized on.
It's a simple interaction strained by a universal tension: data engineers work by restricting the domain, data scientists work by expanding it. To derive suitable value from limited data, a data scientist must often use new packages, techniques, paradigms, data types, etc. The data scientist adds their value by their ability to apply this vast universe of options well upon the data. This is a nightmare for data engineers though, who provide a stable and usable platform by limiting that universe to something manageable. With limited time and money, this can be an extremely difficult balance to strike. When the data scientists and data engineers reside in different business units or groups, this can be even harder, as corporate politics begin to impact the back and forth.
I personally advocate for the two teams to work closely together, but hold separate planning meetings / scrums. The management of the two types of work from a day to day standpoint tend to be very different, as discussed in the previous post, but to manage the balance between the two they need common management and incentives. I find it very useful to have a representative from each group in all planning meetings or scrums to keep the information flow going, and ideally an individual that acts as the universal go-between, splitting time in each group.
Finally, there is one other aspect of this sort of arrangement that I think provides value: innovation. New techniques, possibilities, or data can come from either data science or data engineering, so keeping the two groups closer together rather than siloed keeps those improvements from getting "stuck" in one group, and not communicated to the other.
So what do you think? What division, if any should a data science team be in? How do they communicate and interact with the platform side of the house?e