Launching and Scaling Data Science Teams: Project Summary

Best practice playbooks for building data science into your organization

Published in

Towards Data Science

8 min readDec 27, 2018

This is the final output of a Fall semester independent project on data science leadership and organization within companies. I am in the second year of my MBA at Harvard Business School, and Kris Ferreira in the Technology and Operations (TOM) unit is my advisor. I interviewed 29 people (DS individual contributors, DS managers, non-technical business leaders) across 13 companies. I used those interviews to develop a best practices playbook for launching and scaling data science teams and careers: one for data scientists, one for data science managers, one for business stakeholders.

Intro

The idea for this project came to me when I was onboarding at a new job. Data science onboarding is always a messy process, but I like to start in this order: data, codebase, workflow. Diving into a data set requires stretching your arms wide to find everywhere that anyone is logging something you might be interested in, and figuring out the company specific nuances. It’s asking questions like this: “What database are the retargeting emails stored on? What’s the login info for that server? What’s the difference between agent_id, uu_id, and agentuu_id? Why are there negative and duplicated numbers in the order_id column?” Broadly, what tables and databases will I be using every day and ultimately come to know by heart?

The codebase comes next: how have those that came before me built pipelines, predictions, solutions, and analysis with this data they’ve unearthed? What problems have they already solved that I won’t need to solve for myself?

And finally, the workflow: how does this team write code? What are the standards they hold themselves to? How do they prioritize projects, validate and communicate results, share knowledge, how do they make themselves better?

The data sets I found were standard — mostly structured SQL tables, distributed across a few file systems, with some additional text and image data. The codebase and workflow however, were very different that what I was used to. Team members wrote, tested, and saved code on their local machines. Code repositories were personal (“Ian’s git repo”) instead of project based (“pricing algorithm v2”). Team members worked in R and Python to solve the same problems. The team sat together, but didn’t work together.

I had worked in a similar role on similar projects with similar data sets before, but organizationally this was a big difference. My past team had a standardized workflow, shared set of development tools, shared set of coding standards. The team design and workflow were designed to increase knowledge transfer and mentorship between team members, resulting in better code, better business outcomes, and a stronger team. You were not just evaluated based on your work as an individual contributor, but as a team member; evaluation based on developing others and improving the team workflow, standards, and tools.

My experience forced me to realize that, while there are many incredible ways to train a data scientist, there are few resources or experts on how to launch and scale a data science team within a company. This integration of data science teams into broader organizations can lead to difficulties, misunderstandings, and operational inefficiency. Individual contributor data scientists (ICs), data science management, and non-technical business leaders all have an opportunity to help bridge the information gap between organizations, and change the culture around what a data scientist does/what a data science team does.

I surveyed ICs, data science management, and non-technical business stakeholders across companies that rely on their data science team to solve on core problems. I tried to find how they work, communicate, collaborate, prioritize, and evaluate — as well as blind spots and misalignments in that process. I used these conversations to write three best practices guide for launching and scaling data science teams and careers: one for data scientists, one for data science management, and one for business stakeholders dependent on a data science team. But, to understand today’s problems with data science teams, we have to understand how we got here.

The Rise of Data Science

The rise of “Data Science” has been rapid and well documented. According to Google Trends, the term started to take off in 2012, which is also when Harvard Business Review named “Data Scientist” the sexiest job of the 21st century. A modern company that throws off millions of data points must have a data science team, and there exists a universally agreed upon shortage of talented labor. The HBR article describes the challenge in 2012:

If capitalizing on big data depends on hiring scarce data scientists, then the challenge for managers is to learn how to identify that talent, attract it to an enterprise, and make it productive. None of those tasks is as straightforward as it is with other, established organizational roles. Start with the fact that there are no university programs offering degrees in data science. There is also little consensus on where the role fits in an organization, how data scientists can add the most value, and how their performance should be measured.

Obviously much has changed: six years later, there are now 160+ university degrees with “Data Science” in the title. But many of the challenges and opportunities remain the same. A data scientist can still come from anywhere. The process to become a data scientist is truly accessible in a way very few desirable/high-paying jobs are. In addition to the paid degrees, there’s Andrew Ng’s Coursera machine learning course, edx.org classes by MITx, HarvardX, ColumbiaX, or UCSanDiegoX, physical and web-based textbooks, Youtube channels, personal blogs, company blogs, and more.

The litany of educational options by price point and diversity in background of data scientists leads to a wide range of skill sets, interests, and problem solving styles. Within a company, this can be a huge positive for diversity of thought, leading to very positive internal learning environments. It can also lead to huge differences in business acumen/context, interests, and general job experience. There are brilliant data scientists that have never booked a conference room in Outlook, moved someone to bcc:, or had a performance review. Kaggle has a Competitions Grandmaster who is seventeen years old; illustrating how meritocratic open-source education is, and also, that a seventeen year old can be the best at this specific evaluation of talent.

Kaggle

Kaggle is a data science challenge-based website, where users of various talent levels can come together to compete on building accurate models. A company interested in recruiting data scientists publishes a test and training data set, explains the relevant features, and asks users to predict an output. An example competition: Airbnb “challenges you to predict in which country a new user will make his or her first booking” by issuing disguised data on users, browser sessions, and country data. The format for submission is highly specified, as is the evaluation metric. Individuals submit predictions, and the most accurate go to the top of the public leaderboard. The top submissions are frequently separated by minuscule amounts. The company can use the competition leaders to identify and recruit top talent.

Kaggle competitions can be a great resource for developing and demonstrating modeling skills. The problem with Kaggle is the appeal of Kaggle: it gives you all the data you need and tells you what to predict. In practice, that’s the hardest part to do. From a data science manager I interviewed:

The things that really move the needle are how you actually assemble data, formulate the problem and tell the model what it is it needs to learn or predict. I’ve seen people that are Kaggle data science types be pretty ineffective in a data science setting. Kaggle is bullshit and not what data scientists do, most systems are not like that. We spend most of our time asking, what is the problem that needs to be solved and how do you formulate it? Once you get to the model fitting phase, the implementation is trivial, and the returns to knob tuning I’ve found to be quite low.

Kaggle style competitions (and most interviews) also do not evaluate time and complexity tradeoffs, because they’re inherently evaluated on a single criteria. An industry data science solution must weigh complexity. Netflix famously held a competition to improve their recommendation algorithm, they also famously did not implement the winning algorithm: “The additional accuracy gains that we measured did not seem to justify the engineering effort needed to bring them into a production environment.” Kaggle does not force competitors develop the framework for evaluating whether it makes sense to unlock 0.1% gains via another hundred lines of code, or another two days spent on a project.

Excellence in turning data to a prediction with respect to a certain metric is a critical skill. However, as Max Woolf writes: “Modeling is just one part (and often the easiest part) of a very complex system”. If you can do this and only this, you are the data science equivalent of a goal line running back, and you have ignored the remaining 90%. A data science education, regardless of source, can leave IC’s immensely talented at building and tuning complex models, but unprepared for their first day of work. So how do you grow, scale, and manage teams of data scientists?

Being a data science manager is not easy, nor are there many people with the relevant experience to lead data science teams. If you believe modern data science has existed for under ten years, inherently, there is no one with ten years of managerial experience. Due to the sheer recency of the industry, best practices are hard to come by. There are limited resources. There are few role models who can teach how to lead, and many consistent challenges.

Data science managers are tasked with assembling a team of ICs, understanding their backgrounds, motivations, strengths, and weaknesses. They must face difficult decisions when structuring both their group, and their data architecture. Questions of “ownership” will come up constantly. Leaders must help fill in the gaps that data scientists have: how do you give your people the business context they need to be successful? How do you establish the frameworks for prioritizing projects, code standards, tradeoffs on time and complexity? How do you teach your team to be successful? Managers must align organization, data platforms, project workflow, and overall culture to build an excellent team; all while knowing that there isn’t an established answer to the question: “what does the ideal data science team look like?”

Leaders of data scientists must also specify the manner in which business stakeholders interact with the data science unit. When a department has a request, how is that work communicated, prioritized, executed, and reported on? What shape does an end product take, who has responsibility for the final output going forward? Although it should be trivial, it’s often hard to even agree on when a project is finished. How do individuals on the business side of things support the data scientists, and vice versa, how are those conversations facilitated?

The relationship between business stakeholders and data science teams varies wildly from company to company. “Non-technical” is a spectrum, and different companies have huge discrepancies. The strength of a company’s “data culture” and the level of analytic understanding that business stakeholders have is highly predictive of how well the data science team fits in. If you’re a data science manager, how do you ensure your team’s work can even be consumed/implemented by the necessary stakeholders?

Business stakeholders also face an uphill battle. Data scientists can always sound like they’re speaking a different language. If you’re interacting with a data science team for the first time, if can feel like they’re a completely different species. How do you bridge this gap? You know “data science” involves using data to drive business, but how do you learn what this team can even do for you, never mind if they can actually execute on that plan. How do you prioritize what you’re asking for, and how do you make better decisions as a result? How do data scientists enable your success, and how do you enable theirs?

I am hoping these best practice playbooks will help:

Launching and Scaling Data Science Teams: Project Summary

Best practice playbooks for building data science into your organization

Intro

The Rise of Data Science

Kaggle

Written by Ian Macomber