10. June 2024 By Andreas Helfenstein
How (and Why) Data Scientists Stay Relevant in the Age of AutoML and Co-Pilot
The Evolution of AI Terminology
A joke used to circulate among data scientists, often prompted to explain the difference between AI and machine learning. The tongue-in-cheek witticism, which elicited a smirk from any self-respecting data scientist, stated: "If it is written in Python, it is called machine learning. If it is written in PowerPoints, it is called AI." The implication was that "AI" was a fashionable, trendy yet vague buzzword that described everything and nothing. In corporate slideshows, it resembled a magic 8-ball that seemed to solve any ill-defined business problem. Real scientists, so the predominant opinion, used the term "machine learning" instead.
Attentive readers might object at this point and (correctly) point out that "artificial intelligence" is now widely used in scientific communications and has its well-deserved spot when discussing large language models, generative AI, and related developments. Clearly, something has changed in the perception of this term.
Adapting to Changing Terminology and Trends
Changes in language often indicate changes in the lives and experiences of its speakers. The adoption of AI terminology by data scientists reflects the field's progress in recent years. As the people working most closely with AI, data scientists also witness changes in their job descriptions and ponder how to best adapt to new trends, challenges, and requirements. This article shares views on the skills becoming more important for success, how to develop a career, and how to hire and train talent prepared for an AI-centered future.
The Role of the Cloud
The Cloud is hardly new; shared, remote resources have long been a staple in enterprise IT, confirming their importance. Innovations in large language models, generative AI, and foundational models have influenced cloud usage and predominance. The vast resources needed to train and maintain these models make in-house development often unfeasible or uneconomical. For most use cases, leveraging solutions from hyperscalers like Google, Amazon, or Microsoft is preferred.
In practice, this means it's no longer sufficient to run code through a remote session on a virtual machine. It requires knowledge of available APIs and products, their features, limitations, potential rate limits, and cost control. Understanding documentation, terminology, and product names is crucial, even though these may rapidly change due to marketing demands.
Embracing MLOps
The reliance on cloud-based products and managed services has significantly impacted the structure and architecture of ML projects and pipelines. Turning data into insights is no longer a straightforward matter of running a script from start to finish. Instead, it involves an intricate interplay of API calls, remote resources, and microservices.
To keep this complex machinery running smoothly, a solid MLOps framework is essential. This framework comprises a set of rules, tools, and best practices that help develop reproducible, understandable, and maintainable AI products during both development and production. For data scientists, this means mastering the necessary tools and resources, developing and sharing best practices, and keeping a vigilant eye on pipeline statuses, data drifts, and other quality metrics.
In practice, this might involve setting up automated monitoring systems that alert the team to any deviations in data quality or performance metrics. For example, a sudden shift in data distribution might indicate a data drift, which could affect model performance. By addressing these issues promptly, data scientists can ensure their models remain accurate and reliable.
Maintaining Scientific Rigor
Data scientists are, at their core, still scientists, and their competence to plan, execute, and analyze experiments has not (yet) been replaced by neural networks. A well-planned experiment prevents unnecessary use of expensive resources and enables the evaluation and monitoring of performance and business impact. Knowing how to use and interpret different measures and metrics, coupled with a solid understanding of statistics, is a requirement to develop a high-impact AI solution. While a "traffic light"-style widget might be a good high-level indicator of how business is doing, it falls short to answer questions like "why did something change", "how do we improve", or to which project resource should be allocated.
Navigating Linguistics
Large language models strive to imitate natural language (that is, human speech), but at the end of the day, they are still more or less deterministic machines, albeit very sophisticated ones. While the term "prompt engineering" is sometimes derided as being skill-wise on the same level as "expert googling,” one still needs to recognize the quirks and imprecisions inherent to human language. Especially in production systems, where prompts cannot be tried out on-the-fly only to later be manually formatted to one’s liking, writing a prompt that delivers consistent, unambiguous, and trustworthy results requires finesse and mastering of the language. The principle of "garbage in, garbage out" still applies to multi-million-parameter models, except that unlike code, the language model might be too arrogant to admit if its carefully calculated answer would best belong to the landfill.
Understanding Regulatory Requirements
Research institutes and businesses are not the only ones to keep an eye on AI: Government bodies are picking up the pace to bring the Wild West era of AI to an end. In recent years, regulations on ethics, applicable use, risk management, and data protection have been proposed, ratified, and entered into force across the globe. These rules and regulations, together with the uncertainty of what might come next, have a direct impact on the daily work of data scientists. A good understanding of a model's inner workings, its risks, impacts, and limitations is necessary to assure compliance and prevent potentially hefty fines.
Enhancing Communication Skills
The popularity of AI, its omnipresence in the media, and its huge potential only bring more stakeholders flocking to the data team, bringing questions, ideas, and concerns about new technologies. Instead of having highly specific conversations about activation layers, AUC metrics, or new Python libraries, data scientists find themselves more and more interacting with people of various backgrounds, interests, and priorities. As in-house experts on AI, data scientists balance business interests with technical feasibility and handle expectation management to counteract the over-confident marketing messages from AI companies.
Exponential developments in data science have changed the dynamics of how people perceive the chances, risks, and potential of AI. Data scientists, as the primary users of these algorithms, experience these changes firsthand. By developing their roles away from pure developers of ML models towards an approach focused on planning, implementing, and analyzing AI products, data scientists are well set to claim their spot in the data landscape as versatile experts, ready to handle what the future brings.