As artificial intelligence (AI) and machine learning (ML) gain market momentum, foundation models are being consumed and managed differently. Increasingly, organizations want to build and manage their own models that balance cost, reliability, and specificity of model outputs. Over the past 12 months, this balance has become easier to achieve for these reasons:
- More diverse and efficient foundation models — The recent proliferation of different sizes and types of AI and ML models has lowered the barriers for executing industry- or organization-specific training and fine-tuning. This leads to higher quality and more accurate outputs from existing foundation models or new models derived from them.
- Improved tooling and IT operational alignment — We have seen a maturing of data scientist tools in the form of both notebooks and IDEs. The industry has further improved tooling by taking steps to better align traditional IT tools such as Kubernetes and CI/CD pipelines with AI and ML needs. This has removed barriers between data scientists and IT ops resources.
- More robust infrastructure — Another key ingredient of the reduced costs of custom models has been the rampant pace of expansion and improvement for AI and ML infrastructure, including the development of GPUs, TPUs, and accelerators. This has led to a drastic reduction in costs, such that some training implementations cost a small fraction of what they did even a year ago.
While these factors are sure to further increase the pace of development for AI and ML applications, challenges still remain if an organization wants to host its own AI or ML infrastructure, including:
- Choice — An internally built and managed Al and ML environment may limit an organization’s choices for infrastructure, networking, models, and tooling due to availability or budgetary constraints. Also, rapid changes in the market may prompt the organization to regret recent decisions.
- Costs — The costs of change as infrastructure matures are well understood. But AI and ML infrastructure is a particularly complex and high-cost effort that requires constant human attention. There are also costs associated with how the GPUs are utilized and governed amid competing priorities and projects.
- Time to market — An additional challenge associated with large GPU-driven neural networks is managing reliability issues as components inevitably fail. These failures raise the potential for training runs to be halted or lost, which slows time to market.
So, although the barriers to entry for building and training AI and ML models continue to shrink, organizations may want to consider a managed end-to-end offering such as Amazon SageMaker HyperPod to improve time to market and overcome the complexities and costs associated with on-premises models.
Click the logo below to download the research paper to read more.
Table of Contents
- Introduction
- The Amazon Web Services Approach to AI
- A Closer Look at Infrastructure for Training and Inference
- SageMaker and the Maturation of AI and ML
- SageMaker HyperPod and Reducing Time to Market
- Other Benefits of SageMaker HyperPod
- Conclusion
Companies Cited:
- AWS