Model Deployment and Prediction Service
- The first portion of building an ML System consists of steps like
- Data Sourcing
- Data Storage
- Data Preprocessing
- Model Architecture Selection
- Model Framework Selection
- Model Development
- Model Evaluation
- The second portion of building an ML System consists of steps like
- Model Deployment
- Model Scaling depending on the number of users
- Model Compression depending on where the model will run or be stored
- Infrastructure and resources architecting depending on the model requirements like number of users, latency, and type of application
- The first portion is usually connected to the second portion by using a methodology called Exporting the Model. In this step we export the model which was built in the development environment to certain format which can be used in the production environment
- We export the model parameters, model definition and any specific metadata. These are then used by the runtime to load the model and pass input data through the model to produce an output
- Major part in the inference phase is to understand
- How does the model generate inferences?
- Batch vs Online Prediction
- Where does the inference take place?
- On the cloud or on edge devices
- What is the purpose of serving these models?
- This answers the Daily Active Users, use cases for determining latency and other infrastructure related requirements
- The number of models which is being served by various companies is already a huge number and the reason is that these models serve different use cases or sub problems and hence rely on different architecture, or sometime layers, or different model parameters
- We also have to expect shift in the data distribution in production due to random nature of usage which might not have been accounted during training
- This brings to the requirement that the system should be able to handle continuous model deployment and also make use of the new data to be able to mitigate accuracy drop due to data distribution shift