ML Systems
With the growth of ML models, a new set of systems problems are emerging. They involve:
- Distributed Learning: Running NN Learning algorithms across a fleet of machines.
- ML Compilers
- Custom Hardware for ML training and inference jobs
Backlog of papers yet to read
- Jupiter Evolving: Hochschild et al., 2021
- TPUv4 Jouppi et al., 2023
- TensorFlow Abadi et al.,
- Pathways Barham et al., 2022
- Cores that don't count Hochschild et al., 2021;