ML Systems

With the growth of ML models, a new set of systems problems are emerging. They involve:

  1. Distributed Learning: Running NN Learning algorithms across a fleet of machines.
  2. ML Compilers
  3. Custom Hardware for ML training and inference jobs

Backlog of papers yet to read

  1. Jupiter Evolving: Hochschild et al., 2021
  2. TPUv4 Jouppi et al., 2023
  3. TensorFlow Abadi et al.,
  4. Pathways Barham et al., 2022
  5. Cores that don't count Hochschild et al., 2021;