It may be easier than you think to use your skills for server-based deep learning on Apple devices. Yunfei Cheng and I attempted to evaluate the learning curve by comparing MLX kernels working on Metal GPUs in Apple Silicon chips to PyTorch kernels on CUDA GPUs. The image below depicts the scalability of self-attention and linear projection on M1 Max, M2 Ultra, A100, and H100. The x-y plane represents the beam shape size used in our Recurrent Drafting work (https://lnkd.in/dvrvUwbU). All of these kernels show a similar scalability trend as the beam shape grows. It is interesting to reveal that the performance difference between CUDA and Metal in SDPA is considerably lesser than in linear projection. For example, linear projection indicated a roughly 100x performance difference between the M1 Max and the H100, whereas SDPA showed just a 25x difference on the same hardware.
How does MLX on Metal perform in handling machine learning tasks? Yi Wang and I conducted a set of benchmarks using M1 Max, M2 Ultra with MLX, A100, and H100 with PyTorch to compare the performance of two fundamental operations, SDPA and Linear Projection. A surprising revelation is the close performance between the M2 Ultra and A100, underscoring the impressive potential of on-device machine learning. The benchmark also reveals distinct performance trends. Linear Projection shows a linear increase in latency with larger input sizes, while SDPA exhibits exponential latency growth due to its higher complexity. Interestingly, the performance disparity in SDPA is much less pronounced than in Linear Projection. For instance, Linear Projection demonstrates a nearly 100x performance difference between the M1 Max and H100, whereas SDPA shows only 25x difference on the same set of hardwares. These findings highlight the significant potential of on-device machine learning, and we look forward to further enhancements in performance, particularly with advancements in Metal.