Apple is entering LLM space..but with a strong hardware play. Here are the 10 big ideas from Apple's LLM paper.
The GPU in your pocket will have a local GPT3.5 by next year that can be shared among other applications.
Efficient LLM Inference on Limited DRAM
The paper details innovative approaches for executing large language models on devices constrained by DRAM capacity. It introduces a novel method where model parameters are stored in flash memory and selectively loaded into DRAM during inference, optimizing memory usage and computational efficiency.
Flash Memory for LLM Storage
The study delves into utilizing flash memory as a viable solution for storing LLMs. It addresses the inherent limitations of DRAM in terms of capacity and proposes hardware-aware strategies to enhance inference performance while dealing with the slower access speeds of flash memory.
Windowing and Row-Column Bundling Techniques
The paper proposes 'windowing' and 'row-column bundling' as key techniques. 'Windowing' aims to reduce the amount of data transferred by reusing previously activated neurons, while 'row-column bundling' seeks to optimize the size of data chunks read from flash memory, enhancing efficiency.