PowerInfer is an LLM inference engine designed to maximize model inference performance on consumer-grade hardware.
This article is a quick overview. If you want to learn more, here's the GitHub repository explaining how to use the engine: https://github.com/SJTU-IPADS/PowerInfer
This engine claims extraordinary performance, roughly ten times faster than the llama engine.
But how does it achieve that?
In pursuit of optimization, the team behind the engine chose to use a design based on the concept of locality. This design avoids activating certain neurons and neuron connections based on their usage, by employing two concepts: "sparse activation," which aims to reduce the number of activated neurons, and "hot/cold" neurons, which are used to predict active neurons.
Here's an interesting article on sparse activation: https://medium.com/geekculture/sparse-weight-activation-training-reduce-memory-and-training-time-in-machine-learning-8c0fad7d5def
Implementing these concepts significantly boosts the inference engine's performance while keeping quality degradation to a minimum.
If you compile PowerInfer with the right option, the engine will distribute the workload between the CPU and GPU to achieve the best possible performance.
Similarly, with the RAM/VRAM combination, the engine uses both memory pools to load memory-hungry models. Naturally, the more VRAM available, the better the inference performance. An option is available to cap VRAM usage if needed.
This inference engine is currently compatible with two models:
Falcon-40B
Llama2 (7B, 13B, and 70B)
The team has announced support for the Mistral-7B model in the near future.
PowerInfer has been tested and is supported on the following configurations:
I tested it on Intel Xeon E5-2650v2 processors, which lack AVX2 instructions, and the models appeared to infer correctly with no errors raised.
x86-64 CPU and NVIDIA GPU on Linux and Windows
Apple M CPU on macOS
Functional but not currently optimized with the Metal API
In my opinion, this inference engine looks promising, given the goals it was designed for: enabling easy and efficient local inference on consumer-grade hardware.
This is just the beginning of the project. Updates on its progress are available here: https://github.com/orgs/SJTU-IPADS/projects/2/views/2
As more features are developed, inference engines like this could help democratize local inference on personal machines.
To go further, if you enjoy research papers, here's the PowerInfer whitepaper: https://arxiv.org/abs/2312.12456
Customer Success Manager chez Reboot, Anaël veille à ce que chaque projet IA rime avec satisfaction client. On la retrouve régulièrement sur scène, co-animant des conférences sur l'intelligence artificielle aux côtés du CEO Yaniv Adjedj.
LinkedInGet our best articles every month.
Père Castor, raconte-moi N8N N8N (prononcez « n-huit-n » ou « nodemation » si vous voulez faire classe). C'est un outil qui permet de connecter vos...
ArticleL'intelligence artificielle s'est invitée dans le quotidien des marketeurs à une vitesse record. En quelques mois, des outils comme ChatGPT,...
ArticleLe risque ? Créer une \"illusion de compétence\" tout en laissant les véritables lacunes stratégiques se creuser. La solution est pourtant simple et...
ArticleÀ lire avec la voix de Stallone : « plus de puces, plus de data, plus de milliards, le maître du monde ». Je viens de regarder le dernier numéro du...
ArticleSoyons clairs : si vous dirigez une organisation de taille significative aujourd'hui, la complexité des données—leur volume, leur vitesse de...
ArticleOn parle ici d'une transformation fondamentale, un changement de paradigme comparable à l'arrivée d'Internet ou de l'électricité dans l'industrie....