PowerInfer: How to Supercharge Your Inference

PowerInfer is an LLM inference engine designed to maximize model inference performance on consumer-grade hardware.

This article is a quick overview. If you want to learn more, here's the GitHub repository explaining how to use the engine: https://github.com/SJTU-IPADS/PowerInfer

This engine claims extraordinary performance, roughly ten times faster than the llama engine.

But how does it achieve that?

A Locality-Centric Design

In pursuit of optimization, the team behind the engine chose to use a design based on the concept of locality. This design avoids activating certain neurons and neuron connections based on their usage, by employing two concepts: "sparse activation," which aims to reduce the number of activated neurons, and "hot/cold" neurons, which are used to predict active neurons.

Here's an interesting article on sparse activation: https://medium.com/geekculture/sparse-weight-activation-training-reduce-memory-and-training-time-in-machine-learning-8c0fad7d5def

Implementing these concepts significantly boosts the inference engine's performance while keeping quality degradation to a minimum.

Hybrid CPU/GPU Inference

If you compile PowerInfer with the right option, the engine will distribute the workload between the CPU and GPU to achieve the best possible performance.

Similarly, with the RAM/VRAM combination, the engine uses both memory pools to load memory-hungry models. Naturally, the more VRAM available, the better the inference performance. An option is available to cap VRAM usage if needed.

Compatible Models

This inference engine is currently compatible with two models:

Falcon-40B
Llama2 (7B, 13B, and 70B)

The team has announced support for the Mistral-7B model in the near future.

Supported Systems

PowerInfer has been tested and is supported on the following configurations:

x86-64 CPU (with AVX2 instructions) on Linux and Windows

I tested it on Intel Xeon E5-2650v2 processors, which lack AVX2 instructions, and the models appeared to infer correctly with no errors raised.

x86-64 CPU and NVIDIA GPU on Linux and Windows
Apple M CPU on macOS

Functional but not currently optimized with the Metal API

Metal-optimized inference is planned for the future

Final Thoughts

In my opinion, this inference engine looks promising, given the goals it was designed for: enabling easy and efficient local inference on consumer-grade hardware.

This is just the beginning of the project. Updates on its progress are available here: https://github.com/orgs/SJTU-IPADS/projects/2/views/2

As more features are developed, inference engines like this could help democratize local inference on personal machines.

To go further, if you enjoy research papers, here's the PowerInfer whitepaper: https://arxiv.org/abs/2312.12456

PowerInfer is an LLM inference engine designed to maximize model inference performance on consumer-grade hardware.

This article is a quick overview. If you want to learn more, here's the GitHub repository explaining how to use the engine: https://github.com/SJTU-IPADS/PowerInfer

This engine claims extraordinary performance, roughly ten times faster than the llama engine.

But how does it achieve that?

A Locality-Centric Design

Here's an interesting article on sparse activation: https://medium.com/geekculture/sparse-weight-activation-training-reduce-memory-and-training-time-in-machine-learning-8c0fad7d5def

Implementing these concepts significantly boosts the inference engine's performance while keeping quality degradation to a minimum.

Hybrid CPU/GPU Inference

If you compile PowerInfer with the right option, the engine will distribute the workload between the CPU and GPU to achieve the best possible performance.

Compatible Models

This inference engine is currently compatible with two models:

Falcon-40B
Llama2 (7B, 13B, and 70B)

The team has announced support for the Mistral-7B model in the near future.

Supported Systems

PowerInfer has been tested and is supported on the following configurations:

x86-64 CPU (with AVX2 instructions) on Linux and Windows

I tested it on Intel Xeon E5-2650v2 processors, which lack AVX2 instructions, and the models appeared to infer correctly with no errors raised.

x86-64 CPU and NVIDIA GPU on Linux and Windows
Apple M CPU on macOS

Functional but not currently optimized with the Metal API

Metal-optimized inference is planned for the future

Final Thoughts

In my opinion, this inference engine looks promising, given the goals it was designed for: enabling easy and efficient local inference on consumer-grade hardware.

This is just the beginning of the project. Updates on its progress are available here: https://github.com/orgs/SJTU-IPADS/projects/2/views/2

As more features are developed, inference engines like this could help democratize local inference on personal machines.

To go further, if you enjoy research papers, here's the PowerInfer whitepaper: https://arxiv.org/abs/2312.12456

PowerInfer: How to Supercharge Your Inference

A Locality-Centric Design

Hybrid CPU/GPU Inference

Compatible Models

Supported Systems

Final Thoughts

Similar articles

Test-Driving GCP Duet AI: A Promising Tool That Isn't Quite There Yet

Low-Code: Discovering Langflow!

Self-Hosting a Static Website: Planning and Decision-Making

Newsletter

Go further

N8N, c'est quoi ce truc ?

Comment l'IA révolutionne le marketing (sans vous remplacer)

Grille d'évaluation des besoins de formation IA : Le guide pour DRH et Managers

La France a-t-elle déjà perdu la bataille de l'IA ?

L'IA : l'Impératif de Gouvernance Ultime

Vision 2026 : L'IA est un Mandat de Leadership, Pas un Ticket IT

PowerInfer: How to Supercharge Your Inference

A Locality-Centric Design

Hybrid CPU/GPU Inference

Compatible Models

Supported Systems

Final Thoughts

Similar articles

Test-Driving GCP Duet AI: A Promising Tool That Isn't Quite There Yet

Low-Code: Discovering Langflow!

Self-Hosting a Static Website: Planning and Decision-Making

Newsletter

Go further

N8N, c'est quoi ce truc ?

Comment l'IA révolutionne le marketing (sans vous remplacer)

Grille d'évaluation des besoins de formation IA : Le guide pour DRH et Managers

La France a-t-elle déjà perdu la bataille de l'IA ?

L'IA : l'Impératif de Gouvernance Ultime

Vision 2026 : L'IA est un Mandat de Leadership, Pas un Ticket IT