Questions on my mind

On model architectures <> hardware, the market for consumer GPU inference, world models and more.

Apr 22, 2025

A set of questions that have been on my mind lately - get in touch if you have thoughts!

Local AI

What is the implication of the adoption of GPT style models for image and video generation ala gpt-4o on the consumer hardware market?

GPT style LLMs are typically memory bandwidth bound and image/video diffusion models are typically compute bound. As a result, Apple Silicon, which has plenty of unified memory and bandwidth, can be a competitive solution for the former, but falls short for the latter. Could a shift towards GPT-style image/video models and a memory bandwidth bound regime make Apple Silicon a more attractive deployment target?

Can hardware distribution amongst consumers and support for a CUDA backend to enable transferring models seamlessly between Nvidia GPUs and Apple Silicon (and day 0 support of new models from researchers) allow MLX to eat away at Nvidia's advantage for consumer GPU inference?

Will generative world models (i.e. generate and manipulate 3D digital world in real-time) also make their way to consumer devices?

At first glance, this might seem unlikely given the amount of compute that video generation already demands which is likely even greater for a world model. But…

In the long term, it feels counterintuitive for the dominant regime for using a generative world model to involve compressing world knowledge into a set of weights only to decompress from latent space to video frames, compress into a format that can be sent across the network to be downloaded on a potentially flaky home Internet connection and decompressed for playback on the user's device. This might be even more costly and operationally burdensome than cloud gaming infra! In an ideal world, the weights would be on the user's device and there would be a single decompression step from latent space to video frames on the device bypassing the additional forms of compression/decompression that are required to get video onto the user’s device. The argument against this today is that the model inference cannot be efficiently run on consumer devices. But, up until last week, it also didn’t seem likely that decent video generation would be in reach for most consumers devices and now folks are generating pretty good quality videos (relative to the current SOTA in the cloud) on laptops. Additionally, consumer compute has historically improved at a faster rate than bandwidth and if that trend continues I’d expect the economics of running world models locally to become more attractive over time.

What are the biggest bottlenecks for improving the quality of small models? Are more experiments with quantization (and things like quantization aware training as demonstrated by Google with the latest Gemma 3 release), distillation enough or do we need something else?

If you can reasonably predict expert popularity for a MoE, could a viable strategy to use larger models on resource constrained devices (i.e. mobile phones) be to load as many popular experts as possible on-device and offload the remaining popular experts into the cloud with the expectation that the router will end up keeping many requests on-device?

The hardware utilization of local inference is poor in a chatbot context because when you serve a single user the batch size is 1 and you don’t get to take advantage of parallelization. In an async agent context, is this no longer a con if you split a task into many small sub-tasks that can be parallelized in a batch with size > 1 with a local model (similar to what is shown in the Minions protocol)?

Will model routing and collaborative protocols (i.e. Minions) that use local and cloud models together be important infrastructure or will increases in quality of small models eventually make them unnecessary?

How would personalization and memory for local-first AI apps work if most inference queries do not leave the user's device?

What types of applications only make sense if model inference is unmetered?

What is the analogy for a game engine like Unreal/Unity in the local AI ecosystem?

What is the analogy for Steam in the local AI ecosystem?

Agents

In an agent-to-agent handoff how does agent A give agent B access to resources required for a delegated task?

Will most agent-to-agent transactions happen inside or outside of firm boundaries?

If many agent-to-agent transactions happen outside of firm boundaries how do you solve trust and security issues like prompt injection attacks (CaMeL seems to have some useful ideas here)?

Distilled Context

Discussion about this post