Episode Details

LLMs Everywhere: Running 70B models in browsers and iPhones using MLC — with Tianqi Chen of CMU / OctoML

Published 2 years, 7 months ago

Description

We have just announced our first set of speakers at AI Engineer Summit! Sign up for the livestream or email sponsors@ai.engineer if you’d like to support.

We are facing a massive GPU crunch. As both startups and VC’s hoard Nvidia GPUs like countries count nuclear stockpiles, tweets about GPU shortages have become increasingly common.

But what if we could run LLMs with AMD cards, or without a GPU at all? There’s just one weird trick: compilation. And there’s one person uniquely qualified to do it.

We had the pleasure to sit down with Tianqi Chen, who’s an Assistant Professor at CMU, where he both teaches the MLC course and runs the MLC group. You might also know him as the creator of XGBoost, Apache TVM, and MXNet, as well as the co-founder of OctoML.

The MLC (short for Machine Learning Compilation) group has released a lot of interesting projects:

* MLC Chat: an iPhone app that lets you run models like RedPajama-3B and Vicuna-7B on-device. It gets up to 30 tok/s!

* Web LLM: Run models like LLaMA-70B in your browser (!!) to offer local inference in your product.

* MLC LLM: a framework that allows any language models to be deployed natively on different hardware and software stacks.

The MLC group has just announced new support for AMD cards; we previously talked about the shortcomings of ROCm, but using MLC you can get performance very close to the NVIDIA’s counterparts. This is great news for founders and builders, as AMD cards are more readily available. Here are their latest results on AMD’s 7900s vs some of top NVIDIA consumer cards.

If you just can’t get a GPU at all, MLC LLM also supports ARM and x86 CPU architectures as targets by leveraging LLVM. While speed performance isn’t comparable, it allows for non-time-sensitive inference to be run on commodity hardware.

We also enjoyed getting a peek into TQ’s process, which involves a lot of sketching:

With all the other work going on in this space with projects like ggml and Ollama, we’re excited to see GPUs becoming less and less of an issue to get models in the hands of more people, and innovative software solutions to hardware problems!

Show Notes

* TQ’s Projects:

* XGBoost

* Apache TVM

* MXNet

* MLC

Episode Details

LLMs Everywhere: Running 70B models in browsers and iPhones using MLC — with Tianqi Chen of CMU / OctoML

Description

Listen Now

Love PodBriefly?