Learn how to use GenAI locally on an offline device using WebLLM and Blazor WebAssembly.
Generative AI (GenAI) and large language models (LLMs) have been the hottest topics of 2024. GenAI tools use LLMs to generate content using natural language processing. Using these tools is seamless and easy for the end user, and leads to an overall better user experience for completing tasks. However, running GenAI involves significant computational resources, security concerns and other trade-offs. In this article, you’ll learn to use GenAI locally on an offline device using WebLLM and Blazor WebAssembly.
This post is part of the annual C# Advent calendar—the time of year where C# developers get together to post 50 articles in 25 days! Snippets included in this post are also part of the Blazor Holiday Calendar, where 25 interactive snippets are shared throughout the holiday month.
Providing an LLM locally allows the application to utilize GenAI without the need for external cloud services. Individual reasons for wanting to move LLM operations to a local device are based on the needs of your application and come with a wide range of trade-offs including:
All of these concerns can be addressed with WebLLM, Blazor WebAssembly and a little setup.
WebLLM is a high-performance, in-browser language model inference engine. It leverages WebGPU for hardware acceleration, enabling powerful language model operations directly within web browsers without the need for server-side processing. This means you can run large language models (LLMs) entirely in the browser, which can reduce latency and enhance privacy since no data is sent to external servers.
It’s important before continuing to understand the current limitations of WebLLM. Since WebLLM utilizes WebGPU, browser support varies. At the time of writing, even though Apple is well positioned with processing power for LLMs, it fails to implement WebGPU in the Safari browser. This means WebLLM cannot be used on iOS-based devices.
WebLLM cannot be used on iOS based devices.
WebGPU enabled browsers also may require additional setup to take advantage of the device’s GPU. Windows 11 can incorrectly prioritize the CPU over GPU on some laptops. Additional steps may be required to set the correct priority or the user may encounter poor performance.
In-Browser Inference allows LLMs to run directly in the browser using WebGPU for efficient performance. It offers full OpenAI API compatibility, seamlessly integrating functionalities like JSON-mode, function-calling and streaming. With support for a wide range of models including Llama, Phi, Gemma, Mistral and more, it provides extensive model support. In addition, it facilitates custom model integration, enabling easy deployment and use of tailored models.
Like most browser-based technologies including Blazor itself, some JavaScript bootstrapping is required to get things going. Throughout the remainder of this article, you’ll learn the steps required to get WebLLM working in the browser. In this example, an empty Blazor app will be used as the base application.
Let’s begin by adding WebLLM to a Blazor application in the simplest way possible. From here we can validate that the implementation is working and increase the abstraction level as needed. The first step will be to see if we can boot WebLLM from our Blazor application using its index.html page. This will introduce you to the WebLLM JavaScript module and its initProgressCallback
.
Add a new JavaScript file to the applications wwwroot
folder named webllm-interop.js
. The file will eventually hold the interoperablity code used to communicate between the Blazor framework and WebLLM.
Next, you’ll need to import the WebLLM module. In this scenario, we’ll assume that this is the only JavaScript in the application and rather than adding the complexity of npm
we’ll fetch the module directly from a CDN
. In webllm-interop.js
the module is imported.
import * as webllm from "https://esm.run/@mlc-ai/web-llm";
With the module added, WebLLM’s initialized by creating an instance of an engine
. When calling CreateMLCEngine
a callback is used to capture the model loading progress. To display the progress, a simple console.log
will be used to verify the code is working. In addition, the selectedModel
value specifies the desired LLM to use, this value can be used to change the underlying LLM. The current selectedModel
, Llama-3.2-1B-Instruct-q4f16_1-MLC
, is a smaller model that will require less loading time. Beware, some models can require several GB of data transfer.
// Callback function to update model loading progress
const initProgressCallback = (initProgress) => {
console.log(initProgress);
}
const selectedModel = "Llama-3.2-1B-Instruct-q4f16_1-MLC";
const engine = await webllm.CreateMLCEngine(
selectedModel,
{ initProgressCallback: initProgressCallback }, // engineConfig
);
As a temporary check, add the script to the index.html
file. This will allow the module to be tested to check that the CDN and initialization code are working properly. In later steps, you’ll use a dynamic import with Blazor’s interop to fetch the module in place of this code.
<head>
...
<!-- Temporary, we'll dynamically import this later-->
<script type="module" src="webllm-interop.js"></script>
</head>
After saving the index.html
file, run the Blazor application. Opening the browser’s developer tools will show the initialization process within the console’s output window. From this window, expand one of the JSON objects and observe its properties. The object will need to be replicated in C# later, so make sure to copy an instance of the object somewhere for later use.
{
"progress": 0.1661029919791963,
"timeElapsed": 7,
"text": "Fetching param cache[19/108]: 716MB fetched. 16% completed, 7 secs elapsed. It can take a while when we first visit this page to populate the cache. Later refreshes will become faster."
}
So far you’ve updated the index.html file to include a new script tag for webllm-interop.js. This script was temporarily added and will be dynamically imported later. Next, you created the webllm-interop.js file. In this file, you imported the web-llm module, set up a callback to log the progress of the model loading, and initialized an MLCEngine with the model.
Now that you application is booting up, you can begin integrating with Blazor using the JavaScript interop.
Blazor’s interop APIs allow the application to communicate with browser’s JavaScript instance. This means we can take advantage of the browser’s dynamic JavaScript module loading API and import the required JavaScript when it’s needed by the application.
Start by creating a new service class named WebLLMService
in the Blazor application. The service will provide an abstraction layer between the individual components in the Blazor app and the JavaScript module. In WebLLMService
, a reference to the browser’s JavaScript instance is captured. Because of Blazor’s boot sequence, we’ll ensure the module is loaded when JavaScript is available and the module is required. This is done through lazy loading using a Lazy
thread-save wrapper around IJSObjectReference
. When the WebLLMService
instance is created, Blazor will initialize the module when the IJSRuntime
instance is completed.
private readonly Lazy<Task<IJSObjectReference>> moduleTask;
private const string ModulePath = "./webllm-interop.js";
public WebLLMService(IJSRuntime jsRuntime)
{
moduleTask = new(() => jsRuntime.InvokeAsync<IJSObjectReference>(
"import", $"{ModulePath}").AsTask());
}
Next, the WebLLMService
is added to the Blazor application’s service collection. The service collection is used to create an instance of WebLLService
and resolve references to the instance when a component requests it. In Program.cs
, register the WebLLMService
as follows.
builder.Services.AddScoped<WebLLMService>();
await builder.Build().RunAsync();
With the WebLLMService
added to the application and the dynamic module import complete, the temporary import in index.html
can be removed.
<head>
...
<!-- Temporary, we'll dynamically import this later-->
<script type="module" src="webllm-interop.js"></script>
</head>
Thus far, you added the WebLLMService class in WebLLMService.cs to lazily load webllm-interop.js using IJSRuntime. Additionally, you updated the Program.cs with the WebLLMService and removed the temporary script import for webllm-interop.js from index.html. At this point the application should run as it did before. Check the console log messages to ensure the initialization process is still working.
Now that the app is capable of calling JavaScript code from Blazor, progress callback needs push messages back to the application.
Now you’ll need to establish two-way communication between webllm-interop.js
and WebLLMService
. This is accomplished through Blazor’s invokeMethodAsync
(js) and JSInvokable
(C#) APIs.
Start by updating webllm-interop.js
. Create a variable named engine
to hold the instance returned by CreateMLCEngine
. To make the initialize
function available so it’s callable from Blazor, add the export
keyword to initalize
.
When initialize
is called, we’ll also need to capture a reference to the calling .NET instance, in this case it will be an instance of WebLLMService
passed in through an argument named dotnet
. Store the instance in a module level variable named dotnetInstance
, this will be used by other functions to invoke functions on WebLLMService
. In addition, remove the const selectedModel
, since selectedModel
argument can be used to set the model when it’s called from WebLLMService
.
import * as webllm from "https://esm.run/@mlc-ai/web-llm";
var engine; // <-- hold a reference to MLCEngine in the module
var dotnetInstance; // <-- hold a reference to the WebLLMService instance in the module
- //const selectedModel = "Llama-3.2-1B-Instruct-q4f16_1-MLC";
const initProgressCallback = (initProgress) => { ... }
+ export async function initialize(selectedModel, dotnet) {
+ dotnetInstance = dotnet; // <-- WebLLMService insntance
- // const engine = await webllm.CreateMLCEngine(
engine = await webllm.CreateMLCEngine(
selectedModel,
{ initProgressCallback: initProgressCallback }, // engineConfig
);
}
Next, update initProgressCallback
so it invokes a method on the .NET instance. The .NET method OnInitializing
will be created in the next steps. In the following code, initProgressCallback
method will use invokeMethodAsync
on the dotnetInstance
object to call OnInitializing
passing in the initProgress
JSON.
// Callback function to update model loading progress
const initProgressCallback = (initProgress) => {
- // console.log(initProgress);
// Make a call to.NET with the updated status
+ dotnetInstance.invokeMethodAsync("OnInitializing", initProgress);
}
When the callback is made, Blazor will need to serialize the initProgress
data to an object in C#. A record named InitProgress
is created in a file named ChatModels.cs
. The InitProgress
record is a C# representation of the object used by WebLLM to indicate progress.
public record InitProgress(float Progress, string Text, double TimeElapsed);
Next, you will update WebLLMService
to initialize WebLLM and receive a progress callback. In WebLLMService
the selectedModel
is added as a variable so the model can be set from Blazor. Then an InitializeAsync
method is added, this method invokes initialize
function in JavaScript via the interop APIs InvokeVoidAsync
method. When InvokeVoidAsync
is used a DotNetObjectReference
is passed to JavaScript so callbacks can be invoked on the source instance.
using Microsoft.JSInterop;
public class WebLLMService
{
private readonly Lazy<Task<IJSObjectReference>> moduleTask;
private const string ModulePath = "./webllm-interop.js";
public WebLLMService(IJSRuntime jsRuntime) { ... }
+ private string selectedModel = "Llama-3.2-1B-Instruct-q4f16_1-MLC";
+ public async Task InitializeAsync()
+ {
+ var module = await moduleTask.Value;
+ await module.InvokeVoidAsync("initialize", selectedModel, DotNetObjectReference.Create(this));
+ // Calls webllm-interop.js initialize (selectedModel, dotnet )
+ }
Receiving callbacks using the JavaScript interop is done by adding an event
to perform actions when the callback occurs. You’ll use the OnInitializingChanged
event to trigger an event and communicate the progress in the InitProgress
argument. A method OnInitializing
is created with the JSInvokable
attribute making it callable from the JavaScript module. When invoked OnInitializing
simply raises the OnInitializingChanged
event and passes the arguments received from JavaScript.
+ public event Action<InitProgress>? OnInitializingChanged;
+ // Called from JavaScript
+ // dotnetInstance.invokeMethodAsync("OnInitializing", initProgress);
+ [JSInvokable]
+ public Task OnInitializing(InitProgress status)
+ {
+ OnInitializingChanged?.Invoke(status);
+ return Task.CompletedTask;
+ }
}
With two-way communication established between WebLLMService
and JavaScript, you’ll update the user interface (UI) of the Home
page to show the initialization progress.
To begin, the WebLLMService
is injected making an instance of the service available to the page. Then, a progress
field is added to hold and display the current status of the initialization process. When the Home
component initializes, it subscribes a delegate method OnWebLLMInitialization
to the OnInitializingChanged
event.
Next, InitializeAsync
is called on the service to start the initialization process. When OnInitializingChanged
is triggered, the OnInitializingChanged
method is called updating the progress
variable and signaling StateHasChanged
displaying the new information.
+ @inject WebLLMService llm
<PageTitle>Home</PageTitle>
<h1>Hello, world!</h1>
Welcome to your new app.
+ Loading: @progress
+ @code {
+ InitProgress? progress;
+
+ protected override async Task OnInitializedAsync()
+ {
+ llm.OnInitializingChanged += OnWebLLMInitialization;
+ try
+ {
+ await llm.InitializeAsync();
+ }
+ catch (Exception e)
+ {
+ // Potential errors: No browser support for WebGPU
+ Console.WriteLine(e.Message);
+ throw;
+ }
+ }
+
+ private void OnWebLLMInitialization(InitProgress p)
+ {
+ progress = p;
+ StateHasChanged();
+ }
+ }
A live example can be seen in the following Blazor REPL:
Using this process, you added a loading display for WebLLM initialization. The WebLLMService and its corresponding JavaScript were enhanced to invoke round-trip events to communicate progress updates. Adding conversational elements to the application will follow this pattern.
In order to have a conversation with the LLM, the application will need to send messages to and receive responses using the interop. Before you can create these interactions, you’ll need additional data transefer objects (or DTOs) to bridge the gap between JavaScript and C# code. These DTOs are added to the existing ChatModels.cs
file, each DTO is represented by a record
.
// A chat message
public record Message(string Role, string Content);
// A partial chat message
public record Delta(string Role, string Content);
// Chat message "cost"
public record Usage(double CompletionTokens, double PromptTokens, double TotalTokens);
// A collection of partial chat messages
public record Choice(int Index, Message? Delta, string Logprobs, string FinishReason );
// A chat stream response
public record WebLLMCompletion(
string Id,
string Object,
string Model,
string SystemFingerprint,
Choice[]? Choices,
Usage? Usage
)
{
// The final part of a chat message stream will include Usage
public bool IsStreamComplete => Usage is not null;
}
Next, the webllm-interop.js
module is updated to add streaming chat completions. A function named completeStream
is added as an export making it available to the Blazor application. The completeStream
function calls the MLCEngine
object’s chat.completions.create
function.
When invoking chat,
the stream
and include_usage
arguments are set to true
. These arguments will request a chat response in a streaming format and indicate when the response is complete by including usage statistics. The chat
function returns a WebLLMCompletion
which includes the message’s role
, delta
and other meta data. As each chat completion is generated, the chunk
will be returned to Blazor as a callback from the interop using the ReceiveChunkCompletion
method.
import * as webllm from "https://esm.run/@mlc-ai/web-llm";
var engine; // <-- hold a reference to MLCEngine in the module
var dotnetInstance; // <-- hold a reference to the WebLLMService instance in the module
const initProgressCallback = (initProgress) => { ... }
export async function initialize(selectedModel, dotnet) { ... }
+ export async function completeStream(messages) {
+ // Chunks is an AsyncGenerator object
+ const chunks = await engine.chat.completions.create({
+ messages,
+ temperature: 1,
+ stream: true, // <-- Enable streaming
+ stream_options: { include_usage: true },
+ });
+
+ for await (const chunk of chunks) {
+ //console.log(chunk);
+ await dotnetInstance.invokeMethodAsync("ReceiveChunkCompletion", chunk);
+ }
+ }
With the JavaScript updated, the corresponding interop methods can be added to WebLLMService
. In WebLLMService
you’ll need to create a function named CompleteStreamAsync
that takes a collection of messages and invokes the completeSteram
JavaScript function. This function will start generating chunks
which are sent to the corresponding ReceiveChunkCompletion
as callback argument, response
. Similar to the initialization process, an event associated with the callback is raised so that a delegate can be assigned to this event. The event OnChunkCompletion
will fire for each chunk
generated until the Usage
property is fulfilled causing IsStreamComplete
to indicate true
.
private readonly Lazy<Task<IJSObjectReference>> moduleTask;
private const string ModulePath = "./webllm-interop.js";
the
private string selectedModel = "Llama-3.2-1B-Instruct-q4f16_1-MLC";
public WebLLMService(IJSRuntime jsRuntime) { ... }
public async Task InitializeAsync() { ... }
public event Action<InitProgress>? OnInitializingChanged;
+ public async Task CompleteStreamAsync(IList<Message> messages)
+ {
+ var module = await moduleTask.Value;
+ await module.InvokeVoidAsync("completeStream", messages);
+ }
+ public event Func<WebLLMCompletion, Task>? OnChunkCompletion;
+ [JSInvokable]
+ public Task ReceiveChunkCompletion(WebLLMCompletion response)
+ {
+ OnChunkCompletion?.Invoke(response);
+ return Task.CompletedTask;
+ }
Finally, the UI is updated in Home
to incorporate the chat interface. For a simple chat UI add an textbox
and button
for submitting the chat message. In addition a loop is used to iterate on messages sent to and generated by the LLM. You’ll also need a display for currently streaming text that displays incoming chunks
.
+ <h1>WebLLM Chat</h1>
+ @foreach (var message in messages)
+ {
+ <ul>
+ <li>@message.Role: @message.Content</li>
+ </ul>
+ }
+ @if (progress is not null && progress.Text.Contains("Finish loading"))
+ {
+ <div>
+ <input type="text" @bind-value="@Prompt" disabled="@isResponding" />
+ <button @ref="PromptRef" type="submit" @onclick="StreamPromptRequest" disabled="@isResponding">Chat</button>
+ </div>
+ }
+ <p>@streamingText</p>
Loading: @progress
The code for the Home
page is updated with fields for the page state including messages
, Prompt
, streamingText
and a flag for indicating work, isResponding
. A method named StreamPromptRequest
sets up and initializes the chat request via the service’s CompleteStreamAsync
method. The chat service will process the request and trigger OnChunkCompletion
for each response returning a WebLLMCompletion
. As each chunk is received the streamingText
string is appended until the last item is received. When OnChunkCompletion
receives the final chunk, the completed message is added to messages
and the isResponding
flag is reset to false
.
@code {
+ string? Prompt { get; set; } = "";
+ ElementReference PromptRef;
+ List<Message> messages = new List<Message>();
+ bool isResponding;
+ string streamingText = "";
InitProgress? progress;
protected override async Task OnInitializedAsync()
{
llm.OnInitializingChanged += OnWebLLMInitialization;
+ llm.OnChunkCompletion += OnChunkCompletion;
try
{
await llm.InitializeAsync();
}
catch (Exception e)
{
// Potential errors: No browser support for WebGPU
Console.WriteLine(e.Message);
throw;
}
}
private void OnWebLLMInitialization(InitProgress p)
{
progress = p;
StateHasChanged();
}
+ private async Task OnChunkCompletion(WebLLMCompletion response)
+ {
+ if (response.IsStreamComplete)
+ {
+ isResponding = false;
+ messages.Add(new Message("assistant", streamingText));
+ streamingText = "";
+ Prompt = "";
+ await PromptRef.FocusAsync();
+ }
+ else
+ {
+ streamingText += response.Choices?.ElementAtOrDefault(0)?.Delta?.Content ?? "";
+ }
+ StateHasChanged();
+ await Task.CompletedTask;
+ }
+ private async Task StreamPromptRequest()
+ {
+ if (string.IsNullOrEmpty(Prompt))
+ {
+ return;
+ }
+ isResponding = true;
+ messages.Add(new Message("user", Prompt));
+ await llm.CompleteStreamAsync(messages);
+ }
}
The basic components for chat are now complete and you can complete a full chat experience within the browser. This application does not use any server and can be used completely offline incurring no expense from LLM services like OpenAI or Azure.
A live example can be seen in the following Blazor REPL:
In this article, you learned to use WebLLM and Blazor WebAssembly to create a GenAI application. WebLLM is a plausible solution for embedding an LLM in your application to include GenAI. However, using LLMs in the browser does come with some limitations in regards to compatibility and initialization time. The use of GenAI in web applications is quickly becoming mainstream. As models become more efficient and devices more performant, embedded or on-device solutions like WebLLM will find their place in modern solutions.
For an upgraded UI, the Telerik UI for Blazor and Telerik Design System include components and themes for applications like GenAI. The live example below includes a more detailed example which includes a loading progress indicator, chat bubbles and more.
Ed