Telerik blogs

Learn how to use GenAI locally on an offline device using WebLLM and Blazor WebAssembly.

Generative AI (GenAI) and large language models (LLMs) have been the hottest topics of 2024. GenAI tools use LLMs to generate content using natural language processing. Using these tools is seamless and easy for the end user, and leads to an overall better user experience for completing tasks. However, running GenAI involves significant computational resources, security concerns and other trade-offs. In this article, you’ll learn to use GenAI locally on an offline device using WebLLM and Blazor WebAssembly.

This post is part of the annual C# Advent calendar—the time of year where C# developers get together to post 50 articles in 25 days! Snippets included in this post are also part of the Blazor Holiday Calendar, where 25 interactive snippets are shared throughout the holiday month.

Providing an LLM locally allows the application to utilize GenAI without the need for external cloud services. Individual reasons for wanting to move LLM operations to a local device are based on the needs of your application and come with a wide range of trade-offs including:

  • On-device LLM: Enabling powerful LLM operations directly within web browsers without server-side processing.
  • Cost efficiency: Avoids ongoing costs associated with server usage and data transfers.
  • Privacy: Your data stays on your device, reducing the risk of exposure to external servers.
  • Speed: Caveat, the initial download is slow. [Once downloaded] On-device processing can be faster since it eliminates the need for sending and receiving data over the internet.
  • Offline capability: You can use the LLM without an internet connection, which is great for remote or restricted areas.
  • Customization: Tailor the model to your specific needs and preferences, as it can learn from your unique usage patterns.

All of these concerns can be addressed with WebLLM, Blazor WebAssembly and a little setup.

What Is WebLLM?

WebLLM is a high-performance, in-browser language model inference engine. It leverages WebGPU for hardware acceleration, enabling powerful language model operations directly within web browsers without the need for server-side processing. This means you can run large language models (LLMs) entirely in the browser, which can reduce latency and enhance privacy since no data is sent to external servers.

It’s important before continuing to understand the current limitations of WebLLM. Since WebLLM utilizes WebGPU, browser support varies. At the time of writing, even though Apple is well positioned with processing power for LLMs, it fails to implement WebGPU in the Safari browser. This means WebLLM cannot be used on iOS-based devices.

WebLLM cannot be used on iOS based devices.

WebGPU enabled browsers also may require additional setup to take advantage of the device’s GPU. Windows 11 can incorrectly prioritize the CPU over GPU on some laptops. Additional steps may be required to set the correct priority or the user may encounter poor performance.

In-Browser Inference allows LLMs to run directly in the browser using WebGPU for efficient performance. It offers full OpenAI API compatibility, seamlessly integrating functionalities like JSON-mode, function-calling and streaming. With support for a wide range of models including Llama, Phi, Gemma, Mistral and more, it provides extensive model support. In addition, it facilitates custom model integration, enabling easy deployment and use of tailored models.

Like most browser-based technologies including Blazor itself, some JavaScript bootstrapping is required to get things going. Throughout the remainder of this article, you’ll learn the steps required to get WebLLM working in the browser. In this example, an empty Blazor app will be used as the base application.

Some JavaScript Assembly Required

Let’s begin by adding WebLLM to a Blazor application in the simplest way possible. From here we can validate that the implementation is working and increase the abstraction level as needed. The first step will be to see if we can boot WebLLM from our Blazor application using its index.html page. This will introduce you to the WebLLM JavaScript module and its initProgressCallback.

Add a new JavaScript file to the applications wwwroot folder named webllm-interop.js. The file will eventually hold the interoperablity code used to communicate between the Blazor framework and WebLLM.

Startup

Next, you’ll need to import the WebLLM module. In this scenario, we’ll assume that this is the only JavaScript in the application and rather than adding the complexity of npm we’ll fetch the module directly from a CDN. In webllm-interop.js the module is imported.

import * as webllm from "https://esm.run/@mlc-ai/web-llm";

With the module added, WebLLM’s initialized by creating an instance of an engine. When calling CreateMLCEngine a callback is used to capture the model loading progress. To display the progress, a simple console.log will be used to verify the code is working. In addition, the selectedModel value specifies the desired LLM to use, this value can be used to change the underlying LLM. The current selectedModel, Llama-3.2-1B-Instruct-q4f16_1-MLC, is a smaller model that will require less loading time. Beware, some models can require several GB of data transfer.

// Callback function to update model loading progress
const initProgressCallback = (initProgress) => {
    console.log(initProgress);
}
const selectedModel = "Llama-3.2-1B-Instruct-q4f16_1-MLC";

const engine = await webllm.CreateMLCEngine(
    selectedModel,
    { initProgressCallback: initProgressCallback }, // engineConfig
);

As a temporary check, add the script to the index.html file. This will allow the module to be tested to check that the CDN and initialization code are working properly. In later steps, you’ll use a dynamic import with Blazor’s interop to fetch the module in place of this code.

<head>
    ...
    <!-- Temporary, we'll dynamically import this later-->
    <script type="module" src="webllm-interop.js"></script>
</head>

After saving the index.html file, run the Blazor application. Opening the browser’s developer tools will show the initialization process within the console’s output window. From this window, expand one of the JSON objects and observe its properties. The object will need to be replicated in C# later, so make sure to copy an instance of the object somewhere for later use.

{
    "progress": 0.1661029919791963,
    "timeElapsed": 7,
    "text": "Fetching param cache[19/108]: 716MB fetched. 16% completed, 7 secs elapsed. It can take a while when we first visit this page to populate the cache. Later refreshes will become faster."
}

So far you’ve updated the index.html file to include a new script tag for webllm-interop.js. This script was temporarily added and will be dynamically imported later. Next, you created the webllm-interop.js file. In this file, you imported the web-llm module, set up a callback to log the progress of the model loading, and initialized an MLCEngine with the model.

Now that you application is booting up, you can begin integrating with Blazor using the JavaScript interop.

Building the Interop Service

Blazor’s interop APIs allow the application to communicate with browser’s JavaScript instance. This means we can take advantage of the browser’s dynamic JavaScript module loading API and import the required JavaScript when it’s needed by the application.

Start by creating a new service class named WebLLMService in the Blazor application. The service will provide an abstraction layer between the individual components in the Blazor app and the JavaScript module. In WebLLMService, a reference to the browser’s JavaScript instance is captured. Because of Blazor’s boot sequence, we’ll ensure the module is loaded when JavaScript is available and the module is required. This is done through lazy loading using a Lazy thread-save wrapper around IJSObjectReference. When the WebLLMService instance is created, Blazor will initialize the module when the IJSRuntime instance is completed.

private readonly Lazy<Task<IJSObjectReference>> moduleTask;

private const string ModulePath = "./webllm-interop.js";

public WebLLMService(IJSRuntime jsRuntime)
{
	moduleTask = new(() => jsRuntime.InvokeAsync<IJSObjectReference>(
	"import", $"{ModulePath}").AsTask());
}

Next, the WebLLMService is added to the Blazor application’s service collection. The service collection is used to create an instance of WebLLService and resolve references to the instance when a component requests it. In Program.cs, register the WebLLMService as follows.

builder.Services.AddScoped<WebLLMService>();
await builder.Build().RunAsync();

With the WebLLMService added to the application and the dynamic module import complete, the temporary import in index.html can be removed.

<head>
    ...
    <!-- Temporary, we'll dynamically import this later-->
    <script type="module" src="webllm-interop.js"></script>
</head>

Thus far, you added the WebLLMService class in WebLLMService.cs to lazily load webllm-interop.js using IJSRuntime. Additionally, you updated the Program.cs with the WebLLMService and removed the temporary script import for webllm-interop.js from index.html. At this point the application should run as it did before. Check the console log messages to ensure the initialization process is still working.

Now that the app is capable of calling JavaScript code from Blazor, progress callback needs push messages back to the application.

Calling Back to Blazor

Now you’ll need to establish two-way communication between webllm-interop.js and WebLLMService. This is accomplished through Blazor’s invokeMethodAsync (js) and JSInvokable (C#) APIs.

Start by updating webllm-interop.js. Create a variable named engine to hold the instance returned by CreateMLCEngine. To make the initialize function available so it’s callable from Blazor, add the export keyword to initalize.

When initialize is called, we’ll also need to capture a reference to the calling .NET instance, in this case it will be an instance of WebLLMService passed in through an argument named dotnet. Store the instance in a module level variable named dotnetInstance, this will be used by other functions to invoke functions on WebLLMService. In addition, remove the const selectedModel, since selectedModel argument can be used to set the model when it’s called from WebLLMService.

import * as webllm from "https://esm.run/@mlc-ai/web-llm";

var engine; // <-- hold a reference to MLCEngine in the module
var dotnetInstance; // <-- hold a reference to the WebLLMService instance in the module

- //const selectedModel = "Llama-3.2-1B-Instruct-q4f16_1-MLC";
const initProgressCallback = (initProgress) => { ... }

+ export async function initialize(selectedModel, dotnet) {
+   dotnetInstance = dotnet; // <-- WebLLMService insntance
- // const engine = await webllm.CreateMLCEngine(
    engine = await webllm.CreateMLCEngine(
        selectedModel,
        { initProgressCallback: initProgressCallback }, // engineConfig
    );
}

Next, update initProgressCallback so it invokes a method on the .NET instance. The .NET method OnInitializing will be created in the next steps. In the following code, initProgressCallback method will use invokeMethodAsync on the dotnetInstance object to call OnInitializing passing in the initProgress JSON.


// Callback function to update model loading progress
const initProgressCallback = (initProgress) => {
-    // console.log(initProgress);
     // Make a call to.NET with the updated status
+    dotnetInstance.invokeMethodAsync("OnInitializing", initProgress);
}

When the callback is made, Blazor will need to serialize the initProgress data to an object in C#. A record named InitProgress is created in a file named ChatModels.cs. The InitProgress record is a C# representation of the object used by WebLLM to indicate progress.

public record InitProgress(float Progress, string Text, double TimeElapsed);

Next, you will update WebLLMService to initialize WebLLM and receive a progress callback. In WebLLMService the selectedModel is added as a variable so the model can be set from Blazor. Then an InitializeAsync method is added, this method invokes initialize function in JavaScript via the interop APIs InvokeVoidAsync method. When InvokeVoidAsync is used a DotNetObjectReference is passed to JavaScript so callbacks can be invoked on the source instance.

using Microsoft.JSInterop;
public class WebLLMService
{
    private readonly Lazy<Task<IJSObjectReference>> moduleTask;

    private const string ModulePath = "./webllm-interop.js";
    
    public WebLLMService(IJSRuntime jsRuntime) { ... }

+   private string selectedModel = "Llama-3.2-1B-Instruct-q4f16_1-MLC";
    
+   public async Task InitializeAsync()
+   {
+   	var module = await moduleTask.Value;
+   	await module.InvokeVoidAsync("initialize", selectedModel, DotNetObjectReference.Create(this));
+		// Calls webllm-interop.js    initialize  (selectedModel, dotnet                            ) 
+   }

Receiving callbacks using the JavaScript interop is done by adding an event to perform actions when the callback occurs. You’ll use the OnInitializingChanged event to trigger an event and communicate the progress in the InitProgress argument. A method OnInitializing is created with the JSInvokable attribute making it callable from the JavaScript module. When invoked OnInitializing simply raises the OnInitializingChanged event and passes the arguments received from JavaScript.

+   public event Action<InitProgress>? OnInitializingChanged;

+   // Called from JavaScript
+   // dotnetInstance.invokeMethodAsync("OnInitializing", initProgress);
+   [JSInvokable]
+   public Task OnInitializing(InitProgress status)
+   {
+   	OnInitializingChanged?.Invoke(status);
+   	return Task.CompletedTask;
+   }
}

With two-way communication established between WebLLMService and JavaScript, you’ll update the user interface (UI) of the Home page to show the initialization progress.

To begin, the WebLLMService is injected making an instance of the service available to the page. Then, a progress field is added to hold and display the current status of the initialization process. When the Home component initializes, it subscribes a delegate method OnWebLLMInitialization to the OnInitializingChanged event.

Next, InitializeAsync is called on the service to start the initialization process. When OnInitializingChanged is triggered, the OnInitializingChanged method is called updating the progress variable and signaling StateHasChanged displaying the new information.

+ @inject WebLLMService llm

<PageTitle>Home</PageTitle>

<h1>Hello, world!</h1>

Welcome to your new app.

+ Loading: @progress

+ @code {
+    InitProgress? progress;
+
+    protected override async Task OnInitializedAsync()
+    {
+        llm.OnInitializingChanged += OnWebLLMInitialization;
+        try
+        {
+            await llm.InitializeAsync();
+        }
+        catch (Exception e)
+        {
+            // Potential errors: No browser support for WebGPU
+            Console.WriteLine(e.Message);
+            throw;
+        }
+    }
+
+    private void OnWebLLMInitialization(InitProgress p)
+    {
+        progress = p;
+        StateHasChanged();
+    }
+ }

A live example can be seen in the following Blazor REPL:


Using this process, you added a loading display for WebLLM initialization. The WebLLMService and its corresponding JavaScript were enhanced to invoke round-trip events to communicate progress updates. Adding conversational elements to the application will follow this pattern.

Making Conversation

In order to have a conversation with the LLM, the application will need to send messages to and receive responses using the interop. Before you can create these interactions, you’ll need additional data transefer objects (or DTOs) to bridge the gap between JavaScript and C# code. These DTOs are added to the existing ChatModels.cs file, each DTO is represented by a record.

// A chat message
public record Message(string Role, string Content); 
// A partial chat message
public record Delta(string Role, string Content); 
// Chat message "cost"
public record Usage(double CompletionTokens, double PromptTokens, double TotalTokens); 
// A collection of partial chat messages
public record Choice(int Index, Message? Delta, string Logprobs, string FinishReason ); 

// A chat stream response
public record WebLLMCompletion(
	string Id,
	string Object,
	string Model,
	string SystemFingerprint,
	Choice[]? Choices,
	Usage? Usage
	)
{
    // The final part of a chat message stream will include Usage
	public bool IsStreamComplete => Usage is not null;
} 

Next, the webllm-interop.js module is updated to add streaming chat completions. A function named completeStream is added as an export making it available to the Blazor application. The completeStream function calls the MLCEngine object’s chat.completions.create function.

When invoking chat, the stream and include_usage arguments are set to true. These arguments will request a chat response in a streaming format and indicate when the response is complete by including usage statistics. The chat function returns a WebLLMCompletion which includes the message’s role, delta and other meta data. As each chat completion is generated, the chunk will be returned to Blazor as a callback from the interop using the ReceiveChunkCompletion method.

import * as webllm from "https://esm.run/@mlc-ai/web-llm";

var engine; // <-- hold a reference to MLCEngine in the module
var dotnetInstance; // <-- hold a reference to the WebLLMService instance in the module

const initProgressCallback = (initProgress) => { ... }

export async function initialize(selectedModel, dotnet) { ... }

+ export async function completeStream(messages) {
+ 	// Chunks is an AsyncGenerator object
+ 	const chunks = await engine.chat.completions.create({
+ 		messages,
+ 		temperature: 1,
+ 		stream: true, // <-- Enable streaming
+ 		stream_options: { include_usage: true },
+ 	});
+ 
+ 	for await (const chunk of chunks) {
+ 		//console.log(chunk);
+ 		await dotnetInstance.invokeMethodAsync("ReceiveChunkCompletion", chunk);
+ 	}
+ }

With the JavaScript updated, the corresponding interop methods can be added to WebLLMService. In WebLLMService you’ll need to create a function named CompleteStreamAsync that takes a collection of messages and invokes the completeSteram JavaScript function. This function will start generating chunks which are sent to the corresponding ReceiveChunkCompletion as callback argument, response. Similar to the initialization process, an event associated with the callback is raised so that a delegate can be assigned to this event. The event OnChunkCompletion will fire for each chunk generated until the Usage property is fulfilled causing IsStreamComplete to indicate true.


private readonly Lazy<Task<IJSObjectReference>> moduleTask;

private const string ModulePath = "./webllm-interop.js";
the 
private string selectedModel = "Llama-3.2-1B-Instruct-q4f16_1-MLC";

public WebLLMService(IJSRuntime jsRuntime) { ... }

public async Task InitializeAsync() { ... }

public event Action<InitProgress>? OnInitializingChanged;

+ public async Task CompleteStreamAsync(IList<Message> messages)
+ {
+ 	var module = await moduleTask.Value;
+ 	await module.InvokeVoidAsync("completeStream", messages);
+ }
 
+ public event Func<WebLLMCompletion, Task>? OnChunkCompletion;
 
+ [JSInvokable]
+ public Task ReceiveChunkCompletion(WebLLMCompletion response)
+ {
+ 	OnChunkCompletion?.Invoke(response);
+ 	return Task.CompletedTask;
+ }

Finally, the UI is updated in Home to incorporate the chat interface. For a simple chat UI add an textbox and button for submitting the chat message. In addition a loop is used to iterate on messages sent to and generated by the LLM. You’ll also need a display for currently streaming text that displays incoming chunks.


+ <h1>WebLLM Chat</h1>

+ @foreach (var message in messages)
+ {
+    <ul>
+         <li>@message.Role: @message.Content</li>
+     </ul>
+ }

+ @if (progress is not null && progress.Text.Contains("Finish loading"))
+ {
+     <div>
+         <input type="text" @bind-value="@Prompt" disabled="@isResponding" />
+         <button @ref="PromptRef" type="submit" @onclick="StreamPromptRequest" disabled="@isResponding">Chat</button>
+     </div>
+ }

+ <p>@streamingText</p>

Loading: @progress

The code for the Home page is updated with fields for the page state including messages, Prompt, streamingText and a flag for indicating work, isResponding. A method named StreamPromptRequest sets up and initializes the chat request via the service’s CompleteStreamAsync method. The chat service will process the request and trigger OnChunkCompletion for each response returning a WebLLMCompletion. As each chunk is received the streamingText string is appended until the last item is received. When OnChunkCompletion receives the final chunk, the completed message is added to messages and the isResponding flag is reset to false.

@code {
+     string? Prompt { get; set; } = "";
+     ElementReference PromptRef;
+     List<Message> messages = new List<Message>();
+     bool isResponding;
+     string streamingText = "";
      InitProgress? progress;

    protected override async Task OnInitializedAsync()
    {
        llm.OnInitializingChanged += OnWebLLMInitialization;
+         llm.OnChunkCompletion += OnChunkCompletion;
        try
        {
            await llm.InitializeAsync();
        }
        catch (Exception e)
        {
            // Potential errors: No browser support for WebGPU
            Console.WriteLine(e.Message);
            throw;
        }
    }

    private void OnWebLLMInitialization(InitProgress p)
    {
        progress = p;
        StateHasChanged();
    }

 +   private async Task OnChunkCompletion(WebLLMCompletion response)
 +   {
 +       if (response.IsStreamComplete)
 +       {
 +           isResponding = false;
 +           messages.Add(new Message("assistant", streamingText));
 +           streamingText = "";
 +           Prompt = "";
 +           await PromptRef.FocusAsync();
 +       }
 +       else
 +       {
 +           streamingText += response.Choices?.ElementAtOrDefault(0)?.Delta?.Content ?? "";
 +       }
 +       StateHasChanged();
 +       await Task.CompletedTask;
 +   }

 +   private async Task StreamPromptRequest()
 +   {
 +       if (string.IsNullOrEmpty(Prompt))
 +       {
 +           return;
 +       }
 +       isResponding = true;
 +       messages.Add(new Message("user", Prompt));
 +       await llm.CompleteStreamAsync(messages);
 +   }
}

The basic components for chat are now complete and you can complete a full chat experience within the browser. This application does not use any server and can be used completely offline incurring no expense from LLM services like OpenAI or Azure.

A live example can be seen in the following Blazor REPL:


Conclusion

In this article, you learned to use WebLLM and Blazor WebAssembly to create a GenAI application. WebLLM is a plausible solution for embedding an LLM in your application to include GenAI. However, using LLMs in the browser does come with some limitations in regards to compatibility and initialization time. The use of GenAI in web applications is quickly becoming mainstream. As models become more efficient and devices more performant, embedded or on-device solutions like WebLLM will find their place in modern solutions.

Blending AI with UI

For an upgraded UI, the Telerik UI for Blazor and Telerik Design System include components and themes for applications like GenAI. The live example below includes a more detailed example which includes a loading progress indicator, chat bubbles and more.



About the Author

Ed Charbeneau

Ed Charbeneau is a web enthusiast, speaker, writer, design admirer, and Developer Advocate for Telerik. He has designed and developed web based applications for business, manufacturing, systems integration as well as customer facing websites. Ed enjoys geeking out to cool new tech, brainstorming about future technology, and admiring great design. Ed's latest projects can be found on GitHub.

Related Posts

Comments

Comments are disabled in preview mode.