AI in SG: Guide to Running Local LLMs on the NVIDIA Jetson Orin Nano

Cloud-tethered artificial intelligence presents unavoidable compromises in data privacy, subscription economics, and network reliance. This comprehensive operational blueprint guides the ambitious technologist through transforming an 8GB NVIDIA Jetson Orin Nano into a completely self-contained, power-efficient, edge-AI powerhouse. By leveraging Ollama and Open WebUI under an optimised JetPack environment, you will deploy state-of-the-art language models like Llama 3.2 and DeepSeek R1 locally, bypassing cloud infrastructure entirely while remaining anchored to Singapore’s strict data protection standards.

The Sovereign Desk in a Connected Hub

Step out onto a balcony in Duxton Hill on a stickily warm Singapore afternoon, and you are surrounded by the physical reality of a global financial nexus. Below, heritage shophouses host boutique venture funds; down the road, the skyscrapers of Raffles Place pulse with data flying to hyper-scale cloud facilities across the island. Yet, for the modern enterprise, the independent consultant, or the discerning software engineer, that continuous digital tether to distant servers is becoming an architectural liability.

As Singapore accelerates its National AI Strategy 2.0 (NAIS 2.0), the conversation has shifted from mere adoption to data sovereignty, capital efficiency, and system resilience. Relying on commercial cloud APIs means exposing proprietary intelligence, submitting to unpredictable token-pricing schemes, and accepting latency that degrades the user experience.

The elegant alternative is local execution: edge compute that processes language models entirely within your physical perimeter. Enter the NVIDIA Jetson Orin Nano Developer Kit. No larger than a deck of playing cards and sipping less power than a designer table lamp, this small-form-factor module is capable of delivering up to 67 trillion operations per second (TOPS) of AI performance.

For the uninitiated, configuring an embedded Linux board to serve complex transformers may seem daunting. This guide demystifies the entire deployment arc. From bare silicon and NVMe storage orchestration to the nuances of unified memory management and sleek web interfaces, you will discover how to construct a whisper-quiet, local AI assistant tailored for the modern, privacy-first workflow.

The Silicon at the Edge: Why Jetson Orin Nano?

When choosing hardware for local language model inference, beginners frequently look to consumer desktops equipped with monolithic graphics cards or lightweight hobbyist platforms like the Raspberry Pi 5. The former is loud, power-hungry, and aggressively expensive; the latter lacks the specialized hardware architecture required to execute matrix multiplication at speed.

+-----------------------------------------------------------------------+

| NVIDIA Jetson Orin Nano 8GB Architecture |

+-----------------------------------------------------------------------+

| |

| +-----------------------+ +------------------------+ |

| +-----------+-----------+ | 32 Tensor Cores) | |

| | +-----------+------------+ |

| | | |

| +------------------+-------------------+ |

| | |

| v |

| +------------------------------+ |

| | 8GB LPDDR5 Unified Memory | |

| | (68 GB/s Bandwidth) | |

| +--------------+---------------+ |

| | |

+----------------------------------|------------------------------------+

+------------------------------+

| M.2 Key M NVMe SSD Storage |

+------------------------------+

The Jetson Orin Nano occupies a unique technological sweet spot due to three distinct architectural advantages:

Ampere Architecture Tensor Cores: Unlike traditional CPUs that process calculations sequentially, the Orin Nano features 1,024 CUDA cores and 32 Tensor Cores. These are hardwired for the low-precision mathematics (FP16 and INT4/INT8) that modern Large Language Models (LLMs) rely on for rapid token generation.
Unified Memory Architecture (UMA): In a standard PC, data must constantly traverse the bottleneck of the PCIe bus between system RAM and GPU VRAM. The Jetson utilizes a single pool of high-speed LPDDR5 memory shared dynamically between the CPU and GPU. This is immensely beneficial for LLMs, where the entirety of the model's weights must reside in memory during inference.
Unrivalled Energy Efficiency: Operating within a highly flexible 7-watt to 15-watt power envelope, the Jetson delivers substantial compute density per watt. In a nation like Singapore, where commercial electricity rates reflect global energy realities and sustainability is legally mandated via green building codes, running a 15W edge node continuously is vastly more sensible than running a 600W desktop rig.

Step 1: The Physical Foundations

Before executing a single terminal command, you must assemble a stable hardware baseline. The standard Orin Nano developer kit requires a few deliberate additions to handle the prolonged thermal and read/write stresses of large language model inference.

Hardware Prerequisites

NVIDIA Jetson Orin Nano Developer Kit (8GB Version): Ensure you procure the 8GB variant. The 4GB model is excellent for computer vision but lacks the memory capacity required to hold a modern quantized language model alongside an operating system.
M.2 NVMe PCIe SSD (256GB Minimum): While the Jetson can boot from a MicroSD card, doing so for LLMs is an exercise in frustration. Model weights are massive files that must be loaded into memory instantly; a MicroSD card will bottleneck your boot times and model load cycles. Choose a fast NVMe drive (such as a Samsung 980 or Crucial P3).
Official Power Supply & Active Cooling Fan: Ensure your kit includes the 45W power supply and the official heatsink/fan assembly. LLM inference drives the GPU to maximum utilization, and passive cooling will trigger thermal throttling within minutes.
Peripherals for Initial Configuration: An HDMI/DisplayPort monitor, a USB keyboard and mouse, and an Ethernet cable or the included Wi-Fi module attached to your local network.

Step 2: Provisioning the Environment

We begin by flashing the operating system and optimizing the environment for memory-intensive workloads.

Flashing the Operating System

For beginners, the most direct path is flashing NVIDIA’s official JetPack 6 image directly onto your storage medium using a secondary computer. Download the official JetPack NVMe/SD Card image from the NVIDIA Developer portal and utilize an application such as BalenaEtcher to write the image to your drive.

Once flashed, insert the NVMe SSD into the M.2 Key M slot underneath the Jetson carrier board, connect your peripherals, and apply power. Follow the on-screen Ubuntu initialization prompts to set your username, password, and system language (defaulting to English - UK/Singapore for system consistency).

Maximizing the Power Profiles

By default, the Jetson limits its power consumption to preserve thermals. We need to unlock its full potential. Open your terminal (Ctrl+Alt+T) and execute the following commands to set the power mode to Max Performance:

Bash

# Set the performance profile to 15W Max Capacity

sudo nvpmodel -m 0

# Force the system clocks to lock at their maximum frequencies

sudo jetson_clocks

To ensure these settings persist across system restarts, you can install the jetson-stats utility, an indispensable tool created by the open-source community for monitoring system thermals and resource allocation:

Bash

sudo apt update && sudo apt install -y python3-pip

sudo pip3 install jetson-stats

Restart your system after installation. Running jtop in your terminal will now present an elegant, real-time dashboard of your CPU cores, GPU utilization, power draw, and internal temperatures.

Allocating Swap Space

Because the Orin Nano possesses 8GB of physical RAM shared entirely with the GPU, space is at an absolute premium. Operating systems require buffers, and when loading a 4GB or 5G model, you risk triggering Linux’s "Out of Memory" (OOM) killer, which will abruptly crash your processes.

To mitigate this, we create a generous swap file on the high-speed NVMe drive to act as an overflow reservoir for non-critical CPU instructions:

Bash

# Disable any active swap partitions

sudo swapoff -a

# Allocate a 16-Gigabyte file on your NVMe storage

sudo fallocate -l 16G /swapfile

# Secure the file permissions

sudo chmod 600 /swapfile

# Format the file as Linux Swap space

sudo mkswap /swapfile

# Enable the swap file immediately

sudo swapon /swapfile

# Make the configuration permanent across boots

echo '/swapfile none swap sw 0 0' | sudo tee -a /etc/fstab

Step 3: Deploying Ollama and the Language Models

With the operating system optimized and protected against memory spikes, we deploy the inference engine. We will use Ollama, a streamlined framework that manages model compilation, model weight quantization, and local API exposure seamlessly.

Historically, setting up acceleration on ARM64 architectures required compiling complex toolchains from source. Ollama provides native, out-of-the-box support for the NVIDIA Jetson platform via its automated installation script, linking directly to your pre-installed JetPack CUDA drivers.

Executing the Installation

Run the official installation script within your terminal:

Bash

curl -fsSL https://ollama.com/install.sh | sh

The script automatically detects the Jetson’s Ampere GPU architecture, configures the necessary environment variables, and establishes a system background service (systemd) running on local port 11434.

Relocating the Model Storage Directory

By default, Ollama saves model weights within the system root directory (/usr/share/ollama/.ollama/models). If your primary operating system structure sits on a confined partition, this will rapidly exhaust your space. Let us redirect this directory to a dedicated space on your spacious NVMe storage:

Bash

# Create a dedicated directory for your models

sudo mkdir -p /home/$USER/ollama_models

sudo chown -R ollama:ollama /home/$USER/ollama_models

# Edit the systemd service file to inject the environment variable

sudo systemctl edit ollama.service

An empty text editor window will open. Insert the following lines precisely to override the default storage configuration:

Ini, TOML

[Service]

Environment="OLLAMA_MODELS=/home/$USER/ollama_models"

Save the file (Ctrl+O, Enter) and exit (Ctrl+X). Reload the system configurations and restart the background daemon to apply your changes:

Bash

sudo systemctl daemon-reload

sudo systemctl restart ollama

Selecting and Running Your First Model

In the world of local language models, scale is dictated by parameter count. A model's size indicates how many data variables it uses to understand text. Because the Orin Nano gives us roughly 6.5GB of usable memory after the OS reserves its share, we must target models that have undergone quantization (a compression technique that drops precision from 16-bit to 4-bit numbers without severe intelligence loss).

For the Orin Nano 8GB, two models stand out as exceptional choices:

Llama 3.2 (3B Parameters): Highly articulate, fast, and light enough to leave plenty of operational memory overhead.
DeepSeek R1 (1.5B or 7B Parameters, Quantized): Highly sought after for its explicit step-by-step reasoning paths.

Let us pull and run the Llama 3.2 (3B) model first:

Bash

ollama run llama3.2:3b

Available Memory & Quantization Balance

+-------------------------------------------------------------------------------+

| Total Physical Memory: 8.0 GB (Unified RAM/VRAM) |

+-------------------------------------------------------------------------------+

| [ OS & System Services Overhead: ~1.5 GB ] |

| |

| [ Free Liquid Memory for Inference: ~6.5 GB ] |

| | |

| +---> Llama 3.2 (3B, Q4 Quantized) ~2.0 GB [Optimal / Ultra-Fast] |

| | |

| +---> DeepSeek R1 (7B, Q4 Quantized) ~4.7 GB [Maximum Safe Threshold] |

+-------------------------------------------------------------------------------+

Ollama will display a progress bar as it fetches the model layers. Once complete, you will be presented with an interactive prompt. Type a question, such as "Explain the economic impact of the Straits of Malacca on global maritime logistics," and observe the output.

On the Jetson Orin Nano, a quantized 3B model will stream text back at approximately 25 to 30 tokens per second—comfortably faster than the reading speed of an average adult. To exit the interactive prompt, simply type /bye.

Step 4: Crafting the Interface with Open WebUI

While interacting with an artificial intelligence via the command-line interface appeals to engineers, a web application offers a more practical experience for day-to-day work. We will deploy Open WebUI, an open-source, web-based interface that mirrors the clean layout of commercial chat systems, running entirely within a local Docker container on the Jetson.

Configuring Docker Permissions

JetPack arrives with Docker pre-installed, but it requires root execution by default. Let us grant your standard user profile permission to handle containers without constantly prefixing commands with sudo:

Bash

sudo usermod -aG docker $USER

Log out of your Ubuntu session and log back in to apply these user group modifications.

Deploying the Open WebUI Container

Run the following multi-line terminal instruction to pull and spin up the user interface container. Notice that we specifically tell Docker to utilize the nvidia container runtime, allowing the software inside to tap directly into the underlying hardware acceleration:

Bash

docker run -d \

--network=host \

--runtime=nvidia \

-v open-webui:/app/backend/data \

-e OLLAMA_BASE_URL=http://127.0.0.1:11434 \

--name open-webui \

--restart always \

ghcr.io/open-webui/open-webui:main

Accessing the System

Once the container initializes (you can track its state by typing docker logs -f open-webui), open the Chromium web browser built into JetPack and navigate to:

http://127.0.0.1:8080

If you wish to access the system from another device on your local Singapore office network (such as a MacBook or iPad), replace 127.0.0.1 with the internal IP address of your Jetson board (discoverable by executing hostname -I).

+------------------------------------------------------------------------+

| Open WebUI Browser Portal |

+------------------------------------------------------------------------+

| [ Select Model: llama3.2:3b v ] |

| |

| User: How do local data privacy frameworks apply here? |

| |

| AI Assistant: Operating locally on the Jetson Orin Nano means your |

| data never leaves your physical device, completely satisfying the |

| stringent processing standards mandated by Singapore's Personal Data |

| Protection Act (PDPA). |

| |

| +------------------------------------------------------------------+ |

| | Message Llama 3.2... | |

| +------------------------------------------------------------------+ |

+------------------------------------------------------------------------+

Upon your first visit, you will be prompted to create an initial administrative user account. This profile exists entirely within the local container database on your desk; no data is synchronised across external servers. Select your downloaded model from the drop-down menu at the top of the interface, and your bespoke, private AI console is fully operational.

The Singapore Context: Sovereign AI for Local Enterprises

The legal and operational ramifications of running local AI models on hardware like the Jetson Orin Nano are significant for organisations operating within Singapore.

Total Compliance with the PDPA

Singapore’s Personal Data Protection Act (PDPA) imposes strict obligations on how corporations harvest, transmit, and process consumer information. Sending sensitive corporate files, financial statements, or legal contracts to external cloud LLM providers can easily result in inadvertent compliance breaches if those providers use your data for model training.

By grounding your computation within an isolated Jetson Orin Nano ecosystem, the data boundary remains absolute. Your inputs never traverse external networks, making it an excellent option for law firms in Chinatown, medical clinics in Novena, or boutique financial advisory firms on Shenton Way that handle protected client information.

Operational Cost Control

Cloud infrastructure pricing can be highly unpredictable. Token costs accumulate rapidly when scaling automation pipelines or deploying customer-facing tools. The Jetson Orin Nano represents a predictable, fixed capital expenditure.

Annual Cost Projection Comparison

+-------------------------------------------------------------------------------+

| Enterprise Cloud API Subscriptions (Continuous API API Pools) |

| $$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$ (~S$1,200 - S$3,500+) |

+-------------------------------------------------------------------------------+

| Local Jetson Edge Deployment (Hardware CapEx + 15W Electricity) |

| $$$$$$$$$$$ (~S$550 One-Time DevKit + S$35 Annual Utility) |

+-------------------------------------------------------------------------------+

After accounting for the initial purchase price of the developer kit and an NVMe drive (roughly S$550 to S$650 combined), the ongoing operational cost is limited to the electricity drawn by a 15W device. At current Singapore residential and commercial utility rates, running this system continuously costs less than S$35 per year.

Industrial Applications in Logistics and Maritime Tech

Beyond typical office productivity, the Jetson Orin Nano is an industrial-grade device designed for tough environmental conditions. In the context of Singapore's maritime and logistics sectors, this module can be deployed directly into warehouses in Jurong or onto vessels operating in the Singapore Strait. It can process local logs, scan customs manifests, and structure sensor logs in real time, completely independent of cellular or satellite internet connections.

Advanced Troubleshooting for Beginners

Even with a detailed roadmap, running complex neural networks on tiny silicon modules can throw a few curveballs. Here is how to navigate the most common teething issues.

1. Severe Stuttering and Sluggish Generation

If your model responses drop to 1 or 2 tokens per second, check if your system is thermal throttling or running in a low-power mode. Open a secondary terminal window and type jtop. Verify that the upper right corner displays 15W MAX or MODE_15W. If it indicates a 7W budget, execute sudo nvpmodel -m 0 again.

Additionally, ensure that the cooling fan is spinning. If it remains stationary under heavy load, use jtop’s interactive menus (navigating with the number keys to the control tabs) to force the fan profile to an aggressive cooling curve.

2. Context Window Contraction (The Prefill Crash)

When passing lengthy documents (such as a 30-page PDF) into a model running via Open WebUI, the Jetson might suddenly reboot or freeze. This occurs because processing a long initial text prompt requires a massive amount of memory for the "KV Cache" (the model's short-term memory of the conversation).

If you encounter this, open your configuration files or adjust your prompt strategies to handle text in smaller chunks. You can restrict the context limits within the Open WebUI advanced model settings by setting the num_ctx parameter to 2048 or 4096 tokens rather than letting it scale to its default max value.

3. The "Ollama Cannot Communicate with GPU" Error

If your system logs indicate that Ollama is falling back to pure CPU execution (resulting in painfully slow output), it means the system cannot find your CUDA drivers. This typically happens if you haven't rebooted after installing jetson-stats or if you're running an incompatible configuration.

You can verify your CUDA installation status by running:

Bash

nvcc --version

If the command returns an error, your JetPack installation may be corrupt. Re-flash your NVMe storage with a fresh, clean copy of JetPack 6, which includes pre-verified CUDA libraries.

Conclusion & Takeaways

Running language models locally changes your relationship with artificial intelligence. It transitions AI from an expensive utilities-based service controlled by a handful of corporate conglomerates into a private asset that sits directly on your desk. The NVIDIA Jetson Orin Nano provides beginners with an affordable, highly capable entry point into this space without requiring a massive, power-hungry desktop tower.

Key Practical Takeaways

Memory Dictates Scale: Stick strictly to highly quantized (4-bit) language models within the 1.5B to 3B parameter window for fluid performance. Do not attempt to run unquantized 7B or 8B models without expecting severe memory bottlenecks.
Storage Matters: Never attempt to serve large language models from a standard MicroSD card. Secure a fast M.2 NVMe SSD to preserve your patience and protect your system's components from read/write degradation.
Lock the Clocks: Always enforce the maximum power profile (sudo nvpmodel -m 0 and sudo jetson_clocks) before initializing heavy background processes to ensure consistent performance.
Embrace the Edge: Leverage local deployment to build workflows that remain fully compliant with regional data laws like Singapore's PDPA, safeguarding your intellectual property and client privacy.

Frequently Asked Questions

Can I run the larger Llama 3.1 8B or Mistral 7B models on the Jetson Orin Nano 8GB? Yes, but you must look for highly compressed versions (specifically Q2_K or Q3_K_S quantizations). An 8B parameter model at standard 4-bit precision requires roughly 4.8GB of space just for its weights, which leaves very little memory for the operating system and conversation context. For smooth day-to-day use, models like Llama 3.2 (3B) offer a much better balance of speed and intelligence on this specific hardware.

How does the Jetson Orin Nano compare to a Raspberry Pi 5 for local AI? The Jetson Orin Nano is fundamentally different from a Raspberry Pi 5 for AI workloads. While the Raspberry Pi 5 relies on its standard CPU cores for calculations, the Orin Nano includes an integrated NVIDIA Ampere GPU with dedicated Tensor Cores. This allows the Orin Nano to run language models significantly faster while drawing a comparable amount of power.

Do I need a continuous internet connection once the models are installed? Not at all. Once you have successfully flashed JetPack, installed Ollama, and pulled down your chosen models, the Jetson Orin Nano can be disconnected from the internet entirely. It will continue to process text prompts, run code generation, and host the Open WebUI panel on your local network completely offline, providing total isolation for sensitive data.

Pages

Sunday, May 31, 2026

Guide to Running Local LLMs on the NVIDIA Jetson Orin Nano

The Sovereign Desk in a Connected Hub

The Silicon at the Edge: Why Jetson Orin Nano?

Step 1: The Physical Foundations

Hardware Prerequisites

Step 2: Provisioning the Environment

Flashing the Operating System

Maximizing the Power Profiles

Allocating Swap Space

Step 3: Deploying Ollama and the Language Models

Executing the Installation

Relocating the Model Storage Directory

Selecting and Running Your First Model

Step 4: Crafting the Interface with Open WebUI

Configuring Docker Permissions

Deploying the Open WebUI Container

Accessing the System

The Singapore Context: Sovereign AI for Local Enterprises

Total Compliance with the PDPA

Operational Cost Control

Industrial Applications in Logistics and Maritime Tech

Advanced Troubleshooting for Beginners

1. Severe Stuttering and Sluggish Generation

2. Context Window Contraction (The Prefill Crash)

3. The "Ollama Cannot Communicate with GPU" Error

Conclusion & Takeaways

Key Practical Takeaways

Frequently Asked Questions

No comments:

Post a Comment

Labels