2026-03-31Β·9 min readΒ·sota.io team

Deploy Futhark to Europe β€” GPU Array Computing on EU Infrastructure in 2026

Writing fast parallel code is usually an exercise in managing complexity that should not exist. You describe the computation you want β€” multiply every element of an array by a constant, sum the results, find the maximum β€” and then you rewrite it in a different language to make it fast. CUDA kernels. OpenCL dispatch code. Thread pools, work queues, memory transfer boilerplate. The algorithm is three lines; the parallel version is three hundred.

Futhark is built on a different premise: if your program is purely functional and operates on arrays, parallelism can be derived automatically. You write sequential array operations using map, reduce, scan, and filter. The Futhark compiler analyses the data dependencies and generates parallel GPU kernels or vectorised CPU code from your sequential description. The correctness proof β€” that the parallel execution produces the same result as the sequential semantics β€” is the compiler's responsibility, not yours.

Futhark was created at DIKU β€” the Department of Computer Science at the University of Copenhagen πŸ‡©πŸ‡° β€” by a team of Danish and European researchers. It is the only EU-origin GPU programming language designed for automatic parallelisation. Futhark backends run on NVIDIA CUDA, AMD/Intel OpenCL, and multicore CPU β€” the same code, compiled once, runs across hardware targets. For EU compute backends serving scientific workloads, ML inference, financial risk modelling, or image processing, Futhark provides GPU-class performance from purely functional sequential code.

Futhark applications deploy to sota.io on EU infrastructure with full GDPR compliance. This guide shows how.

The European Futhark Team

Futhark is a research language created entirely within the European academic system. Every primary contributor is affiliated with EU institutions.

Troels Henriksen πŸ‡©πŸ‡° β€” Danish computer scientist at DIKU, the Department of Computer Science at the University of Copenhagen (Datalogisk Institut, KΓΈbenhavns Universitet) β€” is the creator and primary developer of Futhark. Henriksen began the Futhark project during his PhD at DIKU, defended in 2017 under the title "Design and Implementation of the Futhark Programming Language." His thesis laid out the core technical contributions: the array programming model, the fusion optimisation pipeline (eliminating intermediate arrays), the flattening transformation that converts nested parallelism into flat GPU-executable parallelism, and the CUDA and OpenCL code generation backends. Henriksen has continued developing Futhark as a research project and production tool at DIKU. He maintains the language, the compiler, the standard library, and the package ecosystem (futhark pkg). His work represents a strand of EU programming language research focused on the practical problem that GPU programming is too difficult for domain scientists β€” climate researchers, bioinformaticians, financial engineers β€” who need high-performance array computations but should not need to learn CUDA to get them.

Martin Elsman πŸ‡©πŸ‡° β€” Danish professor at DIKU β€” is a co-designer of Futhark's type system and co-author of the foundational Futhark papers. Elsman's research spans programming language theory, type systems, and high-performance functional programming. His contributions to Futhark include the size-polymorphic type system (arrays are typed with their sizes; the type checker enforces size consistency statically) and the work on in-place updates with uniqueness types (allowing safe mutation inside otherwise pure functions). Elsman is also the principal investigator on EU-funded research grants that have supported Futhark's development.

Cosmin Oancea πŸ‡·πŸ‡΄ β€” Romanian associate professor at DIKU β€” is a co-author of Futhark's GPU backend and the primary author of its loop transformation and kernel optimisation passes. Oancea's research focuses on the compiler analysis that enables automatic GPU parallelisation: how to detect and exploit irregular parallelism, how to map nested parallel loops onto flat GPU execution models, and how to optimise memory access patterns for GPU caches. His work on flattening nested parallelism β€” transforming deeply nested map/reduce compositions into GPU-executable flat parallel operations β€” is the core transformation that makes Futhark's automatic parallelisation practical on real hardware.

Niels G.W. Serup πŸ‡©πŸ‡° β€” Danish researcher at DIKU β€” contributed to Futhark's package manager (futhark pkg), the interactive tooling, and the language server. His work on tooling lowered the barrier to entry for domain scientists at EU research institutions who want to use Futhark for real computations.

Robert Schenck πŸ‡©πŸ‡ͺ β€” German researcher who collaborated with the DIKU group β€” contributed to Futhark's automatic differentiation (AD) support. Futhark now supports reverse-mode AD for differentiating Futhark programs, making it suitable for ML-adjacent workloads (gradient computation for custom neural network layers, scientific parameter estimation). Schenck's work connects Futhark to the EU machine learning research community that needs differentiable array programming at GPU speed.

Why Futhark for EU Compute Backends

Automatic parallelisation from purely functional code. Futhark eliminates the gap between algorithm and implementation. You write map (\x -> x * 2.0f32) data and the compiler generates a parallel GPU kernel. You write reduce (+) 0 (map f data) and the compiler fuses the map and reduce into a single GPU kernel with no intermediate array allocation. The sequential semantics guarantee: your parallel program produces the same result as running the operations one at a time. For EU research backends (climate modelling, genomics, financial simulation), this means domain experts can write correct algorithms in familiar functional style without learning GPU programming.

Size-polymorphic type system eliminates out-of-bounds errors. Futhark arrays are typed with their sizes. [n]f32 is an array of n 32-bit floats; [m][n]f32 is an mΓ—n matrix. Size variables propagate through the type system: map f (a: [n]f32) : [n]f32 β€” the output is the same length as the input, statically checked. zip (a: [n]f32) (b: [n]f32) : [n](f32, f32) β€” both arrays must have the same size, enforced at compile time. Out-of-bounds access is structurally impossible for operations that respect the size types. For EU data processing backends (medical imaging, genomics pipelines), this means array shape errors are caught at compile time, not as runtime crashes in production.

Multiple hardware backends from one source. The same Futhark source compiles to CUDA (NVIDIA GPUs), OpenCL (AMD/Intel GPUs and FPGAs), multicore C (CPU parallel via pthreads), sequential C (single-threaded CPU), and ISPC (Intel SIMD). You do not write different code for different hardware. For EU scientific computing services that need to run on different infrastructure β€” GPU-accelerated containers for training workloads, CPU containers for inference, edge devices for sensor processing β€” Futhark's multi-backend model means one codebase serves all deployment targets.

Python bindings for production service integration. Futhark compiles to Python modules via futhark pyopencl or futhark pycuda. The compiled module is a Python class with methods corresponding to your Futhark entry points. You call compute.matrix_multiply(a, b) from a FastAPI handler; the arrays transfer to GPU, compute, and return as NumPy arrays. For EU backend services that need GPU-accelerated endpoints alongside standard HTTP APIs, this means Futhark integrates into standard Python/FastAPI/Django stacks without a foreign runtime.

Purely functional guarantees for multi-user compute isolation. Futhark programs have no mutable global state and no I/O in compute kernels. Every entry point takes inputs and produces outputs with no side effects outside the explicit GPU memory allocations for that computation. For EU services processing personal data (genomics analysis, medical imaging, financial risk), this means there is no mechanism by which computation for one user can affect another user's results β€” pure functions are the strongest possible isolation guarantee.

Fusion eliminates intermediate allocations. Futhark's compiler performs producer-consumer fusion: when the output of one operation is immediately consumed by another, the compiler merges them into a single GPU pass with no intermediate array written to GPU memory. map f (map g data) becomes a single kernel; filter p (map f data) fuses into one pass. For EU analytics backends processing large datasets (Eurostat statistics, ECB financial data, genomic reference panels), fusion can reduce memory bandwidth requirements by 2–5Γ— compared to naive element-wise GPU kernels.

Futhark Language Essentials

Futhark's syntax is clean and functional. Entry points are marked entry and are exposed as methods in the compiled Python/C bindings:

-- Basic array operations
entry map_scale [n] (data: [n]f32) (factor: f32) : [n]f32 =
  map (* factor) data

-- Reduction with fusion
entry sum_of_squares [n] (data: [n]f32) : f32 =
  reduce (+) 0.0f32 (map (** 2.0f32) data)

-- Matrix multiplication (automatically parallelised)
entry matmul [n][m][p] (a: [n][m]f32) (b: [m][p]f32) : [n][p]f32 =
  map (\row ->
    map (\col ->
      f32.sum (map2 (*) row col)
    ) (transpose b)
  ) a

-- Scan (prefix sum) β€” GPU-parallel via parallel prefix algorithm
entry prefix_sum [n] (data: [n]f32) : [n]f32 =
  scan (+) 0.0f32 data

The [n], [m], [p] in the function signatures are size parameters β€” the type checker verifies that a has m columns and b has m rows, statically. No runtime shape checks; the type system handles it.

GDPR-Compliant Analytics: Statistical Aggregations

A common EU backend pattern is computing aggregate statistics on personal data without retaining individual records. Futhark's pure functional model is ideal β€” the computation transforms data to aggregates with no mechanism for storing intermediate state:

-- Histogram computation for anonymised statistics
entry age_histogram [n] (ages: [n]i32) (bins: i32) : []i32 =
  let bin_size = 100 / bins
  let bin_idx = map (\age -> i32.min (bins - 1) (age / bin_size)) ages
  in reduce_by_index (replicate bins 0) (+) 0 bin_idx (replicate n 1)

-- K-anonymity check: count per group, flag groups below threshold
entry k_anonymity_mask [n] (group_ids: [n]i32) (groups: i32) (k: i32) : [n]bool =
  let counts = reduce_by_index (replicate groups 0) (+) 0 group_ids (replicate n 1)
  in map (\gid -> counts[gid] >= k) group_ids

-- Differential privacy: add Laplace noise to counts
entry add_laplace_noise [n] (counts: [n]f32) (sensitivity: f32) (epsilon: f32) (noise: [n]f32) : [n]f32 =
  let scale = sensitivity / epsilon
  in map2 (\c n -> c + scale * n) counts noise

reduce_by_index is Futhark's histogram primitive β€” it computes a scatter-reduction in a single GPU pass. For GDPR Article 89 research exemption workloads (statistical analysis with appropriate safeguards), Futhark's pattern of transforming personal data to anonymised aggregates with no intermediate persistence matches the data minimisation principle of GDPR Article 5(1)(c).

Financial Risk: Monte Carlo on GPU

EU financial services regulation (MiFID II, Solvency II, FRTB) requires large-scale risk computations. Monte Carlo simulation β€” the standard technique β€” parallelises trivially on GPU and is a natural fit for Futhark:

-- Black-Scholes Monte Carlo option pricing
-- paths: [num_paths]f32 of standard normal samples
entry bs_monte_carlo [n]
    (paths: [n]f32)
    (spot: f32) (strike: f32) (rate: f32) (vol: f32) (T: f32)
    : f32 =
  let dt = T
  let drift = (rate - 0.5f32 * vol * vol) * dt
  let diffusion = vol * f32.sqrt dt
  let terminal_prices = map (\z ->
      spot * f32.exp (drift + diffusion * z)
    ) paths
  let payoffs = map (\s -> f32.max 0.0f32 (s - strike)) terminal_prices
  let mean_payoff = f32.sum payoffs / f32.i32 n
  in mean_payoff * f32.exp (-rate * T)

-- Value-at-Risk at confidence level alpha
entry compute_var [n] (pnl: [n]f32) (alpha: f32) : f32 =
  let sorted = radix_sort_float n pnl
  let idx = t32 (f32.i32 n * (1.0f32 - alpha))
  in sorted[idx]

One million Monte Carlo paths on a GPU takes milliseconds in Futhark. The same code runs on CPU (multicore C backend) for smaller workloads. For EU fintech services computing real-time risk metrics for MiFID II best-execution reporting or Solvency II SCR calculations, Futhark provides GPU-class performance from functional sequential code.

Scientific Computing: Signal Processing

EU research infrastructure (CERN, ECMWF, ESA) produces enormous sensor datasets. Futhark's array model maps directly to signal processing algorithms:

-- Convolution (1D)
entry convolve1d [n][m] (signal: [n]f32) (kernel: [m]f32) : [n]f32 =
  let half = m / 2
  in map (\i ->
    let padded_i j = if i + j - half < 0 || i + j - half >= n
                     then 0.0f32
                     else signal[i + j - half]
    in f32.sum (map2 (*) kernel (map padded_i (iota m)))
  ) (iota n)

-- Discrete Fourier Transform (naive, for illustration)
entry dft [n] (signal: [n]f32) : [n](f32, f32) =
  let twopi = 2.0f32 * f32.pi
  in map (\k ->
    let angle j = -twopi * f32.i32 (k * j) / f32.i32 n
    let real = f32.sum (map (\j -> signal[j] * f32.cos (angle j)) (iota n))
    let imag = f32.sum (map (\j -> signal[j] * f32.sin (angle j)) (iota n))
    in (real, imag)
  ) (iota n)

-- Normalise a batch of signals to [0,1] range
entry batch_normalise [b][n] (signals: [b][n]f32) : [b][n]f32 =
  map (\s ->
    let mn = f32.minimum s
    let mx = f32.maximum s
    let range = mx - mn
    in if range == 0.0f32 then s else map (\x -> (x - mn) / range) s
  ) signals

batch_normalise processes a batch of b signals, each of length n, in parallel across the batch dimension. On GPU, all b signals normalise concurrently. For EU medical device backends processing sensor batches (ECG signals, EEG readings, MRI reconstruction), Futhark's batch processing model maps directly onto GPU execution.

Serving Futhark via Python FastAPI

Futhark compiles to Python bindings. A production deployment wraps Futhark kernels in a FastAPI service:

# requirements.txt
# futhark-pycffi, fastapi, uvicorn, numpy

import numpy as np
import futhark_pycffi as fc
from fastapi import FastAPI, HTTPException
from pydantic import BaseModel

# Compile: futhark c --library analytics.fut
# Then: python -m futhark_pycffi analytics
ctx = fc.Context()

app = FastAPI()

class ScaleRequest(BaseModel):
    data: list[float]
    factor: float

class RiskRequest(BaseModel):
    paths: list[float]    # standard normal samples
    spot: float
    strike: float
    rate: float
    vol: float
    T: float

@app.post("/api/scale")
async def scale_data(req: ScaleRequest):
    data_arr = np.array(req.data, dtype=np.float32)
    result = ctx.map_scale(data_arr, np.float32(req.factor))
    return {"result": result.tolist()}

@app.post("/api/risk/option-price")
async def price_option(req: RiskRequest):
    paths_arr = np.array(req.paths, dtype=np.float32)
    price = ctx.bs_monte_carlo(
        paths_arr,
        np.float32(req.spot),
        np.float32(req.strike),
        np.float32(req.rate),
        np.float32(req.vol),
        np.float32(req.T)
    )
    return {"price": float(price)}

@app.get("/health")
async def health():
    return {"status": "ok"}

The Futhark Context manages GPU memory and kernel loading. Arrays transfer as NumPy arrays β€” zero-copy when using CFFI with pinned memory. For EU compute services, this architecture separates concerns cleanly: HTTP routing in Python, compute kernels in Futhark, both running in the same container.

Deploying to sota.io

sota.io detects Dockerfiles automatically. A Futhark/Python service deploys as a multi-stage container:

# Stage 1: compile Futhark kernels
FROM haskell:9.4 AS futhark-builder

RUN cabal update && cabal install futhark

WORKDIR /build
COPY analytics.fut .
# Compile to C library (CPU multicore backend β€” no GPU required on host)
RUN futhark c --library analytics.fut

# Stage 2: production service
FROM python:3.12-slim

WORKDIR /app

COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

# Copy compiled Futhark C library and Python bindings
COPY --from=futhark-builder /build/analytics.c .
COPY --from=futhark-builder /build/analytics.h .

# Build Python CFFI bindings
COPY setup.py .
RUN python setup.py build_ext --inplace

COPY main.py .

EXPOSE 8080
CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8080"]

The multicore C backend (futhark c) runs parallel computations across all available CPU cores without requiring GPU hardware. Hetzner's dedicated server line (used by sota.io) provides high core counts with AVX-512 SIMD β€” Futhark's C backend vectorises automatically.

With the sota.io CLI:

sota deploy

The build runs in Germany. Your Futhark application runs in Germany. Personal data processed by your compute service never leaves EU jurisdiction.

GPU Backend for High-Performance Workloads

For GPU-accelerated deployments, sota.io supports GPU-enabled containers. Use the OpenCL backend:

FROM nvidia/opencl:devel AS futhark-builder
RUN apt-get update && apt-get install -y haskell-platform
RUN cabal update && cabal install futhark
WORKDIR /build
COPY analytics.fut .
# Compile to OpenCL β€” runs on NVIDIA/AMD/Intel GPU
RUN futhark opencl --library analytics.fut

The Futhark pyopencl Python backend (futhark pyopencl analytics.fut) generates a Python class that transfers arrays to GPU, executes OpenCL kernels, and returns NumPy arrays. The GPU executes thousands of Futhark array operations concurrently.

Environment Variables and Database

Futhark handles computation; Python handles I/O and database access. Read environment variables via standard Python:

import os

DATABASE_URL = os.environ["DATABASE_URL"]  # injected by sota.io
PORT = int(os.environ.get("PORT", "8080"))

sota.io injects DATABASE_URL when you provision a managed PostgreSQL database. Use asyncpg for non-blocking database access alongside Futhark GPU computation:

import asyncpg
import numpy as np

async def compute_and_store(pool: asyncpg.Pool, dataset_id: int, raw_data: list[float]):
    # 1. Load from EU database
    async with pool.acquire() as conn:
        rows = await conn.fetch(
            "SELECT value FROM measurements WHERE dataset_id = $1 ORDER BY timestamp",
            dataset_id
        )

    # 2. Compute in Futhark (CPU or GPU)
    data_arr = np.array([r["value"] for r in rows], dtype=np.float32)
    result = ctx.prefix_sum(data_arr)

    # 3. Store results back (parameterised β€” no SQL injection)
    async with pool.acquire() as conn:
        await conn.execute(
            "INSERT INTO results (dataset_id, values) VALUES ($1, $2)",
            dataset_id,
            result.tolist()
        )

All database operations are parameterised. Raw data loads from Hetzner Germany. Futhark computes on Hetzner Germany CPUs (or GPUs). Results store to Hetzner Germany PostgreSQL. No data leaves EU jurisdiction at any point in the pipeline.

Automatic Differentiation for ML Inference

Futhark supports reverse-mode automatic differentiation (AD) via the vjp (vector-Jacobian product) combinator. This enables gradient computation for custom ML layers:

-- Simple neural network layer: linear + ReLU activation
def linear_relu [n][m] (W: [m][n]f32) (b: [m]f32) (x: [n]f32) : [m]f32 =
  let z = map2 (\row bi -> f32.sum (map2 (*) row x) + bi) W b
  in map (f32.max 0.0f32) z

-- Compute gradient of loss with respect to input x
entry compute_gradient [n][m]
    (W: [m][n]f32) (b: [m]f32) (x: [n]f32) (loss_grad: [m]f32)
    : [n]f32 =
  let (_, grad_x) = vjp (linear_relu W b) x loss_grad
  in grad_x

vjp takes a function and a "seed" gradient vector and returns the output and the gradient of the inputs using reverse-mode AD, automatically derived from the Futhark source. For EU ML inference services that need custom gradient computation β€” physics-informed neural networks, scientific ML models, Bayesian optimisation β€” Futhark's AD support provides GPU-accelerated gradient computation from functional array code.

EU Infrastructure and GDPR Compliance

Futhark compute kernels (CPU multicore / GPU)
        ↓
  FastAPI HTTP service
        ↓
  sota.io platform
        ↓
  Hetzner Germany (Frankfurt)
        ↓
  EU jurisdiction β€” no Cloud Act exposure

Deploying on sota.io means:

For EU compute backends processing personal data β€” genomics analysis, medical signal processing, financial risk modelling, anonymised statistical analytics β€” sota.io provides the compliance layer that US-jurisdiction cloud providers cannot, combined with Futhark's purely functional guarantees at the compute layer.

Getting Started

Install Futhark (requires GHC/Haskell Stack or Cabal):

# macOS
brew install futhark

# Ubuntu (via Stack)
stack install futhark

# Docker (recommended for reproducible builds)
docker pull futharklang/futhark:latest

Create your first Futhark program:

-- hello.fut
entry main [n] (data: [n]f32) : f32 =
  reduce (+) 0.0f32 (map (** 2.0f32) data)

Compile and test:

# Compile to C executable for testing
futhark c hello.fut
echo "3.0 4.0 5.0" | ./hello

# Compile to Python bindings (CPU backend)
futhark c --library hello.fut
python -m futhark_pycffi hello

# Run Futhark's built-in benchmarking tool
futhark bench hello.fut --backend=c

Deploy to sota.io:

# Install sota CLI
curl -fsSL https://sota.io/install.sh | sh

# Authenticate
sota auth set-key YOUR_API_KEY

# Deploy from your project directory (with Dockerfile)
sota deploy

Your Futhark compute service is live at your-project.sota.io on German infrastructure within minutes.


sota.io is the EU-native PaaS for Futhark and GPU-accelerated compute backends β€” GDPR-compliant infrastructure, managed PostgreSQL, and zero-configuration TLS. Deploy your first Futhark application β†’