GSoC 2026 @ The Rust Foundation

Posted Jun 21, 2026

By Marcelo Atanasio Domínguez Mateo

9 min read

A Frontend for Safe GPU Offloading in Rust

Basic Information

Name: Marcelo Dominguez
Github: https://github.com/sa4dus
Email: dmmarcelo27@gmail.com
LinkedIn: https://www.linkedin.com/in/dmmarcelo
Location: Madrid, Spain (UTC+1:00)
Mentor: Manuel Drehwald (@ZuseZ4)

About me

I am a 4th year Mathematics and Software Engineering student. GSoC2025@TheRustFoundation contributor working on autodiff feature. Right now I’m contributing on offload feature.

Contributions

rust-lang/rust: https://github.com/rust-lang/rust/pulls?q=is%3Apr+author%3ASa4dUs GSoC 2025 Project: https://summerofcode.withgoogle.com/archive/2025/projects/USQvru7i

Project Details

Size: Large

Abstract

The offload feature is currently working on the Rust compiler but still there’s no safe and user-friendly interface. Right now, the only way to use it is via the offload intrinsic, which is discouraged for general use. The goal of this project is to build an ergonomic frontend around a function-like macro and safe abstractions to make offloading accessible and memory safe.

Primary Goals

Design an offload! macro that simplifies kernel launching on arbitrary devices.
Ensure no mutable aliasing across device threads via disjoint memory partitioning.
Provide an abstraction for device selection.
Define a clear and explicit model for host-to-device data transfer via Device<T>.

Secondary Goals

Support shared memory programming through Shared<T> abstraction.
Offer composable partitioning strategies for common pararell indexing patterns.

Constraints

The frontend should not introduce performance overhead over the raw intrinsic.
While being safe, this frontend should limit as little as possible to the user.
Abstractions must not assume device architecture.

Technical Approach

Expose device selection

Device management is exposed as a safe wrapper over LLVM intrinsics such as:

__tgt_rtl_number_of_devices()
__tgt_rtl_init_device

The API design will look something like:

  
struct OffloadDevice { ... }

OffloadDevice::all();
OffloadDevice::host();
OffloadDevice::from_index(idx);

impl Default for OffloadDevice { .. }

Note that OffloadDevice::default is well defined, as we can always assume the existence of at least the host device.

The `offload!` Macro

The core of the frontend is a declarative macro that handles the boilerplate of capturing variables and calling the intrinsic.

  
// Proposed general syntax
offload!(
    device=D,
    grid=G,
    block=B,
    kernel=K(..args)
);

Instead of reconstructing or duplicating the kernel definition, the macro only takes a reference to an already defined kernel function marked with the offload_kernel attribute, which makes a function eligible for offloading.

  
#[offload_kernel]
fn K(..args) { .. }

which will expand to:

  
#[cfg(host)]
extern "C" fn K(..args);

#[cfg(device)]
extern "gpu-kernel" fn K(..args) { .. };

The offload! macro expansion then generates the host-side call of the intrinsic.

  
unsafe {
    ::core::intrinsics::offload(
        D,
        G,
        B,
        K,
        (..args)
    )
}

As this can lead to verbose invocations, some parameters can be optional and will use a default value instead.

  
offload!(
    grid=G,
    kernel=K(..)
);

Type-check information

Given our current intrinsic, you can pass incorrect types as args (*const T instead of *mut T) and the return type cannot inferred by the compiler, see zulip. We cannot use traits like FnOnce to infer the return type of the intrinsic, which should be the same as the kernel’s return type.

Right now, our offload intrinsic is defined as:

  
pub const fn offload<F, T: crate::marker::Tuple, R>(f: F, args: T) -> R;

since we want to accept the most arbitrary form of a kernel f, we cannot use FnOnce like traits, so we should aim for a solution at typecheck level.

Wrapper for device types

To support passing host data to kernels while maintaining memory safety, we introduce Device<T>. An opaque type to handle memory for the target device.

For now, as there are no clear rules about it, we’ll not define or expose any host-to-device conversion logic. If a type’s memory representation is not valid in the device, it won’t be supported for now.

Only types that are device-safe (see next section) are eligible for transfer into device memory (see next section).

Device<T> would look something like:

  
struct Device<T> {
    ptr: *mut T,
}

This type can be also useful when calling multiple CuBLAS/CuDNN functions, omitting the memory transfers in between kernels. This would provide great performance improvements.

Device-safe types

Define an OffloadSafe marker trait. This trait ensures a type T is device-safe.

  
trait OffloadSafe {}

It’s pre-implemented for primitive types and those ones that we can ensure safety for

  
impl<T, const N: usize> OffloadSafe for [T; N]
impl<T> OffloadSafe for [T]
impl<T: OffloadSafe> OffloadSafe for &T
impl<T: OffloadSafe> OffloadSafe for &mut T

It would be ideal to have automatic derivation, but padding in struct can be a problem.

The offload macro will use trait bounds in its expanded code to trigger compile-time error if a user tries to offload an incompatible type.

If a type T implements Clone, it is assumed to be safely duplicable in host memory, and therefore, eligible for implicit transfer into device memory. Any T: Clone is considered implicitly OffloadSafe for offload!, even without an explicit implementation.

As an alternative design, this check could also be moved entirely to a compiler-internal check by recursively visiting and validating argument types.

This would simplify the implementation and reduce the trait-system complexity but would produce less user-friendly errors.

Pararell Index Patterns

To guarantee memory safety across threads, execution is modeled as disjoint memory partitioning.

Each kernel is associated with a partitioning strategy that defines how input data is split into non-overlapping regions, each assigned to a single thread. Formally:

\[\forall i \neq j: R_i \cap R_j = \emptyset\]

where $R_i$ is the memory region assigned to thread $i$.

This allows us to ensure no mutable aliasing across threads and no data race by construction (assuming correct pattern implementation).

A region is defined as:

  
enum Region<'a, T> {
    Element(&'a mut T),
    Slice(&'a mut [T]),
}

Not all OffloadSafe types admit safe partitioning. While OffloadSafe guarantees valid device memory representation, partitioning additionally requires an indexable and alias-free access model.

We introduce a second marker trait to distinguish types that can be safely partitioned across threads:

  
/// SAFETY:
/// - no aliasing between elements
/// - safe to split into disjoint mutable regions
trait Partitionable {}

This trait is stricter that OffloadSafe. Only structurally safe containers are partitionable by default:

  
impl<T: OffloadSafe> Partitionable for [T] {}
impl<T: OffloadSafe, const N: usize> Partitionable for [T; N] {}

To define how threads map to memory regions, we can define partitioning strategies.

  
trait PartitioningStrategy {
    type Region<'a, T>: 'a;

    fn assign<'a, T>(
        &self,
        thread_idx: (usize, usize, usize),
        block_idx: (usize, usize, usize),
        block_size: (usize, usize, usize),
        grid_size: (usize, usize, usize),
        data: &'a mut [T],
    ) -> Option<Self::Region<'a, T>>;
}

For example, a simple 1-dimensional linear indexing (pseudocode):

  
struct Linear1D;

impl PartitioningStrategy for Linear1D {
    type Region<T> = Option<&mut T>;

    fn assign_region<T>(...) -> Self::Region<T> {
        let i = compute_linear(thread_idx, block, grid);

        if i < data_len {
            Some(get_element(i))
        } else {
            None
        }
    }
}

Some common patterns will be already provided to the user, some of them are:

Pattern	Value
`Linear1D`	$i = b_x \cdot bs_x + t_x$
`GridStride1D`	$i = (b_x \cdot bs_x + t_x) + k \cdot (gs_x \cdot bs_x)$
`Linear2D`	$i = (b_y \cdot bs_y + t_y) \cdot W + (b_x \cdot bs_x + t_x)$

Where $gs_x, bs_\xi, b_\xi, t_\xi$ are the grid size, block size, block index and thread index in the $\xi$ axis.

Shared Memory Model

There’s currently a Tracking Issue for NVPTX shared memory #135516 and an open PR for introducing support for dynamic shared memory. The design space is still evolving.

If dynamic shared memory support is available during the project the frontend will use gpu_launch_sized_workgroup_mem to expose this (more design details will be discussed in Community Bonding Period). For statis shared memory, integration will depend on the state once we are in the coding period. If it hasn’t been implemented yet, I’m happy to help on it as part of the project.

Synchronization gpu primitives NVIDIA syncthreads and AMD barrier may be exposed on a thin wrapper.

For advanced users

Besides from the safe abstraction, the frontend will also provive an unsafe macro which would bypass some of the safety checks so users can have full control in case they need it.

Documentation and testing

All the features above will require proper tests and documentation, this will also be an important part of the project.

Deliverables

Midterm Evaluation: device selection, wrapper types and offload-safe and at least one PartitioningStrategy working. The frontend should be usable in a subset of cases without offload! macro layer.
Final Evaluation: fully functional offload! macro integrating all components, tested and documented.

Project Timeline

Community Bonding Period

Refine the exact design of the frontend and core details and validate it with the mentor. Key design questions, especially around device-safe type validation and shared memory abstractions will be discussed with Rust community.

Week 1

Implement OffloadDevice using __tgt_rtl_* intrinsics.

Week 2

Implement Device<T> and an initial approach for T: Clone types.

Week 3

Define and implement `

` for primitive types, slices, arrays and references.

Week 4

Typecheck for offload intrinsic’s return type and args mutability.

Week 5

Define and implement Partitionable and Region with the required safety constraints.

Week 6

Implement PartitioningStrategy and Linear1D strategy with correct index computation and region assignment and run full kernel end-to-end without the macro layer (Midterm evaluation).

Week 7

Implement parsing logic for offload! macro.

Week 8

Implement macro expansion to the intrnisic call and generated function wrappers.

Week 9

Compile-time checks to enforce OffloadSafe on arguments.

Week 10

Integrate partitioning in the macro execution flow.

Week 11

Implement shared memory using previous design discussions.

Week 12

Add an offload_unsafe! macro as a thin wrapper over the intrinsic.

Final Week

Write docs and prepare complete end-to-end demo for final submission.

gsoc 2026

This post is licensed under CC BY 4.0 by the author.