Skip to content

Conversation

@newpavlov
Copy link
Member

No description provided.

@newpavlov newpavlov requested a review from tarcieri December 15, 2025 13:51
@tarcieri
Copy link
Member

I really don't understand the motivation for this or why you're making it a public function.Something like this seems like a potential replacement for the existing atomic_fence function.

Can you start with a demonstration of a problem in the existing implementation, then show what this solves?

Making it public seems like an implementation detail leaking out of the API.

@newpavlov
Copy link
Member Author

newpavlov commented Dec 15, 2025

You can see a motivation example in the function docs. We can not "zeroize" types like NonZeroU32, but they still may contain sensitive data. It also could work as a safer alternative to zeroize_flat_type. Another use case is a type which is defined in a different crate, does not provide direct access to its internal fields (but could be reset using for example Default::default()), and does not support zeroize.

I encountered a need for this type when working on block buffer for rand_core. It explicitly does not want to depend on zeroize, but we would like to erase data from block buffer defined in it. Right now it's done using an (IMO) ugly hack (see RustCrypto/stream-ciphers#491).

Something like this seems like a potential replacement for the existing atomic_fence function.

Yes. I plan to do it in a separate PR.

@tarcieri
Copy link
Member

I'm not sure I understand the issue in RustCrypto/stream-ciphers#491 or what it's even trying to do... zeroize the entire keystream it generates?

@newpavlov
Copy link
Member Author

newpavlov commented Dec 15, 2025

The current master version of rand_core implements BlockRng wrapper which handles buffering of generated RNG blocks. We want to zeroize the results field in chacha20 AND we do not want for rand_core to depend on zeroize, so it's resolved with the Generator::drop hack. This method gets called in Drop impl of BlockRng with reference to the results field.

With the observe function we could write:

// in rand_core
impl<R: Generator> BlockRng<R> {
    pub fn zeroize(&mut self) {
       self.results = Default::default();
    }
}

// in chacha20
struct ChachaRng(BlockRng<ChaChaBlockRng>);

impl Drop for ChachaRng {
    fn drop(&mut self) {
        self.0.zeroize();
        zeroize::observe(self);
    }
}

@tarcieri
Copy link
Member

I'm still not sure I follow... what calls the drop method of Generator, and what's the problem with that?

Where exactly is the "hack"?

@tarcieri
Copy link
Member

How is...

impl Drop for ChachaRng {
    fn drop(&mut self) {
        self.0.zeroize();
        zeroize::observe(self);
    }
}

...any different from...

impl Drop for ChachaRng {
    fn drop(&mut self) {
        self.0.zeroize();
    }
}

@newpavlov
Copy link
Member Author

newpavlov commented Dec 15, 2025

what calls the drop method of Generator, and what's the problem with that?

The Drop impl of BlockRng. The Generator::drop method has nothing to do with the declared functionality of the trait, i.e. generation of block RNG data. It's sole purpose is to enable the results zeroization in BlockRng without breaking its invariant (it stores cursor position in the first word similarly to our block-buffer).

...any different from...

Note that zeroize in the snippets is defined in the rand_core crate and does not use the zeroize crate (I shoul've named it differently to make it less confusing). So the compiler is allowed to remove self.0.zeroize(); in your second snippet.

@tarcieri
Copy link
Member

So the problem is that rand_core is shipping a half-assed version of zeroize internally, and you want a band-aid to make it more robust?

@newpavlov
Copy link
Member Author

In a certain (uncharitable) sense, yes.

But while rand_coreis indeed my primary motivation, I believe that the observe function would be useful even without it.

@tarcieri
Copy link
Member

Perhaps rand_core could just implement volatile writes? I'm not sure this proposed approach is any less of a hack than what exists currently.

I do still think something like this, internally within zeroize, would be a useful replacement for atomic_fence though (where the latter has questionable value to begin with: #988)

@newpavlov
Copy link
Member Author

newpavlov commented Dec 15, 2025

Perhaps rand_core could just implement volatile writes?

Not everyone needs RNG buffer zeroization. Introducing a crate feature for it is also not desirable. I think that zeroization should be handled on the chacha20 side.

I don't understand why you are against exposing the function. It a simple safe functon which can not be misused (as opposed to zeroize_flat_type) and can reduce amount of unsafe in downstream crates while making the resulting code more efficient (e.g. by replacing inefficient scalar volatile writes with SIMD-based writes). Its only problem is that it operates in a somewhat grey zone without any guarantees from the compiler, but the same arguably applies to atomic_fence as well.

@tarcieri
Copy link
Member

I'm a bit worried about the notion that there's a user pulling in one crate which is expected to do some otherwise insecure zeroing, then separately pulling in zeroize to observe the writes the other crate is supposedly doing. Ensuring the entire operation is secure relies on implicit coupling between those two crates which the user is orchestrating, and the user has little way to tell if e.g. the zeroing operation in one crate has been removed and the observe is useless.

Perhaps for a start you could use this technique as an internal replacement for atomic_fence, and as a followup we can talk about exposing it?

@newpavlov
Copy link
Member Author

newpavlov commented Dec 15, 2025

Ensuring the entire operation is secure relies on implicit coupling between those two crates which the user is orchestrating

Yes, it's unfortunate, but sometimes there is just no other way. Potentially unreliable erasure is better than no erasure at all. And it's much better than risking UB with zeroize_flat_type.

Also note my point about efficiency. observe-based code results in a better codegen. For example:

struct Foo {
    a: [u8; 32],
    b: [u64; 16],
    c: u64,
}

impl Drop for Foo {
    fn drop(&mut self) {
        // This impl is more efficient
        self.a = [0; 32];
        self.b = [0u64; 16];
        self.c = 0;
        zeroize::observe(self);

        // than this impl
        self.a.zeroize();
        self.b.zeroize();
        self.c.zeroize();
    }
}

See https://rust.godbolt.org/z/K7hxoePKE

Perhaps for a start you could use this technique as an internal replacement for atomic_fence, and as a followup we can talk about exposing it?

Personally, I don't see much point it it, but sure.

@tarcieri
Copy link
Member

Also note my point about efficiency. observe-based code results in a better codegen.

The existing performance problems are already noted in #743.

If you replaced atomic_fence with this style of optimization barrier (which would also address #988), we could consider getting rid of the volatile writes, at least on platforms that support asm! where we have guarantees.

@newpavlov
Copy link
Member Author

at least on platforms that support asm! where we have guarantees.

Well, IIUC even asm! does not provide 100% guarantees here, i.e. a "sufficiently smart" compiler may in theory analyze the asm! block and decide that it can be safely eliminated. I would love to have a written guarantee that asm! is a black box which can not be messed with by the compiler (unless something like options(pure) is explicitly provided), but IIRC there is nothing like this right now. And, as you may remember, there are even surprising interactions between asm! blocks and target features.

cc @RalfJung

@newpavlov newpavlov changed the title zeroize: add observe function zeroize: replace atomic_fence with observe Dec 15, 2025
@newpavlov newpavlov changed the title zeroize: replace atomic_fence with observe zeroize: replace atomic_fence with optimization_barrier Dec 15, 2025
@tarcieri
Copy link
Member

Well, IIUC even asm! does not provide 100% guarantees here, i.e. a "sufficiently smart" compiler may in theory analyze the asm! block and decide that it can be safely eliminated.

Do you have more information about that? I wasn't aware compilers would change codegen based on an analysis of the ASM.

@newpavlov
Copy link
Member Author

newpavlov commented Dec 15, 2025

I did not mean that existing compilers do it, but that AFAIK it's not explicitly forbidden. For example, an advanced LTO pass could in theory apply optimizations directly to generated assembly, thus removing "useless" writes to stack frame which immediately gets released.

I tagged Ralf in the hope that he would clarify this moment in the case if I am mistaken.

@RalfJung
Copy link

RalfJung commented Dec 15, 2025

The details around asm! could easily fill a book, so without spending a lot of time (that I don't have) on digging into what exactly you are trying to achieve here, I can't give any solid advice. So just some notes:

  • Here's a comment I wrote where I sketched out a systematic way to reason about asm!, unfortunately without going into examples and details.
  • The compiler is not allowed to analyze the actual instructions in the asm block. (We haven't yet gotten LLVM to enshrine this promise in their docs, but we're hoping to get there.)
  • The concept of a "compiler barrier" doesn't really exist, and there's no sound reasoning principle that I am aware of that is based on "compiler barriers". They can be useful guide for intuition, but actually reliable reasoning needs more solid foundations. (This is related to how the actual specification of atomic fences has nothing at all to do with preventing the reordering of memory accesses; is is entirely about establishing happens-before relationships between certain memory accesses, which impose constraints on observable program behavior, which must be preserved by compilation. Arguing about the order of accesses in the final generated program is like arguing about which instruction the compiler uses to perform division -- there's no stable guarantee of any sort, other than how it impacts observable program behavior. And here, "observable" does not include memory accesses, except if they are volatile.)

At the same time, the entire concept of zeroize is to try to do something that can't reliably be done in the Rust AM. From a formal opsem perspective, reliable zeroizing is impossible. (I wish it were different, but sadly that's where we are. Even if we wanted we couldn't do much about that in rustc until LLVM has proper support for this.) Best-effort zeroizing is a lot more about what compilers and optimizers actually do than about their specifications, and I'm the wrong person to ask about that.

@tarcieri
Copy link
Member

The concept of a "compiler barrier" doesn't really exist, and there's no sound reasoning principle that I am aware of that is based on "compiler barriers"

@RalfJung what term would you use for this sort of tactical defense against code elimination that, at some future date, could be subject to another round of cat-and-mouse when the compiler outsmarts it?

@tarcieri
Copy link
Member

I still think the best thing we could do with asm! is use it to actually implement an optimized memset/bzero routine, where we have guarantees it wouldn't be eliminated, unlike any Rust code we write (which would also address #743)

@newpavlov
Copy link
Member Author

newpavlov commented Dec 15, 2025

If we are to trust the compiler to not mess with asm!, then I think the empty "observation" asm! should be sufficient. Using separate asm!s for every type or one general memset-like function with separate asm!-based impls for every supported arch will be less efficient and more error-prone.

@RalfJung
Copy link

what term would you use for this sort of tactical defense against code elimination that, at some future date, could be subject to another round of cat-and-mouse when the compiler outsmarts it?

The paragraph you quoted was talking about reliable / sound reasoning principles.
If by "compiler barrier" you mean "a best-effort heuristic to steer the compiler away from things it shouldn't do", then I have no objections to the term (but I think that's not how many people use it). I also can't confidently give advice on that kind of a thing; you know a lot more about it than me.

@tarcieri
Copy link
Member

I think we only need one implementation of memset/bzero per architecture.

The implementation of zeroize itself is already abstracted such that there are two places to actually invoke it from: volatile_write and volatile_set (the latter being the main place that would benefit)

@newpavlov
Copy link
Member Author

@RalfJung
Is the LTO optimization I described above technically legal? Or am I being too paranoid?

@tarcieri
Copy link
Member

If by "compiler barrier" you mean "a best-effort heuristic to steer the compiler away from things it shouldn't do", then I have no objections to the term (but I think that's not how many people use it)

I’ve heard “optimization barrier” used for this by several people, though if that sounds like it has a property this isn’t providing, perhaps something like “optimization impediment” is clearer?

@newpavlov
Copy link
Member Author

newpavlov commented Dec 15, 2025

@tarcieri

I think we only need one implementation of memset/bzero per architecture.

I only see disadvantages when compared to the observation asm!.

We don't win anything in language guarantees. While by making it general you prevent the compiler from applying optimizations which we want, such as unrolling loops, reusing zeroed registers, or SIMD-ifying code when appropriate. And it obviously much harder to write correctly than slapping the same empty asm! for all arches.

@RalfJung
Copy link

@RalfJung Is the LTO optimization I described above technically legal? Or am I being too paranoid?

That sounds like it may be in conflict with self-modifying code which needs the asm blocks to be unchanged. OTOH something like BOLT could do all sorts of "fun" things and we'd probably say it's fine, it's just incompatible with self-modifying code. But you're asking about things way outside of what I can confidently talk about. I suggest you bring this to the t-opsem Zulip; there are other people there that know more about these low-level things.

I’ve heard “optimization barrier” used for this by several people, though if that sounds like it has a property this isn’t providing, perhaps something like “optimization impediment” is clearer?

I always interpreted "barrier" as something that actually guarantees certain reorderings don't happen. So yeah I think a term that makes the best-effort nature more clear would be preferable.

@tarcieri
Copy link
Member

I only see disadvantages when compared to the observation asm!.

By using asm! to write zeros rather than Rust code, you get actual guarantees those writes will be performed as opposed to a "compiler impediment" which may possibly be outsmarted by a sufficiently smart future compiler, as you yourself have argued (although to be fair, the latter is an exceedingly common way of implementing this kind of primitive).

We get somewhat similar guarantees out of ptr::write_volatile at the cost of performance, so if that weren't removed, we have a somewhat nice belt-and-suspenders setup were the optimization barrier/impediment to be added.

If we remove the volatile writes to improve performance, I don't think we really have any guarantees at that point, just something future compiler versions are unlikely to see through. Maybe that's fine, as I said earlier it seems fine for many e.g. libc implementations and their memset_s-style functions.

@tarcieri
Copy link
Member

Last time I asked about LTO I was told LTO would not touch asm!

@newpavlov
Copy link
Member Author

newpavlov commented Dec 15, 2025

may possibly be outsmarted by a sufficiently smart future compiler

My point was that it also fully applies to asm!-based implementations of memset. For the above mentioned BOLT there may be zero difference between mov we got from plain write, write_volatile, or asm!. Such tool would just see that we perform "useless" writes into memory which gets immediately released.

e.g. libc implementations and their memset_s-style functions.

IIUC they rely on dynamic linking to act as an optimization "impediment" (in other words, on keeping the functions "external" and thus out of the optimizer reach). We probably do not want to rely on it by default.

@tarcieri
Copy link
Member

My point was that it also fully applies to asm!-based implementations of memset

The major difference is you're still relying on Rust code to do the actual writing to memory, and I would be a lot more worried about that code being eliminated than I would anything in the asm! block.

IIUC they rely on dynamic linking to act as an optimization "impediment"

@newpavlov I think the most common way to implement them is with an ASM barrier immediately after calling the normal memset, similar to what is implemented in this PR.

Sidebar: we could potentially use those libc functions (#1254)

@newpavlov
Copy link
Member Author

newpavlov commented Dec 15, 2025

The major difference is you're still relying on Rust code to do the actual writing to memory, and I would be a lot more worried about that code being eliminated than I would anything in the asm! block.

If we are to trust the black box model of asm!, then the compiler MUST provide pointer to a zeroed memory to the observing asm! block. In a certain sense, the block acts as an external function.

Technically, IIUC it's allowed for the compiler to allocate a similarly sized buffer on stack, zero it out, and then pass pointer to it. But the same hypothetical also applies to "secure" memsets as well (i.e. the compiler may call it on a copy of zeroized object). But I believe we are in the realm of wild speculations at this point.

Comment on lines -414 to +411
unsafe {
volatile_set((self as *mut Self).cast::<u8>(), 0, size_of::<Self>());
}
unsafe { ptr::write_bytes(self, 0, 1) };

optimization_barrier(self);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer the removal of volatile writes be done in a separate PR, so this PR does what it says on the label: replaces atomic_fence with optimization_barrier.

The removal of volatile writes removes the longstanding defense this crate has been built on in the past, and I think that much deserves to be in the commit message (and I would also prefer it be spelled out nicely, and not jammed in like "replace atomic_fence with optimization_barrier and removes volatile writes".

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also I think we should probably leave the volatile writes in place unless we're using asm!. Otherwise we're relying on black_box as the only line of defense.

Copy link
Member Author

@newpavlov newpavlov Dec 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer the removal of volatile writes be done in a separate PR

Ok. Will do it tomorrow.

Otherwise we're relying on black_box as the only line of defense.

Well. AFAIK it works without issues on the mainline compiler. Yes, we don't have any hard guarantees for it, but, as we discussed above, it applies to the asm! approach as well. IIRC black_box gets ignored on an alternative compiler (I forgot the name), but I don't think it supports the target arches in question in the first place and I am not sure how it handles volatile OPs.

We could discuss potential hardening approaches for black_box, but I would strongly prefer to concentrate on optimization_barrier doing what we want instead of piling different hacks on top of each other. In other words, I want for the snippet in this comment to be practice worthy.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's unacceptable for black_box to be the only line of defense

Copy link
Member

@tarcieri tarcieri Dec 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would strongly prefer to concentrate on optimization_barrier doing what we want instead of piling different hacks on top of each other.

I also think this is a gross overstatement of the situation. This PR does not notably make the codebase smaller. What it does do is delete a lot of the SAFETY comments that describe the strategy the library uses.

Had you just changed volatile_write and volatile_set to call ptr::write_bytes, the PR itself could've included considerably fewer changes. They exist so there's a single consistent strategy used throughout the library, and as a place to document how that strategy works.

Adding a little bit of compile-time gating to implement those functions in terms of either volatile or non-volatile depending on if an asm! barrier is in-place is neither particularly complicated nor "piling different hacks on top of each other".

It's a question of "we get the guarantee from a volatile write" vs "we get the guarantee from an asm! optimization barrier" (which is, itself, perhaps overstating the situation). black_box has no guarantees. You would need to delete all the documentation that says this library has guarantees, and that's a change I don't want to make.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I said, we could look into ways to harden black_box. For example, see here. We also could write:

pub fn observe<R>(val: &R) {
    let _ = unsafe{ core::ptr::read_volatile(val) };
}

But, as expected, it results in a pretty atrocious codegen. We could improve it a bit by casting the reference to [usize; N] if size allows, but the #[inline(never)] solution looks better to me. Finally, some of the leftover targets support asm! on Nigthly, so we could add an unstable feature for them.

@RalfJung
Copy link

By using asm! to write zeros rather than Rust code, you get actual guarantees those writes will be performed

FWIW, I agree -- asm! blocks are the best option to ensure that something very specific actually happens in the final binary. Volatile reads/writes are basically on par; they are almost equivalent to an asm! block that just does a single read/write.

@newpavlov
Copy link
Member Author

newpavlov commented Dec 16, 2025

@RalfJung
Is there any difference in zeroization reliability between the following two Drop impls?

pub struct Foo {
    a: u64,
}

impl Drop for Foo {
    fn drop(&mut self) {
        self.a = 0;
        unsafe {
            core::arch::asm!(
                "# {}",
                in(reg) &self.a,
                options(readonly, preserves_flags, nostack),
            );
        }
    }
}

impl Drop for Foo {
    fn drop(&mut self) {
        unsafe {
            core::arch::asm!(
                "mov qword ptr [{}], 0",
                in(reg) &mut self.a,
                options(preserves_flags, nostack),
            );
        }
    }
}

Assuming that the compiler is forbidden from analyzing the asm! body, I don't see any legal way for it to remove the write in the former case, which would not also affect the latter case.

@RalfJung
Copy link

As I said above, that's outside my expertise and I recommend asking on the t-opsem Zulip. At least to a first-order approximation, I'd prefer the 2nd variant as that guarantees that the actual instructions you wrote there are in the final binary, so you don't even have to begin reasoning about what exactly the compiler may or may not do.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants