RISC-V has a constant time coding extension (Zkt, https://riscv-software-src.github.io/riscv-unified-db/manual...). I believe the idea of this is that your CPU will have a dedicated core that implements Zkt and performs a known set of operations in constant time. I'm not sure that anyone has actually implemented it.
Everywhere. The constant-time bit defaults to "1" because the current x86 CPUs follow all of the promises of this already. The particular feature here is to allow Intel to make optional optimizations that are not constant-time on the particular subset of instructions.
No, this Zkt is part of the application profiles, so expected to be implemented by high perf OoO cores.
It just defines that a list of instructions must have "data-independent timing".
Zkt is probably widely implemented even if not declared. Since most RISC-V cores are towards the lower end of the spectrum (in order etc.) you'd have to go to extra effort not to implement it.
The article details how current techniques for preventing timing attack are becoming ineffective for the following reasons:
> 1. Compilers are applying optimization techniques that are heuristically good for performance on general software, but happen to leak information through timing-based sidechannels when used on secret data. Since general-purpose CPUs and compilers are applied to a wide variety of tasks, compilers have no reason to stop doing such optimizations.
> 2. Constant-time coding techniques mostly aim at fooling the compiler, to prevent it from applying these optimizations. Since compilers keep getting smarter, theses techniques lose their efficiency.
> 3. Just-in-time (JIT) compilers have access to runtime data; they can, and will, use such information to perform extra optimizations that static compilers cannot do, and further destroy the developer’s attempt at achieving constant-time code.
> 4. JIT compilation is becoming pervasive and is now employed in various contexts, in particular inside CPUs. Such compilers can do the same optimizations as “external” JIT compilers, but are not publicly documented, and the output of their “machine” code translation cannot be inspected to detect deviations from constant-time execution.
> 5. Modern hardware and software stacks use a very layered structure, with each layer striving to hide its internal functioning details from upper layers. The socio-economic context which allowed such hardware to exist inherently relies on such abstraction (under
the name of “industrial secrecy”). An effective constant-time coding process would require a model of computation with strong guarantees on timing-related characteristics, that would be maintained through all these layers, down to the semiconductors that implement the logic gates. This industry-wide vertical cooperation is unlikely to happen.
Why is cooperation unlikely? AFAIK it’s not too hard to make a compiler support a function attribute that says “do not optimize this function at all”; I assume in most compilers the translation and optimization phases are mostly separate, and inter-procedural optimizations can treat them as externally-linked (untouchable). I’m less familiar with assembly. but couldn’t processors expose instructions to enable or disable their JIT?
I imagine it’s not easy* easy, because it requires every layer to cooperate, but if there’s demand I think it would be done.
IPv6 is in a similar situation where it requires widespread adoption, but I also suspect most people have issues with it and there isn’t much demand because NATs and whatever support for cellular networks has made it unnecessary.
> Why is cooperation unlikely? AFAIK it’s not too hard to make a compiler support a function attribute that says “do not optimize this function at all”
Compilers like Clang actually generate terrible code; it's expected that a sufficiently smart optimizer (of which LLVM is a member) will clean it up anyway, so Clang makes no attempt to generate good code. Rust is similar. For example, a simple for-loop's induction variable is stored/loaded to an alloca (ie stack) on every use, it isn't an SSA variable. So one of the first things in the optimization pipeline is to promote those to SSA registers/variables. Disabling that would cost a ton of perf just right there, nevermind the impact on instruction combining/value tracking/scalar evolution, and crypto is pretty perf sensitive after security.
BTW, Clang/LLVM already has such a function-level attribute, `optnone`, which was actually added to support LTO. But it's all or nothing; LLVM IR/Clang doesn't have the info needed to know what instructions are timing sensitive.
At least on the compiler level, LLVM already offers a "black box" intrinsic that nudges the optimizer to be maximally pessimistic about whatever it wraps.
It doesn't carry any security guarantees (Rust's documentation on their wrapper for it explicitly strikes out "cryptographic or security" purposes) but one could imagine a similar intrinsic with stronger abilities.
> Everyone will be forced to start vibe-coding, a fad type coding where you don't know the language, but you hope that the AI does, and will spin out reams of useless unintelligible random gibberish that no one understands, hopefully to compile into something that only by luck produces any answer at all, for investors to shout for glee, until CloudStrikes strikes again, and there will be no one to explain how it happened. Computers were not that useful in the first place.
I mean, I can walk you through how my c# gets compiled down to IL but start going lower and definitely by the time you’re in assembly and machine code and it’s as inscrutable to me as is how AI is spitting out answers.
The days when someone could explain how every piece of these machines worked are long gone
It's surprisingly straightforward. If you run your application with `DOTNET_JitDisasm='methodName'` you will get exact code that gets executed for a given method. It's pretty well-annotated and readable. You can also control JIT behavior with e.g. AggressiveOptimization flag (which can ironically make performance worse) which bypasses tiered compilation and pgo-optimization, always producing the same output.
But other than that yes, with techniques like load value prediction and memory-data dependent prefetching there is little you can do at the codegen level.
If you post a small snippet I can give a quick breakdown of what's going on. Assembly in its core is very simple - it's primarily just register moves, loads/stores, arithmetic operations, bitwise manipulation, calls and returns. As soon as you get familiar with syntax, even without in-depth knowledge, you can get a general idea of what's going on when comparing to the original code side-by-side (obviously it's not going to work with async or methods with yield returns but you get the idea).
Can an integrated CPU-FPGA design (like the ones AMD sells) be a solution to this problem? Point 5 of the paper says:
> Modern hardware and software stacks use a very layered structure, with each layer striving to hide its internal functioning details from upper layers. The socio-economic context which allowed such hardware to exist inherently relies on such abstraction (under the name of “industrial secrecy”). An effective constant-time coding process would require a model of computation with strong guarantees on timing-related characteristics, that would be maintained through all these layers, down to the semiconductors that implement the logic gates. This industry-wide vertical cooperation is unlikely to happen
It's not mentioned in the paper, but hardware description languages provide strong guarantees on timing-related characteristics on the semiconductor level. Hardware companies are willing to provide user access to FPGAs, as evidenced by the userspace FPGA subsystem in the Linux kernel as well.[1] You can buy Versal SOCs right now that combine ARM CPUs with FPGAs on one die, run Linux on it, and outsource your crypto tasks to the FPGA.
I see a future where we give up on constant-time coding for CPUs and move crypto libraries to FPGAs, given the hundreds of billions of dollars in our economy reliant on strong encryption. That'd be a world in which every corporate laptop is required to have an FPGA for compliance purposes.
It'd be interesting to know if anyone has tried this.
FPGAs are very expensive, so a TPU that implements the most important algorithms might be so much cheaper because the design cost can be amortized on the huge amount of required chips.
I believe the challenge that FPGAs can address is sociotechnical, because the developer of a crypto library will have much more control over the stack and does not need to trust others.
Many high-frequency trading companies use FPGAs over ASICs for similar reasons. FPGAs are more expensive but allow them to have full control over the algorithms implemented and doesn't require giving secret information to a foundry.
In other words, eliminate the impedance mismatch between responsibility and control for developing a secure system.
It'll be cheaper to implement cryptography on an ASIC. But the author of this paper wants to control every single aspect of the encryption/decryption process. Said developer can confidently say the system is secure. You can't say you've delivered a secure system if you're getting your ASICs from another company that doesn't provide implementation details because it'd (justifiably) give others a competitive advantage.
Question I'd have is whether the cost difference between ASICs/FPGAs is worth it for the majority of applications. $1 or $10 on every CPU might mean a world in which every laptop has an FPGA, but $100? What about for server-side applications? Would a hyperscaler spend $1000 extra on every rack if it allowed guaranteed constant-time encryption?
It’s not about “giving secret information to a foundry”. It’s entirely the field programmable (FP) feature. It’s also not really programmable in the sense that you would be sending in new instructions in realtime. Reconfigurable is a better word. So giving everyone an FPGA in their laptop isn’t really going help anyone in except some enthusiast who wants to try out some different algorithms.
My impression is that there’s a lot of mental shorthand in the chip design community and FPGAs are used for prototyping and then translated into ASICs for any niche or larger applications. I presume there’s a pretty straightforward translation process between the two, though no one has ever tried to explain it to me.
A very simple description of an FPGA is that it's got a bunch of switches on the ends of wires. Some of the switches can connect wires to logic elements and some can connect wires to other wires. In this view, programming an FPGA is just toggling switches so that some of them are "on" and the rest are "off".
The easiest migration from FPGA to ASIC is to just make a chip with the same circuit elements and wire segments, but instead of making switches, you just short out connections in the "on" state and leave the rest open.
The FPGA idea raises security questions of its own - how do you get the bitstream over securely, then the data and results, without leaking the data you're trying to protect or being vulnerable to evil clones of the bitstream? Or the FPGA debug JTAG interface?
Meanwhile Windows is requiring that every laptop have a small crypto processor for its own compliance processes (i.e. bitlocker).
I would assume the bitstream would only contain non-secret implementation details and keys would be loaded at runtime rather than being synthesized into the design.
In terms of moving the data over to the FPGA, I have no idea. But if it's all on the same die, it doesn't seem like there's a big physical attack surface that's different than just doing stuff on the CPU.
I’m confused by the quoted text because timing attacks rely on at-most behavior and abstraction layers on at-least behavior.
Abstractions cannot send messages before they receive them. So any delay you add at the top must be magnified as you go down. The exception to this is if the contents of the message require different behaviors for different payloads, in which case they are correct. But encrypted payloads are opaque to the layers they are sent through. You can observe who the message was sent to, and know the maximum amount of data the conversation might have contained. But not a very clear idea of how long it took to build the message, unless you’ve already compromised the machine.
Recent timing attacks rely on the observation that modern CPUs send some messages faster than other messages, based on predicting what they might contain, so some delays get magnified (those that deviate from expectations) while other delays (those that match prior data) get minimized as you go down. An encrypted payload leaks this same information too (the process is independent of what data is being transferred), although that leaked information is (hopefully) not useful since it just leaks the encrypted data, which (hopefully) looks like random noise. But that data has to be encrypted and decrypted at some point somewhere, which gives a point to attack.
> The intent is to
produce the mask values without using a plain comparison (“x != 0”) so that the compiler
does not notice that we are really making a conditional move.
The compiler is not fooled.
So practice your martial arts on the bow of a rowboat. Balance, Danielsan.
The whole timing thing is essentially an issue of getting off-balance as in martial arts and the solution is that there are actions you simply never do, and ones that look like they lack economy of motion because so much of the work is in keeping your center of gravity from shifting outside of your base. Tai chi is infinitely better at protecting you from concussing yourself on black ice than it is at protecting you from a mugger.
So if you have a conditional move for one calculation, then move a to b if !c and move it to d if c. And when the compilers chase that down in another generation, or someone notices that the L1 cache timing is different if d gets cold, then use both calculations in the output. Calculate complements where the runtime of a + !a = 1 to several decimal points.
Will it work for eternity? Unlikely. Will it work for a decade? Probably.
At https://rwc.iacr.org/2025/program.php you can see there is a talk scheduled to be given in a couple weeks titled "Testing Side-channel Security of Cryptographic Implementations against Future Microarchitectures" with the following bits in the abstract: "Using this framework, we conduct an empirical study of 18 proposed microarchitectural optimizations on 25 implementations of eight cryptographic primitives in five popular libraries. We find that every implementation would contain secret-dependent leaks, sometimes sufficient to recover a victim’s secret key, if these optimizations were realized. Ironically, some leaks are possible only because of coding idioms used to prevent leaks under the standard constant-time model."
So far, SIMD operations are contant time, since the whole vector needs to be computed, and not just a single element in it.
ChaCha20 or Kyber might be using that. Although, I don’t know if any cryptographic algorithm depend on vector only calculation as a base for their strength.
Alternatively, I suppose that specialised cryptographic CPUs could be made to follow strict standards ensuing constant time runtime.
> SIMD operations are contant time, since the whole vector needs to be computed
Is that actually guaranteed? Having to do the whole vector doesn't necessarily mean the time can't vary for different values for complex operations. (like division)
I can't find the details, but https://gleissen.github.io/papers/iodine.pdf at least mentions "Dually, if timing variability is unavoidable, e.g.,
in SIMD or floating-point units, making this variability
explicit can better inform..." so it sounds like simd is at least situation specific here.
I wonder if you could have a compiler that intentionally pads every instruction to use a vector register pair of <value, slow-const>, so that every value gets paired with whatever constant or other expression will cause the slowest execution of that vector instruction and with the maximum latency chain. Otherwise, it does look like the VDIV instructions can have variable latencies based on the data itself (and not just spectre-like speculation issues on the memory access patterns).
You don't have any guarantee that the SIMD operations are actually done in parallel (which is of of the assumptions needed for “the latency matches that of the slowest element”). E.g., I believe VPGATHERDD is microcoded (and serial) on some CPUs, NEON (128-bit) is designed to be efficiently implementable by running a 64-bit ALU twice, AVX2 and AVX512 double-pumped on some (many) CPUs likewise…
In MD5, there are calculations that if they result in certain data ranges in certain bytes, results in a secondary calculation to spread that change into other bytes. Like a non Euclidean carry operation.
So I would think if you treat all of the bytes as one SIMD instruction, do a SIMD compare, and then fire another operation as a result… if you could guarantee that the third operation would have to fire for at least one of the vector values every time, then I think you’d have your solution.
Hmm. I could have sworn the rotation had a component of the calculation in it, but you're right it's just the loop counter. SHA-1 and AES also vary by the round, but none of SHA-2, DES, or MD-4 have either round nor state variance, so if I'm misremembering a different algorithm it's pretty obscure. False memory I guess.
For the 5 algorithms you mentioned, I know them well.
> SHA-1 and AES also vary by the round
Vary what by the round? SHA-1 ( https://www.nayuki.io/res/cryptographic-primitives-in-plain-... ) is structured very similarly to MD5, with one big difference being that SHA-1 doesn't vary the rotation by the round number. AES has a public sequence of round constants that gets applied to the key expansion, but otherwise doesn't do anything special to the ciphertext per round.
The logic that can cause timing issues include using a memory lookup table for the S-box and finite field multiplication in AES, as well data-dependent rotations in rare ciphers like https://en.wikipedia.org/wiki/RC5 and RC6.
The tables have always made people nervous anyway. And the compromised EC entry just sealed the deal. How do we know they aren’t chosen by the NSA for mischief, even now that they should know better that someone else will figure out their trick too?
Yes. I keep arguing that C is further and further from pretending to be a "portable assembler", and rather than trying increasingly elaborate tricks to generate the series of machine instructions you want you should just .. emit those instructions yourself in the first place.
The mention of CMOV is good. Additional tricks are available through vectorization - necessarily all elements of the vector have to execute at the same speed, which gives you extra options to put crafted data in a different column to force computation to proceed at a known rate.
We should not forget that some CPUs offer dedicated instructions - e.g. Intel AES-NI.
Another solution is to just stop trying to do it on the general purpose processor altogether and move it to some sort of HSM or TPM. Put your eggs in a smaller, more defensible basket. In a world of GPUs and AI accelerators it's easier to ask for a crypto accelerator (or rather decelerator!).
Is there any good work on constant-time crypto on the GPU?
If anyone manages to make homomorphic encryption viable, that provides another solution, at a huge performance penalty.
I don't think it's anywhere close to viable to move the cryptographic parts of the data plane into HSMs/TPMs. There's just too much work to do there, and you have to move plaintext over unsecured channels to do it. That means that you have to put at least an ephemeral key into the CPU, and the rest follows.
AES-NI, the SHA instructions, and constant-time subsets of instructions are generally good enough that you can do this in assembly.
Give me hardware with a discrete cryptographic processing unit which is just a CPU that does has no speculation, caching, or other nonsense. In other words, a CPU that actually works the way it's supposed to work. I don't care that idiots want their wrong software to be as fast as possible, I want correct software that's as fast as possible, and no faster.
Well, if you say you don't want caching and speculation, then you've essentially given up on performance already. There's a reason why these small cores are slow.
x86 and ARM both have options for executing certain instructions with data-independent timing. The problem here is that languages and compilers need to expose and honor data types that compile down to only those instructions.
This would be something like a 'secret' keyword that would qualify integer types; i.e. in C you could have a variable of type 'secret unsigned int' and the compiler would reject optimizations that would reveal timing information about the variable's contents, while optimizing the rest of the program's non-secret variables. Values can be promoted to secret but not demoted. Code that intends to reveal secrets (e.g. to write them out to secure storage) must explicitly cast away the secretness.
AFAIK Golang has crypto/subtle, but that's more special functions to do certain things in a data-independent way and not pervasive language support for it (and, presumably, they have to keep updating the compiler to not optimize that specific module).
I didn't read TFA, but does this imply that all encryption needs to be done in a special enclave with dedicated hardware (that isn't vulnerable to this)?
One whole technique not mentioned in the paper or comments is bitslicing. For non-branching code (e.g. symmetric ciphers) it's guaranteed constant-time and it would be a remarkable compiler indeed which could introduce optimizations and timing variations to bit-sliced code...
One, old technique that can counter this is as follows:
1. Clock how long the operation takes on any type of data. By operation, it would be a function call maybe for a whole block or buffer.
2. Add a small margin of time to that.
3. Modify the function call to make sure it has waited that long before returning a response. That is true whether it finished early or not.
The result prevents the function call from leaking timing information.
High-assurance, security certification in the 1980's also required mitigating covert, storage channels. That would require, at a minimum, overwriting any memory used during the call and all the CPU registers before returning to the caller.
It might make the exploits mildly more difficult in some cases, but adding a synthetic delay (whether randomized or artificially inflated to meet a certain high water mark) isn't likely to help.
For example, consider the case of a cache-timing leak (rather than the classical "string compare" one). You'd have to have a lot of control over the behavior of your operating environment to guarantee it doesn't leak bits of your secret from cache misses.
When you add a delay to your processing, unless the same tasks being executed, I would expect power analysis leakage and random jitter from the kernel's scheduler to reveal that fact. It might be a more costly analysis to detect, but it's still there.
Generally, I think this is a doomed line of reasoning.
The link I provided is about random delays being inferior to setting a high water mark, yes.
I'm not just echoing the argument made by the link, though. I'm adding to it.
I don't think the "cap runtime to a minimum value" strategy will actually help, due to how much jitter your cap measurements will experience from the normal operation of the machine.
If you filter it out when measuring, you'll end up capping too low, so some values will be above it. For a visualization, let's pretend that you capped the runtime at 80% what it actually takes in the real world:
function biased() {
return Math.max(0.8, Math.random())
}
let samples = [];
for (let i = 0; i < 1000; i++) {
samples.push(biased());
}
// Now plot `samples`
Alternatively, let's say you cap it sufficiently high that there's always some slack time at the end.
Will the kernel switch away to another process on the same machine?
If so, will the time between "the kernel has switched to another process since we're really idling" to "the kernel has swapped back to our process" be measurable?
It's better to make sure your actual algorithm is actually constant-time, even if that means fighting with compilers and hardware vendors' decisions.
Cache and power need to be exploited locally, generally. Introducing random delays to raise the noise floor would work for network services, I believe.
AES cache-timing was broken over a network (but required, like, 2^28 samples).
I wouldn't bet the farm on this line of thinking providing resilience. It might be just annoying enough for an attacker to not really bother. (Or maybe only if the target is, like, a software cryptocurrency wallet with enough value to loot if they're successful.)
My immediate thought upon seeing the condmove() example was "Slap ‘volatile‘ on all those local variables". This forces a standards-compliant compiler to reload their values every time they are referenced. Basically, the compiler is no longer allowed to assume that the current thread of execution is the only actor that updates the variable -- any other mechanism (including another thread, but not limited to that -- e.g., a hardware DMA channel) could have modified the variable's value.
That doesn't solve the hardware-JIT-level problem, but it would appear to solve the compiler-level problem.
ETA: I only read up to p. 3, but I did search for the text "volatile" and didn't find it.
RISC-V has a constant time coding extension (Zkt, https://riscv-software-src.github.io/riscv-unified-db/manual...). I believe the idea of this is that your CPU will have a dedicated core that implements Zkt and performs a known set of operations in constant time. I'm not sure that anyone has actually implemented it.
Do other ISAs have anything similar?
Yes, x86:
https://www.intel.com/content/www/us/en/developer/articles/t...
Oooh has it shipped anywhere already ?
Everywhere. The constant-time bit defaults to "1" because the current x86 CPUs follow all of the promises of this already. The particular feature here is to allow Intel to make optional optimizations that are not constant-time on the particular subset of instructions.
ARM cores also have a similar set of promises.
No, this Zkt is part of the application profiles, so expected to be implemented by high perf OoO cores. It just defines that a list of instructions must have "data-independent timing".
Zkt is probably widely implemented even if not declared. Since most RISC-V cores are towards the lower end of the spectrum (in order etc.) you'd have to go to extra effort not to implement it.
The article details how current techniques for preventing timing attack are becoming ineffective for the following reasons:
> 1. Compilers are applying optimization techniques that are heuristically good for performance on general software, but happen to leak information through timing-based sidechannels when used on secret data. Since general-purpose CPUs and compilers are applied to a wide variety of tasks, compilers have no reason to stop doing such optimizations.
> 2. Constant-time coding techniques mostly aim at fooling the compiler, to prevent it from applying these optimizations. Since compilers keep getting smarter, theses techniques lose their efficiency.
> 3. Just-in-time (JIT) compilers have access to runtime data; they can, and will, use such information to perform extra optimizations that static compilers cannot do, and further destroy the developer’s attempt at achieving constant-time code.
> 4. JIT compilation is becoming pervasive and is now employed in various contexts, in particular inside CPUs. Such compilers can do the same optimizations as “external” JIT compilers, but are not publicly documented, and the output of their “machine” code translation cannot be inspected to detect deviations from constant-time execution.
> 5. Modern hardware and software stacks use a very layered structure, with each layer striving to hide its internal functioning details from upper layers. The socio-economic context which allowed such hardware to exist inherently relies on such abstraction (under the name of “industrial secrecy”). An effective constant-time coding process would require a model of computation with strong guarantees on timing-related characteristics, that would be maintained through all these layers, down to the semiconductors that implement the logic gates. This industry-wide vertical cooperation is unlikely to happen.
Why is cooperation unlikely? AFAIK it’s not too hard to make a compiler support a function attribute that says “do not optimize this function at all”; I assume in most compilers the translation and optimization phases are mostly separate, and inter-procedural optimizations can treat them as externally-linked (untouchable). I’m less familiar with assembly. but couldn’t processors expose instructions to enable or disable their JIT?
I imagine it’s not easy* easy, because it requires every layer to cooperate, but if there’s demand I think it would be done.
IPv6 is in a similar situation where it requires widespread adoption, but I also suspect most people have issues with it and there isn’t much demand because NATs and whatever support for cellular networks has made it unnecessary.
> Why is cooperation unlikely? AFAIK it’s not too hard to make a compiler support a function attribute that says “do not optimize this function at all”
Compilers like Clang actually generate terrible code; it's expected that a sufficiently smart optimizer (of which LLVM is a member) will clean it up anyway, so Clang makes no attempt to generate good code. Rust is similar. For example, a simple for-loop's induction variable is stored/loaded to an alloca (ie stack) on every use, it isn't an SSA variable. So one of the first things in the optimization pipeline is to promote those to SSA registers/variables. Disabling that would cost a ton of perf just right there, nevermind the impact on instruction combining/value tracking/scalar evolution, and crypto is pretty perf sensitive after security.
BTW, Clang/LLVM already has such a function-level attribute, `optnone`, which was actually added to support LTO. But it's all or nothing; LLVM IR/Clang doesn't have the info needed to know what instructions are timing sensitive.
> it’s not too hard to make a compiler support a function attribute that says “do not optimize this function at all”
Note that memset_s() is an existing special case already. https://en.cppreference.com/w/c/string/byte/memset
So basically compilers, and optimizations inside CPUs, will keep aggressively optimizing code to create the the vulnerability.
Perhaps this indicates a missing concept in our compilation and execution of code, metadata that demands a section is not optimized.
At least on the compiler level, LLVM already offers a "black box" intrinsic that nudges the optimizer to be maximally pessimistic about whatever it wraps.
It doesn't carry any security guarantees (Rust's documentation on their wrapper for it explicitly strikes out "cryptographic or security" purposes) but one could imagine a similar intrinsic with stronger abilities.
[flagged]
> Everyone will be forced to start vibe-coding, a fad type coding where you don't know the language, but you hope that the AI does, and will spin out reams of useless unintelligible random gibberish that no one understands, hopefully to compile into something that only by luck produces any answer at all, for investors to shout for glee, until CloudStrikes strikes again, and there will be no one to explain how it happened. Computers were not that useful in the first place.
I mean, I can walk you through how my c# gets compiled down to IL but start going lower and definitely by the time you’re in assembly and machine code and it’s as inscrutable to me as is how AI is spitting out answers.
The days when someone could explain how every piece of these machines worked are long gone
It's surprisingly straightforward. If you run your application with `DOTNET_JitDisasm='methodName'` you will get exact code that gets executed for a given method. It's pretty well-annotated and readable. You can also control JIT behavior with e.g. AggressiveOptimization flag (which can ironically make performance worse) which bypasses tiered compilation and pgo-optimization, always producing the same output.
But other than that yes, with techniques like load value prediction and memory-data dependent prefetching there is little you can do at the codegen level.
It’s as easily readable to me as well formed, college level French.
I can recognize some shared concepts and a general vibe maybe but I am materially incapable of reading it without hundreds of hours of study
If you post a small snippet I can give a quick breakdown of what's going on. Assembly in its core is very simple - it's primarily just register moves, loads/stores, arithmetic operations, bitwise manipulation, calls and returns. As soon as you get familiar with syntax, even without in-depth knowledge, you can get a general idea of what's going on when comparing to the original code side-by-side (obviously it's not going to work with async or methods with yield returns but you get the idea).
Can an integrated CPU-FPGA design (like the ones AMD sells) be a solution to this problem? Point 5 of the paper says:
> Modern hardware and software stacks use a very layered structure, with each layer striving to hide its internal functioning details from upper layers. The socio-economic context which allowed such hardware to exist inherently relies on such abstraction (under the name of “industrial secrecy”). An effective constant-time coding process would require a model of computation with strong guarantees on timing-related characteristics, that would be maintained through all these layers, down to the semiconductors that implement the logic gates. This industry-wide vertical cooperation is unlikely to happen
It's not mentioned in the paper, but hardware description languages provide strong guarantees on timing-related characteristics on the semiconductor level. Hardware companies are willing to provide user access to FPGAs, as evidenced by the userspace FPGA subsystem in the Linux kernel as well.[1] You can buy Versal SOCs right now that combine ARM CPUs with FPGAs on one die, run Linux on it, and outsource your crypto tasks to the FPGA.
I see a future where we give up on constant-time coding for CPUs and move crypto libraries to FPGAs, given the hundreds of billions of dollars in our economy reliant on strong encryption. That'd be a world in which every corporate laptop is required to have an FPGA for compliance purposes.
It'd be interesting to know if anyone has tried this.
[1]https://www.phoronix.com/news/FPGA-User-Space-Interface-AMD
FPGAs are very expensive, so a TPU that implements the most important algorithms might be so much cheaper because the design cost can be amortized on the huge amount of required chips.
I believe the challenge that FPGAs can address is sociotechnical, because the developer of a crypto library will have much more control over the stack and does not need to trust others.
Many high-frequency trading companies use FPGAs over ASICs for similar reasons. FPGAs are more expensive but allow them to have full control over the algorithms implemented and doesn't require giving secret information to a foundry.
In other words, eliminate the impedance mismatch between responsibility and control for developing a secure system.
It'll be cheaper to implement cryptography on an ASIC. But the author of this paper wants to control every single aspect of the encryption/decryption process. Said developer can confidently say the system is secure. You can't say you've delivered a secure system if you're getting your ASICs from another company that doesn't provide implementation details because it'd (justifiably) give others a competitive advantage.
Question I'd have is whether the cost difference between ASICs/FPGAs is worth it for the majority of applications. $1 or $10 on every CPU might mean a world in which every laptop has an FPGA, but $100? What about for server-side applications? Would a hyperscaler spend $1000 extra on every rack if it allowed guaranteed constant-time encryption?
It’s not about “giving secret information to a foundry”. It’s entirely the field programmable (FP) feature. It’s also not really programmable in the sense that you would be sending in new instructions in realtime. Reconfigurable is a better word. So giving everyone an FPGA in their laptop isn’t really going help anyone in except some enthusiast who wants to try out some different algorithms.
My impression is that there’s a lot of mental shorthand in the chip design community and FPGAs are used for prototyping and then translated into ASICs for any niche or larger applications. I presume there’s a pretty straightforward translation process between the two, though no one has ever tried to explain it to me.
A very simple description of an FPGA is that it's got a bunch of switches on the ends of wires. Some of the switches can connect wires to logic elements and some can connect wires to other wires. In this view, programming an FPGA is just toggling switches so that some of them are "on" and the rest are "off".
The easiest migration from FPGA to ASIC is to just make a chip with the same circuit elements and wire segments, but instead of making switches, you just short out connections in the "on" state and leave the rest open.
The FPGA idea raises security questions of its own - how do you get the bitstream over securely, then the data and results, without leaking the data you're trying to protect or being vulnerable to evil clones of the bitstream? Or the FPGA debug JTAG interface?
Meanwhile Windows is requiring that every laptop have a small crypto processor for its own compliance processes (i.e. bitlocker).
I would assume the bitstream would only contain non-secret implementation details and keys would be loaded at runtime rather than being synthesized into the design.
In terms of moving the data over to the FPGA, I have no idea. But if it's all on the same die, it doesn't seem like there's a big physical attack surface that's different than just doing stuff on the CPU.
I’m confused by the quoted text because timing attacks rely on at-most behavior and abstraction layers on at-least behavior.
Abstractions cannot send messages before they receive them. So any delay you add at the top must be magnified as you go down. The exception to this is if the contents of the message require different behaviors for different payloads, in which case they are correct. But encrypted payloads are opaque to the layers they are sent through. You can observe who the message was sent to, and know the maximum amount of data the conversation might have contained. But not a very clear idea of how long it took to build the message, unless you’ve already compromised the machine.
Recent timing attacks rely on the observation that modern CPUs send some messages faster than other messages, based on predicting what they might contain, so some delays get magnified (those that deviate from expectations) while other delays (those that match prior data) get minimized as you go down. An encrypted payload leaks this same information too (the process is independent of what data is being transferred), although that leaked information is (hopefully) not useful since it just leaks the encrypted data, which (hopefully) looks like random noise. But that data has to be encrypted and decrypted at some point somewhere, which gives a point to attack.
> The intent is to produce the mask values without using a plain comparison (“x != 0”) so that the compiler does not notice that we are really making a conditional move. The compiler is not fooled.
So practice your martial arts on the bow of a rowboat. Balance, Danielsan.
The whole timing thing is essentially an issue of getting off-balance as in martial arts and the solution is that there are actions you simply never do, and ones that look like they lack economy of motion because so much of the work is in keeping your center of gravity from shifting outside of your base. Tai chi is infinitely better at protecting you from concussing yourself on black ice than it is at protecting you from a mugger.
So if you have a conditional move for one calculation, then move a to b if !c and move it to d if c. And when the compilers chase that down in another generation, or someone notices that the L1 cache timing is different if d gets cold, then use both calculations in the output. Calculate complements where the runtime of a + !a = 1 to several decimal points.
Will it work for eternity? Unlikely. Will it work for a decade? Probably.
At https://rwc.iacr.org/2025/program.php you can see there is a talk scheduled to be given in a couple weeks titled "Testing Side-channel Security of Cryptographic Implementations against Future Microarchitectures" with the following bits in the abstract: "Using this framework, we conduct an empirical study of 18 proposed microarchitectural optimizations on 25 implementations of eight cryptographic primitives in five popular libraries. We find that every implementation would contain secret-dependent leaks, sometimes sufficient to recover a victim’s secret key, if these optimizations were realized. Ironically, some leaks are possible only because of coding idioms used to prevent leaks under the standard constant-time model."
So far, SIMD operations are contant time, since the whole vector needs to be computed, and not just a single element in it.
ChaCha20 or Kyber might be using that. Although, I don’t know if any cryptographic algorithm depend on vector only calculation as a base for their strength.
Alternatively, I suppose that specialised cryptographic CPUs could be made to follow strict standards ensuing constant time runtime.
> SIMD operations are contant time, since the whole vector needs to be computed
Is that actually guaranteed? Having to do the whole vector doesn't necessarily mean the time can't vary for different values for complex operations. (like division)
I can't find the details, but https://gleissen.github.io/papers/iodine.pdf at least mentions "Dually, if timing variability is unavoidable, e.g., in SIMD or floating-point units, making this variability explicit can better inform..." so it sounds like simd is at least situation specific here.
I wonder if you could have a compiler that intentionally pads every instruction to use a vector register pair of <value, slow-const>, so that every value gets paired with whatever constant or other expression will cause the slowest execution of that vector instruction and with the maximum latency chain. Otherwise, it does look like the VDIV instructions can have variable latencies based on the data itself (and not just spectre-like speculation issues on the memory access patterns).
From https://uops.info/table.html
You don't have any guarantee that the SIMD operations are actually done in parallel (which is of of the assumptions needed for “the latency matches that of the slowest element”). E.g., I believe VPGATHERDD is microcoded (and serial) on some CPUs, NEON (128-bit) is designed to be efficiently implementable by running a 64-bit ALU twice, AVX2 and AVX512 double-pumped on some (many) CPUs likewise…
It's not guaranteed. See section 7 of https://www.usenix.org/system/files/conference/usenixsecurit...
In MD5, there are calculations that if they result in certain data ranges in certain bytes, results in a secondary calculation to spread that change into other bytes. Like a non Euclidean carry operation.
So I would think if you treat all of the bytes as one SIMD instruction, do a SIMD compare, and then fire another operation as a result… if you could guarantee that the third operation would have to fire for at least one of the vector values every time, then I think you’d have your solution.
I am familiar with the low-level arithmetic in MD5 (e.g. https://www.nayuki.io/res/cryptographic-primitives-in-plain-... ). Exactly which calculations are you pointing to that result in a secondary calculation?
Hmm. I could have sworn the rotation had a component of the calculation in it, but you're right it's just the loop counter. SHA-1 and AES also vary by the round, but none of SHA-2, DES, or MD-4 have either round nor state variance, so if I'm misremembering a different algorithm it's pretty obscure. False memory I guess.
For the 5 algorithms you mentioned, I know them well.
> SHA-1 and AES also vary by the round
Vary what by the round? SHA-1 ( https://www.nayuki.io/res/cryptographic-primitives-in-plain-... ) is structured very similarly to MD5, with one big difference being that SHA-1 doesn't vary the rotation by the round number. AES has a public sequence of round constants that gets applied to the key expansion, but otherwise doesn't do anything special to the ciphertext per round.
The logic that can cause timing issues include using a memory lookup table for the S-box and finite field multiplication in AES, as well data-dependent rotations in rare ciphers like https://en.wikipedia.org/wiki/RC5 and RC6.
The tables have always made people nervous anyway. And the compromised EC entry just sealed the deal. How do we know they aren’t chosen by the NSA for mischief, even now that they should know better that someone else will figure out their trick too?
Most constant-time code today is done in assembly, which is a mild downgrade from when it was done in C.
Yes. I keep arguing that C is further and further from pretending to be a "portable assembler", and rather than trying increasingly elaborate tricks to generate the series of machine instructions you want you should just .. emit those instructions yourself in the first place.
The mention of CMOV is good. Additional tricks are available through vectorization - necessarily all elements of the vector have to execute at the same speed, which gives you extra options to put crafted data in a different column to force computation to proceed at a known rate.
We should not forget that some CPUs offer dedicated instructions - e.g. Intel AES-NI.
Another solution is to just stop trying to do it on the general purpose processor altogether and move it to some sort of HSM or TPM. Put your eggs in a smaller, more defensible basket. In a world of GPUs and AI accelerators it's easier to ask for a crypto accelerator (or rather decelerator!).
Is there any good work on constant-time crypto on the GPU?
If anyone manages to make homomorphic encryption viable, that provides another solution, at a huge performance penalty.
I don't think it's anywhere close to viable to move the cryptographic parts of the data plane into HSMs/TPMs. There's just too much work to do there, and you have to move plaintext over unsecured channels to do it. That means that you have to put at least an ephemeral key into the CPU, and the rest follows.
AES-NI, the SHA instructions, and constant-time subsets of instructions are generally good enough that you can do this in assembly.
editorialized title, originally "Constant-Time Code: The Pessimist Case"
most of the points brought up in the article have been happening already for decade(s), it's not something that 'will soon' happen.
Give me hardware with a discrete cryptographic processing unit which is just a CPU that does has no speculation, caching, or other nonsense. In other words, a CPU that actually works the way it's supposed to work. I don't care that idiots want their wrong software to be as fast as possible, I want correct software that's as fast as possible, and no faster.
As the paper notes (page 21 last paragraph), this isn’t just about cryptographic algorithms though, it’s about the processing of any “secret” data.
A TPM, HSM, or Intel AES-NI type solution?
You know given the rise of chiplets, a custom core isn’t a stupid idea…
Are we going to end up doing all our crypto on the big.LITTLE cores?
I wonder if they’re fast enough…
But they want to sell as many cpus as possible, right? So that market wins.
You can easily buy such CPUs in the microcontroller market, just go ahead. They're being sold by the billions.
Presumably, performance also matters.
Well, if you say you don't want caching and speculation, then you've essentially given up on performance already. There's a reason why these small cores are slow.
x86 and ARM both have options for executing certain instructions with data-independent timing. The problem here is that languages and compilers need to expose and honor data types that compile down to only those instructions.
This would be something like a 'secret' keyword that would qualify integer types; i.e. in C you could have a variable of type 'secret unsigned int' and the compiler would reject optimizations that would reveal timing information about the variable's contents, while optimizing the rest of the program's non-secret variables. Values can be promoted to secret but not demoted. Code that intends to reveal secrets (e.g. to write them out to secure storage) must explicitly cast away the secretness.
AFAIK Golang has crypto/subtle, but that's more special functions to do certain things in a data-independent way and not pervasive language support for it (and, presumably, they have to keep updating the compiler to not optimize that specific module).
I didn't read TFA, but does this imply that all encryption needs to be done in a special enclave with dedicated hardware (that isn't vulnerable to this)?
One whole technique not mentioned in the paper or comments is bitslicing. For non-branching code (e.g. symmetric ciphers) it's guaranteed constant-time and it would be a remarkable compiler indeed which could introduce optimizations and timing variations to bit-sliced code...
The author of the paper knows about bitslicing [1], so not mentioning it seems deliberate.
My guess is that bitslicing only gets you so far.
[1]: https://bearssl.org/constanttime.html#bitslicing
One, old technique that can counter this is as follows:
1. Clock how long the operation takes on any type of data. By operation, it would be a function call maybe for a whole block or buffer.
2. Add a small margin of time to that.
3. Modify the function call to make sure it has waited that long before returning a response. That is true whether it finished early or not.
The result prevents the function call from leaking timing information.
High-assurance, security certification in the 1980's also required mitigating covert, storage channels. That would require, at a minimum, overwriting any memory used during the call and all the CPU registers before returning to the caller.
It might make the exploits mildly more difficult in some cases, but adding a synthetic delay (whether randomized or artificially inflated to meet a certain high water mark) isn't likely to help.
https://security.stackexchange.com/a/96493/271113
For example, consider the case of a cache-timing leak (rather than the classical "string compare" one). You'd have to have a lot of control over the behavior of your operating environment to guarantee it doesn't leak bits of your secret from cache misses.
When you add a delay to your processing, unless the same tasks being executed, I would expect power analysis leakage and random jitter from the kernel's scheduler to reveal that fact. It might be a more costly analysis to detect, but it's still there.
Generally, I think this is a doomed line of reasoning.
Actually, the link you provide seems to support the parent comment's suggestion, rather than detract from it.
The previous comment was suggesting making sure that every code path takes the same amount of time, not adding a random delay (which doesn't work).
And while I agree that power-analysis attacks etc. are still going to apply, the over-arching context here is just timing-analysis
The link I provided is about random delays being inferior to setting a high water mark, yes.
I'm not just echoing the argument made by the link, though. I'm adding to it.
I don't think the "cap runtime to a minimum value" strategy will actually help, due to how much jitter your cap measurements will experience from the normal operation of the machine.
If you filter it out when measuring, you'll end up capping too low, so some values will be above it. For a visualization, let's pretend that you capped the runtime at 80% what it actually takes in the real world:
Alternatively, let's say you cap it sufficiently high that there's always some slack time at the end.Will the kernel switch away to another process on the same machine?
If so, will the time between "the kernel has switched to another process since we're really idling" to "the kernel has swapped back to our process" be measurable?
It's better to make sure your actual algorithm is actually constant-time, even if that means fighting with compilers and hardware vendors' decisions.
Cache and power need to be exploited locally, generally. Introducing random delays to raise the noise floor would work for network services, I believe.
> power need[s] to be exploited locally
Not in the presence of DVFS, it turns out: https://www.hertzbleed.com/hertzbleed.pdf
It depends.
AES cache-timing was broken over a network (but required, like, 2^28 samples).
I wouldn't bet the farm on this line of thinking providing resilience. It might be just annoying enough for an attacker to not really bother. (Or maybe only if the target is, like, a software cryptocurrency wallet with enough value to loot if they're successful.)
My immediate thought upon seeing the condmove() example was "Slap ‘volatile‘ on all those local variables". This forces a standards-compliant compiler to reload their values every time they are referenced. Basically, the compiler is no longer allowed to assume that the current thread of execution is the only actor that updates the variable -- any other mechanism (including another thread, but not limited to that -- e.g., a hardware DMA channel) could have modified the variable's value.
That doesn't solve the hardware-JIT-level problem, but it would appear to solve the compiler-level problem.
ETA: I only read up to p. 3, but I did search for the text "volatile" and didn't find it.
(Url changed from https://eprint.iacr.org/2025/435, which points to this.)