Skip to content

Enforce JVM UTF-8 string limits in the backend#25300

Open
SolalPirelli wants to merge 4 commits intoscala:mainfrom
dotty-staging:solal/jvm-string-max
Open

Enforce JVM UTF-8 string limits in the backend#25300
SolalPirelli wants to merge 4 commits intoscala:mainfrom
dotty-staging:solal/jvm-string-max

Conversation

@SolalPirelli
Copy link
Contributor

Fixes #15850
Fixes #24597
Subsumes #19622

Enforce UTF-8 limits at bytecode generation time.

def genConstant(const: Constant): Unit = {
(const.tag/*: @switch*/) match {
private def genConstant(const: Constant, pos: SrcPos): Unit = {
(const.tag: @switch) match {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know why that @switch was commented out but it seems to work fine so might as well

val mirrorMethodName = m.javaSimpleName
if !BCodeUtils.checkConstantStringLength(jgensig, mirrorMethodName, mdesc) then
report.error("Mirror method signature is too long for the JVM", m.srcPos)
return
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't like having to duplicate this code for mirror generation, IMHO the backend should move towards a model where we first decide which classes and methods we're going to create, including JVM-specific synthetic stuff, and then emit these, but that's for later


object BCodeUtils {
val MAX_BYTES_PER_UTF8_CONSTANT = 65535
/** Checks that the given signature if present, or the concatenation of the given name and descriptor, do not exceed the JVM's UTF-8 text size limits. */
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not very happy about having to implement length detection for Java's subset of UTF-8, but we can't reuse ASM's because it's private and catching exceptions to test their message would be both ugly and too late to provide a position in the error...

|$showErrors
|${expected.mkString("Unfulfilled expectations:\n", "\n", "")}
|${unexpected.mkString("Unexpected errors:\n", "\n", "")}
|""".stripMargin.trim.linesIterator.mkString("\n", "\n", "")
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not directly related but it annoyed me when I put an // error on the wrong line not to have information on what went wrong. This is copied and adapted from the case just above.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have added that code several times! I just saw it again in this refreshed PR from 2022.
#15624

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good to know I'm not the only one... that PR added it for the 2nd case though (expectedErrors == 0), so this is not even a conflict :)

@SolalPirelli SolalPirelli marked this pull request as ready for review February 18, 2026 08:53
@SolalPirelli SolalPirelli requested a review from lrytz February 18, 2026 08:53
import dotty.tools.dotc.core.Contexts.*
import dotty.tools.dotc.core.Decorators.em

import scala.collection.JavaConverters.asScalaIteratorConverter
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

accidental?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops, I missed this file while self-reviewing, thanks for pointing it out

object BCodeUtils {
val MAX_BYTES_PER_UTF8_CONSTANT = 65535
/** Checks that the given signature if present, or the concatenation of the given name and descriptor, do not exceed the JVM's UTF-8 text size limits. */
def checkConstantStringLength(sig: String | Null, name: String, desc: String = ""): Boolean = {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this method is a bit cryptic, having multiple methods (for signature, name+desc) would make things more clear. Eg, it's weird that we're passing a non-null signature and also a name at some callsites, but then the name is ignored.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I simplified it

if c <= 0x7F then 1
else if c <= 0x7FF then 2
else if Character.isHighSurrogate(c) || Character.isLowSurrogate(c) then 2
else 3
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you provide a reference (in a code comment) to where this is specified?

also the comment above says "can't be more than 4 bytes per char" but here we never have more than 3.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added details in a comment at the top, which also led me to realize I had forgotten to handle the "null char" case because Java uses "modified UTF-8" which is... weird.

And indeed it's actually 3 bytes per char, I had mixed up char and codepoint in my mind when writing that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

compiletime.testing.typeCheckErrors can result in UTF8 string too large in JVM codegen "UTF8 string too large" in GenBCode

3 participants

Comments