Typing Objects in SSA form

_{Diving into how JSSAT upholds SSA form for objects in spite of mutability, loops, and cyclicity}

_{jssat • draft • purged • 2021-08-22 (edited: 2021-09-23)}

SSA form is a very common framework found in many compilers that makes writing optimization, tooling, and compilers surrounding an IR much easier to reason about. SSA form states that every variable can be assigned to exactly once, and must be defined before it is used. In JSSAT, there is the concept of an object as a list of key-value pairs, where keys (either known such as asdf or unknown such as "some kind of String") are paired to values. An object can be mutated and store references to other objects. Modelling the kind of object described is particularly challenging when the type of that object is not specified or known up-front. The remainder of this post is dedicated to explaining how JSSAT models objects, all within SSA form, by presenting a series of challenges and following up with a solution.

Preliminary: Loops

Before this article begins, having a solid concept of what a function is should be outlined. The headaches of loops are eliminated in the model used by JSSAT. The previous article goes into more depth, but all that is necessary to know is the basic gist of how loops are eliminated. Consider a program such as this one:

function sum(max) {
  let total = 0;
  for (let i = 0; i < max; i++) {
    total += i;
  }
  return total;
}

In JSSAT, that function will get broken up into a series of smaller ones:

function sum(max) {
  let total = 0;
  let i = 0;
  goto sum_check(max, total, i);
}

function sum_check(max, total, i) {
  let condition = i < max;
  if (condition) {
    goto sum_loop(max, total, i);
  } else {
    goto sum_end(total);
  }
}

function sum_loop(max, total, i) {
  let total2 = total = i;
  let i2 = i + 1;
  goto sum_check(max, total2, i2);
}

function sum_end(total) {
  return total;
}

Here, goto can be thought of as a guaranteed tail call. In addition, the conditional if statement is only allowed to be present at the end of a function. For more details on the specifics, it is again encouraged to read the previous post on JSSAT.

What model to use for an object?

In order for JSSAT to model an object, it must choose a model for an object that is aligned with the goals of JSSAT. Consider a struct in C:

struct Person {
  char* name;
  int age;
}

This C struct matches our definition of an object. It is a list of key-value pairs - there are two keys, name and age, which can have values within them. Unlike JSSAT, C does not support dynamic keys at runtime (and thank goodness it doesn't). Now, consider the usage of such an object within the following code:

struct Person new_person(char* name, int age) {
  struct Person person;
  // (1)
  person.name = name;
  // (2)
  person.age = age;
  // (3)
  return person;
}

In JSSAT, abstract interpretation is used to compute all possible paths of control flow, using information available throughout the program. What information is available at (1), (2), and (3) in the above C program?

At (1) it is known that a struct Person exists and can be referred to via person. Enough bytes on the stack have been allocated to contain one Person, but the value of those bytes are unknown.
At (2), it is known that person.name is now equivalent to char* name. The value of person.age is still unknown.
At (3), it is known that both person.name and person.age are defined with known values.

Now, consider the equivalent JavaScript program:

function newPerson(name, age) {
  let person = {};
  // (1)
  person.name = name;
  // (2)
  person.age = age;
  // (3)
  return person;
}

Now, the information is available at (1), (2), and (3) in the above JavaScript program is as follows:

At (1), it is known that person is a freshly created basic object. Operations such as person.toString() will work because the object inherits Object's prototype. Accessing any custom properties, such as the key name or age, will produce the value undefined.
At (2), it is now known that the name key of person is now equivalent to whatever the value of the first parameter was. person.name is equivalent to name.
At (3), it is now known that age too has also been set, similarly to name.

There's something fundamental to realize here - at every point in the program, what is known about person has changed, which can be used to prove different things. For example, in this C program at (1), nothing about the value of name could be proven except that it is unknown because a value had not been assigned to it. Afterwards (at (2)), it would be provable that if name != 0, then so too would person.name != 0. What sort of object model would arise if objects were thought about as displayed here?

The model of an object

Throughout the program, what is known about an object is often changing. That is, the facts about that object change. Consider the JavaScript program from before:

function newPerson(name, age) {
  let person = {};
  // (1)
  person.name = name;
  // (2)
  person.age = age;
  // (3)
  return person;
}

Now, the information available at (1), (2), and (3) in a fact-based model is as follows:

At (1), there are no facts known about person, except for a well-understood base case. That is, because there are no facts stating otherwise, it can be assumed that the prototype of person is the default prototype, and that attempting to get a field such as name or age will produce undefined.
At (2), there is one fact: person.name has the type of name. Attempting to get person.name will no longer result in undefined, but in whatever value name is.
At (3), there is another fact: person.age has the type of age. Similar to before, person.age will produce whatever value age would produce.

The "default set of facts" as mentioned would be troublesome to code tailored towards the JavaScript definintion of an object. As such, JSSAT use its own definition - referred to as a record. Thus, JSSAT can implement its own arbitrary semantics related to what the default set of facts are.

At any state in the program, a variable will have a type. In the event of records, this type is always changing as the code progresses. At any given state in the program, there will be a set of facts to represent a record. For example, consider the previous JavaScript program but now with type annotations:

function newPerson(name, age) {
  // name : String
  // age : Int
  let person = {};
  // person : Record { facts = [] }
  person.name = name;
  // person : Record { facts = [name => String] }
  person.age = age;
  // person : Record { facts = [name => String, age => Int] }
  return person;
}

The type of person is now constantly changing to reflect the set of facts that are true at any given point. The person right after creation has a distinctly different set of facts from the person being returned, even though it's the same object allocation.

In summary,

The model of an object in JSSAT is represented using records. Records are arbitrarily defined in JSSAT to behave in whatever way makes sense, so long as they can model JavaScript objects easily.
The type of a record is comprised of a list of facts.
To answer a question about a record (e.g. "do you have a name property?"), the facts are iterated through until the question can be answered.

Unfortunately, this spawns a series of questions:

How is SSA form preserved?
How are cyclic references, mutability, and recursion handled?
Where is the information for the facts of a record stored?
What happens when two different records combine into one?
How can JSSAT cache previously invoked functions?

Why SSA form appears to be broken

It may initially appear that SSA form is broken in JSSAT. Consider the following JavaScript code:

let a = {};
// a : {}
a.b = 1;
// a : { b: 1 }
a.c = 2;
// c : { b: 1, c: 2 }

Surely, SSA form must have been broken here! The type of a changes after a mutation. How can JSSAT claim to uphold SSA form while supporting mutability?

Why SSA form is upheld in JSSAT

SSA form is no stranger to challenges regarding mutability. Tangentially related, pointer provenance is still an open area of research. Thankfully, simple mutability in SSA form is a solved problem.

Consider a program which starts with i = 0, and loops infinitely. On each iteration of the loop, 1 is added to i. Here, such a program is showcased:

entry:
  i = 0
loop:
  i = i + 1
  jmp loop

This program is clearly not in SSA form, as i is assigned to multiple times. What if it were to be rewritten and ensure that there is exactly one variable for every assignment?

entry:
  i = 0
loop:
  j = i + 1
  jmp loop

This program is in SSA form, but it is incorrect. Every iteration of the loop will just perform j = 0 + 1. How do compilers that use SSA form solve this problem?

To answer this, the following C program (which obviously breaks SSA form) has been written:

void loop(int i) {
    while (1) {
        i += 1;
    }
}

Clang (a C/C++ compiler) generates LLVM IR from C code. LLVM IR is considered to be in SSA form. What does the generated LLVM IR look like?

; https://godbolt.org/z/s4cnd77fa
; irrelevant information has been stripped away manually

define void @loop(i32 %0) {
  %2 = alloca i32
  store i32 %0, i32* %2
  br label %3

3:
  %4 = load i32, i32* %2
  %5 = add i32 %4, 1
  store i32 %5, i32* %2
  br label %3
}

This might be difficult to read if you're unfamiliar with LLVM IR, so here is the above program translated back into C:

void loop(int initial) {
    int *ptr = (int*)alloca(sizeof(int)); // allocate `sizeof(int)` bytes on the stack
    *ptr = initial;
    goto infinite_loop;

  infinite_loop:
    int previous_value = *ptr;
    int next_value = previous_value + 1;
    *ptr = next_value;
    goto infinite_loop;
}

Notice the usage of pointers to store mutable data. The pointer itself does not change - it is assigned to exactly once. However, the memory that the pointer points to does change.

"The 'trick' here is that while LLVM does require all register values to be in SSA form, it does not require (or permit) memory objects to be in SSA form"
— LLVM Documentation

The LLVM compiler infrastructure is a production grade, battle tested piece of technology that adopts the notion that changing the memory a pointer points to is not a violation of SSA form, because SSA form is only upheld on all register values. In summary, LLVM supports SSA form only for register values.

JSSAT too adopts this stance. SSA form is only guaranteed to be upheld for all registers, but not for memory. However, this presents a new challenge:

How is the type of a register modelled if the type of a record is always changing?

The type of a JSSAT register

In JSSAT IR, registers get very simple types. For example, a is an Int { value = 3 }, and b is a Bool { value = true } in the following program:

let a = 3;
let b = true;

This makes the type trivial to carry around. Because records are possible to hold in a register, somehow that must be known in a similarly trivial manner whilst simultaneously conveying information about how the facts of the record are constantly changing. For example, consider the following program:

let a = {};
// a : {}
a.x = 1;
// a : { x: 1 }

In this program, the record a has the facts known about it changing. The actual record pointed to by a does not change, since JSSAT upholds SSA form for all registers, but the memory a points to is changing. How can this be modelled?

Modelling a record register

In JSSAT, there is a primitive notion of a string. Because in JavaScript strings are immutable, there is no concern regarding mutating a string. The following code shows usage of a string in JavaScript:

let s = "asdf";
s[0] = "x"; // does not change `s`
s += "jkl;";
return s; // "asdfjkl;"

Similarly, in JSSAT IR, the above program may look as follows:

const s = "asdf";
// ignoring s[0] = 'x' since that does nothing
const s2 = s + "jkl;";
return s2;

The compiler must compute the type of the s2 variable. If the type of s and s2 were to be a list of characters, that would be wasteful - there could be tons of these throughout the program! For the compiler to copying around and compare lists of bytes (the type of s and s2), it would be very non-trivial (specifically, O(n) time). Thus, a technique called string interning is used. With string interning, a middleman is introduced. This middleman, given a string, will return an abstract token that's smaller than the string and easier to copy or carry around. Then, the middleman can be given the token and will return back a reference to the string so that the contents of the string may be examined. This abstract token is often an integer for performance.

Thus, for records, they could simply be given a unique ID to represent that record. Then, that ID can be copied or carried around very easily. In addition, there could be a service that will store the facts of a record at any given moment based upon the ID, similar to the middleman used for strings. By labelling each record with an ID, that ID can be used to look up the facts of that record while also being extremely cheap to copy around.

In summary, a unique ID is given to each record. This ID is cheap to copy and use within register types, and a separate facility can be used to store the facts of a record.

However, that spawns a few questions that need answers:

How is the ID for the record determined, and how long is a record ID valid for?
What does the separate facility that is used to store the fact of a record look like?

Generating record IDs

For constants such as strings, the constant ID is generated based on whether the string interner has seen it before. However, a record may be mutated by facts, the technique used for constants cannot be used for records. Consider the following code:

let a = {};
let b = {};
a.x = 1;
b.y = 2;

Here, both a and b start off in the same state - as a new record {}. It would be improper to give them the same record ID, as they are different allocations that end up getting different facts applied to them.

The next consideration would be to use a globally monotonically increasing ID for every record. However, this falls flat in the fact of functions. Consider the following program:

function make() {
  let obj = {};
  return obj;
}

let a = make();
let b = make();
a.x = 1;
b.y = 2;

In JSSAT, to avoid duplicate work, the results of abstract interpretation are cached if the parameter types are the same. Here, make has no parameters, so it will always be cached. If a global monotonically increasing ID were to be used, the return type of this function would be a record of ID 0. Thus, a and b would both be classified as a record of ID 0, despite referring to different allocations. This is a similar predicament as before. How can records be given a unique ID, and not come across these problems?

Per-function record IDs

Instead of using a globally monotonically increasing ID, what if a monotonically increasing ID on a per-function basis were to be used? Consider a program similar to the one from before:

function make() {
  let obj = {};
  // obj : Record { id = 0 }
  let obj2 = {};
  // obj2 : Record { id = 1 }
  return obj;
}

let a = make();
// a : Record { id = ? }
let b = make();
// b : Record { id = ? }
a.x = 1;
b.y = 2;

Using a per-function monotonically increasing ID does not solve the problem on its own. In addition, it must be known whether or not record IDs are coming from a parent function, or generated within the function. In the event that a function generates its own IDs, then the knowledge that the function generated its own record ID should be propagated to the caller, so that the caller can differentiate its own record IDs from the record IDs the function generates. Upon differentiation, the caller can then re-map the caller's record ID into its own record ID. This would turn the previous program into the following:

function make() {
  let obj = {};
  // obj : Record { id = 0 }
  let obj2 = {};
  // obj2 : Record { id = 1 }
  return obj;
}

let mine = {};
// mine : Record { id = 0 }
let a = make();
// a : Record { id = 0 in `make` } -> Record { id = 1 }
let b = make();
// b : Record { id = 0 in `make` } -> Record { id = 2 }
a.x = 1;
b.y = 2;

Now that the caller is aware that the record ID returned to it was generated within the function it called, it can apply any mapping operation necessary to assimilate the record ID into its own store of records. In summary, every record is tagged from if it was generated within the current function or not, and is assimilated into the current function if it is from a child function. This also answers how long a record ID is valid for - on a per-function basis.

However, merging record IDs in to the parent brings rise to a peculiar question - what should be done in the event of an infinitely expanding record?

Infinitely Expanding Record

Consider the following function:

function f(x) {
  return f({ x: x });
}

f({});

The above function will make a record whose type will never be fully resolved. If the type of x is kept exact, this function will never be assigned the Never type as the return type because the parameter constantly changes. How should this function be handled?

Widening

In JSSAT, consider the following function:

function f(x) {
  return f(x + 1);
}

f(0);

Similarly, this function will seemingly never end. It will get called with f(0), f(1), f(2), to infinity. JSSAT has a solution for this, which is widening. After a certain threshold, if it notices that it's getting called with similar concrete types repeatedly, the argument type will be widened from 0 to Integer. Then, it will be able to discover that f(Integer) calls f(Integer), and is able to assign the Never type as the return type of the function.

The solution to this problem with records, is to widen the records to a Record. This kind of record, when queried about facts, will be completely unprovable. This also makes it a good solution for having "foreign functions", where the input provided by the caller can't be trusted to be typed sanely.

Consider the original function again, this time after the function has hit its threshold for widening a record:

function f(x: Record) {
  return f({ x: x });
}

f({});

Now, because x is an unknown record, there are very little optimizations that may get applied to it, but the problem is now solved.

Storing record facts

Something overlooked until now is where and how the record facts would be stored:

The model of an object:
Where is the information for the facts of a record stored?
Modelling a record register:
What does the separate facility that is used to store the fact of a record look like?

A naive solution would be to have a mapping of every record ID to their list of facts. While it could work, consider a program where the facts about a record change:

function f() {
  let a = {};
  // (1)
  // a : {}
  a.b = 1;
  // a : { b: 1 }
  a.c = 2;
  // a : { b: 1, c: 2 }
}

If the naive solution were to be used, then after execution, all that would be known about the record are two facts: that key b implies 1, and key c implies 2. But right after creation at (1), that is clearly not true!

Of course, it is unobservable that at (1) there are no facts. However, there is a scenario where it is observable — consider the following problematic program:

function f() {
  let a = {};
  a.a = 1;
  g(a);
  delete a.a;
  return a;
}

Here, the g function must be called with a parameter of type { a: 1 }, but at the end of the function, all that is known about a is that it is a {}. This is a problem, because the compiler must generate an LLVM structure for a - and if it was under the assumption that it did not need to allocate any fields for a, this would be a problem, as it clearly needs some fields in order to call g.

Thus, there needs to be a way to view an object as it changes throughout the program. How can this be modelled?

Transition trees

Rather than maintain a list of facts, what if a list of transitions were to be used? The program from before would look as follows:

function f() {
  let a = {};
  // facts: []
  a.a = 1;
  // facts: [set a => 1]
  g(a);
  delete a.a;
  // facts: [set a => 1, remove a]
  return a;
}

Now, the fact list-like nature of a record can still be used. In addition, now a history of transitions are preserved so that the compiler can see all the state necessary at any point in the program to get an idea of the maximal amount of data necessary to represent the object. If the function call g were to modify the object, then those transitions within g would be propagated to the caller.

A new problem: convergence

A while back, a question was asked:

The model of an object
What happens when two different records combine into one?

So far, the only problems touched on have been simple, sequential programs. It is very common for records to converge. Convergence here is defined as when two different definitions for an object must join into a single definition. Consider the following program:

function gen(cond) {
  if (cond) {
    return { a: 1 };
  } else {
    return { b: 2 };
  }
}

let obj = gen(randomBoolean());

In the above program, what is the type of obj? To make analysis easier, the above program has been converted into a pseudo JSSAT form:

function gen(cond) {
  if (cond) {
    goto gen_a();
  } else {
    goto gen_b();
  }
}

function gen_a() {
  return { a: 1 };
  // -> Record { id = 0, facts = [...] }
}

function gen_b() {
  return { b: 2 };
  // -> Record { id = 0, facts = [...] }
}

let obj = gen(randomBoolean());
// obj : Record { id = 0, facts = [...] } | Record { id = 1, facts = [...] }

In the above program, { a: 1 } inside of gen_a would get the type Record { id = 0, facts = [...] }. Likewise, { b: 2 } would get the type Record { id = 0, facts = [...] }. It may seem strange that they both get the same record ID, but because they are two different functions that do not start off with any objects, they will both happen to get the same record ID inside of the function. It is then up to the caller to analyze the gen function, and map the record IDs so that they do not overlap accordingly.

The key action to note is the unionization of the two records. Because they are distinct objects, their definition will turn into the union of two objects. This is how the problem of differing types being combined into one type is handled. When LLVM IR is emitted, this union may be emitted as a discriminated union.

Now notice what happens - there was first a divergence in the control flow in gen based on cond. Then, the control flow converged to the caller. This convergence is what necessitates some solution - in this case, unionization. However, there is another way for control flow to need to converge. Consider the following program:

function modify(obj, cond) {
  if (cond) {
    obj.a = 1;
  } else {
    obj.b = 2;
  }
}

let obj = {};
modify(obj, randomBoolean());

Here, control flow similarly diverges and then converges. However, rather than modify two distinct objects, a single object is being operated on. Below is the same program, but annotated to help visualize the problem:

function modify(obj, cond) {
  if (cond) {
    obj.a = 1;
    // obj : Record { id = 0, facts = [set a => 1] }
  } else {
    obj.b = 2;
    // obj : Record { id = 0, facts = [set b => 2] }
  }
  // obj : Record { id = 0, facts = [?] }
}

let obj = {};
modify(obj, randomBoolean());

In the following program, how are the set of facts unionized in such a way that preserves the information about the object fully?

Factual unions

One possibility to solve this would be to represent obj as a union of two records, such as the following:

function modify(obj, cond) {
  if (cond) {
    obj.a = 1;
  } else {
    obj.b = 2;
  }
  // obj : Record { id = 0, facts = [set a => 1] } | Record { id = 1, facts = [set b => 2] }
}

let obj = {};
modify(obj, randomBoolean());

However, this approach is incorrect because there should be a single record ID per allocation, and it would be hard to update all usages of the obj record to turn into a union.

Another possibility would be to infer the following possible type for obj:

interface Obj {
  a?: 1;
  b?: 2;
}

This type definition for obj is incorrect because it models two more states that obj cannot model. Those two states would be {} and { a: 1, b: 2 }.

The approach used in JSSAT is to support unions at the facts level itself. Thus, the type of obj would be as follows:

function modify(obj, cond) {
  if (cond) {
    obj.a = 1;
    // obj : Record { id = 0, facts = [set a => 1] }
  } else {
    obj.b = 2;
    // obj : Record { id = 0, facts = [set b => 2] }
  }
  // obj : Record { id = 0, facts = [set a => 1] | [set b => 2] }
}

let obj = {};
modify(obj, randomBoolean());

Now obj can both still refer to a singular allocation, and represent the two possible states exactly.

However, this prompts the question:

What is the type of a value in a factual union for a key?

Type of factual union key

To illustrate the question "What is the type of a value in a factual union for a key?", examine the following code:

let obj = {};

if (randomBoolean()) {
  obj.a = 1;
}

// obj : Record { facts = [set a => 1] | [] }

// what is `field`?
let field = obj.a;

In JavaScript, field would be (in TypeScript notation) 1 | undefined. However, this is JSSAT IR, where entirely different semantics can be arbitrarily defined. Here are two possible choices of action:

Error on the possibility of getting an empty field
This would be simple and straightforward.
Define getting a field that doesn't exist to return Nothing
This would simplify code, as a simple assertion instruction to assert the value isn't Nothing could be generated fairly easily compared to complicated control flow. In addition, there would be no need for an instruction to check if a field exists, which could provide a speed up. A rich definition of Nothing would be needed to ensure that it does not behave like a regular type and cannot be used as such, so that it truly represents nothing.

The second alternative is what JSSAT chooses to do. The precise semantics of Nothing are not specified here.

Caching function invocations

One of the last questions to answer is this one:

The model of an object
How can JSSAT cache previously invoked functions?

This question has three components:

How to determine if a function call has already been executed
How to execute an uncached function call
How to execute a cached function call

1. How to determine if a function call has already been executed

To determine if a function call should need to be re-executed, examine the following simple scenario:

let foo = {};
// foo : Record { facts = [] }

let bar = { a: 1 };
delete bar.a;
// bar : Record { facts = [set a => 1, remove a] }

function amend(x) {
  x.baz = "a";
}

amend(foo);
amend(bar);

In the above scenario, both foo and bar appear the same before the function is executed. That is, if a snapshot of the representation of foo and bar were to be taken before execution, they would be the same — an empty set of facts.

Taking this point into consideration, it is possible to determine if a function call has been executed by determining if it has been called before with a snapshot of the parameters.

2. How to execute an uncached function call

For executing an uncached function call, it seems easy — simply amend onto the information the caller has. However, this runs into problems because then it would not be possible to re-use the results of the cached function call in the callers, as the function must be executed once again.

Another possible way is to use the snapshot as a base, and use that to base facts off of. That way, one ends up with two distinct parts: all the modifications, and the base. Then, all callers can simply perform the same modifications, since they have a list of the precise modifications made.

3. How to execute a cached function call

Since cached function calls have a base snapshot and then a list of modifications, it would be possible to take a snapshot of the caller when it approaches the function. If the snapshots are the same, then the caller may perform the same modifications that are in the list of modifications of the cached function call.

The full picture

Quite aways back, a question was asked:

Why SSA form appears to be broken
How are cyclic references, mutability, and recursion handled?

To summarize:

Record IDs are used as the type of any record. Record IDs can be used as the value of a property within a record, supporting cyclic records. See: Modelling a record register
Transition trees are used to record a history of facts, supporting mutability. See: Transition trees
Convergence creates unions of records to support them together. See: Factual unions

Hopefully this paints a clearer picture of how records are modelled in JSSAT.

Recap

To recap, here's all the problems that have been mentioned and their solutions:

Q: What is the model of an object?
A: A series of facts
Q: How is SSA form preserved?
A: SSA form is only upheld on register values
Q: How are cyclic references, mutability, and recursion handled?
A: Record IDs, transition trees, and unions
Q: Where is the information for the facts of a record stored?
A: In a transition tree
Q: What happens when two different records combine into one?
A: Factual unions
Q: How is the type of a register modelled if the type of a record is always changing?
A: A record gets a unique ID
Q: How is the ID for the record determined, and how long is a record ID valid for?
A: Monotonically increasing ID on a per-function basis
Q: What does the separate facility that is used to store the fact of a record look like?
A: A transition tree
Q: What is the type of a value in a factual union for a key?
A: Nothing
Q: How can JSSAT cache previously invoked functions?
A: Snapshot and modifications

Conclusion

This post is dedicated to be a conceptual outline as to how JSSAT abstractly models records within its system. For concerns regarding soundness or something else, be sure to get in contact.

WIP

This post still requires work done to make the post more understandable and better.