previous article goes into more depth, but all that is necessary to know is the basic gist of how loops are eliminated. Consider a program such as this one:
function sum(max) {
let total = 0;
for (let i = 0; i < max; i++) {
total += i;
}
return total;
}
In JSSAT, that function will get broken up into a series of smaller ones:
function sum(max) {
let total = 0;
let i = 0;
goto sum_check(max, total, i);
}
function sum_check(max, total, i) {
let condition = i < max;
if (condition) {
goto sum_loop(max, total, i);
} else {
goto sum_end(total);
}
}
function sum_loop(max, total, i) {
let total2 = total = i;
let i2 = i + 1;
goto sum_check(max, total2, i2);
}
function sum_end(total) {
return total;
}
Here, goto
can be thought of as a guaranteed tail call. In addition, the conditional if
statement is only allowed to be present at the end of a function. For more details on the specifics, it is again encouraged to read the previous post on JSSAT.
In order for JSSAT to model an object, it must choose a model for an object that is aligned with the goals of JSSAT. Consider a struct in C:
struct Person {
char* name;
int age;
}
This C struct matches our definition of an object. It is a list of key-value pairs - there are two keys, name
and age
, which can have values within them. Unlike JSSAT, C does not support dynamic keys at runtime (and thank goodness it doesn't). Now, consider the usage of such an object within the following code:
struct Person new_person(char* name, int age) {
struct Person person;
// (1)
person.name = name;
// (2)
person.age = age;
// (3)
return person;
}
In JSSAT, abstract interpretation is used to compute all possible paths of control flow, using information available throughout the program. What information is available at (1)
, (2)
, and (3)
in the above C program?
At (1)
it is known that a struct Person
exists and can be referred to via person
. Enough bytes on the stack have been allocated to contain one Person
, but the value of those bytes are unknown.
At (2)
, it is known that person.name
is now equivalent to char* name
. The value of person.age
is still unknown.
At (3)
, it is known that both person.name
and person.age
are defined with known values.
Now, consider the equivalent JavaScript program:
function newPerson(name, age) {
let person = {};
// (1)
person.name = name;
// (2)
person.age = age;
// (3)
return person;
}
Now, the information is available at (1)
, (2)
, and (3)
in the above JavaScript program is as follows:
At (1)
, it is known that person
is a freshly created basic object. Operations such as person.toString()
will work because the object inherits Object's prototype. Accessing any custom properties, such as the key name
or age
, will produce the value undefined
.
At (2)
, it is now known that the name
key of person is now equivalent to whatever the value of the first parameter was. person.name
is equivalent to name
.
At (3)
, it is now known that age
too has also been set, similarly to name
.
There's something fundamental to realize here - at every point in the program, what is known about person
has changed, which can be used to prove different things. For example, in this C program at (1)
, nothing about the value of name
could be proven except that it is unknown because a value had not been assigned to it. Afterwards (at (2)
), it would be provable that if name != 0
, then so too would person.name != 0
. What sort of object model would arise if objects were thought about as displayed here?
Throughout the program, what is known about an object is often changing. That is, the facts about that object change. Consider the JavaScript program from before:
function newPerson(name, age) {
let person = {};
// (1)
person.name = name;
// (2)
person.age = age;
// (3)
return person;
}
Now, the information available at (1)
, (2)
, and (3)
in a fact-based model is as follows:
At (1)
, there are no facts known about person
, except for a well-understood base case. That is, because there are no facts stating otherwise, it can be assumed that the prototype of person
is the default prototype, and that attempting to get a field such as name
or age
will produce undefined
.
At (2)
, there is one fact: person.name
has the type of name
. Attempting to get person.name
will no longer result in undefined
, but in whatever value name
is.
At (3)
, there is another fact: person.age
has the type of age
. Similar to before, person.age
will produce whatever value age
would produce.
The "default set of facts" as mentioned would be troublesome to code tailored towards the JavaScript definintion of an object. As such, JSSAT use its own definition - referred to as a record. Thus, JSSAT can implement its own arbitrary semantics related to what the default set of facts are.
At any state in the program, a variable will have a type. In the event of records, this type is always changing as the code progresses. At any given state in the program, there will be a set of facts to represent a record. For example, consider the previous JavaScript program but now with type annotations:
function newPerson(name, age) {
// name : String
// age : Int
let person = {};
// person : Record { facts = [] }
person.name = name;
// person : Record { facts = [name => String] }
person.age = age;
// person : Record { facts = [name => String, age => Int] }
return person;
}
The type of person
is now constantly changing to reflect the set of facts that are true at any given point. The person
right after creation has a distinctly different set of facts from the person
being returned, even though it's the same object allocation.
In summary,
name
property?"), the facts are iterated through until the question can be answered.Unfortunately, this spawns a series of questions:
How is SSA form preserved?
How are cyclic references, mutability, and recursion handled?
Where is the information for the facts of a record stored?
What happens when two different records combine into one?
How can JSSAT cache previously invoked functions?
It may initially appear that SSA form is broken in JSSAT. Consider the following JavaScript code:
let a = {};
// a : {}
a.b = 1;
// a : { b: 1 }
a.c = 2;
// c : { b: 1, c: 2 }
Surely, SSA form must have been broken here! The type of a
changes after a mutation. How can JSSAT claim to uphold SSA form while supporting mutability?
SSA form is no stranger to challenges regarding mutability. Tangentially related, pointer provenance is still an open area of research. Thankfully, simple mutability in SSA form is a solved problem.
Consider a program which starts with i = 0
, and loops infinitely. On each iteration of the loop, 1
is added to i
. Here, such a program is showcased:
entry:
i = 0
loop:
i = i + 1
jmp loop
This program is clearly not in SSA form, as i
is assigned to multiple times. What if it were to be rewritten and ensure that there is exactly one variable for every assignment?
entry:
i = 0
loop:
j = i + 1
jmp loop
This program is in SSA form, but it is incorrect. Every iteration of the loop will just perform j = 0 + 1
. How do compilers that use SSA form solve this problem?
To answer this, the following C program (which obviously breaks SSA form) has been written:
void loop(int i) {
while (1) {
i += 1;
}
}
Clang (a C/C++ compiler) generates LLVM IR from C code. LLVM IR is considered to be in SSA form. What does the generated LLVM IR look like?
; https://godbolt.org/z/s4cnd77fa
; irrelevant information has been stripped away manually
define void @loop(i32 %0) {
%2 = alloca i32
store i32 %0, i32* %2
br label %3
3:
%4 = load i32, i32* %2
%5 = add i32 %4, 1
store i32 %5, i32* %2
br label %3
}
This might be difficult to read if you're unfamiliar with LLVM IR, so here is the above program translated back into C:
void loop(int initial) {
int *ptr = (int*)alloca(sizeof(int)); // allocate `sizeof(int)` bytes on the stack
*ptr = initial;
goto infinite_loop;
infinite_loop:
int previous_value = *ptr;
int next_value = previous_value + 1;
*ptr = next_value;
goto infinite_loop;
}
Notice the usage of pointers to store mutable data. The pointer itself does not change - it is assigned to exactly once. However, the memory that the pointer points to does change.
"The 'trick' here is that while LLVM does require all register values to be in SSA form, it does not require (or permit) memory objects to be in SSA form"
The LLVM compiler infrastructure is a production grade, battle tested piece of technology that adopts the notion that changing the memory a pointer points to is not a violation of SSA form, because SSA form is only upheld on all register values. In summary, LLVM supports SSA form only for register values.
JSSAT too adopts this stance. SSA form is only guaranteed to be upheld for all registers, but not for memory. However, this presents a new challenge:
In JSSAT IR, registers get very simple types. For example, a
is an Int { value = 3 }
, and b
is a Bool { value = true }
in the following program:
let a = 3;
let b = true;
This makes the type trivial to carry around. Because records are possible to hold in a register, somehow that must be known in a similarly trivial manner whilst simultaneously conveying information about how the facts of the record are constantly changing. For example, consider the following program:
let a = {};
// a : {}
a.x = 1;
// a : { x: 1 }
In this program, the record a
has the facts known about it changing. The actual record pointed to by a
does not change, since JSSAT upholds SSA form for all registers, but the memory a
points to is changing. How can this be modelled?
In JSSAT, there is a primitive notion of a string. Because in JavaScript strings are immutable, there is no concern regarding mutating a string. The following code shows usage of a string in JavaScript:
let s = "asdf";
s[0] = "x"; // does not change `s`
s += "jkl;";
return s; // "asdfjkl;"
Similarly, in JSSAT IR, the above program may look as follows:
const s = "asdf";
// ignoring s[0] = 'x' since that does nothing
const s2 = s + "jkl;";
return s2;
The compiler must compute the type of the s2
variable. If the type of s
and s2
were to be a list of characters, that would be wasteful - there could be tons of these throughout the program! For the compiler to copying around and compare lists of bytes (the type of s
and s2
), it would be very non-trivial (specifically, O(n) time). Thus, a technique called string interning is used. With string interning, a middleman is introduced. This middleman, given a string, will return an abstract token that's smaller than the string and easier to copy or carry around. Then, the middleman can be given the token and will return back a reference to the string so that the contents of the string may be examined. This abstract token is often an integer for performance.
Thus, for records, they could simply be given a unique ID to represent that record. Then, that ID can be copied or carried around very easily. In addition, there could be a service that will store the facts of a record at any given moment based upon the ID, similar to the middleman used for strings. By labelling each record with an ID, that ID can be used to look up the facts of that record while also being extremely cheap to copy around.
In summary, a unique ID is given to each record. This ID is cheap to copy and use within register types, and a separate facility can be used to store the facts of a record.
However, that spawns a few questions that need answers:
How is the ID for the record determined, and how long is a record ID valid for?
What does the separate facility that is used to store the fact of a record look like?
For constants such as strings, the constant ID is generated based on whether the string interner has seen it before. However, a record may be mutated by facts, the technique used for constants cannot be used for records. Consider the following code:
let a = {};
let b = {};
a.x = 1;
b.y = 2;
Here, both a
and b
start off in the same state - as a new record {}
. It would be improper to give them the same record ID, as they are different allocations that end up getting different facts applied to them.
The next consideration would be to use a globally monotonically increasing ID for every record. However, this falls flat in the fact of functions. Consider the following program:
function make() {
let obj = {};
return obj;
}
let a = make();
let b = make();
a.x = 1;
b.y = 2;
In JSSAT, to avoid duplicate work, the results of abstract interpretation are cached if the parameter types are the same. Here, make
has no parameters, so it will always be cached. If a global monotonically increasing ID were to be used, the return type of this function would be a record of ID 0
. Thus, a
and b
would both be classified as a record of ID 0
, despite referring to different allocations. This is a similar predicament as before. How can records be given a unique ID, and not come across these problems?
Instead of using a globally monotonically increasing ID, what if a monotonically increasing ID on a per-function basis were to be used? Consider a program similar to the one from before:
function make() {
let obj = {};
// obj : Record { id = 0 }
let obj2 = {};
// obj2 : Record { id = 1 }
return obj;
}
let a = make();
// a : Record { id = ? }
let b = make();
// b : Record { id = ? }
a.x = 1;
b.y = 2;
Using a per-function monotonically increasing ID does not solve the problem on its own. In addition, it must be known whether or not record IDs are coming from a parent function, or generated within the function. In the event that a function generates its own IDs, then the knowledge that the function generated its own record ID should be propagated to the caller, so that the caller can differentiate its own record IDs from the record IDs the function generates. Upon differentiation, the caller can then re-map the caller's record ID into its own record ID. This would turn the previous program into the following:
function make() {
let obj = {};
// obj : Record { id = 0 }
let obj2 = {};
// obj2 : Record { id = 1 }
return obj;
}
let mine = {};
// mine : Record { id = 0 }
let a = make();
// a : Record { id = 0 in `make` } -> Record { id = 1 }
let b = make();
// b : Record { id = 0 in `make` } -> Record { id = 2 }
a.x = 1;
b.y = 2;
Now that the caller is aware that the record ID returned to it was generated within the function it called, it can apply any mapping operation necessary to assimilate the record ID into its own store of records. In summary, every record is tagged from if it was generated within the current function or not, and is assimilated into the current function if it is from a child function. This also answers how long a record ID is valid for - on a per-function basis.
However, merging record IDs in to the parent brings rise to a peculiar question - what should be done in the event of an infinitely expanding record?
Consider the following function:
function f(x) {
return f({ x: x });
}
f({});
The above function will make a record whose type will never be fully resolved. If the type of x
is kept exact, this function will never be assigned the Never
type as the return type because the parameter constantly changes. How should this function be handled?
In JSSAT, consider the following function:
function f(x) {
return f(x + 1);
}
f(0);
Similarly, this function will seemingly never end. It will get called with f(0)
, f(1)
, f(2)
, to infinity. JSSAT has a solution for this, which is widening. After a certain threshold, if it notices that it's getting called with similar concrete types repeatedly, the argument type will be widened from 0
to Integer
. Then, it will be able to discover that f(Integer)
calls f(Integer)
, and is able to assign the Never
type as the return type of the function.
The solution to this problem with records, is to widen the records to a Record
. This kind of record, when queried about facts, will be completely unprovable. This also makes it a good solution for having "foreign functions", where the input provided by the caller can't be trusted to be typed sanely.
Consider the original function again, this time after the function has hit its threshold for widening a record:
function f(x: Record) {
return f({ x: x });
}
f({});
Now, because x
is an unknown record, there are very little optimizations that may get applied to it, but the problem is now solved.
Something overlooked until now is where and how the record facts would be stored:
- Where is the information for the facts of a record stored?
- What does the separate facility that is used to store the fact of a record look like?
A naive solution would be to have a mapping of every record ID to their list of facts. While it could work, consider a program where the facts about a record change:
function f() {
let a = {};
// (1)
// a : {}
a.b = 1;
// a : { b: 1 }
a.c = 2;
// a : { b: 1, c: 2 }
}
If the naive solution were to be used, then after execution, all that would be known about the record are two facts: that key b
implies 1
, and key c
implies 2
. But right after creation at (1)
, that is clearly not true!
Of course, it is unobservable that at (1)
there are no facts. However, there is a scenario where it is observable — consider the following problematic program:
function f() {
let a = {};
a.a = 1;
g(a);
delete a.a;
return a;
}
Here, the g
function must be called with a parameter of type { a: 1 }
, but at the end of the function, all that is known about a
is that it is a {}
. This is a problem, because the compiler must generate an LLVM structure for a
- and if it was under the assumption that it did not need to allocate any fields for a
, this would be a problem, as it clearly needs some fields in order to call g
.
Thus, there needs to be a way to view an object as it changes throughout the program. How can this be modelled?
Rather than maintain a list of facts, what if a list of transitions were to be used? The program from before would look as follows:
function f() {
let a = {};
// facts: []
a.a = 1;
// facts: [set a => 1]
g(a);
delete a.a;
// facts: [set a => 1, remove a]
return a;
}
Now, the fact list-like nature of a record can still be used. In addition, now a history of transitions are preserved so that the compiler can see all the state necessary at any point in the program to get an idea of the maximal amount of data necessary to represent the object. If the function call g
were to modify the object, then those transitions within g
would be propagated to the caller.
A while back, a question was asked:
- What happens when two different records combine into one?
So far, the only problems touched on have been simple, sequential programs. It is very common for records to converge. Convergence here is defined as when two different definitions for an object must join into a single definition. Consider the following program:
function gen(cond) {
if (cond) {
return { a: 1 };
} else {
return { b: 2 };
}
}
let obj = gen(randomBoolean());
In the above program, what is the type of obj
? To make analysis easier, the above program has been converted into a pseudo JSSAT form:
function gen(cond) {
if (cond) {
goto gen_a();
} else {
goto gen_b();
}
}
function gen_a() {
return { a: 1 };
// -> Record { id = 0, facts = [...] }
}
function gen_b() {
return { b: 2 };
// -> Record { id = 0, facts = [...] }
}
let obj = gen(randomBoolean());
// obj : Record { id = 0, facts = [...] } | Record { id = 1, facts = [...] }
In the above program, { a: 1 }
inside of gen_a
would get the type Record { id = 0, facts = [...] }
. Likewise, { b: 2 }
would get the type Record { id = 0, facts = [...] }
. It may seem strange that they both get the same record ID, but because they are two different functions that do not start off with any objects, they will both happen to get the same record ID inside of the function. It is then up to the caller to analyze the gen
function, and map the record IDs so that they do not overlap accordingly.
The key action to note is the unionization of the two records. Because they are distinct objects, their definition will turn into the union of two objects. This is how the problem of differing types being combined into one type is handled. When LLVM IR is emitted, this union may be emitted as a discriminated union.
Now notice what happens - there was first a divergence in the control flow in gen
based on cond
. Then, the control flow converged to the caller. This convergence is what necessitates some solution - in this case, unionization. However, there is another way for control flow to need to converge. Consider the following program:
function modify(obj, cond) {
if (cond) {
obj.a = 1;
} else {
obj.b = 2;
}
}
let obj = {};
modify(obj, randomBoolean());
Here, control flow similarly diverges and then converges. However, rather than modify two distinct objects, a single object is being operated on. Below is the same program, but annotated to help visualize the problem:
function modify(obj, cond) {
if (cond) {
obj.a = 1;
// obj : Record { id = 0, facts = [set a => 1] }
} else {
obj.b = 2;
// obj : Record { id = 0, facts = [set b => 2] }
}
// obj : Record { id = 0, facts = [?] }
}
let obj = {};
modify(obj, randomBoolean());
In the following program, how are the set of facts unionized in such a way that preserves the information about the object fully?
One possibility to solve this would be to represent obj
as a union of two records, such as the following:
function modify(obj, cond) {
if (cond) {
obj.a = 1;
} else {
obj.b = 2;
}
// obj : Record { id = 0, facts = [set a => 1] } | Record { id = 1, facts = [set b => 2] }
}
let obj = {};
modify(obj, randomBoolean());
However, this approach is incorrect because there should be a single record ID per allocation, and it would be hard to update all usages of the obj
record to turn into a union.
Another possibility would be to infer the following possible type for obj
:
interface Obj {
a?: 1;
b?: 2;
}
This type definition for obj
is incorrect because it models two more states that obj
cannot model. Those two states would be {}
and { a: 1, b: 2 }
.
The approach used in JSSAT is to support unions at the facts
level itself. Thus, the type of obj
would be as follows:
function modify(obj, cond) {
if (cond) {
obj.a = 1;
// obj : Record { id = 0, facts = [set a => 1] }
} else {
obj.b = 2;
// obj : Record { id = 0, facts = [set b => 2] }
}
// obj : Record { id = 0, facts = [set a => 1] | [set b => 2] }
}
let obj = {};
modify(obj, randomBoolean());
Now obj
can both still refer to a singular allocation, and represent the two possible states exactly.
However, this prompts the question:
To illustrate the question "What is the type of a value in a factual union for a key?", examine the following code:
let obj = {};
if (randomBoolean()) {
obj.a = 1;
}
// obj : Record { facts = [set a => 1] | [] }
// what is `field`?
let field = obj.a;
In JavaScript, field
would be (in TypeScript notation) 1 | undefined
. However, this is JSSAT IR, where entirely different semantics can be arbitrarily defined. Here are two possible choices of action:
Error on the possibility of getting an empty field
This would be simple and straightforward.
Define getting a field that doesn't exist to return Nothing
This would simplify code, as a simple assertion instruction to assert the value isn't Nothing
could be generated fairly easily compared to complicated control flow. In addition, there would be no need for an instruction to check if a field exists, which could provide a speed up. A rich definition of Nothing
would be needed to ensure that it does not behave like a regular type and cannot be used as such, so that it truly represents nothing.
The second alternative is what JSSAT chooses to do. The precise semantics of Nothing
are not specified here.
One of the last questions to answer is this one:
- How can JSSAT cache previously invoked functions?
This question has three components:
To determine if a function call should need to be re-executed, examine the following simple scenario:
let foo = {};
// foo : Record { facts = [] }
let bar = { a: 1 };
delete bar.a;
// bar : Record { facts = [set a => 1, remove a] }
function amend(x) {
x.baz = "a";
}
amend(foo);
amend(bar);
In the above scenario, both foo
and bar
appear the same before the function is executed. That is, if a snapshot of the representation of foo
and bar
were to be taken before execution, they would be the same — an empty set of facts.
Taking this point into consideration, it is possible to determine if a function call has been executed by determining if it has been called before with a snapshot of the parameters.
For executing an uncached function call, it seems easy — simply amend onto the information the caller has. However, this runs into problems because then it would not be possible to re-use the results of the cached function call in the callers, as the function must be executed once again.
Another possible way is to use the snapshot as a base, and use that to base facts off of. That way, one ends up with two distinct parts: all the modifications, and the base. Then, all callers can simply perform the same modifications, since they have a list of the precise modifications made.
Since cached function calls have a base snapshot and then a list of modifications, it would be possible to take a snapshot of the caller when it approaches the function. If the snapshots are the same, then the caller may perform the same modifications that are in the list of modifications of the cached function call.
Quite aways back, a question was asked:
Why SSA form appears to be broken
- How are cyclic references, mutability, and recursion handled?
To summarize:
Hopefully this paints a clearer picture of how records are modelled in JSSAT.
To recap, here's all the problems that have been mentioned and their solutions:
Q: How are cyclic references, mutability, and recursion handled?
Q: Where is the information for the facts of a record stored?
Q: What happens when two different records combine into one?
Q: How is the type of a register modelled if the type of a record is always changing?
Q: How is the ID for the record determined, and how long is a record ID valid for?
Q: What does the separate facility that is used to store the fact of a record look like?
Q: What is the type of a value in a factual union for a key?
A: Nothing
This post is dedicated to be a conceptual outline as to how JSSAT abstractly models records within its system. For concerns regarding soundness or something else, be sure to get in contact.
This post still requires work done to make the post more understandable and better.