AUTO-VECTORIZATION IN JUST-IN-TIME COMPILERS FOR DYNAMICALLY TYPED PROGRAMMING LANGUAGES
A computing device with an optimizing compiler is disclosed that is configured to generate optimized machine code including a vector operation corresponding to multiple scalar operations where the vector operation is a single operation on multiple pairs of operands. The optimizing compiler includes a vector guard condition generator configured to generate a vector guard condition for one or more vector operations, a mapping module to generate a mapping between elements of the vector guard condition and positions of the relevant scalar operations in the non-optimized machine code or intermediate representation of the source code, and a guard condition handler configured to initiate execution from a particular scalar operation in the non-optimized machine code or intermediate representation if the vector guard condition is triggered. The computing device may include a non-optimizing compiler and/or an interpreter to perform execution of the scalar operations if the vector guard condition is triggered.
The present Application for Patent is a Divisional of patent application Ser. No. 15/083,157 entitled “Auto-Vectorization in Just-in-Time Compilers for Dynamic Programming Languages” filed Mar. 28, 2016, pending, which claims priority to Provisional Application No. 62/144,252 entitled “Auto-Vectorization in Compilers for Dynamic Programming Languages” filed Apr. 7, 2015, and assigned to the assignee hereof and hereby expressly incorporated by reference herein.
BACKGROUND FieldThe present invention relates to computing devices. In particular, but not by way of limitation, the present invention relates to compiling or interpreting scripting code.
BackgroundMore and more programs are utilizing source code constructs that are written in high level, dynamically-typed programming languages that must be compiled or interpreted before many other activities (e.g., layout calculations and rendering) associated with the constructs can be executed. By way of example, ECMA script-based scripting languages (e.g., JavaScript® or Flash) are frequently used in connection with the content that they host. One of the most ubiquitous dynamically-typed languages is JavaScript which is run by a JavaScript engine that may be realized by a variety of technologies including interpretation-type engines, profile-guided just-in-time (JIT) compilation (e.g., trace based or function based), and traditional-function-based JIT compilation where native code is generated for the entire body of all the functions that get executed. Other dynamically-typed programming languages can be run by similar engines.
In virtual machines for dynamically-typed programming languages (e.g., JavaScript), performance is largely determined by characteristics of the global type state. Global type state can be thought of as a description of all program behavior and invariants across either a single run of a program or multiple runs. In a statically-typed programming language, global type state includes classes, class members, types of members, parameters, and variables, as well as any other type or structural information expressed explicitly or implicitly in the program source code. Programs written in static languages are usually faster to execute than those written in dynamic languages because type information is fully specified in source code at compile-time, and optimized code is generated based on it. Additionally, because type state doesn't change at run-time in statically typed programs, run-time type checks to verify and detect current types of the program variables are not necessary. However, programmers sometimes prefer to use dynamically-typed languages rather than statically-typed languages for several reasons, such as increased flexibility and simplicity. One tradeoff to using dynamically-typed languages is that the aspects of the global type state can change, which makes the compilation of optimized code imprecise, and sometimes wasteful.
Automatic vectorization is a special case of parallelism where a compiler converts a program from a scalar form, which processes a single pair of operands at a time, to a vector form, which processes multiple pairs of operands at once using a single vector operation. The conversions happens in the intermediate representation of the program that the compiler maintains internally after parsing the high level source code (e.g., C, C++, Java, JavaScript) of the input program, and then finally generating machine code using vector instructions.
The compiler first analyzes the dependencies in its intermediate representation of the program to determine if it is safe to transform to the vector form. It then generates machine code by selecting the vector instructions present in the processor.
One of the requirements to perform vectorization is that the “type” of the variables that are grouped into a vector operand (e.g., the types of the different elements in an array) be the same and be statically determinable (e.g., completely known at compile time). This enables a uniformly packed (or a known pattern) data layout that becomes the vector operand and enables selection of the specific type of the vector instruction. But a challenge for performing vectorization for dynamically typed languages (e.g., JavaScript) is the “type” (e.g., “integer,” “floating point,” “string,” “character,” and “object”) of a variable/operand is not statically (at compile time) defined and can change during execution.
SUMMARYAn aspect of the present invention may be characterized as a method for compiling source code that includes generating an intermediate representation of the source code and creating and executing non-optimized machine code that includes multiple scalar operations. A determination is made whether the multiple scalar operations are frequently executed so that the non-optimized machine code may be optimized, and if so, the multiple scalar operations are transformed from a scalar form to a vector operation. A vector guard condition is created for one or more vector operations and optimized machine code is created that includes the vector operation and the vector guard condition. The optimized machine code is executed and an element of the vector guard condition in the optimized machine code is mapped to a particular scalar operation of the non-optimized machine code (or intermediate representation of the source code) if the vector guard condition is triggered during execution of the vector operation in the optimized machine code. The non-optimized code is then executed from the particular scalar operation if the optimized machine code fails the vector guard condition.
Another aspect may be characterized as a computing device for compiling source code that includes a non-optimizing compiler configured to generate non-optimized machine code that includes multiple scalar operations and an optimizing compiler configured to generate optimized machine code including a vector operation corresponding to the multiple scalar operations. The optimizing compiler includes a vector guard condition generator configured to generate a vector guard condition for one or more vector operations and a mapping module to generate a mapping between elements of the vector guard operation and positions in the non-optimized machine code or intermediate representation of the source code. The computing device also includes a guard condition handler that is configured to initiate execution of a particular scalar operation of the non-optimized machine code if the vector guard condition is triggered.
Yet another aspect includes a method for compiling source code that includes receiving source code of a dynamically-typed language, generating an intermediate representation from the source code; performing interpreted execution of the intermediate representation; and gathering profile information to determine if optimized machine code should be created or not. If optimized machine code is created, multiple scalar operations are transformed from a scalar form to a vector operation and a vector guard condition is created for one of more vector operations. Optimized machine code containing vector operations is then executed and an element of the vector guard operation is mapped to a particular scalar operation of the intermediate representation. If the vector guard condition is triggered during execution of the vector operation, then operation switches back to interpretation of the intermediate representation from the particular scalar operation.
Another aspect may be characterized as a computing device for compiling source code that includes an interpreter configured to interpret the intermediate representation of the source code and an optimizing compiler configured to generate optimized machine code including a vector operation corresponding to the multiple scalar operations. The optimizing compiler includes a vector guard condition generator configured to generate a vector guard condition for one or more vector operations, a mapping module to generate a mapping between elements of the vector guard operation and positions in the intermediate representation of the source code, and a guard condition handler configured to initiate interpretation of a particular scalar operation of the intermediate representation of the source code if the vector guard condition is triggered.
Various aspects are disclosed in the following description and related drawings to show specific examples relating to exemplary embodiments. Alternate embodiments will be apparent to those skilled in the pertinent art upon reading this disclosure, and may be constructed and practiced without departing from the scope or spirit of the disclosure. Additionally, well-known elements will not be described in detail or may be omitted so as to not obscure the relevant details of the aspects and embodiments disclosed herein.
The word “exemplary” is used herein to mean “serving as an example, instance, or illustration.” Any embodiment described herein as “exemplary” is not necessarily to be construed as preferred or advantageous over other embodiments. Likewise, the term “embodiments” does not require that all embodiments include the discussed feature, advantage or mode of operation.
The terminology used herein describes particular embodiments only and should be construed to limit any embodiments disclosed herein. As used herein, the singular forms “a,” “an,” and “the” are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will be further understood that the terms “comprises,” “comprising,” “includes,” and/or “including,” when used herein, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.
As depicted, the computing device 100 in this embodiment includes a virtual machine 102 that is disposed to receive and process source code 104 so the instructions embodied in the source code 104 may be executed more quickly than prior art virtual machines. The source code 104 is generally in a dynamically-typed language such as JavaScript, LISP, SELF, Python, Perl, or ActionScript. The source code 104 may represent, for example, a website, a program, or an application, or any other computer instructions that may be written in dynamically-typed code.
The virtual machine 102 may be realized by a compilation-type engine, an interpreter engine, or a combination of both types of engines. In one embodiment, the depicted virtual machine 102 is realized by modifying a HotSpot™ just-in-time (JIT) compiler, which is a compiler for dynamically-typed languages. But it is contemplated that many kinds of compilation or interpretation engines, or hybrids of the two, may be modified in various embodiments without departing from the scope of the disclosure.
As shown, the virtual machine 102 in this embodiment includes both a non-optimizing compiler 106 (which can be replaced with an interpreter in some implementations, as discussed in connection with
Although the virtual machine 102 is depicted as including several functional components (e.g., the non-optimizing compiler 106, the optimizing compiler 108, and VM heap 110), it should be recognized that the several components need not be implemented as a part of a unitary construct. It should also be recognized that the components depicted in
In general, the depicted virtual machine 102 enables generation of efficient vectorized machine code of dynamically typed languages. For exemplary purposes, JavaScript is referred to throughout the present disclosure as the dynamically-typed code that may be used as the source code 104, non-optimized machine code 112, and optimized code 114, but other dynamically-typed languages such as LISP, SELF, Python, Perl, or ActionScript may also be utilized. Similarly, the non-optimizing compiler 106 and the optimizing compiler 108 are referred to as just-in-time (JIT) JavaScript compilers, but this is for purposes of being consistent with the use of JavaScript as the type of language that is used in the examples provided herein. Also some implementations of the VM may use an interpreter instead of a non-optimizing compiler. Some implementations of the VM can use a combination of an interpreter and or multiple levels of compilers, each of which can optimize code to various degrees based on the capability built in.
As shown, the optimizing compiler 108 in this embodiment includes a vector guard generator 124, a guard condition handler 126 and a mapping module 128. In general, the vector guard generator 124 operates to create efficient vector guard conditions; the guard condition handler 126 operates to handle de-optimization in connection with the vector guard conditions being satisfied; and the mapping module 128 operates to enable a switch from the execution of optimized machine code (that includes vector operations) to suitable points in the scalar non-optimized code or interpreted execution with scalar operations.
Referring to
As shown, in this embodiment a VM heap 210 may only include the optimized code 114 because when a guard condition in the optimized code 114 is triggered, the interpreter 206 takes over interpretation from the IR (e.g., abstract syntax tree (AST)) of the source code 104 (e.g., JavaScript code) from a location identified by the map table 118. For the interpreter 206, the map table 118 includes the information about the points in the IR node to start interpretation from if the vector guard condition is triggered.
Although not depicted in
To better understand aspects of the present disclosure (which relate to dynamically-typed languages) it is helpful to understand important differences between dynamically-types languages and statically typed languages. The following is a simple loop for illustration: for(var i=0; i<256; ++i){A[i]=B[i]*i;}
The scalar code generated by compilers for statically typed languages may have the following in a loop body:
-
- Temp1=LOAD element “i” from B; //Temp1 and the Load instruction is based on the “Type” declared for array B;
- Temp2=MPY Temp1, i; //Temp2 and the multiply instruction is based on the “Type” declared for “i” and “B”;
- STORE Temp2 at element “i” of A //The STORE instruction depends on the declared “Type” of array A;
- i=ADD i, #1 //The INCREMENT or ADD instruction for the loop index depends on the “Type” declared for “i.”
In contrast, for dynamically typed languages (e.g., JavaScript) there is no “type” declared in the source code, and the optimizing JIT compiler has to make various assumptions based on the type information gathered and insert guard conditions. The scalar code generated for the loop body in a dynamically typed language such as JavaScript may appear as:
-
- 1. <guard condition: check type of element at ith location in array B>
- 2. Temp1=LOAD element “i” from B;
- 3. <guard condition: check type of “i”>
- 4. Temp2=MPY Temp1, i;
- 5. <guard condition: check if element “i” in array A matches type of Temp2, else adjust array A to be able to hold Temp2 at “i”>
- 6. STORE Temp2 at element “i” of A
- 7. <guard condition: check type of “i”>
- 8. i=ADD i, #1
- 9. <guard condition: check if “i” can still be maintained in the same Type>
It may be possible to eliminate or hoist (outside the loop body) some of the checks (e.g., the checks shown in lines 1, 3, 5, 7, or 9 of Loop Body 1) through known compiler analysis (e.g., range analysis) and optimizations (e.g., bounds check), but most often a few checks still remain.
A compiler for statically typed languages may also vectorize, by creating a temporary array (os length=vector length, e.g., 4 in this example), to hold the running values of “i” and increment each element by 1.
Vector_i[4]={0,1,2,3}; for(var i=0; I<256; i=i+4;){A[i, i+1, i+2, i+3]=B[i, i+1, i+2, i+3]*Vector_i[0,1,2,3]};
The vector code generated by compilers for statically typed languages may have the following in the loop body:
-
- 1. Vector_Temp1=VECTOR_LOAD 4 elements starting at “i” from B;
- 2. Vector_Temp2=VECTOR_MPY Vector_Temp1, Vector_i;
- 3. VECTOR_STORE Vector_Temp2 at 4 elements starting at “i” of A
- 4. Vector_i =VECTOR_ADD Vector_i, “Const Vector #1 for each element in a vector”
For dynamically typed languages (e.g., JavaScript), if all of the guard conditions can be hoisted outside the loop body or eliminated by compiler analysis and optimizations, the vectorized code can appear similar to the above (as in statically typed languages). But this is most unlikely to happen. For simplicity of explanation, assume that it is possible to optimize away (or hoist outside the loop) the guard conditions 1, 5, and 7 of the scalar code (shown in Loop Body 1) generated for the loop body in a dynamically typed language. In that case, the concept of a vector-guard condition must be introduced for the equivalent of the scalar guards 3 and 9 (in Loop Body 1) described above.
The vector code generated by compilers for statically typed languages may have the following in the loop body with the introduction of vector-guard conditions:
-
- 1. Vector_Temp1=VECTOR_LOAD 4 elements starting at “i” from B;
- 2. <Vector-guard condition: check the type of each element of Vector_i>
- 3. Vector_Temp2=VECTOR_MPY Vector_Temp1, Vector_i;
- 4. VECTOR_STORE Vector_Temp2 at 4 elements starting at “i” of A
- 5. Vector_i=VECTOR_ADD Vector_i, “Const Vector #1 at each element”
- 6. <Vector-guard condition: check if all elements of “Vector_i” can still be maintained in the same Type>
The vector guard condition generated by the vector guard generator 124 is a unified guard condition for all the different elements of the vector. The guard condition handler 126 detects if any of the vector elements failed the guard condition and also provides the position of the failed element in the vector.
Referring next to
But if there is a failure, the position of the failed element in the vector is additionally computed in the deferred computation path taken only when de-optimization is needed. The failed position is needed to guide and perform effective de-optimization from the vector code back to the un-optimized scalar code (or interpreter 206 execution). The vector guard condition logic 326 may be implemented by a sequence of vector/scalar instructions of a processor on which the optimized code 114 is running. The failed position is needed to guide and perform effective deoptimization for the vector code and select the suitable position in the scalar non-optimized machine code 112 to switch execution to. As discussed further herein, vector guard condition logic of the vector guard generator 124 may be implemented by a sequence of one or more vector/scalar instructions of the processor based on the functionality the vector-guard condition is testing, and the vector guard condition handler 126 may handle de-optimization for vector guard conditions and enable a switch from a point in the vectorized-optimized code 114 where a guard condition fails to suitable points in the scalar non-optimized machine code 112.
Referring to
Once a failure is detected by a vector-guard condition (Block 402), and the failed element position the deferred (non-optimized) path is computed (Block 404), the next step is to handle de-optimization and switch to suitable points (identified by the failed element position) in the scalar non-optimized code (or the scalar operation execution point in the interpreter 206 when the VM 202 implementation is using the interpreter 206 instead of the non-optimizing compiler 106) and re-start execution with new type gathering. A challenge is to determine the efficient and functionally correct point in the scalar non-optimized code 112 (or in the interpreter 206 execution) to switch to, given that there is no 1:1 mapping that exists, unlike what used to exist for the scalar optimized code and scalar non-optimized code.
As shown in
For each vector guard condition there exists multiple points in the scalar non-optimized code 112 (or the execution of scalar operations as interpreted by the interpreter 206) depending on the vector length. But the most efficient point to switch to also depends on the data/control flow dependencies of the program code in the loop body. For example, for a vector length 4 there are 4 elements in the vector, each representing 4 different consecutive iterations of the loop. If the condition fails for the 3rd element, the efficient point in the scalar code may not be the beginning of the first iteration. Instead, it may be a point in the 2nd or the 3rd iteration based on the data/control flow dependencies of the code in the loop body, for example, when there are no recurrence dependencies in the loop iterations.
Referring next to
At the optimizing compiler 108 an intermediate representation of the source code 104 is generated (Block 504), and multiple scalar operations in the intermediate representation are transformed from a scalar form to a vector form (Block 306). Referring to
As shown, in the context of the embodiment in
Referring briefly to
In the embodiment depicted in
Referring again to
As shown, the optimized machine code 114 is then executed (Block 516), and if a guard condition is triggered, the map table 118, 218 is accessed to map the element of the vector operation that failed to a particular scalar operation (Block 518). The non-optimized machine code (or interpreter if the implementation is employing the interpreter 206 instead of the non-optimizing compiler 106) is then executed from the particular scalar operation (Block 520).
The virtual machine 102, 202 repeats the process of profile based optimized code generation for this function again. With the execution now shifted back to non-optimized machine code 112 (or the interpreted execution), new profiles and types are gathered again as the non-optimized code 112 is executed or interpreted. Once the background profiler and type collection module 122 determines a function of the non-optimized code 112 is “hot” enough for optimized compilation, the optimizing compiler 108 works to create new optimized code 114 for this function based on the newly gathered type and profile information. At this point the optimizing compiler 108 may re-generate optimized machine code 114 that may or may-not employ similar or other forms of vector operations (i.e., to create machine code using the multiple scalar operations instead of the vector operation) compared to the earlier version of the optimized machine code 114. Whether the newly generated optimized code 114 uses vector operations or not depends on the new profile and type information about the new dynamic behavior of the execution determined by the new run of the non-optimized code 112 (or interpreted source code 104).
Referring next to
And
Referring next to
Referring next to
Referring next to
The display 1318 generally operates to provide a presentation of content to a user, and may be realized by any of a variety of displays (e.g., CRT, LCD, HDMI, micro-projector and organic light emitting diode (OLED) displays). And in general, the nonvolatile memory 1320 functions to store (e.g., persistently store) data and executable code including code that is associated with the functional components depicted in
In many implementations, the nonvolatile memory 1320 is realized by flash memory (e.g., NAND or ONENAND memory), but it is certainly contemplated that other memory types may be utilized as well. Although it may be possible to execute the code from the nonvolatile memory 1320, the executable code in the nonvolatile memory 1320 is typically loaded into RAM 1324 and executed by one or more of the N processing components 1326.
The N processing components 1326 in connection with RAM 1324 generally operate to execute the instructions stored in nonvolatile memory to effectuate the functional components depicted in
The transceiver component 1328 includes N transceiver chains, which may be used for communicating with a network. Each of the N transceiver chains may represent a transceiver associated with a particular communication scheme. For example, each transceiver may correspond to protocols that are specific to local area networks, cellular networks (e.g., a CDMA network, a GPRS network, a UMTS networks), and other types of communication networks.
The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.
Claims
1. A computing device for compiling source code, the device including:
- a non-optimizing compiler configured to generate non-optimized machine code that includes multiple scalar operations, each scalar operation includes a single pair of operands;
- an optimizing compiler configured to generate optimized machine code including a vector operation corresponding to the multiple scalar operations, the vector operation is single operation on multiple pairs of operands, the optimizing compiler including: a vector guard condition generator configured to generate a vector guard condition for, at least, the vector operation; a mapping module to generate a mapping between elements of the vector guard condition and positions in the non-optimized machine code; and a guard condition handler configured to initiate execution of a particular scalar operation of the non-optimized machine code if the vector guard condition is triggered.
2. The computing device of claim 1, wherein the source code is a type selected from the group consisting of JavaScript, LISP, SELF, Python, Perl, and ActionScript.
3. The computing device of claim 1, wherein the vector guard condition generator is configured to generate a reference vector and the guard condition handler is configured to compare the reference vector with an output of the vector operation to determine if the vector guard condition is triggered.
4. A method for compiling source code, the method comprising:
- receiving source code of a dynamically-typed language wherein types of operations are not defined in the source code;
- generating an intermediate representation from the source code;
- performing interpreted execution of the intermediate representation;
- gathering profile information to determine if optimized machine code should be created or not;
- transforming multiple scalar operations in the intermediate representation from a scalar form to a vector operation, wherein each scalar operation includes a single pair of operands, and the vector operation is single operation on multiple pairs of operands;
- creating a vector guard condition for, at least, a vector operation;
- creating optimized machine code that includes the vector operation and the vector guard condition;
- executing the optimized machine code containing the vector operation;
- mapping an element of the vector guard condition in the optimized machine code to a particular scalar operation of the intermediate representation if the vector guard condition is triggered during execution of the vector operation in the optimized machine code; and
- switching back to start interpretation of the intermediate representation from the particular scalar operation when the guard condition is triggered.
5. The method of claim 4, including:
- generating a reference vector; and
- comparing the reference vector with an output of the vector operation to determine if the vector guard condition is triggered.
6. The method of claim 4, including switching to execute the optimized machine code after starting the interpretation.
7. The method of claim 4, including:
- generating a mapping table that maps, for the vector guard condition, each of a plurality of element positions of the vector operation to a node in the intermediate representation of the source code.
8. A computing device for compiling source code, the computing device including:
- an interpreter configured to interpret an intermediate representation of the source code that includes multiple scalar operations, each scalar operation includes a single pair of operands;
- an optimizing compiler configured to generate optimized machine code including a vector operation corresponding to the multiple scalar operations, the vector operation is single operation on multiple pairs of operands, the optimizing compiler including: a vector guard condition generator configured to generate a vector guard condition for one or more vector operations; a mapping module to generate a mapping between elements of the vector guard condition and positions in the intermediate representation of the source code; and a guard condition handler configured to initiate interpretation of a particular scalar operation of the intermediate representation of the source code if the vector guard condition is triggered.
9. The computing device of claim 8, wherein the source code is a type selected from the group consisting of JavaScript, LISP, SELF, Python, Perl, and ActionScript.
10. The computing device of claim 8, wherein the vector guard generator is configured to generate a reference vector, and wherein the guard condition handler is configured to compare the reference vector with an output of the vector operation to determine if the vector guard condition is triggered.
11. The computing device of claim 8, wherein the optimizing compiler is configured to switch back to execute the optimized machine code after interpreting the particular scalar operation of the intermediate representation of the source code.
12. The computing device of claim 8, wherein the mapping module is configured to map, for the vector guard condition, each of a plurality of element positions of the vector operation to a node in the intermediate representation of the source code.
Type: Application
Filed: Jun 5, 2017
Publication Date: Oct 5, 2017
Inventors: Subrato Kumar De (San Diego, CA), Zaheer Ahmad (San Diego, CA), Dineel Sule (San Diego, CA), Yang Ding (San Diego, CA)
Application Number: 15/614,000