Computing Machine Using a Matrix Space And Matrix Pointer Registers For Matrix and Array Processing
This disclosure relates to methods and mechanisms for matrix computing which include machine embodiments with one or more matrix storage spaces for holding matrices and arrays for computing, where a matrix or an array is accessible by its columns, by its rows, or both, individually, or concurrently. A set of methods and mechanisms to build a large capacity instruction set with multi-length instructions to load, store, and compute with these matrices and arrays are also disclosed. Methods and access control mechanisms with keys to secure, share, lock and unlock regions in the storage space for matrices and arrays under the control of an operating system or a virtual machine hypervisor by permitted threads and processes are also disclosed. Methods and mechanisms to handle long immediate operands for use by shorter instructions using a payload instruction are also disclosed. The structure of the instructions with key instruction fields and a method for determining instruction length are also disclosed.
The present application is a continuation of, and claims benefit of priority to the U.S. Non-Provisional patent application Ser. No. 16/396,680 titled “A COMPUTING MACHINE USING A MATRIX SPACE AND MATRIX POINTER REGISTERS FOR MATRIX AND ARRAY PROCESSING” filed on Apr. 27, 2019, which is a continuation-in-part of, and claims benefit of priority to U.S. Non-Provisional patent application Ser. No. 15/488,494 titled “A COMPUTING MACHINE ARCHITECTURE FOR MATRIX AND ARRAY PROCESSING” filed on Apr. 16, 2017, which claims benefit of priority to U.S. Provisional Application Serial No. U.S. 62/327,949 titled “A COMPUTING MACHINE ARCHITECTURE FOR MATRIX AND ARRAY PROCESSING” filed on Apr. 26, 2016, all incorporated by reference herein.
BACKGROUNDThe prior art Reduced Instruction Set (RISC) Architectures have used fixed word length sizes for computing. With fixed word length the number of instructions in RISC architectures cannot grow over generations beyond a limit. They have been upgraded for SIMD computing with vector registers and vector computing units. In contrast, the so called Complex Instruction Set (CISC) Architectures for computing have utilized variable word length instructions. Their complexity often derives from the difficulty in determining the word length and the use of memory operands in a large number of instructions including those that use the Arithmetic Logic Units (ALUs) and other computational units. Many of these have been upgraded to perform SIMD computation with vector registers. Each has several disadvantages associated with their complexity or extensibility which may include higher power consumption, or lower performance in some cases.
The prior art may incorporate various embodiments of a register file comprising vectors of scalar values. In some prior art, each register of a file held one vector and a mask determined the valid values of the vector. In some prior art, a vector was simply a register in a SIMD register file. In both cases, instructions using these vectors may simply read one to three vector values from the corresponding registers, wherein the vector values may be computed by adding, subtracting or performing a simple arithmetic or logical operation between the vectors. The prior art may not implicitly recognize the data stored in the registers as a matrix or a plurality of matrices and hence may not determine matrix properties like rank, triangularity and such, using an easy and direct method without using many instructions.
In prior art, Matrix computations may be done by a Central Processing Unit using vector registers and SIMD instructions. All matrices are to be stored, loaded and processed as 1-dimensional vectors in prior art. In some cases, special purpose units called systolic arrays may be used to process matrices. A systolic array may be a grid-like structure of special processing elements that may process data much like an n-dimensional pipeline. Unlike a pipeline, the input data as well as partial results may flow through the array. Systolic arrays use a matrix of computational units (multiply accumulators) with local storage to hold the operands of computation.
SUMMARYThis disclosure presents methods and mechanisms for matrix computing and array processing. It also discloses an extensible computer-implemented instruction set with a capacity for a large number of instructions that allows for computing with arrays and matrices. The matrices and arrays may be held in a storage space called a Matrix Space and accessed by rows, or columns, or both, individually or concurrently, with the help of matrix pointer registers, for computing. It also discloses a set of machine instructions and methods to load, store and compute with these matrices, and methods and mechanisms to secure, share, lock and unlock regions in a Matrix Space under the control of an operating system or a virtual machine monitor. Also disclosed in here are methods and mechanisms for immediate mode addressing of immediate operands using ‘payload’ instructions. Use of the payload instructions allows more bits to be available for definition and decoding of a larger number of instructions and hence grow the instruction set size significantly with newer instructions over many generations.
A system of one or more computers may be configured to perform particular operations or actions by virtue of having software, firmware, hardware, or a combination of them installed on the system that in operation may cause the system to perform the actions. One or more computer programs may be configured to perform particular operations or actions by virtue of including instructions that, when executed by data processing apparatus, may cause the apparatus to perform the actions. In some aspects, corresponding computer systems, apparatus, and computer programs recorded on one or more computer storage devices, may be configured to perform the actions of the methods. The present disclosure relates to a computer-implemented instruction set comprising: one or more matrix instructions that perform operations on one or more matrices and arrays, wherein a reading and decoding of the one or more matrix instructions allows for decoding and using an index of a matrix pointer register.
Implementations may comprise one or more of the following features. The computer-implemented instruction set wherein a reading and decoding of the one or more matrix instructions allows for writing and reading contents of a matrix pointer register which comprise a matrix allocation location. The computer-implemented instruction set wherein obtaining contents of the matrix pointer register allows for access to the one or more matrices and arrays. The computer-implemented instruction set wherein the one or more matrices and arrays accessed using a matrix pointer register reside in a matrix space. The computer-implemented instruction set may further comprise: one or more load matrix instructions; and one or more store matrix instructions. The computer-implemented instruction set may further comprise vector instructions configured to operate upon one or more of vector entities, wherein the vector entities comprise one or more scalars or packed and ordered groups of values. The computer-implemented instruction set may further comprise matrix instructions configured to operate upon one or more of matrices and arrays, and one or more vector entities. The computer-implemented instruction set wherein the matrix instructions comprise a plurality of highly structured multi-length instructions, and wherein each of the highly structured multi-length instructions comprises a multiple of 16 bits. The computer-implemented instruction set wherein the instruction set may comprise highly structured multi-length instructions. Implementations of the described techniques may comprise hardware, a method or process, or computer software on a computer-accessible medium.
In some aspects, corresponding computer systems, apparatus, and computer programs recorded on one or more computer storage devices, may be configured to perform the actions of the methods. The present disclosure relates to a matrix space comprising: an embedded storage configured to hold matrices and/or multidimensional arrays and/or vectors of values for computation whose elements are accessible by at least one of rows and columns; a set of matrix pointer registers configured to hold location, size and type information of matrices and arrays; and a set of machine instructions configured to execute one or more algorithms or programs.
Implementations may comprise one or more of the following features. The matrix space wherein the values comprise one or more numeric values, non-numeric values, or packed and ordered groups of scalars. The matrix space wherein the set of matrix pointer registers are configured to access the matrices or multidimensional arrays and vectors. The matrix space wherein the matrix space may comprise a random access memory that can be accessed by rows and columns, or both. The matrix space wherein the rows and columns are accessible in two dimensions. The matrix space wherein the matrix space is on a single semiconductor chip. The matrix space wherein the matrix space is on a plurality of semiconductor chips. The matrix space wherein the rows and columns are accessible concurrently. The matrix space wherein a member of the set of matrix pointer registers is configured to store a pair of numbers designating row and column addresses of an origin of the matrix. Implementations of the described techniques may comprise hardware, a method or process, or computer software on a computer-accessible medium.
In some aspects, corresponding computer systems, apparatus, and computer programs recorded on one or more computer storage devices, may be configured to perform the actions of the methods. The present disclosure relates to a method of executing one or more matrix instructions comprising the method steps of: decoding one or more opcode fields in the matrix instruction to determine functionality and operations to perform; decoding operands to determine at least one or more source and/or one or more destination matrix pointer registers; reading contents of at least one or more source and/or one or more destination matrix pointer registers; using the read contents in computing locations of one or more origins of one or more arrays; using the read contents in computing sizes and extents of the one or more arrays; using the read contents in computing types of values in the one or more arrays; reading out the one or more arrays by at least one of rows and columns to ports of a matrix space; using values in the one or more arrays in performing a computation; and writing computed results into one or more of scalar registers, vector registers, or by at least one of rows and columns into one or more destination arrays in a matrix space.
Implementations may comprise one or more of the following features. The method may further comprise the method step of computing an effective address of a memory location, wherein the effective address is configured to load data from a memory in a load matrix instruction. The method may further comprise the method step of computing an effective address of a memory location, wherein the effective address is configured to store data into a memory in a store matrix instruction. Implementations of the described techniques may comprise hardware, a method or process, or computer software on a computer-accessible medium.
The present disclosure relates to a computer-implemented instruction set (henceforth also “instruction set” or “instruction set architecture”) comprising highly-structured instructions (henceforth “instructions” or “machine instructions”) and a machine implementation that uses such instructions to compute with matrices (also matrixes), arrays and vectors. A matrix or an array may both henceforth be referred to as a “matrix” or “array” (matrices or arrays in plurality) interchangeably without distinction in this disclosure except when distinguished. More specifically, the highly-structured instructions are designed to have their instruction lengths comprise exact multiples of 16 bits. To allow for a large number of instructions and a highly extensible computer-implemented instruction set the disclosure also introduces a payload instruction in conjunction with methods to handle immediate operands. This mechanism allows a large number of instructions to be used in an implementation. As used herein, a 16-bit instruction refers to a machine instruction comprising 16 bits and a 32-bit instruction refers to a machine instruction comprising 32-bits. In some aspects, the term 16-bit instruction or 32-bit instruction do not imply a size of the addressable memory space, the default sizes of the operands, nor data width used in an instruction.
As used herein, a type of element comprises a category of element, in contrast to determining a value of a number or letter. Identifying types of elements may allow for the processing of complex strings comprising combinations of letters, segments, and numbers. Identifying types may allow for separate processing of like types. As non-limiting illustrative examples, a type may comprise a Byte, Short integer, Integer Word, Long integer, Pointer (to a memory location), Ordered Pair of binary values, Ordered Quad of 4 binary values, Triad of 3 binary values, Half precision floating point values, Single precision floating point values, Double Precision floating point values, Extended Precision floating point values, Ordered Pair of floating point values, Ordered Quad of 4 floating point values, collections of Nibbles, bits, di-bits (2-bit values), and so on.
Referring now to
Referring now to
Referring now to
In some implementations, a 32-bit instruction 210 immediately adjacent to a 16-bit instruction 200 may jointly be treated as one 48-bit instruction based on the value of embedded OPM field 201B and Opcode0 204B in the 32-bit instruction. In some aspects, a 48-bit composite instruction may comprise a 16-bit instruction 212B with its LEN bit 201 indicating 16-bit length, and an adjacent 32-bit instruction 220 whose LEN bit 201A may indicate 32-bit length. In some embodiments, the 16-bit portion of the 48-bit instruction may be decoded by a 16-bit decoder while the 32-bit portion of it may be decoded by a 32-bit decoder and the two decoded results may be merged to form the final decoded operation. This may make the instruction decode much simpler even with a plurality of lengths that occur. In some embodiments, where the 16-bit instruction decode may determine that its Opcode0 204 may be a Payload Immediate, or a 48-bit instruction Opcode (as in the flowchart of
Referring now to
In some aspects, regions of a Matrix Space may be pre-allocated to predefined matrices, arrays, processes, process threads, data types, instruction sequences from a particular customer (or user/owner of some data/process—henceforth “customer”), or to a single thread of instructions, or even different virtual machines, and host and various guest operating systems, as non-limiting examples. In some aspects, the MPU may run an algorithm to determine where to put specific data based on user-friendly coding instructions and security considerations including ownership. The MPU may run off predefined criteria, such as word size or data type, as non-limiting examples.
In some implementations, this may allow the MPU to make better and more efficient use of a Matrix Space. This may also allow the MPU to have more overall space. In some aspects, the process shown in the Matrix Space may also be stride-less in order for the MPU to run at maximum efficiency, since the Matrix Space may be accessed by rows and/or by columns, and by both rows and columns concurrently, when necessary. In contrast to using strides to identify an adequate size in the Matrix Space on an as-needed basis, the present disclosure pre-allocates space (called matrix allocation, henceforth “allocation”) within regions in a Matrix Space as configured by matrix pointer registers in various embodiments. In some aspects, the Matrix Space may hold one or more matrices, and/or arrays and/or vectors comprising data in a manner configured by one or more matrix regions and matrix pointer registers. In some embodiments, a specific customer, or program thread, or process may have a pre-allocated space where the same pre-allocated space is used each time instructions are run for that specific customer or program thread, or process.
There may be a noticeable space optimization in the Matrix Space using pre-allocation instead of using stride. In some implementations, the overlap may be based on predefined, acceptable, or determined similarities, such as by data type, program type, or customer. For example, in some embodiments different data sets may have overlapping pre-allocated space for the same customer, or process thread, or process. In some embodiments, the pre-allocated space may comprise a 16-bit space, which may allow for data sets of 4, 8, and 16 bits. The determination may be manually selected by the user, or there may be an auto determination from the MPU based on the type of input and which organizational tool may best fit the need of the MPU.
Referring now to
Referring to
Accessing and Computing with a Matrix in a Matrix Space Using Matrix Pointer Registers
Referring to
In some aspects, a matrix or array in Matrix Space may be controlled, accessed, read out or written into by using the fields in a longer machine instruction with operands that provide the location, size and type of the said matrix or array, thereby not employing a matrix pointer register.
Referring to
As illustrative examples, the following may be a partial and exemplar list of matrix operations that may be performed in some embodiments: a matrix or array may be loaded from System Memory or a main memory and/or a cache into a Matrix Space; a matrix or array may be stored to System Memory or a main memory or a cache from a Matrix Space; individual rows or columns of elements of a matrix or array may be accessed for reading or writing or moving them to other storage elements in a matrix space; in some embodiments, rows or columns of a matrix may be used for vector operations with vectors in matrix space or in vector registers. In some embodiments, elements of rows and/or columns of a matrix or array, under control of a matrix instruction, may be counted, re-ordered, sorted, scaled, summed, multiplied, AND-ed, OR-ed, negated, logically inverted, tested, compared and zero-ed. Any number of other similar arithmetic, logical, and transport operations may be performed. In some embodiments, some further operations by matrix instructions may comprise the following: a matrix or array in a Matrix Space may be moved, copied, split, transposed, or reordered in part or in full. In some embodiments, a further list of operations by matrix instructions may comprise addition, subtraction, multiplication. convolution and other matrix arithmetic, logic, discrete math, string and flow control operations involving matrices, vectors, arrays, scalars or other multi-dimensional structures.
In some embodiments, linear algebra related operations comprising triangulation or linear transformation operations, tri-diagonalization, norm calculations, rotations, computing determinants or rank, auto-correlation and cross-correlation may be performed on matrices; sparse matrix or sparse arrays may be created/decompressed or compressed, and/or scattered/gathered, transported, and/or transformed; matrix or array arithmetic, logic, discrete math and flow control operations may be performed on sparse matrices and sparse arrays; other elementary matrix, array or graph processing including search, sort, swizzle, rearrange, filter, text and string processing, graph traversal, table pivoting and many others may be executed. In some embodiments, one or more neural computations and transcendental computations and dynamic programming computations comprising multi-dimensional convolutions, maxpooling, average pooling sigmoid, and hyperbolic and trigonometric functions, max, min minmax, softmax, pivoting, flattening, sampling, interpolation, decimation, ReLU, and operations for gradient descent may be performed by one or more matrix instructions.
In some embodiments, a scalar register may be added to or subtracted from a Matrix Pointer Register. An Immediate value to or from a Matrix Pointer Register may be added or subtracted. A Matrix Pointer may be moved to another Matrix Pointer or to a general register. A Matrix Pointer register may be loaded and stored. Other operations may be performed on contents of a Matrix Pointer register.
Referring now to
Referring now to
Loading a Matrix from System Memory
Referring now to
As an illustrative example, following this flowchart in the context of some embodiments in
The contents of register 303 including its Type information may be set up appropriately for Matrix A prior to or at the completion of the LOAD Matrix instruction by itself or by one or more instructions in the program sequence. The contents of the data buffer may be read and transferred in chunks comprising a plurality of elements into the rows or columns, or both rows and columns of Matrix A into the allocated location 310 in Matrix Space 301 via a plurality of ports 320, 321, 326, 327. The LOAD Matrix instruction may then be retired, thereby completing the process.
Storing a Matrix to System MemoryReferring now to
In some aspects, the user may follow the method in the flowchart shown in
Referring now to
For example, a customer may be pre-allocated a single Matrix Region, wherein instructions for the customer may be run only in the pre-allocated Matrix Region. When not in use, the Matrix Region may not be accessible by other customers or programs and may not be processed as available Matrix space. This may allow for increased security. In some embodiments, the Matrix Space 901 may be divided into 4 matrix regions, each of which may be independently secured and/or shared by assigning them properties using one or more privileged instructions by an operating system or a virtual machine (VM) monitor (also referred to as a hypervisor) running on the machine. In some aspects, the properties of a region may be assigned by the OS or hypervisor based on policies that may be configured a priori and as requested by an application process. A process thread may make further OS calls to request a set of attribute values for sharing and security settings to govern the allocated region. In some implementations, at the time of region allocation, the OS may optionally clear the information content or values held in that region of the Matrix Space 901 in some embodiments. In some embodiments an allocation policy setting may be used to forbid any instruction from causing the contents of a region to be transferred to another region or be used as a source operand in a computation whose results go to another region. In some embodiments, regions in a Matrix Space 901 may be allocated and secured by an access control mechanism comprising a set of thread registers such as 910, a set of key registers such as 919 (Keys_0) (and also key registers Keys_1, Keys_2, Keys_3) and control logic in HW (not shown) working in conjunction with an OS or hypervisor. In some embodiments, a region 930 (Region 0) may be allocated and secured for a thread Thread_A0 registered in thread register 910 of a process 902 with process identifier numbered or named as Process_A by an Operating System call or hypervisor call. This call may use a privileged instruction for matrix region allocation to assign a free region to a process for matrix computing among those available in a list maintained by the OS or the hypervisor.
Locking and Unlocking Allocated Regions on a Context Switch or an InterruptIn some embodiments of
In some embodiments, a 0 value in the Thread Key field of a region may block all threads in a process from accessing the region, and all is value (equal to signed value −1 in some aspects) in that Thread Key field may enable all threads of that process to access the region. Similarly, a 0 value in the Process Key field of a matrix region's Key register may prevent every process in the associated process group from accessing the region, and an all Is value may enable all processes in the associated process group to access that region of Matrix Space 901.
In some aspects, key values other than 0 or all is may be leased to individual processes by an OS or hypervisor, wherein the leasing may allow the one or more individual processes to access specific regions of Matrix Space 901 leased to them by an OS or hypervisor while blocking all other processes. Such a capability may be required when an interrupt occurs, and the OS is required to run some other process or a thread that may not access a region. In some implementations, this may allow the OS to quickly swap out a process or thread while locking that matrix region to all others. Upon resumption of the process leasing the region, the HW conducting access control may check and unlock the region to the thread(s) holding the correct keys once again.
In some embodiments, region 930 (Region 0) inside a Matrix Space 901 may be controlled in part by a Thread Key field 920 in a Key Register 919. In some aspects, holding a unique and non-zero value Y in Thread Key field 920 that may be assigned by an OS exclusively may secure region 930 (Region 0) to a thread Thread_A0 registered in thread register 910. Here, key value Y which may not be equal to all is (or all 0s), may authenticate and enable only a thread holding a corresponding private key such as Thread_A0 registered in thread register 910 of the Process_A 902 to access region 930 of the Matrix Space 901. In some other aspects, the private part of key value Y held by threads Thread_A0 and Thread_A1 assigned by an OS to them, non-exclusively between the two, may allow both of them to share and access region 930 (Region 0) while securing region 930 from other threads and processes. The exact encryption, decryption, key generation, key management, key assignment and key exchange schemes may be various and different in different embodiments.
In some implementations, the Thread Key field 923 controlled by Process_C may have an all is value (equal to a signed constant −1) in the keys register Keys_3 which may allow all threads of Process_C to access Region 3. In some embodiments, both the Process Key Field such as 942 and Thread Key Field such as 922 may hold a 0 value for each. This may lock up region 2 to all processes and threads until an OS or hypervisor change the keys. In some aspects, the OS or hypervisor may unlock the region by loading a correct set of keys to provide appropriate access. In some implementations, the Key field 950 may be used to put a region under the control of an OS by a Virtual Machine hypervisor. In some embodiments, it may be controlled by an OS to restrict access to a smaller pool of processes that comprise a Process Group.
In some embodiments, a subset of keys or key fields may control only process level access privileges. This may be beneficial for system performance and ease of use. In some embodiments, keys may be used to control locking and sharing properties of individual regions or group of regions. In some aspects, Regions may be controlled recursively using multiple keys, and sub-regions or partitions of regions may be controlled more finely or coarsely using one or more keys.
In some implementations, instructions to Lock and Unlock using operands to copy to, write to, or control key registers may be provided for use by a process or its thread(s) for locking and unlocking matrix regions. The instructions may hold their matrices or arrays for computations. In some embodiments, a mechanism to encrypt the contents of a region or the keys may require authentication to secure the locking process. In some embodiments, no authentication may occur or be required. In some aspects, a customizable authentication may be installed up request.
Methods and Mechanisms for Handling Immediate Operands in Machine InstructionsReferring now to
Referring now to
In some embodiments, the method 1100 in the flowchart in
Referring now to
In some embodiments, a 16-bit long ADDI instruction such as 1254 using an immediate operand may have its immediate operand's length extended from a mere 4 bits to a longer 15 bits using a payload instruction such as 1251, 1253, 1255 with an immediate operand. In some embodiments a payload instruction such as 1253 may supply 24 bits to concatenate to the 4 bit immediate operand of the ADDI instruction 1254 to extend the value even to 28 bits while incurring very little additional cost. In some embodiments an assembly instruction sequence may use Payload 1251 instructions in conjunction with a modified ADDI instruction 1256 that does not contain any immediate operands (or with zero length immediate operand) and still compute the result using the method 1100. In some embodiments a plurality of payload instructions such as 1251 and/or 1253 may be cascaded in a sequence to create longer immediate operands using method 1100. In some embodiments, it may simplify the instruction decoder. In some aspects, it may be noted that the method disclosed in the disclosure is different from the prior art of loading a register with an operand using a move immediate instruction and then performing a second operation using that register operand. This is because any immediate operation itself may have its immediate operand extended using a Payload instruction and it may not consume an addressed register out of a register file. In some embodiments, the immediate operand length may be enhanced with each sequential Payload 1251 instruction before the immediate operand is consumed by an operation. In some embodiments the concatenation operation may be replaced with an addition or a logical operation.
Referring now to
In some embodiments the payload mechanism 1300 comprises a shifter 1302 receiving a first input operand from an instruction; shifter 1302 is controlled by a shift controller 1303 via a second input; the shifter is coupled to an immediate operand register 1301 that may receive its first operand from the shifter 1302, and a second operand from its own first output as shown in
In some embodiments, the shifter 1302 may shift the first output of an immediate operand register loaded by one or more payload instructions, and the shifted value is concatenated to an immediate value from a subsequent instruction to produce an immediate result; the immediate result is then used by the subsequent instruction. If the subsequent instruction is also a payload instruction then the process continues further till completion (i.e. some termination condition is satisfied).
In some other embodiments, in the context of
In some embodiments, a machine configured to use a computer-implemented instruction set may comprise highly structured multi length instructions with lengths in exact multiples of 16-bits (i.e. 16 bits, 32 bits, 48 bits, 64 bits, and such) that may be designed for use in matrix, array, and vector processing along with general computing. This may also include graphics processing and neural network computations. In some aspects, the instructions may comprise a bit field that may determine instruction length that differentiates 16-bit length instructions from 32-bit instructions. In some implementations, a longer length instructions whose position may be invariant in all instructions may occur in the portion first decoded.
In some aspects, a field comprising bits may be designated and used as a major opcode whose position in all instructions may be invariant and may occur in the portion first decoded. In some implementations, a field may comprise bits used to modify the functionality of the major opcode and may partition an instruction set into a plurality of sub-sets, which may be customized, such as based on business limitation, simpler design, or combinations thereof, as non-limiting examples. In some aspects, the position may be invariant in all instructions and occurs in the portion first decoded.
In some aspects, a field comprising bits that identify instructions may be used by one or more built-in special function application and specific co-processor units, wherein the position may be invariant in all instructions and may occur in the portion first decoded. In some embodiments, a field comprising bits may be designated as a primary destination operand or a source operand whose position may be invariant in all instructions and may occur in the portion first decoded. In some implementations, various fields of bits may be designated for use as source operands, secondary destination operands, secondary or tertiary or miscellaneous opcodes, row or column or level designators, attributes, immediate values, memory pointers, miscellaneous operands, or miscellaneous opcodes to control instruction execution.
In some implementations, an embedded storage, such as a matrix space, may be configured to hold or store matrices (matrixes), single, double or multi-dimensional arrays such as matroids and vectors, wherein the embedded storage may comprise rows, columns of elements of binary values of any type either numeric or non-numeric. In some aspects, these elements may be singular or in plural and may be controlled or accessed by rows, columns, or both during transport and computation.
In some embodiments, a method and apparatus comprising a set of machine instructions (and their assembly language equivalent names) may be used to control, access, load, store, restore, set, transport, shift, manipulate, perform operations including logical, bit-manipulation and arithmetic and non-arithmetic operations. In order to execute steps of algorithms and or manipulations of the aforementioned vectors, there may exist arrays, matrices, or any of the contents held within the aforementioned matrix space along with contents of other registers or storage outside the matrix space on a plurality of stored elements parallelly, which may occur simultaneously, concurrently, or concomitantly. In some implementations, hardware, methods and instructions may control the state of a matrix space (including operations to reset, power on, power down, clock on, clock off, lock, secure, unlock, encrypt, decrypt or control in any manner to effect its state).
In some aspects, a set of one or more matrix pointer registers may be used to hold the location, size and operand type information of matrices or arrays stored in the matrix space. In some implementations, a method and apparatus may address and control matrices or arrays stored in the matrix space comprising of matrix pointer registers. In some embodiments, a matrix pointer register may hold a pair of row, column, or both addresses of the origin position of a matrix, which may be a pre-designated element-position in the matrix. In some aspects, the position may be a corner along with the size of the matrix given in terms of number of elements in its rows and number of elements in its columns (or in terms of numbers of rows and columns) of the matrix.
In some embodiments, defining its extent, a matrix pointer register may be used to control, store, and access one or more elements of a matrix or array by its rows, columns, or both. In some aspects, the control, storage, and access may occur in patterns within the matrix or arrays, such as its diagonals, sub-diagonals, a triangular sub-array, a tri-diagonal sub-array, a rectangular sub-array or a sub-array of a priori user-defined positions of the said matrix or array. In some implementations, there may be a plurality of machine instructions (and their assembly language equivalent) comprising the instruction set to control, access, load, store, restore, set, and compute using arithmetic, logical, and bit-manipulation operations.
In some embodiments, with the contents of these registers and the contents of the vectors, matrices, arrays inside, or those associated with the matrix space (including those held in system memory or other registers outside the matrix space), a type designation may identify the type of binary elements of a matrix.
As illustrative examples, the identifying may distinguish between bytes, short integers, integer words, long integers, pointers (to a memory location), half precision floating point numbers, single precision floating points, double precision floating points, extended and quad precision floating point numbers, ordered pairs (a collection of 2 values) of any integer types, ordered pairs of any floating point types, ordered quads (a collection of 4 values) of any integer types, ordered quads of any floating point types, triads of integer types, triads (a collection of 3 numbers) of floating point types, ordered quads or triads or pairs of nibbles or bytes, and other types comprising of values with no designated type that may comprise collections of a user-defined number of bits each.
In contrast to prior art that may identify a numeric value, the present invention may process complex strings comprising numbers, letters, segments, and combinations thereof. This may allow for separate processing of the different types, which may increase efficiency and allow for effective and efficient processing of complex strings with relatively low computing costs. In some aspects, various methods may interpret ordered pairs of values as complex numbers, quads, and triads of binary values as points, triangles or vectors in a geometric space or as elements of a tensor in computations using machine instructions. In some embodiments, various methods may be used to interpret these quads and triads of binary values as pixel intensities and colors, and as other possible groupings interpreted by instructions that act on them.
Some embodiments may comprise a plurality of instruction structures and modes. In some aspects, individual instructions for computing may comprise matrices and arrays or their parts comprising numeric or non-numeric binary values along with a plurality of binary values that may be elements of other matrices or their parts, vector registers or their parts, scalar register operands, memory operands, and immediate values of a variety of types.
In some aspects, methods and accompanying logic may be used to access one or more matrix (or matrices) or arrays in an embodiment of the matrix space for an operation, wherein the contents of one or more matrix pointer registers may be readable concurrently or simultaneously and each of which may be associated with a matrix or array in the matrix space. In some embodiments, a method may interpret the contents of the fields of a matrix pointer register as a pair of row and column and may address an origin or corner element of said matrix or array inside the matrix space. In some implementations, a method may identify the size in terms of a pair of numbers that may give the number of elements in the rows and columns of the said matrix or array.
In some aspects, a method may interpret the type field of the matrix pointer register, which may associate it with the type of elements of the said matrix or array. In some implementations, a set of method and apparatus may access, read, and control one or more elements of a matrix or array by row or by column or both, along with other operands like vector registers or scalar values or immediate operands from their locations of storage and may also perform computation and generate results. In some embodiments, a set of methods and apparatus may store the results of computation into a matrix held inside a matrix space via its ports into vector registers or scalar registers as the instruction may stipulate.
In some implementations, a method and apparatus may load one or more matrices or arrays from a memory, an embedded memory, or a processor cache into a matrix space that may use a load instruction. In some aspects, a set of methods and apparatus may store one or more matrices or arrays into a memory, an embedded memory, or a processor cache from a matrix space that may use a stored instruction.
Some aspects may comprise an access control mechanism and a set of attributes to secure a matrix space or portions of it to make them accessible and controllable by specific threads of specific processes of specific operating systems running on a computing machine. In some embodiments, these may be defined as a spatial division of the matrix space into one or more regions controlled by a set of instructions and logic to control the security and sharing attributes of these regions.
In some embodiments, one or more regions may comprise one or more partitions, and the access control mechanism may comprise encryption, decryption and security hardware and a plurality of registers that may hold binary valued keys to block or enable access to one or more regions by specified threads belonging to specified processes that may lease these secret or encrypted keys from an operating system or a virtual machine hypervisor.
In some implementations, the keys may comprise one or more fields, and a plurality of canonical key values like 0 and 1 (all is in a binary word) may designate complete blocking or full access to all threads or all processes. In some aspects, a plurality of fields in keys may allow an operating system to control a region of matrix space as stipulated by a virtual machine hypervisor. In some embodiments, methods and logic may be used to lock or unlock access to each matrix region in the aforementioned matrix space by a thread of a process making a request to an operating system using a privileged instruction under operating system control.
In some embodiments, a method and apparatus may comprise an immediate operand register that may be used in conjunction with a plurality of machine and assembly language instructions. In some aspects, a payload instruction may comprise an opcode and an immediate value operand that may be stored by a processing unit into an immediate operand register within it. In some implementations, a method and apparatus may decode the payload instruction with its immediate operand in a program sequence and pass the result for use with a preceding or succeeding instruction with or without an immediate operand for execution.
In some embodiments, a method and apparatus may comprise a shifter or a shift control register to hold a shift value and an immediate operand register that may be able hold a resultant immediate operand. In some aspects, a logic circuit may be present in an immediate operand from an instruction to the aforementioned shifter to perform a shift. In some implementations, it may concatenate it to the existing value in the immediate operand register. In some aspects, a logic circuit may compute a new shift value and place it into the shift control register prior to next instruction. In some aspects, a mechanism may reset the aforementioned registers, and a method and apparatus may use the resultant immediate operand in the immediate operand register as an immediate operand in the execution of an instruction.
Claims
1. (canceled)
2. (canceled)
3. (canceled)
4. (canceled)
5. (canceled)
6. (canceled)
7. (canceled)
8. (canceled)
9. (canceled)
10. (canceled)
11. (canceled)
12. (canceled)
13. (canceled)
14. A computing system comprising:
- a matrix space to store at least one array, the matrix space separate from system memory, the matrix space controlled by a control logic circuit to configure and control matrix operations,
- wherein the control logic circuit is separate from a system memory controller,
- wherein the matrix space is configured to be accessible by rows and by columns,
- wherein two or more elements of a row of the at least one array are accessible simultaneously at a row port in response to a row address,
- and wherein two or more elements of a column of the at least one array are accessible simultaneously at a column port in response to a separate column address.
15. The computing system of claim 14, wherein an individual element included in the matrix space is concurrently accessible by the row containing the individual element and by the column containing the individual element.
16. The computing system of claim 14, wherein one or more elements of the row of the at least one array accessible at the row port are concurrently accessible with one or more elements of the column of the at least one array accessible at the column port.
17. The computing system of claim 14, wherein an origin of the at least one array comprises a row number and a column number.
18. The computing system of claim 14, wherein size of the at least one array comprises a first number of rows in the at least one array, and a second number of columns in the at least one array.
19. The computing system of claim 14, wherein a portion of the matrix space is pre-allocated to store the at least one array.
20. The computing system of claim 14, further comprising at least one mechanism to control a power state or a clock associated with the matrix space.
21. The computing system of claim 14, further comprising at least one execution unit coupled to the matrix space via at least one row port or via at least one column port, wherein the at least one execution unit is configured to use the at least one array in a computation.
22. The computing system of claim 14, further comprising:
- one or more load matrix instructions; or
- one or more store matrix instructions.
23. The computing system of claim 14, further comprising a matrix pointer register configured to store an origin of the at least one array, a size of the at least one array, and a type of the at least one array.
24. The computing system of claim 23, wherein the type of the at least one array identifies a type of elements in the at least one array in the matrix space, wherein the type is one of: byte, short integer, 32-bit integer, 64-bit integer, pointer to a memory location, half precision floating point number, single precision floating point number, double precision floating point number, string, ordered quad of integers, ordered quad of floating point numbers, ordered triad of integers, ordered triad of floating point numbers, ordered pair of integers, ordered pair of floating point numbers, ordered quad of bytes, ordered quad of nibbles, ordered triad of bytes, ordered triad of nibbles, ordered pair of bytes, ordered pair of nibbles, and a user defined type.
25. The computing system of claim 23, further comprising:
- at least one matrix instruction that references the at least one array, wherein the at least one matrix instruction comprises an operand that is an index of the matrix pointer register.
26. The computing system of claim 25, wherein the at least one matrix instruction configured to operate upon the at least one array and at least one vector that comprises one or more scalars or packed and ordered groups of values.
27. The computing system of claim 25, wherein the at least one matrix instruction is configured to access a diagonal of the at least one array.
28. The computing system of claim 25, wherein the at least one matrix instruction is configured to access a transpose of a portion of the at least one array, or a triangular portion of the at least one array, or a multi-diagonal portion of the at least one array.
29. The computing system of claim 25, wherein and the at least one matrix instruction is configured to operate on complex number elements of the at least one array when the type is one of: ordered pair of bytes or ordered pair of integers or ordered pair of short integers or ordered pair of floating point numbers, wherein a complex number element is represented as an ordered pair.
30. The computing system of claim 25, wherein the at least one matrix instruction is configured to perform a matrix multiplication operation or an array multiplication operation.
31. The computing system of claim 25, wherein the at least one matrix instruction is configured to count elements of a row of the at least one array, reorder the elements of the row of the at least one array, or sum the elements of the row of the at least one array.
32. The computing system of claim 25, wherein the at least one matrix instruction is configured to count elements of a column of the at least one array, reorder the elements of the column of the at least one array, or sum the elements of the column of the at least one array.
33. The computing system of claim 25, wherein the at least one matrix instruction is configured to access the at least one array in the matrix space, securely, and wherein the at least one matrix instruction references the at least one array in response to not failing a security check.
34. An array processing unit comprising a matrix space, separate from a system memory, and controlled by a control logic circuit to configure and control operations on a portion of the matrix space, wherein the matrix space is configured to be accessible by rows and by columns;
- at least one row port coupled to the matrix space that is configured to access a portion of a row of the matrix space, wherein elements of the portion of the row are accessed simultaneously at the at least one row port in response to a row address;
- at least one column port coupled to the matrix space that is configured to access a portion of a column of the matrix space, wherein elements of the portion of the column are accessed simultaneously at the at least one column port in response to a column address.
35. The array processing unit of claim 34, wherein a portion of the matrix space is accessed to read, write, set, clear, restore, transport, count, reorder, sort, scale, negate, invert, test, compare, operate on, or manipulate one or more elements in the matrix space.
36. The array processing unit of claim 34, wherein the at least one row port and the at least one column port are oriented perpendicular to one another in two dimensions.
37. The array processing unit of claim 34, further comprising a matrix pointer register configured to store an origin of an allocation in the matrix space, a size of the allocation, and a type.
38. The array processing unit of claim 37, wherein contents of the matrix pointer register determine an allocation of space in the matrix space.
39. The array processing unit of claim 37, wherein the allocation is associated with a sub-matrix, a portion of an array, multi-diagonal portions of a matrix, matroids, or a zero matrix,
40. The array processing unit of claim 37, further comprising at least one matrix instruction that references the allocation, wherein the at least one matrix instruction comprises an operand that is an index of the matrix pointer register.
41. The array processing unit of claim 40, wherein the at least one matrix instruction is configured to concurrently access the rows and the columns of at least one array.
42. A method for array computing comprising:
- providing a method to store at least one array a matrix space, the matrix space separate from system memory, the matrix space controlled by a control logic circuit to configure and control matrix operations,
- wherein the control logic circuit is separate from a system memory controller,
- wherein the matrix space is configured to be accessible by rows and by columns,
- wherein two or more elements of a row of the at least one array are accessible simultaneously at a row port in response to a row address,
- and wherein two or more elements of a column of the at least one array are accessible simultaneously at a column port in response to a separate column address.
43. The method of claim 42, further comprising a method to configure an allocation of space for the at least one array.
Type: Application
Filed: Aug 28, 2023
Publication Date: Jan 4, 2024
Inventor: Sitaram Yadavalli (San Jose, CA)
Application Number: 18/239,031