Patents Assigned to Ascenium, Inc.
-
Publication number: 20260140778Abstract: A processor core is accessed. The core is configured to execute instructions associated with an instruction set architecture (ISA). The core comprises a plurality of compute slices, a plurality of barrier register files, and a control unit. Each compute slice includes at least one arithmetic logic unit (ALU), a local register file, and is coupled to a successor compute slice and a predecessor compute slice by a barrier register file. Code associated with the ISA is evaluated, where the code includes a first loop. The evaluating includes generating iteration transfer information associated with the first loop. Each slice task within a plurality of slice tasks associated with the first loop is distributed to a compute slice. The processor core executes the plurality of slice tasks. Data forwarding between successive compute slices is based on the plurality of barrier register files and the iteration transfer information.Type: ApplicationFiled: January 12, 2026Publication date: May 21, 2026Applicant: Ascenium, Inc.Inventors: Hans Olle Viktor Fredriksson, Tore Jahn Bastiansen, Ove Brynestad, Øyvind Harboe
-
Patent number: 12578991Abstract: Techniques for task processing based on a parallel processing architecture with distributed register files are disclosed. A two-dimensional array of compute elements is accessed. Each compute element is known to a compiler and is coupled to its neighboring compute elements. The array of compute elements is controlled on a cycle-by-cycle basis. The controlling is enabled by a stream of wide, variable length, control words generated by the compiler. Virtual registers are mapped to a plurality of physical register files distributed among one or more of the compute elements. Virtual registers are represented by the compiler. The mapping is performed by the compiler. A broadcast write operation is enabled to two or more of the physical register files. Operations contained in the control words are executed. Operations are enabled by at least one of the distributed physical register files. Implementation in separate compute elements enables parallel operation processing.Type: GrantFiled: May 25, 2022Date of Patent: March 17, 2026Assignee: Ascenium, Inc.Inventor: Peter Foley
-
Patent number: 12504958Abstract: Techniques for task processing based on compute element processing using control word templates are disclosed. One or more control word templates are generated for use in a two-dimensional array of compute elements. Each compute element within the array is known to a compiler and is coupled to its neighboring compute elements within the array of compute elements. Each control word template designates a topological set of compute elements from the array of compute elements. The one or more control word templates are customized with a specific set of compute element operations. The one or more control word templates that were customized are stored. The specific set of compute element operations is executed on the topological set of compute elements. The one or more control word templates that were stored are reused. The one or more control word templates that were stored are modified and executed using compute elements.Type: GrantFiled: December 23, 2022Date of Patent: December 23, 2025Assignee: Ascenium, Inc.Inventors: Ionut Hristodorescu, Peter Foley
-
Publication number: 20250383878Abstract: A processing unit is accessed that includes a plurality of compute slices, a control unit, and a global aliasing table (GAT). Each compute slice within the plurality of compute slices includes at least one execution unit, is known to a compiler, and is coupled to a successor compute slice and a predecessor compute slice. A first compute slice executes a load instruction. The load instruction is associated with a target address. The load instruction is predicted that it will alias with a previous store instruction. The previous store instruction executes on a previous compute slice among the plurality of compute slices. The predicting is based on the GAT. The load instruction is stalled until the previous store instruction completes execution on the previous compute slice. The load instruction is allowed to execute. The predicting includes searching, in the GAT, for an entry which includes the load instruction.Type: ApplicationFiled: June 12, 2025Publication date: December 18, 2025Applicant: Ascenium, Inc.Inventor: Hans Olle Viktor Fredriksson
-
Patent number: 12493554Abstract: Techniques for parallel processing using hazard detection and mitigation are disclosed. An array of compute elements is accessed. Each compute element within the array of compute elements is known to a compiler and is coupled to its neighboring compute elements within the array of compute elements. Control for the compute elements is provided on a cycle-by-cycle basis. Control is enabled by a stream of wide control words generated by the compiler. Memory access operations are tagged with precedence information. The tagging is contained in the control words. The tagging is provided by the compiler at compile time. Memory access operations are monitored. The monitoring is based on the precedence information and a number of architectural cycles of the cycle-by-cycle basis. The tagging is augmented at run time, based on the monitoring. Memory access data is held before promotion, based on the monitoring.Type: GrantFiled: November 7, 2023Date of Patent: December 9, 2025Assignee: Ascenium, Inc.Inventor: Peter Foley
-
Publication number: 20250341970Abstract: Techniques for checking memory operations are disclosed. A processing unit is accessed, comprising compute slices, control unit, local memory disambiguation units (LMDUs), and a global MDU (GMDU). Each slice includes an execution unit and is coupled to successor and predecessor slices. Each slice is coupled to an LMDU. Each LMDU is coupled to the GMDU. A first slice executes a first slice task. The task includes a load instruction and address. The slice issues the load to an LMDU, saving load information in a memory operation table (MOT). For a not fully serviced load instruction, the LMDU sends the load information to the GMDU, storing load information in a global MOT (GMOT). The GMOT detects address aliasing between the load address and a previously issued address saved in the GMOT. The GMOT forwards memory information from previously issued memory instructions to the MOT to satisfy the load instruction.Type: ApplicationFiled: May 2, 2025Publication date: November 6, 2025Applicant: Ascenium, Inc.Inventors: Øyvind Harboe, Jacob John Vorland Taylor, Anders Schau Knatten
-
Publication number: 20250306930Abstract: A processing unit is accessed, comprising compute slices, a control unit, local memory disambiguation units (LMDUs), and memory system. Each slice includes an execution unit and is coupled to successor and predecessor slices. Each slice is coupled to an LMDU. The control unit distributes a first slice task to a first slice coupled to a first LMDU. The first slice executes the first task. The task includes a load instruction including a load address. The first slice issues the load instruction to the first LMDU. The issuing saves load information in a memory operation table (MOT) within the LMDU. The LMDU detects, based on the MOT, address aliasing between the load address and a store address of a previous store instruction. The MOT forwards store information from the previous store instruction. The store information satisfies one or more bytes of data required for the load instruction.Type: ApplicationFiled: March 28, 2025Publication date: October 2, 2025Applicant: Ascenium, Inc.Inventors: Jacob John Vorland Taylor, Anders Schau Knatten
-
Publication number: 20250265088Abstract: Techniques for parallel generation of blocks using a compiler are disclosed. A processing unit comprising compute slices, barrier register sets, a control unit, and a memory system is accessed. Each compute slice includes an execution unit and is coupled to other compute slices by a barrier register set. A compiler evaluates a compiled program that includes basic blocks, based on a control flow graph. A first hyperblock is created from at least two basic blocks. One or more branch instructions are replaced with skip instructions that direct instruction execution between basic blocks in the hyperblock. A first slice task is allocated to a first compute slice. A second slice task is allotted based on branch prediction. Pointers to the first compute slice and the second compute slice are initialized. The compiled program is executed, beginning with the first compute slice.Type: ApplicationFiled: February 14, 2025Publication date: August 21, 2025Applicant: Ascenium, Inc.Inventors: Peter Aaser, Hans Olle Viktor Fredriksson
-
Publication number: 20250085970Abstract: Techniques for managing compute slice tasks are disclosed. A processing unit comprising compute slices, load-store units (LSUs), a control unit, and a memory system is accessed. The compute slices are coupled. Each compute slice includes an LSU which is coupled to a predecessor LSU and a successor LSU. A compiled program is executed as the control unit distributes slice tasks to the compute slices for execution. A slice task, which includes a load instruction, is distributed to a current compute slice. The current compute slice can execute the slice task speculatively. A previously executed store instruction is committed to memory by a predecessor LSU. Address aliasing is checked between an address associated with the previously executed store instruction and the load address associated with the load instruction. The slice task running on the current compute slice can be cancelled when aliasing is detected.Type: ApplicationFiled: September 6, 2024Publication date: March 13, 2025Applicant: Ascenium, Inc.Inventor: Jacob John Vorland Taylor
-
Publication number: 20250021405Abstract: Techniques for task processing based on compiler-scheduled compute slices are disclosed. A processing unit comprising compute slices, barrier register sets, a control unit, and a memory system is accessed. Each compute slice includes an execution unit and is coupled to other compute slices by a barrier register set. A first slice task is distributed to a first compute slice. A second slice task is allotted to a second compute slice, based on a branch prediction logic. The second compute slice is coupled to the first by a first barrier register set. Pointers are initialized. A compiled program is executed, beginning at the first compute slice. The second slice task can be executed in parallel while a branch decision is being made. If the branch decision determines that the second slice task is not the next sequential slice task, results from the second compute slice are discarded.Type: ApplicationFiled: July 11, 2024Publication date: January 16, 2025Applicant: Ascenium, Inc.Inventors: Tore Jahn Bastiansen, Peter Aaser, Trond Hellem Bø
-
Publication number: 20240419507Abstract: Techniques for monitoring block moves in an array of compute elements and applying backpressure are disclosed. An array of compute elements is accessed. The array of compute elements is coupled to at least one data cache. The data cache provides memory storage for the array of compute elements. Control for the array of compute elements is enabled by a stream of wide control words generated by the compiler. A load address and a store address comprising memory block move addresses are generated. The memory block move addresses point to memory storage locations in the at least one data cache. Load buffers are coupled to the array of compute elements. The load buffers are located adjacent to at least one edge of the array of compute elements. A memory block move is executed using at least one of the load buffers, based on the memory block move addresses.Type: ApplicationFiled: August 30, 2024Publication date: December 19, 2024Applicant: Ascenium, Inc.Inventor: Peter Foley
-
Publication number: 20240385965Abstract: Techniques for task processing are disclosed. An array of compute elements is accessed. Each compute element within the array is known to a compiler and is coupled to its neighboring compute elements. The array of compute elements is coupled to at least one data cache. The data cache provides memory storage for the array. Control for the compute elements is provided on a cycle-by-cycle basis. Control is enabled by a stream of wide control words generated by the compiler. A load address and a store address are generated. The load and the store addresses comprise memory block move addresses. The memory block move addresses point to memory storage locations in the data cache. A memory block move is executed, based on the memory block move addresses. The data for the memory block move is transferred outside of the array.Type: ApplicationFiled: July 26, 2024Publication date: November 21, 2024Applicant: Ascenium, Inc.Inventor: Peter Foley
-
Publication number: 20240264974Abstract: Techniques for parallel processing based on hazard mitigation avoidance are disclosed. An array of compute elements is accessed. Each compute element within the array is known to a compiler and is coupled to its neighboring compute elements within the array of compute elements. Control for the compute elements is provided on a cycle-by-cycle basis. Control is enabled by a stream of wide control words generated by the compiler. Memory access operation hazard mitigation is enabled. The hazard mitigation is enabled by a control word tag. The control word tag supports memory access precedence information and is provided by the compiler at compile time. A hazardless memory access operation is executed. The hazardless memory access operation is determined by the compiler, and the hazardless memory access operation is designated by a unique set of precedence information contained in the tag. The tag is modified during runtime by hardware.Type: ApplicationFiled: April 19, 2024Publication date: August 8, 2024Applicant: Ascenium, Inc.Inventor: Peter Foley
-
Publication number: 20240193009Abstract: Techniques for a parallel processing architecture for branch path suppression are disclosed. An array of compute elements is accessed. Each element is known to a compiler and is coupled to its neighboring elements. Control for the elements is provided on a cycle-by-cycle basis. Control is enabled by a stream of wide control words generated by the compiler. The control includes a branch. A plurality of compute elements is mapped. The mapping distributes parallelized operations to the compute elements. The mapping is determined by the compiler. A column of compute elements is enabled to perform vertical data access suppression and a row of compute elements is enabled to perform horizontal data access suppression. Both sides of the branch are executed. The executing includes making a branch decision. Branch operation data accesses are suppressed, based on the branch decision and an invalid indication. The invalid indication is propagated among compute elements.Type: ApplicationFiled: February 23, 2024Publication date: June 13, 2024Applicant: Ascenium, Inc.Inventor: Peter Foley
-
Publication number: 20240168802Abstract: Techniques for parallel processing using hazard detection and store probes are disclosed. An array of compute elements is accessed. Each compute element within the array of compute elements is known to a compiler and is coupled to its neighboring compute elements within the array of compute elements. Control for the compute elements is provided on a cycle-by-cycle basis. Control is enabled by a stream of wide control words generated by the compiler. Data to be stored by the array of compute elements is managed. The data to be stored is targeted to a data cache coupled to the array of compute elements. The managing includes detecting and mitigating memory hazards. Pending data cache accesses are probed for hazards. The examining comprises a store probe. Store data is committed to the data cache. The committing is based on a result of the store probe.Type: ApplicationFiled: January 30, 2024Publication date: May 23, 2024Applicant: Ascenium, Inc.Inventor: Peter Foley
-
Publication number: 20240078182Abstract: Techniques for parallel processing based on parallel processing with switch block execution are disclosed. An array of compute elements is accessed. Each compute element within the array of compute elements is known to a compiler and is coupled to its neighboring compute elements within the array of compute elements. Control for the compute elements is provided on a cycle-by-cycle basis. Control is enabled by a stream of wide control words generated by the compiler. A plurality of compute elements is initialized within the array with a switch statement. The switch statement is mapped into a primitive operation in each element of the plurality of compute elements. The initializing is based on a control word from the stream of control words. Each of the primitive operations is executed in an architectural cycle. A result is returned for the switch statement. The returning is determined by a decision variable.Type: ApplicationFiled: November 13, 2023Publication date: March 7, 2024Applicant: Ascenium, Inc.Inventor: Peter Foley
-
Publication number: 20240070076Abstract: Techniques for parallel processing using hazard detection and mitigation are disclosed. An array of compute elements is accessed. Each compute element within the array of compute elements is known to a compiler and is coupled to its neighboring compute elements within the array of compute elements. Control for the compute elements is provided on a cycle-by-cycle basis. Control is enabled by a stream of wide control words generated by the compiler. Memory access operations are tagged with precedence information. The tagging is contained in the control words. The tagging is provided by the compiler at compile time. Memory access operations are monitored. The monitoring is based on the precedence information and a number of architectural cycles of the cycle-by-cycle basis. The tagging is augmented at run time, based on the monitoring. Memory access data is held before promotion, based on the monitoring.Type: ApplicationFiled: November 7, 2023Publication date: February 29, 2024Applicant: Ascenium, Inc.Inventor: Peter Foley
-
Publication number: 20240028340Abstract: Techniques for parallel processing based on a parallel processing architecture with bin packing are disclosed. An array of compute elements is accessed. Each compute element is known to a compiler and is coupled to its neighboring compute elements. A plurality of compressed control words is generated by the compiler. The plurality of control words enables compute element operation and compute element memory access. The compressed control words are operationally sequenced. The compressed control words are linked by the compiler. Linking information is contained in at least one field of each of the compressed control words. The compressed control words are loaded into a control word cache coupled to the array of compute elements. The compressed control words are loaded into the control word cache in an operationally non-sequenced order. The plurality of compressed control words is ordered into an operationally sequenced execution order, based on the linking information.Type: ApplicationFiled: August 22, 2023Publication date: January 25, 2024Applicant: Ascenium, Inc.Inventor: Peter Foley
-
Publication number: 20230409328Abstract: Techniques for task processing based on a parallel processing architecture with memory block transfers are disclosed. An array of compute elements is accessed. Each compute element is known to a compiler and is coupled to its neighboring compute elements. Control for the array is provided on a cycle-by-cycle basis. The control is enabled by a stream of wide control words generated by the compiler. A control word from the stream of control words includes a source address, a target address, a block size, and a stride. Memory block transfer control logic is used. The memory block transfer logic is implemented outside of the array of compute elements. A memory block transfer is executed. The memory block transfer is initiated by a control word from the stream of wide control words. Data for the memory block transfer is moved independently from the array of compute elements.Type: ApplicationFiled: August 30, 2023Publication date: December 21, 2023Applicant: Ascenium, Inc.Inventor: Peter Foley
-
Publication number: 20230376447Abstract: Techniques for parallel processing based on a parallel processing architecture with dual load buffers are disclosed. A two-dimensional array of compute elements is accessed. Each compute element is known to a compiler and is coupled to its neighboring compute elements. A first data cache is coupled to the array. The first data cache enables loading data to a first portion of the array. The first data cache supports an address space. A second data cache is coupled to the array. The second data cache enables loading data to a second portion of the array. The second data cache supports the address space. Instructions are executed within the array. Instructions executed within the first portion of the array of compute elements use data loaded from the first data cache, and instructions executed within the second portion of the array of compute elements use data loaded from the second data cache.Type: ApplicationFiled: July 31, 2023Publication date: November 23, 2023Applicant: Ascenium, Inc.Inventor: Peter Foley