Patents Assigned to SambaNova Systems, Inc.
-
Publication number: 20250258678Abstract: A system including a reconfigurable processor, a runtime execution engine, a graph scheduler, and a communication scheduler is presented. The graph scheduler and the communication scheduler receive a dataflow graph and static schedules of graph and communication operations from a compiler. The graph scheduler and the communication scheduler generate new schedules of graph and communication operations based on user-defined schedules of graph and communication operations and the static schedules of graph and communication operations. The runtime execution engine uses the dataflow graph and the new schedules of graph and communication operations to configure an array of reconfigurable units in the reconfigurable processor for execution of the dataflow graph. The present technology also relates to a method of operating such a system, and to a non-transitory computer-readable storage medium including instructions that, when executed by a processing unit, cause the processing unit to operate such a system.Type: ApplicationFiled: February 14, 2024Publication date: August 14, 2025Applicant: SambaNova Systems, Inc.Inventors: Joshua Earle POLZIN, Arnav GOEL, Qi ZHENG, Conrad Alexander TURLIK, Arjun SABNIS, Jiayu BAI, Neal SANGHVI, Letao CHEN
-
Patent number: 12386602Abstract: A method for improving throughput in a reconfigurable computing system includes detecting, in an algebraic representation of a computing task for a reconfigurable dataflow processor, an outer meta-pipeline loop, detecting an inner meta-pipeline loop nested within the outer meta-pipeline loop, and determining that the inner meta-pipeline loop and the outer meta-pipeline loop each conduct a common operation. The method also includes fusing the common operation for the inner meta-pipeline loop and the outer meta-pipeline loop into a single operation within the inner meta-pipeline loop. The instances of the common operation may be fused if the output of a first instance of the common operation is the source for a second instance of the common operation. Examples of the common operation include an accumulator operation, a re-read operation, and a temporal (chip buffer synchronized) operation such as a temporal concatenation operation and a temporal slicing operation.Type: GrantFiled: April 4, 2023Date of Patent: August 12, 2025Assignee: SambaNova Systems, Inc.Inventors: Fei Wang, Weihang Fan, David Alan Koeplinger
-
Patent number: 12380060Abstract: A method for reducing latency and increasing throughput in a reconfigurable computing system includes receiving a compute graph for execution on a reconfigurable dataflow processor comprising a grid of compute units and grid of memory units interconnected with a switching array. The compute graph includes a node specifying an operation on a tensor. The node may be split into multiple nodes that each specify the operation on a distinctive portion of the tensor to produce a first modified compute graph. The first modified compute graph may be executed. In addition, the multiple nodes may be within a single meta-pipeline stage and may be processed in parallel. Furthermore, the compute graph may further comprise a separate node for gathering the distinctive portions of the tensor into a complete tensor, to produce a second modified compute graph.Type: GrantFiled: May 25, 2023Date of Patent: August 5, 2025Assignee: SambaNova Systems, Inc.Inventors: Yun Du, Gao Deng, Jianding Luo, Zhengyu Chen
-
Patent number: 12373182Abstract: The technology disclosed provides a system that comprises a processor with computing units on an integrated circuit substrate. The processor is configured to map a program across multiple hardware stages with each hardware stage executing a corresponding operation of the program at a different stage latency dependent on an operation type and an operand format. The system further comprises a runtime logic that configures the compute units with configuration data. The configuration data causes first and second producer hardware stages in a given compute unit to execute first and second data processing operations and produce first and second outputs at first and second stage latencies, and synchronizes consumption of the first and second outputs by a consumer hardware stage in the given compute unit for execution of a third data processing operation by introducing a register storage delay that compensates for a difference between the first and second stage latencies.Type: GrantFiled: December 27, 2022Date of Patent: July 29, 2025Assignee: SambaNova Systems, Inc.Inventors: Weiwei Chen, Raghu Prabhakar, David Alan Koeplinger
-
Patent number: 12367022Abstract: In a method a computer-implemented efficiency analyzer selects operators from an intermediate representation of a dataflow program. The operators are included in a mapping of the operators to hardware of a computing system to execute the dataflow program. Based on the mapping and a description of the hardware, the efficiency analyzer computes an execution metric associated with executing the operators on the hardware. Based on the execution metric and hardware description, the efficiency analyzer determines an inefficiency metric, and based on the inefficiency metric, the efficiency analyzer determines an inefficiency associated with the dataflow program. The computing system to execute the dataflow program can comprise a coarse grain computing system and the hardware can include a reconfigurable processor of the computing system. A computer program product and a computing system to a the dataflow program can implement the method.Type: GrantFiled: November 8, 2023Date of Patent: July 22, 2025Assignee: SambaNova Systems, Inc.Inventors: Blaine Rister, Qingjian Li, Bowen Yang, Junjue Wang, Chen Liu, Zhuo Chen, Arvind Sujeeth, Sumti Jairath
-
Publication number: 20250231748Abstract: A method for merging buffers and associated operations includes receiving a compute graph for a reconfigurable dataflow computing system and conducting a buffer allocation and merging process responsive to determining that a first operation specified by a first operation node is a memory indexing operation and that the first operation node is a producer for exactly one consuming node that specifies a second operation. The buffer allocation and merging process may include replacing the first operation node and the consuming node with a merged buffer node within the graph responsive to determining that the first operation and the second operation can be merged into a merged indexing operation and that the resource cost of the merged node is less than the sum of the resource costs of separate buffer nodes. A corresponding system and computer readable medium are also disclosed herein.Type: ApplicationFiled: February 19, 2025Publication date: July 17, 2025Applicant: SambaNova Systems, Inc.Inventors: David Alan KOEPLINGER, Adam BORDELON, Frank FAN, Kevin BROWN, Weiwei CHEN
-
Publication number: 20250217240Abstract: A data processing system comprises a coarse-grained reconfigurable (CGR) processor including an array of reconfigurable units which are configured to execute a dataflow graph. The system further includes a compiler coupled to provide a configuration file including a configuration for a set of components in a plurality of components in the array of reconfigurable units. The configuration file is coupled to configure the set of components using the configuration. An intelligent redundancy management framework (IRMF) checks health of the configuration and identify the configuration as defective if a component in the set of component is defective. further performs a healing operation for the defective configuration by replacing the defective configuration with an alternate configuration using a different set of all healthy components.Type: ApplicationFiled: December 30, 2024Publication date: July 3, 2025Applicant: SambaNova Systems, Inc.Inventors: Kyle MAY, Arnav GOEL, Qi ZHENG, Pushkar Shridhar NANDKAR
-
Publication number: 20250217125Abstract: The present disclosure provides a method and system for efficiently compiling and executing a high-level program (e.g., artificial intelligence models) on a coarse-grained reconfigurable (CGR) processor comprising an array of CGR units. In one aspect, the system identifies a first occurrence of a section of code within the high-level program, and then creates a first instance of a hypersection based on the section of code. The system next identifies a subsequent occurrence of the section of code within the high-level program, and subsequently creates a second instance of the hypersection based on the section of code. Next, the system compiles the high-level program including the first and second instance of the hypersection. Subsequently, the system executes the high-level program including the first and second instance, which repeatedly executes the section of code. Segmenting the high-level program based on one or more pre-defined hypersections increases both compilation speed and compiler throughout.Type: ApplicationFiled: March 24, 2025Publication date: July 3, 2025Applicant: SambaNova Systems, Inc.Inventors: Jianding LUO, Yuan LIN
-
Patent number: 12346729Abstract: A data processing system comprises a pool of reconfigurable data flow resources and a runtime processor. The pool of reconfigurable data flow resources includes arrays of physical configurable units and memory. The runtime processor includes logic to receive a plurality of configuration files for user applications. The configuration files include configurations of virtual data flow resources required to execute the user applications. The runtime processor also includes logic to allocate physical configurable units and memory in the pool of reconfigurable data flow resources to the virtual data flow resources and load the configuration files to the allocated physical configurable units. The runtime processor further includes logic to execute the user applications using the allocated physical configurable units and memory.Type: GrantFiled: June 20, 2023Date of Patent: July 1, 2025Assignee: SambaNova Systems, Inc.Inventors: Ravinder Kumar, Conrad Alexander Turlik, Arnav Goel, Qi Zheng, Raghunath Shenbagam, Anand Misra, Ananda Reddy Vayyala, Pushkar Shridhar Nandkar
-
Publication number: 20250208839Abstract: The disclosed technology relates to automatically optimizing the precision of data types in a computational graph, such as those used in machine learning and artificial intelligence applications. A representation of the computational graph is obtained. Nodes of the computational graph are assigned to one of three sets: a deny set, an allow set, or an infer set, based on a predefined policy. For nodes in the allow set, the method changes at least one of the input data precision, output data precision, or internal computation precision to a lower precision. For nodes in the infer set, the method propagates a data precision requirement from downstream nodes to upstream nodes. The method generates and stores computer instructions for executing the computational graph with the optimized precisions on one or more processors. This approach enhances performance and energy efficiency while maintaining model accuracy for the computational graph.Type: ApplicationFiled: October 2, 2024Publication date: June 26, 2025Applicant: SambaNova Systems, Inc.Inventors: Mark William Gottscho, Vidushi Goyal, Han Wang, Valentina Popescu, Yongning SHENG, Matthew William Ashcraft
-
Patent number: 12340190Abstract: According to a computing method a compiler determines a recompute node included in a dataflow application and a checkpoint tensor produced by the recompute node. The compiler determines a recompute cost to recompute the checkpoint tensor, and a memory cost to checkpoint the checkpoint tensor in a memory. Based on the recompute cost and/or the memory cost, the compiler determines a solution cost and compares the solution cost to a solution threshold. Based on comparing the solution cost to the solution threshold, the compiler determines a checkpoint solution to execute the dataflow application. The checkpoint solution can comprise recomputing or checkpointing the checkpoint tensor. In some implementations, the compiler can determine a recompute ratio of the recompute cost to the memory cost and can compare the recompute ratio to the solution threshold. A computer program product and a computing system can implement aspects of the method.Type: GrantFiled: March 31, 2023Date of Patent: June 24, 2025Assignee: SambaNova Systems, Inc.Inventors: Bowen Yang, Zhuo Chen, Fei Wang, Venkat Krishna Srinivasan, Chen Liu, Junjue Wang, Arvind Krishna Sujeeth, Sumti Jairath
-
Patent number: 12340195Abstract: A system is presented that includes a communication link, a runtime processor coupled to the communication link, and a reconfigurable processor. The reconfigurable processor is adapted for generating an interrupt to the runtime processor in response to a predetermined event and includes multiple arrays of coarse-grained reconfigurable (CGR) units and an interface to the communication link that couples the reconfigurable processor to the runtime processor via the communication link. The runtime processor is adapted for configuring the interface to the communication link to provide access to the multiple arrays of coarse-grained reconfigurable units from a physical function driver and from at least one virtual function driver, and the reconfigurable processor is adapted for sending the interrupt to the physical function driver and to a virtual function driver of the at least one virtual function driver within the runtime processor.Type: GrantFiled: March 7, 2023Date of Patent: June 24, 2025Assignee: SambaNova Systems, Inc.Inventors: Manish K. Shah, Paul Jordan, Maran Wilson, Ravinder Kumar
-
Publication number: 20250199788Abstract: A method in a reconfigurable computing system includes connecting a plurality of tensor consumers to their corresponding tensor producers via skip-buffers, which generates a plurality of skip-buffers. The method includes determining that at least one skip-buffer of the plurality of skip-buffers corresponding to a first set of tensor consumers and at least one skip-buffer of the plurality of skip-buffers corresponding to a second set of tensor consumers, are compatible to wholly or partially merge. The method also includes merging, wholly or partially, the compatible skip-buffers to produce a merged skip-buffer having a minimal buffer depth. The described method may reduce memory unit consumption and latency.Type: ApplicationFiled: March 5, 2025Publication date: June 19, 2025Applicant: SambaNova Systems, Inc.Inventors: Fei WANG, David Alan KOEPLINGER, Kevin BROWN, Weiwei CHEN
-
Publication number: 20250199985Abstract: A reconfigurable dataflow unit (RDU) includes an intra-RDU network, an array of configurable units connected by an array level network and function interfaces. The RDU also includes interface circuits coupled between the intra-RDU network and external interconnects. An interface circuit receives a packet from the external interconnect and extracts a target RDU identifier and compares the target RDU identifier to the value of the identity register. It also communicates over the intra-RDU network to a function interface based on information in the first packet in response to the target RDU identifier being equal to the identity register. The interface circuit retrieves another interface circuit identifier for the target RDU identifier from the pass-through table and, in response to the target RDU identifier not being equal to the identity register, sends the target RDU identifier and other information to the other interface circuit over the intra-RDU network.Type: ApplicationFiled: February 25, 2025Publication date: June 19, 2025Applicant: SambaNova Systems, Inc.Inventors: Paul JORDAN, Manish K. SHAH, Emre Ali BURHAN, Dawei HUANG, Yong QIN
-
Patent number: 12332836Abstract: A cost estimation tool in a system for implementing an operation unit graph on a reconfigurable processor is presented as well as a method of operating a cost estimation tool for determining scaled logical edge bandwidths in an operation unit graph in preparation of placing and routing the operation unit graph onto a reconfigurable processor. The cost estimation tool may be configured to receive the operation unit graph, divide the operation unit graph in first and second subgraphs, determine maximum latencies of the first and second subgraphs, and determine a scaled logical edge bandwidth of a logical edge that couples a first logical unit of M logical units in the first subgraph with a second logical unit of N logical units in the first subgraph based on M, N, and scaled bandwidth limits of the M and N logical units.Type: GrantFiled: July 13, 2023Date of Patent: June 17, 2025Assignee: SambaNova Systems, Inc.Inventors: Yue Fu, Kin Hing Leung, Joshua Brot, Arvind Krishna Sujeeth, Sumti Jairath, Andrew Deng, Raghu Prabhakar
-
Patent number: 12332837Abstract: A sorting tool for determining an ordered sequence of nodes in an operation unit graph for placing and routing the operation unit graph onto a reconfigurable processor is presented as well as a method of operating a sorting tool for determining an ordered sequence of nodes in an operation unit graph for placing and routing the operation unit graph onto a reconfigurable processor. The sorting tool is configured to receive the operation unit graph including a set of unsorted nodes and edges that interconnect nodes in the set of unsorted nodes, determine an ordered sequence of the nodes in the operation unit graph, and provide the ordered sequence of nodes for the placing and routing of the operation unit graph onto the reconfigurable processor.Type: GrantFiled: July 25, 2023Date of Patent: June 17, 2025Assignee: SambaNova Systems, Inc.Inventors: Hong Suh, Sumti Jairath
-
Patent number: 12333283Abstract: In a method a compiler performs a trial compilation to a low level (LL) intermediate representation (IR) of a high level (HL) decision to execute a dataflow application on a computing system. The LLIR comprises hardware resources to execute the application based on the HL decision and the compiler determines a trial result based on LL execution metrics associated with the trail compilation. The compiler performs a trial compilation of a second HL decision to a second LLIR and determines a trial result based on LL execution metrics associated with the second trail compilation. The compiler evaluates the trial results and, based on the evaluations, selects one or both of the HL decisions for executing the dataflow application. A computer program product and a computing system can implement the method.Type: GrantFiled: March 31, 2023Date of Patent: June 17, 2025Assignee: SambaNova Systems, Inc.Inventors: Blaine Rister, Haocheng Dong, David Alan Koeplinger, Yaqi Zhang, Junjue Wang, Zhuo Chen, Arvind Sujeeth
-
Patent number: 12333270Abstract: A computation unit includes input lines to provide a floating-point value, a first lookup table, a second lookup table, a range detector, and an output stage. The input lines include exponent lines and mantissa lines. The first lookup table has a first address input coupled to a first subset of the input lines to provide a first output. The second lookup table has a second address input coupled to a second subset of the input lines to provide a second output. The range detector is coupled to at least some of the input lines and indicates whether the floating-point value provided on the input lines is within a specified range on a range output. The output stage is operatively coupled to the first output, the second output and the range output, to generate a function output based on the first output, the second output, and the range output.Type: GrantFiled: May 5, 2022Date of Patent: June 17, 2025Assignee: SambaNova Systems, Inc.Inventors: Mingran Wang, Xiaoyan Li, Yongning Sheng
-
Publication number: 20250190192Abstract: The technology disclosed relates to storing a dataflow graph with a plurality of compute nodes that transmit data between the compute nodes, and controlling data transmission between compute nodes in the plurality of compute nodes based on ready-to-read credit counters and write credit counters. For example, systems and methods according to this disclosure may control data transmission between compute nodes along the data connections between the compute nodes by selectively controlling writing of data based on both the ready-to-read credit counter and the write credit counter of a particular compute node of the plurality of compute nodes.Type: ApplicationFiled: January 28, 2025Publication date: June 12, 2025Applicant: SambaNova Systems, Inc.Inventors: Weiwei CHEN, Raghu PRABHAKAR, David Alan KOEPLINGER, Sitanshu GUPTA, Ruddhi CHAPHEKAR, Ajit PUNJ, Sumti JAIRATH
-
Publication number: 20250190749Abstract: A device may pad a first input into a first padded input, read a first set of input tiles from the first padded input in a first input tiling configuration, process the first set of input tiles through a first section of a graph to generate a first set of output tiles in a first target tiling configuration, and pad the first set of output tiles to generate first set of padded output tiles. A device may arrange the first set of padded output tiles into a second input comprising a second set of input tiles, read the second set of input tiles from the second input in a second input tiling configuration, and process the second set of input tiles through a second section of the graph to generate a second set of output tiles in a second target tiling configuration, different than the first target tiling configuration.Type: ApplicationFiled: February 24, 2025Publication date: June 12, 2025Applicant: SambaNova Systems, Inc.Inventors: Tejas Nagendra Babu NAMA, Ruddhi CHAPHEKAR, Ram SIVARAMAKRISHNAN, Raghu PRABHAKAR, Sumti JAIRATH, Junjue WANG, Kaizhao LIANG, Adi FUCHS, Matheen MUSADDIQ, Arvind Krishna SUJEETH