Patents by Inventor Bharadwaj Pudipeddi
Bharadwaj Pudipeddi has filed for patents to protect the following inventions. This listing includes patent applications that are pending as well as patents that have already been granted by the United States Patent and Trademark Office (USPTO).
-
Publication number: 20210232451Abstract: Embodiments of the present disclosure include an error recovery method comprising detecting a computing error, restarting a first artificial intelligence processor of a plurality of artificial intelligence processors processing a data set, and loading a model in the artificial intelligence processor, wherein the model corresponds to a same model processed by the plurality of artificial intelligence processors during a previous processing iteration by the plurality of artificial intelligence processors on data from the data set.Type: ApplicationFiled: March 27, 2020Publication date: July 29, 2021Inventors: Bharadwaj PUDIPEDDI, Maral MESMAKHOSROSHAHI, Jinwen XI, Saurabh M. KULKARNI, Marc TREMBLAY, Matthias BAENNINGER, Nuno CLAUDINO PEREIRA LOPES
-
Publication number: 20210064986Abstract: Systems, methods, and apparatuses are provided for compressing values. A plurality of parameters may be obtained from a memory, each parameter comprising a floating-point number that is used in a relationship between artificial neurons or nodes in a model. A mantissa value and an exponent value may be extracted from each floating-point number to generate a set of mantissa values and a set of exponent values. The set of mantissa values may be compressed to generate a mantissa lookup table (LUT) and a plurality of mantissa LUT index values. The set of exponent values may be encoded to generate an exponent LUT and a plurality of exponent LUT index values. The mantissa LUT, mantissa LUT index values, exponent LUT, and exponent LUT index values may be provided to one or more processing entities to train the model.Type: ApplicationFiled: September 3, 2019Publication date: March 4, 2021Inventors: Jinwen Xi, Bharadwaj Pudipeddi, Marc Tremblay
-
Publication number: 20210019634Abstract: Methods, systems, apparatuses, and computer program products are described herein that enable execution of a large AI model on a memory-constrained target device that is communicatively connected to a parameter server, which stores a master copy of the AI model. The AI model may be dissected into smaller portions (e.g., layers or sub-layers), and each portion may be executed as efficiently as possible on the target device. After execution of one portion of the AI model is finished, another portion of the AI model may be downloaded and executed at the target device. This paradigm of executing one portion of the AI model at a time allows for dynamic execution of the large AI model.Type: ApplicationFiled: September 30, 2019Publication date: January 21, 2021Inventors: Bharadwaj Pudipeddi, Marc Tremblay, Sujeeth Subramanya Bharadwaj, Jinwen Xi, Maral Mesmakhosroshahi
-
Publication number: 20210019152Abstract: Methods, systems, apparatuses, and computer program products are described herein that enable execution of a large AI model on a memory-constrained target device that is communicatively connected to a parameter server, which stores a master copy of the AI model. The AI model may be dissected into smaller portions (e.g., layers or sub-layers), and each portion may be executed as efficiently as possible on the target device. After execution of one portion of the AI model is finished, another portion of the AI model may be downloaded and executed at the target device. To improve efficiency, the input samples may be divided into microbatches, and a plurality of microbatches executing in sequential order may form a minibatch. The size of the group of microbatches or minibatch can be adjusted to reduce the communication overhead. Multi-level parallel parameters reduction may be performed at the parameter server and the target device.Type: ApplicationFiled: September 30, 2019Publication date: January 21, 2021Inventors: Bharadwaj Pudipeddi, Marc Tremblay, Sujeeth Subramanya Bharadwaj, Devangkumar Patel, Jinwen Xi, Maral Mesmakhosroshahi
-
Publication number: 20210019151Abstract: Methods, systems, apparatuses, and computer program products are described herein that enable execution of a large AI model on a memory-constrained target device that is communicatively connected to a parameter server, which stores a master copy of the AI model. The AI model may be dissected into smaller portions (e.g., layers or sub-layers), and each portion may be executed as efficiently as possible on the target device. After execution of one portion of the AI model is finished, another portion of the AI model may be downloaded and executed at the target device. To improve efficiency, the input samples may be divided into microbatches, and a plurality of microbatches executing in sequential order may form a minibatch. The size of the group of microbatches or minibatch can be manually or automatically adjusted to reduce the communication overhead.Type: ApplicationFiled: September 20, 2019Publication date: January 21, 2021Inventors: Bharadwaj Pudipeddi, Marc Tremblay, Gautham Popuri, Layali Rashid, Tiyasa Mitra, III, Mohit Mittal, Maral Mesmakhosroshahi
-
Publication number: 20200342288Abstract: A distributed training system including a parameter server is configured to compress the weight metrices according to a clustering algorithm, with the compressed representation of the weight matrix may thereafter distributed to training workers. The compressed representation may comprise a centroid index matrix and a centroid table, wherein each element of the centroid index matrix corresponds to an element of the corresponding weight matrix and comprises an index into the centroid table, and wherein each element of the centroid table comprises a centroid value. In a further example aspect, a training worker may compute an activation result directly from the compressed representation of a weight matrix and a training data matrix by performing gather-reduce-add operations that accumulate all the elements of the training data matrix that correspond to the same centroid value to generate partial sums, multiplying each partial sum by its corresponding centroid value, and summing the resulting products.Type: ApplicationFiled: September 26, 2019Publication date: October 29, 2020Inventors: Jinwen Xi, Bharadwaj Pudipeddi
-
Patent number: 10627798Abstract: In an embodiment of the invention, an apparatus comprises: a non-volatile memory device; a complex programmable logic device (CPLD) coupled to the non-volatile memory device; a field programmable gate array (FPGA) coupled to the CPLD; and a host coupled to the FPGA; wherein the apparatus triggers a switch of an FPGA image in the FPGA to another FPGA image. In another embodiment of the invention, a method comprises: triggering, by an apparatus, a switch of an FPGA image in a field programmable gate array (FPGA) to another FPGA image; herein the apparatus comprises: a non-volatile memory device; a complex programmable logic device (CPLD) coupled to the non-volatile memory device; the field programmable gate array (FPGA) coupled to the CPLD; and a host coupled to the FPGA.Type: GrantFiled: June 29, 2018Date of Patent: April 21, 2020Assignee: BiTMICRO Networks, Inc.Inventors: Federico Sambilay, Jr., Bharadwaj Pudipeddi, Richard A. Cantong, Joevanni Parairo
-
Publication number: 20190155735Abstract: In an embodiment of the invention, an apparatus comprises: a central processing unit (CPU); a volatile memory controller; a non-volatile memory controller; a volatile memory coupled to the volatile memory controller; and a non-volatile memory coupled to the non-volatile memory controller; wherein a ratio of the non-volatile memory to the volatile memory is much less than a typical ratio. In another embodiment of the invention, a method comprises: receiving, by a Central Processing Unit (CPU) receives a command; evaluating, by the CPU, the command; executing, by the CPU, a data software assist to perform the command or activating, by the CPU, a hardware accelerator module to perform the command; and responding, by the CPU, to the command.Type: ApplicationFiled: June 29, 2018Publication date: May 23, 2019Inventors: Bharadwaj Pudipeddi, Richard A. Cantong, Marlon B. Verdan, Joevanni Parairo, Marvin Fenol
-
Publication number: 20190158384Abstract: In an embodiment of the invention, an apparatus comprises: a requestor configured to transmit a first operand and a second operand, wherein the first operand is partitioned; a shared network configured to transmit the operands; a processing load balancer for receiving the operands; a plurality of processing elements that are configured to process the operands; and a private network configured to multicast the operands to the processing elements. In another embodiment of the invention, a method comprises: transmitting a first operand and a second operand from a requestor, wherein the first operand is partitioned; transmitting the operands along a shared network; receiving the operands by a processing load balancer; multicasting the operands by a private network; and processing the operands by a plurality of processing elements.Type: ApplicationFiled: June 25, 2018Publication date: May 23, 2019Inventors: Bharadwaj Pudipeddi, Federico Sambilay, Richard A. Cantong
-
Publication number: 20190129882Abstract: Disclosed techniques include platform optimization for multi-platform module design for performance scalability. A compute platform pluggable module form factor and functionality is obtained, where the form factor enables single socket plugging within a plurality of sockets on a compute platform. The form factor employs electrical connections in each socket. A scaling form factor commensurate with adjacent sockets on the compute platform is established. The adjacent sockets each provide similar functionality for modules, and the adjacent sockets can be used interchangeably without loss of functionality of the compute platform. A single, integrated, rigid module is provided according to the scaling form factor that plugs into the adjacent sockets of the compute platform. The module provides expanded functionality over a single-plug form factor module. The expanded functionality is enabled through use of the electrical connections of the adjacent sockets.Type: ApplicationFiled: October 30, 2018Publication date: May 2, 2019Inventors: Bharadwaj Pudipeddi, Anthony Gallippi, Vijay Devadiga
-
Patent number: 10216596Abstract: Embodiments of the invention provide a system and method to vastly improve the remote write latency (write to remote server) and to reduce the load that is placed on the remote server by issuing auto-log (automatic log) writes through an integrated networking port in the SSD (solid state drive). Embodiments of the invention also provide a system and method for a PCI-e attached SSD to recover after a failure detection by appropriating a remote namespace.Type: GrantFiled: December 31, 2016Date of Patent: February 26, 2019Assignee: BiTMICRO Networks, Inc.Inventor: Bharadwaj Pudipeddi
-
Publication number: 20190018386Abstract: In an embodiment of the invention, an apparatus comprises: a non-volatile memory device; a complex programmable logic device (CPLD) coupled to the non-volatile memory device; a field programmable gate array (FPGA) coupled to the CPLD; and a host coupled to the FPGA; wherein the apparatus triggers a switch of an FPGA image in the FPGA to another FPGA image. In another embodiment of the invention, a method comprises: triggering, by an apparatus, a switch of an FPGA image in a field programmable gate array (FPGA) to another FPGA image; herein the apparatus comprises: a non-volatile memory device; a complex programmable logic device (CPLD) coupled to the non-volatile memory device; the field programmable gate array (FPGA) coupled to the CPLD; and a host coupled to the FPGA.Type: ApplicationFiled: June 29, 2018Publication date: January 17, 2019Inventors: Federico Sambilay Jr., Bharadwaj Pudipeddi, Richard A. Cantong, Joevanni Parairo
-
Patent number: 10007561Abstract: The invention is an apparatus for dynamic provisioning available as a multi-mode device that can be dynamically configured for balancing between storage performance and hardware acceleration resources on reconfigurable hardware such as an FPGA. An embodiment of the invention provides a cluster of these multi-mode devices that form a group of resilient Storage and Acceleration elements without requiring a dedicated standby storage spare. Yet another embodiment of the invention provides an interconnection network attached cluster configured to dynamically provision full acceleration and storage resources to meet an application's needs and end-of-life requirements of an SSD.Type: GrantFiled: April 10, 2017Date of Patent: June 26, 2018Assignee: BiTMICRO Networks, Inc.Inventors: Bharadwaj Pudipeddi, Jeffrey Bunting, Lihan Chang
-
Patent number: 7813288Abstract: A transaction detection device is described having inputs to couple to communication lines. Each communication line to transport notification of a packet observed by a link probe within a computing system containing point-to-point links between nodes. Each of said nodes having at least one processing core. The transaction detection device also comprises logic circuitry to determine from said notifications whether a looked for transaction has occurred within said computing system.Type: GrantFiled: November 21, 2005Date of Patent: October 12, 2010Assignee: Intel CorporationInventors: Robert Roth, Bharadwaj Pudipeddi, Richard Glass, Madhu Athreya
-
Patent number: 7779210Abstract: In one embodiment, the present invention includes a method for receiving a request for data in a home agent of a system from a first agent, prefetching the data from a memory and accessing a directory entry to determine whether a copy of the data is cached in any system agent, and forwarding the data to the first agent without waiting for snoop responses from other system agents if the directory entry indicates that the data is not cached. Other embodiments are described and claimed.Type: GrantFiled: October 31, 2007Date of Patent: August 17, 2010Assignee: Intel CorporationInventors: Bharadwaj Pudipeddi, Ghassan Khadder
-
Patent number: 7765352Abstract: A power control unit (PCU) may reduce the core wake-up latency in a computer system by concurrently waking-up the remaining cores after the first core is woken-up. The power control unit may detect arrival of a first, second, and a third interrupt directed at a first, second, and a third core. The power control unit may check whether the second interrupt occurs within a first period, wherein the first period is counted after waking-up of the first core is complete. The power control unit may then wake-up the second and the third core concurrently if the second interrupt occurs within the first period after the wake-up activity of the first core is complete. The first period may at least equal twice the time required for a first credit to be returned and next credit to be accepted.Type: GrantFiled: September 1, 2009Date of Patent: July 27, 2010Assignee: Intel CorporationInventors: Bharadwaj Pudipeddi, James S. Burns
-
Publication number: 20090319712Abstract: A power control unit (PCU) may reduce the core wake-up latency in a computer system by concurrently waking-up the remaining cores after the first core is woken-up. The power control unit may detect arrival of a first, second, and a third interrupt directed at a first, second, and a third core. The power control unit may check whether the second interrupt occurs within a first period, wherein the first period is counted after waking-up of the first core is complete. The power control unit may then wake-up the second and the third core concurrently if the second interrupt occurs within the first period after the wake-up activity of the first core is complete. The first period may at least equal twice the time required for a first credit to be returned and next credit to be accepted.Type: ApplicationFiled: September 1, 2009Publication date: December 24, 2009Inventors: Bharadwaj Pudipeddi, James S. Burns
-
Patent number: 7603504Abstract: A power control unit (PCU) may reduce the core wake-up latency in a computer system by concurrently waking-up the remaining cores after the first core is woken-up. The power control unit may detect arrival of a first, second, and a third interrupt directed at a first, second, and a third core. The power control unit may check whether the second interrupt occurs within a first period, wherein the first period is counted after waking-up of the first core is complete. The power control unit may then wake-up the second and the third core concurrently if the second interrupt occurs within the first period after the wake-up activity of the first core is complete. The first period may at least equal twice the time required for a first credit to be returned and next credit to be accepted.Type: GrantFiled: December 18, 2007Date of Patent: October 13, 2009Assignee: Intel CorporationInventors: Bharadwaj Pudipeddi, James S. Burns
-
Publication number: 20090158068Abstract: A power control unit (PCU) may reduce the core wake-up latency in a computer system by concurrently waking-up the remaining cores after the first core is woken-up. The power control unit may detect arrival of a first, second, and a third interrupt directed at a first, second, and a third core. The power control unit may check whether the second interrupt occurs within a first period, wherein the first period is counted after waking-up of the first core is complete. The power control unit may then wake-up the second and the third core concurrently if the second interrupt occurs within the first period after the wake-up activity of the first core is complete. The first period may at least equal twice the time required for a first credit to be returned and next credit to be accepted.Type: ApplicationFiled: December 18, 2007Publication date: June 18, 2009Inventors: Bharadwaj Pudipeddi, James S. Burns
-
Publication number: 20090113139Abstract: In one embodiment, the present invention includes a method for receiving a request for data in a home agent of a system from a first agent, prefetching the data from a memory and accessing a directory entry to determine whether a copy of the data is cached in any system agent, and forwarding the data to the first agent without waiting for snoop responses from other system agents if the directory entry indicates that the data is not cached. Other embodiments are described and claimed.Type: ApplicationFiled: October 31, 2007Publication date: April 30, 2009Inventors: Bharadwaj Pudipeddi, Ghassan Khadder