Patents by Inventor Nilesh Jain
Nilesh Jain has filed for patents to protect the following inventions. This listing includes patent applications that are pending as well as patents that have already been granted by the United States Patent and Trademark Office (USPTO).
-
Publication number: 20250124105Abstract: Key-value (KV) caching accelerates inference in large language models (LLMs) by allowing the attention operation to scale linearly rather than quadratically with the total sequence length. Due to large context lengths in modern LLMs, KV cache size can exceed the model size, which can negatively impact throughput. To address this issue, KVCrush, which stands for KEY-VALUE CACHE SIZE REDUCTION USING SIMILARITY IN HEAD-BEHAVIOR, is implemented. KVCrush involves using binary vectors to represent tokens, where the vector indicates which attention heads attend to the token and which attention heads disregard the token. The binary vectors are used in a hardware-efficient, low-overhead process to produce representatives for unimportant tokens to be pruned, without having to implement k-means clustering techniques.Type: ApplicationFiled: December 26, 2024Publication date: April 17, 2025Applicant: Intel CorporationInventors: Gopi Krishna Jha, Sameh Gobriel, Nilesh Jain
-
Publication number: 20250094712Abstract: Key-value (KV) caching accelerates inference in large language models (LLMs) by allowing the attention operation to scale linearly rather than quadratically with the total sequence length. Due to large context lengths in modern LLMs, KV cache size can exceed the model size, which can negatively impact throughput. To address this issue, a multi-granular clustering-based solution for KV cache compression can be implemented. Key tensors and value tensors corresponding unimportant tokens can be approximated using clusters created at different clustering-levels with varying accuracy. Accuracy loss can be mitigated by using proxies produced at finer granularity clustering-level for a subset of attention heads that are more significant. More significant attention heads can have a higher impact on model accuracy than less significant attention heads. Latency is improved by retrieving proxies from a faster memory for a subset of attention heads that are less significant, when impact on accuracy is lower.Type: ApplicationFiled: December 2, 2024Publication date: March 20, 2025Applicant: Intel CorporationInventors: Gopi Krishna Jha, Sameh Gobriel, Nilesh Jain
-
Publication number: 20250061316Abstract: Key-value (KV) cache paging schemes can improve memory management for KV caches by storing a KV cache page having key tensors and value tensors for a fixed number of tokens in a fixed-sized block in the KV cache of a worker. To further improve memory management, the schemes can be modified to implement dynamic variable quantization. Quantization level of a KV cache page can be set based on a runtime importance score of the KV cache page. In addition, the quantization level of the KV cache page can be set based on the system load. The end result is a scheme that can achieve a high compression ratio of KV cache pages in the KV cache. Fitting more KV cache pages in the KV cache can lead to higher inference throughput, higher system-level user capacity, and higher end-to-end service availability.Type: ApplicationFiled: November 1, 2024Publication date: February 20, 2025Applicant: Intel CorporationInventors: Sameh Gobriel, Nilesh Jain, Vui Seng Chua, Juan Pablo Munoz, Gopi Krishna Jha
-
Publication number: 20250055987Abstract: Techniques related to distributing the video encoding processing of an input video across hardware and software systems. Such techniques include evaluating the content of the video and determine whether or the encoding operation is best to be done on the hardware system only, software system only or a hybrid hardware and software system.Type: ApplicationFiled: August 22, 2024Publication date: February 13, 2025Applicant: Intel CorporationInventors: Brinda Ganesh, Nilesh Jain, Sumit Mohan, Faouzi Kossentini, Jill Boyce, James Holland, Zhijun Lei, Chekib Nouira, Foued Ben Amara, Hassene Tmar, Sebastian Possos, Craig Hurst
-
Publication number: 20250028675Abstract: Embodiments described herein include software, firmware, and hardware that provides techniques to enable deterministic scheduling across multiple general-purpose graphics processing units. One embodiment provides a multi-GPU architecture with uniform latency. One embodiment provides techniques to distribute memory output based on memory chip thermals. One embodiment provides techniques to enable thermally aware workload scheduling. One embodiment provides techniques to enable end to end contracts for workload scheduling on multiple GPUs.Type: ApplicationFiled: August 1, 2024Publication date: January 23, 2025Applicant: Intel CorporationInventors: JOYDEEP RAY, SELVAKUMAR PANNEER, SAURABH TANGRI, BEN ASHBAUGH, SCOTT JANUS, ABHISHEK APPU, VARGHESE GEORGE, RAVISHANKAR IYER, NILESH JAIN, PATTABHIRAMAN K, ALTUG KOKER, MIKE MACPHERSON, JOSH MASTRONARDE, ELMOUSTAPHA OULD-AHMED-VALL, JAYAKRISHNA P. S, ERIC SAMSON
-
Patent number: 12197601Abstract: Examples described herein relate to offload circuitry comprising one or more compute engines that are configurable to perform a workload offloaded from a process executed by a processor based on a descriptor particular to the workload. In some examples, the offload circuitry is configurable to perform the workload, among multiple different workloads. In some examples, the multiple different workloads include one or more of: data transformation (DT) for data format conversion, Locality Sensitive Hashing (LSH) for neural network (NN), similarity search, sparse general matrix-matrix multiplication (SpGEMM) acceleration of hash based sparse matrix multiplication, data encode, data decode, or embedding lookup.Type: GrantFiled: December 22, 2021Date of Patent: January 14, 2025Assignee: Intel CorporationInventors: Ren Wang, Sameh Gobriel, Somnath Paul, Yipeng Wang, Priya Autee, Abhirupa Layek, Shaman Narayana, Edwin Verplanke, Mrittika Ganguli, Jr-Shian Tsai, Anton Sorokin, Suvadeep Banerjee, Abhijit Davare, Desmond Kirkpatrick, Rajesh M. Sankaran, Jaykant B. Timbadiya, Sriram Kabisthalam Muthukumar, Narayan Ranganathan, Nalini Murari, Brinda Ganesh, Nilesh Jain
-
Patent number: 12166688Abstract: Methods, apparatus, systems, and articles of manufacture are disclosed to optimize resources in edge networks. An example apparatus includes agent managing circuitry to invoke an exploration agent to identify platform resource devices, select a first one of the identified platform resource devices, and generate first optimization metrics for the workload corresponding to the first one of the identified platform resource devices, the first optimization metrics corresponding to a first path. The example agent is to further select a second one of the identified platform resource devices, generate second optimization metrics for the workload corresponding to the second one of the identified platform resource devices, the second optimization metrics corresponding to a second path.Type: GrantFiled: June 25, 2021Date of Patent: December 10, 2024Assignee: INTEL CORPORATIONInventors: Nilesh Jain, Rajesh Poornachandran, Eriko Nurvitadhi, Anahita Bhiwandiwalla, Juan Pablo Munoz, Ravishankar Iyer, Chaunte W. Lacewell
-
Patent number: 12130730Abstract: An aspect of the present disclosure determines test cases to be run upon changes in software application code. In one embodiment, a system receives a test suite containing multiple test cases designed to perform the testing of a software application, the software application containing one or more components. The system executes each test case to determine a corresponding sequence of components executed in the software application for the test case, and then stores a dependency data indicating for each test case the corresponding determined sequence of components. Upon determining that a first component has been changed, the system identifies a first set of test cases that cause execution of the first component by performing a reverse look-up in the dependency data. The system then includes the identified first set of test cases in the test cases to be run for re-testing the software application.Type: GrantFiled: July 14, 2022Date of Patent: October 29, 2024Assignee: Oracle International CorporationInventors: Nilesh Jain, Krishnananda Subbarao
-
Patent number: 12113853Abstract: Example methods, apparatus, and systems to manage quality of service with respect to service level agreements in a computing device are disclosed. An example apparatus includes a first mesh proxy assigned to a first platform-agnostic application, the first mesh proxy to generate a first resource request signal based on a first service level agreement requirement from the first platform-agnostic application; a second mesh proxy assigned to a second platform-agnostic application, the second mesh proxy to generate a second resource request signal based on a second service level agreement requirement from second platform-agnostic application; and a load balancer to allocate hardware resources for the first platform-agnostic application and the second platform-agnostic application based on the first resource request signal and the second resource request signal.Type: GrantFiled: September 25, 2020Date of Patent: October 8, 2024Assignee: INTEL CORPORATIONInventors: Francesc Guim Bernat, Andrew J. Herdrich, Kshitij Arun Doshi, Monica Kenguva, Ned M. Smith, Nilesh Jain, Brinda Ganesh, Rashmin Patel, Alexander Vul
-
Patent number: 12101475Abstract: Techniques related to distributing the video encoding processing of an input video across hardware and software systems. Such techniques include evaluating the content of the video and determine whether or the encoding operation is best to be done on the hardware system only, software system only or a hybrid hardware and software system.Type: GrantFiled: December 18, 2020Date of Patent: September 24, 2024Assignee: Intel CorporationInventors: Brinda Ganesh, Nilesh Jain, Sumit Mohan, Faouzi Kossentini, Jill Boyce, James Holland, Zhijun Lei, Chekib Nouira, Foued Ben Amara, Hassene Tmar, Sebastian Possos, Craig Hurst
-
Publication number: 20240311951Abstract: Described herein is a graphics processor configured to perform time based frame predication to bypass execution of a command buffer based on a comparison with time stamps stored in a time stamp buffer that tracks execution time for command buffers. The graphics processors can bypass a frame that will not complete in time for a target display update and trigger neural frame generation to generate the frame data for the bypassed command buffer. Dynamic render scaling is also described.Type: ApplicationFiled: September 29, 2023Publication date: September 19, 2024Applicant: Intel CorporationInventors: Selvakumar Panneer, Sarthak Rajesh Shah, Nilesh Jain, John Feit
-
Publication number: 20240311950Abstract: Described herein is a graphics processor configured to perform time based frame generation via a temporally aware machine learning model that enables the generation of a frame at a target timestamp relative to the render times of input frames. For example, for an extrapolated frame generated by the temporally aware machine learning model, a low relative timestamp would indicate that the extrapolated frame will appear close in time after the final frame in a sequence of frames and should be relatively close in appearance to the final frame. A higher relative timestamp would indicate that the extrapolated frame should depict a greater degree of evolution based on the optical flow.Type: ApplicationFiled: September 29, 2023Publication date: September 19, 2024Applicant: Intel CorporationInventors: Selvakumar Panneer, Nilesh Jain, SungYe Kim
-
Publication number: 20240307773Abstract: Described herein is a technique to enhance the responsiveness of gameplay for a 3D gaming application while maintaining the ability to enqueue multiple frames for processing on the GPU. Each frame or a set of workloads within a frame is submitted to the GPU with predication, such that the indicated rendering and resource manipulation commands are not actually performed if the predication condition is enabled. A low latency command can be submitted to the GPU via a copy engine command queue. The command will cause the copy engine to enable or disable predication for command buffers in the command queue. When predication for queued command buffers is enabled, command buffers for workloads that are not related to the workload that is generated in response to the user input are bypassed. High priority command buffers that include workloads generated in response to user input can then be executed immediately.Type: ApplicationFiled: September 29, 2023Publication date: September 19, 2024Applicant: Intel CorporationInventors: Selvakumar Panneer, John Feit, Sarthak Rajesh Shah, SungYe Kim, Nilesh Jain
-
Patent number: 12079155Abstract: Embodiments described herein include software, firmware, and hardware that provides techniques to enable deterministic scheduling across multiple general-purpose graphics processing units. One embodiment provides a multi-GPU architecture with uniform latency. One embodiment provides techniques to distribute memory output based on memory chip thermals. One embodiment provides techniques to enable thermally aware workload scheduling. One embodiment provides techniques to enable end to end contracts for workload scheduling on multiple GPUs.Type: GrantFiled: March 14, 2020Date of Patent: September 3, 2024Assignee: Intel CorporationInventors: Joydeep Ray, Selvakumar Panneer, Saurabh Tangri, Ben Ashbaugh, Scott Janus, Abhishek Appu, Varghese George, Ravishankar Iyer, Nilesh Jain, Pattabhiraman K, Altug Koker, Mike MacPherson, Josh Mastronarde, Elmoustapha Ould-Ahmed-Vall, Jayakrishna P. S, Eric Samson
-
Publication number: 20240185493Abstract: Technology as described herein provides for generating an image via a generator network, including extracting structural relationship information from a text prompt, wherein the structural relationship information includes sentence features and token features, generating encoded text features based on the sentence features and on relation-related tokens, wherein the relation-related tokens are identified based on parsing text dependency information in the token features, and generating an output image based on combining, via self attention and cross-attention layers, the encoded text features and encoded image features from an input image canvas. Embodiments further include applying a gating function to modify image features based on text features. The self attention and cross-attention layers can be applied via a cross-modality network, the gating function can be applied via a residual gating network, and the relation-related tokens can be further identified via an attention matrix.Type: ApplicationFiled: December 29, 2023Publication date: June 6, 2024Inventors: Peixi Xiong, Nilesh Jain
-
Publication number: 20240144030Abstract: Methods, apparatus, systems, and articles of manufacture to modify pre-trained models to apply neural architecture search are disclosed. Example instructions, when executed, cause processor circuitry to at least access a pre-trained machine learning model, create a super-network based on the pre-trained machine learning model, create a plurality of subnetworks based on the super-network, and search the plurality of subnetworks to select a subnetwork.Type: ApplicationFiled: June 8, 2022Publication date: May 2, 2024Inventors: Juan Pablo Muñoz, Nilesh Jain, Chaunté Lacewell, Alexander Kozlov, Nikolay Lyalyushkin, Vasily Shamporov, Anastasia Senina
-
Publication number: 20240045685Abstract: Systems, methods, and apparatuses relating sparsity based FMA. In some examples, an instance of a single FMA instruction has one or more fields for an opcode, one or more fields to identify a source/destination matrix operand, one or more fields to identify a first plurality of source matrix operands, one or more fields to identify a second plurality of matrix operands, wherein the opcode is to indicate that execution circuitry is to select a proper subset of FP8 data elements from the first plurality of source matrix operands based on sparsity controls from a first matrix operand of the second plurality of matrix operands and perform a FMA.Type: ApplicationFiled: October 1, 2022Publication date: February 8, 2024Inventors: Menachem Adelman, Amit Gradstein, Alexander Heinecke, Christopher Hughes, Naveen Mellempudi, Shahar Mizrahi, Dana Rip, Simon Rubanovich, Uri Sherman, Guy Boudoukh, Evangelos Georganas, Nilesh Jain, Barukh Ziv
-
Publication number: 20240029455Abstract: Systems, apparatuses and methods may provide for technology that encodes multi-view visual data into latent features via an aggregator encoder, decodes the latent features into one or more novel target views different from views of the multi-view visual data via a rendering decoder, and decodes the latent features into an object label via a label decoder. The operation to decode the latent features via the rendering decoder and to decode the latent features via the label decoder occur at least partially at the same time. The operation to encode, via the aggregator encoder, the multi-view visual data into the latent features further includes operations to: perform, via the aggregator encoder, semantic object recognition operations based on radiance field view synthesis operations, and perform, via the aggregator encoder, radiance field view synthesis operations based on semantic object recognition operations.Type: ApplicationFiled: September 27, 2023Publication date: January 25, 2024Inventors: Peixi Xiong, Nilesh Jain, Ravishankar Iyer, Mrutunjayya Mrutunjayya
-
Publication number: 20240020219Abstract: An aspect of the present disclosure determines test cases to be run upon changes in software application code. In one embodiment, a system receives a test suite containing multiple test cases designed to perform the testing of a software application, the software application containing one or more components. The system executes each test case to determine a corresponding sequence of components executed in the software application for the test case, and then stores a dependency data indicating for each test case the corresponding determined sequence of components. Upon determining that a first component has been changed, the system identifies a first set of test cases that cause execution of the first component by performing a reverse look-up in the dependency data. The system then includes the identified first set of test cases in the test cases to be run for re-testing the software application.Type: ApplicationFiled: July 14, 2022Publication date: January 18, 2024Inventors: Nilesh Jain, Krishnananda Subbarao
-
Publication number: 20240007414Abstract: Methods, apparatus, systems, and articles of manufacture are disclosed to optimize resources in edge networks. An example apparatus includes agent managing circuitry to invoke an exploration agent to identify platform resource devices, select a first one of the identified platform resource devices, and generate first optimization metrics for the workload corresponding to the first one of the identified platform resource devices, the first optimization metrics corresponding to a first path. The example agent is to further select a second one of the identified platform resource devices, generate second optimization metrics for the workload corresponding to the second one of the identified platform resource devices, the second optimization metrics corresponding to a second path.Type: ApplicationFiled: June 25, 2021Publication date: January 4, 2024Inventors: Nilesh Jain, Rajesh Poornachandran, Eriko Nurvitadhi, Anahita Bhiwandiwalla, Juan Pablo Munoz, Ravishankar Iyer, Chaunte W. Lacewell