Patents by Inventor Sameh Gobriel

Sameh Gobriel has filed for patents to protect the following inventions. This listing includes patent applications that are pending as well as patents that have already been granted by the United States Patent and Trademark Office (USPTO).

Packet processing load balancer

Patent number: 12293231

Abstract: Examples described herein include a device interface; a first set of one or more processing units; and a second set of one or more processing units. In some examples, the first set of one or more processing units are to perform heavy flow detection for packets of a flow and the second set of one or more processing units are to perform processing of packets of a heavy flow. In some examples, the first set of one or more processing units and second set of one or more processing units are different. In some examples, the first set of one or more processing units is to allocate pointers to packets associated with the heavy flow to a first set of one or more queues of a load balancer and the load balancer is to allocate the packets associated with the heavy flow to one or more processing units of the second set of one or more processing units based, at least in part on a packet receive rate of the packets associated with the heavy flow.

Type: Grant

Filed: September 10, 2021

Date of Patent: May 6, 2025

Assignee: Intel Corporation

Inventors: Chenmin Sun, Yipeng Wang, Rahul R. Shah, Ren Wang, Sameh Gobriel, Hongjun Ni, Mrittika Ganguli, Edwin Verplanke
EFFICIENT TOKEN PRUNING IN TRANSFORMER-BASED NEURAL NETWORKS

Publication number: 20250124105

Abstract: Key-value (KV) caching accelerates inference in large language models (LLMs) by allowing the attention operation to scale linearly rather than quadratically with the total sequence length. Due to large context lengths in modern LLMs, KV cache size can exceed the model size, which can negatively impact throughput. To address this issue, KVCrush, which stands for KEY-VALUE CACHE SIZE REDUCTION USING SIMILARITY IN HEAD-BEHAVIOR, is implemented. KVCrush involves using binary vectors to represent tokens, where the vector indicates which attention heads attend to the token and which attention heads disregard the token. The binary vectors are used in a hardware-efficient, low-overhead process to produce representatives for unimportant tokens to be pruned, without having to implement k-means clustering techniques.

Type: Application

Filed: December 26, 2024

Publication date: April 17, 2025

Applicant: Intel Corporation

Inventors: Gopi Krishna Jha, Sameh Gobriel, Nilesh Jain
MULTI-GRANULAR CLUSTERING-BASED SOLUTION FOR KEY-VALUE CACHE COMPRESSION

Publication number: 20250094712

Abstract: Key-value (KV) caching accelerates inference in large language models (LLMs) by allowing the attention operation to scale linearly rather than quadratically with the total sequence length. Due to large context lengths in modern LLMs, KV cache size can exceed the model size, which can negatively impact throughput. To address this issue, a multi-granular clustering-based solution for KV cache compression can be implemented. Key tensors and value tensors corresponding unimportant tokens can be approximated using clusters created at different clustering-levels with varying accuracy. Accuracy loss can be mitigated by using proxies produced at finer granularity clustering-level for a subset of attention heads that are more significant. More significant attention heads can have a higher impact on model accuracy than less significant attention heads. Latency is improved by retrieving proxies from a faster memory for a subset of attention heads that are less significant, when impact on accuracy is lower.

Type: Application

Filed: December 2, 2024

Publication date: March 20, 2025

Applicant: Intel Corporation

Inventors: Gopi Krishna Jha, Sameh Gobriel, Nilesh Jain
DYNAMIC QUANTIZATION AND MEMORY MANAGEMENT OF KEY-VALUE CACHE FOR SERVING LARGE LANGUAGE MODELS

Publication number: 20250061316

Abstract: Key-value (KV) cache paging schemes can improve memory management for KV caches by storing a KV cache page having key tensors and value tensors for a fixed number of tokens in a fixed-sized block in the KV cache of a worker. To further improve memory management, the schemes can be modified to implement dynamic variable quantization. Quantization level of a KV cache page can be set based on a runtime importance score of the KV cache page. In addition, the quantization level of the KV cache page can be set based on the system load. The end result is a scheme that can achieve a high compression ratio of KV cache pages in the KV cache. Fitting more KV cache pages in the KV cache can lead to higher inference throughput, higher system-level user capacity, and higher end-to-end service availability.

Type: Application

Filed: November 1, 2024

Publication date: February 20, 2025

Applicant: Intel Corporation

Inventors: Sameh Gobriel, Nilesh Jain, Vui Seng Chua, Juan Pablo Munoz, Gopi Krishna Jha
Hardware offload circuitry

Patent number: 12197601

Abstract: Examples described herein relate to offload circuitry comprising one or more compute engines that are configurable to perform a workload offloaded from a process executed by a processor based on a descriptor particular to the workload. In some examples, the offload circuitry is configurable to perform the workload, among multiple different workloads. In some examples, the multiple different workloads include one or more of: data transformation (DT) for data format conversion, Locality Sensitive Hashing (LSH) for neural network (NN), similarity search, sparse general matrix-matrix multiplication (SpGEMM) acceleration of hash based sparse matrix multiplication, data encode, data decode, or embedding lookup.

Type: Grant

Filed: December 22, 2021

Date of Patent: January 14, 2025

Assignee: Intel Corporation

Inventors: Ren Wang, Sameh Gobriel, Somnath Paul, Yipeng Wang, Priya Autee, Abhirupa Layek, Shaman Narayana, Edwin Verplanke, Mrittika Ganguli, Jr-Shian Tsai, Anton Sorokin, Suvadeep Banerjee, Abhijit Davare, Desmond Kirkpatrick, Rajesh M. Sankaran, Jaykant B. Timbadiya, Sriram Kabisthalam Muthukumar, Narayan Ranganathan, Nalini Murari, Brinda Ganesh, Nilesh Jain
Flow classification apparatus, methods, and systems

Patent number: 11811660

Abstract: Apparatus, methods, and systems for tuple space search-based flow classification using cuckoo hash tables and unmasked packet headers are described herein. A device can communicate with one or more hardware switches. The device can include memory to store hash table entries of a hash table. The device can include processing circuitry to perform a hash lookup in the hash table. The lookup can be based on an unmasked key include in a packet header corresponding to a received data packet. The processing circuitry can retrieve an index pointing to a sub-table, the sub-table including a set of rules for handling the data packet. Other embodiments are also described.

Type: Grant

Filed: August 6, 2021

Date of Patent: November 7, 2023

Assignee: Intel Corporation

Inventors: Ren Wang, Tsung-Yuan C. Tai, Yipeng Wang, Sameh Gobriel
PACKET PROCESSING LOAD BALANCER

Publication number: 20230082780

Abstract: Examples described herein include a device interface; a first set of one or more processing units; and a second set of one or more processing units. In some examples, the first set of one or more processing units are to perform heavy flow detection for packets of a flow and the second set of one or more processing units are to perform processing of packets of a heavy flow. In some examples, the first set of one or more processing units and second set of one or more processing units are different. In some examples, the first set of one or more processing units is to allocate pointers to packets associated with the heavy flow to a first set of one or more queues of a load balancer and the load balancer is to allocate the packets associated with the heavy flow to one or more processing units of the second set of one or more processing units based, at least in part on a packet receive rate of the packets associated with the heavy flow.

Type: Application

Filed: September 10, 2021

Publication date: March 16, 2023

Inventors: Chenmin SUN, Yipeng WANG, Rahul R. SHAH, Ren WANG, Sameh GOBRIEL, Hongjun NI, Mrittika GANGULI, Edwin VERPLANKE
Techniques to control an insertion ratio for a cache

Patent number: 11392298

Abstract: Examples may include techniques to control an insertion ratio or rate for a cache. Examples include comparing cache miss ratios for different time intervals or windows for a cache to determine whether to adjust a cache insertion ratio that is based on a ratio of cache misses to cache insertions.

Type: Grant

Filed: November 16, 2020

Date of Patent: July 19, 2022

Assignee: Intel Corporation

Inventors: Yipeng Wang, Ren Wang, Sameh Gobriel, Tsung-Yuan C. Tai
HARDWARE OFFLOAD CIRCUITRY

Publication number: 20220114270

Abstract: Examples described herein relate to offload circuitry comprising one or more compute engines that are configurable to perform a workload offloaded from a process executed by a processor based on a descriptor particular to the workload. In some examples, the offload circuitry is configurable to perform the workload, among multiple different workloads. In some examples, the multiple different workloads include one or more of: data transformation (DT) for data format conversion, Locality Sensitive Hashing (LSH) for neural network (NN), similarity search, sparse general matrix-matrix multiplication (SpGEMM) acceleration of hash based sparse matrix multiplication, data encode, data decode, or embedding lookup.

Type: Application

Filed: December 22, 2021

Publication date: April 14, 2022

Inventors: Ren WANG, Sameh GOBRIEL, Somnath PAUL, Yipeng WANG, Priya AUTEE, Abhirupa LAYEK, Shaman NARAYANA, Edwin VERPLANKE, Mrittika GANGULI, Jr-Shian TSAI, Anton SOROKIN, Suvadeep BANERJEE, Abhijit DAVARE, Desmond KIRKPATRICK
Technologies for flow rule aware exact match cache compression

Patent number: 11201940

Abstract: Technologies for flow rule aware exact match cache compression include multiple computing devices in communication over a network. A computing device reads a network packet from a network port and extracts one or more key fields from the packet to generate a lookup key. The key fields are identified by a key field specification of an exact match flow cache. The computing device may dynamically configure the key field specification based on an active flow rule set. The computing device may compress the key field specification to match a union of non-wildcard fields of the active flow rule set. The computing device may expand the key field specification in response to insertion of a new flow rule. The computing device looks up the lookup key in the exact match flow cache and, if a match is found, applies the corresponding action. Other embodiments are described and claimed.

Type: Grant

Filed: January 4, 2018

Date of Patent: December 14, 2021

Assignee: Intel Corporation

Inventors: Yipeng Wang, Ren Wang, Antonio Fischetti, Sameh Gobriel, Tsung-Yuan C. Tai
FLOW CLASSIFICATION APPARATUS, METHODS, AND SYSTEMS

Publication number: 20210367887

Abstract: Apparatus, methods, and systems for tuple space search-based flow classification using cuckoo hash tables and unmasked packet headers are described herein. A device can communicate with one or more hardware switches. The device can include memory to store hash table entries of a hash table. The device can include processing circuitry to perform a hash lookup in the hash table. The lookup can be based on an unmasked key include in a packet header corresponding to a received data packet. The processing circuitry can retrieve an index pointing to a sub-table, the sub-table including a set of rules for handling the data packet. Other embodiments are also described.

Type: Application

Filed: August 6, 2021

Publication date: November 25, 2021

Inventors: Ren Wang, Tsung-Yuan C. Tai, Yipeng Wang, Sameh Gobriel
Flow classification apparatus, methods, and systems

Patent number: 11088951

Abstract: Apparatus, methods, and systems for tuple space search-based flow classification using cuckoo hash tables and unmasked packet headers are described herein. A device can communicate with one or more hardware switches. The device can include memory to store hash table entries of a hash table. The device can include processing circuitry to perform a hash lookup in the hash table. The lookup can be based on an unmasked key include in a packet header corresponding to a received data packet. The processing circuitry can retrieve an index pointing to a sub-table, the sub-table including a set of rules for handling the data packet. Other embodiments are also described.

Type: Grant

Filed: June 29, 2017

Date of Patent: August 10, 2021

Assignee: Intel Corporation

Inventors: Ren Wang, Tsung-Yuan C. Tai, Yipeng Wang, Sameh Gobriel
Denial of service mitigation with two-tier hash

Patent number: 11005884

Abstract: A computing apparatus for providing a node within a distributed network function, including: a hardware platform; a network interface to communicatively couple to at least one other peer node of the distributed network function; a distributor function including logic to operate on the hardware platform, including a hashing module configured to receive an incoming network packet via the network interface and perform on the incoming network packet a first-level hash of a two-level hash, the first level hash being a lightweight hash with respect to a second-level hash, the first level hash to deterministically direct a packet to one of the nodes of the distributed network function as a directed packet; and a denial of service (DoS) mitigation engine to receive notification of a DoS attack, identify a DoS packet via the first-level hash, and prevent the DoS packet from reaching the second-level hash.

Type: Grant

Filed: September 29, 2017

Date of Patent: May 11, 2021

Assignee: Intel Corporation

Inventors: Sameh Gobriel, Christian Maciocco, Byron Marohn, Ren Wang, Tsung-Yuan C. Tai
NEURAL NETWORK DENSE LAYER SPARSIFICATION AND MATRIX COMPRESSION

Publication number: 20210110269

Abstract: Neural network dense layer sparsification and matrix compression is disclosed. An example of an apparatus includes one or more processors; a memory to store data for processing, including data for processing of a deep neural network (DNN) including one or more layers, each layer including a plurality of neurons, the one or more processors to perform one or both of sparsification of one or more layers of the DNN, including selecting a subset of the plurality of neurons of a first layer of the DNN for activation based at least in part on locality sensitive hashing of inputs to the first layer; or compression of a weight or activation matrix of one or more layers of the DNN, including detection of sparsity patterns in a matrix of the first layer of the DNN based at least in part on locality sensitive hashing of patterns in the matrix.

Type: Application

Filed: December 21, 2020

Publication date: April 15, 2021

Applicant: Intel Corporation

Inventors: Sameh Gobriel, Jesmin Jahari Tithi, Tsung-Yuan Tai
TECHNIQUES TO CONTROL AN INSERTION RATIO FOR A CACHE

Publication number: 20210089216

Abstract: Examples may include techniques to control an insertion ratio or rate for a cache. Examples include comparing cache miss ratios for different time intervals or windows for a cache to determine whether to adjust a cache insertion ratio that is based on a ratio of cache misses to cache insertions.

Type: Application

Filed: November 16, 2020

Publication date: March 25, 2021

Inventors: Yipeng WANG, Ren WANG, Sameh GOBRIEL, Tsung-Yuan C. TAI
Compute node cluster based routing method and apparatus

Patent number: 10938712

Abstract: Apparatus and method to facilitate networked compute node cluster routing are disclosed herein. In some embodiments, a compute node for cluster compute may include one or more input ports to receive data packets from first selected ones of a cluster of compute nodes; one or more output ports to route data packets to second selected ones of the cluster of computer nodes; and one or more processors, wherein the one or more processors includes logic to determine a particular output port, of the one or more output ports, to which a data packet received at the one or more input ports is to be routed, and wherein the logic is to exclude output ports associated with links indicated in fault status information as having a fault status to be the particular output port to which the data packet is to be routed.

Type: Grant

Filed: February 15, 2017

Date of Patent: March 2, 2021

Assignee: Intel Corporation

Inventors: Ken Schumm, Sameh Gobriel, Asif H. Haswarey, Tsung-Yuan Charlie Tai
Techniques to control an insertion ratio for a cache

Patent number: 10845995

Abstract: Examples may include techniques to control an insertion ratio or rate for a cache. Examples include comparing cache miss ratios for different time intervals or windows for a cache to determine whether to adjust a cache insertion ratio that is based on a ratio of cache misses to cache insertions.

Type: Grant

Filed: June 30, 2017

Date of Patent: November 24, 2020

Assignee: Intel Corporation

Inventors: Yipeng Wang, Ren Wang, Sameh Gobriel, Tsung-Yuan Charlie Tai
Apparatus and method for prioritized quality of service processing for transactional memory

Patent number: 10719442

Abstract: An apparatus and method for prioritizing transactional memory regions. For example, one embodiment of a processor comprises: a plurality of cores to execute threads comprising sequences of instructions, at least some of the instructions specifying a transactional memory region; a cache of each core to store a plurality of cache lines; transactional memory circuitry of each core to manage execution of the transactional memory (TM) regions based on priorities associated with each of the TM regions; and wherein the transactional memory circuitry, upon detecting a conflict between a first TM region having a first priority value and a second TM region having a second priority value, is to determine which of the first TM region or the second TM region is permitted to continue executing and which is to be aborted based, at least in part, on the first and second priority values.

Type: Grant

Filed: September 10, 2018

Date of Patent: July 21, 2020

Assignee: Intel Corporation

Inventors: Ren Wang, Raanan Sade, Yipeng Wang, Tsung-Yuan Tai, Sameh Gobriel
Technologies for distributed routing table lookup

Patent number: 10623311

Abstract: Technologies for distributed table lookup via a distributed router includes an ingress computing node, an intermediate computing node, and an egress computing node. Each computing node of the distributed router includes a forwarding table to store a different set of network routing entries obtained from a routing table of the distributed router. The ingress computing node generates a hash key based on the destination address included in a received network packet. The hash key identifies the intermediate computing node of the distributed router that stores the forwarding table that includes a network routing entry corresponding to the destination address. The ingress computing node forwards the received network packet to the intermediate computing node for routing. The intermediate computing node receives the forwarded network packet, determines a destination address of the network packet, and determines the egress computing node for transmission of the network packet from the distributed router.

Type: Grant

Filed: September 27, 2017

Date of Patent: April 14, 2020

Assignee: Intel Corporation

Inventors: Sameh Gobriel, Ren Wang, Christian Maciocco, Tsung-Yuan Tai
APPARATUS AND METHOD FOR PRIORITIZED QUALITY OF SERVICE PROCESSING FOR TRANSACTIONAL MEMORY

Publication number: 20200081835

Abstract: An apparatus and method for prioritizing transactional memory regions. For example, one embodiment of a processor comprises: a plurality of cores to execute threads comprising sequences of instructions, at least some of the instructions specifying a transactional memory region; a cache of each core to store a plurality of cache lines; transactional memory circuitry of each core to manage execution of the transactional memory (TM) regions based on priorities associated with each of the TM regions; and wherein the transactional memory circuitry, upon detecting a conflict between a first TM region having a first priority value and a second TM region having a second priority value, is to determine which of the first TM region or the second TM region is permitted to continue executing and which is to be aborted based, at least in part, on the first and second priority values.

Type: Application

Filed: September 10, 2018

Publication date: March 12, 2020

Inventors: REN WANG, RAANAN SADE, YIPENG WANG, Tsung-Yuan TAI, SAMEH GOBRIEL

1 2 3 4 next