Patents by Inventor Sameh Gobriel
Sameh Gobriel has filed for patents to protect the following inventions. This listing includes patent applications that are pending as well as patents that have already been granted by the United States Patent and Trademark Office (USPTO).
-
Patent number: 12293231Abstract: Examples described herein include a device interface; a first set of one or more processing units; and a second set of one or more processing units. In some examples, the first set of one or more processing units are to perform heavy flow detection for packets of a flow and the second set of one or more processing units are to perform processing of packets of a heavy flow. In some examples, the first set of one or more processing units and second set of one or more processing units are different. In some examples, the first set of one or more processing units is to allocate pointers to packets associated with the heavy flow to a first set of one or more queues of a load balancer and the load balancer is to allocate the packets associated with the heavy flow to one or more processing units of the second set of one or more processing units based, at least in part on a packet receive rate of the packets associated with the heavy flow.Type: GrantFiled: September 10, 2021Date of Patent: May 6, 2025Assignee: Intel CorporationInventors: Chenmin Sun, Yipeng Wang, Rahul R. Shah, Ren Wang, Sameh Gobriel, Hongjun Ni, Mrittika Ganguli, Edwin Verplanke
-
Publication number: 20250124105Abstract: Key-value (KV) caching accelerates inference in large language models (LLMs) by allowing the attention operation to scale linearly rather than quadratically with the total sequence length. Due to large context lengths in modern LLMs, KV cache size can exceed the model size, which can negatively impact throughput. To address this issue, KVCrush, which stands for KEY-VALUE CACHE SIZE REDUCTION USING SIMILARITY IN HEAD-BEHAVIOR, is implemented. KVCrush involves using binary vectors to represent tokens, where the vector indicates which attention heads attend to the token and which attention heads disregard the token. The binary vectors are used in a hardware-efficient, low-overhead process to produce representatives for unimportant tokens to be pruned, without having to implement k-means clustering techniques.Type: ApplicationFiled: December 26, 2024Publication date: April 17, 2025Applicant: Intel CorporationInventors: Gopi Krishna Jha, Sameh Gobriel, Nilesh Jain
-
Publication number: 20250094712Abstract: Key-value (KV) caching accelerates inference in large language models (LLMs) by allowing the attention operation to scale linearly rather than quadratically with the total sequence length. Due to large context lengths in modern LLMs, KV cache size can exceed the model size, which can negatively impact throughput. To address this issue, a multi-granular clustering-based solution for KV cache compression can be implemented. Key tensors and value tensors corresponding unimportant tokens can be approximated using clusters created at different clustering-levels with varying accuracy. Accuracy loss can be mitigated by using proxies produced at finer granularity clustering-level for a subset of attention heads that are more significant. More significant attention heads can have a higher impact on model accuracy than less significant attention heads. Latency is improved by retrieving proxies from a faster memory for a subset of attention heads that are less significant, when impact on accuracy is lower.Type: ApplicationFiled: December 2, 2024Publication date: March 20, 2025Applicant: Intel CorporationInventors: Gopi Krishna Jha, Sameh Gobriel, Nilesh Jain
-
Publication number: 20250061316Abstract: Key-value (KV) cache paging schemes can improve memory management for KV caches by storing a KV cache page having key tensors and value tensors for a fixed number of tokens in a fixed-sized block in the KV cache of a worker. To further improve memory management, the schemes can be modified to implement dynamic variable quantization. Quantization level of a KV cache page can be set based on a runtime importance score of the KV cache page. In addition, the quantization level of the KV cache page can be set based on the system load. The end result is a scheme that can achieve a high compression ratio of KV cache pages in the KV cache. Fitting more KV cache pages in the KV cache can lead to higher inference throughput, higher system-level user capacity, and higher end-to-end service availability.Type: ApplicationFiled: November 1, 2024Publication date: February 20, 2025Applicant: Intel CorporationInventors: Sameh Gobriel, Nilesh Jain, Vui Seng Chua, Juan Pablo Munoz, Gopi Krishna Jha
-
Patent number: 12197601Abstract: Examples described herein relate to offload circuitry comprising one or more compute engines that are configurable to perform a workload offloaded from a process executed by a processor based on a descriptor particular to the workload. In some examples, the offload circuitry is configurable to perform the workload, among multiple different workloads. In some examples, the multiple different workloads include one or more of: data transformation (DT) for data format conversion, Locality Sensitive Hashing (LSH) for neural network (NN), similarity search, sparse general matrix-matrix multiplication (SpGEMM) acceleration of hash based sparse matrix multiplication, data encode, data decode, or embedding lookup.Type: GrantFiled: December 22, 2021Date of Patent: January 14, 2025Assignee: Intel CorporationInventors: Ren Wang, Sameh Gobriel, Somnath Paul, Yipeng Wang, Priya Autee, Abhirupa Layek, Shaman Narayana, Edwin Verplanke, Mrittika Ganguli, Jr-Shian Tsai, Anton Sorokin, Suvadeep Banerjee, Abhijit Davare, Desmond Kirkpatrick, Rajesh M. Sankaran, Jaykant B. Timbadiya, Sriram Kabisthalam Muthukumar, Narayan Ranganathan, Nalini Murari, Brinda Ganesh, Nilesh Jain
-
Patent number: 11811660Abstract: Apparatus, methods, and systems for tuple space search-based flow classification using cuckoo hash tables and unmasked packet headers are described herein. A device can communicate with one or more hardware switches. The device can include memory to store hash table entries of a hash table. The device can include processing circuitry to perform a hash lookup in the hash table. The lookup can be based on an unmasked key include in a packet header corresponding to a received data packet. The processing circuitry can retrieve an index pointing to a sub-table, the sub-table including a set of rules for handling the data packet. Other embodiments are also described.Type: GrantFiled: August 6, 2021Date of Patent: November 7, 2023Assignee: Intel CorporationInventors: Ren Wang, Tsung-Yuan C. Tai, Yipeng Wang, Sameh Gobriel
-
Publication number: 20230082780Abstract: Examples described herein include a device interface; a first set of one or more processing units; and a second set of one or more processing units. In some examples, the first set of one or more processing units are to perform heavy flow detection for packets of a flow and the second set of one or more processing units are to perform processing of packets of a heavy flow. In some examples, the first set of one or more processing units and second set of one or more processing units are different. In some examples, the first set of one or more processing units is to allocate pointers to packets associated with the heavy flow to a first set of one or more queues of a load balancer and the load balancer is to allocate the packets associated with the heavy flow to one or more processing units of the second set of one or more processing units based, at least in part on a packet receive rate of the packets associated with the heavy flow.Type: ApplicationFiled: September 10, 2021Publication date: March 16, 2023Inventors: Chenmin SUN, Yipeng WANG, Rahul R. SHAH, Ren WANG, Sameh GOBRIEL, Hongjun NI, Mrittika GANGULI, Edwin VERPLANKE
-
Patent number: 11392298Abstract: Examples may include techniques to control an insertion ratio or rate for a cache. Examples include comparing cache miss ratios for different time intervals or windows for a cache to determine whether to adjust a cache insertion ratio that is based on a ratio of cache misses to cache insertions.Type: GrantFiled: November 16, 2020Date of Patent: July 19, 2022Assignee: Intel CorporationInventors: Yipeng Wang, Ren Wang, Sameh Gobriel, Tsung-Yuan C. Tai
-
Publication number: 20220114270Abstract: Examples described herein relate to offload circuitry comprising one or more compute engines that are configurable to perform a workload offloaded from a process executed by a processor based on a descriptor particular to the workload. In some examples, the offload circuitry is configurable to perform the workload, among multiple different workloads. In some examples, the multiple different workloads include one or more of: data transformation (DT) for data format conversion, Locality Sensitive Hashing (LSH) for neural network (NN), similarity search, sparse general matrix-matrix multiplication (SpGEMM) acceleration of hash based sparse matrix multiplication, data encode, data decode, or embedding lookup.Type: ApplicationFiled: December 22, 2021Publication date: April 14, 2022Inventors: Ren WANG, Sameh GOBRIEL, Somnath PAUL, Yipeng WANG, Priya AUTEE, Abhirupa LAYEK, Shaman NARAYANA, Edwin VERPLANKE, Mrittika GANGULI, Jr-Shian TSAI, Anton SOROKIN, Suvadeep BANERJEE, Abhijit DAVARE, Desmond KIRKPATRICK
-
Patent number: 11201940Abstract: Technologies for flow rule aware exact match cache compression include multiple computing devices in communication over a network. A computing device reads a network packet from a network port and extracts one or more key fields from the packet to generate a lookup key. The key fields are identified by a key field specification of an exact match flow cache. The computing device may dynamically configure the key field specification based on an active flow rule set. The computing device may compress the key field specification to match a union of non-wildcard fields of the active flow rule set. The computing device may expand the key field specification in response to insertion of a new flow rule. The computing device looks up the lookup key in the exact match flow cache and, if a match is found, applies the corresponding action. Other embodiments are described and claimed.Type: GrantFiled: January 4, 2018Date of Patent: December 14, 2021Assignee: Intel CorporationInventors: Yipeng Wang, Ren Wang, Antonio Fischetti, Sameh Gobriel, Tsung-Yuan C. Tai
-
Publication number: 20210367887Abstract: Apparatus, methods, and systems for tuple space search-based flow classification using cuckoo hash tables and unmasked packet headers are described herein. A device can communicate with one or more hardware switches. The device can include memory to store hash table entries of a hash table. The device can include processing circuitry to perform a hash lookup in the hash table. The lookup can be based on an unmasked key include in a packet header corresponding to a received data packet. The processing circuitry can retrieve an index pointing to a sub-table, the sub-table including a set of rules for handling the data packet. Other embodiments are also described.Type: ApplicationFiled: August 6, 2021Publication date: November 25, 2021Inventors: Ren Wang, Tsung-Yuan C. Tai, Yipeng Wang, Sameh Gobriel
-
Patent number: 11088951Abstract: Apparatus, methods, and systems for tuple space search-based flow classification using cuckoo hash tables and unmasked packet headers are described herein. A device can communicate with one or more hardware switches. The device can include memory to store hash table entries of a hash table. The device can include processing circuitry to perform a hash lookup in the hash table. The lookup can be based on an unmasked key include in a packet header corresponding to a received data packet. The processing circuitry can retrieve an index pointing to a sub-table, the sub-table including a set of rules for handling the data packet. Other embodiments are also described.Type: GrantFiled: June 29, 2017Date of Patent: August 10, 2021Assignee: Intel CorporationInventors: Ren Wang, Tsung-Yuan C. Tai, Yipeng Wang, Sameh Gobriel
-
Patent number: 11005884Abstract: A computing apparatus for providing a node within a distributed network function, including: a hardware platform; a network interface to communicatively couple to at least one other peer node of the distributed network function; a distributor function including logic to operate on the hardware platform, including a hashing module configured to receive an incoming network packet via the network interface and perform on the incoming network packet a first-level hash of a two-level hash, the first level hash being a lightweight hash with respect to a second-level hash, the first level hash to deterministically direct a packet to one of the nodes of the distributed network function as a directed packet; and a denial of service (DoS) mitigation engine to receive notification of a DoS attack, identify a DoS packet via the first-level hash, and prevent the DoS packet from reaching the second-level hash.Type: GrantFiled: September 29, 2017Date of Patent: May 11, 2021Assignee: Intel CorporationInventors: Sameh Gobriel, Christian Maciocco, Byron Marohn, Ren Wang, Tsung-Yuan C. Tai
-
Publication number: 20210110269Abstract: Neural network dense layer sparsification and matrix compression is disclosed. An example of an apparatus includes one or more processors; a memory to store data for processing, including data for processing of a deep neural network (DNN) including one or more layers, each layer including a plurality of neurons, the one or more processors to perform one or both of sparsification of one or more layers of the DNN, including selecting a subset of the plurality of neurons of a first layer of the DNN for activation based at least in part on locality sensitive hashing of inputs to the first layer; or compression of a weight or activation matrix of one or more layers of the DNN, including detection of sparsity patterns in a matrix of the first layer of the DNN based at least in part on locality sensitive hashing of patterns in the matrix.Type: ApplicationFiled: December 21, 2020Publication date: April 15, 2021Applicant: Intel CorporationInventors: Sameh Gobriel, Jesmin Jahari Tithi, Tsung-Yuan Tai
-
Publication number: 20210089216Abstract: Examples may include techniques to control an insertion ratio or rate for a cache. Examples include comparing cache miss ratios for different time intervals or windows for a cache to determine whether to adjust a cache insertion ratio that is based on a ratio of cache misses to cache insertions.Type: ApplicationFiled: November 16, 2020Publication date: March 25, 2021Inventors: Yipeng WANG, Ren WANG, Sameh GOBRIEL, Tsung-Yuan C. TAI
-
Patent number: 10938712Abstract: Apparatus and method to facilitate networked compute node cluster routing are disclosed herein. In some embodiments, a compute node for cluster compute may include one or more input ports to receive data packets from first selected ones of a cluster of compute nodes; one or more output ports to route data packets to second selected ones of the cluster of computer nodes; and one or more processors, wherein the one or more processors includes logic to determine a particular output port, of the one or more output ports, to which a data packet received at the one or more input ports is to be routed, and wherein the logic is to exclude output ports associated with links indicated in fault status information as having a fault status to be the particular output port to which the data packet is to be routed.Type: GrantFiled: February 15, 2017Date of Patent: March 2, 2021Assignee: Intel CorporationInventors: Ken Schumm, Sameh Gobriel, Asif H. Haswarey, Tsung-Yuan Charlie Tai
-
Patent number: 10845995Abstract: Examples may include techniques to control an insertion ratio or rate for a cache. Examples include comparing cache miss ratios for different time intervals or windows for a cache to determine whether to adjust a cache insertion ratio that is based on a ratio of cache misses to cache insertions.Type: GrantFiled: June 30, 2017Date of Patent: November 24, 2020Assignee: Intel CorporationInventors: Yipeng Wang, Ren Wang, Sameh Gobriel, Tsung-Yuan Charlie Tai
-
Patent number: 10719442Abstract: An apparatus and method for prioritizing transactional memory regions. For example, one embodiment of a processor comprises: a plurality of cores to execute threads comprising sequences of instructions, at least some of the instructions specifying a transactional memory region; a cache of each core to store a plurality of cache lines; transactional memory circuitry of each core to manage execution of the transactional memory (TM) regions based on priorities associated with each of the TM regions; and wherein the transactional memory circuitry, upon detecting a conflict between a first TM region having a first priority value and a second TM region having a second priority value, is to determine which of the first TM region or the second TM region is permitted to continue executing and which is to be aborted based, at least in part, on the first and second priority values.Type: GrantFiled: September 10, 2018Date of Patent: July 21, 2020Assignee: Intel CorporationInventors: Ren Wang, Raanan Sade, Yipeng Wang, Tsung-Yuan Tai, Sameh Gobriel
-
Patent number: 10623311Abstract: Technologies for distributed table lookup via a distributed router includes an ingress computing node, an intermediate computing node, and an egress computing node. Each computing node of the distributed router includes a forwarding table to store a different set of network routing entries obtained from a routing table of the distributed router. The ingress computing node generates a hash key based on the destination address included in a received network packet. The hash key identifies the intermediate computing node of the distributed router that stores the forwarding table that includes a network routing entry corresponding to the destination address. The ingress computing node forwards the received network packet to the intermediate computing node for routing. The intermediate computing node receives the forwarded network packet, determines a destination address of the network packet, and determines the egress computing node for transmission of the network packet from the distributed router.Type: GrantFiled: September 27, 2017Date of Patent: April 14, 2020Assignee: Intel CorporationInventors: Sameh Gobriel, Ren Wang, Christian Maciocco, Tsung-Yuan Tai
-
Publication number: 20200081835Abstract: An apparatus and method for prioritizing transactional memory regions. For example, one embodiment of a processor comprises: a plurality of cores to execute threads comprising sequences of instructions, at least some of the instructions specifying a transactional memory region; a cache of each core to store a plurality of cache lines; transactional memory circuitry of each core to manage execution of the transactional memory (TM) regions based on priorities associated with each of the TM regions; and wherein the transactional memory circuitry, upon detecting a conflict between a first TM region having a first priority value and a second TM region having a second priority value, is to determine which of the first TM region or the second TM region is permitted to continue executing and which is to be aborted based, at least in part, on the first and second priority values.Type: ApplicationFiled: September 10, 2018Publication date: March 12, 2020Inventors: REN WANG, RAANAN SADE, YIPENG WANG, Tsung-Yuan TAI, SAMEH GOBRIEL