Patents by Inventor Yajun Ha
Yajun Ha has filed for patents to protect the following inventions. This listing includes patent applications that are pending as well as patents that have already been granted by the United States Patent and Trademark Office (USPTO).
-
Patent number: 12292946Abstract: A method for implementing formal verification of an optimized multiplier via symbolic computer algebra (SCA)-satisfiability (SAT) synergy includes: systematically recovering, by a reverse engineering algorithm, an adder tree from an optimized multiplier; 2) generating, by a constraint satisfaction algorithm, a reference multiplier only by using an adder based on a constraint condition; and 3) combining, by an SCA-based and SAT-based verification method, complementary advantages of SCA and SAT. In the verification framework, the method introduces a reference multiplier generator for generating a correct reference multiplier. The correct reference multiplier has both a structure similar to a structure of the optimized multiplier and a clear adder boundary. The clear adder boundary allows proving correctness of the correct reference multiplier through SCA-based verification.Type: GrantFiled: December 4, 2024Date of Patent: May 6, 2025Assignee: SHANGHAITECH UNIVERSITYInventors: Rui Li, Lin Li, Yajun Ha
-
Patent number: 12292888Abstract: A fast and energy-efficient K-nearest neighbors search accelerator for a large-scale point cloud is provided. A nearest sub-voxel-selection (NSVS) framework that performs search based on a double-segmentation-voxel-structure (DSVS) search structure is constructed, and a K-nearest neighbors search algorithm for a large-scale point cloud map is implemented on a field programmable gate array (FPGA). The K-nearest neighbors search accelerator is configured for constructing the DSVS search structure, and searching for K-nearest neighbors based on the DSVS search structure. An experimental result on a KITTI dataset shows that the K-nearest neighbors search accelerator has a search speed 9.1 times faster than a state-of-the-art FPGA implementation. In addition, the K-nearest neighbors search accelerator also achieves an optimal energy efficiency, and the optimal energy efficiency is 11.5 times and 13.5 times higher than state-of-the-art FPGA and GPU implementations respectively.Type: GrantFiled: December 18, 2024Date of Patent: May 6, 2025Assignee: SHANGHAITECH UNIVERSITYInventors: Yunhao Hu, Yajun Ha
-
Patent number: 12223691Abstract: A max-flow/min-cut solution algorithm for early terminating a push-relabel algorithm is provided. The max-flow/min-cut solution algorithm is used for an application that does not require an exact maximum flow, and includes: defining an early termination condition of the push-relabel algorithm by a separation condition and a stable condition; determining that the separation condition is satisfied if there is no source node s, s?S, in the set T at any time in an operation process of the push-relabel algorithm; determining that the stable condition is satisfied if there is no active node in the set T; and terminating the push-relabel algorithm if both the separation condition and the stability condition are satisfied. The early termination technique is proposed to greatly reduce redundant computations and ensure that the algorithm terminates correctly in all cases.Type: GrantFiled: September 22, 2021Date of Patent: February 11, 2025Assignee: SHANGHAITECH UNIVERSITYInventors: Xinzhe Liu, Guangyao Yan, Yajun Ha
-
Patent number: 12217475Abstract: The provided is a stream processing-based non-blocking oriented FAST and rotated BRIEF (ORB) feature extraction accelerator implemented by a field programmable gate array (FPGA), which mainly includes two innovations: A stream processing-based non-blocking hardware architecture and a cache management algorithm are provided. The accelerator precisely controls and buffers each column of an rBRIEF descriptor computation window by using an algorithm, allowing to receive a new input pixel stream while computing a descriptor, thereby achieving non-blocking processing. An efficient hardware sorting design embedded in an accelerator is provided. Based on a count sorting algorithm, minimal resources are used to implement rBRIEF sorting on hardware, and the rBRIEF sorting is embedded in the accelerator. The accelerator ensures quality of a feature point while achieving high-speed feature point extraction, without significantly reducing accuracy of ORB_SLAM and other algorithms.Type: GrantFiled: August 23, 2024Date of Patent: February 4, 2025Assignee: SHANGHAITECH UNIVERSITYInventors: Qixing Zhang, Yajun Ha
-
Patent number: 12181911Abstract: An automatic overclocking controller based on circuit delay measurement is provided, including a central processing unit (CPU), a clock generator, and a timing delay monitor (TDM) controller. Compared with the prior art, the present disclosure has following innovative points: A two-dimension-multi-frame fusion (2D-MFF) technology is used to process a sampling result, to eliminate sampling noise, and an automatic overclocking controller running on a heterogeneous field programmable gate array (FPGA) can automatically search for a highest frequency at which an accelerator can operate safely.Type: GrantFiled: July 21, 2023Date of Patent: December 31, 2024Assignee: SHANGHAITECH UNIVERSITYInventors: Weixiong Jiang, Yajun Ha
-
Publication number: 20240289914Abstract: A graphics processing unit (GPU)-based logic rewriting acceleration method comprising parallelizing sub-procedures of And-Inverter Graph (AIG)-based logic rewriting. A recursive sub-procedure of the AIG-based logic rewriting is redesigned to be non-recursive, to provide sufficient parallelism for a GPU. In order to parallelize a replacement step on the GPU, the present disclosure uses a lock to ensure mutually exclusive access, which inevitably damages scalability of inter-node parallelism. In order to fully utilize the inter-node parallelism on a large scale, the present disclosure proposes a work scheduler that adds nodes with non-overlapping maximum fan-out-free cones (MFFCs) to a group, such that nodes in an MFFC can be deleted simultaneously without a conflict. In order to simultaneously create and delete a same node, the present disclosure also proposes a GPU-friendly graphical data structure to support these concurrent operations.Type: ApplicationFiled: December 13, 2023Publication date: August 29, 2024Applicant: SHANGHAITECH UNIVERSITYInventors: Lin LI, Yajun HA
-
Publication number: 20240281282Abstract: A window-based dynamic scrubbing scheduling method is provided. By dynamically scheduling a user task and a scrubbing task, the method can reduce scrubbing conflicts of a field-programmable gate array (FPGA) scrubbing module and scrub each user task in a timely manner as much as possible. The method greatly reduces area and energy consumption overheads of a hardware circuit, and improves system reliability. The method proposes a negotiation-driven scrubbing scheduling algorithm and an integer linear programming (ILP)-based optimization-driven scrubbing scheduling algorithm. Based on global conflict information, the algorithms in the method can scrub more user tasks and improve the system reliability. The method ensures reliability of a mixed-criticality task set system.Type: ApplicationFiled: December 18, 2023Publication date: August 22, 2024Applicant: SHANGHAITECH UNIVERSITYInventors: Rui LI, Yajun HA
-
Publication number: 20240273273Abstract: A disordered parallel maximum flow/minimum cut method implemented by an energy-efficient field-programmable gate array (FPGA) folds a single-layer large two-dimensional grid graph into a multi-layer small grid graph. The method enables a folding grid architecture to store and process a grid graph that is much larger than a processor array in size. The folding grid architecture endows a two-dimensional processor array with a degree of freedom in a vertical direction, such that the two-dimensional processor array can leverage a potential for parallel performance of the folding grid architecture based on the degree of freedom in the vertical direction. The folding grid architecture enables a small-sized processor array to have an ability to process a grid graph that is much larger than the small-sized processor array in size. In addition, based on axial symmetry of folding, the folding grid architecture can greatly reduce cross-boundary transmission of data in the processor array.Type: ApplicationFiled: January 2, 2024Publication date: August 15, 2024Applicant: SHANGHAITECH UNIVERSITYInventors: Guangyao YAN, Xinzhe LIU, Yajun HA, Hui WANG
-
Publication number: 20240231415Abstract: An automatic overclocking controller based on circuit delay measurement is provided, including a central processing unit (CPU), a clock generator, and a timing delay monitor (TDM) controller. Compared with the prior art, the present disclosure has following innovative points: A two-dimension-multi-frame fusion (2D-MFF) technology is used to process a sampling result, to eliminate sampling noise, and an automatic overclocking controller running on a heterogeneous field programmable gate array (FPGA) can automatically search for a highest frequency at which an accelerator can operate safely.Type: ApplicationFiled: July 21, 2023Publication date: July 11, 2024Applicant: SHANGHAITECH UNIVERSITYInventors: Weixiong JIANG, Yajun HA
-
Publication number: 20240230907Abstract: An efficient K-nearest neighbor (KNN) method for a single-frame point cloud of a LiDAR and an application of the efficient KNN method for the single-frame point cloud of the LiDAR are provided, where the efficient KNN method for the single-frame point cloud of the LiDAR is accelerated by a field-programmable gate array (FPGA). In the efficient KNN method for the single-frame point cloud of the LiDAR, a data structure is established based on point cloud projection and a distance scale. The data structure ensures that adjacent points in space are organized in adjacent memories. A new data structure is efficiently constructed. An efficient nearest point search mode is provided.Type: ApplicationFiled: November 8, 2023Publication date: July 11, 2024Applicant: SHANGHAITECH UNIVERSITYInventors: Jianzhong XIAO, Hao SUN, Qi DENG, Yajun HA
-
Publication number: 20240233815Abstract: A dual-six-transistor (D6T) in-memory computing (IMC) accelerator supporting always-linear discharge and reducing digital steps is provided. In the IMC accelerator, three effective techniques are proposed: (1) A D6T bitcell can reliably run at 0.4 V and enter a standby mode at 0.26 V, to support parallel processing of dual decoupled ports. (2) An always-linear discharge and convolution mechanism (ALDCM) not only reduces a voltage of a bit line (BL), but also keeps linear calculation throughout an entire voltage range of the BL. (3) A bypass of a bias voltage time converter (BVTC) reduces digital steps, but still keeps high energy efficiency and computing density at a low voltage. A measurement result of the IMC accelerator shows that the IMC accelerator achieves an average energy efficiency of 8918 TOPS/W (8b×8b), and an average computing density of 38.6 TOPS/mm2 (8b×8b) in a 55 nm CMOS technology.Type: ApplicationFiled: October 9, 2023Publication date: July 11, 2024Applicant: SHANGHAITECH UNIVERSITYInventors: Hongtu ZHANG, Yuhao SHU, Yajun HA
-
Publication number: 20240233796Abstract: An energy-efficient memory for cryogenic computing is provided. The energy-efficient memory includes a plurality of memory banks, where each of the memory banks includes a cryogenic semi-static, dual-port, boost-free gain cell (CSDB-GC) macro module, a universal address decoder, and a different address decoder. The CSDB-GC macro module includes a plurality of columns of local blocks, and each of the local blocks includes a plurality of CSDB-GC memory cells. A final measurement result of a 16 Kb CSDB-eDRAM shows that the 16 Kb CSDB-eDRAM achieves data retention time (DRT) of 16.67 seconds, which is 2.6 times longer than DRT of a state-of-the-art cryogenic eDRAM at a temperature of 4.2 K, and achieves lower refresh power (0.11 pW/Kb). In addition, the 16 Kb CSDB-eDRAM also achieves shorter access time, namely, 710 ps (1.41 GHz). Compared with the state-of-the-art work, the 16 Kb CSDB-eDRAM has a lowest dynamic power consumption overhead, namely, 49.23 uW/Kb.Type: ApplicationFiled: November 9, 2023Publication date: July 11, 2024Applicant: SHANGHAITECH UNIVERSITYInventors: Yuhao SHU, Hongtu ZHANG, Yajun HA
-
Publication number: 20240220770Abstract: A high-efficient quantization method for a deep probabilistic network achieves good result through hybrid quantization, structure reformulation, and type optimization. Firstly, for a directed acyclic graph (DAG) structure, all nodes in the DAG are clustered, and each node is quantized by a specific arithmetic type based on the clustering category, to obtain a preliminarily quantized deep probabilistic network. Secondly, the multi-in nodes in a preliminarily quantized deep probabilistic network are reformulated based on the input weights, structural reformulation converts a multi-in node into a binary tree network containing only two-input nodes, and parametrical reformulation is performed on the reformulated structure. Finally, arithmetic types of all nodes are optimized by using an arithmetic type search method based on power consumption analysis and network accuracy analysis.Type: ApplicationFiled: November 7, 2023Publication date: July 4, 2024Applicant: SHANGHAITECH UNIVERSITYInventors: Shen ZHANG, Xinzhe LIU, Yajun HA
-
Publication number: 20240221811Abstract: An energy-efficient cryogenic-in-memory-computing (CIMC) accelerator includes cryogenic 3T (C3T) macros. Each of the C3T macros comprises a C3T array containing M rows×N columns of bitcells. An input signal is converted into a timing sequence signal of a corresponding pulse width by using a digital timing sequence converter array. A C3T bitcell of a corresponding row in the C3T macro is controlled to perform charging and discharging on a read bit line (RBL) of a corresponding column. A voltage on the RBL of the corresponding column is sampled by a sense amplifier configured in each C3T macro to obtain a final result. With adaptive reference voltage configuration and storage on the chip, this design can achieve fast and low-power boolean/convolutional computing.Type: ApplicationFiled: August 3, 2023Publication date: July 4, 2024Applicant: SHANGHAITECH UNIVERSITYInventors: Yuhao SHU, Hongtu ZHANG, Yajun HA
-
Publication number: 20240212175Abstract: A global registration method based on spherical harmonic transformation (SHT) and iterative optimization is provided. Two assumptions are provided: firstly, it is predefined that a minimum percentage of a correct matching pair in an input point cloud is represented as a limit on a quantity of outliers in the point cloud, and secondly, a distance threshold used to determine the correct matching pair is preset based on a scenario and represented as a limited distance of an outlier in the point cloud. In the algorithm provided, the point cloud first undergoes coarse registration to obtain a plurality of search domains, and the search domains are sorted based on an evaluation criterion. A branch and bound method is used to exclude an incorrect search domain and obtain a final registration result.Type: ApplicationFiled: November 23, 2023Publication date: June 27, 2024Applicant: SHANGHAITECH UNIVERSITYInventors: Chengzhang HE, Yajun HA
-
Publication number: 20240212748Abstract: An ultra-low-voltage static random access memory (SRAM) cell for eliminating half-select-disturbance under a bit interleaving structure includes a cross-coupled inverter pair, two N-type write transistors NM1 and NM2, two P-type write transistors PM1 and PM2, and two N-type transistors NM3 and NM4, where the two N-type transistors NM3 and NM4 form a readout path. The present disclosure can be applied to applications with a storage requirement at an ultra-low voltage, especially applications with certain requirements for an access speed and reliability of an SRAM at a low voltage. Compared with other different SRAM cells, the ultra-low-voltage SRAM cell can achieve higher read and write working frequencies with similar energy consumptions.Type: ApplicationFiled: August 14, 2023Publication date: June 27, 2024Applicant: SHANGHAITECH UNIVERSITYInventors: Yifei LI, Jian CHEN, Yajun HA, Hongyu CHEN
-
Publication number: 20240143883Abstract: A layout method for a scalable multi-die network-on-chip FPGA architecture is provided. An application of the aforementioned layout method for the scalable multi-die network-on-chip FPGA architecture is further provided. A scalable multi-die FPGA architecture based on network-on-chip and a corresponding hierarchical recursive layout algorithm are provided, aiming to directly map a register transfer level dataflow design generated by existing high-level synthesis onto the provided interconnection architecture. The layout method can exploit the potential for hierarchical topology and make more efficient use of dedicated interconnection resources, such as cross-die nets, network-on-chips, and high-speed transceivers.Type: ApplicationFiled: May 31, 2023Publication date: May 2, 2024Applicant: SHANGHAITECH UNIVERSITYInventors: Jianwen LUO, Yajun HA
-
Publication number: 20240135989Abstract: A dual-six-transistor (D6T) in-memory computing (IMC) accelerator supporting always-linear discharge and reducing digital steps is provided. In the IMC accelerator, three effective techniques are proposed: (1) A D6T bitcell can reliably run at 0.4 V and enter a standby mode at 0.26 V, to support parallel processing of dual decoupled ports. (2) An always-linear discharge and convolution mechanism (ALDCM) not only reduces a voltage of a bit line (BL), but also keeps linear calculation throughout an entire voltage range of the BL. (3) A bypass of a bias voltage time converter (BVTC) reduces digital steps, but still keeps high energy efficiency and computing density at a low voltage. A measurement result of the IMC accelerator shows that the IMC accelerator achieves an average energy efficiency of 8918 TOPS/W (8b×8b), and an average computing density of 38.6 TOPS/mm2 (8b×8b) in a 55 nm CMOS technology.Type: ApplicationFiled: October 8, 2023Publication date: April 25, 2024Applicant: SHANGHAITECH UNIVERSITYInventors: Hongtu ZHANG, Yuhao SHU, Yajun HA
-
Publication number: 20240127466Abstract: An energy-efficient point cloud feature extraction method based on a field-programmable gate array (FPGA) is mapped onto the FPGA for running. The energy-efficient point cloud feature extraction method based on the FPGA is applied to point cloud feature extraction in unmanned driving; or an intelligent robot. Compared with an existing technical solution, the energy-efficient point cloud feature extraction method based on the FPGA has following innovative points: a low-complexity projection method for organizing unordered and sparse point clouds, a high-parallel method for extracting a coarse-grained feature point, and a high-parallel method for selecting a fine-grained feature point.Type: ApplicationFiled: September 19, 2023Publication date: April 18, 2024Applicant: SHANGHAITECH UNIVERSITYInventors: Hao SUN, Yajun HA
-
Publication number: 20240112443Abstract: A max-flow/min-cut solution algorithm for early terminating a push-relabel algorithm is provided. The max-flow/min-cut solution algorithm is used for an application that does not require an exact maximum flow, and includes: defining an early termination condition of the push-relabel algorithm by a separation condition and a stable condition; determining that the separation condition is satisfied if there is no source node s, s?S, in the set T at any time in an operation process of the push-relabel algorithm; determining that the stable condition is satisfied if there is no active node in the set T; and terminating the push-relabel algorithm if both the separation condition and the stability condition are satisfied. The early termination technique is proposed to greatly reduce redundant computations and ensure that the algorithm terminates correctly in all cases.Type: ApplicationFiled: September 22, 2021Publication date: April 4, 2024Applicant: SHANGHAITECH UNIVERSITYInventors: Xinzhe LIU, Guangyao YAN, Yajun HA