Patents by Inventor Yajun Ha

Yajun Ha has filed for patents to protect the following inventions. This listing includes patent applications that are pending as well as patents that have already been granted by the United States Patent and Trademark Office (USPTO).

  • Patent number: 12292946
    Abstract: A method for implementing formal verification of an optimized multiplier via symbolic computer algebra (SCA)-satisfiability (SAT) synergy includes: systematically recovering, by a reverse engineering algorithm, an adder tree from an optimized multiplier; 2) generating, by a constraint satisfaction algorithm, a reference multiplier only by using an adder based on a constraint condition; and 3) combining, by an SCA-based and SAT-based verification method, complementary advantages of SCA and SAT. In the verification framework, the method introduces a reference multiplier generator for generating a correct reference multiplier. The correct reference multiplier has both a structure similar to a structure of the optimized multiplier and a clear adder boundary. The clear adder boundary allows proving correctness of the correct reference multiplier through SCA-based verification.
    Type: Grant
    Filed: December 4, 2024
    Date of Patent: May 6, 2025
    Assignee: SHANGHAITECH UNIVERSITY
    Inventors: Rui Li, Lin Li, Yajun Ha
  • Patent number: 12292888
    Abstract: A fast and energy-efficient K-nearest neighbors search accelerator for a large-scale point cloud is provided. A nearest sub-voxel-selection (NSVS) framework that performs search based on a double-segmentation-voxel-structure (DSVS) search structure is constructed, and a K-nearest neighbors search algorithm for a large-scale point cloud map is implemented on a field programmable gate array (FPGA). The K-nearest neighbors search accelerator is configured for constructing the DSVS search structure, and searching for K-nearest neighbors based on the DSVS search structure. An experimental result on a KITTI dataset shows that the K-nearest neighbors search accelerator has a search speed 9.1 times faster than a state-of-the-art FPGA implementation. In addition, the K-nearest neighbors search accelerator also achieves an optimal energy efficiency, and the optimal energy efficiency is 11.5 times and 13.5 times higher than state-of-the-art FPGA and GPU implementations respectively.
    Type: Grant
    Filed: December 18, 2024
    Date of Patent: May 6, 2025
    Assignee: SHANGHAITECH UNIVERSITY
    Inventors: Yunhao Hu, Yajun Ha
  • Patent number: 12223691
    Abstract: A max-flow/min-cut solution algorithm for early terminating a push-relabel algorithm is provided. The max-flow/min-cut solution algorithm is used for an application that does not require an exact maximum flow, and includes: defining an early termination condition of the push-relabel algorithm by a separation condition and a stable condition; determining that the separation condition is satisfied if there is no source node s, s?S, in the set T at any time in an operation process of the push-relabel algorithm; determining that the stable condition is satisfied if there is no active node in the set T; and terminating the push-relabel algorithm if both the separation condition and the stability condition are satisfied. The early termination technique is proposed to greatly reduce redundant computations and ensure that the algorithm terminates correctly in all cases.
    Type: Grant
    Filed: September 22, 2021
    Date of Patent: February 11, 2025
    Assignee: SHANGHAITECH UNIVERSITY
    Inventors: Xinzhe Liu, Guangyao Yan, Yajun Ha
  • Patent number: 12217475
    Abstract: The provided is a stream processing-based non-blocking oriented FAST and rotated BRIEF (ORB) feature extraction accelerator implemented by a field programmable gate array (FPGA), which mainly includes two innovations: A stream processing-based non-blocking hardware architecture and a cache management algorithm are provided. The accelerator precisely controls and buffers each column of an rBRIEF descriptor computation window by using an algorithm, allowing to receive a new input pixel stream while computing a descriptor, thereby achieving non-blocking processing. An efficient hardware sorting design embedded in an accelerator is provided. Based on a count sorting algorithm, minimal resources are used to implement rBRIEF sorting on hardware, and the rBRIEF sorting is embedded in the accelerator. The accelerator ensures quality of a feature point while achieving high-speed feature point extraction, without significantly reducing accuracy of ORB_SLAM and other algorithms.
    Type: Grant
    Filed: August 23, 2024
    Date of Patent: February 4, 2025
    Assignee: SHANGHAITECH UNIVERSITY
    Inventors: Qixing Zhang, Yajun Ha
  • Patent number: 12181911
    Abstract: An automatic overclocking controller based on circuit delay measurement is provided, including a central processing unit (CPU), a clock generator, and a timing delay monitor (TDM) controller. Compared with the prior art, the present disclosure has following innovative points: A two-dimension-multi-frame fusion (2D-MFF) technology is used to process a sampling result, to eliminate sampling noise, and an automatic overclocking controller running on a heterogeneous field programmable gate array (FPGA) can automatically search for a highest frequency at which an accelerator can operate safely.
    Type: Grant
    Filed: July 21, 2023
    Date of Patent: December 31, 2024
    Assignee: SHANGHAITECH UNIVERSITY
    Inventors: Weixiong Jiang, Yajun Ha
  • Publication number: 20240289914
    Abstract: A graphics processing unit (GPU)-based logic rewriting acceleration method comprising parallelizing sub-procedures of And-Inverter Graph (AIG)-based logic rewriting. A recursive sub-procedure of the AIG-based logic rewriting is redesigned to be non-recursive, to provide sufficient parallelism for a GPU. In order to parallelize a replacement step on the GPU, the present disclosure uses a lock to ensure mutually exclusive access, which inevitably damages scalability of inter-node parallelism. In order to fully utilize the inter-node parallelism on a large scale, the present disclosure proposes a work scheduler that adds nodes with non-overlapping maximum fan-out-free cones (MFFCs) to a group, such that nodes in an MFFC can be deleted simultaneously without a conflict. In order to simultaneously create and delete a same node, the present disclosure also proposes a GPU-friendly graphical data structure to support these concurrent operations.
    Type: Application
    Filed: December 13, 2023
    Publication date: August 29, 2024
    Applicant: SHANGHAITECH UNIVERSITY
    Inventors: Lin LI, Yajun HA
  • Publication number: 20240281282
    Abstract: A window-based dynamic scrubbing scheduling method is provided. By dynamically scheduling a user task and a scrubbing task, the method can reduce scrubbing conflicts of a field-programmable gate array (FPGA) scrubbing module and scrub each user task in a timely manner as much as possible. The method greatly reduces area and energy consumption overheads of a hardware circuit, and improves system reliability. The method proposes a negotiation-driven scrubbing scheduling algorithm and an integer linear programming (ILP)-based optimization-driven scrubbing scheduling algorithm. Based on global conflict information, the algorithms in the method can scrub more user tasks and improve the system reliability. The method ensures reliability of a mixed-criticality task set system.
    Type: Application
    Filed: December 18, 2023
    Publication date: August 22, 2024
    Applicant: SHANGHAITECH UNIVERSITY
    Inventors: Rui LI, Yajun HA
  • Publication number: 20240273273
    Abstract: A disordered parallel maximum flow/minimum cut method implemented by an energy-efficient field-programmable gate array (FPGA) folds a single-layer large two-dimensional grid graph into a multi-layer small grid graph. The method enables a folding grid architecture to store and process a grid graph that is much larger than a processor array in size. The folding grid architecture endows a two-dimensional processor array with a degree of freedom in a vertical direction, such that the two-dimensional processor array can leverage a potential for parallel performance of the folding grid architecture based on the degree of freedom in the vertical direction. The folding grid architecture enables a small-sized processor array to have an ability to process a grid graph that is much larger than the small-sized processor array in size. In addition, based on axial symmetry of folding, the folding grid architecture can greatly reduce cross-boundary transmission of data in the processor array.
    Type: Application
    Filed: January 2, 2024
    Publication date: August 15, 2024
    Applicant: SHANGHAITECH UNIVERSITY
    Inventors: Guangyao YAN, Xinzhe LIU, Yajun HA, Hui WANG
  • Publication number: 20240231415
    Abstract: An automatic overclocking controller based on circuit delay measurement is provided, including a central processing unit (CPU), a clock generator, and a timing delay monitor (TDM) controller. Compared with the prior art, the present disclosure has following innovative points: A two-dimension-multi-frame fusion (2D-MFF) technology is used to process a sampling result, to eliminate sampling noise, and an automatic overclocking controller running on a heterogeneous field programmable gate array (FPGA) can automatically search for a highest frequency at which an accelerator can operate safely.
    Type: Application
    Filed: July 21, 2023
    Publication date: July 11, 2024
    Applicant: SHANGHAITECH UNIVERSITY
    Inventors: Weixiong JIANG, Yajun HA
  • Publication number: 20240230907
    Abstract: An efficient K-nearest neighbor (KNN) method for a single-frame point cloud of a LiDAR and an application of the efficient KNN method for the single-frame point cloud of the LiDAR are provided, where the efficient KNN method for the single-frame point cloud of the LiDAR is accelerated by a field-programmable gate array (FPGA). In the efficient KNN method for the single-frame point cloud of the LiDAR, a data structure is established based on point cloud projection and a distance scale. The data structure ensures that adjacent points in space are organized in adjacent memories. A new data structure is efficiently constructed. An efficient nearest point search mode is provided.
    Type: Application
    Filed: November 8, 2023
    Publication date: July 11, 2024
    Applicant: SHANGHAITECH UNIVERSITY
    Inventors: Jianzhong XIAO, Hao SUN, Qi DENG, Yajun HA
  • Publication number: 20240233815
    Abstract: A dual-six-transistor (D6T) in-memory computing (IMC) accelerator supporting always-linear discharge and reducing digital steps is provided. In the IMC accelerator, three effective techniques are proposed: (1) A D6T bitcell can reliably run at 0.4 V and enter a standby mode at 0.26 V, to support parallel processing of dual decoupled ports. (2) An always-linear discharge and convolution mechanism (ALDCM) not only reduces a voltage of a bit line (BL), but also keeps linear calculation throughout an entire voltage range of the BL. (3) A bypass of a bias voltage time converter (BVTC) reduces digital steps, but still keeps high energy efficiency and computing density at a low voltage. A measurement result of the IMC accelerator shows that the IMC accelerator achieves an average energy efficiency of 8918 TOPS/W (8b×8b), and an average computing density of 38.6 TOPS/mm2 (8b×8b) in a 55 nm CMOS technology.
    Type: Application
    Filed: October 9, 2023
    Publication date: July 11, 2024
    Applicant: SHANGHAITECH UNIVERSITY
    Inventors: Hongtu ZHANG, Yuhao SHU, Yajun HA
  • Publication number: 20240233796
    Abstract: An energy-efficient memory for cryogenic computing is provided. The energy-efficient memory includes a plurality of memory banks, where each of the memory banks includes a cryogenic semi-static, dual-port, boost-free gain cell (CSDB-GC) macro module, a universal address decoder, and a different address decoder. The CSDB-GC macro module includes a plurality of columns of local blocks, and each of the local blocks includes a plurality of CSDB-GC memory cells. A final measurement result of a 16 Kb CSDB-eDRAM shows that the 16 Kb CSDB-eDRAM achieves data retention time (DRT) of 16.67 seconds, which is 2.6 times longer than DRT of a state-of-the-art cryogenic eDRAM at a temperature of 4.2 K, and achieves lower refresh power (0.11 pW/Kb). In addition, the 16 Kb CSDB-eDRAM also achieves shorter access time, namely, 710 ps (1.41 GHz). Compared with the state-of-the-art work, the 16 Kb CSDB-eDRAM has a lowest dynamic power consumption overhead, namely, 49.23 uW/Kb.
    Type: Application
    Filed: November 9, 2023
    Publication date: July 11, 2024
    Applicant: SHANGHAITECH UNIVERSITY
    Inventors: Yuhao SHU, Hongtu ZHANG, Yajun HA
  • Publication number: 20240220770
    Abstract: A high-efficient quantization method for a deep probabilistic network achieves good result through hybrid quantization, structure reformulation, and type optimization. Firstly, for a directed acyclic graph (DAG) structure, all nodes in the DAG are clustered, and each node is quantized by a specific arithmetic type based on the clustering category, to obtain a preliminarily quantized deep probabilistic network. Secondly, the multi-in nodes in a preliminarily quantized deep probabilistic network are reformulated based on the input weights, structural reformulation converts a multi-in node into a binary tree network containing only two-input nodes, and parametrical reformulation is performed on the reformulated structure. Finally, arithmetic types of all nodes are optimized by using an arithmetic type search method based on power consumption analysis and network accuracy analysis.
    Type: Application
    Filed: November 7, 2023
    Publication date: July 4, 2024
    Applicant: SHANGHAITECH UNIVERSITY
    Inventors: Shen ZHANG, Xinzhe LIU, Yajun HA
  • Publication number: 20240221811
    Abstract: An energy-efficient cryogenic-in-memory-computing (CIMC) accelerator includes cryogenic 3T (C3T) macros. Each of the C3T macros comprises a C3T array containing M rows×N columns of bitcells. An input signal is converted into a timing sequence signal of a corresponding pulse width by using a digital timing sequence converter array. A C3T bitcell of a corresponding row in the C3T macro is controlled to perform charging and discharging on a read bit line (RBL) of a corresponding column. A voltage on the RBL of the corresponding column is sampled by a sense amplifier configured in each C3T macro to obtain a final result. With adaptive reference voltage configuration and storage on the chip, this design can achieve fast and low-power boolean/convolutional computing.
    Type: Application
    Filed: August 3, 2023
    Publication date: July 4, 2024
    Applicant: SHANGHAITECH UNIVERSITY
    Inventors: Yuhao SHU, Hongtu ZHANG, Yajun HA
  • Publication number: 20240212175
    Abstract: A global registration method based on spherical harmonic transformation (SHT) and iterative optimization is provided. Two assumptions are provided: firstly, it is predefined that a minimum percentage of a correct matching pair in an input point cloud is represented as a limit on a quantity of outliers in the point cloud, and secondly, a distance threshold used to determine the correct matching pair is preset based on a scenario and represented as a limited distance of an outlier in the point cloud. In the algorithm provided, the point cloud first undergoes coarse registration to obtain a plurality of search domains, and the search domains are sorted based on an evaluation criterion. A branch and bound method is used to exclude an incorrect search domain and obtain a final registration result.
    Type: Application
    Filed: November 23, 2023
    Publication date: June 27, 2024
    Applicant: SHANGHAITECH UNIVERSITY
    Inventors: Chengzhang HE, Yajun HA
  • Publication number: 20240212748
    Abstract: An ultra-low-voltage static random access memory (SRAM) cell for eliminating half-select-disturbance under a bit interleaving structure includes a cross-coupled inverter pair, two N-type write transistors NM1 and NM2, two P-type write transistors PM1 and PM2, and two N-type transistors NM3 and NM4, where the two N-type transistors NM3 and NM4 form a readout path. The present disclosure can be applied to applications with a storage requirement at an ultra-low voltage, especially applications with certain requirements for an access speed and reliability of an SRAM at a low voltage. Compared with other different SRAM cells, the ultra-low-voltage SRAM cell can achieve higher read and write working frequencies with similar energy consumptions.
    Type: Application
    Filed: August 14, 2023
    Publication date: June 27, 2024
    Applicant: SHANGHAITECH UNIVERSITY
    Inventors: Yifei LI, Jian CHEN, Yajun HA, Hongyu CHEN
  • Publication number: 20240143883
    Abstract: A layout method for a scalable multi-die network-on-chip FPGA architecture is provided. An application of the aforementioned layout method for the scalable multi-die network-on-chip FPGA architecture is further provided. A scalable multi-die FPGA architecture based on network-on-chip and a corresponding hierarchical recursive layout algorithm are provided, aiming to directly map a register transfer level dataflow design generated by existing high-level synthesis onto the provided interconnection architecture. The layout method can exploit the potential for hierarchical topology and make more efficient use of dedicated interconnection resources, such as cross-die nets, network-on-chips, and high-speed transceivers.
    Type: Application
    Filed: May 31, 2023
    Publication date: May 2, 2024
    Applicant: SHANGHAITECH UNIVERSITY
    Inventors: Jianwen LUO, Yajun HA
  • Publication number: 20240135989
    Abstract: A dual-six-transistor (D6T) in-memory computing (IMC) accelerator supporting always-linear discharge and reducing digital steps is provided. In the IMC accelerator, three effective techniques are proposed: (1) A D6T bitcell can reliably run at 0.4 V and enter a standby mode at 0.26 V, to support parallel processing of dual decoupled ports. (2) An always-linear discharge and convolution mechanism (ALDCM) not only reduces a voltage of a bit line (BL), but also keeps linear calculation throughout an entire voltage range of the BL. (3) A bypass of a bias voltage time converter (BVTC) reduces digital steps, but still keeps high energy efficiency and computing density at a low voltage. A measurement result of the IMC accelerator shows that the IMC accelerator achieves an average energy efficiency of 8918 TOPS/W (8b×8b), and an average computing density of 38.6 TOPS/mm2 (8b×8b) in a 55 nm CMOS technology.
    Type: Application
    Filed: October 8, 2023
    Publication date: April 25, 2024
    Applicant: SHANGHAITECH UNIVERSITY
    Inventors: Hongtu ZHANG, Yuhao SHU, Yajun HA
  • Publication number: 20240127466
    Abstract: An energy-efficient point cloud feature extraction method based on a field-programmable gate array (FPGA) is mapped onto the FPGA for running. The energy-efficient point cloud feature extraction method based on the FPGA is applied to point cloud feature extraction in unmanned driving; or an intelligent robot. Compared with an existing technical solution, the energy-efficient point cloud feature extraction method based on the FPGA has following innovative points: a low-complexity projection method for organizing unordered and sparse point clouds, a high-parallel method for extracting a coarse-grained feature point, and a high-parallel method for selecting a fine-grained feature point.
    Type: Application
    Filed: September 19, 2023
    Publication date: April 18, 2024
    Applicant: SHANGHAITECH UNIVERSITY
    Inventors: Hao SUN, Yajun HA
  • Publication number: 20240112443
    Abstract: A max-flow/min-cut solution algorithm for early terminating a push-relabel algorithm is provided. The max-flow/min-cut solution algorithm is used for an application that does not require an exact maximum flow, and includes: defining an early termination condition of the push-relabel algorithm by a separation condition and a stable condition; determining that the separation condition is satisfied if there is no source node s, s?S, in the set T at any time in an operation process of the push-relabel algorithm; determining that the stable condition is satisfied if there is no active node in the set T; and terminating the push-relabel algorithm if both the separation condition and the stability condition are satisfied. The early termination technique is proposed to greatly reduce redundant computations and ensure that the algorithm terminates correctly in all cases.
    Type: Application
    Filed: September 22, 2021
    Publication date: April 4, 2024
    Applicant: SHANGHAITECH UNIVERSITY
    Inventors: Xinzhe LIU, Guangyao YAN, Yajun HA