METHOD AND APPARATUS FOR VERTICAL LAYERED DECODING OF QUASI-CYCLIC LOW-DENSITY PARITY CHECK CODES BUILT FROM CLUSTERS OF CIRCULANT PERMUTATION MATRICES
This invention presents a method and the corresponding hardware apparatus for decoding LDPC codes using a vertical layered (VL) iterative message passing algorithm. The invention operates on quasi-cyclic LDPC (QC-LDPC) codes, for which the non-zero circulant permutation matrices (CPMs) are placed at specific locations in the parity-check matrix of the codes, forming concentrated clusters of CPMs. The purpose of the invention is to take advantage of the organization of CPMs in clusters in order to derive a specific hardware architecture, consuming less power than the classical VL decoders. This is achieved by minimizing the number of read and write accesses to the main memories of the design.
This invention generally relates to error correction coding for information transmission, storage and processing systems, such as wired and wireless communications systems, optical communications systems, computer memories, mass data storage systems, etc. More particularly, it relates to the simplification and the optimization of low complexity and low power architectures for the hardware implementation of vertical layered iterative low-density parity check (LDPC) decoders. The invention is specifically designed for quasi-cyclic LDPC (QC-LDPC) codes build from clusters of circulant permutation matrices (CPMs), with the main objective of reducing the power consumption of the memory accesses during the decoding process.
BACKGROUND OF THE INVENTIONError correcting codes play a vital role in communication, computer, and storage systems by ensuring the integrity of data. The past decades have witnessed a surge in research in coding theory which resulted in the development of efficient coding schemes based on LDPC codes. Iterative message passing decoding algorithms together with suitably designed LDPC codes have been shown to approach the information-theoretic channel capacity in the limit of infinite codeword length. LDPC codes are standardized in a number of applications such as wireless networks, satellite communications, deep-space communications, and power line communications.
For an (N, K) LDPC code with length N and dimension K, the parity-check matrix (PCM) H of size M×N=(N−K)×N (assuming that H is full rank) is composed of a small number of non-zero entries, i.e. a small number of ones. We denote the degree of the n-th column, i.e. the number of ones in the n-th column, by dv(n), 1≤n≤N. Similarly, we denote the degree of the m-th row, i.e. the number of ones in the m-th row, by dc(m), 1≤m≤M. Further, we define the maximum degree for the rows and columns:
When the number of ones in the columns and the rows of H is constant, the LDPC code is said to be regular, otherwise the LDPC code is said to be irregular. For regular LDPC codes, we have dv,max=dv=dv(n), 1≤n≤N, and dc,max=dc=dc(m), 1≤m≤M. The (dv, dc)-regular LDPC codes represent a special interesting type of LDPC codes. For this type, the code rate is R=K/N=1−dv/dc if the PCM H is full rank. Except when it is necessary for the clarity of the argumentation, we will drop the indices n or m in the notations for the degrees of the rows and columns. It is clear however, that all embodiments of the present invention apply both for regular and irregular LDPC codes.
If a binary column vector of length N, denoted x=[x1, x2, . . . , xN]T is a codeword, then it satisfies Hx=0, where the operations of multiplication and addition are performed in the binary field GF(2), and 0 is the length-M all-zero column vector. xT denotes the transposition of x, both for vectors and matrices. An element in a matrix can be denoted indifferently by Hm,n or H(m,n). Similarly, an element in a vector is denoted by xn or x(n). The horizontal concatenation and vertical concatenation, of vectors and matrices are denoted [A, B] and [A; B], respectively.
The present invention relates to the class of QC-LDPC. In QC-LDPC codes, the PCM H is composed of square blocks or sub-matrices of size L×L, as described in equation (2), in which each block Hi,j is either (i) a all-zero L×L block, or (ii) a circulant permutation matrix (CPM).
A CPM is defined as the power of a primitive element of a cyclic group. The primitive element is defined, for example, by the L×L matrix a shown in equation (3) for the case of L=8. As a result, a CPM αk with k∈{0, . . . , L−1} has the form of the identity matrix, shifted k positions to the left. Said otherwise, the row-index of the nonzero value of the first column of αk is k+1. The value of k is referred to as the CPM value. The main feature of a CPM is that it has only a single nonzero element in each row/column and can be defined by its first row/column together with a process to generate the remaining rows/columns. The simplicity of this process translates to low complexity needed for realizing physical connections between subsets of codeword bits and subsets of parity-check equations in an QC-LDPC encoder or decoder.
The PCM of a QC-LDPC code can be conveniently represented by a base matrix (or protograph matrix) B, with Mb rows and Nb columns, which contains integer values, indicating the powers of the primitive element for each block Hi,j. Consequently, the dimensions of the base matrix are related to the dimensions of the PCM the following way: M=Mb L, N=Nb L, and K=Kb L (assuming that H is full rank). An example of matrices H and B for Mb×Nb=4×5 and L=8 is shown in equation (4).
where I=α0 is the identity matrix, and by convention α−∞=0 is the all-zero L×L matrix. In this invention, the rows of the base matrix will be denoted block-rows of the PCM, while the columns of the base matrix will be denoted block-columns.
For QC-LDPC codes, a block-row of the parity-check matrix H, composed of L consecutive rows of the PCM, is referred to as a horizontal layer, or row-layer. For example, the i-th block-row in equation (2) defines the i-th row-layer. Similarly, a vertical layer, or column-layer, is composed of L consecutive columns of the PCM. For example, the j-th block-column in equation (2) defines the j-th column-layer.
The concept of layer can be further extended to the concept of generalized layer (GL). The definition follows:
-
- A generalized layer is defined as the concatenation of two or more layers of H, such that in each block-column of the submatrix defined by the generalized layer, there is at most one non-zero CPM while the other blocks are all-zero blocks.
- A full generalized layer has further the property that each block-column of the submatrix defined by the generalized layer contains exactly one non-zero CPM.
This definition allows that, for a QC-LDPC code with maximum column degree dv,max, the PCM could be designed with at least dv,max generalized layers. For simplicity of the presentation, and without loss of generality, we will assume that the number of GLs is always equal to the maximum column degree dv,max.
The parity-check matrix H can be conveniently represented by a bipartite Tanner graph C, consisting of a set of variable nodes (VN) V={v1, v2, . . . , vN} of cardinality N, and a set of check nodes (CN) C={c1, c2, . . . , cM} of cardinality M. The variable nodes represent the codeword bits and the check nodes represent the parity-check equations, of the LDPC code. Variable nodes and check nodes are connected by edges where an edge exists between nodes cm and vn if the matrix element in the parity-check matrix is equal to Hm,n=1. The degree of check node cm denoted dc(m), is the number of variable nodes it is connected to, and the degree of variable node vn, denoted dv(n) is the number of check nodes it is connected to. An LDPC code is said to be regular if its Tanner Graph has a constant variable node degree dv(n)=4, ∀n, and a constant check node degree dc(m)=dc, ∀m. The LDPC code is said to be irregular otherwise. Let us further denote by (cm) the set of variable nodes connected to cm, and by (vn) the set of check nodes connected to vn.
An iterative decoder operating on a Tanner graph of an LDPC code exchanges messages between the VNs and the CNs, along the edges connecting the two kind of nodes. An edge supports messages in the two directions: variable-to-check messages, denoted μv,c, and check-to-variable messages, denoted μc,v.
Also relevant to this invention is the concept of layered decoding that is used to improve the decoder convergence speed while still maintaining a low hardware complexity. Layered LDPC decoding schemes effectively improve the convergence by reducing the required number of decoding iterations needed to reach successful decoding. A layered decoder produces messages from a subset of the check nodes to a subset of the variable nodes, and then produces messages from a subset of the variable nodes to a subset of the check nodes.
An iterative decoder is usually defined by the VN update (VNU) processing, the CN update (CNU) processing, and the scheduling of the message computation. The scheduling defines the order in which the VNU and the CNU operations are performed in the entire Tanner graph of the LDPC code. There are three main types of scheduling for iterative message-passing LDPC decoders: (i) the flooding schedule, (ii) the horizontal layered (HL) scheduling, (iii) the vertical layered (VL) scheduling. The HL and VL schedules are typically used in conjunction with QC-LDPC codes. In HL decoding the message updating is performed row-layer by row-layer, while in VL decoding the message computation is performed column-layer by column-layer.
This invention concerns an iterative LDPC decoder following the VL scheduling. We will refer only to this particular scheduling throughout the description.
The present invention applies to any binary input symmetric channel, and can be generalized easily to channels with non-binary inputs. Let x be a codeword of a length N QC-LDPC code. The codeword is sent over a noisy memoryless channel with outputs y, which values belong to a q-ary alphabet . The channel precision nq is the number of bits required to represent the q-ary alphabet, i.e. 2n
The embodiments of the present invention are further related to a class of iterative message-passing decoders called finite alphabet iterative decoders (FAIDs). In these decoders, the messages μc,v and μv,c belong to a finite alphabet which consists of a finite—typically small—number of levels, denoted s. The s levels can be represented using ns bits of precision, such that 2n
where ak≥al for any k>l. Note that the message alphabet and the channel alphabet can have different cardinalities, s≠q.
The VNU for a variable node v of degree dv in a FAID is implemented using a pre-defined function Φv: ×{}d
we use:
The VNU function can be optimized to improve the error-correction capability of the decoder. The VNU function for the channel value y=+Y can be deduced from the one with channel value y=−Y by symmetry:
The CNU function Φc used in FAID is similar to the function used in the min-sum decoder which is typically used in the state-of-the-art. If
represent the incoming messages to a node c with degree dc then Φc is given by
Depending on the scheduling type that is used, the CNU can be implemented in various ways. The specifics about the CNU implementation for VL decoding will be presented subsequently.
Finally, in order to compute a hard-decision estimate of the codeword bit for the VN v, an a posteriori probability (APP) is computed using:
The hard-decision estimate of the n-th codeword bit, denoted {circumflex over (x)}n is equal to:
If the hard decision estimates have a syndrome equal to zero, i.e.,
H{circumflex over (x)}=s=0 (10)
then the decoder has successfully converged to a valid codeword.
Let us now describe the general principle of VL decoding of QC-LDPC codes, with a focus on the memory organization and the CNU processing. For each CN cm connected to dc VNs, there are dc incoming variable-to-check messages to the CNU, denoted μv
To each and every cm, we associate a check node state (CNS), defined as
where sm=Πn=1d
A pair (magk, indexk) in the magnitude state will be referred to as a magnitude pair. Each magnitude pair is composed of the magnitude and the index of one of the dc incoming variable-to-check messages to the CNU. For simplicity of the presentation, we dropped the index m of the CN in the definition of the magnitude pairs. In a magnitude state, the magnitudes are sorted in ascending order:
-
- mag1≤mag2≤ . . . ≤magk≤ . . . ≤magw
and the value of indexk indicates the block-column index corresponding to the one where the message is equal to magk. We further assume, for ease of presentation, that each CN has at most one VN neighbor in each block-column of the parity-check matrix, i.e. indexk≠indexl if k≠1. This condition is not mandatory for VL decoding, and the algorithm can be extended easily when a CN has more than one neighbor in the block-columns.
- mag1≤mag2≤ . . . ≤magk≤ . . . ≤magw
Throughout the description, we will describe the algorithms and the hardware architectures for the case of w=2 smallest magnitudes, and we will use the notations CNS(cm) and MAG, dropping the parameter w in the notations. Nonetheless, the current invention applies to VL decoders with other values of w with minor modifications. The collection of all CNSs, for the M check nodes of the LDPC code, is stored in a memory called check node memory (CNM).
A general VL iterative decoder is presented in Algorithm 1. The algorithm takes as inputs the channel values, and produces the hard decision estimates {circumflex over (x)}.
The Initialization step of the algorithm serves to compute the initial values of the CNSs. During the initialization, all variable-to-check messages μv
After the initialization step, the decoder runs for a maximum of Itmax iterations. During one decoding iteration, the message update is performed block-column by block-column, until all block-columns in the PCM have been processed. In the algorithm, and without loss of generality, we assume that the block-columns are processed sequentially from the first to the last one.
During each block-column processing, the computation of the messages and the update of the CNSs are organized in three steps: the CNU-Generator step, the VNU step and the CNU-Updater step.
New check-to-variable messages μc
Depending on the particular implementation of the algorithm and the type of variable node update Φv, the initialization step and the VNU-step can change. For example, if FAID algorithm is used in the VL decoder, the initialization is performed with the direct channel outputs y, while if a min-sum algorithm is used in the VL decoder, the initialization is performed with the LLRs.
The hard-decision estimates {{circumflex over (x)}n}1≤n≤N, which constitute the output of the algorithm, are computed during the VNU step. They are deduced from the APPs (8) using the messages μc
the decision on the codeword bits using equation (9). If the hard-decision estimates verify the zero syndrome condition (10), then they form a valid codeword. The APP values can be computed at the end of the Itmax iterations, or alternatively can be computed during the decoding process, at the end of each iteration or at the end of each block-column processing. In case of a computation during decoding, the value of the syndrome H {circumflex over (x)} can be used as an early stopping criterion. Whenever the syndrome is equal to 0, the decoder can be stopped since it has converged to a valid codeword.
As described in Algorithm 1, the CNM needs to be accessed several times during the iterative decoding. During each block-column processing, the CNU-Generator reads dv times L values of the CNSs, stored in addressed of the CNM corresponding to the block-rows that have non-zero CPMs in them. Similarly, the CNU-Updater accesses the CNM dv times in reading and dv times in writing, for each block-column processing. This represent in total 3dv Nb accesses to this memory during one decoding iteration.
The CNM is a large memory, and the read/write (R/W) accesses to it represents a large portion of the total power consumed by the hardware architecture. The purpose of this invention is to reduce the number of accesses to the CNM, while still implementing an accurate VL decoding of the QC-LDPC code, without losing any error correction performance. We achieve this goal by proposing a specific LDPC code design, with an organization of the parity-check matrix in clusters of CPMs. The decoder architecture is implemented such that the modules, and especially the CNU-Updater, can process a collection of κ consecutive block-columns with less memory accesses than 3 dv κ. As a result, the invention targets a modified VL iterative decoder which will consume less power than the classical decoders, without sacrifying on the error correction performance.
SUMMARY OF THE INVENTIONThe present invention relates to a vertical layered iterative message passing algorithm to decode QC-LDPC codes.
The present invention relates to a method and hardware apparatus implementing vertical layered LDPC decoders targeting very low power consumption. This is achieved by designing a QC-LDPC code, for which the non-zero circulant permutation matrices (CPM) are placed at specific locations in the parity-check matrix of the code, forming concentrated clusters of CPMs.
The algorithm of the present invention passes messages from the variable nodes to the check nodes in the Tanner Graph of the LDPC code, updating the messages with variable node update (VNU) processors and check node updates (CNU) processors. The accumulated signs, the smallest magnitudes and the associated positions of the variable-to-check messages form the check node states (CNS), which are stored in a check node memory (CNM). The CNU is implemented in two steps, with two different processing units: the CNU-Generator and the CNU-Updater.
Specific implementations of the hardware modules of the decoder take advantage of the organization in clusters in order to reduce and minimize the number of read and write (R/W) accesses to the memories. For example, when processing a cluster of κ consecutive CPMs, the CNU-Generator reads the check node memory only once instead of κ times; and the CNU-Updater reads one time and writes one time in the check node memory instead of κ reads and κ writes. This gives a factor of n reduction in the number of accesses to the check-node memory. The larger the size of the clusters κ is, the larger the power saving will be.
Thanks to the reduction of R/W accesses, we propose to process simultaneously several CPMs using a single instance of the CNU-Updater. This can be achieved at the condition that the clusters of CPMs are placed at specific locations within a higher-order generalized layer. A higher-order generalized layer of order p is defined by a sub-matrix of the parity-check matrix, containing at most μ CPMs in the block-column of the sub-matrix defining the generalized layer.
Within a higher-order generalized layer of order p, the placement of clusters needs to follow a particular constraint, called non-colliding clusters (NCC) constraint. This constraint ensures that no two clusters have their last CPM in the same block-column, allowing the CNU-Updater to process multiple (up to μ) clusters in parallel.
We furthermore add the constraint that the set of clusters in a higher-order generalized layer of order μ can be decomposed into μ groups of non-overlapping clusters. Within a cluster group, no two clusters can have a CPM in the same block-column of the generalized layer. The organization of clusters into non-overlapping groups allows the efficient instantiation of the CNU-Updater hardware.
We describe several preferred embodiments of the invention, depending on the cluster size κ and the generalized layer order μ, each of which following the NCC constraint. The preferred embodiments are denoted NCC(κ, μ), and we illustrate examples for the preferred cases NCC(2, 2), NCC(3, 2), NCC(3,3), NCC(4, 4), NCC(6, 4) and NCC(8, 4).
In order to achieve the reduction of R/W accesses, the apparatus for the CNU-Updater makes use of specific units, called Pre-Updaters which are updating local check-node states corresponding only to the CPMs inside a cluster, before updating the CNSs in the check-node memory at the end of the cluster processing. In a CNU-Updater for an order-p generalized layer, there are μ Pre-Updaters, each one in charge of a group of non-overlapping clusters. Similarly, the CNU-Updater is using μ local Sign Accumulators to process in parallel the sign states of the CNS for the μ groups of clusters.
The present invention includes an Initializer module, which is in charge of computing the syndrome bits from the channel values. The syndrome bits are used to initialize the signs states of the CNSs before the first decoding iteration. The apparatus for the Initializer is impacted by the organization in clusters, and is implemented using μ local Sign Accumulators, one for each cluster group inside a higher-order GL.
The present invention includes also a Validator module, used to compute on the fly the syndrome bits of the hard decision estimates, and to stop the decoder whenever the syndrome is all-zero. In order to compute accurately the syndrome bits, stored in a Syndrome Memory, the apparatus for the Validator module computes for each higher-order GL full cluster syndromes and partial cluster syndromes. The partial cluster syndromes correspond to the value of the syndrome bits when the processed clusters are not finished. The Validator module then combines the full cluster syndrome and the partial cluster syndromes to compute the syndrome of the whole code and take the decision to stop the decoder when the whole code syndrome is all-zero.
For a more complete understanding of the invention, reference is made to the following description and accompanying drawings, in which:
The method in the present invention relates to an iterative message-passing LDPC decoder operating on a QC-LDPC code, whose parity-check matrix consists of circulant permutation matrices of size L×L. For a parity-check matrix with Nb block-columns and Mb block-rows, the j-th block-column contains dv(j) CPMs, and the i-th block-row contains dc(i) CPMs. For simplicity of the presentation, and when the context is clear, the indices i and j will be dropped from these notations.
The message-passing decoder of the present invention follows a vertical layered (VL) scheduling, in which the main processing modules are a variable node update (VNU) processor and a check node update (CNU) processor. The CNU processor is itself composed of two main modules: the CNU-Generator and the CNU-Updater. We can refer to Algorithm 1 for more details. The VL decoder processes the Nb block-columns of the QC-LDPC code in an arbitrary order, during one decoding iteration. We assume without loss of generality that the block-columns are processed sequentially from index j=1 to index j=Nb.
During one decoding iteration, the current block-column will be denoted as the processed block-column. Furthermore, each block-column is composed of a set of L variable nodes, which will be denoted as processed VN group. In each block-column, there are dv CPMs located in different block-rows. In a processed block-column, the CPMs are denoted processed CPMs, and the corresponding block-rows as processed block-rows. The set of L CNs in a processed block-row is called processed CN group. In each processed block-row, the set of L CNSs is denoted CNS group. A CNS group is composed of a sign state group and a magnitude state group.
All modules in the hardware implementation of the current invention will process groups of L data in parallel, accepting groups of L data as inputs and producing groups of L data as outputs. The type of data could be messages, syndrome bits, codeword bits, or check node states, depending on the module.
An apparatus for the top level architecture of the VL decoder proposed in this invention is depicted on
At the beginning of the decoding procedure, the Initializer module 202 takes groups of L channel values 200 as inputs, and uses them to initialize the CNSs in the CNM. It computes the initial sign states sm for all CNs, and the initial values of the magnitude states, which depend only on the magnitudes of the channel values. Alternatively, the initial magnitude states could be set by the Initializer to fixed, predetermined values. The initial sign states are either stored in a local memory, called Syndrome Memory, in the Initializer module, or directly stored in the CNM. During initialization, the channel signs are copied as initial values of the variable-to-check message signs, which are stored in the sign memory 206. After the initialization is performed, the CNM contains the initial values of the CNSs {sm; MAG}.
After initialization, the CNU and the VNU processors exchange messages iteratively, through the Barrel Shifter units. The barrel shifters re-order the messages addresses according to the CPMs of the QC-LDPC code. A Barrel Shifter unit is composed of a maximum of dv,max barrel shifters, which can process all messages within a block-column in parallel.
For a processed block-column of degree dv, the decoder proceeds as follows. The CNU-Generator 205 reads dv CNS groups from 203 and dv groups of message signs from the sign memory 206 to compute the check-to-variable messages groups. The details of the CNU-Generator architecture are given later in the description. The check-to-variable messages μc
The VNU processor 208 receives dv check-to-variable messages groups and the channel values group corresponding to the processed block-column. It computes dv variable-to-check messages μv
The CNU-Updater uses as inputs the variable-to-check messages from 208, the associated CNSs from 203 and the corresponding delayed signs from 207, and computes the new CNSs which are written in the CNM 203. The details about the functioning of the CNU-Updater are given later in the description.
Note that the signs that are needed to update the CNSs are the ones of the corresponding variable-to-check message μv
The computation and updating of messages and CNSs described above is repeated block-column by block-column until the entire parity-check matrix has been traversed which then constitutes one decoding iteration. Then, the decoding process starts again from the first block-column in the next decoding iteration.
The VL decoder in this invention is also equipped with a stopping criterion which allows us to output a valid codeword after any block-column processing. In order to do so, the VNU processor also computes APP values following (8) for all VNs in the processed block-column and computes hard-decision estimates {circumflex over (x)}. The hard-decision estimates are sent to the Validator module 209 to check if the decoder has converged to a valid codeword. The Validator computes the syndrome bits of the LDPC code with the most recent values of {circumflex over (x)} received from the VNU processor, which are stored in a Syndrome Memory. The Validator stops the decoding whenever the syndrome bits are all zero (following Eq. (10)), meaning that the hard-decision estimates R form a valid codeword 210.
In this invention, we propose to split the memories that store information about the CNs into one or several pieces, with the objective of reducing the number of R/W accesses in each memory piece to its minimum. This concerns the CNM 203 in the CNU processor, and the Syndrome Memories in the Initializer 202 and in the Validator 209.
Let us take the example of the CNM, and let r be the number of pieces composing the CNM. Each memory piece CNM(γ), γ=1, . . . , Γ, is associated to a set of CNs, and therefore a set of block-rows of the parity-check matrix. The submatrix associated with the y-th memory piece is denoted H7. Whenever the decoder processes a CPM that is located in Hγ, it will access data in the piece CNM(γ).
The number of pieces in which the CNM is split has a direct impact on the implementation of the hardware units of the CNU processor. Since there are Γ pieces of the CNM, the CNU-Updater 204 and the CNU-Generator 205 are composed of δ units processing in parallel the CPMs for each submatrix Hγ.
The purpose of splitting the CNM into several pieces is to allow each of the Γ units in the CNU processor to access their data from an independent memory, at the same time. This can only be achieved is the CPMs in the QC-LDPC code are placed at specific locations. Similarly, the Initializer module 202 is in charge of computing the initial values of the sign states, which are equal to the syndrome bits, stored in a Syndrome Memory. Since there are Γ pieces of the Syndrome Memory, there are Γ units computing in parallel the syndrome bits from the channel signs.
Finally, the Validator module 209 computes the syndrome bits from the hard-decision estimates and stores them in a Syndrome Memory. There are also Γ units in the Validator, computing in parallel the syndrome bits.
In a classical QC-LDPC code, the submatrix Hγ is usually defined as either a layer or a generalized layer, which contain at most one CPM per block-column of the submatrix.
In the present invention, we introduce the new concept of higher-order Generalized Layer, or order-μ Generalized Layer (order-μ GL). The definition is a direct generalization of the classical generalized layers:
-
- An order-μ generalized layer is defined as the concatenation of block-rows of H, such that in each block-column of the submatrix defined by the higher-order generalized layer, there is at most μ non-zero CPMs while the other blocks are all-zero blocks.
- A full order-μ generalized layer has further the property that each block-column contains exactly μ non-zero CPMs.
An order-1 GL is a classical generalized layer. From the definition, it follows that the vertical concatenation of μ generalized layers form an order-μ generalized layer. However, an order-μ generalized layer cannot always be decomposed into μ generalized layers.
In order to minimize the number of R/W accesses to the memory piece corresponding to a submatrix Hγ, the CPMs must be localized in a very specific way. They have to be concentrated in clusters of CPMs such that the modules can process multiple CPMs with a limited number of memory accesses.
Let us define a s-cluster of CPMs by a κ-uple of consecutive non-zero CPMs in a block-row of a submatrix:
Hm,1:κ=[Hm,1,Hm,2, . . . ,Hm,κ] (12)
where Hm,k is a non-zero CPM.
More generally, a κ-cluster of CPMs could contain less than κ non-zero CPMs within the cluster. For example, a κ-cluster with two all-zero blocks and κ−2 non-zero CPMs could have the following structure:
Hm,1:κ=[Hm,1,0,Hm,3, . . . ,0,Hm,κ] (13)
When a κ-cluster is full of CPMs, we will refer to it as κ-cluster or full κ-cluster, otherwise it will be referred to as a sparse κ-cluster. The objective of this organization of CPMs is to reduce to the minimum the number of required R/W accesses to the memories while processing the κ-cluster. The maximum hardware efficiency is achieved when the matrix is organized in full clusters, in which case we have a minimum number of memory accesses for a given number κ of processed CPMs. We will discuss only the case of full clusters in the rest of the description. The generalization of the hardware modules to sparse κ-clusters follows easily. The organization of CPMs in concentrated clusters has an impact on the hardware realization of the Initializer, of the CNU processor, and of the Validator.
In an order-μ GL, the κ-clusters need to be placed at very specific locations in order to avoid memory access port violations.
-
- [NCC] We impose the constraint that two or more clusters cannot end at the same block-column in the higher-order GL. We call this constraint non-colliding clusters (NCC) constraint.
We will assume throughout this description that the clusters in an order-μ GL follow the NCC constraint.
- [NCC] We impose the constraint that two or more clusters cannot end at the same block-column in the higher-order GL. We call this constraint non-colliding clusters (NCC) constraint.
In several preferred embodiments of this invention, the clusters in a higher-order GL have the same size κ. Note that thanks to the definition of sparse clusters, any full cluster of size κ can be seen as a sparse κ′-cluster with κ′>κ by appending κ′−κ zero blocks to the cluster. In a full order-μ GL with constant size clusters, the NCC constraint ensures that clusters do not end at the same block-column, but also do not start at the same block-column. Furthermore, a full order-μ GL cannot be composed of K clusters with κ<μ, otherwise this would violate the NCC constraint.
It results that in a full order-μ GL that follows the NCC constraint, we can split the set of κ-clusters into γ non-overlapping groups of clusters. For each group of non-overlapping clusters, there is at most one CPM in each block-column of the corresponding submatrix. In the preferred embodiments where the order-μ GL is full, for each group of non-overlapping clusters, there is exactly one CPM in each block-column.
The organization of CPMs into clusters allows us to limit the number of R/W accesses to the CNM memory in the CNU-Updater, and to the Syndrome Memories in the Initializer and in the Validator. We read from these memories at the beginning of each processed cluster, and write to these memories at the end of each processed cluster, instead of doing the R/W for each CPM. For each cluster of size κ, the number of R/W accesses is reduced from 2κ to only 2. The reduction of R/W accesses creates free time slots during which the memories are not accessed, and that can be used to process simultaneously other clusters in the same block-column of the submatrix Hκ. As a result, the organization of CPMs in clusters allows us to consider submatrices that have more than one CPM per block-column, still being able to process all CPMs in a block-column simultaneously, without memory access conflicts.
We now discuss several preferred embodiments of the invention, which relate to particular organizations of the clusters inside a higher-order GL. The length κ of the clusters and the order μ of the higher-order GL define each preferred embodiment, and the clusters are assumed to follow the NCC constraint. We will denote by NCC(κ,μ) the preferred embodiment describing the structure of a higher-order GL.
Note that the present invention is applicable to any cluster size, and any GL order. In addition, when the higher-order GL contains sparse clusters, κ and μ represent actually maximum values, instead of actual values, for the cluster length and GL order. The preferred embodiments will be presented assuming full clusters only, and full higher-order GL only, in which case we must have κ≥μ. The generalization to sparse clusters, and to non-full higher-order GLs follows easily.
Let us also note that we can combine GLs with different orders to form QC-LDPC codes with various VN degrees. For example, we can combine a full NCC(3,3) GL and a full NCC(3, 2) GL to obtain a regular QC-LDPC code with constant VN degree dv=5. When using sparse clusters instead of full clusters in this example, the GLs will not be full anymore, and we can build an irregular QC-LDPC code with VN degrees dv∈{3,4,5}. The obtained irregular QC-LDPC will be composed of 2 higher-order GLs, following the NCC constraint.
A first preferred embodiment of this invention concerns the case of full clusters of size κ=2 in an order-2 GL, denoted NCC(2, 2). We give an illustration following the preferred embodiment NCC(2, 2) in
Note that although the tail-biting property of the cluster organization is preferred, this is not mandatory. In the case where a started cluster at the end of the matrix does not finish cyclically at the beginning of the same block-row, the decoder can introduce pauses between iterations to ensure that the NCC constraint is fulfilled, and that no memory access port violations occurs. When the tail-biting property is enforced, no pause is necessary between decoding iterations.
In
The preferred embodiment denoted NCC(3, 3) is shown in 502. The order-3 GL is composed of clusters with length κ=3, which are split into 3 non-overlapping groups. Like in the other embodiments, the cluster organization and NCC constraint are tail-biting: a 3-cluster starts at the end of block-row 2 and finishes at the start of the same block-row, and similarly in block-row 6. In this preferred embodiment, in each and every block-column, the cluster in one group starts, the cluster in another group ends, and the cluster in the last group is in the middle.
In
In the preferred embodiment denoted NCC(6, 4), shown in 603, the clusters have lengths κ=6 and form an order-4 GL following the NCC constraint. We have therefore 4 non-overlapping groups in this preferred case.
Finally, the preferred embodiment denoted NCC(8, 4) is shown in 605. In this case, the order-4 GL is composed of length in =8 clusters, arranged in 4 non-overlapping groups.
Let us now describe in details the implementation and functioning of the modules in the decoder architecture that are affected by the organization in clusters. This concerns the Initializer module 202, the CNU-Updater 204 in the CNU processor, the CNU-Generator 205 in the CNU processor, and finally the Validator module 209. The other parts of the architecture follow the principles of a generic iterative VL QC-LDPC decoder, and are not affected by the organization in clusters of CPMs.
In a preferred embodiment of the Initializer apparatus, the initial magnitude states of the CNS are set to fixed, predetermined values, and the Initializer module is only in charge of computing the initial syndrome bits, which are used as initial values of the sign states in the CNSs. The initial sign states are equal to the syndrome bits computed from the channel signs.
The Initializer module is composed of Γ units processing in parallel the Γ higher-order GLs. We will describe the functioning of one unit, which is in charge of computing the syndrome bits of a single order-μ GL.
One unit of the Initializer takes as inputs groups of L channel signs corresponding to the VNs in the processed block-columns. Since there are μ CPMs in each block-column of the higher-order GL, the L channel signs are first barrel-shifted in accordance with the μ CPM shift values, and the μ groups of shifted signs are used in the Initializer to compute the syndrome bits of the corresponding block-rows.
The incoming signs groups belonging to a κ-cluster are accumulated using Sign Accumulator units, which are used to compute local syndrome bits corresponding only to the channel signs associated with the κ-clusters. The local syndrome bits groups produced by the Sign Accumulator units are then used in the Initializer to compute the full syndrome.
Let us first describe the functioning of the Sign Accumulators.
For 2-clusters, the Sign Accumulator unit receives two channel signs groups, sequentially. The signs group corresponding to the first CPM of the cluster is stored in a register 702 and is xored with the signs group corresponding to the second CPM of the cluster, in order to obtain the local syndrome bits group 703 of the processed cluster. When the cluster size is large with κ>2, it is more efficient to implement a recursive computation of the syndrome bits group for the block-row containing the processed cluster. In
As an illustrative, non-limiting example,
The fact that the three inputs 801-803 belong to different cluster groups ensures that the channel signs associated with a given cluster group always arrive at the input of the same Sign Accumulator unit. Therefore, the Sign Accumulator units 804-806 compute effectively the local syndrome bits associated with the clusters of their designed cluster group, in r successive steps. By virtue of the NCC constraint in an order-μ GL, only one cluster among the three groups has its last CPM in a given block-column. Let us assume that the cluster in group-A ends at block-column j. When processing block-column j, the Sign Accumulator for group-A 804 has finished the computation of the local syndrome bits group, which is selected by 807 and sent to the XOR 808 to be combined with the content of the Syndrome Memory 809 for the processed block-row. The output of the XOR contains the updated syndrome bits corresponding to all previously processed clusters in the processed block-row. The Sign Accumulator for group-B and group-C continue to accumulate their local syndrome bits, since at block-column j, the clusters in groups B and C have not ended.
Once all block-columns in the entire matrix have been processed, the syndrome bits stored in the Syndrome Memory 809 are output and used to initialize the signs states of the CNM, for the corresponding higher-order GL. Although we described only specific examples of cluster sizes and GL orders in
As shown in the apparatus of the top level architecture in
Let us first discuss the CNU-Generator module. In each order-μ GL, there are μ message generator units processing in parallel the μ groups of non-overlapping clusters. Each of the message generator units is in charge of reading the CNS group in the CNM, then reading the message signs groups from the previous iteration in the Sign Memory, and determining the signs and magnitudes of the check-to-variable messages groups sent to the VNU processor.
The CNU-Generator proceeds as follows. For a cluster of size c, the message generator unit reads the associated CNS {sm; MAG} when the cluster starts, and generates κ groups of L check-to-variable messages, in in successive steps. For each CPM in the cluster, the signs of the check-to-variable messages are computed as the XOR between the signs from the previous iteration and the values of the sign states sm of the CNSs. The magnitudes of the check-to-variable messages are equal to one of the two smallest magnitudes (mag1, mag2) of the magnitude states MAG, and are determined the following way. The magnitudes will be equal to the second smallest magnitude mag2, if the index of the output message matches index1, or equal to the first smallest magnitude mag1 otherwise. The L check-to-variable messages of each processed CPM in the cluster are then sent out to the VNU through the Barrel Shifters.
The VNU processor receives the check-to-variable messages groups and the channel values group corresponding to the processed block-column, and computes the variable-to-check messages groups that are sent to the CNU-Updater.
The CNU-Updater is in charge of computing the new CNSs, using the newly computed variable-to-check messages coming from the VNU. The module is composed of two parts: a magnitude states CNU-Updater and a sign states CNU-Updater.
In
We discuss only the preferred embodiments where the CNS is composed of w=2 magnitude pairs.
For each cluster group, the magnitudes of the input messages enter into Pre-Updater units 905-906. The Pre-Updater units are used to compute the local magnitude states MAG*={(mag*1,index*1); (mag*2,index*2)}, which correspond to the two smallest magnitudes and the corresponding indices, for the messages in the processed clusters only. More precisely, for a cluster of size c, there are κ incoming variable-to-check messages from the VNU for each CNS in the processed block-row. The Pre-Updater computes and sorts the 2 smallest magnitudes (mag*1, mag*2) among the κ message magnitudes, and associates them with their local indices (index*1,index*2). The local index of a message indicates its location within the cluster, i.e. index*k∈{1, . . . , κ}.
When the Pre-Updater unit has finished the update for the processed cluster, the local magnitude states group for this cluster is selected by 909 and send to the magnitude state CNU-Updater 911 in order to compute the new magnitude state, denoted MAGnew. The magnitude state CNU-Updater receives the local magnitude states from the Pre-Updater for each CNS in the processed block row. It compares the two smallest magnitudes read from the CNM with the two smallest magnitudes in the local magnitude states, for a total of four magnitudes. The magnitude state CNU-Updater sorts these four magnitudes and outputs only the two smallest ones, together with their associated global indices.
Let us now describe the functioning of the sign state CNU-Updater. For each cluster group, the signs of the input messages are combined with the delayed signs, and accumulated in order to compute the new sign states groups. The delayed signs correspond to the signs of the processed messages from the previous iteration. They are xored with the new message signs of the current iteration in order to detect sign changes.
The sign changes serve as inputs to the Sign Accumulator units 907-908 for each cluster group. The Sign Accumulator units are identical to the ones used in the Initializer, and are described in
The CNU-Updater in
The CNU-Updater contains μ Pre-Updater units, each one processing the clusters within one non-overlapping group. We describe in this paragraph the functioning of a single Pre-Updater unit.
For the case of clusters of length κ>2 the Pre-Updater unit determines, for each CNS in the processed block-row, the two minimum magnitudes among the κ input messages magnitudes, and associates the two corresponding local indices.
Let us now present an apparatus for the implementation of the Validator module 209. The architecture of the Validator module is shown in
When processing a block-column, a group of L hard-decision estimates 1201, produced by the VNU processor, arrive at the input of the Validator. The hard decision memory 1202 contains a copy of the most recently computed hard-decision estimates. When the module receives L hard-decision estimates corresponding to the j-th block-column, the memory 1202 contains the hard-decision estimates of the current iteration for all block-columns k<j, while it contains the hard-decision estimates of the previous iteration for all block-columns k≥j.
The new hard-decision estimates for block-column j replace the ones from the previous iteration in the hard decision memory 1202. Additionally, a XOR is performed between the newly computed hard-decision estimates and the ones from the previous iteration. Therefore, the XOR unit 1204 outputs the changes in the hard-decision estimates, between the current iteration and the previous iteration. The changes in hard-decision estimates are cyclically shifted by the barrel shifters corresponding to the CPMs in each processed cluster. There is one barrel shifter unit for each group of clusters. Then, they are used as inputs to the Syndrome Updaters 1205 and 1206. Each Syndrome Updater for a higher-order GL contains a Syndrome Memory which stores the syndrome bits of the corresponding GL, and its purpose is to update the values of this memory using the changes in hard-decision estimates. The detailed description of the Syndrome Updater is given subsequently.
The outputs of the Syndrome Updaters are then used in the Zero Syndrome Check units 1207-1208 to detect whether the whole syndrome vector is all zero. In case the syndrome is all-zero, a terminate signal 1210 indicates that the decoder can be stopped, since the hard-decision estimates which are output on 1211 form a valid codeword. The shift register 1203 is used to store the hard-decision estimates immediately after they are being received by the module, while the Validator is determining whether the hard-decision estimates constitute a codeword. The shift register has a width of L bits and a depth equal to the total delay that is necessary for the Terminator unit 1209 to generate the terminate signal. This total delay includes the number of pipeline stages in the Validator, as well as the delay induced by the organization in clusters for the computation of the updated syndrome bits in the Syndrome Updaters 1205 and 1206.
Let us now describe an apparatus for the hardware implementation of the Validator Syndrome Updater units. The Validator is composed of Γ of these units, one for each higher-order GL. We show on
Each Sign Accumulator receives sequentially r groups of L barrel shifted hard decisions estimates 1301-1303. The hard decision estimates are used to compute local syndrome bits corresponding to the CNs of the processed clusters. Thanks to the NCC constraint on the clusters, during processing of one block-column, no more than one of the μ Sign Accumulators will have completed the accumulation of a local syndrome.
Let us assume without loss of generality, and by way of example, that the cluster in group-A has its last CPM in the processed block-column j. Consequently, during processing of block-column j, the Sign Accumulator 1304 of group-A has accumulated the shifted hard decision estimates for all CPMs in the processed cluster, and computed a local syndrome for the full cluster. This output is denoted cluster local syndrome.
The Sign Accumulator 1304 computes the local syndrome for the group-A clusters, while the Sign Accumulators 1305 and 1306 compute the local syndromes for the clusters in the other two groups. The cluster local syndrome of the finished cluster in group-A is chosen by the multiplexer 1310 to update the Syndrome Memory 1312. In order to do so, the syndrome bits group in the Syndrome Memory corresponding to the block-rows of the finished cluster is xored with the cluster local syndrome and written back to the same location in the memory.
However, during the processing of block-column j, the clusters of groups B and C are not finished, and the content of the Syndrome Memory for the corresponding block-rows do not take into account the hard decision estimates of the processed block-columns in group-B and group-C. In order to be able to stop the decoder after processing block-column j, the syndrome bits for the block-rows corresponding to group-B and group-C need to take into account the contribution of all the hard decision estimates from block-columns k≤j. This is achieved by taking snapshots of the cluster local syndromes during processing of block-column j, for cluster groups B and C in the higher-order GL.
The enable signal 1300 triggers a snapshot of the cluster local syndromes output by the Sign Accumulators, and stores them in registers 1307-1309. At block-column j, the snapshot for group-B in register 1308 is the local syndrome of only part of the cluster in group-B, corresponding to the hard decision estimates shifted by the CPMs of the cluster with indices k≤j. The value in this register will be denoted partial cluster local syndrome for the cluster in group-B. Similarly, register 1309 contains the partial cluster local syndrome of only a part of the cluster in group-C. The snapshot for group-A in register 1307 is equal to the cluster local syndrome computed from all CPMs in the cluster, and is denoted full cluster local syndrome.
When selected by the multiplexer 1311, the partial cluster local syndromes for group-B and group-C are combined with the corresponding syndrome bits groups coming for the Syndrome Memory, to form partial cluster syndromes. The partial cluster syndrome for group-B, respectively for group-C, represents the syndrome bits values of the block-rows in group-B, respectively in group-C, when the snapshot was taken, i.e. during processing of block-column j. The multiplexer 1310 selects sequentially the full cluster local syndromes for the three cluster groups, and combines them with the content of the Syndrome Memory for the corresponding block-rows, in order to generate the full cluster syndromes for each group.
As a result, during processing of block-column j, the Syndrome Updater takes a snapshot of the partial cluster local syndrome for all groups, updates the Syndrome Memory for the block-rows of the group-A cluster, and outputs the full cluster syndrome for group-A on 1315. The partial cluster syndrome for group-A, which is equal to the full cluster syndrome is output on 1313. During processing of block-column j+1, the Syndrome Updater updates the Syndrome Memory for the block-rows of the group-B cluster, outputs the partial cluster syndrome for group-B on 1313, and outputs the full cluster syndrome for group-B on 1315. Finally, during processing of block-column j+2, the Syndrome Updater updates the Syndrome Memory for the block-rows of the group-C cluster, outputs the partial cluster syndrome for group-C on 1313, and outputs the full cluster syndrome for group-C on 1315. The register 1314 is added in order to ensure that the outputs 1313 and 1315 will correspond to the same cluster group.
Each Zero Syndrome Check units 1207 and 1208 in the Validator module take as inputs the full cluster syndromes and the partial cluster syndromes for the cluster groups of the corresponding higher-order GL. In the example of
Claims
1. A method for vertical layered decoding of quasi-cyclic low-density parity-check codes operating on a parity-check matrix with a structure composed of one or more higher-order generalized layers of order greater or equal than one, wherein each higher-order generalized layer is composed of non-overlapping groups of clusters of one or more circulant permutation matrices (CPMs) with the number of non-overlapping groups at most equal to the order of the higher-order generalized layer, the method comprising:
- receiving, as inputs, channel values belonging to a channel output alphabet;
- using the channel values for initializing, iteratively processing groups of messages between variable nodes and check nodes within block-columns in an arbitrary order, and sequentially from one block-column to another block-column, generating hard decision estimates and validating to check if the hard-decision estimates constitute a codeword based upon which the decoding is terminated;
- computing, during the initializing, respective signs of the variable-to-check messages using the signs of the channel values;
- computing, during the initializing, the initial value of the sign state associated to each check node, by using the signs of the variable-to-check messages;
- further computing, during the initializing, the initial value of the magnitude state associated to each check node, using the channel values;
- storing the check node states in a check node memory, with each check node state associated to a check node comprising, a sign state of the associated check node computed from the signs of the variable-to-check messages, and a magnitude state composed of a set of values comprising one or more smallest magnitudes of the variable-to-check messages of the associated check node along with the same number of respective block-column indices;
- iteratively processing a block-column, wherein the iterative processing includes: computing one or more groups of new check-to-variable messages corresponding to each cluster of CPMs, where each cluster of CPMs belongs to a non-overlapping group of clusters in a higher-order generalized layer, using a check node update-generator (CNU-Generator) step, with inputs comprising the check node states and the signs of the one or more groups of variable-to-check messages corresponding to the cluster of CPMs; computing new variable-to-check messages with inputs comprising the channel values and the check-to-variable messages, using one or more variable node update functions; computing hard decision estimates using the channel values and the check-to-variable messages, updating the check node states corresponding to each cluster of CPMs, where each cluster belongs to a non-overlapping group of clusters in a higher-order generalized layer, to new values using a check node update-updater (CNU-Updater) step, with inputs comprising the current values of the check node states, one or more groups of variable-to-check messages corresponding to the cluster of CPMs, and the signs of the one or more groups of variable-to-check messages corresponding to the cluster of CPMs from the previous iteration.
- computing, during the validating, the syndrome bits associated to the check nodes corresponding to the entire parity check matrix to check if the hard-decision estimates constitute a codeword; and
- outputting the codeword, in accordance with the hard decision estimates constituting a codeword.
Type: Application
Filed: Feb 18, 2022
Publication Date: Aug 11, 2022
Inventors: David Declercq (Tucson, AZ), Benedict J. Reynwar (Tucson, AZ), Vamsi Krishna Yella (Tucson, AZ)
Application Number: 17/676,065