DCT (discrete cosine transform) conversion method and DCT conversion circuit system

文档序号：142570 发布日期：2021-10-22 浏览：27次中文

阅读说明：本技术 一种dct变换方法及dct变换电路系统 (DCT (discrete cosine transform) conversion method and DCT conversion circuit system ) 是由张鹏郝志坚向国庆范益波严伟贾惠柱于 2021-06-15 设计创作，主要内容包括：本发明公开了一种DCT变换方法及电路系统,方法包括：将残差块按列存储到不同的第一存储单元中；每从各个第一存储单元并行读出一行残差块进行行变换后,按照预设的对角读写规则将行变换后数据中的各个数据写入不同的第二存储单元；按照对角读写规则每从各个第二存储单元并行读出一列行变换后数据进行列变换后,将得到的一列变换系数中的各个系数分别存储到不同的第三存储单元中,以用于每次从各个第三存储单元并行读出一列变换系数进行量化操作。通过对行列变换电路以及中间转置存储的优化设计,尤其是对中间存储结构的读写规则进行了改进,减少了行列变换电路对中间存储结构的读取周期,能够在较小的电路面积上实现较大的吞吐率,提高了电路性能。(The invention discloses a DCT conversion method and a circuit system, wherein the method comprises the following steps: storing the residual blocks into different first storage units in columns; after a row of residual blocks are read out in parallel from each first storage unit and are subjected to row conversion, writing each data in the row-converted data into different second storage units according to a preset diagonal reading-writing rule; and after column conversion is carried out on data read out in parallel from each second storage unit by each column according to a diagonal reading and writing rule, each coefficient in a column of obtained conversion coefficients is respectively stored in different third storage units, so that the quantization operation is carried out by reading out a column of conversion coefficients in parallel from each third storage unit each time. By the optimized design of the row-column conversion circuit and the intermediate transposition storage, particularly the read-write rule of the intermediate storage structure is improved, the read cycle of the row-column conversion circuit to the intermediate storage structure is reduced, the higher throughput rate can be realized on the smaller circuit area, and the circuit performance is improved.)

1. A method of DCT transformation, the method comprising:

storing input residual block data into different first storage units in columns;

after a row of residual block data is read out from each first storage unit in parallel and is subjected to row conversion, writing each data in the row-converted data into different second storage units according to a preset diagonal reading-writing rule;

and after reading out a column of data subjected to row transformation from each second storage unit in parallel according to the diagonal reading and writing rule and performing column transformation, respectively storing each transformation coefficient in a column of obtained transformation coefficients into different third storage units so as to enable a subsequent quantization module to read out a column of transformation coefficients from each third storage unit in parallel each time for quantization operation.

2. The method of claim 1, wherein storing the input residual block data in different first storage units by columns comprises:

dividing a first memory into a plurality of first storage units with the depth as the height according to the height of the residual block data, wherein the number of the first storage units is equal to the width of the residual block data;

storing the residual block data into each first storage unit in columns;

the data belonging to different first storage units can be read in parallel in the same period.

3. The method of claim 1, wherein writing each line of the line-transformed data into a different second storage unit according to a preset diagonal read-write rule after each line of the residual block data is read out from each first storage unit in parallel and is line-transformed, comprises:

adopting a first pipeline strategy, reading a line of residual block data from each first storage unit in parallel, performing line conversion, and writing each data in the line-converted data into different second storage units according to a preset diagonal reading and writing rule;

and the first pipeline strategy is to read out the next row of residual block data from each first storage unit in parallel in the process of row conversion of the previous row of residual block data.

4. The method according to claim 1, before writing each of the line-transformed data into a different second storage unit according to a preset diagonal read-write rule, the method comprising:

determining the size of a required storage area according to the height and width of the residual block data;

and selecting a plurality of second storage units from a preset intermediate transposition storage structure based on the size of the storage area.

5. The method according to claim 1, wherein after column-conversion is performed every time data after column-row conversion is read out in parallel from each second storage unit according to the diagonal read-write rule, each obtained column of conversion coefficients is stored in a different third storage unit, respectively, and the method comprises:

reading out a column of data after row transformation from each second storage unit in parallel according to the diagonal reading and writing rule by adopting a second pipeline strategy, and storing each transformation coefficient in the obtained column of transformation coefficients into different third storage units respectively after column transformation;

and the second pipeline strategy is to read out the data after the next column row transformation from each second storage unit in parallel in the process of performing the column transformation on the data after the previous column row transformation.

6. The method of claim 1, wherein performing column transformation for each column of row-transformed data read out in parallel from each second memory cell according to the diagonal read-write rule comprises:

reading out data after row conversion in parallel from each second storage unit according to the diagonal reading and writing rule;

circularly shifting the data after the column and row transformation to a preset direction according to the depth position of the data after the column and row transformation in a second storage unit;

and performing column transformation on the shifted column and row transformed data.

7. The method according to claim 1, wherein before storing each transform coefficient in the obtained column of transform coefficients into a different third storage unit, the method further comprises:

dividing a second memory into a plurality of third storage units with the depth being the width according to the width of the residual block data, wherein the number of the third storage units is equal to the height of the residual block data;

and the data belonging to different third storage units can be read in parallel in the same period.

8. DCT transform circuitry, said system comprising:

a first memory having a number of first storage units of a width of residual block data, each first storage unit for storing a column of data in the residual block data;

an intermediate transpose storage structure having a plurality of second storage units;

the row conversion circuit is used for reading out a row of residual block data from each first storage unit in parallel, performing row conversion, and writing each data in the row-converted data into different second storage units according to a preset diagonal reading and writing rule;

a second memory having a number of third storage units of a height number of residual block data, each third storage unit for storing a row of coefficients in transform coefficients of the residual block data after DCT transform;

and the column transformation circuit is used for reading out a column of data subjected to row transformation from each second storage unit in parallel according to the diagonal reading-writing rule, performing column transformation, and respectively storing each transformation coefficient in a column of obtained transformation coefficients into different third storage units so as to be used for a subsequent quantization module to read out a column of transformation coefficients from each third storage unit in parallel each time for quantization operation.

9. The system of claim 8, wherein the row transform circuit comprises:

a first reading module for reading out a row of residual block data in parallel from each first storage unit;

the first calculation module is used for performing line transformation on a line of read residual block data;

and the first writing module is used for writing each data in the data after the line transformation into different second storage units according to a preset diagonal reading and writing rule.

10. The system of claim 8, wherein the column transform circuit comprises:

the second reading module is used for reading out data after row conversion in parallel from each second storage unit according to the diagonal reading and writing rule;

the second calculation module is used for performing column transformation on the read data after the column row transformation to obtain a column of transformation coefficients;

and the second writing module is used for respectively storing each transformation coefficient in the obtained column of transformation coefficients into different third storage units.

Technical Field

The invention relates to the technical field of video coding, in particular to a DCT (discrete cosine transform) transformation method and a DCT transformation circuit system.

Background

DCT (discrete cosine transform) means that the input residual block is first transformed by lines (column transform) and then transformed by columns (row transform), and the order of the two one-dimensional transforms does not affect the final transform result. It can be seen that an intermediate transpose storage buffer is needed to temporarily store the line-transformed data during the line-column transformation process.

In the related art, a single-port ram is used for storage, which results in that a large number of cycles are wasted in reading and writing the ram between two conversions, and the throughput rate of the whole circuit is reduced.

Disclosure of Invention

The present invention provides a DCT transformation method and a DCT transformation circuit system for overcoming the above-mentioned deficiencies of the prior art, and the object is achieved by the following technical solutions.

A first aspect of the present invention provides a DCT transformation method, the method comprising:

storing input residual block data into different first storage units in columns;

A second aspect of the invention proposes a DCT transformation circuitry, said system comprising:

a first memory having a number of first storage units of a width of residual block data, each first storage unit for storing a column of data in the residual block data;

an intermediate transpose storage structure having a plurality of second storage units;

Based on the DCT transformation method and the DCT transformation circuit system described in the first and second aspects, the present application has the following advantages or benefits:

when the residual block data is input, the residual block data is stored into different first storage units according to columns, and because the data belonging to different first storage units can be read in parallel, the same row of data can be read simultaneously, so that a row of residual block data can be read out from each first storage unit in parallel to perform row transformation calculation, and the data reading efficiency is improved;

when the intermediate transposition is used for storage, each data in the row-transformed data is written into different second storage units by adopting a diagonal reading and writing rule so as to meet the purpose of simultaneously reading the same column of data, so that a column of row-transformed data can be read out in parallel from each second storage unit for column transformation calculation, the problem of repeated reading and writing of the intermediate transposition storage structure is avoided, and the cycle consumption is reduced;

when the transformation coefficients are stored, the transformation coefficients in a column of transformation coefficients are respectively stored in different third storage units, and the coefficients belonging to different third storage units can be read in parallel, so that the same column of coefficients can be read simultaneously, and the data reading of a subsequent quantization module is facilitated.

Drawings

The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description serve to explain the invention and not to limit the invention. In the drawings:

FIG. 1 is a flowchart illustrating an embodiment of a DCT transform method according to an example embodiment of the present invention;

FIG. 2 is a schematic diagram illustrating a comparison between optimization with input and optimization without input in accordance with an exemplary embodiment of the present invention;

FIG. 3 is a schematic diagram illustrating a physical structure as an intermediate transpose storage structure in accordance with an exemplary embodiment of the present invention;

FIG. 4 is a schematic diagram illustrating the selection of a storage area in an intermediate transposed storage structure in accordance with an illustrative embodiment of the present invention;

FIG. 5 is a diagram illustrating a comparison of row and column shift circuitry before and after optimization of processing logic according to an exemplary embodiment of the present invention;

FIG. 6 is a diagram illustrating a 4x8 write effect according to an exemplary embodiment of the present invention;

FIG. 7 is a diagram illustrating an 8x4 write effect according to an exemplary embodiment of the present invention;

FIG. 8 is a schematic diagram illustrating a third memory cell structure according to an exemplary embodiment of the present invention;

fig. 9 is a schematic diagram illustrating a DCT transform circuitry according to an exemplary embodiment of the present invention.

Detailed Description

Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, like numbers in different drawings represent the same or similar elements unless otherwise indicated. The embodiments described in the following exemplary embodiments do not represent all embodiments consistent with the present invention. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the invention, as detailed in the appended claims.

The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. As used in this specification and the appended claims, the singular forms "a", "an", and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should also be understood that the term "and/or" as used herein refers to and encompasses any and all possible combinations of one or more of the associated listed items.

It is to be understood that although the terms first, second, third, etc. may be used herein to describe various information, these information should not be limited to these terms. These terms are only used to distinguish one type of information from another. For example, first information may also be referred to as second information, and similarly, second information may also be referred to as first information, without departing from the scope of the present invention. The word "if" as used herein may be interpreted as "at … …" or "when … …" or "in response to a determination", depending on the context.

In the design of a video coding and decoding chip at present, a main solution to the problem of transpose storage in the DCT conversion process is to use a bit width merging method, i.e., to use a ram with a large bit width, write one row of conversion data into the ram at a time, and when performing column conversion, read out a column of conversion data from different bit positions of the ram and send the column of conversion data to a conversion circuit for column conversion, thereby achieving reduction of the number of access cycles to the ram.

However, the above scheme mainly has the following two problems in the design of the codec chip:

1. as video resolution continues to increase, the size of processed macroblocks becomes larger and larger, and the largest block size in the AVS3 video coding standard may be 128x 128. If the bit width merging method is used, a memory structure with the bit width of 128x128x8 and the depth of 1 is difficult to be synthesized into ram when the FPGA is used for implementation.

2. In the process of agile development by using HLS, a compiler can translate a circuit which is in line with expectation only by simulating corresponding hardware behaviors by using a C/C + + language and adding corresponding comprehensive instructions. The bit width merging behavior is difficult to realize in the C/C + + programming process, and the difficulty of software design and DEBUG is increased.

In order to solve the technical problems, the invention provides an efficient DCT conversion method and a conversion circuit system, which reduce the reading cycle of a middle storage structure by a line-column conversion circuit through the optimized design of the line-column conversion circuit and the middle transposition storage structure, particularly improve the reading-writing rule of the middle storage structure, realize higher throughput rate on a smaller circuit area and improve the circuit performance.

The DCT transformation method and the transformation circuit system according to the present invention will be described in detail with specific embodiments.

Fig. 1 is a flowchart illustrating an embodiment of a DCT transformation method according to an exemplary embodiment of the present invention, including the following steps:

step 101: the input residual block data is stored in different first storage units by columns.

In some embodiments, when the residual block data is input, the first memory may be divided into a plurality of first storage units with the depth as the height according to the height of the residual block data, the number of the first storage units is equal to the width of the residual block data, and then the residual block data is stored in each first storage unit in columns.

Alternatively, if HLS integration is used, assuming that residual block data is src, one memory ram may be divided into a plurality of first memory units corresponding to the width of the residual block data using a # pragma HLS ARRAY _ PARTITION variable-src complete dim-2 optimization instruction, and the depth of each first memory unit is the height of the residual block data, thereby improving the efficiency of reading data.

For example, if the memory is not divided using the above optimization instruction, as shown in fig. 2 (a), 4 × 4 residual block data is directly stored in one memory, only one data can be read out in one cycle due to the port limitation, and if the 16 data are to be read out for line conversion, at least 16 cycles are required; on the other hand, if the memory is divided by using the optimized instruction, as shown in fig. 2 (b), the 4 × 4 residual block data are stored in columns in the divided 4 first storage units (bank 0-bank 3) with 4 depths, and since the data belonging to different banks can be read in the same cycle, 4 cycles are required for 4 rows of data, all the residual block data can be read, and the throughput of the circuit is improved.

Step 102: after a row of residual block data is read out from each first storage unit in parallel and is subjected to row conversion, each data in the row-converted data is written into different second storage units according to a preset diagonal reading and writing rule.

Before step 102 is executed, an intermediate transpose storage structure for storing the line-transformed data needs to be prepared, and optionally, a specific preparation process includes: and determining the size of a required storage area according to the height and width of the residual block data, and then selecting a plurality of second storage units from a preset intermediate transposition storage structure based on the size of the storage area.

In specific implementation, in order to support DCT transformation with a maximum size of 64 × 64, 64 intermediate transposed storage structures (bit width is defined according to data after line transformation) with a depth of 64 are preset, and the corresponding physical structures are as shown in fig. 3, where bank 0-bank 63 are 64 intermediate transposed storage structures.

Optionally, if HLS synthesis is used, the specific setting mode is as follows: assuming a two-dimensional ARRAY coef _ tmp [64] [64] of size 64 × 64, the resultant command # pragma HLS ARRAY _ PARTITION variable ═ coef _ tmp complete dim ═ 2 is optimized to obtain 64 intermediate transposed storage structures of depth 64.

Specifically, in order to unify the read-write rules for the residual block data with the height greater than the width and the residual block data with the height less than the width, it is assumed that the size of the storage area is MxM, and M is specifically a maximum value of the height and the width of the residual block data.

For example, as shown in fig. 4, the size of the storage area required for the residual block data of 4x8 is 8x8, then 8 second storage units with a depth of 8 need to be selected from a preset intermediate transposed storage structure based on the storage area of 8x 8.

In an optional embodiment, in the design of the row-column conversion circuit, the main operation is to perform N-point conversion on all rows of residual block data in a traversal manner, and in order to save circuit area while considering performance, the present invention uses a row conversion circuit to perform pipeline processing according to the logical relationship of data, that is, after a row of residual block data is read out from each first storage unit in parallel by using a first pipeline strategy to perform row conversion, each data in the row-converted data is written into a different second storage unit according to a preset diagonal read-write rule, so that row conversion processing is completed with fewer cycles, and circuit area and performance are considered.

And the first pipeline strategy is to read out the residual block data of the next line from each first storage unit in parallel in the line transformation process of the residual block data of the previous line.

For example, taking 4 × 4 residual block data as an example, as shown in fig. 5, if pipeline policy optimization processing logic is not used, all rows of the residual block data need to be executed sequentially in the case of using one row transformation circuit, which is slow; and if the pipeline strategy optimization processing logic is used, even if one row transformation circuit is used, the next row of residual block data can be read out from each first storage unit in parallel in the process of performing row transformation calculation on the previous row of residual block data.

Assuming that 1 cycle is required for reading, calculating and writing, as shown in table 1, if the pipeline policy optimization processing logic is not used, 12 cycles are required to process 4 rows of data, and if the pipeline policy optimization processing logic is used, during the calculation process of the previous row of residual block data, the next row of residual block data starts to be read, so that the reading process of the next row overlaps with the cycle occupied by the calculation process of the previous row, the calculation process of the next row overlaps with the cycle occupied by the writing process of the previous row, and 6 cycles are required to process 4 rows of data. Compared with the processing logic before optimization, the method reduces the cycle number by half under the condition of using one row conversion circuit, and further improves the throughput rate of the circuit.

Policy	Circuit area/portion	Cycle
			Is free of	1	12
Pipeline strategy	1	6

TABLE 1

In an optional embodiment, in order to satisfy the purpose that the row-transformed data is written in one period and the column-transformed data is read out in parallel by the column-transformed circuit in one period each time, a specific implementation manner of the adopted diagonal read-write rule is as follows:

coef_tmp[bank’][depth]＝coef_tmp[(bank+depth)％size][depth]

wherein coef _ tmp [ bank ] [ depth ] represents the position to be written originally without adopting diagonal read-write rule data, depth is a depth value, and bank is the number of the second storage unit; coef _ tmp [ bank '] [ depth ] represents a position to which data is written after a diagonal read-write rule is adopted, and similarly, bank' is the number of the second storage unit; the size indicates the number of second storage units or the total depth, i.e., the maximum value of the width and height of the residual block data.

For example, taking 4 × 8 data as an example, as shown in fig. 6, the row transformation circuit performs row transformation for the first time to obtain first row transformed data, and 4 data in the row transformed data are written in parallel [ bank0] [ depth0], [ bank1] [ depth1], [ bank2] [ depth2], and [ bank3] [ depth3] along the diagonal direction from the depth0 position of the bank0 by using the diagonal reading and writing rule.

For the specific position calculation of the diagonal read-write rule: taking the 3 rd data of the first row of data as an example, the positions where the diagonal read-write rule data is not adopted and the data is originally written are [ bank0] [ depth2], the size is 8, and the positions where the data is written after the diagonal read-write rule is adopted are [ [ bank ((0+ 2)% 8) ] [ depth2] - [ bank2] [ depth2 ]; taking the 3 rd data of the second row of data as an example, the position where the diagonal read-write rule data is not used and the data is originally written is [ bank1] [ depth2], the size is 8, and the position where the data is written after the diagonal read-write rule is used is [ [ bank ((1+ 2)% 8) ] [ depth2] [ [ bank3] [ depth2 ].

As can be seen from fig. 6, all data in each row are written into different second storage units, so that simultaneous parallel writing can be realized, and the row-transformed data in the same column in the residual block are all written into the same depth position of different second storage units.

Taking the data of 8x4 as an example, as shown in fig. 7, the row transformation circuit performs row transformation for the first time to obtain data after row transformation of the first row, and 8 data in the data after row transformation are written in parallel along the diagonal direction from the position of depth0 of bank0 [ bank0] [ depth0], [ bank1] [ depth1], [ bank2] [ depth2], [ bank3] [ depth3], [ bank4] [ depth4], [ bank5] [ depth5], [ bank6] [ depth6], [ bank7] [ depth7] by using the diagonal read-write rule.

For the specific position calculation of the diagonal read-write rule: taking the 3 rd data of the second row of data as an example, the positions where the diagonal read-write rule data are not adopted and the data are originally written are [ bank1] [ depth2], the size is 8, and the positions where the data are written after the diagonal read-write rule is adopted are [ [ bank ((1+ 2)% 8) ] [ depth2] - [ bank3] [ depth2 ]; taking the 7 th data of the third row of data as an example, the positions where the diagonal read-write rule data is not used and the data is originally written are [ bank2] [ depth6], the size is 8, and the positions where the diagonal read-write rule data is used and the data is written are [ [ bank ((2+ 6)% 8) ] [ depth6] [ [ bank0] [ depth6 ].

As can be seen from fig. 7, all data in each row are written into different second storage units, so that simultaneous parallel writing can be realized, and the row-transformed data in the same column in the residual block are all written into the same depth position of different second storage units.

It should be noted that the same object can be achieved by using the above-described read/write rule regardless of whether data is of a symmetric size (for example, 8x8) or asymmetric size (for example, 8x 4).

Step 103: and after reading out a column of data subjected to row transformation in parallel from each second storage unit according to a diagonal reading and writing rule and performing column transformation, respectively storing each transformation coefficient in a column of obtained transformation coefficients into different third storage units.

Before step 103 is executed, when the transformed data of one column and row is subjected to column transformation to obtain corresponding transformation coefficients, the transformed data needs to be written into a corresponding memory for subsequent reading and quantization, so that a third storage unit needs to be prepared in advance. The specific preparation process may include: according to the width of the residual block data, the second memory is divided into a plurality of third storage units with the depth being the width, the number of the third storage units is equal to the height of the residual block data, so that the transformation coefficients can be stored in each third storage unit in rows.

Alternatively, if HLS integration is used, assuming that the obtained transform coefficient is dst, a # pragma HLS ARRAY _ PARTITION variable ═ dst complete dim ═ 1 optimization instruction may be used to divide one memory ram into a plurality of third storage units corresponding to the height of the transform coefficient, and the depth of each third storage unit is the width of the transform coefficient, thereby improving the efficiency of subsequent data reading.

As shown in fig. 8, for residual block data of 4x4, transform coefficients of 4x4 are finally obtained, and each obtained row of transform coefficients is written in parallel into 4 divided third storage units (bank0 to bank3), and since the transform coefficients belonging to different banks can be read in the same cycle, during subsequent quantization, 4 rows of transform coefficients only need 4 cycles, so that all transform coefficients can be read, and the throughput of the circuit is improved.

In an optional embodiment, based on the same principle as the row transformation process described in step 102, the present invention uses a column transformation circuit to perform pipeline processing according to the logical relationship of data, that is, a second pipeline strategy is adopted, data after row transformation of a column is read out in parallel from each second storage unit according to a diagonal read-write rule to perform column transformation, and then each transformation coefficient in a column of transformation coefficients is stored in a different third storage unit, so that the column transformation processing is completed with a smaller number of cycles, and the circuit area and performance are both considered.

And the second pipeline strategy is to read out the data after the next column row transformation from each second storage unit in parallel in the process of performing the column transformation on the data after the previous column row transformation.

It should be noted that, in order to ensure that column data is read out in the correct order, for the process of performing column transformation by reading out a column of row-transformed data from each second storage unit in parallel according to the diagonal reading and writing rule, a column of row-transformed data may be read out from each second storage unit in parallel according to the diagonal reading and writing rule, and according to the depth position of the column of row-transformed data in the second storage unit, the column of row-transformed data is circularly shifted in the preset direction so as to achieve the purpose of reading out data in the correct order, and then the shifted column of row-transformed data is subjected to column transformation.

For example, as can be seen from fig. 6 mentioned above, in the 8 second storage units, namely, from bank0 to bank7, the order of the data of the first column at the depth position depth0 after being read out in parallel is correct, and cyclic shift is not needed, i.e., 0 bit is shifted; after the second column of data at depth position depth1 is read out in parallel, the written data at row 8 and column 2 is located at the first bit, the sequence is incorrect, and 1 bit needs to be circularly moved to the left; after the data in the third column at the depth position depth2 is read out in parallel, the written data in the 7 th row and the 3 rd column is located at the first bit, and the written data in the 8 th row and the 3 rd column is located at the second bit, so that the sequence is incorrect and the data needs to be circularly moved by 2 bits to the left; after the parallel readout of the fourth column of data at depth position depth3, a 3-bit round-trip to the left is required.

As can be seen from fig. 7, in the 8 second storage units from bank0 to bank7, the order of the data of the first column at the depth position depth0 after being read out in parallel is correct, and cyclic shift is not needed, that is, 0 bit is shifted; after the second column of data at depth position depth1 is read out in parallel, the first bit is empty, and the second column of data is only read out from the second bit, so that the empty bit needs to be moved to the back by moving 1 bit circularly to the left; after the third column of data at the depth position depth2 is read out in parallel, the first two bits are all empty, the third column of data is from the third bit, and 2 bits need to be circularly moved to the left to move the first two empty bits to the back; after the fourth column of data at the depth position depth3 is read in parallel, the first three bits are all empty, the fourth column of data is from the fourth bit, 3 bits need to be circularly moved to the left, and so on, after the 8 th column of data at the depth position depth7 is read in parallel, 7 bits need to be circularly moved to the left, and then the correct order of the 8 th column of data can be ensured.

Therefore, each time a column of data after row conversion is read, the column of data after row conversion needs to be circularly shifted to the left, and the shift digit is the serial number of the depth position of the column of data after row conversion in the second storage unit.

To this end, the transformation process shown in fig. 1 is completed, when inputting, the residual block data is stored into different first storage units in columns, and since the data belonging to different first storage units can be read in parallel, the same row of data can be read simultaneously, so that a row of residual block data can be read out from each first storage unit in parallel to perform row transformation calculation, and the efficiency of reading data is improved;

Corresponding to the foregoing embodiments of the DCT transformation method, the present invention also provides embodiments of DCT transformation circuitry.

Fig. 9 is a schematic diagram illustrating a DCT transform circuitry according to an exemplary embodiment of the present invention, including: a first memory 910, a row transform circuit 920, an intermediate transpose storage structure 930, a column transform circuit 940, and a second memory 950.

A first memory 910 having a width number of first storage units of residual block data, each first storage unit being configured to store a column of data in the residual block data;

an intermediate transposed storage structure 930 having a plurality of second storage units;

a row conversion circuit 920, configured to read out a row of residual block data from each first storage unit in parallel, perform row conversion, and write each data in the row-converted data into a different second storage unit according to a preset diagonal reading and writing rule;

a second memory 950 having a number of third storage units of the height of the residual block data, each third storage unit for storing one row of coefficients among transform coefficients of the residual block data after DCT transform;

and a column transform circuit 940, configured to read out a column of data subjected to row transform from each second storage unit in parallel according to the diagonal read-write rule, perform column transform, and store each transform coefficient in the obtained column of transform coefficients into a different third storage unit, so that a subsequent quantization module reads out a column of transform coefficients from each third storage unit in parallel each time to perform quantization operation.

For the related descriptions of the first memory 910, the row transforming circuit 920, the intermediate transposed storage structure 930, the column transforming circuit 940, and the second memory 950, reference may be made to the related description of the embodiment shown in fig. 1, which is not repeated herein.

In an alternative embodiment, based on the related description about the row transformation process in the embodiment shown in fig. 1, as shown in fig. 9, the row transformation circuit 920 may specifically include a first reading module 921, a first calculating module 922, and a first writing module 923.

The first reading module 921 is configured to read out a row of residual block data from each first storage unit in parallel;

a first calculating module 922, configured to perform line transformation on a line of read residual block data;

the first writing module 923 is configured to write each data in the line-transformed data into a different second storage unit according to a preset diagonal reading and writing rule.

In an alternative embodiment, based on the above description about the column transformation process in the embodiment shown in fig. 1, as shown in fig. 9, the column transformation circuit 940 specifically includes a second reading module 941, a second calculating module 942, and a second writing module 943.

The second reading module 941 is configured to read out a column of data after row transformation from each second storage unit in parallel according to the diagonal reading-writing rule;

a second calculating module 942, configured to perform column transformation on the read data after the column-row transformation to obtain a column of transformation coefficients;

a second writing module 943, configured to store each transform coefficient in the obtained column of transform coefficients into a different third storage unit.

Based on the circuit system shown in fig. 9, when inputting, the residual block data is stored in different first storage units of the first memory by columns, because the data belonging to different first storage units can be read in parallel, the same row of data can be read simultaneously, so that a row of residual block data can be read out from each first storage unit in parallel for row transformation calculation, and the data reading efficiency is improved;

when the intermediate transposition is used for storage, each data in the row-transformed data is written into different second storage units of the intermediate transposition storage structure by adopting a diagonal reading and writing rule so as to meet the purpose of simultaneously reading the same row of data, so that a row of row-transformed data can be read out from each second storage unit in parallel for column transformation calculation, the problem of repeated reading and writing of the intermediate transposition storage structure is avoided, and the period consumption is reduced;

when the transformation coefficients are stored, the transformation coefficients in a column of transformation coefficients are respectively stored in different third storage units of the second memory, and the coefficients belonging to different third storage units can be read in parallel, so that the same column of coefficients can be read simultaneously, and the data reading of a subsequent quantization module is facilitated.

Other embodiments of the invention will be apparent to those skilled in the art from consideration of the specification and practice of the invention disclosed herein. This invention is intended to cover any variations, uses, or adaptations of the invention following, in general, the principles of the invention and including such departures from the present disclosure as come within known or customary practice within the art to which the invention pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the invention being indicated by the following claims.

It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.

The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like made within the spirit and principle of the present invention should be included in the scope of the present invention.

15页详细技术资料下载

上一篇：一种医用注射器针头装配设备

下一篇：一种基于内容权重的视频高效压缩处理方法

DCT (discrete cosine transform) conversion method and DCT conversion circuit system

相关技术

网友询问留言