Multi-modal Mongolian Chinese translation method based on circulation common attention Transformer

文档序号:1846598 发布日期:2021-11-16 浏览:20次 中文

阅读说明:本技术 基于循环共同注意力Transformer的多模态蒙汉翻译方法 (Multi-modal Mongolian Chinese translation method based on circulation common attention Transformer ) 是由 苏依拉 崔少东 仁庆道尔吉 吉亚图 李雷孝 石宝 梁衍锋 吕苏艳 于 2021-07-14 设计创作,主要内容包括:一种基于循环共同注意力Transformer的多模态蒙汉翻译方法,利用YOLO-V4对输入图像进行目标检测,通过相关性检测对比蒙古文本与目标标签,保留与蒙古文本相关的目标图像,并利用编码层将蒙古文本编码为张量;利用重参数化VGG网络和三重注意力机制提取并关注目标图像特征,采用形变双向长短期记忆网络对目标图像特征与编码后的蒙古文本特征即张量分别进行数次交互,之后送入到循环共同注意力Transformer网络中进行蒙汉翻译,通过数次循环交互,将蒙古语言特征与视觉特征充分融合,输出目标语言。本发明从视觉和语言两个角度捕获特征信息,通过多轮循环,可有效的提高翻译质量,解决蒙古文翻译质量不佳的问题。(A multi-mode Mongolian translation method based on a cyclic common attention transducer comprises the steps of carrying out target detection on an input image by using YOLO-V4, comparing a Mongolian text with a target label through correlation detection, reserving the target image related to the Mongolian text, and encoding the Mongolian text into a tensor by using an encoding layer; extracting and paying attention to target image features by using a reparameterized VGG network and a triple attention mechanism, respectively carrying out interaction on the target image features and encoded Mongolian text features, namely tensors, for a plurality of times by using a deformation bidirectional long-short term memory network, then sending the target image features and the encoded Mongolian text features into a cyclic common attention Transformer network for Mongolian translation, fully fusing Mongolian language features and visual features through the plurality of cyclic interactions, and outputting a target language. The method captures the characteristic information from two angles of vision and language, and can effectively improve the translation quality and solve the problem of poor translation quality of Mongolian through multi-round circulation.)

1. A multi-modal Mongolian Chinese translation method based on a cycle common attention transducer is characterized by comprising the following steps:

step 1, target detection

Carrying out target detection on an input image by using YOLO-V4, wherein the input image is the image description of a Mongolian text, comparing the Mongolian text with a target label through correlation detection, eliminating the target image irrelevant to the Mongolian text, reserving the target image relevant to the Mongolian text, and coding the Mongolian text into a tensor by using a coding layer;

step 2, feature extraction

Extracting and paying attention to the target image characteristics by using a reparameterization VGG network and a triple attention mechanism, respectively carrying out interaction on the target image characteristics and the encoded Mongolian text characteristics, namely tensors, for a plurality of times by using a deformation bidirectional long-short term memory network, and then sending the target image characteristics and the encoded Mongolian text characteristics, namely tensors, into a cyclic common attention Transformer network;

step 3, multimodal translation

And taking target image features obtained after interaction for a plurality of times and Mongolian text features after coding as input, performing Mongolian Chinese translation by utilizing a cyclic common attention Transformer network, fully fusing Mongolian language features and visual features through cyclic interaction for a plurality of times, and outputting a target language.

2. The method for multi-modal Mongolian translation based on cyclic co-attention transducer according to claim 1, wherein the YOLO-V4 network is composed of CSPDenseNeet, a path aggregation network, and a regression prediction network, wherein CSPDenseNeet is used as a backbone network to extract image features, the path aggregation network is used as a neck network, and the CSPDenseNeet is added with a spatial pyramid pool to generate an output with a fixed size.

3. The multi-modal Mongolian translation method based on the cyclic co-attention transducer according to claim 2, wherein the CSPDensenet is composed of a CBM module and a cross-stage local module, the CBM module is composed of a convolution layer, a batch normalization layer and a Mish activation function, the cross-stage local module divides input visual information into two parts, one part is the same as the original network calculation, the other part does not participate in the calculation and is directly spliced with the result of the first part calculation, the other part is composed of two branches, one is used for convolving the trunk part, the other is used for generating a residual edge, and the learning capability of the convolutional neural network is enhanced by cross-stage splicing and channel integration of the two branches;

the path aggregation network establishes a path from bottom-layer characteristics to top-layer characteristics in a bottom-up mode, so that the propagation path of bottom-layer characteristic information to top-layer characteristic information is shortened, spatial information is accurately stored, and pixel points are correctly positioned;

the spatial pyramid pooling is implemented by pooling the convolutional layers in front of the fully-connected layers through three largest pooling layers with different sizes and splicing, and a one-dimensional vector is output, so that the size of an input image is not constrained.

4. The method for multi-modal Mongolian Chinese translation based on the cyclic common attention transducer according to claim 1, wherein the heavily parameterized VGG divides a VGG network into a training stage and an inference stage, a multi-branch network structure is adopted in the training stage to improve model accuracy, a single-branch network structure is adopted in the inference stage, and weight values of the multi-branch network are converted into the single-branch network by using the heavily parameterized VGG;

the triple attention mechanism calculates attention weight by capturing cross-dimension interaction by using a three-branch structure, and establishes dependency relationship between dimensions by rotation operation and residual error transformation;

the deformed bidirectional long-short term memory network inputs x of the current timetAnd hidden state h of last timet-1And performing multiple interactions before inputting the long-short term memory network, and then taking the obtained vector as the input of the long-short term memory network to obtain the relevant expression of the context.

5. The multi-modal Mongolian translation method based on the cyclic co-attention Transformer according to claim 4, wherein the multi-branch network is composed of a large number of small networks, the reparameterization VGG applies reparameterization technology on the basis of the VGG network, a 3 x 3 convolutional layer, a batch normalization layer and a Relu activation function are used in the network, a residual error branch and a 1 x 1 convolutional branch are introduced, cross-layer connection of the residual error network is cancelled, direct connection is changed, and an inference network is changed into a one-way structure through fusion branches;

in the triple attention mechanism, given an input tensor x ∈ RC×H×WRepresenting the target image features extracted by the convolutional neural network, C, H, W representing the number of channels, height and width of the input feature set R respectively, in the first branch, C interacts with H, and the input x is firstly rotated by 90 degrees anticlockwise in the height direction to obtainThe shape is (W × H × C), after thatThe shape changed to (2 XHXC) by Z-pooling was described Obtaining output with the shape of (1 × H × C)) through a convolution layer and a batch normalization layer of k × k, generating an attention weight value through a sigmoid activation function, and finally generating the attention weight valueValue applied toAnd rotated 90 deg. clockwise in the height direction to keep the shape of the input x consistent; in the second branch, channels C interact with W, and input x is first obtained by rotating it 90 counterclockwise in the width directionAfter thatThe shape of the resulting product was changed to (2 XWxC) after Z-pooling, and the result was recorded as Obtaining an output with the shape of (1 xWxC)) through a convolution layer and a batch normalization layer of k x k, generating an attention weight value through a sigmoid activation function, and finally applying the attention weight value toAnd rotated 90 ° clockwise in the width direction to keep the shape of the input x consistent; in the third branch, the input x is obtained by Z-poolingThe shape is (2 multiplied by H multiplied by W), then the attention weight value with the shape of (1 multiplied by H multiplied by W) is generated by the output result through a convolution layer and a batch normalization layer with the shape of k multiplied by k, and the attention weight value is applied to the input x to obtain a result; finally, the tensors generated by the three branches are aggregated together by averaging, wherein Z-pooling reduces the tensor of the 0 dimension to the 2 dimension by connecting average pooling and maximum pooling.

6. The method as claimed in claim 1, wherein the loop co-attention fransformer network is composed of a loop co-attention fransformer layer, a fransformer module, a fusion layer, and a fransformer decoder, wherein the loop co-attention fransformer layer uses a multi-head attention mechanism to send the target image features obtained in step 2 and the encoded mongolian text features to the fransformer module in a loop interaction manner, then uses the fusion layer to fuse the information, and uses the fransformer decoder to decode the fused information and output the target language.

7. The method for multi-modal Mongolian translation based on cyclic common attention transducer according to claim 6, wherein the cyclic common attention transducer layer is composed of a visual module and a language module, the visual module receives the extracted target image features, the language module receives the encoded Mongolian text features, weights information of each region in the image as context for the Mongolian text, or weights image regions according to the Mongolian text context, so that a network can capture the visual information and the Mongolian text information at the same time; visual information is interacted with Mongolian text information for a plurality of times in a circulation common attention Transformer layer, and a Transformer module encodes output of the circulation common attention Transformer layer by using a Transformer encoder.

8. The method of claim 7, wherein the visual module and the linguistic module are both composed of a multi-head attention mechanism, a batch normalization layer, an addition layer, and a feed forward layer, and the Transformer module is the same as a standard Transformer encoder.

9. The method of claim 8, wherein in the loop co-attention Transformer layer, mesopic vision and intermediate language are defined asAndobtaining a query, a key and value matrix, a visual module and a language module through a standard Transformer calculation rule, wherein the key and value of each module are used as the input of multi-head attention of the other module, and the attention module adopts a language attention mechanism of an image condition in a visual stream and adopts an image attention mechanism of a language condition in a language stream; the feedforward layer is composed of two linear layers and a Relu activation function and mainly used for fusing word vector information of words in a sentence, the feedforward layer does not process time sequence information and only is responsible for transforming information of each position, and the fusion layer splices two outputs of the circular common attention Transformer network together.

10. The method of claim 8, wherein the Transformer decoder processes self-attention from a previous output vocabulary by using a mask multi-head attention module based on an encoder, and the decoding process is as follows: when the current ith input vector is decoded, the (i-1) th and previous decoding results are obtained, the decoder only decodes one word at each step, the word is output and then is used as the input of the decoder, and the operation is repeated until the word is decoded to < eos >; and performing linear mapping on the output of the decoder to convert the output into a probability vector, then outputting the normalized class probability value through a softmax activation function, and finding out a word corresponding to the value with the maximum probability as output.

Technical Field

The invention belongs to the technical field of computer vision and machine translation, and particularly relates to a multi-mode Mongolian Chinese translation method based on a cycle common attention Transformer.

Background

Machine translation, which is capable of converting one language to a target language, is an effective way to resolve language barriers. With the development of deep learning, the task of machine translation using deep learning has become mainstream, and companies such as google, hundredth, track, science news fly have conducted a lot of research on machine translation, and have developed practical applications.

Before the advent of deep learning, machine translation went through the development process of rule-based methods, corpus-based methods, and multi-method fusion. Mongolian machine translation is started later than other language researches, and data are rare, so that the difficulty in realizing high-quality translation is high. In 2017, L JINTING et al put forward a Mongolian translation model combining NMT and Discrete Lexicon Prohabilities, so that the problem of errors in the neural network in translating Mongolian low-frequency words is solved, and the BLEU is increased by 4.02 on a Mongolian parallel corpus. In 2020 RQINGDAOERJI et al, a Mongolian translation model based on morpheme coding and LSTM was proposed, and GRU-CRF was used to perform word segmentation on Mongolian. The encoded Mongolian morpheme vector is used as the input of the LSTM, and the LSTM neural network model can record important vector information, so that information loss caused by gradient disappearance is prevented, and the problem of large difference of Mongolian bilingual word orders is relieved. The data uses a Mongolian parallel corpus developed by the university of inner Mongolia, and experiments show that the BLEU of the LSTM model based on the morpheme coding reaches 21.8 when more than 30 words, 1.6BLEU is improved compared with the PBMT model, and the LSTM model has good performance when processing the long dependence problem.

At present, machine translation by using deep learning is developed and matured in the aspect of mainstream languages, but researches on small languages such as Mongolian languages and the like are less, particularly, data are seriously lacked, and translation quality does not achieve a good effect.

Disclosure of Invention

In order to overcome the defects of the prior art, the invention aims to provide a multi-mode Mongolian translation method based on a cyclic common attention transducer, which starts from a mode that human beings observe the world, uses a cyclic common attention transducer multi-mode network to capture characteristic information from two angles of vision and language, and can effectively improve the translation quality and solve the problem of poor Mongolian translation quality through multi-round circulation.

In order to achieve the purpose, the invention adopts the technical scheme that:

a multi-modal Mongolian Chinese translation method based on a cycle common attention transducer comprises the following steps:

step 1, target detection

Carrying out target detection on an input image by using YOLO-V4, wherein the input image is the image description of a Mongolian text, comparing the Mongolian text with a target label through correlation detection, eliminating the target image irrelevant to the Mongolian text, reserving the target image relevant to the Mongolian text, and coding the Mongolian text into a tensor by using a coding layer;

step 2, feature extraction

Extracting and paying attention to the target image characteristics by using a reparameterization VGG network and a triple attention mechanism, respectively carrying out interaction on the target image characteristics and the encoded Mongolian text characteristics, namely tensors, for a plurality of times by using a deformation bidirectional long-short term memory network, and then sending the target image characteristics and the encoded Mongolian text characteristics, namely tensors, into a cyclic common attention Transformer network;

step 3, multimodal translation

And taking target image features obtained after interaction for a plurality of times and Mongolian text features after coding as input, performing Mongolian Chinese translation by utilizing a cyclic common attention Transformer network, fully fusing Mongolian language features and visual features through cyclic interaction for a plurality of times, and outputting a target language.

The YOLO-V4 network is composed of CSPDenseNet, a path aggregation network and a regression prediction network, wherein the CSPDenseNet is used as a main network to extract image features, the path aggregation network is used as a neck network, and the space pyramid pool is added into the CSPDenseNet to generate output with a fixed size.

The CSPDensenet comprises a CBM module and a cross-stage local module, wherein the CBM module comprises a convolution layer, a batch normalization layer and a Mish activation function, the cross-stage local module divides input visual information into two parts, one part is the same as the original network calculation, the other part does not participate in the calculation, and is directly spliced with the calculation result of the first part, the cross-stage local module comprises two branches, one branch is used for convolving a main part, the other branch is used for generating a residual edge, and the learning capability of the convolutional neural network is enhanced through cross-stage splicing and channel integration of the two branches;

the path aggregation network establishes a path from bottom-layer characteristics to top-layer characteristics in a bottom-up mode, so that the propagation path of bottom-layer characteristic information to top-layer characteristic information is shortened, spatial information is accurately stored, and pixel points are correctly positioned;

the spatial pyramid pooling is implemented by pooling the convolutional layers in front of the fully-connected layers through three largest pooling layers with different sizes and splicing, and a one-dimensional vector is output, so that the size of an input image is not constrained.

The heavy parameterization VGG divides the VGG network into a training stage and an inference stage, a multi-branch network structure is adopted in the training stage to improve the model precision, a single-branch network structure is adopted in the inference stage, and the heavy parameterization is used for converting the weight of the multi-branch network into the single-branch network;

the triple attention mechanism calculates attention weight by capturing cross-dimension interaction by using a three-branch structure, and establishes dependency relationship between dimensions by rotation operation and residual error transformation;

the deformed bidirectional long-short term memory network inputs x of the current timetAnd hidden state h of last timet-1And performing multiple interactions before inputting the long-short term memory network, and then taking the obtained vector as the input of the long-short term memory network to obtain the relevant expression of the context.

The multi-branch network consists of a large number of small networks, the heavily parameterized VGG applies a heavily parameterized technology on the basis of the VGG network, uses a 3 x 3 convolution layer, a batch normalization layer and a Relu activation function in the network, introduces a residual error branch and a 1 x 1 convolution branch, cancels the cross-layer connection of the residual error network, changes the cross-layer connection into direct connection, and changes an inference network into a single-path structure through fusion branches;

in the triple attention mechanism, given an input tensor x ∈ RC×H×WRepresenting the target image features extracted by the convolutional neural network, C, H, W representing the number of channels, height and width of the input feature set R respectively, in the first branch, C interacts with H, and the input x is firstly rotated by 90 degrees anticlockwise in the height direction to obtainThe shape is (W × H × C), after thatThe shape changed to (2 XHXC) by Z-pooling was describedObtaining output with the shape of (1 xHxC)) through a convolution layer and a batch normalization layer of k x k, generating an attention weight value through a sigmoid activation function, and finally applying the attention weight value toAnd rotated clockwise by 90 degrees in the height directionTo maintain conformity with the shape of the input x; in the second branch, channel C interacts with W, and input x first rotates counter-clockwise in the width direction90 ° to obtainAfter thatThe shape of the resulting product was changed to (2 XWxC) after Z-pooling, and the result was recorded asObtaining an output with the shape of (1 xWxC)) through a convolution layer and a batch normalization layer of k x k, generating an attention weight value through a sigmoid activation function, and finally applying the attention weight value toAnd rotated 90 ° clockwise in the width direction to keep the shape of the input x consistent; in the third branch, the input x is obtained by Z-poolingThe shape is (2 multiplied by H multiplied by W), then the attention weight value with the shape of (1 multiplied by H multiplied by W) is generated by the output result through a convolution layer and a batch normalization layer with the shape of k multiplied by k, and the attention weight value is applied to the input x to obtain a result; finally, the tensors generated by the three branches are aggregated together by averaging, wherein Z-pooling reduces the tensor of the 0 dimension to the 2 dimension by connecting average pooling and maximum pooling.

And the cyclic common attention Transformer network consists of a cyclic common attention Transformer layer, a Transformer module, a fusion layer and a Transformer decoder, wherein the cyclic common attention Transformer layer adopts a multi-head attention system to perform cyclic interaction on the target image characteristics obtained in the step (2) and the coded Mongolian text characteristics and send the target image characteristics and the coded Mongolian text characteristics to the Transformer module, then the fusion layer is utilized to fuse the information, the Transformer decoder is used to decode the fused information, and the target language is output.

The cyclic common attention Transformer layer is composed of a visual module and a language module, the visual module receives the extracted target image characteristics, the language module receives the coded Mongolian text characteristics, information of each area in the image is used as context to weight the Mongolian text, or the image area is weighted according to the Mongolian text context, so that a network can capture the visual information and the Mongolian text information at the same time; visual information is interacted with Mongolian text information for a plurality of times in a circulation common attention Transformer layer, and a Transformer module encodes output of the circulation common attention Transformer layer by using a Transformer encoder.

The visual module and the language module are composed of a multi-head attention mechanism, a batch normalization layer, an addition layer and a feedforward layer, and the Transformer module is the same as a standard Transformer encoder.

In the cyclic co-attention Transformer layer, mesopic and intermediate languages are defined asAndobtaining a query, a key and value matrix, a visual module and a language module through a standard Transformer calculation rule, wherein the key and value of each module are used as the input of multi-head attention of the other module, and the attention module adopts a language attention mechanism of an image condition in a visual stream and adopts an image attention mechanism of a language condition in a language stream; the feedforward layer is composed of two linear layers and a Relu activation function and mainly used for fusing word vector information of words in a sentence, the feedforward layer does not process time sequence information and only is responsible for transforming information of each position, and the fusion layer splices two outputs of the circular common attention Transformer network together.

The Transformer decoder adopts a mask multi-head attention module to process self attention from a previous output vocabulary on the basis of an encoder, and the decoding process is as follows: when the current ith input vector is decoded, the (i-1) th and previous decoding results are obtained, the decoder only decodes one word at each step, the word is output and then is used as the input of the decoder, and the operation is repeated until the word is decoded to < eos >; and performing linear mapping on the output of the decoder to convert the output into a probability vector, then outputting the normalized class probability value through a softmax activation function, and finding out a word corresponding to the value with the maximum probability as output.

Compared with the prior art, the invention has the beneficial effects that:

1. the method is researched from two directions of vision and Mongolian texts, and aiming at the problem of poor translation quality of Mongolian texts, a multi-modal network based on a cycle common attention Transformer is used for performing a translation task so as to improve the translation quality.

2. Aiming at the problem of interaction between visual and Mongolian text information, the method uses a cyclic common attention Transformer layer to interact the visual information and the Mongolian text information, and enhances the interaction degree of the visual information and the language information through a plurality of rounds of circulation.

3. Aiming at the independent problem of calculation of the channel attention and the space attention in a rolling block attention model (CBAM), the invention introduces a triple attention mechanism in a heavily parameterized VGG, carries out cross-channel interaction by capturing the mutual relation between the space dimension and the channel dimension, and solves the independent problem of calculation of the channel attention and the space attention in the CBAM.

4. Aiming at the problems of low accuracy and low calculation speed of an original VGG network, the invention uses the reparameterized VGG to extract the characteristics of a target image. The reparameterization VGG applies reparameterization technology to train-reasoning decoupling, and the required memory is lower.

5. Aiming at the problem of target detection, the method uses YOLO-V4 to perform target detection on an input image, performs image-text correlation detection after the target is detected, finds out an image target related to Mongolian text, and eliminates irrelevant image targets.

6. Aiming at the problem of independent input of the long-short term memory network, the invention uses the deformed bidirectional long-short term memory network to carry out multi-round interaction on input and state so as to enhance the context modeling capability.

Drawings

FIG. 1 is a cycle-based co-attention Transformer multimodal translation network.

FIG. 2 shows the structure of DenseNet and CSPDenseNet.

Fig. 3 is a bottom-up path enhancement module structure.

Fig. 4 is a spatial pyramid pooling structure.

Fig. 5 is a heavily parameterized VGG structure.

Fig. 6 is a triple attention mechanism configuration.

FIG. 7 is a morphed bidirectional long short term memory network.

FIG. 8 is a circular co-attention Transformer layer structure.

Detailed Description

The embodiments of the present invention will be described in detail below with reference to the drawings and examples.

The invention relates to a multi-modal Mongolian Chinese translation method based on a cyclic common attention Transformer, which can refer to a figure 1 in the whole network composition and mainly comprises YOLO-V4, Triple-reparameterization VGG (Triple attention-reparameterization VGG), an encoding layer, a deformation bidirectional long-short term memory network (deformation BilSTM), a cyclic common attention Transformer layer, a Transformer module, a fusion layer and a Transformer decoder.

The invention relates to a multi-mode Mongolian Chinese translation method based on a cycle common attention Transformer, which mainly comprises the following steps:

step 1, target detection

The method comprises the steps that an input image is image description of a Mongolian text, target detection is conducted on the input image through YOLO-V4, the Mongolian text and a target label are compared through correlation detection, the target image irrelevant to the Mongolian text is removed, the target image relevant to the Mongolian text is reserved, and the Mongolian text is coded into a tensor through a coding layer.

1) YOLO-V4 target detection network

The YOLO-V4 target detection Network mainly comprises CSPDenseNeet, a Path Aggregation Network (Path Aggregation Network) and a regression prediction Network. Based on the original YoLO target detection architecture, CSPDensenet is used as a main network to extract image features, a path aggregation network is used as a neck network, and Spatial Pyramid Pooling (Spatial Pyramid Pooling) is added into the CSPDensenet to generate output with a fixed size. Spatial pyramid pooling significantly increases the acceptance domain, separates out the most important contextual features, and has little impact on network speed. YOLO-V4 can be trained using a single GPU, which can achieve faster speed in target detection.

A、CSPDenseNet

The CSPDenseNet is mainly composed of a Cross Stage Partial Model (CSP) and a CBM module.

The cross-phase local module mainly solves the problem of large calculation amount in the reasoning phase from the perspective of network structure design, and comprises two branches, wherein one branch is used for carrying out convolution on a main part, and the other branch is used for generating a residual error edge. The learning capability of the convolutional neural network is enhanced by cross-stage splicing of two branches and channel integration. The cross-phase local module divides the input visual information into two parts, one part is the same as the original network calculation, the other part does not participate in the calculation, and the two parts are directly spliced with the calculation result of the first part.

The CBM module consists of a convolution layer, a batch normalization layer and a Mish activation function. The introduction of the cross-stage local module can enhance the learning capability of the CNN, so that the accuracy is kept while the weight is reduced, and meanwhile, the calculation amount and the memory requirement can be reduced.

Each stage of the DenseNet comprises a dense module and a transition layer, each dense module consists of k dense layers, and the output of the ith dense module becomes the input of the (i + 1) th dense module after dimensionality reduction of the transition layer. DenseNet can be expressed by the following formula:

x1=w1*x0

x2=w2*[x0,x1]

xi=wi*[x0,x1,...,xi-1]

xk=wk*[x0,x1,...,xk-1]

wherein, is a convolution operation, wiWeight of the ith dense module, xiFor the output of the ith dense module, [ x ]0,x1,...]Represents a pair x0,x1,..

If the weight is updated by adopting a back propagation algorithm, the weight updating formula is as follows:

w1’=f(w1,g0)

w2’=f(w2,g0,g1)

w3’=f(w3,g0,g1,g2)

wi’=f(wi,g0,g1,g2,...,gi-1)

wk’=f(wk,g0,g1,g2,...,gk-1)

where f is a weight update function, giRepresenting the gradient, w, propagated to the ith dense modulei' is the updated ith weight.

CSPDenseNet is mainly composed of locally dense modules and locally transitional layers. In a locally dense module, pass through channel x0=[x0′,x0″]The input visual feature map is divided into two parts,x0' and x0"the channels of the first portion and the channels of the second portion, respectively. At x0"to x0In between, the former is directly connected to the end of the phase, the latter will pass through the locally dense module. The local transition layer comprises the following steps: output of dense layer [ x ]0″,x1,...,xk]Through a transition layer, the output x of the transition layerTWill be in contact with x0"connected, and passed through another transition layer, to yield an output xU

The feed forward transfer equation for CSPDenseNet is as follows:

xk=wk *[x0,x1,...,xk-1]

xT=wT *[x0,x1,...,xk]

xU=wU *[x0,xT]

wherein x iskDenotes the output of the k-th dense layer, xTRepresenting the output of the transition layer, xUFor feed-forward output of the network, wk *Is the weight of the k-th dense layer, wT *As weight, w, of the transition layerU *Is the weight of the feed forward output.

The weight update formula of CSPDenseNet is as follows:

xk’=f(wk,g0”,g1,g2,...,gk-1)

xT’=f(wT,g0”,g1,g2,...,gk)

xU’=f(WU,g0’,gT)

where f is a weight update function, xk' is the weight of the updated k-th dense layer, xT' is the weight of the updated transition layer, xU' is the updated weight of the feedforward output, gkDenotes the gradient, g, propagating to the k-th dense layerTRepresenting the gradient that propagates to the transition layer.

The Mish activation function can be represented by the following equation:

wherein e isxIs an exponential function.

FIG. 2 shows DenseNet (a) without adding a cross-phase local module and DenseNet (b) with adding a cross-phase local module. CSPDenseNet divides input visual information into two parts, branch 1 does not participate in calculation, branch 2 passes through a dense module and a transition layer, the calculation is the same as that of the graph a, and finally the branch 1 and the branch 2 are fused by using the transition layer.

B. Path aggregation network

Aiming at the problems that the path from a low-level feature to a high-level feature is too long and the difficulty of information flow positioning is high, the path aggregation network establishes a path from a bottom-level feature to a top-level feature in a bottom-up mode, so that the propagation path from bottom-level feature information to top-level feature information is shortened, spatial information is accurately stored, pixel points are correctly positioned, and a feature pyramid is strengthened.

Specifically, a path aggregation network is adopted as a neck network of the YOLO-v4, and by adding bottom-to-top path enhancement, the propagation path length from low-level features to high-level features in the convolutional neural network is shortened, so that information in the network can be further propagated. Define the output as { N2,N3,N4,N5The calculation formula is as follows:

Ni=conv(r(Pi)+up(Ni-1)),i∈{2,3,4,5}

the path aggregation network reduces propagation loss from the underlying features to the higher features by adding a bottom-up path to the network. The propagation path in the feature pyramid is C2→C3→C4→C5→P5Path C2→C5The problem of massive loss of information of the underlying features exists. The propagation path of the path aggregation network is C2→P2→N2→N5And the bottom layer characteristics can be better kept through two times of transverse connection.

FIG. 3 shows a bottom-up path enhancement module, each profile NiPass through a 3 × 3 convolutional layer with a step size of 2. The feature map P is then connected horizontallyi+1Is fused with the downsampled map. The fused feature map is passed through another 3 x 3 convolutional layer to generate Ni+1As input to the next layer until P is reached5And then the process is terminated. Finally, a feature map set { N } is output2,N3,N4,N5}. The number of all convolutional layer channels is 256, and each convolutional layer is then activated using the Relu activation function.

C. Spatial pyramid pooling

The convolutional layer has no requirement on the size of input data, but the fully connected layer requires the size of the input data to be fixed, but in reality, the size of the data is not fixed. To solve this problem, and thus make the input image size unfixed, the present invention proposes a spatial pyramid pooling technique. The method comprises the steps of performing pooling operation on a convolutional layer in front of a full-connection layer by using three maximum pooling layers with different sizes and splicing to obtain a one-dimensional vector, so that the network has no requirement on the size of input data. The spatial pyramid pooling fixes the size of the output data through multi-level pooling and extracts multi-scale features of the input feature map, so that more comprehensive local information is obtained, the performance of the convolutional neural network can be effectively improved, the problem that the size of an input image of the convolutional neural network must be a fixed value is solved, and scale invariance is increased.

The input data size is (C, H, W), which respectively represents the number of channels, height, width, and the pooling number is (n, n), then the size and step size of each pooling window in the spatial pyramid pooling layer can be calculated by the following formula:

wherein K is the size of the pooling window, S is the pooling step length, ceil, floor are rounded up and down, respectively.

The spatial pyramid pooling layer structure is shown in fig. 4, where the input is a feature of the convolutional layer output, and the input is maximally pooled using pooling layers with pooling windows of 1 × 1, 2 × 2, and 4 × 4, respectively. The left part maps the features to 16 × 256, the middle part maps the features to 4 × 256, the right part maps the features to 1 × 256, and finally the three parts are fused into a one-dimensional vector with the size of 1 × 10752.

D. Regression prediction network

YOLO-V4 predicts the deviation of the grid cells of the feature map by adopting bounding box regression, determines the target center, predicts the proportional coefficient of the width and the height of the anchor frame at the same time, and determines the target size, wherein the formula is as follows:

wherein sigma is sigmoid activation function, (u)x,uy,uw,uh) For the center coordinates and width and height of the real box in the feature map, (t)x,ty,tw,th) To predict the center point and width-height offset of the box, (c)x,cy) Represents the position of the center point of the real frame in the feature map, (p)w,ph) The width and height of the anchor frame which are matched with the real frame are the best.

E. Loss function

The YOLO-v4 uses Complete Intersection over unit (CIOU loss) and Distance Intersection over unit (DIOU loss) to make the network prediction frame more accurate. The DIOU can directly minimize the distance between the two real frames and the prediction frame, so that the regression speed can be accelerated, and the CIOU can ensure that the regression loss is more accurate when the regression loss is overlapped with the target frame and the convergence speed is higher.

IOU calculation formula such as

Wherein X is the area of the frame to be detected, XgtIs the area of the real box.

DIOU makes the target frame regression more stable through adding into factors such as the distance between target and the anchor point, overlap ratio and yardstick, avoids appearing diverging scheduling problem in the training process. The formula is as follows:

wherein, b and bgtRespectively predicting the central points of the frame and the real frame, wherein rho represents the Euclidean distance between the central points, and c is the length of a diagonal line of a minimum closure area covering the predicted frame and the real frame.

The CIOU adds an influence factor on the basis of the DIOU, and the formula is as follows:

where α is a weighting function, v is a parameter for measuring the uniformity of the aspect ratio,andthe width and height of the real box, respectively, and w and h the width and height of the predicted box, respectively.

F. Correlation detection

The relevance detection firstly trains a Word2Vec model by using Mongolian corpus, extracts key words of input Mongolian texts by adopting a TextRank algorithm, and then codes target category information output by a YOLO-V4 network and the key words of the input Mongolian texts into multi-dimensional Word vectors by using the trained Word2Vec model. Calculating the similarity between the keyword and the target category information output by the YOLO-V4 network according to the cosine distance between the vectors, wherein the formula is as follows:

wherein textiIs the ith keyword vector, image in Mongolian textjIs the jth word vector in the target category.

And calculating the cosine similarity of each target category and the Mongolian text keywords, and reserving targets with the cosine similarity larger than rho according to the cosine similarity, wherein rho is a threshold value and takes a value of 0-1. Therefore, targets with target categories irrelevant to the Mongolian text keywords are removed, targets with target categories relevant to the Mongolian text keywords are reserved, the relation between the Mongolian text and the image targets can be enhanced through relevance detection, and therefore translation quality is improved.

Word2Vec is a shallow neural network, and is mainly used for mapping sparse Word vectors into dense Word vectors by using the neural network, and the obtained Word vectors contain context information and semantic information. Word2Vec can be trained by a Skip-Gram or CBOW (continuous Bag of words) model, the Skip-Gram calculates the probability distribution of Word vectors of context according to the vector of the current Word, the CBOW calculates the probability distribution of central words according to the context vectors, and the Skip-Gram is adopted to construct the Word2Vec model.

The Skip-Gram generates context words based on the central words, and if the context words generated by the central words of Mongolian are independent, the conditional probability G of generating all the context words by any Mongolian central word is as follows:

where m is the window size of the Mongolian context, T is the length of the Mongolian text sequence, WtFor Mongolian words at time step t, Wt+jThe Mongolian words at time step t + j are represented by P, which is the conditional probability.

The TextRank algorithm can extract keywords in Mongolian texts through Mongolian text corpora, and the main idea is to construct a network through adjacent relations among Mongolian words, wherein the formula is as follows:

wherein, T (v)i) Is node viThe TextRank value of (d) is a damping coefficient, I (v)i) To point to node viNode set of O (v)i) Is node viSet of nodes pointed to, wijIs node viTo node vjThe weight of the edges in between.

2) Coding layer

The encoding layer encodes the input Mongolian text into vectors using the Word2Vec model, and then encodes the vectors using an Embedding layer (Embedding layers) in deep learning. The embedding layer is a mapping from a semantic space to a vector space, and maintains the relation of an original sample in the semantic space in the vector space as much as possible, the embedding layer can reduce high latitude vector to low latitude vector, assuming that the size of input data is n × m, the size of output data is n × d, in order to reduce the size of input data to n × d, it is required to train an mxd tensor to transform the input data, the tensor is called the embedding layer, and is generally composed of a plurality of layers of fully-connected neural networks. The fully-connected neural network consists of a linear part and a nonlinear part, wherein the linear part is simple linear weighted summation, and the formula is as follows:

z=Wx+b

wherein the input data is x ═ x0,x1,...,xn]TW is a weight matrix, b ═ b0,b1,...,bm]For the bias term, z ═ z0,z1,...,zm]Is the output.

The linear part analyzes the input data in multiple angles, and the nonlinear part maps the input data in a standardized way.

Step 2, feature extraction

Extracting and paying attention to the target image characteristics by using a reparameterization VGG network and a triple attention mechanism, respectively carrying out interaction on the target image characteristics and the encoded Mongolian text characteristics, namely tensors, for a plurality of times by using a deformation bidirectional long-short term memory network, and then sending the target image characteristics and the encoded Mongolian text characteristics, namely tensors, into a cyclic common attention Transformer network.

1)、Triplet-RepVGG

A. Reparameterisation of VGG

The network is divided into a training stage and an inference stage by the aid of the heavily parameterized VGG, the accuracy of the model is improved by the aid of a multi-branch network structure in the training stage, and gradient dispersion during training is avoided. However, the multi-branch network structure can cause the increase of network calculation amount and influence the prediction speed, so that a single-branch network structure is adopted in the inference stage, and the weight of the multi-branch network is converted into the single-branch network by using reparameterization, wherein the multi-branch network consists of a large number of small networks.

Specifically, the reparameterization VGG applies reparameterization technology on the basis of a VGG network, a 3 x 3 convolution layer, a batch normalization layer and a Relu activation function are used in the network, a residual error branch (ResNet) and a 1 x 1 convolution branch are introduced, the residual error branch and the residual error branch are stacked into a training model, cross-layer connection of the residual error network is cancelled, direct connection is changed into direct connection, and an inference network is changed into a single-path structure through fusion branches. By using the technology, the calculation speed of the reparameterized VGG is higher, the required memory is smaller, and the flexibility is extremely high.

The reparameterization weight conversion process is as follows: by usingIs represented by having C1Input channel and C2Convolution kernels of 3 x 3 convolution layers of the output channels, andrepresents a convolution kernel of 1 × 1 branches. V is(3),σ(3),γ(3),b(3)Represents the cumulative mean, standard deviation, learned scale factor and bias, v, of the batch normalization layer after 3 x 3 convolution operations(1),σ(1),γ(1),b(1)For 1 x 1 convolution operationLast batch normalization layer parameter, v(0),σ(0),γ(0),b(0)The layer parameters are normalized for the batch following the constant connection branch. Definition ofInput and output, respectively, and a convolution operator. Hypothesis C1=C2,H1=H2,W1=W2Then, there is the following formula:

M(2)=bn(Z(1)*W(3),ν(3),σ(3),γ(3),b(3))+bn(Z(1)*W(1),ν(1),σ(1),γ(1),b(1))+bn(Z(1),ν(0),σ(0),γ(0),b(0))

where bn is the batch normalization function at inference time, forThen there is the following formula:

the core of the weight transform is to transform the batch normalization layer and the previous convolutional layer into a single convolutional layer containing offset vectors, assuming { W ', b' } is the kernel weight and offset transformed from { W, v, σ, y, b }, then the following equation:

the convolution operation after conversion is the same as the original convolution and batch normalization operation, as follows:

bn(Z*W,ν,σ,γ,b):,i,:,:=(Z*W′):,i,:,:+b′i

fig. 5 shows a heavily parameterized VGG structure, and fig. a shows a residual network, which mainly consists of 3 × 3 convolutional layers, 1 × 1 convolutional layers, Identity (Identity), and Relu activation functions. Fig. B is a network structure of the heavily parameterized VGG training stage, which is similar to the residual network, and mainly differs in that 1 × 1 identity connection in the heavily parameterized VGG network has no cross-layer propagation, and the heavily parameterized VGG includes 2 types of residual structures, one of which is composed of only 1 × 1 convolutional layers, and the other of which is composed of 1 × 1 convolutional layers and identity connections. And the graph C is a network structure of a reparameterization VGG inference stage, the network consists of a 3 x 3 convolutional layer and a Relu activation function, the structure is simple, and model inference can be accelerated.

B. Triple attention mechanism

The Z-pooling layer, which mainly functions to reduce the tensor of the 0 dimension to 2 dimensions by connecting the average pooling and the maximum pooling, can retain most of the bits and reduce the depth, making the network lightweight. The following formula:

Z-pool(x)=[MaxPool0d(x),AvgPool0d(x)]

where 0d represents the 0 th dimension of the maximal pooling versus the average pooling, e.g., a tensor of shape (C W H) is Z-pooled to shape (2W H).

The triple attention mechanism calculates attention weights by capturing cross-dimensional interactions using a three-branch structure. The dependency relationship between the dimensions is established through the rotation operation and the residual transformation, and the influence on the calculation amount of the network is small.

Given an input tensor x ∈ RC×H×WThat is, the target image features extracted by the convolutional neural network, C, H, W respectively represent the number of channels, height, and width of the input feature set R, and the input feature set R is transmitted into three branches of the triple attention module, where in the first branch, C interacts with H, and the input x is first rotated counterclockwise by 90 ° along the H axis, and is recorded asThe shape was (W × H × C). After thatAfter Z-pooling, the shape was changed to (2 XHXC)Is composed ofObtaining an attention weight value which is output in the shape of (1 multiplied by H multiplied by C)) and generated by a sigmoid activation function through a convolution layer and a batch normalization layer of k multiplied by k, and finally applying the attention weight value toAnd rotated 90 deg. clockwise along the H axis to maintain conformity with the shape of the input x.

The second branch operates the same as the first, except that C interacts with W, and the input x is first rotated 90 counterclockwise along the W axisThen obtaining the product through Z-poolingFinally, the output is rotated clockwise by 90 degrees along the W axis to keep the shape of the input consistent.

In the third branch, the input x is obtained by Z-poolingThe shape is (2 × H × W), then through a convolution layer and a batch normalization layer of k × k, an attention weight value of the shape is (1 × H × W) is generated by the output result through a sigmoid activation function, the attention weight value is applied to the input x to obtain a result, and finally tensors generated by 3 branches are aggregated together through simple averaging. For the input tensor x ∈ RC×H×WAfter triple attention, the output y is obtained, and the formula is as follows:

where σ denotes sigmoid activation function, ψ1,ψ2,ψ3Representing the convolution operations in three branches, respectively.

The above formula is simplified to give the following formula:

wherein ω is1,ω2,ω3Cross-dimensional attention weights in the three branches are represented separately,andindicating a 90 deg. clockwise rotation.

FIG. 6 shows a triple attention mechanism, where the Input sensors get the final result through 3 branches respectively. The top branch is responsible for calculating the attention weight of the channel dimension C and the space dimension W, the middle branch is responsible for calculating the attention weight of the channel dimension C and the space dimension H, the bottom branch is responsible for capturing the space dependency relationship of the H and the W, the relation between the channel dimension and the space dimension is established by adopting rotation operation in the top branch and the middle branch, and finally, the three branches are aggregated by adopting simple averaging.

C. Bidirectional long and short term memory network

The long-short term memory network is widely applied to various tasks, and the calculation formula of the output value h of the memory unit c and the hidden layer at the current moment is as follows:

f=σ(Wfxx+Wfhhprev+bf)

i=σ(Wixx+Wihhprev+bi)

j=tanh(Wjxx+Wjhhprev+bj)

o=σ(Woxx+Wohhprev+bo)

c=f⊙cprev+i⊙j

h=o⊙tanh(c)

wherein σ is sigmoid activation function, <' > is dot product operation, W**And b*Respectively weight matrix and bias. f denotes a forgetting door, cprevRepresenting a previous memory unit, i is an input gate, j and o are a candidate state and an output gate respectively, c and h represent the output values of the memory unit and the hidden layer at the current moment respectively, and a forgetting gate f is used for forgetting the previous memory unit cprevThe input gate i is used to control the input of the current information, and the output gate o is used to control the output of the memory cell.

The bidirectional long and short term memory network consists of a forward long and short term memory network and a reverse long and short term memory network, can capture context information in a sequence and acquire future information and past information of the sequence, and an output calculation formula of a hidden layer at the time t is as follows:

wherein the content of the first and second substances,which represents the forward output vector of the input signal,representing the inverted output vector.

The main method of transforming the bidirectional long/short term memory network is to alternately let x before the calculation of the bidirectional long/short term memory networktAnd ht-1Interaction, the formula is as follows:

wherein r is1And r2Respectively represent xtAnd ht-1The number of interactions.

The vector interaction update formula is as follows:

xi=2σ(Gihi-1)xi-2 while odd i∈[1...r]

hi=2σ(Dixi-1)hi-2 while even i∈[1...r]

wherein, the number of rounds r is a hyper-parameter, the model is a bidirectional long-short term memory network when r is 0, and the matrix Gi,DiFor the random initialization matrix, a constant of 2 is multiplied in the formula because after the sigmoid activation function, the values are distributed at (0, 1), and the values gradually approach to 0 after repeated multiplication. Thus multiplying by a value of 2 ensures stability of the value.

In the invention, the deformed bidirectional long-short term memory network inputs x of the current timetAnd hidden state h of last timet-1The method comprises the steps of carrying out interaction for multiple times before inputting the long-short term memory network, then using the obtained vector as the input of the long-short term memory network, thereby enhancing the context modeling capability of the network, obtaining the expression related to the context, and respectively interacting the Mongolian text and the target image feature through the deformed bidirectional long-short term memory network, thereby obviously enhancing the expression of the feature and further improving the translation quality.

The interaction between the target image features and the Mongolian text features in the deformed bidirectional long-short term memory network is independent and irrelevant, the interaction times are set manually, and the best effect is achieved when the times are 4 or 5 according to experience.

FIG. 7 shows a deformed two-way long-short term memory network with 5 rounds of updates. Previous state h0=hprevAND gate x-1X is generated by sigmoid activation function1. Linear transformation x1And gate h0Generation of h2After a number of repeated gating cycles, h*And x*The final value of the sequence is input to a bidirectional long-short term memory cell.

Step 3, multimodal translation

And taking target image features obtained after interaction for a plurality of times and Mongolian text features after coding as input, performing Mongolian Chinese translation by utilizing a cyclic common attention Transformer network, fully fusing Mongolian language features and visual features through cyclic interaction for a plurality of times, and outputting a target language.

The cycle common attention Transformer network consists of a cycle common attention Transformer layer, a Transformer module, a fusion layer and a Transformer decoder. And (3) the cyclic common attention Transformer layer circularly and interactively transmits the target image characteristics obtained in the step (2) and the coded Mongolian text characteristics to the Transformer module by adopting a multi-head attention mechanism, then fuses the information by utilizing the fusion layer, decodes the fused information by using a Transformer decoder and outputs the target language.

The method comprises the steps that a circulation interaction in a circulation common attention Transformer layer is different from a deformation bidirectional long-short term memory network, the input of the circulation common attention Transformer layer is visual characteristic information and Mongolian language characteristic information, a visual module and a language module carry out fusion interaction on the visual characteristic information and the Mongolian language characteristic information by using a multi-head attention layer, the output of the visual and language module is used as the input of the visual and language module to continue interaction, the circulation is performed for k times, k is determined to be 5 by default, k can be 1-10, 1 is taken to represent no circulation, when k is larger, the occupied memory of a model is larger, and the speed of the model is slower. The interaction is terminated when the loop is cycled k times. Through many rounds of circulation, can effectual improvement translation quality, solve the not good problem of Mongolian translation quality.

The circulation common attention Transformer layer consists of a vision module and a language module, and the vision module consists of a multi-head attention mechanism, a batch normalization layer, an addition layer and a feedforward layer. The language module and the vision module have the same structure, except that the language module is input as the coded Mongolian text characteristic, and the vision module is input as the coded image characteristic (namely the target image characteristic). The information of each area in the image is used as the context to weight the Mongolian text, or the image area is weighted according to the context of the Mongolian text, so that the network can capture the visual information and the Mongolian text information at the same time, and the translation performance is improved in the translation task.

The specific architecture of the cyclic co-attention Transformer layer is shown in fig. 8, and the current visual features are determined based on the original Transformer encoderHarmonyCharacteristics of ancient languagesAs input to a visual module and a language module, respectively. Obtaining visual query Q through standard Transformer calculation ruleVVisual key KVVisual value VVMongolian language query QWMongolian language key KWMongolian language value VWAnd (4) matrix. Will QV、KW、VWAs input to the multi-headed attention layer of the visual Module, VV、KV、QWAs the input of the multi-head attention layer of the language module, the two layers pass through the addition normalization layer and the feedforward layer, then the output is sent to the input of the vision and language module to continue to carry out the cycle interaction, and finally the vision characteristics of the next stage are respectively obtained after K cyclesAnd Mongolian language features

The Transformer module is the same as a standard Transformer encoder, the output of the vision module and the output of the language module are encoded, and the circular common attention Transformer layer can better fuse the vision information and the language information by performing multiple circular interaction on the vision module and the language module.

Defining mesopic vision and mesopic Mongolian language asAndthe query, key and value matrices are obtained by standard Transformer calculation rules. The keys and values of each module will be entered as a multi-headed attention input for another module. The attention module adopts a language attention mechanism of the image condition in the visual stream, and adopts an image attention mechanism of the language condition in the language stream, and the specific calculation formula is as follows:

wherein, FFN is a feedforward neural network, Multihead is a multi-head attention mechanism,it is shown that the addition and normalization operations,respectively corresponding to the visual query, the Mongolian language key and the Mongolian language value matrix of the ith cycle,and (5) respectively corresponding to the visual value, the visual key and the Mongolian language query matrix of the ith cycle, wherein k is the cycle number.

Defining the input set as X ═ X1,x2,...,xt,xt+1,., where T is a time series { T | T ═ 1, 2,., T }, the encoder takes X as input, which enters the attention layer via dependent paths, by fitting 3 matrices (W) generated during the training processQ,WK,WV) A query vector Q, a key vector K, a value vector V is generated for each sample. The formula is as follows:

the attention mechanism is widely applied to the fields of images, Mongolian texts and the like, the calculation speed of the point-by-point attention mechanism is higher, and meanwhile, the space is saved. The calculation formula is as follows:

wherein the content of the first and second substances,q, K, V respectively represent query, key, value, softmax is an activation function, dkIn order to input the dimensions of the document,as a scaling factor, when dkAt a large value, the dimension of the result obtained by multiplying the Q and K points is large, resulting in a result in a region where the gradient of the softmax activation function is small, and therefore divided by a scaling factor, so that the dimension can be reduced.

The multi-head attention mechanism can effectively abstract the context dependency relationship and capture syntactic and semantic features, input features are linearly mapped to different information subspaces by utilizing different weight matrixes, the attention mechanism is adopted for calculation in each subspace, so that the potential structure and semantics of the Mongolian text are learned, and the formula is as follows:

MultiHead(Q,K,V)=Concat(head1,...,headh)WO

where headi=Attention(QWi Q,KWi K,VWi V)

wherein the content of the first and second substances,as a parameter matrix, Concat is a vector splicing operation, h is the number of heads of multi-head attention, WOIs a vector linear mapping function after stitching the attention outputs of the individual heads.

The feedforward neural network is composed of two linear layers and a Relu activation function, and is mainly used for fusing word vector information of words in sentences, and the effect of the feedforward neural network is similar to that of 1 multiplied by 1 convolution operation in a convolution neural network. The feedforward neural network does not process time sequence information and only is responsible for transforming information of each position, and a calculation formula is as follows:

FFN(x)=max(0,xW1+b1)W2+b2

wherein, W1And W2As a weight matrix, b1And b2Is an offset.

The fusion layer splices together the two outputs of the cyclic co-attention Transformer network, the formula is as follows:

F=concat(FV,FL)

wherein, FVFeatures of the visual and Mongolian text output for the visual module after interaction, FLAnd concat is tensor splicing operation for the Mongolian language output by the language module and the features after visual interaction.

The Transformer decoder uses a mask multi-head attention module to process the self-attention from the previous output vocabulary on the basis of the encoder. The decoding process is as follows: when the current ith input vector is decoded, the (i-1) th and previous decoding results are obtained, the decoder decodes only one word in each step, the word is output and then is used as the input of the decoder, and the operation is repeated until the word is decoded to < eos >.

The output of the decoder is subjected to linear mapping and converted into a probability vector, then a normalized class probability value is output through a softmax activation function, and a word corresponding to the value with the maximum probability is found, wherein the formula is as follows:

y=softmax(linear(o)W+b)

where o denotes the output of the decoder, linear is a linear function, and W and b denote the weight matrix and the offset of the linear mapping, respectively.

The whole process of the invention is as follows:

(1) performing target detection on the image by using YOLO-V4;

(2) carrying out image-text correlation detection;

(3) extracting image features by using triple-RepVGG;

(4) using a deformation bidirectional long-short term memory network to carry out interaction on the image characteristics;

(5) coding the Mongolian text by using a coding layer;

(6) interacting the Mongolia text characteristics by using a deformed bidirectional long-term and short-term memory network;

(7) interacting the image and the Mongolian text by using a cyclic common attention Transformer network;

(8) fusing the image and the Mongolia text information by using a fusion layer;

(9) performing prediction by using a Transformer decoder;

(10) network training is carried out;

(11) the BLEU value was used to evaluate the Mongolian translation model.

22页详细技术资料下载
上一篇:一种医用注射器针头装配设备
下一篇:一种基于知识图谱的蒙汉非自回归机器翻译方法

网友询问留言

已有0条留言

还没有人留言评论。精彩留言会获得点赞!

精彩留言,会给你点赞!