Voice scoring method and system based on abstract extraction

文档序号:70656 发布日期:2021-10-01 浏览:23次 中文

阅读说明:本技术 基于摘要提取的语音评分方法和系统 (Voice scoring method and system based on abstract extraction ) 是由 李苏梅 陈泽铭 李心广 陈帅 吴伟源 卢树炜 马姗娴 于 2021-06-04 设计创作,主要内容包括:本发明公开了一种基于摘要提取的语音评分方法和系统,所述方法包括:获取待评分的考生语音段,并切分得到若干个语音句子;对每一所述语音句子进行文本识别和单词切分,得到每一文本句子和构成所述文本句子的若干个文本单词;计算每一所述文本单词的词向量;对每一所述文本句子中的每一所述文本单词的词向量进行加权平均处理,得到每一所述文本句子的句子向量,构建文本网络图模型,采用TextRank算法进行迭代计算,得到每一所述文本句子的重要性得分;获取符合预设条件的文本句子,组成所述考生语音段的摘要,用于进行所述考生语音段的翻译内容评分。采用本发明,其能准确地实现对考生语音的识别和摘要信息的提取,从而提高对考生语音的评分精准度。(The invention discloses a voice scoring method and a system based on abstract extraction, wherein the method comprises the following steps: obtaining a voice segment of an examinee to be scored, and segmenting the voice segment to obtain a plurality of voice sentences; performing text recognition and word segmentation on each voice sentence to obtain each text sentence and a plurality of text words forming the text sentence; calculating a word vector for each of the text words; carrying out weighted average processing on the word vector of each text word in each text sentence to obtain the sentence vector of each text sentence, constructing a text network graph model, and carrying out iterative computation by adopting a TextRank algorithm to obtain the importance score of each text sentence; and acquiring text sentences meeting preset conditions, forming the abstract of the examinee speech segment, and scoring the translation content of the examinee speech segment. By adopting the method and the device, the voice of the examinee can be accurately recognized and the abstract information can be accurately extracted, so that the scoring accuracy of the voice of the examinee is improved.)

1. A speech scoring method based on abstract extraction is characterized by comprising the following steps:

obtaining a voice segment of an examinee to be scored, and segmenting the voice segment to obtain a plurality of voice sentences;

performing text recognition and word segmentation on each voice sentence to obtain each text sentence and a plurality of text words forming the text sentence;

calculating a word vector for each of the text words;

carrying out weighted average processing on the word vector of each text word in each text sentence to obtain a sentence vector of each text sentence;

constructing a text network graph model according to the sentence vector of each text sentence; the text network graph model takes a sentence vector of each text sentence as a vertex and takes the similarity of the text sentences larger than a preset similarity threshold value as an edge;

iterative computation is carried out by adopting a TextRank algorithm to obtain the importance score of each text sentence;

acquiring text sentences meeting preset conditions, forming abstracts of the examinee speech segments, and scoring the translated contents of the examinee speech segments; wherein the preset conditions are as follows: the importance scores of the text sentences are larger than a preset score threshold value, or the text sentences are N text sentences with the highest importance scores.

2. The method according to claim 1, wherein the weighted average processing of the word vector of each text word in each text sentence to obtain the sentence vector of each text sentence comprises:

determining the weight of each text word according to a preset parameter factor and a set probability;

carrying out weighted average processing on the word vector of each text word in each text sentence through the following calculation formula to obtain an initial sentence vector of each text sentence:

where s is the number of text sentences, ω is the number of text words, vωFor the word vector, a is a preset parameter factor, and p (omega) is a set probability;

and performing dimensionality reduction on each initial sentence vector to obtain a sentence vector of each text sentence.

3. The method for scoring a speech based on abstract extraction as claimed in claim 2, wherein the dimension reduction processing method comprises: singular value decomposition algorithm, principal component analysis algorithm, factor analysis algorithm or independent component analysis algorithm.

4. The method of claim 1, wherein the similarity of the text sentences is calculated by a cosine similarity algorithm or a longest common subsequence algorithm.

5. The speech scoring method based on abstract extraction as recited in claim 1, wherein the similarity of the text sentences is obtained by the following calculation formula:

Si=(x1,x2,…,xn);

Sj=(y1,y2,…,yn);

wherein, Sim (S)i,Sj) As a text sentence SiAnd SjSimilarity of (D), SiAnd SjRepresenting different text sentences, n being the number of said text sentences, xnRepresenting constituent text sentences SiEach text word of (a); y isnRepresenting constituent text sentences SjEach text word of (a).

6. The method for scoring a speech based on abstract extraction as claimed in claim 1, wherein the TextRank algorithm is specifically:

wherein WS (V)i) Is an importance score, V, of a text sentenceiVertices representing a model of a textual network graph, WijEdge, In (V), representing a model of a textual network graphi) To point to vertex ViSet of points of (c), Out (V)i) Is a vertex ViA set of pointed points; d is a preset damping coefficient.

7. The speech scoring method based on abstract extraction as claimed in claim 1, wherein the obtaining of the examinee speech segments to be scored and the segmentation into a plurality of speech sentences specifically comprises:

obtaining a voice segment of an examinee to be scored;

windowing the examinee voice segment to be scored by adopting a preset window function to obtain a plurality of audio frames;

calculating the short-time average energy and the short-time average zero crossing rate of each audio frame;

and acquiring the audio frames of which the short-term average energy and the short-term average zero-crossing rate reach corresponding preset threshold values, and taking the audio frames as boundary cutting points to segment the examinee speech segments into a plurality of speech sentences.

8. The method according to claim 1, wherein the performing text recognition and word segmentation on each of the speech sentences to obtain each text sentence and a plurality of text words constituting the text sentence comprises:

performing MFCC (Mel frequency cepstrum coefficient) voice feature extraction on each voice sentence to obtain a language feature value;

inputting each language characteristic value into a BP neural network model which is trained in advance to perform text recognition, and obtaining each text sentence;

and performing word segmentation on each text sentence to obtain a plurality of text words forming the text sentences.

9. The method according to claim 1, wherein the calculating of the word vector for each text word comprises:

and calculating a word vector of each text word by using a preset word2vec model.

10. A speech scoring system based on abstract extraction, comprising:

the examinee voice segmentation module is used for acquiring examinee voice segments to be scored and segmenting the examinee voice segments to obtain a plurality of voice sentences;

the text word acquisition module is used for performing text recognition and word segmentation on each voice sentence to obtain each text sentence and a plurality of text words forming the text sentence;

the word vector calculation module is used for calculating a word vector of each text word;

a sentence vector calculation module, configured to perform weighted average processing on a word vector of each text word in each text sentence to obtain a sentence vector of each text sentence;

the text network graph building module is used for building a text network graph model according to the sentence vector of each text sentence; the text network graph model takes a sentence vector of each text sentence as a vertex and takes the similarity of the text sentences larger than a preset similarity threshold value as an edge;

the importance score calculation module is used for carrying out iterative calculation by adopting a TextRank algorithm to obtain an importance score of each text sentence;

the abstract extraction module is used for acquiring text sentences meeting preset conditions, forming an abstract of the examinee speech section and scoring the translation content of the examinee speech section; wherein the preset conditions are as follows: the importance scores of the text sentences are larger than a preset score threshold value, or the text sentences are N text sentences with the highest importance scores.

Technical Field

The invention relates to the technical field of voice recognition and evaluation, in particular to a voice scoring method and system based on abstract extraction.

Background

With the rapid development of computer science technologies, the application of leading-edge technologies such as artificial intelligence and machine learning in the aspect of voice makes voice intelligence become a popular technology. The automatic scoring of the oral English language repeating questions is a hot point of research in the technical field of current voice evaluation, and the oral English language repeating questions refer to that examinees listen to a section of played recording first and then repeat the section of recorded recording through one-minute arrangement according to the content heard by the examinees. The main points of manual scoring mainly focus on two aspects of scoring of translation contents and scoring of language expression, wherein a scoring technology for the accuracy of the translation contents is a key technology for successful scoring. Generally, scoring of translated contents mainly considers the number of correct translation key information points in an answer sheet of a test taker, and relates to a summary extraction technology.

In the prior art, in abstract extraction application, a text abstract extraction method based on TF-IDF is the most basic and early-time statistical-based text abstract extraction algorithm. However, the inventors found that the prior art has at least the following problems: the text abstract extraction based on the TF-IDF method does not take semantic related information into consideration, but simply and directly calculates the TF-IDF value, so that the accuracy of the abstract obtained by extraction is not high.

Disclosure of Invention

The embodiment of the invention aims to provide a voice scoring method and system based on abstract extraction, which can accurately realize the recognition of the voices of examinees and the extraction of abstract information, thereby improving the scoring accuracy of the voices of the examinees.

In order to achieve the above object, an embodiment of the present invention provides a speech scoring method based on abstract extraction, including:

obtaining a voice segment of an examinee to be scored, and segmenting the voice segment to obtain a plurality of voice sentences;

performing text recognition and word segmentation on each voice sentence to obtain each text sentence and a plurality of text words forming the text sentence;

calculating a word vector for each of the text words;

carrying out weighted average processing on the word vector of each text word in each text sentence to obtain a sentence vector of each text sentence;

constructing a text network graph model according to the sentence vector of each text sentence; the text network graph model takes a sentence vector of each text sentence as a vertex and takes the similarity of the text sentences larger than a preset similarity threshold value as an edge;

iterative computation is carried out by adopting a TextRank algorithm to obtain the importance score of each text sentence;

acquiring text sentences meeting preset conditions, forming abstracts of the examinee speech segments, and scoring the translated contents of the examinee speech segments; wherein the preset conditions are as follows: the importance scores of the text sentences are larger than a preset score threshold value, or the text sentences are N text sentences with the highest importance scores.

Compared with the prior art, the speech scoring method based on abstract extraction disclosed by the invention has the advantages that after the speech segments of examinees are processed to obtain the word vector of each text word, the WR algorithm is adopted to perform weighted average processing on the word vector of each text word in each text sentence to obtain the sentence vector of each text sentence, and compared with the traditional weighted summation method, more accurate sentence vectors can be obtained. Constructing a text network graph model according to the sentence vector of each text sentence, wherein the text network graph model takes the sentence vector of each text sentence as a vertex and takes the similarity of the text sentences larger than a preset similarity threshold value as an edge; iterative computation is carried out by adopting a TextRank algorithm to obtain the importance score of each text sentence; the method comprises the steps of obtaining text sentences meeting preset conditions, forming abstracts of examinee speech segments, scoring the translated contents of the examinee speech segments, and improving a TextRank algorithm by constructing a text graph model, so that the abstract extraction effect is improved, and compared with a neural network, the method is simpler, more efficient and has no loss of effect.

As an improvement of the above scheme, the performing weighted average processing on the word vector of each text word in each text sentence to obtain a sentence vector of each text sentence specifically includes:

determining the weight of each text word according to a preset parameter factor and a set probability;

carrying out weighted average processing on the word vector of each text word in each text sentence through the following calculation formula to obtain an initial sentence vector of each text sentence:

where s is the number of text sentences, ω is the number of text words, vωFor the word vector, a is a preset parameter factor, and p (omega) is a set probability;

and performing dimensionality reduction on each initial sentence vector to obtain a sentence vector of each text sentence.

As an improvement of the above scheme, the dimension reduction processing method includes: singular value decomposition algorithm, principal component analysis algorithm, factor analysis algorithm or independent component analysis algorithm.

As an improvement of the scheme, the method for calculating the similarity of the text sentences is a cosine similarity calculation method or a longest common subsequence algorithm.

As an improvement of the above scheme, the similarity of the text sentences is obtained by the following calculation formula:

Si=(x1,x2,...,xn);

Sj=(y1,y2,...,yn);

wherein, Sim (S)i,Sj) As a text sentence SiAnd SjSimilarity of (D), SiAnd SjRepresenting different text sentences, n being the number of said text sentences, xnRepresenting constituent text sentences SiEach text word of (a); y isnRepresenting constituent text sentences SjEach text word of (a).

As an improvement of the above scheme, the TextRank algorithm specifically includes:

wherein WS (V)i) Is an importance score, V, of a text sentenceiVertices representing a model of a textual network graph, WijEdge, In (V), representing a model of a textual network graphi) To point to vertex ViSet of points of (c), Out (V)i) Is a vertex ViA set of pointed points; d is a preset damping coefficient.

As an improvement of the above scheme, the obtaining of the examinee speech segment to be scored and the segmentation to obtain a plurality of speech sentences specifically include:

obtaining a voice segment of an examinee to be scored;

windowing the examinee voice segment to be scored by adopting a preset window function to obtain a plurality of audio frames;

calculating the short-time average energy and the short-time average zero crossing rate of each audio frame;

and acquiring the audio frames of which the short-term average energy and the short-term average zero-crossing rate reach corresponding preset threshold values, and taking the audio frames as boundary cutting points to segment the examinee speech segments into a plurality of speech sentences.

As an improvement of the above scheme, the performing text recognition and word segmentation on each of the speech sentences to obtain each text sentence and a plurality of text words constituting the text sentence specifically includes:

performing MFCC (Mel frequency cepstrum coefficient) voice feature extraction on each voice sentence to obtain a language feature value;

inputting each language characteristic value into a BP neural network model which is trained in advance to perform text recognition, and obtaining each text sentence;

and performing word segmentation on each text sentence to obtain a plurality of text words forming the text sentences.

As an improvement of the above scheme, the calculating a word vector of each text word specifically includes:

and calculating a word vector of each text word by using a preset word2vec model.

The embodiment of the invention also provides a voice scoring system based on abstract extraction, which comprises:

the examinee voice segmentation module is used for acquiring examinee voice segments to be scored and segmenting the examinee voice segments to obtain a plurality of voice sentences;

the text word acquisition module is used for performing text recognition and word segmentation on each voice sentence to obtain each text sentence and a plurality of text words forming the text sentence;

the word vector calculation module is used for calculating a word vector of each text word;

a sentence vector calculation module, configured to perform weighted average processing on a word vector of each text word in each text sentence to obtain a sentence vector of each text sentence;

the text network graph building module is used for building a text network graph model according to the sentence vector of each text sentence; the text network graph model takes a sentence vector of each text sentence as a vertex and takes the similarity of the text sentences larger than a preset similarity threshold value as an edge;

the importance score calculation module is used for carrying out iterative calculation by adopting a TextRank algorithm to obtain an importance score of each text sentence;

the abstract extraction module is used for acquiring text sentences meeting preset conditions, forming an abstract of the examinee speech section and scoring the translation content of the examinee speech section; wherein the preset conditions are as follows: the importance scores of the text sentences are larger than a preset score threshold value, or the text sentences are N text sentences with the highest importance scores.

Compared with the prior art, the speech scoring method and system based on abstract extraction disclosed by the invention have the advantages that the examinee speech segment to be scored is obtained, the sentence segmentation is carried out by using double thresholds according to the characteristics of human pronunciation and the characteristics of self sentence break, and the examinee speech segment is segmented into a plurality of speech sentences; the cutting method is very simple and quick, but has good effect. And aiming at the difference of pronunciation habits of different speakers, the method proposes to establish double-threshold classification to solve the problem of difference of pronunciation habits, thereby improving the stability and accuracy of sentence segmentation. And performing text recognition on each speech sentence by adopting a BP neural network model to obtain each text sentence, changing the traditional HMM or DTW algorithm, and greatly improving the accuracy of speech recognition. Performing word segmentation processing on each text sentence to obtain a plurality of text words forming the text sentence, and calculating a word vector of each text word by using a word2vec model. And performing weighted average processing on the word vector of each text word in each text sentence by adopting a WR (weighted round robin) algorithm to obtain the sentence vector of each text sentence, wherein the sentence vector is more accurate compared with the sentence vector obtained by the traditional weighted summation method. Constructing a text network graph model according to the sentence vector of each text sentence, wherein the text network graph model takes the sentence vector of each text sentence as a vertex and takes the similarity of the text sentences larger than a preset similarity threshold value as an edge; iterative computation is carried out by adopting a TextRank algorithm to obtain the importance score of each text sentence; the method comprises the steps of obtaining text sentences meeting preset conditions, forming abstracts of examinee speech segments, scoring the translated contents of the examinee speech segments, and improving a TextRank algorithm by constructing a text graph model, so that the abstract extraction effect is improved, and compared with a neural network, the method is simpler, more efficient and has no loss of effect.

Drawings

FIG. 1 is a schematic diagram illustrating steps of a method for scoring a speech based on abstract extraction according to an embodiment of the present invention;

FIG. 2 is a diagram illustrating the steps of a dual-threshold sentence segmentation method according to an embodiment of the present invention;

FIG. 3 is a diagram illustrating a single neuron model in the BP neural network model according to an embodiment of the present invention;

FIG. 4 is a diagram of a BP neural network model in an embodiment of the present invention;

FIG. 5 is a schematic structural diagram of a speech scoring system based on abstract extraction according to a second embodiment of the present invention;

fig. 6 is a schematic structural diagram of a speech scoring system based on abstract extraction according to a third embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Fig. 1 is a schematic diagram illustrating steps of a speech scoring method based on abstract extraction according to an embodiment of the present invention. The voice scoring method based on abstract extraction provided by the embodiment of the invention is implemented through steps S1-S7:

and S1, obtaining the examinee voice segments to be scored, and segmenting to obtain a plurality of voice sentences.

Specifically, the examinee voice segment is an examinee voice formed by repeating the recording segment according to the recording content heard by the examinee when the examinee is speaking to the spoken English language, and can be acquired through a microphone.

Because the examinee speech segment is a piece of speech linked with an article, in order to perform accurate speech recognition by using a speech recognition technology, sentences need to be segmented by using a sentence segmentation algorithm to obtain speech segments using the sentences as units so as to facilitate subsequent processing.

In one embodiment, the examinee speech segments are subjected to speech sentence segmentation by a dual-threshold sentence segmentation method. Since the segmentation of different semantic units (paragraphs, sentences, words, etc.) is found by comparison, there will be a pause between almost every language unit, and some features of speech will change significantly. For example, at sentence boundaries, the energy characteristics of the audio are significantly reduced. Cut through the sentence, the energy characteristics of the audio are significantly higher. Different sound segments have different energies, typically the energy of the pause segments is much smaller than the average energy. An energy threshold can therefore be estimated, but one criterion is that it is not possible to segment sentences accurately, but also by means of the characteristics of the time delay. Speech reduction exists between different speech units, only with different attenuation magnitudes. In view of this feature, a mute delay threshold is used for discrimination. The silence segments among different language units of each type of audio, particularly between sentences, are analyzed, the average segment length and the shortest segment length of the silence segments are counted, and then a preset strategy can be adopted to obtain a silence delay threshold.

Referring to fig. 2, a schematic step diagram of a dual-threshold sentence segmentation method according to an embodiment of the present invention is shown.

Step S1 specifically includes steps S11 to S14:

s11, obtaining a voice segment of the examinee to be scored;

and S12, windowing the examinee voice segment to be scored by adopting a preset window function to obtain a plurality of audio frames.

In an embodiment of the invention, the audio is segmented by windowing, each segment being 10-30ms in length, called a frame, with partial overlap (frame shift) between adjacent frames. The extraction of speech features is usually performed in units of frames, based on the short-time stationarity of speech.

The preset window function includes but is not limited to: rectangular windows, hanning windows and hamming windows.

The window functions are respectively:

rectangular window:

hanning Window:

haiming window:

wherein N is the window length, and different window functions are selected according to different requirements in the short-time analysis process.

By adopting the technical means of the embodiment of the invention, the examinee voice segment to be scored is windowed, so that the global situation is more continuous, and the Gibbs effect is avoided. And after windowing, the speech signal which is not periodic originally presents partial characteristics of the periodic function.

And S13, calculating the short-time average energy and the short-time average zero crossing rate of each audio frame.

Short-time average energy: the energy function describes the change in the amplitude of the audio energy, which can be used to separate silence from non-silence and unvoiced and voiced, where the short-time average energy of the ith frame of speech can be expressed as:

cumulative averaging of absolute values:

or cumulative averaging of squares:

or cumulative averaging of the logarithms of the squares:

where i is the audio frame number, N is the number of sampled values in the audio frame, i.e., the window width, xi() Is the signal sample at the nth point in the ith frame.

It should be noted that the above three expression modes are all calculation modes of short-time average energy, and one of the calculation modes may be selected to perform subsequent threshold determination according to actual application requirements. Most scenes are judged by using short-time average energy obtained by accumulation and averaging of squares as a threshold value.

Short-time average zero crossing rate: for discrete signals, a "zero crossing" occurs when two adjacent samples have different signs. The zero crossing rate is the number of times of zero crossing of the signal in a statistical short time, so that voiced sound and unvoiced sound as well as voiced sound and mute sound in the voice can be easily distinguished by using the zero crossing rate. The short-time average zero-crossing rate can be expressed as:

where Sgn () is a sign function, SiRepresenting the short-time average energy of the i-th frame speech signal.

And S14, acquiring the audio frames of which the short-term average energy and the short-term average zero-crossing rate both reach corresponding preset threshold values, and taking the audio frames as boundary cutting points to segment the examinee speech segments into a plurality of speech sentences.

By analyzing the waveform of a particular audio, the amplitude of the waveform is significantly reduced at almost every pause in speech, so that the short-term average energy in the time domain can be used to capture this change. For each type of audio, a threshold called the silence energy threshold is counted, and if the energy of a frame is lower than the threshold, the frame is considered to have entered the interval of speech stop.

The boundary of the voice can be well detected by using the mute energy threshold, but the voice is not reduced only at the sentence boundary, and the phenomenon exists between different semantic units, such as between segments, between clauses, and even between words, which is obviously not required by people. By re-analyzing the waveform, it is found that there is indeed a phenomenon of waveform amplitude attenuation between different semantic units, and in addition to the amplitude of attenuation, there is also a distinct characteristic, i.e. the duration of attenuation, is different. The attenuation is most obvious and the duration is longest between segments, the attenuation is also obvious between sentences, the duration is only slightly short, the duration is shorter between clauses, and the attenuation amplitude and duration between words are not obvious, and the attenuation amplitude and duration between words are distinguished by using a mute delay threshold in view of the characteristic. The silence segments among different semantic units of each type of audio, particularly between sentences, are analyzed, the average segment length and the shortest segment length of the silence segments are counted, and then a certain strategy can be adopted to obtain a silence delay threshold. For example, by multiplying the average segment length by a coefficient smaller than 1 or directly using the shortest segment length, if the added window is non-overlapping, the selected segment length is simply divided by the window length, i.e. the mute delay threshold.

Based on this, a mute energy threshold corresponding to the short-time average energy e (i) and the short-time average zero crossing rate Z can be seti(i) A corresponding mute time domain threshold. Determining whether the short-time average energy of each audio frame is lower thanAnd if the mute energy threshold value and the short-time average zero crossing rate are lower than the mute time domain threshold value, taking the audio frame as a sentence boundary cutting point, thereby cutting the examinee speech segment into a plurality of speech sentences.

By adopting the technical means of the embodiment of the invention, the sentence segmentation is carried out by using double thresholds according to the characteristics of human pronunciation and by utilizing the characteristics of the self sentence break of the human, the stability and the accuracy rate of the sentence segmentation are improved, and the effect is good. Moreover, because the pronunciation of the voice segment of the examinee is basically standard, the voice is clear, and the noise is less, a complex model is not needed, a large amount of operations are not needed, only the time domain characteristics of the examinee and the examinee are carefully analyzed, and the double thresholds are utilized for judgment, so that the calculated amount can be effectively reduced on the basis of ensuring the sentence segmentation accuracy.

It should be understood that, in practical applications, other speech sentence segmentation methods may also be adopted to segment the speech sentences of the examinee speech segments, which do not form specific limitations of the present invention.

And S2, performing text recognition and word segmentation on each voice sentence to obtain each text sentence and a plurality of text words forming the text sentence.

Specifically, step S2 includes steps S21 to S23:

s21, performing MFCC voice feature extraction on each voice sentence to obtain a language feature value;

s22, inputting each language characteristic value into a pre-trained BP neural network model for text recognition to obtain each text sentence;

and S23, performing word segmentation on each text sentence to obtain a plurality of text words forming the text sentence.

In the embodiment of the invention, a BP neural network model is adopted for text recognition, so that a speech sentence of a test taker is recognized and converted into a text form. The BP neural network is also called an error reverse transmission neural network, and is a network model constructed by continuously adjusting the connection weight between nodes according to a feedback value.

Referring to fig. 3-4, fig. 3 is a schematic diagram of a single neuron model in a BP neural network model according to an embodiment of the present invention; FIG. 4 is a diagram of a BP neural network model in an embodiment of the present invention. The whole system structure is divided into an input layer, a hidden layer and an output layer, wherein the hidden layer can be a one-layer structure or a multi-layer structure according to the requirements of specific situations. The more the number of hidden layers, the slower the learning speed of the neural network, and according to the Kosmogloov theorem, under the conditions of reasonable structure and proper weight, the 3-layer BP network can approximate any continuous function, so that the 3-layer BP network with a relatively simple structure can be selected.

As shown in fig. 3: yk represents the output value of neuron k at a certain moment; f is an activation function, also called a transfer function; uk represents the net input to the kth neuron, and can be found by:

Uk=Wk1*X1+Wk2*X2+...+Wkm*Xm+bk

x1, X2, … Xm represent a total of m input data; WK1, WK2, … WKm correspond to the weight of each input signal, respectively; bk is the offset value called threshold value.

The above-mentioned single neurons are connected to obtain the multi-layer neural network model shown in fig. 4, and the final output layer outputs the probability of each match.

The training process of the BP neural network model comprises the following steps: and acquiring a plurality of voice sentences and corresponding labeled text sentences in advance as a data training set. The voice parameters are extracted from the voice sentence, in the embodiment of the present invention, the MFCC features extracted from the voice parameters are a two-bit vector with indefinite row number and 24 column number, and a column matrix of numeric characters is formed by using rows × 24-num elements, that is, feature vectors of numeric characters, because rows are all different, num is 600 according to research experience, and 600 cannot be directly supplemented by 0. I.e. 600 input neurons.

The learning process consists of two processes, forward propagation of the signal and back propagation of the error. During forward propagation, input samples are transmitted from the human input layer, processed layer by the hidden layers and transmitted to the output layer. If the actual output of the output layer does not match the expected output, the error is propagated back to the error stage. The error back transmission is to transmit the output error back to the input layer through the hidden layer in some form and distribute the error to all units in each layer to obtain the error signal of each layer unit, which is the basis for correcting the weight of each unit. The weight adjustment process of each layer of signal forward propagation and error backward propagation is performed in a cycle. The process of continuously adjusting the weight value, namely the learning and training process of the network, is carried out until the error of the network output is reduced to an acceptable degree or is carried out to a preset learning frequency.

In the figure, X represents an input layer, b represents a hidden layer, y represents an output layer, and Vh1 represents the weight of the first input neuron of the input layer to the h-th neuron of the hidden layer, and Wd1 represents the weight of the d-th neuron of the hidden layer to the first neuron of the output layer.

The inputs to the first neuron of the hidden layer are:

where f () is the hidden layer activation function, λ i asks for the bias of the 1 st neuron of the hidden layer.

For the output layer the first output is:

where thetai is the bias of the ith neuron in the output layer.

For each prediction, the error is obtained by using the following formula, and the weight is continuously adjusted:

whereinPredicted output for networkAnd then the mixture is discharged out of the furnace,the expected output is the sample.

And then, according to the trained BP neural network model, performing text recognition on each voice sentence to obtain each text sentence. And performing word segmentation on each text sentence to obtain a text word forming each text sentence.

By adopting the technical means of the embodiment of the invention, the traditional HMM or DTW algorithm is changed by using the voice recognition based on the BP neural network, the resource advantages of a corpus in a laboratory are fully utilized, and the accuracy of the voice recognition is greatly improved.

It should be noted that the above scenario is only used as an example, and in practical application, a word segmentation method in the prior art may also be used, which is not specifically limited herein.

And S3, calculating a word vector of each text word.

Specifically, a word vector of each text word is calculated by using a preset word2vec model.

The word2vec model utilizes a deep learning network to model semantic relations of words and contexts of corpus data so as to obtain a low-dimensional word vector. The word vector is generally about 100-300 dimensions, and the problem of high-dimensional sparsity of a traditional vector space model can be well solved.

It should be noted that the word2vec Model includes Continuous Bag-of-WordsModel (CBOW) and Continuous Skip-gram Model (Skip-gram). Both models include an input layer, a hidden layer, and an output layer. The method for constructing and training the word2vec model can refer to the prior art, and is not described herein in detail.

And S4, carrying out weighted average processing on the word vector of each text word in each text sentence to obtain the sentence vector of each text sentence.

Specifically, step S4 includes steps S41 to S43:

s41, determining the weight of each text word according to preset parameter factors and set probabilities;

s42, performing weighted average processing on the word vector of each text word in each text sentence through the following calculation formula to obtain an initial sentence vector of each text sentence:

where s is the number of text sentences, ω is the number of text words, vωFor the word vector, a is a preset parameter factor, and p (omega) is a set probability;

and S43, performing dimensionality reduction on each initial sentence vector to obtain a sentence vector of each text sentence.

The traditional process of obtaining a sentence vector from a word vector usually adds the word vectors of each word in a sentence, and then averages the word vectors, which is simple but often not excellent.

In the embodiment of the invention, the WR algorithm is used as an unsupervised sentence modeling method to calculate the sentence vector. Where W denotes Weighted, meaning that each word vector in a sentence is Weighted using pre-estimated parameters. R represents Removal, meaning that irrelevant parts in the sentence vector are removed, and the sentence is subjected to dimensionality reduction.

Firstly, weighting each word vector in a sentence by using a pre-estimated parameter a and a set probability p (omega), carrying out dimension reduction processing on each initial sentence vector after weighting summation, and removing irrelevant parts in the sentence vectors to obtain the sentence vectors of each text sentence.

It should be noted that the parameter a is an empirical value, which can be set according to the actual situation, and exemplarily, a ∈ [1e-4, 1e-3 ]. The probability p (ω) is set as a word frequency estimate, that is, the probability of occurrence of the text word in the whole corpus can be obtained by pre-calculation.

The dimension reduction processing method comprises the following steps: singular Value Decomposition (SVD), Principal Component Analysis (PCA), Factor Analysis (FA), Independent Component Analysis (ICA).

In one embodiment, the PCA algorithm is used to remove irrelevant parts in the vector to finally obtain a sentence vector: setting a singular vector u according to each initial sentence vector, and executing the following steps for the initial sentence vector of each text sentence:

vs′=uuTvs

thereby obtaining a sentence vector v for each of the text sentencess′。

By adopting the technical means of the embodiment of the invention, the WR algorithm is an efficient and convenient modeling method, compared with a neural network, the time consumption is short, but the effect is exactly equivalent to that of the neural network, and the method is very efficient and convenient.

S5, constructing a text network graph model according to the sentence vector of each text sentence; the text network graph model takes the sentence vector of each text sentence as a vertex, and takes the similarity of the text sentences larger than a preset similarity threshold value as an edge.

And S6, carrying out iterative computation by adopting a TextRank algorithm to obtain the importance score of each text sentence.

Specifically, the TextRank algorithm divides a text into a plurality of composition units (words and sentences), establishes a graph model, sorts important components in the text by using a voting mechanism, and can realize keyword extraction and abstract only by using the information of a single document. The TextRank general model can be expressed as a directed weighted graph G ═ V, E, consisting of a set of points V and a set of edges E, E being a subset of V × V.

In the embodiment of the invention, sentence vectors of text sentences are used as vertexes, similarity calculation is carried out through the obtained sentence vectors, the similarity between the sentences is used as the weight of edges between nodes of a network graph, and the importance of the sentence units is finally obtained through iterative calculation until convergence or the calculation upper limit times are reached.

Preferably, the method for calculating the similarity of the text sentences includes: cosine similarity algorithm, longest common subsequence algorithm.

In an alternative embodiment, the similarity between text sentences is calculated by using a cosine similarity algorithm, and the similarity of the text sentences is obtained by the following calculation formula:

Si=(x1,x2,...,xn);

Sj=(y1,y2,...,yn);

wherein, Sim (S)i,Sj) As a text sentence SiAnd SjSimilarity of (D), SiAnd SjRepresenting different text sentences, n being the number of said text sentences, xnRepresenting constituent text sentences SiEach text word of (a); y isnRepresenting constituent text sentences SjEach text word of (a).

And if the similarity between the two text sentences is greater than a given similarity threshold value, the two text sentences are considered to be semantically related and are connected, namely the weight value of the edge of the text network graph model is Sim (S)i,Sj)。

Further, the TextRank algorithm specifically includes:

wherein WS (V)i) Is an importance score, V, of a text sentenceiVertices representing a model of a textual network graph, WijEdge, In (V), representing a model of a textual network graphi) To point to vertex ViSet of points of (c), Out (V)i) Is a vertex ViA set of pointed points; d is a preset damping coefficient.

Illustratively, the damping coefficient d takes 0.85 for calculating convergence.

And then, iterative computation is carried out by adopting a TextRank algorithm until the result converges or the upper limit number of computation is reached, and the importance score of each text sentence can be obtained.

S7, obtaining text sentences meeting preset conditions, forming abstracts of the speech segments of the examinees, and scoring the translated contents of the speech segments of the examinees; wherein the preset conditions are as follows: the importance scores of the text sentences are larger than a preset score threshold value, or the text sentences are N text sentences with the highest importance scores.

Specifically, after the importance score of each text sentence is obtained, a set in which the text sentences are arranged in descending order according to the importance scores is obtained in descending order according to the order of scores from large to small. Extracting the first N text sentences from the set to form the abstract according to the requirement of word number or sentence number, wherein N is more than or equal to 1; or according to the requirement of the importance score, extracting text sentences which are higher than a preset score threshold value from the set to form the abstract.

And further, scoring the translation content of the speech segment of the examinee according to a preset scoring standard according to the abstract.

By adopting the technical means of the embodiment of the invention, the text is used as the image to be processed, the sentence vector obtained based on the WR algorithm and word2vec is used as the vertex, the cosine similarity between sentences is used for representing the edges between the vertices, and the text graph model is constructed to improve the TextRank algorithm, so that the abstract extraction effect is improved. Compared with a neural network, the method adopted by the embodiment of the invention is simpler and more efficient, and the effect is not lost.

As a preferred embodiment, the method further comprises steps S8 and S9:

s8, calculating the linguistic expression score of the speech segment of the examinee;

and S9, obtaining the total score of the speech segment of the examinee according to the translation content score and the speech expression score of the speech segment of the examinee.

The grading key points of the spoken English repeating questions also comprise language expression grading in the aspect of the grading of translation contents, so that the speech expression scoring is carried out on the voice segments of the examinees, and the translation content scoring and the speech expression scoring are added or weighted and added to obtain the total grading of the voice segments of the examinees.

By adopting the technical means of the embodiment of the invention, the abstract algorithm is applied to the scoring of the spoken language repeating questions, the abstract algorithm is used for extracting the key information of the voice of the examinee, the pronunciation quality scoring is carried out on the voice of the examinee, the two points are comprehensively utilized to give a final scoring to the answer of the examinee, and the scoring accuracy is improved.

The accuracy and the efficiency of the voice scoring method based on abstract extraction provided by the embodiment of the invention are tested by randomly selecting 1 test question, standard answer and answer sheet which are repeated in a short text in the oral english language examination of the Guangdong college entrance examination for comparison.

400 answer sheets are selected for testing according to the grading levels of the answer sheets, answers of 8 examinees are randomly extracted from each grade according to the high grading level, the middle grading level and the low grading level to serve as samples, data comparison is carried out on the abstract grading obtained according to the voice grading method based on abstract extraction and the grading of teachers, and the result is shown in table 1.

The title is as follows:

in summary: tom wore that sisters are short of food and overwinter, and steal rice to her, but discover that the sisters are doing the same thing.

Key words: worry (worry) harverset add pile

Strange (Strange) asleep hide (winter)

Same (Same) farm

TABLE 1 partial sample comparison of the inventive scores with the teacher scores

The definition of the student score rating is: the high level means that the omission of information points is less, and the normal and smooth recognition degree of the speech speed is high; the levels mean that information points are omitted, the expression is normal, and the recording content can be basically identified; the low level means that the information points are few, the language is not smooth enough, and the recording content is barely recognized. Wherein, the teacher average score is the average score obtained by scoring the recording by multiple college entrance examination paper-reading teachers.

In the 24 groups of data, the error between the teacher score and the scoring result of the invention is about 4.30 percent, and the method has good reference significance to a certain extent.

The resource level in table 1 is the answer level of the examinee, and it can be seen that the answer level of the test case and the keyword coverage after the system detection basically accord with each other. The high level of answers covers almost all of the keywords and the method of the present invention is essentially correct to identify. The corresponding low-level answers have fewer keywords, and thus the method of the present invention is not recognizable.

The 2000 results selected by the project are counted, the difference number between the manual identification result and the system identification result is compared, and the data is shown in the following table 2:

table 2 difference data results

As can be seen from table 2, the man-machine coincidence rate is about 84%, the man-machine difference is about 15% for one keyword, the man-machine difference is about 2% for two keywords, and there is no case of difference of 3 keywords. The consistency rate of the keyword recognition and the manual keyword recognition of the system reaches more than 80 percent, and the data result is integrated to show that the method can finish the summarization work of the voice test paper to a certain degree.

The embodiment one of the invention provides a speech scoring method based on abstract extraction, which comprises the steps of obtaining a speech segment of an examinee to be scored, segmenting sentences by using double thresholds according to the characteristics of human pronunciation and the characteristics of self sentence break of a human, and segmenting the speech segment of the examinee into a plurality of speech sentences; the cutting method is very simple and quick, but has good effect. And aiming at the difference of pronunciation habits of different speakers, the method proposes to establish double-threshold classification to solve the problem of difference of pronunciation habits, thereby improving the stability and accuracy of sentence segmentation. And performing text recognition on each speech sentence by adopting a BP neural network model to obtain each text sentence, changing the traditional HMM or DTW algorithm, and greatly improving the accuracy of speech recognition. Performing word segmentation processing on each text sentence to obtain a plurality of text words forming the text sentence, and calculating a word vector of each text word by using a word2vec model. And performing weighted average processing on the word vector of each text word in each text sentence by adopting a WR (weighted round robin) algorithm to obtain the sentence vector of each text sentence, wherein the sentence vector is more accurate compared with the sentence vector obtained by the traditional weighted summation method. Constructing a text network graph model according to the sentence vector of each text sentence, wherein the text network graph model takes the sentence vector of each text sentence as a vertex and takes the similarity of the text sentences larger than a preset similarity threshold value as an edge; iterative computation is carried out by adopting a TextRank algorithm to obtain the importance score of each text sentence; the method comprises the steps of obtaining text sentences meeting preset conditions, forming abstracts of examinee speech segments, scoring the translated contents of the examinee speech segments, and improving a TextRank algorithm by constructing a text graph model, so that the abstract extraction effect is improved, and compared with a neural network, the method is simpler, more efficient and has no loss of effect.

Fig. 5 is a schematic structural diagram of a speech scoring system based on abstract extraction according to a second embodiment of the present invention. The embodiment of the present invention provides a speech scoring system 20 based on abstract extraction, which includes: the system comprises an examinee voice segmentation module 21, a text word acquisition module 22, a word vector calculation module 23, a sentence vector calculation module 24, a text network diagram construction module 25, an importance score calculation module 26 and a summary extraction module 27; wherein the content of the first and second substances,

the examinee voice segmentation module 21 is used for acquiring examinee voice segments to be scored and segmenting the examinee voice segments to obtain a plurality of voice sentences;

the text word obtaining module 22 is configured to perform text recognition and word segmentation on each voice sentence to obtain each text sentence and a plurality of text words constituting the text sentence;

the word vector calculation module 23 is configured to calculate a word vector for each text word;

the sentence vector calculation module 24 is configured to perform weighted average processing on a word vector of each text word in each text sentence to obtain a sentence vector of each text sentence;

the text network diagram building module 25 is configured to build a text network diagram model according to the sentence vector of each text sentence; the text network graph model takes a sentence vector of each text sentence as a vertex and takes the similarity of the text sentences larger than a preset similarity threshold value as an edge;

the importance score calculating module 26 is configured to perform iterative calculation by using a TextRank algorithm to obtain an importance score of each text sentence;

the abstract extracting module 27 is configured to acquire text sentences meeting preset conditions, form an abstract of the examinee speech segment, and score translation contents of the examinee speech segment; wherein the preset conditions are as follows: the importance scores of the text sentences are larger than a preset score threshold value, or the text sentences are N text sentences with the highest importance scores.

It should be noted that, the speech scoring system based on abstract extraction provided in the embodiment of the present invention is used for executing all the process steps of the speech scoring method based on abstract extraction in the above embodiment, and the working principles and beneficial effects of the two are in one-to-one correspondence, so that details are not repeated.

The second embodiment of the invention provides a speech scoring system based on abstract extraction, which is used for acquiring a speech segment of an examinee to be scored, segmenting sentences by using double thresholds according to the characteristics of human pronunciation and the characteristics of self sentence break of a human, and segmenting the speech segment of the examinee into a plurality of speech sentences; the cutting method is very simple and quick, but has good effect. And aiming at the difference of pronunciation habits of different speakers, the method proposes to establish double-threshold classification to solve the problem of difference of pronunciation habits, thereby improving the stability and accuracy of sentence segmentation. And performing text recognition on each speech sentence by adopting a BP neural network model to obtain each text sentence, changing the traditional HMM or DTW algorithm, and greatly improving the accuracy of speech recognition. Performing word segmentation processing on each text sentence to obtain a plurality of text words forming the text sentence, and calculating a word vector of each text word by using a word2vec model. And performing weighted average processing on the word vector of each text word in each text sentence by adopting a WR (weighted round robin) algorithm to obtain the sentence vector of each text sentence, wherein the sentence vector is more accurate compared with the sentence vector obtained by the traditional weighted summation method. Constructing a text network graph model according to the sentence vector of each text sentence, wherein the text network graph model takes the sentence vector of each text sentence as a vertex and takes the similarity of the text sentences larger than a preset similarity threshold value as an edge; iterative computation is carried out by adopting a TextRank algorithm to obtain the importance score of each text sentence; the method comprises the steps of obtaining text sentences meeting preset conditions, forming abstracts of examinee speech segments, scoring the translated contents of the examinee speech segments, and improving a TextRank algorithm by constructing a text graph model, so that the abstract extraction effect is improved, and compared with a neural network, the method is simpler, more efficient and has no loss of effect.

Fig. 6 is a schematic structural diagram of a speech scoring system based on abstract extraction according to a third embodiment of the present invention. The embodiment of the present invention further provides a speech scoring system 30 based on abstract extraction, which includes a processor 31, a memory 32, and a computer program stored in the memory and configured to be executed by the processor, and when the processor executes the computer program, the speech scoring method based on abstract extraction as provided in the first embodiment is implemented.

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by a computer program, which can be stored in a computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. The storage medium may be a magnetic disk, an optical disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), or the like.

While the foregoing is directed to the preferred embodiment of the present invention, it will be understood by those skilled in the art that various changes and modifications may be made without departing from the spirit and scope of the invention.

23页详细技术资料下载
上一篇:一种医用注射器针头装配设备
下一篇:一种基于工业互联网的语音识别及处理方法

网友询问留言

已有0条留言

还没有人留言评论。精彩留言会获得点赞!

精彩留言,会给你点赞!