Keyboard musical instrument and computer-implemented method of keyboard musical instrument

文档序号：1186277 发布日期：2020-09-22 浏览：21次中文

阅读说明：本技术 键盘乐器以及键盘乐器的计算机执行的方法 (Keyboard musical instrument and computer-implemented method of keyboard musical instrument ) 是由橘敏之于 2020-03-16 设计创作，主要内容包括：本发明提供一种键盘乐器以及键盘乐器的计算机执行的方法。键盘乐器具备：包含多个键的键盘,多个操作元件,其设置在键的长度方向的后侧且乐器壳体的顶面,包含第1操作元件和第2操作元件,第1操作元件与输出的语音数据的从第1定时到第2定时前为止的第1区间数据对应起来,第2操作元件与语音数据的从第2定时到第3定时前为止的第2区间数据对应起来；至少1个处理器,至少1个处理器根据向第1操作元件的第1用户操作决定第1图形的语调,通过第1图形的语调来指示与第1区间数据对应的语音的发音,至少1个处理器根据向第2操作元件的第2用户操作决定第2图形的语调,通过第2图形的语调来指示与第2区间数据对应的语音的发音。(The invention provides a keyboard musical instrument and a computer-implemented method of the keyboard musical instrument. A keyboard musical instrument is provided with: a keyboard including a plurality of keys, a plurality of operation elements provided on a top surface of a musical instrument case on a rear side in a longitudinal direction of the keys, the plurality of operation elements including a 1 st operation element and a 2 nd operation element, the 1 st operation element corresponding to 1 st section data of output voice data from a 1 st timing to before a 2 nd timing, the 2 nd operation element corresponding to 2 nd section data of the voice data from the 2 nd timing to before a 3 rd timing; at least 1 processor, at least 1 processor determines the intonation of the 1 st graph according to the 1 st user operation to the 1 st operation element, indicates the pronunciation of the voice corresponding to the 1 st section data through the intonation of the 1 st graph, at least 1 processor determines the intonation of the 2 nd graph according to the 2 nd user operation to the 2 nd operation element, and indicates the pronunciation of the voice corresponding to the 2 nd section data through the intonation of the 2 nd graph.)

1. A keyboard musical instrument is provided with:

a keyboard comprising a plurality of keys;

a plurality of operation elements provided on the rear side of the keys in the longitudinal direction and on the top surface of the instrument case, including a 1 st operation element and a 2 nd operation element, the 1 st operation element corresponding to 1 st section data of the output voice data from the 1 st timing to before the 2 nd timing, the 2 nd operation element corresponding to 2 nd section data of the voice data from the 2 nd timing to before the 3 rd timing; and

at least one of the number of the processors is 1,

said at least 1 processor determining a tone of a 1 st graphic according to a 1 st user operation to said 1 st operating element and indicating pronunciation of a voice corresponding to said 1 st section data by said determined tone of said 1 st graphic,

the at least 1 processor determines a tone of a 2 nd graphic according to a 2 nd user operation to the 2 nd operating element, and instructs pronunciation of a voice corresponding to the 2 nd section data by the determined tone of the 2 nd graphic.

2. The keyboard musical instrument according to claim 1,

when the number of pieces of section data included in the speech data is larger than the number of the plurality of operation elements, the at least 1 processor outputs 1 st section data corresponding to the 1 st operation element, and then changes the section data corresponding to the 1 st operation element from the 1 st section data to section data after the 1 st section.

3. The keyboard musical instrument according to claim 2,

when the number of the plurality of operation elements is assumed to be 8, the at least 1 processor associates the 1 st section data to the 8 th section data in the voice data with the plurality of operation elements at a certain timing,

the at least 1 processor outputs the 1 st section data corresponding to the 1 st operation element, and then changes the section data corresponding to the 1 st operation element from the 1 st section data to the 9 th section data following the 8 th section data.

4. The keyboard musical instrument according to claim 3,

the at least 1 processor adjusts at least one of the output of the 1 st section data and the output of the 2 nd section data so that the final pitch of the speech of the 1 st section data is continuously connected to the first pitch of the speech of the 2 nd section data.

5. The keyboard musical instrument according to claim 1,

the plurality of operating elements include a sliding operating element,

the at least 1 processor determines a certain tone pattern from among a plurality of tone patterns set in advance based on a slide operation amount based on a slide operation to the slide operation element.

6. The keyboard musical instrument according to claim 1,

the at least 1 processor emits speech at a pitch specified by an operation to the keyboard.

7. The keyboard musical instrument according to any one of claims 1 to 6,

the keyboard musical instrument includes a memory storing a learned acoustic model obtained by machine learning processing of voice data of a singer, the learned acoustic model outputting data representing an acoustic feature amount of the voice of the singer by inputting arbitrary lyric data and arbitrary pitch data,

the at least 1 processor deduces the voice of the singer based on data representing an acoustic feature amount of the voice of the singer, the data being output to the learned acoustic model in response to the input of the arbitrary lyric data and the arbitrary pitch data to the learned acoustic model,

the at least 1 processor assigns the voice of the 1 st section data of the singer to the voice of the determined pattern, and outputs the 1 st section data.

8. A computer-implemented method of a keyboard musical instrument,

the keyboard instrument includes:

a keyboard comprising a plurality of keys;

at least one of the number of the processors is 1,

the above-mentioned at least 1 processor carries out the following step:

deciding the intonation of the 1 st graph according to the 1 st user operation to the 1 st operation element,

indicating pronunciation of voice corresponding to the 1 st section data by the determined intonation of the 1 st graph,

decides the intonation of the 2 nd graph according to the 2 nd user operation to the 2 nd operation element,

the pronunciation of the voice corresponding to the 2 nd section data is instructed by the determined tone of the 2 nd graph.

9. The method of claim 8,

10. The method of claim 9,

11. The method of claim 10,

12. The method of claim 8,

the plurality of operating elements include a sliding operating element,

13. The method of claim 8,

the at least 1 processor emits speech at a pitch specified by an operation to the keyboard.

14. The method according to any one of claims 8 to 13,

the above-mentioned at least 1 processor also carries out the following step:

deducing the voice of the singer from data indicating an acoustic feature amount of the voice of the singer, the data being output to the learned acoustic model in response to the input of the arbitrary lyric data and the arbitrary pitch data to the learned acoustic model,

assigning the voice of the determined pattern tone to the voice of the 1 st section data of the singer, and outputting the 1 st section data.

Technical Field

The present invention relates to a keyboard musical instrument capable of performing a performance such as Rap (Rap) and a computer-implemented method of the keyboard musical instrument.

Background

There is a singing method called rap. Speaking and singing is one of music methods of singing in spoken language or the like in accordance with the rhythm, or time progression of a melody line of music. In rap, in particular, a personalized music expression can be performed by changing the tone of a voice improvingly.

Thus speaking and singing both lyrics and flow (rhythm, melody line) and therefore the handicap when trying to sing with these is very high. At least some of the music elements included in the flow of speaking are automated, and even the initiator can be familiar with speaking as long as the remaining music elements can be played by an electronic musical instrument or the like in accordance with the automated elements.

As the 1 st prior art for automating singing, there is known an electronic musical instrument that outputs a singing voice synthesized by a segment-joining type synthesis method of joining and processing recorded voice segments (for example, japanese patent laid-open No. 9-050287).

However, in the above-described conventional art, although pitch designation can be performed on an electronic musical instrument in accordance with automatic progression of singing based on synthetic speech, a tone unique to the singing cannot be controlled in real time. In addition, it is not limited to rap, and it has been difficult to add a super tone to musical instrument performance.

Disclosure of Invention

According to the present invention, there is an advantage that a desired intonation can be added by a simple operation in the performance of musical instruments, singing.

In an example of the embodiment, a keyboard musical instrument includes: a keyboard comprising a plurality of keys; a plurality of operation elements provided on the rear side of the keys in the longitudinal direction and on the top surface of the instrument case, including a 1 st operation element and a 2 nd operation element, the 1 st operation element corresponding to 1 st section data of the output voice data from the 1 st timing to before the 2 nd timing, the 2 nd operation element corresponding to 2 nd section data of the voice data from the 2 nd timing to before the 3 rd timing; and at least 1 processor, wherein the at least 1 processor determines a tone of a 1 st graphic according to a 1 st user operation to the 1 st operating element, and indicates a pronunciation of a voice corresponding to the 1 st section data by the determined tone of the 1 st graphic, and the at least 1 processor determines a tone of a 2 nd graphic according to a 2 nd user operation to the 2 nd operating element, and indicates a pronunciation of a voice corresponding to the 2 nd section data by the determined tone of the 2 nd graphic.

Drawings

Fig. 1 shows an example of an appearance of an embodiment of an electronic keyboard instrument.

Fig. 2 is a block diagram showing an example of a hardware configuration of an embodiment of a control system of an electronic keyboard instrument.

Fig. 3 is a block diagram showing the main functions of the embodiment.

Fig. 4 is an explanatory diagram of the operation of specifying the bend slider (bend slider), the bend switch (bend switch), and the bend curve (bend curve) according to the embodiment.

Fig. 5 shows an example of the data structure of the embodiment.

Fig. 6 shows an example of the data configuration of the curved sound curve setting table according to the embodiment.

Fig. 7 shows an example of the data structure of the bending curve table according to the embodiment.

Fig. 8 is a main flowchart showing an example of control processing of the electronic musical instrument according to the present embodiment.

Fig. 9 is a flowchart showing a detailed example of the initialization process, the music tempo change process, and the rap start process.

Fig. 10 is a flowchart showing a detailed example of the switching process.

Fig. 11 is a flowchart showing a detailed example of the curved sound curve setting process.

Fig. 12 is a flowchart showing a detailed example of the automatic performance interruption process.

Fig. 13 is a flowchart showing a detailed example of the rap playback process.

Fig. 14 is a flowchart showing a detailed example of the bend sound processing.

Detailed Description

Hereinafter, embodiments for carrying out the present invention will be described in detail with reference to the drawings. Fig. 1 shows an example of an external appearance of an electronic keyboard instrument 100 in which an automatic playing device as an information processing device is mounted. The electronic keyboard instrument 100 includes: a keyboard 101 composed of a plurality of keys as performance operating elements; a 1 st switch panel 102 for instructing various settings such as designation of volume, tempo (tempo) setting of rap reproduction, start of rap reproduction, and accompaniment reproduction; a 2 nd switch panel 103 for selecting a vocal tune, an accompaniment, and a tone; an LCD (Liquid Crystal Display) 104 for displaying lyrics, a musical score, various setting information, a bend slider 105 (also referred to as a slide operation element 105) at the time of rap reproduction, for example, a bend curve (bend) which is an intonation pattern (intonation pattern) specified for the pitch (pitch) of a vocal sound to be uttered; and a bend switch 106 for designating the designated enable/disable of the bend slider 105. Although not particularly shown, the electronic keyboard instrument 100 includes a speaker for emitting musical tones generated by musical performance on the bottom surface, side surface, rear surface, or the like.

As shown in fig. 1, a plurality of operating elements (slide operating elements 105) are provided on the rear side in the length direction of the keys (the user playing the keyboard instrument is located on the front side in the length direction of the keys) and on the top surface (upper side) of the instrument case. The 1 st switch panel 102, the 2 nd switch panel 103, the LCD104, and the bend switch 106 are also provided on the top surface of the instrument case on the rear side in the longitudinal direction of the keys, similarly to the plurality of operation elements.

The plurality of operation elements may be a rotary operation element (knob operation element) 105, a push button operation element 105, instead of the slide operation element 105.

Fig. 2 shows an example of a hardware configuration of an embodiment of a control system 200 of the electronic keyboard instrument 100 equipped with the automatic playing apparatus of fig. 1. In fig. 2, a CPU (central processing unit) 201, a ROM (read only memory) 202, a RAM (random access memory) 203, a sound source LSI (large scale integrated circuit) 204, a voice synthesis LSI205, a key scanner 206 connecting the keyboard 101, the 1 st switch panel 102, the 2 nd switch panel 103, the bender slider 105, and the bender switch 106 of fig. 1, and an LCD controller 208 connecting the LCD104 of fig. 1 of the control system 200 are connected to a system bus 209, respectively. Further, a timer 210 for controlling the sequence of the automatic performance is connected to the CPU 201. The musical sound output data 218 and the vocal speech output data 217 output from the sound source LSI204 and the speech synthesis LSI205 are converted into analog musical sound output signals and analog vocal speech output signals by the D/a converters 211 and 212, respectively. The analog musical sound output signal and the analog talking and singing voice output signal are mixed by a mixer 213, and the mixed signal is amplified by an amplifier 214 and then output from a speaker or an output terminal, not particularly shown.

The CPU201 executes the automatic playing control program stored in the ROM202 while using the RAM203 as a work memory, thereby executing the control operation of the electronic keyboard instrument 100 of fig. 1. Further, the ROM202 stores music data including lyric data and accompaniment data in addition to the control program and various fixed data described above.

The CPU201 is provided with a timer 210 used in the present embodiment, for example, to count the progress of the automatic performance of the electronic keyboard instrument 100.

The sound source LSI204 reads musical tone waveform data from, for example, a waveform ROM, not shown, in accordance with a sound emission control instruction from the CPU201, and outputs the musical tone waveform data to the D/a converter 211. The sound source LSI204 has a capability of simultaneously vibrating and emitting 256 tones (256-voice polyphony) at most.

When the text data of the lyrics and the information on the pitch are given as the rap data 215 from the CPU201, the voice synthesis LSI205 synthesizes the voice data of the rap voice corresponding thereto and outputs the synthesized voice data to the D/a converter 212.

The key scanner 206 always scans the key-on/off state of the keyboard 101 of fig. 1, and the switch operation states of the 1 st switch panel 102, the 2 nd switch panel 103, the bend slider 105, and the bend switch 106, and interrupts the CPU201 to transmit a state change.

The LCD controller 609 is an IC (integrated circuit) that controls the display state of the LCD 505.

Fig. 3 is a block diagram showing main functions of the present embodiment. Here, the speech synthesis unit 302 is built in the electronic keyboard instrument 100 as one function executed by the speech synthesis LSI205 of fig. 2. The speech synthesis unit 302 inputs the vocal data 215 instructed from the CPU201 in fig. 2 by a vocal reproduction process to be described later, and synthesizes and outputs the vocal speech output data 217.

For example, as shown in fig. 3, the voice learning unit 301 may be installed as one function executed by an external server computer 300 different from the electronic keyboard instrument 100 of fig. 1. Alternatively, although not shown in fig. 3, the speech learning unit 301 may be incorporated in the electronic keyboard instrument 100 as one of the functions executed by the speech synthesis LSI205, as long as the speech synthesis LSI205 of fig. 2 has a margin in processing capability. Fig. 2 shows the sound source LSI 204.

The curved sound processing unit 320 is a function of the CPU201 of fig. 2 executing programs of a curved sound curve setting process (see fig. 11) and a curved sound process (see fig. 14) described later, and the curved sound processing unit 320 executes the following processes: the state of the bend slider 105 and the bend switch 106 shown in fig. 1 or 2 is taken in from the key scanner 206 shown in fig. 2 via the system bus 209, and thereby, for example, a change in the bend curve, which is a tone pattern, is given to the pitch of the rap voice.

For example, the speech learning unit 301 and the speech synthesis unit 302 in fig. 2 are installed according to the technique of "statistical speech synthesis by deep learning" described in non-patent document 1 below.

(non-patent document 1)

Bridge Excellent, high-Liang two 'deep body sample に, the voice synthesis (statistical voice synthesis based on deep learning)' Japan society of acoustics 73 volume No. 1 (2017), pp.55-62.

As shown in fig. 3, for example, the speech learning unit 301 of fig. 2, which is a function executed by the external server computer 300, includes a learning text analysis unit 303, a learning acoustic feature extraction unit 304, and a model learning unit 305.

The speech learning unit 301 uses, as the learning speaking speech data 312, for example, data obtained by recording speech of a certain speaking singer singing a plurality of speaking songs. Note that, as the rap data 311 for learning, lyric texts of each rap song are prepared.

The learning text analysis unit 303 inputs the learning rap data 311 including the lyric text and analyzes the data. As a result, the learning text analysis unit 303 estimates and outputs a learning language feature quantity sequence 313 as a discrete numerical value sequence, the learning language feature quantity sequence 313 representing phonemes, pitches, and the like corresponding to the learning rap data 311.

In accordance with the input of the above-described rap data 311, the learning acoustic feature amount extraction unit 304 inputs and analyzes the rap voice data 312 recorded by a microphone or the like by a singer singing the lyric text corresponding to the learning rap data 311. As a result, the learning acoustic feature value extraction unit 304 extracts and outputs a learning acoustic feature value sequence 314 in which the learning acoustic feature value sequence 314 indicates a speech feature corresponding to the learning rap speech data 312.

The model learning unit 305 estimates, by machine learning, an acoustic model that maximizes the probability of generating the learning acoustic feature sequence 314 from the learning language feature sequence 313 and the acoustic model. That is, the relationship between the speech feature quantity sequence as text and the acoustic feature quantity sequence as speech is expressed by a statistical model called an acoustic model.

The model learning unit 305 outputs model parameters representing the acoustic model calculated as a result of the machine learning as a learning result 315.

For example, as shown in fig. 3, the learning result 315 (model parameter) may be stored in the ROM202 of the control system of the electronic keyboard instrument 100 of fig. 2 when the electronic keyboard instrument 100 of fig. 1 is shipped, and may be loaded from the ROM202 of fig. 2 to the acoustic model unit 306, which will be described later, in the speech synthesis LSI205 when the electronic keyboard instrument 100 is turned on. Alternatively, for example, as shown in fig. 3, the player may operate the second switch panel 103 of the electronic keyboard instrument 100 to download the learning result 315 from a network such as the internet or a USB (Universal Serial Bus) cable, not particularly shown, to the acoustic model unit 306, which will be described later, in the speech synthesis LSI205 via the network interface 219.

The speech synthesis unit 302, which is a function executed by the speech synthesis LSI205, includes a text analysis unit 307, an acoustic model unit 306, and a sound generation model unit 308. The speech synthesis unit 302 performs a statistical speech synthesis process of synthesizing the predicted vocal speech output data 217 corresponding to the vocal data 215 including the lyric text by predicting the vocal speech output data using a statistical model called an acoustic model set in the acoustic model unit 306.

As a result of performance by a player matching the automatic performance, the text analysis section 307 inputs and analyzes vocal data 215, and the vocal data 215 includes information on the phoneme, pitch, and the like of the lyrics specified by the CPU201 of fig. 2. As a result, the text analysis unit 307 analyzes and outputs a speech feature quantity sequence 316, the speech feature quantity sequence 316 representing phonemes, parts of speech, words, and the like corresponding to the rap data 215.

The acoustic model unit 306 inputs the speech feature sequence 316, and estimates and outputs a corresponding acoustic feature sequence 317. That is, the acoustic model unit 306 estimates an estimated value of the acoustic feature sequence 317 in which the probability of generating the acoustic feature sequence 317 from the language feature sequence 316 input from the text analysis unit 307 and the acoustic model set as the learning result 315 by machine learning in the model learning unit 305 is maximized.

The utterance model unit 308 generates the rap speech output data 217 corresponding to the rap data 215 including the lyric text specified by the CPU201 by inputting the acoustic feature value sequence 317. The rap speech output data 217 is output from the D/a converter 212 of fig. 2 via the mixer 213 and the amplifier 214, and is output from a speaker not shown in particular.

The acoustic feature expressed by the learning acoustic feature series 314 and the acoustic feature series 317 include spectral data for modeling a vocal tract of a person and sound source data for modeling a vocal cord of a person. Examples of the spectrum parameter include a Mel cepstrum (Mel cepstrum) and a Line Spectrum Pair (LSP). As the sound source data, a fundamental frequency (F0) indicating a pitch frequency (pitch frequency) of human voice and a power value can be used. The utterance model unit 308 includes a sound source generation unit 309 and a synthesis filter unit 310. The sound source generating unit 309 is a part for modeling a human vocal cord, and sequentially inputs a sequence of the sound source data 319 inputted from the acoustic model unit 306, thereby generating a sound source signal composed of, for example, a burst (in the case of a voiced sound element) in which the fundamental frequency (F0) and the power value included in the sound source data 319 are periodically repeated, a white noise (in the case of a silent sound element) having the power value included in the sound source data 319, or a signal obtained by mixing these signals. The synthesis filter unit 310 is a part for modeling a human vocal tract, and forms a digital filter for modeling a vocal tract from a sequence of the spectrum data 318 sequentially input from the acoustic model unit 306, and generates and outputs the vocal output data 217 of a digital signal using the sound source signal input from the sound source generation unit 309 as an excitation source signal.

The sampling frequency with respect to the learning rap voice data 312 is, for example, 16KHz (kilohertz). In addition, for example, in the case where mel-frequency cepstral parameters obtained by the mel-frequency cepstral analysis processing are used as the spectral parameters included in the acoustic feature value sequence 314 and the acoustic feature value sequence 317 for learning, the update frame period is, for example, 5msec (milliseconds). When the mel-frequency cepstrum analysis processing is performed, the analysis window length is 25msec, the window function is a Blackman window (Blackman window), and the number of analyses is 24.

Next, a first embodiment of the statistical speech synthesis process configured by the speech learning unit 301 and the speech synthesis unit 302 in fig. 3 will be described. In the first embodiment of the statistical speech synthesis process, HMMs (Hidden Markov models) described in the above-described non-patent document 1 and the following non-patent document 2 are used as acoustic models expressed by the learning results 315 (Model parameters) set in the acoustic Model unit 306.

(non-patent document 2)

The wine caucasian, maoyangdiran, Nanjiao jiyan, Detian huiyi, and Beicun "whole-grain と singing スタイルを possibly な singing to synthesize システム" information processing society research report music information science (MUS)2008(12(2008-MUS-074)), pp.39-44, 2008-02-08

In the first embodiment of the statistical speech synthesis process, the HMM acoustic model is used to learn how time changes are made in the characteristic parameters of the vocal chord vibration and vocal tract characteristics when the user utters the lyrics following a certain melody. More specifically, the HMM acoustic model is a model in which a frequency spectrum, a fundamental frequency, and a temporal structure thereof obtained from the rap data for learning are modeled in units of phonemes.

Next, a second embodiment of the statistical speech synthesis process configured by the speech learning unit 301 and the speech synthesis unit 302 in fig. 3 will be described. In the second embodiment of the statistical speech synthesis process, in order to predict the acoustic feature value sequence 317 from the language feature value sequence 316, the acoustic model unit 306 is installed by a Deep Neural Network (DNN). In response to this, the model learning unit 305 in the speech learning unit 301 learns model parameters representing a nonlinear transformation function of each neuron in DNN from the language feature amount to the acoustic feature amount, and outputs the model parameters as a learning result 315 to DNN of the acoustic model unit 306 in the speech synthesis unit 302.

The automatic performance action of the songs including the rap according to the embodiment of the electronic keyboard instrument 100 of fig. 1 and 2 using the statistical speech synthesis process described in fig. 3 will be described in detail below. Fig. 4 is an explanatory diagram of a bending sound curve specifying operation using the bending sound slider 105 and the bending sound switch 106 of fig. 1 and 2 according to the present embodiment. In the present embodiment, for example, a curved line, which is a pattern of tones of high vocal sound that change during the period of each beat, can be specified for each beat (predetermined unit of travel) of a automatically-traveling rap song.

The user performs specification of the bend curve and addition of a bend based on the specification of the bend curve in real time for the automatically-traveling rap song, for example, for every 16 beats (4 bars in the case of a song of 4/4 beats) in succession, using adjustment of the bend slider 105 shown in fig. 4 as a specifying unit. The bending slider 105 includes, for example, 16 (only 8 in the example of fig. 4) sliders, and each slider can specify, in order from left to right, the type of bending curve of each of 16 beats from which the currently automatically traveling rap song is executed. As the designated inflexion curve, a plurality of inflexion curve patterns 401 can be prepared (in the example of fig. 4, 4 patterns of inflexion curve patterns 401 of #0 to #3 are illustrated on the left side of the inflexion slider 105). The user can designate one of the bending curve figures 401 as the slide position of each slider for each of the 16 sliders of the bending slider 105.

For example, a bending sound switch 106, which is a designated unit and is composed of, for example, 16 switches, is disposed on the upper portion of the bending sound slider 105 composed of, for example, 16 sliders. Each of the bending sound switches 106 corresponds to each of the sliders of the bending sound slider 105 disposed at the lower portion thereof. The user can set the corresponding slider in the bending sound slider 105 to be invalid by turning off the corresponding switch in the bending sound switch 106 for any tempo of the 16 beats. Thus, the bending sound effect can be applied to the beat.

The above-described setting of the bending tone curve for each of the consecutive 16 beats by the bending tone slider 105 and the bending tone switch 106 is performed by the bending tone processing unit 320 described in fig. 3. The bend processing unit 320, which operates as an adding means, instructs the voice synthesizing unit 302 of the intonation (intonation) of the pitch (pitch) of the vocal speech corresponding to the bend curve specified by the bend slider 105 and the bend switch 106 for each of 16 beats (4 bars in the case of 4/4 beats) in the automatic performance of the vocal song automatically traveling by the voice synthesizing unit 302 (see fig. 2 and 3).

Specifically, the curved sound processing unit 320 specifies the change information of the pitch to the speech synthesis unit 302 based on the curved sound curve specified for the beat for each travel of the beat. The time resolution of the pitch bend in 1 beat is, for example, 48, and the bend processing unit 320 specifies pitch change information corresponding to the specified bend curve to the speech synthesizing unit 302 at each timing when the 1 beat is divided into 48. The speech synthesis unit 302 described in fig. 3 changes the pitch of the sound source information 319 output from the acoustic model unit 306 based on the pitch change information specified by the bend processing unit 320, and supplies the sound source information 319 after the change to the sound source generation unit 309.

As described above, in the present embodiment, the progress of the lyrics and time of the rap song is handed over to the automatic performance, and the user can specify the intonation pattern of the curve of the pitch such as the rap for each progress unit (for example, tempo), so that the user can easily enjoy the rap performance.

In particular, in this case, the user can specify a bend curve for each beat of the rap voice in real time for each 16 beats of the automatic performance in the automatic travel using, for example, the bend slider 105 and the bend switch 106 corresponding to each 16 beats, and can apply the rap performance to the user while automatically playing the rap song.

Further, the user can also specify and store in advance a specification of a bend curve for each beat in association with, for example, a rap song to be automatically performed, and the bend processing unit 320 reads the specification of the bend curve when the automatic performance of the rap song is performed, and instructs the voice synthesizing unit 302 of the intonation of the pitch of the rap voice corresponding to the specified bend curve.

Thus, the user can slowly add the pitch of the rap voice to the rap song.

However, since the speech data (including various data forms such as music data, lyric data, and text data) generally has a larger number of sections than the number of the plurality of operation elements (the slide operation elements 105), the processor 201 performs processing of changing the section data corresponding to the 1 st operation element from the 1 st section data to the section data after the 1 st section data after outputting the 1 st section data corresponding to the 1 st operation element.

Assuming that when the number of the plurality of operation elements (slide operation elements 105) is 8, the processor 201 associates an interval of 2 bar length in the voice data with the plurality of operation elements at a certain timing, for example. That is, at a certain timing, a plurality of operation elements are associated as follows.

The 1 st section data (section 1 of the 1 st section) of the 1 st operation element … …

2 nd section data (section 2 of section 1 section) of the 2 nd operation element … …

3 rd section data (section 3 of section 1 section) of the 3 rd operation element … …

4 th section data (section 4 of section 1, section 4) of the 4 th operation element … …

The 5 th section data (section 1 of the 2 nd section) of the 5 th operation element … …

6 th section data (section 2 of bar 2) of the 6 th operation element … …

7 th section data (section 3 of subsection 2) of 7 th operation element … …

8 th section data (section 4 of section 2 section) of the 8 th operation element … …

After the keyboard musical instrument outputs the 1 st section data corresponding to the 1 st operating element, the processor 201 performs a process of changing the section data corresponding to the 1 st operating element from the 1 st section data to the 9 th section data (for example, the 1 st beat section of the 3 rd bar) following the 8 th section data.

That is, in the performance, the section data assigned to the 1 st operating element is changed in the order of 1 st section data → 9 th section data → 17 th section data → … …. That is, for example, at the timing when the sound emission of the singing voice is completed up to the 4 th beat of the 1 st subsection, section data assigned to each operation element is as follows.

The 1 st operation element … … data of the 9 th section (section of the 1 st beat of the 3 rd subsection)

The 2 nd section data (section 2 of section 3, section 2) of the 2 nd operation element … …

The 3 rd section data (section 3 of section) of the 3 rd operation element … … at 11 th section

The 4 th operation element … … data of the 12 th section (section of the 4 th beat of the 3 rd subsection)

The 5 th section data (section 1 of the 2 nd section) of the 5 th operation element … …

6 th section data (section 2 of bar 2) of the 6 th operation element … …

7 th section data (section 3 of subsection 2) of 7 th operation element … …

8 th section data (section 4 of section 2 section) of the 8 th operation element … …

According to the present invention, even if the number of operating elements is limited, since the interval of the voice data allocated to one operating element is changed during the performance, there is an advantage that it is possible to satisfactorily sing voice data regardless of the length.

In addition, regarding the combination of the intonation patterns assigned to the respective operation elements, for example, the combination of the intonation patterns such as the intonation pattern (1 st pattern) assigned 401(#0) to the 1 st operation element and the intonation pattern (2 nd pattern) assigned 401(#1) to the 2 nd operation element is not changed unless the operation element 105 is operated. Therefore, if the combination of tone patterns is determined once by the operation of the operation element 105, then the keyboard musical instrument can be sounded from the first to the last of the voice data by the determined combination of tone patterns without the user operating the operation element 105. That is, in the performance performed by the user operating the keyboard 101, the operation 105 for adding intonation to the song does not need to be performed. Therefore, there is an advantage that the user can concentrate on the operation of the keyboard 101.

Of course, if the user operates the operation element 105 during the performance, the combination of the intonation patterns can be changed at any time. That is, in the performance performed by the user operating the keyboard 101, the combinations of intonation patterns can be changed in accordance with the change in performance expression. Therefore, there is an advantage that the user can continue to perform more pleasantly.

In the embodiment of fig. 4, the slide operation elements 105 are respectively exemplified with respect to the plurality of operation elements 105. At this time, the processor 201 determines a certain tone pattern from a plurality of tone patterns set in advance based on data indicating a slide operation amount acquired in accordance with a slide operation of the slide operation element 105 by the user. When the operation element 105 is a rotary operation element 105, the intonation pattern may be determined based on data indicating the amount of rotation. In the case where the operation element 105 is a button operation element 105, the tone pattern may be determined in accordance with the on/off of the button.

In the present embodiment, the songs are synthesized according to the pitch data specified by the user operating the keyboard 101. That is, singing voice data corresponding to the lyrics and the specified pitch is generated in real time.

In the present embodiment, fig. 5 shows an example of the data structure of music data read from the ROM202 to the RAM203 in fig. 2. The data structure is based on a standard MIDI file format which is one of file formats for MIDI (Musical Instrument Digital Interface). The music data is constituted by blocks of data called chunks (chunk). Specifically, the music data is composed of a title block (header chunk) located at the head of the file, a track block (track chunk)1 following the title block and storing lyric data for lyric parts, and a track block 2 storing accompaniment data for accompaniment parts.

The header block is composed of four values of ChunkID (block ID), ChunkSize (block size), FormatType (format type), NumberOfTrack (track number), and TimeDvision. The ChunkID is a 4-byte ASCII code "4D 546864" (16-ary) corresponding to 4 characters in half-angle, such as "MThd" of the header block. ChunkSize is 4-byte data indicating the data length of the FormatType, NumberOfTrack, and TimeDvision parts other than ChunkID and ChunkSize in the header block, and the data length is fixed to 6 bytes "00000006" (16-ary digits). In the case of the present embodiment, the FormatType is 2-byte data "0001" (the number is 16-ary) indicating format 1 in which a plurality of tracks are used. In the case of the present embodiment, NumberOfTrack is 2-byte data "0002" (16-ary in number) indicating that 2 tracks corresponding to lyric parts and accompaniment parts are used. The TimeDvision is data indicating a time reference value indicating a resolution per 4-cent note, and in the present embodiment, 2-byte data "01E 0" (16-ary numeral) indicating 480-ary 10.

The track blocks 1 and 2 are respectively composed of: ChunkID, ChunkSize, DeltaTime _1[ i ] and Event _1[ i ] (case of track chunk 1/lyrics portion) or DeltaTime _2[ i ] and Event _2[ i ] (case of track chunk 2/accompaniment portion) (case of 0. ltoreq. i.ltoreq.L: track chunk 1/lyrics portion, case of 0. ltoreq. i.ltoreq.M: track chunk 2/accompaniment portion). The ChunkID is a 4-byte ASCII code "4D 54726B" (16-ary number) corresponding to 4 characters in a half corner, such as "MTrk" indicated as a track block. ChunkSize is 4-byte data indicating the data length of the part other than ChunkID and ChunkSize in each track block.

DeltaTime _1[ i ] is variable length data of 1 to 4 bytes indicating the waiting time (relative time) from the execution time of the immediately preceding Event _1[ i-1 ]. Similarly, DeltaTime _2[ i ] is variable length data of 1-4 bytes indicating the latency (relative time) from the execution time of the immediately preceding Event _2[ i-1 ]. Event _1[ i ] is a meta-Event (metaevent) indicating the timing and pitch of the utterance of the lyrics being sung in track block 1/lyrics part. Event _2[ i ] is a MIDI Event indicating note-on (note-on) or note-off (note-off) or a meta Event indicating a beat in the track piece 2/accompaniment part. For the track block 1/lyric part, in each performance data group DeltaTime _1[ i ] and Event _1[ i ], after waiting for DeltaTime _1[ i ] from the execution time of the immediately preceding Event _1[ i-1 ], Event _1[ i ] is executed, thereby realizing the vocal progression of the lyrics. On the other hand, for the track block 2/accompaniment part, in each of the performance data sets DeltaTime _2[ i ] and Event _2[ i ], after waiting for DeltaTime _2[ i ] from the execution time of the immediately preceding Event _2[ i-1 ], Event _2[ i ] is executed, thereby realizing the automatic accompaniment.

Fig. 6 shows an example of a data configuration of the curved sound curve setting table 600, and the curved sound curve setting table 600 stores the setting of the curved sound curve for each beat specified by the curved sound slider 105, the curved sound switch 106 (see fig. 1, 2, and 4), and the curved sound processing unit 320 (see fig. 3). This bend curve setting table 600 is stored in the RAM203 of fig. 2, for example. The curved sound curve setting table 600 stores curved sound curve numbers designating bar numbers and beat numbers for every 16 beats in succession. For example, bar numbers 0 to 3, beat numbers 0 to 3 in each bar, and bend curve numbers 0 to 3 (corresponding to 401(#0) to 401(#3) in fig. 4) are stored in the first 16-beat continuous data group 601(# 0). Note that, regarding the tempo turned off by the bending switch 106, the bending curve number is Null value (indicated by "-" in fig. 6).

Fig. 7 shows a curved-tone curve table 700, and the curved-tone curve table 700 stores curved-tone curves of 4 patterns, for example, corresponding to intonation patterns of curved-tone curves corresponding to 401(#0) to 401(#3) in fig. 4. This curved sound curve table 700 is stored in the ROM202 of fig. 2 by factory setting, for example. In fig. 7, 401(#0), 401(#1), 401(#2), and 401(#3) correspond to the graphs of the curved sound shown in fig. 4, respectively, and the memory addresses of the respective heads on the ROM202 are, for example, BendCurve [0], BendCurve [1], BendCurve [2], and BendCurve [3 ]. R is the resolution of the bend curve, e.g., R48. In each bend curve, the address offset represents an offset value with respect to the storage address of each start, and each of the offset values 0 to R-1 (for example, 0 to 47) has a storage area, and the bend value is stored in each storage area. The pitch bend value is a magnification value with respect to the pitch value before modification, and for example, a value of "1.00" indicates no pitch change, and a value of "2.00" indicates a pitch of 2 times.

Fig. 8 is a main flowchart showing an example of control processing of the electronic musical instrument according to the present embodiment. This control processing is, for example, an operation in which the CPU201 of fig. 2 executes a control processing program loaded from the ROM202 to the RAM 203.

The CPU201 first executes initialization processing (step S801), and then repeatedly executes a series of processing of steps S802 to S8708.

In this iterative process, the CPU201 first executes a switching process (step S802). Here, the CPU201 executes processing corresponding to each switch operation of the first switch panel 102, the second switch panel 103, the bending slider 105, or the bending switch 106 of fig. 1 in accordance with an interrupt from the key scanner 206 of fig. 2.

Next, the CPU201 executes a keyboard process, that is, a process of determining whether or not a certain key of the keyboard 101 of fig. 1 is operated, in accordance with an interrupt from the key scanner 206 of fig. 2 (step S803). Here, the CPU201 outputs musical tone control data 216 for instructing the start or stop of sound generation to the sound source LSI204 of fig. 2 in response to a key depression or key release operation of a certain key performed by the player.

Next, the CPU201 performs display processing, that is, processing data that should be displayed in the LCD104 of fig. 1, and displays the data on the LCD104 via the LCD controller 208 of fig. 2 (step S804). The data displayed on the LCD104 includes, for example, lyrics corresponding to the vocal output data 217 to be played, a musical score of a melody corresponding to the lyrics, and various setting information.

Next, the CPU201 executes the rap reproduction process (step S805). In this process, the CPU201 executes the control process described in fig. 5 in accordance with the performance of the player, generates rap data 215, and outputs it to the voice synthesis LSI 205.

Next, the CPU201 executes sound source processing (step S806). In the sound source processing, the CPU201 executes control processing such as envelope control (envelope control) of a musical sound being generated in the sound source LSI 204.

Finally, the CPU201 determines whether or not the player has pressed a shutdown switch (not particularly shown) and has shut down (step S807). If the determination at step S807 is no, the CPU201 returns to the process at step S802. If the determination in step S807 is yes, the CPU201 ends the control process shown in the flowchart of fig. 8, and turns off the power supply to the electronic keyboard instrument 100.

Fig. 9(a), (b), and (c) are flowcharts showing detailed examples of the initialization process of step S801 in fig. 8, the music tempo change process of step S1002 in fig. 10 described later in the switching process of step S802 in fig. 8, and the rap start process of step S1006 in fig. 10.

First, in (a) of fig. 9 showing a detailed example of the initialization process of step S801 of fig. 8, the CPU201 executes the initialization process of the TickTime. In the present embodiment, the progression of the lyrics and the automatic accompaniment proceeds in units of time such as TickTime. The time reference value designated as the TimeDvision value in the title block of the music data of fig. 5 indicates the resolution of the 4-note, and if the value is, for example, 480, the 4-note has a time length of 480 TickTime. In addition, the values of the latency DeltaTime _1[ i ] and DeltaTime _2[ i ] within the track block of the music data of fig. 5 are also counted in time units of the TickTime. Here, the 1TickTime is actually several seconds, and differs depending on the tempo of the music specified for the music data. Currently, when the music Tempo value is Tempo [ tap/minute ] and the time reference value is TimeDvision, the number of seconds of the TickTime is calculated by the following equation (1).

TickTime [ sec ] ═ 60/Tempo/TimeDvision (1)

Therefore, in the initialization process illustrated in the flowchart of fig. 9(a), the CPU201 first calculates the TickTime [ sec ] by the arithmetic processing corresponding to the above expression (1) (step S901). In addition, regarding the music velocity value Tempo, it is assumed that a predetermined value, for example, 60[ taps/sec ], is stored in the ROM20 of fig. 2 in the initial state. Alternatively, the music tempo value at the last end may be stored in the nonvolatile memory.

Next, the CPU201 sets a timer interrupt based on the TickTime [ sec ] calculated in step S901 for the timer 210 of fig. 2 (step S902). As a result, an interrupt (hereinafter, referred to as "automatic performance interrupt") for the lyric progression, automatic accompaniment, and inflection processing is generated for the CPU201 every time the timetick [ second ] passes in the timer 210. Therefore, in an automatic performance interruption process (fig. 12 described later) executed by the CPU201 in response to the automatic performance interruption, control processing is executed so that the progression of lyrics and the progression of automatic accompaniment are performed every 1 TickTime.

In addition, the later-described bend sound processing is executed in a time unit obtained by multiplying 1TickTime by D. This D is calculated by the following equation (2) using the time reference value TimeDivision indicating the resolution per 4-cent note described in fig. 3 and the resolution R of the bend curve table 700 described in fig. 7.

D＝TimeDivision/R (2)

For example, as described above, the 4-point note (1 beat in the case of 4/4 beats) is 480TickTime, and when R is 48, the bend processing is performed for each D480/R480/48 to 10 TickTime.

Next, the CPU201 executes other initialization processing such as initialization of the RAM203 of fig. 2 (step S903). After that, the CPU201 ends the initialization processing of step S801 of fig. 8 illustrated in the flowchart of (a) of fig. 9.

The flowcharts of (b) and (c) of fig. 9 will be described later. Fig. 10 is a flowchart showing a detailed example of the switching process in step S802 in fig. 8.

The CPU201 first determines whether or not the lyric progression and the tempo of the music for automatic accompaniment are changed by the tempo change switch in the first switch panel 102 of fig. 1 (step S1001). If the determination is yes, the CPU201 executes a music tempo change process (step S1002). Details of this processing will be described later with reference to fig. 9 (b). If the determination at step S1001 is no, the CPU201 skips the processing at step S1002.

Next, the CPU201 determines whether a certain rap song is selected in the second switch panel 103 of fig. 1 (step S1003). If the determination is yes, the CPU201 executes the rap song reading process (step S1004). This processing is a process of reading music data having the data structure described in fig. 5 from the ROM202 to the RAM203 in fig. 2. Thereafter, data access to the track block 1 or 2 in the data structure illustrated in fig. 5 is performed on the music data read into the RAM 203. If the determination at step S1003 is no, the CPU201 skips the process at step S1004.

Next, the CPU201 determines whether or not the rap start switch is operated in the first switch panel 102 of fig. 1 (step S1005). If the determination is yes, the CPU201 executes the rap start processing (step S1006). Details of this processing will be described later with reference to (c) of fig. 9. If the determination in step S1005 is no, the CPU201 skips the process in step S1006.

Then, the CPU201 determines whether or not the bend curve setting start switch is operated on the 1 st switch panel 102 in fig. 1 (step S1007). If the determination is yes, the CPU201 executes the bending sound curve setting process based on the bending sound slider 105 and the bending sound switch 106 in fig. 1 (step S1008). The details of this processing will be described later with reference to fig. 11. If the determination at step S1007 is no, CPU201 skips the process at step S1008.

Finally, the CPU201 determines whether or not another switch is operated in the first switch panel 102 or the second switch panel 103 of fig. 1, and executes a process corresponding to each switch operation (step S1009). After that, the CPU201 ends the switching process of step S802 of fig. 8 illustrated in the flowchart of fig. 10.

Fig. 9(b) is a flowchart showing a detailed example of the music tempo change process in step S1002 in fig. 10. As described above, when the music tempo value is changed, the TickTime [ second ] is also changed. In the flowchart of fig. 9(b), the CPU201 executes control processing relating to the change of the TickTime [ sec ].

First, as in the case of step S901 in fig. 9(a) executed in the initialization processing in step S801 in fig. 8, the CPU201 calculates the TickTime [ sec ] by the arithmetic processing corresponding to the above expression (1) (step S911). The music Tempo value Tempo is stored in the RAM203 or the like after being changed by the music Tempo change switch in the first switch panel 102 in fig. 1.

Next, as in the case of step S902 of fig. 9(a) executed in the initialization process of step S801 of fig. 8, the CPU201 sets a timer interrupt based on the TickTime [ sec ] calculated in step S911 for the timer 210 of fig. 2 (step S912). After that, the CPU201 ends the music tempo change process of step S1002 in fig. 10 illustrated in the flowchart in fig. 9 (b).

Fig. 9 (c) is a flowchart showing a detailed example of the rap start processing in step S1006 in fig. 10.

First, the CPU201 initially sets the value of a variable ElapseTime on the RAM203 indicating the elapsed time from the start of the automatic performance to 0 in units of TickTime while the automatic performance is progressing. Similarly, the values of the variables DeltaT _1 (track block 1) and DeltaT _2 (track block 2) on the RAM203 for counting the relative time from the occurrence time of the immediately preceding event are both initially set to 0 in the unit of TickTime. Next, the CPU201 sets the values of the variable AutoIndex _1 on the RAM203 for specifying the values of the respective i of the performance data sets DeltaTime _1[ i ] and Event _1[ i ] (1 ≦ i ≦ L-1) in the track block 1 of the music data illustrated in fig. 5 and the values of the variable AutoIndex _2 on the RAM203 for specifying the values of the respective i of the performance data sets DeltaTime _2[ i ] and Event _2[ i ] (1 ≦ i ≦ M-1) in the track block 2 to 0 initially. The value of the variable DividingTime on the RAM203 indicating the division time in the TickTime unit is set to D-1 using the value D calculated by the above equation (2). The value of the variable BendAdressOffset on the RAM203 for indicating the offset address on the bend curve table 700 described in fig. 7 is initially set to a value of R-1 using the resolution R described in fig. 7. For example, R-1-48-1-47 (above, step S921). Thus, in the example of fig. 5, as an initial state, first, the first performance data set DeltaTime _1[0] and Event _1[0] in the track block 1 and the first performance data set DeltaTime _2[0] and Event _2[0] in the track block 2 are referred to, respectively.

Next, the CPU201 initially sets the value of a variable SongIndex on the RAM203 for indicating the current talking position to 0 (step S922).

Then, the CPU201 initially sets a value of a variable SongStart on the RAM203 indicating that the progression of the lyrics and the accompaniment is performed (═ 1) or not performed (═ 0) to 1 (progression) (step S923).

After that, the CPU201 determines whether the player has performed setting for accompaniment reproduction matching the reproduction of the rap lyrics through the first switch panel 102 of fig. 1 (step S924).

If the determination at step S924 is yes, the CPU201 sets the value of the variable Bansou on the RAM203 to 1 (accompanied) (step S925). In contrast, if the determination at step S924 is no, the CPU201 sets the value of the variable Bansou to 0 (no accompaniment) (step S926). After the processing of step S925 or S926, the CPU201 ends the rap start processing of step S1006 of fig. 10 illustrated in the flowchart of (c) of fig. 9.

Fig. 11 is a flowchart showing a detailed example of the curved sound curve setting process in step S1008 in fig. 10. First, the CPU201 specifies a setting start position (bar number) in units of, for example, 16 beats (4 bars in the case of 4/4 beats) (step S1101). Since the bend curve setting process can be executed in real time along with the progress of the automatic performance, the initial value is, for example, the 0 th bar, and the next 16 th, 32 nd and … … can be automatically designated in sequence every time the setting for each 16 beats is completed. In order to change the setting of the tempo in the current automatic performance, the user can designate 16 beats in succession including the tempo in the current performance as the setting start position by a switch not shown in particular on the 1 st switch panel 102, for example.

Next, the CPU201 acquires the lyric data of the 16 beats (4 measures) of rap designated in step S1101 from the ROM202 (step S1102). The CPU201 displays the lyric data of the rap thus obtained on, for example, the LCD104 in fig. 2, in order to assist the user in specifying the bend curve.

Next, the CPU201 sets the initial value of the beat position in the consecutive 16 beats to 0 (step S1103).

After that, the CPU201 initially sets the value of the variable i on the RAM203 indicating the beat positions in 16 consecutive beats to 0 in step S1103, and then repeatedly executes step S1104 and step S1105(#0 to #3) by 16 beats by increasing the value of i by 1 at a time in step S1106 until it is determined in step S1107 that the value of i exceeds 15.

In the above-described repetitive processing, first, the CPU201 reads the slider value (S) of the slider at the beat position i on the bending slider 105 described in fig. 4 from the bending slider 105 of fig. 2 via the key scanner 206, and determines the value (step S1104).

Next, when the slider value at the beat position i is "s" 0, the CPU201 stores the number 0 of the curved sound curve 401(#0) in fig. 4 or fig. 7 in the curved sound curve number entry in the curved sound curve setting table 600 in fig. 6. The value of each item of the bar number and the beat number at this time is calculated by the following expressions (3) and (4), and is stored (step S1105(#0) above).

Section number ═ (section number designated in S1101) + (integer part of 4/i) (3)

Remainder of beat position i/4 (4)

When the slider value at the beat position i is s equal to 1, the CPU201 stores the number 1 of the inflexion curve 401(#1) in fig. 4 or fig. 7 in the inflexion curve number entry in the inflexion curve setting table 600 in fig. 6. The value of each item of the bar number and the beat number at this time is calculated by the above-described equations (3) and (4), and the value is stored (step S1105(#1) above).

When the slider value at the beat position i is s equal to 2, the CPU201 stores the number 2 of the inflexion curve 401(#1) in fig. 4 or fig. 7 in the inflexion curve number entry in the inflexion curve setting table 600 in fig. 6. The value of each item of the bar number and the beat number at this time is calculated by the above-described equations (3) and (4), and the value is stored (step S1105(#2) above).

When the slider value at the beat position i is s equal to 3, the CPU201 stores the number 3 of the inflexion curve 401(#1) in fig. 4 or fig. 7 in the inflexion curve number entry in the inflexion curve setting table 600 in fig. 6. The value of each item of the bar number and the beat number at this time is calculated by the above-described equations (3) and (4), and the value is stored (step S1105(#3) above).

In the repetition of the above-described processing, when it is determined in step S1107 that the value of the variable i has reached 15, the CPU201 ends the processing in the flowchart in fig. 11, and ends the curved sound curve setting processing in step S1008 in fig. 10.

Fig. 12 is a flowchart showing a detailed example of the automatic performance interruption process executed based on an interruption (see step S902 of fig. 9(a) or step S912 of fig. 9 (b)) generated for each tick time [ second ] in the timer 210 of fig. 2. The following processing is performed on the performance data sets of the track block 1 and the track block 2 of the music data illustrated in fig. 5.

First, the CPU201 executes a series of processing corresponding to the track block 1 (steps S1201 to S1206). First, the CPU201 determines whether the SongStart value is 1, that is, whether the travel of the lyrics and the accompaniment is instructed (step S1201).

If it is determined that the progression of the lyrics and the accompaniment is not instructed (no in step S1201), the CPU201 does not proceed with the lyrics and the accompaniment and directly ends the automatic performance interruption process illustrated in the flowchart of fig. 12.

When determining that the progression of the lyrics and accompaniment is instructed (the determination at step S1201 is yes), the CPU201 first increments the value of the variable ElapseTime on the RAM203 indicating the elapsed time in TickTime from the start of the automatic performance by 1. Since the automatic performance interruption process of fig. 12 occurs every tick time second, the value obtained by adding 1 every time the interruption occurs becomes the value of ElapseTime. The value of the variable ElapseTime is used to calculate the current bar number and beat number in step S1406 of the bending process of fig. 14, which will be described later.

Next, the CPU201 determines whether or not the DeltaT _1 value indicating the relative time from the occurrence time of the previous event with respect to the track block 1 coincides with the waiting time DeltaTime _1[ AutoIndex _1] of the performance data set desired to be executed from then on, which is indicated by the AutoIndex _1 value (step S1203).

If the determination at step S1203 is no, the CPU201 increments the DeltaT _1 value indicating the relative time from the occurrence time of the previous event by +1 for the track block 1, and advances the time by the amount of 1TickTime unit corresponding to the present interrupt (step S1204). After that, the CPU201 shifts to S1208 described later.

If the determination at step S1203 is yes, the CPU201 executes Event [ AutoIndex _1] of the performance data set indicated by the AutoIndex _1 value for the track block 1 (step S1205). The event is a rap event containing lyric data.

Next, the CPU201 stores the AutoIndex _1 value indicating the position of the next rap event to be executed within the track block 1 in the variable songinex on the RAM203 (step S1205).

Also, the CPU201 increments the AutoIndex _1 value for referring to the performance data set within the track block 1 by +1 (step S1206).

Further, the CPU201 resets the DeltaT _1 value indicating the relative time from the occurrence time of the rap event, which is referred to this time for the track block 1, to 0 (step S1207). After that, the CPU201 shifts to the processing of step S1208.

Next, the CPU201 executes a series of processing corresponding to the track block 2 (steps S1208 to S1214). First, the CPU201 determines whether or not the DeltaT _2 value indicating the relative time from the occurrence time of the previous event with respect to the track block 2 coincides with the waiting time DeltaTime _2[ AutoIndex _2] of the performance data set desired to be executed from now on, which is indicated by the AutoIndex _2 value (step S1208).

If the determination at step S1208 is no, CPU201 increments the DeltaT _2 value indicating the relative time from the occurrence time of the previous event by +1 for track block 2, and advances the time by 1TickTime unit corresponding to the current interrupt (step S1209). After that, the CPU201 advances to the bend sound processing of step S1211.

If the determination in step S1208 is yes, the CPU201 determines whether or not the value of the variable Bansou on the RAM203 instructing accompaniment reproduction is 1 (accompanied by accompaniment) (step S1210) (refer to steps S924 to S926 of fig. 9 (c)).

If the determination at step S1210 is yes, the CPU201 executes an EVENT _2[ AutoIndex _2] associated with the accompaniment of the track block 2, which is indicated by the AutoIndex _2 value (step S1211). If the EVENT _2[ AutoIndex _2] executed here is, for example, a note-on EVENT, a sounding command for an accompaniment musical tone is issued to the sound source LSI204 of fig. 2 by the Key number (Key number) and the velocity specified by the note-on EVENT. On the other hand, if the EVENT _2[ AutoIndex _2] is, for example, a note-off EVENT, a mute command for an accompaniment musical sound in sound is issued to the sound source LSI204 of fig. 2 by the key number and speed specified by the note-off EVENT.

On the other hand, if the determination at step S1210 is "no", the CPU201 skips step S1211, does not execute EVENT _2[ AutoIndex _2] relating to the current accompaniment, and proceeds to the processing at next step S1212 so as to advance in synchronization with the lyrics, and executes only the control processing for advancing the EVENT.

After step S1211 or in the case of the determination of S1210 being "no", the CPU201 increments the AutoIndex _2 value of the performance data set for accompaniment data on the reference track block 2 by +1 (step S1212).

Further, the CPU201 resets the DeltaT _2 value indicating the relative time from the occurrence time of the event executed this time to 0 for the track block 2 (step S1213).

Then, the CPU201 determines whether or not the waiting time DeltaTime _2[ AutoIndex _2] of the performance data set on the next executed track block 2 indicated by the AutoIndex _2 value is 0, that is, whether or not it is an event executed simultaneously with the event of this time (step S1214).

If the determination at step S1214 is no, the CPU201 proceeds to the bending sound processing at step S1211.

If the determination at step S1214 is yes, the CPU201 returns to step S1210 and repeats the control process for the EVENT _2[ AutoIndex _2] of the performance data set executed next to and after the track block 2 indicated by the AutoIndex _2 value. The CPU201 repeatedly executes the processing of steps S1210 to S1214 by the number of simultaneous executions this time. The above processing sequence is executed in the case where a plurality of note-on events are sounded at synchronized timing, such as chord and the like, for example.

After the process of step S1209, or in the case where the determination of step S1214 is "no", the CPU201 executes the bending sound process (step S1211). Here, the processing corresponding to the bending sound processing unit 320 of fig. 3 that executes bending sound for the speech synthesis unit 302 of fig. 3 is actually executed based on the setting of the bending sound curve for each bar and each beat in the bar set in the bending sound curve setting table 600 illustrated in fig. 6 by the bending sound curve setting processing of step S1008 of fig. 10. Details of this processing will be described later using the flowchart of fig. 14. After the process of step S1209, the present-time automatic performance interruption process shown in the flowchart of fig. 12 is terminated.

Fig. 13 is a flowchart showing a detailed example of the rap reproduction processing in step S805 in fig. 8.

First, the CPU201 determines in step S1205 of the automatic performance interrupt process of fig. 12 whether a value other than the Null value is set for the variable songandex on the RAM203 (step S1301). The SongIndex value indicates whether or not the current timing is the reproduction timing of the rap voice.

If the determination in step S1301 is yes, that is, if the timing of the rap playback is reached at the current time point, the CPU201 determines whether or not a new key pressed by the user on the keyboard 101 in fig. 1 is detected by the keyboard processing in step S803 in fig. 8 (step S1302).

If the determination of step S1302 is yes, the CPU201 sets the pitch specified by the player through the key as the pitch of utterances to a register not shown in particular or a variable on the RAM203 (step S1303).

Next, the CPU201 reads out the rap lyric character string from the rap EVENT _1[ songddex ] on the track block 1 of the music data on the RAM203 represented by the variable songddex on the RAM 203. The CPU201 generates the rap data 215 for uttering the rap voice output data 217 corresponding to the read lyric character string by the utterance pitch set based on the pitch of the key in step S1303, and instructs the utterance processing to the voice synthesis LSI205 (step S1305). The speech synthesis LSI205 synthesizes and outputs the rap speech output data 217 for singing the lyrics specified as music data from the RAM203 in real time corresponding to the pitch of the keys pressed by the player on the keyboard 101 by performing the statistical speech synthesis process explained in fig. 3.

On the other hand, when it is determined that the current time point is the timing of the rap playback by the determination of step S1301 and it is determined that no new key operation is detected at the current time point in step S1302, that is, it is determined that no new key operation is detected at the current time point, the CPU201 reads out data of a pitch from the rap EVENT _1[ song index ] on the track block 1 of the music data on the RAM203 indicated by the variable song index on the RAM203 and sets the pitch as a pitch of an utterance to a register not specifically shown or a variable on the RAM203 (step S1104).

When the music is played for speaking, the pitch can be linked with the pitch of the melody or not.

After that, the CPU201 generates the rap data 215 by executing the processing of step S1305 described above, and instructs the speech synthesis LSI205 to utterance processing (step S1305), the rap data 215 being used to emit the rap speech output data 217 corresponding to the lyric character string read out from the rap EVENT _1[ songinex ] at the utterance pitch set in step S1304. The speech synthesis LSI205 synthesizes and outputs the rap speech output data 217 for singing the lyrics specified as the music data from the RAM203 corresponding to the pitch specified by default as the music data as well by performing the statistical speech synthesis process illustrated in fig. 3 even if the player does not press any key on the keyboard 101.

After the process of step S1305, the CPU201 stores the rap position reproduced, which is indicated by the variable songanex on the RAM203, in the variable songanex _ pre on the RAM203 (step S1306).

Then, the CPU201 clears the value of the variable songandex to Null value, and sets the timing after that to a state other than the rap reproduction timing (step S1307). After that, the CPU201 ends the rap reproduction processing of step S805 in fig. 8 illustrated in the flowchart of fig. 13.

If the determination of step S1301 is "no", that is, if the current time point is not the timing of rap playback, the CPU201 determines whether or not a new key pressed by the player on the keyboard 101 of fig. 1 is detected by the keyboard processing of step S803 of fig. 8 (step S1308).

When the determination of step S1308 is no, the CPU201 directly ends the rap playback processing of step S805 in fig. 8 illustrated in the flowchart of fig. 13.

When the determination of step S1308 is yes, the CPU201 generates and outputs to the voice synthesis LSI205 rap data 215 for instructing to change the pitch of the rap voice output data 217 corresponding to the lyric character string of the rap Event _1[ songinex _ pre ] on the track block 1 of the music data on the RAM203 indicated by the variable songinex _ pre in the utterance processing of the current voice synthesis LSI205 to the pitch based on the player' S key detected in step S1308 (step S1309). At this time, in the rap data 215, a frame starting from the second half "/i/" of the phoneme string "/k/" "/i/" of the lyric character string if it is, for example, the lyric character string "ki" in the phoneme of the lyric that has been subjected to the vocalization is set at the start position of the pitch change. The speech synthesis LSI205 performs the statistical speech synthesis process described with reference to fig. 3, thereby synthesizing and outputting the talking speech output data 217, and the talking speech output data 217 changes the pitch of the talking speech in the current utterance in real time to the pitch of the key pressed by the player on the keyboard 101 to sing.

By the above processing of step S1309, the pitch of the utterance of the talking voice output data 217 uttered from the original timing immediately before the current key timing is changed to the pitch played by the player, and the utterance can be continued at the current key timing.

After the process of step S1309, the CPU201 ends the rap reproduction process of step S805 of fig. 8 illustrated in the flowchart of fig. 13.

Fig. 14 is a flowchart showing a detailed processing example of the bend processing of step S1211 in the automatic performance interruption processing of fig. 12. First, the CPU201 increments the value of the variable DividingTime in the RAM203 by 1 (step S1401).

After that, the CPU201 determines whether or not the value of the variable divedingtime matches the value D calculated by the above equation (2) (step S1402). If the determination at step S1402 is no, the CPU201 directly ends the bending sound processing at step S1211 of fig. 12 illustrated in the flowchart of fig. 14. D is a value indicating how many times the TickTime is multiplied, and therefore, the automatic performance interruption process of fig. 12 is executed every 1TickTime, and the essential process of the bend processing of fig. 14 called out therefrom is executed every DTickTime. For example, if D is set to 10, the bend processing is performed every 10 TickTime. Since the value of the variable DividingTime is initially set to D-1 in step S921 of the rap start processing in fig. 9 (c), the determination in step S1402 must be yes after the processing in step S1401 when the first interrupt processing for the automatic playing at the start of the automatic playing is executed.

If the determination in step S1402 is yes, the CPU201 resets the value of the variable divatingtime to 0 (step S1403).

Next, the CPU201 determines whether the value of the variable bendaresoffsleep on the RAM203 coincides with the final address R-1 within one curved sound curve (step S1404). Here, it is determined whether or not the bend processing for one beat is finished. In step S921 of the above-described rap start processing in fig. 9 (c), since the value of the variable bendaresoffoffset is initially set to R-1, the determination in step S1404 is necessarily yes when the first automatic performance interruption processing at the time of starting the automatic performance is executed.

If the determination in step S1404 is yes, the CPU201 resets the value of the variable bendaresoffshiftet to a value 0 (see fig. 7) indicating the head of the curved sound curve (step S1405).

After that, the CPU201 calculates the current bar number and beat number from the value of the variable ElapseTime (step S1406). At 4/4 beats, the number of tick time of 1 beat is given by the value of time division, so the variable ElapseTime is divided by the value of time division, and the result is further divided by 4 (the number of beats per 1 bar), whereby the current bar number and beat number can be calculated.

Next, the CPU201 acquires the inflexion curve number corresponding to the bar number and the tempo number calculated in step S1406 from the inflexion curve setting table 600 illustrated in fig. 6, and sets the value to the variable CurveNum on the RAM203 (step S1407).

On the other hand, if the value of the variable bendaresoffsleep on the RAM203 does not reach the final address R-1 in one curved sound curve and the determination in step S1404 is no, the CPU201 increments the value of the variable bendaresoffsleep which indicates the offset address in the curved sound curve by 1 (step S1409).

Next, the CPU201 determines whether the data has the bend curve number in the variable CurveNum by executing step S1407 in the automatic performance interruption process of this time or the previous time (step S1408).

If the determination of step S1408 is "yes", the CPU201 acquires a pitch bend value (see fig. 7) from an address of the pitch bend curve table 700 obtained by adding the offset value obtained by the variable bendadressoffset to the head address bendarverve [ CurveNum ] of the pitch bend data of the ROM202 corresponding to the pitch bend number obtained by the variable CurveNum (step S1410).

Finally, as in the case described in step S1309 of fig. 13, the CPU201 generates and outputs to the speech synthesis LSI205 the rap data 215 for instructing to change the pitch of the rap speech output data 217 corresponding to the lyric character string of the rap Event _1[ songinex _ pre ] on the track block 1 of the music data on the RAM203 indicated by the variable songinex _ pre in the utterance processing of the current speech synthesis LSI205 to the pitch calculated from the bend value acquired in step S1410. Then, the CPU201 ends the bend sound processing of step S1211 of fig. 12 illustrated in the flowchart of fig. 14.

If the CurveNum number is not obtained in the variable CurveNum and the determination in step S1408 is "no", the user invalidates the setting of the curvecurve for the beat, and therefore the CPU201 directly ends the curvesound processing in step S1211 of fig. 12 illustrated in the flowchart of fig. 14.

As described above, in the present embodiment, it is possible to execute, for each beat, a bending sound process corresponding to a bending sound curve specified by the user in real time or in advance for the beat.

In addition to the above-described embodiments, when different inflection curves are designated at the connection portion between the beat and the beat, the inflection processing unit 320 in fig. 3 can perform processing of inheriting the last pitch of the preceding beat or temporally interpolating both pitches so that the last pitch of the beat before being changed by the inflection curves does not become discontinuous with the first pitch of the present beat. This makes it possible to reproduce a rap sound with good sound quality while suppressing the occurrence of different sounds.

In the above-described embodiment, the user sets the bend curve for each beat in, for example, 16 consecutive beats (4 bars in the case of 4/4 beats), but a user interface may be adopted in which a combination of the bend curves for 16 beats is designated at once. Thus, the vocal performance of the named vocal singer can be directly simulated and designated easily.

Further, it is possible to provide an emphasis unit that emphasizes a intonation by changing a bend curve in real time or for a predetermined number (for example, 4 beats) in which beats such as the beginning of each bar continue. This enables a more colorful rap presentation.

In the above-described embodiment, the bend processing is performed on the pitch of the rap voice as the pitch bend, but the bend processing may be performed on the intensity, tone color, or the like of the sound other than the pitch. This enables a more colorful rap presentation.

In the above-described embodiment, the tone pattern is specified for the rap voice, but the tone pattern may be specified for music information of musical instrument voices other than the rap voice.

In the first embodiment of the statistical speech synthesis process using the HMM acoustic model described with reference to fig. 3 and 4, a subtle musical expression such as a specific singer and a singing style can be reproduced, and smooth speech sound quality without connection distortion can be realized. Moreover, by converting the learning result 315 (model parameter), it is possible to adapt to other singers and express various voices and emotions. Further, it is possible to machine-learn all the model parameters in the HMM acoustic model from the learning rap data 311 and the learning rap speech data 312, thereby obtaining the characteristics of a specific singer as the HMM acoustic model, and automatically construct a musical tone synthesis system expressing these characteristics at the time of synthesis. The fundamental frequency and length of speech are based on the melody of a musical score and the tempo of a musical composition, and the time structure of pitch and rhythm can be uniquely determined from the musical score. In actual speaking voice, the voice is not drawn in order like a musical score, and there is a style unique to each singer depending on the sound quality, the sound level, and the temporal structural change thereof. In the first embodiment of the statistical speech synthesis process using the HMM acoustic model, the spectral information and the time-series change of pitch information in the rap speech can be modeled according to the contents, and by considering the score information, speech reproduction closer to the actual rap speech can be performed. The HMM acoustic model used in the first embodiment of the statistical speech synthesis process corresponds to a generation model that generates a speech while changing with time what kind of vocal cord vibration of a singer or a sequence of acoustic features of a speech in vocal tract characteristics occurs when a lyric according to a certain melody is uttered. In the first embodiment of the statistical speech synthesis process, a synthesis of a rap speech is realized by using an HMM acoustic model including the content of "deviation" between notes and speech, and a singing method that can accurately reproduce a singing method having a tendency to change complicatedly depending on the vocal characteristics of a singer is realized. The technique according to the first embodiment of the statistical speech synthesis process using the HMM acoustic model as described above is combined with a technique based on real-time performance of the electronic keyboard instrument 100, for example, and can accurately reflect the singing method and the voice quality of a singer that becomes a model, which cannot be realized in an electronic musical instrument of a conventional segment synthesis method or the like, and can realize a rap speech performance such as a certain rap singer actually speaking in accordance with a keyboard performance or the like of the electronic keyboard instrument 100.

In the second embodiment of the statistical speech synthesis process using DNN acoustic models described with reference to fig. 3 and 5, the HMM acoustic models that depend on the content based on the decision tree in the first embodiment of the statistical speech synthesis process are replaced with DNN as an expression of the relationship between the speech feature sequence and the acoustic feature sequence. Thus, the relationship between the speech feature sequence and the acoustic feature sequence can be expressed by a complex nonlinear transformation function that is difficult to express by a decision tree. Further, in the HMM acoustic model depending on the contents based on the decision tree, the corresponding learning data is also classified according to the decision tree, and therefore the learning data to which the HMM acoustic model depending on each content is assigned is reduced. In contrast, in the DNN acoustic model, since a single DNN is learned from all the learning data, the learning data can be efficiently used. Therefore, the DNN acoustic model can predict the acoustic feature amount with higher accuracy than the HMM acoustic model, and can significantly improve the naturalness of the synthesized speech. In addition, the DNN acoustic model can use a speech feature sequence related to a frame. That is, since the temporal correspondence relationship between the acoustic feature sequence and the language feature sequence is determined in advance in the DNN acoustic model, it is possible to use the language feature amount related to the frame such as "the number of continuation frames of the current phoneme" and "the intra-phoneme position of the current frame" which are difficult to be considered in the HMM acoustic model. Thus, by using the speech feature amount related to the frame, more detailed features can be modeled, and the naturalness of the synthesized speech can be improved. The technique of the second embodiment of the statistical speech synthesis process using the DNN acoustic model as described above can more naturally approximate the singing method and the vocal quality of the model rap singer by the keyboard performance or the like by combining with the technique of the real-time performance of the electronic keyboard instrument 100, for example.

In the above-described embodiment, by adopting the technique of statistical speech synthesis processing as the speech synthesis method, it is possible to realize a memory capacity that is extremely small compared to the conventional segment synthesis method. For example, although a memory having a storage capacity of several hundred megabytes is necessary for storing speech segment data in an electronic musical instrument of the segment synthesis method, a memory having a storage capacity of several megabytes is sufficient for storing the model parameters of the learning result 315 shown in fig. 3 in the present embodiment. Therefore, a lower price of the electronic musical instrument can be realized, and a high-quality rap performance system can be used for a wider user layer.

Further, in the conventional clip data method, although it takes much time (in units of years) and labor to create data for a rap performance because it is necessary to manually adjust the clip data, it is almost unnecessary to adjust the data when generating the model parameters of the learning result 315 for the HMM acoustic model or the DNN acoustic model in the present embodiment, and thus it takes only a fraction of the generation time and labor. According to these, a lower price electronic musical instrument can be realized. Further, it is also possible for a general user to learn his or her own voice, family voice, voice of a famous person, or the like using a learning function built in the server computer 300 or the voice synthesis LSI205 usable as the cloud service, and perform rap performance on the model voice through an electronic musical instrument. In this case, a more natural and high-quality rap performance than before can be realized as a lower-priced electronic musical instrument.

In the embodiments described above, the present invention was implemented for an electronic keyboard musical instrument, but the present invention can also be applied to other electronic musical instruments such as an electronic stringed musical instrument.

The speech synthesis method that can be used by the speech model unit 308 in fig. 3 is not limited to the cepstrum speech synthesis method, and various speech synthesis methods including the LSP speech synthesis method can be used.

In the above-described embodiment, the speech synthesis method of the first embodiment of the statistical speech synthesis process using the HMM acoustic model or the subsequent second embodiment using the DNN acoustic model has been described, but the present invention is not limited to this, and any speech synthesis method such as an acoustic model combining HMM and DNN may be employed as long as the technique using the statistical speech synthesis process is used.

In the above-described embodiment, the lyric information of singing is provided as music data, but text data obtained by speech-recognizing the content of singing by the player in real time may be provided as the lyric information of singing in real time.

36页详细技术资料下载

上一篇：一种医用注射器针头装配设备

下一篇：多媒体点播方法、移动终端及计算机可读存储介质

Keyboard musical instrument and computer-implemented method of keyboard musical instrument

相关技术

网友询问留言