Video processing method, device, equipment and medium

文档序号：1861311 发布日期：2021-11-19 浏览：21次中文

阅读说明：本技术 一种视频处理方法、装置、设备及介质 (Video processing method, device, equipment and medium ) 是由吴益灵李明磊李太松于 2020-05-14 设计创作，主要内容包括：本申请提供了一种视频处理方法,应用于人工智能领域,包括：获取视频数据,视频数据包括图像数据和音频数据,对音频数据进行语音识别,获得音频数据对应的文本数据以及文本数据中至少一个单词对应的语音在音频数据的时间戳,根据文本数据中至少一个单词对应的语音在音频数据的时间戳分割视频数据,得到至少一段数据,从而实现对视频的自动分割,提高分割效率,而且考虑了视频内容,具有较高分割精度。如此,可以提高标注效率和标注精度,降低标注成本。(The application provides a video processing method, which is applied to the field of artificial intelligence and comprises the following steps: the method comprises the steps of obtaining video data, wherein the video data comprise image data and audio data, carrying out voice recognition on the audio data, obtaining text data corresponding to the audio data and a time stamp of voice corresponding to at least one word in the text data in the audio data, and segmenting the video data according to the time stamp of the voice corresponding to at least one word in the text data in the audio data to obtain at least one section of data, so that automatic segmentation of videos is achieved, segmentation efficiency is improved, video content is considered, and high segmentation precision is achieved. Therefore, the marking efficiency and the marking precision can be improved, and the marking cost is reduced.)

1. A method of video processing, the method comprising:

acquiring video data, wherein the video data comprises image data and audio data;

performing voice recognition on the audio data to obtain text data corresponding to the audio data and a time stamp of voice corresponding to at least one word in the text data in the audio data;

and segmenting the video data according to the time stamp of the voice corresponding to at least one word in the text data in the audio data to obtain at least one segment of data.

2. The method of claim 1, wherein the segmenting the video data according to the voice corresponding to the at least one word in the text data at the time stamp of the audio data comprises:

determining sentence break positions of sentences in the text data;

and segmenting the video data according to the voice corresponding to the word at the sentence-break position in the time stamp of the audio data.

3. The method according to claim 1 or 2, characterized in that the method further comprises:

receiving correction information of the at least one piece of data by a user through a Graphical User Interface (GUI), wherein the correction information is used for modifying the start-stop time of the at least one piece of data.

4. The method according to any one of claims 1 to 3, further comprising:

and receiving the labeling information of the at least one piece of data, wherein the labeling information comprises any one or more of position information, attribute information, action information and emotion information of a target in the image data.

5. The method of claim 4, wherein the receiving the user labeling information for the at least one piece of data comprises:

receiving the labeling information of the key frame in the at least one section of data from the user;

and determining the labeling information of the non-key frames in the at least one section of data according to the labeling information of the key frames.

6. The method according to any one of claims 1 to 5, wherein the segmenting the video data according to the voice corresponding to at least one word in the text data at the time stamp of the audio data comprises:

segmenting the image data according to the time stamp of the audio data of the voice corresponding to at least one word in the text data; or the like, or, alternatively,

and segmenting the image data and the audio data according to the voice corresponding to at least one word in the text data at the time stamp of the audio data.

7. A video processing apparatus, characterized in that the apparatus comprises:

the communication module is used for acquiring video data, and the video data comprises image data and audio data;

the recognition module is used for carrying out voice recognition on the audio data to obtain text data corresponding to the audio data and a time stamp of voice corresponding to at least one word in the text data in the audio data;

and the segmentation module is used for segmenting the video data according to the voice corresponding to at least one word in the text data at the time stamp of the audio data to obtain at least one section of data.

8. The apparatus of claim 7, wherein the segmentation module is specifically configured to:

determining sentence break positions of sentences in the text data;

and segmenting the video data according to the voice corresponding to the word at the sentence-break position in the time stamp of the audio data.

9. The apparatus of claim 7 or 8, wherein the communication module is further configured to:

10. The apparatus of any of claims 7 to 9, wherein the communication module is further configured to:

11. The apparatus of claim 10, wherein the communication module is specifically configured to:

receiving the labeling information of the key frame in the at least one section of data from the user;

the device further comprises:

and the marking module is used for determining the marking information of the non-key frames in the at least one section of data according to the marking information of the key frames.

12. The apparatus according to any one of claims 7 to 11, wherein the segmentation module is specifically configured to:

segmenting the image data according to the time stamp of the audio data of the voice corresponding to at least one word in the text data; or the like, or, alternatively,

and segmenting the image data and the audio data according to the voice corresponding to at least one word in the text data at the time stamp of the audio data.

13. An apparatus, comprising a processor and a memory;

the processor is to execute instructions stored in the memory to cause the device to perform the method of any of claims 1 to 6.

14. A computer-readable storage medium comprising instructions that direct a device to perform the method of any of claims 1-6.

Technical Field

The present application relates to the field of Artificial Intelligence (AI), and in particular, to a video processing method, apparatus, device, and computer readable storage medium.

Background

With the rise of artificial intelligence, computer vision direction tasks such as video-based object detection, object tracking, motion classification and the like have made great progress. Specifically, tasks such as video-based object detection, object tracking or motion classification can be achieved by training corresponding models through supervised learning. Supervised learning requires a large amount of labeled data, such as video data labeled with information on the position of an object, the category of the object, and the like.

Currently, the mainstream annotation method in the industry is to watch the whole video by an annotator, manually select the start and end time of each video segment, and then annotate each video segment. Thus, a lot of labor and time are required, and the labeling cost is increased.

Accordingly, there is a need to provide an efficient video processing method to reduce the labor and time required for annotating videos and to reduce the annotation cost.

Disclosure of Invention

The application provides a video processing method, which solves the problem that a large amount of labor and time are needed to be spent for marking in the related technology, improves the marking efficiency and reduces the marking cost. The application also provides a device, equipment, a computer readable storage medium and a computer program product corresponding to the method.

In a first aspect, the present application provides a video processing method. The method is performed by a video processing system. The video processing system may be deployed in a cloud environment, a marginal environment, or an end device. The video processing system comprises a plurality of parts, on the basis of which the video processing system can also be deployed in a distributed manner in different environments. For example, portions of a video processing system may be deployed separately on three environments, a cloud environment, a marginal environment, an end device, or any two of them.

Specifically, the video processing system obtains video data, where the video data includes image data and audio data, then performs voice recognition on the audio data by the video processing system to obtain text data corresponding to the audio data and a time stamp of voice corresponding to at least one word in the text data in the audio data, and then the video processing system may segment the video data according to the time stamp of the voice corresponding to at least one word in the text data in the audio data to obtain at least one piece of data.

The method simplifies user operation on one hand, does not need a user to manually segment the video data, considers video content on the other hand, and has higher segmentation accuracy. The at least one section of data obtained based on the method is labeled, so that the labeling efficiency and the labeling precision can be improved, and the labeling cost is reduced.

In some possible implementations, considering that the text is composed of sentences, the video processing system may also segment the video data using sentence break positions of the sentences in the text data when segmenting the video data. The video processing system may determine a sentence break position of a sentence in the text data, and then segment the video data at a time stamp of the audio data according to a speech corresponding to a word at the sentence break position (e.g., a word before the sentence break position). The method for segmenting the video data based on the words at the sentence-breaking positions is more suitable for real scenes and has higher segmentation precision.

In some possible implementations, the user may also modify the start-stop time of at least one piece of data through a Graphical User Interface (GUI). Correspondingly, the video processing system can also receive correction information of the at least one piece of data from the user through the GUI, and the correction information is used for modifying the starting time and the ending time of the at least one piece of data. Thus, the video segmentation precision can be further improved.

In some possible implementations, the user may also label at least one piece of data obtained by the segmentation. Correspondingly, the video processing system can receive the annotation information of the user on the at least one piece of data, so as to realize video annotation. The labeled data can be used for model training or model testing, and the model trained by the labeled data can be used for scenes such as target detection, target recognition and the like.

The annotation information comprises any one or more of position information, attribute information, action information and emotion information of the target in the image data. The position information of the object refers to the position information of the object in the video picture. The attribute information of the object is information describing the nature and/or characteristics of the object. Specifically, the attribute information of the target may include any one or more of category information and identity information of the target. The action information of the target refers to an action performed by the target. The emotional information of the target refers to the emotional information expressed by the target.

Because the whole video is not required to be manually segmented by a user, the user operation is simplified, the labeling workload is reduced, and the labeling efficiency is improved. Moreover, the method takes the video content into consideration when the video is segmented, so that at least one segment of segmented data has higher precision, thereby improving the annotation precision.

In some possible implementations, to further simplify the user operation, the video processing system may receive annotation information of a key frame in the at least one piece of data from a user, and then determine annotation information of a non-key frame in the at least one piece of data according to the annotation information of the key frame. Thus, the labeling efficiency can be further improved.

In some possible implementations, in consideration of requirements of different scenes, the video processing system may segment image data in the video data according to a time stamp of voice corresponding to at least one word in the text data in the audio data, or segment both the image data and the audio data in the video data.

For example, in the position-labeling scene, the video processing system may segment the image data according to the time stamp to obtain at least one segment of image data, and each segment of image data includes consecutive image frames. For another example, in an emotion markup scene or other multi-modal markup scenes, the video processing system may segment both the image data and the audio data according to the timestamp to obtain at least one piece of video data.

In some possible implementation manners, before performing speech recognition on the audio data, the video processing system may further filter a non-speech segment in the audio data, so as to avoid a user from listening to an irrelevant segment during tagging, save time and energy of the user, and improve tagging efficiency.

In a second aspect, the present application provides a video processing apparatus. The device comprises:

the communication module is used for acquiring video data, and the video data comprises image data and audio data;

In some possible implementations, the segmentation module is specifically configured to:

determining sentence break positions of sentences in the text data;

and segmenting the video data according to the voice corresponding to the word at the sentence-break position in the time stamp of the audio data.

In some possible implementations, the communication module is further configured to:

In some possible implementations, the communication module is specifically configured to:

receiving the labeling information of the key frame in the at least one section of data from the user;

the device further comprises:

and the marking module is used for determining the marking information of the non-key frames in the at least one section of data according to the marking information of the key frames.

In some possible implementations, the apparatus further includes:

and the filtering module is used for filtering the voice-free segments in the audio data before carrying out voice recognition on the audio data.

In some possible implementations, the segmentation module is specifically configured to:

segmenting the image data according to the time stamp of the audio data of the voice corresponding to at least one word in the text data; or the like, or, alternatively,

and segmenting the image data and the audio data according to the voice corresponding to at least one word in the text data at the time stamp of the audio data.

In a third aspect, the present application provides an apparatus comprising a processor and a memory. The processor and the memory are in communication with each other. The processor is configured to execute the instructions stored in the memory to cause the apparatus to perform the video processing method as in the first aspect or any implementation manner of the first aspect.

In a fourth aspect, the present application provides a computer-readable storage medium having instructions stored therein, where the instructions instruct a device to execute the video processing method according to the first aspect or any implementation manner of the first aspect.

In a fifth aspect, the present application provides a computer program product comprising instructions which, when run on a device, cause the device to perform the video processing method of the first aspect or any of the implementations of the first aspect.

The present application can further combine to provide more implementations on the basis of the implementations provided by the above aspects.

Drawings

In order to more clearly illustrate the technical method of the embodiments of the present application, the drawings used in the embodiments will be briefly described below.

Fig. 1 is an architecture diagram of a video processing system according to an embodiment of the present application;

fig. 2 is an architecture diagram of a video processing system according to an embodiment of the present application;

fig. 3 is a flowchart of a video processing method according to an embodiment of the present application;

FIG. 4 is a schematic view of an interface provided by an embodiment of the present application;

FIG. 5 is a schematic view of an interface provided by an embodiment of the present application;

fig. 6 is a schematic structural diagram of a video processing apparatus according to an embodiment of the present application;

fig. 7 is a schematic structural diagram of an apparatus according to an embodiment of the present application.

Detailed Description

The terms "first" and "second" in the embodiments of the present application are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include one or more of that feature.

Some technical terms referred to in the embodiments of the present application will be first described.

Computer Vision (CV) refers to the identification, tracking, and/or measurement of objects using a camera and a computer instead of the human eye. The target is a specific object or a part of the specific object in the video. In some implementations, the target can be an animal, such as a human, cat, dog, or a part of such an animal, such as a human face. In other implementations, the target may also be a vehicle or license plate, etc. In different application scenarios, the objects that need to be identified, tracked and/or measured are not the same.

Aiming at tasks such as object detection, object tracking, action classification, emotion recognition and the like in the visual direction of a computer, a better effect can be obtained by detecting a video comprising multi-frame data. The video may be stored in a storage medium such as a magnetic disk, a Solid State Drive (SSD), a flash memory (flash memory) as digital video data, and played by a video player.

The video data includes at least audio data. In some implementations, the video data also includes image data. It should be noted that when the video data includes audio data and image data, the image data and the audio data may be generated by mixing, for example, by simultaneously capturing an image signal and an audio signal by a video capture device such as a video camera. Of course, the video data may also be generated by post-mixing, for example, capturing an image signal by a camera, capturing an audio signal by a separate audio capturing device such as a microphone, and then mixing the image signal captured by the camera and the audio signal captured by the audio capturing device to generate a video.

Video-based object detection, object tracking, motion classification, or emotion recognition may be implemented by supervised learning (supervised learning). By supervised learning is meant that the model parameters are adjusted using a set of data of known output values such that the model achieves the required performance. When object detection, object tracking, motion classification, or emotion recognition is performed based on a video, the data of the known output value is video data (hereinafter, simply referred to as labeled data) in which any one or more of position information, attribute information, motion information, and emotion information are labeled as indices. The attribute information may be category information such as cat, dog, grass, football, etc., among others. The attribute information may also be identity information such as the names of people such as zhang san, lie siu, etc.

Currently, the labeled data is mainly obtained by manual labeling. Specifically, the annotator views the entire video, manually selects the start and stop times of each video segment, and then annotates each video segment. Thus, a lot of labor and time are required, and the labeling cost is increased. In some implementation manners, the video can be further divided into a plurality of segments according to time, but the content of the video itself is not considered in the simple dividing manner, so that the dividing accuracy is low, and the annotating personnel still need to divide each segment again, so that the operation is complex, the annotating efficiency is reduced, and the annotating cost is increased.

In view of this, the present application provides a video processing method. The video processing method can be realized by a video processing system. Specifically, a video processing system acquires video data, performs voice recognition on audio data in the video data, acquires text data and a time stamp of voice corresponding to at least one word in the text data in the audio data, and then segments the video data according to the time stamp of the voice corresponding to at least one word in the text data in the audio data to obtain at least one segment of data. On one hand, the user operation is simplified, the user does not need to manually segment the video data, on the other hand, the video content is considered, and the segmentation accuracy is high. The at least one section of data obtained based on the method is labeled, so that the labeling efficiency and the labeling precision can be improved, and the labeling cost is reduced.

In some implementations, the video processing system can be used not only as a system for assisting model training or testing, but also in other practical scenarios, such as scenarios for education, entertainment, and the like. For example, in an application such as child education, the video processing system may perform speech recognition on audio data in a video, obtain time stamps of text data and speech corresponding to at least one word in the text data in the audio data, segment the video data according to the time stamp of the speech corresponding to the at least one word in the text data in the audio data to obtain at least one segment of data, and label the at least one segment of data. When the video is played, if a user (for example, a child or a guardian) triggers an operation of displaying the annotation information, the annotation information can be displayed through a Graphical User Interface (GUI), so that the child can be helped to recognize the object or the part of the object in the physical world.

For convenience of description, the following is illustrated in a scenario in which a video processing method is applied to model training or model testing.

As shown in fig. 1, the video processing system may be deployed in a cloud environment, and in particular, one or more computing devices (e.g., a central server) on the cloud environment. The video processing system may also be deployed in an edge environment, specifically on one or more computing devices (edge computing devices) in the edge environment, which may be servers, computing boxes, and the like. The cloud environment indicates a central cluster of computing devices owned by a cloud service provider for providing computing, storage, and communication resources; the edge environment indicates a cluster of edge computing devices geographically close to an end device (i.e., a peer device) for providing computing, storage, and communication resources.

In some implementations, the video processing system may also be deployed on an end device. The end device includes, but is not limited to, a desktop computer, a notebook computer, a smart phone, and other user terminals. Video segmentation can be realized by running a video processing system on the user terminals, and video annotation can be realized based on at least one piece of data obtained by segmentation. The end device may also act as a video providing device for providing video data to a video processing system for video segmentation. When the end device is used only for providing video data, the end device may also be a camera, a radar, an infrared camera, or the like.

When the video processing system is deployed in a cloud environment or an edge environment, the video processing system can be provided for users to use in the form of a service. Specifically, a user can access a cloud environment or an edge environment through a browser, create an instance of the video processing system in the cloud environment or the edge environment, and then interact with the instance of the video processing system through the browser to achieve video segmentation. And, the video annotation can be further carried out based on at least one piece of data obtained by the segmentation.

The video processing system can also be deployed on a terminal device and provided for users in the form of clients. Specifically, the end device obtains an installation package of the video processing system, and the installation package is operated, so that the client of the video processing system is installed in the end device. And the end equipment realizes video segmentation by operating the client. And, the video annotation can be further carried out based on at least one piece of data obtained by the segmentation.

As shown in fig. 2, the video processing system includes multiple parts (e.g., includes multiple subsystems, each subsystem including multiple units), and thus, the various parts of the video processing system may also be distributively deployed in different environments. For example, portions of a video processing system may be deployed separately on three environments, a cloud environment, a marginal environment, an end device, or any two of them.

In order to make the technical solution of the present application clearer and easier to understand, a video processing method is described below from the perspective of a video processing system.

Referring to fig. 3, a flow chart of a video processing method is shown, the method comprising:

s302: the video processing system obtains video data.

The video data includes audio data and image data. The audio data may include voice data. The voice is the sound made by human through the vibration of the vocal organs (such as larynx and vocal cords), and the voice data is the data formed by collecting the sound made by human through the vocal organs by the voice collecting device (including the camera with the audio collecting function). Non-speech data may also be included in the audio data. In the embodiment of the present application, the non-voice data refers to audio data not including voice, for example, audio data including only background sound, or mute audio data not including any sound.

In some implementations, the video processing system can acquire video signals (including at least audio signals) in real-time by a video acquisition device, such as a video camera, to obtain video data. The video processing system may also retrieve video data from storage devices, including local storage devices and remote storage devices. For example, the video processing system may obtain video data from a video platform server in an online or offline manner. The online acquisition of the video data refers to the online acquisition of a video stream. Obtaining video data in an offline manner refers to caching or downloading the video data of the entire video.

S304: and the video processing system performs voice recognition on the audio data to obtain text data and a time stamp of voice corresponding to at least one word in the text data.

Specifically, the video processing system extracts audio data from the video data, and then performs speech recognition on speech data in the audio data through an Automatic Speech Recognition (ASR) model, so as to obtain text data corresponding to the speech data and a time stamp of speech corresponding to at least one word in the text data.

ASR models include an Acoustic Model (AM) and a Language Model (LM). The acoustic model takes the feature vector as input and takes the phoneme information as output. The language model takes phoneme information as input and takes a probability value of words or mutual association of the words as output. The ASR model may first perform feature extraction on the audio data, such as extracting features of a Linear Prediction Cepstrum Coefficient (LPCC) and/or a Mel Frequency Cepstrum Coefficient (MFCC), to obtain a multi-dimensional feature vector. Each frame of data in the audio data can be presented in a waveform mode, and the ASR model can perform feature extraction on a waveform corresponding to each frame of audio data, so as to obtain a feature vector. The feature vectors are then input into an acoustic model to obtain phoneme information, and probability values of the correlation of the single words or the words are determined through a language model and a dictionary. Text data corresponding to the voice data may be output based on the probability value.

For ease of understanding, the following description is made with reference to specific examples.

In one example, the ASR model performs feature extraction on the audio data to obtain a feature vector [13454598 … … ], and then inputs the feature vector into the acoustic model to obtain a phoneme "woxihuanrichu", where the dictionary includes correspondences between phonemes and characters, such as pit: wo; i: wo; and (3) happiness: xi; west: xi; changing: a huang; albizzia: a huang; day: ri; treating: chu; and (3) discharging: chu, and so on. The language model may output probability values that individual words or words are associated with each other based on the input phonemes in conjunction with the dictionary. For example, the language model may output a probability value, i: 0.0786, like: 0.0546, i like: 0.0898, see: 0.0854, sunrise: 0.1589, see sunrise: 0.6785. based on the probability values, the ASR model may output the text data "i like to see sunrise".

Further, when the ASR model outputs text data, it may also obtain a frame number of audio data corresponding to a word formed by a single word or a word formed by a combination of multiple words, and determine a time stamp of speech corresponding to each word in the text data based on the frame number.

In some implementations, the ASR may also perform speech recognition on speech data of different people in a multi-person conversation scene by analyzing the audio data, performing speech recognition on speech data of different people respectively, or performing speech recognition on speech data of a specific person.

Considering the situation that non-voice data may exist in the audio data, the video processing system may further filter a non-voice segment in the audio data before performing voice recognition on the audio data, retain the voice data in the audio data, avoid a annotator listening to the non-voice data during annotation, and shorten the annotation time.

Specifically, the video processing system may filter non-speech data through a Voice Activity Detection (VAD) model, preserving speech data. A VAD model typically consists of two parts, specifically feature extraction and speech/non-speech decision. The feature extraction used by the VAD model may be: energy-based features, frequency domain features, cepstral features, harmonic features, long-term features. The discrimination criteria used by the VAD model may be: machine learning methods, thresholding methods, etc.

S306: and the video processing system divides the video data according to the voice corresponding to at least one word in the text data at the time stamp of the audio data to obtain at least one section of data.

The words in the text data and the voice in the audio data have a corresponding relationship, and the video processing system can realize automatic video segmentation according to the voice corresponding to the words in the time stamp of the audio data. The video processing system may segment the image data and the audio data in the video data together according to the time stamp to obtain at least one segment of video data. The video processing system can also segment the image data in the video data according to the time stamp to obtain at least one segment of image data.

For example, in the position-labeled scene, the video processing system may segment the image data in the video data according to the time stamp. For another example, in an emotion labeling scenario, the video processing system may segment both the image data and the audio data in the video data according to the timestamp to determine emotion information based on the image data and the audio data together.

Considering that the text is composed of sentences, the video processing system may also segment the video data using sentence break positions of the sentences in the document data when segmenting the video data. Specifically, the video processing system may determine a sentence break position of a sentence in the text data, and segment the video data according to a time stamp of voice corresponding to a word at the sentence break position in the audio data to obtain at least one segment of data.

In a specific implementation, the video processing system may detect a sentence break position of the text data through a Sentence Boundary Detection (SBD) model, determine a time stamp of a voice corresponding to a word (for example, the word may be a word before the sentence break position) of the sentence break position, and segment the video data according to the time stamp, thereby obtaining at least one piece of data. For example, the video processing system may divide the video data according to the timestamp corresponding to each sentence break position to obtain at least one piece of data, where each piece of data corresponds to one sentence.

Wherein, the sentence boundary detection model can be obtained by training. In particular, the video processing system may construct an initial detection model, for example, the initial detection model may be constructed from a transducer-based bi-directional encoder representation from transforms (BERTs). For each word in the text data, the initial detection model outputs a label of whether the word is followed by a sentence break position. For example, if the sentence is a sentence-break position, 1 is output, and if the sentence is not a sentence-break position, 0 is output.

When the initial detection model is trained, the video processing system adopts large-scale corpus with punctuations for training. If a word is followed by a period, question mark or exclamation mark, the position after the word is regarded as a punctuation position, so that the position of the corpus punctuation is known. The video processing system inputs the corpus into an initial detection model, the initial detection model can output a label of whether a word is followed by a sentence break position, a loss value can be determined based on a predicted label value and an expected label value (determined according to a known sentence break position) and a loss function (such as a cross entropy loss function), and parameters of the initial detection model are updated based on the loss value, so that training of the initial detection model is realized. And when the trained model meets the training end condition, for example, the loss value of the model tends to be converged or is smaller than a preset value, the training can be stopped. The trained model can be used as a sentence boundary detection model for detecting sentence boundaries (i.e. sentence break positions) in the text data.

In some implementations, the video processing system can also divide the plurality of sentences having an association into a piece of data according to the text context. For example, a video processing system may divide a plurality of sentences of a set of dialogs into a piece of data, i.e., a piece of data may also correspond to a plurality of sentences. Therefore, a section of data including the character communication process can be obtained, and richer and more accurate information can be provided based on the section of data in the communication process, so that the actual business requirements are met.

In consideration of the problem of segmentation accuracy, the embodiment of the application also provides a manual correction mechanism on the basis of automatic segmentation. In some implementations, the video processing system can output at least one piece of data obtained by the segmentation, and a user (e.g., a annotator) can determine whether the at least one piece of data is correct, for example, whether the at least one piece of data includes a complete sentence, and if not, the user can modify a start time and an end time (i.e., at least one of the start time and the end time) of the at least one piece of data.

As shown in fig. 4, the interface 400 presents a video playback window 402, the video playback window 402 being used to present video pictures. And a playing progress bar is also displayed in the video playing window. The video processing system segments the video data according to the time stamp of the voice corresponding to the word at the sentence-break position, and after at least one segment of data is obtained, the starting time and the ending time of each segment of data in the video data can be displayed through the sliding block 404. As shown in fig. 4, two sliders 404 connected by a double-headed arrow indicate the start time and the end time of a piece of data. It should be noted that the two sliders in fig. 4 for a period of time are not connected by a double-headed arrow, indicating that the period of time does not include voice data. The user may adjust the start-stop time for a piece of data by dragging slider 404.

Correspondingly, the video processing system can receive correction information of the at least one piece of data from the user through the GUI, and the correction information is used for modifying the starting time and the ending time of the at least one piece of data. Based on this, the user can further improve the segmentation accuracy only by a small amount of modification operation.

S308: and the video processing system receives the labeling information of the user on the at least one piece of data.

The annotation information may be any one or more of position information, attribute information, motion information, and emotion information of the target in the image data.

The position information of the object refers to the position information of the object in the video picture. In particular implementations, the video processing system may use location information of a target frame including the target as the target location information. The target frame may be rectangular, circular, oval, etc., or may be shaped to follow the contour of the target. Taking the target frame as a rectangular frame as an example, the target position information may be represented by coordinates (x0, y0) of a center point of the rectangular frame and a scaling factor (rx, ry) on a coordinate axis.

The attribute information of the object is information describing the nature and/or characteristics of the object. Specifically, the attribute information of the target may include any one or more of category information and identity information of the target. For example, in an object detection scene, various types of objects appearing in a piece of data can be labeled, including categories of objects such as cats, dogs, grasslands, and football. For another example, in a face recognition scenario, the attribute information may also be identity information, such as names of people like zhang san and lie si.

The action information of the target refers to an action performed by the target. For example, in a smart city management scenario, the target may be a person, and the action information of the target indicates whether the person performs a fighting action. For another example, in a dance teaching scenario, the motion information of the target indicates a dance motion performed by a dance teacher.

The emotional information of the target refers to the emotional information expressed by the target. When the video data only includes audio data, the annotation information may also be emotion information, for example, emotion information expressed by a spoken person, and the emotion information may be determined by at least one of the features of semantics, mood, intonation, and the like.

Specifically, the video processing system may load at least one piece of data, for example, at least one piece of video piece data, through the player to play the video piece. The user determines any one or more of position information, attribute information, motion information, and emotion information of the object by viewing the video segment, and then annotates the information through the GUI. Correspondingly, the video processing system can receive the annotation information through the GUI.

As shown in fig. 5, when the user triggers the play control of the interface 400 to play a video, the user may select a face in the video frame through a mouse box, so as to mark the position information of the face. Correspondingly, the video processing system receives annotation information (in this example, location information for a human face) via the interface 400.

In some implementations, the video processing system can also receive the annotation information in other manners, for example, the user can annotate the information in a voice manner, and the video processing system can also receive the annotation information through an audio capture device such as a microphone.

When labeling at least one piece of data, the user can label each frame. In some implementations, to reduce the annotation workload, the user may also annotate only key frames. The key frame is a frame having significance to a piece of data, such as a start frame, an end frame, or a frame where a key action of an object in a video picture is moving or changing. Correspondingly, the video processing system can receive the annotation information of the key frame in the at least one piece of data from the user, and then determine the annotation information of the non-key frame in the at least one piece of data according to the annotation information of the key frame. The non-key frame specifically refers to a frame in a piece of data except for a key frame.

Taking the position information of the labeled face as an example, when labeling the face in each segment of data, the user selects the face region in the start-stop frame through the mouse frame, then selects the face regions of other key frames through the mouse frame, and for the non-key frames except the key frames, the image processing system calculates by using a difference method according to the labeling information of the key frames, thereby determining the labeling information of the non-key frames. Further, for a multi-modal emotion analysis scene, the user can also label emotion categories (such as emotion categories of happy, sad, angry, surprise, disgust, fear, and blankness) and positive and negative emotion values (value) and emotion intensity values (arousal) corresponding to the voice and the face.

For each piece of data, for example, each piece of video data, the annotation result may include the start and end time of the video piece of data, the position of the face rectangle box of each frame, the text corresponding to the speech, the corresponding expression category, and the value and the arousal value, and the annotation result may be saved in Javascript object notation (json) format. In some implementations, the annotation results can also be saved in other formats, such as a text format.

Based on the above description, an embodiment of the present application provides a video processing method. According to the method, the voice recognition is carried out on the audio data in the video data to obtain the text data, then the video data is segmented by utilizing the time stamp of the voice corresponding to at least one word in the text data in the audio data, the operation that a user manually segments the video data during labeling is reduced, and the segmentation efficiency is improved. Moreover, the video data segmentation is carried out in consideration of the video content, so that the accuracy is high. The marking is carried out on at least one section of data obtained based on the method, so that the marking workload can be reduced, and the marking efficiency and the marking accuracy can be improved.

The video processing method provided by the embodiment of the present application is described in detail above with reference to fig. 1 to 5, and the apparatus and the device provided by the embodiment of the present application are described below with reference to the drawings.

Referring to the schematic structural diagram of the video processing apparatus shown in fig. 6, the apparatus 600 includes:

a communication module 602, configured to obtain video data, where the video data includes image data and audio data;

a recognition module 604, configured to perform speech recognition on the audio data, to obtain text data corresponding to the audio data and a timestamp of a speech corresponding to at least one word in the text data in the audio data;

a segmentation module 606, configured to segment the video data according to a time stamp of the audio data corresponding to a voice of at least one word in the text data, so as to obtain at least one segment of data.

In some possible implementations, the segmentation module 606 is specifically configured to:

determining sentence break positions of sentences in the text data;

and segmenting the video data according to the voice corresponding to the word at the sentence-break position in the time stamp of the audio data.

In some possible implementations, the communication module 602 is further configured to:

In some possible implementations, the communication module 602 is specifically configured to:

receiving the labeling information of the key frame in the at least one section of data from the user;

the apparatus 600 further comprises:

the labeling module 608 is configured to determine labeling information of a non-key frame in the at least one piece of data according to the labeling information of the key frame.

In some possible implementations, the apparatus 600 further includes:

a filtering module 610, configured to filter a segment without speech in the audio data before performing speech recognition on the audio data.

In some possible implementations, the segmentation module 606 is specifically configured to:

segmenting the image data according to the time stamp of the audio data of the voice corresponding to at least one word in the text data; or the like, or, alternatively,

and segmenting the image data and the audio data according to the voice corresponding to at least one word in the text data at the time stamp of the audio data.

The video processing apparatus 600 according to the embodiment of the present application may correspond to performing the method described in the embodiment of the present application, and the above and other operations and/or functions of each module/unit of the video processing apparatus 600 are respectively for implementing the corresponding flow of each method in the embodiment shown in fig. 3, and are not repeated herein for brevity.

The embodiment of the application also provides a device 700. The device 700 may be a peer-side device such as a laptop computer or a desktop computer, or may be a computer cluster in a cloud environment or an edge environment. The device 700 is specifically adapted to implement the functionality of the video processing apparatus 600 in the embodiment shown in fig. 6.

Fig. 7 provides a schematic diagram of a structure of a device 700, and as shown in fig. 7, the device 700 includes a bus 701, a processor 702, a communication interface 703, and a memory 704. The processor 702, memory 704, and communication interface 703 communicate over a bus 701. The bus 701 may be a Peripheral Component Interconnect (PCI) bus, an Extended Industry Standard Architecture (EISA) bus, or the like. The bus may be divided into an address bus, a data bus, a control bus, etc. For ease of illustration, only one thick line is shown in FIG. 7, but this is not intended to represent only one bus or type of bus. The communication interface 703 is used for communication with the outside. For example, video data is acquired, and correction information of at least one piece of data by a user is received through the GUI, or annotation information of at least one piece of data by the user is received, and the like.

The processor 702 may be a Central Processing Unit (CPU). The memory 704 may include volatile memory (volatile memory), such as Random Access Memory (RAM). The memory 704 may also include a non-volatile memory (non-volatile memory), such as a read-only memory (ROM), a flash memory, an HDD, or an SSD.

The memory 704 has stored therein executable code that the processor 702 executes to perform the video processing methods described above.

In particular, in the case where the embodiment shown in fig. 6 is implemented, and the modules of the video processing apparatus 600 described in the embodiment of fig. 6 are implemented by software, software or program codes required to perform the functions of the identifying module 604, the dividing module 606, the labeling module 608, and the filtering module 610 in fig. 6 are stored in the memory 704. The communication module functions are implemented through the communication interface 703. The communication interface 703 receives the video data and transmits the video data to the processor 702 through the bus 701, the processor 702 executes program codes corresponding to the modules stored in the memory 704, such as program codes corresponding to the recognition module 604 and the segmentation module 606, to perform voice recognition on the audio data in the video data, obtain text data corresponding to the audio data and a time stamp of voice corresponding to at least one word in the text data in the audio data, and segment the video data according to the time stamp of voice corresponding to at least one word in the text data in the audio data, so as to obtain at least one piece of data.

In some possible implementations, the communication interface 703 may further receive, through the GUI, correction information of the at least one piece of video-end data by the user to modify the start-stop time of the at least one piece of data.

In some possible implementations, the communication interface 703 may further receive annotation information of the at least one piece of data by the user, where the annotation information includes any one or more of location information, attribute information, action information, and emotion information of the target in the image data.

In some possible implementations, the communication interface 703 may receive annotation information of a key frame in at least one piece of data from a user. Correspondingly, the processor 702 can also execute the program code corresponding to the annotation module 608 to determine the annotation information of the non-key frame in the at least one piece of data according to the annotation information of the key frame.

In some possible implementations, the processor 702 may further execute program code corresponding to the filtering module 610 to perform an operation of filtering a speech-free segment of the audio data before performing speech recognition on the audio data.

An embodiment of the present application further provides a computer-readable storage medium, which includes instructions for instructing a computer to execute the video processing method applied to the video processing apparatus 600.

The embodiment of the application also provides a computer program product, and when the computer program product is executed by a computer, the computer executes any one of the video processing methods. The computer program product may be a software installation package which may be downloaded and executed on a computer in the event that any of the aforementioned video processing methods are required.

Through the above description of the embodiments, those skilled in the art will clearly understand that the present application can be implemented by software plus necessary general-purpose hardware, and certainly can also be implemented by special-purpose hardware including special-purpose integrated circuits, special-purpose CPUs, special-purpose memories, special-purpose components and the like. Generally, functions performed by computer programs can be easily implemented by corresponding hardware, and specific hardware structures for implementing the same functions may be various, such as analog circuits, digital circuits, or dedicated circuits. However, for the present application, the implementation of a software program is more preferable. Based on such understanding, the technical solutions of the present application may be substantially embodied in the form of a software product, which is stored in a readable storage medium, such as a floppy disk, a usb disk, a removable hard disk, a ROM, a RAM, a magnetic disk, or an optical disk of a computer, and includes several instructions for enabling a computer device (which may be a personal computer, an exercise device, or a network device) to execute the method according to the embodiments of the present application.

In the above embodiments, the implementation may be wholly or partially realized by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product.

The computer program product includes one or more computer instructions. When loaded and executed on a computer, cause the processes or functions described in accordance with the embodiments of the application to occur, in whole or in part. The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable device. The computer instructions may be stored in a computer readable storage medium or transmitted from one computer readable storage medium to another, for example, from one website site, computer, training device, or data center to another website site, computer, training device, or data center via wired (e.g., coaxial cable, fiber optic, Digital Subscriber Line (DSL)) or wireless (e.g., infrared, wireless, microwave, etc.). The computer-readable storage medium can be any available medium that a computer can store or a data storage device, such as a training device, a data center, etc., that incorporates one or more available media. The usable medium may be a magnetic medium (e.g., floppy Disk, hard Disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., Solid State Disk (SSD)), among others.

20页详细技术资料下载

上一篇：一种医用注射器针头装配设备

下一篇：一种音乐线上教学视频智能推荐方法及系统

Video processing method, device, equipment and medium

相关技术

网友询问留言