Wrongly written character recognition method, device, equipment and readable storage medium

文档序号：1889942 发布日期：2021-11-26 浏览：4次中文

阅读说明：本技术 错别字识别方法、装置、设备及可读存储介质 (Wrongly written character recognition method, device, equipment and readable storage medium ) 是由王晨琛于 2021-03-01 设计创作，主要内容包括：本申请提供一种错别字识别方法、装置、设备及可读存储介质,涉及计人工智能技术领域,以提升识别媒体内容中的错别字的准确度。该方法包括：获取针对已发布的媒体内容的目标评论数据,根据所述目标评论数据包含的各个词语的上下文信息,提取所述目标评论数据对应的目标文本特征,基于所述目标文本特征,确定所述目标评论数据中包含有针对错别字的目标评论内容,基于所述目标评论内容,确定所述媒体内容中的错别字信息。该方法中能够识别出历史媒体内容中未出现过的错别字或特殊词,进而提升识别媒体内容中的错别字的准确度,且在识别错别字的过程中,不需要对整个媒体内容进行检测,提升了识别媒体内容中的错别字的效率。(The application provides a wrongly written character recognition method, a wrongly written character recognition device, wrongly written character recognition equipment and a readable storage medium, and relates to the technical field of artificial intelligence so as to improve the accuracy of recognizing wrongly written characters in media contents. The method comprises the following steps: the method comprises the steps of obtaining target comment data aiming at published media content, extracting target text features corresponding to the target comment data according to context information of words contained in the target comment data, determining target comment content aiming at wrongly written words contained in the target comment data based on the target text features, and determining wrongly written word information in the media content based on the target comment content. The method can identify the wrongly written characters or special words which do not appear in the historical media content, so as to improve the accuracy of identifying the wrongly written characters in the media content, and in the process of identifying the wrongly written characters, the whole media content does not need to be detected, so that the efficiency of identifying the wrongly written characters in the media content is improved.)

1. A method for identifying wrongly written characters, comprising:

acquiring target comment data aiming at published media content;

extracting target text characteristics corresponding to the target comment data according to the context information of each word contained in the target comment data;

determining that the target comment data contains target comment content for wrongly written words based on the target text characteristics;

and determining wrongly written information in the media content based on the target comment content.

2. The method of claim 1, wherein extracting the target text features of the target comment data according to the context information of each word contained in the target comment data comprises:

inputting the target comment data into a trained comment data classification model;

based on a language learning submodel in the comment data classification model, performing feature extraction on context information of each word contained in the target comment data to obtain target text features corresponding to the target comment data;

the language learning submodel is obtained by taking historical comment data as a training sample and training the language learning submodel for feature learning based on context information of each word contained in the training sample.

3. The method of claim 2, wherein the comment data classification model further comprises a prediction submodel, and determining that the target comment data contains target comment content for wrongly written words based on the target text features comprises:

inputting the target text features into the predictor model;

predicting a second association degree between the target text feature and a target data identification result based on a first association degree learned by the prediction submodel, wherein the first association degree is determined based on an association degree between a historical text feature corresponding to historical comment data and the target data identification result, and the target data identification result is used for representing comment contents aiming at wrongly-written words in the text data;

and if the second relevance is greater than the relevance threshold, determining that the target comment data contains the target comment content.

4. The method of claim 2, wherein the language learning submodel is trained by:

based on the historical comment data set, training the language learning submodel, wherein one training operation comprises the following steps: respectively executing text prediction operation on each historical comment data obtained from the historical comment data set, and determining a prediction deviation corresponding to each historical comment data; based on the prediction deviation corresponding to each historical comment data, parameter adjustment is carried out on the language learning submodel;

wherein the text prediction operation comprises:

performing word segmentation processing on one historical comment data according to a word segmentation rule associated with the language form of the one historical comment data in each historical comment data to obtain at least one word contained in the one historical comment data;

shielding part of words in the at least one word based on a preset word mask; and

determining context information of the partial words in the historical comment data, and selecting candidate words of which the matching degree with the determined context information meets a matching degree condition from a pre-configured candidate word bank, wherein the candidate word bank is determined based on the historical comment data set;

and determining the deviation information between the partial words and the selected candidate words as the prediction deviation corresponding to the historical comment data.

5. The method of any of claims 1-4, wherein said determining mispronounced word information in the media content based on the targeted commentary content comprises:

analyzing the target comment content based on a pre-configured regular expression for identifying the wrongly written word information to obtain a corresponding analysis result;

determining at least one wrongly-written word associated with the target comment content and text position information of the at least one wrongly-written word in the media content based on the analysis result;

and determining the at least one error word and the text position information as the error word information in the media content.

6. The method of claim 5, wherein the method further comprises:

if the wrongly written character information is not obtained based on the analysis result, carrying out wrongly written character detection on the media content based on a pre-configured wrongly written character detection rule to obtain a detection result;

and determining whether the media content contains corresponding wrongly written information or not according to the detection result.

7. The method of any of claims 1-4, wherein said determining mispronounced word information in the media content based on the targeted commentary content comprises:

acquiring account information of a target account issuing the target comment data;

determining a confidence level of the target comment data based on the account information;

and when the confidence coefficient reaches a confidence coefficient threshold value, determining wrongly written information in the media content based on the target comment content.

8. An apparatus for identifying wrongly written characters, comprising:

the data acquisition unit is used for acquiring target comment data aiming at the published media content;

the feature extraction unit is used for extracting target text features corresponding to the target comment data according to the context information of each word contained in the target comment data;

the first identification unit is used for determining that the target comment data contains target comment content aiming at wrongly written words based on the target text characteristics;

and the second identification unit is used for determining wrongly written information in the media content based on the target comment content.

9. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the steps of the method of any of claims 1-7 are implemented when the program is executed by the processor.

10. A computer-readable storage medium having stored thereon computer instructions which, when executed on a computer, cause the computer to perform the method of any one of claims 1-7.

Technical Field

The present application relates to the field of artificial intelligence technologies, and in particular, to a method, an apparatus, a device, and a readable storage medium for identifying wrongly written characters.

Background

In the related art, a wrongly-written character candidate set is generally created in advance, candidate words appearing in the wrongly-written character candidate set in media content are replaced, and whether the candidate words are wrongly-written characters is judged based on the influence degree of the candidate words on the text passing sequence of the media content.

Disclosure of Invention

The embodiment of the application provides a wrongly written character recognition method, a wrongly written character recognition device, wrongly written character recognition equipment and a readable storage medium, which are used for improving the accuracy of recognizing wrongly written characters in media contents.

In a first aspect of the present application, a method for identifying wrongly written characters is provided, including:

acquiring target comment data aiming at published media content;

extracting target text characteristics corresponding to the target comment data according to the context information of each word contained in the target comment data;

determining that the target comment data contains target comment content for wrongly written words based on the target text characteristics;

and determining wrongly written information in the media content based on the target comment content.

In a second aspect of the present application, there is provided a wrongly written character recognition apparatus, including:

the data acquisition unit is used for acquiring target comment data aiming at the published media content;

the first identification unit is used for determining that the target comment data contains target comment content aiming at wrongly written words based on the target text characteristics;

and the second identification unit is used for determining wrongly written information in the media content based on the target comment content.

In a possible implementation manner, the feature extraction unit is specifically configured to:

inputting the target comment data into a trained comment data classification model;

In a possible implementation manner, the comment data classification model further includes a predictor model, and the first identifying unit is specifically configured to:

inputting the target text features into the predictor model;

and if the second relevance is greater than the relevance threshold, determining that the target comment data contains the target comment content.

In a possible implementation manner, the feature extraction unit is further configured to train the language learning model by:

wherein the text prediction operation comprises:

shielding part of words in the at least one word based on a preset word mask; and

and determining the deviation information between the partial words and the selected candidate words as the prediction deviation corresponding to the historical comment data.

In a possible implementation manner, the second identifying unit is specifically configured to:

analyzing the target comment content based on a pre-configured regular expression for identifying the wrongly written word information to obtain a corresponding analysis result;

and determining the at least one error word and the text position information as the error word information in the media content.

In a possible implementation manner, the second identification unit is further configured to:

and determining whether the media content contains corresponding wrongly written information or not according to the detection result.

In a possible implementation manner, the second identification unit is further configured to:

acquiring account information of a target account issuing the target comment data;

determining a confidence level of the target comment data based on the account information;

and when the confidence coefficient reaches a confidence coefficient threshold value, determining wrongly written information in the media content based on the target comment content.

In a third aspect of the present application, a computer device is provided, which comprises a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the method of the first aspect when executing the program.

In a fourth aspect of the present application, a computer program product is provided that includes computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions to cause the computer device to perform the method provided in the first aspect described above.

In a fifth aspect of the present application, there is provided a computer readable storage medium having stored thereon computer instructions which, when run on a computer, cause the computer to perform the method according to the first aspect.

Due to the adoption of the technical scheme, the embodiment of the application has at least the following technical effects:

on one hand, on the one hand, wrongly-written characters in the media content are identified directly based on information in comment data of the media content, wrongly-written characters are not required to be collected in advance in the process, a wrongly-written character candidate set is created, wrongly-written characters or special words which do not appear in historical media content can be identified, and therefore the accuracy of identifying wrongly-written characters in the media content is improved; on the other hand, in the process of identifying wrongly written characters, only the comment data need to be identified, and the whole media content does not need to be detected, so that the detection range of wrongly written character identification is obviously reduced, and the efficiency of identifying wrongly written characters in the media content is improved.

Drawings

Fig. 1 is a schematic diagram of an application scenario of wrongly written character recognition according to an embodiment of the present application;

fig. 2 is a schematic diagram illustrating a flow of a method for identifying wrongly written characters according to an embodiment of the present application;

fig. 3 is a schematic diagram of a process of obtaining comment data for media content according to an embodiment of the present application;

FIG. 4 is a diagram illustrating a structure of a review data classification model provided in an embodiment of the present application;

FIG. 5 is an exemplary diagram of a structure of a language learning submodel according to an embodiment of the present disclosure;

FIG. 6 is a schematic diagram illustrating a flow of a training operation for a language learning submodel according to an embodiment of the present disclosure;

fig. 7 is a complete flow chart of a method for identifying wrongly written characters according to an embodiment of the present application;

fig. 8 is a complete flow chart of a method for identifying wrongly written words according to an embodiment of the present application;

fig. 9 is a schematic structural diagram of an apparatus for identifying wrongly written characters according to an embodiment of the present application;

fig. 10 is a block diagram of a computer device according to an embodiment of the present application.

Detailed Description

In order to better understand the technical solutions provided by the embodiments of the present application, the following detailed description is made with reference to the drawings and specific embodiments.

It should be noted that the terms "first," "second," and the like in the description and claims of the present invention and in the drawings described above are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the invention described herein are capable of operation in sequences other than those illustrated or described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

In order to facilitate those skilled in the art to better understand the technical solutions of the present application, some concepts related to the present application will be described below.

1) Media content

In the age of media, content can generally refer to audio, video, graphics, and the like; in the embodiment of the present application, the media content may be, but is not limited to, content pushed to an account by a content platform (for example, but not limited to, including a content sharing system or a content recommendation system), and the media content may be, but is not limited to, multimedia resources obtained by at least one or any combination of information of text, audio, video, articles, pictures, and the like; in this embodiment of the present application, a wrongly written or mispronounced word may be identified for text information in media content, for example, when the media content is an article or an image, the text information may be a text in the article or the image, when the media content is an image, the text information may be a description text in the image, and when the media content is a video, the text information may include, but is not limited to, a description text in the video (for example, bystander information or character introduction information of a tv series) or a subtitle in the video, and the like.

2) Comment data and target comment data

The comment data in the embodiment of the application may include, but is not limited to, an account for receiving media content, comment information for the media content; the target comment data is comment data for media content that needs to be subjected to wrongly written word recognition, and may also be understood as comment data that is currently being processed, or the like.

3) Bert (bidirectional Encoder Repressions from Transformer) model

The Bert model is a coding network (Encoder) of a bidirectional Transformer; the goal of the Bert model is to use large-scale unmarked corpus training to obtain semantic Representation (replication) of the text containing rich semantic information, and then to fine-tune the semantic Representation of the text in a specific Natural Language Processing (NLP) task, and finally apply the task to the specific NLP task.

4) Natural language processing

Natural language processing is an important direction in the fields of computer science and artificial intelligence. It studies various theories and methods that enable efficient communication between humans and computers using natural language. Natural language processing is a science integrating linguistics, computer science and mathematics. Therefore, the research in this field will involve natural language, i.e. the language that people use everyday, so it is closely related to the research of linguistics. Natural language processing techniques typically include text processing, semantic understanding, machine translation, robotic question and answer, knowledge mapping, and the like.

Embodiments of the present application relate to Artificial Intelligence (AI) and machine learning techniques, and are designed based on computer vision techniques and Machine Learning (ML) in the AI.

Artificial intelligence is a theory, method, technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and expand human intelligence, perceive the environment, acquire knowledge and use the knowledge to obtain the best results. In other words, artificial intelligence is a comprehensive technique of computer science that attempts to understand the essence of intelligence and produce a new intelligent machine that can react in a manner similar to human intelligence.

Artificial intelligence is the research of the design principle and the realization method of various intelligent machines, so that the machines have the functions of perception, reasoning and decision making. The artificial intelligence technology mainly comprises a computer vision technology, a natural language processing technology, machine learning or deep learning and the like. With the research and progress of artificial intelligence technology, artificial intelligence is researched and applied in a plurality of fields, such as common smart homes, smart customer service, virtual assistants, smart speakers, smart marketing, unmanned driving, automatic driving, robots, smart medical treatment and the like.

Machine learning is a multi-field cross discipline, and relates to a plurality of disciplines such as probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory and the like. Specially researching how a computer simulates or realizes the learning behavior of human beings so as to acquire new knowledge or skills and reorganize the existing knowledge structure to continuously improve the performance of the computer; machine learning is the core of artificial intelligence, is the fundamental approach for computers to have intelligence, and is applied to all fields of artificial intelligence. Machine learning and deep learning generally include techniques such as artificial neural networks, belief networks, reinforcement learning, transfer learning, inductive learning, and the like.

In order to make the objects, technical solutions and advantages of the present application clearer, the present application will be described in further detail with reference to the accompanying drawings, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

The following explains the concept of the present application.

In the process of identifying wrongly-written characters in media content in the related art, a candidate set containing wrongly-written characters with a high occurrence frequency is generally established by collecting known wrongly-written character samples, candidate words appearing in the candidate set in the media content are replaced by some ways, and whether the replaced candidate words are wrongly-written characters is judged based on the influence degree of the replaced candidate words on the text passing degree of the media content, such that the method can only detect the wrongly-written characters existing in the wrongly-written character candidate set, but cannot find the wrongly-written characters not appearing in the wrongly-written character candidate set, i.e., cannot find special words with strong priori knowledge, such as special words (such as but not limited to including names of people, newly-appearing network expressions in the internet, etc.), and cannot find whether the special words are wrongly-written characters, for example, a wife of a1 is a2, a2 is a normal name, but A2 is not a wife of A1, and the current wrongly written character recognition method can only recognize that A2 is a correct name, but cannot recognize that A2 is not a wife of A1.

In view of the above, the inventors have devised a method, an apparatus, a device and a readable storage medium for identifying wrongly written words, which are used to improve the efficiency of identifying wrongly written words in media content; in the method, the fact that the comment data aiming at the media content possibly contain information related to wrongly written words is considered, so that whether the media content contains wrongly written words or not is identified through the information in the comment data aiming at the media content in the implementation of the application; specifically, target comment data for published media content is acquired, target text features corresponding to the target comment data are extracted according to context information of each word included in the target comment data, whether the target comment data include target comment content for wrongly written words or not is determined based on the target text features, and when the target comment data include the target comment content for wrongly written words, wrongly written word information in the media content can be determined based on the target comment content.

As an embodiment, the language form of the text information in the media content is not limited in the embodiment of the present application, and those skilled in the art can set the language form according to actual needs, where the language form of the text information may be, but is not limited to, at least one language form including chinese, english, korean, japanese, italian, hindi, and the like; the language form of the comment data for the media content is not limited too much, and can be set by those skilled in the art according to actual needs, and the language form of the comment data can be, but is not limited to, at least one language form including chinese, english, korean, japanese, italian, hindi, and the like; in the following, the method for identifying wrongly-written characters provided by the embodiments of the present application is exemplarily described with reference to the chinese language.

Furthermore, in order to further improve the accuracy of identifying wrongly written words in the media content, the language form of the comment data for the media content in the embodiment of the present application and the language of the text information in the media content may be consistent.

In order to more clearly understand the design idea of the present application, an application scenario in the embodiment of the present application is described below as an example.

Referring to fig. 1, an application scenario for identifying a wrongly written word is shown, where the application scenario may include a terminal device 110, a content server 120, and a wrongly written word identifying server 130; the terminal device 110, the content server 120, and the wrongly written word recognition server 130 may communicate with each other via a network, wherein:

the terminal device 110 is configured to receive the media content and send the media content to the content server 120; the terminal device 110 may also receive the media content distributed by the content server 120, and in response to a comment operation of an account receiving the media content for the published media content, obtain comment data for the media content, and send the comment data to the wrongly-written word recognition server 130.

As an embodiment, the terminal device 110 may have a client installed thereon, and the terminal device 110 may publish the media content to the content server 120 through the client and receive the media content distributed by the content server 120 through the client.

Content server 120 is configured to receive media content uploaded by terminal devices 110 and to distribute the media content to one or more terminal devices 110.

The wrongly written character recognition server 130 is configured to obtain target comment data for a published media content, extract a target text feature corresponding to the target comment data according to context information of each word included in the target comment data, and determine wrongly written character information in the media content based on the target comment content when it is determined that the target comment data includes a target comment content for a wrongly written character based on the target text feature.

The terminal device 110 in the embodiments of the present application may be a mobile terminal, a fixed terminal, or a portable terminal, such as a mobile handset, a station, a unit, a device, a multimedia computer, a multimedia tablet, an internet node, a communicator, a desktop computer, a laptop computer, a notebook computer, a netbook computer, a tablet computer, a Personal Communication System (PCS) device, a personal navigation device, a Personal Digital Assistant (PDA), an audio/video player, a digital camera or camcorder, a positioning device, a television receiver, a radio broadcast receiver, an electronic book device, a game device, or any combination thereof, including accessories and peripherals of these devices, or any combination thereof.

The content server 120 and the wrongly written character recognition server 130 in the embodiment of the present application may be the same server or different servers; the content server 120 and the wrongly-recognized word recognition server 130 may be independent physical servers, may also be a server cluster or a distributed system formed by a plurality of physical servers, and may also be a plurality of cloud servers in the cloud service technology that provide basic cloud computing services such as cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communications, middleware services, domain name services, security services, CDNs, and big data and artificial intelligence platforms (for example, the content server 120 may include, but is not limited to, the server 120-1, the server 120-2, or the server 120-3 illustrated in the figure; for example, the wrongly-recognized word recognition server 130 may include, but is not limited to, the server 130-1, the server 130-2, or the server 130-3 illustrated in the figure); the functions of the content server 120 may be implemented by one or more cloud servers, or may be implemented by one or more cloud server clusters, etc.; the functions of the wrongly written word recognition server 130 may be implemented by one or more cloud servers, or may be implemented by one or more cloud server clusters.

The Cloud service technology (Cloud technology) is a generic term of a network technology, an information technology, an integration technology, a management platform technology, an application technology and the like applied based on a Cloud computing business model, can form a resource pool, is used as required, and is flexible and convenient. Cloud service technology is an important support; background services of the technical network system require a large amount of computing and storage resources, such as video websites, picture-like websites and more web portals. With the high development and application of the internet industry, each article may have its own identification mark and needs to be transmitted to a background system for logic processing, data of different levels are processed separately, and various industrial data need strong system background support and can only be realized through a cloud service technology.

Based on the application scenario of fig. 1, an example of a method for identifying wrongly written words in the embodiment of the present application is described below; referring to fig. 2, a schematic diagram of a method for identifying a wrongly written character according to an embodiment of the present application is shown, which includes the following steps:

in step S201, target comment data for the published media content is acquired.

As an embodiment, the target comment data may be, but is not limited to, selected from a comment data set associated with the media content, where the comment data set includes comment data for the media content; the comment data may include comment content for the media content, and the comment data may further include at least one of information such as content identification of the media content, account information of an account for which the comment data is issued, and time information of issuing the comment data.

As an embodiment, the comment data may be collected by the terminal device 110, and specifically, as shown in fig. 3 (a), an account receiving media content may trigger a comment operation for the media content, and the terminal device 110 may respond to the comment operation, obtain comment content indicated by the comment operation as comment data, and send the comment data to the wrongly written word recognition server 130; furthermore, the wrongly written character server 130 may receive comment data sent by each terminal device, record comment data for the same media content into a comment data set associated with the media content, and further select comment data from the comment data set as target comment data.

As an embodiment, the comment data may be directly collected by the wrongly written character recognition server 130; specifically, as can be seen from fig. 3 (b), an account receiving media content may trigger a comment operation for the media content, and the terminal device 110 may send a comment instruction to the wrongly written word recognition server 130 in response to the comment operation; further, the wrongly-written character server 130 may, in response to the comment instruction, obtain comment content indicated by the comment operation as comment data, record comment data for the same media content in a comment data set associated with the media content, and further select comment data from the comment data set as target comment data.

Step S202, extracting target text features corresponding to the target comment data according to the context information of each word included in the target comment data.

As an embodiment, in order to improve the accuracy of extracting the target text feature, in the embodiment of the present application, the target text feature may also be extracted by using a Model, for example, in the embodiment of the present application, the feature extraction may be performed on the target comment data in at least one of a Vector Space Model (VSM) and a probability statistical Model on a text or a trained neural network Model, so as to obtain the target text feature; a more detailed method for extracting the target text feature will be further described below.

In the embodiment of the present application, the Neural Network model is not limited, and a person skilled in the art may set the Neural Network model according to actual requirements, where the Neural Network model may include, but is not limited to, a Convolutional Neural Network (CNN), a Recurrent Neural Network (RNN), a language learning model (Bert model), an Ernie model, an Albert model, and the like.

Step S203 determines that the target comment data contains target comment content for wrongly written words based on the target text features.

For convenience of description, in the following contents of the embodiments of the present application, target comment data containing target comment content for wrongly written words is referred to as wrongly written comment data, and target comment data not containing target comment content for wrongly written words is referred to as non-wrongly written comment data; in order to improve the efficiency and accuracy of identifying wrongly written comments in the target comment data, in the embodiment of the application, wrongly written comments and non-wrongly written comments in the target comment data may be identified through a trained comment data identification model.

Step S204, determining wrongly written or mispronounced character information in the media content based on the target comment content.

As an embodiment, in order to improve the efficiency and accuracy of determining wrongly written information in media content based on target comment content, in the embodiment of the present application, the target comment content may be analyzed based on a preconfigured regular expression for identifying the wrongly written information, and the wrongly written information in the media content is obtained based on an analysis result; the specific manner of determining the wrongly written information in the media content based on the regular expression will be described below.

The following of the embodiment of the present application further describes the above-mentioned comment data classification model related to step S203.

As an embodiment, an example of a comment data classification model is provided in the embodiments of the present application, please refer to fig. 4, which provides a schematic structural diagram of a comment data classification model, which may include, but is not limited to, an input layer, a language learning submodel, a prediction submodel, and an output layer, where:

the input layer is used for receiving the target comment data and transmitting the target comment data to the language learning submodel; the output layer is used for outputting the prediction result of the prediction submodel.

The language learning submodel can perform feature extraction on context information of each word contained in the target comment data to obtain target text features corresponding to the target comment data; in order to further improve the accuracy of the obtained target text features, the language learning submodel may respectively perform feature extraction on the context information of each word, obtain a word feature vector mapped by each word, and further obtain the target text features based on the word feature vector mapped by each word, where an average value of the word feature vectors mapped by each word may be determined as the target text features, but is not limited to the average value; the context information may include, but is not limited to, one of a word vector obtained by encoding (Embedding) the word, position information of the word in the target comment data, and a data identifier of the target comment data in which the word is located.

As an embodiment, in order to improve the accuracy of extracting the target text features by the language learning submodel, in the embodiment of the present application, the feature learning training may be performed on the language learning submodel based on the context information of each word included in the training sample by using the historical comment data as the training sample; in the training process of the language learning submodel, the text features of the words can be learned in a mode of predicting the words based on the context information of the words in the historical comment data, and then the accuracy of the language learning submodel for extracting the text features can be improved in the training process, so that the accuracy of extracting the target text features is improved by the trained language learning submodel.

The prediction submodel may identify whether the target comment data is wrongly written comment data or non-wrongly written comment data based on the target text features, and specifically, in step S203, the target text features extracted by the language learning submodel may be, but are not limited to, input into the prediction submodel; predicting a second degree of association between the target text feature and a target data recognition result based on a learned first degree of association of the prediction submodel, and if the second degree of association is greater than a threshold value of the degree of association, determining that the target comment data contains the target comment content; the first relevance is determined based on the relevance between the historical text features corresponding to the historical comment data and the target data identification result, the target data identification result is used for representing that the text data contains comment contents for wrongly written words, namely the target data identification result represents that the comment data currently processed is wrongly written comment data; the predictor model may be, but is not limited to, a binary model or other classification models.

As an embodiment, in the embodiment of the application, the language learning submodel may be trained to obtain a trained language learning submodel, and then, after the trained language learning submodel, a prediction submodel for identifying wrongly written comment data and non-wrongly written comment data is created to obtain an initial comment data classification model, and the initial comment data classification model is trained to obtain a trained comment data classification model.

The following further describes the training process of the language learning submodel and the comment data classification model.

Firstly, the structure of the language learning submodel related to the present application is exemplarily described, and the language learning submodel related to the present application may be, but is not limited to, a Bert model, a Fast-Bert model, a Tiny-Bert model, and the like; for the sake of understanding, a structural example diagram of a language learning submodel is provided herein, please refer to fig. 5, the language learning submodel may include a word representation layer 501, a feature extraction layer 502 and a feature output layer 503; wherein:

the term representation layer 501 may be, but is not limited to, configured to perform a word segmentation process on comment data (such as the above target comment data or the above historical comment data) according to a word segmentation rule associated in a language form of the comment data, to obtain at least one term included in the comment data, and further perform a processing mode such as encoding (Embedding) on context information of each term in the at least one term to obtain a term representation of each term, where the term representation may include, but is not limited to, a term representation E indicated in the drawing₁The term denotes E₂And word representation E_N(N is a positive integer), etc., wherein the description of the context information can be referred to above, and the description is not repeated here.

The feature extraction layer 502 may be, but is not limited to, for each word after processing by the multi-layered unit TrmThe word representation of the language respectively extracts a word characteristic, as shown in the figure, can represent E by the word₁To E_NRespectively extracting word characteristics T₁To T_N。

The feature output layer 503 may output, but is not limited to, the word features extracted by the feature extraction layer 502.

Next, the training process of the language learning submodel is explained in detail:

in the embodiment of the application, historical comment data can be used as a training sample, and the language learning submodel is subjected to feature extraction training based on the context information of each word contained in the training sample; specifically, the language learning submodel may be, but is not limited to, obtained by performing at least one training operation on the language learning submodel based on a historical comment data set including a plurality of historical comment data.

As an embodiment, in order to improve the accuracy of extracting text features by a language learning model, in the embodiment of the present application, a first training end condition may be set for a training process of a language learning submodel, and then, in a process of training the language learning submodel, when it is determined that the first training end condition is satisfied, a language learning submodel in training is output;

the first training ending condition is not limited in the embodiment of the present application, and a person skilled in the art may set the first training ending condition according to actual requirements, for example, the first training ending condition may be, but is not limited to, set as one or any combination of the following training ending conditions a1 to A3: training end condition a1) the number of training operations for the language learning submodel reaches a first time threshold; training end condition a2) the time length of the training operation of the language learning submodel reaches a first time length threshold value; training end condition a3) the model prediction error of the language learning submodel currently being trained is less than a first prediction error threshold, wherein the model prediction error is further described below.

As an example, in a training operation, a text prediction operation may be performed on each historical comment data obtained from the historical comment data set by using a language learning submodel, a model prediction error of the language learning submodel is determined, and a model parameter of the language learning submodel is adjusted based on the model prediction error, and specifically, referring to fig. 6, a schematic diagram of a flow of the training operation is provided, and the training operation may include, but is not limited to, the following steps S601 and S602:

step S601, performing text prediction operation on each historical comment data obtained from the historical comment data set, and determining a prediction deviation corresponding to each historical comment data.

As an embodiment, the prediction deviation corresponding to one historical comment data may represent error information for text prediction of a part of words in the one historical comment data through a language learning submodel; in the text prediction operation for one piece of history comment data, the partial word may be predicted based on, but not limited to, context information of the partial word in the one piece of history comment data, and specifically, the text prediction operation may include, but is not limited to, the following steps S6011 to S6014.

Step S6011, performing word segmentation processing on one piece of history comment data according to a word segmentation rule associated with a language format of the one piece of history comment data in each piece of history comment data, to obtain at least one word included in the one piece of history comment data.

In the step, word segmentation processing is performed on the historical comment data based on the word segmentation rule associated with the language form of the historical comment data, so that the accuracy of the obtained words can be improved, and the accuracy of subsequent word prediction is improved.

As an embodiment, a person skilled in the art may determine the word segmentation rule associated with each language form according to the actual business requirement and the language characteristics of the language form, and in the embodiment of the present application, an exemplary description of several language form associated word segmentation rules will be given below.

Step S6012, masking a part of the at least one word based on a preset word mask.

As an embodiment, the step may, but is not limited to, randomly block one or more words of the at least one word by using a preset word Mask, and a specific form of the word Mask is not limited, and those skilled in the art may set the block according to actual requirements.

Step S6013, determining context information of the partial words in the historical review data, and selecting candidate words whose matching degree with the determined context information satisfies a matching degree condition from a pre-configured candidate word library, where the candidate word library is determined based on the historical review data set.

As an embodiment, before training the language learning submodel, word segmentation processing may be performed on each historical comment data in the historical comment data set, each word obtained through the word segmentation processing is determined as a candidate word, and the obtained candidate word set is determined as the above candidate word library, where a specific manner of performing word segmentation processing on each historical comment data may refer to step S6011, and a description is not repeated here.

As an embodiment, the condition that the matching degree satisfies the matching degree condition is not limited, and a person skilled in the art may set according to actual requirements, for example, the matching degree with the largest value in the matching degrees between the candidate word and the determined context information may be determined as the matching degree satisfying the matching degree condition, or the matching degree closest to the threshold value of the matching degree in the matching degrees between the candidate word and the determined context information may be determined as the matching degree satisfying the matching degree condition.

As an embodiment, in the step S6013, the following text prediction operations may be performed on each word in the partial words, but not limited to, to select candidate words matching each word in the partial words, and determine the candidate words as the prediction words corresponding to each word in the partial words: determining context information of one word in the historical comment data as target context information aiming at one word in partial words, further determining the matching degree of each candidate word in the candidate word library and the target context information, selecting the candidate word with the matching degree reaching a matching degree threshold value, and determining the selected candidate word as a predicted word matched with the word; the method of the text prediction operation is only an exemplary illustration, and those skilled in the art can flexibly adopt other ways to predict the words.

Step S6014, determining deviation information between the partial word and the selected candidate word as a prediction deviation corresponding to the history comment data.

As an example, the prediction deviation may, but is not limited to, characterize a deviation degree between the partial word and the selected candidate word, and the deviation degree may be inversely related to a matching degree between the partial word and the selected candidate word; in the embodiment of the present application, a specific manner for determining the predicted deviation may be set according to actual requirements, and several examples for determining the predicted deviation are given as follows:

in the embodiment of the application, the prediction deviation corresponding to the historical comment data may be determined based on the character string matching degree or semantic matching degree between the partial word and the selected candidate word, and if the partial word includes a plurality of words, the semantic matching degree between each word and the respective corresponding predicted word (i.e., the selected candidate word for each word) may be determined as the error information corresponding to each word, and then the average value of the error information corresponding to each word in the plurality of words may be determined as the prediction deviation corresponding to the historical comment data;

when some words in the embodiment of the present application include multiple words, the prediction accuracy probability for the multiple words may also be determined as the prediction error corresponding to the one piece of historical comment data, where the prediction probability may be determined by, but is not limited to, formula 1:

in the formula (1), K2 is the total number of words of the partial words shielded in the historical comment data, K1 is the number of the selected candidate words shielded, and P1 is the prediction deviation corresponding to the historical comment data; if the partial words shielded in the history comment data are word 1, word 2 and word 3, the candidate word selected for word 1 is word 1, the candidate word selected for word 2 is word 5, and the candidate word selected for word 3 is word 4, then K2 is 3, K1 is 1, and the prediction deviation corresponding to the history comment data is 1/3.

Step S602, based on the prediction deviation corresponding to each historical comment data, performing parameter adjustment on the language learning submodel.

As an example, a model prediction error of the language learning submodel may be determined based on a prediction deviation corresponding to each historical comment data, and a parameter adjustment may be performed on the language learning submodel based on the model prediction error, such as, but not limited to, adjusting a model parameter of the language learning submodel toward a direction of reducing the model prediction error.

In order to improve the implementation flexibility of the scheme, the specific way of determining the model prediction error is not limited too much in the embodiment of the present application, and the average value of the prediction deviations corresponding to the historical comment data may be flexibly set according to the actual business requirements, for example, but not limited to, and determined as the model prediction error; the model prediction error of the language learning submodel may also be determined based on the following principle of equation 2:

in formula (2), K4 is the total number of historical comment data in the historical comment data set, K3 is the number of historical comment data with correct text prediction, and P2 is the model prediction error; the historical comment data with correct text prediction can be historical comment data with prediction deviation larger than a prediction deviation threshold value, and the historical comment data with correct text prediction can also be historical comment data of a part of words which are selected and shielded as candidate words.

As an example, to further understand the scheme of the embodiment of the present application, the following gives examples of word segmentation rules associated in several language forms:

the language form associated word segmentation rule related in the embodiment of the application can be, but is not limited to, dividing continuous characters of reference words combined into a language form associated reference word segmentation set in historical comment data together; the reference words may be, but are not limited to, a first character group composed of at least one character whose combined use frequency is higher than a first frequency threshold in the language form, where an arrangement position of at least one character in the first character group may have a sequence, and the combined use frequency may be determined based on statistics of historical use conditions of each character in the language form, and the like; here, a language form in which chinese is taken as the above-mentioned one history comment data is exemplified, a chinese character may be a character, assuming that the above-mentioned first frequency threshold is 0.65, the frequency of use of the combination of the character "long" and the character "river" into "long river" is 0.68, the frequency of use of the combination of the character "old" and the character "column" into "column old" is 0.25, the frequency of use of the combination of the character "yellow" and the character "river" into "yellow river" is 0.75, the frequency of use of the combination of the character "yellow" and the character "river" into "yellow river" is 0.20, the "long river" and the "yellow river" may be respectively used as reference words in the reference participle set associated with chinese, instead of the "column old" and the "yellow river" as reference words, and further, if the target comment data includes the character "long", and the character "river yellow" is a continuous position after the "long", the "long" and "river" in the target comment data can be divided into one word.

The word segmentation rule related to the language form in the embodiment of the application may also be that continuous characters with usage frequency greater than a second frequency threshold value are combined in the language form of the one historical comment data in the target comment data and are divided into one word; wherein, the continuous characters are marked as a second character group, and the following description takes the language form of Chinese as the historical comment data as an example, and the second character group obtained by determining the combination use frequency and dividing is exemplarily described; if a second character group comprises characters B1 and B2 with continuous positions, the probability of the character B2 appearing after the character B1 in the Chinese language can be determined as the combined use frequency of the second character group, for example, the probability of the character "Yangtze river" appearing after the character "Changjiang river" in the Chinese language is 0.8, and the combined use frequency of the character "Changjiang river" is 0.8; if a second character set includes the positionally sequential characters B3, B4, and B5, then the first probability of the character B4 occurring after the character B3 in chinese may be based, and a second probability of the occurrence of the character B5 after the character B4 in chinese, determining a combined usage frequency of the second group of characters, if the product of the first probability and the second probability can be directly determined as the combined use frequency of the second character set, a specific example is given here, if there are "yellow", "pump" and "river" in the target comment data in consecutive positions, then, if the second frequency threshold is 0.6, then "yellow", "pump", and "river" that appear continuously in the target comment data can be divided into a word "yellow pu river"; when the number of the characters in the second character combination exceeds 3, the combination frequency of each second character combination can be determined based on the method, which is not described in more detail herein;

it should be noted that the word segmentation rules associated with the language forms are only exemplary descriptions, and those skilled in the art can determine the word segmentation rules associated with each language form according to the actual business requirements and the language characteristics of the language forms, and for different language forms, the related characters can be set according to the actual situation, for example, when the language form is chinese, one chinese character can be one character, and when the language form is english, one english word can be one character.

The following of the embodiment of the present application further describes the training process of the above comment data classification model.

As an embodiment, in the embodiment of the present application, in the process of training the comment data classification model, a data type may be labeled on the historical comment data and then used as a training sample, the comment data classification model may be trained based on a training sample set composed of a plurality of training samples, in the training process, the data type labeled on each historical comment data may be used as a labeled data type, a historical text feature of each historical comment data is extracted through a trained language learning submodel, and a data type of each historical comment data is estimated through a prediction submodel according to the historical text feature, and further the estimated data type is used as a prediction data type, and based on deviation information of the labeled data type and the prediction data type corresponding to each historical comment data, a prediction error of the comment data classification model is determined, and further a direction of reducing the prediction error of the comment data classification model is oriented, adjusting parameters of the prediction submodel until a second training end condition is met, and outputting the trained language learning submodel and the current prediction submodel as a trained comment data classification model;

the type of the marked data can be wrongly-written comment data or non-wrongly-written comment data, and if one piece of historical comment data contains target comment content for wrongly-written characters, the historical comment data can be marked as wrongly-written comment data; if one piece of history comment data does not contain the target comment content for the wrongly written word, the history comment data can be marked as non-wrongly written comment data.

As an embodiment, the second training ending condition is not limited in this embodiment, and those skilled in the art may set according to actual requirements, for example, but not limited to, the second training ending condition may be set as one or any combination of the following training ending conditions C1 to C3: training end condition C1) the number of times of training operations for the above-described comment data classification model reaches the second-time threshold; training end condition C2) the duration of the training operation performed on the above comment data classification model reaches a second duration threshold; training end condition C3) the prediction error of the comment data classification model currently being trained is less than the second prediction error threshold, and so on.

As an embodiment, in order to improve efficiency and flexibility of identifying wrongly written information in media content based on target comment data for the media content, in the embodiment of the present application, the comment data classification model may be input in batches for the target comment data in the target comment data set for the media content, or the comment data classification model may be input one by one for identifying the target comment data in the target comment data.

As an embodiment, in order to further improve the accuracy of identifying wrongly written characters in media contents in different language forms, historical comment data sets in different language forms can be respectively utilized to train corresponding language learning submodels and comment data classification models, so as to obtain language learning submodels and comment data classification submodels associated with different language forms; when wrongly written characters of media content are identified, the language form of the media content can be determined as a target language form, comment data in the target language form are selected from a comment data set aiming at the media content and are processed as target comment data, then the comment data classification model associated with the target language form is used for processing the target comment data, whether the target comment data are wrongly written comment data or not is determined, the specific processing mode can refer to the content, and repeated description is not repeated here; wherein the language learning submodel in the comment data classification model associated with the target language form is a trained language learning submodel associated with the target language form.

The following content in the embodiment of the present application further describes the process of analyzing the target comment content based on the regular expression in step S204 and determining the wrongly written or mispronounced character information in the media content:

firstly, the regular expression related in the embodiment of the application is further explained; the regular expression may include, but is not limited to, at least one reference text for indicating a wrongly written word, and at least one of information such as a wrongly written word placeholder and a language symbol for indicating a wrongly written word, where the wrongly written word placeholder may be in one of the reference texts or between two reference texts, and a person skilled in the art may set the reference text and the wrongly written word placeholder in the regular expression according to actual needs, for example, the regular expression may be, but is not limited to, in the form of the following regular expressions 1 to 5:

regular expression 1: "S1";

regular expression 2: "S1-P1, S2";

regular expression 3: "P1-S1";

regular expression 4: "D1-P1-D2";

regular expression 5: "D1-P1, E1";

in the regular expressions 1 to 5, S1 and S2 are different reference texts, P1 is a placeholder for a wrongly written word, D1 and D2 are different character segments in the same reference text, and E1 is a placeholder for a target word, which may be, but is not limited to, a placeholder for a correct word corresponding to the wrongly written word.

As an embodiment, in order to improve the accuracy of analyzing the target comment content, in the embodiment of the application, different regular expression sets may be preconfigured for comment data in different language forms, and then the target comment content is analyzed based on each regular expression in the regular expression set associated with the language form of the target comment data; for ease of understanding, the following regular expression sets associated in chinese are given by way of example in chinese, and may include, but are not limited to, the following examples 1 through 5:

example 1: "there is a word written incorrectly";

example 2: "where" in the text, as if wrongly written ";

example 3: the "Wen" word was written incorrectly;

example 4: the "text" in "Li text" was written incorrectly;

example 5: "not" Liwen "but" Liwen ";

wherein, the above example 1 is an example of the regular expression 1, in which "having written wrong" is S1 in the regular expression 1; example 2 above is an example of regular expression 2, where "text" and "as if written in error" are S1 and S2 in regular expression 2, respectively, and "which" is P1 in regular expression 2; example 3 above is an example of regular expression 3, where "Wen" is P1 in regular expression 2, and "wrongly written" is S1 in regular expression 3; example 4 above is an example of regular expression 4, where "in lie" and "wrongly written" are D1 and D2 in regular expression 4, respectively, and "text" is P1 in regular expression 4; example 5 above is an example of regular expression 5, where "not" and "but" should be "D1 and D2 in regular expression 5," Liwen "is P1 in regular expression 5, and" Liwen "is E1 in regular expression 5, respectively.

As an embodiment, the number of regular expressions in the regular expression sets associated with different language forms respectively is not limited too much, and a person skilled in the art may set the regular expressions according to actual requirements, and in step S204, the regular expressions may be sequentially selected from the regular expression sets associated with the language forms of the target comment data, so as to analyze the target comment content in the target comment data.

Specifically, in step S204, the target comment content included in the target comment data may be analyzed based on the selected regular expression, so as to obtain a corresponding analysis result; determining at least one wrongly-written word associated with the target comment content and text position information of the at least one wrongly-written word in the media content based on the analysis result; and determining the at least one error word and the text position information as the error word information in the media content.

As an embodiment, in order to improve the accuracy of an obtained analysis result, in the embodiment of the present application, the character matching degree between the target comment data and the reference text in the regular expression may be, but is not limited to, determined as the analysis result; specifically, in step S204, for each selected regular expression, the following operations may be performed to determine the wrongly written information in the media content:

determining the character matching degree of the target comment data and a regular expression; and if the character matching degree is greater than the matching degree threshold value, determining the characters at the text positions of the wrongly-written characters placeholders in the target comment data as the wrongly-written characters based on the text position information of the wrongly-written characters placeholders in the regular expression, and determining the text positions of the wrongly-written characters placeholders as the text position information of the wrongly-written characters.

As an embodiment, in order to improve the accuracy of detecting the wrongly written words, in the embodiment of the present application, if the wrongly written word information is not obtained based on the analysis result in step S204, the detection result may be obtained by performing the wrongly written word detection on the media content based on a pre-configured wrongly written word detection rule, and it is determined whether the media content includes the corresponding wrongly written word information according to the detection result.

As an embodiment, the case 1) and the case 2) where the above-mentioned wrongly written word information is not obtained based on the above-mentioned parsing result may be, but are not limited to be, included as follows): case 1) in the regular expression set associated with the language form of the target comment data, there is no regular expression whose character matching degree with the target comment data is greater than a matching degree threshold; case 2) in the regular expression set associated with the language form of the target comment data, the regular expression having the character matching degree with the target comment data greater than the matching degree threshold does not indicate specific wrongly written information, such as may but not limited to the case where the regular expression having the character matching degree greater than the matching degree threshold is the regular expression 1 described above.

As an embodiment, the preconfigured erroneous word detection rule may be, but is not limited to, directly detecting the erroneous word in the media content based on a trained erroneous word detection model, or performing, by an auditor, the detection of the erroneous word on the media content.

As an embodiment, the wrongly written information indicated in the target comment data is not necessarily reliable, so in order to improve the accuracy of identifying wrongly written words, in this embodiment of the application, after determining that the target comment data contains target comment content for wrongly written words based on the target text features, before determining the wrongly written information in the media content based on the target comment content, determining the confidence of the target comment data based on obtaining account information of a target account issuing the target comment data, and after determining that the confidence of the target comment data reaches a confidence threshold, determining the wrongly written information in the media content based on the target comment content; therefore, target comment data with unreliable information can be filtered, and the accuracy of identifying wrongly written characters can be improved; the account information may be, but is not limited to, account Profile data of a target account, which is also called an account Profile or a User Profile (User Profile), and refers to tagging information of a User associated with the account; the account drawing data may include, but is not limited to, at least one of the following information: the user's gender, age, frequent living, native place, height, academic calendar, love and marriage status, education level, asset condition, income condition, occupation and other population attribute information and social attribute, account information such as account grade of the account, account asset, account credit and other account information and information mined from historical behavior data of the account.

As an embodiment, since the target comment data is not necessarily accurate, the wrongly written information obtained in step S204 may be incorrect, so in the embodiment of the present application, after the wrongly written information is obtained in step S204, the wrongly written information may be fed back to the content auditor, and the content auditor may determine whether the wrongly written information is correct, and in a case that the wrongly written information is determined to be correct, the content auditor may correct the wrongly written information indicated by the wrongly written information in the media content.

As an embodiment, an example of a complete flow of a method for identifying wrongly written words is provided in the following, please refer to fig. 7, which specifically includes the following steps:

step S701 is to select currently unprocessed target comment data from a comment data set managed for the published media content.

Step S702, inputting the selected target comment data into a trained comment data classification model, and extracting target text characteristics corresponding to the target comment data through a language learning submodel in the comment data classification model.

And step S703, identifying the selected target comment data through a prediction submodel in the comment data classification model.

Step S704, determining whether the target comment data is wrongly written comment data, if so, proceeding to step S705, otherwise, proceeding to step S709.

Step S705, analyzing the target comment data based on the regular expression associated with the language form of the media content, and obtaining an analysis result.

Step S706, determining whether the analysis result contains the indication information of the wrongly written words, if yes, entering step S707, otherwise, entering step S708;

the wrongly written or mispronounced word indication information may be, but is not limited to, at least one wrongly written or mispronounced word associated with the target comment content in the target comment data, and text position information of the at least one wrongly written or mispronounced word in the media content.

In step S707, the wrongly written information in the media content is determined based on the analysis result, and the process proceeds to step S709.

Step S708, performing a detection of the wrongly written characters on the media content based on a pre-configured detection rule of the wrongly written characters to obtain a detection result, and determining whether the media content includes corresponding information of the wrongly written characters according to the detection result, and proceeding to step S709.

Step S709, determining whether unprocessed target comment data exists in the comment data set associated with the media content, if yes, proceeding to step S701, otherwise, ending the processing.

The specific contents of the steps S701 to S709 can be referred to the above description, and the description is not repeated here.

Please refer to fig. 8, which provides a specific example of a method for identifying wrongly written characters, in which an article is taken as an example for explaining the example, and after a user of a content platform creates the article, the article can be published to the content platform through a content production end; the content platform can distribute the articles to the content consumption end through the content distribution outlet after the articles are published through the content production end; with the continuous exposure and recommendation of the article on the information stream, a user in the content platform can click and read the article and trigger a comment operation on the article, and then the wrongly written character recognition server 130 can take comment data obtained based on the comment operation as target comment data, recognize the target comment data based on a comment data classification model, determine whether the target comment data contains target comment content for wrongly written characters, and determine wrongly written character information in the article based on the target comment content when determining that the target comment data contains the target comment content; further, the target comment data and the wrongly written information can be fed back to a content auditor of a manual auditing system, the content auditor can judge whether the wrongly written information is correct, and the content auditor can modify the wrongly written information indicated by the wrongly written information and the like under the condition that the wrongly written information is correct.

On one hand, in the embodiment of the application, wrongly written character information in the media content can be identified based on the target comment data aiming at the media content, so that the accuracy of identifying wrongly written characters in the media content can be improved; experiments prove that more than 90% of wrongly written information in wrongly written comment data can be recognized by using the wrongly written recognition method provided by the embodiment of the application, and the recognition accuracy of the comment data classification model on the wrongly written comment data can reach more than 95%; on the other hand, if the content auditor carries out wrongly written characters in the media content based on the identified wrongly written character information.

Referring to fig. 9, based on the same inventive concept, an embodiment of the present application provides an apparatus 900 for identifying wrongly written characters, including:

a data acquisition unit 901 configured to acquire target comment data for the published media content;

a feature extraction unit 902, configured to extract, according to context information of each word included in the target comment data, a target text feature corresponding to the target comment data;

a first identifying unit 903, configured to determine, based on the target text feature, that the target comment data includes target comment content for a wrongly written word;

the second identifying unit 904 determines wrongly written information in the media content based on the target comment content.

As an embodiment, the feature extraction unit 902 is specifically configured to:

inputting the target comment data into a trained comment data classification model;

the language learning submodel is obtained by training the language learning submodel for feature learning based on the context information of each word contained in the training sample by using the historical comment data as the training sample.

As an embodiment, if the comment data classification model further includes a predictor model, the first identifying unit 903 is specifically configured to:

inputting the target text characteristics into the predictor model;

predicting a second degree of association between the target text feature and a target data recognition result based on a learned first degree of association of the prediction submodel, wherein the first degree of association is determined based on a degree of association between a history text feature corresponding to history comment data and the target data recognition result, and the target data recognition result is used for representing comment contents for wrongly written words in the text data;

and if the second relevance is greater than the relevance threshold, determining that the target comment data contains the target comment content.

As an embodiment, the feature extraction unit 902 is further configured to train the language learning model by:

wherein the text prediction operation comprises:

shielding part of words in the at least one word based on a preset word mask; and

determining context information of the partial words in the historical comment data, and selecting candidate words of which the matching degree with the determined context information meets the matching degree condition from a pre-configured candidate word bank, wherein the candidate word bank is determined based on the historical comment data set;

and determining the deviation information between the partial words and the selected candidate words as the prediction deviation corresponding to the historical comment data.

As an embodiment, the second identifying unit 904 is specifically configured to:

analyzing the target comment content based on a pre-configured regular expression for identifying the wrongly written word information to obtain a corresponding analysis result;

and determining the at least one error word and the text position information as the error word information in the media content.

As an embodiment, the second identifying unit 904 is further configured to:

and determining whether the media content contains corresponding wrongly written information or not according to the detection result.

As an embodiment, the second identifying unit 904 is configured to:

acquiring account information of a target account issuing the target comment data;

determining the confidence of the target comment data based on the account information;

and when the confidence coefficient reaches a confidence coefficient threshold value, determining wrongly written information in the media content based on the target comment content.

As an example, the apparatus in fig. 9 may be used to implement any of the wrongly written or mispronounced word recognition methods discussed above.

The method embodiment is based on the same inventive concept, and the embodiment of the application also provides computer equipment. The computer device may be used for push content based data processing. In one embodiment, the computer device may be a server, such as the mispronounced word recognition server 130 shown in FIG. 1. In this embodiment, the computer device may be configured as shown in fig. 10, and includes a memory 1001, a communication module 1003, and one or more processors 1002.

A memory 1001 for storing computer programs executed by the processor 1002. The memory 1001 may mainly include a storage program area and a storage data area, where the storage program area may store an operating system, a program required for running an instant messaging function, and the like; the storage data area can store various instant messaging information, operation instruction sets and the like.

Memory 1001 may be a volatile memory (volatile memory), such as a random-access memory (RAM); the memory 1001 may also be a non-volatile memory (non-volatile memory), such as a read-only memory (rom), a flash memory (flash memory), a hard disk (HDD) or a solid-state drive (SSD); or the memory 1001 is any other medium that can be used to carry or store desired program code in the form of instructions or data structures and that can be accessed by a computer, but is not limited thereto. The memory 1001 may be a combination of the above memories.

The processor 1002 may include one or more Central Processing Units (CPUs), a digital processing unit, and the like. The processor 1002 is configured to implement the data processing method based on content push when a computer program stored in the memory 1001 is called.

The communication module 1003 is used for communicating with the terminal device and other servers.

In the embodiment of the present application, the specific connection medium among the memory 1001, the communication module 1003, and the processor 1002 is not limited. In fig. 10, the memory 1001 and the processor 1002 are connected by a bus 1004, the bus 1004 is represented by a thick line in fig. 10, and the connection manner between other components is merely illustrative and not limited. The bus 1004 may be divided into an address bus, a data bus, a control bus, and the like. For ease of illustration, only one thick line is shown in FIG. 10, but this is not intended to represent only one bus or type of bus.

The memory 1001 stores therein a computer storage medium, and the computer storage medium stores therein computer-executable instructions for implementing the content recommendation method according to the embodiment of the present application. The processor 1002 is configured to execute the above-described method for identifying wrongly written words.

Those of ordinary skill in the art will understand that: all or part of the steps for implementing the method embodiments may be implemented by hardware related to program instructions, and the program may be stored in a computer readable storage medium, and when executed, the program performs the steps including the method embodiments; and the aforementioned storage medium includes: various media capable of storing program codes, such as a removable Memory device, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, and an optical disk.

Alternatively, the integrated unit of the invention may be stored in a computer-readable storage medium if it is implemented in the form of a software functional module and sold or used as a separate product. Based on such understanding, the technical solutions of the embodiments of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the above-mentioned method for recognizing wrongly written words according to the embodiments of the present invention. And the aforementioned storage medium includes: a removable storage device, a ROM, a RAM, a magnetic or optical disk, or various other media that can store program code.

Based on the same technical concept, the embodiment of the present application also provides a computer-readable storage medium, which stores computer instructions that, when executed on a computer, cause the computer to perform the method for identifying wrongly-written words as discussed above.

As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

It will be apparent to those skilled in the art that various changes and modifications may be made in the present application without departing from the spirit and scope of the application. Thus, if such modifications and variations of the present application fall within the scope of the claims of the present application and their equivalents, the present application is intended to include such modifications and variations as well.

30页详细技术资料下载

上一篇：一种医用注射器针头装配设备

下一篇：一种核心词确定方法和相关装置

Wrongly written character recognition method, device, equipment and readable storage medium

相关技术

网友询问留言