Chinese text audio program discovery method based on Aho Corasick mode matching machine

文档序号:1310321 发布日期:2020-07-10 浏览:7次 中文

阅读说明:本技术 一种基于AhoCorasick模式匹配机的语文课文音频节目发现方法 (Chinese text audio program discovery method based on Aho Corasick mode matching machine ) 是由 吴海旭 于 2019-10-28 设计创作,主要内容包括:本发明涉及一种基于AhoCorasick模式匹配机的语文课文音频节目发现方法,步骤一:构建一个语文课文音频节目关键词词库;步骤二:应用AhoCorasick模式匹配机,自动抽取语文课文音频节目的标题和简介中的关键词;步骤三:如果抽取到语文课文关键词,则判定此节目为语文课文类节目。本发明可以高效、科学地发现语文课文音频节目,增加效率的同时,也可大大减少企业用人成本。(The invention relates to a Chinese text audio program discovery method based on an Aho Corasick mode matching machine, which comprises the following steps: constructing a keyword lexicon of a Chinese text audio program; step two: automatically extracting keywords in the titles and brief introduction of the Chinese text audio programs by applying an Aho Corasick mode matching machine; step three: if the Chinese text keywords are extracted, the program is judged to be a Chinese text program. The invention can efficiently and scientifically discover the audio programs of the Chinese texts, thereby increasing the efficiency and greatly reducing the labor cost of enterprises.)

1. A Chinese text audio program discovery method based on an Aho Corasick mode matching machine is characterized by comprising the following steps:

the method comprises the following steps: constructing a keyword lexicon of a Chinese text audio program;

step two: automatically extracting keywords in the titles and brief introduction of the Chinese text audio programs by applying an Aho Corasick mode matching machine;

step three: if the Chinese text keywords are extracted, the program is judged to be a Chinese text program.

2. The method for discovering Chinese text audio programs based on AhoCorasick model matching machine according to claim 1, wherein the keyword library is constructed in a manner that: automatically crawled from vertical-class websites or manually summarized from program titles and profiles.

3. The method for discovering Chinese text audio programs based on the Aho Corasick mode matching machine according to claim 1, wherein the Aho Corasick mode matching machine is constructed in two steps, wherein a goto function is constructed in the first step, and an output function is constructed at the same time; and a second step of constructing a failure function and simultaneously completing an output function.

4. The method as claimed in claim 3, wherein the method comprises the following steps: and constructing a Goto function, wherein the input is a keyword set K ═ { y1, y2, …, yk }, and the output is the Goto function and the partial output function.

5. The AhoCorasick mode matching machine-based Chinese text audio program discovery method as claimed in claim 4, wherein: and constructing a failure function, wherein the input is a goto function and an output function, and the output is a failure function f and an output function output.

Technical Field

The invention belongs to the field of artificial intelligence, and particularly relates to a Chinese text audio program discovery method based on an Aho Corasick mode matching machine.

Background

With the rapid development of the mobile internet, audio products emerge like bamboo shoots in spring after rain, and therefore the cultural life of the masses is greatly enriched. The number of audio programs is huge, and users want to quickly find favorite audio programs, and firstly, an efficient program finding method is needed. For example, a pupil in three grades like listening to a recording program of a Chinese text, and then she needs to quickly select the Chinese text program from a large amount of programs. However, the manual review of the titles and profiles of programs for screening is complicated, labor intensive, and requires a high level of culture for the user. In addition, the number of audio programs is hundreds of millions, and manpower cannot meet the requirement of rapidly screening Chinese texts. Therefore, there is a need for automatically and quickly discovering audio programs in the chinese class.

At the present stage, there are two main ways for Chinese text audio program discovery. The first is that the operator manually screens chinese lesson programs, which is highly accurate but inefficient. The second one is that a great deal of audio programs of Chinese class and text are marked manually, then a machine learning model is constructed by adopting supervised learning according to the characteristics of the titles, the sound contents and the like of the programs, and automatic secondary classification is carried out on the basis of the pre-marked data. The second method of machine learning, while improving efficiency to some extent, has some problems. On one hand, the method is established on the basis of mass marking data, and manual marking is needed, so that the labor is consumed; on the other hand, the machine learning model is complex, and the training model and the persistence model are not light enough.

In view of the above technical problems, the applicant provides a method for discovering a chinese text audio program based on an ahocoasick mode matching machine, and the method is generated by the scheme.

Disclosure of Invention

The invention aims to provide a Chinese text audio program discovery method based on an Aho Corasick mode matching machine so as to discover a Chinese text audio program quickly and efficiently.

In order to achieve the purpose, the invention specifically provides the following technical scheme: a Chinese text audio program discovery method based on an Aho Corasick mode matching machine comprises the following steps:

the method comprises the following steps: constructing a keyword lexicon of a Chinese text audio program;

step two: automatically extracting keywords in the titles and brief introduction of the Chinese text audio programs by applying an Aho Corasick mode matching machine;

step three: if the Chinese text keywords are extracted, the program is judged to be a Chinese text program.

Further, the construction mode of the keyword library is as follows: automatically crawled from vertical-class websites or manually summarized from program titles and profiles.

Further, the method comprises the steps of constructing an Aho Corasick mode matching machine in two steps, constructing a goto function in the first step, and constructing an output function at the same time; and a second step of constructing a failure function and simultaneously completing an output function.

Further, a Goto function is constructed, the input is a keyword set K ═ { y1, y2, …, yk }, and the output is a Goto function and a partial output function.

Further, a failure function is constructed, wherein the input is a goto function and the output function, and the output is a failure function f and an output function output.

The AhoCorasick pattern matching machine is an automaton, and all words in a word bank can be matched by traversing one character string once.

The concrete principle of the Aho Corasick mode matching machine is as follows:

let K be a set of keywords and x be a string. The problem is to locate and identify all x substrings in K. The AhoCorasick pattern matching machine is a program, the input is x, the output is the keyword where x is matched to K, and the position where the keyword appears.

The AhoCorasick pattern matching machine includes several states, each represented by a number, and processes x by reading the characters in x one by one, state transition, and then transmitting the output. The behavior of the matching machine is characterized by three functions: goto function g, failure function f, output function output.

The invention has the beneficial effects that: at the present stage, the discovery of audio programs of Chinese texts relies on a large number of manual marks. On the one hand, manual marking is inefficient, and on the other hand, a large amount of labor may increase enterprise costs. By utilizing the Aho Corasick mode matching machine, Chinese text audio programs can be efficiently and scientifically found, the efficiency is improved, and meanwhile, the personnel cost of enterprises can be greatly reduced.

Drawings

FIG. 1 is a flow chart of an AhoCorasick mode matching machine finding a chinese text audio program; fig. 2 shows the operation of the AhoCorasick mode matching machine.

Detailed Description

Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, like numbers in different drawings represent the same or similar elements unless otherwise indicated. The implementations described in the exemplary embodiments below are not intended to represent all implementations consistent with the present disclosure. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the present disclosure, as detailed in the appended claims.

As shown in fig. 1, the present embodiment discloses a method for discovering a chinese text audio program based on ahocoasick mode matching machine, which includes the following steps:

the method comprises the following steps: constructing a keyword lexicon of a Chinese text audio program;

step two: automatically extracting keywords in the titles and brief introduction of the Chinese text audio programs by applying an Aho Corasick mode matching machine;

step three: if the Chinese text keywords are extracted, the program is judged to be a Chinese text program.

Constructing a keyword word bank:

by applying the AhoCorasick pattern matching machine, a keyword word library is required to be constructed firstly.

The keyword library is constructed in two ways:

firstly, automatic crawling is carried out on the vertical website. For example, a large number of movie names can be obtained from the bean cotyledon, and a movie thesaurus can be made through data preprocessing and the like.

And secondly, manually summarizing the program titles and the brief introduction. The title and brief introduction of the audio program are observed artificially, and corresponding keywords are summarized.

And after the keyword word stock is constructed, constructing an AhoCorasick pattern matching machine. The method comprises the steps of constructing an AhoCorasick mode matching machine in two steps, constructing a goto function in the first step, and beginning to construct an output function at the same time; and a second step of constructing a failure function and simultaneously completing an output function.

The specific algorithm is as follows:

(1) and constructing a Goto function, wherein the input is a keyword set K ═ { y1, y2, …, yk }, the output is the Goto function, and the partial output function.

(2) And constructing a failure function, wherein the input is a goto function and an output function, the output is a failure function f, and the output is an output function output.

The constructed AhoCorasick pattern matching machine can realize matching of keywords. The AhoCorasick pattern matching machine inputs a character string x as a1a2 … am, outputs matched keywords, and the positions of the keywords in x.

The AhoCorasick pattern matching machine is an automaton, and all words in a word bank can be matched by traversing one character string once. The concrete principle of the Aho Corasick mode matching machine is as follows:

let K be a set of keywords and x be a string. The problem is to locate and identify all x substrings in K. The AhoCorasick pattern matching machine is a program, the input is x, the output is the keyword where x is matched to K, and the position where the keyword appears.

The AhoCorasick pattern matching machine includes several states, each represented by a number, and processes x by reading the characters in x one by one, state transition, and then transmitting the output. The behavior of the matching machine is characterized by three functions: goto function g, failure function f, output function output.

For example, for the keyword { he, she, his, hers }, the three functions are shown in FIG. 1. Where 0 is the initial state. The goto function inputs one state and one character and outputs another state or fail signal. The fail function maps one state to another. The fail function is enabled when the goto function signals fail. The output function maps the state to a set of keywords (which may be an empty set).

The AhoCorasick mode matching machine operates as follows: s is the current state and a is the current character.

If g (s, a) ═ s ', then we move to s', and the next character becomes the current character. Meanwhile, if output (s') is not an empty set, the matching machine transmits the set, accompanied by the position of the current character. And completing the matching.

And secondly, if g (s, a) is fail, calling a fail function to perform fail transfer. If f(s) is s ', take s' as the current state, a continues to be left as the current character, and returns to 1. The AhoCorasick pattern matching machine matches a character string, and the time complexity of extracting keywords from the character string is the length of the character string, so that the efficiency is very high.

Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure disclosed herein. This disclosure is intended to cover any variations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.

It will be understood that the present disclosure is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the present disclosure is limited only by the appended claims.

8页详细技术资料下载
上一篇:一种医用注射器针头装配设备
下一篇:音频指纹的生成方法、装置、设备及存储介质

网友询问留言

已有0条留言

还没有人留言评论。精彩留言会获得点赞!

精彩留言,会给你点赞!