Construction method and system of semi-automatic translation bilingual template based on patent data

文档序号:1466022 发布日期:2020-02-21 浏览:23次 中文

阅读说明:本技术 基于专利数据的半自动化翻译双语模板的构建方法及系统 (Construction method and system of semi-automatic translation bilingual template based on patent data ) 是由 张孝飞 张迁 范婷婷 葛昱晖 诸敏刚 于 2019-11-01 设计创作,主要内容包括:本发明涉及机器翻译技术领域,尤其涉及一种基于专利数据的半自动化翻译双语模板的构建方法及半自动化翻译系统;包括以下步骤:获取专利领域双语句对齐的双语语料;从双语语料筛选出翻译存在问题的双语句子;对双语句子进行拆分,聚类,形成双语数据库,从双语数据库中抽取常量和变量,建立翻译双语模板;对翻译双语模板进行过滤和人工校验,得到符合要求的合格翻译双语模板;通过专利数据的半自动化翻译双语模板的构建方法建立半自动化翻译系统以解决现有技术存在的对专利翻译模板精度差,翻译不准确的技术问题。(The invention relates to the technical field of machine translation, in particular to a construction method of a semi-automatic translation bilingual template based on patent data and a semi-automatic translation system; the method comprises the following steps: acquiring bilingual corpus with aligned bilingual sentences in the patent field; screening out bilingual sentences with problems in translation from the bilingual corpus; splitting and clustering bilingual sentences to form a bilingual database, extracting constants and variables from the bilingual database, and establishing a translation bilingual template; filtering and manually checking the translation bilingual template to obtain a qualified translation bilingual template meeting the requirements; a semi-automatic translation system is established through a construction method of a semi-automatic translation bilingual template of patent data so as to solve the technical problems of poor precision and inaccurate translation of the patent translation template in the prior art.)

1. A method for establishing a semi-automatic translation bilingual template based on patent data is characterized by comprising the following steps: the method comprises the following steps:

s1) obtaining bilingual corpus aligned with the bilingual sentences in the patent field;

s2) screening the obtained bilingual corpus according to preset semantic grammar screening conditions to screen out bilingual sentences with translation problems in the patent field;

s3) splitting the screened bilingual sentences with problems, splitting the bilingual sentences into a source language database and a target language database, clustering the source language database, and forming a bilingual database by corresponding the clustered source language database and the target language database;

s4) extracting common vocabulary entries, phrases or periods from the bilingual database and marking the common vocabulary entries, phrases or periods as constants, and marking the remaining vocabulary entries, phrases or periods as variables; respectively carrying out word meaning, sentence meaning and position matching on the constant in the source language database and the constant in the target language database, and determining the corresponding relation between the constant in the source language database and the constant in the target language database; respectively carrying out word meaning, sentence meaning and position matching on the variable in the source language database and the variable in the target language database, determining the corresponding relation between the variable in the source language database and the variable in the target language database, and establishing a translation bilingual template;

s5) setting a coverage rate preset filtering threshold range, and removing the translation bilingual template outside the coverage rate preset filtering threshold range to obtain a filtering translation bilingual template;

s6) carrying out manual verification on the filtered and translated bilingual template, setting a preset threshold range of accuracy, and removing the filtered and translated bilingual template outside the preset threshold range of accuracy to obtain a qualified translated bilingual template meeting the requirement.

2. The method for building a semi-automated bilingual translation template based on patent data according to claim 1, characterized in that: the preset semantic grammar screening conditions include sentences with multiple layers of modifications, sentences with complex logical relations, sentences with inserted components or sentences exceeding a certain length.

3. The method for building a semi-automated bilingual translation template based on patent data according to claim 2, characterized in that: the multilayer modified sentences are more than three layers of modified sentences.

4. The method for building a semi-automated bilingual translation template based on patent data according to claim 3, wherein: clustering is to classify sentences that are identical or similar in a source language database into one unit language database, thereby forming a plurality of unit language databases.

5. The method for building a semi-automated translation bilingual template based on patent data according to any one of claims 1 to 4, characterized in that: the constants comprise words, sentences, paragraphs, punctuation marks or special characters; the variables comprise words, sentences, paragraphs, punctuation marks or special characters; after determining the corresponding relation between the constant in the source language database and the constant in the target language database, performing attribute limitation on the constant in the source language database and the constant in the target language database; after the corresponding relation between the variable in the source language database and the variable in the target language database is determined, limiting the attributes of the variable in the source language database and the variable in the target language database.

6. The method for building a semi-automated bilingual translation template based on patent data according to claim 5, wherein: the method for determining the constants is the fixed collocation of nouns, noun phrases or words obtained by translating the characteristics of the patents and analyzing the patents.

7. The method for building a semi-automated bilingual translation template based on patent data according to claim 6, wherein: the coverage preset filtering threshold range is 1-7 constants.

8. The method for building a semi-automated bilingual translation template based on patent data according to claim 7, wherein: the accuracy preset threshold range is more than or equal to 3 constants.

9. The method for building a semi-automated bilingual translation template based on patent data according to claim 8, wherein: the languages of the bilingual corpus include two languages of english, german, japanese, korean, russian, or french.

10. A semi-automatic translation system based on patent data, its characterized in that: the method comprises the following steps:

the patent bilingual corpus extraction module is used for extracting bilingual corpuses with aligned bilingual sentences in the patent field and sending the extracted bilingual corpuses to the screening module;

the screening module is used for screening the obtained bilingual corpus according to preset semantic grammar screening conditions, screening out bilingual sentences with problems in translation in the patent field, and sending the bilingual sentences with problems to the clustering module;

the clustering module is used for splitting the screened bilingual sentences with problems, splitting a source language database and a target language database, clustering the source language database, forming a bilingual database by corresponding the clustered source language database and the target language database, and sending the bilingual database to the constant and variable extraction and template establishment module;

the constant and variable extraction and template establishment module is used for extracting the commonly used entries, phrases or periods from the bilingual database and marking the commonly used entries, phrases or periods as constants, and marking the rest entries, phrases or periods as variables; respectively carrying out word meaning, sentence meaning and position matching on the constant in the source language database and the constant in the target language database, and determining the corresponding relation between the constant in the source language database and the constant in the target language database; respectively carrying out word meaning, sentence meaning and position matching on the variable in the source language database and the variable in the target language database, determining the corresponding relation between the variable in the source language database and the variable in the target language database, establishing a translation bilingual template, and sending the translation bilingual template to a filtering module;

the filtering module is used for filtering the translation bilingual template, removing the translation bilingual template outside the preset filtering threshold range of the coverage rate according to the preset filtering threshold range of the set coverage rate to obtain the filtered translation bilingual template, and sending the filtered translation bilingual template to the verification module;

and the checking module is used for removing the filtering translation bilingual template outside the preset threshold range of the accuracy rate according to the preset threshold range of the set accuracy rate to obtain the qualified translation bilingual template meeting the requirements.

Technical Field

The invention relates to the technical field of machine translation, in particular to a construction method of a semi-automatic translation bilingual template based on patent data and a semi-automatic translation system.

Background

Machine translation, also called automatic translation, is a process of converting one natural language (source language) into another natural language (target language) by using a computer, and machine translation based on an artificial neural network has been gradually developed along with great progress of research on deep learning in 2013. The core of the machine translation technology of the artificial neural network is a deep neural network with massive nodes (neurons), which can automatically learn translation knowledge from a corpus, and a high-quality massive corpus plays an important role in improving the machine translation quality. At present, the translation quality of neural network machine translation is greatly improved compared with the machine translation based on statistics and the machine translation based on rules, but the effect of some aspects, such as the adjustment related to the translation word order, can not meet the translation requirement.

As an important knowledge in machine translation systems, the translation bilingual template is an indispensable resource in many current machine translation and assistant translation systems. The past and building blocks for translating bilingual templates are often manually extracted from the corpus in early machine translation systems. Kitano takes a manual coding of translation rules in his system, using a manually written matching expression as a template for translation. However, as the corpus becomes larger, the manual method becomes more difficult and brings about many errors. Still other scholars also propose automatic machine translation template construction methods, and propose methods based on analogy learning or methods based on structure alignment. One of the two methods requires a very large scale and a large amount of similar bilingual corpus, and the other requires a sentence analyzer with high accuracy for both languages, and because of the limitation of conditions, the two types of automatic extraction methods cannot achieve satisfactory accuracy.

Therefore, in order to solve the above problems, it is urgently needed to invent a construction method of a semi-automatic translation bilingual template based on patent data and a semi-automatic translation system.

Disclosure of Invention

The invention aims to: the patent data-based semi-automatic translation bilingual template construction method and the semi-automatic translation system are provided, and the semi-automatic translation system is established through the patent data-based semi-automatic translation bilingual template construction method so as to solve the technical problems of poor precision and inaccurate translation of a patent translation template in the prior art.

The invention provides the following scheme:

s1) a method for establishing a semi-automatic translation bilingual template based on patent data, which comprises the following steps:

s2) obtaining bilingual corpus aligned with the bilingual sentences in the patent field;

screening the obtained bilingual corpus according to preset semantic grammar screening conditions to screen out bilingual sentences with translation problems in the patent field;

s3) splitting the screened bilingual sentences with problems, splitting the bilingual sentences into a source language database and a target language database, clustering the source language database, and forming a bilingual database by corresponding the clustered source language database and the target language database;

s4) extracting common vocabulary entries, phrases or periods from the bilingual database and marking the common vocabulary entries, phrases or periods as constants, and marking the remaining vocabulary entries, phrases or periods as variables; respectively carrying out word meaning, sentence meaning and position matching on the constant in the source language database and the constant in the target language database, and determining the corresponding relation between the constant in the source language database and the constant in the target language database; respectively carrying out word meaning, sentence meaning and position matching on the variable in the source language database and the variable in the target language database, determining the corresponding relation between the variable in the source language database and the variable in the target language database, and establishing a translation bilingual template;

s5) setting a coverage rate preset filtering threshold range, and removing the translation bilingual template outside the coverage rate preset filtering threshold range to obtain a filtering translation bilingual template;

s6) carrying out manual verification on the filtered and translated bilingual template, setting a preset threshold range of accuracy, and removing the filtered and translated bilingual template outside the preset threshold range of accuracy to obtain a qualified translated bilingual template meeting the requirement.

Preferably, the preset semantic grammar filtering condition includes a sentence with multiple layers of modifications, a sentence with a complex logical relationship, a sentence with an inserted component, or a sentence exceeding a certain length.

Preferably, the sentence decorated in multiple layers is a sentence decorated in more than three layers.

Preferably, clustering is to classify identical or similar sentences in the source language database into one unit language database, thereby forming a plurality of unit language databases.

Preferably, the constants include words, sentences, paragraphs, punctuation marks or special characters; the variables comprise words, sentences, paragraphs, punctuation marks or special characters; after determining the corresponding relation between the constant in the source language database and the constant in the target language database, performing attribute limitation on the constant in the source language database and the constant in the target language database; after the corresponding relation between the variable in the source language database and the variable in the target language database is determined, limiting the attributes of the variable in the source language database and the variable in the target language database.

Preferably, the method for determining the constants is a fixed collocation of nouns, noun phrases or words obtained by translating characteristics of the patents and analyzing the patents.

Preferably, the coverage preset filtering threshold ranges from 1 to 7 constants.

Preferably, the accuracy preset threshold range is greater than or equal to 3 constants.

Preferably, the languages of the bilingual corpus include two languages of english, german, japanese, korean, or french.

The invention also includes a semi-automated translation system based on patent data, comprising:

the patent bilingual corpus extraction module 210 is configured to extract bilingual corpuses with aligned bilingual sentences in the patent field, and send the extracted bilingual corpuses to the screening module;

the screening module 220 is configured to screen the obtained bilingual corpus according to a preset semantic grammar screening condition, screen out bilingual sentences with problems in translation in the patent field, and send the bilingual sentences with problems to the clustering module;

the clustering module 230 is used for splitting the screened bilingual sentences with problems, splitting the bilingual sentences into a source language database and a target language database, clustering the source language database, forming a bilingual database by corresponding the clustered source language database and the target language database, and sending the bilingual database to the constant and variable extraction and template establishment module;

a constant and variable extraction and template creation module 240 for extracting the commonly used vocabulary entries, phrases or sentence segments from the bilingual database and marking them as constants, and the remaining vocabulary entries, phrases or sentence segments as variables; respectively carrying out word meaning, sentence meaning and position matching on the constant in the source language database and the constant in the target language database, and determining the corresponding relation between the constant in the source language database and the constant in the target language database; respectively carrying out word meaning, sentence meaning and position matching on the variable in the source language database and the variable in the target language database, determining the corresponding relation between the variable in the source language database and the variable in the target language database, establishing a translation bilingual template, and sending the translation bilingual template to a filtering module;

the filtering module 250 is used for filtering the translated bilingual template, removing the translated bilingual template outside the preset filtering threshold range of the coverage rate according to the preset filtering threshold range of the set coverage rate to obtain a filtered translated bilingual template, and sending the filtered translated bilingual template to the verification module;

and the checking module 260 is used for removing the filtered translation bilingual template outside the preset threshold range of the accuracy rate according to the preset threshold range of the accuracy rate to obtain a qualified translation bilingual template meeting the requirement.

The invention has the following beneficial effects:

1. the invention provides a method for establishing a semi-automatic bilingual translation template based on patent data, which is characterized in that the extracted object of the template aims at patent documents, the characteristics of a patent are integrated on the basis of the statistics of big data to form a semi-automatic bilingual module establishing method, before the module is established, sentences possibly having problems in the current machine translation are obtained according to the characteristics of the patent field and long-time knowledge accumulation, then the sentences are subjected to the statistical analysis of the big data to form a database, meanwhile, semantic and grammatical analysis are adopted to perform clustering sorting on the problem sentences, a source language database and a target language database are split, the source language database is clustered, and the clustered source language database and the target language database are correspondingly formed into a bilingual database; extracting common entries, phrases or periods from the bilingual database and marking the common entries, phrases or periods as constants, and marking the rest entries, phrases or periods as variables; respectively carrying out word meaning, sentence meaning and position matching on the constant in the source language database and the constant in the target language database, and determining the corresponding relation between the constant in the source language database and the constant in the target language database; respectively carrying out word meaning, sentence meaning and position matching on variables in a source language database and variables in a target language database, determining the corresponding relation between the variables in the source language database and the variables in the target language database, establishing a translation bilingual template, filtering and manually checking the translation bilingual template to obtain a qualified translation bilingual template, and by adopting the translation bilingual template, the translation of patent words and sentences is more accurate and convenient for human understanding; the translation quality and the coverage rate are greatly improved, and the translation quality of machine translation is improved.

2. The invention discloses a patent data-based construction method and a semi-automatic translation system for a semi-automatic translation bilingual template.

Drawings

FIG. 1 is a block flow diagram of a construction method of a semi-automatic bilingual translation template based on patent data according to the invention;

fig. 2 is a block diagram of a semi-automated translation system according to the present invention.

Detailed Description

Exemplary embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.

It will be understood by those skilled in the art that, unless otherwise defined, all terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. It will be further understood that terms, such as those defined in commonly used dictionaries, should be interpreted as having a meaning that is consistent with their meaning in the context of the prior art and will not be interpreted in an idealized or overly formal sense unless expressly so defined herein.

Referring to fig. 1, the present embodiment provides a method for building a semi-automatic bilingual translation template based on patent data, including the following steps:

s1) obtaining bilingual corpus aligned with the bilingual sentences in the patent field;

s2) screening the obtained bilingual corpus according to preset semantic grammar screening conditions to screen out bilingual sentences with translation problems in the patent field;

s3) splitting the screened bilingual sentences with problems, splitting the bilingual sentences into a source language database and a target language database, clustering the source language database, and forming a bilingual database by corresponding the clustered source language database and the target language database;

s4) extracting common vocabulary entries, phrases or periods from the bilingual database and marking the common vocabulary entries, phrases or periods as constants, and marking the remaining vocabulary entries, phrases or periods as variables; respectively carrying out word meaning, sentence meaning and position matching on the constant in the source language database and the constant in the target language database, and determining the corresponding relation between the constant in the source language database and the constant in the target language database; respectively carrying out word meaning, sentence meaning and position matching on the variable in the source language database and the variable in the target language database, determining the corresponding relation between the variable in the source language database and the variable in the target language database, and establishing a translation bilingual template;

s5) setting a coverage rate preset filtering threshold range, and removing the translation bilingual template outside the coverage rate preset filtering threshold range to obtain a filtering translation bilingual template;

s6) carrying out manual verification on the filtered and translated bilingual template, setting a preset threshold range of accuracy, and removing the filtered and translated bilingual template outside the preset threshold range of accuracy to obtain a qualified translated bilingual template meeting the requirement.

Specifically, the preset semantic grammar filtering condition includes sentences having multiple layers of modifications, sentences having complex logical relations, sentences having inserted components, or sentences exceeding a certain length.

Specifically, the sentence decorated in multiple layers is a sentence decorated in three or more layers.

Specifically, clustering is to classify identical or similar sentences in a source language database into one unit language database, thereby forming a plurality of unit language databases.

Specifically, the constants include words, sentences, paragraphs, punctuation marks or special characters; the variables comprise words, sentences, paragraphs, punctuation marks or special characters; after determining the corresponding relation between the constant in the source language database and the constant in the target language database, performing attribute limitation on the constant in the source language database and the constant in the target language database; after the corresponding relation between the variable in the source language database and the variable in the target language database is determined, limiting the attributes of the variable in the source language database and the variable in the target language database.

Specifically, the method for determining the constants is to obtain fixed collocation of nouns, noun phrases or words obtained by translating characteristics of the patents and analyzing the patents.

Specifically, the coverage preset filtering threshold range is 1-7 constants.

Specifically, the accuracy preset threshold range is greater than or equal to 3 constants.

Specifically, the languages of the bilingual corpus include two languages of english, german, japanese, korean, or french.

The invention provides a method for establishing a semi-automatic bilingual translation template based on patent data, which is characterized in that the extracted object of the template aims at patent documents, the characteristics of a patent are integrated on the basis of the statistics of big data to form a semi-automatic bilingual module establishing method, before the module is established, sentences possibly having problems in the current machine translation are obtained according to the characteristics of the patent field and long-time knowledge accumulation, then the sentences are subjected to the statistical analysis of the big data to form a database, meanwhile, semantic and grammatical analysis are adopted to perform clustering sorting on the problem sentences, a source language database and a target language database are split, the source language database is clustered, and the clustered source language database and the target language database are correspondingly formed into a bilingual database; extracting common entries, phrases or periods from the bilingual database and marking the common entries, phrases or periods as constants, and marking the rest entries, phrases or periods as variables; respectively carrying out word meaning, sentence meaning and position matching on the constant in the source language database and the constant in the target language database, and determining the corresponding relation between the constant in the source language database and the constant in the target language database; the method comprises the steps of respectively carrying out word meaning, sentence meaning and position matching on variables in a source language database and variables in a target language database, determining the corresponding relation between the variables in the source language database and the variables in the target language database, establishing a translation bilingual template, filtering the translation bilingual template, manually checking to obtain a qualified translation bilingual template, and adopting the translation bilingual template, so that translation of patent words and sentences is more accurate and convenient for human understanding.

The algorithm process of clustering is as follows: clustering is carried out based on a K-means clustering algorithm (K-means algorithm), and the distance between sentences is calculated by using a word frequency-inverse text frequency index (TF-IDF algorithm); the method comprises the following specific steps: 1) performing word segmentation on the bilingual corpus; 2) calculating each separated vocabulary by adopting a TF-IDF algorithm; 3) setting the number K of the clustering central points, randomly selecting K clustering central points as initial central points, and 4) calculating the distance between each object and each clustering central point by using a TF-IDF algorithm; 5) assigning each object to the cluster center point closest to it; 6) recalculating the distance from the point in each class to the center point of the class; 7) assigning each data to its nearest cluster center point; 8) the process of steps 6 and 7 is repeated until all objects are no longer assigned or a maximum number of iterations is reached.

Specifically, The sentence "The satelliteof claim2,whereinthe feed arraybeing configured toreceive the second portion of the first signalsincludesbeingconfigured to receive the second portion of the first signals during ascheduled,periodic time of a known duration in which the communication in thefirst geographic region is absent.

The sentence two The ground base stationof claim7,whereinthe radio-frequencyequipment being configured to receive the second portion of the first signalsincludesbeing configured to receive the second portion of the first signalsduring a scheduled,periodic time of a known duration in which thecommunication in the first geographic region is absent.

Sentence of The satellite ofclaim2,whereinthe feed arraybeing configured toreceive the second portion of the first signals includes beingconfigured to receive the second portion of the first signals during ascheduled,periodic time of a known durationin whichthe communication in thefirst geographic region is absent.

The sentence four The satelliteof claim2,whereinthe feed arraybeing configured toreceive the second portion of the fi rst signalsincludesbeingconfigured to receive the second portion of the first signals in an allocatedportion of a frequency band during a scheduled time of a known duration inwhich the communication in the first geographic regionisabsent.

The sentence five the interactive talking aboutof claim1,whereinthe toy unitfurthercomprisesa microphonebeing connected withthe controller ic andconfigured to acquire a voice input,andan audio codec processorbeing connected tothe microphone and the controller ic,the audio codec processorcomprising an adc and a dac,and being configured toprocess voice inputacquired by the microphone and send the processed audio data to thecontroller ic.

Observing the five sentences, and according to the clustering algorithm process, the sentence meanings, the structure and the vocabulary of the first sentence, the second sentence, the third sentence and the fourth sentence are similar and are grouped into a group, and the sentence five is grouped into a group.

In order to ensure the accuracy of the corresponding relationship between the constant and the variable in the bilingual template, when constructing the template, corresponding constraints are required to be made on the constant itself, the variable itself, between the constant and the constant, and between the variable and the variable to ensure the accuracy of the bilingual template, where the constraints include, but are not limited to, constraints of the source language or the target language itself, constraints of the constant or the variable itself, constraints of the source language and the target language, constraints of the constant, constraints of the variable, and the like. For example, the determination of the constant knowledge base in the present invention is a high-frequency common vocabulary knowledge base obtained by summarizing the characteristics of the translation of the patent itself and analyzing the patent, the vocabulary knowledge base includes nouns, noun phrases, fixed collocations, etc., and the entries of the vocabulary knowledge base may be one-to-one, or one-to-many, or many-to-one. When determining the variable relationship between the source language database and the target language database, similarity calculation needs to be performed on each variable in the target language database and each variable in the source language database, one with the highest probability is selected from a formed matrix for matching, the matched variable does not participate in similarity calculation of other variables, and the similarity calculation is performed in the target language database in sequence.

According to the characteristics of the patent, the similarity calculation steps are as follows:

1) collecting Chinese and English monolingual corpus databases;

2) collecting and determining Chinese and English stop word databases;

3) performing word segmentation on the collected Chinese and English monolingual corpus database to form a word segmentation database;

4) performing minimum distance calculation on the participle database by using a Word Vec model, finding out the Word with the minimum distance between each Word in Chinese and the corresponding distance, and finding out the Word with the minimum distance between each Word in English and the corresponding distance;

5) performing similarity calculation on each found word with the minimum distance by using a TF-IDF algorithm;

specifically, the Chinese sentence (CN) to be input1) Translating into English sentence (EN)2) English sentence (EN) to be input1) Translating into Chinese sentence (CN)2);

To CN1、CN2、EN1、EN2Respectively carries out word segmentation to form CN11、CN21、EN11、EN21

The word segmentation result is processed to form CN12、CN22、EN12、EN22

To CN12、CN22The minimum distance calculation is carried out on each Chinese vocabulary through a Word Vec model to find CN12Each word is in CN22The word with the minimum distance between every two words and the corresponding distance;

calculating CN12And CN22The similarity of (2);

wherein λ > 0.

Similarly, the similarity SIMEN of EN12 to EN22 was calculated;

Figure BDA0002257184770000112

the corresponding relation between the constants in the source language database and the constants in the target language database is not limited to one-to-one, and can be in a one-to-many, many-to-one or many-to-many mode;

the corresponding relation between the variables in the source language database and the target language database in the invention is not limited to one-to-one, and can be in a one-to-many, many-to-one or many-to-many mode.

In order to ensure the accuracy of the template, after determining the corresponding relation between the constant in the source language database and the constant in the target language database, performing attribute limitation on the constant in the source language database and the constant in the target language database; after the corresponding relation between the variable in the source language database and the variable in the target language database is determined, limiting the attributes of the variable in the source language database and the variable in the target language database.

The attribute definition comprises a start attribute definition, an end attribute definition, a contained attribute definition, a non-contained relation attribute definition, a part-of-speech attribute definition and a length attribute definition; the limitation of the attributes is beneficial to reducing the coverage rate and improving the accuracy rate of the translation bilingual template.

The patent content feature library, the patent linguistic constraint library, the patent knowledge library and the like involved in the method are formed by combining the summary of patent translators on the basis of statistics of big data, and can be applied to various fields including but not limited to patents.

The invention also includes a semi-automated translation system based on patent data, comprising:

the patent bilingual corpus extraction module 210 is configured to extract bilingual corpuses with aligned bilingual sentences in the patent field, and send the extracted bilingual corpuses to the screening module;

the screening module 220 is configured to screen the obtained bilingual corpus according to a preset semantic grammar screening condition, screen out bilingual sentences with problems in translation in the patent field, and send the bilingual sentences with problems to the clustering module;

the clustering module 230 is used for splitting the screened bilingual sentences with problems, splitting the bilingual sentences into a source language database and a target language database, clustering the source language database, forming a bilingual database by corresponding the clustered source language database and the target language database, and sending the bilingual database to the constant and variable extraction and template establishment module;

a constant and variable extraction and template creation module 240 for extracting the commonly used vocabulary entries, phrases or sentence segments from the bilingual database and marking them as constants, and the remaining vocabulary entries, phrases or sentence segments as variables; respectively carrying out word meaning, sentence meaning and position matching on the constant in the source language database and the constant in the target language database, and determining the corresponding relation between the constant in the source language database and the constant in the target language database; respectively carrying out word meaning, sentence meaning and position matching on the variable in the source language database and the variable in the target language database, determining the corresponding relation between the variable in the source language database and the variable in the target language database, establishing a translation bilingual template, and sending the translation bilingual template to a filtering module;

the filtering module 250 is used for filtering the translated bilingual template, removing the translated bilingual template outside the preset filtering threshold range of the coverage rate according to the preset filtering threshold range of the set coverage rate to obtain a filtered translated bilingual template, and sending the filtered translated bilingual template to the verification module;

and the checking module 260 is used for removing the filtered translation bilingual template outside the preset threshold range of the accuracy rate according to the preset threshold range of the accuracy rate to obtain a qualified translation bilingual template meeting the requirement.

The embodiment also provides a computer system suitable for realizing the construction method of the semi-automatic translation bilingual template based on the patent data and the semi-automatic translation system. The computer system includes a processor and a computer-readable storage medium. The computer system may perform a method according to an embodiment of the invention.

In particular, the processor may comprise, for example, a general purpose microprocessor, an instruction set processor and/or related chip set and/or a special purpose microprocessor (e.g., an Application Specific Integrated Circuit (ASIC)), or the like. The processor may also include on-board memory for caching purposes. The processor may be a single processing unit or a plurality of processing units for performing the different actions of the method flow according to embodiments of the present invention.

Computer-readable storage media, for example, may be non-volatile computer-readable storage media, specific examples including, but not limited to: magnetic storage devices, such as magnetic tape or Hard Disk Drives (HDDs); optical storage devices, such as compact disks (CD-ROMs); a memory, such as a Random Access Memory (RAM) or a flash memory; and so on.

The computer-readable storage medium may comprise a computer program that may comprise code/computer-executable instructions that, when executed by a processor, cause the processor to perform a method according to an embodiment of the invention or any variant thereof.

The computer program may be configured with computer program code, for example comprising computer program modules. For example, in an example embodiment, code in the computer program may include one or more program modules, including, for example, a filtering module 210, a patent bilingual corpus extraction module 220, a clustering module 230, a constant, variable extraction and template creation module 240, a filtering module 250, and a checking module 260. It should be noted that the division and number of modules are not fixed, and those skilled in the art may use suitable program modules or program module combinations according to actual situations, which when executed by a processor, enable the processor to perform the method according to the embodiments of the present invention or any variations thereof.

According to an embodiment of the present invention, at least one of the above modules may be implemented as a computer program module, which when executed by a processor, may implement the respective operations described above.

The present invention also provides a computer-readable storage medium, which may be contained in the apparatus/device/system described in the above embodiments; or may exist separately and not be assembled into the device/apparatus/system. The computer-readable storage medium carries one or more programs which, when executed, implement the method according to an embodiment of the present invention.

According to embodiments of the present invention, the computer readable storage medium may be a non-volatile computer readable storage medium, which may include, for example but is not limited to: a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present invention, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams or flowchart illustration, and combinations of blocks in the block diagrams or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The semi-automatic translation system is suitable for machine translation in the patent field, and can effectively improve the quality and the precision of machine translation.

Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

13页详细技术资料下载
上一篇:一种医用注射器针头装配设备
下一篇:神经网络模型压缩方法、语料翻译方法及其装置

网友询问留言

已有0条留言

还没有人留言评论。精彩留言会获得点赞!

精彩留言,会给你点赞!