A kind of data optimization methods based on search engine, device and equipment

文档序号:1742159 发布日期:2019-11-26 浏览:19次 中文

阅读说明:本技术 一种基于搜索引擎的数据优化方法、装置及设备 (A kind of data optimization methods based on search engine, device and equipment ) 是由 张洪 于 2019-08-02 设计创作,主要内容包括:本说明书实施例公开了一种基于搜索引擎的数据优化方法、装置以及设备。所述方法包括:基于搜索引擎对待优化数据进行搜索爬取,获得所述待优化数据的爬取数据;基于所述待优化数据的爬取数据,建立关联映射数据;对所述关联映射数据进行特征提取,获得特征提取数据;基于所述特征提取数据的融合数据,对所述待优化数据进行优化,其中,所述融合数据为将所述特征提取数据进行融合得到的数据。(This specification embodiment discloses a kind of data optimization methods based on search engine, device and equipment.The described method includes: scanning for crawling to data to be optimized based on search engine, obtain the data to be optimized crawls data;Data are crawled based on the data to be optimized, establish relationship maps data;Feature extraction is carried out to the relationship maps data, obtains feature extraction data;Based on the fused data of the feature extraction data, the data to be optimized are optimized, wherein the fused data is the data for being merged the feature extraction data.)

1. a kind of data optimization methods based on search engine, comprising:

Data to be optimized are scanned for crawling based on search engine, obtain the data to be optimized crawls data;

Data are crawled based on the data to be optimized, establish relationship maps data;

Feature extraction is carried out to the relationship maps data, obtains feature extraction data;

Based on the fused data of the feature extraction data, the data to be optimized are optimized, wherein the fused data For the data for being merged the feature extraction data.

2. the method as described in claim 1, described to scan for crawling to data to be optimized based on search engine, described in acquisition Data to be optimized crawl data, specifically include:

It is for statistical analysis to the data to be optimized, obtain nodes for research;

It based on preset rules, using described search seed as search condition, scans for crawling, obtains climbing for the data to be optimized Access evidence.

3. method according to claim 2, described for statistical analysis to the data to be optimized, nodes for research, tool are obtained Body includes:

It is for statistical analysis using enterprise's name library as data to be optimized, wherein the statistical analysis includes to enterprise's name library The statistical analysis of the length of each word in the quantity of Zhong Ge word association enterprise, enterprise name library;

Using each word in enterprise's name library as nodes for research.

4. the method as described in claim 1, described to crawl data based on the data to be optimized, relationship maps number is established According to specifically including:

Data are crawled based on the data to be optimized, using entity recognition techniques, establish described search seed with described to excellent Change the relationship maps data of data crawled between data.

5. the method as described in claim 1, described to carry out feature extraction to the relationship maps data, feature extraction number is obtained According to specifically including:

Based on the relationship maps data and the data to be optimized, feature extraction is carried out, obtains the spy of the data to be optimized Sign.

6. the feature of method as claimed in claim 5, the data to be optimized includes:

Enterprise's abbreviation sharing degree, enterprise referred to as degree of recalling, enterprise's abbreviation ambiguous degree, enterprise referred to as degree of expressing the meaning, enterprise's abbreviation include Degree;

Enterprise's abbreviation sharing degree is

The enterprise referred to as degree of recalling is S2=log (Ndoc+1);

Enterprise's abbreviation ambiguous degree is

The enterprise referred to as degree of expressing the meaning is

Enterprise's abbreviation including degree is

Wherein,

NcomFor the referred to as corresponding enterprise's number of enterprise, NdocFor the number of files that enterprise referred to as occurs, Ndoc_comFor enterprise's abbreviation search chain Meet corresponding enterprise's number, Nsame_docIt is linked for the same search of different enterprise's abbreviations, Ncontains_comIt is enterprise's abbreviation comprising secondary Number.

7. such as method as claimed in any one of claims 1 to 6, the fused data based on the feature extraction data, to institute It states data to be optimized to optimize, specifically include:

Feature extraction data weight corresponding with the feature extraction data is merged, the data to be optimized are obtained Appraisal result;

The appraisal result is subjected to threshold filtering, to optimize to the data to be optimized.

8. a kind of data-optimized device based on search engine, comprising:

Unit is crawled, data to be optimized are scanned for crawling based on search engine, obtain the data to be optimized crawls number According to;

Associative cell crawls data based on the data to be optimized, establishes relationship maps data;

Extraction unit carries out feature extraction to the relationship maps data, obtains feature extraction data;

Optimization unit optimizes the data to be optimized, wherein institute based on the fused data of the feature extraction data Stating fused data is the data for being merged the feature extraction data.

9. device as claimed in claim 8, described to scan for crawling to data to be optimized based on search engine, described in acquisition Data to be optimized crawl data, specifically include:

It is for statistical analysis to the data to be optimized, obtain nodes for research;

It based on preset rules, using described search seed as search condition, scans for crawling, obtains climbing for the data to be optimized Access evidence.

10. device as claimed in claim 9, described for statistical analysis to the data to be optimized, nodes for research, tool are obtained Body includes:

It is for statistical analysis using enterprise's name library as data to be optimized, wherein the statistical analysis includes to enterprise's name library The statistical analysis of the length of each word in the quantity of Zhong Ge word association enterprise, enterprise name library;

Using each word in enterprise's name library as nodes for research.

11. device as claimed in claim 8, described to crawl data based on the data to be optimized, relationship maps number is established According to specifically including:

Data are crawled based on the data to be optimized, using entity recognition techniques, establish described search seed with described to excellent Change the relationship maps data of data crawled between data.

12. device as claimed in claim 8, described to carry out feature extraction to the relationship maps data, feature extraction is obtained Data specifically include:

Based on the relationship maps data and the data to be optimized, feature extraction is carried out, obtains the spy of the data to be optimized Sign.

13. the feature of device as claimed in claim 12, the data to be optimized includes:

Enterprise's abbreviation sharing degree, enterprise referred to as degree of recalling, enterprise's abbreviation ambiguous degree, enterprise referred to as degree of expressing the meaning, enterprise's abbreviation include Degree;

Enterprise's abbreviation sharing degree is

The enterprise referred to as degree of recalling is S2=log (Ndoc+1);

Enterprise's abbreviation ambiguous degree is

The enterprise referred to as degree of expressing the meaning is S4=log (Nsame_doc+1);

Enterprise's abbreviation including degree is S5=log (Ncontains_com+1);

Wherein,

NcomFor the referred to as corresponding enterprise's number of enterprise, NdocFor the number of files that enterprise referred to as occurs, Ndoc_comFor enterprise's abbreviation search chain Meet corresponding enterprise's number, Nsame_docIt is linked for the same search of different enterprise's abbreviations, Ncontains_comIt is enterprise's abbreviation comprising secondary Number.

14. such as the described in any item devices of claim 8 to 13, the fused data based on the feature extraction data is right The data to be optimized optimize, and specifically include:

Feature extraction data weight corresponding with the feature extraction data is merged, the data to be optimized are obtained Appraisal result;

The appraisal result is subjected to threshold filtering, to optimize to the data to be optimized.

15. a kind of data-optimized equipment based on search engine, comprising:

At least one processor;And

The memory being connect at least one described processor communication;Wherein,

The memory is stored with the instruction that can be executed by least one described processor, and described instruction is by described at least one Manage device execute so that at least one described processor can:

Data to be optimized are scanned for crawling based on search engine, obtain the data to be optimized crawls data;

Data are crawled based on the data to be optimized, establish relationship maps data;

Feature extraction is carried out to the relationship maps data, obtains feature extraction data;

Based on the fused data of the feature extraction data, the data to be optimized are optimized, wherein the fused data For the data for being merged the feature extraction data.

Technical field

This specification is related to field of computer technology more particularly to a kind of data optimization methods based on search engine, dress It sets and equipment.

Background technique

Enterprise name is the letter symbol that an enterprise is different from other enterprises, successively by the administrative area in enterprise location Four part such as domain, font size, trade or operation characteristics, organizational form composition, and industrial and commercial registration, the tax registration can be marked to enterprise Infuse unique encoded register.Wherein, font size is the outstanding feature for distinguishing different enterprises.

Currently, what society formed the cognition of enterprise often by enterprise's abbreviation or enterprise product or brand, than Such as " China Construction Bank Co., Ltd. ", " branch, Beijing, China Construction Bank Co., Ltd. " and " China Reconstructs Limited liability company, bank Shanghai Huangpu subbranch " all regards as " Construction Bank " in social recognition;Equally, in each applied field In scape, such as search engine, user also can carry out information retrieval by retrieval " Construction Bank ".For another example " China Mobile ", " in The public cognitive that the large enterprises such as state's life insurance ", " KFC " are also based on enterprise's abbreviation or enterprise product or brand is formed.It can See, claims enterprise to be a kind of common cultural phenomenon come generation using enterprise's abbreviation or enterprise product or brand.For an enterprise Speech equally also has multiple abbreviations or multiple brand images or multiple product promotions, so claiming also to have very for the generation of enterprise It is more, for example, " China, Alibaba technology ", " China, Alibaba ", " Alibaba's network ", " Alibaba ", " Alibaba " Deng the abbreviation of " network technology Co., Ltd, Alibaba ", there are also brand names or the productions such as " Taobao ", " retail is logical ", " day cat " Product.

Since the abbreviation or enterprise product or brand message of Some Enterprises are excessive, there are company information expression is excessive Situation is affiliated to incoherent information below the enterprise;Or there are Some Enterprises abbreviations or enterprise product or brand to exist In multiple enterprises in the following, there are enterprise's extensions when thus generation enterprise abbreviation or enterprise product or brand are affiliated to information as enterprise By the situation of inaccuracy;Or referred to as or enterprise product or brand are expressed the meaning changes wide in range or under different scenes of expressing the meaning due to enterprise Change, so that enterprise's abbreviation or enterprise product or brand are not suitable as enterprise and are affiliated to information.For these reasons, it needs to enterprise Title, the abbreviation of enterprise, product or brand etc. can be optimized for the set for the key nouns for referring to enterprise.

Summary of the invention

This specification embodiment provides a kind of data optimization methods based on search engine, device and equipment, for solving Following technical problem: the accuracy and confidence level of data to be optimized be can be improved.

In order to solve the above technical problems, this specification embodiment is achieved in that

A kind of data optimization methods based on search engine that this specification embodiment provides, comprising:

Data to be optimized are scanned for crawling based on search engine, obtain the data to be optimized crawls data;

Data are crawled based on the data to be optimized, establish relationship maps data;

Feature extraction is carried out to the relationship maps data, obtains feature extraction data;

Based on the fused data of the feature extraction data, the data to be optimized are optimized, wherein the fusion Data are the data for being merged the feature extraction data.

A kind of data-optimized device based on search engine that this specification embodiment provides, comprising:

Unit is crawled, data to be optimized are scanned for crawling based on search engine, obtains climbing for the data to be optimized Access evidence;

Associative cell crawls data based on the data to be optimized, establishes relationship maps data;

Extraction unit carries out feature extraction to the relationship maps data, obtains feature extraction data;

Optimization unit optimizes the data to be optimized based on the fused data of the feature extraction data, In, the fused data is the data for being merged the feature extraction data.

This specification embodiment also provides a kind of data-optimized equipment based on search engine, comprising:

At least one processor;And

The memory being connect at least one described processor communication;Wherein,

The memory is stored with the instruction that can be executed by least one described processor, and described instruction is by described at least one A processor executes so that at least one described processor can:

Data to be optimized are scanned for crawling based on search engine, obtain the data to be optimized crawls data;

Data are crawled based on the data to be optimized, establish relationship maps data;

Feature extraction is carried out to the relationship maps data, obtains feature extraction data;

Based on the fused data of the feature extraction data, the data to be optimized are optimized, wherein the fusion Data are the data for being merged the feature extraction data.

This specification embodiment use at least one above-mentioned technical solution can reach it is following the utility model has the advantages that

The present embodiment is crawled based on data to be optimized as a result, establishing data to be optimized and crawling the relationship maps number of result According to further carrying out feature extraction using the relationship maps data obtained, and carry out data fusion, to be carried out to data to be optimized Optimization, can reduce it is data-optimized during search work amount, improve the accuracy and confidence level of data to be optimized.

Detailed description of the invention

In order to illustrate more clearly of this specification embodiment or technical solution in the prior art, below will to embodiment or Attached drawing needed to be used in the description of the prior art is briefly described, it should be apparent that, the accompanying drawings in the following description is only The some embodiments recorded in this specification, for those of ordinary skill in the art, in not making the creative labor property Under the premise of, it is also possible to obtain other drawings based on these drawings.

Fig. 1 is a kind of frame diagram of the data optimization methods based on search engine provided by the embodiments of the present application;

Fig. 2 is a query result schematic diagram provided by the embodiments of the present application;

Fig. 3 is one provided by the embodiments of the present application and crawls result schematic diagram;

Fig. 4 is a kind of flow chart of the data optimization methods based on search engine provided by the embodiments of the present application;

Fig. 5 is a kind of building-block of logic of the data optimization methods based on search engine provided by the embodiments of the present application;

Fig. 6 is a kind of schematic diagram for data-optimized device based on search engine that this specification embodiment provides.

Specific embodiment

In order to make those skilled in the art more fully understand the technical solution in this specification, below in conjunction with this explanation Attached drawing in book embodiment is clearly and completely described the technical solution in this specification embodiment, it is clear that described Embodiment be merely a part but not all of the embodiments of the present application.Based on this specification embodiment, this field Those of ordinary skill's every other embodiment obtained without creative efforts, all should belong to the application The range of protection.

As previously mentioned, society forms the cognition of enterprise often by enterprise's abbreviation or enterprise product or brand , in practical applications, enterprise's name library comprising information such as enterprise's abbreviation, brand names, can be formed by enterprise's nickname table, So that society is used for enterprise query.Enterprise's name library is that enterprise name, the abbreviation of enterprise, product or brand etc. can be looked forward to for referring to The set of the key nouns of industry.

Table 1 is that enterprise provided by the embodiments of the present application also known as indicates to be intended to.

Table 1

As shown in table 1, in " 2345 network technology Co., Ltd of Shanghai ", there are enterprise's abbreviation or enterprise products or product The introducing of incoherent information is affiliated under the enterprise by the excessive situation of board information, and the big congruence of such as browser, cartoon exists It expresses the meaning fuzzy, in practical applications, cannot preferably realize enterprise referred to as or between enterprise product or brand and enterprise's full name Information fusion.

Table 2 is that another enterprise provided by the embodiments of the present application also known as indicates to be intended to, and " Nong Tou company " is used as enterprise referred to as It is affiliated to under tens enterprises, will affect the accuracy for being affiliated to information, generate the inaccuracy that information is affiliated to.

Table 2

Table 3 is another enterprise's nickname schematic diagram provided by the embodiments of the present application, wherein " financial transaction institute " belongs to one Wide in range concept is not suitable for description " Tianjin Co., Ltd, financial asset exchange ", such due to enterprise's abbreviation or enterprise Product or brand are expressed the meaning the variation of expressing the meaning under wide in range or different scenes, will cause the change that different scenes lower generation refers to the confidence level of enterprise Change.

Table 3

In practical applications, when such as enterprise search, the keyword of enterprise search directly determines retrieval letter The quantity and validity of breath have decided on whether that more more accurate information can be recalled.Since search engine often utilizes the row of falling Index is constructed, and inverted index segments word, rewrites etc. and operating using word as index label, therefore When carrying out enterprise search, how to select search key most important as nodes for research.Due to search engine or its The introducing of his technology when retrieval information is matched with enterprise, needs to carry out entity extraction to retrieval information, with analysis entities institute The enterprise of finger.

With the development of enterprise, enterprise product, brand names etc. can also be changed, and enterprise name can also change, Enterprise referred to as can also change, therefore, correspondingly, the information in enterprise's name library is also required to change.

Due to existing enterprise's enormous amount, the number of the enterprise of whole nation survival at present is 50,000,000 or more, along with the letter of enterprise Title, product or brand, data volume are huge.In the prior art, in order to guarantee enterprise name library accuracy, need to pass through manual type It is labeled verification, since data volume is huge, manually marks higher cost, accuracy rate is low, poor reliability;It is calculated using machine learning Method also needs to carry out data mark, therefore can not also extract preferable data characteristics to overcome the problems of the prior art.

Based on the above issues, the embodiment of the present application provides a kind of data optimization methods based on search engine, is searched with reducing Rope workload, the abbreviation, product or brand for improving enterprise are affiliated to the accuracy and confidence level of information as enterprise.Below in conjunction with attached Figure, is described in detail technical solution provided by the embodiments of the present application.

Fig. 1 is a kind of frame diagram of the data optimization methods based on search engine provided by the embodiments of the present application, specific to wrap Include following steps:

Step S101: scanning for crawling based on search engine to data to be optimized, obtains climbing for the data to be optimized Access evidence.

Enterprise name library refers to the set of enterprise's key nouns as the generation such as enterprise name, the abbreviation of enterprise, product or brand, can For scenes such as enterprise searchs.For the accuracy for guaranteeing retrieval information, need to optimize enterprise's name library.In the application One embodiment in, using enterprise's name library as data to be optimized.It should be strongly noted that using enterprise's name library as to be optimized Data, only the application a exemplary illustration.Data to be optimized also can be other data acquisition systems, title full name and title letter Claim, title full name product corresponding with title full name.

In one embodiment of the application, obtain data to be optimized crawls data, specifically includes:

It is for statistical analysis to data to be optimized, obtain nodes for research;

It based on preset rules, using described search seed as search condition, scans for crawling, obtains the data to be optimized Crawl data.

Data to be optimized scan for before crawling, and need to find suitable nodes for research.In this application, to number to be optimized According to for statistical analysis, acquisition nodes for research.In one embodiment of the application, using enterprise name library as data to be optimized, It is for statistical analysis;Using each word in enterprise's name library as nodes for research.It should be strongly noted that the statistical analysis Including to the enterprise Ming Kuzhongge word association enterprise quantity, the enterprise name library in each word length statistical Analysis.It in the specific implementation process, is the accuracy for guaranteeing subsequent result, for different business characteristics, each word in enterprise's name library The quantity of enterprise associated by language should give control, and no more than certain amount threshold, amount threshold is according to business characteristic and/or industry Depending on scene of being engaged in.

Due to the special setting of search engine, when scanning for crawling using nodes for research, nodes for research can be divided The operation such as word, in order to guarantee that search crawls the accuracy of result, in the specific implementation, to the length of each word in enterprise's name library into Row statistical analysis, using the word within the scope of searching threshold as nodes for research.The determination of searching threshold range is by business spy Depending on point and/or business scenario and/or search engine own characteristic.

In one embodiment of the application, using enterprise's name library as data to be optimized, the word that will include in enterprise's name library Language is as nodes for research, and therefore, nodes for research can be abbreviation, brand or the product of enterprise.To illustrate convenient for subsequent embodiment, The abbreviation of enterprise, brand or product are referred to as enterprise referred to as.

Since the length of the word as nodes for research cannot be too short, too short nodes for research often will cause biggish letter Error is ceased, therefore, in one embodiment of the application, the length of nodes for research can be more than or equal to 3 characters, by 3 words Accord with the lower limit as searching threshold range.Since the participle of search engine will affect the accuracy that search crawls result, make It is unsuitable too long for the length of the word of nodes for research, using the maximum value of applicable nodes for research length as searching threshold range The upper limit.In one embodiment of the application, using 12 characters as the upper limit of searching threshold range.In another of the application In embodiment, to allow to search for the maximum character length inputted as defined in search engine itself as the upper of searching threshold range Limit.

Nodes for research is scanned for by the function of search of search engine, and acquisition crawls data.In specific implementation process In, searching for the preset rules crawled can be configured by search engine.Preset rules include but is not limited to: the time screens item Part, search crawl result and return to item number.In order to guarantee the update of the server data where search engine, time screening conditions can Search cycle is set, it specifically can be according to the time screening conditions that the initial ranging period is longer, following time window is shorter.Specifically Ground can choose longer period, such as 1 year when scanning for crawling using search engine for the first time, by initial ranging week The relatively long time is arranged in phase.In view of the update of the server data where search engine, after the first using search When engine scans for crawling, relatively short period, such as one month can be set, set opposite for following time window The shorter time to reduce the data volume that search crawls, while can be realized and crawl timely updating for data.

Using nodes for research as querying condition, by search engine, the query result based on nodes for research is obtained.Due to searching The operations such as index holds up the special setting of itself, and there are query results to segment, rewrite, query result is supplied, therefore, the inquiry of acquisition As a result the relevance for having a data and data to be optimized in is very little.In order to reduce the optimization processing of subsequent data to be optimized Data volume, while guaranteeing the optimization accuracy of data to be optimized, need further exist for screening query result, according to business Feature and/or business scenario, the preceding N item for choosing query result, which is used as, crawls data.In one embodiment of the application, N is 100.

Carry out data-optimized for the ease of subsequent, crawling data includes essential information, and the content of essential information includes but not It is limited to: nodes for research (searching keyword, enterprise's abbreviation etc.), title, abstract, text, search link.Data are crawled to tie The form of structureization storage is stored.Structured storage is the principle of tree file system to be applied in single file, is made Obtaining single file also can include " subdirectory " as file system, and " subdirectory " can also include deeper " specific item Record ", each " catalogue " can contain multiple files, and the content for needing multiple files to store originally is saved with tree and level Into a file.

Step S103: data are crawled based on the data to be optimized, establish relationship maps data

As previously described, because the special setting of search engine, will cause the partial query result by search engine with pre- The query result of phase is not inconsistent.Fig. 2 is a query result schematic diagram provided by the embodiments of the present application.Using " Shandong quotient's electronics " as searching Rope seed is inquired, and query result shown in Fig. 2 is not inconsistent with nodes for research.Therefore, it is necessary to further be known by entity Other technology establishes data to be optimized and crawls the accurate correlation of data.Specifically, based on the data that crawl of data to be optimized, benefit With entity recognition techniques, the relationship maps data of nodes for research and data to be optimized crawled between data are established.

The common technological means of entity recognition techniques includes:

(1) string matching technology utilizes string matching algorithm calculating character string inclusion relation or literal similar skill Art.Such technical method is simple, there is certain ambiguity sometimes.Such as the search information being retrieved " between five or six ": government It keeps under strict control " three public " consumption, high-end food and drink (special topic is read) shop is closed down in succession, and it is saturating that Central Plains real estate industry and commerce spreads region sales manager Wu Feng Dew, high-end eating and drinking establishment closes the door between Zhujiang New City and one band of the Milky Way have had five or six, with every 400.......In search information In, " between five or six " there are ambiguity.Same search chain was connected when title, abstract, body text content carry out enterprise calculation also There can be accuracy problem.

(2) business entity's identification technology utilizes entity recognition techniques (Named Entity Recognition, abbreviation NER business entity's identification) is carried out.Name Entity recognition, also referred to as " proper name identification ", refer in identification text that there is certain sense Entity, mainly include name, place name, mechanism name, proper noun etc..The accuracy of current entity identification technology is also compared can It leans on, but needs to put into resource and carry out data mark and verification.

In this application, preferred business entity's identification technology carries out business entity's identification.

What is obtained using nodes for research as searching keyword crawls as a result, further using business entity's identification technology, shape At the relationship maps data between nodes for research-enterprise name-search link.Continue previous embodiment, using enterprise's abbreviation as Nodes for research, the relationship maps data of formation are the relationship maps data between enterprise's abbreviation-enterprise name-search link.

Step S105: feature extraction is carried out to the relationship maps data, obtains feature extraction data

Based on relationship maps data and the data to be optimized that abovementioned steps S103 is obtained, feature extraction is carried out, is obtained The feature of the data to be optimized.It should be strongly noted that data to be optimized be characterized in based on relationship maps data and to Optimize the statistical analysis acquisition of data.

Continue previous embodiment, based on enterprise's abbreviation-enterprise name-search link between relationship maps data and enterprise The statistical analysis in name library, obtains enterprise name-enterprise's abbreviation feature.

The feature of data to be optimized includes: enterprise's abbreviation sharing degree, enterprise referred to as degree of recalling, enterprise's abbreviation ambiguous degree, enterprise Industry referred to as degree of expressing the meaning, enterprise's abbreviation including degree.Specifically,

(1) enterprise's abbreviation sharing degree

The referred to as corresponding number of the enterprise of enterprise is more, and enterprise's abbreviation sharing degree is lower.For example " Nong Tou company " represents more families Enterprise, shared referred to as by many enterprises, thus the representativeness of the word is low.The referred to as corresponding enterprise's number of enterprise is Ncom, enterprise is referred to as Sharing degree is

(2) enterprise's referred to as degree of recalling

Indicate that enterprise's abbreviation frequency of occurrences or number of files are bigger, the meaning for indicating that the word is recalled is stronger, expresses the meaning clear.In reality In the application scenarios of border, enterprise's referred to as degree of recalling shows degree of the enterprise referred to as in public cognitive, and masses pass through the enterprise recalled by force Industry referred to as describes enterprise, and therefore, enterprise's referred to as degree of recalling is higher, shows that the degree of awareness of enterprise's abbreviation is higher.Take enterprise The number of files referred to as occurred is Ndoc, enterprise's referred to as degree of recalling is S2=log (Ndoc+1)。

It should be strongly noted that referred to as there is the text for appearance enterprise's abbreviation that number of files refers to crawling in data in enterprise Gear number.In aforementioned, referred to as scan for crawling as nodes for research using enterprise, acquisition crawls data according to structured storage. Since structured storage is that the content for needing multiple files to store originally is saved in a file by tree and level, Therefore, it is primary to indicate that enterprise referred to as occurs for a number of files.

(3) enterprise's abbreviation ambiguous degree

Due to the case where being affiliated to multiple enterprises there are enterprise's link in enterprise name, for individual enterprise's abbreviation, enterprise The number of documents that industry can recall should all correspond to the same enterprise, then show that the information of enterprise's abbreviation more polymerize, ambiguity is brighter Really.Therefore it is indicated using enterprise's alias number of files/enterprise's number (number of files/enterprise's number represents the word more representative of specific enterprise) Enterprise's abbreviation ambiguous degree.It is N that enterprise's referred to as search, which links corresponding enterprise's number,doc_com, then enterprise's abbreviation ambiguous degree be

(4) enterprise's referred to as degree of expressing the meaning

Enterprise's referred to as degree of expressing the meaning for ease of understanding, Fig. 3 are one provided by the embodiments of the present application and crawl result schematic diagram.Certain Enterprise shares 3 abbreviations, respectively referred to as 1, referred to as 2 and referred to as 3.It is respectively nodes for research with referred to as 1, referred to as 2 and referred to as 3, It scans for crawling, acquisition crawls data.Wherein, referred to as 1 and referred to as 2 have 56 identical search links, abbreviation 1 and referred to as 3 There are 31 identical search links, referred to as 2 and referred to as 3 there are 43 identical search links, and therefore, referred to as 2 is more representative.

Based on example shown in Fig. 3, enterprise's referred to as degree of expressing the meaning index is generated, calculates and also known as crawls identical document net under enterprise Network extracts the nickname for more representing the enterprise.The same search of different enterprise's abbreviations is linked as Nsame_doc, then enterprise referred to as degree of expressing the meaning For S4=log (Nsame_doc+1)。

(5) enterprise's abbreviation including degree

By taking following examples as an example " China perseverance the management of industry investment funds (Beijing) Co., Ltd's (calling China's perseverance industry in the following text) ", thus " China's perseverance industry " more has ability to express for " China perseverance the management of industry investment funds (Beijing) Co., Ltd ".Therefore building enterprise letter Including degree index is claimed to portray, enterprise's abbreviation is N comprising numbercontains_com, then enterprise's referred to as degree of expressing the meaning is S5=log (Ncontains_com+1)。

Step S107: the fused data based on the feature extraction data optimizes the data to be optimized, In, the fused data is the data for being merged the feature extraction data.

Based on the fused data that abovementioned steps obtain, using the method to score data to be optimized, to number to be optimized According to optimizing, specifically include:

Feature extraction data weight corresponding with the feature extraction data is merged, is obtained described to be optimized The appraisal result of data;

The appraisal result is subjected to threshold filtering, to optimize to the data to be optimized.

In specific implementation, the scoring of data to be optimized is with SfinalIt indicates, Sfinal=∑ wi*Si, wherein Si is aforementioned The feature extraction data that step S105 is obtained, wiFor the corresponding weight of the feature extraction data.It should be strongly noted that wi It can be adjusted according to practical business feature and/or business scenario.For referred to as (enterprise, multiple enterprises below the same enterprise Abbreviation, brand or product etc.) utilize appraisal result, threshold filtering is carried out according to practical business feature and/or business scenario, with Guarantee that enterprise's lower generation refers to that relationship is more reasonable.

The above are the frame diagrams for the data optimization methods that this specification embodiment provides, below will be based on frame described in Fig. 1 Frame figure, using enterprise's name library as data to be optimized as specific embodiment, data-optimized process is described in detail.Fig. 4 is this Shen Please embodiment provide a kind of data optimization methods based on search engine flow chart, specifically include:

Step S401: enterprise's name library is for statistical analysis as data to be optimized.

Step S403: the statistic analysis result based on enterprise's name library determines nodes for research.

Step S405: being based on described search seed, scan for crawling using search engine, and acquisition crawls data.

Step S407: crawling data based on described, using entity recognition techniques, forms enterprise's abbreviation-enterprise name-search Relationship maps data between link.

Step S409: the statistic analysis result based on the relationship maps data and enterprise name library carries out feature and mentions It takes, obtains enterprise name-enterprise's abbreviation feature.

It should be noted that enterprise name-enterprise's abbreviation feature, can be used for analyzing enterprise referred to as between enterprise name Generation refer to strength relationship, and then for optimizing enterprise's name library.

Step S411: by the enterprise name-enterprise's abbreviation feature and the enterprise name-enterprise's abbreviation feature pair The weight answered is merged, to optimize to enterprise's name library.

To further understand data optimization methods provided by the embodiments of the present application, below with reference to Fig. 1 and data shown in Fig. 4 Optimization method gives comprehensive analysis explanation.Fig. 5 is a kind of data-optimized side based on search engine provided by the embodiments of the present application The building-block of logic of method, specifically includes:

Nodes for research is obtained based on enterprise's name library, engine queries are scanned for nodes for research, obtain search link, to searching Rope chain taps into row information crawl, finally carries out enterprise using business entity's identification technology and enterprise's name library and is affiliated to.To query link The content for carrying out information scratching includes: the corresponding title of link, abstract, content.In above-mentioned logic, nodes for research is established Relationship maps data between (enterprise is referred to as)-enterprise name-search link, are then carried out using the relationship maps data generated Feature extraction, obtains enterprise name-enterprise's abbreviation feature, and analysis enterprise refers to that intensity is closed in the generation referred to as between enterprise name System, to realize the optimization in enterprise's name library.

It should be noted that enterprise name library carries out data-optimized, the only example of the application as data to be optimized Property explanation.Title full name and title abbreviation, title full name product corresponding with title full name etc. can also be used as data to be optimized into Row is data-optimized.

The data optimization methods provided using this specification embodiment, the search based on search engine, which crawls, can solve now Have in technology and data to be optimized are scanned for, the excessive problem of information content;Further relationship maps are established using crawling data Data, progress is data-optimized, can be reduced ambiguity expression with amount of compressed data, be proposed highly expressed accuracy and confidence level.

A kind of data optimization methods based on search engine are described in detail in above content, corresponding, this specification A kind of data-optimized device based on search engine is additionally provided, as shown in Figure 6.

Fig. 6 is a kind of schematic diagram for data-optimized device based on search engine that this specification embodiment provides, shown Device includes:

Unit 601 is crawled, data to be optimized are scanned for crawling based on search engine, obtains the data to be optimized Crawl data;

Associative cell 603 crawls data based on the data to be optimized, establishes relationship maps data;

Extraction unit 605 carries out feature extraction to the relationship maps data, obtains feature extraction data;

Optimization unit 607 optimizes the data to be optimized based on the fused data of the feature extraction data, Wherein, the fused data is the data for being merged the feature extraction data.

The unit 601 that crawls is specifically used for: it is for statistical analysis to the data to be optimized, obtain nodes for research;

It based on preset rules, using described search seed as search condition, scans for crawling, obtains the data to be optimized Crawl data.

The unit 601 that crawls is also used to: for statistical analysis, wherein described using enterprise's name library as data to be optimized Statistical analysis includes the length to each word in the quantity of the enterprise Ming Kuzhongge word association enterprise, enterprise name library Statistical analysis;

Using each word in enterprise's name library as nodes for research.

The associative cell 603 is specifically used for: crawling data based on the data to be optimized, utilizes Entity recognition skill Art establishes the relationship maps data of described search seed and the data to be optimized crawled between data.

The extraction unit 605 is specifically used for: being based on the relationship maps data and the data to be optimized, carries out feature It extracts, obtains the feature of the data to be optimized.

The feature of the data to be optimized includes:

Enterprise's abbreviation sharing degree, enterprise referred to as degree of recalling, enterprise's abbreviation ambiguous degree, enterprise referred to as degree of expressing the meaning, enterprise are referred to as Including degree;

Enterprise's abbreviation sharing degree is

The enterprise referred to as degree of recalling is S2=log (Ndoc+1);

Enterprise's abbreviation ambiguous degree is

The enterprise referred to as degree of expressing the meaning is S4=log(Nsame_doc+1);

Enterprise's abbreviation including degree is S5=log(Ncontains_com+1);

Wherein,

NcomFor the referred to as corresponding enterprise's number of enterprise, NdocFor the number of files that enterprise referred to as occurs, Ndoc_comReferred to as enterprise Search links corresponding enterprise's number, Nsame_docIt is linked for the same search of different enterprise's abbreviations, Ncontains_comReferred to as enterprise Include number.

The optimization unit 607 is specifically used for: by feature extraction data power corresponding with the feature extraction data It is merged again, obtains the appraisal result of the data to be optimized;

The appraisal result is subjected to threshold filtering, to optimize to the data to be optimized.

Based on same thinking, this specification embodiment also provides a kind of data-optimized equipment based on search engine, packet It includes:

At least one processor;And

The memory being connect at least one described processor communication;Wherein,

The memory is stored with the instruction that can be executed by least one described processor, and described instruction is by described at least one A processor executes so that at least one described processor can:

Data to be optimized are scanned for crawling based on search engine, obtain the data to be optimized crawls data;

Data are crawled based on the data to be optimized, establish relationship maps data;

Feature extraction is carried out to the relationship maps data, obtains feature extraction data;

Based on the fused data of the feature extraction data, the data to be optimized are optimized, wherein the fusion Data are the data for being merged the feature extraction data.

It is above-mentioned that this specification specific embodiment is described.Other embodiments are in the scope of the appended claims It is interior.In some cases, the movement recorded in detail in the claims or step can be come according to the sequence being different from embodiment It executes and desired result still may be implemented.In addition, process depicted in the drawing not necessarily require show it is specific suitable Sequence or consecutive order are just able to achieve desired result.In some embodiments, multitasking and parallel processing be also can With or may be advantageous.

All the embodiments in this specification are described in a progressive manner, same and similar portion between each embodiment Dividing may refer to each other, and each embodiment focuses on the differences from other embodiments.Especially for device, For electronic equipment, nonvolatile computer storage media embodiment, since it is substantially similar to the method embodiment, so description It is fairly simple, the relevent part can refer to the partial explaination of embodiments of method.

Device that this specification embodiment provides, electronic equipment, nonvolatile computer storage media with method are corresponding , therefore, device, electronic equipment, nonvolatile computer storage media also have the Advantageous effect similar with corresponding method Fruit, since the advantageous effects of method being described in detail above, which is not described herein again corresponding intrument, The advantageous effects of electronic equipment, nonvolatile computer storage media.

In the 1990s, the improvement of a technology can be distinguished clearly be on hardware improvement (for example, Improvement to circuit structures such as diode, transistor, switches) or software on improvement (improvement for method flow).So And with the development of technology, the improvement of current many method flows can be considered as directly improving for hardware circuit. Designer nearly all obtains corresponding hardware circuit by the way that improved method flow to be programmed into hardware circuit.Cause This, it cannot be said that the improvement of a method flow cannot be realized with hardware entities module.For example, programmable logic device (Programmable Logic Device, PLD) (such as field programmable gate array (Field Programmable Gate Array, FPGA)) it is exactly such a integrated circuit, logic function determines device programming by user.By designer Voluntarily programming comes a digital display circuit " integrated " on a piece of PLD, designs and makes without asking chip maker Dedicated IC chip.Moreover, nowadays, substitution manually makes IC chip, this programming is also used instead mostly " is patrolled Volume compiler (logic compiler) " software realizes that software compiler used is similar when it writes with program development, And the source code before compiling also write by handy specific programming language, this is referred to as hardware description language (Hardware Description Language, HDL), and HDL is also not only a kind of, but there are many kind, such as ABEL (Advanced Boolean Expression Language)、AHDL(Altera Hardware Description Language)、 Confluence、CUPL(Cornell University Programming Language)、HDCal、 JHDL (Java Hardware Description Language)、Lava、Lola、MyHDL、PALASM、 RHDL(Ruby Hardware Description Language) etc., VHDL (Very-High-Speed is most generally used at present Integrated Circuit Hardware Description Language) and Verilog.Those skilled in the art also answer This understands, it is only necessary to method flow slightly programming in logic and is programmed into integrated circuit with above-mentioned several hardware description languages, The hardware circuit for realizing the logical method process can be readily available.

Controller can be implemented in any suitable manner, for example, controller can take such as microprocessor or processing The computer for the computer readable program code (such as software or firmware) that device and storage can be executed by (micro-) processor can Read medium, logic gate, switch, specific integrated circuit (Application Specific Integrated Circuit, ASIC), the form of programmable logic controller (PLC) and insertion microcontroller, the example of controller includes but is not limited to following microcontroller Device: ARC 625D, Atmel AT91SAM, Microchip PIC18F26K20 and Silicone Labs C8051F320 are deposited Memory controller is also implemented as a part of the control logic of memory.It is also known in the art that in addition to Pure computer readable program code mode is realized other than controller, can be made completely by the way that method and step is carried out programming in logic Controller is obtained to come in fact in the form of logic gate, switch, specific integrated circuit, programmable logic controller (PLC) and insertion microcontroller etc. Existing identical function.Therefore this controller is considered a kind of hardware component, and to including for realizing various in it The device of function can also be considered as the structure in hardware component.Or even, it can will be regarded for realizing the device of various functions For either the software module of implementation method can be the structure in hardware component again.

System, device, module or the unit that above-described embodiment illustrates can specifically realize by computer chip or entity, Or it is realized by the product with certain function.It is a kind of typically to realize that equipment is computer.Specifically, computer for example may be used Think personal computer, laptop computer, cellular phone, camera phone, smart phone, personal digital assistant, media play It is any in device, navigation equipment, electronic mail equipment, game console, tablet computer, wearable device or these equipment The combination of equipment.

For convenience of description, it is divided into various units when description apparatus above with function to describe respectively.Certainly, implementing this The function of each unit can be realized in the same or multiple software and or hardware when specification one or more embodiment.

It should be understood by those skilled in the art that, this specification embodiment can provide as method, system or computer program Product.Therefore, this specification embodiment can be used complete hardware embodiment, complete software embodiment or combine software and hardware The form of the embodiment of aspect.Moreover, it wherein includes that computer is available that this specification embodiment, which can be used in one or more, It is real in the computer-usable storage medium (including but not limited to magnetic disk storage, CD-ROM, optical memory etc.) of program code The form for the computer program product applied.

This specification is referring to the method, equipment (system) and computer program product according to this specification embodiment Flowchart and/or the block diagram describes.It should be understood that can be realized by computer program instructions every in flowchart and/or the block diagram The combination of process and/or box in one process and/or box and flowchart and/or the block diagram.It can provide these computers Processor of the program instruction to general purpose computer, special purpose computer, Embedded Processor or other programmable datas optimization equipment To generate a machine, so that generating use by the instruction that the processor that computer or other programmable datas optimize equipment executes In the dress for realizing the function of specifying in one or more flows of the flowchart and/or one or more blocks of the block diagram It sets.

These computer program instructions, which may also be stored in, is able to guide computer or other programmable datas optimization equipment with spy Determine in the computer-readable memory that mode works, so that it includes referring to that instruction stored in the computer readable memory, which generates, Enable the manufacture of device, the command device realize in one box of one or more flows of the flowchart and/or block diagram or The function of being specified in multiple boxes.

These computer program instructions can also be loaded into computer or other programmable datas optimization equipment, so that counting Series of operation steps are executed on calculation machine or other programmable devices to generate computer implemented processing, thus in computer or The instruction executed on other programmable devices is provided for realizing in one or more flows of the flowchart and/or block diagram one The step of function of being specified in a box or multiple boxes.

In a typical configuration, calculating equipment includes one or more processors (CPU), input/output interface, net Network interface and memory.

Memory may include the non-volatile memory in computer-readable medium, random access memory (RAM) and/ Or the forms such as Nonvolatile memory, such as read-only memory (ROM) or flash memory (flash RAM).Memory is computer-readable medium Example.

Computer-readable medium includes permanent and non-permanent, removable and non-removable media can be by any method Or technology come realize information store.Information can be computer readable instructions, data structure, the module of program or other data. The example of the storage medium of computer includes, but are not limited to phase change memory (PRAM), static random access memory (SRAM), moves State random access memory (DRAM), other kinds of random access memory (RAM), read-only memory (ROM), electric erasable Programmable read only memory (EEPROM), flash memory or other memory techniques, read-only disc read only memory (CD-ROM) (CD-ROM), Digital versatile disc (DVD) or other optical storage, magnetic cassettes, tape magnetic disk storage or other magnetic storage devices Or any other non-transmission medium, can be used for storage can be accessed by a computing device information.As defined in this article, it calculates Machine readable medium does not include temporary computer readable media (transitory media), such as the data-signal and carrier wave of modulation.

It should also be noted that, the terms "include", "comprise" or its any other variant are intended to nonexcludability It include so that the process, method, commodity or the equipment that include a series of elements not only include those elements, but also to wrap Include other elements that are not explicitly listed, or further include for this process, method, commodity or equipment intrinsic want Element.In the absence of more restrictions, the element limited by sentence "including a ...", it is not excluded that including described want There is also other identical elements in the process, method of element, commodity or equipment.

This specification can describe in the general context of computer-executable instructions executed by a computer, such as journey Sequence module.Generally, program module include routines performing specific tasks or implementing specific abstract data types, programs, objects, Component, data structure etc..Specification can also be practiced in a distributed computing environment, in these distributed computing environments, By executing task by the connected remote processing devices of communication network.In a distributed computing environment, program module can To be located in the local and remote computer storage media including storage equipment.

All the embodiments in this specification are described in a progressive manner, same and similar portion between each embodiment Dividing may refer to each other, and each embodiment focuses on the differences from other embodiments.Especially for system reality For applying example, since it is substantially similar to the method embodiment, so being described relatively simple, related place is referring to embodiment of the method Part explanation.

The foregoing is merely this specification embodiments, are not intended to limit this application.For those skilled in the art For, various changes and changes are possible in this application.All any modifications made within the spirit and principles of the present application are equal Replacement, improvement etc., should be included within the scope of the claims of this application.

20页详细技术资料下载
上一篇:一种医用注射器针头装配设备
下一篇:一种工程造价的造价数据收集系统及其收集方法

网友询问留言

已有0条留言

还没有人留言评论。精彩留言会获得点赞!

精彩留言,会给你点赞!