News coordinate matching method based on natural language processing

文档序号:190630 发布日期:2021-11-02 浏览:44次 中文

阅读说明:本技术 一种基于自然语言处理的新闻坐标匹配方法 (News coordinate matching method based on natural language processing ) 是由 许衍 杨鹏 范宏城 吴欣羽 刘磊 于 2021-02-02 设计创作,主要内容包括:本发明公开了一种基于自然语言处理的新闻坐标匹配方法,先建立地址坐标数据库、新闻单位数据库和特征词数据库;获取新闻数据,对新闻文本进行分词,建立候选事件地点数组;根据附近的所有特征词在特征词数据库中的频率之和,对候选事件地点数组中的每一项进行特征词匹配,得到候选事件地点频率数组;根据数组中每一项在文章中的位置,进行加权匹配,得到候选事件地点加权频率数组,取值最高的一项作为事件的实际发生地点(即最终点);从地址坐标数据库中对最终点进行地名匹配,筛选出所有可能是最终点的POI点组成候选地址数组;最后处理匹配到新闻坐标。本发明能够识别新闻发生的主要地点,并将新闻中的事件地点准确地展示在地图上。(The invention discloses a news coordinate matching method based on natural language processing, which comprises the steps of firstly establishing an address coordinate database, a news unit database and a feature word database; acquiring news data, segmenting a news text, and establishing a candidate event place array; performing feature word matching on each item in the candidate event place frequency array according to the sum of the frequencies of all nearby feature words in the feature word database to obtain a candidate event place frequency array; carrying out weighted matching according to the position of each item in the array in the article to obtain a candidate event site weighted frequency array, and taking the item with the highest value as the actual occurrence site (namely the final point) of the event; performing place name matching on the final point from an address coordinate database, and screening all POI points which are possible to be the final point to form a candidate address array; and finally processing and matching to the news coordinates. The method and the system can identify the main place where the news occurs and accurately display the event place in the news on the map.)

1. A news coordinate matching method based on natural language processing is characterized by comprising the following steps:

step one, crawling POI point data and provincial and urban street data from a high-grade map through a software tool to establish an address coordinate database; respectively acquiring the name of each news unit and the service range thereof from an encyclopedia map and an encyclopedia through a software tool to establish a news unit database; manually marking the place where the event occurs from 10000 news in an artificial training mode, simultaneously recording feature words near the place where the event occurs, and establishing a feature word database according to the occurrence frequency of different feature words;

secondly, acquiring news data from each large website by using a software tool, wherein the data comprises news article contents, news titles and news units;

thirdly, segmenting words of the Chinese news text by utilizing an ICTCCLAS Chinese word segmentation tool, and identifying words capable of representing places to form a candidate event place array;

step four, according to the sum of the frequencies of all nearby feature words in the feature word database, performing feature word matching on each item in the candidate event place frequency array to obtain a candidate event place frequency array;

step five, performing weighted matching according to the position of each item in the array in the article, wherein the item i appearing in the title and the item j appearing in the article are decreased by h according to the position sequence to obtain a candidate event place weighted frequency array; wherein, the value range of i is [0.5, 1 ], the value range of j is [0.3, i ], and the value range of h is [0.001, 0.03 ];

taking one item with the highest median of the weighted frequency array of the candidate event sites as an actual occurrence site of the event, and recording the actual occurrence site as a final point;

step seven, performing place name matching on the final point from the address coordinate database, and screening all POI points which are possible to be the final point to form a candidate address array;

and step eight, sequentially selecting the candidate event place closest to the position of the final point from the candidate event place arrays, wherein the number of distance interval symbols is m, the number of distance paragraphs is n, traversing the POI point arrays, calculating the closest distance k between each POI point and the selected candidate event place, and adding the value of (n +1) × (m +1) × k into the weight of each POI point to obtain the weight sum array of each POI point, so that the coordinate of the point with the minimum value in the array is the matched news coordinate.

2. The method for matching news coordinates based on natural language processing as claimed in claim 1, wherein the data in the first step includes the name and coordinates of each POI point and the name of the street in the provincial region where the POI point is located.

3. The method as claimed in claim 1, wherein the news coordinate matching unit in the first step includes news agencies, broadcasters, tv stations, news magazine agencies, news documentary movie studios, and news photo agencies.

4. The method for matching news coordinates based on natural language processing as claimed in claim 1, wherein the characteristic words in the first step include one or more of verbs, adverbs, adjectives and punctuation marks.

5. The method of claim 1, wherein the vocabulary groups capable of representing places in the third step include place names, transliterated place names, organization group names, and place words.

6. The method for matching news coordinates based on natural language processing as claimed in claim 1, wherein i is 0.99, j is 0.5, and h is 0.01 in the third step.

Technical Field

The invention relates to a news coordinate matching method based on natural language processing, and belongs to the technical field of data mining and processing.

Background

The news is a narrative-based discourse, and the basic elements of the news are as follows: people, time, place, event, reason, process of occurrence. A news story, whether a message, a communication, or a feature, generally contains these factors. That is, news is typically local, and a few news items that do not have a local idea do not fall within the scope of the discussion.

Chinese patent published 24 months 6 and 2015 and having publication number CN104731768 provides an event location extraction method for Chinese news text, which extracts three features of context feature, position feature and topology feature from news text to form a feature vector, and identifies event locations from word segmentation acquisition mechanism names, place nouns and place names by using a Random Forest classifier; the method can further identify the place where the news event occurs on the basis of place name identification, but the method can only identify the event place in the text and cannot display news on a map. Therefore, the event location identified by the method needs to be converted into longitude and latitude coordinates to be displayed on the map.

At present, the traditional method is to convert the address into longitude and latitude coordinates by a geocoding API of a high-grade or Baidu map, however, the existing APIs only return one coordinate, and if the position information is incomplete, multiple regions have the same place, and the wrong coordinates are likely to be obtained. Patent name "a news event place name address matching method based on geographic feature hierarchical segmentation" publication No. CN105404686 provides a news event place name address matching method based on geographic feature hierarchical segmentation, which can realize fast text capture of news events in a network online environment, Chinese segmentation of news texts and matching of place name addresses, but it can not handle the situation that there are places with the same name in multiple regions.

Therefore, there is an urgent need for a coordinate matching method capable of accurately showing an event location in news on a map.

Disclosure of Invention

Aiming at the problems in the prior art, the invention provides a news coordinate matching method based on natural language processing, which can identify the main place where news occurs and display the event place in the news on a map; and when a plurality of places of the same name exist in the event place in news, the correct place of the event can be judged.

In order to achieve the purpose, the invention adopts the following technical scheme: a news coordinate matching method based on natural language processing comprises the following steps:

step one, crawling POI point data and provincial and urban street data from a high-grade map through a software tool to establish an address coordinate database; respectively acquiring the name of each news unit and the service range thereof from an encyclopedia map and an encyclopedia through a software tool to establish a news unit database; manually marking the place where the event occurs from 10000 news in an artificial training mode, simultaneously recording feature words near the place where the event occurs, and establishing a feature word database according to the occurrence frequency of different feature words;

secondly, acquiring news data from each large website by using a software tool, wherein the data comprises news article contents, news titles and news units;

thirdly, segmenting words of the Chinese news text by utilizing an ICTCCLAS Chinese word segmentation tool, and identifying words capable of representing places to form a candidate event place array;

step four, according to the sum of the frequencies of all nearby feature words in the feature word database, performing feature word matching on each item in the candidate event place frequency array to obtain a candidate event place frequency array;

step five, performing weighted matching according to the position of each item in the array in the article, wherein the item i appearing in the title and the item j appearing in the article are decreased by h according to the position sequence to obtain a candidate event place weighted frequency array; wherein, the value range of i is [0.5, 1 ], the value range of j is [0.3, i ], and the value range of h is [0.001, 0.03 ];

taking one item with the highest median of the weighted frequency array of the candidate event sites as an actual occurrence site of the event, and recording the actual occurrence site as a final point;

step seven, performing place name matching on the final point from the address coordinate database, and screening all POI points which are possible to be the final point to form a candidate address array;

and step eight, sequentially selecting the candidate event place closest to the position of the final point from the candidate event place arrays, wherein the number of distance interval symbols is m, the number of distance paragraphs is n, traversing the POI point arrays, calculating the closest distance k between each POI point and the selected candidate event place, and adding the value of (n +1) × (m +1) × k into the weight of each POI point to obtain the weight sum array of each POI point, so that the coordinate of the point with the minimum value in the array is the matched news coordinate.

Preferably, the data in the first step includes the name and coordinates of each POI point and the name of the city area street where the POI point is located.

Preferably, the news units in the first step include news agencies, broadcasting stations, television stations, news magazine agencies, news record movie studios, and news photo agencies.

Preferably, the characteristic words in the first step include one or more of verbs, adverbs, adjectives and punctuation marks.

Preferably, the vocabulary group capable of representing places in the third step comprises place names, transliterated place names, organization group names and place words.

Preferably, i in step three is 0.9, j is 0.5, and h is 0.01.

Compared with the prior art, the method comprises the steps of firstly establishing an address coordinate database, a news unit database and a feature word database; acquiring news data, segmenting a news text, and establishing a candidate event place array; performing feature word matching on each item in the candidate event place frequency array according to the sum of the frequencies of all nearby feature words in the feature word database to obtain a candidate event place frequency array; carrying out weighted matching according to the position of each item in the array in the article to obtain a candidate event site weighted frequency array, and taking the item with the highest value as the actual occurrence site (namely the final point) of the event; performing place name matching on the final point from an address coordinate database, and screening all POI points which are possible to be the final point to form a candidate address array; and finally, selecting candidate event sites closest to the position of the final point from the candidate event site array in sequence, wherein the distance interval sign number is m, the distance paragraph number is n, traversing the POI point array, calculating the closest distance k between each POI point and the selected candidate event site, and adding the value of (n +1) × (m +1) × k into the weight value of each POI point to obtain the weight value summation array of each POI point, so that the coordinate of the point with the minimum value in the array is the matched news coordinate. Finally, the main place where the news occurs can be identified, and the event place in the news is displayed on a map; and when a plurality of places of the same name exist in the event place in news, the correct place of the event can be judged.

Detailed Description

The technical solution in the implementation of the present invention is described in detail below with reference to embodiments, which are only a part of embodiments of the present invention, but not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments of the present invention without making any creative effort, shall fall within the protection scope of the present invention.

The invention provides a news coordinate matching method based on natural language processing, which comprises the following steps:

step one, data of POI points and data of city-saving streets are crawled from a high-grade map through a software tool to establish an address coordinate database, wherein the data comprises the name and the coordinates of each POI point and the name of the city-saving streets where the POI points are located; respectively acquiring the name of each news unit and the service range thereof from a Baidu map and a Baidu encyclopedia through a software tool to establish a news unit database, wherein the news units comprise news agencies, communication agencies, broadcasting stations, television stations, news magazine agencies, news record movie studios, news photo agencies and the like; manually marking the place where the event occurs from 10000 news in an artificial training mode, simultaneously recording feature words near the place where the event occurs, and establishing a feature word database according to the occurrence frequency of different feature words; the characteristic words comprise one or more of verbs, adverbs, adjectives and punctuation marks;

secondly, acquiring news data from each large website by using a software tool, wherein the data comprises news article contents, news titles and news units;

thirdly, segmenting words of the Chinese news text by utilizing an ICTCCLAS Chinese word segmentation tool, and identifying place names, transliterated place names, organization group names, place words, other proper names and other words which can represent places from the words to form a candidate event place array;

step four, according to the sum of the frequencies of all nearby feature words in the feature word database, performing feature word matching on each item in the candidate event place frequency array to obtain a candidate event place frequency array;

step five, performing weighted matching according to the position of each item in the array in the article, wherein the item i appearing in the title and the item j appearing in the article are decreased by h according to the position sequence to obtain a candidate event place weighted frequency array; wherein, the value range of i is [0.5, 1 ], the value range of j is [0.3, i ], and the value range of h is [0.001, 0.03 ];

taking one item with the highest median of the weighted frequency array of the candidate event sites as an actual occurrence site of the event, and recording the actual occurrence site as a final point;

step seven, performing place name matching on the final point from the address coordinate database, and screening all POI points which are possible to be the final point to form a candidate address array;

and step eight, sequentially selecting the candidate event place closest to the position of the final point from the candidate event place arrays, wherein the number of distance interval symbols is m, the number of distance paragraphs is n, traversing the POI point arrays, calculating the closest distance k between each POI point and the selected candidate event place, and adding the value of (n +1) × (m +1) × k into the weight of each POI point to obtain the weight sum array of each POI point, so that the coordinate of the point with the minimum value in the array is the matched news coordinate.

Example (b):

step one, omitting;

step two, acquiring news by using a software tool, and acquiring data through a news link, wherein the data can be acquired as follows:

news headlines: a-land creation of high quality development beyond advanced zone

News Unit: a Website

News content:

a place a holds a meeting in a meeting room.

Remarkable results obtained in nearly five years

In the first year, get through the province a number of times a, and obtain multiple titles … …

For 5 years, the comprehensive strength of the A site greatly spans, and the A index jumps the second place of the whole province; the B index is the first of the whole provinces.

The innovation kinetic energy is great, the comprehensive reform of the key field is developed in an all-around way, 150 items of 624 small reform works in 10 fields are pushed forward in an accumulated way, and 12 items of first-level projects and 19 items of second-level projects are accepted. The market subject breaks through 13 ten thousand households. National high and new technology enterprises are increased from 156 to 597.

And (4) large upgrading of the livable environment, accumulating 55 reconstruction items A, 152 ten thousand square meters, modifying 265 reconstruction items B and improving 7C projects.

The planning from 2021 to 2025 is exciting

In the opening year, the high starting point and the high station position of the A place draw innovation, and the leading high-quality development of the whole province is made to exceed the advanced area: demonstration area A and sample plate area B. After planning for 5 years, the method firstly realizes multiple fields and then steps a big step.

Focusing on the first generation, 2 billions of brand benches are built. Creating the center of H core, and developing area A, area B and area C with high force.

The method aims to focus on the initial creation and cultivate a plurality of leading enterprises. A batch of high-level and high-skill talents are introduced accurately.

The focus is on the first position, the original advantages are consolidated and expanded, and more core regions such as an A core region, a B core region, a C core region and the like are created.

Thirdly, performing word segmentation on the data by using a word segmentation tool, wherein the title word segmentation result is { A place and a previous area }, the news text word segmentation result is { A place in a certain conference room }, { A place and each reputation }, { A place }, { A community }, { A place, a previous area, an A demonstration area, a B template area }, { H core center, an A area, a B area, a C area }, and { A core area, a B core area and a C core area };

identifying words which can represent places as [ place A, place A in a meeting room, place A, community A, place A, demonstration area A, core center H, area A and area B ], namely a candidate event place array;

step four, matching the feature words of each item according to the sum of the frequencies of all the nearby feature words in the feature word database;

the characteristic word near the first item 'A ground' is 'creation', and the frequency in the characteristic word database is 0.01;

the second term "the characteristic words near a meeting room A" are "at" and "on screen", and the frequency in the characteristic word database is 0.3 and 0.01;

the feature words near the third item ' A ground ' are ' and ' in ', and the frequency in the feature word database is 0.2 and 0.3;

the feature words near the fourth item ' A ground ' are ' and the frequency in the feature word database is 0.2 and 0.2;

the fifth characteristic words near the A community are' and the frequency in the characteristic word database is 0.2 and 0.03;

the sixth item near the ' place A ' is ' and ' collude ', and the frequency in the characteristic word database is 0.2 and 0.01;

the seventh term "exemplary region of A" is followed by the feature words "of", and "of". ", the frequency in the feature word database is 0.2, 0.05;

the feature words near the eighth item "" H core center "are" created "," and ", and the frequency in the feature word database is 0.01, 0.2;

the feature words near the ninth item "area a" are "develop", and the frequency in the feature word database is 0.01, 0.05;

the characteristic words near the tenth item 'B area' are 'development', 'and', and the frequency in the characteristic word database is 0.01 and 0.05;

step five, setting i to be 0.99, j to be 0.5 and h to be 0.01, and carrying out weighted matching on each item;

the first term is 0.01 x 0.99 ═ 0.0099;

the second term is (0.3+0.01) × 0.5 ═ 0.155;

the third term is (0.2+0.3) × 0.49 ═ 0.245;

the fourth term is (0.2+0.2) × 0.48 ═ 0.192;

the fifth term is (0.2+0.03) × 0.47 ═ 0.1081;

the sixth term is (0.2+0.01) × 0.46 ═ 0.0966;

the seventh term is (0.2+0.05) × 0.45 ═ 0.1125;

the eighth term is (0.01+0.2) × 0.44 ═ 0.0924;

the ninth term is (0.01+0.05) × 0.43 ═ 0.0258;

the tenth term is (0.01+0.05) × 0.42 ═ 0.0252;

the terms in which the places are the same, namely [ { A place: 0.5435}, { A place: 0.155} in a certain conference room, { A community: 0.1081}, { A demonstration area: 0.1125}, { H core center: 0.0924}, { A area: 0.0258}, { B area: 0.0252} ] are merged into a weighted frequency array;

taking the item with the highest value in the array, namely the 'A place' in the { A place: 0.5435} is the actual place of the event, and the item is marked as the final point below;

step seven, carrying out place name matching on the final point from the address coordinate database to obtain a candidate address array [ place A of place a, place A of place b, place A of place c, place A of place d, place A of place e ];

step eight, sequentially selecting a candidate event place closest to the position of the final point from the candidate event place array, taking an example as an "a community", wherein the number of symbols m closest to the final point is 37, the number of paragraphs n is 2,

calculating the nearest distance of each item in the candidate address array from the community A to obtain the distance of 898.6 { A place: 898.6} of b place, { 1324.4} of c place: 1649.1}, { d place: 0.148} of d place and 1185.8} of e place; the weight of each entry in the candidate address array, namely { A is a place: 102440.4}, { B is a place: 150981.6}, { c is a place: 187997.4}, { d is a place: 16.872}, and { e is a place: 135181.2} is obtained according to the formula (n +1) × (m +1) × k, and the calculation is carried out by sequentially taking "A is a meeting room", "A community", "A demonstration zone", "H core center", "A zone", "B zone",

after the weights are added (the specific calculation process is omitted), the minimum value is'd place A' which is the matched news occurrence place, and the coordinates { lng:119.3, lat:26.08} of the place are the matched news coordinates.

Finally, through the cooperation of all the steps, the main place where the news occurs can be identified, and the event place in the news is displayed on a map; and when a plurality of places of the same name exist in the event place in news, the correct place of the event can be judged.

It will be evident to those skilled in the art that the invention is not limited to the details of the foregoing illustrative embodiments, and that the present invention may be embodied in other specific forms without departing from the spirit or essential attributes thereof. Furthermore, it should be understood that although the present description refers to embodiments, not every embodiment may contain only a single embodiment, and such description is for clarity only, and those skilled in the art should make the description as a whole, and the embodiments may be appropriately combined to form other embodiments understood by those skilled in the art.

7页详细技术资料下载
上一篇:一种医用注射器针头装配设备
下一篇:基于图数据库的动态建模方法及系统

网友询问留言

已有0条留言

还没有人留言评论。精彩留言会获得点赞!

精彩留言,会给你点赞!