Travel memory place name disambiguation method based on time geography

文档序号：1087405 发布日期：2020-10-20 浏览：8次中文

阅读说明：本技术 一种基于时间地理学的游记地名消歧方法 (Travel memory place name disambiguation method based on time geography ) 是由尹章才赵晓茹曹莉婷李三娟于 2020-06-03 设计创作，主要内容包括：本发明公开了一种基于时间地理学的游记地名消歧方法,该方法包括以下步骤：1)提取游记文本中的地名及其时间标签,将提取的地名分为歧义地名和无歧义地名,对无歧义地名分配唯一的经纬度位置,对歧义地名列出该地名对应的全部可能的经纬度位置；2)利用PPA进行消歧；3)利用确定时刻的可达域进行消歧；4)利用概率时间地理学进行排序；为每个余下的歧义地名计算概率,并按照计算结果降序排列。本发明提供了基于时间地理学的消歧方法,不同于之前基于规则等方法,适用于游记地名消歧,补充了在细粒度地名方面的消歧方法,让地名消歧更加准确。(The invention discloses a shorthand place name disambiguation method based on time geography, which comprises the following steps: 1) extracting place names and time labels thereof in the travel note text, dividing the extracted place names into ambiguous place names and unambiguous place names, allocating unique longitude and latitude positions to the unambiguous place names, and listing all possible longitude and latitude positions corresponding to the place names for the ambiguous place names; 2) disambiguation using PPA; 3) disambiguating by using the reachable domain at the determined time; 4) sorting by using probabilistic temporal geography; the probability is calculated for each remaining ambiguous location name and sorted in descending order according to the calculation. The invention provides a disambiguation method based on time geography, which is different from the prior methods based on rules and the like, is suitable for shorthand place name disambiguation, supplements the disambiguation method in the aspect of fine-grained place names, and ensures that the place name disambiguation is more accurate.)

1. A travel note place name disambiguation method based on time geography is characterized by comprising the following steps:

1) extracting place names and time labels thereof in the travel note text, dividing the extracted place names into ambiguous place names and unambiguous place names, allocating unique longitude and latitude positions to the unambiguous place names, and listing all possible longitude and latitude positions corresponding to the place names for the ambiguous place names;

2) disambiguation using PPA;

setting a plurality of ambiguous positions of a place name L where the tourist is located at the time t, and recording the longitude and latitude of any position as L (x, y); selecting two unambiguous place name positions L in front of and behind time_i(x_i，y_i) And L_j(x_j，y_j) And its time information t_iAnd t_j，t_i<t<t_jAs the start and stop point information of a tour; and then according to the maximum possible velocity V of the passenger_mCalculating a potential path area of the passenger under the constraint of the start point and the stop point by utilizing a time geographic principle, and taking the potential path area as a basis for disambiguation, namely, if the ambiguous location name position L (x, y) is not located in the PPA, the ambiguous location name position L (x, y) is not the correct location of the location name; if the remaining longitude and latitude positions corresponding to the ambiguous place names are unique after reduction, the disambiguation of the shorthand place names is finished, otherwise, the step 3 is switched to;

the PPA includes the budget (t) of the guest at a given time_j-t_i) And the area of all accessible locations under the speed Vm limit;

according to any longitude and latitude L (x, y) of the ambiguous location name L, eliminating ambiguous location points which are not located in the PPA area, and judging to adopt the following formula:

wherein, g_ijIs a PPA regionThe domain is guest at the starting point L_iAnd end point L_jAll accessible location sets under constraints, (x)_i，y_i) And (x)_j，y_j) Coordinates of the start and end points, respectively, t_i、t_jRespectively, starting time and end time, V_mIs the maximum possible speed of the passenger;

3) disambiguating by using the reachable domain at the determined time;

extracting the time t when the individual appears in the ambiguous place name according to the travel notes, establishing an reachable domain of the time t, and reducing the longitude and latitude positions of the ambiguous place name which is not in the reachable domain; if the remaining longitude and latitude positions corresponding to the ambiguous place names are unique after reduction, the disambiguation of the shorthand place names is finished, otherwise, the step 4 is switched to;

4) and performing probability calculation on the latitude and longitude positions of the remaining n ambiguous place names, calculating the probability for the latitude and longitude positions of each remaining ambiguous place name, and arranging the calculated results in a descending order.

2. The method for disambiguating names of tourist sites based on temporal geography according to claim 1, wherein the reachable domain at time t in step 3) is f_i(t)∩p_j(t) wherein f_i(t) is passenger from starting point L_iReachable field of position departure at time t, p_j(t) for passenger to proceed to terminal L_jReachable domain at time t before the position; is represented as follows:

3. the method for disambiguating names of tourist sites based on temporal geography according to claim 1, wherein the probability in step 4) is calculated by using the following formula

In the formula, c₀Is a starting point L_iAnd end point L_jA position point c corresponding to the time t in between₀(x₀,y₀) And is andc_ka location point with an index number k for an ambiguous place name; k is the interval [1, n ]]A natural number of; n is the total number of all location points for the ambiguous place name.

Technical Field

The invention relates to a natural language processing technology, in particular to a shorthand place name disambiguation method based on time geography.

Background

The continuous development and popularity of networks have led to a rapid increase in the amount of information on the network, and the network has become a large database containing numerous digital texts, which has become a main source for people to obtain geographic information, and statistically at least 70% of text documents contain geographic location reference information expressed in the form of place names. In real life, the information is ambiguous often, such as a Zhongshan park, and the geographic phenomenon of the same name causes uncertainty of position semantics, so that the ambiguous position semantics need to be disambiguated, and a unique longitude and latitude is allocated to the ambiguous position semantics.

The existing method generally carries out disambiguation according to the calculation of the evidence and the geographic relevance degree near the ambiguous place name in the text, but the disambiguation effect is negatively influenced along with the excessive increase of the number of the evidence. In addition, since the geographic scale is classified into three categories, namely provincial level, city level and county level, many fine-grained and administrative place names cannot distinguish the difference of geographic relevance, and the disambiguation error can be caused.

The representative text-travel notes are the texts actively published by travelers based on the own travel experiences and mainly describe the travel processes and experiences, and are mostly used for extracting geographic information. Although there are very many place name disambiguation methods, different disambiguation methods correspond to different types of texts, and currently there is no disambiguation method specifically directed to shorthand.

Disclosure of Invention

The invention aims to solve the technical problem of providing a shorthand place name disambiguation method based on time geography aiming at the defects in the prior art.

The technical scheme adopted by the invention for solving the technical problems is as follows: a travel note place name disambiguation method based on time geography comprises the following steps:

2) disambiguation using PPA;

setting a plurality of ambiguous positions of a place name L where the tourist is located at the time t, and recording the longitude and latitude of any position as L (x, y); selecting two unambiguous place name positions L in front of and behind time_i(x_i，y_i) And L_j(x_j，y_j) And its time information t_iAnd t_j，t_i<t<t_jAs the start and stop point information of a tour; and then according to the maximum possible velocity V of the passenger_mCalculating a Potential Path Area (PPA) of the passenger under the constraint of a start point and a stop point by using a time geographic principle, and using the Potential Path Area as a basis for disambiguation, namely, if the ambiguous location name position L (x, y) is not located in the PPA, the ambiguous location name position L (x, y) is not the correct location of the location name; if the remaining longitude and latitude positions corresponding to the ambiguous place names are unique after reduction, the disambiguation of the shorthand place names is finished, otherwise, the step 3 is switched to;

the PPA includes the budget (t) of the guest at a given time_j-t_i) And the area of all accessible locations under the speed Vm limit;

wherein, g_ijFor PPA area, i.e. for guests at starting point L_iAnd end point L_jAll accessible location sets under constraints, (x)_i，y_i) And (x)_j，y_j) Coordinates of the start and end points, respectively, t_i、t_jRespectively, starting time and end time, V_mIs the maximum possible speed of the passenger;

3) disambiguating by using the reachable domain at the determined time;

According to the scheme, the reachable domain at the time t in the step 3) is f_i(t)∩p_j(t) wherein f_i(t) is passenger from starting point L_iReachable field of position departure at time t, p_j(t) for passenger to proceed to terminal L_jReachable domain at time t before the position; is represented as follows:

according to the scheme, the probability calculation in the step 4) adopts the following formula

The invention has the following beneficial effects: the invention provides a disambiguation method based on time geography, which is different from the prior methods based on rules and the like, is suitable for disambiguation of shorthand place names with time labels, supplements the disambiguation method in the aspect of fine-grained place names, and ensures that the place name disambiguation is more accurate.

Drawings

The invention will be further described with reference to the accompanying drawings and examples, in which:

FIG. 1 is a flow chart of a method of an embodiment of the present invention;

FIG. 2 is a schematic diagram of time and place names extracted from the travel notes according to an embodiment of the present invention;

FIG. 3 is a display diagram of ambiguous and unambiguous location names according to an embodiment of the invention;

FIG. 4 is a schematic representation of the PPA disambiguation results of an embodiment of the invention;

FIG. 5 is a diagram illustrating the reach domain disambiguation results of an embodiment of the present invention;

FIG. 6 is a graphical illustration of a probability disambiguation result of an embodiment of the invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is further described in detail with reference to the following embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

As shown in fig. 1, a method for disambiguating a name of a tourist map based on time geography includes the following steps:

2) disambiguation was performed using PPA.

In the tour, the guest may record some place names and time information that appears in the place names. Some place names correspond to a plurality of different longitude and latitude positions, so that ambiguity is generated. Setting a plurality of ambiguous positions of a place name L where the tourist is located at the time t, and recording the longitude and latitude of any position as L (x, y); selecting two unambiguous place name positions L in front of and behind time_i(x_i，y_i) And L_j(x_j，y_j) And its time information t_iAnd t_j，t_i<t<t_jAs the start and stop point information of a tour; and then according to the maximum possible velocity V of the passenger_mCalculating a Potential Path Area (PPA) of the passenger under the constraint of a start point and a stop point by using a time geographic principle, and using the Potential path area as a basis for disambiguation, namely, if the ambiguous location name position L (x, y) is not located in the PPA, the ambiguous location name position L (x, y) is not the correct location of the location name; ambiguous place names after reductionIf the corresponding residual longitude and latitude positions are unique, the shorthand place name disambiguation is finished, otherwise, the step 3) is carried out;

maximum possible velocity V_mThe maximum possible speed of the passenger in the tour can be estimated according to the unambiguous place name and the time information thereof;

the PPA includes the budget (t) of the guest at a given time_j-t_i) And the area of all accessible locations under the speed Vm limit;

temporal geography:

the spatio-temporal prism is a prism that shows all spatio-temporal regions that an individual can reach, given a known starting point, end point, departure time, end time and maximum driving speed. Spatio-temporal prism at time t e (t)_i，t_j) Is defined as:

Z_ij(t)＝{(x,y；t)|f_i(t)∩p_j(t)} (2)

the spatial extent of the prism at time t is determined by the intersection of the two extents: (1) all positions f that can be reached at time t starting from the start_i(t); (2) all positions p that may reach the end point at time t_j(t) of (d). Projecting the prism into the geographic space generates a potential area of the path (PPA), g in the figure_ijPPA is an ellipse in a two-dimensional geographic space that encompasses all accessible locations in the geographic space for a given time budget and speed limit, i.e., at [ t ]_i,t_j]The range in which an individual may be present during the period. PPA represents the full range of activity of an individual over a period of time, while reachable domain represents the range of activity of an individual at that time.

3) Disambiguating by using the reachable domain at the determined time; extracting the time t when the individual appears in the ambiguous place name, then establishing an reachable domain at the time t, and reducing the longitude and latitude positions of the ambiguous place name which is not located in the reachable domain; if the remaining longitude and latitude positions corresponding to the ambiguous place names are unique after reduction, the disambiguation of the shorthand place names is finished, otherwise, the step 4 is switched to;

4) sorting by using probabilistic temporal geography; carrying out probability calculation on the latitude and longitude positions of the remaining n ambiguous place names, calculating the probability for the latitude and longitude positions of each remaining ambiguous place name, and arranging the calculated results in a descending order;

probability-time geography is an extension of time geography based on position probability, the position probability is distributed to reachable domains, and the probability calculation on the reachable domains adopts the following formula

In the formula, c₀Is a position point between the starting point and the end point corresponding to the t moment; c. C_kOne location point with index k for ambiguous place names. In this embodiment, the linear shortest path between the start point and the end point is taken.

One specific example is:

setting: the disambiguating travel note is a travel note recorded during the playing of Wuhan in one day on a portable platform, each playing place in the travel note describes specific time, the requirement of the method is met, and the website for example description is as follows:

https://you.ctrip.com/travels/wuhan145/3787772.html？tdsourcetag＝s_pctim_aiomsg。

step 1: the website (URL) input by the user is read, and then all place names and time are extracted from the travel notes by using the natural language processing function of the Baidu AI. In this travel note, the time points of morning, nine o 'clock, spring minute, twelve noon, two hours and two o' clock, and the place names of Wuhan university, cherry blossom avenue, sea fishing, Wuhan and Huanghelou can be extracted, as shown in FIG. 2.

Step 2: a starting point is determined. The place names are divided into two groups of ambiguous place names of a submarine fishing device and an unambiguous place name of Wuhan university, a cherry blossom avenue and a Huanghe building through a geocodeSearch method in Baidu JavaScript API, then two place names are selected from the unambiguous place names to be used as starting and stopping points, in order to make the method more universal, the first place name and the last place name in the unambiguous place name array are selected, and the two place names of the Wuhan university and the Huanghe building are used as the starting and stopping points. And converting the place names of Wuhan university at the beginning and end and Huanghe building into longitude and latitude through address resolution.

And step 3: an array of ambiguous place names is obtained. Unlike a quantitative geo-referencing system, the place names usually have uniqueness only within a certain geographical range, so that an ambiguous place name array is obtained by using a Baidu LocalSearch method, and the number of the obtained ambiguous place names "subsea fishing" in Wuhan City in this example is 19. To more intuitively see the disambiguation process, the set of ambiguous and unambiguous place names are displayed on a map, as in FIG. 3.

And 4, step 4: the time and maximum speed are determined. Recording time nine and afternoon two points corresponding to the place names of the start-stop Wuhan university and the Huanghe building as starting time and end time, and estimating the lower bound of the maximum speed of the passengers in the travel notes according to the starting time, the end time and the distance between the two places; and determines the maximum possible moving speed based on the information combined with the transportation mode of passengers.

And 5: and (6) disambiguating. The whole disambiguation process is divided into three parts, and the specific disambiguation process is as follows:

in the first step, disambiguation is performed using PPA. As shown in formula (1), substituting each longitude and latitude of the ambiguous place name "subsea fishing" for calculation. The geographical ellipse is made up of grid cells, each cell being no further from the two foci than the major axis of the ellipse. Here, the two focal points are the two determined place names (i.e., the start and stop points) selected, and the major axis is the time interval between the start and stop points multiplied by the maximum speed. As shown in fig. 4, the location points where the ambiguous location name is outside the PPA can be eliminated and the location points where the ambiguous location name is inside the PPA need further confirmation.

And secondly, disambiguating by using the reachable domain of the determined time. The time t when the individual appears in the ambiguous place name can be extracted according to the travel notes, and then the reachable domain of the time t is calculated. The specific operation is to calculate the formulas (6) and (7) to obtain two circles, and the intersection of the two circles is the reachable domain. As a result, as shown in fig. 5, the location points where the ambiguous location name is outside the reachable domain can be eliminated, and the three location points where the ambiguous location name is within the reachable domain need further confirmation. There are four points in the reachable domain shown in fig. 5, where one point is a determined dead point (e.g., yellow crane tower) and the remaining three points are ambiguous points.

Thirdly, because the former two methods can not determine the position of the ambiguity point, the disambiguation is finally carried out by utilizing probabilistic time geography. The probabilities of the remaining three locations are calculated by equation (8) as 0.157, 0.683 and 0.160, respectively, and the locations represented by ambiguous points are more likely to be the most probable locations.

After sorting, the point with the highest probability is output as a suggested result, and the covering of the result point is a red icon, as shown in fig. 6.

It will be understood that modifications and variations can be made by persons skilled in the art in light of the above teachings and all such modifications and variations are intended to be included within the scope of the invention as defined in the appended claims.

11页详细技术资料下载

Travel memory place name disambiguation method based on time geography

相关技术

网友询问留言