Vehicle control device, vehicle control system, and vehicle control method

文档序号：731960 发布日期：2021-04-20 浏览：37次中文

阅读说明：本技术 车辆用控制装置、车辆用控制系统以及车辆控制方法 (Vehicle control device, vehicle control system, and vehicle control method ) 是由桥本洋介片山章弘大城裕太杉江和纪冈尚哉于 2020-10-14 设计创作，主要内容包括：提供一种车辆用控制装置、车辆用控制系统以及车辆控制方法。车辆用控制装置包括存储装置和处理器。存储装置构成为存储关系规定数据,所述关系规定数据对车辆的状态与行动变量的关系进行规定,所述行动变量是与所述车辆内的电子设备的操作有关的变量。所述处理器构成为算出与所述电子设备的操作对应的奖励。所述处理器在所述处理器的运算负荷为预定负荷以下时,将基于所取得的所述检测值的所述车辆的状态、在所述电子设备的操作中所使用了的所述行动变量的值、以及与所述电子设备的操作对应的所述奖励作为对预先确定的更新映射的输入,更新所述关系规定数据。(Provided are a vehicle control device, a vehicle control system, and a vehicle control method. The vehicle control device includes a storage device and a processor. The storage device is configured to store relationship specifying data that specifies a relationship between a state of a vehicle and an action variable that is a variable related to an operation of an electronic device in the vehicle. The processor is configured to calculate a reward corresponding to operation of the electronic device. The processor updates the relationship specifying data by using, as inputs to a predetermined update map, the state of the vehicle based on the acquired detection value, the value of the action variable used in the operation of the electronic device, and the reward corresponding to the operation of the electronic device when a computation load of the processor is equal to or less than a predetermined load.)

1. A control device for a vehicle includes a storage device and a processor,

the storage device is configured to store relationship specifying data that specifies a relationship between a state of a vehicle and an action variable that is a variable related to an operation of an electronic device in the vehicle,

the processor is configured to:

acquiring a detection value of a sensor that detects a state of the vehicle,

operating the electronic device based on the value of the action variable determined from the acquired detection value and the relationship specifying data read out from the storage device,

calculating an award based on the acquired detection value such that the award is larger in a case where the characteristic of the vehicle satisfies a predetermined criterion than in a case where the characteristic of the vehicle does not satisfy the predetermined criterion, and,

updating the relationship specifying data with the state of the vehicle based on the acquired detection value, the value of the action variable used in the operation of the electronic device, and the reward corresponding to the operation of the electronic device as inputs to a predetermined update map when the computational load of the processor is equal to or less than a predetermined load,

the update map is a map that outputs the relationship specifying data updated so as to increase an expected benefit regarding the award in a case where the electronic device is operated in accordance with the relationship specifying data.

2. The control device for a vehicle according to claim 1,

the processor is configured to acquire a detection value of a sensor that detects a state of the vehicle including an internal combustion engine,

the processor is configured to operate electronics that control the internal combustion engine,

the processor is configured to update the relationship specifying data by assuming that the calculation load is equal to or less than a predetermined load when a rotation speed of a crankshaft of the internal combustion engine is equal to or less than a predetermined speed.

3. The control device for a vehicle according to claim 1 or 2,

the processor is configured to update the relationship specifying data when the vehicle is stopped, assuming that the computation load is equal to or less than the predetermined load.

4. A control system for a vehicle, comprising:

a 1 st processor and a storage device mounted on a vehicle; and

a 2 nd processor disposed external to the vehicle,

the storage device is configured to store relationship specifying data that specifies a relationship between a state of the vehicle and an action variable that is a variable related to an operation of an electronic device in the vehicle,

the 1 st processor is configured to:

acquiring a detection value of a sensor that detects a state of the vehicle,

operating the electronic device based on the value of the action variable determined from the acquired detection value and the relationship specifying data read out from the storage device,

transmitting the state of the vehicle based on the acquired detection value, the value of the action variable used for the operation of the electronic device, and the reward corresponding to the operation of the electronic device to the 2 nd processor when the computational load of the 1 st processor is equal to or less than a predetermined load,

the 2 nd processor is configured to:

receiving the state of the vehicle based on the acquired detection value, the value of the action variable used in the operation of the electronic device, and the reward corresponding to the operation of the electronic device, which are transmitted from the 1 st processor,

updating the relationship specifying data by using, as inputs to a predetermined update map, the state of the vehicle based on the acquired detection value, the value of the action variable used in the operation of the electronic device, and the reward corresponding to the operation of the electronic device, and,

transmitting the updated relationship specifying data to the storage device of the vehicle,

5. A vehicle control method, the vehicle including a storage device configured to store relationship specifying data that specifies a relationship between a state of the vehicle and an action variable that is a variable related to an operation of an electronic apparatus in the vehicle, and a processor, the vehicle control method comprising:

acquiring, by the processor, a detection value of a sensor that detects a state of the vehicle;

operating, by the processor, the electronic device based on the value of the action variable determined by the acquired detection value and the relationship specifying data read out from the storage device;

calculating, by the processor, a reward based on the acquired detection value such that the reward is greater if the characteristic of the vehicle satisfies a predetermined criterion than if the characteristic of the vehicle does not satisfy the predetermined criterion;

Technical Field

The invention relates to a vehicle control device, a vehicle control system, and a vehicle control method.

Background

For example, japanese patent application laid-open No. 2016-.

Disclosure of Invention

However, since the filter needs to set the operation amount of the throttle valve of the internal combustion engine mounted on the vehicle to an appropriate operation amount according to the operation amount of the accelerator pedal, it takes many man-hours for a skilled person to adapt the filter. In this way, conventionally, a skilled person has taken many man-hours to adapt the operation amount of the electronic device in the vehicle according to the state of the vehicle.

A first aspect of the present invention relates to a vehicle control device. The vehicle control device includes a storage device configured to store relationship specifying data that specifies a relationship between a state of a vehicle and an action variable that is a variable related to an operation of an electronic device in the vehicle, and a processor configured to execute the following. The method includes acquiring a detection value of a sensor that detects a state of the vehicle, operating the electronic device based on a value of the action variable determined from the acquired detection value and the relationship specifying data read out from the storage device, calculating an award based on the acquired detection value such that the award when a characteristic of the vehicle satisfies a predetermined reference is larger than the award when the predetermined condition is not satisfied, and updating the relationship specifying data based on the state of the vehicle based on the acquired detection value, the value of the action variable used in the operation of the electronic device, and the award corresponding to the operation of the electronic device as inputs to a predetermined update map when a calculation load of the processor is equal to or less than a predetermined load. The update map is a map that outputs the relationship specifying data updated so as to increase an expected benefit regarding the award in a case where the electronic device is operated in accordance with the relationship specifying data.

In the above configuration, by calculating the reward associated with the operation of the electronic device, it is possible to grasp what reward can be obtained by the operation. In addition, by updating the relationship specifying data based on the update map of reinforcement learning based on the reward, the relationship between the state of the vehicle and the action variable can be set to an appropriate relationship during the travel of the vehicle. Therefore, the man-hours required for the skilled person can be reduced when setting the relationship between the vehicle state and the action variable.

In addition, by executing the update processing, the computational load of the execution device increases. In the above configuration, by executing the update processing when the computation load is equal to or less than a predetermined value, it is possible to suppress the following: performing the update process may affect other tasks that the execution device should perform.

In the vehicle control device according to the first aspect, the processor may be configured to acquire a detection value of a sensor that detects a state of the vehicle including the internal combustion engine. The processor may also be configured to operate electronics that control the internal combustion engine. The processor may be configured to update the relationship specifying data by assuming that the calculation load is equal to or less than the predetermined load when the rotation speed of the crankshaft of the internal combustion engine is equal to or less than a predetermined speed.

The processing for operating the operation unit to control the control amount of the internal combustion engine includes processing corresponding to the appearance interval of the compression top dead center, and therefore, when the rotation speed of the crankshaft is high, the calculation load for controlling the internal combustion engine is larger than that when the rotation speed of the crankshaft is low. In the above configuration, the update process is executed when the rotation speed is equal to or lower than the predetermined speed, so that it is possible to suppress the computational load of the execution device from becoming excessively large due to the computational load related to the control of the control amount of the internal combustion engine and the computational load of the update process.

In the vehicle control device according to the first aspect, the processor may be configured to update the relationship specifying data when the vehicle is stopped, assuming that the calculation load is equal to or less than the predetermined load.

When the vehicle is traveling, the computational load of the execution device tends to be larger than that during parking. In the above configuration, the update processing is executed when the vehicle is stopped, so that it is possible to suppress an excessive increase in the calculation load of the execution device due to the calculation load generated by the processing executed by the execution device in association with the traveling of the vehicle and the calculation load generated by the update processing.

A second aspect of the present invention relates to a vehicle control system. The control system for a vehicle includes: a 1 st processor and a storage device mounted on a vehicle; and a 2 nd processor configured outside the vehicle. The storage device is configured to store relationship specifying data that specifies a relationship between a state of the vehicle and an action variable that is a variable related to an operation of an electronic device in the vehicle, and the 1 st processor is configured to: the method includes acquiring a detection value of a sensor that detects a state of the vehicle, operating the electronic device based on a value of the action variable specified by the acquired detection value and the relationship specifying data read out from the storage device, calculating an award based on the acquired detection value such that the award is larger when a characteristic of the vehicle satisfies a predetermined reference than when the award is not satisfied, and transmitting the state of the vehicle based on the acquired detection value, the value of the action variable used in the operation of the electronic device, and the award corresponding to the operation of the electronic device to the 2 nd processor when a calculation load of the 1 st processor is equal to or less than a predetermined load. The 2 nd processor is configured to: receiving the state of the vehicle based on the acquired detection value, the value of the action variable used in the operation of the electronic device, and the reward corresponding to the operation of the electronic device, which are transmitted from the 1 st processor, updating the relationship specifying data newly, using the state of the vehicle based on the acquired detection value, the value of the action variable used in the operation of the electronic device, and the reward corresponding to the operation of the electronic device as inputs to a predetermined update map, and transmitting the updated relationship specifying data to a storage device of the vehicle. The update map is a map that outputs the relationship specifying data updated so as to increase an expected benefit regarding the award in a case where the electronic device is operated in accordance with the relationship specifying data.

In the above configuration, the 2 nd execution device executes the update process, thereby reducing the computational load of the 1 st execution device. Further, since the 1 st execution device executes the vehicle-side transmission processing when the computation load is equal to or less than a predetermined value, it is possible to suppress the computation load of the 1 st execution device from becoming excessive due to the computation load generated by the vehicle-side transmission processing.

A third aspect of the invention relates to a vehicle control method. The vehicle includes a storage device configured to store relationship specifying data that specifies a relationship between a state of the vehicle and an action variable that is a variable related to an operation of an electronic device in the vehicle, and a processor. The vehicle control method includes: acquiring, by the processor, a detection value of a sensor that detects a state of the vehicle; operating, by the processor, the electronic device based on the value of the action variable determined by the acquired detection value and the relationship specifying data read out from the storage device; calculating, by the processor, a reward based on the acquired detection value such that the reward is greater if the characteristic of the vehicle satisfies a predetermined criterion than if the predetermined criterion is not satisfied; when the computational load of the processor is equal to or less than a predetermined load, the relationship specifying data is updated by using, as inputs to a predetermined update map, the state of the vehicle based on the acquired detection value, the value of the action variable used in the operation of the electronic device, and the reward corresponding to the operation of the electronic device. The update map is a map that outputs the relationship specifying data updated so as to increase an expected benefit regarding the award in a case where the electronic device is operated in accordance with the relationship specifying data.

Drawings

Features, advantages, and technical and industrial significance of exemplary embodiments of the present invention will be described below with reference to the accompanying drawings, in which like reference numerals represent like elements, and wherein:

fig. 1 is a diagram showing a control device and a drive system thereof according to embodiment 1.

Fig. 2 is a flowchart showing the procedure of processing executed by the control device according to the embodiment.

Fig. 3 is a flowchart showing the procedure of processing executed by the control device according to the embodiment.

Fig. 4 is a diagram showing a configuration of a vehicle control system according to embodiment 2.

Fig. 5 is a flowchart showing steps of processing executed by the vehicle control system.

Detailed Description

Hereinafter, embodiment 1 of the vehicle control device will be described with reference to the drawings. Fig. 1 shows the configuration of a drive system and a control device of a vehicle VC1 according to the present embodiment.

As shown in fig. 1, a throttle valve 14 and a fuel injection valve 16 are provided in the intake passage 12 of the internal combustion engine 1 in this order from the upstream side, and air taken into the intake passage 12 and fuel injected from the fuel injection valve 16 flow into a combustion chamber 24 partitioned by a cylinder 20 and a piston 22 as an intake valve 18 opens. In the combustion chamber 24, an air-fuel mixture of fuel and air is supplied to combustion in accordance with spark discharge of the ignition device 26, and energy generated by the combustion is converted into rotational energy of the crankshaft 28 via the piston 22. The air-fuel mixture supplied to the combustion is discharged as exhaust gas to the exhaust passage 32 as the exhaust valve 30 opens. A catalyst 34 as an aftertreatment device that purifies exhaust gas is provided in the exhaust passage 32.

The input shaft 52 of the transmission 50 can be mechanically coupled to the crankshaft 28 via the torque converter 40 including the lockup clutch 42. The transmission 50 is a device that varies a speed ratio, which is a ratio of the rotation speed of the input shaft 52 to the rotation speed of the output shaft 54. The output shaft 54 is mechanically coupled to a drive wheel 60.

The control device 70 operates operating portions of the internal combustion engine 10 such as the throttle valve 14, the fuel injection valve 16, and the ignition device 26 in order to control the internal combustion engine 10 and control the torque, the exhaust gas component ratio, and the like, which are controlled amounts of the internal combustion engine 10. Further, the control device 70 operates the lock-up clutch 42 in order to control the engaged state of the lock-up clutch 42 with the torque converter 40 as a control target. The control device 70 operates the transmission 50 to control the transmission 50 and control the gear ratio as a control amount. Fig. 1 shows the operation signals MS1 to MS5 of the throttle valve 14, the fuel injection valve 16, the ignition device 26, the lock-up clutch 42, and the transmission 50, respectively.

The control device 70 refers to the intake air amount Ga detected by the airflow meter 80, the opening degree of the throttle valve 14 (throttle opening degree TA) detected by the throttle sensor 82, and the output signal Scr of the crank angle sensor 84 for controlling the control amount. The control device 70 refers to the amount of depression of the accelerator pedal 86 (accelerator operation amount PA) detected by the accelerator sensor 88 and the acceleration Gx in the front-rear direction of the vehicle VC1 detected by the acceleration sensor 90.

The control device 70 includes a CPU72, a ROM74, an electrically rewritable nonvolatile memory (storage device 76), and a peripheral circuit 78, and these components can communicate via a local network 79. Here, the peripheral circuit 78 includes a circuit that generates a clock signal that defines an internal operation, a power supply circuit, a reset circuit, and the like.

The ROM74 stores a control program 74a and a learning program 74 b. On the other hand, the storage device 76 stores relationship specifying data DR specifying a relationship between the accelerator operation amount PA, the command value for the throttle opening degree TA (throttle opening degree command value TA ″), and the retard amount aop of the ignition device 26. Here, the retard amount aop is a retard amount with respect to a predetermined reference ignition timing, which is a timing on the retard side of the MBT ignition timing and the knock limit point. The MBT ignition timing is an ignition timing at which the maximum torque is obtained (maximum torque ignition timing). In addition, the knock limit point is an advance limit value of the ignition timing that can fall within a level that can allow knocking under the optimum conditions envisaged when using a high octane fuel having a high knock limit. The storage device 76 stores torque output map data DT. The torque output map defined by the torque output map data DT is a map that outputs the torque Trq with the rotation speed NE of the crankshaft 28, the charging efficiency η, and the ignition timing as inputs.

Fig. 2 shows a procedure of a process executed by the control device 70 according to the present embodiment. The process shown in fig. 2 is realized by the CPU72 repeatedly executing a control program 74a stored in the ROM74, for example, at predetermined cycles. In the following, the step number of each process is represented by a numeral given "S" at the head.

In the series of processes shown in fig. 2, the CPU72 first acquires time-series data (time-series data) of 6 sample values "PA (1), PA (2), … …, PA (6)" including the accelerator operation amount PA as the state S (S10). Here, the respective sample values constituting the time-series data are sample values sampled at mutually different timings. In the present embodiment, time-series data is constituted by 6 sampling values adjacent to each other in time series in the case of sampling at a constant sampling period.

Next, the CPU72 sets an action a including the throttle opening degree command value TA corresponding to the state S obtained by the processing of S10 and the retard amount aop in accordance with the policy pi specified by the relation specification data DR (S12).

In the present embodiment, the relationship specifying data DR is data for specifying the action cost function Q and the policy pi. In the present embodiment, the action cost function Q is a tabular function representing expected profit values corresponding to the 8-dimensional arguments of the state s and the action a. In addition, policy π determines the following rules: when the state s is provided, it is preferable to select an action a (greedy action) whose argument becomes the largest in the action cost function Q of the provided state s, and select the other actions a with a predetermined probability ∈.

Specifically, the number of values that can be set for the argument of the action cost function Q according to the present embodiment is a number obtained by reducing a part of all combinations of the values that can be set for the state s and the action a according to human knowledge or the like. That is, for example, in the case where one of two adjacent sample values in the time-series data of the accelerator operation amount PA becomes the minimum value of the accelerator operation amount PA and the other becomes the maximum value, it is considered that the action merit function Q is not defined because the action of the accelerator pedal 86 by a human is not generated. In the present embodiment, the value of the state s that defines the action merit function Q is limited to 10 to the power of 4 or less, and more preferably 10 to the power of 3 or less, by dimension reduction based on human knowledge or the like.

Next, the CPU72 outputs an operation signal MS1 to the throttle valve 14 to operate the throttle opening degree TA and outputs an operation signal MS3 to the ignition device 26 to operate the ignition timing, based on the set throttle opening degree command value TA and the retard amount aop (S14). Here, in the present embodiment, the feedback control of the throttle opening degree TA to the throttle opening degree command value TA is exemplified, and therefore, even if the throttle opening degree command values TA are the same value, the operation signals MS1 may be different signals from each other. For example, when known Knock Control (KCS) is performed, the ignition timing is a value obtained by feedback-correcting a value obtained by retarding the reference ignition timing by the retard amount aop by the KCS. Here, the reference ignition timing is variably set by the CPU72 according to the rotation speed NE of the crankshaft 28 and the charging efficiency η. The rotation speed NE is calculated by the CPU72 based on the output signal Scr of the crank angle sensor 84. The charging efficiency η is calculated by the CPU72 based on the rotation speed NE and the intake air amount Ga.

Next, the CPU72 acquires the torque Trq of the internal combustion engine 10, the torque command value Trq for the internal combustion engine 10, and the acceleration Gx (S16). Here, the CPU72 calculates the torque Trq by inputting the rotation speed NE, the charging efficiency η, and the ignition timing to the torque output map. The CPU72 sets a torque command value Trq in accordance with the accelerator operation amount PA.

Next, the CPU72 determines whether the transition flag F is "1" (S18). The transition flag F indicates that the vehicle is in the transition operation when it is "1", and indicates that the vehicle is not in the transition operation when it is "0". When determining that the transition flag F is "0" (S18: no), the CPU72 determines whether or not the absolute value of the change amount Δ PA per unit time of the accelerator operation amount PA is equal to or greater than a predetermined amount Δ PAth (S20). Here, the change amount Δ PA may be, for example, a difference between the latest accelerator operation amount PA at the execution timing of the process of S20 and the accelerator operation amount PA before the unit time at the execution timing.

If the CPU72 determines that the flag is equal to or greater than the predetermined amount Δ PAth (S20: yes), it substitutes "1" for the transition flag F (S22). On the other hand, if the CPU72 determines that the transition flag F is "1" (S18: yes), it determines whether or not a predetermined period has elapsed from the execution timing of the process of S22 (S24). Here, the predetermined period is a period until the state in which the absolute value of the change amount Δ PA per unit time of the accelerator operation amount PA is a predetermined amount or less smaller than the predetermined amount Δ PAth continues for a predetermined time. When the CPU72 determines that the predetermined period has elapsed (yes in S24), it substitutes "0" for the transition flag F (S26).

When the processing of S22 and S26 is completed and a negative determination is made in the processing of S20 and S24, the CPU72 proceeds to the processing of S28. In the processing of S28, the CPU72 stores the state S acquired in the processing of S10, the action a set in the processing of S12, the torque Trq acquired in the processing of S16, the torque command value Trq, the acceleration Gx, and the current value of the transition flag F in the storage device 76. Further, the CPU72, upon completion of the process of S28, temporarily ends the series of processes shown in fig. 2.

Fig. 3 shows a procedure of processing executed by the control device 70 in the present embodiment. The process shown in fig. 3 is realized by the CPU72 repeatedly executing the learning program 74b stored in the ROM74, for example, at predetermined cycles.

In the series of processes shown in fig. 3, the CPU72 first determines whether it is the end of the stroke (S30). Here, the trip refers to 1 time period in which the travel permission signal of the vehicle is in an active (on) state. In the present embodiment, the travel permission signal corresponds to an ignition signal.

If the CPU72 determines that the time is the end of the trip (S30: yes), it selects 1 fixed period of the transition flag F, that is, 1 episode (S32). Each scenario (episode) is a period from the execution of the process of S26 to the execution of the process of S22, and a period from the execution of the process of S22 to the execution of the process of S26.

Next, the CPU72 reads time-series data of a set of 3 sampling values including the torque command value Trq, the torque Trq, and the acceleration Gx in 1 scenario selected in the processing of S32, and time-series data of the state S and the action a as data used in the following processing (S34). In fig. 3, it is shown that numerical different values in parentheses are values of variables of different sampling timings. For example, torque command value Trq (1) and torque command value Trq (2) are values whose sampling timings are different from each other. Further, time-series data of an action a belonging to the selected scenario is defined as an action set Aj, and time-series data of a state s belonging to the scenario is defined as a state set Sj.

Next, the CPU72 determines whether the logical product of the condition (a) that the absolute value of the difference between the torque Trq and the torque command value Trq is equal to or less than the predetermined amount Δ Trq and the condition (b) that the acceleration Gx is equal to or greater than the lower limit GxL and equal to or less than the upper limit GxH is true (S36).

Here, the CPU72 variably sets the predetermined amount Δ Trq in accordance with the sign of the change amount Δ PA per unit time of the accelerator operation amount PA at the start of the scenario and the value of the transition flag F. That is, in the case of the scenario in which the transition flag F is "1", the CPU72 sets the predetermined amount Δ Trq to a value larger than the predetermined amount Δ Trq in the case where the transition flag F is "0" as the scenario related to the transition time. The CPU72 sets the predetermined amount Δ Trq to different values according to the sign of the change amount Δ PA.

Further, the CPU72 variably sets the lower limit value GxL according to the sign of the change amount Δ PA of the accelerator operation amount PA at the start of the scenario and the value of the transition flag F. That is, when the change amount Δ PA is positive in the transient-time scenario, the CPU72 sets the lower limit value GxL to a value greater than the lower limit value GxL in the steady-state scenario. When the change amount Δ PA is negative in the transient-time scenario, the CPU72 sets the lower limit value GxL to a value smaller than the lower limit value GxL in the steady-state scenario.

The CPU72 variably sets the upper limit value GxH according to the sign of the change amount Δ PA per unit time of the accelerator operation amount PA at the start of the scenario and the value of the transition flag F. That is, when the change amount Δ PA is positive in the scenario related to the transient state, the CPU72 sets the upper limit value GxH to a value larger than the upper limit value GxH in the scenario related to the steady state. When the change amount Δ PA is negative in the transition-time scenario, the CPU72 sets the upper limit GxH to a value smaller than the upper limit GxH in the steady-state scenario.

If the CPU72 determines that the logical product is true (yes in S36), "10" is substituted for the award r (S38), whereas if it determines that the logical product is false (no in S36), "10" is substituted for the award r (S40). When the processing in S38 and S40 is completed, the CPU72 updates the relationship specifying data DR stored in the storage device 76 shown in fig. 1. In the present embodiment, an epsilon soft-parity Monte Carlo method (epsilon-soft on-policy Monte Carlo method) is used.

That is, the CPU72 adds the reward R to each benefit R (Sj, Aj) specified by each state and corresponding action group read out in the process of S34 (S42). Here, "R (Sj, Aj)" collectively describes the benefit R in which one of the elements of the state set Sj is a state and one of the elements of the action set Aj is an action. Next, the gains R (Sj, Aj) specified by the state and action groups read in the above-described processing at S34 are averaged and substituted into the corresponding action cost function Q (Sj, Aj) (S44). Here, the averaging may be a process of dividing the profit R calculated in the process of S42 by a number obtained by adding a predetermined number to the number of times the process of S42 is performed. The initial value of the profit R may be the initial value of the action cost function Q.

Next, the CPU72 substitutes actions, which are the set of the throttle opening degree command value TA and the delay amount aop at the time of the maximum value in the corresponding action cost function Q (Sj, a), into the action Aj (S46) for the state read out by the processing of S34. Here, "a" represents an arbitrary action that is desirable. Note that the action Aj is an action that has a different value depending on the type of the state read by the processing of S34, but the description is simplified here and the same reference numerals are used.

Next, the CPU72 updates the corresponding policy pi (Aj | Sj) for each state read in the process of S34 (S48). That is, when the total number of actions is "| A |", the selection probability of the action Aj selected in S46 is "1- ε + ε/| A |". Further, the selection probabilities of "| A | -1" actions other than the action Aj are respectively defined as "ε/| A |". The process of S48 is based on the action merit function Q updated by the process of S44, and thus the relationship specifying data DR specifying the relationship between the state S and the action a is updated so that the profit R increases.

When the process of S48 is completed, the CPU72 determines whether all episodes stored in the process of S28 have been selected in the process of S32 and whether the processes of S34 to S48 have been completed (S50). If the CPU72 determines that there is a scenario that has not been selected (no in S50), the process returns to S32 and the scenario is selected. On the other hand, if the CPU72 determines that all episodes have been selected (yes in S50) or if a negative determination is made in the process of S30, the series of processes shown in fig. 3 is temporarily ended.

Here, the operation and effect of the present embodiment will be described. The CPU72 acquires time-series data of the accelerator operation amount PA in accordance with the operation of the accelerator pedal 86 by the user, and sets an action a including the throttle opening degree command value TA and the delay amount aop in accordance with the strategy pi. Here, the CPU72 basically selects an action a that maximizes the expected benefit based on the action merit function Q defined by the relationship definition data DR. However, the CPU72 performs a search for the action a that maximizes the expected benefit by selecting an action other than the action a that maximizes the expected benefit with a predetermined probability ∈. This enables the relationship specifying data DR to be updated by reinforcement learning in accordance with driving of the vehicle VC1 by the user. Therefore, the throttle opening degree command value TA corresponding to the accelerator operation amount PA and the delay amount aop can be set to appropriate values during traveling of the vehicle VC1 without excessively increasing the man-hours of the skilled person.

In particular, in the present embodiment, the update process is performed at the end of the stroke. At the end of the stroke, the computational load related to the control of the internal combustion engine 10 is smaller than during the stroke, and therefore the computational load of the CPU72 is smaller. Therefore, the processes of S32 to S50 can be appropriately executed by the CPU 72.

According to the present embodiment described above, the following operational effects can be further obtained. (1) The argument of the behavior cost function Q includes time-series data of the accelerator operation amount PA. Thus, the value of the action a can be finely adjusted with respect to various changes in the accelerator operation amount PA, as compared with a case where only a single sampling value is used as an argument for the accelerator operation amount PA.

(2) The independent variable of the behavior merit function Q includes the throttle opening degree command value TA itself. Thus, for example, the degree of freedom of search by reinforcement learning is easily increased as compared with a case where a parameter of a model formula obtained by modeling the behavior of the throttle opening degree command value TA is used as an independent variable relating to the throttle opening degree.

Embodiment 2

Hereinafter, the following description will focus on differences from embodiment 1, and embodiment 2 will be described with reference to the drawings.

Fig. 4 shows a configuration of a control system for executing reinforcement learning in the present embodiment. In fig. 4, for convenience of explanation, the same reference numerals are given to components corresponding to those shown in fig. 1.

In addition to the control program 74a, a learning subroutine 74c is stored in the ROM74 in the vehicle VC1 shown in fig. 4. The control device 70 further includes a communication device 77. The communicator 77 is a device for communicating with the data analysis center 110 via the network 100 outside the vehicle VC 1.

The data analysis center 110 analyzes data transmitted from the plurality of vehicles VC1, VC2, and … …. The data analysis center 110 includes a CPU112, a ROM114, an electrically rewritable nonvolatile memory (storage device 116), a peripheral circuit 118, and a communicator 117, and these components can communicate with each other through a local network 119. A main program 114a for learning is stored in the ROM 114. The storage device 116 stores relationship specifying data DR.

Fig. 5 shows a procedure of reinforcement learning according to the present embodiment. The processing shown on the left side of fig. 5 is realized by the CPU72 executing the learning subroutine 74c stored in the ROM74 shown in fig. 4. The processing shown on the right side of fig. 5 is realized by the CPU112 executing a main program 114a for learning stored in the ROM 114. In fig. 5, for convenience of explanation, the same step numbers are assigned to the processes corresponding to the processes shown in fig. 3. The processing shown in fig. 5 will be described below along the time series of reinforcement learning.

In the series of processes shown on the left side of fig. 5, in the case where the CPU72 makes an affirmative determination in the process of S30, it operates the communicator 77 to transmit data necessary for the update of the relationship specifying data DR (S60). That is, the CPU72 transmits time-series data of the state S, the action a, the torque Trq, the torque command value Trq, the acceleration Gx, and the transition flag F, which are stored in the processing of S28 during the trip.

On the other hand, as shown on the right side of fig. 5, the CPU112 receives the data transmitted by the processing of S60 (S70), and executes the processing of S32 to S50. When the affirmative determination is made in the process of S50, the CPU112 operates the communicator 117 to transmit the updated relationship specifying data DR (S72). Further, the CPU112, upon completion of the process of S72, temporarily ends the series of processes shown on the right side of fig. 5.

On the other hand, as shown on the left side of fig. 5, the CPU72 receives the updated relationship specifying data DR (S62), and overwrites the relationship specifying data DR used in the processing of S12 with the data (S64).

Further, the CPU72 temporarily ends the series of processing shown on the left side of fig. 5 in the case where the processing of S64 is completed, or in the case where a negative determination is made in the processing of S30. As described above, in the present embodiment, the data analysis center 110 executes the update processing of the relationship specifying data DR, thereby reducing the calculation load of the CPU 72.

According to the present embodiment described above, the following operational effects can be further obtained. (3) The CPU72 transmits data necessary for updating the relationship specifying data DR at the end of the trip. This can reduce the computational load required for transmission during the trip, as compared with the case of transmission during the trip.

Corresponding relation

The correspondence between the matters in the above-described embodiment and the matters described in the section "means for solving the problem" is as follows.

The CPU72 and the ROM74 of the embodiment can be regarded as the execution means of the invention, and the storage device 76 of the embodiment can be regarded as the storage means of the invention. The processing of S10 and S16 in the embodiment can be regarded as the acquisition processing of the present invention, the processing of S14 in the embodiment can be regarded as the operation processing of the present invention, the processing of S36 to S40 in the embodiment can be regarded as the reward calculation processing of the present invention, and the processing of S42 to S48 in the embodiment can be regarded as the update processing of the present invention. The map specified by the instruction to execute the processing of S42 to S48 in the learning program 74b of the embodiment can be regarded as the update map of the present invention. The end of the trip in the embodiment can be regarded as a case where the calculation load of the present invention is equal to or less than the predetermined load. The CPU72 and the ROM74 of the embodiment can be regarded as the 1 st execution means of the invention, and the CPU112 and the ROM114 of the embodiment can be regarded as the 2 nd execution means of the invention. The process of S60 of the embodiment can be regarded as the vehicle-side transmission process of the present invention, and the process of S62 of the embodiment can be regarded as the vehicle-side reception process of the present invention. The process of S70 of the embodiment can be regarded as the external reception process of the present invention, and the process of S72 of the embodiment can be regarded as the external transmission process of the present invention.

Other embodiments

The present embodiment can be modified as follows. This embodiment mode and the following modifications can be combined with each other within a range not technically contradictory.

About action variables

In the above embodiment, the throttle opening degree command value TA is exemplified as the variable relating to the opening degree of the throttle valve as the action variable, but the present invention is not limited thereto. For example, the responsiveness of the throttle opening degree command value TA to the accelerator operation amount PA may be expressed by dead time (dead time) and a second-order delay filter, and a total of 3 variables of the dead time and two variables defining the second-order delay filter may be used as variables relating to the opening degree of the throttle valve. However, in this case, the state variable is preferably a change amount per unit time of the accelerator operation amount PA in place of the time-series data of the accelerator operation amount PA.

In the above embodiment, the retard amount aop is exemplified as the variable relating to the ignition timing as the action variable, but the present invention is not limited thereto. For example, the ignition timing itself may be a correction target of KCS.

In the above-described embodiment, the variable relating to the opening degree of the throttle valve and the variable relating to the ignition timing are exemplified as the action variable, but the present invention is not limited thereto. For example, the fuel injection amount may be used in addition to the variable relating to the opening degree of the throttle valve and the variable relating to the ignition timing. Further, regarding those 3 variables, as the action variables, only the variable relating to the opening degree of the throttle valve and the fuel injection amount, or only the variable relating to the ignition timing and the fuel injection amount may be used. Further, regarding those 3 variables, as the action variables, only 1 of those may be employed.

As described in the column "related to the internal combustion engine", in the case of a compression ignition type internal combustion engine, a variable related to the injection amount may be used instead of a variable related to the opening degree of the throttle valve, and a variable related to the injection timing may be used as a variable related to the ignition timing. Further, it is preferable to add, in addition to the variable relating to the injection timing, a variable relating to the number of injections in 1 combustion cycle and a variable relating to a time interval between an end timing of one of 2 fuel injections adjacent in time series for one cylinder and a start timing of the other in 1 combustion cycle.

For example, when the transmission 50 is a stepped transmission, a current value of a solenoid valve for adjusting an engagement state of the clutch by hydraulic pressure may be used as the action variable. For example, when a hybrid vehicle, an electric vehicle, or a fuel cell vehicle is used as the vehicle as described in the section "vehicle" below, the torque and/or the output of the rotating electric machine may be used as the action variable. For example, in the case of a vehicle-mounted air conditioning apparatus including a compressor that is rotated by rotational power of a crankshaft of an internal combustion engine, load torque of the compressor may be included in the acting variable. In addition, when an electrically-operated in-vehicle air conditioner is provided, the power consumption of the air conditioner may be included in the action variable.

About state

In the above embodiment, the time-series data of the accelerator operation amount PA is data including 6 values sampled at equal intervals, but is not limited thereto. In this case, data including 2 or more sampling values at different sampling timings is more preferable, and data having sampling intervals of 3 or more sampling values and equal intervals is more preferable.

The state variable related to the accelerator operation amount is not limited to the time-series data of the accelerator operation amount PA, and may be, for example, a change amount per unit time of the accelerator operation amount PA as described in the column of "action variable".

For example, as described in the column of "action variable", when the current value of the solenoid valve is used as the action variable, the state may include the rotation speed of the input shaft 52, the rotation speed of the output shaft 54, and the hydraulic pressure adjusted by the solenoid valve of the transmission. For example, as described in the column of "action variable", when the torque and the output of the rotating electric machine are used as the action variables, the state may include the charging rate and the temperature of the battery. For example, as described in the column "regarding the action variable", when the load torque of the compressor and the power consumption of the air conditioner are included in the action variable, the temperature in the vehicle interior may be included in the state.

With respect to the relationship-specifying data,

in the above embodiment, the action merit function Q is a tabular function, but is not limited thereto. For example, a function approximator may also be used.

For example, instead of using the action merit function Q, the policy pi may be expressed by a function approximator having the state s and the action a as independent variables and the probability of taking the action a as dependent variables, and the parameters for determining the function approximator may be updated according to the reward r.

Dimensionality reduction for tabular form data

The method of reducing the dimension of tabular data is not limited to the method described in the above embodiment. For example, since the accelerator operation amount PA is rarely the maximum value, the action cost function Q may not be defined for a state where the accelerator operation amount PA is equal to or more than a predetermined amount, and the throttle opening degree command value TA may be separately adapted for a case where the accelerator operation amount PA is equal to or more than a predetermined amount. For example, dimension reduction may be performed by excluding a value at which the throttle opening degree command value TA is equal to or greater than a predetermined value from the action-allowable value.

However, it is not necessary to perform the dimension reduction. For example, in embodiment 2, if the computing power of the CPU72 and the storage capacity of the storage device 76 are sufficient, only some of the actions that are arguments that are the action merit function may be learned before shipment of the vehicle, but all of the actions may be executed by searching after shipment. This makes it possible to increase the number of actions that can be searched for, and to find out more appropriate actions, in view of the fact that sufficient data for learning can be secured after shipment than before shipment.

Regarding updating mappings

The processing in S42 to S48 is exemplified by the processing based on the epsilon soft parity policy monte carlo method, but is not limited thereto. For example, a process based on the off-policy Monte Carlo method may be used. However, the present invention is not limited to the monte carlo method, and for example, a hetero-policy type TD method may be used, or for example, an iso-policy type TD method such as the SARSA method may be used, or for example, an eligibility trace method may be used as the learning of the iso-policy type.

For example, when a policy pi is expressed by using a function approximator and directly updated based on the reward r as described in the column "relation specifying data", the update map may be configured by using a policy gradient method (policy gradient method) or the like.

In addition, only one of the action merit function Q and the policy pi is not limited to a direct update target by the reward r. For example, the action cost function Q and the policy pi may be updated separately as in the operator critic method. In addition, the operator critic method is not limited to this, and for example, instead of the action cost function Q, the cost function V may be set as an update target.

Note that "epsilon" for specifying the strategy pi is not limited to a fixed value, and may be changed according to a rule predetermined according to the degree of progress of learning.

With respect to reward calculation processing

In the process of fig. 3, a reward is given according to whether or not the logical product of the condition (a) and the condition (b) is true, but is not limited thereto. For example, the process of giving a bonus according to whether or not the condition (a) is satisfied and the process of giving a bonus according to whether or not the condition (b) is satisfied may be executed. For example, either of the processing for giving a reward according to whether or not the condition (a) is satisfied and the processing for giving a reward according to whether or not the condition (b) is satisfied may be executed.

For example, the following processing may be performed: instead of giving the same reward uniformly when the condition (a) is satisfied, a reward larger than that when the absolute value of the difference between the torque Trq and the torque command value Trq is smaller than the absolute value is given when the difference is larger than the absolute value. For example, the following processing may be performed: instead of uniformly giving the same reward when the condition (a) is not satisfied, a reward is given that is smaller when the absolute value of the difference between the torque Trq and the torque command value Trq is large and the difference is smaller than the absolute value.

For example, the following processing may be performed: instead of giving the same award uniformly when the condition (b) is satisfied, the magnitude of the award is made variable according to the magnitude of the acceleration Gx. For example, the following processing may be performed: instead of giving the same award uniformly if the condition (b) is not satisfied, the magnitude of the award is made variable according to the magnitude of the acceleration Gx.

In the above embodiment, the reward r is given according to whether or not the criteria relating to drivability are satisfied, but the criteria relating to drivability are not limited to the above criteria, and may be set according to whether or not the criteria are satisfied by, for example, noise and/or vibration intensity. However, the present invention is not limited to this, and for example, any 1 or more of 4, which is whether the acceleration satisfies the reference, whether the followability of the torque Trq satisfies the reference, whether the noise satisfies the reference, and whether the vibration intensity satisfies the reference, may be adopted.

The reward calculation process is not limited to a process of giving the reward r according to whether or not the criteria relating to drivability are satisfied. For example, the processing may be processing for giving a larger reward when the fuel consumption rate satisfies the criterion than when the fuel consumption rate does not satisfy the criterion. For example, when the exhaust characteristics satisfy the criterion, the processing may be performed to give a larger reward than when the exhaust characteristics do not satisfy the criterion. Further, the processing may include 2 or 3 of 3 pieces of processing in which a reward is given larger than that in the case where the criterion related to drivability is satisfied, processing in which a reward is given larger than that in the case where the criterion is not satisfied in the case where the fuel consumption rate satisfies the criterion, and processing in which a reward is given larger than that in the case where the criterion is not satisfied in the case where the exhaust characteristic satisfies the criterion.

As described in the column of "action variable", for example, when the current value of the solenoid valve of the transmission 50 is used as the action variable, the reward calculation process may include at least one of the 3 processes (a) to (c) below.

(a) The method comprises the following steps: when the time required for switching the gear ratio of the transmission is within a predetermined time, a larger reward is given than when the predetermined time is exceeded. (b) The method comprises the following steps: when the absolute value of the change speed of the rotation speed of the input shaft 52 of the transmission is equal to or less than the input-side predetermined value, a larger reward is given than when the input-side predetermined value is exceeded.

(c) The method comprises the following steps: when the absolute value of the change speed of the rotation speed of the output shaft 54 of the transmission is equal to or less than the output-side predetermined value, a larger reward is given than when the output-side predetermined value is exceeded. As described in the column of "action variable", for example, when the torque and output of the rotating electrical machine are used as the action variables, the processing may include the processing of giving a larger award than the award when the charging rate of the battery is within the predetermined range, and the processing of giving a larger award than the award when the temperature of the battery is within the predetermined range. Further, for example, as described in the column of "regarding the action variable", when the load torque of the compressor and the power consumption of the air conditioner are included in the action variable, a process of giving a larger reward than a reward when the temperature in the vehicle interior is within a predetermined range may be added.

Control system for vehicle

The vehicle control system is not limited to a system including the control device 70 and the data analysis center 110. For example, a vehicle control system may be configured by the control device 70 and a portable terminal using a portable terminal held by a user instead of the data analysis center 110. For example, the control device 70, the portable terminal, and the data analysis center 110 may constitute a vehicle control system. This can be realized by executing the process of S12 by the portable terminal, for example.

About an actuating device

The execution device is not limited to a device that includes the CPU72(112) and the ROM74(114) and executes software processing. For example, a dedicated hardware circuit such as an ASIC may be provided for performing hardware processing on at least a part of the software-processed part in the above embodiment. That is, the execution device may be configured as any one of the following (a) to (c). (a) The processing device executes all the above-described processing in accordance with a program, and a program storage device such as a ROM that stores the program. (b) The apparatus includes a processing device and a program storage device for executing a part of the above-described processing in accordance with a program, and a dedicated hardware circuit for executing the remaining processing. (c) A dedicated hardware circuit is provided for executing all the above-described processing. Here, the software executing apparatus and the dedicated hardware circuit provided with the processing apparatus and the program storage apparatus may be plural.

About a storage device

In the above embodiment, the storage device for storing the relationship specifying data DR, the storage device (ROM74) for storing the learning program 74b and the control program 74a, and the storage device are different storage devices, but the present invention is not limited to this.

In relation to internal combustion engines

The internal combustion engine is not limited to an internal combustion engine including a port injection valve for injecting fuel into the intake passage 12 as a fuel injection valve, and may be an internal combustion engine including an in-cylinder injection valve for directly injecting fuel into the combustion chamber 24, or may be an internal combustion engine including two port injection valves and an in-cylinder injection valve.

The internal combustion engine is not limited to a spark ignition type internal combustion engine, and may be, for example, a compression ignition type internal combustion engine using light oil or the like as fuel.

About a vehicle

The vehicle is not limited to a vehicle in which the thrust generation device is only an internal combustion engine, and may be, for example, a so-called hybrid vehicle including an internal combustion engine and a rotating electric machine. For example, the present invention may be a so-called electric vehicle or a fuel cell vehicle that does not include an internal combustion engine and includes a rotating electric machine as a thrust generation device.

20页详细技术资料下载

上一篇：一种医用注射器针头装配设备

下一篇：车辆用控制装置、车辆用控制系统以及车辆控制方法

Vehicle control device, vehicle control system, and vehicle control method

相关技术

网友询问留言