Vehicle control device, vehicle control system, and vehicle control method

文档序号:731963 发布日期:2021-04-20 浏览:34次 中文

阅读说明:本技术 车辆用控制装置、车辆用控制系统以及车辆控制方法 (Vehicle control device, vehicle control system, and vehicle control method ) 是由 桥本洋介 片山章弘 大城裕太 杉江和纪 冈尚哉 于 2020-10-16 设计创作,主要内容包括:本公开涉及车辆用控制装置、车辆用控制系统以及车辆控制方法。车辆用控制装置包括存储装置以及执行装置,存储装置存储关系规定数据,关系规定数据规定车辆的状态与行动变量的关系,行动变量是与所述车辆内的电子设备的操作有关的变量。所述执行装置构成为执行:取得传感器的检测值和驾驶偏好信息的取得处理;操作所述电子设备的操作处理;在车辆的特性满足基准的情况下比特性不满足基准的情况下给予较大的奖励的奖励计算处理;和更新所述关系规定数据的更新处理,所述执行装置构成为基于更新映射,输出以使按照所述关系规定数据来操作所述电子设备的情况下的关于所述奖励的期待收益增加的方式进行了更新的所述关系规定数据。(The present disclosure relates to a vehicle control device, a vehicle control system, and a vehicle control method. The vehicle control device includes a storage device that stores relationship specifying data that specifies a relationship between a state of a vehicle and an action variable that is a variable related to an operation of an electronic device in the vehicle, and an execution device. The execution device is configured to execute: acquisition processing for acquiring a detection value of a sensor and driving preference information; an operation process of operating the electronic device; reward calculation processing for giving a larger reward when the characteristic of the vehicle satisfies the criterion than when the characteristic does not satisfy the criterion; and an update process of updating the relationship specifying data, wherein the execution device is configured to output the relationship specifying data updated so as to increase an expected profit for the bonus when the electronic device is operated in accordance with the relationship specifying data, based on an update map.)

1. A control device for a vehicle, characterized in that,

comprises a storage device and an execution device,

the storage device is configured to store relationship specifying data that specifies a relationship between a state of the vehicle and an action variable that is a variable related to an operation of an electronic device in the vehicle,

the execution device is configured to:

executing acquisition processing of acquiring a detection value of a sensor configured to detect a state of the vehicle and driving preference information as information relating to driving preference of a user;

executing operation processing of operating the electronic device based on the value of the action variable determined by the relationship specifying data and the detection value acquired by the acquisition processing;

executing reward calculation processing of giving a greater reward than that in a case where the characteristic of the vehicle does not satisfy a criterion, on the basis of the detection value acquired by the acquisition processing, in a case where the characteristic of the vehicle satisfies the criterion, the reward calculation processing being processing of giving a different reward in a case where the driving preference information is different even in a case where the characteristic relating to behavior of the vehicle satisfies the same criterion;

executing update processing of updating the relationship specifying data with the state of the vehicle based on the detection value acquired by the acquisition processing, the value of the action variable used in the operation of the electronic device, and the reward corresponding to the operation as inputs to a predetermined update map; and

outputting the relationship specifying data updated so as to increase an expected profit about the award when the electronic device is operated in accordance with the relationship specifying data based on the update map.

2. The vehicle control device according to claim 1,

the acquisition process includes the following processes: obtaining an evaluation of a behavior of the vehicle by a user as the driving preference information,

the reward calculation process includes the following processes: in the case where the driving preference information indicating a low evaluation is acquired by the acquisition process, even if the behavior-related characteristics of the vehicle are the same, a different reward is given before the acquisition of the evaluation.

3. The control device for a vehicle according to claim 1 or 2,

the driving preference information includes history information of acceleration in the front-rear direction of the vehicle.

4. The control device for a vehicle according to any one of claims 1 to 3,

the driving preference information includes history information of accelerator operation amounts.

5. The control device for a vehicle according to any one of claims 1 to 4,

the acquisition process includes the following processes: and acquiring an analysis result of the face image of the user as the driving preference information.

6. The control device for a vehicle according to any one of claims 1 to 5,

the state of the vehicle includes a change in the accelerator operation amount,

the reward calculation process includes the following processes: when the acceleration in the front-rear direction of the vehicle accompanying the change in the accelerator operation amount satisfies a reference, a larger reward is given than when the acceleration does not satisfy the reference.

7. The vehicle control device according to claim 6,

the vehicle is provided with an internal combustion engine as a thrust force generation device of the vehicle,

the electronic apparatus includes a throttle valve of the internal combustion engine,

the action variable includes a variable relating to an opening degree of the throttle valve.

8. A control system for a vehicle, characterized in that,

comprises a storage device and an execution device,

the storage device stores relationship regulation data that is data that regulates a relationship between a state of the vehicle and an action variable that is a variable related to an operation of an electronic apparatus in the vehicle,

the executing device comprises a 1 st executing device mounted on the vehicle and a 2 nd executing device different from the vehicle-mounted device,

the 1 st execution device is configured to execute at least acquisition processing for acquiring a detection value of a sensor that detects a state of the vehicle and driving preference information that is information relating to driving preference of a user, and operation processing for operating the electronic device based on a value of the action variable that is determined from the relationship specifying data and the detection value acquired by the acquisition processing,

the 2 nd execution device is configured to execute at least an update process of updating the relationship specifying data with the state of the vehicle based on the detection value acquired by the acquisition process, the value of the action variable used in the operation of the electronic device, and an award corresponding to the operation as inputs to a predetermined update map.

9. A vehicle control method, the vehicle including a storage device configured to store relationship specifying data that specifies a relationship between a state of the vehicle and an action variable that is a variable related to an operation of an electronic device in the vehicle, and an execution device, the vehicle control method characterized by comprising:

executing, by the execution device, acquisition processing of acquiring a detection value of a sensor configured to detect a state of the vehicle and driving preference information as information relating to a driving preference of a user;

executing, by the execution device, an operation process of operating the electronic apparatus based on the value of the action variable determined by the relationship specifying data and the detection value acquired by the acquisition process;

executing, by the execution device, reward calculation processing that gives a greater reward than a case where a characteristic of the vehicle does not satisfy a criterion, on the basis of the detection value acquired by the acquisition processing, the reward calculation processing being processing that gives a different reward when the driving preference information is different even when the characteristic of the vehicle relating to behavior satisfies the same criterion;

executing, by the execution device, an update process of updating the relationship specifying data with the state of the vehicle based on the detection value acquired by the acquisition process, the value of the action variable used in the operation of the electronic device, and the reward corresponding to the operation as inputs to a predetermined update map; and

outputting, by the execution device, the relationship specifying data updated so as to increase an expected profit about the award when the electronic device is operated in accordance with the relationship specifying data, based on the update map.

Technical Field

The invention relates to a vehicle control device, a vehicle control system, and a vehicle control method.

Background

For example, japanese patent application laid-open No. 2016-.

Disclosure of Invention

However, since the filter needs to set the operation amount of the throttle valve of the internal combustion engine mounted on the vehicle to an appropriate operation amount according to the operation amount of the accelerator pedal, it takes a lot of man-hours for a skilled person to adapt the filter. In this way, the skilled person takes many man-hours in adapting the operation amount of the electronic device in the vehicle according to the state of the vehicle.

A vehicle control device according to claim 1 of the present disclosure includes a storage device and an execution device. The storage device stores relationship specifying data that specifies a relationship between a state of the vehicle and an action variable that is a variable related to an operation of an electronic device in the vehicle. The execution device is configured to execute an acquisition process of acquiring a detection value of a sensor configured to detect a state of the vehicle and driving preference information as information relating to a driving preference of a user. The execution device is configured to execute an operation process of operating the electronic device based on the value of the action variable specified by the relationship specifying data and the detection value acquired by the acquisition process. The execution device is configured to execute bonus (reward) calculation processing for giving a larger reward based on the detection value acquired by the acquisition processing, when the characteristic of the vehicle satisfies a criterion, than when the characteristic of the vehicle does not satisfy the criterion. The reward calculation process is a process of giving a different reward when the driving preference information is different even when the behavior-related characteristics of the vehicle satisfy the same criterion. The execution device is configured to execute an update process of updating the relationship specifying data with the state of the vehicle based on the detection value acquired by the acquisition process, the value of the action variable used in the operation of the electronic device, and the reward corresponding to the operation as inputs to a predetermined update map. The execution device is configured to output the relationship specifying data updated so as to increase an expected profit (return) regarding the bonus in a case where the electronic device is operated in accordance with the relationship specifying data, based on the update map.

According to the above-described aspect 1, by calculating the reward associated with the operation of the electronic device, it is possible to grasp what reward is obtained by the operation. Further, by updating the relationship specifying data based on the reward based on the update map according to reinforcement learning, the relationship between the state of the vehicle and the action variable can be set to a relationship suitable for the traveling of the vehicle. Therefore, the number of steps required for a skilled person to set the relationship between the state of the vehicle and the behavior variable to an appropriate relationship during the traveling of the vehicle can be reduced.

In the case where the reward is uniquely determined, the relationship between the state of the vehicle and the action variable learned by reinforcement learning may not follow the driving preference of the user. Thus, according to the above-described aspect 1, by acquiring the driving preference information and giving the reward based on the driving preference information by the reward calculation processing, it is possible to update the relationship specifying data to the relationship specifying data that conforms to the driving preference of the user by reinforcement learning.

In the vehicle control device, the acquisition process may include: the reward calculation process may include a process of obtaining, as the driving preference information, an evaluation of behavior of the vehicle by a user, and a process of: in the case where the driving preference information indicating a low evaluation is acquired by the acquisition process, even if the behavior-related characteristics of the vehicle are the same, a different reward is given before the acquisition of the evaluation.

According to the above-described aspect 1, the evaluation made by the user is acquired as the driving preference information, and if the evaluation result is low, the award is changed. Also, by performing the operation processing using the relationship specifying data updated by the subsequent update processing, the evaluation made by the user can be improved.

In the vehicle control device, the driving preference information may include history information of acceleration in a front-rear direction of the vehicle. Since the history of the acceleration in the front-rear direction of the vehicle is information that differs according to the manner of operation of the accelerator by the user, the driving preference of the user is reflected in the history of the acceleration. In view of this, according to the above-described aspect 1, by acquiring the history of the acceleration as the driving preference information, the driving preference information can be acquired even if the user does not input the driving preference information.

In the vehicle control device, the driving preference information may include history information of an accelerator operation amount. Since the accelerator operation by the user differs depending on the driving preference of the user, the driving preference information is included in the history information of the accelerator operation amount. In view of this, according to the above-described aspect 1, by acquiring the history of the accelerator operation amount as the driving preference information, the driving preference information can be acquired even if the user does not input the driving preference information.

In the vehicle control device, the acquisition process may include: an analysis result of the face image (face image) of the user is acquired as the driving preference information. According to the above-described aspect 1, by acquiring the analysis result of the face image of the user as the driving preference information, the driving preference information can be acquired even if the user does not input the driving preference information.

In the vehicle control device, the state of the vehicle may include a change in an accelerator operation amount, and the reward calculation process may include: when the acceleration in the front-rear direction of the vehicle accompanying the change in the accelerator operation amount satisfies a reference, a larger reward is given than when the acceleration does not satisfy the reference.

Since the magnitude of the acceleration in the front-rear direction of the vehicle, which is generated by the change in the accelerator operation amount, largely correlates with the traveling performance of the vehicle, it is possible to learn the value of the action variable appropriate for the traveling performance to be the desired performance by reinforcement learning from the state of the vehicle by giving an award according to whether or not the acceleration satisfies the reference as in the above-described claim 1.

In particular, according to the above-described aspect 1, by changing the manner in which the reward is given according to the driving preference information, it is possible to learn the value of the action variable appropriate for the running performance appropriate for the driving preference by reinforcement learning.

The vehicle may include an internal combustion engine as a thrust force generation device of the vehicle, and the electronic device may include a throttle valve of the internal combustion engine. In the vehicle control device, the action variable may include a variable relating to an opening degree of the throttle valve.

For example, in an internal combustion engine or the like in which the injection amount is adjusted according to the intake air amount, the torque (torque) and/or the output of the internal combustion engine greatly change according to the opening degree of a throttle valve. Therefore, by using a variable relating to the opening degree of the throttle valve as a behavior variable with respect to the accelerator operation amount, the propulsion force of the vehicle can be appropriately adjusted.

A vehicle control system according to claim 2 of the present disclosure includes a storage device and an execution device. The storage device stores relationship specifying data that specifies a relationship between a state of the vehicle and an action variable that is a variable related to an operation of an electronic device in the vehicle. The execution device comprises a 1 st execution device mounted on the vehicle and a 2 nd execution device different from the vehicle-mounted device. The 1 st execution device is configured to execute at least acquisition processing for acquiring a detection value of a sensor that detects a state of the vehicle and driving preference information that is information relating to driving preference of a user, and operation processing for operating the electronic device based on a value of the action variable that is determined based on the relationship specifying data and the detection value acquired by the acquisition processing. The 2 nd execution device is configured to execute at least an update process of updating the relationship specifying data with the state of the vehicle based on the detection value acquired by the acquisition process, the value of the action variable used in the operation of the electronic device, and an award corresponding to the operation as inputs to a predetermined update map.

According to the above-described aspect 2, by executing the update process by the 2 nd execution device, the computational load on the 1 st execution device can be reduced as compared with the case where the 1 st execution device executes the update process. Further, the 2 nd execution device being a device different from the in-vehicle device means that the 2 nd execution device is not the in-vehicle device.

In a vehicle control method according to claim 3 of the present disclosure, the vehicle includes a storage device configured to store relationship specifying data that specifies a relationship between a state of the vehicle and an action variable that is a variable related to an operation of an electronic device in the vehicle, and an execution device. The vehicle control method includes: executing, by the execution device, acquisition processing of acquiring a detection value of a sensor configured to detect a state of the vehicle and driving preference information as information relating to a driving preference of a user; executing, by the execution device, an operation process of operating the electronic apparatus based on the value of the action variable determined by the relationship specifying data and the detection value acquired by the acquisition process; executing, by the execution device, reward calculation processing that gives a greater reward than a case where a characteristic of the vehicle does not satisfy a criterion, on the basis of the detection value acquired by the acquisition processing, the reward calculation processing being processing that gives a different reward when the driving preference information is different even when the characteristic of the vehicle relating to behavior satisfies the same criterion; executing, by the execution device, an update process of updating the relationship specifying data with the state of the vehicle based on the detection value acquired by the acquisition process, the value of the action variable used in the operation of the electronic device, and the reward corresponding to the operation as inputs to a predetermined update map; and outputting, by the execution device, the relationship specifying data updated so as to increase an expected profit about the award when the electronic device is operated in accordance with the relationship specifying data, based on the update map.

Drawings

Features, advantages, and technical and industrial significance of exemplary embodiments of the present invention will be described below with reference to the accompanying drawings, in which like reference numerals represent like elements, and wherein:

fig. 1 is a diagram showing a control device and a drive system according to embodiment 1.

Fig. 2 is a flowchart showing the procedure of processing executed by the control device according to the embodiment.

Fig. 3 is a flowchart showing a detailed procedure of a part of the processing executed by the control device according to the embodiment.

Fig. 4 is a flowchart showing the procedure of processing executed by the control device according to the embodiment.

Fig. 5 is a flowchart showing the procedure of processing executed by the control device according to embodiment 2.

Fig. 6 is a diagram showing the configuration of the control system according to embodiment 3.

Fig. 7 is a flowchart showing steps of processing executed by the control system.

Detailed Description

Embodiment 1

Hereinafter, embodiment 1 of the vehicle control device will be described with reference to the drawings. Fig. 1 shows a configuration of a drive system and a control device of a vehicle VC1 according to the present embodiment.

As shown in fig. 1, a throttle valve 14 and a fuel injection valve 16 are provided in the intake passage 12 of the internal combustion engine 10 in this order from the upstream side, and air taken into the intake passage 12 and fuel injected from the fuel injection valve 16 flow into a combustion chamber 24 defined by a cylinder 20 and a piston 22 as an intake valve 18 opens. In the combustion chamber 24, the air-fuel mixture of fuel and air is used for combustion in accordance with spark discharge from the ignition device 26, and energy generated by the combustion is converted into rotational energy of the crankshaft 28 via the piston 22. The air-fuel mixture used for combustion is discharged as exhaust gas (exhaust gas) to the exhaust passage 32 as the exhaust valve 30 is opened. A catalyst 34 as an aftertreatment device for purifying exhaust gas is provided in the exhaust passage 32.

The input shaft 52 of the transmission 50 is mechanically connectable to the crankshaft 28 via the torque converter 40 including the lock-up clutch 42. The transmission 50 is a device that changes a speed ratio (transmission ratio, gear ratio) that is a ratio of the rotational speed of the input shaft 52 to the rotational speed of the output shaft 54. The output shaft 54 is mechanically coupled to a drive wheel 60.

The control device 70 controls the internal combustion engine 10, and operates operation portions of the internal combustion engine 10 such as the throttle valve 14, the fuel injection valve 16, and the ignition device 26 in order to control the torque, the exhaust gas component ratio, and the like, which are controlled amounts of the internal combustion engine 10. The control device 70 controls the torque converter 40, and operates the lock-up clutch 42 to control the engaged state of the lock-up clutch 42. The control device 70 controls the transmission 50, and operates the transmission 50 to control the gear ratio as a control amount. Fig. 1 shows the operation signals MS1 to MS5 of the throttle valve 14, the fuel injection valve 16, the ignition device 26, the lock-up clutch 42, and the transmission 50, respectively.

The control device 70 refers to the intake air amount Ga detected by the air flow meter 80, the opening degree of the throttle valve 14 (throttle opening degree TA) detected by the throttle sensor 82, and the output signal Scr of the crank angle sensor 84 in order to control the control amount. Further, control device 70 refers to the amount of depression of accelerator pedal 86 (accelerator operation amount PA) detected by accelerator sensor 88 and/or acceleration Gx in the front-rear direction of vehicle VC1 detected by acceleration sensor 90, the face image of the user captured by camera (camera)92, and the value of evaluation variable VV determined by the operation of evaluation switch (switch) 94. Here, the evaluation switch 94 is a human-machine interface (interface) for the user of the vehicle VC1 to select one of three options related to the running performance of the vehicle VC 1. Here, the three options are three items regarding responsiveness "too high", "just", "too low".

The control device 70 includes a CPU72, a ROM74, an electrically rewritable nonvolatile memory (storage device 76), and a peripheral circuit 78, and can communicate with each other via a local network 79. Here, the peripheral circuit 78 includes a circuit that generates a clock signal that defines an internal operation, a power supply circuit, a reset (reset) circuit, and the like.

The ROM74 stores a control program 74a and a learning program 74 b. On the other hand, the storage device 76 stores therein relationship specifying data DR that specifies the relationship between the accelerator operation amount PA and the command value for the throttle opening degree TA (throttle opening degree command value TA ×) and the retard amount aop of the ignition device 26. Here, the retard amount aop is a retard amount with respect to a reference ignition timing that is determined in advance and that is on the retard side of the MBT ignition timing and the knock critical point. The MBT ignition timing is an ignition timing at which the maximum torque is obtained (maximum torque ignition timing). The knock limit point is an advance limit value of the ignition timing that can control knock to within an allowable level under an assumed optimum condition when a high octane fuel having a high knock limit is used. The storage device 76 stores torque output map data DT. The torque output map defined by the torque output map data DT is a map that takes as input the rotation speed NE of the crankshaft 28, the charging (charging) efficiency η, and the ignition timing aig, and takes as output the torque Trq.

Fig. 2 shows a procedure of a process executed by the control device 70 according to the present embodiment. The processing shown in fig. 2 is realized by the CPU72 repeatedly executing the control program 74a and the learning program 74b stored in the ROM74, for example, at predetermined cycles. In the following, the step number of each process is represented by a numeral with "S" attached to the head.

In the series of processes shown in fig. 2, first, the CPU72 acquires time-series data of 6 sampling values "PA (1), PA (2), … … PA (6)" including the accelerator operation amount PA as the state S (S10). Here, the sample values constituting the time-series data are values sampled at mutually different timings (timing). In the present embodiment, time-series data is configured by 6 sampling values adjacent to each other in time series in the case of sampling at a constant sampling period.

Next, the CPU72 sets an action a including the throttle opening command value TA and the delay amount aop according to the state S obtained by the processing of S10, in accordance with the strategy pi specified by the relation specifying data DR (S12).

In the present embodiment, the relationship specifying data DR is data for specifying the action cost function Q and the policy pi. In the present embodiment, the action cost function Q is a function in the form of a table (table) representing expected profit values according to the 8-dimensional arguments of the state s and the action a. In addition, policy π determines the following rules: when the state s is given, an action a (greedy action) in which the expected profit value of the action cost function Q of the given state s becomes the maximum is preferentially selected, and other actions a are also selected with a predetermined probability.

Specifically, the number of values that can be set for the argument of the action cost function Q according to the present embodiment is a number obtained by reducing a part of all combinations of values that can be set for the state s and the action a according to human knowledge or the like. That is, for example, in a case where one of two adjacent sample values in the time-series data of the accelerator operation amount PA becomes the minimum value of the accelerator operation amount PA and the other becomes the maximum value, the action merit function Q is not defined because the action of the accelerator pedal 86 by a person is not generated. In the present embodiment, the value that can be defined for the state s of the action merit function Q is limited to the power of 4 of 10 or less, and more preferably to the power of 3 of 10 or less, by dimension reduction based on human knowledge or the like.

Next, the CPU72 outputs an operation signal MS1 to the throttle valve 14 to operate the throttle opening degree TA and outputs an operation signal MS3 to the ignition device 26 to operate the ignition timing, based on the set throttle opening degree command value TA and the delay amount aop (S14). Here, in the present embodiment, the case where the throttle opening degree TA is feedback-controlled to the throttle opening degree command value TA is exemplified, and therefore, even if the throttle opening degree command value TA is the same value, the operation signals MS1 may be different signals from each other. For example, when known Knock Control (KCS) is performed, the ignition timing is set to a value obtained by feedback-correcting the reference ignition timing by the retard amount aop by the KCS. Here, the reference ignition timing is variably set by the CPU72 according to the rotation speed NE of the crankshaft 28 and the charging efficiency η. The rotation speed NE is calculated by the CPU72 based on the output signal Scr of the crank angle sensor 84. The filling efficiency η is calculated by the CPU72 based on the rotation speed NE and the intake air amount Ga.

Next, the CPU72 obtains the torque Trq of the internal combustion engine 10, the torque command value Trq for the internal combustion engine 10, and the acceleration Gx (S16). Here, the CPU72 calculates the torque Trq by inputting the rotation speed NE, the charging efficiency η, and the ignition timing to the torque output map. The CPU72 sets a torque command value Trq according to the accelerator operation amount PA.

Next, the CPU72 determines whether the transition flag F is "1" (S18). The transition flag F indicates that the engine is in the transition operation when the flag F is "1", and indicates that the engine is not in the transition operation when the flag F is "0". When determining that the transition flag F is "0" (S18: no), the CPU72 determines whether or not the absolute value of the change amount Δ PA per unit time of the accelerator operation amount PA is equal to or greater than a predetermined amount Δ PAth (S20). Here, the change amount Δ PA may be, for example, the difference between the latest accelerator operation amount PA at the execution timing of the process of S20 and the accelerator operation amount PA before the unit time with respect to the latest accelerator operation amount PA.

When the CPU72 determines that the absolute value of the change amount Δ PA is equal to or greater than the predetermined amount Δ PAth (yes in S20), it substitutes "1" for the transition flag F (S22). On the other hand, if the CPU72 determines that the transition flag F is "1" (S18: yes), it determines whether or not a predetermined period of time has elapsed since the execution of the process of S22 (S24). Here, the predetermined period is a period until the state in which the absolute value of the change amount Δ PA per unit time of the accelerator operation amount PA becomes a predetermined amount or less smaller than the predetermined amount Δ PAth continues for a predetermined time. When the CPU72 determines that the predetermined period has elapsed (yes in S24), it substitutes "0" for the transition flag F (S26).

When the processing in S22 and S26 is completed, the CPU72 ends the scenario (episode), and updates the action value function Q by reinforcement learning (S28). Fig. 3 shows details of the processing of S28.

In the series of processing shown in fig. 3, the CPU72 acquires time-series data of a set of 3 sample values including the torque command value Trq, the torque Trq, and the acceleration Gx in the last-to-end scenario, and time-series data of the state S and the action a (S30). Here, in the latest scenario, when the process of S30 is performed subsequent to the process of S22, the transition flag F continues to be "0", and when the process of S30 is performed subsequent to the process of S26, the transition flag F continues to be "1".

In fig. 3, variables in parentheses whose numbers are different represent values of variables at different sampling timings. For example, torque command value Trq (1) and torque command value Trq (2) are values whose sampling timings are different from each other. Further, time-series data of an action a belonging to the latest scenario is defined as an action set Aj, and time-series data of a state s belonging to the scenario is defined as a state set Sj.

Next, the CPU72 determines whether or not the logical product (AND) of the condition (i) that the absolute value of the difference between the torque Trq AND the torque command value Trq belonging to the latest scenario is equal to or less than the predetermined amount Δ Trq AND the condition (ii) that the acceleration Gx is equal to or more than the lower limit GxL AND equal to or less than the upper limit GxH is true (S32).

Here, the CPU72 sets the predetermined amount Δ Trq variably in accordance with the change amount Δ PA per unit time of the accelerator operation amount PA at the start of the scenario. That is, when it is determined that the scenario is a transient scenario based on the change amount Δ PA per unit time of the accelerator operation amount PA at the scenario start, the CPU72 sets the predetermined amount Δ Trq to a larger value than that in the steady state (steady state).

The CPU72 variably sets the lower limit value GxL according to the change amount Δ PA per unit time of the accelerator operation amount PA at the start of the scenario. That is, when the CPU72 is about the transient scenario and the variation Δ PA is positive, the CPU72 sets the lower limit value GxL to a larger value than that about the steady scenario. When the change amount Δ PA is negative with respect to the transient scenario, the CPU72 sets the lower limit value GxL to a smaller value than with respect to the steady-state scenario.

The CPU72 sets the upper limit GxH variably in accordance with the change amount Δ PA per unit time of the accelerator operation amount PA at the start of the scenario. That is, when the CPU72 is about the transient scenario and the variation Δ PA is positive, the CPU72 sets the upper limit value GxH to a larger value than that about the steady scenario. When the change amount Δ PA is negative with respect to the transient scenario, the CPU72 sets the upper limit value GxH to a smaller value than with respect to the steady scenario.

If the CPU72 determines that the logical and is true (S32: yes), it substitutes "10" for the award r (S34), and if the logical and is false (S32: no), it substitutes "-10" for the award r (S36). When the processing in S34 and S36 is completed, the CPU72 updates the relationship specifying data DR stored in the storage device 76 shown in fig. 1. In the present embodiment, an epsilon soft-policy type Monte Carlo method (epsilon-soft on-policy type Monte Carlo method) is used.

That is, the CPU72 adds the reward R to each benefit R (Sj, Aj) specified by each state and corresponding action group read in the process of S30 (S38). Here, "R (Sj, Aj)" collectively describes the benefit R in which one of the elements of the state set Sj is used as a state and one of the elements of the action set Aj is used as an action. Next, the gains R (Sj, Aj) determined for each state and corresponding action group read in the above-described processing at S30 are averaged and substituted into the corresponding action cost function Q (Sj, Aj) (S40). Here, the averaging may be performed as follows: the benefit R calculated by the processing of S38 is divided by the value obtained by adding a predetermined number to the number of times the processing of S38 was performed. The initial value of the profit R may be set to the initial value of the corresponding action merit function Q.

Next, the CPU72 substitutes the action, which is the set of the throttle opening command value TA and the retard amount aop when the expected benefit is the maximum value, in the corresponding action cost function Q (Sj, a) into the action Aj with respect to the state read in the process of S30 (S42). Here, "a" represents an arbitrary action that is desirable. Note that the action Aj is an action that has a different value depending on the type of state read in the process of S30, but the description is simplified here and the same reference numeral is used.

Next, the CPU72 updates the corresponding policy pi (Aj | Sj) for each of the states read by the process of S30 (S44). That is, when the total number of actions is "| a |", the selection probability of the action Aj selected in S42 is "1-epsilon + epsilon/| a |". The selection probabilities of "| a | -1" actions other than the action Aj are respectively set to "epsilon/| a |". The process at S44 is based on the action merit function Q updated by the process at S40, and thus the relationship specifying data DR specifying the relationship between the state S and the action a is updated so as to increase the profit R.

Further, the CPU72, when the process of S44 is completed, temporarily ends the series of processes shown in fig. 3. Returning to fig. 2, the CPU72 temporarily ends the series of processing shown in fig. 2 in the case where the processing of S28 is completed or in the case where a negative determination is made in the processing of S20, S24. Further, the processing of S10 to S26 is realized by the CPU72 executing the control program 74a, and the processing of S28 is realized by the CPU72 executing the learning program 74 b. The relationship regulation data DR at the time of shipment of the vehicle VC1 is data that has been learned in advance by executing the same processing as the processing shown in fig. 2 while simulating the traveling of the vehicle at a test bench (test bench).

Fig. 4 shows a procedure of the process of changing the reference in the process of S32. The process shown in fig. 4 is realized by repeatedly executing the learning program 74b stored in the ROM74 by the CPU72, for example, at predetermined cycles.

In the series of processes shown in fig. 4, first, the CPU72 determines whether there is an evaluation input by the operation of the evaluation switch 94 (S50). When the CPU72 determines that there is an evaluation input (yes in S50), it determines whether or not the evaluation input is an input indicating "excessively low responsiveness" (S52). When it is determined that the evaluation input is "response too low" (yes in S52), the CPU72 decreases the predetermined amount Δ Trq during transition, increases (increases) the upper limit GxH and the lower limit GxL when the change amount Δ PA is positive, and decreases (decreases) the upper limit GxH and the lower limit GxL when the change amount Δ PA is negative (S54).

On the other hand, if the CPU72 makes a negative determination in the process of S52, it determines whether or not the evaluation input is an input indicating "excessive responsiveness" (S56). When it is determined that the evaluation input is "excessive responsiveness" (yes in S56), the CPU72 increases the predetermined amount Δ Trq during transition, decreases the upper limit GxH and the lower limit GxL when the variation Δ PA is positive, and increases the upper limit GxH and the lower limit GxL when the variation Δ PA is negative (S58).

Further, the CPU72 temporarily ends the series of processing shown in fig. 4 in a case where the processing of S54, S58 is completed or in a case where a negative determination is made in the processing of S50, S56. Here, the operation and effect of the present embodiment will be described.

The CPU72 acquires time-series data of the accelerator operation amount PA in accordance with the operation of the accelerator pedal 86 by the user, and sets an action a including a throttle opening degree command value TA and a delay amount aop in accordance with a strategy pi. Here, the CPU72 basically selects the action a that maximizes the expected benefit based on the action merit function Q specified in the relationship specifying data DR. However, the CPU72 performs a search for the action a that maximizes the expected benefit by selecting an action other than the action a that maximizes the expected benefit at a predetermined rate ∈. This enables the relationship specifying data DR to be updated by reinforcement learning in accordance with driving of the vehicle VC1 by the user. Therefore, the throttle opening degree command value TA and the delay amount aop corresponding to the accelerator operation amount PA can be set to appropriate values during traveling of the vehicle VC1 without requiring an excessive increase in man-hours for a skilled person.

In particular, in the present embodiment, the user can evaluate the traveling performance of the vehicle by operating the evaluation switch 94. Then, the reference regarding the absolute value of the difference between the torque Trq and the torque command value Trq and/or the reference regarding the acceleration Gx in the respect of giving the reward r is changed according to the result of the evaluation by the user. Thus, the reference for the absolute value of the difference between the torque Trq and the torque command value Trq and/or the reference for the acceleration Gx can be set as the reference appropriate for the driving preference of the user. Therefore, as the reinforcement learning advances along with the driving of the user, the relationship specifying data DR can be updated to data appropriate for the driving preference of the user.

According to the present embodiment described above, the following operational effects can be obtained.

(1) The argument of the action cost function Q is made to contain time-series data of the accelerator operation amount PA. Thus, the value of the action a can be finely adjusted for various changes in the accelerator operation amount PA, as compared with a case where only a single sampling value is used as an argument for the accelerator operation amount PA.

(2) The argument of the action merit function Q includes the throttle opening degree command value TA itself. Thus, for example, the degree of freedom of search by reinforcement learning is easily increased as compared with a case where a parameter of a model expression obtained by modeling the behavior of the throttle opening degree command value TA is used as an independent variable relating to the throttle opening degree.

Embodiment 2

Hereinafter, embodiment 2 will be described mainly focusing on differences from embodiment 1 with reference to the drawings.

Fig. 5 shows a procedure of the process of changing the reference in the process of S32 according to the present embodiment. The process shown in fig. 5 is realized by repeatedly executing the learning program 74b stored in the ROM74 by the CPU72, for example, at predetermined cycles. In fig. 5, for convenience, the same step numbers are assigned to the processes corresponding to those shown in fig. 4.

In the series of processing shown in fig. 5, first, the CPU72 acquires the accelerator operation amount PA and the acceleration Gx (S60). Next, the CPU72 determines whether or not a predetermined period has elapsed after the change amount Δ PA per unit time of the accelerator operation amount PA becomes equal to or greater than a predetermined amount Δ PAth (S62). Here, the predetermined period is a period until a predetermined time elapses after the change amount Δ PA per unit time of the accelerator operation amount PA is reduced.

When the CPU72 determines that the predetermined period has elapsed (yes in S62), it acquires the face image data (S64). Then, the CPU72 determines whether the user has experienced an unpleasant sensation on the drivability by analyzing the face image data, and stores the result in the storage device 76 (S66). The CPU72 determines whether it is the end of the trip (trip) in the case where the process of S66 is completed or in the case where a negative determination is made in the process of S62 (S68). Here, the trip refers to a period in which the travel permission signal of the vehicle is in an active (ON) state. In the present embodiment, the running permission signal corresponds to an ignition signal.

If the CPU72 determines that the travel is at the end (yes in S68), it reads the time-series data of the accelerator operation amount PA and the acceleration Gx acquired in S60 in the travel (S70). Then, the CPU72 determines whether or not the logical and of the following conditions (iii) to (v) is true (S72).

Condition (iii): the maximum value of the accelerator operation amount PA is equal to or greater than the predetermined value PAH. Here, the predetermined value PAH is set to a value larger than an assumed maximum value of the accelerator operation amount PA generated by an operation of the accelerator pedal 86 by a normal user.

Condition (iv): the maximum value of the acceleration Gx of the vehicle VC1 is equal to or greater than the predetermined value GxHH. Here, the predetermined value GxHH is set to a value larger than an assumed maximum value of the acceleration Gx generated by the operation of the accelerator pedal 86 by the ordinary user.

Condition (v): the analysis result of the face image data obtained in the process of S66 is a condition indicating a result (NG) that the user feels unpleasant to the drivability.

If the CPU72 determines that the logical and of the conditions (iii) to (v) is true (S72: yes), it executes the process of S54. That is, in the case where the logical and of conditions (iii) and (iv) is true, there is a possibility that the user is depressing the accelerator pedal 86 hard and wants to accelerate the vehicle VC1 suddenly, and if condition (v) is established, there is a possibility that the accelerator pedal 86 is depressed harder than a normal user due to unsatisfactory responsiveness to the vehicle VC 1. Thus, the conditions for awarding the prize are altered so that the acceleration performance of the vehicle VC1 can be improved.

On the other hand, if the CPU72 determines that the logical and of the conditions (iii) to (v) is false (S72: no), it determines whether or not the logical and of the following conditions (vi), (vii), and (v) is true (S74).

Condition (vi): the maximum value of the accelerator operation amount PA is equal to or less than the predetermined value PAL. Here, the predetermined value PAL is set to a value smaller than an assumed maximum value of the accelerator operation amount PA generated by an operation of the accelerator pedal 86 by a normal user.

Condition (vii): the maximum value of the acceleration Gx of the vehicle VC1 is equal to or less than a predetermined value GxLL. Here, the predetermined value GxLL is set to a value smaller than an assumed maximum value of the acceleration Gx generated by the operation of the accelerator pedal 86 by a normal user.

If the CPU72 determines that the logical and of the condition (vi), the condition (vii), and the condition (v) is true (S74: yes), it executes the process of S58. That is, in the case where the logical and of condition (vi) and condition (vii) is true, the user of vehicle VC1 tends to depress accelerator pedal 86 lightly as compared with an ordinary user, and if condition (v) is established, there is a possibility that the acceleration applied to vehicle VC1 is excessively large despite this, and the user feels unpleasant. Thus, the conditions under which the reward is awarded are altered so that the acceleration felt by the user's body when the vehicle VC1 accelerates can be made smaller.

Further, the CPU72 temporarily ends the series of processing shown in fig. 5 in a case where the processing of S54, S58 is completed or in a case where a negative determination is made in the processing of S68, S74. As described above, in the present embodiment, without the user having to perform an operation to input an evaluation of the running performance, the driving preference information can be acquired from the information during the driving of the vehicle VC1 by the user, and the condition for giving the reward can be changed accordingly.

Embodiment 3

Hereinafter, embodiment 3 will be described mainly with reference to the differences from embodiment 1 with reference to the drawings.

In the present embodiment, the update of the relationship specifying data DR is performed outside the vehicle VC 1. Fig. 6 shows a configuration of a control system for executing reinforcement learning in the present embodiment. In fig. 6, for convenience, the same reference numerals are given to components corresponding to those shown in fig. 1.

The ROM74 of the control device 70 in the vehicle VC1 shown in fig. 6 stores the control program 74a, but does not store the learning program 74 b. The control device 70 further includes a communication device 77. The communicator 77 is a device for communicating with the data analysis center 110 via the network 100 outside the vehicle VC 1.

The data analysis center 110 analyzes data transmitted from a plurality of vehicles VC1, VC2, … …. The data analysis center 110 includes a CPU112, a ROM114, an electrically rewritable nonvolatile memory (storage device 116), a peripheral circuit 118, and a communication device 117, and can communicate with each other via a local network 119. The ROM114 stores the learning program 74b, and the storage device 116 stores the relationship specifying data DR.

Fig. 7 shows a procedure of reinforcement learning according to the present embodiment. The processing shown in the left flow of fig. 7 is realized by the CPU72 shown in fig. 6 executing the control program 74a stored in the ROM 74. In addition, the processing shown in the right flow of fig. 7 is realized by the CPU112 executing the learning program 74b stored in the ROM 114. Note that, in fig. 7, for convenience, the same step numbers are given to the processes corresponding to the processes shown in fig. 3 and 4. The processing shown in fig. 7 will be described below in terms of a time series of reinforcement learning.

In a series of processes shown in the left-hand flow of fig. 7, the CPU72 executes the processes of S10 to S26. When the processing of S22 and S26 is completed, the CPU72 operates the communicator 77 to transmit data necessary for the learning processing (S80). Here, the data to be transmitted includes the torque command value Trq, the time-series data of the torque Trq and the acceleration Gx, the state set Sj, and the action set Aj in the scenario that ends immediately before the processing of S22 and S26 is executed. The CPU72 determines whether or not there is an evaluation input by operating the evaluation switch 94 (S82), and if it is determined that there is an input (S82: yes), the communicator 77 is operated to transmit data relating to the evaluation result (S84).

In contrast, as shown in the right flow of fig. 7, the CPU112 receives the data transmitted in the process of S80 (S100), and determines whether or not the evaluation result data is transmitted in the process of S84 (S102). When it is determined that the evaluation result data has been transmitted (yes in S102), the CPU112 receives the evaluation result (S104) and executes the processing from S52 to S58.

When the processing of S54, S58 is completed or when a negative determination is made in the processing of S56, S102, the CPU112 updates the relationship specifying data DR based on the data received through the processing of S100 (S28). Then, CPU112 determines whether or not the number of updates of relation specifying data DR is equal to or greater than a predetermined number (S106), and if it is determined that the number of updates is equal to or greater than the predetermined number (S106: yes), operates communicator 117 to transmit relation specifying data DR to vehicle VC1 that transmitted the data received through the processing of S100 (S108). Further, the CPU112 temporarily ends a series of processing shown in the right-hand flow of fig. 7 in a case where the processing of S108 is completed or in a case where a negative determination is made in the processing of S106.

On the other hand, as shown in the left flow of fig. 7, the CPU72 determines whether or not there is any update data (S86), and if it is determined that there is any update data (S86: yes), receives the updated relationship specifying data DR (S88). Then, the CPU72 rewrites the relationship specifying data DR stored in the storage device 76 with the received relationship specifying data DR (S90). Further, the CPU72 temporarily ends the series of processing shown in the left-hand flow of fig. 7 in the case where the processing of S90 is completed or in the case where a negative determination is made in the processing of S20, S24, S86.

As described above, according to the present embodiment, since the update process of the relation specifying data DR is performed outside the vehicle VC1, the calculation load of the control device 70 can be reduced.

The execution device is an example of the CPU72 and the ROM74, and the storage device is an example of the storage device 76. The acquisition processing is an example of the processing of S10, S16, S50, S60, and S66, and the operation processing is an example of the processing of S14. The bonus calculation process is an example of the processes from S32 to S36, and the update process is an example of the processes from S38 to S44. The reference acceleration is an example of a range defined by the lower limit value GxL and the upper limit value GxH. The variable relating to the opening degree of the throttle valve is an example of the throttle opening degree command value TA. The 1 st execution device is an example of the CPU72 and the ROM74, and the 2 nd execution device is an example of the CPU112 and the ROM 114.

Other embodiments

The present embodiment can be modified as follows. This embodiment mode and the following modifications can be combined with each other within a range not technically contradictory.

Information about driving preference

In the above embodiment, the evaluation of the behavior of the vehicle by the user is obtained based on the operation of the evaluation switch 94, but the present invention is not limited thereto. For example, when a user says "slow" at the time of acceleration or the like while a microphone is provided on vehicle VC1, information indicating that the acceleration feeling is low may be acquired.

In fig. 5, three pieces of information, that is, the maximum value of the accelerator operation amount PA, the maximum value of the acceleration Gx, and the analysis result of the face image data are used as the driving preference information, but the present invention is not limited thereto. As to their three pieces of information, only two or one of them may also be used. In addition, the minimum value of the acceleration Gx may be used. This makes it possible to set the magnitude of the absolute value of the acceleration Gx at the time of deceleration as the driving preference information.

About action variables

In the above embodiment, the throttle opening degree command value TA is exemplified as the variable relating to the opening degree of the throttle valve as the action variable, but the present invention is not limited thereto. For example, the responsiveness of the throttle opening degree command value TA with respect to the accelerator operation amount PA may be expressed by a dead time (dead time) and a second order lag filter, and a total of 3 variables of the dead time and two variables defining the second order lag filter may be set as variables relating to the opening degree of the throttle valve. However, in this case, the state variable is preferably set to the amount of change per unit time of the accelerator operation amount PA instead of the time-series data of the accelerator operation amount PA.

In the above embodiment, the retard amount aop is exemplified as the variable relating to the ignition timing as the action variable, but the present invention is not limited thereto. For example, the ignition timing itself may be a correction target of KCS.

In the above embodiment, the variable relating to the opening degree of the throttle valve and the variable relating to the ignition timing are exemplified as the action variable, but the present invention is not limited thereto. For example, the fuel injection amount may also be used in addition to the variable relating to the opening amount of the throttle valve and the variable relating to the ignition timing. In addition, regarding the 3 variables, only the variable relating to the opening amount of the throttle valve and the fuel injection amount, or only the variable relating to the ignition timing and the fuel injection amount may be used as the action variable. Further, regarding the above-mentioned 3 variables, only one of them may be adopted as the action variable.

In the case of a compression ignition type internal combustion engine as described in the column "related to the internal combustion engine", a variable related to the injection amount may be used instead of a variable related to the opening degree of the throttle valve, and a variable related to the injection timing may be used instead of a variable related to the ignition timing. Further, it is preferable to add, in addition to the variable relating to the injection timing, a variable relating to the number of injections in 1 combustion cycle (cycle) and/or a variable relating to a time interval between an end timing of one of two fuel injections for one cylinder adjacent in time series and a start timing of the other within 1 combustion cycle.

For example, when the transmission 50 is a stepped transmission, a current value of a solenoid valve for adjusting an engagement state of the clutch by hydraulic pressure may be used as the action variable. For example, as described in the column of "vehicle" below, when a hybrid vehicle, an electric vehicle, or a fuel cell vehicle is used as the vehicle, the torque and/or the output of the rotating electric machine may be used as the action variable. For example, in the case of a vehicle-mounted air conditioning apparatus including a compressor that rotates by the rotational power of a crankshaft of an internal combustion engine, the load torque of the compressor may be included in the acting variable. In addition, when an electrically-operated in-vehicle air conditioner is provided, the power consumption of the air conditioner may be included in the action variable.

About state

In the above embodiment, the time-series data of the accelerator operation amount PA is data including 6 values sampled at equal intervals, but is not limited thereto. In this case, data including 2 or more sampling values at different sampling timings is more preferable, and data having sampling intervals of 3 or more sampling values and equal intervals are more preferable.

The state variable related to the accelerator operation amount is not limited to the time-series data of the accelerator operation amount PA, and may be, for example, a change amount per unit time of the accelerator operation amount PA as described in the column of "action variable".

For example, when the current value of the solenoid valve is used as the action variable as described in the column of "action variable", the state may include the rotation speed of the input shaft 52, the rotation speed of the output shaft 54, and the hydraulic pressure adjusted by the solenoid valve of the transmission. For example, when the torque and/or the output of the rotating electrical machine is used as the action variable as described in the column of "action variable", the state may include the charging rate and the temperature of the battery. For example, as described in the column of "about action variables", when the load torque of the compressor and the power consumption of the air conditioner are included in the action, the temperature in the vehicle interior may be included in the state.

Dimension reduction for tabular form data

The dimension reduction method for tabular data is not limited to the method described in the above embodiment. For example, since the accelerator operation amount PA is rarely the maximum value, the action cost function Q may not be defined for a state in which the accelerator operation amount PA is equal to or more than a predetermined amount, and the throttle opening degree command value TA and the like may be separately adapted for a case in which the accelerator operation amount PA is equal to or more than a predetermined amount. For example, dimension reduction may be performed by removing a value at which the throttle opening command value TA becomes equal to or greater than a predetermined value from a value at which behavior is desired.

However, it is not necessary to perform dimension reduction. For example, if reinforcement learning based on data from a plurality of vehicles is performed and the computing power of CPU72 and the storage capacity of storage device 76 are sufficient in embodiment 3, the action merit function may be learned only for a part of the reduced dimensions before shipment of the vehicle, but all actions may be executed by searching after shipment. This makes it possible to secure more sufficient data for learning than before shipment after shipment, and therefore, the number of actions that can be searched for can be increased, and more appropriate actions can be found.

Specifying data about relationships

In the above embodiment, the action merit function Q is a tabular function, but is not limited thereto. For example, a function approximator may also be used.

For example, instead of using the action cost function Q, the strategy pi may be expressed by a function approximator that uses the state s and the action a as arguments and uses the probability of the action a as a dependent variable, and the parameters that determine the function approximator may be updated according to the reward r.

Concerning handling of operations

For example, when the action cost function is used as a function approximator as described in the column "relation-specifying data", all groups of discrete values of the action, which are arguments of the table-type function in the above embodiment, may be input to the action cost function Q together with the state s, and the action a that maximizes the action cost function Q may be selected.

For example, in the case of a function approximator in which a policy pi is set to a state s and an action a as arguments and a probability of taking the action a as a dependent variable as described in the column of "relation-specifying data", the action a may be selected based on the probability represented by the policy pi.

Regarding updating mappings

The processing of S38 to S44 is exemplified by the processing using the epsilon soft parity policy type monte carlo method, but is not limited thereto. For example, a process using an off-policy type (off-policy type) monte carlo method may be used. However, the present invention is not limited to the monte carlo method, and for example, a heterogeneous-policy TD (time-difference) method may be used, and for example, a homogeneous-policy TD method such as a SARSA (state-action-forward-state '-action') method may be used, and for example, an eligibility trace method may be used as the homogeneous-policy learning method.

For example, when expressing the policy pi using a function approximator and directly updating the policy pi based on the reward r as described in the column "relation specifying data", the update map may be configured using a policy gradient method or the like.

In addition, only one of the action merit function Q and the policy pi is not limited to a direct update target by the reward r. For example, the action-value function Q and the policy pi may be updated separately as in the Actor Critic method. In addition, the Actor critical method is not limited to this, and for example, instead of the action cost function Q, the cost function V may be set as an update target.

Note that "epsilon" for specifying the strategy pi is not limited to a fixed value, and may be changed according to a rule predetermined according to the degree of progress of learning.

Relating to reward calculation processing

In the process of fig. 3, a prize is awarded according to the logical and of the condition (i) and the condition (ii) being true, but is not limited thereto. For example, it is also possible to execute processing for awarding a bonus according to whether or not the condition (i) is satisfied, and processing for awarding a bonus according to whether or not the condition (ii) is satisfied. For example, with respect to both the processing of giving a reward according to whether or not the condition (i) is satisfied and the processing of giving a reward according to whether or not the condition (ii) is satisfied, only either one of them may be executed.

For example, instead of uniformly giving the same reward when the condition (i) is satisfied, a process may be performed in which a larger reward is given when the absolute value of the difference between the torque Trq and the torque command value Trq is smaller than the absolute value. For example, instead of giving the same reward uniformly when the condition (i) is not satisfied, a process may be performed in which a smaller reward is given when the absolute value of the difference between the torque Trq and the torque command value Trq is larger than the absolute value.

For example, instead of uniformly giving the same bonus prize when the condition (ii) is satisfied, the magnitude of the bonus prize may be changed according to the magnitude of the acceleration Gx. For example, instead of uniformly giving the same bonus prize when the condition (ii) is not satisfied, the magnitude of the bonus prize may be changed according to the magnitude of the acceleration Gx.

For example, when the current value of the solenoid valve of the transmission 50 is used as the action variable as described in the column of "action variable", the reward calculation process may include at least one of the following three processes (a) to (c).

(a) The following treatment is carried out: when the time required for switching the gear ratio of the transmission is within the predetermined time, a large reward is given when the required time exceeds the predetermined time.

(b) The following treatment is carried out: when the absolute value of the change speed of the rotation speed of the input shaft 52 of the transmission is equal to or less than the input-side predetermined value, a larger reward is given than when the absolute value exceeds the input-side predetermined value.

(c) The following treatment is carried out: when the absolute value of the change speed of the rotation speed of the output shaft 54 of the transmission is equal to or less than the output-side predetermined value, a larger reward is given than when the absolute value exceeds the output-side predetermined value. In such a case, for example, when an evaluation indicating that the responsiveness is too low is input by the operation of the evaluation switch 94, the predetermined time may be set to be shorter, while the input-side predetermined value and/or the output-side predetermined value may be set to be larger.

For example, when the torque and/or the output of the rotating electrical machine is used as the action variable as described in the column "action variable", the following processing may be included: a process of giving a larger award in a case where the charging rate of the battery is within a predetermined range than in a case where the charging rate is not within the predetermined range; a process of giving a larger bonus in a case where the temperature of the battery is within a predetermined range than in a case where the temperature is not within the predetermined range. In this case, the change according to the driving preference information may be limited to the above condition (ii) or the like, but the predetermined range may be made variable according to the driving preference information in order to easily satisfy the condition (ii) or the like at the time of the transient operation.

For example, when the load torque of the compressor and/or the power consumption of the air conditioner are included in the action variable as described in the column of "action variable", a process may be added in which a larger reward is given when the temperature in the vehicle interior is within the predetermined range than when the temperature is not within the predetermined range. In this case, the change according to the driving preference information may be limited to the above condition (ii) or the like, but the predetermined range may be made variable according to the driving preference information in order to easily satisfy the condition (ii) or the like at the time of the transient operation.

With respect to acquisition processing

In the above embodiment, the evaluation result of the traveling performance by the user is obtained by obtaining the evaluation variable VV based on the output signal of the evaluation switch 94, but the present invention is not limited thereto. For example, a device for sensing a voice instruction may be provided instead of the evaluation switch 94, and the result of the sensing may be acquired as the evaluation variable VV.

Control system for vehicle

In the example shown in fig. 7, the process of determining the action based on the policy pi (the process of S12) is executed on the vehicle side, but the present invention is not limited thereto. For example, data acquired by the processing of S10 may be transmitted from the vehicle VC1, the transmitted data may be used by the data analysis center 110 to determine the action a, and the determined action may be transmitted to the vehicle VC 1.

The vehicle control system is not limited to a system including the control device 70 and the data analysis center 110. For example, a portable terminal of the user may be used instead of the data analysis center 110. The control device 70, the data analysis center 110, and the mobile terminal may constitute a vehicle control system. This can be realized by executing the process of S12 by the portable terminal, for example.

About an actuating device

The execution device is not limited to a device that includes the CPU72(112) and the ROM74(114) and executes software processing. For example, a dedicated hardware circuit such as an ASIC may be provided for performing hardware processing on at least a part of the content subjected to the software processing in the above embodiment. That is, the execution device may have any one of the following configurations (a) to (c).

(a) The apparatus includes a processing device for executing all the above-described processes in accordance with a program, and a program storage device such as a ROM for storing the program.

(b) The apparatus includes a processing device and a program storage device for executing a part of the above-described processing in accordance with a program, and a dedicated hardware circuit for executing the remaining processing.

(c) The apparatus includes a dedicated hardware circuit for executing all the above-described processing. Here, the number of the software executing apparatus and the dedicated hardware circuit provided with the processing apparatus and the program storage apparatus may be plural.

About a storage device

In the above embodiment, the storage device for storing the relationship specifying data DR and the storage devices (the ROMs 74, 114) for storing the learning program 74b and the control program 74a are different storage devices, but the present invention is not limited thereto.

In relation to internal combustion engines

The internal combustion engine is not limited to an internal combustion engine including a port injection valve for injecting fuel into intake passage 12 as a fuel injection valve, and may be an internal combustion engine including an in-cylinder injection valve for directly injecting fuel into combustion chamber 24, or may be an internal combustion engine including both a port injection valve and an in-cylinder injection valve.

The internal combustion engine is not limited to a spark ignition type internal combustion engine, and may be, for example, a compression ignition type internal combustion engine using light oil or the like as fuel.

About a vehicle

The vehicle is not limited to a vehicle in which the thrust force generation device is only an internal combustion engine, and may be, for example, a so-called hybrid vehicle including an internal combustion engine and a rotating electric machine. For example, the present invention may be a so-called electric vehicle or a fuel cell vehicle that does not include an internal combustion engine and includes a rotating electrical machine as a thrust generation device.

24页详细技术资料下载
上一篇:一种医用注射器针头装配设备
下一篇:增压器的闭环控制方法、装置、可读介质以及设备

网友询问留言

已有0条留言

还没有人留言评论。精彩留言会获得点赞!

精彩留言,会给你点赞!