Coordinated Power Smoothing Control Strategy of Multi-Wind Turbines and Energy Storage Systems in Wind Farm Based on MADRL

The randomness and volatility of wind power greatly affect the safety and economy of the power systems, and the wake effect of the wind farm aggravates the wind energy loss and the wind power fluctuation. Taking into consideration the wake effect of the wind farm, a new coordinated wind power smoothing control strategy for multi-wind turbines (M-WT) and energy storage systems (ESS) is proposed. The proposed method is based on a multi-agent deep reinforcement learning (MADRL), in which the relationship between output power and wake effect is firstly analyzed, and a power smoothing control model of the M-WT and ESS is established. MADRL is then introduced to optimize the power control of M-WT and ESS. In order to further increase the learning and training efficiency, an improved MADRL algorithm based on the partitioned experience buffer and prioritized experience replay is proposed, where the experience buffer is divided into positive, negative, and neutral experiences, and the experiences are sampled according to experience priority. The effectiveness of the proposed strategy is verified on the SimWindFarm platform. The results show that the proposed control strategy can maximize the economic benefits while further smoothing wind power fluctuations and increasing power generation.

Abstract-The randomness and volatility of wind power greatly affect the safety and economy of the power systems, and the wake effect of the wind farm aggravates the wind energy loss and the wind power fluctuation.Taking into consideration the wake effect of the wind farm, a new coordinated wind power smoothing control strategy for multi-wind turbines (M-WT) and energy storage systems (ESS) is proposed.The proposed method is based on a multi-agent deep reinforcement learning (MADRL), in which the relationship between output power and wake effect is firstly analyzed, and a power smoothing control model of the M-WT and ESS is established.MADRL is then introduced to optimize the power control of M-WT and ESS.In order to further increase the learning and training efficiency, an improved MADRL algorithm based on the partitioned experience buffer and prioritized experience replay is proposed, where the experience buffer is divided into positive, negative, and neutral experiences, and the experiences are sampled according to experience priority.The effectiveness of the proposed strategy is verified on the SimWindFarm platform.The results show that the proposed control strategy can maximize the economic benefits while further smoothing wind power fluctuations and increasing power generation.Index Terms-Wind farm, energy storage systems, power control, wake effect, multi-agent deep reinforcement learning.

I. INTRODUCTION
W IND power generation is one of the most important methods to solve environmental pollution and energy crises [1].Smoothing wind power fluctuation can effectively reduce its negative influence on the reliability and stability of power systems [2].Increasing the active power and generating capacity of a wind farm has become an important way for wind farm industries to reduce investment risks [3].
At present, there are two methods to stabilize wind power fluctuations: installing energy storage systems and power smoothing control of a wind turbine (WT) [4], [5], [6], [7].The former can smooth wind power fluctuations effectively with high operability and little power loss but requires additional equipment costs.In [7], a hybrid energy storage system of supercapacitors and lithium batteries was used to smooth wind power fluctuations, and a stochastic optimization scheduling strategy for wind power smoothing was proposed.Lin et al. [8] proposed a long-term stable operation control method for wind power smoothing based on a dual-battery energy storage system (DBESS), to ensure the long-term stable operation of DBESS while meeting the wind power demand.
The latter is generally realized through pitch angle control, rotor speed control, DC side capacitor control, as well as grid-connected side converter control [9], [10].Liao et al. [11] proposed a low-pass virtual filter method for power smoothing control of wind power generation systems.A new low-pass virtual filter in the rotor energy control loop of a wind power generation system is introduced so that the system has more power smoothing capability and stability.Xue et al. [12] proposed a power smoothing control strategy based on an adaptive capability of WT.Power smoothing can be achieved through DC voltage control, rotor speed control and pitch angle control.The power control of WT is to smooth the power fluctuations at the cost of losing some power and increasing the fatigue load of the wind turbine.Most of the power smoothing control methods in the existing literatures only consider at the individual wind turbine level, while the actual power smoothing control should be considered at the wind farm level.Moreover, power smoothing at the individual WT level does not necessarily represent the total output power smoothing of the wind farm.
In terms of the power smoothing control of a wind farm, Zhu et al. [13] proposed a power smoothing control strategy for a wind farm based on power distribution.The output power of wind turbines (WTs) is controlled through the machine-side and grid-side converter control, and the output power of each WT is controlled by setting power distribution rules.This method can ensure the maximum output power of the wind farm while smoothing the output power fluctuations.However, the wind farm wake effect is not considered, which may cause the uneven distribution of wind speed, affect the operation status of each WT in the wind farm, further decrease the output power of the WTs and increase its fluctuations.Considering the wake effect of a wind farm, Howlader et al. [14] proposed a smooth wind power coordinated control strategy for multi-wind turbines (M-WT).The influence of the wake effect on the output wind power from the perspective of M-WT in the wind farm is considered.In addition, the wake effect of the wind farm on power smoothing under different tower spacing is also considered.However, the control structure in the above literatures is designed based on the wind farm modeling, ignoring the error and uncertainty of the model.
In the process of wind farm control, accurate wind farm dynamics analysis is required, and the inevitable modeling errors and uncertainties lead to significant degradation of the control performance.In contrast, Deep Reinforcement Learning (DRL) [15] can interact with complex environment with no models or inaccurate models to search optimal control strategies that can achieve long-term rewards and enhance adaptability and robustness.Aiming at the wake effect between WTs and the randomness of the environment, a robust deep reinforcement learning method was proposed to deal with uncertain environmental conditions and strong aerodynamic interaction between WTs to realize wind farm power tracking [16].Huang et al. [17] proposed a DRL-based control strategy for wind-solar energy storage systems to maximize the long-term benefits.The limitation of the method is that it is difficult to solve the M-WT control problem by a single agent, and the complex control problem needs to be decomposed into a multi-agent cooperative problem.Multi-agent deep reinforcement learning (MADRL) [18] applies DRL's principles and algorithms to multi-agent systems.It can organize multi-agents to conduct self-learning and realize cooperative solutions to complex problems through the interaction between agents.In addition, compared with a single agent, multiple agents can share risks and improve system reliability.Therefore, MADRL has the potential to solve the control problems of complex uncertain, and nonlinear systems such as the wind farm.
In this study, aiming at overcoming the limitations of the current wind power smoothing methods, a MADRL-based coordinated control strategy for M-WT and energy storage systems (ESS) is proposed.The mainstream WT control method is only studied for controlling individual WT.The smoothing power of individual WT does not necessarily represent the smoothing output power of a wind farm.Moreover, such methods do not apply to wind farms consisting of multiple turbines.In this article, the output power of individual WT in the wind farm is coordinated and controlled so that the sum of the powers of the WTs is smoothed.The proposed method is studied at the wind farm level, which avoids the inapplicability of individual WT control to the wind farm.Due to the high controllability and fast response of ESS, the ESS is used to smooth the high-frequency fluctuations that are difficult to be handled by the internal control of the wind farm.The M-WT coordinated smooth power control can smooth part of the power fluctuations in the wind farm, undertaking the task of power smoothing.The ESS of the proposed method deals with fewer power fluctuations than that of the individual control, and the ESS capacity configuration can be appropriately reduced, which reduces the investment of ESS cost.At present, the wind farm model is difficult to establish accurately, and the inaccurate model will lead to the unsatisfactory control performance.To solve this problem, a power optimization control of the M-WT and ESS based on a multi-agent twin delayed deep deterministic policy gradient (MATD3) algorithm is proposed.The MATD3 algorithm is used to optimize the power control of the M-WT and ESS, and the power is corrected and compensated when the model has errors or the parameters are time-varying, so as to reduce the negative impact caused by the inaccurate model.In addition, the computational complexity and experiences of the MATD3 algorithm under multi-agent tasks will increase exponentially, the learning ability of the algorithm will decrease and the convergence speed will slow down.In response to the problem, an improved MATD3 algorithm, based on the partitioned buffer and priority experience replay, is proposed to enhance the efficiency of the MATD3 algorithm, where the experience buffer is divided into positive experiences, negative experiences, and neutral experiences, and then the experiences are sampled according to the experience priority.
The main contributions of the article are as follows: 1) Different from the individual power smoothing control of WT and ESS, the proposed control strategy combines the M-WT and ESS smooth power controls.Some power fluctuations are smoothed through coordinated power control among wind turbines, while the ESS smooth high-frequency fluctuations that are difficult to be handled by internal control of the wind farm.The M-WT and ESS bear wind power fluctuations and relieve the pressure of smooth power to each other.2) A new power optimization control method based on MATD3 for M-WT and ESS is proposed to reduce the negative effect caused by model uncertainty and to enhance reliability and robustness.Meanwhile, the power smoothness of the wind farm, the power generation of the wind farm, the load of the WT, and the loss of the ESS are taken as the reward functions of MATD3, to reduce the system loss on the premise of ensuring the power smoothness.3) An improved MATD3 algorithm based on the partitioned experience buffer and priority experience replay (PEPE-MATD3) is proposed to enhance algorithm efficiency, where the experience buffer is divided into positive experiences, negative experiences, and neutral experiences according to the reward values of learning.The experiences are preferentially sampled according to the experience priority to filter out more useful experiences for policy learning, so as to improve the learning ability of the algorithm.The rest of this article is organized as follows.Section II introduces the coordinated control system of an offshore wind farm and ESS.Section III introduces a power-optimized compensation based on PEPE-MATD3.Section IV verifies the effectiveness and feasibility of the proposed method through SimWindFarm simulations, and conclusions are drawn in Section V.

A. System Structure
The structure of the combined power generation system of an offshore wind farm and ESS is shown in Fig. 1.The offshore wind farm structure is mainly composed of M-WT, ESS, transformers, and controller.According to the data of the wind farm and the ESS collected by the wind farm monitoring system, the controller controls the output power of the WTs and the ESS of the wind farm.
The power equation of the combined generation system of the offshore wind farm and the ESS is as follows: (1) where P WTi is the output power of the i th wind turbine, P farm is the output power of the wind farm, P es is the output power of the ESS, and P grid is the grid-connected power.

B. Overall Control Strategy
As shown in Fig. 2, the control structure consists of the power control based on a wake model and the power optimization based on MADRL.The former is composed of a wake model and the reference power setting.In the wake model part, the input wind speed of each wind turbine [V 1 , V 2 , …, V n ] is calculated according to the input wind speed of the wind farm V farm and the thrust coefficients of the WTs [C T1 , C T2 , …, C Tn ].Meanwhile, the power of WT_i P WTi,pre and the power of the wind farm P farm,pre are calculated.In the reference power setting part, the low-frequency power of the wind farm P L is obtained by filtering P farm,pre through Bessel low-pass filtering, and the reference power of each wind turbine P ref,i is determined according to the proportion of each wind turbine power P WTi,pre /P farm,pre .The reference power of the ESS P ref,es is obtained by the difference between P L and P grid , and the ESS is used to deal with the power fluctuations that are difficult to be handled by WTs.
The power optimization aims to improve wind power generation, further smooth the power fluctuations and reduce the loss of WT and ESS.PEPE-MATD3 is used to optimize the power control of M-WT and the ESS is used to reduce the impact of model errors and uncertainties.The cooperation among agents in PEPE-MATD3 is used to optimize and compensate for the reference power of each wind turbine and the ESS.The input wind speed of each column of the WTs in the wind farm is similar.It is optimized and compensated by an agent, thereby reducing the computational complexity of the PEPE-MATD3.Agent_1 in PEPE-MATD3 optimizes the power with reference adjustment value ΔP ref,1∼3 , so as to obtain the new reference power P ref,1∼3 of the wind turbines.Other agents act in the same way.The power optimization of the ESS is controlled by a single Agent_es.The optimization method of the ESS is the same as those of the WTs.

C. Jensen Wake Model
In practical operation, a simplified wind farm wake model is crucial to reduce the calculation load and improve the real-time performance of active power output dispatching of the wind farm.Therefore, Jensen's wake model can be used to calculate wake flow [20], [21]: where s is the distance from a certain position of the downstream wind turbine to the upstream wind turbine, V 0 is the incoming wind speed at infinity, and D is the diameter of the upstream wind turbine.V 1,L and D w are the wind speed and wake section diameter at s of the upstream turbine in the wake, respectively.k is the expansion rate.
It is necessary to consider whether the downstream wind turbine is within the wake influence radius of the upstream wind turbine when calculating the wind speed of the downstream wind turbine.If it is within the wake influence radius, the wake influence of the upstream wind turbine should be considered.
The wind wheel wake of a wind turbine affected by a single wake is superimposed.The wind speed at the location of the downstream wind turbine is expressed as: Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
where A overlap is the overlap area of the wake of the upstream wind turbine at the downstream wind turbine and wheel.A 0 is the swept area of the downstream wind turbine.

D. Active Power Distribution of Wind Farm
The active power distribution of the wind farm is shown in Fig. 3. V i is calculated according to the wake model, and then P WTi,pre is calculated according to (7), where P rate is the rated power of the wind turbine, ρ is the air density, R is the radius of the wind wheel, V rate is the rated wind speed, and V start is the starting wind speed.When V i of the i th wind turbine is less than V start , the wind turbine is in the start mode and P WTi,pre = 0.When V i of the i th wind turbine is greater than V start and less than V rate , the wind turbine is in the maximum wind energy tracking mode, and P WTi,pre = 1/2ρπR 2 V 3 C p .When V i of the i th wind turbine is greater than V rate , the wind turbine is in the constant power mode, and P WTi,pre = P rate .
P farm,pre is obtained by summing P WTi,pre of each wind turbine According to P farm,pre , P L is obtained by Bessel low-pass filtering.Bessel low-pass filtering extracts P L from the original power signal when the power fluctuation is less than 10% of the rated power and within 1 min.The P ref,i of each wind turbine is calculated as follows, It is used as the inputs of the WT internal control, so as to control WT output power.The methods in references [22], [23] are adopted for the modeling and control of the WTs.Because the power control of WTs is difficult to fully track the set reference power, ESS is used to deal with the power fluctuations.P ref,es is calculated by the difference between P L and P grid .

E. Active Power Distribution of Wind Farm
The control structure of the ESS is shown in Fig. 4. P ref,es is optimized by agent_es of MADRL to obtain the new value P ref,es .Overcharge and discharge protection is incorporated into the ESS control [24], [25].When the SOC of ESS >0.8, the overcharge protection acts, and the charging power is multiplied by a charge protection parameter K es,c .When the SOC of ESS is <0.2, the over-discharge protection acts, and the discharge power is multiplied by a discharge protection parameter K es,dis .The calculation of K es,c and K es,dis are shown in (10) and (11), respectively.The error is obtained by the difference between the power reference value protected by SOC P * ref,es and the actual power P es , and it is sent to PI controller to control the charge and discharge of the ESS.

A. MATD3
As a new MADRL algorithm for solving continuous problems, MATD3 extends TD3 to the multi-agent domain [19], [15], [26].MATD3 is similar to the multi-agent deterministic policy gradient (MADDPG) [27], which also uses a centralized training and decentralized execution framework.Each agent has two centralized critics and one actor (policy).This algorithm Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.does not need to establish real communication rules, and it is easy to converge to the global optimum.Two critics Q π i,θ 1,2 are introduced into MATD3 and the minimum of the two is taken when calculating the target value to reduce the impact caused by network overestimation.The TD target value y i of the i th agent, where r i is the reward value of the i th agent, γ is the conversion factor, π φ is the strategy of the algorithm, and ε is Gaussian noise.During training, the two critics of each agent can access the actions, states, rewards, and strategies of all agents from the experience buffer [19], thus realizing the interaction among agents.These two critics complete a centralized training task, namely, they evaluate the values of their actions not only according to their own state, but also considering the behavior states of other agents.On the other hand, the actor does a decentralized task according to a policy, namely, it only needs to take its own state into account and act accordingly.

B. Partitioned Experience Buffer and Priority Experience Replay
The structure of the partitioned experience buffer and priority experience replay (PEPE) is shown in Fig. 5.The proposed experience replay method first stratifies the experience buffer according to the impact of rewards on agents' learning and then sets priority sampling.
Different experiences play different roles in the agents' learning.Positive experiences can accelerate the agents' learning, negative experiences can improve the agents' generalization ability and anti-risk ability [28].Agents generate many negative experiences in the early stage of training, which affects agents' learning rate.In the middle stage of training, the experience buffer stores more neutral experience, which affects the learning rate and anti-risk ability of the agents.In the late stage of training, the experience buffer stores a lot of positive experiences, which affects agents' anti-risk ability.To solve this problem, the experience buffer R (s i,t , a i,t , s i,t+1 , r i,t ) is divided into three areas according to the range of reward values, and a dynamic sampling ratio is set for each buffer.In the early stage of training, the number of positive experiences is increased to improve the learning rate of agents.In the middle stage of training, the number of positive and negative experiences is increased to improve the anti-risk ability and the learning rate of agents.In the late stage of training, the number of negative experiences is increased to improve the anti-risk ability of agents.
) where R pos (s i,t , a i,t , s i,t+1 , r i,t,pos ) is the positive experience area, R neg (s i,t , a i,t , s i,t+1 , r i,t,neg ) is the negative experience area, and R neu (s i,t , a i,t , s i,t+1 , r i,t,neu ) is the neutral experience area.O p and O n are two boundary coefficients, respectively.
The sampling number in each area is determined as follows: where B pos , B neg , and B neu are the number of experiences sampled from the positive experience area, negative experience area, and neutral experience area, respectively.N pos , N neg , and N neu are the number of experiences in the positive experience area, negative experience area, and neutral experience area, respectively.N sum is the total number of experiences in the experience buffer area, and B size is the number of batches.B pos , B neg , and B neu are determined according to the sampling probability P ES , i from three experience areas respectively, and then aggregated into a minibatch to train the agents.Sampling priority in experience replay is set, and the most useful experience is preferentially sampled to update the agents and improve the agents' learning efficiency.The experience priority p t,i is determined based on TD-error |δ t,i |, where ϵ is an infinitesimal positive number (To prevent p t,i from zero).The larger the TD error, the greater the role of this experience, and the higher priority of this experience.The smaller the TD error, the lower priority of this experience.The P ES,i of experience being selected is as follows: where μ is the weight factor of sampling.It represents the influence degree of priority on sampling probability.The algorithm pseudocode is shown in Table I.

C. Power Optimization Strategy Based on PEPE-MATD3
In the original MATD3 algorithm, the experience replay is utilized to store the experience data in the experience buffer, randomly sample from the experience buffer, and use the experiences to update the target strategy.The experience utilization of such experience replay algorithm is not high, which affects the learning efficiency of agents.Therefore, the PEPE algorithm is introduced into the MATD3 algorithm.Firstly, the improved PEPE-MATD3 algorithm divides the experience pool into three Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.

TABLE I MATD3 WITH PEPE
layers: positive experiences, negative experiences, and neutral experiences according to the different impacts of reward values on learning.In the early, middle, and late stages of training, the number of samples in each layer is determined to improve the utilization efficiency of experiences and the learning efficiency of agents.The experiences are then preferentially sampled according to the experience priority, and the experiences that are more useful for policy learning are screened out based on stratification, so as to improve the learning ability of the algorithm.According to the state and the strategy, the agents make actions (power optimization compensation amount ΔP ref , 1∼10 ) to adjust the reference power of the wind turbines and receive the reward of feedback.The agents explore and learn during the trial-and-error training of adjusting reference power until they learn the optimal strategy.The power optimization framework is shown in Fig. 6.A centralized training and decentralized execution architecture is adopted for PEPE-MATD3.The centralized training is to guide actor training through the critics that can observe globally in each agent.The critics of each agent has access to the information of all other agents, which can realize the communication among multiple agents.Decentralized execution means that each agent's actor acts independently according to the state of the environment.Each column of wind turbines is controlled by an agent to reduce the computational load.Agent_1 optimizes the power through the output action values a1, a2, a3 to obtain the new wind turbine reference power ΔP ref,1∼3 .Other agents act in the same way.Agent_N of the ESS optimizes the power through the output action values a N (ΔP ref,es ) to obtain the reference power ΔP ref,es of the new ESS.
The environment provides V farm , P farm , P grid , P es , SOC, V i , P WTi , and β WTi information to each agent.The state space of the combined power generation system of the wind farm and ESS is defined as: S = [V wind (t) , P farm (t) , P grid (t) , P es (t) , SOC (t) , V i (t) , After observing the state information of the environment, the agent chooses an action in the action space according to its policy π.The action space a i and a N are the reference adjustment values ΔP ref,i and ΔP ref,es for each wind turbine and ESS, respectively, and their expressions are as follows: In the learning process, setting the reward function determines the tasks that each agent needs to complete as well as whether they cooperate or compete.In order to solve the problems of power fluctuation, power loss, excessive load of WTs, ESS loss, the wind farm's power generation, grid-connected power smoothness, and pitch angle standard deviation are taken as the reward values for the WTs.For the ESS, it can be protected according to SOC.Therefore, the level of SOC and the gridconnected power smoothness are taken as the reward values, where r i (t) and r es (t) are the reward functions of the agents of the WTs and ESS, respectively.ρ, ξ, λ wt , λ es , σ and ζ are the weight coefficients.y es,j (j = 0,1,2,3) is defined as: E farm is defined as: where E farm is the power generation of the wind farm.Δt is the time interval and T is the total time.F g is defined as [29]: where F g represents the grid-connected power smoothness.The smaller F g is, the better the smoothing effect is, and the smaller the impact on the power grid is.ΔP grid is the absolute value of the grid-connected power fluctuation.P farm,rate is the rated power of the wind farm.

A. System Parameter Configuration
To verify the effectiveness of the proposed strategy, a simulation model was established on SimWindFarm [22].The simplified 5MW FAST wind turbine model developed by NREL was used, which is composed of the pneumatic model, transmission chain, generator, blade and tower model, and the wind turbine control strategies [22], [23].The layout of 10 NREL 5MW wind turbines in the simulation model is shown in Fig. 7.
The charge/discharge power and the stored energy of the ESS can be obtained at any time during the whole operation period by simulating the scheduling operation.Therefore, the ESS capacity can be calculated and determined according to these results.The rated capacity of the ESS that meets the operational requirements of SOC is calculated as follows [30]:

TABLE II SYSTEM PARAMETERS TABLE III HYPERPARAMETERS
where E flu (t) is the energy fluctuation of the ESS to the initial state at different times, P es (t) is the output power of the ESS, ν is the capacity configuration margin of the ESS, and η e is the charging and discharging efficiency of the ESS.The system parameters are shown in Table II.

B. Analysis of Training Results
MATD3 was used to optimize M-WT power and ESS to reduce the impact of model errors and uncertainties.The input wind speed of each WT column in the wind farm is similar, and each WT column was optimized and compensated by one agent, thereby reducing the computational complexity.According to the wind farm layout in Figs.7 and 10 WTs are controlled by four agents, and ESS is controlled by another agent.Therefore, the total number of agents in this article is 5.To verify the training efficiency, an experiment was conducted with the MATD3 algorithm [19], improved PEPE-MATD3 algorithm, and multi-agent deep deterministic policy gradient (MADDPG) algorithm [27]   actor network of the algorithm is shown in Fig. 8 (the network structure of each agent is the same).
To ensure the training effect, 200 trial and error trainings were conducted by the agents.The global reward index with three algorithms is shown in Fig. 9.As can be observed from Fig. 9, after 200 trial-and-error trainings by agents, the PEPE-MATD3 algorithm can obtain higher reward values than the other two algorithms at the early stage of training, indicating the advantage of improving the positive experiences at the early stage.Although the reward values of the PEPE-MATD3 are lower than that of MATD3 in the 25-60 iterations, they are higher than those of the other two algorithms after 60 iterations, indicating that PEPE-MATD3 has stronger learning ability.In addition, the reward values of the PEPE-MATD3 are maintained at around -870, and they are larger than those of the other two algorithms.

C. Simulation Experiment Analysis
To verify the effectiveness of the proposed control method, it was compared with the traditional control of wind farm (WF), the SOC feedback control of ESS [31], dynamic allocation (DA)-based coordinated control of M-WT [13], Rule-based coordinated control of WF and ESS, MATD3-based coordinated control of WF and ESS(MATD3).When the average wind speed of the wind farm is 12m/s and the wind direction is 0 0 , the grid-connected power of the wind farm with six control methods are shown in Fig. 10.P es and SOC of ESS with different control methods are shown in Figs.11 and 12, respectively.
The output coefficient OC of the ESS is as follows:

TABLE IV ENERGY EVALUATION INDEXES
The closer the SOC value is to 0.5, the smaller OC, and the stronger the ESS's ability to cope with future power fluctuations.
The energy evaluation indexes are shown in Table IV, where F f is the smoothness of the output power of the wind farm, and Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
E farm is the total power generation of the wind farm excluding the energy storage part.
It can be observed from Fig. 10 that the grid-connected power curve of the SOC feedback control of the ESS is smoother compared with the DA-based coordinated control of M-WT, especially for high-frequency power fluctuations.At the same time, it can be observed from Fig. 11 that the charging and discharging frequency of the output power of the ESS is high, being able to quickly handle power fluctuations.Therefore, the characteristics of ESS exactly compensate for the lack of M-WT smoothing power.As can be observed from Table IV, F g = 0.4234 of the rule-based coordinated control of WF and ESS is significantly lower than F g = 0.5010 of the SOC feedback control of ESS and F g = 0.8051 of the DA-based coordinated control of M-WT.Moreover, it can be observed from Fig. 10 that the grid-connected power fluctuations of the rule-based coordinated control of WF and ESS are lower than those of the two methods, indicating that it has a better effect on smoothing power fluctuations than the individual control.OC = 0.1376 of the rule-based coordinated control of WF and ESS is lower than OC 0.2093 of the SOC feedback control of ESS, indicating that it has a stronger ability to smooth future power fluctuations.The output power P es and SOC of ESS are shown in Figs.11 and  12, respectively.It can be observed from Fig. 12 that the SOC of the rule-based coordinated control of WF and ESS is closer to the optimal value 0.5, indicating that the SOC of ESS is kept in a safe range and it is less likely to overcharge and over-discharge, as well as being able to better coping with future power fluctuations.The reason is that it is the ESS control enhanced by the WF smoothing power control, utilizing the power smoothing capability of the WF itself to reduce the workload of ESS so as to improve the system smoothing ability.However, its E farm is 275.3MW•h, which is smaller than that of the traditional wind farm, indicating that the power fluctuations are reduced at the cost of energy loss.
As can be observed from Table IV, the proposed control method has better performance in many aspects compared with other control methods mentioned above.The grid-connected power smoothness F g = 0.3950 of the proposed control method is significantly smaller than that of the rule-based WF and ESS coordinated control F g = 0.4234.At the same time, it can be observed from Fig. 10 that the grid-connected power fluctuations of the proposed control method are lower than that of the rule-based coordinated control of WF and ESS, demonstrating its better smoothing power fluctuation effect than that of the rule-based coordinated control of WF and ESS without multi-agent deep reinforcement learning.Compared with the rule-based coordinated control of WF and ESS, E farm = 305.8MW•h of the proposed control method is increased by 10.40%, indicating that the power generation of the wind farm on the premise of ensuring smooth power can be increased by the proposed control method.The reason is that by sacrificing some power generation of upstream WTs, the influence of wake effect on downstream WTs is reduced, and the wind energy obtained by downstream WTs is increased, so as to maximize the power generation of the whole wind farm.The specific process of the strategy is to use the power generation of the wind farm as the reward function of the agents in the PEPE-MATD3 algorithm.Through continuous trial-and-error trainings, agents will find a strategy to maximize the rewards according to the reward It can be observed from Fig. 12 that the SOC of the proposed control method is closer to the optimal value 0.5, indicating that the SOC of ESS is kept within a safe range, and it is less likely to overcharge and over-discharge, as well as being able to better coping with future power fluctuations.Although F g = 0.3950 of the proposed control method is slightly larger than F g = 0.3902 of the MATD3-based coordinated control method, the E farm and OC are significantly better than that of the latter.
The comparison of the tower root moment of the No. 1 WT with different methods is shown in Fig. 13.The smoothness of the tower root moment of each WT F M,i is shown in Table V.
As can be observed from Fig. 13, the fluctuations of the tower root bending moment with the proposed method are significantly lower than those of the traditional control of WF, the DA-based coordinated control of M-WT, and the rule-based coordinated control of WF and ESS.They are significantly lower than that of MATD3 during 5000s∼7000s, although the fluctuation difference is not obvious as a whole.At the same time, it can be found from Table V that the smoothness of the root bending moment of each WT tower with the proposed method is smaller than those of other methods.The results also show that more fatigue load can be reduced with the proposed method.As can be observed from Fig. 14, the grid-connected power smoothness F g of these four methods shows a downward trend when the rated capacity of the ESS increases, indicating that the increased rated capacity can improve the ability of the ESS to Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.smooth wind power.Although the F g of the proposed method is slightly lower than that of the MATD3 method when the energy storage capacity ranges from 10MW•h to 15MW•h, it is still under the condition of low rated capacity as a whole and still maintains roughly 0.4, indicating more stable performance in terms of smoothness.As can be observed from Fig. 15, the OC of the ESS shows a downward trend using these four methods when the rated capacity of the ESS increases, and the OC of the proposed method is at a lower position than those of the other four methods.This indicates that a better capability of smoothing future power fluctuations is still achieved with the proposed method in the capacity changing environment.Overall, more stable performance and robustness can be obtained with the proposed method in the capacity changing environment.In addition, the results in different rated capacities verify that the ESS using the proposed method can have many configuration choices according to the investment budget.

D. Simulation Experiments Under Different Environmental Conditions
2) Average Wind Speed Variation in the Wind Farm: The average wind speed of the wind farm was varied from 10 m/s to 14 m/s.The smoothness of grid-connected power, power generation and output coefficient of ESS of the wind farm with different average wind speeds are shown in Figs. 16, 17, and 18, respectively.It can be observed from Fig. 16 that the F g of the proposed method is lower than those of other methods when the average wind speed of the wind farm changes, indicating that a better smoothing power performance can still be guaranteed with the proposed method in the average wind speed changing environment.It can be observed from Fig. 17 that the power generation of the proposed method is higher than those of the other four methods when the average wind speed changes from 10 m/s to 14 m/s, indicating that more energy can be generated in the wind farm with the proposed method at the full wind speed.As can be observed from Fig. 18, although the OC of MATD3 is lower than that of the proposed control method when the average wind speed is from 12.5 m/s to 14 m/s, the OC of the proposed control method is lower than those of other algorithms as a whole.In general, a good performance is obtained in smoothness, power generation, and ESS output coefficient with the proposed method under different wind speeds.
To verify the adaptability and robustness of the proposed control method, the average wind speed of the wind farm, the rated capacity of the ESS, and the turbulence intensity were changed simultaneously, and 200 sets of Monte Carlo simulations were conducted.The simulation results are shown in Fig. 19.
It can be observed from Fig. 19 that good performance in grid-connected power smoothness, power generation, and ESS output coefficient is obtained with the proposed method when different parameters vary at the same time, which also shows that a variety of environments with different parameters can be better adapted by the proposed method.However, it can be observed from Fig. 19(a) that the F g of the proposed method becomes larger when the turbulence coefficient is 0.05 and the rated capacity is as low as 4MW•h, indicating that a good smoothing power performance is difficult to be guaranteed with the proposed method in this case.It can be observed from Fig. 19(c) that the OC of the proposed method also becomes larger when the turbulence coefficient is 0.05, the rated capacity is as low as 4MW•h and the average wind speed is 14m/s.It is for the proposed method in dealing with future power fluctuations and easier to make SOC overcharge and over-discharge.It is also difficult for the proposed method to ensure good control performance when the average wind speed of the wind farm is too large or too small or the rated capacity is too low.Overall, good adaptability and robustness are obtained with the proposed method in a random uncertain environment.and the proposed control method when the model has errors.Moreover, the SOC of the two methods changes obviously, which are further deviated from 0.5, and in an unsafe range during 25000∼30000s, indicating the control effect is affected by the model errors.It can be observed from Figs. 21     increased by 0.0646, and the E farm is decreased by 30.1MW•h when the model errors occur.On the other hand, the F g of the proposed method is increased by 0.0218, the OC is increased by 0.0345, and the E farm is reduced by 2.8 MW•h.It can also be observed from the above data that compared with the rule-based coordinated control method of WF and ESS, the smoothness, output coefficient and energy generation of the proposed control method have smaller changes with the model errors, and a better control effect can be maintained with the proposed control method, indicating that the impact of model inaccuracy is reduced with the optimization compensation of the PEPE-MATD3 in the proposed control method, and the advantage of insensitivity to the model is also reflected by using PEPE-MATD3 with deep reinforcement learning in the proposed control.

V. CONCLUSION
Aiming at the problems of wind power smoothing, a coordinated power smoothing control strategy for M-WT and ESS based on PEPE-MATD3 algorithm is proposed in this article.Firstly, a coordinated power control system of the M-WT and ESS based on the wake model is established.The coordinated power control between the M-WT is used to smooth power fluctuations, while the ESS is used to smooth high-frequency fluctuations that are difficult to be handled by internal control of the wind farm.The smooth wind power capabilities of the M-WT and ESS are combined to compensate the shortcomings of individual WT control.Then, the PEPE-MATD3 algorithm is used to optimize the coordinated power control of the M-WT and ESS, and the negative impact caused by the inaccurate model is reduced by the model-free feature of the PEPE-MATD3 algorithm, so as to reduce the loss of the system on the premise of ensuring smooth power.Experimental results show that the proposed method is superior to M-WT and ESS alone in terms of power smoothness, generation, and output capacity of the ESS, and the proposed method is further improved by optimizing the PEPE-MATD3 algorithm.The simulation results under different scenarios and model errors show that the proposed method can reduce the influence of model uncertainty, decompose the complex multi-objective optimization problem into a multi-agent cooperative problem, simplify the complex problem, and improve the robustness and stability of the system.Future work will focus on practical wind farm applications, and industrial experiments will be conducted when the experimental conditions are sufficient.In addition, how to ensure the system stability and safety in the process of trial-and-error training of deep reinforcement learning and how to maximize the training effect of agents at acceptable trial and error costs will also be investigated.

Fig. 1 .
Fig. 1.Combined power generation system of an offshore wind farm and ESS.
. The training hyperparameters (which are shared by the three algorithms), shown in Table III were determined by following multiple tests.The structural design of the critic network and Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.

1 )
Changes in Energy Storage Capacity: On the premise that other parameters remain unchanged, the energy storage capacity configuration was changed from 4MW•h to 15MW•h in the experiment.F g of the wind farm and OC of the ESS with different rated capacities of ESS were analyzed.The results are shown in Figs.14 and 15.

:
To simulate the errors between the wake model established and the actual wake environment, interference signals (Band-Limited White Noise) were added to the wake model on the SimWindFarm.The simulations on control effects with/without model errors were conducted by the rule-based coordinated control of WF and ESS and the proposed control.The results are shown in Figs. 20, 21, and 22, and the corresponding evaluation indexes are shown in Table VI.As can be observed from Fig. 20, the calculated wind speed of the NO.4 WT is obviously different from the real wind speed if there exist errors in the model.It can be observed from Figs. 21 and 22 that the fluctuations of the grid-connected power are larger with the rule-based coordinated control of WF and ESS

Fig. 22 .
Fig. 22. SOC of the ESS with/without model errors.
Coordinated Power Smoothing Control Strategy of Multi-Wind Turbines and Energy Storage Systems in Wind Farm Based on MADRL Xin Wang , Member, IEEE, Jianshu Zhou, Bin Qin , and Lingzhong Guo , Member, IEEE

TABLE V EVALUATION
INDEXES OF THE WTS Fig. 14.F g with different rated capacities of ESS.
and 22and Table VI that the F g of the rule-based coordinated control method of WF and ESS is increased by 0.1063, the OC is Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.