Multiplexing eMBB and URLLC in Wireless Powered Communication Networks: A Deep Reinforcement Learning-Based Approach

This letter investigates the multiplexing of enhanced mobile broadband (eMBB) and ultra-reliable low-latency communications (URLLC) services in a wireless powered communication network, where a hybrid access point coordinates the wireless energy transfer (WET) to users and receives information from them. The preemptive puncturing is adopted to multiplex URLLC traffic onto eMBB transmission. Apart from the energy used for wireless information transmission (WIT), the rest energy in user’s battery is reserved to avoid insufficient energy for future WIT. The problem is formulated to jointly allocate subcarriers, time, and energy to maximize the uplink eMBB sum rate under the constraints of URLLC latency, radio frequency to direct current (RF/DC) sensitivity, user’s battery capacity, and subcarriers availability. We propose a deep reinforcement learning-based approach named mixed deep deterministic policy gradient (Mixed-DDPG), which decomposes the optimization problem into a discrete subproblem for subcarriers allocation and a continuous subproblem for time and energy allocation, and solves them alternately. Numerical results show that the proposed algorithm achieves a higher eMBB sum rate than the existing schemes.


Multiplexing eMBB and URLLC in Wireless Powered Communication
Networks: A Deep Reinforcement Learning-Based Approach Xiaotian Jiang , Kai Liang , Xiaoli Chu , Senior Member, IEEE, Cheng Li, and George K. Karagiannidis , Fellow, IEEE Abstract-This letter investigates the multiplexing of enhanced mobile broadband (eMBB) and ultra-reliable low-latency communications (URLLC) services in a wireless powered communication network, where a hybrid access point coordinates the wireless energy transfer (WET) to users and receives information from them.The preemptive puncturing is adopted to multiplex URLLC traffic onto eMBB transmission.Apart from the energy used for wireless information transmission (WIT), the rest energy in user's battery is reserved to avoid insufficient energy for future WIT.The problem is formulated to jointly allocate subcarriers, time, and energy to maximize the uplink eMBB sum rate under the constraints of URLLC latency, radio frequency to direct current (RF/DC) sensitivity, user's battery capacity, and subcarriers availability.We propose a deep reinforcement learning-based approach named mixed deep deterministic policy gradient (Mixed-DDPG), which decomposes the optimization problem into a discrete subproblem for subcarriers allocation and a continuous subproblem for time and energy allocation, and solves them alternately.Numerical results show that the proposed algorithm achieves a higher eMBB sum rate than the existing schemes.

I. INTRODUCTION
W ITH wireless devices becoming ubiquitous and carrying out various applications, wireless powered communication network (WPCN) has emerged to solve the energy supply problem of energy-limited devices [1].The user association and time allocation in a WPCN were jointly optimized Xiaotian Jiang and Kai Liang are with the School of Telecommunications Engineering, Xidian University, Xi'an 710071, China (e-mail: x.jiang@stu.xidian.edu.cn;kliang@xidian.edu.cn).
Xiaoli Chu is with the Department of Electronic and Electrical Engineering, The University of Sheffield, S1 3JD Sheffield, U.K. (e-mail: x.chu@Sheffield.ac.uk).
Cheng Li is with the Department of System Engineering Division, Xi'an Aerospace Precision Electromechanical Institute, Xi'an 710199, China (e-mail: xevillee@126.com).
George K. Karagiannidis is with the Department of Electrical and Computer Engineering, Aristotle University of Thessaloniki, 54124 Thessaloniki, Greece, and also with the Cyber Security Systems and Applied AI Research Center, Lebanese American University, Beirut 302060, Lebanon (e-mail: geokarag@auth.gr).
Digital Object Identifier 10.1109/LWC.2023.3290040by adopting the α-fair utility to maximize the sum, maxmin, and proportional fairness rate in [2].In [3], the total effective throughput was maximized by optimizing the tradeoff between the transmission time and packet error rate of a WPCN while meeting the effective information requirements.
A point-to-point energy harvesting system was considered in [4] with finite blocklength, where an achievable channel coding rate and a mean delay of the system were investigated.
In addition, how to efficiently multiplex enhanced mobile broadband (eMBB) and ultra-reliable low-latency communications (URLLC) on a shared channel has become a major challenge faced by 5G wireless networks [5].Due to their different requirements, eMBB and URLLC transmit at different time scales [6].Specifically, the time domain is divided into equal slots and each slot is further divided into multiple minislots, eMBB transmissions are performed on slots to achieve a high data rate and URLLC packets are transmitted on minislots to reduce latency.To solve the problem of optimal allocation of radio resources when eMBB and URLLC are multiplexed, [7] adopts a preemptive puncturing method, i.e., an arriving URLLC packet is scheduled to transmit in the next minislot by preempting subcarriers already allocated to eMBB users, which is shown to achieve higher expected rates than static or semi-static allocation of spectrum resources.The authors in [5] maximized the eMBB throughput under URLLC constraints by jointly optimizing the traffic scheduling for eMBB and the preemptive puncturing for URLLC.In [6], a simplified model-free deep reinforcement learningbased approach was proposed to minimize the loss of eMBB transmission rate due to URLLC packet puncturing under the assumption of advanced allocation of radio resources for eMBB users and each URLLC packet can preempt radio resources from multiple eMBB users.
However, the authors in [4] assumed an infinite battery capacity for the user, which is infeasible in practice.In [1], [2], [3], [4], uplink wireless information transmission (WIT) in each slot relies on the energy harvested only in the current slot without any energy reservation, hence some slots may see the harvested energy insufficient for uplink WIT due to channel variations [8].For example, a deep fading channel will result in reduced energy harvested by the user while requiring more energy for uplink WIT.Moreover, we note that the multiplexing of eMBB and URLLC services has not been studied for wireless energy transfer (WET) based WPCNs yet.As a result, the existing system models may not be readily applicable in WPCN scenarios where multiple services of different requirements, such as URLLC and eMBB, share the same spectrum.
In contrast to the above works, this letter investigates the multiplexing of eMBB and URLLC transmissions on a shared channel in a WPCN, where a hybrid access point (HAP) 2162-2345 c 2023 IEEE.Personal use is permitted, but republication/redistribution requires IEEE permission.
Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
powers multiple users by WET with the consideration of energy reservation and the preemptive puncturing.The contributions of this letter are summarized as follows: (i) Unlike the existing works, we study the problem of how to multiplex eMBB and URLLC transmissions in the uplink of a WPCN while considering finite battery capacity at each user, radio frequency to direct current (RF/DC) sensitivity of the energy harvesting circuit, and energy reservation for each user's battery to ensure power supply for uplink transmission.
(ii) We formulate an optimization problem to maximize the uplink eMBB sum rate by jointly optimizing the allocation of subcarriers, time for WET, and energy reservation of each user's battery under the constraints of URLLC latency, RF/DC sensitivity, user's battery capacity, and subcarriers availability.
(iii) To solve this non-convex optimization problem that features mixed allocation of discrete subcarriers and continuous time and energy, we propose a deep reinforcement learning-based approach named mixed deep deterministic policy gradient (Mixed-DDPG) to decompose it into a discrete subproblem and a continuous subproblem and solve them alternately.

II. SYSTEM MODEL AND PROBLEM FORMULATION
We consider a WPCN that includes a HAP and a set U of U users with eMBB and URLLC transmission requirements.Each user has a rechargeable battery.For analytical tractability, it is assumed that the HAP and each user are equipped with a single antenna [9].Let B = {1, 2, . . ., B } denote the set of available subcarriers each with a bandwidth of f b Hz.Thus, the total bandwidth is b∈B f b Hz.A long time period is considered and is divided into T equal slots, denoted by T = {1, 2, . . ., T }.Each slot has a duration t 0 .On each subcarrier, we assume channel reciprocity and that the channel fading coefficient stays constant within a slot but may change across adjacent slots.The subcarriers are reclaimed and rescheduled to eMBB users at the beginning of each slot based on the channel state information (CSI) [5].Due to the stringent latency requirement of URLLC transmissions, we adopt the "URLLC preemption" scheme [10], where each slot is further divided into minislots represented by M = {1, 2, . . ., M }, and an arriving URLLC packet is scheduled immediately for transmission in the next minislot by preempting the subcarriers already allocated to the same user for eMBB transmissions, without waiting for the eMBB transmissions on those subcarriers to finish [6].Without loss of generality, we assume that each user has URLLC packets arriving at each minislot.
All users adopt the harvest-then-transmit protocol in each slot, where the users first harvest energy from the energy signal broadcast by the HAP and then transmit information to the HAP using the harvested energy.For instance, if the uth user is scheduled to transmit in the tth slot, then the tth slot is divided into a downlink WET phase of duration τ u,t t 0 and an uplink WIT phase of duration (1−τ u,t )t 0 , where τ u,t ∈ (0, 1).The HAP's downlink transmission power is assumed to be the same on each subcarrier and is denoted by P DL .The received power P r u,t of the uth user in the tth slot is given by where η c is the energy conversion efficiency of the RF/DC circuit, d u is the distance between the uth user and the HAP, α is the path loss exponent, h u,b,t ∈ CN (0, 1) denotes the Rayleigh fading coefficient between the HAP and the uth user on subcarrier b in the tth slot, and x u,b,t,m ∈ {0, 1} is a binary indicator of subcarrier allocation, where x u,b,t,m = 1 means that subcarrier b is allocated to the uth user in minislot m of slot t for eMBB transmission, otherwise x u,b,t,m = 0. Since a user cannot harvest energy if its received power is less than the RF/DC circuit sensitivity φ, the received energy at the uth user in the tth slot is given by where 1(•) is the binary indicator function.
The uplink transmission power of the uth user during the WIT phase in slot t is given by where ρ u,t ∈ [0, 1] is the percentage of energy reserved by the uth user in the tth slot for the next WIT of slot t + 1 and Q u,t is the battery energy level of the uth user at the end of WET in slot t, which is updated as where Q u,t−1 is the battery energy level of the uth user at the end of the WET in slot t − 1, and Q max is the battery capacity.The uplink received signal to noise ratio (SNR) of the uth user on subcarrier b in slot t is given as follows where σ 2 is the power spectral density of additive noise.
Based on the Shannon capacity [9], the eMBB transmission rate of the uth user in minislot m of slot t is given by where y u,b,t,m ∈ {0, 1} is the binary indicator of subcarrier preemption by URLLC packets.Specifically, y u,b,t,m = 1 indicates that subcarrier b is preempted by the uth user in minislot m of slot t for URLLC transmission, otherwise y u,b,t,m = 0. To ensure that the URLLC packets of the uth user can only preempt the subcarriers that have been allocated to the uth user for eMBB transmissions, it is necessary to Since the packet length of URLLC is typically much shorter than that of eMBB, using the Shannon capacity may significantly overestimate the delay of URLLC transmissions [7].Instead, the URLLC transmission rate of the uth user in minislot m of slot t can be calculated based on the finite block length theorem [10]: where n u is the length (in symbols) of the codeword block for the uth user, ε is the decoding error probability, Q −1 (ε) is the inverse of the Gaussian cumulative distribution function, and Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
To maximize the eMBB sum rate of all the users, we formulate the following optimization problem: where X = {x u,b,t,m } u∈U ,b∈B,t∈T ,m∈M , Y = {y u,b,t,m } u∈U ,b∈B,t∈T ,m∈M , τ = {τ u,t } u∈U ,t∈T , ρ = {ρ u,t } u∈U ,t∈T , ω u is the minimum data rate requirement for eMBB transmission of the uth user, F u is the URLLC packet length (in bits) of the uth user, and ψ is the maximum tolerable delay of URLLC packets.Constraints (8a) and (8b) ensure that each subcarrier is allocated to at most one user at any time, while (8c) is the subcarriers preemption availability constraint.Constraint (8d) is introduced to avoid energy overflow due to the battery capacity [11].Constraint (8e) imposes the eMBB minimum data rate requirement.Constraint (8f) is the latency requirement for the URLLC transmission.

III. MIXED-DDPG BASED RESOURCE ALLOCATION
To tackle the non-convex optimization problem (P) with mixed allocation of discrete subcarriers and continuous time and energy, we propose a novel alternate approach called Mixed-DDPG in this section.Specifically, we decompose the problem (P) into discrete and continuous subproblems, where the discrete subproblem optimises the allocation of subcarriers while the continuous subproblem optimises the time allocation for WET and energy reservation for WIT.The two subproblems are then solved alternately until convergence.

A. Discrete Subproblem
By fixing the continuous time allocation {τ u,t } u∈U for WET and energy reservation {ρ u,t } u∈U for WIT of slot t in problem (P), we obtain a discrete subproblem that optimizes the binary indicators of subcarrier allocation to eMBB transmission at the beginning of slot t + 1 and subcarrier preemption by URLLC packets in each minislot m ∈ M of slot t + 1, ∀t ∈ T .Moreover, based on (1)-( 6), for fixed {τ u,t } u∈U and {ρ u,t } u∈U , R mbb u,t,m becomes independent for different slot t, and the eMBB sum rate of all the users can be maximized separately in each slot.Hence, for slot t, under the given time allocation {τ u,t } u∈U for WET and energy reservation {ρ u, } u∈U for WIT, problem (P) reduces to the following discrete subproblem, where X t = {x u,b,t,m } u∈U ,b∈B,m∈M and Y t = {y u,b,t,m } u∈U ,b∈B,m∈M .We can show that subproblem (P1) is convex and can be solved by using existing convex optimization tools.

B. Continuous Subproblem
For given binary indicators X, Y of subcarrier allocation to eMBB transmission and subcarrier preemption by URLLC packets, problem (P) reduces to the following continuous subproblem: 8e), (8f), (8g), (8h).(10) We note that (P2) has large state spaces, including X, Y, CSI, and battery status of all users in different slots, and hence will be difficult to solve using conventional optimization methods, but can leverage deep reinforcement learning (DRL) [6].Since the variables τ and ρ are continuous, we adopt a modelfree DRL, i.e., DDPG that has a continuous action space [10], to solve subproblem (P2).The DDPG state, action, and reward are defined as follows.
• State: , where H t = {h u,b,t } u∈U ,b∈B contains the CSI and Q t = {Q u,t } u∈U denotes the battery status.
• Action: a t = {τ t , ρ t }, where τ t = {τ u,t } u∈U denotes the time allocation for WET and ρ t = {ρ u,t } u∈U denotes the energy reservation proportion.• Reward: if action a t is chosen, the reward r t is given by where δ > 0 is the penalty factor and Rmbb u,t,m = The penalty δ m∈M Rmbb u,t,m will be imposed by the system on the agent when any constraint of (P2) is violated, thereby avoiding overfitting.DDPG consists of an actor network and a critic network for generating and evaluating policies, respectively [12].Based on the input state s t , the actor network μ(s t |θ μ ) selects the deterministic action as follows [10] where θ μ is the actor network parameter and N t is an additional noise that follows a normal distribution with a mean of μ 1 and variance of σ 2 1 due to action exploration.For given s t , a t and reward r t , after randomly selecting a minibatch of N tuples {(s j , a j , r j , s j +1 )} j =1,...,N from the replay buffer D, which is introduced to reduce the correlation among training samples, the critic network generates a Q value Q(s j , a j |θ Q ) [6] to assess the selected action a t and updates its parameter θ Q by minimizing the loss: where y j = r j + γQ (s j +1 , μ (s j +1 |θ μ )|θ Q ), γ is the discount rate, θ μ and θ Q are the parameters of target actor network μ (s|θ μ ) and target critic network Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.Set t = 0. Randomly initialize Q t , X t , Y t and D.

3:
for all slots of an episode do 4:  Get the reward r t based on (11)  Obtain new battery energy level Q t+1 based on (4).
The actor policy is updated using the deterministic sampled gradient policy as follows, The parameters θ μ of the target actor network and θ Q of the target critic network are soft updated as follows, where 0 < ζ 1 is the soft updating rate [12].

C. The Mixed-DDPG Algorithm
Based on the aforementioned solutions to subproblems (P1) and (P2), we propose a Mixed-DDPG algorithm to solve problem (P) as shown in Algorithm 1. Specifically, we first initialize Q t , D, X t and Y t in line 2, then obtain the optimal τ t and ρ t by solving subproblem (P2) using the DDPG-based approach in lines 4-7 and 9-11.Next, based on the obtained τ t and ρ t , we get X t+1 and Y t+1 for the next slot by solving subproblem (P1) in line 8.The above steps repeat alternately and iteratively until the maximum number of slots per episode and the maximum number of episodes are reached.
We analyze the computational complexity of the proposed Mixed-DDPG as follows.Since the convex subproblem (P1) has 2UBTM variables and can be solved by existing convex optimization tools, the computational complexity of solving (P1) is O((2UBTM ) 3 ).For subproblem (P2), let j i and k i denote the sizes of the input and the output of the layer i of the DDPG, respectively, where i ∈ I, then the complexity of solving (P2) is O( i∈I j i k i ).Therefore, the total computational complexity of the proposed Mixed-DDPG is O(F e F s ((2UBTM ) 3 + i∈I j i k i )), where F e and F s are the maximum number of episodes and the maximum number of slots per episode for Algorithm 1, respectively.For performance comparison with the proposed Mixed-DDPG algorithm, we include in the simulations the following three benchmark algorithms: zero energy reservation DDPG (ZER-DDPG), which differs from the Mixed-DDPG only in ρ = 0; fixed energy reservation proportional DDPG (FERP-DDPG), which differs from the Mixed-DDPG only in ρ = 0.5; and fixed transmission time proportional DDPG (FTTP-DDPG), which differs from the Mixed-DDPG only in τ = 0.5.
Fig. 1 shows the rewards versus the number of episodes of the four algorithms for two different maximum tolerable URLLC delays, ψ = 1 ms and ψ = 0.1 ms, respectively, where φ = 20 μW, Q max = 25 μJ.The proposed algorithm significantly outperforms the other three algorithms for both ψ = 1 ms and ψ = 0.1 ms because it can dynamically adjust the WET time allocation τ and the proportion ρ of energy for reservation according to the users' battery level and CSI.Besides, each considered algorithm achieves a smaller reward for a smaller ψ, because the delay constraint will be violated more often during the algorithm execution.
In Fig. 2, we plot the eMBB sum rate versus the battery capacity Q max of the four algorithms within an episode after convergence, where φ = 20 μW and ψ = 1 ms.We can see that the performance of all these algorithms increases as Q max grows and eventually stabilizes.This is because the larger battery capacity can store more energy for higher data rates, but when the battery capacity is larger than the energy received, the battery capacity no longer affects the data rate.Fig. 3 depicts the eMBB sum rate versus RF/DC sensitivity φ of the four algorithms within an episode after convergence, where Q max = 25 μJ and ψ = 1 ms.It shows that the proposed algorithm remains the highest eMBB sum rate and that all these four algorithms decrease as φ increases because a higher φ leads to less harvested energy and increases the change of energy shortage.We can also find that ZER-DDPG outperforms FERP-DDPG because flexibly changing τ affects both the amount of energy received in WET and the uplink transmission power in WIT, thereby affecting the system rate.
Fig. 4 depicts the eMBB sum rate versus the maximum tolerable delay ψ of URLLC packets, where φ = 20 μW and Q max = 25 μJ .The figure shows that the proposed Mixed-DDPG algorithm outperforms the benchmark algorithms in terms of the eMBB sum rate.Moreover, for each considered algorithm apart from FTTP-DDPG, the eMBB sum rate increases as ψ increases because a larger ψ leads to fewer violations of the constraint in the DDPG-based algorithm, which will result in a larger reward and therefore a higher eMBB sum rate.The eMBB sum rate of FTTP-DDPG is limited by its fixed WET time allocation of τ = 0.5, which leaves insufficient time for uplink WIT.
V. CONCLUSION This letter studies the multiplexing of eMBB and URLLC in the uplink WIT powered by downlink WET via preemptionbased resource allocation in a WPCN, where the finite battery capacity and the RF/DC sensitivity of the energy-harvesting circuit are considered.The optimization of resource allocation is formulated as a problem that maximizes the eMBB sum rate of all users under all necessary constraints.To tackle this problem, we decompose it into two subproblems and propose a Mixed-DDPG algorithm to solve them alternately.The numerical results reveal that the proposed Mixed-DDPG algorithm can quickly converge to a stable state and achieve a higher eMBB sum rate than the existing schemes, but the performance is sensitive to the transmission time.In our future work, we will extend the proposed model and algorithm to more complex scenarios, such as multi-cell and reconfigurable intelligent surface-aided (RIS-aided) networks.

Manuscript received 18
May 2023; accepted 18 June 2023.Date of publication 27 June 2023; date of current version 9 October 2023.This work was supported in part by the National Key R&D Program of China under Grant 2021YFE0205200; in part by the Fundamental Research Funds for the Central Universities under Grant QTZX23031; in part by the National Natural Science Foundation of China under Grant 61901317; and in part by the European Commission's Horizon 2020 Research and Innovation Program under Grant 778305.The associate editor coordinating the review of this article and approving it for publication was K. Wang.(Corresponding author: Kai Liang.)

10 :
Randomly sample N tuples from D as training data.11: