Markov decision processes (MDP) and their model-free counterpart in reinforcement learning have known a large success in the last two decades. However, these successes often rely on quite exceptional hardware possibilities and cannot be applied in many "usual" context, where, for instance, the volume of data available or the amount of computing power is more restricted.To define the next generation of more "democratic" and widely applicable algorithms, such methods still need to deal with very demanding exploration issues. One way around this is to use underlying knowledge and structure present in many MDPs. This is especially true for problems related to scheduling and resources which shall provide central use cases for this proposal.

We will leverage structures of optimal policies to significantly improve both the exploration and the exploitation in the model free setting, as well as defining exploration schemes based on particle systems (e.g., Fleming Viot systems) to tackle use cases with sparse rewards.

In many real-life situations, the process we want to control is influenced by external processes. For example, computer systems are very sensitive to their surroundings. For example, meteorological effects in wireless networks affect the transmission rates of the users and the power consumption. Another application is processes whose dynamics are time-dependent.

We will focus on MDP problems living in Markov-modulated environments. In the literature, results on performance evaluation of Markov modulated queueing systems exist, however results on optimal control are very scarce. In particular, very little is known when the state of the environment is unobservable. We therefore want to combine reinforcement learning with MDP techniques allowing us to learn efficient planning controls. In particular, we aim to obtain controls with provable performance guarantees for limiting settings, such as a very fast (slow) changing environment.

Load balancing and scheduling refer to the most fundamental resource sharing problems that arise in nowadays computing networks, see Figure 1. Due to the versatility, the class of MDPs known as multi-armed bandit problems (MABP) and restless multi-armed bandit problems (RMABP) have received a lot of attention in recent years since they provide a natural framework to study resource allocation problems. MABP and RMABP have in common that their solution is given in terms of index policies, i.e., one can define for each bandit an index - that only depends on its own state - and the index policy activates in each time step the bandit with the highest index. Index policies are optimal for MABPs, known in this case as Gittins' index policy, and asymptotically optimal for RMABPs, known as Whittle's index policy.

Classical results in resource allocation, like optimality of Join the Shortest Queue or cmu-rule, can be interpreted in terms of optimality of Gittins' index in an MABP, and asymptotic optimality of the c\mu/\theta-rule in terms of asymptotic optimality of Whittle's index in a RMABP.

Some of the main challenges we plan to address are (i) characterizing performance guarantees for index policies in resource sharing, (ii) deriving scalable sharing policies for large-scale data, (iii) developing data-driven approaches to estimate index policies, etc.

In most unsupervised learning tasks such as clustering, classification, recommendation, or dimension reduction, a notion of similarity between points is both crucial and usually not directly available as an input. Instead, a notion of distance has to be guessed or inferred from the data itself. This is especially the case when the data is in fact in an unknown lower-dimensional manifold, or when the density of points is nonuniform, which is a typical situation in most applications. In these situations the Euclidean distance is typically misleading; this effect, being more dramatic as the dimension increases.

On the other hand, finding a global distance representing the data for a given learning problem is often very challenging, while in many unsupervised tasks like clustering might depend much more on the distance or similarity used to compare the data points than on the specific clustering algorithm used. Learning based on similarity estimation has been considered in diverse applications: time series clustering of chemical structures, genetic data, documents or texts.

We recently worked on defining a density-based estimator for weighted geodesic distances suitable for data lying on a manifold of lower dimension than ambient space and sampled from a possibly non-uniform density distribution. We aim at using these notions for completing unsupervised learning tasks like clustering and combine this with robust estimations.

Game theory provides a framework for decision making in strategic scenarios where agents with different objective functions compete for limited resources. It can be seen as a multiobjective counterpart to classical optimization theory where there is a common objective for all the agents. Game theory has been applied in a wide-range of fields including economics, biology, social sciences), computer science, communication networks, road traffic networks, etc.

This theme will investigate applications of both noncooperative and cooperative game theory in the context of communication and road networks. The main focus will be on resource sharing problems discussed in the previous themes but in the strategic setting of agents with potentially opposing objectives. In particular, the questions of interest in the noncooperative setting are: what are the equilibrium strategies of agents and how can they learn the equilibrium using some natural dynamics. The strategic setting also gives rise to the question of the Price of Anarchy (how bad is decentralized decision making compared to centralized one) and mechanism design (e.g., tolls, taxes, etc.) to reduce the Price of Anarchy. In the cooperative setting, coalition formation and fair sharing of gains will be the central problems. These questions are particularly important in self-organized networks in order to understand the form and shape of the subnetworks that will emerge as well as to design incentives to avoid free-riding whereby agents profit from the resources of others without contributing themselves.

One regime of interest in stochastic networks is when the number of nodes or agents is large. However, the dimension of the associated stochastic process grows with the population size and the analysis becomes intractable if finite dimensional methods are applied as it is. In this case, it is more pertinent to apply asymptotic methods such as mean-field, fluid limits, large deviations, rare events, etc. that are able to capture the essence of the stochastic process in a much lower dimensional space.

This theme explores various asymptotic methods for the analysis as well as for control and games in large dimensional stochastic networks. The conclusions drawn from the limiting regime with the help of these asymptotic methods can then be applied in the finite population networks. Bounds on the suboptimality gaps induced by the transition from the infinite to the finite dimensional case will be investigated.