Distributed Learning for Resilient Virtual Network Management at Scale

Project Coordinator (EU) :

Politecnico di Torino

Country of the EU Coordinator :

Italy

Organisation Type :

Academia

Project participants :

EU:

Guido Marchetto (PhD) is currently an Associate Professor with the Department of Control and Computer Engineering. His research topics cover distributed systems and formal verification of systems and protocols. His interests also include network protocols and network architectures. He is Senior Member of the IEEE and serves as an Associate Editor of the IEEE Transactions on Vehicular Technology.

Fulvio Valenza (PhD) is currently an Assistant Professor. His research activity focuses on the orchestration and management of network functions in SDN/NFV-based networks, with particular reference to security policy enforcement.

Alessio Sacco is currently pursuing the Ph.D. degree in Computer Engineering. His research interests include architecture and protocols for network management; implementation and design of cloud computing applications; algorithms and protocols for service-based architecture, such as Software Defined Networks (SDN), used in conjunction with Machine Learning algorithms.

US:

Flavio Esposito is an Associate Professor with the Department of Computer Science at SaintLouis University (SLU). He also has an affiliation with the Parks College of Engineering at SLU. Flavio workedin the industry for a few years, and his main research interests include network management, network virtualization, and distributed systems.

Princewill Okorie is a M.Sc. student at Saint Louis University (SLU). His research interest is primarily based on applying Machine Learning algorithms to modern-day problems. Some of these include congestion control techniques in computer networks, balance predictions, and transaction categorization
in finance.

State of US partner :

Missouri

Starting date :

01 Mar 2022

NGI related Topic :

Open Internet Architecture and Renovation

Distributed Learning for Resilient Virtual Network Management at Scale

Experiment description

Our project will contribute to the process of enhanced scalability and resiliency of future Internet infrastructures, to satisfy the needs of novel (computationally intensive) emerging applications and services. Increasing network capabilities would unleash the power of network management towards autonomous decision-making. For example, recent studies make profitable use of Reinforcement Learning (RL) techniques in making networking decisions, given the ability of RL to fit the network dynamics well without any prior knowledge.

Hence, while experimenting with the scalability of RL models, we plan to provide a novel auto-scaling network environment. The advantages brought by network auto-scaling techniques are multiple. For example, they may reduce the cost of resource management by deactivating resources that could increase unnecessary (energy) costs. At the same time, such solutions can provide redundant facilities to reroute traffic when workload peaks to unexpected levels.

By continuously monitoring the state of the network, learning-driven network management agents can optimize network performance and users' experience, exploiting Software-Defined Network controllers or other tools to promptly change network states and configurations. The impact of our network auto-scaling solutions will be two-fold. On the one hand, network users could benefit from an improved Quality of Experience (QoE), which would open the path to innovative applications, also in strategic fields, such as healthcare or industry. On the other hand, infrastructure and service providers will benefit from reduced management costs, e.g., in terms of energy consumption for active links and nodes.

To test the autonomous run-time decision-making capabilities of future networks, we plan to use existing NSF-funded large-scale virtual network testbeds, such as Chameleon Cloud and GENI. Over these platforms, we plan to develop an open infrastructure that would be accessed by the networking community for further studies, analysis, and experiments. For example, split learning and federated learning appeared as two of the most frequent methods for splitting the model and/or input data. Having an open platform of this type would favor a discussion about the efficacy of these methods when changing the network scenario.

Exploring if and to what extent these two approaches work in various network contexts is also one of the goals for this project. Once the distributed method is set, users will be able to specify the settings of the learning phase of the RL, such as the reward function of the model. At the same time the network operators can prototype with a diversity of network and model settings, experiencing new methods to share partial but sufficient information with other peers.

Impacts :

Impact 1: Enhanced EU – US cooperation in Next Generation Internet, including policy cooperation.
The project will serve to facilitate the use of open networking testbeds to communities in the US and Europe. Our aim is to favour cooperation among EU and US partners working on distributed ML models for networks. Results from this project are going to impact the network management community directly, especially in the industry. Some of the core ideas were sparked by discussions with US and EU industry members, such as Ericsson and Akamai. For example, performance engineers at Akamai demonstrating interest and support for the intersection between network virtualization and AI/ML.

Impact 2: Reinforced collaboration and increased synergies between the Next Generation Internet and the NSF-funded testbeds.
Recent advances in AI/ML tools have significant potential for achieving zero-touch “self-managing” networks that include a high degree of operational agility. An open platform would help the development of new learning-based mechanisms, enabling experimentation on disaggregated architectures in order to verify validity on systems at scale. Starting from a general-purpose testbed like GENI, which provides researchers with virtual machines and virtual links to run software defined-networks, our solution will offer a web-based software platform to test autoscaling for network virtualization, impacting research and education in computer networks and AI. For example, US partner (SLU) has been awarded with the grant #1908574 “Collaborative Research: HEECMA: A Hybrid Elastic Edge-Cloud Application Management Architecture” to work in conjunction with this project.

Impact 3: Developing interoperable solutions and joint demonstrators, contributions to standards.
Our research plan concludes with the development of a testbed accessible from anywhere. The user can submit the desired model and topology, that will run over the offered servers. In this sense, the traditional testbeds will be extended by means of this joint effort between EU and US partners, which will cooperate for the development of the above mentioned programming interface. The resulting solution will be an EU-US joint demonstrator that could
facilitate further investigation by the network management research community.

Impact 4: An EU - US ecosystem of top researchers, hi-tech start-ups / SMEs and Internet-related communities collaborating on the evolution of the Internet
This accessible from everywhere solution would ease the usage of strictly US testbeds also to the European community. Our platform will give the opportunity to study and experiment with the possible role of ML within the Next Generation Internet, not only for the design of the network but also for its operations. Although the Covid-19 pandemic will probably limit the exchange of researchers and students during the project, the joint activity will certainly
promote future exchanges and further collaborations.

Results :

The main goal of the experiment is to assess the feasibility and validity of adopting a distributed approach for network management decisions such as auto-scaling. We will compare this approach against a centralized version and the state-of-the-art, and we expect to offer positive results in the following metrics:

Accuracy of the distributed RL model, in terms of utility values of reward functions. We expect negligible accuracy loss with respect to the centralized learning approach.
Learning rate of the model. We expect to increase the speedup percentage of the autoscaling model.
Scalability of the solution with large-scale networks. We expect a smooth execution of the solution over topologies composed of dozens network nodes.

These results will be analyzed at the end of the experiments, and they will provide guidelines for the implementation over even more complex systems.

Future Plan :

With this project coming to an end, we present a solution that can (and will in the future) serve as a platform to experiment with learning-based network management schema. In particular, multi-agent RL approaches for managing next-generation networks can now be easily installed and run over GENI, one of the main NSF-funded testbeds available.
After an investigation and development activity concerning our multi-agent solution, we run an extensive set of experiments. Results confirmed how the use of AI/ML in auto-scaling multi-site SDN networks brings benefits in terms of attainable user throughput and power savings.
In the near future, we plan to experiment with the definition of an RL model that can be distributed by means of Federated Learning (FL). To this end, we plan to use a model based on a Graph Convolutional Network (GCN), which we argue can help learn per-topology decisions and manage the case of node failure. The input features for this model are thus the node associated with the load over time.

NGI related Topic :

Open Internet Architecture and Renovation

Search

Project Coordinator (EU) :

Country of the EU Coordinator :

Organisation Type :

Project participants :

State of US partner :

Starting date :

NGI related Topic :

Distributed Learning for Resilient Virtual Network Management at Scale

Experiment description

Impacts :

Results :

Future Plan :

NGI related Topic :

Call Reference :