Publications
publications by categories in reversed chronological order. generated by jekyll-scholar.
2026
- arXivMAESTRO: Multi-Agent Evaluation Suite for Testing, Reliability, and ObservabilityTie Ma, Yixi Chen , Vaastav Anand , and 8 more authors2026
We present MAESTRO, an evaluation suite for the testing, reliability, and observability of LLM-based MAS. MAESTRO standardizes MAS configuration and execution through a unified interface, supports integrating both native and third-party MAS via a repository of examples and lightweight adapters, and exports framework-agnostic execution traces together with system-level signals (e.g., latency, cost, and failures). We instantiate MAESTRO with 12 representative MAS spanning popular agentic frameworks and interaction patterns, and conduct controlled experiments across repeated runs, backend models, and tool configurations. Our case studies show that MAS executions can be structurally stable yet temporally variable, leading to substantial run-to-run variance in performance and reliability. We further find that MAS architecture is the dominant driver of resource profiles, reproducibility, and cost-latency-accuracy trade-off, often outweighing changes in backend models or tool settings. Overall, MAESTRO enables systematic evaluation and provides empirical guidance for designing and optimizing agentic systems.
- INFOCOMFollowing the Usage, Not the Request: Risk-Aware Task Scheduling with Overbooking in Edge CloudsTie Ma, Shan Zhang , Xiaoyu Zhang , and 3 more authorsIn IEEE Conference on Computer Communications (INFOCOM) , 2026
Edge computing platforms are increasingly deployed to support delay-sensitive and resource-intensive applications. However, current task scheduling strategies, which rely on user-requested resources, often lead to low resource utilization and reduced platform profit due to users’ tendency to over-request resources. Overbooking, widely adopted in industries such as airlines and hotels, can improve utilization but introduce risk under uncertain task resource usage. This paper applies resource overbooking to edge clouds, focusing on task scheduling optimization under uncertainty. We formulate the problem as a stochastic mixed integer program, which is proven to be NP-hard. To this end, a risk evaluation scheme is proposed to accurately quantify the risk of overload without assuming specific resource usage distributions, which has an additive error guarantee. Based on this scheme, we transform the problem into a more deterministic form and prove that the objective function is submodular. This enables us to design a greedy algorithm that achieves a ((1-1/e))-approximation ratio with lower complexity. Extensive experiments on a real-world dataset demonstrate that the proposed algorithm significantly improves profit by 0.23(\times)-3.35(\times) and resource utilization by 0.16(\times)-0.75(\times) while accurately controlling the risk associated with overbooking.
2025
- TPDSDoing More with Less: Balancing Probing Costs and Task Offloading Efficiency at the Network EdgeXishuo Li , Shan Zhang , Tie Ma, and 2 more authorsIEEE Transactions on Parallel and Distributed Systems, 2025
2024
- NSDIKlonet: an Easy-to-Use and Scalable Platform for Computer Networks EducationTie Ma, Long Luo , Hongfang Yu , and 9 more authorsIn USENIX Symposium on Networked Systems Design and Implementation (NSDI) , 2024
Currently, one of the simplest and most effective ways for people to gain an in-depth understanding of computer networks is through hands-on practice and experimentation on software platforms. While education is important for the field of computer networks, existing platforms are inadequate in usability and scalability, failing to fully meet all the teaching needs of computer networking education. This paper describes our experiences in designing and using Klonet, an emulation platform for computer networking education. Klonet is easy-to-use for both students and tutors, which has been carefully designed to lower the barrier to use, thus making the practice more efficient. Klonet also demonstrates good scalability. It adopts a container-based distributed architecture and a virtual network embedding algorithm customized for this platform. Evaluation experiments show that Klonet exhibits better scalability, such as supporting more students with fewer hardware resources (i.e., servers) and deploying virtual network topologies more quickly. Furthermore, to ensure stability during teaching, Klonet enhances the robustness of its upper orchestrator and underlying virtual networks. So far, Klonet has been adopted in 3 universities and 4 courses, serving more than 800 students. We showcase Klonet’s usefulness in networking education with real use cases, including a scenario with 10,000 emulated routers. We also share our lessons learned from the 4 years of Klonet development and 2 years of operations.
2023
- US PatentNetwork emulation system supporting flexible and efficient dynamic experimentHongfang Yu , Gang Sun , Tie Ma, and 2 more authors2023US Patent App. 17/988,777
A network emulation system supporting a flexible and efficient dynamic experiment emulates a network based on container, veth-pair, traffic control (TC), and other technologies and sets a network state management model based on a key-value pair, thereby constructing a network emulation system supporting a flexible and efficient dynamic experiment. The system flexibly realizes the dynamic performance of a plurality of dimensions, namely, dynamic node start/stop, dynamic node attribute configuration, dynamic link start/stop, and dynamic link attribute configuration. Based on the network state management model, the network emulation system provides a concise and unified dynamic application programming interface (API) for an upper layer. Researchers can call the API in their network innovation programs at any time after an emulation network is deployed to achieve efficient, batch-processing, and programmable dynamic management. The network emulation system greatly facilitates the experimental work of the researchers in a dynamic scenario of the network.
2022
- IWQoSFlexible and Efficient Multicast Transfers in Inter-Datacenter NetworksLong Luo , Linjian Yu , Tie Ma, and 1 more authorIn IEEE/ACM International Symposium on Quality of Service (IWQoS) , 2022
The explosive growth of global distributed services has led to a massive increase in bulk multicast data transfers over the inter-datacenter Wide-Area Network. While many solutions have been proposed to improve the performance of inter-DC bulk data transfers, they are insufficient to optimize multicast transfers because they fail to explore the characteristics of multicast transfers and network topology. This paper presents FlexCast, a flexible and efficient solution to optimize the completion times for multicast transfers. FlexCast takes advantage of topological characteristics to divide network sites into groups, partition receivers into subsets, and construct load-adaptive Steiner trees for receiver partitions to reduce completion time. It also employs a flexible multicast model for parallel transmission. For better performance FlexCast uses multiple scheduling policies to handle offline request submission, and for greater efficiency it adopts a combination of small-scale optimization and fast heuristic to address online request submission quickly. Simulations on real-world topologies show that FlexCast improves the completion time for multicast receivers by up to 80% compared to prior solutions.
- GLOBECOMvNetRadar: Lightweight and Network-Wide Traffic Measurement in Virtual NetworksTie Ma, Jin Zhang , Long Luo , and 3 more authorsIn IEEE Global Communications Conference (GLOBECOM) , 2022
Measuring traffic metrics is indispensable in virtual networks as it is the basis for a wide range of applications, such as network diagnostics and performance evaluation of the network algorithms. However, existing measurement schemes fail to have all these excellent characteristics simultaneously: 1) fine-grained, i.e. to obtain per packet level information. 2) lightweight, namely low CPU and bandwidth overhead. 3) network-wide, which means obtaining metrics of the whole network, e.g. per packet path. 4) easy-to-deploy, which refers to deployment without additional modification of Maximum Transmission Units (MTUs). We design vNetRadar, a virtual network measurement system, which has these excellent characteristics simultaneously. Specifically, vNetRadar 1) identifies each packet without increasing the size of each packet, to obtain network-wide metrics without MTU modification, 2) allocates each packet an area in memory, called backpack, and carries metadata in it to largely reduce bandwidth overhead. vNetRadar is implemented based on the extended Berkeley Packet Filter (eBPF) and is mainly in kernel space, avoiding the CPU overhead of copying packets to user space when performing the fine-grained measurement. Evaluation results show that the easy-to-deploy vNetRadar can get fine-grained network-wide metrics with low CPU and bandwidth overhead.
- CN Patent一种支持灵活高效动态实验的网络模拟系统Hongfang Yu , Tie Ma, Jin Zhang , and 2 more authors2022CN114844787B
本发明公开了一种支持灵活高效动态实验的网络模拟系统,属于网络通信技术领域。本发明基于容器、veth-pair、tc等技术对网络进行模拟,以及设置的基于键值对的网络状态管理模型,从而构建了一种支持灵活高效动态实验的网络模拟系统。该系统灵活地实现了多种维度的动态性,即节点动态启停、节点属性动态配置、链路动态启停、链路属性动态配置。本发明基于网络状态管理模型,向上提供了简洁统一的动态性API接口,研究人员可在模拟网络部署后的任意时刻在自己的网络创新程序中调用API接口,从而实现高效批量、可编程的动态性管理。本发明极大地方便了研究人员在网络动态性场景下的实验工作,为网络创新注入了新的动力。
2021
- 电信科学Klonet:面向技术创新的网络模拟实验平台Jingzhao Xie , Wei Shan , Chang Xiao , and 4 more authors电信科学, 2021
网络关键技术的创新是推动网络未来发展和整个行业进步的使能条件,为此亟须一个能对创新技术进行方便、快捷验证的网络实验平台,以有效降低创新门槛、促进创新趋势。通过分析网络模拟技术在网络实验中的优势,结合网络技术创新对网络实验平台的需求,提出了面向技术创新的网络模拟实验平台Klonet。该平台可以灵活地扩展规模、全方位地对实验网络进行模拟、细粒度地管理模拟网络,并促进实验的工作流程。结合两个具体的用例,分析了在Klonet上进行实验的方法,验证了Klonet支持网络技术创新的能力。
2020
- IoTJEfficient Multisource Data Delivery in Edge Cloud With Rateless Parallel PushShouxi Luo , Tie Ma, Wei Shan , and 3 more authorsIEEE Internet of Things Journal (IoTJ), 2020
As the key infrastructure for emerging 5G and Internet-of-Things (IoT) applications, micro data centers would be widely deployed at network edges to provide high-bandwidth low-latency cloud service. In these systems, applications would deliver large-size data objects among servers for various purposes like service deployment, application scale-up, and data duplication on demand. Accordingly, reducing delivery time is crucial for the optimization of service delay and system utilization. To accelerate the delivery, this article proposes a multisource-aware adaptive data transmission solution, Parallel Push (PPUSH), by leveraging the fact that data objects in the cloud are generally replicated among servers by design. At the high level, PPUSH achieves efficient delivery of multisource data by launching multiple push flows in parallel; and at the low level, it decouples transfers from different sources by encoding data objects with rateless RaptorQ code, and further employing novel congestion controls to prioritize the bandwidth allocation of concurrent tasks respecting their remaining sizes. Fluid model analysis along with Mininet-based test and packet-level simulation shows that, unlike DCTCP and other proposals, push is robust to packet loss and achieves provable prioritized bandwidth allocation. Extensive simulation results imply that, with above advantages, PPUSH could achieve very efficient data delivery by making use of all available data sources: for instance, compared with the straightforward design of equal-size task split and fair bandwidth allocation, its adaptive task assignment and prioritized traffic scheduling reduce the average task completion time in a tested scenario by 1.495× and 1.329×, respectively, demonstrating a total improvement of 1.586×, when enabled at the same time.