[arxiv daily] New submissions for Wed, 19 May 21 ,today papers 6
# 学科: Distributed, Parallel, and Cluster Computing(cs.DC)
### A Scalable Concurrent Algorithm for Dynamic Connectivity
- **Authors:** Alexander Fedorov, Nikita Koval, Dan Alistarh
- **Subjects:** Data Structures and Algorithms (cs.DS); Distributed, Parallel, and Cluster Computing (cs.DC)
- **Arxiv link:** https://arxiv.org/abs/2105.08098
- **Pdf link:** https://arxiv.org/pdf/2105.08098
Dynamic Connectivity is a fundamental algorithmic graph problem, motivated by a wide range of applications to social and communication networks and used as a building block in various other algorithms, such as the bi-connectivity and the dynamic minimal spanning tree problems. In brief, we wish to maintain the connected components of the graph under dynamic edge insertions and deletions. In the sequential case, the problem has been well-studied from both theoretical and practical perspectives. However, much less is known about efficient concurrent solutions to this problem. This is the gap we address in this paper. We start from one of the classic data structures used to solve this problem, the Euler Tour Tree. Our first contribution is a non-blocking single-writer implementation of it. We leverage this data structure to obtain the first truly concurrent generalization of dynamic connectivity, which preserves the time complexity of its sequential counterpart, but is also scalable in practice. To achieve this, we rely on three main techniques. The first is to ensure that connectivity queries, which usually dominate real-world workloads, are non-blocking. The second non-trivial technique expands the above idea by making all queries that do not change the connectivity structure non-blocking. The third ingredient is applying fine-grained locking for updating the connected components, which allows operations on disjoint components to occur in parallel. We evaluate the resulting algorithm on various workloads, executing on both real and synthetic graphs. The results show the efficiency of each of the proposed optimizations; the most efficient variant improves the performance of a coarse-grained based implementation on realistic scenarios up to 6x on average and up to 30x when connectivity queries dominate.
### TRIM: A Design Space Exploration Model for Deep Neural Networks Inference and Training Accelerators
- **Authors:** Yangjie Qi, Shuo Zhang, Tarek M. Taha
- **Subjects:** Distributed, Parallel, and Cluster Computing (cs.DC); Hardware Architecture (cs.AR)
- **Arxiv link:** https://arxiv.org/abs/2105.08239
- **Pdf link:** https://arxiv.org/pdf/2105.08239
There is increasing demand for specialized hardware for training deep neural networks, both in edge/IoT environments and in high-performance computing systems. The design space of such hardware is very large due to the wide range of processing architectures, deep neural network configurations, and dataflow options. This makes developing deep neural network processors quite complex, especially for training. We present TRIM, an infrastructure to help hardware architects explore the design space of deep neural network accelerators for both inference and training in the early design stages. The model evaluates at the whole network level, considering both inter-layer and intra-layer activities. Given applications, essential hardware specifications, and a design goal, TRIM can quickly explore different hardware design options, select the optimal dataflow and guide new hardware architecture design. We validated TRIM with FPGA-based implementation of deep neural network accelerators and ASIC-based architectures. We also show how to use TRIM to explore the design space through several case studies. TRIM is a powerful tool to help architects evaluate different hardware choices to develop efficient inference and training architecture design.
### ModelPS: An Interactive and Collaborative Platform for Editing Pre-trained Models at Scale
- **Authors:** Yuanming Li, Huaizheng Zhang, Shanshan Jiang, Fan Yang, Yonggang Wen, Yong Luo
- **Subjects:** Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
- **Arxiv link:** https://arxiv.org/abs/2105.08275
- **Pdf link:** https://arxiv.org/pdf/2105.08275
AI engineering has emerged as a crucial discipline to democratize deep neural network (DNN) models among software developers with a diverse background. In particular, altering these DNN models in the deployment stage posits a tremendous challenge. In this research, we propose and develop a low-code solution, ModelPS (an acronym for "Model Photoshop"), to enable and empower collaborative DNN model editing and intelligent model serving. The ModelPS solution embodies two transformative features: 1) a user-friendly web interface for a developer team to share and edit DNN models pictorially, in a low-code fashion, and 2) a model genie engine in the backend to aid developers in customizing model editing configurations for given deployment requirements or constraints. Our case studies with a wide range of deep learning (DL) models show that the system can tremendously reduce both development and communication overheads with improved productivity. The code has been released as an open-source package at GitHub.
### TOD: Transprecise Object Detection to Maximise Real-Time Accuracy on the Edge
- **Authors:** JunKyu Lee, Blesson Varghese, Roger Woods, Hans Vandierendonck
- **Subjects:** Distributed, Parallel, and Cluster Computing (cs.DC)
- **Arxiv link:** https://arxiv.org/abs/2105.08668
- **Pdf link:** https://arxiv.org/pdf/2105.08668
Real-time video analytics on the edge is challenging as the computationally constrained resources typically cannot analyse video streams at full fidelity and frame rate, which results in loss of accuracy. This paper proposes a Transprecise Object Detector (TOD) which maximises the real-time object detection accuracy on an edge device by selecting an appropriate Deep Neural Network (DNN) on the fly with negligible computational overhead. TOD makes two key contributions over the state of the art: (1) TOD leverages characteristics of the video stream such as object size and speed of movement to identify networks with high prediction accuracy for the current frames; (2) it selects the best-performing network based on projected accuracy and computational demand using an effective and low-overhead decision mechanism. Experimental evaluation on a Jetson Nano demonstrates that TOD improves the average object detection precision by 34.7 % over the YOLOv4-tiny-288 model on average over the MOT17Det dataset. In the MOT17-05 test dataset, TOD utilises only 45.1 % of GPU resource and 62.7 % of the GPU board power without losing accuracy, compared to YOLOv4-416 model. We expect that TOD will maximise the application of edge devices to real-time object detection, since TOD maximises real-time object detection accuracy given edge devices according to dynamic input features without increasing inference latency in practice.
### DID-eFed: Facilitating Federated Learning as a Service withDecentralized Identities
- **Authors:** Jiahui Geng, Neel Kanwal, Martin Gilje Jaatun, Chunming Rong
- **Subjects:** Cryptography and Security (cs.CR); Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
- **Arxiv link:** https://arxiv.org/abs/2105.08671
- **Pdf link:** https://arxiv.org/pdf/2105.08671
We have entered the era of big data, and it is considered to be the "fuel" for the flourishing of artificial intelligence applications. The enactment of the EU General Data Protection Regulation (GDPR) raises concerns about individuals' privacy in big data. Federated learning (FL) emerges as a functional solution that can help build high-performance models shared among multiple parties while still complying with user privacy and data confidentiality requirements. Although FL has been intensively studied and used in real applications, there is still limited research related to its prospects and applications as a FLaaS (Federated Learning as a Service) to interested 3rd parties. In this paper, we present a FLaaS system: DID-eFed, where FL is facilitated by decentralized identities (DID) and a smart contract. DID enables a more flexible and credible decentralized access management in our system, while the smart contract offers a frictionless and less error-prone process. We describe particularly the scenario where our DID-eFed enables the FLaaS among hospitals and research institutions.
### Durable Queues: The Second Amendment
- **Authors:** Gal Sela, Erez Petrank
- **Subjects:** Distributed, Parallel, and Cluster Computing (cs.DC); Data Structures and Algorithms (cs.DS); Performance (cs.PF)
- **Arxiv link:** https://arxiv.org/abs/2105.08706
- **Pdf link:** https://arxiv.org/pdf/2105.08706
We consider durable data structures for non-volatile main memory, such as the new Intel Optane memory architecture. Substantial recent work has concentrated on making concurrent data structures durable with low overhead, by adding a minimal number of blocking persist operations (i.e., flushes and fences). In this work we show that focusing on minimizing the number of persist instructions is important, but not enough. We show that access to flushed content is of high cost due to cache invalidation in current architectures. Given this finding, we present a design of the queue data structure that properly takes care of minimizing blocking persist operations as well as minimizing access to flushed content. The proposed design outperforms state-of-the-art durable queues. We start by providing a durable version of the Michael Scott queue (MSQ). We amend MSQ by adding a minimal number of persist instructions, fewer than in available durable queues, and meeting the theoretical lower bound on the number of blocking persist operations. We then proceed with a second amendment to this design, that eliminates accesses to flushed data. Evaluation shows that the second amendment yields substantial performance improvement, outperforming the state of the art and demonstrating the importance of reduced accesses to flushed content. The presented queues are durably linearizable and lock-free. Finally, we discuss the theoretical optimal number of accesses to flushed content.