Research

AI system software 🤹🏻

In recent years, datacenters have evolved to accommodate a variety of heterogeneous workloads and devices. A prominent example is distributed deep learning training, which enables the development of large-scale AI models like GPT, DALL-E or LLaMA, demanding over 530 billion hyperparameters and utilizing hundreds of GPU nodes. As a result, optimizing infrastructure utilization and efficiency has become crucial. However, recent data from major cloud providers such as Microsoft and Alibaba reveal average GPU utilization rates of only 52.4% and 25.4%, respectively. This underutilization signifies a considerable waste of datacenter infrastructure resources, highlighting the need for more effective strategies to improve efficiency and utilization.

One of the primary challenges in optimizing system infrastructure is that deep learning workloads have not yet been characterized by system software. Despite significant advancements in the field, determining the optimal training configuration, such as the GPU type and number of GPUs, remains unknown, resulting in unpredictable training times. This issue makes it impossible to optimize scheduling or system techniques on AI tasks. Also, severe overallocation of GPUs exist in datacenters. These resource inefficiency problems necessitate more efficient system management strategies.

The following are representative technologies from our research.

Workload characterization

In-depth profiling (introspection) of deep learning training and inference workloads from the OS perspective

Job scheduling

Efficient workload scheduling to ensure service performance (SLA) guarantees

Job placement and migration

Cluster- and planet-level workload placement and migration

최근 데이터센터는 다양한 종류의 (이기종) 워크로드와 장치를 수용할 수 있도록 진화해 왔습니다. 대표적인 예로 분산 딥러닝 모델의 학습 및 추론을 들 수 있는데, 이 학습 과정은 수백 개의 GPU 노드를 활용하여 5,300억 개 이상의 하이퍼파라미터를 지닌 GPT, DALL-E, LLaMA와 같은 대규모 AI 모델을 개발할 수 있게 해줍니다. 이러한 환경에서는, GPU 등의 인프라 효율성을 최적화하는 것이 매우 중요합니다. 하지만 최근 Microsoft와 Alibaba와 같은 주요 클라우드 제공업체의 데이터에 따르면 평균 GPU 자원의 활용률은 각각 52.4%와 25.4%에 불과하여, 매우 고가의 장비들이 상당한 낭비를 겪고 있습니다.

시스템 인프라 최적화의 주요 어려움 중 하나는 새로운 워크로드(가령 딥러닝)가 아직 시스템 소프트웨어에 명확히 분석(프로파일링)되거나 최적화되어 있지 않다는 점입니다. 딥러닝 분야의 큰 발전에도 불구하고, 특정 모델의 학습이나 추론에 최적으로 필요한 GPU의 종류나 개수 등의 구성을 결정짓는 것은 아직 전적으로 개발자의 경험에만 의존하고 있고, 이로 인해 학습 시간과 추론 시간도 불균등합니다. 나아가 워크로드의 특징에 기반한 스케줄링이나 시스템 최적화도 어렵고, 실제 학습이나 추론에 필요한 GPU보다 더 많은 GPU를 먼저 선점하게 하는 일종의 과할당으로 이어집니다.

이러한 상황을 극복하기 위해, 본 연구실은 다양한 연구를 수행하고 있고, 대표기술은 아래와 같습니다.

워크로드 characterization

OS 관점에서 딥러닝 학습 및 추론 워크로드에 대한 심층 프로파일링(intrpspection) 및 예측

워크로드 스케줄링

서비스 성능(SLA) 보장을 위한 효율적인 스케줄링 기법 연구

워크로드 배치 및 마이그레이션

클러스터 및 플래닛 수준의 워크로드 배치 및 마이그레이션

Network system software 🖥️

Cloud computing is primarily realized through “networking,” which is utilized for either 1) device-to-device communication within a node or 2) node-to-node communication. In this context, the network infrastructure of datacenters must be virtualized to isolate performance and resource usage between tenants. However, current datacenters do not allow tenants to create or control their virtual network infrastructure, such as virtual switches, virtual links, and virtual topologies. Instead, the virtual network infrastructure is configured and managed solely by the datacenter administrators, which is in stark contrast to server virtualization. In particular, considering that customized operations of network resources (e.g., in-network computing) are one of the major building blocks of upcoming networking systems such as those beyond 5G, this issue is critical. As a result, system researchers have sought to make the network infrastructure controllable by "software" (known as software-defined networking, or SDN) to enable users of the network infrastructure to freely access, virtualize, and customize it (programmability).

We are investigating network systems that are more 1) programmable, 2) high-performance, and 3) reliable for connecting heterogeneous devices and enabling services in this context. The following are examples of our technologies.

Network virtualization

Virtualizing physical infrastructure into isolated and independent virtual network resources through network hypervisor and programmable switches

High-performance networking stack

Optimized and high-performance kernel networking stack for ultra-low latency and extremely HW-constrained IoT devices

AI-based optimized network systems

Intelligent network systems that automatically improve performance for users and optimize the utilization of network resources for end-to-end systems

클라우드 컴퓨팅은 주로 1) 노드 내의 디바이스 간 통신 또는 2) 노드 간 통신에 활용되는 '네트워킹'을 통해 실현됩니다. 이러한 맥락에서 데이터센터의 네트워크 인프라는 테넌트 간의 성능과 리소스 사용을 분리하기 위해 소위 "가상화"되어야 합니다. 그러나 현재 데이터센터에서는 테넌트가 전혀 가상 스위치, 가상 링크, 가상 토폴로지 등의 가상 네트워크 인프라를 생성하거나 제어할 수 없습니다. 서버 가상화를 통해, 임의의 vCPU, vRAM, storage 등으로 구성된 컨테이너나 VM을 자유자재로 구성하는 것에 비해 매우 제한적이며, 특히 5G 이후 활발히 연구되는 Beyond 5G 등의 시스템에서는 네트워크 내부에서의 컴퓨팅이 중요한 요소로 인식되는 바, 네트워크 자원까지 온전한 가상화를 지원하는 것은 매우 중요합니다.

이러한 연구동기에 따라, 시스템 연구자들은 네트워크 인프라를 '소프트웨어'(소프트웨어 정의 네트워킹 또는 SDN)로 제어하여 네트워크 인프라 사용자가 자유롭게 액세스하고, 가상화하고, 사용자 지정할 수 있도록 하는 방법을 연구하고 있습니다. 본 연구실은 Open Networking Foundation과 협력으로 네트워크 가상화 분야의 원천기술을 확보하고 있으며, 1) 더 programmable한, 2) 더 고성능의, 3) 더 reliable한 서비스와 이종 장치 통신을 지원하기 위한 연구를 수행하고 있습니다.

네트워크 가상화

네트워크 하이퍼바이저 및 프로그래밍 가능한 스위치를 통해 물리적 인프라를 격리되고 독립적인 가상 네트워크 리소스로 가상화

고성능 커널 네트워킹 스택

초저지연 및 하드웨어 자원이 극도로 제한된 IoT 디바이스를 위해 최적화된 고성능 커널 네트워킹 스택

AI 기반 최적 네트워크 시스템

사용자 서비스별로 자동으로 성능을 개선하고 종단 간 시스템 리소스 활용률을 최적화하는 지능형 네트워크 시스템

Google Sites

Report abuse