Research

AI system software 🀹🏻

In recent years, datacenters have evolved to accommodate a variety of heterogeneous workloads and devices. A prominent example is distributed deep learning training, which enables the development of large-scale AI models like GPT, DALL-E or LLaMA, demanding over 530 billion hyperparameters and utilizing hundreds of GPU nodes. As a result, optimizing infrastructure utilization and efficiency has become crucial. However, recent data from major cloud providers such as Microsoft and Alibaba reveal average GPU utilization rates of only 52.4% and 25.4%, respectively. This underutilization signifies a considerable waste of datacenter infrastructure resources, highlighting the need for more effective strategies to improve efficiency and utilization.

One of the primary challenges in optimizing system infrastructure is that deep learning workloads have not yet been characterized by system software. Despite significant advancements in the field, determining the optimal training configuration, such as the GPU type and number of GPUs, remains unknown, resulting in unpredictable training times. This issue makes it impossible to optimize scheduling or system techniques on AI tasks. Also, severe overallocation of GPUs exist in datacenters. These resource inefficiency problems necessitate more efficient system management strategies.

The following are representative technologies from our research.

Workload characterization

In-depth profiling (introspection) of deep learning training and inference workloads from the OS perspective

Job scheduling

Efficient workload scheduling to ensure service performance (SLA) guarantees

Job placement and migration

Β Cluster- and planet-level workload placement and migration

졜근 λ°μ΄ν„°μ„Όν„°λŠ” λ‹€μ–‘ν•œ μ’…λ₯˜μ˜ (이기쒅) μ›Œν¬λ‘œλ“œμ™€ μž₯치λ₯Ό μˆ˜μš©ν•  수 μžˆλ„λ‘ 진화해 μ™”μŠ΅λ‹ˆλ‹€. λŒ€ν‘œμ μΈ 예둜 λΆ„μ‚° λ”₯λŸ¬λ‹ λͺ¨λΈμ˜ ν•™μŠ΅ 및 좔둠을 λ“€ 수 μžˆλŠ”λ°, 이 ν•™μŠ΅ 과정은 수백 개의 GPU λ…Έλ“œλ₯Ό ν™œμš©ν•˜μ—¬ 5,300μ–΅ 개 μ΄μƒμ˜ ν•˜μ΄νΌνŒŒλΌλ―Έν„°λ₯Ό μ§€λ‹Œ GPT, DALL-E, LLaMA와 같은 λŒ€κ·œλͺ¨ AI λͺ¨λΈμ„ κ°œλ°œν•  수 있게 ν•΄μ€λ‹ˆλ‹€. μ΄λŸ¬ν•œ ν™˜κ²½μ—μ„œλŠ”, GPU λ“±μ˜ 인프라 νš¨μœ¨μ„±μ„ μ΅œμ ν™”ν•˜λŠ” 것이 맀우 μ€‘μš”ν•©λ‹ˆλ‹€. ν•˜μ§€λ§Œ 졜근 Microsoft와 Alibaba와 같은 μ£Όμš” ν΄λΌμš°λ“œ μ œκ³΅μ—…μ²΄μ˜ 데이터에 λ”°λ₯΄λ©΄ 평균 GPU μžμ›μ˜ ν™œμš©λ₯ μ€ 각각 52.4%와 25.4%에 λΆˆκ³Όν•˜μ—¬, 맀우 κ³ κ°€μ˜ μž₯비듀이 μƒλ‹Ήν•œ λ‚­λΉ„λ₯Ό κ²ͺκ³  μžˆμŠ΅λ‹ˆλ‹€.Β 

μ‹œμŠ€ν…œ 인프라 μ΅œμ ν™”μ˜ μ£Όμš” 어렀움 쀑 ν•˜λ‚˜λŠ” μƒˆλ‘œμš΄ μ›Œν¬λ‘œλ“œ(κ°€λ Ή λ”₯λŸ¬λ‹)κ°€ 아직 μ‹œμŠ€ν…œ μ†Œν”„νŠΈμ›¨μ–΄μ— λͺ…ν™•νžˆ 뢄석(ν”„λ‘œνŒŒμΌλ§)λ˜κ±°λ‚˜ μ΅œμ ν™”λ˜μ–΄ μžˆμ§€ μ•Šλ‹€λŠ” μ μž…λ‹ˆλ‹€. λ”₯λŸ¬λ‹ λΆ„μ•Όμ˜ 큰 λ°œμ „μ—λ„ λΆˆκ΅¬ν•˜κ³ , νŠΉμ • λͺ¨λΈμ˜ ν•™μŠ΅μ΄λ‚˜ 좔둠에 졜적으둜 ν•„μš”ν•œ GPU의 μ’…λ₯˜λ‚˜ 개수 λ“±μ˜ ꡬ성을 κ²°μ •μ§“λŠ” 것은 아직 μ „μ μœΌλ‘œ 개발자의 κ²½ν—˜μ—λ§Œ μ˜μ‘΄ν•˜κ³  있고, 이둜 인해 ν•™μŠ΅ μ‹œκ°„κ³Ό μΆ”λ‘  μ‹œκ°„λ„ λΆˆκ· λ“±ν•©λ‹ˆλ‹€. λ‚˜μ•„κ°€ μ›Œν¬λ‘œλ“œμ˜ νŠΉμ§•μ— κΈ°λ°˜ν•œ μŠ€μΌ€μ€„λ§μ΄λ‚˜ μ‹œμŠ€ν…œ μ΅œμ ν™”λ„ μ–΄λ ΅κ³ , μ‹€μ œ ν•™μŠ΅μ΄λ‚˜ 좔둠에 ν•„μš”ν•œ GPU보닀 더 λ§Žμ€ GPUλ₯Ό λ¨Όμ € μ„ μ ν•˜κ²Œ ν•˜λŠ” μΌμ’…μ˜ κ³Όν• λ‹ΉμœΌλ‘œ μ΄μ–΄μ§‘λ‹ˆλ‹€.Β 

μ΄λŸ¬ν•œ 상황을 κ·Ήλ³΅ν•˜κΈ° μœ„ν•΄, λ³Έ 연ꡬ싀은 λ‹€μ–‘ν•œ 연ꡬλ₯Ό μˆ˜ν–‰ν•˜κ³  있고, λŒ€ν‘œκΈ°μˆ μ€ μ•„λž˜μ™€ κ°™μŠ΅λ‹ˆλ‹€.

μ›Œν¬λ‘œλ“œ characterization

OS κ΄€μ μ—μ„œ λ”₯λŸ¬λ‹ ν•™μŠ΅ 및 μΆ”λ‘  μ›Œν¬λ‘œλ“œμ— λŒ€ν•œ 심측 ν”„λ‘œνŒŒμΌλ§(intrpspection) 및 예츑

μ›Œν¬λ‘œλ“œ μŠ€μΌ€μ€„λ§

μ„œλΉ„μŠ€ μ„±λŠ₯(SLA) 보μž₯을 μœ„ν•œ 효율적인 μŠ€μΌ€μ€„λ§ 기법 연ꡬ 

μ›Œν¬λ‘œλ“œ 배치 및 λ§ˆμ΄κ·Έλ ˆμ΄μ…˜

ν΄λŸ¬μŠ€ν„° 및 ν”Œλž˜λ‹› μˆ˜μ€€μ˜ μ›Œν¬λ‘œλ“œ 배치 및 λ§ˆμ΄κ·Έλ ˆμ΄μ…˜

Network system softwareΒ  πŸ–₯️

Cloud computing is primarily realized through β€œnetworking,” which is utilized for either 1) device-to-device communication within a node or 2) node-to-node communication. In this context, the network infrastructure of datacenters must be virtualized to isolate performance and resource usage between tenants. However, current datacenters do not allow tenants to create or control their virtual network infrastructure, such as virtual switches, virtual links, and virtual topologies. Instead, the virtual network infrastructure is configured and managed solely by the datacenter administrators, which is in stark contrast to server virtualization. In particular, considering that customized operations of network resources (e.g., in-network computing) are one of the major building blocks of upcoming networking systems such as those beyond 5G, this issue is critical. As a result, system researchers have sought to make the network infrastructure controllable by "software" (known as software-defined networking, or SDN) to enable users of the network infrastructure to freely access, virtualize, and customize it (programmability).

We are investigating network systems that are more 1) programmable, 2) high-performance, and 3) reliableΒ for connecting heterogeneous devices and enabling services in this context. The following are examples of our technologies.

Network virtualization

Virtualizing physical infrastructure into isolated and independent virtual network resources through network hypervisor and programmable switches

High-performance networking stack

Optimized and high-performance kernel networking stack for ultra-low latency and extremely HW-constrained IoT devices

AI-based optimized network systems

Intelligent network systems that automatically improve performance for users and optimize the utilization of network resources for end-to-end systems

ν΄λΌμš°λ“œ μ»΄ν“¨νŒ…μ€ 주둜 1) λ…Έλ“œ λ‚΄μ˜ λ””λ°”μ΄μŠ€ κ°„ 톡신 λ˜λŠ” 2) λ…Έλ“œ κ°„ 톡신에 ν™œμš©λ˜λŠ” 'λ„€νŠΈμ›Œν‚Ή'을 톡해 μ‹€ν˜„λ©λ‹ˆλ‹€. μ΄λŸ¬ν•œ λ§₯λ½μ—μ„œ λ°μ΄ν„°μ„Όν„°μ˜ λ„€νŠΈμ›Œν¬ μΈν”„λΌλŠ” ν…Œλ„ŒνŠΈ κ°„μ˜ μ„±λŠ₯κ³Ό λ¦¬μ†ŒμŠ€ μ‚¬μš©μ„ λΆ„λ¦¬ν•˜κΈ° μœ„ν•΄ μ†Œμœ„ "가상화"λ˜μ–΄μ•Ό ν•©λ‹ˆλ‹€. κ·ΈλŸ¬λ‚˜ ν˜„μž¬ λ°μ΄ν„°μ„Όν„°μ—μ„œλŠ” ν…Œλ„ŒνŠΈκ°€ μ „ν˜€ 가상 μŠ€μœ„μΉ˜, 가상 링크, 가상 ν† ν΄λ‘œμ§€ λ“±μ˜ 가상 λ„€νŠΈμ›Œν¬ 인프라λ₯Ό μƒμ„±ν•˜κ±°λ‚˜ μ œμ–΄ν•  수 μ—†μŠ΅λ‹ˆλ‹€. μ„œλ²„ 가상화λ₯Ό 톡해, μž„μ˜μ˜ vCPU, vRAM, storage λ“±μœΌλ‘œ κ΅¬μ„±λœ μ»¨ν…Œμ΄λ„ˆλ‚˜ VM을 자유자재둜 κ΅¬μ„±ν•˜λŠ” 것에 λΉ„ν•΄ 맀우 μ œν•œμ μ΄λ©°, 특히 5G 이후 ν™œλ°œνžˆ μ—°κ΅¬λ˜λŠ” Beyond 5G λ“±μ˜ μ‹œμŠ€ν…œμ—μ„œλŠ” λ„€νŠΈμ›Œν¬ λ‚΄λΆ€μ—μ„œμ˜ μ»΄ν“¨νŒ…μ΄ μ€‘μš”ν•œ μš”μ†Œλ‘œ μΈμ‹λ˜λŠ” λ°”, λ„€νŠΈμ›Œν¬ μžμ›κΉŒμ§€ μ˜¨μ „ν•œ 가상화λ₯Ό μ§€μ›ν•˜λŠ” 것은 맀우 μ€‘μš”ν•©λ‹ˆλ‹€.

μ΄λŸ¬ν•œ 연ꡬ동기에 따라, μ‹œμŠ€ν…œ μ—°κ΅¬μžλ“€μ€ λ„€νŠΈμ›Œν¬ 인프라λ₯Ό 'μ†Œν”„νŠΈμ›¨μ–΄'(μ†Œν”„νŠΈμ›¨μ–΄ μ •μ˜ λ„€νŠΈμ›Œν‚Ή λ˜λŠ” SDN)둜 μ œμ–΄ν•˜μ—¬ λ„€νŠΈμ›Œν¬ 인프라 μ‚¬μš©μžκ°€ 자유둭게 μ•‘μ„ΈμŠ€ν•˜κ³ , κ°€μƒν™”ν•˜κ³ , μ‚¬μš©μž 지정할 수 μžˆλ„λ‘ ν•˜λŠ” 방법을 μ—°κ΅¬ν•˜κ³  μžˆμŠ΅λ‹ˆλ‹€. λ³Έ 연ꡬ싀은 Open Networking Foundationκ³Ό ν˜‘λ ₯으둜 λ„€νŠΈμ›Œν¬ 가상화 λΆ„μ•Όμ˜ μ›μ²œκΈ°μˆ μ„ ν™•λ³΄ν•˜κ³  있으며, 1) 더 programmableν•œ, 2) 더 κ³ μ„±λŠ₯의, 3) 더 reliableν•œ μ„œλΉ„μŠ€μ™€ 이쒅 μž₯치 톡신을 μ§€μ›ν•˜κΈ° μœ„ν•œ 연ꡬλ₯Ό μˆ˜ν–‰ν•˜κ³  μžˆμŠ΅λ‹ˆλ‹€.Β 

λ„€νŠΈμ›Œν¬ 가상화

λ„€νŠΈμ›Œν¬ ν•˜μ΄νΌλ°”μ΄μ € 및 ν”„λ‘œκ·Έλž˜λ° κ°€λŠ₯ν•œ μŠ€μœ„μΉ˜λ₯Ό 톡해 물리적 인프라λ₯Ό 격리되고 독립적인 가상 λ„€νŠΈμ›Œν¬ λ¦¬μ†ŒμŠ€λ‘œ 가상화

κ³ μ„±λŠ₯ 컀널 λ„€νŠΈμ›Œν‚Ή μŠ€νƒ

μ΄ˆμ €μ§€μ—° 및 ν•˜λ“œμ›¨μ–΄ μžμ›μ΄ κ·Ήλ„λ‘œ μ œν•œλœ IoT λ””λ°”μ΄μŠ€λ₯Ό μœ„ν•΄ μ΅œμ ν™”λœ κ³ μ„±λŠ₯ 컀널 λ„€νŠΈμ›Œν‚Ή μŠ€νƒ

AI 기반 졜적 λ„€νŠΈμ›Œν¬ μ‹œμŠ€ν…œ

μ‚¬μš©μž μ„œλΉ„μŠ€λ³„λ‘œ μžλ™μœΌλ‘œ μ„±λŠ₯을 κ°œμ„ ν•˜κ³  쒅단 κ°„ μ‹œμŠ€ν…œ λ¦¬μ†ŒμŠ€ ν™œμš©λ₯ μ„ μ΅œμ ν™”ν•˜λŠ” 지λŠ₯ν˜• λ„€νŠΈμ›Œν¬ μ‹œμŠ€ν…œ