Research
AI system software π€Ήπ»
In recent years, datacenters have evolved to accommodate a variety of heterogeneous workloads and devices. A prominent example is distributed deep learning training, which enables the development of large-scale AI models like GPT, DALL-E or LLaMA, demanding over 530 billion hyperparameters and utilizing hundreds of GPU nodes. As a result, optimizing infrastructure utilization and efficiency has become crucial. However, recent data from major cloud providers such as Microsoft and Alibaba reveal average GPU utilization rates of only 52.4% and 25.4%, respectively. This underutilization signifies a considerable waste of datacenter infrastructure resources, highlighting the need for more effective strategies to improve efficiency and utilization.
One of the primary challenges in optimizing system infrastructure is that deep learning workloads have not yet been characterized by system software. Despite significant advancements in the field, determining the optimal training configuration, such as the GPU type and number of GPUs, remains unknown, resulting in unpredictable training times. This issue makes it impossible to optimize scheduling or system techniques on AI tasks. Also, severe overallocation of GPUs exist in datacenters. These resource inefficiency problems necessitate more efficient system management strategies.
The following are representative technologies from our research.
Workload characterization
In-depth profiling (introspection) of deep learning training and inference workloads from the OS perspective
Job scheduling
Efficient workload scheduling to ensure service performance (SLA) guarantees
Job placement and migration
Β Cluster- and planet-level workload placement and migration
μ΅κ·Ό λ°μ΄ν°μΌν°λ λ€μν μ’ λ₯μ (μ΄κΈ°μ’ ) μν¬λ‘λμ μ₯μΉλ₯Ό μμ©ν μ μλλ‘ μ§νν΄ μμ΅λλ€. λνμ μΈ μλ‘ λΆμ° λ₯λ¬λ λͺ¨λΈμ νμ΅ λ° μΆλ‘ μ λ€ μ μλλ°, μ΄ νμ΅ κ³Όμ μ μλ°± κ°μ GPU λ Έλλ₯Ό νμ©νμ¬ 5,300μ΅ κ° μ΄μμ νμ΄νΌνλΌλ―Έν°λ₯Ό μ§λ GPT, DALL-E, LLaMAμ κ°μ λκ·λͺ¨ AI λͺ¨λΈμ κ°λ°ν μ μκ² ν΄μ€λλ€. μ΄λ¬ν νκ²½μμλ, GPU λ±μ μΈνλΌ ν¨μ¨μ±μ μ΅μ ννλ κ²μ΄ λ§€μ° μ€μν©λλ€. νμ§λ§ μ΅κ·Ό Microsoftμ Alibabaμ κ°μ μ£Όμ ν΄λΌμ°λ μ 곡μ 체μ λ°μ΄ν°μ λ°λ₯΄λ©΄ νκ· GPU μμμ νμ©λ₯ μ κ°κ° 52.4%μ 25.4%μ λΆκ³Όνμ¬, λ§€μ° κ³ κ°μ μ₯λΉλ€μ΄ μλΉν λλΉλ₯Ό κ²ͺκ³ μμ΅λλ€.Β
μμ€ν μΈνλΌ μ΅μ νμ μ£Όμ μ΄λ €μ μ€ νλλ μλ‘μ΄ μν¬λ‘λ(κ°λ Ή λ₯λ¬λ)κ° μμ§ μμ€ν μννΈμ¨μ΄μ λͺ νν λΆμ(νλ‘νμΌλ§)λκ±°λ μ΅μ νλμ΄ μμ§ μλ€λ μ μ λλ€. λ₯λ¬λ λΆμΌμ ν° λ°μ μλ λΆκ΅¬νκ³ , νΉμ λͺ¨λΈμ νμ΅μ΄λ μΆλ‘ μ μ΅μ μΌλ‘ νμν GPUμ μ’ λ₯λ κ°μ λ±μ ꡬμ±μ κ²°μ μ§λ κ²μ μμ§ μ μ μΌλ‘ κ°λ°μμ κ²½νμλ§ μμ‘΄νκ³ μκ³ , μ΄λ‘ μΈν΄ νμ΅ μκ°κ³Ό μΆλ‘ μκ°λ λΆκ· λ±ν©λλ€. λμκ° μν¬λ‘λμ νΉμ§μ κΈ°λ°ν μ€μΌμ€λ§μ΄λ μμ€ν μ΅μ νλ μ΄λ ΅κ³ , μ€μ νμ΅μ΄λ μΆλ‘ μ νμν GPUλ³΄λ€ λ λ§μ GPUλ₯Ό λ¨Όμ μ μ νκ² νλ μΌμ’ μ κ³Όν λΉμΌλ‘ μ΄μ΄μ§λλ€.Β
μ΄λ¬ν μν©μ 극볡νκΈ° μν΄, λ³Έ μ°κ΅¬μ€μ λ€μν μ°κ΅¬λ₯Ό μννκ³ μκ³ , λνκΈ°μ μ μλμ κ°μ΅λλ€.
μν¬λ‘λ characterization
OS κ΄μ μμ λ₯λ¬λ νμ΅ λ° μΆλ‘ μν¬λ‘λμ λν μ¬μΈ΅ νλ‘νμΌλ§(intrpspection) λ° μμΈ‘
μν¬λ‘λ μ€μΌμ€λ§
μλΉμ€ μ±λ₯(SLA) 보μ₯μ μν ν¨μ¨μ μΈ μ€μΌμ€λ§ κΈ°λ² μ°κ΅¬Β
μν¬λ‘λ λ°°μΉ λ° λ§μ΄κ·Έλ μ΄μ
ν΄λ¬μ€ν° λ° νλλ μμ€μ μν¬λ‘λ λ°°μΉ λ° λ§μ΄κ·Έλ μ΄μ
Network system softwareΒ π₯οΈ
Cloud computing is primarily realized through βnetworking,β which is utilized for either 1) device-to-device communication within a node or 2) node-to-node communication. In this context, the network infrastructure of datacenters must be virtualized to isolate performance and resource usage between tenants. However, current datacenters do not allow tenants to create or control their virtual network infrastructure, such as virtual switches, virtual links, and virtual topologies. Instead, the virtual network infrastructure is configured and managed solely by the datacenter administrators, which is in stark contrast to server virtualization. In particular, considering that customized operations of network resources (e.g., in-network computing) are one of the major building blocks of upcoming networking systems such as those beyond 5G, this issue is critical. As a result, system researchers have sought to make the network infrastructure controllable by "software" (known as software-defined networking, or SDN) to enable users of the network infrastructure to freely access, virtualize, and customize it (programmability).
We are investigating network systems that are more 1) programmable, 2) high-performance, and 3) reliableΒ for connecting heterogeneous devices and enabling services in this context. The following are examples of our technologies.
Network virtualization
Virtualizing physical infrastructure into isolated and independent virtual network resources through network hypervisor and programmable switches
High-performance networking stack
Optimized and high-performance kernel networking stack for ultra-low latency and extremely HW-constrained IoT devices
AI-based optimized network systems
Intelligent network systems that automatically improve performance for users and optimize the utilization of network resources for end-to-end systems
ν΄λΌμ°λ μ»΄ν¨ν μ μ£Όλ‘ 1) λ Έλ λ΄μ λλ°μ΄μ€ κ° ν΅μ λλ 2) λ Έλ κ° ν΅μ μ νμ©λλ 'λ€νΈμνΉ'μ ν΅ν΄ μ€νλ©λλ€. μ΄λ¬ν λ§₯λ½μμ λ°μ΄ν°μΌν°μ λ€νΈμν¬ μΈνλΌλ ν λνΈ κ°μ μ±λ₯κ³Ό 리μμ€ μ¬μ©μ λΆλ¦¬νκΈ° μν΄ μμ "κ°μν"λμ΄μΌ ν©λλ€. κ·Έλ¬λ νμ¬ λ°μ΄ν°μΌν°μμλ ν λνΈκ° μ ν κ°μ μ€μμΉ, κ°μ λ§ν¬, κ°μ ν ν΄λ‘μ§ λ±μ κ°μ λ€νΈμν¬ μΈνλΌλ₯Ό μμ±νκ±°λ μ μ΄ν μ μμ΅λλ€. μλ² κ°μνλ₯Ό ν΅ν΄, μμμ vCPU, vRAM, storage λ±μΌλ‘ ꡬμ±λ 컨ν μ΄λλ VMμ μμ μμ¬λ‘ ꡬμ±νλ κ²μ λΉν΄ λ§€μ° μ νμ μ΄λ©°, νΉν 5G μ΄ν νλ°ν μ°κ΅¬λλ Beyond 5G λ±μ μμ€ν μμλ λ€νΈμν¬ λ΄λΆμμμ μ»΄ν¨ν μ΄ μ€μν μμλ‘ μΈμλλ λ°, λ€νΈμν¬ μμκΉμ§ μ¨μ ν κ°μνλ₯Ό μ§μνλ κ²μ λ§€μ° μ€μν©λλ€.
μ΄λ¬ν μ°κ΅¬λκΈ°μ λ°λΌ, μμ€ν μ°κ΅¬μλ€μ λ€νΈμν¬ μΈνλΌλ₯Ό 'μννΈμ¨μ΄'(μννΈμ¨μ΄ μ μ λ€νΈμνΉ λλ SDN)λ‘ μ μ΄νμ¬ λ€νΈμν¬ μΈνλΌ μ¬μ©μκ° μμ λ‘κ² μ‘μΈμ€νκ³ , κ°μννκ³ , μ¬μ©μ μ§μ ν μ μλλ‘ νλ λ°©λ²μ μ°κ΅¬νκ³ μμ΅λλ€. λ³Έ μ°κ΅¬μ€μ Open Networking Foundationκ³Ό νλ ₯μΌλ‘ λ€νΈμν¬ κ°μν λΆμΌμ μμ²κΈ°μ μ ν보νκ³ μμΌλ©°, 1) λ programmableν, 2) λ κ³ μ±λ₯μ, 3) λ reliableν μλΉμ€μ μ΄μ’ μ₯μΉ ν΅μ μ μ§μνκΈ° μν μ°κ΅¬λ₯Ό μννκ³ μμ΅λλ€.Β
λ€νΈμν¬ κ°μν
λ€νΈμν¬ νμ΄νΌλ°μ΄μ λ° νλ‘κ·Έλλ° κ°λ₯ν μ€μμΉλ₯Ό ν΅ν΄ 물리μ μΈνλΌλ₯Ό 격리λκ³ λ 립μ μΈ κ°μ λ€νΈμν¬ λ¦¬μμ€λ‘ κ°μν
κ³ μ±λ₯ 컀λ λ€νΈμνΉ μ€ν
μ΄μ μ§μ° λ° νλμ¨μ΄ μμμ΄ κ·Ήλλ‘ μ νλ IoT λλ°μ΄μ€λ₯Ό μν΄ μ΅μ νλ κ³ μ±λ₯ 컀λ λ€νΈμνΉ μ€ν
AI κΈ°λ° μ΅μ λ€νΈμν¬ μμ€ν
μ¬μ©μ μλΉμ€λ³λ‘ μλμΌλ‘ μ±λ₯μ κ°μ νκ³ μ’ λ¨ κ° μμ€ν 리μμ€ νμ©λ₯ μ μ΅μ ννλ μ§λ₯ν λ€νΈμν¬ μμ€ν