고위합성 자동화 도구의 동향
[역자주] 반도체 설계 분야의 최신 설계기법에 관한 동향보고서(survey)다. 영어공부도 편식하면 않되겠기에 이정도 글은 상식으로 읽어볼만 하다. 총 30여쪽인데 본문은 다소 전문적인 내용이 담겨 있으니 요약과 서론 부분을 발췌하여 읽기로 한다. 간간이 나오는 전문용어에는 나름 주석을 붙여 보겠다.
[원문출처] IEEE Access, Vol.8,2020, https://ieeexplore.ieee.org/abstract/document/9195872
---------------------------------------------------------------------------
Towards Automatic High-Level Code Deployment on Reconfigurable Platforms: A Survey of High-Level Synthesis Tools and Toolchains
- 제목부터 길고 난해하다. 동원된 단어들은 기본수준 이지만 생략되고 축약되어서 이 논문이 어느 잡지에 실렸는지 모른다면 도데체 무슨 내용인지 감 잡기 어렵다. 구글의 번역의 도움을 받아보니 이렇게 나왔다.
"재구성 가능한 플랫폼의 자동 고급 코드 배포를 향하여: 고급 합성 도구 및 도구 체인에 대한 조사"
- 번역된 내용은 나중에 따지기로 하고 몇가지 외국어들이 눈에 띈다. '플랫폼', '코드', '체인'은 우리말에 동화된 외래어가 되어 버렸다는 뜻일까? 그렇다면 우리는 이 외래어를 적절하게 사용할 수 있어야 할 것이다. 이 논문이 전기전자공학기술자들의 전문지 IEEE Explore에 실렸다는 점을 감안하면,
- 'platform'은 각종 논리회로의 요소들을 미리 배치해 놓은 FPGA(Field Programable Gate Array),
- 'code'는 계산법(algoritjm)을 높은 추상화 수준의 컴퓨팅 언어로 기술한 원시구문(source code)을 뜻한다. 이렇게 옮겨 봤다.
"높은 추상화 수준에서 작성된 설계구문을 재구성 가능한 반도체 토대 위에 구현하는 자동화된 방법에 대하여: 고위 합성과 그와 동반하는 일련의 도구들의 동향보고"
MOSTAFA W. NUMAN[1], BRADEN J. PHILLIPS[2], (Member, IEEE), GAVIN S. PUDDY[1], (Associate Member, IEEE), AND KATRINA FALKNER[1]
Corresponding author(교신저자): Mostafa W. Numan (mostafa.numan@adelaide.edu.au)
[1] School of Computer Science, The University of Adelaide, Adelaide, SA 5005, Australia / 호주 아델레이드 대학교, 컴퓨터과학과
[2] School of Electrical and Electronic Engineering, The University of Adelaide, Adelaide, SA 5005, Australia / 아델레이드 대학교 전기전자 공학과
This work was supported by the Maritime Division of the Defence Science and Technology Group, Australia.
이 논문은 국방과학기술단 해양과의 지원을 받아 수행되었음.
---------------------------------------------------------------------
ABSTRACT
요약
Heterogeneous computing systems with tightly coupled processors and reconfigurable logic blocks provide great scope to improve software performance by executing each section of code on the processor or custom hardware accelerator that best matches its requirements and the system optimisation goals.
범용 연산기와 재구성 가능한 논리회로부가 밀접히 결합된 이종 연산기 체계는 구문을 범용 계산기와 전용 계산회로로 구성된 가속기에서 나눠 실행시키고 체계적으로 최적화함으로써 높은 실행 성능을 발휘한다.
- heterogeneous computing system: 이종 연산기 체계
- processors: 범용계산기 CPU
- reconfigurable logic blocks: 전용계산기 FPGA
- 'tightly coupled' 범용계산기와 재구성 가능 전용 계산기의 밀접한 결합을 강조하고 있다.
- Reconfigurable 은 Programmable 과 같은 의미
- 'improve software performance' 는 HLS의 목적
This article is motivated by the idea of a software tool that can automatically accomplish the task of deploying code, originally written for a conventional computer, to the processors and reconfigurable logic blocks in a heterogeneous system.
원래 전통적인 전자계산기에 수행되도록 작성된 구문을 범용 계산기와 재구성 가능한 논리계산기로 구성된 이종 계산기 체계에 자동으로 구현하는 소프트웨어 도구들이 등장하였기에 이 논문을 작성하게 됐다.
We undertake an extensive survey of high-level synthesis tools to determine how close we are to this vision, and to identify any capability gaps.
우리는 고위합성도구가 이 기대에 얼마나 가까워 졌는지 그리고 그 간격이 어느 정도인지 알기 위해 심도있게 조사해봤다.
- high-level synthesis(HLS): 고위합성 (높은 추상화 수준)
- gaps: 계산법을 기술하는 방법 사이에서 추상성(abstraction level)의 차이.
- 범용 컴퓨팅 언어는 추상성 수준이 높다.
- 전용 하드웨어는 RTL(Register-Transfer Level)로 추상성이 낮다.
- RTL(Register-Transfer Level): 클럭(number of clocks)과 비트-폭(bit-width)의 상세
The survey is structured according to a new framework that clearly expresses the relationships between the many tools surveyed.
이 조사는 다양한 도구들을 살펴보고 그들 사이의 관계를 분명히 하는 시각에서 살펴봤다.
- new framework: 도구의 우열이나 장단점을 찾기보다 각 도구마다 가진 특징을 알아보고 그 도구들 사이의 관계를 정리해봤다.
We find that none of the existing tools can deploy general high-level code without manual intervention.
우리는 이번 조사에서 손보지 않고도 고위 구문을 낮은 RTL의 하드웨어로 변환 할 수 있는 도구는 없음을 알았다.
- 현재 성숙된 단계에 이른 HDL(Hardware Description Language)에도 합성 가능한 구문 형식이 있다(synthesizable subsets). 하물며 이보다 높은 추상화 수준의 C++는 어떠랴.
Logic synthesis from arbitrary high-level code remains an open problem with dynamic data structures, function pointers and recursion all presenting challenges.
매우 높은 추상화 수준의 구문으로부터 논리식 합성이 가지고 있는 과제로는 동적 자료처리, 주소변수 함수, 재귀호출 등은 여전히 해결되지 않고 있다.
- dynamic data structures: 동적 자료구조. 메모리 할당(memory allocation), 링크 리스트(linked-list), 스택 포인터(stack pointer) 등.
- function pointers: 함수의 주소를 포인터 변수에 대입하여 호출하는 기법
- recursion: 함수 내에서 자신을 호출. 프랙탈(Fractals) 알고리즘
Other challenges include automating the tasks of code partitioning, optimisation and design space exploration.
그외 어려운 과제로는 자동화된 구문 분할, 최적화, 설계 구조 평가 등이 있다.
- C++ 컴파일러의 다양한 기계어 코드 생성 옵션을 상기해 보자.
- 코드 크기/실행속도 최적화
- 인-라인 코드 인-라인
- 호출 방식
- 자료형 변환
-------------------------------------------------------------------------
요약만 보더라도 굉장히 많은 토론꺼리를 담고 있다. 분량이 다소 많긴 하지만 최신 반도체 설계 기법에 관심이 있다면 전체 논문을 읽어보길 권한다.
SECTION I.Introduction
For the last four decades, Moore's law and Dennard scaling have relentlessly delivered improvements in computing performance [1]. Since the early 2000s their impact has begun to wane and alternative ways to improve performance have begun to emerge.
Heterogeneous computing is a promising approach in which a group of processing nodes execute a workload in parallel. Given different kinds of nodes including multi-core CPUs, real-time processors, DSPs, GPUs, and accelerators on FPGAs or ASICs, the computing workload can be partitioned such that each part is executed on a processor that is well-matched to its requirements and the performance optimisation goals.
This article is concerned with the engineering task of writing software for a heterogeneous system and considers how close existing tools and technologies are to a fully automatic system in which high-level source code is partitioned and deployed to heterogeneous nodes with a minimum of human intervention. This is an ambitious scope so we constrain ourselves, in this article, to the task of deploying source code blocks onto custom FPGA logic.
It is possible, of course, to write software specifically for a particular heterogeneous system by manually partitioning tasks among the processors, and using the most appropriate programming language for each of the different processors. For example a Hardware Description Language (HDL) such as Verilog could be used for tasks executing on an FPGA, and CUDA for those on a GPU. An alternative, which has seen a great deal of research activity in recent years, is to use High-Level Synthesis (HLS) for generating hardware modules from code written in a High-Level Language (HLL) (such as C, C++ or Python).
There are benefits of using HLS instead of HDL so that the entire application is in a high-level language: simulation speed is generally faster; debugging is less difficult; it is easier to explore and evaluate design alternatives; and the high-level language may include features that cannot be easily expressed in a HDL [2].
Although current HLS tools do not always produce performance-optimised implementations, applications without stringent performance requirements can be more quickly and easily developed using HLS. HLS software developers do NOT necessarily need to be FPGA or HDL experts, and optimisation opportunities can be exposed to the designer that cannot be easily explored via HDL approaches. In some cases, a project that would not have been practical in HDL, given its complexity, limited time frame and small development team, can be feasible in HLS at a low performance cost compared to an HDL-based approach [3]–[5].
An HLS-based design for a heterogeneous system could be started from scratch, or use pre-existing code originally written for a conventional CPU. Either way, to effectively use current HLS technology, system developers require considerable knowledge and experience in the application domain, computer programming, and HLS design flow.
Deploying pre-existing code written for a conventional CPU onto a heterogeneous system with the aim of improving performance or efficiency is even more difficult. The code needs to be substantially restructured to be synthesisable, and to produce optimised hardware. This needs to be done for all the code: not just the application source code but also any library functions it uses. To date Automatic Code Deployment (ACD) tools capable of performing this challenging task without human intervention have been the subject of limited research (e.g. [6], [7]) but a considerable amount of work has been done on automating some of the more challenging, tedious and time-consuming steps in this process (e.g. [8]–[11]).
This article surveys recent toolchains and workflows for high-level synthesis to FPGA with a focus on technologies that might eventually be used for automatic code deployment. Section I introduces HLS and the motivation for heterogeneous computing based on HLS. Section II categorises different contemporary approaches to deploy compute-intensive code segments to FPGA hardware accelerators. The categories introduced in Section II are used to organise a thorough survey, in Sections III and IV, of approaches that take a candidate function expressed in a HLL and produce low-level HDL suitable for FPGA deployment. The arguments in these sections focus on contemporary HLS tools currently used in academia or industry; legacy tools are included in summary tables for completeness. Specification of a hypothetical tool for ACD to FPGA, as well as a brief summary of progress reported in the literature towards making HLS-based FPGA code deployment less dependent on human judgement and proficiency, are provided in Section V.
SECTION II.High-Level Code Deployment Approaches
FIGURE 1. Design flows for high-level code deployment.
SECTION III.Behavioural Approach for FPGA Synthesis
FIGURE 2. A generic HLS design flow.
TABLE 1 Currently Available Commercial HLS Tools
TABLE 2 Currently Available Academic HLS Tools
----------------------------------------------------------------------
원 논문에 각 HLS 도구들의 작동 방식을 간략히 설명하고 있으니 찾아보자. 어떤 기법들이 동원되고 있는지 눈여겨볼 만 하다.
SECTION VI.Conclusion
The motivation for this article was a vision of a software tool that could automatically deploy sections of code, originally written for a conventional CPU, to FPGA accelerators to achieve an implementation advantage, whether that be latency, throughput, energy or some other optimisation goal. How close are existing tools to this delivering this vision, and where are the capability gaps?
To survey and evaluate existing tools we provided a classification of design flows in FIGURE 1. This classification has neatly expressed the relationships between many different hardware synthesis tools surveyed under three broad approaches: manual re-coding, behavioural synthesis, and dataflow synthesis, as well as variations of these.
Sections III and IV surveyed many commercial and research tools for code deployment, organised according to the framework in FIGURE 1. Wherever possible we have identified the pedigree of these tools and their relationship to other tools in the survey. For the more widely used or surveyed tools we have provided an overview of salient features and capabilities.
None of the existing tools are able to fulfil the vision of fully automatic deployment of general C/C++ code to a heterogeneous system of FPGAs and CPUs. Capability gaps include the generation of synthesisable HLS code from HLL code that uses pointers to pointers or functions, recursive functions, or dynamic memory allocations. Other challenges include efficient partitioning of the code, optimisation of generated hardware, and design space exploration. All of these are the subject of active research efforts as surveyed in Section V-B.
This work has also identified a trend that is somewhat orthogonal to the idea of automatic code deployment. There is currently significant work in approaches to express the application in a high level language that is more amenable for execution on diverse platforms. Examples include the growing proliferation of domain-specific languages or dataflow representations.