Thinking parallel: Multi-cores, virtual elasticity, and the application programmer # Thinking parallel: Multi-cores, virtual elasticity, and the application programmer ### The basics Task Interaction Graph (TIG) ### Sequential scheduling (one processor) $Makespan = T_{sequential}$ Parallel scheduling (three processors) Makespan = $$T_{\text{parallel}} + T_{\text{serial}}$$ Speedup = $T_{\text{sequential}} / (T_{\text{parallel}} + T_{\text{serial}})$ # Thinking parallel: Multi-cores, virtual elasticity, and the application programmer # Speedup versus Scalability - Ideal speedup = number of processors = |N| - > Amdahl's law<sup>3</sup>: Speedup = $$\frac{1}{(1-P) + \frac{P}{|N|}}$$ $\rightarrow$ Max speedup as $|N| \rightarrow \infty$ is $$\frac{1}{1-P}$$ > Example: With P = 90% the max speedup is 10 - Alternatively: Keep run time fixed, but increase problem size with the number of processors - > Hypothetical sequential run time $$T_{\text{sequential}} = T_{\text{serial}} + |\mathbf{N}| \times T_{\text{parallel}}$$ Gustafson-Barsis' law<sup>4</sup>: the <u>scaled</u> speedup is $$SS = T_{\text{sequential}} / (T_{\text{parallel}} + T_{\text{serial}})$$ $$= (T_{\text{serial}} + |N| \times T_{\text{parallel}}) / (T_{\text{parallel}} + T_{\text{serial}})$$ $$= |N| - \alpha (|N| - 1)$$ with $$\alpha = T_{\text{serial}} / (T_{\text{parallel}} + T_{\text{serial}})$$ <sup>[3]</sup> Gene M. Amdahl, 'Validity of the single processor approach to achieving large scale computing capabilities', in *Proceedings of the AFIPS spring joint computer conference*, Conference location: Atlantic City, New Jersey, USA, 1967, pp. 483–485. # Thinking parallel: Multi-cores, virtual elasticity, and the application programmer ### Parallel digital computers Burroughs' D825 4 processors (1962) Cray-1 (1976) Cray X-MP (1980) #### Distributed computing - Massive Parallel Processing = Dedicated interconnect - Cluster Computing = Ethernet + TCP/IP - GRID computing = Internet - Cloud = everything virtualised Shared memory # Thinking parallel: Multi-cores, virtual elasticity, and the application programm # Parallel computing<sup>†</sup> #### Hardware The execution platform #### Management software - Operating System (OS) - Platform control - Management middleware - Development support #### **Applications** The essential value † High Performance Computing, Many Task Computing, maybe High Throughput Computing, not Dedicated Computing # Thinking parallel: Multi-cores, virtual elasticity, and the application programme # More beyond Moore Gordon E. Moore's law<sup>5</sup>: "...the number of transistors on a chip will double every 18 months" Mostly process scaling with 0.7 every 24 months from 0.8 $\mu m$ in 1992 until ~2000 - Physical distance problems with scaling - Voltage supply scaling and increased leaks - Clock frequency: maximum around 5GHz - Power density wall and heat - Design complexity: exponential design team growth # Ihinking parallel: Multi-cores, virtual elasticity, and the application programm ### ...on heterogeneous cores "Assuming that this trend will follow Moore's Law scaling, mainstream systems will contain over 10 processing cores by the end of the decade, yielding unprecedented theoretical peak performance" Justin Rattner Vice president and chief technology officer, Intel (2005)<sup>10</sup> simultaneous multithreading (2004)<sup>8</sup> [7] J. M. Tendler et al, 'POWER4 system microarchitecture', IBM Journal of Research and Development, vol. 46, no. 1, pp. 5-25, Jan. 2002. [8] B. Sinharoy et al, 'POWER5 system microarchitecture', IBM Journal of Research and Development, vol. 49, no. 4.5, pp. 505-521, Jul. 2005 [9] Umesh Gajanan Nawathe *et al*, 'An 8-Core 64-Thread 64b Power-Efficient SPARC SoC', in *Digest of Technical Papers from the IEEE International Solid-State Circuits Conference (ISSCC 2007)*, Conference location: San Francisco, CA, USA, 2007, pp. 5.7: 108–110. [10] Justin Rattner, 'Multi-core to the masses', in *Proceedings of the 14th International Conference on Parallel Architectures and Compilation Techniques (PACT 2005)*, Conference location: Saint Louis, Missouri, USA, 2005, p. 3. ### Hardware: The future - > Dedicated accelerators<sup>†</sup> or proven FPGA templates or design patterns - > Proven library of modules with predictable performance 15 - Parameterised interconnects - Standard adapters and interfaces - Reusable building blocks - > Asymmetric speedup ≥ symmetric speedup<sup>16</sup> - > "Software will take a more prominent role in the multi-core era"<sup>6</sup> - > "...the programming model for heterogeneous architectures is much more complicated" <sup>17</sup> † "Technology based on "manycore" will employ 100s to 1000s of CPU cores per chip (by2011)"14 <sup>[17]</sup> Geoffrey Blake *et al.* 'A survey of multicore processors', *IEEE Signal Processing Magazine*, vol. 26, no. 6, pp. 26–37, Nov. 2009. <sup>[14]</sup> John Shalf, 'The new landscape of parallel computer architecture', Journal of Physics: Conference Series, vol. 78, p. 012066, Jul. 2007. <sup>[15]</sup> John A Darringer, 'Multi-core design automation challenges', in *Proceedings of the 44th annual Design Automation Conference (DAC '07)*, San Diego, California, USA, 2007, pp. 760–764. <sup>[16]</sup> Mark D. Hill and Michael R. Marty, 'Amdahl's Law in the Multicore Era', Computer, vol. 41, no. 7, pp. 33-38, Jul. 2008. # Operating Systems - > "No current OS is really multithreaded"a - New approaches to operating systems - $XtreemOS^b = 7$ + - S(o)SS = Service Oriented Operating Systems<sup>18</sup> - Tessllation<sup>19</sup> - FUSE: OS support for easy HW accelerator integration<sup>20</sup> - Native hypervisors + microkernels - > Real-time OSd - a Lutz Schubert, Universität Ulm - b http://www.xtreemos.eu/ and http://research.cs.wisc.edu/condor/ - c http://www.soos-project.eu/ - d QNX Neutrino RTOS (http://www.gnx.com/products/neutrino-rtos/neutrino-rtos.html) or INTEGRITY (http://www.ghs.com/products/rtos/integrity.html) <sup>[19]</sup> Krste Asanovic et al. 'A view of the parallel computing landscape', Communications of the ACM, vol. 52, no. 10, pp. 56-67, Oct. 2009. <sup>[20]</sup> Aws Ismail and Lesley Shannon, 'FUSE: Front-End User Framework for O/S Abstraction of Hardware Accelerators', in *Proceedings of the 19th Annual International* Symposium on Field-Programmable Custom Computing Machines (FCCM 2011), Salt Lake City, Utha, USA, 2011, pp. 170–177. # High Throughput Mass Market ### > Desktop - Legacy software "does the job" - Current software developed for single-core - Per-application performance is important (only) if the load consists of only a few applications or if there are performancecritical applications<sup>33</sup> #### > Servers - Database and web servers are designed for high throughput - Idle time can be masked by multi-threading<sup>33</sup> - Large parallel application part = more cores, otherwise more complex cores<sup>16</sup> <sup>[30]</sup> Jacques A. Pienaar *et al.* 'MDR: performance model driven runtime for heterogeneous parallel platforms', in *Proceedings of the international conference on Supercomputing (ICS '11)*, Tucson, Arizona, USA, 2011, pp. 225–234. <sup>[31]</sup> Anders Logg et al. 'Past and Future Perspectives on Scientific Software', in Simula Research Laboratory, vol. Part II, Aslak Tveito et al. (Eds.) Springer, 2010, pp. 321-362. <sup>[32]</sup> Michela Becchi *et al.* 'Enabling Legacy Applications on Heterogeneous Platforms', in *Proceedings of the 2nd USENIX Workshop on Hot Topics in Parallelism (HotPar 2010)*, Berkeley, CA, USA, 2010 <sup>[33]</sup> Angela C. Sodan *et al.* 'Parallelism via Multithreaded and Multicore CPUs', *Computer*, vol. 43, no. 3, pp. 24–32, Mar. 2010. ### The future - > Pin-limits: no technology in sight!<sup>33</sup> - > Three scenarios<sup>35</sup> - "Drop the ball" ⇒ Cloud computing - "Niche markets" ⇒ Multimedia, gaming... - "Scalable, dependable, software development" <sup>13</sup> - > Urgently needed: "Parallel computing for all"™ - Standardised, industry accepted development platforms - Education of programmers on these platforms - Requirements engineering for parallelism # Development support http://www.open-mpi.org/ http://openmp.org http://www.openhmpp.org http://www.openacc-standard.org/ http://www.khronos.org/opencl/ http://upc.lbl.gov/ http://threadingbuildingblocks.org/http://software.intel.com/en-us/articles/intel-cilk-plus/http://software.intel.com/en-us/articles/intel-array-building-blocks/http://msdn.microsoft.com/en-us/library/hh265137%28v=vs.110%29 http://charm.cs.uiuc.edu/research/charm/ http://stellar.cct.lsu.edu/tag/parallex/ http://www.ni.com/labview/ http://chapel.cray.com/ http://x10-lang.org/ 13 ### Management software: The future - Operating systems - Thin generic access layers - Automatic adaptive schedulers learning the characteristics of the execution platform and the application - > Optimised computation kernels - BLAS, LINPACK, Diffpack, FFT, Boost,... - Learning based implementation selectors<sup>26</sup> and elastic computing<sup>27</sup> - Development support - Methodology: Thinking parallel, patterns<sup>28</sup> - Languages - > Enhanced or new languages, e.g. Lime<sup>29</sup> - > Model Driven Development and Runtime<sup>30</sup> - Integrated Development Environments - Cross and just-in-time compilers<sup>31</sup> - Tools to assist conversion of legacy software<sup>32</sup> Illustration from [28] [25] Cédric Augonnet and Raymond Namyst, 'A Unified Runtime System for Heterogeneous Multi-core Architectures', in *Proceedings of the Euro-Par 2008 Workshops on Parallel Processing*. Conference location: Las Palmas de Gran Canaria, Spain, 2008, vol. 5415, pp. 174–183. [26] Mario Kicherer *et al.* 'Seamlessly portable applications: Managing the diversity of modern heterogeneous systems', *ACM Transactions on Architecture and Code Optimization (TACO)*, vol. 8, no. 4, pp. 42:1–20, Jan. 2012. # Paradigm: The ACTOR model<sup>36</sup> - > Mathematical model of concurrent computation - Dynamic creation of actors - Asynchronous, unordered message passing - Concurrent computation - Dynamic topology addresses in messages - > Service oriented - > No threads no shared memory - > Legacy compliant # Thinking parallel: Multi-cores, virtual elasticity, and the application programm # Virtual elasticity - > Migrating actors - → Location: Data ←→ Algorithms - Scheduling of actors - Virtual threads - Bin packing - Communication aware - > Integrated interconnect Geir Horn Geir.Horn@mn.uio.no +47 93 05 93 35 In a soldier's stance, I aimed my hand At the mongrel dogs who teach Fearing not that I'd become my enemy In the instant that I preach My pathway led by confusion boats Mutiny from stern to bow Ah, but I was so much older then I'm younger than that now Yes, my guard stood hard when abstract threats Too noble to neglect Deceived me into thinking I had something to protect Good and bad, I define these terms Quite clear, no doubt, somehow Ah, but I was so much older then I'm younger than that now Bob Dylan My back pages