您的位置:首页 > 其它

Heterogeneous Parallel Programming—Week one part one

2012-11-29 17:32 507 查看
Heterogeneous Parallel Programming

Wen-mei Hwu (instructor), Gang Liao (editor) www.greenhat1016@gmail.com

Lecture 0: Course Overview

Course Overview

People

Learn how to program heterogeneous parallel computing systems and achieve

high performance and energy-efficiency

functionality and maintainability

scalability across future generations

Technical subjects

principles and patterns of parallel algorithms

processor architecture features and constraints

programming API, tools and techniques

Instructor: Wen-mei Hwu w-hwu@illinois.edu, use [Coursera] to start your e-mail subject line

Teaching Assistants: John Stratton, I-Jui (Ray) Sung, Xiao-Long Wu, Hee-Seok Kim, Liwen Chang, Nasser Anssari, Izzat El Hajj, Abdul Dakkak, Steven Wu, Tom Jablin

Contributors: David Kirk, John Stratton, Issac Gelado, John Stone, Javier Cabezas, Michael Garland

Web Resources

Website: https://www.coursera.org/course/hetero
Handouts and lecture slides/recordings

Sample textbook chapters, documentation, software resources

Web board discussions

Channel for electronic announcements

Forum for Q&A - the TAs and Professors read the board, and your classmates often have answers

Grading

Quizzes: 50%

Labs (Machine Problems): 50%

Academic Honesty

You are allowed and encouraged to discuss assignments with other students in the class. Getting verbal advice/help from people who've already taken the course is also fine.

Any copying of code is unacceptable

Includes reading someone else's code and then going off to write your own.

Giving/receiving help on a quiz is unacceptable

Recommended Textbook/Notes

D. Kirk and W. Hwu, "Programming Massively Parallel Processors -- A Hands-on Approach," Morgan Kaufman Publisher, 2010, ISBN 978-0123814722

We will be using an pre-public-release of the 2nd Edition, made available to Coursera students at a special discount: http://store.elsevier.com/specialOffer.jsp?offerId=EST_PROG
Lab assignments will have accompanying notes

NVIDIA, NVidia CUDA C Programming Guide, version 4.0, NVidia, 2011 (reference book)

ECE498AL -- ECE408/CS483 - Coursera





Tentative Schedule

Week 1Week 4
Lecture 0: Course OverviewLecture 7: Tiled Convolution
Lecture 1: Intro to Hetero ComputingLecture 8: Reduction Trees
Lecture 2: Intro to CUDA CLab-3: Tiled matrix multiplication
Lab-1: installation, vector addition
Week 2Week 5
Lecture 3: Data Parallelism ModelLecture 9: Streams and Contexts
Lecture 4: CUDA Memory ModelLecture 10: Hetero Clusters
Lab-2: simple matrix multiplicationLab 4: Tiled convolution
Week 3Week 6
Lecture 5: Tiling and LocalityLecture 11: OpenCL, OpenACC
Lecture 6: ConvolutionLecture 12: Thrust, C++AMP
Lab-3: Tiled matrix multiplicationLecture 13: Summary
Lab 4: Tiled convolution

Lecture 1.1: Introduction to Heterogeneous Parallel Computing

Heterogeneous Parallel Computing

Use the best match for the job (heterogeneity in mobile SOC)





UIUC Blue Waters Supercomputer

Cray System & Storage cabinets>300
Compute nodes>25,000
Usable Storage Bandwidth>1 TB/s
System Memory>1.5 Petabytes
Memory per core module4 GB
Gemin Interconnect Topology3D Torus
Usable Storage>25 Petabytes
Peak performance>11.5 Petaflops
Number of AMD Interlogos processors>49,000
Number of AMD x86 core modules>380,000
Number of NVIDIA Kepler GPUs:>3,000

CPU and GPU have very different design philosophy





CPUs: Latency Oriented Design

Large caches: Convert long latency memory accesses to short latency cache accesses

Sophisticated control

Branch prediction for reduced branch latency

Data forwarding for reduced data latency

Powerful ALU

Reduced operation latency





GPUs: Throughput Oriented Design

Small caches

To boost memory throughput

Simple control

No branch prediction

No data forwarding

Energy efficient ALUs

Many, long latency but heavily pipelined for high throughput

Require massive number of threads to tolerate latencies





Winning Applications Use Both CPU and GPU

CPUs for sequential parts where latency matters

CPUs can be 10+X faster than GPUs for sequential code

GPUs for parallel parts where throughput wins

GPUs can be 10+X faster than CPUs for parallel code

Heterogeneous parallel computing is catching on

280 submissions to GPU Computing Gems and 90 articles included in two volumes.





Financial Analysis

Scientific Simulation

Engineering Simulation

Data Intensive Analytics

Medical Imaging

Digital Audio Processing

Computer Vision

Digital Video Processing

Biomedical Informatics

Electronic Design Automation

Statistical Modeling

Ray Tracing Rendering

Interactive Physics

Numerical Methods

Lecture 1.2: Software Cost in Heterogeneous Parallel Computing

Software Dominates System Cost

SW lines per chip increases at 2x/10 months

HW gates per chip increases at 2x/18 months

Future system must minimize software redevelopment





the Fig. published by IBM in 2010

Keys to Software Cost Control

Scalability

The same application runs efficiently on new generations of cores





The same application runs efficiently on more of the same cores





Portability

The same application runs efficiently on different types of cores





The same application runs efficiently on systems with different organizations and interfaces







Scalability and Portability

Performance growth with HW generations

Increasing number of compute units

Increasing number of threads

Increasing vector length

Increasing pipeline depth

Increasing DRAM burst size

Increasing number of DRAM channels

Increasing data movement latency

Portability across many different HW types

Multi-core CPUs vs. many-core GPUs

VLIW vs. SIMD vs. threading

Shared memory vs. distributed memory

The programming style we use in this course supports both scalability and portability through advanced tools.
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
标签: 
相关文章推荐