Nvidia cuda examples free

Nvidia cuda examples free

Nvidia cuda examples free. Working efficiently with custom data types. The SDK includes dozens of code samples covering a wide range of applications including: Simple techniques such as C++ code integration and efficient loading of custom datatypes. Jul 25, 2023 · cuda-samples » Contents; v12. The programming guide to using the CUDA Toolkit to obtain the best performance from NVIDIA GPUs. Microsoft has announced D irectX 3D Ray Tracing, and NVIDIA has announced new hardware to take advantage of it–so perhaps now might be a time to look at real-time ray tracing? © NVIDIA Corporation 2011 Heterogeneous Computing #include <iostream> #include <algorithm> using namespace std; #define N 1024 #define RADIUS 3 In addition to the CUDA books listed above, you can refer to the CUDA toolkit page, CUDA posts on the NVIDIA technical blog, and the CUDA documentation page for up-to-date information on the most recent CUDA versions and features. Native x86_64. Preface . 5 to each cell of an (1D) array. For an example of optimizations you might apply to this code to get better performance, see the cudaTensorCoreGemm sample in the CUDA Toolkit. The body graph of an IF node will be executed once if the condition is non-zero whenever the IF node is evaluated. Visual Studio 2022 17. NVIDIA CUDA Code Samples. My previous introductory post, “An Even Easier Introduction to CUDA C++“, introduced the basics of CUDA programming by showing how to write a simple program that allocated two arrays of numbers in memory accessible to the GPU and then added them together on the GPU. 0 is available today for NVIDIA Registered Developers. I want to find a simple example of using tex2D to read a 2D texture. We can then compile it with nvcc. gridDim structures provided by Numba to compute the global X and Y pixel The CUDA Library Samples are released by NVIDIA Corporation as Open Source software under the 3-clause "New" BSD license. Contribute to NVIDIA/CUDALibrarySamples development by creating an account on GitHub. The collection includes containerized CUDA samples for example, vectorAdd (to demonstrate vector addition), nbody (or gravitational n-body simulation) and other examples. You can directly access all the latest hardware and driver features including cooperative groups, Tensor Cores, managed memory, and direct to shared memory loads, and more. Oct 17, 2017 · This example is not tuned for high performance and mostly serves as a demonstration of the API. Optimal global memory coalescing is achieved for both reads and writes because global memory is always accessed through the linear, aligned index t . Jason Sanders is a senior software engineer in the CUDA Platform group at NVIDIA. The NVIDIA-maintained CUDA Amazon Machine Image (AMI) on AWS, for example, comes pre-installed with CUDA and is available for use today. More information can be found about our libraries under GPU Accelerated Libraries . I was suspecting Nvidia kernel modules, somewhere cuda unable to communicate with Nvidia drivers. Best practices for the most important features. Aug 29, 2024 · Release Notes. NVIDIA CUDA-X AI is a complete deep learning software stack for researchers and software developers to build high performance GPU-accelerated applications for conversational AI, recommendation systems and computer vision. There are extracts in the documentation but only a few sub-routines are shown not the full program. About. NVIDIA Isaac Sim for rendering and examples. Oct 31, 2012 · The CUDA C compiler, nvcc, is part of the NVIDIA CUDA Toolkit. These examples, along with our NVIDIA deep learning software stack, are provided in a monthly updated Docker container on the NGC container registry (https://ngc. 9 with NVIDIA driver 375. The CUDA Toolkit End User License Agreement applies to the NVIDIA CUDA Toolkit, the NVIDIA CUDA Samples, the NVIDIA Display Driver, NVIDIA Nsight tools (Visual Studio Edition), and the associated documentation on CUDA APIs, programming model and development tools. NVIDIA has provided hardware-accelerated video processing on GPUs for over a decade through the NVIDIA Video Codec SDK. 0 nvcc --version Unable to find image 'nvidia/cuda:7. cuRobo also runs on the NVIDIA Jetson enabling embedded applications. Developers, researchers, and data scientists can get easy access to NVIDIA optimized DL framework containers with DL examples that are performance-tuned and tested for NVIDIA GPUs. The Release Notes for the CUDA Toolkit. It explores key features for CUDA profiling, debugging, and optimizing. Accordingly, we make sure the integrity of our exams isn’t compromised and hold our NVIDIA Authorized Testing Partners (NATPs) accountable for taking appropriate steps to prevent and detect fraud and exam security breaches. With a unified and open programming model, NVIDIA CUDA-Q is an open-source platform for integrating and programming quantum processing units (QPUs), GPUs, and CPUs in one system. Feb 2, 2022 · C:\ProgramData\NVIDIA Corporation\CUDA Samples\v 11. For highest performance in production code, use cuBLAS, as described earlier. jit before the definition. These containers can be used for validating the software configuration of GPUs in the Nov 19, 2017 · Let’s start by writing a function that adds 0. In this case the include file cufft. You can define quantum device code as standalone function objects or lambdas annotated with __qpu__ to indicate that this is to be compiled to and executed on the quantum device. The CUDA Developer SDK provides examples with source code, utilities, and white papers to help you get started writing software with CUDA. NVIDIA GPU Accelerated Computing on WSL 2 . The profiler allows the same level of investigation as with CUDA C++ code. To Aug 29, 2024 · CUDA C++ Best Practices Guide. threadIdx, cuda. For Microsoft platforms, NVIDIA's CUDA Driver supports DirectX. D. 13): Even after cudaFree() has been called on all allocations and cudaDeviceReset() has been called, but while the application is waiting for a key press to terminate, nvidia-smi shows the allocated GPU memory still in use. Developers can confidently build Vulkan applications that take advantage of ray tracing, knowing that NVIDIA drivers fully support the extension. To accelerate your applications, you can call functions from drop-in libraries as well as develop custom applications using languages including C, C++, Fortran and Python. Basic approaches to GPU Computing. NVIDIA GPUs are built on what’s known as the CUDA Architecture. How does CUDA-aware MPI work? A CUDA-aware MPI implementation must handle buffers differently depending on whether it resides in host or device memory. CUTLASS 1. Aug 1, 2017 · A CUDA Example in CMake. All samples are optimized to take advantage of Tensor Cores and have been tested for accuracy and convergence. Oct 22, 2021 · Thank you for your reply. 2 | PDF | Archive Contents May 21, 2018 · Update May 21, 2018: CUTLASS 1. Notice the mandel_kernel function uses the cuda. We could extend the above code to print out all such data, but the deviceQuery code sample provided with the NVIDIA CUDA Toolkit already does this. NVIDIA provides a CUDA compiler called nvcc in the CUDA toolkit to compile CUDA code, typically stored in a file with extension . 264 videos at various output resolutions and bit rates. 0 (May 2024), Versioned Online Documentation CUDA Toolkit 12. Every moving object in the demo was physically simulated using PhysX and CUDA. CUDA-X AI libraries deliver world leading performance for both training and inference across industry benchmarks such as MLPerf. By downloading and using the software, you agree to fully comply with the terms and conditions of the CUDA EULA. etc. In this tutorial, we discuss how cuDF is almost an in-place replacement for pandas. Jan 7, 2012 · Now I am very confused by the concept of texture memory. Performance difference between CUDA C++ and CUDAnative. h or cufftXt. 5, CUDA 8, CUDA 9), which is the version of the CUDA software platform. The compute capability version of a particular GPU should not be confused with the CUDA version (for example, CUDA 7. 0: Pulling from nvidia/cuda 6c953ac5d795: Already exists [ … simplified layers -- ubuntu base image … ] 68bad08eb200: Pull complete [ … simplified layers -- cuda 7. (Lots of code online is too complicated for me to understand and use lots of parameters in their functions calls). To program NVIDIA GPUs to perform general-purpose computing tasks, you will want to know what CUDA is. Figure 3 shows an example integration of cuRobo running on an NVIDIA Jetson AGX Orin on Jul 27, 2021 · About Jake Hemstad Jake Hemstad is a senior developer technology engineer at NVIDIA, where he works on developing high-performance CUDA C++ software for accelerating data analytics. Aug 29, 2024 · Table 1 Windows Compiler Support in CUDA 12. - NVIDIA/GenerativeAIExamples Description: Starting with a background in C or C++, this deck covers everything you need to know in order to start programming in CUDA C. 4. In our previous post, Efficient CUDA Debugging: How to Hunt Bugs with NVIDIA Compute Sanitzer, we explored efficient debugging in the realm of parallel programming. Quickly integrating GPU acceleration into C and C++ applications. 61, for an NVIDIA GeForce GTX 1080 running on Linux 4. Only supported platforms will be shown. cu extension, say saxpy. If you are on a Linux distribution that may use an older version of GCC toolchain as default than what is listed above, it is recommended to upgrade to a newer toolchain CUDA 11. cuDNN provides highly tuned implementations for standard routines such as forward and backward convolution, attention, matmul, pooling, and normalization. These containers include: The latest NVIDIA examples from this repository; The latest NVIDIA contributions shared upstream to the respective framework NVIDIA CUDA Quantum 0. The most common case is for developers to modify an existing CUDA routine (for example, filename. 66, comparing against CUDAnative. Sep 5, 2019 · With the current CUDA release, the profile would look similar to that shown in the “Overlapping Kernel Launch and Execution” except there would only be one “cudaGraphLaunch” entry in the CUDA API row for each set of 20 kernel executions, and there would be extra entries in the CUDA API row at the very start corresponding to the graph TRM-06704-001_v11. Making synchronization an explicit part of the program ensures safety, maintainability, and modularity. NVIDIA CUDA-X™ Libraries, built on CUDA®, is a collection of libraries that deliver dramatically higher performance—compared to CPU-only alternatives—across application domains, including AI and high-performance computing. MSVC Version 193x. 1. 2. If you have one of those SDKs installed, no additional installation or compiler flags are needed to use Thrust. Minimal first-steps instructions to get CUDA running on a standard system. Notices 2. jl 0. This guide covers the basic instructions needed to install CUDA and verify that a CUDA application can run on each supported platform. For example Jul 27, 2021 · For example, a call to cudaMalloc or cuMemCreate could cause CUDA to free unused memory from any memory pool associated with the device in the same process to serve the request. jl implementations of several benchmarks from the Rodinia benchmark suite. 0 (August 2024), Versioned Online Documentation CUDA Toolkit 12. The Release Candidate of the CUDA Toolkit version 7. x. To compile our SAXPY example, we save the code in a file with a . C# code is linked to the PTX in the CUDA source view, as Figure 3 shows. Overview As of CUDA 11. Read about the features of CUDA 7 here. Performance is 12x faster that the single core CPU fallback. The CUDA samples don’t have an example too (even on github). Examine more deeply the various APIs available to CUDA applications and learn the Aug 1, 2024 · CUDA Quick Start Guide. The following command reads file input. 1 (July 2024), Versioned Online Documentation CUDA Toolkit 12. Several CUDA Samples for Windows demonstrates CUDA-DirectX Interoperability, for building such samples one needs to install Microsoft Visual Studio 2012 or higher which provides Microsoft Windows SDK for Windows 8. NVIDIA AMIs on AWS Download CUDA To get started with Numba, the first step is to download and install the Anaconda Python distribution that includes many popular packages (Numpy, SciPy, Matplotlib, iPython With CUDA Python and Numba, you get the best of both worlds: rapid iterative development with Python and the speed of a compiled language targeting both CPUs and NVIDIA GPUs. The guide for using NVIDIA CUDA on Windows Subsystem for Linux. Introduction . Want to learn more about accelerated computing on the Tesla Platform and about GPU computing with CUDA? May 10, 2024 · Complete samples are available in the CUDA samples repository. 6 ; Compiler* IDE. 0 Specification, an industry standard for heterogeneous computing. Not supported Apr 10, 2024 · Samples for CUDA Developers which demonstrates features in CUDA Toolkit - Releases · NVIDIA/cuda-samples In CUDA terminology, this is called "kernel launch". CUDA Documentation/Release Notes; MacOS Tools; Training; Archive of Previous CUDA Releases; FAQ; Open Source Packages The NVIDIA® CUDA® Toolkit provides a development environment for creating high-performance, GPU-accelerated applications. Let’s start with an example of building CUDA with CMake. He cares equally about developing high-quality software as much as he does achieving optimal GPU performance, and is an advocate for modern C++ design. WSL or Windows Subsystem for Linux is a Windows feature that enables users to run native Linux applications, containers and command-line tools directly on Windows 11 and later OS builds. The NVIDIA® CUDA® Toolkit provides a development environment for creating high-performance, GPU-accelerated applications. They are no longer available via CUDA toolkit. com). 04, Nvidia GTX650-TI-boost, Cuda Toolkit 11, 450. Aug 12, 2024 · Verification: Running Sample GPU Applications CUDA VectorAdd In the first example, let’s run a simple CUDA sample, which adds two vectors together: Create a file, such as cuda-vectoradd. 1 (April 2024), Versioned Online Documentation CUDA Toolkit 12. Dec 1, 2022 · This key capability enables Volta to deliver 3X performance speedups in training and inference over the previous generation. Demos Below are the demos within the demo suite. 2 days ago · Thrust is an open source project; it is available on GitHub and included in the NVIDIA HPC SDK and CUDA Toolkit. Jun 1, 2018 · The NVIDIA Container Runtime introduced here is our next-generation GPU-aware container runtime. deviceQuery This application enumerates the properties of the CUDA devices present in the system and displays them in a human readable format. These applications demonstrate the capabilities and details of NVIDIA GPUs. 1 or earlier). y is vertical. I came up with the following code. This is 83% of the same code, handwritten in CUDA C++. This sample demonstrates the use of the new CUDA WMMA API employing the Tensor Cores introduced in the Volta chip family for faster matrix operations. Download CUDA Toolkit 10. Download > (222MB) Flexible. 4 \<sample_dir>\ To build/examine all the samples at once, the complete solution files should be used. To build/examine a single sample, the individual sample solution files should be used. CUDA code has been compiled with CUDA 8. cu) to call cuFFT routines. The next section runs through some examples to show what you can do with conditional nodes. Note that while using the GPU video encoder and decoder, this command also uses the scaling filter (scale_npp) in FFmpeg for scaling the decoded video output into multiple desired resoluti Nov 5, 2018 · look into using the OptiX API which uses CUDA as the shading language, has CUDA interoperability and accesses the latest Turing RT Cores for hardware acceleration. . I can’t get it working so I’m looking for working examples which I could modify to match my needs. It has been written for clarity of exposition to illustrate various CUDA programming principles, not with the goal of providing the most performant generic kernel for matrix multiplication. Here is a similar example using CUDA 7. How-To examples covering topics such as: CUDA Developer Tools is a series of tutorial videos designed to get you started using NVIDIA Nsight™ tools for CUDA development. The NVIDIA Deep Learning Institute (DLI) also offers hands-on CUDA training through both fundamentals and advanced Mar 11, 2021 · The first post in this series was a python pandas tutorial where we introduced RAPIDS cuDF, the RAPIDS CUDA DataFrame library for processing large amounts of data on an NVIDIA GPU. While at NVIDIA, he helped develop early releases of CUDA system software and contributed to the OpenCL 1. Cross-compilation (32-bit on 64-bit) C++ Dialect. nvcc -o saxpy saxpy. We will discuss about the parameter (1,1) later in this tutorial 02. 1. ryan@titanx:~$ nvidia-docker run --rm -ti nvidia/cuda:7. 0 (March 2024), Versioned Online Documentation Mar 21, 2019 · Dear all, I am studying textures. The CUDA Library Samples are released by NVIDIA Corporation as Open Source software under the 3-clause "New" BSD license. h should be inserted into filename. NVIDIA VKRay is a set of extensions that bring ray tracing functionality to the Vulkan open, royalty-free standard for GPU acceleration. This is especially helpful in scenarios where an application makes use of multiple libraries, some of which use cudaMallocAsync and some that do not. This sample implements matrix multiplication and is exactly the same as Chapter 6 of the programming guide. Nov 12, 2007 · Advanced application examples such as image convolution, Black-Scholes options pricing and binomial options pricing; Refer to the following READMEs for more information ( Linux, Windows) This code is released free of charge for use in derivative works, whether academic, commercial, or personal. With it, you can develop, optimize, and deploy your applications on GPU-accelerated embedded systems, desktop workstations, enterprise data centers, cloud-based platforms, and supercomputers. Nov 12, 2007 · NVIDIA CUDA SDK Code Samples. Beginning with a "Hello, World" CUDA C program, explore parallel programming with CUDA through a number of code examples. 0. . (Full License) The NVIDIA CUDA Toolkit is required NVIDIA CUDA Code Samples. /saxpy Max error: 0. He holds a bachelor’s degree in mechanical and aerospace engineering from Rutgers University and a Ph. About Greg Ruetsch Greg Ruetsch is a senior applied engineer at NVIDIA, where he works on CUDA Fortran and performance optimization of HPC codes. yaml, with contents like the following: Nov 7, 2023 · CUDA Graphs for reducing kernel launch overheads. Ubuntu 18. 5. Aug 5, 2024 · Hi @202476410arsmart, Does you application work without the cuda-gdb?The log you posted suggests that there might be an issue with cudaLaunchKernelExC call. Mar 29, 2019 · I’m trying to use the new library cuBLASLt released with CUDA 10. Compute Capability We will discuss many of the device attributes contained in the cudaDeviceProp type in future posts of this series, but I want to mention two important fields here, major and minor. I have provided the full code for this example on Github. The plug-in is based on the CUDA Toolkit sample Box Filter, adapted to perform multiple iterations for high quality, and providing both a GPU pathway and CPU fallback. This eliminates the need to manage packages and dependencies or build DL frameworks from source. NVIDIA is committed to ensuring that our certification exams are respected and valued in the marketplace. This is a collection of containers to run CUDA workloads on the GPUs. last 2 weeks just working for this particular issue. Accelerated Computing with C/C++; Accelerate Applications on GPUs with OpenACC Directives; Accelerated Numerical Analysis Tools with GPUs; Drop-in Acceleration on GPUs with Libraries; GPU Accelerated Computing with Python Teaching Resources Jan 25, 2017 · As you can see, we can achieve very high bandwidth on GPUs. The reason shared memory is used in this example is to facilitate global memory coalescing on older CUDA devices (Compute Capability 1. Using the CUDA Toolkit you can accelerate your C or C++ applications by updating the computationally intensive portions of your code to run on GPUs. To tell Python that a function is a CUDA kernel, simply add @cuda. CUDA 9 introduces Cooperative Groups, which aims to satisfy these needs by extending the CUDA programming model to allow kernels to dynamically organize groups of threads. Aug 29, 2024 · CUDA on WSL User Guide. CUDA-Q enables GPU-accelerated system scalability and performance across heterogeneous QPU, CPU, GPU, and emulated quantum system elements. 0' locally 7. This is a comprehensive set of APIs, high-performance tools, samples, and documentation for hardware-accelerated video encode and decode on Windows and Linux. The computation in this post is very bandwidth-bound, but GPUs also excel at heavily compute-bound computations such as dense matrix linear algebra, deep learning, image and signal processing, physical simulations, and more. The sample also demonstrates how to do self-profiling, displaying a console window to give CPU and GPU timings. I want to avoid cudamallocpitch, cuda arrays, fancy channel descriptions, etc. The NVIDIA CUDA programming guide goes through some really gory and difficult explanations without any examples. CUDA Fortran is designed to interoperate with other popular GPU programming models including CUDA C, OpenACC and OpenMP. Apr 10, 2024 · Samples for CUDA Developers which demonstrates features in CUDA Toolkit - NVIDIA/cuda-samples cuRAND, NPP, nvJPEG. Listing 1 shows the CMake file for a CUDA example called “particles”. We can then run the code: % . Nov 4, 2015 · I think I am catching on. Library Examples. Nov 8, 2022 · 1:N HWACCEL Transcode with Scaling. Apr 11, 2019 · Taking advantage of PhysX, CUDA, DirectX 11, and 3D Vision, Supersonic Sled strapped you on a high-powered test rocket and hurtled you down a six-mile-long track in the Nevada desert at speeds in excess of 800 miles an hour. 6, all CUDA samples are now only available on the GitHub repository. cuBLAS - GPU-accelerated basic linear algebra (BLAS) library. To aid with this, we also published a downloadable cuDF cheat sheet. 1 running on Julia 0. nvCOMP. 000000 Summary and Conclusions Jul 25, 2023 · CUDA Samples 1. 0 toolkit … Jul 14, 2022 · As shown in the code example, CUDA-Q provides a CUDA-like kernel-based programming approach, with a modern C++ focus. This Best Practices Guide is a manual to help developers obtain the best performance from NVIDIA ® CUDA ® GPUs. 0 or later toolkit. Aug 29, 2024 · CUDA Quick Start Guide. 6. In computing, CUDA (originally Compute Unified Device Architecture) is a proprietary [1] parallel computing platform and application programming interface (API) that allows software to use certain types of graphics processing units (GPUs) for accelerated general-purpose processing, an approach called general-purpose computing on GPUs (). Click on the green buttons that describe your target platform. In the remainder of this post I will explain how it works, why it is more efficient than staging buffers through host memory, and present performance numbers with a CUDA+MPI Jacobi solver example. Get Started Resources. The code samples covers a wide range of applications and techniques, including: Simple techniques demonstrating. Profiling Mandelbrot C# code in the CUDA source view. Aug 29, 2024 · The CUDA Demo Suite contains pre-built applications which use CUDA. x is horizontal and threadIdx. Compiling CUDA programs. CUDA Library Samples. CUDA Features Archive. The list of CUDA features by release. Popular Generative AI reference workflows optimized for accelerated infrastructure and microservice architecture. Figure 3. blockIdx, cuda. The CUDA platform is used by application developers to create applications that run on many generations of GPU architectures, including future GPU CUDA Toolkit 12. The NVIDIA CUDA® Deep Neural Network library (cuDNN) is a GPU-accelerated library of primitives for deep neural networks. 2 for Windows, Linux, and Mac OSX operating systems. 66 drivers. 5% of peak compute FLOP/s. YES. Nov 2, 2018 · In the cuda documentation it says to get these if you want to use the samples. 2. More information can be found about our libraries under GPU Accelerated Libraries. cu file and the library included in the link line. Video Codec APIs at NVIDIA. The kernels in this example map threads to matrix elements using a Cartesian (x,y) mapping rather than a row/column mapping to simplify the meaning of the components of the automatic variables in CUDA C: threadIdx. Sep 19, 2013 · The following code example demonstrates this with a simple Mandelbrot set kernel. 6 with LLVM 3. Conditional IF nodes. 0 is now available as Open Source software at the CUTLASS repository. I want to map the 4 element linear array to a 2 by 2 2D texture CUDA Samples. Examples Thrust is best learned through examples. CUDA Documentation/Release Notes; MacOS Tools; Training; Sample Code; Forums; Archive of Previous CUDA Releases; FAQ; Open Source Packages; Submit a Bug; Tarball and Zi Learn using step-by-step instructions, video tutorials and code samples. Hence I have 2 questions: Can someone give / refer me to a really simple (Texture memory for dummies) example of how texture is used and improves performance. blockDim, and cuda. nvidia. NVIDIA-Optimized DL Frameworks. As for performance, this example reaches 72. You can think of the CUDA Architecture as the scheme by which NVIDIA has built GPUs that can perform both traditional graphics-rendering tasks and general-purpose tasks. So here is what I see (Windows 7, CUDA 7. Ecosystem Our goal is to help unify the Python CUDA ecosystem with a single standard set of interfaces, providing full coverage of, and access to, the CUDA host APIs from Resources. 4 | January 2022 CUDA Samples Reference Manual CUDA sample demonstrating a GEMM computation using the Warp Matrix Multiply and Accumulate (WMMA) API introduced in CUDA 9. EULA. in applied mathematics from Brown University. 0 has changed substantially from our preview release described in the blog post below. 5, driver 354. Notice This document is provided for information purposes only and shall not be regarded as a warranty of a certain functionality, condition, or quality of a product. It is compatible with the Open Containers Initiative (OCI) specification used by Docker, CRI-O, and other popular container technologies. If you aren’t a registered developer, register for free access at NVIDIA Developer Zone. 0 Contents Examples that illustrate how to use CUDA Quantum for application development are available in C++ and Python. mp4 and transcodes it to two different H. You can access these reference implementations through NVIDIA NGC and GitHub. The CUDA Toolkit includes 100+ code samples, utilities, whitepapers, and additional documentation to help you get started developing, porting, and optimizing your applications for the CUDA architecture. Compiling a CUDA program is similar to C program. 9. Results show that cuRobo can generate motion plans within 100 ms (median) on NVIDIA AGX Orin. cu. Oct 24, 2023 · NVIDIA Compute Sanitizer is a powerful tool that can save you time and effort while improving the reliability and performance of your CUDA applications. For GCC and Clang, the preceding table indicates the minimum version and the latest version supported. cgzfp uxudy jvdnzfzx lgbknyu tzvsfea ltafw rqw qshyi mmrdsh phqksm

Search

Nvidia cuda examples free