"Fossies" - the Fresh Open Source Software Archive

Member "pocl-1.8/CHANGES" (12 Oct 2021, 32873 Bytes) of package /linux/misc/pocl-1.8.tar.gz:


As a special service "Fossies" has tried to format the requested text file into HTML format (style: standard) with prefixed line numbers. Alternatively you can here view or download the uninterpreted source code file. See also the latest Fossies "Diffs" side-by-side code changes report for "CHANGES": 1.7_vs_1.8.

    1 1.8 unreleased
    2 ==============
    3 
    4 Notable User Facing Changes
    5 ---------------------------
    6 
    7 - support for LLVM 13
    8 - CMake: Inter-Procedural Optimization is enabled on code of runtime library
    9   (libpocl.so is compiled with -flto on systems that support it).
   10 - LTTng tracing improved - more command types are traced, and also
   11   some synchronous API calls (like clCreateBuffer) are traced.
   12 - poclcc, tests and examples can be disabled with CMake options
   13 - Valgrind support improved by making Valgrind aware of pocl's
   14   reference counting of cl_* objects
   15 - kernels which are called by kernels are now force-inlined
   16 - Support for NetBSD.
   17 - Support for Unix systems without libdl.
   18 - PoCL can now (optionally) respond to SIGUSR2 by printing
   19   some live debug information.
   20 - improved SPIR support for CUDA devices
   21 
   22 Notable Bug Fixes
   23 -----------------
   24 
   25 - Fixed a potential crash on Unix systems without sysfs mounted.
   26 - Fixed compilation errors when building on macOS.
   27   - Fixed POCL_FAST_INIT macro; POCL_INIT_LOCK must be invoked with only one argument.
   28   - Fix bin/poclcc to not depend on OpenCL 2.0 symbols
   29 - Fixed miscompilation in kernel loops with multiple conditionals with barriers in them.
   30 
   31 Other
   32 -----
   33 - Add cmake options PARALLEL_COMPILE_JOBS, PARALLEL_LINK_JOBS to
   34   use ninja's seperate compile and link job pools.
   35 
   36 - Improve memory architecture, buffer migration and allocation.
   37   Buffers are now allocated on a device when first used
   38   (previously each buffer was allocated on every device in context).
   39 
   40 - the single global LLVMContext was replaced with
   41   multiple LLVMContexts, one per OpenCL cl_context.
   42   OpenCL code can now be compiled in parallel
   43   when using separate cl_contexts. This feature
   44   is disabled by default since it significantly slowed
   45   down PyOpenCL. This should be resolved by separating
   46   LLVM compilation in their own threads in the future.
   47 
   48 - a new OpenCL extension was added to PoCL: cl_pocl_content_size.
   49   The extension allows the user to give optimization hint to PoCL,
   50   which will be used internally by PoCL to optimize buffer transfers
   51   between multiple devices.
   52 
   53 1.7 May 2021
   54 ============
   55 
   56 Notable User Facing Changes
   57 ---------------------------
   58 
   59 - Support for LLVM 12.
   60 - support for cross-compiling PoCL
   61 - Added support for the cl_nv_device_attribute_query extension on CUDA devices.
   62 - improved support for SPIR-V binaries when using CPU device:
   63   - improved local variables support
   64   - OpenCL 2.0 atomics are now supported
   65   - work_group_barrier, to_local/to_global are implemented
   66 - Implemented OpenCL 3.0 features
   67   - clGetDeviceInfo queries
   68     - CL_DEVICE_ATOMIC_MEMORY_CAPABILITIES (Minimal implementation)
   69     - CL_DEVICE_ATOMIC_FENCE_CAPABILITIES (Minimal implementation)
   70 
   71 1.6 December 2020
   72 =================
   73 
   74 Notable User Facing Changes
   75 ---------------------------
   76 
   77 - Support for LLVM 11.
   78 - CUDA kernels using constant __local blocks are now ABI incompatible with
   79   previous release. Users need to delete their pocl cache.
   80 - SINGLE_LLVM_LIB CMake option removed. Instead reintroduce STATIC_LLVM and
   81   PoCL now relies on llvm-config to provide correct shared/static libraries
   82   for linkage.
   83 - improved debugging of OpenCL code with CPU driver.
   84   See doc/sphinx/source/debug.rst
   85 
   86 Optimizations
   87 -------------
   88 
   89 - Improved the PTX code generation for __local blocks. Previously constant
   90   __local blocks and __local arguments were using one dynamic shared CUDA
   91   memory block with offsets computed at runtime. Now if there is no __local
   92   arguments, separate static shared CUDA memory is used. If there are
   93   __local arguments, the constant __local blocks are indexed with compile
   94   time constants. This improves the performance due to better SASS code
   95   generation because it avoids what appears to be a pointer aliasing issue.
   96 
   97   Running SHOC benchmark GEMM with size class 4 on a NVIDIA Titan X gives
   98   the following performance improvements.
   99 
  100     sgemm_n:        23.2%
  101     sgemm_t:        18.5%
  102     sgemm_n_pcie:   23.3%
  103     sgemm_t_pcie:   19.5%
  104     dgemm_n:        51.4%
  105     dgemm_t:         2.8%
  106     dgemm_n_pcie:   51.6%
  107     dgemm_t_pcie:    6.9%
  108 
  109 - Improved handling of command queue barriers: Previously an internal
  110   event was added from all previous commands to it, even with in-order
  111   queues, causing slowdown with applications that have a lot of commands.
  112   Now additional events are omitted in in-order queues. Measured with the
  113   PolyBench OpenCL Gramschmidt kernel, execution time went down from 44
  114   seconds to around 0.5 seconds.
  115 
  116 Notable Bug Fixes
  117 -----------------
  118 - Fix LLVM loop vectorizing remarks printing (POCL_VECTORIZER_REMARKS=1).
  119 - Fix an issue in which the loop vectorizer produced code with invalid
  120   memory reads (issue #757).
  121 - Fix compilation error when CMake option SINGLE_LLVM_LIB is set to OFF.
  122 - Fix wrongly output dlerror (Undefined symbol) after dlopen,
  123   caused by a previous libdl call in an ICD loader (issue #877).
  124 - [CPU] safety margin of pocl's CPU driver local memory allocation
  125   has been reduced to a much more reasonable value
  126 - [CPU] buffer size for OpenCL printf is now configurable with
  127   PRINTF_BUFFER_SIZE CMake variable
  128 - [CPU] local memory size reported is now the size of last level of non-shared
  129   data cache (usually L1 or L2 depending on CPU), if hwloc can determine it.
  130 
  131 Security
  132 --------
  133 
  134 - Added a build parameter HARDENING_ENABLE that applies hardening flags
  135   present in some modern compilers to produce a more secure libpocl.so
  136   with the trade-off in performance.
  137 
  138 1.5 April 2020
  139 ==============
  140 
  141 Notable User Facing Changes
  142 ---------------------------
  143 - Support for LLVM 10.
  144 - POCL_TRACE_EVENT, POCL_TRACE_EVENT_OPT and POCL_TRACE_EVENT_FILTER
  145   environment variables were renamed to
  146   POCL_TRACING, POCL_TRACING_OPT and POCL_TRACING_FILTER, respectively.
  147 - Refactored the implementation of convert_T() OpenCL functions to better meet
  148   autovectorization criteria under LLVM, thus utilizing device's SIMD ISA
  149   capabilities where available; e.g. on an ARM64 Cortex-A72 convert_int8(short8)
  150   is 5.5x faster now when measured in a tight loop.
  151 - A lot of fixes.
  152 
  153 Usability
  154 ---------
  155 - A simple per-kernel execution time statistics atexit() for quick and easy
  156   low-impact per-device profiling purposes (relies on event time stamps purely).
  157   It can be enabled by setting POCL_TRACING env to 'cq'.
  158 
  159 1.4 September 2019
  160 ==================
  161 
  162 Highlights
  163 ----------
  164 - pocl-accel: An example driver and support infrastructure for OpenCL 1.2
  165   CL_DEVICE_TYPE_CUSTOM hardware accelerators which implement a memory
  166   mapped control interface.
  167 
  168 - Improved SPIR and SPIR-V support. clCreateProgramWithIL() implemented,
  169   Kernel library (for CPU target) support for SPIR-mangling improved
  170 
  171 Kernel Compiler
  172 ---------------
  173 - Specialize work-group functions for global offset (0,0,0).
  174 - A pocl installation with clang, hwloc statically linked in
  175   is now relocatable.
  176 - Clang/LLVM versions older than 6.0 are no longer supported.
  177 - Create specialized work-group functions for small (defined by a device
  178   driver specific limit) grid dimensions.
  179 - Add Range Metadata to various ID queries etc. to improve vectorizing
  180   index computation to smaller lane widths and other optimizations.
  181 - Passes only the launched kernel to work-group generation and code gen, thus
  182   speeding up the compilation process.
  183 
  184 Misc.
  185 -----
  186 - hsa-native: Downgraded the advertised version to 1.2
  187   which is closer to the truth (fixes OCLTest of Glow).
  188 - hsa-native: Add support for byval (struct) argument passing.
  189 - hsa-native: Allow offsets in block copy.
  190 
  191 Notable Internal Changes
  192 ------------------------
  193 - Allow devices to utilize the ROCm-Device-Libs ocml
  194   builtins for their builtin libraries if seen fit.
  195   https://github.com/RadeonOpenCompute/ROCm-Device-Libs/tree/master/ocml
  196   was mirrored in lib/kernel and made it easy to cherry pick
  197   implementations to targets' kernel libary.
  198 
  199 - libltdl is replaced with libdl on UNIX platforms.
  200 
  201 Notable Bug Fixes
  202 -----------------
  203 - Fix a race condition in device initialization, which
  204   caused issues in applications that cause reinitialization
  205   of pocl device drivers (appeared in Glow's OCLTest).
  206 
  207 Device Driver Specific
  208 ----------------------
  209 - hsa-native: Downgraded the advertised version to 1.2
  210   which is closer to the truth (fixes OCLTest of Glow).
  211 - hsa-native: Add support for byval (struct) argument passing.
  212 - hsa-native: Allow offsets in block copy.
  213 
  214 1.3 April 2019
  215 ==============
  216 
  217 Highlights
  218 ----------
  219 - Support for Clang/LLVM 8.0.
  220 - Support ICD on OSX.
  221 
  222 Misc.
  223 -----
  224 - Ability to have size_t (basically derived from the largest supported
  225   object) smaller than CL_ADDRESS_BITS. This is an unofficial optional
  226   extension as the OpenCL standard mandates it to be the same.
  227 - POCL_EXTRA_BUILD_FLAGS can be used to force add extra build flags such as
  228   '-g' to all clBuildProgram() calls.
  229 - Allow building pocl without CPU backend drivers. When set to off,
  230   CPU will not appear in the list of OpenCL devices reported by pocl.
  231   Controllable via ENABLE_HOST_CPU_DEVICES=off cmake option.
  232 - Build logs are now produced also for illegal options passed to the kernel
  233   build e.g. via the options parameter of clBuildProgram().
  234 - hsa-native: Device side printf-support and alternative < 1.2 non-standard
  235   C99 printf exposing support.
  236 - pocl's binary format has been slightly updated (changes are listed in
  237   the top of pocl_binary.c file) to version 7, but pocl can still read
  238   also the previous version 6 format.
  239 - Allow local-size-specializing also SPMD-targeted kernels to enable compile
  240   time optimization of code depending on the local dimensions.
  241 - Support older GLIBC versions.
  242 - HSA: Initial experimental support for native-ISA compilation on top of HSA
  243   runtime. Tested and works currently only on phsa-runtime. Can be enabled
  244   with ENABLE_HSAIL=off cmake option.
  245 - Add option to disable installing of OpenCL headers.
  246 
  247 Notable Bug Fixes
  248 -----------------
  249 - Fixed kernel debug symbol generation.
  250 - HSA: fix kernel caching.
  251 - Fix issue #661: clCreateImage doesn't fail with unsupported image type.
  252 - Fix issue #668: handle non-kernel functions with barriers properly.
  253 - Fix issue #671: Unable to build pocl with CUDA support with LLVM 7 and host GCC 8.2.
  254 - Fix image format/size handling with multiple devices in context.
  255 - Fix padding issue with context arrays that manifested as unaligned
  256   access errors after autovectorization.
  257 
  258 Notable Internal Changes
  259 ------------------------
  260 - Add group ids as hidden kernel arguments instead of digging them up
  261   from the context struct.
  262 - Ability to generate the final binary via separate assembly text + assembler
  263   call. Useful for supporting LLVM targets without direct binary emission
  264   support.
  265 - Use Clang's Driver API for launching the final linkage step. This way we
  266   utilize the toolchain registry with correct linkage steps required for
  267   the target at hand.
  268 - Add 'device_aux_functions' to the driver layer attributes. This can be used
  269   to retain device-specific functions required by the target across the
  270   pruning of unused globals.
  271 - The "default kernels" hack which was used to store kernel metadata, has
  272   been removed. Kernel metadata are now stored only once, in cl_program
  273   struct; every new cl_kernel structs holds only a pointer.
  274 - Major 'pthread' CPU driver cleanup.
  275 - Major Workgroup.cc cleanup.
  276 
  277 1.2 September 2018
  278 ==================
  279 
  280 - LLVM 7.0 is now supported.
  281 
  282 - Version 2.0 of hwloc library is supported.
  283 
  284 - device-side printf; more consistent printf output.
  285 
  286 1.1 March 2018
  287 ==============
  288 
  289 Highlights
  290 ----------
  291 
  292 - LLVM 6.0 is now supported.
  293 
  294 - Reintroduced experimental SPIR LLVM bitcode support to pocl.
  295   Requires LLVM 5 or newer. New experimental feature: SPIR-V support;
  296   requires a working llvm-spirv converter. Currently only loading
  297   of SPIR-V binaries by pocl is supported, not output.
  298   See docs/features.rst for more details.
  299 
  300 - Refactored pocl cache now does away with LLVM file locks and relies
  301   entirely on system calls for proper synchronization. Additionally,
  302   cache file writes are now fdatasync()ed.
  303 
  304 - Improved kernel compilation time (with cold cache). Improvement
  305   depends on sources - it's bigger for large programs with many kernels.
  306   Luxmark now compiles in seconds instead of dozens of seconds;
  307   internal pocl tests run in 30-50% less time.
  308 
  309 - LLVM Scalarizer pass is now only called for SPMD devices. Performance
  310   change varies across tests, but positive seems to outweigh negative.
  311 
  312 - Implemented uninitialization callback for device drivers. This is
  313   triggered when the last cl_context is released. Currently only the
  314   CPU driver implements the callback.
  315 
  316 - Removed libpoclu from installed files; this library contains helpers
  317   for pocl's internal tests, and from installed files was only used by
  318   poclcc, which has been updated to not rely on it.
  319 
  320 - POCL_MAX_WORK_GROUP_SIZE is now respected by all devices. This variable
  321   limits the reported maximum WG sizes & dimensions; tuning max WG size
  322   may improve performance due to cache locality improvement.
  323 
  324 - CL_PLATFORM_VERSION now contains much more information about how
  325   pocl was built.
  326 
  327 - For users still building with Vecmathlib, performance should be back
  328   to levels of pocl 0.14 (there was a huge drop caused by a change
  329   in -O0 optimization level of LLVM 5.0).
  330 
  331 - Improved support for ARM and ARM64 architectures. All internal tests
  332   now pass (on Cortex-A53 and Cortex-A15), although it's still far
  333   from full conformance.
  334 
  335 
  336 1.0 December 2017
  337 =================
  338 
  339 Highlights
  340 ----------
  341 - Improved automatic local work-group sizing on kernel enqueue, taking
  342   into account standard constraints, SIMD width for vectorization as
  343   well as the number of compute units available on the device.
  344 - Support for NVIDIA GPUs via a new CUDA backend (currently experimental).
  345 - Removed support for BBVectorizer.
  346 - LLVM 5.0 is now supported.
  347 - A few build options have been added for distribution builds,
  348   see README.packaging.
  349 - Somewhat improved scalability in the CPU driver. CPUs with many cores
  350   and programs using a lot of WIs with small kernels can run somewhat faster.
  351 - The OpenCL 1.2 conformance tests now pass with selected CPUs. There are some
  352   caveats though - see the documentation.
  353 - When conformance is enabled, some kernel library functions might be
  354   slower than in previous releases.
  355 - Pocl now reports OpenCL 1.2 instead of 2.0, except HSA enabled builds.
  356 - Updated format of pocl binaries, which is NOT backwards compatible.
  357   You'll need to clean any kernel caches.
  358 - Fixed several memory leaks.
  359 - Unresolved symbols (missing/misspelled functions etc) in a kernel will
  360   result in error in clBuildProgram() instead of pocl silently ignoring
  361   them and then aborting at dlopen().
  362 - New env variable POCL_MEMORY_LIMIT=<num> limits the Global memory size
  363   reported by pocl to <num> gigabytes.
  364 - New env variable POCL_AFFINITY (defaults to 0): if enabled, sets
  365   the affinity of each CPU driver pthread to a single core.
  366 - Improved AVX512 support (with LLVM 5.0). Note that even with LLVM 5.0
  367   there are still a few bugs (see pocl issue #555); AVX512 + LLVM 4.0 are
  368   a lot more broken, and probably not worth trying.
  369 - POCL_DEBUG env var has been revamped. You can now limit debuginfo to
  370   these categories (or their combination): all,error,warning,general
  371   memory,llvm,events,cache,locking,refcounts,timing,hsa,tce,cuda
  372   The old setting POCL_DEBUG=1 now equals error+warning+general.
  373 
  374 
  375 0.14 April 2017
  376 ===============
  377 
  378 Highlights
  379 ----------
  380 - Support for LLVM/Clang versions 3.9 and 4.0. Version 3.9 was the first
  381   release to include all frontend features for OpenCL 2.0.
  382 - Ability to build pocl in a mode where online compilation is not
  383   supported to run in hosts without LLVM and binaries compiled offline
  384   e.g. using poclcc.
  385 - pocl's binary format now can contain all the necessary bits to
  386   execute the programs on a host without online compiler support.
  387 - Initial support for out-of-order execution execution of command queues.
  388 - It's now possible to cross-compile pocl when building an offline
  389   compiler build.
  390 - New driver api extension to support out-of-order and asynchronous
  391   devices/drivers.
  392 - Pthread and HSA drivers are now fully asynchronous.
  393 - CMake now the only supported build system, autotools removed.
  394 - LTTng tracing support
  395 
  396 OpenCL Runtime/Platform API support
  397 -----------------------------------
  398 - implemented clEnqueueBarrierWithWaitList
  399 - implemented clEnqueueMigrateMemObjects
  400 
  401 Other
  402 -----
  403 - Support for reqd_work_group_size attribute in the binary format and poclcc:
  404   Generates a static sized work-group function to help optimizations
  405   such as autovectorization.
  406 - HSA: added support for phsa (https://github.com/HSAFoundation/phsa)
  407 - A lot of bug and memory leak fixes. Some notable ones:
  408   - Issue #1, passing aggregates as kernel value parameters, can be
  409     now fixed with an LLVM patch.
  410   - Now it's possible to build pocl without using the fake address
  411     space ids, which were a source of many annoying issues.
  412 
  413 0.13 April 2016
  414 ===============
  415 
  416 Highlights
  417 -----------
  418 - Support for LLVM/Clang 3.8
  419 - initial (partial) OpenCL 2.0 support
  420   (only Shared Virtual Memory and Atomics are supported ATM)
  421 - CMake build system almost on parity with autotools
  422   (TCE, all external testsuites)
  423 - CMake build is now able to build multiple kernel libraries
  424   for different CPUs and let pocl select a suitable one at runtime
  425 
  426 Bugfixes
  427 ---------
  428 - clEnqueueCopyImage() now works properly
  429 - improved file locking (much less disk access to kernel cache)
  430 - Address spaces of structs are handled properly
  431 
  432 Other
  433 ------
  434 - removed custom buffer alloc from pthread device
  435 - removed IBM Cell support
  436 - removed support for older LLVM versions (before 3.7)
  437 - significantly higher performance with a lot of small kernel enqueues
  438   (due to improved file locking)
  439 - vecmathlib now supports AVX2
  440 - a few more HSA kernel library implementations: l/tgamma, erf(c), hypot
  441 - implemented OpenCL 2.0 API calls: clEnqueueSVM*, clSVMalloc/free,
  442   clEnqueueFillBuffer, clSetKernelExecInfo, clSetKernelArgSVMPointer,
  443   clCreateCommandQueueWithProperties - no device side queues yet
  444 - OpenCL 2.0 atomics (C11 atomics subset) for x86-64 and HSA
  445 - new testsuites: AMD SDK 3.0, Intel SVM
  446 - New CMake-only testsuites: ASL, clBLAS, clFFT, arrayfire
  447 - more debugging info (timing, mem stats)
  448 - ansi colors with POCL_DEBUG=1 if the output is a terminal
  449 
  450 
  451 0.12 October 2015
  452 ===============
  453 
  454 Highlights
  455 ----------
  456 - Support for HSA-compliant devices (kernel agents). The GPU of AMD Kaveri
  457   now works through pocl with a bunch of test cases in the AMD SDK 2.9 example
  458   suite.
  459 - New and improved kernel cache system that enables caching
  460   kernels with #includes.
  461 - Support for LLVM/Clang 3.7.
  462 - Little endian MIPS32 now passes almost all pocl testsuite tests.
  463 
  464 OpenCL Runtime/Platform API support
  465 -----------------------------------
  466 - Transferred buffer read/write/copy offset calculation to device driver side.
  467   - these driver api functions have changed; got offset as a new argument.
  468 - Maximum allocation is not limited to 1/4th of total memory size.
  469 - Maximum image dimensions grow to fit maximum allocation.
  470 - clGetDeviceInfo() reports better information about CPU vendor and cache.
  471 - experimental clCreateSubDevices() for pthread CPU device.
  472 
  473 OpenCL C Builtin Function Implementations
  474 -----------------------------------------
  475 - Implemented get_image_dim().
  476 
  477 Bugfixes
  478 --------
  479 - Avoid infinite loops when users recycle an event waiting list.
  480 - Correctly report the base address alignment.
  481 - Lots of others.
  482 
  483 Misc
  484 ----
  485 - Tests now using new cl2.hpp, removing dependency on OpenGL headers
  486 
  487 
  488 0.11 March 2015
  489 ===============
  490 
  491 Highlights
  492 ----------
  493 - Support for LLVM/Clang 3.6
  494 - Kernel compiler cache.
  495 - Android support.
  496 
  497 Kernel compiler
  498 ---------------
  499 - Do not add implicit barriers to kernels without WG barriers
  500   to avoid WI context data overheads.
  501 - Setting the POCL_VECTORIZER_REMARKS env to 1 prints out LLVM vectorizer 
  502   remarks during kernel compilation.
  503 - Implicit work-group vectorizer improvements.
  504 - POCL_VECTORIZER_REMARKS: When set to 1, prints out remarks produced by 
  505   the loop vectorizer of LLVM during kernel compilation.
  506 
  507 OpenCL Runtime/Platform API support
  508 -----------------------------------
  509 - Minimal initial implementation for clCreateSubDevices()
  510 
  511 Bugfixes
  512 --------
  513 - Fix falsely detecting operations with side-effects (especially atomic
  514   operations) as uniform. This caused deadlock/race situations due to 
  515   illegal implicit barrier injection.
  516 - Fix several reference counting issues.
  517 - Memory leak fixes.
  518 - ARM/openSUSE build fixes.
  519 - Plenty of CMake fixes.
  520 
  521 New test/example cases
  522 ----------------------
  523 - Several Halide examples using its OpenCL backend added.
  524 - CloverLeaf
  525 
  526 Misc.
  527 -----
  528 - The old BBVectorizer forked WIVectorizer removed due to bit rot and 
  529   the general hackiness of it.
  530 - Experimental Windows/Visual Studio support (in progress).
  531 - Initial support for MIPS architecture (with known issues).
  532 - Runtime debug printouts that can be enabled via POCL_DEBUG=1.
  533 - Streamlined the buffer allocation and fixed several issues with it.
  534 
  535  
  536 0.10 September 2014
  537 ===================
  538 
  539 This lists only the most interesting changes. Please
  540 refer to the version control log for a full listing.
  541 
  542 Highlights
  543 ----------
  544 - Support for LLVM/Clang 3.5 
  545 - Support for building using CMake (experimental with known issues).
  546 
  547 Bugfixes
  548 --------
  549 - TCE: kernel building was broken when running pocl
  550   from install location 
  551 - thread-safety (as required since OpenCL 1.1) improved
  552 
  553 Kernel compiler
  554 ---------------
  555 - Final code generation now done via LLVM API calls instead of
  556   calling the llc binary.
  557 - Sensible linking of functions from the monolithic kernel built-in
  558   library. Major compilation speedup for smaller kernels.
  559 
  560 OpenCL C Builtin Function Implementations
  561 -----------------------------------------
  562 - Improved support for halfN functions.
  563 - ilogb and ldexp available with vecmathlib
  564 
  565 OpenCL Runtime/Platform API support
  566 -----------------------------------
  567 - Implement clCreateKernelsInProgram()
  568 - OpenCL-C shuffle() and shuffle2() implementation added
  569 - Device probing modified to allow for device driver to detect device during
  570   runtime. POCL_DEVICES still supported.
  571 - Checks in clSetKernelArgs() for argument validity
  572 - Checks in clEnqueueNDRange() for arguments to be all set
  573 - Implement clGetKernelArgInfo()
  574 - clEnqueueCopyImage() 
  575 
  576 Misc
  577 ----
  578 - ViennaCL testsuite updated to 1.5.1
  579 
  580 0.9 January 2014
  581 ================
  582 
  583 This lists only the most interesting changes. Please
  584 refer to the version control log for a full listing.
  585 
  586 Highlights
  587 ----------
  588 - Major improvements to the kernel compiler's vectorization 
  589   performance. Twofold speedups in some benchmarks
  590 - Support for most of the piglit CL tests
  591 
  592 OpenCL Runtime/Platform API support
  593 -----------------------------------
  594 - clCreateImage2D() and clCreateImage3D() implementation moved to 
  595   clCreateImage()
  596 - Image creation now uses clCreateBuffer() 
  597 - clBuildProgram: Propagate the supported -cl* compiler options to Clang's 
  598   OpenCL frontend. 
  599 - clFinish: works with commands with event wait lists.
  600 - Preliminary support for OpenCL 2.0 blocks
  601 - Added support for clEnqueueNativeKernel()
  602 
  603 Builtin Function Implementations (OpenCL 1.2 Section 6.12)
  604 ----------------------------------------------------------
  605 - Refactored read/write_image()-functions to support refactored device image 
  606   object. (Only functions used by SimpleImage test)
  607 - Introduced new macro based implementation for read/write_image()-functions
  608 - Added sampler implementation for CLK_ADDRESS_CLAMP and 
  609   CLK_ADDRESS_CLAMP_TO_EDGE (Only integer coords supported)
  610 - Most of the printf() format strings now works. Missing features:
  611   - long on 32-bit architectures
  612 
  613 Performance Improvements
  614 ------------------------
  615 - Kernel compiler now tries to avoid replicating uniform variables,
  616   this leads to less context data to be saved per work-item and cleaner
  617   kernel bitcode for later optimizations
  618 - Use a precompiled header for OpenCL C builtin declarations to speed up
  619   the kernel compilation
  620 - Kernel compiler vectorization optimizations:
  621   - Inject implicit barriers both to loop starts and ends to
  622     horizontally vectorize the inner loop.
  623   - Reduce "peeling" by minimizing the conditional barrier region
  624     by injecting implicit barrier close to the branch points for
  625     conditional barrier cases.
  626   - Breaking of vector datatypes for more efficient loop 
  627     vectorization.
  628   - Support LLVM 3.4 parallel loop metadata.
  629 
  630 Misc
  631 ----
  632 - Explicitly specify the target architecture/CPU for the 
  633   kernel complier.
  634 - Kernel compiler frontend defaults to implementation using LLVM API
  635   directly instead of the scripts. 
  636 - __OPENCL_VERSION__ defined to 120
  637 - poclu: helpers for converting between the C float and OpenCL cl_half
  638   types
  639 - clEnqueueNativeKernel implemented
  640 - Static and cmake-builds of LLVM can now be used. 
  641 
  642 Bugfixes
  643 --------
  644 - Correct isequal, isnan, and similar routines
  645 
  646 0.8  August 2013
  647 ================
  648 
  649 This lists only the most interesting changes. Please
  650 refer to the version control log for a full listing.
  651 
  652 Overall
  653 -------
  654 
  655 - Added support for LLVM/Clang 3.3.
  656 - Dropped support for LLVM/Clang v3.1.
  657 - Removed the depedency on llvm-ld (which was copied to
  658   pocl-llvm-ld to pocl tree). Now uses llvm-link instead.
  659 - Project renamed to Portable Computing Language (pocl).
  660 - Luxmark v2.0 now works.
  661 - x86_64 can now use efficient math built-in function
  662   implementations from the vecmathlib project to avoid libm 
  663   calls and to exploit the SIMD instructions more efficiently
  664   in case of vector datatypes in the kernel.
  665 - Parallelize kernel inner loops "horizontally", if possible.
  666   This converts possibly sequential inner kernel loops to parallel 
  667   loops by effectively performing "loop interchange" of the
  668   work-item loop and the kernel's inner loop.
  669 - Added VexCL tests to the test suite. All but one of them
  670   work with pocl.
  671 
  672 Major bugfixes
  673 --------------
  674 - Fixed passing NULL as a buffer argument to clSetKernelArg
  675   (this time with a regression test added).
  676 - Constant BitCast expressions broken to variables to avoid
  677   crashing when copying a kernel with casts on automatic
  678   local pointers.
  679 - Fixes for i386/i686. Tested on Pentium4/Ubuntu 10.04 LTS.
  680 - Lots of API error checking added (found by the Piglit testing suite).
  681 - Fixed bug in select producing incorrect results when the third
  682   conditional argument is an unsigned scalar or vector.
  683 - Replaced deprecated SSE 4.1 assembly mneunomics in x86-64 min/max
  684   kernel functions that have since been removed in more recent
  685   versions of gas and llvm-as.
  686 - SPIR/LLVM IR 'byval' attributes are now handled correctly on
  687   kernel function arguments, allowing for structs and oversized
  688   vectors to be passed in with value semantics.
  689 - Fixed to work with the latest Khronos OpenCL headers for 1.2.
  690   Some issues fixed with the new cl.hpp.
  691 - The ICD dispatch table was too small which might have caused
  692   "interesting" behavior when calling the later functions in
  693   the table and not using ocl-icd as the dispatcher.
  694 - Several kernel compiler bugs fixed.
  695 - A multithreaded host application could free the same object
  696   multiple times due to a race issue.
  697 
  698 Platform Layer implementations (OpenCL 1.2 Chapter 4)
  699 -----------------------------------------------------
  700 - Return correctly formatted CL_DEVICE_VERSION and
  701   CL_DEVICE_OPENCL_C_VERSION.
  702 - clGetDeviceInfo: Use the 'cpufreq' sys interface of Linux for 
  703   querying the CPU clock frequency, if available.
  704 
  705 The OpenCL Runtime (OpenCL 1.2 Chapter 5)
  706 -----------------------------------------
  707 - clGetEventInfo: Querying the command type, command queue,
  708   and the reference count of the event.
  709 
  710 Builtin Function Implementations (OpenCL 1.2 Section 6.12)
  711 ----------------------------------------------------------
  712 - convert_type* builtins now generated with a Python script by
  713   Victor Oliveira.
  714 - length() fingerprint was assuming two arguments instead of one.
  715 - The kernel bitcode library is now optimized when built in pocl. Speeds
  716   up kernel optimization for cases which use the kernel functions
  717   a lot.
  718 - Fix mul_hi() implementation
  719 
  720 ICD
  721 ---
  722 - Fixed pocl tests to work when executed through the Khronos 
  723   supplied icd loader (needs a patch applied to the loader be able to
  724   override the .icd search path).
  725 
  726 Misc.
  727 -----
  728 - Fix to the helper script search logic:
  729   Search from the BUILDDIR only if env POCL_BUILDING is defined.
  730   Otherwise search from PKGDATADIR first, then from the PATH.
  731 - Fixed memory leaks in clCreateContext* and clCreateKernel
  732 - Ensured that stored arguments are adequately aligned in
  733   clSetKernelArg and clEnqueueNDRangeKernel.
  734 
  735 0.7  January 2013
  736 =================
  737 
  738 This lists only the most interesting changes. Please
  739 refer to the version control log for a full listing.
  740 
  741 Overall
  742 -------
  743 - Support for LLVM 3.2.
  744 - Multi-WI work group functions can be now generated
  745   using loops which are only partially unrolled. Reduces
  746   code size explosion with large WGs in comparison to
  747   the full replication method.
  748 - PowerPC 64 support (tested on Cell/Debian Sid/PS3).
  749 - PowerPC 32 support (tested on Cell/Debian Sid/PS3).
  750 - ARM v7 support (on Linux)
  751 - Beginning of Cell SPU support (very experimental!).
  752 - Most of the AMD APP SDK OpenCL examples now work and have been
  753   added to the pocl test suite.
  754 - Most of the Parboil benchmark cases added to the test
  755   suite.
  756 
  757 Kernel Compiler Passes
  758 ----------------------
  759 - Several miscompilations and compiler crashes fixed.
  760 - Multiple bugs fixed from the work group vectorizer.
  761 - Updated metadata format pocl uses to pass information
  762   to vectorization and TCE backend to simplify debuging.
  763 - Kernel pointer arguments are not always marked 'noalias' (restricted).
  764   Doing this previously was a specs misunderstanding.
  765 - ConstantGEPs to static variables generated from automated
  766   locals caused problems. Now converting them to normal GEPs
  767   using a pass from the SAFECode project.
  768 
  769 OpenCL Platform Layer implementations (OpenCL 1.2 Chapter 4)
  770 -------------------------------------------------------
  771 - clGetDeviceInfo now uses the hwloc lib for device property
  772   queries. Many new queries implemented.
  773 - clGetKernelInfo (initial implementation)
  774 - clGetMemObjectInfo (initial implementation)
  775 - clGetCommandQueueInfo (initial implementation)
  776 - clReleaseDevice
  777 - clRetainDevice
  778 - Proper freeing of devices in clReleaseContext
  779 
  780 The OpenCL Runtime Implementations (OpenCL 1.2 Chapter 5)
  781 ---------------------------------------------------------
  782 - clBuildProgram: support for passing options to the compiler.
  783 - clEnqueueMarker
  784 
  785 OpenCL C Builtin Function Implementations (OpenCL 1.2 Section 6.12)
  786 -------------------------------------------------------------------
  787 - Atomic Functions (6.12.11)
  788 - get_global_offset() was not linked correctly
  789 
  790 Framework
  791 ---------
  792 
  793 - Made it possible to override the .cl -> .bc build command
  794   called by clBuildProgram per device.
  795 
  796 Device Drivers
  797 --------------
  798 
  799 - pthread/basic:
  800   * extract CPU clock frequency from /proc/cpuinfo, if available
  801   * return cl_khr_fp64 if doubles supported by the CPU
  802 - ttasim: support for explicitly calling custom/special operations
  803   through the vendor extensions API
  804 
  805 Misc.
  806 -----
  807 
  808 - Fixes for MacOSX builds.
  809 - Fixed passing NULL as a buffer argument to clSetKernelArguments
  810 - Fixed a major bug when launching the same kernel multiple times:
  811   the arguments very not copied to the command object.
  812 - Fixed several issues with ICD, it is now considered stable to be
  813   used by default.
  814 
  815 0.6   August 2012
  816 =================
  817 
  818 Kernel library
  819 --------------
  820 
  821 - Added initial optimized kernel library for X86_64/SSE.
  822 - Preliminary support for ARM architectures on Linux
  823   (briefly tested on MeeGo/Nokia N9).
  824 
  825 Pthread device driver
  826 ---------------------
  827 
  828 - Multithreading at the work group granularity using pthreads.
  829 - Tries to figure out the optimal maximum number of
  830   threads for the system based on the available hardware
  831   threads. Currently works only in Linux using the
  832   /proc/cpuinfo interface.
  833 - Region-based customized memory allocator for speeding up buffer
  834   allocations.
  835 
  836 Kernel compiler
  837 ---------------
  838 
  839 - Most of the tricky work group barrier cases (barriers inside
  840   for-loops etc) now supported.
  841 - Support for local variables, also automatic locals.
  842 - Reuse previous compilation results, if available.
  843 - Automatic vectorization of work groups (multiple work items
  844   in parallel).
  845 
  846 Miscellaneous
  847 -------------
  848 - Installable Client Driver (icd) support.
  849 - Event profiling support (incomplete, works only for kernel and
  850   buffer read/write/map/unmap events).
  851 
  852 Known issues
  853 ------------
  854 
  855 - Non-pointer struct kernel arguments fail due to varying ABIs
  856   * https://bugs.launchpad.net/pocl/+bug/987905
  857 - Produces always "fully unrolled" chains of work items for
  858   work groups causing code size explosion for large WGs.
  859