Accelerating Contourlet Transform

Do and Vetterli [1] proposed their new multi-scale and directional representation of images to overcome deficiencies of previous proposed transforms such as wavelet and curvelet. Many applications, which used other transforms, are now using CT to take advantage of its higher performance. The widespread usage of the Contourlet-Transform (CT) and today’s real-time needs demand faster executions of CT. Solutions are available, but due to lack of portability or computational intensity, they are disadvantageous in real-time applications. So, we take advantage of modern GPUs for the acceleration purpose. GPU is well-suited to address data-parallel computation applications such as CT. 


Major steps in our proposed GPU-based implementation:

  1. Parallelism extraction with complexity analysis of CT
  2. GPU-based implementation


In the first step, we concentrated on CT structure.  CT consists of two major parts and also each part includes some building blocks [1]:

  1. Laplacian Pyramid (LP)
  • Analysis and synthesis filter
  • Downsampling and upsampling blocks
  1. Directional Filter Bank (DFB)
  • channel quincunx filter bank
  • Shearing operator: Reordering the image samples


Basic considerations for first step:

  • Parallelizing time-consuming filtering steps
  • Reducing the number of transfers between CPU (host) and GPU (device)
  • Parallelizing upsampling, downsampling and reordering steps


GPU implementation:

  1. Copy data into the GPU memory space.
  • The input image is placed in the global memory as in Fig. [1]
  • The filter kernel is placed in the constant memory
  • The final results are written back in the global memory
  1. Kernel call using 16×16 blocks of thread for coalescence considerations
  2. Optimizing implementation:
  • Memory alignment
  • Loop unrolling
  • Tiling boundary cases


Figure. 1. Periodic extended image formation inside GPU global memory using
block threads. Extended image on the right


For thorough assessment of our method, several experiments have been conducted using C, CUDA C and MATLAB CT toolbox [4].  Comparisons were made between CT implementation on a CPU and parallelized CUDA on different GPUs.  Our experimental platform is an Intel Core2 Duo CPU as the host with 4GB of memory.  The system was equipped with an NVIDIA GeForce GTX570 using 480 cores.  An alternative GPU (GeForce 610M) was also used for extension of results’ comparisons. Experimental results show that with existing GPUs, CT execution achieves more than 19x speedup as compared to its non-parallel CPU-based method. It takes approximately 40ms to compute the transform of a 512×512 image, which should be sufficient for real-time applications.

Figure. 2.   

Execution times for contourlet transform decomposition.


Figure. 3.  Execution time comparison between GPU, CPU C and  MATLAB implementation
of contourlet transform on a CPU.



[1] M. N. Do and M. Vetterli, “The contourlet transform: An efficient directional multiresolution image representation,” IEEE Transaction on Image Processing, vol. 14, pp. 2091–2106, Dec. 2005.

[2] S. Katsigiannis, G. Papaioannou, and D. Maroulis, "A Contourlet Transform based algorithm for real-time video encoding," SPIE Photonics Europe, Real-Time Image and Video processing Conference, Brussels, Belgium, 16-19 April, 2012.

[3] Y. Wei, Y. Zhu, F. Zhao, Y. Shi, and T. Mo, “Implementing Contourlet Transform for Medical Image Fusion on a Heterogeneous Platform,” International Conference on Scalable Computing and Communications, Dalian, pp.115-120, 25-27 September. 2009

[4] M. N. Do, “Contourlet toolbox,” Available at: minhdo/software/contourlettoolbox.tar


Extracted Papers:

M. Mohrekesh, S. Azizi and S. Samavi, "Accelerating GPU implementation of contourlet transform," In Proceeding of Iranian Conference on Machine Vision and Image Processing (MVIP), September 2013.

this is  an image.