Coarse Grain Parallelization of the H.264 Decoder without a Start-Code Scanner
Date Issued
2011
Date
2011
Author(s)
Gurhanli, Ahmet
Abstract
Fine grain methods for parallelization of the H.264 decoder have good latency perfor-
mance and less memory usage. However, they could not reach the scalability of coarse
grain approaches although assuming a well-designed entropy decoder which can feed the
increasing number of parallel working cores. We would like to introduce a GOP (Group
of Pictures) level approach due to its high scalability, mentioning solution approaches for
the well-known memory and latency issues. Our design revokes the need to a scanner for
GOP start-codes which was used in the earlier methods. This approach lets all the cores
work on the decoding task. Our experiments showed that the memory initialization op-
erations may degrade the scalability of parallel applications substantially. The multicore
cache architecture appeared to be a critical point for getting the desired speedup. For
FHD resolution video, we observed a speedup of 7.51 with 8 processors having separate
caches, and a speedup of 14.46 using 15 processors when a cache is shared by 2 processors.
mance and less memory usage. However, they could not reach the scalability of coarse
grain approaches although assuming a well-designed entropy decoder which can feed the
increasing number of parallel working cores. We would like to introduce a GOP (Group
of Pictures) level approach due to its high scalability, mentioning solution approaches for
the well-known memory and latency issues. Our design revokes the need to a scanner for
GOP start-codes which was used in the earlier methods. This approach lets all the cores
work on the decoding task. Our experiments showed that the memory initialization op-
erations may degrade the scalability of parallel applications substantially. The multicore
cache architecture appeared to be a critical point for getting the desired speedup. For
FHD resolution video, we observed a speedup of 7.51 with 8 processors having separate
caches, and a speedup of 14.46 using 15 processors when a cache is shared by 2 processors.
Subjects
Parallel Programming
Video Compression
High Performance Computing
Type
thesis
File(s)![Thumbnail Image]()
Loading...
Name
ntu-100-D95943040-1.pdf
Size
23.32 KB
Format
Adobe PDF
Checksum
(MD5):0c02c0e2e7def7a40363c0f9333b942a