H.264 compression - learn and use in practice

Currently, H.264 compression has become a standard in digital video recorders (DVRs). The most advanced devices enable the user to control the compression ratio and many parameters of the video stream injected into the network. To understand the impact of the changes and the allowed ranges, the users should know the general principles of this video compression method.
The general principles of H.264 compression
H.264 is currently the most popular and most efficient video compression standard. The first version was released in 2003. H.264, also known as MPEG-4 pt. 10, or AVC, was originally designed to transmit video data over the network, such as conferences, videos, movies. It also found application in compression of high definition video, including Blu-ray discs.
The high performance of H.264 is the result of prediction and estimation methods for selected frames of the video. A normal, uncompressed video signal consists of video frames displayed one after another in the correct time sequence.
Displaying uncompressed video

In H.264 there are 3 types of frames: I - Intra Coded, P - Predictive, B - Bi - predictive. The I frames contain complete picture information, the P frames carry information about changes between adjacent P or I frames, allowing for reconstruction of the picture, whereas the B frames complement the information about the image changes in time and are intended to smooth the transition between the P and P, I and P frames. The size of each frame depends on many factors, but it may be assumed that the size of P frames is about 60% of the size of I frames, and B frames may be reduced even to 10% of the size of the I frames.
Compressed video stream

Decompression sequence

The H.264 algorithm is complicated. There are several types and degrees of compression, and the decoding of the stream may require considerable computing power. In standalone devices such as DVRs, the user interfaces are generally designed to allow for simple and fast setup. It should be remembered that:
  • the more motion in the camera view, the more data is required for a good estimate;
  • lower number of I-frames results in less accurate picture of the actual situation;
  • in addition to the video data, the encapsulation in packets increases the total size of the transmitted data.
Control of H.264 in practice
Below, we will discuss the most common parameters that the user can select and adjust. As the demonstration device we will use Hikvision DS-8108HDI-S DVR, however the description can be successfully applied to other DVRs based on H.264 compression.
The Hikvision DVR provides the possibility to change the following picture settings (remote configuration using iVMS4000 software):
Remote setting>Video Parameters window of Hikvision DS-8108HDI-S DVR
Encoding Parameters - most DVRs can send to the network several independent streams. If the upload bandwidth is too low, or the download bandwidth in the monitoring is insufficient, there is the possibility to select a kind of stream with lower requirements. The parameters of the "Main Stream" are identical with those of local recording.
Stream Type (Video or Audio & Video) - with this option the administrator can decide whether to send the audio or not to the network. Audio recording options are available only from the local menu. The audio stream is negligibly small compared to the video stream, so this option can be used as an additional protection against sending unwanted (audio) data or a sound on/off switch.
Resolution - number of pixels in one complete frame. This parameter of DVRs is usually given in the form of an acronym, based on the following unit:
  • CIF (Common Intermediate Format): 352x288 pixels (0.1 MP)
So the acronyms are multiples of the unit, with the numbers of pixels equal to:
  • 2CIF: 704x288 (202752 pixels, ca. 0.2 MP)
  • DCIF (Double CIF): 528x384 (202752 pixels , ca. 0.2 MP). The number of pixels is identical to 2CIF, but the image has different aspect ratio.
  • 4CIF: 704x576 (0.4 MP) - currently regarded as the maximum resolution of analog CCTV.
  • (D1: 720x576 - often confused and used interchangeably with 4CIF).
The resolution of D1 has its origins derived from the document issued by the ITU-R (International Telecommunication Union - Radiocommunication Sector) with the usual name of Rec.601 (BT.601). This document, issued in 1982, was designed to determine the method of digitizing analog interlaced video signal. One of the specified parameters was the YCrCb matrix defining the image information (luminance and two chrominance components in relation to red and blue colors). The standard has defined that each line contains 720 luminance samples and 320 chrominance samples. The D1 resolution in analog PAL system, first implemented in Sony and Bosch device, means 720x576 pixels. After some time, it has been replaced by the more common 4CIF resolution.
One of the basic parameters of an analog CCTV camera or monitor is horizontal resolution power specified as the maximum number of alternating light and dark vertical lines that can be resolved over a horizontal span equal to the height of the picture. It should not be confused with the number of horizontal scanning lines of a broadcast television system. A PAL camera generates 625 interlaced horizontal lines, 576 of which are active lines (carry video information), produced 25 times each second (as complete frames, each consisting of two fields with 312.5 odd and even lines, with the scanning frequency of 50 Hz). Thus, the horizontal scanning rate is 15625 Hz (625x 25, or 312.5x 50).
To optimally utilize the standard, with the 576 active horizontal lines and the classic aspect ratio of 4:3, the proportional number of horizontal points should reach 768. It is the result of the fact that conventional TV systems in use today were originally designed to achieve equal horizontal and vertical resolution ("square pixels"). However, the actual vertical resolution is reduced by Kell factor of 0.7 (the scanning lines are not ideally situated on the picture details).
Horizontal resolution (in TVL) is defined as the amount of vertical black and white lines that can be seen over a span of 3/4 of the image width (equal to the image height).
There are CCTV cameras with various numbers of TV lines. The video from an analog camera is digitized to one of that CIF formats (CIF, 2CIF, etc.), with the same aspect ratio (4:3), but the place of scan lines is taken by pixels. For good quality, higher resolutions (especially 4CIF) require to use cameras with a greater number of TV lines, corresponding with the higher number of pixels (more than 528 TVL for 4CIF) and providing more detailed images that are converted into digital form.
Bitrate Type - variable or fixed. With variable bitrate, the DVR is able to detect parameters of the analog signal from a camera and adjust the compression ratio (to increase or decrease the number of bits needed to render the video for one second of recording). This optimizes the quality of the recorded and/or transmitted images and the size of the data.
Max Bitrate - specifies the maximum number of bits per second that can be reached by the compressed video stream. This value, due to the nature of the H.264 compression, is an estimate. Some devices force minimum values of this parameter.
Frame Rate - determines the recording speed. The human eye perceives smooth motion picture at 25 frames per second (fps) or more. However, 8- or 16-channel DVRs capable of simultaneous recording of all channels at this speed are expensive. In fact, they are necessary only in cases where the monitored environment is characterized by high dynamics and variability of the scene. Usually, this parameter may be lower, often 6 fps is quite sufficient. In this case, the more important parameter is the shutter speed of the camera, because each frame of the video should be sharp, for correct identification of people and objects.
Video Quality - this parameter can be set only in the case of variable bitrate (VBR). The higher the video quality, the lower the compression rate and vice versa. DVRs with VBR option usually preserve the image quality above a factory-set threshold.
Frame Type, I Frame Interval - video stream settings. These parameters relate to the type of the stream transmitted over the network from the DVR to a PC. They can be adjusted in some DVRs. The number (interval) of I frames determines the image quality, but also changes the demand for computing power needed to decode the stream by the client computer. In most cases, these settings are reserved and the stream is automatically generated by the device.
Configuration of parameters for individual needs
Video surveillance systems operate in various environments, so the encoding (compression) parameters should be set for the actual scene. Because the live video is not compressed, the influence of parameter changes can only be observed by transmission through the network or playing back recordings.
The two most common questions of novice users are:
  • How to set the video parameters to get the best quality optimally optimally adapted to the available network bandwidth?
  • What should be the bandwidth of the link for remote operation of the DVR?
The first criterion that should be followed when setting the parameters of the DVR is the quality of the recorded material. The recordings have to be sufficient for identification of people and analysis of events. A continuous, remote monitoring through the network is of secondary importance, so it does not require highest quality video.
According to EN 50132-7 norm, the size of the representation of an object, e.g. human figure should be strictly connected with the aim of the operator - identification, recognition, detection or general observation. For crowd control - the figures should cover at least 5% of screen height, for recognition purposes - at least 50%, for identification purposes - at least 120%. If the position of the camera and other parameters do not allow for such proportions, the resolution of the compressed video stream can be reduced.
The frame rate selected by the user should depend on the variability in the monitored environment. Usually, it is the maximum value offered by the DVR at the highest resolution.
Typical frame rates used in various environments:

6 fps

12 fps

25 fps

  • local stores

  • single-family homes

  • offices

  • housing developments

  • playgrounds

  • warehouses, stockrooms

  • production facilities

  • large stores

  • parking lots

  • areas with moderate traffic

  • train, bus etc. stations

  • stadiums

  • concert halls

  • discos

  • cities

  • areas with dense traffic

Bitrate and I frame interval are closely linked. The image compressed in H.264 consists of compressed frames and the frames generated on the basis of (among others) displacement vectors. The compression settings must depend on the dynamics of the image. Sometimes the amount of information needed to generate the P and B frames is so large that it exceeds the amount of data describing the "basic" image. But excessively increased number of I frames lowers compression ratio (higher bitrate) and may even spoil the image quality.
Some examples:

Picture 1
One I frame every 25 frames
Bitrate: 3.5 Mb/s
Picture 2 - worse quality - visible ghosting and pixels
One I frame every two frames
Bitrate: 5.8 Mb/s
The both images come from a megapixel camera and have the same resolution (1600x1200), bitrate (3 Mb/s), and frame rate (25 fps). The first picture shows the image with I frame repeated every 25 frames, i.e. once per second. The second picture shows the same image with I frame repeated every second frame. The image from the photo 2, although generated on the basis of more data from "full" frames is of lower quality, with visible ghosting and pixels. In addition, the data stream that has to be transmitted over the network is much wider, 5.8 Mb/s instead of 3.5 Mb/s.
The optimal setting of the frequency of I frame is a 0.5-1 Hz. Ultimately, the value should be chosen by assessing the quality of picture and knowing the frame rate.
Bitrate refers both to the data transfer over the network and the compression rate of the video stream that is saved locally. Bitrate settings for the same resolution may be different, if the compressed image is more or less detailed.

Picture 3
Bitrate: 256 kb/s
Picture 4
Bitrate: 896 kb/s
The pictures 3 and 4 show images recorded in 4CIF resolution at 12 fps. In the case of Picture 3, the bitrate was set to 256 kb/s, for the Picture 4 - to 896 kb/s. The scenes with a moderate traffic dynamics, with stationary objects and a low degree of movement, do not have high expectations. The total bandwidth of the stream connected with the Picture 1 is 290 kb/s, with the Picture 2 - 970 kb/s.
In Picture 5, the moving objects (especially legs of the person) are blurred. This is due to too low bitrate - the decoder does not have enough data to generate precise predictive frames.
Picture 5
Too low bitrate - blurred moving objects
For areas where the video monitoring system plays a preventive role and is used to observe the scene for quick response to threats, the bitrate must be sufficiently high. In places where the system provides a general overview and uses motion detection, this parameter can be set at a relatively low level.

Picture 6
Too low bitrate
The details of the static image are blurred
Picture 7
Properly selected bitrate
The bitrates for Picture 6 is the same as for Picture 3, and for Picture 7 as for Picture 4 (3.5 Mb/s and 5.8 Mb/s respectively). The differences in rendering details are significant. In Picture 6, the descriptions on remote control unit are illegible, the lines on the test pattern as well as fine characters are blurred and can been easily confused. Additionally, uniform background areas are presented with noise.
However, static images enable reduction of frame rates, so the effective bitrates can be lower, with the same image quality.
All adjustments of bitrate, resolution and frame rate are directly connected with the possible recording time in a specific storage space.
Recommended bitrate values for various resolutions and frame rates and the estimated recording times (1TB)
   25 fps  12 fps 6 fps
1080p (1920x1080) 3 Mb/s - 9 Mb/s 4096 kb/s 3072 kb/s
32-10 days 23 days 31 days
UXGA (1600x1200) 3 Mb/s - 9 Mb/s 4096 kb/s 3072 kb/s
32-10 days 23 days 31 days
720P (1280x720) 2.5 Mb/s - 6 Mb/s 1792 kb/s 1536 kb/s
47-16 days 54 days 63 days
4CIF (704x576) 1536 kb/s - 2048 kb/s 896 kb/s 640 kb/s
63-47 days 107 days 151 days
2CIF (704x288) 768 kb/s - 1024 kb/s 448 kb/s 320 kb/s
126-94 days 215 days 303 days
CIF (352x288) 512 kb/s - 768 kb/s 320 kb/s 160 kb/s
189-126 days 303 days 606 days
QCIF (176x144) 160 kb/s - 224 kb/s 96 kb/s 80 kb/s
606-433 days 1011 days 1213 days
The estimated recording times on a 1TB hard drive (for a single channel) for a given resolution,
recording speed (frame rate), and bitrate. Due to the type of compression, these are rough estimates.
For network preview of images in the highest quality offered by the device, the total stream will be the sum of the streams from all cameras. It is necessary to add about 20% to the values from the table, due to dividing the data into packets. For example, the stream from 16 cameras with 4CIF resolution at 12 fps, each with a moderate 640 kb/s bitrate will amount to:
1.2 * 16 * 640 kb/s = 12288 kb/s, which may be round up to 12.5-13 Mb/s
Such download rates are popular even in home networks, however the bottleneck may be the available upload speed.
Usually the network preview is performed using an auxiliary stream with lower bandwidth. Displaying images from 16 cameras in the classical division of 4 x 4, even a FullHD screen (1920x1080 pixels) will show each window with resolution of 480x270 pixels, so there is no need to transmit 4CIF images.
Of course, there are alternatives - for a routine monitoring system the auxiliary stream usually uses CIF resolution at 6 fps, which reduces the requirement for upload capacity of the network to:
1.2 * 16 * 160 kb/s = 3072 kb/s, which may be round up to 3 Mb/s
If necessary, the user can close the preview of all 16 cameras and select the image(s) from one or several camera points.
These restrictions apply only to WAN networks. Within LANs, the FastEthernet (100 Mb/s) or GbE are usually sufficient for transmitting the video streams with the highest parameters.
Generally, it should be remembered that DVR needs sufficient UPLOAD speed of the network, whereas the network client requires adequate DOWNLOAD speed.
Video monitoring via mobile phones
The possibility of remote monitoring has become a standard in CCTV since the emergence of mobile operating systems. They allow for installation of various applications created in simple programming languages, both free and advanced commercial ones.
For a typical smartphone user, the most important issues are:
  • Required connection speed
  • Data transfer for a single channel
The DVR usually sends to mobile phones an auxiliary stream. Depending on the phone operating system and its version, as well as the model of the DVR, the user can simultaneously view various number of camera channels.
Usually, the maximum resolution of the secondary stream is CIF, which allows for a greater frame rate (smoothness of the video). A practical compromise between the bandwidth and image quality is CIF @ 12 fps with a stream rate of 320 kb/s. The online preview with these parameters consumes about 2.5 MB per minute and requires connection through a 3G network (UMTS) in order to ensure adequate download speed. Of course, the stream may be limited to a minimum level (e.g. QCIF @ 6 fps, 64 kb/s). Such transfer is possible even in 2.5G networks (EDGE), but these parameters are sufficient only for a very general observation.