English | Other languages

DSP-based Vision Comes Of Age
by Stephen Albanese, Matrox Imaging
Original article featured in DSP and Multimedia Technology, March/April 1997
"DSPs are popular processors for imaging boards since they deliver the level of power and flexibility necessary for demanding applications like on-line quality inspection and medical visualization. For a system designer contemplating using off-the-shelf hardware and software, however, there is more to a successful solution than just an on-board DSP. In imaging, board design also has to address high throughput I/O and the ever increasing need for higher performance."
Developers of industrial, medical, and scientific imaging systems are facing market pressures to reduce costs and time-to-market while at the same time, increase performance. The PC has become a formidable ally, thanks in part to enabling technologies like PCI, Windows NT, Pentium PRO processors, low cost peripherals, and the increasing ease-of-use and abundance of software development tools. For high-end imaging, DSP-based PCI image processors are being used in machine vision, medical imaging, and image analysis applications where in the past, OEMs and integrators relied on proprietary or VME-based solutions. Matrox, a long-time imaging vendor, recently introduced a subsystem based on the PC platform designed to surpass the price/performance ratio of other hardware and software solutions while offering a built-in migration path to more performance.
Matrox Genesis is based on a DSP optimized for imaging operations, the Texas Instruments TMS320C80. Matrox Genesis integrates acquisition, processing, and display on a single PCI board that is scalable for performance by adding companion processor boards. The high degree of integration and flexible scalability simplify system complexity while reducing overall costs.
DSP offers power
High end imaging applications are typically pixel processing intensive with real-time performance requirements. Of course, real-time is relative. Labels on bottles may need to be inspected at a speed of 20,000 bottles an hour. Produce may require on-line color grading at a speed of several feet per minute. In medical imaging, it is common to perform large kernel convolutions on high-resolution images at 30 frames per second.
For applications like these, imaging hardware has to have on-board processing and frame store. The TI C80 DSP is a very good, cost effective starting point for these performance requirements, with enough power and programming flexibility to execute a number of imaging operations within real-time constraints. To off-load data management tasks and let the DSP realize its full processing potential, Matrox designed a Video Interface ASIC (VIA). The VIA performs optimized transfers on/off board, manages data streams, implements bus arbitration, as well as other tasks, for maximum performance.
The TI C80 is a multiprocessor DSP with one RISC master processor (MP) , and 4 parallel processors (PPs), each tied to private SRAM memory and linked by a crossbar network. Data routing is handled by a dedicated transfer controller. As illustrated in the main board block diagram (Figure 1), Matrox Genesis uses this chip as one of its processing components.
Image 1
Figure 1
More power - ASICs
While the TI C80 can process operations such as point-to-point in 2-3 ms on a 512 x 512 x 8-bit image, neighborhood operations such as convolutions or morphology, that are fundamental to imaging, are much slower. The optional custom Matrox neighborhood operations accelerator (NOA) ASIC supplements the power of the TI C80 in this case. It consists of a MAC (multiplier/ accumulator) array capable of performing 32 simultaneous sum of products at 50 MHz, and accelerates neighborhood operations by up to 20 times when compared to the TI C80 alone.
Interfacing to all devices
Besides powerful processing components, imaging applications require flexible acquisition. Given the wide range of cameras used, imaging hardware targeted at industrial, medical, and scientific applications should interface to a range of input devices. In addition to standard RS-170/CCIR and RGB cameras; non-standard color and monochrome analog and digital, progressive scan cameras, cameras with higher frame rates (i.e. 60 frames per second or more) and multiple output taps, as well as higher resolution cameras are becoming increasingly common. For web inspection, interfacing to line scan cameras is likely required. PCB panel inspection might require a digital camera to capture images with less noise and more accuracy than analog cameras.
In order to meet these acquisition demands, the Matrox Genesis main board's mezzanine grab module is fully programmable. It interfaces to virtually all analog, digital, monochrome, and color area and line scan video devices. As well as being a data traffic controller, the VIA is capable of reformatting incoming p ixel data in real time from video devices that output non-contiguous pixel data.
High performance display
Display requirements vary with applications, but real-time display of processed images and some operator intervention is likely. So is a large desktop for placing multiple windows in an efficient workspace. Rectangles and cross hairs to mark ROIs and edges may need to be written over a live video window for such tasks as alignment of components. In medical or scientific imaging, graphics and text may be used to annotate images.
The integrated display on the Matrox Genesis main board is managed by a second VIA and incorporates the Matrox MGA 2064W graphics engine, up to 8 MB of WRAM, and a 220 MHz RAMDAC for resolutions up to 1600 x 1200 @85 Hz. This provides a large desktop as well as graphics acceleration. Dual frame buffers, one for the image (8 or 24-bit) and one for the overlay (8-bit), allow the desktop and other graphics to be superimposed nondestructively over the live video window.
Dealing with I/O
To avoid bottlenecks, an imaging subsystem must be able to handle the I/O requirements for simultaneous acquisition, processing, and display of high-speed video. Matrox Genesis sustains high throughput I/O by using the VIA to control and manage all data interfaces, and by using SDRAM for main memeory. Multiple, high-speed dedicated buses are used to transfer data between devices on main and processor boards, as well as between main/processor boards and to/from external resources such as the host PC (see Figure 2).
Image 2
Figure 2
Processing building blocks
System performance requirements vary from one application to the next. To address this, Matrox uses the concept of processing nodes. Processing nodes are the basic building blocks of Matrox Genesis' processing power. A main board has one processing node while companion processor boards each have one or two processing nodes. One node contains one TI C80, NOA ASIC (optional), VIA ASIC and up to 64 MB of SDRAM.
Operations sent to a processing node are automatically performed using as many resources as the operations need to run at maximum speed. A developer does not have to optimize the use of either the TI C80's MP, PPs, or the NOA. However, optimization control is provided if required.
Flexible Scalability
For applications that demand it, multiple processing nodes can be used by adding companion processor boards to a main board (see Figure 2). Multiple nodes can be used in parallel, either using the SIMD (single instruction, multiple data) model or the MIMD (multiple instruction, multiple data) model.
There are several different ways to divide an application between nodes. A developer can let each node grab a different part of the same input frame, and work only on that. This results in the lowest possible latency before an output image is produced, but is not suitable for some algorithms when a single processor needs to access the whole image. A second way would be to let each node grab and process a complete frame. Each successive frame goes to a different node. The latency will be longer, but performance scales linearly with the number of nodes and almost any algorithm can be implemented (since each node sees a complete image). Another way would be to dedicate one node to grabbing, and let it do the first part of the processing before passing the partial results on to the next node in the pipeline.
In other words, the Matrox Genesis supports parallel or pipeline topologies or any combination of the above methods. Which model is best depends on the individual application, but various features assist a developer in programming a multi-processing algorithm on Matrox Genesis. First, grabbed images are broadcast to all nodes in a system. Each node can take the whole image or only part of it, reducing transfers between nodes. Second, the VM Channel connects all nodes, so that results are easily sent to a specific node for further processing or display. As well, each TI C80 can make random accesses to any other node over the PCI bus. This is normally used for message passing and sharing small amounts of data.
Since every application is different, processing requirements vary. Things that will affect processing requirements include camera resolution and bit-depth; as well as how many and what types of operations need to be performed within real-time constraints. For example, a 60 frames per second high-resolution camera defines a faster real-time requirement than a 30 fps standard camera. Certain operations such as normalized correlation are higher level algorithms that demand more processing power than, for example, image addition. If needed, scalability easily gives developers a migration path to increased performance, so algorithms can be expanded later. The following two examples reflect how a developer has complete control to optimally configure processing across multiple nodes.
Parallel processing example
Many medical imaging applications require real-time acquisition, processing and display of images for visualization and review. Typical processing may include noise or temporal filtering, edge enhancement, and brightness/contrast adjustment. Performing these operations on a 512 x 512 x 8-bit image at 30 frames per second might be handled by one main board in real time, i.e. 33 ms. However, a 1k x 1k x 10-bit image at 30 fps may require the addition of a processor board.
The noise/temporal filtering part of the example algorithm requires approximately 20 ms to complete for a 1k x 1k x 10-bit image at 30 fps running on a single node. Edge enhancement using large 7x7 convolutions requires 26 ms, and finally a brightness/contrast adjustment, 11 ms. Because this algorithm requires roughly 57 ms to complete with one node, three nodes are used to run in real time (using three nodes leaves plenty of spare processing power in case additional operations need to be added to the algorithm in the future).
As illustrated (see Figure 3), images are grabbed and broadcast simultaneously to all three nodes. Then, each node grabs a different region of interest to process. Each region is 1024 pixels long by 350 high (a slight overlap is used to avoid border effects). The results of each processed area are sent to display by the primary VIA of each node over the VM Channel, where the main board display VIA "stitches" the image back together in real time.
Image 3
Figure 3
Pipeline/round-robin processing
A traditional way to cope with demanding real-time inspection requirements has been to use pipeline hardware. While this can certainly meet some requirements, it can be expensive and inflexible. A dedicated hardware module is required for each operation in the pipe and reconfiguring hardware for different applications can prove cost prohibitive or impossible. While virtual pipelining can be accomplished using Matrox Genesis main and processor boards, a more efficient way would be to let each frame of image data be processed on whatever node is available.
For this example (see Figure 4), in order to inspect BGA packaging for missing balls, the algorithm is as follows. First, opposite corners of the package are located, to determine its exact orientation. Next, alignment of the source image to a reference image and image subtraction takes place. Finally there is a blob (missing ball) count. In machine vision, the algorithms are often not suited to dividing up an image. For example, operations such as a pattern match or blob analysis require a search on the whole frame.
Image 4
Figure 4
To find the 2 corner models, each pattern match requires about 5 ms (x2=10 ms). Subpixel alignment with a reference image (rotation with bilinear interpolation) takes 22 ms. Image subtraction with the reference image takes an additional 4 ms and blob analysis to detect missing balls will take an average of 10 ms. Total processing time for this example, therefore, is 46 ms, which will require two processing nodes for real-time execution (i.e. less than 33 ms since the example uses a standard RS-170 camera). In this case, an optimum configuration might use a main board, and a processor board with one node.
The first node will process a complete frame. Since it takes 46 ms to process a complete frame, a second node will process the next frame (n+1) while the first node completes its task. With enough nodes, real-time operation can be maintained for even complex algorithms or faster cameras and higher resolutions. And it is easy to code the application so that it uses as many processing nodes as required to maintain real-time performance.
Multilevel software support minimal or no custom development
As software development can be the largest investment in any application solution, software development tools like libraries should provide not only comprehensive functionality but the flexibility to accept custom operations if needed. Matrox supplies three levels for programming Matrox Genesis: the device-independent Matrox Imaging Library (MIL), the board-specific Matrox Genesis Native Library, and the Matrox Genesis Developer's Toolkit (DTK) (in .pdf format). The DTK is used in conjunction with Texas Instruments' TMS320C80 Software Development Tools for programming the TI C80 DSP directly.
Comprehensive libraries
MIL and the Native Library have an extensive set of pre-built and optimized functions for point processing, morphology, neighborhood operations, statistics, geometric transforms, as well as blob analysis, pattern matching and alignment, gauging and JPEG compression. The two libraries generally have the same commands, with the Native Library offering some additional commands which exploit the specific features of the board.
If MIL is used to build the application, MIL commands simply make calls to the equivalent Native Library commands. The Native Library is a set of stub routines, one for each supported opcode. Native Library commands are initiated by the host and make remote procedure calls to the actual processing functions on the C80 (the Native Library "Shell" resides in the on-board SDRAM processing memory). The commands automatically divide up the task between the multiple processors (the TI C80's PPs and the NOA). However, if control of the processors is required, a set of commands is provided. (i.e. The DTK is not required for controlling the task division between processors.)
Maximum flexibility/user extensible
An application can be built using MIL, the Native Library, or the DTK; or any combination. To develop mainly portable code, the majority of the application can be written with MIL commands and by using MIL's native mode programming, a developer can call Native Library commands for certain tasks.
If portability is not required, the application can be written entirely with the Native Library. Since some specialized applications may require custom TI C80 functions, the Native Library is user extensible. Custom functions can be added to the library by using the DTK. These custom functions can then be seamlessly integrated into the application code. Complete applications developed with the Native Library can, if necessary, be ported to the TI C80 for on-board execution.
Final words
By building on the power of DSPs that are optimized for imaging, such as TI's C80, vendors like Matrox are able to provide PC-based technology for even the most demanding applications in industrial, medical, and scientific imaging. A flexible, off-the-shelf subsystem like Matrox Genesis solves I/O issues and processing requirements on a single board, and if required, offers the scalability to achieve almost any level of performance.
For more information, contact our Media Relations Team.
Top of page
Site Map Contact Us Legal E-mail Matrox